嘘~ 正在从服务器偷取页面 . . .

无监督/半监督/对比学习


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-05 更新

C-LEAD: Contrastive Learning for Enhanced Adversarial Defense

Authors:Suklav Ghosh, Sonal Kumar, Arijit Sur

Deep neural networks (DNNs) have achieved remarkable success in computer vision tasks such as image classification, segmentation, and object detection. However, they are vulnerable to adversarial attacks, which can cause incorrect predictions with small perturbations in input images. Addressing this issue is crucial for deploying robust deep-learning systems. This paper presents a novel approach that utilizes contrastive learning for adversarial defense, a previously unexplored area. Our method leverages the contrastive loss function to enhance the robustness of classification models by training them with both clean and adversarially perturbed images. By optimizing the model’s parameters alongside the perturbations, our approach enables the network to learn robust representations that are less susceptible to adversarial attacks. Experimental results show significant improvements in the model’s robustness against various types of adversarial perturbations. This suggests that contrastive loss helps extract more informative and resilient features, contributing to the field of adversarial robustness in deep learning.

深度神经网络(DNN)在计算机视觉任务(如图像分类、分割和对象检测)中取得了显著的成功。然而,它们容易受到对抗性攻击的威胁,这些攻击可以通过输入图像中的微小扰动导致预测错误。解决这一问题对于部署稳健的深度学习系统至关重要。本文提出了一种利用对比学习进行对抗防御的新方法,这是一个以前未被探索的领域。我们的方法利用对比损失函数,通过用干净和对抗性扰动图像训练分类模型,提高模型的稳健性。通过优化模型参数和扰动,我们的方法使网络能够学习不易受到对抗性攻击的稳定表示。实验结果表明,该模型对各种类型的对抗性扰动具有显著的稳健性提高。这表明对比损失有助于提取更具信息量和弹性的特征,为深度学习领域的对抗稳健性做出了贡献。

论文及项目相关链接

PDF

Summary

本文提出一种利用对比学习增强模型鲁棒性的新方法,对抗击对抗性攻击这一关键问题。通过训练模型以同时处理干净和对抗性扰动图像,并利用对比损失函数优化模型参数,网络能学习更具鲁棒性的表示,从而减少受攻击的风险。实验证明,对比损失有助于提高模型对各种对抗性扰动的鲁棒性。

Key Takeaways

  1. 对比学习用于增强深度神经网络模型的鲁棒性,对抗对抗性攻击问题。
  2. 方法基于对比损失函数训练模型,同时处理干净和对抗扰动图像。
  3. 模型通过优化参数以应对扰动,学习更鲁棒的表示。
  4. 对比损失有助于提取更具信息量和抗攻击性的特征。
  5. 实验证明该方法显著提高模型对各种对抗性扰动的鲁棒性。
  6. 该研究为深度学习中的对抗性稳健性问题提供了新的解决思路。

Cool Papers

点此查看论文截图

Comparative Study of UNet-based Architectures for Liver Tumor Segmentation in Multi-Phase Contrast-Enhanced Computed Tomography

Authors:Doan-Van-Anh Ly, Thi-Thu-Hien Pham, Thanh-Hai Le

Segmentation of liver structures in multi-phase contrast-enhanced computed tomography (CECT) plays a crucial role in computer-aided diagnosis and treatment planning for liver diseases, including tumor detection. In this study, we investigate the performance of UNet-based architectures for liver tumor segmentation, starting from the original UNet and extending to UNet3+ with various backbone networks. We evaluate ResNet, Transformer-based, and State-space (Mamba) backbones, all initialized with pretrained weights. Surprisingly, despite the advances in modern architecture, ResNet-based models consistently outperform Transformer- and Mamba-based alternatives across multiple evaluation metrics. To further improve segmentation quality, we introduce attention mechanisms into the backbone and observe that incorporating the Convolutional Block Attention Module (CBAM) yields the best performance. ResNetUNet3+ with CBAM module not only produced the best overlap metrics with a Dice score of 0.755 and IoU of 0.662, but also achieved the most precise boundary delineation, evidenced by the lowest HD95 distance of 77.911. The model’s superiority was further cemented by its leading overall accuracy of 0.925 and specificity of 0.926, showcasing its robust capability in accurately identifying both lesion and healthy tissue. To further enhance interpretability, Grad-CAM visualizations were employed to highlight the region’s most influential predictions, providing insights into its decision-making process. These findings demonstrate that classical ResNet architecture, when combined with modern attention modules, remain highly competitive for medical image segmentation tasks, offering a promising direction for liver tumor detection in clinical practice.

在多期相增强计算机断层扫描(CECT)中,肝脏结构的分割对肝脏疾病的计算机辅助诊断和治疗计划,包括肿瘤检测,起着至关重要的作用。本研究中,我们调查了基于UNet架构的肝脏肿瘤分割性能,从原始UNet扩展到带有各种骨干网络的UNet3+。我们评估了ResNet、基于Transformer和State-space(Mamba)骨干网,所有网络都使用预训练权重进行初始化。令人惊讶的是,尽管现代架构取得了进展,但基于ResNet的模型在多个评估指标上始终优于基于Transformer和Mamba的替代方案。为进一步提高分割质量,我们在骨干中引入了注意力机制,并观察到加入卷积块注意力模块(CBAM)会取得最佳性能。带有CBAM模块的ResNetUNet3+不仅以Dice分数0.755和IoU分数0.662的最佳重叠度指标产出结果,而且实现了最精确的边界描绘,以最低的HD95距离77.911为证据。该模型的优越性还体现在其整体准确度0.925和特异性0.926上的领先地位,展示了其准确识别病变组织和健康组织的稳健能力。为了进一步增强可解释性,采用了Grad-CAM可视化来突出显示对预测最具有影响力的区域,从而深入了解其决策过程。这些研究结果表明,经典的ResNet架构与现代注意力模块相结合,在医学图像分割任务中仍具有竞争力,为临床实践中的肝脏肿瘤检测提供了有前景的研究方向。

论文及项目相关链接

PDF 27 pages, 8 figures

Summary

本文研究了基于UNet架构的肝脏肿瘤分割性能,从原始UNet到UNet3+的不同版本,并采用了ResNet、基于Transformer的Mamba等不同的骨干网络。结果显示,尽管现代架构有所进步,但基于ResNet的模型在多个评估指标上始终优于基于Transformer和Mamba的模型。引入注意力机制后,使用卷积块注意力模块(CBAM)的ResNetUNet3+不仅获得了最佳的Dice系数和IoU值,而且实现了最精确的边界描绘,证明了其在肝脏肿瘤检测中的稳健性和准确性。

Key Takeaways

  1. UNet-based架构在肝脏肿瘤分割中具有重要作用。
  2. 研究了从UNet到UNet3+的不同版本,并采用了多种骨干网络。
  3. 基于ResNet的模型在评估中表现最佳,表明其在肝脏肿瘤检测中的稳健性。
  4. 引入注意力机制后,使用CBAM的ResNetUNet3+获得最佳性能。
  5. 该模型实现了最佳的Dice系数和IoU值,以及最精确的边界描绘。
  6. Grad-CAM可视化技术用于增强模型的解释性,揭示了预测的最关键区域。

Cool Papers

点此查看论文截图

An Enhanced Dual Transformer Contrastive Network for Multimodal Sentiment Analysis

Authors:Phuong Q. Dao, Mark Roantree, Vuong M. Ngo

Multimodal Sentiment Analysis (MSA) seeks to understand human emotions by jointly analyzing data from multiple modalities typically text and images offering a richer and more accurate interpretation than unimodal approaches. In this paper, we first propose BERT-ViT-EF, a novel model that combines powerful Transformer-based encoders BERT for textual input and ViT for visual input through an early fusion strategy. This approach facilitates deeper cross-modal interactions and more effective joint representation learning. To further enhance the model’s capability, we propose an extension called the Dual Transformer Contrastive Network (DTCN), which builds upon BERT-ViT-EF. DTCN incorporates an additional Transformer encoder layer after BERT to refine textual context (before fusion) and employs contrastive learning to align text and image representations, fostering robust multimodal feature learning. Empirical results on two widely used MSA benchmarks MVSA-Single and TumEmo demonstrate the effectiveness of our approach. DTCN achieves best accuracy (78.4%) and F1-score (78.3%) on TumEmo, and delivers competitive performance on MVSA-Single, with 76.6% accuracy and 75.9% F1-score. These improvements highlight the benefits of early fusion and deeper contextual modeling in Transformer-based multimodal sentiment analysis.

多模态情感分析(MSA)旨在通过联合分析文本和图像等多种模态的数据来理解人类情感,相比于单模态方法,它提供了更丰富和更准确的解释。在本文中,我们首先提出BERT-ViT-EF模型,这是一个结合强大Transformer编码器BERT(用于文本输入)和ViT(用于视觉输入)的新模型,通过早期融合策略实现跨模态深度交互和联合表示学习。为了进一步增强模型的能力,我们提出了名为Dual Transformer Contrastive Network(DTCN)的扩展模型,该模型基于BERT-ViT-EF构建。DTCN在BERT之后增加了一层额外的Transformer编码器层来优化文本上下文(在融合之前),并采用对比学习来对齐文本和图像表示,从而促进鲁棒的多模态特征学习。在两个广泛使用的MSA基准测试MVSA-Single和TumEmo上的实证结果表明了我们的方法的有效性。DTCN在TumEmo上取得了最佳准确率(78.4%)和F1分数(78.3%),并在MVSA-Single上表现出竞争力,达到76.6%的准确率和75.9%的F1分数。这些改进突显了早期融合和深度上下文建模在基于Transformer的多模态情感分析中的优势。

论文及项目相关链接

PDF The paper has been accepted for presentation at the MEDES 2025 conference

Summary

本文提出了基于早期融合策略的BERT-ViT-EF模型,用于多模态情感分析(MSA)。该模型结合了BERT和ViT两种强大的Transformer编码器,分别处理文本和图像输入。为提高模型性能,进一步提出了基于BERT-ViT-EF的扩展模型——Dual Transformer Contrastive Network (DTCN)。DTCN在融合前增加了一层Transformer编码器以优化文本上下文,并采用对比学习对齐文本和图像表示,实现了稳健的多模态特征学习。在两大MSA基准测试集上的实证结果表明,该方法效果显著。

Key Takeaways

  1. 本文介绍了一种多模态情感分析(MSA)的新方法,通过联合分析文本和图像数据,提供更丰富和准确的情感解读。
  2. 提出了BERT-ViT-EF模型,该模型采用早期融合策略,结合BERT和ViT两种Transformer编码器处理文本和图像输入。
  3. BERT-ViT-EF模型的扩展版本DTCN被介绍,它在融合前增加了一层Transformer编码器优化文本上下文,并采用对比学习对齐文本和图像表示。
  4. DTCN在两大MSA基准测试集上的表现优于其他方法,证明了早期融合和深层上下文建模在Transformer基多模态情感分析中的优势。
  5. DTCN在TumEmo数据集上的准确率和F1分数均达到最佳(分别为78.4%和78.3%)。
  6. 在MVSA-Single数据集上,DTCN表现出有竞争力的性能,准确率和F1分数分别为76.6%和75.9%。

Cool Papers

点此查看论文截图

Bi-Encoder Contrastive Learning for Fingerprint and Iris Biometrics

Authors:Matthew So, Judah Goldfeder, Mark Lis, Hod Lipson

There has been a historic assumption that the biometrics of an individual are statistically uncorrelated. We test this assumption by training Bi-Encoder networks on three verification tasks, including fingerprint-to-fingerprint matching, iris-to-iris matching, and cross-modal fingerprint-to-iris matching using 274 subjects with $\sim$100k fingerprints and 7k iris images. We trained ResNet-50 and Vision Transformer backbones in Bi-Encoder architectures such that the contrastive loss between images sampled from the same individual is minimized. The iris ResNet architecture reaches 91 ROC AUC score for iris-to-iris matching, providing clear evidence that the left and right irises of an individual are correlated. Fingerprint models reproduce the positive intra-subject suggested by prior work in this space. This is the first work attempting to use Vision Transformers for this matching. Cross-modal matching rises only slightly above chance, which suggests that more data and a more sophisticated pipeline is needed to obtain compelling results. These findings continue challenge independence assumptions of biometrics and we plan to extend this work to other biometrics in the future. Code available: https://github.com/MatthewSo/bio_fingerprints_iris.

存在一种历史悠久的假设,即个体的生物识别特征在统计学上是无关的。我们通过训练双编码器网络进行三项验证任务来测试这一假设,包括指纹与指纹匹配、虹膜与虹膜匹配以及跨模态指纹与虹膜匹配,使用274名受试者,约10万枚指纹和7千张虹膜图像。我们训练了Bi-Encoder架构中的ResNet-50和Vision Transformer骨干网,以使来自同一个体的图像样本之间的对比损失最小化。虹膜ResNet架构在虹膜与虹膜匹配方面达到了91分的ROC AUC得分,这提供了明确的证据表明个体的左右虹膜是相关的。指纹模型在此空间中复制了先前工作的积极内部主体建议。这是首次尝试使用Vision Transformers进行这种匹配的工作。跨模态匹配仅略高于偶然水平,这表明需要更多的数据和更复杂的流程才能获得有说服力的结果。这些发现继续挑战生物识别特征的独立性假设,我们计划将来将此工作扩展至其他生物识别特征。代码可用:https://github.com/MatthewSo/bio_fingerprints_iris。

论文及项目相关链接

PDF

Summary

本文测试了个体生物识别特征统计无关的假设。通过训练双编码器网络进行三项验证任务,包括指纹对比指纹、虹膜对比虹膜以及跨模态指纹对比虹膜匹配。使用来自274名个体的约10万枚指纹和7千张虹膜图像进行训练。采用ResNet-50和Vision Transformer作为双编码器架构的基础,最小化同一个体采样图像的对比损失。虹膜ResNet架构在虹膜对比虹膜匹配任务中达到91分的ROC AUC得分,明确证明了个体的左右虹膜之间存在相关性。指纹模型重现了先前工作提出的积极个体内相关性。这是首次尝试使用Vision Transformer进行此匹配的工作。跨模态匹配结果仅略高于偶然水平,表明需要更多数据和更复杂的流程才能获得有说服力的结果。这些发现继续挑战生物识别特征的独立性假设,并计划将来将此工作扩展到其他生物识别领域。

Key Takeaways

  1. 本文测试了个体生物识别特征统计无关的假设,通过双编码器网络进行三项验证任务。
  2. 使用ResNet-50和Vision Transformer作为双编码器架构基础,最小化同一人不同生物识别特征间的对比损失。
  3. 虹膜对比虹膜匹配实验结果显示左右虹膜之间存在相关性。
  4. 指纹模型重现了个体内积极相关性,与先前研究一致。
  5. 首次尝试使用Vision Transformer进行匹配。
  6. 跨模态匹配结果略高于偶然水平,表明需更多数据和复杂流程来提升效果。

Cool Papers

点此查看论文截图

AutoSciDACT: Automated Scientific Discovery through Contrastive Embedding and Hypothesis Testing

Authors:Samuel Bright-Thonney, Christina Reissel, Gaia Grosso, Nathaniel Woodward, Katya Govorkova, Andrzej Novak, Sang Eon Park, Eric Moreno, Philip Harris

Novelty detection in large scientific datasets faces two key challenges: the noisy and high-dimensional nature of experimental data, and the necessity of making statistically robust statements about any observed outliers. While there is a wealth of literature on anomaly detection via dimensionality reduction, most methods do not produce outputs compatible with quantifiable claims of scientific discovery. In this work we directly address these challenges, presenting the first step towards a unified pipeline for novelty detection adapted for the rigorous statistical demands of science. We introduce AutoSciDACT (Automated Scientific Discovery with Anomalous Contrastive Testing), a general-purpose pipeline for detecting novelty in scientific data. AutoSciDACT begins by creating expressive low-dimensional data representations using a contrastive pre-training, leveraging the abundance of high-quality simulated data in many scientific domains alongside expertise that can guide principled data augmentation strategies. These compact embeddings then enable an extremely sensitive machine learning-based two-sample test using the New Physics Learning Machine (NPLM) framework, which identifies and statistically quantifies deviations in observed data relative to a reference distribution (null hypothesis). We perform experiments across a range of astronomical, physical, biological, image, and synthetic datasets, demonstrating strong sensitivity to small injections of anomalous data across all domains.

在大规模科学数据集中的新颖性检测面临两个主要挑战:实验数据存在噪声和高维性,以及需要对任何观察到的异常值做出统计上稳健的陈述。虽然关于通过降维进行异常检测的文献非常丰富,但大多数方法产生的输出不能与可量化的科学发现声明兼容。在这项工作中,我们直接应对这些挑战,朝着适应科学严格统计要求的新颖性检测的统一流程迈出了第一步。我们引入了AutoSciDACT(借助异常对比测试的自动化科学发现),这是一个用于检测科学数据中新颖性的通用流程。AutoSciDACT首先通过对比预训练创建表达性的低维数据表示,利用许多科学领域中海量优质模拟数据以及可以指导有原则的数据增强策略的专家知识。这些紧凑的嵌入然后使能一种极其敏感的基于机器学习的两样本测试,使用新物理学习机(NPLM)框架,该框架能够识别和统计地量化观察到的数据与参考分布(零假设)之间的偏差。我们在一系列天文、物理、生物、图像和合成数据集上进行了实验,证明在所有领域中对微小异常数据的注入具有强大的敏感性。

论文及项目相关链接

PDF Accepted at NeurIPS 2025; 32 pages, 16 figures

Summary

本文介绍了在大型科学数据集中进行新颖性检测所面临的挑战,包括实验数据的高噪声和高维度性质,以及需要对观察到的异常值做出统计上稳健的陈述。为解决这些挑战,本文提出了一种名为AutoSciDACT(基于异常对比测试的自动科学发现)的统一管道,用于检测科学数据中的新颖性。AutoSciDACT通过对比预训练创建表达性低维数据表示,利用许多科学领域中的高质量模拟数据丰富的资源以及指导原则的数据增强策略的专业知识。然后,这些紧凑嵌入使机器学习能够使用新物理学习机(NPLM)框架进行高度敏感的两样本测试,该框架能够识别和统计观察到的数据与参考分布(零假设)之间的偏差。本文在天文、物理、生物、图像和合成数据集上进行了一系列实验,证明在各种领域中对微小异常数据的注入具有强大的敏感性。

Key Takeaways

  1. 新颖性检测在科学数据集中面临的主要挑战包括数据的噪声和高维度性质以及对观察到的异常值进行统计上稳健的陈述的需求。
  2. 提出了一种名为AutoSciDACT的统一管道来解决这些挑战,该管道适用于科学数据的新颖性检测。
  3. AutoSciDACT通过对比预训练创建低维数据表示,结合专业指导原则的数据增强策略来增强新颖性检测的能力。
  4. 利用了科学领域的模拟数据以及高纬度模拟生成模型训练的数据表达模式来提高数据质量。
  5. AutoSciDACT利用机器学习技术中的两样本测试来识别和统计观察到的数据与参考分布之间的偏差。

Cool Papers

点此查看论文截图

Digital Contrast CT Pulmonary Angiography Synthesis from Non-contrast CT for Pulmonary Vascular Disease

Authors:Ying Ming, Yue Lin, Longfei Zhao, Gengwan Li, Zuopeng Tan, Bing Li, Sheng Xie, Wei Song, Qiqi Xu

Computed Tomography Pulmonary Angiography (CTPA) is the reference standard for diagnosing pulmonary vascular diseases such as Pulmonary Embolism (PE) and Chronic Thromboembolic Pulmonary Hypertension (CTEPH). However, its reliance on iodinated contrast agents poses risks including nephrotoxicity and allergic reactions, particularly in high-risk patients. This study proposes a method to generate Digital Contrast CTPA (DCCTPA) from Non-Contrast CT (NCCT) scans using a cascaded synthesizer based on Cycle-Consistent Generative Adversarial Networks (CycleGAN). Totally retrospective 410 paired CTPA and NCCT scans were obtained from three centers. The model was trained and validated internally on 249 paired images. Extra dataset that comprising 161 paired images was as test set for model generalization evaluation and downstream clinical tasks validation. Compared with state-of-the-art (SOTA) methods, the proposed method achieved the best comprehensive performance by evaluating quantitative metrics (For validation, MAE: 156.28, PSNR: 20.71 and SSIM: 0.98; For test, MAE: 165.12, PSNR: 20.27 and SSIM: 0.98) and qualitative visualization, demonstrating valid vessel enhancement, superior image fidelity and structural preservation. The approach was further applied to downstream tasks of pulmonary vessel segmentation and vascular quantification. On the test set, the average Dice, clDice, and clRecall of artery and vein pulmonary segmentation was 0.70, 0.71, 0.73 and 0.70, 0.72, 0.75 respectively, all markedly improved compared with NCCT inputs.@ Inter-class Correlation Coefficient (ICC) for vessel volume between DCCTPA and CTPA was significantly better than that between NCCT and CTPA (Average ICC : 0.81 vs 0.70), indicating effective vascular enhancement in DCCTPA, especially for small vessels.

计算机断层扫描肺动脉造影(CTPA)是诊断肺血管疾病如肺栓塞(PE)和慢性血栓栓塞性肺动脉高压(CTEPH)的参考标准。然而,它依赖于碘化造影剂,这带来了包括肾毒性和过敏反应在内的风险,特别是在高危患者中。本研究提出了一种基于周期一致性生成对抗网络(CycleGAN)的级联合成器生成非对比CT(NCCT)扫描的数字对比CTPA(DCCTPA)的方法。研究共从三个中心获取了410对回顾性CTPA和NCCT扫描。该模型在内部使用249对图像进行训练和验证。包含161对图像的额外数据集用于模型泛化评估和下游临床任务验证。与最新方法相比,所提出的方法在定量指标评估上取得了最佳的综合性能(对于验证集,MAE:156.28,PSNR:20.71,SSIM:0.98;对于测试集,MAE:165.12,PSNR:20.27,SSIM:0.98),并在定性可视化方面表现出有效的血管增强、较高的图像保真度和结构保持。该方法进一步应用于肺血管分割和血管量化的下游任务。在测试集上,动脉和静脉肺分割的平均Dice、clDice和clRecall分别为0.70、0.71、0.73和0.70、0.72、0.75,与NCCT输入相比均有显著改善。DCCTPA与CTPA之间的血管体积的类间相关系数(ICC)显著高于NCCT与CTPA之间的类间相关系数(平均ICC:0.81 vs 0.70),这表明DCCTPA中的血管增强有效,尤其是小血管。

论文及项目相关链接

PDF

摘要

本研究提出了一种基于CycleGAN的级联合成器生成数字对比CTPA(DCCTPA)的方法,该方法利用非对比CT(NCCT)扫描生成CTPA图像。该研究主要用于诊断肺动脉疾病如肺栓塞和慢性血栓栓塞性肺动脉高压。此方法通过在内部对249对图像进行训练和验证,并在包含161对图像的额外数据集上进行模型泛化评估和下游临床任务验证,与现有先进技术相比,在定量指标(验证集MAE:156.28,PSNR:20.71,SSIM:0.98;测试集MAE:165.12,PSNR:20.27和SSIM:0.98)和定性可视化方面取得了最佳的综合性能。此外,该方法还应用于肺血管分割和血管定量等下游任务。在测试集上,动脉和静脉肺分割的Dice、clDice和clRecall分别为0.70、0.71、0.73和0.70、0.72、0.75,与NCCT输入相比有明显改善。DCCTPA与CTPA之间的血管体积的ICC显著高于NCCT与CTPA之间的ICC(平均ICC:0.81 vs 0.70),表明DCCTPA在血管增强方面尤其对小血管有效。

关键见解

  1. 研究提出了一种基于CycleGAN的DCCTPA生成方法,使用NCCT扫描生成CTPA图像。
  2. 方法在定量指标和定性可视化方面取得了良好的性能,实现了有效的血管增强。
  3. 与现有技术相比,该方法的综合性能最佳。
  4. 方法在下游任务如肺血管分割和血管定量中表现出良好的应用潜力。
  5. DCCTPA在测试集上的动脉和静脉肺分割结果较NCCT有明显改善。
  6. DCCTPA与CTPA之间的血管体积的ICC显著高于NCCT与CTPA之间的ICC,表明其在小血管增强方面的优势。

Cool Papers

点此查看论文截图

Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Authors:Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot

Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.

最近生成模型的进展使扩散模型成为处理复杂数据分布采样的最先进技术工具。虽然这些模型在单模态领域(如图像和音频)取得了显著的成功,但它们的能力拓展到跨不同感官模态的信息翻译(模态翻译)仍然是一个开放挑战。现有方法通常依赖于限制性假设,包括共享维度、高斯源先验和特定模态架构,这限制了它们的通用性和理论基础。在这项工作中,我们提出了潜在去噪扩散桥梁模型(Latent Denoising Diffusion Bridge Model,LDDBM),这是一个基于去噪扩散桥梁模型的潜在变量扩展的通用模态翻译框架。通过在共享潜在空间中进行操作,我们的方法在无需对齐维度的情况下学会了不同模态之间的桥梁。我们引入了一种对比对齐损失来强制配对样本之间的语义一致性,并设计了一种针对潜在空间中的噪声预测的域无关编码器-解码器架构。此外,我们提出了一种预测损失来引导训练实现准确的跨域翻译,并探索了多种训练策略来提高稳定性。我们的方法支持任意模态对,并在多种模态翻译任务上表现强劲,包括多视图到3D形状生成、图像超分辨率和多视图场景合成。综合实验和消融实验验证了我们的框架的有效性,在一般模态翻译中建立了新的强劲基准。更多信息请参见我们的项目页面:https://sites.google.com/view/lddbm/home。

论文及项目相关链接

PDF Accepted as a poster at NeurIPS 2025

Summary

本文介绍了扩散模型在模态转换(Modality Translation, MT)领域的应用。针对现有方法的局限性,提出一种基于潜在变量扩展的降噪扩散桥梁模型(Latent Denoising Diffusion Bridge Model, LDBM)的通用框架,可在无需对齐维度的共享潜在空间内学习不同模态之间的桥梁。通过引入对比对齐损失和针对潜在空间的噪声预测设计的领域无关编码器-解码器架构,实现了对任意模态对的支持,并在多种MT任务上表现出色。

Key Takeaways

  1. 扩散模型在单模态领域(如图像和音频)已取得了显著成功,但将其扩展到模态转换(MT)仍具挑战。
  2. 现有方法常依赖于共享维度、高斯源先验和模态特定架构等限制性假设,限制了其通用性和理论根基。
  3. 提出的Latent Denoising Diffusion Bridge Model(LDBM)是一个用于模态转换的通用框架,基于潜在变量的扩展。
  4. LDBM在无需对齐维度的共享潜在空间内学习不同模态之间的桥梁。
  5. 引入对比对齐损失来强制执行配对样本之间的语义一致性。
  6. 设计了针对潜在空间的噪声预测的领域无关编码器-解码器架构。

Cool Papers

点此查看论文截图

CARE: Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams

Authors:Junhao Zhao, Zishuai Liu, Ruili Fang, Jin Lu, Linghan Zhang, Fei Dou

The recognition of Activities of Daily Living (ADLs) from event-triggered ambient sensors is an essential task in Ambient Assisted Living, yet existing methods remain constrained by representation-level limitations. Sequence-based approaches preserve temporal order of sensor activations but are sensitive to noise and lack spatial awareness, while image-based approaches capture global patterns and implicit spatial correlations but compress fine-grained temporal dynamics and distort sensor layouts. Naive fusion (e.g., feature concatenation) fail to enforce alignment between sequence- and image-based representation views, underutilizing their complementary strengths. We propose Contrastive Alignment for ADL Recognition from Event-Triggered Sensor Streams (CARE), an end-to-end framework that jointly optimizes representation learning via Sequence-Image Contrastive Alignment (SICA) and classification via cross-entropy, ensuring both cross-representation alignment and task-specific discriminability. CARE integrates (i) time-aware, noise-resilient sequence encoding with (ii) spatially-informed and frequency-sensitive image representations, and employs (iii) a joint contrastive-classification objective for end-to-end learning of aligned and discriminative embeddings. Evaluated on three CASAS datasets, CARE achieves state-of-the-art performance (89.8% on Milan, 88.9% on Cairo, and 73.3% on Kyoto7) and demonstrates robustness to sensor malfunctions and layout variability, highlighting its potential for reliable ADL recognition in smart homes.

从事件触发的环境传感器识别日常活动(ADLs)是智能辅助生活的一个重要任务,但现有方法仍受到表示层面上的限制。基于序列的方法保留了传感器激活的时间顺序,但对噪声敏感且缺乏空间感知能力,而基于图像的方法捕捉了全局模式和隐式空间相关性,但压缩了精细的时间动态并扭曲了传感器布局。简单的融合方法(例如特征拼接)未能实现序列和图像表示视图之间的对齐,未能充分利用它们的互补优势。我们提出了基于事件触发的传感器流进行ADL识别的对比对齐方法(CARE),这是一个端到端的框架,通过序列图像对比对齐(SICA)和交叉熵进行分类,联合优化表示学习,确保跨表示对齐和任务特定的辨别力。CARE集成了(i)时间感知、噪声耐用的序列编码以及(ii)包含空间信息和频率敏感性的图像表示,并采用了(iii)联合对比分类目标来进行端到端的学习和辨别嵌入。在三个CASAS数据集上进行评估,CARE达到了最先进的性能(米兰数据集上89.8%,开罗数据集上88.9%,京都7数据集上73.3%),并证明了其对于传感器故障和布局变化的稳健性,突出了其在智能家庭中可靠识别ADL的潜力。

论文及项目相关链接

PDF

摘要

活动日常生活(ADLs)识别是智能居住环境中的重要任务之一。现有的识别方法仍存在表现层面上的局限性。序列方法虽保留了传感器激活的临时顺序,但易受噪声干扰且缺乏空间感知能力。图像方法捕捉全局模式和隐含的空间相关性,但会压缩精细的临时动态并扭曲传感器布局。简单的融合方法(如特征拼接)未能强化序列和图像表示之间的对齐,未能充分利用二者的互补优势。本文提出一种名为CARE(对比对齐用于事件触发传感器流的活动日常生活识别)的端到端框架,通过序列图像对比对齐(SICA)和交叉熵优化表示学习和分类,确保跨表示对齐和任务特定鉴别力。CARE集成了时间感知、噪声抗扰的序列编码和具有空间感知及频率敏感性的图像表示,并采用联合对比分类目标进行端到端学习和判别嵌入。在三个CASAS数据集上的评估表明,CARE达到了米兰89.8%、开罗88.9%、京都73.3%的最新性能水平,并展示了对传感器故障和布局变化的稳健性,凸显其在智能家居中实现可靠ADLs识别的潜力。

关键见解

  1. 活动日常生活(ADLs)识别是智能辅助居住环境的核心任务之一。
  2. 当前方法存在表示层面的局限性,如噪声敏感、缺乏空间感知或失真传感器布局。
  3. 简单融合序列和图像方法未能充分利用其互补优势。
  4. 提出了一种名为CARE的端到端框架,通过序列图像对比对齐(SICA)联合优化表示学习和分类。
  5. CARE集成了时间感知和噪声抗扰的序列编码,以及具有空间感知和频率敏感性的图像表示。
  6. 采用联合对比分类目标进行判别嵌入的端到端学习。

Cool Papers

点此查看论文截图

Style-Aware Blending and Prototype-Based Cross-Contrast Consistency for Semi-Supervised Medical Image Segmentation

Authors:Chaowei Chen, Xiang Zhang, Honglie Guo, Shunfang Wang

Weak-strong consistency learning strategies are widely employed in semi-supervised medical image segmentation to train models by leveraging limited labeled data and enforcing weak-to-strong consistency. However, existing methods primarily focus on designing and combining various perturbation schemes, overlooking the inherent potential and limitations within the framework itself. In this paper, we first identify two critical deficiencies: (1) separated training data streams, which lead to confirmation bias dominated by the labeled stream; and (2) incomplete utilization of supervisory information, which limits exploration of strong-to-weak consistency. To tackle these challenges, we propose a style-aware blending and prototype-based cross-contrast consistency learning framework. Specifically, inspired by the empirical observation that the distribution mismatch between labeled and unlabeled data can be characterized by statistical moments, we design a style-guided distribution blending module to break the independent training data streams. Meanwhile, considering the potential noise in strong pseudo-labels, we introduce a prototype-based cross-contrast strategy to encourage the model to learn informative supervisory signals from both weak-to-strong and strong-to-weak predictions, while mitigating the adverse effects of noise. Experimental results demonstrate the effectiveness and superiority of our framework across multiple medical segmentation benchmarks under various semi-supervised settings.

弱强一致性学习策略在半监督医学图像分割中得到了广泛应用,通过利用有限的标记数据并强制执行弱到强的一致性来训练模型。然而,现有方法主要集中在设计和组合各种扰动方案,而忽略了框架本身的内在潜力和局限性。在本文中,我们首先识别出两个关键缺陷:(1)训练数据流分离,导致以标记流为主的确认偏见;(2)监督信息利用不完全,限制了从强到弱的一致性的探索。为了解决这些挑战,我们提出了一个风格感知的混合和基于原型的交叉对比一致性学习框架。具体而言,受实证观察启发,即标记和无标记数据之间的分布不匹配可以通过统计矩来表征,我们设计了一个风格引导的分布混合模块来打破独立的训练数据流。同时,考虑到强伪标签中的潜在噪声,我们引入了一种基于原型的交叉对比策略,以鼓励模型从弱到强和强到弱的预测中学习有用的监督信号,同时减轻噪声的不利影响。实验结果表明,我们的框架在多种医学分割基准测试下,在各种半监督设置中都具有有效性和优越性。

论文及项目相关链接

PDF

Summary

本文探讨了弱强一致性学习策略在半监督医学图像分割中的局限性,并提出了一个新的框架来解决现有方法中的两个关键问题:训练数据流分离和监督信息利用不完全。通过风格感知混合和基于原型的交叉对比一致性学习策略,该框架能够打破独立训练数据流,并从弱强和强弱的预测中学习有用的监督信号,同时减轻噪声的不利影响。实验结果表明,该框架在多种医学分割基准测试中表现优越。

Key Takeaways

  1. 现有弱强一致性学习策略主要关注扰动方案的设计,忽略了框架本身的潜在能力和局限性。
  2. 训练数据流分离会导致确认偏见,主要依赖于标记流。
  3. 监督信息利用不完全限制了强到弱的一致性的探索。
  4. 新的框架结合了风格感知混合和基于原型的交叉对比策略,以解决上述问题。
  5. 风格感知混合模块通过打破独立训练数据流来优化数据分布不匹配问题。
  6. 基于原型的交叉对比策略有助于模型从弱强和强弱的预测中学习有用的监督信号。

Cool Papers

点此查看论文截图

Contrastive Conditional-Unconditional Alignment for Long-tailed Diffusion Model

Authors:Fang Chen, Alex Villa, Gongbo Liang, Xiaoyi Lu, Meng Tang

Training data for class-conditional image synthesis often exhibit a long-tailed distribution with limited images for tail classes. Such an imbalance causes mode collapse and reduces the diversity of synthesized images for tail classes. For class-conditional diffusion models trained on imbalanced data, we aim to improve the diversity and fidelity of tail class images without compromising the quality of head class images. We achieve this by introducing two simple but highly effective loss functions. Firstly, we employ an Unsupervised Contrastive Loss (UCL) utilizing negative samples to increase the distance/dissimilarity among synthetic images. Such regularization is coupled with a standard trick of batch resampling to further diversify tail-class images. Our second loss is an Alignment Loss (AL) that aligns class-conditional generation with unconditional generation at large timesteps. This second loss makes the denoising process insensitive to class conditions for the initial steps, which enriches tail classes through knowledge sharing from head classes. We successfully leverage contrastive learning and conditional-unconditional alignment for class-imbalanced diffusion models. Our framework is easy to implement as demonstrated on both U-Net based architecture and Diffusion Transformer. Our method outperforms vanilla denoising diffusion probabilistic models, score-based diffusion model, and alternative methods for class-imbalanced image generation across various datasets, in particular ImageNet-LT with 256x256 resolution.

类条件图像合成训练数据通常呈现出长尾分布,尾类图像有限。这种不平衡会导致模式崩溃,并降低尾类合成图像的多样性。对于在不平衡数据上训练的类条件扩散模型,我们的目标是在不损害头类图像质量的情况下,提高尾类图像的多样性和保真度。我们通过引入两个简单但高效的损失函数来实现这一目标。首先,我们采用利用负样本的无监督对比损失(UCL),以增加合成图像之间的距离/差异。这种正则化与批量重采样的标准技巧相结合,以进一步多样化尾类图像。我们的第二个损失是对齐损失(AL),它将类条件生成与无条件生成在大时间步长上对齐。这种损失使得去噪过程对初始步骤的类别条件不敏感,通过从头类共享知识来丰富尾类。我们成功利用对比学习和条件-无条件对齐技术,实现了类不平衡扩散模型。我们的框架易于实现,已在U-Net架构和Diffusion Transformer上得到验证。我们的方法在多个数据集上超越了标准的去噪扩散概率模型、基于分数的扩散模型和针对类不平衡图像生成的替代方法,特别是在分辨率为256x256的ImageNet-LT数据集上表现尤为出色。

论文及项目相关链接

PDF 20 pages, 11 figures

Summary
针对类别条件图像合成中训练数据呈现长尾分布的问题,研究团队通过引入两种损失函数优化了类不平衡扩散模型的尾类图像生成效果。第一种是无监督对比损失(UCL),利用负样本增加合成图像间的距离/差异;第二种是对齐损失(AL),在大步长时实现类条件生成与无条件生成的对齐。该研究成功运用对比学习和条件-无条件对齐策略,提高了类不平衡扩散模型的性能,适用于U-Net架构和Diffusion Transformer,并在多个数据集上表现优异。

Key Takeaways

  1. 研究针对类别条件图像合成中的类不平衡问题,优化了尾类图像的生成。
  2. 引入两种损失函数:无监督对比损失(UCL)和对齐损失(AL)。
  3. UCL利用负样本增加合成图像间的差异,配合批处理重采样技巧进一步多样化尾类图像。
  4. AL在大步长时实现类条件生成与无条件生成的对齐,使去噪过程对初步条件不敏感,通过头部知识丰富尾部类别。
  5. 研究成功将对比学习和条件-无条件对齐策略应用于类不平衡扩散模型。
  6. 方法在U-Net架构和Diffusion Transformer上易于实现。

Cool Papers

点此查看论文截图

AdFair-CLIP: Adversarial Fair Contrastive Language-Image Pre-training for Chest X-rays

Authors:Chenlang Yi, Zizhan Xiong, Qi Qi, Xiyuan Wei, Girish Bathla, Ching-Long Lin, Bobak Jack Mortazavi, Tianbao Yang

Contrastive Language-Image Pre-training (CLIP) models have demonstrated superior performance across various visual tasks including medical image classification. However, fairness concerns, including demographic biases, have received limited attention for CLIP models. This oversight leads to critical issues, particularly those related to race and gender, resulting in disparities in diagnostic outcomes and reduced reliability for underrepresented groups. To address these challenges, we introduce AdFair-CLIP, a novel framework employing adversarial feature intervention to suppress sensitive attributes, thereby mitigating spurious correlations and improving prediction fairness. We conduct comprehensive experiments on chest X-ray (CXR) datasets, and show that AdFair-CLIP significantly enhances both fairness and diagnostic accuracy, while maintaining robust generalization in zero-shot and few-shot scenarios. These results establish new benchmarks for fairness-aware learning in CLIP-based medical diagnostic models, particularly for CXR analysis.

对比语言图像预训练(CLIP)模型在包括医学图像分类在内的各种视觉任务中表现出了卓越的性能。然而,对于CLIP模型的公平性关注,包括人口统计偏见,并未得到足够的重视。这种疏忽导致了关键问题,特别是与种族和性别相关的问题,进而导致诊断结果的不公平和对代表性不足的群体的可靠性降低。为了解决这些挑战,我们引入了AdFair-CLIP,这是一个采用对抗性特征干预来抑制敏感属性的新型框架,从而减轻虚假关联,提高预测公平性。我们在胸部X射线(CXR)数据集上进行了全面的实验,结果表明AdFair-CLIP在零样本和少样本场景中,显著提高了公平性和诊断准确性,同时保持了稳健的泛化能力。这些结果为CLIP基医学诊断模型中的公平意识学习,特别是CXR分析,建立了新的基准。

论文及项目相关链接

PDF This preprint has been accepted by MICCAI 2025

Summary
对比语言图像预训练(CLIP)模型在多种视觉任务中表现优异,但在医疗图像分类等任务中存在公平性问题,特别是与种族和性别相关的挑战。为解决这个问题,提出AdFair-CLIP框架,采用对抗特征干预来抑制敏感属性,减少错误关联,提高预测公平性。在胸部X射线(CXR)数据集上的实验表明,AdFair-CLIP在零样本和少样本场景下能显著提高公平性和诊断准确性,同时保持强大的泛化能力,为CLIP基础的医疗诊断模型的公平性学习树立了新标杆。

Key Takeaways

  1. CLIP模型在多种视觉任务中表现优异,但在医疗图像分类等任务中存在公平性问题。
  2. 公平性问题可能导致诊断结果的不公平和对少数群体的误诊。
  3. AdFair-CLIP框架被提出来解决这些问题,采用对抗特征干预来抑制敏感属性。
  4. 通过在胸部X射线(CXR)数据集上的实验,AdFair-CLIP显著提高了公平性和诊断准确性。
  5. AdFair-CLIP在零样本和少样本场景下保持了强大的泛化能力。
  6. AdFair-CLIP为CLIP基础的医疗诊断模型的公平性学习树立了新标杆。

Cool Papers

点此查看论文截图

MoralCLIP: Contrastive Alignment of Vision-and-Language Representations with Moral Foundations Theory

Authors:Ana Carolina Condez, Diogo Tavares, João Magalhães

Recent advances in vision-language models have enabled rich semantic understanding across modalities. However, these encoding methods lack the ability to interpret or reason about the moral dimensions of content-a crucial aspect of human cognition. In this paper, we address this gap by introducing MoralCLIP, a novel embedding representation method that extends multimodal learning with explicit moral grounding based on Moral Foundations Theory (MFT). Our approach integrates visual and textual moral cues into a unified embedding space, enabling cross-modal moral alignment. MoralCLIP is grounded on the multi-label dataset Social-Moral Image Database to identify co-occurring moral foundations in visual content. For MoralCLIP training, we design a moral data augmentation strategy to scale our annotated dataset to 15,000 image-text pairs labeled with MFT-aligned dimensions. Our results demonstrate that explicit moral supervision improves both unimodal and multimodal understanding of moral content, establishing a foundation for morally-aware AI systems capable of recognizing and aligning with human moral values.

最近,视觉语言模型的进步已经实现了跨模态的丰富语义理解。然而,这些编码方法缺乏解释或推理内容道德维度的能力——这是人类认知的关键方面。在本文中,我们通过引入MoralCLIP来解决这一问题,这是一种基于道德基础理论(MFT)的新型嵌入表示方法,它扩展了多模态学习,实现了明确的道德定位。我们的方法将视觉和文本道德线索整合到一个统一的嵌入空间,实现了跨模态道德对齐。MoralCLIP建立在多标签数据集Social-Moral Image Database上,用于识别视觉内容中共同存在的道德基础。为了训练MoralCLIP,我们设计了一种道德数据增强策略,以将我们的标注数据集扩展到15000个图像文本对,这些对都使用与MFT对齐的维度进行标注。我们的结果表明,明确的道德监督提高了对道德内容的单模态和多模态理解,为能够识别和符合人类道德价值观的具有道德意识的AI系统奠定了基础。

论文及项目相关链接

PDF Updated version: corresponds to the ACM MM ‘25 published paper and includes full appendix material

摘要

本文提出了MoralCLIP,一种基于道德基础理论(MFT)的多模态学习方法,旨在弥补当前视觉语言模型在解读道德维度内容方面的不足。MoralCLIP将视觉和文本道德线索整合到统一的嵌入空间,实现跨模态道德对齐。该研究使用社会道德图像数据库多标签数据集来识别视觉内容中共同存在的道德基础。通过设计道德数据增强策略,将标注数据集扩展到15000个图像文本对,以实现与MFT对齐的维度标注。研究表明,明确的道德监督能提高单模态和多模态对道德内容的理解,为建立能够识别并符合人类道德价值观的道德感知AI系统奠定了基础。

要点掌握

  1. 介绍了当前视觉语言模型在解读道德维度内容方面的不足。
  2. 提出了MoralCLIP,一种基于道德基础理论(MFT)的多模态学习方法。
  3. MoralCLIP整合了视觉和文本道德线索到统一的嵌入空间,实现跨模态道德对齐。
  4. 使用社会道德图像数据库多标签数据集来识别视觉内容中的道德基础。
  5. 通过道德数据增强策略,扩展了标注数据集。
  6. 明确的道德监督提高了单模态和多模态对道德内容的理解。

Cool Papers

点此查看论文截图

Decoupling Contrastive Decoding: Robust Hallucination Mitigation in Multimodal Large Language Models

Authors:Wei Chen, Xin Yan, Bin Wen, Fan Yang, Tingting Gao, Di Zhang, Long Chen

Although multimodal large language models (MLLMs) exhibit remarkable reasoning capabilities on complex multimodal understanding tasks, they still suffer from the notorious hallucination issue: generating outputs misaligned with obvious visual or factual evidence. Currently, training-based solutions, like direct preference optimization (DPO), leverage paired preference data to suppress hallucinations. However, they risk sacrificing general reasoning capabilities due to the likelihood displacement. Meanwhile, training-free solutions, like contrastive decoding, achieve this goal by subtracting the estimated hallucination pattern from a distorted input. Yet, these handcrafted perturbations (e.g., add noise to images) may poorly capture authentic hallucination patterns. To avoid these weaknesses of existing methods, and realize robust hallucination mitigation (i.e., maintaining general reasoning performance), we propose a novel framework: Decoupling Contrastive Decoding (DCD). Specifically, DCD decouples the learning of positive and negative samples in preference datasets, and trains separate positive and negative image projections within the MLLM. The negative projection implicitly models real hallucination patterns, which enables vision-aware negative images in the contrastive decoding inference stage. Our DCD alleviates likelihood displacement by avoiding pairwise optimization and generalizes robustly without handcrafted degradation. Extensive ablations across hallucination benchmarks and general reasoning tasks demonstrate the effectiveness of DCD, i.e., it matches DPO’s hallucination suppression while preserving general capabilities and outperforms the handcrafted contrastive decoding methods.

虽然多模态大型语言模型(MLLMs)在复杂的多模态理解任务中展现出卓越的推理能力,但它们仍然受到著名的幻觉问题的困扰:生成的输出与明显的视觉或事实证据不一致。目前,基于训练的方法,如直接偏好优化(DPO),利用配对偏好数据来抑制幻觉。然而,它们可能会因为可能性位移而牺牲一般的推理能力。同时,无训练的方法,如对比解码,通过从扭曲的输入中减去估计的幻觉模式来实现这一目标。然而,这些手工制作的扰动(例如给图像添加噪声)可能无法很好地捕捉真实的幻觉模式。为了避免现有方法的这些弱点,并实现稳健的幻觉缓解(即保持一般推理性能),我们提出了一种新型框架:解耦对比解码(DCD)。具体来说,DCD解耦了偏好数据集中正样和负样的学习,并在MLLM内训练了单独的正负图像投影。负投影隐式地模拟了真实的幻觉模式,这使得在对比解码推理阶段能够使用视觉感知的负图像。我们的DCD通过避免成对优化来缓解可能性位移,并在不使用手工退化的情况下实现稳健泛化。在幻觉基准测试和一般推理任务上的广泛消融实验证明了DCD的有效性,即它在保持一般能力的同时,能够达到与DPO相当的幻觉抑制效果,并超越了手工对比解码方法。

论文及项目相关链接

PDF 17 pages, 4 figures

Summary

多模态大型语言模型在复杂的多模态理解任务中展现出强大的推理能力,但仍存在著名的虚构问题,即生成的输出与明显的视觉或事实证据不一致。现有方法如直接偏好优化(DPO)利用配对偏好数据来抑制虚构,但可能牺牲一般推理能力。而对比解码等无训练方法通过从失真输入中减去估计的虚构模式来实现这一目标,但手工制作的扰动可能无法很好地捕捉真实的虚构模式。为此,我们提出了全新的框架:解耦对比解码(DCD),旨在实现稳健的虚构缓解,同时保持一般推理性能。DCD通过解耦偏好数据集中正样本和负样本的学习,并在MLLM内训练单独的正负图像投影。负投影隐含地模拟真实的虚构模式,从而在对比解码推断阶段实现视觉感知的负图像。我们的DCD通过避免配对优化减轻了可能性位移,并实现了稳健的泛化,无需手工制作的退化。广泛的消融实验表明,DCD在抑制虚构的同时保持了一般能力,并优于手工对比解码方法。

Key Takeaways

  1. 多模态大型语言模型在复杂任务中表现出强大的推理能力,但仍存在虚构问题。
  2. 直接偏好优化等现有方法可能牺牲一般推理能力来抑制虚构。
  3. 对比解码等无训练方法使用手工制作的扰动,可能无法有效捕捉真实虚构模式。
  4. 提出新的框架——解耦对比解码(DCD),旨在实现稳健的虚构缓解并维持一般推理性能。
  5. DCD通过解耦正负面样本学习,训练单独的正负图像投影来抑制虚构。
  6. DCD通过避免配对优化减轻可能性位移并实现稳健泛化。

Cool Papers

点此查看论文截图

Data Matters Most: Auditing Social Bias in Contrastive Vision Language Models

Authors:Zahraa Al Sahili, Ioannis Patras, Matthew Purver

Vision-language models (VLMs) deliver strong zero-shot recognition but frequently inherit social biases from their training data. We systematically disentangle three design factors – model size, training-data scale, and training-data source – by comparing CLIP and OpenCLIP, two models that share an identical contrastive objective yet differ in encoder width and in the image-text corpora on which they are pre-trained (400M proprietary pairs vs. 400M/2B LAION). Across balanced face-analysis benchmarks, enlarging the encoder reduces gender skew in CLIP but amplifies both gender and racial skew in OpenCLIP; increasing the LAION corpus from 400M to 2B further increases OpenCLIP bias. At matched model and data budgets, substituting proprietary data with LAION improves gender fairness while increasing racial skew, underscoring data source as the primary driver of bias patterns. We also evaluate three post-hoc, test-time debiasing strategies – Bias Prompts, Prompt Array, and SANER. Debiasing reduces but does not eliminate harm, and its effectiveness is source- and size-dependent: Bias Prompts most effectively reduce gender skew in CLIP at smaller model sizes, whereas Prompt Array and SANER more reliably reduce racial skew in OpenCLIP; scaling LAION reconfigures which method is most fair. Taken together, these findings challenge the assumption that bigger models or datasets are automatically fairer and foreground training data source as the key determinant of both bias and mitigation efficacy. We release code and evaluation scripts to enable transparent, reproducible auditing of future VLMs.

视觉语言模型(VLMs)虽然可以实现强大的零样本识别,但经常从训练数据中继承社会偏见。我们通过比较CLIP和OpenCLIP这两个模型,系统地分析了模型大小、训练数据规模和训练数据来源三个设计因素。这两个模型具有相同的对比目标,但在编码器宽度和预训练的图像文本语料库(分别为4亿个专有配对和高达四十亿至两亿个LAION数据对)上有所不同。在平衡面部分析基准测试中,扩大编码器规模减少了CLIP中的性别偏见,但在OpenCLIP中却放大了性别和种族偏见;将LAION语料库从四亿增加到两亿进一步增加了OpenCLIP偏见。在匹配模型和数据库预算的情况下,用LAION数据替代专有数据提高了性别公平性,但增加了种族偏见,这突显出训练数据来源是偏见模式的主要驱动力。我们还评估了三种事后、测试时的去偏策略,即偏置提示、提示数组和SANER。去偏可以减少但并不能完全消除伤害,其有效性取决于来源和规模:偏置提示在小模型规模下更有效地减少CLIP中的性别偏见,而提示数组和SANER更可靠地减少OpenCLIP中的种族偏见;扩大LAION语料库重新定义了哪种方法更为公平。综上所述,这些发现挑战了假设更大的模型或数据集会自动变得更公平的观点,并将训练数据来源作为关键因素来影响偏见和缓解效果。我们发布代码和评估脚本,以实现对未来视觉语言模型的透明、可重复审计。

论文及项目相关链接

PDF Accepted at TMLR

Summary

本文探讨了视觉语言模型(VLMs)在零样本识别中的表现及其所继承的社会偏见问题。通过对比CLIP和OpenCLIP两个模型,文章分析了模型规模、训练数据规模和训练数据来源三个设计因素如何影响模型的性别和种族偏见。同时,文章还评估了三种事后、测试时的去偏策略。研究发现,训练数据来源是偏见的主要驱动因素,而去偏策略的效果取决于数据源和规模。

Key Takeaways

  1. VLMs在零样本识别中表现出色,但会继承训练数据中的社会偏见。
  2. 模型规模、训练数据规模和训练数据来源是影响VLMs偏见的关键因素。
  3. 相比CLIP模型,OpenCLIP在性别偏见上表现更差,而在种族偏见上表现更好。
  4. 数据来源是影响偏见模式的主要因素,替代专有数据使用LAION会增加性别公平性但增加种族偏见。
  5. 三种去偏策略可降低但无法完全消除偏见,其效果取决于数据源和规模。
  6. 更大的模型或数据集并不一定更公平。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
医学影像/Breast Ultrasound 医学影像/Breast Ultrasound
医学影像/Breast Ultrasound 方向最新论文已更新,请持续关注 Update in 2025-11-05 Flip Learning Weakly Supervised Erase to Segment Nodules in Breast Ultrasound
下一篇 
人脸相关 人脸相关
人脸相关 方向最新论文已更新,请持续关注 Update in 2025-11-05 Leveraging Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping
2025-11-05
  目录