发布日期: 2025-10-02

更新日期: 2025-11-27

文章字数: 2.7k

阅读时长: 10 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-02 更新

Generalized Contrastive Learning for Universal Multimodal Retrieval

Authors:Jungsoo Lee, Janghoon Cho, Hyojin Park, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi

Despite their consistent performance improvements, cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality (e.g., Wikipedia pages with both images and text). To address this critical challenge, multimodal retrieval has been recently explored to develop a unified single retrieval model capable of retrieving keys across diverse modality combinations. A common approach involves constructing new composed sets of image-text triplets (e.g., retrieving a pair of image and text given a query image). However, such an approach requires careful curation to ensure the dataset quality and fails to generalize to unseen modality combinations. To overcome these limitations, this paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the burdensome need for new dataset curation. Specifically, GCL operates by enforcing contrastive learning across all modalities within a mini-batch, utilizing existing image-caption paired datasets to learn a unified representation space. We demonstrate the effectiveness of GCL by showing consistent performance improvements on off-the-shelf multimodal retrieval models (e.g., VISTA, CLIP, and TinyCLIP) using the M-BEIR, MMEB, and CoVR benchmarks.

尽管跨模态检索模型（例如CLIP）在性能上持续有所提升，但在处理由融合图像文本模态组成的检索关键词时（例如包含图像和文本的Wikipedia页面），其性能却出现下降。为了应对这一关键挑战，最近开始探索多模态检索，以开发一种能够在不同模态组合之间进行检索的统一单一检索模型。一种常见的方法涉及构建新的图像-文本三元组集合（例如，给定查询图像，检索一对图像和文本）。然而，这种方法需要仔细筛选以确保数据集的质量，并且无法推广到未见过的模态组合。为了克服这些局限性，本文提出了广义对比学习（GCL），这是一种新的损失形式，可以在不需要新的数据集筛选的情况下提高多模态检索性能。具体来说，GCL通过在mini-batch内所有模态之间强制执行对比学习，利用现有的图像字幕配对数据集来学习统一表示空间。我们在M-BEIR、MMEB和CVR基准测试上展示了使用现成的多模态检索模型（例如VISTA、CLIP和TinyCLIP）时GCL的有效性，其性能得到了持续的提升。

论文及项目相关链接

PDF Accepted to NeurIPS 2025

Summary

本文提出了广义对比学习（GCL）的方法，用于改进跨模态检索的性能。针对当前跨模态检索模型在处理融合图像文本的键时性能下降的问题，GCL通过对比学习所有模态的数据在mini-batch内的关系，利用现有的图像-字幕配对数据集来学习统一表示空间。此方法无需新数据集的处理和整理，在现有跨模态检索模型上表现优越。

Key Takeaways

当前跨模态检索模型面临融合图像文本键时的性能下降问题。
多模态检索旨在开发一个统一的单一检索模型，能够跨不同模态组合检索键。
常见方法是通过构建新的图像文本三元组集合来解决问题，但需要仔细整理以确保数据集质量，且难以推广到未见过的模态组合。
本文提出了广义对比学习（GCL）方法，通过对比学习所有模态的数据在mini-batch内的关系来改善多模态检索性能。
GCL利用现有的图像-字幕配对数据集来学习统一表示空间。
GCL在现有跨模态检索模型上表现优越，无需新数据集的处理和整理。

Cool Papers

点此查看论文截图

Neural Fields for Highly Accelerated 2D Cine Phase Contrast MRI

Authors:Pablo Arratia, Martin J. Graves, Mary McLean, Carolin Pirkl, Carola-Bibiane Schönlieb, Timo Schirmer, Florian Wiesinger, Matthias J. Ehrhardt

2D cine phase contrast (CPC) MRI provides quantitative information on blood velocity and flow within the human vasculature. However, data acquisition is time-consuming, motivating the reconstruction of the velocity field from undersampled measurements to reduce scan times. In this work, we propose using neural fields to parametrize the complex-valued images, leveraging their inductive bias for the reconstruction of the velocity data. Additionally, to mitigate the inherent over-smoothing of neural fields, we introduce a simple voxel-based postprocessing step. We validate our method numerically in Cartesian and radial k-space with both high and low temporal resolution data. Our approach achieves accurate reconstructions at high acceleration factors, with low errors even at 16x and 32x undersampling, and consistently outperforms classical locally low-rank regularized voxel-based methods in both flow estimates and anatomical depiction.

二维电影相位对比（CPC）MRI提供了关于人类血管中血液速度和流量的定量信息。然而，数据获取耗时较长，因此人们希望通过从欠采样测量值重建速度场来减少扫描时间。在这项工作中，我们提出使用神经场来参数化复数图像，并利用其归纳偏置来重建速度数据。此外，为了减轻神经场的固有过度平滑问题，我们引入了一个简单的基于体素的后处理步骤。我们在笛卡尔和径向k空间中采用高、低时间分辨率数据进行了数值验证。我们的方法在高加速因子下实现了准确重建，即使在欠采样为原速度的十六倍或三十二倍的情况下，误差仍然较低。无论是在流量估计还是解剖描绘方面，它均优于传统的基于局部低秩正则化的体素方法。

论文及项目相关链接

PDF

Summary
基于二维电影相位对比MRI的血液速度和血流定量信息重建。研究中提出了利用神经网络参数化重建血流速度数据，并用像素级的后处理减少数据平滑化过度问题。研究成功实现高速采样因子下的精确重建，显著优于局部低秩正则化的像素级方法。

Key Takeaways

该研究使用了二维电影相位对比MRI，可提供血流速度和流向的定量信息。
数据获取时间长的问题激发了该研究的重建思路，尝试通过采用神经网络进行基于速度场低采样的重建以缩短扫描时间。
利用神经网络的优势在于其参数化能力，能更有效地处理复杂图像数据。
研究引入了像素级的后处理步骤，以解决神经网络固有的过度平滑问题。
研究在笛卡尔和径向k空间中进行了数值验证，无论高或低时间分辨率数据都表现出良好的效果。
该方法在高采样加速因子下实现了准确重建，即使在极端情况下（如16倍和32倍采样），误差仍然很低。

Cool Papers

点此查看论文截图

Mitigating Hallucination in Multimodal LLMs with Layer Contrastive Decoding

Authors:Bingkui Tong, Jiaer Xia, Kaiyang Zhou

Multimodal Large Language Models (MLLMs) have shown impressive perception and reasoning capabilities, yet they often suffer from hallucinations – generating outputs that are linguistically coherent but inconsistent with the context of the input image, including inaccuracies in objects, attributes, and relations. To address this challenge, we propose a simple approach called Layer Contrastive Decoding (LayerCD). Our design is motivated by the observation that shallow visual features are much more likely than deep visual features to cause an MLLM to hallucinate as they only capture biased, low-level information that is insufficient for high-level reasoning. Therefore, LayerCD aims to filter out hallucinations by contrasting the output distributions generated from visual features of different levels, specifically those from the shallow and deep layers of the vision encoder, respectively. We conduct extensive experiments on two hallucination benchmarks and show that LayerCD significantly outperforms current state-of-the-art. The code for LayerCD is available at https://github.com/maifoundations/LayerCD .

多模态大型语言模型（MLLMs）已经显示出令人印象深刻的感知和推理能力，但它们常常会出现幻觉——生成的语言输出在语法上连贯，但与输入图像的上下文不一致，包括对象、属性和关系的准确性问题。为了应对这一挑战，我们提出了一种名为Layer Contrastive Decoding（LayerCD）的简单方法。我们的设计灵感来自于一个观察结果，即浅层视觉特征比深层视觉特征更有可能导致MLLM出现幻觉，因为浅层视觉特征只捕捉有偏见的、低级别的信息，这对于高级推理来说是不足的。因此，LayerCD旨在通过对比不同层次的视觉特征生成的输出分布来过滤幻觉，特别是来自视觉编码器的浅层和深层特征的输出分布。我们在两个幻觉基准测试上进行了大量实验，结果表明LayerCD显著优于当前最先进的模型。LayerCD的代码可在https://github.com/maifoundations/LayerCD找到。

论文及项目相关链接

PDF

Summary
多模态大型语言模型（MLLMs）虽然展现出强大的感知和推理能力，但容易出现幻觉，生成与输入图像上下文不一致但语言上连贯的输出，包括物体、属性和关系的不准确。为解决这一问题，我们提出了一种名为Layer Contrastive Decoding（LayerCD）的简单方法。我们的设计灵感来源于观察，浅层视觉特征相较于深层视觉特征更容易导致MLLM出现幻觉，因为浅层特征只捕捉有偏见的、低级别的信息，不足以进行高级推理。因此，LayerCD旨在通过对比不同层次的视觉特征（特别是来自视觉编码器浅层和深层）生成的输出分布来过滤幻觉。我们在两个幻觉基准测试上进行大量实验，并证明LayerCD显著优于当前最先进的模型。LayerCD的代码位于：https://github.com/maifoundations/LayerCD。

Key Takeaways

多模态大型语言模型（MLLMs）虽具有强大的感知和推理能力，但容易出现幻觉问题。
幻觉问题主要表现在生成与输入图像上下文不一致但语言上连贯的输出。
Layer Contrastive Decoding（LayerCD）方法旨在通过对比不同层次的视觉特征来过滤幻觉。
浅层视觉特征相较于深层视觉特征更容易导致MLLM出现幻觉。
LayerCD的设计灵感来源于观察发现浅层视觉特征具有局限性，不足以支持高级推理。
LayerCD在多个基准测试上显著优于当前最先进的模型。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-02/%E6%97%A0%E7%9B%91%E7%9D%A3_%E5%8D%8A%E7%9B%91%E7%9D%A3_%E5%AF%B9%E6%AF%94%E5%AD%A6%E4%B9%A0/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

无监督/半监督/对比学习

Speech

Speech 方向最新论文已更新，请持续关注 Update in 2025-10-02 Voice Evaluation of Reasoning Ability Diagnosing the Modality-Induced Performance Gap

2025-10-02 Speech

Speech

检测/分割/跟踪

检测/分割/跟踪方向最新论文已更新，请持续关注 Update in 2025-10-02 Evaluating the Impact of Radiographic Noise on Chest X-ray Semantic Segmentation and Disease Classification Using a Scalable Noise Injection Framework

2025-10-02 检测/分割/跟踪

检测/分割/跟踪