⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-18 更新
More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era
Authors:Yingtai Li, Haoran Lai, Xiaoqian Zhou, Shuai Ming, Wenxin Ma, Wei Wei, Shaohua Kevin Zhou
The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale “silver-standard” datasets at a minimal cost (~$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this “silver-standard” dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8% AUC for zero-shot diagnosis on CT-RATE, 77.3% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7% for image-image, Recall@100=52.2% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.
大规模语言模型(LLMs)的出现,为革新医学对比视觉语言预训练带来了前所未有的机遇。在本文中,我们展示了LLMs如何促进大规模监督预训练,从而推动视觉语言对齐的发展。首先,我们证明现代LLMs能够以惊人的精度(我们在实验中的AUC值超过96%)自动从放射学报告中提取诊断标签,无需复杂的提示工程,从而以最低的成本(每对CT图像报告成本约3美元)创建大规模“银标准”数据集。此外,我们发现使用此“银标准”数据集训练的视觉编码器性能与用基于BERT的模型提取的标签训练的视觉编码器性能相当,从而实现了大规模监督预训练的普及。在此基础上,我们进一步揭示监督预训练从根本上改善了对比视觉语言对齐。我们的方法仅使用标准的CLIP训练和简单的3D ResNet-18就达到了最先进的性能,包括在CT-RATE上的零样本诊断AUC为83.8%,RAD-ChestCT上的AUC为77.3%,跨模态检索也取得了实质性的改进(图像与图像检索MAP@50为53.7%,报告与图像检索Recall@100为52.2%)。这些结果表明,利用LLMs构建更高性能和可扩展的医学人工智能系统的潜力巨大。我们的代码可在https://github.com/SadVoxel/More-performant-and-scalable找到。
论文及项目相关链接
PDF MICCAI 2025
Summary
本文探讨了大型语言模型(LLMs)在医学对比视觉语言预训练中的潜力。研究发现,LLMs能够自动从放射学报告中提取诊断标签,创建大规模“银标准”数据集,并以较低的成本推动大规模监督预训练。此外,基于LLMs的预训练方法显著提高了对比视觉语言对齐的性能,实现了先进的诊断结果和跨模态检索。整体而言,该研究推动了利用LLMs创建更高效、可扩展的医学人工智能系统的可能性。
Key Takeaways
- 大型语言模型(LLMs)可用于自动从放射学报告中提取诊断标签,创建大规模“银标准”数据集。
- LLMs使大规模监督预训练变得更为简单和低成本。
- 基于LLMs的预训练方法提高了对比视觉语言对齐的性能。
- 该方法实现了先进的诊断结果,如零样本诊断的高AUC值。
- 跨模态检索性能得到了显著提升。
- 该研究推动了医学AI系统的更高效、可扩展的可能性。
点此查看论文截图




Contrastive Learning with Enhanced Abstract Representations using Grouped Loss of Abstract Semantic Supervision
Authors:Omri Suissa, Muhiim Ali, Shengmai Chen, Yinuo Cai, Shekhar Pradhan
Humans can recognize an image as an instance of a general concept, beyond simply identifying its objects and their relationships. In this paper, we investigate 1. The extent to which VLMs have this concept abstraction capacity, and 2. Strategies for encoding the sort of higher-concept information in images that would enable the resulting VLM model (CLEAR GLASS model) to have this capability to a greater degree. To this end, we introduce a grouped image-caption dataset (MAGIC), which consists of several groups of image captions and for each group a set of associated images and higher-level conceptual labels. We use a novel contrastive loss technique to induce the model to encode in the representation of each image (caption) in a group the information that is common to all members of the image-caption group. Our main contribution is a grouped contrastive loss function based on text-image contrastive groups (outer contrastive loss) as well as an inner loss which measures the distances between image-caption instances in the group. Our training methodology results in the CLEAR GLASS model having the concept abstraction capacity as an emergent capacity because the model is not exposed to the higher-level concepts associated with each group. Instead, the training forces the model to create for each image-caption group a semantic representation that brings it closer to the semantic representation of the higher-level concepts in the latent semantic space. Our experiments show that this training methodology results in a model which shows improvement in abstract concept recognition compared to SOTA models.
人类能够识别一张图片作为一个通用概念的一个实例,而不仅仅是识别其对象及其关系。在本文中,我们调查了1.视觉语言模型(VLMs)拥有这种概念抽象能力的程度;以及2.在图像中编码这种高级概念信息的策略,以便使得到的VLM模型(CLEAR GLASS模型)能够在更大程度上具备这种能力。为此,我们引入了一个分组图像描述数据集(MAGIC),它包含多个图像描述分组,每个分组都有一组相关图像和高级概念标签。我们使用一种新型对比损失技术,诱导模型在分组的每个图像(描述)的表示中编码所有成员共有的信息。我们的主要贡献是基于文本图像对比分组(外部对比损失)的分组对比损失函数,以及衡量组内图像描述实例之间距离的内部损失。我们的训练方式使得CLEAR GLASS模型拥有概念抽象能力成为一种隐性能力,因为该模型并未接触到与每个分组相关的高级概念。相反,训练迫使模型为每个图像描述分组创建一个语义表示,使其更接近潜在语义空间中高级概念的语义表示。我们的实验表明,与最先进模型相比,这种训练方式导致的模型在抽象概念识别方面表现出改进。
论文及项目相关链接
Summary
本文探讨了视觉语言模型(VLM)的概念抽象能力,并研究了如何在图像中编码高级概念信息。为此,引入了分组图像字幕数据集(MAGIC),并使用了一种基于文本图像对比分组的分组对比损失函数和内部损失进行训练。实验表明,这种训练策略提高了模型对抽象概念的识别能力,相较于其他先进模型有所改善。
Key Takeaways
- 人类能够识别图像作为一般概念的实例,而不仅仅是识别其对象和关系。
- 论文探讨了视觉语言模型(VLM)的概念抽象能力。
- 研究引入了分组图像字幕数据集(MAGIC),其中包含多个图像字幕组及其相关的高级概念标签。
- 使用了一种新颖的对比损失技术,使模型能够编码图像中的高级概念信息。
- 训练策略包括基于文本图像对比分组的分组对比损失函数和内部损失度量。
- 模型未暴露于每个分组的高级概念,而是通过训练迫使模型为每个图像字幕组创建语义表示,使其在潜在语义空间中更接近高级概念的语义表示。
点此查看论文截图




Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D
Authors:Artemis Panagopoulou, Le Xue, Honglu Zhou, silvio savarese, Ran Xu, Caiming Xiong, Chris Callison-Burch, Mark Yatskar, Juan Carlos Niebles
Real-world decision-making often begins with identifying which modality contains the most relevant information for a given query. While recent multimodal models have made impressive progress in processing diverse inputs, it remains unclear whether they can reason contrastively across multiple modalities to select the one that best satisfies a natural language prompt. We argue this capability is foundational, especially in retrieval-augmented and decision-time contexts, where systems must evaluate multiple signals and identify which one conveys the relevant information. To evaluate this skill, we introduce Contra4, a dataset for contrastive cross-modal reasoning across four modalities: image, audio, video, and 3D. Each example presents a natural language question alongside multiple candidate modality instances, and the model must select the one that semantically aligns with the prompt. Contra4 combines human-annotated captions with a mixture-of-models round-trip-consistency filter to ensure high-quality supervision, resulting in 174k training examples and a manually verified test set of 2.3k samples. While task-specific fine-tuning helps improve performance by 56% relative to baseline, state-of-the-art models still achieve only an absolute of 56% accuracy overall and 42% in four-modality settings, underscoring a significant limitation in current multimodal models.
现实世界中的决策通常始于确定哪种模式对于给定查询包含最相关的信息。虽然最近的跨模态模型在处理各种输入方面取得了令人印象深刻的进展,但仍不清楚它们是否能够跨多个模式进行对比推理,以选择最能满足自然语言提示的模式。我们认为这种能力是基础的,特别是在检索增强和决策时间上下文中,系统必须评估多种信号并确定哪种信号传达了相关信息。为了评估这项技能,我们引入了Contra4,这是一个用于四模态对比跨模态推理的数据集:图像、音频、视频和三维。每个示例都提供了一个自然语言问题以及多个候选模态实例,模型必须选择语义上符合提示的那一个。Contra4结合了人类注释的标题和混合模型的往返一致性过滤器,以确保高质量的监督,从而产生17.4万个训练样本和经过人工验证的2.3千个测试样本。虽然特定任务的微调有助于提高性能(相对于基线提高了56%),但最先进的模型整体仍只能达到56%的绝对准确率,在四模态设置下为42%,这突显了当前跨模态模型的重大局限性。
论文及项目相关链接
Summary:
近期多模态模型在处理多样化输入方面取得了显著进展,但在对比跨多模态推理方面仍存在不确定性,特别是在评估多个信号并确定哪一个与语言提示最为匹配的情境中尤为明显。为此,文章引入了Contra4数据集,包含图像、音频、视频和三维等四种模态的对比跨模态推理。模型需从多个候选模态实例中选择与语言问题语义对齐的那一个。Contra4结合了人类注释的标题和混合模型的往返一致性过滤器来确保高质量监督。尽管特定任务微调有助于提升性能,但现有模型整体准确度仅为56%,在四模态设置下为42%,凸显了当前多模态模型的一大局限。
Key Takeaways:
- 多模态模型在处理多样化输入方面取得显著进展,但在对比跨模态推理方面存在不足。
- Contra4数据集用于评估跨四个模态(图像、音频、视频和三维)的对比跨模态推理能力。
- 模型需要从多个候选模态实例中选择与语言问题语义对齐的一个。
- Contra4结合了人类注释的标题和混合模型的往返一致性过滤器来保证高质量监督。
- 任务特定微调有助于提高模型性能。
- 当前模型的总体准确度仅为56%,在四模态设置下为42%,表明存在显著局限性。
点此查看论文截图




