发布日期: 2025-11-08

更新日期: 2025-11-27

文章字数: 3.2k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-08 更新

PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

Authors:Yicheng Xiao, Yu Chen, Haoxuan Ma, Jiale Hong, Caorui Li, Lingxiang Wu, Haiyun Guo, Jinqiao Wang

While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model’s fine-grained vision-language alignment. However, the inherent token length limitation of CLIP’s text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP’s original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.

对比语言图像预训练（CLIP）模型已在多种下游视觉语言理解任务中取得了显著的成功，但提高其针对精细粒度图像文本对齐的能力仍是当前的研究重点。为此，大多数现有工作都采用了明确增加视觉信息处理粒度的策略，例如，融入视觉提示来引导模型关注图像内的特定局部区域。同时，关于多模态大型语言模型（MLLM）的研究表明，使用长而详细的文本描述进行训练可以有效提高模型的精细视觉语言对齐能力。然而，CLIP文本编码器的固有令牌长度限制从根本上限制了CLIP处理嵌入在长文本序列中的更精细的文本信息。为了协同利用提高视觉和文本内容处理粒度的优势，我们提出了PixCLIP，这是一个新颖框架，旨在同时接受视觉提示输入并处理冗长的文本描述。具体来说，我们首先建立了一个自动化注释管道，能够生成像素级的局部化、长文本图像描述。利用此管道，我们构建了LongGRIT数据集，包含近150万个高质量样本。其次，我们更换了CLIP的原始文本编码器为LLM，并提出了一种三分支像素文本对齐学习框架，促进图像区域与相应文本描述之间在任意粒度上的精细对齐。实验表明，PixCLIP在像素级交互和处理长文本方面取得了突破，达到了最先进的性能。

论文及项目相关链接

PDF

Summary

本文探讨了CLIP模型在下游视觉语言理解任务中的优秀表现，并针对精细粒度的图像文本对齐的挑战提出了改进策略。为了增强模型的视觉和文本内容处理粒度，研究者们提出了PixCLIP框架，它结合了视觉提示和长文本描述的优势。该框架建立了自动生成像素级局部长文本描述的图像标注管道，并构建了一个高质量的数据集LongGRIT。此外，他们改进了CLIP的文本编码器，并提出了一个三分支像素文本对齐学习框架，实现了图像区域与对应文本描述之间的精细粒度对齐。实验表明，PixCLIP在像素级交互和长文本处理方面取得了突破性进展，达到了最先进的性能。

Key Takeaways

CLIP模型在下游视觉语言理解任务中表现出色，但仍需改进精细粒度的图像文本对齐。
现有研究倾向于通过增加视觉信息处理的粒度来提升模型性能，例如通过视觉提示引导模型关注图像的特定局部区域。
多模态大型语言模型的研究表明，使用长而详细的文本描述进行训练可以有效提高模型的精细视觉语言对齐能力。
CLIP的文本编码器存在固有的令牌长度限制，无法处理嵌入在长文本序列中的更精细的文本信息。
PixCLIP框架结合了视觉提示和长文本描述的优势，通过自动标注管道生成像素级局部长文本描述，构建了LongGRIT数据集。
PixCLIP改进了CLIP的文本编码器，并提出了一个三分支像素文本对齐学习框架，实现了图像区域与任意粒度的文本描述之间的精细对齐。

Cool Papers

点此查看论文截图

MedDChest: A Content-Aware Multimodal Foundational Vision Model for Thoracic Imaging

Authors:Mahmoud Soliman, Islam Osman, Mohamed S. Shehata, Rasika Rajapakshe

The performance of vision models in medical imaging is often hindered by the prevailing paradigm of fine-tuning backbones pre-trained on out-of-domain natural images. To address this fundamental domain gap, we propose MedDChest, a new foundational Vision Transformer (ViT) model optimized specifically for thoracic imaging. We pre-trained MedDChest from scratch on a massive, curated, multimodal dataset of over 1.2 million images, encompassing different modalities including Chest X-ray and Computed Tomography (CT) compiled from 10 public sources. A core technical contribution of our work is Guided Random Resized Crops, a novel content-aware data augmentation strategy that biases sampling towards anatomically relevant regions, overcoming the inefficiency of standard cropping techniques on medical scans. We validate our model’s effectiveness by fine-tuning it on a diverse set of downstream diagnostic tasks. Comprehensive experiments empirically demonstrate that MedDChest significantly outperforms strong, publicly available ImageNet-pretrained models. By establishing the superiority of large-scale, in-domain pre-training combined with domain-specific data augmentation, MedDChest provides a powerful and robust feature extractor that serves as a significantly better starting point for a wide array of thoracic diagnostic tasks. The model weights will be made publicly available to foster future research and applications.

在医学成像中，视觉模型的性能通常受到使用在域外自然图像上预训练的模型进行微调的主流范式的制约。为了解决这一基本的域差距问题，我们提出了MedDChest，这是一种专门针对胸部成像优化的新型基础视觉Transformer（ViT）模型。我们从零开始使用大规模的精选多模式数据集对MedDChest进行预训练，该数据集包含超过120万张图像，涵盖了包括来自公共资源的胸部X光片和计算机断层扫描（CT）等不同模式的数据集。我们工作的核心技术贡献是引导随机调整大小的裁剪（Guided Random Resized Crops），这是一种新型的内容感知数据增强策略，它偏向于解剖相关的区域进行采样，克服了标准裁剪技术在医学扫描上的低效性。我们通过在一系列下游诊断任务上微调模型来验证其有效性。经过实证的广泛实验表明，MedDChest显著优于公开可用的ImageNet预训练模型。通过建立大规模域内预训练与特定域数据增强的优势，MedDChest提供了一个强大而稳健的特征提取器，作为胸部诊断任务广泛应用的更优秀的起点。模型权重将公开提供，以促进未来的研究与应用。

论文及项目相关链接

PDF 10 pages, 2 figures

Summary

本文提出一种针对医学影像领域的新型基础Vision Transformer模型——MedDChest。该模型通过大规模、精细筛选的多模态数据集进行预训练，可解决当前模型在医学成像领域的性能瓶颈。创新点包括针对医学扫描的自定义内容感知数据增强策略——引导随机尺寸裁剪（Guided Random Resized Crops），以及通过下游诊断任务验证的模型有效性。实验证明，相较于公开可用的ImageNet预训练模型，MedDChest表现更佳。它将公开提供权重，为未来的研究和应用奠定基础。

Key Takeaways

MedDChest是一个针对医学影像领域的优化Vision Transformer模型。
该模型通过大规模、多模态数据集进行预训练，涵盖超过120万张图像。
MedDChest采用新颖的内容感知数据增强策略——引导随机尺寸裁剪。
引导随机尺寸裁剪策略能提高模型对解剖学相关区域的采样效率。
通过下游诊断任务的精细调整，验证了MedDChest的有效性。
与公开可用的ImageNet预训练模型相比，MedDChest表现更优。

Cool Papers

点此查看论文截图

DAMRO: Dive into the Attention Mechanism of LVLM to Reduce Object Hallucination

Authors:Xuan Gong, Tianshi Ming, Xinpeng Wang, Zhihua Wei

Despite the great success of Large Vision-Language Models (LVLMs), they inevitably suffer from hallucination. As we know, both the visual encoder and the Large Language Model (LLM) decoder in LVLMs are Transformer-based, allowing the model to extract visual information and generate text outputs via attention mechanisms. We find that the attention distribution of LLM decoder on image tokens is highly consistent with the visual encoder and both distributions tend to focus on particular background tokens rather than the referred objects in the image. We attribute to the unexpected attention distribution to an inherent flaw in the visual encoder itself, which misguides LLMs to over emphasize the redundant information and generate object hallucination. To address the issue, we propose DAMRO, a novel training-free strategy that $D$ive into $A$ttention $M$echanism of LVLM to $R$educe $O$bject Hallucination. Specifically, our approach employs classification token (CLS) of ViT to filter out high-attention outlier tokens scattered in the background and then eliminate their influence during decoding stage. We evaluate our method on LVLMs including LLaVA-1.5, LLaVA-NeXT and InstructBLIP, using various benchmarks such as POPE, CHAIR, MME and GPT-4V Aided Evaluation. The results demonstrate that our approach significantly reduces the impact of these outlier tokens, thus effectively alleviating the hallucination of LVLMs. The code is released at https://github.com/coder-gx/DAMRO.

尽管大型视觉语言模型（LVLMs）取得了巨大成功，但它们不可避免地会出现幻觉现象。众所周知，LVLMs中的视觉编码器和大型语言模型（LLM）解码器都是基于Transformer的，这使得模型能够通过注意力机制提取视觉信息并生成文本输出。我们发现LLM解码器在图像标记上的注意力分布与视觉编码器高度一致，而且这两种分布都倾向于关注特定的背景标记，而不是图像中的所指对象。我们将这种意外的注意力分布归因于视觉编码器本身的固有缺陷，它误导了LLMs过分强调冗余信息，并产生对象幻觉。为了解决这一问题，我们提出了DAMRO，这是一种新的无需训练的策略，旨在深入LVLM的注意力机制，以减少对象幻觉。具体来说，我们的方法采用ViT的分类标记（CLS）来过滤掉背景中高度关注的外来者标记，并在解码阶段消除它们的影响。我们在包括LLaVA-1.5、LLaVA-NeXT和InstructBLIP等LVLMs上评估了我们的方法，使用了各种基准测试，如POPE、CHAIR、MME和GPT-4V辅助评估。结果表明，我们的方法显著减少了这些外来者标记的影响，从而有效地减轻了LVLMs的幻觉。代码已发布在https://github.com/coder-gx/DAMRO。

论文及项目相关链接

PDF Accepted by EMNLP2024 (Main Conference), add GitHub link

Summary

本文探讨了大型视觉语言模型（LVLMs）中的幻觉问题，指出视觉编码器和大型语言模型（LLM）解码器的注意力分布与图像标记高度一致，导致模型容易关注背景标记而非目标对象。为解决这一问题，本文提出了一种无需训练的策略DAMRO，通过过滤掉高注意力异常标记来减少对象幻觉。实验表明，该方法显著降低了异常标记的影响，有效缓解了LVLMs的幻觉问题。

Key Takeaways