发布日期: 2025-09-18

更新日期: 2025-10-07

文章字数: 1.1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-18 更新

Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation

Authors:Luca Barsellotti, Lorenzo Bianchi, Nicola Messina, Fabio Carrara, Marcella Cornia, Lorenzo Baraldi, Fabrizio Falchi, Rita Cucchiara

Open-Vocabulary Segmentation (OVS) aims at segmenting images from free-form textual concepts without predefined training classes. While existing vision-language models such as CLIP can generate segmentation masks by leveraging coarse spatial information from Vision Transformers, they face challenges in spatial localization due to their global alignment of image and text features. Conversely, self-supervised visual models like DINO excel in fine-grained visual encoding but lack integration with language. To bridge this gap, we present Talk2DINO, a novel hybrid approach that combines the spatial accuracy of DINOv2 with the language understanding of CLIP. Our approach aligns the textual embeddings of CLIP to the patch-level features of DINOv2 through a learned mapping function without the need to fine-tune the underlying backbones. At training time, we exploit the attention maps of DINOv2 to selectively align local visual patches with textual embeddings. We show that the powerful semantic and localization abilities of Talk2DINO can enhance the segmentation process, resulting in more natural and less noisy segmentations, and that our approach can also effectively distinguish foreground objects from the background. Experimental results demonstrate that Talk2DINO achieves state-of-the-art performance across several unsupervised OVS benchmarks. Source code and models are publicly available at: https://lorebianchi98.github.io/Talk2DINO/.

开放词汇分割（OVS）旨在从自由的文本概念中对图像进行分割，而无需预先定义的训练类别。虽然现有的视觉语言模型（如CLIP）可以利用视觉变压器的粗略空间信息生成分割掩码，但由于图像和文本特征的全局对齐，它们在空间定位方面面临挑战。相反，自监督的视觉模型（如DINO）在精细的视觉编码方面表现出色，但与语言的结合方面存在不足。为了弥补这一差距，我们提出了Talk2DINO，这是一种新颖的方法，结合了DINOv2的空间精度与CLIP的语言理解。我们的方法通过将CLIP的文本嵌入与DINOv2的补丁级别特征通过学习的映射函数对齐，而无需微调底层骨干网。在训练过程中，我们利用DINOv2的注意力图来有选择地将局部视觉补丁与文本嵌入对齐。我们证明了Talk2DINO强大的语义和定位能力可以强化分割过程，产生更自然、更少噪音的分割结果，并且我们的方法还可以有效地区分前景物体和背景。实验结果表明，Talk2DINO在多个无监督OVS基准测试中达到了最先进的性能。源代码和模型可在以下网址公开获取：https://lorebianchi98.github.io/Talk2DINO/。

论文及项目相关链接

PDF ICCV 2025

Summary：Talk2DINO是一种结合了DINOv2的空间精度和CLIP的语言理解能力的新型混合方法。该方法无需微调底层骨架，即可将CLIP的文本嵌入与DINOv2的补丁级特征通过一种学习映射函数进行对齐。在训练过程中，该方法利用DINOv2的注意力图选择性地将局部视觉补丁与文本嵌入对齐。此方法能有效提高语义理解和定位能力，从而提高分割过程的性能，实现更自然、更少噪声的分割，并能有效区分前景和背景物体。实验结果表明，Talk2DINO在多个无监督开放词汇分割基准测试中取得了最佳性能。

Key Takeaways：