发布日期: 2025-09-28

更新日期: 2025-11-27

文章字数: 10k

阅读时长: 40 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-28 更新

Large Pre-Trained Models for Bimanual Manipulation in 3D

Authors:Hanna Yurchyk, Wei-Di Chang, Gregory Dudek, David Meger

We investigate the integration of attention maps from a pre-trained Vision Transformer into voxel representations to enhance bimanual robotic manipulation. Specifically, we extract attention maps from DINOv2, a self-supervised ViT model, and interpret them as pixel-level saliency scores over RGB images. These maps are lifted into a 3D voxel grid, resulting in voxel-level semantic cues that are incorporated into a behavior cloning policy. When integrated into a state-of-the-art voxel-based policy, our attention-guided featurization yields an average absolute improvement of 8.2% and a relative gain of 21.9% across all tasks in the RLBench bimanual benchmark.

我们研究了如何将预训练的视觉变压器的注意力图集成到体素表示中，以增强双足机器人的操作。具体来说，我们从名为DINOv2的自我监督的ViT模型中提取注意力图，并将其解释为RGB图像上的像素级显著性分数。这些地图提升为三维体素网格，得到结合行为克隆策略的体素级语义线索。当集成到最先进的基于体素的政策中时，我们的注意力引导特征化在RLBench双足基准测试中所有任务上的平均绝对改进了8.2%，相对收益提高了21.9%。

论文及项目相关链接

PDF Accepted to 2025 IEEE-RAS 24th International Conference on Humanoid Robots

Summary

本文研究了将预训练的Vision Transformer的注意力图融入体素表示，以强化双足机器人的操作。通过从DINOv2（一种自监督的ViT模型）中提取注意力图，并解读为RGB图像上的像素级显著性分数，将其提升到3D体素网格中，得到体素级的语义线索，并融入行为克隆策略。在RLBench双足基准测试中，注意力导向的特征化技术平均绝对提升8.2%，相对提升21.9%。

Key Takeaways

研究了如何将预训练的Vision Transformer的注意力图融入机器人操作。
采用DINOv2模型提取注意力图，并解读为像素级显著性分数。
将注意力图提升到3D体素网格，得到体素级的语义线索。
这些语义线索被融入行为克隆策略中。
在RLBench双足基准测试中进行了测试。
注意力导向的特征化技术取得了显著的提升效果，平均绝对提升8.2%。
与原有策略相比，相对提升达到了21.9%。

Cool Papers

点此查看论文截图

ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression

Authors:Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir

The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance towards texture. Code is available at https://github.com/tomburgert/feature-reliance.

卷积神经网络（CNN）具有固有纹理偏向性的假设已经影响了深度学习中特征使用的许多讨论。我们通过研究Geirhos等人在提示冲突实验中的局限性，重新审视这一假设。为了解决这些局限性，我们提出了一个跨领域的框架，通过系统性地抑制形状、纹理和颜色线索来量化特征依赖性，避免了强制选择冲突所带来的混淆。通过在受控抑制条件下对人类和神经网络进行评估，我们发现CNN并非固有地偏向于纹理，而是主要依赖于局部形状特征。然而，通过现代训练策略或架构（如ConvNeXt和ViTs），这种依赖性可以得到显著缓解。我们进一步扩展了计算机视觉、医学成像和遥感分析领域的研究，发现依赖模式存在系统性差异：计算机视觉模型优先考虑形状，医学成像模型强调颜色，遥感模型则表现出更强的纹理依赖性。相关代码可通过https://github.com/tomburgert/feature-reliance获取。

论文及项目相关链接

PDF Accepted at NeurIPS 2025 (oral)

Summary

本文主要探讨卷积神经网络（CNNs）是否固有地偏向于纹理特征的问题，重新评估了Geirhos等人在实验中的局限性。为解决这一问题，提出一种通用的框架来量化模型对各种特征的依赖程度，如形状、纹理和颜色等。通过控制条件下的实验评估，发现CNN并非固有地偏向于纹理特征，而是主要依赖于局部形状特征。此外，现代训练策略和架构（如ConvNeXt和ViTs）能有效减轻这种依赖。在不同领域（如计算机视觉、医学成像和遥感）中，模型依赖特征的模式也存在系统性差异。

Key Takeaways

卷积神经网络（CNNs）在实验中的纹理偏向问题需要重新审视。
提出一种框架来量化模型对形状、纹理和颜色特征的依赖程度。
通过控制条件下的实验评估，发现CNN主要依赖于局部形状特征，而非固有地偏向于纹理特征。
现代训练策略和架构可以有效减轻CNN对特定特征的依赖。
不同领域中的模型依赖特征模式存在系统性差异。
计算机视觉模型主要关注形状特征。

Cool Papers

点此查看论文截图

Algorithms for Adversarially Robust Deep Learning

Authors:Alexander Robey

Given the widespread use of deep learning models in safety-critical applications, ensuring that the decisions of such models are robust against adversarial exploitation is of fundamental importance. In this thesis, we discuss recent progress toward designing algorithms that exhibit desirable robustness properties. First, we discuss the problem of adversarial examples in computer vision, for which we introduce new technical results, training paradigms, and certification algorithms. Next, we consider the problem of domain generalization, wherein the task is to train neural networks to generalize from a family of training distributions to unseen test distributions. We present new algorithms that achieve state-of-the-art generalization in medical imaging, molecular identification, and image classification. Finally, we study the setting of jailbreaking large language models (LLMs), wherein an adversarial user attempts to design prompts that elicit objectionable content from an LLM. We propose new attacks and defenses, which represent the frontier of progress toward designing robust language-based agents.

鉴于深度学习模型在关键安全应用中的广泛应用，确保这些模型的决策在面对对抗性攻击时具有鲁棒性变得至关重要。在本论文中，我们讨论了设计具有理想鲁棒性的算法的最新进展。首先，我们讨论计算机视觉中的对抗性样本问题，为此我们介绍了新技术成果、训练范式和认证算法。接下来，我们考虑域泛化问题，其任务是训练神经网络，以从一系列训练分布推广到未见过的测试分布。我们提出了在医学成像、分子识别和图像分类方面实现最新泛化的新算法。最后，我们研究破解大型语言模型（LLM）的设置，其中对抗性用户试图设计提示，以从LLM中引发令人反感的内容。我们提出了新的攻击和防御手段，这代表了在设计健壮的语言智能体方面的最前沿进展。

论文及项目相关链接

PDF PhD thesis

Summary
深度学习模型在关键安全领域应用广泛，其决策鲁棒性至关重要。本论文探讨设计具备理想鲁棒性的算法的最新进展。包括研究计算机视觉中的对抗样本问题，引入新技术、训练范式和认证算法；解决神经网络在多种领域中的域泛化问题，展示在医学成像等领域的新算法的前沿性；研究突破大型语言模型的问题，提出新的攻击和防御手段，推进语言智能代理的稳健性设计。

Key Takeaways

深度学习模型的决策鲁棒性在关键安全领域至关重要。
论文讨论了计算机视觉中的对抗样本问题，引入新技术和算法来解决这一问题。
研究了神经网络的域泛化问题，展示新算法在医学成像等领域的优势。
针对大型语言模型（LLMs）的“jailbreaking”问题进行研究，提出新的攻击和防御策略。
论文强调设计具备理想鲁棒性的算法的重要性，以保护深度学习模型免受对抗性攻击的影响。
通过新技术和训练范式，提高了深度学习模型的泛化能力和鲁棒性。

Cool Papers

点此查看论文截图

No Labels Needed: Zero-Shot Image Classification with Collaborative Self-Learning

Authors:Matheus Vinícius Todescato, Joel Luís Carbonera

While deep learning, including Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), has significantly advanced classification performance, its typical reliance on extensive annotated datasets presents a major obstacle in many practical scenarios where such data is scarce. Vision-language models (VLMs) and transfer learning with pre-trained visual models appear as promising techniques to deal with this problem. This paper proposes a novel zero-shot image classification framework that combines a VLM and a pre-trained visual model within a self-learning cycle. Requiring only the set of class names and no labeled training data, our method utilizes a confidence-based pseudo-labeling strategy to train a lightweight classifier directly on the test data, enabling dynamic adaptation. The VLM identifies high-confidence samples, and the pre-trained visual model enhances their visual representations. These enhanced features then iteratively train the classifier, allowing the system to capture complementary semantic and visual cues without supervision. Notably, our approach avoids VLM fine-tuning and the use of large language models, relying on the visual-only model to reduce the dependence on semantic representation. Experimental evaluations on ten diverse datasets demonstrate that our approach outperforms the baseline zero-shot method.

尽管深度学习，包括卷积神经网络（CNN）和视觉转换器（ViT），已经显著提高了分类性能，但它通常依赖于大量的注释数据集，这在许多此类数据稀缺的实际场景中构成了一个主要障碍。视觉语言模型（VLM）和借助预训练视觉模型的迁移学习似乎是有希望解决这个问题的技术。本文提出了一种新颖的零样本图像分类框架，该框架结合了VLM和预训练的视觉模型，在一个自学习循环内。该方法仅需要类别名称集合，而不需要标记的训练数据，它采用基于置信度的伪标签策略直接在测试数据上训练轻量级分类器，从而实现动态适应。VLM识别高置信度样本，预训练的视觉模型增强它们的视觉表示。这些增强的特征然后迭代地训练分类器，使系统能够捕获无监督的互补语义和视觉线索。值得注意的是，我们的方法避免了VLM微调和使用大型语言模型，而依赖于仅视觉模型来减少对语义表示的依赖。在十个不同数据集上的实验评估表明，我们的方法优于基准零样本方法。

论文及项目相关链接

PDF This paper was accepted at International Conference on Tools with Artificial Intelligence (ICTAI) 2025

Summary

本文提出了一种结合视觉语言模型（VLM）和预训练视觉模型的零样本图像分类框架。该方法仅需类别名称，无需标注训练数据，通过基于自信的伪标签策略直接在测试数据上训练轻量级分类器，实现动态适应。VLM识别高置信样本，预训练视觉模型增强其视觉表征。这些增强的特征迭代地训练分类器，使系统能够捕获互补的语义和视觉线索，无需监督。实验评估表明，该方法在多个数据集上的性能优于基线零样本方法。

Key Takeaways

该论文提出了一种零样本图像分类框架，结合了视觉语言模型（VLM）和预训练视觉模型。
该方法仅需类别名称，无需标注训练数据，通过自信伪标签策略直接在测试数据上训练分类器。
VLM用于识别高置信样本，预训练视觉模型增强这些样本的视觉表征。
该方法通过迭代训练分类器，能够捕获互补的语义和视觉线索。
该方法避免了VLM微调和大语言模型的使用，降低了对语义表示的依赖。
实验评估表明，该框架在多个数据集上的性能优于基线零样本方法。

Cool Papers

点此查看论文截图

Benchmarking Vision-Language and Multimodal Large Language Models in Zero-shot and Few-shot Scenarios: A study on Christian Iconography

Authors:Gianmarco Spinaci, Lukas Klic, Giovanni Colavizza

This study evaluates the capabilities of Multimodal Large Language Models (LLMs) and Vision Language Models (VLMs) in the task of single-label classification of Christian Iconography. The goal was to assess whether general-purpose VLMs (CLIP and SigLIP) and LLMs, such as GPT-4o and Gemini 2.5, can interpret the Iconography, typically addressed by supervised classifiers, and evaluate their performance. Two research questions guided the analysis: (RQ1) How do multimodal LLMs perform on image classification of Christian saints? And (RQ2), how does performance vary when enriching input with contextual information or few-shot exemplars? We conducted a benchmarking study using three datasets supporting Iconclass natively: ArtDL, ICONCLASS, and Wikidata, filtered to include the top 10 most frequent classes. Models were tested under three conditions: (1) classification using class labels, (2) classification with Iconclass descriptions, and (3) few-shot learning with five exemplars. Results were compared against ResNet50 baselines fine-tuned on the same datasets. The findings show that Gemini-2.5 Pro and GPT-4o outperformed the ResNet50 baselines. Accuracy dropped significantly on the Wikidata dataset, where Siglip reached the highest accuracy score, suggesting model sensitivity to image size and metadata alignment. Enriching prompts with class descriptions generally improved zero-shot performance, while few-shot learning produced lower results, with only occasional and minimal increments in accuracy. We conclude that general-purpose multimodal LLMs are capable of classification in visually complex cultural heritage domains. These results support the application of LLMs as metadata curation tools in digital humanities workflows, suggesting future research on prompt optimization and the expansion of the study to other classification strategies and models.

本研究评估了多模态大型语言模型（LLMs）和视觉语言模型（VLMs）在基督教肖像学单一标签分类任务中的能力。目的是评估通用VLMs（CLIP和SigLIP）以及LLMs（如GPT-4o和Gemini 2.5）是否能解读通常通过监督分类器处理的肖像学，并评估其性能。本研究由两个研究问题引导分析：（RQ1）多模态LLMs在基督教圣人图像分类方面的表现如何？（RQ2）当输入丰富上下文信息或少量样本时，性能如何变化？我们使用支持Iconclass的原生数据集ArtDL、ICONCLASS和Wikidata进行了基准测试，筛选出前10个最频繁的类别。模型在三种条件下进行测试：（1）使用类别标签进行分类，（2）使用Iconclass描述进行分类，以及（3）五个样本的少量学习。结果将与在同数据集上微调过的ResNet50基准线进行比较。研究结果表明，Gemini-2.5 Pro和GPT-4o的表现优于ResNet50基准线。在Wikidata数据集上，准确率显著下降，Siglip达到最高准确率，这表明模型对图像大小和元数据对齐的敏感性。用类别描述丰富提示通常可以提高零样本性能，而少量学习产生的结果较低，只有偶尔和微小的准确率提升。我们得出结论，通用多模态LLMs能够进行视觉复杂文化遗产领域的分类。这些结果支持将LLMs应用于数字人文工作流程中的元数据整理工具，并建议未来研究优化提示并扩大研究到其他分类策略和模型。

论文及项目相关链接

PDF 11 pages, 2 figures

摘要

本研究评估了多模态大型语言模型（LLMs）和视觉语言模型（VLMs）在基督教肖像学单标签分类任务中的能力。研究旨在探讨通用VLMs（CLIP和SigLIP）和LLMs（如GPT-4o和Gemini 2.5）是否能解读通常通过监督分类器处理的肖像学，并评估其性能。研究提出了两个问题：RQ1）多模态LLMs在基督教圣人图像分类中的表现如何？RQ2）当输入信息增加上下文信息或少量的样本范例时，性能如何变化？研究使用了支持Iconclass的三个数据集：ArtDL、ICONCLASS和Wikidata，筛选出前十个最频繁的类别。模型在三种条件下进行测试：1）使用类别标签进行分类，2）使用Iconclass描述进行分类，3）五个范例的少量样本学习。结果与在同一数据集上微调过的ResNet50基准模型进行了比较。研究结果表明，Gemini-2.5 Pro和GPT-4o的表现优于ResNet50基准模型。在Wikidata数据集上，准确性显著下降，Siglip达到最高准确率，这表明模型对图像大小和元数据对齐较为敏感。一般而言，通过类描述丰富提示信息可以提高零样本性能，而少量样本学习则产生较低结果，仅偶尔出现准确度的微小增长。我们得出结论，通用多模态LLMs有能力进行视觉复杂文化遗产领域的分类。这些结果支持将LLMs应用于数字人文工作流程中的元数据整理工具，并建议未来研究优化提示并扩大研究范围涵盖其他分类策略和模型。

关键见解

多模态大型语言模型（LLMs）和视觉语言模型（VLMs）在基督教肖像学单标签分类任务中展现出了能力。
研究对比了包括CLIP、SigLIP在内的视觉语言模型和GPT-4o、Gemini 2.5等大型语言模型在图像分类上的表现。
研究结果表明Gemini-2.5 Pro和GPT-4o相较于ResNet50基准模型在分类任务上表现更优。
模型对图像大小和元数据对齐的敏感性较高，这在Wikidata数据集上的准确性下降中得到了体现。
通过类描述丰富提示信息可以提高模型的零样本性能。
少量样本学习在分类任务中的效果并不理想，仅偶尔出现准确度的微小增长。

Cool Papers

点此查看论文截图

Visual Instruction Pretraining for Domain-Specific Foundation Models

Authors:Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang

Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.

现代计算机视觉正在形成一个闭环，感知、推理和生成在彼此之间互相强化。然而，这个循环仍然不完整：高级推理对低级感知特征基础学习的自上而下影响尚未得到充分探索。本文旨在通过为下游领域的基础模型预训练提出一种新的范式来解决这一差距。我们引入了视觉指令预训练（ViTP），这是一种利用推理来直接增强感知的新型方法。ViTP将视觉转换器（ViT）主干嵌入到视觉语言模型中，并使用从目标下游领域精心挑选的丰富视觉指令数据对其进行端到端预训练。ViTP由我们提出的视觉稳健性学习（VRL）驱动，它迫使ViT从稀疏的视觉符号集中学习稳健和与领域相关的特征。在16个具有挑战性的遥感和医学影像基准测试上的广泛实验表明，ViTP在不同的下游任务上达到了新的最先进的性能。代码可在https://github.com/zcablii/ViTP找到。

论文及项目相关链接

PDF

Summary
视觉指令预训练（ViTP）是一种新型的预训练范式，旨在利用推理来增强感知能力。该研究通过引入视觉转换器（ViT）骨干网与视觉语言模型相结合的方式，利用来自目标下游领域丰富的视觉指令数据，对模型进行端到端的预训练。该研究还提出了视觉稳健性学习（VRL），使得ViTP能够从少量的视觉标记中学习稳健且与领域相关的特征。在远程遥感和医学影像等领域的多个挑战任务上，ViTP取得了新的最先进的性能表现。

Key Takeaways

现代计算机视觉正在形成一个感知、推理和生成相互加强的闭环，但仍需研究高水平推理对基础低级别感知特征学习的影响。
视觉指令预训练（ViTP）解决了这一空白，引入了一种新方法来加强感知的推理能力。
ViTP将视觉转换器（ViT）骨干网嵌入到视觉语言模型中，并在目标下游领域的数据集上进行端到端的预训练。
ViTP利用了丰富的视觉指令数据，这些数据来自于目标下游领域。
视觉稳健性学习（VRL）是ViTP的核心组成部分，能使模型从稀疏的视觉标记中学习稳健和与领域相关的特征。
在多个遥感及医学影像的基准测试中，ViTP达到了新的最佳性能水平。

Cool Papers

点此查看论文截图

ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Authors:Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen

Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model’s attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model’s hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.

推测解码是加速大型语言模型（LLM）推理的广泛采用的技术，然而它在视觉语言模型（VLM）中的应用仍然被探索得不够深入，现有方法只能实现较小的加速（<1.5倍）。随着多模态能力成为大规模模型的核心，这一差距变得越来越显著。我们假设大型VLM可以有效地逐层过滤掉冗余的图像信息，而不损害文本理解，而较小的草稿模型则很难做到这一点。为了解决这个问题，我们引入了针对VLM量身定制的Vision-Aware Speculative Decoding（ViSpec）新型框架。ViSpec采用轻量级的视觉适配器模块，将图像令牌压缩成紧凑的表示形式，无缝地集成到草稿模型的注意力机制中，同时保留原始图像的位置信息。此外，我们为每个输入图像提取全局特征向量，并将其增强到所有后续的文本令牌中，以提高多模态一致性。为了克服缺乏带有长助理响应的多模态数据集的问题，我们通过重新利用现有数据集并生成扩展输出，使用目标VLM和修改后的提示来制作专用的训练数据集。我们的训练策略减轻了草稿模型直接访问目标模型的隐藏状态的风险，如果仅在目标模型输出上进行训练，这可能会导致捷径学习。大量实验验证了ViSpec的有效性，据我们所知，这是VLM推测解码中的首次实质性加速。代码可在https://github.com/KangJialiang/ViSpec找到。

论文及项目相关链接

PDF NeurIPS 2025

Summary：
针对视觉语言模型（VLMs）的推理加速技术存在不足的问题，本文提出了一种名为Vision-Aware Speculative Decoding（ViSpec）的新框架。它通过轻量级视觉适配器模块将图像令牌压缩成紧凑表示，集成到草稿模型的注意力机制中，同时保留原始图像的位置信息。此外，ViSpec还为每个输入图像提取全局特征向量并将其添加到所有后续文本令牌中，以提高跨模态一致性。本文解决的方法是定制训练数据集以及训练策略的问题，最终实现了一种新的视觉模型加速技术。相关代码已在GitHub上发布。

Key Takeaways：

本文提出一种针对视觉语言模型（VLMs）的推理加速框架——Vision-Aware Speculative Decoding（ViSpec）。
ViSpec使用轻量级视觉适配器模块压缩图像令牌，同时保留原始图像的位置信息。
ViSpec通过提取全局特征向量并将其添加到文本令牌中，提高跨模态一致性。
针对缺乏多样化多媒体数据集的问题，本文提出一种特殊的训练数据集创建方法，即通过重新利用现有数据集并使用目标VLM进行扩展输出生成修改后的提示。
训练策略避免了模型直接访问目标模型的隐藏状态的风险，防止了捷径学习现象的发生。

Cool Papers

点此查看论文截图

Technical report on label-informed logit redistribution for better domain generalization in low-shot classification with foundation models

Authors:Behraj Khan, Tahir Syed

Confidence calibration is an emerging challenge in real-world decision systems based on foundations models when used for downstream vision classification tasks. Due to various reasons exposed, logit scores on the CLIP head remain large irrespective of whether the image-language pairs reconcile. It is difficult to address in data space, given the few-shot regime. We propose a penalty incorporated into loss objective that penalizes incorrect classifications whenever one is made during finetuning, by moving an amount of log-likelihood to the true class commensurate to the relative amplitudes of the two likelihoods. We refer to it as \textit{confidence misalignment penalty (CMP)}. Extensive experiments on $12$ vision datasets and $5$ domain generalization datasets supports the calibration performance of our method against stat-of-the-art. CMP outperforms the benchmarked prompt learning methods, demonstrating average improvement in Expected Calibration Error (ECE) by average $6.01$%, $4.01$ % at minimum and $9.72$% at maximum.

在基于基础模型的现实世界决策系统中，当用于下游视觉分类任务时，置信度校准成为一个新兴的挑战。由于各种原因的暴露，CLIP头部上的逻辑分数无论图像-语言对是否协调都保持较大。考虑到小样本制度，这在数据空间很难解决。我们提出了一种在损失目标中融入惩罚的方法，通过在微调时错误分类时移动与两个概率的相对幅度相称的对数似然到真实类别来惩罚错误分类。我们将其称为“置信度不匹配惩罚（CMP）”。在12个视觉数据集和5个领域泛化数据集上的大量实验支持我们的方法与最新技术相比的校准性能。CMP优于基准提示学习方法，在预期校准误差（ECE）方面平均提高了6.01%，最低提高4.01%，最高提高9.72%。

论文及项目相关链接

PDF

Summary

基于大型模型的下游视觉分类任务中，置信度校准是一个新兴的挑战。文章指出CLIP头输出的logit分数较大，难以在数据空间解决该问题。为此，文章提出了一种损失目标中的惩罚机制，即在微调过程中，每当出现错误分类时，将部分对数似然概率转移到真实类别，该转移量与两个类别的相对幅度有关。这种惩罚被称为置信度不匹配惩罚（CMP）。在多个视觉数据集和领域泛化数据集上的实验支持了该方法相较于最新技术的校准性能。CMP在预期校准误差（ECE）上平均提高了6.01%，最小提高4.01%，最大提高9.72%，优于基准提示学习方法。

Key Takeaways

置信度校准是大型模型在视觉分类任务中面临的新兴挑战。
CLIP头输出的logit分数在处理图像-语言对时存在不匹配问题。
提出了置信度不匹配惩罚（CMP）方法来解决此问题。
CMP是一种损失目标中的惩罚机制，通过对错误分类施加惩罚来改善模型校准。
CMP通过调整对数似然概率来工作，将部分概率转移到真实类别。
CMP在多个视觉数据集和领域泛化数据集上的实验表现优于最新技术。

Cool Papers

点此查看论文截图

Vim-F: Visual State Space Model Benefiting from Learning in the Frequency Domain

Authors:Juntao Zhang, Shaogeng Liu, Jun Zhou, Kun Bian, You Zhou, Jianning Liu, Pei Zhang, Bingyan Liu

In recent years, State Space Models (SSMs) with efficient hardware-aware designs, known as the Mamba deep learning models, have made significant progress in modeling long sequences such as language understanding. Therefore, building efficient and general-purpose visual backbones based on SSMs is a promising direction. Compared to traditional convolutional neural networks (CNNs) and Vision Transformers (ViTs), the performance of Vision Mamba (ViM) methods is not yet fully competitive. To enable SSMs to process image data, ViMs typically flatten 2D images into 1D sequences, inevitably ignoring some 2D local dependencies, thereby weakening the model’s ability to interpret spatial relationships from a global perspective. We use Fast Fourier Transform (FFT) to obtain the spectrum of the feature map and add it to the original feature map, enabling ViM to model a unified visual representation in both frequency and spatial domains. The introduction of frequency domain information enables ViM to have a global receptive field during scanning. We propose a novel model called Vim-F, which employs pure Mamba encoders and scans in both the frequency and spatial domains. Moreover, we question the necessity of position embedding in ViM and remove it accordingly in Vim-F, which helps to fully utilize the efficient long-sequence modeling capability of ViM. Finally, we redesign a patch embedding for Vim-F, leveraging a convolutional stem to capture more local correlations, further improving the performance of Vim-F. Code is available at: https://github.com/yws-wxs/Vim-F.

近年来，具有高效硬件感知设计的状态空间模型（SSMs），又称为Mamba深度学习模型，在建模长序列（如语言理解）方面取得了显著进展。因此，基于SSM构建高效且通用的视觉主干是一个充满希望的方向。与传统的卷积神经网络（CNNs）和视觉变压器（ViTs）相比，视觉Mamba（ViM）方法的性能尚未达到完全竞争水平。为了使SSM处理图像数据，ViM通常将2D图像展平为1D序列，这不可避免地会忽略一些2D局部依赖性，从而削弱了模型从全局角度解释空间关系的能力。我们使用快速傅里叶变换（FFT）获得特征图的频谱并将其添加到原始特征图，使ViM能够在频率和空间域中建立统一的视觉表示。引入频域信息使ViM在扫描过程中具有全局感受野。我们提出了一种新型模型Vim-F，它采用纯Mamba编码器，并在频率和空间域中进行扫描。此外，我们对ViM中位置嵌入的必要性提出了质疑，并在Vim-F中相应地将其移除，这有助于充分利用ViM对长序列建模的高效能力。最后，我们为Vim-F重新设计了补丁嵌入，利用卷积干细胞捕获更多局部相关性，进一步提高了Vim-F的性能。代码可在：https://github.com/yws-wxs/Vim-F获得。

论文及项目相关链接

PDF

Summary

近年来，基于状态空间模型（SSMs）的高效硬件感知设计，即深度学习模型Mamba，在建模长序列方面取得了显著进展。因此，建立基于SSM的高效通用视觉骨干网是一个有前途的方向。尽管Vision Mamba（ViM）方法的性能尚未达到传统卷积神经网络（CNNs）和视觉变压器（ViTs）的水平，但ViM通过将图像数据扁平化为序列来处理图像数据。为了增强ViM对空间关系的全局解读能力，本文引入了频谱信息，并提出了Vim-F模型，该模型采用纯Mamba编码器，在频率域和空间域进行扫描，并重新设计了补丁嵌入。此外，本文质疑了ViM中位置嵌入的必要性并相应地在Vim-F中将其移除，以充分利用ViM对长序列建模的高效能力。代码已公开。

Key Takeaways

状态空间模型（SSMs）在建模长序列方面取得显著进展，被应用于深度学习模型Mamba。
建立基于SSM的高效通用视觉骨干网是一个有前途的研究方向。
Vision Mamba（ViM）方法在处理图像数据时将其扁平化为序列，但性能尚未达到传统CNNs和ViTs的水平。
通过引入频谱信息增强ViM对空间关系的全局解读能力。
提出了Vim-F模型，采用纯Mamba编码器，在频率域和空间域进行扫描。
Vim-F重新设计了补丁嵌入并移除位置嵌入以提高性能。

Cool Papers

点此查看论文截图

CLIP Can Understand Depth

Authors:Sohee Kim, Jisu Kang, Dunam Kim, Seokju Lee

In this paper, we demonstrate that CLIP can also be adapted to downstream tasks where its vision-language alignment is suboptimally learned during pre-training on web-crawled data, all without requiring fine-tuning. We explore the case of monocular depth estimation, where CLIP’s contrastive prior struggles to generalize, compared to its success in domains such as generative modeling and semantic segmentation. Since CLIP fails to consistently capture similarities between image patches and natural language prompts describing distance, we eliminate the use of its pre-trained natural language token embeddings and distill the semantic prior of its frozen text encoder into a single learnable embedding matrix called “mirror”. The main design goal of mirror is to derive a non-human language prompt that approximates an optimal natural language prompt: “How far is this location from the camera?” Using this approach, we jointly train two lightweight modules, a mirror and a compact decoder, on top of a frozen CLIP for dense depth prediction. Compared to conventional depth models, our framework is significantly more efficient in terms of parameters and computation. The resulting model exhibits impressive performance, matching several state-of-the-art vision models on the NYU Depth v2 and KITTI benchmark datasets, while outperforming all vision-language depth models based on a frozen CLIP prior. Experiments demonstrate that the suboptimal depth understanding of CLIP in terms of spatial and temporal consistency can be significantly corrected without either fine-tuning it or concatenating mirror with its pre-trained subword token embeddings. Furthermore, an ablation study on the convergence status of mirror shows that it is implicitly trained to capture objects, such as humans and windows, where semantic cues play an important role in detection.

本文中，我们展示了CLIP也可以适应下游任务，这些任务在网页爬虫数据的预训练过程中视觉语言对齐是次优的，而且这一切都不需要微调。我们研究了单目深度估计的情况，CLIP的对比先验在通用性方面遇到了困难，与其在生成建模和语义分割等领域的成功相比形成鲜明对比。由于CLIP无法持续捕获图像补丁与自然语言提示之间的相似性，这些提示描述了距离，因此我们消除了预训练的自然语言令牌嵌入的使用，并将冻结文本编码器的语义先验简化为一个称为“镜像”的可学习嵌入矩阵。镜像的主要设计目标是推导出一种非人类语言提示，该提示近似于最佳自然语言提示：“这个位置离相机有多远？”采用这种方法，我们在冻结CLIP之上联合训练了两个轻量级模块——一个镜像和一个紧凑的解码器——用于密集的深度预测。与传统的深度模型相比，我们的框架在参数和计算方面更加高效。所得模型性能令人印象深刻，在NYU Depth v2和KITTI基准数据集上达到了许多最先进的视觉模型的水平，同时超越了所有基于冻结CLIP先验的视觉语言深度模型。实验表明，CLIP在空间和时间一致性方面的深度理解不佳可以得到显著纠正，而无需对其进行微调或将镜像与预训练的子词令牌嵌入进行拼接。此外，对镜像收敛状态的消融研究表明，它经过隐性训练以捕获对象（如人类和窗户），其中语义线索在检测中起着重要作用。

论文及项目相关链接

PDF Accepted in Pattern Recognition, 2025

Summary

本文探索了CLIP模型在下游任务如单眼深度估计中的适应性。由于CLIP在预训练期间未能优化图像块与自然语言描述之间的相似性，本研究不使用其预训练的自然语言令牌嵌入，而是提炼出冻结文本编码器的语义先验，形成一个名为“镜像”的可学习嵌入矩阵。通过联合训练镜像和紧凑解码器，实现了对冻结CLIP的深度预测。该方法显著提高了性能，与一流视觉模型匹配，并超越了基于冻结CLIP先验的所有视觉语言深度模型。研究还表明，CLIP的深度理解可以在不微调或结合其预训练子词令牌嵌入的情况下得到显著改进。镜像的收敛状态研究还显示其能够捕捉到如人类和窗户等对象，其中语义线索在检测中起到了重要作用。

Key Takeaways

以下是本文的主要见解：

CLIP模型可适应下游任务，即使其在网络爬虫数据预训练中的视觉语言对齐是次优的。
在单眼深度估计任务中，CLIP的对比优先策略相较于其在生成建模和语义分割等领域的应用表现较差。
由于CLIP无法持续捕捉图像块与自然语言描述之间的相似性，研究提出了名为“镜像”的可学习嵌入矩阵来提炼语义先验。
通过联合训练镜像和紧凑解码器，实现了对冻结CLIP的深度预测，显著提高了性能。
该方法超越了基于冻结CLIP先验的所有视觉语言深度模型，并在多个基准数据集上表现出色。
CLIP的深度理解可以在不微调或结合其预训练子词令牌嵌入的情况下得到改进。这表明CLIP模型具有较大的优化潜力。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-28/Vision%20Transformer/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Vision Transformer

检测/分割/跟踪

检测/分割/跟踪方向最新论文已更新，请持续关注 Update in 2025-09-28 SwinMamba A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images

2025-09-28 检测/分割/跟踪

检测/分割/跟踪

视频理解

视频理解方向最新论文已更新，请持续关注 Update in 2025-09-28 VIR-Bench Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

2025-09-28 视频理解

视频理解