⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-24 更新
Degradation-Aware All-in-One Image Restoration via Latent Prior Encoding
Authors:S M A Sharif, Abdur Rehman, Fayaz Ali Dharejo, Radu Timofte, Rizwan Ali Naqvi
Real-world images often suffer from spatially diverse degradations such as haze, rain, snow, and low-light, significantly impacting visual quality and downstream vision tasks. Existing all-in-one restoration (AIR) approaches either depend on external text prompts or embed hand-crafted architectural priors (e.g., frequency heuristics); both impose discrete, brittle assumptions that weaken generalization to unseen or mixed degradations. To address this limitation, we propose to reframe AIR as learned latent prior inference, where degradation-aware representations are automatically inferred from the input without explicit task cues. Based on latent priors, we formulate AIR as a structured reasoning paradigm: (1) which features to route (adaptive feature selection), (2) where to restore (spatial localization), and (3) what to restore (degradation semantics). We design a lightweight decoding module that efficiently leverages these latent encoded cues for spatially-adaptive restoration. Extensive experiments across six common degradation tasks, five compound settings, and previously unseen degradations demonstrate that our method outperforms state-of-the-art (SOTA) approaches, achieving an average PSNR improvement of 1.68 dB while being three times more efficient.
现实世界中的图像经常受到空间多样化的退化影响,如雾霾、雨水、雪和暗光,这会对图像质量和下游视觉任务产生显著影响。现有的全能恢复(AIR)方法要么依赖于外部文本提示,要么嵌入手工构建的结构先验(如频率启发式);两者都施加离散、脆弱的假设,削弱了其对未见或混合退化的泛化能力。为了解决这一局限性,我们提出将AIR重新构建为学习潜在先验推理,其中退化感知表示自动从输入中推断出来,无需明确的任务提示。基于潜在先验,我们将AIR制定为结构化推理模式:(1)选择哪些特征进行路由(自适应特征选择)、(2)在哪里恢复(空间定位)和(3)恢复什么(退化语义)。我们设计了一个轻量级的解码模块,该模块能够高效利用这些潜在编码线索进行空间自适应恢复。在六个常见的退化任务、五个复合场景和之前未见的退化情况下的广泛实验表明,我们的方法超越了最新技术的方法,平均PSNR值提高了1.68分贝,同时效率提高了三倍。
论文及项目相关链接
Summary
空间多样退化如雾霾、雨水、雪和暗光等现实问题对图像质量和视觉任务产生显著影响。现有的一站式恢复方法依赖外部文本提示或手工架构先验,存在通用性弱的缺点。我们提出通过潜在先验推理来解决这一问题,自动从输入中推断退化感知表示,将其制定为结构化推理模式,包括特征路由、恢复位置和退化语义。我们设计了一个轻量级的解码模块,能够高效利用这些潜在线索进行空间自适应恢复。实验表明,我们的方法在六种常见退化任务、五种复合设置和未知退化上超越了现有方法,平均PSNR提高了1.68dB,且效率提高了三倍。
Key Takeaways
- 现实问题中的图像经常面临多种空间退化挑战。
- 现有的一站式图像恢复方法存在通用性不足的问题。
- 提出通过潜在先验推理来解决这一问题,自动从输入中推断退化感知表示。
- 将图像恢复制定为结构化推理模式,包括特征路由、恢复位置和退化语义。
- 设计了一个轻量级的解码模块,利用潜在线索进行空间自适应恢复。
- 实验表明,该方法在多种退化任务上超越了现有方法。
- 该方法在提高恢复质量的同时,也提高了效率。
点此查看论文截图



Dual-View Alignment Learning with Hierarchical-Prompt for Class-Imbalance Multi-Label Classification
Authors:Sheng Huang, Jiexuan Yan, Beiyan Liu, Bo Liu, Richang Hong
Real-world datasets often exhibit class imbalance across multiple categories, manifesting as long-tailed distributions and few-shot scenarios. This is especially challenging in Class-Imbalanced Multi-Label Image Classification (CI-MLIC) tasks, where data imbalance and multi-object recognition present significant obstacles. To address these challenges, we propose a novel method termed Dual-View Alignment Learning with Hierarchical Prompt (HP-DVAL), which leverages multi-modal knowledge from vision-language pretrained (VLP) models to mitigate the class-imbalance problem in multi-label settings. Specifically, HP-DVAL employs dual-view alignment learning to transfer the powerful feature representation capabilities from VLP models by extracting complementary features for accurate image-text alignment. To better adapt VLP models for CI-MLIC tasks, we introduce a hierarchical prompt-tuning strategy that utilizes global and local prompts to learn task-specific and context-related prior knowledge. Additionally, we design a semantic consistency loss during prompt tuning to prevent learned prompts from deviating from general knowledge embedded in VLP models. The effectiveness of our approach is validated on two CI-MLIC benchmarks: MS-COCO and VOC2007. Extensive experimental results demonstrate the superiority of our method over SOTA approaches, achieving mAP improvements of 10.0% and 5.2% on the long-tailed multi-label image classification task, and 6.8% and 2.9% on the multi-label few-shot image classification task.
现实世界的数据集通常在多个类别之间表现出类别不平衡,表现为长尾分布和少量样本场景。这在类别不平衡的多标签图像分类(CI-MLIC)任务中尤其具有挑战性,其中数据不平衡和多目标识别构成了重大障碍。为了应对这些挑战,我们提出了一种新的方法,称为基于层次提示的双视图对齐学习(HP-DVAL),它利用视觉语言预训练(VLP)模型的多模态知识来缓解多标签设置中的类别不平衡问题。具体来说,HP-DVAL采用双视图对齐学习,通过提取图像和文本对齐的互补特征,从VLP模型转移强大的特征表示能力。为了更好地适应CI-MLIC任务,我们引入了一种层次化的提示调整策略,利用全局和局部提示来学习特定任务和上下文相关的先验知识。此外,我们在提示调整过程中设计了一种语义一致性损失,以防止学习到的提示偏离VLP模型中的通用知识。我们的方法在MS-COCO和VOC2007两个CI-MLIC基准测试上的有效性得到了验证。大量的实验结果证明,我们的方法优于最新方法,在长尾多标签图像分类任务上提高了10.0%和5.2%的mAP,在多标签少量样本图像分类任务上提高了6.8%和2.9%的mAP。
论文及项目相关链接
PDF accepted by IEEE Transactions on Image Processing
Summary
本文提出了针对类别不平衡多标签图像分类(CI-MLIC)任务的新的挑战,介绍了一种名为HP-DVAL的双视图对齐学习方法。该方法利用视觉语言预训练(VLP)模型的多模态知识,通过双视图对齐学习提取互补特征,实现图像与文本准确对齐。为适应CI-MLIC任务,引入层次化提示调整策略,并利用全局和局部提示学习特定任务和上下文相关知识。在MS-COCO和VOC2007两个CI-MLIC基准测试集上的实验结果表明,该方法优于现有方法,在长尾多标签图像分类任务上提高了mAP值10.0%和5.2%,在多标签小样本图像分类任务上提高了mAP值6.8%和2.9%。
Key Takeaways
- 类别不平衡和多目标识别在CI-MLIC任务中是重大挑战。
- 提出了HP-DVAL双视图对齐学习方法,利用视觉语言预训练模型应对这些挑战。
- HP-DVAL通过双视图对齐学习实现图像与文本准确对齐,提取互补特征。
- 引入层次化提示调整策略,利用全局和局部提示学习特定任务和上下文相关知识。
- 设计了语义一致性损失,防止学习到的提示偏离预训练模型中的通用知识。
- 在MS-COCO和VOC2007基准测试集上的实验结果表明,该方法显著优于现有方法。
点此查看论文截图



Visual Instruction Pretraining for Domain-Specific Foundation Models
Authors:Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang
Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at github.com/zcablii/ViTP.
现代计算机视觉正在形成一个闭环,感知、推理和生成在其中相互加强。然而,这个循环仍然不完整:高级推理对低级感知特征基础学习的自上而下影响尚未得到充分探索。本文通过为下游领域的基础模型预训练提出一种新的范式来解决这一差距。我们引入了视觉指令预训练(ViTP),这是一种利用推理来直接增强感知的新型方法。ViTP将视觉转换器(ViT)主干嵌入到视觉语言模型中,并使用从目标下游领域精心挑选的丰富视觉指令数据对其进行端到端的预训练。ViTP由我们提出的视觉稳健性学习(VRL)赋能,它促使ViT从少量的视觉标记中学习稳健和与领域相关的特征。在16个具有挑战性的遥感医疗成像基准测试上的广泛实验表明,ViTP在不同下游任务上取得了最新最先进的性能。代码可在github.com/zcablii/ViTP找到。
论文及项目相关链接
Summary
视觉Transformer在下游领域预训练模型的新范式。提出ViTP(视觉指令预训练)方法,直接利用推理增强感知能力。通过ViT(视觉Transformer)骨干网嵌入视觉语言模型,并利用从目标下游领域收集的大量视觉指令数据进行端到端的预训练。通过VRL(视觉稳健学习)促使ViT从稀疏的视觉令牌集中学习稳健且与领域相关的特征。在远程感应和医学影像等领域的基准测试中表现优异。代码已上传至GitHub。
Key Takeaways
- 现代计算机视觉正在形成一个感知、推理和生成相互加强的闭环,但高级推理对基础感知特征学习的影响尚未得到充分探索。
- 本文提出ViTP方法,旨在通过引入推理来增强感知能力,填补这一空白。
- ViTP利用视觉Transformer(ViT)骨干网嵌入视觉语言模型,进行下游领域的预训练。
- ViTP使用丰富的视觉指令数据来预训练模型,这些数据从目标下游领域收集。
- 引入的VRL方法促使ViT从稀疏的视觉令牌中学习稳健且与领域相关的特征。
- 在多个基准测试中,ViTP表现出卓越性能,特别是在远程感应和医学影像领域。
点此查看论文截图




Informative Text-Image Alignment for Visual Affordance Learning with Foundation Models
Authors:Qian Zhang, Lin Zhang, Xing Fang, Mingxin Zhang, Zhiyuan Wei, Ran Song, Wei Zhang
Visual affordance learning is crucial for robots to understand and interact effectively with the physical world. Recent advances in this field attempt to leverage pre-trained knowledge of vision-language foundation models to learn affordance properties with limited training data, providing a novel paradigm for visual affordance learning. However, these methods overlook the significance of maintaining feature alignment between visual images and language descriptions for identifying affordance areas with textual guidance, and thus may lead to suboptimal results. In this paper, we present an informative framework for text-guided affordance learning, which involves information-based constraints to achieve text-image alignment at feature level. Specifically, we design an affordance mutual information constraint that helps learn appropriate textual prompts and task-oriented visual features simultaneously by maximizing the mutual information between the features of the affordance areas in the input images and the corresponding textual prompts. In addition, we propose an object-level information constraint that maximizes the mutual information between the visual features of a given object and the text features of the category it belongs to. This enables the model to capture high-quality representations for the object, providing more reliable semantic priors for identifying affordance regions. Experimental results on the AGD20K dataset show that the proposed method outperforms existing approaches and achieves the new state-of-the-art in one-shot affordance learning.
视觉动作学习对于机器人有效地理解和与物理世界交互至关重要。该领域的最新进展试图利用视觉语言基础模型的预训练知识来学习动作属性,以有限的训练数据进行动作属性的学习,为视觉动作学习提供了新颖的模式。然而,这些方法忽视了在文本指导下维持视觉图像和语言描述之间特征对齐的重要性,从而可能无法有效地识别动作区域并导致结果不理想。在本文中,我们提出了一个信息丰富的文本引导动作学习框架,该框架涉及基于信息的约束来实现文本图像在特征层面的对齐。具体来说,我们设计了一个动作互信息约束,通过最大化输入图像中的动作区域特征与相应的文本提示之间的互信息,同时学习适当的文本提示和任务导向的视觉特征。此外,我们提出了一个对象级别的信息约束,通过最大化给定对象的视觉特征与其所属类别的文本特征之间的互信息,使得模型能够捕捉对象的高质量表示,为识别动作区域提供更可靠的语义先验。在AGD20K数据集上的实验结果表明,该方法优于现有方法并实现了单次动作学习的新技术状态。
论文及项目相关链接
PDF Submitted to the IEEE International Conference on Robotics and Automation (ICRA) 2026
Summary
本文提出了一种基于文本指导的视觉表现学习框架,利用信息约束在特征层面实现文本与图像的对齐。设计了一种表现互惠信息约束,通过最大化输入图像中的表现区域特征与相应文本提示之间的互惠信息,同时学习适当的文本提示和任务导向的视觉特征。此外,还提出了对象级别的信息约束,能够最大化给定对象的视觉特征与其所属类别文本特征之间的互惠信息,为识别表现区域提供可靠的语义先验。在AGD20K数据集上的实验结果表明,该方法优于现有方法,在一站式表现学习中达到新的先进水平。
Key Takeaways
- 视觉表现学习对于机器人理解和有效与物理世界交互至关重要。
- 最近的研究尝试利用视觉语言基础模型的预训练知识来学习表现属性,以减少训练数据的需求。
- 现有方法忽略了在识别表现区域时,保持视觉图像和语言描述之间特征对齐的重要性。
- 本文提出了一种基于文本指导的视觉表现学习框架,通过信息约束实现文本与图像在特征层面的对齐。
- 引入表现互惠信息约束,通过最大化表现区域特征与文本提示之间的互惠信息,同时学习文本提示与视觉特征。
- 提出对象级别的信息约束,能够捕捉对象的高质量表示,为识别表现区域提供更可靠的语义先验。
点此查看论文截图






When Color-Space Decoupling Meets Diffusion for Adverse-Weather Image Restoration
Authors:Wenxuan Fang, Jili Fan, Chao Wang, Xiantao Hu, Jiangwei Weng, Ying Tai, Jian Yang, Jun Li
Adverse Weather Image Restoration (AWIR) is a highly challenging task due to the unpredictable and dynamic nature of weather-related degradations. Traditional task-specific methods often fail to generalize to unseen or complex degradation types, while recent prompt-learning approaches depend heavily on the degradation estimation capabilities of vision-language models, resulting in inconsistent restorations. In this paper, we propose \textbf{LCDiff}, a novel framework comprising two key components: \textit{Lumina-Chroma Decomposition Network} (LCDN) and \textit{Lumina-Guided Diffusion Model} (LGDM). LCDN processes degraded images in the YCbCr color space, separately handling degradation-related luminance and degradation-invariant chrominance components. This decomposition effectively mitigates weather-induced degradation while preserving color fidelity. To further enhance restoration quality, LGDM leverages degradation-related luminance information as a guiding condition, eliminating the need for explicit degradation prompts. Additionally, LGDM incorporates a \textit{Dynamic Time Step Loss} to optimize the denoising network, ensuring a balanced recovery of both low- and high-frequency features in the image. Finally, we present DriveWeather, a comprehensive all-weather driving dataset designed to enable robust evaluation. Extensive experiments demonstrate that our approach surpasses state-of-the-art methods, setting a new benchmark in AWIR. The dataset and code are available at: https://github.com/fiwy0527/LCDiff.
恶劣天气图像恢复(AWIR)是一项极具挑战性的任务,因为天气相关的退化是不可预测和动态的。传统的针对特定任务的方法往往不能推广到未见过的或复杂的退化类型,而最近的提示学习的方法严重依赖于视觉语言模型的退化估计能力,导致恢复结果不一致。在本文中,我们提出了一个新的框架LCDiff,它包含两个关键组件:亮度-色度分解网络(LCDN)和亮度引导扩散模型(LGDM)。LCDN在YCbCr颜色空间处理退化图像,分别处理与退化相关的亮度和退化不变的色度组件。这种分解有效地减轻了天气引起的退化,同时保持颜色保真度。为了进一步改善恢复质量,LGDM利用与退化相关的亮度信息作为指导条件,消除了对明确退化提示的需求。此外,LGDM还采用了一种动态时间步长损失来优化去噪网络,确保图像中低频和高频特征的平衡恢复。最后,我们推出了DriveWeather,这是一个全面的全天候驾驶数据集,旨在实现稳健的评估。大量实验表明,我们的方法超越了最先进的方法,在AWIR中树立了新的基准。数据集和代码可在:https://github.com/fiwy0527/LCDiff获取。
论文及项目相关链接
Summary
本文提出了一种新型的图像恢复框架LCDiff,用于应对恶劣天气下的图像恢复(AWIR)挑战。该框架包含两个关键组件:Lumina-Chroma分解网络(LCDN)和Lumina-Guided扩散模型(LGDM)。LCDN在YCbCr色彩空间处理退化图像,分别处理与退化相关的亮度和不变色度成分,有效减轻天气引起的退化,同时保持色彩保真。LGDM利用与退化相关的亮度信息作为指导条件,进一步提高恢复质量,并引入动态时间步长损失来优化去噪网络,确保图像中低频和高频特征的平衡恢复。此外,还推出了DriveWeather全天气驾驶数据集,用于稳健评估。实验表明,该方法优于现有技术,为AWIR设定了新的基准。
Key Takeaways
- LCDiff框架被提出,用于应对恶劣天气图像恢复(AWIR)的挑战。
- LCDiff包含两个关键组件:LCDN和LGDM。
- LCDN在YCbCr色彩空间处理图像,分离亮度和色度成分,以提高恢复效果。
- LGDM利用亮度信息作为指导条件,动态时间步长损失优化去噪网络。
- DriveWeather数据集的推出为AWIR提供了稳健的评估平台。
- 实验表明,LCDiff方法优于现有技术。
点此查看论文截图





V-CECE: Visual Counterfactual Explanations via Conceptual Edits
Authors:Nikolaos Spanos, Maria Lymperaiou, Giorgos Filandrianos, Konstantinos Thomas, Athanasios Voulodimos, Giorgos Stamou
Recent black-box counterfactual generation frameworks fail to take into account the semantic content of the proposed edits, while relying heavily on training to guide the generation process. We propose a novel, plug-and-play black-box counterfactual generation framework, which suggests step-by-step edits based on theoretical guarantees of optimal edits to produce human-level counterfactual explanations with zero training. Our framework utilizes a pre-trained image editing diffusion model, and operates without access to the internals of the classifier, leading to an explainable counterfactual generation process. Throughout our experimentation, we showcase the explanatory gap between human reasoning and neural model behavior by utilizing both Convolutional Neural Network (CNN), Vision Transformer (ViT) and Large Vision Language Model (LVLM) classifiers, substantiated through a comprehensive human evaluation.
最近的黑盒反事实生成框架忽略了提议编辑的语义内容,同时严重依赖于训练来指导生成过程。我们提出了一种新颖、即插即用的黑盒反事实生成框架,该框架基于最优编辑的理论保证,提出逐步编辑建议,以零训练生成人类水平的反事实解释。我们的框架利用预先训练的图像编辑扩散模型,无需访问分类器的内部,从而实现可解释的反事实生成过程。在我们的实验中,我们展示了人类推理和神经网络模型行为之间的解释差距,这通过综合人类评估得到证实,包括卷积神经网络(CNN)、视觉转换器(ViT)和大视觉语言模型(LVLM)分类器。
论文及项目相关链接
PDF Accepted in NeurIPS 2025
Summary
文本提出一种新的零训练的黑盒对抗性生成框架,它通过理论保证最优编辑,并基于预训练的图像编辑扩散模型进行逐步编辑建议,生成人类级别的对抗性解释,无需访问分类器的内部信息。该框架填补了人类推理和神经网络模型行为之间的解释鸿沟,并通过CNN、Vision Transformer和LVLM分类器的综合人类评估得到证实。
Key Takeaways
- 现有黑盒对抗生成框架忽视了提议编辑的语义内容,而依赖大量训练来指导生成过程。
- 提出一种新型零训练黑盒对抗生成框架,基于理论保证进行最优编辑建议,生成人类级别的对抗解释。
- 利用预训练的图像编辑扩散模型进行操作,无需访问分类器的内部信息。
- 该框架填补了人类推理和神经网络模型行为之间的解释鸿沟。
- 通过CNN、Vision Transformer和LVLM分类器的综合人类评估验证了框架的有效性。
- 框架具有可插拔性,可轻松集成到其他系统中。
点此查看论文截图


Which Direction to Choose? An Analysis on the Representation Power of Self-Supervised ViTs in Downstream Tasks
Authors:Yannis Kaltampanidis, Alexandros Doumanoglou, Dimitrios Zarpalas
Self-Supervised Learning (SSL) for Vision Transformers (ViTs) has recently demonstrated considerable potential as a pre-training strategy for a variety of computer vision tasks, including image classification and segmentation, both in standard and few-shot downstream contexts. Two pre-training objectives dominate the landscape of SSL techniques: Contrastive Learning and Masked Image Modeling. Features (or tokens) extracted from the final transformer attention block – specifically, the keys, queries, and values – as well as features obtained after the final block’s feed-forward layer, have become a common foundation for addressing downstream tasks. However, in many existing approaches, these pre-trained ViT features are further processed through additional transformation layers, often involving lightweight heads or combined with distillation, to achieve superior task performance. Although such methods can improve task outcomes, to the best of our knowledge, a comprehensive analysis of the intrinsic representation capabilities of unaltered ViT features has yet to be conducted. This study aims to bridge this gap by systematically evaluating the use of these unmodified features across image classification and segmentation tasks, in both standard and few-shot contexts. The classification and segmentation rules that we use are either hyperplane based (as in logistic regression) or cosine-similarity based, both of which rely on the presence of interpretable directions in the ViT’s latent space. Based on the previous rules and without the use of additional feature transformations, we conduct an analysis across token types, tasks, and pre-trained ViT models. This study provides insights into the optimal choice for token type and decision rule based on the task, context, and the pre-training objective, while reporting detailed findings on two widely-used datasets.
自监督学习(SSL)对于视觉转换器(ViTs)已经显示出巨大的潜力,作为一种预训练策略,它可以应用于各种计算机视觉任务,包括标准和少样本下游环境中的图像分类和分割。在SSL技术中,两种预训练目标占据主导地位:对比学习和掩模图像建模。从最终的转换器注意力块中提取的特征(或令牌),特别是键、查询和值,以及从最后一个块的前馈层之后获得的特征,已经成为解决下游任务的常见基础。然而,在许多现有方法中,这些预训练的ViT特征会经过额外的转换层进行处理,通常涉及轻量级头部或与蒸馏相结合,以实现卓越的任务性能。尽管这些方法可以改善任务结果,但据我们所知,尚未对未经修改的ViT特征的内在表示能力进行全面分析。本研究旨在通过系统地评估这些未修改的特征在图像分类和分割任务中的应用来弥补这一空白,无论是在标准还是在少样本环境中皆是如此。我们使用的分类和分割规则是基于超平面的(如逻辑回归),或是基于余弦相似性的,两者都依赖于ViT潜在空间中可解释方向的存在。基于之前的规则和没有使用额外的特征转换,我们对令牌类型、任务和预训练的ViT模型进行了分析。本研究为基于任务、上下文和预训练目标的令牌类型和决策规则的最佳选择提供了见解,并在两个广泛使用的数据集上报告了详细的发现。
论文及项目相关链接
PDF 24 pages, XAI 2025
摘要
视觉Transformer的自监督学习(SSL)作为多种计算机视觉任务的预训练策略,已展现出巨大潜力,包括图像分类和分割,以及标准和少镜头下游环境。SSL技术的主要预训练目标包括对比学习和掩模图像建模。本研究旨在系统地评估未经修改的特征在图像分类和分割任务中的使用,涉及标记类型、任务和预训练视觉Transformer模型的选择。基于以前的规定,不使用额外的特征转换,我们分析了不同语境下的任务选择对于标记类型和决策规则的最优选择,同时报告了在两个广泛使用数据集上的详细发现。我们的研究有助于了解视觉Transformer的内在表示能力,并为未来的研究提供有价值的见解。
关键见解
- 自监督学习(SSL)已成为视觉Transformer(ViT)预训练的有效策略,适用于多种计算机视觉任务。
- SSL技术的主要预训练目标包括对比学习和掩模图像建模。
- 未经修改的视觉Transformer特征在图像分类和分割任务中具有潜力。
- 研究系统地评估了不同标记类型、任务和预训练模型的选择。
- 分析表明,在决策过程中应考虑到任务、上下文和预训练目标来选择最佳的标记类型和决策规则。
- 研究报告了在两个常用数据集上的详细发现,为理解视觉Transformer的内在表示能力提供了有价值的信息。
点此查看论文截图


ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding
Authors:Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen
Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups (<1.5x). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model’s attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model’s hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding. Code is available at https://github.com/KangJialiang/ViSpec.
推测解码是加速大型语言模型(LLM)推理的广泛采用的技术,然而它在视觉语言模型(VLM)中的应用仍然被探索得不够深入,现有方法只能实现适度的加速(<1.5倍)。随着多模态能力成为大规模模型的核心,这一差距变得越来越显著。我们假设大型VLM可以逐层有效地过滤掉冗余的图像信息,而不损害文本理解,而较小的草稿模型则很难做到这一点。为了解决这一问题,我们引入了针对VLM设计的全新框架——视觉感知推测解码(ViSpec)。ViSpec采用轻量级的视觉适配器模块,将图像令牌压缩成紧凑的表示形式,无缝集成到草稿模型的注意力机制中,同时保留原始图像的位置信息。此外,我们为每个输入图像提取全局特征向量,并将其增强到随后的所有文本令牌中,以提高多模态一致性。为了克服缺乏带有长助理响应的多模态数据集的问题,我们通过重新利用现有数据集并使用目标VLM生成扩展输出来专门制作一个训练数据集。我们的训练策略减轻了草稿模型直接访问目标模型的隐藏状态的风险,如果仅在目标模型输出上进行训练,这可能导致捷径学习。大量实验验证了ViSpec的有效性,据我们所知,这是VLM推测解码方面的首次实质性加速。代码可在https://github.com/KangJialiang/ViSpec找到。
论文及项目相关链接
PDF NeurIPS 2025
Summary:
针对视觉语言模型(VLMs)的推测解码技术尚未得到充分研究,现有方法的速度提升有限(小于1.5倍)。本文提出了一种针对VLMs的定制框架——Vision-Aware Speculative Decoding(ViSpec)。它通过轻量级视觉适配器模块压缩图像标记,将其无缝集成到草案模型的注意力机制中,同时保留原始图像的位置信息。此外,它还为每个输入图像提取全局特征向量,并将其添加到所有后续文本标记中,以增强多媒体模态的一致性。该论文通过实验验证了ViSpec的有效性,实现了对VLM推测解码的首次实质性加速。
Key Takeaways:
- 视觉语言模型(VLMs)的推测解码技术应用存在空白,现有方法速度提升有限。
- 提出了一种针对VLMs的定制框架——Vision-Aware Speculative Decoding(ViSpec)。
- ViSpec使用轻量级视觉适配器模块压缩图像标记以加速推断。
- ViSpec保留了原始图像的位置信息,并将其集成到草案模型的注意力机制中。
- 通过为每个输入图像提取全局特征向量并添加到文本标记中,增强多媒体模态的一致性。
- 采用特殊训练数据集克服训练时的挑战,避免模型利用目标模型的隐藏状态导致的捷径学习。
- 实验验证了ViSpec的有效性,实现了对VLM推测解码的首次实质性加速。
点此查看论文截图




SAIL-VL2 Technical Report
Authors:Weijie Yin, Yongjie Ye, Fangxun Shu, Yue Liao, Zijian Kang, Hongyuan Dong, Haiyang Yu, Dingkang Yang, Jiacong Wang, Han Wang, Wenzhuo Liu, Xiao Liang, Shuicheng Yan, Chao Feng
We introduce SAIL-VL2, an open-suite vision-language foundation model (LVM) for comprehensive multimodal understanding and reasoning. As the successor to SAIL-VL, SAIL-VL2 achieves state-of-the-art performance at the 2B and 8B parameter scales across diverse image and video benchmarks, demonstrating strong capabilities from fine-grained perception to complex reasoning. Its effectiveness is driven by three core innovations. First, a large-scale data curation pipeline with scoring and filtering strategies enhances both quality and distribution across captioning, OCR, QA, and video data, improving training efficiency. Second, a progressive training framework begins with a powerful pre-trained vision encoder (SAIL-ViT), advances through multimodal pre-training, and culminates in a thinking-fusion SFT-RL hybrid paradigm that systematically strengthens model capabilities. Third, architectural advances extend beyond dense LLMs to efficient sparse Mixture-of-Experts (MoE) designs. With these contributions, SAIL-VL2 demonstrates competitive performance across 106 datasets and achieves state-of-the-art results on challenging reasoning benchmarks such as MMMU and MathVista. Furthermore, on the OpenCompass leaderboard, SAIL-VL2-2B ranks first among officially released open-source models under the 4B parameter scale, while serving as an efficient and extensible foundation for the open-source multimodal community.
我们介绍了SAIL-VL2,这是一个开放的视觉语言基础模型(LVM),用于全面的多模式理解和推理。作为SAIL-VL的后续产品,SAIL-VL2在不同参数规模的2B和8B上实现了最先进的性能,在多种图像和视频基准测试中展示了从精细粒度感知到复杂推理的强大能力。它的有效性源于三个核心创新。首先,采用大规模数据整理管道,结合评分和过滤策略,提高了描述、OCR、问答和视频数据的质量和分布,提高了训练效率。其次,渐进式训练框架从一个强大的预训练视觉编码器(SAIL-ViT)开始,通过多模式预训练,最终形成一个系统强化模型能力的思维融合SFT-RL混合范式。第三,架构上的进步超越了密集的大型语言模型,采用了高效的稀疏专家混合(MoE)设计。通过这些贡献,SAIL-VL2在106个数据集上展现了竞争力,并在具有挑战性的推理基准测试(如MMMU和MathVista)上取得了最先进的成果。此外,在OpenCompass排行榜上,SAIL-VL2-2B在官方发布的开源模型中位列4B参数规模的第一名,成为开源多模式社区高效且可扩展的基础。
论文及项目相关链接
PDF Technical Report
Summary
SAIL-VL2是一款用于全面多模态理解和推理的开放视觉语言基础模型(LVM)。相较于前代模型,SAIL-VL2在2B和8B参数规模上达到了最先进的技术性能,在各种图像和视频基准测试中表现出强大的能力,从精细粒度感知到复杂推理。其有效性源于三个核心创新:大规模数据整理管道、渐进式训练框架和架构进步。
Key Takeaways
- SAIL-VL2是一个全面的多模态理解和推理的开放视觉语言基础模型(LVM)。
- 与前代模型相比,SAIL-VL2在参数规模上实现了最先进的技术性能。
- SAIL-VL2通过大规模数据整理管道提高了数据质量和分布,改进了标注、OCR、QA和视频数据的训练效率。
- 渐进式训练框架从强大的预训练视觉编码器开始,通过多模态预训练,最终形成一个系统强化模型能力的思考融合SFT-RL混合范式。
- SAIL-VL2的架构进步超越了密集的LLMs,采用了高效的稀疏Mixture-of-Experts(MoE)设计。
- SAIL-VL2在106个数据集上表现出竞争力,并在MMMU和MathVista等具有挑战性的推理基准测试上达到最佳状态。
点此查看论文截图





Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers
Authors:Ian Chuang, Jinyu Zou, Andrew Lee, Dechen Gao, Iman Soltani
Human vision is a highly active process driven by gaze, which directs attention to task-relevant regions through foveation, dramatically reducing visual processing. In contrast, robot learning systems typically rely on passive, uniform processing of raw camera images. In this work, we explore how incorporating human-like active gaze into robotic policies can enhance efficiency and robustness. We develop GIAVA (Gaze Integrated Active-Vision ALOHA), a robot vision system that emulates human head and neck movement, and gaze adjustment for foveated processing. Extending the AV-ALOHA robot platform, we introduce a framework for simultaneously collecting eye-tracking, perspective control, and robot manipulation demonstration data from a human operator. We also open-source a simulation benchmark and dataset for training robot policies that incorporate human gaze. Inspired by recent work in foveated image segmentation and given the widespread use of Vision Transformers (ViTs) in robot learning, we integrate gaze information into ViTs using a foveated patch tokenization scheme. Compared to uniform patch tokenization, this significantly reduces the number of tokens, and thus computation. Our results show that our method for foveated robot vision drastically reduces computational overhead, and enhances robustness to background distractors. Notably, on certain high-precision tasks, foveated vision also improves performance, as reflected in higher success rates. Together, these findings suggest that human-inspired foveated visual processing offers untapped potential and should be further considered as a useful inductive bias in robotic vision systems. https://ian-chuang.github.io/gaze-av-aloha/
人类的视觉是一个由注视驱动的高度活跃的过程,通过注视将注意力集中在与任务相关的区域,从而极大地减少视觉处理。相比之下,机器人学习系统通常依赖于对原始相机图像的被动均匀处理。在这项工作中,我们探讨了将人类类似的主动注视融入机器人策略如何增强效率和稳健性。我们开发了GIAVA(集成注视的主动视觉ALOHA),这是一个模拟人类头部和颈部运动以及注视调整进行焦点处理的机器人视觉系统。在AV-ALOHA机器人平台的基础上,我们引入了一个框架,可以同时从人类操作者那里收集眼动追踪、视角控制和机器人操作演示数据。我们还开源了一个模拟基准测试和数据集,用于训练融入人类注视的机器人策略。受最近关于焦点图像分割工作的启发,以及Vision Transformers(ViTs)在机器人学习中的广泛应用,我们通过一种焦点补丁标记方案将注视信息融入ViTs。与均匀的补丁标记相比,这显著减少了标记的数量,从而减少了计算量。我们的结果表明,我们的焦点机器人视觉方法极大地减少了计算开销,并增强了对抗背景干扰项的稳健性。值得注意的是,在某些高精度任务上,焦点视觉也提高了性能,体现在更高的成功率上。总体而言,这些发现表明人类启发的焦点视觉处理具有巨大的潜力,并应进一步考虑将其作为机器人视觉系统中的一个有用的归纳偏见。https://ian-chuang.github.io/gaze-av-aloha/
论文及项目相关链接
PDF Project page: https://ian-chuang.github.io/gaze-av-aloha/
Summary
本研究探索了将人类式的主动视线融入机器人策略,以提高机器人视觉系统的效率和稳健性。研究团队开发了GIAVA(视线集成主动视觉ALOHA),一个模仿人类头部和颈部运动以及视线调整的机器人视觉系统。该研究将视线信息融入Vision Transformers(ViTs),采用注视补丁标记方案,相较于传统的均匀补丁标记,显著减少了标记数量,降低了计算量。研究结果表明,注视机器人视觉方法大幅降低了计算开销,提高了对背景干扰因素的稳健性,并在某些高精度任务上提升了性能。
Key Takeaways
- 人类视觉是一个由视线驱动的活动过程,通过注视任务相关区域实现视觉处理的显著减少。
- 机器人学习系统通常依赖于对原始相机图像的被动均匀处理,而本研究探索了将人类式的主动视线融入机器人策略以增强效率和稳健性。
- 开发了GIAVA(视线集成主动视觉ALOHA)系统,模仿人类头部和颈部运动以及视线调整,实现注视处理。
- 引入了同时收集人类操作员的眼动追踪、视角控制和机器人操作演示数据的框架。
- 开放了一个用于训练融入人类视线的机器人策略仿真基准测试和数据集。
- 融合注视信息到Vision Transformers(ViTs)中,通过注视补丁标记方案显著减少了标记数量和计算量。
点此查看论文截图







Few-Shot Image Quality Assessment via Adaptation of Vision-Language Models
Authors:Xudong Li, Zihao Huang, Yan Zhang, Yunhang Shen, Ke Li, Xiawu Zheng, Liujuan Cao, Rongrong Ji
Image Quality Assessment (IQA) remains an unresolved challenge in computer vision due to complex distortions, diverse image content, and limited data availability. Existing Blind IQA (BIQA) methods largely rely on extensive human annotations, which are labor-intensive and costly due to the demanding nature of creating IQA datasets. To reduce this dependency, we propose the Gradient-Regulated Meta-Prompt IQA Framework (GRMP-IQA), designed to efficiently adapt the visual-language pre-trained model, CLIP, to IQA tasks, achieving high accuracy even with limited data. GRMP-IQA consists of two core modules: (i) Meta-Prompt Pre-training Module and (ii) Quality-Aware Gradient Regularization. The Meta Prompt Pre-training Module leverages a meta-learning paradigm to pre-train soft prompts with shared meta-knowledge across different distortions, enabling rapid adaptation to various IQA tasks. On the other hand, the Quality-Aware Gradient Regularization is designed to adjust the update gradients during fine-tuning, focusing the model’s attention on quality-relevant features and preventing overfitting to semantic information. Extensive experiments on standard BIQA datasets demonstrate the superior performance to the state-of-the-art BIQA methods under limited data setting. Notably, utilizing just 20% of the training data, GRMP-IQA is competitive with most existing fully supervised BIQA approaches.
图像质量评估(IQA)仍然是计算机视觉领域的一个未解决的挑战,这是由于复杂的失真、多样化的图像内容和数据可用性有限所导致的。现有的盲图像质量评估(BIQA)方法大多依赖于大量的人工标注,由于创建IQA数据集的性质要求严格,因此劳动密集且成本高昂。为了减少这种依赖性,我们提出了梯度调节元提示图像质量评估框架(GRMP-IQA),该框架旨在有效地适应视觉语言预训练模型CLIP,即使在有限数据的情况下也能实现高准确性。GRMP-IQA由两个核心模块组成:(i)元提示预训练模块和(ii)质量感知梯度正则化。元提示预训练模块利用元学习范式对不同失真进行共享元知识的软提示预训练,从而能够迅速适应各种IQA任务。另一方面,质量感知梯度正则化设计用于微调时调整更新梯度,使模型的注意力集中在质量相关特征上,防止过度拟合语义信息。在标准BIQA数据集上的大量实验表明,在有限数据设置下,GRMP-IQA的性能优于最先进的BIQA方法。值得注意的是,仅使用20%的训练数据,GRMP-IQA就与大多数现有的完全监督BIQA方法具有竞争力。
论文及项目相关链接
Summary
本文介绍了图像质量评估(IQA)的挑战以及现有BIQA方法的不足。为减少对数据标注的依赖,提出了Gradient-Regulated Meta-Prompt IQA Framework(GRMP-IQA)框架,该框架利用视觉语言预训练模型CLIP,在有限数据下实现高准确率的IQA任务。其核心包括两大模块:Meta-Prompt Pre-training Module和Quality-Aware Gradient Regularization。前者通过元学习范式预训练软提示,后者用于微调时的梯度调整,重点关注质量相关特征并防止对语义信息的过度拟合。在标准BIQA数据集上的实验表明,在有限数据设置下,GRMP-IQA的性能优于现有最先进的BIQA方法。仅使用20%的训练数据,GRMP-IQA的表现在大多数现有全监督BIQA方法中都具有竞争力。
Key Takeaways
- 图像质量评估(IQA)仍然是一个由于复杂失真、多样化的图像内容和有限的数据可用性而未能完全解决的问题。
- 现有的盲图像质量评估(BIQA)方法主要依赖于大量的人工注释,这既耗费劳力又成本高昂。
- 提出了Gradient-Regulated Meta-Prompt IQA Framework(GRMP-IQA)框架,旨在有效利用视觉语言预训练模型CLIP来解决IQA任务,并在有限数据下实现高准确率。
- GRMP-IQA框架包含两个核心模块:Meta-Prompt Pre-training Module和Quality-Aware Gradient Regularization。
- Meta-Prompt Pre-training Module通过元学习范式进行软提示的预训练,以应对不同的失真,并加速对IQA任务的适应。
- Quality-Aware Gradient Regularization模块用于微调时的梯度调整,专注于质量相关特征,并防止模型对语义信息的过度拟合。
点此查看论文截图



