发布日期: 2025-10-19

更新日期: 2025-11-27

文章字数: 1.1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-19 更新

UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning

Authors:Maoxun Yuan, Bo Cui, Tianyi Zhao, Jiayi Wang, Shan Fu, Xue Yang, Xingxing Wei

Semantic analysis on visible (RGB) and infrared (IR) images has gained significant attention due to their enhanced accuracy and robustness under challenging conditions including low-illumination and adverse weather. However, due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. To address these limitations, we propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks that introduces a novel adapter mechanism to effectively incorporate rich multi-modal features into pre-trained RGB-based foundation models. Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module, and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.

针对可见光（RGB）和红外（IR）图像的语义分析因其能在低光照和恶劣天气等挑战条件下提高准确性和稳健性而受到广泛关注。然而，由于缺乏针对大规模红外图像数据集进行预训练的基础模型，现有方法更倾向于设计针对特定任务的框架，并利用其RGB-IR语义相关性数据集对预训练的基础模型进行微调，这导致了较差的可扩展性和有限的泛化能力。为了解决这些局限性，我们提出了UniRGB-IR框架，这是一个用于RGB-IR语义任务的可扩展且高效的框架。它通过一种新的适配器机制，有效地将丰富的多模态特征融入基于预训练RGB的基础模型。我们的框架包括三个关键组件：视觉转换器（ViT）基础模型、多模态特征池（MFP）模块和补充特征注入器（SFI）模块。MFP和SFI模块相互协作，作为适配器，有效地将ViT特征与上下文多尺度特征相结合。在训练过程中，我们冻结整个基础模型以继承先验知识，并仅优化MFP和SFI模块。此外，为了验证我们框架的有效性，我们利用ViT-Base作为预训练的基础模型进行了大量实验。在各种RGB-IR语义任务上的实验结果证明了我们的方法达到了最新性能水平。

论文及项目相关链接

PDF 10 pages, 6 figures, Accepted by ACM MM 2025

Summary

该文本介绍了一种针对RGB和红外图像语义分析的UniRGB-IR框架。为了解决在大规模红外图像数据集上预训练基础模型的缺乏问题，该框架引入了多模态特征池（MFP）和补充特征注入器（SFI）模块，与视觉转换器（ViT）基础模型相结合，有效地结合了丰富的多模态特征。通过冻结整个基础模型，仅优化MFP和SFI模块，该框架在RGB-IR语义任务上实现了卓越的性能。

Key Takeaways

RGB和红外图像语义分析在挑战性条件下（如低光照和恶劣天气）具有增强准确性和鲁棒性。
缺乏在大规模红外图像数据集上预训练的基础模型是现有方法的瓶颈。
UniRGB-IR框架引入了一种新型适配器机制，能有效结合丰富的多模态特征到预训练的RGB基础模型中。
该框架包含三个关键组件：视觉转换器（ViT）基础模型、多模态特征池（MFP）模块和补充特征注入器（SFI）模块。
MFP和SFI模块相互协作，通过适配器有效补充ViT特征以进行多尺度特征分析。
在训练过程中，整个基础模型被冻结以继承先验知识，仅优化MFP和SFI模块。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-19/Vision%20Transformer/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Vision Transformer

Speech

Speech 方向最新论文已更新，请持续关注 Update in 2025-10-19 End-to-end Automatic Speech Recognition and Speech Translation Integration of Speech Foundational Models and LLMs

2025-10-19 Speech

Speech

Few-Shot

Few-Shot 方向最新论文已更新，请持续关注 Update in 2025-10-19 Highlighting What Matters Promptable Embeddings for Attribute-Focused Image Retrieval

2025-10-19 Few-Shot

Few-Shot