发布日期: 2025-09-16

更新日期: 2025-10-07

文章字数: 3.1k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-16 更新

Multimodal SAM-adapter for Semantic Segmentation

Authors:Iacopo Curti, Pierluigi Zama Ramirez, Alioscia Petrelli, Luigi Di Stefano

Semantic segmentation, a key task in computer vision with broad applications in autonomous driving, medical imaging, and robotics, has advanced substantially with deep learning. Nevertheless, current approaches remain vulnerable to challenging conditions such as poor lighting, occlusions, and adverse weather. To address these limitations, multimodal methods that integrate auxiliary sensor data (e.g., LiDAR, infrared) have recently emerged, providing complementary information that enhances robustness. In this work, we present MM SAM-adapter, a novel framework that extends the capabilities of the Segment Anything Model (SAM) for multimodal semantic segmentation. The proposed method employs an adapter network that injects fused multimodal features into SAM’s rich RGB features. This design enables the model to retain the strong generalization ability of RGB features while selectively incorporating auxiliary modalities only when they contribute additional cues. As a result, MM SAM-adapter achieves a balanced and efficient use of multimodal information. We evaluate our approach on three challenging benchmarks, DeLiVER, FMB, and MUSES, where MM SAM-adapter delivers state-of-the-art performance. To further analyze modality contributions, we partition DeLiVER and FMB into RGB-easy and RGB-hard subsets. Results consistently demonstrate that our framework outperforms competing methods in both favorable and adverse conditions, highlighting the effectiveness of multimodal adaptation for robust scene understanding. The code is available at the following link: https://github.com/iacopo97/Multimodal-SAM-Adapter.

语义分割是计算机视觉中的一项关键任务，在自动驾驶、医学影像和机器人技术等领域有着广泛的应用，并随着深度学习的发展取得了显著的进步。然而，当前的方法在照明不良、遮挡和恶劣天气等条件下仍然容易受到影响。为了克服这些局限性，最近出现了融合辅助传感器数据（如激光雷达、红外）的多模式方法，这些方法能够提供增强稳健性的补充信息。在这项工作中，我们提出了MM SAM-adapter，这是一个扩展Segment Anything Model（SAM）进行多模式语义分割的新框架。所提出的方法采用适配器网络，该网络将融合的多模式特征注入SAM丰富的RGB特征中。这种设计使模型能够保留RGB特征的强大泛化能力，同时有选择地仅在辅助模式提供额外线索时纳入其中。因此，MM SAM-adapter实现了多模式信息的平衡和有效利用。我们在三个具有挑战性的基准测试（DeLiVER、FMB和MUSES）上评估了我们的方法，在这些测试中，MM SAM-adapter达到了最先进的性能。为了进一步分析模态贡献，我们将DeLiVER和FMB划分为RGB-易和RGB-难子集。结果一致表明，我们的框架在有利和不利条件下均优于竞争方法，突显了多模式适应在稳健场景理解中的有效性。代码可在以下链接中找到：https://github.com/iacopo97/Multimodal-SAM-Adapter。

论文及项目相关链接

PDF

Summary

本文介绍了语义分割领域的最新研究进展，针对现有方法在恶劣环境下的局限性，提出了一种新型的多模态方法MM SAM-adapter。该方法通过集成辅助传感器数据，增强了模型的稳健性。MM SAM-adapter设计了一个适配器网络，将多模态特征注入到SAM模型的丰富RGB特征中，从而在保持RGB特征强大泛化能力的同时，选择性结合辅助模态。实验结果表明，该方法在三个挑战数据集上实现了卓越的性能。

Key Takeaways

语义分割是计算机视觉中的一项关键任务，在自动驾驶、医疗成像和机器人技术等领域有广泛应用。
当前语义分割方法在面对如低光照、遮挡和恶劣天气等挑战条件时存在局限性。
为了解决这些问题，研究者提出了一种新型的多模态方法MM SAM-adapter，该方法结合了辅助传感器数据，如LiDAR、红外线等，以增强模型的稳健性。
MM SAM-adapter设计了一个适配器网络，将多模态特征与SAM模型的RGB特征融合，实现了在保持RGB特征泛化能力的同时，选择性结合辅助模态。
在三个挑战数据集上的实验结果表明，MM SAM-adapter实现了卓越的性能，达到了最新水平。
通过分析不同条件下的性能表现，证明了MM SAM-adapter在复杂环境下的优势。

Cool Papers

点此查看论文截图

Authors:Xiaodong Guo, Tong Liu, Yike Li, Zi’ang Lin, Zhihong Deng

RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing models employ encoders pre-trained on RGB images to extract features from both RGB and infrared inputs, and design additional modules to achieve cross-modal feature fusion. This results in limited thermal feature extraction and suboptimal cross-modal fusion, while the redundant encoders further compromises the model’s real-time efficiency. To address the above issues, we propose TUNI, with an RGB-T encoder consisting of multiple stacked blocks that simultaneously perform multi-modal feature extraction and cross-modal fusion. By leveraging large-scale pre-training with RGB and pseudo-thermal data, the RGB-T encoder learns to integrate feature extraction and fusion in a unified manner. By slimming down the thermal branch, the encoder achieves a more compact architecture. Moreover, we introduce an RGB-T local module to strengthen the encoder’s capacity for cross-modal local feature fusion. The RGB-T local module employs adaptive cosine similarity to selectively emphasize salient consistent and distinct local features across RGB-T modalities. Experimental results show that TUNI achieves competitive performance with state-of-the-art models on FMB, PST900 and CART, with fewer parameters and lower computational cost. Meanwhile, it achieves an inference speed of 27 FPS on a Jetson Orin NX, demonstrating its real-time capability in deployment. Codes are available at https://github.com/xiaodonguo/TUNI.

RGB-thermal（RGB-T）语义分割在复杂条件下提高了自主平台的环境感知能力。现有的模型使用在RGB图像上预训练的编码器来提取RGB和红外输入的特征，并设计额外的模块来实现跨模态特征融合。这导致热特征提取有限和跨模态融合不佳，而冗余的编码器进一步影响了模型的实时效率。为了解决上述问题，我们提出了TUNI，它采用RGB-T编码器，该编码器由多个堆叠的块组成，可同时执行多模态特征提取和跨模态融合。通过利用大规模预训练和RGB和伪热数据，RGB-T编码器能够以统一的方式学习和集成特征提取和融合。通过精简热分支，编码器实现了更紧凑的架构。此外，我们引入了一个RGB-T局部模块，以加强编码器跨模态局部特征融合的能力。RGB-T局部模块采用自适应余弦相似性来有选择地强调RGB-T模态之间显著一致和独特的局部特征。实验结果表明，TUNI在FMB、PST900和CART上达到了与最新模型相当的性能，同时参数更少、计算成本更低。此外，它在Jetson Orin NX上的推理速度达到27 FPS，显示了其在部署中的实时能力。代码可通过https://github.com/xiaodonguo/TUNI获取。

论文及项目相关链接

PDF

Summary

本文介绍了RGB-thermal（RGB-T）语义分割在自主平台环境感知中的应用。针对现有模型在提取热特征和跨模态融合方面的不足，提出了TUNI模型。该模型采用RGB-T编码器，通过堆叠多个块来同时执行多模态特征提取和跨模态融合。利用大规模预训练和伪热数据，RGB-T编码器以统一的方式集成了特征提取和融合。此外，还引入了RGB-T局部模块，以加强编码器的跨模态局部特征融合能力。实验结果表明，TUNI在FMB、PST900和CART等数据集上达到了与最新模型相当的性能，同时参数更少、计算成本更低。此外，它在Jetson Orin NX上的推理速度达到27 FPS，显示出其在实际部署中的实时能力。

Key Takeaways

RGB-thermal (RGB-T) 语义分割能提高自主平台在恶劣环境下的感知能力。
现有模型在热特征提取和跨模态融合方面存在局限性。
TUNI模型采用RGB-T编码器，集成多模态特征提取和跨模态融合。
通过大规模预训练和伪热数据，RGB-T编码器实现统一特征处理。
TUNI模型采用紧凑的架构，通过简化热分支实现更高效的推理。
引入RGB-T局部模块强化跨模态局部特征融合。

Cool Papers

点此查看论文截图

Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures

Authors:Waqar Ahmad, Evan Murphy, Vladimir A. Krylov

Object re-identification (Re-ID) methods are highly sensitive to label noise, which typically leads to significant performance degradation. We address this challenge by reframing Re-ID as a supervised image similarity task and adopting a Siamese network architecture trained to capture discriminative pairwise relationships. Central to our approach is a novel statistical outlier detection (OD) framework, termed Beta-SOD (Beta mixture Similarity-based Outlier Detection), which models the distribution of cosine similarities between embedding pairs using a two-component Beta distribution mixture model. We establish a novel identifiability result for mixtures of two Beta distributions, ensuring that our learning task is well-posed. The proposed OD step complements the Re-ID architecture combining binary cross-entropy, contrastive, and cosine embedding losses that jointly optimize feature-level similarity learning.We demonstrate the effectiveness of Beta-SOD in de-noising and Re-ID tasks for person Re-ID, on CUHK03 and Market-1501 datasets, and vehicle Re-ID, on VeRi-776 dataset. Our method shows superior performance compared to the state-of-the-art methods across various noise levels (10-30%), demonstrating both robustness and broad applicability in noisy Re-ID scenarios. The implementation of Beta-SOD is available at: github.com/waqar3411/Beta-SOD

目标重识别（Re-ID）方法对标签噪声高度敏感，通常会导致性能显著下降。我们通过将Re-ID重新构建为监督图像相似性任务来解决这一挑战，并采用Siamese网络架构进行训练，以捕获判别性的成对关系。我们的方法的核心是一个新颖的统计异常检测（OD）框架，称为Beta-SOD（基于相似度的Beta混合异常检测），它使用两分量Beta分布混合模型对嵌入对之间的余弦相似性的分布进行建模。我们为两个Beta分布的混合物建立了一个新的可识别性结果，以确保我们的学习任务是适定的。所提出的OD步骤补充了Re-ID架构，结合了二元交叉熵、对比和余弦嵌入损失，共同优化特征级的相似性学习。我们在CUHK03和Market-1501数据集上对人体重识别进行了去噪和Re-ID任务的演示，在VeRi-776数据集上对车辆Re-ID进行了演示。我们的方法在各种噪声水平（10-30％）上的性能优于最先进的方法，证明了在嘈杂的Re-ID场景中既稳健又广泛适用。Beta-SOD的实现可在github.com/waqar3411/Beta-SOD找到。

论文及项目相关链接

PDF

Summary

本文解决了目标再识别（Re-ID）中标签噪声导致性能下降的问题。通过将其重新构建为受监督的图像相似性任务并采用Siamese网络架构来捕获判别性成对关系，采用了一种新型统计异常检测（OD）框架Beta-SOD。该框架使用两组件Beta分布混合模型对嵌入对之间的余弦相似性分布进行建模。在CUHK03、Market-1501和VeRi-776数据集上，Beta-SOD在去噪和Re-ID任务中表现出卓越性能，在各种噪声水平下均优于最先进的方法，展示出其稳健性和广泛适用于噪声Re-ID场景的潜力。

Key Takeaways

目标再识别（Re-ID）面临标签噪声导致的性能下降问题。
提出将Re-ID重构为受监督的图像相似性任务，并采用Siamese网络架构。
引入新型统计异常检测框架Beta-SOD，利用两组件Beta分布混合模型建模余弦相似性分布。
Beta-SOD能提高去噪和Re-ID任务的性能。
在CUHK03、Market-1501和VeRi-776数据集上，Beta-SOD性能优于最先进的方法。
Beta-SOD在多种噪声水平下均表现稳健。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-16/%E6%A3%80%E6%B5%8B_%E5%88%86%E5%89%B2_%E8%B7%9F%E8%B8%AA/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

检测/分割/跟踪

无监督/半监督/对比学习

无监督/半监督/对比学习方向最新论文已更新，请持续关注 Update in 2025-09-16 Prototypical Contrastive Learning For Improved Few-Shot Audio Classification

2025-09-16 无监督/半监督/对比学习

无监督/半监督/对比学习

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2025-09-16 Attacking Attention of Foundation Models Disrupts Downstream Tasks

2025-09-16 Vision Transformer

Vision Transformer

检测/分割/跟踪

2025-09-16 更新

Multimodal SAM-adapter for Semantic Segmentation

TUNI: Real-time RGB-T Semantic Segmentation with Unified Multi-Modal Feature Extraction and Cross-Modal Feature Fusion

Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures