⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-19 更新
Referring Camouflaged Object Detection With Multi-Context Overlapped Windows Cross-Attention
Authors:Yu Wen, Shuyong Gao, Shuping Zhang, Miao Huang, Lili Tao, Han Yang, Haozhe Xing, Lihe Zhang, Boxue Hou
Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.
参考伪装目标检测(Ref-COD)旨在通过结合参考信息(如图像和文本描述)来识别隐藏的目标。之前的研究将含有显著目标的参考图像转换为一维提示,取得了显著的结果。我们探索了通过融合丰富显著的图像特征和伪装目标特征的多上下文来提升性能的方法。因此,我们提出了RFMNet,它利用参考显著图像多个编码阶段的特点,并在相应的编码阶段与伪装特征进行交互融合。鉴于显著目标图像中的特征包含丰富的目标相关详细信息,在局部区域进行特征融合更有利于检测伪装目标。因此,我们提出了一种重叠窗口交叉注意力机制,使模型能够基于参考特征更加关注局部信息匹配。此外,我们提出了参考特征聚合(RFA)模块,以逐步解码和分割伪装的目标。在Ref-COD基准测试的大量实验表明,我们的方法达到了最先进的性能。
论文及项目相关链接
PDF 12 pages, 7figures, This work is supported by National Nature Science Foundation of China (Grant No. 62203291)
Summary:参考伪装目标检测(Ref-COD)通过融合参考信息,如图像和文本描述,识别隐藏目标。本研究探索了通过融合丰富显著的图像特征和伪装目标特征的多上下文增强性能的方法。因此,我们提出了RFMNet,它利用参考显著图像多个编码阶段的特点,与伪装特征在相应编码阶段进行交互融合。局部区域进行特征融合有助于获取更多关于目标的详细信息。因此,我们提出了重叠窗口交叉注意力机制,使模型更加关注基于参考特征的局部信息匹配。此外,我们还提出了参考特征聚合(RFA)模块,逐步解码和分割伪装目标。在Ref-COD基准测试上进行的广泛实验表明,我们的方法达到了最先进的性能。
Key Takeaways:
- Ref-COD旨在通过融合参考信息,如图像和文本描述,来识别隐藏目标。
- 本研究提出了RFMNet模型,通过多上下文融合丰富显著的图像特征和伪装目标特征来提升性能。
- RFMNet利用参考图像的多个编码阶段的特点,并与伪装特征进行交互融合。
- 局部区域特征融合有助于获取更多关于目标的详细信息。
- 重叠窗口交叉注意力机制使模型更加关注基于参考特征的局部信息匹配。
- 参考特征聚合(RFA)模块用于逐步解码和分割伪装目标。
点此查看论文截图
C3Net: Context-Contrast Network for Camouflaged Object Detection
Authors:Baber Jan, Aiman H. El-Maleh, Abdul Jabbar Siddiqui, Abdul Bais, Saeed Anwar
Camouflaged object detection identifies objects that blend seamlessly with their surroundings through similar colors, textures, and patterns. This task challenges both traditional segmentation methods and modern foundation models, which fail dramatically on camouflaged objects. We identify six fundamental challenges in COD: Intrinsic Similarity, Edge Disruption, Extreme Scale Variation, Environmental Complexities, Contextual Dependencies, and Salient-Camouflaged Object Disambiguation. These challenges frequently co-occur and compound the difficulty of detection, requiring comprehensive architectural solutions. We propose C3Net, which addresses all challenges through a specialized dual-pathway decoder architecture. The Edge Refinement Pathway employs gradient-initialized Edge Enhancement Modules to recover precise boundaries from early features. The Contextual Localization Pathway utilizes our novel Image-based Context Guidance mechanism to achieve intrinsic saliency suppression without external models. An Attentive Fusion Module synergistically combines the two pathways via spatial gating. C3Net achieves state-of-the-art performance with S-measures of 0.898 on COD10K, 0.904 on CAMO, and 0.913 on NC4K, while maintaining efficient processing. C3Net demonstrates that complex, multifaceted detection challenges require architectural innovation, with specialized components working synergistically to achieve comprehensive coverage beyond isolated improvements. Code, model weights, and results are available at https://github.com/Baber-Jan/C3Net.
隐蔽物体检测的任务是识别通过相似颜色、纹理和图案无缝融入其周围环境的物体。这一任务对传统分割方法和现代基础模型都构成了挑战,在隐蔽物体方面这些模型的表现尤为糟糕。我们确定了COD中的六个基本挑战:内在相似性、边缘断裂、极端尺度变化、环境复杂性、上下文依赖性和显著隐蔽物体识别。这些挑战经常同时存在并增加了检测的难度,需要全面的架构解决方案。我们提出了C3Net,它通过专门的双路径解码器架构来解决所有挑战。边缘细化路径采用梯度初始化的边缘增强模块来从早期特征中恢复精确边界。上下文定位路径利用我们新颖的基于图像的上下文指导机制,实现内在显著性抑制,无需外部模型。注意力融合模块协同结合了这两个路径,通过空间门控实现。C3Net在COD10K(S-measure为0.898)、CAMO(S-measure为0.904)和NC4K(S-measure为0.913)上实现了最先进的性能,同时保持了高效的处理速度。C3Net证明,复杂的、多方面的检测挑战需要架构创新,需要专门的组件协同工作以实现全面的覆盖,超越局部改进。代码、模型权重和结果可在https://github.com/Baber-Jan/C3Net找到。
论文及项目相关链接
Summary:伪装目标检测旨在识别与周围环境通过颜色、纹理和图案无缝融合的目标。传统分割方法和现代基础模型在伪装目标检测方面存在挑战。本文提出六大挑战,并设计C3Net解决所有挑战,包括边缘优化路径和环境定位路径等创新架构,实现了高效、准确的伪装目标检测。模型性能优异,并在三个数据集上实现最新状态表现。
Key Takeaways:
- 伪装目标检测识别与周围环境融合的目标,涉及颜色、纹理和图案的相似性。
- 传统分割方法和现代基础模型在伪装目标检测方面存在挑战。
- 本文提出了六大挑战,包括内在相似性、边缘破坏等。
- C3Net通过专业化的双路径解码器架构解决所有挑战。
- C3Net包括边缘优化路径和环境定位路径等创新组件。
- C3Net实现了高效、准确的伪装目标检测,并在三个数据集上实现最新状态表现。
点此查看论文截图
Evaluation of Attention Mechanisms in U-Net Architectures for Semantic Segmentation of Brazilian Rock Art Petroglyphs
Authors:Leonardi Melo, Luís Gustavo, Dimmy Magalhães, Lucciani Vieira, Mauro Araújo
This study presents a comparative analysis of three U-Net-based architectures for semantic segmentation of rock art petroglyphs from Brazilian archaeological sites. The investigated architectures were: (1) BEGL-UNet with Border-Enhanced Gaussian Loss function; (2) Attention-Residual BEGL-UNet, incorporating residual blocks and gated attention mechanisms; and (3) Spatial Channel Attention BEGL-UNet, which employs spatial-channel attention modules based on Convolutional Block Attention Module. All implementations employed the BEGL loss function combining binary cross-entropy with Gaussian edge enhancement. Experiments were conducted on images from the Poço da Bebidinha Archaeological Complex, Piauí, Brazil, using 5-fold cross-validation. Among the architectures, Attention-Residual BEGL-UNet achieved the best overall performance with Dice Score of 0.710, validation loss of 0.067, and highest recall of 0.854. Spatial Channel Attention BEGL-UNet obtained comparable performance with DSC of 0.707 and recall of 0.857. The baseline BEGL-UNet registered DSC of 0.690. These results demonstrate the effectiveness of attention mechanisms for archaeological heritage digital preservation, with Dice Score improvements of 2.5-2.9% over the baseline.
本文对比分析了三种基于U-Net的架构,用于巴西考古遗址岩石艺术浮雕的语义分割。所研究的架构包括:(1) BEGL-UNet,带有边界增强高斯损失函数; (2) 融合残差块和门控注意机制的Attention-Residual BEGL-UNet;以及(3) 基于卷积块注意模块的Spatial Channel Attention BEGL-UNet。所有实现都采用了结合二元交叉熵与高斯边缘增强的BEGL损失函数。实验采用巴西皮奥图贝比迪哈考古遗址的图像进行五折交叉验证。在所有这些架构中,Attention-Residual BEGL-UNet取得了最佳的整体性能,其Dice得分为0.710,验证损失为0.067,召回率最高,为0.854。Spatial Channel Attention BEGL-UNet的DSC得分达到0.707,召回率为0.857,表现相当。基线BEGL-UNet的DSC得分为0.690。这些结果表明,对于考古遗产数字化保护,注意力机制是有效的,与基线相比,Dice得分提高了2.5-2.9%。
论文及项目相关链接
PDF 14 pages, 8 figures. Preprint submitted to arXiv
Summary:
本研究对比分析了三种基于U-Net的架构,用于巴西考古遗址岩石艺术岩画的语义分割。实验在巴西皮奥伊州波科达贝比迪纳考古遗址的图像上进行,采用五折交叉验证。其中,融合残差块和门控注意力机制的Attention-Residual BEGL-UNet获得最佳性能,Dice系数为0.710,验证损失为0.067,召回率最高为0.854。Spatial Channel Attention BEGL-UNet获得相近性能。研究结果表明注意力机制在文物数字化保护中有效,较基线模型Dice系数提高了2.5%-2.9%。
Key Takeaways:
- 研究对三种基于U-Net的架构进行了对比分析,用于岩石艺术岩画的语义分割。
- 实验中采用的三种架构包括BEGL-UNet、Attention-Residual BEGL-UNet和Spatial Channel Attention BEGL-UNet。
- 实验在巴西考古遗址的图像上进行,并采用五折交叉验证。
- Attention-Residual BEGL-UNet获得最佳性能,Dice系数为0.710,验证损失为0.067,召回率最高。
- Spatial Channel Attention BEGL-UNet性能与最佳模型相近。
- 研究结果表明注意力机制在文物数字化保护中有效。
点此查看论文截图
SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection
Authors:Xin Zuo, Chenyu Qu, Haibo Zhan, Jifeng Shen, Wankou Yang
Recent multispectral object detection methods have primarily focused on spatial-domain feature fusion based on CNNs or Transformers, while the potential of frequency-domain feature remains underexplored. In this work, we propose a novel Spatial and Frequency Feature Reconstruction method (SFFR) method, which leverages the spatial-frequency feature representation mechanisms of the Kolmogorov-Arnold Network (KAN) to reconstruct complementary representations in both spatial and frequency domains prior to feature fusion. The core components of SFFR are the proposed Frequency Component Exchange KAN (FCEKAN) module and Multi-Scale Gaussian KAN (MSGKAN) module. The FCEKAN introduces an innovative selective frequency component exchange strategy that effectively enhances the complementarity and consistency of cross-modal features based on the frequency feature of RGB and IR images. The MSGKAN module demonstrates excellent nonlinear feature modeling capability in the spatial domain. By leveraging multi-scale Gaussian basis functions, it effectively captures the feature variations caused by scale changes at different UAV flight altitudes, significantly enhancing the model’s adaptability and robustness to scale variations. It is experimentally validated that our proposed FCEKAN and MSGKAN modules are complementary and can effectively capture the frequency and spatial semantic features respectively for better feature fusion. Extensive experiments on the SeaDroneSee, DroneVehicle and DVTOD datasets demonstrate the superior performance and significant advantages of the proposed method in UAV multispectral object perception task. Code will be available at https://github.com/qchenyu1027/SFFR.
最近的多光谱目标检测方法主要关注基于CNN或Transformer的空间域特征融合,而频率域特征的可能性尚未得到充分探索。在这项工作中,我们提出了一种新的空间与频率特征重建方法(SFFR),该方法利用Kolmogorov-Arnold网络(KAN)的空间频率特征表示机制,在特征融合之前重建空间域和频率域中的互补表示。SFFR的核心组件是提出的频率分量交换KAN(FCEKAN)模块和多尺度高斯KAN(MSGKAN)模块。FCEKAN引入了一种创新的选择性频率分量交换策略,该策略基于RGB和红外图像的频率特征有效地增强了跨模态特征的互补性和一致性。MSGKAN模块在空间域表现出出色的非线性特征建模能力。通过利用多尺度高斯基函数,它有效地捕获了不同无人机飞行高度引起的尺度变化所导致的特征变化,显著增强了模型对尺度变化的适应性和鲁棒性。实验验证了我们提出的FCEKAN和MSGKAN模块是互补的,可以有效地捕获频率和空间语义特征,以实现更好的特征融合。在SeaDroneSee、DroneVehicle和DVTOD数据集上的大量实验表明,所提出的方法在无人机多光谱目标感知任务中表现出卓越的性能和显著的优势。代码将在https://github.com/qchenyu1027/SFFR上提供。
论文及项目相关链接
PDF 11 pages,8 figures, accepted by IEEE TGRS
Summary
本文提出了一种新型的多光谱目标检测方法——空间与频率特征重建方法(SFFR)。该方法利用Kolmogorov-Arnold网络(KAN)的空间频率特征表示机制,在空间和频率域进行特征融合前的特征重建。核心模块包括频率分量交换KAN(FCEKAN)和多尺度高斯KAN(MSGKAN)。FCEKAN通过创新的频率分量选择性交换策略,增强了跨模态特征的互补性和一致性。MSGKAN模块在空间中表现出卓越的非线性特征建模能力,通过多尺度高斯基础函数捕捉不同无人机飞行高度引起的特征变化,增强了模型对尺度变化的适应性和稳健性。实验证明,FCEKAN和MSGKAN模块互补性强,能有效捕捉频率和空间语义特征,从而提高无人机多光谱目标感知任务的效果。
Key Takeaways
- 该论文提出了一种新的多光谱目标检测方法——空间与频率特征重建方法(SFFR),结合了空间和频率域的特征信息。
- 方法的核心是利用Kolmogorov-Arnold网络(KAN)的特性,通过其空间频率特征表示机制进行特征重建。
- 引入了两个关键模块:频率分量交换KAN(FCEKAN)和多尺度高斯KAN(MSGKAN)。
- FCEKAN通过创新的策略增强跨模态特征的互补性和一致性。
- MSGKAN模块在空间中表现出强大的非线性特征建模能力,能够捕捉不同飞行高度下的特征变化。
- FCEKAN和MSGKAN模块的实验结果证明了它们的互补性,能显著提高无人机多光谱目标感知的效果。
- 方法在SeaDroneSee、DroneVehicle和DVTOD数据集上的表现优于其他方法。
点此查看论文截图
Probabilistic Robustness Analysis in High Dimensional Space: Application to Semantic Segmentation Network
Authors:Navid Hashemi, Samuel Sasaki, Diego Manzanas Lopez, Lars Lindemann, Ipek Oguz, Meiyi Ma, Taylor T. Johnson
Semantic segmentation networks (SSNs) are central to safety-critical applications such as medical imaging and autonomous driving, where robustness under uncertainty is essential. However, existing probabilistic verification methods often fail to scale with the complexity and dimensionality of modern segmentation tasks, producing guarantees that are overly conservative and of limited practical value. We propose a probabilistic verification framework that is architecture-agnostic and scalable to high-dimensional input-output spaces. Our approach employs conformal inference (CI), enhanced by a novel technique that we call the \textbf{clipping block}, to provide provable guarantees while mitigating the excessive conservatism of prior methods. Experiments on large-scale segmentation models across CamVid, OCTA-500, Lung Segmentation, and Cityscapes demonstrate that our framework delivers reliable safety guarantees while substantially reducing conservatism compared to state-of-the-art approaches on segmentation tasks. We also provide a public GitHub repository (https://github.com/Navidhashemicodes/SSN_Reach_CLP_Surrogate) for this approach, to support reproducibility.
语义分割网络(SSNs)在医学影像和自动驾驶等安全关键领域应用扮演核心角色,这些领域需要处理不确定性下的稳健性。然而,现有的概率验证方法通常无法适应现代分割任务的复杂性和维数,提供的保证过于保守,实用价值有限。我们提出了一种概率验证框架,该框架不受架构限制,可扩展到高维输入输出空间。我们的方法采用形式化推断(CI),并结合我们称之为“裁剪块”的新技术,以提供可证明的保证,同时减轻先前方法的过度保守性。在CamVid、OCTA-500、肺分割和城市景观的大规模分割模型上的实验表明,我们的框架在分割任务上相较于最先进的方法提供了可靠的保证,同时显著减少了保守性。我们还为此方法提供了公共GitHub仓库(https://github.com/Navidhashemicodes/SSN_Reach_CLP_Surrogate),以支持复制和研究。
论文及项目相关链接
Summary
本文提出了一个概率验证框架,该框架对架构具有通用性,并能扩展到高维输入输出空间。它采用形式化推断(CI)和一种新的技术——裁剪块(clipping block),以提供可验证的保证,同时减轻先前方法的过度保守性。实验证明,该框架在CamVid、OCTA-500、肺分割和Cityscapes等大型分割模型上的安全保证更为可靠,并且在分割任务上相比现有先进技术大幅减少了保守性。此外,本文还提供了公共GitHub仓库以支持重现性。
Key Takeaways
- SSNs在医疗成像和自动驾驶等安全关键应用中扮演核心角色,要求具备不确定性下的稳健性。
- 现有概率验证方法往往无法适应现代分割任务的复杂性和维度,提供的保证过于保守且实用价值有限。
- 本文提出的概率验证框架采用形式化推断(CI)并结合裁剪块技术,以提供可靠的保证并减轻先前方法的过度保守性。
- 在多个大型分割模型上的实验证明了该框架的优越性能,提供了更为可靠的安全保证,并且相比现有技术大幅降低了保守性。
- 该方法具有架构通用性,能扩展到高维输入输出空间。
点此查看论文截图
SynSeg: Feature Synergy for Multi-Category Contrastive Learning in End-to-End Open-Vocabulary Semantic Segmentation
Authors:Weichen Zhang, Kebin Liu, Fan Dang, Zhui Zhu, Xikai Sun, Yunhao Liu
Semantic segmentation in open-vocabulary scenarios presents significant challenges due to the wide range and granularity of semantic categories. Existing weakly-supervised methods often rely on category-specific supervision and ill-suited feature construction methods for contrastive learning, leading to semantic misalignment and poor performance. In this work, we propose a novel weakly-supervised approach, SynSeg, to address the challenges. SynSeg performs Multi-Category Contrastive Learning (MCCL) as a stronger training signal with a new feature reconstruction framework named Feature Synergy Structure (FSS). Specifically, MCCL strategy robustly combines both intra- and inter-category alignment and separation in order to make the model learn the knowledge of correlations from different categories within the same image. Moreover, FSS reconstructs discriminative features for contrastive learning through prior fusion and semantic-activation-map enhancement, effectively avoiding the foreground bias introduced by the visual encoder. Furthermore, SynSeg is a lightweight end-to-end solution without using any mid-term output from large-scale pretrained models and capable for real-time inference. In general, SynSeg effectively improves the abilities in semantic localization and discrimination under weak supervision in an efficient manner. Extensive experiments on benchmarks demonstrate that our method outperforms state-of-the-art (SOTA) performance. Particularly, SynSeg achieves higher accuracy than SOTA baselines with a ratio from 6.9% up to 26.2%.
在开放词汇场景中,语义分割面临着广泛的语义类别和粒度所带来的巨大挑战。现有的弱监督方法通常依赖于特定类别的监督和不适合对比学习的特征构建方法,导致语义不匹配和性能不佳。在这项工作中,我们提出了一种新型的弱监督方法SynSeg,以应对这些挑战。SynSeg执行多类别对比学习(MCCL)作为更强的训练信号,并引入一个新的特征重建框架,称为特征协同结构(FSS)。具体来说,MCCL策略稳健地将同一图像内不同类别之间的类别内和类别间对齐和分离结合起来,使模型学习不同类别之间的关联知识。此外,FSS通过先验融合和语义激活图增强,重建了用于对比学习的判别特征,有效地避免了视觉编码器引入的前景偏见。此外,SynSeg是一种轻量级的端到端解决方案,无需使用大规模预训练模型的任何中期输出,适用于实时推理。总的来说,SynSeg在弱监督情况下有效地提高了语义定位和鉴别的能力。在基准测试上的大量实验表明,我们的方法超过了最新技术(SOTA)的性能。特别是,SynSeg的准确率高于SOTA基准线,比例从6.9%到26.2%。
论文及项目相关链接
Summary
语义分割在开放词汇场景中存在诸多挑战,因为语义类别的范围广泛且粒度精细。现有的弱监督方法通常依赖于特定类别的监督和对比学习的特征构建方法不当,导致语义不匹配和性能不佳。本研究提出了一种新的弱监督方法SynSeg,通过多类别对比学习(MCCL)和特征协同结构(FSS)解决这些问题。MCCL策略能够稳健地结合同一图像内不同类别的对齐与分离,使模型学习相关性知识。FSS通过先验融合和语义激活图增强重建对比学习的判别特征,有效避免视觉编码器引入的前景偏见。总体而言,SynSeg在弱监督条件下有效提高了语义定位和鉴别能力,并且是轻量级的端到端解决方案,不使用大型预训练模型的中间输出,适用于实时推理。在基准测试上的实验表明,该方法优于最新技术状态,特别是相对于最新技术基线,准确率提高了从6.9%到26.2%。
Key Takeaways
- 语义分割在开放词汇场景中存在挑战,因为语义类别的范围广泛且粒度精细。
- 现有弱监督方法存在语义不匹配和性能不佳的问题。
- 本研究提出了SynSeg方法,包括多类别对比学习(MCCL)和特征协同结构(FSS)来解决挑战。
- MCCL稳健地结合同一图像内不同类别的对齐与分离,使模型学习相关性知识。
- FSS重建对比学习的判别特征,避免前景偏见。
- SynSeg是轻量级的端到端解决方案,适用于实时推理。
点此查看论文截图
SANSA: Unleashing the Hidden Semantics in SAM2 for Few-Shot Segmentation
Authors:Claudia Cuttano, Gabriele Trivigno, Giuseppe Averta, Carlo Masone
Few-shot segmentation aims to segment unseen object categories from just a handful of annotated examples. This requires mechanisms that can both identify semantically related objects across images and accurately produce segmentation masks. We note that Segment Anything 2 (SAM2), with its prompt-and-propagate mechanism, offers both strong segmentation capabilities and a built-in feature matching process. However, we show that its representations are entangled with task-specific cues optimized for object tracking, which impairs its use for tasks requiring higher level semantic understanding. Our key insight is that, despite its class-agnostic pretraining, SAM2 already encodes rich semantic structure in its features. We propose SANSA (Semantically AligNed Segment Anything 2), a framework that makes this latent structure explicit, and repurposes SAM2 for few-shot segmentation through minimal task-specific modifications. SANSA achieves state-of-the-art performance on few-shot segmentation benchmarks specifically designed to assess generalization, outperforms generalist methods in the popular in-context setting, supports various prompts flexible interaction via points, boxes, or scribbles, and remains significantly faster and more compact than prior approaches. Code is available at https://github.com/ClaudiaCuttano/SANSA.
小样本分割(Few-shot segmentation)旨在从少量标注的样本中对未见过的目标类别进行分割。这需要能够在图像中识别语义相关对象并准确生成分割掩模的机制。我们注意到,借助提示和传播机制(prompt-and-propagate mechanism),Segment Anything 2(SAM2)提供了强大的分割能力和内置的特征匹配过程。然而,我们展示其表示与针对对象跟踪优化的特定任务线索交织在一起,这损害其用于需要更高层次语义理解的任务的实用性。我们的关键见解是,尽管它采用了类无关预训练(class-agnostic pretraining),但SAM2的特征中已经编码了丰富的语义结构。我们提出SANSA(语义对齐分段任何东西2),它通过最轻微的特定任务修改使这种潜在结构明确化,并重新利用SAM2进行小样本分割。SANSA在专门设计用于评估泛化的少样本分割基准测试中实现了最先进的性能,在流行上下文设置中的通用方法表现优异,支持通过点、框或涂鸦进行各种提示灵活交互,并且相比于以前的方法仍然显著更快、更紧凑。代码可在https://github.com/ClaudiaCuttano/SANSA找到。
论文及项目相关链接
PDF Accepted to NeurIPS 2025 as Spotlight
Summary
SANSA框架针对少样本分割任务进行优化,通过微调SAM2模型使其更适用于此类任务。SANSA具有显著性能表现,实现了跨图像识别语义相关对象的能力,同时准确生成分割掩膜。此外,SANSA还支持各种提示方式,如点、框或涂鸦,具有快速且紧凑的特点。
Key Takeaways
- 少样本分割旨在从未经标注的例子中对未见的对象类别进行分割。
- SAM2模型具备强大的分割能力和内置的特征匹配过程,但其表示与任务特定线索纠缠在一起,限制了其在需要高级语义理解的任务中的应用。
- 尽管SAM2具有类别无关的预训练,但它已经在特征中编码了丰富的语义结构。
- SANSA框架通过使这种潜在结构显性化并重新利用SAM2进行微调,实现了少样本分割任务的最先进性能。
- SANSA在专门设计的评估泛化的少样本分割基准测试中表现出卓越性能。
- SANSA在流行的上下文设置中的表现优于通用方法。
- SANSA支持通过点、框或涂鸦等多种提示方式进行灵活的交互。
点此查看论文截图
SAM2MOT: A Novel Paradigm of Multi-Object Tracking by Segmentation
Authors:Junjie Jiang, Zelin Wang, Manqi Zhao, Yin Li, DongSheng Jiang
Inspired by Segment Anything 2, which generalizes segmentation from images to videos, we propose SAM2MOT–a novel segmentation-driven paradigm for multi-object tracking that breaks away from the conventional detection-association framework. In contrast to previous approaches that treat segmentation as auxiliary information, SAM2MOT places it at the heart of the tracking process, systematically tackling challenges like false positives and occlusions. Its effectiveness has been thoroughly validated on major MOT benchmarks. Furthermore, SAM2MOT integrates pre-trained detector, pre-trained segmentor with tracking logic into a zero-shot MOT system that requires no fine-tuning. This significantly reduces dependence on labeled data and paves the way for transitioning MOT research from task-specific solutions to general-purpose systems. Experiments on DanceTrack, UAVDT, and BDD100K show state-of-the-art results. Notably, SAM2MOT outperforms existing methods on DanceTrack by +2.1 HOTA and +4.5 IDF1, highlighting its effectiveness in MOT. Code is available at https://github.com/TripleJoy/SAM2MOT.
受Segment Anything 2的启发,它将图像分割技术泛化到视频领域,我们提出了SAM2MOT——一种用于多目标跟踪的新型分割驱动范式,它打破了传统的检测关联框架。与以前将分割视为辅助信息的方法不同,SAM2MOT将其置于跟踪过程的核心,系统解决误报和遮挡等挑战。它在主要的MOT基准测试上的有效性已经得到了充分的验证。此外,SAM2MOT将预训练检测器、预训练分割器与跟踪逻辑集成到一个无需微调即可进行零点击MOT的系统,这极大地减少了其对标记数据的依赖,为从任务特定解决方案向通用系统过渡的MOT研究铺平了道路。在DanceTrack、UAVDT和BDD100K上的实验显示其处于领先水平。值得注意的是,SAM2MOT在DanceTrack上的HOTA高出+2.1,+IDF高出+4.5,突显其在MOT中的有效性。代码可在https://github.com/TripleJoy/SAM2MOT找到。
论文及项目相关链接
Summary
基于Segment Anything 2对图像分割的通用框架启发,提出SAM2MOT这一新颖的多目标跟踪分割驱动范式,突破传统检测关联框架的局限。SAM2MOT将分割置于跟踪过程的核心,有效应对误报和遮挡等挑战。其在主要多目标跟踪基准测试中得到了验证。此外,SAM2MOT集成了预训练检测器、预训练分割器和跟踪逻辑,构建了一个无需微调即能适应多种任务的零样本多目标跟踪系统,减少了标签数据的依赖,为从任务特定解决方案向通用系统的过渡铺平了道路。在DanceTrack、UAVDT和BDD100K上的实验显示其表现卓越。代码已公开在GitHub上。
Key Takeaways
- SAM2MOT是基于Segment Anything 2启发的分割驱动多目标跟踪范式。
- 它突破了传统检测关联框架的限制,将分割置于跟踪过程的核心。
- SAM2MOT有效应对误报和遮挡等跟踪过程中的常见挑战。
- 在主要的多目标跟踪基准测试中得到了验证和广泛应用。
- 系统集成了预训练检测器和分割器,形成了一个零样本多目标跟踪系统,无需微调即可适应不同任务。
- 该方法减少了标签数据的依赖,为从任务特定解决方案向通用系统的过渡铺平了道路。
- 在DanceTrack、UAVDT和BDD100K等多个数据集上的实验结果显示其表现卓越。
点此查看论文截图
SimROD: A Simple Baseline for Raw Object Detection with Global and Local Enhancements
Authors:Haiyang Xie, Xi Shen, Shihua Huang, Qirui Wang, Zheng Wang
Most visual models are designed for sRGB images, yet RAW data offers significant advantages for object detection by preserving sensor information before ISP processing. This enables improved detection accuracy and more efficient hardware designs by bypassing the ISP. However, RAW object detection is challenging due to limited training data, unbalanced pixel distributions, and sensor noise. To address this, we propose SimROD, a lightweight and effective approach for RAW object detection. We introduce a Global Gamma Enhancement (GGE) module, which applies a learnable global gamma transformation with only four parameters, improving feature representation while keeping the model efficient. Additionally, we leverage the green channel’s richer signal to enhance local details, aligning with the human eye’s sensitivity and Bayer filter design. Extensive experiments on multiple RAW object detection datasets and detectors demonstrate that SimROD outperforms state-of-the-art methods like RAW-Adapter and DIAP while maintaining efficiency. Our work highlights the potential of RAW data for real-world object detection. Code is available at https://ocean146.github.io/SimROD2025/.
大多数视觉模型都是为sRGB图像设计的,然而RAW数据在对象检测方面具有显著优势,它能够保留ISP处理前的传感器信息。这通过绕过ISP,实现了更高的检测精度和更有效的硬件设计。然而,RAW对象检测面临有限的训练数据、像素分布不平衡和传感器噪声等挑战。为了解决这些问题,我们提出了SimROD,这是一种用于RAW对象检测的轻量级且有效的方法。我们引入了全局伽马增强(GGE)模块,该模块仅使用四个参数应用可学习的全局伽马变换,在保持模型效率的同时,改进了特征表示。此外,我们利用绿色通道更丰富的信号增强局部细节,这与人眼的敏感度和Bayer滤镜设计相一致。在多个RAW对象检测数据集和检测器上的广泛实验表明,SimROD在保持效率的同时,优于RAW-Adapter和DIAP等最先进的方法。我们的研究工作突出了RAW数据在现实对象检测中的潜力。代码可访问:https://ocean146.github.io/SimROD2025/。
论文及项目相关链接
PDF Accepted by AAAI 2026. Code is available at https://ocean146.github.io/SimROD2025/
Summary
本文探讨了视觉模型在RAW数据上的对象检测问题。文章指出,虽然大多数视觉模型是为sRGB图像设计的,但RAW数据在保存传感器信息方面具有优势,可以用于更准确的对象检测和更有效的硬件设计。为解决RAW数据对象检测的挑战,如训练数据有限、像素分布不平衡和传感器噪声等问题,文章提出了SimROD方法。该方法通过全局伽马增强模块和绿色通道信号的利用,提高了特征表示和局部细节,从而在多个RAW对象检测数据集和检测器上的实验表现优于RAW-Adapter和DIAP等现有方法。
Key Takeaways
- 视觉模型在RAW数据上的对象检测是一个重要研究方向。
- RAW数据保留了传感器信息,有助于提高对象检测准确性和硬件效率。
- SimROD方法通过全局伽马增强模块和绿色通道信号的利用,解决了RAW数据对象检测的挑战。
- SimROD方法在多个RAW对象检测数据集和检测器上的实验表现优异。
- SimROD方法相对于其他方法如RAW-Adapter和DIAP有更高的效率和准确性。
- 文章强调了利用人类视觉系统的特性(如眼睛对绿色的敏感度)进行设计的考虑。
点此查看论文截图
SAQ-SAM: Semantically-Aligned Quantization for Segment Anything Model
Authors:Jing Zhang, Zhikai Li, Chengzhi Hu, Xuewen Liu, Qingyi Gu
Segment Anything Model (SAM) exhibits remarkable zero-shot segmentation capability; however, its prohibitive computational costs make edge deployment challenging. Although post-training quantization (PTQ) offers a promising compression solution, existing methods yield unsatisfactory results when applied to SAM, owing to its specialized model components and promptable workflow: (i) The mask decoder’s attention exhibits extreme activation outliers, and we find that aggressive clipping (even 100x), without smoothing or isolation, is effective in suppressing outliers while maintaining performance. Unfortunately, traditional distribution-based metrics (e.g., MSE) fail to provide such large-scale clipping. (ii) Existing quantization reconstruction methods neglect semantic interactivity of SAM, leading to misalignment between image feature and prompt intention. To address the above issues, we propose SAQ-SAM in this paper, which boosts PTQ for SAM from the perspective of semantic alignment. Specifically, we propose Perceptual-Consistency Clipping, which exploits attention focus overlap to promote aggressive clipping while preserving semantic capabilities. Furthermore, we propose Prompt-Aware Reconstruction, which incorporates image-prompt interactions by leveraging cross-attention in mask decoder, thus facilitating alignment in both distribution and semantic. Moreover, to ensure the interaction efficiency, we design a layer-skipping strategy for image tokens in encoder. Extensive experiments are conducted on various SAM sizes and tasks, including instance segmentation, oriented object detection, and semantic segmentation, and the results show that our method consistently exhibits advantages. For example, when quantizing SAM-B to 4-bit, SAQ-SAM achieves 11.7% higher mAP than the baseline in instance segmentation task.
Segment Anything Model(SAM)展现出了卓越的零样本分割能力;然而,其高昂的计算成本使得边缘部署面临挑战。虽然后训练量化(PTQ)提供了一种有前景的压缩解决方案,但当将其应用于SAM时,现有方法产生的效果并不令人满意,原因在于SAM的专门模型组件和可提示的工作流程:(i)掩膜解码器的注意力表现出极端的激活异常值,我们发现,在不对其进行平滑或隔离的情况下,进行激烈裁剪(甚至是100倍)在抑制异常值的同时能维持性能。然而,传统的基于分布的度量(例如MSE)无法提供如此大规模的裁剪。(ii)现有的量化重建方法忽略了SAM的语义交互性,导致图像特征与提示意图之间的不匹配。为了解决上述问题,本文提出了SAQ-SAM,它从语义对齐的角度提升了SAM的后训练量化。具体来说,我们提出了感知一致性裁剪,它利用注意力焦点重叠来促进激烈裁剪,同时保留语义能力。此外,我们提出了提示感知重建,它通过利用掩膜解码器中的交叉注意力来结合图像提示交互,从而促进分布和语义的对齐。为了确保交互效率,我们还为编码器中的图像令牌设计了一种跳过策略。在多种SAM大小和任务上进行了广泛实验,包括实例分割、定向目标检测和语义分割,结果表明我们的方法具有一致性优势。例如,当将SAM-B量化至4位时,SAQ-SAM在实例分割任务中的mAP比基线高出11.7%。
论文及项目相关链接
PDF AAAI 2026. Code is available at https://github.com/jingjing0419/SAQ-SAM
Summary
本文介绍了Segment Anything Model(SAM)的零样本分割能力及其边缘部署的挑战。针对模型计算成本高昂的问题,尝试使用量化压缩方法解决。然而,现有的量化方法无法直接应用于SAM模型,因为它具有特殊的模型组件和提示工作流程。为解决这些问题,本文提出了SAQ-SAM方案,从语义对齐的角度增强PTQ在SAM上的应用。包括感知一致性裁剪和提示感知重建技术。针对编码器的图像令牌采用层跳过策略以确保交互效率。实验结果在各种规模和任务的SAM模型中显示出其优越性。在将SAM-B量化为四比特的情况下,其在实例分割任务中的平均精度比基线高出11.7%。
Key Takeaways
- SAM具有零样本分割能力,但计算成本高,难以实现边缘部署。
- 现有量化方法无法满足SAM的特殊需求,原因在于其独特的模型组件和提示工作流程。
- 本文提出的SAQ-SAM方法,从语义对齐的角度增强了PTQ在SAM上的应用性能。
- SAQ-SAM包含感知一致性裁剪技术,能够在保持性能的同时有效抑制注意力激活的异常值。
- SAQ-SAM引入提示感知重建技术,利用mask decoder中的跨注意力融合图像提示交互,实现分布和语义的对齐。
- SAQ-SAM设计了一种针对图像令牌的层跳过策略,以提高交互效率。
点此查看论文截图
A Framework for Real-Time Volcano-Seismic Event Recognition Based on Multi-Station Seismograms and Semantic Segmentation Models
Authors:Camilo Espinosa-Curilem, Millaray Curilem, Daniel Basualto
In volcano monitoring, effective recognition of seismic events is essential for understanding volcanic activity and raising timely warning alerts. Traditional methods rely on manual analysis, which can be subjective and labor-intensive. Furthermore, current automatic approaches often tackle detection and classification separately, mostly rely on single station information and generally require tailored preprocessing and representations to perform predictions. These limitations often hinder their application to real-time monitoring and utilization across different volcano conditions. This study introduces a novel approach that utilizes Semantic Segmentation models to automate seismic event recognition by applying a straight forward transformation of multi-channel 1D signals into 2D representations, enabling their use as images. Our framework employs a data-driven, end-to-end design that integrates multi-station seismic data with minimal preprocessing, performing both detection and classification simultaneously for five seismic event classes. We evaluated four state-of-the-art segmentation models (UNet, UNet++, DeepLabV3+ and SwinUNet) on approximately 25.000 seismic events recorded at four different Chilean volcanoes: Nevados del Chillán Volcanic Complex, Laguna del Maule, Villarrica and Puyehue-Cordón Caulle. Among these models, the UNet architecture was identified as the most effective model, achieving mean F1 and Intersection over Union (IoU) scores of up to 0.91 and 0.88, respectively, and demonstrating superior noise robustness and model flexibility to unseen volcano datasets.
在火山监测中,有效地识别地震事件对于理解火山活动并及时发出警告至关重要。传统的方法依赖于手动分析,这可能会带有主观性和劳动密集型。此外,目前的自动方法通常分别处理检测和分类,主要依赖单一站点信息,通常需要进行专门的预处理和表示以进行预测。这些限制常常阻碍了它们在不同火山条件下的实时监测和应用。本研究介绍了一种新方法,利用语义分割模型通过直接将多通道一维信号转换为二维表示,实现地震事件的自动识别。我们的框架采用数据驱动、端到端的设计,将多站地震数据整合在一起,进行最小的预处理,同时完成五个地震事件类别的检测和分类。我们对四种最新分割模型(UNet、UNet++、DeepLabV3+和SwinUNet)在智利四个不同火山记录的大约25,000个地震事件进行了评估:Nevados del Chillán火山群、Laguna del Maule、Villarrica和Puyehue-Cordón Caulle火山。在这些模型中,UNet架构被认定为最有效的模型,平均F1分数和交并比(IoU)得分高达0.91和0.88,显示出对未见过的火山数据集的出色噪声鲁棒性和模型灵活性。
论文及项目相关链接
PDF 10 pages, 9 figures. This is a pre-print, it is currently under review for publication
Summary:该研究采用语义分割模型,通过直接将多通道一维信号转化为二维图像表示,实现了地震事件的自动识别。该研究采用数据驱动的端到端设计,整合多站地震数据,最小限度预处理,同时完成检测和分类五个地震事件类别。在智利四个不同火山的地震事件数据上评估了四种最先进的分割模型,其中UNet架构表现最佳,具有噪声鲁棒性和模型灵活性。
Key Takeaways:
- 研究背景:火山监测中地震事件的有效识别对于理解火山活动和及时发出警告至关重要。传统方法依赖人工分析,存在主观性和劳动密集性的问题。
- 当前挑战:现有自动方法常将检测和分类分开处理,依赖于单一站信息,需要定制预处理和表示来进行预测。这些限制阻碍了它们在实时监测和不同火山条件下的应用。
- 新方法介绍:本研究使用语义分割模型,通过简单转换多通道一维信号为二维图像表示,实现地震事件的自动识别。
- 数据驱动方法:采用端到端设计,整合多站地震数据,简化预处理过程,同时进行五个地震事件类别的检测和分类。
- 模型评估:在智利四个不同火山的地震事件数据上测试了四种分割模型,UNet架构表现最佳。
- 性能指标:UNet模型达到平均F1分数和交并比(IoU)高达0.91和0.88。