⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-24 更新
An Empirical Study on the Robustness of YOLO Models for Underwater Object Detection
Authors:Edwine Nabahirwa, Wei Song, Minghua Zhang, Shufan Chen
Underwater object detection (UOD) remains a critical challenge in computer vision due to underwater distortions which degrade low-level features and compromise the reliability of even state-of-the-art detectors. While YOLO models have become the backbone of real-time object detection, little work has systematically examined their robustness under these uniquely challenging conditions. This raises a critical question: Are YOLO models genuinely robust when operating under the chaotic and unpredictable conditions of underwater environments? In this study, we present one of the first comprehensive evaluations of recent YOLO variants (YOLOv8-YOLOv12) across six simulated underwater environments. Using a unified dataset of 10,000 annotated images from DUO and Roboflow100, we not only benchmark model robustness but also analyze how distortions affect key low-level features such as texture, edges, and color. Our findings show that (1) YOLOv12 delivers the strongest overall performance but is highly vulnerable to noise, and (2) noise disrupts edge and texture features, explaining the poor detection performance in noisy images. Class imbalance is a persistent challenge in UOD. Experiments revealed that (3) image counts and instance frequency primarily drive detection performance, while object appearance exerts only a secondary influence. Finally, we evaluated lightweight training-aware strategies: noise-aware sample injection, which improves robustness in both noisy and real-world conditions, and fine-tuning with advanced enhancement, which boosts accuracy in enhanced domains but slightly lowers performance in original data, demonstrating strong potential for domain adaptation, respectively. Together, these insights provide practical guidance for building resilient and cost-efficient UOD systems.
水下目标检测(UOD)仍然是计算机视觉中的一个关键挑战,由于水下失真会破坏低级特征,甚至影响最先进检测器的可靠性。虽然YOLO模型已成为实时目标检测的核心,但很少有工作系统地研究它们在这些独特挑战条件下的稳健性。这引发了一个关键问题:YOLO模型在水下环境混乱和不可预测的条件下真正稳健吗?在这项研究中,我们对最近的YOLO变体(YOLOv8-YOLOv12)进行了首次全面的评估之一,跨越六种模拟水下环境。我们使用DUO和Roboflow100的统一数据集,不仅评估模型的稳健性,还分析失真如何影响纹理、边缘和颜色等关键低级特征。我们的研究发现:(1)YOLOv12总体性能最强,但对噪声高度敏感;(2)噪声会扰乱边缘和纹理特征,这解释了噪声图像中检测性能不佳的原因。类别不平衡是UOD中的持久挑战。实验表明,(3)图像计数和实例频率主要驱动检测性能,而对象外观的影响较小。最后,我们评估了轻量级训练感知策略:噪声感知样本注入,该策略提高了噪声和现实世界条件下的稳健性;以及使用先进增强技术的微调,这提高了增强领域的准确性,但略微降低了原始数据的性能,分别显示出域适应的强大潜力。这些见解共同为构建具有弹性和成本效益的水下目标检测系统提供了实际指导。
论文及项目相关链接
PDF 28 Pages, 12 Figures
摘要
本文研究了水下目标检测(UOD)的挑战性问题,探讨了YOLO模型在水下环境中的稳健性。研究通过对YOLO的不同变体(YOLOv8至YOLOv12)在六个模拟水下环境中的综合评估,发现YOLOv12整体性能最佳但易受噪声影响。同时,研究分析了水下失真对纹理、边缘和颜色等关键低层次特征的影响。此外,实验表明图像数量和实例频率对检测性能有主要影响,而物体外观的影响较小。最后,研究评估了轻量级训练感知策略,如噪声感知样本注入和高级增强微调等,为提高水下目标检测的鲁棒性和效率提供了实际指导。
关键见解
- YOLOv12在水下环境中整体性能最佳,但在噪声干扰方面存在较高脆弱性。
- 噪声对边缘和纹理特征的破坏影响了水下目标检测的准确性。
- 在水下目标检测中,图像数量和实例频率对检测性能起主要作用。
- 物体外观对检测性能的影响相对次要。
- 噪声感知样本注入策略提高了模型在噪声和真实环境下的稳健性。
- 采用高级增强微调策略在增强领域提高了精度,但略微降低了原始数据的性能。
点此查看论文截图




DepTR-MOT: Unveiling the Potential of Depth-Informed Trajectory Refinement for Multi-Object Tracking
Authors:Buyin Deng, Lingxin Huang, Kai Luo, Fei Teng, Kailun Yang
Visual Multi-Object Tracking (MOT) is a crucial component of robotic perception, yet existing Tracking-By-Detection (TBD) methods often rely on 2D cues, such as bounding boxes and motion modeling, which struggle under occlusions and close-proximity interactions. Trackers relying on these 2D cues are particularly unreliable in robotic environments, where dense targets and frequent occlusions are common. While depth information has the potential to alleviate these issues, most existing MOT datasets lack depth annotations, leading to its underexploited role in the domain. To unveil the potential of depth-informed trajectory refinement, we introduce DepTR-MOT, a DETR-based detector enhanced with instance-level depth information. Specifically, we propose two key innovations: (i) foundation model-based instance-level soft depth label supervision, which refines depth prediction, and (ii) the distillation of dense depth maps to maintain global depth consistency. These strategies enable DepTR-MOT to output instance-level depth during inference, without requiring foundation models and without additional computational cost. By incorporating depth cues, our method enhances the robustness of the TBD paradigm, effectively resolving occlusion and close-proximity challenges. Experiments on both the QuadTrack and DanceTrack datasets demonstrate the effectiveness of our approach, achieving HOTA scores of 27.59 and 44.47, respectively. In particular, results on QuadTrack, a robotic platform MOT dataset, highlight the advantages of our method in handling occlusion and close-proximity challenges in robotic tracking. The source code will be made publicly available at https://github.com/warriordby/DepTR-MOT.
视觉多目标跟踪(MOT)是机器人感知的重要组成部分,但现有的跟踪检测(TBD)方法往往依赖于二维线索,如边界框和运动建模,这在遮挡和近距离交互的情况下存在困难。依赖这些二维线索的跟踪器在机器人环境中尤其不可靠,其中密集目标和频繁遮挡很常见。虽然深度信息有潜力解决这些问题,但大多数现有的MOT数据集缺乏深度注释,导致其在该领域的作用被低估。为了揭示深度信息轨迹修正的潜力,我们引入了DepTR-MOT,这是一个基于DETR的探测器,增强了实例级别的深度信息。具体来说,我们提出了两项关键创新:(i)基于基础模型的实例级软深度标签监督,这可以优化深度预测;(ii)密集深度图的蒸馏以保持全局深度一致性。这些策略使DepTR-MOT能够在推理过程中输出实例级别的深度,而无需基础模型且无需额外的计算成本。通过融入深度线索,我们的方法提高了TBD范式的稳健性,有效地解决了遮挡和近距离挑战。在QuadTrack和DanceTrack数据集上的实验证明了我们的方法的有效性,分别实现了HOTA分数为27.59和44.47。特别是在QuadTrack机器人平台MOT数据集上的结果,突显了我们的方法在机器人跟踪中处理遮挡和近距离挑战的优势。源代码将公开发布在https://github.com/warriordby/DepTR-MOT。
论文及项目相关链接
PDF The source code will be made publicly available at https://github.com/warriordby/DepTR-MOT
Summary
基于深度信息加强的Detr视觉多目标跟踪器在解决机器人环境中物体遮挡与近距离交互的跟踪问题方面具有卓越效果。它通过引入实例级深度信息提升轨迹精细度,并通过基础模型辅助的实例级软深度标签监督和深度图的蒸馏方法维持全局深度一致性。该方法在QuadTrack和DanceTrack数据集上的实验结果表明其有效性。
Key Takeaways
- 视觉多目标跟踪(MOT)是机器人感知的重要组成部分。现有的Tracking-By-Detection(TBD)方法主要依赖二维线索如边界框和运动建模,这在遮挡和近距离交互情况下存在挑战。
- 深度信息能增强跟踪器的性能,但大多数现有的MOT数据集缺乏深度标注,限制了其在该领域的应用。
- DepTR-MOT是一种基于DETR的检测器,引入了实例级深度信息来提升轨迹精细度。其两大关键创新为基础模型辅助的实例级软深度标签监督和用于维持全局深度一致性的深度图蒸馏方法。
点此查看论文截图





Enhanced Detection of Tiny Objects in Aerial Images
Authors:Kihyun Kim, Michalis Lazarou, Tania Stathaki
While one-stage detectors like YOLOv8 offer fast training speed, they often under-perform on detecting small objects as a trade-off. This becomes even more critical when detecting tiny objects in aerial imagery due to low-resolution targets and cluttered backgrounds. To address this, we introduce three enhancement strategies – input image resolution adjustment, data augmentation, and attention mechanisms – that can be easily implemented on YOLOv8. We demonstrate that image size enlargement and the proper use of augmentation can lead to enhancement. Additionally, we designed a Mixture of Orthogonal Neural-modules Network (MoonNet) pipeline which consists of attention-augmented CNNs. Two well-known attention modules, the Squeeze-and-Excitation Block (SE Block) and the Convolutional Block Attention Module (CBAM), were integrated into the backbone of YOLOv8 with an increased number of channels, and the MoonNet backbone obtained improved detection accuracy compared to the original YOLOv8. MoonNet further proved its adaptability and potential by achieving state-of-the-art performance on a tiny-object benchmark when integrated with the YOLC model. Our codes are available at: https://github.com/Kihyun11/MoonNet
尽管像YOLOv8这样的单阶段检测器提供了快速的训练速度,但它们往往在检测小物体时表现不佳。这在检测空中图像中的微小物体时变得尤为关键,因为目标分辨率低且背景杂乱。为了解决这个问题,我们引入了三种增强策略,即输入图像分辨率调整、数据增强和注意力机制,这些策略可以很容易地在YOLOv8上实现。我们证明,图像尺寸放大和适当的增强可以导致性能提升。此外,我们设计了一个混合正交神经网络(MoonNet)管道,它由增强注意力的卷积神经网络组成。我们整合了两个知名的注意力模块,即挤压-激发块(SE块)和卷积块注意力模块(CBAM),增加了YOLOv8的通道数,并且与原始YOLOv8相比,MoonNet骨干网在检测精度上有所提高。当与YOLC模型集成时,MoonNet在微小物体基准测试上达到了最先进的性能,进一步证明了其适应性和潜力。我们的代码可在:https://github.com/Kihyun11/MoonNet找到。
论文及项目相关链接
Summary
针对YOLOv8在检测小目标时的不足,提出三种增强策略:调整输入图像分辨率、数据增强和注意力机制。设计了一种名为MoonNet的神经网络管道,由增强注意力CNN组成。将两种知名注意力模块——压缩激励块(SE Block)和卷积块注意力模块(CBAM)——融入YOLOv8的骨干网络,提高检测精度。MoonNet与YOLC模型结合后,在小目标基准测试中达到了卓越的性能。
Key Takeaways
- YOLOv8虽然训练速度快,但在检测小目标时性能较差。
- 为解决这一问题,提出了三种增强策略:调整输入图像分辨率、数据增强和引入注意力机制。
- 通过增大图像尺寸和适当的数据增强,可以实现性能提升。
- 设计了名为MoonNet的神经网络管道,由增强注意力的CNN组成。
- 将SE Block和CBAM两种注意力模块融入YOLOv8的骨干网络。
- MoonNet骨干网络相较于原始YOLOv8,提高了检测精度。
点此查看论文截图




Prototype-Based Pseudo-Label Denoising for Source-Free Domain Adaptation in Remote Sensing Semantic Segmentation
Authors:Bin Wang, Fei Deng, Zeyu Chen, Zhicheng Yu, Yiguang Liu
Source-Free Domain Adaptation (SFDA) enables domain adaptation for semantic segmentation of Remote Sensing Images (RSIs) using only a well-trained source model and unlabeled target domain data. However, the lack of ground-truth labels in the target domain often leads to the generation of noisy pseudo-labels. Such noise impedes the effective mitigation of domain shift (DS). To address this challenge, we propose ProSFDA, a prototype-guided SFDA framework. It employs prototype-weighted pseudo-labels to facilitate reliable self-training (ST) under pseudo-labels noise. We, in addition, introduce a prototype-contrast strategy that encourages the aggregation of features belonging to the same class, enabling the model to learn discriminative target domain representations without relying on ground-truth supervision. Extensive experiments show that our approach substantially outperforms existing methods.
无源域自适应(SFDA)技术仅利用预先训练好的源模型和未标记的目标域数据,实现了遥感图像语义分割的域自适应。然而,目标域中缺乏真实标签通常会导致生成带有噪声的伪标签。这种噪声阻碍了域偏移(DS)的有效缓解。为了解决这一挑战,我们提出了ProSFDA,一种原型引导的无源域自适应(SFDA)框架。它采用原型加权伪标签,在伪标签噪声下实现可靠的自训练(ST)。此外,我们还引入了一种原型对比策略,该策略鼓励属于同一类的特征进行聚合,使模型能够在不依赖真实标签监督的情况下,学习具有区分度的目标域表示。大量实验表明,我们的方法显著优于现有方法。
论文及项目相关链接
Summary
本文探讨了源自由域适配技术(SFDA)在遥感图像语义分割中的应用,提出了ProSFDA框架,采用原型引导的伪标签加权方法,实现可靠的自训练(ST),同时引入原型对比策略,鼓励同类特征的聚合,学习具有鉴别力的目标域表示。该框架在不依赖真实标签监督的情况下表现优越。
Key Takeaways
- 源自由域适配技术(SFDA)能够实现遥感图像语义分割的域适应。
- 无标签目标域数据产生的伪标签噪声是域适应中的一个挑战。
- ProSFDA框架采用原型引导的伪标签加权方法,实现可靠自训练(ST)。
- 原型对比策略有助于同类特征的聚合,提高模型性能。
- 该框架能够在没有真实标签监督的情况下学习具有鉴别力的目标域表示。
- ProSFDA框架在大量实验中表现出优越性能,显著优于现有方法。
点此查看论文截图



SegDINO3D: 3D Instance Segmentation Empowered by Both Image-Level and Object-Level 2D Features
Authors:Jinyuan Qu, Hongyang Li, Xingyu Chen, Shilong Liu, Yukai Shi, Tianhe Ren, Ruitao Jing, Lei Zhang
In this paper, we present SegDINO3D, a novel Transformer encoder-decoder framework for 3D instance segmentation. As 3D training data is generally not as sufficient as 2D training images, SegDINO3D is designed to fully leverage 2D representation from a pre-trained 2D detection model, including both image-level and object-level features, for improving 3D representation. SegDINO3D takes both a point cloud and its associated 2D images as input. In the encoder stage, it first enriches each 3D point by retrieving 2D image features from its corresponding image views and then leverages a 3D encoder for 3D context fusion. In the decoder stage, it formulates 3D object queries as 3D anchor boxes and performs cross-attention from 3D queries to 2D object queries obtained from 2D images using the 2D detection model. These 2D object queries serve as a compact object-level representation of 2D images, effectively avoiding the challenge of keeping thousands of image feature maps in the memory while faithfully preserving the knowledge of the pre-trained 2D model. The introducing of 3D box queries also enables the model to modulate cross-attention using the predicted boxes for more precise querying. SegDINO3D achieves the state-of-the-art performance on the ScanNetV2 and ScanNet200 3D instance segmentation benchmarks. Notably, on the challenging ScanNet200 dataset, SegDINO3D significantly outperforms prior methods by +8.7 and +6.8 mAP on the validation and hidden test sets, respectively, demonstrating its superiority.
本文提出了一种新型的Transformer编码器-解码器框架SegDINO3D,用于3D实例分割。由于3D训练数据通常不像2D训练图像那样充足,SegDINO3D的设计目的是充分利用预训练的2D检测模型的2D表示,包括图像级别和对象级别的特征,以改进3D表示。SegDINO3D以点云及其关联的2D图像作为输入。在编码阶段,它首先通过从相应的图像视图中检索2D图像特征来丰富每个3D点,然后利用3D编码器进行3D上下文融合。在解码阶段,它将3D对象查询制定为3D锚框,并使用预训练的2D检测模型从3D查询到2D对象查询进行交叉注意力处理。这些2D对象查询作为2D图像的紧凑对象级别表示,有效地避免了在内存中保留数千个图像特征图的挑战,同时忠实保留了预训练的2D模型的知识。引入的3D框查询还使模型能够使用预测的框对交叉注意力进行调制,从而实现更精确的查询。SegDINO3D在ScanNetV2和ScanNet200 3D实例分割基准测试中实现了最先进的性能。值得注意的是,在具有挑战性的ScanNet200数据集上,SegDINO3D在验证集和隐藏测试集上的mAP分别比现有方法高出+8.7和+6.8,显示了其优越性。
论文及项目相关链接
Summary
SegDINO3D是一个用于三维实例分割的新型Transformer编码器-解码器框架。它充分利用二维检测模型的图像级和对象级特征来提高三维表现。框架接受点云和相关的二维图像作为输入,将每个三维点丰富与来自相应图像视图的二维图像特征,并通过三维编码器进行三维上下文融合。解码阶段制定了三维对象查询作为三维锚箱,并从三维查询到二维对象查询执行交叉注意力,使用二维检测模型从二维图像中获取这些查询。SegDINO3D在ScanNetV2和ScanNet200三维实例分割基准测试中取得了最先进的性能,尤其在具有挑战性的ScanNet200数据集上表现出显著优势。
Key Takeaways
- SegDINO3D是一个基于Transformer的编码器-解码器框架,用于三维实例分割。
- 该框架旨在充分利用二维检测模型的图像级和对象级特征来提高三维表现。
- SegDINO3D接受点云和相关的二维图像作为输入,并通过三维编码器进行上下文融合。
- 解码阶段利用三维对象查询和交叉注意力机制结合二维对象查询,避免保持大量图像特征图在内存中的挑战。
点此查看论文截图



The Missing Piece: A Case for Pre-Training in 3D Medical Object Detection
Authors:Katharina Eckstein, Constantin Ulrich, Michael Baumgartner, Jessica Kächele, Dimitrios Bounias, Tassilo Wald, Ralf Floca, Klaus H. Maier-Hein
Large-scale pre-training holds the promise to advance 3D medical object detection, a crucial component of accurate computer-aided diagnosis. Yet, it remains underexplored compared to segmentation, where pre-training has already demonstrated significant benefits. Existing pre-training approaches for 3D object detection rely on 2D medical data or natural image pre-training, failing to fully leverage 3D volumetric information. In this work, we present the first systematic study of how existing pre-training methods can be integrated into state-of-the-art detection architectures, covering both CNNs and Transformers. Our results show that pre-training consistently improves detection performance across various tasks and datasets. Notably, reconstruction-based self-supervised pre-training outperforms supervised pre-training, while contrastive pre-training provides no clear benefit for 3D medical object detection. Our code is publicly available at: https://github.com/MIC-DKFZ/nnDetection-finetuning.
大规模预训练有望推动三维医学目标检测的发展,这是实现计算机辅助诊断准确性的关键部分。然而,与分割相比,预训练的研究仍然远远不够深入,在分割领域,预训练已经显示出显著的益处。现有的三维目标检测预训练方法依赖于二维医学数据或自然图像预训练,未能充分利用三维体积信息。在这项工作中,我们首次系统地研究了如何将现有预训练方法集成到最先进的检测架构中,涵盖卷积神经网络和Transformer。我们的结果表明,预训练在各种任务和数据集上始终提高了检测性能。值得注意的是,基于重建的自我监督预训练优于监督预训练,而对比预训练对于三维医学目标检测没有明确的优势。我们的代码公开在:[https://github.com/MIC-DKFZ/nnDetection-finetuning。]
论文及项目相关链接
PDF MICCAI 2025
Summary
大规模预训练有望推动3D医疗目标检测的发展,这是计算机辅助诊断的重要组成部分。然而,与分割相比,预训练在目标检测方面的应用仍被较少探索,尽管在分割任务中预训练已经显示出明显的优势。现有的预训练策略主要依赖于二维医学数据或自然图像预训练,未能充分利用三维体积信息。本研究首次系统地探讨了如何将现有预训练技术融入先进的检测架构中,包括卷积神经网络和Transformer。研究结果表明,预训练在各种任务和数据集上都能提高检测性能。值得注意的是,基于重建的自监督预训练方法表现优于监督预训练,而对比预训练对3D医疗目标检测没有明确的优势。我们的代码已公开在:https://github.com/MIC-DKFZ/nnDetection-finetuning上可用。
Key Takeaways
- 大型预训练对提升3D医疗目标检测至关重要。
- 与分割任务相比,预训练在目标检测中的应用仍被较少探索。
- 当前预训练策略主要依赖二维医学数据或自然图像,未充分利用三维体积信息。
- 研究系统地探讨了预训练技术与先进检测架构的结合。
- 预训练能提高各种任务和数据集的检测性能。
- 基于重建的自监督预训练方法表现优于监督预训练。
点此查看论文截图



MCOD: The First Challenging Benchmark for Multispectral Camouflaged Object Detection
Authors:Yang Li, Tingfa Xu, Shuyan Bai, Peifu Liu, Jianan Li
Camouflaged Object Detection (COD) aims to identify objects that blend seamlessly into natural scenes. Although RGB-based methods have advanced, their performance remains limited under challenging conditions. Multispectral imagery, providing rich spectral information, offers a promising alternative for enhanced foreground-background discrimination. However, existing COD benchmark datasets are exclusively RGB-based, lacking essential support for multispectral approaches, which has impeded progress in this area. To address this gap, we introduce MCOD, the first challenging benchmark dataset specifically designed for multispectral camouflaged object detection. MCOD features three key advantages: (i) Comprehensive challenge attributes: It captures real-world difficulties such as small object sizes and extreme lighting conditions commonly encountered in COD tasks. (ii) Diverse real-world scenarios: The dataset spans a wide range of natural environments to better reflect practical applications. (iii) High-quality pixel-level annotations: Each image is manually annotated with precise object masks and corresponding challenge attribute labels. We benchmark eleven representative COD methods on MCOD, observing a consistent performance drop due to increased task difficulty. Notably, integrating multispectral modalities substantially alleviates this degradation, highlighting the value of spectral information in enhancing detection robustness. We anticipate MCOD will provide a strong foundation for future research in multispectral camouflaged object detection. The dataset is publicly accessible at https://github.com/yl2900260-bit/MCOD.
隐蔽物体检测(COD)旨在识别无缝融入自然场景中的物体。尽管基于RGB的方法已经取得进展,但在具有挑战性的条件下,其性能仍然受到限制。多光谱成像提供了丰富的光谱信息,为增强前景背景鉴别提供了有前景的替代方案。然而,现有的COD基准数据集都是基于RGB的,缺乏多光谱方法所必需的支持,这阻碍了该领域的进步。为了弥补这一空白,我们引入了MCOD,这是专门为多光谱隐蔽物体检测设计的首个具有挑战性的基准数据集。MCOD具有三个主要优点:(i)全面的挑战属性:它捕捉了现实世界中的难点,例如常见的隐蔽物体检测任务中小物体尺寸和极端照明条件。(ii)丰富的现实世界场景:数据集涵盖了广泛自然环境场景以更好地反映实际应用。(iii)高质量的像素级注释:每张图像都会通过精确的物体掩膜和相应的挑战属性标签进行手动注释。我们在MCOD上基准测试了十一种代表性的COD方法,观察到由于任务难度增加而导致的性能普遍下降。值得注意的是,集成多光谱模式大大减轻了这种退化,突显了光谱信息在增强检测稳健性方面的价值。我们预计MCOD将为未来多光谱隐蔽物体检测的研究提供坚实的基础。数据集可在https://github.com/yl2900260-bit/MCOD公开访问。
论文及项目相关链接
Summary:针对伪装物体检测的挑战性任务,引入了首个专为多光谱伪装物体检测设计的基准数据集MCOD。MCOD具有三大优势:全面的挑战属性、多样化的现实世界场景以及高质量像素级别的标注。研究中观察到现有COD方法在MCOD上的性能下降,而多光谱模态的整合有助于提高检测的鲁棒性。
Key Takeaways:
- COD的目标:识别融入自然场景中的物体。
- RGB方法在面临挑战性条件下性能受限,需寻找其他途径改进前景背景识别技术。
- 多光谱成像可以提供丰富的光谱信息,有助于增强前景背景鉴别能力。
- 当前COD基准数据集缺乏针对多光谱方法的支持,限制了该领域的进展。
- MCOD数据集的特点包括全面的挑战属性、多样化的现实世界场景和高质量像素级别的标注。
- MCOD对代表COD方法进行评估时,观察到性能下降,多光谱模态整合可缓解此问题。
点此查看论文截图








Towards Size-invariant Salient Object Detection: A Generic Evaluation and Optimization Approach
Authors:Shilong Bao, Qianqian Xu, Feiran Li, Boyu Han, Zhiyong Yang, Xiaochun Cao, Qingming Huang
This paper investigates a fundamental yet underexplored issue in Salient Object Detection (SOD): the size-invariant property for evaluation protocols, particularly in scenarios when multiple salient objects of significantly different sizes appear within a single image. We first present a novel perspective to expose the inherent size sensitivity of existing widely used SOD metrics. Through careful theoretical derivations, we show that the evaluation outcome of an image under current SOD metrics can be essentially decomposed into a sum of several separable terms, with the contribution of each term being directly proportional to its corresponding region size. Consequently, the prediction errors would be dominated by the larger regions, while smaller yet potentially more semantically important objects are often overlooked, leading to biased performance assessments and practical degradation. To address this challenge, a generic Size-Invariant Evaluation (SIEva) framework is proposed. The core idea is to evaluate each separable component individually and then aggregate the results, thereby effectively mitigating the impact of size imbalance across objects. Building upon this, we further develop a dedicated optimization framework (SIOpt), which adheres to the size-invariant principle and significantly enhances the detection of salient objects across a broad range of sizes. Notably, SIOpt is model-agnostic and can be seamlessly integrated with a wide range of SOD backbones. Theoretically, we also present generalization analysis of SOD methods and provide evidence supporting the validity of our new evaluation protocols. Finally, comprehensive experiments speak to the efficacy of our proposed approach. The code is available at https://github.com/Ferry-Li/SI-SOD.
本文探讨了显著性目标检测(SOD)中一个基础但尚未被充分研究的问题:评估协议中的尺寸不变特性,特别是在单个图像中出现多个大小显著不同的目标时的场景。我们首先从一个新颖的角度揭示了现有广泛使用的SOD指标固有的尺寸敏感性。通过仔细的理论推导,我们表明,在当前SOD指标下对图像的评估结果可以基本分解为几个可分量的和,每个分量的贡献与其相应的区域大小直接成正比。因此,预测误差主要由较大的区域主导,而较小但可能在语义上更重要的目标往往被忽视,从而导致性能评估偏差和实际性能下降。为了解决这一挑战,我们提出了通用的尺寸不变评估(SIEva)框架。其核心思想是对每个可分离组件进行单独评估,然后汇总结果,从而有效地减轻对象之间尺寸不平衡的影响。在此基础上,我们进一步开发了一个专用的优化框架(SIOpt),它遵循尺寸不变原则,并显著提高了各种尺寸目标的检测性能。值得注意的是,SIOpt与模型无关,可以无缝集成到各种SOD主干网络中。在理论上,我们还对SOD方法进行了泛化分析,并为我们的新评估协议的有效性提供了证据支持。最后,综合实验证明了我们提出的方法的有效性。代码可在https://github.com/Ferry-Li/SI-SOD上找到。
论文及项目相关链接
Summary
本文探讨了显著目标检测(SOD)中一个基础但尚未被充分研究的问题:评估协议中的尺寸不变性,特别是在单个图像中出现多个大小显著不同的目标时的场景。文章首先提出一种新的视角来揭示现有广泛使用的SOD指标固有的尺寸敏感性。通过仔细的理论推导,文章表明当前SOD指标下的图像评估结果可以分解为几个可分离项的和,每个项的贡献与其相应区域的大小直接成比例。因此,预测误差主要由较大的区域主导,而较小但可能更具语义重要性的对象往往被忽视,导致性能评估偏差和实际性能下降。为解决此问题,文章提出了通用的尺寸不变评估(SIEva)框架。核心思想是对每个可分离组件进行个别评估,然后汇总结果,从而有效缓解对象之间尺寸不平衡的影响。在此基础上,文章进一步开发了一个专用的优化框架(SIOpt),它遵循尺寸不变原则,并显著提高了各种尺寸下显著目标的检测效果。值得注意的是,SIOpt与模型无关,可以无缝集成到广泛的SOD主干网络中。最后,大量的实验验证了所提出方法的有效性。
Key Takeaways
- 文章指出了现有显著目标检测(SOD)评估指标在尺寸敏感性方面存在的问题。
- 文章通过理论推导揭示了当前SOD指标在评估时的尺寸偏差现象。
- 文章提出了一个通用的尺寸不变评估(SIEva)框架来解决尺寸不平衡问题。
- SIEva框架能有效评估每个可分离组件并汇总结果,减少尺寸对评估的影响。
- 文章进一步提出了一个专用优化框架(SIOpt),能够显著提高各种尺寸下显著目标的检测效果。
- SIOpt与多种SOD模型兼容,可广泛集成到各种SOD方法中。
点此查看论文截图



GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity
Authors:Seongheon Park, Yixuan Li
Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.
在大规模视觉语言模型中,对象幻觉现象对它们在实际应用中的安全部署提出了重大挑战。近期的研究工作已经提出了对象级别的幻觉评分来估计出现对象幻觉的可能性;然而,这些方法通常孤立地采用全局或局部视角,可能会限制检测可靠性。在本文中,我们介绍了GLSim,这是一种新型的无训练对象幻觉检测框架,它利用图像和文本模态之间全局和局部嵌入相似性的互补信号,能够在各种场景中实现更准确、更可靠地幻觉检测。我们对现有的对象幻觉检测方法进行了全面的基准测试,并证明GLSim的检测性能优越,显著优于竞争对手的基线。
论文及项目相关链接
PDF NeurIPS 2025
Summary
文本介绍了对象幻视在大型视觉语言模型中的挑战。现有方法通常从全局或局部单一角度评估对象幻视的可能性,可能限制了检测的可靠性。本文提出了一种新型的无训练对象幻视检测框架GLSim,它利用图像和文本模态之间的全局和局部嵌入相似性信号,提高了在各种场景下的检测准确性和可靠性。实验表明,GLSim在对象幻视检测方面的性能优于现有方法。
Key Takeaways
- 对象幻视在大型视觉语言模型中是一个挑战,影响其在真实世界应用的安全性。
- 现有对象幻视检测方法主要从全局或局部单一角度进行评估,存在检测可靠性问题。
- GLSim是一种新型的无训练对象幻视检测框架,结合了全局和局部嵌入相似性信号,提高了检测的准确性和可靠性。
- GLSim通过综合评估现有对象幻视检测方法,实现了卓越的检测性能。
- GLSim在多种场景下表现出优异的性能,显著优于其他基线方法。
- 该框架具有广泛的应用前景,可应用于各种视觉语言模型的安全部署。
点此查看论文截图



A Semantic Segmentation Algorithm for Pleural Effusion Based on DBIF-AUNet
Authors:Ruixiang Tang, Mingda Zhang, Jianglong Qin, Yan Song, Yi Wu, Wei Wu
Pleural effusion semantic segmentation can significantly enhance the accuracy and timeliness of clinical diagnosis and treatment by precisely identifying disease severity and lesion areas. Currently, semantic segmentation of pleural effusion CT images faces multiple challenges. These include similar gray levels between effusion and surrounding tissues, blurred edges, and variable morphology. Existing methods often struggle with diverse image variations and complex edges, primarily because direct feature concatenation causes semantic gaps. To address these challenges, we propose the Dual-Branch Interactive Fusion Attention model (DBIF-AUNet). This model constructs a densely nested skip-connection network and innovatively refines the Dual-Domain Feature Disentanglement module (DDFD). The DDFD module orthogonally decouples the functions of dual-domain modules to achieve multi-scale feature complementarity and enhance characteristics at different levels. Concurrently, we design a Branch Interaction Attention Fusion module (BIAF) that works synergistically with the DDFD. This module dynamically weights and fuses global, local, and frequency band features, thereby improving segmentation robustness. Furthermore, we implement a nested deep supervision mechanism with hierarchical adaptive hybrid loss to effectively address class imbalance. Through validation on 1,622 pleural effusion CT images from Southwest Hospital, DBIF-AUNet achieved IoU and Dice scores of 80.1% and 89.0% respectively. These results outperform state-of-the-art medical image segmentation models U-Net++ and Swin-UNet by 5.7%/2.7% and 2.2%/1.5% respectively, demonstrating significant optimization in segmentation accuracy for complex pleural effusion CT images.
胸腔积液语义分割能够精确识别疾病严重程度和病变区域,从而显著提高临床诊断和治疗准确性和及时性。目前,胸腔积液CT图像的语义分割面临多重挑战,包括积液与周围组织的灰度相似、边缘模糊以及形态多变等。现有方法往往难以应对图像多样性和复杂边缘情况,主要是因为直接特征拼接会造成语义鸿沟。为了解决这些挑战,我们提出了双分支交互融合注意力模型(DBIF-AUNet)。该模型构建了一个密集嵌套跳跃连接网络,并创新地改进了双域特征分解模块(DDFD)。DDFD模块正交地解耦了双域模块的功能,以实现多尺度特征互补并增强不同级别的特征。同时,我们设计了分支交互注意力融合模块(BIAF),它与DDFD协同工作。该模块动态加权并融合全局、局部和频带特征,提高分割稳健性。此外,我们实现了具有分层自适应混合损失的嵌套深度监督机制,以有效解决类别不平衡问题。通过对西南医院1622张胸腔积液CT图像的验证,DBIF-AUNet的IoU和Dice得分分别为80.1%和89.0%,相比最先进的医疗图像分割模型U-Net++和Swin-UNet分别高出5.7%/2.7%和2.2%/1.5%。这证明了DBIF-AUNet在复杂胸腔积液CT图像分割精度上的显著优化。
论文及项目相关链接
PDF 12 pages, 6 figures, 2 tables
Summary:
胸膜积液语义分割能够精确识别疾病严重程度和病灶区域,从而提高临床诊断和治疗准确性和及时性。针对胸膜积液CT图像语义分割所面临的挑战,如灰度相似、边缘模糊和形态多变等问题,我们提出了双分支交互融合注意力模型(DBIF-AUNet)。该模型通过改进特征分解和融合方式,实现了多尺度特征互补和不同层次特征增强。在西南医院的1622张胸膜积液CT图像上进行验证,DBIF-AUNet的IoU和Dice得分分别为80.1%和89.0%,较其他先进医学图像分割模型有显著提高。
Key Takeaways:
- 胸膜积液语义分割对临床诊断和治疗具有重要意义,能精确识别疾病严重程度和病灶区域。
- 胸膜积液CT图像语义分割面临灰度相似、边缘模糊和形态多变等挑战。
- 现有方法常常难以应对多样的图像变化和复杂的边缘,主要原因是直接特征拼接会导致语义鸿沟。
- 双分支交互融合注意力模型(DBIF-AUNet)被提出以解决这些挑战,包括改进的特征分解和融合方式。
- DBIF-AUNet通过构建密集嵌套跳跃连接网络和创新地改进双域特征分解模块(DDFD),实现多尺度特征互补和不同层次特征增强。
- DBIF-AUNet还设计了分支交互注意力融合模块(BIAF),能够动态加权和融合全局、局部和频带特征,提高分割稳健性。
点此查看论文截图




DAOcc: 3D Object Detection Assisted Multi-Sensor Fusion for 3D Occupancy Prediction
Authors:Zhen Yang, Yanpeng Dong, Jiayu Wang, Heng Wang, Lichao Ma, Zijian Cui, Qi Liu, Haoran Pei, Kexin Zhang, Chao Zhang
Multi-sensor fusion significantly enhances the accuracy and robustness of 3D semantic occupancy prediction, which is crucial for autonomous driving and robotics. However, most existing approaches depend on high-resolution images and complex networks to achieve top performance, hindering their deployment in practical scenarios. Moreover, current multi-sensor fusion approaches mainly focus on improving feature fusion while largely neglecting effective supervision strategies for those features. To address these issues, we propose DAOcc, a novel multi-modal occupancy prediction framework that leverages 3D object detection supervision to assist in achieving superior performance, while using a deployment-friendly image backbone and practical input resolution. In addition, we introduce a BEV View Range Extension strategy to mitigate performance degradation caused by lower image resolution. Extensive experiments demonstrate that DAOcc achieves new state-of-the-art results on both the Occ3D-nuScenes and Occ3D-Waymo benchmarks, and outperforms previous state-of-the-art methods by a significant margin using only a ResNet-50 backbone and 256*704 input resolution. With TensorRT optimization, DAOcc reaches 104.9 FPS while maintaining 54.2 mIoU on an NVIDIA RTX 4090 GPU. Code is available at https://github.com/AlphaPlusTT/DAOcc.
多传感器融合显著提高了3D语义占用预测的准确性和鲁棒性,这对于自动驾驶和机器人技术至关重要。然而,大多数现有方法依赖高分辨率图像和复杂网络来实现最佳性能,这阻碍了它们在实践场景中的应用。此外,当前的多传感器融合方法主要集中在改进特征融合上,而很大程度上忽视了这些特征的有效监督策略。为了解决这些问题,我们提出了DAOcc,这是一个新型的多模式占用预测框架,它利用3D对象检测监督来帮助实现卓越性能,同时使用部署友好的图像主干和实用的输入分辨率。此外,我们还引入了BEV View Range Extension策略,以缓解因较低图像分辨率导致的性能下降。大量实验表明,DAOcc在Occ3D-nuScenes和Occ3D-Waymo基准测试上达到了最新水平的结果,并且仅使用ResNet-50主干和256*704输入分辨率的情况下,显著超越了以前的最先进方法。通过TensorRT优化,DAOcc在NVIDIA RTX 4090 GPU上达到了104.9 FPS的帧率,同时保持了54.2 mIoU的性能。相关代码可访问https://github.com/AlphaPlusTT/DAOcc获取。
论文及项目相关链接
PDF TCSVT Accepted version (not the final published version)
Summary
多传感器融合能提高三维语义占用预测的准确性及鲁棒性,对自动驾驶和机器人技术至关重要。但现有方法依赖高分辨率图像和复杂网络实现最佳性能,阻碍了实际应用场景的部署。为解决这个问题,我们提出了DAOcc框架,利用三维物体检测的监控帮助获得卓越性能,同时采用部署友好的图像骨干网及实用输入分辨率。实验表明,DAOcc在Occ3D-nuScenes和Occ3D-Waymo两个基准测试上都达到最新技术水平,仅使用ResNet-50骨干网和256*704输入分辨率便大幅超越先前技术。经TensorRT优化后,DAOcc在NVIDIA RTX 4090 GPU上达到每秒104.9帧的处理速度,同时保持54.2 mIoU。
Key Takeaways
- 多传感器融合能提高三维语义占用预测的准确性及鲁棒性。
- 当前方法依赖高分辨率图像和复杂网络,影响实际应用部署。
- DAOcc框架利用三维物体检测监控来提升性能,并采用部署友好的图像骨干网及实用输入分辨率。
- DAOcc在基准测试中表现优越,超越先前技术。
- 使用ResNet-50骨干网和256*704输入分辨率即可实现良好性能。
- DAOcc经TensorRT优化后处理速度提升。
- DAOcc在NVIDIA RTX 4090 GPU上达到每秒104.9帧处理速度,同时保持高mIoU值。
点此查看论文截图






Investigating Long-term Training for Remote Sensing Object Detection
Authors:JongHyun Park, Yechan Kim, Moongu Jeon
Recently, numerous methods have achieved impressive performance in remote sensing object detection, relying on convolution or transformer architectures. Such detectors typically have a feature backbone to extract useful features from raw input images. A common practice in current detectors is initializing the backbone with pre-trained weights available online. Fine-tuning the backbone is typically required to generate features suitable for remote-sensing images. While the prolonged training could lead to over-fitting, hindering the extraction of basic visual features, it can enable models to gradually extract deeper insights and richer representations from remote sensing data. Striking a balance between these competing factors is critical for achieving optimal performance. In this study, we aim to investigate the performance and characteristics of remote sensing object detection models under very long training schedules, and propose a novel method named Dynamic Backbone Freezing (DBF) for feature backbone fine-tuning on remote sensing object detection under long-term training. Our method addresses the dilemma of whether the backbone should extract low-level generic features or possess specific knowledge of the remote sensing domain, by introducing a module called ‘Freezing Scheduler’ to manage the update of backbone features during long-term training dynamically. Extensive experiments on DOTA and DIOR-R show that our approach enables more accurate model learning while substantially reducing computational costs in long-term training. Besides, it can be seamlessly adopted without additional effort due to its straightforward design. The code is available at https://github.com/unique-chan/dbf.
最近,许多方法已在遥感目标检测方面取得了令人印象深刻的性能,这些方法依赖于卷积或转换器架构。此类检测器通常具有特征主干,可从原始输入图像中提取有用特征。当前检测器中的一种常见做法是使用在线可用的预训练权重来初始化主干。通常需要对主干进行微调,以生成适用于遥感图像的特征。虽然长时间的训练可能会导致过拟合,从而阻碍基本视觉特征的提取,但它可以使模型逐渐从遥感数据中提取更深入的见解和更丰富的表示。在这两种相互竞争的因素之间取得平衡对于实现最佳性能至关重要。在这项研究中,我们旨在研究遥感目标检测模型在非常长的训练计划下的性能和特点,并提出了一种名为动态主干冻结(DBF)的新方法,用于在长期训练中对遥感目标检测的特征主干进行微调。我们的方法通过引入一个名为“冻结调度程序”的模块来管理长期训练期间主干特征的更新,解决了主干应该提取低级通用特征还是具备遥感领域的特定知识的问题。在DOTA和DIOR-R上的大量实验表明,我们的方法使模型学习更加准确,同时大大降低了长期训练的计算成本。此外,由于其简单直观的设计,它可以无缝采用,无需额外努力。代码可在https://github.com/unique-chan/dbf找到。
论文及项目相关链接
Summary
本研究旨在探讨遥感目标检测模型在长期训练下的性能与特性,并提出一种名为动态主干冻结(DBF)的新方法,用于在长期训练中对遥感目标检测的特征主干进行微调。该方法通过引入名为“冻结调度器”的模块,实现主干特征更新的动态管理,解决了主干应提取低层次通用特征还是具备遥感领域特定知识的问题。在DOTA和DIOR-R上的实验表明,该方法能更准确地学习模型,同时大幅降低长期训练的计算成本,且设计简洁,可无缝采用而无需额外努力。
Key Takeaways
- 遥感目标检测近期取得显著进展,主要依赖于卷积或transformer架构。
- 当前检测器通常使用在线预训练权重来初始化主干,并进行微调以生成适合遥感图像的特征。
- 长期训练可能导致过拟合,影响基本视觉特征的提取,但也能使模型逐渐提取更深入、更丰富的遥感数据表示。
- 研究旨在解决遥感目标检测模型在长期训练下的性能问题,并提出动态主干冻结(DBF)方法。
- DBF方法通过引入“冻结调度器”模块,实现主干特征更新的动态管理,在提取通用低层次特征和遥感领域特定知识之间取得平衡。
点此查看论文截图



A re-calibration method for object detection with multi-modal alignment bias in autonomous driving
Authors:Zhihang Song, Dingyi Yao, Ruibo Ming, Lihui Peng, Danya Yao, Yi Zhang
Multi-modal object detection in autonomous driving has achieved great breakthroughs due to the usage of fusing complementary information from different sensors. The calibration in fusion between sensors such as LiDAR and camera was always supposed to be precise in previous work. However, in reality, calibration matrices are fixed when the vehicles leave the factory, but mechanical vibration, road bumps, and data lags may cause calibration bias. As there is relatively limited research on the impact of calibration on fusion detection performance, multi-sensor detection methods with flexible calibration dependency have remained a key objective. In this paper, we systematically evaluate the sensitivity of the SOTA EPNet++ detection framework and prove that even slight bias on calibration can reduce the performance seriously. To address this vulnerability, we propose a re-calibration model to re-calibrate the misalignment in detection tasks. This model integrates LiDAR point cloud, camera image, and initial calibration matrix as inputs, generating re-calibrated bias through semantic segmentation guidance and a tailored loss function design. The re-calibration model can operate with existing detection algorithms, enhancing both robustness against calibration bias and overall object detection performance. Our approach establishes a foundational methodology for maintaining reliability in multi-modal perception systems under real-world calibration uncertainties.
多模态物体检测在自动驾驶领域已经取得了重大突破,这是由于融合了不同传感器的互补信息。以前的工作总是要求激光雷达和相机等传感器之间的融合校准要精确。然而,在实际情况下,车辆在出厂时校准矩阵是固定的,但机械振动、路面颠簸和数据延迟可能会导致校准偏差。由于关于校准对融合检测性能影响的研究相对较少,具有灵活校准依赖性的多传感器检测方法仍是一个关键目标。在本文中,我们系统地评估了先进检测框架EPNet++的敏感性,并证明即使轻微的校准偏差也会严重降低性能。为了解决这一漏洞,我们提出了一种重新校准模型,对检测任务中的不对齐进行重新校准。该模型将激光雷达点云、相机图像和初始校准矩阵作为输入,通过语义分割指导和定制的损失函数设计来生成重新校准的偏差。重新校准模型可以与现有的检测算法一起使用,提高了对校准偏差的鲁棒性和总体目标检测性能。我们的方法为在现实世界的校准不确定性下,保持多模态感知系统的可靠性建立了基础方法。
论文及项目相关链接
PDF Accepted for publication in IST 2025. Official IEEE Xplore entry will be available once published
Summary
本文探讨了自主驾驶中多模态物体检测的突破,特别是在融合不同传感器信息方面的应用。由于车辆出厂时校准矩阵是固定的,但现实环境中的机械振动、路面颠簸和数据延迟可能导致校准偏差。针对现有研究中关于校准对融合检测性能影响的研究相对有限,本文系统地评估了当前先进的EPNet++检测框架的敏感性,并提出了一种重新校准模型来解决检测任务中的不对齐问题。该模型通过语义分割指导和定制的损失函数设计,将激光雷达点云、相机图像和初始校准矩阵作为输入,生成重新校准的偏差。重新校准模型可以与现有的检测算法一起操作,增强了对抗校准偏差的鲁棒性和整体物体检测性能。本文的方法为现实世界中校准不确定性下的多模态感知系统保持可靠性建立了基础方法。
Key Takeaways
- 多模态物体检测在自主驾驶中取得突破,得益于融合不同传感器的互补信息。
- 现实中,车辆出厂时的固定校准矩阵可能受到机械振动、路面颠簸和数据延迟等因素的影响,导致校准偏差。
- 当前研究对校准对融合检测性能的影响有限,因此多传感器检测方法需要具备灵活的校准依赖性。
- 本文系统地评估了EPNet++检测框架的敏感性,并指出轻微校准偏差即可严重影响性能。
- 针对上述问题,提出了重新校准模型,通过语义分割指导和定制损失函数设计来解决检测任务中的不对齐问题。
- 重新校准模型可以与现有检测算法结合,增强对校准偏差的鲁棒性和整体物体检测性能。
点此查看论文截图







