发布日期: 2025-11-05

更新日期: 2025-11-27

文章字数: 15.2k

阅读时长: 62 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-05 更新

Overcoming Prompts Pool Confusion via Parameterized Prompt for Incremental Object Detection

Authors:Zijia An, Boyu Diao, Ruiqi Liu, Libo Huang, Chuanguang Yang, Fei Wang, Zhulin An, Yongjun Xu

Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD’s effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.

最近的研究表明，将可训练提示融入预训练模型，可以实现有效的增量学习。然而，提示在增量目标检测（IOD）中的应用仍然被探索得不够充分。现有的基于提示池的方法假设增量任务之间的类别集是不相交的，这不适用于IOD，因为它们忽视了检测图像中固有的共现现象。在共现场景中，来自先前任务的未标记目标可能会出现在当前任务图像中，从而导致提示池中的混淆。在本文中，我们认为提示结构应在任务之间表现出自适应整合特性，并通过约束更新来防止灾难性遗忘。受此启发，我们引入了用于增量目标检测的参数化提示（P$^2$IOD）。利用神经网络的全局演化特性，P$^2$IOD利用网络作为参数化提示，以自适应地整合任务间的知识。为了约束提示结构更新，P$^2$IOD进一步采用参数化提示融合策略。在PASCAL VOC2007和MS COCO数据集上的大量实验表明，P$^2$IOD在IOD中的有效性，并在现有基准测试中实现了最先进的性能。

论文及项目相关链接

PDF

Summary

近期研究表明，将可训练提示融入预训练模型可实现有效的增量学习。然而，在增量目标检测（IOD）中提示的应用仍被忽视。当前基于提示池的方法假设跨增量任务的类别集是不相交的，这不适用于IOD，因为忽视了检测图像中固有的共现现象。共现场景中，来自先前任务的无标签对象可能会出现在当前任务图像中，导致提示池混淆。本文认为，提示结构应展现跨任务的自适应整合特性，并有限制性更新来防止灾难性遗忘。因此，我们提出用于增量目标检测的带参数提示（P²IOD）。利用神经网络全局进化特性，P²IOD利用网络作为带参数提示来跨任务自适应整合知识。为了限制提示结构更新，P²IOD进一步采用带参数提示融合策略。在PASCAL VOC2007和MS COCO数据集上的广泛实验证明，P²IOD在IOD中的有效性以及相较于现有基准测试方法的卓越性能。

Key Takeaways

近期研究利用可训练提示融入预训练模型实现增量学习，但其在增量目标检测（IOD）中的应用仍被忽视。
当前基于提示池的方法忽略了检测图像中的共现现象，导致在共现场景中出现混淆。
本文主张提示结构应展现跨任务的自适应整合特性，并有限制性更新防止灾难性遗忘。
提出用于增量目标检测的带参数提示（P²IOD）。
P²IOD利用神经网络全局进化特性来跨任务自适应整合知识。
P²IOD采用带参数提示融合策略来限制提示结构更新。

Cool Papers

点此查看论文截图

Authors:Xiaozhi Li, Huijun Di, Jian Li, Feng Liu, Wei Liang

Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.

近期4D成像雷达技术的进展使得在恶劣天气下的稳健感知成为可能，而相机传感器提供了密集语义信息。融合这些互补模态在成本效益高的3D感知方面具有巨大潜力。然而，现有的相机雷达融合方法大多仅限于单帧输入，只能捕捉场景的部分视图。不完整的场景信息加上图像退化和4D雷达稀疏性，阻碍了整体检测性能。相比之下，多帧融合提供了更丰富的时空信息，但面临两个挑战：实现跨帧和跨模态的稳健有效的目标特征融合，并减少冗余特征提取的计算成本。因此，我们提出了M^3Detection，这是一个统一的多帧3D目标检测框架，对来自相机和4D成像雷达的多模态数据进行多层次特征融合。我们的框架利用基线检测器的中间特征，并使用跟踪器生成参考轨迹，提高计算效率并为第二阶段提供更丰富的信息。在第二阶段，我们设计了一个受雷达信息引导的全局级目标间特征聚合模块，以对齐候选提案的全局特征，以及一个局部级网格间特征聚合模块，该模块沿着参考轨迹扩展局部特征，以增强精细目标表示。然后，聚集的特征通过轨迹级多帧时空推理模块进行处理，以编码跨帧交互并增强时间表示。在VoD和TJ4DRadSet数据集上的大量实验表明，M^3Detection达到了最先进的3D检测性能，验证了其在多帧检测与相机4D成像雷达融合方面的有效性。

论文及项目相关链接

PDF 16 pages, 9 figures

Summary
近期4D成像雷达技术取得进展，能在恶劣天气中实现稳健感知，而相机传感器提供丰富的语义信息。融合这两种互补技术具有实现成本效益高的三维感知的潜力。但现有的摄像头雷达融合方法仅限于单帧输入，仅捕获场景的局部视图，导致检测性能受限。相比之下，多帧融合提供更丰富的时空信息，但面临跨帧和跨模态的稳健有效特征融合以及冗余特征提取的计算成本挑战。为此，我们提出M^3Detection统一多帧三维目标检测框架，在来自相机和4D成像雷达的多模态数据上进行多层次特征融合。该框架利用基线检测器的中间特征，并通过跟踪器生成参考轨迹，提高计算效率并为第二阶段提供更丰富的信息。在第二阶段，我们设计了一个全局级别的跨目标特征聚合模块和局部级别的跨网格特征聚合模块，以沿参考轨迹扩展局部特征，增强精细目标表示。聚合特征随后通过轨迹级多帧时空推理模块进行处理，以编码跨帧交互并增强时间表示。在VoD和TJ4DRadSet数据集上的实验表明，M^3Detection实现了最先进的三维检测性能，验证了其在多帧检测与摄像头4D成像雷达融合中的有效性。

Key Takeaways

近期4D成像雷达技术能在恶劣天气中实现稳健感知，与相机传感器信息融合具有巨大潜力。
现有摄像头雷达融合方法主要局限于单帧输入，导致场景信息不完整。
多帧融合提供更丰富的时空信息，但面临特征融合和计算成本挑战。
M^3Detection框架实现多层次特征融合，结合相机和4D成像雷达数据。
该框架利用基线检测器的中间特征和跟踪器生成的参考轨迹。
M^3Detection通过设计全局和局部特征聚合模块以及轨迹级多帧推理模块增强检测性能。

Cool Papers

点此查看论文截图

Test-Time Adaptive Object Detection with Foundation Model

Authors:Yingjie Gao, Yanan Zhang, Zhi Cai, Di Huang

In recent years, test-time adaptive object detection has attracted increasing attention due to its unique advantages in online domain adaptation, which aligns more closely with real-world application scenarios. However, existing approaches heavily rely on source-derived statistical characteristics while making the strong assumption that the source and target domains share an identical category space. In this paper, we propose the first foundation model-powered test-time adaptive object detection method that eliminates the need for source data entirely and overcomes traditional closed-set limitations. Specifically, we design a Multi-modal Prompt-based Mean-Teacher framework for vision-language detector-driven test-time adaptation, which incorporates text and visual prompt tuning to adapt both language and vision representation spaces on the test data in a parameter-efficient manner. Correspondingly, we propose a Test-time Warm-start strategy tailored for the visual prompts to effectively preserve the representation capability of the vision branch. Furthermore, to guarantee high-quality pseudo-labels in every test batch, we maintain an Instance Dynamic Memory (IDM) module that stores high-quality pseudo-labels from previous test samples, and propose two novel strategies-Memory Enhancement and Memory Hallucination-to leverage IDM’s high-quality instances for enhancing original predictions and hallucinating images without available pseudo-labels, respectively. Extensive experiments on cross-corruption and cross-dataset benchmarks demonstrate that our method consistently outperforms previous state-of-the-art methods, and can adapt to arbitrary cross-domain and cross-category target data. Code is available at https://github.com/gaoyingjay/ttaod_foundation.

近年来，测试时自适应目标检测由于其在线域自适应方面的独特优势，越来越受关注，这更贴近实际应用场景。然而，现有方法严重依赖于源派生统计特征，并假设源域和目标域共享相同的类别空间。在本文中，我们提出了首个由基础模型驱动的检测时自适应目标检测方法，该方法完全消除了对源数据的依赖，并突破了传统的封闭集限制。具体来说，我们设计了一个多模态提示基元均值教师框架，用于视觉语言检测器驱动的测试时自适应，它通过文本和视觉提示调整来适应测试数据的语言和视觉表示空间，并且参数效率高。相应地，我们针对视觉提示提出了一种测试时预热策略，有效地保留了视觉分支的表示能力。此外，为了保证每个测试批次的高质量伪标签，我们维护了一个实例动态内存模块（IDM），该模块存储来自先前测试样本的高质量伪标签，并提出了两种新策略——内存增强和内存幻觉——以利用IDM的高质量实例来增强原始预测和在没有可用伪标签的情况下模拟图像。跨腐败和跨数据集基准的大量实验表明，我们的方法始终优于以前的最先进方法，并且可以适应任意跨域和跨类别的目标数据。代码可通过 https://github.com/gaoyingjay/ttaod_foundation 获取。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary：

近年来，测试时自适应目标检测因其在在线域自适应中的独特优势而备受关注。然而，现有方法高度依赖于源派生统计特征，并假设源域和目标域具有相同的类别空间，这限制了其实际应用。本文提出了一种基于基础模型的测试时自适应目标检测方法，无需源数据，突破了传统封闭集的局限。该方法设计了一个多模态提示基础的Mean-Teacher框架，结合文本和视觉提示调整，以参数有效的方式在测试数据上适应语言和视觉表示空间。同时，提出了测试时预热策略，针对视觉提示进行有效保留视觉分支的表示能力。为保证每批测试样本的高质量伪标签，维护了一个实例动态内存模块，并提出两种新策略来利用高质量实例增强原始预测和虚构无伪标签的图像。实验证明，该方法在跨腐败和跨数据集基准测试上始终优于先前的方法，并可适应任意跨域和跨类别的目标数据。

Key Takeaways：

测试时自适应目标检测在现实应用中的价值及其近年来受到的关注。
现有方法依赖于源数据并存在封闭集限制。
提出了一种基于基础模型的测试时自适应目标检测方法，无需源数据。
设计了多模态提示基础的Mean-Teacher框架，结合文本和视觉提示调整。
提出了测试时预热策略，有效保留视觉分支的表示能力。
通过实例动态内存模块保证高质量伪标签。
提出两种新策略来利用高质量实例增强原始预测和虚构无伪标签的图像。
在多个基准测试中表现优异，具有广泛的适用性。

Cool Papers

点此查看论文截图

Classifier Enhancement Using Extended Context and Domain Experts for Semantic Segmentation

Authors:Huadong Tang, Youpeng Zhao, Min Xu, Jun Wang, Qiang Wu

Prevalent semantic segmentation methods generally adopt a vanilla classifier to categorize each pixel into specific classes. Although such a classifier learns global information from the training data, this information is represented by a set of fixed parameters (weights and biases). However, each image has a different class distribution, which prevents the classifier from addressing the unique characteristics of individual images. At the dataset level, class imbalance leads to segmentation results being biased towards majority classes, limiting the model’s effectiveness in identifying and segmenting minority class regions. In this paper, we propose an Extended Context-Aware Classifier (ECAC) that dynamically adjusts the classifier using global (dataset-level) and local (image-level) contextual information. Specifically, we leverage a memory bank to learn dataset-level contextual information of each class, incorporating the class-specific contextual information from the current image to improve the classifier for precise pixel labeling. Additionally, a teacher-student network paradigm is adopted, where the domain expert (teacher network) dynamically adjusts contextual information with ground truth and transfers knowledge to the student network. Comprehensive experiments illustrate that the proposed ECAC can achieve state-of-the-art performance across several datasets, including ADE20K, COCO-Stuff10K, and Pascal-Context.

当前流行的语义分割方法通常采用普通分类器，将每个像素分类为特定的类别。虽然这样的分类器从训练数据中学习全局信息，但这些信息由一组固定的参数（权重和偏差）表示。然而，每张图像的类别分布都是不同的，这导致分类器无法应对单个图像的独特特征。在数据集层面，类别不平衡导致分割结果偏向于多数类别，限制了模型在识别和分割少数类别区域方面的有效性。在本文中，我们提出了一种扩展的上下文感知分类器（ECAC），该分类器能够利用全局（数据集级别）和局部（图像级别）的上下文信息来动态调整分类器。具体来说，我们利用存储库来学习每个类别的数据集级别的上下文信息，并结合当前图像中的类别特定上下文信息来提高分类器的精确像素标签能力。此外，我们采用了师徒网络范式，领域专家（教师网络）能够动态调整上下文信息并与真实情况进行匹配，将知识传授给学生网络。综合实验表明，所提出的ECAC在多个数据集上均能达到最先进的性能，包括ADE20K、COCO-Stuff10K和Pascal-Context。

论文及项目相关链接

PDF Accepted at IEEE TRANSACTIONS ON MULTIMEDIA (TMM)

Summary
在本文中，为了解决现有语义分割方法的局限性问题，提出了一个名为扩展上下文感知分类器（ECAC）的新方法。传统方法通常使用固定参数的分类器对像素进行分类，忽视了每幅图像的类分布差异以及数据集级别的类不平衡问题。而ECAC利用全局和局部上下文信息动态调整分类器，借助内存库学习每个类的数据集级别上下文信息，并结合当前图像的类特定上下文信息来提高分类器的精度。此外，还采用了教师-学生网络范式，动态调整真实上下文信息并向学生网络传递知识。实验证明，ECAC在多个数据集上取得了最先进的性能。

Key Takeaways

传统语义分割方法采用固定参数分类器，难以适应每幅图像的不同类分布。
数据集级别的类不平衡导致模型对多数类的分割结果偏向，影响对少数类的识别和分割效果。
ECAC通过结合全局和局部上下文信息动态调整分类器，以提高像素标记的精度。
ECAC利用内存库学习每个类的数据集级别上下文信息。
教师-学生网络范式用于动态调整真实上下文信息并传递知识给学生网络。
ECAC在多个数据集上实现了最先进的性能。

Cool Papers

点此查看论文截图

DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

Authors:Malaisree P, Youwai S, Kitkobsin T, Janrungautai S, Amorndechaphon D, Rojanavasu P

Object detection in civil engineering applications is constrained by limited annotated data in specialized domains. We introduce DINO-YOLO, a hybrid architecture combining YOLOv12 with DINOv3 self-supervised vision transformers for data-efficient detection. DINOv3 features are strategically integrated at two locations: input preprocessing (P0) and mid-backbone enhancement (P3). Experimental validation demonstrates substantial improvements: Tunnel Segment Crack detection (648 images) achieves 12.4% improvement, Construction PPE (1K images) gains 13.7%, and KITTI (7K images) shows 88.6% improvement, while maintaining real-time inference (30-47 FPS). Systematic ablation across five YOLO scales and nine DINOv3 variants reveals that Medium-scale architectures achieve optimal performance with DualP0P3 integration (55.77% mAP@0.5), while Small-scale requires Triple Integration (53.63%). The 2-4x inference overhead (21-33ms versus 8-16ms baseline) remains acceptable for field deployment on NVIDIA RTX 5090. DINO-YOLO establishes state-of-the-art performance for civil engineering datasets (<10K images) while preserving computational efficiency, providing practical solutions for construction safety monitoring and infrastructure inspection in data-constrained environments.

在土木工程应用中，目标检测的局限性在于特定领域标注数据的有限性。我们引入了DINO-YOLO，这是一种结合了YOLOv12和DINOv3自监督视觉变换器的混合架构，以实现高效的数据检测。DINOv3的特性被战略性地整合在两个位置：输入预处理（P0）和中间主干增强（P3）。实验验证显示出了显著的提升：隧道段裂缝检测（648张图像）提升了12.4%，施工个人防护装备（1000张图像）提升了13.7%，KITTI数据集（7000张图像）则显示了88.6%的提升，同时保持了实时推理（每秒处理帧数达到30-47帧）。通过对五个YOLO尺度和九个DINOv3变种的系统消融研究，发现中等规模的架构在采用DualP0P3集成后达到最佳性能（mAP@0.5为55.77%），而小规模则需要三重集成（mAP为53.63%）。虽然推理开销为基准时间的2-4倍（相对于基线延迟增加约每小时新增的运行时间消耗在最长运行时时间为），但可以在NVIDIA RTX 5090上部署使用场景仍被认为是可接受的。DINO-YOLO为数据约束环境下的土木工程数据集（小于一万张图像）提供了最先进的技术性能计算效率提升，为建筑施工安全监测和基础设施检测提供了实用解决方案。

论文及项目相关链接

PDF

Summary
针对土木工程应用中目标检测受限于专业领域内标注数据不足的问题，提出了DINO-YOLO混合架构。该架构结合了YOLOv12和DINOv3自监督视觉转换器，以实现数据高效检测。在隧道段落裂缝检测、施工个人防护设备检测和KITTI数据集上的实验验证显示，DINO-YOLO有显著改善，同时保持实时推理速度。研究还通过系统性消融实验揭示了最佳性能的组合策略。DINO-YOLO在数据受限环境中为土木工程施工安全监控和设施检测提供了实用解决方案。

Key Takeaways

引入DINO-YOLO混合架构，结合了YOLOv12和DINOv3技术，适用于土木工程中的目标检测。
DINOv3的特性被整合在输入预处理（P0）和中期主干增强（P3）两个位置。
在隧道段落裂缝检测、施工个人防护设备检测和KITTI数据集上，DINO-YOLO有显著提升性能表现。
系统性消融实验揭示了Medium-scale架构与DualP0P3整合策略达到最优性能。
Small-scale架构需要Triple Integration策略。
DINO-YOLO推理时间略有增加，但仍在可接受的范围内，适合现场部署。

Cool Papers

点此查看论文截图

MIC-BEV: Multi-Infrastructure Camera Bird’s-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection

Authors:Yun Zhang, Zhaoliang Zheng, Johnson Liu, Zhiyu Huang, Zewei Zhou, Zonglin Meng, Tianhui Cai, Jiaqi Ma

Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird’s-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.

基于基础设施的感知在智能交通系统中起着至关重要的作用，它提供全局态势感知并促进协同自主。然而，由于多视角基础设施设置、不同摄像头配置、视觉输入退化以及各种道路布局等挑战，现有的基于摄像头的检测模型在这种情况下通常表现不佳。我们引入了MIC-BEV，这是一个基于Transformer的鸟瞰图（BEV）感知框架，用于基于基础设施的多摄像头3D对象检测。MIC-BEV灵活支持具有不同固有和外在参数的多个摄像头，并在传感器退化的情况下表现出强大的稳健性。MIC-BEV中提出的图增强融合模块通过利用摄像头和BEV单元之间的几何关系以及潜在视觉线索，将多视角图像特征集成到BEV空间中。为了支持和评估，我们引入了M2I，这是一个用于基础设施对象检测的合成数据集，具有各种摄像头配置、道路布局和环境条件。在M2I和真实世界数据集RoScenes上的大量实验表明，MIC-BEV在3D对象检测方面达到了最新技术水平。它在极端天气和传感器退化等挑战条件下仍然保持稳健。这些结果突出了MIC-BEV在现实世界部署的潜力。数据集和源代码可在：https://github.com/HandsomeYun/MIC-BEV 获得。

论文及项目相关链接

PDF

Summary

基于基础设施的感知在智能交通系统中扮演重要角色，提供全局态势感知并实现协同自主。然而，现有基于摄像头的检测模型在这种场景下表现不佳，面临多视角基础设施设置、不同摄像头配置、视觉输入退化以及道路布局多样等挑战。为此，我们提出了基于Transformer的俯视图感知框架MIC-BEV，支持多种基础设施摄像头的三维物体检测。MIC-BEV灵活支持不同数量的摄像头和多样化的内外参数，在传感器退化情况下表现出强大的稳健性。其创新的图增强融合模块利用摄像头与俯视图单元之间的几何关系以及潜在视觉线索，将多视角图像特征融合到俯视图空间中。为了支持和评估模型，我们创建了M2I这一合成数据集，涵盖不同的摄像头配置、道路布局和环境条件。在M2I和真实世界数据集RoScenes上的大量实验表明，MIC-BEV在三维物体检测方面达到最新性能水平，并在极端天气和传感器退化等挑战条件下保持稳健。

Key Takeaways

基于基础设施的感知在智能交通系统中重要，提供全局态势感知。
现有摄像头检测模型在多视角、多样化道路布局等场景下表现不佳。
MIC-BEV是一个基于Transformer的俯视图感知框架，用于支持多摄像头三维物体检测。
MIC-BEV表现出强大的稳健性，可灵活支持不同摄像头和内外参数。
图增强融合模块利用几何关系和潜在视觉线索进行多视角图像特征融合。
M2I数据集用于支持训练和评估，模拟多样化场景和环境条件。

Cool Papers

点此查看论文截图

Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy

Authors:Qing Zhao, Weijian Deng, Pengxu Wei, ZiYi Dong, Hannan Lu, Xiangyang Ji, Liang Lin

To improve detection robustness in adverse conditions (e.g., haze and low light), image restoration is commonly applied as a pre-processing step to enhance image quality for the detector. However, the functional mismatch between restoration and detection networks can introduce instability and hinder effective integration – an issue that remains underexplored. We revisit this limitation through the lens of Lipschitz continuity, analyzing the functional differences between restoration and detection networks in both the input space and the parameter space. Our analysis shows that restoration networks perform smooth, continuous transformations, while object detectors operate with discontinuous decision boundaries, making them highly sensitive to minor perturbations. This mismatch introduces instability in traditional cascade frameworks, where even imperceptible noise from restoration is amplified during detection, disrupting gradient flow and hindering optimization. To address this, we propose Lipschitz-regularized object detection (LROD), a simple yet effective framework that integrates image restoration directly into the detector’s feature learning, harmonizing the Lipschitz continuity of both tasks during training. We implement this framework as Lipschitz-regularized YOLO (LR-YOLO), extending seamlessly to existing YOLO detectors. Extensive experiments on haze and low-light benchmarks demonstrate that LR-YOLO consistently improves detection stability, optimization smoothness, and overall accuracy.

在恶劣条件（例如雾霾和低光）下，为了提高检测稳健性，通常将图像恢复作为预处理步骤应用于检测器，以提高图像质量。然而，恢复网络和检测网络之间的功能不匹配可能会引入不稳定因素并阻碍有效的集成，这是一个尚未得到充分研究的问题。我们通过利普希茨连续性来分析恢复网络和检测网络在输入空间和参数空间上的功能差异。我们的分析表明，恢复网络执行平滑、连续的转换，而对象检测器则具有不连续的决策边界，使其对微小扰动极为敏感。这种不匹配会给传统的级联框架带来不稳定因素，其中恢复过程中的几乎不可察觉的噪声在检测过程中会被放大，破坏梯度流并阻碍优化。为了解决这一问题，我们提出了Lipschitz正则化对象检测（LROD），这是一个简单有效的框架，直接将图像恢复集成到检测器的特征学习中，在训练过程中协调两个任务之间的Lipschitz连续性。我们将此框架实现为Lipschitz正则化YOLO（LR-YOLO），无缝扩展到现有的YOLO检测器。在雾霾和低光基准测试上的大量实验表明，LR-YOLO在检测稳定性、优化平滑度和总体准确性方面均有所改进。

论文及项目相关链接

PDF NeurIPS 2025

Summary

在恶劣环境下（如雾霾和低光照）提高检测稳健性的研究中，图像修复作为预处理步骤被广泛应用于提高图像质量以供检测器使用。然而，修复和检测网络之间的功能不匹配会引发不稳定并阻碍有效集成，这一问题尚未得到充分探索。本文通过Lipschitz连续性来分析修复和检测网络在输入空间和参数空间中的功能差异。分析表明，修复网络执行平滑、连续的转换，而对象检测器则具有不连续的决策边界，使其对微小扰动高度敏感。这种不匹配在传统级联框架中引入了不稳定性，其中来自修复的几乎无法察觉的噪声在检测过程中会被放大，破坏梯度流并阻碍优化。针对这一问题，我们提出了Lipschitz正则化目标检测（LROD），这是一种简单有效的框架，直接将图像修复集成到检测器的特征学习中，在训练过程中协调两个任务的Lipschitz连续性。我们在YOLO检测器上实现了此框架的Lipschitz正则化版本（LR-YOLO），可无缝集成到现有YOLO检测器中。在雾霾和低光基准测试上的大量实验表明，LR-YOLO在检测稳定性、优化平滑度和总体准确性方面均有所提高。

Key Takeaways

图像修复作为预处理步骤在提高检测稳健性方面具有重要意义，特别是在恶劣环境下如雾霾和低光照条件下。
修复和检测网络之间的功能不匹配是制约有效集成的主要障碍之一，这一问题尚未得到充分探索。
通过Lipschitz连续性分析了修复和检测网络的功能差异，揭示了它们在处理微小扰动时的不同表现。
这种功能不匹配可能导致传统级联框架中的不稳定，其中噪声在检测过程中被放大并影响梯度流和优化过程。
提出了Lipschitz正则化目标检测（LROD）框架来直接集成图像修复到检测器的特征学习中，通过训练过程协调两个任务的Lipschitz连续性。
LR-YOLO是LROD框架在YOLO检测器上的实现，可无缝集成到现有YOLO检测器中。

Cool Papers

点此查看论文截图

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

Authors:Jinxin Zhou, Jiachen Jiang, Zhihui Zhu

Extending CLIP models to semantic segmentation remains challenging due to the misalignment between their image-level pre-training objectives and the pixel-level visual understanding required for dense prediction. While prior efforts have achieved encouraging results by reorganizing the final layer and features, they often inherit the global alignment bias of preceding layers, leading to suboptimal segmentation performance. In this work, we propose LHT-CLIP, a novel training-free framework that systematically exploits the visual discriminability of CLIP across layer, head, and token levels. Through comprehensive analysis, we reveal three key insights: (i) the final layers primarily strengthen image-text alignment with sacrifice of visual discriminability (e.g., last 3 layers in ViT-B/16 and 8 layers in ViT-L/14), partly due to the emergence of anomalous tokens; (ii) a subset of attention heads (e.g., 10 out of 144 in ViT-B/16) display consistently strong visual discriminability across datasets; (iii) abnormal tokens display sparse and consistent activation pattern compared to normal tokens. Based on these findings, we propose three complementary techniques: semantic-spatial reweighting, selective head enhancement, and abnormal token replacement to effectively restore visual discriminability and improve segmentation performance without any additional training, auxiliary pre-trained networks, or extensive hyperparameter tuning. Extensive experiments on 8 common semantic segmentation benchmarks demonstrate that LHT-CLIP achieves state-of-the-art performance across diverse scenarios, highlighting its effectiveness and practicality for real-world deployment.

将CLIP模型扩展到语义分割仍然是一个挑战，因为图像级的预训练目标与密集预测所需的像素级视觉理解之间存在不匹配。尽管先前的努力通过重新组织最后一层和特征取得了令人鼓舞的结果，但它们往往继承了前层的全局对齐偏见，导致分割性能不佳。在这项工作中，我们提出了LHT-CLIP，这是一种新的无训练框架，系统地利用CLIP在不同层次（层、头和令牌）的视觉辨别力。通过综合分析，我们揭示了三个关键见解：（i）最后几层主要通过加强图像文本对齐来牺牲视觉辨别力（例如ViT-B/16的最后3层和ViT-L/14的8层），部分原因是由于异常令牌的出现；（ii）一小部分注意力头（例如ViT-B/16中的10个注意力头）在数据集上始终表现出强大的视觉辨别力；（iii）异常令牌与正常令牌相比，显示出稀疏且一致的激活模式。基于这些发现，我们提出了三种互补技术：语义空间重新加权、选择性头部增强和异常令牌替换，以有效地恢复视觉辨别力，提高分割性能，而无需任何额外的训练、辅助预训练网络或广泛的超参数调整。在8个常见的语义分割基准测试上的实验表明，LHT-CLIP在多种场景中都达到了最先进的性能，凸显了其在现实世界部署中的有效性和实用性。

论文及项目相关链接

PDF 23 pages, 10 figures, 14 tables

Summary

本文提出一种无需训练的LHT-CLIP框架，它通过系统地利用CLIP在不同层级（层、头和令牌）的视觉辨别能力，来解决CLIP模型在语义分割方面面临的挑战。通过对最终层及特征的重组，以及全面分析，本文揭示了三个关键见解，并在此基础上提出了三种技术来恢复视觉辨别能力并提升分割性能。

Key Takeaways

CLIP模型在语义分割上仍面临挑战，因为图像级别的预训练目标与像素级别的视觉理解之间存在不匹配。
通过对最终层及特征的重组，之前的努力已经取得了一些成果，但仍存在全局对齐偏差的问题，导致分割性能不佳。
本文提出的LHT-CLIP框架系统地利用CLIP在不同层级（层、头和令牌）的视觉辨别能力。
研究发现：最终层主要强化图像文本对齐，但牺牲了视觉辨别能力；部分注意力头在所有数据集中始终表现出强大的视觉辨别能力；异常令牌具有稀疏且一致的激活模式。
基于上述发现，本文提出三种互补技术：语义空间重新加权、选择性头部增强和异常令牌替换，以恢复视觉辨别能力并提高分割性能。
在8个常见的语义分割基准测试上，LHT-CLIP实现了最先进的性能，展示了其在现实世界部署中的有效性和实用性。

Cool Papers

点此查看论文截图

DQ3D: Depth-guided Query for Transformer-Based 3D Object Detection in Traffic Scenarios

Authors:Ziyu Wang, Wenhao Li, Ji Wu

3D object detection from multi-view images in traffic scenarios has garnered significant attention in recent years. Many existing approaches rely on object queries that are generated from 3D reference points to localize objects. However, a limitation of these methods is that some reference points are often far from the target object, which can lead to false positive detections. In this paper, we propose a depth-guided query generator for 3D object detection (DQ3D) that leverages depth information and 2D detections to ensure that reference points are sampled from the surface or interior of the object. Furthermore, to address partially occluded objects in current frame, we introduce a hybrid attention mechanism that fuses historical detection results with depth-guided queries, thereby forming hybrid queries. Evaluation on the nuScenes dataset demonstrates that our method outperforms the baseline by 6.3% in terms of mean Average Precision (mAP) and 4.3% in the NuScenes Detection Score (NDS).

近年来，交通场景中从多视角图像进行3D目标检测已经引起了广泛关注。许多现有方法依赖于从3D参考点生成的目标查询来定位目标。然而，这些方法的局限性在于，某些参考点往往远离目标物体，从而导致误报。在本文中，我们提出了一种用于3D目标检测的深度引导查询生成器（DQ3D），它利用深度信息和2D检测结果，确保参考点从物体表面或内部采样。此外，为了解决当前帧中部分遮挡物体的问题，我们引入了一种混合注意机制，该机制将历史检测结果与深度引导查询相融合，形成混合查询。在nuScenes数据集上的评估表明，我们的方法在平均精度（mAP）上比基线高出6.3%，在NuScenes检测分数（NDS）上高出4.3%。

论文及项目相关链接

PDF

Summary

本文提出了一种深度引导查询生成器（DQ3D）用于交通场景中的三维物体检测。该方法利用深度信息和二维检测结果确保参考点从物体表面或内部采样，并引入混合注意力机制融合历史检测结果与深度引导查询，以处理当前帧中的部分遮挡物体。在nuScenes数据集上的评估表明，该方法在平均精度（mAP）和NuScenes检测得分（NDS）上分别提高了6.3%和4.3%。

Key Takeaways

该论文提出了一种新的深度引导查询生成器（DQ3D）用于三维物体检测。
DQ3D利用深度信息和二维检测结果确保参考点更准确，减少误检。
针对部分遮挡物体，引入混合注意力机制融合历史检测结果与深度引导查询，形成混合查询。
方法在nuScenes数据集上进行评估。
与基线方法相比，该方法在平均精度（mAP）上提高了6.3%。
在NuScenes检测得分（NDS）上提高了4.3%。

Cool Papers

点此查看论文截图

Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation

Authors:Jeongin Kim, Wonho Bae, YouLee Han, Giyeong Oh, Youngjae Yu, Danica J. Sutherland, Junhyug Noh

Semantic segmentation demands dense pixel-level annotations, which can be prohibitively expensive - especially under extremely constrained labeling budgets. In this paper, we address the problem of low-budget active learning for semantic segmentation by proposing a novel two-stage selection pipeline. Our approach leverages a pre-trained diffusion model to extract rich multi-scale features that capture both global structure and fine details. In the first stage, we perform a hierarchical, representation-based candidate selection by first choosing a small subset of representative pixels per image using MaxHerding, and then refining these into a diverse global pool. In the second stage, we compute an entropy-augmented disagreement score (eDALD) over noisy multi-scale diffusion features to capture both epistemic uncertainty and prediction confidence, selecting the most informative pixels for annotation. This decoupling of diversity and uncertainty lets us achieve high segmentation accuracy with only a tiny fraction of labeled pixels. Extensive experiments on four benchmarks (CamVid, ADE-Bed, Cityscapes, and Pascal-Context) demonstrate that our method significantly outperforms existing baselines under extreme pixel-budget regimes. Our code is available at https://github.com/jn-kim/two-stage-edald.

语义分割需要密集的像素级标注，这可能会非常昂贵——特别是在标注预算极其有限的情况下。本文解决了低预算主动学习中语义分割的问题，通过提出一种新型的两阶段选择流程。我们的方法利用预训练的扩散模型来提取丰富的多尺度特征，这些特征既能捕捉全局结构，又能捕捉细节。在第一阶段，我们基于表示进行分层候选选择，首先使用MaxHerding选择每幅图像中具有代表性的小部分像素，然后将其细化为多样化的全局池。在第二阶段，我们计算带有熵增强的分歧分数（eDALD），通过对噪声多尺度扩散特征进行建模，以捕捉认知不确定性和预测置信度，选择信息量最大的像素进行标注。多样性和不确定性的这种解耦让我们仅使用一小部分标记像素就能实现高分割精度。在CamVid、ADE-Bed、Cityscapes和Pascal-Context四个基准测试上的广泛实验表明，在极端像素预算条件下，我们的方法显著优于现有基线。我们的代码位于https://github.com/jn-kim/two-stage-edald。

论文及项目相关链接

PDF Accepted to NeurIPS 2025

Summary

本文解决了低预算下语义分割的主动学习问题，提出了一种新颖的两阶段选择管道。该方法利用预训练的扩散模型提取丰富的多尺度特征，第一阶段进行基于表示的分层候选选择，第二阶段计算包含认知不确定性和预测置信度的增强分歧分数（eDALD），选择最具信息量的像素进行标注。这种方法实现了高分割精度，且仅使用极少量的标注像素。

Key Takeaways

本文解决了低预算下语义分割的主动学习问题。
提出了一种新颖的两阶段选择管道，利用预训练的扩散模型提取多尺度特征。
第一阶段进行基于表示的分层候选选择。
第二阶段计算包含认知不确定性和预测置信度的增强分歧分数（eDALD）。
通过这种方式，实现了高分割精度，并且仅使用极少量的标注像素。
在四个基准数据集上的实验证明，该方法在极端像素预算下显著优于现有基线。

Cool Papers

点此查看论文截图

Comparative Analysis of Object Detection Algorithms for Surface Defect Detection

Authors:Arpan Maity, Tamal Ghosh

This article compares the performance of six prominent object detection algorithms, YOLOv11, RetinaNet, Fast R-CNN, YOLOv8, RT-DETR, and DETR, on the NEU-DET surface defect detection dataset, comprising images representing various metal surface defects, a crucial application in industrial quality control. Each model’s performance was assessed regarding detection accuracy, speed, and robustness across different defect types such as scratches, inclusions, and rolled-in scales. YOLOv11, a state-of-the-art real-time object detection algorithm, demonstrated superior performance compared to the other methods, achieving a remarkable 70% higher accuracy on average. This improvement can be attributed to YOLOv11s enhanced feature extraction capabilities and ability to process the entire image in a single forward pass, making it faster and more efficient in detecting minor surface defects. Additionally, YOLOv11’s architecture optimizations, such as improved anchor box generation and deeper convolutional layers, contributed to more precise localization of defects. In conclusion, YOLOv11’s outstanding performance in accuracy and speed solidifies its position as the most effective model for surface defect detection on the NEU dataset, surpassing competing algorithms by a substantial margin.

本文在NEU-DET表面缺陷检测数据集上比较了六种主流目标检测算法（YOLOv11、RetinaNet、Fast R-CNN、YOLOv8、RT-DETR和DETR）的性能。该数据集包含代表各种金属表面缺陷的图像，是工业质量控制中的一项重要应用。评估了每个模型在检测精度、速度和不同缺陷类型（如划痕、夹杂物和轧入鳞片等）的稳健性方面的性能。YOLOv11是最先进的实时目标检测算法，与其他方法相比，其性能更为优越，平均准确度高出70%。这一改进可归因于YOLOv11增强的特征提取能力，以及能够在单次前向传递中处理整个图像的能力，使其在检测微小表面缺陷时更快、更高效。此外，YOLOv11的架构优化，如改进的锚框生成和更深的卷积层，有助于更精确地定位缺陷。总之，YOLOv11在精度和速度方面的出色表现，使其在NEU数据集上成为表面缺陷检测的最有效模型，大幅超越了其他算法。

论文及项目相关链接

PDF 14 pages, 8 figures

Summary

本文主要比较了六种主流目标检测算法在NEU-DET表面缺陷检测数据集上的表现，包括YOLOv11、RetinaNet、Fast R-CNN、YOLOv8、RT-DETR和DETR。研究结果显示，YOLOv11在检测精度和速度方面表现出卓越性能，平均准确率比其它方法高出70%。这得益于YOLOv11强大的特征提取能力、快速高效的图像处理能力以及优化的架构，如改进锚框生成和更深的卷积层。因此，YOLOv11被确认为NEU数据集上表面缺陷检测的最有效模型。

Key Takeaways

文章对比了六种主流目标检测算法在表面缺陷检测数据集上的表现。
YOLOv11在检测精度和速度方面表现出卓越性能。
YOLOv11的平均准确率比其它方法高出70%。
YOLOv11强大的特征提取能力和快速高效的图像处理能力是提高性能的关键。
YOLOv11的架构优化，如改进锚框生成和更深的卷积层，有助于更精确地定位缺陷。
YOLOv11被确认为NEU数据集上表面缺陷检测的最有效模型。

Cool Papers

点此查看论文截图

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

Authors:Orest Kupyn, Hirokatsu Kataoka, Christian Rupprecht

Salient object detection exemplifies data-bounded tasks where expensive pixel-precise annotations force separate model training for related subtasks like DIS and HR-SOD. We present a method that dramatically improves generalization through large-scale synthetic data generation and ambiguity-aware architecture. We introduce S3OD, a dataset of over 139,000 high-resolution images created through our multi-modal diffusion pipeline that extracts labels from diffusion and DINO-v3 features. The iterative generation framework prioritizes challenging categories based on model performance. We propose a streamlined multi-mask decoder that naturally handles the inherent ambiguity in salient object detection by predicting multiple valid interpretations. Models trained solely on synthetic data achieve 20-50% error reduction in cross-dataset generalization, while fine-tuned versions reach state-of-the-art performance across DIS and HR-SOD benchmarks.

显著性目标检测是数据边界任务的典型示例，昂贵的像素精确标注迫使对如DIS和HR-SOD等相关子任务进行单独模型训练。我们提出了一种通过大规模合成数据生成和模糊感知架构来显著提高泛化能力的方法。我们引入了S3OD，这是一个由我们的多模态扩散管道创建的高分辨率图像数据集，包含超过139,000张图像，从扩散和DINO-v3特征中提取标签。迭代生成框架根据模型性能优先处理具有挑战性的类别。我们提出了一种简化的多掩膜解码器，它通过预测多个有效解释来自然处理显著性目标检测中的固有模糊性。仅通过合成数据进行训练的模型在跨数据集泛化方面实现了20-50%的误差降低，而微调版本在DIS和HR-SOD基准测试中达到了最先进的性能。

论文及项目相关链接

PDF

Summary

该文本介绍了显著目标检测中的数据有界任务，通过大规模合成数据生成和模糊感知架构提高了模型的泛化能力。提出了S3OD数据集，包含超过13万张高分辨率图像，通过多模态扩散管道提取标签。模型在跨数据集泛化方面实现了20-50%的错误率降低，微调后的版本在DIS和HR-SOD基准测试中达到了最先进的性能。

Key Takeaways

显著目标检测是数据有界任务的典型示例，需要昂贵的像素精确注释来训练模型。
提出了一种通过大规模合成数据生成提高模型泛化能力的方法。
介绍了S3OD数据集，包含超过13万张高分辨率图像，通过多模态扩散管道和DINO-v3特性提取标签。
迭代生成框架优先处理基于模型性能的具有挑战性的类别。
提出了一种简化的多掩膜解码器，能够自然处理显著目标检测中的固有模糊性，预测多个有效解释。
仅在合成数据上训练的模型在跨数据集泛化方面实现了20-50%的错误率降低。

Cool Papers

点此查看论文截图

BioDet: Boosting Industrial Object Detection with Image Preprocessing Strategies

Authors:Jiaqi Hu, Hongli Xu, Junwen Huang, Peter KT Yu, Slobodan Ilic, Benjamin Busam

Accurate 6D pose estimation is essential for robotic manipulation in industrial environments. Existing pipelines typically rely on off-the-shelf object detectors followed by cropping and pose refinement, but their performance degrades under challenging conditions such as clutter, poor lighting, and complex backgrounds, making detection the critical bottleneck. In this work, we introduce a standardized and plug-in pipeline for 2D detection of unseen objects in industrial settings. Based on current SOTA baselines, our approach reduces domain shift and background artifacts through low-light image enhancement and background removal guided by open-vocabulary detection with foundation models. This design suppresses the false positives prevalent in raw SAM outputs, yielding more reliable detections for downstream pose estimation. Extensive experiments on real-world industrial bin-picking benchmarks from BOP demonstrate that our method significantly boosts detection accuracy while incurring negligible inference overhead, showing the effectiveness and practicality of the proposed method.

精确的6D姿态估计对于工业环境中的机器人操作至关重要。现有的流程通常依赖于现成的目标检测器，然后是裁剪和姿态优化，但在杂乱、光线不足和复杂背景等挑战条件下，它们的性能会下降，使得检测成为关键瓶颈。在这项工作中，我们为工业环境中未知对象的二维检测引入了一种标准化和可插入的流程。基于当前的最佳基线技术，我们的方法通过低光图像增强和背景去除技术减少了领域偏移和背景伪影，这由开放词汇检测引导并使用基础模型实现。这种设计抑制了原始SAM输出中普遍存在的误报，为下游姿态估计提供了更可靠的检测结果。在来自BOP的真实世界工业装箱挑选基准测试的大量实验表明，我们的方法在几乎不增加推理开销的情况下显著提高了检测精度，证明了所提出方法的有效性和实用性。

论文及项目相关链接

PDF 8 pages, accepted by ICCV 2025 R6D

Summary

本论文提出了一种针对工业环境中未见物体标准化与插件式的二维检测方案。该方案解决了现有流程在复杂环境下性能下降的问题，如混乱、低光照和复杂背景等。通过低光照图像增强和背景去除技术，结合开放词汇检测与基础模型引导，减少了领域偏移和背景伪影。此设计抑制了原始SAM输出中的误报，为下游姿态估计提供了更可靠的检测结果。在真实工业环境中的拣选基准测试显示，该方法显著提高了检测精度，同时推理开销几乎可以忽略不计。

Key Takeaways

提出了一种针对工业环境中未见物体的标准化和插件式二维检测方案。
该方案解决了现有流程在复杂环境下的性能下降问题。
通过低光照图像增强和背景去除技术减少领域偏移和背景伪影。
利用开放词汇检测与基础模型引导技术提高了检测准确性。
减少了原始SAM输出的误报，为下游姿态估计提供了更可靠的检测结果。
在真实工业环境中的实验显示，该方法显著提高了检测精度。

Cool Papers

点此查看论文截图

LEGNet: A Lightweight Edge-Gaussian Network for Low-Quality Remote Sensing Image Object Detection

Authors:Wei Lu, Si-Bao Chen, Hui-Dong Li, Qing-Ling Shu, Chris H. Q. Ding, Jin Tang, Bin Luo

Remote sensing object detection (RSOD) often suffers from degradations such as low spatial resolution, sensor noise, motion blur, and adverse illumination. These factors diminish feature distinctiveness, leading to ambiguous object representations and inadequate foreground-background separation. Existing RSOD methods exhibit limitations in robust detection of low-quality objects. To address these pressing challenges, we introduce LEGNet, a lightweight backbone network featuring a novel Edge-Gaussian Aggregation (EGA) module specifically engineered to enhance feature representation derived from low-quality remote sensing images. EGA module integrates: (a) orientation-aware Scharr filters to sharpen crucial edge details often lost in low-contrast or blurred objects, and (b) Gaussian-prior-based feature refinement to suppress noise and regularize ambiguous feature responses, enhancing foreground saliency under challenging conditions. EGA module alleviates prevalent problems in reduced contrast, structural discontinuities, and ambiguous feature responses prevalent in degraded images, effectively improving model robustness while maintaining computational efficiency. Comprehensive evaluations across five benchmarks (DOTA-v1.0, v1.5, DIOR-R, FAIR1M-v1.0, and VisDrone2019) demonstrate that LEGNet achieves state-of-the-art performance, particularly in detecting low-quality objects.The code is available at https://github.com/AeroVILab-AHU/LEGNet.

遥感目标检测（RSOD）经常受到空间分辨率低、传感器噪声、运动模糊和恶劣照明等问题的困扰。这些因素降低了特征的区分度，导致目标表示不明确，前景与背景分离不足。现有的RSOD方法在检测低质量目标时表现出局限性。为了应对这些紧迫挑战，我们引入了LEGNet，这是一个具有新型边缘高斯聚合（EGA）模块的轻量级主干网络，专门设计用于增强从低质量遥感图像中派生的特征表示。EGA模块集成了：（a）定向感知的Scharr滤波器，用于锐化低对比度或模糊目标中经常丢失的关键边缘细节；（b）基于高斯先验的特征细化，用于抑制噪声并规范模糊的特征响应，从而在具有挑战性的条件下增强前景显著性。EGA模块减轻了退化图像中常见的对比度降低、结构不连续和特征响应模糊的问题，在保持计算效率的同时有效提高了模型的鲁棒性。在五个基准测试（DOTA-v1.0、v1.5、DIOR-R、FAIR1M-v1.0和VisDrone2019）上的综合评估表明，LEGNet实现了最先进的性能，特别是在检测低质量目标方面。代码可在https://github.com/AeroVILab-AHU/LEGNet找到。

论文及项目相关链接

PDF 19 pages, 9 figures. Accepted by ICCV 2025 Workshop

Summary

本文介绍了针对遥感物体检测中遇到的低空间分辨率、传感器噪声、运动模糊和不良照明等问题，提出了一种名为LEGNet的轻量级骨干网络。该网络包含一个新型的边缘高斯聚合（EGA）模块，用于增强从低质量遥感图像中提取的特征表示。EGA模块通过集成方向感知的Scharr滤波器和基于高斯先验的特征优化，提高了边缘细节的清晰度，并抑制噪声和模糊特征响应，从而提高了前景在挑战条件下的显著性。在五个基准测试集上的综合评估表明，LEGNet在检测低质量物体方面取得了最新技术水平的性能。

Key Takeaways

遥感物体检测（RSOD）面临低空间分辨率、传感器噪声、运动模糊和不良照明等挑战。
特征辨识度的降低导致物体表示不明确，前景背景分离不足。
现有RSOD方法在检测低质量物体时存在局限性。
LEGNet是一种轻量级骨干网络，包含新型EGA模块，旨在增强从低质量遥感图像中派生的特征表示。
EGA模块通过集成方向感知的Scharr滤波器，提高了边缘细节的清晰度。
基于高斯先验的特征优化有助于抑制噪声和模糊特征响应。

Cool Papers

点此查看论文截图

Circle Representation for Medical Instance Object Segmentation

Authors:Juming Xiong, Ethan H. Nguyen, Yilin Liu, Ruining Deng, Regina N Tyree, Hernan Correa, Girish Hiremath, Yaohong Wang, Haichun Yang, Agnes B. Fogo, Yuankai Huo

Recently, circle representation has been introduced for medical imaging, designed specifically to enhance the detection of instance objects that are spherically shaped (e.g., cells, glomeruli, and nuclei). Given its outstanding effectiveness in instance detection, it is compelling to consider the application of circle representation for segmenting instance medical objects. In this study, we introduce CircleSnake, a simple end-to-end segmentation approach that utilizes circle contour deformation for segmenting ball-shaped medical objects at the instance level. The innovation of CircleSnake lies in these three areas: (1) It substitutes the complex bounding box-to-octagon contour transformation with a more consistent and rotation-invariant bounding circle-to-circle contour adaptation. This adaptation specifically targets ball-shaped medical objects. (2) The circle representation employed in CircleSnake significantly reduces the degrees of freedom to two, compared to eight in the octagon representation. This reduction enhances both the robustness of the segmentation performance and the rotational consistency of the method. (3) CircleSnake is the first end-to-end deep instance segmentation pipeline to incorporate circle representation, encompassing consistent circle detection, circle contour proposal, and circular convolution in a unified framework. This integration is achieved through the novel application of circular graph convolution within the context of circle detection and instance segmentation. In practical applications, such as the detection of glomeruli, nuclei, and eosinophils in pathological images, CircleSnake has demonstrated superior performance and greater rotation invariance when compared to benchmarks. The code has been made publicly available: https://github.com/hrlblab/CircleSnake.

最近，圆形表示法被引入到医学成像中，专门用于增强球形实例对象（如细胞、肾小球和细胞核）的检测。鉴于其在实例检测中的出色效果，考虑将圆形表示法应用于医学实例对象的分割是非常有吸引力的。在本研究中，我们引入了CircleSnake，这是一种简单的端到端分割方法，利用圆形轮廓变形来对球形医学对象进行实例级别的分割。CircleSnake的创新之处在于以下三个方面：（1）它用更一致且旋转不变的圆形轮廓替代了复杂的边界框到八边形轮廓的转换。这种适应性专门针对球形医学对象。（2）CircleSnake中使用的圆形表示法将自由度降低到两个，与八边形表示的八个自由度相比。这种降低增强了分割性能的鲁棒性和方法的旋转一致性。（3）CircleSnake是第一个采用圆形表示的端到端深度实例分割管道，它包含一致的圆形检测、圆形轮廓建议和圆形卷积的统一框架。这种集成是通过在圆形检测和实例分割的上下文中新颖地应用圆形图卷积来实现的。在实际应用中，如在病理图像中检测肾小球、细胞核和嗜酸性粒细胞等，与基准测试相比，CircleSnake表现出优越的性能和更大的旋转不变性。代码已公开发布在：[https://github.com/hrlblab/CircleSnake。]

论文及项目相关链接

PDF

Summary

本文介绍了CircleSnake方法，这是一种针对球形医疗对象实例级分割的端到端分割方法。它采用圆形轮廓变形技术，通过引入圆形表示，简化了复杂的边界框到八边形轮廓转换，并显著提高了分割性能和旋转一致性。CircleSnake是首个将圆形表示、圆形轮廓提议和圆形卷积整合在统一框架中的深度实例分割管道。在肾小球、细胞核和病理图像中的嗜酸性粒细胞检测等实际应用中，与基准测试相比，CircleSnake表现出了卓越的性能和更高的旋转不变性。

Key Takeaways

CircleSnake是一种针对球形医疗对象实例级分割的端到端方法。
它采用圆形轮廓变形技术，简化了边界框到八边形轮廓转换。
圆形表示法减少了自由度，提高了分割性能的鲁棒性和旋转一致性。
CircleSnake是首个将圆形表示整合到深度实例分割管道中的方法。
在实际应用中，CircleSnake相对于基准测试表现出卓越性能和旋转不变性。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-05/%E6%A3%80%E6%B5%8B_%E5%88%86%E5%89%B2_%E8%B7%9F%E8%B8%AA/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

检测/分割/跟踪

人脸相关

人脸相关方向最新论文已更新，请持续关注 Update in 2025-11-05 Leveraging Large-Scale Face Datasets for Deep Periocular Recognition via Ocular Cropping

2025-11-05 人脸相关

人脸相关

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2025-11-05 Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes

2025-11-05 Vision Transformer

Vision Transformer

检测/分割/跟踪

2025-11-05 更新

Overcoming Prompts Pool Confusion via Parameterized Prompt for Incremental Object Detection

M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar

Test-Time Adaptive Object Detection with Foundation Model

Classifier Enhancement Using Extended Context and Domain Experts for Semantic Segmentation

DINO-YOLO: Self-Supervised Pre-training for Data-Efficient Object Detection in Civil Engineering Applications

MIC-BEV: Multi-Infrastructure Camera Bird’s-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection

Delving into Cascaded Instability: A Lipschitz Continuity View on Image Restoration and Object Detection Synergy

Improving Visual Discriminability of CLIP for Training-Free Open-Vocabulary Semantic Segmentation

DQ3D: Depth-guided Query for Transformer-Based 3D Object Detection in Traffic Scenarios

Diffusion-Driven Two-Stage Active Learning for Low-Budget Semantic Segmentation

Comparative Analysis of Object Detection Algorithms for Surface Defect Detection

S3OD: Towards Generalizable Salient Object Detection with Synthetic Data

BioDet: Boosting Industrial Object Detection with Image Preprocessing Strategies

LEGNet: A Lightweight Edge-Gaussian Network for Low-Quality Remote Sensing Image Object Detection

Circle Representation for Medical Instance Object Segmentation