⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-16 更新
DGFusion: Dual-guided Fusion for Robust Multi-Modal 3D Object Detection
Authors:Feiyang Jia, Caiyan Jia, Ailin Liu, Shaoqing Xu, Qiming Xia, Lin Liu, Lei Yang, Yan Gong, Ziying Song
As a critical task in autonomous driving perception systems, 3D object detection is used to identify and track key objects, such as vehicles and pedestrians. However, detecting distant, small, or occluded objects (hard instances) remains a challenge, which directly compromises the safety of autonomous driving systems. We observe that existing multi-modal 3D object detection methods often follow a single-guided paradigm, failing to account for the differences in information density of hard instances between modalities. In this work, we propose DGFusion, based on the Dual-guided paradigm, which fully inherits the advantages of the Point-guide-Image paradigm and integrates the Image-guide-Point paradigm to address the limitations of the single paradigms. The core of DGFusion, the Difficulty-aware Instance Pair Matcher (DIPM), performs instance-level feature matching based on difficulty to generate easy and hard instance pairs, while the Dual-guided Modules exploit the advantages of both pair types to enable effective multi-modal feature fusion. Experimental results demonstrate that our DGFusion outperforms the baseline methods, with respective improvements of +1.0% mAP, +0.8% NDS, and +1.3% average recall on nuScenes. Extensive experiments demonstrate consistent robustness gains for hard instance detection across ego-distance, size, visibility, and small-scale training scenarios.
在自动驾驶感知系统中,作为关键任务之一的3D物体检测用于识别和跟踪关键物体,如车辆和行人。然而,检测远距离、小型或遮挡的物体(硬实例)仍然是一个挑战,这直接影响了自动驾驶系统的安全性。我们观察到现有的多模态3D物体检测方法通常采用单一指导范式,未能考虑到不同模态之间硬实例信息密度的差异。在这项工作中,我们提出了基于双重指导范式的DGFusion,它继承了Point-guide-Image范式的优势,并融合了Image-guide-Point范式来解决单一范式的局限性。DGFusion的核心是难度感知实例配对匹配器(DIPM),它基于难度进行实例级特征匹配,以生成容易和困难的实例对,而双重指导模块则利用这两种类型的对的优势来实现有效的多模态特征融合。实验结果表明,我们的DGFusion优于基线方法,在nuScenes上的mAP分别提高了+1.0%,NDS提高了+0.8%,平均召回率提高了+1.3%。大量实验表明,在自我距离、大小、可见性和小规模训练场景下,硬实例检测的稳健性都有一致的收益。
论文及项目相关链接
Summary
3D对象检测是自动驾驶感知系统的关键任务之一,主要用于识别和跟踪关键对象。然而,检测远处、小型或遮挡的物体(硬实例)仍然是一个挑战,直接影响自动驾驶系统的安全性。现有研究多采用单一模态指导模式的多模态三维对象检测方法,无法兼顾不同模态下硬实例信息密度的差异。本研究提出了基于双引导模式的DGFusion方法,结合了Point-guide-Image和Image-guide-Point两种模式的优点,并引入难度感知实例配对匹配器(DIPM)实现有效的多模态特征融合。实验表明,DGFusion在硬实例检测方面取得了优于基线方法的结果,在不同场景下表现出强大的稳健性增益。
Key Takeaways
- 自动驾驶中的三维对象检测对于识别和跟踪关键对象至关重要。
- 检测远处、小型或遮挡的物体(硬实例)是自动驾驶感知系统的挑战之一。
- 目前的多模态三维对象检测方法多采用单一模态指导模式,忽略了不同模态下硬实例信息密度的差异。
- 本研究提出的DGFusion方法结合了Point-guide-Image和Image-guide-Point两种模式的优点。
- DGFusion引入了难度感知实例配对匹配器(DIPM),以进行多模态特征融合和更好的实例配对。
- 实验显示DGFusion在硬实例检测方面优于基线方法,并展现出强大的稳健性增益。
点此查看论文截图
DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation
Authors:Xuexun Liu, Xiaoxu Xu, Qiudan Zhang, Lin Ma, Xu Wang
Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.
弱监督3D实例分割对于3D场景理解至关重要,尤其是随着数据规模的增长和全监督方法的高标注成本。现有方法主要依赖于两种形式的弱监督:一点一击标注和边界框标注,两者的目标都是减少标注工作量。然而,这些方法仍面临局限性,包括标注过程劳动强度高、复杂性高以及依赖专家标注人员。为了应对这些挑战,我们提出DBGroup,这是一个两阶段的弱监督3D实例分割框架,它利用场景级标注作为更高效和可扩展的替代方案。在第一阶段,我们引入双分支点分组模块,根据从多视角图像中提取的语义和掩膜线索生成伪标签。为了进一步提高标签质量,我们开发了两种优化策略:粒度感知实例合并和语义选择与传播。第二阶段涉及在端到端实例分割网络上进行多轮自训练,使用精炼的伪标签。此外,我们还引入了实例掩膜过滤策略,以解决伪标签内部的不一致性。大量实验表明,DBGroup与稀疏点级监督的3D实例分割方法相比具有竞争力,同时超越了最先进的场景级监督的3D语义分割方法。代码可在https://github.com/lixuexun/DBGroup找到。
论文及项目相关链接
Summary
本文提出了一种名为DBGroup的两阶段弱监督3D实例分割框架,利用场景级注释作为更高效和可扩展的替代方案。第一阶段通过双分支点分组模块生成伪标签,该模块由多视角图像提取的语义和掩膜线索引导。为提高标签质量,采用粒度感知实例合并和语义选择与传播两种优化策略。第二阶段利用优化后的伪标签进行多轮自训练,实现对实例分割网络的端到端训练。此外,还引入了实例掩膜过滤策略来解决伪标签内部的不一致性。实验表明,DBGroup与稀疏点级监督的3D实例分割方法相比具有竞争力,并超越了最先进的场景级监督的3D语义分割方法。
Key Takeaways
- DBGroup是一种两阶段的弱监督3D实例分割框架,旨在解决现有方法面临的高标注成本和复杂性挑战。
- 第一阶段通过双分支点分组模块生成伪标签,利用语义和掩膜线索提高标签质量。
- 引入两种优化策略——粒度感知实例合并和语义选择与传播——以提高伪标签质量。
- 第二阶段采用多轮自训练,利用优化后的伪标签进行端到端的实例分割网络训练。
- 提出实例掩膜过滤策略以解决伪标签内部的不一致性。
点此查看论文截图
Robust Object Detection with Pseudo Labels from VLMs using Per-Object Co-teaching
Authors:Uday Bhaskar, Rishabh Bhattacharya, Avinash Patel, Sarthak Khoche, Praveen Anil Kulkarni, Naresh Manwani
Foundation models, especially vision-language models (VLMs), offer compelling zero-shot object detection for applications like autonomous driving, a domain where manual labelling is prohibitively expensive. However, their detection latency and tendency to hallucinate predictions render them unsuitable for direct deployment. This work introduces a novel pipeline that addresses this challenge by leveraging VLMs to automatically generate pseudo-labels for training efficient, real-time object detectors. Our key innovation is a per-object co-teaching-based training strategy that mitigates the inherent noise in VLM-generated labels. The proposed per-object coteaching approach filters noisy bounding boxes from training instead of filtering the entire image. Specifically, two YOLO models learn collaboratively, filtering out unreliable boxes from each mini-batch based on their peers’ per-object loss values. Overall, our pipeline provides an efficient, robust, and scalable approach to train high-performance object detectors for autonomous driving, significantly reducing reliance on costly human annotation. Experimental results on the KITTI dataset demonstrate that our method outperforms a baseline YOLOv5m model, achieving a significant mAP@0.5 boost ($31.12%$ to $46.61%$) while maintaining real-time detection latency. Furthermore, we show that supplementing our pseudo-labelled data with a small fraction of ground truth labels ($10%$) leads to further performance gains, reaching $57.97%$ mAP@0.5 on the KITTI dataset. We observe similar performance improvements for the ACDC and BDD100k datasets.
基础模型,特别是视觉语言模型(VLMs),为自动驾驶等应用提供了引人注目的零射击目标检测功能,这是一个手动标签成本过高的领域。然而,它们的检测延迟和倾向于产生幻觉预测使其不适合直接部署。这项工作引入了一种新管道来解决这一挑战,利用VLMs自动生成伪标签来训练高效的实时目标检测器。我们的关键创新之处在于基于每个对象的协同教学训练策略,这减轻了VLM生成的标签中的固有噪声。所提出的针对每个对象的协同教学方法从训练中过滤出嘈杂的边界框,而不是过滤整个图像。具体来说,两个YOLO模型通过协作学习,基于同行之间的每个对象的损失值,从每个小批量数据中筛选出不可靠的框。总的来说,我们的管道提供了一种高效、稳健和可扩展的方法来训练用于自动驾驶的高性能目标检测器,大大降低了对昂贵的人力标注的依赖。在KITTI数据集上的实验结果表明,我们的方法优于基线YOLOv5m模型,在保持实时检测延迟的同时,实现了显著的mAP@0.5提升(从31.12%提高到46.61%)。此外,我们还表明,用一小部分真实标签(10%)补充我们的伪标签数据可以带来进一步的性能提升,在KITTI数据集上达到57.97%的mAP@0.5。我们在ACDC和BDD100k数据集上也观察到了类似的性能提升。
论文及项目相关链接
Summary
本文提出一种利用视觉语言模型(VLMs)自动生成伪标签,训练高效实时目标检测器的新颖流程。通过采用基于目标协同教学的训练策略,减轻VLM生成标签的固有噪声问题。该策略能够过滤掉训练中的不可靠边界框,而非过滤整个图像。在KITTI数据集上的实验表明,该方法相较于基线YOLOv5m模型显著提升mAP@0.5指标,从31.12%提升至46.61%,同时保持实时检测速度。通过补充小部分真实标签数据(仅占10%),性能可进一步提升至57.97%。该流程为自动驾驶领域高效、稳健、可扩展的目标检测器训练提供了有效方法。
Key Takeaways
- 利用视觉语言模型(VLMs)自动生成伪标签以训练目标检测器。
- 提出基于目标协同教学的训练策略,过滤不可靠的边界框。
- 在KITTI数据集上实现显著的性能提升,mAP@0.5从31.12%提升至46.61%。
- 通过结合少量真实标签数据,性能进一步提升至57.97%。
- 方法具有高效性、稳健性和可扩展性,适用于自动驾驶领域。
点此查看论文截图
SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection
Authors:Jia Lin, Xiaofei Zhou, Jiyuan Liu, Runmin Cong, Guodao Zhang, Zhi Liu, Jiyong Zhang
Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.
最近,任何内容分割模型(SAM)引起了广泛关注,通常被视为通用分割的视觉基础模型。一些研究人员尝试直接将基础模型应用于RGB-D视频显著目标检测(RGB-D VSOD)任务,这常常面临三个挑战,包括依赖手动提示、顺序适配器的高内存消耗,以及内存注意力的计算负担。为了解决这些限制,我们提出了一种新方法,即深度引导自适应查询的任意内容分割模型(SAM-DAQ),它将SAM2适应于从视频中弹出显著目标,通过在一个统一框架内无缝集成深度和时态线索。首先,我们部署了基于并行适配器的多模态图像编码器(PAMIE),它以跳过连接的方式结合了多个深度引导并行适配器(DPA)。值得注意的是,我们在无提示条件下对冻结的SAM编码器进行了微调,DPA利用深度线索来促进多模态特征的融合。其次,我们部署了一个查询驱动的时态内存(QTM)模块,它将内存银行和提示嵌入到一个可学习的流程中。具体来说,通过同时利用帧级查询和视频级查询,QTM模块不仅可以有选择地提取时态一致性特征,还可以迭代更新查询的时态表示。在三个RGB-D VSOD数据集上进行了大量实验,结果表明,所提出的SAM-DAQ在所有评估指标上均优于最先进的方法。
论文及项目相关链接
PDF Accepted to 40th AAAI Conference on Artificial Intelligence (AAAI 2026)
Summary
最近,分段任何事物模型(SAM)受到广泛关注,被视为通用分割的视觉基础模型。研究者尝试将其直接应用于RGB-D视频显著目标检测任务,但面临依赖手动提示、序贯适配器高内存消耗和内存注意力计算负担等三大挑战。为应对这些局限,我们提出一种名为SAM-DAQ的新方法,它通过深度引导自适应查询将SAM2适应于视频中的显著目标检测。首先,我们部署并行适配器多模态图像编码器(PAMIE),以跳过连接的方式融入多个深度引导并行适配器(DPAs)。值得注意是,我们在无提示条件下微调了冻结的SAM编码器,DPA利用深度线索促进多模态特征的融合。其次,我们引入查询驱动式临时内存模块(QTM),将内存银行和提示嵌入到可学习管道中。具体来说,通过同时利用帧级查询和视频级查询,QTM模块不仅可以选择性提取时间一致性特征,还可以迭代更新查询的时间表示。在三个RGB-D VSOD数据集上的大量实验表明,所提出的SAM-DAQ在所有评价指标上均优于最新方法。
Key Takeaways
- SAM模型作为通用分割的视觉基础模型受到关注。
- 直接将SAM应用于RGB-D视频显著目标检测任务面临三大挑战。
- 提出SAM-DAQ方法,通过深度引导自适应查询适应SAM模型以检测视频中的显著目标。
- PAMIE和DPAs的引入促进了多模态特征的融合。
- QTM模块能够提取时间一致性特征并更新查询的时间表示。
- 在多个数据集上的实验表明,SAM-DAQ在各项评价指标上表现优异。
点此查看论文截图
High-Quality Proposal Encoding and Cascade Denoising for Imaginary Supervised Object Detection
Authors:Zhiyuan Chen, Yuelin Guo, Zitong Huang, Haoyu He, Renhao Lu, Weizhe Zhang
Object detection models demand large-scale annotated datasets, which are costly and labor-intensive to create. This motivated Imaginary Supervised Object Detection (ISOD), where models train on synthetic images and test on real images. However, existing methods face three limitations: (1) synthetic datasets suffer from simplistic prompts, poor image quality, and weak supervision; (2) DETR-based detectors, due to their random query initialization, struggle with slow convergence and overfitting to synthetic patterns, hindering real-world generalization; (3) uniform denoising pressure promotes model overfitting to pseudo-label noise. We propose Cascade HQP-DETR to address these limitations. First, we introduce a high-quality data pipeline using LLaMA-3, Flux, and Grounding DINO to generate the FluxVOC and FluxCOCO datasets, advancing ISOD from weak to full supervision. Second, our High-Quality Proposal guided query encoding initializes object queries with image-specific priors from SAM-generated proposals and RoI-pooled features, accelerating convergence while steering the model to learn transferable features instead of overfitting to synthetic patterns. Third, our cascade denoising algorithm dynamically adjusts training weights through progressively increasing IoU thresholds across decoder layers, guiding the model to learn robust boundaries from reliable visual cues rather than overfitting to noisy labels. Trained for just 12 epochs solely on FluxVOC, Cascade HQP-DETR achieves a SOTA 61.04% mAP@0.5 on PASCAL VOC 2007, outperforming strong baselines, with its competitive real-data performance confirming the architecture’s universal applicability.
对象检测模型需要大规模标注数据集,而这些数据集的创建成本高昂且劳动密集型。这激发了Imaginary Supervised Object Detection(ISOD)的出现,模型在合成图像上进行训练,在真实图像上进行测试。然而,现有方法面临三个局限性:(1)合成数据集存在提示过于简单、图像质量差和监督弱的问题;(2)基于DETR的探测器由于其随机查询初始化,存在收敛慢和过度拟合合成模式的问题,阻碍了其在现实世界中的泛化能力;(3)统一的去噪压力导致模型对伪标签噪声过度拟合。我们提出Cascade HQP-DETR来解决这些局限性。首先,我们利用LLaMA-3、Flux和Grounding DINO引入高质量数据管道,生成FluxVOC和FluxCOCO数据集,将ISOD从弱监督推进到全监督。其次,我们的高质量提案引导查询编码使用SAM生成的提案和RoI池化特征对图像特定先验进行对象查询初始化,加速收敛,引导模型学习可迁移特征,而不是过度拟合合成模式。第三,我们的级联去噪算法通过逐层增加IoU阈值动态调整训练权重,引导模型从可靠视觉线索中学习稳健边界,而不是过度依赖嘈杂标签。仅在FluxVOC上训练12个周期,Cascade HQP-DETR就实现了PASCAL VOC 2007上的61.04%mAP@0.5的领先水平,超越了强大的基线,其有竞争力的真实数据性能证实了该架构的通用适用性。
论文及项目相关链接
PDF This work has been submitted to Pattern Recognition for possible publication
Summary
针对现有对象检测模型在合成图像训练时面临的三大问题,提出了Cascade HQP-DETR模型。通过引入高质量数据管道、高质量提议引导查询编码以及级联去噪算法,解决了合成图像训练中的弱监督、收敛慢和过拟合等问题。在PASCAL VOC 2007数据集上取得了较高的性能表现。
Key Takeaways
- Cascade HQP-DETR解决了现有对象检测模型在合成图像训练时面临的挑战,包括弱监督、收敛速度慢和过拟合问题。
- 引入了一个高质量的数据管道,利用LLaMA-3、Flux和Grounding DINO等技术生成FluxVOC和FluxCOCO数据集,实现ISOD从弱监督到全监督的进展。
- 通过高质量提议引导查询编码,使用SAM生成的提议和RoI池化特征来初始化对象查询,提高了模型的收敛速度并促进了模型的泛化能力。
- 级联去噪算法可以根据解码器层的IoU阈值动态调整训练权重,帮助模型从可靠的视觉线索中学习鲁棒的边界信息,减少了对噪声标签的过拟合。
点此查看论文截图
Multi-Modal Assistance for Unsupervised Domain Adaptation on Point Cloud 3D Object Detection
Authors:Shenao Zhao, Pengpeng Liang, Zhoufan Yang
Unsupervised domain adaptation for LiDAR-based 3D object detection (3D UDA) based on the teacher-student architecture with pseudo labels has achieved notable improvements in recent years. Although it is quite popular to collect point clouds and images simultaneously, little attention has been paid to the usefulness of image data in 3D UDA when training the models. In this paper, we propose an approach named MMAssist that improves the performance of 3D UDA with multi-modal assistance. A method is designed to align 3D features between the source domain and the target domain by using image and text features as bridges. More specifically, we project the ground truth labels or pseudo labels to the images to get a set of 2D bounding boxes. For each 2D box, we extract its image feature from a pre-trained vision backbone. A large vision-language model (LVLM) is adopted to extract the box’s text description, and a pre-trained text encoder is used to obtain its text feature. During the training of the model in the source domain and the student model in the target domain, we align the 3D features of the predicted boxes with their corresponding image and text features, and the 3D features and the aligned features are fused with learned weights for the final prediction. The features between the student branch and the teacher branch in the target domain are aligned as well. To enhance the pseudo labels, we use an off-the-shelf 2D object detector to generate 2D bounding boxes from images and estimate their corresponding 3D boxes with the aid of point cloud, and these 3D boxes are combined with the pseudo labels generated by the teacher model. Experimental results show that our approach achieves promising performance compared with state-of-the-art methods in three domain adaptation tasks on three popular 3D object detection datasets. The code is available at https://github.com/liangp/MMAssist.
基于带有伪标签的教师-学生架构的激光雷达(LiDAR)的基于域适应的3D目标检测(也称为基于图像与文字特征的辅助学习LiDAR 3D目标检测)在近年来取得了显著的改进。虽然同时收集点云和图像是非常流行的做法,但在训练模型时,对于利用图像数据在无人监督领域适应中(也即三维UDA)的实际用处却没有引起足够的关注。本文提出了一种名为MMAssist的方法,通过多模态辅助提高了三维UDA的性能。设计了一种方法,利用图像和文字特征作为桥梁,对齐源域和目标域之间的三维特征。具体来说,我们将真实标签或伪标签投射到图像上得到一组二维边界框。对于每个二维框,我们从预训练的视觉骨干网络中提取其图像特征。使用大型视觉语言模型(LVLM)来提取框的文本描述,并使用预训练的文本编码器获取其文本特征。在源域训练模型和训练目标域的学生模型时,我们将预测框的三维特征与其对应的图像和文字特征对齐,并将三维特征和已对齐的特征融合用于最终预测。同时,目标域中的学生分支和教师分支之间的特征也被对齐。为了增强伪标签,我们使用现成的二维目标检测器从图像生成二维边界框,并利用点云估计其对应的三维框,并将这些三维框与由教师模型生成的伪标签结合使用。实验结果表明,与最先进的算法相比,我们的方法在三个领域适应任务中的三个流行的三维目标检测数据集上取得了有前景的表现。相关代码可从https://github.com/liangp/MMAssist获得。
论文及项目相关链接
PDF Accepted to AAAI-26
Summary
本文提出一种名为MMAssist的方法,用于改进基于激光雷达的3D目标检测的域自适应性能。该方法利用图像和文本特征作为桥梁,通过多模态辅助实现对源域和目标域之间的3D特征对齐。通过结合图像和文本特征,提高伪标签质量,并在三个流行的3D目标检测数据集上进行实验验证,取得显著成果。
Key Takeaways
- 文章提出了名为MMAssist的方法,用于改进基于激光雷达的3D目标检测的无监督域自适应(UDA)。
- 利用图像和文本特征作为桥梁,实现源域和目标域之间的3D特征对齐。
- 通过结合图像和文本特征,提高伪标签质量。
- 采用多模态辅助提高目标检测性能。
- 在三个流行的数据集上进行实验验证,包括不同领域适应性任务。
- 该方法与目前主流方法相比表现优越。
点此查看论文截图
MonoCLUE : Object-Aware Clustering Enhances Monocular 3D Object Detection
Authors:Sunghun Yang, Minhyeok Lee, Jungho Lee, Sangyoun Lee
Monocular 3D object detection offers a cost-effective solution for autonomous driving but suffers from ill-posed depth and limited field of view. These constraints cause a lack of geometric cues and reduced accuracy in occluded or truncated scenes. While recent approaches incorporate additional depth information to address geometric ambiguity, they overlook the visual cues crucial for robust recognition. We propose MonoCLUE, which enhances monocular 3D detection by leveraging both local clustering and generalized scene memory of visual features. First, we perform K-means clustering on visual features to capture distinct object-level appearance parts (e.g., bonnet, car roof), improving detection of partially visible objects. The clustered features are propagated across regions to capture objects with similar appearances. Second, we construct a generalized scene memory by aggregating clustered features across images, providing consistent representations that generalize across scenes. This improves object-level feature consistency, enabling stable detection across varying environments. Lastly, we integrate both local cluster features and generalized scene memory into object queries, guiding attention toward informative regions. Exploiting a unified local clustering and generalized scene memory strategy, MonoCLUE enables robust monocular 3D detection under occlusion and limited visibility, achieving state-of-the-art performance on the KITTI benchmark.
单目3D目标检测为自动驾驶提供了经济高效的解决方案,但存在深度不明确和视野有限的缺点。这些约束导致了缺乏几何线索,以及在遮挡或截断场景中的精度降低。虽然最近的方法通过加入深度信息来解决几何歧义问题,但它们忽略了对于稳健识别至关重要的视觉线索。我们提出MonoCLUE,它通过利用局部聚类和视觉特征的广义场景记忆来增强单目3D检测。首先,我们对视觉特征执行K-means聚类,以捕获独特的对象级外观部分(例如引擎盖、汽车顶部),从而提高对部分可见对象的检测能力。聚类特征被传播到各个区域,以捕获具有相似外观的对象。其次,我们通过聚集图像中的聚类特征来构建广义场景记忆,提供一致的表示形式,这些表示形式可以推广到各个场景。这提高了对象级特征的一致性,能够在各种环境中实现稳定的检测。最后,我们将局部聚类特征和广义场景记忆整合到对象查询中,引导注意力关注信息区域。通过利用统一的局部聚类和广义场景记忆策略,MonoCLUE在遮挡和有限可见度的情况下实现了稳健的单目3D检测,并在KITTI基准测试中达到了最先进的性能。
论文及项目相关链接
PDF Accepted to AAAI 2026
Summary
基于单目3D对象检测的自主驾驶成本效益解决方案,因深度模糊及视野受限存在几何线索缺失及遮挡或截断场景精度下降问题。近期方法虽融入深度信息解决几何歧义,但忽略了关键视觉线索的稳健识别。本文提出MonoCLUE,通过结合局部聚类和广义场景记忆强化单目3D检测。首先,对视觉特征进行K-means聚类,捕捉独特的对象级外观部分(如引擎盖、车顶),改进部分可见对象的检测。聚类特征跨区域传播以捕捉具有相似外观的对象。其次,通过聚集图像中的聚类特征构建广义场景记忆,提供跨场景的通用表示。这提高了对象级别的特征一致性,实现了不同环境下的稳定检测。最后,将局部聚类特征和广义场景记忆整合到对象查询中,引导注意力关注信息区域。借助统一的局部聚类和广义场景记忆策略,MonoCLUE在遮挡和有限可见度下实现了稳健的单目3D检测,在KITTI基准测试中达到最新性能。
Key Takeaways
- 单目3D对象检测面临深度模糊和视野受限的挑战,导致几何线索缺失和精度下降。
- 近期方法虽融入深度信息,但忽略了关键视觉线索的稳健识别。
- MonoCLUE通过结合局部聚类和广义场景记忆强化单目3D检测。
- K-means聚类用于捕捉独特的对象级外观部分,改进部分可见对象的检测。
- 聚类特征跨区域传播以识别具有相似外观的对象。
- 通过构建广义场景记忆提供跨场景的通用表示,提高对象级别的特征一致性。
点此查看论文截图
EIDSeg: A Pixel-Level Semantic Segmentation Dataset for Post-Earthquake Damage Assessment from Social Media Images
Authors:Huili Huang, Chengeng Liu, Danrong Zhang, Shail Patel, Anastasiya Masalava, Sagar Sadak, Parisa Babolhavaeji, WeiHong Low, Max Mahdi Roozbahani, J. David Frost
Rapid post-earthquake damage assessment is crucial for rescue and resource planning. Still, existing remote sensing methods depend on costly aerial images, expert labeling, and produce only binary damage maps for early-stage evaluation. Although ground-level images from social networks provide a valuable source to fill this gap, a large pixel-level annotated dataset for this task is still unavailable. We introduce EIDSeg, the first large-scale semantic segmentation dataset specifically for post-earthquake social media imagery. The dataset comprises 3,266 images from nine major earthquakes (2008-2023), annotated across five classes of infrastructure damage: Undamaged Building, Damaged Building, Destroyed Building, Undamaged Road, and Damaged Road. We propose a practical three-phase cross-disciplinary annotation protocol with labeling guidelines that enables consistent segmentation by non-expert annotators, achieving over 70% inter-annotator agreement. We benchmark several state-of-the-art segmentation models, identifying Encoder-only Mask Transformer (EoMT) as the top-performing method with a Mean Intersection over Union (mIoU) of 80.8%. By unlocking social networks’ rich ground-level perspective, our work paves the way for a faster, finer-grained damage assessment in the post-earthquake scenario.
快速的地震后灾害评估对于救援和资源规划至关重要。然而,现有的遥感方法依赖于昂贵的航空图像、专家标注,并且仅提供用于早期评估的二进制的灾害地图。虽然来自社交网络的地面图像为填补这一空白提供了宝贵的资源,但针对此项任务的大规模像素级别的标注数据集仍然不可用。我们介绍了EIDSeg,这是专门为地震后社交媒体图像设计的大型语义分割数据集。该数据集包含来自九次大地震(2008年至2023年)的3,266张图像,跨越五个类别的设施损坏进行标注:未损坏建筑、损坏建筑、毁坏建筑、未损坏道路和损坏道路。我们提出了一种实用的跨学科的第三阶段标注协议,包含标注指南,使得非专业标注人员能够进行一致的分割,达到了超过70%的标注者间一致性。我们对一些先进的分割模型进行了基准测试,发现Encoder-only Mask Transformer(EoMT)是表现最好的方法,其平均交并比(mIoU)为80.8%。通过解锁社交网络的丰富地面视角,我们的研究为地震后的更快、更精细的灾害评估铺平了道路。
论文及项目相关链接
PDF Camera-Ready for AAAI-AISI26
Summary:引入了一个针对灾后社交媒体影像的大型语义分割数据集EIDSeg。该数据集包含来自九个主要地震的3,266张图片,涵盖五种基础设施损坏类别。提出了一种实用的跨学科标注协议,能够一致地进行非专家标注。通过解锁社交媒体丰富的地面视角,该研究为快速精细的地震后损伤评估铺平了道路。
Key Takeaways:
- 灾后快速损伤评估对救援和资源规划至关重要,但现有的遥感方法存在成本高昂和只提供二元损伤地图的问题。
- 社交媒体的地面视角图像为解决此问题提供了有价值的资源。
- 引入了名为EIDSeg的大型语义分割数据集,专门用于灾后的社交媒体影像。
- 数据集包含来自九个主要地震的3,266张图片,涵盖五种基础设施损坏类别(无损建筑、损坏建筑、被毁建筑、无损道路和损坏道路)。
- 提出了一种实用的跨学科标注协议,并制定了标注指南,使非专家标注员也能进行一致的分割。
- Encoder-only Mask Transformer(EoMT)表现最佳,其Mean Intersection over Union(mIoU)达到80.8%。
点此查看论文截图
SFFR: Spatial-Frequency Feature Reconstruction for Multispectral Aerial Object Detection
Authors:Xin Zuo, Yuchen Qu, Haibo Zhan, Jifeng Shen, Wankou Yang
Recent multispectral object detection methods have primarily focused on spatial-domain feature fusion based on CNNs or Transformers, while the potential of frequency-domain feature remains underexplored. In this work, we propose a novel Spatial and Frequency Feature Reconstruction method (SFFR) method, which leverages the spatial-frequency feature representation mechanisms of the Kolmogorov-Arnold Network (KAN) to reconstruct complementary representations in both spatial and frequency domains prior to feature fusion. The core components of SFFR are the proposed Frequency Component Exchange KAN (FCEKAN) module and Multi-Scale Gaussian KAN (MSGKAN) module. The FCEKAN introduces an innovative selective frequency component exchange strategy that effectively enhances the complementarity and consistency of cross-modal features based on the frequency feature of RGB and IR images. The MSGKAN module demonstrates excellent nonlinear feature modeling capability in the spatial domain. By leveraging multi-scale Gaussian basis functions, it effectively captures the feature variations caused by scale changes at different UAV flight altitudes, significantly enhancing the model’s adaptability and robustness to scale variations. It is experimentally validated that our proposed FCEKAN and MSGKAN modules are complementary and can effectively capture the frequency and spatial semantic features respectively for better feature fusion. Extensive experiments on the SeaDroneSee, DroneVehicle and DVTOD datasets demonstrate the superior performance and significant advantages of the proposed method in UAV multispectral object perception task. Code will be available at https://github.com/qchenyu1027/SFFR.
最近的多光谱目标检测方法主要关注基于CNN或Transformer的空间域特征融合,而频率域特征的可能性尚未得到充分探索。在这项工作中,我们提出了一种新的空间与频率特征重建方法(SFFR),该方法利用Kolmogorov-Arnold网络(KAN)的空间频率特征表示机制,在特征融合之前重建空间域和频率域中的互补表示。SFFR的核心组件是提出的频率分量交换KAN(FCEKAN)模块和多尺度高斯KAN(MSGKAN)模块。FCEKAN引入了一种创新的选择性频率分量交换策略,该策略基于RGB和IR图像的频率特征有效地增强了跨模态特征的互补性和一致性。MSGKAN模块在空域表现出卓越的非线性特征建模能力。通过利用多尺度高斯基础函数,它有效地捕获了由于不同无人机飞行高度引起的尺度变化所导致的特征变化,显著增强了模型对尺度变化的适应性和稳健性。实验验证了我们提出的FCEKAN和MSGKAN模块的互补性,它们可以有效地捕捉频率和空间语义特征,以实现更好的特征融合。在SeaDroneSee、DroneVehicle和DVTOD数据集上的大量实验表明,所提出的方法在无人机多光谱目标感知任务中具有卓越的性能和显著的优势。代码将在https://github.com/qchenyu1027/SFFR上提供。
论文及项目相关链接
PDF 11 pages,8 figures, accepted by IEEE TGRS
Summary
本文提出了一种新型的多光谱目标检测方法——空间与频率特征重建方法(SFFR)。该方法利用Kolmogorov-Arnold网络(KAN)的空间频率特征表示机制,在空间和频率域进行特征融合前的特征重建。核心模块包括频率分量交换KAN(FCEKAN)和多尺度高斯KAN(MSGKAN)。FCEKAN通过选择性频率分量交换策略增强了跨模态特征的互补性和一致性,而MSGKAN模块在空间中展现了出色的非线性特征建模能力。通过多尺度高斯基础函数,该方法能有效捕捉不同无人机飞行高度所引起的尺度变化特征,提高了模型对尺度变化的适应性和稳健性。实验验证显示,所提出的FCEKAN和MSGKAN模块可互补捕捉频率和空间语义特征,实现更好的特征融合,在无人机多光谱目标检测任务中表现卓越。
Key Takeaways
- 提出了一种新型的多光谱目标检测方法——空间与频率特征重建方法(SFFR)。
- 利用Kolmogorov-Arnold网络(KAN)进行空间频率特征表示。
- 引入频率分量交换策略以增强跨模态特征的互补性和一致性。
- 多尺度高斯基础函数用于捕捉不同飞行高度下的尺度变化特征。
- FCEKAN和MSGKAN模块可互补捕捉频率和空间语义特征。
- 方法在无人机多光谱目标检测任务中表现优越。
点此查看论文截图
TCSA-UDA: Text-Driven Cross-Semantic Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation
Authors:Lalit Maurya, Honghai Liu, Reyer Zwiggelaar
Unsupervised domain adaptation for medical image segmentation remains a significant challenge due to substantial domain shifts across imaging modalities, such as CT and MRI. While recent vision-language representation learning methods have shown promise, their potential in UDA segmentation tasks remains underexplored. To address this gap, we propose TCSA-UDA, a Text-driven Cross-Semantic Alignment framework that leverages domain-invariant textual class descriptions to guide visual representation learning. Our approach introduces a vision-language covariance cosine loss to directly align image encoder features with inter-class textual semantic relations, encouraging semantically meaningful and modality-invariant feature representations. Additionally, we incorporate a prototype alignment module that aligns class-wise pixel-level feature distributions across domains using high-level semantic prototypes. This mitigates residual category-level discrepancies and enhances cross-modal consistency. Extensive experiments on challenging cross-modality cardiac, abdominal, and brain tumor segmentation benchmarks demonstrate that our TCSA-UDA framework significantly reduces domain shift and consistently outperforms state-of-the-art UDA methods, establishing a new paradigm for integrating language-driven semantics into domain-adaptive medical image analysis.
针对医学图像分割的无监督域自适应(UDA)仍然是一个重大挑战,由于成像模式(如CT和MRI)之间存在大量的域偏移。虽然最近的视觉语言表示学习方法显示出了一定的潜力,但它们在UDA分割任务中的潜力仍然未被充分探索。为了弥补这一空白,我们提出了TCSA-UDA,一个文本驱动跨语义对齐框架,它利用域不变的文本类描述来指导视觉表示学习。我们的方法引入了一种视觉语言协方差余弦损失,直接对齐图像编码器特征与跨类文本语义关系,鼓励具有语义意义和模态不变的特征表示。此外,我们加入了一个原型对齐模块,该模块利用高级语义原型来对齐跨域的逐类像素级特征分布。这减轻了剩余的类别级差异并增强了跨模态一致性。在具有挑战性的跨模态心脏、腹部和脑肿瘤分割基准测试上的广泛实验表明,我们的TCSA-UDA框架显著减少了域偏移,并且一致地优于最先进的UDA方法,为将语言驱动的语义集成到域自适应医学图像分析中建立了新的范式。
论文及项目相关链接
Summary
本文提出TCSA-UDA框架,利用文本驱动的跨语义对齐技术来解决医学影像分割中的无监督域适应问题。该框架通过引入视觉语言协方差余弦损失和原型对齐模块,利用跨模态文本语义描述来指导视觉表示学习,并有效减少域偏移,提高跨模态一致性。实验证明,TCSA-UDA在心脏、腹部和脑部肿瘤分割的跨模态任务上显著优于现有无监督域适应方法。
Key Takeaways
- TCSA-UDA框架解决了医学影像分割中的无监督域适应问题。
- 利用文本驱动的跨语义对齐技术,通过引入视觉语言协方差余弦损失来指导视觉表示学习。
- 引入原型对齐模块,通过高层次的语义原型来对齐跨域的类级像素级特征分布。
- 有效减少域偏移并增强跨模态一致性。
- 在心脏、腹部和脑部肿瘤分割的跨模态任务上进行了广泛的实验验证。
- TCSA-UDA框架显著优于现有的无监督域适应方法。
点此查看论文截图
DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation
Authors:Boyi Li, Ce Zhang, Richard M. Timmerman, Wenxuan Bao
The emergence of vision language models (VLMs) bridges the gap between vision and language, enabling multimodal understanding beyond traditional visual-only deep learning models. However, transferring VLMs from the natural image domain to remote sensing (RS) segmentation remains challenging due to the large domain gap and the diversity of RS inputs across tasks, particularly in open-vocabulary semantic segmentation (OVSS) and referring expression segmentation (RES). Here, we propose a training-free unified framework, termed DGL-RSIS, which decouples visual and textual representations and performs visual-language alignment at both local semantic and global contextual levels. Specifically, a Global-Local Decoupling (GLD) module decomposes textual inputs into local semantic tokens and global contextual tokens, while image inputs are partitioned into class-agnostic mask proposals. Then, a Local Visual-Textual Alignment (LVTA) module adaptively extracts context-aware visual features from the mask proposals and enriches textual features through knowledge-guided prompt engineering, achieving OVSS from a local perspective. Furthermore, a Global Visual-Textual Alignment (GVTA) module employs a global-enhanced Grad-CAM mechanism to capture contextual cues for referring expressions, followed by a mask selection module that integrates pixel-level activations into mask-level segmentation outputs, thereby achieving RES from a global perspective. Experiments on the iSAID (OVSS) and RRSIS-D (RES) benchmarks demonstrate that DGL-RSIS outperforms existing training-free approaches. Ablation studies further validate the effectiveness of each module. To the best of our knowledge, this is the first unified training-free framework for RS image segmentation, which effectively transfers the semantic capability of VLMs trained on natural images to the RS domain without additional training.
视觉语言模型(VLMs)的出现,缩小了视觉与语言之间的差距,实现了超越传统仅视觉深度学习的多模态理解。然而,将VLMs从自然图像领域迁移到遥感(RS)分割仍然具有挑战性,这主要是由于领域差距大以及遥感输入任务的多样性,特别是在开放词汇语义分割(OVSS)和引用表达式分割(RES)中。在这里,我们提出了一种无需训练的统一框架,名为DGL-RSIS,该框架可以解耦视觉和文本表示,并在局部语义和全局上下文级别进行视觉语言对齐。具体而言,全局局部解耦(GLD)模块将文本输入分解为局部语义标记和全局上下文标记,而图像输入则被划分为类无关掩膜提案。然后,局部视觉文本对齐(LVTA)模块自适应地从掩膜提案中提取上下文感知的视觉特征,并通过知识引导提示工程丰富文本特征,从而实现从局部角度的OVSS。此外,全局视觉文本对齐(GVTA)模块采用全局增强Grad-CAM机制来捕获引用表达式的上下文线索,随后是掩膜选择模块,该模块将像素级激活整合到掩膜级分割输出中,从而实现从全局角度的RES。在iSAID(OVSS)和RRSIS-D(RES)基准测试上的实验表明,DGL-RSIS优于现有的无需训练的方法。消融研究进一步验证了每个模块的有效性。据我们所知,这是第一个无需训练的统一遥感图像分割框架,能够有效地将基于自然图像训练的VLMs的语义能力转移到遥感领域而无需额外的训练。
论文及项目相关链接
Summary
基于视觉语言模型(VLMs)的多模态理解在遥感(RS)图像分割领域展现出巨大潜力。然而,由于领域差距大及遥感输入任务的多样性,VLMs在自然图像领域的应用难以直接迁移到RS分割。本研究提出了一种无需训练的统一框架DGL-RSIS,该框架可在局部语义和全局语境层面进行视觉与语言对齐。实验证明,该框架在iSAID和RRSIS-D基准测试中表现优异。
Key Takeaways
- 视觉语言模型(VLMs)能够弥合视觉与语言之间的鸿沟,实现多模态理解。
- 将VLMs从自然图像领域迁移到遥感(RS)分割存在挑战,主要由于领域差距大和任务多样性。
- 提出的DGL-RSIS框架是一种无需训练的统一方法,能够在局部语义和全局语境层面进行视觉与语言对齐。
- DGL-RSIS框架包括全局-局部解耦(GLD)模块、局部视觉-文本对齐(LVTA)模块和全局视觉-文本对齐(GVTA)模块。
- 实验证明,DGL-RSIS框架在iSAID和RRSIS-D基准测试中表现优于现有无需训练的方法。
- DGL-RSIS框架是首个将VLMs在自然图像领域的语义能力有效迁移到遥感领域的无需训练的统一框架。
点此查看论文截图
An Instance-Aware Prompting Framework for Training-free Camouflaged Object Segmentation
Authors:Chao Yin, Jide Li, Hang Yao, Xiaoqiang Li
Training-free Camouflaged Object Segmentation (COS) seeks to segment camouflaged objects without task-specific training, by automatically generating visual prompts to guide the Segment Anything Model (SAM). However, existing pipelines mostly yield semantic-level prompts, which drive SAM to coarse semantic masks and struggle to handle multiple discrete camouflaged instances effectively. To address this critical limitation, we propose an \textbf{I}nstance-\textbf{A}ware \textbf{P}rompting \textbf{F}ramework (IAPF) tailored for the first training-free COS that upgrades prompt granularity from semantic to instance-level while keeping all components frozen. The centerpiece is an Instance Mask Generator that (i) leverages a detector-agnostic enumerator to produce precise instance-level box prompts for the foreground tag, and (ii) introduces the Single-Foreground Multi-Background Prompting (SFMBP) strategy to sample region-constrained point prompts within each box prompt, enabling SAM to output instance masks. The pipeline is supported by a simple text prompt generator that produces image-specific tags and a self-consistency vote across synonymous task-generic prompts to stabilize inference. Extensive evaluations on three COS benchmarks, two CIS benchmarks, and two downstream datasets demonstrate state-of-the-art performance among training-free methods. Code will be released upon acceptance.
无训练伪装目标分割(COS)旨在无需特定任务训练,通过自动生成视觉提示来引导任何目标分割模型(SAM)。然而,现有流程大多生成语义级别的提示,这导致SAM只能获得粗略的语义掩膜,并且难以有效处理多个离散伪装实例。为了解决这一关键限制,我们提出了针对首次无训练COS的实例感知提示框架(IAPF),该框架将提示粒度从语义升级到实例级别,同时保持所有组件冻结。其核心是一个实例掩膜生成器,它(i)利用与检测器无关的枚举器产生精确到实例级别的框提示作为前景标签,(ii)引入单前景多背景提示(SFMBP)策略,在每个框提示内采样受区域约束的点提示,使SAM能够输出实例掩膜。该流程由简单的文本提示生成器支持,生成特定图像的标签,并通过同义词通用任务提示的自我一致性投票来稳定推断。在三个COS基准测试、两个CIS基准测试和两个下游数据集上的广泛评估表明,其在无训练方法中处于领先水平。代码将在接受后发布。
论文及项目相关链接
PDF under review
Summary
无训练COS(Camouflaged Object Segmentation)对象分割方法通过自动生成视觉提示来引导分割任何模型(SAM),无需特定任务训练。但现有流程主要产生语义级别的提示,导致SAM产生粗糙的语义掩膜,难以有效处理多个离散隐蔽实例。为解决此限制,我们提出面向首个无训练COS的实例感知提示框架(IAPF),将提示粒度从语义升级到实例级别,同时保持所有组件冻结。通过实例掩膜生成器(Instance Mask Generator)产生精确实例级框提示和单前景多背景提示策略(SFMBP),使SAM输出实例掩膜。该流程由文本提示生成器支持,产生图像特定标签,并通过同义词通用任务提示的自我一致性投票来稳定推断。在三个COS基准测试、两个CIS基准测试和两个下游数据集上的广泛评估显示,该方法在无训练方法中表现最佳。
Key Takeaways
- 训练-free Camouflaged Object Segmentation (COS)方法通过自动生成视觉提示引导Segment Anything Model (SAM),无需特定任务训练。
- 现有流程主要产生语义级别的提示,难以处理多个离散隐蔽实例。
- 提出面向无训练COS的实例感知提示框架(IAPF),将提示粒度从语义升级到实例级别。
- IAPF包括一个实例掩膜生成器,产生精确实例级框提示和单前景多背景提示策略。
- 文本提示生成器支持该流程,产生图像特定标签,通过自我一致性投票稳定推断。
- 在多个基准测试和下游数据集上的评估显示,该方法在无训练方法中表现最佳。
点此查看论文截图
RaGS: Unleashing 3D Gaussian Splatting from 4D Radar and Monocular Cues for 3D Object Detection
Authors:Xiaokai Bai, Chenxu Zhou, Lianqing Zheng, Si-Yuan Cao, Jianan Liu, Xiaohan Zhang, Yiming Li, Zhengzhuang Zhang, Hui-liang Shen
4D millimeter-wave radar is a promising sensing modality for autonomous driving, yet effective 3D object detection from 4D radar and monocular images remains challenging. Existing fusion approaches either rely on instance proposals lacking global context or dense BEV grids constrained by rigid structures, lacking a flexible and adaptive representation for diverse scenes. To address this, we propose RaGS, the first framework that leverages 3D Gaussian Splatting (GS) to fuse 4D radar and monocular cues for 3D object detection. 3D GS models the scene as a continuous field of Gaussians, enabling dynamic resource allocation to foreground objects while maintaining flexibility and efficiency. Moreover, the velocity dimension of 4D radar provides motion cues that help anchor and refine the spatial distribution of Gaussians. Specifically, RaGS adopts a cascaded pipeline to construct and progressively refine the Gaussian field. It begins with Frustum-based Localization Initiation (FLI), which unprojects foreground pixels to initialize coarse Gaussian centers. Then, Iterative Multimodal Aggregation (IMA) explicitly exploits image semantics and implicitly integrates 4D radar velocity geometry to refine the Gaussians within regions of interest. Finally, Multi-level Gaussian Fusion (MGF) renders the Gaussian field into hierarchical BEV features for 3D object detection. By dynamically focusing on sparse and informative regions, RaGS achieves object-centric precision and comprehensive scene perception. Extensive experiments on View-of-Delft, TJ4DRadSet, and OmniHD-Scenes demonstrate its robustness and SOTA performance. Code will be released.
4D毫米波雷达对于自动驾驶来说是一种有前景的感知方式,但从4D雷达和单目图像进行有效的3D对象检测仍然具有挑战性。现有的融合方法要么依赖于缺乏全局上下文的实例提案,要么受限于刚性的密集BEV网格,缺乏灵活和自适应的表示来处理各种场景。为了解决这一问题,我们提出了RaGS,这是一个首次利用3D高斯涂斑(GS)融合4D雷达和单目图像提示进行3D对象检测的框架。3D GS将场景建模为高斯连续场,实现动态资源分配给前景对象,同时保持灵活性和效率。此外,4D雷达的速度维度提供了运动提示,有助于锚定并优化高斯的空间分布。具体来说,RaGS采用级联管道构建并逐步完善高斯场。它始于基于Frustum的定位初始化(FLI),将前景像素反投影以初始化粗略的高斯中心。然后,迭代多模式聚合(IMA)显式利用图像语义并隐式集成4D雷达速度几何来完善感兴趣区域内的高斯。最后,多级高斯融合(MGF)将高斯场呈现为分层BEV特征,用于3D对象检测。RaGS通过动态关注稀疏且信息丰富的区域,实现了以对象为中心的精度和全面的场景感知。在View-of-Delft、TJ4DRadSet和OmniHD-Scenes上的广泛实验证明了其稳健性和SOTA性能。代码将被发布。
论文及项目相关链接
Summary
本文介绍了4D毫米波雷达在自动驾驶中的应用前景及其面临的挑战,为解决现有的三维目标检测在融合过程中的局限性问题,提出了全新的RaGS框架,结合高斯模板法和单目图像技术实现更精确的目标检测。该框架可动态配置资源于前景目标,同时通过利用雷达的四个维度数据提供动态信息来锚定并优化高斯分布。通过一系列的步骤构建和优化高斯场并实现层次化检测结果,使雷达能够高效准确感知车辆周围的真实世界。研究表明RaGS在多场景实验中均表现优越。即将公开源代码。
Key Takeaways
一、现有自动驾驶传感器面临挑战
点此查看论文截图
Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control
Authors:Danfeng Li, Hui Zhang, Sheng Wang, Jiacheng Li, Zuxuan Wu
Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity’s image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.
尽管扩散模型最近有进展,但顶尖的文生图(T2I)模型仍然难以实现对精确空间布局的控制,即准确生成具有指定属性和位置的实体。掩膜到图像(S2I)的生成方法通过融入像素级空间指导和区域文本提示,已成为一种有前途的解决方案。然而,现有的S2I方法无法同时确保语义一致性和形状一致性。为了应对这些挑战,我们提出了Seg2Any,这是一种新型S2I框架,建立在先进的多模态扩散变压器(如FLUX)之上。首先,为了实现语义和形状的一致性,我们将分割掩膜条件分解为区域语义和高频形状组件。通过语义对齐注意力掩膜引入区域语义条件,确保生成的实体符合其分配到的文本提示。代表实体边界的高频形状条件被编码为实体轮廓图,然后作为多模态注意力的一种额外模态引入,以指导图像的空间结构。其次,为了防止多实体场景中属性在实体之间的泄露,我们引入了属性隔离注意力掩膜机制,该机制约束每个实体的图像令牌在图像自注意力期间只专注于自身。为了支持开放集S2I生成,我们构建了SACap-1M,这是一个包含100万张图像、590万个分割实体和详细区域描述的大规模数据集,以及用于全面S2I评价的SACap-Eval基准测试。大量实验表明,Seg2Any在开放集和封闭集的S2I基准测试中均达到了最先进的性能,尤其是在实体的精细粒度空间和属性控制方面。
论文及项目相关链接
Summary
本文提出一种新兴的S2I框架Seg2Any,旨在解决顶级文本到图像(T2I)模型在空间布局控制上的不足。它通过引入多重机制如语义对齐注意力掩码、实体轮廓地图和多模态注意力,实现了语义和形状的一致性。同时,为防范多实体场景中属性泄露,引入属性隔离注意力掩码机制。Seg2Any在开放和封闭集合的S2I基准测试中均达到最佳性能,尤其在实体的精细空间控制方面表现突出。
Key Takeaways
- Seg2Any框架旨在解决顶级文本到图像(T2I)模型在空间布局控制上的挑战。
- 通过引入多重机制如语义对齐注意力掩码和实体轮廓地图,实现语义和形状的一致性。
- 引入属性隔离注意力掩码机制,防止多实体场景中属性泄露。
- 构建大型数据集SACap-1M和SACap-Eval基准测试,以支持开放集S2I生成和全面S2I评估。
点此查看论文截图
Test-Time Adaptation of Vision-Language Models for Open-Vocabulary Semantic Segmentation
Authors:Mehrdad Noori, David Osowiechi, Gustavo Adolfo Vargas Hakim, Ali Bahri, Moslem Yazdanpanah, Sahar Dastani, Farzad Beizaee, Ismail Ben Ayed, Christian Desrosiers
Recently, test-time adaptation has attracted wide interest in the context of vision-language models for image classification. However, to the best of our knowledge, the problem is completely overlooked in dense prediction tasks such as Open-Vocabulary Semantic Segmentation (OVSS). In response, we propose a novel TTA method tailored to adapting VLMs for segmentation during test time. Unlike TTA methods for image classification, our Multi-Level and Multi-Prompt (MLMP) entropy minimization integrates features from intermediate vision-encoder layers and is performed with different text-prompt templates at both the global CLS token and local pixel-wise levels. Our approach could be used as plug-and-play for any segmentation network, does not require additional training data or labels, and remains effective even with a single test sample. Furthermore, we introduce a comprehensive OVSS TTA benchmark suite, which integrates a rigorous evaluation protocol, nine segmentation datasets, 15 common synthetic corruptions, and additional real and rendered domain shifts, \textbf{with a total of 87 distinct test scenarios}, establishing a standardized and comprehensive testbed for future TTA research in open-vocabulary segmentation. Our experiments on this suite demonstrate that our segmentation-tailored method consistently delivers significant gains over direct adoption of TTA classification baselines. Code and data are available at https://github.com/dosowiechi/MLMP.
近期,测试时适应(Test-Time Adaptation,TTA)在图像分类的视觉语言模型背景下引起了广泛关注。然而,据我们所知,该问题在密集预测任务(如开放词汇语义分割(OVSS))中完全被忽略。为了解决这个问题,我们提出了一种新型的TTA方法,专门用于在测试时适应用于分割的VLMs。与用于图像分类的TTA方法不同,我们的多层次多提示(Multi-Level and Multi-Prompt,MLMP)熵最小化结合了中间视觉编码器层的特征,并在全局CLS标记和局部像素级使用不同的文本提示模板进行操作。我们的方法可以作为任何分割网络的即插即用工具,无需额外的训练数据或标签,即使在单个测试样本上也能保持有效。此外,我们引入了一个全面的OVSS TTA基准套件,其中包括严格的评估协议、九个分割数据集、15种常见的合成腐蚀,以及额外的真实和渲染域偏移,\textbf{总共有87种不同的测试场景},为未来开放词汇分割的TTA研究建立了标准化和全面的测试平台。在此套件上的实验表明,我们的针对分割的方法在直接采用TTA分类基线时始终实现了显著的收益。代码和数据可在https://github.com/dosowiechi/MLMP中找到。
论文及项目相关链接
Summary
针对开放词汇语义分割(OVSS)任务中的测试时间自适应问题,提出了一个新型的多层次多提示(MLMP)熵最小化方法。该方法结合了中间视觉编码器层的特征,并在全局CLS标记和局部像素级使用不同的文本提示模板进行。此方法可作为任何分割网络的即插即用模块,无需额外的训练数据或标签,甚至在单个测试样本上也能保持有效。此外,还引入了一个全面的OVSS TTA基准套件,为未来TTA研究提供了标准化和全面的测试平台。实验表明,该方法在分割任务上相对于直接采用TTA分类基准线具有显著优势。
Key Takeaways
- 测试时间自适应(TTA)在视觉语言模型中的图像分类中已受到广泛关注,但在密集预测任务如开放词汇语义分割(OVSS)中却被忽视。
- 提出了一种新型的TTA方法,名为多层次多提示(MLMP)熵最小化,专门用于测试时在语义分割任务中适应视觉语言模型。
- MLMP方法结合了中间视觉编码器层的特征,并在全局和局部级别使用不同的文本提示模板进行。
- 该方法可作为任何分割网络的即插即用模块,无需额外的训练数据或标签。
- 引入了一个全面的OVSS TTA基准套件,包含严格的评估协议、九个分割数据集、15种常见的合成腐蚀以及额外的真实和渲染域偏移,总共有87种不同的测试场景。
- 实验表明,对于所提出的分割任务,该方法相对于直接采用TTA分类基准线具有显著的优势。
点此查看论文截图
RAFT – A Domain Adaptation Framework for RGB & LiDAR Semantic Segmentation
Authors:Edward Humes, Xiaomin Lin, Boxun Hu, Rithvik Jonna, Tinoosh Mohsenin
Image segmentation is a powerful computer vision technique for scene understanding. However, real-world deployment is stymied by the need for high-quality, meticulously labeled datasets. Synthetic data provides high-quality labels while reducing the need for manual data collection and annotation. However, deep neural networks trained on synthetic data often face the Syn2Real problem, leading to poor performance in real-world deployments. To mitigate the aforementioned gap in image segmentation, we propose RAFT, a novel framework for adapting image segmentation models using minimal labeled real-world data through data and feature augmentations, as well as active learning. To validate RAFT, we perform experiments on the synthetic-to-real “SYNTHIA->Cityscapes” and “GTAV->Cityscapes” benchmarks. We managed to surpass the previous state of the art, HALO. SYNTHIA->Cityscapes experiences an improvement in mIoU* upon domain adaptation of 2.1%/79.9%, and GTAV->Cityscapes experiences a 0.4%/78.2% improvement in mIoU. Furthermore, we test our approach on the real-to-real benchmark of “Cityscapes->ACDC”, and again surpass HALO, with a gain in mIoU upon adaptation of 1.3%/73.2%. Finally, we examine the effect of the allocated annotation budget and various components of RAFT upon the final transfer mIoU.
图像分割是一种用于场景理解的强大计算机视觉技术。然而,实际部署受到需要高质量、精心标注的数据集的阻碍。合成数据提供了高质量标签,同时减少了手动数据收集和标注的需求。然而,在合成数据上训练的深度神经网络常常面临Syn2Real问题,导致在实际部署中的性能不佳。
论文及项目相关链接
PDF Submitted to RA-L
Summary
图像分割技术是一种强大的计算机视觉场景理解技术,但其在现实世界的部署受限于高质量、精细标注的数据集的需求。合成数据虽然可以提供高质量标签,减少手动数据收集和标注的需要,但深度神经网络在合成数据上常常面临合成到现实(Syn2Real)问题,导致在现实世界部署时表现不佳。为缓解图像分割中的上述问题,我们提出了RAFT框架,通过数据和特征增强以及主动学习,使用最少的有标签现实世界数据来适应图像分割模型。实验表明,RAFT在合成到现实的“SYNTHIA→Cityscapes”和“GTAV→Cityscapes”基准测试上超越了之前的最佳水平,并进行了详尽的性能分析。此外,在实际部署中也表现出了优越性。该框架为图像分割技术的实际应用提供了新的方向。
Key Takeaways
- 图像分割是计算机视觉中的关键技术,但在现实世界的部署受限于数据集的标注质量。
- 合成数据能够提供高质量标签,减少手动数据收集和标注的需求。
- 深度神经网络在合成数据上常面临合成到现实问题(Syn2Real),导致在现实世界表现不佳。
- 为解决上述问题,提出了RAFT框架,通过数据和特征增强以及主动学习适应图像分割模型。
- RAFT在合成到现实的基准测试上超越了之前的最佳水平,如“SYNTHIA→Cityscapes”和“GTAV→Cityscapes”。
- RAFT在实际部署中也表现出优越性,例如在“Cityscapes→ACDC”基准测试上超越了HALO。
点此查看论文截图
SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection
Authors:Yuxuan Li, Xiang Li, Yunheng Li, Yicheng Zhang, Yimian Dai, Qibin Hou, Ming-Ming Cheng, Jian Yang
With the rapid advancement of remote sensing technology, high-resolution multi-modal imagery is now more widely accessible. Conventional Object detection models are trained on a single dataset, often restricted to a specific imaging modality and annotation format. However, such an approach overlooks the valuable shared knowledge across multi-modalities and limits the model’s applicability in more versatile scenarios. This paper introduces a new task called Multi-Modal Datasets and Multi-Task Object Detection (M2Det) for remote sensing, designed to accurately detect horizontal or oriented objects from any sensor modality. This task poses challenges due to 1) the trade-offs involved in managing multi-modal modelling and 2) the complexities of multi-task optimization. To address these, we establish a benchmark dataset and propose a unified model, SM3Det (Single Model for Multi-Modal datasets and Multi-Task object Detection). SM3Det leverages a grid-level sparse MoE backbone to enable joint knowledge learning while preserving distinct feature representations for different modalities. Furthermore, it integrates a consistency and synchronization optimization strategy using dynamic learning rate adjustment, allowing it to effectively handle varying levels of learning difficulty across modalities and tasks. Extensive experiments demonstrate SM3Det’s effectiveness and generalizability, consistently outperforming specialized models on individual datasets. The code is available at https://github.com/zcablii/SM3Det.
随着遥感技术的快速发展,高分辨率多模态图像现在更加易于获取。传统的目标检测模型是在单一数据集上进行训练的,通常局限于特定的成像模式和注释格式。然而,这种方法忽略了多模态之间宝贵的知识共享,并限制了模型在更通用场景中的应用。本文引入了一个名为遥感多模态数据集和多任务目标检测(M2Det)的新任务,旨在从任何传感器模态准确检测水平或定向目标。这项任务由于1)涉及多模态建模的权衡和2)多任务优化的复杂性而具有挑战性。为了解决这些问题,我们建立了一个基准数据集,并提出了一种统一模型SM3Det(用于多模态数据集和多任务目标检测的单模型)。SM3Det利用网格级稀疏MoE骨干网,实现联合知识学习,同时保留不同模态的独特特征表示。此外,它采用一致性同步优化策略,通过动态调整学习率,能够有效处理不同模态和任务之间不同程度的学习难度。大量实验证明了SM3Det的有效性和通用性,在单个数据集上始终优于专业模型。相关代码可通过https://github.com/zcablii/SM3Det获取。
论文及项目相关链接
PDF Accepted as Oral in AAAI 2026
Summary
随着遥感技术的快速发展,高分辨率多模态影像现在更加普及。本文提出了遥感中的多模态数据集多任务目标检测(M2Det)新任务,旨在准确检测任何传感器模态的水平或定向目标。为应对多模态建模和多任务优化的挑战,建立了基准数据集,并提出了统一模型SM3Det。SM3Det利用网格级稀疏MoE骨干网实现联合知识学习,同时保留不同模态的特定特征表示。通过动态调整学习速率,采用一致性同步优化策略,有效处理不同模态和任务的学习难度差异。实验证明SM3Det的有效性和泛化能力,在多个数据集上均优于专项模型。
Key Takeaways
- 高分辨率多模态影像的普及推动了遥感领域的新任务——多模态数据集多任务目标检测(M2Det)。
- M2Det旨在准确检测任何传感器模态的水平或定向目标。
- 在处理多模态建模和多任务优化时面临的挑战通过提出SM3Det模型得到解决。
- SM3Det利用网格级稀疏MoE骨干网实现联合知识学习,同时考虑不同模态的特定特征。
- SM3Det通过动态调整学习速率,采用一致性同步优化策略,以处理不同模态和任务的学习难度差异。
- 实验证明SM3Det在多个数据集上的表现优于专项模型。
点此查看论文截图
MMCL: Correcting Content Query Distributions for Improved Anti-Overlapping X-Ray Object Detection
Authors:Mingyuan Li, Tong Jia, Hui Lu, Hao Wang, Bowen Ma, Shiyi Guo, Shuyang Lin, Dongyue Chen, Haoran Wang, Baosheng Yu
Unlike natural images with occlusion-based overlap, X-ray images exhibit depth-induced superimposition and semi-transparent appearances, where objects at different depths overlap and their features blend together. These characteristics demand specialized mechanisms to disentangle mixed representations between target objects (e.g., prohibited items) and irrelevant backgrounds. While recent studies have explored adapting detection transformers (DETR) for anti-overlapping object detection, the importance of well-distributed content queries that represent object hypotheses remains underexplored. In this paper, we introduce a multi-class min-margin contrastive learning (MMCL) framework to correct the distribution of content queries, achieving balanced intra-class diversity and inter-class separability. The framework first groups content queries by object category and then applies two proposed complementary loss components: a multi-class exclusion loss to enhance inter-class separability, and a min-margin clustering loss to encourage intra-class diversity. We evaluate the proposed method on three widely used X-ray prohibited-item detection datasets, PIXray, OPIXray, and PIDray, using two backbone networks and four DETR variants. Experimental results demonstrate that MMCL effectively enhances anti-overlapping object detection and achieves state-of-the-art performance on both datasets. Code will be made publicly available on GitHub.
不同于基于遮挡重叠的自然图像,X射线图像表现出深度引起的叠加和半透明外观,不同深度的物体彼此重叠,其特征融合在一起。这些特点需要专门的机制来解开目标对象(如违禁物品)和无关背景之间的混合表示。虽然近期研究已经探索了检测变压器(DETR)在防重叠对象检测中的应用,但代表对象假设的分布式内容查询的重要性仍被忽视。在本文中,我们引入了一种多类最小间隔对比学习(MMCL)框架,以校正内容查询的分布,实现类内多样性平衡和类间可分性。该框架首先按对象类别对内容查询进行分组,然后应用两个提出的互补损失组件:多类排除损失,以增强类间可分性,和最小间隔聚类损失,以促进类内多样性。我们在三个广泛使用的X射线违禁物品检测数据集PIXray、OPIXray和PIDray上评估了所提出的方法,使用了两个主干网络和四个DETR变体。实验结果表明,MMCL有效地提高了防重叠对象检测性能,并在两个数据集上均达到了最先进的性能。代码将在GitHub上公开提供。
论文及项目相关链接
PDF 16 pages,8 figures
Summary
本文提出了一种基于多类最小边距对比学习(MMCL)的框架,用于改善内容查询的分布,实现类内多样性和类间可分性的平衡。该框架通过分组内容查询和采用两种互补的损失组件——多类排除损失和最小边距聚类损失,来提升反重叠物体的检测性能。在三个广泛使用的X光违禁物品检测数据集上的实验结果表明,MMCL有效地增强了反重叠物体检测性能,达到了先进的表现水平。
Key Takeaways
- X光图像具有深度引起的叠加和半透明外观,需要特殊机制来解开目标物体与无关背景之间的混合表示。
- 近期研究已探索了使用检测变压器(DETR)进行反重叠物体检测,但内容查询的分布重要性仍未得到充分探索。
- 本文引入了多类最小边距对比学习(MMCL)框架,以纠正内容查询的分布,实现类内多样性和类间可分性的平衡。
- MMCL框架通过分组内容查询并采用多类排除损失和最小边距聚类损失来提高反重叠物体检测性能。
- 实验结果表明,MMCL在X光违禁物品检测方面达到了先进的表现水平,适用于多种数据集和backbone网络以及DETR变体。
- MMCL框架将在GitHub上公开可用。