发布日期: 2025-10-18

更新日期: 2025-11-27

文章字数: 17.1k

阅读时长: 70 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-18 更新

CoT-PL: Visual Chain-of-Thought Reasoning Meets Pseudo-Labeling for Open-Vocabulary Object Detection

Authors:Hojun Choi, Youngsun Lim, Jaeyo Shin, Hyunjung Shim

Open-vocabulary object detection (OVD) seeks to recognize and localize object categories beyond those seen during training. Recent approaches typically leverage vision-language models (VLMs) to generate pseudo-labels using image-text alignment, allowing detectors to generalize to unseen classes without explicit supervision. However, these methods depend heavily on direct image-text matching, neglecting the intermediate reasoning steps essential for interpreting semantically complex scenes. This results in limited robustness when confronted with crowded or occluded visual contexts. In this paper, we introduce CoT-PL, a new framework that employs structured visual chain-of-thought (CoT) reasoning into the pseudo-labeling process. CoT-PL decomposes object understanding into three interpretable steps: (1) region perception even for unseen objects, (2) category recognition via zero-shot reasoning, and (3) background grounding to separate semantically complex objects. Crucially, the third step naturally motivates our contrastive background learning (CBL) that uses the pre-computed background cues as negatives to promote feature disentanglement between objects and background. In this way, CoT reasoning and CBL form an integrated pipeline tailored to robust pseudo-labeling in crowded or occluded scenes. Notably, in these two settings, our novel-class pseudo-label quality achieves relative improvements of 103.4% and 168.4% over the best prior, respectively. Our extensive experiments demonstrate that CoT-PL achieves +7.7 AP50 on open-vocabulary COCO and +2.9 mask AP on LVIS for novel classes, setting a new state of the art.

开放词汇对象检测（OVD）旨在识别和定位超出训练期间所见的对象类别。最近的方法通常利用视觉语言模型（VLM）通过图像文本对齐生成伪标签，使得检测器能够推广到未见类别而无需明确监督。然而，这些方法严重依赖于直接的图像文本匹配，忽略了解释语义复杂场景所必需的中介推理步骤。这导致在面对拥挤或遮挡的视觉环境时稳健性有限。在本文中，我们介绍了CoT-PL，这是一个新的框架，它将结构化的视觉思维链（CoT）推理融入伪标签生成过程。CoT-PL将对象理解分解为三个可解释的步骤：（1）对未知对象的区域感知，（2）通过零样本推理进行类别识别，以及（3）背景定位以区分语义复杂的对象。关键的是，第三步自然地激励了我们的对比背景学习（CBL），它使用预先计算的背景线索作为负样本，以促进对象和背景特征之间的特征解耦。通过这种方式，CoT推理和CBL形成了一条针对拥挤或遮挡场景中稳健伪标签生成的集成管道。值得注意的是，在这两种场景中，我们的新型伪标签质量分别实现了相对于最佳先前技术的103.4%和168.4%的相对改进。我们的广泛实验表明，CoT-PL在开放词汇COCO上实现了+7.7的AP50，在LVIS上实现了+2.9的掩膜AP（针对新型类别），创造了新的技术水准。

论文及项目相关链接

PDF 28 pages, 13 Figures, 12 Tables

Summary：

该文提出了一种新的开放词汇对象检测框架CoT-PL，它结合了结构化视觉思维链（CoT）推理和对比背景学习（CBL）。该框架解决了现有方法在面对复杂场景时遇到的局限性，特别是在拥挤或遮挡的视觉环境中。通过引入CoT推理和CBL，CoT-PL能够分解为三个可解释的步骤，包括区域感知、类别识别和背景定位，从而提高未见过类别对象的检测性能。

Key Takeaways：

开放词汇对象检测（OVD）能识别训练期间未见过的对象类别。
现有方法主要依赖图像文本匹配生成伪标签，忽视了复杂的场景理解步骤。
CoT-PL框架引入结构化视觉思维链（CoT）推理，分解对象理解为三个步骤：区域感知、类别识别和背景定位。
对比背景学习（CBL）方法被提出，利用预计算的背景线索作为负样本，促进对象和背景特征的分离。
CoT-PL框架提高了在拥挤或遮挡场景中的伪标签质量，相对现有最佳方法分别提高了103.4%和168.4%。
在开放词汇COCO和LVIS数据集上，CoT-PL实现了先进性能，分别提高了7.7 AP50和2.9 mask AP。

Cool Papers

点此查看论文截图

Multiplicative Loss for Enhancing Semantic Segmentation in Medical and Cellular Images

Authors:Yuto Yokoi, Kazuhiro Hotta

We propose two novel loss functions, Multiplicative Loss and Confidence-Adaptive Multiplicative Loss, for semantic segmentation in medical and cellular images. Although Cross Entropy and Dice Loss are widely used, their additive combination is sensitive to hyperparameters and often performs suboptimally, especially with limited data. Medical images suffer from data scarcity due to privacy, ethics, and costly annotations, requiring robust and efficient training objectives. Our Multiplicative Loss combines Cross Entropy and Dice losses multiplicatively, dynamically modulating gradients based on prediction confidence. This reduces penalties for confident correct predictions and amplifies gradients for incorrect overconfident ones, stabilizing optimization. Building on this, Confidence-Adaptive Multiplicative Loss applies a confidence-driven exponential scaling inspired by Focal Loss, integrating predicted probabilities and Dice coefficients to emphasize difficult samples. This enhances learning under extreme data scarcity by strengthening gradients when confidence is low. Experiments on cellular and medical segmentation benchmarks show our framework consistently outperforms tuned additive and existing loss functions, offering a simple, effective, and hyperparameter-free mechanism for robust segmentation under challenging data limitations.

我们针对医学和细胞图像语义分割提出了两种新型损失函数，即乘法损失和置信自适应乘法损失。尽管交叉熵和Dice Loss已得到广泛应用，但其加法组合对超参数敏感，在数据有限的情况下往往表现不佳。由于隐私、伦理和昂贵的标注问题，医疗图像面临数据稀缺的问题，因此需要鲁棒且高效的训练目标。我们的乘法损失将交叉熵和Dice损失相乘，基于预测置信度动态调节梯度。这减少了自信正确预测的惩罚，并放大了错误自信预测的梯度，从而稳定了优化。在此基础上，置信自适应乘法损失应用了一种受Focal Loss启发的置信驱动指数缩放方法，将预测概率和Dice系数相结合，以强调困难样本。在细胞医学分割基准测试上的实验表明，我们的框架在具有挑战性的数据限制条件下始终优于经过调整的加性损失函数和现有损失函数，提供了一种简单、有效且无需超参数机制的稳健分割方法。

论文及项目相关链接

PDF Accepted by ICCV2025 Workshop “Third Workshop on Computer Vision for Automated Medical Diagnosis”

Summary

本文提出两种新颖的适用于医学和细胞图像语义分割的损失函数：乘法损失和置信自适应乘法损失。虽然交叉熵和Dice损失广泛应用于图像分割任务，但在数据有限的情况下，它们的组合对超参数敏感且性能不佳。本文提出的乘法损失通过乘法方式结合交叉熵和Dice损失，根据预测置信度动态调节梯度，减少对自信正确预测的惩罚，并放大错误过度自信预测的梯度，从而稳定优化过程。在此基础上，置信自适应乘法损失应用了一种基于信心的指数缩放策略，集成预测概率和Dice系数以突出困难样本。在细胞和医学图像分割基准测试上的实验表明，该框架在具有挑战性的数据限制条件下，始终优于调优的添加损失和现有损失函数，为稳健分割提供了一个简单、有效且无需超参数调整的新机制。

Key Takeaways

提出两种新的损失函数：乘法损失和置信自适应乘法损失，用于医学和细胞图像的语义分割。
针对数据稀缺的医学图像问题，需要稳健和高效的训练目标。
乘法损失结合交叉熵和Dice损失，通过乘法方式组合，根据预测置信度动态调节梯度。
置信自适应乘法损失应用了一种基于信心的指数缩放策略，集成预测概率和Dice系数，强调困难样本的学习。
提出的框架在细胞和医学图像分割基准测试上表现优越，相较于其他损失函数具有优势。
新框架适用于数据有限的情况，对于挑战性的数据限制条件具有稳健性。

Cool Papers

点此查看论文截图

APGNet: Adaptive Prior-Guided for Underwater Camouflaged Object Detection

Authors:Xinxin Huang, Han Sun, Junmin Cai, Ningzhong Liu, Huiyu Zhou

Detecting camouflaged objects in underwater environments is crucial for marine ecological research and resource exploration. However, existing methods face two key challenges: underwater image degradation, including low contrast and color distortion, and the natural camouflage of marine organisms. Traditional image enhancement techniques struggle to restore critical features in degraded images, while camouflaged object detection (COD) methods developed for terrestrial scenes often fail to adapt to underwater environments due to the lack of consideration for underwater optical characteristics. To address these issues, we propose APGNet, an Adaptive Prior-Guided Network, which integrates a Siamese architecture with a novel prior-guided mechanism to enhance robustness and detection accuracy. First, we employ the Multi-Scale Retinex with Color Restoration (MSRCR) algorithm for data augmentation, generating illumination-invariant images to mitigate degradation effects. Second, we design an Extended Receptive Field (ERF) module combined with a Multi-Scale Progressive Decoder (MPD) to capture multi-scale contextual information and refine feature representations. Furthermore, we propose an adaptive prior-guided mechanism that hierarchically fuses position and boundary priors by embedding spatial attention in high-level features for coarse localization and using deformable convolution to refine contours in low-level features. Extensive experimental results on two public MAS datasets demonstrate that our proposed method APGNet outperforms 15 state-of-art methods under widely used evaluation metrics.

在水下环境中检测隐蔽物体对海洋生态研究和资源勘探至关重要。然而，现有方法面临两大挑战：水下图像退化，包括对比度低和颜色失真，以及海洋生物的自然伪装。传统图像增强技术在恢复退化图像的关键特征时表现挣扎，而针对陆地场景开发的隐蔽物体检测（COD）方法往往无法适应水下环境，因为它们没有考虑到水下光学特性。

论文及项目相关链接

PDF 6 pages. accepted by ACM MM Asia 2025

Summary
水下伪装物体检测对于海洋生态研究和资源探索至关重要。当前方法面临两大挑战：水下图像退化和海洋生物的天然伪装。提出的APGNet网络结合Siamese架构和新颖先验引导机制，提高稳健性和检测精度。采用MSRCR算法进行数据增强，生成光照不变图像以减轻退化影响。设计ERF模块和MPD解码器，捕捉多尺度上下文信息并优化特征表示。提出自适应先验引导机制，分层融合位置和边界先验，嵌入高级特征中的空间注意力实现粗略定位，并使用可变形卷积优化低级特征中的轮廓。在公共MAS数据集上的实验表明，APGNet优于其他15种先进方法。

Key Takeaways

水下伪装物体检测在海洋生态研究和资源探索中具有重要作用。
当前方法面临水下图像退化和海洋生物自然伪装两大挑战。
APGNet网络通过结合Siamese架构和新颖先验引导机制来解决这些问题。
使用MSRCR算法进行数据增强，以减轻图像退化的影响。
ERF模块和MPD解码器的设计用于捕捉多尺度上下文信息并优化特征表示。
自适应先验引导机制结合了位置和边界先验，以实现粗略定位和轮廓优化。

Cool Papers

点此查看论文截图

A Framework for Low-Effort Training Data Generation for Urban Semantic Segmentation

Authors:Denis Zavadski, Damjan Kalšan, Tim Küchler, Haebom Lee, Stefan Roth, Carsten Rother

Synthetic datasets are widely used for training urban scene recognition models, but even highly realistic renderings show a noticeable gap to real imagery. This gap is particularly pronounced when adapting to a specific target domain, such as Cityscapes, where differences in architecture, vegetation, object appearance, and camera characteristics limit downstream performance. Closing this gap with more detailed 3D modelling would require expensive asset and scene design, defeating the purpose of low-cost labelled data. To address this, we present a new framework that adapts an off-the-shelf diffusion model to a target domain using only imperfect pseudo-labels. Once trained, it generates high-fidelity, target-aligned images from semantic maps of any synthetic dataset, including low-effort sources created in hours rather than months. The method filters suboptimal generations, rectifies image-label misalignments, and standardises semantics across datasets, transforming weak synthetic data into competitive real-domain training sets. Experiments on five synthetic datasets and two real target datasets show segmentation gains of up to +8.0%pt. mIoU over state-of-the-art translation methods, making rapidly constructed synthetic datasets as effective as high-effort, time-intensive synthetic datasets requiring extensive manual design. This work highlights a valuable collaborative paradigm where fast semantic prototyping, combined with generative models, enables scalable, high-quality training data creation for urban scene understanding.

合成数据集广泛用于训练城市场景识别模型，但即使是非常逼真的渲染也与真实图像之间存在明显的差距。在适应特定目标域（例如Cityscapes）时，这种差距尤为突出，目标域中的建筑、植被、物体外观和相机特性的差异限制了下游性能。使用更详细的3D建模来弥补这一差距将需要昂贵的资产和场景设计，这违背了低成本标注数据的初衷。针对这一问题，我们提出了一种新的框架，该框架使用仅不完美的伪标签来适应现成的扩散模型以匹配目标域。一旦训练完成，它就可以从任何合成数据集（包括几小时而非数月创建的低投入来源）的语义图中生成高保真、与目标相匹配的图片。该方法会过滤掉次优生成，纠正图像标签的错位，并标准化跨数据集语义，将薄弱的合成数据转化为具有竞争力的真实域训练集。在五个合成数据集和两个真实目标数据集上的实验表明，与最先进的方法相比，使用我们的方法在分割指标上平均提高了高达+8.0%的mIoU。这使得快速构建的合成数据集与需要大量手动设计的耗时、高投入的合成数据集一样有效。这项工作突显了一个有价值的协作模式，即快速语义原型设计与生成模型的结合，可为城市场景理解实现可扩展的高质量训练数据创建。

论文及项目相关链接

PDF

Summary：

本文提出了一种新的框架，该框架利用不完美的伪标签将一个现成的扩散模型适应到目标域。此框架通过语义地图从任何合成数据集中生成高保真、与目标对齐的图像，可以处理低成本的合成数据。经过训练后，该方法能过滤出不佳的生成图像，修正图像标签不对齐的问题，标准化跨数据集的语义，从而将弱合成数据转化为有竞争力的真实域训练集。实验表明，该方法在五个合成数据集和两个真实目标数据集上的分割增益达到最高8%，使快速构建的合成数据集与高成本耗时的大量手动设计的合成数据集具有同等效果。这项研究突显了一种合作范例，通过快速语义原型与生成模型的结合，实现了大规模高质量训练数据的快速创建，为城市场景理解提供有力支持。

Key Takeaways：

合成数据集广泛用于训练城市场景识别模型，但与现实图像存在差距。
现有方法通过更详细的3D建模来缩小差距会增加成本和耗时，不符合低成本标签数据的初衷。
新框架适应目标域仅使用不完美的伪标签，能快速从语义地图生成高保真、目标对齐的图像。
方法能够过滤不佳生成图像，修正图像标签不对齐问题，标准化跨数据集的语义。
与现有翻译方法相比，新框架在多个数据集上的分割增益显著。
快速语义原型与生成模型的结合提高了合成数据的质量，使其接近真实数据的效果。

Cool Papers

点此查看论文截图

Bridging Perspectives: Foundation Model Guided BEV Maps for 3D Object Detection and Tracking

Authors:Markus Käppeler, Özgün Çiçek, Daniele Cattaneo, Claudius Gläser, Yakov Miron, Abhinav Valada

Camera-based 3D object detection and tracking are essential for perception in autonomous driving. Current state-of-the-art approaches often rely exclusively on either perspective-view (PV) or bird’s-eye-view (BEV) features, limiting their ability to leverage both fine-grained object details and spatially structured scene representations. In this work, we propose DualViewDistill, a hybrid detection and tracking framework that incorporates both PV and BEV camera image features to leverage their complementary strengths. Our approach introduces BEV maps guided by foundation models, leveraging descriptive DINOv2 features that are distilled into BEV representations through a novel distillation process. By integrating PV features with BEV maps enriched with semantic and geometric features from DINOv2, our model leverages this hybrid representation via deformable aggregation to enhance 3D object detection and tracking. Extensive experiments on the nuScenes and Argoverse 2 benchmarks demonstrate that DualViewDistill achieves state-of-the-art performance. The results showcase the potential of foundation model BEV maps to enable more reliable perception for autonomous driving. We make the code and pre-trained models available at https://dualviewdistill.cs.uni-freiburg.de .

基于摄像头的3D目标检测和跟踪是自动驾驶感知中的关键技术。目前最先进的方案通常只依赖于透视视图（PV）或鸟瞰视图（BEV）特征，这限制了它们利用精细目标细节和空间结构化场景表示的能力。在这项工作中，我们提出了DualViewDistill，这是一种混合检测和跟踪框架，它结合了PV和BEV摄像头图像特征，以利用它们的互补优势。我们的方法引入了由基础模型引导的BEV地图，利用描述性的DINOv2特征，这些特征通过一种新型蒸馏过程蒸馏成BEV表示。通过将PV特征与通过基础模型DINOv2提供的丰富语义和几何特征的BEV地图相结合，我们的模型通过可变形聚合利用这种混合表示，以提高3D目标检测和跟踪能力。在nuScenes和Argoverse 2基准测试上的广泛实验表明，DualViewDistill达到了最先进的性能。结果展示了基础模型BEV地图在实现更可靠的自动驾驶感知方面的潜力。我们在https://dualviewdistill.cs.uni-freiburg.de上提供了代码和预训练模型。

论文及项目相关链接

PDF

Summary

本文提出了一种基于DualViewDistill的混合检测与跟踪框架，该框架结合了透视视图（PV）和鸟瞰视图（BEV）相机图像特征，以提高自主驾驶中的三维物体检测与跟踪性能。通过引入由基础模型引导的BEV地图，并结合透视视图特征，实现了对三维物体检测与跟踪的增强。在nuScenes和Argoverse 2基准测试中，该框架达到了先进性能。

Key Takeaways

自主驾驶中，三维物体检测与跟踪至关重要。
当前方法主要依赖透视视图（PV）或鸟瞰视图（BEV）特征，各有局限性。
DualViewDistill框架结合了PV和BEV特征，发挥两者优势。
引入基础模型引导的BEV地图，通过新颖蒸馏过程获取DINOv2特征。
结合PV特征与丰富的BEV地图，通过可变形聚合增强三维物体检测与跟踪。
在nuScenes和Argoverse 2基准测试中，DualViewDistill达到先进性能。

Cool Papers

点此查看论文截图

Exploring Single Domain Generalization of LiDAR-based Semantic Segmentation under Imperfect Labels

Authors:Weitong Kong, Zichao Zeng, Di Wen, Jiale Wei, Kunyu Peng, June Moh Goo, Jan Boehm, Rainer Stiefelhagen

Accurate perception is critical for vehicle safety, with LiDAR as a key enabler in autonomous driving. To ensure robust performance across environments, sensor types, and weather conditions without costly re-annotation, domain generalization in LiDAR-based 3D semantic segmentation is essential. However, LiDAR annotations are often noisy due to sensor imperfections, occlusions, and human errors. Such noise degrades segmentation accuracy and is further amplified under domain shifts, threatening system reliability. While noisy-label learning is well-studied in images, its extension to 3D LiDAR segmentation under domain generalization remains largely unexplored, as the sparse and irregular structure of point clouds limits direct use of 2D methods. To address this gap, we introduce the novel task Domain Generalization for LiDAR Semantic Segmentation under Noisy Labels (DGLSS-NL) and establish the first benchmark by adapting three representative noisy-label learning strategies from image classification to 3D segmentation. However, we find that existing noisy-label learning approaches adapt poorly to LiDAR data. We therefore propose DuNe, a dual-view framework with strong and weak branches that enforce feature-level consistency and apply cross-entropy loss based on confidence-aware filtering of predictions. Our approach shows state-of-the-art performance by achieving 56.86% mIoU on SemanticKITTI, 42.28% on nuScenes, and 52.58% on SemanticPOSS under 10% symmetric label noise, with an overall Arithmetic Mean (AM) of 49.57% and Harmonic Mean (HM) of 48.50%, thereby demonstrating robust domain generalization in DGLSS-NL tasks. The code is available on our project page.

精确感知对于车辆安全至关重要，激光雷达作为自动驾驶的关键技术起到了关键作用。为了确保在各种环境、传感器类型和天气条件下的稳健性能，且无需昂贵的重新标注，激光雷达基于域的3D语义分割泛化至关重要。然而，由于传感器缺陷、遮挡和人为错误，激光雷达标注往往存在噪声。这种噪声会降低分割精度，在域转移的情况下会进一步放大，威胁系统可靠性。虽然噪声标签学习在图像中得到了很好的研究，但其扩展到域泛化下的3D激光雷达分割仍然很大程度上未被探索，点云的稀疏和不规则结构限制了直接使用2D方法。为了填补这一空白，我们引入了噪声标签下的激光雷达语义分割域泛化（DGLSS-NL）的新任务，并通过将三个具有代表性的噪声标签学习策略从图像分类适应到3D分割，建立了第一个基准测试。然而，我们发现现有的噪声标签学习方法对激光雷达数据适应性较差。因此，我们提出了DuNe，这是一个具有强分支和弱分支的双视图框架，它强制实施特征级别的一致性，并基于预测的信心感知过滤应用交叉熵损失。我们的方法达到了最先进的性能，在SemanticKITTI上实现了56.86%的mIoU，nuScenes上实现了42.28%，SemanticPOSS上实现了52.58%，在10%对称标签噪声的情况下，整体算术平均值（AM）为49.57%，调和平均值（HM）为48.50%，从而在DGLSS-NL任务中实现了稳健的域泛化。代码可在我们的项目页面获得。

论文及项目相关链接

PDF

Summary

本文探讨了LiDAR在自动驾驶中的重要作用，并针对LiDAR标注噪声问题，提出了域泛化下的LiDAR语义分割任务（DGLSS-NL）。由于现有噪声标签学习方法对LiDAR数据适应性较差，因此提出一种双视图框架DuNe，通过强弱分支实现特征级别的一致性，并应用基于置信度过滤的预测交叉熵损失。该方法在多个数据集上实现了最先进的性能。

Key Takeaways

LiDAR在自动驾驶车辆安全中起到关键作用，但环境、传感器类型和天气条件的差异要求传感器具有域泛化能力。
LiDAR标注存在噪声问题，影响分割精度，且在域变化下问题加剧。
现有噪声标签学习方法在LiDAR数据上的适应性较差。
提出了DGLSS-NL任务，并建立了第一个基准测试。
提出了DuNe框架，通过强弱分支实现特征一致性，并应用置信度过滤的预测交叉熵损失。

Cool Papers

点此查看论文截图

FOLK: Fast Open-Vocabulary 3D Instance Segmentation via Label-guided Knowledge Distillation

Authors:Hongrui Wu, Zhicheng Gao, Jin Cao, Kelu Yao, Wen Shen, Zhihua Wei

Open-vocabulary 3D instance segmentation seeks to segment and classify instances beyond the annotated label space. Existing methods typically map 3D instances to 2D RGB-D images, and then employ vision-language models (VLMs) for classification. However, such a mapping strategy usually introduces noise from 2D occlusions and incurs substantial computational and memory costs during inference, slowing down the inference speed. To address the above problems, we propose a Fast Open-vocabulary 3D instance segmentation method via Label-guided Knowledge distillation (FOLK). Our core idea is to design a teacher model that extracts high-quality instance embeddings and distills its open-vocabulary knowledge into a 3D student model. In this way, during inference, the distilled 3D model can directly classify instances from the 3D point cloud, avoiding noise caused by occlusions and significantly accelerating the inference process. Specifically, we first design a teacher model to generate a 2D CLIP embedding for each 3D instance, incorporating both visibility and viewpoint diversity, which serves as the learning target for distillation. We then develop a 3D student model that directly produces a 3D embedding for each 3D instance. During training, we propose a label-guided distillation algorithm to distill open-vocabulary knowledge from label-consistent 2D embeddings into the student model. FOLK conducted experiments on the ScanNet200 and Replica datasets, achieving state-of-the-art performance on the ScanNet200 dataset with an AP50 score of 35.7, while running approximately 6.0x to 152.2x faster than previous methods. All codes will be released after the paper is accepted.

开放词汇表下的三维实例分割旨在分割并分类标注标签空间之外的实例。现有方法通常将三维实例映射到二维RGB-D图像上，然后采用视觉语言模型（VLMs）进行分类。然而，这种映射策略通常会引入来自二维遮挡的噪声，并在推理过程中产生巨大的计算和内存成本，从而降低了推理速度。为了解决上述问题，我们提出了一种基于标签引导知识蒸馏的快速开放词汇表三维实例分割方法（FOLK）。我们的核心思想是设计一种教师模型，提取高质量实例嵌入，并将其开放词汇表知识蒸馏到三维学生模型中。通过这种方式，在推理过程中，蒸馏后的三维模型可以直接对三维点云中的实例进行分类，避免了遮挡引起的噪声，并显著加速了推理过程。具体来说，我们首先设计了一个教师模型，为每个三维实例生成一个二维CLIP嵌入，同时考虑可见性和视点多样性，作为蒸馏的学习目标。然后我们开发了一个三维学生模型，该模型直接为每个三维实例生成一个三维嵌入。在训练过程中，我们提出了一种标签引导的蒸馏算法，将标签一致性的二维嵌入中的开放词汇表知识蒸馏到学生模型中。FOLK在ScanNet200和Replica数据集上进行了实验，在ScanNet200数据集上取得了最先进的性能，AP50得分为35.7，同时运行速度比以前的方法快大约6.0倍到152.2倍。论文被接受后将发布所有代码。

论文及项目相关链接

PDF

摘要

本文提出了一种基于标签引导知识蒸馏的快速开放词汇表三维实例分割方法（FOLK）。该方法旨在解决现有开放词汇表三维实例分割方法在计算效率和识别精度方面的问题。通过建立教师模型，提取高质量实例嵌入，并蒸馏其开放词汇表知识到三维学生模型中，使得在推理过程中，蒸馏后的三维模型可以直接对三维点云中的实例进行分类，避免了二维遮挡引起的噪声，并显著加速了推理过程。实验结果表明，FOLK方法在ScanNet200数据集上达到了先进性能，同时显著提高了计算效率。

关键见解

提出了基于标签引导知识蒸馏的开放词汇表三维实例分割方法（FOLK）。
通过建立教师模型提取高质量实例嵌入，解决了现有方法在计算效率和识别精度方面的问题。
蒸馏后的三维模型可以直接对三维点云中的实例进行分类，避免了二维遮挡引起的噪声。
FOLK方法显著加速了推理过程。
在ScanNet200数据集上达到了先进性能，AP50分数为35.7。
与之前的方法相比，FOLK方法的计算效率更高，运行速度提高了大约6.0至152.2倍。

Cool Papers

点此查看论文截图

ALISE: Annotation-Free LiDAR Instance Segmentation for Autonomous Driving

Authors:Yongxuan Lyu, Guangfeng Jiang, Hongsi Liu, Jun Liu

The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high-quality pseudo-labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo-labels. We then refine these labels through a dedicated spatio-temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior-based losses that inject visual knowledge into the 3D network, and a novel prototype-based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state-of-the-art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground-truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).

室外激光雷达点云的实例分割手动标注成本极高且耗时。当前的方法试图减少这一负担，但仍依赖于某种形式的人工标注。为了完全消除这种依赖，我们引入了ALISE这一全新框架，该框架可在无需任何标注的情况下执行激光雷达实例分割。核心挑战在于以完全无监督的方式生成高质量伪标签。我们的方法首先使用文本和图像指导的视觉基础模型（VFMs）生成初始伪标签。然后，我们通过专用的时空投票模块对这些标签进行精炼，该模块结合了二维和三维语义，用于在线和离线优化。为了实现卓越的特征学习，我们还引入了两种形式的语义监督：一组基于二维先验的损失，将视觉知识注入三维网络；一种新型基于原型对比损失，通过利用三维语义一致性构建判别特征空间。这一全面的设计带来了显著的性能提升，为无监督三维实例分割建立了新的最先进的水平。值得注意的是，我们的方法甚至超越了使用真实二维边界框进行监督的方法MWSIS，在mAP上高出2.53%（50.95%对比48.42%）。

论文及项目相关链接

PDF

Summary
自动标注激光雷达点云实例是一个耗时且成本高昂的任务。研究者们一直在尝试降低这种负担，但大部分方法仍然依赖人工标注。本文提出一种名为ALISE的新框架，它无需任何标注即可进行激光雷达实例分割。该框架通过视觉基础模型生成高质量伪标签，并通过时空投票模块进行精细化处理，结合二维和三维语义进行在线和离线优化。通过引入两种语义监督方式，该框架实现了卓越的特征学习能力，并达到了监督学习方法的性能水平。

Key Takeaways

激光雷达点云实例手动标注成本高昂且耗时。
当前方法虽尝试减少人工标注的依赖，但仍需某种形式的标注。
ALISE框架无需任何标注即可进行激光雷达实例分割。
ALISE框架通过视觉基础模型生成伪标签，并采用时空投票模块进行精细化处理。
该框架结合二维和三维语义进行在线和离线优化，实现卓越的特征学习能力。
引入两种语义监督方式，提升特征学习效果。

Cool Papers

点此查看论文截图

Evaluating the Impact of Radiographic Noise on Chest X-ray Semantic Segmentation and Disease Classification Using a Scalable Noise Injection Framework

Authors:Derek Jiu, Kiran Nijjer, Nishant Chinta, Ryan Bui, Kevin Zhu

Deep learning models are increasingly used for radiographic analysis, but their reliability is challenged by the stochastic noise inherent in clinical imaging. A systematic, cross-task understanding of how different noise types impact these models is lacking. Here, we evaluate the robustness of state-of-the-art convolutional neural networks (CNNs) to simulated quantum (Poisson) and electronic (Gaussian) noise in two key chest X-ray tasks: semantic segmentation and pulmonary disease classification. Using a novel, scalable noise injection framework, we applied controlled, clinically-motivated noise severities to common architectures (UNet, DeepLabV3, FPN; ResNet, DenseNet, EfficientNet) on public datasets (Landmark, ChestX-ray14). Our results reveal a stark dichotomy in task robustness. Semantic segmentation models proved highly vulnerable, with lung segmentation performance collapsing under severe electronic noise (Dice Similarity Coefficient drop of 0.843), signifying a near-total model failure. In contrast, classification tasks demonstrated greater overall resilience, but this robustness was not uniform. We discovered a differential vulnerability: certain tasks, such as distinguishing Pneumothorax from Atelectasis, failed catastrophically under quantum noise (AUROC drop of 0.355), while others were more susceptible to electronic noise. These findings demonstrate that while classification models possess a degree of inherent robustness, pixel-level segmentation tasks are far more brittle. The task- and noise-specific nature of model failure underscores the critical need for targeted validation and mitigation strategies before the safe clinical deployment of diagnostic AI.

深度学习模型在放射学分析中的应用越来越广泛，但临床图像中固有的随机噪声对其可靠性提出了挑战。目前还缺乏关于不同噪声类型如何影响这些模型的系统、跨任务的了解。在这里，我们评估了最先进的卷积神经网络（CNN）对两种关键胸部X射线任务中模拟的量子（泊松）噪声和电子（高斯）噪声的稳健性：语义分割和肺部疾病分类。我们采用了一种新型的可扩展噪声注入框架，对公共数据集（Landmark、ChestX-ray14）上的常见架构（UNet、DeepLabV3、FPN；ResNet、DenseNet、EfficientNet）应用了受控的、临床激励的噪声严重程度。我们的结果揭示了任务稳健性方面的鲜明差异。语义分割模型证明高度脆弱，在严重的电子噪声下，肺部分割性能大幅下降（Dice相似系数下降0.843），这标志着模型近乎完全失败。相比之下，分类任务表现出更大的整体韧性，但这种稳健性并不统一。我们发现了一种差异性脆弱性：某些任务，如区分气胸和肺不张，在量子噪声下遭遇了灾难性失败（AUROC下降0.355），而其他任务则更容易受到电子噪声的影响。这些结果表明，虽然分类模型具有一定的固有稳健性，但像素级分割任务则更为脆弱。模型失败的特定任务和噪声性质的特点强调了有针对性的验证和在诊断人工智能临床部署前采取缓解策略的关键需求。

论文及项目相关链接

PDF Accepted to ARRS 2026 Annual Meeting

Summary：深度学习模型在放射学分析中的应用日益广泛，但在临床图像中存在的随机噪声对其可靠性提出了挑战。本研究评估了最先进的卷积神经网络（CNN）对模拟量子（泊松）和电子（高斯）噪声的鲁棒性，涉及两个关键胸部X射线任务：语义分割和肺部疾病分类。通过使用新型的可扩展噪声注入框架，在公共数据集上应用了受控的临床噪声严重程度。结果显示任务鲁棒性存在明显差异。语义分割模型高度脆弱，在严重电子噪声下肺部分割性能大幅下降（Dice相似系数下降0.843），几乎导致模型完全失效。分类任务则表现出更大的整体稳健性，但这种稳健性并不统一。研究发现不同任务的脆弱性存在差异，某些任务如区分气胸与肺不张在量子噪声下失效（AUROC下降0.355），而其他任务则更易受电子噪声影响。这些发现表明，虽然分类模型具有一定的固有稳健性，但像素级分割任务更为脆弱。模型失败的任务和噪声特定性质强调了在部署诊断人工智能之前需要有针对性的验证和缓解策略。

Key Takeaways：

深度学习模型在临床图像分析中的应用受到随机噪声的影响。
本研究评估了卷积神经网络对量子噪声和电子噪声的鲁棒性。
语义分割模型对电子噪声高度脆弱，肺部分割性能大幅下降。
分类任务表现出更大的整体稳健性，但不同任务间的脆弱性存在差异。
某些分类任务在量子噪声下会失效，而其他任务更易受电子噪声影响。
分类模型具有一定的固有稳健性，但像素级分割任务更为脆弱。

Cool Papers

点此查看论文截图

GLSim: Detecting Object Hallucinations in LVLMs via Global-Local Similarity

Authors:Seongheon Park, Sharon Li

Object hallucination in large vision-language models presents a significant challenge to their safe deployment in real-world applications. Recent works have proposed object-level hallucination scores to estimate the likelihood of object hallucination; however, these methods typically adopt either a global or local perspective in isolation, which may limit detection reliability. In this paper, we introduce GLSim, a novel training-free object hallucination detection framework that leverages complementary global and local embedding similarity signals between image and text modalities, enabling more accurate and reliable hallucination detection in diverse scenarios. We comprehensively benchmark existing object hallucination detection methods and demonstrate that GLSim achieves superior detection performance, outperforming competitive baselines by a significant margin.

对象幻觉在大型视觉语言模型中是一个重大挑战，对于其在真实世界应用中的安全部署构成了威胁。近期的研究工作已经提出了对象级别的幻觉评分来估计出现对象幻觉的可能性；然而，这些方法通常孤立地采用全局或局部视角，可能会限制检测的可靠性。在本文中，我们介绍了GLSim，这是一种新型的无训练对象幻觉检测框架，它利用图像和文本模态之间的全局和局部嵌入相似性信号的互补性，能够在各种场景中实现更准确、更可靠地幻觉检测。我们对现有的对象幻觉检测方法进行了全面的基准测试，并证明GLSim的检测性能优于其他基线方法，具有显著的优势。

论文及项目相关链接

PDF NeurIPS 2025

Summary

本文介绍了大型视觉语言模型中对象臆测现象带来的挑战。虽然已有方法通过对象级臆测分数估计臆测发生的可能性，但它们往往采用全局或局部单一视角，可能影响检测可靠性。本文提出一种全新的训练外对象臆测检测框架GLSim，利用图像和文本模态之间的全局和局部嵌入相似性信号，提高不同场景下的臆测检测准确性和可靠性。经过对现有对象臆测检测方法的全面评估，GLSim表现出优越的检测性能，显著超越了其他基线方法。

Key Takeaways

对象臆测是大型视觉语言模型在现实世界应用中的一大挑战。
现有方法主要通过全局或局部单一视角检测对象臆测，可能限制检测可靠性。
GLSim是一种全新的训练外对象臆测检测框架，结合全局和局部嵌入相似性信号，提高检测准确性。
GLSim实现了对现有对象臆测检测方法的全面评估，并展现出优越的检测性能。
GLSim在多种场景下都能有效检测对象臆测，性能远超其他基线方法。
该框架对于确保视觉语言模型在真实环境中的安全部署具有重要意义。

Cool Papers

点此查看论文截图

Contour Errors: An Ego-Centric Metric for Reliable 3D Multi-Object Tracking

Authors:Sharang Kaul, Mario Berk, Thiemo Gerbich, Abhinav Valada

Finding reliable matches is essential in multi-object tracking to ensure the accuracy and reliability of perception systems in safety-critical applications such as autonomous vehicles. Effective matching mitigates perception errors, enhancing object identification and tracking for improved performance and safety. However, traditional metrics such as Intersection over Union (IoU) and Center Point Distances (CPDs), which are effective in 2D image planes, often fail to find critical matches in complex 3D scenes. To address this limitation, we introduce Contour Errors (CEs), an ego or object-centric metric for identifying matches of interest in tracking scenarios from a functional perspective. By comparing bounding boxes in the ego vehicle’s frame, contour errors provide a more functionally relevant assessment of object matches. Extensive experiments on the nuScenes dataset demonstrate that contour errors improve the reliability of matches over the state-of-the-art 2D IoU and CPD metrics in tracking-by-detection methods. In 3D car tracking, our results show that Contour Errors reduce functional failures (FPs/FNs) by 80% at close ranges and 60% at far ranges compared to IoU in the evaluation stage.

在多目标跟踪中，找到可靠的匹配是确保安全关键应用（如自动驾驶汽车）中的感知系统准确性和可靠性的关键。有效的匹配可以减轻感知错误，增强对象识别和跟踪，从而提高性能和安全性。然而，传统的度量标准，如交并比（IoU）和中心点距离（CPD），在二维图像平面上是有效的，但在复杂的三维场景中往往难以找到关键的匹配。为了解决这一局限性，我们引入了轮廓误差（CEs），这是一种以自我或对象为中心的度量标准，用于从功能角度识别跟踪场景中的匹配项。通过比较自我车辆框架中的边界框，轮廓误差为对象匹配提供了更为功能相关的评估。在nuScenes数据集上的大量实验表明，与当前最先进的基于检测的跟踪方法的IoU和CPD度量相比，轮廓误差提高了匹配的可靠性。在三维汽车跟踪中，我们的结果表明，在评估阶段，与IoU相比，轮廓误差在近距离减少了80%、远距离减少了60%的功能性故障（FPs/FNs）。

论文及项目相关链接

PDF

Summary

该文介绍了多目标跟踪中可靠匹配的重要性，并指出传统度量标准如IoU和CPD在复杂3D场景中难以找到关键匹配的问题。为解决此问题，引入了轮廓误差（CEs）这一以自我或对象为中心的度量标准，从功能角度评估跟踪场景中的感兴趣匹配。轮廓误差通过比较以自我车辆为框架的边界框，提供了对目标匹配的更功能相关的评估。实验证明，轮廓误差在跟踪检测方法和三维车辆跟踪中提高了匹配的可靠性，减少了功能故障。

Key Takeaways

多目标跟踪中的可靠匹配对于确保安全关键应用（如自动驾驶车辆）的感知系统准确性和可靠性至关重要。
传统度量标准如IoU和CPD在复杂3D场景中难以找到关键匹配。
轮廓误差（CEs）是一种新型的匹配度量标准，通过比较以自我车辆为框架的边界框来评估目标匹配。
轮廓误差从功能角度评估匹配，提供更功能相关的评估结果。
在跟踪检测方法中，轮廓误差提高了匹配的可靠性，优于现有的IoU和CPD度量标准。
在三维车辆跟踪中，轮廓误差显著减少了功能故障（FPs/FNs）。

Cool Papers

点此查看论文截图

Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control

Authors:Danfeng li, Hui Zhang, Sheng Wang, Jiacheng Li, Zuxuan Wu

Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity’s image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.

尽管扩散模型最近有所进展，顶级文本到图像（T2I）模型仍然难以实现精确的空间布局控制，即准确生成具有指定属性和位置的实体。图像分割掩膜到图像（S2I）生成技术作为一种结合像素级空间指导和区域文本提示的有前途的解决方案而出现。然而，现有的S2I方法无法同时确保语义一致性和形状一致性。为了解决这些挑战，我们提出了Seg2Any，一个基于先进的多模态扩散变压器（例如FLUX）的新型S2I框架。首先，为了实现语义和形状的一致性，我们将分割掩膜条件分解为区域语义和高频形状组件。通过语义对齐注意力掩膜引入区域语义条件，确保生成的实体符合其分配到的文本提示。代表实体边界的高频形状条件被编码为实体轮廓图，然后通过多模态注意力作为附加模态引入，以指导图像的空间结构。其次，为了防止多实体场景中属性在实体之间的泄露，我们引入了属性隔离注意力掩膜机制，该机制约束每个实体的图像令牌在图像自注意力期间只专注于自身。为了支持开放集S2I生成，我们构建了SACap-1M，这是一个包含100万张图像、590万个分割实体和详细区域描述的大规模数据集，以及用于全面S2I评估的SACap-Eval基准测试。大量实验表明，Seg2Any在开放集和封闭集S2I基准测试上均达到了最先进的性能，尤其是在实体的精细空间控制上表现出色。特别是在具有精细属性和位置的场景（例如城市和自然风景的详细重建）中表现得更加出色。

论文及项目相关链接

PDF

Summary

本文提出了一种新的分割掩膜到图像（S2I）框架Seg2Any，用于解决文本到图像生成中精确空间布局控制的问题。Seg2Any利用先进的多模态扩散变压器（如FLUX），通过解耦分割掩膜条件实现语义和形状的一致性。同时，引入语义对齐注意力掩膜和实体轮廓图，分别负责区域语义和高频形状条件。为预防多实体场景中的属性泄漏，采用了属性隔离注意力掩膜机制。Seg2Any在开放集和封闭集的S2I基准测试中均取得最佳性能，尤其擅长实体细粒度空间和属性的控制。

Key Takeaways

Seg2Any是一个新兴的S2I框架，旨在解决T2I模型在精确空间布局控制方面的挑战。
Seg2Any利用多模态扩散变压器（如FLUX）实现语义和形状的一致性。
语义对齐注意力掩膜和实体轮廓图被引入以处理区域语义和高频形状条件。
属性隔离注意力掩膜机制防止了多实体场景中的属性泄漏。
Seg2Any构建了SACap-1M数据集，用于支持开放集S2I生成，并提供了SACap-Eval基准测试。
实验表明，Seg2Any在S2I基准测试中表现最佳，特别是在实体细粒度空间和属性的控制方面。

Cool Papers

点此查看论文截图

TMT: Cross-domain Semantic Segmentation with Region-adaptive Transferability Estimation

Authors:Enming Zhang, Zhengyu Li, Yanru Wu, Jingge Wang, Yang Tan, Guan Wang, Yang Li, Xiaoping Zhang

Recent advances in Vision Transformers (ViTs) have significantly advanced semantic segmentation performance. However, their adaptation to new target domains remains challenged by distribution shifts, which often disrupt global attention mechanisms. While existing global and patch-level adaptation methods offer some improvements, they overlook the spatially varying transferability inherent in different image regions. To address this, we propose the Transferable Mask Transformer (TMT), a region-adaptive framework designed to enhance cross-domain representation learning through transferability guidance. First, we dynamically partition the image into coherent regions, grouped by structural and semantic similarity, and estimates their domain transferability at a localized level. Then, we incorporate region-level transferability maps directly into the self-attention mechanism of ViTs, allowing the model to adaptively focus attention on areas with lower transferability and higher semantic uncertainty. Extensive experiments across 20 diverse cross-domain settings demonstrate that TMT not only mitigates the performance degradation typically associated with domain shift but also consistently outperforms existing approaches.

近期在视觉Transformer（ViTs）方面的进展已经显著提高了语义分割的性能。然而，它们在适应新目标域时仍然面临分布转移的难题，这通常会破坏全局注意力机制。虽然现有的全局和补丁级自适应方法提供了一些改进，但它们忽略了不同图像区域固有的空间可转移性。为了解决这一问题，我们提出了可转移掩膜Transformer（TMT），这是一种区域自适应框架，旨在通过可转移性指导增强跨域表示学习。首先，我们根据结构性和语义相似性将图像动态划分为连贯的区域，并对这些区域进行分组，然后在局部级别估计它们的域可转移性。接着，我们将区域级可转移性图直接融入ViT的自注意力机制中，使模型能够自适应地关注可转移性较低、语义不确定性较高的区域。在20个不同的跨域设置上进行的广泛实验表明，TMT不仅减轻了与域偏移相关的性能下降问题，而且始终优于现有方法。

论文及项目相关链接

PDF

Summary

ViT的最新进展极大地提高了语义分割性能，但在目标域适应方面仍面临分布转移的挑战，这可能会破坏全局注意力机制。现有全局和补丁级适应方法虽有所提升，但忽略了不同图像区域固有的空间可转移性。为解决此问题，我们提出了可转移掩膜转换器（TMT），这是一种区域自适应框架，旨在通过可转移性指导增强跨域表示学习。首先，我们根据结构和语义相似性将图像动态划分为连贯区域，并估算其局部域可转移性。然后，我们将区域级可转移性图直接融入ViT的自注意力机制中，使模型能够自适应地关注低可转移性和高语义不确定性的区域。在跨越二十多种跨域设置的大量实验中表明，TMT不仅减轻了与域偏移相关的性能下降问题，而且始终优于现有方法。

Key Takeaways

Vision Transformers (ViTs) 在语义分割性能上取得了显著进展。
分布转移在ViT的目标域适应中仍然是一个挑战，可能影响全局注意力机制。
现有适应方法虽有所提升，但忽略了不同图像区域的空间可转移性。
提出了可转移掩膜转换器（TMT）框架，通过区域自适应增强跨域表示学习。
TMT通过动态划分图像区域并估算其局部域可转移性来解决现有问题。
TMT将区域级可转移性图融入ViT的自注意力机制中，使模型能自适应关注关键区域。

Cool Papers

点此查看论文截图

CQ-DINO: Mitigating Gradient Dilution via Category Queries for Vast Vocabulary Object Detection

Authors:Zhichao Sun, Huazhang Hu, Yidong Ma, Gang Liu, Yibo Chen, Xu Tang, Yao Hu, Yongchao Xu

With the exponential growth of data, traditional object detection methods are increasingly struggling to handle vast vocabulary object detection tasks effectively. We analyze two key limitations of classification-based detectors: positive gradient dilution, where rare positive categories receive insufficient learning signals, and hard negative gradient dilution, where discriminative gradients are overwhelmed by numerous easy negatives. To address these challenges, we propose CQ-DINO, a category query-based object detection framework that reformulates classification as a contrastive task between object queries and learnable category queries. Our method introduces image-guided query selection, which reduces the negative space by adaptively retrieving top-K relevant categories per image via cross-attention, thereby rebalancing gradient distributions and facilitating implicit hard example mining. Furthermore, CQ-DINO flexibly integrates explicit hierarchical category relationships in structured datasets (e.g., V3Det) or learns implicit category correlations via self-attention in generic datasets (e.g., COCO). Experiments demonstrate that CQ-DINO achieves superior performance on the challenging V3Det benchmark (surpassing previous methods by 2.1% AP) while maintaining competitiveness in COCO. Our work provides a scalable solution for real-world detection systems requiring wide category coverage. The code is publicly at https://github.com/FireRedTeam/CQ-DINO.

随着数据的指数级增长，传统的目标检测方法越来越难以有效地处理大规模词汇表的目标检测任务。我们分析了基于分类的检测器的两个关键局限性：正向梯度稀释，即罕见的正向类别接收到的学习信号不足；以及难以应对的负梯度稀释，即区分梯度被大量简单的负样本所淹没。为了应对这些挑战，我们提出了基于类别查询的目标检测框架CQ-DINO，它将分类重新制定为目标查询和可学习类别查询之间的对比任务。我们的方法引入了图像引导查询选择，通过跨注意力自适应地检索每张图像的前K个相关类别，从而减少负空间，从而重新平衡梯度分布并促进隐式硬样本挖掘。此外，CQ-DINO可以灵活地集成结构化数据集（例如V3Det）中的显式层次类别关系，或通过通用数据集（例如COCO）中的自注意力学习隐式类别关联。实验表明，CQ-DINO在具有挑战性的V3Det基准测试上取得了优于其他方法（超出2.1%的AP）的性能，同时在COCO中也保持竞争力。我们的工作提供了一种可扩展的解决方案，适用于需要广泛类别覆盖的真实世界检测系统。代码公开在https://github.com/FireRedTeam/CQ-DINO。

论文及项目相关链接

PDF Accepted at NeurIPS 2025

Summary

随着数据量的指数级增长，传统目标检测方法在应对大规模词汇目标检测任务时面临挑战。本文分析了基于分类的检测器的两个主要局限性：正梯度稀释和硬负梯度稀释。为此，我们提出了CQ-DINO，一种基于类别查询的目标检测框架，将分类重新构建为对象查询和可学习类别查询之间的对比任务。通过图像引导查询选择，我们的方法减少了负空间，通过跨注意力自适应地检索每张图像的前K个相关类别，从而平衡梯度分布并促进隐式硬样本挖掘。此外，CQ-DINO灵活整合了结构化数据集（如V3Det）中的显式层次类别关系，或在通用数据集（如COCO）中学习通过自注意力实现的隐式类别关联。实验表明，CQ-DINO在具有挑战性的V3Det基准测试中取得了优于其他方法（提高2.1%的AP）的性能，同时在COCO中保持竞争力。我们的研究为需要广泛类别覆盖的真实世界检测系统提供了可扩展的解决方案。

Key Takeaways

传统目标检测方法在应对大规模数据挑战时存在困难。
分析了基于分类的检测器的两个主要局限性：正梯度稀释和硬负梯度稀释。
提出了CQ-DINO框架，将分类转化为对比任务，提高目标检测的准确性。
通过图像引导查询选择减少负空间，平衡梯度分布，促进隐式硬样本挖掘。
CQ-DINO能够灵活适应不同的数据集，整合层次类别关系或学习隐式类别关联。
在V3Det基准测试中，CQ-DINO性能优越，较之前的方法提高了2.1%的AP。

Cool Papers

点此查看论文截图

OmniSAM: Omnidirectional Segment Anything Model for UDA in Panoramic Semantic Segmentation

Authors:Ding Zhong, Xu Zheng, Chenfei Liao, Yuanhuiyi Lyu, Jialei Chen, Shengyang Wu, Linfeng Zhang, Xuming Hu

Segment Anything Model 2 (SAM2) has emerged as a strong base model in various pinhole imaging segmentation tasks. However, when applying it to $360^\circ$ domain, the significant field-of-view (FoV) gap between pinhole ($70^\circ \times 70^\circ$) and panoramic images ($180^\circ \times 360^\circ$) poses unique challenges. Two major concerns for this application includes 1) inevitable distortion and object deformation brought by the large FoV disparity between domains; 2) the lack of pixel-level semantic understanding that the original SAM2 cannot provide. To address these issues, we propose a novel OmniSAM framework, which makes the first attempt to apply SAM2 for panoramic semantic segmentation. Specifically, to bridge the first gap, OmniSAM first divides the panorama into sequences of patches. These patches are then treated as image sequences in similar manners as in video segmentation tasks. We then leverage the SAM2’s memory mechanism to extract cross-patch correspondences that embeds the cross-FoV dependencies, improving feature continuity and the prediction consistency along mask boundaries. For the second gap, OmniSAM fine-tunes the pretrained image encoder and reutilize the mask decoder for semantic prediction. An FoV-based prototypical adaptation module with dynamic pseudo label update mechanism is also introduced to facilitate the alignment of memory and backbone features, thereby improving model generalization ability across different sizes of source models. Extensive experimental results demonstrate that OmniSAM outperforms the state-of-the-art methods by large margins, e.g., 79.06% (+10.22%) on SPin8-to-SPan8, 62.46% (+6.58%) on CS13-to-DP13.

Segment Anything Model 2（SAM2）在各种针孔成像分割任务中表现出强大的基础模型能力。然而，当将其应用于$360^\circ$领域时，针孔（$70^\circ \times 70^\circ$）与全景图像（$180^\circ \times 360^\circ$）之间的视野（FoV）差距巨大，带来了独特的挑战。该应用的主要有两个担忧点：1）由领域间大视野差异带来的不可避免的形状扭曲和物体变形；2）原始SAM2无法提供像素级别的语义理解。为了解决这些问题，我们提出了全新的OmniSAM框架，它首次尝试将SAM2应用于全景语义分割。具体来说，为了弥补第一个差距，OmniSAM首先把全景图像分成一系列的补丁（patches），然后将这些补丁视为类似视频分割任务中的图像序列。然后，我们利用SAM2的记忆机制来提取跨补丁对应关系，这些对应关系嵌入跨视野的依赖关系，提高了特征连续性和掩膜边界的预测一致性。对于第二个差距，OmniSAM微调了预训练图像编码器并重新使用掩膜解码器进行语义预测。还引入了一个基于视野的原型适应模块，并带有动态伪标签更新机制，以促进记忆和主干特征的对齐，从而提高模型在不同尺寸源模型上的泛化能力。大量的实验结果证明，OmniSAM大幅超越了现有方法，例如在SPin8-to-SPan8上达到79.06%（+10.22%），在CS13-to-DP13上达到62.46%（+6.58%）。

论文及项目相关链接

PDF

Summary：针对全景语义分割面临的挑战，OmniSAM框架通过应用SAM2模型提出了一系列创新解决方案。OmniSAM通过划分全景图像为多个补丁序列，并利用SAM2的记忆机制提取跨补丁对应关系，以解决不同视场角之间的差异性问题和语义理解的缺失。经过微调图像编码器和重新利用遮罩解码器进行语义预测，OmniSAM显著提高了模型在不同源模型大小上的泛化能力。实验结果表明，OmniSAM在SPin8-to-SPan8和CS13-to-DP13任务上的性能远超现有方法。

Key Takeaways：

OmniSAM框架首次将SAM2模型应用于全景语义分割，解决在$360^\circ$领域中应用时面临的挑战。
OmniSAM通过划分全景图像为补丁序列，利用SAM2的记忆机制提取跨补丁对应关系，以弥补视场角差异和语义理解的不足。
OmniSAM微调了预训练的图像编码器并重新利用遮罩解码器进行语义预测，提高了模型的泛化能力。
引入了一个基于视场的原型适应模块，带有动态伪标签更新机制，促进记忆和主干特征的对齐。
实验结果表明，OmniSAM在SPin8-to-SPan8和CS13-to-DP13任务上的性能远超现有方法，显示出其强大的实际应用潜力。
OmniSAM框架在桥接不同视场角差异和增强语义理解方面的创新策略为全景成像分割任务提供了新的研究方向。

Cool Papers

点此查看论文截图

SynDiff-AD: Improving Semantic Segmentation and End-to-End Autonomous Driving with Synthetic Data from Latent Diffusion Models

Authors:Harsh Goel, Sai Shankar Narasimhan, Oguzhan Akcin, Sandeep Chinchali

In recent years, significant progress has been made in collecting large-scale datasets to improve segmentation and autonomous driving models. These large-scale datasets are often dominated by common environmental conditions such as “Clear and Day” weather, leading to decreased performance in under-represented conditions like “Rainy and Night”. To address this issue, we introduce SynDiff-AD, a novel data augmentation pipeline that leverages diffusion models (DMs) to generate realistic images for such subgroups. SynDiff-AD uses ControlNet-a DM that guides data generation conditioned on semantic maps-along with a novel prompting scheme that generates subgroup-specific, semantically dense prompts. By augmenting datasets with SynDiff-AD, we improve the performance of segmentation models like Mask2Former and SegFormer by up to 1.2% and 2.3% on the Waymo dataset, and up to 1.4% and 0.7% on the DeepDrive dataset, respectively. Additionally, we demonstrate that our SynDiff-AD pipeline enhances the driving performance of end-to-end autonomous driving models, like AIM-2D and AIM-BEV, by up to 20% across diverse environmental conditions in the CARLA autonomous driving simulator, providing a more robust model. We release our code and pipeline at https://github.com/UTAustin-SwarmLab/SynDiff-AD.

近年来，在收集大规模数据集以改进分割和自动驾驶模型方面取得了显著进展。这些大规模数据集通常以“晴朗和白天”等常见环境条件为主，导致在代表性不足的条件（如“雨天夜间”）下性能下降。为了解决这一问题，我们引入了SynDiff-AD，这是一种新的数据增强管道，它利用扩散模型（DM）生成此类子组的现实图像。SynDiff-AD使用ControlNet（一种基于语义地图引导数据生成的DM）以及一种新型提示方案，该方案生成子组特定的语义密集提示。通过SynDiff-AD增强数据集，我们在Waymo数据集上提高了Mask2Former和SegFormer等分割模型的性能，分别高达1.2%和2.3%；在DeepDrive数据集上分别提高了1.4%和0.7%。此外，我们还证明我们的SynDiff-AD管道在CARLA自动驾驶模拟器中的多种环境条件下，能够提高端到端自动驾驶模型（如AIM-2D和AIM-BEV）的驾驶性能，最高提高20%，为更稳健的模型提供支撑。我们在https://github.com/UTAustin-SwarmLab/SynDiff-AD发布我们的代码和管道。

论文及项目相关链接

PDF 15 pages, 10 figures

Summary

本文介绍了SynDiff-AD数据增强管道的研究，该管道利用扩散模型生成针对特定子组的真实图像，解决了大规模数据集在复杂环境条件下的性能下降问题。通过SynDiff-AD增强数据集，改善了分割模型的性能，提高了端到端自动驾驶模型在CARLA自动驾驶模拟器中的驾驶性能。

Key Takeaways

SynDiff-AD是一种新的数据增强管道，旨在解决大规模数据集中特定环境条件下的性能下降问题。
该方法利用扩散模型（DMs）生成真实图像，特别针对那些被忽视的、特定的环境条件如“雨天夜晚”。
通过SynDiff-AD增强数据集，提高了分割模型的性能，如Mask2Former和SegFormer在Waymo和DeepDrive数据集上的表现。
SynDiff-AD管道不仅提高了分割模型的性能，还增强了端到端自动驾驶模型在模拟环境中的驾驶性能。
该研究在CARLA自动驾驶模拟器上测试了SynDiff-AD的效能，结果显示性能提升显著。
研究人员发布了SynDiff-AD的代码和管道在https://github.com/UTAustin-SwarmLab/SynDiff-AD。

Cool Papers

点此查看论文截图

Diffusion-based RGB-D Semantic Segmentation with Deformable Attention Transformer

Authors:Minh Bui, Kostas Alexis

Vision-based perception and reasoning is essential for scene understanding in any autonomous system. RGB and depth images are commonly used to capture both the semantic and geometric features of the environment. Developing methods to reliably interpret this data is critical for real-world applications, where noisy measurements are often unavoidable. In this work, we introduce a diffusion-based framework to address the RGB-D semantic segmentation problem. Additionally, we demonstrate that utilizing a Deformable Attention Transformer as the encoder to extract features from depth images effectively captures the characteristics of invalid regions in depth measurements. Our generative framework shows a greater capacity to model the underlying distribution of RGB-D images, achieving robust performance in challenging scenarios with significantly less training time compared to discriminative methods. Experimental results indicate that our approach achieves State-of-the-Art performance on both the NYUv2 and SUN-RGBD datasets in general and especially in the most challenging of their image data. Our project page will be available at https://diffusionmms.github.io/

基于视觉的感知和推理对于任何自主系统的场景理解都是至关重要的。RGB和深度图像通常用于捕捉环境的语义和几何特征。开发能够可靠解释这些数据的方法对于实际应用至关重要，在真实世界中，存在噪声的测量往往不可避免。在这项工作中，我们引入了一个基于扩散的框架来解决RGB-D语义分割问题。此外，我们证明，利用可变形注意力变压器作为编码器从深度图像中提取特征，可以有效地捕捉深度测量中无效区域的特性。我们的生成框架表现出更大的能力来模拟RGB-D图像的基础分布，在具有挑战性的场景中实现了稳健的性能，与判别方法相比，训练时间大大缩短。实验结果表明，我们的方法在NYUv2和SUN-RGBD数据集上均达到了最新技术水平，尤其是在最具挑战性的图像数据上表现尤为出色。我们的项目页面可在https://diffusionmms.github.io/上找到。

论文及项目相关链接

PDF

Summary
本文引入基于扩散的框架解决RGB-D语义分割问题，并利用可变形注意力转换器作为编码器提取深度图像特征，有效捕捉深度测量中无效区域的特性。实验结果表明，该方法在NYUv2和SUN-RGBD数据集上实现了最新性能，尤其在具有挑战性的图像数据上表现更出色。

Key Takeaways

介绍了基于扩散的框架来解决RGB-D语义分割问题的重要性。
可变形注意力转换器能有效提取深度图像特征，并捕捉深度测量中无效区域的特性。
与判别式方法相比，所提生成式框架具有更强的建模RGB-D图像基础分布的能力，并且在训练时间上更具优势。
实验结果证明该方法在主流数据集NYUv2和SUN-RGBD上达到了最先进的性能。
挑战性场景下该方法的稳健性能表现。
该方法对于自主系统中场景理解的重要性，特别是在处理带有噪声的测量数据时。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-18/%E6%A3%80%E6%B5%8B_%E5%88%86%E5%89%B2_%E8%B7%9F%E8%B8%AA/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

检测/分割/跟踪

人脸相关

人脸相关方向最新论文已更新，请持续关注 Update in 2025-10-18 Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection

2025-10-18 人脸相关

人脸相关

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2025-10-18 Towards Generalist Intelligence in Dentistry Vision Foundation Models for Oral and Maxillofacial Radiology

2025-10-18 Vision Transformer

Vision Transformer