⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-27 更新
Zoo3D: Zero-Shot 3D Object Detection at Scene Level
Authors:Andrey Lemeshko, Bulat Gabdullin, Nikita Drozdov, Anton Konushin, Danila Rukhovich, Maksim Kolodiazhnyi
3D object detection is fundamental for spatial understanding. Real-world environments demand models capable of recognizing diverse, previously unseen objects, which remains a major limitation of closed-set methods. Existing open-vocabulary 3D detectors relax annotation requirements but still depend on training scenes, either as point clouds or images. We take this a step further by introducing Zoo3D, the first training-free 3D object detection framework. Our method constructs 3D bounding boxes via graph clustering of 2D instance masks, then assigns semantic labels using a novel open-vocabulary module with best-view selection and view-consensus mask generation. Zoo3D operates in two modes: the zero-shot Zoo3D$_0$, which requires no training at all, and the self-supervised Zoo3D$_1$, which refines 3D box prediction by training a class-agnostic detector on Zoo3D$_0$-generated pseudo labels. Furthermore, we extend Zoo3D beyond point clouds to work directly with posed and even unposed images. Across ScanNet200 and ARKitScenes benchmarks, both Zoo3D$_0$ and Zoo3D$_1$ achieve state-of-the-art results in open-vocabulary 3D object detection. Remarkably, our zero-shot Zoo3D$_0$ outperforms all existing self-supervised methods, hence demonstrating the power and adaptability of training-free, off-the-shelf approaches for real-world 3D understanding. Code is available at https://github.com/col14m/zoo3d .
3D对象检测对空间理解至关重要。现实环境要求模型能够识别多样且先前未见过的物体,这仍是闭集方法的主要局限。现有的开放词汇表3D检测器放宽了标注要求,但仍依赖于训练场景,无论是点云还是图像。我们更进一步地引入了Zoo3D,即首个无需训练的3D对象检测框架。我们的方法通过2D实例掩模的图聚类构建3D边界框,然后使用具有最佳视图选择和视图共识掩模生成功能的新型开放词汇表模块分配语义标签。Zoo3D有两种模式:无需任何训练的零样本Zoo3D$_0$,以及通过Zoo3D$_0$生成的伪标签训练类别无关检测器来优化3D框预测的自监督Zoo3D$_1$。此外,我们将Zoo3D扩展到不仅仅是点云,使其能够直接在定位图像甚至非定位图像上运行。在ScanNet200和ARKitScenes基准测试中,Zoo3D$_0$和Zoo3D$_1$在开放词汇表3D对象检测方面均达到了最新水平。值得注意的是,我们的零样本Zoo3D$_0$优于所有现有的自监督方法,从而展示了无需训练、即插即用的方法在真实世界3D理解中的强大和适应性。代码可在https://github.com/col14m/zoo3d获得。
论文及项目相关链接
Summary
本文介绍了Zoo3D,一个无需训练的三维对象检测框架。它通过图聚类二维实例掩膜构建三维边界框,并采用新型开放词汇表模块进行语义标签分配。Zoo3D有两种模式:零样本Zoo3D$_0$,完全无需训练;自监督Zoo3D$_1$,通过Zoo3D$_0$生成的伪标签对三维框预测进行精炼。此外,Zoo3D不仅适用于点云,还能直接在带有姿势甚至不带姿势的图像上工作。在ScanNet200和ARKitScenes基准测试中,Zoo3D$_0$和Zoo3D$_1$在开放词汇表三维对象检测方面达到最新水平,其中零样本Zoo3D$_0$甚至超越了所有现有自监督方法,展示了无训练、即席方法在真实世界三维理解中的强大适应力。
Key Takeaways
- Zoo3D是首个无需训练的三维对象检测框架,可进行空间理解。
- 通过图聚类二维实例掩膜构建三维边界框。
- 引入新型开放词汇表模块进行语义标签分配。
- Zoo3D有两种模式:零样本和自监督,其中零样本模式完全无需训练。
- Zoo3D能直接在带有姿势甚至不带姿势的图像上工作,不仅适用于点云。
- 在基准测试中,Zoo3D达到最新水平,特别是零样本模式超越了所有现有自监督方法。
点此查看论文截图
DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Semantic Instance Segmentation
Authors:Xuexun Liu, Xiaoxu Xu, Qiudan Zhang, Lin Ma, Xu Wang
Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.
弱监督三维实例分割对于三维场景理解至关重要,特别是在全监督方法的数据规模不断扩大和标注成本高昂的情况下。现有方法主要依赖于两种形式的弱监督:一点一击标注和边界框标注,两者的目标都是减少标注工作量。然而,这些方法仍面临一些挑战,包括标注过程劳动强度高、复杂性高以及依赖专家标注人员。为了应对这些挑战,我们提出了DBGroup,这是一个两阶段弱监督三维实例分割框架,它利用场景级标注作为更高效和可扩展的替代方案。在第一阶段,我们引入了一个双分支点分组模块,该模块以多视图图像中提取的语义和掩膜线索为指导生成伪标签。为了进一步提高标签质量,我们开发了两种细化策略:粒度感知实例合并和语义选择与传播。第二阶段是在使用细化伪标签进行多轮自训练端到端实例分割网络。此外,我们还引入了一个实例掩膜过滤策略来解决伪标签内部的不一致性。大量实验表明,DBGroup与稀疏点级监督的三维实例分割方法相比具有竞争力,同时超越了最新的场景级监督三维语义分割方法。代码可在https://github.com/liuxuexun/DBGroup找到。
论文及项目相关链接
Summary
本文提出了一种名为DBGroup的两阶段弱监督3D实例分割框架,利用场景级注释作为更高效和可扩展的替代方案。第一阶段通过双分支点分组模块生成伪标签,该模块由多视角图像提取的语义和掩膜线索引导。为提高标签质量,采用粒度感知实例合并和语义选择与传播两种优化策略。第二阶段利用优化后的伪标签进行多轮自训练,对端到端的实例分割网络进行训练。此外,还引入了实例掩膜过滤策略来解决伪标签内的不一致性。实验表明,DBGroup在弱监督3D实例分割方面表现优异。
Key Takeaways
- DBGroup是一个两阶段的弱监督3D实例分割框架,旨在解决大规模数据和昂贵标注成本的问题。
- 第一阶段利用双分支点分组模块生成伪标签,基于多视角图像的语义和掩膜线索。
- 为提高标签质量,采用粒度感知实例合并和语义选择与传播两种策略。
- 第二阶段利用优化后的伪标签进行多轮自训练,训练端到端的实例分割网络。
- 引入实例掩膜过滤策略解决伪标签内的不一致性。
- 实验显示DBGroup在弱监督3D实例分割上的性能具有竞争力,优于稀疏点级监督方法以及场景级监督的3D语义分割方法。
点此查看论文截图
InstaDA: Augmenting Instance Segmentation Data with Dual-Agent System
Authors:Xianbao Hou, Yonghao He, Zeyd Boukhers, John See, Hu Su, Wei Sui, Cong Yang
Acquiring high-quality instance segmentation data is challenging due to the labor-intensive nature of the annotation process and significant class imbalances within datasets. Recent studies have utilized the integration of Copy-Paste and diffusion models to create more diverse datasets. However, these studies often lack deep collaboration between large language models (LLMs) and diffusion models, and underutilize the rich information within the existing training data. To address these limitations, we propose InstaDA, a novel, training-free Dual-Agent system designed to augment instance segmentation datasets. First, we introduce a Text-Agent (T-Agent) that enhances data diversity through collaboration between LLMs and diffusion models. This agent features a novel Prompt Rethink mechanism, which iteratively refines prompts based on the generated images. This process not only fosters collaboration but also increases image utilization and optimizes the prompts themselves. Additionally, we present an Image-Agent (I-Agent) aimed at enriching the overall data distribution. This agent augments the training set by generating new instances conditioned on the training images. To ensure practicality and efficiency, both agents operate as independent and automated workflows, enhancing usability. Experiments conducted on the LVIS 1.0 validation set indicate that InstaDA achieves significant improvements, with an increase of +4.0 in box average precision (AP) and +3.3 in mask AP compared to the baseline. Furthermore, it outperforms the leading model, DiverGen, by +0.3 in box AP and +0.1 in mask AP, with a notable +0.7 gain in box AP on common categories and mask AP gains of +0.2 on common categories and +0.5 on frequent categories.
获取高质量的实例分割数据是一个挑战,因为标注过程劳动密集,并且数据集内的类别不平衡问题显著。最近的研究通过结合Copy-Paste和扩散模型来创建更多样化的数据集。然而,这些研究往往缺乏大型语言模型(LLM)和扩散模型之间的深入合作,并且未能充分利用现有训练数据中的丰富信息。为了解决这些局限性,我们提出了InstaDA,这是一种无需训练的新型双代理系统,旨在增强实例分割数据集。首先,我们引入文本代理(T-Agent),它通过LLM和扩散模型的协作提高数据多样性。该代理具有新颖提示反思机制,根据生成的图像迭代优化提示。这个过程不仅促进了协作,还提高了图像利用率并优化了提示本身。此外,我们还推出了图像代理(I-Agent),旨在丰富整体数据分布。该代理通过基于训练图像生成新实例来丰富训练集。为确保实用性和效率,两个代理作为独立自动化工作流程运行,增强了可用性。在LVIS 1.0验证集上进行的实验表明,InstaDA取得了显著的改进,框平均精度(AP)提高了+4.0,掩膜AP提高了+3.3,超过了基线。此外,它超越了领先模型DiverGen,在框AP上高出+0.3,掩膜AP上高出+0.1,在常见类别上框AP增益显著为+0.7,掩膜AP在常见类别上增益为+0.2,频繁类别上增益为+0.5。
论文及项目相关链接
Summary
基于实例分割数据集的高质量数据采集存在挑战,标注过程劳动密集且数据集中存在严重的类别不平衡问题。最近的研究通过结合Copy-Paste和扩散模型创建了更多样化的数据集,但往往缺乏与大型语言模型的深度融合,且未能充分利用现有训练数据的丰富信息。针对这些局限性,我们提出了InstaDA,这是一种全新的、无需训练的双代理系统,旨在增强实例分割数据集。其中包括文本代理(T-Agent),它通过大型语言模型和扩散模型的协作提高数据多样性,并引入了一种新型提示反思机制来优化提示并增强图像利用。此外,还有图像代理(I-Agent),旨在丰富整体数据分布,根据训练图像生成新实例来增强训练集。实验表明,InstaDA在LVIS 1.0验证集上实现了显著的改进,与基线相比,框平均精度(AP)提高了+4.0,掩膜AP提高了+3.3。并且它在一些类别上超越了领先的DiverGen模型。
Key Takeaways
- 高质量实例分割数据采集面临标注过程劳动密集和类别不平衡的挑战。
- 最近研究通过结合Copy-Paste和扩散模型创建多样化数据集,但存在与大型语言模型融合不足及未能充分利用现有训练数据的问题。
- 提出了一种全新的、无需训练的双代理系统InstaDA来增强实例分割数据集。
- InstaDA包括文本代理(T-Agent)和图像代理(I-Agent),分别通过大型语言模型和扩散模型的协作以及新实例生成来丰富数据多样性和整体数据分布。
- InstaDA引入了一种新型提示反思机制来优化提示并增强图像利用。
- 实验表明,InstaDA在LVIS 1.0验证集上显著提高了框平均精度和掩膜AP,相较于基线和其他领先模型有更好的表现。
点此查看论文截图
Unleashing the Power of Chain-of-Prediction for Monocular 3D Object Detection
Authors:Zhihao Zhang, Abhinav Kumar, Girish Chandar Ganesan, Xiaoming Liu
Monocular 3D detection (Mono3D) aims to infer 3D bounding boxes from a single RGB image. Without auxiliary sensors such as LiDAR, this task is inherently ill-posed since the 3D-to-2D projection introduces depth ambiguity. Previous works often predict 3D attributes (e.g., depth, size, and orientation) in parallel, overlooking that these attributes are inherently correlated through the 3D-to-2D projection. However, simply enforcing such correlations through sequential prediction can propagate errors across attributes, especially when objects are occluded or truncated, where inaccurate size or orientation predictions can further amplify depth errors. Therefore, neither parallel nor sequential prediction is optimal. In this paper, we propose MonoCoP, an adaptive framework that learns when and how to leverage inter-attribute correlations with two complementary designs. A Chain-of-Prediction (CoP) explores inter-attribute correlations through feature-level learning, propagation, and aggregation, while an Uncertainty-Guided Selector (GS) dynamically switches between CoP and parallel paradigms for each object based on the predicted uncertainty. By combining their strengths, MonoCoP achieves state-of-the-art (SOTA) performance on KITTI, nuScenes, and Waymo, significantly improving depth accuracy, particularly for distant objects.
单目3D检测(Mono3D)的目标是从单个RGB图像中推断出3D边界框。由于没有激光雷达等辅助传感器,此任务本质上是不适定的,因为从3D到2D的投影引入了深度歧义。之前的工作通常并行预测3D属性(例如深度、大小和方位),忽略了这些属性通过从3D到2
论文及项目相关链接
摘要
单目3D检测(Mono3D)旨在从单一RGB图像中推断出3D边界框。由于没有激光雷达等辅助传感器,此任务本质上是不适定的,因为3D到2D的投影引入了深度歧义。先前的工作经常并行预测3D属性(例如深度、大小和方向),忽视了这些属性通过3D到2D的投影本质上是相关的。然而,仅仅通过顺序预测来强制执行这种关联可能会在属性之间传播错误,特别是当物体被遮挡或截断时,不准确的大小或方向预测可能会进一步放大深度错误。因此,并行预测和顺序预测都不是最优的。在本文中,我们提出了MonoCoP,一个自适应的框架,它学习何时以及如何利用属性间的关联,具有两个互补的设计。预测链(CoP)通过特征级学习、传播和聚合探索属性间的关联,而不确定性引导选择器(GS)则根据预测的的不确定性动态地在CoP和并行范式之间为每个对象切换。通过结合两者的优势,MonoCoP在KITTI、nuScenes和Waymo上实现了卓越的性能,特别是在深度准确性方面表现优异。
关键见解
- 单目3D检测任务由于深度歧义而具有内在的挑战性,尤其在没有辅助传感器的情况下。
- 之前的解决方案如并行预测和顺序预测都有其局限性,无法有效地处理遮挡或截断情况下的对象。
- MonoCoP框架结合了预测链(CoP)和不确定性引导选择器(GS)的设计,旨在动态适应不同对象的属性预测需求。
- CoP通过特征级学习、传播和聚合来探索属性间的关联。
- GS根据预测的的不确定性为每个对象动态切换预测模式。
- MonoCoP在多个数据集上实现了卓越的性能,特别是在深度准确性方面。