⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-21 更新
ReCon: Region-Controllable Data Augmentation with Rectification and Alignment for Object Detection
Authors:Haowei Zhu, Tianxiang Pan, Rui Qin, Jun-Hai Yong, Bin Wang
The scale and quality of datasets are crucial for training robust perception models. However, obtaining large-scale annotated data is both costly and time-consuming. Generative models have emerged as a powerful tool for data augmentation by synthesizing samples that adhere to desired distributions. However, current generative approaches often rely on complex post-processing or extensive fine-tuning on massive datasets to achieve satisfactory results, and they remain prone to content-position mismatches and semantic leakage. To overcome these limitations, we introduce ReCon, a novel augmentation framework that enhances the capacity of structure-controllable generative models for object detection. ReCon integrates region-guided rectification into the diffusion sampling process, using feedback from a pre-trained perception model to rectify misgenerated regions within diffusion sampling process. We further propose region-aligned cross-attention to enforce spatial-semantic alignment between image regions and their textual cues, thereby improving both semantic consistency and overall image fidelity. Extensive experiments demonstrate that ReCon substantially improve the quality and trainability of generated data, achieving consistent performance gains across various datasets, backbone architectures, and data scales. Our code is available at https://github.com/haoweiz23/ReCon .
数据集的大小和质量对于训练稳健的感知模型至关重要。然而,获取大规模标注数据既昂贵又耗时。生成模型已作为一种强大的数据增强工具出现,通过合成符合期望分布的样本进行数据增强。然而,当前的生成方法往往依赖于复杂的后处理或在大量数据集上进行广泛的微调才能获得满意的结果,它们仍容易遇到内容位置不匹配和语义泄漏的问题。为了克服这些局限性,我们引入了ReCon,这是一种新型增强框架,旨在增强结构可控生成模型在目标检测方面的能力。ReCon将区域引导修正集成到扩散采样过程中,利用预训练感知模型的反馈来修正扩散采样过程中产生的错误区域。我们还提出了区域对齐交叉注意力机制,以加强图像区域与其文本线索之间的空间语义对齐,从而提高语义一致性和整体图像保真度。大量实验表明,ReCon显著提高了生成数据的质量和可训练性,在各类数据集、主干架构和数据规模上均实现了性能的一致性提升。我们的代码可在https://github.com/haoweiz23/ReCon找到。
论文及项目相关链接
PDF Accepted to NeurIPS 2025 (spotlight)
Summary
该研究提出了一种名为ReCon的新型数据增强框架,用于增强结构可控生成模型的目标检测能力。通过整合区域引导修正扩散采样过程,并利用预训练感知模型的反馈来修正误生成区域,同时提出区域对齐交叉注意力机制,以加强图像区域与其文本线索之间的空间语义对齐,从而提高语义一致性和整体图像保真度。实验表明,ReCon显著提高了生成数据的质量和可训练性,在各种数据集、架构和数据规模上均实现了性能提升。
Key Takeaways
- 数据集规模和质量对于训练稳健的感知模型至关重要,但获取大规模注释数据成本高昂且耗时。
- 生成模型已成为数据增强的有力工具,能合成符合期望分布的样本。
- 当前生成方法存在内容位置不匹配和语义泄漏的问题,需要通过复杂后处理或大规模数据集上的精细调整才能获得满意结果。
- ReCon框架整合区域引导修正扩散采样过程,利用预训练感知模型的反馈来修正误生成区域。
- ReCon提出区域对齐交叉注意力机制,提高图像区域与文本线索之间的空间语义对齐。
- ReCon框架显著提高了生成数据的质量和可训练性,在各种数据集、架构和数据规模上均实现了性能提升。
- ReCon的代码已公开可用。
点此查看论文截图
Semantic segmentation with coarse annotations
Authors:Jort de Jong, Mike Holenderski
Semantic segmentation is the task of classifying each pixel in an image. Training a segmentation model achieves best results using annotated images, where each pixel is annotated with the corresponding class. When obtaining fine annotations is difficult or expensive, it may be possible to acquire coarse annotations, e.g. by roughly annotating pixels in an images leaving some pixels around the boundaries between classes unlabeled. Segmentation with coarse annotations is difficult, in particular when the objective is to optimize the alignment of boundaries between classes. This paper proposes a regularization method for models with an encoder-decoder architecture with superpixel based upsampling. It encourages the segmented pixels in the decoded image to be SLIC-superpixels, which are based on pixel color and position, independent of the segmentation annotation. The method is applied to FCN-16 fully convolutional network architecture and evaluated on the SUIM, Cityscapes, and PanNuke data sets. It is shown that the boundary recall improves significantly compared to state-of-the-art models when trained on coarse annotations.
语义分割是对图像中的每个像素进行分类的任务。使用标注图像训练分割模型可以获得最佳结果,其中每个像素都被标注为相应的类别。当获取精细标注困难或成本高昂时,可能可以获取粗略标注,例如通过粗略标注图像中的像素,而使类边界周围的某些像素未标注。粗略标注下的分割是困难的,尤其是当目标是优化类之间的边界对齐时。本文提出了一种针对具有超像素上采样结构的编码器-解码器架构的正则化方法。它鼓励解码图像中的分割像素成为SLIC超像素,超像素基于像素颜色和位置,与分割标注无关。该方法应用于FCN-16全卷积网络架构,并在SUIM、Cityscapes和PanNuke数据集上进行了评估。结果表明,与在粗略标注上训练的最新模型相比,边界召回率有了显著提高。
论文及项目相关链接
Summary
本文探讨了语义分割中使用粗标注进行模型训练的问题。文章提出了一种针对具有超像素上采样结构的编码器-解码器架构模型的正则化方法。该方法鼓励解码图像中的分割像素成为基于像素颜色和位置的SLIC超像素,独立于分割标注。在SUIM、Cityscapes和PanNuke数据集上应用此方法并评估,结果显示在粗标注训练的情况下,边界召回率显著提高。
Key Takeaways
- 语义分割旨在将图像中的每个像素进行分类。
- 使用粗标注训练语义分割模型是一个挑战,特别是在优化类边界对齐方面。
- 文章提出了一种针对编码器-解码器架构模型的正则化方法,该方法结合了超像素上采样。
- 方法鼓励解码图像中的像素成为SLIC超像素,基于像素的颜色和位置,不受分割标注的影响。
- 该方法应用于FCN-16全卷积网络架构。
- 在多个数据集上的实验结果表明,使用粗标注进行训练时,边界召回率有了显著提高。
点此查看论文截图
MARIS: Marine Open-Vocabulary Instance Segmentation with Geometric Enhancement and Semantic Alignment
Authors:Bingyu Li, Feiyu Wang, Da Zhang, Zhiyuan Zhao, Junyu Gao, Xuelong Li
Most existing underwater instance segmentation approaches are constrained by close-vocabulary prediction, limiting their ability to recognize novel marine categories. To support evaluation, we introduce \textbf{MARIS} (\underline{Mar}ine Open-Vocabulary \underline{I}nstance \underline{S}egmentation), the first large-scale fine-grained benchmark for underwater Open-Vocabulary (OV) segmentation, featuring a limited set of seen categories and diverse unseen categories. Although OV segmentation has shown promise on natural images, our analysis reveals that transfer to underwater scenes suffers from severe visual degradation (e.g., color attenuation) and semantic misalignment caused by lack underwater class definitions. To address these issues, we propose a unified framework with two complementary components. The Geometric Prior Enhancement Module (\textbf{GPEM}) leverages stable part-level and structural cues to maintain object consistency under degraded visual conditions. The Semantic Alignment Injection Mechanism (\textbf{SAIM}) enriches language embeddings with domain-specific priors, mitigating semantic ambiguity and improving recognition of unseen categories. Experiments show that our framework consistently outperforms existing OV baselines both In-Domain and Cross-Domain setting on MARIS, establishing a strong foundation for future underwater perception research.
现有的大多数水下实例分割方法受到封闭词汇预测的限制,这限制了它们识别新型海洋类别的能力。为了支持评估,我们引入了MARIS(海洋开放词汇实例分割),这是水下开放词汇(OV)分割的第一个大规模精细基准测试,特点是有限的可见类别和多样的未见类别。虽然OV分割在自然图像上显示出希望,但我们的分析表明,转移到水下场景会受到严重的视觉退化(例如,颜色衰减)和由缺乏水下类别定义引起的语义不匹配的影响。为了解决这些问题,我们提出了一个统一的框架,包含两个互补的组件。几何先验增强模块(GPEM)利用稳定的部分级别和结构线索,在退化的视觉条件下保持对象的一致性。语义对齐注入机制(SAIM)用领域特定先验丰富语言嵌入,减轻语义模糊,提高未见类别的识别能力。实验表明,我们的框架在MARIS上的域内和跨域设置中均超过了现有的OV基准测试,为未来水下感知研究奠定了坚实基础。
论文及项目相关链接
Summary
本文主要介绍了水下开放词汇(OV)实例分割的挑战和解决方案。针对现有水下实例分割方法的局限性,推出了首个大规模水下开放词汇分割基准测试MARIS。文章分析了水下场景中的视觉退化和语义不匹配问题,并提出了一种包含几何先验增强模块(GPEM)和语义对齐注入机制(SAIM)的统一框架,以改善性能。实验表明,该框架在MARIS基准测试中,无论是内部验证还是跨域验证,均显著优于现有OV基线,为未来水下感知研究奠定了坚实基础。
Key Takeaways
- 现有水下实例分割方法受限于封闭词汇预测,难以识别新的海洋类别。
- 推出了首个大规模水下开放词汇分割基准测试MARIS,具有有限的已知类别和多样的未知类别。
- 水下场景中的视觉退化和语义不匹配问题严重。
- 提出的统一框架包含几何先验增强模块(GPEM)和语义对齐注入机制(SAIM)。
- GPEM利用稳定的局部和结构线索来保持对象在退化视觉条件下的一致性。
- SAIM通过丰富语言嵌入以领域特定先验知识来减少语义模糊,提高未知类别的识别能力。
点此查看论文截图
FreqPDE: Rethinking Positional Depth Embedding for Multi-View 3D Object Detection Transformers
Authors:Haisheng Su, Junjie Zhang, Feixiang Song, Sanping Zhou, Wei Wu, Nanning Zheng, Junchi Yan
Detecting 3D objects accurately from multi-view 2D images is a challenging yet essential task in the field of autonomous driving. Current methods resort to integrating depth prediction to recover the spatial information for object query decoding, which necessitates explicit supervision from LiDAR points during the training phase. However, the predicted depth quality is still unsatisfactory such as depth discontinuity of object boundaries and indistinction of small objects, which are mainly caused by the sparse supervision of projected points and the use of high-level image features for depth prediction. Besides, cross-view consistency and scale invariance are also overlooked in previous methods. In this paper, we introduce Frequency-aware Positional Depth Embedding (FreqPDE) to equip 2D image features with spatial information for 3D detection transformer decoder, which can be obtained through three main modules. Specifically, the Frequency-aware Spatial Pyramid Encoder (FSPE) constructs a feature pyramid by combining high-frequency edge clues and low-frequency semantics from different levels respectively. Then the Cross-view Scale-invariant Depth Predictor (CSDP) estimates the pixel-level depth distribution with cross-view and efficient channel attention mechanism. Finally, the Positional Depth Encoder (PDE) combines the 2D image features and 3D position embeddings to generate the 3D depth-aware features for query decoding. Additionally, hybrid depth supervision is adopted for complementary depth learning from both metric and distribution aspects. Extensive experiments conducted on the nuScenes dataset demonstrate the effectiveness and superiority of our proposed method.
从多视角的二维图像准确检测三维物体是自动驾驶领域中的一个具有挑战但至关重要的任务。当前的方法是通过集成深度预测来恢复用于对象查询解码的空间信息,这在训练阶段需要激光雷达点的显式监督。然而,预测的深度质量仍然不尽如人意,如物体边界的深度不连续和小物体难以区分等,这主要是由于投影点的稀疏监督以及用于深度预测的高层次图像特征。此外,以前的方法忽视了跨视图的一致性和尺度不变性。在本文中,我们引入了频率感知位置深度嵌入(FreqPDE)来为二维图像特征配备空间信息,用于三维检测transformer解码器,这可以通过三个主要模块获得。具体而言,频率感知空间金字塔编码器(FSPE)通过结合不同层次的高频边缘线索和低频语义来构建特征金字塔。然后,跨视图尺度不变深度预测器(CSDP)估计具有跨视图和高效通道注意机制的像素级深度分布。最后,位置深度编码器(PDE)结合二维图像特征和三维位置嵌入,为查询解码生成三维深度感知特征。此外,还采用了混合深度监督来进行从度量和分布方面的互补深度学习。在nuScenes数据集上进行的广泛实验证明了我们所提出方法的有效性和优越性。
论文及项目相关链接
PDF Accepted to ICCV2025
Summary
针对自动驾驶领域从多视角二维图像准确检测三维物体的挑战性问题,当前方法常借助深度预测来恢复物体查询解码的空间信息。本文提出一种频率感知位置深度嵌入(FreqPDE)方法,通过三个主要模块为二维图像特征配备空间信息,用于三维检测转换器解码器。实验证明,该方法在nuScenes数据集上表现出优越效果。
Key Takeaways
- 当前方法需要借助深度预测来恢复三维物体检测的空间信息,但仍存在深度预测质量不佳的问题。
- 现有方法忽视了跨视角一致性和尺度不变性。
- 本文提出FreqPDE方法,通过三个主要模块(FSPE、CSDP和PDE)为二维图像特征配备空间信息,用于三维检测。
- FSPE构建特征金字塔,结合不同层级的高频边缘线索和低频语义。
- CSDP估计像素级深度分布,采用跨视角和有效的通道注意力机制。
- PDE结合二维图像特征和三维位置嵌入,生成用于查询解码的三维深度感知特征。
点此查看论文截图