⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-17 更新
FS-SAM2: Adapting Segment Anything Model 2 for Few-Shot Semantic Segmentation via Low-Rank Adaptation
Authors:Bernardo Forni, Gabriele Lombardi, Federico Pozzi, Mirco Planamente
Few-shot semantic segmentation has recently attracted great attention. The goal is to develop a model capable of segmenting unseen classes using only a few annotated samples. Most existing approaches adapt a pre-trained model by training from scratch an additional module. Achieving optimal performance with these approaches requires extensive training on large-scale datasets. The Segment Anything Model 2 (SAM2) is a foundational model for zero-shot image and video segmentation with a modular design. In this paper, we propose a Few-Shot segmentation method based on SAM2 (FS-SAM2), where SAM2’s video capabilities are directly repurposed for the few-shot task. Moreover, we apply a Low-Rank Adaptation (LoRA) to the original modules in order to handle the diverse images typically found in standard datasets, unlike the temporally connected frames used in SAM2’s pre-training. With this approach, only a small number of parameters is meta-trained, which effectively adapts SAM2 while benefiting from its impressive segmentation performance. Our method supports any K-shot configuration. We evaluate FS-SAM2 on the PASCAL-5$^i$, COCO-20$^i$ and FSS-1000 datasets, achieving remarkable results and demonstrating excellent computational efficiency during inference. Code is available at https://github.com/fornib/FS-SAM2
小样本语义分割最近引起了极大的关注。其目标是要开发一种模型,该模型能够仅使用少量标注样本对未见过的类别进行分割。大多数现有方法是通过从头开始训练附加模块来适应预训练模型。要在这些方法上实现最佳性能,需要在大规模数据集上进行大量训练。Segment Anything Model 2(SAM2)是一个用于零样本图像和视频分割的基础模型,具有模块化设计。在本文中,我们提出了基于SAM2的小样本分割方法(FS-SAM2),其中SAM2的视频功能被直接重新用于小样本任务。此外,我们对原始模块应用了低秩适应(LoRA),以处理通常在标准数据集中发现的多样化图像,而不同于SAM2预训练中使用的时序关联帧。通过这种方法,只有一小部分参数进行元训练,这有效地适应了SAM2,同时受益于其令人印象深刻的分割性能。我们的方法支持任何K-shot配置。我们在PASCAL-5i、COCO-20i和FSS-1000数据集上对FS-SAM2进行了评估,取得了显著的结果,并在推理过程中表现出卓越的计算效率。代码可通过https://github.com/fornib/FS-SAM2获取。
论文及项目相关链接
PDF Accepted at ICIAP 2025
Summary
本文介绍了一种基于Segment Anything Model 2(SAM2)的Few-Shot语义分割方法(FS-SAM2)。该方法利用SAM2的视频能力,通过应用低秩适配技术,直接重新设计用于少样本分割任务。方法实现了跨不同数据集的有效适应,支持任意K-shot配置,并在PASCAL-5i、COCO-20i和FSS-1000数据集上取得了显著的结果。计算效率高。具体信息请访问相应GitHub链接。
Key Takeaways
- Few-shot语义分割旨在利用少量标注样本对未见过的类别进行分割。
- SAM2是一个用于零样本图像和视频分割的基础模型,具有模块化设计。
- FS-SAM2方法基于SAM2的视频能力重新设计用于少样本分割任务。
- LoRA技术用于处理标准数据集中多样的图像,适应了SAM2预训练时使用的临时关联帧的不同特点。
点此查看论文截图



Probabilistic Robustness Analysis in High Dimensional Space: Application to Semantic Segmentation Network
Authors:Navid Hashemi, Samuel Sasaki, Diego Manzanas Lopez, Ipek Oguz, Meiyi Ma, Taylor T. Johnson
Semantic segmentation networks (SSNs) play a critical role in domains such as medical imaging, autonomous driving, and environmental monitoring, where safety hinges on reliable model behavior under uncertainty. Yet, existing probabilistic verification approaches struggle to scale with the complexity and dimensionality of modern segmentation tasks, often yielding guarantees that are too conservative to be practical. We introduce a probabilistic verification framework that is both architecture-agnostic and scalable to high-dimensional outputs. Our approach combines sampling-based reachability analysis with conformal inference (CI) to deliver provable guarantees while avoiding the excessive conservatism of prior methods. To counteract CI’s limitations in high-dimensional settings, we propose novel strategies that reduce conservatism without compromising rigor. Empirical evaluation on large-scale segmentation models across CamVid, OCTA-500, Lung Segmentation, and Cityscapes demonstrates that our method provides reliable safety guarantees while substantially tightening bounds compared to SOTA. We also provide a toolbox implementing this technique, available on Github.
语义分割网络(SSN)在医学影像、自动驾驶和环境监测等领域扮演着至关重要的角色,这些领域的安全性依赖于不确定条件下的可靠模型行为。然而,现有的概率验证方法难以应对现代分割任务的复杂性和维度,往往产生的保证过于保守而无法实际应用。我们引入了一个概率验证框架,该框架既不受架构限制,又能适应高维输出。我们的方法结合了基于采样的可达性分析方法和合规推理(CI),提供了可证明的保证,同时避免了先前方法的过度保守性。为了克服CI在高维环境中的局限性,我们提出了减少保守性的新策略,同时不妥协于严谨性。在CamVid、OCTA-500、肺分割和城市景观等大型分割模型上的实证评估表明,我们的方法在提供可靠的安全保证的同时,与最新技术相比,大大缩小了边界。我们还提供了一个实现这一技术的工具箱,可以在Github上获取。
论文及项目相关链接
Summary
本文介绍了一种针对语义分割网络(SSNs)的概率验证框架,该框架对于医疗成像、自动驾驶和环境监测等领域至关重要。此框架对架构具有中立性,并能适应高维输出。它结合了基于采样的可达性分析方法和形式化推断(CI),提供可验证的保证,避免了先前方法的过度保守性。针对高维环境中CI的局限性,本文提出了减少保守性的新策略,同时保持严格性。在大规模分割模型上的实证评估表明,该方法提供可靠的安全保证,与现有技术相比,界限更加严格。
Key Takeaways
- 语义分割网络(SSNs)在医疗成像、自动驾驶和环境监测等领域中扮演关键角色。
- 现有概率验证方法面临与现代分割任务复杂性和维度之间的挑战。
- 引入的概率验证框架既具有架构中立性,又能适应高维输出。
- 结合采样可达性分析和形式化推断(CI),提供可验证的保证。
- 针对高维环境中CI的局限性,提出了减少保守性的新策略。
- 在大规模分割模型上的实证评估证明了该方法的可靠性和优越性。
点此查看论文截图

MAFS: Masked Autoencoder for Infrared-Visible Image Fusion and Semantic Segmentation
Authors:Liying Wang, Xiaoli Zhang, Chuanmin Jia, Siwei Ma
Infrared-visible image fusion methods aim at generating fused images with good visual quality and also facilitate the performance of high-level tasks. Indeed, existing semantic-driven methods have considered semantic information injection for downstream applications. However, none of them investigates the potential for reciprocal promotion between pixel-wise image fusion and cross-modal feature fusion perception tasks from a macroscopic task-level perspective. To address this limitation, we propose a unified network for image fusion and semantic segmentation. MAFS is a parallel structure, containing a fusion sub-network and a segmentation sub-network. On the one hand, We devise a heterogeneous feature fusion strategy to enhance semantic-aware capabilities for image fusion. On the other hand, by cascading the fusion sub-network and a segmentation backbone, segmentation-related knowledge is transferred to promote feature-level fusion-based segmentation. Within the framework, we design a novel multi-stage Transformer decoder to aggregate fine-grained multi-scale fused features efficiently. Additionally, a dynamic factor based on the max-min fairness allocation principle is introduced to generate adaptive weights of two tasks and guarantee smooth training in a multi-task manner. Extensive experiments demonstrate that our approach achieves competitive results compared with state-of-the-art methods. The code is available at https://github.com/Abraham-Einstein/MAFS/.
红外可见光图像融合方法旨在生成具有良好视觉质量的融合图像,并促进高级任务性能。实际上,现有的语义驱动方法已经考虑了将语义信息注入下游应用。然而,它们都没有从宏观的任务层面角度研究像素级图像融合和跨模态特征融合感知任务之间的潜在相互促进作用。为了解决这一局限性,我们提出了一个统一的网络用于图像融合和语义分割。MAFS是一个并行结构,包含一个融合子网络和一个分割子网络。一方面,我们设计了一种异构特征融合策略,以提高图像融合的语义感知能力。另一方面,通过将融合子网络与分割主干级联,将分割相关知识转移到基于特征级的融合分割中。在该框架内,我们设计了一种新型的多阶段Transformer解码器,以有效地聚合精细的多尺度融合特征。此外,还引入了一个基于最大最小公平性分配原则的动态因子,以生成两个任务的自适应权重,并保证多任务训练的平稳性。大量实验表明,我们的方法与最先进的方法相比取得了具有竞争力的结果。代码可在https://github.com/Abraham-Einstein/MAFS/找到。
论文及项目相关链接
PDF Accepted by TIP 2025
Summary:
该文提出一种统一网络用于图像融合和语义分割,旨在生成具有良好视觉质量的融合图像并促进高级任务性能。文章从任务层面出发,探讨像素级图像融合与跨模态特征融合感知任务之间的相互促进潜力。采用异质特征融合策略提高融合图像的语义感知能力,并通过串联融合子网络和分割主干网络实现分割知识在特征层面的融合应用。同时,设计了一种新型的多阶段Transformer解码器,以高效聚合精细的多尺度融合特征。此外,引入基于最大最小公平性分配原则的动态因子,以生成两种任务自适应权重,并在多任务训练过程中保证平滑性。实验表明,该方法与现有先进方法相比具有竞争力。
Key Takeaways:
- 红外可见光图像融合旨在生成具有良好视觉质量的融合图像,并促进高级任务性能。
- 现有语义驱动方法已考虑语义信息注入下游应用,但缺乏从宏观任务层面探讨像素级图像融合与跨模态特征融合感知任务间的相互促进潜力。
- 提出一种统一网络用于图像融合和语义分割,包括融合子网络和分割子网络。
- 引入异质特征融合策略提高融合图像的语义感知能力。
- 通过串联融合子网络和分割主干网络,实现分割知识在特征层面的融合应用。
- 设计新型多阶段Transformer解码器,高效聚合精细的多尺度融合特征。
点此查看论文截图





ActivePose: Active 6D Object Pose Estimation and Tracking for Robotic Manipulation
Authors:Sheng Liu, Zhe Li, Weiheng Wang, Han Sun, Heng Zhang, Hongpeng Chen, Yusen Qin, Arash Ajoudani, Yizhao Wang
Accurate 6-DoF object pose estimation and tracking are critical for reliable robotic manipulation. However, zero-shot methods often fail under viewpoint-induced ambiguities and fixed-camera setups struggle when objects move or become self-occluded. To address these challenges, we propose an active pose estimation pipeline that combines a Vision-Language Model (VLM) with “robotic imagination” to dynamically detect and resolve ambiguities in real time. In an offline stage, we render a dense set of views of the CAD model, compute the FoundationPose entropy for each view, and construct a geometric-aware prompt that includes low-entropy (unambiguous) and high-entropy (ambiguous) examples. At runtime, the system: (1) queries the VLM on the live image for an ambiguity score; (2) if ambiguity is detected, imagines a discrete set of candidate camera poses by rendering virtual views, scores each based on a weighted combination of VLM ambiguity probability and FoundationPose entropy, and then moves the camera to the Next-Best-View (NBV) to obtain a disambiguated pose estimation. Furthermore, since moving objects may leave the camera’s field of view, we introduce an active pose tracking module: a diffusion-policy trained via imitation learning, which generates camera trajectories that preserve object visibility and minimize pose ambiguity. Experiments in simulation and real-world show that our approach significantly outperforms classical baselines.
准确的6自由度物体姿态估计和跟踪对于可靠的机器人操作至关重要。然而,零样本方法常常在视角引起的歧义下失效,固定摄像头设置在物体移动或自我遮挡时很难应对。为了应对这些挑战,我们提出了一种结合视觉语言模型(VLM)和“机器人想象”的主动姿态估计管道,以实时检测和解决歧义。在离线阶段,我们对CAD模型的密集视图进行渲染,计算每个视图的FoundationPose熵,并构建一个包含低熵(无歧义)和高熵(有歧义)示例的几何感知提示。在运行时,系统:(1)对实时图像查询VLM以获取歧义分数;(2)如果检测到歧义,通过渲染虚拟视图来想象一组离散的候选相机姿态,基于VLM歧义概率和FoundationPose熵的加权组合对它们进行评分,然后将相机移动到最佳下一个视角(NBV)以获得明确的姿态估计。此外,由于移动物体可能会离开相机的视野,我们引入了一个主动姿态跟踪模块:通过模仿学习训练的扩散策略,生成能够保持物体可见性并最小化姿态歧义的相机轨迹。仿真和真实世界的实验表明,我们的方法显著优于经典基线。
论文及项目相关链接
PDF 6D Pose, Diffusion Policy
Summary
本文提出一种结合视觉语言模型(VLM)与“机器人想象”技术的主动姿态估计管道,用于实时检测和解决视角引起的歧义问题。通过离线阶段对CAD模型的密集视图渲染、FoundationPose熵的计算以及包含低熵和高熵例子的几何感知提示的构建,运行时系统查询VLM获取歧义分数,检测歧义时通过渲染虚拟视图生成一组候选相机姿态,基于VLM歧义概率和FoundationPose熵的加权组合进行评分,然后移动相机至下一个最佳视角(NBV)以获得清晰的姿态估计。此外,为解决移动物体可能离开相机视野的问题,引入主动姿态跟踪模块,通过模仿学习训练扩散策略,生成能够保持物体可见性并最小化姿态歧义的相机轨迹。实验证明,该方法显著优于传统方法。
Key Takeaways
- 6-DoF对象姿态估计和跟踪对于可靠的机器人操作至关重要。
- 零射击方法经常在视角引起的歧义下失败,固定相机设置在物体移动或自遮挡时遇到困难。
- 提出一种结合视觉语言模型(VLM)和“机器人想象”的主动姿态估计管道,以实时检测和解决歧义。
- 通过CAD模型的密集视图渲染、FoundationPose熵的计算以及包含低高熵例子的几何感知提示构建,为系统提供数据基础。
- 系统通过查询VLM获取歧义分数,通过渲染虚拟视图生成候选相机姿态,并基于VLM歧义概率和FoundationPose熵评分选择最佳视角。
- 引入主动姿态跟踪模块,通过模仿学习训练扩散策略,生成保持物体可见性并减少姿态歧义的相机轨迹。
点此查看论文截图







Motion Estimation for Multi-Object Tracking using KalmanNet with Semantic-Independent Encoding
Authors:Jian Song, Wei Mei, Yunfeng Xu, Qiang Fu, Renke Kou, Lina Bu, Yucheng Long
Motion estimation is a crucial component in multi-object tracking (MOT). It predicts the trajectory of objects by analyzing the changes in their positions in consecutive frames of images, reducing tracking failures and identity switches. The Kalman filter (KF) based on the linear constant-velocity model is one of the most commonly used methods in MOT. However, it may yield unsatisfactory results when KF’s parameters are mismatched and objects move in non-stationary. In this work, we utilize the learning-aided filter to handle the motion estimation of MOT. In particular, we propose a novel method named Semantic-Independent KalmanNet (SIKNet), which encodes the state vector (the input feature) using a Semantic-Independent Encoder (SIE) by two steps. First, the SIE uses a 1D convolution with a kernel size of 1, which convolves along the dimension of homogeneous-semantic elements across different state vectors to encode independent semantic information. Then it employs a fully-connected layer and a nonlinear activation layer to encode nonlinear and cross-dependency information between heterogeneous-semantic elements. To independently evaluate the performance of the motion estimation module in MOT, we constructed a large-scale semi-simulated dataset from several open-source MOT datasets. Experimental results demonstrate that the proposed SIKNet outperforms the traditional KF and achieves superior robustness and accuracy than existing learning-aided filters. The code is available at (https://github.com/SongJgit/filternet and https://github.com/SongJgit/TBDTracker).
运动估计是多目标跟踪(MOT)中的关键组成部分。它通过连续帧的图像中物体的位置变化来分析预测物体的轨迹,从而减少跟踪失败和身份切换的情况。基于线性恒定速度模型的卡尔曼滤波器(KF)是MOT中最常用的方法之一。然而,当KF的参数不匹配且物体做非平稳运动时,可能会得到不满意的结果。在这项工作中,我们采用学习辅助滤波器来处理MOT的运动估计。特别是,我们提出了一种名为Semantic-Independent KalmanNet(SIKNet)的新方法,该方法通过两步使用语义独立编码器(SIE)对状态向量(输入特征)进行编码。首先,SIE使用大小为1的1D卷积,沿着不同状态向量中同构语义元素的维度进行卷积,以编码独立的语义信息。然后,它采用全连接层和非线性激活层来编码异构语义元素之间的非线性和交叉依赖信息。为了独立评估运动估计模块在MOT中的性能,我们从几个开源的MOT数据集中构建了一个大规模的半模拟数据集。实验结果表明,提出的SIKNet优于传统的KF,并且在运动估计方面实现了比现有学习辅助滤波器更高的鲁棒性和准确性。代码可在(https://github.com/SongJgit/filternet和https://github.com/SongJgit/TBDTracker)中找到。
论文及项目相关链接
Summary
本文研究了多目标跟踪中的运动估计问题。传统的卡尔曼滤波在某些情况下可能表现不佳。为此,提出了一种名为SIKNet的新方法,利用语义独立编码器处理状态向量,并在大规模半模拟数据集上进行性能评估。实验结果显示,SIKNet在稳健性和准确性上优于传统卡尔曼滤波和学习辅助滤波器。
Key Takeaways
- 运动估计在多目标跟踪中至关重要,通过预测对象轨迹提高跟踪效果和准确性。
- 卡尔曼滤波是基于线性恒定速度模型的常用方法,但在参数不匹配和非平稳物体移动时可能表现不佳。
- 提出了名为SIKNet的新方法,使用语义独立编码器处理状态向量。
- SIKNet通过两个步骤编码状态向量:首先使用1D卷积提取独立语义信息,然后利用全连接层和非线性激活层编码非线性及跨依赖信息。
- 为了独立评估运动估计模块性能,从多个开源多目标跟踪数据集中构建了大规模半模拟数据集。
- 实验结果表明,SIKNet在稳健性和准确性上优于传统卡尔曼滤波和学习辅助滤波器。
点此查看论文截图


OpenUrban3D: Annotation-Free Open-Vocabulary Semantic Segmentation of Large-Scale Urban Point Clouds
Authors:Chongyu Wang, Kunlei Jing, Jihua Zhu, Di Wang
Open-vocabulary semantic segmentation enables models to recognize and segment objects from arbitrary natural language descriptions, offering the flexibility to handle novel, fine-grained, or functionally defined categories beyond fixed label sets. While this capability is crucial for large-scale urban point clouds that support applications such as digital twins, smart city management, and urban analytics, it remains largely unexplored in this domain. The main obstacles are the frequent absence of high-quality, well-aligned multi-view imagery in large-scale urban point cloud datasets and the poor generalization of existing three-dimensional (3D) segmentation pipelines across diverse urban environments with substantial variation in geometry, scale, and appearance. To address these challenges, we present OpenUrban3D, the first 3D open-vocabulary semantic segmentation framework for large-scale urban scenes that operates without aligned multi-view images, pre-trained point cloud segmentation networks, or manual annotations. Our approach generates robust semantic features directly from raw point clouds through multi-view, multi-granularity rendering, mask-level vision-language feature extraction, and sample-balanced fusion, followed by distillation into a 3D backbone model. This design enables zero-shot segmentation for arbitrary text queries while capturing both semantic richness and geometric priors. Extensive experiments on large-scale urban benchmarks, including SensatUrban and SUM, show that OpenUrban3D achieves significant improvements in both segmentation accuracy and cross-scene generalization over existing methods, demonstrating its potential as a flexible and scalable solution for 3D urban scene understanding.
开放词汇语义分割技术使模型能够基于任意自然语言描述来识别和分割对象,这为处理超出固定标签集的新颖、精细粒度或功能定义的类别提供了灵活性。虽然这种能力对于支持数字孪生、智慧城市管理和城市分析等应用的大规模城市点云至关重要,但在这个领域里它仍然在很大程度上未被探索。主要障碍在于大规模城市点云数据集中高质量、对齐的多视角图像经常缺失,以及现有三维(3D)分割管道在几何、尺度和外观上有很大差异的多样城市环境中泛化性能不佳。为了应对这些挑战,我们推出了OpenUrban3D,这是首个针对大规模城市场景进行开放词汇语义分割的3D框架,它可以在没有对齐的多视角图像、预训练的点云分割网络或手动注释的情况下运行。我们的方法通过多视角、多粒度渲染、掩膜级视觉语言特征提取和样本平衡融合,直接从原始点云中生成稳健的语义特征,然后进行蒸馏到3D骨干模型中。这种设计实现了零样本分割,可以处理任意文本查询,同时捕捉语义丰富和几何先验。在大规模城市基准测试上进行的广泛实验,包括SensatUrban和SUM,表明OpenUrban3D在分割准确性和跨场景泛化方面均取得了显著的改进,证明了其作为灵活且可扩展的3D城市场景理解解决方案的潜力。
论文及项目相关链接
Summary
本文介绍了OpenUrban3D,一个针对大规模城市场景的开放词汇语义分割框架。该框架可直接从原始点云生成稳健语义特征,无需对齐的多视图图像、预训练的点云分割网络或手动标注。通过多视图、多粒度渲染、掩膜级视觉语言特征提取和样本平衡融合等方法,实现零样本分割任意文本查询,同时捕捉语义丰富性和几何先验。在大型城市基准测试上的实验表明,OpenUrban3D在分割准确性和跨场景泛化方面实现了显著改进,展现了作为灵活可扩展的3D城市场景理解解决方案的潜力。
Key Takeaways
- Open-vocabulary semantic segmentation can handle novel, fine-grained, or functionally defined categories beyond fixed label sets, crucial for large-scale urban point clouds.
- 现有方法在多样城市环境中的泛化能力有限,面临缺乏高质量、对齐的多视图图像和3D分割管道的挑战。
- OpenUrban3D是首个针对大规模城市场景的3D开放词汇语义分割框架,无需对齐的多视图图像、预训练点云分割网络或手动注释。
- 通过多视图、多粒度渲染,OpenUrban3D直接从原始点云生成稳健语义特征。
- 该设计实现零样本分割任意文本查询,同时捕捉语义丰富性和几何先验。
- 在大型城市基准测试上,OpenUrban3D在分割准确性和跨场景泛化方面实现了显著改进。
点此查看论文截图



Similarity-based Outlier Detection for Noisy Object Re-Identification Using Beta Mixtures
Authors:Waqar Ahmad, Evan Murphy, Vladimir A. Krylov
Object re-identification (Re-ID) methods are highly sensitive to label noise, which typically leads to significant performance degradation. We address this challenge by reframing Re-ID as a supervised image similarity task and adopting a Siamese network architecture trained to capture discriminative pairwise relationships. Central to our approach is a novel statistical outlier detection (OD) framework, termed Beta-SOD (Beta mixture Similarity-based Outlier Detection), which models the distribution of cosine similarities between embedding pairs using a two-component Beta distribution mixture model. We establish a novel identifiability result for mixtures of two Beta distributions, ensuring that our learning task is well-posed. The proposed OD step complements the Re-ID architecture combining binary cross-entropy, contrastive, and cosine embedding losses that jointly optimize feature-level similarity learning. We demonstrate the effectiveness of Beta-SOD in de-noising and Re-ID tasks for person Re-ID, on CUHK03 and Market-1501 datasets, and vehicle Re-ID, on VeRi-776 dataset. Our method shows superior performance compared to the state-of-the-art methods across various noise levels (10-30%), demonstrating both robustness and broad applicability in noisy Re-ID scenarios. The implementation of Beta-SOD is available at: github.com/waqar3411/Beta-SOD
对象再识别(Re-ID)方法对标签噪声高度敏感,通常会导致性能显著下降。我们通过将Re-ID重新构建为受监督的图像相似性任务来解决这一挑战,并采用Siamese网络架构进行训练,以捕获判别性的成对关系。我们的方法的核心是一个新颖的统计异常检测(OD)框架,称为Beta-SOD(基于相似度的Beta混合异常检测),它使用两分量Beta分布混合模型对嵌入对之间的余弦相似性的分布进行建模。我们为两个Beta分布的混合物建立了新的可识别结果,以确保我们的学习任务是适定的。所提出的OD步骤补充了Re-ID架构,结合了二元交叉熵、对比和余弦嵌入损失,共同优化特征级别的相似性学习。我们在CUHK03和Market-1501数据集上进行了行人Re-ID去噪和Re-ID任务,以及在VeRi-776数据集上进行了车辆Re-ID任务,证明了Beta-SOD的有效性。我们的方法在各种噪声水平(10-30%)下的性能优于最先进的方法,证明了在嘈杂的Re-ID场景中既稳健又广泛应用。Beta-SOD的实现可在:github.com/waqar3411/Beta-SOD找到。
论文及项目相关链接
Summary
本文解决标签噪声导致的目标再识别(Re-ID)性能下降问题。通过将其视为监督图像相似性任务并采用Siamese网络架构来捕获判别性成对关系进行训练。引入了一种名为Beta-SOD的新型统计异常检测框架,该框架使用两组件Beta分布混合模型对嵌入对之间的余弦相似性进行建模。该方法可有效去除噪声并在人员Re-ID和车辆Re-ID的多个数据集上表现出卓越性能,相比现有方法在多种噪声水平下具有更强的稳健性和广泛的应用性。
Key Takeaways
- 目标再识别(Re-ID)受标签噪声影响大,导致性能下降。
- 采用Siamese网络架构进行训练,将Re-ID重新定义为监督图像相似性任务。
- 引入新型统计异常检测框架Beta-SOD,利用两组件Beta分布混合模型建模余弦相似性分布。
- 建立了关于两个Beta分布混合的新的可识别性结果,确保学习任务的明确性。
- Beta-SOD与Re-ID架构相结合,通过联合优化特征级别的相似性学习来提高去噪和Re-ID任务的性能。
- 在人员Re-ID的CUHK03和Market-1501数据集以及车辆Re-ID的VeRi-776数据集上展示了Beta-SOD的有效性。
点此查看论文截图




DAOcc: 3D Object Detection Assisted Multi-Sensor Fusion for 3D Occupancy Prediction
Authors:Zhen Yang, Yanpeng Dong, Jiayu Wang, Heng Wang, Lichao Ma, Zijian Cui, Qi Liu, Haoran Pei, Kexin Zhang, Chao Zhang
Multi-sensor fusion significantly enhances the accuracy and robustness of 3D semantic occupancy prediction, which is crucial for autonomous driving and robotics. However, most existing approaches depend on high-resolution images and complex networks to achieve top performance, hindering their deployment in practical scenarios. Moreover, current multi-sensor fusion approaches mainly focus on improving feature fusion while largely neglecting effective supervision strategies for those features. To address these issues, we propose DAOcc, a novel multi-modal occupancy prediction framework that leverages 3D object detection supervision to assist in achieving superior performance, while using a deployment-friendly image backbone and practical input resolution. In addition, we introduce a BEV View Range Extension strategy to mitigate performance degradation caused by lower image resolution. Extensive experiments demonstrate that DAOcc achieves new state-of-the-art results on both the Occ3D-nuScenes and Occ3D-Waymo benchmarks, and outperforms previous state-of-the-art methods by a significant margin using only a ResNet-50 backbone and 256*704 input resolution. With TensorRT optimization, DAOcc reaches 104.9 FPS while maintaining 54.2 mIoU on an NVIDIA RTX 4090 GPU. Code is available at https://github.com/AlphaPlusTT/DAOcc.
多传感器融合能显著增强3D语义占用预测的精度和稳健性,这对于自动驾驶和机器人技术至关重要。然而,大多数现有方法依赖高分辨率图像和复杂网络来实现最佳性能,这阻碍了它们在实际场景中的应用。此外,当前的多传感器融合方法主要集中在改进特征融合上,而很少关注这些特征的有效监督策略。为了解决这些问题,我们提出了DAOcc,这是一种新型的多模态占用预测框架,它利用3D目标检测监督来帮助实现卓越性能,同时使用部署友好的图像主干和实用的输入分辨率。此外,我们还引入了BEV View Range Extension策略,以缓解因较低图像分辨率导致的性能下降。大量实验表明,DAOcc在Occ3D-nuScenes和Occ3D-Waymo基准测试中达到了最新水平,仅使用ResNet-50主干和256*704输入分辨率就大大超越了以前的最先进方法。通过TensorRT优化,DAOcc在NVIDIA RTX 4090 GPU上达到了104.9 FPS,同时保持了54.2 mIoU。代码可在https://github.com/AlphaPlusTT/DAOcc找到。
论文及项目相关链接
PDF TCSVT Accepted version (not the final published version)
Summary
多传感器融合能提高3D语义预测准确性和稳健性,对自动驾驶和机器人技术至关重要。现有方法依赖高分辨率图像和复杂网络,不利于实际应用。针对这些问题,提出DAOcc框架,利用3D目标检测监督实现优越性能,使用部署友好的图像主干和实际输入分辨率。同时引入BEV View Range Extension策略,缓解低分辨率图像导致的性能下降。DAOcc在Occ3D-nuScenes和Occ3D-Waymo基准测试中达到最新技术水平,使用ResNet-50主干和256*704输入分辨率时表现更优异。经TensorRT优化,DAOcc在NVIDIA RTX 4090 GPU上达到104.9 FPS,同时保持54.2 mIoU。
Key Takeaways
- 多传感器融合对增强3D语义预测准确性与稳健性至关重要。
- 当前方法依赖高分辨率图像和复杂网络,限制了实际应用。
- DAOcc框架利用3D目标检测监督实现高性能,具有部署友好性。
- BEV View Range Extension策略改善低分辨率图像性能下降问题。
- DAOcc在基准测试中表现超越现有技术。
- DAOcc使用ResNet-50主干和较低输入分辨率表现出优越性。
点此查看论文截图





