发布日期: 2025-10-10

更新日期: 2025-11-27

文章字数: 6.6k

阅读时长: 26 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-10 更新

Semantic Segmentation Algorithm Based on Light Field and LiDAR Fusion

Authors:Jie Luo, Yuxuan Jiang, Xin Jin, Mingyu Liu, Yihui Fan

Semantic segmentation serves as a cornerstone of scene understanding in autonomous driving but continues to face significant challenges under complex conditions such as occlusion. Light field and LiDAR modalities provide complementary visual and spatial cues that are beneficial for robust perception; however, their effective integration is hindered by limited viewpoint diversity and inherent modality discrepancies. To address these challenges, the first multimodal semantic segmentation dataset integrating light field data and point cloud data is proposed. Based on this dataset, we proposed a multi-modal light field point-cloud fusion segmentation network(Mlpfseg), incorporating feature completion and depth perception to segment both camera images and LiDAR point clouds simultaneously. The feature completion module addresses the density mismatch between point clouds and image pixels by performing differential reconstruction of point-cloud feature maps, enhancing the fusion of these modalities. The depth perception module improves the segmentation of occluded objects by reinforcing attention scores for better occlusion awareness. Our method outperforms image-only segmentation by 1.71 Mean Intersection over Union(mIoU) and point cloud-only segmentation by 2.38 mIoU, demonstrating its effectiveness.

语义分割是自动驾驶场景理解的核心技术，但在遮挡等复杂条件下仍面临巨大挑战。光场和激光雷达模态提供了有益的视觉和空间线索，有助于实现稳健的感知；然而，由于视点多样性有限和固有模态差异，它们的有效集成受到了阻碍。为了应对这些挑战，我们提出了第一个融合光场数据和点云数据的多模态语义分割数据集。基于此数据集，我们提出了一种多模态光场点云融合分割网络（Mlpfseg），该网络结合了特征补全和深度感知，可同时分割相机图像和激光雷达点云。特征补全模块通过点云特征图的差异重建解决了点云与图像像素之间的密度不匹配问题，增强了这些模态的融合。深度感知模块通过加强注意力分数来提高遮挡物体的分割，从而提高遮挡意识。我们的方法相较于仅使用图像的分割方法提高了1.71的平均交并比（mIoU），相较于仅使用点云的分割方法提高了2.38的mIoU，证明了其有效性。

论文及项目相关链接

PDF

Summary

本文介绍了针对自主驾驶中的语义分割问题，提出的多模态轻量级点云融合分割网络（Mlpfseg）。该网络结合了光场数据和点云数据，通过特征补全和深度感知模块，提高了遮挡物体的分割效果。实验结果表明，该方法在光场和点云数据融合方面表现出优越性，相比仅使用图像或点云数据的分割方法，分别提高了1.71和2.38的Mean Intersection over Union（mIoU）。

Key Takeaways

语义分割在自主驾驶场景理解中扮演着重要角色，但在复杂条件下如遮挡等仍面临挑战。
光场和LiDAR模态提供视觉和空间线索，有益于稳健的感知。
多模态语义分割数据集的提出，整合了光场数据和点云数据，为解决视角多样性和模态差异问题提供了基础。
介绍了Mlpfseg网络，该网络能够同时处理相机图像和LiDAR点云。
特征补全模块解决了点云和图像像素之间的密度不匹配问题，通过点云特征图的差异重建增强了模态的融合。
深度感知模块提高了遮挡物体的分割效果，通过加强注意力分数来提升遮挡意识。

Cool Papers

点此查看论文截图

ALISE: Annotation-Free LiDAR Instance Segmentation for Autonomous Driving

Authors:Yongxuan Lyu, Guangfeng Jiang, Hongsi Liu, Jun Liu

The manual annotation of outdoor LiDAR point clouds for instance segmentation is extremely costly and time-consuming. Current methods attempt to reduce this burden but still rely on some form of human labeling. To completely eliminate this dependency, we introduce ALISE, a novel framework that performs LiDAR instance segmentation without any annotations. The central challenge is to generate high-quality pseudo-labels in a fully unsupervised manner. Our approach starts by employing Vision Foundation Models (VFMs), guided by text and images, to produce initial pseudo-labels. We then refine these labels through a dedicated spatio-temporal voting module, which combines 2D and 3D semantics for both offline and online optimization. To achieve superior feature learning, we further introduce two forms of semantic supervision: a set of 2D prior-based losses that inject visual knowledge into the 3D network, and a novel prototype-based contrastive loss that builds a discriminative feature space by exploiting 3D semantic consistency. This comprehensive design results in significant performance gains, establishing a new state-of-the-art for unsupervised 3D instance segmentation. Remarkably, our approach even outperforms MWSIS, a method that operates with supervision from ground-truth (GT) 2D bounding boxes by a margin of 2.53% in mAP (50.95% vs. 48.42%).

激光雷达点云的手动标注实例分割极为耗费成本和时间。当前的方法试图减少这种负担，但仍然需要某种形式的人工标注。为了完全消除这种依赖，我们引入了ALISE这一全新框架，该框架可在无需任何标注的情况下执行激光雷达实例分割。最大的挑战在于以完全无监督的方式生成高质量伪标签。我们的方法始于利用视觉基础模型（VFMs），通过文本和图像指导来生成初始伪标签。随后通过专门的时空投票模块对标签进行细化处理，该模块结合了用于离线与在线优化的二维和三维语义。为了实现卓越的特征学习，我们还引入了两种形式的语义监督：一组基于二维先验的损失将视觉知识注入三维网络中，以及一种新型基于原型对比损失构建判别特征空间通过利用三维语义一致性。这一全面的设计带来了显著的性能提升，为无监督的三维实例分割建立了新的最先进的水平。值得注意的是，我们的方法甚至超过了在真实地面数据进行监督运算的方法MWSIS（在MAP上提高了2.53％，从原来的百分比是通过对等的实体的从性能的一种数字大小，达领先水平的能力相对不错的专家当前标准平均精度是百分之五十点九五是优于先前的一个最好的系统的结果达到百分之四十八点四十二）。

论文及项目相关链接

PDF

Summary
基于LiDAR的点云数据实例分割标注成本高昂且耗时。研究团队推出全新框架ALISE，无需人工标注即可实现LiDAR实例分割。该框架通过视觉基础模型生成高质量伪标签，并利用时空投票模块进行精细化处理，结合2D和3D语义进行在线和离线优化。通过引入两种语义监督方式，实现了卓越的特征学习能力，并实现了无监督的3D实例分割的新水平。相较于依赖GT 2D边界框的MWSIS方法，性能提升了2.53%。

Key Takeaways

室外LiDAR点云数据的实例分割标注成本高昂且耗时。
现有方法尝试减少人工标注的依赖，但仍需某种形式的标注。
ALISE框架无需任何标注即可实现LiDAR实例分割。
通过视觉基础模型生成高质量伪标签，并采用时空投票模块进行精细化处理。
结合了在线和离线优化、利用多种形式的语义监督来提升特征学习能力。

Cool Papers

点此查看论文截图

Incremental Object Detection with Prompt-based Methods

Authors:Matthias Neuwirth-Trapp, Maarten Bieshaar, Danda Pani Paudel, Luc Van Gool

Visual prompt-based methods have seen growing interest in incremental learning (IL) for image classification. These approaches learn additional embedding vectors while keeping the model frozen, making them efficient to train. However, no prior work has applied such methods to incremental object detection (IOD), leaving their generalizability unclear. In this paper, we analyze three different prompt-based methods under a complex domain-incremental learning setting. We additionally provide a wide range of reference baselines for comparison. Empirically, we show that the prompt-based approaches we tested underperform in this setting. However, a strong yet practical method, combining visual prompts with replaying a small portion of previous data, achieves the best results. Together with additional experiments on prompt length and initialization, our findings offer valuable insights for advancing prompt-based IL in IOD.

基于视觉提示的方法在图像分类的增量学习（IL）中日益受到关注。这些方法在学习额外的嵌入向量时保持模型冻结，使它们训练效率很高。然而，之前的工作并未将此类方法应用于增量目标检测（IOD），导致其泛化性不明确。在本文中，我们在复杂的域增量学习环境下分析了三种不同的基于提示的方法。我们还提供了一系列广泛的参考基线进行对比。从经验上看，我们在该环境下测试的基于提示的方法表现不佳。然而，一种强大而实用的方法，即将视觉提示与回放少量先前数据相结合，取得了最佳效果。此外，关于提示长度和初始化的附加实验为我们的研究提供了宝贵的见解，有助于推进基于提示的IL在IOD领域的发展。

论文及项目相关链接

PDF Accepted to ICCV Workshops 2025: v2 update affiliation

Summary

视觉提示基方法在图像分类的增量学习（IL）中日益受到关注。这些方法在保持模型冻结的同时学习额外的嵌入向量，使训练效率提高。然而，先前的工作尚未将这些方法应用于增量目标检测（IOD），其通用性尚不清楚。本文在复杂的域增量学习环境下分析了三种不同的基于提示的方法。我们还提供了一系列广泛的参考基线用于比较。经验表明，我们在这种环境下测试的提示方法表现不佳。然而，一种强大而实用的方法，将视觉提示与回放少量先前数据相结合，取得了最佳效果。关于提示长度和初始化的附加实验为我们的研究提供了宝贵的见解，有助于推动基于提示的IL在IOD领域的发展。

Key Takeaways

基于视觉提示的方法在增量学习（IL）中受到关注，特别是在图像分类领域。
这些方法通过学习额外的嵌入向量并保持模型冻结来提高训练效率。
目前尚未有将基于视觉提示的方法应用于增量目标检测（IOD）的研究。
在复杂的域增量学习环境下，三种不同的基于提示的方法表现不佳。
结合视觉提示和回放少量先前数据的实用方法取得了最佳效果。
提示长度和初始化对基于提示的IL在IOD中的性能有影响。

Cool Papers

点此查看论文截图

SurfDist: Interpretable Three-Dimensional Instance Segmentation Using Curved Surface Patches

Authors:Jackson Borchardt, Saul Kato

We present SurfDist, a convolutional neural network architecture for three-dimensional volumetric instance segmentation. SurfDist enables prediction of instances represented as closed surfaces composed of smooth parametric surface patches, specifically bicubic B'ezier triangles. SurfDist is a modification of the popular model architecture StarDist-3D which breaks StarDist-3D’s coupling of instance parameterization dimension and instance voxel resolution, and it produces predictions which may be upsampled to arbitrarily high resolutions without introduction of voxelization artifacts. For datasets with blob-shaped instances, common in biomedical imaging, SurfDist can outperform StarDist-3D with more compact instance parameterizations. We detail SurfDist’s technical implementation and show one synthetic and one real-world dataset for which it outperforms StarDist-3D. These results demonstrate that interpretable instance surface models can be learned effectively alongside instance membership.

我们提出了SurfDist，这是一种用于三维体积实例分割的卷积神经网络架构。SurfDist能够预测以光滑的参数曲面补丁（特别是双三次Bézier三角形）组成的闭合表面表示的实例。SurfDist是流行模型架构StarDist-3D的改进版，它打破了StarDist-3D实例参数化维度和实例体素分辨率之间的耦合，并产生预测结果，可将其上采样到任意高分辨率，而不会引入体素化伪影。对于在生物医学成像中常见的点状实例数据集，SurfDist可以使用更紧凑的实例参数化表现来超越StarDist-3D。我们详细介绍了SurfDist的技术实现，并展示了一个合成数据集和一个真实世界数据集，在这些数据集上它的表现超过了StarDist-3D。这些结果表明，可解释的实例表面模型可以与实例成员身份一起有效地学习。

论文及项目相关链接

PDF 8 pages, 6 figures

Summary

SurfDist是一种用于三维体积实例分割的卷积神经网络架构。它可预测以光滑参数曲面补丁（特别是三次贝塞尔三角形）组成的闭合表面表示的实例。SurfDist改进了流行的模型架构StarDist-3D，打破了实例参数化维度和实例体素分辨率之间的耦合，并可以产生可上采样到任意高分辨率的预测结果，而不会引入体素化伪影。对于在生物医学成像中常见的具有块状实例的数据集，SurfDist可以使用更紧凑的实例参数化来超越StarDist-3D的性能。我们详细描述了SurfDist的技术实现，并展示了一个合成数据集和一个真实世界数据集，在这些数据集上它的性能优于StarDist-3D。

Key Takeaways

SurfDist是一种用于三维体积实例分割的卷积神经网络架构。
SurfDist能够预测以光滑参数曲面补丁表示的闭合表面实例。
SurfDist改进了StarDist-3D模型，打破了实例参数化维度和实例体素分辨率之间的耦合。
SurfDist产生的预测结果可以上采样到任意高分辨率，且不会引入体素化伪影。
对于具有块状实例的数据集，SurfDist的性能优于StarDist-3D。
SurfDist的技术实现细节被详细介绍。

Cool Papers

点此查看论文截图

Fully Spiking Neural Networks for Unified Frame-Event Object Tracking

Authors:Jingjun Yang, Liangwei Fan, Jinpu Zhang, Xiangkai Lian, Hui Shen, Dewen Hu

The integration of image and event streams offers a promising approach for achieving robust visual object tracking in complex environments. However, current fusion methods achieve high performance at the cost of significant computational overhead and struggle to efficiently extract the sparse, asynchronous information from event streams, failing to leverage the energy-efficient advantages of event-driven spiking paradigms. To address this challenge, we propose the first fully Spiking Frame-Event Tracking framework called SpikeFET. This network achieves synergistic integration of convolutional local feature extraction and Transformer-based global modeling within the spiking paradigm, effectively fusing frame and event data. To overcome the degradation of translation invariance caused by convolutional padding, we introduce a Random Patchwork Module (RPM) that eliminates positional bias through randomized spatial reorganization and learnable type encoding while preserving residual structures. Furthermore, we propose a Spatial-Temporal Regularization (STR) strategy that overcomes similarity metric degradation from asymmetric features by enforcing spatio-temporal consistency among temporal template features in latent space. Extensive experiments across multiple benchmarks demonstrate that the proposed framework achieves superior tracking accuracy over existing methods while significantly reducing power consumption, attaining an optimal balance between performance and efficiency.

将图像和事件流融合为实现复杂环境中稳健的视觉对象跟踪的有前途的方法。然而，当前的融合方法虽然性能较高，但计算开销较大，难以有效地从事件流中提取稀疏、异步信息，未能利用事件驱动脉冲范式的节能优势。为了应对这一挑战，我们提出了首个全脉冲帧事件跟踪框架，名为SpikeFET。该网络实现了脉冲范式内的卷积局部特征提取和基于Transformer的全局建模的协同融合，有效地融合了帧和事件数据。为了解决因卷积填充而导致的翻译不变性退化问题，我们引入了随机补丁模块（RPM），通过随机空间重组和可学习类型编码消除位置偏差，同时保留残差结构。此外，我们提出了一种时空正则化（STR）策略，通过强制潜在空间中时间模板特征的时空一致性，克服由不对称特征引起的相似度度量退化问题。在多个基准测试上的广泛实验表明，所提出的框架在跟踪精度上优于现有方法，同时显著降低了功耗，在性能和效率之间达到了最佳平衡。

论文及项目相关链接

PDF Accepted by NeurIPS2025

Summary：图像与事件流融合是实现复杂环境中稳健视觉对象跟踪的潜在方法。当前融合方法虽高性能但计算开销大，难以高效提取稀疏异步事件流信息，未能充分利用事件驱动脉冲模式的节能优势。为解决此挑战，我们提出首个全脉冲帧事件跟踪框架SpikeFET，实现卷积局部特征提取与基于Transformer全局建模的协同融合。为克服卷积填充引起的翻译不变性退化，引入随机补丁模块RPM，通过随机空间重组和学习类型编码保留残差结构。此外，提出时空正则化策略STR，通过潜在空间中时空模板特征的时空一致性克服不对称特征的相似性度量退化。实验证明，该方法在多个基准测试上实现优越跟踪性能，显著降低功耗，实现性能与效率之间的优化平衡。

Key Takeaways：

图像和事件流的融合对于实现复杂环境中的稳健视觉对象跟踪具有潜力。
当前融合方法计算开销大，难以高效处理稀疏、异步的事件流信息。
SpikeFET框架实现卷积局部特征提取与基于Transformer的全局建模的协同融合，有效融合帧和事件数据。
随机补丁模块RPM消除位置偏差，通过随机空间重组和可学习类型编码保留残差结构。
时空正则化策略STR克服不对称特征的相似性度量退化，通过潜在空间中时空模板特征的时空一致性增强性能。
提出的框架在多个基准测试上实现优越跟踪性能。

Cool Papers

点此查看论文截图

Roboflow100-VL: A Multi-Domain Object Detection Benchmark for Vision-Language Models

Authors:Peter Robicheaux, Matvei Popov, Anish Madan, Isaac Robinson, Joseph Nelson, Deva Ramanan, Neehar Peri

Vision-language models (VLMs) trained on internet-scale data achieve remarkable zero-shot detection performance on common objects like car, truck, and pedestrian. However, state-of-the-art models still struggle to generalize to out-of-distribution classes, tasks and imaging modalities not typically found in their pre-training. Rather than simply re-training VLMs on more visual data, we argue that one should align VLMs to new concepts with annotation instructions containing a few visual examples and rich textual descriptions. To this end, we introduce Roboflow100-VL, a large-scale collection of 100 multi-modal object detection datasets with diverse concepts not commonly found in VLM pre-training. We evaluate state-of-the-art models on our benchmark in zero-shot, few-shot, semi-supervised, and fully-supervised settings, allowing for comparison across data regimes. Notably, we find that VLMs like GroundingDINO and Qwen2.5-VL achieve less than 2% zero-shot accuracy on challenging medical imaging datasets within Roboflow100-VL, demonstrating the need for few-shot concept alignment. Lastly, we discuss our recent CVPR 2025 Foundational FSOD competition and share insights from the community. Notably, the winning team significantly outperforms our baseline by 17 mAP! Our code and dataset are available at https://github.com/roboflow/rf100-vl and https://universe.roboflow.com/rf100-vl/.

基于互联网规模数据训练的视觉语言模型（VLMs）在常见对象（如汽车、卡车和行人）的零样本检测性能上取得了显著的成果。然而，最先进的模型仍然难以推广到其预训练中没有出现的类别、任务和成像模式。我们主张，与其简单地使用更多的视觉数据重新训练VLMs，不如将VLM与包含少量视觉示例和丰富文本描述的新概念标注指令对齐。为此，我们推出了Roboflow100-VL，这是一个包含一百个多模态对象检测数据集的大规模集合，其中包含的概念在VLM预训练中并不常见。我们在我们的基准测试上对最先进的模型进行了零样本、少样本、半监督和全监督的设置评估，允许跨数据领域进行比较。值得注意的是，我们发现像GroundingDINO和Qwen2.5-VL这样的VLM在Roboflow100-VL中的挑战性医学成像数据集上的零样本准确率低于百分之二，这显示了少样本概念对齐的需求。最后，我们讨论了近期CVPR 2025的基础FSOD竞赛并与社区分享了一些见解。值得一提的是，冠军团队在我们的基线基础上提高了17 mAP！我们的代码和数据集可在https://github.com/roboflow/rf100-vl和https://universe.roboflow.com/rf100-vl/获取。

论文及项目相关链接

PDF The first two authors contributed equally. This work has been accepted to the Neural Information Processing Systems (NeurIPS) 2025 Datasets & Benchmark Track. Project Page: https://rf100-vl.org/

Summary：

本文介绍了在大型数据集上训练的视觉语言模型（VLMs）在零样本检测任务上的表现。尽管它们可以在通用对象上实现出色的性能，但在面对超出分布范围的新类别、任务和成像模式时，它们通常无法很好地泛化。作者提出了一种通过标注指令与少量视觉示例和丰富的文本描述来对齐VLMs的新概念的方法。为了评估模型性能，作者引入了Roboflow100-VL基准测试集，其中包括各种在VLM预训练中很少遇到的多样概念。文章详细分析了现有模型在各种不同场景下的性能表现，并对Roboflow的挑战结果进行了讨论。最后，作者分享了数据集和代码资源。

Key Takeaways：

VLMs在零样本检测任务上表现出色，但在面对超出分布范围的新类别和任务时泛化能力受限。
Roboflow100-VL是一个大型多模态对象检测数据集，包含多种不常见于VLM预训练的概念。
仅通过增加视觉数据重新训练VLMs不足以提高对新概念的泛化能力。
对齐VLMs到新概念的一种方法是使用包含少量视觉示例和丰富文本描述的标注指令。
在零样本、少样本、半监督以及全监督环境下评估模型性能，有助于比较不同数据环境下的模型表现。
在Roboflow的挑战中，采用GroundingDINO和Qwen2.5-VL等VLMs在某些医学图像数据集上的零样本准确度低于百分之二（精度较差）。这表明了对齐少量概念的需要。另外还分享了社区参与的CVPR 2025 FSOD竞赛的结果。最终的赢家超越基准分数17 mAP的好成绩！欢迎访问我们开源的数据集和代码链接https://github.com/roboflow/rf100-vl 和 https://universe.roboflow.com/rf100-vl/。

Cool Papers

点此查看论文截图

Leveraging Confident Image Regions for Source-Free Domain-Adaptive Object Detection

Authors:Mohamed Lamine Mekhalfi, Davide Boscaini, Fabio Poiesi

Source-free domain-adaptive object detection is an interesting but scarcely addressed topic. It aims at adapting a source-pretrained detector to a distinct target domain without resorting to source data during adaptation. So far, there is no data augmentation scheme tailored to source-free domain-adaptive object detection. To this end, this paper presents a novel data augmentation approach that cuts out target image regions where the detector is confident, augments them along with their respective pseudo-labels, and joins them into a challenging target image to adapt the detector. As the source data is out of reach during adaptation, we implement our approach within a teacher-student learning paradigm to ensure that the model does not collapse during the adaptation procedure. We evaluated our approach on three adaptation benchmarks of traffic scenes, scoring new state-of-the-art on two of them.

无源域自适应目标检测是一个有趣但鲜有研究的话题。它的目标是将源预训练检测器适应到不同的目标域，而在适应过程中不依赖源数据。迄今为止，还没有针对无源域自适应目标检测定制的数据增强方案。为此，本文提出了一种新的数据增强方法，该方法会裁剪出目标图像中检测器信心充足的区域，对这些区域及其相应的伪标签进行增强，并将其合并到具有挑战性的目标图像中，以适应检测器。由于适应过程中无法访问源数据，我们在师生学习范式内实施我们的方法，以确保模型在适应过程中不会崩溃。我们在三个交通场景的自适应基准测试集上对我们的方法进行了评估，并在其中两个基准测试集上取得了最新状态。

论文及项目相关链接

PDF

Summary
目标检测自适应在不同领域之间有着重要的应用，本文提出了一种无源数据增强方法，通过裁剪目标图像区域并进行增强，以适应目标检测器。采用教师学生学习范式确保模型在适应过程中不会崩溃，并在交通场景的三个自适应基准测试中取得了新的成果。

Key Takeaways