发布日期: 2025-09-28

更新日期: 2025-11-27

文章字数: 9.5k

阅读时长: 39 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-28 更新

SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images

Authors:Qinfeng Zhu, Han Li, Liang He, Lei Fan

Semantic segmentation of remote sensing imagery is a fundamental task in computer vision, supporting a wide range of applications such as land use classification, urban planning, and environmental monitoring. However, this task is often challenged by the high spatial resolution, complex scene structures, and diverse object scales present in remote sensing data. To address these challenges, various deep learning architectures have been proposed, including convolutional neural networks, Vision Transformers, and the recently introduced Vision Mamba. Vision Mamba features a global receptive field and low computational complexity, demonstrating both efficiency and effectiveness in image segmentation. However, its reliance on global scanning tends to overlook critical local features, such as textures and edges, which are essential for achieving accurate segmentation in remote sensing contexts. To tackle this limitation, we propose SwinMamba, a novel framework inspired by the Swin Transformer. SwinMamba integrates localized Mamba-style scanning within shifted windows with a global receptive field, to enhance the model’s perception of both local and global features. Specifically, the first two stages of SwinMamba perform local scanning to capture fine-grained details, while its subsequent two stages leverage global scanning to fuse broader contextual information. In our model, the use of overlapping shifted windows enhances inter-region information exchange, facilitating more robust feature integration across the entire image. Extensive experiments on the LoveDA and ISPRS Potsdam datasets demonstrate that SwinMamba outperforms state-of-the-art methods, underscoring its effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery.

遥感影像的语义分割是计算机视觉中的一项基本任务，支持土地利用分类、城市规划和环境监测等广泛应用。然而，这项任务经常受到遥感数据的高空间分辨率、复杂的场景结构和多样的对象尺度的影响。为了应对这些挑战，已经提出了各种深度学习架构，包括卷积神经网络、视觉Transformer和最近推出的视觉Mamba。视觉Mamba具有全局感受野和低计算复杂度，在图像分割中既高效又有效。然而，它对全局扫描的依赖往往会忽略关键的局部特征，如纹理和边缘，这对于在遥感环境中实现准确分割至关重要。为了解决这一局限性，我们提出了受Swin Transformer启发的新型框架SwinMamba。SwinMamba结合了局部Mamba风格扫描和全局感受野，增强了模型对局部和全局特征的感知。具体来说，SwinMamba的前两个阶段进行局部扫描以捕捉精细细节，而后两个阶段则利用全局扫描来融合更广泛的上下文信息。在我们的模型中，使用重叠的移位窗口增强了区域间的信息交流，促进了整个图像上更稳健的特征集成。在LoveDA和ISPRS Potsdam数据集上的大量实验表明，SwinMamba优于现有方法，突显了其有效性和作为遥感影像语义分割的先进解决方案的潜力。

论文及项目相关链接

PDF

Summary

本文介绍了遥感影像语义分割的挑战及应对方法。针对现有模型如Vision Mamba在处理遥感影像时忽略局部特征的问题，提出了SwinMamba模型。该模型结合了局部扫描与全局感受野，既捕捉局部细节又融合全局上下文信息，增强了模型对局部与全局特征的感知能力。在LoveDA和ISPRS Potsdam数据集上的实验表明，SwinMamba表现优异，有望成为遥感影像语义分割的先进解决方案。

Key Takeaways

遥感影像语义分割是计算机视觉中的基础任务，广泛应用于土地分类、城市规划、环境监测等领域。
现有模型如Vision Mamba在处理遥感影像时面临忽略局部特征的问题，如纹理和边缘。
SwinMamba模型旨在解决此问题，结合局部扫描和全局感受野，既捕捉局部细节又融合全局信息。
SwinMamba的前两个阶段进行局部扫描以捕捉精细颗粒细节，而后两个阶段利用全局扫描融合更广泛的上下文信息。
SwinMamba使用重叠的移位窗口增强区域间信息交流，促进整个图像上特征的稳健集成。
在LoveDA和ISPRS Potsdam数据集上的实验表明，SwinMamba性能优于现有方法。

Cool Papers

点此查看论文截图

Neptune-X: Active X-to-Maritime Generation for Universal Maritime Object Detection

Authors:Yu Guo, Shengfeng He, Yuxu Lu, Haonan An, Yihang Tao, Huilin Zhu, Jingxian Liu, Yuguang Fang

Maritime object detection is essential for navigation safety, surveillance, and autonomous operations, yet constrained by two key challenges: the scarcity of annotated maritime data and poor generalization across various maritime attributes (e.g., object category, viewpoint, location, and imaging environment). % In particular, models trained on existing datasets often underperform in underrepresented scenarios such as open-sea environments. To address these challenges, we propose Neptune-X, a data-centric generative-selection framework that enhances training effectiveness by leveraging synthetic data generation with task-aware sample selection. From the generation perspective, we develop X-to-Maritime, a multi-modality-conditioned generative model that synthesizes diverse and realistic maritime scenes. A key component is the Bidirectional Object-Water Attention module, which captures boundary interactions between objects and their aquatic surroundings to improve visual fidelity. To further improve downstream tasking performance, we propose Attribute-correlated Active Sampling, which dynamically selects synthetic samples based on their task relevance. To support robust benchmarking, we construct the Maritime Generation Dataset, the first dataset tailored for generative maritime learning, encompassing a wide range of semantic conditions. Extensive experiments demonstrate that our approach sets a new benchmark in maritime scene synthesis, significantly improving detection accuracy, particularly in challenging and previously underrepresented settings.The code is available at https://github.com/gy65896/Neptune-X.

海洋目标检测对于航行安全、监控和自主操作至关重要，然而受到两个关键挑战的限制：海洋数据的标注稀缺以及不同海洋属性（如目标类别、观点、位置和成像环境）的泛化性能不佳。特别是在开放海域环境下，使用现有数据集训练的模型往往表现不佳。为了应对这些挑战，我们提出了Neptune-X，这是一个以数据为中心的生成选择框架，通过利用带有任务感知样本选择的合成数据生成，提高训练效果。从生成的角度来看，我们开发了X-to-Maritime多模式条件生成模型，合成多样且逼真的海洋场景。其中的关键组件是双向对象-水注意力模块，该模块捕获对象与其水生环境之间的边界交互，以提高视觉逼真度。为了进一步提高下游任务性能，我们提出了属性相关主动采样方法，该方法根据任务相关性动态选择合成样本。为了支持稳健的基准测试，我们构建了海洋生成数据集，这是专门为生成式海洋学习量身定制的第一个数据集，涵盖了广泛的语义条件。大量实验表明，我们的方法在海洋场景合成方面树立了新的基准，显著提高了检测精度，特别是在具有挑战性和先前表现不佳的环境中。代码可在https://github.com/gy65896/Neptune-X找到。

论文及项目相关链接

PDF

Summary：针对海事目标检测的挑战，如缺乏标注的海事数据和不同海事属性的泛化问题，提出了Neptune-X数据为中心的选择生成框架。利用合成数据生成和基于任务的样本选择，提高了训练效果。开发多模态条件下的生成模型X-to-Maritime，并引入双向对象-水注意力模块来提高视觉保真度。提出基于属性的主动采样方法，根据任务相关性动态选择合成样本。构建了专门用于生成海事学习的数据集 Maritime Generation Dataset，为稳健性评估提供支持。实验证明，该方法在海事场景合成上树立了新基准，尤其在具有挑战性的未表示场景中检测精度显著提高。代码公开在https://github.com/gy65896/Neptune-X。

Key Takeaways：

针对海事目标检测的两大挑战是数据稀缺和属性泛化问题。
Neptune-X框架通过合成数据生成和基于任务的样本选择来提高训练效果。
X-to-Maritime生成模型能合成多样且逼真的海事场景，具备多模态条件。
双向对象-水注意力模块提高了视觉保真度。
基于属性的主动采样方法根据任务相关性选择合成样本。
构建了 Maritime Generation Dataset数据集，专门用于生成海事学习。

Cool Papers

点此查看论文截图

Boosting LiDAR-Based Localization with Semantic Insight: Camera Projection versus Direct LiDAR Segmentation

Authors:Sven Ochs, Philip Schörner, Marc René Zofka, J. Marius Zöllner

Semantic segmentation of LiDAR data presents considerable challenges, particularly when dealing with diverse sensor types and configurations. However, incorporating semantic information can significantly enhance the accuracy and robustness of LiDAR-based localization techniques for autonomous mobile systems. We propose an approach that integrates semantic camera data with LiDAR segmentation to address this challenge. By projecting LiDAR points into the semantic segmentation space of the camera, our method enhances the precision and reliability of the LiDAR-based localization pipeline. For validation, we utilize the CoCar NextGen platform from the FZI Research Center for Information Technology, which offers diverse sensor modalities and configurations. The sensor setup of CoCar NextGen enables a thorough analysis of different sensor types. Our evaluation leverages the state-of-the-art Depth-Anything network for camera image segmentation and an adaptive segmentation network for LiDAR segmentation. To establish a reliable ground truth for LiDAR-based localization, we make us of a Global Navigation Satellite System (GNSS) solution with Real-Time Kinematic corrections (RTK). Additionally, we conduct an extensive 55 km drive through the city of Karlsruhe, Germany, covering a variety of environments, including urban areas, multi-lane roads, and rural highways. This multimodal approach paves the way for more reliable and precise autonomous navigation systems, particularly in complex real-world environments.

激光雷达数据的语义分割存在相当大的挑战，特别是当处理多种传感器类型和配置时。然而，融入语义信息可以显著提高基于激光雷达的自主移动系统定位技术的准确性和稳健性。我们提出了一种将语义相机数据与激光雷达分割相结合的方法，以应对这一挑战。通过将激光雷达点投影到相机的语义分割空间，我们的方法提高了基于激光雷达的定位流程的精度和可靠性。

论文及项目相关链接

PDF

Summary

针对LiDAR数据的语义分割面临诸多挑战，特别是在处理不同传感器类型和配置时。本文通过集成语义相机数据与LiDAR分割来应对这一挑战，通过将LiDAR点投影到相机的语义分割空间，提高LiDAR定位精度和可靠性。采用FZI信息技术研究中心的CoCar NextGen平台和多种传感器模态和配置进行验证，并通过深度神经网络进行相机图像分割和自适应分割网络进行LiDAR分割。利用全球导航卫星系统（GNSS）解决方案进行实时动态校正（RTK）建立可靠的地面真实数据，并在德国卡尔斯鲁厄市进行了长达55公里的多元环境驾驶测试，包括城市地区、多车道公路和乡村高速公路。这种多模式方法为更可靠和精确的自主导航系统铺平了道路，特别是在复杂的真实环境中。

Key Takeaways

LiDAR数据的语义分割面临多种传感器类型和配置的挑战。
集成语义相机数据与LiDAR分割可以提高LiDAR定位技术的准确性和稳健性。
提出的方法通过将LiDAR点投影到相机的语义分割空间来增强定位精度。
采用CoCar NextGen平台验证方法，该平台具有多种传感器模态和配置。
利用深度神经网络进行相机图像分割和自适应分割网络进行LiDAR分割。
利用GNSS解决方案和RTK技术建立可靠的地面真实数据。

Cool Papers

点此查看论文截图

Hyperspectral Adapter for Semantic Segmentation with Vision Foundation Models

Authors:Juana Valeria Hurtado, Rohit Mohan, Abhinav Valada

Hyperspectral imaging (HSI) captures spatial information along with dense spectral measurements across numerous narrow wavelength bands. This rich spectral content has the potential to facilitate robust robotic perception, particularly in environments with complex material compositions, varying illumination, or other visually challenging conditions. However, current HSI semantic segmentation methods underperform due to their reliance on architectures and learning frameworks optimized for RGB inputs. In this work, we propose a novel hyperspectral adapter that leverages pretrained vision foundation models to effectively learn from hyperspectral data. Our architecture incorporates a spectral transformer and a spectrum-aware spatial prior module to extract rich spatial-spectral features. Additionally, we introduce a modality-aware interaction block that facilitates effective integration of hyperspectral representations and frozen vision Transformer features through dedicated extraction and injection mechanisms. Extensive evaluations on three benchmark autonomous driving datasets demonstrate that our architecture achieves state-of-the-art semantic segmentation performance while directly using HSI inputs, outperforming both vision-based and hyperspectral segmentation methods. We make the code available at https://hsi-adapter.cs.uni-freiburg.de.

高光谱成像（HSI）能够捕获空间信息以及多个狭窄波长范围内的密集光谱测量值。这种丰富的光谱内容具有促进稳健的机器人感知的潜力，特别是在环境材料成分复杂、光照变化或其他视觉条件具有挑战性的环境中。然而，由于当前的高光谱图像语义分割方法依赖于针对RGB输入的架构和学习框架，导致其性能不佳。在这项工作中，我们提出了一种新型的高光谱适配器，它利用预训练的视觉基础模型来有效地从高光谱数据中学习。我们的架构结合了光谱变换器和光谱感知空间先验模块，以提取丰富的空间-光谱特征。此外，我们引入了一种模态感知交互块，通过专门的提取和注入机制，促进高光谱表示和冻结的视觉Transformer特征的有效融合。在三个基准自动驾驶数据集上的广泛评估表明，我们的架构实现了最先进的语义分割性能，同时直接使用HSI输入，超越了基于视觉和基于高光谱的分割方法。我们在https://hsi-adapter.cs.uni-freiburg.de上提供代码。

论文及项目相关链接

PDF

Summary
超光谱成像技术结合丰富的光谱内容和预训练的视觉基础模型，能有效提高机器在复杂环境下的感知能力。本工作提出了一种新颖的超光谱适配器，结合光谱转换器和光谱感知空间先验模块，实现空间光谱特征的丰富提取。通过引入模态感知交互块，实现了超光谱表示与冻结视觉转换器特征的有效融合。在三个自动驾驶数据集上的评估表明，该架构实现了最先进的语义分割性能。

Key Takeaways

超光谱成像技术能捕捉空间信息和密集光谱测量数据，有助于提升机器在复杂环境下的感知能力。
当前HSI语义分割方法因依赖优化于RGB输入的架构和学习框架而表现不佳。
提出了一种新颖的超光谱适配器，利用预训练的视觉基础模型进行超光谱数据的有效学习。
该架构结合了光谱转换器、光谱感知空间先验模块和模态感知交互块，以实现丰富的空间光谱特征提取和有效的特征融合。
在三个自动驾驶数据集上的评估表明，该架构的语义分割性能达到最新水平，且直接使用HSI输入表现优于基于视觉和超光谱的分割方法。
相关代码已公开可用。

Cool Papers

点此查看论文截图

MLF-4DRCNet: Multi-Level Fusion with 4D Radar and Camera for 3D Object Detection in Autonomous Driving

Authors:Yuzhi Wu, Li Xiao, Jun Liu, Guangfeng Jiang, XiangGen Xia

The emerging 4D millimeter-wave radar, measuring the range, azimuth, elevation, and Doppler velocity of objects, is recognized for its cost-effectiveness and robustness in autonomous driving. Nevertheless, its point clouds exhibit significant sparsity and noise, restricting its standalone application in 3D object detection. Recent 4D radar-camera fusion methods have provided effective perception. Most existing approaches, however, adopt explicit Bird’s-Eye-View fusion paradigms originally designed for LiDAR-camera fusion, neglecting radar’s inherent drawbacks. Specifically, they overlook the sparse and incomplete geometry of radar point clouds and restrict fusion to coarse scene-level integration. To address these problems, we propose MLF-4DRCNet, a novel two-stage framework for 3D object detection via multi-level fusion of 4D radar and camera images. Our model incorporates the point-, scene-, and proposal-level multi-modal information, enabling comprehensive feature representation. It comprises three crucial components: the Enhanced Radar Point Encoder (ERPE) module, the Hierarchical Scene Fusion Pooling (HSFP) module, and the Proposal-Level Fusion Enhancement (PLFE) module. Operating at the point-level, ERPE densities radar point clouds with 2D image instances and encodes them into voxels via the proposed Triple-Attention Voxel Feature Encoder. HSFP dynamically integrates multi-scale voxel features with 2D image features using deformable attention to capture scene context and adopts pooling to the fused features. PLFE refines region proposals by fusing image features, and further integrates with the pooled features from HSFP. Experimental results on the View-of-Delft (VoD) and TJ4DRadSet datasets demonstrate that MLF-4DRCNet achieves the state-of-the-art performance. Notably, it attains performance comparable to LiDAR-based models on the VoD dataset.

新兴的4D毫米波雷达能够测量物体的距离、方位角、高度和多普勒速度，以其成本效益和自动驾驶中的稳健性而受到认可。然而，它的点云表现出很大的稀疏性和噪声，限制了其在3D目标检测中的独立应用。最近的4D雷达-相机融合方法为感知提供了有效手段。然而，大多数现有方法采用专门为激光雷达-相机融合设计的显式鸟瞰图融合范式，忽视了雷达的固有缺陷。具体来说，它们忽略了雷达点云的稀疏和不完全几何结构，并且将融合限制在粗略的场景级别集成。为了解决这些问题，我们提出了MLF-4DRCNet，这是一个通过4D雷达和相机图像的多层次融合进行3D目标检测的新型两阶段框架。我们的模型结合了点、场景和提案的多模态信息，实现了全面的特征表示。它包含三个关键组件：增强雷达点编码器（ERPE）模块、分层场景融合池化（HSFP）模块和提案级融合增强（PLFE）模块。ERPE在点级别运行，用二维图像实例密集化雷达点云，并通过提出的三重注意力体素特征编码器将它们编码为体素。HSFP通过可变形注意力动态集成多尺度体素特征与二维图像特征，以捕获场景上下文，并对融合后的特征进行池化。PLFE通过融合图像特征来优化区域提案，并进一步与HSFP的池化特征集成。在View-of-Delft（VoD）和TJ4DRadSet数据集上的实验结果表明，MLF-4DRCNet达到了最先进的性能。值得注意的是，它在VoD数据集上的性能与激光雷达模型相当。

论文及项目相关链接

PDF

Summary
4D毫米波雷达在自主驾驶中因其成本效益和稳健性而受到重视，但其点云存在稀疏性和噪声，限制了其在3D物体检测中的独立应用。近期4D雷达-相机融合方法提供了有效的感知。然而，现有方法多忽略雷达点云的稀疏性和不完全几何特性，仅进行场景级融合。为此，我们提出MLF-4DRCNet，一个通过4D雷达和相机图像的多层次融合进行3D物体检测的两阶段框架。该模型融合点、场景和提案的多模态信息，实现全面的特征表示。

Key Takeaways

4D毫米波雷达在自主驾驶中有成本效益和稳健性优势，但存在点云稀疏和噪声问题。
现有雷达-相机融合方法主要基于LiDAR-相机融合设计的鸟瞰图融合范式，忽视雷达的固有缺点。
MLF-4DRCNet是一个两阶段框架，通过多层次融合进行3D物体检测。
MLF-4DRCNet包含三个关键组件：增强雷达点编码器（ERPE）、分层场景融合池化（HSFP）和提案级融合增强（PLFE）。
ERPE模块在点级别加密雷达点云并与2D图像实例结合，通过三重注意力体素特征编码器进行编码。
HSFP模块通过可变形注意力动态集成多尺度体素特征与2D图像特征，并采用池化进行特征融合。
PLFE模块通过融合图像特征和HSFP的池化特征来优化区域提案。在View-of-Delft（VoD）和TJ4DRadSet数据集上的实验结果表明，MLF-4DRCNet达到最新性能水平，且在VoD数据集上的性能与基于LiDAR的模型相当。

Cool Papers

点此查看论文截图

An Analysis of Kalman Filter based Object Tracking Methods for Fast-Moving Tiny Objects

Authors:Prithvi Raj Singh, Raju Gottumukkala, Anthony Maida

Unpredictable movement patterns and small visual mark make precise tracking of fast-moving tiny objects like a racquetball one of the challenging problems in computer vision. This challenge is particularly relevant for sport robotics applications, where lightweight and accurate tracking systems can improve robot perception and planning capabilities. While Kalman filter-based tracking methods have shown success in general object tracking scenarios, their performance degrades substantially when dealing with rapidly moving objects that exhibit irregular bouncing behavior. In this study, we evaluate the performance of five state-of-the-art Kalman filter-based tracking methods-OCSORT, DeepOCSORT, ByteTrack, BoTSORT, and StrongSORT-using a custom dataset containing 10,000 annotated racquetball frames captured at 720p-1280p resolution. We focus our analysis on two critical performance factors: inference speed and update frequency per image, examining how these parameters affect tracking accuracy and reliability for fast-moving tiny objects. Our experimental evaluation across four distinct scenarios reveals that DeepOCSORT achieves the lowest tracking error with an average ADE of 31.15 pixels compared to ByteTrack’s 114.3 pixels, while ByteTrack demonstrates the fastest processing at 26.6ms average inference time versus DeepOCSORT’s 26.8ms. However, our results show that all Kalman filter-based trackers exhibit significant tracking drift with spatial errors ranging from 3-11cm (ADE values: 31-114 pixels), indicating fundamental limitations in handling the unpredictable motion patterns of fast-moving tiny objects like racquetballs. Our analysis demonstrates that current tracking approaches require substantial improvements, with error rates 3-4x higher than standard object tracking benchmarks, highlighting the need for specialized methodologies for fast-moving tiny object tracking applications.

快速移动的微小物体（如乒乓球）的精确跟踪是计算机视觉中的一项具有挑战性的任务，其不可预测的运动模式和微小的视觉标记造成了这一挑战。这一挑战对于运动机器人应用而言尤为重要，轻量级且精准的跟踪系统能够改善机器人的感知和规划能力。虽然基于卡尔曼滤波器的跟踪方法已在一般物体跟踪场景中取得了成功，但在处理快速移动的、表现出不规则弹跳行为的物体时，其性能会大大降低。在这项研究中，我们使用一个包含10,000个标注的乒乓球框的自定义数据集，评估了五种最先进的基于卡尔曼滤波器的跟踪方法（OCSORT、DeepOCSORT、ByteTrack、BoTSORT和StrongSORT）的性能。我们的分析主要集中在两个关键的性能因素上：推理速度和每幅图像的更新频率，研究这些因素如何影响快速移动的微小物体的跟踪精度和可靠性。我们在四个不同场景下的实验评估表明，DeepOCSORT的平均跟踪误差最低，平均ADE为31.15像素，而ByteTrack的平均像素误差为114.3像素；尽管ByteTrack的平均推理时间为26.6ms，略快于DeepOCSORT的26.8ms，但其处理速度最快。然而，我们的结果表明，所有基于卡尔曼滤波器的跟踪器都出现了明显的跟踪漂移，空间误差范围在3-11厘米（ADE值：31-114像素），这表明它们在处理快速移动的微小物体（如乒乓球）的不可预测运动模式时存在基本局限性。我们的分析表明，目前的跟踪方法需要大幅度改进，其错误率比标准对象跟踪基准高出3-4倍，这凸显了为快速移动的微小物体跟踪应用开发专门方法的需求。

论文及项目相关链接

PDF

Summary

本文研究了快速移动的小物体（如球拍球）的跟踪问题，指出基于卡尔曼滤波的跟踪方法在处理不规则弹跳行为时性能下降。通过对五种先进的卡尔曼滤波跟踪方法进行比较评估，发现DeepOCSORT在跟踪误差方面表现最佳，但所有方法都存在跟踪漂移问题，表明需要针对快速移动的小物体跟踪应用开发专门的方法。

Key Takeaways

快速移动的小物体跟踪是计算机视觉中的挑战性问题。
基于卡尔曼滤波的跟踪方法在处理不规则弹跳行为时性能下降。
DeepOCSORT在跟踪误差方面表现最佳，但存在跟踪漂移问题。

Cool Papers

点此查看论文截图

SpaRC: Sparse Radar-Camera Fusion for 3D Object Detection

Authors:Philipp Wolters, Johannes Gilg, Torben Teepe, Fabian Herzog, Felix Fent, Gerhard Rigoll

In this work, we present SpaRC, a novel Sparse fusion transformer for 3D perception that integrates multi-view image semantics with Radar and Camera point features. The fusion of radar and camera modalities has emerged as an efficient perception paradigm for autonomous driving systems. While conventional approaches utilize dense Bird’s Eye View (BEV)-based architectures for depth estimation, contemporary query-based transformers excel in camera-only detection through object-centric methodology. However, these query-based approaches exhibit limitations in false positive detections and localization precision due to implicit depth modeling. We address these challenges through three key contributions: (1) sparse frustum fusion (SFF) for cross-modal feature alignment, (2) range-adaptive radar aggregation (RAR) for precise object localization, and (3) local self-attention (LSA) for focused query aggregation. In contrast to existing methods requiring computationally intensive BEV-grid rendering, SpaRC operates directly on encoded point features, yielding substantial improvements in efficiency and accuracy. Empirical evaluations on the nuScenes and TruckScenes benchmarks demonstrate that SpaRC significantly outperforms existing dense BEV-based and sparse query-based detectors. Our method achieves state-of-the-art performance metrics of 67.1 NDS and 63.1 AMOTA. The code and pretrained models are available at https://github.com/phi-wol/sparc.

在这项工作中，我们提出了SpaRC，这是一种用于3D感知的新型稀疏融合变压器，它融合了多视图图像语义与雷达和摄像头点特征。雷达和摄像头的融合已成为自动驾驶系统的高效感知范式。虽然传统方法使用密集的鸟瞰图（BEV）基础架构进行深度估计，但当代基于查询的变压器通过以对象为中心的方法在仅使用相机进行检测方面表现出色。然而，这些基于查询的方法由于隐式深度建模而在误报检测和定位精度方面存在局限性。我们通过三个主要贡献来解决这些挑战：（1）用于跨模态特征对齐的稀疏棱锥融合（SFF），（2）用于精确对象定位的雷达范围自适应聚合（RAR），以及（3）用于聚焦查询聚合的局部自注意力（LSA）。与现有方法需要计算密集型的BEV网格渲染不同，SpaRC直接在编码的点特征上运行，在效率和准确性方面取得了显著改进。在nuScenes和TruckScenes基准测试上的经验评估表明，SpaRC显著优于现有的密集BEV基础和稀疏查询检测器。我们的方法达到了最先进的性能指标，NDS为67.1，AMOTA为63.1。代码和预训练模型可在https://github.com/phi-wol/sparc中找到。

论文及项目相关链接

PDF 18 pages, 11 figures

Summary
本文提出一种名为SpaRC的新型稀疏融合变压器，用于3D感知，它融合了多视角图像语义与雷达和相机的点特征。SpaRC通过稀疏棱镜融合（SFF）、范围自适应雷达聚合（RAR）和局部自注意力（LSA）等三项关键技术，解决了现有方法在假阳性检测和定位精度方面的挑战。与需要计算密集BEV网格的现有方法不同，SpaRC直接在编码的点特征上运行，提高了效率和准确性。在nuScenes和TruckScenes基准测试上的实证评估表明，SpaRC显著优于现有的密集BEV和稀疏查询检测器，达到了业界领先水平。

Key Takeaways

SpaRC是一个新型的稀疏融合变压器，用于3D感知。
它融合了多视角图像语义与雷达和相机的点特征。
SpaRC通过三项关键技术解决了假阳性检测和定位精度方面的挑战。
与需要密集BEV网格渲染的方法不同，SpaRC直接在编码的点特征上运行。
SpaRC在nuScenes和TruckScenes基准测试上表现优异，显著优于现有方法。
SpaRC达到了业界领先水平，具有高效和准确的特性。

Cool Papers

点此查看论文截图

CrossEarth: Geospatial Vision Foundation Model for Domain Generalizable Remote Sensing Semantic Segmentation

Authors:Ziyang Gong, Zhixiang Wei, Di Wang, Xiaoxing Hu, Xianzheng Ma, Hongruixuan Chen, Yuru Jia, Yupeng Deng, Zhenming Ji, Xiangwei Zhu, Xue Yang, Naoto Yokoya, Jing Zhang, Bo Du, Junchi Yan, Liangpei Zhang

The field of Remote Sensing Domain Generalization (RSDG) has emerged as a critical and valuable research frontier, focusing on developing models that generalize effectively across diverse scenarios. Despite the substantial domain gaps in RS images that are characterized by variabilities such as location, wavelength, and sensor type, research in this area remains underexplored: (1) Current cross-domain methods primarily focus on Domain Adaptation (DA), which adapts models to predefined domains rather than to unseen ones; (2) Few studies targeting the RSDG issue, especially for semantic segmentation tasks, where existing models are developed for specific unknown domains, struggling with issues of underfitting on other unknown scenarios; (3) Existing RS foundation models tend to prioritize in-domain performance over cross-domain generalization. To this end, we introduce the first vision foundation model for RSDG semantic segmentation, CrossEarth. CrossEarth demonstrates strong cross-domain generalization through a specially designed data-level Earth-Style Injection pipeline and a model-level Multi-Task Training pipeline. In addition, for the semantic segmentation task, we have curated an RSDG benchmark comprising 32 cross-domain settings across various regions, spectral bands, platforms, and climates, providing a comprehensive framework for testing the generalizability of future RSDG models. Extensive experiments on this benchmark demonstrate the superiority of CrossEarth over existing state-of-the-art methods.

遥感领域泛化（RSDG）领域已经崭露头角，成为一个重要且有价值的研究前沿，专注于开发能够在不同场景中进行有效泛化的模型。尽管遥感图像中存在大量的域间隙，这些间隙以位置、波长和传感器类型等可变性为特征，但这一领域的研究仍然被探索不足。目前存在的问题包括：（1）现有的跨域方法主要关注域适应（DA），即将模型适应到预定义的域，而不是未见过的域；（2）针对RSDG问题的研究很少，特别是对于语义分割任务，现有模型针对特定未知域开发，但在其他未知场景上会出现欠拟合问题；（3）现有的遥感基础模型倾向于优先考虑域内性能，而不是跨域泛化。为此，我们引入了RSDG语义分割的首个视觉基础模型——CrossEarth。CrossEarth通过专门设计的数据级地球风格注入管道和模型级多任务训练管道，展示了强大的跨域泛化能力。此外，针对语义分割任务，我们精心策划了一个RSDG基准测试，涵盖了各种区域、光谱波段、平台和气候的32个跨域设置，为未来RSDG模型的可泛化性测试提供了全面框架。在此基准测试上进行的大量实验证明了CrossEarth相较于现有最先进的方法的优越性。

论文及项目相关链接

PDF The codes and models will be available at https://github.com/Cuzyoung/CrossEarth

Summary

远程遥感领域通用化（RSDG）作为前沿研究领域，旨在发展能在不同场景中具有泛化效果的模型。当前该领域还存在一些挑战，如跨域方法的局限性、针对语义分割任务的模型泛化问题等。为解决这些问题，本文提出了首个针对RSDG语义分割的愿景基础模型CrossEarth，通过数据层面的地球风格注入和模型层面的多任务训练来提升模型的泛化能力。同时，本文还构建了一个包含多种区域、光谱波段、平台和气候的RSDG基准测试集，为未来模型的可泛化性测试提供了全面框架。实验证明，CrossEarth在基准测试上的表现优于现有最先进的方法。

Key Takeaways

远程遥感领域通用化（RSDG）是一个有价值的研究前沿，旨在开发在不同场景中具有泛化效果的模型。
当前RSDG领域的研究还存在挑战，如跨域方法的局限性和针对语义分割任务的模型泛化问题。
CrossEarth模型通过数据层面的地球风格注入和模型层面的多任务训练来提升模型的泛化能力。
CrossEarth模型在基准测试上的表现优于现有最先进的方法。
本文构建了一个全面的RSDG基准测试集，包含多种区域、光谱波段、平台和气候，为未来模型的可泛化性测试提供了框架。
RSDG领域需要更多研究来探索和发展更有效的模型和算法，以解决遥感图像中的跨域问题。

Cool Papers

点此查看论文截图

SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection

Authors:Dingkang Liang, Wei Hua, Chunsheng Shi, Zhikang Zou, Xiaoqing Ye, Xiang Bai

Semi-supervised object detection (SSOD), leveraging unlabeled data to boost object detectors, has become a hot topic recently. However, existing SSOD approaches mainly focus on horizontal objects, leaving oriented objects common in aerial images unexplored. At the same time, the annotation cost of oriented objects is significantly higher than that of their horizontal counterparts. Therefore, in this paper, we propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++. Specifically, we observe that objects from aerial images usually have arbitrary orientations, small scales, and dense distribution, which inspires the following core designs: a Simple Instance-aware Dense Sampling (SIDS) strategy is used to generate comprehensive dense pseudo-labels; the Geometry-aware Adaptive Weighting (GAW) loss dynamically modulates the importance of each pair between pseudo-label and corresponding prediction by leveraging the intricate geometric information of aerial objects; we treat aerial images as global layouts and explicitly build the many-to-many relationship between the sets of pseudo-labels and predictions via the proposed Noise-driven Global Consistency (NGC). Extensive experiments conducted on various oriented object datasets under various labeled settings demonstrate the effectiveness of our method. For example, on the DOTA-V2.0/DOTA-V1.5 benchmark, the proposed method outperforms previous state-of-the-art (SOTA) by a large margin (+2.90/2.14, +2.16/2.18, and +2.66/2.32) mAP under 10%, 20%, and 30% labeled data settings, respectively, with single-scale training and testing. More importantly, it still improves upon a strong supervised baseline with 70.66 mAP, trained using the full DOTA-V1.5 train-val set, by +1.82 mAP, resulting in a 72.48 mAP, pushing the new state-of-the-art. The project page is at https://dk-liang.github.io/SOODv2/

半监督目标检测（SSOD）利用无标签数据来提升目标检测器的性能，这最近成为了一个热门话题。然而，现有的SSOD方法主要关注水平目标，忽略了在航空图像中常见的定向目标。同时，定向目标的标注成本显著高于其水平对应物。因此，本文提出了一种简单有效的半监督定向目标检测方法，称为SOOD++。具体来说，我们观察到航空图像中的目标通常具有任意方向、小尺寸和密集分布的特点，这启发我们进行了以下核心设计：采用简单实例感知密集采样（SIDS）策略生成全面的密集伪标签；几何感知自适应加权（GAW）损失通过利用航空目标的复杂几何信息，动态调整伪标签和相应预测之间每对的重要性；我们将航空图像视为全局布局，并通过提出的噪声驱动全局一致性（NGC）明确建立了伪标签集和预测集之间的多对多关系。在各种定向目标数据集上进行的广泛实验表明，我们的方法在各种标注设置下都是有效的。例如，在DOTA-V2.0/DOTA-V1.5基准测试上，与之前的最新技术相比，该方法在10%、20%和30%标注数据设置下，分别提高了（+2.90/2.14）、（+2.16/2.18）和（+2.66/2.32）mAP。更重要的是，即使在使用完整的DOTA-V1.5训练和验证集进行训练的强大监督基线之上，它仍然提高了+1.82 mAP，达到了72.48 mAP的新水平。项目页面是https://dk-liang.github.io/SOODv2/。

论文及项目相关链接

PDF Accepted by IEEE TPAMI. The project page is at https://dk-liang.github.io/SOODv2/

Summary：半监督目标检测（SSOD）利用无标签数据提升目标检测性能，当前主要关注水平目标检测，对空中图像中的定向目标检测研究不足。本文提出一种简单有效的半监督定向目标检测方法SOOD++，采用简单实例感知密集采样策略生成综合密集伪标签，利用空中目标的精细几何信息动态调整伪标签和对应预测之间的权重损失，并构建伪标签和预测之间的多对多关系。实验表明，该方法在多种定向目标数据集上的表现优于先前的方法。

Key Takeaways：