⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-06 更新
Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization
Authors:Tao Liu, Kan Ren, Qian Chen
With the rapid growth of the low-altitude economy, UAVs have become crucial for measurement and tracking in patrol systems. However, in GNSS-denied areas, satellite-based localization methods are prone to failure. This paper presents a cross-view UAV localization framework that performs map matching via object detection, aimed at effectively addressing cross-temporal, cross-view, heterogeneous aerial image matching. In typical pipelines, UAV visual localization is formulated as an image-retrieval problem: features are extracted to build a localization map, and the pose of a query image is estimated by matching it to a reference database with known poses. Because publicly available UAV localization datasets are limited, many approaches recast localization as a classification task and rely on scene labels in these datasets to ensure accuracy. Other methods seek to reduce cross-domain differences using polar-coordinate reprojection, perspective transformations, or generative adversarial networks; however, they can suffer from misalignment, content loss, and limited realism. In contrast, we leverage modern object detection to accurately extract salient instances from UAV and satellite images, and integrate a graph neural network to reason about inter-image and intra-image node relationships. Using a fine-grained, graph-based node-similarity metric, our method achieves strong retrieval and localization performance. Extensive experiments on public and real-world datasets show that our approach handles heterogeneous appearance differences effectively and generalizes well, making it applicable to scenarios with larger modality gaps, such as infrared-visible image matching. Our dataset will be publicly available at the following URL: https://github.com/liutao23/ODGNNLoc.git.
随着低空经济的飞速发展,无人机在巡逻系统中的测量和跟踪变得至关重要。然而,在GNSS失效区域,基于卫星的定位方法容易失效。本文针对有效解决跨时间、跨视角、异质航空图像匹配的问题,提出了一种跨视角无人机定位框架,通过目标检测执行地图匹配。在典型的流程中,无人机视觉定位被制定为图像检索问题:提取特征以构建定位地图,并通过与参考数据库中的已知姿态匹配来估计查询图像的姿态。由于公开可用的无人机定位数据集有限,许多方法将定位重新构建为分类任务,并依赖这些数据集中的场景标签以确保准确性。其他方法试图通过极坐标重投影、透视变换或生成对抗网络来减少跨域差异,但它们可能会遭受错位、内容丢失和有限的现实感的问题。相比之下,我们利用现代目标检测准确地从无人机和卫星图像中提取显著实例,并集成图神经网络来推理图像内和图像间节点关系。通过使用基于图的精细节点相似性度量,我们的方法实现了强大的检索和定位性能。在公共和真实世界数据集上的广泛实验表明,我们的方法能够有效地处理异质外观差异,并具有良好的通用性,适用于模态差距较大的场景,如红外可见光图像匹配等。我们的数据集将在以下URL公开提供:https://github.com com/liutao23/ODGNNLoc.git。
论文及项目相关链接
PDF 20 pages, Submitted to IEEE TIM
Summary:随着低空经济的快速增长,无人机在巡逻系统中的测量和跟踪变得至关重要。然而,在卫星导航信号被遮挡的地区,基于卫星的定位方法容易失效。本文提出了一种跨视图无人机定位框架,通过目标检测执行地图匹配,旨在有效解决跨时间、跨视图的异构航空图像匹配问题。该研究使用了一种基于图像检索的方法来实现无人机视觉定位,并利用现代目标检测技术从无人机和卫星图像中准确提取显著实例。结合图神经网络对图像内和图像间节点关系进行推理,使用基于图的精细节点相似性度量方法,实现了强大的检索和定位性能。
Key Takeaways:
- 低空经济中无人机的重要性及其在巡逻系统中的测量和跟踪应用。
- 在卫星导航信号被遮挡的地区,基于卫星的定位方法容易失效。
- 提出了一种跨视图无人机定位框架,通过目标检测执行地图匹配。
- 该方法有效解决跨时间、跨视图的异构航空图像匹配问题。
- 研究采用基于图像检索的方法实现无人机视觉定位。
- 利用现代目标检测技术从无人机和卫星图像中提取显著实例。
点此查看论文截图
OmniTrack++: Omnidirectional Multi-Object Tracking by Learning Large-FoV Trajectory Feedback
Authors:Kai Luo, Hao Shi, Kunyu Peng, Fei Teng, Sheng Wu, Kaiwei Wang, Kailun Yang
This paper investigates Multi-Object Tracking (MOT) in panoramic imagery, which introduces unique challenges including a 360{\deg} Field of View (FoV), resolution dilution, and severe view-dependent distortions. Conventional MOT methods designed for narrow-FoV pinhole cameras generalize unsatisfactorily under these conditions. To address panoramic distortion, large search space, and identity ambiguity under a 360{\deg} FoV, OmniTrack++ adopts a feedback-driven framework that progressively refines perception with trajectory cues. A DynamicSSM block first stabilizes panoramic features, implicitly alleviating geometric distortion. On top of normalized representations, FlexiTrack Instances use trajectory-informed feedback for flexible localization and reliable short-term association. To ensure long-term robustness, an ExpertTrack Memory consolidates appearance cues via a Mixture-of-Experts design, enabling recovery from fragmented tracks and reducing identity drift. Finally, a Tracklet Management module adaptively switches between end-to-end and tracking-by-detection modes according to scene dynamics, offering a balanced and scalable solution for panoramic MOT. To support rigorous evaluation, we establish the EmboTrack benchmark, a comprehensive dataset for panoramic MOT that includes QuadTrack, captured with a quadruped robot, and BipTrack, collected with a bipedal wheel-legged robot. Together, these datasets span wide-angle environments and diverse motion patterns, providing a challenging testbed for real-world panoramic perception. Extensive experiments on JRDB and EmboTrack demonstrate that OmniTrack++ achieves state-of-the-art performance, yielding substantial HOTA improvements of +25.5% on JRDB and +43.07% on QuadTrack over the original OmniTrack. Datasets and code will be made publicly available at https://github.com/xifen523/OmniTrack.
本文研究了全景影像中的多目标跟踪(MOT)技术,这带来了独特的挑战,包括360°视野(FoV)、分辨率衰减和严重的视角相关畸变。针对窄FoV针孔相机设计的传统MOT方法在这些条件下表现不佳。为了解决全景畸变、大搜索空间和360°视野下的身份歧义问题,OmniTrack++采用了一种反馈驱动框架,该框架利用轨迹线索逐步优化感知。DynamicSSM块首先稳定全景特征,隐含地缓解几何畸变。在归一化表示的基础上,FlexiTrack实例采用基于轨迹的反馈进行灵活定位,实现可靠的短期关联。为了确保长期稳健性,ExpertTrack内存通过混合专家设计巩固外观线索,能够实现从破碎轨迹中恢复并减少身份漂移。最后,Tracklet管理模块根据场景动态自适应地在端到端和跟踪检测模式之间进行切换,为全景MOT提供平衡且可扩展的解决方案。为了支持严格评估,我们建立了EmboTrack基准测试,这是一套全面的全景MOT数据集,包括用四足机器人捕获的QuadTrack和用双足轮腿机器人收集的BipTrack。这些数据集涵盖了广角环境和多样化的运动模式,为现实世界的全景感知提供了一个有挑战性的测试平台。在JRDB和EmboTrack上的大量实验表明,OmniTrack++达到了业界最先进的性能水平,相较于原始的OmniTrack在JRDB上有+25.5%的HOTA提升以及在QuadTrack上有+43.07%的提升。数据集和代码将在https://github.com/xifen523/OmniTrack公开提供。
论文及项目相关链接
PDF Extended version of CVPR 2025 paper arXiv:2503.04565. Datasets and code will be made publicly available at https://github.com/xifen523/OmniTrack
Summary
本文研究了全景影像中的多目标跟踪(MOT)问题,针对全景影像的特殊性质如360°视野、分辨率稀释和严重的视角失真等挑战,提出了OmniTrack++方法。该方法采用反馈驱动框架,利用轨迹线索逐步优化感知。通过DynamicSSM模块稳定全景特征,并结合FlexiTrack Instances进行灵活的定位和短期关联。为长期稳健性,ExpertTrack Memory通过混合专家设计巩固外观线索,以恢复断裂的轨迹并减少身份漂移。Tracklet Management模块则根据场景动态自适应切换端到端和跟踪检测模式。此外,建立了EmboTrack基准测试平台,包括QuadTrack和BipTrack数据集,为全景MOT提供了挑战性的测试环境。实验结果表明,OmniTrack++在JRDB和EmboTrack上达到了最先进的性能。
Key Takeaways
- 该论文研究了全景影像中的多目标跟踪(MOT)问题,并指出了传统MOT方法在全景影像中的不足。
- OmniTrack++采用反馈驱动框架,通过轨迹线索逐步优化感知,解决全景影像的扭曲、大搜索空间和身份模糊问题。
- DynamicSSM模块用于稳定全景特征,并缓解几何失真。
- FlexiTrack Instances利用轨迹信息反馈进行灵活定位和短期关联。
- ExpertTrack Memory通过混合专家设计巩固外观线索,提高长期稳健性,并能从断裂的轨迹中恢复。
- Tracklet Management模块根据场景动态自适应调整跟踪模式。
点此查看论文截图
HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation
Authors:Panwang Pan, Tingting Shen, Chenxin Li, Yunlong Lin, Kairun Wen, Jingjing Zhao, Yixuan Yuan
Recent advances in generative models have achieved high-fidelity in 3D human reconstruction, yet their utility for specific tasks (e.g., human 3D segmentation) remains constrained. We propose HumanCrafter, a unified framework that enables the joint modeling of appearance and human-part semantics from a single image in a feed-forward manner. Specifically, we integrate human geometric priors in the reconstruction stage and self-supervised semantic priors in the segmentation stage. To address labeled 3D human datasets scarcity, we further develop an interactive annotation procedure for generating high-quality data-label pairs. Our pixel-aligned aggregation enables cross-task synergy, while the multi-task objective simultaneously optimizes texture modeling fidelity and semantic consistency. Extensive experiments demonstrate that HumanCrafter surpasses existing state-of-the-art methods in both 3D human-part segmentation and 3D human reconstruction from a single image.
最近生成模型的技术进步已经实现了在三维人体重建中的高保真度,但它们在特定任务(例如三维人体分割)中的实用性仍然受到限制。我们提出了HumanCrafter,这是一个统一框架,能够以前馈方式从单张图像中联合建模外观和人体部位语义。具体来说,我们在重建阶段整合了人体几何先验知识,在分割阶段整合了自监督语义先验知识。为了解决标记的三维人体数据集稀缺的问题,我们还开发了一个交互式注释程序来生成高质量的数据标签对。我们的像素对齐聚合实现了跨任务的协同作用,而多任务目标同时优化了纹理建模的保真度和语义一致性。大量实验表明,HumanCrafter在单图像的三维人体部位分割和三维人体重建方面都超越了现有的最先进方法。
论文及项目相关链接
PDF Accepted to NeurIPS 2025; Project page: this URL
Summary
在生成模型领域的最新进展已经在3D人体重建方面取得了高保真度的成果,然而对于特定任务(如3D人体分割)的应用仍然受限。我们提出了HumanCrafter这一统一框架,以馈前方式实现对单张图像中外观和人类部分语义的联合建模。具体来说,我们在重建阶段整合了人类几何先验知识,并在分割阶段采用了自我监督的语义先验知识。为了解决标记的3D人体数据集稀缺的问题,我们还开发了一种交互式注释程序来生成高质量的数据标签对。我们的像素对齐聚合技术实现了跨任务的协同作用,多任务目标同时优化纹理建模保真度和语义一致性。大量实验表明,HumanCrafter在单图像3D人体部位分割和3D人体重建方面都超越了现有的最新方法。
Key Takeaways
- HumanCrafter框架实现了从单张图像中对人类外观和人类部分语义的联合建模。
- 整合了人类几何先验知识和自我监督的语义先验知识,分别应用于重建和分割阶段。
- 开发了一种交互式注释程序,以应对标记的3D人体数据集稀缺的问题。
- 通过像素对齐聚合技术实现跨任务协同。
- 多任务目标优化同时考虑纹理建模的保真度和语义一致性。
- HumanCrafter在单图像3D人体重建和分割任务上取得了超越现有最新方法的性能。
点此查看论文截图
Parameterized Prompt for Incremental Object Detection
Authors:Zijia An, Boyu Diao, Ruiqi Liu, Libo Huang, Chuanguang Yang, Fei Wang, Zhulin An, Yongjun Xu
Recent studies have demonstrated that incorporating trainable prompts into pretrained models enables effective incremental learning. However, the application of prompts in incremental object detection (IOD) remains underexplored. Existing prompts pool based approaches assume disjoint class sets across incremental tasks, which are unsuitable for IOD as they overlook the inherent co-occurrence phenomenon in detection images. In co-occurring scenarios, unlabeled objects from previous tasks may appear in current task images, leading to confusion in prompts pool. In this paper, we hold that prompt structures should exhibit adaptive consolidation properties across tasks, with constrained updates to prevent catastrophic forgetting. Motivated by this, we introduce Parameterized Prompts for Incremental Object Detection (P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD employs networks as the parameterized prompts to adaptively consolidate knowledge across tasks. To constrain prompts structure updates, P$^2$IOD further engages a parameterized prompts fusion strategy. Extensive experiments on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD’s effectiveness in IOD and achieves the state-of-the-art performance among existing baselines.
最近的研究表明,将可训练提示融入预训练模型,可以实现有效的增量学习。然而,提示在增量目标检测(IOD)中的应用仍然被较少探索。现有的基于提示池的方法假设增量任务之间的类别集是不相交的,这不适用于IOD,因为它们忽略了检测图像中固有的共现现象。在共现场景中,来自先前任务的未标记对象可能会出现在当前任务图像中,从而导致提示池混乱。在本文中,我们认为提示结构应展现出任务间的自适应整合特性,并通过约束更新来防止灾难性遗忘。受此启发,我们引入了针对增量目标检测的参数化提示(P$^2$IOD)。利用神经网络的全局演化特性,P$^2$IOD使用网络作为参数化提示,以自适应地整合任务间的知识。为了约束提示结构更新,P$^2$IOD进一步采用参数化提示融合策略。在PASCAL VOC2007和MS COCO数据集上的大量实验表明,P$^2$IOD在IOD中的有效性,并在现有基准测试中实现了最先进的性能。
论文及项目相关链接
Summary
本文研究了在预训练模型中引入可训练提示(prompt)以实现增量学习的有效性。针对增量对象检测(IOD)中的提示应用进行了探索,发现现有基于提示池的方法假设不同任务中的类别集是不重叠的,这在处理检测图像中的固有共现现象时并不适用。共现场景可能导致先前任务的未标记对象出现在当前任务图像中,从而引发提示混淆。本文提出了一种针对增量对象检测的适应性提示结构——参数化提示(P²IOD)。利用神经网络的全局演化特性,P²IOD通过神经网络作为参数化提示来适应性地巩固跨任务知识。同时,为了限制提示结构的更新,P²IOD进一步采用参数化提示融合策略。在PASCAL VOC2007和MS COCO数据集上的大量实验表明,P²IOD在增量对象检测方面的卓越性能,并达到了现有基准的先进水平。
Key Takeaways
- 研究发现在预训练模型中引入可训练提示(prompt)有利于增量学习。
- 目前针对增量对象检测(IOD)中的提示应用仍有所欠缺,尤其是在考虑检测图像中对象的共现现象时。
- 共现场景可能导致先前任务的未标记对象出现在当前任务图像中,引发提示混淆问题。
- 提出了一种参数化提示结构(P²IOD),该结构能够适应性地巩固跨任务知识。
- 利用神经网络的全局演化特性来实现参数化提示结构。
- P²IOD通过参数化提示融合策略限制提示结构的更新。
点此查看论文截图
WXSOD: A Benchmark for Robust Salient Object Detection in Adverse Weather Conditions
Authors:Quan Chen, Xiong Yang, Bolun Zheng, Rongfeng Lu, Xiaokai Yang, Qianyu Zhang, Yu Liu, Xiaofei Zhou
Salient object detection (SOD) in complex environments remains a challenging research topic. Most existing methods perform well in natural scenes with negligible noise, and tend to leverage multi-modal information (e.g., depth and infrared) to enhance accuracy. However, few studies are concerned with the damage of weather noise on SOD performance due to the lack of dataset with pixel-wise annotations. To bridge this gap, this paper introduces a novel Weather-eXtended Salient Object Detection (WXSOD) dataset. It consists of 14,945 RGB images with diverse weather noise, along with the corresponding ground truth annotations and weather labels. To verify algorithm generalization, WXSOD contains two test sets, i.e., a synthesized test set and a real test set. The former is generated by adding weather noise to clean images, while the latter contains real-world weather noise. Based on WXSOD, we propose an efficient baseline, termed Weather-aware Feature Aggregation Network (WFANet), which adopts a fully supervised two-branch architecture. Specifically, the weather prediction branch mines weather-related deep features, while the saliency detection branch fuses semantic features extracted from the backbone with weather features for SOD. Comprehensive comparisons against 17 SOD methods shows that our WFANet achieves superior performance on WXSOD. The code and benchmark results will be made publicly available at https://github.com/C-water/WXSOD
复杂环境下的显著目标检测(SOD)仍然是一个具有挑战性的研究课题。大多数现有方法在噪声较少的自然场景中表现良好,并倾向于利用多模态信息(例如深度和红外)来提高准确性。然而,由于缺乏带有像素级注释的数据集,很少有研究关注天气噪声对SOD性能的影响。为了填补这一空白,本文介绍了一个新的Weather-eXtended显著目标检测(WXSOD)数据集。它包含带有各种天气噪声的14,945张RGB图像,以及相应的真实注释和天气标签。为了验证算法的通用性,WXSOD包含两个测试集,即合成测试集和真实测试集。前者是通过向干净图像添加天气噪声而生成的,而后者包含真实世界的天气噪声。基于WXSOD,我们提出了一种高效的基线模型,称为天气感知特征聚合网络(WFANet),它采用完全监督的两分支架构。具体来说,天气预测分支挖掘与天气相关的深度特征,而显著性检测分支将骨干中提取的语义特征与天气特征融合用于SOD。与17种SOD方法的全面比较表明,我们的WFANet在WXSOD上取得了优越的性能。代码和基准测试结果将公开在https://github.com/C-water/WXSOD。
论文及项目相关链接
PDF Under review
Summary
在复杂环境中进行显著性目标检测(SOD)仍然是一个充满挑战的研究课题。当前的方法在自然场景中表现良好,但对噪声的影响研究较少。为了解决这一问题,本文引入了全新的天气扩展显著性目标检测(WXSOD)数据集,包含带有像素级标注的RGB图像。同时,基于WXSOD数据集提出了有效的基线模型——天气感知特征聚合网络(WFANet)。该网络采用全监督的两分支架构,天气预测分支挖掘与天气相关的深度特征,显著性检测分支融合语义特征与天气特征进行SOD。与现有方法相比,WFANet在WXSOD上表现优越。
Key Takeaways
- 显著性目标检测(SOD)在复杂环境中仍具挑战性。
- 现存方法多在自然场景且低噪声条件下表现良好,但天气噪声对SOD性能的影响研究不足。
- 引入新的数据集WXSOD,包含带有像素级标注的RGB图像以应对天气噪声的影响。
- 提出基于WXSOD的基线模型WFANet,采用两分支架构,能有效处理天气相关的显著性目标检测。
- WFANet在天气噪声条件下表现优于其他17种SOD方法。
- WFANet的源代码和基准测试结果已公开,便于后续研究。
点此查看论文截图
Label tree semantic losses for rich multi-class medical image segmentation
Authors:Junwen Wang, Oscar MacCormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren
Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the labels space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical image segmentation tasks, namely head MRI for whole brain parcellation (WBP) with full supervision and neurosurgical hyperspectral imaging (HSI) for scene understanding with sparse annotations. Results demonstrate that our proposed method reaches state-of-the-art performance in both cases.
丰富而准确的医学图像分割被定位为支撑下一代由人工智能定义的临床实践。通过对关键解剖结构的精细描绘,为术前规划提供指导,进行实时术中导航,并支持精确的术后评估。然而,对于医学和手术图像分割任务中常用的学习方法会等同惩罚所有错误,因此无法利用标签空间中的类间语义。当标签的数量和丰富性增加,包括微妙的差异时,这变得特别成问题。在这项工作中,我们提出了两种基于树的语义损失函数,它们利用了标签的层次结构组织。我们进一步将我们的损失纳入最近提出的具有稀疏、无背景注释的训练方法中,以扩展我们提出的损失适用性。关于两项医学和手术图像分割任务进行了大量实验报告,即在全监督下进行全脑部分区(WBP)的头部MRI以及具有稀疏注释的场景理解的神经外科高光谱成像(HSI)。结果表明,我们的方法在这两种情况下均达到了最先进的性能。
论文及项目相关链接
Summary
医疗图像分割的精准性对于下一代人工智能在临床实践中的应用至关重要。然而,传统的学习方法在处理医学和手术图像分割任务时,对所有错误一视同仁,忽略了标签空间中的类间语义。本研究提出了两种基于树的语义损失函数,充分利用标签的层次结构。此外,将我们的损失纳入了一种新提出的针对稀疏、无背景注释的训练方法,扩展了损失的应用范围。在医学和手术图像分割的两个任务上进行了大量实验验证,包括全监督的头MRI全脑分区和稀疏注释的神经外科高光谱成像场景理解。结果表明,该方法达到了两种场景的最先进性能。
Key Takeaways
- 医疗图像分割在AI临床实践中至关重要,涉及术前规划、术中导航和术后评估。
- 传统学习方法在处理医学和手术图像分割时存在对所有错误等价的缺陷,忽略了类间语义。
- 研究提出了两种基于树的语义损失函数,利用标签的层次结构来提升分割性能。
- 损失函数纳入了一种针对稀疏、无背景注释的训练方法,增强了损失函数的适用性。
- 实验验证包括头MRI全脑分区和神经外科高光谱成像场景理解两个任务。
- 方法在全监督和稀疏注释场景下均达到了最先进性能。
点此查看论文截图
Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms
Authors:Kangning Cui, Rongkun Zhu, Manqi Wang, Wei Tang, Gregory D. Larsen, Victor P. Pauca, Sarra Alqahtani, Fan Yang, David Segurado, David Lutz, Jean-Michel Morel, Miles R. Silman
Palms are ecologically and economically indicators of tropical forest health, biodiversity, and human impact that support local economies and global forest product supply chains. While palm detection in plantations is well-studied, efforts to map naturally occurring palms in dense forests remain limited by overlapping crowns, uneven shading, and heterogeneous landscapes. We develop PRISM (Processing, Inference, Segmentation, and Mapping), a flexible pipeline for detecting and localizing palms in dense tropical forests using large orthomosaic images. Orthomosaics are created from thousands of aerial images and spanning several to hundreds of gigabytes. Our contributions are threefold. First, we construct a large UAV-derived orthomosaic dataset collected across 21 ecologically diverse sites in western Ecuador, annotated with 8,830 bounding boxes and 5,026 palm center points. Second, we evaluate multiple state-of-the-art object detectors based on efficiency and performance, integrating zero-shot SAM 2 as the segmentation backbone, and refining the results for precise geographic mapping. Third, we apply calibration methods to align confidence scores with IoU and explore saliency maps for feature explainability. Though optimized for palms, PRISM is adaptable for identifying other natural objects, such as eastern white pines. Future work will explore transfer learning for lower-resolution datasets (0.5 to 1m).
棕榈树作为生态和经济指标,反映了热带雨林健康、生物多样性和人类影响,支持着地方经济和全球森林产品供应链。虽然棕榈树在种植园的检测已经得到了很好的研究,但自然生长的棕榈树在密集森林中的地图绘制工作仍受到树冠重叠、阴影不均和景观异质性的限制。我们开发了PRISM(处理、推断、分割和映射),这是一个灵活的管道,利用大型正射镶嵌图像检测和在热带密集森林中定位棕榈树。正射镶嵌图像是由数千张航空照片制成,并跨越几到数百吉字节。我们的贡献有三点。首先,我们构建了一个大型无人机衍生的正射镶嵌数据集,该数据集收集了厄瓜多尔西部21个生态各异的站点的数据,并用8830个边界框和5026个棕榈树中心点进行标注。其次,我们评估了多个最先进的物体检测器的效率和性能,将零射击SAM 2作为分割主干进行整合,并对结果进行细化,以实现精确地理映射。第三,我们应用校准方法使置信度得分与IoU对齐,并探索显著性地图进行特征解释。虽然PRISM是针对棕榈树进行了优化,但它也可以用于识别其他自然物体,如东部白皮松。未来的工作将探索低分辨率数据集(0.5至1米)的迁移学习。
论文及项目相关链接
PDF 15 pages, 8 figures, 4 tables
Summary
该文探讨了利用PRISM技术检测热带森林中的棕榈树的问题。棕榈树是热带森林生态和经济健康的重要标志,但密集森林中的棕榈树检测面临诸多挑战。PRISM技术利用大型正射影像图实现棕榈树的定位与检测。该技术在多个生态各异的站点收集了大规模无人机生成的正射影像数据集,并对数据进行了标注。通过评估多种先进的物体检测器,确定了基于零射击SAM 2的分段框架。此外,该技术还采用了校准方法对齐置信度得分与IoU,并探索了特征解释性的显著性地图。虽然PRISM针对棕榈树进行了优化,但它也可应用于识别其他自然物体。未来研究将探索低分辨率数据集的迁移学习应用。
Key Takeaways
以下是本文的主要观点与见解:
- 棕榈树作为热带森林生态和经济健康的标志,对当地经济和全球森林产品供应链具有支持作用。
- 在密集森林中自然生长的棕榈树检测面临诸多挑战,如树冠重叠、阴影不均和景观异质性等。
- PRISM是一种灵活的管道技术,用于利用大型正射影像图检测热带森林中的棕榈树。
- PRISM技术在无人机生成的大规模正射影像数据集上进行了训练,这些数据集涵盖了西厄瓜多尔21个生态各异的站点,并对数据进行了标注。
- 通过评估多种先进的物体检测器,确定了基于零射击SAM 2的分段框架,以提高检测的精确性和效率。
- PRISM技术采用校准方法对齐置信度得分与IoU,并利用显著性地图进行特征解释。
点此查看论文截图
Deep Fourier-embedded Network for RGB and Thermal Salient Object Detection
Authors:Pengfei Lyu, Xiaosheng Yu, Pak-Hei Yeung, Chengdong Wu, Jagath C. Rajapakse
The rapid development of deep learning has significantly improved salient object detection (SOD) combining both RGB and thermal (RGB-T) images. However, existing Transformer-based RGB-T SOD models with quadratic complexity are memory-intensive, limiting their application in high-resolution bimodal feature fusion. To overcome this limitation, we propose a purely Fourier Transform-based model, namely Deep Fourier-embedded Network (FreqSal), for accurate RGB-T SOD. Specifically, we leverage the efficiency of Fast Fourier Transform with linear complexity to design three key components: (1) To fuse RGB and thermal modalities, we propose Modal-coordinated Perception Attention, which aligns and enhances bimodal Fourier representation in multiple dimensions; (2) To clarify object edges and suppress noise, we design Frequency-decomposed Edge-aware Block, which deeply decomposes and filters Fourier components of low-level features; (3) To accurately decode features, we propose Fourier Residual Channel Attention Block, which prioritizes high-frequency information while aligning channel-wise global relationships. Additionally, even when converged, existing deep learning-based SOD models’ predictions still exhibit frequency gaps relative to ground-truth. To address this problem, we propose Co-focus Frequency Loss, which dynamically weights hard frequencies during edge frequency reconstruction by cross-referencing bimodal edge information in the Fourier domain. Extensive experiments on ten bimodal SOD benchmark datasets demonstrate that FreqSal outperforms twenty-nine existing state-of-the-art bimodal SOD models. Comprehensive ablation studies further validate the value and effectiveness of our newly proposed components. The code is available at https://github.com/JoshuaLPF/FreqSal.
深度学习技术的快速发展显著提高了结合RGB和红外热成像(RGB-T)图像的显著性目标检测(SOD)。然而,现有的基于Transformer的RGB-T SOD模型具有二次复杂度,内存密集,限制了其在高分辨率双模态特征融合中的应用。为了克服这一局限性,我们提出了一种基于纯傅里叶变换的模型,名为Deep Fourier-embedded Network(FreqSal),用于精确的RGB-T SOD。具体来说,我们利用具有线性复杂度的快速傅里叶变换的效率来设计了三个关键组件:(1)为了融合RGB和红外热成像模态,我们提出了Modal-coordinated Perception Attention,它可以在多个维度中对齐并增强双模态傅里叶表示;(2)为了明确对象边缘并抑制噪声,我们设计了Frequency-decomposed Edge-aware Block,它深度地分解并过滤了低级特征的傅里叶成分;(3)为了准确地解码特征,我们提出了傅立叶残差通道注意力块(Fourier Residual Channel Attention Block),它优先处理高频信息,同时调整通道级的全局关系。此外,即使现有的基于深度学习的SOD模型收敛后,其预测结果相对于真实值仍会显示频率差距。为了解决这个问题,我们提出了Co-focus Frequency Loss。它通过参考双模态边缘信息的傅里叶域中的信息,在边缘频率重建过程中动态加权硬频率。在十个双模态SOD基准数据集上的大量实验表明,FreqSal在二十九种最先进的双模态SOD模型中表现最佳。全面的消融研究进一步验证了我们新提出的组件的价值和有效性。代码可在https://github.com/JoshuaLPF/FreqSal获取。
论文及项目相关链接
PDF Accepted by TCSVT2025
Summary
深度学习的快速发展显著提高了结合RGB和红外热成像(RGB-T)图像的显著性目标检测(SOD)能力。针对现有基于Transformer的RGB-T SOD模型内存密集、复杂度为二次方的问题,我们提出了一种基于纯傅里叶变换的模型——Deep Fourier-embedded Network(FreqSal),用于准确的RGB-T SOD。该模型利用具有线性复杂度的快速傅里叶变换的高效性,设计了三个关键组件以提高性能。此外,我们还引入了Co-focus Frequency Loss,以动态权重处理边缘频率重建中的硬频率。实验证明,FreqSal在十种双模态SOD基准数据集上的表现优于二十九种最先进的双模态SOD模型。
Key Takeaways
- 深度学习的进步推动了RGB和红外热成像图像结合的显著性目标检测的发展。
- 现有基于Transformer的RGB-T SOD模型存在内存密集和二次方复杂度的问题,限制了其在高分辨率双模态特征融合中的应用。
- 提出了基于纯傅里叶变换的Deep Fourier-embedded Network(FreqSal)模型,利用快速傅里叶变换的线性复杂度提高效率。
- FreqSal模型设计了三个关键组件:模态协调感知注意力、频率分解边缘感知块和傅里叶残差通道注意力块,以提高性能。
- 引入Co-focus Frequency Loss,通过交叉引用双模态边缘信息,动态权重处理边缘频率重建中的硬频率。
- 实验证明,FreqSal在多个双模态SOD基准数据集上的表现优于其他先进模型。
点此查看论文截图