⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-02-27 更新
OpenFly: A Versatile Toolchain and Large-scale Benchmark for Aerial Vision-Language Navigation
Authors:Yunpeng Gao, Chenhui Li, Zhongrui You, Junli Liu, Zhen Li, Pengan Chen, Qizhi Chen, Zhonghan Tang, Liansheng Wang, Penghui Yang, Yiwen Tang, Yuhang Tang, Shuai Liang, Songyi Zhu, Ziqin Xiong, Yifei Su, Xinyi Ye, Jianan Li, Yan Ding, Dong Wang, Zhigang Wang, Bin Zhao, Xuelong Li
Vision-Language Navigation (VLN) aims to guide agents through an environment by leveraging both language instructions and visual cues, playing a pivotal role in embodied AI. Indoor VLN has been extensively studied, whereas outdoor aerial VLN remains underexplored. The potential reason is that outdoor aerial view encompasses vast areas, making data collection more challenging, which results in a lack of benchmarks. To address this problem, we propose OpenFly, a platform comprising a versatile toolchain and large-scale benchmark for aerial VLN. Firstly, we develop a highly automated toolchain for data collection, enabling automatic point cloud acquisition, scene semantic segmentation, flight trajectory creation, and instruction generation. Secondly, based on the toolchain, we construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes. The corresponding visual data are generated using various rendering engines and advanced techniques, including Unreal Engine, GTA V, Google Earth, and 3D Gaussian Splatting (3D GS). All data exhibit high visual quality. Particularly, 3D GS supports real-to-sim rendering, further enhancing the realism of the dataset. Thirdly, we propose OpenFly-Agent, a keyframe-aware VLN model, which takes language instructions, current observations, and historical keyframes as input, and outputs flight actions directly. Extensive analyses and experiments are conducted, showcasing the superiority of our OpenFly platform and OpenFly-Agent. The toolchain, dataset, and codes will be open-sourced.
视觉语言导航(VLN)旨在利用语言指令和视觉线索来引导智能体在环境中进行导航,是嵌入式人工智能中的一项关键任务。室内VLN已经得到了广泛的研究,而户外空中VLN仍然被较少探索。可能的原因是户外俯视图涉及大片区域,使得数据收集更具挑战性,进而导致缺乏基准测试集。为了解决这一问题,我们提出了OpenFly平台,该平台包含空中VLN的通用工具链和大规模基准测试集。首先,我们开发了一个高度自动化的工具链进行数据采集,能够实现点云自动采集、场景语义分割、飞行轨迹创建和指令生成。其次,基于该工具链,我们构建了一个大规模空中VLN数据集,包含10万条轨迹,覆盖18个场景的多种高度和长度。相应的视觉数据采用各种渲染引擎和先进技术生成,包括Unreal Engine、GTA V、Google Earth和3D Gaussian Splatting(3D GS)。所有数据均呈现出高质量视觉效果。特别是,3D GS支持真实到模拟渲染,进一步增强了数据集的逼真性。第三,我们提出了OpenFly-Agent,一个关键帧感知VLN模型,以语言指令、当前观察结果和历史关键帧为输入,并直接输出飞行动作。进行了广泛的分析和实验,展示了OpenFly平台和OpenFly-Agent的优越性。工具链、数据集和代码将开源。
论文及项目相关链接
Summary
本研究提出了OpenFly平台,包含工具链和大规模基准测试数据集,旨在解决户外空中视觉语言导航(VLN)的问题。通过高度自动化的工具链实现数据收集,构建大规模空中VLN数据集,提出OpenFly-Agent模型,实现语言指令、当前观测和历史关键帧的输入,直接输出飞行动作。
Key Takeaways
- OpenFly平台旨在解决户外空中视觉语言导航(VLN)的问题,填补了该领域的空白。
- OpenFly包含高度自动化的工具链,用于数据收集、场景语义分割、飞行轨迹创建和指令生成。
- 构建了一个大规模空中VLN数据集,包含10万条轨迹,覆盖18个场景的多种高度和长度。
- 数据集使用了多种渲染引擎和先进技术,包括Unreal Engine、GTA V、Google Earth和3D Gaussian Splatting(3D GS),数据视觉质量高。
- 3D GS支持真实到模拟渲染,增强了数据集的真实性。
- 提出了OpenFly-Agent模型,该模型考虑语言指令、当前观测和历史关键帧,直接输出飞行动作。
点此查看论文截图




UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting
Authors:Haoyuan Li, Yanpeng Zhou, Tao Tang, Jifei Song, Yihan Zeng, Michael Kampffmeyer, Hang Xu, Xiaodan Liang
Recent advancements in multi-modal 3D pre-training methods have shown promising efficacy in learning joint representations of text, images, and point clouds. However, adopting point clouds as 3D representation fails to fully capture the intricacies of the 3D world and exhibits a noticeable gap between the discrete points and the dense 2D pixels of images. To tackle this issue, we propose UniGS, integrating 3D Gaussian Splatting (3DGS) into multi-modal pre-training to enhance the 3D representation. We first rely on the 3DGS representation to model the 3D world as a collection of 3D Gaussians with color and opacity, incorporating all the information of the 3D scene while establishing a strong connection with 2D images. Then, to achieve Language-Image-3D pertaining, UniGS starts with a pre-trained vision-language model to establish a shared visual and textual space through extensive real-world image-text pairs. Subsequently, UniGS employs a 3D encoder to align the optimized 3DGS with the Language-Image representations to learn unified multi-modal representations. To facilitate the extraction of global explicit 3D features by the 3D encoder and achieve better cross-modal alignment, we additionally introduce a novel Gaussian-Aware Guidance module that guides the learning of fine-grained representations of the 3D domain. Through extensive experiments across the Objaverse, ABO, MVImgNet and SUN RGBD datasets with zero-shot classification, text-driven retrieval and open-world understanding tasks, we demonstrate the effectiveness of UniGS in learning a more general and stronger aligned multi-modal representation. Specifically, UniGS achieves leading results across different 3D tasks with remarkable improvements over previous SOTA, Uni3D, including on zero-shot classification (+9.36%), text-driven retrieval (+4.3%) and open-world understanding (+7.92%).
近期多模态三维预训练方法的进展在文本、图像和点云联合表示学习方面显示出良好的有效性。然而,采用点云作为三维表示无法完全捕捉三维世界的复杂性,并且在离散点和图像密集二维像素之间表现出明显的差距。为了解决这个问题,我们提出了UniGS,将三维高斯延展(3DGS)集成到多模态预训练中,以增强三维表示。我们首先对三维世界进行建模,作为一系列带有颜色和透明度的三维高斯值的集合,融入三维场景的所有信息,同时与二维图像建立紧密联系。然后,为了实现语言-图像-三维关联,UniGS首先使用预训练好的视觉语言模型,通过大量现实世界中的图像-文本对建立共享的视觉和文本空间。接着,UniGS使用三维编码器对齐优化的3DGS与语言-图像表示,学习统一的多模态表示。为了促进三维编码器提取全局显性三维特征并实现更好的跨模态对齐,我们还引入了一个新型的高斯感知引导模块,引导学习三维域中的精细特征表示。我们在Objaverse、ABO、MVImgNet和SUN RGBD数据集上进行了大量实验,通过零样本分类、文本驱动检索和开放世界理解任务,证明了UniGS在学习更通用、更强大的多模态表示方面的有效性。具体来说,UniGS在不同三维任务上取得了领先的结果,并且相对于之前的最佳模型Uni3D取得了显著的改进,包括零样本分类(+9.36%)、文本驱动检索(+4.3%)和开放世界理解(+7.92%)。
论文及项目相关链接
PDF ICLR 2025
Summary
本文提出一种名为UniGS的新方法,它将3D高斯展布(3DGS)融入多模态预训练,以提升3D表示的能力。该方法通过3DGS表示将3D世界建模为一系列带有颜色和透明度的3D高斯,与2D图像建立紧密联系。通过预训练的视觉语言模型,UniGS建立共享的视觉和文本空间,然后使用3D编码器对齐优化的3DGS与语言图像表示,学习统一的多模态表示。此外,还引入了一个新的高斯感知指导模块,以促进全局显式3D特征的提取,实现更好的跨模态对齐。实验结果表明,UniGS在零样本分类、文本驱动检索和开放世界理解等任务上取得了领先水平,特别是相对于之前的最优方法Uni3D,UniGS在零样本分类上提高了9.36%,文本驱动检索提高了4.3%,开放世界理解提高了7.92%。
Key Takeaways
- UniGS方法融合了3D高斯展布(3DGS)技术,旨在改进多模态预训练中的3D表示能力。
- UniGS利用3DGS表示将3D世界建模为带有颜色和透明度的3D高斯,更好地连接了二维图像与三维世界。
- 通过预训练的视觉语言模型,UniGS建立了共享的视觉和文本空间,实现了语言、图像和3D之间的关联。
- 引入的3D编码器和高斯感知指导模块提高了全局显式3D特征的提取和跨模态对齐的效果。
- UniGS在多个数据集上的实验表现出色,特别是在零样本分类、文本驱动检索和开放世界理解等任务上取得了显著进步。
- UniGS相对于之前的最优方法Uni3D,在各项任务上都有明显的性能提升。
点此查看论文截图



GaussianFlowOcc: Sparse and Weakly Supervised Occupancy Estimation using Gaussian Splatting and Temporal Flow
Authors:Simon Boeder, Fabian Gigengack, Benjamin Risse
Occupancy estimation has become a prominent task in 3D computer vision, particularly within the autonomous driving community. In this paper, we present a novel approach to occupancy estimation, termed GaussianFlowOcc, which is inspired by Gaussian Splatting and replaces traditional dense voxel grids with a sparse 3D Gaussian representation. Our efficient model architecture based on a Gaussian Transformer significantly reduces computational and memory requirements by eliminating the need for expensive 3D convolutions used with inefficient voxel-based representations that predominantly represent empty 3D spaces. GaussianFlowOcc effectively captures scene dynamics by estimating temporal flow for each Gaussian during the overall network training process, offering a straightforward solution to a complex problem that is often neglected by existing methods. Moreover, GaussianFlowOcc is designed for scalability, as it employs weak supervision and does not require costly dense 3D voxel annotations based on additional data (e.g., LiDAR). Through extensive experimentation, we demonstrate that GaussianFlowOcc significantly outperforms all previous methods for weakly supervised occupancy estimation on the nuScenes dataset while featuring an inference speed that is 50 times faster than current SOTA.
占用估计已成为三维计算机视觉中的一个重要任务,特别是在自动驾驶领域。在本文中,我们提出了一种新颖的占用估计方法,称为GaussianFlowOcc,它受到高斯涂抹的启发,用稀疏的三维高斯表示替换了传统的密集体素网格。我们基于高斯变换器的有效模型架构,通过消除主要用于表示空三维空间的效率低下的体素表示法所使用的昂贵三维卷积,大大降低了计算和内存要求。GaussianFlowOcc通过在网络整体训练过程中为每个高斯估计时间流来有效地捕捉场景动态,为解决现有方法往往忽略的复杂问题提供了直接解决方案。此外,GaussianFlowOcc设计用于可扩展性,因为它采用弱监督,并且不需要基于额外数据(例如激光雷达)的昂贵密集三维体素注释。通过广泛的实验,我们证明GaussianFlowOcc在nuScenes数据集上的弱监督占用估计方面大大优于所有之前的方法,同时其推理速度比当前最佳技术快50倍。
论文及项目相关链接
Summary
本文提出了一种新颖的占用率估计方法,称为GaussianFlowOcc。该方法受到高斯涂抹技术的启发,采用稀疏的3D高斯表示替代传统的密集体素网格。基于高斯变换器的有效模型架构,显著减少了计算和内存需求,因为它无需使用主要用于表示空的三维空间的低效体素表示法中所必需的昂贵的三维卷积。GaussianFlowOcc通过在网络训练过程中估计每个高斯的时间流来有效地捕捉场景动态,为解决现有方法经常忽略的复杂问题提供了简单直接的解决方案。此外,GaussianFlowOcc具有可扩展性,因为它采用弱监督,无需基于额外数据(如激光雷达)的昂贵密集三维体素注释。通过广泛实验,我们在nuScenes数据集上证明了GaussianFlowOcc在弱监督占用率估计方面显著优于所有先前方法,同时推理速度比当前最佳方法快50倍。
Key Takeaways
- GaussianFlowOcc是一种用于占用率估计的新型方法,基于高斯涂抹技术。
- 它采用稀疏的3D高斯表示来替代传统的密集体素网格,从而提高了效率。
- 基于高斯变换器的模型架构显著减少了计算量和内存需求。
- GaussianFlowOcc能够捕捉场景动态,通过估计每个高斯的时间流来解决复杂问题。
- 该方法采用弱监督,无需昂贵的密集三维体素注释。
- 在nuScenes数据集上,GaussianFlowOcc在弱监督占用率估计方面表现出卓越性能。
点此查看论文截图



