3DGS

发布日期: 2025-06-26

更新日期: 2025-07-06

文章字数: 4.2k

阅读时长: 16 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-06-26 更新

ManiGaussian++: General Robotic Bimanual Manipulation with Hierarchical Gaussian World Model

Authors:Tengbo Yu, Guanxing Lu, Zaijia Yang, Haoyuan Deng, Season Si Chen, Jiwen Lu, Wenbo Ding, Guoqiang Hu, Yansong Tang, Ziwei Wang

Multi-task robotic bimanual manipulation is becoming increasingly popular as it enables sophisticated tasks that require diverse dual-arm collaboration patterns. Compared to unimanual manipulation, bimanual tasks pose challenges to understanding the multi-body spatiotemporal dynamics. An existing method ManiGaussian pioneers encoding the spatiotemporal dynamics into the visual representation via Gaussian world model for single-arm settings, which ignores the interaction of multiple embodiments for dual-arm systems with significant performance drop. In this paper, we propose ManiGaussian++, an extension of ManiGaussian framework that improves multi-task bimanual manipulation by digesting multi-body scene dynamics through a hierarchical Gaussian world model. To be specific, we first generate task-oriented Gaussian Splatting from intermediate visual features, which aims to differentiate acting and stabilizing arms for multi-body spatiotemporal dynamics modeling. We then build a hierarchical Gaussian world model with the leader-follower architecture, where the multi-body spatiotemporal dynamics is mined for intermediate visual representation via future scene prediction. The leader predicts Gaussian Splatting deformation caused by motions of the stabilizing arm, through which the follower generates the physical consequences resulted from the movement of the acting arm. As a result, our method significantly outperforms the current state-of-the-art bimanual manipulation techniques by an improvement of 20.2% in 10 simulated tasks, and achieves 60% success rate on average in 9 challenging real-world tasks. Our code is available at https://github.com/April-Yz/ManiGaussian_Bimanual.

多任务机器人双上肢操作正变得越来越流行，因为它能够完成需要多样化双上肢协作模式的复杂任务。与单手上肢操作相比，双上肢任务给理解多刚体时空动力学带来了挑战。现有的ManiGaussian方法通过将时空动力学编码到视觉表现中，通过高斯世界模型为单臂环境提供了基础，但该方法忽略了双臂系统的多主体交互，导致性能显著下降。在本文中，我们提出了ManiGaussian++，它是ManiGaussian框架的扩展，通过分层高斯世界模型改进了多任务双上肢操作。具体来说，我们首先根据中间视觉特征生成面向任务的Gaussian Splatting，旨在区分行动臂和稳定臂，用于多刚体时空动力学建模。然后，我们建立了具有领导者-追随者架构的分层高斯世界模型，通过预测未来场景挖掘中间视觉表现中的多刚体时空动力学。领导者预测由稳定臂运动引起的高斯Splatting变形，追随者则根据运动臂产生的物理结果生成预测。因此，我们的方法在模拟的10项任务中，较当前最先进的双上肢操作技术提高了20.2%，在具有挑战性的真实世界任务的平均成功率为60%。我们的代码可通过https://github.com/April-Yz/ManiGaussian_Bimanual访问。

论文及项目相关链接

PDF

Summary

本文介绍了多任务双臂操作面临的挑战，并提出了一种新的方法ManiGaussian++。该方法基于ManiGaussian框架，通过分层高斯世界模型处理多体场景动力学，以改进多任务双臂操作。ManiGaussian++生成任务导向的高斯溅出，通过领导者-跟随者架构建立分层高斯世界模型，通过预测未来场景进行多体时空动力学建模。该方法在模拟任务和真实任务中均表现出优异性能，成功率高。

Key Takeaways

多任务双臂操作能完成复杂任务，需要理解多体时空动力学。
现有方法ManiGaussian通过高斯世界模型处理单臂设置的时空动力学，但忽略了双臂系统的多体交互。
ManiGaussian++是ManiGaussian的扩展，通过分层高斯世界模型处理多体场景动力学，改进了多任务双臂操作。
ManiGaussian++生成任务导向的高斯溅出，用于区分执行和稳定手臂，进行多体时空动力学建模。
领导者-跟随者架构用于建立分层高斯世界模型，通过预测未来场景挖掘多体时空动力学。
ManiGaussian++在模拟的10个任务中，相比当前最先进的双臂操作技术，性能提高了20.2%。
在具有挑战性的真实任务的平均成功率为60%。

Cool Papers

点此查看论文截图

HoliGS: Holistic Gaussian Splatting for Embodied View Synthesis

Authors:Xiaoyuan Wang, Yizhou Zhao, Botao Ye, Xiaojun Shan, Weijie Lyu, Lu Qi, Kelvin C. K. Chan, Yinxiao Li, Ming-Hsuan Yang

We propose HoliGS, a novel deformable Gaussian splatting framework that addresses embodied view synthesis from long monocular RGB videos. Unlike prior 4D Gaussian splatting and dynamic NeRF pipelines, which struggle with training overhead in minute-long captures, our method leverages invertible Gaussian Splatting deformation networks to reconstruct large-scale, dynamic environments accurately. Specifically, we decompose each scene into a static background plus time-varying objects, each represented by learned Gaussian primitives undergoing global rigid transformations, skeleton-driven articulation, and subtle non-rigid deformations via an invertible neural flow. This hierarchical warping strategy enables robust free-viewpoint novel-view rendering from various embodied camera trajectories by attaching Gaussians to a complete canonical foreground shape (\eg, egocentric or third-person follow), which may involve substantial viewpoint changes and interactions between multiple actors. Our experiments demonstrate that \ourmethod~ achieves superior reconstruction quality on challenging datasets while significantly reducing both training and rendering time compared to state-of-the-art monocular deformable NeRFs. These results highlight a practical and scalable solution for EVS in real-world scenarios. The source code will be released.

我们提出了HoliGS，这是一种新型的可变形高斯喷涂框架，用于从长单目RGB视频中合成身临其境的视图。与之前的4D高斯喷涂和动态NeRF管道不同，后者在长达数分钟的捕获中面临训练开销的挑战，我们的方法利用可逆高斯喷涂变形网络来准确重建大规模、动态环境。具体来说，我们将每个场景分解为静态背景和时间变化的物体，每个物体由经历全局刚体变换、骨架驱动关节活动和通过可逆神经流产生的微妙非刚性变形的习得不凡高斯原始形态表示。这种层次化的弯曲策略通过将高斯贴合到完整的规范前景形状（例如以自我为中心或第三人称追踪）上，实现了从各种身临其境的相机轨迹进行稳健的自由视点新型视图渲染，这可能涉及大量的视点变化和多个演员之间的交互。我们的实验表明，我们的方法在具有挑战性的数据集上实现了更高的重建质量，与最先进的单目可变形NeRF相比，显著减少了训练和渲染时间。这些结果突显了EVS在现实场景中的实用性和可扩展解决方案。源代码将发布。

论文及项目相关链接

PDF

Summary

本文提出一种新型的可变形高斯贴图框架HoliGS，用于从长单目RGB视频中合成视角。不同于以往的4D高斯贴图和动态NeRF管道，该方法借助可逆高斯贴图变形网络，能准确重建大规模动态环境。通过分解场景为静态背景和时间变化物体，用学习的高斯原始数据进行全局刚性转换、骨架驱动的关节活动和细微非刚性变形。这种分层映射策略实现了从各种人体摄像头轨迹的稳健自由视角新型视图渲染。实验显示，该方法在具有挑战性的数据集上实现了高质量的重建，同时显著减少了训练和渲染时间，相较于单目可变形NeRF属于先进实用且可规模化解决方案。

Key Takeaways

HoliGS是一种新型的可变形高斯贴图框架，用于从长单目RGB视频中合成视角。
与传统方法不同，HoliGS利用可逆高斯贴图变形网络，能准确重建大规模动态环境。
HoliGS通过分解场景为静态背景和时间变化物体，使用学习的高斯原始数据进行多种转换和变形。
该方法实现了从各种人体摄像头轨迹的稳健自由视角渲染。
HoliGS在挑战性数据集上实现高质量重建，同时减少训练和渲染时间。
相比单目可变形NeRF，HoliGS提供更先进、实用且可规模化解决方案。
HoliGS源代码将发布。

Cool Papers

点此查看论文截图

RA-NeRF: Robust Neural Radiance Field Reconstruction with Accurate Camera Pose Estimation under Complex Trajectories

Authors:Qingsong Yan, Qiang Wang, Kaiyong Zhao, Jie Chen, Bo Li, Xiaowen Chu, Fei Deng

Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.

神经辐射场（NeRF）和三维高斯拼贴（3DGS）已经成为三维重建和SLAM任务的强大工具。然而，它们的性能严重依赖于精确的相机姿态先验。现有方法试图通过引入外部约束来解决这个问题，但在复杂相机轨迹下无法达到令人满意的精度。在本文中，我们提出了一种新型方法RA-NeRF，即使在复杂的相机轨迹下，也能预测出高度精确的相机姿态。遵循增量管线，RA-NeRF使用NeRF进行场景重建，借助光辐射一致性并结合流动驱动姿态调节，以提高初始化和定位过程中的稳健性。此外，RA-NeRF还采用隐式姿态滤波器捕捉相机运动模式，为姿态估计消除噪声。为了验证我们的方法，我们在标准的Tanks&Temple数据集以及具有挑战性相机姿态轨迹的NeRFBuster数据集上进行了大量实验。在两个数据集上，RA-NeRF在相机姿态估计和视觉质量方面都达到了最新水平，证明了其在复杂姿态轨迹下的场景重建的有效性和稳健性。

论文及项目相关链接

PDF IROS 2025

Summary

NeRF与3DGS在3D重建与SLAM任务中表现出强大的能力，但受限于相机姿态的精确度。现有方法引入外部约束来解决这一问题，但在复杂相机轨迹下的精度仍不理想。本文提出一种新方法RA-NeRF，能预测精确的相机姿态，即使在复杂的相机轨迹下也能实现。RA-NeRF采用增量管道结合NeRF的光度一致性进行场景重建，并引入流驱动姿态调节增强初始化和定位阶段的稳健性。此外，RA-NeRF采用隐式姿态滤波器捕捉相机运动模式，消除姿态估计中的噪声。在Tanks&Temple和NeRFBuster数据集上的实验验证了RA-NeRF在相机姿态估计和视觉质量方面的卓越表现，特别是在复杂姿态轨迹下的场景重建中表现出其有效性和稳健性。

Key Takeaways

NeRF与3DGS在3D重建和SLAM任务中表现强大，但受限于相机姿态的准确性。
现有方法通过引入外部约束来解决相机姿态问题，但在复杂轨迹下的精度有待提高。
RA-NeRF能预测精确的相机姿态，适用于复杂轨迹。
RA-NeRF结合NeRF的光度一致性进行场景重建，并采用流驱动姿态调节增强稳健性。
RA-NeRF采用隐式姿态滤波器消除姿态估计中的噪声，提高准确性。
在Tanks&Temple和NeRFBuster数据集上的实验验证了RA-NeRF的有效性。

Cool Papers

点此查看论文截图

ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

Authors:Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, Yueqi Duan

Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.

随着三维场景重建技术的不断进步，已经从现实世界中的二维图像成功创建出三维模型，能够通过数百张输入照片生成逼真的三维结果。尽管在密集视图重建场景中已经取得了巨大的成功，但从不足够捕捉的视图中渲染详细场景仍然是一个表述不清的优化问题，这常常导致在看不见的区域出现伪影和失真。在本文中，我们提出了名为ReconX的新型三维场景重建范式，它将模糊的重建挑战重新定义为一种时间生成任务。关键思路是释放大型预训练视频扩散模型的强大生成先验知识，用于稀疏视图重建。然而，直接从预训练模型中生成视频帧时，三维视图的一致性难以准确保留。为了解决这一问题，给定有限的输入视图，所提出的ReconX首先构建全局点云并将其编码为上下文空间作为三维结构条件。在条件的引导下，视频扩散模型随后合成的视频帧既保留了细节又展现了高度的三维一致性，确保了从不同角度场景的连贯性。最后，我们通过一种具有置信度的三维高斯延展优化方案，从生成的视频中恢复三维场景。在多种真实世界数据集上的广泛实验表明，我们的ReconX在质量和泛化能力方面均优于最先进的方法。

Summary
在三维重建领域，该论文提出了一种新的重建方法ReconX，将稀疏视角下的重建问题转化为时间生成任务，利用预训练的视频扩散模型的生成先验信息。该方法通过构建全局点云并将其编码到上下文空间作为三维结构条件，引导合成视频帧既保留细节又保持高度三维一致性。最后通过置信度感知的三维高斯展开优化方案恢复三维场景。

Key Takeaways