NeRF

发布日期: 2025-10-02

更新日期: 2025-11-27

文章字数: 1.1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-02 更新

VGGT-X: When VGGT Meets Dense Novel View Synthesis

Authors:Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, Zhaoxiang Zhang

We study the problem of applying 3D Foundation Models (3DFMs) to dense Novel View Synthesis (NVS). Despite significant progress in Novel View Synthesis powered by NeRF and 3DGS, current approaches remain reliant on accurate 3D attributes (e.g., camera poses and point clouds) acquired from Structure-from-Motion (SfM), which is often slow and fragile in low-texture or low-overlap captures. Recent 3DFMs showcase orders of magnitude speedup over the traditional pipeline and great potential for online NVS. But most of the validation and conclusions are confined to sparse-view settings. Our study reveals that naively scaling 3DFMs to dense views encounters two fundamental barriers: dramatically increasing VRAM burden and imperfect outputs that degrade initialization-sensitive 3D training. To address these barriers, we introduce VGGT-X, incorporating a memory-efficient VGGT implementation that scales to 1,000+ images, an adaptive global alignment for VGGT output enhancement, and robust 3DGS training practices. Extensive experiments show that these measures substantially close the fidelity gap with COLMAP-initialized pipelines, achieving state-of-the-art results in dense COLMAP-free NVS and pose estimation. Additionally, we analyze the causes of remaining gaps with COLMAP-initialized rendering, providing insights for the future development of 3D foundation models and dense NVS. Our project page is available at https://dekuliutesla.github.io/vggt-x.github.io/

我们研究了将3D基础模型（3DFMs）应用于密集的新型视图合成（NVS）的问题。尽管由NeRF和3DGS推动的新型视图合成取得了重大进展，但当前的方法仍然依赖于从运动结构（SfM）获得的精确3D属性（例如，相机姿态和点云），这在低纹理或低重叠捕获中通常速度较慢且脆弱。最近的3DFMs展示了相对于传统流程的巨大速度提升，以及在线NVS的巨大潜力。但是，大部分的验证和结论都局限于稀疏视图设置。我们的研究表明，将3DFMs盲目扩展到密集视图会遇到两个基本障碍：急剧增加VRAM负担以及降低初始化敏感的3D训练的输出不理想。为了解决这些障碍，我们引入了VGGT-X，它结合了高效的VGGT实现，可扩展到1000+张图像、用于增强VGGT输出的自适应全局对齐，以及稳健的3DGS训练实践。大量实验表明，这些措施显著缩小了与COLMAP初始化管道之间的保真度差距，在密集COLMAP自由NVS和姿态估计方面达到了最新结果。此外，我们还分析了与COLMAP初始化渲染之间剩余差距的原因，为3D基础模型和密集NVS的未来发展提供了见解。我们的项目页面可在https://dekuliutesla.github.io/vggt-x.github.io/找到。

Summary

本文研究了将三维基础模型（3DFMs）应用于密集新颖视图合成（NVS）的问题。针对现有方法依赖结构从运动（SfM）获得的3D属性（如相机姿态和点云），存在速度慢、在低纹理或低重叠捕捉中易出错的问题，提出了VGGT-X方法。该方法通过引入高效的VGGT实现、自适应全局对齐和稳健的3DGS训练实践，解决了内存负担增加和输出质量不高等问题，实现了无COLMAP的密集NVS和姿态估计的先进结果。同时，分析了与COLMAP初始化渲染之间的差距，为未来的3D基础模型和密集NVS发展提供了见解。

Key Takeaways