嘘~ 正在从服务器偷取页面 . . .

3DGS


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-10-18 更新

Terra: Explorable Native 3D World Model with Point Latents

Authors:Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu

World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency and diminish the modeling efficiency of world models. In this paper, we present Terra, a native 3D world model that represents and generates explorable environments in an intrinsic 3D latent space. Specifically, we propose a novel point-to-Gaussian variational autoencoder (P2G-VAE) that encodes 3D inputs into a latent point representation, which is subsequently decoded as 3D Gaussian primitives to jointly model geometry and appearance. We then introduce a sparse point flow matching network (SPFlow) for generating the latent point representation, which simultaneously denoises the positions and features of the point latents. Our Terra enables exact multi-view consistency with native 3D representation and architecture, and supports flexible rendering from any viewpoint with only a single generation process. Furthermore, Terra achieves explorable world modeling through progressive generation in the point latent space. We conduct extensive experiments on the challenging indoor scenes from ScanNet v2. Terra achieves state-of-the-art performance in both reconstruction and generation with high 3D consistency.

世界模型因其对真实世界的全面建模而越来越受到关注。然而,大多数现有方法仍然依赖于像素对齐表示作为世界演化的基础,忽略了物理世界的固有三维特性。这可能会破坏三维一致性并降低世界模型的建模效率。在本文中,我们介绍了Terra,一种内在的三维世界模型,它以内在的三维潜在空间表示并生成可探索的环境。具体来说,我们提出了一种新型的点高斯变分自动编码器(P2G-VAE),它将三维输入编码为潜在点表示,然后将其解码为三维高斯基本体以联合建模几何形状和外观。然后,我们引入稀疏点流匹配网络(SPFlow)来生成潜在点表示,该网络可以同时去除点潜变量位置和特征中的噪声。我们的Terra通过内在的三维表示和结构实现了精确的多视角一致性,并支持仅通过单次生成过程从任何视角进行灵活的渲染。此外,Terra通过点潜在空间中的渐进生成实现了可探索的世界建模。我们在ScanNet v2的挑战性室内场景上进行了大量实验。Terra在重建和生成方面都达到了最先进的性能,具有较高的三维一致性。

论文及项目相关链接

PDF Project Page: https://huang-yh.github.io/terra/

Summary

本文提出一种名为Terra的本地3D世界模型,该模型能够在内在的3D潜在空间中表示和生成可探索的环境。通过引入P2G-VAE(点至高斯变分自编码器)和SPFlow(稀疏点流匹配网络),实现了在3D潜在空间中的几何和外观联合建模,以及灵活渲染和渐进式生成。Terra在ScanNet v2的室内场景实验上取得了出色的重建和生成性能,展现了高度的3D一致性。

Key Takeaways

  1. 世界模型越来越受关注,但需要解决对现实世界全面建模的问题。现有方法忽略了物理世界的内在三维性质,可能削弱模型的准确性和效率。
  2. 提出的Terra模型是一种本地三维世界模型,能够在内在的三维潜在空间中表示和生成可探索的环境。
  3. P2G-VAE用于将三维输入编码为潜在点表示,然后解码为三维高斯原始点以联合建模几何和外观。
  4. SPFlow网络用于生成潜在点表示,可以同时对点潜伏位置的噪声进行去噪和特征处理。
  5. Terra实现了多视角一致性,具有本地三维表示和架构,支持从任何视角灵活渲染,只需进行一次生成过程。
  6. 通过在点潜伏空间中的渐进生成,Terra实现了可探索的世界建模。

Cool Papers

点此查看论文截图

Leveraging Learned Image Prior for 3D Gaussian Compression

Authors:Seungjoo Shin, Jaesik Park, Sunghyun Cho

Compression techniques for 3D Gaussian Splatting (3DGS) have recently achieved considerable success in minimizing storage overhead for 3D Gaussians while preserving high rendering quality. Despite the impressive storage reduction, the lack of learned priors restricts further advances in the rate-distortion trade-off for 3DGS compression tasks. To address this, we introduce a novel 3DGS compression framework that leverages the powerful representational capacity of learned image priors to recover compression-induced quality degradation. Built upon initially compressed Gaussians, our restoration network effectively models the compression artifacts in the image space between degraded and original Gaussians. To enhance the rate-distortion performance, we provide coarse rendering residuals into the restoration network as side information. By leveraging the supervision of restored images, the compressed Gaussians are refined, resulting in a highly compact representation with enhanced rendering performance. Our framework is designed to be compatible with existing Gaussian compression methods, making it broadly applicable across different baselines. Extensive experiments validate the effectiveness of our framework, demonstrating superior rate-distortion performance and outperforming the rendering quality of state-of-the-art 3DGS compression methods while requiring substantially less storage.

针对三维高斯样条(3DGS)的压缩技术最近在减少三维高斯数据的存储开销方面取得了巨大的成功,同时保持了高质量的渲染效果。尽管实现了令人印象深刻的存储缩减,但由于缺乏学习先验知识,限制了其在三维GS压缩任务的速率失真权衡方面的进一步进展。为了解决这个问题,我们引入了一种新型的3DGS压缩框架,该框架利用学习到的图像先验的强大表征能力来恢复压缩引起的质量下降。我们的修复网络建立在最初压缩的高斯数据之上,有效地对退化高斯和原始高斯之间的图像空间中的压缩伪影进行建模。为了提高速率失真性能,我们将粗略的渲染残差作为侧面信息提供给修复网络。通过利用恢复图像的监督信息,对压缩的高斯数据进行优化,实现了高度紧凑的表示并增强了渲染性能。我们的框架旨在与现有的高斯压缩方法兼容,使其在不同的基线标准下具有广泛的应用性。大量实验验证了我们的框架的有效性,在速率失真性能和渲染质量方面均优于目前最先进的3DGS压缩方法,同时需要更少的存储空间。

论文及项目相关链接

PDF Accepted to ICCV 2025 Workshop on ECLR

摘要
基于深度学习的图像先验技术,优化了三维高斯纹理集(Gaussian Splatting, GS)压缩的算法,通过对已压缩的高斯数据集进行建模和分析压缩所产生的图像失真,进而提升了压缩存储的性能和渲染质量。该框架利用图像残差信息,强化了压缩过程中的率失真性能,同时兼容现有的高斯压缩方法,具有广泛的应用前景。实验证明,该框架在存储需求和渲染质量上均优于当前主流的三维高斯纹理集压缩方法。

关键见解

  • 引入深度学习图像先验技术优化三维高斯纹理集(GS)压缩算法。
  • 通过分析已压缩的高斯数据集,提高压缩存储的性能和渲染质量。
  • 设计一个基于已压缩高斯数据的恢复网络模型,该模型能有效处理压缩产生的图像失真问题。
  • 利用图像残差信息增强率失真性能。
  • 兼容现有高斯压缩方法,具有广泛的应用潜力。
  • 实验验证该框架在存储需求和渲染质量上的优越性。

Cool Papers

点此查看论文截图

GauSSmart: Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering

Authors:Alexander Valverde, Brian Xu, Yuyin Zhou, Meng Xu, Hongyun Wang

Scene reconstruction has emerged as a central challenge in computer vision, with approaches such as Neural Radiance Fields (NeRF) and Gaussian Splatting achieving remarkable progress. While Gaussian Splatting demonstrates strong performance on large-scale datasets, it often struggles to capture fine details or maintain realism in regions with sparse coverage, largely due to the inherent limitations of sparse 3D training data. In this work, we propose GauSSmart, a hybrid method that effectively bridges 2D foundational models and 3D Gaussian Splatting reconstruction. Our approach integrates established 2D computer vision techniques, including convex filtering and semantic feature supervision from foundational models such as DINO, to enhance Gaussian-based scene reconstruction. By leveraging 2D segmentation priors and high-dimensional feature embeddings, our method guides the densification and refinement of Gaussian splats, improving coverage in underrepresented areas and preserving intricate structural details. We validate our approach across three datasets, where GauSSmart consistently outperforms existing Gaussian Splatting in the majority of evaluated scenes. Our results demonstrate the significant potential of hybrid 2D-3D approaches, highlighting how the thoughtful combination of 2D foundational models with 3D reconstruction pipelines can overcome the limitations inherent in either approach alone.

场景重建已成为计算机视觉的核心挑战,神经辐射场(NeRF)和高斯贴图等方法取得了显著进展。虽然高斯贴图在大规模数据集上表现出强大的性能,但在稀疏覆盖的区域捕捉细微细节或保持真实感方面常常遇到困难,这主要是由于稀疏3D训练数据的固有局限性所致。

在这项工作中,我们提出了GauSSmart,这是一种有效的混合方法,能够很好地连接2D基础模型和3D高斯贴图重建。我们的方法集成了成熟的2D计算机视觉技术,包括凸过滤器和来自DINO等基础模型的语义特征监督,以增强基于高斯的场景重建。通过利用2D分割先验知识和高维特征嵌入,我们的方法指导高斯贴图的稠密化和细化,改进了欠代表区域的覆盖,并保留了复杂结构细节。我们在三个数据集上验证了我们的方法,GauSSmart在大多数评估场景中始终优于现有高斯贴图。我们的结果展示了混合2D-3D方法的巨大潜力,突出了如何将2D基础模型与3D重建管道相结合,克服单一方法的固有局限性。

论文及项目相关链接

PDF

摘要
本文提出了一个名为GauSSmart的混合方法,该方法有效地结合了二维基础模型和三维高斯Splatting重建技术,解决了场景重建中的挑战。通过集成成熟的二维计算机视觉技术和从基础模型(如DINO)获得的语义特征监督,GauSSmart能够提升基于高斯方法的场景重建效果。该方法利用二维分割先验和高维特征嵌入,指导高斯斑点的密集化和精细化,改进了欠代表区域的覆盖情况,并保留了精细的结构细节。在三个数据集上的验证表明,GauSSmart在大多数评估场景中均表现出超越传统高斯Splatting的性能。结果展示了混合二维-三维方法的巨大潜力,突显了将二维基础模型与三维重建流程相结合如何克服单一方法的固有局限性。

关键见解

  1. Gaussian Splatting在大规模数据集上表现良好,但在细节捕捉和稀疏区域的现实性维持方面存在局限,主要是由于稀疏三维训练数据的固有局限性。
  2. 提出了一种新的混合方法GauSSmart,结合了二维基础模型和三维高斯Splatting重建技术。
  3. GauSSmart集成了成熟的二维计算机视觉技术,包括凸过滤和来自基础模型(如DINO)的语义特征监督。
  4. 通过利用二维分割先验和高维特征嵌入,GauSSmart改进了高斯斑点的密集化和精细化过程。
  5. 在三个数据集上的验证显示,GauSSmart在多数评估场景中超越了传统的Gaussian Splatting。
  6. 结果突显了混合二维-三维方法的潜力,显示出结合二维基础模型与三维重建流程能够克服单一方法的局限性。

Cool Papers

点此查看论文截图

Virtually Being: Customizing Camera-Controllable Video Diffusion Models with Multi-View Performance Captures

Authors:Yuancheng Xu, Wenqi Xian, Li Ma, Julien Philip, Ahmet Levent Taşel, Yiwei Zhao, Ryan Burgert, Mingming He, Oliver Hermann, Oliver Pilarski, Rahul Garg, Paul Debevec, Ning Yu

We introduce a framework that enables both multi-view character consistency and 3D camera control in video diffusion models through a novel customization data pipeline. We train the character consistency component with recorded volumetric capture performances re-rendered with diverse camera trajectories via 4D Gaussian Splatting (4DGS), lighting variability obtained with a video relighting model. We fine-tune state-of-the-art open-source video diffusion models on this data to provide strong multi-view identity preservation, precise camera control, and lighting adaptability. Our framework also supports core capabilities for virtual production, including multi-subject generation using two approaches: joint training and noise blending, the latter enabling efficient composition of independently customized models at inference time; it also achieves scene and real-life video customization as well as control over motion and spatial layout during customization. Extensive experiments show improved video quality, higher personalization accuracy, and enhanced camera control and lighting adaptability, advancing the integration of video generation into virtual production. Our project page is available at: https://eyeline-labs.github.io/Virtually-Being.

我们引入了一个框架,通过新颖的数据定制管道,实现了视频扩散模型中的多视角角色一致性和3D相机控制。我们使用4D高斯展布(4DGS)技术重新渲染的录制体积捕获性能数据来训练角色一致性组件,该数据通过视频重照明模型获得的光线变化多样性。我们在这些数据上微调了最先进的开源视频扩散模型,以提供强大的多视角身份保留、精确的相机控制和灯光适应性。我们的框架还支持虚拟生产的核心功能,包括使用两种方法的多主题生成:联合训练和噪声混合,后者能够在推理时间有效地组合独立定制模型;它还能实现场景和现实生活视频的定制,以及在定制过程中的动作和空间布局控制。大量实验表明,视频质量有所提高,个性化精度更高,相机控制和灯光适应性增强,推动了视频生成在虚拟生产中的融合。我们的项目页面可在:https://eyeline-labs.github.io/Virtually-Being查看。

论文及项目相关链接

PDF Accepted to SIGGRAPH Asia 2025

Summary

本文介绍了一个框架,该框架通过新型定制数据管道实现了视频扩散模型中的多视角角色一致性及3D相机控制。通过采用4D高斯溅射技术录制体积捕获性能并重新渲染,结合视频重照明模型获得的光照变化,训练角色一致性组件。在此基础上对开源视频扩散模型进行微调,以实现强大的多视角身份保留、精确的相机控制和光照适应性。该框架还支持虚拟生产的核心功能,包括使用联合训练和噪声混合两种方法进行多主题生成,实现了独立定制模型的高效组合;实现了场景和现实生活视频的定制,以及在定制过程中的运动控制和空间布局控制。实验表明,该框架提高了视频质量、个性化精度以及相机控制和光照适应性,推动了视频生成在虚拟生产中的整合。

Key Takeaways

  1. 引入了一个新型框架,实现了视频扩散模型中的多视角角色一致性及3D相机控制。
  2. 通过4D高斯溅射技术和视频重照明模型,训练角色一致性组件。
  3. 对开源视频扩散模型进行微调,实现多视角身份保留、相机控制和光照适应性。
  4. 框架支持虚拟生产的核心功能,包括多主题生成、场景和现实生活视频的定制。
  5. 采用了联合训练和噪声混合两种方法,实现了独立定制模型的高效组合。
  6. 框架支持在定制过程中的运动控制和空间布局控制。

Cool Papers

点此查看论文截图

Instant Skinned Gaussian Avatars for Web, Mobile and VR Applications

Authors:Naruya Kondo, Yuto Asano, Yoichi Ochiai

We present Instant Skinned Gaussian Avatars, a real-time and cross-platform 3D avatar system. Many approaches have been proposed to animate Gaussian Splatting, but they often require camera arrays, long preprocessing times, or high-end GPUs. Some methods attempt to convert Gaussian Splatting into mesh-based representations, achieving lightweight performance but sacrificing visual fidelity. In contrast, our system efficiently animates Gaussian Splatting by leveraging parallel splat-wise processing to dynamically follow the underlying skinned mesh in real time while preserving high visual fidelity. From smartphone-based 3D scanning to on-device preprocessing, the entire process takes just around five minutes, with the avatar generation step itself completed in only about 30 seconds. Our system enables users to instantly transform their real-world appearance into a 3D avatar, making it ideal for seamless integration with social media and metaverse applications. Website: https://sites.google.com/view/gaussian-vrm

我们推出了即时皮肤化高斯化身(Instant Skinned Gaussian Avatars),这是一个实时跨平台的3D化身系统。尽管已经有许多关于高斯蒙皮动画的方法被提出,但它们通常需要相机阵列、长时间的预处理或高端GPU。一些方法试图将高斯蒙皮转换为基于网格的表示形式,以实现轻量级性能,但牺牲了视觉保真度。相比之下,我们的系统通过利用并行平铺处理来高效地实现高斯蒙皮动画,能够实时跟踪底层蒙皮网格,同时保持高视觉保真度。从基于智能手机的3D扫描到设备上的预处理,整个过程仅需约五分钟,其中化身生成步骤本身仅需约30秒即可完成。我们的系统允许用户立即将自己的真实外貌转化为3D化身,使其成为无缝集成社交媒体和元宇宙应用的理想选择。网站地址:https://sites.google.com/view/gaussian-vrm

论文及项目相关链接

PDF Accepted to SUI 2025 Demo Track

摘要

本文介绍了即时蒙皮高斯化身系统,这是一个实时跨平台的3D化身系统。该系统通过利用平行片元处理,实现了高斯蒙皮的高效动画化,能够在实时动态跟随蒙皮网格的同时保持高视觉保真度。从基于智能手机的3D扫描到设备上的预处理,整个过程仅需约五分钟,其中化身生成步骤仅需约三十秒。该系统使用户能够立即将其真实外貌转化为3D化身,非常适合与社交媒体和元宇宙应用程序无缝集成。

要点摘要

  1. 介绍了即时蒙皮高斯化身系统,这是一个用于创建实时跨平台3D化身的系统。
  2. 很多方法尝试对高斯蒙皮进行动画化,但需要相机阵列、长时间的预处理或高端GPU。
  3. 一些方法试图将高斯蒙皮转换为基于网格的表示形式,以实现轻量级性能,但牺牲了视觉保真度。
  4. 该系统通过利用平行片元处理,能够在保持高视觉保真度的同时,实时动态跟随蒙皮网格。
  5. 该系统的整个流程从基于智能手机的3D扫描到设备上的预处理仅需五分钟完成。
  6. 化身生成步骤快速,仅约三十秒即可完成。

Cool Papers

点此查看论文截图

FlashWorld: High-quality 3D Scene Generation within Seconds

Authors:Xinyang Li, Tengfei Wang, Zixiao Gu, Shengchuan Zhang, Chunchao Guo, Liujuan Cao

We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, 10~100$\times$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model’s generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.

我们提出了FlashWorld,这是一个生成模型,能够在几秒内从单张图像或文本提示生成3D场景。相较于先前的工作,其速度提高了10~100倍,同时拥有卓越的渲染质量。我们的方法摒弃了传统的面向多视角(MV-oriented)的范式,该范式生成多视角图像用于后续的三维重建,转而采用面向3D的方法,其中模型在生成多视角时直接产生三维高斯表示。在保障三维一致性的同时,面向3D的方法通常视觉质量较差。FlashWorld包括双模式预训练阶段和跨模式后训练阶段,有效地结合了两种范式的优点。具体来说,我们借助视频扩散模型的先验知识,首先预训练一个双模式多视角扩散模型,该模型同时支持面向多视角和面向3D的生成模式。为了缩小面向3D生成的品质差距,我们进一步提出了跨模式后训练蒸馏,通过匹配一致面向3D模式的分布到高质量面向多视角模式来提升其视觉质量。这不仅提高了视觉质量,同时保持了三维一致性,还减少了推理所需的去噪步骤。此外,我们提出一种策略,利用大量的单视角图像和文本提示在此过程中来提高模型对离群输入的一般化能力。大量的实验证明了我们方法的优越性和效率。

论文及项目相关链接

PDF Project Page: https://imlixinyang.github.io/FlashWorld-Project-Page/

Summary

本文提出名为FlashWorld的生成模型,可从单一图像或文本提示快速生成3D场景,较之前的方法快10~100倍,同时拥有卓越渲染质量。它摒弃了传统的以多视角为主(MV-oriented)的方法,采用直接生成3D高斯表示的3D导向方法。为提高视觉质量和保持3D一致性,FlashWorld采用双模式预训练及跨模式后训练的方式,结合两种方法的优点。通过视频扩散模型的先验知识,进行双模式多视角扩散模型的预训练。为缩小3D导向生成的品质差距,提出跨模式后训练蒸馏方法,匹配一致性的3D导向模式与高品质的MV-oriented模式的分布。此外,本文还提出利用大量单视角图像和文本提示增强模型对异常输入的泛化能力。实验证明,该方法卓越且高效。

Key Takeaways

  1. FlashWorld是一个生成模型,能快速从单一图像或文本提示生成3D场景,速度较传统方法显著提升。
  2. 摒弃传统以多视角为主的方法,直接生成3D高斯表示,提高效率和渲染质量。
  3. 采用双模式预训练和跨模式后训练的方式,结合两种方法的优点,提升视觉质量和保持3D一致性。
  4. 利用视频扩散模型的先验知识进行预训练,增强模型性能。
  5. 提出跨模式后训练蒸馏方法,匹配不同模式的分布,进一步提高3D导向生成的品质。
  6. 利用大量单视角图像和文本提示,增强模型对异常输入的泛化能力。

Cool Papers

点此查看论文截图

VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Authors:Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, Konrad Schindler

The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as “generator” with the geometric abilities of a recent (feedforward) 3D reconstruction system as “decoder”. We introduce VIST3A, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit model stitching, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt direct reward finetuning, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.

大型预训练模型在视觉内容生成和3D重建方面的快速发展,为文本到3D生成的转换开启了新的可能性。直观地讲,如果能够结合现代潜在文本到视频模型的生成能力与最近的(前馈)3D重建系统的几何能力,就可以得到一个强大的3D场景生成器。我们介绍了VIST3A,这是一个通用的框架,正好实现了这一目标,解决了两个主要挑战。首先,必须以保留其权重中编码的丰富知识的方式将这两个组件结合在一起。我们重新研究了模型拼接,即确定3D解码器中与文本到视频生成器产生的潜在表示最佳匹配的层,并将两部分拼接在一起。该操作只需要一个小型数据集,而无需任何标签。其次,文本到视频生成器必须与拼接的3D解码器对齐,以确保生成的潜在向量能够解码为一致且感知上令人信服的3D场景几何。为此,我们采用了直接奖励微调这一流行的与人类偏好对齐的技术。我们使用不同的视频生成器和3D重建模型对提出的VIST3A方法进行了评估。所有测试配对都显著改进了先前输出高斯splat的文本到3D模型。此外,通过选择适合的3D基础模型,VIST3A还可以实现高质量文本到点云图的生成。

论文及项目相关链接

PDF Project page: https://gohyojun15.github.io/VIST3A/

Summary

该文介绍了结合文本到视频生成模型与三维重建系统的优势,提出一种通用框架VIST3A,解决了两大挑战。首先是通过模型拼接技术将两个组件结合,无需大量数据集和标签。其次是调整文本到视频生成模型与拼接后的三维解码器对齐,确保生成的潜在变量可解码成连贯、感知上令人信服的三维场景几何。实验证明,VIST3A显著提高了文本到三维模型的生成质量。

Key Takeaways

  1. VIST3A框架结合了文本到视频生成模型与三维重建系统的优势。
  2. 通过模型拼接技术,实现了两个组件的紧密结合,无需大量数据集和标签。
  3. 调整文本到视频生成模型以确保生成的潜在变量能够被解码成连贯的三维场景几何。
  4. VIST3A显著提高了文本到三维模型的生成质量。
  5. 该方法可以选择适合的三维基础模型,实现高质量的文本到点云映射生成。
  6. 提出的框架解决了文本与三维重建之间的语义鸿沟问题。

Cool Papers

点此查看论文截图

Leveraging 2D Priors and SDF Guidance for Dynamic Urban Scene Rendering

Authors:Siddharth Tourani, Jayaram Reddy, Akash Kumbar, Satyajit Tourani, Nishant Goyal, Madhava Krishna, N. Dinesh Reddy, Muhammad Haris Khan

Dynamic scene rendering and reconstruction play a crucial role in computer vision and augmented reality. Recent methods based on 3D Gaussian Splatting (3DGS), have enabled accurate modeling of dynamic urban scenes, but for urban scenes they require both camera and LiDAR data, ground-truth 3D segmentations and motion data in the form of tracklets or pre-defined object templates such as SMPL. In this work, we explore whether a combination of 2D object agnostic priors in the form of depth and point tracking coupled with a signed distance function (SDF) representation for dynamic objects can be used to relax some of these requirements. We present a novel approach that integrates Signed Distance Functions (SDFs) with 3D Gaussian Splatting (3DGS) to create a more robust object representation by harnessing the strengths of both methods. Our unified optimization framework enhances the geometric accuracy of 3D Gaussian splatting and improves deformation modeling within the SDF, resulting in a more adaptable and precise representation. We demonstrate that our method achieves state-of-the-art performance in rendering metrics even without LiDAR data on urban scenes. When incorporating LiDAR, our approach improved further in reconstructing and generating novel views across diverse object categories, without ground-truth 3D motion annotation. Additionally, our method enables various scene editing tasks, including scene decomposition, and scene composition.

动态场景渲染和重建在计算机视觉和增强现实领域起着至关重要的作用。最近基于三维高斯平铺(3DGS)的方法能够准确地对动态城市场景进行建模,但它们需要相机和激光雷达数据、真实的三维分割以及轨迹或预定义对象模板(如SMPL)形式的运动数据。在这项工作中,我们探索了深度与点跟踪的二维对象无先验知识,结合动态对象的符号距离函数(SDF)表示是否可以减轻上述一些要求。我们提出了一种新颖的方法,将符号距离函数(SDFs)与三维高斯平铺(3DGS)相结合,通过利用这两种方法的优点来创建更稳健的对象表示。我们的统一优化框架提高了三维高斯拼贴的几何精度,并改善了SDF内的变形建模,从而得到了更灵活和精确的对象表示。我们证明了我们的方法在渲染指标上达到了领先水平,即使在城市场景中不使用激光雷达数据也是如此。当结合激光雷达时,我们的方法在不使用真实三维运动注释的情况下,进一步改进了不同对象类别的重建和新型视图的生成。此外,我们的方法能够执行各种场景编辑任务,包括场景分解和场景组合。

论文及项目相关链接

PDF Accepted at ICCV-2025, project page: https://dynamic-ugsdf.github.io/

Summary

基于三维高斯展布(3DGS)的方法对于动态场景的渲染和重建至关重要。本文探索将深度与点跟踪的二维对象无关先验与动态对象的符号距离函数(SDF)表示相结合,以减轻对某些需求的要求。通过将SDF与3DGS结合,创建更稳健的对象表示,提高几何精度和变形建模,无需激光雷达数据即可实现城市场景的先进渲染性能。结合激光雷达时,该方法在重建和生成不同对象类别的视图方面进一步改进,无需地面真实三维运动注释。此外,该方法可实现各种场景编辑任务,包括场景分解和场景组合。

Key Takeaways

  1. 3DGS在动态场景渲染和重建中起关键作用。
  2. 二维对象无关先验(深度与点跟踪)与动态对象的符号距离函数(SDF)表示相结合,可减轻对特定数据和方法的要求。
  3. 结合SDF与3DGS创建更稳健的对象表示。
  4. 统一优化框架提高几何精度和变形建模效果。
  5. 该方法在城市场景渲染方面达到先进水平,即使不使用激光雷达数据。
  6. 结合激光雷达数据时,方法在重建和生成不同对象类别的视图方面表现更佳。

Cool Papers

点此查看论文截图

STT-GS: Sample-Then-Transmit Edge Gaussian Splatting with Joint Client Selection and Power Control

Authors:Zhen Li, Xibin Jin, Guoliang Li, Shuai Wang, Miaowen Wen, Huseyin Arslan, Derrick Wing Kwan Ng, Chengzhong Xu

Edge Gaussian splatting (EGS), which aggregates data from distributed clients and trains a global GS model at the edge server, is an emerging paradigm for scene reconstruction. Unlike traditional edge resource management methods that emphasize communication throughput or general-purpose learning performance, EGS explicitly aims to maximize the GS qualities, rendering existing approaches inapplicable. To address this problem, this paper formulates a novel GS-oriented objective function that distinguishes the heterogeneous view contributions of different clients. However, evaluating this function in turn requires clients’ images, leading to a causality dilemma. To this end, this paper further proposes a sample-then-transmit EGS (or STT-GS for short) strategy, which first samples a subset of images as pilot data from each client for loss prediction. Based on the first-stage evaluation, communication resources are then prioritized towards more valuable clients. To achieve efficient sampling, a feature-domain clustering (FDC) scheme is proposed to select the most representative data and pilot transmission time minimization (PTTM) is adopted to reduce the pilot overhead.Subsequently, we develop a joint client selection and power control (JCSPC) framework to maximize the GS-oriented function under communication resource constraints. Despite the nonconvexity of the problem, we propose a low-complexity efficient solution based on the penalty alternating majorization minimization (PAMM) algorithm. Experiments unveil that the proposed scheme significantly outperforms existing benchmarks on real-world datasets. It is found that the GS-oriented objective can be accurately predicted with low sampling ratios (e.g.,10%), and our method achieves an excellent tradeoff between view contributions and communication costs.

边缘高斯拼接(EGS)是一种新兴的场景重建范式,它从分布式客户端聚合数据并在边缘服务器训练全局GS模型。与传统的边缘资源管理方法不同,EGS旨在最大化GS质量,使得现有方法不适用。为了解决这一问题,本文制定了一个新的面向GS的目标函数,该函数能够区分不同客户端的异构视图贡献。然而,评估该函数需要客户端的图像,这导致了一个因果困境。为此,本文进一步提出了一种采样后传输的EGS(或简称STT-GS)策略,该策略首先从每个客户端采样一部分图像作为试点数据进行损失预测。基于第一阶段的评估,然后将通信资源优先分配给更有价值的客户端。为了实现有效的采样,提出了一种特征域聚类(FDC)方案来选择最具代表性的数据,并采用传输时间最小化(PTTM)来减少试点开销。随后,我们开发了一个联合客户端选择和功率控制(JCSPC)框架,以在通信资源约束下最大化面向GS的函数。尽管该问题具有非凸性,但我们提出了一种基于惩罚交替主要最小化(PAMM)算法的低复杂度有效解决方案。实验表明,该方案在真实数据集上的性能显著优于现有基准测试。研究发现,使用低采样率(例如10%)即可准确预测面向GS的目标,我们的方法在视图贡献和通信成本之间实现了出色的权衡。

论文及项目相关链接

PDF

Summary

在边缘服务器聚合数据并训练全局高斯喷射模型(GS模型)的边缘高斯喷射(EGS)技术为场景重建提供了一种新兴范式。针对现有方法无法最大化GS质量的问题,本文构建了一个面向GS的目标函数,并考虑了不同客户的异构视图贡献。为评估该函数需要客户图像,导致因果困境。因此,本文提出了采样后传输的EGS策略(STT-GS),首先从每个客户中采样部分图像作为试点数据进行损失预测。基于第一阶段评估,优先为更有价值的客户提供通信资源。为实现高效采样,提出了特征域聚类方案并选择最具代表性的数据,并采用传输时间最小化来减少试点开销。此外,本文开发了联合客户选择和功率控制框架,以在通信资源限制下最大化面向GS的函数。尽管问题具有非凸性,但基于惩罚交替主要最小化算法提出了低复杂度解决方案。实验表明,该方案在真实数据集上显著优于现有基准测试,可以在低采样率下准确预测面向GS的目标,并在视图贡献和通信成本之间取得了出色的平衡。

Key Takeaways

  1. EGS技术是一种新兴的场景重建方法,通过边缘服务器聚合数据并训练全局GS模型。
  2. 现有方法无法最大化GS质量的问题,因此本文构建了一个面向GS的目标函数。
  3. 评估目标函数需要客户图像,导致因果困境,为此提出了STT-GS策略进行采样后传输。
  4. 为实现高效采样,采用特征域聚类方案并选择最具代表性的数据。
  5. 联合客户选择和功率控制框架被开发,以在通信资源限制下最大化面向GS的函数。
  6. 所提出的解决方案基于惩罚交替主要最小化算法,具有低复杂度。

Cool Papers

点此查看论文截图

SimULi: Real-Time LiDAR and Camera Simulation with Unscented Transforms

Authors:Haithem Turki, Qi Wu, Xin Kang, Janick Martinez Esturo, Shengyu Huang, Ruilong Li, Zan Gojcic, Riccardo de Lutio

Rigorous testing of autonomous robots, such as self-driving vehicles, is essential to ensure their safety in real-world deployments. This requires building high-fidelity simulators to test scenarios beyond those that can be safely or exhaustively collected in the real-world. Existing neural rendering methods based on NeRF and 3DGS hold promise but suffer from low rendering speeds or can only render pinhole camera models, hindering their suitability to applications that commonly require high-distortion lenses and LiDAR data. Multi-sensor simulation poses additional challenges as existing methods handle cross-sensor inconsistencies by favoring the quality of one modality at the expense of others. To overcome these limitations, we propose SimULi, the first method capable of rendering arbitrary camera models and LiDAR data in real-time. Our method extends 3DGUT, which natively supports complex camera models, with LiDAR support, via an automated tiling strategy for arbitrary spinning LiDAR models and ray-based culling. To address cross-sensor inconsistencies, we design a factorized 3D Gaussian representation and anchoring strategy that reduces mean camera and depth error by up to 40% compared to existing methods. SimULi renders 10-20x faster than ray tracing approaches and 1.5-10x faster than prior rasterization-based work (and handles a wider range of camera models). When evaluated on two widely benchmarked autonomous driving datasets, SimULi matches or exceeds the fidelity of existing state-of-the-art methods across numerous camera and LiDAR metrics.

对自动驾驶车辆等自主机器人的严格测试是确保其在现实世界部署中安全性的关键。这需要使用高保真模拟器来测试超出那些在现实世界中可以安全或详尽收集的场景。基于NeRF和3DGS的现有神经渲染方法虽然很有前景,但它们存在渲染速度慢或只能呈现针孔相机模型的局限性,这阻碍了它们在对高失真镜头和激光雷达数据通常有所要求的应用中的适用性。多传感器仿真带来了额外的挑战,因为现有方法通过牺牲其他模态的质量来优先考虑某一模态的跨传感器不一致性。为了克服这些局限性,我们提出了SimULi,这是一种能够实时呈现任意相机模型和激光雷达数据的方法。我们的方法扩展了3DGUT,它原生支持复杂的相机模型,通过自动平铺策略和基于射线的剔除技术来支持激光雷达数据。为了解决跨传感器的不一致性,我们设计了一种分解的3D高斯表示和锚定策略,与现有方法相比,它将摄像机和深度误差平均减少了高达40%。SimULi的渲染速度是光线追踪方法的10-20倍,是先前基于光栅化的工作的1.5-10倍(并且支持更广泛的相机模型)。在广泛使用的自动驾驶数据集上进行评估时,SimULi在多个相机和激光雷达指标上达到了或超过了现有最新方法的保真度。

论文及项目相关链接

PDF Project page: https://research.nvidia.com/labs/sil/projects/simuli

Summary

本文强调了自主机器人严格测试的重要性,并指出高保真模拟器在测试超越现实世界安全或详尽收集的场景中的关键作用。现有基于NeRF和3DGS的神经渲染方法存在渲染速度慢或只能渲染针孔相机模型的局限性,难以满足高失真镜头和LiDAR数据的应用需求。针对这些问题,本文提出了SimULi方法,该方法能够实时渲染任意相机模型和LiDAR数据,并扩展了支持复杂相机模型的3DGUT,通过自动化平铺策略和基于射线的剔除技术来支持LiDAR。为解决跨传感器的不一致性,本文设计了一种分解的3D高斯表示和锚定策略,将相机和深度误差减少了高达40%。SimULi的渲染速度比光线追踪方法快10-20倍,比先前的基于光栅化的工作快1.5-10倍,并且支持更广泛的相机模型。在两个广泛评估的自动驾驶数据集上,SimULi在多个相机和LiDAR指标上的逼真度达到或超过了现有先进技术。

Key Takeaways

  1. 自主机器人的严格测试对于确保其在现实世界部署中的安全性至关重要。
  2. 高保真模拟器能够测试超越现实世界中安全或详尽收集的场景。
  3. 现有神经渲染方法存在渲染速度慢或只能渲染特定相机模型的局限性。
  4. SimULi方法能够实时渲染任意相机模型和LiDAR数据,扩展了3DGUT。
  5. SimULi通过自动化策略支持LiDAR数据,解决跨传感器的不一致性。
  6. SimULi采用分解的3D高斯表示和锚定策略,提高了相机和深度误差的减少程度。

Cool Papers

点此查看论文截图

BSGS: Bi-stage 3D Gaussian Splatting for Camera Motion Deblurring

Authors:An Zhao, Piaopiao Yu, Zhe Zhu, Mingqiang Wei

3D Gaussian Splatting has exhibited remarkable capabilities in 3D scene reconstruction.However, reconstructing high-quality 3D scenes from motion-blurred images caused by camera motion poses a significant challenge.The performance of existing 3DGS-based deblurring methods are limited due to their inherent mechanisms, such as extreme dependence on the accuracy of camera poses and inability to effectively control erroneous Gaussian primitives densification caused by motion blur.To solve these problems, we introduce a novel framework, Bi-Stage 3D Gaussian Splatting, to accurately reconstruct 3D scenes from motion-blurred images.BSGS contains two stages. First, Camera Pose Refinement roughly optimizes camera poses to reduce motion-induced distortions. Second, with fixed rough camera poses, Global RigidTransformation further corrects motion-induced blur distortions.To alleviate multi-subframe gradient conflicts, we propose a subframe gradient aggregation strategy to optimize both stages.Furthermore, a space-time bi-stage optimization strategy is introduced to dynamically adjust primitive densification thresholds and prevent premature noisy Gaussian generation in blurred regions. Comprehensive experiments verify the effectiveness of our proposed deblurring method and show its superiority over the state of the arts.

3D高斯摊铺技术在3D场景重建中展现出了显著的能力。然而,从由于相机运动造成的运动模糊图像重建高质量3D场景是一个巨大的挑战。现有的基于3DGS的去模糊方法的性能由于其固有机制而受到限制,例如极度依赖于相机姿态的准确性,以及无法有效控制由运动模糊引起的错误高斯基本形态密实化。为了解决这些问题,我们引入了一种新型框架——双阶段3D高斯摊铺,以准确地从运动模糊图像重建3D场景。BSGS包含两个阶段。首先,相机姿态优化大致优化相机姿态,以减少运动引起的失真。其次,在固定的粗略相机姿态下,全局刚性变换进一步校正运动引起的模糊失真。为了缓解多子帧梯度冲突,我们提出了一种子帧梯度聚合策略来优化这两个阶段。此外,还引入了时空双阶段优化策略,以动态调整基本密实化阈值,防止模糊区域过早产生嘈杂的高斯。综合实验验证了所提出的去模糊方法的有效性,并展示了其优于现有技术。

论文及项目相关链接

PDF

Summary

基于三维高斯描画(3DGS)技术的高精度重建算法能够有效应对因相机运动引起的图像运动模糊问题。为此,我们提出一个新型的双阶段三维高斯描画框架(BSGS),用于从运动模糊图像中准确重建三维场景。该框架包含两个阶段:第一阶段为相机姿态优化,用于减少运动引起的畸变;第二阶段为全局刚体变换,用于进一步修正运动引起的模糊畸变。通过子帧梯度聚合策略缓解多子帧梯度冲突,并采用时空双阶段优化策略动态调整基本粒子密度阈值,避免模糊区域过早产生噪声高斯。实验验证了我们所提出去模糊方法的有效性,展示了其在行业内领先的优越性。

Key Takeaways

  1. 针对基于3DGS的去模糊技术,处理相机运动导致的图像运动模糊是一大挑战。
  2. 新框架名为Bi-Stage 3D Gaussian Splatting(BSGS),旨在解决现有方法的局限性。
  3. BSGS包含两个阶段:第一阶段优化相机姿态以减少运动畸变,第二阶段进一步修正运动模糊。
  4. 提出子帧梯度聚合策略以解决多子帧梯度冲突问题。
  5. 引入时空双阶段优化策略以动态调整基本粒子密度阈值,防止生成过早的噪声高斯。
  6. 综合实验验证了我们所提出方法的优越性。

Cool Papers

点此查看论文截图

Hybrid Gaussian Splatting for Novel Urban View Synthesis

Authors:Mohamed Omran, Farhad Zanjani, Davide Abati, Jens Petersen, Amirhossein Habibian

This paper describes the Qualcomm AI Research solution to the RealADSim-NVS challenge, hosted at the RealADSim Workshop at ICCV 2025. The challenge concerns novel view synthesis in street scenes, and participants are required to generate, starting from car-centric frames captured during some training traversals, renders of the same urban environment as viewed from a different traversal (e.g. different street lane or car direction). Our solution is inspired by hybrid methods in scene generation and generative simulators merging gaussian splatting and diffusion models, and it is composed of two stages: First, we fit a 3D reconstruction of the scene and render novel views as seen from the target cameras. Then, we enhance the resulting frames with a dedicated single-step diffusion model. We discuss specific choices made in the initialization of gaussian primitives as well as the finetuning of the enhancer model and its training data curation. We report the performance of our model design and we ablate its components in terms of novel view quality as measured by PSNR, SSIM and LPIPS. On the public leaderboard reporting test results, our proposal reaches an aggregated score of 0.432, achieving the second place overall.

本文描述了Qualcomm AI Research团队在ICCV 2025举办的RealADSim Workshop上针对RealADSim-NVS挑战的解决方案。该挑战涉及街道场景的新视角合成,参赛者需根据某些训练过程中的车载摄像头拍摄的画面,生成从不同路径(如不同的街道车道或车辆方向)观察到的同一城市环境的渲染图像。我们的解决方案受到场景生成混合方法和生成模拟器融合高斯涂抹和扩散模型的启发,分为两个阶段:首先,我们对场景进行3D重建,并从目标摄像头角度呈现新颖视角的渲染图像。然后,我们使用专用的单步扩散模型增强所得画面。我们讨论了高斯原始数据的初始化选择以及对增强模型的微调及其训练数据集的构建。我们报告了模型设计的性能,并根据峰值信噪比(PSNR)、结构相似性(SSIM)和线性感知图像保真度(LPIPS)等指标评估其组件对新颖视角质量的影响。在公开排行榜报告测试结果中,我们的方案获得了综合得分0.432,总体排名第二。

论文及项目相关链接

PDF ICCV 2025 RealADSim Workshop

Summary

本文介绍了Qualcomm AI Research团队在ICCV 2025举办的RealADSim-NVS挑战赛上的解决方案。该挑战要求参赛者针对街道场景的新视角合成任务,利用在训练过程中从车载视角捕捉的帧,生成同一城市环境从不同视角(如不同街道或不同行车方向)的渲染图像。该方案结合了场景生成和生成式模拟器的混合方法,包括高斯喷射和扩散模型,分为两个阶段:首先进行场景的三维重建并从目标相机角度渲染新视角;接着使用专用的单步扩散模型增强生成的帧。文章讨论了高斯原始数据的初始设定以及增强模型的微调与训练数据集的构建。最后,根据PSNR、SSIM和LPIPS等指标评估模型性能,并在公开排行榜上获得测试结果,该方案总体排名第二,综合得分为0.432。

Key Takeaways

  1. Qualcomm AI Research团队参加了RealADSim-NVS挑战赛,挑战内容为街道场景的新视角合成。
  2. 解决方案结合了场景生成和生成式模拟器的混合方法,包括高斯喷射和扩散模型。
  3. 解决方案分为两个阶段:场景三维重建和新视角渲染,以及使用扩散模型的帧增强。
  4. 团队讨论了高斯原始数据的初始设定以及增强模型的微调与训练数据集的构建。
  5. 该方案在ICCV 2025公开排行榜的测试中取得了第二名的成绩,综合得分为0.432。
  6. 团队通过PSNR、SSIM和LPIPS等指标评估了其模型性能。

Cool Papers

点此查看论文截图

UniGS: Unified Geometry-Aware Gaussian Splatting for Multimodal Rendering

Authors:Yusen Xie, Zhenmin Huang, Jianhao Jiao, Dimitrios Kanoulas, Jun Ma

In this paper, we propose UniGS, a unified map representation and differentiable framework for high-fidelity multimodal 3D reconstruction based on 3D Gaussian Splatting. Our framework integrates a CUDA-accelerated rasterization pipeline capable of rendering photo-realistic RGB images, geometrically accurate depth maps, consistent surface normals, and semantic logits simultaneously. We redesign the rasterization to render depth via differentiable ray-ellipsoid intersection rather than using Gaussian centers, enabling effective optimization of rotation and scale attribute through analytic depth gradients. Furthermore, we derive the analytic gradient formulation for surface normal rendering, ensuring geometric consistency among reconstructed 3D scenes. To improve computational and storage efficiency, we introduce a learnable attribute that enables differentiable pruning of Gaussians with minimal contribution during training. Quantitative and qualitative experiments demonstrate state-of-the-art reconstruction accuracy across all modalities, validating the efficacy of our geometry-aware paradigm. Source code and multimodal viewer will be available on GitHub.

本文提出了UniGS,这是一种基于3D高斯拼接的高保真多模态3D重建的统一地图表示和可区分框架。我们的框架集成了一个CUDA加速的光栅化管道,该管道能够同时呈现逼真的RGB图像、几何准确的深度图、一致的面法向量和语义对数几率。我们重新设计了光栅化,通过可区分的射线与椭圆体相交来呈现深度,而不是使用高斯中心,通过解析深度梯度有效地优化旋转和缩放属性。此外,我们为面法线渲染推导了解析梯度公式,确保重建的3D场景之间的几何一致性。为了提高计算和存储效率,我们引入了一个可学习的属性,在训练过程中通过可区分的修剪对贡献最小的高斯进行修剪。定量和定性实验表明,在所有模态下重建精度均处于最新水平,验证了我们几何感知方法的有效性。源代码和多模态查看器将在GitHub上提供。

论文及项目相关链接

PDF

Summary

本文提出一种基于3D高斯拼贴技术的统一地图表示和可分化框架UniGS,用于高保真度多模态三维重建。该框架集成CUDA加速的光栅化管道,可同时渲染逼真的RGB图像、几何准确的深度图、一致的面朝方向和语义对数。通过可微分的射线与椭球交点设计光栅化深度渲染而非使用高斯中心,实现了旋转和缩放属性的有效优化。此外,推导了表面正常渲染的分析梯度公式,确保重建的三维场景几何一致性。为提高计算和存储效率,引入了一种可学习的属性,在训练过程中可实现高斯的最小贡献的可微分裁剪。定量和定性实验表明,在所有模态下的重建精度均达到领先水平,验证了我们的几何感知范式的有效性。

Key Takeaways

  1. 提出了名为UniGS的统一地图表示和可分化框架,用于高保真度多模态三维重建。
  2. 集成CUDA加速的光栅化管道,实现多种模态的同时渲染,包括RGB图像、深度图、面朝方向和语义对数。
  3. 通过可微分的设计实现深度渲染,优化旋转和缩放属性。
  4. 推导了表面正常渲染的分析梯度公式,增强三维重建的几何一致性。
  5. 引入可学习的属性,提高计算和存储效率,同时实现高斯的最小贡献的可微分裁剪。
  6. 实验结果证明该框架在所有模态下的重建精度达到领先水平。

Cool Papers

点此查看论文截图

G4Splat: Geometry-Guided Gaussian Splatting with Generative Prior

Authors:Junfeng Ni, Yixin Chen, Zhifei Yang, Yu Liu, Ruijie Lu, Song-Chun Zhu, Siyuan Huang

Despite recent advances in leveraging generative prior from pre-trained diffusion models for 3D scene reconstruction, existing methods still face two critical limitations. First, due to the lack of reliable geometric supervision, they struggle to produce high-quality reconstructions even in observed regions, let alone in unobserved areas. Second, they lack effective mechanisms to mitigate multi-view inconsistencies in the generated images, leading to severe shape-appearance ambiguities and degraded scene geometry. In this paper, we identify accurate geometry as the fundamental prerequisite for effectively exploiting generative models to enhance 3D scene reconstruction. We first propose to leverage the prevalence of planar structures to derive accurate metric-scale depth maps, providing reliable supervision in both observed and unobserved regions. Furthermore, we incorporate this geometry guidance throughout the generative pipeline to improve visibility mask estimation, guide novel view selection, and enhance multi-view consistency when inpainting with video diffusion models, resulting in accurate and consistent scene completion. Extensive experiments on Replica, ScanNet++, and DeepBlending show that our method consistently outperforms existing baselines in both geometry and appearance reconstruction, particularly for unobserved regions. Moreover, our method naturally supports single-view inputs and unposed videos, with strong generalizability in both indoor and outdoor scenarios with practical real-world applicability. The project page is available at https://dali-jack.github.io/g4splat-web/.

尽管最近在利用预训练的扩散模型的生成先验进行3D场景重建方面取得了进展,但现有方法仍面临两个关键局限。首先,由于缺乏可靠的几何监督,即使在观察区域,它们也难以产生高质量的重建,更不用说在未观察到的区域。其次,他们缺乏有效的机制来缓解生成图像中的多视图不一致性,导致形状外观模糊和场景几何退化。在本文中,我们认为精确几何是利用生成模型增强3D场景重建的基本前提。我们首先提出利用平面结构的普遍性来推导准确的度量尺度深度图,为观测区域和未观测区域提供可靠的监督。此外,我们将这种几何指导融入生成管道中,以提高可见性掩模估计、引导新视图选择,并在使用视频扩散模型进行绘画时增强多视图一致性,从而实现准确且一致的场景完成。在Replica、ScanNet++和DeepBlending上的大量实验表明,我们的方法在几何和外观重建方面始终超越现有基线,特别是在未观察区域。而且,我们的方法天然支持单视图输入和无姿势视频,在室内和室外场景中具有强大的通用性和实际应用性。项目页面可在https://dali-jack.github.io/g4splat-web/找到。

论文及项目相关链接

PDF Project page: https://dali-jack.github.io/g4splat-web/

Summary

本文指出了现有基于预训练扩散模型的3D场景重建方法存在几何监督不足和多视角不一致性问题。为解决这些问题,本文提出利用平面结构来推导准确的度量尺度深度图,为观测和未观测区域提供可靠的监督。同时,将此几何指导融入生成管道,改进可见性掩模估计、引导新视角选择,并在视频扩散模型中增强多视角一致性,从而实现准确且一致的场景完成。实验表明,该方法在几何和外观重建方面均优于现有基线,尤其适用于未观测区域的重建。该项目支持单视图输入和无姿态视频,具有较强的室内外实际应用泛化能力。

Key Takeaways

  1. 现有3D场景重建方法存在几何监督不足和多视角不一致性问题。
  2. 本文提出利用平面结构推导准确度量尺度深度图,为观测和未观测区域提供可靠监督。
  3. 将几何指导融入生成管道,改进可见性掩模估计、新视角选择及视频扩散模型中的多视角一致性。
  4. 方法实现准确且一致的场景完成,在几何和外观重建方面均优于现有基线。
  5. 支持单视图输入和无姿态视频,具有室内外实际应用泛化能力。
  6. 通过广泛实验,在Replica、ScanNet++和DeepBlending数据集上验证了方法的有效性。

Cool Papers

点此查看论文截图

GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality

Authors:Anastasiya Pechko, Piotr Borycki, Joanna Waczyńska, Daniel Barczyk, Agata Szymańska, Sławomir Tadeja, Przemysław Spurek

As the demand for immersive 3D content grows, the need for intuitive and efficient interaction methods becomes paramount. Current techniques for physically manipulating 3D content within Virtual Reality (VR) often face significant limitations, including reliance on engineering-intensive processes and simplified geometric representations, such as tetrahedral cages, which can compromise visual fidelity and physical accuracy. In this paper, we introduce \our{} (\textbf{G}aussian \textbf{S}platting for \textbf{V}irtual \textbf{E}nvironment \textbf{R}endering and \textbf{S}cene \textbf{E}diting), a novel method designed to overcome these challenges by directly integrating an object’s mesh with a Gaussian Splatting (GS) representation. Our approach enables more precise surface approximation, leading to highly realistic deformations and interactions. By leveraging existing 3D mesh assets, \our{} facilitates seamless content reuse and simplifies the development workflow. Moreover, our system is designed to be physics-engine-agnostic, granting developers robust deployment flexibility. This versatile architecture delivers a highly realistic, adaptable, and intuitive approach to interactive 3D manipulation. We rigorously validate our method against the current state-of-the-art technique that couples VR with GS in a comparative user study involving 18 participants. Specifically, we demonstrate that our approach is statistically significantly better for physics-aware stretching manipulation and is also more consistent in other physics-based manipulations like twisting and shaking. Further evaluation across various interactions and scenes confirms that our method consistently delivers high and reliable performance, showing its potential as a plausible alternative to existing methods.

随着对沉浸式3D内容的需求不断增长,对直观高效交互方法的需求变得至关重要。当前在虚拟现实(VR)中物理操作3D内容的技巧常常面临重大局限,包括依赖工程密集型过程和简化几何表示(例如四面体笼子),这可能会损害视觉保真度和物理准确性。在本文中,我们介绍了我们的方法(高斯拼贴渲染法),这是一种旨在通过直接将对象的网格与高斯拼贴(GS)表示集成来克服这些挑战的新方法。我们的方法使表面近似更加精确,从而实现高度逼真的变形和交互。通过利用现有的3D网格资产,我们的方法简化了内容复用和开发工作流程。此外,我们的系统设计为物理引擎无关性,为开发者提供了强大的部署灵活性。这种通用架构提供了一种高度逼真、灵活直观的交互式3D操作方式。我们通过一项涉及18名参与者的比较性用户研究,对我们的方法与当前先进的虚拟现实与GS相结合的技术进行了严格验证。具体来说,我们证明我们的方法在物理感知拉伸操作中表现优异并具有统计显著性,并且在扭曲和摇晃等其他基于物理的操作中也更加一致。对各种交互和场景的进一步评估证实,我们的方法始终表现出高而可靠的性能,显示出作为现有方法的可行替代方案的潜力。

论文及项目相关链接

PDF

Summary

该文介绍了一种名为our(高斯平铺用于虚拟现实环境渲染和场景编辑)的新方法,旨在解决当前虚拟现实(VR)中物理交互技术的局限性。该方法通过直接整合对象网格与高斯平铺(GS)表示,提高了表面近似精度,实现了高度逼真的变形和交互。Our方法还支持无缝内容复用和简化开发流程,利用现有3D网格资产进行渲染场景编辑,同时具有物理引擎无关性优势。该方法具有现实性强、适应性强和直观交互等特点,并在用户研究中验证了其在物理感知拉伸操作上的优越性。

Key Takeaways

  1. 当前VR中物理交互方法的局限性包括工程密集过程和简化几何表示,可能影响视觉真实性和物理准确性。
  2. Our方法整合对象网格与高斯平铺表示,提高表面近似精度并实现高度逼真的变形和交互。
  3. Our方法支持无缝内容复用和简化开发流程,利用现有3D网格资产进行渲染场景编辑。
  4. Our系统具有物理引擎无关性优势,为开发者提供灵活的部署选择。
  5. 该方法通过用户研究验证,在物理感知拉伸操作上表现优越。
  6. Our方法在多种交互和场景下的评价表现出高可靠性和性能一致性。

Cool Papers

点此查看论文截图

Ev4DGS: Novel-view Rendering of Non-Rigid Objects from Monocular Event Streams

Authors:Takuya Nakabayashi, Navami Kairanda, Hideo Saito, Vladislav Golyanik

Event cameras offer various advantages for novel view rendering compared to synchronously operating RGB cameras, and efficient event-based techniques supporting rigid scenes have been recently demonstrated in the literature. In the case of non-rigid objects, however, existing approaches additionally require sparse RGB inputs, which can be a substantial practical limitation; it remains unknown if similar models could be learned from event streams only. This paper sheds light on this challenging open question and introduces Ev4DGS, i.e., the first approach for novel view rendering of non-rigidly deforming objects in the explicit observation space (i.e., as RGB or greyscale images) from monocular event streams. Our method regresses a deformable 3D Gaussian Splatting representation through 1) a loss relating the outputs of the estimated model with the 2D event observation space, and 2) a coarse 3D deformation model trained from binary masks generated from events. We perform experimental comparisons on existing synthetic and newly recorded real datasets with non-rigid objects. The results demonstrate the validity of Ev4DGS and its superior performance compared to multiple naive baselines that can be applied in our setting. We will release our models and the datasets used in the evaluation for research purposes; see the project webpage: https://4dqv.mpi-inf.mpg.de/Ev4DGS/.

相比于同步运行的RGB相机,事件相机对于新型视图渲染具有多种优势。支持刚硬场景的基于事件的高效技术已经在文献中得到了展示。然而,在非刚体对象的情况下,现有方法还需要稀疏RGB输入,这在实际应用中可能是一个重要的限制;目前尚不清楚是否仅能从事件流中学习类似的模型。本文旨在解决这一具有挑战性的开放问题,并介绍了Ev4DGS,即首个从单眼事件流在非刚体变形对象的显式观测空间(即RGB或灰度图像)中进行新型视图渲染的方法。我们的方法通过以下两个方面回归可变形的3D高斯喷射表示:1)与估计模型的输出与二维事件观测空间相关的损失;2)从事件生成的二进制蒙版训练的粗略三维变形模型。我们对现有的合成数据以及新记录的真实数据集上的非刚体对象进行了实验比较。结果表明Ev4DGS的有效性以及其相较于多个可在我们的环境中应用的简单基准线的卓越性能。我们将为了研究目的发布我们在评估中使用的模型和数据集;详见项目网页:https://4dqv.mpi-inf.mpg.de/Ev4DGS/。

论文及项目相关链接

PDF

Summary
事件相机对于新型视图渲染具有多种优势,相比同步操作的RGB相机,基于事件的方法在非刚性场景处理上展示了潜力。本文解决非刚性物体视图渲染的挑战性问题,提出了Ev4DGS方法,即从单目事件流中渲染非刚性变形物体的明确观察空间(如RGB或灰度图像)。该方法通过1)与估计模型输出和二维事件观察空间相关的损失,以及2)从事件生成的二进制蒙版训练的粗略三维变形模型,回归可变形三维高斯拼贴表示。在现有合成数据和新录制真实数据集上的实验比较表明Ev4DGS的有效性及其相较于多个基线方法的优越性。

Key Takeaways

  • 事件相机在新型视图渲染方面具有优势,相较于RGB相机,其在处理非刚性场景时展示更大的潜力。
  • 本文解决了从事件流中渲染非刚性物体的挑战性问题,提出了Ev4DGS方法。
  • Ev4DGS方法通过回归可变形三维高斯拼贴表示,利用与估计模型输出和二维事件观察空间相关的损失以及粗略三维变形模型来实现渲染。
  • 实验结果表明Ev4DGS的有效性,且其在性能上优于多个基线方法。

Cool Papers

点此查看论文截图

Phys2Real: Fusing VLM Priors with Interactive Online Adaptation for Uncertainty-Aware Sim-to-Real Manipulation

Authors:Maggie Wang, Stephen Tian, Aiden Swann, Ola Shorinwa, Jiajun Wu, Mac Schwager

Learning robotic manipulation policies directly in the real world can be expensive and time-consuming. While reinforcement learning (RL) policies trained in simulation present a scalable alternative, effective sim-to-real transfer remains challenging, particularly for tasks that require precise dynamics. To address this, we propose Phys2Real, a real-to-sim-to-real RL pipeline that combines vision-language model (VLM)-inferred physical parameter estimates with interactive adaptation through uncertainty-aware fusion. Our approach consists of three core components: (1) high-fidelity geometric reconstruction with 3D Gaussian splatting, (2) VLM-inferred prior distributions over physical parameters, and (3) online physical parameter estimation from interaction data. Phys2Real conditions policies on interpretable physical parameters, refining VLM predictions with online estimates via ensemble-based uncertainty quantification. On planar pushing tasks of a T-block with varying center of mass (CoM) and a hammer with an off-center mass distribution, Phys2Real achieves substantial improvements over a domain randomization baseline: 100% vs 79% success rate for the bottom-weighted T-block, 57% vs 23% in the challenging top-weighted T-block, and 15% faster average task completion for hammer pushing. Ablation studies indicate that the combination of VLM and interaction information is essential for success. Project website: https://phys2real.github.io/ .

学习机器人操控策略直接在现实世界中可能会花费高昂且耗时。虽然仿真中训练的强化学习(RL)策略提供了一种可扩展的替代方案,但有效的仿真到现实的转移仍然具有挑战性,特别是对于需要精确动力学的任务。为了解决这一问题,我们提出了Phys2Real,这是一个从现实到仿真再到现实的RL管道,它结合了视觉语言模型(VLM)推断的物理参数估计和通过不确定性感知融合进行的交互适应。我们的方法包括三个核心组件:(1)使用3D高斯飞溅的高精度几何重建,(2)VLM推断物理参数的前向分布,以及(3)在线物理参数估计从交互数据中获取。Phys2Real根据可解释的物理参数调整策略,通过基于集合的不确定性量化在线估计修正VLM预测。在处理平面推动任务中,如在重心(CoM)变化的T形块和离中心质量分布的锤子的情况下,Phys2Real相对于域随机基线取得了显著改进:底部加重的T形块成功率为100%对比79%,顶部加重的T形块从57%提升至23%,锤子推动任务平均完成时间缩短了15%。消融研究表明,VLM和交互信息的结合对于成功至关重要。项目网站:https://phys2real.github.io/。

论文及项目相关链接

PDF

Summary
本文提出一种结合视觉语言模型推断的物理参数估计与交互适应的仿真到真实世界机器人操作策略学习方法Phys2Real。通过高保真几何重建、VLM推断物理参数先验分布以及在线物理参数估计等技术,该方法在平面推动任务中实现了显著的性能提升。

Key Takeaways

  1. 该方法结合了视觉语言模型(VLM)推断的物理参数估计与交互适应,形成了一种独特的Real-to-Sim-to-Real RL管道。
  2. 方法包括三个核心组件:高保真几何重建、VLM推断物理参数先验分布以及在线物理参数估计。
  3. 该方法通过条件策略优化,使用可解释的物理参数,通过基于集合的不确定性量化在线估计来完善VLM预测。
  4. 在平面推动任务中,相比领域随机化基线,Phys2Real在具有不同质心(CoM)的T块和具有偏心质量分布的锤子任务上实现了显著的成功率提升。
  5. 该方法在底部加重的T块任务上达到了100%的成功率,相比基线提升了21%;在顶部加重的T块任务上提升了34%;在锤子推动任务上平均任务完成时间缩短了15%。
  6. 消融研究指出,VLM和交互信息的结合对于成功至关重要。

Cool Papers

点此查看论文截图

VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment

Authors:Qing Li, Huifang Feng, Xun Gong, Yu-Shen Liu

3D Gaussian Splatting has recently emerged as an efficient solution for high-quality and real-time novel view synthesis. However, its capability for accurate surface reconstruction remains underexplored. Due to the discrete and unstructured nature of Gaussians, supervision based solely on image rendering loss often leads to inaccurate geometry and inconsistent multi-view alignment. In this work, we propose a novel method that enhances the geometric representation of 3D Gaussians through view alignment (VA). Specifically, we incorporate edge-aware image cues into the rendering loss to improve surface boundary delineation. To enforce geometric consistency across views, we introduce a visibility-aware photometric alignment loss that models occlusions and encourages accurate spatial relationships among Gaussians. To further mitigate ambiguities caused by lighting variations, we incorporate normal-based constraints to refine the spatial orientation of Gaussians and improve local surface estimation. Additionally, we leverage deep image feature embeddings to enforce cross-view consistency, enhancing the robustness of the learned geometry under varying viewpoints and illumination. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. The source code is available at https://github.com/LeoQLi/VA-GS.

3D高斯混合技术近期被证明是一种高效的高质量实时合成新视角的解决方案。然而,其在精确表面重建方面的能力尚未得到充分探索。由于高斯混合具有离散和非结构化的特性,仅基于图像渲染损失的监督通常会导致几何不准确和多视角对齐不一致。在这项工作中,我们提出了一种通过视角对齐(VA)增强3D高斯混合几何表示的新方法。具体来说,我们将边缘感知图像线索纳入渲染损失,以提高表面边界的描绘。为了强制跨视角的几何一致性,我们引入了一种可见度感知的光度对齐损失,该损失对遮挡进行建模并鼓励高斯之间的准确空间关系。为了进一步减轻由光照变化引起的歧义,我们结合了基于法线的约束来细化高斯的空间方向并改善局部表面估计。此外,我们还利用深度图像特征嵌入来强制执行跨视角的一致性,在多种视角和光照条件下增强了学习几何的稳健性。在标准基准测试上的广泛实验表明,我们的方法在表面重建和新颖视角合成方面都达到了最新技术水平。源代码可在https://github.com/LeoQLi/VA-GS处获取。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

本文探讨了三维高斯融合技术中的几何重建问题。为解决传统监督方法中的几何不精确与多视角对齐不一致的问题,提出了结合边缘感知图像线索改进表面边界划分的方法。通过引入可见度感知的光度对齐损失模型遮挡现象,加强高斯之间的空间关系准确性。同时,利用基于法线的约束优化高斯的空间方向估计,并提高了局部表面重建质量。此外,通过深度图像特征嵌入增强跨视角的一致性,提高在不同视角和光照条件下的几何学习稳健性。实验结果在标准数据集上达到了先进的性能。

Key Takeaways

  1. 三维高斯融合技术对于高质量实时新颖视角合成具有高效解决方案,但其准确的表面重建能力尚未得到充分发挥。
  2. 传统的基于图像渲染损失的监督方法会导致几何不准确和多视角对齐不一致的问题。
  3. 提出结合边缘感知图像线索,改善表面边界的划分,从而提高几何表示的准确性。
  4. 引入可见度感知的光度对齐损失模型,强化高斯之间的空间关系,处理遮挡现象。
  5. 利用基于法线的约束优化高斯的空间方向估计,提高局部表面重建质量。
  6. 通过深度图像特征嵌入增强跨视角的一致性,提高几何学习的稳健性。

Cool Papers

点此查看论文截图

MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference

Authors:Wenyuan Zhang, Jimin Tang, Weiqi Zhang, Yi Fang, Yu-Shen Liu, Zhizhong Han

Modeling reflections from 2D images is essential for photorealistic rendering and novel view synthesis. Recent approaches enhance Gaussian primitives with reflection-related material attributes to enable physically based rendering (PBR) with Gaussian Splatting. However, the material inference often lacks sufficient constraints, especially under limited environment modeling, resulting in illumination aliasing and reduced generalization. In this work, we revisit the problem from a multi-view perspective and show that multi-view consistent material inference with more physically-based environment modeling is key to learning accurate reflections with Gaussian Splatting. To this end, we enforce 2D Gaussians to produce multi-view consistent material maps during deferred shading. We also track photometric variations across views to identify highly reflective regions, which serve as strong priors for reflection strength terms. To handle indirect illumination caused by inter-object occlusions, we further introduce an environment modeling strategy through ray tracing with 2DGS, enabling photorealistic rendering of indirect radiance. Experiments on widely used benchmarks show that our method faithfully recovers both illumination and geometry, achieving state-of-the-art rendering quality in novel views synthesis.

从二维图像中对反射进行建模对于实现逼真的渲染和新型视图合成至关重要。最近的方法使用与反射相关的材料属性增强高斯基本形态,以实现基于高斯喷射的物理渲染(PBR)。然而,材料推断通常缺乏足够的约束,特别是在环境建模有限的情况下,导致照明混淆和泛化减少。在这项工作中,我们从多角度重新审视这个问题,并表明使用更多基于物理的环境建模的多视角一致材料推断是使用高斯喷射学习准确反射的关键。为此,我们在延迟着色期间强制二维高斯生成多视角一致的材料图。我们还跟踪视图之间的光度变化,以识别高反射区域,这些区域作为反射强度项的强先验。为了处理由对象间遮挡造成的间接照明,我们进一步通过结合二维高斯喷射的射线追踪引入环境建模策略,实现间接辐射的逼真渲染。在广泛使用的基准测试上的实验表明,我们的方法在照明和几何结构恢复上忠实度高,并在新视角合成中达到了前沿的渲染质量。

论文及项目相关链接

PDF Accepted by NeurIPS 2025. Project Page: https://wen-yuan-zhang.github.io/MaterialRefGS

Summary

本文研究了基于二维图像建模反射对真实感渲染和新颖视角合成的重要性。针对现有高斯基元方法在反射相关材料属性推断上的不足,特别是环境建模受限情况下的约束不足问题,本文从多视角视角重新审视问题,并指出采用更具物理基础的环境建模进行多视角一致性材料推断是使用高斯摊开技术的关键。为实现这一目标,本文在延迟着色过程中强制二维高斯生成多视角一致的材料图,并跟踪跨视图的光度变化以识别高反射区域,作为反射强度项的强先验。为解决由对象间遮挡引起的间接照明问题,本文通过结合二维高斯分裂和光线追踪技术进一步引入环境建模策略,以实现间接辐射的逼真渲染。实验结果表明,本文方法在照明和几何恢复方面表现出优异性能,实现了新颖视角合成的最新渲染质量。

Key Takeaways

  • 高斯基元通过加入反射相关材料属性增强了物理基础渲染能力。
  • 环境建模限制下的材料推断缺乏足够约束,导致照明混淆和泛化能力下降。
  • 提出从多视角一致性的角度进行材料推断,以增强物理基础的环境建模。
  • 通过强制二维高斯生成多视角一致的材料图来优化延迟着色过程。
  • 通过跟踪跨视图的光度变化来识别高反射区域,为反射强度提供强先验。
  • 结合二维高斯分裂和光线追踪技术来解决间接照明问题,实现间接辐射的逼真渲染。

Cool Papers

点此查看论文截图

Towards Efficient 3D Gaussian Human Avatar Compression: A Prior-Guided Framework

Authors:Shanzhi Yin, Bolin Chen, Xinju Wu, Ru-Ling Liao, Jie Chen, Shiqi Wang, Yan Ye

This paper proposes an efficient 3D avatar coding framework that leverages compact human priors and canonical-to-target transformation to enable high-quality 3D human avatar video compression at ultra-low bit rates. The framework begins by training a canonical Gaussian avatar using articulated splatting in a network-free manner, which serves as the foundation for avatar appearance modeling. Simultaneously, a human-prior template is employed to capture temporal body movements through compact parametric representations. This decomposition of appearance and temporal evolution minimizes redundancy, enabling efficient compression: the canonical avatar is shared across the sequence, requiring compression only once, while the temporal parameters, consisting of just 94 parameters per frame, are transmitted with minimal bit-rate. For each frame, the target human avatar is generated by deforming canonical avatar via Linear Blend Skinning transformation, facilitating temporal coherent video reconstruction and novel view synthesis. Experimental results demonstrate that the proposed method significantly outperforms conventional 2D/3D codecs and existing learnable dynamic 3D Gaussian splatting compression method in terms of rate-distortion performance on mainstream multi-view human video datasets, paving the way for seamless immersive multimedia experiences in meta-verse applications.

本文提出了一种高效的3D化身编码框架,该框架利用紧凑的人体先验知识和规范到目标的转换,以实现超高压缩率的优质3D人类化身视频压缩。该框架首先以无网络的方式使用关节溅泼法训练规范高斯化身,作为化身外观建模的基础。同时,利用人体先验模板通过紧凑的参数表示来捕捉身体动作的临时变化。这种外观和临时演变的分解最小化冗余,实现了高效的压缩:规范化身在整个序列中共享,只需压缩一次,而临时参数(每帧仅包含94个参数)则以极低的比特率传输。针对每一帧,目标人类化身是通过线性混合蒙皮变换对规范化身进行变形而生成的,这促进了时间连贯的视频重建和新颖视图合成。实验结果表明,在主流的多视角人类视频数据集上,该方法在速率失真性能上显著优于传统的2D/3D编解码器和现有的可学习动态3D高斯溅泼压缩方法,为元宇宙应用中的无缝沉浸式多媒体体验铺平了道路。

论文及项目相关链接

PDF 10 pages, 4 figures

Summary

本文提出了一种高效的3D化身编码框架,该框架利用紧凑的人类先验知识和规范到目标的转换,实现了超低比特率下的高质量3D人类化身视频压缩。该框架通过无网络的方式使用关节点贴片技术训练规范高斯化身,作为化身外观模型的基础。同时,利用人类先验模板捕获身体运动的临时变化,通过紧凑的参数表示呈现。外观和临时演变的分解最小化冗余,实现了高效的压缩:规范化身在序列中共享,只需压缩一次,而每帧的临时参数仅由94个参数组成,以最低的比特率传输。目标人类化身是通过线性混合蒙皮变换变形规范化身生成的,有利于临时连贯的视频重建和新颖视图合成。实验结果表明,该方法在主流的多视角人类视频数据集上的速率失真性能显著优于传统的2D/3D编解码器和现有的可学习动态3D高斯贴片压缩方法,为元宇宙应用中的无缝沉浸式多媒体体验铺平了道路。

Key Takeaways

  1. 论文提出了一种新的3D化身编码框架,用于高效压缩高质量3D人类化身视频。
  2. 利用规范高斯化身作为外观模型基础,通过无网络方式训练。
  3. 采用人类先验模板捕捉身体运动的临时变化,以紧凑的参数形式表示。
  4. 通过分解外观和临时演变实现高效压缩。规范化身在整个序列中共享且只需压缩一次。每帧的临时参数数量有限(仅94个参数)。
  5. 目标化身通过线性混合蒙皮变换从规范化身变形而来,支持连贯的视频重建和新颖视图合成。
  6. 实验结果显示该方法的速率失真性能优于传统编解码器和现有技术。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
NeRF NeRF
NeRF 方向最新论文已更新,请持续关注 Update in 2025-10-18 GauSSmart Enhanced 3D Reconstruction through 2D Foundation Models and Geometric Filtering
2025-10-18
下一篇 
元宇宙/虚拟人 元宇宙/虚拟人
元宇宙/虚拟人 方向最新论文已更新,请持续关注 Update in 2025-10-18 Instant Skinned Gaussian Avatars for Web, Mobile and VR Applications
  目录