⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-18 更新
Dream3DAvatar: Text-Controlled 3D Avatar Reconstruction from a Single Image
Authors:Gaofeng Liu, Hengsen Li, Ruoyu Gao, Xuetong Li, Zhiyuan Ma, Tao Fang
With the rapid advancement of 3D representation techniques and generative models, substantial progress has been made in reconstructing full-body 3D avatars from a single image. However, this task remains fundamentally ill-posedness due to the limited information available from monocular input, making it difficult to control the geometry and texture of occluded regions during generation. To address these challenges, we redesign the reconstruction pipeline and propose Dream3DAvatar, an efficient and text-controllable two-stage framework for 3D avatar generation. In the first stage, we develop a lightweight, adapter-enhanced multi-view generation model. Specifically, we introduce the Pose-Adapter to inject SMPL-X renderings and skeletal information into SDXL, enforcing geometric and pose consistency across views. To preserve facial identity, we incorporate ID-Adapter-G, which injects high-resolution facial features into the generation process. Additionally, we leverage BLIP2 to generate high-quality textual descriptions of the multi-view images, enhancing text-driven controllability in occluded regions. In the second stage, we design a feedforward Transformer model equipped with a multi-view feature fusion module to reconstruct high-fidelity 3D Gaussian Splat representations (3DGS) from the generated images. Furthermore, we introduce ID-Adapter-R, which utilizes a gating mechanism to effectively fuse facial features into the reconstruction process, improving high-frequency detail recovery. Extensive experiments demonstrate that our method can generate realistic, animation-ready 3D avatars without any post-processing and consistently outperforms existing baselines across multiple evaluation metrics.
随着三维表示技术和生成模型的快速发展,从单张图像重建全身三维头像已经取得了重大进展。然而,由于单目输入信息的有限性,此任务仍具有本质上的不适定性,使得在生成过程中难以控制遮挡区域的几何和纹理。为了应对这些挑战,我们重新设计了重建流程,并提出了Dream3DAvatar,这是一个用于三维头像生成的高效且文本可控的两阶段框架。在第一阶段,我们开发了一个轻量级的、带有适配器的多视角生成模型。具体来说,我们引入了姿态适配器(Pose-Adapter),将SMPL-X渲染和骨骼信息注入SDXL,以在不同视角之间强制执行几何和姿态的一致性。为了保留面部身份,我们融入了ID-Adapter-G,将高分辨率的面部特征注入生成过程。此外,我们还利用BLIP2对多视角图像生成高质量的文本描述,增强遮挡区域的文本驱动可控性。在第二阶段,我们设计了一个前馈Transformer模型,配备多视角特征融合模块,从生成的图像中重建高保真三维高斯splat表示(3DGS)。此外,我们引入了ID-Adapter-R,它利用门控机制有效地将面部特征融合到重建过程中,提高了高频细节的复原能力。大量实验表明,我们的方法能够生成现实、可用于动画的三维头像,无需任何后期处理,并且在多个评估指标上始终优于现有基线。
论文及项目相关链接
Summary
随着3D表示技术和生成模型的快速发展,基于单张图像重建全身3D角色的技术已取得显著进步。然而,由于单目输入信息有限,此任务仍具有本质的不适定性,使得在生成过程中难以控制遮挡区域的几何和纹理。为应对这些挑战,本文提出Dream3DAvatar框架,采用两阶段策略实现高效的文本控制3D角色生成。第一阶段利用轻量级的多视图生成模型并结合Pose-Adapter与ID-Adapter-G进行几何和身份信息的保持;第二阶段使用带有特征融合模块的feedforward Transformer模型进行高质量的3D高斯样条表示重建。此方法能生成真实且可直接动画化的3D角色,无需任何后期处理,并在多个评估指标上超越现有基线。
Key Takeaways
- 3D角色生成技术已借助3D表示技术和生成模型取得显著进步。
- 单目图像重建全身3D角色任务存在本质的不适定性,挑战在于控制遮挡区域的几何和纹理。
- 提出Dream3DAvatar框架,采用两阶段策略进行高效的文本控制的3D角色生成。
- 第一阶段通过Pose-Adapter与ID-Adapter-G实现几何和身份信息的保持。
- 第二阶段使用带有特征融合模块的feedforward Transformer模型进行高质量的3D高斯样条表示重建。
- 此方法生成的3D角色真实且可直接动画化,无需任何后期处理。
- 在多个评估指标上超越现有基线方法。
点此查看论文截图






Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings
Authors:Abdalla Arafa, Didier Stricker
Novel view synthesis has seen significant advancements with 3D Gaussian Splatting (3DGS), enabling real-time photorealistic rendering. However, the inherent fuzziness of Gaussian Splatting presents challenges for 3D scene understanding, restricting its broader applications in AR/VR and robotics. While recent works attempt to learn semantics via 2D foundation model distillation, they inherit fundamental limitations: alpha blending averages semantics across objects, making 3D-level understanding impossible. We propose a paradigm-shifting alternative that bypasses differentiable rendering for semantics entirely. Our key insight is to leverage predecomposed object-level Gaussians and represent each object through multiview CLIP feature aggregation, creating comprehensive “bags of embeddings” that holistically describe objects. This allows: (1) accurate open-vocabulary object retrieval by comparing text queries to object-level (not Gaussian-level) embeddings, and (2) seamless task adaptation: propagating object IDs to pixels for 2D segmentation or to Gaussians for 3D extraction. Experiments demonstrate that our method effectively overcomes the challenges of 3D open-vocabulary object extraction while remaining comparable to state-of-the-art performance in 2D open-vocabulary segmentation, ensuring minimal compromise.
采用三维高斯贴图(3DGS)的新视图合成技术取得了重大进展,实现了实时逼真的渲染。然而,高斯贴图的固有模糊性给三维场景理解带来了挑战,限制了其在增强现实/虚拟现实和机器人技术中的更广泛应用。尽管最近的工作试图通过二维基础模型蒸馏来学习语义,但它们继承了根本的局限性:阿尔法混合会在对象之间平均语义,使得无法进行三维级别的理解。我们提出了一种颠覆性的替代方案,完全绕过用于获取语义的可微分渲染。我们的关键见解是利用预先分解的对象级高斯值,并通过多视图CLIP特征聚合来表示每个对象,创建全面的“嵌入包”,整体描述对象。这允许:(1)通过文本查询与对象级(而不是高斯级)嵌入进行比较,实现准确开放词汇对象检索;(2)无缝任务适应:将对象ID传播到像素以进行二维分割或传播到高斯以进行三维提取。实验表明,我们的方法有效地克服了三维开放词汇对象提取的挑战,同时在二维开放词汇分割方面保持与最新技术相当的性能,确保最小的妥协。
论文及项目相关链接
Summary
3DGS技术通过利用预分解的对象级高斯并利用多视图CLIP特征聚合创建全面的“嵌入包”,实现了开放词汇对象检索和无缝任务适应,解决了实时渲染和3D场景理解的挑战。
Key Takeaways
- 3DGS推动了实时逼真的渲染技术的发展。
- 高斯插值的固有模糊性对3D场景理解带来了挑战,限制了其在AR/VR和机器人技术中的广泛应用。
- 当前通过2D基础模型蒸馏学习语义的方法存在根本局限性,无法对物体进行3D级别的理解。
- 提出了一种新的方法,绕过可微分渲染,直接利用预分解的对象级高斯和通过多视图CLIP特征聚合表示每个对象。
- 该方法允许准确地进行开放词汇对象检索,通过文本查询与对象级嵌入进行比较,而不是高斯级嵌入。
- 此方法能够实现无缝任务适应,将对象ID传播到像素以进行二维分割或传播到高斯以进行三维提取。
点此查看论文截图



Effective Gaussian Management for High-fidelity Object Reconstruction
Authors:Jiateng Liu, Hao Gao, Jiu-Cheng Xie, Chi-Man Pun, Jian Xiong, Haolun Li, Feng Xu
This paper proposes an effective Gaussian management approach for high-fidelity object reconstruction. Departing from recent Gaussian Splatting (GS) methods that employ indiscriminate attribute assignment, our approach introduces a novel densification strategy that dynamically activates spherical harmonics (SHs) or normals under the supervision of a surface reconstruction module, which effectively mitigates the gradient conflicts caused by dual supervision and achieves superior reconstruction results. To further improve representation efficiency, we develop a lightweight Gaussian representation that adaptively adjusts the SH orders of each Gaussian based on gradient magnitudes and performs task-decoupled pruning to remove Gaussian with minimal impact on a reconstruction task without sacrificing others, which balances the representational capacity with parameter quantity. Notably, our management approach is model-agnostic and can be seamlessly integrated into other frameworks, enhancing performance while reducing model size. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art approaches in both reconstruction quality and efficiency, achieving superior performance with significantly fewer parameters.
本文提出了一种针对高保真物体重建的有效高斯管理方案。不同于近期采用随机属性分配的Gaussian Splatting(GS)方法,我们的方法引入了一种新的稠密化策略,该策略在表面重建模块的监督下动态激活球面谐波(SHs)或法线,有效缓解了双重监督造成的梯度冲突,实现了优越的重建结果。为了进一步提高表示效率,我们开发了一种轻量级的高斯表示方法,该方法根据梯度幅度自适应调整每个高斯SH的顺序,并执行任务解耦修剪,以去除对重建任务影响最小的高斯,同时不牺牲其他任务,从而平衡表示能力与参数数量。值得注意的是,我们的管理方法具有模型无关性,可以无缝集成到其他框架中,在提高性能的同时减小模型大小。大量实验表明,我们的方法在重建质量和效率方面始终优于现有先进技术,在显著减少参数的情况下实现了卓越的性能。
论文及项目相关链接
Summary
本文提出了一种针对高保真物体重建的有效高斯管理策略。该策略改进了最近的高斯映射(GS)方法,通过引入动态激活球谐波(SHs)或法线的新密集化策略,提高了重建质量。该方法受到表面重建模块的监督,有效缓解了双重监督引起的梯度冲突问题。同时,为提高表示效率,研究了一种轻量级的高斯表示方法,可自适应调整每个高斯函数的SH阶和进行任务解耦修剪。这种管理方法具有模型无关性,可轻松集成到其他框架中,提高性能并减小模型大小。实验表明,该方法在重建质量和效率方面均优于现有技术。
Key Takeaways
- 引入新的高斯管理策略,用于高保真物体重建。
- 提出动态激活球谐波(SHs)或法线的新密集化策略,以提高重建质量。
- 表面重建模块监督下的方法有效缓解梯度冲突问题。
- 开发了轻量级的高斯表示方法,提高表示效率。
- 自适应调整每个高斯函数的SH阶和进行任务解耦修剪。
- 管理方法具有模型无关性,可集成到其他框架中。
点此查看论文截图






WorldExplorer: Towards Generating Fully Navigable 3D Scenes
Authors:Manuel-Andreas Schneider, Lukas Höllein, Matthias Nießner
Generating 3D worlds from text is a highly anticipated goal in computer vision. Existing works are limited by the degree of exploration they allow inside of a scene, i.e., produce streched-out and noisy artifacts when moving beyond central or panoramic perspectives. To this end, we propose WorldExplorer, a novel method based on autoregressive video trajectory generation, which builds fully navigable 3D scenes with consistent visual quality across a wide range of viewpoints. We initialize our scenes by creating multi-view consistent images corresponding to a 360 degree panorama. Then, we expand it by leveraging video diffusion models in an iterative scene generation pipeline. Concretely, we generate multiple videos along short, pre-defined trajectories, that explore the scene in depth, including motion around objects. Our novel scene memory conditions each video on the most relevant prior views, while a collision-detection mechanism prevents degenerate results, like moving into objects. Finally, we fuse all generated views into a unified 3D representation via 3D Gaussian Splatting optimization. Compared to prior approaches, WorldExplorer produces high-quality scenes that remain stable under large camera motion, enabling for the first time realistic and unrestricted exploration. We believe this marks a significant step toward generating immersive and truly explorable virtual 3D environments.
从文本生成3D世界是计算机视觉领域一个备受期待的目标。现有工作受限于场景内的探索程度,即在中央或全景视角之外移动时,会产生拉伸和嘈杂的伪影。为此,我们提出了WorldExplorer,这是一种基于自回归视频轨迹生成的新方法,构建了在广泛视点下具有一致视觉质量的完全可导航3D场景。我们通过创建与360度全景相对应的多视角一致图像来初始化场景。然后,我们利用视频扩散模型在迭代场景生成管道中进行扩展。具体来说,我们沿着短预定义轨迹生成多个视频,深入探索场景,包括围绕物体的运动。我们的新型场景记忆条件使每个视频都以前期最相关的视图为基础,同时碰撞检测机制防止了退化结果,例如进入物体内部。最后,我们通过3D高斯喷涂优化技术将所有生成的视图融合成统一的3D表示。与先前的方法相比,WorldExplorer产生的高质量场景在大型相机运动下保持稳定性,首次实现现实且不受限制的探索。我们认为,这标志着在生成沉浸式和真正可探索的虚拟3D环境方面迈出了重要一步。
论文及项目相关链接
PDF Accepted to SIGGRAPH Asia 2025. Project page: see https://mschneider456.github.io/world-explorer, video: see https://youtu.be/N6NJsNyiv6I, code: https://github.com/mschneider456/WorldExplorer
Summary
本文提出一种基于自回归视频轨迹生成的新方法WorldExplorer,用于构建可在广泛视角范围内具有一致视觉质量的完全可导航3D场景。该方法通过创建对应360度全景的多视角一致图像来初始化场景,然后利用视频扩散模型在迭代场景生成管道中进行扩展。通过沿预定轨迹生成多个视频,探索场景的深度,包括物体周围的运动。新场景记忆将每个视频与最相关的先前视图相结合,同时碰撞检测机制防止了退化结果。最后,通过3D高斯溅射优化将所有生成的视图融合为统一的三维表示。WorldExplorer生成的高质量场景在大型相机运动下保持稳定,首次实现了现实和不受限制的探索,标志着向生成沉浸式和真正可探索的虚拟3D环境的重要一步。
Key Takeaways
- WorldExplorer是一种基于自回归视频轨迹生成的新方法,用于创建可导航的3D场景。
- 初始化场景是通过创建对应360度全景的多视角一致图像实现的。
- 利用视频扩散模型在迭代场景生成管道中扩展场景。
- 沿预定轨迹生成多个视频,探索场景的深度。
- 新场景记忆机制结合每个视频与最相关的先前视图。
- 碰撞检测机制避免了结果退化。
点此查看论文截图



