发布日期: 2025-09-17

更新日期: 2025-10-07

文章字数: 1.2k

阅读时长: 5 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-17 更新

Avat3r: Large Animatable Gaussian Reconstruction Model for High-fidelity 3D Head Avatars

Authors:Tobias Kirschstein, Javier Romero, Artem Sevastopolsky, Matthias Nießner, Shunsuke Saito

Traditionally, creating photo-realistic 3D head avatars requires a studio-level multi-view capture setup and expensive optimization during test-time, limiting the use of digital human doubles to the VFX industry or offline renderings. To address this shortcoming, we present Avat3r, which regresses a high-quality and animatable 3D head avatar from just a few input images, vastly reducing compute requirements during inference. More specifically, we make Large Reconstruction Models animatable and learn a powerful prior over 3D human heads from a large multi-view video dataset. For better 3D head reconstructions, we employ position maps from DUSt3R and generalized feature maps from the human foundation model Sapiens. To animate the 3D head, our key discovery is that simple cross-attention to an expression code is already sufficient. Finally, we increase robustness by feeding input images with different expressions to our model during training, enabling the reconstruction of 3D head avatars from inconsistent inputs, e.g., an imperfect phone capture with accidental movement, or frames from a monocular video. We compare Avat3r with current state-of-the-art methods for few-input and single-input scenarios, and find that our method has a competitive advantage in both tasks. Finally, we demonstrate the wide applicability of our proposed model, creating 3D head avatars from images of different sources, smartphone captures, single images, and even out-of-domain inputs like antique busts. Project website: https://tobias-kirschstein.github.io/avat3r/

传统上，创建逼真的3D头像需要工作室级别的多视角捕捉设置和昂贵的测试时间优化，这限制了数字人类替身的使用仅限于VFX行业或离线渲染。为了解决这一缺陷，我们推出了Avat3r，它仅从少量输入图像回归高质量和可动画的3D头像，极大地减少了推理过程中的计算需求。更具体地说，我们使大型重建模型具备动画功能，并从大型多视角视频数据集中学习有关3D人类头部的强大先验知识。为了更好地进行3D头部重建，我们采用了DUSt3R的位置图和来自人类基础模型Sapiens的通用特征图。为了驱动3D头部的动画，我们的关键发现是简单的表情编码交叉注意力已经足够。最后，我们在训练期间向模型输入具有不同表情的图像，以提高其稳健性，从而能够从不一致的输入（例如带有意外移动的拍摄不完美的手机图像或来自单目视频的画面）重建出高质量的3D头像。我们将Avat3r与当前最先进的方法进行了比较，包括在少数输入和单一输入场景下进行了对比实验，发现我们的方法在两种任务中都具有竞争优势。最后，我们展示了所提出模型的广泛应用性，可以从不同来源的图像创建高质量的3D头像，如智能手机拍摄的照片、单张图像以及古代雕像等非主流输入源。项目网站地址为：https://tobias-kirschstein.github.io/avat3r/。

Summary

该文介绍了一种名为Avat3r的新技术，该技术可从少量输入图像回归出高质量且可动画的3D头像，大大降低了推断时的计算要求。Avat3r利用大型重建模型，学习强大的3D人头先验知识，并采用位置映射和通用特征映射提高3D头像的重建质量。通过简单的表情代码交叉注意力机制，可实现3D头像的动画效果。此外，该技术在训练过程中通过输入不同表情的图像增强了模型的稳健性，使得能够从不一致的输入（如手机拍摄时的意外移动或单目视频帧）重建3D头像。Avat3r不仅适用于多种图像来源，包括智能手机拍摄、单张图像，甚至古代雕塑头像，展示了其广泛的应用潜力。

Key Takeaways