发布日期: 2025-03-22

更新日期: 2025-05-14

文章字数: 3.2k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-03-22 更新

Zero-1-to-A: Zero-Shot One Image to Animatable Head Avatars Using Video Diffusion

Authors:Zhou Zhenglin, Ma Fan, Fan Hehe, Chua Tat-Seng

Animatable head avatar generation typically requires extensive data for training. To reduce the data requirements, a natural solution is to leverage existing data-free static avatar generation methods, such as pre-trained diffusion models with score distillation sampling (SDS), which align avatars with pseudo ground-truth outputs from the diffusion model. However, directly distilling 4D avatars from video diffusion often leads to over-smooth results due to spatial and temporal inconsistencies in the generated video. To address this issue, we propose Zero-1-to-A, a robust method that synthesizes a spatial and temporal consistency dataset for 4D avatar reconstruction using the video diffusion model. Specifically, Zero-1-to-A iteratively constructs video datasets and optimizes animatable avatars in a progressive manner, ensuring that avatar quality increases smoothly and consistently throughout the learning process. This progressive learning involves two stages: (1) Spatial Consistency Learning fixes expressions and learns from front-to-side views, and (2) Temporal Consistency Learning fixes views and learns from relaxed to exaggerated expressions, generating 4D avatars in a simple-to-complex manner. Extensive experiments demonstrate that Zero-1-to-A improves fidelity, animation quality, and rendering speed compared to existing diffusion-based methods, providing a solution for lifelike avatar creation. Code is publicly available at: https://github.com/ZhenglinZhou/Zero-1-to-A.

动画头像生成通常需要大量的数据进行训练。为了减少数据需求，一种自然的解决方案是利用无数据的静态头像生成方法，例如使用带有得分蒸馏采样（SDS）的预训练扩散模型，该模型将头像与扩散模型的伪真实输出对齐。然而，直接从视频扩散中蒸馏四维头像往往会导致结果过于平滑，这是由于生成视频的空间和时间不一致性造成的。为了解决这一问题，我们提出了Zero-1-to-A方法，这是一种利用视频扩散模型为四维头像重建合成空间和时间一致性数据集的有效方法。具体来说，Zero-1-to-A通过迭代构建视频数据集和以渐进的方式优化可动画头像，确保在学习过程中头像质量平稳且一致地提高。这种渐进学习涉及两个阶段：（1）空间一致性学习修正表情，并从正面到侧面进行学习；（2）时间一致性学习修正视图，并从轻松到夸张的表情进行学习，以简单到复杂的方式生成四维头像。大量实验表明，与现有的基于扩散的方法相比，Zero-1-to-A提高了保真度、动画质量和渲染速度，为解决逼真头像创建提供了解决方案。代码公开可用：https://github.com/ZhenglinZhou/Zero-1-to-A。

论文及项目相关链接

PDF Accepted by CVPR 2025, project page: https://zhenglinzhou.github.io/zero-1-to-a

Summary

本文提出一种名为Zero-1-to-A的方法，用于合成具有时空一致性的数据集，以支持视频扩散模型生成动态头像。通过迭代构建视频数据集并优化动画头像，Zero-1-to-A确保头像质量在训练过程中平稳提升。该方法包括两个阶段：空间一致性学习和时间一致性学习，旨在提高头像的真实感和动画质量。相较于现有扩散模型方法，Zero-1-to-A能提高保真度、动画质量和渲染速度，为解决逼真头像创建问题提供了有效解决方案。代码已公开。

Key Takeaways

Zero-1-to-A方法通过合成具有时空一致性的数据集来改进动态头像生成。
方法利用视频扩散模型进行头像重建，涉及空间一致性学习和时间一致性学习两个阶段。
通过迭代构建视频数据集和优化动画头像，确保头像质量在训练过程中不断提升。
与现有扩散模型方法相比，Zero-1-to-A提高了保真度、动画质量和渲染速度。
公开可用代码为创建逼真头像提供了实用工具。

Cool Papers

点此查看论文截图

Controlling Avatar Diffusion with Learnable Gaussian Embedding

Authors:Xuan Gao, Jingtao Zhou, Dongyu Liu, Yuqi Zhou, Juyong Zhang

Recent advances in diffusion models have made significant progress in digital human generation. However, most existing models still struggle to maintain 3D consistency, temporal coherence, and motion accuracy. A key reason for these shortcomings is the limited representation ability of commonly used control signals(e.g., landmarks, depth maps, etc.). In addition, the lack of diversity in identity and pose variations in public datasets further hinders progress in this area. In this paper, we analyze the shortcomings of current control signals and introduce a novel control signal representation that is optimizable, dense, expressive, and 3D consistent. Our method embeds a learnable neural Gaussian onto a parametric head surface, which greatly enhances the consistency and expressiveness of diffusion-based head models. Regarding the dataset, we synthesize a large-scale dataset with multiple poses and identities. In addition, we use real/synthetic labels to effectively distinguish real and synthetic data, minimizing the impact of imperfections in synthetic data on the generated head images. Extensive experiments show that our model outperforms existing methods in terms of realism, expressiveness, and 3D consistency. Our code, synthetic datasets, and pre-trained models will be released in our project page: https://ustc3dv.github.io/Learn2Control/

近期扩散模型在数字人生成方面取得了显著进展。然而，大多数现有模型仍然难以在3D一致性、时间连贯性和运动准确性方面达到理想效果。造成这些不足的一个关键原因是常用控制信号（如地标、深度图等）的表示能力有限。此外，公共数据集中身份和姿态变化的缺乏进一步阻碍了该领域的进步。在本文中，我们分析了当前控制信号的不足，并引入了一种新型的可优化、密集、表达性强且3D一致的控制信号表示方法。我们的方法将可学习的神经高斯嵌入到参数化头部表面，大大提高了基于扩散的头部模型的一致性和表现力。关于数据集，我们合成了一个大规模的多姿态和身份的数据集。此外，我们还使用真实/合成标签来有效区分真实和合成数据，以最小化合成数据缺陷对生成的头部图像的影响。大量实验表明，我们的模型在真实性、表现力和3D一致性方面优于现有方法。我们的代码、合成数据集和预训练模型将在我们的项目页面发布：https://ustc3dv.github.io/Learn2Control/

Summary
数字人类在扩散模型上的生成已有显著进展，但仍面临保持3D一致性、时间连贯性和运动准确性等挑战。本文分析了当前控制信号的不足，提出了一种新型的可优化、密集、有表现力的控制信号表示方法，并在合成的大规模数据集上进行实验验证。该数据集具有多种姿态和身份特征，可有效区分真实和合成数据。实验表明，该模型在真实感、表现力和3D一致性方面优于现有方法。

Key Takeaways

扩散模型在数字人类生成上取得显著进展，但存在保持3D一致性、时间连贯性和运动准确性的挑战。
当前控制信号存在局限性，导致模型表现不足。
引入了一种新型的可优化、密集、有表现力的控制信号表示方法。
合成的大规模数据集具有多种姿态和身份特征。
有效区分真实和合成数据，减少合成数据不完美对生成头部图像的影响。
实验证明该模型在真实感、表现力和3D一致性方面优于现有方法。

Cool Papers

点此查看论文截图

Synthetic Prior for Few-Shot Drivable Head Avatar Inversion

Authors:Wojciech Zielonka, Stephan J. Garbin, Alexandros Lattas, George Kopanas, Paulo Gotardo, Thabo Beeler, Justus Thies, Timo Bolkart

We present SynShot, a novel method for the few-shot inversion of a drivable head avatar based on a synthetic prior. We tackle three major challenges. First, training a controllable 3D generative network requires a large number of diverse sequences, for which pairs of images and high-quality tracked meshes are not always available. Second, the use of real data is strictly regulated (e.g., under the General Data Protection Regulation, which mandates frequent deletion of models and data to accommodate a situation when a participant’s consent is withdrawn). Synthetic data, free from these constraints, is an appealing alternative. Third, state-of-the-art monocular avatar models struggle to generalize to new views and expressions, lacking a strong prior and often overfitting to a specific viewpoint distribution. Inspired by machine learning models trained solely on synthetic data, we propose a method that learns a prior model from a large dataset of synthetic heads with diverse identities, expressions, and viewpoints. With few input images, SynShot fine-tunes the pretrained synthetic prior to bridge the domain gap, modeling a photorealistic head avatar that generalizes to novel expressions and viewpoints. We model the head avatar using 3D Gaussian splatting and a convolutional encoder-decoder that outputs Gaussian parameters in UV texture space. To account for the different modeling complexities over parts of the head (e.g., skin vs hair), we embed the prior with explicit control for upsampling the number of per-part primitives. Compared to SOTA monocular and GAN-based methods, SynShot significantly improves novel view and expression synthesis.

我们提出了一种名为SynShot的新方法，用于基于合成先验的少量驱动头部化身反演。我们解决了三个主要挑战。首先，训练可控的3D生成网络需要大量的不同序列，而图像和高品质跟踪网格的配对并不总是可用。其次，真实数据的使用受到严格监管（例如，根据《通用数据保护条例》，当参与者的同意被撤回时，需要频繁删除模型和数据进行适应）。不受这些约束的合成数据是一个吸引人的替代方案。第三，最先进的单目化身模型很难推广到新的视角和表情，缺乏强大的先验知识并且经常过度拟合特定的视角分布。受仅使用合成数据训练的机器学习模型的启发，我们提出了一种方法，该方法从包含不同身份、表情和视角的大量合成头部数据中学习先验模型。SynShot使用少量输入图像微调预训练的合成先验，以弥合领域差距，建模一个通用到新颖表情和视角的光头化身。我们使用3D高斯拼贴和卷积编码器-解码器对头部化身进行建模，该编码器-解码器在UV纹理空间中输出高斯参数。为了应对头部各部分不同的建模复杂性（例如皮肤和头发），我们通过嵌入先验知识来控制每部分基本体数量的上采样。与最新单目和基于GAN的方法相比，SynShot极大地改进了新颖视角和表情的合成效果。

论文及项目相关链接

PDF Accepted to CVPR25 Website: https://zielon.github.io/synshot/

Summary
在遵循隐私和安全规范的同时，为解决现实中高质数据的稀缺问题，本研究提出SynShot方法，通过合成先验数据构建头部驱动模型。借助合成数据多样性，解决了训练可控三维生成网络所需大量不同序列的问题。利用合成头部数据的丰富先验信息，使得模型能够泛化到新的视角和表情。SynShot结合3D高斯喷射技术和卷积编码器解码器输出UV纹理空间的参数模型，能更精确地模拟头部各部分（如皮肤和头发）的复杂程度。相较于其他单目相机和GAN方法，SynShot在合成新视角和表情时表现出显著提升。

Key Takeaways