发布日期: 2025-06-14

更新日期: 2025-07-06

文章字数: 2.1k

阅读时长: 8 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-06-14 更新

Controllable Expressive 3D Facial Animation via Diffusion in a Unified Multimodal Space

Authors:Kangwei Liu, Junwu Liu, Xiaowei Yi, Jinlin Guo, Yun Cao

Audio-driven emotional 3D facial animation encounters two significant challenges: (1) reliance on single-modal control signals (videos, text, or emotion labels) without leveraging their complementary strengths for comprehensive emotion manipulation, and (2) deterministic regression-based mapping that constrains the stochastic nature of emotional expressions and non-verbal behaviors, limiting the expressiveness of synthesized animations. To address these challenges, we present a diffusion-based framework for controllable expressive 3D facial animation. Our approach introduces two key innovations: (1) a FLAME-centered multimodal emotion binding strategy that aligns diverse modalities (text, audio, and emotion labels) through contrastive learning, enabling flexible emotion control from multiple signal sources, and (2) an attention-based latent diffusion model with content-aware attention and emotion-guided layers, which enriches motion diversity while maintaining temporal coherence and natural facial dynamics. Extensive experiments demonstrate that our method outperforms existing approaches across most metrics, achieving a 21.6% improvement in emotion similarity while preserving physiologically plausible facial dynamics. Project Page: https://kangweiiliu.github.io/Control_3D_Animation.

音频驱动的3D表情动画面临两大挑战：（1）依赖单一模态控制信号（视频、文本或情绪标签），而没有充分利用它们各自的优点进行全方位的情绪操控；（2）基于确定性回归的映射，这种映射方法会约束情感表达和动作的非言语特性，限制合成动画的表达性。为了解决这些挑战，我们提出了一个基于扩散的可控制、具有表达力的三维面部动画框架。我们的方法引入了两个关键的创新点：（一）基于FLAME模型的多模态情绪绑定策略通过对比学习使各种模式（文本、音频和情绪标签）相匹配，实现了多信号源的灵活情绪控制；（二）一个基于注意力机制的潜在扩散模型融合了内容感知注意力和情感导向层，丰富运动多样性的同时保持了时间连贯性和自然的面部表情动态。大量实验表明，我们的方法在大多数指标上优于现有方法，在情感相似性上提高了21.6%，同时保持了生理合理的面部表情动态。项目页面：https://kangweiiliu.github.io/Control_3D_Animation。

论文及项目相关链接

PDF Accepted by ICME2025

Summary
针对音频驱动的3D面部动画存在的挑战，研究者提出了一种基于扩散控制的框架，该框架具有两大创新点：一是以FLAME为中心的多模态情感绑定策略，通过对比学习将不同模态（文本、音频和情感标签）对齐，实现灵活的情感控制；二是基于注意力机制的潜在扩散模型，通过内容感知注意力和情感引导层，丰富动作多样性同时保持时间连贯和自然面部动态。此方法在多数指标上表现优越，情感相似性提高21.6%。

Key Takeaways

音频驱动的3D面部动画面临两大挑战：依赖单一模态控制信号和基于确定性回归的映射方法限制了合成动画的表达力。
研究者提出了一种基于扩散控制的框架来解决这些问题，引入了两大创新点。
第一大创新点是FLAME为中心的多模态情感绑定策略，通过对比学习实现对多种信号源的情感灵活控制。
第二大创新点是基于注意力机制的潜在扩散模型，通过内容感知注意力和情感引导层，提高了动作多样性并保持了时间连贯性和自然面部动态。
该方法实现了多模态情感控制，使得动画表情更加丰富和自然。
实验结果表明，该方法在大多数指标上优于现有方法，情感相似性显著提高。

Cool Papers

点此查看论文截图

A Unit Enhancement and Guidance Framework for Audio-Driven Avatar Video Generation

Authors:S. Z. Zhou, Y. B. Wang, J. F. Wu, T. Hu, J. N. Zhang

Audio-driven human animation technology is widely used in human-computer interaction, and the emergence of diffusion models has further advanced its development. Currently, most methods rely on multi-stage generation and intermediate representations, resulting in long inference time and issues with generation quality in specific foreground regions and audio-motion consistency. These shortcomings are primarily due to the lack of localized fine-grained supervised guidance. To address above challenges, we propose Parts-aware Audio-driven Human Animation, PAHA, a unit enhancement and guidance framework for audio-driven upper-body animation. We introduce two key methods: Parts-Aware Re-weighting (PAR) and Parts Consistency Enhancement (PCE). PAR dynamically adjusts regional training loss weights based on pose confidence scores, effectively improving visual quality. PCE constructs and trains diffusion-based regional audio-visual classifiers to improve the consistency of motion and co-speech audio. Afterwards, we design two novel inference guidance methods for the foregoing classifiers, Sequential Guidance (SG) and Differential Guidance (DG), to balance efficiency and quality respectively. Additionally, we build CNAS, the first public Chinese News Anchor Speech dataset, to advance research and validation in this field. Extensive experimental results and user studies demonstrate that PAHA significantly outperforms existing methods in audio-motion alignment and video-related evaluations. The codes and CNAS dataset will be released upon acceptance.

音频驱动的人类动画技术广泛应用于人机交互领域，扩散模型的兴起进一步推动了其发展。目前，大多数方法依赖于多阶段生成和中间表示，导致推理时间长，特定前景区域生成质量和音频运动一致性等方面存在问题。这些缺点主要是由于缺乏局部精细监督指导。为了应对上述挑战，我们提出了Parts-aware Audio-driven Human Animation（PAHA），这是一个针对音频驱动的上体动画的单位增强和指导框架。我们引入了两种关键方法：Parts-Aware Re-weighting（PAR）和Parts Consistency Enhancement（PCE）。PAR根据姿态置信度分数动态调整区域训练损失权重，有效提高视觉质量。PCE构建并训练基于扩散的区域音频视觉分类器，以提高运动和语音音频的一致性。之后，我们为前述分类器设计了两种新颖推理指导方法，Sequential Guidance（SG）和Differential Guidance（DG），以平衡效率和质量。此外，我们构建了首个公共中文新闻主播语音数据集CNAS，以推动该领域的研究和验证。大量的实验和用户研究结果表明，PAHA在音频运动对齐和视频相关评估方面显著优于现有方法。代码和CNAS数据集将在接受后发布。

论文及项目相关链接

PDF revised

Summary

基于音频驱动的人体动画技术在人机交互中的广泛应用，为应对现有方法的缺陷，如长推理时间、特定前景区域生成质量问题和音频运动一致性差等，我们提出了部件感知音频驱动的人体动画（PAHA）框架，包含两大核心方法：部件感知重加权（PAR）和部件一致性增强（PCE）。PAHA在改进现有问题同时保持了良好的效率和质量。此外，还建立了首个公共中文新闻主播语音数据集CNAS。此研究工作成果显著超越了现有方法，在音频运动对齐和视频相关评估方面取得了显著成效。

Key Takeaways