发布日期: 2025-09-24

更新日期: 2025-11-27

文章字数: 5.6k

阅读时长: 22 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-24 更新

Beat on Gaze: Learning Stylized Generation of Gaze and Head Dynamics

Authors:Chengwei Shi, Chong Cao, Xin Tong, Xukun Shen

Head and gaze dynamics are crucial in expressive 3D facial animation for conveying emotion and intention. However, existing methods frequently address facial components in isolation, overlooking the intricate coordination between gaze, head motion, and speech. The scarcity of high-quality gaze-annotated datasets hinders the development of data-driven models capable of capturing realistic, personalized gaze control. To address these challenges, we propose StyGazeTalk, an audio-driven method that generates synchronized gaze and head motion styles. We extract speaker-specific motion traits from gaze-head sequences with a multi-layer LSTM structure incorporating a style encoder, enabling the generation of diverse animation styles. We also introduce a high-precision multimodal dataset comprising eye-tracked gaze, audio, head pose, and 3D facial parameters, providing a valuable resource for training and evaluating head and gaze control models. Experimental results demonstrate that our method generates realistic, temporally coherent, and style-aware head-gaze motions, significantly advancing the state-of-the-art in audio-driven facial animation.

头部和目光动态在传达情感与意图的表现力十足的3D面部动画中至关重要。然而，现有的方法往往孤立地处理面部组件，忽略了目光、头部运动和语音之间的复杂协调。高质量目光注释数据集的稀缺阻碍了数据驱动模型的发展，而这些模型能够捕捉真实、个性化的目光控制。为了解决这些挑战，我们提出了StyGazeTalk，这是一种音频驱动的方法，可以生成同步的目光和头部运动风格。我们从目光-头部序列中提取出与说话者相关的运动特征，采用多层LSTM结构并融入风格编码器，从而能够生成多样化的动画风格。我们还引入了一个高精度多模式数据集，包括眼动追踪目光、音频、头部姿态和3D面部参数，为训练和评估头部和目光控制模型提供了宝贵的资源。实验结果表明，我们的方法能够生成真实、时间连贯且风格明显的头部-目光运动，在音频驱动的面部动画方面显著提升了最新技术水平。

论文及项目相关链接

PDF arXiv submission

Summary

本文提出一种音频驱动的方法，用于生成同步的凝视和头部运动风格。文章强调了头部和凝视动力学在表达3D面部动画中的重要性，并针对现有方法的局限性（孤立处理面部组件，忽视凝视、头部运动和语音之间的复杂协调）提出了创新的解决方案。为了捕捉真实的个性化凝视控制，引入了高质量凝视注释数据集。文章采用多层LSTM结构结合风格编码器，从凝视头部序列中提取说话人特定的运动特征，生成多样化的动画风格。此外，还引入了一个高精度多模式数据集，包含眼动追踪凝视、音频、头部姿势和3D面部参数，为训练和评估头部和凝视控制模型提供了宝贵资源。实验结果表明，该方法生成的头部凝视动作真实、时间连贯、风格鲜明。

Key Takeaways

强调了头部和凝视动力学在表达3D面部动画中的重要性。
现有方法常常孤立地处理面部组件，忽略了凝视、头部运动和语音之间的复杂协调。
高质量凝视注释数据集的缺乏阻碍了数据驱动模型的发展，这些模型能够捕捉真实的个性化凝视控制。
提出了 StyGazeTalk 方法，这是一种音频驱动的方法，用于生成同步的凝视和头部运动风格。
采用多层LSTM结构和风格编码器，从凝视头部序列中提取说话人特定的运动特征。
引入了一个包含眼动追踪凝视、音频、头部姿势和3D面部参数的高精度多模式数据集。

Cool Papers

点此查看论文截图

PGSTalker: Real-Time Audio-Driven Talking Head Generation via 3D Gaussian Splatting with Pixel-Aware Density Control

Authors:Tianheng Zhu, Yinfeng Yu, Liejun Wang, Fuchun Sun, Wendong Zheng

Audio-driven talking head generation is crucial for applications in virtual reality, digital avatars, and film production. While NeRF-based methods enable high-fidelity reconstruction, they suffer from low rendering efficiency and suboptimal audio-visual synchronization. This work presents PGSTalker, a real-time audio-driven talking head synthesis framework based on 3D Gaussian Splatting (3DGS). To improve rendering performance, we propose a pixel-aware density control strategy that adaptively allocates point density, enhancing detail in dynamic facial regions while reducing redundancy elsewhere. Additionally, we introduce a lightweight Multimodal Gated Fusion Module to effectively fuse audio and spatial features, thereby improving the accuracy of Gaussian deformation prediction. Extensive experiments on public datasets demonstrate that PGSTalker outperforms existing NeRF- and 3DGS-based approaches in rendering quality, lip-sync precision, and inference speed. Our method exhibits strong generalization capabilities and practical potential for real-world deployment.

音频驱动的说话人头部生成对于虚拟现实、数字化身和电影制作中的应用至关重要。虽然基于NeRF的方法能够实现高保真重建，但它们存在渲染效率低下和音视频同步不佳的问题。本研究提出了PGSTalker，一个基于3D高斯拼贴（3DGS）的实时音频驱动说话人头部合成框架。为了提高渲染性能，我们提出了一种像素感知密度控制策略，该策略能够自适应地分配点密度，从而在动态面部区域增强细节，同时减少其他地方的冗余。此外，我们还引入了一个轻量级的多模式门控融合模块，以有效地融合音频和空间特征，从而提高高斯变形预测的准确性。在公共数据集上的大量实验表明，PGSTalker在渲染质量、唇同步精度和推理速度方面优于现有的NeRF和3DGS方法。我们的方法表现出强大的泛化能力和在实际部署中的实用潜力。

论文及项目相关链接

PDF Main paper (15 pages). Accepted for publication by ICONIP( International Conference on Neural Information Processing) 2025

Summary

基于音频驱动的说话人头部生成技术在虚拟现实、数字分身和电影制作中具有广泛应用。本研究提出PGSTalker，一个基于3D高斯平铺（3DGS）的实时音频驱动说话人头部合成框架。为提升渲染性能，研究采用像素感知密度控制策略，自适应分配点密度，在动态面部区域增强细节的同时减少冗余。此外，研究还引入轻量级多模态门控融合模块，有效融合音频和空间特征，从而提高高斯变形预测的准确性。在公共数据集上的广泛实验表明，PGSTalker在渲染质量、唇同步精度和推理速度上优于现有的NeRF和3DGS方法。该方法具有良好的通用性和实际应用潜力。

Key Takeaways

音频驱动的说话人头部生成技术在多个领域有重要应用。
PGSTalker是一个基于3D高斯平铺的实时音频驱动说话头合成框架。
像素感知密度控制策略提升了渲染性能，通过自适应分配点密度优化细节和冗余。
引入的多模态门控融合模块有效融合了音频和空间特征。
PGSTalker在渲染质量、唇同步精度和推理速度上优于现有方法。
该方法具有良好的通用性，可应用于多种场景。

Cool Papers

点此查看论文截图

Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS

Authors:Ziqi Dai, Yiting Chen, Jiacheng Xu, Liufei Xie, Yuchen Wang, Zhenchuan Yang, Bingsong Bai, Yangsheng Gao, Wenjiang Zhou, Weifeng Zhao, Ruohua Zhou

The pipeline for multi-participant audiobook production primarily consists of three stages: script analysis, character voice timbre selection, and speech synthesis. Among these, script analysis can be automated with high accuracy using NLP models, whereas character voice timbre selection still relies on manual effort. Speech synthesis uses either manual dubbing or text-to-speech (TTS). While TTS boosts efficiency, it struggles with emotional expression, intonation control, and contextual scene adaptation. To address these challenges, we propose DeepDubbing, an end-to-end automated system for multi-participant audiobook production. The system comprises two main components: a Text-to-Timbre (TTT) model and a Context-Aware Instruct-TTS (CA-Instruct-TTS) model. The TTT model generates role-specific timbre embeddings conditioned on text descriptions. The CA-Instruct-TTS model synthesizes expressive speech by analyzing contextual dialogue and incorporating fine-grained emotional instructions. This system enables the automated generation of multi-participant audiobooks with both timbre-matched character voices and emotionally expressive narration, offering a novel solution for audiobook production.

针对多参与者有声书生产流程主要涵盖三个阶段：剧本分析、角色嗓音音质的选取以及语音合成。其中，剧本分析可借助NLP模型实现高度自动化，而角色嗓音音质的选取仍需要人工操作。语音合成则采用人工配音或文本到语音（TTS）技术。虽然TTS技术能提高效率，但在情感表达、语调控制和上下文场景适应方面仍存在困难。为了应对这些挑战，我们提出了DeepDubbing这一端到端的多参与者有声书生产自动化系统。该系统主要包括两个组件：文本到音质（TTT）模型和上下文感知指令语音合成（CA-Instruct-TTS）模型。TTT模型根据文本描述生成特定角色的音质嵌入。CA-Instruct-TTS模型通过分析上下文对话并融入精细的情感指令，合成富有表现力的语音。该系统能够实现多参与者有声书的自动化生成，既匹配角色嗓音音质，又具备情感丰富的旁白，为有声书生产提供全新解决方案。

论文及项目相关链接

PDF Submitted to ICASSP 2026.Copyright 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, including reprinting/republishing, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work. DOI will be added upon IEEE Xplore publication

Summary
文本介绍了多参与者有声书生产的流程，包括脚本分析、角色声音音色选择和语音合成三个阶段。其中，脚本分析可以使用NLP模型实现自动化且高准确率；音色选择仍需要人工操作。语音合成采用人工配音或TTS技术，但TTS技术在情感表达、语调控制和场景适应方面存在挑战。为此，提出DeepDubbing系统，包括Text-to-Timbre模型和Context-Aware Instruct-TTS模型，实现多参与者有声书的自动化生成，具有音色匹配的角色声音和情感丰富的叙述。

Key Takeaways

多参与者有声书生产流程包括脚本分析、角色声音音色选择和语音合成三个阶段。
脚本分析可以使用NLP模型实现自动化且高准确率。
音色选择目前仍需要人工操作。
语音合成采用人工配音或TTS技术，但TTS在情感表达、语调控制和场景适应方面有挑战。
DeepDubbing系统包括Text-to-Timbre模型和Context-Aware Instruct-TTS模型，可实现多参与者有声书的自动化生成。
Text-to-Timbre模型根据文本描述生成角色特定的音色嵌入。

Cool Papers

点此查看论文截图

Tiny is not small enough: High-quality, low-resource facial animation models through hybrid knowledge distillation

Authors:Zhen Han, Mattias Teye, Derek Yadgaroff, Judith Bütepage

The training of high-quality, robust machine learning models for speech-driven 3D facial animation requires a large, diverse dataset of high-quality audio-animation pairs. To overcome the lack of such a dataset, recent work has introduced large pre-trained speech encoders that are robust to variations in the input audio and, therefore, enable the facial animation model to generalize across speakers, audio quality, and languages. However, the resulting facial animation models are prohibitively large and lend themselves only to offline inference on a dedicated machine. In this work, we explore on-device, real-time facial animation models in the context of game development. We overcome the lack of large datasets by using hybrid knowledge distillation with pseudo-labeling. Given a large audio dataset, we employ a high-performing teacher model to train very small student models. In contrast to the pre-trained speech encoders, our student models only consist of convolutional and fully-connected layers, removing the need for attention context or recurrent updates. In our experiments, we demonstrate that we can reduce the memory footprint to up to 3.4 MB and required future audio context to up to 81 ms while maintaining high-quality animations. This paves the way for on-device inference, an important step towards realistic, model-driven digital characters.

对于高质量、稳健的机器学习模型进行语音驱动的3D面部动画训练，需要高质量音频动画对的大规模、多样化的数据集。为了克服缺乏这样的数据集，近期的工作引入了大型预训练语音编码器，它对输入音频的变化具有鲁棒性，因此，使得面部动画模型能够在发言人、音频质量和语言方面实现泛化。然而，由此产生的面部动画模型过于庞大，只能在专用机器上进行离线推理。在这项工作中，我们在游戏开发的背景下探索了设备上的实时面部动画模型。通过使用混合知识蒸馏和伪标签技术，我们克服了缺乏大规模数据集的难题。给定一个大型音频数据集，我们利用高性能的教师模型来训练非常小的的学生模型。与预训练的语音编码器不同，我们的学生模型只包含卷积层和全连接层，无需注意力上下文或递归更新。在我们的实验中，我们证明可以将内存占用减少到最多3.4MB，并将未来音频上下文的需求减少到最多81毫秒，同时保持高质量的动画效果。这为设备端推理铺平了道路，是朝着逼真、模型驱动的数字角色迈进的重要一步。

论文及项目相关链接

PDF Accepted to ACM TOG 2025 (SIGGRAPH journal track); Project page: https://electronicarts.github.io/tiny-voice2face/

Summary

本文研究了面向游戏开发的实时面部动画模型。为解决高质量音频动画对大型数据集的需求，采用混合知识蒸馏与伪标签技术，利用高性能教师模型训练小型学生模型。学生模型仅包含卷积层和全连接层，无需注意力上下文或递归更新，降低了内存占用并维持了高质量动画。

Key Takeaways

高质量、稳健的机器学习任务驱动3D面部动画需要大型、多样化的高质量音频动画数据集。
近期工作引入大型预训练语音编码器，对输入音频的变异具有鲁棒性，使面部动画模型能够跨说话人、音频质量和语言进行推广。
现有面部动画模型通常过于庞大，仅限于离线推理，不适用于游戏开发等场景。
提出使用混合知识蒸馏与伪标签技术，解决缺乏大型数据集的问题。
利用高性能教师模型训练小型学生模型，学生模型仅包含卷积和完全连接层。
实验表明，在维持高质量动画的同时，能减少内存占用高达3.4MB，并将未来音频上下文需求减少到81ms。

Cool Papers

点此查看论文截图

Revisiting Speech-Lip Alignment: A Phoneme-Aware Speech Encoder for Robust Talking Head Synthesis

Authors:Yihuan Huang, Jiajun Liu, Yanzhen Ren, Wuyang Liu, Zongkun Sun

Speech-driven talking head synthesis tasks commonly use general acoustic features as guided speech features. However, we discovered that these features suffer from phoneme-viseme alignment ambiguity, which refers to the uncertainty and imprecision in matching phonemes with visemes. To overcome this limitation, we propose a phoneme-aware speech encoder (PASE) that explicitly enforces accurate phoneme-viseme correspondence. PASE first captures fine-grained speech and visual features, then introduces a prediction-reconstruction task to improve robustness under noise and modality absence. Furthermore, a phoneme-level alignment module guided by phoneme embeddings and contrastive learning ensures discriminative audio and visual alignment. Experimental results show that PASE achieves state-of-the-art performance in both NeRF and 3DGS rendering models. Its lip sync accuracy improves by 13.7% and 14.2% compared to the acoustic feature, producing results close to the ground truth videos.

语音驱动说话人头部合成任务通常使用通用声学特征作为引导语音特征。然而，我们发现这些特征存在音素-面部动画参数对齐模糊的问题，即音素与面部动画参数匹配时存在不确定性和不精确性。为了克服这一局限性，我们提出了一种音素感知语音编码器（PASE），它显式地强制执行准确的音素-面部动画参数对应关系。PASE首先捕获精细的语音和视觉特征，然后引入预测重建任务，以提高噪声和模态缺失情况下的稳健性。此外，一个由音素嵌入和对比学习引导的音素级对齐模块确保了区分性的音频和视觉对齐。实验结果表明，PASE在NeRF和3DGS渲染模型中都达到了最新性能。与声学特征相比，其唇同步精度提高了13.7%和14.2%，生成的结果接近真实视频。

论文及项目相关链接

PDF

摘要

该文介绍了语音驱动的头部合成任务中常见的以通用声学特征作为引导语音特征的问题。研究发现，由于语音与视觉之间的音素-姿态对应模糊，即音素与姿态匹配的不确定性和不精确性，这种方法存在局限性。为了克服这一问题，提出了一种音素感知语音编码器（PASE），它能准确地对应音素和姿态。PASE首先捕捉精细的语音和视觉特征，然后引入预测重建任务，以提高噪声和模态缺失下的稳健性。此外，通过音素嵌入和对比学习引导的音素级对齐模块确保了音频和视觉对齐的鉴别性。实验结果表明，PASE在NeRF和3DGS渲染模型上均达到了最新技术水平，唇同步精度分别提高了13.7%和14.2%，与声学特征相比，生成的结果更接近真实视频。

关键见解

当前语音驱动头部合成任务依赖于通用声学特征作为引导语音特征。
存在音素与姿态匹配的不确定性和不精确性问题。
提出了一种音素感知语音编码器（PASE）来准确对应音素和姿态。
PASE能捕捉精细的语音和视觉特征。
预测重建任务提高了噪声和模态缺失下的稳健性。
音素级对齐模块通过音素嵌入和对比学习增强了音频和视觉对齐的鉴别性。

Cool Papers

点此查看论文截图

FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait

Authors:Taekyung Ki, Dongchan Min, Gyeongsu Chae

With the rapid advancement of diffusion-based generative models, portrait image animation has achieved remarkable results. However, it still faces challenges in temporally consistent video generation and fast sampling due to its iterative sampling nature. This paper presents FLOAT, an audio-driven talking portrait video generation method based on flow matching generative model. Instead of a pixel-based latent space, we take advantage of a learned orthogonal motion latent space, enabling efficient generation and editing of temporally consistent motion. To achieve this, we introduce a transformer-based vector field predictor with an effective frame-wise conditioning mechanism. Additionally, our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions. Extensive experiments demonstrate that our method outperforms state-of-the-art audio-driven talking portrait methods in terms of visual quality, motion fidelity, and efficiency.

随着基于扩散的生成模型的快速发展，肖像图像动画已经取得了显著成果。然而，由于其迭代采样的性质，它在时序一致的视频生成和快速采样方面仍然面临挑战。本文提出了FLOAT，这是一种基于流匹配生成模型的音频驱动式说话肖像视频生成方法。我们利用学习到的正交运动潜在空间，而不是基于像素的潜在空间，实现了高效且时序一致的运动生成和编辑。为此，我们引入了一个基于变压器的矢量场预测器，并设计了一个有效的帧条件机制。此外，我们的方法支持语音驱动的情感增强，能够实现表达性运动的自然融合。大量实验表明，我们的方法在视觉质量、运动保真度和效率方面优于最先进的音频驱动式说话肖像方法。

论文及项目相关链接

PDF ICCV 2025. Project page: https://deepbrainai-research.github.io/float/

Summary

随着扩散生成模型的快速发展，肖像图像动画已经取得了显著成果。然而，它仍面临因迭代采样产生的视频生成时间一致性和快速采样挑战。本文提出了基于流匹配生成模型的音频驱动肖像视频生成方法FLOAT。我们利用学习到的正交运动潜在空间，而非基于像素的潜在空间，实现了高效的时间一致运动生成和编辑。为此，我们引入基于变换器的向量场预测器，并配备了有效的帧条件机制。此外，我们的方法支持语音驱动的情感增强，能够自然地融入表达性动作。实验证明，我们的方法在视觉质量、运动保真度和效率方面超越了现有音频驱动的肖像动画方法。

Key Takeaways