发布日期: 2025-11-05

更新日期: 2025-11-27

文章字数: 9.9k

阅读时长: 40 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-05 更新

Authors:Jinting Wang, Jun Wang, Hei Victor Cheng, Li Liu

Unlike existing methods that rely on source images as appearance references and use source speech to generate motion, this work proposes a novel approach that directly extracts information from the speech, addressing key challenges in speech-to-talking face. Specifically, we first employ a speech-to-face portrait generation stage, utilizing a speech-conditioned diffusion model combined with statistical facial prior and a sample-adaptive weighting module to achieve high-quality portrait generation. In the subsequent speech-driven talking face generation stage, we embed expressive dynamics such as lip movement, facial expressions, and eye movements into the latent space of the diffusion model and further optimize lip synchronization using a region-enhancement module. To generate high-resolution outputs, we integrate a pre-trained Transformer-based discrete codebook with an image rendering network, enhancing video frame details in an end-to-end manner. Experimental results demonstrate that our method outperforms existing approaches on the HDTF, VoxCeleb, and AVSpeech datasets. Notably, this is the first method capable of generating high-resolution, high-quality talking face videos exclusively from a single speech input.

与传统方法不同，传统方法依赖于源图像作为外观参考并使用源语音来生成运动，而这项工作提出了一种直接从语音中提取信息的新方法，解决了语音到语音挑战中的关键对话面部。具体来说，我们首先采用从语音到人脸肖像生成阶段，利用受语音控制的扩散模型，结合统计面部先验知识和样本自适应加权模块，实现高质量肖像生成。在随后的语音驱动对话面部生成阶段，我们将表情动态（如嘴唇动作、面部表情和眼部运动）嵌入到扩散模型的潜在空间中，并进一步使用区域增强模块优化嘴唇同步。为了生成高分辨率输出，我们将预训练的基于Transformer的离散码本与图像渲染网络相结合，以端到端的方式增强视频帧细节。实验结果表明，我们的方法在HDTF、VoxCeleb和AVSpeech数据集上的性能优于现有方法。值得注意的是，这是第一种能够从单一语音输入生成高分辨率、高质量对话面部视频的方法。

论文及项目相关链接

PDF 16 pages,15 figures, accepted by TASLP

Summary

本文提出了一种新的方法，直接从语音中提取信息来生成说话者的面部动画，解决了语音到说话人脸的关键挑战。首先，通过结合语音条件扩散模型、面部统计先验知识和样本自适应加权模块，进行语音驱动的人脸肖像生成。随后，在语音驱动的人脸动画生成阶段，将嘴唇动作、面部表情和眼睛运动等表达性动态嵌入扩散模型的潜在空间，并使用区域增强模块进一步优化嘴唇同步。最后，通过集成预训练的基于Transformer的离散码本和图像渲染网络，以端到端的方式提高视频帧的细节分辨率。实验结果表明，该方法在HDTF、VoxCeleb和AVSpeech数据集上的表现优于现有方法，并且是首个能够仅从单一语音输入生成高分辨率、高质量说话人脸视频的方法。

Key Takeaways

提出了新的方法，直接从语音中提取信息生成说话者的面部动画。
通过结合语音条件扩散模型、面部统计先验知识和样本自适应加权模块，实现高质量的人脸肖像生成。
在语音驱动的人脸动画生成阶段，嵌入表达性动态如嘴唇动作、面部表情和眼睛运动。
使用区域增强模块优化嘴唇同步。
通过集成预训练的基于Transformer的离散码本和图像渲染网络，提高视频帧的细节分辨率。
该方法在多个数据集上的表现优于现有方法。

Cool Papers

点此查看论文截图

Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

Authors:Frederik Broy, Maike Züfle, Jan Niehues

Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk’s corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.

科学讲座是传播研究的一种日益重要的媒介，自动识别能支撑或丰富讲座内容的相关文献对研究者和学生来说都极为宝贵。我们引入了“从讲座引用预测（RPT）”这一新任务，它将长且非结构的科学讲座映射到相关论文上。为了支持RPT研究，我们推出了Talk2Ref数据集，这是目前首个大规模数据集，包含6,279场讲座和43,429篇被引用的论文（平均每场讲座有26篇论文），其中相关性由讲座对应源出版物中引用的论文来近似判断。我们通过评估最先进文本嵌入模型在零样本检索场景中的表现来建立强基线，并提出一种在Talk2Ref上训练的双编码器架构。我们进一步探索了处理长文本的策略以及针对领域的适应性训练。我们的结果表明，在Talk2Ref上进行微调能显著提高引用预测性能，这不仅体现了该任务具有挑战性也显示了我们的数据集对于学习口头科学内容的语义表示的有效性。数据集和训练好的模型都在开放许可下发布，以促进未来研究将口头科学交流融入引用推荐系统。

论文及项目相关链接

PDF

Summary

本文介绍了从演讲中预测参考文献（RPT）的新任务，该任务旨在将长且非结构的科学演讲与相关的论文相匹配。为支持RPT研究，文章推出了Talk2Ref数据集，包含6,279个演讲和43,429篇被引用的论文（平均每篇演讲有26篇论文）。文章评估了最先进的文本嵌入模型在零样本检索场景中的表现，并提出了一种在Talk2Ref上训练的双重编码器架构。此外，文章还探讨了处理长文本的策略以及领域适应性的训练策略。实验结果表明，在Talk2Ref上进行微调可以显著提高引用预测性能，这既体现了该任务的挑战性，也证明了数据集从口头科学内容中学习语义表示的有效性。数据集和训练好的模型以开放许可方式发布，以促进将来将口头科学交流融入引文推荐系统的研究。

Key Takeaways

提出了从演讲中预测参考文献（RPT）的新任务，旨在将科学演讲与相关的论文相匹配。
介绍了Talk2Ref数据集，包含大量科学演讲及其对应的参考文献。
评估了文本嵌入模型在零样本检索场景中的表现。
提出了双重编码器架构，并在Talk2Ref数据集上进行训练。
探讨了如何处理长文本以及领域适应性训练的策略。
微调Talk2Ref数据集可以显著提高引用预测性能。

Cool Papers

点此查看论文截图

Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation

Authors:Junyoung Seo, Rodrigo Mira, Alexandros Haliassos, Stella Bounareli, Honglie Chen, Linh Tran, Seungryong Kim, Zoe Landgraf, Jie Shen

Audio-driven human animation models often suffer from identity drift during temporal autoregressive generation, where characters gradually lose their identity over time. One solution is to generate keyframes as intermediate temporal anchors that prevent degradation, but this requires an additional keyframe generation stage and can restrict natural motion dynamics. To address this, we propose Lookahead Anchoring, which leverages keyframes from future timesteps ahead of the current generation window, rather than within it. This transforms keyframes from fixed boundaries into directional beacons: the model continuously pursues these future anchors while responding to immediate audio cues, maintaining consistent identity through persistent guidance. This also enables self-keyframing, where the reference image serves as the lookahead target, eliminating the need for keyframe generation entirely. We find that the temporal lookahead distance naturally controls the balance between expressivity and consistency: larger distances allow for greater motion freedom, while smaller ones strengthen identity adherence. When applied to three recent human animation models, Lookahead Anchoring achieves superior lip synchronization, identity preservation, and visual quality, demonstrating improved temporal conditioning across several different architectures. Video results are available at the following link: https://lookahead-anchoring.github.io.

音频驱动的人物动画模型在时序自回归生成过程中常常会遇到身份漂移的问题，即角色随着时间的推移逐渐失去其身份特征。一种解决方案是生成关键帧作为中间时序锚点来防止退化，但这需要额外的关键帧生成阶段，并可能限制自然运动动力学。为解决这一问题，我们提出了前瞻性锚定（Lookahead Anchoring）方法，它利用当前生成窗口之外未来时间步的关键帧，而不是窗口内的关键帧。这将关键帧从固定边界转变为方向性路标：模型在响应即时音频提示的同时，不断追求这些未来的锚点，通过持续的指导来维持一致的身份。这也实现了自我关键帧（self-keyframing），其中参考图像作为前瞻性目标，完全消除了对关键帧生成的需求。我们发现，前瞻性视距自然地在表现力和一致性之间找到了平衡：较大的距离允许更大的运动自由度，而较小的距离则增强了对身份特征的坚持。将前瞻性锚定应用于三个最新的人物动画模型后，实现了优越的唇同步、身份保留和视觉质量，证明了其在不同架构中的改进时序条件。视频结果可通过以下链接查看：https://lookahead-anchoring.github.io。

Summary

基于音频驱动的人物动画模型在时间自回归生成过程中会出现身份漂移问题，即角色随时间逐渐失去其身份特征。一种解决方案是生成关键帧作为中间时间锚点以防止退化，但这需要额外的关键帧生成阶段，并可能限制自然运动动态。为解决这一问题，我们提出前瞻锚定（Lookahead Anchoring）方法，它利用当前生成窗口之外未来时刻的关键帧，将关键帧从固定边界转变为方向指示标。模型在持续追求这些未来锚点的同时，响应即时音频线索，通过持续指导保持身份一致性。这还包括自我关键帧生成功能，其中参考图像作为前瞻目标，完全消除了对关键帧生成的需求。我们发现前瞻距离自然地控制了表现力和一致性之间的平衡：较大的距离允许更大的运动自由度，而较小的距离则增强了身份的一致性。将Lookahead Anchoring应用于三个近期的人物动画模型后，在唇同步、身份保留和视觉质量方面取得了显著优势，并在多种架构中展示了改进的时间条件。相关视频结果可通过以下链接查看：https://lookahead-anchoring.github.io。

Key Takeaways

音频驱动的人物动画模型面临身份漂移问题。
关键帧生成可作为中间时间锚点防止退化，但需额外阶段且可能限制自然运动。
提出Lookahead Anchoring方法，利用未来时刻的关键帧，实现身份一致性维持。
Lookahead Anchoring可消除对关键帧生成的需求。
前瞻距离影响表现力和一致性平衡。
在三个近期人物动画模型中取得显著优势，包括唇同步、身份保留和视觉质量提升。

Cool Papers

点此查看论文截图

MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control

Authors:Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Muhammad Awais, Josef Kittler

Audio-driven talking face generation has gained significant attention for applications in digital media and virtual avatars. While recent methods improve audio-lip synchronization, they often struggle with temporal consistency, identity preservation, and customization, especially in long video generation. To address these issues, we propose MAGIC-Talk, a one-shot diffusion-based framework for customizable and temporally stable talking face generation. MAGIC-Talk consists of ReferenceNet, which preserves identity and enables fine-grained facial editing via text prompts, and AnimateNet, which enhances motion coherence using structured motion priors. Unlike previous methods requiring multiple reference images or fine-tuning, MAGIC-Talk maintains identity from a single image while ensuring smooth transitions across frames. Additionally, a progressive latent fusion strategy is introduced to improve long-form video quality by reducing motion inconsistencies and flickering. Extensive experiments demonstrate that MAGIC-Talk outperforms state-of-the-art methods in visual quality, identity preservation, and synchronization accuracy, offering a robust solution for talking face generation.

音频驱动的说话人脸生成在数字媒体和虚拟化身的应用中引起了极大的关注。虽然最近的方法提高了音频与嘴唇的同步性，但在长期视频生成中，它们通常面临时间一致性、身份保留和定制化的问题。为了解决这些问题，我们提出了MAGIC-Talk，这是一个基于单次扩散的可定制且时间稳定的说话人脸生成框架。MAGIC-Talk包括ReferenceNet和AnimateNet两部分。ReferenceNet保留了身份并通过文本提示实现了精细的面部编辑，而AnimateNet则使用结构化的运动先验增强了运动的一致性。与需要多张参考图像或精细调整之前的方法不同，MAGIC-Talk能够从单张图像中保持身份，同时确保跨帧的平滑过渡。此外，还引入了一种渐进式潜在融合策略，通过减少运动不一致和闪烁来提高长格式视频的质量。大量实验表明，在视觉质量、身份保留和同步准确性方面，MAGIC-Talk优于最新技术方法，为说话人脸生成提供了稳健的解决方案。

论文及项目相关链接

PDF

Summary
新一代音频驱动说话人脸生成技术已经引起了数字媒体和虚拟角色的关注。为了解决长期视频生成中的时间一致性、身份保留和个性化问题，提出了一种基于单次扩散的框架MAGIC-Talk。它包含ReferenceNet和AnimateNet两部分，前者保留身份并允许通过文本提示进行精细面部编辑，后者使用结构运动先验增强运动一致性。与其他需要多张参考图像或精细调整的方法不同，MAGIC-Talk能够从单张图像中保持身份，同时确保跨帧的平滑过渡。此外，引入了一种渐进式潜在融合策略，以提高长视频的质量，减少运动不一致和闪烁现象。广泛的实验表明，MAGIC-Talk在视觉质量、身份保留和同步精度方面均优于其他技术的前沿技术，为说话人脸生成提供了可靠的解决方案。

Key Takeaways

音频驱动说话人脸生成技术在数字媒体和虚拟角色应用中获得关注。
MAGIC-Talk框架解决时间一致性、身份保留和个性化问题。
包含ReferenceNet和AnimateNet两部分，分别负责身份保留和运动一致性增强。
与其他方法不同，MAGIC-Talk能够从单张图像保持身份并保障跨帧平滑过渡。
引入渐进式潜在融合策略，提高长视频质量，减少运动不一致和闪烁。

Cool Papers

点此查看论文截图

LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation

Authors:Xin Lu, Chuanqing Zhuang, Chenxi Jin, Zhengda Lu, Yiqun Wang, Wu Liu, Jun Xiao

Speech-driven 3D facial animation has attracted increasing interest since its potential to generate expressive and temporally synchronized digital humans. While recent works have begun to explore emotion-aware animation, they still depend on explicit one-hot encodings to represent identity and emotion with given emotion and identity labels, which limits their ability to generalize to unseen speakers. Moreover, the emotional cues inherently present in speech are often neglected, limiting the naturalness and adaptability of generated animations. In this work, we propose LSF-Animation, a novel framework that eliminates the reliance on explicit emotion and identity feature representations. Specifically, LSF-Animation implicitly extracts emotion information from speech and captures the identity features from a neutral facial mesh, enabling improved generalization to unseen speakers and emotional states without requiring manual labels. Furthermore, we introduce a Hierarchical Interaction Fusion Block (HIFB), which employs a fusion token to integrate dual transformer features and more effectively integrate emotional, motion-related and identity-related cues. Extensive experiments conducted on the 3DMEAD dataset demonstrate that our method surpasses recent state-of-the-art approaches in terms of emotional expressiveness, identity generalization, and animation realism. The source code will be released at: https://github.com/Dogter521/LSF-Animation.

语音驱动的3D面部动画因其生成表情丰富、时间同步的数字人类的可能性而越来越受到关注。尽管近期的研究开始探索情感感知动画，但它们仍然依赖于明确的独热编码来表示给定情感和身份标签的身份和情感，这限制了它们对未见过的说话者的泛化能力。此外，往往忽视了语音中固有的情感线索，限制了生成动画的自然性和适应性。在这项工作中，我们提出了LSF-Animation，一个消除对明确情感和身份特征表示依赖的新型框架。具体来说，LSF-Animation从语音中隐式提取情感信息，从中性面部网格中捕获身份特征，改进了对未见过的说话者和情感状态的泛化能力，无需手动标签。此外，我们引入了分层交互融合块（HIFB），它使用融合令牌来集成双重转换器特征，更有效地整合情感、运动相关和身份相关的线索。在3DMEAD数据集上进行的广泛实验表明，我们的方法在情感表达、身份泛化和动画逼真度方面超过了最近的最先进方法。源代码将在https://github.com/Dogter521/LSF-Animation上发布。

论文及项目相关链接

PDF

Summary：

提出了一种名为LSF-Animation的新型框架，该框架能够从语音中隐式提取情感信息，并从中性面部网格中捕获身份信息，提高了对未见说话者和情感状态的泛化能力，无需手动标签。引入分层交互融合块（HIFB），有效整合情感、运动相关和身份相关线索。在3DMEAD数据集上的实验表明，该方法在情感表达、身份泛化和动画真实性方面超越了最新的先进技术。

Key Takeaways：

LSF-Animation框架可以隐式地从语音中提取情感信息，并捕获身份信息。
该框架提高了对未见说话者和情感状态的泛化能力。
LSF-Animation不需要手动标签。
引入了一种新型结构——分层交互融合块（HIFB）。
HIFB能有效整合情感、运动相关和身份相关线索。
在3DMEAD数据集上的实验结果表明，LSF-Animation在情感表达、身份泛化和动画真实性方面表现优异。

Cool Papers

点此查看论文截图

Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

Authors:Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani, Koki Nagano, David Luebke, Orazio Gallo, Matthew Stamm

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim’s likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

基于AI的语音头部视频会议系统通过发送紧凑的姿态表达潜像并在接收器处重新合成RGB来减少带宽，但这一潜像可以被操纵，使得攻击者能够实时劫持受害者的外貌。由于每一帧都是合成的，深度伪造和合成视频检测器会完全失效。为了解决这一安全问题，我们利用了一个关键观察结果：姿态表达潜像本身包含驱动身份的生物识别信息。因此，我们引入了第一个无需查看重建的RGB视频的生物泄露防御：一个基于姿态条件的、具有较大边界的对比编码器，它能在传输的潜像中隔离持久的身份线索，同时取消短暂的姿态和表情。在这个分离嵌入物上进行简单的余弦测试可以在视频渲染时标记出非法身份交换。我们在多个语音头部生成模型上的实验表明，我们的方法始终优于现有的操纵防御技术，可实时操作，对超出分布范围的场景具有很强的通用性。

论文及项目相关链接

PDF

Summary

基于AI的说话人头视频会议系统通过发送紧凑的姿态表达潜像并在接收器端重新合成RGB来减少带宽，但这一潜像可以被操纵，使攻击者能够在实时中劫持受害者的肖像。由于每一帧都是合成的，因此深度伪造和合成视频检测器直接失效。为解决这一安全问题，我们观察到姿态表达潜像本身就含有驱动身份的生物识别信息。我们提出了一种新的防御方法，即在不查看重建的RGB视频的情况下，利用姿态调节的大边距对比编码器来隔离传输潜像中的持久身份线索，同时取消短暂的姿态和表情。在这个分解嵌入的基础上进行的简单余弦测试能够在视频渲染过程中检测出身份的不正当替换。在多个说话人头生成模型上的实验表明，我们的方法一致优于现有的操纵防御方法，可实现实时操作，对超出分布范围的场景表现出强大的泛化能力。

Key Takeaways

AI谈话头视频会议系统可减少带宽使用，但潜像被操纵的风险存在。
合成视频的广泛采用使得传统深度伪造和合成视频检测器失效。
姿态表达潜像包含生物识别信息，可用于防御技术。
提出了一种新的防御策略，通过姿态调节的大边距对比编码器来识别身份不当替换。
方法在不查看重建的RGB视频的情况下工作，具有实时操作的能力。
该方法在不同的模型和场景中表现出优越的性能和泛化能力。
此技术对于保护虚拟会议和在线交流中的身份安全性具有重要意义。

Cool Papers

点此查看论文截图

Authors:Jiye Lee, Chenghui Li, Linh Tran, Shih-En Wei, Jason Saragih, Alexander Richard, Hanbyul Joo, Shaojie Bai

We present an audio-driven real-time system for animating photorealistic 3D facial avatars with minimal latency, designed for social interactions in virtual reality for anyone. Central to our approach is an encoder model that transforms audio signals into latent facial expression sequences in real time, which are then decoded as photorealistic 3D facial avatars. Leveraging the generative capabilities of diffusion models, we capture the rich spectrum of facial expressions necessary for natural communication while achieving real-time performance (<15ms GPU time). Our novel architecture minimizes latency through two key innovations: an online transformer that eliminates dependency on future inputs and a distillation pipeline that accelerates iterative denoising into a single step. We further address critical design challenges in live scenarios for processing continuous audio signals frame-by-frame while maintaining consistent animation quality. The versatility of our framework extends to multimodal applications, including semantic modalities such as emotion conditions and multimodal sensors with head-mounted eye cameras on VR headsets. Experimental results demonstrate significant improvements in facial animation accuracy over existing offline state-of-the-art baselines, achieving 100 to 1000 times faster inference speed. We validate our approach through live VR demonstrations and across various scenarios such as multilingual speeches.

我们提出了一种音频驱动的实时系统，用于驱动具有极低延迟的光照真实3D面部化身动画，专为虚拟现实中的社交互动设计，适用于任何人。我们的方法的核心是一个编码器模型，它将音频信号实时转换为潜在的面部表情序列，然后解码为光照真实的3D面部化身。我们利用扩散模型的生成能力，捕捉自然沟通所需的丰富面部表情谱，同时实现实时性能（<15毫秒GPU时间）。我们的新型架构通过两个关键创新来最小化延迟：一种在线变压器，消除对未来输入的依赖；一种蒸馏管道，将迭代去噪加速为单步完成。我们进一步解决了现场场景中处理连续音频信号的关键设计挑战，逐帧保持一致的动画质量。我们框架的通用性扩展到多模式应用，包括语义模式，如情绪条件和带有头戴式眼摄像头的VR头盔的多模式传感器。实验结果表明，在面部动画准确性方面，我们的方法显著优于现有的离线最先进的基线，推理速度提高100到1000倍。我们通过现场VR演示和多种场景（如多语言演讲）验证了我们的方法。

论文及项目相关链接

PDF SIGGRAPH Asia 2025. Project page: https://jiyewise.github.io/projects/AudioRTA

Summary

实时音频驱动的光照现实3D面部角色动画系统，专为虚拟现实社交互动设计。系统采用编码模型将音频信号实时转换为潜在面部表情序列，再解码为光照现实3D面部角色。借助扩散模型的生成能力，系统能捕捉丰富的面部表情，实现实时性能(<15ms GPU时间)。系统通过两项关键创新——在线变压器和蒸馏管道，最小化延迟。同时，系统处理连续音频信号时，能逐帧保持一致的动画质量，并扩展到多模式应用。

Key Takeaways

实时音频驱动的光照现实3D面部角色动画系统专为虚拟现实社交互动设计。
系统使用编码模型实时转换音频信号为潜在面部表情序列。
借助扩散模型生成能力，系统能捕捉丰富的面部表情并实现实时性能。
系统通过在线变压器和蒸馏管道两项关键创新来最小化延迟。
系统逐帧处理连续音频信号，保持一致的动画质量。
系统具有可扩展性，支持多模式应用，包括情感条件和多模态传感器等。

Cool Papers

点此查看论文截图

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

Authors:Shiyu Liu, Kui Jiang, Xianming Liu, Hongxun Yao, Xiaocheng Feng

Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations–an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker’s superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

音频驱动的说话人头部视频生成提高了人机交互中的用户参与度。然而，当前的方法经常产生运动模糊和嘴唇抖动等问题的视频，这主要是因为它们依赖于音频面部运动之间的隐性建模关联——一种缺乏明确的发音先验知识（即与语音相关的面部运动的解剖指导）的方法。为了克服这一局限性，我们提出了HM-Talker，这是一个用于生成高保真、时间上连贯的说话人头部的全新框架。HM-Talker利用了一种混合运动表示法，结合了隐性和显性运动线索。显性线索使用面部肌肉运动的动作单位（AUs），与隐性特征相结合，以最小化音素-可见语素错位。具体来说，我们的跨模态分离模块（CMDM）在预测与视觉线索对齐的音频输入的直接AU时，提取了互补的隐式/显式运动特征。为了减轻显性特征中的身份相关偏见并增强跨主体泛化能力，我们引入了混合运动建模模块（HMMM）。该模块动态地合并随机配对的隐式/显式特征，强制实施身份无关的学习。这些组件共同作用，实现了跨不同身份的稳健嘴唇同步，推动了个性化说话头部合成的进步。大量实验表明，HM-Talker在视觉质量和嘴唇同步准确性方面优于最先进的方法。

论文及项目相关链接

PDF

Summary

本文提出一种名为HM-Talker的新型框架，用于生成高保真、时间连贯的谈话头部视频。HM-Talker结合了隐式和显式运动线索的混合运动表示，使用动作单元（AUs）等面部肌肉运动的解剖定义，与隐式特征相结合，以最小化音素与面部表情的不对齐。此外，本文介绍了跨模态去模块化和混合运动建模模块，分别解决了隐式特征与音频之间的不匹配问题以及跨主体身份识别的偏见问题。实验证明，HM-Talker在视觉质量和唇同步准确性方面优于现有技术。

Key Takeaways

HM-Talker框架被提出用于生成高保真、时间连贯的谈话头部视频，克服了现有方法产生的运动模糊和唇部抖动的问题。
该框架结合了隐式和显式运动线索的混合运动表示，以提高视频质量。
使用动作单元（AUs）等面部肌肉运动的解剖定义，与隐式特征相结合，实现更准确的语音与面部表情的匹配。
Cross-Modal Disentanglement Module（CMDM）能够提取隐式和显式运动特征的互补信息，并预测与视觉线索对齐的音频输入的AUs。
Hybrid Motion Modeling Module（HMMM）解决了显式特征的主体依赖偏见问题，增强了跨主体泛化能力。
通过动态融合随机配对的隐式和显式特征，实现身份无关的学习。

Cool Papers

点此查看论文截图

MOSPA: Human Motion Generation Driven by Spatial Audio

Authors:Shuyang Xu, Zhiyang Dou, Mingyi Shi, Liang Pan, Leo Ho, Jingbo Wang, Yuan Liu, Cheng Lin, Yuexin Ma, Wenping Wang, Taku Komura

Enabling virtual humans to dynamically and realistically respond to diverse auditory stimuli remains a key challenge in character animation, demanding the integration of perceptual modeling and motion synthesis. Despite its significance, this task remains largely unexplored. Most previous works have primarily focused on mapping modalities like speech, audio, and music to generate human motion. As of yet, these models typically overlook the impact of spatial features encoded in spatial audio signals on human motion. To bridge this gap and enable high-quality modeling of human movements in response to spatial audio, we introduce the first comprehensive Spatial Audio-Driven Human Motion (SAM) dataset, which contains diverse and high-quality spatial audio and motion data. For benchmarking, we develop a simple yet effective diffusion-based generative framework for human MOtion generation driven by SPatial Audio, termed MOSPA, which faithfully captures the relationship between body motion and spatial audio through an effective fusion mechanism. Once trained, MOSPA can generate diverse, realistic human motions conditioned on varying spatial audio inputs. We perform a thorough investigation of the proposed dataset and conduct extensive experiments for benchmarking, where our method achieves state-of-the-art performance on this task. Our code and model are publicly available at https://github.com/xsy27/Mospa-Acoustic-driven-Motion-Generation

实现虚拟人物对多种听觉刺激的动态真实反应在角色动画中仍然是一个关键挑战，这需要感知建模和运动合成的集成。尽管这一任务具有重要意义，但迄今为止仍鲜有研究。以前的大多数工作主要集中在将语音、音频和音乐等映射模式映射到人类运动生成上。然而，这些模型通常忽视了空间音频信号编码的空间特征对人类运动的影响。为了填补这一空白，实现对空间音频响应下的人类运动高质量建模，我们首次推出全面的空间音频驱动人类运动（SAM）数据集，其中包含多样且高质量的空间音频和运动数据。为了基准测试，我们开发了一种基于扩散的简单有效的生成框架，用于空间音频驱动的人类运动生成，被称为MOSPA。MOSPA通过有效的融合机制忠实捕捉身体运动和空间音频之间的关系。一旦训练完成，MOSPA就能够根据各种空间音频输入生成多样且逼真的运动。我们对提出的数据集进行了全面调查，并进行了广泛的基准测试实验，我们的方法在这一任务上取得了最先进的性能表现。我们的代码和模型在https://github.com/xsy27/Mospa-Acoustc-driven-Motion-Generation公开可用。

论文及项目相关链接

PDF NeurIPS 2025 (Spotlight)

Summary:

该文本介绍了一种应对虚拟人物动态真实反应多种听觉刺激的挑战的解决方案。创建了首个全面的空间音频驱动人物运动（SAM）数据集，旨在填补现有模型忽略的空间音频信号对人物运动影响的空白。并提出了一种基于扩散的简单有效的空间音频驱动人物运动生成框架MOSPA，它能够有效地捕捉人物运动和空间音频之间的关系。

Key Takeaways:

空间音频对人物运动的影响被忽视，创建首个全面的空间音频驱动人物运动（SAM）数据集以填补这一空白。
引入了一种基于扩散的生成框架MOSPA，用于根据空间音频生成人物运动。
MOSPA能够真实地捕捉人物运动和空间音频之间的关系。
MOSPA可在不同的空间音频输入下生成多样且现实的人物运动。
该方法在提出的SAM数据集上进行了全面的调查和广泛的实验评估，达到了在该任务上的最佳性能。
代码和模型已公开发布，便于公众访问和使用。

Cool Papers

点此查看论文截图

Talk, Listen, Connect: How Humans and AI Evaluate Empathy in Responses to Emotionally Charged Narratives

Authors:Mahnaz Roshanaei, Rezvaneh Rezapour, Magy Seif El-Nasr

Social interactions promote well-being, yet barriers like geographic distance, time limitations, and mental health conditions can limit face-to-face interactions. Emotionally responsive AI systems, such as chatbots, offer new opportunities for social and emotional support, but raise critical questions about how empathy is perceived and experienced in human-AI interactions. This study examines how empathy is evaluated in AI-generated versus human responses. Using personal narratives, we explored how persona attributes (e.g., gender, empathic traits, shared experiences) and story qualities affect empathy ratings. We compared responses from standard and fine-tuned AI models with human judgments. Results show that while humans are highly sensitive to emotional vividness and shared experience, AI-responses are less influenced by these cues, often lack nuance in empathic expression. These findings highlight challenges in designing emotionally intelligent systems that respond meaningfully across diverse users and contexts, and informs the design of ethically aware tools to support social connection and well-being.

社会互动有助于促进福祉，然而，地理距离、时间限制和心理健康状况等障碍限制了面对面的互动。情感响应的AI系统，如聊天机器人，为社会和情感支持提供了新的机会，但也引发了关于人类与AI互动中如何感知和体验共情的关键问题。本研究旨在考察AI生成与人类回应中的共情评估方式。通过个人叙事，我们探讨了人格特质（如性别、共情特质、共享经验）和故事质量如何影响共情评分。我们将标准AI模型和经过微调后的AI模型的反应与人类判断进行了比较。结果表明，虽然人类对情感生动性和共享经验高度敏感，但AI反应对这些线索的影响较小，往往缺乏微妙的共情表达。这些发现突显了在设计和开发智能系统中面临的挑战，即在跨用户情境中进行有意义的响应。这为支持社交联系和福祉的伦理意识工具的设计提供了信息。

论文及项目相关链接

PDF 18 pages, 4 figures, 6 tables. Title updated from “Talk, Listen, Connect: Navigating Empathy in Human-AI Interactions” to “Talk, Listen, Connect: How Humans and AI Evaluate Empathy in Responses to Emotionally Charged Narratives” in this version. This is version 3 (v3) of the paper. All previous citations of arXiv:2409.15550 with the old title still refer to the same paper

Summary
人工智能在社交互动中的角色及其对人类共情的影响。研究表明，尽管AI系统能提供情感支持，但它们对共情的理解和表达与人类不同。对比AI生成和人类回应，发现人类在情感生动性和共同经历上更敏感，而AI对此类线索反应较少，缺乏细微的共情表达。这为设计能在多样用户和语境中回应的伦理意识工具带来了挑战。

Key Takeaways

社会互动对福祉有积极影响，但地理距离、时间限制和精神健康等障碍限制了面对面的交流。
AI系统如聊天机器人提供了新的社交和情感支持机会，但在人机互动中如何理解和体验共情引发关键问题。
对比AI生成和人类回应发现，人类在情感生动性和共同经历上高度敏感，而AI对此类线索反应较少。
AI在共情表达上常缺乏细微差别，这影响了它们在不同用户和语境中的回应能力。
设计情感智能系统时，需考虑如何克服这些挑战，以确保有效支持社会联系和福祉。
应重视伦理意识在设计情感智能系统中的作用，以确保它们能够公平、公正地回应不同用户的需求。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-05/Talking%20Head%20Generation/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Talking Head Generation

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-11-06 Agent-Omni Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

2025-11-06 R1_Reasoning

R1_Reasoning

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-11-05 LISTEN to Your Preferences An LLM Framework for Multi-Objective Selection

2025-11-05 Interactive

Interactive

Talking Head Generation

2025-11-05 更新

See the Speaker: Crafting High-Resolution Talking Faces from Speech with Prior Guidance and Region Refinement

Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation

MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control

LSF-Animation: Label-Free Speech-Driven Facial Animation via Implicit Feature Representation

Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

Audio Driven Real-Time Facial Animation for Social Telepresence

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

MOSPA: Human Motion Generation Driven by Spatial Audio

Talk, Listen, Connect: How Humans and AI Evaluate Empathy in Responses to Emotionally Charged Narratives