发布日期: 2025-10-03

更新日期: 2025-11-27

文章字数: 1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-03 更新

PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation

Authors:Fatemeh Nazarieh, Zhenhua Feng, Diptesh Kanojia, Muhammad Awais, Josef Kittler

Audio-driven talking face generation is a challenging task in digital communication. Despite significant progress in the area, most existing methods concentrate on audio-lip synchronization, often overlooking aspects such as visual quality, customization, and generalization that are crucial to producing realistic talking faces. To address these limitations, we introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk. Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet. IdentityNet is designed to preserve identity features consistently across the generated video frames, while AnimateNet aims to enhance temporal coherence and motion consistency. This framework also integrates an audio input with the reference images, thereby reducing the reliance on reference-style videos prevalent in existing approaches. A key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms, which significantly expands creative control over the generated videos. Through extensive experiments, including a newly developed evaluation metric, our model demonstrates superior performance over the state-of-the-art methods, setting a new standard for the generation of customizable realistic talking faces suitable for real-world applications.

音频驱动的说话面部生成是数字通信中的一项具有挑战性的任务。尽管该领域取得了重大进展，但大多数现有方法都集中在音频与嘴唇的同步上，往往忽视了视觉质量、定制和普及等方面，这些方面对于生成逼真的说话面部至关重要。为了解决这些局限性，我们引入了一种可定制的一次性音频驱动说话面部生成框架，名为PortraitTalk。我们提出的方法利用了一个潜在扩散框架，该框架包括两个主要组件：IdentityNet和AnimateNet。IdentityNet旨在在生成的视频帧中保持身份特征的连续性，而AnimateNet旨在提高时间连贯性和运动一致性。该框架还将音频输入与参考图像相结合，从而减少了现有方法中对参考风格视频的依赖。PortraitTalk的一个关键创新之处在于通过解耦交叉注意力机制融入文本提示，这极大地增强了生成视频的创意控制。通过大量实验，包括新开发的评估指标，我们的模型在最新技术方法上表现出了卓越的性能，为定制现实说话面部的生成设定了新的标准，适合现实应用。

论文及项目相关链接

PDF

Summary

本文介绍了一种新型的音频驱动说话面孔生成框架——PortraitTalk。该框架利用潜在扩散模型，包括IdentityNet和AnimateNet两个主要组件，分别用于保持身份特征的视频帧一致性和提高时间连贯性和运动一致性。与传统的依赖于参考样式视频的方法不同，PortraitTalk将音频输入与参考图像相结合。此外，通过解耦的跨注意力机制融入文本提示，显著增强了生成视频的创意控制。实验证明，与现有方法相比，该模型表现出卓越性能，为定制现实说话面孔的生成树立了新标准，适用于实际应用。

Key Takeaways

PortraitTalk是一个音频驱动的说话面孔生成框架，旨在解决数字通信中的挑战。
该框架利用潜在扩散模型，包括IdentityNet和AnimateNet，分别注重身份特征保持一致和运动连贯性。
PortraitTalk将音频输入与参考图像结合，减少对参考样式视频的依赖。
通过解耦的跨注意力机制，融入文本提示，增强了生成视频的创意控制。
该模型实验表现优越，超越了现有方法，为定制现实说话面孔的生成树立了新标准。
PortraitTalk适用于各种真实世界应用场景。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-03/Talking%20Head%20Generation/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Talking Head Generation

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-10-03 Probing the Critical Point (CritPt) of AI Reasoning a Frontier Physics Research Benchmark

2025-10-03 R1_Reasoning

R1_Reasoning

Text-to-Motion

Text-to-Motion 方向最新论文已更新，请持续关注 Update in 2025-10-02 LUMA Low-Dimension Unified Motion Alignment with Dual-Path Anchoring for Text-to-Motion Diffusion Model

2025-10-02 Text-to-Motion

Text-to-Motion