发布日期: 2025-11-12

更新日期: 2025-11-27

文章字数: 2.1k

阅读时长: 8 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-12 更新

ConsistTalk: Intensity Controllable Temporally Consistent Talking Head Generation with Diffusion Noise Search

Authors:Zhenjie Liu, Jianzhang Lu, Renjie Lu, Cong Liang, Shangfei Wang

Recent advancements in video diffusion models have significantly enhanced audio-driven portrait animation. However, current methods still suffer from flickering, identity drift, and poor audio-visual synchronization. These issues primarily stem from entangled appearance-motion representations and unstable inference strategies. In this paper, we introduce \textbf{ConsistTalk}, a novel intensity-controllable and temporally consistent talking head generation framework with diffusion noise search inference. First, we propose \textbf{an optical flow-guided temporal module (OFT)} that decouples motion features from static appearance by leveraging facial optical flow, thereby reducing visual flicker and improving temporal consistency. Second, we present an \textbf{Audio-to-Intensity (A2I) model} obtained through multimodal teacher-student knowledge distillation. By transforming audio and facial velocity features into a frame-wise intensity sequence, the A2I model enables joint modeling of audio and visual motion, resulting in more natural dynamics. This further enables fine-grained, frame-wise control of motion dynamics while maintaining tight audio-visual synchronization. Third, we introduce a \textbf{diffusion noise initialization strategy (IC-Init)}. By enforcing explicit constraints on background coherence and motion continuity during inference-time noise search, we achieve better identity preservation and refine motion dynamics compared to the current autoregressive strategy. Extensive experiments demonstrate that ConsistTalk significantly outperforms prior methods in reducing flicker, preserving identity, and delivering temporally stable, high-fidelity talking head videos.

近期视频扩散模型的进步极大地推动了音频驱动肖像动画的发展。然而，当前的方法仍然存在闪烁、身份漂移以及视听同步不良等问题。这些问题主要源于外观运动表示的纠缠以及不稳定的推理策略。在本文中，我们介绍了ConsistTalk，这是一种新型强度可控、时间一致的谈话头部生成框架，具有扩散噪声搜索推理。首先，我们提出了光学流引导的时间模块（OFT），通过利用面部光学流，将运动特征从静态外观中解耦出来，从而减少视觉闪烁并改善时间一致性。其次，我们提出了通过多模态教师学生知识蒸馏获得的Audio-to-Intensity（A2I）模型。通过将音频和面部速度特征转换为帧级强度序列，A2I模型能够联合建模音频和视觉运动，从而产生更自然的动态效果。这进一步实现了对运动动态的精细帧级控制，同时保持紧密的视听同步。第三，我们介绍了一种扩散噪声初始化策略（IC-Init）。通过在推理时间的噪声搜索过程中施加背景连贯性和运动连续性的明确约束，与当前的自回归策略相比，我们实现了更好的身份保留并改进了运动动态。大量实验表明，ConsistTalk在减少闪烁、保持身份以及提供时间稳定、高保真度的谈话头部视频方面显著优于先前的方法。

论文及项目相关链接

PDF AAAI26 poster

摘要
基于视频扩散模型的新进展，音频驱动的肖像动画技术取得了显著提升。然而，当前方法仍存在闪烁、身份漂移和视听同步不良等问题。这些问题主要源于外观运动表示的纠缠和推理策略的不稳定性。本文提出一种名为“ConsistTalk”的新型强度可控且时间一致的谈话头生成框架，采用扩散噪声搜索推理。首先，我们提出了一个光流引导的时间模块（OFT），它通过利用面部光流来解耦运动特征，从而减少视觉闪烁并改善时间一致性。其次，我们提出了一种通过多模态教师-学生知识蒸馏获得的音频到强度（A2I）模型。通过将音频和面部速度特征转换为帧级强度序列，A2I模型能够联合建模音频和视频运动，从而带来更自然的动力学。这进一步实现了对运动动力学的精细帧级控制，同时保持了紧密的视听同步。第三，我们采用了一种扩散噪声初始化策略（IC-Init）。通过在推理过程中的噪声搜索时强制背景连贯性和运动连续性上的明确约束，我们实现了与当前自回归策略相比更好的身份保留和运动动力学的改进。大量实验表明，ConsistTalk在减少闪烁、保持身份以及生成时间上稳定、高保真度的谈话头视频方面显著优于以前的方法。

关键见解

引入了一种新的谈话头生成框架“ConsistTalk”，结合扩散噪声搜索推理，提升了音频驱动的肖像动画质量。
通过光流引导的时间模块（OFT）解耦运动特征和静态外观，减少了视觉闪烁，提高了时间一致性。
提出了音频到强度（A2I）模型，通过联合建模音频和视频运动，实现了更自然的动力学，并实现了对运动动力的精细控制。
采用扩散噪声初始化策略（IC-Init），在推理过程中通过明确约束背景连贯性和运动连续性，改进了身份保留和运动动力学。
ConsistTalk显著减少了闪烁，更好地保留了身份，生成了时间上稳定、高保真度的谈话头视频。

Cool Papers

点此查看论文截图

Interpreting the Role of Visemes in Audio-Visual Speech Recognition

Authors:Aristeidis Papadopoulos, Naomi Harte

Audio-Visual Speech Recognition (AVSR) models have surpassed their audio-only counterparts in terms of performance. However, the interpretability of AVSR systems, particularly the role of the visual modality, remains under-explored. In this paper, we apply several interpretability techniques to examine how visemes are encoded in AV-HuBERT a state-of-the-art AVSR model. First, we use t-distributed Stochastic Neighbour Embedding (t-SNE) to visualize learned features, revealing natural clustering driven by visual cues, which is further refined by the presence of audio. Then, we employ probing to show how audio contributes to refining feature representations, particularly for visemes that are visually ambiguous or under-represented. Our findings shed light on the interplay between modalities in AVSR and could point to new strategies for leveraging visual information to improve AVSR performance.

音频视觉语音识别（AVSR）模型的性能已经超越了仅使用音频的模型。然而，AVSR系统的可解释性，特别是视觉模式的作用，仍然未被充分探索。在本文中，我们应用了几种可解释性技术来检查先进AVSR模型AV-HuBERT中是如何编码可见语的。首先，我们使用t分布随机邻域嵌入（t-SNE）来可视化学习到的特征，揭示由视觉线索驱动的天然聚类，这在音频存在的情况下得到了进一步的优化。然后，我们通过探查显示音频是如何改善特征表示的，特别是对于视觉上模糊或表示不足的可见语。我们的研究结果揭示了AVSR中模式之间的相互作用，并可能指向利用视觉信息提高AVSR性能的新策略。

论文及项目相关链接

PDF Accepted into Automatic Speech Recognition and Understanding- ASRU 2025

Summary
本论文采用多种解释性技术，探究了在先进的视听语音识别（AVSR）模型AV-HuBERT中，如何编码唇形运动特征（visemes）。通过t-SNE可视化学习到的特征，发现视觉线索驱动的自然聚类，音频的加入进一步细化了这些聚类。此外，本研究还通过探查展示了音频如何完善特征表示，特别是对于那些视觉上模糊或表现不足的唇形特征。本研究揭示了AVSR中模态间的相互作用，并为如何利用视觉信息提高AVSR性能提供了新的策略。

Key Takeaways