嘘~ 正在从服务器偷取页面 . . .

Talking Head Generation


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-02-01 更新

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Authors:Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, Qingming Huang

Given a piece of text, a video clip, and a reference audio, the movie dubbing task aims to generate speech that aligns with the video while cloning the desired voice. The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation; (2) They lack the capacity to express user-defined emotions. To address these problems, we propose EmoDubber, an emotion-controllable dubbing architecture that allows users to specify emotion type and emotional intensity while satisfying high-quality lip sync and pronunciation. Specifically, we first design Lip-related Prosody Aligning (LPA), which focuses on learning the inherent consistency between lip motion and prosody variation by duration level contrastive learning to incorporate reasonable alignment. Then, we design Pronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences by efficient conformer to improve speech intelligibility. Next, the speaker identity adapting module aims to decode acoustics prior and inject the speaker style embedding. After that, the proposed Flow-based User Emotion Controlling (FUEC) is used to synthesize waveform by flow matching prediction network conditioned on acoustics prior. In this process, the FUEC determines the gradient direction and guidance scale based on the user’s emotion instructions by the positive and negative guidance mechanism, which focuses on amplifying the desired emotion while suppressing others. Extensive experimental results on three benchmark datasets demonstrate favorable performance compared to several state-of-the-art methods.

给定一段文本、一个视频片段和参考音频,电影配音任务旨在生成与视频对齐的语音,同时克隆所需的语音。现有方法存在两个主要缺陷:一是难以同时保持音视频同步和清晰的发音;二是缺乏表达用户定义情绪的能力。为了解决这些问题,我们提出了EmoDubber,一种情感可控的配音架构,允许用户指定情绪类型和情绪强度,同时满足高质量的唇部同步和发音。具体来说,我们首先设计了唇相关韵律对齐(LPA),通过时长级别对比学习来关注唇动和韵律变化之间的内在一致性,以实现合理对齐。然后,我们设计了发音增强(PE)策略,通过高效转换器融合视频级音素序列,以提高语音清晰度。接下来,说话人身份适应模块旨在解码声音先验并注入说话人风格嵌入。之后,我们利用基于流的用户情绪控制(FUEC)来合成波形。在此过程中,FUEC通过正负引导机制根据用户的情绪指令来确定梯度方向和引导尺度,专注于放大所需的情绪同时抑制其他情绪。在三个基准数据集上的大量实验结果表明,与几种最先进的方法相比,我们的方法具有优越的性能。

论文及项目相关链接

PDF Under review

Summary
针对电影配音任务,提出一种情感可控的配音架构EmoDubber,可让用户指定情感类型和情感强度,同时实现高质量的唇同步和发音。通过设计唇相关韵律对齐(LPA)和发音增强(PE)策略,以及基于流的用户情感控制(FUEC)方法,实现了音频视频同步和特定情感表达。

Key Takeaways

  1. 电影配音任务旨在生成与视频相符的语音,同时克隆想要的音色。
  2. 现有方法存在同时保持音视频同步和清晰发音的困难,以及缺乏表达用户定义情感的能力。
  3. 提出的EmoDubber架构解决了这些问题,允许用户指定情感类型和强度。
  4. LPA设计通过学习唇动和韵律变化之间的内在一致性,实现了合理的对齐。
  5. PE策略融合了视频级别的音素序列,提高了语音清晰度。
  6. FUEC方法用于合成波形,通过流匹配预测网络调节声学先验,并根据用户情感指令确定梯度方向和指导规模。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
Text-to-Motion Text-to-Motion
Text-to-Motion 方向最新论文已更新,请持续关注 Update in 2025-02-01 Free-T2M Frequency Enhanced Text-to-Motion Diffusion Model With Consistency Loss
2025-02-01
下一篇 
Interactive Interactive
Interactive 方向最新论文已更新,请持续关注 Update in 2025-02-01 A Video-grounded Dialogue Dataset and Metric for Event-driven Activities
2025-02-01
  目录