嘘~ 正在从服务器偷取页面 . . .

Talking Head Generation

⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-02-01 更新

EmoDubber: Towards High Quality and Emotion Controllable Movie Dubbing

Authors:Gaoxiang Cong, Jiadong Pan, Liang Li, Yuankai Qi, Yuxin Peng, Anton van den Hengel, Jian Yang, Qingming Huang

Given a piece of text, a video clip, and a reference audio, the movie dubbing task aims to generate speech that aligns with the video while cloning the desired voice. The existing methods have two primary deficiencies: (1) They struggle to simultaneously hold audio-visual sync and achieve clear pronunciation; (2) They lack the capacity to express user-defined emotions. To address these problems, we propose EmoDubber, an emotion-controllable dubbing architecture that allows users to specify emotion type and emotional intensity while satisfying high-quality lip sync and pronunciation. Specifically, we first design Lip-related Prosody Aligning (LPA), which focuses on learning the inherent consistency between lip motion and prosody variation by duration level contrastive learning to incorporate reasonable alignment. Then, we design Pronunciation Enhancing (PE) strategy to fuse the video-level phoneme sequences by efficient conformer to improve speech intelligibility. Next, the speaker identity adapting module aims to decode acoustics prior and inject the speaker style embedding. After that, the proposed Flow-based User Emotion Controlling (FUEC) is used to synthesize waveform by flow matching prediction network conditioned on acoustics prior. In this process, the FUEC determines the gradient direction and guidance scale based on the user’s emotion instructions by the positive and negative guidance mechanism, which focuses on amplifying the desired emotion while suppressing others. Extensive experimental results on three benchmark datasets demonstrate favorable performance compared to several state-of-the-art methods.



PDF Under review


Key Takeaways

  1. 电影配音任务旨在生成与视频相符的语音,同时克隆想要的音色。
  2. 现有方法存在同时保持音视频同步和清晰发音的困难,以及缺乏表达用户定义情感的能力。
  3. 提出的EmoDubber架构解决了这些问题,允许用户指定情感类型和强度。
  4. LPA设计通过学习唇动和韵律变化之间的内在一致性,实现了合理的对齐。
  5. PE策略融合了视频级别的音素序列,提高了语音清晰度。
  6. FUEC方法用于合成波形,通过流匹配预测网络调节声学先验,并根据用户情感指令确定梯度方向和指导规模。

Cool Papers


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
Text-to-Motion Text-to-Motion
Text-to-Motion 方向最新论文已更新,请持续关注 Update in 2025-02-01 Free-T2M Frequency Enhanced Text-to-Motion Diffusion Model With Consistency Loss
Interactive Interactive
Interactive 方向最新论文已更新,请持续关注 Update in 2025-02-01 A Video-grounded Dialogue Dataset and Metric for Event-driven Activities