嘘~ 正在从服务器偷取页面 . . .

Talking Head Generation

⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-02-13 更新

What Is That Talk About? A Video-to-Text Summarization Dataset for Scientific Presentations

Authors:Dongqi Liu, Chenxi Whitehouse, Xi Yu, Louis Mahon, Rohit Saxena, Zheng Zhao, Yifu Qiu, Mirella Lapata, Vera Demberg

Transforming recorded videos into concise and accurate textual summaries is a growing challenge in multimodal learning. This paper introduces VISTA, a dataset specifically designed for video-to-text summarization in scientific domains. VISTA contains 18,599 recorded AI conference presentations paired with their corresponding paper abstracts. We benchmark the performance of state-of-the-art large models and apply a plan-based framework to better capture the structured nature of abstracts. Both human and automated evaluations confirm that explicit planning enhances summary quality and factual consistency. However, a considerable gap remains between models and human performance, highlighting the challenges of scientific video summarization.



PDF arXiv admin note: text overlap with arXiv:2306.02873 by other authors



Key Takeaways:

  1. VISTA数据集专注于视频到文本摘要的科学领域,包含AI会议演讲视频和对应的论文摘要。
  2. 论文评估了先进的模型性能,发现基于计划的框架有助于捕捉摘要的结构性特征。
  3. 基于计划的框架能够提高摘要的质量和事实一致性。
  4. 当前模型与人类性能之间存在差距,显示科学视频摘要的挑战性。
  5. 视频到文本摘要是一个持续增长的挑战,需要更多的研究和发展来克服现有的困难。
  6. 视频摘要任务不仅需要自然语言处理技术,还需要理解视频内容及其上下文的能力。

Cool Papers


Playmate: Flexible Control of Portrait Animation via 3D-Implicit Space Guided Diffusion

Authors:Xingpei Ma, Jiaran Cai, Yuansheng Guan, Shenneng Huang, Qiang Zhang, Shunsi Zhang

Recent diffusion-based talking face generation models have demonstrated impressive potential in synthesizing videos that accurately match a speech audio clip with a given reference identity. However, existing approaches still encounter significant challenges due to uncontrollable factors, such as inaccurate lip-sync, inappropriate head posture and the lack of fine-grained control over facial expressions. In order to introduce more face-guided conditions beyond speech audio clips, a novel two-stage training framework Playmate is proposed to generate more lifelike facial expressions and talking faces. In the first stage, we introduce a decoupled implicit 3D representation along with a meticulously designed motion-decoupled module to facilitate more accurate attribute disentanglement and generate expressive talking videos directly from audio cues. Then, in the second stage, we introduce an emotion-control module to encode emotion control information into the latent space, enabling fine-grained control over emotions and thereby achieving the ability to generate talking videos with desired emotion. Extensive experiments demonstrate that Playmate outperforms existing state-of-the-art methods in terms of video quality and lip-synchronization, and improves flexibility in controlling emotion and head pose. The code will be available at https://playmate111.github.io.






Key Takeaways

  1. Playmate是一种新型的两阶段训练框架,用于生成与语音音频剪辑匹配的说话面部视频。
  2. 第一阶段引入了解耦隐式3D表示和运动解耦模块,实现更准确的属性分解和表情丰富的说话视频生成。
  3. 第二阶段引入了情感控制模块,使模型能够生成具有特定情感的说话视频,并对情感和头部姿势实现更精细的控制。
  4. Playmate在视频质量和唇同步方面优于现有方法。
  5. 该框架的代码将在https://playmate111.github.io上提供。
  6. Playmate框架有助于合成更逼真的面部表情和谈话视频。

Cool Papers


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
LLM 方向最新论文已更新,请持续关注 Update in 2025-02-14 HoVLE Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding
Interactive Interactive
Interactive 方向最新论文已更新,请持续关注 Update in 2025-02-13 Grammar Control in Dialogue Response Generation for Language Learning Chatbots