发布日期: 2025-11-08

更新日期: 2025-11-27

文章字数: 1.1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-08 更新

THEval. Evaluation Framework for Talking Head Video Generation

Authors:Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva

Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.

视频生成已经取得了显著的进步，生成的视频越来越逼真。然而，生成的迅速发展超出了评估指标的开发进度。目前，头部谈话生成的评价主要依赖于有限的指标，评估一般视频质量、嘴唇同步，并进行用户研究。受此启发，我们提出了一个新的评价框架，包括与三个维度相关的8个指标：（i）质量，（ii）自然度，（iii）同步性。在选择指标时，我们强调效率以及与人类偏好的一致性。基于此，我们简化了对头部、嘴巴和眉毛的细微动态分析，以及面部质量。我们在由17种最新技术模型生成的85000个视频上进行的大量实验表明，虽然许多算法在嘴唇同步方面表现出色，但在生成表达力和无瑕疵的细节方面仍面临挑战。这些视频是基于我们精心策划的新现实数据集生成的，旨在减轻训练数据的偏见。我们提出的基准框架旨在评估生成方法的改进。原始代码、数据集和排行榜将公开发布并定期更新新方法，以反映该领域的进展。

论文及项目相关链接

PDF

Summary

新一代视频生成技术发展迅速，越来越逼真。然而，评估指标的发展滞后于技术进展，特别是针对说话人头部生成的评估。本文提出一个新的评价框架，包括与三个维度相关的八个指标：（i）质量，（ii）自然度，（iii）同步性。在挑选指标时，我们注重效率和与人类偏好的一致性。通过对头部、嘴巴、眉毛的精细动态以及面部质量进行分析，我们在大量实验中发现，现有模型在唇同步方面表现出色，但在表达力和无瑕疵细节生成方面仍有挑战。这些视频是基于我们整理的新真实数据集生成的，旨在减轻训练数据偏见的问题。本文提出的基准框架旨在评估生成方法的改进情况。

Key Takeaways