发布日期: 2025-10-11

更新日期: 2025-11-27

文章字数: 1.1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-11 更新

Paper2Video: Automatic Video Generation from Scientific Papers

Authors:Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce Paper2Video, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics–Meta Similarity, PresentArena, PresentQuiz, and IP Memory–to measure how videos convey the paper’s information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

学术报告视频已成为研究交流的必要媒介，但其制作过程仍然高度依赖人力，通常需要数小时的设计和编辑才能得到短短两到十分钟的视频。不同于自然视频，演示视频生成面临独特的挑战：如研究论文的输入、密集的多模式信息（文本、图表、表格），以及需要协调多个对齐的通道（如幻灯片、字幕、语音和人类演讲者）。为了解决这些挑战，我们推出了Paper2Video，这是由101篇研究论文与作者创建的演示视频、幻灯片以及演讲者元数据配对而成的第一个基准测试集。我们进一步设计了四个专门的评估指标——Meta相似性、PresentArena、PresentQuiz和IP记忆——来衡量视频向观众传达论文信息的效果。在此基础上，我们提出了PaperTalker，这是第一个用于学术报告视频生成的多代理框架。它整合了幻灯片生成与有效的布局优化，通过新颖的有效树形搜索视觉选择、光标定位、字幕、语音合成和说话人头像渲染等技术，同时并行幻灯片生成以提高效率。在Paper2Video上的实验表明，我们的方法生成的报告视频比现有基线更忠实且信息更丰富，朝着自动化和即用的学术视频生成迈出了实用的一步。我们的数据集、代理和代码可在https://github.com/showlab/Paper2Video获取。

Summary

本文介绍了学术报告视频作为研究交流的重要媒介的重要性及其生产过程中的挑战。为解决挑战，提出了基于数据集的基准评价模型和创新的框架。这个框架能够自动化地生成学术研究视频，通过整合多个模块如幻灯片生成、布局优化、字幕、语音合成和讲话者头像渲染等，提高了视频的信息传达效率和制作效率。

Key Takeaways