发布日期: 2025-10-09

更新日期: 2025-11-27

文章字数: 2k

阅读时长: 8 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-09 更新

Paper2Video: Automatic Video Generation from Scientific Papers

Authors:Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou

Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics–Meta Similarity, PresentArena, PresentQuiz, and IP Memory–to measure how videos convey the paper’s information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.

学术报告视频已成为研究交流的重要媒介，但其制作仍然是一项高度劳动密集的工作，通常需要数小时的幻灯片设计、录制和编辑才能制作出一个短短的2到10分钟的视频。与自然的视频不同，演示视频生成面临独特的挑战：来自研究论文的输入、密集的多模式信息（文本、图表、表格），以及需要协调多个对齐的通道，如幻灯片、字幕、语音和人类演讲者。为了解决这些挑战，我们推出了PaperTalker，这是第一个基准测试，包含101篇研究论文，以及与作者创作的演示视频、幻灯片和演讲者元数据配对。我们进一步设计了四个定制的评价指标——Meta相似性、PresentArena、PresentQuiz和IP记忆——来衡量视频向观众传达论文信息的效果。在此基础上，我们提出了PaperTalker，这是第一个学术报告视频生成的多代理框架。它将幻灯片生成与有效的布局优化相结合，通过新颖的有效树搜索视觉选择、光标定位、字幕、语音合成和谈话头部渲染等技术，同时并行幻灯片级的生成以提高效率。在Paper2Video上的实验表明，我们的方法生成的报告视频比现有基准测试更忠实和富有信息，朝着自动化和即插即用的学术视频生成迈出了实际的一步。我们的数据集、代理和代码可在https://github.com/showlab/Paper2Video上找到。

论文及项目相关链接

PDF 20 pages, 8 figures

Summary

本文介绍了学术报告视频的重要性及其制作过程的复杂性。为应对挑战，研究者推出了PaperTalker，这是一个包含研究论文与作者制作的演示视频、幻灯片及演讲者元数据的基准测试。同时，研究者设计了四个评估指标来衡量视频传达论文信息的效果。在此基础上，提出了多智能体框架PaperTalker，该框架实现了幻灯片生成与布局优化、字幕、语音合成与讲话头部渲染等功能，且能并行处理幻灯片以提高效率。实验表明，该方法生成的演示视频比现有方法更忠实且更具信息性。

Key Takeaways

学术报告视频已成为研究沟通的重要媒介，但其制作过程高度劳动密集。
PaperTalker是一个包含研究论文、演示视频、幻灯片及演讲者元数据的基准测试。
四个评估指标衡量视频传达论文信息的效果。
PaperTalker多智能体框架实现了学术报告视频生成的一系列功能。
该框架实现了幻灯片的并行处理以提高效率。
实验表明，该方法生成的演示视频更具忠实性和信息性。

Cool Papers

点此查看论文截图

Unmasking Puppeteers: Leveraging Biometric Leakage to Disarm Impersonation in AI-based Videoconferencing

Authors:Danial Samadi Vahdati, Tai Duc Nguyen, Ekta Prashnani, Koki Nagano, David Luebke, Orazio Gallo, Matthew Stamm

AI-based talking-head videoconferencing systems reduce bandwidth by sending a compact pose-expression latent and re-synthesizing RGB at the receiver, but this latent can be puppeteered, letting an attacker hijack a victim’s likeness in real time. Because every frame is synthetic, deepfake and synthetic video detectors fail outright. To address this security problem, we exploit a key observation: the pose-expression latent inherently contains biometric information of the driving identity. Therefore, we introduce the first biometric leakage defense without ever looking at the reconstructed RGB video: a pose-conditioned, large-margin contrastive encoder that isolates persistent identity cues inside the transmitted latent while cancelling transient pose and expression. A simple cosine test on this disentangled embedding flags illicit identity swaps as the video is rendered. Our experiments on multiple talking-head generation models show that our method consistently outperforms existing puppeteering defenses, operates in real-time, and shows strong generalization to out-of-distribution scenarios.

基于AI的谈话头视频通话系统通过发送紧凑的姿态表情潜在信息并在接收端重新合成RGB来减少带宽，但这一潜在信息可以被操纵，使攻击者能够实时劫持受害者的肖像。由于每一帧都是合成的，因此深度伪造和合成视频检测器完全失效。为了解决这一安全问题，我们利用了一个关键观察结果：姿态表情潜在信息本质上包含了驱动身份的生物识别信息。因此，我们引入了第一个无需查看重建的RGB视频的生物泄露防御方法：一个以姿态为条件、具有较大边缘对比度的编码器，它能隔离传输潜在信息中的持久性身份线索，同时消除短暂姿态和表情。在这个分离嵌入物上进行的简单余弦测试会在视频渲染时标记出非法身份交换。我们在多个对话头部生成模型上的实验表明，我们的方法一直优于现有的操纵防御技术，能实时操作，并显示出对超出分配范围场景的强泛化能力。

论文及项目相关链接

PDF

Summary

基于AI的说话人头视频会议系统通过发送紧凑的姿态表情潜值和重新合成RGB来降低带宽，但这一潜值可被操纵，使得攻击者可以在实时中劫持受害者的外貌。由于每一帧都是合成的，因此深度伪造和合成视频检测器完全失效。为解决这一安全问题，我们利用了一个关键观察结果：姿态表情潜值本身就包含了驱动身份的生物识别信息。因此，我们引入了首个无需查看重建RGB视频的生物识别泄露防御方法：一个姿态调节的大间隔对比编码器，它在传输的潜值内隔离持久的身份线索，同时取消短暂姿态和表情。在这种分离嵌入上进行简单的余弦测试可在视频渲染时标记出非法身份交换。我们在多个说话人头生成模型上的实验表明，我们的方法始终优于现有的操纵防御措施，可实时操作，并对超出分布范围的场景具有良好的泛化能力。

Key Takeaways

AI-based talking-head videoconferencing systems reduce bandwidth by sending pose-expression latent.
The pose-expression latent can be puppeteered, allowing attackers to hijack victims’ likeness in real time.
Because videos are synthetic, traditional deepfake and synthetic video detectors fail.
The pose-expression latent contains biometric information of the driving identity.
A pose-conditioned contrastive encoder is introduced to isolate identity cues in the transmitted latent and cancel pose and expression.
A simple cosine test on the disentangled embedding can detect illicit identity swaps during video rendering.

Cool Papers

点此查看论文截图