发布日期: 2025-02-07

更新日期: 2025-05-14

文章字数: 817

阅读时长: 3 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-02-07 更新

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Authors:Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, Heng Wang

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering. Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Nevertheless, the generated imperfect summaries can already achieve competitive performance on existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.

一段简短的视频可能会包含多个事件的进展和一个有趣的故事线。人类需要捕捉每一镜头中的事件，并将它们关联起来以理解背后的故事。在这项工作中，我们提出了一个新的多镜头视频理解基准测试Shot2Story，它包含详细的镜头级字幕、全面的视频摘要和问答对。为了促进对视频的更好语义理解，我们为视觉信号和人类旁白提供了字幕。我们设计了几个不同的任务，包括单镜头视频字幕、多镜头视频摘要和多镜头视频问答。初步实验表明，在为多镜头视频生成长而全面的摘要方面存在一些挑战。然而，生成的不完美的摘要已经可以在现有的视频理解任务（如视频问答）上实现有竞争力的性能，从而促进了带有详细摘要的视频理解的未开发设置。

论文及项目相关链接

PDF ICLR 2025. Extended annotation with 43K multi-shot videos in total. https://mingfei.info/shot2story for updates and more information

Summary

视频理解的新基准Shot2Story提供详细的镜头级字幕、全面的视频摘要和问答对。通过多个任务，如单镜头视频字幕、多镜头视频摘要和多镜头视频问答等，提高视频语义理解的挑战与可能已展示出来。初步实验发现为多镜头视频生成长篇综合摘要面临挑战，但生成的摘要已在现有视频理解任务中表现优异。这一工作将促进视频理解的详细摘要设置的发展。

Key Takeaways

视频理解的新基准Shot2Story包含多个任务，旨在提高视频的语义理解。
提供详细的镜头级字幕和全面的视频摘要，帮助理解视频故事线。
设计了单镜头视频字幕、多镜头视频摘要和多镜头视频问答等任务。
生成多镜头视频的长篇综合摘要面临挑战。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-02-07/%E8%A7%86%E9%A2%91%E7%90%86%E8%A7%A3/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

视频理解

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2025-02-07 ZISVFM Zero-Shot Object Instance Segmentation in Indoor Robotic Environments with Vision Foundation Models

2025-02-07 Vision Transformer

Vision Transformer

I2I Translation

I2I Translation 方向最新论文已更新，请持续关注 Update in 2025-02-07 Secure & Personalized Music-to-Video Generation via CHARCHA

2025-02-07 I2I Translation

I2I Translation