发布日期: 2025-10-10

更新日期: 2025-11-27

文章字数: 4.4k

阅读时长: 18 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-10 更新

Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow

Authors:Ruyang Liu, Shangkun Sun, Haoran Tang, Ge Li, Wei Gao

Long-form video understanding has always been a challenging problem due to the significant redundancy in both temporal and spatial contents. This challenge is further exacerbated by the limited context length of Multimodal Large Language Models (MLLMs). To address this issue, many previous works have attempted to extract key video information, where the “key” is typically semantic-aware and heavily dependent on the CLIP model as prior. In this paper, we propose Flow4Agent, a novel framework that pioneeringly incorporates motion priors from optical flow to facilitate LLM-based long video understanding. Flow4Agent mitigates the redundancy in long videos at both temporal and spatial levels through two core modules: Temporal Granularity Optimization (TGO) adaptively refines framelevel hierarchies, which first leverages coarse flow priors to group similar visual contents and then applies semantic priors to filter out highly irrelevant scene information. Motion Token Pruning (MTP) further refines the intra-frame visual representations, pruning high-redundancy video tokens using fine-grained optical flow information. Extensive experiments demonstrate that our Flow4Agent outperforms existing methods across a wide range of video MLLM benchmarks, especially for hour-level video understanding tasks, achieving 64.7% on Video-MME, 71.4% on MLVU and 60.4% on LongVideoBench.

长视频理解一直是一个具有挑战性的问题，因为时间和空间内容中存在大量冗余信息。这一问题在多模态大型语言模型（MLLMs）的有限上下文长度下进一步加剧。为了解决这个问题，许多早期的研究工作都试图提取视频的关键信息，其中“关键”通常是语义感知的，并且很大程度上依赖于CLIP模型作为先验知识。在本文中，我们提出了Flow4Agent这一新型框架，它首创性地结合了来自光流的运动先验知识，以促进基于LLM的长视频理解。Flow4Agent通过两个核心模块在时间和空间两个层面上减轻了长视频的冗余信息：时间粒度优化（TGO）自适应地优化帧级层次结构，首先利用粗光流先验将相似的视觉内容分组，然后应用语义先验过滤掉高度不相关的场景信息。运动令牌修剪（MTP）进一步细化帧内视觉表示，使用精细光流信息删除高冗余的视频令牌。大量实验表明，我们的Flow4Agent在广泛的视频MLLM基准测试中优于现有方法，特别是在小时级视频理解任务中表现突出，在Video-MME上达到64.7%，在MLVU上达到71.4%，在LongVideoBench上达到60.4%。

论文及项目相关链接

PDF Accepted to ICCV’ 2025

Summary：

本文提出Flow4Agent框架，通过引入运动先验（optical flow）来解决基于LLM的长视频理解中的冗余问题。该框架通过Temporal Granularity Optimization（TGO）和Motion Token Pruning（MTP）两个核心模块，在时间和空间层面减少视频冗余信息，提高长视频理解的性能。实验表明，Flow4Agent在多个视频MLLM基准测试中表现优异，尤其在小时级视频理解任务上效果显著。

Key Takeaways：

Flow4Agent框架通过引入运动先验，解决了基于LLM的长视频理解中的挑战。
框架包含两个核心模块：Temporal Granularity Optimization（TGO）和Motion Token Pruning（MTP）。
TGO模块通过利用粗粒度流先验和语义先验，自适应地优化帧级别层次结构，减少视频中的冗余信息。
MTP模块进一步精炼帧内视觉表示，使用精细粒度的光流信息删除高度冗余的视频令牌。
实验显示Flow4Agent在多个视频MLLM基准测试中表现优于现有方法。
Flow4Agent特别适用于小时级别的视频理解任务，在Video-MME、MLVU和LongVideoBench上的准确率分别为64.7%、71.4%和60.4%。

Cool Papers

点此查看论文截图

Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs

Authors:Sameep Vani, Shreyas Jena, Maitreya Patel, Chitta Baral, Somak Aditya, Yezhou Yang

While Video Large Language Models (Video-LLMs) have demonstrated remarkable performance across general video understanding benchmarks-particularly in video captioning and descriptive tasks-they consistently underperform on tasks that require fine-grained temporal understanding. This limitation arises due to the lack of visual complexity and temporal nuance in current fine-tuning datasets, leading these models to rely heavily on language-based reasoning rather than truly understanding video dynamics. In this work, we propose TimeWarp, a systematic method to create a targeted synthetic temporal dataset to fine-tune the model’s responses to encourage it to focus on the given input video. We introduce a large-scale preference dataset, created using TimeWarp, that captures intricate temporal dynamics often overlooked, grounding the model’s responses to visual and temporal information. We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks, highlighting the effectiveness of our proposed datasets in advancing temporal understanding in Video-LLMs, resulting in an absolute improvement in performance across seven benchmarks. Code is available at https://github.com/sameepv21/timewarp.

虽然视频大型语言模型（Video-LLM）在通用视频理解基准测试中表现出显著的性能，特别是在视频字幕和描述任务中，但它们在进行需要精细时间理解的任务时始终表现不佳。这一局限性是由于当前微调数据集中缺乏视觉复杂性和时间细微差别，导致这些模型过于依赖基于语言的推理，而不是真正理解视频动态。在这项工作中，我们提出了TimeWarp，这是一种创建有针对性的合成时间数据集的系统方法，以微调模型响应，鼓励其关注给定输入视频。我们引入了一个大规模偏好数据集，该数据集是使用TimeWarp创建的，可以捕捉经常被忽略的复杂时间动态，以视觉和时间信息为基础来回应模型。我们证明，将我们的方法应用于现有模型时，它在时间理解基准测试中的性能得到了显著提高，这突出了我们提出的数据集在推进视频大型语言模型的时间理解方面的有效性，在七个基准测试中实现了性能绝对提升。代码可在https://github.com/sameepv20/timewarp中找到。

论文及项目相关链接

PDF 17 pages, 9 figures, 6 tables. Presents TimeWarp, a synthetic preference data framework to improve temporal understanding in Video-LLMs, showing consistent gains across seven benchmarks. Includes supplementary material in the Appendix

摘要
针对视频大型语言模型（Video-LLMs）在精细时间理解任务上的不足，提出了一种名为TimeWarp的系统方法。该方法创建有针对性的合成时间数据集，以微调模型响应，鼓励其关注给定输入视频。通过引入使用TimeWarp创建的大规模偏好数据集，捕捉常被忽略的精细时间动态，使模型响应基于视觉和时间信息。实验表明，将该方法应用于现有模型，能显著提高在时间上理解基准测试的性能，凸显了我们提出的数据集在提升Video-LLM的时间理解方面的有效性，在七个基准测试上实现了性能的绝对提升。相关代码已发布在GitHub上。

要点

Video-LLMs在需要精细时间理解的视频任务上表现欠佳。
当前微调数据集中缺乏视觉复杂性和时间细微差别，导致模型过于依赖语言推理而非真正理解视频动态。
提出TimeWarp系统方法，创建合成时间数据集以改善模型对视频中的时间动态的响应。
使用TimeWarp引入的大规模偏好数据集捕捉精细时间动态，使模型响应基于视觉和时间信息。
该方法能显著提高现有模型在时间上理解的基准测试性能。
提出的数据集对提升Video-LLM的时间理解能力具有显著效果。

Cool Papers

点此查看论文截图

Empowering LLMs with Pseudo-Untrimmed Videos for Audio-Visual Temporal Understanding

Authors:Yolo Yunlong Tang, Daiki Shimada, Jing Bi, Mingqian Feng, Hang Hua, Chenliang Xu

Large language models (LLMs) have demonstrated remarkable capabilities in natural language and multimodal domains. By fine-tuning multimodal LLMs with temporal annotations from well-annotated datasets, e.g., dense video captioning datasets, their temporal understanding capacity in video-language tasks can be obtained. However, there is a notable lack of untrimmed audio-visual video datasets with precise temporal annotations for events. This deficiency hinders LLMs from learning the alignment between time, audio-visual events, and text tokens, thus impairing their ability to temporally localize audio-visual events in videos. To address this gap, we introduce PU-VALOR, a comprehensive audio-visual dataset comprising over 114,000 pseudo-untrimmed videos with detailed temporal annotations. PU-VALOR is derived from the large-scale but coarse-annotated audio-visual dataset VALOR, through a subtle method involving event-based video clustering, random temporal scaling, and permutation. By fine-tuning a multimodal LLM on PU-VALOR, we developed AVicuna, a model capable of aligning audio-visual events with temporal intervals and corresponding text tokens. AVicuna excels in temporal localization and time-aware dialogue capabilities. Our experiments demonstrate that AVicuna effectively handles temporal understanding in audio-visual videos and achieves state-of-the-art performance on open-ended video QA, audio-visual QA, and audio-visual event dense localization tasks.

大型语言模型（LLM）在自然语言和多媒体域中展示了出色的能力。通过利用带有时间标注的经过精细标注的数据集对多媒体LLM进行微调，例如密集视频描述数据集，可以获得它们在视频语言任务中的时间理解能力。然而，缺乏带有精确时间标注的未修剪音视频数据集用于事件标注，这是一个明显的缺陷。这一缺陷阻碍了LLM学习时间、视听事件和文本标记之间的对齐，从而影响了它们在视频中定位视听事件的时间能力。为了弥补这一差距，我们引入了PU-VALOR，这是一个全面的音视频数据集，包含超过11万4千个带有详细时间标注的伪未修剪视频。PU-VALOR来源于大规模但粗略标注的视听数据集VALOR，通过基于事件的视频聚类、随机时间缩放和置换的精细方法得到。通过在PU-VALOR上对多媒体LLM进行微调，我们开发了AVicuna模型，该模型能够将视听事件与时间间隔和相应文本标记对齐。AVicuna在时空定位和时间感知对话能力方面表现出色。我们的实验表明，AVicuna有效处理视听视频中的时间理解问题，并在开放式视频问答、视听问答和视听事件密集定位任务上达到了最先进的性能。

论文及项目相关链接

PDF Accepted to AAAI 2025

Summary

大型语言模型（LLM）在多模态领域展现出显著的能力。通过微调多模态LLM与密集视频描述数据集等带时间注释的标注数据集，可获得其在视频语言任务中的时间理解能力。然而，缺乏带有精确时间注释的非剪辑音视频数据集是阻碍LLM学习时间、音视频事件和文本标记之间对齐的关键因素。为解决这一不足，我们推出了PU-VALOR，这是一个包含超过11万4千个伪非剪辑视频的综合音视频数据集，具有详细的时间注释。通过基于事件的视频聚类、随机时间缩放和置换等微妙方法从大规模但粗略注释的VALOR音频视频数据集中派生。通过在PU-VALOR上微调多模态LLM，我们开发出了AVicuna模型，该模型能够对齐音视频事件与时间间隔和相应文本标记。AVicuna在时序定位和时间感知对话能力方面表现出色。实验表明，AVicuna在开放视频问答、音视频问答和音视频事件密集定位任务上达到了最先进的性能。

Key Takeaways

大型语言模型在多模态领域具有显著能力，可以通过微调增强其在视频语言任务中的时间理解能力。
缺乏精确时间注释的非剪辑音视频数据集是限制LLM学习的重要因素。
PU-VALOR是一个包含伪非剪辑视频的综合音视频数据集，具有详细的时间注释，通过微妙方法从VALOR数据集派生。
AVicuna模型通过微调在PU-VALOR上表现出色，能够对齐音视频事件与时间间隔和相应文本标记。
AVicuna在时序定位和时间感知对话能力上具有优势。
AVicuna在开放视频问答、音视频问答和音视频事件密集定位任务上达到了最先进的性能。

Cool Papers

点此查看论文截图

Video Understanding with Large Language Models: A Survey

Authors:Yolo Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu

With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.

随着在线视频平台的蓬勃发展和视频内容的不断增加，对熟练的视频理解工具的需求急剧增加。鉴于大型语言模型（LLM）在语言和多媒体任务中的显著能力，这篇综述提供了关于利用LLM（视频LLM）进行视频理解的最新进展的详细介绍。视频LLM的新兴能力令人惊讶地先进，特别是其结合常识知识进行开放式多粒度（一般、时间和时空）推理的能力，为未来的视频理解指明了有希望的道路。我们研究了视频LLM的独特特征和功能，将方法分为三类：视频分析器x LLM、视频嵌入器x LLM和（分析器+嵌入器）x LLM。此外，我们根据LLM在视频LLM中的功能确定了五种亚型：LLM作为摘要器、LLM作为管理器、LLM作为文本解码器、LLM作为回归器和LLM作为隐藏层。此外，这篇综述还对视频LLM的任务、数据集、基准测试和评估方法进行了全面的研究。它还探讨了视频LLM在各个领域的广泛应用，突出了其在现实世界视频理解挑战中的惊人可扩展性和通用性。最后，它总结了现有视频LLM的局限性，并概述了未来的研究方向。更多信息请参阅https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding仓库。

论文及项目相关链接

PDF Accepted to IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

摘要

在线视频平台的蓬勃发展及视频内容的急剧增长，对熟练的视频理解工具的需求显著增强。鉴于大型语言模型在多模态任务中的显著能力，这篇综述提供了关于利用大型语言模型（LLM）进行视频理解的最新进展的详细概述。Vid-LLMs的新兴能力令人惊讶地先进，尤其是它们结合常识知识进行的开放多粒度（一般、时间和时空）推理能力，为未来的视频理解开辟了一条充满希望的道路。我们研究了Vid-LLMs的独特特征和能力，将其方法分为三类：视频分析器x LLM、视频嵌入器x LLM和（分析器+嵌入器）x LLM。此外，根据LLM在Vid-LLMs中的作用，我们还确定了五种亚型：LLM作为摘要器、LLM作为管理器、LLM作为文本解码器、LLM作为回归器和LLM作为隐藏层。此外，该综述还对Vid-LLMs的任务、数据集、基准测试和评估方法进行了全面的研究。还探讨了Vid-LLMs在各个领域的应用，突显其在现实世界的视频理解挑战中的可扩展性和多功能性。最后，总结了现有Vid-LLMs的局限性，并指出了未来研究的方向。更多信息请访问：https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding。

关键见解