发布日期: 2025-11-16

更新日期: 2025-11-27

文章字数: 3.5k

阅读时长: 14 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-16 更新

MVU-Eval: Towards Multi-Video Understanding Evaluation for Multimodal LLMs

Authors:Tianhao Peng, Haochen Wang, Yuanxing Zhang, Zekun Wang, Zili Wang, Gavin Chang, Jian Yang, Shihao Li, Yanghai Wang, Xintao Wang, Houyi Li, Wei Ji, Pengfei Wan, Steven Huang, Zhaoxiang Zhang, Jiaheng Liu

The advent of Multimodal Large Language Models (MLLMs) has expanded AI capabilities to visual modalities, yet existing evaluation benchmarks remain limited to single-video understanding, overlooking the critical need for multi-video understanding in real-world scenarios (e.g., sports analytics and autonomous driving). To address this significant gap, we introduce MVU-Eval, the first comprehensive benchmark for evaluating Multi-Video Understanding for MLLMs. Specifically, our MVU-Eval mainly assesses eight core competencies through 1,824 meticulously curated question-answer pairs spanning 4,959 videos from diverse domains, addressing both fundamental perception tasks and high-order reasoning tasks. These capabilities are rigorously aligned with real-world applications such as multi-sensor synthesis in autonomous systems and cross-angle sports analytics. Through extensive evaluation of state-of-the-art open-source and closed-source models, we reveal significant performance discrepancies and limitations in current MLLMs’ ability to perform understanding across multiple videos. The benchmark will be made publicly available to foster future research.

随着多模态大型语言模型（MLLMs）的出现，人工智能的能力已经扩展到了视觉模式，但现有的评估基准仍然仅限于单视频理解，忽视了真实场景中对多视频理解的迫切需求（例如体育分析和自动驾驶）。为了弥补这一重要空白，我们推出了MVU-Eval，这是第一个用于评估多视频理解的全面基准测试。具体来说，我们的MVU-Eval主要通过涵盖来自不同领域的4959个视频的精心挑选的1824个问答对来评估八个核心能力，这些能力既包括基本的感知任务也包括高级的推理任务。这些能力与现实世界的应用紧密相关，如自主系统中的多传感器合成和跨角度体育分析。通过对最先进的开源和专有模型的广泛评估，我们揭示了当前MLLM模型在跨多个视频进行理解时的性能差异和局限性。该基准测试将公开提供，以促进未来的研究。

论文及项目相关链接

PDF

Summary
多媒体大型语言模型（MLLMs）的出现扩展了人工智能的视觉感知能力，但现有的评估基准测试仍然局限于单视频理解，忽视了真实场景中对多视频理解的迫切需求（如体育分析和自动驾驶）。为解决这一重要空白，我们推出了MVU-Eval，这是首个全面评估MLLM多视频理解的基准测试。具体而言，MVU-Eval主要通过涵盖4959个视频的1824个精心策划的问题答案对来评估八大核心能力，既包括基本感知任务也包括高级推理任务。这些能力与现实世界应用严格对齐，如自主系统的多传感器合成和跨角度体育分析等。通过对最新开源和闭源模型的广泛评估，我们揭示了MLLM在跨多个视频的理解能力上的显著性能差异和局限性。该基准测试将公开发布以促进未来研究。

Key Takeaways

MLLMs已扩展至视觉模态，但评估基准测试仍局限于单视频理解。
真实场景中需要多视频理解能力，如体育分析和自动驾驶。
MVU-Eval是首个全面评估MLLM多视频理解的基准测试。
MVU-Eval涵盖八大核心能力评估，包括基本感知任务和高级推理任务。
能力评估与多传感器合成、跨角度分析等现实世界应用对齐。
评估显示当前MLLM在多视频理解上存在显著性能差异和局限性。

Cool Papers

点此查看论文截图

TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding

Authors:Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xuchong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, Hao Sun

Multimodal Large Language Models (MLLMs) have demonstrated significant progress in vision-language tasks, yet they still face challenges when processing long-duration video inputs. The limitation arises from MLLMs’ context limit and training costs, necessitating sparse frame sampling before feeding videos into MLLMs. However, building a trainable sampling method remains challenging due to the unsupervised and non-differentiable nature of sparse frame sampling in Video-MLLMs. To address these problems, we propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs’ long-form video-language understanding via reinforcement learning. Specifically, we first propose a trainable event-aware temporal agent, which captures event-query correlation for performing probabilistic keyframe selection. Then, we propose the TSPO reinforcement learning paradigm, which models keyframe selection and language generation as a joint decision-making process, enabling end-to-end group relative optimization for the temporal sampling policy. Furthermore, we propose a dual-style long video training data construction pipeline, balancing comprehensive temporal understanding and key segment localization. Finally, we incorporate rule-based answering accuracy and temporal locating reward mechanisms to optimize the temporal sampling policy. Comprehensive experiments show that our TSPO achieves state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs. Our code is available at https://github.com/Hui-design/TSPO

多模态大型语言模型（MLLMs）在视觉语言任务中取得了显著进展，但在处理长视频输入时仍面临挑战。这些挑战源于MLLMs的上下文限制和训练成本，需要在将视频输入MLLMs之前进行稀疏帧采样。然而，构建可训练采样方法仍然具有挑战性，因为视频MLLMs中的稀疏帧采样是无监督和非可微分的。为了解决这些问题，我们提出了时序采样策略优化（TSPO），通过强化学习推进MLLMs对长格式视频语言的理解。具体来说，我们首先提出了一个可训练的事件感知时序代理，用于捕捉事件查询相关性，执行概率关键帧选择。然后，我们提出了TSPO强化学习范式，将关键帧选择和语言生成建模为联合决策过程，实现对时序采样策略端到端的组相对优化。此外，我们提出了双风格长视频训练数据构建流程，平衡全面的时序理解和关键段落定位。最后，我们结合了基于规则的回答准确率和时序定位奖励机制，以优化时序采样策略。综合实验表明，我们的TSPO在多个长视频理解基准测试中达到了最新性能，并展示了在不同前沿视频MLLMs之间的可迁移能力。我们的代码可在https://github.com/Hui-design/TSPO上找到。

论文及项目相关链接

PDF Accepted by AAAI 2026

摘要
TSPO利用强化学习优化多模态大型语言模型对长视频的识别与理解。它通过构建事件感知时间代理实现概率关键帧选择，建立TSPO强化学习范式联合决策关键帧选择与语言生成过程。同时引入双风格长视频训练数据构建流程，结合规则基准回答准确性和时间定位奖励机制优化时间采样策略。实验证明，TSPO在多个长视频理解基准测试中表现优异且具有良好的迁移能力。其代码已公开在GitHub上。这项技术的推出将极大提升视频理解的精度和应用前景。

要点摘要

TSPO使用强化学习突破多模态大型语言模型在处理长视频任务时的限制。
通过事件感知时间代理实现概率关键帧选择，提升模型对视频内容的理解。
TSPO强化学习范式联合决策关键帧选择与语言生成，优化时间采样策略。
提出双风格长视频训练数据构建流程，实现全面时间理解与关键段定位平衡。
结合规则基准回答准确性和时间定位奖励机制，提升模型的性能表现。

Cool Papers

点此查看论文截图

Controllable Hybrid Captioner for Improved Long-form Video Understanding

Authors:Kuleen Sasse, Efsun Sarioglu Kayi, Arun Reddy

Video data, especially long-form video, is extremely dense and high-dimensional. Text-based summaries of video content offer a way to represent query-relevant content in a much more compact manner than raw video. In addition, textual representations are easily ingested by state-of-the-art large language models (LLMs), which enable reasoning over video content to answer complex natural language queries. To solve this issue, we rely on the progressive construction of a text-based memory by a video captioner operating on shorter chunks of the video, where spatio-temporal modeling is computationally feasible. We explore ways to improve the quality of the activity log comprised solely of short video captions. Because the video captions tend to be focused on human actions, and questions may pertain to other information in the scene, we seek to enrich the memory with static scene descriptions using Vision Language Models (VLMs). Our video understanding system relies on the LaViLa video captioner in combination with a LLM to answer questions about videos. We first explored different ways of partitioning the video into meaningful segments such that the textual descriptions more accurately reflect the structure of the video content. Furthermore, we incorporated static scene descriptions into the captioning pipeline using LLaVA VLM, resulting in a more detailed and complete caption log and expanding the space of questions that are answerable from the textual memory. Finally, we have successfully fine-tuned the LaViLa video captioner to produce both action and scene captions, significantly improving the efficiency of the captioning pipeline compared to using separate captioning models for the two tasks. Our model, controllable hybrid captioner, can alternate between different types of captions according to special input tokens that signals scene changes detected in the video.

视频数据，特别是长视频，信息极其密集且高维度。基于文本的视频内容摘要提供了一种以更紧凑的方式表示与查询相关内容的手段，而非原始视频。此外，文本表示很容易被当前先进的大型语言模型（LLM）所吸收，使得可以推理视频内容以回答复杂的自然语言查询。为解决此问题，我们依赖于由操作视频较短片段的视频字幕制作器逐步构建的基于文本的存储器，其中时空建模在计算上是可行的。我们探索了提高仅由短视频字幕构成的活动日志质量的方法。由于视频字幕往往侧重于人类行为，而问题可能与场景中的其他信息有关，我们试图使用视觉语言模型（VLM）丰富存储器中的静态场景描述。我们的视频理解系统依赖于LaViLa视频字幕制作器结合LLM来回答问题。我们首先探索了将视频分割成有意义片段的不同方式，以使文本描述更准确地反映视频内容的结构。此外，我们利用LLaVA VLM将静态场景描述纳入字幕制作管道，从而得到更详细和完整的字幕日志，并扩大了可从文本存储器中回答问题。最后，我们已经成功地对LaViLa视频字幕制作器进行了微调，以产生动作和场景字幕，与为这两个任务使用单独的字幕制作模型相比，这大大提高了字幕制作管道的效率。我们的模型——可控混合字幕器可以根据视频中的场景变化等特殊输入令牌交替生成不同类型的字幕。

论文及项目相关链接

PDF

摘要

视频数据，特别是长视频，极为密集且高维。基于文本的视频内容摘要提供了一种比原始视频更紧凑地表示查询相关内容的方式。此外，文本表示形式可以轻易被先进的大型语言模型（LLM）所接纳，从而能够对视频内容进行推理以回答复杂的自然语言查询。为解决此问题，我们依赖于由操作视频较短片段的视频描述器逐步构建的基于文本的存储器，其中时空建模是计算上可行的。我们探索了提高仅由短视频字幕构成的活动日志质量的方法。由于视频字幕往往侧重于人类行为，而问题可能涉及场景中的其他信息，我们寻求使用视觉语言模型（VLM）丰富存储器中的静态场景描述。我们的视频理解系统依赖于LaViLa视频描述器与LLM来回答有关视频的问题。我们首先探索了将视频划分为有意义的片段的不同方式，以使文本描述更准确地反映视频内容结构。此外，我们利用LLaVA VLM将静态场景描述纳入字幕管道，从而得到更详细和完整的字幕日志，并扩大了可从文本存储器中回答问题的问题空间。最后，我们已经成功地对LaViLa视频描述器进行了微调，以产生动作和场景字幕，与为这两个任务使用单独的字幕模型相比，这大大提高了字幕管道的效率。我们的模型——可控混合描述器可以根据检测到视频中的场景变化而交替产生不同类型的字幕的特殊输入令牌。

关键见解