发布日期: 2025-09-30

更新日期: 2025-11-27

文章字数: 2.1k

阅读时长: 8 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-30 更新

VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding

Authors:Abdul Waheed, Zhen Wu, Dareen Alharthi, Seungone Kim, Bhiksha Raj

Precisely evaluating video understanding models remains challenging: commonly used metrics such as BLEU, ROUGE, and BERTScore fail to capture the fineness of human judgment, while obtaining such judgments through manual evaluation is costly. Recent work has explored using large language models (LLMs) or multimodal LLMs (MLLMs) as evaluators, but their extension to video understanding remains relatively unexplored. In this work, we introduce VideoJudge, a 3B and 7B-sized MLLM judge specialized to evaluate outputs from video understanding models (\textit{i.e.}, text responses conditioned on videos). To train VideoJudge, our recipe builds on the interplay between a generator and an evaluator: the generator is prompted to produce responses conditioned on a target rating, and responses not matching the evaluator’s rating are discarded. Across three out of four meta-evaluation benchmarks, VideoJudge-7B outperforms larger MLLM judge baselines such as Qwen2.5-VL (32B and 72B). Notably, we find that LLM judges (Qwen3) models perform worse than MLLM judges (Qwen2.5-VL) and long chain-of-thought reasoning does not improve performance, indicating that providing video inputs is crucial for evaluation of video understanding tasks.

对视频理解模型进行精确评估仍然具有挑战性：常用的评价指标，如BLEU、ROUGE和BERTScore，无法捕捉人类判断的细节，而通过手动评估获得这种判断成本又很高。近期的研究已经探索了使用大型语言模型（LLM）或多模态大型语言模型（MLLM）作为评估器，但它们在视频理解方面的应用仍然相对未被充分研究。在这项工作中，我们引入了VideoJudge，这是一个专门用于评估视频理解模型输出的3B和7B规模的多模态大型语言模型评估器（即，根据视频生成文本响应）。为了训练VideoJudge，我们的方法建立在生成器和评估器之间的相互作用上：生成器根据目标评分生成响应，不匹配评估器评分的响应会被丢弃。在四个元评估基准中的三个基准上，VideoJudge-7B的表现优于更大的MLLM评估器基准，如Qwen2.5-VL（32B和72B）。值得注意的是，我们发现LLM评估器（Qwen3）的表现比MLLM评估器（Qwen2.5-VL）差，而且长篇的推理思考并不会提高性能，这表明为视频理解任务提供视频输入是至关重要的。

论文及项目相关链接

PDF Work in progress

Summary

本文介绍了视频理解模型评估的挑战，常见评估指标如BLEU、ROUGE和BERTScore无法精准捕捉人类判断的细节，而手动评估成本高昂。为应对这一挑战，本文提出了VideoJudge，一个专门用于评估视频理解模型输出的多模态大型语言模型（MLLM）。通过生成器和评估器的互动训练，VideoJudge在三个元评估基准测试中表现出优异的性能，优于如Qwen2.5-VL等大型MLLM评估器基线。研究还发现，单纯的大型语言模型（LLM）评估器表现不如多模态大型语言模型（MLLM），且深度思考链并不提升性能，视频输入对视频理解任务评估至关重要。

Key Takeaways

视频理解模型的评估存在挑战，现有指标无法精准反映人类判断。
VideoJudge是一个专门用于评估视频理解模型输出的多模态大型语言模型（MLLM）。
VideoJudge通过生成器和评估器的互动训练，在元评估基准测试中表现出卓越性能。
VideoJudge优于其他大型MLLM评估器基线，如Qwen2.5-VL。
LLM评估器表现不如MLLM评估器。
深度思考链并不提升视频理解模型评估的性能。

Cool Papers

点此查看论文截图

Think With Videos For Agentic Long-Video Understanding

Authors:Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou

Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video’’, which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer’s significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).

长视频理解（LVU）是计算机视觉领域的一个难题。现有方法要么对帧进行降采样以进行单次推理，牺牲了精细细节，要么依赖于任务无关表示上的文本推理，阻碍了特定任务的感知和探索。在本文中，我们提出了VideoExplorer框架，该框架基于“用视频思考”的原则，将规划、时间定位和可扩展感知自然地结合到一个连贯的推理过程中。VideoExplorer不是对静态上下文进行推理，而是迭代地制定子问题，定位相关时刻，并进行面向任务的、可扩展的视频理解，直到达到最终答案，从而实现忠实、高效和可解释的推理。为了解决LVU训练资源的缺乏问题，我们使用难度自适应采样构建了一个长视频推理数据集，以确保在复杂任务上的高质量轨迹。基于该数据集，我们设计了一个两阶段训练管道：监督轨迹初始化，然后是轨迹级别的偏好优化，鼓励自适应时间定位和迭代信息整合，由下游奖励引导。在流行的长视频理解和推理基准测试上的广泛评估表明，VideoExplorer相对于现有基线具有显著优势，突出了其稳健性、适应性和效率。我们的代码已在GitHub上公开提供（https://github.com/yhy-2000/VideoDeepResearch）。

论文及项目相关链接

PDF

Summary
视频理解中的一个挑战是处理长视频内容。现有方法会忽略细节或在任务无关表示上进行文本推理，导致任务特定感知和探索受限。本文提出VideoExplorer框架，以“用视频思考”为原则，自然地将规划、时间定位和可伸缩感知融入连贯的推理过程中。VideoExplorer通过迭代提出子问题、定位相关时刻，进行任务导向的、时间可伸缩的视频理解，直至得出最终答案，实现忠实、高效、可解释性的推理。为解决长视频理解训练资源匮乏的问题，本文构建了一个长视频推理数据集，并设计了两阶段训练管道：监督轨迹初始化，随后进行轨迹级别偏好优化，鼓励自适应时间定位与迭代信息整合。在流行长视频理解和推理基准测试上的广泛评估表明，VideoExplorer相较于现有基线有显著优势，体现了其稳健性、适应性和高效性。

Key Takeaways