发布日期: 2025-11-26

更新日期: 2025-11-27

文章字数: 5.6k

阅读时长: 22 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-26 更新

Alternating Perception-Reasoning for Hallucination-Resistant Video Understanding

Authors:Bowei Pu, Chuanbin Liu, Yifan Ge, Peichen Zhou, Yiwei Sun, Zhiyin Lu, Jiankang Wang, Hongtao Xie

Sufficient visual perception is the foundation of video reasoning. Nevertheless, existing Video Reasoning LLMs suffer from perception shortcuts, relying on a flawed single-step perception paradigm. This paradigm describes the video and then conducts reasoning, which runs the risk of insufficient evidence and emergent hallucinations. To address these issues, we introduce a new framework that integrates a loop-based paradigm with an anti-hallucination reward. First, to address the insufficient evidence, we introduce the Perception Loop Reasoning (PLR) paradigm. Instead of describing the video at once, each loop requires the model to describe a video segment with precise timestamps, analyze this segment, and decide the next action. Second, for the risk of hallucinations, the Factual-Aware Evaluator (FAE) evaluates each perception result as a reliable anti-hallucination reward. This reward encourages the model to provide sufficient and precise video evidence. Our FAE, which performs comparably to GPT-4o, is tuned on our AnetHallu-117K, a large-scale hallucination judgment preference dataset. Extensive experiments show that our Video-PLR achieves the state-of-the-art in both 3B and 7B parameter scales and has the best data efficiency. Our code, models, and datasets are released on: https://github.com/BoweiPu/VideoPLR.

充分的视觉感知是视频推理的基础。然而，现有的视频推理大型语言模型存在感知捷径的问题，依赖于有缺陷的单步感知模式。这种模式先描述视频，然后进行推理，存在证据不足和出现幻觉的风险。为了解决这些问题，我们引入了一个新框架，该框架结合了循环基础和抗幻觉奖励。首先，为了解决证据不足的问题，我们引入了感知循环推理（PLR）模式。与一次性描述视频不同，每个循环都要求模型使用精确的时间戳描述视频片段，分析该片段，并决定下一个行动。其次，对于出现幻觉的风险，事实感知评估器（FAE）评估每次感知结果作为可靠的抗幻觉奖励。该奖励鼓励模型提供充足和精确的视频证据。我们的FAE与GPT-4o表现相当，是在我们的大型幻觉判断偏好数据集AnetHallu-117K上调整的。大量实验表明，我们的Video-PLR在3B和7B参数规模上均达到最新水平，并且数据效率最高。我们的代码、模型和数据集已发布在：https://github.com/BoweiPu/VideoPLR。

论文及项目相关链接

PDF 32 pages, 36 figures

摘要
现有视频理解大型语言模型存在感知捷径问题，依赖有缺陷的单步感知范式。为解决这个问题，我们提出一个融合循环感知与抗幻觉奖励的新框架。首先引入感知循环推理（PLR）范式，让模型分段精确描述视频并进行分析，避免证据不足的风险。其次使用事实性评估器（FAE）对感知结果进行评估作为抗幻觉奖励。通过在我们的幻觉判断偏好数据集AnetHallu-117K上的训练，FAE的表现可与GPT-4o相抗衡。实验证明，我们的Video-PLR在参数规模和数据效率方面均达到业界最佳水平。代码、模型和数据集已公开在：https://github.com/BoweiPu/VideoPLR。

关键见解

视频理解中的感知是核心基础，现有大型语言模型存在感知捷径问题。
提出感知循环推理（PLR）范式解决证据不足问题，强调分段描述视频并进行深度分析。
通过事实性评估器（FAE）评价感知结果来降低幻觉风险，提供准确且充足的视频证据。
开发AnetHallu-117K数据集用于评估幻觉判断偏好，为模型训练提供基础。

Cool Papers

点此查看论文截图

Test-Time Temporal Sampling for Efficient MLLM Video Understanding

Authors:Kaibin Wang, Mingbao Lin

Processing long videos with multimodal large language models (MLLMs) poses a significant computational challenge, as the model’s self-attention mechanism scales quadratically with the number of video tokens, resulting in high computational demand and slow inference speed. Current solutions, such as rule-based sub-sampling, learned frame selector, or memory-based summarization, often introduce their own trade-offs: they compromise accuracy, necessitate additional training, or decrease inference speed. In this paper, we propose Test-Time Temporal Sampling (T3S), a training-free, plug-and-play inference wrapper that enables MLLMs to process long videos both efficiently and effectively. T3S exploits spatiotemporal redundancy by generating multiple short and diverse subsequences of video tokens at inference time, packing them within a single forward pass, and aggregating their predictions. This multi-subsequence formulation broadens visual coverage while reducing the computational cost of self-attention from $O(L^2)$ to $O(\sum_{i=1}^m α_i^2L^2)$, where $\sum_{i=1}^m α_i^2 < 1$. Extensive experiments on long video understanding benchmarks demonstrate that T3S improves accuracy by up to 3.1% and reduces first token delay by $2.04\times$, all with minimal integration effort. Our approach operates entirely at inference time, requires no model modifications or fine-tuning, and is compatible with a wide range of pretrained MLLMs. T3S turns video redundancy into a computational advantage, offering a scalable solution for long-video understanding. The code is available at https://github.com/kaibinwang3/T3S.

处理长视频的多模态大型语言模型（MLLMs）面临着重大的计算挑战，因为模型的自注意力机制随着视频令牌的数量而二次扩展，导致计算需求高和推理速度慢。当前解决方案，如基于规则的子采样、学习帧选择器或基于内存的摘要，往往存在各自的权衡：它们会影响准确性、需要进行额外的训练或降低推理速度。在本文中，我们提出了测试时间时序采样（T3S），这是一种无需训练、即插即用的推理包装器，能够使MLLMs高效且有效地处理长视频。T3S通过利用时空冗余性，在推理阶段生成多个短而多样的视频令牌子序列，将它们打包在单次前向传递中，并聚合其预测结果。这种多子序列公式扩大了视觉覆盖范围，同时将自注意力的计算成本从O(L^2)降低到O(∑αi^2L^2)，其中∑αi^2 < 1。在长期视频理解基准测试的大量实验表明，T3S提高了高达3.1%的准确率，并将首个令牌的延迟降低了2.04倍，所有这些都实现了最小的集成努力。我们的方法完全在推理时间运行，无需修改模型或微调，并且与各种预训练的MLLMs兼容。T3S将视频冗余转化为计算优势，为长视频理解提供了可伸缩的解决方案。代码可在https://github.com/kaibinwang3/T3S找到。

论文及项目相关链接

PDF

Summary
针对长视频处理的多模态大型语言模型（MLLMs）存在计算挑战，现有解决方案存在精度、训练需求或推理速度方面的权衡。本文提出Test-Time Temporal Sampling（T3S），一种无需训练、即插即用的推理包装器，使MLLMs能够高效且有效地处理长视频。T3S通过生成多个短且多样的视频令牌子序列来利用时空冗余，将它们打包在单次前向传递中，并聚合其预测结果。此方法扩大了视觉覆盖，并将自注意力的计算成本从O(L^2)降低到O(∑αi^2L^2)，其中∑αi^2 < 1。实验表明，T3S提高了长达3.1%的精度，并将首令牌延迟减少了2.04倍，且易于集成。该方法无需模型修改或微调，适用于各种预训练MLLMs，将视频冗余转化为计算优势，为长视频理解提供了可伸缩的解决方案。

Key Takeaways

处理长视频时，多模态大型语言模型面临计算挑战，自注意力机制的计算需求随视频令牌数量增加而急剧增长。
现有解决方案如规则基础子采样、学习帧选择器或基于内存的摘要存在权衡问题。
本文提出的Test-Time Temporal Sampling (T3S)方法能够在无需训练的情况下提高处理长视频的效率和效果。
T3S通过生成多个短且多样的视频令牌子序列来利用时空冗余，扩大了视觉覆盖并降低了计算成本。
T3S方法提高了模型的准确性，最多可提高3.1%，并减少了首令牌延迟。
T3S易于集成，无需模型修改或微调，适用于各种预训练的多模态大型语言模型。

Cool Papers

点此查看论文截图

SciEducator: Scientific Video Understanding and Educating via Deming-Cycle Multi-Agent System

Authors:Zhiyu Xu, Weilong Yan, Yufei Shi, Xin Meng, Tao He, Huiping Zhuang, Ming Li, Hehe Fan

Recent advancements in multimodal large language models (MLLMs) and video agent systems have significantly improved general video understanding. However, when applied to scientific video understanding and educating, a domain that demands external professional knowledge integration and rigorous step-wise reasoning, existing approaches often struggle. To bridge this gap, we propose SciEducator, the first iterative self-evolving multi-agent system for scientific video comprehension and education. Rooted in the classical Deming Cycle from management science, our design reformulates its Plan-Do-Study-Act philosophy into a self-evolving reasoning and feedback mechanism, which facilitates the interpretation of intricate scientific activities in videos. Moreover, SciEducator can produce multimodal educational content tailored to specific scientific processes, including textual instructions, visual guides, audio narrations, and interactive references. To support evaluation, we construct SciVBench, a benchmark consisting of 500 expert-verified and literature-grounded science QA pairs across five categories, covering physical, chemical, and everyday phenomena. Extensive experiments demonstrate that SciEducator substantially outperforms leading closed-source MLLMs (e.g., Gemini, GPT-4o) and state-of-the-art video agents on the benchmark, establishing a new paradigm for the community.

近期，多模态大型语言模型（MLLMs）和视频代理系统的进步极大地推动了通用视频理解的进步。然而，当应用于需要外部专业知识整合和严格逐步推理的科学视频理解和教育领域时，现有方法往往面临挑战。为了弥补这一差距，我们提出了SciEducator，这是第一个用于科学视频理解和教育的迭代式自我进化多代理系统。SciEducator根植于管理科学的经典Deming循环，我们的设计将其计划-实施-研究-行动的理念重新制定为自我进化的推理和反馈机制，这有助于解释视频中的复杂科学活动。此外，SciEducator可以生成针对特定科学过程的多模态教育内容，包括文本指令、视觉指南、音频叙述和互动参考。为了支持评估，我们构建了SciVBench基准测试，它包括500组经过专家验证和文献支持的跨五个类别的科学问答，涵盖物理、化学和日常现象。大量实验表明，SciEducator在基准测试上显著优于领先的多模态大型语言模型（如Gemini、GPT-4o）和先进的视频代理系统，为社区树立了新的范式。

论文及项目相关链接

PDF

Summary

基于近期多模态大型语言模型和视频代理系统的进步，对于通用视频理解有了显著提升。然而，在面对需要整合外部专业知识与严谨步骤推理的科学视频理解与教育领域时，现有方法往往存在不足。为弥补这一鸿沟，我们提出SciEducator——首个针对科学视频理解与教育的迭代式自我进化多代理系统。该系统根植于管理科学的经典Deming循环理论，将计划-执行-研究-行动的理念转化为自我进化推理与反馈机制，有助于解释视频中的复杂科学活动。此外，SciEducator可生成针对特定科学过程的多媒体教育内容，包括文本指令、视觉指南、音频叙述和互动参考。为支持评估，我们构建了SciVBench基准测试，包含500组经过专家验证和文献依据的科学问答对，涉及物理、化学和日常现象等五大类别。实验表明，SciEducator在基准测试上大幅超越领先的多模态大型语言模型和视频代理系统，为社区树立了新典范。

Key Takeaways

多模态大型语言模型和视频代理系统在通用视频理解领域有显著提升。
科学视频理解和教育领域需要整合外部专业知识和严谨步骤推理，现有方法存在不足。
提出SciEducator系统，基于Deming循环的自我进化多代理系统，助力科学视频理解和教育。
SciEducator具备多模态教育内容生成能力，包括文本、视觉、音频等。
构建了SciVBench基准测试，包含500组科学问答对，用于评估科学视频理解性能。
SciEducator在基准测试中表现优异，超越现有多模态大型语言模型和视频代理系统。

Cool Papers

点此查看论文截图

FOCUS: Efficient Keyframe Selection for Long Video Understanding

Authors:Zirui Zhu, Hailun Xu, Yang Luo, Yong Liu, Kanchan Sarkar, Zhenheng Yang, Yang You

Multimodal large language models (MLLMs) represent images and video frames as visual tokens. Scaling from single images to hour-long videos, however, inflates the token budget far beyond practical limits. Popular pipelines therefore either uniformly subsample or apply keyframe selection with retrieval-style scoring using smaller vision-language models. However, these keyframe selection methods still rely on pre-filtering before selection to reduce the inference cost and can miss the most informative moments. We propose FOCUS, Frame-Optimistic Confidence Upper-bound Selection, a training-free, model-agnostic keyframe selection module that selects query-relevant frames under a strict token budget. FOCUS formulates keyframe selection as a combinatorial pure-exploration (CPE) problem in multi-armed bandits: it treats short temporal clips as arms, and uses empirical means and Bernstein confidence radius to identify informative regions while preserving exploration of uncertain areas. The resulting two-stage exploration-exploitation procedure reduces from a sequential policy with theoretical guarantees, first identifying high-value temporal regions, then selecting top-scoring frames within each region. On two long-video question-answering benchmarks, FOCUS delivers substantial accuracy improvements while processing less than 2% of video frames. For videos longer than 20 minutes, it achieves an 11.9% gain in accuracy on LongVideoBench, demonstrating its effectiveness as a keyframe selection method and providing a simple and general solution for scalable long-video understanding with MLLMs. Code is available at https://github.com/NUS-HPC-AI-Lab/FOCUS.

多模态大型语言模型（MLLMs）将图像和视频帧表示为视觉标记。然而，从单个图像扩展到长达数小时的视频时，产生的标记远远超出了实际限制。因此，流行的管道要么进行均匀的子采样，要么使用较小的视觉语言模型进行检索式打分来选择关键帧。然而，这些关键帧选择方法仍然需要在选择之前进行预先过滤，以降低推理成本，并且可能会遗漏最具信息量的时刻。我们提出了FOCUS（Frame-Optimistic Confidence Upper-bound Selection），这是一种无需训练、适用于各种模型的关键帧选择模块，它可以在严格的标记预算下选择查询相关的帧。FOCUS将关键帧选择公式化为多武装赌博中的组合纯探索（CPE）问题：它将短暂的临时剪辑视为武器，并使用经验均值和伯恩斯坦置信半径来识别信息区域，同时保持对不确定区域的探索。由此产生的两阶段探索与开发过程从具有理论保证的序列策略中减少了成本，首先确定高价值的临时区域，然后在每个区域内选择得分最高的帧。在两项长视频问答基准测试中，FOCUS在仅处理不到2%的视频帧的情况下实现了显著的性能提升。对于超过20分钟的视频，它在LongVideoBench上的准确性提高了11.9%，这证明了其作为关键帧选择方法的有效性，并为使用MLLMs进行可扩展的长视频理解提供了简单且通用的解决方案。代码可在https://github.com/NUS-HPC-AI-Lab/FOCUS找到。

论文及项目相关链接

PDF

摘要
提出一种无训练、模型通用的关键帧选择模块FOCUS，用于在严格的令牌预算下选择查询相关的帧。FOCUS将关键帧选择表述为多重武装强盗中的组合纯探索（CPE）问题，将短视频片段视为武器，并利用经验均值和伯恩斯坦置信半径来识别信息丰富的区域，同时保持对不确定区域的探索。这种两阶段探索与利用的程序首先确定高价值的时序区域，然后在每个区域内选择得分最高的帧。在两项长视频问答基准测试中，FOCUS在处理不到2%的视频帧的情况下实现了显著的性能提升。对于超过20分钟的视频，它在LongVideoBench上的准确率提高了11.9%，证明了其作为关键帧选择方法的有效性，并为多模态大型语言模型的可扩展长视频理解提供了简单通用的解决方案。

关键发现

多模态大型语言模型处理长视频时面临将视觉令牌化成本过高的挑战。
当前的关键帧选择方法仍依赖预过滤来减少推理成本，可能遗漏最有信息量的时刻。
引入FOCUS方法，一种无训练、模型通用的关键帧选择模块，能在严格的令牌预算下选择查询相关的帧。
FOCUS将关键帧选择表述为组合纯探索问题，利用伯恩斯坦置信半径识别信息丰富的区域并保持对不确定区域的探索。
两阶段探索与利用的程序确保先识别高价值时序区域，然后在这些区域内选择最佳帧。
在长视频问答基准测试中，FOCUS显著提高了准确性，处理视频帧的比例低于2%。

Cool Papers

点此查看论文截图

AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding

Authors:Zhucun Xue, Jiangning Zhang, Xurong Xie, Yuxuan Cai, Yong Liu, Xiangtai Li, Dacheng Tao

Multimodal Large Language Models (MLLMs) perform well in video understanding but degrade on long videos due to fixed-length context and weak long-term dependency modeling. Retrieval-Augmented Generation (RAG) can expand knowledge dynamically, yet existing video RAG schemes adopt fixed retrieval paradigms that ignore query difficulty. This uniform design causes redundant computation and latency for simple queries, while coarse retrieval for complex, multi-hop reasoning can miss key information. Such single-step retrieval severely limits the trade-off between efficiency and cognitive depth. We propose AdaVideoRAG, an adaptive RAG framework for long-video understanding. A lightweight intent classifier dynamically selects suitable retrieval schemes according to query complexity from the simplest to the most sophisticated. We design an Omni-Knowledge Indexing module that extracts and organizes multi-modal information into three databases: (1) a text base built from clip captions, ASR, and OCR; (2) a visual base; and (3) a knowledge graph for deep semantic understanding. This supports hierarchical knowledge access, from naive retrieval to graph-based retrieval, balancing resource cost and reasoning ability. To evaluate deep understanding, we further construct the HiVU benchmark. Experiments show that AdaVideoRAG significantly improves both efficiency and accuracy on long-video QA tasks and can be seamlessly plugged into existing MLLMs through lightweight APIs, establishing a new paradigm for adaptive retrieval-augmented video analysis.

多模态大型语言模型（MLLMs）在视频理解方面表现良好，但在处理长视频时，由于固定长度的上下文和长期依赖建模能力较弱而性能下降。检索增强生成（RAG）可以动态扩展知识，但现有的视频RAG方案采用固定的检索模式，忽略了查询的难易程度。这种统一的设计对于简单的查询会造成冗余的计算和延迟，而对复杂的多跳推理进行粗略检索可能会遗漏关键信息。这种单步检索严重限制了效率和认知深度的权衡。我们提出了AdaVideoRAG，这是一个用于长视频理解的自适应RAG框架。一个轻量级的意图分类器会根据查询的复杂性从最简单到最复杂动态地选择合适的检索方案。我们设计了一个Omni-Knowledge索引模块，该模块提取并组织多模态信息进入三个数据库：（1）一个由剪辑字幕、ASR和OCR构建的文本库；（2）一个视觉库；（3）一个用于深度语义理解的知识图谱。这支持从简单检索到基于图谱的检索的分层知识访问，平衡资源成本和推理能力。为了评估深度理解，我们进一步构建了HiVU基准测试。实验表明，AdaVideoRAG在长时间视频问答任务上显著提高了效率和准确性，并且可以通过轻量级API无缝集成到现有MLLMs中，为自适应检索增强视频分析建立了新的范式。

论文及项目相关链接

PDF NeurIPS 2025

Summary

在视频理解领域，多模态大型语言模型在处理长视频时表现欠佳，原因是固定的上下文长度和长期依赖建模能力较弱。尽管检索增强生成技术可以动态扩展知识，但现有的视频检索增强方案采用固定的检索范式，忽略了查询的复杂性。这导致了简单的查询冗余计算和延迟，而复杂的跨步推理粗糙检索则可能遗漏关键信息。针对此问题，本文提出了自适应的视频检索增强框架AdaVideoRAG，用于处理长视频理解任务。根据查询复杂度动态选择适当的检索方案，同时设计了Omni-Knowledge索引模块来提取和组织多模态信息，包括文本基础数据库、视觉基础数据库和知识图谱等，以支持从简单检索到复杂图检索的不同层次知识访问。实验证明，AdaVideoRAG在视频问答任务上显著提高了效率和准确性，并能无缝集成到现有的多模态大型语言模型中。

Key Takeaways