⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-18 更新
Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding
Authors:Ning Ding, Keisuke Fujii, Toru Tamaki
Tactical understanding in badminton involves interpreting not only individual actions but also how tactics are dynamically executed over time. In this paper, we propose \textbf{Shot2Tactic-Caption}, a novel framework for semantic and temporal multi-scale video captioning in badminton, capable of generating shot-level captions that describe individual actions and tactic-level captions that capture how these actions unfold over time within a tactical execution. We also introduce the Shot2Tactic-Caption Dataset, the first badminton captioning dataset containing 5,494 shot captions and 544 tactic captions. Shot2Tactic-Caption adopts a dual-branch design, with both branches including a visual encoder, a spatio-temporal Transformer encoder, and a Transformer-based decoder to generate shot and tactic captions. To support tactic captioning, we additionally introduce a Tactic Unit Detector that identifies valid tactic units, tactic types, and tactic states (e.g., Interrupt, Resume). For tactic captioning, we further incorporate a shot-wise prompt-guided mechanism, where the predicted tactic type and state are embedded as prompts and injected into the decoder via cross-attention. The shot-wise prompt-guided mechanism enables our system not only to describe successfully executed tactics but also to capture tactical executions that are temporarily interrupted and later resumed. Experimental results demonstrate the effectiveness of our framework in generating both shot and tactic captions. Ablation studies show that the ResNet50-based spatio-temporal encoder outperforms other variants, and that shot-wise prompt structuring leads to more coherent and accurate tactic captioning.
羽毛球战术理解不仅涉及对单个动作的解读,还涉及如何动态执行战术。在本文中,我们提出了一个新的框架Shot2Tactic-Caption,用于羽毛球语义和时间多尺度视频描述生成,能够生成描述单个动作的镜头级描述和捕捉战术执行过程中这些动作随时间展开的战术级描述。我们还介绍了Shot2Tactic-Caption数据集,这是第一个包含5494个镜头描述和544个战术描述的羽毛球描述数据集。Shot2Tactic-Caption采用双分支设计,两个分支都包括视觉编码器、时空Transformer编码器和基于Transformer的解码器,以生成镜头和战术描述。为了支持战术描述生成,我们还引入了一个战术单元检测器,用于识别有效的战术单元、战术类型和战术状态(例如中断、恢复)。对于战术描述,我们进一步采用了一种基于镜头的提示引导机制,将预测的战术类型和状态嵌入为提示,并通过交叉注意力注入解码器。基于镜头的提示引导机制使我们的系统不仅能够描述成功执行的战术,还能够捕捉暂时中断并随后恢复的战术执行。实验结果表明,我们的框架在生成镜头和战术描述方面都非常有效。消融研究表明,基于ResNet50的时空编码器优于其他变体,基于镜头的提示结构导致更连贯和准确的战术描述生成。
论文及项目相关链接
PDF 9 pages, 3 figures. Accepted to ACM MMSports 2025
摘要
本文介绍了在羽毛球运动中,战术理解不仅包括解读个别动作,还包括理解战术随时间如何动态执行。为此,提出了一个新的框架Shot2Tactic-Caption,该框架可以生成描述个别动作的shot-level字幕以及捕捉这些动作在战术执行中随时间展开的tactic-level字幕。此外,还推出了Shot2Tactic-Caption数据集,这是首个包含羽毛球视频字幕的数据集,包含5494个shot字幕和544个战术字幕。Shot2Tactic-Caption采用双分支设计,包括视觉编码器、时空Transformer编码器和基于Transformer的解码器来生成战术字幕。为了支持战术字幕生成,引入了战术单元检测器,用于识别有效的战术单元、战术类型和战术状态(如中断、恢复)。实验结果表明,该框架在生成战术字幕方面非常有效。消融研究表明,基于ResNet50的时空编码器优于其他变体,而基于提示结构的策略有助于更连贯和准确的战术字幕生成。
关键见解
- Shot2Tactic-Caption框架可以生成描述羽毛球比赛中个别动作的shot-level字幕以及战术执行的tactic-level字幕。
- 引入Shot2Tactic-Caption数据集,包含羽毛球视频字幕。
- Shot2Tactic-Caption采用双分支设计来处理视频内容。
- 引入了战术单元检测器以支持战术字幕的生成。
- 战术字幕生成中融入了基于提示结构的策略。
- 实验结果证明了框架的有效性。
点此查看论文截图





Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
Authors:Xiaoqian Shen, Wenxuan Zhang, Jun Chen, Mohamed Elhoseiny
Understanding and reasoning over long videos pose significant challenges for large video language models (LVLMs) due to the difficulty in processing intensive video tokens beyond context window and retaining long-term sequential information. Retrieval-Augmented Generation (RAG) has demonstrated effectiveness in processing long context for Large Language Models (LLMs); however, applying RAG to long video faces challenges such as disrupted temporal dependencies and inclusion of irrelevant information that can hinder accurate reasoning. To address these limitations, we propose Vgent, a novel graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding. Our approach introduces two key innovations: (i) It represents videos by structured graphs with semantic relationships across video clips preserved to improve retrieval effectiveness. (ii) It introduces an intermediate reasoning step to mitigate the reasoning limitation of LVLMs, which leverages structured verification to reduce retrieval noise and facilitate the explicit aggregation of relevant information across clips, resulting in more accurate and context-aware responses. We comprehensively evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks. Our approach yielded an overall performance improvement of $3.0%\sim 5.4%$ over base models on MLVU, and outperformed state-of-the-art video RAG methods by $8.6%$. Our code is publicly available at https://xiaoqian-shen.github.io/Vgent.
理解和推理长视频对大型视频语言模型(LVLMs)构成了重大挑战,原因在于处理超出上下文窗口的密集视频令牌和保留长期序列信息的难度。检索增强生成(RAG)在处理大型语言模型(LLM)的长上下文方面已显示出其有效性。然而,将RAG应用于长视频时面临挑战,如时间依赖关系中断和包含无关信息,这可能会阻碍准确推理。为了解决这些局限性,我们提出了Vgent,这是一种新型的基于图的检索推理增强生成框架,旨在提高LVLMs对长视频的理解能力。我们的方法引入了两个关键创新点:(i)它通过结构图表示视频,保留跨视频剪辑的语义关系,以提高检索效果。(ii)它引入了一个中间推理步骤来缓解LVLMs的推理局限性,该步骤利用结构化验证来减少检索噪声,促进跨剪辑的相关信息的显式聚合,从而生成更准确、更具上下文感知的响应。我们在三个长视频理解基准测试上对开源LVLMs全面评估了我们的框架。我们的方法在MLVU上的总体性能比基础模型提高了$ 3.0%\sim 5.4%$,并且优于最新的视频RAG方法$ 8.6%$。我们的代码可在https://xiaoqian-shen.github.io/Vgent公开访问。
论文及项目相关链接
PDF NeurIPS 2025 (Spotlight). Webpage at https://xiaoqian-shen.github.io/Vgent
Summary
本文介绍了处理长视频理解的挑战,提出一种基于图的检索增强生成框架Vgent,用于增强大型视频语言模型(LVLMs)的长视频理解能力。Vgent通过结构化图表示视频,引入中间推理步骤来减少LVLMs的推理限制,并通过结构化验证减少检索噪声,促进跨片段相关信息的显式聚合,从而实现更准确和上下文感知的响应。在三个长视频理解基准测试上评估,与基础模型相比,Vgent的总体性能提高了3.0%~5.4%,并超越了最新的视频RAG方法,提高了8.6%。
Key Takeaways
- 大型视频语言模型(LVLMs)在处理长视频时面临挑战,如处理大量视频令牌、保持长期序列信息等。
- 检索增强生成(RAG)在处理长文本方面已显示出有效性,但应用于长视频时面临挑战,如中断的时空依赖和包含无关信息。
- Vgent是一个基于图的检索推理增强生成框架,旨在解决LVLMs在长视频理解方面的局限性。
- Vgent的主要创新点包括:通过结构化图表示视频,改善检索效果;引入中间推理步骤,缓解LVLMs的推理限制。
- Vgent通过结构化验证减少检索噪声,促进跨片段相关信息的显式聚合,实现更准确和上下文感知的响应。
- 在三个长视频理解基准测试上的评估显示,与基础模型相比,Vgent的总体性能有所提升。
- Vgent的代码已公开发布在https://xiaoqian-shen.github.io/Vgent。
点此查看论文截图



VideoLucy: Deep Memory Backtracking for Long Video Understanding
Authors:Jialong Zuo, Yongtai Deng, Lingdong Kong, Jingkang Yang, Rui Jin, Yiwei Zhang, Nong Sang, Liang Pan, Ziwei Liu, Changxin Gao
Recent studies have shown that agent-based systems leveraging large language models (LLMs) for key information retrieval and integration have emerged as a promising approach for long video understanding. However, these systems face two major challenges. First, they typically perform modeling and reasoning on individual frames, struggling to capture the temporal context of consecutive frames. Second, to reduce the cost of dense frame-level captioning, they adopt sparse frame sampling, which risks discarding crucial information. To overcome these limitations, we propose VideoLucy, a deep memory backtracking framework for long video understanding. Inspired by the human recollection process from coarse to fine, VideoLucy employs a hierarchical memory structure with progressive granularity. This structure explicitly defines the detail level and temporal scope of memory at different hierarchical depths. Through an agent-based iterative backtracking mechanism, VideoLucy systematically mines video-wide, question-relevant deep memories until sufficient information is gathered to provide a confident answer. This design enables effective temporal understanding of consecutive frames while preserving critical details. In addition, we introduce EgoMem, a new benchmark for long video understanding. EgoMem is designed to comprehensively evaluate a model’s ability to understand complex events that unfold over time and capture fine-grained details in extremely long videos. Extensive experiments demonstrate the superiority of VideoLucy. Built on open-source models, VideoLucy significantly outperforms state-of-the-art methods on multiple long video understanding benchmarks, achieving performance even surpassing the latest proprietary models such as GPT-4o. Our code and dataset will be made publicly at https://videolucy.github.io
近期研究表明,利用大型语言模型(LLM)进行关键信息检索和整合的基于代理的系统已成为长视频理解的一种有前途的方法。然而,这些系统面临两大挑战。首先,它们通常对单个帧进行建模和推理,难以捕捉连续帧的时间上下文。其次,为了降低密集帧级字幕的成本,它们采用稀疏帧采样,这有可能丢失关键信息。为了克服这些局限性,我们提出了VideoLucy,这是一个用于长视频理解的深度记忆回溯框架。VideoLucy受到人类从粗略到精细的回忆过程的启发,采用分层记忆结构,具有渐进的粒度。这种结构在不同层次深度明确定义了记忆的细节水平和时间范围。通过基于代理的迭代回溯机制,VideoLucy系统地挖掘与视频相关的深度记忆,直到收集足够的信息以提供确信的答案。这种设计实现了对连续帧的有效时间理解,同时保留了关键细节。此外,我们介绍了EgoMem,一个用于长视频理解的新基准测试。EgoMem旨在全面评估模型理解随时间展开复杂事件的能力,并在极长的视频中捕捉细微的细节。大量实验证明了VideoLucy的优越性。建立在开源模型之上,VideoLucy在多个长视频理解基准测试中显著优于最新方法,甚至超越了一些最新的专有模型,如GPT-4o。我们的代码和数据集将在https://videolucy.github.io公开。
论文及项目相关链接
PDF NeurIPS-2025 Accepted Paper
Summary
基于大型语言模型(LLM)的agent系统已逐渐成为长视频理解的一种有前途的方法。然而,这些方法面临两大挑战:一是缺乏对连续帧的临时上下文捕捉能力,二是为了降低密集帧级标注的成本而采用稀疏帧采样,可能导致关键信息的丢失。为了克服这些局限性,我们提出了VideoLucy,这是一个用于长视频理解的深度记忆回溯框架。它采用分层记忆结构,通过agent的迭代回溯机制,系统地挖掘与问题相关的深层记忆,从而实现有效的临时理解并保留关键细节。我们还引入了EgoMem基准测试集,用于全面评估模型在理解复杂事件和捕捉长时间视频中的精细细节方面的能力。实验表明VideoLucy显著优于现有方法。
Key Takeaways
- 基于大型语言模型的agent系统为长视频理解提供了有前途的方法。
- 目前系统面临的关键挑战是缺乏连续帧的临时上下文理解和稀疏帧采样导致的关键信息丢失。
- VideoLucy框架通过深度记忆回溯解决了这些挑战,实现了有效的临时理解并保留关键细节。
- VideoLucy采用分层记忆结构和agent迭代回溯机制,挖掘与问题相关的深层记忆。
- VideoLucy在多个长视频理解基准测试中表现优异,甚至超越了最新的专有模型,如GPT-4o。
- 我们引入了EgoMem基准测试集,用于评估模型在理解复杂事件和捕捉长时间视频中的精细细节的能力。
点此查看论文截图





State Space Prompting via Gathering and Spreading Spatio-Temporal Information for Video Understanding
Authors:Jiahuan Zhou, Kai Zhu, Zhenyu Cui, Zichen Liu, Xu Zou, Gang Hua
Recently, pre-trained state space models have shown great potential for video classification, which sequentially compresses visual tokens in videos with linear complexity, thereby improving the processing efficiency of video data while maintaining high performance. To apply powerful pre-trained models to downstream tasks, prompt learning is proposed to achieve efficient downstream task adaptation with only a small number of fine-tuned parameters. However, the sequentially compressed visual prompt tokens fail to capture the spatial and temporal contextual information in the video, thus limiting the effective propagation of spatial information within a video frame and temporal information between frames in the state compression model and the extraction of discriminative information. To tackle the above issue, we proposed a State Space Prompting (SSP) method for video understanding, which combines intra-frame and inter-frame prompts to aggregate and propagate key spatiotemporal information in the video. Specifically, an Intra-Frame Gathering (IFG) module is designed to aggregate spatial key information within each frame. Besides, an Inter-Frame Spreading (IFS) module is designed to spread discriminative spatio-temporal information across different frames. By adaptively balancing and compressing key spatio-temporal information within and between frames, our SSP effectively propagates discriminative information in videos in a complementary manner. Extensive experiments on four video benchmark datasets verify that our SSP significantly outperforms existing SOTA methods by 2.76% on average while reducing the overhead of fine-tuning parameters.
最近,预训练的状态空间模型在视频分类方面显示出巨大潜力,该模型以线性复杂度对视频中的视觉标记进行顺序压缩,从而在保持高性能的同时提高了视频数据的处理效率。为了将强大的预训练模型应用于下游任务,提出了提示学习(prompt learning),通过微调少量参数即可实现高效的下游任务适应。然而,顺序压缩的视觉提示标记未能捕获视频中的空间和时间上下文信息,从而限制了状态压缩模型中视频帧内的空间信息和帧之间的时间信息的有效传播以及判别信息的提取。为了解决上述问题,我们提出了一种用于视频理解的状态空间提示(SSP)方法,该方法结合了帧内和帧间提示,以聚集和传播视频中的关键时空信息。具体来说,设计了一个帧内聚集(IFG)模块来聚集每一帧内的关键空间信息。此外,还设计了一个帧间扩散(IFS)模块,以将判别性的时空信息传播到不同的帧。通过自适应地平衡和压缩帧内和帧间的关键时空信息,我们的SSP以互补的方式有效地传播了视频中的判别信息。在四个视频基准数据集上的大量实验表明,我们的SSP平均比现有最佳方法高出2.76%,同时降低了微调参数的开销。
论文及项目相关链接
Summary
预训练状态空间模型在视频分类中展现巨大潜力,通过线性复杂度压缩视觉令牌提高视频数据处理效率同时保持高性能。为将强大预训练模型应用于下游任务,提出提示学习法,仅需少量参数微调即可实现高效下游任务适应。然而,顺序压缩的视觉提示令牌无法捕获视频中的空间和时间上下文信息,限制了状态压缩模型中空间信息在视频帧内的有效传播以及时间信息在不同帧之间的传播,以及判别信息的提取。为解决上述问题,我们提出一种用于视频理解的状态空间提示(SSP)方法,结合帧内和帧间提示,聚集和传播视频中的关键时空信息。通过自适应平衡和压缩帧内和帧间的关键时空信息,SSP以互补的方式有效传播视频中的判别信息。在四个视频基准数据集上的大量实验验证,我们的SSP方法平均比现有最佳方法高出2.76%,同时降低了微调参数的开销。
Key Takeaways
- 预训练状态空间模型在视频分类中具有高效与高性能的优势。
- 顺序压缩视觉令牌存在无法捕获视频时空上下文信息的局限。
- 状态空间提示(SSP)方法结合帧内和帧间提示来解决此问题。
- SSP通过自适应平衡和压缩关键时空信息,有效传播视频中的判别信息。
- 在多个视频数据集上,SSP方法显著优于现有最佳方法。
- SSP方法在提高性能的同时,降低了微调参数的开销。
点此查看论文截图





LikePhys: Evaluating Intuitive Physics Understanding in Video Diffusion Models via Likelihood Preference
Authors:Jianhao Yuan, Fabio Pizzati, Francesco Pinto, Lars Kunze, Ivan Laptev, Paul Newman, Philip Torr, Daniele De Martini
Intuitive physics understanding in video diffusion models plays an essential role in building general-purpose physically plausible world simulators, yet accurately evaluating such capacity remains a challenging task due to the difficulty in disentangling physics correctness from visual appearance in generation. To the end, we introduce LikePhys, a training-free method that evaluates intuitive physics in video diffusion models by distinguishing physically valid and impossible videos using the denoising objective as an ELBO-based likelihood surrogate on a curated dataset of valid-invalid pairs. By testing on our constructed benchmark of twelve scenarios spanning over four physics domains, we show that our evaluation metric, Plausibility Preference Error (PPE), demonstrates strong alignment with human preference, outperforming state-of-the-art evaluator baselines. We then systematically benchmark intuitive physics understanding in current video diffusion models. Our study further analyses how model design and inference settings affect intuitive physics understanding and highlights domain-specific capacity variations across physical laws. Empirical results show that, despite current models struggling with complex and chaotic dynamics, there is a clear trend of improvement in physics understanding as model capacity and inference settings scale.
在视频扩散模型中,直觉物理理解在构建通用物理可行世界模拟器方面起着至关重要的作用。然而,由于很难将物理正确性从视觉外观中分离出来,准确评估这种能力仍然是一项具有挑战性的任务。为此,我们引入了LikePhys,这是一种无需训练的方法,通过区分物理上有效和不可能的视频,使用去噪目标作为基于ELBO的似然性替代物,在有效-无效对的数据集上评估视频扩散模型中的直觉物理。通过在我们构建的涵盖四个物理领域的十二个场景基准测试上进行测试,我们显示我们的评估指标——可行性偏好误差(PPE)与人类偏好高度一致,优于最新的评估基线。然后,我们对当前视频扩散模型中的直觉物理理解进行了系统评估。我们的研究进一步分析了模型设计和推理设置如何影响直觉物理理解,并突出了不同物理定律领域特定的能力差异。实证结果表明,尽管当前模型在复杂和混沌动力学方面存在困难,但随着模型能力和推理设置的扩大,对物理知识的理解有明显的改进趋势。
论文及项目相关链接
PDF 22 pages, 9 figures
摘要
视频扩散模型中直觉物理学理解在构建通用物理模拟世界中起着至关重要的作用,但由于区分物理正确性从视觉表现的难度,准确评估这样的能力仍是一项挑战。因此,我们引入了LikePhys这一无训练评估方法,它通过区分物理有效和无效的视频来评估视频扩散模型中的直觉物理学,使用降噪目标作为基于ELBO的似然性替代物在一组有效与无效的视频配对数据集上进行评估。通过在构建的涵盖四个物理领域的十二个场景基准测试中测试,我们证明我们的评估指标——可信度偏好误差(PPE)与人类偏好高度一致,优于现有最先进的评估基线。然后我们对当前视频扩散模型中的直觉物理学理解进行了系统评估。我们的研究进一步分析了模型设计和推理设置如何影响直觉物理学理解,并突出了不同物理定律领域的能力差异。实证结果表明,尽管当前模型在处理复杂和混沌动力学方面存在困难,但随着模型能力和推理设置的扩大,物理学理解的趋势正在明显改善。
关键见解
- 视频扩散模型中直觉物理学理解对于构建通用物理模拟世界至关重要。
- 提出了一种名为LikePhys的无训练评估方法,通过区分物理有效和无效的视频来评估视频扩散模型的直觉物理学理解。
- LikePhys使用降噪目标作为基于ELBO的似然性替代物在一组有效与无效的视频配对数据集上进行评估。
- 引入的评价指标——可信度偏好误差(PPE)与人类偏好高度一致,优于现有最先进的评估基线。
- 当前视频扩散模型在直觉物理学理解方面进行了系统评估,表明尽管存在挑战,但模型在物理学理解方面正在改善。
- 模型设计和推理设置对直觉物理学理解有影响。
点此查看论文截图





StreamingVLM: Real-Time Understanding for Infinite Video Streams
Authors:Ruyi Xu, Guangxuan Xiao, Yukang Chen, Liuning He, Kelly Peng, Yao Lu, Song Han
Vision-language models (VLMs) could power real-time assistants and autonomous agents, but they face a critical challenge: understanding near-infinite video streams without escalating latency and memory usage. Processing entire videos with full attention leads to quadratic computational costs and poor performance on long videos. Meanwhile, simple sliding window methods are also flawed, as they either break coherence or suffer from high latency due to redundant recomputation. In this paper, we introduce StreamingVLM, a model designed for real-time, stable understanding of infinite visual input. Our approach is a unified framework that aligns training with streaming inference. During inference, we maintain a compact KV cache by reusing states of attention sinks, a short window of recent vision tokens, and a long window of recent text tokens. This streaming ability is instilled via a simple supervised fine-tuning (SFT) strategy that applies full attention on short, overlapped video chunks, which effectively mimics the inference-time attention pattern without training on prohibitively long contexts. For evaluation, we build Inf-Streams-Eval, a new benchmark with videos averaging over two hours that requires dense, per-second alignment between frames and text. On Inf-Streams-Eval, StreamingVLM achieves a 66.18% win rate against GPT-4O mini and maintains stable, real-time performance at up to 8 FPS on a single NVIDIA H100. Notably, our SFT strategy also enhances general VQA abilities without any VQA-specific fine-tuning, improving performance on LongVideoBench by +4.30 and OVOBench Realtime by +5.96. Code is available at https://github.com/mit-han-lab/streaming-vlm.
视觉语言模型(VLM)能够为实时助理和自主代理提供强大的支持,但它们面临一项重大挑战:在无限接近于无穷的视频流中理解内容,同时不会增加延迟和内存使用。对整段视频进行全面关注处理会导致计算成本呈二次方增长,并且在处理长视频时表现不佳。同时,简单的滑动窗口方法也存在缺陷,因为它们要么破坏连贯性,要么由于冗余的重新计算而遭受高延迟。在本文中,我们介绍了StreamingVLM模型,该模型专为实时、稳定地理解无限视觉输入而设计。我们的方法是一个统一的框架,将训练与流式推理相结合。在推理过程中,我们通过重用注意力汇聚的状态、最近的视觉令牌短期窗口以及最近的文本令牌长期窗口来保持紧凑的KV缓存。这种流式能力是通过一种简单的有监督微调(SFT)策略灌输的,该策略对重叠的短期视频块应用全注意力,这有效地模仿了推理时的注意力模式,而无需在禁止的过长上下文上进行训练。为了评估性能,我们建立了Inf-Streams-Eval基准测试,这是一个包含平均超过两小时的视频的新基准测试,它要求帧和文本之间密集的每秒对齐。在Inf-Streams-Eval基准测试中,StreamingVLM以66.18%的胜率战胜了GPT-4O mini,并在单个NVIDIA H100上实现了稳定实时的高达8 FPS的性能。值得注意的是,我们的SFT策略还提高了通用的VQA能力,而无需进行任何特定的VQA微调,在LongVideoBench上的性能提高了+4.30,在OVOBench Realtime上的性能提高了+5.96。代码可在https://github.com/mit-han-lab/streaming-vlm找到。
论文及项目相关链接
PDF The first two authors contributed equally to this work
Summary
该文提出了一种为实时无限视觉输入理解而设计的StreamingVLM模型。该模型拥有实时稳定的理解能力,通过维护一个紧凑的KV缓存并重用注意力状态,实现了对近无限视频流的理解,而无需承受不断增长的延迟和内存使用压力。实验表明,StreamingVLM在Inf-Streams-Eval基准测试上取得了显著成绩,且其监督微调策略不仅提升了模型的实时性能,还增强了通用视觉问答能力。
Key Takeaways
- StreamingVLM模型专为实时理解无限视觉输入设计,解决了视频流处理中的关键挑战。
- 通过维护KV缓存和重用注意力状态,实现了对视频流的实时稳定理解。
- StreamingVLM通过简单的监督微调策略,模仿推理时的注意力模式,无需在过于冗长的上下文上进行训练。
- 模型在Inf-Streams-Eval基准测试上表现优秀,展示了高效实时的处理能力。
- 模型能够在单NVIDIA H100上以最高达8帧每秒的速度维持稳定运行。
- 监督微调策略增强了模型的通用视觉问答能力,在不同基准测试中均有所提升。
点此查看论文截图








MomentSeg: Moment-Centric Sampling for Enhanced Video Pixel Understanding
Authors:Ming Dai, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang
Referring Video Object Segmentation (RefVOS) seeks to segment target objects in videos guided by natural language descriptions, demanding both temporal reasoning and fine-grained visual comprehension. Existing sampling strategies for LLM-based approaches typically rely on either handcrafted heuristics or external keyframe models. The former often overlooks essential temporal cues, while the latter increases system complexity. To address this, we propose a unified framework that jointly optimizes Temporal Sentence Grounding (TSG) and RefVOS, naturally incorporating key moment grounding capability. During training, we introduce a novel TSG paradigm that employs a dedicated \texttt{[FIND]} token for key moment identification through temporal token similarity matching, thereby avoiding the need for external timestamp encodings. For inference, we design a Moment-Centric Sampling (MCS) strategy that densely samples informative moments while sparsely sampling non-essential frames, preserving both motion details and global context. To further enhance tracking stability, we develop Bidirectional Anchor-updated Propagation (BAP), which leverages the most relevant moment as start point for high-quality mask initialization and dynamically updates at sampled points to mitigate accumulated errors. Code and model will be available at: https://github.com/Dmmm1997/MomentSeg
指代视频对象分割(RefVOS)旨在通过自然语言描述来引导视频中的目标对象分割,这需要时序推理和精细粒度的视觉理解。基于LLM的现有采样策略通常依赖于手工启发式方法或外部关键帧模型。前者往往会忽略重要的时序线索,而后者增加了系统复杂性。为了解决这一问题,我们提出了一个统一的框架,该框架联合优化时序句子定位(TSG)和RefVOS,自然地融入了关键时刻定位能力。在训练过程中,我们引入了一种新颖的TSG范式,它采用专用的“[FIND]”令牌通过时序令牌相似性匹配进行关键时刻识别,从而避免了外部时间戳编码的需求。对于推理,我们设计了一种以时刻为中心的采样(MCS)策略,该策略对信息丰富的时刻进行密集采样,而对非关键帧进行稀疏采样,从而保留了运动细节和全局上下文。为了进一步提高跟踪稳定性,我们开发了双向锚点更新传播(BAP),它利用最相关的时刻作为高质量掩码的起点,并在采样点动态更新,以减轻累积误差。代码和模型将在以下网址提供:https://github.com/Dmmm1997/MomentSeg
论文及项目相关链接
Summary
视频对象分割技术(RefVOS)旨在通过自然语言描述对视频中的目标对象进行分割,这需要时间和精细的视觉理解。现有采样策略通常依赖于手工启发式或外部关键帧模型,前者忽略了重要的时间线索,后者增加了系统复杂性。为此,我们提出一个联合优化时间句子定位(TSG)和RefVOS的统一框架,自然融入关键时刻定位能力。训练时引入TSG范式,采用专门的[FIND]标记进行关键时刻识别,通过时间令牌相似性匹配,无需外部时间戳编码。推理时设计Moment-Centric采样策略(MCS),对信息丰富的时刻密集采样,对不重要帧稀疏采样,兼顾运动细节和全局上下文。还开发双向锚点更新传播(BAP)技术,利用最相关的时刻作为高质量掩码的起点,并在采样点动态更新,减少累积误差。
Key Takeaways
- RefVOS旨在通过自然语言描述实现视频中的目标对象分割,需要时间和精细的视觉理解。
- 现有采样策略存在依赖手工启发式或外部模型的问题,可能忽略重要时间线索并增加系统复杂性。
- 引入统一框架联合优化TSG和RefVOS,自然融入关键时刻定位能力。
- 采用TSG范式和[FIND]标记进行关键时刻识别,通过时间令牌相似性匹配,无需外部编码。
- 设计MCS策略对信息丰富的时刻密集采样,对不重要帧稀疏采样。
- 开发BAP技术利用最相关时刻作为高质量掩码的起点,并在采样点动态更新。
- 该方法旨在提高跟踪稳定性并减少累积误差。代码和模型将在指定链接上公开。
点此查看论文截图





NeMo: Needle in a Montage for Video-Language Understanding
Authors:Zi-Yuan Hu, Shuo Liang, Duo Zheng, Yanyang Li, Yeyao Tao, Shijia Huang, Wei Feng, Jia Qin, Jianguang Yu, Jing Huang, Meng Fang, Yin Li, Liwei Wang
Recent advances in video large language models (VideoLLMs) call for new evaluation protocols and benchmarks for complex temporal reasoning in video-language understanding. Inspired by the needle in a haystack test widely used by LLMs, we introduce a novel task of Needle in a Montage (NeMo), designed to assess VideoLLMs’ critical reasoning capabilities, including long-context recall and temporal grounding. To generate video question answering data for our task, we develop a scalable automated data generation pipeline that facilitates high-quality data synthesis. Built upon the proposed pipeline, we present NeMoBench, a video-language benchmark centered on our task. Specifically, our full set of NeMoBench features 31,378 automatically generated question-answer (QA) pairs from 13,486 videos with various durations ranging from seconds to hours. Experiments demonstrate that our pipeline can reliably and automatically generate high-quality evaluation data, enabling NeMoBench to be continuously updated with the latest videos. We evaluate 20 state-of-the-art models on our benchmark, providing extensive results and key insights into their capabilities and limitations. Our project page is available at: https://lavi-lab.github.io/NeMoBench.
关于视频大型语言模型(VideoLLM)的最新进展,要求对视频语言理解中的复杂时间推理采用新的评估协议和基准测试。受大型语言模型中广泛使用的“寻找一根针在大海里”测试的启发,我们引入了一项名为“蒙太奇中的针”(NeMo)的新任务,旨在评估VideoLLM的关键推理能力,包括长上下文回忆和时间定位。为了为我们的任务生成视频问答数据,我们开发了一个可扩展的自动化数据生成管道,便于高质量数据的合成。基于提出的管道,我们推出了以我们的任务为中心的NeMoBench视频语言基准测试。具体来说,我们的NeMoBench全套功能包含来自13486个视频的自动生成的31378个问答对,视频时长从几秒到几小时不等。实验表明,我们的管道能够可靠地自动生成高质量评估数据,使NeMoBench能够不断更新为最新的视频。我们在基准测试上评估了20个最先进模型,提供了关于其能力和局限性的广泛结果和关键见解。我们的项目页面可访问于:https://lavi-lab.github.io/NeMoBench。
论文及项目相关链接
Summary
本文介绍了一种针对视频大语言模型(VideoLLMs)的新任务“针尖在蒙太奇中的任务”(NeMo),旨在评估视频语言理解的复杂时序推理能力,包括长上下文回忆和时序定位。为此任务,开发了一个可扩展的自动化数据生成管道,并基于该管道构建了NeMoBench视频语言基准测试。该基准测试包含从时长从几秒到几小时不等的13,486个视频中自动生成的31,378个问答对。实验表明,该管道能够可靠地自动生成高质量评估数据,使得NeMoBench能够不断更新以包含最新视频。此外,本文对当前先进的模型进行了广泛的评估和结果分析。项目详情可通过链接查看:[链接地址](具体链接请根据实际情况替换)。
Key Takeaways
- 视频大语言模型需要新的评估协议和基准测试来评估复杂时序推理能力。
- 引入“针尖在蒙太奇中的任务”(NeMo)以评估视频语言理解的推理能力。
- 开发了一个自动化数据生成管道,用于生成高质量的视频问答数据。
- 基于该管道构建了NeMoBench视频语言基准测试,包含大量自动生成的问答对。
- 该管道能够可靠地自动生成高质量评估数据,使NeMoBench可不断更新。
- 在NeMoBench上评估了多个先进模型,并提供了广泛的实验结果和分析。
点此查看论文截图




In the Eye of MLLM: Benchmarking Egocentric Video Intent Understanding with Gaze-Guided Prompting
Authors:Taiying Peng, Jiacheng Hua, Miao Liu, Feng Lu
The emergence of advanced multimodal large language models (MLLMs) has significantly enhanced AI assistants’ ability to process complex information across modalities. Recently, egocentric videos, by directly capturing user focus, actions, and context in an unified coordinate, offer an exciting opportunity to enable proactive and personalized AI user experiences with MLLMs. However, existing benchmarks overlook the crucial role of gaze as an indicator of user intent. To address this gap, we introduce EgoGazeVQA, an egocentric gaze-guided video question answering benchmark that leverages gaze information to improve the understanding of longer daily-life videos. EgoGazeVQA consists of gaze-based QA pairs generated by MLLMs and refined by human annotators. Our experiments reveal that existing MLLMs struggle to accurately interpret user intentions. In contrast, our gaze-guided intent prompting methods significantly enhance performance by integrating spatial, temporal, and intent-related cues. We further conduct experiments on gaze-related fine-tuning and analyze how gaze estimation accuracy impacts prompting effectiveness. These results underscore the value of gaze for more personalized and effective AI assistants in egocentric settings. Project page: https://taiyi98.github.io/projects/EgoGazeVQA
先进的多模态大型语言模型(MLLMs)的出现,显著增强了AI助理处理跨模态复杂信息的能力。最近,以自我为中心的视频通过直接捕捉用户的关注点、动作和上下文统一坐标,为利用MLLMs实现主动个性化的AI用户体验提供了激动人心的机会。然而,现有的基准测试忽视了眼神作为用户意图指标的关键作用。为了弥补这一空白,我们引入了EgoGazeVQA,这是一个以眼神为中心的自我注视引导视频问答基准测试,它利用眼神信息来提高对日常生活视频的理解能力。EgoGazeVQA由MLLMs生成的基于眼神的QA对组成,并由人类注释者进行完善。我们的实验表明,现有的MLLMs在准确解释用户意图方面存在困难。相比之下,我们的眼神引导意图提示方法通过整合空间、时间和意图相关的线索,显著提高了性能。我们还进行了关于眼神相关的微调实验,并分析了眼神估计精度对提示效果的影响。这些结果强调了眼神在自我为中心的环境中为更加个性化和高效的AI助理提供的价值。项目页面:https://taiyi98.github.io/projects/EgoGazeVQA
论文及项目相关链接
PDF Accepted to NeurIPS 2025
Summary
先进的多模态大型语言模型(MLLMs)的出现,显著增强了AI助理处理跨模态复杂信息的能力。为了提供更主动、个性化的AI用户体验,引入以自我为中心的视频。然而,现有的基准测试忽视了视线作为用户意图指标的重要作用。为弥补这一不足,我们推出了EgoGazeVQA基准测试,该测试通过视线信息来增强对日常长视频的理解。实验表明,现有的MLLMs在准确解释用户意图方面存在困难,而我们的视线引导提示方法通过整合空间、时间和意图相关线索显著提高性能。我们还进行了关于视线相关的微调实验,并分析了视线估计精度对提示效果的影响。这些结果强调了在以自我为中心的环境中,视线对于更个性化、有效的AI助理的价值。
Key Takeaways
- 先进的多模态大型语言模型(MLLMs)提高了AI助理处理跨模态信息的能力。
- 以自我为中心的视频能提供更主动、个性化的AI用户体验。
- 现有基准测试忽略了视线作为用户意图的重要指标。
- 引入EgoGazeVQA基准测试,结合视线信息,提高日常长视频的理解。
- MLLMs在准确解释用户意图方面存在困难。
- 视线引导提示方法通过整合空间、时间和意图相关线索提高性能。
点此查看论文截图







StreamAgent: Towards Anticipatory Agents for Streaming Video Understanding
Authors:Haolin Yang, Feilong Tang, Lingxiao Zhao, Xiang An, Ming Hu, Huifa Li, Xinlin Zhuang, Yifan Lu, Xiaofeng Zhang, Abdalla Swikir, Junjun He, Zongyuan Ge, Imran Razzak
Real-time streaming video understanding in domains such as autonomous driving and intelligent surveillance poses challenges beyond conventional offline video processing, requiring continuous perception, proactive decision making, and responsive interaction based on dynamically evolving visual content. However, existing methods rely on alternating perception-reaction or asynchronous triggers, lacking task-driven planning and future anticipation, which limits their real-time responsiveness and proactive decision making in evolving video streams. To this end, we propose a StreamAgent that anticipates the temporal intervals and spatial regions expected to contain future task-relevant information to enable proactive and goal-driven responses. Specifically, we integrate question semantics and historical observations through prompting the anticipatory agent to anticipate the temporal progression of key events, align current observations with the expected future evidence, and subsequently adjust the perception action (e.g., attending to task-relevant regions or continuously tracking in subsequent frames). To enable efficient inference, we design a streaming KV-cache memory mechanism that constructs a hierarchical memory structure for selective recall of relevant tokens, enabling efficient semantic retrieval while reducing the overhead of storing all tokens in the traditional KV-cache. Extensive experiments on streaming and long video understanding tasks demonstrate that our method outperforms existing methods in response accuracy and real-time efficiency, highlighting its practical value for real-world streaming scenarios.
在自动驾驶和智能监控等领域,实时流媒体视频理解面临传统离线视频处理之外的挑战。这要求基于动态变化的视觉内容进行持续感知、主动决策和响应交互。然而,现有方法依赖于交替感知反应或异步触发,缺乏任务驱动的规划和未来预期,这限制了它们在实时流媒体中的响应能力和主动决策制定。为此,我们提出了一种StreamAgent,它能够预测未来任务相关信息的预期时间间隔和空间区域,以实现主动和目标驱动的响应。具体来说,我们通过提示预测代理整合问题语义和历史观察,以预测关键事件的时间进展,将当前观察与预期的未来证据对齐,并随后调整感知动作(例如,关注任务相关区域或在后续帧中持续跟踪)。为了实现高效推理,我们设计了一种流式KV-缓存内存机制,构建分层内存结构,实现相关令牌的选择性召回,从而在减少存储所有令牌的传统KV-缓存开销的同时,实现高效语义检索。在流媒体和长视频理解任务上的大量实验表明,我们的方法在响应准确性和实时效率方面优于现有方法,突显了其在真实世界流媒体场景中的实用价值。
论文及项目相关链接
Summary
实时流媒体视频理解在自动驾驶和智能监控等领域面临挑战,需要超越传统的离线视频处理,实现连续感知、主动决策和基于动态视觉内容的响应交互。现有方法依赖感知反应或异步触发,缺乏任务驱动的规划和未来预测,限制了其在实时流媒体中的响应性和主动决策能力。为此,我们提出StreamAgent,可预测未来任务相关信息的时空间隔区域,实现目标驱动的响应。通过整合问题语义和历史观察,促使预测代理预测关键事件的时间进展,将当前观察与预期的未来证据对齐,并调整感知动作。为提升推理效率,我们设计了一种流式KV缓存机制,构建分层内存结构,实现相关令牌的选择性回忆,从而在减少存储所有令牌的传统KV缓存开销的同时,实现高效语义检索。在流媒体和长视频理解任务上的大量实验表明,我们的方法在响应准确性和实时效率方面优于现有方法,凸显其在真实流媒体场景中的实用价值。
Key Takeaways
- 实时流媒体视频理解在自动驾驶和智能监控领域具有挑战,需超越传统离线视频处理。
- 现有方法依赖感知反应或异步触发,缺乏任务驱动的规划和未来预测。
- 提出StreamAgent,能预测未来任务相关信息的时空间隔区域,实现主动决策。
- 通过整合问题语义和历史观察,提高预测代理的预测能力。
- 设计流式KV缓存机制,提高语义检索效率,减少存储开销。
- 方法在响应准确性和实时效率方面优于现有方法。
点此查看论文截图





The Role of Video Generation in Enhancing Data-Limited Action Understanding
Authors:Wei Li, Dezhao Luo, Dongbao Yang, Zhenhang Li, Weiping Wang, Yu Zhou
Video action understanding tasks in real-world scenarios always suffer data limitations. In this paper, we address the data-limited action understanding problem by bridging data scarcity. We propose a novel method that employs a text-to-video diffusion transformer to generate annotated data for model training. This paradigm enables the generation of realistic annotated data on an infinite scale without human intervention. We proposed the information enhancement strategy and the uncertainty-based label smoothing tailored to generate sample training. Through quantitative and qualitative analysis, we observed that real samples generally contain a richer level of information than generated samples. Based on this observation, the information enhancement strategy is proposed to enhance the informative content of the generated samples from two aspects: the environments and the characters. Furthermore, we observed that some low-quality generated samples might negatively affect model training. To address this, we devised the uncertainty-based label smoothing strategy to increase the smoothing of these samples, thus reducing their impact. We demonstrate the effectiveness of the proposed method on four datasets across five tasks and achieve state-of-the-art performance for zero-shot action recognition.
在现实场景的视频动作理解任务中,数据限制始终是一个挑战。本文旨在通过解决数据稀缺问题来解决数据有限的动作理解问题。我们提出了一种新方法,采用文本到视频的扩散变压器来生成用于模型训练的数据标注。这种范式能够在无需人工干预的情况下,生成无限规模的现实标注数据。我们提出了信息增强策略和基于不确定性的标签平滑,以定制生成样本训练。通过定量和定性分析,我们发现真实样本通常包含比生成样本更丰富的信息。基于此观察,我们提出信息增强策略,从环境和角色两个方面提高生成样本的信息内容。此外,我们还发现一些低质量的生成样本可能会对模型训练产生负面影响。为解决这一问题,我们设计了基于不确定性的标签平滑策略,以增加这些样本的平滑度,从而降低其影响。我们在五个任务中的四个数据集上验证了所提方法的有效性,并在零样本动作识别领域取得了最先进的性能。
论文及项目相关链接
PDF IJCAI2025
Summary
文本中解决数据有限的问题的方法主要是通过建立文本到视频的扩散变换器生成训练数据,进而提高动作理解模型的性能。采用信息增强策略和基于不确定性的标签平滑策略,生成更真实丰富的样本数据,提升模型训练效果。在四个数据集上进行了实验验证,该方法实现了零次动作识别的性能表现处于领先状态。这种模型可在现实场景中获得应用价值,并解决动作数据短缺问题。简而言之,就是利用数据生成技术和增强策略优化动作识别模型的表现。这一模型可用于实际场景中解决动作数据短缺问题。
Key Takeaways
- 针对视频动作理解任务在真实场景中的数据限制问题,提出了利用文本到视频的扩散变换器生成训练数据的解决方案。
- 提出信息增强策略,从环境和人物两个方面提升生成样本的信息丰富度。
点此查看论文截图






VideoAds for Fast-Paced Video Understanding
Authors:Zheyuan Zhang, Monica Dou, Linkai Peng, Hongyi Pan, Ulas Bagci, Boqing Gong
Advertisement videos serve as a rich and valuable source of purpose-driven information, encompassing high-quality visual, textual, and contextual cues designed to engage viewers. They are often more complex than general videos of similar duration due to their structured narratives and rapid scene transitions, posing significant challenges to multi-modal large language models (MLLMs). In this work, we introduce VideoAds, the first dataset tailored for benchmarking the performance of MLLMs on advertisement videos. VideoAds comprises well-curated advertisement videos with complex temporal structures, accompanied by \textbf{manually} annotated diverse questions across three core tasks: visual finding, video summary, and visual reasoning. We propose a quantitative measure to compare VideoAds against existing benchmarks in terms of video complexity. Through extensive experiments, we find that Qwen2.5-VL-72B, an opensource MLLM, achieves 73.35% accuracy on VideoAds, outperforming GPT-4o (66.82%) and Gemini-1.5 Pro (69.66%); the two proprietary models especially fall behind the opensource model in video summarization and reasoning, but perform the best in visual finding. Notably, human experts easily achieve a remarkable accuracy of 94.27%. These results underscore the necessity of advancing MLLMs’ temporal modeling capabilities and highlight VideoAds as a potentially pivotal benchmark for future research in understanding video that requires high FPS sampling. The dataset and evaluation code will be publicly available at https://videoadsbenchmark.netlify.app.
广告视频作为目的驱动信息的丰富且宝贵的来源,包含了为吸引观众而设计的高质量视觉、文本和上下文线索。由于它们具有结构化的叙事和快速的场景转换,往往比类似时长的普通视频更加复杂,这给多模态大型语言模型(MLLMs)带来了重大挑战。在这项工作中,我们推出了VideoAds,这是首个针对广告视频评估多模态大型语言模型性能的基准测试数据集。VideoAds包含了具有复杂时间结构的精心挑选的广告视频,以及针对三个核心任务的多样化手动注释问题:视觉查找、视频摘要和视觉推理。我们提出了一个定量度量标准,以根据视频复杂性将VideoAds与现有基准测试进行比较。通过广泛的实验,我们发现开源MLLM Qwen2.5-VL-72B在VideoAds上达到了73.35%的准确率,超过了GPT-4o(66.82%)和Gemini-1.5 Pro(69.66%);这两个专有模型尤其在视频摘要和推理方面落后于开源模型,但在视觉查找方面表现最佳。值得注意的是,人类专家很容易达到94.27%的显著准确率。这些结果强调了提高MLLMs的时间建模能力的必要性,并突出了VideoAds作为未来研究理解视频的高FPS采样潜力的关键基准测试。数据集和评估代码将在https://videoadsbenchmark.netlify.app上公开可用。
论文及项目相关链接
PDF ICCV2025
Summary:
广告视频作为目的驱动信息的丰富来源,包含高质量视觉、文本和上下文线索,旨在吸引观众。由于其结构化的叙事和快速的场景转换,对多模态大型语言模型(MLLMs)构成重大挑战。本研究推出VideoAds数据集,专为评估广告视频上MLLMs性能而设计。VideoAds包含精心挑选的广告视频,具有复杂的时间结构,并配有针对三项核心任务的手工注释的多样化问题:视觉查找、视频摘要和视觉推理。通过广泛实验发现开源MLLM Qwen2.5-VL-72B在VideoAds上表现最佳,准确率为73.35%,优于GPT-4o和Gemini-1.5 Pro。人类专家轻松达到94.27%的准确率。结果突显了提高MLLMs时间建模能力的必要性,并强调了VideoAds作为未来研究的关键基准的重要性。该数据集和评估代码将公开访问:https://videoadsbenchmark.netlify.app。
Key Takeaways:
- 广告视频是信息丰富的数据来源,包含了视觉、文本和上下文线索,具有复杂的叙事结构和场景转换。
- VideoAds数据集是专为评估多模态大型语言模型(MLLMs)在广告视频上的性能而设计的。
- VideoAds包含手工注释的多样化问题,涵盖视觉查找、视频摘要和视觉推理三项核心任务。
- 开源MLLM Qwen2.5-VL-72B在VideoAds数据集上表现最佳,准确率为73.35%。
- 在视觉查找任务上,两个专有模型表现最好。但在视频摘要和推理方面,开源模型表现出较好的性能。
- 人类专家在评估广告视频任务时表现出高准确率(94.27%)。
点此查看论文截图




