⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-11 更新
TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
Authors:Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, Qi She
Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at https://github.com/Time-Search/TimeSearch-R.
时间搜索旨在从数万个帧中识别出最小的相关帧集合,作为准确理解长视频的基础。现有研究试图逐步缩小搜索范围。然而,这些方法通常依赖于手工制作的搜索过程,缺乏端到端优化来学习最佳搜索策略。在本文中,我们提出了TimeSearch-R,它将时间搜索重新定义为交替文本视频思考,通过强化学习(RL)无缝地将搜索视频片段融入推理过程。然而,将强化学习训练方法(如组相对策略优化(GRPO))应用于视频推理可能会产生无人监督的中间搜索决策。这导致视频内容探索不足和逻辑推理不一致。为了解决这些问题,我们引入了带有完整性自我验证的GRPO(GRPO-CSV),它从交替推理过程中收集已搜索的视频帧,并使用相同的策略模型验证所搜索帧的充分性,从而提高视频推理的完整性。此外,我们构建了专门用于GRPO-CSV的SFT冷启动和RL训练的数据集,筛选出具有弱时间依赖性的样本,以提高任务难度并增强时间搜索能力。大量实验表明,TimeSearch-R在时间搜索基准测试(如Haystack-LVBench和Haystack-Ego4D)以及长视频理解基准测试(如VideoMME和MLVU)上取得了显著改进。值得注意的是,TimeSearch-R在LongVideoBench上建立了新的最先进的性能,相较于基础模型Qwen2.5-VL提高了4.1%,相较于高级视频推理模型Video-R1提高了2.0%。我们的代码可在https://github.com/Time-Search/TimeSearch-R处获取。
论文及项目相关链接
PDF 22 pages, 17 figures. Official code: https://github.com/Time-Search/TimeSearch-R
摘要
本文提出一种名为TimeSearch-R的方法,将时间搜索重新定义为文本与视频的交互思考过程,通过强化学习(RL)无缝集成视频剪辑的搜索过程。为解决应用强化学习训练方法(如Group Relative Policy Optimization (GRPO))于视频推理时可能出现的无监督中间搜索决策问题,本文引入带有完整性自我验证的GRPO(GRPO-CSV)。该方法通过收集搜索的视频帧进行推理,并利用同一策略模型验证搜索帧的充足性,从而提高视频推理的完整性。此外,本文构建了专门用于GRPO-CSV的SFT冷启动和RL训练的数据集,过滤掉具有弱时间依赖性的样本,以提高任务难度并增强时间搜索能力。实验表明,TimeSearch-R在Haystack-LVBench、Haystack-Ego4D等时间搜索基准测试以及VideoMME和MLVU等长期视频理解基准测试中取得了显著改进。特别是,TimeSearch-R在LongVideoBench上建立了新的最先进的性能,相较于基础模型Qwen2.5-VL提高了4.1%,相较于高级视频推理模型Video-R1提高了2.0%。
关键见解
- TimeSearch-R将时间搜索重新定义为文本与视频的交互思考过程。
- 提出GRPO-CSV,通过收集搜索的视频帧进行推理验证,提高视频推理的完整性。
- 构建了专门的数据集以应对弱时间依赖性问题,提高任务难度和搜索能力。
- TimeSearch-R在多个基准测试中表现优异,相较于基础模型有显著改进。
- TimeSearch-R在LongVideoBench上建立了新的最先进性能。
- TimeSearch-R利用强化学习实现了端到端的优化学习搜索策略。
- 提供了代码公开链接供研究使用。
点此查看论文截图
LiveStar: Live Streaming Assistant for Real-World Online Video Understanding
Authors:Zhenyu Yang, Kairui Zhang, Yuhang Hu, Bing Wang, Shengsheng Qian, Bin Wen, Fan Yang, Tingting Gao, Weiming Dong, Changsheng Xu
Despite significant progress in Video Large Language Models (Video-LLMs) for offline video understanding, existing online Video-LLMs typically struggle to simultaneously process continuous frame-by-frame inputs and determine optimal response timing, often compromising real-time responsiveness and narrative coherence. To address these limitations, we introduce LiveStar, a pioneering live streaming assistant that achieves always-on proactive responses through adaptive streaming decoding. Specifically, LiveStar incorporates: (1) a training strategy enabling incremental video-language alignment for variable-length video streams, preserving temporal consistency across dynamically evolving frame sequences; (2) a response-silence decoding framework that determines optimal proactive response timing via a single forward pass verification; (3) memory-aware acceleration via peak-end memory compression for online inference on 10+ minute videos, combined with streaming key-value cache to achieve 1.53x faster inference. We also construct an OmniStar dataset, a comprehensive dataset for training and benchmarking that encompasses 15 diverse real-world scenarios and 5 evaluation tasks for online video understanding. Extensive experiments across three benchmarks demonstrate LiveStar’s state-of-the-art performance, achieving an average 19.5% improvement in semantic correctness with 18.1% reduced timing difference compared to existing online Video-LLMs, while improving FPS by 12.0% across all five OmniStar tasks. Our model and dataset can be accessed at https://github.com/yzy-bupt/LiveStar.
尽管视频大型语言模型(Video-LLMs)在离线视频理解方面取得了显著进展,但现有的在线Video-LLMs通常在同时处理连续帧输入和确定最佳响应时间方面遇到困难,经常牺牲实时响应能力和叙事连贯性。为了解决这个问题,我们引入了LiveStar,这是一款开创性的直播辅助工具,通过自适应流媒体解码实现始终在线的主动响应。具体来说,LiveStar结合了以下几点:(1)一种训练策略,实现增量视频语言对齐,适用于可变长度的视频流,保持动态演化帧序列之间的时间一致性;(2)一种响应-沉默解码框架,通过单次前向传递验证来确定最佳主动响应时机;(3)通过峰值结束内存压缩实现内存感知加速,适用于10分钟以上视频的在线推理,结合流媒体键值缓存,实现1.53倍更快的推理速度。我们还构建了OmniStar数据集,这是一个涵盖15种多样现实场景和5个在线视频理解评估任务的全面训练和基准测试数据集。在三个基准测试上的大量实验表明,LiveStar的性能达到了最新水平,与现有的在线Video-LLMs相比,语义正确性平均提高了19.5%,时间差异减少了18.1%,同时在所有五个OmniStar任务中提高了12.0%的FPS。我们的模型和数据集可在https://github.com/yzy-bupt/LiveStar访问。
论文及项目相关链接
PDF NeurIPS 2025 Accepted
Summary
本文介绍了针对在线视频理解的新模型LiveStar。该模型通过自适应流解码实现始终在线的主动响应,解决了现有在线视频大型语言模型在处理连续帧输入和最佳响应时间确定上的困难。通过训练策略、响应沉默解码框架和内存感知加速等技术手段,LiveStar实现了高效的在线视频理解,在语义正确性和实时响应性能上表现出优异的效果。同时,文章还介绍了用于训练和评估的OmniStar数据集。
Key Takeaways
- LiveStar是一种针对在线视频理解的模型,通过自适应流解码实现始终在线的主动响应。
- LiveStar解决了现有在线视频大型语言模型在处理连续帧输入和最佳响应时间确定上的困难。
- 训练策略实现了增量视频语言对齐,适用于可变长度的视频流。
- 响应沉默解码框架通过单次前向传递验证确定最佳主动响应时间。
- 内存感知加速技术提高了在线推理的效率,特别是在处理长时间视频时。
- OmniStar数据集用于训练和评估在线视频理解模型,包含多种现实场景和任务。
- 实验结果表明,LiveStar在语义正确性和实时响应性能上实现了显著的提升。
点此查看论文截图