发布日期: 2025-10-23

更新日期: 2025-11-27

文章字数: 3.3k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-23 更新

StreamingTOM: Streaming Token Compression for Efficient Video Understanding

Authors:Xueyi Chen, Keda Tao, Kele Shao, Huan Wang

Unlike offline processing, streaming video vision-language models face two fundamental constraints: causality and accumulation. Causality prevents access to future frames that offline methods exploit, while accumulation causes tokens to grow unbounded, creating efficiency bottlenecks. However, existing approaches only regulate post-LLM kv-cache, leaving costly pre-LLM prefill unchanged. We introduce StreamingTOM, a training-free, plug-and-play two-stage framework that addresses both pre-LLM and post-LLM bottlenecks with predictable latency. Causal Temporal Reduction imposes a fixed per-frame budget and selects tokens based on adjacent-frame changes and token saliency, drastically reducing per-frame prefill cost by processing only a compact subset of visual tokens per frame instead of all visual tokens. Online Quantized Memory stores tokens in 4-bit format, retrieves relevant groups on demand, and dequantizes them, keeping the active kv-cache bounded regardless of stream length. Experiments demonstrate our method achieves $15.7\times$ kv-cache compression, $1.2\times$ lower peak memory and $2\times$ faster TTFT compared to prior SOTA. StreamingTOM maintains state-of-the-art accuracy among training-free methods with an average of $63.8%$ on offline benchmarks and $55.8%/3.7$ on RVS. These results highlight the practical benefits of our two-stage approach for efficient streaming video understanding with bounded growth.

不同于离线处理，流式视频视觉语言模型面临两个基本约束：因果性和累积性。因果性阻止了访问离线方法利用的未来帧，而累积性导致令牌无限增长，从而引发效率瓶颈。然而，现有方法仅对LLM之后的kv缓存进行了管理，而对成本高昂的LLM之前的预填充则没有改变。我们引入了StreamingTOM，这是一个无需训练、即插即用的两阶段框架，能够解决LLM之前和之后的瓶颈问题，并可实现可预测的延迟。因果时间缩减强制实施每帧固定预算，并根据相邻帧的更改和令牌显著性选择令牌，仅处理每帧中的一小部分视觉令牌而不是所有视觉令牌，从而极大地降低了每帧的预填充成本。在线量化内存以4位格式存储令牌，按需检索相关组并进行反量化，无论流长度如何，都能保持活动的kv缓存有界。实验表明，我们的方法实现了15.7倍的kv缓存压缩、1.2倍更低的峰值内存和比先前最佳技术快两倍的时间到令牌转换时间。StreamingTOM在离线基准测试中的平均准确率为63.8%，在RVS上的准确率为55.8%/3.7%。这些结果突显了我们两阶段方法在具有有界增长的效率流视频理解方面的实际应用效益。

论文及项目相关链接

PDF

摘要
针对流处理视频的语言模型，存在因果性和累积性两大基本约束。因果性无法获取未来帧的信息，而累积性导致标记不断增长，形成效率瓶颈。现有方法仅对LLM后的kv-cache进行调节，而昂贵的pre-LLM预填充保持不变。我们推出StreamingTOM，这是一种无需训练的两阶段框架，可解决pre-LLM和post-LLM的瓶颈问题，具有可预测的延迟。因果时间减少法设定了每帧的预算，并根据相邻帧的变化和标记显著性进行选择，仅处理每帧的视觉标记子集，而非所有视觉标记，大大降低了每帧的预填充成本。在线量化内存以4位格式存储标记，按需检索相关组并进行反量化，保持活跃的kv-cache不因流长度而受限。实验证明，我们的方法实现了$15.7\times$的kv-cache压缩，峰值内存降低了$1.2\times$，TTFT速度提高了$2\times$，优于先前最佳方法。StreamingTOM在无需训练的方法中保持最先进的准确性，在离线基准测试中平均达到$63.8%$，在RVS上达到$55.8%/3.7$。这些结果突显了我们两阶段方法在有效流处理视频理解中的实用效益，且增长有界。

要点

流处理视频模型面临因果性和累积性两大基本约束。
现有方法主要关注LLM后的kv-cache调节，而StreamingTOM同时解决pre-LLM和post-LLM的瓶颈。
StreamingTOM采用两阶段框架，通过因果时间减少法和在线量化内存技术来优化处理效率和内存使用。
实验结果显示，StreamingTOM在kv-cache压缩、内存使用和TTFT速度方面显著优于先前最佳方法。
StreamingTOM在无需训练的方法中保持先进的准确性，适用于实际流处理视频理解场景。
StreamingTOM通过有效的两阶段方法实现了增长有界，具有实用价值。

Cool Papers

点此查看论文截图

Authors:ZhaoYang Han, Qihan Lin, Hao Liang, Bowen Chen, Zhou Liu, Wentao Zhang

We introduce \textbf{LongInsightBench}, the first benchmark designed to assess models’ ability to understand long videos, with a focus on human language, viewpoints, actions, and other contextual elements, while integrating \textbf{visual, audio, and text} modalities. Our benchmark excels in three key areas: \textbf{a) Long-Duration, Information-Dense Videos:} We carefully select approximately 1,000 videos from open-source datasets FineVideo based on duration limit and the information density of both visual and audio modalities, focusing on content like lectures, interviews, and vlogs, which contain rich language elements. \textbf{b) Diverse and Challenging Task Scenarios:} We have designed six challenging task scenarios, including both Intra-Event and Inter-Event Tasks. \textbf{c) Rigorous and Comprehensive Quality Assurance Pipelines:} We have developed a three-step, semi-automated data quality assurance pipeline to ensure the difficulty and validity of the synthesized questions and answer options. Based on LongInsightBench, we designed a series of experiments. Experimental results shows that Omni-modal models(OLMs) still face challenge in tasks requiring precise temporal localization (T-Loc) and long-range causal inference (CE-Caus). Extended experiments reveal the information loss and processing bias in multi-modal fusion of OLMs. Our dataset and code is available at https://anonymous.4open.science/r/LongInsightBench-910F/.

我们推出“LongInsightBench”，这是首个旨在评估模型理解长视频能力的基准测试。它侧重于人类语言、观点、动作和其他上下文元素，同时整合视觉、音频和文字模式。我们的基准测试在三个关键领域表现出色：

a) 长时长、信息密集的视频：我们从基于时长限制和信息密度的开源数据集FineVideo中精心挑选了大约1000个视频，重点关注讲座、访谈和vlog等内容，这些内容包含丰富的语言元素。

b) 多样且具挑战性的任务场景：我们设计了六个具有挑战性的任务场景，包括事件内和事件间任务。

论文及项目相关链接

PDF Submitted to ARR Rolling Review

摘要

提出了一个全新的视频理解评估基准LongInsightBench。此基准主要针对长时间视频的模型理解能力进行评估，专注于语言文字、视角、动作以及其他上下文要素，并融合了视觉、音频和文本三大模式。主要亮点包括三个方面：选取的信息密集型长视频包含丰富语言元素；任务场景多样化充满挑战；严格全面的质量保证流程确保问题的合成与答案选项的难度和有效性。基于LongInsightBench的实验结果表明，全模态模型在需要精确时间定位和长程因果推理的任务上仍面临挑战。

关键见解

LongInsightBench是首个旨在评估模型对长时间视频理解能力的基准。
它侧重于评估视频中的语言文字、视角、动作等上下文元素的理解能力。
该基准融合了视觉、音频和文本三大模式。
选取的视频主要是信息密集型的长时间视频，如讲座、访谈和vlog。
设计了六种具有挑战性的任务场景，包括内部事件和外部事件任务。
建立了严格全面的质量保证流程以确保问题的合成与答案选项的难度和有效性。
实验结果表明，全模态模型在精确时间定位和长程因果推理任务上仍面临挑战。

Cool Papers

点此查看论文截图

UniVideo: Unified Understanding, Generation, and Editing for Videos

Authors:Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, Wenhu Chen

Unified multimodal models have shown promising results in multimodal content generation and editing but remain largely limited to the image domain. In this work, we present UniVideo, a versatile framework that extends unified modeling to the video domain. UniVideo adopts a dual-stream design, combining a Multimodal Large Language Model (MLLM) for instruction understanding with a Multimodal DiT (MMDiT) for video generation. This design enables accurate interpretation of complex multimodal instructions while preserving visual consistency. Built on this architecture, UniVideo unifies diverse video generation and editing tasks under a single multimodal instruction paradigm and is jointly trained across them. Extensive experiments demonstrate that UniVideo matches or surpasses state-of-the-art task-specific baselines in text/image-to-video generation, in-context video generation and in-context video editing. Notably, the unified design of UniVideo enables two forms of generalization. First, UniVideo supports task composition, such as combining editing with style transfer, by integrating multiple capabilities within a single instruction. Second, even without explicit training on free-form video editing, UniVideo transfers its editing capability from large-scale image editing data to this setting, handling unseen instructions such as green-screening characters or changing materials within a video. Beyond these core capabilities, UniVideo also supports visual-prompt-based video generation, where the MLLM interprets visual prompts and guides the MMDiT during synthesis. To foster future research, we will release our model and code.

统一多模态模型在多模态内容生成和编辑方面已显示出有前途的结果，但主要局限于图像领域。在这项工作中，我们提出了UniVideo，这是一个通用框架，它将统一建模扩展到视频领域。UniVideo采用双流设计，结合多模态大型语言模型（MLLM）用于指令理解，与多模态DiT（MMDiT）用于视频生成。这种设计可以准确解释复杂的跨模态指令，同时保持视觉一致性。基于这种架构，UniVideo将多样化的视频生成和编辑任务统一在一个单一的多模态指令范式下，并在这些任务上联合训练。大量实验表明，UniVideo在文本/图像到视频的生成、上下文内视频生成和上下文视频编辑方面达到了或超越了最先进的任务特定基线。值得注意的是，UniVideo的统一设计实现了两种形式的泛化。首先，UniVideo支持任务组合，例如通过单个指令集成多个功能来结合编辑与风格转换。其次，即使在没有针对自由形式视频编辑的明确训练的情况下，UniVideo也能从其大规模图像编辑数据中转移编辑能力，处理未见的指令，如绿幕替换角色或在视频中更改材质。除了这些核心能力之外，UniVideo还支持基于视觉提示的视频生成，其中MLLM解释视觉提示并在合成过程中引导MMDiT。为了促进未来的研究，我们将发布我们的模型和代码。

Summary

本文介绍了一种名为UniVideo的通用框架，它将统一建模扩展到视频领域。UniVideo采用双流设计，结合多模态大型语言模型（MLLM）用于指令理解与多模态DiT（MMDiT）用于视频生成。这种设计能够准确解释复杂的跨模态指令，同时保持视觉一致性。基于此架构，UniVideo将多样化的视频生成和编辑任务统一到单一的多模态指令范式下，并在这些任务上进行联合训练。实验表明，UniVideo在文本/图像到视频的生成、上下文视频生成和上下文视频编辑等方面达到了或超越了最新任务特定的基线。UniVideo的统一设计实现了两种形式的泛化：一是支持任务组合，如通过单个指令结合编辑与风格转换；二是即使没有针对自由形式视频编辑的明确训练，UniVideo也能从大规模图像编辑数据转移其编辑能力，处理如绿幕替换角色或更改视频内的材质等未见过的指令。此外，UniVideo还支持基于视觉提示的视频生成，其中MLLM解释视觉提示并在合成过程中引导MMDiT。

Key Takeaways