发布日期: 2025-09-28

更新日期: 2025-11-27

文章字数: 3.5k

阅读时长: 14 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-28 更新

VIR-Bench: Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

Authors:Hao Wang, Eiki Murata, Lingfang Zhang, Ayako Sato, So Fukuda, Ziqi Yin, Wentao Hu, Keisuke Nakao, Yusuke Nakamura, Sebastian Zwirner, Yi-Chia Chen, Hiroyuki Otomo, Hiroki Ouchi, Daisuke Kawahara

Recent advances in multimodal large language models (MLLMs) have significantly enhanced video understanding capabilities, opening new possibilities for practical applications. Yet current video benchmarks focus largely on indoor scenes or short-range outdoor activities, leaving the challenges associated with long-distance travel largely unexplored. Mastering extended geospatial-temporal trajectories is critical for next-generation MLLMs, underpinning real-world tasks such as embodied-AI planning and navigation. To bridge this gap, we present VIR-Bench, a novel benchmark consisting of 200 travel videos that frames itinerary reconstruction as a challenging task designed to evaluate and push forward MLLMs’ geospatial-temporal intelligence. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, struggle to achieve high scores, underscoring the difficulty of handling videos that span extended spatial and temporal scales. Moreover, we conduct an in-depth case study in which we develop a prototype travel-planning agent that leverages the insights gained from VIR-Bench. The agent’s markedly improved itinerary recommendations verify that our evaluation protocol not only benchmarks models effectively but also translates into concrete performance gains in user-facing applications.

近期多模态大型语言模型（MLLMs）的进步极大地增强了视频理解能力，为实际应用开辟了新可能性。然而，当前的视频基准测试主要集中在室内场景或短距离户外活动上，与长途旅行相关的挑战在很大程度上尚未被探索。掌握扩展的地理时空轨迹对于下一代MLLMs至关重要，它为实体AI规划和导航等现实任务提供了支持。为了弥补这一差距，我们推出了VIR-Bench，这是一个由200个旅行视频组成的新基准测试，它将行程重建设定为一个具有挑战性的任务，旨在评估和推动MLLMs的地理时空智能。实验结果表明，包括专有模型在内的最新MLLMs很难获得高分，这表明处理跨越广阔空间和时间的视频非常困难。此外，我们进行了一项深入的案例研究，开发了一个利用VIR-Bench所获得见解的原型旅行规划代理。该代理的行程推荐显著改善，证明我们的评估协议不仅有效地评估了模型，而且还转化为面向用户的应用中的具体性能提升。

论文及项目相关链接

PDF

Summary
多模态大型语言模型（MLLMs）的最新进展极大地提高了对视频的理解能力，为实际应用开辟了新途径。然而，当前视频基准测试主要集中在室内场景或短距离户外活动上，与长途旅行相关的挑战尚未得到充分探索。掌握扩展的地理时空轨迹对于下一代MLLMs至关重要，它为现实世界的任务（如嵌入式人工智能规划和导航）提供了支持。为了缩小这一差距，我们推出了VIR-Bench基准测试，它包括200个旅行视频，将行程重建作为一个具有挑战性的任务来设计和评估MLLMs的地理时空智能。实验结果表明，最先进的MLLMs包括专有模型在内很难取得高分，这表明处理跨越广阔空间和时间的视频非常困难。此外，我们还进行了一项深入的案例研究，开发了一个利用VIR-Bench所获得的见解的旅行规划代理原型。该代理的行程推荐显著改善，证明我们的评估协议不仅有效地评估了模型，而且转化为面向用户的实际应用中的具体性能提升。

Key Takeaways

多模态大型语言模型（MLLMs）在视频理解方面取得显著进展，为实际应用提供了新机会。
当前视频基准测试主要集中在室内和短距离户外场景，缺乏针对长途旅行的研究。
VIR-Bench基准测试包括200个旅行视频，旨在评估MLLMs处理地理时空轨迹的能力。
最先进的MLLMs在VIR-Bench上的表现不尽如人意，说明处理广泛地理和长时间尺度的视频是一大挑战。
VIR-Bench的评估协议不仅有效评估模型性能，还能转化为实际应用中的性能提升。
案例研究展示了利用VIR-Bench基准测试的见解开发的旅行规划代理的改进效果。

Cool Papers

点此查看论文截图

SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model

Authors:Guankun Wang, Junyi Wang, Wenjin Mo, Long Bai, Kun Yuan, Ming Hu, Jinlin Wu, Junjun He, Yiming Huang, Nicolas Padoy, Zhen Lei, Hongbin Liu, Nassir Navab, Hongliang Ren

Surgical scene understanding is critical for surgical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two-stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high-frequency local analysis guided by temporal cues. We also develop the Multi-frequency Fusion Attention to effectively integrate low- and high-frequency visual tokens, ensuring the preservation of critical task-specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot-assisted surgeries. Our code and dataset will be publicly accessible soon.

手术场景理解对于手术训练和机器人辅助手术中的机器人决策至关重要。最近多模态大型语言模型（MLLMs）的进展在医疗领域的场景感知方面显示出巨大潜力，有助于外科医生理解手术场景和程序。然而，这些方法主要面向基于图像的分析或全局视频理解，忽视了对于分析特定过程和捕捉手术程序内详细任务执行至关重要的精细粒度视频推理。为了弥补这一差距，我们提出了SurgVidLM，这是第一个旨在解决全面和精细粒度手术视频理解问题的视频语言模型。为了训练我们的SurgVidLM，我们构建了SVU-31K，这是一个大规模数据集，包含超过31K个视频指令对，能够实现手术程序的整体理解和详细分析。在此基础上，SurgVidLM采用两阶段StageFocus机制：第一阶段提取全局过程上下文，而第二阶段进行受时间线索引导的高频局部分析。我们还开发了多频率融合注意力，以有效地整合低频和高频视觉标记，确保关键任务特定细节的保留。实验结果表明，SurgVidLM在全面和精细粒度的视频理解任务上都显著优于参数规模相当的最先进Vid-LLMs，展现出其在捕捉复杂的机器人辅助手术上下文方面的卓越能力。我们的代码和数据集将很快公开访问。

论文及项目相关链接

PDF

Summary
手术场景理解对于手术训练和机器人辅助手术的决策至关重要。最近多模态大型语言模型（MLLMs）在医疗领域的场景感知方面展现出巨大潜力，有助于外科医生理解手术场景和程序。然而，这些方法主要面向基于图像的分析或全局视频理解，忽略了精细的视频推理，这对于分析特定过程和捕捉手术程序中的详细任务执行至关重要。为此，我们提出SurgVidLM，这是首个同时解决全局和精细外科视频理解的视频语言模型。我们构建了SVU-31K大规模数据集，包含超过31K的视频指令对，以支持手术程序的整体理解和详细分析。SurgVidLM采用两阶段的StageFocus机制，第一阶段提取全局程序上下文，第二阶段进行受时间线索引导的高频局部分析。我们还开发了多频率融合注意力机制，有效整合了低频和高频视觉标记，确保关键任务特定细节的保留。实验结果表明，SurgVidLM在全局和精细视频理解任务上的表现均优于参数规模相当的最佳Vid-LLM。我们的代码和数据集将很快公开。

Key Takeaways

外科场景理解在手术训练和机器人辅助手术中至关重要。
多模态大型语言模型（MLLMs）在医疗领域展现出强大的潜力，尤其在手术场景理解方面。
当前方法主要关注全局视频理解，忽略了精细的视频推理。
SurgVidLM被提出以解决全局和精细外科视频理解的双重问题。
构建SVU-31K数据集用于支持手术程序的整体理解和详细分析。
SurgVidLM采用两阶段机制来处理视频数据，并引入多频率融合注意力机制以增强性能。

Cool Papers

点此查看论文截图

STORM: Token-Efficient Long Video Understanding for Multimodal LLMs

Authors:Jindong Jiang, Xiuyu Li, Zhijian Liu, Muyang Li, Guo Chen, Zhiqi Li, De-An Huang, Guilin Liu, Zhiding Yu, Kurt Keutzer, Sungjin Ahn, Jan Kautz, Hongxu Yin, Yao Lu, Song Han, Wonmin Byeon

Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM (Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to $8\times$ and the decoding latency by 2.4-2.9$\times$ for the fixed numbers of input frames. Project page is available at https://research.nvidia.com/labs/lpr/storm

基于视频的多媒体大型语言模型（Video-LLM）的最新进展通过处理视频作为图像帧序列，显著提高了对视频的理解能力。然而，许多现有方法在视觉主干中独立处理帧，缺乏明确的时空建模，这限制了它们捕捉动态模式和有效处理长视频的能力。为了解决这些局限性，我们引入了STORM（用于多媒体LLM的时空令牌缩减），这是一种新型架构，在图像编码器和LLM之间集成了一个专用的时空编码器。我们的时空编码器利用Mamba状态空间模型将时间信息集成到图像令牌中，生成丰富的表示形式，这些表示形式在整个视频序列中保留了跨帧间的动态。这种丰富的编码不仅增强了视频推理能力，还实现了有效的令牌缩减策略，包括测试时采样和基于训练的时空池化，在不影响关键时间信息的情况下，大大降低了LLM的计算需求。通过整合这些技术，我们的方法能够在提高性能的同时，降低训练和推理延迟，实现在扩展的时间背景下的高效和稳健的视频理解。全面评估表明，STORM在各种长视频理解基准测试上达到了最新水平（在MLVU和LongVideoBench上的改进超过5%），同时计算成本降低了高达8倍，对于固定数量的输入帧，解码延迟降低了2.4-2.9倍。项目页面可在https://research.nvidia.com/labs/lpr/storm上查看。

论文及项目相关链接

PDF

Summary

本文介绍了针对视频理解的新技术——时空令牌缩减技术（STORM）。该技术通过在图像编码器与大型语言模型（LLM）之间引入专用的时空编码器，将时空信息集成到图像令牌中，生成丰富的表示形式，从而提高了视频理解的能力。该技术不仅增强了视频推理能力，还实现了有效的令牌缩减策略，包括测试时采样和基于训练的时空池化，显著减少了大型语言模型的计算需求，同时不损失关键的时空信息。通过整合这些技术，该方法在延长的时间上下文上实现了高效且稳健的视频理解，并在多个长视频理解基准测试中取得了最新成果。

Key Takeaways