发布日期: 2025-06-08

更新日期: 2025-07-06

文章字数: 3.3k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-06-08 更新

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos

Authors:Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, Fahad Khan

Mathematical reasoning in real-world video settings presents a fundamentally different challenge than in static images or text. It requires interpreting fine-grained visual information, accurately reading handwritten or digital text, and integrating spoken cues, often dispersed non-linearly over time. In such multimodal contexts, success hinges not just on perception, but on selectively identifying and integrating the right contextual details from a rich and noisy stream of content. To this end, we introduce VideoMathQA, a benchmark designed to evaluate whether models can perform such temporally extended cross-modal reasoning on videos. The benchmark spans 10 diverse mathematical domains, covering videos ranging from 10 seconds to over 1 hour. It requires models to interpret structured visual content, understand instructional narratives, and jointly ground concepts across visual, audio, and textual modalities. We employ graduate-level experts to ensure high quality, totaling over $920$ man-hours of annotation. To reflect real-world scenarios, questions are designed around three core reasoning challenges: direct problem solving, where answers are grounded in the presented question; conceptual transfer, which requires applying learned methods to new problems; and deep instructional comprehension, involving multi-step reasoning over extended explanations and partially worked-out solutions. Each question includes multi-step reasoning annotations, enabling fine-grained diagnosis of model capabilities. Through this benchmark, we highlight the limitations of existing approaches and establish a systematic evaluation framework for models that must reason, rather than merely perceive, across temporally extended and modality-rich mathematical problem settings. Our benchmark and evaluation code are available at: https://mbzuai-oryx.github.io/VideoMathQA

在真实世界的视频环境中进行数学推理与静态图像或文本中的推理相比，呈现了一个根本性的挑战。它要求解释精细的视觉信息，准确阅读手写或数字文本，并整合口头线索，这些线索往往随时间非线性地分散。在这样的多模态环境中，成功的关键不仅在于感知，还在于从丰富而嘈杂的内容流中选择性地识别和整合正确的上下文细节。为此，我们推出了VideoMathQA基准测试，旨在评估模型在视频上执行这种时间跨度延长的跨模态推理的能力。该基准测试涵盖了10个多样化的数学领域，包含时长从10秒到超过1小时的视频。它要求模型解释结构化视觉内容，理解说明性叙述，并在视觉、音频和文本模式之间共同建立概念。我们聘请了研究生水平的专家以确保高质量，总标注时间超过920个人小时。为了反映真实世界场景，问题围绕三个核心推理挑战进行设计：直接问题解决，答案源于呈现的问题；概念迁移，需要将学到的方法应用于新问题；以及深度指令理解，涉及对扩展解释和部分解决方案的多步推理。每个问题都包含多步推理注释，能够精细地诊断模型的能力。通过这个基准测试，我们突出了现有方法的局限性，并为必须在时间跨度延长和模态丰富的数学问题环境中进行推理的模型建立了系统的评估框架。我们的基准测试和评估代码可在以下网址找到：https://mbzuai-oryx.github.io/VideoMathQA

论文及项目相关链接

PDF VideoMathQA Technical Report

Summary

视频环境中的数学推理与静态图像或文本中的挑战存在根本差异。它要求解读精细的视觉信息、准确识别手写或数字文本，并整合口语提示，这些提示在时间上通常呈非线性分布。针对这种多模态情境，成功不仅取决于感知，还取决于从丰富而嘈杂的内容流中有选择地识别和整合正确的上下文细节。为此，我们推出了VideoMathQA基准测试，旨在评估模型在视频上是否具备这种时间扩展的跨模态推理能力。该基准测试涵盖10个多样化的数学领域，视频时长从10秒到超过1小时不等。它要求模型解读结构化视觉内容、理解指令性叙述，并在视觉、音频和文本模式之间共同定位概念。我们聘请了研究生水平的专家进行标注，总计投入超过920个人工小时。问题设计围绕三个核心推理挑战：直接问题解决、概念迁移和深度指令理解，涉及对扩展解释和部分解决方案的多步推理。此基准测试突显了现有方法的局限性，并为必须在时间扩展和模态丰富的数学问题设置中推理的模型建立了系统的评估框架。

Key Takeaways

VideoMathQA是一个针对视频的基准测试，旨在评估模型在真实世界环境中的数学推理能力。
该测试涵盖了广泛的数学领域，视频时长不一。
模型需解读结构化视觉内容、理解指令性叙述，并在多模态之间共同定位概念。
问题设计围绕直接问题解决、概念迁移和深度指令理解等核心推理挑战。
此基准测试由研究生水平的专家进行高质量标注，总计投入大量人工小时。
基准测试反映了现实世界的复杂性，现有方法存在局限性。

Cool Papers

点此查看论文截图

Unleashing Hour-Scale Video Training for Long Video-Language Understanding

Authors:Jingyang Lin, Jialian Wu, Ximeng Sun, Ze Wang, Jiang Liu, Yusheng Su, Xiaodong Yu, Hao Chen, Jiebo Luo, Zicheng Liu, Emad Barsoum

Recent long-form video-language understanding benchmarks have driven progress in video large multimodal models (Video-LMMs). However, the scarcity of well-annotated long videos has left the training of hour-long Video-LLMs underexplored. To close this gap, we present VideoMarathon, a large-scale hour-long video instruction-following dataset. This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video. Specifically, it contains 3.3M high-quality QA pairs, spanning six fundamental topics: temporality, spatiality, object, action, scene, and event. Compared to existing video instruction datasets, VideoMarathon significantly extends training video durations up to 1 hour, and supports 22 diverse tasks requiring both short- and long-term video comprehension. Building on VideoMarathon, we propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling. It enables hour-long video training and inference at 1-FPS sampling by leveraging a memory augmentation module, which adaptively integrates user question-relevant and spatiotemporal-informative semantics from a cached full video context. In our experiments, Hour-LLaVA achieves the best performance on multiple long video-language benchmarks, demonstrating the high quality of the VideoMarathon dataset and the superiority of the Hour-LLaVA model.

最近的长视频语言理解基准测试推动了视频多模态大型模型（Video-LMMs）的进展。然而，缺乏大量标注良好的长视频使得长达一小时的视频大型语言模型（Video-LLMs）的训练研究受到很大限制。为了填补这一空白，我们推出了VideoMarathon，这是一个大规模的长达一小时的视频指令跟随数据集。该数据集包含大约9700小时来自不同领域的长视频，每个视频时长在3到60分钟之间。具体来说，它包含了330万高质量的问答对，涵盖了六大基本主题：时间性、空间性、对象、动作、场景和事件。与现有的视频指令数据集相比，VideoMarathon将训练视频的时长显著扩展至1小时，并支持需要短期和长期视频理解的22项多样化任务。基于VideoMarathon，我们提出了强大的高效视频多模态模型Hour-LLaVA，用于进行长达一小时的视频语言建模。它通过利用内存扩展模块实现了对长达一小时视频的培训和推理，该模块自适应地集成了与用户问题相关且具有时空信息语义的完整视频上下文内容。在我们的实验中，Hour-LLaVA在多个长视频语言基准测试中取得了最佳性能，证明了VideoMarathon数据集的高质量以及Hour-LLaVA模型的优越性。

Summary

本文介绍了针对长视频语言理解的新基准测试集VideoMarathon的推出。该数据集包含了长达一小时的长时间视频数据，包括大约9,700小时的视频时长，涵盖了六大主题，并支持多达22种不同的任务。为了应对这一挑战，文章还提出了一种强大的视频语言模型Hour-LLaVA，该模型通过引入内存扩展模块来应对长时间的视频训练和推理，从缓存的全视频上下文中自适应地集成用户相关的语义和时空信息。实验结果证明了VideoMarathon数据集的高质量和Hour-LLaVA模型的优越性。

Key Takeaways

VideoMarathon是一个大规模的长视频指令跟随数据集，填补了长时间视频语言理解的训练空白。
数据集包含长达一小时的视频内容，涵盖六大主题，并支持多种任务类型。
Hour-LLaVA模型是为了处理长时间视频而设计的强大视频语言模型。
Hour-LLaVA通过内存扩展模块应对长时间视频训练和推理的挑战。
该模块可以自适应集成用户相关的语义和时空信息。
VideoMarathon数据集的高质量以及Hour-LLaVA模型的优越性得到了实验验证。

Cool Papers

点此查看论文截图

APVR: Hour-Level Long Video Understanding with Adaptive Pivot Visual Information Retrieval

Authors:Hong Gao, Yiming Bao, Xuezhan Tu, Bin Zhong, Minling Zhang

Current video-based multimodal large language models struggle with hour-level video understanding due to computational constraints and inefficient information extraction from extensive temporal sequences. We propose APVR (Adaptive Pivot Visual information Retrieval), a training-free framework that addresses the memory wall limitation through hierarchical visual information retrieval. APVR operates via two complementary components: Pivot Frame Retrieval employs semantic expansion and multi-modal confidence scoring to identify semantically relevant video frames, while Pivot Token Retrieval performs query-aware attention-driven token selection within the pivot frames. This dual granularity approach enables processing of hour-long videos while maintaining semantic fidelity. Experimental validation on LongVideoBench and VideoMME demonstrates significant performance improvements, establishing state-of-the-art results for not only training-free but also training-based approaches while providing plug-and-play integration capability with existing MLLM architectures.

当前基于视频的多媒体大型语言模型在处理长达数小时的视频理解时面临计算约束和从大量时间序列中提取信息的效率问题。我们提出了自适应枢纽视觉信息检索（APVR），这是一个无需训练即可突破内存墙限制的框架，它通过分层视觉信息检索来实现这一目标。APVR通过两个互补的组件进行操作：枢纽帧检索利用语义扩展和多模态置信评分来识别语义相关的视频帧，而枢纽令牌检索则在枢纽帧内执行查询感知注意力驱动令牌选择。这种双重粒度方法能够在处理长达数小时的视频时保持语义保真度。在LongVideoBench和VideoMME上的实验验证表明，性能得到了显著提高，不仅在无训练方法中处于领先地位，而且在基于训练的方法中也提供了出色的性能表现，同时还提供了与现有MLLM架构的即时集成能力。

论文及项目相关链接

PDF

Summary

该文本提出一种名为APVR（自适应枢轴视觉信息检索）的训练外框架，解决了因计算限制和从大量时间序列中提取信息的不效率导致的当前视频大型语言模型在小时级别视频理解上的困境。通过分层视觉信息检索实现突破记忆墙限制，APVR通过两个互补组件进行操作：枢轴帧检索通过语义扩展和多模态置信度评分来识别语义相关的视频帧，而枢轴令牌检索在枢轴帧内执行查询感知注意力驱动令牌选择。这种双重粒度方法能够在保持语义保真度的同时处理长达一小时的视频。在长视频基准测试VideoMME和LongVideoBench上的实验验证显示，其性能显著提高，不仅为训练外方法树立了最新技术成果，还为现有大型语言模型架构提供了即插即用集成能力。

Key Takeaways

当前视频大型语言模型面临小时级别视频理解的挑战，主要因为计算限制和信息提取的不效率。
提出一种名为APVR的训练外框架，通过分层视觉信息检索解决此问题。
APVR包含两个互补组件：枢轴帧检索和枢轴令牌检索。
枢轴帧检索通过语义扩展和多模态置信度评分识别语义相关视频帧。
枢轴令牌检索在枢轴帧内执行查询感知的注意力驱动令牌选择。
这种双重粒度方法能够在保持语义保真度的同时处理长达一小时的视频。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-06-08/%E8%A7%86%E9%A2%91%E7%90%86%E8%A7%A3/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

视频理解

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2025-06-08 Single GPU Task Adaptation of Pathology Foundation Models for Whole Slide Image Analysis

2025-06-08 Vision Transformer

Vision Transformer

I2I Translation

I2I Translation 方向最新论文已更新，请持续关注 Update in 2025-06-08 Deep learning image burst stacking to reconstruct high-resolution ground-based solar observations

2025-06-08 I2I Translation

I2I Translation