发布日期: 2025-09-13

更新日期: 2025-10-07

文章字数: 1.2k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-13 更新

DATE: Dynamic Absolute Time Enhancement for Long Video Understanding

Authors:Chao Yuan, Yang Yang, Yehui Yang, Zach Cheng

Long video understanding remains a fundamental challenge for multimodal large language models (MLLMs), particularly in tasks requiring precise temporal reasoning and event localization. Existing approaches typically adopt uniform frame sampling and rely on implicit position encodings to model temporal order. However, these methods struggle with long-range dependencies, leading to critical information loss and degraded temporal comprehension. In this paper, we propose Dynamic Absolute Time Enhancement (DATE) that enhances temporal awareness in MLLMs through the Timestamp Injection Mechanism (TIM) and a semantically guided Temporal-Aware Similarity Sampling (TASS) strategy. Specifically, we interleave video frame embeddings with textual timestamp tokens to construct a continuous temporal reference system. We further reformulate the video sampling problem as a vision-language retrieval task and introduce a two-stage algorithm to ensure both semantic relevance and temporal coverage: enriching each query into a descriptive caption to better align with the vision feature, and sampling key event with a similarity-driven temporally regularized greedy strategy. Our method achieves remarkable improvements w.r.t. absolute time understanding and key event localization, resulting in state-of-the-art performance among 7B and 72B models on hour-long video benchmarks. Particularly, our 7B model even exceeds many 72B models on some benchmarks.

长时间视频理解对于多模态大型语言模型（MLLMs）来说仍然是一个基本挑战，特别是在需要精确时间推理和事件定位的任务中。现有方法通常采用均匀框架采样，并依赖隐式位置编码来模拟时间顺序。然而，这些方法在处理长距离依赖关系时遇到困难，导致关键信息丢失和时间理解退化。在本文中，我们提出了动态绝对时间增强（DATE）方法，通过时间戳注入机制（TIM）和语义引导的时间感知相似性采样（TASS）策略，增强MLLMs的时间意识。具体来说，我们将视频帧嵌入与文本时间戳令牌交织，以构建连续的临时参考系统。我们进一步将视频采样问题重新表述为视觉语言检索任务，并引入两阶段算法，以确保语义相关性和时间覆盖：将每个查询丰富为描述性标题，以更好地与视觉特征对齐，并采用相似性驱动的时间正则贪婪策略进行关键事件采样。我们的方法在绝对时间理解和关键事件定位方面取得了显著改进，在长达一小时的视频基准测试上，7B和72B模型的性能均达到了最新水平。特别是，我们的7B模型在某些基准测试上的表现甚至超过了许多72B模型。

论文及项目相关链接

PDF

Summary

本文提出一种名为动态绝对时间增强（DATE）的方法，通过时间戳注入机制（TIM）和语义引导的时间感知相似性采样（TASS）策略，提高多模态大型语言模型（MLLMs）的时间感知能力。方法将视频帧嵌入与文本时间戳标记交替结合，构建了一个连续的时间参考系统。同时，将视频采样问题重新表述为视觉语言检索任务，并采用两阶段算法确保语义相关性和时间覆盖，通过丰富查询描述与贪婪策略采样关键事件。该方法在绝对时间理解和关键事件定位方面取得了显著改进，并在长达一小时的视频基准测试上实现了最佳性能。

Key Takeaways