发布日期: 2025-11-06

更新日期: 2025-11-27

文章字数: 1.3k

阅读时长: 5 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-06 更新

CueBench: Advancing Unified Understanding of Context-Aware Video Anomalies in Real-World

Authors:Yating Yu, Congqi Cao, Zhaoying Wang, Weihua Meng, Jie Li, Yuxin Li, Zihao Wei, Zhongpei Shen, Jiajun Zhang

How far are deep models from real-world video anomaly understanding (VAU)? Current works typically emphasize on detecting unexpected occurrences deviated from normal patterns or comprehending anomalous events with interpretable descriptions. However, they exhibit only a superficial comprehension of real-world anomalies, with limited breadth in complex principles and subtle context that distinguish the anomalies from normalities, e.g., climbing cliffs with safety gear vs. without it. To this end, we introduce CueBench, the first of its kind Benchmark, devoted to Context-aware video anomalies within a Unified Evaluation framework. We comprehensively establish an event-centric hierarchical taxonomy that anchors two core event types: 14 conditional and 18 absolute anomaly events, defined by their refined semantics from diverse contexts across 174 scenes and 198 attributes. Based on this, we propose to unify and benchmark context-aware VAU with various challenging tasks across recognition, temporal grounding, detection, and anticipation. This also serves as a rigorous and fair probing evaluation suite for generative-discriminative as well as generalized-specialized vision-language models (VLMs). To address the challenges underlying CueBench, we further develop Cue-R1 based on R1-style reinforcement fine-tuning with verifiable, task-aligned, and hierarchy-refined rewards in a unified generative manner. Extensive results on CueBench reveal that, existing VLMs are still far from satisfactory real-world anomaly understanding, while our Cue-R1 surpasses these state-of-the-art approaches by over 24% on average.

当前深度模型在真实世界视频异常理解（VAU）方面的应用程度如何？目前的研究通常侧重于检测偏离正常模式的意外事件或通过对异常事件进行可解释的描述来理解异常事件。然而，它们对真实世界异常的理解仅限于表面，对于区分异常与正常情况的复杂原理和微妙上下文（如是否穿戴安全装备攀爬悬崖）的广度有限。为此，我们引入了CueBench，这是第一个专注于统一评估框架内的上下文感知视频异常的基准测试。我们全面建立了以事件为中心的分层次分类法，确定了两种核心事件类型：14种条件异常事件和18种绝对异常事件，这些事件的精细语义来自174个场景中的不同上下文和198个属性。在此基础上，我们提出统一并评估上下文感知的VAU，涵盖识别、时间定位、检测和预测等具有挑战的任务。这也为生成判别以及通用专业视觉语言模型（VLMs）提供了严格公平的评估套件。为了解决CueBench面临的挑战，我们进一步开发了Cue-R1，它是基于R1风格的强化微调，具有可验证的、任务对齐的、层次细化的奖励，以统一生成的方式呈现。在CueBench上的广泛结果表明，现有的VLMs在真实世界异常理解方面仍远未满足要求，而我们的Cue-R1平均超过了这些最新方法超过24%。

论文及项目相关链接

PDF

摘要

本文关注深度模型在现实视频异常理解（VAU）方面的应用差距。当前的研究主要关注从正常模式中检测意外发生或解释异常事件。然而，它们对现实世界的异常理解仅限于表面，缺乏复杂原理和微妙上下文的深度理解，如是否穿戴安全装备攀爬悬崖的区分。为此，我们推出CueBench基准测试，专注于统一评估框架下的上下文感知视频异常。我们建立了以事件为中心的分层次分类法，确定了两种核心事件类型：涉及安全问题的绝对异常事件和非安全问题的条件异常事件。基于此，我们统一并评估上下文感知的VAU，涵盖识别、时间定位、检测和预测等任务。这也为生成判别和通用专业视觉语言模型提供了严格公平的评估套件。为解决CueBench的挑战，我们进一步开发了Cue-R1，基于R1风格的强化微调，采用可验证的、任务对齐的和层次细化的奖励，以统一生成的方式工作。在CueBench上的广泛结果表明，现有VLM在现实世界的异常理解方面仍有待提高，而我们的Cue-R1平均高出最新方法超过24%。

关键见解