发布日期: 2025-09-24

更新日期: 2025-11-27

文章字数: 3.2k

阅读时长: 12 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-24 更新

ChronoForge-RL: Chronological Forging through Reinforcement Learning for Enhanced Video Understanding

Authors:Kehua Chen

Current state-of-the-art video understanding methods typically struggle with two critical challenges: (1) the computational infeasibility of processing every frame in dense video content and (2) the difficulty in identifying semantically significant frames through naive uniform sampling strategies. In this paper, we propose a novel video understanding framework, called ChronoForge-RL, which combines Temporal Apex Distillation (TAD) and KeyFrame-aware Group Relative Policy Optimization (KF-GRPO) to tackle these issues. Concretely, we introduce a differentiable keyframe selection mechanism that systematically identifies semantic inflection points through a three-stage process to enhance computational efficiency while preserving temporal information. Then, two particular modules are proposed to enable effective temporal reasoning: Firstly, TAD leverages variation scoring, inflection detection, and prioritized distillation to select the most informative frames. Secondly, we introduce KF-GRPO which implements a contrastive learning paradigm with a saliency-enhanced reward mechanism that explicitly incentivizes models to leverage both frame content and temporal relationships. Finally, our proposed ChronoForge-RL achieves 69.1% on VideoMME and 52.7% on LVBench compared to baseline methods, clearly surpassing previous approaches while enabling our 7B parameter model to achieve performance comparable to 72B parameter alternatives.

当前最先进的视频理解方法通常面临两个关键挑战：（1）处理密集视频内容中每一帧的计算不可行性；（2）通过简单的均匀采样策略识别语义重要帧的困难。在本文中，我们提出了一种新型的视频理解框架，名为ChronoForge-RL，它结合了时间顶点蒸馏（TAD）和关键帧感知组相对策略优化（KF-GRPO）来解决这些问题。具体来说，我们引入了一种可区分的关键帧选择机制，通过三阶段过程系统地识别语义转折点，提高计算效率的同时保留时间信息。然后，我们提出了两个使有效时间推理成为可能的特定模块：首先，TAD利用变异评分、拐点检测和优先蒸馏来选择最具信息的帧。其次，我们引入了KF-GRPO，它实现了具有显著性增强奖励机制的对比学习范式，明确激励模型利用帧内容和时间关系。最后，我们提出的ChronoForge-RL在VideoMME上达到了69.1%，在LVBench上达到了52.7%，相较于基准方法，明显超越了以前的方法，同时使我们的7B参数模型性能可与72B参数替代方案相媲美。

论文及项目相关链接

PDF 10 pages, 2 figures

Summary

本文提出一种名为ChronoForge-RL的新型视频理解框架，解决了现有技术面临的挑战。该框架通过结合Temporal Apex Distillation（TAD）和KeyFrame-aware Group Relative Policy Optimization（KF-GRPO）两大模块，提高了计算效率和视频理解的准确性。具体而言，引入了可微关键帧选择机制，通过三阶段过程系统地识别语义拐点；同时提出两种实现有效时间推理的模块，其中TAD利用变异评分、拐点检测和优先蒸馏来选择最具信息量的帧，而KF-GRPO则引入了一种对比学习范式和显著性增强奖励机制，激励模型利用帧内容和时间关系。最终，ChronoForge-RL在VideoMME上实现了69.1%、在LVBench上实现了52.7%的准确率，优于基线方法，而且使得参数规模仅为7B的模型达到了与参数规模高达72B的模型相近的性能水平。

Key Takeaways

当前视频理解技术面临处理密集视频内容的计算不可行性和通过简单均匀采样策略识别语义重要帧的困难。
ChronoForge-RL框架结合了Temporal Apex Distillation（TAD）和KeyFrame-aware Group Relative Policy Optimization（KF-GRPO）两大模块来解决这些问题。
引入了可微关键帧选择机制来系统地识别语义拐点，提高计算效率和保留时间信息。
TAD模块利用变异评分、拐点检测和优先蒸馏选择信息丰富的帧。
KF-GRPO模块引入对比学习范式和显著性增强奖励机制，激励模型利用帧内容和时间关系。
ChronoForge-RL框架在VideoMME和LVBench上表现出优异性能，超过基线方法，实现与大型模型相近的性能水平。

Cool Papers

点此查看论文截图

Data-Efficient Learning for Generalizable Surgical Video Understanding

Authors:Sahar Nasirihaghighi

Advances in surgical video analysis are transforming operating rooms into intelligent, data-driven environments. Computer-assisted systems support full surgical workflow, from preoperative planning to intraoperative guidance and postoperative assessment. However, developing robust and generalizable models for surgical video understanding remains challenging due to (I) annotation scarcity, (II) spatiotemporal complexity, and (III) domain gap across procedures and institutions. This doctoral research aims to bridge the gap between deep learning-based surgical video analysis in research and its real-world clinical deployment. To address the core challenge of recognizing surgical phases, actions, and events, critical for analysis, I benchmarked state-of-the-art neural network architectures to identify the most effective designs for each task. I further improved performance by proposing novel architectures and integrating advanced modules. Given the high cost of expert annotations and the domain gap across surgical video sources, I focused on reducing reliance on labeled data. We developed semi-supervised frameworks that improve model performance across tasks by leveraging large amounts of unlabeled surgical video. We introduced novel semi-supervised frameworks, including DIST, SemiVT-Surge, and ENCORE, that achieved state-of-the-art results on challenging surgical datasets by leveraging minimal labeled data and enhancing model training through dynamic pseudo-labeling. To support reproducibility and advance the field, we released two multi-task datasets: GynSurg, the largest gynecologic laparoscopy dataset, and Cataract-1K, the largest cataract surgery video dataset. Together, this work contributes to robust, data-efficient, and clinically scalable solutions for surgical video analysis, laying the foundation for generalizable AI systems that can meaningfully impact surgical care and training.

手术视频分析技术的进步正在将手术室转变为智能、数据驱动的环境。计算机辅助系统支持从术前规划到术中指导和术后评估的完整手术工作流程。然而，由于（I）标注数据稀缺，（II）时空复杂性，以及（III）不同手术和机构之间的领域差异，为手术视频理解开发稳健和可推广的模型仍然具有挑战性。本博士研究旨在缩小基于深度学习的手术视频分析研究与其现实临床应用之间的差距。为了解决识别手术阶段、动作和事件的核心挑战，这些对分析至关重要，我对最先进神经网络架构进行了基准测试，以确定每项任务最有效的设计。通过提出新型架构和集成先进模块，我进一步提高了性能。考虑到专家标注的高成本和不同手术视频来源的域差异，我重点关注减少对有标签数据的依赖。我们开发了半监督框架，通过利用大量未标注手术视频来提高各项任务的模型性能。我们介绍了新型半监督框架，包括DIST、SemiVT-Surge和ENCORE，这些框架在具有挑战性的手术数据集上实现了最佳结果，通过利用少量有标签数据和通过动态伪标签增强模型训练。为了支持重复实验并推动该领域的发展，我们发布了两个多任务数据集：GynSurg（最大的妇科腹腔镜数据集）和Cataract-1K（最大的白内障手术视频数据集）。总体而言，这项工作为手术视频分析提供了稳健、数据高效和临床可扩展的解决方案，为可推广的人工智能系统奠定了基础，这些系统可以对手术护理和培训产生有意义的影响。

论文及项目相关链接

PDF

Summary

该文介绍了手术视频分析领域的最新进展及其所面临的挑战。文章指出，计算机辅助系统支持从术前规划到术中指导和术后评估的整个手术工作流程，但深度学习在手术视频分析中的应用仍存在差距。研究者通过基准测试最先进神经网络架构、提出新型架构和集成高级模块来解决核心挑战。同时，为了减少专家标注的依赖性和解决跨手术视频源的领域差异问题，研究者开发了半监督框架，如DIST、SemiVT-Surge和ENCORE等。此外，还发布了两个多任务数据集，以支持研究的可重复性和推动该领域的发展。总体而言，该研究为手术视频分析提供了稳健、数据高效和临床可扩展的解决方案，为可推广至临床实践和手术培训的通用人工智能系统奠定了基础。

Key Takeaways

外科视频分析领域正经历快速发展，将手术室转变为智能化、数据驱动的环境。
计算机辅助系统涵盖从术前到术后的完整手术工作流程。
开发稳健和可推广的模型面临三大挑战：标注缺乏、时空复杂性和跨程序和机构的领域差异。
研究者通过基准测试最先进的神经网络架构并开发新型架构和集成高级模块来解决核心挑战。
为减少专家标注的依赖性和解决领域差异问题，研究者开发了半监督框架。
研究者发布了两个多任务数据集以支持研究的可重复性和推动领域发展。

Cool Papers

点此查看论文截图

Spatial Understanding from Videos: Structured Prompts Meet Simulation Data

Authors:Haoyu Zhang, Meng Liu, Zaijing Li, Haokun Wen, Weili Guan, Yaowei Wang, Liqiang Nie

Visual-spatial understanding, the ability to infer object relationships and layouts from visual input, is fundamental to downstream tasks such as robotic navigation and embodied interaction. However, existing methods face spatial uncertainty and data scarcity, limiting the 3D spatial reasoning capability of pre-trained vision-language models (VLMs). To address these challenges, we present a unified framework for enhancing 3D spatial reasoning in pre-trained VLMs without modifying their architecture. This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes through an automated construction process designed for fine-tuning. Extensive experiments across multiple benchmarks demonstrate the individual and combined effectiveness of our prompting and fine-tuning strategies, and yield insights that may inspire future research on visual-spatial understanding.

视觉空间理解，即从视觉输入推断物体关系和布局的能力，对于下游任务如机器人导航和实体交互至关重要。然而，现有方法面临空间不确定性和数据稀缺性问题，限制了预训练视觉语言模型（VLM）的3D空间推理能力。为了解决这些挑战，我们提出了一种无需修改架构即可提高预训练VLM中3D空间推理的统一框架。该框架结合了SpatialMind（一种结构化提示策略，可将复杂场景和问题分解为可解释的推理步骤）和ScanForgeQA（一个可扩展的问答数据集，通过专为微调设计的自动化构建流程，从多样化的3D模拟场景中构建）。在多个基准测试上的广泛实验证明了我们的提示和微调策略单独和联合使用的有效性，并产生了对未来视觉空间理解研究的启示。

论文及项目相关链接

PDF Accepted by NeurIPS 2025 as a Spotlight

Summary：

本文介绍了视觉空间理解的重要性，及其在机器人导航和实体交互等下游任务中的应用。针对现有方法在三维空间推理方面的挑战，提出了一种增强预训练视觉语言模型（VLMs）三维空间推理能力的统一框架。该框架结合了SpatialMind结构化提示策略和ScanForgeQA大规模问答数据集，通过在多样化的三维模拟场景中的自动化构建过程进行微调。实验证明，该框架的提示和微调策略均有效，并产生了一些关于视觉空间理解的未来研究的启示。

Key Takeaways：