发布日期: 2025-11-18

更新日期: 2025-11-27

文章字数: 4.6k

阅读时长: 18 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-18 更新

VIDEOP2R: Video Understanding from Perception to Reasoning

Authors:Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model’s perception output is information-sufficient for downstream reasoning.

强化微调（Reinforcement Fine-tuning，简称RFT）是一个由监督微调（Supervised Fine-tuning，简称SFT）和强化学习（Reinforcement Learning，简称RL）两个阶段构成的框架，该框架在提高大型语言模型（LLM）的推理能力方面展现出有前景的结果。然而，将RFT扩展到大型视频语言模型（LVLM）仍然是一个挑战。我们提出了VideoP2R，这是一个新型的过程感知视频RFT框架，它通过模拟感知和推理作为不同的过程来提高视频推理能力。在SFT阶段，我们开发了一个包含三个步骤的流程来生成VideoP2R-CoT-162K数据集，这是一个用于感知和推理的高质量、过程感知的思维链（Chain of Thought，简称CoT）数据集。在RL阶段，我们引入了一种新型的过程感知组相对策略优化（Process-Aware Group Relative Policy Optimization，简称PA-GRPO）算法，该算法为感知和推理提供单独的奖励。广泛的实验表明，VideoP2R在七个视频推理和理解基准测试中的六个上实现了最新技术表现（State of the Art，简称SotA）。进一步的剔除研究进一步证实了我们的过程感知建模和PA-GRPO的有效性，并证明模型的感知输出对于下游推理是信息充足的。

论文及项目相关链接

PDF

Summary

本文介绍了针对大型视频语言模型（LVLMs）的强化微调（RFT）新方法VideoP2R。该方法将感知和推理建模为独立过程，通过监督微调（SFT）和强化学习（RL）两个阶段提升视频推理能力。在SFT阶段，开发了三步管道生成高质量的视频感知与推理过程数据集VideoP2R-CoT-162K。在RL阶段，引入了过程感知群体相对策略优化（PA-GRPO）算法，为感知和推理提供独立奖励。实验表明，VideoP2R在七个视频理解和推理基准测试中六个达到最佳性能。消融研究进一步验证了过程感知建模和PA-GRPO的有效性，并显示模型的感知输出对下游推理具有足够的信息量。

Key Takeaways

VideoP2R是一种针对大型视频语言模型（LVLMs）的强化微调（RFT）新方法，旨在提高视频推理能力。
VideoP2R将感知和推理建模为独立过程，通过监督微调（SFT）和强化学习（RL）两个阶段进行优化。
SFT阶段通过三步管道生成了高质量的视频感知与推理过程数据集VideoP2R-CoT-162K。
RL阶段引入了过程感知群体相对策略优化（PA-GRPO）算法，为感知和推理提供独立奖励，从而提高模型性能。
VideoP2R在多个视频理解和推理基准测试中达到最佳性能。
消融研究验证了过程感知建模和PA-GRPO算法的有效性。

Cool Papers

点此查看论文截图

EmoVid: A Multimodal Emotion Video Dataset for Emotion-Centric Video Understanding and Generation

Authors:Zongyang Qiu, Bingyuan Wang, Xingbei Chen, Yingqing He, Zeyu Wang

Emotion plays a pivotal role in video-based expression, but existing video generation systems predominantly focus on low-level visual metrics while neglecting affective dimensions. Although emotion analysis has made progress in the visual domain, the video community lacks dedicated resources to bridge emotion understanding with generative tasks, particularly for stylized and non-realistic contexts. To address this gap, we introduce EmoVid, the first multimodal, emotion-annotated video dataset specifically designed for creative media, which includes cartoon animations, movie clips, and animated stickers. Each video is annotated with emotion labels, visual attributes (brightness, colorfulness, hue), and text captions. Through systematic analysis, we uncover spatial and temporal patterns linking visual features to emotional perceptions across diverse video forms. Building on these insights, we develop an emotion-conditioned video generation technique by fine-tuning the Wan2.1 model. The results show a significant improvement in both quantitative metrics and the visual quality of generated videos for text-to-video and image-to-video tasks. EmoVid establishes a new benchmark for affective video computing. Our work not only offers valuable insights into visual emotion analysis in artistically styled videos, but also provides practical methods for enhancing emotional expression in video generation.

情感在视频表达中扮演着至关重要的角色，但现有的视频生成系统主要关注低级的视觉指标，而忽视情感维度。尽管情感分析在视觉领域已经取得了一定的进展，但视频社区缺乏专门的资源来架起情感理解与生成任务之间的桥梁，特别是在风格化和非现实主义的情境中。为了弥补这一空白，我们推出了EmoVid，这是专为创意媒体设计的第一个多模态情感注释视频数据集，包括动画、电影片段和动态贴纸。每个视频都带有情感标签、视觉属性（亮度、色彩丰富度、色调）和文本字幕。通过系统分析，我们发现了连接各种视频形式的视觉特征与情感感知的空间和时间模式。在此基础上，我们通过微调Wan2.1模型，开发了一种情感调节的视频生成技术。结果表明，无论是在定量指标还是生成视频的视觉质量方面，对于文本到视频和图像到视频的任务都取得了显著的改进。EmoVid为情感视频计算建立了新的基准。我们的工作不仅为艺术风格视频中的视觉情感分析提供了有价值的见解，而且为增强视频生成中的情感表达提供了实用方法。

论文及项目相关链接

PDF 15 pages, 12 figures. Accepted as an Oral presentation at AAAI 2026. For code and dataset, see https://zane-zyqiu.github.io/EmoVid

Summary

该文强调了情感在视频表达中的重要性，但现有的视频生成系统主要关注低级别的视觉指标，而忽视了情感维度。为了弥补这一空白，文章介绍了EmoVid，这是一个专为创意媒体设计的多模态情感注释视频数据集，包含卡通动画、电影片段和动画贴纸。通过对视频的系统分析，文章揭示了视觉特征与情感感知之间的空间和时间模式。在此基础上，通过对Wan2.1模型的微调，开发了一种情感条件视频生成技术，显著提高了文本到视频和图像到视频任务的定量指标和视频质量。EmoVid为情感视频计算树立了新的基准，不仅为艺术风格视频中的视觉情感分析提供了宝贵见解，还为增强视频生成中的情感表达提供了实用方法。

Key Takeaways

情感在视频表达中扮演重要角色，但现有视频生成系统主要关注低级别视觉指标，忽视情感维度。
介绍了EmoVid，一个专为创意媒体设计的多模态情感注释视频数据集，包含多种视频形式。
EmoVid数据集每个视频都标有情感标签、视觉属性和文本字幕。
通过系统分析，发现了视觉特征与情感感知之间的空间和时间模式。
通过微调Wan2.1模型，开发了情感条件视频生成技术。
该技术在文本到视频和图像到视频任务中显著提高了定量指标和视频质量。

Cool Papers

点此查看论文截图

SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding

Authors:Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie

Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V – a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.

尽管高级视频理解技术给人留下了深刻的印象，但多模态语言模型在时间上和空间上的空间推理能力仍存在困难。尽管当前的空间训练方法是基于现实世界的视频数据，但获取具有精确空间注释的多样化镜头仍然是一个瓶颈。为了缓解这一瓶颈，我们提出了SIMS-V系统数据生成框架，该框架利用3D模拟器的特权信息，为多种模态语言模型创建空间丰富的视频训练数据。使用这个框架，我们通过系统地去除问题类型、混合和规模，研究哪些模拟数据的属性能有效地进行真实世界迁移。我们确定了最有效的最小问题集包含三个类别（度量测量、视角相关的推理和时间跟踪），尽管使用的问题类型较少，但在发展可迁移的空间智能方面，它们证明是最有效的，全面覆盖反而效果不佳。这些见解使训练更加高效：我们的视频大型语言模型（LLM）仅有7亿个参数，仅在2.5万个模拟示例上进行微调，就超过了拥有72亿个参数的基线模型，并在严格的现实世界空间推理基准测试中实现了与专有模型相竞争的性能。我们的方法展示了稳健的泛化能力，在保持对一般视频理解性能的同时，大大提升了实体和现实世界空间任务的表现。

Summary

模拟环境生成的空间丰富视频训练数据能有效解决模态语言模型在空间推理上的瓶颈问题。通过利用3D模拟器的特权信息，SIMS-V框架生成模拟数据，并系统地研究模拟数据的哪些特性有助于实现真实世界的迁移。研究发现，三种问题类型（度量测量、视角依赖推理和时序跟踪）虽然数量少，但能有效提升迁移能力。利用这些见解进行高效训练后，仅使用少量模拟数据即可实现良好的性能提升。我们的方法不仅保持了通用视频理解能力，还在空间任务上取得了显著改进。

Key Takeaways

模拟环境可生成丰富的视频训练数据以辅助解决模态语言模型在空间推理方面的缺陷。
使用SIMS-V框架，结合3D模拟器的特权信息生成模拟数据，有利于空间推理能力的迁移。
通过研究模拟数据的不同特性发现三种问题类型最为有效：度量测量、视角依赖推理和时序跟踪。
与全面的训练覆盖相比，这些有效问题类型提供了一种更高效的方式来训练语言模型的空间推理能力。

Cool Papers

点此查看论文截图

NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

Authors:Sahil Shah, S P Sharan, Harsh Goel, Minkyu Choi, Mustafa Munir, Manvik Pasula, Radu Marculescu, Sandeep Chinchali

While vision-language models (VLMs) excel at tasks involving single images or short videos, they still struggle with Long Video Question Answering (LVQA) due to its demand for complex multi-step temporal reasoning. Vanilla approaches, which simply sample frames uniformly and feed them to a VLM along with the question, incur significant token overhead. This forces aggressive downsampling of long videos, causing models to miss fine-grained visual structure, subtle event transitions, and key temporal cues. Recent works attempt to overcome these limitations through heuristic approaches; however, they lack explicit mechanisms for encoding temporal relationships and fail to provide any formal guarantees that the sampled context actually encodes the compositional or causal logic required by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA first translates a natural language question into a logic specification that models the temporal relationship between frame-level events. Next, we construct a video automaton to model the video’s frame-by-frame event progression, and finally employ model checking to compare the automaton against the specification to identify all video segments that satisfy the question’s logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on the LongVideoBench and CinePile LVQA benchmarks show that NeuS-QA significantly improves performance by over 10%, particularly on questions involving event ordering, causality, and multi-step reasoning. We open-source our code at https://utaustin-swarmlab.github.io/NeuS-QA/.

视觉语言模型（VLMs）在处理涉及单张图片或短视频的任务时表现出色，但在长视频问答（LVQA）方面仍存在挑战，因为LVQA需要复杂的多步骤时间推理。传统的方法只是简单地均匀采样帧，并将其与问题一起输入到VLM中，这会导致大量的令牌开销。这迫使人们对长视频进行激烈的降采样，导致模型错过精细的视觉结构、微妙的事件过渡和关键的时间线索。最近的工作试图通过启发式方法来克服这些限制，然而，它们缺乏编码时间关系的明确机制，无法提供任何形式上的保证，即所抽取的上下文实际上包含了问题所需的组合或因果逻辑。为了解决这些基础差距，我们引入了NeuS-QA，这是一种用于LVQA的无需训练、即插即用的神经符号管道。NeuS-QA首先将自然语言问题翻译成逻辑规范，该规范对帧级别事件之间的时间关系进行建模。接下来，我们构建了一个视频自动机来模拟视频逐帧的事件进展，并最终使用模型检查来比较自动机与规范，以识别满足问题逻辑要求的所有视频片段。只有这些经过逻辑验证的片段才会被提交给VLM，从而提高可解释性，减少幻觉，并在不需要修改或微调模型的情况下实现组合推理。在LongVideoBench和CinePile LVQA基准测试上的实验表明，NeuS-QA的性能提高了10%以上，特别是在涉及事件顺序、因果关系和多步骤推理的问题上。我们的代码已开源在https://utaustin-swarmlab.github.io/NeuS-QA/。

论文及项目相关链接

PDF

摘要

本文指出，尽管视觉语言模型（VLMs）在单图或短视频任务上表现优异，但在长视频问题回答（LVQA）方面仍面临挑战，因为LVQA需要复杂的多步时序推理。现有方法通过均匀采样帧并将其与问题一起输入到VLM中，但由于令牌开销大，导致长视频的下采样过于激进，从而错过了精细的视觉效果、微妙的事件过渡和关键的时间线索。尽管最近有研究工作试图克服这些限制，但它们缺乏编码时序关系的明确机制，并且不能保证所采样的上下文实际上包含了问题所需的组合或因果逻辑。为解决这些基本差距，我们引入了NeuS-QA，这是一个用于LVQA的训练免费、即插即用的神经符号管道。NeuS-QA首先将自然语言问题翻译为逻辑规范，该规范模拟帧级事件之间的时间关系。然后，我们构建了一个视频自动机来模拟视频的帧级事件进展，并通过模型检查将其与规范进行比较，以识别所有满足问题逻辑要求的视频片段。只有这些经过逻辑验证的片段才会被提交给VLM，从而提高可解释性，减少虚构情况，并启用组合推理而无需修改或微调模型。在LongVideoBench和CinePile LVQA基准测试上的实验表明，NeuS-QA的性能提高了超过10%，尤其是在涉及事件排序、因果和多步推理的问题上。我们的代码已在https://utaustin-swarmlab.github.io/NeuS-QA/上开源。

关键见解

长视频问题回答（LVQA）需要复杂的多步时序推理，这是现有视觉语言模型（VLMs）面临的挑战。
现有方法通过均匀采样帧来处理长视频，但这种方法会导致重要视频信息的丢失。
最近的研究工作试图通过启发式方法克服这些限制，但它们缺乏明确编码时序关系的机制。
引入NeuS-QA，一个用于LVQA的神经符号管道，通过翻译自然语言问题为逻辑规范并结合视频自动机进行时序推理。
NeuS-QA提高了性能，特别是在涉及事件排序、因果和多步推理的问题上。
该方法提高了模型的可解释性，减少了虚构情况，并允许组合推理而无需修改或微调模型。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-18/%E8%A7%86%E9%A2%91%E7%90%86%E8%A7%A3/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

视频理解

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2025-11-18 From Attention to Frequency Integration of Vision Transformer and FFT-ReLU for Enhanced Image Deblurring

2025-11-18 Vision Transformer

Vision Transformer

I2I Translation

I2I Translation 方向最新论文已更新，请持续关注 Update in 2025-11-18 Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

2025-11-18 I2I Translation

I2I Translation