嘘~ 正在从服务器偷取页面 . . .

R1_Reasoning


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-12 更新

Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View

Authors:Jianyu Qi, Ding Zou, Wenrui Yan, Rui Ma, Jiaxu Li, Zhijie Zheng, Zhiguo Yang, Rongchang Zhao

Recent advances in Multimodal Large Language Models (MLLMs) have spurred significant progress in Chain-of-Thought (CoT) reasoning. Building on the success of Deepseek-R1, researchers extended multimodal reasoning to post-training paradigms based on reinforcement learning (RL), focusing predominantly on mathematical datasets. However, existing post-training paradigms tend to neglect two critical aspects: (1) The lack of quantifiable difficulty metrics capable of strategically screening samples for post-training optimization. (2) Suboptimal post-training paradigms that fail to jointly optimize perception and reasoning capabilities. To address this gap, we propose two novel difficulty-aware sampling strategies: Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation, while Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity via attention distribution analysis. Leveraging these metrics, we design a hierarchical training framework that incorporates both GRPO-only and SFT+GRPO hybrid training paradigms, and evaluate them across six benchmark datasets. Experiments demonstrate consistent superiority of GRPO applied to difficulty-stratified samples compared to conventional SFT+GRPO pipelines, indicating that strategic data sampling can obviate the need for supervised fine-tuning while improving model accuracy. Our code will be released at https://github.com/qijianyu277/DifficultySampling.

在Multimodal Large Language Models(MLLMs)的最新进展推动下,链式思维(CoT)推理取得了重大进展。基于Deepseek-R1的成功,研究者将多模态推理扩展到基于强化学习(RL)的后期训练模式,主要集中在数学数据集上。然而,现有的后期训练模式往往忽略了两个关键方面:(1)缺乏可量化的难度指标,无法为后期训练优化策略性地筛选样本。(2)次优的后期训练模式无法同时优化感知和推理能力。为了弥补这一空白,我们提出了两种新的难度感知采样策略:渐进图像语义掩蔽(PISM)通过系统图像退化量化样本难度,而跨模态注意力平衡(CMAB)则通过注意力分布分析评估跨模态交互复杂性。利用这些指标,我们设计了一个分层训练框架,该框架结合了仅使用GRPO和SFT+GRPO混合训练模式,并在六个基准数据集上进行了评估。实验表明,与常规SFT+GRPO管道相比,应用于难度分层样本的GRPO表现更优越且一致,这表明战略数据采样可以取代监督微调的需要,同时提高模型准确性。我们的代码将在https://github.com/qijianyu277/DifficultySampling上发布。

论文及项目相关链接

PDF Accpeted by AAAI 2026

Summary
多模态大型语言模型(MLLMs)在链式思维(CoT)推理方面取得了显著进展。然而,现有研究忽略了量化难度指标和联合优化感知和推理能力的策略。为解决这些问题,本文提出了两种难度感知采样策略,并通过分层训练框架应用于多个基准数据集进行实验评估。实验结果表明,在难度分层样本上应用GRPO相比传统SFT+GRPO管道具有一致性优势,表明战略数据采样可提高模型精度。

Key Takeaways

  1. 多模态大型语言模型(MLLMs)在链式思维(CoT)推理方面取得进展。
  2. 现有研究忽视了量化难度指标和联合优化感知和推理能力的策略。
  3. 本文提出了两种难度感知采样策略:渐进图像语义掩蔽(PISM)和跨模态注意力平衡(CMAB)。
  4. 通过分层训练框架将这两种策略应用于多个基准数据集进行评估。
  5. 实验结果显示GRPO在难度分层样本上具有优势,超越了传统的SFT+GRPO管道。
  6. 战略数据采样可以提高模型精度,甚至可能取代监督微调的需要。

Cool Papers

点此查看论文截图

EgoExo-Con: Exploring View-Invariant Video Temporal Understanding

Authors:Minjoon Jung, Junbin Xiao, Junghyun Kim, Byoung-Tak Zhang, Angela Yao

Can Video-LLMs achieve consistent temporal understanding when videos capture the same event from different viewpoints? To study this, we introduce EgoExo-Con (Consistency), a benchmark of comprehensively synchronized egocentric and exocentric video pairs with human-refined queries in natural language. EgoExo-Con emphasizes two temporal understanding tasks: Temporal Verification and Temporal Grounding. It evaluates not only correctness but consistency across viewpoints. Our analysis reveals two critical limitations of existing Video-LLMs: (1) models often fail to maintain consistency, with results far worse than their single-view performances. (2) When naively finetuned with synchronized videos of both viewpoints, the models show improved consistency but often underperform those trained on a single view. For improvements, we propose View-GRPO, a novel reinforcement learning framework that effectively strengthens view-specific temporal reasoning while encouraging consistent comprehension across viewpoints. Our method demonstrates its superiority over naive SFT and GRPO, especially for improving cross-view consistency. All resources will be made publicly available.

视频LLMs(大语言模型)在从多个角度捕捉同一事件时是否能实现一致的时间理解?为了研究这个问题,我们引入了EgoExo-Con(一致性)基准测试,这是一个包含人类自然语言的精炼查询的全方位同步的第一人称和第三人称视频对。EgoExo-Con强调两个时间理解任务:时间验证和时间定位。它不仅评估正确性,还评估不同视角下的一致性。我们的分析揭示了现有视频LLM的两个关键局限性:(1)模型通常很难保持一致性,其性能远不及单视角的表现;(2)当直接用两个视角同步的视频进行微调时,模型的视角一致性有所提升,但性能常常不及在单一视角训练的表现。为了改进,我们提出了View-GRPO这一新型强化学习框架,它有效地加强了特定视角的时间推理能力,同时鼓励不同视角之间的一致理解。我们的方法显示出其在简单SFT和GRPO上的优势,特别是在提高跨视角一致性方面。所有资源都将公开提供。

论文及项目相关链接

PDF project page: \url{https://minjoong507.github.io/projects/EgoExo-Con/}

Summary
为探究视频LLM在不同视角捕捉同一事件时的时序理解能力,提出EgoExo-Con(一致性)基准测试,包含全面同步的第一人称和第三人称视频对,以及自然语言人类精炼查询。EgoExo-Con强调两个时序理解任务:时序验证和时序定位,评估结果不仅正确而且跨视角一致。分析显示现有视频LLM两大局限:一是模型难以维持一致性,结果较单一视角性能差;二是仅通过同步双视角视频微调,模型一致性虽提高但往往不如单一视角训练表现。为此,提出View-GRPO新型强化学习框架,有效强化特定视角时序推理,同时鼓励跨视角一致理解。该方法在朴素SFT和GRPO上表现优越,尤其提高了跨视角一致性。所有资源将公开发布。

Key Takeaways

  1. 引入EgoExo-Con基准测试,用于评估视频LLM在不同视角捕捉同一事件时的时序理解能力。
  2. 现有视频LLM在维持跨视角一致性方面存在局限。
  3. 单纯通过同步双视角视频微调,虽然能提高模型的一致性,但可能牺牲单一视角的性能。
  4. 提出View-GRPO强化学习框架,旨在强化特定视角的时序推理并提升跨视角的一致性理解。
  5. View-GRPO在朴素SFT和GRPO方法基础上表现优越,尤其有助于提升跨视角一致性。
  6. 该研究强调跨视角一致性在视频理解中的重要性。

Cool Papers

点此查看论文截图

Geometric-Mean Policy Optimization

Authors:Yuzhong Zhao, Yue Liu, Junpeng Liu, Jingye Chen, Xun Wu, Yaru Hao, Tengchao Lv, Shaohan Huang, Lei Cui, Qixiang Ye, Fang Wan, Furu Wei

Group Relative Policy Optimization (GRPO) has significantly enhanced the reasoning capability of large language models by optimizing the arithmetic mean of token-level rewards. Unfortunately, GRPO is observed to suffer from unstable policy updates when facing tokens with outlier importance-weighted rewards, which manifest as extreme importance sampling ratios during training. In this study, we propose Geometric-Mean Policy Optimization (GMPO), with the aim to improve the stability of GRPO through suppressing token reward outliers. Instead of optimizing the arithmetic mean, GMPO maximizes the geometric mean of token-level rewards, which is inherently less sensitive to outliers and maintains a more stable range of importance sampling ratio. GMPO is plug-and-play-simply replacing GRPO’s arithmetic mean with the geometric mean of token-level rewards, as the latter is inherently less sensitive to outliers. GMPO is theoretically plausible-analysis reveals that both GMPO and GRPO are weighted forms of the policy gradient while the former enjoys more stable weights, which consequently benefits policy optimization and performance. Experiments on multiple mathematical reasoning benchmarks show that GMPO-7B improves the average Pass@1 of GRPO by up to 4.1%, outperforming many state-of-the-art approaches. Code is available at https://github.com/callsys/GMPO.

集团相对策略优化(GRPO)通过优化令牌级别奖励的算术平均值,显著提高了大型语言模型的推理能力。然而,研究人员发现GRPO在面对具有异常重要性加权奖励的令牌时,会出现策略更新不稳定的情况,这表现为训练过程中的重要性采样比率极端化。针对这一问题,本研究提出几何均值策略优化(GMPO),旨在通过抑制令牌奖励异常值来提高GRPO的稳定性。不同于优化算术平均值,GMPO最大化令牌级别奖励的几何平均值,后者对异常值具有固有的不敏感性,并保持了更稳定的重要性采样比率范围。GMPO即插即用,只需将GRPO的算术平均值替换为令牌级别奖励的几何平均值,因为后者对异常值具有固有的不敏感性。从理论上讲,分析表明GMPO和GRPO都是策略梯度的加权形式,而前者具有更稳定的权重,从而有利于策略优化和性能提升。在多个数学推理基准测试上的实验表明,GMPO-7B将GRPO的平均通过率提高了高达4.1%,超越了许多最先进的方法。代码可访问https://github.com/callsys/GMPO。

论文及项目相关链接

PDF Code is available at https://github.com/callsys/GMPO

Summary

大型语言模型的推理能力通过优化令牌级别奖励的算术平均值得到了显著提升,但面临具有异常重要性加权奖励的令牌时,会出现政策更新不稳定的问题。本研究提出几何均值政策优化(GMPO),旨在通过抑制令牌奖励异常值来提高GRPO的稳定性。GMPO最大化令牌级别奖励的几何平均值,而非算术平均值,对异常值具有更小的敏感性,并能维持更稳定的重要性采样比率范围。GMPO即插即用,只需将GRPO的算术平均值替换为令牌级别奖励的几何平均值。理论分析表明,GMPO和GRPO都是政策梯度的加权形式,而前者享有更稳定的权重,有利于政策优化和性能提升。在多个数学推理基准测试上的实验表明,GMPO-7B将GRPO的平均通过率提高了4.1%,优于许多最先进的方法。

Key Takeaways

  1. Group Relative Policy Optimization (GRPO) 通过优化令牌级别奖励的算术平均值增强了大型语言模型的推理能力。
  2. GRPO 在面对具有异常重要性加权奖励的令牌时,会出现政策更新不稳定的问题。
  3. 几何均值政策优化(GMPO)旨在通过最大化令牌级别奖励的几何平均值来提高GRPO的稳定性,对异常值具有更小的敏感性。
  4. GMPO 能维持更稳定的重要性采样比率范围,并通过替换GRPO中的算术平均值为几何平均值来实施。
  5. 理论分析显示GMPO享有更稳定的权重,有利于政策优化和性能提升。
  6. 在数学推理基准测试上,GMPO-7B 较 GRPO 有显著性能提升,平均通过率提高4.1%。

Cool Papers

点此查看论文截图

EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework

Authors:Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yuzhi Zhang, Yue Wang

Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), a lightweight variant of Proximal Policy Optimization (PPO), improves efficiency but suffers from limited exploration and training instability, limiting its effectiveness on complex reasoning tasks. To address these challenges, we introduce EFRame, an Exploration-Filter-Replay framework that augments GRPO across three dimensions: additional rollouts enable deeper and more targeted exploration, online filtering removes low-quality samples to stabilize gradients and accelerate training, and experience replay amplifies rare yet informative trajectories for stable convergence. This unified framework establishes a principled training cycle that balances exploration, efficiency, and stability. Experiments on diverse reasoning benchmarks demonstrate that EFRame achieves consistent gains, including a 37.9% relative improvement on Geometry3K over GRPO. EFRame further supports fine-grained sample categorization and precise entropy control, highlighting it as a robust solution for advancing deeper reasoning in LLMs. Our code is available at https://github.com/597358816/EFRame.

随着强化学习(RL)的最新进展,大型语言模型(LLM)的推理能力得到了显著提升。Group Relative Policy Optimization(GRPO)是Proximal Policy Optimization(PPO)的一种轻量级变体,提高了效率,但存在探索有限和训练不稳定的问题,使其在复杂推理任务上的效果有限。为了解决这些挑战,我们引入了EFRame,即一个探索-过滤-回放框架,它从三个维度增强GRPO:额外的迭代实现更深、更精准的探索;在线过滤则移除低质量样本,稳定梯度并加速训练;经验回放则放大稀有但信息量大的轨迹以实现稳定的收敛。这一统一的框架建立了一个有原则的训练周期,平衡了探索、效率和稳定性。在多种推理基准测试上的实验表明,EFRame取得了持续的收益,包括对Geometry3K的37.9%的相对改进。此外,EFRame还支持精细的样本分类和精确的不确定性控制,凸显其作为推进LLM深度推理的稳健解决方案。我们的代码可在[https://github.com/5973588]找到。

论文及项目相关链接

PDF

Summary

强化学习在提升大型语言模型的推理能力方面取得了重要进展。然而,Group Relative Policy Optimization(GRPO)在处理复杂推理任务时面临有限探索和训练不稳定的问题。为解决这些问题,我们提出了EFRame框架,该框架从三个维度对GRPO进行了增强:额外的rollouts实现更深更精准的探索,在线过滤移除低质量样本以稳定梯度并加速训练,经验回放放大稀有但重要的轨迹以实现稳定收敛。实验证明,EFRame在多种推理基准测试中表现优异,相对于GRPO有显著改善,如在Geometry3K上相对提升了37.9%。该框架支持精细样本分类和精确熵控制,展现了其在推动大型语言模型深度推理方面的稳健性。相关代码已发布在GitHub上。

Key Takeaways

  1. 强化学习提升了大型语言模型的推理能力。
  2. Group Relative Policy Optimization(GRPO)在复杂推理任务中面临有限探索和训练不稳定的问题。
  3. EFRame框架通过三个维度增强GRPO性能:深度探索、稳定的训练和样本优化管理。
  4. EFRame框架实现了更深的探索,通过在线过滤和低质量样本移除稳定梯度,加速训练过程。
  5. 经验回放机制有助于放大重要但稀有的轨迹信息,以实现更稳定的收敛。
  6. EFRame在多种基准测试中表现优异,相较于GRPO有显著改进。

Cool Papers

点此查看论文截图

SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents

Authors:Wanxin Tian, Shijie Zhang, Kevin Zhang, Xiaowei Chi, Chunkai Fan, Junyu Lu, Yulin Luo, Qiang Zhou, Yiming Zhao, Ning Liu, Siyu Lin, Zhiyuan Qin, Xiaozhu Ju, Shanghang Zhang, Jian Tang

Self-evolution, the ability of agents to autonomously improve their reasoning and behavior, is essential for the embodied domain with long-horizon, real-world tasks. Despite current advancements in reinforcement fine-tuning (RFT) showing strong performance in enhancing reasoning in LLMs, its potential to enable self-evolving embodied intelligence with multi-modal interactions remains largely unexplored. Specifically, reinforcement fine-tuning faces two fundamental obstacles in embodied settings: (i) the lack of accessible intermediate rewards in multi-step reasoning tasks limits effective learning signals, and (ii) reliance on hand-crafted reward functions restricts generalization to novel tasks and environments. To address these challenges, we present Self-Evolving Embodied Agents-R1, SEEA-R1, the first RFT framework designed for enabling the self-evolving capabilities of embodied agents. Specifically, to convert sparse delayed rewards into denser intermediate signals that improve multi-step reasoning, we propose Tree-based group relative policy optimization (Tree-GRPO) integrates Monte Carlo Tree Search into GRPO. To generalize reward estimation across tasks and scenes, supporting autonomous adaptation and reward-driven self-evolution, we further introduce Multi-modal Generative Reward Model (MGRM). To holistically evaluate the effectiveness of SEEA-R1, we evaluate on the ALFWorld benchmark, surpassing state-of-the-art methods with scores of 85.07% (textual) and 46.27% (multi-modal), outperforming prior models including GPT-4o. SEEA-R1 also achieves scores of 80.3% (textual) and 44.03% (multi-modal) without ground truth reward, surpassing all open-source baselines and highlighting its scalability as a self-evolving embodied agent. Additional experiments and qualitative analysis further support the potential of SEEA-R1 for future research in scalable embodied intelligence.

自我进化,即智能体自主提高其推理和行为的能力,对于具有长远视野和现实世界任务的具体领域至关重要。尽管强化微调(RFT)的当前进展在增强大型语言模型(LLM)的推理能力方面表现出强大的性能,但其实现具有多模态交互的自我进化智能的潜力仍被大量探索。具体来说,强化微调在面对具体领域时面临两大基本障碍:(i)多步推理任务中缺乏可访问的中间奖励限制了有效的学习信号;(ii)对手动设计奖励函数的依赖限制了其在新型任务和环境中的泛化能力。针对这些挑战,我们提出了自我进化实体代理-R1,即SEEA-R1,这是第一个旨在实现实体代理自我进化能力的强化微调框架。具体来说,为了将稀疏的延迟奖励转化为更密集的中间信号,以提高多步推理能力,我们提出了基于树的群体相对策略优化(Tree-GRPO),它将蒙特卡洛树搜索整合到GRPO中。为了对任务和场景进行泛化的奖励估计,支持自主适应和奖励驱动的自我进化,我们进一步引入了多模态生成奖励模型(MGRM)。为了全面评估SEEA-R1的有效性,我们在ALFWorld基准测试上进行了评估,超越了最先进的方法,达到了文本类85.07%和多模态类46.27%的成绩,超越了包括GPT-4o在内的先前模型。SEEA-R1在不使用真实奖励的情况下还实现了文本类80.3%和多模态类44.03%的成绩,超越了所有开源基准测试,突显了其作为自我进化实体代理的可扩展性。额外的实验和定性分析进一步支持SEEA-R1在未来可扩展智能领域的研究潜力。

论文及项目相关链接

PDF

Summary

自我进化能力对于执行长期、真实世界任务尤为重要。强化微调(RFT)在提升大型语言模型(LLM)的推理能力方面表现出强大的性能,但在实现具有多模态交互的自我进化智能方面仍有待探索。针对强化微调在实体场景中的两个基本障碍,本文提出了一种名为SEEA-R1的自我进化实体代理RFT框架。通过Tree-GRPO和MGRM方法解决延迟奖励和奖励函数手工设计的问题。实验结果表明,SEEA-R1在ALFWorld基准测试上取得了显著的成果。

Key Takeaways

  1. 自我进化能力对于执行长期、真实世界任务非常重要。
  2. 强化微调(RFT)在提升大型语言模型推理能力方面表现优异,但在实现自我进化智能方面仍待探索。
  3. SEEA-R1是首个为实体代理的自我进化能力设计的RFT框架。
  4. Tree-GRPO方法将稀疏延迟奖励转化为更密集的中间信号,以提高多步推理能力。
  5. MGRM方法用于跨任务和场景的奖励估计,支持自主适应和奖励驱动的自我进化。
  6. 在ALFWorld基准测试中,SEEA-R1取得了显著的成绩,超越了现有方法。

Cool Papers

点此查看论文截图

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

Authors:Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim

Recent works have demonstrated the effectiveness of reinforcement learning (RL)-based post-training for enhancing the reasoning capabilities of large language models (LLMs). In particular, Group Relative Policy Optimization (GRPO) has shown impressive success using a PPO-style reinforcement algorithm with group-normalized rewards. However, the effectiveness of GRPO in Video Large Language Models (VideoLLMs) has still been less studyed. In this paper, we explore GRPO and identify two problems that deteriorate the effective learning: (1) reliance on safeguards, and (2) vanishing advantage. To mitigate these challenges, we propose DeepVideo-R1, a video large language model trained with Reg-GRPO (Regressive GRPO) and difficulty-aware data augmentation. Reg-GRPO reformulates the GRPO loss function into a regression task that directly predicts the advantage in GRPO, eliminating the need for safeguards such as the clipping and min functions. It directly aligns the model with advantages, providing guidance to prefer better ones. The difficulty-aware data augmentation strategy augments input prompts/videos to locate the difficulty of samples at solvable difficulty levels, enabling diverse reward signals. Our experimental results show that our approach significantly improves video reasoning performance across multiple benchmarks.

最近的研究工作展示了基于强化学习(RL)的训练后处理在提升大型语言模型(LLM)的推理能力方面的有效性。特别是,群组相对策略优化(GRPO)使用PPO风格的强化算法和群组归一化奖励,取得了令人印象深刻的成功。然而,GRPO在视频大型语言模型(VideoLLMs)中的有效性研究仍然较少。在本文中,我们探索了GRPO,并识别出两个导致学习效能下降的问题:(1)对保障措施的依赖;(2)优势消失。为了缓解这些挑战,我们提出了DeepVideo-R1,这是一种用回归GRPO(Reg-GRPO)和难度感知数据增强训练的视频大型语言模型。Reg-GRPO将GRPO损失函数重新定义为回归任务,直接预测GRPO中的优势,消除了对保障措施(如裁剪和最小函数)的需求。它直接使模型与优势对齐,为更好的模型提供指导。难度感知数据增强策略增加了输入提示/视频,以定位在可解决难度水平样本的难度,从而实现多样化的奖励信号。我们的实验结果表明,我们的方法在多基准测试中显著提高了视频推理性能。

论文及项目相关链接

PDF NeurIPS 2025

Summary

强化学习(RL)在提升大型语言模型(LLM)的推理能力方面表现出显著效果,特别是使用PPO风格的强化算法的Group Relative Policy Optimization(GRPO)方法。然而,GRPO在视频大型语言模型(VideoLLMs)中的应用仍缺乏研究。本文探讨了GRPO,并指出其存在的两个问题:依赖保障机制和优势消失问题。为解决这些问题,我们提出了DeepVideo-R1,这是一种使用Reg-GRPO和难度感知数据增强的视频大型语言模型。Reg-GRPO将GRPO损失函数重新定义为回归任务,直接预测GRPO中的优势,消除对保障机制的依赖。难度感知数据增强策略增加了输入提示/视频的难度,使其处于可解决水平,提供多样化的奖励信号。实验结果表明,该方法在多基准测试中显著提高了视频推理性能。

Key Takeaways

  1. 强化学习在提升大型语言模型的推理能力方面有效,特别是GRPO方法。
  2. GRPO在视频大型语言模型中的应用存在依赖保障机制和优势消失的问题。
  3. Reg-GRPO将GRPO损失函数重新定义为回归任务,消除对保障机制的依赖。
  4. 难度感知数据增强策略可以提高输入提示/视频的难度,使其处于可解决水平。
  5. DeepVideo-R1通过结合Reg-GRPO和难度感知数据增强,显著提高了视频推理性能。
  6. 该方法在多基准测试中表现出良好的性能。

Cool Papers

点此查看论文截图

AgentThink: A Unified Framework for Tool-Augmented Chain-of-Thought Reasoning in Vision-Language Models for Autonomous Driving

Authors:Kangan Qian, Sicong Jiang, Yang Zhong, Ziang Luo, Zilin Huang, Tianze Zhu, Kun Jiang, Mengmeng Yang, Zheng Fu, Jinyu Miao, Yining Shi, He Zhe Lim, Li Liu, Tianbao Zhou, Huang Yu, Yifei Hu, Guang Li, Guang Chen, Hao Ye, Lijun Sun, Diange Yang

Vision-Language Models (VLMs) show promise for autonomous driving, yet their struggle with hallucinations, inefficient reasoning, and limited real-world validation hinders accurate perception and robust step-by-step reasoning. To overcome this, we introduce \textbf{AgentThink}, a pioneering unified framework that integrates Chain-of-Thought (CoT) reasoning with dynamic, agent-style tool invocation for autonomous driving tasks. AgentThink’s core innovations include: \textbf{(i) Structured Data Generation}, which establishes an autonomous driving tool library to automatically construct structured, self-verified reasoning data explicitly incorporating tool usage for diverse driving scenarios; \textbf{(ii) A Two-stage Training Pipeline}, employing Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO) to equip VLMs with the capability for autonomous tool invocation; and \textbf{(iii) Agent-style Tool-Usage Evaluation}, introducing a novel multi-tool assessment protocol to rigorously evaluate the model’s tool invocation and utilization. Experiments on the DriveLMM-o1 benchmark demonstrate that AgentThink significantly boosts overall reasoning scores by \textbf{53.91%} and enhances answer accuracy by \textbf{33.54%}, while markedly improving reasoning quality and consistency. Furthermore, ablation studies and robust zero-shot/few-shot generalization experiments across various benchmarks underscore its powerful capabilities. These findings highlight a promising trajectory for developing trustworthy and tool-aware autonomous driving models. Code is available at https://github.com/curryqka/AgentThink.

视觉语言模型(VLMs)在自动驾驶领域显示出巨大的潜力,然而它们所面临的幻觉、推理效率低下和现实世界验证有限等问题阻碍了准确的感知和稳健的逐步推理。为了克服这一问题,我们引入了AgentThink,这是一个开创性的统一框架,它将思维链(CoT)推理与动态、代理式工具调用相结合,用于自动驾驶任务。AgentThink的核心创新包括:(i)结构化数据生成,建立自动驾驶工具库,自动构建结构化、自我验证的推理数据,明确融入各种驾驶场景的工具使用;(ii)两阶段训练管道,采用监督微调(SFT)和群组相对策略优化(GRPO),使VLMs具备自主工具调用的能力;(iii)代理式工具使用评估,引入一种新的多工具评估协议,严格评估模型的工具调用和使用情况。在DriveLMM-o1基准测试上的实验表明,AgentThink整体推理得分提高了**53.91%,答案准确性提高了33.54%**,同时显著提高了推理质量和一致性。此外,消融研究和在各种基准测试上的稳健零样本/少样本泛化实验突显了其强大的能力。这些发现为我们开发可信且工具感知的自动驾驶模型指明了有希望的路径。代码可在https://github.com/curryqka/AgentThink处获取。

论文及项目相关链接

PDF 19 pages, 8 figures

Summary

本文介绍了针对自动驾驶领域视觉语言模型(VLMs)存在的局限而提出的一种名为AgentThink的统一框架。该框架结合了链式思维(CoT)推理与动态工具调用技术,以提高VLMs的感知准确性和逐步推理的稳健性。AgentThink的核心创新包括结构化数据生成、两阶段训练管道和工具使用评估协议。实验表明,AgentThink显著提高了推理得分和答案准确性,并改善了推理质量和一致性。

Key Takeaways

  • AgentThink是一个统一框架,旨在解决自动驾驶中视觉语言模型(VLMs)存在的局限,如幻觉、低效推理和缺乏真实世界验证等问题。
  • 该框架结合了链式思维(CoT)推理与动态工具调用技术,以强化模型的感知和推理能力。
  • AgentThink的核心创新包括结构化数据生成、两阶段训练管道以及新颖的评估协议,用于评估模型在多种驾驶场景中的工具调用和使用能力。
  • 在DriveLMM-o1基准测试上进行的实验表明,AgentThink显著提高了推理得分和答案准确性,改善了推理质量和一致性。
  • 该框架展现出强大的跨基准测试能力,通过稳健的零样本和少样本泛化实验得到验证。

Cool Papers

点此查看论文截图

Fast-Slow Thinking GRPO for Large Vision-Language Model Reasoning

Authors:Wenyi Xiao, Leilei Gan

When applying reinforcement learning–typically through GRPO–to large vision-language model reasoning struggles to effectively scale reasoning length or generates verbose outputs across all tasks with only marginal gains in accuracy. To address this issue, we present FAST-GRPO, a variant of GRPO that dynamically adapts reasoning depth based on question characteristics. Through empirical analysis, we establish the feasibility of fast-slow thinking in LVLMs by investigating how response length and data distribution affect performance. Inspired by these observations, we introduce two complementary metrics to estimate the difficulty of the questions, guiding the model to determine when fast or slow thinking is more appropriate. Next, we incorporate adaptive length-based rewards and difficulty-aware KL divergence into the GRPO algorithm. Experiments across seven reasoning benchmarks demonstrate that FAST achieves state-of-the-art accuracy with over 10% relative improvement compared to the base model, while reducing token usage by 32.7-67.3% compared to previous slow-thinking approaches, effectively balancing reasoning length and accuracy.

当使用强化学习(通常通过GRPO)对大型视觉语言模型进行推理时,面临着难以有效扩展推理长度的问题,或在所有任务中生成冗长的输出,而准确率只有微小的提高。为了解决这一问题,我们提出了FAST-GRPO,这是GRPO的一种变体,它根据问题的特性动态适应推理深度。通过实证分析,我们证明了快慢思维在LVLMs中的可行性,并研究了响应长度和数据分布如何影响性能。受这些观察的启发,我们引入了两种互补的指标来估计问题的难度,指导模型确定何时使用快速或慢速思维更为合适。接下来,我们将基于长度的自适应奖励和难度感知KL散度纳入GRPO算法。在七个推理基准测试上的实验表明,FAST实现了最先进的准确性,与基础模型相比,相对提高了10%以上,同时与之前的慢速思维方法相比,令牌使用量减少了32.7-67.3%,有效地平衡了推理长度和准确性。

论文及项目相关链接

PDF

Summary

在大型视觉语言模型中,应用强化学习(通常通过GRPO实现)进行推理时,存在推理长度难以有效扩展或输出过于冗长的问题,且准确度仅略有提升。为解决这一问题,我们提出了FAST-GRPO方法,这是一种能根据问题特性动态调整推理深度的GRPO变体。通过实证研究,我们证明了快慢思考的可行性,并发现响应长度和数据分布对性能的影响。基于此观察,我们引入了两种互补指标来评估问题的难度,指导模型判断何时采用快速或慢速思考更为合适。接着,我们将基于长度的自适应奖励和难度感知KL散度融入GRPO算法。在七个推理基准测试上的实验表明,FAST方法实现了最先进的准确性,相对于基础模型有超过10%的相对改进,同时相比之前的慢速思考方法减少了32.7-67.3%的令牌使用量,有效地平衡了推理长度和准确性。

Key Takeaways

  1. 应用强化学习在大型视觉语言模型(LVLMs)进行推理时面临推理长度和输出冗长的问题。
  2. 提出FAST-GRPO方法,能根据问题特性动态调整推理深度。
  3. 通过实证研究证明了快慢思考的可行性。
  4. 引入两种互补指标评估问题难度,指导模型选择适当的思考速度。
  5. 将自适应长度奖励和难度感知KL散度融入GRPO算法。
  6. 在多个基准测试中,FAST方法实现了较高的准确性和显著的相对改进。

Cool Papers

点此查看论文截图

Video-R1: Reinforcing Video Reasoning in MLLMs

Authors:Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, Xiangyu Yue

Inspired by DeepSeek-R1’s success in eliciting reasoning abilities through rule-based reinforcement learning (RL), we introduce Video-R1 as the first attempt to systematically explore the R1 paradigm for incentivizing video reasoning within multimodal large language models (MLLMs). However, directly applying RL training with the GRPO algorithm to video reasoning presents two primary challenges: (i) a lack of temporal modeling for video reasoning, and (ii) the scarcity of high-quality video-reasoning data. To address these issues, we first propose the T-GRPO algorithm, which encourages models to utilize temporal information in videos for reasoning. Additionally, instead of relying solely on video data, we incorporate high-quality image-reasoning data into the training process. We have constructed two datasets: Video-R1-CoT-165k for SFT cold start and Video-R1-260k for RL training, both comprising image and video data. Experimental results demonstrate that Video-R1 achieves significant improvements on video reasoning benchmarks such as VideoMMMU and VSI-Bench, as well as on general video benchmarks including MVBench and TempCompass, etc. Notably, Video-R1-7B attains a 37.1% accuracy on video spatial reasoning benchmark VSI-bench, surpassing the commercial proprietary model GPT-4o. All code, models, and data are released in: https://github.com/tulerfeng/Video-R1.

受DeepSeek-R1在基于规则的强化学习(RL)中激发推理能力的成功的启发,我们推出Video-R1,首次尝试在多模态大型语言模型(MLLMs)中系统地探索R1范式以激励视频推理。然而,直接将RL训练和GRPO算法应用于视频推理面临两个主要挑战:(i)视频推理缺乏时间建模;(ii)高质量视频推理数据的稀缺。为了解决这些问题,我们首先提出T-GRPO算法,该算法鼓励模型在推理时利用视频中的时间信息。此外,我们不是仅依赖视频数据,而是将高质量图像推理数据纳入训练过程。我们构建了两个数据集:用于SFT冷启动的Video-R1-CoT-165k和用于RL训练的视频R1-260k,两者都包含图像和视频数据。实验结果表明,Video-R1在视频推理基准测试(如VideoMMMU和VSI-Bench)以及通用视频基准测试(如MVBench和TempCompass等)上取得了显著改进。值得注意的是,Video-R1-7B在视频空间推理基准VSI-bench上达到了37.1%的准确率,超过了商业专有模型GPT-4o。所有代码、模型和数据均已在https://github.com/tulerfeng/Video-Rl中发布。

论文及项目相关链接

PDF NeurIPS 2025, Project page: https://github.com/tulerfeng/Video-R1

Summary

本文介绍了基于DeepSeek-R1成功激发推理能力的启发,推出Video-R1,旨在系统探索R1范式在多媒体大型语言模型(MLLMs)中激励视频推理的应用。针对直接应用强化学习(RL)训练视频推理面临的挑战,如缺乏视频的时间建模和高质量视频推理数据的稀缺性,提出了T-GRPO算法,并融入图像推理数据。构建了Video-R1-CoT-165k和Video-R1-260k数据集。实验结果表明,Video-R1在视频推理基准测试、通用视频基准测试等方面取得显著改进,超越了商业专有模型GPT-4o。

Key Takeaways

  1. Video-R1是首个尝试系统探索R1范式以激励多媒体大型语言模型(MLLMs)中视频推理的研究。
  2. 直接应用强化学习(RL)训练视频推理面临两大挑战:缺乏视频时间建模和高质量视频推理数据稀缺。
  3. 为解决这些问题,提出了T-GRPO算法,鼓励模型利用视频中的时间信息进行推理。
  4. 除了视频数据,还融入了高质量图像推理数据,构建了Video-R1-CoT-165k和Video-R1-260k数据集。
  5. Video-R1在视频推理基准测试如VideoMMMU和VSI-Bench上取得了显著改进。
  6. Video-R1在通用视频基准测试如MVBench和TempCompass上也表现出优越性能。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
  目录