⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-13 更新
SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards
Authors:Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark
Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.
多模态大型语言模型(MLLMs)在视觉语言任务方面取得了显著的进步,但在空间理解方面仍存在困难。现有的空间MLLMs通常依赖于明确的3D输入或针对特定架构的修改,并受到大规模数据集或稀疏监督的限制。为了解决这些局限性,我们引入了SpatialThinker,这是一个用强化学习(RL)训练的对3D有感知的MLLM,旨在将结构化空间接地与多步骤推理相结合。该模型通过构建与任务相关的对象和空间关系的场景图来模拟人类的空间感知,并通过密集的空间奖励进行推理以得出答案。SpatialThinker由两个关键贡献组成:(1)数据合成管道,生成高质量的空间VQA数据集STVQA-7K;(2)在线强化学习,采用多目标密集空间奖励来加强空间接地。SpatialThinker-7B在空间和现实世界VQA基准测试上的表现优于监督微调方法和稀疏强化学习基线,与稀疏强化学习相比,它几乎将基础模型的收益提高了一倍,并超越了GPT-4o。这些结果展示了在空间监督与奖励对齐的推理相结合方面的有效性,使模型能够在有限的数据中实现稳健的3D空间理解,推动MLLMs朝着人类水平的视觉推理发展。
论文及项目相关链接
PDF Preprint. Accepted at NeurIPS 2025 Workshops on SPACE in Vision, Language, and Embodied AI (SpaVLE), Embodied World Models for Decision Making (EWM), Aligning Reinforcement Learning Experimentalists and Theorists (ARLET), and Scaling Environments for Agents (SEA)
Summary
空间多模态大型语言模型(MLLMs)在视觉语言任务中取得了显著进展,但在空间理解方面仍存在挑战。为解决现有空间MLLMs对明确3D输入或特定架构修改的依赖,以及大规模数据集或稀疏监督的限制,我们推出了SpatialThinker。这是一款用强化学习训练而成的具备3D感知能力的MLLM,能够整合结构化空间接地与多步推理。它通过构建任务相关物体和空缘关系的场景图来模拟人类的空间感知,并通过密集的空间奖励进行推理得出答案。SpatialThinker的主要贡献包括:(1)数据合成管道,生成高质量的空间VQA数据集STVQA-7K;(2)在线强化学习,采用多目标密集空间奖励来强化空间接地。在理解和现实世界问答基准测试中,SpatialThinker-7B的表现优于监督微调方法和稀疏强化学习基线,其增益几乎是基础模型的两倍,并超越了GPT-4o。这些结果展示了在空间监督与奖励对齐推理的结合下,实现有限数据的稳健3D空间理解的有效性,并推动MLLM向人类水平的视觉推理发展。
Key Takeaways
- 多模态大型语言模型(MLLMs)在视觉语言任务中取得显著进展,但在空间理解方面存在挑战。
- 现有空间MLLMs常常依赖明确的3D输入或特定架构修改,并受限于大规模数据集和稀疏监督。
- SpatialThinker是一款用强化学习训练的具备3D感知能力的MLLM,能整合结构化空间接地与多步推理。
- SpatialThinker通过构建场景图来模拟人类的空间感知,包括任务相关物体和空缘关系。
- SpatialThinker采用数据合成管道生成高质量的空间VQA数据集STVQA-7K。
- 在线强化学习与多目标密集空间奖励被用于强化空间接地。
点此查看论文截图
SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization
Authors:Zhi Zheng, Wee Sun Lee
The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master
大型语言模型(LLM)推理的软思考范式在某些场景中能优于传统的离散标记链式思维(CoT)推理,这凸显了它的研究和应用价值。然而,虽然通过群体相对策略优化(GRPO)等策略优化算法可以加强离散标记的CoT推理模式,但使用强化学习(RL)来扩展软思考模式仍然具有挑战性。这一难点源于向软思考标记中注入随机性并相应更新软思考策略的复杂性。因此,之前尝试将软思考与GRPO相结合的尝试通常表现不如其离散标记GRPO的同类算法。为了充分发挥软思考的优势,本文提出了一种新的策略优化算法SofT-GRPO,以在软思考推理模式下加强LLM。SofT-GRPO向逻辑概率中注入Gumbel噪声,采用Gumbel-Softmax技术避免软思考标记超出预训练嵌入空间,并利用策略梯度中的重参数化技巧。我们在基础LLM上进行了实验,参数范围从1.5B到7B,结果表明SofT-GRPO使软思考LLM在Pass@1(平均准确率提高0.13%)上略微优于离散标记GRPO,而在Pass@32(平均准确率提高2.19%)上有显著的提升。相关代码和权重可在https://github.com/zz1358m/SofT-GRPO-master找到。
论文及项目相关链接
Summary
该文探讨了大型语言模型(LLM)的软思考范式与传统离散符号链思考(CoT)范式的对比。软思考在某些场景下表现优异,但结合强化学习(RL)时面临挑战。文章提出了一种新的策略优化算法SofT-GRPO,旨在强化软思考模式下的LLM。实验表明,SofT-GRPO在Pass@1和Pass@32上略有优势。
Key Takeaways
- 软思考范式在大型语言模型(LLM)推理中展现出优于传统离散符号链思考(CoT)范式的潜力。
- 离散符号CoT推理模式可通过策略优化算法(如组相关策略优化GRPO)加强。
- 软思考与强化学习(RL)的结合面临挑战,源于向软思考标记注入随机性和相应更新软思考策略的复杂性。
- 尝试结合软思考与GRPO的方法通常表现不如离散符号GRPO。
- SofT-GRPO是一种新的策略优化算法,旨在强化软思考模式下的LLM。
- SofT-GRPO通过注入Gumbel噪声到logits、使用Gumbel-Softmax技术避免软思考标记超出预训练嵌入空间和使用策略梯度中的重参数化技巧来实现其目标。
点此查看论文截图
PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection
Authors:Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, Zhe Gan
Multimodal large language models (MLLMs) have achieved remarkable progress on vision-language tasks, yet their reasoning processes remain sometimes unreliable. We introduce PRISM-Bench, a benchmark of puzzle-based visual challenges designed to evaluate not only whether models can solve problems, but how their reasoning unfolds. Unlike prior evaluations that measure only final-answer accuracy, PRISM-Bench introduces a diagnostic task: given a visual puzzle and a step-by-step chain-of-thought (CoT) containing exactly one error, models must identify the first incorrect step. This setting enables fine-grained assessment of logical consistency, error detection, and visual reasoning. The puzzles in PRISM-Bench require multi-step symbolic, geometric, and analogical reasoning, resisting shortcuts based on superficial pattern matching. Evaluations across state-of-the-art MLLMs reveal a persistent gap between fluent generation and faithful reasoning: models that produce plausible CoTs often fail to locate simple logical faults. By disentangling answer generation from reasoning verification, PRISM-Bench offers a sharper lens on multimodal reasoning competence and underscores the need for diagnostic evaluation protocols in the development of trustworthy MLLMs.
多模态大型语言模型(MLLMs)在视觉语言任务方面取得了显著的进步,但它们的推理过程有时仍然不可靠。我们引入了PRISM-Bench,这是一个基于谜题视觉挑战的基准测试,旨在评估模型不仅是否能解决问题,还能评估它们的推理过程。与以往仅测量最终答案准确性的评估不同,PRISM-Bench引入了一项诊断任务:给定一个视觉谜题和一个逐步思考过程(CoT),其中包含恰好一个错误,模型必须识别出第一个错误的步骤。这种设置能够精细地评估逻辑一致性、错误检测和视觉推理。PRISM-Bench中的谜题需要进行多步骤的符号、几何和类比推理,抵制基于表面模式匹配的捷径。对最新MLLMs的评估显示,流畅生成和忠实推理之间存在持续差距:能够产生合理思考过程的模型往往无法找出简单的逻辑错误。通过将答案生成与推理验证分开,PRISM-Bench为评估多模态推理能力提供了更锐利的视角,并强调了在开发可靠MLLMs时需要诊断评估协议。
论文及项目相关链接
Summary
MLLM在视觉语言任务上取得了显著进展,但其推理过程有时不可靠。为此,我们推出了PRISM-Bench基准测试,它基于谜题设计的视觉挑战任务,旨在评估模型不仅能解决问题,还能如何展现其推理过程。与仅测量最终答案准确性的先前评估不同,PRISM-Bench引入了一项诊断任务:给定一个视觉谜题和一个逐步推理过程(CoT),其中恰好有一个错误步骤,模型必须找出第一个错误的步骤。这一设定有助于精细评估逻辑一致性、错误检测和视觉推理能力。PRISM-Bench中的谜题要求多步骤的符号、几何和类比推理,抵抗基于表面模式的捷径。对最新MLLM的评估显示,流畅生成与忠实推理之间存在持久差距:能够生成合理CoT的模型往往无法找到简单的逻辑错误。通过解开答案生成与推理验证的关联,PRISM-Bench为评估多模态推理能力提供了更锐利的视角,并强调在开发可信赖的MLLM过程中需要诊断评估协议。
Key Takeaways
- 多模态大型语言模型(MLLM)在视觉语言任务上取得显著进展,但推理过程有时不可靠。
- PRISM-Bench基准测试旨在评估MLLM不仅解决问题,而且如何展现其推理过程。
- PRISM-Bench引入了一项新的诊断任务,要求模型识别包含精确一个错误的逐步推理过程中的第一个错误步骤。
- 这一设定能够精细评估模型的逻辑一致性、错误检测和视觉推理能力。
- PRISM-Bench中的谜题需要多步骤的符号、几何和类比推理,不易通过表面模式匹配来找到解决方案。
- 评估显示,当前MLLM在生成流畅答案和忠实执行推理之间存有明显差距。
点此查看论文截图
Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
Authors:Ziyan Wang, Zheng Wang, Jie Fu, Xingwei Qu, Qi Cheng, Shengpu Tang, Minjia Zhang, Xiaoming Huo
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO’s best accuracy.
强化学习(RL)在提升大型语言模型(LLM)的推理能力方面发挥着核心作用。然而,基于策略的算法,如集团相对策略优化(GRPO),在早期训练中经常遭遇困境:来自低质量rollouts的噪声梯度导致更新不稳定和效率不高的探索。我们引入了慢快策略优化(SFPO),这是一种简单而高效的框架,通过分解每一步为三个阶段来解决这些限制:在同一批次上进行短暂快速的内步轨迹、控制离策略漂移的定位机制、以及最后的缓慢校正。这种更新前的定位设计保持了目标和rollout过程不变,使得SFPO与现有的策略梯度管道兼容。大量实验表明,SFPO持续提高了稳定性,减少了rollouts,并加速了推理RL训练的收敛。具体而言,它在数学推理基准测试中平均比GRPO高出2.80个点。它还实现了最多减少4.93倍rollouts和最多减少4.19倍匹配GRPO最佳准确率的时钟时间。
论文及项目相关链接
Summary
强化学习在提升大型语言模型的推理能力中扮演重要角色。然而,早期训练中如Group Relative Policy Optimization(GRPO)等基于策略的算法常常面临挑战:来自低质量rollouts的噪声梯度导致更新不稳定和效率低下的探索过程。为解决这些问题,我们推出Slow-Fast Policy Optimization(SFPO),它是一种通过分解为三个阶段的简单而有效的框架:在同一批次上进行短时间的快速轨迹内部步骤、控制非策略偏移的重新定位机制,以及最终的慢速修正。这种更新前的重新定位设计保留了目标值和rollout过程不变,使得SFPO能够兼容现有的策略梯度管道。实验证明,SFPO在稳定性、减少rollouts和加速推理训练收敛方面表现优异。特别地,它在数学推理基准测试中平均比GRPO高出高达2.80个点,同时减少了高达4.93倍rollouts,并在匹配GRPO的最佳准确性时减少了高达4.19倍的时钟时间。
Key Takeaways
- 强化学习在提升大型语言模型推理能力中起关键作用。
- 基于策略的算法如GRPO在早期训练中面临噪声梯度和不稳定更新问题。
- SFPO框架通过分解为三个阶段来解决这些问题:快速轨迹、重新定位机制和慢速修正。
- SFPO与现有策略梯度管道兼容,保持目标值和rollout过程不变。
- 实验证明SFPO在稳定性、减少rollouts和加速训练收敛方面表现优异。
- SFPO在数学推理测试中表现优于GRPO,平均高出2.8点。
点此查看论文截图
GRPO-$λ$: Credit Assignment improves LLM Reasoning
Authors:Prasanna Parthasarathi, Mathieu Reymond, Boxing Chen, Yufei Cui, Sarath Chandar
Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving their reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO’s ability to assign fine-grained credit across token sequences. In this work, we present GRPO-$λ$, a novel extension to GRPO that enhances credit assignment in RL finetuning of LLMs for complex reasoning tasks. We approximate learning from $λ$-return with a reformulation of eligibility traces using token-level log-probabilities applied after each sequence generation, and a novel critic-free approximation of the temporal-difference error. We introduce a few variations for the weighting of the $λ$-return, and their applications to the eligibility-trace, where all the variations provide significant gains over GRPO. We compare GRPO-$λ$ against GRPO by training models from 1.5B to 7B parameters on $4$ different math reasoning datasets. The training plots demonstrate 30-40% improved performance during RL training on both LLaMA-3.1 and Qwen-2.5 architectures. Finally, we show that with GRPO-$λ$, the resulting average performance on AIME24, Math500, OlympiadMath, MinervaMath, and AMC improves over GRPO by over $3$ points and a $4.5$ points improvement on the 7B model.
大型语言模型(LLM)越来越多地被部署于需要复杂推理的任务中,这引发了通过后续训练提高其推理能力的浓厚兴趣。特别是基于可验证奖励的强化学习方法,如最先进的GRPO,在作为后续训练方法应用时,已证明能极大地改善推理行为。然而,由于缺乏明确的奖励或评论家模型,GRPO在跨令牌序列分配精细粒度信用方面存在局限性。在这项工作中,我们提出了GRPO-$λ$,这是GRPO的一个新颖扩展,可增强在复杂推理任务的LLM强化学习微调中的信用分配。我们通过使用令牌级别日志概率的资格痕迹重新表述来学习$λ$-回报的近似值,这是在每次序列生成后应用的,以及时间差误差的无评论家近似值。我们介绍了几种$λ$-回报的加权方法及其在资格痕迹中的应用,所有变体都提供了对GRPO的重大收益。我们通过在不同的数学推理数据集上训练从1.5B到7B参数的模型,比较了GRPO-$λ$和GRPO。训练图表明,在LLaMA-3.1和Qwen-2.5架构上,RL训练期间性能提高了30-40%。最后,我们展示了使用GRPO-$λ$,在AIME24、Math500、OlympiadMath、MinervaMath和AMC上的平均性能优于GRPO超过3分,7B模型上改进了4.5分。
论文及项目相关链接
Summary
大型语言模型(LLMs)在需要复杂推理的任务部署中越来越广泛,引发了通过后天训练提高其推理能力的兴趣。特别是使用可验证奖励的基于强化学习(RL)的方法,如最新的GRPO,已被证明在后天训练时能极大地改善推理行为。然而,由于缺乏明确的奖励或评论家模型,GRPO在跨令牌序列分配精细粒度信用方面存在局限性。本研究提出了GRPO-$λ$,这是GRPO的一个新颖扩展,能够增强LLMs在复杂推理任务中的信用分配能力。通过利用令牌级对数概率对资格迹进行重构,以及采用无评论家模型的时序差分误差近似,我们实现了从$λ$-回报中学习。我们还介绍了几种加权$λ$-回报的方法,以及其在资格迹中的应用,所有这些变体都提供了相较于GRPO的显著收益。通过在不同数学推理数据集上对规模从1.5B到7B的模型进行训练比较,证明了GRPO-$λ$的有效性。最终结果显示,在AIME24、Math500、OlympiadMath、MinervaMath和AMC等多个数据集上,相较于GRPO,使用GRPO-$λ$可提高平均性能超过3个点,并在7B模型上提高4.5个点。
Key Takeaways
- 大型语言模型(LLMs)在复杂推理任务中的部署引发了对提高推理能力的兴趣。
- 基于强化学习(RL)的方法,如GRPO,已在LLMs的后天训练中显示出对改善推理行为的重要性。
- GRPO存在跨令牌序列分配精细粒度信用的局限性。
- 引入GRPO-$λ$扩展来解决此问题,它通过重构资格迹并利用无评论家模型的时序差分误差近似来实现从$λ$-回报中学习。
- 介绍了加权$λ$-回报的不同方法及其在资格迹中的应用。
- 对比实验表明,GRPO-$λ$在多个数学推理数据集上的性能优于GRPO。
点此查看论文截图
WirelessMathLM: Teaching Mathematical Reasoning for LLMs in Wireless Communications with Reinforcement Learning
Authors:Xin Li, Mengbing Liu, Yiyang Zhu, Wenhe Zhang, Li Wei, Jiancheng An, Chau Yuen
Large language models (LLMs) excel at general mathematical reasoning but fail catastrophically on specialized technical mathematics. In wireless communications, where problems require precise manipulation of information-theoretic bounds, optimization constraints, and signal processing formulations, even state-of-the-art models struggle to achieve competent performance. We present WirelessMathLM, demonstrating that compact models (0.5B-7B parameters) can match or exceed much larger models through domain-specific reinforcement learning with verifiable rewards. Our key insight is that wireless mathematics problems possess a unique property–verifiable correctness–that enables effective reinforcement learning without human feedback. We construct WirelessMathBench-XL, a comprehensive benchmark of 4,027 problems from 970 papers. Using Group Relative Policy Optimization (GRPO) with binary verification rewards, we train models directly from base checkpoints without supervised warm-start. Our 7B model achieves 39.5% accuracy on WirelessMathBench-XL, approaching GPT-4o (40.4%) while using about 100 times fewer parameters than DeepSeek-R1 (671B, 57.4%). Remarkably, GRPO training nearly doubles performance across all model scales (0.5B +11%, 3B +103%, 7B +81%), with positive transfer to general mathematics benchmarks–our models gain +8.4 points on average across MATH, Minerva-Math, OlympiadBench, AMC, and AIME without any training on these tasks.
大型语言模型(LLM)在一般数学推理方面表现出色,但在专业技术数学上却遭遇了重大失败。在无线通信领域,问题需要对信息理论界限、优化约束和信号处理公式进行精确操作,即使是最先进的模型也很难实现出色的性能。我们提出了WirelessMathLM,证明通过领域特定的强化学习与可验证奖励,小型模型(0.5B-7B参数)可以与更大的模型相匹敌甚至超过它们。我们的关键见解是,无线数学问题具有可验证正确性的独特属性,这可以在没有人类反馈的情况下实现有效的强化学习。我们构建了WirelessMathBench-XL,这是一个包含4027个问题的综合基准测试,来自970篇论文。我们使用带有二进制验证奖励的Group Relative Policy Optimization(GRPO)直接从基准点训练模型,无需监督预热启动。我们的7B模型在WirelessMathBench-XL上达到了39.5%的准确率,接近GPT-4o(40.4%),同时使用的参数比DeepSeek-R1(671B,57.4%)少了约100倍。值得注意的是,GRPO训练在所有模型规模上几乎将性能提高了一倍(0.5B提高11%,3B提高103%,7B提高81%),并且正向转移到一般数学基准测试——我们的模型在MATH、Minerva-Math、OlympiadBench、AMC和AIME等任务上平均提高了8.4分,而无需对这些任务进行任何训练。
论文及项目相关链接
PDF Project Homepage: https://lixin.ai/WirelessMathLM
Summary
大型语言模型在一般数学推理方面表现出色,但在专业技术数学领域存在显著缺陷。在无线通信等需要精确操作信息理论界限、优化约束和信号处理公式的问题中,即使是最先进的模型也很难实现有竞争力的性能。本文提出WirelessMathLM,通过领域特定的强化学习可验证奖励,证明紧凑模型(0.5B-7B参数)可以匹配或超过更大模型。关键见解是,无线通信数学问题具有可验证的正确性,这能够在没有人类反馈的情况下实现有效的强化学习。构建了WirelessMathBench-XL,这是一个包含4,027个问题的综合基准测试,来自970篇论文。使用Group Relative Policy Optimization(GRPO)和二元验证奖励,直接从基本检查点进行模型训练,无需监督预热。7B模型的准确性达到39.5%,接近GPT-4o(40.4%),同时使用的参数大约是DeepSeek-R1(671B,57.4%)的十分之一。值得注意的是,GRPO训练几乎使所有模型规模的性能翻了一番(0.5B+11%,3B+103%,7B+81%),并对一般数学基准测试产生了积极的转移效果——我们的模型在这些任务上平均提高了8.4分,无需任何额外训练。
Key Takeaways
- 大型语言模型在一般数学推理上表现出色,但在专业技术数学领域存在缺陷。
- 无线通信中的数学问题具有可验证的正确性,这有助于强化学习的有效性。
- WirelessMathLM通过领域特定的强化学习和可验证奖励,展示紧凑模型可以匹配或超过大型模型性能。
- 构建了WirelessMathBench-XL基准测试,包含来自970篇论文的4,027个问题。
- 使用Group Relative Policy Optimization (GRPO) 和二元验证奖励进行模型训练,无需监督预热。
- 7B模型的准确性接近GPT-4o,同时使用的参数远少于其他模型。
点此查看论文截图
Variational Reasoning for Language Models
Authors:Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang
We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.
我们为语言模型引入了一个变分推理框架,该框架将思考轨迹视为潜在变量,并通过变分推理对其进行优化。我们从证据下界(ELBO)出发,将其扩展为多轨迹目标以获取更紧密界限,并提出前向KL公式以稳定变分后验的训练。我们还表明,拒绝采样微调和二值奖励强化学习(包括GRPO)可以解释为局部前向KL目标,其中模型准确性的隐式加权自然产生于推导中,并揭示了之前未被注意到的对更简单问题的偏向。我们在广泛的推理任务上对Qwen 2.5和Qwen 3模型家族进行了实证验证。总的来说,我们的工作提供了一个有原则的概率视角,将变分推理与RL风格的方法统一起来,并产生了稳定的目标,以提高语言模型的推理能力。我们的代码位于https://github.com/sail-sg/variational-reasoning。
论文及项目相关链接
Summary
本文介绍了一种用于语言模型变分推理框架,将思考轨迹视为潜在变量并通过变分推断进行优化。通过扩展证据下限(ELBO)到多轨迹目标以获取更紧密界限,并提出前向KL公式来稳定变分后验的训练。同时,本文揭示了拒绝采样微调与二元奖励强化学习(包括GRPO)可解释为局部前向KL目标,其中模型准确性的隐式权重自然产生于推导中,并揭示了偏向更简单问题的先前未注意到的偏见。在Qwen 2.5和Qwen 3模型家族上进行的实证验证了该方法在广泛推理任务上的有效性。总体而言,本文提供了一个有原则的概率视角,将变分推断与RL风格方法统一起来,并为提高语言模型的推理能力提供了稳定的目标。
Key Takeaways
- 引入变分推理框架,将思考轨迹视为潜在变量,通过变分推断优化。
- 扩展证据下限(ELBO)至多轨迹目标,以获取更紧密界限。
- 提出前向KL公式,稳定变分后验的训练。
- 揭示拒绝采样微调与二元奖励强化学习(包括GRPO)可解释为局部前向KL目标。
- 模型准确性的隐式权重自然产生于推导中,揭示偏向更简单问题的先前未注意到的偏见。
- 在多个模型家族上进行的实证验证了该方法在广泛推理任务上的有效性。
点此查看论文截图
Tree Search for LLM Agent Reinforcement Learning
Authors:Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu
Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.
近期强化学习(RL)的进展极大地增强了大型语言模型(LLM)的代理能力。在长期和多轮代理任务中,仅由结果奖励驱动现有方法常常面临监督稀疏的问题。为解决这一挑战,我们提出了基于树搜索的树状群组相对策略优化(Tree-GRPO),这是一种分组代理RL方法,其中每个树节点代表完整的代理交互步骤。通过共享公共前缀,树搜索采样增加了在固定预算的令牌或工具调用中可实现的回滚次数。此外,我们发现树状轨迹结构自然允许即使仅使用结果奖励也能构建逐步过程监督信号。基于此,Tree-GRPO估计了树内和树间级别的分组相对优势。通过理论分析,我们证明了树内级别分组相对策略优化的目标与步骤级别直接偏好学习的目标是一致的。在11个数据集和3种问答任务上的实验表明,与基于链的RL方法相比,所提出的基于树的RL具有优越性。
论文及项目相关链接
Summary
强化学习(RL)的最新进展极大地提升了大型语言模型(LLM)的代理能力。在长期和多轮代理任务中,仅由结果奖励驱动的方法常常面临监督稀疏的问题。为解决此挑战,我们提出了基于树搜索的分组代理RL方法——Tree-GRPO。树中的每个节点代表完整的代理交互步骤。通过共享公共前缀,树搜索采样在固定的令牌或工具调用预算内增加了可实现滚动次数。此外,树结构轨迹自然地允许使用仅结果奖励构建逐步过程监督信号。基于此,Tree-GRPO估计了树内和树间的分组相对优势。理论分析表明,树内级别的群组相对策略优化的目标与步骤级别的直接偏好学习的目标等效。在11个数据集和3种问答任务上的实验表明,基于树的RL优于基于链的RL方法。
Key Takeaways
- 强化学习在提升大型语言模型的代理能力方面取得显著进展。
- 在长期和多轮代理任务中,现有方法面临监督稀疏的问题。
- Tree-GRPO是一种基于树搜索的分组代理RL方法,通过共享公共前缀增加滚动次数并提高监督效率。
- 树结构轨迹可构建逐步过程监督信号,仅使用结果奖励。
- Tree-GRPO可估计树内和树间的分组相对优势。
- 树内级别的群组相对策略优化目标与步骤级别的直接偏好学习目标等效。
点此查看论文截图
Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning
Authors:Haozhe Wang, Qixin Xu, Che Liu, Junhong Wu, Fangzhen Lin, Wenhu Chen
Reinforcement Learning (RL) has proven highly effective at enhancing the complex reasoning abilities of Large Language Models (LLMs), yet underlying mechanisms driving this success remain largely opaque. Our analysis reveals that puzzling phenomena like aha moments", length-scaling’’ and entropy dynamics are not disparate occurrences but hallmarks of an emergent reasoning hierarchy, akin to the separation of high-level strategic planning from low-level procedural execution in human cognition. We uncover a compelling two-phase dynamic: initially, a model is constrained by procedural correctness and must improve its low-level skills. The learning bottleneck then decisively shifts, with performance gains being driven by the exploration and mastery of high-level strategic planning. This insight exposes a core inefficiency in prevailing RL algorithms like GRPO, which apply optimization pressure agnostically and dilute the learning signal across all tokens. To address this, we propose Hierarchy-Aware Credit Assignment (HICRA), an algorithm that concentrates optimization efforts on high-impact planning tokens. Our extensive experiments validate that HICRA significantly outperforms strong baselines, and offer deep insights into how reasoning advances through the lens of strategic exploration.
强化学习(RL)已经证明可以有效提高大型语言模型(LLM)的复杂推理能力,但驱动这一成功的潜在机制仍然大多不明确。我们的分析揭示,像“啊哈时刻”、“长度缩放”和熵动力学等令人困惑的现象并不是孤立发生的,而是新兴推理层次的标志,类似于人类认知中高级战略规划与低级程序执行的分离。我们发现了引人注目的两阶段动态过程:最初,模型受到程序正确性的约束,必须提高其低级技能。学习瓶颈随后发生决定性转移,性能增益源于对高级战略规划的探索和掌握。这种洞察力揭示了现行RL算法(如GRPO)中的核心低效之处,这些算法盲目地施加优化压力,并在所有令牌上稀释学习信号。为了解决这一问题,我们提出了Hierarchy-Aware Credit Assignment (HICRA)算法,该算法将优化工作集中在高影响力的规划令牌上。我们的大量实验验证了HICRA显著优于强大的基线,并深入洞察了通过战略探索视角推理如何进步。
论文及项目相关链接
PDF Preprint
Summary
强化学习(RL)在提升大型语言模型(LLM)的复杂推理能力方面表现出显著效果,但其背后的机制仍大多不明确。本文揭示了一些现象,如“啊哈时刻”、“长度缩放”和熵动力学,它们并非孤立存在,而是新兴推理层次的标志,类似于人类认知中高级战略规划与低级程序执行的分离。研究发现一个引人注目的两阶段动态过程:初期,模型受程序正确性约束,必须改进低级技能。学习瓶颈随后发生决定性转变,性能提升源于高级战略规划的探索与掌握。这揭示了当前RL算法(如GRPO)的核心低效之处,即优化压力的应用具有盲目性,信号分散在所有标记上。为解决这一问题,本文提出了层次感知信用分配(HICRA)算法,该算法集中优化努力于高影响规划标记上。大量实验验证HICRA显著优于强劲基线,并深入探讨了战略探索下推理能力如何提升。
Key Takeaways
- 强化学习在提升大型语言模型的推理能力方面具有显著效果。
- 模型在学习的初始阶段主要关注低级技能的改进。
- 学习过程中会出现从低级到高级的技能转变,表现为一种“层次性的动态”。
- 当前强化学习算法在优化上存在核心低效问题,信号分散在所有标记上。
- 提出了层次感知信用分配(HICRA)算法,专注于优化高影响规划标记。
- HICRA算法在实验中显著优于现有强劲基线。
点此查看论文截图
ARM: Adaptive Reasoning Model
Authors:Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao
While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the “overthinking” problem – excessive and unnecessary reasoning – which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones – Direct Answer, Short CoT, and Code – as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens – ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.
尽管大型推理模型在复杂任务上表现出强大的性能,但它们缺乏根据任务难度调整推理令牌使用的能力。这常常会导致“过度思考”问题——即过度且不必要的推理。虽然可以通过人工干预来控制令牌预算来潜在地缓解这一问题,但这仍然与实现完全自主的人工智能的目标相悖。在这项工作中,我们提出了自适应推理模型(ARM),这是一种能够根据手头任务自适应选择适当推理格式的推理模型。这些格式包括三种高效的格式——直接回答、简短CoT和代码,以及一种更详细的格式,即长CoT。为了训练ARM,我们对群体相对策略优化(GRPO)进行了改进,引入了Ada-GRPO,以解决传统GRPO中的格式崩溃问题。Ada-GRPO使ARM实现了高令牌效率,平均减少令牌使用30%,最高可达70%,同时保持与仅依赖Long CoT的模型相当的性能。此外,它不仅通过减少令牌生成提高了推理效率,而且还将训练速度提高了两倍。除了默认的自适应模式外,ARM还支持两种额外的推理模式:1)指令引导模式,允许用户通过特殊令牌明确指定推理格式——当一批任务适合的格式已知时,这是理想的选择。2)共识引导模式,它聚合三种高效格式的输出来达成结果共识,并在出现分歧时采用长CoT格式优先保证性能表现。这种模式在高令牌使用的情况下优先考虑性能表现。
论文及项目相关链接
PDF NeurIPS 2025 (Spotlight)
Summary
该文指出大型推理模型在处理复杂任务时表现出强大的性能,但它们缺乏根据任务难度调整推理符号使用的能力。这导致了“过度思考”的问题,即过度和不必要的推理。尽管可以通过人为干预来控制符号预算来减轻这一问题,但这与实现完全自主的人工智能的目标相矛盾。为此,本文提出了自适应推理模型(ARM),该模型能够根据任务选择适当的推理格式。ARM引入了Ada-GRPO训练方法,解决了传统GRPO中的格式崩溃问题,使ARM实现了高符号效率,平均减少30%的符号,最高可达70%,同时保持与仅依赖Long CoT的模型相当的性能。此外,ARM还提高了推理效率,减少了符号生成,并实现了2倍的培训速度。ARM还支持两种额外的推理模式:指令引导模式和共识引导模式。
Key Takeaways
- 大型推理模型虽然能处理复杂任务,但缺乏根据任务难度调整推理的能力,导致“过度思考”问题。
- 自适应推理模型(ARM)能根据任务选择适当的推理格式,包括Direct Answer、Short CoT、Code和Long CoT等。
- Ada-GRPO训练方法解决了格式崩溃问题,提高了符号效率,平均减少30%的符号使用,最高可达70%。
- Ada-GRPO训练方法保持了与仅依赖Long CoT的模型相当的性能。
- ARM提高了推理效率,减少了符号生成,并实现了2倍的培训速度。
- ARM支持指令引导模式,允许用户通过特殊符号明确指定推理格式。
点此查看论文截图
Enhancing Efficiency and Exploration in Reinforcement Learning for LLMs
Authors:Mengqi Liao, Xiangyu Xi, Ruinian Chen, Jia Leng, Yangen Hu, Ke Zeng, Shuai Liu, Huaiyu Wan
Reasoning large language models (LLMs) excel in complex tasks, which has drawn significant attention to reinforcement learning (RL) for LLMs. However, existing approaches allocate an equal number of rollouts to all questions during the RL process, which is inefficient. This inefficiency stems from the fact that training on simple questions yields limited gains, whereas more rollouts are needed for challenging questions to sample correct answers. Furthermore, while RL improves response precision, it limits the model’s exploration ability, potentially resulting in a performance cap below that of the base model prior to RL. To address these issues, we propose a mechanism for dynamically allocating rollout budgets based on the difficulty of the problems, enabling more efficient RL training. Additionally, we introduce an adaptive dynamic temperature adjustment strategy to maintain the entropy at a stable level, thereby encouraging sufficient exploration. This enables LLMs to improve response precision while preserving their exploratory ability to uncover potential correct pathways. The code and data is available on: https://github.com/LiaoMengqi/E3-RL4LLMs
推理大型语言模型(LLMs)在复杂任务上表现出色,这引起了人们对强化学习(RL)在LLMs中的应用的关注。然而,现有方法在强化学习过程中的所有问题上都分配了等量的试验次数,这是低效的。这种低效源于在简单问题上进行训练所获得的收益有限,而挑战性问题需要更多的试验次数来采样正确答案的事实。此外,虽然强化学习提高了响应精度,但它限制了模型的探索能力,可能导致性能低于强化学习前的基准模型。为了解决这些问题,我们提出了一种基于问题难度的动态分配试验预算的机制,以实现更有效的强化学习训练。此外,我们还引入了一种自适应动态温度调整策略,以保持熵的稳定水平,从而鼓励足够的探索。这使得LLMs能够在提高响应精度的同时,保持其探索能力,发现潜在的正确路径。相关代码和数据可在以下网址找到:https://github.com/LiaoMengqi/E3-RL4LLMs。
论文及项目相关链接
PDF Accept by EMNLP 2025 main
Summary
强化学习(RL)在大规模语言模型(LLM)中的应用已引起广泛关注。然而,现有方法在所有问题上都分配了等量的训练样本,这导致了效率问题。针对简单问题的训练收益有限,而复杂问题需要更多的训练样本才能找到正确答案。为了解决这些问题,我们提出了一种动态分配训练样本的机制,根据问题的难度进行分配以提高效率。此外,我们还引入了自适应动态温度调整策略来保持模型的探索能力。该研究有助于实现LLM响应精度和模型探索能力的平衡。更多信息可通过访问我们的GitHub仓库:https://github.com/LiaoMengqi/E3-RL4LLMs。
Key Takeaways
- 强化学习在大规模语言模型中的应用是关键的研究领域。
- 当前强化学习方法在分配训练样本方面存在效率问题,需要对不同难度的问题进行动态分配。
- 简单问题的训练收益有限,而复杂问题需要更多的训练样本。
- 提出了一种动态分配训练样本的机制来提高强化学习的效率。
- 引入了自适应动态温度调整策略来保持模型的探索能力。
- 该研究有助于平衡大规模语言模型的响应精度和模型探索能力。
点此查看论文截图
Auditing Meta-Cognitive Hallucinations in Reasoning Large Language Models
Authors:Haolang Lu, Yilian Liu, Jingxin Xu, Guoshun Nan, Yuanlong Yu, Zhican Chen, Kun Wang
The development of Reasoning Large Language Models (RLLMs) has significantly improved multi-step reasoning capabilities, but it has also made hallucination problems more frequent and harder to eliminate. While existing approaches mitigate hallucinations through external knowledge integration, model parameter analysis, or self-verification, they often fail to capture how hallucinations emerge and evolve across the reasoning chain. In this work, we study the causality of hallucinations under constrained knowledge domains by auditing the Chain-of-Thought (CoT) trajectory and assessing the model’s cognitive confidence in potentially erroneous or biased claims. Our analysis reveals that in long-CoT settings, RLLMs can iteratively reinforce biases and errors through flawed reflective reasoning, eventually leading to hallucinated reasoning paths. Surprisingly, even direct interventions at the origin of hallucinations often fail to reverse their effects, as reasoning chains exhibit ‘chain disloyalty’ – a resistance to correction and a tendency to preserve flawed logic. Furthermore, we show that existing hallucination detection methods are less reliable and interpretable than previously assumed in complex reasoning scenarios. Unlike methods such as circuit tracing that require access to model internals, our black-box auditing approach supports interpretable long-chain hallucination attribution, offering better generalizability and practical utility. Our code is available at: https://github.com/Winnie-Lian/AHa_Meta_Cognitive
推理大型语言模型(RLLMs)的发展显著提高了多步推理能力,但同时也使虚构问题更加频繁且难以消除。虽然现有方法通过外部知识整合、模型参数分析或自我验证来缓解虚构现象,但它们往往无法捕捉虚构如何在推理链中涌现和演变。在这项工作中,我们通过审核思维链(CoT)轨迹并评估模型对潜在错误或偏见主张的认知信心,来研究受控知识域下虚构的因果关系。我们的分析揭示,在长CoT设置中,RLLMs可以通过错误的反思推理来迭代强化偏见和错误,最终导致虚构的推理路径。令人惊讶的是,即使在虚构的起源处进行直接干预也往往无法逆转其影响,因为推理链表现出“链不忠”–对修正的抵抗力和保留错误逻辑的趋势。此外,我们表明,在复杂的推理场景中,现有的虚构检测方法比以往认为的要少可靠和可解释性。不同于需要访问模型内部的电路跟踪等方法,我们的黑盒审计方法支持可解释的长链虚构归因,提供更好的通用性和实用性。我们的代码可在:https://github.com/Winnie-Lian/AHa_Meta_Cognitive找到。
论文及项目相关链接
PDF Accepted by NeurIPS 2025 (37 pages)
Summary
大型语言模型的推理发展虽提升了多步推理能力,但也增加了出现幻觉问题的频率和消除难度。现有方法通过外部知识整合、模型参数分析或自我验证来减轻幻觉问题,但难以捕捉幻觉在推理链中的产生和演变过程。本研究通过审计思维链轨迹和评估模型对潜在错误或偏见主张的认知信心,探讨幻觉的因果关系。分析显示,在长篇思维链设置中,大型语言模型可通过缺陷反思推理,不断强化的偏见和错误,最终导致幻觉推理路径。即使对幻觉产生的源头进行直接干预,也常无法消除其影响,因为思维链展现出“链不忠”的特性,即抵抗纠正并倾向于维持错误逻辑。此外,现有的幻觉检测方法在复杂推理场景中不如先前假设的那么可靠和可解释。本研究采用的黑盒审计方法支持可解释的长链幻觉归因,提供更好的通用性和实用性。
Key Takeaways
- 大型语言模型的多步推理能力显著提高,但幻觉问题更频繁且难以消除。
- 现有方法难以捕捉幻觉在推理链中的产生和演变。
- 研究通过审计思维链轨迹和评估认知信心来探讨幻觉的因果关系。
- 在长篇思维链设置中,大型语言模型会通过缺陷反思推理强化偏见和错误。
- 思维链展现出“链不忠”特性,抵抗纠正并维持错误逻辑。
- 现有的幻觉检测方法在复杂推理场景中其可靠性和可解释性受限。
- 研究采用的黑盒审计方法支持可解释的长链幻觉归因,具备更好的通用性和实用性。
点此查看论文截图
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
Authors:Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, Tianbao Yang
The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions; (3) it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint. As a result, DisCO offers notable advantages over GRPO and its variants: (i) it completely eliminates difficulty bias by adopting discriminative objectives; (ii) it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach, yielding long and stable training dynamics; (iii) it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training. Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for an 1.5B model.
DeepSeek-R1的近期成功和开放性使人们广泛关注Group Relative Policy Optimization (GRPO)作为一种用于大型推理模型(LRMs)的强化学习方法。在这项工作中,我们在二元奖励设置下分析了GRPO的目标,并揭示了问题难度偏差的内在局限性。我们还发现了GRPO与传统监督学习中的判别方法之间的联系。受这些见解的启发,我们引入了基于判别学习原理的新型判别约束优化(DisCO)框架,用于强化LRMs。DisCO与GRPO及其最近变体之间的主要区别在于:(1)它用判别目标替代了相对组目标,该目标由评分函数定义;(2)它放弃了基于剪辑的替代品,转而使用非剪辑的RL替代目标作为评分函数;(3)它采用简单有效的约束优化方法来强制执行KL散度约束。因此,DisCO相对于GRPO及其变体具有显著优势:(i)它采用判别目标,完全消除了难度偏差;(ii)它通过非剪辑评分函数和约束优化方法解决了GRPO及其变体中的熵不稳定问题,产生了长期稳定的训练动态;(iii)它允许纳入先进的判别学习技术来解决数据不平衡问题,在训练过程中,大量问题的生成答案中负答案多于正答案。我们在增强SFT微调模型的数学推理能力方面的实验表明,DisCO显著优于GRPO及其改进变体(如DAPO),在六个基准任务中对1.5B模型的平均增益为7%,对DAPO的平均增益为6%。
论文及项目相关链接
PDF Accepted to NeurIPS 2025
摘要
DeepSeek-R1的成功和开放性使Group Relative Policy Optimization(GRPO)作为大型推理模型(LRMs)的强化学习方法受到广泛关注。在这项工作中,我们在二进制奖励设置下分析了GRPO的目标,并揭示了问题级别难度偏差的固有局限性。我们还发现了GRPO与传统监督学习中的判别方法之间的联系。受这些见解的启发,我们为强化LRMs引入了新的Discriminative Constrained Optimization(DisCO)框架,该框架基于判别学习的原则。DisCO与GRPO及其最近变体之间的主要区别在于:(1)它用判别目标替代了组相对目标,该目标由评分函数定义;(2)它放弃了基于剪辑的替代品,转而使用非剪辑RL替代品目标作为评分函数;(3)它采用简单有效的约束优化方法来强制执行KL散度约束。因此,DisCO相对于GRPO及其变体具有显著优势:(i)通过采用判别目标,它完全消除了难度偏见;(ii)它解决了GRPO及其变体中的熵不稳定问题,通过使用非剪辑评分函数和约束优化方法,产生长期稳定的训练动态;(iii)它允许纳入先进的判别学习技术来解决数据不平衡问题,在训练过程中,大量问题的生成答案中负面答案多于正面。我们的实验表明,在提高SFT微调模型的数学推理能力方面,DisCO显著优于GRPO及其改进变体如DAPO,在六个基准任务上,DisCO对GRPO的平均增益为7%,对DAPO的平均增益为6%。
关键见解
- DeepSeek-R1的成功引领了对Group Relative Policy Optimization(GRPO)强化学习方法的关注,该方法用于大型推理模型(LRMs)。
- 在二进制奖励设置下分析了GRPO,揭示了其问题级别难度偏差的局限性。
- 发现了GRPO与传统监督学习中的判别方法之间的联系。
- 引入了新的DisCO框架,采用判别学习目标,消除了难度偏见。
- DisCO通过非剪辑评分函数和约束优化方法解决了GRPO中的熵不稳定问题,实现了长期稳定的训练。
- DisCO允许结合先进的判别学习技术处理数据不平衡问题。
点此查看论文截图
Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning
Authors:Hu Wang, Congbo Ma, Ian Reid, Mohammad Yaqub
The advantage function is a central concept in RL that helps reduce variance in policy gradient estimates. Recently, for language modeling, Group Relative Policy Optimization (GRPO) was proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group. However, it can lead to high variance when the reward advantage is inaccurately predicted. In this work, we propose Kalman Filter Enhanced Group Relative Policy Optimization (KRPO) model, by using lightweight Kalman filtering to dynamically estimate the latent reward baseline and uncertainty. This filtering technique replaces the naive group mean, enabling more adaptive advantage normalization. Our method does not require additional learned parameters over GRPO. This approach offers a simple yet effective way to incorporate multiple outputs of GRPO into advantage estimation, improving policy optimization in settings where highly dynamic reward signals are difficult to model for language models. Through the accuracies and rewards obtained from math question answering and reasoning, we show that using a more adaptive advantage estimation model, KRPO can improve the stability and performance of GRPO. The code is available at https://github.com/billhhh/KRPO_LLMs_RL.
优势函数是强化学习中的一个核心概念,有助于减少策略梯度估计中的方差。最近,对于语言建模,提出了Group Relative Policy Optimization (GRPO),通过减去所有输出组的平均奖励作为基准值来计算每个输出的优势。然而,当奖励优势预测不准确时,可能会导致较高的方差。在本工作中,我们提出了基于卡尔曼滤波增强的Group Relative Policy Optimization(KRPO)模型,通过使用轻量级的卡尔曼滤波来动态估计潜在奖励基准值和不确定性。这种滤波技术替换了简单的组均值,实现了更自适应的优势归一化。我们的方法不需要在GRPO的基础上增加额外的学习参数。这种方法为将GRPO的多个输出纳入优势估计提供了一种简单而有效的方法,特别是在语言模型中难以建模高度动态的奖励信号的情况下,改善了策略优化。通过数学问题的回答和推理所获得的准确性和奖励,我们证明了使用更自适应的优势估计模型KRPO可以提高GRPO的稳定性和性能。代码可在https://github.com/billhhh/KRPO_LLMs_RL中找到。
论文及项目相关链接
Summary
基于强化学习中的优势函数概念,提出了Kalman滤波增强组相对策略优化(KRPO)模型。该模型使用轻量级Kalman滤波器动态估计潜在奖励基准值和不确定性,改进了语言建模中的策略优化。通过数学问答和推理任务的准确性和奖励来证明,KRPO能提高GRPO的稳定性和性能。
Key Takeaways
- 优势函数在策略梯度估计中用于减少方差。
- Group Relative Policy Optimization (GRPO)用于计算语言模型的输出优势。
- GRPO在计算优势时采用组均值作为基线,但可能导致高方差。
- KRPO模型使用Kalman滤波器动态估计奖励基线和不确定性,改进了优势估计。
- KRPO方法不需要额外学习参数,简化了优势估算过程。
- KRPO提高了GRPO在动态奖励信号难以建模情况下的稳定性和性能。
点此查看论文截图
MrGuard: A Multilingual Reasoning Guardrail for Universal LLM Safety
Authors:Yahan Yang, Soham Dan, Shuo Li, Dan Roth, Insup Lee
Large Language Models (LLMs) are susceptible to adversarial attacks such as jailbreaking, which can elicit harmful or unsafe behaviors. This vulnerability is exacerbated in multilingual settings, where multilingual safety-aligned data is often limited. Thus, developing a guardrail capable of detecting and filtering unsafe content across diverse languages is critical for deploying LLMs in real-world applications. In this work, we introduce a multilingual guardrail with reasoning for prompt classification. Our method consists of: (1) synthetic multilingual data generation incorporating culturally and linguistically nuanced variants, (2) supervised fine-tuning, and (3) a curriculum-based Group Relative Policy Optimization (GRPO) framework that further improves performance. Experimental results demonstrate that our multilingual guardrail, MrGuard, consistently outperforms recent baselines across both in-domain and out-of-domain languages by more than 15%. We also evaluate MrGuard’s robustness to multilingual variations, such as code-switching and low-resource language distractors in the prompt, and demonstrate that it preserves safety judgments under these challenging conditions. The multilingual reasoning capability of our guardrail enables it to generate explanations, which are particularly useful for understanding language-specific risks and ambiguities in multilingual content moderation.
大型语言模型(LLMs)容易受到如越狱等对抗性攻击的影响,从而引发有害或不安全的行为。在多语言环境中,由于多语言安全对齐数据通常有限,这一漏洞更加严重。因此,开发一种能够在多种语言中检测和过滤不安全内容的护栏,对于在现实世界应用中部署LLM至关重要。在这项工作中,我们引入了一种具有提示分类推理的多语言护栏。我们的方法包括:(1)合成多语言数据生成,融入文化和语言上的细微差别,(2)监督微调,以及(3)基于课程的相对组策略优化(GRPO)框架,进一步提高性能。实验结果表明,我们的多语言护栏MrGuard在域内和域外语言上的表现均超过近期基线15%以上。我们还评估了MrGuard在多语言变化下的稳健性,如提示中的代码切换和低资源语言干扰项,并证明它在这些具有挑战性的条件下能保持安全判断。我们的护栏具备多语言推理能力,能够生成解释,这对于理解多语言内容审核中的语言特定风险和模糊性特别有用。
论文及项目相关链接
PDF Preprint
Summary
大型语言模型(LLMs)容易受到如“越狱”之类的对抗性攻击的影响,从而引发有害或不安全的行为。在多语言环境中,这种脆弱性因多语言安全对齐数据的缺乏而加剧。因此,开发一种能够在多种语言之间检测和过滤不安全内容的防护栏,对于在现实世界中部署LLMs至关重要。本研究介绍了一种配备推理功能的跨语言防护栏。我们的方法包括:(1)合成多语言数据生成,涵盖文化和语言细微差别,(2)通过监督微调进行改进,(3)基于课程的相对政策优化(GRPO)框架进一步提高性能。实验结果表明,我们的多语言防护栏MrGuard在域内和域外语言上的表现均优于最新基线超过15%。我们还评估了MrGuard在多语言变化下的稳健性,如代码切换和低资源语言干扰提示,并证明它在这些具有挑战性的条件下仍能维持安全判断。我们的防护栏具备多语言推理能力,能够生成解释,对于理解多语言内容审查中的语言特定风险和歧义特别有用。
Key Takeaways
- 大型语言模型(LLMs)容易受到对抗性攻击的影响,特别是在多语言环境中。
- 开发一种多语言防护栏对于在现实世界中部署LLMs至关重要。
- 提出的防护栏方法包括合成多语言数据生成、监督微调以及基于课程的相对政策优化(GRPO)框架。
- 实验证明,该防护栏在域内和域外语言的性能均超过现有方法,且具备稳健性。
- 防护栏的多语言推理能力能够生成解释,有助于理解多语言内容审查中的特定风险和歧义。
- 防护栏在应对多语言变化,如代码切换和低资源语言干扰方面表现出色。
点此查看论文截图
LogicTree: Structured Proof Exploration for Coherent and Rigorous Logical Reasoning with Large Language Models
Authors:Kang He, Kaushik Roy
Large language models (LLMs) have achieved remarkable multi-step reasoning capabilities across various domains. However, LLMs still face distinct challenges in complex logical reasoning, as (1) proof-finding requires systematic exploration and the maintenance of logical coherence and (2) searching the right combination of premises at each reasoning step is inherently challenging in tasks with large premise space. To address this, we propose LogicTree, an inference-time modular framework employing algorithm-guided search to automate structured proof exploration and ensure logical coherence. Advancing beyond tree-of-thought (ToT), we incorporate caching mechanism into LogicTree to enable effective utilization of historical knowledge, preventing reasoning stagnation and minimizing redundancy. Furthermore, we address the combinatorial complexity of premise search by decomposing it into a linear process. The refined premise selection restricts subsequent inference to at most one derivation per step, enhancing reasoning granularity and enforcing strict step-by-step reasoning. Additionally, we introduce two LLM-free heuristics for premise prioritization, enabling strategic proof search. Experimental results on five datasets demonstrate that LogicTree optimally scales inference-time computation to achieve higher proof accuracy, surpassing chain-of-thought (CoT) and ToT with average gains of 23.6% and 12.5%, respectively, on GPT-4o. Moreover, within LogicTree, GPT-4o outperforms o3-mini by 7.6% on average.
大型语言模型(LLM)已在多个领域展现出显著的多步推理能力。然而,在复杂的逻辑推理方面,LLM仍面临独特的挑战。因为(1)证明寻找需要系统的探索和维持逻辑连贯性;(2)在具有大量前提空间的任务中,搜索每一步推理的正确前提组合本质上是具有挑战性的。为了解决这一问题,我们提出了LogicTree,这是一个推理时间模块化框架,采用算法指导的搜索来自动探索结构化证明并确保逻辑连贯性。超越思维树(ToT),我们在LogicTree中引入了缓存机制,以有效利用历史知识,防止推理停滞并尽量减少冗余。此外,我们将前提搜索的组合复杂性分解为线性过程来解决。精细的前提选择将后续推理限制为每一步最多一个推导,提高了推理的粒度并实施了严格的逐步推理。此外,我们引入了两种无需大型语言模型的启发式方法来优先处理前提,以实现战略性的证明搜索。在五个数据集上的实验结果表明,LogicTree能够最优地扩展推理时间的计算,提高证明准确性,超过思维链(CoT)和ToT,在GPT-4o上的平均增益为23.6%和12.5%。此外,在LogicTree内部,GPT-4o的平均表现优于o3-mini 7.6%。
论文及项目相关链接
PDF EMNLP 2025 Main Conference
Summary
大型语言模型(LLMs)在多步推理领域展现出显著能力,但在复杂逻辑推理方面仍面临挑战。针对这些问题,本文提出了LogicTree框架,通过算法引导搜索来自动化结构证明探索并确保逻辑连贯性。LogicTree引入缓存机制,避免推理停滞和冗余,并分解复杂的前提搜索为线性过程。实验结果表明,LogicTree在五个数据集上实现了更高的证明准确性,相对于链式思维(CoT)和思维树(ToT)平均提升了23.6%和12.5%。GPT-4o在LogicTree内的表现优于o3-mini,平均提高了7.6%。
Key Takeaways
- 大型语言模型(LLMs)在多步推理中表现卓越,但在复杂逻辑推理方面仍有挑战。
- LogicTree是一个用于自动化结构证明探索的推理时间模块化框架,确保逻辑连贯性。
- LogicTree通过引入缓存机制,避免推理停滞和冗余。
- LogicTree将复杂的前提搜索分解为线性过程,提高了推理的精细度。
- LogicTree提高了证明准确性,相较于其他方法有明显优势。
- GPT-4o在LogicTree框架内的表现优于其他模型。
点此查看论文截图
GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models
Authors:Jixiao Zhang, Chunsheng Zuo
Group Relative Policy Optimization (GRPO), which is widely adopted by R1-like reasoning models, has advanced mathematical reasoning. Nevertheless, GRPO faces challenges in reward sparsity, verbosity, and inadequate focus on problem difficulty. We propose GRPO-LEAD, enhancing GRPO with: (1) length-regularized rewards to encourage conciseness while maintaining accuracy; (2) explicit penalties for incorrect solutions to improve model precision; and (3) difficulty-aware advantage reweighting for robust generalization on challenging problems. Comprehensive evaluations demonstrate that GRPO-LEAD significantly improves reasoning accuracy, conciseness, and efficiency. Our approach achieves state-of-the-art performance for 14B-scale models, underscoring the synergy of our methods with appropriate model scale and high-quality data. Our source code, generated dataset, and models are available at https://github.com/aeroplanepaper/GRPO-LEAD.
集团相对策略优化(GRPO)被R1类推理模型广泛应用,具有先进的数学推理能力。然而,GRPO面临奖励稀疏、冗余以及缺乏对问题难度关注度的挑战。我们提出GRPO-LEAD,通过以下三个方面增强GRPO:(1)长度正则化奖励,在保持准确性的同时鼓励简洁性;(2)对错误解决方案的明确惩罚,以提高模型精度;(3)难度感知优势重新加权,以实现在具有挑战性的问题上的稳健泛化。全面评估表明,GRPO-LEAD在推理准确性、简洁性和效率方面都有显著提高。我们的方法为14B规模模型实现了最先进的性能表现,凸显了我们的方法与适当模型规模和高质量数据的协同作用。我们的源代码、生成数据集和模型可通过https://github.com/aeroplanepaper/GRPO-LEAD获取。
论文及项目相关链接
PDF Accepted to EMNLP 2025 (Main)
Summary
GRPO-LEAD是改进的Group Relative Policy Optimization(GRPO)策略,适用于R1类似的推理模型,可提升数学推理能力。它解决了奖励稀疏、冗长和缺乏问题难度关注的问题。GRPO-LEAD通过以下方式增强GRPO:1)长度正则化奖励,鼓励准确性同时保持简洁性;2)对错误解决方案的明确惩罚,提高模型精度;3)难度感知优势重新加权,以在具有挑战性的问题上实现稳健泛化。评估表明,GRPO-LEAD在准确性、简洁性和效率方面显著改善,且对大规模模型性能具有先进性。开源代码和模型在:https://github.com/aeroplanepaper/GRPO-LEAD。
Key Takeaways
- GRPO在广泛适用的过程中,有潜在的奖励稀疏和冗长的问题需要解决。新策略GRPO-LEAD增加了对新挑战的处理机制来提升模型的性能。
- GRPO-LEAD通过引入长度正则化奖励来鼓励简洁性并保持准确性。这种策略有助于模型在解答问题时避免冗余信息。
- 为了提高模型的精确度和纠正能力,GRPO-LEAD加入了对错误解决方案的显性惩罚。这对于改进模型判断正确性十分必要。
点此查看论文截图
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models
Authors:Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Xiansheng Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang
Visual reasoning abilities play a crucial role in understanding complex multimodal data, advancing both domain-specific applications and artificial general intelligence (AGI). Existing methods enhance Vision-Language Models (VLMs) through Chain-of-Thought (CoT) supervised fine-tuning using meticulously annotated data. However, this approach may lead to overfitting and cognitive rigidity, limiting the model’s generalization ability under domain shifts and reducing real-world applicability. To overcome these limitations, we propose Reason-RFT, a two-stage reinforcement fine-tuning framework for visual reasoning. First, Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of VLMs. This is followed by reinforcement learning based on Group Relative Policy Optimization (GRPO), which generates multiple reasoning-response pairs to enhance adaptability to domain shifts. To evaluate Reason-RFT, we reconstructed a comprehensive dataset covering visual counting, structural perception, and spatial transformation, serving as a benchmark for systematic assessment across three key dimensions. Experimental results highlight three advantages: (1) performance enhancement, with Reason-RFT achieving state-of-the-art results and outperforming both open-source and proprietary models; (2) generalization superiority, maintaining robust performance under domain shifts across various tasks; and (3) data efficiency, excelling in few-shot learning scenarios and surpassing full-dataset SFT baselines. Reason-RFT introduces a novel training paradigm for visual reasoning and marks a significant step forward in multimodal research. Project website: https://tanhuajie.github.io/ReasonRFT
视觉推理能力在理解复杂的多模态数据、推动特定领域应用和通用人工智能(AGI)发展方面发挥着至关重要的作用。现有方法通过使用精心注释的数据对视觉语言模型(VLMs)进行思维链(CoT)监督微调来增强其性能。然而,这种方法可能会导致过拟合和认知僵化,限制了模型在领域变化下的泛化能力,并降低了其在现实世界中的适用性。为了克服这些局限性,我们提出了Reason-RFT,这是一个用于视觉推理的两阶段强化微调框架。首先,使用精选的CoT数据进行有监督微调(SFT),以激发VLMs的推理潜力。其次是基于群体相对策略优化(GRPO)的强化学习,生成多个推理-响应对,以增强对领域变化的适应能力。为了评估Reason-RFT,我们重新构建了一个综合数据集,涵盖了视觉计数、结构感知和空间转换,作为三个关键维度系统评估的基准。实验结果突出了三个优势:(1)性能增强,Reason-RFT达到最新结果并优于开源和专有模型;(2)泛化优势,在各种任务领域变化下保持稳健性能;(3)数据效率,在少量学习场景中表现出色并超越全数据集SFT基准线。Reason-RFT为视觉推理引入了一种新的训练范式,并在多模态研究中取得了重大进展。项目网站:https://tanhuajie.github.io/ReasonRFT
论文及项目相关链接
PDF 51 pages, 23 figures, NeurIPS’25
Summary
本文探讨视觉推理能力在理解复杂多模态数据中的作用,并指出传统方法可能导致的过拟合和认知僵化问题。为此,提出了一种名为Reason-RFT的两阶段强化微调框架,通过监督精细调机和基于群体相对策略优化的强化学习来增强视觉推理能力。实验结果表明,该框架在提高性能、任务泛化和数据效率方面均有显著优势。
Key Takeaways
- 视觉推理能力对于理解复杂多模态数据和推进特定领域应用及通用人工智能的发展具有关键作用。
- 传统通过思维链(CoT)对视觉语言模型(VLM)进行精细监督调机的方法可能会导致过拟合和认知僵化。
- Reason-RFT框架包含两个阶段:第一阶段通过监督精细调机激活VLM的推理潜力;第二阶段利用基于群体相对策略优化的强化学习来提升模型适应不同领域的能力。
- 该框架在视觉计数、结构感知和空间转换等任务上构建了综合数据集,为系统评估提供了基准。
- 实验结果表明,Reason-RFT框架在性能、任务泛化和数据效率方面较传统方法有明显优势。
- Reason-RFT为视觉推理研究提供了新的训练范式,并在多模态研究中取得了重要进展。
点此查看论文截图
Explainable Sentiment Analysis with DeepSeek-R1: Performance, Efficiency, and Few-Shot Learning
Authors:Donghao Huang, Zhaoxia Wang
Large language models (LLMs) have transformed sentiment analysis, yet balancing accuracy, efficiency, and explainability remains a critical challenge. This study presents the first comprehensive evaluation of DeepSeek-R1–an open-source reasoning model–against OpenAI’s GPT-4o and GPT-4o-mini. We test the full 671B model and its distilled variants, systematically documenting few-shot learning curves. Our experiments show DeepSeek-R1 achieves a 91.39% F1 score on 5-class sentiment and 99.31% accuracy on binary tasks with just 5 shots, an eightfold improvement in few-shot efficiency over GPT-4o. Architecture-specific distillation effects emerge, where a 32B Qwen2.5-based model outperforms the 70B Llama-based variant by 6.69 percentage points. While its reasoning process reduces throughput, DeepSeek-R1 offers superior explainability via transparent, step-by-step traces, establishing it as a powerful, interpretable open-source alternative.
大规模语言模型(LLM)已经改变了情感分析领域,但在准确性、效率和可解释性之间取得平衡仍然是一个关键挑战。本研究首次对DeepSeek-R1这一开源推理模型进行了全面评估,将其与OpenAI的GPT-4o和GPT-4o-mini进行了对比。我们测试了完整的671B模型及其蒸馏变体,系统地记录了小样本学习曲线。我们的实验表明,DeepSeek-R1在5类情感分析上达到了91.39%的F1分数,在二元任务上达到了99.31%的准确率,只需5个样本,其在小样本效率上是GPT-4o的8倍。出现了与架构特定的蒸馏效应,其中基于32B Qwen2.5的模型优于基于70B Llama的变体,高出6.69个百分点。虽然其推理过程降低了吞吐量,但DeepSeek-R1通过透明、逐步的追踪提供了出色的可解释性,使其成为强大、可解释的开源替代品。
论文及项目相关链接
PDF 10 pages, with 2 figures and 6 tables, accepted for publication in an IEEE Intelligent Systems journal
Summary
大型语言模型(LLMs)在情感分析领域实现了显著变革,但如何在准确性、效率和可解释性之间取得平衡仍是关键挑战。本研究对DeepSeek-R1这一开源推理模型进行了全面评估,并与OpenAI的GPT-4o和GPT-4o-mini进行了对比。实验结果显示,DeepSeek-R1在5类情感分析上达到91.39%的F1分数,在二元任务上达到99.31%的准确率,仅通过几次学习曲线即可实现。相较于GPT-4o,DeepSeek-R1在少样本学习方面的效率提高了八倍。同时,架构特定的蒸馏效应显现,一个基于Qwen2.5的32B模型在性能上超过了基于Llama的70B模型。虽然推理过程可能影响速度,但DeepSeek-R1凭借透明的逐步跟踪提供了卓越的可解释性,成为强大且可解读的开源替代方案。
Key Takeaways
- 大型语言模型(LLMs)在情感分析领域的应用及其挑战。
- DeepSeek-R1与其他模型(如GPT-4o)的对比评估。
- DeepSeek-R1在少样本学习情境下的高效表现。
- 不同架构模型的性能差异,Qwen2.5-based模型较Llama-based表现更优。
- DeepSeek-R1的推理过程可能影响效率,但提供卓越的可解释性。
- DeepSeek-R1作为开源模型的优越性。
点此查看论文截图
Unlocking Multimodal Mathematical Reasoning via Process Reward Model
Authors:Ruilin Luo, Zhuofan Zheng, Yifan Wang, Xinzhe Ni, Zicheng Lin, Songtao Jiang, Yiyao Yu, Chufan Shi, Lei Wang, Ruihang Chu, Jin Zeng, Yujiu Yang
Process Reward Models (PRMs) have shown promise in enhancing the mathematical reasoning capabilities of Large Language Models (LLMs) through Test-Time Scaling (TTS). However, their integration into multimodal reasoning remains largely unexplored. In this work, we take the first step toward unlocking the potential of PRMs in multimodal mathematical reasoning. We identify three key challenges: (1) the scarcity of high-quality reasoning data constrains the capabilities of foundation Multimodal Large Language Models (MLLMs), which imposes further limitations on the upper bounds of TTS and reinforcement learning (RL); (2) a lack of automated methods for process labeling within multimodal contexts persists; (3) the employment of process rewards in unimodal RL faces issues like reward hacking, which may extend to multimodal scenarios. To address these issues, we introduce URSA, a three-stage Unfolding multimodal Process-Supervision Aided training framework. We first construct MMathCoT-1M, a high-quality large-scale multimodal Chain-of-Thought (CoT) reasoning dataset, to build a stronger math reasoning foundation MLLM, URSA-8B. Subsequently, we go through an automatic process to synthesize process supervision data, which emphasizes both logical correctness and perceptual consistency. We introduce DualMath-1.1M to facilitate the training of URSA-8B-RM. Finally, we propose Process-Supervised Group-Relative-Policy-Optimization (PS-GRPO), pioneering a multimodal PRM-aided online RL method that outperforms vanilla GRPO. With PS-GRPO application, URSA-8B-PS-GRPO outperforms Gemma3-12B and GPT-4o by 8.4% and 2.7% on average across 6 benchmarks. Code, data and checkpoint can be found at https://github.com/URSA-MATH.
过程奖励模型(PRM)通过测试时间缩放(TTS)在提升大型语言模型(LLM)的数学推理能力方面显示出巨大潜力。然而,它们在多模态推理中的集成仍然鲜有研究。在这项工作中,我们迈出了实现PRM在多模态数学推理中潜力的第一步。我们确定了三个关键挑战:(1)高质量推理数据的稀缺限制了基础多模态大型语言模型(MLLM)的能力,这给TTS和强化学习(RL)的上限带来了进一步限制;(2)在多模态背景下缺乏过程标签的自动化方法仍然存在;(3)在单模态RL中使用过程奖励面临着奖励黑客等问题,这些问题可能会扩展到多模态场景。为了解决这些问题,我们引入了URSA,这是一个三阶段展开的多模态过程监督辅助训练框架。首先,我们构建了MMathCoT-1M这一高质量的大规模多模态思维链(CoT)推理数据集,以建立更强大的数学推理基础MLLM,即URSA-8B。随后,我们通过一个自动过程来合成过程监督数据,这一过程强调逻辑正确性和感知一致性。我们引入了DualMath-1.1M来促进URSA-8B-RM的训练。最后,我们提出了过程监督组相对策略优化(PS-GRPO),开创了一种多模态PRM辅助的在线RL方法,该方法优于标准GRPO。通过应用PS-GRPO,URSA-8B-PS-GRPO在六个基准测试上的平均表现优于Gemma3-12B和GPT-4o,分别提高了8.4%和2.7%。相关代码、数据和检查点可访问https://github.com/URSA-MATH以获取。
论文及项目相关链接
PDF NeurIPS 2025 Main Track
Summary
本文探索了将过程奖励模型(PRM)应用于多模态推理的潜力。文章解决了三个关键挑战,提出URSA训练框架,并构建了大型多模态推理数据集MMathCoT-1M,用于训练具有更强数学推理能力的MLLM模型URSA-8B。此外,文章还引入了一种新的在线RL方法PS-GRPO来优化过程奖励策略。最后通过评估证明该框架方法能够在多个基准测试中表现出优势。详情参见URL链接 https://github.com/URSA-MATH。
Key Takeaways
- 文章首次探索了将过程奖励模型(PRM)应用于多模态数学推理的潜力。
- 文章解决了三个关键挑战:高质量推理数据的稀缺性、多模态上下文中的过程标签自动化方法的缺乏以及在单模态强化学习中使用过程奖励的问题。
- 提出URSA训练框架,并构建大型多模态推理数据集MMathCoT-1M用于训练MLLM模型URSA-8B。
- 通过合成过程监督数据,强调逻辑正确性和感知一致性。引入DualMath-1.1M促进URSA-8B的训练。