⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-06 更新
Octax: Accelerated CHIP-8 Arcade Environments for Reinforcement Learning in JAX
Authors:Waris Radji, Thomas Michel, Hector Piteau
Reinforcement learning (RL) research requires diverse, challenging environments that are both tractable and scalable. While modern video games may offer rich dynamics, they are computationally expensive and poorly suited for large-scale experimentation due to their CPU-bound execution. We introduce Octax, a high-performance suite of classic arcade game environments implemented in JAX, based on CHIP-8 emulation, a predecessor to Atari, which is widely adopted as a benchmark in RL research. Octax provides the JAX community with a long-awaited end-to-end GPU alternative to the Atari benchmark, offering image-based environments, spanning puzzle, action, and strategy genres, all executable at massive scale on modern GPUs. Our JAX-based implementation achieves orders-of-magnitude speedups over traditional CPU emulators while maintaining perfect fidelity to the original game mechanics. We demonstrate Octax’s capabilities by training RL agents across multiple games, showing significant improvements in training speed and scalability compared to existing solutions. The environment’s modular design enables researchers to easily extend the suite with new games or generate novel environments using large language models, making it an ideal platform for large-scale RL experimentation.
强化学习(RL)研究需要多样且具挑战性的环境,这些环境既易于处理又具备可扩展性。虽然现代视频游戏可能提供丰富的动态性,但由于其CPU密集型执行特性,它们在计算上成本高昂,不适合大规模实验。我们推出了Octax,这是一个在JAX中实现经典街机游戏环境的高性能套件,基于CHIP-8仿真技术,是Atari的先驱,广泛应用于RL研究中的基准测试。Octax为JAX社区提供了Atari基准测试的久盼未至的端到端GPU替代方案,提供基于图像的环境,涵盖益智、动作和策略等多种类型,所有内容都可在现代GPU上大规模执行。我们的基于JAX的实现相对于传统的CPU模拟器实现了数量级的加速,同时完美忠实于原始的游戏机制。我们通过训练多个游戏中的RL代理来展示Octax的功能,与现有解决方案相比,在训练速度和可扩展性方面显示出显着改进。环境的模块化设计使研究人员能够轻松地将新游戏扩展到套件中或使用大型语言模型生成新环境,使其成为大规模RL实验的理想平台。
论文及项目相关链接
Summary
基于强化学习(RL)研究的需要,一个高性能的经典街机游戏环境套件Octax被引入。Octax采用JAX实现,基于CHIP-8模拟,是Atari的先驱,为RL研究提供了广泛采用的基准测试。Octax为JAX社区提供了Atari基准测试的端到端GPU替代方案,提供图像环境,涵盖益智、动作和策略等多种类型游戏,可在现代GPU上大规模执行。与传统的CPU模拟器相比,基于JAX的实现取得了数量级的加速,同时完美地保持了原始游戏机制。通过训练RL代理进行多游戏演示,展示了与现有解决方案相比在训练速度和可扩展性方面的显著改善。其模块化设计使研究人员能够轻松添加新游戏或利用大型语言模型生成新环境,使其成为大规模RL实验的理想平台。
Key Takeaways
- Octax是一个高性能的基于JAX实现的经典街机游戏环境套件,旨在为强化学习研究提供多样化、可伸缩的挑战环境。
- Octax提供了一个端到端的GPU替代方案,为JAX社区提供了Atari基准测试平台。
- Octax提供图像环境,包括多种类型的游戏,如益智、动作和策略等,可在现代GPU上大规模执行。
- 与传统CPU模拟器相比,基于JAX的Octax实现具有数量级的加速性能。
- Octax能够完美地保持原始游戏机制,同时支持训练强化学习代理。
- 在多个游戏中训练RL代理的演示证明了Octax在训练速度和可扩展性方面的显著改善。
点此查看论文截图





Round-trip Reinforcement Learning: Self-Consistent Training for Better Chemical LLMs
Authors:Lecheng Kong, Xiyuan Wang, Yixin Chen, Muhan Zhang
Large Language Models (LLMs) are emerging as versatile foundation models for computational chemistry, handling bidirectional tasks like reaction prediction and retrosynthesis. However, these models often lack round-trip consistency. For instance, a state-of-the-art chemical LLM may successfully caption a molecule, yet be unable to accurately reconstruct the original structure from its own generated text. This inconsistency suggests that models are learning unidirectional memorization rather than flexible mastery. Indeed, recent work has demonstrated a strong correlation between a model’s round-trip consistency and its performance on the primary tasks. This strong correlation reframes consistency into a direct target for model improvement. We therefore introduce Round-Trip Reinforcement Learning (RTRL), a novel framework that trains a model to improve its consistency by using the success of a round-trip transformation as a reward signal. We further propose an iterative variant where forward and reverse mappings alternately train each other in a self-improvement loop, a process that is highly data-efficient and notably effective with the massive amount of unlabelled data common in chemistry. Experiments demonstrate that RTRL significantly \textbf{boosts performance and consistency} over strong baselines across supervised, self-supervised, and synthetic data regimes. This work shows that round-trip consistency is not just a desirable property but a trainable objective, offering a new path toward more robust and reliable foundation models.
大型语言模型(LLM)正成为计算化学的通用基础模型,能够处理反应预测和逆合成等双向任务。然而,这些模型往往缺乏往返一致性。例如,最先进的化学LLM可能成功地标注了一个分子,但无法根据其生成的文本准确重建原始结构。这种不一致性表明,模型学习的是单向记忆,而不是灵活掌握。确实,最近的工作显示模型往返一致性与其主要任务性能之间存在强烈的相关性。这种强烈的相关性将一致性重新定义为模型改进的直接目标。因此,我们引入了往返强化学习(RTRL)这一新框架,该框架通过以往返转换的成功作为奖励信号来训练模型,以提高其一致性。我们还提出了一个迭代变体,其中正向和反向映射交替进行相互训练,形成一个自我改进循环,这一过程在化学中大量常见的无标签数据下非常高效且效果显著。实验表明,RTRL在监督、自监督和合成数据情况下均显著提高了强大基准线的性能和一致性。这项工作表明,往返一致性不仅是一个理想属性,而且是一个可训练的目标,为更稳健和可靠的基础模型提供了新的发展途径。
论文及项目相关链接
PDF 19 pages
Summary:大型语言模型在计算化学领域展现出通用基础模型的潜力,能够处理如反应预测和逆向合成等双向任务。然而,这些模型往往缺乏往返一致性,即成功描述分子后难以准确重建其原始结构,表明它们学习的是单向记忆而非灵活掌握。最近的研究表明,模型的往返一致性与其主要任务的性能之间存在强烈关联。为此,研究引入了往返强化学习(RTRL)框架,使用往返转换的成功作为奖励信号来训练模型,提高其一致性。在化学领域大量无标签数据的情况下,这种方法的训练效率高且效果显著。实验表明,RTRL在监督、自监督以及合成数据模式下,显著提高性能并增强一致性。这项工作表明往返一致性不仅是理想属性,还可作为训练目标,为更稳健和可靠的基础模型开辟了新的路径。
Key Takeaways:
- 大型语言模型在计算化学领域具有广泛的应用潜力,包括处理双向任务如反应预测和逆向合成。
- 这些模型常缺乏往返一致性,即难以从自身生成的文本中准确重建原始结构。
- 往返一致性对于模型性能至关重要,与主要任务的表现存在强烈关联。
- 引入的往返强化学习(RTRL)框架旨在通过奖励信号提高模型的往返一致性。
- RTRL框架采用迭代方式,通过前向和后向映射相互训练,提高数据效率。
- 在化学领域大量无标签数据的情况下,RTRL效果显著。
点此查看论文截图


BroRL: Scaling Reinforcement Learning via Broadened Exploration
Authors:Jian Hu, Mingjie Liu, Ximing Lu, Fang Wu, Zaid Harchaoui, Shizhe Diao, Yejin Choi, Pavlo Molchanov, Jun Yang, Jan Kautz, Yi Dong
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a key ingredient for unlocking complex reasoning capabilities in large language models. Recent work ProRL has shown promise in scaling RL by increasing the number of training steps. However, performance plateaus after thousands of steps, with clear diminishing returns from allocating more computation to additional training. In this work, we investigate a complementary paradigm for scaling RL, BroR-Lincreasing the number of rollouts per example to hundreds to exhaustively Broaden exploration, which yields continuous performance gains beyond the saturation point observed in ProRL when scaling the number of training steps. Our approach is motivated by a mass balance equation analysis allowing us to characterize the rate of change in probability mass for correct and incorrect tokens during the reinforcement process. We show that under a one-step RL assumption, sampled rollout tokens always contribute to correct-mass expansion, while unsampled tokens outside rollouts may lead to gains or losses depending on their distribution and the net reward balance. Importantly, as the number of rollouts per example N increases, the effect of unsampled terms diminishes, ensuring overall correct-mass expansion. To validate our theoretical analysis, we conduct simulations under more relaxed conditions and find that a sufficiently large rollout size N-corresponding to ample exploration-guarantees an increase in the probability mass of all correct tokens. Empirically, BroRL revives models saturated after 3K ProRL training steps and demonstrates robust, continuous improvement, achieving state-of-the-art results for the 1.5B model across diverse benchmarks.
强化学习与可验证奖励(RLVR)已成为解锁大型语言模型中复杂推理能力的重要成分。最近的ProRL工作通过增加训练步骤数量展现出强化学习扩展性的前景。然而,在数千步后性能达到峰值,通过分配更多计算资源来进行额外训练所产生的回报明显递减。在这项工作中,我们研究了一种与扩展强化学习互补的方法,即通过每个例子进行数百次的滚动以详尽地拓宽探索(BroR)。当在ProRL中扩大训练步骤数量时,我们方法产生连续的性能提升,超越了观察到的饱和点。我们的方法受到质量平衡方程分析的启发,使我们能够表征强化过程中正确和错误令牌概率质量的变化率。我们表明,在一步强化学习的假设下,采样滚动令牌始终有助于正确质量的扩展,而未采样的令牌可能导致收益或损失,这取决于其分布和净奖励平衡。重要的是,随着每个例子的滚动次数N的增加,未采样项的影响减小,从而确保了总体上的正确质量扩展。为了验证我们的理论分析,我们在更宽松的条件下进行模拟,并发现足够大的滚动规模N(对应于充分的探索)保证了所有正确令牌的概率质量增加。经验上,BroRL使在3K ProRL训练步骤后达到饱和的模型恢复活力,并展现出稳健的、持续的改进,在多种基准测试上实现了1.5B模型的最新结果。
论文及项目相关链接
PDF 16 pages, 4 figures
Summary
强化学习通过可验证奖励(RLVR)成为解锁大型语言模型复杂推理能力的重要成分。ProRL展现出了在增加训练步骤上提高性能的潜力,但当训练步骤达到一定数量后,性能趋于饱和。本研究探讨了一个与ProRL互补的策略BroRL,通过增加每个例子的抽样次数来扩大探索范围,从而在ProRL训练步骤饱和后仍然能够持续提高性能。我们的方法受到质量平衡方程分析的启发,能够表征强化过程中正确和错误代币概率质量的变化率。实验显示,增加抽样次数N可以降低未抽样代币的影响,保证正确代币的整体质量扩张。经验上,BroRL能重启在3K步训练后饱和的模型,并实现稳健的持续提升,在多个基准测试中取得最佳结果。
Key Takeaways
- 强化学习(RL)在大型语言模型中的复杂推理能力至关重要。
- ProRL通过增加训练步骤展现性能提升潜力,但存在性能饱和问题。
- BroRL策略旨在通过增加每个例子的抽样次数来扩大探索范围,实现持续性能提升。
- 质量平衡方程分析指导了我们的方法,描述了正确和错误代币概率质量的变化率。
- 增加抽样次数N可降低未抽样代币的影响,确保正确代币的整体质量扩张。
- BroRL能重启饱和模型并继续提升性能。
点此查看论文截图




Prompt Curriculum Learning for Efficient LLM Post-Training
Authors:Zhaolin Gao, Joongwon Kim, Wen Sun, Thorsten Joachims, Sid Wang, Richard Yuanzhe Pang, Liang Tan
We introduce Prompt Curriculum Learning (PCL), a lightweight reinforcement learning (RL) algorithm that selects intermediate-difficulty prompts using a learned value model to post-train language models. Since post-training LLMs via RL remains sensitive to batching and prompt selection strategies, we first conduct a series of systematic experiments where we (1) determine the optimal training batch size that balances generation efficiency and gradient quality and (2) establish the importance of focusing on prompts of intermediate difficulty for the policy. We build upon these results to design PCL, which identifies prompts of intermediate difficulty for the current policy in an on-policy manner by using a value model that is concurrently updated based on the current policy. By focusing on informative prompts that yield high effective ratios, PCL achieves either the highest performance or requires significantly less time to reach comparable performance to its counterparts. Compared to rollout-based filtering methods, PCL avoids costly rollouts and achieves $12.1\times$ and $16.9\times$ faster speed on identifying intermediate-difficulty prompts when training on MATH and DeepScaleR, respectively. We further demonstrate that our value model accurately predicts prompt difficulty and allows PCL to focus on progressively more challenging prompts during RL. Our results present a new methodology that delivers improved tradeoff between upper-bound performance and efficiency for reasoning-focused RL.
我们介绍了Prompt课程学习(PCL)这一轻量级强化学习(RL)算法,它通过利用学习得到的值模型来选择中等难度的提示来对语言模型进行后训练。由于通过强化学习对大型语言模型进行后训练仍然对批处理和提示选择策略敏感,我们首先进行了一系列系统实验,以确定(1)在平衡生成效率和梯度质量方面表现最佳的训练批处理大小;(2)确定了关注中等难度提示对策略的重要性。我们基于这些结果设计了PCL,它采用值模型来确定当前策略的提示中等难度,以策略依赖的方式使用基于当前策略更新的值模型。通过专注于产生高有效率的提示信息,PCL要么实现了最高性能,要么在达到与同类相当的性能时所需的时间大大减少。与基于回合的过滤方法相比,PCL避免了昂贵的回合并且在进行MATH和DeepScaleR训练时分别加快了12.1倍和更高的16.9倍的中等难度提示识别速度。我们还进一步证明我们的值模型能够准确预测提示的难度,使PCL能够在强化学习过程中专注于越来越具有挑战性的提示。我们的结果展示了一种新方法,在达到上限性能和效率之间提供了更好的权衡,适用于以推理为中心的强化学习。
论文及项目相关链接
Summary
本文介绍了基于强化学习(RL)的Prompt课程学习(PCL)方法。PCL通过利用价值模型选择中等难度的提示来训练语言模型。本文研究了批量处理和提示选择策略对训练的影响,并在此基础上设计PCL算法,实现高性能且效率较高的提示选择过程。同时引入了价值模型来预测提示难度,允许PCL在RL过程中逐步挑战更困难的提示。该研究为提高推理能力相关的强化学习的性能与效率提供了新的方法。
Key Takeaways
- PCL是一种基于强化学习的轻量级算法,用于通过价值模型选择中等难度的提示来训练语言模型。
- PCL设计用于平衡生成效率和梯度质量的最优训练批量大小的研究。
- 中等难度的提示对强化学习策略尤为重要,PCL能够通过价值模型在线策略地识别这些提示。
- PCL方法聚焦于能够产生高有效率的提示,从而达到高性能或较快的训练时间。
- 与基于rollout的过滤方法相比,PCL避免了昂贵的rollout过程,显著提高了识别中等难度提示的速度。
- 价值模型能够准确预测提示的难度,帮助PCL在强化学习中逐步挑战更复杂的提示。
点此查看论文截图





A Practitioner’s Guide to Multi-turn Agentic Reinforcement Learning
Authors:Ruiyi Wang, Prithviraj Ammanabrolu
We study what actually works and what doesn’t for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars – environment, reward, and policy – and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn-level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent’s policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co-design across the three pillars, facilitating research and practical efforts in multi-turn agentic RL. Code: https://github.com/pearls-lab/meow-tea-taro
我们研究通过多轮强化学习训练大型语言模型作为代理时,哪些方法实际上有效,哪些无效。尽管取得了快速进展,但现有框架和定义是零碎的,没有系统的表述或分析,也没有明确哪些设计选择在不同任务中很重要。我们通过首先将其分解为三个相互关联的支柱——环境、奖励和政策来解决这一差距,并实证推导出在情境文本域中训练大型语言模型代理的食谱。特别是,我们测试了用于测试情境体推理的流行领域TextWorld和ALFWorld,以及用于软件工程设计任务的SWE-Gym。(i)在环境方面,我们分析了任务复杂性对状态空间、动作空间大小以及最佳解决方案长度的影响,发现即使在某个域内的简单环境也可以提供关于代理如何更好地推广到更复杂任务的信号。(ii)在奖励方面,我们分析了相对奖励稀疏性,观察到虽然密集的回合级奖励可以加速训练,但性能和稳定性在很大程度上取决于选择的强化学习算法。(iii)对于代理的政策,我们探索了奖励稀疏性与有偏(PPO,GRPO)和无偏(RLOO)政策梯度方法之间的相互作用,并展示了如何在固定预算下找到最佳的监督微调(SFT)到强化学习训练的比例。我们总结出这些发现,形成了一份指导这三个支柱共同设计的训练食谱,促进了多轮代理强化学习领域的研究和实践工作。代码:https://github.com/pearls-lab/meow-tea-taro
论文及项目相关链接
Summary
该文本主要研究了通过多轮强化学习训练大型语言模型作为代理的实际效果。文章针对当前框架和定义的碎片化问题,通过实证方法探索了环境、奖励和政策三个相互关联的支柱在训练过程中的作用,并推导出了一个在情境文本域中训练LLM代理人的配方。文章分析了任务复杂性对状态空间、动作空间以及最优解长度的影响,探讨了奖励稀疏性对训练的影响,并探索了代理策略中奖励稀疏性与偏向和无偏向策略梯度方法之间的相互作用。最后,文章将研究成果总结为一份训练配方,指导跨三个支柱的协同设计,促进多轮代理强化学习领域的研究和实践。
Key Takeaways
- 文章研究了通过多轮强化学习训练大型语言模型的效果,并填补了当前缺乏系统性分析和设计选择的重要性这一空白。
- 文章将训练大型语言模型的设计空间分为三个相互关联的支柱:环境、奖励和政策。
- 文章通过实证分析,发现任务复杂性对训练效果有影响,简单环境内的任务也能为更复杂任务的泛化提供信号。
- 文章探讨了奖励稀疏性对训练的影响,发现虽然密集的回合级奖励可以加速训练,但性能和稳定性高度依赖于选择的强化学习算法。
- 文章探索了代理策略中奖励稀疏性与偏向和无偏向策略梯度方法的相互作用,并展示了如何找到给定固定预算下的最佳监督微调与强化学习训练比例。
- 文章将研究成果总结为一份训练配方,为跨三个支柱的协同设计提供了指导。
点此查看论文截图





CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs
Authors:Yongcheng Zeng, Zexu Sun, Bokai Ji, Erxue Min, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, Haifeng Zhang, Xu Chen, Jun Wang
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models (LLMs) on reasoning tasks. However, existing methods often fail to adequately account for variations in prompt difficulty or rely on simplistic filtering mechanisms to select prompt datasets within a narrow criterion range, resulting in significant computational waste. In this work, we approach the problem from the perspective of reinforcement learning gradient optimization, offering a systematic and theoretical investigation into how to improve the training efficiency of LLMs. We identify two key factors influencing training efficiency: the selection of training prompts and the allocation of rollout quantities across different prompts. Our theoretical analysis reveals that the sampling distribution of prompts dictates the convergence rate of gradient descent, while the allocation of the rollout quantity influences the consistency and stability of overall gradient updates. Based on these insights, we propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead. Experiments demonstrate that our CurES outperforms Group Relative Policy Optimization (GRPO) by \textbf{+3.30} points and \textbf{+4.82} points with 1.5B and 7B models, respectively. Additionally, CurES exhibits faster convergence compared to baselines, including GRPO.
课程学习在提高大型语言模型(LLM)在推理任务上的训练效率方面起着至关重要的作用。然而,现有方法往往未能充分考虑到提示难度的变化,或者在选择提示数据集时依赖于简化的过滤机制,仅在狭窄的标准范围内进行选择,从而导致大量的计算浪费。在这项工作中,我们从强化学习梯度优化的角度来解决这个问题,对如何提高LLM的训练效率进行了系统和理论的研究。我们确定了两个影响训练效率的关键因素:训练提示的选择和不同提示的rollout数量分配。我们的理论分析表明,提示的采样分布决定了梯度下降的收敛率,而rollout数量的分配影响了整体梯度更新的一致性和稳定性。基于这些见解,我们提出了CurES,这是一种高效的训练方法,可以加速收敛,并采用贝叶斯后验估计来最小化计算开销。实验表明,我们的CurES在1.5B和7B的模型上分别优于群体相对策略优化(GRPO)\textbf{+3.30}点和\textbf{+4.82}点。此外,与基线方法相比,CurES还表现出更快的收敛速度,包括GRPO。
论文及项目相关链接
PDF 25 pages, 10 Figures
Summary
大语言模型(LLM)在推理任务上的训练效率可通过课程学习增强。现有方法未能充分考虑提示难度的变化或依赖过于简化的过滤机制选择提示数据集,造成大量计算浪费。本研究从强化学习梯度优化的角度入手,提出一种系统的提高LLM训练效率的方法,通过识别训练提示的选择和在不同提示上的rollout数量分配两个关键因素来影响训练效率。实验表明,我们的CurES方法优于Group Relative Policy Optimization(GRPO),并在不同模型规模上实现了显著的性能提升。同时,CurES相较于基线方法具有更快的收敛速度。
Key Takeaways
- 课程学习在提升大语言模型(LLM)的推理任务训练效率方面发挥关键作用。
- 现有方法存在对提示难度变化的考虑不足和依赖简化过滤机制的问题,导致计算资源浪费。
- 本研究从强化学习梯度优化的角度系统提高LLM训练效率。
- 训练提示的选择和在不同提示上的rollout数量分配是影响训练效率的关键因素。
- 采样分布的提示决定了梯度下降的收敛速率,而rollout数量的分配影响梯度更新的稳定性和一致性。
- 提出的CurES方法通过加速收敛和采用贝叶斯后验估计来最小化计算开销,实现性能提升。
点此查看论文截图


Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers
Authors:Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama
Reinforcement Learning with Verifiable Rewards (RLVR) trains policies against automated verifiers to avoid costly human labeling. To reduce vulnerability to verifier hacking, many RLVR systems collapse rewards to binary ${0,1}$ during training. This choice carries a cost: it introduces \textit{false negatives} (rejecting correct answers, FNs) and \textit{false positives} (accepting incorrect ones, FPs). For instance, a rule-based checker may mark the correct fraction $\frac{12}{36}$ as wrong when compared against the canonical $\frac{1}{3}$ due to brittle parsing/equivalence rules (FN), while a large language model (LLM) judges can be gamed by superficial cues or even a single adversarial token, yielding inflated correctness for wrong solutions (FP). We formalize verifier unreliability by modeling the verifier as a stochastic reward channel with asymmetric noise rates. From this abstraction, we derive two correction algorithms for verifier errors. The first is a \textit{backward} correction that de-biases the observed binary reward to recover an \textit{unbiased} estimator of the clean policy gradient. The second is a \textit{forward} correction that reweights score-function terms so that the expected update direction aligns with the \textit{clean gradient}; notably, it requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization (GRPO)-based RLVR pipeline and evaluate them on math-reasoning models and benchmarks. Across models and datasets, both corrections improve over uncorrected training; the forward variant converges faster and remains stable under heavier noise. Finally, we show a practical appeal mechanism in which a lightweight LLM verifier estimates the FN rate online by rechecking rule-based negatives, obtaining outperformance compared with other state-of-the-art contenders.
强化学习与可验证奖励(RLVR)通过针对自动化验证器训练策略,以避免昂贵的人力标注。为了减少验证器攻击带来的脆弱性,许多RLVR系统在训练时将奖励简化为二元{0,1}。这种选择带有成本:它引入了“假阴性”(拒绝正确答案)和“假阳性”(接受错误答案)。例如,基于规则的检查器在与标准答案相比时,可能会将正确的分数12/36标记为错误,这是由于脆弱的解析/等价规则(假阴性),而大型语言模型(LLM)的判断可能会受到表面线索甚至是单个对抗性令牌的影响,为错误解决方案提供膨胀的正确性(假阳性)。我们通过将验证器建模为具有不对称噪声率的随机奖励通道来形式化验证器的不可靠性。从这个抽象中,我们推导出两种纠正验证器错误的算法。第一种是“向后”校正,它通过消除观察到的二元奖励的偏见来恢复干净的政策梯度的“无偏见”估计器。第二种是“向前”校正,它重新加权分数函数项,以使预期更新方向与“干净梯度”一致;值得注意的是,它只需要FN率。我们将这两种方法都实现为基于组相对策略优化(GRPO)的RLVR管道中的轻量级钩子,并在数学推理模型和基准测试上进行评估。在跨模型和数据集方面,两种校正方法都优于未校正的训练;向前变种收敛更快,在更重的噪声下保持稳定。最后,我们展示了一种实用吸引力机制,其中轻量级LLM验证器通过重新检查基于规则的负面结果来在线估计FN率,并与其他最新竞争对手相比取得性能优势。
论文及项目相关链接
Summary
强化学习(RL)在训练策略时容易受到奖励的影响,为了减少人工标签的成本,采用自动化验证器来奖励的策略。但自动化验证器容易被攻击,为了避免这种情况,一些RL系统会选择在训练时将奖励设为二元化(${0, 1}$)。然而,这种选择会导致出现假阴性(拒绝正确答案)和假阳性(接受错误答案)的问题。本文提出一种形式化的验证器不稳定性模型,使用随机奖励通道与不对称噪声率来描述其可靠性问题。据此推导出了两种用于修正验证器错误的算法,一种是向后修正法,用于纠正观察到的二元奖励的偏见,恢复对清洁政策梯度的无偏见估计;另一种是向前修正法,通过重新加权得分函数项使预期的更新方向与清洁梯度一致。这两种方法均在实际应用的RLVR管道中得到了验证。评估结果显示,这两种修正方法都优于未修正的训练方法,其中向前修正法收敛更快,在更重的噪声下仍能保持稳定。此外,还提出了一种利用轻量级LLM验证器在线估计假阴性率的实用机制。这种机制相对于其他先进的方法表现出了较好的性能。总的来说,本文为处理验证器奖励的不稳定性和提高强化学习训练策略的可靠性提供了新的思路和方法。
Key Takeaways
- 强化学习使用自动化验证器进行奖励的策略训练时面临安全性问题,为了减少人工标签成本选择了二元化的奖励方式。这导致出现假阴性及假阳性结果的风险增加。
- 通过建模验证了验证器的不可靠性是一个核心问题,用随机奖励通道与不对称噪声率进行描述。此模型使得理解验证器错误的原因变得更为直观。
- 基于建模提出了两种验证器错误的修正算法:向后修正法和向前修正法,前者纠正观察到的二元奖励偏见,后者通过重新加权得分函数项提高预期更新方向与清洁梯度的对齐性。两者都在实践中被证实有效。
点此查看论文截图


RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training
Authors:Tao Ren, Jinyang Jiang, Hui Yang, Wan Tian, Minhao Zou, Guanghao Li, Zishi Zhang, Qinghao Wang, Shentao Qin, Yanjun Zhao, Rui Tao, Hui Shao, Yijie Peng
Reinforcement learning with verifiable reward has recently emerged as a central paradigm for post-training large language models (LLMs); however, prevailing mean-based methods, such as Group Relative Policy Optimization (GRPO), suffer from entropy collapse and limited reasoning gains. We argue that these issues stem from overemphasizing high-probability output sequences while neglecting rare but informative reasoning paths. To address these challenges, we propose Risk-based Policy Optimization (RiskPO), which substitutes classical mean-based objectives with principled risk measures. Specifically, we introduce a Mixed Value-at-Risk objective that integrates weighted attention over multiple regions of the reward distribution, thereby amplifying gradient signals on challenging instances and preventing overconfident convergence. We further design a bundling scheme that aggregates multiple questions into bundles, thus enriching the feedback signal and yielding more stable and informative training dynamics. Theoretically, we prove that the risk-averse update alleviates entropy collapse and promotes exploration. Numerically, RiskPO achieves consistent and significant improvements in mathematical reasoning, multi-modal reasoning, and code generation benchmarks, surpassing GRPO and its variants on both Pass@1 and Pass@k metrics. Our results demonstrate that risk-based optimization provides a rigorous and effective paradigm for enhancing LLM reasoning capabilities.
强化学习与可验证奖励相结合,已成为训练大型语言模型(LLM)后的核心范式;然而,流行的基于均值的方法,如群体相对策略优化(GRPO),存在熵崩溃和推理增益有限的问题。我们认为,这些问题源于过分强调高概率的输出序列,而忽视了稀有但富有信息量的推理路径。为了解决这些挑战,我们提出了基于风险的策略优化(RiskPO),它以原则性的风险度量替代了传统的基于均值的目标。具体来说,我们引入了一个混合Value-at-Risk目标,该目标通过加权关注奖励分布的多个区域,从而强化了对困难实例的梯度信号并防止过度自信的收敛。我们还设计了一种捆绑方案,将多个问题聚集在一起,从而丰富了反馈信号并产生了更稳定和富有信息的训练动态。理论上,我们证明了风险厌恶更新可以缓解熵崩溃并促进探索。从数值上看,RiskPO在数学推理、多模态推理和代码生成基准测试中实现了持续和显著的改进,在Pass@1和Pass@k指标上均超越了GRPO及其变体。我们的结果表明,基于风险的优化为提高LLM推理能力提供了严格有效的范式。
论文及项目相关链接
Summary
强化学习可验证奖励已成为训练大型语言模型(LLM)的重要范式。然而,现有的基于均值的方法(如组相对策略优化)存在熵崩溃和推理收益有限的问题。针对这些问题,我们提出基于风险的策略优化(RiskPO),以风险衡量取代传统的基于均值的目标。通过引入混合值在风险目标中整合奖励分布的多个区域的加权注意力,防止过度自信收敛,在挑战性实例上放大梯度信号。此外,我们设计了一种捆绑方案,将多个问题捆绑在一起,从而丰富反馈信号,产生更稳定和丰富的训练动态。理论证明,风险厌恶更新可缓解熵崩溃并促进探索。数值上,RiskPO在数学推理、多模态推理和代码生成基准测试中实现了持续而显著的改进,在Pass@1和Pass@k指标上超越了GRPO及其变体。结果表明,基于风险的优化为提高LLM推理能力提供了严格有效的范式。
Key Takeaways
- 强化学习可验证奖励已成为训练大型语言模型的重要方法。
- 现有基于均值的方法存在熵崩溃和推理收益有限的问题。
- 基于风险的策略优化(RiskPO)通过引入风险衡量来解决这些问题。
- RiskPO通过混合值在风险目标中整合奖励分布的多个区域来放大梯度信号并防止过度自信收敛。
- RiskPO设计了一种捆绑方案来丰富反馈信号并产生更稳定和丰富的训练动态。
- 风险厌恶更新可缓解熵崩溃并促进探索。
点此查看论文截图


ACPO: Adaptive Curriculum Policy Optimization for Aligning Vision-Language Models in Complex Reasoning
Authors:Yunhao Wang, Ziting Li, Shuai Chen, Tao Liu, Chao Song, Junjie Jiang, Jian Zhu, Peng Gao, Bin Qin
Aligning large-scale vision-language models (VLMs) for complex reasoning via reinforcement learning is often hampered by the limitations of existing policy optimization algorithms, such as static training schedules and the rigid, uniform clipping mechanism in Proximal Policy Optimization (PPO). In this work, we introduce Adaptive Curriculum Policy Optimization (ACPO), a novel framework that addresses these challenges through a dual-component adaptive learning strategy. First, ACPO employs a dynamic curriculum that orchestrates a principled transition from a stable, near on-policy exploration phase to an efficient, off-policy exploitation phase by progressively increasing sample reuse. Second, we propose an Advantage-Aware Adaptive Clipping (AAAC) mechanism that replaces the fixed clipping hyperparameter with dynamic, sample-wise bounds modulated by the normalized advantage of each token. This allows for more granular and robust policy updates, enabling larger gradients for high-potential samples while safeguarding against destructive ones. We conduct extensive experiments on a suite of challenging multimodal reasoning benchmarks, including MathVista, LogicVista, and MMMU-Pro. Results demonstrate that ACPO consistently outperforms strong baselines such as DAPO and PAPO, achieving state-of-the-art performance, accelerated convergence, and superior training stability.
将大规模视觉语言模型(VLMs)与强化学习相结合进行复杂推理时,通常会受到现有策略优化算法的限制,例如固定的训练计划和近端策略优化(PPO)中的刚性统一裁剪机制。在这项工作中,我们引入了自适应课程策略优化(ACPO),这是一种通过双组件自适应学习策略来解决这些挑战的新型框架。首先,ACPO采用动态课程,通过有原则地过渡,从稳定的近策略探索阶段逐步过渡到高效的离策略利用阶段,同时逐步提高样本复用率。其次,我们提出了一种优势感知自适应裁剪(AAAC)机制,该机制用动态样本特定的边界替代固定的裁剪超参数,这些边界由每个标记的归一化优势调制。这允许进行更精细和更稳健的策略更新,使得对高潜力样本的梯度更大,同时避免破坏性样本的风险。我们在一系列具有挑战性的多模态推理基准测试上进行了广泛的实验,包括MathVista、LogicVista和MMMU-Pro。结果表明,ACPO始终优于强大的基线,如DAPO和PAPO,达到了最先进的性能、加速收敛和优越的训练稳定性。
论文及项目相关链接
Summary
本文提出一种名为Adaptive Curriculum Policy Optimization(ACPO)的新框架,以解决大规模视觉语言模型(VLMs)在复杂推理任务中通过强化学习进行对齐时面临的挑战。ACPO采用双组分自适应学习策略,包括动态课程安排和优势感知自适应裁剪机制。动态课程安排实现从稳定、近策略探索阶段向高效、离策略利用阶段的原则性过渡。优势感知自适应裁剪机制则替代固定裁剪超参数,采用样本优势的动态调整。实验证明,ACPO在多个具有挑战性的多模态推理基准测试中表现卓越,如MathVista、LogicVista和MMMU-Pro,优于DAPO和PAPO等强基线。
Key Takeaways
- ACPO框架解决了大规模视觉语言模型在复杂推理任务中通过强化学习进行对齐的挑战。
- 采用动态课程安排,实现从稳定探索阶段到高效利用阶段的过渡。
- 引入优势感知自适应裁剪机制,替代固定裁剪超参数。
- ACPO在多个多模态推理基准测试中表现优越,如MathVista、LogicVista和MMMU-Pro。
点此查看论文截图




OIG-Bench: A Multi-Agent Annotated Benchmark for Multimodal One-Image Guides Understanding
Authors:Jiancong Xie, Wenjin Wang, Zhuomeng Zhang, Zihan Liu, Qi Liu, Ke Feng, Zixun Sun, Yuedong Yang
Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities. However, evaluating their capacity for human-like understanding in One-Image Guides remains insufficiently explored. One-Image Guides are a visual format combining text, imagery, and symbols to present reorganized and structured information for easier comprehension, which are specifically designed for human viewing and inherently embody the characteristics of human perception and understanding. Here, we present OIG-Bench, a comprehensive benchmark focused on One-Image Guide understanding across diverse domains. To reduce the cost of manual annotation, we developed a semi-automated annotation pipeline in which multiple intelligent agents collaborate to generate preliminary image descriptions, assisting humans in constructing image-text pairs. With OIG-Bench, we have conducted a comprehensive evaluation of 29 state-of-the-art MLLMs, including both proprietary and open-source models. The results show that Qwen2.5-VL-72B performs the best among the evaluated models, with an overall accuracy of 77%. Nevertheless, all models exhibit notable weaknesses in semantic understanding and logical reasoning, indicating that current MLLMs still struggle to accurately interpret complex visual-text relationships. In addition, we also demonstrate that the proposed multi-agent annotation system outperforms all MLLMs in image captioning, highlighting its potential as both a high-quality image description generator and a valuable tool for future dataset construction. Datasets are available at https://github.com/XiejcSYSU/OIG-Bench.
近期多模态大型语言模型(MLLMs)的进展展现出了令人印象深刻的能力。然而,评估它们在以图文引导(One-Image Guides)形式的人机类理解方面仍研究不足。图文引导是一种结合了文本、图像和符号的视觉形式,用以呈现重新组织和结构化的信息以便于理解,专为人类观看设计,本质上体现了人类感知和理解的特点。在此,我们提出了OIG-Bench,这是一个全面关注图文引导理解的跨域基准测试。为了降低手动标注的成本,我们开发了一个半自动标注管道,多个智能代理在其中协作生成初步的图像描述,协助人类构建图像文本对。通过OIG-Bench,我们对29个最新MLLMs进行了全面评估,包括专有和开源模型。结果显示,Qwen2.5-VL-72B在评估模型中表现最佳,总体准确率为77%。然而,所有模型在语义理解和逻辑推理方面都存在明显弱点,表明当前MLLMs在准确解释复杂视觉文本关系方面仍有困难。此外,我们还证明,提出的多代理标注系统在图像描述方面优于所有MLLMs,显示出其作为高质量图像描述生成器和未来数据集构建工具的潜力。数据集可在https://github.com/XiejcSYSU/OIG-Bench获取。
论文及项目相关链接
Summary
该文本介绍了最近的多模态大型语言模型(MLLMs)在理解One-Image Guides方面的能力。为了评估模型的表现,研究者们提出了OIG-Bench基准测试,并对29个最先进的大型语言模型进行了评估。结果显示,Qwen2.5-VL-72B模型表现最佳,但所有模型在语义理解和逻辑推理方面都存在明显不足。同时,研究还发现采用的多智能代理注释系统在图像描述方面优于所有MLLMs。
Key Takeaways
- 多模态大型语言模型(MLLMs)在理解One-Image Guides方面展现出令人印象深刻的能力。
- OIG-Bench是一个针对One-Image Guide理解的综合性基准测试。
- Qwen2.5-VL-72B在OIG-Bench测试中表现最佳,但所有模型在语义理解和逻辑推理方面存在不足。
- 研究提出了一种多智能代理注释系统,该系统在图像描述方面优于所有MLLMs。
- 该注释系统可生成高质量图像描述,对于未来数据集构建具有潜在价值。
- 数据集可在XiejcSYSU/OIG-Bench的GitHub仓库获取。
点此查看论文截图




Interactive Learning for LLM Reasoning
Authors:Hehai Lin, Shilei Cao, Sudong Wang, Haotian Wu, Minzhi Li, Linyi Yang, Juepeng Zheng, Chengwei Qin
Existing multi-agent learning approaches have developed interactive training environments to explicitly promote collaboration among multiple Large Language Models (LLMs), thereby constructing stronger multi-agent systems (MAS). However, during inference, they require re-executing the MAS to obtain final solutions, which diverges from human cognition that individuals can enhance their reasoning capabilities through interactions with others and resolve questions independently in the future. To investigate whether multi-agent interaction can enhance LLMs’ independent problem-solving ability, we introduce ILR, a novel co-learning framework for MAS that integrates two key components: Dynamic Interaction and Perception Calibration. Specifically, Dynamic Interaction first adaptively selects either cooperative or competitive strategies depending on question difficulty and model ability. LLMs then exchange information through Idea3 (Idea Sharing, Idea Analysis, and Idea Fusion), an innovative interaction paradigm designed to mimic human discussion, before deriving their respective final answers. In Perception Calibration, ILR employs Group Relative Policy Optimization (GRPO) to train LLMs while integrating one LLM’s reward distribution characteristics into another’s reward function, thereby enhancing the cohesion of multi-agent interactions. We validate ILR on three LLMs across two model families of varying scales, evaluating performance on five mathematical benchmarks and one coding benchmark. Experimental results show that ILR consistently outperforms single-agent learning, yielding an improvement of up to 5% over the strongest baseline. We further discover that Idea3 can enhance the robustness of stronger LLMs during multi-agent inference, and dynamic interaction types can boost multi-agent learning compared to pure cooperative or competitive strategies.
现有的多智能体学习方法已经构建了交互式训练环境,以明确促进多个大型语言模型(LLM)之间的协作,从而构建更强大的多智能体系统(MAS)。然而,在推理过程中,它们需要重新执行MAS来获得最终解决方案,这与人类认知相悖——人类可以通过与他人互动增强自己的推理能力,并在未来独立解决问题。为了调查多智能体互动是否能增强LLM的独立解决问题的能力,我们引入了ILR,这是一种新型的多智能体协同学习框架,它集成了两个关键组成部分:动态交互和感知校准。具体来说,动态交互首先根据问题的难度和模型能力自适应地选择合作或竞争策略。然后LLM通过模仿人类讨论的创新互动模式Idea3(思想共享、思想分析和思想融合),交换信息,然后得出各自的最终答案。在感知校准中,ILR采用群组相对策略优化(GRPO)来训练LLM,同时将一个LLM的奖励分布特性整合到另一个LLM的奖励函数中,从而提高多智能体互动的凝聚力。我们在两个不同规模的模型家族中的三个LLM上验证了ILR的有效性,通过五个数学基准测试和一个编码基准测试评估性能。实验结果表明,ILR始终优于单智能体学习,比最强基线提高了高达5%。我们还发现Idea3可以增强更强LLM在多智能体推理过程中的稳健性,并且与纯粹的合作或竞争策略相比,动态交互类型可以促进多智能体学习。
论文及项目相关链接
PDF The code is available at https://github.com/linhh29/Interactive-Learning-for-LLM-Reasoning
摘要
现有的多智能体学习方法是构建交互式训练环境以促进多个大型语言模型间的协作,进而构建更强的多智能体系统。然而,在推理过程中,它们需要重新执行多智能体系统才能获得最终解决方案,这与人类认知不同,人类可以在与他人互动中提升推理能力并在未来独立解决问题。为了探究多智能体交互是否能提升大型语言模型独立解决问题的能力,我们提出了ILR,这是一种新型的多智能体系统协同学习框架,包含两个关键组件:动态交互和感知校准。动态交互根据问题的难度和模型能力自适应选择合作或竞争策略。大型语言模型通过模仿人类讨论的理念交换信息,然后进行最终的答案推导。在感知校准中,ILR采用群体相对策略优化方法训练大型语言模型,同时整合一个大型语言模型的奖励分布特性到另一个的奖励函数中,从而提升多智能体交互的凝聚力。我们在三个不同规模的大型语言模型上验证了ILR的有效性,评估其在五个数学基准测试和一个编码基准测试上的表现。实验结果表明,ILR始终优于单智能体学习,并在最强的基准测试上提高了最多5%的性能。我们还发现理念交换在多智能体推理中增强了强大模型的稳健性,而动态交互类型相较于纯粹的合作或竞争策略更能促进多智能体学习。
关键见解
- 多智能体学习方法通过构建交互式训练环境促进大型语言模型间的协作。
- 现有方法需要重新执行多智能体系统来获得解决方案,与人类独立解决问题的能力有所不同。
- ILR框架引入动态交互和感知校准来提升大型语言模型的独立问题解决能力。
- 动态交互根据问题难度和模型能力自适应选择合作或竞争策略。
- 通过模仿人类讨论的理念交换信息,大型语言模型进行最终的答案推导。
- 感知校准采用群体相对策略优化方法训练大型语言模型,增强多智能体交互的凝聚力。
点此查看论文截图



More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
Authors:Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model’s reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/
推理已成为大型语言模型(LLM)中的关键能力。通过强化学习(RL),通常是群体相对策略优化(GRPO),这些模型能够解决数学和代码生成等复杂任务。在这些进展的基础上,最近的研究试图将推理扩展到视觉语言模型(VLM),在多种视觉任务上取得了令人鼓舞的结果。尽管取得了进展,但我们的研究发现多模态推理具有双重性:虽然它大大提高了逻辑推理能力,并有助于解决具有挑战性的问题,但它可能会逐渐损害感知定位能力,导致对基本视觉问题的识别失败。通过进一步分析,我们将这种现象归因于视觉遗忘,长时间的推理会导致模型越来越忽视视觉输入。为了解决这一问题,我们提出了视觉锚定策略优化(VAPO),这是一种简单有效的方法,能够明确引导推理过程走向视觉定位轨迹。我们的结果模型VAPO-Thinker-7B大大增强了模型对视觉信息的依赖,并在广泛的基准测试上实现了创纪录的超高水平。项目页面:https://xytian1008.github.io/VAPO/(如涉及到专业术语可能存在翻译不精确情况,请以具体学术出版物为准)
论文及项目相关链接
Summary
大型语言模型(LLM)通过强化学习(RL)和集团相对策略优化(GRPO)进行推理,可以完成数学和代码生成等复杂任务。最新研究将推理扩展到视觉语言模型(VLM),在多种视觉任务上表现优异。然而,研究发现多模态推理具有双重性质,在提高逻辑推断能力的同时,可能导致基本视觉问题的识别失败,这是由于视觉遗忘现象导致的。为解决这个问题,研究团队提出了视觉锚定策略优化(VAPO),让推理过程更加依赖视觉信息,新的模型VAPO-Thinker-7B在广泛的标准测试中取得了最新技术成果。
Key Takeaways
- 大型语言模型通过强化学习和集团相对策略优化进行推理,可以完成复杂任务。
- 视觉语言模型的最新研究扩展了推理能力,在多种视觉任务上表现优异。
- 多模态推理具有双重性质,既提高逻辑推断能力又可能导致基本视觉问题的识别失败。
- 多模态推理中视觉遗忘现象导致模型忽略视觉输入。
- 为解决视觉遗忘问题,提出了视觉锚定策略优化方法。
- 新的模型VAPO-Thinker-7B更加依赖视觉信息。
点此查看论文截图




Euclid’s Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Authors:Shijie Lian, Changti Wu, Laurence Tianruo Yang, Hang Yuan, Bin Yu, Lei Zhang, Kai Chen
Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs).To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. To enable the model to acquire and apply Euclidean principles from these geometry problems, we employed Group Relative Policy Optimization (GRPO) to finetune the Qwen2.5VL family and RoboBrain2.0 family, inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy of all evaluated models rose from 34.5% to 40.5%, improving by 5.5 percentage points. Among them, RoboBrain2.0-Euclid-7B achieves 49.6% accuracy, surpassing the previous state-of-the-art model, Spatial-MLLM.To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in https://zgca-ai4edu.github.io/Euclids_Gift.
空间智能涵盖了一系列丰富的能力,包括可视化并转换形状、在脑海中旋转物体、判断相对位置和包含关系,以及估计数量。然而,对于多模态大型语言模型(MLLMs)来说,这仍然是一个尚未解决的关键挑战。为了填补这一空白,我们提议将欧几里得几何问题求解作为替代任务。具体来说,我们精心构建了一个名为Euclid30K的多模态数据集,包含大约30000个平面和立体几何问题。为了让模型从这些几何问题中学习并应用欧几里得原理,我们采用了群组相对策略优化(GRPO)来对Qwen2.5VL家族和RoboBrain2.0家族进行微调,激励模型识别形状、计数和关联实体,并使用欧几里得原理进行多步骤推理。我们的实验表明,由此产生的模型在四个空间推理基准测试(Super-CLEVR、Omni3DBench、VSI-Bench和MindCube)上实现了显著的零样本增益,而无需任何特定任务的适应。值得注意的是,经过Euclid30K训练后,所有评估模型的VSI-Bench平均准确率从34.5%提高到40.5%,提高了5.5个百分点。其中,RoboBrain2.0-Euclid-7B的准确率达到49.6%,超越了之前的最佳模型Spatial-MLLM。据我们所知,这是第一项系统研究,表明以几何为中心的微调可以赋予视觉语言模型广泛可转移的空间技能。代码和Euclid30K数据集可在https://zgca-ai4edu.github.io/Euclids_Gift找到。
论文及项目相关链接
Summary
空间智能涵盖了一系列能力,包括视觉化形状、转换形状、心理旋转物体、判断位置关系和数量关系,以及估算数量等。然而,对于多模态大型语言模型(MLLMs)来说,空间智能仍然是一个尚未解决的挑战。为解决此问题,我们提出将欧几里得几何问题求解作为替代任务。我们精心构建了一个名为Euclid30K的多模态数据集,包含约3万道平面和立体几何问题。通过采用群组相对策略优化(GRPO)对Qwen2.5VL家族和RoboBrain2.0家族进行微调,使模型能够从这些几何问题中学习并应用欧几里得原理。实验表明,模型在四项空间推理基准测试(Super-CLEVR、Omni3DBench、VSI-Bench和MindCube)上实现了显著的零样本增益,无需任何特定任务适应。特别地,在Euclid30K训练后,VSI-Bench的平均准确率从34.5%提高到40.5%,提高了5.5个百分点。其中,RoboBrain2.0-Euclid-7B达到49.6%的准确率,超越了之前的最佳模型Spatial-MLLM。据我们所知,这是首次有系统的研究表明,以几何为中心的微调可以为视觉语言模型赋予广泛可转移的空间技能。
Key Takeaways
- 空间智能涵盖多种能力,包括形状视觉化、转换、心理旋转、位置关系判断、数量关系及数量估算等,是多模态大型语言模型的重要挑战。
- 提出将欧几里得几何问题求解作为替代任务,以改善模型的空间智能表现。
- 构建了名为Euclid30K的多模态数据集,包含约3万道几何问题。
- 采用群组相对策略优化(GRPO)对语言模型进行微调,使其能从几何问题中学习和应用欧几里得原理。
- 模型在多项空间推理基准测试上表现出显著性能提升,特别是在VSI-Bench上的准确率提高了5.5个百分点。
- RoboBrain2.0-Euclid-7B模型达到49.6%的准确率,超越现有最佳模型。
点此查看论文截图





Advancing Multi-agent Traffic Simulation via R1-Style Reinforcement Fine-Tuning
Authors:Muleilan Pei, Shaoshuai Shi, Shaojie Shen
Scalable and realistic simulation of multi-agent traffic behavior is critical for advancing autonomous driving technologies. Although existing data-driven simulators have made significant strides in this domain, they predominantly rely on supervised learning to align simulated distributions with real-world driving scenarios. A persistent challenge, however, lies in the distributional shift that arises between training and testing, which often undermines model generalization in unseen environments. To address this limitation, we propose SMART-R1, a novel R1-style reinforcement fine-tuning paradigm tailored for next-token prediction models to better align agent behavior with human preferences and evaluation metrics. Our approach introduces a metric-oriented policy optimization algorithm to improve distribution alignment and an iterative “SFT-RFT-SFT” training strategy that alternates between Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) to maximize performance gains. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) validate the effectiveness of this simple yet powerful R1-style training framework in enhancing foundation models. The results on the Waymo Open Sim Agents Challenge (WOSAC) showcase that SMART-R1 achieves state-of-the-art performance with an overall realism meta score of 0.7858, ranking first on the leaderboard at the time of submission.
在多智能体交通行为的模拟中,实现可扩展且真实的模拟对于推动自动驾驶技术的发展至关重要。尽管现有的数据驱动模拟器在该领域取得了重大进展,但它们主要依赖于监督学习来使模拟分布与真实世界驾驶场景相符。然而,持续的挑战在于训练和测试之间出现的分布偏移,这常常会影响模型在未见环境中的泛化能力。为了解决这个问题,我们提出了SMART-R1,这是一种针对下一个令牌预测模型的新型R1风格强化微调范式,可以更好地使智能体行为与人的偏好和评价指标相符。我们的方法引入了一种面向指标的策略优化算法,以提高分布对齐性,以及一种交替进行“SFT-RFT-SFT”的训练策略,即在监督微调(SFT)和强化微调(RFT)之间进行交替,以最大限度地提高性能收益。在大型Waymo Open Motion Dataset(WOMD)上进行的广泛实验验证了这种简单而强大的R1风格训练框架在提高基础模型方面的有效性。在Waymo Open Sim Agents Challenge(WOSAC)上的结果显示,SMART-R1达到了最先进的性能,总体现实感得分为0.7858,在提交时位居排行榜首位。
论文及项目相关链接
Summary
基于多智能体交通行为的大规模仿真对于推动自动驾驶技术的发展至关重要。现有的数据驱动模拟器已经在这一领域取得了显著进展,但它们主要依赖于监督学习来模拟真实世界驾驶场景中的分布。然而,训练与测试之间出现的分布偏移问题常常导致模型在未见环境中泛化能力不足。为解决这一问题,本文提出了SMART-R1,这是一种针对下一代令牌预测模型的R1风格强化微调范式,能更好地将智能体行为与人的偏好和评价指标对齐。通过引入面向指标的策略优化算法来提高分布对齐性,并采用交替进行的有监督微调(SFT)和强化微调(RFT)的“SFT-RFT-SFT”训练策略来最大化性能提升。在大规模Waymo Open Motion Dataset(WOMD)上的实验验证了这一简单而强大的R1风格训练框架在提升基础模型性能方面的有效性。在Waymo Open Sim Agents Challenge(WOSAC)上的结果展示了SMART-R1达到了最新技术水平,在提交时排名第一,整体现实感得分为0.7858。
Key Takeaways
- 多智能体交通行为仿真对自动驾驶技术发展至关重要。
- 现有模拟器主要依赖监督学习模拟真实驾驶场景分布,但存在分布偏移问题。
- SMART-R1通过引入面向指标的强化微调范式解决分布偏移问题,增强模型泛化能力。
- SMART-R1采用“SFT-RFT-SFT”训练策略,结合了监督微调与强化微调的优势。
- 在大规模数据集WOMD上的实验验证了SMART-R1的有效性。
- 在WOSAC挑战中,SMART-R1排名第一,现实感得分为0.7858。
点此查看论文截图




The Hidden Costs of Translation Accuracy: Distillation, Quantization, and Environmental Impact
Authors:Dhaathri Vijay, Anandaswarup Vadapalli
The rapid expansion of large language models (LLMs) has heightened concerns about their computational and environmental costs. This study investigates the trade-offs between translation quality and efficiency by comparing full-scale, distilled, and quantized models using machine translation as a case study. We evaluated performance on the Flores+ benchmark and through human judgments of conversational translations in French, Hindi, and Kannada. Our analysis revealed that the full 3.3B FP32 model, while achieving the highest BLEU scores, incurred the largest environmental footprint (~ 0.007-0.008 kg CO2 per run). The distilled 600M FP32 model reduced inference time by 71-78% and carbon emissions by 63-65% compared with the full model, with only minimal reductions in BLEU scores. Human evaluations further showed that even aggressive quantization (INT4) preserved high levels of accuracy and fluency, with differences between models generally minor. These findings demonstrate that model compression strategies can substantially reduce computational demands and environmental impact while maintaining competitive translation quality, though trade-offs are more pronounced in low-resource settings. We argue for evaluation frameworks that integrate efficiency and sustainability alongside accuracy as central dimensions of progress in NLP.
大规模语言模型(LLMs)的快速发展引发了人们对它们的计算和环境成本的关注。本研究通过对比机器翻译中的全尺寸、蒸馏和量化模型,探讨翻译质量与效率之间的权衡。我们在Flores+基准测试以及法语、印地语和坎纳达语的对话翻译的人类判断中评估了性能。我们的分析表明,虽然完整的3.3B FP32模型获得了最高的BLEU得分,但它也产生了最大的环境足迹(每次运行约产生0.007-0.008公斤的二氧化碳排放)。与全模型相比,蒸馏的600M FP32模型将推理时间减少了71-78%,碳排放量减少了63-65%,而BLEU得分只有轻微下降。人类评估进一步表明,即使采用激烈的量化(INT4)也能保持高水平的准确性和流畅性,各模型之间的差异通常很小。这些发现表明,模型压缩策略可以在保持竞争力翻译质量的同时,大幅降低计算需求和环境影响,但在低资源环境中,权衡更为突出。我们主张评估框架应整合效率、可持续性与准确性,将其作为自然语言处理进展的核心维度。
论文及项目相关链接
Summary
大型语言模型(LLMs)的快速发展引发了对其计算和环保成本的高度关注。本研究通过对比全规模、蒸馏和量化模型,探讨机器翻译领域翻译质量与效率之间的权衡。在Flores+基准测试以及法语、印地语和坎纳达语的对话翻译人工评估中,我们发现3.3B FP32全模型虽然获得最高BLEU分数,但其环境足迹最大(约0.007-0.008公斤CO2/次运行)。相比全模型,蒸馏600M FP32模型将推理时间减少71-78%,碳排放减少63-65%,而BLEU分数仅略有下降。人类评估进一步表明,即使采用激烈的量化(INT4)也能保持较高的准确性和流畅性,各模型之间的差异通常较小。研究发现,模型压缩策略可以在保持竞争性的翻译质量的同时,大幅降低计算需求和环保影响,但在低资源环境中权衡更为突出。我们主张在NLP的评估框架中,将效率、可持续性与准确性一同作为核心维度。
Key Takeaways
- 大型语言模型(LLMs)的快速发展带来计算和环保成本的关注。
- 研究对比了全规模、蒸馏和量化模型在机器翻译方面的表现。
- 3.3B FP32全模型虽获得最高BLEU分数,但环境成本较高。
- 蒸馏模型能在减少推理时间和碳排放的同时,保持较高的翻译质量。
- 量化模型(INT4)在保持翻译准确性和流畅性方面表现良好。
- 模型压缩策略有助于降低计算需求和环保影响。
点此查看论文截图



From Reasoning to Answer: Empirical, Attention-Based and Mechanistic Insights into Distilled DeepSeek R1 Models
Authors:Jue Zhang, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
Large Reasoning Models (LRMs) generate explicit reasoning traces alongside final answers, yet the extent to which these traces influence answer generation remains unclear. In this work, we conduct a three-stage investigation into the interplay between reasoning and answer generation in three distilled DeepSeek R1 models. First, through empirical evaluation, we demonstrate that including explicit reasoning consistently improves answer quality across diverse domains. Second, attention analysis reveals that answer tokens attend substantially to reasoning tokens, with certain mid-layer Reasoning-Focus Heads (RFHs) closely tracking the reasoning trajectory, including self-reflective cues. Third, we apply mechanistic interventions using activation patching to assess the dependence of answer tokens on reasoning activations. Our results show that perturbations to key reasoning tokens can reliably alter the final answers, confirming a directional and functional flow of information from reasoning to answer. These findings deepen our understanding of how LRMs leverage reasoning tokens for answer generation, highlighting the functional role of intermediate reasoning in shaping model outputs. Our data and code are publicly available at \href{https://aka.ms/R2A-code}{this URL}.
大型推理模型(LRMs)在最终答案旁边生成明确的推理痕迹,但这些痕迹对答案生成的影响程度尚不清楚。在这项工作中,我们对三种蒸馏的DeepSeek R1模型中推理与答案生成之间的相互作用进行了三阶段的研究。首先,通过实证研究,我们证明了在多个领域中包含明确的推理可以持续提高答案的质量。其次,注意力分析表明答案标记对推理标记的关注度很高,某些中层推理焦点头(RFH)紧密跟踪推理轨迹,包括自我反思线索。第三,我们通过使用激活补丁进行机械干预来评估答案标记对推理激活的依赖程度。我们的结果表明,对关键推理标记的干扰可以可靠地改变最终答案,证实了从推理到答案的信息的定向和功能流动。这些发现深化了我们对LRM如何利用推理标记生成答案的理解,突出了中间推理在塑造模型输出中的功能作用。我们的数据和代码可以在这个URL中找到:https://aka.ms/R2A-code。
论文及项目相关链接
PDF Accepted by EMNLP’25 (Main)
摘要
本文研究了大型推理模型(LRMs)中的推理与答案生成之间的相互作用。文章进行了三阶段的探究,首先通过实证研究证明了显式推理的加入能持续提高跨不同领域的答案质量。其次,注意力分析显示答案标记对推理标记的关注度很高,某些中层推理焦点头(RFH)紧密追踪推理轨迹,包括自我反思线索。最后,通过激活补丁进行机制干预,评估答案标记对推理激活的依赖性。结果显示对关键推理标记的干扰能可靠地改变最终答案,证实了从推理到答案的信息流动具有方向性和功能性。本文加深了我们对于LRM如何利用推理标记进行答案生成的理解,突出了中间推理在塑造模型输出中的功能作用。
关键见解
- 显式推理的加入能提高跨不同领域的答案质量。
- 注意力分析显示答案标记对推理标记的依赖性强。
- 中层推理焦点头(RFH)紧密追踪推理轨迹。
- 通过机制干预确认,对关键推理标记的干扰能改变最终答案。
- 从推理到答案的信息流动具有方向性和功能性。
- 中间推理在模型输出中扮演重要角色。
- 研究数据和代码已公开可供访问。
点此查看论文截图






$p$-less Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding
Authors:Runyan Tan, Shuang Wu, Phillip Howard
Obtaining high-quality outputs from Large Language Models (LLMs) often depends upon the choice of a sampling-based decoding strategy to probabilistically choose the next token at each generation step. While a variety of such sampling methods have been proposed, their performance can be sensitive to the selection of hyperparameters which may require different settings depending upon the generation task and temperature configuration. In this work, we introduce $p$-less sampling: an information-theoretic approach to sampling which dynamically sets a truncation threshold at each decoding step based on the entire token probability distribution. Unlike existing methods, $p$-less sampling has no hyperparameters and consistently produces high-quality outputs as temperature increases. We provide theoretical perspectives on $p$-less sampling to ground our proposed method and conduct experiments to empirically validate its effectiveness across a range of math, logical reasoning, and creative writing tasks. Our results demonstrate how $p$-less sampling consistently outperforms existing sampling approaches while exhibiting much less degradation in text quality at higher temperature values. We further show how $p$-less achieves greater inference-time efficiency than alternative methods through lower average token sampling times and shorter generation lengths, without sacrificing accuracy. Finally, we provide analyses to highlight the benefits of $p$-less through qualitative examples, case studies, and diversity assessments.
从大型语言模型(LLM)获得高质量输出通常取决于基于采样的解码策略的选择,该策略以概率方式选择每个生成步骤中的下一个令牌。虽然已经提出了多种这样的采样方法,但它们的性能对超参数的选择很敏感,这可能需要根据不同的生成任务和温度配置进行设置。在这项工作中,我们引入了无p采样:一种基于信息理论的采样方法,它会在每个解码步骤中根据整个令牌概率分布动态设置截断阈值。与现有方法不同,无p采样没有任何超参数,随着温度的升高,它始终能产生高质量的输出。我们从理论上阐述了无p采样的方法,并通过实验验证了其在数学、逻辑推理和创造性写作任务中的有效性。我们的结果表明,无p采样如何始终优于现有的采样方法,同时在较高的温度值下文本质量下降较少。我们进一步展示了无p如何通过降低平均令牌采样时间和缩短生成长度来实现更高的推理效率,同时不牺牲准确性。最后,我们通过定性示例、案例研究和多样性评估来分析无p的优势。
论文及项目相关链接
Summary
本文介绍了无参数采样(p-less sampling)这一新的基于信息理论的采样方法,用于大型语言模型(LLM)的解码策略。该方法根据整个token概率分布动态设定截断阈值,无需调整超参数,能够在不同温度下稳定产生高质量输出。通过理论分析和实验验证,表明p-less采样在数学、逻辑推理和创造性写作任务上优于现有采样方法,并在较高温度下仍能保持较好的文本质量。此外,p-less采样还具有更高的推理效率,平均token采样时间短,生成长度短,且不牺牲准确性。最后通过定性例子、案例研究和多样性评估分析证明了p-less采样的优势。
Key Takeaways
- p-less采样是一种基于信息理论的采样方法,用于大型语言模型的解码策略。
- p-less采样能够根据整个token概率分布动态设定截断阈值。
- p-less采样无需调整超参数,可适应不同的生成任务和温度配置。
- p-less采样在不同温度下都能产生高质量输出。
- 实验验证表明p-less采样在多种任务上优于现有采样方法。
- p-less采样具有更高的推理效率,平均token采样时间短,生成长度短。
点此查看论文截图




Critique to Verify: Accurate and Honest Test-Time Scaling with RL-Trained Verifiers
Authors:Zhicheng Yang, Zhijiang Guo, Yinya Huang, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang
Test-time scaling via solution sampling and aggregation has become a key paradigm for improving the reasoning performance of Large Language Models (LLMs). While reward model selection is commonly employed in this approach, it often fails to identify minority-yet-correct answers, which limits its effectiveness beyond that of simple majority voting. We argue that this limitation stems from a lack of informative critique signals during verifier training. To bridge this gap, we introduce Mirror-Critique, a framework that trains a verifier with informative critiques. Our key insight is to leverage the rich critique signal by contrasting model-generated solutions with ground-truth solutions. We deploy a small instruction-tuned model to synthesize high-quality critique data with rejection sampling that teaches the verifier not only what is wrong, but also why. The synthetic data is used to cold-start the LLMs in the RLVR process to further improve the verification ability. The resulting Mirror-Verifier is deployed to evaluate candidate solutions by generating multiple critiques per solution, aggregating them into a verify score used for weighted voting or selective abstention. The experimental results show that our Mirror-Verifier significantly outperforms majority voting in terms of solution accuracy and also improves the solver’s honesty to recognize and abstain from answering beyond its capability boundaries.
通过解决方案采样和聚合进行测试时缩放,已成为提高大型语言模型(LLM)推理性能的关键范式。虽然奖励模型选择在此方法中常用,但它往往无法识别出少数但正确的答案,这限制了其效果,仅超越简单多数投票。我们认为,这一局限性源于验证器训练期间缺乏信息性批判信号。为了弥补这一差距,我们引入了Mirror-Critique框架,该框架使用信息性批判来训练验证器。我们的关键见解是通过对比模型生成的解决方案与真实解决方案来利用丰富的批判信号。我们部署了一个小型指令调整模型,通过拒绝采样合成高质量的批判数据,不仅教会验证器什么是错误的,还教会它为什么错误。合成数据用于在RLVR过程中冷启动LLM,以进一步提高验证能力。所得的Mirror-Verifier用于评估候选解决方案,通过为每个解决方案生成多个批判并将其聚合为验证分数,用于加权投票或选择性弃权。实验结果表明,我们的Mirror-Verifier在解决方案准确性方面显著优于多数投票,并提高了求解者的诚信度,使其能够识别和避免超出其能力边界的答案。
论文及项目相关链接
PDF 15 pages, 7 figures
Summary
训练验证器以增强大语言模型推理性能的过程中存在不足之处,主要在于缺少对模型的批评性信号反馈。因此提出Mirror-Critique框架,通过对比模型生成答案与真实答案来训练验证器,使其具备对答案的批评性反馈能力。通过采用指令微调模型生成高质量批评数据并采用拒绝采样方法,让验证器不仅了解答案错误之处,还能理解错误原因。该框架能显著提高验证器性能,在解决方案准确性方面优于多数投票制,并提升模型在超出能力边界时拒绝回答的能力。
Key Takeaways
- 解决方案采样与聚合是改善大型语言模型推理性能的关键范式。
- 目前存在的问题是奖励模型选择无法识别少数正确但非主流的答案。
- 这种局限性源于验证器训练期间缺乏信息性批评信号。
- Mirror-Critique框架旨在解决这一问题,通过对比模型生成答案与真实答案来训练验证器,使其具备批评性反馈能力。
- 采用指令微调模型生成高质量批评数据并采用拒绝采样方法。
- 使用合成的批评数据来启动验证器在冷启动过程中进一步提高验证能力。
点此查看论文截图



Multiplayer Nash Preference Optimization
Authors:Fang Wu, Xu Huang, Weihao Xuan, Zhiwei Zhang, Yijia Xiao, Guancheng Wan, Xiaomin Li, Bing Hu, Peng Xia, Jure Leskovec, Yejin Choi
Reinforcement learning from human feedback (RLHF) has emerged as the standard paradigm for aligning large language models (LLMs) with human preferences. However, reward-based methods built on the Bradley-Terry assumption struggle to capture the non-transitive and heterogeneous nature of real-world preferences. To address this, recent studies have reframed alignment as a two-player Nash game, giving rise to Nash learning from human feedback (NLHF). While this perspective has inspired algorithms such as INPO, ONPO, and EGPO with strong theoretical and empirical guarantees, they remain fundamentally restricted to two-player interactions, creating a single-opponent bias that fails to capture the full complexity of realistic preference structures. In this work, we introduce Multiplayer Nash Preference Optimization (MNPO), a novel framework that generalizes NLHF to the multiplayer regime. It formulates alignment as an $n$-player game, where each policy competes against a population of opponents while being regularized toward a reference model. Our framework establishes well-defined Nash equilibria in multiplayer settings and extends the concept of duality gap to quantify approximation quality. We demonstrate that MNPO inherits the equilibrium guarantees of two-player methods while enabling richer competitive dynamics and improved coverage of diverse preference structures. Through comprehensive empirical evaluation, we show that MNPO consistently outperforms existing NLHF baselines on instruction-following benchmarks, achieving superior alignment quality under heterogeneous annotator conditions and mixed-policy evaluation scenarios. Together, these results establish MNPO as a principled and scalable framework for aligning LLMs with complex, non-transitive human preferences. Code is available at https://github.com/smiles724/MNPO.
强化学习从人类反馈(RLHF)已经作为对齐大型语言模型(LLM)与人类偏好的标准范式。然而,基于Bradley-Terry假设的奖励方法很难捕捉现实世界中偏好的非传递性和异质性。为了解决这一问题,最近的研究将对齐重新构建为一个两人纳什游戏,从而产生了基于人类反馈的纳什学习(NLHF)。虽然这一视角激发了如INPO、ONPO和EGPO等算法,具有强大的理论和实证保证,但它们从根本上局限于两人互动,产生了单一的对手偏见,无法捕捉现实偏好结构的全部复杂性。在这项工作中,我们引入了多人纳什偏好优化(MNPO),这是一个将NLHF推广到多人模式的新型框架。它将对齐制定为一个n人游戏,其中每个策略都与一群对手竞争,同时向参考模型进行正则化。我们的框架在多人设置中建立了明确的纳什均衡,并将对偶间隙的概念扩展到量化近似质量。我们证明了MNPO继承了两人方法的均衡保证,同时实现了更丰富的竞争动态和改进了对多样化偏好结构的覆盖。通过全面的经验评估,我们表明MNPO在指令遵循基准测试上始终优于现有的NLHF基线,在异质注释者条件和混合策略评估场景中实现了优质的对齐。总的来说,这些结果确立了MNPO作为一个有原则且可扩展的框架,用于对齐具有复杂非传递性人类偏好的大型语言模型。代码可在https://github.com/smiles724/MNPO上找到。
论文及项目相关链接
摘要
大语言模型与人类偏好的对齐通常采用的是基于人类反馈的强化学习(RLHF)标准范式。然而,基于Bradley-Terry假设的奖励方法很难捕捉现实世界中非传递性和异质性的偏好。为解决这个问题,近期研究将对齐问题重新构建为两人Nash博弈,产生了基于人类反馈的Nash学习(NLHF)。尽管这种视角启发了具有强大理论和实证保证的INPO、ONPO和EGPO等算法,但它们仍局限于两人互动,无法捕捉现实偏好结构的全面复杂性。在此工作中,我们引入了多人Nash偏好优化(MNPO),一个将NLHF推广到多人制度的新框架。它将对齐制定为一个n人游戏,其中每个策略都与一群对手竞争,同时向参考模型进行正则化。我们的框架在多人设置中建立了明确的Nash均衡,并扩展了量化近似质量的概念——对偶间隙。我们证明MNPO继承了两人方法的均衡保证,同时实现更丰富的竞争动态和改进对不同偏好结构的覆盖。通过全面的实证评估,我们证明MNPO在指令遵循基准测试上始终优于现有的NLHF基线,在异质注释者条件和混合策略评估场景中实现优质对齐。这些结果共同确立了MNPO作为一个有原则、可扩充的框架,用于对齐具有复杂、非传递性的人类偏好的大语言模型。相关代码可在https://github.com/smiles724/MNPO获取。
关键见解
- 强化学习从人类反馈(RLHF)已成为对齐大语言模型(LLMs)的标准范式。
- 基于Bradley-Terry假设的奖励方法难以捕捉现实世界中偏好的非传递性和异质性。
- 最近的研究通过将对齐问题视为两人Nash博弈,提出了Nash学习从人类反馈(NLHF)。
- 现有算法如INPO、ONPO和EGPO虽在理论和实证上有保障,但仅限于两人互动,无法全面捕捉现实偏好结构。
- 引入了一种新的框架——多人Nash偏好优化(MNPO),它将NLHF推广到多人环境,允许更复杂的竞争动态和更广泛的偏好结构覆盖。
- MNPO通过建立明确的Nash均衡和扩展对偶间隙的概念来保证其有效性。
- 实证评估显示,MNPO在指令遵循基准测试上优于现有的NLHF基线,特别是在异质注释者条件和混合策略评估场景中表现优越。
点此查看论文截图

