嘘~ 正在从服务器偷取页面 . . .

R1_Reasoning


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-15 更新

PROPA: Toward Process-level Optimization in Visual Reasoning via Reinforcement Learning

Authors:Yanbei Jiang, Chao Lei, Yihao Ding, Krista Ehinger, Jey Han Lau

Despite significant progress, Vision-Language Models (VLMs) still struggle with complex visual reasoning, where multi-step dependencies cause early errors to cascade through the reasoning chain. Existing post-training paradigms are limited: Supervised Fine-Tuning (SFT) relies on costly step-level annotations, while Reinforcement Learning with Verifiable Rewards (RLVR) methods like GRPO provide only sparse, outcome-level feedback, hindering stable optimization. We introduce PROPA (Process-level Reasoning Optimization with interleaved Policy Alignment), a novel framework that integrates Monte Carlo Tree Search (MCTS) with GRPO to generate dense, process-level rewards and optimize reasoning at each intermediate step without human annotations. To overcome the cold-start problem, PROPA interleaves GRPO updates with SFT, enabling the model to learn from both successful and failed reasoning trajectories. A Process Reward Model (PRM) is further trained to guide inference-time search, aligning the test-time search with the training signal. Across seven benchmarks and four VLM backbones, PROPA consistently outperforms both SFT- and RLVR-based baselines. It achieves up to 17.0% gains on in-domain tasks and 21.0% gains on out-of-domain tasks compared to existing state-of-the-art, establishing a strong reasoning and generalization capability for visual reasoning tasks. The code isavailable at: https://github.com/YanbeiJiang/PROPA.

尽管取得了重大进展,视觉语言模型(VLMs)在复杂的视觉推理方面仍然面临挑战,其中多步骤依赖关系会导致早期错误在推理链中累积。现有的后训练模式存在局限性:监督微调(SFT)依赖于昂贵的步骤级注释,而强化学习与可验证奖励(RLVR)方法(如GRPO)仅提供稀疏的结果级反馈,阻碍了稳定的优化。我们引入了PROPA(基于交替策略对齐的过程级推理优化),这是一个结合蒙特卡洛树搜索(MCTS)与GRPO的新型框架,旨在生成密集的过程级奖励,并在每个中间步骤优化推理过程,无需人工注释。为了克服冷启动问题,PROPA交替进行GRPO更新和SFT,使模型能够从成功的和失败的推理轨迹中学习。此外,我们进一步训练了一个过程奖励模型(PRM)来指导推理时的搜索,使测试时的搜索与训练信号保持一致。在七个基准测试和四个VLM骨干网络上,PROPA一致地优于基于SFT和RLVR的基线。与现有技术相比,它在域内任务上实现了高达17.0%的增益,在域外任务上实现了高达21.0%的增益,证明了其在视觉推理任务中的强大推理和泛化能力。代码可通过以下链接获取:https://github.com/YanbeiJiang/PROPA

论文及项目相关链接

PDF

Summary

本文介绍了PROPA(基于Monte Carlo树搜索的推理过程优化与策略对齐),这是一种新型的视觉推理任务优化框架。它结合了监督微调(SFT)和可验证奖励强化学习(RLVR)方法,如GRPO,生成过程级别的奖励,优化每个中间步骤的推理过程,无需人工标注。通过解决冷启动问题并训练过程奖励模型(PRM),PROPA提高了模型的性能,特别是在推理任务中,其在七个基准测试和四种不同视觉语言模型后端的性能均优于其他方法。该代码已在GitHub上发布。

Key Takeaways

  • PROPA是一种新型的视觉推理任务优化框架,结合了监督微调(SFT)和强化学习(RLVR)的优点。
  • PROPA利用Monte Carlo树搜索(MCTS)生成过程级别的奖励,从而优化每个中间步骤的推理过程。
  • 该方法克服了现有方法的局限性,无需人工标注即可实现稳定优化。
  • 通过解决冷启动问题并训练过程奖励模型(PRM),PROPA提高了模型的性能。
  • PROPA在多个基准测试和不同视觉语言模型后端均表现出卓越性能。
  • PROPA相对于现有技术实现了显著的性能提升,具有强大的推理和泛化能力。

Cool Papers

点此查看论文截图

Image Aesthetic Reasoning via HCM-GRPO: Empowering Compact Model for Superior Performance

Authors:Zhiyuan Hu, Zheng Sun, Yi Wei, Long Yu

The performance of image generation has been significantly improved in recent years. However, the study of image screening is rare and its performance with Multimodal Large Language Models (MLLMs) is unsatisfactory due to the lack of data and the weak image aesthetic reasoning ability in MLLMs. In this work, we propose a complete solution to address these problems in terms of data and methodology. For data, we collect a comprehensive image screening dataset with over 128k samples, about 640k images. Each sample consists of an original image, four generated images. The dataset evaluates the image aesthetic reasoning ability under four aspects: appearance deformation, physical shadow, placement layout, and extension rationality. Regarding data annotation, we investigate multiple approaches, including purely manual, fully automated, and answer-driven annotations, to acquire high-quality chains of thought (CoT) data in the most cost-effective manner. Methodologically, we introduce a Hard Cases Mining (HCM) strategy with a Dynamic Proportional Accuracy (DPA) reward into the Group Relative Policy Optimization (GRPO) framework, called HCM-GRPO. This enhanced method demonstrates superior image aesthetic reasoning capabilities compared to the original GRPO. Our experimental results reveal that even state-of-the-art closed-source MLLMs, such as GPT4o and Qwen-VL-Max, exhibit performance akin to random guessing in image aesthetic reasoning. In contrast, by leveraging the HCM-GRPO, we are able to surpass the scores of both large-scale open-source and leading closed-source models with a much smaller model.

近年来,图像生成性能得到了显著提升。然而,关于图像筛选的研究却很少,其在多模态大型语言模型(MLLMs)中的表现也不尽人意,这主要是由于数据缺乏以及MLLMs中图像美学推理能力较弱。在这项工作中,我们针对数据和方法论方面提出了完整的解决方案。在数据方面,我们收集了一个全面的图像筛选数据集,包含超过12.8万个样本,约640万张图像。每个样本由一张原始图像和四张生成图像组成。该数据集从四个方面评估图像美学推理能力,包括外观变形、物理阴影、放置布局和扩展合理性。在数据标注方面,我们研究了多种方法,包括纯手动、全自动和答案驱动标注,以最具成本效益的方式获取高质量的思维链数据。在方法论上,我们将硬案例挖掘(HCM)策略与动态比例精度(DPA)奖励引入群体相对策略优化(GRPO)框架,称为HCM-GRPO。这种增强型方法相较于原始的GRPO表现出了卓越的图象美学推理能力。我们的实验结果表明,即使是最先进的封闭源MLLMs,如GPT4o和Qwen-VL-Max,在图像美学推理方面的表现也如同随机猜测。相比之下,通过利用HCM-GRPO,我们能够以更小的模型超越大规模开源和领先的封闭源模型的得分。

论文及项目相关链接

PDF

Summary

本文提出了一套完整的解决方案,旨在解决图像筛选领域存在的问题。该研究团队构建了一个包含超过128k样本和约640k图像的图像筛选数据集,用于评估图像的审美推理能力。同时,他们提出了一种结合硬案例挖掘(HCM)和动态比例准确性(DPA)奖励的集团相对政策优化(GRPO)框架,称为HCM-GRPO。实验结果表明,即使是先进的大型语言模型在图像审美推理方面的表现也不尽人意,而利用HCM-GRPO的方法能够在较小的模型上超越这些大型模型的得分。

Key Takeaways

  • 图像筛选领域缺乏数据和多模态大型语言模型(MLLMs)的弱图像审美推理能力导致了该领域的性能不足。
  • 提出了一种包含超过128k样本和约640k图像的图像筛选数据集,用于评估图像的审美推理能力。
  • 介绍了硬案例挖掘(HCM)与动态比例准确性(DPA)奖励结合的策略优化框架GRPO,称为HCM-GRPO。
  • 实验显示先进的大型语言模型在图像审美推理上的表现不佳,而HCM-GRPO方法能在较小的模型上取得超越这些大型模型的性能。

Cool Papers

点此查看论文截图

Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs

Authors:Yujie Zhao, Lanxiang Hu, Yang Wang, Minmin Hou, Hao Zhang, Ke Ding, Jishen Zhao

Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models. We propose AT-GRPO, which includes (i) an agent- and turn-wise grouped RL algorithm tailored to MAS and (ii) a training system that supports both single- and multi-policy regimes. Across game, planning, coding, and math tasks, AT-GRPO delivers substantial gains. On long-horizon planning, it increases accuracy from a 14.0 to 47.0 percent single-agent RL baseline to 96.0 to 99.5 percent. It also improves reasoning performance, with average gains of 3.87 to 7.62 percent on coding tasks and 9.0 to 17.93 percent on math. Code and environments are available at: https://github.com/pettingllms-ai/PettingLLMs.

多智能体系统(MAS)和强化学习(RL)通常用于增强大型语言模型(LLM)的智能力。MAS通过基于角色的协同工作提高任务性能,而RL则利用环境奖励来学习更强大的策略,例如GRPO风格的优化。然而,对MAS应用基于策略的RL仍被研究较少且面临独特的挑战。算法方面,标准的GRPO分组假设不成立,因为提示会因角色和回合而变化。从系统角度看,训练堆栈必须支持MAS工作流推演以及单策略和多策略模型的基于策略更新。我们提出了AT-GRPO,它包括(i)针对MAS量身定制的智能体和回合分组RL算法,(ii)支持单策略和多策略制度的训练系统。在游戏、规划、编码和数学任务中,AT-GRPO带来了实质性的提升。在长远规划方面,它从单智能体RL基准线的14.0%提升到96.0%至99.5%。它还能提升推理性能,在编码任务上平均提升3.87%至7.62%,在数学任务上提升9.0%至17.93%。相关代码和环境可访问:https://github.com/pettingllms-ai/PettingLLMs。

论文及项目相关链接

PDF

Summary

本文探讨了多智能体系统(MAS)与强化学习(RL)在大规模语言模型(LLM)中的应用。文章指出,MAS通过角色化的协同工作提高任务性能,而RL则通过环境奖励学习更优策略,如GRPO风格的优化。然而,将基于策略的RL应用于MAS仍然是一个未被充分研究的领域,并存在独特的挑战。为此,文章提出了AT-GRPO方法,包括针对MAS的代理和转向分组RL算法以及支持单策略和多策略模式的训练系统。实验结果表明,AT-GRPO在游戏、规划、编码和数学任务上有显著的提升效果。

Key Takeaways

  1. 多智能体系统(MAS)和强化学习(RL)用于增强大规模语言模型(LLM)的代理能力。
  2. MAS通过角色化的协同工作提高任务性能,而RL使用环境奖励来学习更优策略。
  3. 将基于策略的RL应用于MAS存在独特挑战,因为标准GRPO分组假设会因角色和转向的不同而失效。
  4. 提出了AT-GRPO方法,包括定制的RL算法和训练系统,支持单策略和多策略模式。
  5. AT-GRPO在游戏、规划、编码和数学任务上有显著的提升效果。
  6. 在长期规划任务上,AT-GRPO将单代理RL基准线的准确率从14.0%提升到96.0%-99.5%。

Cool Papers

点此查看论文截图

TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance

Authors:Jianhui Yang, Yiming Jin, Pengkun Jiao, Chenhe Dong, Zerui Huang, Shaowei Yao, Xiaojiang Zhou, Dan Ou, Haihong Tang

Query-product relevance prediction is fundamental to e-commerce search and has become even more critical in the era of AI-powered shopping, where semantic understanding and complex reasoning directly shape the user experience and business conversion. Large Language Models (LLMs) enable generative, reasoning-based approaches, typically aligned via supervised fine-tuning (SFT) or preference optimization methods like Direct Preference Optimization (DPO). However, the increasing complexity of business rules and user queries exposes the inability of existing methods to endow models with robust reasoning capacity for long-tail and challenging cases. Efforts to address this via reinforcement learning strategies like Group Relative Policy Optimization (GRPO) often suffer from sparse terminal rewards, offering insufficient guidance for multi-step reasoning and slowing convergence. To address these challenges, we propose TaoSR-AGRL, an Adaptive Guided Reinforcement Learning framework for LLM-based relevance prediction in Taobao Search Relevance. TaoSR-AGRL introduces two key innovations: (1) Rule-aware Reward Shaping, which decomposes the final relevance judgment into dense, structured rewards aligned with domain-specific relevance criteria; and (2) Adaptive Guided Replay, which identifies low-accuracy rollouts during training and injects targeted ground-truth guidance to steer the policy away from stagnant, rule-violating reasoning patterns toward compliant trajectories. TaoSR-AGRL was evaluated on large-scale real-world datasets and through online side-by-side human evaluations on Taobao Search. It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability. The model trained with TaoSR-AGRL has been successfully deployed in the main search scenario on Taobao, serving hundreds of millions of users.

查询产品相关性预测是电子商务搜索的核心基础,在AI驱动购物的时代,它变得更加关键。在这个时代,语义理解和复杂推理直接塑造了用户体验和业务转化。大型语言模型(LLM)支持基于推理的生成方法,通常通过对监督微调(SFT)或直接偏好优化(DPO)等方法进行对齐。然而,商业规则的日益复杂和用户查询的多样化暴露了现有方法的不足,它们无法使模型具备针对长尾和复杂情况的稳健推理能力。通过群体相对策略优化(GRPO)等强化学习策略来解决这个问题的尝试常常因终端奖励稀疏而遭受困扰,无法为多步推理提供足够的指导,并减缓了收敛速度。为了解决这些挑战,我们提出了面向淘宝搜索相关性的TaoSR-AGRL自适应引导强化学习框架。TaoSR-AGRL引入了两个关键的创新点:首先是规则感知奖励塑造,它将最终的相关性判断分解为密集的结构化奖励,与特定领域的相关性标准对齐;其次是自适应引导回放,它能够在训练过程中识别低准确率的滚动过程,并注入有针对性的真实指导,使策略能够摆脱停滞的规则违反推理模式,转向合规轨迹。TaoSR-AGRL在大型真实数据集和淘宝搜索的在线侧面对比人类评估中得到了评估。它在离线实验中始终优于DPO和标准GRPO基准线,提高了相关性准确性、规则遵守性和训练稳定性。使用TaoSR-AGRL训练的模型已成功部署在淘宝的主搜索场景中,为数亿用户提供服务。

论文及项目相关链接

PDF

Summary

本文介绍了在AI驱动的购物时代,针对电子商务搜索中的查询产品相关性预测的挑战。传统的监督微调或偏好优化方法无法应对复杂的业务规则和长尾案例。为此,提出了一种基于大型语言模型的自适应引导强化学习框架TaoSR-AGRL,通过规则感知奖励塑造和自适应引导回放来解决挑战。该框架在大型真实数据集和淘宝搜索的在线评估中表现优越,提高了相关性准确性、规则遵循性和训练稳定性,并已成功部署在淘宝主要搜索场景中为亿万用户服务。

Key Takeaways

  1. 查询产品相关性预测在AI驱动的购物时代对电子商务搜索至关重要。
  2. 大型语言模型(LLMs)支持生成式、基于推理的方法,但现有方法无法应对复杂的业务规则和长尾案例。
  3. TaoSR-AGRL框架通过规则感知奖励塑造和自适应引导回放解决这些挑战。
  4. 规则感知奖励塑造将最终的相关性判断分解为与特定领域的相关性标准对齐的密集结构化奖励。
  5. 自适应引导回放针对训练中的低精度滚动,注入有针对性的真实指导,引导策略远离停滞的规则违反推理模式。
  6. TaoSR-AGRL框架在大型真实数据集和在线评估中表现优越,提高相关性准确性、规则遵循性和训练稳定性。

Cool Papers

点此查看论文截图

Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning

Authors:Yurun Yuan, Fan Chen, Zeyu Jia, Alexander Rakhlin, Tengyang Xie

Policy-based methods currently dominate reinforcement learning (RL) pipelines for large language model (LLM) reasoning, leaving value-based approaches largely unexplored. We revisit the classical paradigm of Bellman Residual Minimization and introduce Trajectory Bellman Residual Minimization (TBRM), an algorithm that naturally adapts this idea to LLMs, yielding a simple yet effective off-policy algorithm that optimizes a single trajectory-level Bellman objective using the model’s own logits as $Q$-values. TBRM removes the need for critics, importance-sampling ratios, or clipping, and operates with only one rollout per prompt. We prove convergence to the near-optimal KL-regularized policy from arbitrary off-policy data via an improved change-of-trajectory-measure analysis. Experiments on standard mathematical-reasoning benchmarks show that TBRM consistently outperforms policy-based baselines, like PPO and GRPO, with comparable or lower computational and memory overhead. Our results indicate that value-based RL might be a principled and efficient alternative for enhancing reasoning capabilities in LLMs.

基于策略的方法目前在大语言模型(LLM)推理中占据了强化学习(RL)流程的主导地位,而基于价值的方法则很少被探索。我们重新审视了贝尔曼残差最小化这一经典范式,并引入了轨迹贝尔曼残差最小化(TBRM)算法,该算法自然地将这一思想适应到LLM上,提出了一种简单而有效的离轨算法,该算法使用模型自身的逻辑值作为Q值来优化轨迹级别的贝尔曼目标。TBRM不需要评论家、重要性采样比率或剪辑,并且每次提示只需要进行一次滚动操作。我们通过改进的变化轨迹度量分析,证明了从任意离轨数据收敛到接近最优的KL正则化策略。在数学推理标准基准测试上的实验表明,TBRM始终优于基于策略的方法基线,如PPO和GRPO,并且具有相当或更低的计算和内存开销。我们的结果表明,基于价值的RL可能是增强LLM推理能力的一种有原则和高效的选择。

论文及项目相关链接

PDF NeurIPS 2025

Summary
基于策略的方法目前在大型语言模型(LLM)推理中主导强化学习(RL)管道,而使价值导向型方法被忽略。本研究重新考察了贝尔曼残差最小化这一经典范式,并引入了轨迹贝尔曼残差最小化(TBRM)算法,该算法自然地将这一理念适应到LLM上,采用模型本身的逻辑值作为Q值优化单个轨迹层面的贝尔曼目标,是一种简单有效的离轨算法。通过改进的改变轨迹分析证明,TBRM能够从任意离轨数据中收敛到接近最优的KL正则策略。在数学推理标准测试上的实验表明,TBRM在具有相当或更低计算和内存开销的情况下,始终优于策略导向的基线方法,如PPO和GRPO。这表明价值导向的RL可能是增强LLM推理能力的一种原则性和高效性的替代方案。

Key Takeaways

  1. 价值导向型方法在强化学习中的大型语言模型(LLM)推理中未得到充分探索。
  2. 研究者提出了轨迹贝尔曼残差最小化(TBRM)算法。这是一个价值型强化学习算法。它以模型的预测逻辑作为q值来进行训练和优化。
  3. TBRM算法不需要评论家模型、重要性采样比率或剪辑操作,并且只需要一次滚动就可以操作。这对于大型语言模型的强化学习应用来说是一个优势。
  4. TBRM算法能够从任意离轨数据中收敛到接近最优的KL正则策略。这是一个重要的进步,因为它增加了算法的灵活性和适应性。
  5. 在数学推理标准测试上进行的实验表明,TBRM算法的性能优于基于策略的方法,如PPO和GRPO等。这表明价值导向的强化学习可能是增强LLM推理能力的有效方法。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
3DGS 3DGS
3DGS 方向最新论文已更新,请持续关注 Update in 2025-11-15 AHA! Animating Human Avatars in Diverse Scenes with Gaussian Splatting
2025-11-15
下一篇 
Few-Shot Few-Shot
Few-Shot 方向最新论文已更新,请持续关注 Update in 2025-11-13 metaTextGrad Automatically optimizing language model optimizers
2025-11-13
  目录