发布日期: 2025-10-19

更新日期: 2025-11-27

文章字数: 22.5k

阅读时长: 92 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-19 更新

RoboGPT-R1: Enhancing Robot Planning with Reinforcement Learning

Authors:Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li

Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model’s shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

提高实体代理的推理能力对于机器人在长期操控任务中成功完成复杂的人类指令至关重要。尽管基于监督微调（SFT）的大型语言模型和视觉语言模型在规划任务中取得了成功，但它们仍然面临着在复杂的真实环境中执行长期操控任务的挑战，这是由于它们有限的常识和推理能力。考虑到通过监督微调将通用视觉语言模型与机器人规划任务对齐存在泛化性差和物理理解不足的问题，我们提出了RoboGPT-R1，这是一个用于实体规划的两阶段微调框架。在这个框架中，监督训练通过专家序列获得基础知识，然后通过强化学习来解决模型在视觉空间理解和推理方面的不足。为了实现多步推理任务中的物理理解和动作序列一致性，我们设计了一个基于规则的奖励函数，该函数同时考虑了长期性能和环境中的动作约束。在EmbodiedBench基准测试上，该推理模型的性能显著优于更大规模的GPT-4o-mini模型，高出21.33%，并且超过了其他在Qwen2.5-VL-7B上训练的工作，高出20.33%。

论文及项目相关链接

PDF

Summary

机器人完成复杂人类指令的长期操作任务中，提高其推理能力至关重要。虽然基于监督微调的大型语言模型和视觉语言模型在规划任务中取得了成功，但它们仍面临在复杂现实环境中执行长期操作任务的挑战，因为它们缺乏通用常识和推理能力。为此，我们提出了RoboGPT-R1，这是一个两阶段的微调框架，用于嵌入式规划。该框架首先通过专家序列进行监督训练获取基础知识，然后通过强化学习来解决模型在视觉空间理解和推理方面的不足。为实现多步推理任务中的物理理解和动作序列一致性，我们设计了一个基于规则的奖励函数，同时考虑长期性能和环境中动作约束。在EmbodiedBench基准测试中，该推理模型在Qwen2.5-VL-3B上显著优于较大规模的GPT-4o-mini模型，准确率提高21.33%，并且超越了其他在Qwen2.5-VL-7B上训练的工作，准确率提高20.33%。

Key Takeaways

提高机器人的推理能力对于完成复杂的长期操作任务至关重要。
现有的大型语言模型和视觉语言模型在真实环境的长期操作任务中面临挑战，因为它们缺乏通用常识和推理能力。
提出了RoboGPT-R1框架，结合监督训练和强化学习，以提高嵌入式规划任务的性能。
RoboGPT-R1通过两阶段训练：第一阶段通过专家序列进行监督训练，第二阶段解决视觉空间理解和推理的不足。
设计了基于规则的奖励函数，以实现多步推理任务中的物理理解和动作序列一致性。
在EmbodiedBench基准测试中，RoboGPT-R1显著优于其他模型，准确率达到较高水平。

Cool Papers

点此查看论文截图

ATGen: Adversarial Reinforcement Learning for Test Case Generation

Authors:Qingyao Li, Xinyi Dai, Weiwen Liu, Xiangyang Li, Yasheng Wang, Ruiming Tang, Yong Yu, Weinan Zhang

Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs, for which effective test cases are a critical bottleneck. Existing test generation methods, whether based on prompting or supervised fine-tuning, rely on static datasets. This imposes a fixed-difficulty ceiling'', fundamentally limiting their ability to uncover novel or more complex bugs beyond their training scope. To overcome this, we introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning. ATGen pits a test generator against an adversarial code generator that continuously crafts harder bugs to evade the current policy. This dynamic loop creates a curriculum of increasing difficulty challenging current policy. The test generator is optimized via Reinforcement Learning (RL) to jointly maximize Output Accuracy’’ and ``Attack Success’’, enabling it to learn a progressively stronger policy that breaks the fixed-difficulty ceiling of static training. Extensive experiments demonstrate that ATGen significantly outperforms state-of-the-art baselines. We further validate its practical utility, showing it serves as both a more effective filter for Best-of-N inference and a higher-quality reward source for training code generation models. Our work establishes a new, dynamic paradigm for improving the reliability of LLM-generated code.

大型语言模型（LLM）在代码生成方面表现出色，但它们的输出常常含有微妙的错误，有效的测试用例是关键的瓶颈。现有的测试生成方法，无论基于提示还是监督微调，都依赖于静态数据集。这设定了一个“固定难度上限”，从根本上限制了它们发现训练范围之外的新颖或更复杂错误的能力。为了克服这一点，我们引入了ATGen，一个通过对抗性强化学习来训练测试用例生成器的框架。ATGen将测试生成器与对抗性代码生成器相对抗，后者不断制造更难以解决的错误来逃避当前策略。这种动态循环创建了一个难度不断增加的课程，以挑战当前策略。测试生成器通过强化学习（RL）进行优化，以最大化“输出准确性”和“攻击成功性”，使其能够学习一个逐渐强大的策略，打破静态训练的固定难度上限。大量实验表明，ATGen显著优于最新基线。我们进一步验证了其在实践中的实用性，表明它既是更有效的Best-of-N推理过滤器，也是更高质量的代码生成模型训练奖励来源。我们的工作为提高LLM生成代码的可靠性建立了新的动态范式。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）擅长代码生成，但其输出常含有微妙错误，有效的测试用例是关键的瓶颈。现有的测试生成方法，无论是基于提示还是监督微调，都依赖于静态数据集，这导致它们存在“固定难度上限”，无法发现训练范围之外的新颖或复杂错误。为突破此限制，我们提出ATGen框架，通过对抗性强化学习训练测试用例生成器。ATGen将测试生成器与对抗性代码生成器相对抗，后者不断制造更难以被发现的错误以逃避当前策略。这种动态循环创建了一个不断增长的难度课程，挑战当前策略。测试生成器通过强化学习（RL）进行优化，以最大化“输出准确性”和“攻击成功率”，使其学会逐步强大的策略，打破静态训练的固定难度上限。实验表明，ATGen显著优于最新基线。我们进一步验证了其实用性，表明它既是Best-of-N推理的更有效过滤器，也是训练代码生成模型的高质量奖励来源。我们的工作为提升LLM生成代码的可靠性建立了新的动态范式。

Key Takeaways

LLMs擅长代码生成，但输出常含错误，需要有效的测试方法来检查其正确性。
当前测试生成方法存在“固定难度上限”，难以发现新颖或复杂错误。
ATGen框架通过对抗性强化学习训练测试生成器来克服此问题。
ATGen创建一个难度递增的课程来挑战测试生成器的策略。
ATGen优化的测试生成器能最大化输出准确性和攻击成功率，打破静态训练的难度上限。
实验证明ATGen显著优于其他方法。

Cool Papers

点此查看论文截图

Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

Authors:Shrey Pandit, Xuan-Phi Nguyen, Yifei Ming, Austin Xu, Jiayu Wang, Caiming Xiong, Shafiq Joty

Web-based ‘deep research’ agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.

基于网络的”深度研究”代理旨在通过与在线工具进行长期交互来解决复杂的问答任务。这些任务仍然具有挑战性，因为基础语言模型通常没有针对长期推理和探索进行优化。先前的工作已经提出了构建指令调整数据集的工作流程，通常利用知识图谱。然而，这些方法通常缺乏对难度和质量的精细控制，产生的合成数据无法捕捉长期推理所需的复杂性。此外，许多研究通过比较在不同优化配方下训练的模型来混淆数据和训练效果，很难孤立地评估和评估数据本身的有效性。我们引入了一种双管齐下的数据合成管道，通过逐步增加任务复杂性来生成问答对，直到前沿基线网络代理失败。基线代理在此过程中扮演多个角色：尝试问题、验证事实性、检查替代答案和执行过滤。为了评估我们的合成方法的有效性，我们采用了基于从强大网络代理中提炼出来的受控训练设置。在多个网络基准测试上的实验表明，尽管我们的数据集较小，但它能训练出比现有数据集更有效的网络代理。特别是，我们的数据在工具使用行为方面表现出两倍的多样性，使得在此之上训练的模型能够取得更强的性能，同时避免重复的工具调用行为。

论文及项目相关链接

PDF Preprint. ICLR 26 submission

Summary

该文本介绍了基于网络的“深度研究”代理，旨在通过在线工具进行长期交互来解决复杂的问答任务。然而，由于底层语言模型通常未针对长期推理和探索进行优化，这些任务仍然具有挑战性。作者提出了一种双管齐下的数据合成管道，通过逐步增加任务复杂性来生成问答对，直到前沿基线网络代理失败。为了评估合成方法的有效性，作者采用了一种基于从强大网络代理中提取知识的受控训练设置。实验表明，尽管数据量较小，但作者在多个网络基准测试上的数据集能够训练出更有效的网络代理，特别是该数据在工具使用行为上表现出更强的多样性，能够避免重复的工具调用行为。

Key Takeaways

基于网络的“深度研究”代理致力于通过在线工具进行长期交互完成复杂问答任务。
现有语言模型针对长期推理和探索的优化不足，使得这些任务具有挑战性。
作者提出了一种数据合成管道，通过逐步增加任务复杂性来生成问答对，直至基线网络代理无法应对。
基线代理在此过程中扮演多重角色，包括尝试问题、验证事实、检查替代答案和执行过滤。
作者采用了一种基于从强大网络代理中提取知识的受控训练设置来评估数据合成方法的有效性。
与现有数据集相比，作者在多个网络基准测试上的数据集表现更优秀，训练出更有效的网络代理。

Cool Papers

点此查看论文截图

The Art of Scaling Reinforcement Learning Compute for LLMs

Authors:Devvrit Khatri, Lovish Madaan, Rishabh Tiwari, Rachit Bansal, Sai Surya Duvvuri, Manzil Zaheer, Inderjit S. Dhillon, David Brandfonbrener, Rishabh Agarwal

Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, (2) Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.

强化学习（RL）已成为训练大型语言模型（LLM）的核心，然而，该领域缺乏可与预训练建立的预测扩展方法相比较的方法。尽管计算预算迅速增加，但没有原则性的理解如何评估算法改进对扩展RL计算的影响。我们进行了超过40万GPU小时的大规模系统研究，定义了一个分析预测大型语言模型中强化学习扩展的原则性框架。我们对强化学习的计算性能拟合了Sigmoid曲线，并分析了一系列通用设计选择对其渐近性能和计算效率的影响。我们观察到：（1）并非所有方法都能产生相似的渐近性能；（2）损失聚合、归一化、课程及离线策略算法等细节主要调节计算效率，而不会实质性改变渐近性能；（3）稳定、可扩展的方法遵循可预测的扩展轨迹，可从较小的运行中推断出更大的规模。结合这些见解，我们提出了一个最佳实践方法ScaleRL，并通过一次强化学习运行成功扩展到10万GPU小时来验证其有效性。我们的工作既为分析强化学习中的扩展提供了一个科学框架，也为使强化学习训练更接近长期实现的预训练预测性提供了一个实用方法。

论文及项目相关链接

PDF 28 pages, 20 figures

Summary

本文研究了强化学习（RL）在大规模语言模型（LLM）训练中的应用，并进行了大规模的系统性研究，定义了一个原则性的框架来分析预测RL在LLM中的扩展性。研究发现，不同的训练策略对最终性能和计算效率有不同的影响，并提出了一种可靠的训练策略，旨在实现计算效率的突破并推进强化学习模型的训练进程。研究建立了理解并评估RL计算效率改善原则，有利于更有效地训练和预测大规模语言模型。这项工作既有理论基础也贴合实践需求，是推动机器学习研究取得长足发展的一项重要工作。

Key Takeaways

以下是本文的主要见解：

强化学习（RL）在大规模语言模型（LLM）训练中扮演了核心角色，但目前缺乏预测其扩展性的方法。
对强化学习训练进行了大规模的系统性研究，并定义了原则性的框架来分析预测其在LLM中的扩展性。
通过大量的计算实践数据研究发现并非所有的训练策略都会带来相似的最终结果。此外损失聚集，正规化等方法并不能根本影响结果极限。而稳定的可扩展训练策略遵循可预测的路径。
研究通过大量实验观察发现不同设计选择对渐进性能和计算效率的影响不同。
结合这些发现，提出了一种最佳实践的训练策略ScaleRL，并通过小规模运行的扩展验证其有效性。
该研究不仅提供了一个科学的框架来分析强化学习中的扩展性，还提出了一种实用的训练策略，使强化学习训练更接近长期实现的预训练预测性。

Cool Papers

点此查看论文截图

Beyond Static LLM Policies: Imitation-Enhanced Reinforcement Learning for Recommendation

Authors:Yi Zhang, Lili Xie, Ruihong Qiu, Jiajun Liu, Sen Wang

Recommender systems (RecSys) have become critical tools for enhancing user engagement by delivering personalized content across diverse digital platforms. Recent advancements in large language models (LLMs) demonstrate significant potential for improving RecSys, primarily due to their exceptional generalization capabilities and sophisticated contextual understanding, which facilitate the generation of flexible and interpretable recommendations. However, the direct deployment of LLMs as primary recommendation policies presents notable challenges, including persistent latency issues stemming from frequent API calls and inherent model limitations such as hallucinations and biases. To address these issues, this paper proposes a novel offline reinforcement learning (RL) framework that leverages imitation learning from LLM-generated trajectories. Specifically, inverse reinforcement learning is employed to extract robust reward models from LLM demonstrations. This approach negates the need for LLM fine-tuning, thereby substantially reducing computational overhead. Simultaneously, the RL policy is guided by the cumulative rewards derived from these demonstrations, effectively transferring the semantic insights captured by the LLM. Comprehensive experiments conducted on two benchmark datasets validate the effectiveness of the proposed method, demonstrating superior performance when compared against state-of-the-art RL-based and in-context learning baselines. The code can be found at https://github.com/ArronDZhang/IL-Rec.

推荐系统（RecSys）已成为通过不同数字平台提供个性化内容来提高用户参与度的重要工具。大型语言模型（LLM）的最新进展显示出在提高推荐系统方面的巨大潜力，这主要是因为它们出色的泛化能力和复杂上下文理解，有助于生成灵活且可解释的建议。然而，直接将大型语言模型作为主要的推荐策略部署会遇到一些挑战，包括源于频繁API调用的持久延迟问题，以及模型本身的固有局限性，如幻想和偏见。为了解决这些问题，本文提出了一种新型的离线强化学习（RL）框架，该框架利用从大型语言模型生成的轨迹进行模仿学习。具体来说，采用逆向强化学习从大型语言模型的演示中提取稳健的奖励模型。这种方法避免了大型语言模型的精细调整需求，从而大大减少了计算开销。同时，强化学习政策由这些演示中得出的累积奖励来指导，有效地转移了大型语言模型捕获的语义洞察力。在基准数据集上进行的综合实验验证了所提出方法的有效性，与最新的强化学习方法和上下文学习基线相比，表现出卓越的性能。代码可在https://github.com/ArronDZhang/IL-Rec找到。

论文及项目相关链接

PDF ICDM 2025 Accepted Paper

Summary

大型语言模型（LLM）的推荐系统因其强大的泛化能力和丰富的上下文理解而展现出巨大的潜力。然而，直接部署LLM作为主要的推荐策略面临挑战，如延迟问题和模型固有局限。本文提出一种利用离线强化学习（RL）框架的新方法，通过模仿学习从LLM生成的轨迹中学习。具体而言，利用逆向强化学习从LLM演示中提取稳健的奖励模型，无需对LLM进行微调，降低了计算开销。实验证明，该方法在基准数据集上表现出优异的性能。

Key Takeaways

推荐系统（RecSys）通过提供个性化内容增强了用户参与度。
大型语言模型（LLM）的出色泛化能力和上下文理解潜力提升了推荐系统的性能。
直接使用LLM作为推荐策略主要面临延迟和模型局限等挑战。
新方法利用离线强化学习（RL）框架，通过模仿学习从LLM生成的轨迹中提取奖励模型。
逆向强化学习用于提取稳健的奖励模型，无需对LLM进行微调。
该方法通过累计奖励指导RL策略，有效转移LLM捕获的语义见解。
实验证明，该方法在基准数据集上表现出卓越性能，相较于其他先进的RL和上下文学习基线有优势。

Cool Papers

点此查看论文截图

CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning

Authors:Kehua Feng, Keyan Ding, Zhihui Zhu, Lei Liang, Qiang Zhang, Huajun Chen

While chain-of-thought (CoT) distillation from advanced large language models (LLMs) has proven effective in general reasoning tasks, it struggles in scientific domains where even advanced models often produce incorrect or superficial reasoning due to high complexity and specialized knowledge requirements. Directly distilling from such flawed outputs results in low-quality training data and limits the performance of smaller student models. To overcome this, we propose CoT-Evo, an evolutionary CoT distillation framework. It begins by constructing a diverse pool of reasoning trajectories from multiple LLM thinkers, enriches them with automatically retrieved domain knowledge, and iteratively refines the trajectories using novelty-driven selection, reflective recombination and mutation. The refinement is guided by a fitness function that evaluates answer correctness, coherence, and effective knowledge utilization. This results in a high-quality CoT dataset tailored for scientific reasoning. We employ this evolved dataset to fine-tune a compact model, which achieves state-of-the-art performance on scientific reasoning benchmarks. Our work establishes a scalable approach to synthesizing high-fidelity scientific reasoning data from diverse and fallible LLMs.

虽然基于先进的大型语言模型（LLM）的思维链（CoT）蒸馏技术在一般推理任务中已被证明是有效的，但在科学领域，由于高度复杂性和专业知识要求，即使是先进模型也往往会产生错误或肤浅的推理。直接从这种有缺陷的输出中进行蒸馏会导致训练数据质量低下，并限制小型学生模型的性能。为了克服这一问题，我们提出了CoT-Evo，这是一个进化的CoT蒸馏框架。它首先从不同LLM思考者构建多样化的推理轨迹池，通过自动检索领域知识来丰富它们，并通过新奇性驱动的选择、反思性重组和突变来迭代优化这些轨迹。优化由适应度函数引导，该函数评估答案的正确性、连贯性和有效的知识利用。这产生了一个针对科学推理量身定制的高质量CoT数据集。我们使用这个进化数据集对紧凑模型进行微调，该模型在科学素养评估上达到了最先进的性能。我们的工作建立了一种从多样化和易出错的LLM中合成高保真科学推理数据的方法。

论文及项目相关链接

PDF 28 pages, 3 figures

Summary

先进的链式思维（CoT）蒸馏技术在大规模语言模型（LLM）中的通用推理任务中表现出色，但在科学领域存在挑战。由于高度复杂性和专业知识需求，即使高级模型也会产生错误或肤浅的推理。直接蒸馏这些有缺陷的输出会导致低质量的训练数据，并限制小型学生模型的性能。为此，我们提出了CoT-Evo，一种进化式CoT蒸馏框架。它首先构建多种推理轨迹的池，这些轨迹来源于多个LLM的思考者，并自动丰富其领域知识，通过新颖性驱动的选择、反思性重组和突变进行迭代优化轨迹。优化由适应度函数引导，该函数评估答案的正确性、连贯性和有效的知识利用。这产生了一个适用于科学推理的高质量CoT数据集。我们利用这个进化的数据集对紧凑模型进行微调，在科技推理基准测试中实现了最先进的性能。我们的工作建立了一种从多样化和易出错的LLM中合成高保真科学推理数据的可扩展方法。

Key Takeaways

链式思维（CoT）蒸馏在通用推理任务中表现良好，但在科学领域面临挑战。
直接从有缺陷的模型蒸馏会导致低质量训练数据。
CoT-Evo框架通过构建多样化的推理轨迹池来解决问题，这些轨迹来源于多个LLM的思考者。
自动检索的领域知识丰富了这些轨迹。
通过新颖性驱动的选择、反思性重组和突变进行迭代优化轨迹。
一个适应度函数用于评估答案的质量。

Cool Papers

点此查看论文截图

Retrieval-in-the-Chain: Bootstrapping Large Language Models for Generative Retrieval

Authors:Yingchen zhang, Ruqing zhang, Jiafeng Guo, Wenjun Peng, Sen Li, Fuyu Lv

Generative retrieval (GR) is an emerging paradigm that leverages large language models (LLMs) to autoregressively generate document identifiers (docids) relevant to a given query. Prior works have focused on leveraging the generative capabilities of LLMs to improve GR, while overlooking that their reasoning capabilities could likewise help. This raises a key question: Can explicit reasoning benefit GR? To investigate, we first conduct a preliminary study where an LLM is prompted to generate free-form chain-of-thought (CoT) reasoning before performing constrained docid decoding. Although this method outperforms standard GR, the generated reasoning tends to be verbose and poorly aligned with the docid space. These limitations motivate the development of a reasoning mechanism better tailored to GR. Therefore, we propose Reason-for-Retrieval (R4R), a reasoning-augmented framework for GR that converts free-form CoT reasoning into a compact, structured format, and iteratively refines the reasoning during the retrieval process. R4R augments an existing GR method by leveraging a reasoning-capable LLM that has been instruction-tuned for GR. At inference time, R4R first uses the LLM to generate an initial structured reasoning; then the same LLM alternates between (i) constrained decoding with the chosen GR method to produce candidate docids and (ii) updating the reasoning based on retrieval results to improve the next round. R4R does not require additional models or training, and instead a single LLM serves as both the reasoning generator and the retriever. Extensive experiments on Natural Questions, MS MARCO, and a real-world item-search benchmark validate the effectiveness of R4R.

生成式检索（GR）是一种新兴范式，它利用大型语言模型（LLM）来自动生成与给定查询相关的文档标识符（docids）。早期的研究工作主要集中在利用LLM的生成能力来提高GR的性能，却忽视了它们的推理能力同样可以帮助这一点。这引发了一个关键问题：明确的推理能否有益于GR？为了探究这个问题，我们首先进行了一项初步研究，提示LLM在执行约束docid解码之前生成自由形式的思维链（CoT）推理。尽管这种方法优于标准的GR，但生成的推理往往过于冗长，且与docid空间不太对齐。这些局限性促使我们开发一种更适合GR的推理机制。因此，我们提出了Reason-for-Retrieval（R4R），这是一种用于GR的推理增强框架，它将自由形式的CoT推理转换为紧凑的结构化格式，并在检索过程中迭代地优化推理。R4R通过利用具备GR指令调优的具有推理能力的LLM来增强现有的GR方法。在推理阶段，R4R首先使用LLM生成初始的结构化推理；然后，相同的LLM交替执行（i）使用所选的GR方法进行约束解码以产生候选docids，（ii）基于检索结果更新推理以改进下一轮次的检索。R4R不需要额外的模型或训练，而是使用单个LLM同时作为推理生成器和检索器。在自然问题、MS MARCO和现实世界中的商品搜索基准测试上的大量实验验证了R4R的有效性。

论文及项目相关链接

PDF

Summary

本文介绍了生成式检索（GR）的新范式，该范式利用大型语言模型（LLM）的生成能力来生成与给定查询相关的文档标识符（docid）。尽管之前的研究已经发现LLM的生成能力可以改善GR，但它们往往忽略了LLM的推理能力也可能对此有所帮助。因此，本文旨在探讨显式推理是否有助于GR。为了研究这个问题，作者首先进行了一项初步研究，发现虽然通过提示LLM进行自由形式的思维链（CoT）推理再进行约束docid解码的方法优于标准GR，但生成的推理往往过于冗长且难以与docid空间对齐。因此，作者提出了针对GR的量身定制的推理机制——Reason-for-Retrieval（R4R）。R4R将自由形式的CoT推理转换为紧凑的结构化格式，并在检索过程中迭代地优化推理。R4R通过在现有的GR方法上增加具备GR指令调优能力的推理型LLM来实现增强效果。在推理过程中，R4R首先利用LLM生成初始结构化推理，然后在选择的GR方法的约束解码与基于检索结果的推理更新之间交替进行，以产生候选docids。实验结果表明，R4R在Natural Questions、MS MARCO以及真实世界的物品搜索基准测试上均有效。

Key Takeaways

生成式检索（GR）利用大型语言模型（LLM）生成与查询相关的文档标识符（docid）。
此前的研究主要关注LLM的生成能力对GR的改进，但忽视了其推理能力可能带来的益处。
初步研究表明，结合LLM的自由形式思维链（CoT）推理和约束docid解码的方法虽优于标准GR，但存在推理过于冗长和与docid空间不对齐的问题。
提出Reason-for-Retrieval（R4R）框架，将自由形式的CoT推理转化为紧凑的结构化格式，并在检索过程中迭代优化推理。
R4R通过增加具备GR指令调优能力的推理型LLM来增强现有GR方法。
R4R在多个基准测试中表现出有效性，包括Natural Questions、MS MARCO以及真实世界的物品搜索。

Cool Papers

点此查看论文截图

A$^2$FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning

Authors:Qianben Chen, Jingyi Cao, Jiayu Zhang, Tianrui Qin, Xiaowan Li, King Zhu, Dingfeng Shi, He Zhu, Minghao Liu, Xiaobo Liang, Xin Gui, Ge Zhang, Jian Yang, Yuchen Eleanor Jiang, Wangchunshu Zhou

Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A$^2$FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A$^2$FM achieves 13.4% on BrowseComp, 70.4% on AIME25, and 16.7% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only $0.00487 per correct answer-cutting cost by 45.2% relative to reasoning and 33.5% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.

大型语言模型主要分为两个家族：以推理为中心的语言模型，它们增强了内部思维链的推理能力，但不能调用外部工具；以及能够学习与环境互动并利用工具的语言模型，但它们往往在深度推理上表现滞后。这种分歧源于根本不同的训练目标，导致在简单查询上的优势不匹配，两个家族都倾向于过度思考或过度调用工具。在这项工作中，我们提出了自适应代理基础模型（A$^2$FM），这是一个遵循“先路由后对齐”原则的统框架：模型首先学习任务感知路由，然后在共享主干下对齐特定模式的轨迹。为了解决效率差距，我们引入了第三种模式即时模式，该模式直接处理简单查询，防止不必要的推理或工具调用，同时补充代理模式和推理模式。为了同时提高准确性和效率，我们提出了自适应策略优化（APO），它强制执行跨模式的自适应采样，并应用成本正则化奖励。在32B规模上，A$^2$FM在BrowseComp上达到了13.4%，在AIME25上达到了70.4%，在HLE上达到了16.7%，在同类模型中达到了新的SOTA水平，并在代理、推理和通用基准测试中与前沿的大型语言模型竞争。值得注意的是，自适应执行仅实现了每正确答案0.00487的通过成本，相对于推理减少了45.2%，相对于代理减少了33.5%，从而在保持相当准确性的同时实现了更高的成本效益。

论文及项目相关链接

PDF 9 pages, 5 figures, submitted to ICLR 2026

Summary

大型语言模型分为两个家族：注重内部推理过程的推理中心型LLM和能够与环境互动并利用工具但深度推理能力较弱的代理型LLM。两者源于不同的训练目标，在处理简单查询时往往存在能力不匹配、效率不高的问题。本文提出了自适应代理基金会模型（A²FM），采用“先路由后对齐”的原则，模型首先学习任务感知路由，然后在共享主干下对齐特定模式轨迹。为解决效率差距问题，引入第三种即时模式，直接处理简单查询，避免不必要的推理或工具调用，同时补充代理和推理模式。通过自适应策略优化（APO），对三种模式进行自适应采样并应用成本正则化奖励，以提高准确性和效率。在32B规模上，A²FM在BrowseComp上达到13.4%，在AIME25上达到70.4%，在HLE上达到16.7%，在同类模型中表现最佳，并在前沿LLM的基准测试中表现良好。特别是自适应执行只需每正确回答一个问题花费0.00487的成本，相较于推理模式和代理模式分别降低了45.2%和33.5%的成本，实现了高成本效益与良好准确性的平衡。

Key Takeaways

大型语言模型分为推理中心型LLM和代理型LLM两个家族，各有优缺点。
两者处理简单查询时存在能力不匹配和效率问题。
提出自适应代理基金会模型（A²FM），采用“先路由后对齐”原则，以提高效率和准确性。
引入第三种即时模式来处理简单查询，避免不必要的推理或工具调用。
通过自适应策略优化（APO）实现三种模式的自适应采样和奖励。
A²FM在多个基准测试中表现优异，包括BrowseComp、AIME25和HLE等。

Cool Papers

点此查看论文截图

BanglaMATH : A Bangla benchmark dataset for testing LLM mathematical reasoning at grades 6, 7, and 8

Authors:Tabia Tanzin Prama, Christopher M. Danforth, Peter Sheridan Dodds

Large Language Models (LLMs) have tremendous potential to play a key role in supporting mathematical reasoning, with growing use in education and AI research. However, most existing benchmarks are limited to English, creating a significant gap for low-resource languages. For example, Bangla is spoken by nearly 250 million people who would collectively benefit from LLMs capable of native fluency. To address this, we present BanglaMATH, a dataset of 1.7k Bangla math word problems across topics such as Arithmetic, Algebra, Geometry, and Logical Reasoning, sourced from Bangla elementary school workbooks and annotated with details like grade level and number of reasoning steps. We have designed BanglaMATH to evaluate the mathematical capabilities of both commercial and open-source LLMs in Bangla, and we find that Gemini 2.5 Flash and DeepSeek V3 are the only models to achieve strong performance, with $\ge$ 80% accuracy across three elementary school grades. Furthermore, we assess the robustness and language bias of these top-performing LLMs by augmenting the original problems with distracting information, and translating the problems into English. We show that both LLMs fail to maintain robustness and exhibit significant performance bias in Bangla. Our study underlines current limitations of LLMs in handling arithmetic and mathematical reasoning in low-resource languages, and highlights the need for further research on multilingual and equitable mathematical understanding. Dataset link: \href{https://github.com/TabiaTanzin/BanglaMATH-A-Bangla-benchmark-dataset-for-testing-LLM-mathematical-reasoning-at-grades-6-7-and-8.git}{https://github.com/BanglaMATH}

大型语言模型（LLM）在支持数学推理方面具有巨大的潜力，并在教育和人工智能研究方面得到越来越广泛的应用。然而，大多数现有基准测试仅限于英语，这为低资源语言创造了重大差距。例如，孟加拉语有将近2.5亿人使用，这些人群将共同受益于能够熟练使用孟加拉语的LLM。为了解决这一问题，我们推出了BanglaMATH数据集，其中包含1700个涵盖算术、代数、几何和逻辑推理等主题的孟加拉数学文字问题，这些问题来源于孟加拉小学课本，并详细标注了年级和推理步骤等信息。我们设计BanglaMATH是为了评估商业和开源LLM在孟加拉语的数学能力，我们发现Gemini 2.5 Flash和DeepSeek V3是表现最好的模型，在三个小学年级中准确率≥80%。此外，我们通过添加干扰信息和将问题翻译成英语来评估这些表现最佳的LLM的鲁棒性和语言偏见。我们发现这两个LLM都无法维持其鲁棒性，在孟加拉语表现出显著的性能偏见。我们的研究强调了LLM在处理低资源语言的算术和数学推理方面的当前局限性，并突显了对多元和平等的数学理解进行进一步研究的必要性。数据集链接：[点击此处查看数据集](https://github.com/TabiaTanzin/BanglaMATH-A-Bangla-benchmark-dataset-for-testing-LLM-mathematical-reasoning-at-grades-6-7-and-8.git）

论文及项目相关链接

PDF

Summary

本文介绍了大型语言模型（LLMs）在数学推理方面的潜力及其在教育和AI研究中的应用。然而，现有的大多数基准测试仅限于英语，对于低资源语言存在显著差距。为此，提出了BanglaMATH数据集，包含170万道孟加拉数学应用题，涵盖算术、代数、几何和逻辑推理等多个主题。数据集旨在评估商业和开源LLM在孟加拉语中的数学能力，发现仅有Gemini 2.5 Flash和DeepSeek V3表现优异，但存在鲁棒性和语言偏见问题。研究强调了LLM在处理低资源语言中的算术和数学推理方面的局限性，并强调了进行多语种和公平数学理解研究的必要性。

Key Takeaways

大型语言模型在数学推理方面具有巨大潜力，对教育领域具有广泛应用价值。
当前基准测试主要局限于英语，缺乏针对低资源语言的测试标准。
BanglaMATH数据集为孟加拉语环境下的数学能力评估提供了重要资源。
Gemini 2.5 Flash和DeepSeek V3在BanglaMATH中表现优异，但存在鲁棒性和语言偏见问题。
LLM在处理低资源语言的算术和数学推理方面存在局限性。
需要进一步开展多语种研究，提高LLM的数学理解能力。

Cool Papers

点此查看论文截图

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Authors:Marco Del Tredici, Jacob McCarran, Benjamin Breen, Javier Aspuru Mijares, Weichen Winston Yin, Jacob M. Taylor, Frank H. L. Koppens, Dirk Englund

We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover’s assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

我们推出了Ax-Prover，这是一个用于Lean自动定理证明的多智能体系统，可以解决不同科学领域的各种问题，并可以自主运行或与人类专家协作。为实现这一目标，Ax-Prover通过形式化证明生成来解决科学问题，这一过程既需要创造性的推理又需要严格的句法严谨性。Ax-Prover通过为大型语言模型（LLM）提供知识和推理能力来应对这一挑战，并通过模型上下文协议（MCP）与Lean工具相结合，确保形式正确性。为了评估其作为自主证明器的性能，我们在两个公共数学基准测试以及我们在抽象代数和量子理论领域引入的两个Lean基准测试上，将我们的方法与前沿的LLM和专用证明器模型进行了比较。在公共数据集上，Ax-Prover与最先进的证明器具有竞争力，而在新基准测试上则大大优于它们。这表明，与那些难以推广的专用系统不同，我们基于工具的智能定理证明器方法为跨不同科学领域的正式验证提供了一种可推广的方法。此外，我们还展示了Ax-Prover在实际用例中的助理能力，展示了它如何帮助一位专家数学家形式化一个复杂的密码学定理的证明。

论文及项目相关链接

PDF

Summary

Ax-Prover是一个用于Lean自动化定理证明的多智能体系统，可解决不同科学领域的各种问题，并可与人类专家自主或协作工作。它通过形式化证明生成来应对科学问题的求解挑战，这一流程要求具有创造性和严格精确的句法严谨性。Ax-Prover配备了大型语言模型（LLM），借助模型上下文协议（MCP）与Lean工具实现了形式化正确性的保障。对自主证明者性能的评估显示，在公开数据集上Ax-Prover与最新前沿的证明器性能相当，但在我们引入的抽象代数和量子理论的新基准测试中表现优异。此外，Ax-Prover还展示了其作为助手的实用能力，证明了它如何帮助数学家形式化复杂密码定理的证明。简而言之，Ax-Prover不仅提供了通用的定理证明方法，也通过与人类合作展现出强大的助力和效率提升。

Key Takeaways

Ax-Prover是一个多智能体系统，用于自动化定理证明，适用于多个科学领域。
它结合了大型语言模型和Lean工具，通过形式化证明生成来解决问题。
Ax-Prover能够自主工作，也能与人类专家协作。
在公共数学基准测试中，Ax-Prover表现良好，并且在特定领域的新基准测试中显著优于其他证明器。
Ax-Prover展示了其在实际应用中的有效性，如帮助数学家形式化复杂密码定理的证明。
Ax-Prover提供了一个通用的方法来解决形式化验证问题，这对于跨不同科学领域的推广具有重要意义。

Cool Papers

点此查看论文截图

Hierarchical Alignment: Surgical Fine-Tuning via Functional Layer Specialization in Large Language Models

Authors:Yukun Zhang, Qi Dong

Existing alignment techniques for Large Language Models (LLMs), such as Direct Preference Optimization (DPO), typically treat the model as a monolithic entity, applying uniform optimization pressure across all layers. This approach overlooks the functional specialization within the Transformer architecture, where different layers are known to handle distinct tasks from syntax to abstract reasoning. In this paper, we challenge this one-size-fits-all paradigm by introducing Hierarchical Alignment, a novel method that applies targeted DPO to distinct functional blocks of a model’s layers: local (syntax), intermediate (logic), and global (factuality). Through a series of controlled experiments on state-of-the-art models like Llama-3.1-8B and Qwen1.5-7B using LoRA for surgical fine-tuning, our results, evaluated by a powerful LLM-as-Judge, demonstrate significant and predictable improvements. Specifically, aligning the local layers (Local-Align) enhances grammatical fluency. More importantly, aligning the global layers (Global-Align) not only improves factual consistency as hypothesized but also proves to be the most effective strategy for enhancing logical coherence, outperforming all baselines. Critically, all hierarchical strategies successfully avoid the “alignment tax” observed in standard DPO, where gains in fluency come at the cost of degraded logical reasoning. These findings establish a more resource-efficient, controllable, and interpretable path for model alignment, highlighting the immense potential of shifting from monolithic optimization to structure-aware surgical fine-tuning to build more advanced and reliable LLMs.

现有的针对大型语言模型（LLM）的对齐技术，如直接偏好优化（DPO），通常将模型视为一个单一实体，在所有层上应用统一的优化压力。这种方法忽略了Transformer架构内的功能专业化，不同层负责从语法到抽象推理的不同任务。在本文中，我们通过引入分层对齐，挑战了这种一刀切的模式。这是一种新方法，对模型的层的不同功能块进行有针对性的DPO：本地（语法）、中间（逻辑）和全局（真实性）。我们通过在一系列先进模型上进行控制实验，如使用LoRA进行精细调整的Llama-3.1-8B和Qwen1.5-7B，由强大的LLM-as-Judge进行评估，结果证明了显著且可预测的进步。具体来说，对齐本地层（Local-Align）提高了语法流畅性。更重要的是，对齐全局层（Global-Align）不仅改善了事实一致性，而且被证明是提高逻辑连贯性的最有效策略，超过了所有基线。关键的是，所有分层策略都成功避免了标准DPO中观察到的“对齐税”，即在流畅性提高的同时付出了逻辑推理能力下降的代价。这些发现为模型对齐建立了一条更高效、可控、可解释的道路，突出了从单一优化转向结构感知精细调整的巨大潜力，以构建更先进、更可靠的大型语言模型。

论文及项目相关链接

PDF

Summary
本文提出一种针对大型语言模型（LLM）的分层对齐方法，通过定向优化不同功能块（如语法、逻辑和事实性）来提高模型性能。实验结果表明，该方法在改善语法流畅性、提高事实一致性和逻辑连贯性方面效果显著，同时避免了传统DPO方法的“对齐税”问题。

Key Takeaways

现有大型语言模型（LLM）对齐技术如直接偏好优化（DPO）通常将模型视为一个整体进行优化，忽略了Transformer架构中的功能专业化。
本文提出了一种新的分层对齐方法，针对模型的不同层（局部、中间和全局）进行定向优化。
分层对齐方法显著提高了模型的语法流畅性、事实一致性和逻辑连贯性。
分层对齐方法避免了传统DPO方法的“对齐税”问题，即提高流畅度时可能导致的逻辑推理能力下降。
实验结果表明，分层对齐方法在改进模型性能方面效果显著，尤其是在提高全局层对齐方面，这为更先进、更可靠的大型语言模型的构建提供了更可控和可解释的路径。
通过使用LoRA进行有针对性的微调，实验验证了分层对齐方法的有效性和优势。

Cool Papers

点此查看论文截图

Are Large Reasoning Models Interruptible?

Authors:Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

Large Reasoning Models (LRMs) excel at complex reasoning but are traditionally evaluated in static, “frozen world” settings: model responses are assumed to be instantaneous, and the context of a request is presumed to be immutable over the duration of the response. While generally true for short-term tasks, the “frozen world” assumption breaks down in modern reasoning tasks such as assistive programming, where models may take hours to think through problems and code may change dramatically from the time the model starts thinking to the model’s final output. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the quality of the model’s partial outputs on a limited budget, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades while incorporating updated information. Project Page: http://dynamic-lm.github.io/

大型推理模型（LRMs）在复杂推理方面表现出色，但传统上是在静态、“冻结世界”环境中进行评估的：模型响应被假定为瞬时的，并且请求的背景在响应期间内假定为不可变的。虽然在短期任务中通常是正确的，“冻结世界”假设在现代推理任务中崩溃了，例如辅助编程，其中模型可能需要数小时的时间来考虑问题，并且代码可能会从模型开始思考到模型的最终输出之间发生巨大的变化。在这项工作中，我们挑战“冻结世界”假设，并在两个现实动态场景下评估LRM的稳健性：中断，测试模型在有限预算下部分输出的质量；动态上下文，测试模型对实时变化的适应能力。在数学和编程基准测试中，需要长篇推理的静态评估往往会高估稳健性：即使在静态环境中达到高准确性的最新LRM，在中断或暴露于变化的上下文时也会发生不可预测的失败，当在推理过程中后期引入更新时，性能可能会下降高达60%。我们的分析还揭示了多种新型失败模式，包括推理泄漏，其中模型在中断时将推理折叠到其最终答案中；恐慌，在时间压力下模型完全放弃推理并返回错误的答案；以及自我怀疑，在融入更新信息时性能下降。项目页面：http://dynamic-lm.github.io/。

论文及项目相关链接

PDF Project Page: http://dynamic-lm.github.io

Summary

大型推理模型（LRMs）擅长处理复杂的推理任务，但传统的评估方式常常是在静态的“冻结世界”设定下进行。模型响应被假设为瞬时完成，请求的背景在响应过程中保持不变。虽然这一假设在短期任务中大致适用，但在现代推理任务如辅助编程中，模型可能需要数小时来思考问题，代码从模型开始思考到最终输出的过程中可能会发生显著变化。本文挑战了“冻结世界”假设，并在两个现实动态场景下评估了LRM的稳健性：中断和动态上下文。在数学和编程基准测试中的长形式推理任务显示，静态评估往往会高估模型的稳健性：即使在静态环境中表现高超的LRMs，在面对中断或变化上下文时，性能可能会意外下降，在推理过程中后期引入更新时，性能可能会下降高达60%。

Key Takeaways

大型推理模型（LRMs）在复杂推理任务上表现出色，但传统评估方法主要基于静态环境假设。
“冻结世界”假设在现代推理任务如辅助编程中可能不成立。
模型在思考过程中可能需要长时间，且代码可能在模型输出前发生显著变化。
在中断和动态上下文两个现实场景下评估了LRM的稳健性。
静态评估可能高估模型的稳健性，LRMs在面临中断或变化的上下文时性能可能会大幅下降。
模型在面对中断时可能出现推理泄露、恐慌和自我怀疑等新型失败模式。

Cool Papers

点此查看论文截图

Demystifying Reinforcement Learning in Agentic Reasoning

Authors:Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, Mengdi Wang

Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: https://github.com/Gen-Verse/Open-AgentRL

最近，代理强化学习（agentic RL）的出现展示了一个事实，即强化学习可以有效地提高大型语言模型（LLMs）的代理推理能力，然而关键的设计原则和最佳实践仍然不明确。在这项工作中，我们从数据、算法和推理模式这三个关键角度对强化学习在代理推理中的奥秘进行了全面系统的研究。我们强调了我们的关键见解：首先，（i）用真实的端到端工具使用轨迹替换拼接的合成轨迹，会产生更强的SFT初始化效果；高多样性、模型感知的数据集维持探索并显著提高了强化学习的性能。（ii）探索友好的技术对于代理强化学习至关重要，如clip higher、过度奖励塑造以及保持适当的策略熵可以提高训练效率。（iii）采用工具调用较少的审慎策略优于频繁的工具调用或冗长的自我推理，能够提高工具效率和最终准确性。总的来说，这些简单的实践始终提高了代理推理和训练效率，在具有挑战性的基准测试上取得了显著成果，即使是小模型也表现出了强大的性能，为未来代理强化学习研究建立了实用的基准线。除了这些经验见解之外，我们还进一步贡献了一个高质量的端到端代理SFT数据集和一个高质量的RL数据集，并证明了我们的见解在提高大型语言模型的代理推理能力方面的有效性，跨越四个具有挑战性的基准测试，包括AIME2024/AIME2025、GPQA-Diamond和LiveCodeBench-v6。借助我们的方法，即使规模为4B的模型也能在代理推理性能上超越规模为32B的模型。代码和模型可通过https://github.com/Gen-Verse/Open-AgentRL访问。

Summary

本文探讨了强化学习在智能体推理中的应用，从数据、算法和推理模式三个角度进行了全面系统的研究。研究发现，使用真实端到端的工具使用轨迹替换合成轨迹，能提高模型的初始性能；探索友好的技术和较少的工具调用策略能提高训练效率和最终精度。这些实践方法能增强智能体推理能力，并在具有挑战性的基准测试中取得良好效果。

Key Takeaways

使用真实端到端的工具使用轨迹替换合成轨迹，能有效提高模型性能。
探索友好的技术，如clip higher、overlong reward shaping和维持适当的政策熵，能提高训练效率。
采用较少的工具调用策略，能提高工具效率和最终精度。
本文提供了一份高质量的、真实的端到端智能体SFT数据集和一份高质量的RL数据集。
这些实践方法能在具有挑战性的基准测试上提高智能体推理能力，包括AIME2024/AIME2025、GPQA-Diamond和LiveCodeBench-v6等。
即使是较小的模型，也能通过遵循这些实践方法实现出色的智能体推理性能。

Cool Papers

点此查看论文截图

Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation

Authors:Siheng Xiong, Ali Payani, Faramarz Fekri

Inference-time scaling enhances the reasoning ability of a language model (LM) by extending its chain-of-thought (CoT). However, existing approaches typically generate the entire reasoning chain in a single forward pass, which often leads to CoT derailment, i.e., the reasoning trajectory drifting off course due to compounding errors. This problem is particularly severe for smaller LMs with long CoTs due to their limited capacity. To address this, we analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps. Our analysis reveals that most reasoning errors stem from incorrect planning. Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation. Following a variable interval schedule based on the token position, MPPA generates multiple candidate plans and aggregates them into a refined planning step. To maintain efficiency, we adopt a minimal design in which the base LM serves as the primary policy, while a lightweight LoRA module implements the plan aggregation policy. We further observe that outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks demonstrate that, with only 10% SFT data and 5% of preference pairs, our method outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward RL baseline across multiple base models and tasks.

推理时间缩放通过扩展语言模型的思维链（CoT）提升其推理能力。然而，现有方法通常在一次前向传递中生成整个推理链，这常常导致思维链脱轨，即由于累积错误导致推理轨迹偏离正确方向。对于具有较长思维链的小型语言模型，这个问题尤为严重，因为它们的能力有限。为了解决这个问题，我们分析原始的长思维链并揭示了一个由规划和执行步骤组成的推理层次结构。我们的分析表明，大多数推理错误源于规划不正确。受此观察的启发，我们提出了多路径计划聚合（MPPA）框架，它通过计划探索和聚合来增强单路径推理。遵循基于令牌位置的可变间隔调度，MPPA生成多个候选计划并将其聚合到精细的规划步骤中。为了保持效率，我们采用了最小设计，其中基础语言模型充当主要策略，而轻量级的LoRA模块实现计划聚合策略。我们进一步观察到，对于超过4K令牌的长轨迹，结果奖励强化学习是低效的。为了克服这一点，我们引入了在线Step-DPO，这是一种过程级偏好优化方案，它利用扭曲序贯蒙特卡洛（TSMC）提供可伸缩的逐步监督，并使用小型语言模型。这带来了更有效的训练、更高的稳定性和准确性。在具有挑战性的数学、科学和逻辑推理基准测试上的广泛实验表明，仅使用10%的SFT数据和5%的偏好对，我们的方法在多个基础模型和任务上超越了DeepSeek-R1蒸馏基准测试和结果奖励强化学习基准测试。

论文及项目相关链接

PDF

Summary

本文提出了一种名为Multi-Path Plan Aggregation（MPPA）的框架，用于增强语言模型的推理能力。该框架通过计划探索和聚合来改进传统的单一推理过程，解决了长推理链中常见的轨迹偏离问题。通过引入在线Step-DPO和Twisted Sequential Monte Carlo（TSMC），实现了高效、稳定的训练，提高了推理准确性。实验结果表明，该方法在多个基准模型和任务上均优于DeepSeek-R1蒸馏基线和方法结果强化学习基线。

Key Takeaways

推理时间缩放可通过扩展语言模型的思维链增强推理能力，但现有方法通常在单次前向传递中生成整个推理链，导致推理轨迹偏离。
提出了Multi-Path Plan Aggregation（MPPA）框架，通过计划探索和聚合改进了传统的单一推理过程。
MPPA框架采用基于令牌位置的变量间隔调度生成多个候选计划，并聚集它们以形成精炼的规划步骤。
引入在线Step-DPO和Twisted Sequential Monte Carlo（TSMC）来实现高效、稳定的训练和提高推理准确性。

Cool Papers

点此查看论文截图

Vision-LLMs for Spatiotemporal Traffic Forecasting

Authors:Ning Yang, Hengyu Zhong, Haijun Zhang, Randall Berry

Accurate spatiotemporal traffic forecasting is a critical prerequisite for proactive resource management in dense urban mobile networks. While Large Language Models (LLMs) have shown promise in time series analysis, they inherently struggle to model the complex spatial dependencies of grid-based traffic data. Effectively extending LLMs to this domain is challenging, as representing the vast amount of information from dense geographical grids can be inefficient and overwhelm the model’s context. To address these challenges, we propose ST-Vision-LLM, a novel framework that reframes spatiotemporal forecasting as a vision-language fusion problem. Our approach leverages a Vision-LLM visual encoder to process historical global traffic matrices as image sequences, providing the model with a comprehensive global view to inform cell-level predictions. To overcome the inefficiency of LLMs in handling numerical data, we introduce an efficient encoding scheme that represents floating-point values as single tokens via a specialized vocabulary, coupled with a two-stage numerical alignment fine-tuning process. The model is first trained with Supervised Fine-Tuning (SFT) and then further optimized for predictive accuracy using Group Relative Policy Optimization (GRPO), a memory-efficient reinforcement learning method. Evaluations on real-world mobile traffic datasets demonstrate that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain few-shot scenarios. Our extensive experiments validate the model’s strong generalization capabilities across various data-scarce environments.

精确的时空交通预测是密集城市移动网络中进行主动资源管理的关键前提。虽然大型语言模型（LLM）在时间序列分析方面显示出潜力，但它们本质上很难对基于网格的交通数据的复杂空间依赖性进行建模。将LLM有效扩展到这一领域具有挑战性，因为表示来自密集地理网格的大量信息可能效率低下，会使模型的上下文感到困惑。为了解决这些挑战，我们提出了ST-Vision-LLM这一新型框架，它将时空预测重新构想为一个视觉语言融合问题。我们的方法利用Vision-LLM视觉编码器来处理历史全球交通矩阵作为图像序列，为模型提供全面的全局视图，以支持进行单元级别的预测。为了解决LLM在处理数值数据方面的效率低下问题，我们引入了一种有效的编码方案，该方案通过专用词汇将浮点数值表示为单个令牌，再加上一个两阶段的数值对齐微调过程。模型首先通过监督微调（SFT）进行训练，然后采用内存高效强化学习方法——集团相对策略优化（GRPO）进行进一步优化，以提高预测精度。在现实世界移动交通数据集上的评估表明，ST-Vision-LLM的长期预测精度比现有方法高出15.6%，在跨域小样本场景中超过第二名基线30.04%以上。我们的广泛实验验证了该模型在各种数据稀缺环境中的强大泛化能力。

论文及项目相关链接

PDF

Summary

基于精准的时空交通预测在密集城市移动网络中进行资源主动管理的必要性，本研究针对大型语言模型（LLM）在建模网格交通数据复杂空间依赖关系方面的固有挑战，提出了ST-Vision-LLM这一新型框架。该研究将时空预测重新定位为视觉语言融合问题，并利用Vision-LLM视觉编码器处理历史全球交通矩阵作为图像序列，为模型提供全球视角以支持单元格级别的预测。为解决LLM在处理数值数据时的低效问题，研究引入了高效的编码方案，通过专门的词汇表将浮点数值表示为单个令牌，并结合两阶段数值对齐微调过程。模型首先通过监督微调（SFT）进行训练，然后采用内存高效强化学习方法——集团相对策略优化（GRPO）进一步优化预测精度。在真实移动交通数据集上的评估表明，ST-Vision-LLM的长期预测精度高出现有方法15.6%，在跨域少样本场景中超出第二名基线30.04%以上。广泛实验验证了该模型在不同数据稀缺环境下的强大泛化能力。

Key Takeaways

密集城市移动网络的时空交通预测对于资源主动管理至关重要。
大型语言模型（LLM）在建模网格交通数据的复杂空间依赖关系方面存在挑战。
ST-Vision-LLM框架将时空预测重新定位为视觉语言融合问题。
利用Vision-LLM视觉编码器处理历史全球交通矩阵作为图像序列。
引入高效编码方案，将浮点数表示为单个令牌，解决LLM处理数值数据的低效问题。
模型通过监督微调（SFT）和集团相对策略优化（GRPO）进行训练和优化。

Cool Papers

点此查看论文截图

Towards Real-Time Fake News Detection under Evidence Scarcity

Authors:Guangyu Wei, Ke Han, Yueming Lyu, Yu Luo, Yue Jiang, Caifeng Shan, Nicu Sebe

Fake news detection becomes particularly challenging in real-time scenarios, where emerging events often lack sufficient supporting evidence. Existing approaches often rely heavily on external evidence and therefore struggle to generalize under evidence scarcity. To address this issue, we propose Evaluation-Aware Selection of Experts (EASE), a novel framework for real-time fake news detection that dynamically adapts its decision-making process according to the assessed sufficiency of available evidence. EASE introduces a sequential evaluation mechanism comprising three independent perspectives: (1) Evidence-based evaluation, which assesses evidence and incorporates it into decision-making only when the evidence is sufficiently supportive; (2) Reasoning-based evaluation, which leverages the world knowledge of large language models (LLMs) and applies them only when their reliability is adequately established; and (3) Sentiment-based fallback, which integrates sentiment cues when neither evidence nor reasoning is reliable. To enhance the accuracy of evaluation processes, EASE employs instruction tuning with pseudo labels to guide each evaluator in justifying its perspective-specific knowledge through interpretable reasoning. Furthermore, the expert modules integrate the evaluators’ justified assessments with the news content to enable evaluation-aware decision-making, thereby enhancing overall detection accuracy. Moreover, we introduce RealTimeNews-25, a new benchmark comprising recent news for evaluating model generalization on emerging news with limited evidence. Extensive experiments demonstrate that EASE not only achieves state-of-the-art performance across multiple benchmarks, but also significantly improves generalization to real-time news. The code and dataset are available: https://github.com/wgyhhhh/EASE.

实时场景中的假新闻检测变得尤为具有挑战性，因为新兴事件往往缺乏足够的支持证据。现有方法通常严重依赖于外部证据，因此在证据稀缺的情况下很难推广。为了解决这个问题，我们提出了“基于评估的专家选择”（EASE），这是一个用于实时假新闻检测的新型框架，它可以根据现有证据的充足性来动态调整其决策过程。EASE引入了一种顺序评估机制，包括三个独立的角度：（1）基于证据的评价，它只在证据充足支持的情况下评估证据并将其纳入决策；（2）基于推理的评价，它利用大型语言模型（LLM）的世界知识，只在可靠性得到充分证明的情况下应用；（3）基于情感的后备，当证据和推理都不可靠时，它整合情感线索。为了提高评估过程的准确性，EASE采用指令微调与伪标签相结合的方法，引导每个评估者通过可解释的理由来证明其特定视角的知识。此外，专家模块将评估者的合理评估与新闻内容相结合，以实现基于评估的决策制定，从而提高总体检测准确性。此外，我们引入了RealTimeNews-25，这是一个新的基准测试，包含最近的新闻，用于评估模型在证据有限的最新新闻上的泛化能力。大量实验表明，EASE不仅在多个基准测试中实现了最佳性能，而且在实时新闻中的泛化能力也得到了显着提高。相关代码和数据集可通过以下网址获取：https://github.com/wgyhhhh/EASE 。

论文及项目相关链接

PDF

Summary：
实时假新闻检测面临证据不足的挑战。现有方法依赖外部证据，难以在证据不足的情况下进行泛化。我们提出了一个名为“基于评估的专家选择”（EASE）的新框架，它能根据现有证据的充足性动态调整决策过程。EASE引入了一个包含三个独立角度的序列评估机制：（1）基于证据的评价只在证据充足时进行评估和决策；（2）基于推理的评价利用大型语言模型的全球知识，只在可靠性得到证实时应用；（3）当证据和推理都不可靠时，基于情感的备选方案会整合情感线索。为提高评估过程的准确性，EASE采用指令微调与伪标签来引导每个评估者通过可解释的理由来证明其特定知识的合理性。此外，专家模块整合评估者的合理评估与新闻内容，实现评估感知的决策制定，从而提高整体检测准确性。我们还引入了RealTimeNews-25基准测试，用于评估模型在证据有限的实时新闻上的泛化能力。实验表明，EASE不仅在多个基准测试中达到最佳性能，而且在实时新闻中显著提高泛化能力。

Key Takeaways：

实时假新闻检测面临证据不足的挑战。
EASE框架能动态适应证据充足性进行决策。
EASE采用基于三个独立角度的序列评估机制：基于证据、基于推理、基于情感的备选方案。
EASE通过指令微调与伪标签提高评估过程的准确性。
专家模块整合评估结果与新闻内容，实现评估感知的决策。
RealTimeNews-25基准测试用于评估模型在实时新闻上的泛化能力。
EASE在多个基准测试中表现最佳，并在实时新闻中显著提高泛化能力。

Cool Papers

点此查看论文截图

DyKnow-RAG: Dynamic Knowledge Utilization Reinforcement Framework for Noisy Retrieval-Augmented Generation in E-commerce Search Relevance

Authors:Tingqiao Xu, Shaowei Yao, Chenhe Dong, Yiming Jin, Zerui Huang, Dan Ou, Haihong Tang

Accurately modeling query-item relevance drives e-commerce ranking, yet long-tail, knowledge-heavy, and fast-evolving queries exceed parametric LLM coverage. External context (reviews, attribute encyclopedias, UGC) can help but is noisy, and single-pass latency and cost forbid any clean-then-summarize step. The model must, per query, judge relevance and decide whether to use, partially use, or ignore the context. DyKnow-RAG is a dynamic noisy-RAG framework built on Group Relative Policy Optimization. It trains two rollout groups (no external context vs a single retrieved chunk) and applies posterior-driven inter-group advantage scaling that adaptively reweights their contributions by the per-query correctness gap. This teaches when to trust retrieval versus fall back to parametric knowledge, without process labels, value networks, or extra inference passes, preserving single-pass, single-chunk deployment under production latency. Training combines: (1) supervised initialization with a structured rationale that explicitly records the context-usage decision; (2) an RL pool prioritized by SFT uncertainty to focus where context choice is most consequential; and (3) an optional lightweight DPO warm start to stabilize with-context calibration. Under a unified retrieval/index and fixed latency budget, DyKnow-RAG outperforms SFT, DPO, and vanilla GRPO in offline tests, and delivers consistent lifts on GSB, Query Goodrate, and Item Goodrate in Taobao A/B testing. It is deployed in Taobao’s production relevance system, serving live traffic. To our knowledge, it is among the first single-pass RAG solutions for e-commerce relevance, turning noisy external signals into reliable gains without added online complexity.

准确地对查询项目相关性进行建模是推动电子商务排名的重要因素。然而，长尾、知识密集型和快速演变的查询超出了参数化大型语言模型的覆盖范围。外部上下文（评论、属性百科全书、用户生成内容）可能有所帮助，但存在噪声干扰，单次传递的延迟和成本不允许进行任何清理然后进行摘要的步骤。该模型必须针对每个查询，判断相关性并决定是否使用、部分使用或忽略上下文。DyKnow-RAG是一个基于群组相对策略优化的动态噪声RAG框架。它训练了两个演练组（无外部上下文与单个检索块），并应用了后验驱动组间优势缩放，这可以根据每个查询的正确性差距自适应地重新加权其贡献。这教会了模型在何时信任检索与何时回退到参数化知识之间做出选择，而无需处理标签、价值网络或额外的推理传递，从而保持了生产延迟下的单次传递、单个块部署。训练结合了：（1）使用结构化理由的监督初始化，明确记录上下文使用决策；（2）由SFT不确定性优先的强化学习池，以集中关注上下文选择最关键的方面；（3）一个可选的轻量级DPO预热来稳定上下文校准。在统一的检索/索引和固定的延迟预算下，DyKnow-RAG在离线测试中优于SFT、DPO和简单的GRPO，在淘宝的A/B测试中，对GSB、查询好率和项目好率实现了持续的提升。它已部署在淘宝的生产相关性系统中，为实时流量提供服务。据我们所知，它是电子商务相关性中首个单通道RAG解决方案，能够将嘈杂的外部信号转化为可靠的收益，而无需增加在线复杂性。

论文及项目相关链接

PDF

Summary

本文介绍了在电子商务排名中准确建模查询项相关性的重要性。针对长尾、知识密集、快速变化的查询，提出了DynKnow-RAG动态噪声RAG框架，该框架能够在无额外在线复杂性的情况下，将噪声外部信号转化为可靠的优势。DynKnow-RAG采用分组相对策略优化，训练了两个滚动组，并自适应地重新评估其贡献，决定何时信任检索结果，何时回退到参数知识。在淘宝的A/B测试中，DynKnow-RAG在离线测试、GSB、查询好评率和商品好评率方面都表现出了一致的提升，并已部署在淘宝的生产相关性系统中，为实时流量提供服务。

Key Takeaways

查询项相关性的精确建模对电子商务排名至关重要。
针对长尾、知识密集和快速变化的查询，现有的参数化LLM覆盖不足。
外部上下文（如评论、属性百科全书、用户生成内容）有助于弥补这一不足，但存在噪声问题。
DynKnow-RAG是一个动态噪声RAG框架，能够根据查询的正确性差距自适应地重新评估无上下文和有单一检索结果的贡献。
DynKnow-RAG结合监督学习、强化学习和可选的DPO预热策略进行训练。
DynKnow-RAG在离线测试、GSB、查询好评率和商品好评率方面表现优异，并在淘宝生产环境中部署。

Cool Papers

点此查看论文截图

Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph

Authors:Wentao Wang, Heqing Zou, Tianze Luo, Rui Huang, Yutian Zhao, Zhuochen Wang, Hansheng Zhang, Chengwei Qin, Yan Wang, Lin Zhao, Huaijian Zhang

Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated strong semantic understanding capabilities, but struggles to perform precise spatio-temporal understanding. Existing spatio-temporal methods primarily focus on the video itself, while overlooking the physical information within the video, such as multi-object layouts and motion. Such limitations restrict the use of MLLMs in downstream applications that demand high precision, including embodied intelligence and VR. To address this issue, we present Video-STR, a novel graph-based reinforcement method for precise Video Spatio-Temporal Reasoning. Building upon the capacity of Reinforcement Learning with Verifiable Reward (RLVR) to improve model abilities, we introduce a reasoning mechanism using graph-based Group Relative Policy Optimization (GRPO) method to guide the model in inferring the underlying spatio-temporal topology of scenarios during the thinking process. To resolve the lack of spatio-temporal training data, we construct the STV-205k dataset with 205k question-answering pairs, covering dynamic multi-object scenes in both indoor and outdoor environments, to support the model training. Experiments show that Video-STR achieves state-of-the-art results on various benchmarks, outperforming the base model by 13% on STI-Bench, and demonstrating the effectiveness of our approach and dataset. Code, model, and data will be released.

最近的多模态大型语言模型（MLLMs）进展表明其具有较强的语义理解能力，但在执行精确时空理解方面遇到困难。现有的时空方法主要关注视频本身，而忽略了视频内的物理信息，如多对象布局和运动。这些限制影响了MLLMs在需要高精度的下游应用中的使用，包括智能身体和虚拟现实。为了解决这一问题，我们提出了Video-STR，这是一种基于图的新型强化方法，用于精确的视频时空推理。我们借助强化学习与可验证奖励（RLVR）的能力来提高模型能力，并引入一种基于图的相对策略优化（GRPO）方法的推理机制，以指导模型在思考过程中推断场景下的时空拓扑结构。为了解决时空训练数据的缺乏，我们构建了STV-205k数据集，包含205万问答对，覆盖室内外动态多对象场景，以支持模型训练。实验表明，Video-STR在各种基准测试上达到了最新水平的结果，在STI-Bench上的表现比基础模型高出13%，证明了我们的方法和数据集的有效性。代码、模型和数据将发布。

论文及项目相关链接

PDF

Summary
多媒体模态大型语言模型（MLLMs）在语义理解方面取得了显著进展，但在精确时空理解方面仍有困难。为解决此问题，我们提出了基于图的强化方法Video-STR，用于精确视频时空推理。我们引入了基于图的相对策略优化（GRPO）方法，以解决模型在推理过程中的时空拓扑推断问题。为解决时空训练数据不足的问题，我们构建了STV-205k数据集，并支持模型训练。实验表明，Video-STR在各种基准测试上取得了最新结果。

Key Takeaways

MLLMs展现出强大的语义理解能力，但在精确时空理解方面存在挑战。
Video-STR是一种基于图的强化方法，旨在解决MLLMs在精确时空推理方面的不足。
Video-STR引入GRPO方法，指导模型在推理过程中推断场景的时空拓扑。
为支持模型训练，构建了STV-205k数据集，包含205k问答对，覆盖室内和室外动态多对象场景。
Video-STR在各种基准测试上取得最新结果，较基础模型在STI-Bench上提高了13%。
Video-STR方法、模型和数据将公开发布。

Cool Papers

点此查看论文截图

Chart-RVR: Reinforcement Learning with Verifiable Rewards for Explainable Chart Reasoning

Authors:Sanchit Sinha, Oana Frunza, Kashif Rasul, Yuriy Nevmyvaka, Aidong Zhang

The capabilities of Large Vision-Language Models (LVLMs) have reached state-of-the-art on many visual reasoning tasks, including chart reasoning, yet they still falter on out-of-distribution (OOD) data, and degrade further when asked to produce their chain-of-thought (CoT) rationales, limiting explainability. We present Chart-RVR, a general framework that fine-tunes LVLMs to be more robust and explainable for chart reasoning by coupling Group Relative Policy Optimization (GRPO) with automatically verifiable rewards. Our framework comprises of three rewards that maximize: (i) correct chart-type classification, (ii) faithful chart table reconstruction, and (iii) process conformity. Applied to 3-billion-parameter LVLMs, Chart-RVR consistently outperforms standard supervised fine-tuning (SFT) on both in-distribution and out-of-distribution datasets, closing the OOD performance gap while improving rationale fidelity. The resulting models, the Chart-RVR-3B series, achieve state-of-the-art results on six chart-reasoning benchmarks spanning in-domain and OOD settings, surpassing all existing models of comparable size. Beyond accuracy, Chart-RVR yields more interpretable CoT rationales, strengthening trust and reliability - showcasing the power of verifiable rewards with GRPO for training reliable, interpretable chart-reasoning models.

大型视觉语言模型（LVLMs）的能力在许多视觉推理任务上已经达到了最新技术水准，包括图表推理，然而它们在处理离群分布（OOD）数据时仍然会出现困难，当要求产生思维链（CoT）推理时，性能会进一步下降，这限制了可解释性。我们提出了Chart-RVR，这是一个通用框架，它通过结合群体相对策略优化（GRPO）和可自动验证的奖励，对LVLMs进行微调，使其在面对图表推理时更加稳健和可解释。我们的框架包括三个奖励，以最大化：（i）正确的图表类型分类，（ii）忠诚的图表表格重建，（iii）过程一致性。应用于3亿参数LVLMs，Chart-RVR在内部和外部数据集上均持续优于标准监督微调（SFT），在缩小离群性能差距的同时提高了理由忠实度。由此产生的模型——Chart-RVR-3B系列，在涵盖内部和离群设置的六个图表推理基准测试上达到了最新技术成果，超越了所有现有规模相当的模型。除了准确性之外，Chart-RVR产生了更具解释性的CoT理由，增强了信任和可靠性——展示了GRPO与可验证奖励结合训练可靠、可解释的图表推理模型的威力。

论文及项目相关链接

PDF 23 pages

Summary
大型视觉语言模型（LVLMs）在视觉推理任务上达到了最新技术水平，但在分布外数据上仍然存在问题，并且在产生思维链（CoT）解释时性能会进一步下降。为了解决这个问题，我们提出了Chart-RVR框架，它通过结合群体相对策略优化（GRPO）和可自动验证的奖励来微调LVLMs，使其对图表推理更加稳健和可解释。该框架包括三个奖励，以最大化：（i）正确的图表类型分类，（ii）忠实的图表重建，（iii）流程一致性。在3亿参数LVLMs上的应用表明，Chart-RVR在内部和外部数据集上的性能均优于标准监督微调（SFT），缩小了OOD性能差距，同时提高了解释性。Chart-RVR系列模型在涵盖内部和外部设置的六个图表推理基准测试上达到了最新技术水平。除了准确性之外，Chart-RVR还产生了更具解释性的思维链解释，增强了信任和可靠性。

Key Takeaways

大型视觉语言模型（LVLMs）在许多视觉推理任务上表现卓越，但在处理分布外（OOD）数据时仍有局限。
Chart-RVR框架旨在通过结合群体相对策略优化（GRPO）和自动验证奖励来提升LVLMs的稳健性和可解释性。
Chart-RVR框架包含三个奖励，分别关注正确的图表类型分类、忠实的图表重建和流程一致性。
在3亿参数LVLMs上应用Chart-RVR框架，无论在内部还是外部数据集上，其性能均超越标准监督微调（SFT）。
Chart-RVR框架缩小了OOD性能差距，同时提高了模型解释性。
Chart-RVR系列模型在多个图表推理基准测试中达到最新技术水平。

Cool Papers

点此查看论文截图

Rediscovering Entropy Regularization: Adaptive Coefficient Unlocks Its Potential for LLM Reinforcement Learning

Authors:Xiaoyun Zhang, Xiaojian Yuan, Di Huang, Wang You, Chen Hu, Jingqing Ruan, Kejiang Chen, Xing Hu

Reasoning ability has become a defining capability of Large Language Models (LLMs), with Reinforcement Learning with Verifiable Rewards (RLVR) emerging as a key paradigm to enhance it. However, RLVR training often suffers from policy entropy collapse, where the policy becomes overly deterministic, hindering exploration and limiting reasoning performance. While entropy regularization is a common remedy, its effectiveness is highly sensitive to the fixed coefficient, making it unstable across tasks and models. In this work, we revisit entropy regularization in RLVR and argue that its potential has been largely underestimated. Our analysis shows that (i) tasks of varying difficulty demand distinct exploration intensities, and (ii) balanced exploration may require the policy entropy to be maintained within a moderate range below its initial level. Therefore, we propose Adaptive Entropy Regularization (AER)–a framework that dynamically balances exploration and exploitation via three components: difficulty-aware coefficient allocation, initial-anchored target entropy, and dynamic global coefficient adjustment. Experiments on multiple mathematical reasoning benchmarks show that AER consistently outperforms baselines, improving both reasoning accuracy and exploration capability.

推理能力已成为大型语言模型（LLM）的一种核心功能，而强化学习通过可验证奖励（RLVR）已成为增强这一功能的关键范式。然而，RLVR训练经常受到策略熵崩溃的影响，策略变得过于确定性，阻碍了探索并限制了推理性能。虽然熵正则化是一种常见的补救方法，但其有效性对固定系数高度敏感，导致其在不同任务和模型中的稳定性较差。在这项工作中，我们重新审视了RLVR中的熵正则化，并认为其潜力在很大程度上被低估了。我们的分析表明（i）不同难度的任务需要不同的探索强度；（ii）平衡探索可能需要将策略熵保持在初始水平以下的适度范围内。因此，我们提出了自适应熵正则化（AER）——一个框架，通过难度感知系数分配、初始锚定目标熵和动态全局系数调整三个组件来动态平衡探索与利用。在多个数学推理基准测试上的实验表明，AER始终优于基线，提高了推理准确性和探索能力。

论文及项目相关链接

PDF 16 pages, 4 figures

Summary

大语言模型的推理能力已经成为一种重要的特性，强化学习加可验证奖励（RLVR）是提高这种能力的一种关键方法。然而，RLVR训练常遇到策略熵崩溃的问题，限制了模型的探索能力和推理性能。本文重新审视了RLVR中的熵正则化方法，并提出了自适应熵正则化（AER）框架，通过动态平衡探索与利用，包括难度感知系数分配、初始锚定目标熵和动态全局系数调整三个关键组件。实验证明，AER在多个数学推理基准测试中表现优异，提高了推理准确性和探索能力。

Key Takeaways