⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-11 更新
MATRIX: Multimodal Agent Tuning for Robust Tool-Use Reasoning
Authors:Tajamul Ashraf, Umair Nawaz, Abdelrahman M. Shaker, Rao Anwer, Philip Torr, Fahad Shahbaz Khan, Salman Khan
Vision language models (VLMs) are increasingly deployed as controllers with access to external tools for complex reasoning and decision-making, yet their effectiveness remains limited by the scarcity of high-quality multimodal trajectories and the cost of manual annotation. We address this challenge with a vision-centric agent tuning framework that automatically synthesizes multimodal trajectories, generates step-wise preference pairs, and trains a VLM controller for robust tool-use reasoning. Our pipeline first constructs M-TRACE, a large-scale dataset of 28.5K multimodal tasks with 177K verified trajectories, enabling imitation-based trajectory tuning. Building on this, we develop MATRIX Agent, a controller finetuned on M-TRACE for step-wise tool reasoning. To achieve finer alignment, we further introduce Pref-X, a set of 11K automatically generated preference pairs, and optimize MATRIX on it via step-wise preference learning. Across three benchmarks, Agent-X, GTA, and GAIA, MATRIX consistently surpasses both open- and closed-source VLMs, demonstrating scalable and effective multimodal tool use. Our data and code is avaliable at https://github.com/mbzuai-oryx/MATRIX.
视觉语言模型(VLMs)越来越多地被部署为控制器,可以访问外部工具进行复杂的推理和决策,然而,其有效性仍然受到高质量多模式轨迹稀缺和手动标注成本高昂的限制。我们通过一个以视觉为中心的代理调整框架来解决这一挑战,该框架能够自动合成多模式轨迹,生成分步偏好对,并训练一个用于稳健工具使用推理的VLM控制器。我们的管道首先构建M-TRACE,这是一个包含28.5K多模式任务的大规模数据集,包含177K个验证过的轨迹,为实现基于模仿的轨迹调整提供了可能。在此基础上,我们开发了MATRIX Agent控制器,在M-TRACE上进行微调以实现分步工具推理。为了实现更精细的对齐,我们进一步引入了Pref-X,这是一组自动生成的包含11K个偏好对的数据集,并通过逐步偏好学习优化MATRIX。在Agent-X、GTA和GAIA三个基准测试中,MATRIX始终超过了开源和闭源的VLMs,证明了其可扩展性和有效的多模式工具使用能力。我们的数据和代码可在https://github.com/mbzuai-oryx/MATRIX找到。
论文及项目相关链接
Summary
新一代视觉语言模型(VLMs)面临缺乏高质量多模态轨迹和手动标注成本高昂的挑战。为解决此问题,我们提出一种以视觉为中心的代理调整框架,可自动合成多模态轨迹、生成分步偏好对,并训练VLM控制器进行稳健的工具使用推理。我们构建了大型数据集M-TRACE,包含2.85万多项多模态任务和17.7万多条验证轨迹,为基于模仿的轨迹调整打下基础。在此基础上,我们开发出MATRIX Agent控制器,在M-TRACE上进行微调以实现逐步工具推理。为达到更精细的对齐,我们进一步引入自动生成的偏好对Pref-X,并通过逐步偏好学习优化MATRIX。在三个基准测试中,MATRIX在开源和闭源VLMs上表现优越,展现出可伸缩且有效的多模态工具使用能力。
Key Takeaways
- VLMs作为控制器在复杂推理和决策制定中越来越受欢迎,但仍面临缺乏高质量多模态轨迹和手动标注的挑战。
- 提出一种以视觉为中心的代理调整框架,自动合成多模态轨迹和生成分步偏好对,以提高VLMs的有效性。
- 构建大型数据集M-TRACE,为基于模仿的轨迹调整提供基础。
- 开发MATRIX Agent控制器,可在M-TRACE上进行微调并实现逐步工具推理。
- 通过引入自动生成的偏好对Pref-X和逐步偏好学习来优化MATRIX。
- MATRIX在多个基准测试中表现优越,超越开源和闭源的VLMs。
点此查看论文截图



SciVideoBench: Benchmarking Scientific Video Reasoning in Large Multimodal Models
Authors:Andong Deng, Taojiannan Yang, Shoubin Yu, Lincoln Spencer, Mohit Bansal, Chen Chen, Serena Yeung-Levy, Xiaohan Wang
Large Multimodal Models (LMMs) have achieved remarkable progress across various capabilities; however, complex video reasoning in the scientific domain remains a significant and challenging frontier. Current video benchmarks predominantly target general scenarios where perception/recognition is heavily relied on, while with relatively simple reasoning tasks, leading to saturation and thus failing to effectively evaluate advanced multimodal cognitive skills. To address this critical gap, we introduce SciVideoBench, a rigorous benchmark specifically designed to assess advanced video reasoning in scientific contexts. SciVideoBench consists of 1,000 carefully crafted multiple-choice questions derived from cutting-edge scientific experimental videos spanning over 25 specialized academic subjects and verified by a semi-automatic system. Each question demands sophisticated domain-specific knowledge, precise spatiotemporal perception, and intricate logical reasoning, effectively challenging models’ higher-order cognitive abilities. Our evaluation highlights significant performance deficits in state-of-the-art proprietary and open-source LMMs, including Gemini 2.5 Pro and Qwen2.5-VL, indicating substantial room for advancement in video reasoning capabilities. Detailed analyses of critical factors such as reasoning complexity and visual grounding provide valuable insights and clear direction for future developments in LMMs, driving the evolution of truly capable multimodal AI co-scientists. We hope SciVideoBench could fit the interests of the community and help to push the boundary of cutting-edge AI for border science.
多模态大型模型(LMMs)在各种能力方面取得了显著的进步,但在科学领域的复杂视频推理方面仍存在重大挑战。当前的视频基准测试主要针对高度依赖感知/识别且相对简单的推理任务的一般场景,导致饱和,无法有效评估高级的多模态认知技能。为了弥补这一关键差距,我们推出了SciVideoBench,这是一个严格的标准,专门用于评估科学背景下的高级视频推理能力。SciVideoBench包含从前沿的科学实验视频中精心制作的1000道选择题,涵盖超过25个专业的学术主题,并通过半自动系统进行验证。每个问题都需要专业领域的知识、精确的时空感知和复杂的逻辑推理,有效地挑战了模型的高级认知能力。我们的评估显示,包括Gemini 2.5 Pro和Qwen2.5-VL在内的最新专有和开源大型模型存在显著的性能缺陷,这表明在视频推理能力方面仍有很大的提升空间。对推理复杂性和视觉定位等关键因素的深入分析为大型模型的未来发展提供了宝贵的见解和明确的方向,推动真正具备能力的多模态AI科学家的出现。我们希望SciVideoBench能引起社区的兴趣,并帮助推动前沿AI在边界科学方面的进步。
论文及项目相关链接
Summary
本文介绍了SciVideoBench基准测试,该测试旨在评估大型多模态模型在科学背景下的高级视频推理能力。SciVideoBench包含从前沿科学实验视频中精心制作的涵盖25个学科的多个选择题,旨在测试复杂的空间感知能力和精确推理能力。实验结果表明当前先进的大型多模态模型在视频推理方面存在显著缺陷,这为未来的模型发展提供了方向。
Key Takeaways
- SciVideoBench是专为评估科学背景下的大型多模态模型的高级视频推理能力而设计的基准测试。
- 该测试包含从前沿科学实验视频中精心制作的涵盖多个学科的选择题。
- 测试涵盖了复杂的空间感知能力和精确推理能力。
- 当前的大型多模态模型在视频推理方面存在显著缺陷。
- SciVideoBench的实验评估为未来大型多模态模型的发展提供了方向。
点此查看论文截图






Improving Reasoning for Diffusion Language Models via Group Diffusion Policy Optimization
Authors:Kevin Rojas, Jiahe Lin, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Molei Tao, Wei Deng
Diffusion language models (DLMs) enable parallel, order-agnostic generation with iterative refinement, offering a flexible alternative to autoregressive large language models (LLMs). However, adapting reinforcement learning (RL) fine-tuning to DLMs remains an open challenge because of the intractable likelihood. Pioneering work such as diffu-GRPO estimated token-level likelihoods via one-step unmasking. While computationally efficient, this approach is severely biased. A more principled foundation lies in sequence-level likelihoods, where the evidence lower bound (ELBO) serves as a surrogate. Yet, despite this clean mathematical connection, ELBO-based methods have seen limited adoption due to the prohibitive cost of likelihood evaluation. In this work, we revisit ELBO estimation and disentangle its sources of variance. This decomposition motivates reducing variance through fast, deterministic integral approximations along a few pivotal dimensions. Building on this insight, we introduce \textbf{Group Diffusion Policy Optimization (GDPO)}, a new RL algorithm tailored for DLMs. GDPO leverages simple yet effective Semi-deterministic Monte Carlo schemes to mitigate the variance explosion of ELBO estimators under vanilla double Monte Carlo sampling, yielding a provably lower-variance estimator under tight evaluation budgets. Empirically, GDPO achieves consistent gains over pretrained checkpoints and outperforms diffu-GRPO, one of the state-of-the-art baselines, on the majority of math, reasoning, and coding benchmarks.
扩散语言模型(DLMs)支持并行、顺序无关生成和迭代细化,为自回归大型语言模型(LLMs)提供了灵活的替代方案。然而,由于概率的不可计算性,将强化学习(RL)微调适应DLMs仍然是一个开放性的挑战。开创性工作如diffu-GRPO通过一步去掩码估计令牌级概率。虽然计算效率高,但这种方法存在严重的偏见。更原则性的基础在于序列级概率,其中证据下限(ELBO)作为替代。然而,尽管有这种清晰的数学联系,基于ELBO的方法由于概率评估的昂贵成本而应用有限。在这项工作中,我们重新审视ELBO估计并分解其方差来源。这种分解促使我们通过几个关键维度上的快速确定性积分近似来减少方差。基于这一见解,我们引入了针对DLMs量身定制的\textbf{Group Diffusion Policy Optimization (GDPO)}这一新RL算法。GDPO利用简单有效的半确定性蒙特卡洛方案,缓解了纯双重蒙特卡洛采样下ELBO估计器的方差爆炸问题,在严格评估预算下产生了可证明的低方差估计器。经验上,GDPO在预训练检查点上实现了持续的收益,并在大多数数学、推理和编码基准测试中超过了最前沿的基线diffu-GRPO。
论文及项目相关链接
Summary
扩散语言模型(DLMs)支持并行、顺序无关生成和迭代优化,为大型自回归语言模型(LLMs)提供了灵活的替代方案。然而,强化学习(RL)对DLMs的微调仍存在开放挑战,因为存在不可预测的似然性。本文重新审视了ELBO估计并分解了其方差来源。基于此,我们引入了针对DLMs定制的新的RL算法——集团扩散策略优化(GDPO)。GDPO利用简单有效的半确定性蒙特卡洛方案,缓解了ELBO估计器在普通双重蒙特卡洛采样下的方差爆炸问题,在有限评估预算下得到了方差更小的估计器。经验上,GDPO在预训练检查点上取得了持续的收益,并在大多数数学、推理和编码基准测试中超越了最前沿的基线之一——diffu-GRPO。
Key Takeaways
- 扩散语言模型(DLMs)具有并行生成和迭代优化的特点,提供大型自回归语言模型的灵活替代方案。
- 强化学习(RL)对DLMs的微调是一个开放挑战,因为存在不可预测的似然性。
- ELBO估计被提议作为序列级似然性的基础,但由于似然性评估的昂贵成本,其应用受到限制。
- 本文通过分解ELBO估计的方差来源,介绍了新的强化学习算法——集团扩散策略优化(GDPO)。
- GDPO利用半确定性蒙特卡洛方案缓解ELBO估计的方差问题,在有限评估预算下表现优越。
- GDPO在预训练检查点上取得持续收益,并在数学、推理和编码基准测试中表现优于当前最前沿的基线方法diffu-GRPO。
点此查看论文截图


SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models
Authors:Hongxing Li, Dingming Li, Zixuan Wang, Yuchen Yan, Hang Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of-domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.
空间推理仍然是视觉语言模型(VLMs)的一个基本挑战。尽管最近取得了进展,但现有方法仍难以实现稳健的性能。我们发现这一局限性源于一个关键差距:现有方法试图直接学习空间推理,而没有建立感知和理解的层次基础。为了应对这一挑战,我们提出了一种逐步构建空间智力的综合方法。我们介绍了SpatialLadder-26k,这是一个包含26610个样本的多模式数据集,涵盖对象定位、单图像、多视图和视频空间推理任务,通过标准化管道构建,确保跨模式的系统覆盖。基于该数据集,我们设计了一个三阶段的渐进式训练框架,其中包括(1)通过对象定位建立空间感知,(2)通过多维空间任务发展空间理解,以及(3)通过可验证奖励的强化学习加强复杂推理。这种方法产生了SpatialLadder模型,一个具有3B参数的模型,在空间推理基准测试上达到最新技术水平,平均比基础模型提高23.4%,比GPT-4o高出20.8%,比Gemini-2.0-Flash高出10.1%。值得注意的是,SpatialLadder在域外基准测试上的改进达到了7.2%,这表明从感知到推理的渐进式训练对于稳健的空间智力至关重要。
论文及项目相关链接
PDF Project Page: https://zju-real.github.io/SpatialLadder/ Code: https://github.com/ZJU-REAL/SpatialLadder
Summary
该文本指出空间推理是视觉语言模型(VLMs)的基本挑战,现有方法直接学习空间推理而缺乏感知和理解层次的建立,因此存在性能局限。为解决此问题,提出一种渐进建立空间智能的综合方法,包括构建SpatialLadder-26k多模态数据集和分阶段训练框架,最终提出SpatialLadder模型,实现了对空间推理基准测试的平均改进率达23.4%,并保持良好的泛化能力。
Key Takeaways
- 空间推理是视觉语言模型(VLMs)的核心挑战,现有方法存在性能局限。
- 局限性的根源在于现有方法试图直接学习空间推理,而没有建立感知和理解的层次结构。
- 为解决此挑战,提出了一种构建空间智能的渐进综合方法。
- 引入SpatialLadder-26k多模态数据集,包含26,610个样本,涵盖对象定位、单图、多视图和视频空间推理任务。
- 设计了一个三阶段的渐进训练框架,包括通过对象定位建立空间感知、通过多维空间任务发展空间理解、以及通过强化学习与可验证奖励加强复杂推理。
- 最终提出SpatialLadder模型,实现对空间推理基准测试的平均改进率达23.4%,超过基准模型、GPT-4o和Gemini-2.0-Flash。
点此查看论文截图




CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards
Authors:Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, Lei Bai
Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent’s policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.
自我进化是使基于大语言模型的代理在预训练后能够持续提高其能力的一个核心研究主题。最近的研究见证了从非强化学习向基于强化学习的方法的转变。当前的基于强化学习的方法要么依赖于密集的外部奖励信号,要么从语言模型本身中提取内在奖励信号。然而,这些方法与人类智能中观察到的自我进化机制相悖,个人通过相互讨论和协作来学习和提高。在这项工作中,我们引入了协同进化多智能体系统(CoMAS),这是一个新型框架,它能够让智能体通过从智能体之间的交互中学习来自主提高,无需外部监督。CoMAS从丰富的讨论动态中产生内在奖励,采用语言模型作为法官的机制来制定这些奖励,并通过强化学习优化每个智能体的策略,从而实现去中心化和可扩展的协同进化。实验结果表明,CoMAS始终优于未经训练的智能体,并在大多数评估环境中达到最新技术水平。消融研究证实了基于交互的奖励信号的必要性,并显示出随着智能体数量和多样性的增加,前景十分广阔的可扩展性。这些发现确立了CoMAS在基于语言模型的智能体自我进化中的新型有效范式。
论文及项目相关链接
Summary:
自我进化是研究大型语言模型(LLM)驱动的智能体在预训练后持续提高其能力的重要课题。最近的研究经历了从非强化学习(RL)到基于RL的方法的转变。当前基于RL的方法依赖于密集的外界奖励信号或从LLM本身提取的内在奖励信号。然而,这些方法与人类智能中的自我进化机制相悖,人类通过相互讨论和协作来学习和提高。本研究引入了一种新型框架CoMAS,它使智能体能够通过从智能体间的交互中学习来自主提高,无需外界监督。CoMAS从丰富的讨论动态中产生内在奖励,利用LLM作为裁判来制定这些奖励,并通过RL优化每个智能体的策略,从而实现分散和可扩展的协同进化。实验结果表明,CoMAS持续超越未经训练的智能体,并在大多数评估环境中达到最佳性能。
Key Takeaways:
- CoMAS框架允许LLM驱动的智能体通过智能体间的交互来自主提高能力。
- 该框架结合了丰富的讨论动态以生成内在奖励。
- LLM作为裁判制定奖励信号以促进智能体的进化。
- 通过强化学习优化每个智能体的策略。
- 实验证明CoMAS在多数评估环境中表现优于其他方法。
- 消融研究证实了基于交互的奖励信号的必要性。
点此查看论文截图



Video-STAR: Reinforcing Open-Vocabulary Action Recognition with Tools
Authors:Zhenlong Yuan, Xiangyan Qu, Chengxuan Qian, Rui Chen, Jing Tang, Lei Sun, Xiangxiang Chu, Dapeng Zhang, Yiwei Wang, Yujun Cai, Shuo Li
Multimodal large language models (MLLMs) have demonstrated remarkable potential in bridging visual and textual reasoning, yet their reliance on text-centric priors often limits their ability to disentangle semantically similar actions in open-vocabulary scenarios. To address this, we propose Video-STAR, a framework that harmonizes contextual sub-motion decomposition with tool-augmented reinforcement learning for open-vocabulary action recognition (OVAR). Unlike prior methods that treat actions as monolithic entities, our approach innovatively decomposes actions into discriminative sub-motions for fine-grained matching while dynamically invoking domain-specific tools for cross-modal interleaving, thereby enabling category-specific reasoning capacity and reducing cross-modal hallucination. Moreover, by designing a hierarchical reward that balances tool-usage efficiency, sub-motion relevance, and structural coherence in reasoning, our method autonomously leverages external tools to prioritize sub-motion patterns without explicit supervision, transmitting from text-centric reasoning to visually grounded inference. Extensive evaluations on HMDB-51, UCF-101, SSv2, Kinetics-400, and Kinetics-600 datasets demonstrate our state-of-the-art performance, outperforming existing methods in distinguishing fine-grained actions and handling cross-modal hallucination, validating our excellent robustness and generalization.
多模态大型语言模型(MLLMs)在桥接视觉和文本推理方面表现出了显著潜力,然而它们对以文本为中心的先验的依赖常常限制了它们在开放词汇场景中解开语义上相似的动作的能力。为了解决这一问题,我们提出了Video-STAR框架,该框架通过上下文子运动分解与工具增强强化学习实现开放词汇动作识别(OVAR)。不同于将动作视为单一实体的先前方法,我们的方法创新地将动作分解为具有区分力的子运动,进行精细匹配,同时动态调用特定领域的工具进行跨模态交错,从而实现特定类别的推理能力并减少跨模态幻觉。此外,通过设计一种平衡工具使用效率、子运动相关性和推理中结构连贯性的分层奖励,我们的方法能够自主地利用外部工具来优先处理子运动模式,而无需明确监督,实现从以文本为中心的推理到以视觉为基础的推理。在HMDB-51、UCF-10 1、SSv2、Kinetics-400和Kinetics-600数据集上的广泛评估表明,我们的性能处于最新水平,在区分精细动作和处理跨模态幻觉方面优于现有方法,验证了我们的出色稳健性和泛化能力。
论文及项目相关链接
Summary
本文提出了Video-STAR框架,该框架通过上下文子运动分解与工具增强型强化学习,实现了开放词汇表动作识别(OVAR)。与传统的将动作视为单一实体的方法不同,Video-STAR创新地将动作分解为具有鉴别力的子运动,进行精细匹配,并动态调用特定领域的工具进行跨模态交织,从而提高类别特定的推理能力,减少跨模态幻觉。通过设计平衡工具使用效率、子运动相关性和推理结构连贯性的分层奖励,该方法可自主利用外部工具来优先处理子运动模式,而无需明确的监督。广泛评估结果表明,Video-STAR在HMDB-51、UCF-101、SSv2、Kinetics-400和Kinetics-600数据集上表现优异,展现了其卓越的鲁棒性和泛化能力。
Key Takeaways
- Video-STAR框架融合了上下文子运动分解与工具增强型强化学习,旨在解决开放词汇表动作识别中的挑战。
- 不同于传统方法,Video-STAR将动作分解为子运动进行精细匹配,提高了识别精度。
- Video-STAR通过动态调用特定领域的工具进行跨模态交织,增强了类别特定的推理能力。
- 设计的分层奖励机制可自主利用外部工具处理子运动模式,无需明确监督。
- Video-STAR在处理精细动作识别和跨模态幻觉方面表现出卓越性能。
- Video-STAR在多个数据集上的评估结果均显示其鲁棒性和泛化能力。
点此查看论文截图




Rethinking Provenance Completeness with a Learning-Based Linux Scheduler
Authors:Jinsong Mao, Benjamin E. Ujcich, Shiqing Ma
Provenance plays a critical role in maintaining traceability of a system’s actions for root cause analysis of security threats and impacts. Provenance collection is often incorporated into the reference monitor of systems to ensure that an audit trail exists of all events, that events are completely captured, and that logging of such events cannot be bypassed. However, recent research has questioned whether existing state-of-the-art provenance collection systems fail to ensure the security guarantees of a true reference monitor due to the ‘super producer threat’ in which provenance generation can overload a system to force the system to drop security-relevant events and allow an attacker to hide their actions. One approach towards solving this threat is to enforce resource isolation, but that does not fully solve the problems resulting from hardware dependencies and performance limitations. In this paper, we show how an operating system’s kernel scheduler can mitigate this threat, and we introduce Venus, a learned scheduler for Linux specifically designed for provenance. Unlike conventional schedulers that ignore provenance completeness requirements, Venus leverages reinforcement learning to learn provenance task behavior and to dynamically optimize resource allocation. We evaluate Venus’s efficacy and show that Venus significantly improves both the completeness and efficiency of provenance collection systems compared to traditional scheduling, while maintaining reasonable overheads and even improving overall runtime in certain cases compared to the default Linux scheduler.
溯源在系统行为的溯源追踪中起着关键作用,用于分析安全威胁和影响的根本原因。溯源收集通常被纳入系统的参考监视器中,以确保所有事件都有审计跟踪,事件被完全捕获,并且此类事件的日志记录不能被绕过。然而,最近有研究表明,由于“超级生产者威胁”,现有最先进的溯源收集系统无法保证真正参考监视器的安全保证。在这种威胁中,溯源生成会过载系统,迫使系统丢弃与安全相关的事件,并允许攻击者隐藏其行动。解决此威胁的一种方法是强制执行资源隔离,但这并不能完全解决由硬件依赖和性能限制产生的问题。在本文中,我们展示了操作系统内核调度器如何减轻这种威胁,并介绍了Venus,一种专为溯源设计的Linux学习调度器。与忽略溯源完整性要求的传统调度器不同,Venus利用强化学习来了解溯源任务行为,并动态优化资源分配。我们评估了Venus的有效性,并证明与传统调度相比,Venus在显著提高溯源收集系统的完整性和效率的同时,还保持了合理的开销,并且在某些情况下甚至与Linux默认调度器相比提高了总体运行时间。
论文及项目相关链接
Summary
系统溯源对于维护系统行为的可追溯性、分析安全威胁的根源至关重要。最新研究表明,现有先进的溯源收集系统存在“超级生产者威胁”,可能导致系统丢失安全相关事件并允许攻击者隐藏行动。本文展示如何通过操作系统内核调度器缓解这一威胁,并介绍专为溯源设计的Linux学习调度器Venus。Venus利用强化学习了解溯源任务行为,并动态优化资源分配,在保持合理开销的同时,显著提高溯源系统的完整性和效率,并在某些情况下相比默认Linux调度器改善了整体运行时间。
Key Takeaways
- 溯源在系统安全中起关键作用,有助于追踪系统行为以进行安全威胁的根源分析。
- 现有溯源收集系统面临“超级生产者威胁”,可能影响其安全性和完整性。
- “Venus”调度器是专为Linux设计的溯源学习调度器,可优化资源分配。
- Venus利用强化学习了解溯源任务行为,以提高系统的溯源完整性和效率。
- 与传统调度器相比,Venus在保持合理开销的同时,显著提高了溯源系统的性能。
- 在某些情况下,Venus甚至可能在整体运行时间上超过默认Linux调度器的表现。
- Venus的引入为缓解系统安全威胁提供了新的解决方案。
点此查看论文截图





Reinforcing Diffusion Models by Direct Group Preference Optimization
Authors:Yihong Luo, Tianyang Hu, Jing Tang
While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost-effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics. Code is available at https://github.com/Luo-Yihong/DGPO.
虽然集团相对偏好优化(GRPO)等强化学习方法已经大大增强了大型语言模型的效果,但将其适应扩散模型仍然具有挑战性。特别是,GRPO要求随机策略,而成本效益最高的扩散采样器是基于确定性常微分方程(ODE)的。近期的工作通过使用基于随机微分方程(SDE)的采样器来引入随机性来解决这个问题,但这种对模型不可知的高斯噪声的依赖导致了收敛速度慢。为了解决这一冲突,我们提出了直接集团偏好优化(DGPO),这是一种新的在线强化学习算法,它完全摒弃了基于策略梯度的框架。DGPO直接从群体层面的偏好中学习,这利用了群体内样本的相对信息。这种设计消除了对低效随机策略的需求,解锁了高效确定性常微分方程采样器的使用,并加快了训练速度。大量结果表明,DGPO的训练速度比现有最先进的方法快约20倍,在域内和域外奖励指标上均表现出卓越性能。代码可在https://github.com/Luo-Yihong/DGPO找到。
论文及项目相关链接
摘要
强化学习方法如集团相对偏好优化(GRPO)已经显著提高了大型语言模型的性能,但将其适应扩散模型仍然具有挑战性。GRPO要求策略具有随机性,而成本效益最高的扩散采样器却是基于确定性常微分方程(ODEs)。最近的工作通过使用基于随机微分方程(SDEs)的采样器来引入随机性来解决这个问题,但这依赖于模型无关的高斯噪声,导致收敛速度慢。为解决这一冲突,我们提出了直接集团偏好优化(DGPO)算法,这是一种新的在线强化学习算法,它完全摒弃了基于策略梯度的框架。DGPO直接从集团层面的偏好中学习,利用集团内样本的相对信息。这种设计消除了对低效随机策略的需求,实现了高效确定性ODE采样器的使用并加快了训练速度。实验结果显示,DGPO的训练速度比现有最先进的方法快约20倍,在域内和域外奖励指标上均表现出卓越性能。代码可在 https://github.com/Luo-Yihong/DGPO 找到。
要点
- 强化学习在大型语言模型中的应用已经取得了显著进展,但在适应扩散模型时面临挑战。
- 现有方法如GRPO要求策略具有随机性,而最有效的扩散采样器是确定性的。
- 最近的研究通过引入SDE-based采样器来增加随机性,但这导致收敛速度慢且依赖于模型无关的高斯噪声。
- DGPO算法解决了这一问题,它通过直接从集团层面的偏好学习,摒弃了策略梯度框架。
- DGPO的设计消除了对随机策略的需求,允许使用高效的确定性ODE采样器。
- 实验结果显示DGPO训练速度快,性能卓越,相比现有方法优势明显。
点此查看论文截图


FlyLoRA: Boosting Task Decoupling and Parameter Efficiency via Implicit Rank-Wise Mixture-of-Experts
Authors:Heming Zou, Yunliang Zang, Wutong Xu, Yao Zhu, Xiangyang Ji
Low-Rank Adaptation (LoRA) is a widely used parameter-efficient fine-tuning method for foundation models, but it suffers from parameter interference, resulting in suboptimal performance. Although Mixture-of-Experts (MoE)-based LoRA variants show promise in mitigating intra-task correlations in single-task instruction tuning, they introduce additional router parameters and remain ineffective in multi-task model merging where inter-task interference arises. Inspired by the fly olfactory circuit, we propose FlyLoRA, an implicit MoE-based LoRA variant that introduces: (1) rank-wise expert activation in the up-projection matrix, and (2) an implicit router that unifies expert routing and down-projection, where a frozen sparse random projection matrix replaces the traditional dense trainable version. This design resolves the trade-off between intra-task decorrelation and computational efficiency by eliminating the need for an explicit router, while inherently mitigating inter-task interference due to the orthogonality property of random matrices. Extensive experiments across four domains – general knowledge understanding, scientific question answering, mathematical reasoning, and code generation – demonstrate consistent performance improvements over existing methods. Beyond empirical gains, FlyLoRA highlights how biological structures can inspire innovations in AI technologies. Code is available at https://github.com/gfyddha/FlyLoRA.
低秩适应(LoRA)是一种广泛应用于基础模型的参数高效微调方法,但它存在参数干扰的问题,导致性能不佳。虽然基于混合专家(MoE)的LoRA变体在单任务指令微调中显示出缓解任务内关联的潜力,但它们引入了额外的路由器参数,并且在多任务模型合并中仍然无效,因为在这里会出现任务间干扰。受苍蝇嗅觉回路的启发,我们提出了FlyLoRA,这是一种基于隐式MoE的LoRA变体,它引入了:(1)在投影矩阵中进行等级专家激活;(2)一种隐式路由器,统一专家路由和下投影,其中用固定的稀疏随机投影矩阵代替传统的密集可训练版本。这种设计通过消除对显式路由器的需求,解决了任务内去相关和计算效率之间的权衡,同时由于其随机矩阵的正交性属性,固有地减轻了任务间的干扰。在四个领域——通用知识理解、科学问答、数学推理和代码生成——的广泛实验证明,与现有方法相比,FlyLoRA的性能得到了一致的提升。除了实证收益外,FlyLoRA还强调了生物结构如何激发AI技术的创新。代码可在https://github.com/gfyddha/FlyLoRA找到。
论文及项目相关链接
PDF NeurIPS 2025 accepted paper
Summary
飞洛拉(FlyLoRA)是一种基于隐混合专家(MoE)的低秩适应(LoRA)变体,通过引入排名专家激活和隐路由机制,解决了参数干扰问题。它在多个领域表现优异,如通用知识理解、科学问答、数学推理和代码生成等。其核心在于利用随机矩阵的正交性,有效缓解任务间干扰。
Key Takeaways
- 飞洛拉(FlyLoRA)是一种针对基础模型的参数效率高的微调方法,基于隐混合专家(MoE)技术。
- 飞洛拉解决了低秩适应(LoRA)中的参数干扰问题。
- 飞洛拉通过引入排名专家激活和隐路由机制,在单任务指令微调中减轻了任务内相关性。
- 飞洛拉利用随机矩阵的固有正交性,有效缓解多任务模型合并中的任务间干扰。
- 飞洛拉在多个领域,包括通用知识理解、科学问答、数学推理和代码生成等,表现出卓越性能。
- 飞洛拉的代码已公开发布,可供研究使用。
点此查看论文截图




Beyond Pass@k: Breadth-Depth Metrics for Reasoning Boundaries
Authors:Marius Dragoi, Ioana Pintilie, Florin Gogianu, Florin Brad
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm to improve Large Language Models on reasoning tasks such as coding, math or logic. To assess the reasoning boundary (the fraction of problems a model can solve) researchers often report Pass@k at large sampling budgets. Recent results reveal a crossover phenomenon: while RLVR models outperform the base model at small k values, the base model usually outperforms them when sampling a very large number of completions. This has been interpreted as evidence that base models have a larger reasoning boundary. We argue that on tasks with discrete answer spaces, such as math with numeric outputs, Pass@k at large k reflects the increasingly higher chance of success in the limit of the number of trials rather than genuine reasoning, and can therefore be misleading. We propose Cover@tau, which measures the fraction of problems that a model can solve for which at least a tau proportion of completions are correct. Unlike Pass@k, Cover@tau captures reasoning under an explicit reliability threshold: models that rely on random guessing degrade rapidly as tau increases. We evaluate several RLVR models using Cover@tau-based metrics and illustrate how the relative rankings of popular algorithms change compared to Pass@1, offering a different perspective on reasoning boundaries.
强化学习与可验证奖励(RLVR)已成为一种强大的范式,能够在编程、数学或逻辑等推理任务上改进大型语言模型。为了评估推理边界(模型可解决的问题部分),研究者通常会在大采样预算下报告Pass@k。最近的结果揭示了一种交叉现象:尽管在较小的k值下,RLVR模型的表现优于基础模型,但当采样大量的完成时,基础模型通常表现更好。这被认为是基础模型具有更大推理边界的证据。我们认为,在具有离散答案空间的任务(如具有数字输出的数学)中,大k值的Pass@k反映了在试验次数无限增加的情况下越来越高的成功机会,而不是真正的推理,因此可能会产生误导。我们提出了Cover@tau,它衡量的是模型能够解决的问题中,至少有一部分(tau比例)的完成是正确的。与Pass@k不同,Cover@tau在明确的可靠性阈值下捕捉推理:依赖随机猜测的模型随着tau的增加而迅速下降。我们使用基于Cover@tau的度量标准评估了多个RLVR模型,并说明了与Pass@1相比,流行算法相对排名的变化,为推理边界提供了不同的视角。
论文及项目相关链接
PDF 10 pages, 3 figures
Summary
强化学习与可验证奖励(RLVR)范式在提升大型语言模型在编程、数学或逻辑等推理任务上的表现方面展现出强大的潜力。研究中常通过Pass@k指标评估模型解决问题的边界,但在大样本预算下存在交叉现象。我们认为,在离散答案空间的任务中,如数学等具有数值输出的任务,Pass@k指标在大型k值上反映的是尝试次数越多成功概率越高的情况,而非真正的推理能力,可能具有误导性。因此,我们提出Cover@tau指标,衡量模型解决至少tau比例问题的比例。与Pass@k不同,Cover@tau在明确的可靠性阈值下衡量推理能力:依赖随机猜测的模型随着tau的增加会迅速下降。我们对使用Cover@tau指标的RLVR模型进行了评估,并展示了与Pass@1相比算法相对排名的变化,为理解推理边界提供了不同视角。
Key Takeaways
- RLVR范式在提升大型语言模型在推理任务上的表现具有显著效果。
- Pass@k指标在大样本预算下存在交叉现象,可能导致对模型解决问题边界的误解。
- 在离散答案空间的任务中,Pass@k指标可能无法真实反映模型的推理能力。
- Cover@tau指标能够衡量模型在明确可靠性阈值下的推理能力。
- 依赖随机猜测的模型在Cover@tau指标下会迅速下降。
- 使用Cover@tau指标的评估结果显示算法相对排名的变化。
点此查看论文截图




Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window
Authors:Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Yaojie Lu, Xianpei Han, Le Sun, WenJuan Zhang, Pengbo Wang, Shixuan Liu, Zhenru Zhang, Jianhong Tu, Hongyu Lin, Junyang Lin
While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.
虽然最近的推理模型进展通过强化学习展示了认知行为,但现有方法很难在具有长期视野交互的多轮代理中激发深度推理能力。我们提出了DeepMiner,这是一个通过引入高难度训练任务和动态上下文窗口来激发此类能力的新型框架。DeepMiner采用反向构建方法,从真实网络来源生成复杂但可验证的问题答案对,这确保了训练数据的挑战性和可靠性,同时将认知能力注入多轮推理场景中。我们还为训练和推理设计了一个优雅而有效的动态上下文管理策略,利用滑动窗口机制,消除对外部摘要模型的依赖,从而有效地增强模型处理不断扩展的长期视野上下文的能力。通过在Qwen3-32B上进行强化学习,我们开发了DeepMiner-32B,在多个搜索代理基准测试中实现了显著的性能提升。DeepMiner在BrowseComp-en上达到了33.5%的准确率,超过了之前最佳的开源代理近20个百分点,并在BrowseComp-zh、XBench-DeepSearch和GAIA上表现出持续的可改进性。值得注意的是,我们的动态上下文管理可实现近100轮的标准3.2万语境长度内的持续交互,有效解决现有多轮交互系统的语境局限性问题。
论文及项目相关链接
Summary
深度挖掘(DeepMiner)是一个新颖框架,通过引入高难度训练任务和动态语境窗口,激发多轮交互中的深度推理能力。该框架使用反向构建方法生成来自真实网络资源的复杂但可验证的问题答案对,确保训练数据的挑战性和可靠性,并注入多轮推理场景的认知能力。通过强化学习在Qwen3-32B上的实验,DeepMiner-32B在多个搜索代理基准测试中实现了显著的性能提升。
Key Takeaways
- DeepMiner框架通过引入高难度训练任务和动态语境窗口,能够激发多轮交互中的深度推理能力。
- 该框架使用反向构建方法生成真实且可验证的问题答案对,确保训练数据的挑战性和可靠性。
- 动态语境管理策略用于训练和推理,能有效处理不断扩展的长期语境。
- DeepMiner框架无需依赖外部摘要模型,使用滑动窗口机制提高效率。
- 通过在Qwen3-32B上进行强化学习,DeepMiner-32B性能显著提升,并在多个搜索代理基准测试中表现优异。
- DeepMiner在BrowseComp-en上的准确率达到了33.5%,远超之前开源代理的准确率。
点此查看论文截图




ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval
Authors:Jianlyu Chen, Junwei Lan, Chaofan Li, Defu Lian, Zheng Liu
In this paper, we introduce ReasonEmbed, a novel text embedding model developed for reasoning-intensive document retrieval. Our work includes three key technical contributions. First, we propose ReMixer, a new data synthesis method that overcomes the triviality problem prevalent in previous synthetic datasets, enabling large-scale production of 82K high-quality training samples. Second, we design Redapter, a self-adaptive learning algorithm that dynamically adjusts training each sample’s weight based on its reasoning intensity. This allows the model to effectively capture the complex semantic relationships between queries and documents. Third, we implement ReasonEmbed across multiple backbones of varying sizes, all of which achieve superior performance on reasoning-intensive retrieval tasks. Notably, our ReasonEmbed-Qwen3-8B model offers a record-high nDCG@10 score of 38.1 on the BRIGHT benchmark, which significantly outperforms existing text embedding models. We will fully open-source our created resources in ReasonEmbed to push forward the research advancement in this field.
本文介绍了ReasonEmbed,这是一款专为推理密集型文档检索而开发的新型文本嵌入模型。我们的工作包括三项关键技术贡献。首先,我们提出了ReMixer,这是一种新的数据合成方法,克服了以前合成数据集中普遍存在的平庸问题,能够实现8.2万高质量训练样本的大规模生产。其次,我们设计了Redapter,一种自适应学习算法,可以根据每个样本的推理强度动态调整其权重。这使得模型能够有效地捕捉查询和文档之间的复杂语义关系。第三,我们在不同大小的主干网络上实现了ReasonEmbed,所有这些模型在推理密集型检索任务上都表现出卓越的性能。值得注意的是,我们的ReasonEmbed-Qwen3-8B模型在BRIGHT基准测试上取得了创纪录的nDCG@10得分38.1,显著优于现有的文本嵌入模型。我们将全面开源ReasonEmbed中创建的资源,以推动该领域的研究进展。
论文及项目相关链接
PDF 17 pages, 3 figures
Summary
本文介绍了ReasonEmbed,一种为推理密集型文档检索开发的新型文本嵌入模型。该模型包含三项关键技术贡献:提出ReMixer新数据合成方法,克服以往合成数据集的平庸问题,生成8.2万高质量训练样本;设计Redapter自适应学习算法,根据样本的推理强度动态调整训练权重,有效捕捉查询和文档之间的复杂语义关系;在多大小不同的backbone上实现ReasonEmbed,在推理密集型检索任务上表现优异。其中,ReasonEmbed-Qwen3-8B模型在BRIGHT基准测试中取得创纪录的高nDCG@10得分38.1,显著优于现有文本嵌入模型。我们将全面开源ReasonEmbed资源,以推动该领域的研究进展。
Key Takeaways
- 引入ReasonEmbed模型,专为推理密集型文档检索设计。
- 提出ReMixer数据合成方法,解决以往合成数据集的平庸问题,生成高质量训练样本。
- 设计Redapter自适应学习算法,根据样本推理强度动态调整训练。
- ReasonEmbed模型在多大小不同的backbone上表现优异。
- ReasonEmbed-Qwen3-8B模型在BRIGHT基准测试中取得显著高得分。
- 该模型显著优于现有文本嵌入模型。
点此查看论文截图




Enhancing Reasoning for Diffusion LLMs via Distribution Matching Policy Optimization
Authors:Yuchen Zhu, Wei Guo, Jaemoo Choi, Petr Molodyk, Bo Yuan, Molei Tao, Yongxin Chen
Diffusion large language models (dLLMs) are promising alternatives to autoregressive large language models (AR-LLMs), as they potentially allow higher inference throughput. Reinforcement learning (RL) is a crucial component for dLLMs to achieve comparable performance with AR-LLMs on important tasks, such as reasoning. However, RL algorithms that are well-suited for dLLMs’ unique characteristics have yet to be developed. This paper proposes Distribution Matching Policy Optimization (DMPO), a principled and theoretically grounded RL fine-tuning method specifically designed to enhance the reasoning capabilities of dLLMs by matching the dLLM policy distribution to the optimal, reward-tilted one through cross-entropy optimization. We identify a key challenge in the implementation with a small training batch size and propose several effective solutions through a novel weight baseline subtraction technique. DMPO exhibits superior performance on multiple reasoning benchmarks without supervised fine-tuning, with an accuracy improvement of up to $42.9%$ over previously SOTA baselines and $55.8%$ over the base model, underscoring the effectiveness of the distribution matching framework. Our code is available at https://github.com/yuchen-zhu-zyc/DMPO.
扩散大型语言模型(dLLMs)作为对自回归大型语言模型(AR-LLMs)的有前途的替代方案,具有更高的推理吞吐量潜力。强化学习(RL)是dLLMs在推理等重要任务上实现与AR-LLMs相当性能的关键组成部分。然而,适合dLLMs独特特性的RL算法尚未开发出来。本文提出了分布匹配策略优化(DMPO),这是一种有原则、有理论基础的RL微调方法,专门设计用于通过交叉熵优化,将dLLM策略分布匹配到最优、倾向奖励的分布,从而提高dLLMs的推理能力。我们确定了在实现过程中小训练批次的一个关键挑战,并通过一种新的权重基线减法技术提出了几种有效的解决方案。DMPO在多个推理基准测试上表现出卓越的性能,无需监督微调,其准确率较之前的最佳基线提高了高达42.9%,较基础模型提高了55.8%,突显了分布匹配框架的有效性。我们的代码可在https://github.com/yuchen-zhu-zyc/DMPO找到。
论文及项目相关链接
Summary
扩散大语言模型(dLLMs)作为对自回归大语言模型(AR-LLMs)的潜在替代方案展现出良好前景,它们可实现更高的推理吞吐量。强化学习(RL)对于dLLMs在重要任务(如推理)上实现与AR-LLMs相当的性能至关重要。然而,适合dLLMs独特特性的RL算法仍有待开发。本文提出一种分布匹配策略优化(DMPO)方法,这是一种有原则且理论上有依据的RL微调方法,旨在通过交叉熵优化匹配dLLM策略分布到最佳、倾斜奖励的分布,从而增强dLLMs的推理能力。本研究解决了实现过程中的一个关键挑战,即小训练批次的问题,并提出了一种有效的权重基线减法技术。DMPO在多个推理基准测试上的表现优于之前的最先进基线,对基础模型的准确度提高了高达42.9%,突显了分布匹配框架的有效性。我们的代码可通过https://github.com/yuchen-zhu-zyc/DMPO获取。
Key Takeaways
- 扩散大语言模型(dLLMs)具有作为自回归大语言模型(AR-LLMs)替代方案的前景,因为它们能提高推理吞吐量。
- 强化学习(RL)对于提升dLLMs在重要任务上的性能至关重要。
- 当前尚需开发适合dLLMs独特特性的RL算法。
- 本文提出了分布匹配策略优化(DMPO)方法,旨在通过匹配策略分布增强dLLMs的推理能力。
- DMPO解决了小训练批次的问题,并提出有效的权重基线减法技术。
- DMPO在多个推理基准测试上的表现超越先前最先进基线,对基础模型的准确度有显著提高。
点此查看论文截图


Training-Free Group Relative Policy Optimization
Authors:Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun
Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.
近期大型语言模型(LLM)代理人的进展展示出了它们具有前景的通用能力。然而,它们在专业现实世界领域的表现往往会因为有效地整合外部工具和特定提示策略的挑战而降低。虽然提出了诸如基于代理的强化学习等方法来解决这个问题,但它们通常依赖于昂贵的参数更新,例如通过采用监督微调(SFT)后的强化学习(RL)阶段进行分组相对策略优化(GRPO)来改变输出分布。然而,我们认为LLM可以通过学习体验知识作为令牌先验来达到类似的输出分布效果,这是一种更轻便的方法,不仅解决了实际数据稀缺的问题,而且避免了过拟合的常见难题。为此,我们提出了无需训练的分组相对策略优化(Training-Free GRPO),这是一种经济实惠的解决方案,可以在无需任何参数更新的情况下提高LLM代理人的性能。我们的方法利用组内的相对语义优势,而不是数值优势,在多轮学习的每个时期中迭代提炼高质量的经验知识,在最少真实数据的基础上。这种知识作为学习的令牌先验,可以无缝地集成到LLM API调用中,以指导模型的行为。在数学推理和网页搜索任务的实验表明,将Training-Free GRPO应用于DeepSeek-V3.1-Terminus时,其跨域性能得到了显著改善。只需几十个训练样本,Training-Free GRPO就能以微小的训练数据和成本优势超越精细调整的小型LLM。
论文及项目相关链接
摘要
近期大型语言模型(LLM)在通用能力上展现出巨大的潜力,但在特定领域的现实世界中常面临性能下降的挑战,这主要是由于难以有效地集成外部工具和特定的提示策略。虽然已有方法如强化学习来解决这一问题,但它们通常依赖于昂贵的参数更新过程,例如通过监督微调(SFT)后采用强化学习(RL)阶段进行组相对策略优化(GRPO)。然而,我们主张LLM能够通过学习经验知识作为令牌先验来实现类似的输出分布效果,这是一种更轻量级的方法,不仅解决了实际数据稀缺的问题,而且避免了过度拟合的常见问题。为此,我们提出了训练免费组相对策略优化(Training-Free GRPO),这是一种增强LLM代理性能且无需任何参数更新的成本效益解决方案。我们的方法利用组内的相对语义优势而不是数值优势,在多轮学习的每个时代中迭代提炼高质量的经验知识,基于少量真实数据进行。这种知识作为学到的令牌先验无缝集成到LLM API调用中,以指导模型行为。在数学推理和网页搜索任务上的实验表明,与DeepSeek-V3.1-Terminus相结合时,训练免费的GRPO能显著提高跨域性能,仅使用几十个训练样本就能超越微调的小型LLM模型。
关键见解
- 大型语言模型(LLM)在特定领域的现实世界中面临性能挑战。
- 当前方法如强化学习虽然能解决这一问题,但需要昂贵的参数更新过程。
- LLM可以通过学习经验知识作为令牌先验来实现输出分布调整,这是一种更轻量级的方法。
- 提出了训练免费组相对策略优化(Training-Free GRPO),能增强LLM性能且无需任何参数更新。
- Training-Free GRPO利用组内的相对语义优势,在多轮学习中提炼经验知识。
- Training-Free GRPO集成了学到的令牌先验,可以无缝集成到LLM API调用中指导模型行为。
点此查看论文截图




Beyond Textual CoT: Interleaved Text-Image Chains with Deep Confidence Reasoning for Image Editing
Authors:Zhentao Zou, Zhengrong Yue, Kunpeng Du, Binlei Bao, Hanting Li, Haizhen Xie, Guozheng Xu, Yue Zhou, Yali Wang, Jie Hu, Xue Jiang, Xinghao Chen
Image editing with natural language has gained significant popularity, yet existing methods struggle with intricate object intersections and fine-grained spatial relationships due to the lack of an explicit reasoning process. While Chain-of-Thought (CoT) has been explored to enhance reasoning, purely textual CoT or CoT augmented with coordinate information is fundamentally limited in its ability to represent intricate visual layouts and lacks the necessary visual cues to guide the generation of fine-grained, pixel-level details. To address these challenges, we propose Multimodal Reasoning Edit (MURE), a novel framework that shifts the visual editing process from purely text-based reasoning to a series of interleaved textual and visual rationales. Our framework performs image editing using a natively multimodal, interleaved text-image CoT. This approach generates a step-by-step chain of reasoning where a textual description is followed by a corresponding visual cue, such as a positional mask that defined intended edited regions or a representation of new content. Furthermore, to mitigate the hallucination phenomenon of large language models, we introduce Multimodal Deep Confidence (MMDC) reasoning paradigm. This paradigm explores a tree of visual reasoning paths at each step. By pruning low-quality branches using a deep confidence score from a reward model, it ensures the model consistently follows a high-quality trajectory towards the final edited result. The proposed method decomposes complex editing tasks into interdependent sub-tasks, achieving greater precision at each stage and yielding high-fidelity edited results. We define the formulation for interleaved text-image chains and release the first CoT-Edit-14K dataset, comprising 14K high-quality editing examples. Extensive experiments show that our method yields significant improvements across three image editing benchmarks.
图像编辑与自然语言已经获得了广泛的关注,但现有方法在处理复杂的对象交叉和精细的空间关系时面临困难,因为它们缺乏明确的推理过程。虽然链式思维(CoT)已被探索用于增强推理能力,但纯粹的文本CoT或与坐标信息相结合的方法在表示复杂的视觉布局方面存在根本局限性,并且缺乏必要的视觉线索来指导精细的像素级细节生成。为了应对这些挑战,我们提出了多模态推理编辑(MURE)这一新型框架,它将视觉编辑过程从纯粹的文本推理转变为一系列交织的文本和视觉推理。我们的框架使用天生的多模态、交织的文本图像CoT进行图像编辑。这种方法生成了一个逐步推理的链,其中文本描述之后是相应的视觉线索,如定义意图编辑区域的定位掩膜或新内容的表示。此外,为了减轻大型语言模型的幻觉现象,我们引入了多模态深度置信(MMDC)推理范式。该范式在每一步探索多个视觉推理路径的树结构。通过使用奖励模型的深度置信分数来修剪低质量的分支,它确保模型始终遵循高质量轨迹以达到最终的编辑结果。所提出的方法将复杂的编辑任务分解为相互依存的子任务,在每个阶段实现更高的精度,并产生高保真度的编辑结果。我们定义了交织文本图像链的公式,并发布了首个CoT-Edit-14K数据集,包含14K高质量编辑示例。大量实验表明,我们的方法在三个图像编辑基准测试中取得了显著改进。
论文及项目相关链接
PDF 25pages,20figures
Summary
本文提出一种名为Multimodal Reasoning Edit(MURE)的新框架,用于图像编辑。该框架采用文本与视觉交替的推理过程,实现了精细的像素级编辑。为减轻大型语言模型的幻觉现象,引入Multimodal Deep Confidence(MMDC)推理范式。此方法将复杂的编辑任务分解为相互依赖的子任务,并在每个阶段实现高精度,产生高保真度的编辑结果。同时,发布CoT-Edit-14K数据集用于实验验证。
Key Takeaways
- Multimodal Reasoning Edit (MURE)框架采用文本与视觉交替的推理过程,处理图像编辑中的复杂空间关系。
- MURE通过结合文本描述和相应的视觉线索(如位置掩码和新内容表示),实现了精细的像素级编辑。
- Multimodal Deep Confidence (MMDC)推理范式用于减轻大型语言模型的幻觉现象,确保模型遵循高质量轨迹进行编辑。
- MURE将复杂的编辑任务分解为子任务,每个阶段实现高精度,最终产生高保真度的编辑结果。
- 首次发布CoT-Edit-14K数据集,包含14K高质量编辑示例,用于实验验证和模型训练。
点此查看论文截图




Detecting and Mitigating Insertion Hallucination in Video-to-Audio Generation
Authors:Liyang Chen, Hongkai Chen, Yujun Cai, Sifan Li, Qingwen Ye, Yiwei Wang
Video-to-Audio generation has made remarkable strides in automatically synthesizing sound for video. However, existing evaluation metrics, which focus on semantic and temporal alignment, overlook a critical failure mode: models often generate acoustic events, particularly speech and music, that have no corresponding visual source. We term this phenomenon Insertion Hallucination and identify it as a systemic risk driven by dataset biases, such as the prevalence of off-screen sounds, that remains completely undetected by current metrics. To address this challenge, we first develop a systematic evaluation framework that employs a majority-voting ensemble of multiple audio event detectors. We also introduce two novel metrics to quantify the prevalence and severity of this issue: IH@vid (the fraction of videos with hallucinations) and IH@dur (the fraction of hallucinated duration). Building on this, we propose Posterior Feature Correction, a novel training-free inference-time method that mitigates IH. PFC operates in a two-pass process: it first generates an initial audio output to detect hallucinated segments, and then regenerates the audio after masking the corresponding video features at those timestamps. Experiments on several mainstream V2A benchmarks first reveal that state-of-the-art models suffer from severe IH. In contrast, our PFC method reduces both the prevalence and duration of hallucinations by over 50% on average, without degrading, and in some cases even improving, conventional metrics for audio quality and temporal synchronization. Our work is the first to formally define, systematically measure, and effectively mitigate Insertion Hallucination, paving the way for more reliable and faithful V2A models.
视频到音频生成在自动为视频合成声音方面取得了显著的进步。然而,现有的评估指标主要集中在语义和时间对齐上,忽略了一种关键的失败模式:模型经常生成没有相应视觉来源的声学事件,特别是语音和音乐。我们将这种现象称为“插入幻觉”,并将其识别为由数据集偏差驱动的系统性风险,例如屏幕外声音普遍存在,而当前指标完全无法检测到此风险。为了应对这一挑战,我们首先开发了一个系统评估框架,该框架采用多数投票制的多个音频事件检测器集合。我们还引入了两种新指标来量化此问题的流行程度和严重性:IH@vid(有幻觉的视频比例)和IH@dur(幻觉持续时间比例)。在此基础上,我们提出了“后特征校正”(PFC),这是一种无需训练的新型推断时间方法,可以缓解IH。PFC采用两阶段过程:首先生成初始音频输出以检测幻觉片段,然后在这些时间戳上屏蔽相应视频特征后重新生成音频。在几个主流V2A基准测试上的实验首先表明,最先进的模型存在严重的IH。相比之下,我们的PFC方法平均减少了超过50%的幻觉发生率和持续时间,同时不降低、甚至在某些情况下改进了音频质量和时间同步的常规指标。我们的工作是首次正式定义、系统测量和有效缓解插入幻觉,为更可靠和忠诚的V2A模型铺平了道路。
论文及项目相关链接
摘要
视频转音频生成技术在自动合成声音方面已取得了显著的进展。然而,现有的评估指标主要集中在语义和时序对齐上,忽略了一种关键的失败模式:模型经常生成没有相应视觉来源的声源,特别是语音和音乐。这种现象被称为插入幻觉,并被识别为系统风险,由数据集偏差(如屏幕外声音普遍存在)驱动,而当前指标无法检测到它。为解决此问题,本文首先开发了一个采用多数投票制的音频事件检测器集合的系统评估框架。我们还引入了两个新指标来量化此问题的普遍性和严重性:IH@vid(带有幻觉的视频分数)和IH@dur(幻觉持续时间分数)。在此基础上,我们提出了后特征校正,这是一种无需训练的后推理时间方法,可以缓解IH。PFC采用两阶段过程:首先生成初始音频输出以检测幻觉片段,然后在这些时间戳上屏蔽相应视频特征后重新生成音频。在主流V2A基准测试上的实验表明,最先进模型存在严重的IH问题。相比之下,我们的PFC方法平均减少了超过50%的幻觉存在和持续时间,同时不降低甚至改进了音频质量和时序同步的常规指标。我们的工作是首次正式定义、系统测量和有效缓解插入幻觉,为更可靠和忠诚的V2A模型铺平了道路。
要点
- 视频转音频生成技术存在插入幻觉问题,即模型会生成没有相应视觉来源的声源。
- 插入幻觉是系统风险,由数据集偏差(如屏幕外声音普遍存在)驱动,当前评估指标无法检测。
- 提出了一种新的系统评估框架和音频事件检测器集合来检测插入幻觉。
- 引入了两个新指标IH@vid和IH@dur来量化插入幻觉的普遍性和严重性。
- 提出了后特征校正方法,通过两阶段过程检测和缓解插入幻觉。
- 实验显示,后特征校正方法能有效减少插入幻觉的存在和持续时间,同时不损害音频质量和时序同步的常规指标。
点此查看论文截图




TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance
Authors:Jianhui Yang, Yiming Jin, Pengkun Jiao, Chenhe Dong, Zerui Huang, Shaowei Yao, Xiaojiang Zhou, Dan Ou, Haihong Tang
Query-product relevance prediction is fundamental to e-commerce search and has become even more critical in the era of AI-powered shopping, where semantic understanding and complex reasoning directly shape the user experience and business conversion. Large Language Models (LLMs) enable generative, reasoning-based approaches, typically aligned via supervised fine-tuning (SFT) or preference optimization methods like Direct Preference Optimization (DPO). However, the increasing complexity of business rules and user queries exposes the inability of existing methods to endow models with robust reasoning capacity for long-tail and challenging cases. Efforts to address this via reinforcement learning strategies like Group Relative Policy Optimization (GRPO) often suffer from sparse terminal rewards, offering insufficient guidance for multi-step reasoning and slowing convergence. To address these challenges, we propose TaoSR-AGRL, an Adaptive Guided Reinforcement Learning framework for LLM-based relevance prediction in Taobao Search Relevance. TaoSR-AGRL introduces two key innovations: (1) Rule-aware Reward Shaping, which decomposes the final relevance judgment into dense, structured rewards aligned with domain-specific relevance criteria; and (2) Adaptive Guided Replay, which identifies low-accuracy rollouts during training and injects targeted ground-truth guidance to steer the policy away from stagnant, rule-violating reasoning patterns toward compliant trajectories. TaoSR-AGRL was evaluated on large-scale real-world datasets and through online side-by-side human evaluations on Taobao Search. It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability. The model trained with TaoSR-AGRL has been successfully deployed in the main search scenario on Taobao, serving hundreds of millions of users.
查询产品相关性预测是电子商务搜索的核心基础,在人工智能驱动购物的时代,它变得更为重要。在这个时代,语义理解和复杂推理直接影响了用户体验和业务转化。大型语言模型(LLM)能够实现基于推理的生成方法,通常通过监督微调(SFT)或对偏好优化方法(如直接偏好优化(DPO))进行对齐。然而,商业规则的日益复杂和用户查询的多样化暴露出现有方法的不足,它们无法使模型具备针对长尾和复杂情况的稳健推理能力。
为解决这一问题,我们提出通过强化学习策略的TaoSR-AGRL(淘宝搜索相关性中的基于大型语言模型的自适应引导强化学习框架)来应对这些挑战。TaoSR-AGRL引入了两个关键创新点:1)规则感知奖励塑造,它将最终的相关性判断分解成密集的结构化奖励,与特定领域的相关性标准对齐;2 结了训练过程中低准确率的回放内容,并注入有针对性的真实指导,使策略摆脱停滞不前的违规推理模式,转向合规轨迹。TaoSR-AGRL在大型真实数据集上进行了评估,并通过淘宝搜索的在线旁侧人类评估进行了验证。在线下实验中,它始终优于DPO和标准GRPO基准测试,提高了相关性准确性、规则遵循性和训练稳定性。使用TaoSR-AGRL训练的模型已成功部署在淘宝主搜索场景中,为数亿用户提供服务。
论文及项目相关链接
Summary
本文介绍了在AI驱动的购物时代,查询产品相关性预测在电子商务搜索中的重要性。文章指出大型语言模型(LLMs)能够通过生成和推理的方法来进行预测,但现有方法在处理复杂商业规则和长尾案例时存在局限性。为此,提出了一种自适应引导强化学习框架TaoSR-AGRL,用于淘宝搜索相关性的LLM基于预测。该框架包括两个关键创新点:规则感知奖励塑造和自适应引导回放,旨在解决多步推理和规则遵守问题。TaoSR-AGRL在大型真实数据集和在线人类评估中表现优异,提高了相关性准确性、规则遵守性和训练稳定性,并已成功部署在淘宝主要搜索场景中,为数百万用户提供服务。
Key Takeaways
- 查询产品相关性预测在AI驱动的购物时代对电子商务搜索至关重要。
- 大型语言模型(LLMs)在预测中起到关键作用,但处理复杂商业规则和长尾案例时存在局限性。
- 现有方法如Direct Preference Optimization (DPO)和Group Relative Policy Optimization (GRPO)面临挑战,如稀疏终端奖励和多步推理问题。
- TaoSR-AGRL框架通过规则感知奖励塑造和自适应引导回放两个关键创新点来解决这些问题。
- TaoSR-AGRL在大型真实数据集和在线评估中表现优异,提高相关性准确性、规则遵守性和训练稳定性。
- TaoSR-AGRL已成功部署在淘宝搜索场景中,为数百万用户提供服务。
点此查看论文截图



Do We Really Need SFT? Prompt-as-Policy over Knowledge Graphs for Cold-start Next POI Recommendation
Authors:Jinze Wang, Lu Zhang, Yiyang Cui, Zhishu Shen, Xingjun Ma, Jiong Jin, Tiehua Zhang
Next point-of-interest (POI) recommendation is crucial for smart urban services such as tourism, dining, and transportation, yet most approaches struggle under cold-start conditions where user-POI interactions are sparse. Recent efforts leveraging large language models (LLMs) address this challenge through either supervised fine-tuning (SFT) or in-context learning (ICL). However, SFT demands costly annotations and fails to generalize to inactive users, while static prompts in ICL cannot adapt to diverse user contexts. To overcome these limitations, we propose Prompt-as-Policy over knowledge graphs, a reinforcement-guided prompting framework that learns to construct prompts dynamically through contextual bandit optimization. Our method treats prompt construction as a learnable policy that adaptively determines (i) which relational evidences to include, (ii) the number of evidence per candidate, and (iii) their organization and ordering within prompts. More specifically, we construct a knowledge graph (KG) to discover candidates and mine relational paths, which are transformed into evidence cards that summarize rationales for each candidate POI. The frozen LLM then acts as a reasoning engine, generating recommendations from the KG-discovered candidate set based on the policy-optimized prompts. Experiments on three real-world datasets demonstrate that Prompt-as-Policy consistently outperforms state-of-the-art baselines, achieving average 7.7% relative improvements in Acc@1 for inactive users, while maintaining competitive performance on active users, without requiring model fine-tuning.
下一个兴趣点(POI)推荐对于智慧城市服务(如旅游、餐饮和交通)至关重要。然而,大多数方法在用户与POI交互稀疏的冷启动条件下表现困难。最近利用大型语言模型(LLM)的努力通过监督微调(SFT)或上下文学习(ICL)来解决这一挑战。然而,SFT需要昂贵的注释,并且无法推广到非活跃用户,而ICL中的静态提示无法适应多样化的用户上下文。为了克服这些限制,我们提出了基于知识图的Prompt-as-Policy策略,这是一种强化指导的提示框架,通过上下文强盗优化来学习动态构建提示。我们的方法将提示构建视为可学习的策略,自适应地确定(i)要包含哪些关系证据,(ii)每个候选对象的关系证据数量,(iii)它们在提示中的组织和排序。更具体地说,我们构建了知识图(KG)来发现候选对象并挖掘关系路径,这些路径被转化为证据卡片,总结了每个候选POI的合理性。然后,冻结的LLM充当推理引擎,基于经过政策优化的提示,从KG发现的候选集中生成推荐。在三个真实数据集上的实验表明,Prompt-as-Policy始终优于最新的基线方法,在不活跃用户的准确性方面平均提高了7.7%,同时在活跃用户上保持竞争性能,而无需进行模型微调。
论文及项目相关链接
Summary:
在智能城市服务(如旅游、餐饮和交通)中,下一个兴趣点(POI)的推荐至关重要。在面临冷启动条件时,大多数方法会面临困难,其中用户与POI的互动较少。近期的研究利用大型语言模型(LLM)通过监督微调(SFT)或上下文学习(ICL)来解决这一挑战。然而,SFT需要大量昂贵的注释并且无法推广到非活跃用户,而ICL中的静态提示无法适应多样化的用户上下文。为了克服这些限制,我们提出了基于知识图谱的“提示作为策略”,这是一种强化引导提示框架,它通过上下文强盗优化动态构建提示。我们的方法将提示构建视为可学习的策略,自适应地确定(i)要包含哪些关系证据,(ii)每个候选人的证据数量,(iii)提示内证据的组织和排序。特别是在知识图谱(KG)中,我们构建发现候选人和挖掘关系路径,将其转化为证据卡片,总结每个候选POI的合理性。冻结的LLM则作为推理引擎,根据优化后的策略提示从知识图谱发现的候选集中生成推荐。在三个真实世界数据集上的实验表明,“提示作为策略”始终优于最新技术基线,在不活跃用户上平均提高了7.7%的准确率,同时在活跃用户上保持竞争力性能,且无需模型微调。
Key Takeaways:
- 在智能城市服务中,下一个兴趣点推荐对于旅游、餐饮和交通等服务至关重要。
- 在冷启动条件下,大多数推荐方法面临挑战,需要解决用户与POI之间互动不足的问题。
- 大型语言模型(LLM)被用于解决这一挑战,但现有方法存在限制,如监督微调的高成本和非活跃用户的推广问题。
- 提出了一种新的方法“提示作为策略”,结合知识图谱和强化学习来解决上述问题。
- 该方法动态构建提示,并自适应地确定应包含哪些关系证据、证据数量和组织方式。
- 实验表明,“提示作为策略”在真实世界数据集上表现优异,特别是在不活跃用户上的准确率有显著提高。
点此查看论文截图




Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning
Authors:Jialu Du, Guiyang Hou, Yihui Fu, Chen Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu
While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek-R1’s reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like “tricky” and “confused” when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents’ subjective beliefs. To address this, we propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi-ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.
虽然大型语言模型(LLM)在数学和代码推理方面表现出色,但我们观察到它们在社交推理任务上遇到了困难,表现出认知混淆、逻辑不一致以及客观世界状态和主观信念状态之间的混淆。通过对DeepSeek-R1推理轨迹的详细分析,我们发现LLM经常遇到推理障碍,在处理多参与者和时间线的场景时,倾向于输出“棘手”和“困惑”等矛盾术语,从而导致错误推理或无限循环。核心问题是它们无法将客观现实与代理人的主观信念区分开来。为了解决这一问题,我们提出了一种自适应世界模型增强的推理机制,该机制构建了一个动态的文本世界模型,用于跟踪实体状态和时间序列。它动态监控推理轨迹中的混淆指标,并提供清晰的世界状态描述来及时干预,帮助模型解决认知困境。该机制模仿了人类如何使用隐式世界模型来区分外部事件和内部信念。在三个社交基准测试上的评估表明,该机制在准确性方面实现了显著改进(例如Hi-ToM中的+10%),同时降低了计算成本(最多减少33.8%的令牌),为在社交环境中部署LLM提供了简单而有效的解决方案。
论文及项目相关链接
PDF 15 pages, 10 figures
Summary
大型语言模型(LLMs)在数学和代码推理方面表现出色,但在社会推理任务上遇到困难,存在认知混淆、逻辑不一致等问题。通过分析DeepSeek-R1的推理轨迹,发现LLMs易遭遇推理困境,在处理多参与者和时间线的场景时,会输出“棘手”和“困惑”等矛盾术语,导致错误推理或无限循环。核心问题在于无法区分客观现实和代理的主观信念。为此,提出了一种自适应世界模型增强的推理机制,通过构建动态文本世界模型来跟踪实体状态和时序序列,及时监测推理轨迹中的困惑指标并提供明确的世界状态描述,帮助模型解决认知困境。该机制模仿人类使用隐式世界模型来区分外部事件和内部信念。在三个社会基准测试上的评估显示,该机制在准确性方面显著提高(例如Hi-ToM提高10%),同时降低计算成本(最多减少33.8%的令牌数),为在社会环境中部署LLMs提供了简单有效的解决方案。
Key Takeaways
- 大型语言模型(LLMs)在社会推理任务上表现欠佳,存在认知混淆和逻辑不一致问题。
- LLMs在处理多参与者和时间线的场景时易遭遇推理困境,输出矛盾术语。
- 核心问题在于LLMs无法区分客观现实和代理的主观信念。
- 提出了一种自适应世界模型增强的推理机制,通过构建动态文本世界模型来解决LLMs的推理问题。
- 该机制能实时监测并干预困惑的推理轨迹。
- 机制模仿人类使用隐式世界模型来区分外部事件和内部信念。
点此查看论文截图






LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?
Authors:Jingyuan Wang, Yankai Chen, Zhonghang Li, Chao Huang
Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter’s unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert’s advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner
大规模语言模型(LLM)在推理方面取得了显著的进步,这通常是通过有监督的微调(SFT)实现的。然而,SFT资源密集,依赖于大量精选数据集、拒绝采样演示和所有标记的统一优化,尽管只有一小部分携带了有意义的学习价值。在这项工作中,我们探索了一个似是而非的想法:是否可以通过揭示反映大型语言模型独特优势的高价值推理时刻,让小型语言模型(SLM)教授大型语言模型(LLM)?我们提出了LightReasoner,这是一个利用强大专家模型(LLM)和业余模型(SLM)之间行为差异的新型框架。LightReasoner分为两个阶段:1)采样阶段,它定位关键的推理时刻,并通过专家与业余爱好者的对比来构建捕捉专家优势的监督示例;2)微调阶段,使专家模型与这些蒸馏示例对齐,增强其推理能力。在七个数学基准测试中,LightReasoner在提高准确性的同时,减少了高达90%的时间消耗、80%的采样问题和99%的调优标记使用量,而且这一切都不依赖于真实标签。通过将较弱的小型语言模型转化为有效的教学信号,LightReasoner提供了一种可扩展且资源高效的方法来提升大型语言模型的推理能力。代码可在:https://github.com/HKUDS/LightReasoner上找到。
论文及项目相关链接
Summary:
大型语言模型(LLM)在推理方面取得了显著进展,通常通过监督微调(SFT)实现。然而,SFT资源密集,依赖大型数据集、拒绝采样演示和统一优化所有标记符。本研究提出了一种相反的想法:是否可以由小型语言模型(SLM)来教导大型语言模型(LLM),揭示反映其独特优势的推理时刻?为此,我们提出了LightReasoner框架,它利用强专家模型(LLM)和弱业余模型(SLM)之间的行为差异。LightReasoner分为两个阶段:一是采样阶段,确定关键推理时刻并通过专家与业余对比构建捕捉专家优势的监督示例;二是微调阶段,用这些提炼的示例对齐专家模型,增强其推理能力。LightReasoner在七个数学基准测试中提高了准确性,最高达28.1%,同时减少了时间消耗、采样问题和微调标记的使用量,且无需依赖真实标签。通过将较弱的小型语言模型转化为有效的教学信号,LightReasoner为推进大型语言模型的推理能力提供了可伸缩和资源高效的方法。
Key Takeaways:
- 大型语言模型(LLM)在推理方面取得显著进展,但仍需资源密集型的监督微调(SFT)。
- 小型语言模型(SLM)可能通过揭示高价值推理时刻来教导LLM,反映其独特优势。
- LightReasoner是一个新型框架,利用专家模型(LLM)和业余模型(SLM)之间的行为差异。
- LightReasoner包括两个主要阶段:采样阶段和微调阶段。
- 采样阶段确定关键推理时刻并构建捕捉专家优势的监督示例。
- 微调阶段使用提炼的示例对齐专家模型,增强其推理能力。
- LightReasoner在多个基准测试中表现出显著提高的准确性和资源效率。
点此查看论文截图




