⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-08 更新
SIMS-V: Simulated Instruction-Tuning for Spatial Video Understanding
Authors:Ellis Brown, Arijit Ray, Ranjay Krishna, Ross Girshick, Rob Fergus, Saining Xie
Despite impressive high-level video comprehension, multimodal language models struggle with spatial reasoning across time and space. While current spatial training approaches rely on real-world video data, obtaining diverse footage with precise spatial annotations remains a bottleneck. To alleviate this bottleneck, we present SIMS-V – a systematic data-generation framework that leverages the privileged information of 3D simulators to create spatially-rich video training data for multimodal language models. Using this framework, we investigate which properties of simulated data drive effective real-world transfer through systematic ablations of question types, mixes, and scales. We identify a minimal set of three question categories (metric measurement, perspective-dependent reasoning, and temporal tracking) that prove most effective for developing transferable spatial intelligence, outperforming comprehensive coverage despite using fewer question types. These insights enable highly efficient training: our 7B-parameter video LLM fine-tuned on just 25K simulated examples outperforms the larger 72B baseline and achieves competitive performance with proprietary models on rigorous real-world spatial reasoning benchmarks. Our approach demonstrates robust generalization, maintaining performance on general video understanding while showing substantial improvements on embodied and real-world spatial tasks.
尽管多媒体语言模型在高阶视频理解方面表现令人印象深刻,但在时空推理方面仍存在困难。当前的空间训练方式依赖于真实世界视频数据,但获取带有精确空间注释的多样化镜头仍然是一个瓶颈。为了缓解这一瓶颈,我们推出了SIMS-V——一个系统化的数据生成框架,它利用3D模拟器的特权信息,为多媒体语言模型创建空间丰富的视频训练数据。使用此框架,我们通过系统的问题类型、混合和规模的剥离,研究哪些模拟数据的属性能有效实现现实世界迁移。我们确定了最有效的最小问题集,包括三个类别:度量测量、视角依赖推理和时间跟踪,尽管使用的问题类型较少,但在发展可转移的空间智能方面却证明了其有效性,并超越了全面覆盖的效果。这些见解使训练更加高效:我们的仅微调了2.5万个模拟例子的7B参数视频大型语言模型,超越了更大的72B基线模型,并在严格的现实世界空间推理基准测试中实现了与专有模型相当的性能。我们的方法展示了稳健的泛化能力,在保持对一般视频理解性能的同时,在实体和现实世界空间任务上取得了显著改进。
论文及项目相关链接
PDF Project page: https://ellisbrown.github.io/sims-v
Summary
本文探讨了多模态语言模型在空间推理方面的挑战,并提出了一个利用3D模拟器生成丰富空间视频训练数据的新框架SIMS-V。通过对模拟数据的系统性分析,研究确定了三种最有效的模拟数据问题类型,包括度量测量、视角依赖推理和时间跟踪。这些发现使得高效训练成为可能:使用少量模拟数据对小型视频LLM进行微调即可实现性能提升,且在严格的现实世界空间推理基准测试中表现优异。此外,该方法的通用性能得以保留并大幅度提升了复杂任务的表现。总体来说,通过高效的模拟训练方式推动空间推理技术的飞跃进步。这项新技术可能会推进视觉处理的研究方向并提供潜在的大规模训练方法的发展,同时也有助于提升机器在复杂环境中的感知能力。
Key Takeaways
以下是关键见解的简要概述:
点此查看论文截图
Real-to-Sim Robot Policy Evaluation with Gaussian Splatting Simulation of Soft-Body Interactions
Authors:Kaifeng Zhang, Shuo Sha, Hanxiao Jiang, Matthew Loper, Hyunjong Song, Guangyan Cai, Zhuo Xu, Xiaochen Hu, Changxi Zheng, Yunzhu Li
Robotic manipulation policies are advancing rapidly, but their direct evaluation in the real world remains costly, time-consuming, and difficult to reproduce, particularly for tasks involving deformable objects. Simulation provides a scalable and systematic alternative, yet existing simulators often fail to capture the coupled visual and physical complexity of soft-body interactions. We present a real-to-sim policy evaluation framework that constructs soft-body digital twins from real-world videos and renders robots, objects, and environments with photorealistic fidelity using 3D Gaussian Splatting. We validate our approach on representative deformable manipulation tasks, including plush toy packing, rope routing, and T-block pushing, demonstrating that simulated rollouts correlate strongly with real-world execution performance and reveal key behavioral patterns of learned policies. Our results suggest that combining physics-informed reconstruction with high-quality rendering enables reproducible, scalable, and accurate evaluation of robotic manipulation policies. Website: https://real2sim-eval.github.io/
机器人操作策略正在迅速发展,但在现实世界中直接评估它们仍然成本高昂、耗时且难以重现,特别是对于涉及可变形物体的任务。仿真提供了一种可扩展和系统的替代方案,但现有仿真器往往无法捕捉软体交互的耦合视觉和物理复杂性。我们提出了一种从现实到仿真的策略评估框架,该框架通过现实世界的视频构建软体数字孪生体,并使用三维高斯喷绘技术以高度逼真的保真度渲染机器人、物体和环境。我们在代表性的可变形操作任务上验证了我们的方法,包括毛绒玩具包装、线路布置和T块推动,证明模拟运行结果与真实世界执行性能高度相关,并揭示了学习策略的关键行为模式。我们的结果表明,将基于物理的重建与高质感的渲染相结合,可以实现机器人操作策略的可重复性、可扩展性和准确的评估。网站:https://real2sim-eval.github.io/
论文及项目相关链接
PDF Website: https://real2sim-eval.github.io/
Summary
现实世界中机器人操作策略的直接评估成本高昂、耗时长且难以重现,特别是在涉及可变形物体的任务中。仿真提供了一种可扩展和系统的替代方案,但现有仿真器往往无法捕捉软体交互的复杂视觉和物理特性。我们提出一种从现实到仿真的策略评估框架,通过构建软体数字双胞胎模拟真实世界视频中的机器人、物体和环境,使用三维高斯绘图实现逼真渲染。在典型的可变形操作任务中验证了我们的方法,包括填充毛绒玩具、线路铺设和T块推动等任务,证明模拟运行结果与真实世界执行性能高度相关,并揭示了学习策略的关键行为模式。我们的结果表明,结合物理信息重建和高品质渲染可实现机器人操作策略的可重现、可扩展和准确评估。
Key Takeaways
- 现实世界中机器人操作策略评估存在挑战,尤其是涉及可变形物体的任务,而仿真提供了一个解决方案。
- 现有仿真器往往无法准确模拟软体交互的复杂视觉和物理特性。
- 提出一种从现实到仿真的策略评估框架,构建软体数字双胞胎模拟真实世界中的机器人、物体和环境。
- 使用三维高斯绘图实现逼真渲染,提高模拟的准确性和可信度。
- 在多种典型的可变形操作任务中验证了该方法的有效性,包括毛绒玩具填充、线路铺设和T块推动等任务。
- 模拟运行结果与真实世界执行性能高度相关,能够准确反映策略的优劣和行为模式。
点此查看论文截图
Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning
Authors:Mohammad Atif Quamar, Mohammad Areeb
Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30–35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.
链式思维(Chain-of-Thought,简称CoT)提示是大型语言模型中实现复杂推理的关键技术。然而,生成完整、固定长度的理由在计算上是浪费的,增加了令牌使用量和延迟。我们引入了LEASH:无需训练的解码算法——对数熵自适应停止启发式方法,该方法可以自适应地终止理由生成。LEASH监控两个内在信号:令牌级别的熵斜率和顶部对数边界的改善情况。一旦这两个信号达到平稳状态,即表示模型已经到达稳定的推理状态,此时便会终止生成。在GSM8K和AQuA-RAT基准测试的四个指令调优模型中,LEASH将平均令牌生成量减少了30-35%,延迟减少了27%,相对于CoT,准确率下降了10个百分点。LEASH具有模型无关性,无需额外的训练或监督,为CoT解码提供了简单高效的替代方案。
论文及项目相关链接
PDF Presented at the 1st Workshop on Efficient Reasoning (NeurIPS 2025)
Summary:
链式思维(Chain-of-Thought,CoT)提示是推动大型语言模型复杂推理的关键技术。然而,生成完整、固定长度的推理理由在计算上是浪费的,增加了令牌使用量和延迟。为此,我们引入了LEASH:一种无需训练的解码算法,可自适应地终止理由生成。LEASH监控两个内在信号:令牌级别熵的斜率和顶部对数边界的改善。一旦这两个信号达到稳定状态,即表示模型已到达稳定的推理状态,算法就会终止生成。在GSM8K和AQuA-RAT基准测试的四项指令调整模型中,LEASH将平均令牌生成量减少了30-35%,延迟减少了27%,相对于CoT,准确率下降了10个百分点。LEASH具有模型无关性,不需要额外的训练或监督,为CoT解码提供了简单高效的替代方案。
Key Takeaways:
- 链式思维(CoT)提示是推动大型语言模型复杂推理的关键。
- 生成完整、固定长度的推理理由在计算上是浪费的,会增加令牌使用量和延迟。
- LEASH是一种无需训练的解码算法,可以自适应地终止理由生成。
- LEASH通过监控两个内在信号:令牌级别熵的斜率和顶部对数边界的改善,来判断何时终止生成。
- 在多个模型和基准测试中,LEASH能有效减少令牌生成量和延迟。
- 相对于CoT,LEASH的准确率有所下降,但提供了一种简单高效的替代方案。
点此查看论文截图
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
Authors:Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu
“Thinking with Text” and “Thinking with Images” paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce “Thinking with Video”, a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions “thinking with video” as a unified multimodal reasoning paradigm.
“文字思考”和”图像思考”范式显著提高了大型语言模型(LLM)和视觉语言模型(VLM)的推理能力。然而,这些范式存在固有局限性。(1)图像只能捕捉瞬间,无法表示动态过程或连续变化;(2)文字和视觉作为不同的模态被分隔开来,阻碍了统一的多模态理解和生成。为了克服这些局限性,我们引入了“视频思考”这一新范式,它利用视频生成模型(如Sora-2)在统一的时间框架内桥接视觉和文本推理。为了支持这一探索,我们开发了视频思考基准测试(VideoThinkBench)。VideoThinkBench包含两个任务类别:(1)以视觉为中心的任务(例如,眼球测验),以及(2)以文本为中心的任务(例如,GSM8K的子集,MMMU)。我们的评估证明Sora-2是一个能胜任的推理者。在以视觉为中心的任务中,Sora-2通常与最先进的技术(SOTA)VLM相当,甚至在几个任务上超过了VLM,如眼球游戏。在以文本为中心的任务中,Sora-2在MATH上达到92%的准确率,在MMMU上达到75.53%的准确率。此外,我们还系统地分析了这些能力的来源。我们还发现自我一致性(self-consistency)和上下文内学习(in-context learning)可以提高Sora-2的性能。总之,我们的研究结果表明,视频生成模型是潜在的一体化多模态理解和生成模型,”视频思考”是一种统一的多模态推理范式。
论文及项目相关链接
PDF 36 pages, 14 figures
Summary
在文本中,介绍了“Thinking with Video”这一新的范式,该范式利用视频生成模型(如Sora-2)来桥接视觉和文本推理的统一时间框架。为支持这一探索,开发了Video Thinking Benchmark(VideoThinkBench),其中包括两类任务:以视觉为中心的任务和以文本为中心的任务。评估表明,Sora-2作为推理器的能力得到了验证,在某些任务上的表现甚至超过了当前最佳的VLMs。总体而言,视频生成模型具有成为统一的多模态理解和生成模型的潜力,而“Thinking with Video”则定位为统一的多模态推理范式。
Key Takeaways
- “Thinking with Video”范式通过利用视频生成模型,如Sora-2,来融合视觉和文本推理,克服了过去“Thinking with Text”和“Thinking with Images”范式的局限性。
- VideoThinkBench包含两类任务:以视觉为中心的任务(如Eyeballing Puzzles)和以文本为中心的任务(如GSM8K、MMMU的子集),为评估模型提供了全面的测试环境。
- 评估结果显示Sora-2在多个任务上的表现与最佳VLMs相当或更优,特别是在以视觉为中心的任务上。
- 在以文本为中心的任务上,Sora-2达到了一定的准确性水平,这显示了其强大的文本处理能力。
- 通过对Sora-2的性能进行系统性分析,发现自我一致性(self-consistency)和上下文内学习(in-context learning)是提高其性能的关键因素。
- 视频生成模型具有成为统一的多模态理解和生成模型的潜力,这表明在多媒体内容理解和生成方面有着广阔的应用前景。
点此查看论文截图
THEval. Evaluation Framework for Talking Head Video Generation
Authors:Nabyl Quignon, Baptiste Chopin, Yaohui Wang, Antitza Dantcheva
Video generation has achieved remarkable progress, with generated videos increasingly resembling real ones. However, the rapid advance in generation has outpaced the development of adequate evaluation metrics. Currently, the assessment of talking head generation primarily relies on limited metrics, evaluating general video quality, lip synchronization, and on conducting user studies. Motivated by this, we propose a new evaluation framework comprising 8 metrics related to three dimensions (i) quality, (ii) naturalness, and (iii) synchronization. In selecting the metrics, we place emphasis on efficiency, as well as alignment with human preferences. Based on this considerations, we streamline to analyze fine-grained dynamics of head, mouth, and eyebrows, as well as face quality. Our extensive experiments on 85,000 videos generated by 17 state-of-the-art models suggest that while many algorithms excel in lip synchronization, they face challenges with generating expressiveness and artifact-free details. These videos were generated based on a novel real dataset, that we have curated, in order to mitigate bias of training data. Our proposed benchmark framework is aimed at evaluating the improvement of generative methods. Original code, dataset and leaderboards will be publicly released and regularly updated with new methods, in order to reflect progress in the field.
视频生成已经取得了显著的进步,生成的视频越来越逼真。然而,生成的迅速发展超出了充足评估指标的制定速度。目前,头部说话生成的评价主要依赖于有限的评价指标,包括通用视频质量、唇同步以及用户研究。基于此,我们提出了一种新的评价框架,包含与三个维度相关的8个指标:(i)质量,(ii)自然度,(iii)同步性。在选择指标时,我们注重效率以及与人类偏好的一致性。基于此考虑,我们简化了对头部、嘴巴和眉毛的精细动态分析以及面部质量。我们在由17种最新模型生成的85000个视频上进行的大量实验表明,虽然许多算法在唇同步方面表现出色,但在生成表现力和无瑕疵的细节方面仍面临挑战。这些视频是基于我们整理的一个新型真实数据集生成的,旨在减轻训练数据的偏见。我们提出的基准框架旨在评估生成方法的改进。原始代码、数据集和排行榜将公开发布并定期更新新的方法,以反映该领域的进展。
论文及项目相关链接
Summary
视频生成技术已经取得了显著进展,生成的视频越来越逼真。然而,评估指标的发展跟不上生成技术的快速进步。针对说话人头部生成视频的评估,目前主要依赖于有限的指标,如视频质量、唇同步和用户研究等。因此,我们提出一个新的评估框架,包含8个与三个维度相关的指标:质量、自然度和同步性。在选择指标时,我们注重效率和对人类偏好的一致性。基于这些考虑,我们简化了对头部、嘴巴和眉毛的精细动态分析,以及面部质量。我们对由最新顶级模型生成的85000个视频进行了广泛实验,发现尽管许多算法在唇同步方面表现出色,但在生成表现力和无瑕疵细节方面面临挑战。这些视频是基于我们精心整理的真实数据集生成的,旨在减少训练数据的偏见。我们提出的基准框架旨在评估生成方法的改进情况。原始代码、数据集和排行榜将定期公开发布和更新,以反映该领域的进展。
Key Takeaways
- 视频生成技术取得了显著进展,生成视频越来越逼真。
- 当前评估说话人头部生成视频的指标有限,主要侧重于视频质量、唇同步和用户研究。
- 提出新的评估框架,包含8个指标,涉及质量、自然度和同步性三个维度。
- 在选择评估指标时,强调效率和对人类偏好的一致性。
- 实验表明,现有算法在唇同步方面表现良好,但在生成表现力和无瑕疵细节方面仍有挑战。
- 使用真实数据集生成视频,以减少训练数据的偏见。
点此查看论文截图
V-Thinker: Interactive Thinking with Images
Authors:Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang
Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising “Thinking with Images” paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.
赋能大型多模态模型(LMM)以深度整合图像交互与长周期推理能力,一直是该领域的长期挑战。最近以视觉为中心的推理技术的进步探索了为LMM的“图文结合思考”范式,标志着从图像辅助推理到图像交互思考的转变。虽然这一里程碑式的进展使模型能够关注于精细的图像区域,但受限于有限的视觉工具空间和特定的任务工作流程设计,进展仍然受到限制。为了弥补这一差距,我们推出了V-Thinker,这是一个通用的多模态推理助手,它通过端到端的强化学习使以视觉为中心的交互思考成为可能。V-Thinker包括两个关键组成部分:(1)数据进化飞轮,它自动合成、发展和验证跨三个维度的交互式推理数据集——多样性、质量和难度;(2)视觉渐进训练课程,首先通过点级监督对齐感知,然后通过两阶段强化学习框架整合交互式推理。此外,我们还推出了针对视觉为中心交互式推理任务的专家验证基准VTBench。大量实验表明,无论是在一般还是交互式推理场景中,V-Thinker都持续超越强大的LMM基准模型,为推进图像交互推理应用提供了宝贵的见解。
论文及项目相关链接
PDF Working in progress
Summary
在赋能大型多模态模型(LMMs)方面,通过整合图像交互与长期推理能力仍存在挑战。近期在视觉为中心(Vision-centric)的推理研究中,出现了为LMMs的“图文思维”范式。此突破虽然能使模型关注于图像细节区域,但仍受限于有限的视觉工具空间和任务特定的工作流程设计。为此,我们推出V-Thinker,一种通用多模态推理助手,通过端到端的强化学习实现交互式、视觉为中心的思考。其包括两个关键组成部分:(一)数据进化飞轮(Data Evolution Flywheel),它能自动合成、演变和验证跨越三个维度的交互式推理数据集;(二)视觉渐进训练课程(Visual Progressive Training Curriculum),它首先通过点对点监督对齐感知,然后通过两阶段强化学习框架整合交互式推理。此外,我们还推出了VTBench专家验证基准测试,针对视觉为中心交互式推理任务。实验证明,V-Thinker在一般和交互式推理场景中均优于强大的LMM基线模型,为推进图像交互推理应用提供了宝贵见解。
Key Takeaways
- 大型多模态模型(LMMs)在整合图像交互与长期推理能力方面存在挑战。
- “图文思维”范式是LMMs的一个突破,使模型关注图像细节区域。
- 当前方法受限于有限的视觉工具空间和任务特定的工作流程设计。
- V-Thinker是一个通用多模态推理助手,通过强化学习实现交互式、视觉为中心的思考。
- V-Thinker包括两个关键组成部分:数据进化飞轮和视觉渐进训练课程。
- VTBench是一个专家验证基准测试,用于评估视觉为中心交互式推理任务性能。
- 实验显示,V-Thinker在一般和交互式推理场景中表现优于其他模型。
点此查看论文截图
Monitor-Generate-Verify (MGV):Formalising Metacognitive Theory for Language Model Reasoning
Authors:Nick Oh, Fernand Gobet
Test-time reasoning architectures such as those following the Generate-Verify paradigm – where a model iteratively refines or verifies its own generated outputs – prioritise generation and verification but exclude the monitoring processes that determine when and how reasoning should begin. This omission may contribute to the prefix dominance trap, in which models commit early to suboptimal reasoning paths and seldom recover, yielding roughly 20% accuracy loss. We address this architectural gap by formalising Flavell’s and Nelson and Narens’ metacognitive theories into computational specifications, proposing the Monitor-Generate-Verify (MGV) framework. MGV extends the Generate-Verify paradigm by adding explicit monitoring that captures metacognitive experiences (from difficulty assessments to confidence judgements) before generation begins and refines future monitoring through verification feedback. Though we present no empirical validation, this work provides the first systematic computational translation of foundational metacognitive theories, offering a principled vocabulary for understanding reasoning system failures and suggesting specific architectural interventions for future test-time reasoning designs.
在遵循生成-验证范式的测试时间推理架构中,模型会迭代地完善或验证自身生成的输出,这种架构虽然重视生成和验证,但忽略了确定何时以及如何开始推理的监控过程。这种遗漏可能导致前缀主导陷阱,即模型过早地陷入非最优推理路径,并且很少恢复,造成大约20%的准确率损失。我们通过将弗拉维尔以及纳尔逊和纳伦斯的元认知理论形式化为计算规范来解决这一架构缺陷,提出了监控-生成-验证(MGV)框架。MGV通过添加明确的监控来扩展生成-验证范式,在生成开始之前捕获元认知体验(从难度评估到信心判断),并通过验证反馈来完善未来的监控。尽管我们没有提供实证验证,但这项工作提供了基础元认知理论的首次系统性计算翻译,为理解推理系统失败提供了原则性的词汇,并为未来的测试时间推理设计提出了具体的架构干预措施。
论文及项目相关链接
PDF To-be presented at the Workshop on the Foundations of Reasoning in Language Models at NeurIPS 2025 (non-archival)
Summary:生成验证范式下的测试时推理架构允许模型迭代地完善或验证其生成的输出,但忽略了监测过程,这可能导致模型陷入早期不理想的推理路径,造成大约20%的准确率损失。为解决这一问题,本文借鉴弗拉维尔、纳尔逊和内伦斯的元认知理论,提出了监控生成验证(MGV)框架。该框架扩展了生成验证范式,通过明确的监测捕捉元认知体验(如难度评估和信心判断),在生成之前进行监测,并通过验证反馈优化未来的监测。尽管没有实证验证,但这项工作为元认知理论提供了系统的计算翻译,有助于理解推理系统的失败并提供未来的干预措施。
Key Takeaways:
- 测试时推理架构如生成验证范式存在缺陷,可能忽略监测过程导致模型陷入早期不理想的推理路径。
- 模型忽视监测过程可能导致准确率损失约20%。
- MGV框架扩展了生成验证范式,通过明确的监测捕捉元认知体验。
- MGV框架在生成之前进行监测,并通过验证反馈优化未来的监测。
- 本文借鉴了弗拉维尔、纳尔逊和内伦斯的元认知理论,为计算翻译提供了基础。
- 该工作有助于理解推理系统的失败原因。
点此查看论文截图
MacroNav: Multi-Task Context Representation Learning Enables Efficient Navigation in Unknown Environments
Authors:Kuankuan Sima, Longbin Tang, Haozhe Ma, Lin Zhao
Autonomous navigation in unknown environments requires compact yet expressive spatial understanding under partial observability to support high-level decision making. Existing approaches struggle to balance rich contextual representation with navigation efficiency. We present MacroNav, a learning-based navigation framework featuring two key components: (1) a lightweight context encoder trained via multi-task self-supervised learning to capture multi-scale, navigation-centric spatial representations; and (2) a reinforcement learning policy that seamlessly integrates these representations with graph-based reasoning for efficient action selection. Extensive experiments demonstrate the context encoder’s efficient and robust environmental understanding. Real-world deployments further validate MacroNav’s effectiveness, yielding significant gains over state-of-the-art navigation methods in both Success Rate (SR) and Success weighted by Path Length (SPL), while maintaining low computational cost. Code will be released upon acceptance.
在未知环境中进行自主导航,需要在部分可观察性的情况下,具备紧凑而富有表现力的空间理解能力,以支持高级决策制定。现有方法很难在丰富的上下文表示和导航效率之间取得平衡。我们提出了MacroNav,这是一个基于学习的导航框架,包含两个关键组件:(1)通过多任务自我监督学习训练的一个轻量级上下文编码器,用于捕获多尺度、以导航为中心的空间表示;(2)一种强化学习策略,无缝集成这些表示与基于图的推理,以实现高效的行动选择。大量实验证明了上下文编码器对环境的理解和鲁棒性。在现实世界的部署中进一步验证了MacroNav的有效性,其在成功率和路径长度加权成功率方面均优于最新导航方法,同时保持较低的计算成本。代码在验收通过后发布。
论文及项目相关链接
Summary
该文提出了一种基于学习的导航框架MacroNav,包含两个关键组件:一个轻量级上下文编码器,通过多任务自我监督学习捕获多尺度、以导航为中心的空间表示;以及一个强化学习策略,无缝集成这些表示与基于图的推理,以进行高效的动作选择。实验表明,上下文编码器能够高效且稳健地理解环境。在现实世界的部署中,MacroNav相对于最新导航方法在成功率(SR)和路径长度加权成功率(SPL)方面取得了显著的提升,同时保持了较低的计算成本。
Key Takeaways
- MacroNav是一个基于学习的导航框架,旨在解决在未知环境中的自主导航问题。
- 该框架包含两个关键组件:一个轻量级上下文编码器和强化学习策略。
- 上下文编码器通过多任务自我监督学习捕获多尺度、以导航为中心的空间表示。
- 强化学习策略集成了空间表示与基于图的推理,以实现高效动作选择。
- 实验证明,MacroNav在环境理解方面表现出高效性和稳健性。
- 在现实世界部署中,MacroNav在成功率(SR)和路径长度加权成功率(SPL)方面超越了现有导航方法。
点此查看论文截图
GUI-360: A Comprehensive Dataset and Benchmark for Computer-Using Agents
Authors:Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, Si Qin, Liqun Li, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang
We introduce GUI-360$^\circ$, a large-scale, comprehensive dataset and benchmark suite designed to advance computer-using agents (CUAs). CUAs present unique challenges and is constrained by three persistent gaps: a scarcity of real-world CUA tasks, the lack of automated collection-and-annotation pipelines for multi-modal trajectories, and the absence of a unified benchmark that jointly evaluates GUI grounding, screen parsing, and action prediction. GUI-360$^\circ$ addresses these gaps with an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering. The released corpus contains over 1.2M executed action steps across thousands of trajectories in popular Windows office applications, and includes full-resolution screenshots, accessibility metadata when available, instantiated goals, intermediate reasoning traces, and both successful and failed action trajectories. The dataset supports three canonical tasks, GUI grounding, screen parsing, and action prediction, and a hybrid GUI+API action space that reflects modern agent designs. Benchmarking state-of-the-art vision–language models on GUI-360$^\circ$ reveals substantial out-of-the-box shortcomings in grounding and action prediction; supervised fine-tuning and reinforcement learning yield significant gains but do not close the gap to human-level reliability. We release GUI-360$^\circ$ and accompanying code to facilitate reproducible research and accelerate progress on robust desktop CUAs. The full dataset has been made public on https://huggingface.co/datasets/vyokky/GUI-360.
我们介绍了GUI-360°,这是一个大规模的综合数据集和基准测试套件,旨在推动计算机使用代理(CUAs)的发展。计算机使用代理面临独特的挑战,并受到三个持续存在的差距的限制:缺乏真实世界的CUA任务、缺乏用于多模式轨迹的自动化收集和注释管道,以及缺乏一个统一的基准,可以联合评估GUI定位、屏幕解析和动作预测。GUI-360°通过增强的LLM和高度自动化的管道来解决这些差距,用于查询源、环境模板构建、任务实例化、批处理执行和LLM驱动的质量过滤。发布的数据集包含流行的Windows办公软件中数千条轨迹的超过120万个执行动作步骤,包括全分辨率截图、可用的访问元数据、实例化目标、中间推理轨迹以及成功和失败的动作轨迹。该数据集支持GUI定位、屏幕解析和动作预测三个典型任务,以及反映现代代理设计的混合GUI+API动作空间。在GUI-360°数据集上评估最先进的视觉语言模型,显示出在定位和动作预测方面存在显著的局限性;有监督的微调强化学习产生了巨大收益,但仍未弥人类水平的可靠性差距。我们发布GUI-360°和伴随的代码,以促进可重复的研究,并加速稳健桌面CUA的进展。完整的数据集已在https://huggingface.co/datasets/vyokky/GUI-360公开。
论文及项目相关链接
Summary
GUI-360°是一个大规模、全面的数据集和基准测试套件,旨在推动计算机使用代理(CUAs)的发展。它解决了三个持续存在的挑战:真实世界CUA任务的稀缺、缺乏多模态轨迹的自动化收集与注释管道以及缺乏联合评估GUI定位、屏幕解析和动作预测的基准测试。GUI-360°使用大型语言模型增强的大规模自动化管道来查询源、构建环境模板、任务实例化、批量执行和基于大型语言模型的质量过滤。发布的数据集包含超过120万个在流行的Windows办公软件中的执行动作步骤和数千条轨迹,并支持GUI定位、屏幕解析和动作预测三个典型任务。
Key Takeaways
- GUI-360°是一个针对计算机使用代理(CUAs)的大型数据集和基准测试套件。
- GUI-360°解决了三个挑战:真实世界CUA任务的稀缺性、缺乏自动化收集与注释管道以及缺乏联合评估多个方面的基准测试。
- 数据集包含超过120万个执行动作步骤和数千条轨迹,涵盖GUI定位、屏幕解析和动作预测三个任务。
- 数据集采用大型语言模型增强的自动化管道进行数据收集和处理。
- 现有的视觉语言模型在GUI-360°上的表现存在显著短板,经过监督微调后性能有所提升但仍未达到人类水平可靠性。
点此查看论文截图
SSPO: Subsentence-level Policy Optimization
Authors:Kun Yang, Zikang chen, Yanmeng Wang, Zhigen Li
As a significant part of post-training of the Large Language Models (LLMs), Reinforcement Learning from Verifiable Reward (RLVR) has greatly improved LLMs’ reasoning skills. However, some RLVR algorithms, such as GRPO (Group Relative Policy Optimization) and GSPO (Group Sequence Policy Optimization), are observed to suffer from unstable policy updates and low usage of sampling data, respectively. The importance ratio of GRPO is calculated at the token level, which focuses more on optimizing a single token. This will be easily affected by outliers, leading to model training collapse. GSPO proposed the calculation of the response level importance ratio, which solves the problem of high variance and training noise accumulation in the calculation of the GRPO importance ratio. However, since all the response tokens share a common importance ratio, extreme values can easily raise or lower the overall mean, leading to the entire response being mistakenly discarded, resulting in a decrease in the utilization of sampled data. This paper introduces SSPO, which applies sentence-level importance ratio, taking the balance between GRPO and GSPO. SSPO not only avoids training collapse and high variance, but also prevents the whole response tokens from being abandoned by the clipping mechanism. Furthermore, we apply sentence entropy to PPO-CLIP to steadily adjust the clipping bounds, encouraging high-entropy tokens to explore and narrow the clipping range of low-entropy tokens. In particular, SSPO achieves an average score of 46.57 across five datasets, surpassing GRPO (43.01) and GSPO (44.42), and wins state-of-the-art performance on three datasets. These results highlight SSPO’s effectiveness in leveraging generated data by taking the essence of GSPO but rejecting its shortcomings.
作为大型语言模型(LLM)后训练的重要组成部分,强化学习从可验证奖励(RLVR)极大地提高了LLM的推理能力。然而,一些RLVR算法,如GRPO(群体相对策略优化)和GSPO(群体序列策略优化),存在不稳定策略更新和采样数据利用率低的问题。GRPO的重要性比率是在令牌级别计算的,更侧重于优化单个令牌。这很容易受到异常值的影响,导致模型训练崩溃。GSPO提出了响应级别重要性比率的计算,解决了GRPO重要性比率计算中方差高和训练噪声积累的问题。然而,由于所有响应令牌共享相同的重要性比率,极端值很容易提高或降低整体平均值,导致整个响应被错误地丢弃,导致采样数据利用率降低。本文介绍了SSPO,它应用句子级别的重要性比率,在GRPO和GSPO之间取得平衡。SSPO不仅避免了训练崩溃和 high方差,还防止了整个响应令牌被裁剪机制丢弃。此外,我们将句子熵应用于PPO-CLIP,以稳定调整裁剪边界,鼓励高熵令牌进行探索并缩小低熵令牌的裁剪范围。特别是,SSPO在五个数据集上的平均得分为46.57,超过了GRPO(43.01)和GSPO(44.42),并在三个数据集上达到了最新技术水平。这些结果突显了SSPO在利用生成数据方面的有效性,它取GSPO的精华而舍其短。
论文及项目相关链接
Summary:
强化学习可验证奖励(RLVR)对于提高大型语言模型(LLM)的推理能力有很大帮助。然而,GRPO和GSPO等RLVR算法存在不稳定策略更新和数据采样利用率低的问题。本文提出的SSPO算法在GRPO和GSPO之间取得了平衡,通过应用句子级别的重要性比率,避免了训练崩溃、高方差等问题,并防止整个响应令牌被丢弃。此外,SSPO还采用句子熵来稳定调整裁剪边界,鼓励高熵令牌进行探索并缩小低熵令牌的裁剪范围。SSPO在五个数据集上的平均得分为46.57,超过了GRPO(43.01)和GSPO(44.42),并在三个数据集上达到了最先进的性能。
Key Takeaways:
- RLVR在LLM的推理能力提升中起到重要作用。
- GRPO和GSPO在策略更新和数据采样利用率方面存在问题。
- SSPO算法在GRPO和GSPO之间找到了平衡,通过句子级别的重要性比率避免了训练崩溃和高方差等问题。
- SSPO采用了句子熵来稳定调整裁剪边界,以提高数据利用率。
- SSPO在多个数据集上的表现超过了GRPO和GSPO,达到了最先进的性能。
- SSPO能够利用生成数据,并摒弃了GSPO的缺点。
点此查看论文截图
Plan of Knowledge: Retrieval-Augmented Large Language Models for Temporal Knowledge Graph Question Answering
Authors:Xinying Qian, Ying Zhang, Yu Zhao, Baohang Zhou, Xuhui Sui, Xiaojie Yuan
Temporal Knowledge Graph Question Answering (TKGQA) aims to answer time-sensitive questions by leveraging factual information from Temporal Knowledge Graphs (TKGs). While previous studies have employed pre-trained TKG embeddings or graph neural networks to inject temporal knowledge, they fail to fully understand the complex semantic information of time constraints. Recently, Large Language Models (LLMs) have shown remarkable progress, benefiting from their strong semantic understanding and reasoning generalization capabilities. However, their temporal reasoning ability remains limited. LLMs frequently suffer from hallucination and a lack of knowledge. To address these limitations, we propose the Plan of Knowledge framework with a contrastive temporal retriever, which is named PoK. Specifically, the proposed Plan of Knowledge module decomposes a complex temporal question into a sequence of sub-objectives from the pre-defined tools, serving as intermediate guidance for reasoning exploration. In parallel, we construct a Temporal Knowledge Store (TKS) with a contrastive retrieval framework, enabling the model to selectively retrieve semantically and temporally aligned facts from TKGs. By combining structured planning with temporal knowledge retrieval, PoK effectively enhances the interpretability and factual consistency of temporal reasoning. Extensive experiments on four benchmark TKGQA datasets demonstrate that PoK significantly improves the retrieval precision and reasoning accuracy of LLMs, surpassing the performance of the state-of-the-art TKGQA methods by 56.0% at most.
时序知识图谱问答(TKGQA)旨在利用时序知识图谱(TKG)中的事实信息来回答时间敏感的问题。虽然之前的研究已经采用预训练的时序知识图谱嵌入或图神经网络来注入时序知识,但它们未能完全理解时间约束的复杂语义信息。最近,大型语言模型(LLM)由于其强大的语义理解和推理泛化能力,已经取得了显著的进步。然而,它们的时序推理能力仍然有限。大型语言模型经常存在幻觉和缺乏知识的问题。为了解决这些局限性,我们提出了带有对比时序检索器的知识计划框架,名为PoK。具体来说,所提出的知识计划模块将复杂的时序问题分解为一系列来自预定义工具的子目标,作为推理探索的中间指导。同时,我们构建了一个对比检索框架下的时序知识存储(TKS),使模型能够从时序知识图谱中选择性地检索语义和时序对齐的事实。通过结合结构化规划与时序知识检索,PoK有效地提高了时序推理的可解释性和事实一致性。在四个基准TKGQA数据集上的广泛实验表明,PoK显著提高了大型语言模型的检索精度和推理准确性,超越了最先进TKGQA方法的性能,最多提高了56.0%。
论文及项目相关链接
PDF Submitted to the IEEE for possible publication
Summary
该文本介绍了Temporal Knowledge Graph Question Answering(TKGQA)的目标是利用时序知识图谱(TKGs)中的事实信息来回答时间敏感的问题。文章指出,尽管已有研究采用预训练TKG嵌入和图神经网络来注入时序知识,但它们未能完全理解时间约束的复杂语义信息。文章提出一个名为PoK的知识计划框架,结合对比时序检索器,旨在解决大型语言模型(LLMs)在时序推理方面的局限性。PoK通过将复杂的时序问题分解为一系列子目标,并提供预定义工具的中间指导来进行推理探索。同时,构建一个对比知识存储库,使模型能够选择性地从TKGs中检索语义和时序对齐的事实。通过结合结构化规划与时序知识检索,PoK有效提高了时序推理的可解释性和事实一致性。实验表明,PoK显著提高了LLMs的检索精度和推理准确性,在四个基准TKGQA数据集上的性能超过了最先进的TKGQA方法,最多提高了56.0%。
Key Takeaways
- TKGQA旨在利用TKGs中的事实信息回答时间敏感问题。
- 现有研究未能充分理解时间约束的复杂语义信息。
- PoK框架通过分解复杂时序问题为一系列子目标并提供中间指导进行推理探索。
- PoK结合结构化规划与对比知识存储库进行时序知识检索。
- PoK提高了LLMs在时序推理方面的可解释性和事实一致性。
- PoK在四个基准TKGQA数据集上的性能表现显著优于现有方法。
点此查看论文截图
Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development
Authors:Zhengran Zeng, Yixin Li, Rui Xie, Wei Ye, Shikun Zhang
The development of LLM-based autonomous agents for end-to-end software development represents a significant paradigm shift in software engineering. However, the scientific evaluation of these systems is hampered by significant challenges, including overly simplistic benchmarks and the difficulty of conducting fair comparisons between different agent architectures due to confounding implementation variables. To address these limitations, we first construct a challenging and dynamically curated E2EDevBench to simulate realistic development scenarios. Second, we propose a hybrid evaluation framework that combines test-case-based functional assessment with fine-grained, LLM-based requirement verification. Using this framework, we conduct a controlled empirical study on three representative agent architectures implemented upon a unified foundation to isolate the impact of workflow design. Our findings reveal that state-of-the-art agents can fulfill approximately 50% of requirements on \bench{}, but their success is critically dependent on the architectural strategy for task decomposition and collaboration. Furthermore, our analysis indicates that the primary bottleneck is the omission of requirements and inadequate self-verification. This work provides the community with a more realistic benchmark, a comprehensive evaluation framework, and crucial insights into the current capabilities and core challenges of software development agents, guiding future research toward enhancing requirement comprehension and planning.
基于大型语言模型(LLM)的自主代理软件在端到端软件开发中的应用,代表了软件工程中的重大范式转变。然而,这些系统的科学评估面临着重大挑战,包括过于简单的基准测试,以及由于混淆的实施变量导致的不同代理架构之间的公平比较困难。为了解决这些局限性,我们首先构建了一个具有挑战性和动态策划的E2EDevBench,以模拟现实开发场景。其次,我们提出了一种混合评估框架,该框架结合了基于测试用例的功能评估与基于大型语言模型的精细要求验证。使用这个框架,我们对三种基于统一基础的代表性代理架构进行了控制实证研究,以隔离工作流程设计的影响。我们的研究发现,最先进的代理可以在Bench上完成大约50%的要求,但它们的成功严重依赖于任务分解和协作的架构策略。此外,我们的分析表明,主要瓶颈在于遗漏要求和自我验证不足。这项工作为社区提供了一个更现实的基准测试、一个全面的评估框架,以及对当前软件开发代理的能力和核心挑战的关键见解,为未来的研究提供了提高需求理解和规划的方向。
论文及项目相关链接
Summary
该文探讨了基于大模型的自主代理软件在端到端软件开发中的发展及其评估问题。为解决当前评估所面临的挑战,如过于简单的基准测试和难以在不同代理架构之间进行公平比较,研究团队建立了动态模拟真实开发场景的E2EDevBench基准测试平台,并提出了结合测试用例功能评估与精细的大模型需求验证的混合评估框架。通过对三种代表性代理架构的实证研究发现,当前先进代理只能满足约一半的需求,其成功取决于任务分解和协作的架构策略。瓶颈在于要求遗漏和缺乏自我验证。这项工作提供了一个更现实的基准测试、全面的评估框架,以及对软件开发代理当前能力和核心挑战的重要见解,为未来研究方向提供了指导。
Key Takeaways
- 基于大模型的自主代理软件在端到端软件开发中代表了显著的范式转变。
- 当前评估这些系统时面临挑战,包括过于简单的基准测试和难以比较不同的代理架构。
- 为解决这些问题,建立了E2EDevBench基准测试平台来模拟真实开发场景。
- 提出了结合测试用例功能评估与精细的大模型需求验证的混合评估框架。
- 实证研究表明,当前先进代理只能满足约一半的需求。
- 代理的成功取决于任务分解和协作的架构策略。
点此查看论文截图
To See or To Read: User Behavior Reasoning in Multimodal LLMs
Authors:Tianning Dong, Luyi Ma, Varun Vasudevan, Jason Cho, Sushant Kumar, Kannan Achan
Multimodal Large Language Models (MLLMs) are reshaping how modern agentic systems reason over sequential user-behavior data. However, whether textual or image representations of user behavior data are more effective for maximizing MLLM performance remains underexplored. We present \texttt{BehaviorLens}, a systematic benchmarking framework for assessing modality trade-offs in user-behavior reasoning across six MLLMs by representing transaction data as (1) a text paragraph, (2) a scatter plot, and (3) a flowchart. Using a real-world purchase-sequence dataset, we find that when data is represented as images, MLLMs next-purchase prediction accuracy is improved by 87.5% compared with an equivalent textual representation without any additional computational cost.
多模态大型语言模型(MLLMs)正在改变现代代理系统对序列用户行为数据的推理方式。然而,无论是文本还是图像表示的用户行为数据,哪种表示方式更能有效地最大化MLLM的性能,这一点尚未得到充分的探索。我们提出了
BehaviorLens,这是一个系统的基准测试框架,用于评估在六个MLLM中进行用户行为推理时的模态权衡。我们通过三种方式表示交易数据:(1)一段文本,(2)散点图,(3)流程图。通过使用现实世界的购买序列数据集,我们发现当数据以图像形式表示时,与等效的文本表示相比,MLLM的下次购买预测精度提高了87.5%,且没有任何额外的计算成本。
论文及项目相关链接
PDF Accepted by the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Efficient Reasoning
Summary:多模态大型语言模型(MLLMs)正在改变现代代理系统对用户行为数据的序列化推理方式。然而,关于用户行为数据的文本表示和图像表示在最大化MLLM性能方面哪个更有效尚未得到充分探索。本研究提出了一个名为BehaviorLens的系统性评估框架,通过对交易数据以文本段落、散点图和流程图三种方式进行呈现,评估用户行为推理中的模态权衡问题。使用真实购买序列数据集发现,将数据以图像形式呈现,MLLMs的下次购买预测准确度提高了87.5%,且无需增加计算成本。
Key Takeaways:
- 多模态大型语言模型(MLLMs)正在改变对用户行为数据的推理方式。
BehaviorLens框架用于评估不同呈现方式(文本、图像)对MLLM性能的影响。- 在用户行为数据中,图像表示相比文本表示能显著提高MLLM的预测准确度。
- 采用图像表示用户行为数据,预测准确度提高87.5%且无需增加计算成本。
- 真实购买序列数据集在研究中起到了重要作用。
- 散点图和流程图作为数据呈现方式也被研究并评估。
点此查看论文截图
Scaling Agent Learning via Experience Synthesis
Authors:Zhaorun Chen, Zhuokai Zhao, Kai Zhang, Bo Liu, Qi Qi, Yifan Wu, Tarun Kalluri, Sara Cao, Yuanhao Xiong, Haibo Tong, Huaxiu Yao, Hengduo Li, Jiacheng Zhu, Xian Li, Dawn Song, Bo Li, Jason Weston, Dat Huynh
While reinforcement learning (RL) can empower large language model (LLM) agents by enabling self-improvement through interaction, its practical adoption remains challenging due to costly rollouts, limited task diversity, unreliable reward signals, and infrastructure complexity, all of which obstruct the collection of scalable experience data. To address these challenges, we introduce DreamGym, the first unified framework designed to synthesize diverse experiences with scalability in mind to enable effective online RL training for autonomous agents. Rather than relying on expensive real-environment rollouts, DreamGym distills environment dynamics into a reasoning-based experience model that derives consistent state transitions and feedback signals through step-by-step reasoning, enabling scalable agent rollout collection for RL. To improve the stability and quality of transitions, DreamGym leverages an experience replay buffer initialized with offline real-world data and continuously enriched with fresh interactions to actively support agent training. To improve knowledge acquisition, DreamGym adaptively generates new tasks that challenge the current agent policy, enabling more effective online curriculum learning. Experiments across diverse environments and agent backbones demonstrate that DreamGym substantially improves RL training, both in fully synthetic settings and in sim-to-real transfer scenarios. On non-RL-ready tasks like WebArena, DreamGym outperforms all baselines by over 30%. And in RL-ready but costly settings, it matches GRPO and PPO performance using only synthetic interactions. When transferring a policy trained purely on synthetic experiences to real-environment RL, DreamGym yields significant additional performance gains while requiring far fewer real-world interactions, providing a scalable warm-start strategy for general-purpose RL.
强化学习(RL)可以通过交互使大型语言模型(LLM)实现自我提升,从而增强其能力,但其在实践中的采纳仍面临挑战,包括成本高昂的滚动操作、任务多样性有限、奖励信号不可靠以及基础设施复杂性等,这些都阻碍了可扩展经验数据的收集。为了应对这些挑战,我们推出了DreamGym,这是第一个统一框架,旨在通过合成多样化的经验并考虑可扩展性,以实现对自主代理进行有效在线RL训练。DreamGym不依赖于昂贵的真实环境滚动操作,而是将环境动态性提炼成基于推理的经验模型,通过逐步推理来得出一致的状态转换和反馈信号,从而实现可扩展的代理滚动收集用于RL。为了提高转换的稳定性和质量,DreamGym利用初始化为离线真实世界数据的经验回放缓冲区,并不断丰富新鲜交互以积极支持代理训练。为了提高知识获取能力,DreamGym自适应生成新任务以挑战当前代理策略,从而实现更有效的在线课程学习。在多种环境和代理骨架上的实验表明,DreamGym在完全合成设置和模拟到真实转移场景中均显著改进了RL训练。在非RL准备任务(如WebArena)上,DreamGym的基线性能优于所有基线30%以上。在RL就绪但成本高昂的环境中,它仅使用合成交互即可与GRPO和PPO性能相匹配。当将仅基于合成经验训练的策略转移到真实环境的RL时,DreamGym在需要更少的真实世界交互的情况下实现了显著的性能提升,为通用RL提供了一个可扩展的预热启动策略。
论文及项目相关链接
Summary
强化学习(RL)通过交互赋能大型语言模型(LLM)代理人自我提升,但在实践中面临诸多挑战。为了应对这些挑战,DreamGym统一框架被引入用于合成多样化的经验并实现可扩展性,以实现对自主代理的有效在线RL训练。DreamGym通过基于推理的经验模型将环境动态简化为步骤式推理和反馈信号,无需依赖昂贵的实际环境演练,从而实现代理的扩展性滚动收集。通过利用初始化的离线现实世界数据和持续丰富的交互来支持代理训练,DreamGym改善了过渡的稳定性和质量。此外,它通过自适应生成新任务促进知识获取,挑战当前代理策略以实现更有效的在线课程学习。在跨环境和不同代理主体的实验中证明,DreamGym大幅提高了RL训练效果。在合成环境中以及仿真到现实迁移场景中,相较于基线模型,DreamGym实现了显著的性能提升。即使在非RL准备任务上,其性能也超过了所有基线模型;在RL就绪但成本高昂的环境中,仅凭合成交互便匹配了GRPO和PPO的性能。当将仅通过合成经验训练的策略转移到实际环境RL时,DreamGym实现了显著的性能增益,同时大幅减少了实际交互的需求,为通用RL提供了一个可扩展的预热启动策略。
Key Takeaways
- 强化学习(RL)通过交互赋能大型语言模型(LLM)代理人自我提升,但仍面临实践挑战。
- DreamGym是首个统一框架,旨在合成多样化的经验并实现可扩展性,以支持自主代理的有效在线RL训练。
- DreamGym通过基于推理的经验模型简化环境动态,避免了昂贵的实际环境演练成本。
- 通过利用离线现实世界数据和持续丰富的交互数据来改善过渡稳定性和质量。
- DreamGym自适应生成新任务以挑战当前代理策略,从而促进知识获取和在线课程学习。
- 在跨环境和不同主体实验中证明,DreamGym显著提高了RL训练效果。
点此查看论文截图
CareMedEval dataset: Evaluating Critical Appraisal and Reasoning in the Biomedical Field
Authors:Doria Bonzi, Alexandre Guiggi, Frédéric Béchet, Carlos Ramisch, Benoit Favre
Critical appraisal of scientific literature is an essential skill in the biomedical field. While large language models (LLMs) can offer promising support in this task, their reliability remains limited, particularly for critical reasoning in specialized domains. We introduce CareMedEval, an original dataset designed to evaluate LLMs on biomedical critical appraisal and reasoning tasks. Derived from authentic exams taken by French medical students, the dataset contains 534 questions based on 37 scientific articles. Unlike existing benchmarks, CareMedEval explicitly evaluates critical reading and reasoning grounded in scientific papers. Benchmarking state-of-the-art generalist and biomedical-specialized LLMs under various context conditions reveals the difficulty of the task: open and commercial models fail to exceed an Exact Match Rate of 0.5 even though generating intermediate reasoning tokens considerably improves the results. Yet, models remain challenged especially on questions about study limitations and statistical analysis. CareMedEval provides a challenging benchmark for grounded reasoning, exposing current LLM limitations and paving the way for future development of automated support for critical appraisal.
科学文献的批判性评价是生物医学领域的一项基本技能。虽然大型语言模型(LLM)在此任务中提供了有前景的支持,但它们的可靠性仍然有限,特别是在专业领域的批判性推理方面。我们介绍了CareMedEval,这是一个原创数据集,旨在评估LLM在生物医学批判性评价和推理任务上的表现。该数据集来源于法国医学生实际考试中的问题,包含基于37篇科学文章的534个问题。与现有基准测试不同,CareMedEval明确评估基于科学论文的批判性阅读和推理能力。在多种上下文条件下对最新通用LLM和生物医学专业LLM的基准测试表明,该任务的难度很大:开放和商业模型未能超过准确匹配率0.5,尽管生成中间推理标记可以显著改善结果。然而,模型在关于研究局限性和统计分析的问题上仍然面临挑战。CareMedEval为基于情境的推理提供了一个具有挑战性的基准测试,暴露了当前LLM的局限性,并为未来开发自动化支持批判性评价铺平了道路。
论文及项目相关链接
PDF Preprint submitted to LREC 2026 (under review) To access the dataset, see https://github.com/bonzid/CareMedEval
Summary
本文介绍了CareMedEval数据集,该数据集用于评估大型语言模型(LLMs)在生物医学文献批判性评价和推理任务中的性能。该数据集包含从法国医学生真实考试中的534个问题,基于37篇科学文章构建而成。该数据集评估了基于科学论文的批判性阅读和推理能力。对最新的通用大型语言模型和生物医学专业化的大型语言模型进行的基准测试表明,这一任务极具挑战性,现有的模型精确匹配率不超过0.5。虽然生成中间推理符号会显著改善结果,但在研究局限性和统计分析问题上,模型仍然面临挑战。该数据集为未来的自动支持提供了机会,有望克服大型语言模型的局限性,促进其在生物医学领域的进一步发展。
Key Takeaways
- CareMedEval是一个用于评估大型语言模型在生物医学文献批判性评价和推理任务中的性能的数据集。
- 数据集包含从真实考试中提取的534个问题,这些问题基于科学文章设计而成。
- 该数据集注重评估基于科学论文的批判性阅读和推理能力。
- 当前的大型语言模型在这一任务上表现有限,精确匹配率不超过0.5。
- 生成中间推理符号可以显著改善结果,但在研究局限性和统计分析方面仍存在挑战。
点此查看论文截图
Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models
Authors:Sriram Balasubramanian, Samyadeep Basu, Koustava Goswami, Ryan Rossi, Varun Manjunatha, Roshan Santhosh, Ruiyi Zhang, Soheil Feizi, Nedim Lipka
Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.
大型语言模型(LLM)越来越多地用于长文档问答,其中可靠的来源归属对于信任至关重要。现有的事后归因方法在提取式问答中表现良好,但在多跳、抽象和半提取设置中表现较差,这些设置中的答案会综合各段落的信息。为了应对这些挑战,我们认为事后归因可以重新构建为推理问题,答案被分解为与特定上下文相关联的组成部分。我们首先证明,提示模型生成这种分解并附带归因可以提高性能。在此基础上,我们引入了DecompTune,这是一种后训练方法,可以教导模型将答案分解作为中间推理步骤来产生答案。我们整理了一套复杂的问答任务数据集,由强大的LLM进行分解注释,并使用带有特定任务奖励的两阶段SFT+GRPO管道对Qwen-2.5(7B和14B)进行后训练。通过广泛的实验和消融研究,DecompTune大幅提高了归因质量,优于以前的方法,并匹配或超过最新前沿模型。
论文及项目相关链接
PDF Post-hoc attribution
Summary
大型语言模型(LLM)在处理长文档问答时越来越受欢迎,可靠的来源归属对于信任至关重要。现有的事后归因方法在处理提取型问答时效果很好,但在多跳、抽象和半提取场景中表现不佳。为解决这些挑战,我们将事后归因重新定位为推理问题,答案被分解为与特定上下文相关联的各个组成部分。我们首先展示通过提示模型生成此类分解与归因,可以提高性能。在此基础上,我们引入了DecompTune,这是一种训练后方法,用于教导模型将答案分解作为中间推理步骤产生。我们使用强大的LLM对复杂的问答任务进行分解标注,并制作了一个多样化的数据集。通过对Qwen-2.5(7B和14B)进行后训练,使用带有特定任务奖励的两阶段SFT+GRPO管道,DecompTune大幅提高了归因质量,优于之前的方法并达到了或超越了业界前沿模型的表现。
Key Takeaways
- 大型语言模型(LLM)在处理长文档问答时面临可靠归因的挑战。
- 现有事后归因方法在提取型问答场景中表现良好,但在多跳、抽象和半提取场景中存在问题。
- 将事后归因重新定位为推理问题,答案被分解为与特定上下文相关联的各个组成部分。
- 通过提示模型生成答案分解可以提高性能。
- DecompTune是一种训练后方法,用于教导模型将答案分解作为中间推理步骤产生。
- 使用多样化数据集进行后训练,包括由强大LLM对复杂问答任务的分解标注。
点此查看论文截图
Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards
Authors:Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren
Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at https://github.com/starreeze/latr.
强化学习与可验证奖励(RLVR)特别是与集团相对政策优化(GRPO)等算法相结合,在增强大型语言模型的推理能力方面表现出高度有效性。然而,当前管道中的一个关键瓶颈在于集团滚动过程中采样轨迹的多样性有限。同质的轨迹及其相关的奖励会降低策略更新的回报信号,从而阻碍有效的策略学习。这种多样性的缺乏主要源于令牌级的随机采样,局部变化很可能陷入几乎相同的推理路径。为了解决这一局限性,我们提出了基于前瞻树的滚动(LATR),这是一种新的滚动策略,旨在通过强制分支到可能产生不同延续的不同候选令牌来显式促进轨迹级的多样性。具体来说,LATR以三个阶段进行操作:(1)在高不确定性生成步骤进行分支,(2)对每个新分支进行前瞻模拟,(3)剪除模拟过程中持续时间较长的相似分支。与随机采样相比,LATR在GRPO和动态采样策略优化(DAPO)算法上平均加速策略学习131%,并在不同的推理任务中提高了最终通过率为1的性能4.2%。我们的代码和数据在https://github.com/starreeze/latr公开可用。
简化版翻译
论文及项目相关链接
Summary
强化学习与可验证奖励(RLVR)结合,如集团相对政策优化(GRPO)等算法,能有效提升大型语言模型的推理能力。然而,当前管道的关键瓶颈在于群体滚动过程中采样轨迹的多样性有限。同质的轨迹及其相关奖励会降低策略更新的回报信号,阻碍有效的策略学习。针对这一问题,我们提出了基于前瞻树的滚动(LATR)策略,旨在通过强制分支到不同的候选令牌来明确促进轨迹级别的多样性,从而产生不同的延续。LATR通过三个阶段操作:一是在高不确定性生成步骤进行分支;二是对每个新分支进行前瞻模拟;三是删除模拟过程中长时间相似的分支。相较于随机采样,LATR平均加速策略学习131%,最终通过一次执行成功率的性能提高了4.2%,无论是在GRPO还是动态采样政策优化(DAPO)算法中,均应用于不同的推理任务。相关代码和数据已在公开平台发布。
Key Takeaways
- RLVR与GRPO等算法增强了大型语言模型的推理能力。
- 当前管道面临的关键瓶颈在于采样轨迹多样性有限。
- 同质轨迹和相关奖励会降低策略更新的回报信号。
- LATR策略旨在通过强制分支到不同的候选令牌来提升轨迹级别的多样性。
- LATR通过三个阶段操作:分支、前瞻模拟和删除长时间相似的分支。
- LATR相较于随机采样能显著加速策略学习并提高执行成功率。
点此查看论文截图
PaTaRM: Bridging Pairwise and Pointwise Signals via Preference-Aware Task-Adaptive Reward Modeling
Authors:Ai Jian, Jingqing Ruan, Xing Ma, Dailin Li, QianLin Zhou, Ke Zeng, Xunliang Cai
Reward models (RMs) are central to reinforcement learning from human feedback (RLHF), providing the critical supervision signals that align large language models (LLMs) with human preferences. While generative reward models (GRMs) offer greater interpretability than traditional scalar RMs, current training paradigms remain limited. Pair-wise methods rely on binary good-versus-bad labels, which cause mismatches for point-wise inference and necessitate complex pairing strategies for effective application in RLHF. On the other hand, point-wise methods require more elaborate absolute labeling with rubric-driven criteria, resulting in poor adaptability and high annotation costs. In this work, we propose the Preference-Aware Task-Adaptive Reward Model (PaTaRM), a unified framework that integrates a preference-aware reward (PAR) mechanism with dynamic rubric adaptation. PaTaRM leverages relative preference information from pairwise data to construct robust point-wise training signals, eliminating the need for explicit point-wise labels. Simultaneously, it employs a task-adaptive rubric system that flexibly generates evaluation criteria for both global task consistency and instance-specific fine-grained reasoning. This design enables efficient, generalizable, and interpretable reward modeling for RLHF. Extensive experiments show that PaTaRM achieves an average relative improvement of 4.7% on RewardBench and RMBench across Qwen3-8B and Qwen3-14B models. Furthermore, PaTaRM boosts downstream RLHF performance, with an average improvement of 13.6% across IFEval and InFoBench benchmarks, confirming its effectiveness and robustness. Our code is available at https://github.com/JaneEyre0530/PaTaRM.
奖励模型(RM)是人类反馈强化学习(RLHF)的核心,提供与大型语言模型(LLM)符合人类偏好的关键监督信号。虽然生成奖励模型(GRM)提供了比传统标量RM更大的可解释性,但当前的训练模式仍然有限。配对方法依赖于二元的好与坏的标签,这会导致点对点推理的不匹配,并需要复杂的配对策略才能在RLHF中有效应用。另一方面,点对点方法需要更详细的绝对标签和基于标准的标准,导致适应性差和标注成本高昂。在此工作中,我们提出了偏好感知任务自适应奖励模型(PaTaRM),这是一个统一的框架,将偏好感知奖励(PAR)机制与动态标准自适应相结合。PaTaRM利用配对数据中的相对偏好信息来构建稳健的点对点训练信号,无需明确的点对点标签。同时,它采用任务自适应标准系统,灵活生成全局任务一致性和实例特定精细推理的评价标准。这种设计实现了高效、通用和可解释的奖励模型,适用于RLHF。大量实验表明,PaTaRM在RewardBench和RMBench的Qwen3-8B和Qwen3-14B模型上平均相对提高了4.7%。此外,PaTaRM提高了下游RLHF的性能,在IFEval和InFoBench基准测试上的平均提高了13.6%,证实了其有效性和稳健性。我们的代码可在https://github.com/JaneEyre0530/PaTaRM获取。
论文及项目相关链接
Summary
本文介绍了奖励模型(RM)在强化学习人类反馈(RLHF)中的重要性,它提供与人类偏好对齐的关键监督信号。针对当前奖励模型的局限性,提出了一种融合偏好感知奖励(PAR)和任务适应性奖励模型的统一框架——PaTaRM。该框架利用配对数据的相对偏好信息构建稳健的点态训练信号,无需显式点态标签。同时,它采用任务适应性评价体系,既能保证全局任务一致性又能实现实例特定精细推理。实验表明,PaTaRM在奖励模型基准测试和大型语言模型上实现了平均相对改进4.7%,提高了下游RLHF性能的平均改进率为13.6%,证明了其有效性和稳健性。
Key Takeaways
- 奖励模型(RM)在强化学习人类反馈(RLHF)中扮演核心角色,为大型语言模型(LLM)提供与人类偏好对齐的关键监督信号。
- 当前奖励模型存在局限性,需要新方法来解决配对和点态标注的问题。
- 提出了一个统一的框架——PaTaRM,融合了偏好感知奖励(PAR)和任务适应性奖励模型。
- PaTaRM利用相对偏好信息构建稳健的点态训练信号,无需复杂的配对策略或昂贵的点态标注。
- PaTaRM采用任务适应性评价体系,实现全局任务一致性和实例特定精细推理。
- 实验表明,PaTaRM在多个基准测试上实现了显著的性能提升。
- PaTaRM的代码已公开,供公众使用。
点此查看论文截图
Evaluating Long-Term Memory for Long-Context Question Answering
Authors:Alessandra Terranova, Björn Ross, Alexandra Birch
In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with small foundation models benefitting most from RAG, and strong instruction-tuned reasoning model gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.
为了让大型语言模型实现真正的对话连贯性并从中受益体验学习,它们需要记忆。虽然研究主要集中在复杂记忆系统的发展上,但尚不清楚哪种类型的记忆对于长语境对话任务最为有效。我们使用LoCoMo(一个为问答任务标注的合成长语境对话的基准测试)对增强记忆方法进行了系统评估,这些任务需要多种推理策略。我们分析了全语境提示、通过检索增强生成和自主记忆的语义记忆、通过上下文学习的情景记忆以及通过提示优化程序记忆。我们的研究结果表明,增强记忆的方法可以将令牌使用量减少90%以上,同时保持竞争力准确度。记忆架构的复杂性应随着模型能力的扩展而扩展,小型基础模型从RAG中获益最大,而强大的指令调整推理模型则从通过反思和更复杂的自主语义记忆中获益。特别是情景记忆可以帮助大型语言模型识别自身知识的局限性。
论文及项目相关链接
PDF 14 pages including appendix, 3 figures. Submitted to October ARR and to Metacognition in Generative AI EurIPS workshop (under review for both)
Summary
大型语言模型要实现真正的对话连续性和从经验学习中受益,需要记忆力的支持。研究已关注复杂记忆系统的发展,但对于长语境对话任务,尚不清楚哪种类型的记忆最为有效。本文通过LoCoMo基准测试,对记忆增强方法进行了系统评估,包括全语境提示、通过检索增强生成实现的语义记忆和自主记忆、通过上下文学习实现的情景记忆以及通过提示优化实现的程序性记忆。研究发现,记忆增强方法可在减少令牌使用的同时保持竞争力;记忆架构的复杂性应随模型能力的增强而扩展,小型基础模型最受益于RAG,而强大的指令调优推理模型则受益于通过反思和更复杂的自主语义记忆的情景学习。特别是情景记忆有助于LLM识别自身知识的局限性。
Key Takeaways
- 大型语言模型要实现真正的对话连续性需要从经验中学习并配备记忆力支持。
- 目前对于长语境对话任务中哪种类型的记忆最为有效尚不清楚。
- LoCoMo基准测试被用于系统评估记忆增强方法。
- 记忆增强方法能在减少令牌使用的同时保持竞争力。
- 记忆架构的复杂性随着模型能力的提升而扩展。
- 小型基础模型主要受益于RAG(检索增强生成),而强大的指令调优推理模型则更依赖于情景学习和复杂的自主语义记忆。
点此查看论文截图
VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation
Authors:Walid Bousselham, Hilde Kuehne, Cordelia Schmid
Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.
训练用于复杂推理的视觉语言模型(VLM)仍然是一项具有挑战性的任务,尤其是因为高质量的图文推理数据稀缺。相反,基于文本推理的资源丰富且可扩展。然而,如何利用这些资源用于VLM推理仍然是一个悬而未决的问题。为了解决这个问题,我们提出了VOLD框架,该框架可以将仅基于文本的教师模型的推理能力转移到VLM学生模型。为此,VOLD结合了通过群体相对策略优化(GRPO)的强化学习与基于策略的蒸馏,这允许学生推理轨迹受到教师模型的指导,与仅使用GRPO相比,这带来了显着的收益。我们进一步表明,在此场景中,在线训练阶段有效的转移过程中冷启动对齐至关重要,并且在教师和学生之间缺乏足够的分布对齐会导致基于策略的蒸馏无法提供有意义的指导。我们在包括MMMU-Pro、MathVision、MathVista和LogicVista等多种基准测试上对VOLD进行了评估,结果表明VOLD显著优于基准模型,并以一定幅度提高了现有技术水平。我们的消融研究显示了使用仅基于文本的教师进行策略蒸馏时,通过SFT进行冷启动对齐的重要性。
论文及项目相关链接
PDF www.walidbousselham.com/VOLD/
Summary
本文提出一个名为VOLD的框架,用于从文本模型向视觉语言模型(VLM)转移推理能力。通过结合基于群体相对策略优化(GRPO)的强化学习与在线策略蒸馏,使VLM学生在教师的引导下进行推理,相较于仅使用GRPO有显著提升。研究还显示,在在线训练阶段进行有效的冷启动对齐至关重要。在多个基准测试中,VOLD显著优于基准模型,并提高了现有技术水平。
Key Takeaways
- VOLD框架旨在从文本模型向视觉语言模型(VLM)转移推理能力。
- VOLD结合了强化学习与在线策略蒸馏,使VLM学生模型能够在教师的引导下进行推理。
- 强化学习通过群体相对策略优化(GRPO)实现。
- 在在线训练阶段,有效的冷启动对齐对于转移学习至关重要。
- VOLD在多个基准测试中表现出显著优势,优于基准模型并提升现有技术水平。
- 对齐策略在在线策略蒸馏中扮演重要角色。
点此查看论文截图