⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-16 更新
DeepDive: Advancing Deep Search Agents with Knowledge Graphs and Multi-Turn RL
Authors:Rui Lu, Zhenyu Hou, Zihan Wang, Hanchen Zhang, Xiao Liu, Yujiang Li, Shi Feng, Jie Tang, Yuxiao Dong
Augmenting large language models (LLMs) with browsing tools substantially improves their potential as deep search agents to solve complex, real-world tasks. Yet, open LLMs still perform poorly in such settings due to limited long-horizon reasoning capacity with browsing tools and the lack of sufficiently difficult supervised data. To address these challenges, we present DeepDive to advance deep search agents. First, we propose a strategy to automatically synthesize complex, difficult, and hard-to-find questions from open knowledge graphs. Second, we apply end-to-end multi-turn reinforcement learning (RL) to enhance LLMs’ long-horizon reasoning with deep search. Experiments show that DeepDive-32B achieves a new open-source competitive result on BrowseComp, outperforming WebSailor, DeepSeek-R1-Browse, and Search-o1. We demonstrate that multi-turn RL training improves deep search ability and significantly contributes to the performance improvements across multiple benchmarks. We observe that DeepDive enables test-time scaling of tool calls and parallel sampling. All datasets, models, and code are publicly available at https://github.com/THUDM/DeepDive.
通过浏览工具增强大型语言模型(LLM)的能力,能显著提高其作为深度搜索代理的潜力,从而解决复杂、真实的任务。然而,在类似场景中,开放的LLM仍表现不佳,其原因在于使用浏览工具进行长期推理的能力有限,以及缺乏足够困难的监督数据。为了应对这些挑战,我们推出DeepDive来推动深度搜索代理的发展。首先,我们提出了一种策略,可以从开放知识图谱中自动合成复杂、困难且难以找到的问题。其次,我们应用端到端的多轮强化学习(RL)来提高LLM的长期推理能力,并与深度搜索相结合。实验表明,DeepDive-32B在BrowseComp上取得了开源竞争的新成果,超越了WebSailor、DeepSeek-R1-Browse和Search-o1。我们证明了多轮RL训练能提高深度搜索能力,并为多个基准测试的性能改进做出了重大贡献。我们观察到DeepDive能够实现测试时的工具调用扩展和并行采样。所有的数据集、模型和代码都可在https://github.com/THUDM/DeepDive公开获取。
论文及项目相关链接
Summary
大型语言模型(LLMs)通过浏览工具进行增强,其在解决复杂现实世界任务时作为深度搜索代理的潜力得到显著提升。然而,由于长期规划推理能力与浏览工具的局限性以及缺乏足够困难的监督数据,开放LLMs在这种环境下的表现仍然较差。为应对这些挑战,我们推出DeepDive以提升深度搜索代理的性能。首先,我们提出一种策略,可以从开放知识图谱中自动合成复杂、困难且难以找到的问题。其次,我们采用端到端的多轮强化学习(RL)以增强LLMs的长期规划推理能力。实验表明,DeepDive-32B在BrowseComp上取得了开源竞争的新成果,超越了WebSailor、DeepSeek-R1-Browse和Search-o1。我们证明了多轮RL训练提升了深度搜索能力,并对多个基准测试的性能改进有显著贡献。DeepDive还支持工具调用的测试时间扩展和并行采样。所有数据集、模型和代码均可在https://github.com/THUDM/DeepDive公开获取。
Key Takeaways
- LLMs结合浏览工具可显著提高解决复杂现实世界任务的能力。
- 当前LLMs面临长期规划推理能力与浏览工具的局限性以及缺乏足够困难的监督数据的挑战。
- DeepDive通过自动合成复杂问题并应用端到端的多轮强化学习来解决这些挑战。
- DeepDive-32B在BrowseComp上的表现超越了其他开源模型。
- 多轮RL训练显著提升了LLMs的深度搜索能力和长期规划推理能力。
- DeepDive支持工具调用的测试时间扩展和并行采样,增强了其实用性。
点此查看论文截图





Inpainting-Guided Policy Optimization for Diffusion Large Language Models
Authors:Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu, Chenyu Wang, Bo Liu, Yuandong Tian, Guan Pang, Sean Bell, Aditya Grover, Feiyu Chen
Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity–their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks–GSM8K, Math500, and AMC–achieving new state-of-the-art results for full-attention masked dLLMs.
掩盖扩散大型语言模型(dLLMs)作为自动回归LLMs的有望替代方案正在兴起,它们提供了竞争性的性能,同时支持如插画等独特的生成能力。我们探讨了插画如何为dLLMs的强化学习算法设计提供信息。LLMs与强化学习对齐面临着探索挑战:当模型未能发现正确解决方案时,奖励信号稀疏和样本浪费。虽然这种低效性广泛影响了LLMs,但dLLMs提供了独特的机会——它们的插画能力可以引导探索。我们介绍了IGPO(插画引导策略优化),这是一种强化学习框架,它能在在线采样过程中战略性地插入部分真实推理轨迹。与提供完整解决方案不同,插画能引导探索走向有希望的轨迹空间,同时保留自我生成的推理,弥合了监督微调与强化学习之间的桥梁。我们将IGPO应用于基于组的优化方法,如GRPO,其中探索失败导致零优势和梯度。IGPO恢复了有意义的梯度,同时提高了样本效率。我们还提出了在合成重写的简洁轨迹上进行监督微调,使其更好地与dLLM生成模式对齐。通过包括基于熵的过滤等其他技术,我们的训练配方在三个数学基准测试(GSM8K、Math500和AMC)上取得了显著进展,为全注意力掩盖dLLMs达到了新的最新结果。
论文及项目相关链接
PDF preprint; 21 pages
Summary
基于文本内容,简化的中文摘要为:“新兴的面向Masked扩散的大型语言模型(dLLMs)为强化学习算法设计提供了新思路。dLLMs的绘画填充能力可指导强化学习中的探索过程。我们引入了IGPO(绘画填充引导策略优化)框架,该框架可在在线采样过程中战略性地插入部分真实推理轨迹。这不同于提供完整解决方案,绘画填充可以引导探索走向有希望的轨迹空间,同时保留自我生成的推理,弥合了监督微调与强化学习之间的差距。我们在基于群体的优化方法中应用了IGPO,如GRPO等。此外,我们提出使用合成重写的简洁轨迹进行训练有助于dLLM更好地生成模式。通过包括基于熵的过滤在内的附加技术,我们的训练配方在GSM8K、Math500和AMC三个数学基准测试中取得了显著的提升。”请注意这个摘要可能超过了你要求的字数限制,请酌情删减以确保符合要求。
Key Takeaways
以下是文本中的关键要点:
- Masked扩散的大型语言模型(dLLMs)是大型语言模型(LLMs)的一种新兴替代形式,拥有独特的生成能力如绘画填充功能。这些模型表现出竞争性的性能水平。
- LLMs在与强化学习对齐时面临探索挑战,特别是在稀疏奖励信号和模型未能发现正确解决方案时的样本浪费问题。这种低效性广泛影响LLMs。然而,dLLMs有机会利用其绘画填充能力来指导强化学习中的探索过程。
- IGPO(绘画填充引导策略优化)框架是一种新的强化学习框架,它通过战略性地插入部分真实推理轨迹来引导模型探索过程。它有助于弥合监督微调与强化学习之间的差距。此框架特别适用于群体优化方法,例如在探索失败导致无效成果或零梯度时的GRPO等方法。
- 在监督微调中加入合成重写的简洁轨迹可以更好地适应dLLM生成模式。附加技术如基于熵的过滤能改善样本效率并提升模型性能。这一训练策略在多个数学基准测试中实现了显著的提升和新状态的最佳结果。
点此查看论文截图


MagicMirror: A Large-Scale Dataset and Benchmark for Fine-Grained Artifacts Assessment in Text-to-Image Generation
Authors:Jia Wang, Jie Hu, Xiaoqi Ma, Hanghang Ma, Yanbing Zeng, Xiaoming Wei
Text-to-image (T2I) generation has achieved remarkable progress in instruction following and aesthetics. However, a persistent challenge is the prevalence of physical artifacts, such as anatomical and structural flaws, which severely degrade perceptual quality and limit application. Given the diversity and complexity of these artifacts, a systematic and fine-grained evaluation framework is required, which is lacking in current benchmarks. To fill this gap, we introduce MagicMirror, a comprehensive framework for artifacts assessment. We first establish a detailed taxonomy of generated image artifacts. Guided by this taxonomy, we manually annotate MagicData340K, the first human-annotated large-scale dataset of 340K generated images with fine-grained artifact labels. Building on this dataset, we train MagicAssessor, a Vision-Language Model (VLM) that provides detailed assessments and corresponding labels. To overcome challenges like class imbalance and reward hacking, we design a novel data sampling strategy and a multi-level reward system for Group Relative Policy Optimization (GRPO). Finally, we leverage MagicAssessor to construct MagicBench, an automated benchmark for evaluating the image artifacts of current T2I models. Our evaluation with MagicBench reveals that despite their widespread adoption, even top-tier models like GPT-image-1 are consistently plagued by significant artifacts, highlighting artifact reduction as a critical frontier for future T2I development. Project page: https://wj-inf.github.io/MagicMirror-page/.
文本到图像(T2I)生成在指令遵循和美学方面取得了显著的进步。然而,一个持续存在的挑战是物理伪影的普遍存在,例如解剖和结构缺陷,这些缺陷严重降低了感知质量并限制了应用。考虑到这些伪影的多样性和复杂性,需要一个系统和精细的评价框架,而当前基准测试中缺乏这一框架。为了填补这一空白,我们引入了MagicMirror,一个用于伪影评估的综合框架。我们首先建立了生成的图像伪影的详细分类。在此分类的指导下,我们手动注释了MagicData340K,这是第一个由人类注释的大规模数据集,包含340K张生成的图像和精细的伪影标签。在此基础上,我们训练了MagicAssessor,这是一个视觉语言模型(VLM),提供详细的评估和相应标签。为了克服类不平衡和奖励黑客攻击等挑战,我们设计了一种新型数据采样策略和用于组相对策略优化(GRPO)的多级奖励系统。最后,我们利用MagicAssessor构建了MagicBench,一个用于评估当前T2I模型图像伪影的自动化基准测试。我们在MagicBench上的评估显示,尽管广泛使用,但即使是顶级模型如GPT-image-1也始终存在显著的伪影问题,这突出了伪影减少是未来T2I发展的关键前沿。项目页面:https://wj-inf.github.io/MagicMirror-page/。
论文及项目相关链接
Summary
本文介绍了文本到图像生成中的挑战,尤其是生成的图像中的物理瑕疵问题。为了解决这个问题,研究团队引入了MagicMirror框架,包括建立瑕疵的详细分类、手动标注大规模数据集MagicData340K、训练视觉语言模型MagicAssessor、设计数据采样策略和采用多级奖励系统进行群组相对策略优化,并构建自动化基准测试MagicBench。研究指出,尽管顶级模型如GPT-image-1广泛应用,但仍存在大量瑕疵,这成为未来文本到图像发展的一个重要前沿。
Key Takeaways
- 文本到图像生成面临物理瑕疵的挑战。
- 引入MagicMirror框架以评估和减少生成图像中的瑕疵。
- 建立详细的瑕疵分类和手动标注的大规模数据集MagicData340K。
- 训练视觉语言模型MagicAssessor提供详细的评估和相应标签。
- 采用数据采样策略和群组相对策略优化(GRPO)的多级奖励系统。
- 利用MagicAssessor构建自动化基准测试MagicBench。
点此查看论文截图






Empirical Evaluation of Memory-Erasure Protocols
Authors:Reynaldo Gil-Pons, Sjouke Mauw, Rolando Trujillo-Rasua
Software-based memory-erasure protocols are two-party communication protocols where a verifier instructs a computational device to erase its memory and send a proof of erasure. They aim at guaranteeing that low-cost IoT devices are free of malware by putting them back into a safe state without requiring secure hardware or physical manipulation of the device. Several software-based memory-erasure protocols have been introduced and theoretically analysed. Yet, many of them have not been tested for their feasibility, performance and security on real devices, which hinders their industry adoption. This article reports on the first empirical analysis of software-based memory-erasure protocols with respect to their security, erasure guarantees, and performance. The experimental setup consists of 3 modern IoT devices with different computational capabilities, 7 protocols, 6 hash-function implementations, and various performance and security criteria. Our results indicate that existing software-based memory-erasure protocols are feasible, although slow devices may take several seconds to erase their memory and generate a proof of erasure. We found that no protocol dominates across all empirical settings, defined by the computational power and memory size of the device, the network speed, and the required level of security. Interestingly, network speed and hidden constants within the protocol specification played a more prominent role in the performance of these protocols than anticipated based on the related literature. We provide an evaluation framework that, given a desired level of security, determines which protocols offer the best trade-off between performance and erasure guarantees.
基于软件的内存擦除协议是一种双方通信协议,其中验证者指令计算设备擦除其内存并发送擦除证明。它们的目的是通过使低成本物联网设备回到安全状态,保证这些设备不受恶意软件的侵扰,而无需依赖安全硬件或对设备进行物理操控。已经引入并分析了多个基于软件的内存擦除协议。然而,其中许多协议在实际设备上的可行性、性能和安全性尚未得到测试,这阻碍了它们在行业中的应用。本文首次对基于软件的内存擦除协议进行实证分析,涉及安全性、擦除保证和性能等方面。实验设置包括3台具有不同计算能力的现代物联网设备、7个协议、6个哈希函数实现以及各种性能和安全性标准。我们的结果表明,现有的基于软件的内存擦除协议是可行的,尽管较慢的设备可能需要数秒时间来擦除其内存并生成擦除证明。我们发现,没有一个协议能在所有实证环境中占据主导,这些环境由设备的计算能力和内存大小、网络速度以及所需的安全级别所定义。有趣的是,网络速度和协议规范中的隐藏常数对这些协议性能的影响比相关文献预期的更为突出。我们提供了一个评估框架,在给定的安全级别下,确定哪些协议在性能和擦除保证之间提供最佳折衷。
论文及项目相关链接
PDF Published at SECRYPT 2025
Summary
在软件基础的记忆清除协议中,验证者会指导计算设备清除其内存并发送清除证明。这些协议旨在确保低成本物联网设备在没有恶意软件的情况下回归安全状态,无需依赖安全硬件或对设备进行物理操控。本文首次对软件基础的记忆清除协议进行实证研究,涉及安全、清除保证和性能等方面。实验设置包括现代物联网设备、记忆清除协议、哈希函数实施以及各项性能和安全标准。研究结果表明现有软件基础的记忆清除协议具有可行性,但速度较慢的设备可能需要数秒来完成内存清除和生成证明。不同协议表现因设备计算能力和内存大小、网络速度以及所需安全等级的差异而有所不同。网络和协议规格中的隐藏常数对协议性能的影响比预期更为显著。我们提供了一个评估框架,根据所需的安全等级,确定哪些协议在性能和清除保证之间达到最佳平衡。
Key Takeaways
- 软件基础的记忆清除协议旨在确保低成本物联网设备的安全状态,无需依赖安全硬件或物理操控。
- 本文首次对这些协议进行实证研究,涉及其在真实设备上的可行性、性能和安全性。
- 实验涵盖了多种现代物联网设备和记忆清除协议,以及各种性能和安全标准。
- 研究表明软件基础的记忆清除协议具有可行性,但较慢的设备可能需要较长时间来完成操作。
- 不同协议的表现受到设备性能、网络速度和所需安全等级的影响。
- 网络速度和协议规格中的隐藏常数对协议性能的影响更为显著。
点此查看论文截图

Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration
Authors:Yue Zhou, Litong Feng, Mengcheng Lan, Xue Yang, Qingyun Li, Yiping Ke, Xue Jiang, Wayne Zhang
Mathematical reasoning is critical for tasks such as precise distance and area computations, trajectory estimations, and spatial analysis in unmanned aerial vehicle (UAV) based remote sensing, yet current vision-language models (VLMs) have not been adequately tested in this domain. To address this gap, we introduce AVI-Math, the first benchmark to rigorously evaluate multimodal mathematical reasoning in aerial vehicle imagery, moving beyond simple counting tasks to include domain-specific knowledge in areas such as geometry, logic, and algebra. The dataset comprises 3,773 high-quality vehicle-related questions captured from UAV views, covering 6 mathematical subjects and 20 topics. The data, collected at varying altitudes and from multiple UAV angles, reflects real-world UAV scenarios, ensuring the diversity and complexity of the constructed mathematical problems. In this paper, we benchmark 14 prominent VLMs through a comprehensive evaluation and demonstrate that, despite their success on previous multimodal benchmarks, these models struggle with the reasoning tasks in AVI-Math. Our detailed analysis highlights significant limitations in the mathematical reasoning capabilities of current VLMs and suggests avenues for future research. Furthermore, we explore the use of Chain-of-Thought prompting and fine-tuning techniques, which show promise in addressing the reasoning challenges in AVI-Math. Our findings not only expose the limitations of VLMs in mathematical reasoning but also offer valuable insights for advancing UAV-based trustworthy VLMs in real-world applications. The code, and datasets will be released at https://github.com/VisionXLab/avi-math
在无人驾驶航空器(UAV)基于遥感技术的精确距离和面积计算、轨迹估算和空间分析等任务中,数学推理至关重要。然而,当前的视觉语言模型(VLM)在此领域尚未得到充分测试。为了弥补这一空白,我们引入了AVI-Math,这是第一个严格评估空中车辆图像多模态数学推理的基准测试,它超越了简单的计数任务,涵盖了诸如几何、逻辑和代数等领域的特定领域知识。数据集包含从无人机视角捕获的3773个高质量车辆相关问题,涉及6个数学主题和20个话题。这些数据是从不同高度和多个无人机角度收集的,反映了真实世界的无人机场景,确保了构建的数学问题的多样性和复杂性。在本文中,我们通过全面的评估基准测试了14种主流的VLM,结果表明,尽管这些模型在以前的多模态基准测试中取得了成功,但在AVI-Math的推理任务中却遇到了困难。我们的详细分析突出了当前VLM在数学推理能力方面的重大局限性,并指出了未来研究的方向。此外,我们探索了使用思维链提示和微调技术,这些方法在解决AVI-Math中的推理挑战方面显示出希望。我们的研究不仅揭示了VLM在数学推理方面的局限性,而且为推进真实世界应用中基于UAV的可信VLM提供了宝贵见解。代码和数据集将在https://github.com/VisionXLab/avi-math上发布。
论文及项目相关链接
PDF 17 pages, 16 figures
Summary:
本文介绍了数学推理在无人机遥感中的重要作用,并指出了当前视觉语言模型在该领域的不足。为此,文章引入了AVI-Math基准测试,该测试能够严格评估无人机图像中的多模态数学推理能力,涵盖几何、逻辑和代数等领域。文章通过对比评估了多个领先的视觉语言模型,发现它们在AVI-Math的推理任务上表现不佳。此外,文章还探讨了使用Chain-of-Thought提示和微调技术的潜力。研究不仅揭示了当前视觉语言模型在数学推理上的局限性,还为在真实世界应用中提高基于无人机的可信视觉语言模型提供了有价值的见解。数据集和代码已发布在相应链接。
Key Takeaways:
- 数学推理在无人机遥感中至关重要,涉及精确距离和面积计算、轨迹估计和空间分析等任务。
- 当前视觉语言模型在无人机图像领域的多模态数学推理能力尚未得到充分测试。
- AVI-Math基准测试的引入,旨在严格评估无人机图像中的数学推理能力,包括几何、逻辑和代数等领域。
- 领先的视觉语言模型在AVI-Math的推理任务上表现不佳。
- Chain-of-Thought提示和微调技术在解决AVI-Math中的推理挑战方面显示出潜力。
- 研究揭示了当前视觉语言模型在数学推理方面的局限性。
点此查看论文截图






LaV-CoT: Language-Aware Visual CoT with Multi-Aspect Reward Optimization for Real-World Multilingual VQA
Authors:Jing Huang, Zhiya Tan, Shutao Gong, Fanwei Zeng, Jianshu Li
As large vision language models (VLMs) advance, their capabilities in multilingual visual question answering (mVQA) have significantly improved. Chain-of-thought (CoT) reasoning has been proven to enhance interpretability and complex reasoning. However, most existing approaches rely primarily on textual CoT and provide limited support for multilingual multimodal reasoning, constraining their deployment in real-world applications. To address this gap, we introduce \textbf{LaV-CoT}, the first Language-aware Visual CoT framework with Multi-Aspect Reward Optimization. LaV-CoT incorporates an interpretable multi-stage reasoning pipeline consisting of Text Summary with Bounding Box (BBox), Language Identification, Spatial Object-level Captioning, and Step-by-step Logical Reasoning. Following this reasoning pipeline, we design an automated data curation method that generates multilingual CoT annotations through iterative generation, correction, and refinement, enabling scalable and high-quality training data. To improve reasoning and generalization, LaV-CoT adopts a two-stage training paradigm combining Supervised Fine-Tuning (SFT) with Language-aware Group Relative Policy Optimization (GRPO), guided by verifiable multi-aspect rewards including language consistency, structural accuracy, and semantic alignment. Extensive evaluations on public datasets including MMMB, Multilingual MMBench, and MTVQA show that LaV-CoT achieves up to (\sim)9.5% accuracy improvements over open-source baselines of similar size and even surpasses models with 2$\times$ larger scales by (\sim)2.6%. Moreover, LaV-CoT outperforms advanced proprietary models such as GPT-4o-0513 and Gemini-2.5-flash. We further conducted an online A/B test to validate our method on real-world data, highlighting its effectiveness for industrial deployment. Our code is available at this link: \href{https://github.com/HJNVR/LaV-CoT}
随着大型视觉语言模型(VLMs)的进步,它们在多语言视觉问答(mVQA)方面的能力得到了显著提升。思维链(CoT)推理被证明可以提高解释性和复杂推理能力。然而,大多数现有方法主要依赖文本CoT,对多语言多模态推理的支持有限,限制了它们在真实世界应用中的部署。为了弥补这一差距,我们引入了LaV-CoT,这是第一个具有多方面奖励优化的语言感知视觉CoT框架。LaV-CoT采用了一个可解释的多阶段推理管道,包括带有边界框(BBox)的文本摘要、语言识别、空间对象级描述和逐步逻辑推理。遵循这一推理管道,我们设计了一种自动化数据整理方法,通过迭代生成、修正和细化生成多语言CoT注释,实现可扩展和高质量的训练数据。为了提高推理和泛化能力,LaV-CoT采用了一种两阶段训练模式,结合监督微调(SFT)和语言感知组相对策略优化(GRPO),在可验证的多方面奖励的指导下进行训练,包括语言一致性、结构准确性和语义对齐。在MMMB、多语言MMBench和MTVQA等公共数据集上的广泛评估表明,LaV-CoT在类似规模的开源基准测试上实现了高达约9.5%的准确率提升,甚至超过了规模更大的模型约2.6%。此外,LaV-CoT还超越了先进的专有模型,如GPT-4o-0513和Gemini-2.5-flash。我们进一步进行了在线AB测试,以验证我们在真实世界数据上的方法,突出了其在工业部署中的有效性。我们的代码位于:https://github.com/HJNVR/LaV-CoT。
论文及项目相关链接
PDF 12 Pages, 12 Figures, 2 Tables
Summary:随着大型视觉语言模型(VLMs)的发展,它们在多语言视觉问答(mVQA)中的能力显著提高。本文介绍了\textbf{LaV-CoT},一种具有多方面奖励优化的语言感知视觉思维链框架。它通过可解释的多阶段推理管道,结合文本摘要、语言识别、空间对象级描述和逐步逻辑推理,提高推理和泛化能力。在公共数据集上的评估表明,LaV-CoT在类似规模的开源基准测试上实现了高达9.5%的准确率改进,甚至超过了规模更大的模型约2.6%。此外,LaV-CoT在在线A/B测试中的表现验证了其在工业部署中的有效性。
Key Takeaways:
- 大型视觉语言模型在多语言视觉问答上的性能显著提高。
- LaV-CoT是首个具有多方面奖励优化的语言感知视觉思维链框架。
- LaV-CoT通过多阶段推理管道提高理解和泛化能力。
- LaV-CoT采用自动化数据整理方法生成多语言思维链注释。
- LaV-CoT通过结合监督微调与语言感知组相对策略优化来提高推理能力。
- LaV-CoT在公共数据集上的表现优于类似规模的开源模型和大型模型。
- LaV-CoT在在线A/B测试中的有效性验证了其在工业部署中的实用性。
点此查看论文截图




Unsupervised Hallucination Detection by Inspecting Reasoning Processes
Authors:Ponhvoan Srey, Xiaobao Wu, Anh Tuan Luu
Unsupervised hallucination detection aims to identify hallucinated content generated by large language models (LLMs) without relying on labeled data. While unsupervised methods have gained popularity by eliminating labor-intensive human annotations, they frequently rely on proxy signals unrelated to factual correctness. This misalignment biases detection probes toward superficial or non-truth-related aspects, limiting generalizability across datasets and scenarios. To overcome these limitations, we propose IRIS, an unsupervised hallucination detection framework, leveraging internal representations intrinsic to factual correctness. IRIS prompts the LLM to carefully verify the truthfulness of a given statement, and obtain its contextualized embedding as informative features for training. Meanwhile, the uncertainty of each response is considered a soft pseudolabel for truthfulness. Experimental results demonstrate that IRIS consistently outperforms existing unsupervised methods. Our approach is fully unsupervised, computationally low cost, and works well even with few training data, making it suitable for real-time detection.
无监督的幻觉检测旨在识别大型语言模型(LLM)产生的幻觉内容,而无需依赖标注数据。虽然无监督方法通过消除劳动密集型人类注释而广受欢迎,但它们经常依赖于与事实正确性无关的代理信号。这种不匹配会使检测探针偏向于表面或非真实相关的方面,从而限制了跨数据集和场景的可泛化能力。为了克服这些局限性,我们提出了IRIS,这是一个利用内在事实正确性表示的无监督幻觉检测框架。IRIS提示LLM仔细验证给定陈述的真实性,并将其上下文嵌入作为训练信息特征。同时,每个响应的不确定性被视为真实性的软伪标签。实验结果表明,IRIS始终在现有无监督方法中表现最佳。我们的方法完全无监督,计算成本低,即使在少量训练数据的情况下也能很好地工作,非常适合实时检测。
论文及项目相关链接
PDF To appear in EMNLP 2025
Summary:
本文介绍了无监督幻觉检测的目标和方法。尽管无监督方法消除了对大量标注数据的依赖,但它们经常依赖于与事实正确性无关的代理信号,导致检测结果偏向表面或非真理相关的方面,限制了其在不同数据集和场景中的泛化能力。针对此问题,提出了一种利用内在事实正确性的表示来检测幻觉的框架IRIS。IRIS通过提示大型语言模型(LLM)仔细验证给定陈述的真实性,并获取其上下文嵌入作为训练信息特征。同时,每个响应的不确定性被视为真实性的软伪标签。实验结果表明,IRIS在现有无监督方法中表现优异。该方法完全无监督,计算成本低,即使在少量训练数据下也能良好工作,适合实时检测。
Key Takeaways:
- 无监督幻觉检测旨在识别大型语言模型生成的幻觉内容,无需依赖标注数据。
- 当前无监督方法存在依赖与事实正确性无关的代理信号的问题。
- IRIS框架利用内在事实正确性的表示来检测幻觉。
- IRIS通过提示LLM验证陈述的真实性来获取上下文嵌入。
- 每个响应的不确定性被视为真实性的软伪标签。
- 实验结果表明IRIS在无监督方法中表现优越。
点此查看论文截图





SmartCoder-R1: Towards Secure and Explainable Smart Contract Generation with Security-Aware Group Relative Policy Optimization
Authors:Lei Yu, Jingyuan Zhang, Xin Wang, Jiajia Ma, Li Yang, Fengjun Zhang
Smart contracts automate the management of high-value assets, where vulnerabilities can lead to catastrophic financial losses. This challenge is amplified in Large Language Models (LLMs) by two interconnected failures: they operate as unauditable “black boxes” lacking a transparent reasoning process, and consequently, generate code riddled with critical security vulnerabilities. To address both issues, we propose SmartCoder-R1 (based on Qwen2.5-Coder-7B), a novel framework for secure and explainable smart contract generation. It begins with Continual Pre-training (CPT) to specialize the model. We then apply Long Chain-of-Thought Supervised Fine-Tuning (L-CoT SFT) on 7,998 expert-validated reasoning-and-code samples to train the model to emulate human security analysis. Finally, to directly mitigate vulnerabilities, we employ Security-Aware Group Relative Policy Optimization (S-GRPO), a reinforcement learning phase that refines the generation policy by optimizing a weighted reward signal for compilation success, security compliance, and format correctness. Evaluated against 17 baselines on a benchmark of 756 real-world functions, SmartCoder-R1 establishes a new state of the art, achieving top performance across five key metrics: a ComPass of 87.70%, a VulRate of 8.60%, a SafeAval of 80.16%, a FuncRate of 53.84%, and a FullRate of 50.53%. This FullRate marks a 45.79% relative improvement over the strongest baseline, DeepSeek-R1. Crucially, its generated reasoning also excels in human evaluations, achieving high-quality ratings for Functionality (82.7%), Security (85.3%), and Clarity (90.7%).
智能合约自动管理高价值资产,其中存在的漏洞可能导致重大财务损失。这一挑战在大规模语言模型(LLMs)中被两个相互关联的问题所放大:它们作为无法审计的“黑箱”缺乏透明的推理过程,因此生成的代码充斥着关键的安全漏洞。为了解决这两个问题,我们提出了基于Qwen2.5-Coder-7B的智能合约生成新型框架SmartCoder-R1。它首先通过持续预训练(CPT)来专业化模型。然后,我们对7998个经过专家验证的推理和代码样本应用长链思维监督微调(L-CoT SFT),以训练模型模拟人类安全分析。最后,为了直接缓解漏洞问题,我们采用安全感知组相对策略优化(S-GRPO)强化学习阶段,通过优化编译成功、安全合规和格式正确的加权奖励信号来改进生成策略。在包含现实世界函数756个基准点的测试集上,与十七条基线相比,SmartCoder-R1树立了新的行业标准,在五个关键指标上取得了顶尖表现:Compass达到87.7%,VulRate为8.6%,SafeAval为80.16%,FuncRate为53.84%,FullRate为50.53%。FullRate相对于表现最强的基线DeepSeek-R1提高了45.79%。关键的是,其生成的推理在人类评估中也表现出色,在功能、安全性和清晰度方面获得了高质量评分,分别为82.7%、85.3%和90.7%。
论文及项目相关链接
Summary
智能合约在高价值资产的管理中发挥着自动化的作用,但同时也面临着严重的安全挑战。大型语言模型(LLMs)在此问题上面临两大挑战:一是缺乏透明度,二是生成的代码存在严重的安全漏洞。为解决这些问题,我们提出了SmartCoder-R1框架,它通过连续预训练(CPT)来专业化模型,并运用长思考链监督微调(L-CoT SFT)来训练模型以模拟人类安全分析。此外,通过安全感知的群体相对策略优化(S-GRPO)直接解决漏洞问题。评估结果表明,SmartCoder-R1在五个关键指标上均达到最佳性能。它不仅在安全测试中表现优异,同时在人类评估中也获得了高度评价。总的来说,该框架为实现智能合约的可靠与安全应用奠定了坚实基础。
Key Takeaways
- 智能合约在资产管理中的自动化作用,但其安全隐患严重,尤其是在大型语言模型(LLMs)的应用中更为突出。
- LLMs面临两大挑战:缺乏透明度和生成代码的安全漏洞问题。
- SmartCoder-R1框架通过连续预训练、长思考链监督微调和安全感知的群体相对策略优化来解决上述问题。
- SmartCoder-R1在五个关键指标上达到最佳性能,并在安全测试中表现出色。
- 该框架生成的推理过程在人类评估中也获得了高质量评价,体现了其实际应用的潜力。
- SmartCoder-R1框架不仅提升了智能合约的性能,也为其在实际应用中的可靠性与安全保障提供了重要支持。
点此查看论文截图


Latency and Token-Aware Test-Time Compute
Authors:Jenny Y. Huang, Mehul Damani, Yousef El-Kurdi, Ramon Astudillo, Wei Sun
Inference-time scaling has emerged as a powerful way to improve large language model (LLM) performance by generating multiple candidate responses and selecting among them. However, existing work on dynamic allocation for test-time compute typically considers only parallel generation methods such as best-of-N, overlooking incremental decoding methods like beam search, and has largely ignored latency, focusing only on token usage. We formulate inference-time scaling as a problem of dynamic compute allocation and method selection, where the system must decide which strategy to apply and how much compute to allocate on a per-query basis. Our framework explicitly incorporates both token cost and wall-clock latency, the latter being critical for user experience and particularly for agentic workflows where models must issue multiple queries efficiently. Experiments on reasoning benchmarks show that our approach consistently outperforms static strategies, achieving favorable accuracy-cost trade-offs while remaining practical for deployment.
推理时间缩放(Inference-time scaling)作为一种强大的方法,通过生成多个候选响应并在其中进行选择,提高了大型语言模型(LLM)的性能。然而,现有的关于测试时间计算的动态分配工作通常只考虑并行生成方法(如最佳N选项),忽略了增量解码方法(如集束搜索),并且主要忽略了延迟问题,只关注令牌使用情况。我们将推理时间缩放公式化为一个动态计算分配和方法选择的问题,系统必须在每个查询的基础上决定应用哪种策略以及分配多少计算资源。我们的框架显式地包含了令牌成本和墙钟延迟,后者对于用户体验至关重要,特别是对于必须高效发出多个查询的代理工作流程而言。在推理基准测试上的实验表明,我们的方法始终优于静态策略,实现了有利的精度-成本权衡,同时在实际部署中保持实用性。
论文及项目相关链接
Summary
本文介绍了推理时间缩放作为一种强大的方法,通过生成多个候选响应并选择其中的最佳响应来提高大型语言模型(LLM)的性能。然而,现有的测试时间计算动态分配通常只考虑并行生成方法,如最佳N选择法,而忽视了增量解码方法,如集束搜索,并且主要忽视了延迟问题,只关注令牌使用量。本文将推理时间缩放制定为动态计算分配和方法选择的问题,系统必须在每个查询的基础上决定应用哪种策略以及分配多少计算资源。该框架显式地包含了令牌成本和墙钟延迟,后者对于用户体验至关重要,尤其是在需要模型高效发出多个查询的代理工作流程中。实验表明,该方法在推理基准测试上始终优于静态策略,实现了有利的精度成本权衡,并且适用于部署。
Key Takeaways
- 推理时间缩放能提高大型语言模型(LLM)的性能,通过生成多个候选响应并选择最佳响应。
- 现有工作主要关注并行生成方法,如最佳N选择法,但忽视了增量解码方法,如集束搜索。
- 现有的测试时间计算动态分配研究主要关注令牌使用量,而忽视了延迟问题。
- 本文将推理时间缩放制定为动态计算分配和方法选择的问题。
- 系统需决定应用何种策略及分配多少计算资源于每个查询。
- 框架考虑了令牌成本和墙钟延迟,后者在代理工作流程中尤其重要。
点此查看论文截图



Topic-Guided Reinforcement Learning with LLMs for Enhancing Multi-Document Summarization
Authors:Chuyuan Li, Austin Xu, Shafiq Joty, Giuseppe Carenini
A key challenge in Multi-Document Summarization (MDS) is effectively integrating information from multiple sources while maintaining coherence and topical relevance. While Large Language Models have shown impressive results in single-document summarization, their performance on MDS still leaves room for improvement. In this paper, we propose a topic-guided reinforcement learning approach to improve content selection in MDS. We first show that explicitly prompting models with topic labels enhances the informativeness of the generated summaries. Building on this insight, we propose a novel topic reward within the Group Relative Policy Optimization (GRPO) framework to measure topic alignment between the generated summary and source documents. Experimental results on the Multi-News and Multi-XScience datasets demonstrate that our method consistently outperforms strong baselines, highlighting the effectiveness of leveraging topical cues in MDS.
多文档摘要(MDS)中的一个关键挑战是如何有效地整合多源信息,同时保持连贯性和主题相关性。虽然大型语言模型在单文档摘要中取得了令人印象深刻的结果,但它们在MDS上的表现仍有待提高。在本文中,我们提出了一种基于主题引导的强化学习方法,以改进MDS中的内容选择。我们首先表明,通过主题标签明确提示模型可以增强生成摘要的信息性。在此基础上,我们在组相对策略优化(GRPO)框架内提出了一种新的主题奖励措施,以衡量生成摘要与源文档之间的话题对齐程度。在Multi-News和Multi-XScience数据集上的实验结果表明,我们的方法始终优于强大的基线方法,突出了在MDS中利用主题线索的有效性。
论文及项目相关链接
Summary:多文档摘要(MDS)的关键挑战在于如何有效整合多源信息,同时保持连贯性和主题相关性。本文提出一种基于话题引导强化学习的改进内容选择方法。实验结果表明,在Multi-News和Multi-XScience数据集上,该方法持续优于强基线,突显了在MDS中利用主题线索的有效性。
Key Takeaways:
- 多文档摘要(MDS)面临有效整合多源信息的同时保持连贯性和主题相关性的挑战。
- 大型语言模型在单文档摘要中表现出色,但在MDS方面的性能仍有提升空间。
- 显式地通过话题标签提示模型可增强生成摘要的信息性。
- 提出了一种新的基于话题奖励的Group Relative Policy Optimization(GRPO)框架,以衡量生成摘要与源文档之间的话题对齐程度。
- 实验结果表明,在Multi-News和Multi-XScience数据集上,该方法优于强基线。
- 利用话题线索在MDS中至关重要,可有效提高摘要的质量。
点此查看论文截图


The Thinking Therapist: Training Large Language Models to Deliver Acceptance and Commitment Therapy using Supervised Fine-Tuning and Odds Ratio Policy Optimization
Authors:Talha Tahir
Acceptance and Commitment Therapy (ACT) is a third-wave cognitive behavioral therapy with emerging evidence of efficacy in several psychiatric conditions. This study investigates the impact of post-training methodology and explicit reasoning on the ability of a small open-weight large language model (LLM) to deliver ACT. Using 50 sets of synthetic ACT transcripts generated by Mistral-Large, we trained Llama-3.2-3b-Instruct with two distinct approaches, supervised fine-tuning (SFT) and odds ratio policy optimization (ORPO), each with and without an explicit chain-of-thought (COT) reasoning step. Performance was evaluated by comparing these four post-trained variants against the base Instruct model. These models were benchmarked in simulated therapy sessions, with performance quantitatively assessed on the ACT Fidelity Measure (ACT-FM) and the Therapist Empathy Scale (TES) by an LLM judge that had been fine-tuned on human evaluations. Our findings demonstrate that the ORPO-trained models significantly outperformed both their SFT and Instruct counterparts on ACT fidelity ($\chi^2(5) = 185.15, p < .001$) and therapeutic empathy ($\chi^2(5) = 140.37, p < .001$). The effect of COT was conditional as it provided a significant benefit to SFT models, improving ACT-FM scores by an average of 2.68 points ($p < .001$), while offering no discernible advantage to the superior ORPO or instruct-tuned variants. We posit that the superiority of ORPO stems from its ability to learn the therapeutic process' over imitating
content,’ a key aspect of ACT, while COT acts as a necessary scaffold for models trained only via imitation. This study establishes that preference-aligned policy optimization can effectively instill ACT competencies in small LLMs, and that the utility of explicit reasoning is highly dependent on the underlying training paradigm.
接纳承诺疗法(ACT)是一种第三波认知行为疗法,在多种精神疾病中有明显的疗效证据。本研究探讨了训练后的方法和明确推理对小型开放式大型语言模型(LLM)执行ACT能力的影响。我们使用由Mistral-Large生成的50组合成ACT转录本,采用两种不同的方法训练了Llama-3.2-3b-Instruct模型,即监督微调(SFT)和赔率比率策略优化(ORPO),每种方法都带有和不带有明确的思维链(COT)推理步骤。通过将这些训练后的模型与基础Instruct模型进行比较来评估性能。这些模型在模拟治疗环节中进行评估,定量评估ACT忠诚度衡量标准(ACT-FM)和治疗师同理心量表(TES),评估人员是经过针对人类评估的微调后的LLM法官。我们的研究结果表明,经过ORPO训练的模型在ACT忠诚度方面显著优于SFT和Instruct模型(χ²(5)= 185.15,p < .001),在治疗同理心方面也有显著优势(χ²(5)= 140.37,p < .001)。思维链效应是有条件的,因为它为SFT模型提供了显著的优势,提高了ACT-FM得分平均提高2.68分(p < .001),而对更高级的ORPO或经过指令调整的变体则没有明显优势。我们认为ORPO的优势在于它能够学习治疗过程而不是模仿内容,这是ACT的一个关键方面,而思维链对于仅通过模仿训练的模型来说是必要的支撑架构。本研究确定了偏好对齐策略优化可以有效地将ACT技能传授给小型LLM,并且明确推理的实用性高度依赖于基础训练范式。
论文及项目相关链接
摘要
本研究探讨了接纳与承诺疗法(ACT)在大型语言模型(LLM)中的应用效果。通过两种不同的训练方法(监督精细调整(SFT)和赔率比率策略优化(ORPO))以及是否采用显式推理链(COT),对LLM进行了训练。结果显示,ORPO训练模型在ACT的忠实度与同情心方面显著优于其他模型。研究认为,ORPO的优势在于其能够学习治疗过程而非单纯模仿内容,这是ACT的核心要素;而COT对于仅通过模仿训练的模型有辅助作用。本研究确立了偏好对齐策略优化能有效赋予小型LLM执行ACT的能力,并揭示了显式推理的实用性取决于底层训练范式。
关键见解
- ACT作为第三波认知行为疗法,在精神疾病的疗效方面存在新兴证据。
- 通过两种不同的训练方法(SFT和ORPO)以及是否结合COT,研究了ACT在LLM中的应用。
- ORPO训练模型在模拟治疗会话中的ACT忠实度和同情心方面显著优于其他模型。
- COT对仅通过模仿训练的模型有辅助作用,但对经过ORPO训练的模型没有显著优势。
- ORPO的优势在于它更注重学习治疗过程而非内容模仿,这是ACT的核心要素。
- 偏好对齐策略优化能有效赋予小型LLM执行ACT的能力。
点此查看论文截图






OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning
Authors:Yuecheng Liu, Dafeng Chi, Shiguang Wu, Zhanguang Zhang, Yuzheng Zhuang, Bowen Yang, He Zhu, Lingfeng Zhang, Pengwei Xie, David Gamaliel Arcos Bravo, Yingxue Zhang, Jianye Hao, Xingyue Quan
Recent advances in multimodal large language models (MLLMs) have opened new opportunities for embodied intelligence, enabling multimodal understanding, reasoning, and interaction, as well as continuous spatial decision-making. Nevertheless, current MLLM-based embodied systems face two critical limitations. First, Geometric Adaptability Gap: models trained solely on 2D inputs or with hard-coded 3D geometry injection suffer from either insufficient spatial information or restricted 2D generalization, leading to poor adaptability across tasks with diverse spatial demands. Second, Embodiment Constraint Gap: prior work often neglects the physical constraints and capacities of real robots, resulting in task plans that are theoretically valid but practically infeasible. To address these gaps, we introduce OmniEVA – an embodied versatile planner that enables advanced embodied reasoning and task planning through two pivotal innovations: (1) a Task-Adaptive 3D Grounding mechanism, which introduces a gated router to perform explicit selective regulation of 3D fusion based on contextual requirements, enabling context-aware 3D grounding for diverse embodied tasks. (2) an Embodiment-Aware Reasoning framework that jointly incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable. Extensive experimental results demonstrate that OmniEVA not only achieves state-of-the-art general embodied reasoning performance, but also exhibits a strong ability across a wide range of downstream scenarios. Evaluations of a suite of proposed embodied benchmarks, including both primitive and composite tasks, confirm its robust and versatile planning capabilities. Project page: https://omnieva.github.io
近期多模态大型语言模型(MLLMs)的进展为实体智能带来了新的机会,实现了多模态理解、推理和交互,以及连续的空间决策。然而,当前基于MLLM的实体系统面临两个关键局限。首先,几何适应性差距:仅通过2D输入进行训练或在3D几何注入中硬编码的模型,由于缺乏空间信息或2D泛化受限,导致在不同空间需求的任务中的适应性较差。其次,实体约束差距:早期的研究往往忽视了真实机器人的物理约束和能力,导致任务计划在理论上有效,但在实践中却不可行。为了弥补这些差距,我们引入了OmniEVA——一个实体通用规划器,它通过两个关键创新实现了先进的实体推理和任务规划:(1)任务自适应3D接地机制,它引入了一个门控路由器,根据上下文要求进行显式的选择性调节3D融合,为实现多样化的实体任务的上下文感知3D接地。 (2)实体感知推理框架,该框架将任务目标和实体约束纳入推理循环中,从而做出既以目标为导向又可行的规划决策。大量的实验结果表明,OmniEVA不仅达到了最先进的实体推理性能,而且在广泛的下游场景中表现出了强大的能力。对一系列提出的实体基准测试的评价,包括基本任务和复合任务,都证实了其稳健和通用的规划能力。项目页面:https://omnieva.github.io
论文及项目相关链接
Summary
本文介绍了近期多模态大型语言模型(MLLMs)的进步为实体智能带来了新机遇,实现了多模态理解、推理和交互,以及连续的空间决策。然而,当前的MLLM实体系统面临两大局限:一是几何适应性差距,二是实体约束差距。为解决这个问题,提出了OmniEVA——一个实体通用规划器,通过两项创新技术解决这些问题:一是任务适应性3D接地机制,二是实体感知推理框架。OmniEVA不仅实现了先进的实体推理和任务规划,而且表现出跨多种下游场景的强大能力。
Key Takeaways
- 多模态大型语言模型(MLLMs)的进步推动了实体智能的发展,包括多模态理解、推理和交互,以及连续的空间决策能力。
- 当前MLLM实体系统存在两大局限:几何适应性差距和实体约束差距。
- OmniEVA是一个实体通用规划器,通过任务适应性3D接地机制和实体感知推理框架两项创新技术来解决上述局限。
- 任务适应性3D接地机制根据上下文需求进行3D融合的显式选择性调节,实现多样化的实体任务的上下文感知3D接地。
- 实体感知推理框架将任务目标和实体约束纳入推理循环中,使规划决策既具有目标导向性又可行。
- OmniEVA不仅实现了先进的实体推理和任务规划,而且在多种下游场景中表现出强大的能力。
点此查看论文截图



Parallel-R1: Towards Parallel Thinking via Reinforcement Learning
Authors:Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, Dong Yu
Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging, as existing methods predominantly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced imitation rather than exploration and generalization. Different from them, we propose \textbf{Parallel-R1}, the first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks. Our framework employs a progressive curriculum that explicitly addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking ability, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully instills parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on challenging tasks with RL. Further analysis reveals a clear shift in the model’s thinking behavior: at an early stage, it uses parallel thinking as an exploration strategy, while in a later stage, it uses the same capability for multi-perspective verification. Most significantly, we validate parallel thinking as a \textbf{mid-training exploration scaffold}, where this temporary exploratory phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the baseline on AIME25. Our model, data, and code will be open-source at https://github.com/zhengkid/Parallel-R1.
并行思考已成为一种通过同时探索多个推理路径来提高大语言模型(LLM)推理能力的新型方法。然而,通过训练激活这种能力仍然具有挑战性,因为现有方法主要依赖于合成数据上的监督微调(SFT),这鼓励了教师强制模仿,而不是探索和泛化。与它们不同,我们提出了Parallel-R1,这是第一个能够在复杂的现实世界推理任务中启用并行思考行为的强化学习(RL)框架。我们的框架采用了一种渐进的课程学习,明确解决了训练并行思考时RL面临的冷启动问题。我们首先使用SFT对来自简单任务的提示生成轨迹进行训练,以培养并行思考能力,然后过渡到RL以在更复杂的问题上探索和推广这项技能。在各种数学基准测试上的实验,包括MATH、AMC23和AIME,表明Parallel-R1成功培养了并行思考能力,相对于直接在具有挑战性的任务上使用RL训练的顺序思考模型,其准确性提高了8.4%。进一步的分析表明,模型的行为有明显的转变:在早期阶段,它使用并行思考作为探索策略,而在后期阶段,它使用相同的能力进行多视角验证。最重要的是,我们验证了并行思考作为训练中期探索脚手架,这一临时探索阶段在RL之后开启了更高的性能上限,在AIME25上的改进比基线提高了42.9%。我们的模型、数据和代码将在https://github.com/zhengkid/Parallel-R1上开源。
论文及项目相关链接
PDF Project website: https://zhengkid.github.io/Parallel_R1.github.io/
Summary
本文介绍了并行思维在提升大型语言模型的推理能力中的新兴作用。然而,通过训练激活这种能力仍然具有挑战性。主流方法主要依赖于合成数据上的监督微调(SFT),这鼓励教师强制模仿,而非探索和泛化。不同于此,本文提出了名为Parallel-R1的强化学习框架,该框架首次实现了复杂现实世界推理任务的并行思维行为。它采用渐进的课程策略来明确解决训练并行思维中的冷启动问题。首先使用基于简单任务的提示生成轨迹进行SFT,以灌输并行思维能力,然后过渡到RL以在此技能上探索并泛化难题。实验表明,Parallel-R1成功培养了并行思维,在MATH、AMC23和AIME等数学基准测试上实现了比通过RL直接训练顺序思维的模型高出8.4%的准确率。进一步的分析显示,模型的行为发生了明显的转变:在早期阶段,它使用并行思维作为探索策略,而在后期阶段则将其作为多视角验证的能力。最重要的是,验证了并行思维作为训练过程中的临时探索架构的作用,该架构能够在RL后解锁更高的性能上限,在AIME25上的改进率达到了惊人的42.9%。模型和数据将在https://github.com/zhengkid/Parallel-R1上开源。
Key Takeaways
- 并行思维是一种新兴方法,用于增强大型语言模型的推理能力。
- 强化学习框架Parallel-R1支持复杂现实推理任务的并行思维行为。
- Parallel-R1采用渐进的课程策略来解决训练并行思维的冷启动问题。
- 该框架结合了监督微调(SFT)和强化学习(RL),先灌输并行思维能力,再探索并泛化难题。
- 实验表明Parallel-R1能有效培养并行思维,并在多个数学基准测试上实现较高准确率。
- 模型的行为在训练过程中发生变化,从早期作为探索策略的并行思维转变为后期的多视角验证能力。
点此查看论文截图



K2-Think: A Parameter-Efficient Reasoning System
Authors:Zhoujun Cheng, Richard Fan, Shibo Hao, Taylor W. Killian, Haonan Li, Suqi Sun, Hector Ren, Alexander Moreno, Daqian Zhang, Tianjun Zhong, Yuxin Xiong, Yuanzhe Hu, Yutao Xie, Xudong Han, Yuqi Wang, Varad Pimpalkhute, Yonghao Zhuang, Aaryamonvikram Singh, Xuezhi Liang, Anze Xie, Jianshu She, Desai Fan, Chengqian Gao, Liqun Ma, Mikhail Yurochkin, John Maggs, Xuezhe Ma, Guowei He, Zhiting Hu, Zhengzhong Liu, Eric P. Xing
K2-Think is a reasoning system that achieves state-of-the-art performance with a 32B parameter model, matching or surpassing much larger models like GPT-OSS 120B and DeepSeek v3.1. Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets. K2-Think excels in mathematical reasoning, achieving state-of-the-art scores on public benchmarks for open-source models, while also performing strongly in other areas such as Code and Science. Our results confirm that a more parameter-efficient model like K2-Think 32B can compete with state-of-the-art systems through an integrated post-training recipe that includes long chain-of-thought training and strategic inference-time enhancements, making open-source reasoning systems more accessible and affordable. K2-Think is freely available at k2think.ai, offering best-in-class inference speeds of over 2,000 tokens per second per request via the Cerebras Wafer-Scale Engine.
K2-Think是一个推理系统,它凭借一个32B参数的模型实现了前沿性能,可与GPT-OSS 120B和DeepSeek v3.1等大型模型相媲美甚至表现更佳。我们的系统建立在Qwen2.5基础模型上,展示了通过结合先进的后训练和测试时间计算技术,较小模型也能在最高层次上竞争。该方法基于六大关键技术支柱:长链思维监督微调、可验证奖励强化学习(RLVR)、推理前的代理计划、测试时缩放、投机解码和推理优化硬件,所有这些都使用公开可用的开源数据集。K2-Think在数学推理方面表现出色,在公开基准测试上实现了前沿的开源模型得分,同时在代码和科学等其他领域也表现强劲。我们的结果证实,像K2-Think 32B这样更参数高效的模型,可以通过集成后训练方案,包括长链思维训练和战略推理时间增强功能,与最前沿系统相竞争,这使得开源推理系统更加易于访问和负担得起。K2-Think已在k2think.ai上免费提供,通过Cerebras晶圆级引擎提供每秒超过2000令牌的顶级推理速度。
论文及项目相关链接
PDF To access the K2-Think reasoning system, please visit www.k2think.ai
Summary
K2-Think是一款结合先进后训练和测试时间计算技术的推理系统,能在高水平的基准测试中表现出卓越性能。它基于Qwen2.5基础模型,通过六个关键技术支柱实现先进性能,包括长链思维监督微调、强化学习与可验证奖励(RLVR)、代理计划推理、测试时间缩放、投机解码和推理优化硬件等。K2-Think在数学推理方面表现尤为出色,同时在代码和科学领域也表现出强劲实力。该系统现已在k2think.ai上免费提供,并可通过Cerebras晶圆级引擎实现每秒超过2000个令牌的请求处理速度。
Key Takeaways
- K2-Think是一个高效的推理系统,能够匹配或超越更大规模的模型,如GPT-OSS 120B和DeepSeek v3.1。
- 它基于Qwen2.5基础模型构建,结合先进后训练和测试时间计算技术实现高水平性能。
- K2-Think的六大关键技术支柱包括长链思维监督微调、强化学习与可验证奖励等。
- 该系统在数学推理方面表现卓越,同时在代码和科学领域也显示出强大的实力。
- K2-Think提供了公开可用的开源数据集,并且现已在k2think.ai上免费提供。
- K2-Think通过集成后训练配方和推理时间增强措施,展现了与最先进的系统竞争的能力。
点此查看论文截图




MachineLearningLM: Scaling Many-shot In-context Learning via Continued Pretraining
Authors:Haoyu Dong, Pengkun Zhang, Mingzhe Lu, Yanzhen Shen, Guolin Ke
Large language models (LLMs) possess broad world knowledge and strong general-purpose reasoning ability, yet they struggle to learn from many in-context examples on standard machine learning (ML) tasks, that is, to leverage many-shot demonstrations purely via in-context learning (ICL) without gradient descent. We introduce MachineLearningLM, a portable continued-pretraining framework that equips a general-purpose LLM with robust in-context ML capability while preserving its general knowledge and reasoning for broader chat workflows. Our pretraining procedure synthesizes ML tasks from millions of structural causal models (SCMs), spanning shot counts up to 1,024. We begin with a random-forest teacher, distilling tree-based decision strategies into the LLM to strengthen robustness in numerical modeling. All tasks are serialized with a token-efficient prompt, enabling 3x to 6x more examples per context window and delivering up to 50x amortized throughput via batch inference. Despite a modest setup (Qwen-2.5-7B-Instruct with LoRA rank 8), MachineLearningLM outperforms strong LLM baselines (e.g., GPT-5-mini) by an average of about 15% on out-of-distribution tabular classification across finance, physics, biology, and healthcare domains. It exhibits a striking many-shot scaling law: accuracy increases monotonically as in-context demonstrations grow from 8 to 1,024. Without any task-specific training, it attains random-forest-level accuracy across hundreds of shots. General chat capabilities, including knowledge and reasoning, are preserved: it achieves 75.4% on MMLU.
大型语言模型(LLM)拥有广泛的世界知识和强大的通用推理能力,但在标准机器学习(ML)任务上,它们很难从多个上下文示例中学习,也就是说,它们无法仅通过上下文学习(ICL)利用多个示例演示,而无需进行梯度下降。我们引入了MachineLearningLM,这是一个便携式持续预训练框架,它为通用LLM配备了强大的上下文ML功能,同时保留了其一般知识和推理能力,适用于更广泛的聊天工作流程。我们的预训练程序从数百万个结构因果模型(SCM)中综合ML任务,涵盖示例数量高达1024个。我们以随机森林教师开始,将基于树的决策策略蒸馏到LLM中,以加强数值建模中的稳健性。所有任务都通过高效的令牌提示进行序列化,可以在每个上下文窗口中实现3到6倍的更多示例,并通过批量推理实现高达50倍的摊销吞吐量。尽管配置相对简单(使用LoRA等级8的Qwen-2.5-7B-Instruct),但MachineLearningLM在金融、物理、生物和医疗保健领域的离分布表格分类方面,平均比强大的LLM基准(例如GPT-5-mini)高出约15%。它表现出引人注目的多次射击定律:随着上下文演示从8个增加到1024个,准确性单调增加。无需任何特定任务的训练,它在数百次射击中就能达到随机森林的精度。通用聊天功能(包括知识和推理)得以保留:它在MMLU上达到了75.4%。
论文及项目相关链接
摘要
大型语言模型(LLM)具备广泛的世界知识和强大的通用推理能力,但在标准机器学习(ML)任务上学习多个上下文示例时遇到困难。我们引入了MachineLearningLM,这是一个便携式持续预训练框架,能够在不损失通用知识和推理能力的情况下,为通用LLM提供强大的上下文机器学习功能。我们的预训练程序通过数百万的结构化因果模型(SCM)合成ML任务,涵盖样本数高达1024个。我们通过随机森林教师强化树决策策略到LLM中,以提高数值建模的稳健性。所有任务都以高效的提示进行序列化,能够在每个上下文窗口中实现额外示例数量的三倍至六倍,并通过批量推理实现高达五十倍的摊销吞吐量提升。尽管模式适中(配有LoRA排名第8的Qwen-2.5-7B-Instruct),但MachineLearningLM在财经、物理、生物和医疗保健领域的离分布表格分类任务上平均优于强大的LLM基准测试(如GPT-5-mini)约15%。它表现出惊人的多镜头缩放定律:随着上下文演示从8个增长到1024个,准确性单调增加。无需任何特定任务训练,即可获得随机森林级别的精度,跨度数百个镜头。同时保留了一般聊天能力,包括知识和推理能力:在MMLU上达到75.4%。
要点归纳
- 大型语言模型(LLM)在标准机器学习(ML)任务上学习多个上下文示例时存在困难。
- 介绍了MachineLearningLM,一个便携式持续预训练框架,增强了LLM的上下文机器学习能力,同时保留了其通用知识和推理能力。
- 预训练程序通过合成来自数百万结构化因果模型(SCM)的ML任务进行,涵盖样本数高达1024个。
- 使用随机森林教师来强化树决策策略,提高数值建模的稳健性。
- 串行化任务以有效利用上下文窗口,并通过批量推理提高吞吐量。
- MachineLearningLM在多种领域表现出强大的性能,尤其是离分布表格分类任务上优于其他LLM。
点此查看论文截图



Input-Time Scaling
Authors:Rapheal Huang, Weilong Guo
Current Large Language Models (LLMs) are usually post-trained on large-scale carefully curated datasets (data & training scaling) and doing reasoning in test time (inference time scaling). In this work, we present a new scaling paradigm, Input-Time Scaling, to complement previous scaling methods by putting resources on queries (input time). During training and testing, we utilize meta-knowledge from LLMs to refine inputs with different strategies. We also discover a new phenomenon, train-test co-design. It requires us to apply query strategies during training and testing as a whole. Only applying strategies on training or testing would seriously degrade the performance gained. We are also surprised to find that seemingly low data quality datasets can perform better. We can get the best performance even by adding irrelevant information to the queries, with randomly selected 1k examples from a minimally filtered dataset. These findings contradict the widely held inductive bias, “garbage in, garbage out”. Curating datasets with seemingly high-quality data can even potentially limit the performance ceiling. In addition, models trained on more data with similar quality (15k VS 1k) perform worse, the intuition of simply scaling the size should also be carefully inspected. The good news is that our findings are compatible with the Less is More phenomenon. 1K examples are enough to invoke high-level reasoning ability. With experiments on Qwen2.5-32B-Instruct, we are able to reach SOTA performance among 32B models on AIME24(76.7%) and AIME25(76.7%) pass@1. We can further achieve AIME24(76.7%) and AIME25(80%) with a majority vote of three models. Starting from DeepSeek-R1-Distill-Qwen-32B, the result would be 90.0% on AIME24 and 80.0% on AIME25. To facilitate reproducibility and further research, we are working on open-source our datasets, data pipelines, evaluation results, and checkpoints.
当前的大型语言模型(LLM)通常在大规模精心策划的数据集上进行后训练(数据和训练规模化),并在测试时间(推理时间规模化)进行推理。在这项工作中,我们提出了一种新的规模化范式——输入时间规模化,以补充之前的规模化方法,将资源集中在查询(输入时间)上。在训练和测试过程中,我们利用LLM的元知识,采用不同的策略来优化输入。我们还发现了一个新现象,即训练与测试的协同设计。它要求我们在训练和测试过程中整体应用查询策略。仅对训练或测试应用策略会严重降低所获得的性能。令人惊讶的是,我们发现即使使用低质量的数据集也能取得更好的性能。我们可以通过向查询中添加无关信息,以及从最少过滤的数据集中随机选择1k个示例来获得最佳性能。这些发现与广为流传的归纳偏见“垃圾进,垃圾出”相矛盾。使用看似高质量数据的数据集甚至可能限制性能上限。此外,在相似质量的数据上训练更多数据的模型(15k对1k)表现更差,我们也需要谨慎检查单纯扩大规模直觉。好消息是,我们的研究与“少即是多”现象相吻合。1K个例子就足以激发高级推理能力。我们在Qwen2.5-32B-Instruct上的实验,能够在AIME24(76.7%)和AIME25(76.7%)的pass@1上达到32B模型中的SOTA性能。我们通过三个模型的大多数投票,能够进一步达到AIME24(76.7%)和AIME25(80%)。从DeepSeek-R1-Distill-Qwen-32B开始,AIME24的结果将达到90.0%,AIME25将达到80.0%。为了便于复制和进一步研究,我们正在开源我们的数据集、数据管道、评估结果和检查点。
Translation into Simplified Chinese
论文及项目相关链接
Summary:
本文介绍了一种新的模型缩放范式——输入时间缩放,它专注于在查询(输入时间)上分配资源,以补充现有的数据训练缩放和推理时间缩放方法。研究发现了新的现象训练测试协同设计,需要在训练和测试过程中应用查询策略以提高性能。实验结果显示,看似低质量的数据集表现良好,甚至可以通过添加与查询相关的无关信息提高性能。同时指出单纯扩大规模而忽视数据质量可能导致性能下降,而少量的高质量数据可以触发高级推理能力。实验结果显示某些模型能够达到SOTA性能。未来会开源数据集、数据管道、评估结果和检查点以促进研究和可重复性。
Key Takeaways:
- 介绍了一种新的模型缩放范式——输入时间缩放,关注查询资源的分配。
- 发现了训练测试协同设计的新现象,强调在训练和测试过程中应用查询策略的重要性。
- 表明看似低质量的数据集可能表现更好,挑战了“垃圾进垃圾出”的归纳偏见。
- 单纯扩大规模而不考虑数据质量可能导致性能下降。
- 少量高质量数据可以触发高级推理能力。
- 实验结果显示某些模型能够达到SOTA性能,包括在某些特定任务上的高通过率。
点此查看论文截图






AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Authors:Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, Yi Wu
Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.
强化学习(RL)已成为训练大型语言模型(LLM)的主导范式,特别是在推理任务中。对于LLM的有效RL需要大规模并行化,并迫切需要高效的训练系统。大多数现有的大型RL系统都是同步的,分批生成和训练交替进行,每个训练批次中的生成都由同一模型完成。这种方法稳定了RL训练,但存在严重的系统级效率低下:生成必须等待批次中最长的输出完成才能进行模型更新,导致GPU利用率不足。我们提出了AReaL,一个完全异步的RL系统,它将生成和训练完全解耦。AReaL中的生成工作器可以持续生成新的输出而无需等待,而训练工作器则会在每次收集到一批数据时更新模型。AReaL还采用了一系列系统级优化,大大提高了GPU利用率。为了稳定RL训练,AReaL平衡了生成和训练工作器的负载,以控制数据的陈旧程度,并采用了增强陈旧性的PPO变体来更好地处理过时的训练样本。在数学和代码推理基准测试的大量实验表明,与具有相同数量GPU的同步系统相比,AReaL的训练速度提高了2.77倍,并且具有相匹配或更好的最终性能。AReaL的代码可在https://github.com/inclusionAI/AReaL/找到。
论文及项目相关链接
Summary
本文介绍了强化学习(RL)在大规模语言模型(LLM)训练中的主导地位,特别是在推理任务中的应用。为了解决大规模RL系统中存在的系统级效率低下问题,提出了一种全新的异步RL系统AReaL。AReaL通过将生成与训练完全解耦,实现了连续的生成输出和模型更新,从而大大提高了GPU利用率。同时,AReaL还通过一系列系统级优化措施来平衡工作负载并控制数据陈旧性,采用了一种增强陈旧性的PPO变体以更好地处理过时的训练样本。在数学和代码推理基准测试上,AReaL相较于同步系统实现了最高达2.77倍的训练速度提升,同时保持了匹配或更好的最终性能。
Key Takeaways
- 强化学习已成为大规模语言模型训练的主要方法,特别是在处理推理任务时。
- 现有大规模RL系统在训练LLM时存在系统级效率低下问题。
- 提出了一种全新的异步RL系统AReaL,将生成与训练过程完全分离,实现了高效的GPU利用。
- AReaL通过一系列优化措施平衡生成与训练的工作负载,控制数据陈旧性。
- AReaL采用了一种增强陈旧性的PPO变体,以处理过时的训练样本。
- 在数学和代码推理测试中,AReaL相较于同步系统显著提升了训练速度。
点此查看论文截图



Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning
Authors:Yutong Chen, Jiandong Gao, Ji Wu
R1-style Reinforcement Learning (RL) significantly enhances Large Language Models’ reasoning capabilities, yet the mechanism behind rule-based RL remains unclear. We found that small-scale SFT has substantial influence on RL but shows poor efficiency. To explain our observations, we propose an analytical framework and compare the efficiency of SFT and RL by measuring \textbf{sample effect}. Our hypothetical analysis shows the potential to improve SFT efficiency. Guided by our analysis, we propose \textbf{Re-distillation}, a technique that aims to boost the effectiveness of small-scale distillation by sampling from the RL-trained policy. Re-distillation shows consistent surprising efficiency on three datasets and both Qwen&Llama models: Re-distilled models matched RL performance with far fewer samples and less computation. As a result, on K&K dataset, our re-distilled Qwen-2.5-1.5B model surpasses DeepSeek-V3-0324 with only 1K SFT samples. We demonstrate that re-distillation can be used to efficiently balance multiple goals in RL. Our work explains several interesting phenomena in R1-style RL, shedding light on the mechanisms behind its empirical success. Code is available at: https://github.com/on1262/deep-reasoning.
R1风格的强化学习(RL)显著增强了大型语言模型的推理能力,然而基于规则的RL背后的机制仍然不清楚。我们发现小规模SFT对RL有实质影响,但效率较低。为了解释我们的观察,我们提出了一个分析框架,通过测量“样本效应”来比较SFT和RL的效率。我们的假设分析显示提高SFT效率的潜力。在我们的分析指导下,我们提出了“再蒸馏”技术,该技术旨在通过从RL训练的策略中进行采样,提高小规模蒸馏的有效性。再蒸馏在三个数据集和两个Qwen&Llama模型上均显示出令人惊讶的持续性高效。再蒸馏模型使用较少的样本和计算资源即可达到RL的性能。因此,在K&K数据集上,我们再蒸馏的Qwen-2.5-1.5B模型仅使用1K个SFT样本就超过了DeepSeek-V3-0324。我们证明再蒸馏可用于有效地平衡RL中的多个目标。我们的工作解释了R1风格RL中的几个有趣现象,揭示了其经验成功的机制。代码可用在:https://github.com/on1262/deep-reasoning。
论文及项目相关链接
PDF preprint
Summary
R1风格的强化学习(RL)能够显著提升大型语言模型的推理能力,但其背后的机制尚不清楚。研究发现小规模样本预训练(SFT)对RL有重要影响但效率不高。为此,本文提出了一个分析框架,通过测量样本效应来比较SFT和RL的效率。同时,本文提出了再蒸馏技术,旨在通过从RL训练的决策中采样来提高小规模蒸馏的有效性。再蒸馏技术在三个数据集和两个模型上的表现均令人惊讶,用更少的样本和计算量达到了与RL相当的性能。因此,再蒸馏有望成为高效平衡RL中多个目标的方法。
Key Takeaways
- R1风格的强化学习大幅提升了大型语言模型的推理能力,但机制不明。
- 小规模样本预训练对强化学习有重要影响,但其效率不高。
- 通过测量样本效应,提出了一个分析框架来比较样本预训练和强化学习的效率。
- 提出了再蒸馏技术,旨在提高小规模蒸馏的有效性,通过从强化学习的决策中采样。
- 再蒸馏技术在多个数据集和模型上的表现优异,用更少的样本和计算量达到了与强化学习相当的性能。
- 再蒸馏有助于高效平衡强化学习中的多个目标。
点此查看论文截图




Prior Prompt Engineering for Reinforcement Fine-Tuning
Authors:Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul
This paper investigates prior prompt engineering (pPE) in the context of reinforcement fine-tuning (RFT), where language models (LMs) are incentivized to exhibit behaviors that maximize performance through reward signals. While existing RFT research has primarily focused on algorithms, reward shaping, and data curation, the design of the prior prompt–the instructions prepended to queries during training to elicit behaviors such as step-by-step reasoning–remains underexplored. We investigate whether different pPE approaches can guide LMs to internalize distinct behaviors after RFT. Inspired by inference-time prompt engineering (iPE), we translate five representative iPE strategies–reasoning, planning, code-based reasoning, knowledge recall, and null-example utilization–into corresponding pPE approaches. We experiment with Qwen2.5-7B using each of the pPE approaches, then evaluate performance on in-domain and out-of-domain benchmarks (e.g., AIME2024, HumanEval+, and GPQA-Diamond). Our results show that all pPE-trained models surpass their iPE-prompted counterparts, with the null-example pPE approach achieving the largest average performance gain and the highest improvement on AIME2024 and GPQA-Diamond, surpassing the commonly used reasoning approach. Furthermore, by adapting a behavior-classification framework, we demonstrate that different pPE strategies instill distinct behavioral styles in the resulting models. These findings position pPE as a powerful yet understudied axis for RFT.
本文探讨了强化微调(RFT)背景下的前期提示工程(pPE),在RFT中,通过奖励信号激励语言模型(LMs)表现出最大化性能的行为。虽然现有的RFT研究主要集中在算法、奖励塑造和数据集成上,但前期提示的设计,即在训练期间附加到查询中的指令,以激发如逐步推理等行为,仍然未被充分探索。我们调查不同的pPE方法是否能在RFT后引导语言模型内化不同的行为。我们受到推理时间提示工程(iPE)的启发,将五种代表性的iPE策略——推理、规划、基于代码推理、知识回忆和空例利用——转化为相应的pPE方法。我们使用每种pPE方法实验Qwen2.5-7B,然后在域内和域外基准测试(例如AIME2024、HumanEval+和GPQA-Diamond)上评估性能。我们的结果表明,所有pPE训练的模型都超过了iPE提示的对应模型,其中空例pPE方法取得了最大的平均性能提升,并在AIME2024和GPQA-Diamond上实现了最高改进,超越了常用的推理方法。此外,通过适应行为分类框架,我们证明了不同的pPE策略会在结果模型中形成不同的行为风格。这些发现将pPE定位为RFT中强大而未被充分研究的轴。
论文及项目相关链接
PDF Accepted at EMNLP 2025, Main; 26 pages, 42 figures
Summary:本文探讨了强化微调(RFT)背景下的前期提示工程(pPE),研究了如何通过奖励信号激励语言模型(LMs)表现出最大化性能的行为。文章主要关注如何将推理时间提示工程(iPE)的策略转化为前期提示工程(pPE)的方法,并通过实验验证了不同pPE策略在指导语言模型内化行为方面的有效性。结果显示,采用pPE训练的模型在性能和风格上均优于采用iPE提示的模型,其中空例pPE方法取得最大平均性能提升。
Key Takeaways:
- 文章探讨了前期提示工程(pPE)在强化微调(RFT)中的作用,研究如何通过奖励信号使语言模型表现出最大化性能的行为。
- 文章将推理时间提示工程(iPE)的策略转化为pPE的方法,包括推理、规划、基于代码推理、知识回忆和空例利用等五种代表性策略。
- 实验结果显示,采用pPE训练的模型在性能和风格上均优于仅使用iPE提示的模型。
- 空例pPE方法取得最大的平均性能提升,并在AIME2024和GPQA-Diamond上表现最佳。
- 不同pPE策略会导致模型表现出不同的行为风格。
- pPE作为强化微调的一个强大而尚未充分研究的轴,具有巨大的潜力。
点此查看论文截图






An Empirical Study of Position Bias in Modern Information Retrieval
Authors:Ziyang Zeng, Dun Zhang, Jiacheng Li, Panxiang Zou, Yudong Zhou, Shengjie Wang, Yuqing Yang
This study investigates the position bias in information retrieval, where models tend to overemphasize content at the beginning of passages while neglecting semantically relevant information that appears later. To analyze the extent and impact of position bias, we introduce a new evaluation framework consisting of two position-aware retrieval benchmarks (SQuAD-PosQ, FineWeb-PosQ) and an intuitive diagnostic metric, the Position Sensitivity Index (PSI), for quantifying position bias from a worst-case perspective. We conduct a comprehensive evaluation across the full retrieval pipeline, including BM25, dense embedding models, ColBERT-style late-interaction models, and full-interaction reranker models. Our experiments show that when relevant information appears later in the passage, dense embedding models and ColBERT-style models suffer significant performance degradation (an average drop of 15.6%). In contrast, BM25 and reranker models demonstrate greater robustness to such positional variation. These findings provide practical insights into model sensitivity to the position of relevant information and offer guidance for building more position-robust retrieval systems. Code and data are publicly available at: https://github.com/NovaSearch-Team/position-bias-in-IR.
本研究探讨了信息检索中的位置偏见问题,即模型往往过分强调段落开头的内容,而忽视后面出现的语义相关信息。为了分析和评估位置偏见的影响程度,我们引入了一个新的评估框架,包括两个位置感知检索基准测试(SQuAD-PosQ,FineWeb-PosQ)和一个直观的诊断指标——位置敏感性指数(PSI),从最差情况的角度量化位置偏见。我们对整个检索流程进行了全面评估,包括BM25、密集嵌入模型、ColBERT风格的后期交互模型以及全交互重排模型。实验表明,当相关信息出现在段落后面时,密集嵌入模型和ColBERT风格的模型性能会显著下降(平均下降15.6%)。相比之下,BM25和重排模型对这种位置变化的鲁棒性更强。这些发现提供了对模型对相关信息位置的敏感性的实际见解,并为构建更稳健的位置检索系统提供了指导。代码和数据公开在:https://github.com/NovaSearch-Team/position-bias-in-IR。
论文及项目相关链接
PDF EMNLP 2025 Findings (camera-ready)
Summary
在信息检索中存在位置偏见问题,即模型往往过分强调段落开头的内容,而忽视后面出现的语义相关信息。为分析和评估位置偏见的影响,本研究引入了新的评估框架和指标,包括两个位置感知检索基准测试(SQuAD-PosQ、FineWeb-PosQ)和一个直观的诊断指标——位置敏感性指数(PSI)。实验表明,当相关信息出现在段落后面时,密集嵌入模型和ColBERT风格模型性能显著下降(平均下降15.6%),而BM25和重排模型对位置变化更具鲁棒性。
Key Takeaways
- 信息检索中存在位置偏见问题,模型易忽略段落后半部分的相关信息。
- 研究引入新的评估框架,包括两个位置感知检索基准测试和位置敏感性指数(PSI)指标。
- 密集嵌入模型和ColBERT风格模型在相关信息靠后时性能显著下降。
- BM25和重排模型对位置变化更具鲁棒性。
- 位置偏见影响评估有助于了解模型对不同位置信息的敏感性。
- 研究结果提供构建更稳健的检索系统的指导。
点此查看论文截图

