⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-05-22 更新
Emerging Properties in Unified Multimodal Pretraining
Authors:Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, Haoqi Fan
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open0source foundational model that natively supports multimodal understanding and generation. BAGEL is a unified, decoder0only model pretrained on trillions of tokens curated from large0scale interleaved text, image, video, and web data. When scaled with such diverse multimodal interleaved data, BAGEL exhibits emerging capabilities in complex multimodal reasoning. As a result, it significantly outperforms open-source unified models in both multimodal generation and understanding across standard benchmarks, while exhibiting advanced multimodal reasoning abilities such as free-form image manipulation, future frame prediction, 3D manipulation, and world navigation. In the hope of facilitating further opportunities for multimodal research, we share the key findings, pretraining details, data creation protocal, and release our code and checkpoints to the community. The project page is at https://bagel-ai.org/
在多模态理解和生成方面,最前沿的专有系统已经展示了令人印象深刻的能力。在这项工作中,我们介绍了BAGEL,这是一个开源的基础模型,它原生支持多模态理解和生成。BAGEL是一个统一的、仅解码的模型,它在从大规模交错文本、图像、视频和网络数据中精选出的万亿级标记上进行预训练。当使用如此多样化的多模态交错数据进行扩展时,BAGEL在复杂的多模态推理方面展现出新兴的能力。因此,它在标准基准测试中多模态生成和理解方面大大优于开源统一模型,同时展现出先进的多模态推理能力,如自由形式的图像操作、未来帧预测、3D操作和全球导航等。我们希望为进一步的跨模态研究提供机会,因此分享关键发现、预训练细节、数据创建协议以及向社区发布我们的代码和检查点。项目页面位于:[https://bagel-ai.org/]
论文及项目相关链接
PDF 37 pages, 17 figures
Summary
BAGEL是一个开源的基础模型,支持多模态理解和生成。该模型在大量大规模交替文本、图像、视频和网页数据上进行预训练,展现出强大的多模态推理能力,显著优于开源统一模型,具备自由形式的图像操作、未来帧预测、3D操作和导航等高级功能。
Key Takeaways
- BAGEL是一个支持多模态理解和生成的多功能开源基础模型。
- BAGEL通过预训练在大量大规模交替文本、图像、视频和网页数据上实现多模态推理能力。
- BAGEL展现出强大的性能,显著优于其他开源统一模型,在多模态生成和理解标准测试中表现优异。
- BAGEL具备高级多模态推理能力,如自由形式的图像操作、未来帧预测和3D操作等。
- 该模型允许用户进行数据操控、导航等多种高级应用操作。
- 为推动多模态研究的发展,分享了关键发现、预训练细节、数据创建协议以及代码和检查点。
- 该项目的网页地址为:[https://bagel-ai.org/]
点此查看论文截图




Two Experts Are All You Need for Steering Thinking: Reinforcing Cognitive Effort in MoE Reasoning Models Without Additional Training
Authors:Mengru Wang, Xingyu Chen, Yue Wang, Zhiwei He, Jiahao Xu, Tian Liang, Qiuzhi Liu, Yunzhi Yao, Wenxuan Wang, Ruotian Ma, Haitao Mi, Ningyu Zhang, Zhaopeng Tu, Xiaolong Li, Dong Yu
Mixture-of-Experts (MoE) architectures within Large Reasoning Models (LRMs) have achieved impressive reasoning capabilities by selectively activating experts to facilitate structured cognitive processes. Despite notable advances, existing reasoning models often suffer from cognitive inefficiencies like overthinking and underthinking. To address these limitations, we introduce a novel inference-time steering methodology called Reinforcing Cognitive Experts (RICE), designed to improve reasoning performance without additional training or complex heuristics. Leveraging normalized Pointwise Mutual Information (nPMI), we systematically identify specialized experts, termed ‘’cognitive experts’’ that orchestrate meta-level reasoning operations characterized by tokens like ‘’
在大规模推理模型(LRMs)中的专家混合(MoE)架构通过选择性激活专家以促进结构化认知过程,已经实现了令人印象深刻的推理能力。尽管有显著进展,但现有推理模型往往存在认知效率低下的问题,如过度思考和思考不足。为了解决这些局限性,我们引入了一种新型推理时间引导方法,称为强化认知专家(RICE),旨在提高推理性能,而无需额外的训练或复杂的启发式方法。我们利用标准化点互信息(nPMI)系统地识别专门从事特定任务的专家,称为“认知专家”,他们负责协调以令牌如“<思考>”为特征的元级推理操作。在领先的基于MoE的LRMs(DeepSeek-R1和Qwen3-235B)上进行严格的定量和科学推理基准的实证评估表明,在推理准确性、认知效率和跨域泛化方面都有显著且一致的改进。关键的是,我们的轻量级方法在很大程度上优于流行的推理引导技术,如提示设计和解码约束,同时保留了模型的通用指令遵循技能。这些结果强调了强化认知专家作为一个有前途、实用和可解释的方向,可以在高级推理模型中提高认知效率。
论文及项目相关链接
PDF Work in progress
Summary
文本介绍了Mixture-of-Experts(MoE)架构的大型推理模型(LRMs)通过选择性激活专家来实现结构化认知过程,展现出强大的推理能力。针对现有推理模型的认知效率低下问题,文本提出了一种新型推理时间控制方法——强化认知专家(RICE)。该方法利用标准化点互信息(nPMI)系统识别专门化的专家,称为“认知专家”,负责元级别推理操作,如由“
Key Takeaways
- MoE架构的LRMs通过激活特定专家实现结构化认知过程,表现出强大的推理能力。
- 现有推理模型存在认知效率低下的问题,如过度思考和思考不足。
- RICE方法通过识别并强化“认知专家”来提升推理性能,这些专家负责元级别推理操作。
- 利用标准化点互信息(nPMI)系统识别“认知专家”。
- RICE在多个领先MoE基LRMs上的实证评估中,显著提升了推理准确性、认知效率和跨域泛化能力。
- RICE相较于其他推理控制技巧更轻便,且能保留模型的通用指令跟随技能。
点此查看论文截图



Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning
Authors:Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, Kaiyang Zhou
Learning general-purpose reasoning capabilities has long been a challenging problem in AI. Recent research in large language models (LLMs), such as DeepSeek-R1, has shown that reinforcement learning techniques like GRPO can enable pre-trained LLMs to develop reasoning capabilities using simple question-answer pairs. In this paper, we aim to train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs, without any explicit chain-of-thought (CoT) supervision. Our findings indicate that simply applying reinforcement learning to a VLM – by prompting the model to produce a reasoning chain before providing an answer – can lead the model to develop shortcuts from easy questions, thereby reducing its ability to generalize across unseen data distributions. We argue that the key to mitigating shortcut learning is to encourage the model to interpret images prior to reasoning. Therefore, we train the model to adhere to a caption-reason-answer output format: initially generating a detailed caption for an image, followed by constructing an extensive reasoning chain. When trained on 273K CoT-free visual question-answer pairs and using only reinforcement learning, our model, named Visionary-R1, outperforms strong multimodal models, such as GPT-4o, Claude3.5-Sonnet, and Gemini-1.5-Pro, on multiple visual reasoning benchmarks.
学习和掌握通用推理能力长期以来一直是人工智能领域的一个难题。最近,关于大型语言模型(LLM)的研究,如DeepSeek-R1,表明强化学习技术(如GRPO)可以使预训练的语言模型通过简单的问题答案对来发展推理能力。在本文中,我们的目标是训练视觉语言模型(VLM)通过强化学习和视觉问答对来对图像数据进行推理,而无需任何明确的思维链(CoT)监督。我们的研究结果表明,仅仅将强化学习应用于视觉语言模型——通过提示模型在给出答案之前产生推理链——可能导致模型从简单问题中产生捷径,从而降低其在未见数据分布上的泛化能力。我们认为避免捷径学习的关键在于鼓励模型在推理之前解释图像。因此,我们训练模型遵循标题-推理-答案的输出格式:首先为图像生成详细的标题,然后构建广泛的推理链。在仅使用强化学习且无需思维链的视觉问答对上进行训练时,我们的模型名为Visionary-R1,在多视觉推理基准测试中表现优于强大的多模态模型,如GPT-4o、Claude3.5-Sonnet和Gemini-1.5-Pro等。
论文及项目相关链接
Summary
本文介绍了通过强化学习和视觉问答对训练视觉语言模型(VLM)进行图像推理的研究。实验发现,通过引导模型在回答问题前产生推理链,会导致模型从简单问题中习得捷径,降低其在未见数据分布上的泛化能力。为缓解这一问题,鼓励模型在推理前先对图像进行解读。训练模型遵循“描述-推理-回答”的输出格式,先为图像生成详细描述,再构建推理链。使用仅包含强化学习和无思维链监督的273K视觉问答对进行训练,所提出模型Visionary-R1在多视觉推理基准测试中表现出优于其他强大跨模态模型(如GPT-4o、Claude 3.5-Sonnet和Gemini-1.5-Pro)的性能。
Key Takeaways
- 强化学习技术如GRPO能使预训练的LLMs具备推理能力。
- 通过引导模型在回答问题前产生推理链会导致模型从简单问题中习得捷径,影响其在未见数据上的泛化能力。
- 为缓解模型捷径学习问题,需要鼓励模型在推理前先解读图像。
- 训练模型遵循“描述-推理-回答”格式,以提高其在视觉推理任务上的性能。
- 模型通过仅使用强化学习和无思维链监督的大量视觉问答对进行训练。
- Visionary-R1模型在多视觉推理基准测试中表现优异。
点此查看论文截图




Reward Reasoning Model
Authors:Jiaxin Guo, Zewen Chi, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei
Reward models play a critical role in guiding large language models toward outputs that align with human expectations. However, an open challenge remains in effectively utilizing test-time compute to enhance reward model performance. In this work, we introduce Reward Reasoning Models (RRMs), which are specifically designed to execute a deliberate reasoning process before generating final rewards. Through chain-of-thought reasoning, RRMs leverage additional test-time compute for complex queries where appropriate rewards are not immediately apparent. To develop RRMs, we implement a reinforcement learning framework that fosters self-evolved reward reasoning capabilities without requiring explicit reasoning traces as training data. Experimental results demonstrate that RRMs achieve superior performance on reward modeling benchmarks across diverse domains. Notably, we show that RRMs can adaptively exploit test-time compute to further improve reward accuracy. The pretrained reward reasoning models are available at https://huggingface.co/Reward-Reasoning.
奖励模型在引导大型语言模型产生符合人类期望的输出方面起着关键作用。然而,如何有效利用测试时的计算资源来提升奖励模型的性能仍然是一个开放性的挑战。在此工作中,我们引入了奖励推理模型(RRMs),它专门设计用于在生成最终奖励之前执行深思熟虑的推理过程。通过链式思维推理,RRMs能够在复杂查询中利用额外的测试时间计算资源,在适当的时候获得合适的奖励。为了开发RRMs,我们实现了一个强化学习框架,该框架能够培育自我进化的奖励推理能力,而无需将明确的推理轨迹作为训练数据。实验结果表明,RRMs在不同领域的奖励建模基准测试中实现了卓越的性能。值得注意的是,我们展示了RRMs能够自适应地利用测试时的计算资源来进一步提高奖励的准确性。预训练的奖励推理模型可在https://huggingface.co/Reward-Reasoning找到。
论文及项目相关链接
Summary:奖励模型对于引导大型语言模型产生符合人类期望的输出至关重要。然而,如何有效利用测试时的计算资源来提升奖励模型性能仍是挑战。本研究引入了奖励推理模型(RRMs),专门设计用于在生成最终奖励之前进行深思熟虑的推理过程。对于复杂查询,RRMs利用额外的测试时间计算来得出适当的奖励。我们实现了一个强化学习框架来培养RRMs的自我进化奖励推理能力,无需明确的推理轨迹作为训练数据。实验结果表明,RRMs在跨领域的奖励建模基准测试中表现优异。特别是,我们展示了RRMs能够自适应地利用测试时间计算来进一步提高奖励准确性。
Key Takeaways:
- 奖励模型对引导大型语言模型产出符合人类期望的输出至关重要。
- 测试时的计算资源有效利用是提升奖励模型性能的挑战之一。
- 引入奖励推理模型(RRMs),专门进行深思熟虑的推理过程再生成最终奖励。
- RRMs利用额外的测试时间计算来处理复杂查询,得出适当奖励。
- 采用强化学习框架培养RRMs的自我进化奖励推理能力。
- RRMs在跨领域奖励建模基准测试中表现优异。
点此查看论文截图



General-Reasoner: Advancing LLM Reasoning Across All Domains
Authors:Xueguang Ma, Qian Liu, Dongfu Jiang, Ge Zhang, Zejun Ma, Wenhu Chen
Reinforcement learning (RL) has recently demonstrated strong potential in enhancing the reasoning capabilities of large language models (LLMs). Particularly, the “Zero” reinforcement learning introduced by Deepseek-R1-Zero, enables direct RL training of base LLMs without relying on an intermediate supervised fine-tuning stage. Despite these advancements, current works for LLM reasoning mainly focus on mathematical and coding domains, largely due to data abundance and the ease of answer verification. This limits the applicability and generalization of such models to broader domains, where questions often have diverse answer representations, and data is more scarce. In this paper, we propose General-Reasoner, a novel training paradigm designed to enhance LLM reasoning capabilities across diverse domains. Our key contributions include: (1) constructing a large-scale, high-quality dataset of questions with verifiable answers curated by web crawling, covering a wide range of disciplines; and (2) developing a generative model-based answer verifier, which replaces traditional rule-based verification with the capability of chain-of-thought and context-awareness. We train a series of models and evaluate them on a wide range of datasets covering wide domains like physics, chemistry, finance, electronics etc. Our comprehensive evaluation across these 12 benchmarks (e.g. MMLU-Pro, GPQA, SuperGPQA, TheoremQA, BBEH and MATH AMC) demonstrates that General-Reasoner outperforms existing baseline methods, achieving robust and generalizable reasoning performance while maintaining superior effectiveness in mathematical reasoning tasks.
强化学习(RL)最近在提高大型语言模型(LLM)的推理能力方面表现出了强大的潜力。特别是Deepseek-R1-Zero引入的“Zero”强化学习,能够实现基础LLM的直接RL训练,无需依赖中间监督微调阶段。尽管有这些进展,但目前LLM推理的研究主要集中在数学和编码领域,这主要是因为数据丰富且答案验证容易。这限制了此类模型在更广域内的适用性和泛化能力,在这些领域中,问题往往具有多样的答案表示形式,且数据更加稀缺。在本文中,我们提出了General-Reasoner,这是一种旨在提高LLM在多样化领域推理能力的新型训练范式。我们的主要贡献包括:(1)通过网页爬虫构建了一个大规模、高质量的问题数据集,其中包含可验证的答案,涵盖广泛的学科;以及(2)开发了一种基于生成模型的答案验证器,它用基于思维链和上下文感知的能力取代了传统的基于规则的验证。我们在多个数据集上训练了一系列模型,并在涵盖物理、化学、金融、电子等广泛领域的12个基准测试(如MMLU-Pro、GPQA、SuperGPQA、TheoremQA、BBEH和MATH AMC)上进行了评估。结果表明,General-Reasoner在保持数学推理任务高效性的同时,实现了稳健且可泛化的推理性能,并优于现有基准方法。
论文及项目相关链接
Summary
强化学习在提升大型语言模型的推理能力方面具有巨大潜力。Deepseek-R1-Zero提出的“零强化学习”可直接训练基础语言模型,无需中间监督微调阶段。当前的研究主要集中在数学和编码领域,主要由于数据丰富和答案验证的简便性。这限制了模型在更广泛领域的适用性。本文提出General-Reasoner,旨在提高语言模型跨不同领域的推理能力。主要贡献包括构建大规模高质量问题数据集和基于生成模型的答案验证器。在物理、化学、金融、电子等领域的多个数据集上评估表明,General-Reasoner在保持数学推理任务高效性的同时,展现出稳健的跨领域推理性能。
Key Takeaways
- 强化学习可增强大型语言模型的推理能力。
- Deepseek-R1-Zero提出了“零强化学习”方法,可直接训练语言模型。
- 当前LLM推理研究主要集中在数学和编码领域,存在领域适用性的限制。
- General-Reasoner旨在提高语言模型在多个领域的推理能力。
- General-Reasoner构建了大规模、高质量的问题数据集。
- General-Reasoner采用基于生成模型的答案验证器,具备链式思维和上下文感知能力。
点此查看论文截图




Think Only When You Need with Large Hybrid-Reasoning Models
Authors:Lingjie Jiang, Xun Wu, Shaohan Huang, Qingxiu Dong, Zewen Chi, Li Dong, Xingxing Zhang, Tengchao Lv, Lei Cui, Furu Wei
Recent Large Reasoning Models (LRMs) have shown substantially improved reasoning capabilities over traditional Large Language Models (LLMs) by incorporating extended thinking processes prior to producing final responses. However, excessively lengthy thinking introduces substantial overhead in terms of token consumption and latency, which is particularly unnecessary for simple queries. In this work, we introduce Large Hybrid-Reasoning Models (LHRMs), the first kind of model capable of adaptively determining whether to perform thinking based on the contextual information of user queries. To achieve this, we propose a two-stage training pipeline comprising Hybrid Fine-Tuning (HFT) as a cold start, followed by online reinforcement learning with the proposed Hybrid Group Policy Optimization (HGPO) to implicitly learn to select the appropriate thinking mode. Furthermore, we introduce a metric called Hybrid Accuracy to quantitatively assess the model’s capability for hybrid thinking. Extensive experimental results show that LHRMs can adaptively perform hybrid thinking on queries of varying difficulty and type. It outperforms existing LRMs and LLMs in reasoning and general capabilities while significantly improving efficiency. Together, our work advocates for a reconsideration of the appropriate use of extended thinking processes and provides a solid starting point for building hybrid thinking systems.
最近的大型推理模型(LRMs)通过融入最终的回答之前的扩展思考过程,展现出相较于传统的大型语言模型(LLMs)实质提升的推理能力。然而,过度冗长的思考过程在符号消耗和延迟方面引入了相当大的开销,这对于简单的查询来说尤为不必要。在这项工作中,我们引入了大型混合推理模型(LHRMs),这是一种能够自适应地根据用户查询的上下文信息来决定是否进行思考的模型。为了实现这一点,我们提出了一种两阶段的训练流程,首先是作为冷启动的混合微调(HFT),接着是通过提出的混合组策略优化(HGPO)进行在线强化学习,以隐式地学习选择适当的思考模式。此外,我们还引入了一个名为混合准确率的指标来定量评估模型的混合思考能力。广泛的实验结果表明,LHRMs可以自适应地对不同难度和类型的查询进行混合思考。它在推理和一般能力方面优于现有的LRMs和LLMs,同时大大提高了效率。总的来说,我们的工作主张重新考虑扩展思考过程的适当使用,并为构建混合思考系统提供了一个坚实的起点。
论文及项目相关链接
Summary
近期的大型推理模型(LRMs)通过融入扩展的思考过程,展现出相较于传统的大型语言模型(LLMs)的显著改善的推理能力。然而,过度冗长的思考带来了大量的符号消耗和延迟,这对于简单的查询来说尤为不必要。本研究提出了大型混合推理模型(LHRMs),这是一种能够自适应地根据用户查询的上下文信息进行思考的首创模型。通过混合微调(HFT)的冷启动和提出的混合组策略优化(HGPO)的在线强化学习,我们实现了两阶段训练管道,使模型能够隐式地学习选择适当的思考模式。此外,我们引入了混合精度指标来定量评估模型的混合思考能力。大量实验结果表明,LHRMs可以在不同类型的查询上自适应地进行混合思考,在推理和综合能力方面优于现有的LRMs和LLMs,同时大大提高了效率。
Key Takeaways
- 大型推理模型(LRMs)融入扩展思考过程以提升推理能力。
- 过度冗长的思考带来符号消耗和延迟问题,尤其对于简单查询不必要。
- 提出大型混合推理模型(LHRMs),能根据用户查询的上下文信息自适应思考。
- 通过混合微调(HFT)的冷启动和混合组策略优化(HGPO)的在线强化学习实现自适应思考。
- 引入混合精度指标来评估模型的混合思考能力。
- LHRMs在推理和综合能力方面优于现有模型,同时提高效率。
点此查看论文截图



Debating for Better Reasoning: An Unsupervised Multimodal Approach
Authors:Ashutosh Adhikari, Mirella Lapata
As Large Language Models (LLMs) gain expertise across diverse domains and modalities, scalable oversight becomes increasingly challenging, particularly when their capabilities may surpass human evaluators. Debate has emerged as a promising mechanism for enabling such oversight. In this work, we extend the debate paradigm to a multimodal setting, exploring its potential for weaker models to supervise and enhance the performance of stronger models. We focus on visual question answering (VQA), where two “sighted” expert vision-language models debate an answer, while a “blind” (text-only) judge adjudicates based solely on the quality of the arguments. In our framework, the experts defend only answers aligned with their beliefs, thereby obviating the need for explicit role-playing and concentrating the debate on instances of expert disagreement. Experiments on several multimodal tasks demonstrate that the debate framework consistently outperforms individual expert models. Moreover, judgments from weaker LLMs can help instill reasoning capabilities in vision-language models through finetuning.
随着大型语言模型(LLMs)在不同领域和模态中积累专业知识,可扩展的监督变得越来越具有挑战性,尤其是当它们的能力可能超越人类评估者时。辩论作为一种有前景的机制,可以实现这种监督。在这项工作中,我们将辩论范式扩展到多模态环境,探索其潜在能力,使较弱模型能够监督并增强较强模型的表现。我们专注于视觉问答(VQA),其中两个“有见识”的专家视觉语言模型对答案进行辩论,而一个“盲目”(仅文本)的法官仅根据论证的质量做出裁决。在我们的框架中,专家只捍卫与其信念相符的答案,从而不需要明确的角色扮演,并将辩论集中在专家分歧的实例上。在多个多模态任务上的实验表明,辩论框架始终优于单个专家模型。此外,较弱的大型语言模型的判断可以通过微调帮助视觉语言模型培养推理能力。
论文及项目相关链接
Summary
大型语言模型(LLMs)在多领域和多模态上展现卓越能力,给监督带来挑战,尤其是其能力可能超越人类评估者。辩论作为一种监督机制展现出巨大潜力。本研究将辩论范式扩展到多模态环境,探索弱模型监督强模型性能提升的可能性。在视觉问答(VQA)任务中,两个“有视觉”的专家视觉语言模型进行答案辩论,一个“盲”(仅文本)的裁判根据论证质量进行裁决。专家的辩护仅针对与其信仰相符的答案,从而无需显式角色扮演,集中辩论在专家分歧的实例上。在多个多模态任务上的实验表明,辩论框架的表现持续优于单个专家模型。此外,来自较弱的LLM的判断力可以通过微调赋予视觉语言模型推理能力。
Key Takeaways
- 大型语言模型(LLMs)在多领域和多模态上的能力带来监督挑战。
- 辩论作为一种监督机制在多模态环境中展现出巨大潜力。
- 在视觉问答(VQA)任务中,两个专家视觉语言模型进行答案辩论,强调对答案的不同看法和争论点。
- 一个“盲”裁判根据论证质量进行裁决,确保辩论的公正性。
- 辩论框架的表现优于单个专家模型,展示集体智慧的优势。
- 较弱的LLM的判断力可以通过辩论赋予视觉语言模型推理能力。这种机制对于提高模型的性能和准确性具有积极意义。
点此查看论文截图




TinyV: Reducing False Negatives in Verification Improves RL for LLM Reasoning
Authors:Zhangchen Xu, Yuetai Li, Fengqing Jiang, Bhaskar Ramasubramanian, Luyao Niu, Bill Yuchen Lin, Radha Poovendran
Reinforcement Learning (RL) has become a powerful tool for enhancing the reasoning abilities of large language models (LLMs) by optimizing their policies with reward signals. Yet, RL’s success relies on the reliability of rewards, which are provided by verifiers. In this paper, we expose and analyze a widespread problem–false negatives–where verifiers wrongly reject correct model outputs. Our in-depth study of the Big-Math-RL-Verified dataset reveals that over 38% of model-generated responses suffer from false negatives, where the verifier fails to recognize correct answers. We show, both empirically and theoretically, that these false negatives severely impair RL training by depriving the model of informative gradient signals and slowing convergence. To mitigate this, we propose tinyV, a lightweight LLM-based verifier that augments existing rule-based methods, which dynamically identifies potential false negatives and recovers valid responses to produce more accurate reward estimates. Across multiple math-reasoning benchmarks, integrating TinyV boosts pass rates by up to 10% and accelerates convergence relative to the baseline. Our findings highlight the critical importance of addressing verifier false negatives and offer a practical approach to improve RL-based fine-tuning of LLMs. Our code is available at https://github.com/uw-nsl/TinyV.
强化学习(RL)已成为通过奖励信号优化策略来提升大型语言模型(LLM)推理能力的一种强大工具。然而,RL的成功依赖于奖励的可靠性,这些奖励由验证器提供。在本文中,我们揭示并分析了普遍存在的问题——假阴性(False Negative)——其中验证器错误地拒绝了正确的模型输出。我们对Big-Math-RL-Verified数据集的深入研究显示,超过3.8%的模型生成响应存在假阴性问题,即验证器未能识别正确答案。我们实证和理论地证明,这些假阴性会严重损害RL训练,因为它们剥夺了模型的信息化梯度信号并减缓收敛速度。为了缓解这一问题,我们提出了TinyV,一个基于LLM的轻量级验证器,它增强了现有的基于规则的方法,能够动态识别潜在的假阴性并恢复有效响应,以产生更准确的奖励估计。在多个数学推理基准测试中,集成TinyV将通过率提高了高达10%,相对于基线加速了收敛。我们的研究强调了解决验证器假阴性的关键重要性,并为改善基于RL的LLM微调提供了一种实用方法。我们的代码位于https://github.com/uw-nsl/TinyV。
论文及项目相关链接
Summary
本文介绍了强化学习(RL)在大规模语言模型(LLM)中的应用及其面临的挑战。主要问题是验证器产生的假阴性结果,即验证器错误地拒绝正确的模型输出。研究发现,Big-Math-RL-Verified数据集中超过38%的模型生成响应受到假阴性的影响。为解决这一问题,本文提出了一种基于LLM的轻量级验证器TinyV,它能够动态识别潜在的假阴性,恢复有效的响应,从而提供更准确的奖励估计。集成TinyV后,数学推理基准测试的通过率提高了10%,并且相对于基线加速了收敛。
Key Takeaways
- 强化学习(RL)用于增强大规模语言模型(LLM)的推理能力。
- 验证器产生的假阴性问题是RL面临的主要挑战之一。
- 假阴性结果会严重影响RL训练,剥夺模型的梯度信号并减慢收敛速度。
- Big-Math-RL-Verified数据集研究显示,超过38%的模型生成响应受到假阴性的影响。
- TinyV是一种基于LLM的轻量级验证器,旨在解决假阴性问题。
- 集成TinyV后,数学推理基准测试的通过率提高10%,并且加速收敛。
点此查看论文截图




Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models
Authors:Sahar Abdelnabi, Ahmed Salem
Reasoning-focused large language models (LLMs) sometimes alter their behavior when they detect that they are being evaluated, an effect analogous to the Hawthorne phenomenon, which can lead them to optimize for test-passing performance or to comply more readily with harmful prompts if real-world consequences appear absent. We present the first quantitative study of how such “test awareness” impacts model behavior, particularly its safety alignment. We introduce a white-box probing framework that (i) linearly identifies awareness-related activations and (ii) steers models toward or away from test awareness while monitoring downstream performance. We apply our method to different state-of-the-art open-source reasoning LLMs across both realistic and hypothetical tasks. Our results demonstrate that test awareness significantly impact safety alignment, and is different for different models. By providing fine-grained control over this latent effect, our work aims to increase trust in how we perform safety evaluation.
以推理为重点的大型语言模型(LLM)在检测到正在进行评估时,有时会改变其行为,这种影响类似于霍索恩现象,可能导致它们优化通过测试的性能,或者在现实世界后果似乎不存在的情况下更容易接受有害的提示。我们首次对“测试意识”如何影响模型行为进行了定量研究,特别是其安全对齐情况。我们引入了一个白盒探测框架,该框架能够(i)线性识别与意识相关的激活,(ii)在监视下游性能的同时,使模型靠近或远离测试意识。我们将该方法应用于不同的最新开源推理LLM,涵盖现实和假设任务。我们的结果表明,测试意识对安全对齐有显著影响,并且对不同模型的影响不同。通过对此潜在效应进行精细控制,我们的工作旨在增加我们对如何进行安全评估的信任。
论文及项目相关链接
Summary:大型语言模型在检测到评估时会改变行为,类似霍桑现象,优化测试通过性能或更易于遵守有害提示,缺少实际后果感知。一项研究定量探讨了测试意识如何影响模型行为的安全性。引入白盒探测框架,识别测试意识相关激活并控制模型测试意识,同时监测下游性能。应用于不同先进推理大型语言模型的结果显示测试意识对安全对齐有重大影响,且不同模型间存在差异。此研究旨在增加对安全评估的信任度。
Key Takeaways:
- 大型语言模型在检测评估时会改变行为。
- 测试意识与霍桑现象类似,可能导致模型优化测试通过性能或更易于遵守有害提示。
- 缺乏实际后果感知会加剧这一影响。
- 研究通过白盒探测框架定量研究测试意识对模型行为安全性的影响。
- 测试意识在不同先进推理大型语言模型中的影响存在显著差异。
- 提供对测试意识的精细控制有助于增加对安全评估的信任度。
点此查看论文截图




Context Reasoner: Incentivizing Reasoning Capability for Contextualized Privacy and Safety Compliance via Reinforcement Learning
Authors:Wenbin Hu, Haoran Li, Huihao Jing, Qi Hu, Ziqian Zeng, Sirui Han, Heli Xu, Tianshu Chu, Peizhao Hu, Yangqiu Song
While Large Language Models (LLMs) exhibit remarkable capabilities, they also introduce significant safety and privacy risks. Current mitigation strategies often fail to preserve contextual reasoning capabilities in risky scenarios. Instead, they rely heavily on sensitive pattern matching to protect LLMs, which limits the scope. Furthermore, they overlook established safety and privacy standards, leading to systemic risks for legal compliance. To address these gaps, we formulate safety and privacy issues into contextualized compliance problems following the Contextual Integrity (CI) theory. Under the CI framework, we align our model with three critical regulatory standards: GDPR, EU AI Act, and HIPAA. Specifically, we employ reinforcement learning (RL) with a rule-based reward to incentivize contextual reasoning capabilities while enhancing compliance with safety and privacy norms. Through extensive experiments, we demonstrate that our method not only significantly enhances legal compliance (achieving a +17.64% accuracy improvement in safety/privacy benchmarks) but also further improves general reasoning capability. For OpenThinker-7B, a strong reasoning model that significantly outperforms its base model Qwen2.5-7B-Instruct across diverse subjects, our method enhances its general reasoning capabilities, with +2.05% and +8.98% accuracy improvement on the MMLU and LegalBench benchmark, respectively.
虽然大型语言模型(LLM)展现出显著的能力,但它们也带来了重大的安全和隐私风险。当前的缓解策略往往无法保留危险场景中的上下文推理能力,反而过分依赖于敏感模式匹配来保护LLM,这限制了其应用范围。此外,它们忽略了既定的安全和隐私标准,导致法律合规的系统性风险。为了解决这些差距,我们将安全和隐私问题制定为符合情境完整性的合规性问题。遵循情境完整性(CI)理论,我们的模型与三个关键的监管标准保持一致:GDPR、欧盟人工智能法案和HIPAA。具体来说,我们采用基于规则的奖励强化学习(RL),以激励上下文推理能力,同时提高遵守安全和隐私规范的能力。通过广泛的实验,我们证明我们的方法不仅显著提高了法律合规性(在安全/隐私基准测试中实现了+17.64%的准确率提升),而且还进一步提高了通用推理能力。对于在多种主题上显著优于其基础模型的强大推理模型OpenThinker-7B,我们的方法提高了其在MMLU和LegalBench基准测试上的准确率,分别提高了+2.05%和+8.98%。
论文及项目相关链接
Summary
大型语言模型(LLMs)在安全与隐私方面存在显著风险,现有缓解策略往往无法兼顾语境推理能力,并过分依赖敏感模式匹配来保护LLMs,这限制了其应用范围。为解决这些问题,本文根据语境完整性(CI)理论,将安全与隐私问题转化为语境化合规问题,并与三大关键监管标准(GDPR、欧盟人工智能法案和HIPAA)对齐。通过强化学习与基于规则的奖励机制,在提升语境推理能力的同时增强对安全与隐私规范的遵守。实验证明,该方法不仅显著提高法律合规性,还进一步提升了通用推理能力。
Key Takeaways
- 大型语言模型(LLMs)存在安全与隐私风险。
- 现有缓解策略难以兼顾语境推理能力,并依赖敏感模式匹配,限制了应用范围。
- 引入语境完整性(CI)理论来解决安全与隐私问题,将其转化为语境化合规问题。
- 与三大关键监管标准(GDPR、欧盟人工智能法案和HIPAA)对齐。
- 采用强化学习提升语境推理能力和对安全与隐私规范的遵守。
- 方法显著提高法律合规性,在安全和隐私基准测试上实现+17.64%的准确度提升。
点此查看论文截图






R2MED: A Benchmark for Reasoning-Driven Medical Retrieval
Authors:Lei Li, Xiao Zhou, Zheng Liu
Current medical retrieval benchmarks primarily emphasize lexical or shallow semantic similarity, overlooking the reasoning-intensive demands that are central to clinical decision-making. In practice, physicians often retrieve authoritative medical evidence to support diagnostic hypotheses. Such evidence typically aligns with an inferred diagnosis rather than the surface form of a patient’s symptoms, leading to low lexical or semantic overlap between queries and relevant documents. To address this gap, we introduce R2MED, the first benchmark explicitly designed for reasoning-driven medical retrieval. It comprises 876 queries spanning three tasks: Q&A reference retrieval, clinical evidence retrieval, and clinical case retrieval. These tasks are drawn from five representative medical scenarios and twelve body systems, capturing the complexity and diversity of real-world medical information needs. We evaluate 15 widely-used retrieval systems on R2MED and find that even the best model achieves only 31.4 nDCG@10, demonstrating the benchmark’s difficulty. Classical re-ranking and generation-augmented retrieval methods offer only modest improvements. Although large reasoning models improve performance via intermediate inference generation, the best results still peak at 41.4 nDCG@10. These findings underscore a substantial gap between current retrieval techniques and the reasoning demands of real clinical tasks. We release R2MED as a challenging benchmark to foster the development of next-generation medical retrieval systems with enhanced reasoning capabilities. Data and code are available at https://github.com/R2MED/R2MED
当前医学检索基准测试主要强调词汇或浅层语义相似性,忽视了临床决策核心中的推理需求。在实践中,医生通常会检索权威医学证据来支持诊断假设。这样的证据通常与推断出的诊断结果相符,而非患者症状的表层形式,这导致查询与相关文档之间词汇或语义上的重叠较低。为了解决这一差距,我们推出了R2MED,这是首个专为推理驱动医学检索设计的基准测试。它包含876个查询,涵盖三个任务:问答参考检索、临床证据检索和临床案例检索。这些任务来自五个具有代表性的医学场景和十二个身体系统,捕捉了现实世界医学信息需求的复杂性和多样性。我们在R2MED上评估了15种广泛使用的检索系统,发现即使是最好的模型也仅实现了31.4的nDCG@10得分,这显示了基准测试的困难程度。传统的重新排序和生成增强检索方法只带来了适度的改进。尽管大型推理模型通过中间推理生成提高了性能,但最佳结果仍然只达到41.4的nDCG@10。这些发现突显了当前检索技术与实际临床任务推理需求之间的巨大差距。我们发布R2MED作为一个具有挑战性的基准测试,以推动具备增强推理能力的新一代医学检索系统的发展。数据和代码可在https://github.com/R2MED/R2MED获取。
论文及项目相关链接
PDF 38 pages, 16 figures
Summary:
当前医学检索基准测试主要侧重于词汇或浅层语义相似性,忽视了临床决策中的推理需求。为弥补这一不足,我们推出R2MED基准测试,专门用于推理驱动型医学检索。R2MED包括876个查询,涵盖问答参考检索、临床证据检索和临床案例检索三项任务,反映真实世界的医学信息需求复杂性和多样性。对现有广泛使用的15种检索系统在R2MED上的评估表明,即使是最优秀的模型也仅达到31.4的nDCG@10得分,显示该基准测试的困难性。尽管经典的重排序和生成增强检索方法略有改进,但最佳结果仍仅限于41.4的nDCG@10。这表明当前检索技术与临床任务中的推理需求之间存在巨大差距。我们发布R2MED作为一个具有挑战性的基准测试,以促进具有增强推理能力的新一代医学检索系统的开发。
Key Takeaways:
- 当前医学检索基准测试主要侧重于词汇或浅层语义相似性,忽略了临床决策中的推理需求。
- R2MED基准测试是专门为推理驱动型医学检索设计的,涵盖多种任务,反映真实医学信息需求的复杂性和多样性。
- 对广泛使用的检索系统在R2MED上的评估显示,现有技术面临巨大挑战,最佳模型表现仍有很大提升空间。
- 经典的重排序和生成增强检索方法仅带来有限改进。
- 大型推理模型通过中间推理生成能提高性能,但仍未达到理想水平。
- 当前检索技术与临床任务中的推理需求之间存在显著差距。
点此查看论文截图





Not All Correct Answers Are Equal: Why Your Distillation Source Matters
Authors:Xiaoyu Tian, Yunjie Ji, Haotian Wang, Shuaiting Chen, Sitong Zhao, Yiping Peng, Han Zhao, Xiangang Li
Distillation has emerged as a practical and effective approach to enhance the reasoning capabilities of open-source language models. In this work, we conduct a large-scale empirical study on reasoning data distillation by collecting verified outputs from three state-of-the-art teacher models-AM-Thinking-v1, Qwen3-235B-A22B, and DeepSeek-R1-on a shared corpus of 1.89 million queries. We construct three parallel datasets and analyze their distributions, revealing that AM-Thinking-v1-distilled data exhibits greater token length diversity and lower perplexity. Student models trained on each dataset are evaluated on reasoning benchmarks including AIME2024, AIME2025, MATH500, and LiveCodeBench. The AM-based model consistently achieves the best performance (e.g., 84.3 on AIME2024, 72.2 on AIME2025, 98.4 on MATH500, and 65.9 on LiveCodeBench) and demonstrates adaptive output behavior-producing longer responses for harder tasks and shorter ones for simpler tasks. These findings highlight the value of high-quality, verified reasoning traces. We release the AM-Thinking-v1 and Qwen3-235B-A22B distilled datasets to support future research on open and high-performing reasoning-oriented language models. The datasets are publicly available on Hugging Face\footnote{Datasets are available on Hugging Face: \href{https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled}{AM-Thinking-v1-Distilled}, \href{https://huggingface.co/datasets/a-m-team/AM-Qwen3-Distilled}{AM-Qwen3-Distilled}.}.
蒸馏法已成为提高开源语言模型推理能力的一种实用有效的方法。在这项工作中,我们对推理数据蒸馏进行了大规模实证研究,通过收集来自AM-Thinking-v1、Qwen3-235B-A22B和DeepSeek-R1三个先进教师模型的验证输出,在包含189万个查询的共享语料库上进行对比分析。我们构建了三个并行数据集并对其分布进行了分析,发现由AM-Thinking-v1蒸馏的数据具有更大的令牌长度多样性和更低的困惑度。在AIME2024、AIME2025、MATH500和LiveCodeBench等推理基准测试上,对各个数据集训练的学生模型进行了评估。基于AM的模型表现最为出色(例如,在AIME2024上得分为84.3,在AIME2025上得分为72.2,在MATH500上得分为98.4,在LiveCodeBench上得分为65.9),并展现出自适应输出行为——为更困难的任务生成更长的响应,为更简单的任务生成更短的响应。这些发现突显了高质量验证推理轨迹的价值。我们发布了由AM-Thinking-v1和Qwen3-235B-A22B蒸馏的数据集,以支持未来对开放和高性能推理导向的语言模型的研究。数据集已在Hugging Face上公开提供(数据集可在Hugging Face上找到:[https://huggingface.co/datasets/a-m-team/AM-Thinking-v1-Distilled)。
论文及项目相关链接
Summary
文本介绍了蒸馏法在提高开源语言模型的推理能力方面的实际应用和效果。通过对三种顶级教师模型(AM-Thinking-v1、Qwen3-235B-A22B和DeepSeek-R1)进行大规模实证研究,构建了三组并行数据集,并分析了其分布特点。发现AM-Thinking-v1蒸馏数据具有更大的令牌长度多样性和更低的困惑度。学生模型在多个推理基准测试上的表现显示,基于AM的模型表现最佳。此外,该模型展现出适应性输出行为,针对难度更高的任务产生更长的响应,针对简单的任务则产生较短的响应。最后,公开发布了两个蒸馏数据集以支持未来的研究。
Key Takeaways
- 蒸馏法是一种提高开源语言模型推理能力的实用有效方法。
- 对三种顶级教师模型进行大规模实证研究,构建并行数据集。
- AM-Thinking-v1蒸馏数据具有更高的令牌长度多样性和更低的困惑度。
- 基于AM的模型在多个推理基准测试中表现最佳。
- AM模型能适应性输出不同长度的响应,适应任务难度。
- 公开发布了两个蒸馏数据集,支持未来研究。
点此查看论文截图




VisualQuality-R1: Reasoning-Induced Image Quality Assessment via Reinforcement Learning to Rank
Authors:Tianhe Wu, Jian Zou, Jie Liang, Lei Zhang, Kede Ma
DeepSeek-R1 has demonstrated remarkable effectiveness in incentivizing reasoning and generalization capabilities of large language models (LLMs) through reinforcement learning. Nevertheless, the potential of reasoning-induced computational modeling has not been thoroughly explored in the context of image quality assessment (IQA), a task critically dependent on visual reasoning. In this paper, we introduce VisualQuality-R1, a reasoning-induced no-reference IQA (NR-IQA) model, and we train it with reinforcement learning to rank, a learning algorithm tailored to the intrinsically relative nature of visual quality. Specifically, for a pair of images, we employ group relative policy optimization to generate multiple quality scores for each image. These estimates are then used to compute comparative probabilities of one image having higher quality than the other under the Thurstone model. Rewards for each quality estimate are defined using continuous fidelity measures rather than discretized binary labels. Extensive experiments show that the proposed VisualQuality-R1 consistently outperforms discriminative deep learning-based NR-IQA models as well as a recent reasoning-induced quality regression method. Moreover, VisualQuality-R1 is capable of generating contextually rich, human-aligned quality descriptions, and supports multi-dataset training without requiring perceptual scale realignment. These features make VisualQuality-R1 especially well-suited for reliably measuring progress in a wide range of image processing tasks like super-resolution and image generation.
DeepSeek-R1通过强化学习在激励大型语言模型(LLM)的推理和泛化能力方面表现出了显著的有效性。然而,在图像质量评估(IQA)的情境中,推理诱导计算建模的潜力尚未得到充分的探索,这是一个严重依赖于视觉推理的任务。在本文中,我们介绍了VisualQuality-R1,这是一个由推理诱导的无参考图像质量评估(NR-IQA)模型,我们使用强化学习排名对其进行训练,这是一种适应视觉质量内在相对性的学习算法。具体来说,对于一对图像,我们采用群体相对策略优化为每张图像生成多个质量分数。然后,这些估计被用来在Thurstone模型下计算一张图像比另一张图像质量更高的比较概率。每个质量估计的奖励是使用连续的保真度度量来定义的,而不是使用离散化的二进制标签。大量实验表明,提出的VisualQuality-R1持续优于基于判别深度学习的NR-IQA模型以及最新的推理诱导质量回归方法。此外,VisualQuality-R1能够生成上下文丰富、与人类对齐的质量描述,并支持多数据集训练,无需感知尺度调整。这些特点使得VisualQuality-R1尤其适合在超分辨率和图像生成等广泛的图像处理任务中可靠地衡量进展。
论文及项目相关链接
Summary
在文本中,通过强化学习,DeepSeek-R1展现了在大规模语言模型中激励推理和泛化能力方面的显著成效。但基于视觉推理的图像质量评估(IQA)方面的潜力尚未被充分发掘。因此,本文提出了VisualQuality-R1这一无参考图像质量评估(NR-IQA)模型,并用强化学习进行排名训练。实验表明,该模型持续优于基于深度学习的判别式NR-IQA模型以及一种最新的推理感应质量回归方法。此外,VisualQuality-R1可生成丰富的上下文质量描述,并支持跨数据集训练而无需感知尺度调整对齐,尤其适用于图像超分辨率和图像生成等任务的质量可靠评估。
Key Takeaways
- DeepSeek-R1在激励大规模语言模型的推理和泛化能力方面表现出显著成效。
- VisualQuality-R1是一个基于视觉推理的无参考图像质量评估模型。
- VisualQuality-R1采用强化学习进行排名训练,并基于Thurstone模型生成相对质量概率。
- 该模型使用连续忠实度度量作为奖励函数,而非离散二元标签。
- 实验显示VisualQuality-R1性能优于其他深度学习和推理感应质量回归模型。
- VisualQuality-R1能生成丰富的上下文质量描述,支持多数据集训练。
点此查看论文截图



Rank-K: Test-Time Reasoning for Listwise Reranking
Authors:Eugene Yang, Andrew Yates, Kathryn Ricci, Orion Weller, Vivek Chari, Benjamin Van Durme, Dawn Lawrie
Retrieve-and-rerank is a popular retrieval pipeline because of its ability to make slow but effective rerankers efficient enough at query time by reducing the number of comparisons. Recent works in neural rerankers take advantage of large language models for their capability in reasoning between queries and passages and have achieved state-of-the-art retrieval effectiveness. However, such rerankers are resource-intensive, even after heavy optimization. In this work, we introduce Rank-K, a listwise passage reranking model that leverages the reasoning capability of the reasoning language model at query time that provides test time scalability to serve hard queries. We show that Rank-K improves retrieval effectiveness by 23% over the RankZephyr, the state-of-the-art listwise reranker, when reranking a BM25 initial ranked list and 19% when reranking strong retrieval results by SPLADE-v3. Since Rank-K is inherently a multilingual model, we found that it ranks passages based on queries in different languages as effectively as it does in monolingual retrieval.
检索和重新排序(Retrieve-and-rerank)是一种流行的检索流程,它能够通过减少比较次数,使缓慢但有效的重新排序器在查询时间变得高效。近期神经重新排序器的工作利用大型语言模型进行查询和段落之间的推理能力,并实现了最先进的检索效果。然而,即使经过重度优化,这种重新排序器的资源消耗仍然很大。在这项工作中,我们引入了Rank-K,这是一个基于查询时间的推理语言模型的段落重新排序模型,它为硬查询提供了测试时的可扩展性。我们展示了Rank-K在BM25初始排名列表上比最新列表级重新排序器RankZephyr高出23%的检索效果,以及在SPLADE-v3的强大检索结果上高出19%。由于Rank-K本质上是一个多语言模型,我们发现它根据不同语言的查询进行段落排名,就像它在单语检索中一样有效。
论文及项目相关链接
PDF 15 pages, 4 figures
Summary
基于Retrieve-and-rerank管道的优势,通过减少比较次数提高了慢速但有效的rerankers在查询时的效率。最新神经rerankers利用大型语言模型进行推理,实现了先进的有效性。然而,它们仍然需要大量资源,尽管经过优化。在这项工作中,我们介绍了Rank-K模型,一种基于推理语言模型的listwise段落重排模型,提供了查询时间的可扩展性以应对难以处理的查询。实验结果显示,相较于最新最先进的listwise重排器RankZephyr,Rank-K在BM25初始排名列表重排时提高了检索有效性,提升了23%,在SPLADE-v3的强大检索结果重排时提高了19%。由于Rank-K本质上是一个多语言模型,我们发现它在处理不同语言的查询时同样有效。
Key Takeaways
- Rank-K模型利用大型语言模型的推理能力进行listwise段落重排。
- Rank-K能够在减少资源消耗的同时提高检索效率。
- 与先进的listwise重排器RankZephyr相比,Rank-K提高了检索有效性。
- Rank-K模型在处理多语言查询时同样有效。
- Rank-K通过优化提高检索效率的同时,保持了良好的性能表现。
- Rank-K模型适用于处理复杂的查询需求,提供了查询时间的可扩展性。
点此查看论文截图





SCOPE: Compress Mathematical Reasoning Steps for Efficient Automated Process Annotation
Authors:Huimin Xu, Xin Mao, Feng-Lin Li, Xiaobao Wu, Wang Chen, Wei Zhang, Anh Tuan Luu
Process Reward Models (PRMs) have demonstrated promising results in mathematical reasoning, but existing process annotation approaches, whether through human annotations or Monte Carlo simulations, remain computationally expensive. In this paper, we introduce Step COmpression for Process Estimation (SCOPE), a novel compression-based approach that significantly reduces annotation costs. We first translate natural language reasoning steps into code and normalize them through Abstract Syntax Tree, then merge equivalent steps to construct a prefix tree. Unlike simulation-based methods that waste numerous samples on estimation, SCOPE leverages a compression-based prefix tree where each root-to-leaf path serves as a training sample, reducing the complexity from $O(NMK)$ to $O(N)$. We construct a large-scale dataset containing 196K samples with only 5% of the computational resources required by previous methods. Empirical results demonstrate that PRMs trained on our dataset consistently outperform existing automated annotation approaches on both Best-of-N strategy and ProcessBench.
过程奖励模型(PRM)在数学推理方面展现出了有前景的结果,但是现有过程标注方法,无论是通过人工标注还是蒙特卡洛模拟,在计算上仍然成本高昂。在本文中,我们引入了基于压缩的过程估计(SCOPE)方法,这是一种新型压缩方法,可以显著降低标注成本。我们首先把自然语言推理步骤翻译成代码,并通过抽象语法树进行标准化处理,然后合并等效步骤来构建前缀树。不同于基于模拟的方法在估计上浪费大量样本,SCOPE利用基于压缩的前缀树,其中每个从根到叶的路径都可作为训练样本,将复杂度从$O(NMK)$降低到$O(N)$。我们构建了一个包含19万6千个样本的大规模数据集,仅使用了之前方法所需的5%的计算资源。实证结果表明,在我们数据集上训练的PRM在最佳N策略和ProcessBench上始终优于现有自动化标注方法。
论文及项目相关链接
Summary
在流程奖励模型(PRM)在自然语言推理中表现优异,但现有的流程标注方法计算成本高昂。本文引入基于压缩的流程估算方法(SCOPE),首先通过抽象语法树将自然语言推理步骤转化为代码并进行归一化,然后合并等价步骤构建前缀树。与基于模拟的方法相比,SCOPE利用压缩式前缀树,每条从根到叶的路径均作为训练样本,将复杂度从O(NMK)降至O(N)。我们构建的大型数据集包含19.6万样本,仅需之前方法的5%计算资源。实验结果显示,在最佳N策略和ProcessBench上,基于我们数据集的PRM表现均优于现有自动标注方法。
Key Takeaways
- SCOPE是一种新型的基于压缩的流程估算方法,用于降低流程奖励模型(PRM)的标注成本。
- SCOPE通过将自然语言推理步骤转化为代码并归一化,进而构建前缀树,实现了高效的样本生成。
- SCOPE方法将复杂度从O(NMK)降低到O(N),提高了计算效率。
- 构建的大型数据集包含19.6万样本,显著减少了计算资源需求。
- PRM在最佳N策略和ProcessBench上的表现优于现有自动标注方法。
- SCOPE方法为PRM的研究提供了新思路,有助于推动自然语言推理领域的发展。
点此查看论文截图





Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning
Authors:Zhaohui Yang, Shilei Jiang, Chen Hu, Linjing Li, Shihong Deng, Daxin Jiang
Recent advances in reasoning language models have witnessed a paradigm shift from short to long CoT pattern. Given the substantial computational cost of rollouts in long CoT models, maximizing the utility of fixed training datasets becomes crucial. Our analysis reveals that negative responses contain valuable components such as self-reflection and error-correction steps, yet primary existing methods either completely discard negative samples (RFT) or apply equal penalization across all tokens (RL), failing to leverage these potential learning signals. In light of this, we propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA), a fine-grained offline RL framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples. Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset, achieving improved sample efficiency and demonstrating robustness and scalability when extended to multiple iterations.
近期推理语言模型的进步见证了一个从短CoT模式到长CoT模式的转变。考虑到长CoT模型中rollouts的巨大计算成本,最大限度地提高固定训练数据集的使用价值变得至关重要。我们的分析表明,负面反馈包含有价值的成分,如自我反思和纠错步骤,但现有的主要方法要么完全丢弃负面样本(RFT),要么对所有标记应用均等的惩罚(RL),无法利用这些潜在的学习信号。鉴于此,我们提出了基于负面样本增强的行为约束策略梯度(BCPG-NSA),这是一个精细的离线强化学习框架,包括三个阶段:1)样本分割,2)基于共识的步骤正确性评估,结合LLM和PRM判断器,以及3)针对在负面样本中有效挖掘正面步骤的NSA策略优化。实验结果表明,在几个具有挑战性的数学/编码推理基准测试中,BCPG-NSA使用相同的训练数据集超越了基线,提高了样本效率,并在扩展到多次迭代时显示出稳健性和可扩展性。
论文及项目相关链接
Summary
近期推理语言模型的进展经历了从短CoT模式到长CoT模式的转变。考虑到长CoT模型中滚动计算的巨大成本,最大化固定训练数据集的使用变得至关重要。分析表明,否定回答包含有价值的成分,如自我反思和纠错步骤。现有的主要方法要么完全放弃负面样本(RFT),要么对所有令牌应用平等的惩罚(RL),未能利用这些潜在的学习信号。因此,我们提出了行为约束策略梯度与负样本扩充(BCPG-NSA),这是一种精细的离线强化学习框架,包括三个阶段:1)样本分割;2)结合LLM和PRM判断器进行基于共识的步骤正确性评估;3)针对负样本中的积极步骤进行设计的NSA策略优化。实验结果表明,BCPG-NSA在几个具有挑战性的数学/编码推理基准测试中优于基线,使用相同的训练数据集,提高了样本效率,并在扩展到多次迭代时表现出稳健性和可扩展性。
Key Takeaways
- 推理语言模型经历从短CoT到长CoT的转变。
- 长CoT模型面临巨大的计算成本,因此有效利用固定训练数据集变得重要。
- 否定回答包含有价值的成分如自我反思和纠错步骤。
- 现有方法未能充分利用负面样本中的学习信号。
- BCPG-NSA是一个精细的离线强化学习框架,包括样本分割、基于共识的步骤正确性评估和策略优化三个阶段。
- BCPG-NSA在多个基准测试中表现优于基线方法。
点此查看论文截图




Log-Augmented Generation: Scaling Test-Time Reasoning with Reusable Computation
Authors:Peter Baile Chen, Yi Zhang, Dan Roth, Samuel Madden, Jacob Andreas, Michael Cafarella
While humans naturally learn and adapt from past experiences, large language models (LLMs) and their agentic counterparts struggle to retain reasoning from previous tasks and apply them in future contexts. To address this limitation, we propose a novel framework, log-augmented generation (LAG) that directly reuses prior computation and reasoning from past logs at test time to enhance model’s ability to learn from previous tasks and perform better on new, unseen challenges, all while keeping the system efficient and scalable. Specifically, our system represents task logs using key-value (KV) caches, encoding the full reasoning context of prior tasks while storing KV caches for only a selected subset of tokens. When a new task arises, LAG retrieves the KV values from relevant logs to augment generation. Our approach differs from reflection-based memory mechanisms by directly reusing prior reasoning and computations without requiring additional steps for knowledge extraction or distillation. Our method also goes beyond existing KV caching techniques, which primarily target efficiency gains rather than improving accuracy. Experiments on knowledge- and reasoning-intensive datasets demonstrate that our method significantly outperforms standard agentic systems that do not utilize logs, as well as existing solutions based on reflection and KV cache techniques.
人类在自然学习过程中能从过去的经验中吸取教训并适应新环境,然而大型语言模型(LLM)及其代理对应物在保留过去任务的推理能力并将其应用于未来情境方面却面临困难。为了解决这一局限性,我们提出了一种新型框架——日志增强生成(LAG),该框架在测试时直接重用过去的计算和推理日志,以增强模型从过去任务中学习并在新出现的、未见过的挑战中表现更佳的能力,同时保持系统的效率和可扩展性。具体来说,我们的系统使用键值(KV)缓存来表示任务日志,编码先前任务的全推理上下文,仅对选定的一部分标记存储KV缓存。当出现新任务时,LAG会检索相关日志中的KV值以辅助生成。我们的方法与基于反思的记忆机制不同,能直接重用先前的推理和计算,无需额外的知识提取或蒸馏步骤。我们的方法也超越了现有的KV缓存技术,这些技术主要关注效率提升而非提高准确性。在知识和推理密集型数据集上的实验表明,我们的方法在标准不使用日志的代理系统以及基于反思和KV缓存技术的现有解决方案上表现出显著优势。
论文及项目相关链接
PDF Data and code are available at https://peterbaile.github.io/lag/
Summary
文章讨论了人类从过去的经验中学习适应的能力,但大型语言模型及其智能对应实体则难以实现这一点。为解决这一局限,提出了名为LAG的新框架,该框架在测试阶段直接利用过去的日志中的计算和推理能力,以提高模型从过去任务中学习并在新挑战中表现的能力,同时保持系统的效率和可扩展性。具体来说,该系统使用键值缓存来代表任务日志,存储先前任务的完整推理上下文中的关键值缓存仅用于一个子集的标记。当出现新任务时,LAG会从相关日志中检索键值以增强生成能力。与其他基于反射的记忆机制不同,LAG直接重用先前的推理和计算,无需额外的知识提取或蒸馏步骤。此外,与其他主要关注效率提升的KV缓存技术相比,LAG侧重于提高准确性。在知识和推理密集型数据集上的实验表明,与不使用日志的标准智能系统和现有解决方案相比,该方法具有显著优势。
Key Takeaways
- 大型语言模型及其智能对应实体在保留过去任务的推理并将其应用于未来上下文方面存在困难。
- LAG框架旨在通过直接利用过去的日志中的计算和推理能力来解决这一挑战。
- LAG使用键值缓存来存储和表示任务日志中的关键信息。
- LAG能够从相关日志中检索键值以增强模型在新任务上的表现。
- LAG与其他基于反射的记忆机制不同,它直接重用先前的推理和计算。
- LAG侧重于提高准确性,与其他主要关注效率提升的KV缓存技术形成对比。
点此查看论文截图




DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning
Authors:Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu
Large Vision-Language Models (VLMs) have shown strong capabilities in multimodal understanding and reasoning, yet they are primarily constrained by text-based reasoning processes. However, achieving seamless integration of visual and textual reasoning which mirrors human cognitive processes remains a significant challenge. In particular, effectively incorporating advanced visual input processing into reasoning mechanisms is still an open question. Thus, in this paper, we explore the interleaved multimodal reasoning paradigm and introduce DeepEyes, a model with “thinking with images” capabilities incentivized through end-to-end reinforcement learning without the need for cold-start SFT. Notably, this ability emerges natively within the model itself, leveraging its inherent grounding ability as a tool instead of depending on separate specialized models. Specifically, we propose a tool-use-oriented data selection mechanism and a reward strategy to encourage successful tool-assisted reasoning trajectories. DeepEyes achieves significant performance gains on fine-grained perception and reasoning benchmarks and also demonstrates improvement in grounding, hallucination, and mathematical reasoning tasks. Interestingly, we observe the distinct evolution of tool-calling behavior from initial exploration to efficient and accurate exploitation, and diverse thinking patterns that closely mirror human visual reasoning processes. Code is available at https://github.com/Visual-Agent/DeepEyes.
大型视觉语言模型(VLMs)在多模态理解和推理方面表现出强大的能力,但它们主要受到基于文本的推理过程的限制。然而,实现无缝的视觉和文本推理集成,以反映人类的认知过程仍然是一个巨大的挑战。特别是,如何有效地将先进的视觉输入处理融入推理机制仍然是一个悬而未决的问题。因此,在本文中,我们探索了交织的多模态推理范式,并引入了DeepEyes模型,该模型具有“以图像思考”的能力,通过端到端的强化学习进行激励,无需冷启动SFT。值得注意的是,这种能力是在模型本身内部自然产生的,它利用模型本身的内在定位能力作为工具,而不是依赖于单独的专用模型。具体来说,我们提出了一种以工具使用为导向的数据选择机制和奖励策略,以鼓励成功的工具辅助推理轨迹。DeepEyes在精细感知和推理基准测试上实现了显著的性能提升,并在定位、幻觉和数学推理任务中展示了改进。有趣的是,我们观察到从初步探索到有效和精确利用的工具调用行为的独特演变,以及紧密模仿人类视觉推理过程的多样化思维模式。代码可在https://github.com/Visual-Agent/DeepEyes获得。
论文及项目相关链接
Summary
本文探索了交织式多模态推理范式,并引入了DeepEyes模型。该模型具有“以图像思考”的能力,通过端到端的强化学习实现,无需冷启动SFT。DeepEyes利用内在的地标能力作为工具,不依赖单独的专门模型,实现了对精细感知和推理基准测试的性能提升,并改善了地标、幻觉和数学推理任务。观察到工具调用行为的独特演变,从初步探索到高效准确的利用,以及紧密模仿人类视觉推理过程的多样化思维模式。
Key Takeaways
- Large Vision-Language Models (VLMs) 展现出强大的多模态理解和推理能力,但主要受限于文本基础的推理过程。
- 实现视觉和文本推理无缝集成以模拟人类认知过程是一大挑战。
- DeepEyes模型具有“以图像思考”的能力,通过端到端的强化学习实现,无需冷启动SFT。
- DeepEyes利用内在的地标能力,不依赖单独的专门模型。
- DeepEyes在精细感知和推理基准测试上实现了性能提升,并改善了地标、幻觉和数学推理任务。
- 工具调用行为的独特演变从初步探索到高效准确的利用被观察到。
点此查看论文截图




A MIND for Reasoning: Meta-learning for In-context Deduction
Authors:Leonardo Bertolazzi, Manuel Vargas Guzmán, Raffaella Bernardi, Maciej Malicki, Jakub Szymanik
Large language models (LLMs) are increasingly evaluated on formal tasks, where strong reasoning abilities define the state of the art. However, their ability to generalize to out-of-distribution problems remains limited. In this paper, we investigate how LLMs can achieve a systematic understanding of deductive rules. Our focus is on the task of identifying the appropriate subset of premises within a knowledge base needed to derive a given hypothesis. To tackle this challenge, we propose Meta-learning for In-context Deduction (MIND), a novel few-shot meta-learning fine-tuning approach. The goal of MIND is to enable models to generalize more effectively to unseen knowledge bases and to systematically apply inference rules. Our results show that MIND significantly improves generalization in small LMs ranging from 1.5B to 7B parameters. The benefits are especially pronounced in smaller models and low-data settings. Remarkably, small models fine-tuned with MIND outperform state-of-the-art LLMs, such as GPT-4o and o3-mini, on this task.
大型语言模型(LLM)在正式任务上的评估越来越多,其中强大的推理能力定义了最新技术状态。然而,它们对超出分布范围的问题的泛化能力仍然有限。在本文中,我们研究了LLM如何实现对演绎规则的系统性理解。我们的重点是确定在知识库中用于推导给定假设的适当前提子集的任务。为了应对这一挑战,我们提出了用于上下文内演绎的元学习(MIND),这是一种新型的小样本元学习微调方法。MIND的目标是使模型能够更有效地泛化到未见过的知识库,并系统地应用推理规则。我们的结果表明,MIND在参数范围从1.5B到7B的小型LM中显著提高了泛化能力。在较小的模型和低数据设置下,这些好处尤为突出。值得注意的是,使用MIND进行微调的小型模型在此任务上表现优于最新的大型语言模型(如GPT-4o和o3-mini)。
论文及项目相关链接
Summary
大型语言模型(LLM)在正式任务上的评估日益增加,其推理能力成为衡量技术先进性的关键指标。然而,它们对离分布问题的泛化能力仍然有限。本文旨在探讨LLM如何系统理解演绎规则。研究焦点在于识别知识库中用于推导给定假设所需的前提条件的子集任务。为应对这一挑战,我们提出了名为Meta-learning for In-context Deduction(MIND)的新型元学习微调方法。MIND的目标是使模型更有效地泛化到未见过的知识库并系统地应用推理规则。实验结果表明,MIND对小规模至中等规模LLM(参数范围从1.5B到7B)的泛化能力有显著改善,尤其是在小型模型和低数据设置下尤为显著。通过采用MIND调教的小型模型表现出超乎预期的性能,在这项任务上超越了如GPT-4o和o3-mini等先进LLM。
Key Takeaways
- 大型语言模型(LLMs)在正式任务上表现出强大的推理能力,但在面对离分布问题时的泛化能力受限。
- 研究关注LLMs如何系统理解演绎规则,特别是识别知识库中推导假设所需的前提子集的任务。
- 提出了一种新型的元学习微调方法——Meta-learning for In-context Deduction(MIND)。
- MIND有助于模型更有效地泛化到未见过的知识库,并系统地应用推理规则。
- MIND对小规模至中等规模的LLM(参数范围从1.5B到7B)的泛化能力有显著改善。
- 在特定的任务上,采用MIND调教的模型在性能上超越了现有的先进LLM。
点此查看论文截图




MultiTab: A Comprehensive Benchmark Suite for Multi-Dimensional Evaluation in Tabular Domains
Authors:Kyungeun Lee, Moonjung Eo, Hye-Seung Cho, Dongmin Kim, Ye Seul Sim, Seoyoon Kim, Min-Kook Suh, Woohyung Lim
Despite the widespread use of tabular data in real-world applications, most benchmarks rely on average-case metrics, which fail to reveal how model behavior varies across diverse data regimes. To address this, we propose MultiTab, a benchmark suite and evaluation framework for multi-dimensional, data-aware analysis of tabular learning algorithms. Rather than comparing models only in aggregate, MultiTab categorizes 196 publicly available datasets along key data characteristics, including sample size, label imbalance, and feature interaction, and evaluates 13 representative models spanning a range of inductive biases. Our analysis shows that model performance is highly sensitive to such regimes: for example, models using sample-level similarity excel on datasets with large sample sizes or high inter-feature correlation, while models encoding inter-feature dependencies perform best with weakly correlated features. These findings reveal that inductive biases do not always behave as intended, and that regime-aware evaluation is essential for understanding and improving model behavior. MultiTab enables more principled model design and offers practical guidance for selecting models tailored to specific data characteristics. All datasets, code, and optimization logs are publicly available at https://huggingface.co/datasets/LGAI-DILab/Multitab.
尽管表格数据在现实世界的广泛应用,但大多数基准测试仍然依赖于平均情况下的指标,这无法揭示模型行为在不同数据环境下的变化。为了解决这一问题,我们提出了MultiTab,这是一个基准测试套件和评估框架,用于对表格学习算法进行多维度的数据感知分析。MultiTab不仅仅是在总体上对比模型,而是将196个公开数据集按照关键数据特性进行分类,包括样本大小、标签不平衡和特征交互等,并对13种具有代表性的模型进行评估,这些模型涵盖了一系列的归纳偏置。我们的分析表明,模型性能对这些环境非常敏感:例如,使用样本级别相似性的模型在样本量大或特征间高相关的数据集上表现优异,而编码特征间依赖关系的模型在弱相关特征上表现最佳。这些发现表明,归纳偏置并不总是按预期表现,而了解特定数据特性的评估对于理解和改进模型行为至关重要。MultiTab为实现更有原则的模型设计提供了可能,并为针对特定数据特性选择模型提供了实际指导。所有数据集、代码和优化日志均可在https://huggingface.co/datasets/LGAI-DILab/Multitab公开获取。
论文及项目相关链接
PDF Under review
Summary
本文提出MultiTab,一个用于多维数据感知的表格学习算法评估框架。该框架不仅关注平均情况下的模型性能,更注重模型在不同数据特性下的表现差异。通过分类196个公开数据集和评估13种代表性模型,研究发现模型性能对数据特性高度敏感。MultiTab的推出,有助于更系统地设计模型,并为针对特定数据特性的模型选择提供实际指导。
Key Takeaways
- 当前大多数基准测试主要关注平均情况下的模型性能,忽略了模型在不同数据特性下的表现差异。
- MultiTab是一个新的评估框架,旨在进行多维数据感知的表格学习算法分析。
- 该框架分类了196个公开数据集,并评估了13种代表性模型,涵盖多种归纳偏置。
- 模型性能对数据特性高度敏感,例如样本大小、标签不平衡和特征交互等。
- 某些模型在特定数据特性下表现优异,例如样本级别相似性在大样本或高特征相关性数据集上表现较好。
- 归纳偏置并不总是按预期工作,需要对特定数据特性的模型行为进行敏感评估。
点此查看论文截图



