⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-28 更新
SciReasoner: Laying the Scientific Reasoning Ground Across Disciplines
Authors:Yizhou Wang, Chen Tang, Han Deng, Jiabei Xiao, Jiaqi Liu, Jianyu Wu, Jun Yao, Pengze Li, Encheng Su, Lintao Wang, Guohang Zhuang, Yuchen Ren, Ben Fei, Ming Hu, Xin Chen, Dongzhan Zhou, Junjun He, Xiangyu Yue, Zhenfei Yin, Jiamin Wu, Qihao Zheng, Yuhao Zhou, Huihui Xu, Chenglong Ma, Yan Lu, Wenlong Zhang, Chunfeng Song, Philip Torr, Shixiang Tang, Xinzhu Ma, Wanli Ouyang, Lei Bai
We present a scientific reasoning foundation model that aligns natural language with heterogeneous scientific representations. The model is pretrained on a 206B-token corpus spanning scientific text, pure sequences, and sequence-text pairs, then aligned via SFT on 40M instructions, annealed cold-start bootstrapping to elicit long-form chain-of-thought, and reinforcement learning with task-specific reward shaping, which instills deliberate scientific reasoning. It supports four capability families, covering up to 103 tasks across workflows: (i) faithful translation between text and scientific formats, (ii) text/knowledge extraction, (iii) property prediction, (iv) property classification, (v) unconditional and conditional sequence generation and design. Compared with specialist systems, our approach broadens instruction coverage, improves cross-domain generalization, and enhances fidelity. We detail data curation and training and show that cross-discipline learning strengthens transfer and downstream reliability. The model, instruct tuning datasets and the evaluation code are open-sourced at https://huggingface.co/SciReason and https://github.com/open-sciencelab/SciReason.
我们提出了一种科学推理基础模型,该模型将自然语言与异质科学表示进行对齐。该模型在包含科学文本、纯序列和序列文本对的206B标记语料库上进行预训练,然后通过40M指令进行SFT对齐,采用退火冷启动引导来激发长形式思维链,并使用针对任务的奖励塑形进行强化学习,从而灌输有意识的科学推理。它支持四个能力家族,涵盖103个任务的工作流程:(i)文本和科学格式之间的忠实翻译,(ii)文本/知识提取,(iii)属性预测,(iv)属性分类,(v)无条件和有条件序列生成和设计。与专用系统相比,我们的方法扩大了指令覆盖范围,提高了跨域泛化能力,并增强了保真度。我们详细描述了数据整理和训练,并表明跨学科学习加强了迁移和下游可靠性。模型、指令调整数据集和评估代码已在https://huggingface.co/SciReason和https://github.com/open-sciencelab/SciReason开源。
论文及项目相关链接
PDF technical report
Summary
该文章介绍了一个科学推理基础模型,该模型将自然语言与异质科学表示对齐。模型在包含科学文本、纯序列和序列文本对的206B标记语料库上进行预训练,然后通过SFT对齐的40M指令进行微调,采用冷启动引导法激发长形式的思维链,并通过任务特定的奖励塑形强化学习,培养了有意的科学推理能力。该模型支持四大能力领域,涵盖103项任务和工作流程,包括文本与科学格式之间的忠实翻译、文本/知识提取、属性预测、属性分类、无条件和有条件的序列生成和设计等。与专项系统相比,该方法扩大了指令覆盖面,提高了跨域泛化能力,并提高了保真度。文章详细描述了数据整理和训练过程,并证明跨学科学习能加强迁移和下游可靠性。模型和评估代码已在Hugging Face和GitHub上开源。
Key Takeaways
- 模型将自然语言与异质科学表示对齐。
- 模型在多种科学文本、纯序列和序列文本对上进行预训练。
- 使用SFT对齐指令进行微调,激发长形式的思维链。
- 强化学习通过任务特定奖励塑形培养科学推理能力。
- 模型支持多种任务和工作流程,包括翻译、提取、预测、分类以及序列生成和设计。
- 与其他系统相比,该模型扩大了指令覆盖面,提高了泛化能力和保真度。
点此查看论文截图



SAGE: A Realistic Benchmark for Semantic Understanding
Authors:Samarth Goel, Reagan J. Lee, Kannan Ramchandran
As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI’s text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI’s text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.
随着大型语言模型(LLM)在传统基准测试上表现出强大的性能,我们急需更具挑战性的评估框架来深入探索语义理解的更深层次方面。我们引入了SAGE(语义对齐与泛化评估),这是一个严格的基准测试,旨在评估嵌入模型和相似性度量在五个类别中的表现:人类偏好对齐、转换鲁棒性、信息敏感性、聚类性能和检索鲁棒性。与现有的主要关注孤立能力的基准测试不同,SAGE通过对抗性条件、嘈杂的转换和微妙的人类判断任务在30多个数据集上评估语义理解。我们对9种嵌入模型和经典指标的全面评估显示,存在显著的性能差距,没有任何单一的方法在所有维度上都表现出卓越的性能。例如,虽然最先进的嵌入模型如OpenAI的文本嵌入模型在人类偏好对齐方面表现出色(得分为0.682,而最佳经典指标得分为0.591),但在信息敏感性任务上却被经典指标显著超越,其中Jaccard相似度达到了0.905分的高分,而表现最佳的嵌入模型得分为仅的达到 了0. 794分。SAGE进一步揭示了关键权衡:OpenAI的文本嵌入模型虽然聚类性能最高(得分为 0. 483),但在鲁棒性方面却表现出极端的脆弱性(得分最低,仅为 0. 011)。SAGE揭示了当前语义理解能力的关键局限性,并为现实世界的部署提供了更现实的模型鲁棒性评估。
论文及项目相关链接
PDF 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
Summary
本文介绍了一个名为SAGE的新评估框架,用于评估嵌入模型和相似性度量在五个类别中的表现,包括人类偏好对齐、转换鲁棒性、信息敏感性、聚类性能和检索鲁棒性。SAGE通过跨超过30个数据集的对抗条件、噪声转换和微妙的人类判断任务来评估语义理解。对现有嵌入模型和经典指标的全面评估显示,各方法之间存在显著性能差距,没有单一方法在所有维度上都表现出卓越性能。SAGE揭示了关键权衡,并暴露了当前语义理解的局限性,为现实世界的模型部署提供了更现实的评估。
Key Takeaways
- SAGE是一个用于评估嵌入模型和相似性度量的新评估框架。
- SAGE包括五个类别:人类偏好对齐、转换鲁棒性、信息敏感性、聚类性能和检索鲁棒性。
- SAGE使用超过30个数据集进行对抗条件、噪声转换和微妙的判断任务来评估语义理解。
- 现有嵌入模型和经典指标在SAGE评估中存在显著性能差距。
- 没有单一方法在所有维度上都表现出卓越性能。
- SAGE揭示了关键权衡,例如某些模型在某些任务上的优秀表现与其他任务上的脆弱性。
点此查看论文截图


MMR1: Enhancing Multimodal Reasoning with Variance-Aware Sampling and Open Resources
Authors:Sicong Leng, Jing Wang, Jiaxi Li, Hao Zhang, Zhiqiang Hu, Boqiang Zhang, Yuming Jiang, Hang Zhang, Xin Li, Lidong Bing, Deli Zhao, Wei Lu, Yu Rong, Aixin Sun, Shijian Lu
Large multimodal reasoning models have achieved rapid progress, but their advancement is constrained by two major limitations: the absence of open, large-scale, high-quality long chain-of-thought (CoT) data, and the instability of reinforcement learning (RL) algorithms in post-training. Group Relative Policy Optimization (GRPO), the standard framework for RL fine-tuning, is prone to gradient vanishing when reward variance is low, which weakens optimization signals and impairs convergence. This work makes three contributions: (1) We propose Variance-Aware Sampling (VAS), a data selection strategy guided by Variance Promotion Score (VPS) that combines outcome variance and trajectory diversity to promote reward variance and stabilize policy optimization. (2) We release large-scale, carefully curated resources containing ~1.6M long CoT cold-start data and ~15k RL QA pairs, designed to ensure quality, difficulty, and diversity, along with a fully reproducible end-to-end training codebase. (3) We open-source a family of multimodal reasoning models in multiple scales, establishing standardized baselines for the community. Experiments across mathematical reasoning benchmarks demonstrate the effectiveness of both the curated data and the proposed VAS. Comprehensive ablation studies and analyses provide further insight into the contributions of each component. In addition, we theoretically establish that reward variance lower-bounds the expected policy gradient magnitude, with VAS serving as a practical mechanism to realize this guarantee. Our code, data, and checkpoints are available at https://github.com/LengSicong/MMR1.
多模态推理模型虽然取得了快速进展,但它们的进步受到两个主要限制:缺乏开放的大规模、高质量的长思维链(CoT)数据,以及训练后强化学习(RL)算法的不稳定性。强化学习微调的标准框架——组相对策略优化(GRPO)在奖励方差低时容易出现梯度消失,这削弱了优化信号并损害了收敛。本文有以下三个贡献:(1)我们提出了方差感知采样(VAS),这是一种受方差促进分数(VPS)引导的数据选择策略,结合结果方差和轨迹多样性来促进奖励方差并稳定策略优化。(2)我们发布了大规模精心策划的资源,包含约160万条长思维链冷启动数据和约1.5万条RL问答对,旨在确保质量、难度和多样性,以及一个可完全复现的端到端训练代码库。(3)我们开源了多个规模的多模态推理模型家族,为社区建立标准化基准线。在跨数学推理基准测试的实验中,验证了精选数据和提出的VAS的有效性。全面的消融研究和分析进一步深入了解每个组件的贡献。此外,我们从理论上证明了奖励方差是预期策略梯度幅度的下限,VAS作为实现这一保证的实际机制。我们的代码、数据和检查点可在https://github.com/LengSicong/MMR1找到。
论文及项目相关链接
Summary
大型多模态推理模型的进步受到数据缺失和强化学习算法不稳定两大限制。本文提出Variance-Aware Sampling(VAS)策略,通过结合结果方差和轨迹多样性来促进奖励方差并稳定策略优化。同时,开放大规模高质量的长链思维数据资源,建立标准化基线模型。实验证明,方法和数据均有效。同时理论上证明了奖励方差对预期策略梯度幅度的影响,并在实践中通过VAS实现此保障。相关资源和代码已开源。
Key Takeaways
- 大型多模态推理模型面临开放数据缺失和强化学习算法不稳定两大挑战。
- VAS策略结合结果方差和轨迹多样性以促进奖励方差并稳定策略优化。
- 开放大规模高质量的长链思维数据资源。
- 建立标准化基线模型并开源一系列多模态推理模型。
- 实验证明方法和数据的有效性。
- 理论上证明了奖励方差对预期策略梯度幅度的影响。
点此查看论文截图




Instruction-tuned Self-Questioning Framework for Multimodal Reasoning
Authors:You-Won Jang, Yu-Jung Heo, Jaeseok Kim, Minsu Lee, Du-Seong Chang, Byoung-Tak Zhang
The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.
视觉语言理解领域近年来得到了广泛的研究,这得益于大型语言模型(LLMs)的发展。然而,它仍然需要解决需要进行多步骤推理的问题,即使是对于非常简单的问题。最近的研究采用LLMs来解决这个问题,通过迭代生成子问题和答案。然而,存在一些缺点,例如1)LLMs无法读取视觉信息,因此无法获取图像的精细视觉内容;2)使用黑箱LLMs时,其内部机制无法访问且难以复制。为了解决这些问题,我们提出了SQ(自我提问)-InstructBLIP,它通过迭代生成图像感知信息丰富的子问题和子答案,从而提高推理性能。SQ-InstructBLIP由具有相同架构的提问者、回答者和推理者组成。提问者和回答者生成子问题和子答案以帮助推断主问题,而推理者则考虑生成的子问题信息进行主问题的推理。我们的实验表明,所提出的SQ-InstructBLIP方法在使用子问题作为解决VQA任务时的附加信息时,比以前的工作表现出更准确的推理能力。
论文及项目相关链接
PDF This paper was accepted to the “CLVL: 5th Workshop on Closing the Loop Between Vision and Language (ICCV 2023 CLVL workshop).”
Summary
视觉语言理解领域近年来得到了广泛的研究,但仍面临多步推理问题。最新研究采用大型语言模型(LLMs)解决此问题,通过迭代生成子问题和答案。然而,存在缺点如无法读取图像中的精细视觉内容以及黑箱LLMs的内部机制不可访问和难以复制。为解决这些问题,我们提出SQ-InstructBLIP,通过生成图像感知信息子问题和子答案来提高推理性能。SQ-InstructBLIP包括提问者、回答者和推理者,三者共享相同架构。提问者和回答者生成子问题和子答案以辅助推理主问题,而推理者则考虑生成的子问题信息进行推理。实验表明,使用生成的子问题作为解决视觉问答任务时的附加信息,SQ-InstructBLIP表现出比先前工作更准确的推理能力。
Key Takeaways
- 大型语言模型(LLMs)在视觉语言理解领域面临多步推理挑战。
- 最新研究通过迭代生成子问题和答案来解决这一问题。
- LLMs无法读取图像中的精细视觉信息是一个缺点。
- 黑箱LLMs的内部机制不可访问和难以复制也是一大挑战。
- SQ-InstructBLIP通过生成图像感知信息子问题和子答案来提高推理性能。
- SQ-InstructBLIP包括提问者、回答者和推理者三个组成部分,共享相同架构。
点此查看论文截图




RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models
Authors:Jiyeon Koo, Taewan Cho, Hyunjoon Kang, Eunseom Pyo, Tae Gyun Oh, Taeryang Kim, Andrew Jaeyong Choi
Recent Vision-Language-Action (VLA) models demonstrate remarkable generalization in robotics but are restricted by their substantial size and computational cost, limiting real-world deployment. However, conventional lightweighting methods often sacrifice critical capabilities, particularly spatial reasoning. This creates a trade-off between efficiency and performance. To address this challenge, our work reuses Register Tokens, which were introduced for artifact removal in Vision Transformers but subsequently discarded. We suppose that these tokens contain essential spatial information and propose RetoVLA, a novel architecture that reuses them directly by injecting them into the Action Expert. RetoVLA maintains a lightweight structure while leveraging this repurposed spatial context to enhance reasoning. We demonstrate RetoVLA’s effectiveness through a series of comprehensive experiments. On our custom-built 7-DOF robot arm, the model achieves a 17.1%p absolute improvement in success rates for complex manipulation tasks. Our results confirm that reusing Register Tokens directly enhances spatial reasoning, demonstrating that what was previously discarded as an artifact is in fact a valuable, unexplored resource for robotic intelligence. A video demonstration is available at: https://youtu.be/2CseBR-snZg
最近的视觉-语言-动作(VLA)模型在机器人技术中展现出卓越的泛化能力,但受限于其庞大的体积和计算成本,制约了其在现实世界的部署。然而,传统的轻量化方法往往会牺牲关键功能,特别是空间推理能力。这使得效率和性能之间存在权衡。为了应对这一挑战,我们的工作重新利用了视觉转换器中用于去除伪影的寄存器令牌(Register Tokens),这些令牌之后被丢弃。我们假设这些令牌包含重要的空间信息,并提出RetoVLA,这是一种通过直接将其注入动作专家(Action Expert)来重新利用它们的新型架构。RetoVLA在保持轻量化结构的同时,利用这种重新定位的空间上下文来提高推理能力。我们通过一系列全面的实验证明了RetoVLA的有效性。在我们自主构建的7自由度(7-DOF)机械臂上,该模型在复杂操作任务方面的成功率提高了绝对达17.1%。我们的结果证实,直接重新利用寄存器令牌可以增强空间推理能力,表明先前被丢弃作为伪影的东西实际上是一个有价值的、尚未探索的机器人智能资源。视频演示可在以下链接找到:[https://youtu.be/2CseBR-snZg](注:链接需根据实际情况进行替换)。
论文及项目相关链接
Summary
近期,视觉语言动作(VLA)模型在机器人领域展现出强大的泛化能力,但其庞大的规模和计算成本限制了实际应用。为解决效率与性能之间的权衡问题,本研究重新利用了在视觉转换器中用于去除伪特征的寄存器令牌(Register Tokens)。我们假设这些令牌包含重要的空间信息,并提出了一种新的架构RetoVLA,它通过直接注入动作专家(Action Expert)来利用这些令牌。RetoVLA保持轻量级结构,并利用此重新设计的空间上下文增强推理能力。在自定义的7自由度机器人手臂上进行的实验证明,该模型在复杂操作任务上的成功率提高了17.1%。研究结果表明,重新利用寄存器令牌可增强空间推理能力,证明以前被视为伪特征的令牌实际上是机器人智能中未被探索的宝贵资源。
Key Takeaways
- VLA模型在机器人领域具有强大的泛化能力,但存在计算成本高和部署困难的问题。
- 寄存器令牌(Register Tokens)最初用于去除伪特征,在本研究中被重新利用。
- 通过直接注入动作专家(Action Expert),提出新的架构RetoVLA。
- RetoVLA在保持轻量级结构的同时,利用空间上下文增强推理能力。
- 在自定义机器人手臂上的实验证明,RetoVLA在复杂操作任务上的成功率显著提高。
- 重新利用寄存器令牌增强了空间推理能力,证明了其作为机器人智能中未被探索资源的重要性。
点此查看论文截图








Tree Search for LLM Agent Reinforcement Learning
Authors:Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu
Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.
最近强化学习(RL)的进展极大地增强了大型语言模型(LLM)的代理能力。在长期和多轮代理任务中,仅由结果奖励驱动的传统方法经常面临监督稀疏的问题。为了解决这一挑战,我们提出了基于树组的相对策略优化(Tree-GRPO),这是一种基于树搜索的分组代理RL方法,其中每个树节点代表完整的代理交互步骤。通过共享公共前缀,树搜索采样增加了在固定预算的令牌或工具调用期间可实现的rollouts数量。此外,我们发现树状轨迹自然地允许使用仅结果奖励来构建逐步过程监督信号。基于此,Tree-GRPO估计了树内和树间的分组相对优势。通过理论分析,我们证明了树内级别分组相对策略优化的目标与步骤级别直接偏好学习的目标是一致的。在11个数据集和3种问答任务上的实验证明了所提出的基于树的RL优于基于链的RL方法。
论文及项目相关链接
Summary
强化学习在大型语言模型中的应用显著提升其在长期和多轮任务中的代理能力。针对仅依赖结果奖励带来的稀疏监督问题,本文提出了基于树搜索的树结构群体相对策略优化(Tree-GRPO)方法。树结构可以有效构建步骤级的监督信号,利用结果奖励进行组内相对优势估计。实验结果显示,树结构的强化学习相较于链式强化学习方法更具优势。
Key Takeaways
- 强化学习在大型语言模型中的应用增强了其在长期和多轮任务中的代理能力。
- Tree-GRPO是一种基于树搜索的群体相对策略优化方法,解决了仅依赖结果奖励带来的稀疏监督问题。
- 树结构可以有效构建步骤级的监督信号。
- Tree-GRPO通过估计组内相对优势来解决稀疏监督问题。
- 树结构的强化学习相较于链式强化学习方法具有优势。
- Tree-GRPO在理论分析与实验验证中都表现出其有效性。
点此查看论文截图



A Fano-Style Accuracy Upper Bound for LLM Single-Pass Reasoning in Multi-Hop QA
Authors:Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, Xiuying Chen
Multi-Hop Question Answering (MHQA) requires integrating dispersed, interdependent evidence through sequential reasoning under noise. This task is challenging for LLMs as they have a finite per-pass output capacity, beyond which the integration of task-relevant evidence proves unreliable. Consequently, the single-pass reasoning paradigm is inherently vulnerable to this capacity overflow. To formalize this bottleneck, our analysis establishes a Fano-style accuracy upper bound, defining a theoretical performance ceiling for single-pass LLMs. This bound reveals that accuracy inevitably collapses once task complexity exceeds model capacity, providing general principles for capacity-aware representation and structuring of MHQA in LLMs. Building on these principles, we introduce a proof-of-concept multi-call framework for MHQA, InfoQA. It ensures high per-step accuracy by combining capacity-aware task decomposition with active pruning of prior reasoning traces, keeping the information load within the single-pass limit. It further achieves robustness by a dependency-explicit workflow that enables precise control over the reasoning path. We construct a stringent and noise-rich benchmark to validate our theory and framework. Experimental results show that model behavior aligns with our predicted capacity curves while InfoQA achieves consistent performance improvements. We hope our work inspires more LLM multi-step reasoning methods: \faGithub \href{https://github.com/KaiyangWan/InfoQA}{InfoQA}.
多跳问答(MHQA)需要在噪声环境下通过顺序推理整合分散且相互依赖的证据。对于大型语言模型(LLMs)来说,这项任务具有挑战性,因为它们每次传递的输出容量是有限的,超出此容量后,与任务相关的证据整合便变得不可靠。因此,单路径推理模式本质上容易受到这种容量溢出的影响。为了形式化这个瓶颈,我们的分析建立了Fano风格的准确性上限,为单路径LLMs定义了理论性能天花板。这个界限表明,一旦任务复杂度超过模型容量,准确性不可避免地会崩溃,为在LLMs中实施容量感知表示和多跳问答结构化提供了通用原则。基于这些原则,我们引入了一个用于MHQA的概念验证多调用框架——InfoQA。它通过结合容量感知任务分解和先前推理痕迹的主动修剪,确保每一步的高准确性,同时将信息负载保持在单路径限制之内。通过依赖明确的工作流程,它进一步实现了稳健性,能够精确控制推理路径。我们构建了一个严格且富含噪声的基准测试来验证我们的理论和框架。实验结果表明,模型行为与我们的预测容量曲线一致,同时InfoQA实现了性能的一致性改进。我们希望我们的工作能激发更多关于大型语言模型多步骤推理方法的研究:InfoQA(点击此处查看GitHub仓库)。
论文及项目相关链接
PDF 21 pages, 6 figures
Summary
本文探讨了多跳问答(MHQA)的挑战,指出大型语言模型(LLMs)在整合分散、相互依赖的证据时存在容量限制的问题。文章提出了一个理论性能上限的Fano风格精度边界,揭示了当任务复杂度超过模型容量时,准确性不可避免地会崩溃。为解决这一问题,文章提出了一个多调用框架InfoQA,它通过容量感知任务分解和主动剪除先前推理痕迹的方式,确保每一步的高精度,同时通过依赖明确的工作流程实现推理路径的精确控制。实验结果表明,InfoQA实现了性能改进,并与预测容量曲线一致。
Key Takeaways
- 多跳问答(MHQA)要求整合分散、相互依赖的证据,这一任务对大型语言模型(LLMs)具有挑战性。
- LLMs在整合任务相关证据时存在容量限制,超出容量范围后,单通道推理模式容易出错。
- 文章建立了Fano风格的精度上限,定义了单通道LLMs的理论性能天花板。
- 当任务复杂度超过模型容量时,准确性会不可避免地崩溃。
- InfoQA框架通过容量感知任务分解和主动剪除先前推理痕迹的方式,确保高精度和多步骤推理的稳健性。
- InfoQA实现了性能改进,并通过严格的噪声丰富的基准测试验证了其理论和框架的有效性。
点此查看论文截图




Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning
Authors:Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, Wenlong Zhang, Lei Bai, Zhenfei Yin, Philip Torr, Hanrui Wang, Di Jin
Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden “tool tax” of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity’s Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy – the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5% and agent steps by 43.7%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.
大型语言模型(LLM)在科学探究推理方面取得了显著进展,但仍存在两个主要瓶颈。首先,明确的检索片段化推理,增加了额外的标记和步骤,构成了一种隐藏的“工具税”。其次,多智能体管道往往会通过平均所有候选者来稀释优秀解决方案。我们用一个统一的框架来解决这些挑战,该框架结合了隐式检索和结构化协作。在其基础上,一个基于监视器的检索模块在令牌层面运行,将外部知识与推理之间的干扰降至最低。在此基础之上,分层解决方案精炼(HSR)会迭代地指定每个候选者作为锚点,由同行进行修复,同时质量感知迭代推理(QAIR)会根据解决方案质量进行精炼。在人类最后的考试(HLE)生物/化学金牌科目上,我们的框架达到了48.3%的准确率,这是迄今为止报道的最高准确率。与最强的智能体基准相比,我们的准确率提高了13.4个百分点,并领先前沿的大型语言模型最多达18.1个百分点。同时,我们的框架减少了53.5%的令牌使用量和43.7%的智能体步骤。在SuperGPQA和TRQA上的结果证实了其在不同领域的稳健性。错误分析表明,推理失败和知识差距在超过85%的情况下同时存在。多样性分析显示了一种明确的差异:检索任务受益于解决方案的多样性,而推理任务则倾向于达成共识。这些发现共同表明,隐式增强和结构化精炼如何克服显式工具使用和统一聚合的无效性。代码可在https://github.com/tangxiangru/Eigen-1上找到。
论文及项目相关链接
Summary
大型语言模型在科学推理方面取得显著进展,但仍面临两个主要瓶颈:显式检索片段化推理和多代理管道平均方案。研究团队提出一个统一框架,结合隐性检索和结构化协作来解决这些问题。该框架包括基于监视器的检索模块和分层解决方案细化,能够在最少干扰的情况下整合外部知识。在Humanity’s Last Exam生物/化学金牌任务上,该框架取得了48.3%的最高准确率,领先于最强的代理基准点和前沿的大型语言模型。同时,它减少了53.5%的令牌使用量和43.7%的代理步骤。代码已公开在GitHub上。
Key Takeaways
- 大型语言模型在科学推理方面表现出强大的能力,但仍面临两个主要挑战:显式检索的瓶颈和多代理管道的平均方案问题。
- 研究提出了一种统一框架,结合隐性检索和结构化协作,以提高语言模型在推理任务上的效率。
- 该框架包括基于监视器的检索模块和分层解决方案细化,能最小化外部知识与推理的干扰。
- 在Humanity’s Last Exam生物/化学金牌任务上,该框架取得了显著成果,达到48.3%的准确率,超越其他模型。
- 与其他模型相比,该框架减少了令牌使用量和代理步骤。
- 错误分析显示,推理失败和知识差距在超过85%的情况下同时发生。
点此查看论文截图





Who’s Laughing Now? An Overview of Computational Humour Generation and Explanation
Authors:Tyler Loakman, William Thorne, Chenghua Lin
The creation and perception of humour is a fundamental human trait, positioning its computational understanding as one of the most challenging tasks in natural language processing (NLP). As an abstract, creative, and frequently context-dependent construct, humour requires extensive reasoning to understand and create, making it a pertinent task for assessing the common-sense knowledge and reasoning abilities of modern large language models (LLMs). In this work, we survey the landscape of computational humour as it pertains to the generative tasks of creation and explanation. We observe that, despite the task of understanding humour bearing all the hallmarks of a foundational NLP task, work on generating and explaining humour beyond puns remains sparse, while state-of-the-art models continue to fall short of human capabilities. We bookend our literature survey by motivating the importance of computational humour processing as a subdiscipline of NLP and presenting an extensive discussion of future directions for research in the area that takes into account the subjective and ethically ambiguous nature of humour.
幽默的创造与感知是人类的基本特质,这也使得计算机对其的理解成为自然语言处理(NLP)中最具挑战性的任务之一。幽默作为一个抽象、富有创造力的概念,并且经常依赖于特定的语境,理解和创造它需要广泛的推理能力。因此,它是评估现代大型语言模型(LLM)的常识知识和推理能力的一个相关任务。在这项工作中,我们调查了幽默计算领域的概况,特别是它与创作和解释等生成任务的相关性。我们发现,尽管理解幽默的任务具有基础NLP任务的特征,但在双关语之外的幽默创作和解释工作仍然很少,而最先进的模型仍然达不到人类的能力水平。我们以强调幽默计算处理作为NLP子学科的重要性作为文献综述的开端,并讨论了考虑幽默的主观性和伦理模糊性的研究方向的广泛讨论。
论文及项目相关链接
PDF Accepted to INLG 2025
Summary
本文探讨了计算幽默在自然语言处理领域的重要性及其生成任务中的挑战。虽然幽默理解是NLP的基础任务之一,但在生成和解释幽默方面,尤其是超越双关语的研究仍然稀缺。当前最先进的模型仍然无法与人类的能力相提并论。本文还强调了计算幽默处理作为NLP子学科的重要性,并讨论了考虑到幽默的主观性和伦理模糊性特征的未来研究方向。
Key Takeaways
- 幽默是人类的基本特质之一,其计算理解在自然语言处理(NLP)中是极具挑战性的任务。
- 生成和解释幽默是计算幽默的两个核心任务,目前相关工作仍较为稀缺。
- 尽管存在许多先进模型,但在生成和解释幽默方面,它们仍无法与人类的能力相匹配。
- 幽默的理解需要广泛的推理能力,这使其成为评估现代大型语言模型(LLM)常识知识和推理能力的恰当任务。
- 计算幽默处理是NLP的一个重要子领域。
- 未来的研究方向需要考虑到幽默的主观性和伦理模糊性。
点此查看论文截图


Fine-Tuning LLMs to Analyze Multiple Dimensions of Code Review: A Maximum Entropy Regulated Long Chain-of-Thought Approach
Authors:Yongda Yu, Guohao Shi, Xianwei Wu, Haochuan He, XueMing Gu, Qianqian Zhao, Kui Liu, Qiushi Wang, Zhao Tian, Haifeng Shen, Guoping Rong
Large Language Models (LLMs) have shown great potential in supporting automated code review due to their impressive capabilities in context understanding and reasoning. However, these capabilities are still limited compared to human-level cognition because they are heavily influenced by the training data. Recent research has demonstrated significantly improved performance through fine-tuning LLMs with code review data. However, compared to human reviewers who often simultaneously analyze multiple dimensions of code review to better identify issues, the full potential of these methods is hampered by the limited or vague information used to fine-tune the models. This paper contributes MelcotCR, a chain-of-thought (COT) fine-tuning approach that trains LLMs with an impressive reasoning ability to analyze multiple dimensions of code review by harnessing long COT techniques to provide rich structured information. To address context loss and reasoning logic loss issues that frequently occur when LLMs process long COT prompts, we propose a solution that combines the Maximum Entropy (ME) modeling principle with pre-defined reasoning pathways in MelcotCR to enable more effective utilization of in-context knowledge within long COT prompts while strengthening the logical tightness of the reasoning process. Empirical evaluations on our curated MelcotCR dataset and the public CodeReviewer dataset reveal that a low-parameter base model, such as 14B Qwen2.5, fine-tuned with MelcotCR can surpass state-of-the-art methods in terms of the accuracy of detecting and describing code issues, with its performance remarkably on par with that of the 671B DeepSeek-R1 model.
大型语言模型(LLM)在支持自动化代码审查方面显示出巨大潜力,它们在上下文理解和推理方面的能力令人印象深刻。然而,与人类水平的认知相比,这些能力仍然有限,因为它们受到训练数据的大量影响。最近有研究表明,通过用代码审查数据微调LLM,可以显著提高性能。然而,与人类评审者通常同时分析代码审查的多个维度以更好地发现问题相比,这些方法的全潜受到了用于微调模型的有限或模糊信息的阻碍。本文提出了MelcotCR,这是一种思维链(COT)微调方法,它通过运用令人印象深刻的推理能力训练LLM,通过利用长COT技术来分析代码审查的多个维度,提供丰富的结构化信息。为了解决LLM处理长COT提示时经常出现的上下文丢失和推理逻辑丢失问题,我们提出了一种结合最大熵(ME)建模原理和MelcotCR中预先定义的推理途径的解决方案,以便更有效地利用长COT提示中的上下文知识,同时加强推理过程的逻辑严谨性。在我们精选的MelcotCR数据集和公共CodeReviewer数据集上的经验评估表明,使用MelcotCR微调的低参数基础模型,如14B Qwen2.5,在检测描述代码问题方面的准确性可以超越最先进的方法,其性能与671B DeepSeek-R1模型相当。
论文及项目相关链接
PDF 22 pages
Summary
大型语言模型(LLMs)在代码审查自动化方面展现出巨大潜力,其上下文理解和推理能力令人印象深刻。然而,受限于训练数据,其性能仍低于人类认知。最新研究通过用代码审查数据微调LLMs,提高了其性能。然而,与人类审查者同时分析代码的多个维度以更好地发现问题相比,这些方法仍未能充分发挥潜力。本文提出MelcotCR,一种链式思维(COT)微调方法,通过利用长COT技术训练具有强大推理能力的LLMs来分析代码审查的多个维度。为解决LLMs处理长COT提示时经常出现的上下文丢失和推理逻辑丢失问题,我们结合最大熵建模原理和MelcotCR中的预定义推理路径,更有效地利用长COT提示中的上下文知识,同时加强推理过程的逻辑严谨性。在MelcotCR数据集和公共CodeReviewer数据集上的实证评估表明,使用MelcotCR微调的低参数基础模型,如14B Qwen2.5,在检测描述代码问题方面的准确度可超越现有最先进的模型,其性能与671B DeepSeek-R1模型相当。
Key Takeaways
- 大型语言模型(LLMs)在代码审查自动化方面具有潜力,但受限于训练数据的性能。
- 通过使用代码审查数据微调LLMs可以提高其性能。
- MelcotCR是一种新的微调方法,能让LLMs分析代码审查的多个维度。
- MelcotCR利用长COT技术来提供丰富的结构化信息。
- 为解决上下文和逻辑丢失问题,结合最大熵建模原理和预定义推理路径。
- MelcotCR在检测代码问题方面的性能超越现有模型。
点此查看论文截图



Learning the Wrong Lessons: Syntactic-Domain Spurious Correlations in Language Models
Authors:Chantal Shaib, Vinith M. Suriyakumar, Levent Sagun, Byron C. Wallace, Marzyeh Ghassemi
For an LLM to correctly respond to an instruction it must understand both the semantics and the domain (i.e., subject area) of a given task-instruction pair. However, syntax can also convey implicit information Recent work shows that syntactic templates–frequent sequences of Part-of-Speech (PoS) tags–are prevalent in training data and often appear in model outputs. In this work we characterize syntactic templates, domain, and semantics in task-instruction pairs. We identify cases of spurious correlations between syntax and domain, where models learn to associate a domain with syntax during training; this can sometimes override prompt semantics. Using a synthetic training dataset, we find that the syntactic-domain correlation can lower performance (mean 0.51 +/- 0.06) on entity knowledge tasks in OLMo-2 models (1B-13B). We introduce an evaluation framework to detect this phenomenon in trained models, and show that it occurs on a subset of the FlanV2 dataset in open (OLMo-2-7B; Llama-4-Maverick), and closed (GPT-4o) models. Finally, we present a case study on the implications for safety finetuning, showing that unintended syntactic-domain correlations can be used to bypass refusals in OLMo-2-7B Instruct and GPT-4o. Our findings highlight two needs: (1) to explicitly test for syntactic-domain correlations, and (2) to ensure syntactic diversity in training data, specifically within domains, to prevent such spurious correlations.
对于大型语言模型(LLM)正确响应指令,它必须理解给定任务指令对中的语义和领域(即主题领域)。然而,语法也能传达隐含信息。最近的研究表明,语法模板——词性(Part-of-Speech,PoS)标签的频繁序列,在训练数据中普遍存在,并且经常出现在模型输出中。在这项工作中,我们对任务指令对中的语法模板、领域和语义进行了特征描述。我们确定了语法和领域之间的虚假关联情况,模型在训练过程中学会了将领域与语法相关联;这有时会覆盖指令的语义。我们使用合成训练数据集发现,在OLMo-2模型(1B-13B)的实体知识任务上,语法领域相关性会降低性能(平均值±标准差为0.51±0.06)。我们引入了一个评估框架来检测训练模型中的这种现象,并显示它发生在开放(OLMo-2-7B;Llama-4-Maverick)和封闭(GPT-4o)模型的FlanV2数据集的子集上。最后,我们对安全微调的影响进行了案例研究,表明无意中的语法领域相关性可用于绕过OLMo-2-7B指令和GPT-4o的拒绝。我们的研究结果强调了两个需求:(1)明确测试语法领域相关性;(2)确保训练数据中的语法多样性,特别是在领域内,以防止这种虚假的相关性。
论文及项目相关链接
PDF NeurIPS 2025 Spotlight
Summary
本文探讨了大型语言模型(LLM)在处理任务指令对时,需要同时理解语义和特定领域的局限性。虽然句法能传递隐含信息,但语法模板的存在使得在某些情况下模型会将特定的领域与句法结构联系起来,而忽视了语义含义。作者识别出在合成训练数据集上的模型中存在的句法-领域相关性可能对实体知识任务的性能产生影响的问题,并提出一种评估框架来检测已训练模型中的这种现象。同时作者也研究了这一问题的安全性影响,展示了利用错误的句法-领域相关性来绕过拒绝命令的情况。文章强调了明确测试语法模板领域关联和确保训练数据中的语法多样性的重要性。
Key Takeaways
- LLM在响应指令时需要理解任务指令对的语义和领域。
- 语法模板在训练数据和模型输出中频繁出现,可能导致模型关注句法结构而忽视语义含义。
- 在合成训练数据集上,语法模板和领域的相关性可能降低模型在实体知识任务上的性能。
- 文章介绍了一个评估框架,用以检测训练模型中存在的语法模板和领域关联的现象。
- 该现象在安全调整过程中产生影响,可能会绕过某些拒绝指令的情况。
- 需要明确测试语法模板与领域之间的关联。
点此查看论文截图





GRPO is Secretly a Process Reward Model
Authors:Michael Sullivan
We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($\lambda$-GRPO), and show that LLMs trained with $\lambda$-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks$-$and reach peak performance more rapidly$-$than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.
我们理论上证明了,在一定的假设条件下,关于完成任务过程中标记序列组内重叠的部分,GRPO RL算法会引发一个非平凡的过程奖励模型(PRM)。然后,我们通过实证表明这些假设在真实条件下是成立的:GRPO确实会引发一个非平凡的过程奖励模型。利用GRPO作为PRM的框架,我们发现了GRPO目标中的一个缺陷:过程步骤的非均匀分布阻碍了探索和利用(在不同的条件下)。为了缓解这一缺陷,我们对算法进行了简单的修改(λ-GRPO),并证明使用λ-GRPO训练的LLM在下游推理任务中达到了更高的验证精度和性能,并且更快达到峰值性能。与传统的GRPO相比,我们在成本高昂、明确定义的PRM的优势上产生了质疑:我们展示了可以充分利用原始的GRPO算法中隐藏的内置PRM结构来提升模型性能,而对训练时间和成本的影响微乎其微。
论文及项目相关链接
PDF 14 pages, 6 figures; under review at ICLR 2026
Summary
GRPO RL算法在特定假设下能引发非平凡的过程奖励模型(PRM)。实证研究表明,这些假设在现实世界条件下是成立的。通过分析GRPO作为PRM的框架,发现GRPO目标中存在缺陷:过程步骤的非均匀分布阻碍探索和利用。提出对算法进行简单修改(λ-GRPO),使用λ-GRPO训练的语言模型在下游推理任务上表现出更高的验证精度和性能,并能更快达到峰值性能。研究结果对昂贵、明确定义的PRM在GRPO中的优势提出质疑,显示利用GRPO算法中的内置PRM结构提升模型性能是可能的,且对训练时间和成本的影响微乎其微。
Key Takeaways
- GRPO RL算法在特定假设下能激发非平凡的过程奖励模型(PRM)。
- 实证研究验证这些假设在现实世界中的有效性。
- 分析发现GRPO目标中存在缺陷:过程步骤的非均匀分布影响探索和利用。
- 提出改进算法λ-GRPO,有效提升语言模型的性能。
- λ-GRPO训练的语言模型在下游推理任务上表现更优秀,验证精度更高,达到峰值性能更快。
- 研究结果对明确定义的PRM在GRPO中的必要性提出质疑。
点此查看论文截图



ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective
Authors:Yiwen Zhang, Ziang Chen, Fanqi Kong, Yizhe Huang, Xue Feng
Large Language Models (LLMs) have been used to make decisions in complex scenarios, where they need models to think deeply, reason logically, and decide wisely. Many existing studies focus solely on multi-round conversations in social tasks or simulated environments, neglecting the various types of decisions and their interdependence. Current reinforcement learning methods struggle to consider the strategies of others during training. To address these issues, we first define a strategic decision-making problem that includes two types of decisions and their temporal dependencies. Furthermore, we propose Theory of Mind Policy Optimization (ToMPO) algorithm to optimize the perception of other individual strategies and the game situation trends. Compared to the Group Relative Policy Optimization (GRPO) algorithm, ToMPO enhances the LLM’s strategic decision-making mainly by: 1) generating rollouts based on reasoning the strategies of other individuals, 2) estimating advantages at both the graph-level and sample-level, and 3) balancing global and partial rewards. The ToMPO algorithm outperforms the GRPO method by 35% in terms of model output compliance and cooperative outcomes. Additionally, when compared to models with parameter sizes 100 times larger, it shows an 18% improvement. This demonstrates the effectiveness of the ToMPO algorithm in enhancing the model’s strategic decision-making capabilities.
大型语言模型(LLM)已在复杂场景中被用于决策,这些场景需要模型进行深入思考、逻辑推理和明智决策。现有的许多研究仅专注于社会任务或模拟环境中的多轮对话,忽视了各种类型的决策及其相互依赖性。当前的强化学习方法在训练过程中很难考虑他人的策略。为了解决这些问题,我们首先定义了一个战略决策问题,包括两种类型的决策和它们的时间依赖性。此外,我们提出了T理论o的M策略优化思想**(ToMPO)**算法,以优化对其他个体策略和游戏局势趋势的感知。与群体相对策略优化(GRPO)算法相比,ToMPO主要通过以下方式提高LLM的战略决策能力:1)基于推理其他个体的策略生成推演结果,2)在图形级别和样本级别上估计优势,以及3)平衡全局和局部奖励。在模型输出符合度和合作结果方面,ToMPO算法的性能比GRPO方法高出35%。此外,与参数规模大100倍的模型相比,它显示出18%的改进。这证明了ToMPO算法在提高模型战略决策能力方面的有效性。
论文及项目相关链接
PDF 22 pages, 14 figures
Summary
大型语言模型(LLM)在复杂场景中的决策能力受到广泛关注。现有研究多侧重于社会任务或模拟环境中的多轮对话,忽略了不同类型决策及其相互依赖性。针对这些问题,本文提出了战略决策制定的问题定义,包括两种类型的决策及其时间依赖性。同时,提出了理论优化算法(ToMPO),该算法主要优化了对其他个体策略和游戏局势趋势的感知。相较于群体相对策略优化(GRPO)算法,ToMPO在战略决策制定方面增强了LLM的能力,通过基于其他个体策略的推理生成滚动预测、在图形和样本层面估算优势以及平衡全局和局部奖励等方法。实验显示,ToMPO算法在模型输出符合度和合作结果方面比GRPO方法高出35%,并且在与参数规模大100倍的模型对比中仍表现出18%的提升。这证明了ToMPO算法在提升模型战略决策能力方面的有效性。
Key Takeaways
- 大型语言模型(LLM)在复杂场景中的决策能力受到研究关注。
- 现有研究多侧重于社会任务或模拟环境中的多轮对话,缺乏对不同类型决策及其相互依赖性的研究。
- 提出了战略决策制定的问题定义,涉及两种类型的决策及其时间依赖性。
- 介绍了理论优化算法(ToMPO),该算法优化了LLM对其他个体策略和游戏局势趋势的感知能力。
- ToMPO通过基于其他个体策略的推理生成滚动预测、在图形和样本层面估算优势等方法,增强了LLM的战略决策能力。
- ToMPO算法在模型输出符合度和合作结果方面表现出较高的性能,相较于其他算法有明显的优势。
点此查看论文截图




RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
Authors:Kohsei Matsutani, Shota Takashiro, Gouki Minegishi, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo
Large language models (LLMs) are typically trained by reinforcement learning (RL) with verifiable rewards (RLVR) and supervised fine-tuning (SFT) on reasoning traces to improve their reasoning abilities. However, how these methods shape reasoning capabilities remains largely elusive. Going beyond an accuracy-based investigation of how these two components sculpt the reasoning process, this paper introduces a novel analysis framework that quantifies reasoning paths and captures their qualitative changes under each training process (with models of 1.5B, 7B, and 14B parameters on mathematical domains). Specifically, we investigate the reasoning process at two levels of granularity: the trajectory-level, which examines complete reasoning outputs, and the step-level, which analyzes reasoning graphs whose nodes correspond to individual reasoning steps. Notably, clustering of unique reasoning trajectories shows complementary effects: RL compresses incorrect trajectories, whereas SFT expands correct ones. Step-level analysis reveals that RL steepens (about 2.5 times), while SFT flattens (reduced to about one-third), the decay rates of node visitation frequency, degree, and betweenness centrality distributions in the reasoning graph. This indicates that RL concentrates reasoning functionality into a small subset of steps, while SFT homogenizes it across many steps. Furthermore, by evaluating the reasoning graph topologies from multiple perspectives, we delineate the shared and distinct characteristics of RL and SFT. Our work presents a novel reasoning path perspective that explains why the current best practice of two-stage training, with SFT followed by RL, is successful, and offers practical implications for data construction and more efficient learning approaches.
大型语言模型(LLM)通常通过强化学习(RL)和可验证奖励(RLVR)以及基于推理轨迹的监督微调(SFT)进行训练,以提高其推理能力。然而,这些方法如何塑造推理能力仍然很大程度上不得而知。本文不仅从准确性的角度研究这两种成分如何塑造推理过程,还引入了一个新型分析框架,该框架可以量化推理路径并捕获每种训练过程中推理路径的定性变化(针对数学领域的1.5B、7B和14B参数模型)。具体来说,我们从两个粒度级别研究推理过程:轨迹级别,它检查完整的推理输出;步骤级别,它分析推理图,其中节点对应于单个推理步骤。独特的推理轨迹聚类显示出互补效应:强化学习压缩了错误的轨迹,而监督微调则扩展了正确的轨迹。步骤级分析表明,强化学习使节点访问频率、度数和中间中心分布衰减率急剧上升(约2.5倍),而监督微调则使其平缓下降(降至约三分之一)。这表明强化学习将推理功能集中在少数几个步骤中,而监督微调则将其均匀分布到多个步骤中。此外,通过从多个角度评估推理图拓扑结构,我们阐述了强化学习和监督调共享的特性和独特之处。我们的工作从新的推理路径角度解释了为什么当前最佳实践的两阶段训练(先监督微调后强化学习)是成功的,并为数据构建和更有效的学习方法提供了实际启示。
论文及项目相关链接
Summary
大型语言模型通过强化学习(RL)和可验证奖励(RLVR)以及监督微调(SFT)进行训练,以提高其推理能力。本文引入了一个新型分析框架,量化推理路径并捕捉每种训练过程中的定性变化。研究发现,强化学习能压缩错误推理轨迹,而监督微调则扩展正确轨迹。此外,强化学习会加剧节点访问频率、度数和中介中心性分布的衰减率,而监督微调则会减缓这一趋势。本研究从推理路径角度解释了为何当前最佳实践是两阶段训练(先SFT后RL),并为数据构建和更高效的学习方法提供了实践启示。
Key Takeaways
- 大型语言模型通过强化学习和监督微调提高推理能力。
- 新型分析框架用于量化推理路径并捕捉训练过程中的变化。
- 强化学习压缩错误推理轨迹,监督微调扩展正确轨迹。
- 强化学习会加剧节点访问频率等分布的衰减率,而监督微调则减缓这一趋势。
- 强化学习将推理功能集中在少数步骤上,而监督微调则使其均匀分布。
- 从推理路径角度解释了为何两阶段训练(先SFT后RL)是成功的。
点此查看论文截图





TrustJudge: Inconsistencies of LLM-as-a-Judge and How to Alleviate Them
Authors:Yidong Wang, Yunze Song, Tingyuan Zhu, Xuanwang Zhang, Zhuohao Yu, Hao Chen, Chiyu Song, Qiufeng Wang, Cunxiang Wang, Zhen Wu, Xinyu Dai, Yue Zhang, Wei Ye, Shikun Zhang
The adoption of Large Language Models (LLMs) as automated evaluators (LLM-as-a-judge) has revealed critical inconsistencies in current evaluation frameworks. We identify two fundamental types of inconsistencies: (1) Score-Comparison Inconsistency, where lower-rated responses outperform higher-scored ones in pairwise comparisons, and (2) Pairwise Transitivity Inconsistency, manifested through circular preference chains (A>B>C>A) and equivalence contradictions (A=B=C\neq A). We argue that these issues come from information loss in discrete rating systems and ambiguous tie judgments during pairwise evaluation. We propose TrustJudge, a probabilistic framework that addresses these limitations through two key innovations: 1) distribution-sensitive scoring that computes continuous expectations from discrete rating probabilities, preserving information entropy for more precise scoring, and 2) likelihood-aware aggregation that resolves transitivity violations using bidirectional preference probabilities or perplexity. We also formalize the theoretical limitations of current LLM-as-a-judge frameworks and demonstrate how TrustJudge’s components overcome them. When evaluated with Llama-3.1-70B-Instruct as judge using our dataset, TrustJudge reduces Score-Comparison inconsistency by 8.43% (from 23.32% to 14.89%) and Pairwise Transitivity inconsistency by 10.82% (from 15.22% to 4.40%), while maintaining higher evaluation accuracy. Our work provides the first systematic analysis of evaluation framework inconsistencies in LLM-as-a-judge paradigms, offering both theoretical insights and practical solutions for reliable automated assessment. The framework demonstrates consistent improvements across various model architectures and scales, enabling more trustworthy LLM evaluation without requiring additional training or human annotations. The codes can be found at https://github.com/TrustJudge/TrustJudge.
采用大型语言模型(LLM)作为自动评估器(LLM-as-a-judge)已经揭示了当前评估框架中的关键不一致之处。我们识别出两种基本类型的不一致性:(1)评分比较不一致性,其中低评级的回应在配对比较中表现优于高评级的回应;(2)配对传递性不一致性,表现为循环偏好链(A>B>C>A)和等价矛盾(A=B=C≠A)。我们认为这些问题源于离散评分系统中的信息损失和配对评估期间的模糊平局判断。我们提出了TrustJudge,这是一个概率框架,通过两个关键创新来解决这些局限性:1)分布敏感评分,根据离散评分概率计算连续期望,保留信息熵以进行更精确评分;以及2)可能性感知聚合,使用双向偏好概率或困惑度解决传递性违规。我们还正式提出了当前LLM-as-a-judge框架的理论局限性,并展示了TrustJudge的组件是如何克服它们的。使用Llama-3.1-70B-Instruct作为评委对我们的数据集进行评估时,TrustJudge将评分比较不一致性降低了8.43%(从23.32%降至14.89%),配对传递性不一致性降低了10.82%(从15.22%降至4.4%),同时保持了更高的评估准确性。我们的工作提供了LLM-as-a-judge范式中评估框架不一致性的系统分析,为可靠的自动评估提供了理论见解和实际解决方案。该框架在各种模型架构和规模上均显示出持续改进的能力,可在无需额外训练或人工注释的情况下,进行更可靠的LLM评估。相关代码可见于https://github.com/TrustJudge/TrustJudge。
论文及项目相关链接
PDF 22 pages, 9 figures, 6 tables
Summary
本文探讨了大型语言模型(LLM)作为自动评估器(LLM-as-a-judge)在现有评估框架中的关键不一致性问题。文章指出了两种根本性的不一致性:(1)评分比较不一致性,其中低评级的回应在成对比较中表现出优于高评分回应的情况;(2)成对传递性不一致性,表现为循环偏好链和等价矛盾。文章提出TrustJudge框架来解决这些问题,通过分布敏感评分和可能性感知聚合两个关键创新点来改进现有框架的理论局限性。在特定数据集上的评估显示,TrustJudge能减少评分比较不一致性和成对传递性不一致性,同时保持较高的评估准确性。
Key Takeaways
- 大型语言模型(LLM)作为自动评估器在现有评估框架中存在关键不一致性问题。
- 主要的两种不一致性包括评分比较不一致性和成对传递性不一致性。
- TrustJudge框架通过分布敏感评分和可能性感知聚合来解决这些不一致性问题。
- TrustJudge能减少评分比较不一致性和成对传递性不一致性达一定比例。
- TrustJudge框架在多种模型和规模上表现一致,可提高LLM评估的可靠性。
- TrustJudge框架不需要额外的训练或人工注释,即可实现可靠的LLM评估。
点此查看论文截图



ScaleDiff: Scaling Difficult Problems for Advanced Mathematical Reasoning
Authors:Qizhi Pei, Zhuoshi Pan, Honglin Lin, Xin Gao, Yu Li, Zinan Tang, Conghui He, Rui Yan, Lijun Wu
Large Reasoning Models (LRMs) have shown impressive capabilities in complex problem-solving, often benefiting from training on difficult mathematical problems that stimulate intricate reasoning. Recent efforts have explored automated synthesis of mathematical problems by prompting proprietary models or large-scale open-source models from seed data or inherent mathematical concepts. However, scaling up these methods remains challenging due to their high computational/API cost, complexity of prompting, and limited difficulty level of the generated problems. To overcome these limitations, we propose ScaleDiff, a simple yet effective pipeline designed to scale the creation of difficult problems. We efficiently identify difficult problems from existing datasets with only a single forward pass using an adaptive thinking model, which can perceive problem difficulty and automatically switch between “Thinking” and “NoThinking” modes. We then train a specialized difficult problem generator (DiffGen-8B) on this filtered difficult data, which can produce new difficult problems in large scale, eliminating the need for complex, per-instance prompting and its associated high API costs. Fine-tuning Qwen2.5-Math-7B-Instruct on the ScaleDiff-Math dataset yields a substantial performance increase of 11.3% compared to the original dataset and achieves a 65.9% average accuracy on AIME’24, AIME’25, HMMT-Feb’25, BRUMO’25, and MATH500, outperforming recent strong LRMs like OpenThinker3. Notably, this performance is achieved using the cost-efficient Qwen3-8B model as a teacher, demonstrating that our pipeline can effectively transfer advanced reasoning capabilities without relying on larger, more expensive teacher models. Furthermore, we observe a clear scaling phenomenon in model performance on difficult benchmarks as the quantity of difficult problems increases. Code: https://github.com/QizhiPei/ScaleDiff.
大规模推理模型(LRMs)在复杂问题解决方面展现出了令人印象深刻的能力,通常受益于通过在复杂数学问题上的训练,刺激其复杂的推理能力。最近的努力已经尝试通过提示专有模型或大规模开源模型从种子数据或固有的数学概念来自动合成数学问题。然而,由于计算/API成本高、提示复杂以及生成的问题难度有限,这些方法在规模化方面仍然面临挑战。为了克服这些局限性,我们提出了ScaleDiff,这是一个简单有效的流水线设计,旨在规模化创建困难问题。我们仅通过一次前向传递有效地从现有数据集中识别出困难问题,并使用自适应思考模型来感知问题难度,该模型可以在“思考”和“不思考”模式之间自动切换。然后我们在这些筛选出的困难数据上训练了一个专门的困难问题生成器(DiffGen-8B),它能够大规模生成新的问题,从而无需复杂的逐个实例提示及其相关的高API成本。在ScaleDiff-Math数据集上对Qwen2.5-Math-7B-Instruct进行微调,与原始数据集相比,性能提高了11.3%,在AIME’24、AIME’25、HMMT-Feb’25、BRUMO’25和MATH500上的平均准确率为65.9%,超过了最近的强大LRMs,如OpenThinker3。值得注意的是,这一性能是使用成本效益高的Qwen3-8B模型作为教师模型实现的,这表明我们的流水线可以有效地转移高级推理能力,而无需依赖更大、更昂贵的教师模型。此外,随着难题数量的增加,我们在困难基准测试上的模型性能出现了明显的规模效应。代码地址:https://github.com/QizhiPei/ScaleDiff。
论文及项目相关链接
PDF 15 pages
Summary
大型推理模型(LRMs)在处理复杂问题时展现出显著的能力。针对当前自动合成数学问题方法的高计算成本、生成题目难度有限等局限性,研究者提出了一种简单有效的名为Scalediff的解决方案。该方案使用自适应思考模型快速识别现有数据集中的难题,并通过训练专门的难题生成器来大规模生成新难题,从而无需复杂的逐个实例提示及其相关的高API成本。对模型的微调取得了显著的性能提升,特别是在面对具有一定难度的基准测试时,实现了成本效益优化的成果。目前,该方案已在GitHub上公开。
Key Takeaways
- LRMs展现出解决复杂问题的能力,得益于训练在困难数学问题上的成果。
- 当前自动合成数学问题的挑战包括高计算成本、提示复杂性和问题难度有限。
- ScaleDiff被提出以解决上述问题,通过自适应思考模型快速识别难题并利用难题生成器进行大规模难题生成。
- ScaleDiff方案减少了复杂的逐个实例提示及其相关的高API成本。
- 对模型的微调在难题基准测试上取得了显著性能提升,并且能够在不使用大型昂贵教师模型的情况下实现高性能表现。
- 模型在面临不同难度级别的问题时展现出明显的性能提升趋势。
点此查看论文截图




GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions
Authors:Bing Liu, Wenqiang Yv, Xuzheng Yang, Shichang Wang, Junzhuo Liu, Peng Wang, Guoqing Wang, Yang Yang, Heng Tao Shen
AI-driven geometric problem solving is a complex vision-language task that requires accurate diagram interpretation, mathematical reasoning, and robust cross-modal grounding. A foundational yet underexplored capability for this task is the ability to identify and interpret geometric elements based on natural language queries. To address this, we introduce the task of Referring Expression Comprehension (REC) for geometric problems, which evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts. We present GeoRef, a benchmark dataset constructed from existing geometric problem corpora, featuring diverse, high-quality annotations and queries. Due to the lack of annotated data for this task, we generate a large-scale synthetic training dataset using a structured geometric formal language, enabling broad coverage of geometric concepts and facilitating model adaptation. We explore two fine-tuning approaches: Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). Our results show that GRPO significantly outperforms SFT by better aligning model behavior with task-specific rewards. Furthermore, we propose a verify-and-regenerate mechanism that detects incorrect predictions and re-infers answers using contextual reasoning history, further boosting accuracy. Notably, even state-of-the-art Multimodal Large Language Models (MLLMs) struggle with this task, underscoring the necessity of explicitly evaluating and strengthening geometric grounding as a prerequisite for robust geometric problem solving. Moreover, models trained on GeoRef demonstrate measurable improvements on downstream geometric reasoning tasks, highlighting the broader value of REC as a foundation for multimodal mathematical understanding.
人工智能驱动的几何问题求解是一项复杂的视觉语言任务,需要准确的图表解读、数学推理和稳健的跨模态接地能力。对此任务的基础但尚未被充分探索的能力是,基于自然语言查询来识别和解释几何元素的能力。为了解决这一问题,我们引入了用于几何问题的指代表达式理解(REC)任务,该任务评估模型是否能根据文本提示在图表中定位点、形状和空间关系。我们展示了GeoRef,这是一个由现有几何问题语料库构建的基准数据集,具有多样且高质量的注释和查询。由于此任务缺乏标注数据,我们使用结构化的几何形式语言生成大规模合成训练数据集,以实现对几何概念的广泛覆盖并促进模型适应。我们探索了两种微调方法:监督微调(SFT)和组相对策略优化(GRPO)。我们的结果表明,通过更好地将模型行为与任务特定奖励对齐,GRPO显著优于SFT。此外,我们提出了一种验证和再生机制,该机制可以检测错误的预测并使用上下文推理历史重新推断答案,从而进一步提高准确性。值得注意的是,甚至最先进的多媒体大型语言模型(MLLMs)也在这个任务上挣扎,这强调了明确评估和加强几何接地作为稳健几何问题求解先决条件的必要性。此外,在GeoRef上训练的模型在下游几何推理任务上表现出可衡量的改进,突显了REC作为多模态数学理解基础的更广泛价值。
论文及项目相关链接
Summary
在这个任务中,我们面临的是AI驱动的几何问题求解这一复杂的视觉语言任务,这需要准确的图表解读、数学推理和跨模态的稳健基础。我们引入了面向几何问题的指代表达式理解(REC)任务,以评估模型是否能根据文本提示在图表中定位点、形状和空间关系。我们推出了GeoRef数据集,该数据集由现有的几何问题语料库构建而成,具有多样化和高质量的注释和查询。由于缺乏此任务的注释数据,我们利用结构化几何形式语言生成大规模合成训练数据集,以涵盖广泛的几何概念并促进模型适应。我们的结果研究表明,相对于监督微调(SFT),采用群体相对策略优化(GRPO)的微调方法能更好地适应任务特定奖励,从而实现显著的效果提升。此外,我们提出了一种验证和再生机制,能够检测错误的预测并借助上下文推理历史重新推断答案,进一步提升准确率。尽管最先进的多模态大型语言模型(MLLMs)在此任务上仍然面临挑战,但模型在GeoRef上的训练在下游几何推理任务上显示出可衡量的改进,凸显了明确评估和强化几何基础作为稳健几何问题解决的先决条件的必要性,并表明了REC作为多模态数学理解基础的重要性。总的来说这是一个挑战非常大的问题需要通过准确的方法进行分析解决。对于这样的复杂任务,数据集、方法和验证机制的进一步发展将具有极其重要的价值。简化后的摘要就是“人工智能解决几何问题需要结合自然语言处理技术和数学知识的学习模型研究。”。这体现了模型的复杂性以及任务本身的挑战性。
Key Takeaways
点此查看论文截图







Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs
Authors:Honglin Zhang, Qianyue Hao, Fengli Xu, Yong Li
Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training. A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves. However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored. In this study, we draw inspiration from prior work on edge attribution patching (EAP) to investigate the internal differences of LLMs before and after RL fine-tuning. Our analysis across multiple model families shows two robust effects of online RL post-training: (i) an overall increase in activation intensity, indicating that more internal pathways are engaged and their signals become stronger, and (ii) greater diversity in activation patterns, reflected by higher entropy and less concentrated edge distributions. These changes suggest that RL reshapes information flow to be both more redundant and more flexible, which may explain its advantage in generalization. Notably, models fine-tuned with Direct Preference Optimization (DPO) deviate from these trends, exhibiting substantially weaker or inconsistent internal changes compared to PPO- and GRPO-based training. Together, our findings provide a unified view of how RL fine-tuning systematically alters the internal circuitry of LLMs and highlight the methodological distinctions between online RL and preference-based approaches. Our code is open source at https://anonymous.4open.science/r/llm_rl_probing_analysis-F673.
大型语言模型(LLM)通过大规模预训练获得丰富的先验知识,并可通过监督微调(SFT)或基于强化学习(RL)的后训练进一步增强。越来越多的证据表明,RL微调提高了LLM的能力,超越了仅使用SFT所达到的水平。然而,RL微调能够增强具有不同内在特性的各种LLM能力的底层机制仍然未被充分探索。在本研究中,我们从边缘归属修补(EAP)的先前研究中汲取灵感,调查LLM在RL微调前后的内部差异。我们对多个模型家族的分析显示,在线RL后训练的两种稳健效果:(i)激活强度的总体增加,表明更多的内部路径被激活且其信号变得更强;(ii)激活模式的多样性更高,体现在更高的熵值和更不集中的边缘分布。这些变化表明RL使信息流动更加冗余和灵活,这可能解释了其在泛化方面的优势。值得注意的是,使用直接偏好优化(DPO)进行微调的模型与这些趋势背道而驰,与基于PPO和GRPO的训练相比,其内部变化更弱或不一致。我们的研究综合地展示了RL微调如何系统地改变LLM的内部结构,并突出了在线RL和基于偏好的方法之间的方法论区别。我们的代码在https://anonymous.4open.science/r/llm_rl_probing_analysis-F673开放访问。
论文及项目相关链接
Summary
大型语言模型(LLM)通过大规模预训练获得丰富的先验知识,并可进一步通过监督微调(SFT)、强化学习(RL)等后训练方式进行增强。研究表明,强化学习微调能提高LLM的能力,超越仅使用SFT所能达到的水平。然而,关于为何RL微调能够提升具有不同内在特性的各种LLM的能力的底层机制尚未得到充分探索。本研究受边缘属性补丁(EAP)的启发,探讨了RL微调前后LLM的内部差异。对多个模型家族的分析显示,在线RL后训练的两个稳健效应为:(一)激活强度的总体增加,表明更多的内部路径被激活且信号变得更强烈;(二)激活模式的多样性增加,反映为边缘分布的高熵和较少集中。这些变化表明,RL重塑了更为冗余和灵活的信息流,这可能解释了其在泛化方面的优势。值得注意的是,使用直接偏好优化(DPO)进行微调模型的趋势有所不同,与其他基于PPO和GRPO的训练相比,其内部变化较弱或不一致。总体而言,我们的研究系统地揭示了RL微调如何改变LLM的内部结构,并突出了在线RL和基于偏好的方法之间的方法论差异。
Key Takeaways
- 强化学习微调能够提升大型语言模型的能力,超越监督微调的效果。
- 在线RL后训练导致LLM的激活强度总体增加,表明更多内部路径被激活且信号强烈。
- RL训练使得LLM的激活模式更加多样,表现为高熵和较少集中的边缘分布。
- RL重塑了LLM的信息流,使其更加冗余和灵活,可能提高了模型的泛化能力。
- 直接偏好优化(DPO)在LLM微调中的效果与其他基于PPO和GRPO的训练方法有所不同。
- DPO微调模型的内部变化较弱或不一致,与其他训练方法存在显著差异。
点此查看论文截图



Predicting LLM Reasoning Performance with Small Proxy Model
Authors:Woosung Koh, Juyoung Suk, Sungjun Han, Se-Young Yun, Jay Shin
Given the prohibitive cost of pre-training large language models, it is essential to leverage smaller proxy models to optimize datasets before scaling up. However, this approach becomes challenging for reasoning capabilities, which exhibit emergent behavior that only appear reliably at larger model sizes, often exceeding 7B parameters. To address this, we introduce rBridge, showing that small proxies ($\leq$1B) can effectively predict large-model reasoning by aligning more closely with (1) the pre-training objective and (2) the target task. rBridge achieves this by weighting negative log-likelihood with task alignment, using reasoning traces from frontier models as gold labels. In our experiments, rBridge (i) reduces dataset ranking costs by over 100x relative to the best baseline, (ii) achieves the strongest correlation across six reasoning benchmarks at 1B to 32B scale, and (iii) zero-shot transfers predictive relationships across pre-training datasets at 1B to 7B scale. These findings indicate that rBridge offers a practical path for exploring reasoning-oriented pre-training at lower cost.
考虑到预训练大型语言模型的成本高昂,利用小型代理模型对数据集进行优化,然后再进行扩展是至关重要的。然而,对于展现出的新兴行为只在较大模型规模中可靠出现的推理能力,这种方法变得具有挑战性,通常超过70亿参数。为了解决这一问题,我们引入了rBridge,展示小型代理(≤1B)可以通过更紧密地符合(1)预训练目标和(2)目标任务,有效地预测大型模型的推理。rBridge通过用任务对齐对负对数可能性进行加权来实现这一点,利用前沿模型的推理轨迹作为黄金标签。在我们的实验中,rBridge(i)将数据集排名成本降低了超过100倍,相对于最佳基线,(ii)在1B到32B规模上实现了六个推理基准测试中的最强相关性,(iii)在1B到7B规模上实现了预训练数据集之间的零射击转移预测关系。这些发现表明,rBridge为在较低成本下探索以推理为导向的预训练提供了一条实用途径。
论文及项目相关链接
PDF Pre-print
Summary
文本介绍了面对大规模语言模型高昂的训练成本问题,利用小型代理模型优化数据集的策略变得尤为关键。但问题在于这种策略在处理需要大量参数才能达到的优秀推理能力上显得捉襟见肘,往往要求模型参数超过7B。为了解决这个问题,我们引入了rBridge技术。该技术通过更紧密地结合预训练目标和目标任务,使小型代理模型(≤1B)能够有效预测大型模型的推理能力。实验结果显示,rBridge显著降低了数据集排名成本,同时在不同规模的模型中实现了强大的推理能力基准测试关联和零样本迁移预测关系。这些发现证明rBridge在低成本条件下探索以推理为重点的预训练是一个可行的途径。
Key Takeaways
- 面对大规模语言模型的高昂训练成本,利用小型代理模型优化数据集成为一种策略。
- 小型代理模型在处理需要大量参数的推理任务时面临挑战。
- rBridge技术引入,旨在解决小型代理模型在预测大型模型推理能力上的不足。
- rBridge技术结合预训练目标和目标任务,使小型代理模型更紧密地模拟大型模型的推理能力。
- rBridge显著降低了数据集排名成本,提高了效率。
- rBridge在不同规模的模型中实现了强大的推理能力基准测试关联。
点此查看论文截图





Lossless Compression: A New Benchmark for Time Series Model Evaluation
Authors:Meng Wan, Benxi Tian, Jue Wang, Cui Hui, Ningming Nie, Tiantian Liu, Zongguo Wang, Cao Rongqiang, Peng Shi, Yangang Wang
The evaluation of time series models has traditionally focused on four canonical tasks: forecasting, imputation, anomaly detection, and classification. While these tasks have driven significant progress, they primarily assess task-specific performance and do not rigorously measure whether a model captures the full generative distribution of the data. We introduce lossless compression as a new paradigm for evaluating time series models, grounded in Shannon’s source coding theorem. This perspective establishes a direct equivalence between optimal compression length and the negative log-likelihood, providing a strict and unified information-theoretic criterion for modeling capacity. Then We define a standardized evaluation protocol and metrics. We further propose and open-source a comprehensive evaluation framework TSCom-Bench, which enables the rapid adaptation of time series models as backbones for lossless compression. Experiments across diverse datasets on state-of-the-art models, including TimeXer, iTransformer, and PatchTST, demonstrate that compression reveals distributional weaknesses overlooked by classic benchmarks. These findings position lossless compression as a principled task that complements and extends existing evaluation for time series modeling.
时间序列模型的评估传统上主要集中在四个典型任务上:预测、补全、异常检测和分类。虽然这些任务推动了显著的进步,但它们主要评估特定任务的性能,并不能严格地衡量模型是否捕捉到了数据的全生成分布。我们引入无损压缩作为评估时间序列模型的新范式,该范式基于香农的源代码编码定理。这个观点建立了最优压缩长度和负对数似然之间的直接等价关系,为模型容量提供了严格且统一的信息理论标准。然后,我们定义了标准化的评估协议和指标。我们进一步提出并开源了一个全面的评估框架TSCom-Bench,该框架能够使时间序列模型快速适应无损压缩。在包括TimeXer、iTransformer和PatchTST等最新模型上的多样化数据集上的实验表明,压缩揭示了经典基准测试所忽视的分布弱点。这些发现将无损压缩定位为一个有原则的任务,它补充并扩展了现有的时间序列建模评估。
论文及项目相关链接
PDF 24 pages
Summary
本文提出以无损压缩作为评估时间序列模型的新范式,该范式基于香农的源编码定理,建立最优压缩长度与负对数似然之间的直接等价关系,为模型容量提供了严格且统一的信息理论标准。文章还定义了标准化的评估协议和指标,并开源了一个全面的评估框架TSCom-Bench,使时间序列模型能迅速适应无损压缩。实验证明,压缩揭示了经典基准测试未能发现的分布弱点。
Key Takeaways
- 时间序列模型的评估传统上集中在四个典型任务上:预测、插补、异常检测和分类。
- 这些任务主要评估特定任务的性能,并未严格测量模型是否捕捉数据的全生成分布。
- 无损压缩被引入作为评估时间序列模型的新范式,基于香农的源编码定理。
- 无损压缩建立最优压缩长度与负对数似然之间的直接等价,提供信息理论的标准。
- 文章定义了标准化的评估协议和指标,并开源评估框架TSCom-Bench。
- 实验显示,无损压缩能揭示被经典基准测试忽略的分布弱点。
点此查看论文截图


