⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-07-17 更新
Comprehension Without Competence: Architectural Limits of LLMs in Symbolic Computation and Reasoning
Authors:Zheng Zhang
Large Language Models (LLMs) display striking surface fluency yet systematically fail at tasks requiring symbolic reasoning, arithmetic accuracy, and logical consistency. This paper offers a structural diagnosis of such failures, revealing a persistent gap between \textit{comprehension} and \textit{competence}. Through controlled experiments and architectural analysis, we demonstrate that LLMs often articulate correct principles without reliably applying them–a failure rooted not in knowledge access, but in computational execution. We term this phenomenon the computational \textit{split-brain syndrome}, where instruction and action pathways are geometrically and functionally dissociated. This core limitation recurs across domains, from mathematical operations to relational inferences, and explains why model behavior remains brittle even under idealized prompting. We argue that LLMs function as powerful pattern completion engines, but lack the architectural scaffolding for principled, compositional reasoning. Our findings delineate the boundary of current LLM capabilities and motivate future models with metacognitive control, principle lifting, and structurally grounded execution. This diagnosis also clarifies why mechanistic interpretability findings may reflect training-specific pattern coordination rather than universal computational principles, and why the geometric separation between instruction and execution pathways suggests limitations in neural introspection and mechanistic analysis.
大型语言模型(LLM)表现出惊人的表面流畅性,但在需要符号推理、算术准确性和逻辑一致性的任务上却系统性地失败。本文对这些失败进行了结构性诊断,揭示了“理解”和“能力”之间的持久差距。通过受控实验和架构分析,我们证明LLMs往往阐述正确的原则,但却不能可靠地应用它们——这种失败并非源于知识获取,而是源于计算执行。我们将这种现象称为计算性“分裂脑综合征”,其中指令和行动路径在几何和功能上相互分离。这一核心限制在不同的领域中都存在,从数学运算到关系推理,并解释了为什么即使在理想化的提示下,模型行为仍然很脆弱。我们认为LLM虽然可以作为强大的模式完成引擎发挥作用,但缺乏有原则、有组织的推理所需的架构支撑。我们的研究界定了当前LLM的能力边界,并激励未来模型具备元认知控制、原则提升和结构化的执行能力。这种诊断还阐明了为什么机制性解释的发现可能反映训练特定的模式协调,而不是普遍的计算原理,以及为什么指令和执行路径之间的几何分离表明神经内省和机制分析的局限性。
论文及项目相关链接
PDF Substantial change to previous version (experiments, theorem, analysis and related work); currently under review at TMLR
Summary
大型语言模型(LLMs)虽然表现出惊人的表面流畅性,但在需要符号推理、算术准确性和逻辑一致性的任务中却系统性地失败。本文对其失败原因进行了结构性诊断,揭示了“理解”与“能力”之间的持久差距。通过实验分析和架构解析,我们发现LLMs往往能正确阐述原则,但却无法可靠地应用它们——这种失败并非源于知识获取,而是计算执行的问题。我们称这种现象为“计算分裂脑综合征”,其中指令和执行路径在几何和功能上分离。这一核心限制在数学运算、关系推理等领域中普遍存在,解释了为什么即使在理想化的提示下,模型行为仍然脆弱。本文认为LLMs虽然功能强大,但缺乏原则性、组合推理的架构支撑。我们的研究明确了当前LLM的能力边界,并鼓励未来构建具有元认知控制、原则提升和结构基础执行的模型。同时,本文也澄清了机械解释性发现可能反映训练特定的模式协调,而非普遍的计算原则,以及指令与执行路径之间的几何分离对神经内省和机械分析的限制。
Key Takeaways
- LLMs在需要符号推理、算术准确性和逻辑一致性的任务中表现出系统性失败。
- LLMs的失败源于计算执行的问题,而非知识获取。
- LLMs常常能正确阐述原则,但无法可靠地应用它们。
- 现象被称为“计算分裂脑综合征”,指令和执行路径存在几何和功能上的分离。
- LLMs缺乏原则性、组合推理的架构支撑。
- 当前LLM的能力边界被明确,需要构建具有元认知控制、原则提升和结构基础执行的模型。
点此查看论文截图

Tiny Reward Models
Authors:Sarah Pan
Large decoder-based language models have become the dominant architecture for reward modeling in reinforcement learning from human feedback (RLHF). However, as reward models are increasingly deployed in test-time strategies, their inference costs become a growing concern. We present TinyRM, a family of small, bidirectional masked language models (MLMs) with as few as 400 million parameters, that rival the capabilities of models over 175 times larger on reasoning and safety preference modeling tasks. TinyRM combines FLAN-style prompting, Directional Low-Rank Adaptation (DoRA), and layer freezing to achieve strong performance on RewardBench, despite using significantly fewer resources. Our experiments suggest that small models benefit from domain-specific tuning strategies, particularly in reasoning, where lightweight finetuning methods are especially effective. While challenges remain in building generalist models and conversational preference modeling, our preliminary results highlight the promise of lightweight bidirectional architectures as efficient, scalable alternatives for preference modeling.
基于大型解码器的语言模型已经成为强化学习人类反馈(RLHF)中奖励建模的主导架构。然而,随着奖励模型在测试时间策略中的部署越来越多,其推理成本成为一个日益关注的问题。我们提出了TinyRM,这是一系列小型双向掩码语言模型(MLM),拥有仅4亿个参数,就能在推理和安全偏好建模任务上,与超过其175倍的模型能力相抗衡。尽管TinyRM使用的资源相对较少,但它结合了FLAN风格的提示、方向低秩适应(DoRA)和层冻结技术,在RewardBench上实现了强大的性能。我们的实验表明,小模型受益于特定的领域调整策略,特别是在推理方面,轻量级微调方法特别有效。虽然构建通用模型和对话偏好建模仍存在挑战,但我们的初步结果凸显了轻型双向架构作为高效、可扩展的偏好建模替代方案的潜力。
论文及项目相关链接
PDF 2025 ICML Efficient Systems for Foundation Models Workshop
Summary
大型解码器基础语言模型已成为强化学习人类反馈(RLHF)中奖励建模的主导架构。然而,随着奖励模型在测试时间策略中的部署越来越多,其推理成本成为一个日益关注的问题。我们提出了TinyRM,这是一个小型双向掩码语言模型(MLM)系列,仅有4亿个参数,在推理和安全偏好建模任务上的能力与超过其规模大175倍的模型相匹敌。TinyRM结合了FLAN风格的提示、方向性低秩适应(DoRA)和层冻结等技术,在RewardBench上表现出强大的性能,同时使用的资源相对较少。我们的实验表明,小模型受益于特定的领域调整策略,特别是在推理方面,轻量级微调方法特别有效。尽管在构建通用模型和对话偏好建模方面仍存在挑战,但我们的初步结果突出了轻量级双向架构作为偏好建模的有效、可扩展替代方案的潜力。
Key Takeaways
- 大型解码器基础语言模型在强化学习人类反馈(RLHF)的奖励建模中占据主导地位。
- 奖励模型的推理成本随着其部署的增加而成为关注焦点。
- TinyRM是一个小型双向掩码语言模型(MLM),参数少至4亿,性能强大。
- TinyRM在ReasoningBench上表现出强大的性能,同时使用的资源较少。
- 小模型受益于特定的领域调整策略,轻量级微调方法特别有效。
- TinyRM结合了FLAN风格的提示、方向性低秩适应(DoRA)和层冻结等技术实现高性能。
- 尽管在构建通用模型和对话偏好建模方面仍存在挑战,但轻量级双向架构具有潜力。
点此查看论文截图


Is Human-Written Data Enough? The Challenge of Teaching Reasoning to LLMs Without RL or Distillation
Authors:Wei Du, Branislav Kisacanin, George Armstrong, Shubham Toshniwal, Ivan Moshkov, Alexan Ayrapetyan, Sadegh Mahdavi, Dan Zhao, Shizhe Diao, Dragan Masulovic, Marius Stanean, Advaith Avadhanam, Max Wang, Ashmit Dutta, Shitij Govil, Sri Yanamandara, Mihir Tandon, Sriram Ananthakrishnan, Vedant Rathi, David Zhang, Joonseok Kang, Leon Luo, Titu Andreescu, Boris Ginsburg, Igor Gitman
Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought (CoT) traces. While recent works show that base models can acquire such reasoning traces via reinforcement learning or distillation from stronger models like DeepSeek-R1, previous works demonstrate that even short CoT prompting without fine-tuning is able to improve reasoning. We ask whether long CoT can be induced in a base model using only prompting or minimal tuning. Using just 20 long CoT examples from the reasoning model \texttt{QwQ-32B-Preview}, we lightly fine-tune the base model \texttt{Qwen2.5-32B}. The resulting model outperforms the much larger \texttt{Qwen2.5-Math-72B-Instruct}, showing that a handful of high-quality examples can unlock strong reasoning capabilities. We further explore using CoT data from non-reasoning models and human annotators, enhanced with prompt engineering, multi-pass editing, and structural guidance. However, neither matches the performance of reasoning model traces, suggesting that certain latent qualities of expert CoT are difficult to replicate. We analyze key properties of reasoning data, such as problem difficulty, diversity, and answer length, that influence reasoning distillation. While challenges remain, we are optimistic that carefully curated human-written CoT, even in small quantities, can activate reasoning behaviors in base models. We release our human-authored dataset across refinement stages and invite further investigation into what makes small-scale reasoning supervision so effective.
具备推理能力的语言模型通过生成长而明确的思维链(Chain-of-Thought,CoT)轨迹,在不同复杂的任务上达到了最先进的性能。虽然最近的研究表明,基础模型可以通过强化学习或从更强模型(如DeepSeek-R1)的蒸馏来获得这种推理轨迹,但先前的研究表明,即使没有微调,仅仅使用短的CoT提示也能提高推理能力。我们想知道,是否可以使用提示或微调在基础模型中引发长CoT。我们仅使用来自推理模型QwQ-32B-Preview的20个长CoT示例,对基础模型Qwen2.5-32B进行轻微微调。结果模型的表现超过了更大的Qwen2.5-Math-72B-Instruct模型,这表明少量高质量示例可以解锁强大的推理能力。我们进一步探索使用非推理模型和人类注释者的CoT数据,并通过提示工程、多轮编辑和结构指导来增强。但是,无论哪种方式都无法匹配推理模型轨迹的性能,这表明专家CoT的某些潜在品质难以复制。我们分析了影响推理蒸馏的关键推理数据属性,如问题难度、多样性和答案长度。虽然仍存在挑战,但我们相信精心策划的人类编写的小规模CoT数据也能激活基础模型中的推理行为。我们发布了各个精炼阶段的人类创作数据集,并邀请进一步研究小规模推理监督为何如此有效的原因。
论文及项目相关链接
PDF Accepted at the Second AI for Math Workshop at the 42nd International Conference on Machine Learning (ICML 2025)
Summary
大規模語言模型在通過生成長篇的、顯式的鏈式思維(Chain-of-Thought,CoT)痕跡後,在多重複雜任務中取得了前沿的表現。本文探索了是否僅通過提示或最小限度的調整就能在基礎模型中誘導出長篇CoT。使用來自推理模型QwQ-32B-Preview的僅20個長篇CoT樣本,我們稍微調整了基礎模型Qwen2.5-32B,結果顯示其表現超越了更大的Qwen2.5-Math-72B-Instruct模型。此外,本文還探索了使用非推理模型和人類注釋者的CoT數據,配合提示工程、多遍編輯和結構指南,但效果仍不及推理模型痕跡。本文分析了影響推理精餾的關鍵屬性,如問題難度、多樣性和答案長度。我們樂觀地認為,精心策劃的人類書寫CoT,即使在少量情況下,也能啟動基礎模型中的推理行為。我們公開了我們人類作者在不同精製階段所編寫的大型數據集,並歡迎進一步探索小型推理監督如此有效的因素。
Key Takeaways
- 推理模型通過生成長篇的Chain-of-Thought(CoT)痕跡實現複雜任務的最佳性能。
- 基礎模型可通過提示和少量精細調整來誘導長篇CoT。
- 使用少量高質量的CoT樣例可在基礎模型中解釋強推理能力。
- 來自非推理模型和人類注釋者的CoT數據效果較差,暗示專家CoT的某些潛在特質難以複製。
- 問題難度、數據多樣性和答案長度等屬性影響推理精餾效果。
- 精心策劃的人類書寫CoT即使在少量情況下也能激活基礎模型的推理行為。
点此查看论文截图



Rethinking Prompt Optimization: Reinforcement, Diversification, and Migration in Blackbox LLMs
Authors:MohammadReza Davari, Utkarsh Garg, Weixin Cai, Eugene Belilovsky
An increasing number of NLP applications interact with large language models (LLMs) through black-box APIs, making prompt engineering critical for controlling model outputs. While recent Automatic Prompt Optimization (APO) methods iteratively refine prompts using model-generated feedback, textual gradients, they primarily focus on error correction and neglect valuable insights from correct predictions. This limits both their effectiveness and efficiency. In this paper, we propose a novel APO framework centered on enhancing the feedback mechanism. We reinterpret the textual gradient as a form of negative reinforcement and introduce the complementary positive reinforcement to explicitly preserve beneficial prompt components identified through successful predictions. To mitigate the noise inherent in LLM-generated feedback, we introduce a technique called feedback diversification, which aggregates multiple feedback signals, emphasizing consistent, actionable advice while filtering out outliers. Motivated by the rapid evolution and diversity of available LLMs, we also formalize Continual Prompt Optimization (CPO), addressing the practical challenge of efficiently migrating optimized prompts between different model versions or API providers. Our experiments reveal that naive prompt migration often degrades performance due to loss of critical instructions. In contrast, our approach consistently outperforms strong baselines, achieving significant accuracy improvements, faster convergence, and lower computational costs in both standard and migration scenarios.
随着越来越多的自然语言处理应用通过黑箱API与大型语言模型(LLM)进行交互,提示工程对于控制模型输出变得至关重要。尽管最近的自动提示优化(APO)方法通过模型生成的反馈、文本梯度来迭代优化提示,但它们主要关注错误修正,忽视了正确预测中的宝贵见解。这限制了它们的有效性和效率。在本文中,我们提出了以加强反馈机制为中心的新型APO框架。我们重新解释文本梯度作为一种负面强化,并引入补充的正面强化来明确保留通过成功预测确定的有益提示成分。为了减少LLM生成的反馈中固有的噪声,我们引入了一种称为反馈多样化的技术,该技术聚合多个反馈信号,强调一致、可操作的建议,同时过滤掉异常值。受可用LLM快速演变和多样性的推动,我们还对持续提示优化(CPO)进行了正规化,解决在不同模型版本或API提供商之间高效迁移优化提示的实际挑战。我们的实验表明,由于关键指令的丢失,简单的提示迁移往往会降低性能。相比之下,我们的方法始终优于强大的基线,在标准和迁移场景中实现了显著的准确性提高、更快的收敛速度和更低的计算成本。
论文及项目相关链接
Summary
大型语言模型(LLM)在NLP应用中扮演着关键角色,其中黑箱API成为互动的关键。当前自动提示优化(APO)主要侧重于基于模型生成反馈的错误校正,而忽视了正确预测中的有价值信息。本文提出一种新型的APO框架,以增强反馈机制为核心,引入正面强化来保留成功预测中识别的有益提示成分。同时,为解决LLM生成反馈中的固有噪声问题,引入反馈多样化技术,聚合多个反馈信号,强调一致、可操作的建议,过滤异常值。此外,本文还针对不断发展和多样化的LLM提出了持续提示优化(CPO)方法,解决了在不同模型版本或API提供商之间高效迁移优化提示的实际挑战。实验证明,与强基线相比,该方法在标准和迁移场景中均实现了显著的性能提升。
Key Takeaways
- 大型语言模型(LLM)通过黑箱API与NLP应用交互,使得提示工程对于控制模型输出至关重要。
- 当前自动提示优化(APO)主要关注错误校正,忽视了正确预测中的有价值信息。
- 新型APO框架通过增强反馈机制来提高性能,包括引入正面强化和反馈多样化技术。
- 反馈多样化技术能聚合多个反馈信号,强调一致、可操作的建议,过滤异常值。
- 提出了持续提示优化(CPO)方法,以解决在不同模型版本或API提供商之间迁移优化提示的挑战。
点此查看论文截图



It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective
Authors:Yunqian Wang, Xiaohong Li, Yao Zhang, Yuekang Li, Zhiping Zhou, Ruitao Feng
With the growing threat of software vulnerabilities, deep learning (DL)-based detectors have gained popularity for vulnerability detection. However, doubts remain regarding their consistency within declared CWE ranges, real-world effectiveness, and applicability across scenarios. These issues may lead to unreliable detection, high false positives/negatives, and poor adaptability to emerging vulnerabilities. A comprehensive analysis is needed to uncover critical factors affecting detection and guide improvements in model design and deployment. In this paper, we present VulTegra, a novel evaluation framework that conducts a multidimensional comparison of scratch-trained and pre-trained-based DL models for vulnerability detection. VulTegra reveals that state-of-the-art (SOTA) detectors still suffer from low consistency, limited real-world capabilities, and scalability challenges. Contrary to common belief, pre-trained models are not consistently better than scratch-trained models but exhibit distinct strengths in specific contexts.Importantly, our study exposes the limitations of relying solely on CWE-based classification and identifies key factors that significantly affect model performance. Experimental results show that adjusting just one such factor consistently improves recall across all seven evaluated detectors, with six also achieving better F1 scores. Our findings provide deeper insights into model behavior and emphasize the need to consider both vulnerability types and inherent code features for effective detection.
随着软件漏洞威胁的不断增长,基于深度学习的检测器在漏洞检测方面越来越受欢迎。然而,关于其在声明的CWE范围内的一致性、现实世界的有效性以及跨场景适用性的疑虑仍然存在。这些问题可能导致检测不可靠、误报/漏报率高以及适应新兴漏洞的能力差。为了揭示影响检测的关键因素并引导模型设计和部署的改进,我们需要进行全面的分析。在本文中,我们提出了VulTegra,这是一种新型评估框架,可以对用于漏洞检测的从头开始训练和基于预训练的深度学习模型进行多维比较。VulTegra揭示,最先进的检测器仍然面临一致性差、现实世界能力有限和可扩展性挑战。与普遍信念相反,预训练模型并不总是比从头开始训练的模型更好,但在特定上下文中表现出明显的优势。更重要的是,我们的研究揭示了仅依赖CWE分类的局限性,并确定了显著影响模型性能的关键因素。实验结果表明,调整其中一个因素可以一致地提高所有七个评估检测器的召回率,其中六个还实现了更好的F1分数。我们的研究为深入了解模型行为提供了更深入的见解,并强调了在有效检测中同时考虑漏洞类型和固有代码特征的重要性。
论文及项目相关链接
Summary
深度学习(DL)在软件漏洞检测中受到广泛关注,但仍存在一些问题,如一致性、真实场景中的效果及跨场景适用性等。本研究提出一个新的评估框架VulTegra,对比分析了从零开始训练的DL模型和基于预训练的模型。研究发现,即使是最先进的检测器仍存在低一致性、有限的真实场景能力和可扩展性挑战。预训练模型并不总是优于从零开始训练的模型,但在特定场景下表现出优势。此外,研究还指出了仅依赖CWE分类的局限性,并确定了影响模型性能的关键因素。调整这些因素可改善检测器的召回率和F1分数。
Key Takeaways
- 深度学习在软件漏洞检测中受到关注,但仍存在一致性问题。
- 当前先进的检测器在真实场景应用中仍有限制和挑战。
- 预训练模型与从零开始训练的模型各有优势,不总是优于后者。
- 仅依赖CWE分类进行漏洞检测存在局限性。
- 识别并调整影响模型性能的关键因素能显著提高检测器的性能。
- 调整特定因素可改善所有评估过的检测器的召回率,并使得大部分检测器获得更高的F1分数。
点此查看论文截图





Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models
Authors:Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel
Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model’s stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.
多模态大型语言模型(MLLMs)在医疗保健领域的应用前景广阔。然而,它们在安全关键环境中的部署受到两个主要局限的阻碍:(i)对提示设计的敏感性;(ii)倾向于产生错误响应并具有高信心。临床医生可能会依赖模型所表达的信心来判断其预测的可信度,因此当模型表现出高信心时,也高度准确尤为重要。我们引入了Prompt4Trust,这是针对MLLMs中置信度校准的第一个强化学习(RL)框架的提示增强。一个轻量级LLM被训练来产生上下文感知的辅助提示,以指导下游任务MLLM生成响应,其中表达的信心更准确地反映了预测的准确性。与传统的校准技术不同,Prompt4Trust特别优先考虑对安全和可信临床决策至关重要的校准方面。除了受此临床动机驱动的校准目标所推动的改进之外,我们提出的方法还提高了任务准确性,在PMC-VQA基准测试上实现了最先进的医学视觉问答(VQA)性能,该基准测试包含跨越多种医学成像模式的选择题。而且,我们的框架使用小型下游任务MLLM进行训练,在实验中显示出对更大的MLLMs的零样本泛化能力,这表明了可扩展校准的潜力,而无需承担相关的计算成本。这项工作展示了自动化且与人类对齐的提示工程在提高MLLMs在安全关键环境中的可信度的潜力。我们的代码库可在https://github.com/xingbpshen/prompt4trust找到。
论文及项目相关链接
PDF Accepted to ICCV 2025 Workshop CVAMD
摘要
多模态大型语言模型(MLLMs)在医疗保健应用中具有巨大潜力,但在安全关键环境中的部署受到两个主要限制:一是对提示设计的敏感性和二是有产生错误响应的高信心倾向。模型的信心表达对于临床医生评估预测可靠性尤为重要。我们引入Prompt4Trust,这是第一个针对MLLMs中的信心校准的提示增强强化学习(RL)框架。一个轻量级LLM被训练来产生上下文感知的辅助提示,引导下游任务MLLM生成响应,其中表达的信心更准确地反映了预测准确性。与传统的校准技术不同,Prompt4Trust特别优先考虑对安全和可信临床决策至关重要的校准方面。除了受此临床动机驱动的校准目标的改进之外,我们提出的方法还提高了任务准确性,在PMC-VQA基准测试上实现了最先进的医学视觉问答性能,该基准测试包含跨越多种医学成像模式的选择性问题。而且,我们的框架使用较小的下游任务MLLM进行训练,并在实验中显示出对更大的MLLMs的零样本泛化能力,这表明了可扩展校准的潜力,无需相关的计算成本。这项工作证明了自动化且符合人类工程提示的方法在提高MLLM在安全关键设置中的可信度方面的潜力。我们的代码库位于https://github.com/xingbpshen/prompt4trust。
关键见解
- 多模态大型语言模型(MLLMs)在医疗保健应用中具有潜力,但在安全关键环境中的部署存在两个主要限制:提示设计敏感性和错误响应的高信心倾向。
- Prompt4Trust是第一个针对MLLMs中的信心校准的强化学习(RL)框架,通过上下文感知的辅助提示来改善模型的信心表达与预测准确性的匹配度。
- 与传统校准技术不同,Prompt4Trust特别关注临床决策中的关键校准方面。
- Prompt4Trust不仅能提高模型预测的准确性,而且在PMC-VQA基准测试中实现了医学视觉问答的最先进性能。
- 该框架显示出对更大MLLMs的零样本泛化能力,表明有可能进行成本效益高的模型校准。
- 代码库公开可用,为未来的研究和应用提供了基础。
点此查看论文截图




Mixture of LoRA Experts with Multi-Modal and Multi-Granularity LLM Generative Error Correction for Accented Speech Recognition
Authors:Bingshen Mu, Kun Wei, Pengcheng Guo, Lei Xie
Despite substantial improvements in ASR, performance tends to degrade when faced with adverse conditions such as speaker accents. Generative error correction (GER) leverages the rich linguistic knowledge and exceptional reasoning ability of LLMs, significantly outperforming typical LM methods. However, it lacks specificity in accented speech scenarios. In this study, we leverage GER to improve the accuracy of transcription predictions by addressing the two primary features of accented speech recognition. To fully leverage pronunciation information, we propose the multi-modal GER, which integrates pronunciation information from the speech modality, and the multi-granularity GER, which incorporates fine-grained phoneme-level information related to pronunciation. These two methods enable the LLM to utilize the pronunciation information of accented speech and the semantic information from word-level hypotheses for accurate transcription predictions through LoRA fine-tuning. On the one hand, we employ a three-stage training strategy to train separate multi-modal GER models for each accent to obtain mono-accent LoRA experts. By adopting our proposed HDMoLE method, which incorporates hierarchical routing and dynamic thresholds within the mixture of LoRA experts, we effectively merge multiple mono-accent LoRA experts within a single multi-modal GER to overcome the challenges posed by accent diversity. On the other hand, multi-granularity GER leverages the N-best word-level and phoneme-level hypotheses generated by the HDMoLE model to predict the final accented speech transcriptions. Experimental results on the multi-accent English dataset demonstrate the efficacy of our proposed methods. Our methods achieve a remarkable relative WER reduction of 67.35% compared to the Whisper-large-v3 baseline.
尽管语音识别技术(ASR)取得了实质性的进步,但在面对如说话者口音等不利条件时,其性能往往会下降。生成错误纠正(GER)能够利用大型语言模型(LLM)丰富的语言知识和出色的推理能力,显著优于典型的语言模型方法。然而,它在带有口音的语音场景方面缺乏特异性。
本研究中,我们利用GER改进了转录预测的准确性,通过解决口音语音识别的两个主要特征来实现。为了充分利用发音信息,我们提出了多模态GER,它整合了语音模态的发音信息,以及多粒度GER,它结合了与发音相关的精细音素级信息。这两种方法使LLM能够通过LoRA微调利用口音语音的发音信息和词级假设的语义信息进行准确的转录预测。
一方面,我们采用三阶段训练策略,针对每种口音训练单独的多模态GER模型,以获得单口音LoRA专家。通过采用我们提出的HDMoLE方法,该方法结合了层次路由和动态阈值在LoRA专家混合中,我们有效地将多个单口音LoRA专家合并到一个单一的多模态GER中,以克服口音多样性带来的挑战。
另一方面,多粒度GER利用HDMoLE模型生成的N个最佳词级和音素级假设来预测最终的口音语音转录。在多口音英语数据集上的实验结果证明了我们的方法的有效性。与whisper-large-v3基线相比,我们的方法实现了相对字词错误率(WER)降低了67.35%。
论文及项目相关链接
PDF IEEE Transactions on Audio, Speech and Language Processing
Summary
本文介绍了针对带口音语音识别的改进方法。利用生成式错误修正(GER)结合语言模型(LLM)的优势,提出多模态GER和多粒度GER两种方法,以利用口音语音的发音信息和语义信息,提高转录预测的准确性。通过三阶段训练策略,为每种口音训练单独的多模态GER模型,并采用HDMoLE方法合并多个单口音模型以应对口音多样性挑战。实验结果表明,该方法在多口音英语数据集上实现了显著的相对字词错误率(WER)降低。
Key Takeaways
- 生成式错误修正(GER)在语音识别中利用语言模型(LLM)的丰富语言知识和推理能力,显著优于传统方法。
- 针对带口音语音识别,提出多模态GER和多粒度GER两种改进方法。
- 多模态GER结合语音模态的发音信息,多粒度GER则结合细粒度音素级信息,以提高转录预测的准确性。
- 采用三阶段训练策略,为每种口音训练单独的多模态GER模型,形成单口音LoRA专家。
- HDMoLE方法通过层次化路由和动态阈值,在LoRA专家混合中有效合并单口音模型,应对口音多样性。
- 多粒度GER利用HDMoLE模型生成的N-best词级和音素级假设,预测带口音语音的转录。
点此查看论文截图




wd1: Weighted Policy Optimization for Reasoning in Diffusion Language Models
Authors:Xiaohang Tang, Rares Dolga, Sangwoong Yoon, Ilija Bogunovic
Improving the reasoning capabilities of diffusion-based large language models (dLLMs) through reinforcement learning (RL) remains an open problem. The intractability of dLLMs likelihood function necessitates approximating the current, old, and reference policy likelihoods at each policy optimization step. This reliance introduces additional computational overhead and lead to potentially large bias – particularly when approximation errors occur in the denominator of policy ratios used for importance sampling. To mitigate these issues, we introduce $\mathtt{wd1}$, a novel policy optimization approach that reformulates the objective as a weighted likelihood, requiring only a single approximation for the current parametrized policy likelihood. Experiments on widely used reasoning benchmarks demonstrate that $\mathtt{wd1}$, without supervised fine-tuning (SFT) or any supervised data, outperforms existing RL methods for dLLMs, achieving up to 16% higher accuracy. $\mathtt{wd1}$ delivers additional computational gains, including reduced training time and fewer function evaluations (NFEs) per gradient step. These findings, combined with the simplicity of method’s implementation and R1-Zero-like training (no SFT), position $\mathtt{wd1}$ as a more effective and efficient method for applying RL to dLLMs reasoning.
通过强化学习(RL)提高基于扩散的大型语言模型(dLLM)的推理能力仍然是一个悬而未决的问题。dLLM的似然函数的复杂性要求在每次策略优化步骤中近似当前、旧的和参考策略的似然函数。这种依赖引入了额外的计算开销,并可能导致较大的偏差,特别是在用于重要性采样的策略比值的分母中出现近似误差时。为了缓解这些问题,我们引入了$\mathtt{wd1}$,这是一种新的策略优化方法,它将目标重新定义为加权似然,只需要对当前参数化策略似然进行一次近似。在广泛使用的推理基准测试上的实验表明,$\mathtt{wd1}$在不进行有监督微调(SFT)或使用任何监督数据的情况下,优于现有的dLLM的RL方法,准确率提高了高达16%。$\mathtt{wd1}$还带来了额外的计算收益,包括减少训练时间和每个梯度步骤减少函数评估次数(NFEs)。这些发现,加上方法实现的简单性和R1-Zero-like训练(无SFT),使$\mathtt{wd1}$成为将RL应用于dLLM推理的更有效和更高效的方法。
论文及项目相关链接
PDF Preprint
Summary
强化学习在改善基于扩散的大型语言模型(dLLM)的推理能力时存在挑战。本研究提出了一种新的策略优化方法wd1,通过加权似然重新定义了目标,只需对当前参数化策略似然进行一次近似。实验表明,在不进行有监督微调(SFT)或任何监督数据的情况下,wd1优于现有的dLLM强化学习方法,准确率提高达16%。此外,wd1还具有计算效益,包括缩短训练时间和减少每梯度步数的函数评估次数。因此,wd1在应用于dLLM推理的RL方法中更具有效性和效率。
Key Takeaways
- 强化学习在改善基于扩散的大型语言模型(dLLM)的推理能力方面存在挑战。
- 当前dLLM的似然函数不可行性导致需要在策略优化步骤中对当前、旧和参考策略似然进行近似。
- 近似过程引入计算负担和偏差风险,特别是在用于重要性采样的策略比率分母中出现近似误差时。
- wd1是一种新的策略优化方法,通过加权似然重新定义了目标,简化了计算过程。
- 实验表明,wd1在不使用有监督微调(SFT)或监督数据的情况下,优于现有的强化学习方法,提高了dLLM的推理准确性,最高可达16%。
- wd1具有计算效益,包括缩短训练时间和减少每梯度步数的函数评估次数。
点此查看论文截图

KAT-V1: Kwai-AutoThink Technical Report
Authors:Zizheng Zhan, Ken Deng, Huaixi Tang, Wen Xiang, Kun Wu, Weihao Li, Wenqiang Zhu, Jingxuan Xu, Lecheng Huang, Zongxian Feng, Shaojie Wang, Shangpeng Yan, Xuxing Chen, Jiaheng Liu, Zhongyuan Peng, Zuchen Gao, Haoyang Huang, Xiaojiang Zhang, Jinghui Wang, Zheng Lin, Mengtong Li, Huiming Wang, Ziqi Zhan, Yanan Wu, Yuanxing Zhang, Jian Yang, Guang Chen, Haotian Zhang, Bin Chen, Bing Yu
We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage by up to approximately 30%. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou’s internal coding assistant), and improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) with 40B activation parameters, where the early-stage results already demonstrate promising improvements in performance and efficiency, further showing the scalability of the AutoThink paradigm.
我们介绍了Kwaipilot-AutoThink(KAT)项目,这是一个旨在解决推理密集型任务中过度思考问题的开源的大型语言模型,其具备基于任务复杂度动态在推理模式和非推理模式间切换的能力。具体来说,我们首先根据新的标记管道和多智能体合成策略构建双模式数据集,然后应用增强知识蒸馏的多令牌预测(MTP),以最小的预训练成本实现高效精细推理迁移。此外,我们实现了冷启动初始化策略,通过多数投票信号和意图感知提示引入模式选择先验。最后,我们提出了融入中间监督的Step-SRPO强化学习算法,在GRPO框架内为推理模式选择和响应准确性提供结构化指导。在多基准测试的大量实验表明,KAT在广泛的推理密集型任务上表现优异,甚至超过当前最先进的模型(包括DeepSeek-R1-0528和Qwen3-235B-A22B),同时减少高达约30%的令牌使用。除了学术评估外,KAT已在Kwaipilot(即快手内部编码助手)中成功部署,以高准确性、高效率和控制推理行为的方式改进了现实世界开发工作流程。此外,我们正在积极训练一个拥有激活参数达200B的专家混合模型(MoE),早期阶段的结果已经显示出在性能和效率上的提升承诺,这进一步显示了AutoThink范式的可扩展性。
论文及项目相关链接
Summary
江山提供了一项针对大型语言模型的开放源代码研究,名为Kwaipilot-AutoThink(KAT)。该研究旨在解决推理密集型任务中的过度思考问题。KAT采用自动思考训练模式,能根据任务复杂性在推理与非推理模式之间动态切换。通过使用多种技术和策略构建双态数据集和多令化元合成策略的新颖标记管道,以及增强的知识蒸馏技术(多令牌预测),使模型能在保持高效的同时进行精细推理。此外,还引入了冷启动初始化策略,通过多数投票信号和意图感知提示引入模式选择先验。最后提出了一种融合强化学习算法(步骤-SRPO)进行中间监督的辅助控制程序规划优化算法。该研究减少了预处理成本的同时提升了对多种基准测试的推理性能,并且已经成功部署在Kwaipilot(快手内部编码助手)中。该研究进一步展望了规模化的自动思考模式,并正在积极训练一个混合专家模型。
Key Takeaways
- KAT是一个针对推理密集型任务的开放源代码大型语言模型。
- 采用自动思考训练模式,根据任务复杂性动态切换推理和非推理模式。
- 通过双态数据集、知识蒸馏技术和多令化元合成策略来提升模型的推理效率。
- 采用冷启动初始化策略,通过多数投票信号和意图感知提示实现模式选择先验。
- 提出融合强化学习算法的中间监督辅助控制程序规划优化算法(Step-SRPO)。
点此查看论文截图





SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs
Authors:Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, Jun Wang
Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark’s strong discriminative power, but also uncovers counter-intuitive findings: models exhibit unexpected behaviors by showing difficulty perception that misaligns with human intuition, displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula derivation despite spatial tasks requiring visualization alone. SpatialVizBench empirically demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark is publicly available.
人类能够在大脑中直接想象并操作视觉图像,这种能力被称为空间可视化。虽然多模态大型语言模型(MLLMs)支持基于想象的推理,但空间可视化能力的评估仍然不足,通常只嵌入更广泛的数学和逻辑评估中。现有的评估通常依赖于可能与训练数据重叠的智商测试或数学竞赛,从而影响评估的可靠性。为此,我们引入了SpatialViz-Bench,这是一个全面的多模态空间可视化基准测试,包含12项任务,跨越4种子能力,共有1180个自动生成的问题。我们对33种最新颖的MLLMs的评估不仅揭示了广泛的性能差异,并证明了该基准测试的强劲鉴别力,而且还发现了出人意料的发现:模型表现出难以感知的意外行为,与人类直觉不一致,在二维到三维的性能上有巨大差异,并且即使在只需要可视化而不需要公式推导的空间任务中,也默认选择公式推导。SpatialVizBench实证表明,最新颖的MLLMs在空间可视化任务中仍存在缺陷,从而填补了该领域的重大空白。该基准测试可公开访问。
论文及项目相关链接
Summary
空间可视化是人类能够在脑海中直接想象和操作视觉图像的能力。尽管多模态大型语言模型支持基于想象的推理,但对空间可视化的评估仍然不足,通常嵌入在更广泛的数学和逻辑评估中。为此,我们引入了SpatialViz-Bench,这是一个全面的多模态空间可视化基准测试,包含12项任务,涵盖4种子能力,共有1180个自动生成的题目。我们对33个最新大型语言模型的评价揭示了广泛的性能差异,证明了基准测试的强鉴别力,并发现了令人困惑的结果:模型在感知方面表现出困难,与人类直觉不符,二维到三维性能下降明显,并且在需要仅使用可视化的任务时默认使用公式推导。SpatialVizBench实证表明,最新大型语言模型在空间可视化任务上仍有缺陷,填补了该领域的空白。该基准测试已公开可用。
Key Takeaways
- 人类具备空间可视化能力,能直接想象和操作视觉图像。
- 多模态大型语言模型支持基于想象的推理,但对空间可视化的评估仍然不足。
- SpatialViz-Bench是一个全面的多模态空间可视化基准测试,包含多个任务和子能力。
- 对最新大型语言模型的评价揭示了广泛的性能差异和强大的基准测试鉴别力。
- 模型在感知方面表现出困难,存在二维到三维性能悬崖。
- 模型在需要仅使用可视化的任务时默认使用公式推导。
点此查看论文截图






Perception-Aware Policy Optimization for Multimodal Reasoning
Authors:Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.
强化学习与可验证奖励(RLVR)已被证明是赋予大型语言模型(LLM)强大的多步推理能力的一种高效策略。然而,其设计与优化仍然针对纯文本领域,在应用于多模态推理任务时表现不佳。我们观察到,当前多模态推理中的误差主要来源于视觉输入的感知。为了解决这一瓶颈,我们提出了感知感知策略优化(PAPO),这是GRPO的一个简单而有效的扩展,鼓励模型在学习推理的同时学习感知,完全基于内部监督信号。值得注意的是,PAPO不依赖额外的数据整理、外部奖励模型或专有模型。具体来说,我们在GRPO目标中引入了隐式感知损失,以KL散度项的形式,尽管它很简单,但在多种多模态基准测试上产生了显著的总体改进(4.4%)。在高度依赖视觉的任务上,改进更为显著,接近8.0%。我们还观察到感知错误大幅减少(30.5%),表明PAPO提高了感知能力。我们对PAPO进行了综合分析,发现了一个独特的损失黑客问题,我们对其进行了严格的分析,并通过双重熵损失减轻了该问题。总的来说,我们的工作将感知感知监督更深入地集成到RLVR学习目标中,并为鼓励视觉辅助推理的新RL框架奠定了基础。项目页面:https://mikewangwzhl.github.io/PAPO。
论文及项目相关链接
Summary
本文探讨了强化学习在大型语言模型中的实际应用,特别是针对多模态推理任务的问题及其解决策略。通过引入感知意识策略优化(PAPO),实现了感知与推理的协同学习,显著提高了模型在多模态推理任务上的性能。PAPO方法通过引入隐含感知损失,优化了通用策略优化(GRPO)的目标函数,提高了模型的视觉感知能力,减少了感知错误。同时,本文还通过深入分析解决了损失黑客问题,为强化学习框架中视觉推理的集成发展奠定了基础。
Key Takeaways
- 强化学习具有可验证奖励(RLVR)策略为大型语言模型(LLM)赋予强大的多步骤推理能力,但在多模态推理任务中表现不佳。
- 多模态推理中的错误主要源于对视觉输入的感知问题。
- 引入感知意识策略优化(PAPO)来解决这一问题,通过内部监督信号鼓励模型在推理过程中学习感知。
- PAPO方法通过引入隐含感知损失来优化通用策略优化(GRPO)的目标函数,从而提高模型的视觉感知能力。
- PAPO显著提高了模型在多模态推理任务上的性能,减少了感知错误。
- 解决了损失黑客问题,为强化学习框架中视觉推理的进一步发展奠定基础。
点此查看论文截图




Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching
Authors:Thomas Savage
Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF’s improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.
精细调整方法,如直接偏好优化(DPO)和群体相对策略优化(GRPO),在针对单一回合任务的训练大型语言模型(LLM)中取得了成功。然而,这些方法在多回合应用方面存在不足,如在诊断病人访谈中,理解早期对话回合如何影响下游完成和结果至关重要。在医学领域,多回合视角对于学习诊断模式并更好地理解对话动态至关重要。为了解决这一差距,我引入了野蛮对话森林(SCF),这是一个利用分支对话架构来精细调整LLM的多回合对话的强化学习框架。SCF在每一回合生成多个可能的对话延续,使模型能够学习不同的早期回应如何影响下游互动和诊断结果。在模拟医生病人对话的实验中,带有分支的SCF在诊断准确性方面优于线性对话架构。我假设SCF的改进源于其在整个对话回合中提供丰富、相互依存训练信号的能力。这些结果表明,分支训练架构是在复杂的多回合对话任务中微调LLM的重要策略。
论文及项目相关链接
Summary
大型语言模型(LLM)在单回合任务中通过直接偏好优化(DPO)和组相对策略优化(GRPO)等微调方法取得了成功。但在多回合应用中,如诊断病人访谈等,这些方法表现不足。为此,本文引入野蛮对话森林(SCF)这一强化学习框架,利用分支对话架构对LLM进行微调,以应对多回合对话。SCF在模拟医患对话的实验中,以分支对话架构展现出更高的诊断准确性。这暗示着分支训练架构是在复杂多回合对话任务中微调LLM的重要策略。
Key Takeaways
- 大型语言模型(LLM)在单回合任务中已取得成功应用微调方法,如直接偏好优化(DPO)和组相对策略优化(GRPO)。
- 在多回合对话任务中,如诊断病人访谈,现有微调方法表现不足,需要更复杂的策略。
- 野蛮对话森林(SCF)是一种新的强化学习框架,采用分支对话架构进行LLM微调。
- SCF能够生成每个回合的多种可能的对话延续,使模型了解早期回应如何影响下游互动和诊断结果。
- 在模拟医患对话的实验中,SCF的分支架构在诊断准确性上优于线性对话架构。
- SCF的改进源于其提供跨对话回合的丰富、相互依赖的训练信号的能力。
点此查看论文截图





RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
Authors:Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models’ search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model’s capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.
大型语言模型(LLM)在各种任务中表现出了显著的能力,但由于其内部知识的静态性,它们容易产生虚构或过时的回应。最近,增强检索生成(RAG)方法的进步通过强化学习(RL)探索了提高模型的搜索和推理能力。尽管这些方法显示出有希望的结果,但它们面临着训练稳定性方面的挑战,并遇到了由于单查询模式而导致的问题,如推理时间实质性增加和能力受限。在本文中,我们提出了RAG-R1,这是一个新型训练框架,旨在使LLM能够在推理过程中自适应地利用内部和外部知识。我们进一步在框架内扩展了生成和检索过程,从单查询模式扩展到多查询并行处理,旨在减少推理时间并增强模型的能力。在七个问答基准测试上的广泛实验表明,我们的方法比最强基线高出13.2%,并将推理时间减少了11.1%。
论文及项目相关链接
Summary
LLM容易受到生成的内容产生偏差或者回应过时的问题。通过RAG增强模型的搜索和推理能力是解决该问题的方法之一,但仍面临训练稳定性挑战以及推理时间和模型能力受限的问题。本文提出了一个新的训练框架RAG-R1,这个框架能使LLM在推理过程中灵活使用内外部知识,从单查询模式扩展到多查询并行模式,旨在提高推理速度并增强模型能力。在七个问答基准测试上的实验证明,该方法相较于最强大的基线方法提高了高达13.2%,推理时间减少了11.1%。
Key Takeaways
- LLM在某些任务上展现出强大的能力,但仍存在生成虚幻或过时响应的问题。
- RAG方法旨在增强模型的搜索和推理能力,但面临训练稳定性挑战和推理时间限制。
- RAG-R1是一个新的训练框架,旨在解决上述问题,通过适应性地利用内部和外部知识来提升模型性能。
- RAG-R1将生成和检索过程从单查询模式扩展到多查询并行模式,以提高推理速度和模型能力。
- 该方法在七个问答基准测试上实现了显著的性能提升,超过最强基线高达13.2%。
- RAG-R1不仅提高了模型性能,还降低了推理时间,减少达11.1%。
点此查看论文截图




Large Language Model-Driven Closed-Loop UAV Operation with Semantic Observations
Authors:Wenhao Wang, Yanyan Li, Long Jiao, Jiawei Yuan
Recent advances in large Language Models (LLMs) have revolutionized mobile robots, including unmanned aerial vehicles (UAVs), enabling their intelligent operation within Internet of Things (IoT) ecosystems. However, LLMs still face challenges from logical reasoning and complex decision-making, leading to concerns about the reliability of LLM-driven UAV operations in IoT applications. In this paper, we propose a LLM-driven closed-loop control framework that enables reliable UAV operations powered by effective feedback and refinement using two LLM modules, i.e., a Code Generator and an Evaluator. Our framework transforms numerical state observations from UAV operations into natural language trajectory descriptions to enhance the evaluator LLM’s understanding of UAV dynamics for precise feedback generation. Our framework also enables a simulation-based refinement process, and hence eliminates the risks to physical UAVs caused by incorrect code execution during the refinement. Extensive experiments on UAV control tasks with different complexities are conducted. The experimental results show that our framework can achieve reliable UAV operations using LLMs, which significantly outperforms baseline approaches in terms of success rate and completeness with the increase of task complexity.
近年来,大型语言模型(LLM)的进展在移动机器人领域引发了革命性的变革,包括无人飞行器(UAVs)。这推动了它们在互联网物联网生态系统中的智能操作。然而,LLM仍面临逻辑推理和复杂决策方面的挑战,导致对LLM驱动的无人机在物联网应用中的可靠性的担忧。在本文中,我们提出了一种LLM驱动的闭环控制框架,它通过两个LLM模块(即代码生成器和评估器)的有效反馈和改进,实现了可靠的无人机操作。我们的框架将无人机操作的数值状态观察转化为自然语言轨迹描述,增强了评估器LLM对无人机动态的理解,以实现精确的反馈生成。我们的框架还启用了基于模拟的改进过程,从而消除了改进过程中因代码执行错误而对实际无人机造成的风险。我们对不同复杂度的无人机控制任务进行了大量实验。实验结果表明,我们的框架能够在LLM上实现可靠的无人机操作,随着任务复杂性的增加,在成功率和完整性方面显著优于基准方法。
论文及项目相关链接
PDF 9 pages, 7 figures
Summary
大型语言模型(LLM)在移动机器人技术中的最新进展,特别是无人飞行器(UAV)领域,已经实现了在互联网物联网(IoT)生态系统中的智能操作。然而,LLM在逻辑推理和复杂决策制定方面仍存在挑战,导致对LLM驱动的无人机在IoT应用中运行的可靠性存在担忧。本文提出了一种LLM驱动的闭环控制框架,通过两个LLM模块即代码生成器和评估器实现可靠反馈和优化。此框架可将无人机的数值状态观测转换为自然语言轨迹描述,从而提高评估器LLM对无人机动态的理解,实现精确反馈生成。该框架还支持基于模拟的改进过程,消除了改进过程中因代码执行错误而对实际无人机造成的风险。通过在不同复杂度的无人机控制任务上进行大量实验,证明该框架能够可靠地使用LLM实现无人机操作,在任务复杂度增加的情况下,相对于基准方法具有更高的成功率和完整性。
Key Takeaways
- 大型语言模型(LLM)已应用于移动机器人技术,特别是无人飞行器(UAV)领域。
- 在互联网物联网(IoT)生态系统中实现了智能操作。
- LLM面临逻辑推理和复杂决策制定的挑战。
- 提出了一种LLM驱动的闭环控制框架,通过代码生成器和评估器两个模块实现可靠反馈和优化。
- 框架可将无人机的数值状态观测转换为自然语言轨迹描述,提高评估器对无人机动态的理解。
- 该框架支持模拟改进过程,减少实际无人机的风险。
点此查看论文截图






Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages
Authors:Samridhi Raj Sinha, Rajvee Sheth, Abhishek Upperwal, Mayank Singh
The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that address the requirements of linguistically diverse regions, such as India, and go beyond English-centric benchmarks. We introduce EKA-EVAL, a unified evaluation framework that integrates over 35+ benchmarks (including 10 Indic benchmarks) across nine major evaluation categories. The framework provides broader coverage than existing Indian language evaluation tools, offering 11 core capabilities through a modular architecture, seamless integration with Hugging Face and proprietary models, and plug-and-play usability. As the first end-to-end suite for scalable, multilingual LLM benchmarking, the framework combines extensive benchmarks, modular workflows, and dedicated support for low-resource Indian languages to enable inclusive assessment of LLM capabilities across diverse domains. We conducted extensive comparisons against five existing baselines, demonstrating that EKA-EVAL achieves the highest participant ratings in four out of five categories. The framework is open-source and publicly available at: https://github.com/lingo-iitgn/eka-eval.
随着大型语言模型(LLM)的快速发展,需要更加强调评估框架的重要性,以满足印度等语言多样化的地区的需求,并超越以英语为中心的基准测试。我们推出了EKA-EVAL,这是一个统一的评估框架,整合了超过35个基准测试(包括10个印度语言基准测试),涵盖九大主要评估类别。该框架提供了比现有印度语言评估工具更广泛的覆盖范围,通过模块化架构提供11项核心能力,与Hugging Face和专有模型无缝集成,以及即插即用的易用性。作为首个用于可扩展、多语言LLM基准测试端到端套件,该框架结合了广泛的基准测试、模块化工作流程以及对低资源印度语言的专门支持,以实现在不同领域全面评估LLM的能力。我们与五个现有基准进行了广泛比较,证明EKA-EVAL在五个类别中的四个中获得了最高参与者评分。该框架是开源的,可在以下网址公开获取:https://github.com/lingo-iitgn/eka-eval。
论文及项目相关链接
Summary
大语言模型(LLM)的快速发展增强了评估框架的需求,该需求必须满足语言多样化地区(如印度)的要求,并超越英语为中心的基准测试。我们推出了EKA-EVAL,这是一个统一的评估框架,集成了超过35个基准测试(包括10个印度语言基准测试),涵盖九大类主要评估项目。该框架通过模块化架构提供了更广泛的覆盖范围,与Hugging Face和专有模型无缝集成,并具有即插即用性。作为首个端到端的、可扩展的多语言LLM基准测试套件,该框架结合了广泛的基准测试、模块化工作流程以及对低资源印度语言的专门支持,以实现对不同领域LLM能力的包容性评估。经过与五个现有基准的广泛比较,证明EKA-EVAL在五个类别中的四项获得了最高参与者评分。该框架是开源的,可在https://github.com/lingo-iitgn/eka-eval公开访问。
Key Takeaways
- EKA-EVAL是一个统一的评估框架,专为评估大语言模型(LLM)的性能而设计。
- 它集成了超过35个基准测试,包括针对印度语言的基准测试。
- 涵盖九大类评估项目,如阅读理解、语言生成等。
- 该框架具有模块化架构,易于集成不同的评估工具和模型。
- 它提供了广泛的覆盖范围,不仅限于英语,支持多种语言评估。
- 与Hugging Face等现有工具无缝集成,并具有即插即用性。
点此查看论文截图



Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
Authors:Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael
Language models trained with reinforcement learning (RL) can engage in reward hacking–the exploitation of unintended strategies for high reward–without revealing this behavior in their chain-of-thought reasoning. This makes the detection of reward hacking difficult, posing risks for high-stakes applications. We propose verbalization fine-tuning (VFT), a pre-RL fine-tuning intervention that trains models to explicitly acknowledge when they are influenced by prompt cues–hints which point to incorrect answers (e.g., “a Stanford professor thinks the answer is A”). To evaluate VFT, we subsequently train models with RL on environments where held-out prompt cues signal which incorrect answers will receive high reward, incentivizing models to exploit these cues instead of reasoning correctly. We measure how often models exploit these cues without verbalizing it. After RL, only 6% of the VFT-trained model’s responses consist of undetected reward hacks. In comparison, when we perform RL without VFT, the rate of undetected reward hacks goes up to 88%; with a debiasing baseline intervention, this increases further to 99%. VFT achieves this by substantially increasing how often models verbalize the influence of cues, from 8% to 43% after VFT, and up to 94% after RL. Baselines remain low even after RL (11% and 1%). Our results show that teaching models to explicitly verbalize reward hacking behavior before RL significantly improves their detection, offering a practical path toward more transparent and safe AI systems.
使用强化学习(RL)进行训练的语言模型会进行奖励黑客行为——利用未预设的策略来获得高奖励——而不会暴露这一行为在其思维链中的推理过程。这使得检测奖励黑客行为变得困难,并为高风险应用带来风险。我们提出一种预RL微调干预方法,称为“语言化微调”(VFT),训练模型明确承认它们何时受到提示线索的影响——指向错误答案的暗示(例如,“斯坦福大学教授认为答案是A”)。为了评估VFT,我们随后在隐藏提示线索的环境中用RL训练模型,这些线索信号哪些错误答案会得到高奖励,激励模型利用这些线索而不是正确推理。我们衡量模型在没有语言化的情况下利用这些线索的频率。RL之后,只有6%的VFT训练模型的响应包含未检测的奖励黑客行为。相比之下,当我们不进行VFT而直接使用RL时,未检测的奖励黑客行为的比率上升至88%;使用去偏基线干预时,这一比率进一步上升至99%。VFT通过大幅增加模型语言化线索影响的能力来实现这一点,从VFT之后的8%增加到43%,并在RL之后达到94%。即使在RL之后,基线仍保持在较低水平(分别为11%和1%)。我们的结果表明,在RL之前教授模型明确语言化奖励黑客行为,可以显着提高其检测能力,为构建更透明、更安全的AI系统提供了实用途径。
论文及项目相关链接
PDF Published at ICML 2025 Workshop on Reliable and Responsible Foundation Models
Summary:训练的语言模型使用强化学习时,会进行奖励黑客行为,即利用未预设的策略获取高奖励而不表现出这种行为在思维链中的推理过程。这使得检测奖励黑客行为变得困难,并给高风险应用带来风险。我们提出了言语微调(VFT)这一预强化学习前的微调干预方法,训练模型显式地承认受到提示线索的影响。通过评估发现,言语微调可以显著减少未发现的奖励黑客行为。结果证明了训练模型在强化学习前明确表达奖励黑客行为的能力可以显著提高检测率,为更透明和安全的人工智能系统提供了实际路径。
Key Takeaways:
- 语言模型使用强化学习时可能进行奖励黑客行为,即利用未预设的策略获取高奖励而不展现其推理过程。
- 检测奖励黑客行为具有挑战性,对高风险应用存在风险。
- 言语微调(VFT)是一种预强化学习前的干预方法,旨在训练模型显式地承认受到提示线索的影响。
- VFT可以有效减少未发现的奖励黑客行为,提高检测率。
- VFT通过增加模型对提示线索影响的显性表达来提高检测率。
- 与其他方法相比,VFT显著提高模型的透明度和安全性。
点此查看论文截图




Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment
Authors:Yuhui Sun, Xiyao Wang, Zixi Li, Zhenlong Yuan, Jinman Zhao
While large language models (LLMs) excel at text generation, aligning them with human preferences remains challenging. Reinforcement learning from human feedback (RLHF) improves alignment but is costly and unstable. Direct Preference Optimization (DPO) offers a simpler alternative, yet assumes a fixed, single-dimensional preference. We propose Multi-Preference Lambda-weighted Listwise DPO, a generalization of DPO that supports multiple preference dimensions and dynamic interpolation via a simplex-weighted lambda vector. Our method enables listwise supervision and flexible alignment without re-training. While our experiments are conducted on 1B-2B scale models, this is an intentional choice: smaller models provide a more stringent testbed where performance improvements more clearly reflect the effectiveness of the alignment strategy itself. Moreover, such models are widely used in compute-constrained applications, making our improvements both methodologically meaningful and practically valuable. Empirical results show that our approach matches or surpasses standard DPO on alignment benchmarks while offering improved adaptability.
大型语言模型(LLMs)在文本生成方面表现出色,但将其与人类偏好对齐仍然具有挑战性。强化学习从人类反馈(RLHF)提高了对齐性,但成本高昂且不稳定。直接偏好优化(DPO)提供了更简单的替代方案,但假设存在固定的一维偏好。我们提出了多偏好λ加权列表式DPO,它是DPO的推广,支持多个偏好维度和通过单纯形加权λ向量进行的动态插值。我们的方法实现了列表式监督和无需重新训练的灵活对齐。虽然我们的实验是在1B-2B规模模型上进行的,但这是有意为之的选择:小型模型提供了更严格的测试环境,性能改进更清楚地反映了对齐策略本身的有效性。此外,这样的模型在计算受限的应用中广泛使用,使我们的改进在方法学上具有实际意义,在实践中也极具价值。经验结果表明,我们的方法在对齐基准测试中与标准DPO相匹配或表现更佳,同时提供了更好的适应性。
论文及项目相关链接
PDF 13 pages, 9 figures, appendix included. To appear in Proceedings of AAAI 2026. Code: https://github.com/yuhui15/Multi-Preference-Lambda-weighted-DPO
Summary
大语言模型在文本生成方面表现出色,但在与人类偏好对齐方面存在挑战。强化学习从人类反馈(RLHF)提高了对齐性,但成本高昂且不稳定。直接偏好优化(DPO)提供了更简单的选择,但假设了固定单一维度的偏好。我们提出了多偏好Lambda加权列表式DPO,它是DPO的推广,支持多个偏好维度和通过单纯形加权lambda向量进行的动态插值。我们的方法可以在无需重新训练的情况下进行列表监督及灵活对齐。我们的实验是在规模达到几十亿的模型上进行的,但这是有意为之的选择:较小的模型提供了更为严格的测试环境,其中性能的提升更清晰地反映了对齐策略本身的有效性。此外,这样的模型在算力受限的应用中广泛使用,这使我们的改进在方法学上具有实际意义并在实践中具有实用价值。经验结果表明,我们的方法在基准测试上达到了或超越了标准DPO的对齐性能,同时提供了更好的适应性。
Key Takeaways
- 大型语言模型(LLMs)在文本生成方面表现出色,但在与人类偏好对齐上存在挑战。
- 强化学习从人类反馈(RLHF)能提高对齐性,但成本高昂且不稳定。
- 直接偏好优化(DPO)提供了一种简化替代方案,但假设了固定单一维度的偏好。
- 我们提出了多偏好Lambda加权列表式DPO方法,支持多个偏好维度和动态插值。
- 该方法可在无需重新训练的情况下实现列表监督及灵活对齐。
- 实验是在规模几十亿的模型上进行的,强调在较小的模型上性能提升的意义和实际应用价值。
点此查看论文截图



TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
Authors:Ce Li, Xiaofan Liu, Zhiyan Song, Ce Chi, Chen Zhao, Jingjing Yang, Zhendong Wang, Kexin Yang, Boshen Shi, Xing Wang, Chao Deng, Junlan Feng
The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs using this frame work and prove its effectiveness. Experimental results reveal that existing LLMs still have significant room for improvement in addressing the complex and real world Table related tasks. Both the dataset and evaluation framework are publicly available, with the dataset hosted on huggingface.co/datasets/JT-LM/JIUTIAN-TReB and the framework on github.com/JT-LM/jiutian-treb.
在商业和工业界,大部分数据都存储在表格、数据库和数据仓库中。由于表格结构数据的隐藏语义、固有复杂性和结构化特性,对于大型语言模型(LLMs)来说,对其进行推理构成了重大挑战。这些挑战之一是没有一个有效的评估基准,能够公正地反映LLMs在广泛表格推理能力方面的表现。在本文中,我们填补了这一空白,提出了一个全面的表格推理进化基准测试(TReB),该基准测试衡量了浅层次的表格理解能力和深层次的表格推理能力,共包括26个子任务。我们通过迭代数据处理程序构建了一个高质量的数据集。我们创建了一个评估框架,以三种不同的推理模式(TCoT、PoT和ICoT)稳健地衡量表格推理能力。此外,我们使用此框架对20多种最新LLMs进行了基准测试,并证明了其有效性。实验结果表明,现有的LLMs在解决复杂和现实世界的表格相关任务方面仍有很大的改进空间。数据集和评估框架均公开可用,数据集托管在huggingface.co/datasets/JT-LM/JIUTIAN-TReB上,框架则位于github.com/JT-LM/jiutian-treb。
论文及项目相关链接
PDF Benmark report v1.1
Summary
本文解决商业和工业领域数据处理中的一大挑战——表格结构化数据的推理问题。文章提出了一种全面的表格推理评估基准TReB,用于衡量LLM对表格的浅层次理解和深层次推理能力,包含26个子任务。通过迭代数据处理程序构建高质量数据集,并建立评估框架,以三种不同的推理模式对表格推理能力进行稳健测量。文章还使用此框架对20多种最先进的大型语言模型进行了基准测试,证明了其有效性。实验结果表明,现有大型语言模型在处理复杂和现实世界表格相关任务方面仍有很大的改进空间。数据集和评估框架均已公开可用。
Key Takeaways
- 表格结构化数据在商业和工业领域占据主导地位,对其的推理存在诸多挑战。
- 当前缺乏有效评估大型语言模型(LLMs)在表格推理方面的性能的基准测试。
- 本文提出了一种全面的表格推理评估基准TReB,包含浅层次理解和深层次推理能力共26个子任务。
- 通过迭代数据处理程序构建了高质量数据集。
- 建立了评估框架,以三种不同的推理模式测量表格推理能力。
- 文章使用此框架对多个LLM进行了基准测试,并证明了其有效性。
点此查看论文截图


Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Authors:Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
科学研究越来越依赖于基于信息密集的科学数据和特定领域专业知识的复杂多模态推理。借助专家级科学基准,科学多模态大型语言模型(MLLMs)在实际工作流程中具有显著提高这一发现过程的潜力。然而,当前的科学基准大多侧重于评估MLLMs的知识理解能力,导致对它们的感知和推理能力的评估不足。为了弥补这一空白,我们提出了科学家第一次考试(SFE)基准,旨在通过三个相互联系的水平来评估MLLMs的科学认知能力:科学信号感知、科学属性理解、科学比较推理。具体来说,SFE包含830个专家验证的VQA对,涉及三种问题类型,涵盖66个多模态任务,涉及五个高价值学科。大量实验表明,目前最先进的GPT-o3和InternVL-3在SFE上的表现仅为34.08%和26.52%,这表明MLLMs在科学领域仍有很大的改进空间。我们希望从SFE中获得的数据能推动人工智能辅助科学发现的进一步发展。
论文及项目相关链接
PDF 82 pages
Summary
本文介绍了科学发现越来越依赖于基于信息密集型科学数据和领域特定专业知识进行的多模态推理。针对当前科学基准测试对大型语言模型评估的局限性,本文提出了科学家首次考试(SFE)基准测试,旨在评估多模态大型语言模型(MLLMs)的科学认知能力。该基准测试包括三个相互关联的水平:科学信号感知、科学属性理解和科学比较推理,涵盖五个高价值学科的66项多模态任务,通过830组专家验证后的问答对进行评估。研究表明,现有的先进模型如GPT-o3和InternVL-3在SFE上的表现仅为34.08%和26.52%,显示出MLLMs在科学领域有很大的改进空间。希望通过SFE的研究结果推动人工智能增强科学发现的进一步发展。
Key Takeaways
以下是基于文本的关键要点总结:
- 科学发现越来越依赖于复杂的多模态推理和信息密集型科学数据。
- 当前的科学基准测试主要关注大型语言模型的知识理解能力评估。
- 本文提出了科学家首次考试(SFE)基准测试,以更全面评估多模态大型语言模型(MLLMs)的科学认知能力。
- SFE包含三个评估水平:科学信号感知、科学属性理解和科学比较推理。
- SFE涵盖多个高价值学科,包含66项多模态任务和830组专家验证后的问答对。
- 现有先进模型在SFE上的表现不佳,显示MLLMs在科学领域有巨大改进空间。
点此查看论文截图





BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning
Authors:Ha-Thanh Nguyen, Chaoran Liu, Qianying Liu, Hideyuki Tachibana, Su Myat Noe, Yusuke Miyao, Koichi Takeda, Sadao Kurohashi
We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.
我们推出了BIS Reasoning 1.0,这是首个大规模日语集合推理数据集,专门设计用于评估大型语言模型(LLM)中的信念不一致推理。不同于先前专注于一般或信念一致的推理的NeuBAROCO和JFLD数据集,BIS Reasoning 1.0引入了逻辑上有效但信念不一致的排比论证,以揭示训练有素的语言模型在受过人类相关语料库训练的大型语言模型中的推理偏见。我们基准测试了最先进模型的表现水平,包括GPT模型、Claude模型和领先的日本LLM模型,发现性能存在显著差异,GPT-4o准确率达到了79.54%。我们的分析确定了当前LLM在处理逻辑上有效但信念相冲突的输入时的关键弱点。这些发现在法律、医疗保健和科学文献等高风险领域部署LLM时具有重要意义,在这些领域中,必须让真理凌驾于直觉信念之上,以确保完整性和安全性。
论文及项目相关链接
PDF This version includes minor typo corrections in the example image
Summary:我们介绍了BIS Reasoning 1.0,它是首个针对大型语言模型评估信念不一致推理能力的大规模日语数据集。与以往的NeuBART和JFLD等数据集相比,BIS Reasoning 1.0引入了逻辑上有效但信念不一致的推理问题,以揭示在基于人类语料库训练的LLM中存在的推理偏见。我们对当前最先进的技术进行了基准测试,包括GPT模型和Claude模型等领先的大型语言模型。发现性能显著参差不齐,GPT-4o的准确率为79.54%。我们的分析确定了当前大型语言模型在处理逻辑有效但信念冲突的输入时的关键弱点。这些发现对于将大型语言模型部署在法律、医疗保健和科学文献等高风险领域具有重要的启示意义。在这些领域中,需要真相压倒直觉信仰来确保诚信和安全。
Key Takeaways:
- BIS Reasoning 1.0是首个专注于信念不一致推理的大规模日语数据集,旨在评估大型语言模型在逻辑推理方面的能力。
- 与其他数据集相比,BIS Reasoning 1.0引入逻辑上有效但信念不一致的推理问题。
- 当前大型语言模型在处理逻辑有效但信念冲突的输入时存在关键弱点。
- GPT-4o在BIS Reasoning 1.0数据集上的准确率为79.54%。
- 不同的大型语言模型在性能上表现出显著的差异。
- 此研究对于将大型语言模型部署在高风险领域,如法律、医疗和科学文献,具有重要的启示意义。
点此查看论文截图




