⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-22 更新
Robobench: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain
Authors:Yulin Luo, Chun-Kai Fan, Menghang Dong, Jiayu Shi, Mengdi Zhao, Bo-Wen Zhang, Cheng Chi, Jiaming Liu, Gaole Dai, Rongyu Zhang, Ruichuan An, Kun Wu, Zhengping Che, Shaoxuan Xie, Guocai Yao, Zhongxia Zhao, Pengwei Wang, Guang Liu, Zhongyuan Wang, Tiejun Huang, Shanghang Zhang
Building robots that can perceive, reason, and act in dynamic, unstructured environments remains a core challenge. Recent embodied systems often adopt a dual-system paradigm, where System 2 handles high-level reasoning while System 1 executes low-level control. In this work, we refer to System 2 as the embodied brain, emphasizing its role as the cognitive core for reasoning and decision-making in manipulation tasks. Given this role, systematic evaluation of the embodied brain is essential. Yet existing benchmarks emphasize execution success, or when targeting high-level reasoning, suffer from incomplete dimensions and limited task realism, offering only a partial picture of cognitive capability. To bridge this gap, we introduce RoboBench, a benchmark that systematically evaluates multimodal large language models (MLLMs) as embodied brains. Motivated by the critical roles across the full manipulation pipeline, RoboBench defines five dimensions-instruction comprehension, perception reasoning, generalized planning, affordance prediction, and failure analysis-spanning 14 capabilities, 25 tasks, and 6092 QA pairs. To ensure realism, we curate datasets across diverse embodiments, attribute-rich objects, and multi-view scenes, drawing from large-scale real robotic data. For planning, RoboBench introduces an evaluation framework, MLLM-as-world-simulator. It evaluate embodied feasibility by simulating whether predicted plans can achieve critical object-state changes. Experiments on 14 MLLMs reveal fundamental limitations: difficulties with implicit instruction comprehension, spatiotemporal reasoning, cross-scenario planning, fine-grained affordance understanding, and execution failure diagnosis. RoboBench provides a comprehensive scaffold to quantify high-level cognition, and guide the development of next-generation embodied MLLMs. The project page is in https://robo-bench.github.io.
构建能够在动态、非结构化环境中感知、推理和行动的机器人仍然是核心挑战。最近的实体系统通常采用双系统范式,其中系统2负责高级推理,而系统1执行低级控制。在这项工作中,我们将系统2称为实体大脑,强调其在操作任务中的推理和决策制定方面的认知核心作用。考虑到这一作用,对实体大脑进行系统评估至关重要。然而,现有的基准测试侧重于执行成功,或者在针对高级推理时,存在维度不全和任务现实性有限的问题,只能提供认知能力的部分图景。为了弥补这一差距,我们介绍了RoboBench,这是一个系统评估多模态大型语言模型(MLLM)作为实体大脑的基准测试。受整个操作管道中关键作用启发,RoboBench定义了五个维度:指令理解、感知推理、通用规划、可预测性和故障分析,涵盖14种能力、25项任务和6092个问答对。为了确保现实性,我们从大规模真实机器人数据中整理了多种实体、属性丰富的对象和场景的数据集。在规划方面,RoboBench引入了一个评估框架,即MLLM作为世界模拟器。它通过模拟预测的计划是否能实现关键对象状态变化来评估实体的可行性。对1 or models的实验揭示了根本性局限:在隐性指令理解、时空推理、跨场景规划、精细可预测性理解和执行故障诊断方面的困难。RoboBench提供了一个全面的架构来量化高级认知,并引导下一代实体MLLM的发展。项目页面位于https://robo-bench.github.io。
论文及项目相关链接
Summary
本文介绍了在动态、非结构化环境中构建机器人的核心挑战,并指出当前的系统通常采用双系统范式,其中系统2负责高级推理,系统1执行低级控制。文章强调了系统2作为认知核心在操纵任务中的作用,并指出对其的系统性评价的重要性。现有的基准测试主要关注执行成功或高级推理,存在维度不全和任务现实性有限的问题,只能提供部分认知能力评估。为了弥补这一差距,文章引入了RoboBench基准测试,该测试能够系统地评估多模态大型语言模型(MLLMs)作为认知核心的表现。RoboBench定义了五个维度,包括指令理解、感知推理、通用规划、可预测性和失败分析等,涵盖了丰富的任务和能力。为确保现实性,数据集涵盖了多种体现、属性丰富的物体和多视角场景。实验结果显示,MLLMs在隐性指令理解、时空推理、跨场景规划、精细动作理解和执行失败诊断等方面存在根本性局限。RoboBench为高级认知能力的量化提供了全面的架构,并为下一代MLLMs的发展提供了指导。
Key Takeaways
- 机器人面临在动态、非结构化环境中感知、推理和行动的核心挑战。
- 当前机器人系统采用双系统范式,其中系统2作为认知核心负责高级推理。
- 现有基准测试在评估机器人认知能力方面存在局限性。
- 引入RoboBench基准测试,系统地评估多模态大型语言模型在五个维度(指令理解、感知推理等)的表现。
- RoboBench实验揭示了MLLMs在多个方面的根本性局限。
- RoboBench为高级认知能力的量化提供了全面的架构。
点此查看论文截图






Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
Authors:Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty
Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.
微调专用生成评估器已成为一种流行的范式,以满足训练和测试期间不断增长的可伸缩性评估需求。然而,最近的工作主要集中在应用新的方法学,如强化学习(RL)来训练评估器,而避免大规模的数据驱动开发。在这项工作中,我们专注于数据规模扩展,整理了一套包含250万个样本的数据集,涵盖五个独特的评估任务(成对、步骤级、无参考和有参考验证、单一评分),以及多个专注于推理评估的域。使用我们的数据,我们训练了基础自动推理评估器(FARE),这是一个拥有8B和带有36亿个活跃参数的20B家族评估器系列,采用简单的迭代拒绝采样监督微调(SFT)方法。FARE-8B挑战了更大的专用RL训练评估器,而FARE-20B为开源评估器设定了新的标准,超越了专用70B+评估器。除了静态基准测试外,我们在现实任务中评估FARE:作为推理时的重新排序器,FARE-20B在MATH上的性能接近最优。作为强化学习训练中的验证器,FARE将下游强化学习训练模型的性能提高了高达14.1%,超过了字符串匹配验证器。使用FARE进行初始化后,持续微调后的FARE-Code在评估测试用例质量方面超过了gpt-oss-20B达65%。
论文及项目相关链接
PDF 29 pages, 9 tables, 6 figures
Summary
本文介绍了为满足训练和测试时日益增长的评价需求而兴起的特殊生成评价器的微调模式。文章重点关注数据规模扩大,为此整理了一套跨越五个独特评价任务、涵盖多个以推理评价为中心的领域的250万样本数据。文章使用简单的迭代拒绝采样监督微调方法,训练出基础自动推理评价器(FARE),其中FARE-8B和FARE-20B性能突出。在真实世界任务中评估表明,FARE-20B在MATH上的性能接近最优,作为验证器在强化学习训练中能提升下游模型性能,且当以FARE为基础进行连续微调时,FARE-Code在评估测试用例质量方面表现优异,超越gpt-oss-20B的65%。
Key Takeaways
- 特殊生成评价器的微调已成为一种流行的评价模式,以满足训练和测试时的需求。
- 研究的重点从单纯的方法论转向数据规模扩大,为此整理了一套涵盖多个领域的样本数据。
- 训练出基础自动推理评价器(FARE),包括FARE-8B和FARE-20B两种型号。
- FARE在真实世界任务中表现优异,如在MATH上的性能接近最优。
- 作为验证器在强化学习训练中的应用能显著提升下游模型的性能。
- 连续微调后的FARE-Code在评估测试用例质量方面超越了gpt-oss-20B。
点此查看论文截图




Contextual Attention Modulation: Towards Efficient Multi-Task Adaptation in Large Language Models
Authors:Dayan Pan, Zhaoyang Fu, Jingyuan Wang, Xiao Han, Yue Zhu, Xiangyu Zhao
Large Language Models (LLMs) possess remarkable generalization capabilities but struggle with multi-task adaptation, particularly in balancing knowledge retention with task-specific specialization. Conventional fine-tuning methods suffer from catastrophic forgetting and substantial resource consumption, while existing parameter-efficient methods perform suboptimally in complex multi-task scenarios. To address this, we propose Contextual Attention Modulation (CAM), a novel mechanism that dynamically modulates the representations of self-attention modules in LLMs. CAM enhances task-specific features while preserving general knowledge, thereby facilitating more effective and efficient adaptation. For effective multi-task adaptation, CAM is integrated into our Hybrid Contextual Attention Modulation (HyCAM) framework, which combines a shared, full-parameter CAM module with multiple specialized, lightweight CAM modules, enhanced by a dynamic routing strategy for adaptive knowledge fusion. Extensive experiments on heterogeneous tasks, including question answering, code generation, and logical reasoning, demonstrate that our approach significantly outperforms existing approaches, achieving an average performance improvement of 3.65%. The implemented code and data are available to ease reproducibility at https://github.com/Applied-Machine-Learning-Lab/HyCAM.
大型语言模型(LLM)具有显著的泛化能力,但在多任务适应方面存在挑战,特别是在保留知识与任务特定专业化的平衡方面。传统的微调方法面临着灾难性遗忘和大量资源消耗的问题,而现有的参数高效方法在复杂的多任务场景中表现不佳。为解决此问题,我们提出了上下文注意力调制(CAM),这是一种动态调制LLM中自注意力模块表示的新型机制。CAM在保留通用知识的同时,增强了特定任务的特征,从而促进了更有效、更高效的适应。为了实现有效的多任务适应,CAM被集成到我们的混合上下文注意力调制(HyCAM)框架中,该框架结合了共享的全参数CAM模块和多个专业化的轻量级CAM模块,并由动态路由策略增强,以实现自适应知识融合。在包括问答、代码生成和逻辑推理等异质任务上的大量实验表明,我们的方法显著优于现有方法,平均性能提高3.65%。实现的代码和数据可在https://github.com/Applied-Machine-Learning-Lab/HyCAM上轻松重现。
论文及项目相关链接
PDF Accepted by CIKM’ 25
Summary
大型语言模型(LLMs)具有出色的泛化能力,但在多任务适应方面存在挑战,尤其在平衡知识保留与任务特定专业化方面。现有参数优化方法在多任务场景中表现不佳,容易遗忘先前学到的知识且资源消耗大。为解决这一问题,本文提出一种名为上下文注意力调制(CAM)的新机制,它能动态调整LLMs中自注意力模块的表示。CAM通过增强任务特定特征并保留一般知识,实现更有效的知识融合,促进更高效的适应。为解决多任务适应问题,将CAM集成到我们的混合上下文注意力调制(HyCAM)框架中,该框架结合了共享的全参数CAM模块和多个专用的轻量化CAM模块,通过动态路由策略实现自适应知识融合。在问答、代码生成和逻辑推理等异质任务上的实验表明,该方法显著优于现有方法,平均性能提高3.65%。相关代码和数据已在https://github.com/Applied-Machine-Learning-Lab/HyCAM上公开。
Key Takeaways
- 大型语言模型(LLMs)在多任务适应方面面临挑战,需要平衡知识保留与任务特定专业化的能力。
- 现有参数优化方法在多任务场景中表现不佳,存在遗忘先前知识和资源消耗大的问题。
- 上下文注意力调制(CAM)机制能动态调整LLMs中自注意力模块的表示,增强任务特定特征并保留一般知识。
- HyCAM框架结合了共享的全参数CAM模块和多个专用的轻量化CAM模块,通过动态路由策略实现更有效的知识融合和更高效的多任务适应。
- HyCAM框架在异质任务上的实验表现显著优于现有方法,平均性能提升3.65%。
- 相关代码和数据已在GitHub上公开,便于研究人员进行复现。
点此查看论文截图



CrossGuard: Safeguarding MLLMs against Joint-Modal Implicit Malicious Attacks
Authors:Xu Zhang, Hao Li, Zhichao Lu
Multimodal Large Language Models (MLLMs) achieve strong reasoning and perception capabilities but are increasingly vulnerable to jailbreak attacks. While existing work focuses on explicit attacks, where malicious content resides in a single modality, recent studies reveal implicit attacks, in which benign text and image inputs jointly express unsafe intent. Such joint-modal threats are difficult to detect and remain underexplored, largely due to the scarcity of high-quality implicit data. We propose ImpForge, an automated red-teaming pipeline that leverages reinforcement learning with tailored reward modules to generate diverse implicit samples across 14 domains. Building on this dataset, we further develop CrossGuard, an intent-aware safeguard providing robust and comprehensive defense against both explicit and implicit threats. Extensive experiments across safe and unsafe benchmarks, implicit and explicit attacks, and multiple out-of-domain settings demonstrate that CrossGuard significantly outperforms existing defenses, including advanced MLLMs and guardrails, achieving stronger security while maintaining high utility. This offers a balanced and practical solution for enhancing MLLM robustness against real-world multimodal threats.
多模态大型语言模型(MLLMs)具备强大的推理和感知能力,但越来越容易受到越狱攻击。虽然现有工作主要关注显式攻击,即恶意内容存在于单一模态中,但最近的研究揭示了隐式攻击,其中良性文本和图像输入共同表达不安全意图。由于高质量隐式数据的稀缺,这种跨模态联合威胁很难检测且一直被忽视。我们提出了ImpForge,这是一个自动化红队管道,利用强化学习并结合定制的奖励模块,在14个领域生成多样化的隐式样本。在此基础上,我们进一步开发了CrossGuard,一种意图感知的防护罩,为显式和隐式威胁提供稳健而全面的防御。在安全和不安全基准、隐式和显式攻击以及多个跨域环境下的广泛实验表明,CrossGuard显著优于现有防御手段,包括高级MLLMs和护栏,在保持高效性的同时实现了更强的安全性。这为提高MLLM对现实世界多模态威胁的稳健性提供了平衡且实用的解决方案。
论文及项目相关链接
PDF 14 pages, 8 figures, 2 tables
Summary
多模态大型语言模型(MLLMs)具备强大的推理和感知能力,但易受攻击。现有研究关注显性攻击,而近期发现隐性攻击威胁更大,其中良性文本和图像输入共同表达不安全意图。本文提出ImpForge红队测试自动化管道,使用强化学习与定制奖励模块生成多种隐样本数据。基于此数据集进一步开发CrossGuard意图感知防护系统,对显性及隐性威胁提供稳健全面的防御。实验证明,CrossGuard显著优于现有防护系统,在保持高效用的同时提高安全性,为增强MLLM对现实世界多模态威胁的稳健性提供平衡实用的解决方案。
Key Takeaways
- 多模态大型语言模型(MLLMs)具有强大的推理和感知能力,但易受攻击。
- 现有研究主要关注显性攻击,但隐性攻击逐渐成为威胁。
- 良性文本和图像输入可共同表达不安全意图,这些联合模态威胁难以检测。
- 缺乏高质量隐性数据使得相关研究受到限制。
- 提出ImpForge自动化红队测试管道,利用强化学习生成多样隐性样本数据。
- 开发CrossGuard意图感知防护系统,对显性及隐性威胁提供全面防御。
点此查看论文截图


Multilingual Text-to-Image Person Retrieval via Bidirectional Relation Reasoning and Aligning
Authors:Min Cao, Xinyu Zhou, Ding Jiang, Bo Du, Mang Ye, Min Zhang
Text-to-image person retrieval (TIPR) aims to identify the target person using textual descriptions, facing challenge in modality heterogeneity. Prior works have attempted to address it by developing cross-modal global or local alignment strategies. However, global methods typically overlook fine-grained cross-modal differences, whereas local methods require prior information to explore explicit part alignments. Additionally, current methods are English-centric, restricting their application in multilingual contexts. To alleviate these issues, we pioneer a multilingual TIPR task by developing a multilingual TIPR benchmark, for which we leverage large language models for initial translations and refine them by integrating domain-specific knowledge. Correspondingly, we propose Bi-IRRA: a Bidirectional Implicit Relation Reasoning and Aligning framework to learn alignment across languages and modalities. Within Bi-IRRA, a bidirectional implicit relation reasoning module enables bidirectional prediction of masked image and text, implicitly enhancing the modeling of local relations across languages and modalities, a multi-dimensional global alignment module is integrated to bridge the modality heterogeneity. The proposed method achieves new state-of-the-art results on all multilingual TIPR datasets. Data and code are presented in https://github.com/Flame-Chasers/Bi-IRRA.
文本到图像人物检索(TIPR)旨在使用文本描述来识别目标人物,面临模态异构性的挑战。早期的研究工作试图通过开发跨模态全局或局部对齐策略来解决这个问题。然而,全局方法通常忽略了跨模态的细微差异,而局部方法则需要先验信息来探索明确的局部对齐。此外,当前的方法主要侧重于英语,限制了它们在多语言环境中的应用。为了缓解这些问题,我们首创了一个多语言TIPR任务,通过开发一个多语言TIPR基准数据集来进行研究,我们利用大型语言模型进行初步翻译,并通过整合领域特定知识对其进行改进。相应地,我们提出了Bi-IRRA:一个双向隐式关系推理和对齐框架,用于学习跨语言和模态的对齐。在Bi-IRRA中,双向隐式关系推理模块能够实现掩码图像和文本的双向预测,从而隐式地增强跨语言和模态的局部关系建模。同时,我们集成了一个多维全局对齐模块,以弥合模态异构性。所提出的方法在所有多语言TIPR数据集上均达到了最新水平的成果。数据和代码已在https://github.com/Flame-Chasers/Bi-IRRA上公布。
论文及项目相关链接
PDF Final version published in IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). Xplore link: https://ieeexplore.ieee.org/document/11199360
Summary
文本到图像人物检索(TIPR)旨在使用文本描述来识别目标人物,面临模态多样性挑战。以前的工作通过开发跨模态全局或局部对齐策略来解决这个问题。然而,全局方法通常忽略了跨模态的细微差异,而局部方法则需要探索显式部分对齐的先验信息。此外,当前的方法以英语为中心,限制了它们在多语言环境中的使用。为了缓解这些问题,我们率先进行多语言TIPR任务,通过开发多语言TIPR基准进行测试,利用大型语言模型进行初步翻译,并结合领域知识进行优化。相应地,我们提出了Bi-IRRA:一种双向隐式关系推理和对齐框架,用于学习跨语言和模态的对齐。Bi-IRRA中的双向隐式关系推理模块能够双向预测图像和文本中的掩码,从而隐性增强跨语言和模态的局部关系建模。同时集成了一个多维全局对齐模块来桥接模态多样性。该方法在所有多语言TIPR数据集上取得了最新成果。数据和代码可见于https://github.com/Flame-Chasers/Bi-IRRA。
Key Takeaways
- TIPR旨在使用文本描述来识别目标人物,面临模态多样性挑战。
- 现有方法包括全局和局部对齐策略,但各有缺点。
- 当前方法以英语为中心,缺乏多语言支持。
- 提出多语言TIPR任务和多语言TIPR基准。
- 利用大型语言模型进行初步翻译并结合领域知识优化。
- 提出了Bi-IRRA框架,包括双向隐式关系推理模块和多维全局对齐模块。
点此查看论文截图




LawChain: Modeling Legal Reasoning Chains for Chinese Tort Case Analysis
Authors:Huiyuan Xie, Chenyang Li, Huining Zhu, Chubin Zhang, Yuxiao Ye, Zhenghao Liu, Zhiyuan Liu
Legal reasoning is a fundamental component of legal analysis and decision-making. Existing computational approaches to legal reasoning predominantly rely on generic reasoning frameworks such as syllogism and IRAC, which do not comprehensively examine the nuanced processes that underpin legal reasoning. Moreover, current research has largely focused on criminal cases, with insufficient modeling for civil cases. In this work, we present a novel framework for explicitly modeling legal reasoning in the analysis of Chinese tort-related civil cases. We first operationalize the legal reasoning processes used in tort analysis into the LawChain framework. LawChain is a three-module reasoning framework, with each module consisting of multiple finer-grained sub-steps. Informed by the LawChain framework, we introduce the task of tort legal reasoning and construct an evaluation benchmark, LawChain$_{eval}$, to systematically assess the critical steps within analytical reasoning chains for tort analysis. Leveraging this benchmark, we evaluate state-of-the-art large language models for their legal reasoning ability in civil tort contexts. Our results indicate that current models still fall short in accurately handling crucial elements of tort legal reasoning. Furthermore, we introduce several baseline approaches that explicitly incorporate LawChain-style reasoning through prompting or post-training. We conduct further experiments on additional legal analysis tasks, such as Legal Named-Entity Recognition and Criminal Damages Calculation, to verify the generalizability of these baselines. The proposed baseline approaches achieve significant improvements in tort-related legal reasoning and generalize well to related legal analysis tasks, thus demonstrating the value of explicitly modeling legal reasoning chains to enhance the reasoning capabilities of language models.
法律推理是法律分析和决策制定中的基本组成部分。现有的计算法律推理方法主要依赖于诸如三段论和IRAC等通用推理框架,这些框架并没有全面考察支撑法律推理的微妙过程。此外,当前的研究主要集中在刑事案件上,对民事案件的建模还不够充分。在这项工作中,我们提出了一个明确建模法律推理的框架,用于分析中国侵权相关的民事案件。我们首先将侵权分析中使用的法律推理过程转化为LawChain框架。LawChain是一个由三个模块组成的推理框架,每个模块由多个更精细的子步骤组成。基于LawChain框架,我们引入了侵权法律推理的任务,并构建了一个评估基准LawChain${eval}$,以系统地评估侵权分析中的分析推理链中的关键步骤。利用这个基准,我们评估了最先进的自然语言模型在民事侵权背景下的法律推理能力。结果表明,当前模型在处理侵权法律推理的关键要素时仍有不足。此外,我们介绍了几种基线方法,这些方法通过提示或后训练明确融入LawChain风格的推理。我们对其他法律分析任务(如法律命名实体识别和刑事损害赔偿计算)进行了进一步的实验,以验证这些基准的通用性。所提出的方法在侵权相关的法律推理方面取得了显著的改进,并能很好地应用于相关的法律分析任务,从而证明了明确建模法律推理链对于提高语言模型的推理能力的价值。
论文及项目相关链接
Summary
在中国法律推理领域,现有计算方法主要依赖通用的推理框架如三段论和IRAC等,无法充分反映法律推理中的细微过程。研究多聚焦于刑事案件建模,民事案件建模不足。本研究提出了一个新型的框架来显式建模中文侵权相关的民事案件的法律推理过程。采用LawChain框架来细化和系统化法律推理步骤,并基于此构建评估基准来评估大模型在民事侵权语境下的法律推理能力。研究结果显示现有模型仍需在关键要素上改进,通过引入基于LawChain的推理方法,如提示或后训练等方法作为基线模型,显著提高了法律推理能力并具有良好的泛化性能。这为相关任务提供了一种有效思路。
Key Takeaways
- 法律推理是法律分析和决策制定中的基本组成部分。当前计算方法尚不能完全反映法律推理的细微过程。
- 目前研究多集中在刑事案例建模上,民事案例建模方面存在不足。
- 研究提出了一种新的框架来显式建模中文侵权相关的民事案件的法律推理过程,采用LawChain框架进行系统化分析。
- LawChain框架包含三个模块和多个更精细的子步骤,用于细化法律推理过程。
- 基于LawChain构建评估基准以评估大模型在民事侵权语境下的法律推理能力。
- 研究发现现有模型在法律推理的关键要素上仍有不足,需要改进。
点此查看论文截图



Reasoning Distillation and Structural Alignment for Improved Code Generation
Authors:Amir Jalilifard, Anderson de Rezende Rocha, Marcos Medeiros Raimundo
Effective code generation with language models hinges on two critical factors: accurately understanding the intent of the prompt and generating code that applies algorithmic reasoning to produce correct solutions capable of passing diverse test cases while adhering to the syntax of the target programming language. Unlike other language tasks, code generation requires more than accurate token prediction; it demands comprehension of solution-level and structural relationships rather than merely generating the most likely tokens. very large language model (VLLM) are capable of generating detailed steps toward the correct solution of complex tasks where reasoning is crucial in solving the problem. Such reasoning capabilities may be absent in smaller language models. Therefore, in this work, we distill the reasoning capabilities of a VLLM into a smaller, more efficient model that is faster and cheaper to deploy. Our approach trains the model to emulate the reasoning and problem-solving abilities of the VLLM by learning to identify correct solution pathways and establishing a structural correspondence between problem definitions and potential solutions through a novel method of structure-aware loss optimization. This enables the model to transcend token-level generation and to deeply grasp the overarching structure of solutions for given problems. Experimental results show that our fine-tuned model, developed through a cheap and simple to implement process, significantly outperforms our baseline model in terms of pass@1, average data flow, and average syntax match metrics across the MBPP, MBPP Plus, and HumanEval benchmarks.
使用语言模型进行有效代码生成的关键在于两个重要因素:准确理解提示的意图,并生成能够应用算法推理来产生正确解决方案的代码,这些解决方案需要能够通过多种测试用例,同时遵循目标编程语言的语法。与其他语言任务不同,代码生成不仅仅要求准确的令牌预测;它需要的是解决方案级别和结构关系的理解,而不仅仅是生成最可能的令牌。大型语言模型(VLLM)能够生成关于复杂任务的正确解决方案的详细步骤,其中推理是解决问题的关键。这样的推理能力在小语言模型中可能不存在。因此,在这项工作中,我们将大型语言模型的推理能力提炼到一个更小、更高效的模型中,该模型部署更快、成本更低。我们的方法是通过一种新型的结构感知损失优化方法,训练模型来模仿大型语言模型的推理和解决问题的能力,学习识别正确的解决方案途径,并建立问题定义和潜在解决方案之间的结构对应关系。这使得模型能够超越令牌级别的生成,并深刻把握给定问题的解决方案的总体结构。实验结果表明,我们经过微调的模型,通过一种价格低廉、易于实现的流程开发,在MBPP、MBPP Plus和HumanEval基准测试中,在pass@1、平均数据流和平均语法匹配指标方面显著优于我们的基准模型。
论文及项目相关链接
Summary
本文介绍了通过大型语言模型(VLLM)进行代码生成的关键在于准确理解提示意图并生成可应用于算法推理的正确代码,能通过各种测试用例并遵守目标编程语言的语法。为实现这一目标,研究人员将VLLM的推理能力提炼到一个更小、更高效的模型中,通过结构感知损失优化方法训练模型,使其能够识别正确的解决方案路径并建立问题定义与潜在解决方案之间的结构对应关系。实验结果表明,通过简单廉价的过程进行微调后的模型在MBPP、MBPP Plus和HumanEval等多个基准测试上显著优于基线模型。
Key Takeaways
- 代码生成依赖于准确理解提示意图和生成具备算法推理能力的代码。
- 大型语言模型(VLLM)具备生成复杂任务正确解决方案的详细步骤的能力。
- 将VLLM的推理能力提炼到更小、更高效的模型中,实现更快、更经济的部署。
- 通过结构感知损失优化方法训练模型,以识别正确的解决方案路径并建立问题定义与解决方案之间的结构对应关系。
- 模型能超越词级生成,深入把握给定问题的解决方案的整体结构。
- 实验结果显示,微调后的模型在多个基准测试上的性能显著优于基线模型。
点此查看论文截图



OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction
Authors:Raghu Vamshi Hemadri, Geetha Krishna Guruju, Kristi Topollai, Anna Ewa Choromanska
Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.
预测癌症治疗结果需要既准确又易于解释的模型,特别是在存在临床数据异质性时更是如此。虽然大型语言模型(LLM)在生物医学自然语言处理中表现出强大的性能,但它们通常缺乏关键的结构化推理能力,这对于高风险决策支持至关重要。我们提出了一种统一的、多任务学习框架,将自回归LLM与临床推理相结合,在MSK-CHORD数据集上进行结果预测。我们的模型经过训练,能够同时执行二元生存分类、连续生存时间回归和自然语言推理生成。我们评估了三种对齐策略:(1)标准监督微调(SFT),(2)带有思维链(CoT)提示的SFT,以激发逐步推理,以及(3)组相对策略优化(GRPO),这是一种强化学习方法,将模型输出与专家推导的推理轨迹对齐。使用LLaMa3-8B和Med42-8B骨干网进行的实验表明,CoT提示提高了F1分数+6.0,并降低了MAE的12%,而GRPO在BLEU、ROUGE和BERTScore上实现了业界最佳的解释性和预测性能。我们进一步表明,由于架构约束,现有的生物医学LLM往往无法生成有效的推理轨迹。我们的研究强调了多任务临床建模中推理感知对齐的重要性,并为精确肿瘤学中的可解释、可信赖的LLM设置了一个新的基准。
论文及项目相关链接
Summary
本文提出了一种多任务学习框架,它将自回归的大型语言模型(LLMs)与临床推理相结合,用于在MSK-CHORD数据集上进行癌症治疗结果预测。该框架能够同时进行二元生存分类、连续生存时间回归和自然语言推理生成。实验表明,通过Chain-of-Thought(CoT)提示进行标准监督微调(SFT)可以提高F1分数并降低MAE,而Group Relative Policy Optimization(GRPO)方法则实现了更高的解释性和预测性能。该研究强调了多任务临床建模中推理对齐的重要性,并为精准肿瘤学中的可解释、可信的大型语言模型树立了新的基准。
Key Takeaways
- 大型语言模型(LLMs)在生物医学NLP中表现出强大的性能,但缺乏结构化推理能力,这对于高风险决策支持至关重要。
- 提出了一种多任务学习框架,结合了LLMs和临床推理,用于预测癌症治疗结果。
- 框架能够同时进行二元生存分类、连续生存时间回归以及自然语言推理生成。
- 评估了三种对齐策略,包括标准监督微调(SFT)、带有Chain-of-Thought(CoT)提示的SFT以及Group Relative Policy Optimization(GRPO)。
- CoT提示可以提高F1分数并降低MAE,而GRPO方法在BLEU、ROUGE和BERTScore等指标上实现了最先进的解释性和预测性能。
- 现有生物医学LLMs由于架构约束,往往无法产生有效的推理轨迹。
点此查看论文截图





SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors
Authors:Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Dirk Hovy, Nigel Collier, Paul Röttger
Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.
大型语言模型(LLM)对人类行为的模拟具有颠覆社会和行为科学的潜力,但这种潜力的发挥取决于它们是否能真实地反映人类行为。目前的评估是零碎的,基于特定任务和指标,导致结果无法比较。为解决这一问题,我们引入了SimBench,这是第一个大规模标准化基准测试,用于稳健、可重复的LLM模拟科学。SimBench统一了20个不同的数据集,涵盖了从道德决策到经济选择的任务,涵盖了大量全球参与者,为关于LLM模拟何时、如何以及为何成功或失败的根本问题提供了必要的基础。我们发现,即使是目前最好的LLM模拟能力也有限(得分:40.80/100),性能随模型大小的对数线性变化。增加推理时间的计算并不会提高模拟性能。我们证明了模拟指令之间存在权衡:指令调整可以提高低熵(共识)问题上的性能,但会降低高熵(多样化)问题上的性能。当模拟特定的人口群体时,模型特别困难。最后,我们证明模拟能力与深入的知识密集型推理(MMLU-Pro,r=0.939)关联度最高。通过使进步可衡量,我们的目标是加速更真实的LLM模拟器的发展。
论文及项目相关链接
PDF Project Website: http://simbench.tiancheng.hu/ Data: https://huggingface.co/datasets/pitehu/SimBench
Summary
大型语言模型(LLM)对人类行为的模拟具有革新社会和行为科学的潜力,前提是必须真实反映人类行为。为评估LLM模拟效果,我们推出SimBench,这是首个大规模标准化基准测试,旨在建立稳健、可重复的科学LLM模拟。SimBench统一了20个涵盖从道德决策到经济选择等任务的多样化数据集,为全球大量参与者提供了必要的基石,以探究LLM模拟何时、如何、为何成功或失败。我们发现,即使是目前最好的LLM,其模拟能力也有限(得分:40.80/100),且性能随模型大小的对数线性扩展而提高。模拟性能并不会通过增加推理时间计算来改善。我们证明了对齐模拟之间的权衡:指令调整提高了在低熵(共识)问题上的性能,但在高熵(多样性)问题上却降低了性能。模型在模拟特定人口群体时尤其困难。最后,我们发现模拟能力与深入的知识推理密切相关(MMLU-Pro,r=0.939)。SimBench的目标是建立可衡量的进步,以加速更真实LLM模拟器的发展。
Key Takeaways
- 大型语言模型(LLM)对人类行为的模拟具有潜力,前提是反映真实人类行为。
- SimBench是首个大规模标准化基准测试,用于评估LLM模拟效果。
- LLM模拟能力有限,性能随模型大小扩展而提高。
- 模拟性能与推理时间计算无关。
- 指令调整在低熵问题上的表现较好,但在高熵问题上存在性能下降的问题。
- 模型在模拟特定人口群体时遇到困难。
点此查看论文截图


I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models
Authors:Giacomo Camposampiero, Michael Hersche, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
我们介绍了I-RAVEN-X,这是一个符号基准测试,旨在评估大型语言模型(LLM)和大型推理模型(LRM)在类比和数学推理中的泛化和鲁棒性。I-RAVEN-X通过增加操作数复杂性、属性范围和引入感知不确定性来扩展I-RAVEN。与LLM相比,实证结果表明,LRM在处理较长的推理关系和更广泛的属性范围方面,分别提高了生产力和系统性。然而,LRM在不确定性推理方面仍面临巨大挑战,无法有效探索多种概率结果。
论文及项目相关链接
PDF Accepted at the 5th Workshop on Mathematical Reasoning and AI (MATH-AI), NeurIPS 2025
Summary:
我们介绍了I-RAVEN-X,这是一个符号基准测试,旨在评估大型语言模型(LLMs)和大型推理模型(LRMs)在类比和数学推理中的泛化和鲁棒性。I-RAVEN-X通过增加操作数复杂性、属性范围和引入感知不确定性来扩展I-RAVEN。与LLMs相比,实证结果表明LRMs在较长推理关系和更广泛的属性范围方面表现出较高的生产力和系统性,但在不确定下的推理和多个概率结果的探索方面仍存在挑战。
Key Takeaways:
- I-RAVEN-X是一个旨在评估大型语言模型(LLMs)和大型推理模型(LRMs)泛化和鲁棒性的符号基准测试。
- I-RAVEN-X扩展了I-RAVEN,增加了操作数复杂性、属性范围,并引入了感知不确定性。
- LRMs在较长推理关系和更广泛的属性范围方面表现出较高的生产力和系统性。
- LRMs在不确定下的推理和多个概率结果的探索方面仍存在挑战。
- 实证结果表明,与LLMs相比,LRMs在某些方面表现出优势,但也有其局限。
- I-RAVEN-X为评估模型在复杂和不确定环境下的推理能力提供了有力工具。
点此查看论文截图




Leveraging Group Relative Policy Optimization to Advance Large Language Models in Traditional Chinese Medicine
Authors:Jiacheng Xie, Shuai Zeng, Yang Yu, Xiaoting Tang, Guanghui An, Dong Xu
Traditional Chinese Medicine (TCM) presents a rich and structurally unique knowledge system that challenges conventional applications of large language models (LLMs). Although previous TCM-specific LLMs have shown progress through supervised fine-tuning, they often face limitations in alignment, data quality, and evaluation consistency. In this study, we introduce Ladder-base, the first TCM-focused LLM trained with Group Relative Policy Optimization (GRPO), a reinforcement learning method that improves reasoning and factual consistency by optimizing response selection based on intra-group comparisons. Ladder-base is built upon the Qwen2.5-7B-Instruct foundation model and trained exclusively on the textual subset of the TCM-Ladder benchmark, using 80 percent of the data for training and the remaining 20 percent split evenly between validation and test sets. Through standardized evaluation, Ladder-base demonstrates superior performance across multiple reasoning metrics when compared to both state-of-the-art general-purpose LLMs such as GPT-4, Gemini 2.5, Claude 3, and Qwen3 and domain-specific TCM models including BenTsao, HuatuoGPT2, and Zhongjing. These findings suggest that GRPO provides an effective and efficient strategy for aligning LLMs with expert-level reasoning in traditional medical domains and supports the development of trustworthy and clinically grounded TCM artificial intelligence systems.
传统中医(TCM)呈现了一个丰富且结构独特的知识体系,这对大型语言模型(LLM)的常规应用提出了挑战。尽管之前针对TCM的LLM通过监督微调取得了一定的进展,但它们在对齐、数据质量和评估一致性方面仍存在局限性。在这项研究中,我们介绍了Ladder-base,这是一个专注于TCM的LLM,采用群体相对策略优化(GRPO)进行训练,这是一种强化学习方法,通过优化基于组内比较的响应选择来提高推理和事实一致性。Ladder-base建立在Qwen2.5-7B-Instruct基础模型上,仅对TCM-Ladder基准测试文本子集进行训练,其中80%的数据用于训练,其余20%的数据平均分配给验证集和测试集。通过标准化评估,与最先进的通用LLM(如GPT-4、Gemini 2.5、Claude 3和Qwen3)以及特定领域的TCM模型(包括BenTsao、HuatuoGPT2和Zhongjing)相比,Ladder-base在多个推理指标上表现出卓越的性能。这些发现表明,GRPO为在传统医学领域中实现LLM与专家级推理的对齐提供了一种有效且高效的策略,并支持开发可信且以临床为基础的TCM人工智能系统。
论文及项目相关链接
Summary:
本研究介绍了Ladder-base,一个专注于传统中医(TCM)的大型语言模型(LLM)。该模型采用群组相对策略优化(GRPO)方法训练,提高了推理和事实一致性。通过标准化评估,Ladder-base在多个推理指标上表现出卓越性能,与其他通用和特定领域的LLM相比具有优势。这表明GRPO是有效且高效的策略,可帮助LLM在医学领域实现专家级推理,支持开发可信且基于临床的中医人工智能系统。
Key Takeaways:
- Ladder-base是第一个专注于传统中医(TCM)的大型语言模型(LLM)。
- Ladder-base采用群组相对策略优化(GRPO)方法进行训练,提高推理和事实一致性。
- Ladder-base在TCM-Ladder基准测试的文本子集上进行训练,数据集分为训练、验证和测试集。
- 与其他通用和特定领域的LLM相比,Ladder-base在多个推理指标上表现出卓越性能。
- GRPO方法为医学领域的专家级推理提供了一种有效且高效的策略。
- Ladder-base的开发支持开发可信且基于临床的中医人工智能系统。
- 该研究强调了传统中医知识的独特性和复杂性,需要专门的LLM来处理和分析。
点此查看论文截图



EduAdapt: A Question Answer Benchmark Dataset for Evaluating Grade-Level Adaptability in LLMs
Authors:Numaan Naeem, Abdellah El Mekki, Muhammad Abdul-Mageed
Large language models (LLMs) are transforming education by answering questions, explaining complex concepts, and generating content across a wide range of subjects. Despite strong performance on academic benchmarks, they often fail to tailor responses to students’ grade levels. This is a critical need in K-12 education, where age-appropriate vocabulary and explanation are essential for effective learning. Existing models frequently produce outputs that are too advanced or vague for younger learners, and there are no standardized benchmarks to evaluate their ability to adjust across cognitive and developmental stages. To address this gap, we introduce EduAdapt, a benchmark of nearly 48k grade-labeled QA pairs across nine science subjects, spanning Grades 1-12 and grouped into four grade levels. We evaluate a diverse set of open-source LLMs on EduAdapt and find that while larger models generally perform better, they still struggle with generating suitable responses for early-grade students (Grades 1-5). Our work presents the first dataset and evaluation framework for assessing grade-level adaptability in LLMs, aiming to foster more developmentally aligned educational AI systems through better training and prompting strategies. EduAdapt code and datasets are publicly available at https://github.com/NaumanNaeem/EduAdapt.
大型语言模型(LLM)通过回答问题、解释复杂概念和生成广泛主题的内容,正在改变教育方式。尽管在学术基准测试中表现出色,但它们往往无法针对学生的年级水平调整回答。这在K-12教育中是迫切需要的,因为适合年龄段的词汇和解释是有效学习的关键。现有模型产生的输出通常对年轻学习者来说过于深奥或模糊,而且没有标准化的基准来评估它们在认知和发育阶段进行调整的能力。为了解决这一差距,我们推出了EduAdapt,这是一个包含近48,000个年级标签的问答对基准,涵盖九个科学科目,涵盖1-12年级,分为四个年级。我们对EduAdapt上的一系列开源大型语言模型进行了评估,发现虽然较大的模型通常表现更好,但在为低年级学生(1-5年级)生成合适回答方面仍然面临困难。我们的工作提供了评估大型语言模型年级适应性的第一个数据集和评估框架,旨在通过更好的训练和提示策略,促进与发育相适应的教育人工智能系统的开发。EduAdapt代码和数据集可在https://github.com/NaumanNaeem/EduAdapt公开获取。
论文及项目相关链接
PDF 28 pages, 2 figures, 14 tables, 50 listings, EMNLP 2025 Main
Summary
大型语言模型(LLM)在教育领域的应用正在改变教育方式,通过回答问题、解释复杂概念、生成各种学科内容等方式助力教育。然而,这些模型在适应学生年级需求方面存在不足。在K-12教育中,适应学生年龄的解释和词汇至关重要。现有模型输出的内容通常过于高深或模糊,不适合低年级学习者,且缺乏标准化的评估模型来评价其在认知和发育阶段的适应性。为解决这一问题,我们推出EduAdapt基准测试,包含近48,000个按年级划分的问答对,涵盖九个科学学科,分为四个年级层次,涉及幼儿园至高中。我们对一系列开源LLM进行了评估,发现虽然大模型总体表现较好,但在为低年级学生(幼儿园至五年级)生成合适回答方面仍存在困难。我们的工作首次提供了评估LLM年级适应性的数据集和评估框架,旨在通过更好的训练和提示策略,促进更符合发育规律的教育人工智能系统的发展。EduAdapt的代码和数据集可在链接公开获取。
Key Takeaways
- 大型语言模型(LLM)在教育中的应用正在改变教学方式,但它们在学生年级适应性方面存在缺陷。
- K-12教育中,适应学生年龄的解释和词汇至关重要。
- 现有模型输出内容往往不适合低年级学习者,缺乏标准化评估模型评价其在不同年级的适应性。
- 为解决此问题,推出了EduAdapt基准测试,包含近48,000个按年级划分的问答对,涵盖九个学科。
- 大模型总体表现良好,但在适应低年级学生方面仍有困难。
- EduAdapt提供了评估LLM年级适应性的数据集和评估框架。
点此查看论文截图






TabR1: Taming GRPO for tabular reasoning LLMs
Authors:Pengxiang Cai, Zihao Gao, Jintai Chen
Tabular prediction has traditionally relied on gradient-boosted decision trees and specialized deep learning models, which excel within tasks but provide limited interpretability and weak transfer across tables. Reasoning large language models (LLMs) promise cross-task adaptability with trans- parent reasoning traces, yet their potential has not been fully realized for tabular data. This paper presents TabR1, the first reasoning LLM for tabular prediction with multi-step reasoning. At its core is Permutation Relative Policy Optimization (PRPO), a simple yet efficient reinforcement learning method that encodes column-permutation invariance as a structural prior. By construct- ing multiple label-preserving permutations per sample and estimating advantages both within and across permutations, PRPO transforms sparse rewards into dense learning signals and improves generalization. With limited supervision, PRPO activates the reasoning ability of LLMs for tabular prediction, enhancing few-shot and zero-shot performance as well as interpretability. Comprehensive experiments demonstrate that TabR1 achieves performance comparable to strong baselines under full-supervision fine-tuning. In the zero-shot setting, TabR1 approaches the performance of strong baselines under the 32-shot setting. Moreover, TabR1 (8B) substantially outperforms much larger LLMs across various tasks, achieving up to 53.17% improvement over DeepSeek-R1 (685B).
表格预测传统上依赖于梯度增强的决策树和专用深度学习模型,这些模型在任务内表现优异,但提供有限的解释性和跨表格的弱迁移性。大型语言模型(LLM)的推理承诺了跨任务的可适应性,具有透明的推理轨迹,但其对表格数据的潜力尚未得到充分实现。本文提出了TabR1,这是第一个用于表格预测的多步推理推理LLM。其核心是置换相对策略优化(PRPO),这是一种简单而高效的强化学习方法,将列置换不变性编码为结构先验。通过对每个样本构建多个标签保留置换,并在置换内部和跨置换估计优势,PRPO将稀疏奖励转化为密集的学习信号,提高了泛化能力。在有限的监督下,PRPO激活了LLM的表格预测推理能力,提高了少样本和零样本的性能以及可解释性。综合实验表明,TabR1在全监督微调的情况下实现了与强大基准线相当的性能。在零样本设置中,TabR1的性能接近32样本设置下的强大基准线。此外,TabR1(8B)在各种任务上显著优于更大的LLMs,与DeepSeek-R1(685B)相比,性能提高了高达53.17%。
论文及项目相关链接
Summary
本文提出一种名为TabR1的表格预测推理模型,它结合了大型语言模型(LLM)和多步推理技术。该模型采用一种名为Permutation Relative Policy Optimization(PRPO)的强化学习方法,通过构建样本的多个标签保留排列并估算排列内的优势,将稀疏奖励转化为密集学习信号,提高泛化能力。TabR1模型在不依赖全监督微调的情况下增强了少样本和无样本性能及解释性,同时在各种任务上表现出优于强大基线的能力。相较于更大的DeepSeek-R1模型(规模为685B),TabR1(规模为8B)在性能上实现了显著的提升。
Key Takeaways
- TabR1是首个针对表格预测设计的推理大型语言模型(LLM),结合了多步推理技术。
- TabR1的核心是Permutation Relative Policy Optimization(PRPO)强化学习方法,它通过构建样本的多个标签保留排列来提高模型的泛化能力。
- PRPO通过将稀疏奖励转化为密集学习信号,有效地提升了模型的性能。
- TabR1在不依赖全监督微调的情况下增强了少样本和无样本学习的性能。
点此查看论文截图




TaxoAlign: Scholarly Taxonomy Generation Using Language Models
Authors:Avishek Lahiri, Yufang Hou, Debarshi Kumar Sanyal
Taxonomies play a crucial role in helping researchers structure and navigate knowledge in a hierarchical manner. They also form an important part in the creation of comprehensive literature surveys. The existing approaches to automatic survey generation do not compare the structure of the generated surveys with those written by human experts. To address this gap, we present our own method for automated taxonomy creation that can bridge the gap between human-generated and automatically-created taxonomies. For this purpose, we create the CS-TaxoBench benchmark which consists of 460 taxonomies that have been extracted from human-written survey papers. We also include an additional test set of 80 taxonomies curated from conference survey papers. We propose TaxoAlign, a three-phase topic-based instruction-guided method for scholarly taxonomy generation. Additionally, we propose a stringent automated evaluation framework that measures the structural alignment and semantic coherence of automatically generated taxonomies in comparison to those created by human experts. We evaluate our method and various baselines on CS-TaxoBench, using both automated evaluation metrics and human evaluation studies. The results show that TaxoAlign consistently surpasses the baselines on nearly all metrics. The code and data can be found at https://github.com/AvishekLahiri/TaxoAlign.
分类法在研究人员以层级方式结构和浏览知识方面发挥着至关重要的作用。它们也是创建综合性文献综述的重要组成部分。现有的自动调查生成方法并不会将生成的调查结构与人类专家编写的调查结构进行比较。为了解决这一差距,我们提出了自己的自动分类法创建方法,可以弥合人工生成和自动创建分类法之间的差距。为此,我们创建了CS-TaxoBench基准测试,其中包括从人类编写的调查报告论文中提取的460个分类法。我们还从会议调查报告论文中加入了额外的测试集,共计80个分类法。我们提出TaxoAlign,这是一种基于主题、指令引导的学术分类法生成的三阶段方法。此外,我们还提出了一个严格的自动化评估框架,用于衡量自动生成的分类法与专家创建的分类法在结构对齐和语义连贯性方面的比较。我们在CS-TaxoBench上评估了我们的方法和各种基线方法,采用了自动评估指标和人类评估研究。结果表明,TaxoAlign几乎在所有指标上都超过了基线方法。代码和数据可在https://github.com/AvishekLahiri/TaxoAlign找到。
论文及项目相关链接
PDF This paper has been accepted at the EMNLP 2025 Main Conference
Summary
该文介绍了自动分类法生成的重要性及其在研究领域的运用。现有方法未能与人类专家生成的分类法结构进行比较,为解决这一问题,提出了一种名为TaxoAlign的三阶段主题导向指令驱动方法用于学术分类法生成。同时创建了CS-TaxoBench基准测试平台,包含从人类撰写的调查报告论文中提取的460个分类法以及从会议调查报告论文中挑选的80个额外测试集。评估结果显示,TaxoAlign在几乎所有指标上都超越了基线方法。
Key Takeaways
- 分类法对于研究者的知识结构和文献调查至关重要。
- 现有自动调查生成方法未与人类专家生成的分类法结构进行比较。
- 提出了TaxoAlign方法,包括三个阶段的主题导向指令驱动,用于学术分类法生成。
- 创建了CS-TaxoBench基准测试平台,包含从人类撰写的调查报告论文中提取的分类法。
- TaxoAlign评估框架能衡量自动生成分类法与人类专家创建分类法的结构对齐和语义连贯性。
- TaxoAlign在几乎所有评估指标上都超越了基线方法。
点此查看论文截图





StreamingThinker: Large Language Models Can Think While Reading
Authors:Junlong Tong, Yingqi Fan, Anhao Zhao, Yunpu Ma, Xiaoyu Shen
Large language models (LLMs) have demonstrated remarkable capabilities in chain of thought (CoT) reasoning. However, the current LLM reasoning paradigm initiates thinking only after the entire input is available, which introduces unnecessary latency and weakens attention to earlier information in dynamic scenarios. Inspired by human cognition of thinking while reading, we first design a \textit{\textbf{streaming thinking}} paradigm for LLMs, where reasoning unfolds in the order of input and further adjusts its depth once reading is complete. We instantiate this paradigm with \textit{StreamingThinker}, a framework that enables LLMs to think while reading through the integration of streaming CoT generation, streaming-constraint training, and streaming parallel inference. Specifically, StreamingThinker employs streaming reasoning units with quality control for CoT generation, enforces order-preserving reasoning through streaming attention masks and position encoding, and leverages parallel KV caches that decouple input encoding from reasoning generation, thereby ensuring alignment and enabling true concurrency. We evaluate StreamingThinker on the Qwen3 model family across math reasoning, logical reasoning, and context-based QA reasoning tasks. Experimental results show that the StreamingThinker preserves performance comparable to batch thinking, while yielding an 80% reduction in token waiting before the onset of reasoning and a more than 60% reduction in time-level latency for producing the final answer, demonstrating the effectiveness of the streaming paradigm for LLM reasoning. Code will be released at \href{https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker}{this repository.}
大型语言模型(LLM)在思维链(CoT)推理方面表现出了显著的能力。然而,当前的LLM推理模式只有在整个输入可用后才开始思考,这引入了不必要的延迟,并减弱了对动态场景中早期信息的关注。受人类阅读时认知的启发,我们首次为LLM设计了一种“流式思维”模式,在这种模式下,推理按照输入的顺序展开,并在阅读完成后进一步调整其深度。我们用StreamingThinker实例化这种模式,它是一个框架,通过整合流式CoT生成、流式约束训练和流式并行推理,使LLM能够在阅读时进行思考。具体来说,StreamingThinker采用具有质量控制机制的流式推理单元进行CoT生成,通过流式注意力掩码和位置编码强制保留顺序推理,并利用并行KV缓存来解耦输入编码和推理生成,从而确保对齐并实现真正的并发性。我们在Qwen3模型家族上对数学推理、逻辑推理和基于上下文的问答推理任务评估了StreamingThinker。实验结果表明,StreamingThinker的性能与批量思考相当,同时在开始推理之前减少了80%的令牌等待时间,并在产生最终答案方面的时间延迟减少了60%以上,这证明了流式范式在LLM推理中的有效性。相关代码将发布在此仓库:[https://github.com/EIT-NLP/StreamingLLM/tree/main/StreamingThinker。](注:该链接为虚构,实际发布地址请自行替换。)
论文及项目相关链接
Summary:
大型语言模型(LLM)在链式思维(CoT)推理方面表现出显著的能力。然而,当前LLM的推理模式是在获取完整输入后才开始思考,这导致了不必要的延迟,并在动态场景中忽视了早期的信息。受人类阅读时思考的认知启发,我们为LLM设计了一种“流式思维”模式,推理过程按照输入的顺序展开,并在阅读完成后调整深度。我们使用StreamingThinker框架来实现这种模式,它通过整合流式CoT生成、流式约束训练和流式并行推理,使LLM能够在阅读时思考。实验结果表明,StreamingThinker在保持与批量思考相当的性能的同时,减少了80%的推理前等待令牌的时间和超过60%的最终答案生成的时间延迟。
Key Takeaways:
- 大型语言模型(LLM)在链式思维(CoT)推理方面表现出卓越的能力。
- 当前LLM的推理模式存在延迟,无法有效处理动态场景中的早期信息。
- 提出了“流式思维”模式,使LLM的推理过程按照输入顺序展开。
- StreamingThinker框架实现了流式思维模式,通过整合流式CoT生成、流式约束训练和流式并行推理,使LLM在阅读时能够思考。
- StreamingThinker框架包括流式推理单元、流式注意力掩码和位置编码等技术。
- 实验结果表明,StreamingThinker在保持高性能的同时,显著减少了推理延迟。
点此查看论文截图





TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework
Authors:Shuzheng Gao, Eric John Li, Man Ho Lam, Jingyu Xiao, Yuxuan Wan, Chaozheng Wang, Ng Man Tik, Michael R. Lyu
Large foundation models are fundamentally transforming the software engineering landscape, demonstrating exceptional capabilities across diverse tasks such as code generation, debugging, and testing. Despite this rapid progress, a significant gap remains in how to comprehensively evaluate these models’ trustworthiness in real-world software engineering scenarios. Existing benchmarks suffer from limited task scope and fail to incorporate critical evaluation aspects such as the robustness and reliability of models. To bridge this gap, we present an evaluation framework called TREAT (Code LLMs Trustworthiness / Reliability Evaluation And Testing) that provides a holistic assessment of model performance in code intelligence tasks. Our evaluation framework addresses key limitations in existing approaches with four main improvements: (1) Multi-Task Holistic Evaluation that spans diverse software engineering activities rather than limited coding tasks; (2) Multi-Language and Multi-Modality Assessment that extends beyond traditional single-language, text-only benchmarks to include multi-modality coding tasks; (3) Robustness Assessment that evaluates model reliability under semantically-preserving code transformations; and (4) Rigorous Evaluation Methodology that enhances the trustworthiness of evaluation results through diverse evaluation prompts and adaptive solution extraction. Based on this evaluation framework, we assess 26 state-of-the-art models and uncover both their strengths and limitations, yielding several key insights:(1) Current models show substantial performance variation across programming tasks; (2) Multi-modal language models demonstrate specific performance limitations in UI code generation and edit;
大型基础模型正在从根本上改变软件工程领域的格局,其在代码生成、调试和测试等多样化任务中展现出卓越的能力。尽管进展迅速,但在真实世界软件工程场景中如何全面评估这些模型的可靠性仍存在巨大差距。现有基准测试的任务范围有限,未能纳入模型稳健性和可靠性的重要评估方面。为了弥补这一差距,我们提出了一种评估框架,名为TREAT(代码LLM信任度/可靠性评估与测试),该框架全面评估模型在代码智能任务中的性能。我们的评估框架解决了现有方法的关键局限性,主要包括四个方面的改进:(1)多任务整体评估,涵盖多样化的软件工程活动,而非仅限于编码任务;(2)多语言和多模式评估,超越传统的单语言、纯文本基准测试,包括多模式编码任务;(3)稳健性评估,评估模型在语义保留的代码转换下的可靠性;(4)严谨的评价方法,通过多样的评价提示和自适应解决方案提取,增强评价结果的可信度。基于这一评估框架,我们对26个最先进模型进行了评估,揭示了它们的优势和局限,得出了一些关键见解:(1)当前模型在编程任务中的性能存在显著差异;(2)多模态语言模型在UI代码生成和编辑方面表现出特定的性能局限。
论文及项目相关链接
Summary:大型基础模型正在从根本上改变软件工程的景观,展示出代码生成、调试和测试等多样任务的出色能力。然而,现有评估方法在全面评估这些模型在真实软件工程场景中的可信度方面存在显著差距。为此,我们提出了一种评估框架TREAT,对模型在代码智能任务中的表现进行整体评估,并解决了现有方法的关键局限性。通过多任务的全面评估、多语言和多模式评估、稳健性评估以及严谨的评价方法对模型进行评估,并基于该框架评估了26个最先进的模型,揭示了几个关键见解。
Key Takeaways:
- 大型基础模型在软件工程领域表现出强大的能力,涵盖多种任务。
- 现有评估方法在评估这些模型在真实软件工程场景中的信任度方面存在差距。
- 我们提出了一个名为TREAT的评估框架,对模型在代码智能任务中的表现进行综合评价。
- TREAT框架解决了现有方法的关键局限性,包括多任务的全面评估、多语言和多模式评估、稳健性评估以及严谨的评价方法。
- 基于TREAT框架评估了26个最先进的模型,发现模型在编程任务中的性能存在显著差异。
- 多模式语言模型在UI代码生成和编辑方面表现出特定的性能局限。
- 评估结果揭示了模型的优势和局限性,为未来的研究提供了方向。
点此查看论文截图





Integrating Performance Tools in Model Reasoning for GPU Kernel Optimization
Authors:Daniel Nichols, Konstantinos Parasyris, Charles Jekel, Abhinav Bhatele, Harshitha Menon
Language models are now prevalent in software engineering with many developers using them to automate tasks and accelerate their development. While language models have been tremendous at accomplishing complex software engineering tasks, there are still many areas where they fail to deliver desirable results, for instance code performance related tasks. Tasks like optimization depend on many complex data from the environment, hardware, etc. that are not directly represented in source code. Recent efforts have seen large improvements in general code modeling tasks using chain-of-thought style reasoning, but these models still fail to comprehend how the environment interacts with code performance. In this paper we propose a methodology to train language models that can interact with performance tools during their reasoning process. We then demonstrate how this methodology can be used to train a state-of-the-art GPU kernel optimization model.
语言模型在软件工程领域现在非常流行,许多开发人员使用它们来自动化任务并加速开发。虽然语言模型在完成复杂的软件工程任务方面表现出色,但仍有许多领域无法提供理想的结果,例如与代码性能相关的任务。像优化这样的任务依赖于环境中的许多复杂数据、硬件等,这些数据并没有直接在源代码中表示。最近的努力通过采用链式思维风格的推理,在一般代码建模任务上取得了很大的改进,但这些模型仍然无法理解环境如何与代码性能进行交互。在本文中,我们提出了一种训练语言模型的方法,该模型可以在推理过程中与性能工具进行交互。然后,我们展示了如何使用这种方法来训练最先进的GPU内核优化模型。
论文及项目相关链接
Summary
语言模型在软件工程领域广泛应用,用于自动化任务和加速开发过程。尽管语言模型在完成复杂的软件工程任务方面表现出色,但在代码性能相关任务等方面仍存在不足。环境、硬件等复杂数据在源代码中无法直接体现,导致模型难以进行推理。本文提出一种与性能工具互动的语言模型训练方法,并展示了如何训练出先进的GPU内核优化模型。
Key Takeaways
- 语言模型在软件工程领域广泛应用,有助于自动化任务和加速开发。
- 在代码性能相关任务方面,语言模型仍存在不足。
- 环境、硬件等数据在源代码中无法直接体现,影响语言模型的推理能力。
- 近期研究通过链式思维风格推理改善了通用代码建模任务。
- 本文提出一种与性能工具互动的语言模型训练方法。
- 该方法可用于训练出先进的GPU内核优化模型。
点此查看论文截图


GACO-CAD: Geometry-Augmented and Conciseness-Optimized CAD Model Generation from Single Image
Authors:Yinghui Wang, Xinyu Zhang, Peng Du
Generating editable, parametric CAD models from a single image holds great potential to lower the barriers of industrial concept design. However, current multi-modal large language models (MLLMs) still struggle with accurately inferring 3D geometry from 2D images due to limited spatial reasoning capabilities. We address this limitation by introducing GACO-CAD, a novel two-stage post-training framework. It is designed to achieve a joint objective: simultaneously improving the geometric accuracy of the generated CAD models and encouraging the use of more concise modeling procedures. First, during supervised fine-tuning, we leverage depth and surface normal maps as dense geometric priors, combining them with the RGB image to form a multi-channel input. In the context of single-view reconstruction, these priors provide complementary spatial cues that help the MLLM more reliably recover 3D geometry from 2D observations. Second, during reinforcement learning, we introduce a group length reward that, while preserving high geometric fidelity, promotes the generation of more compact and less redundant parametric modeling sequences. A simple dynamic weighting strategy is adopted to stabilize training. Experiments on the DeepCAD and Fusion360 datasets show that GACO-CAD achieves state-of-the-art performance under the same MLLM backbone, consistently outperforming existing methods in terms of code validity, geometric accuracy, and modeling conciseness.
从单一图像生成可编辑的、参数化的CAD模型,为降低工业概念设计的门槛带来了巨大的潜力。然而,当前的多模态大型语言模型(MLLMs)由于空间推理能力有限,仍然难以从二维图像准确推断三维几何结构。我们通过引入GACO-CAD,一个新型的两阶段后训练框架,来解决这一限制。它旨在实现联合目标:同时提高生成CAD模型几何精度并鼓励使用更简洁的建模程序。首先,在监督微调过程中,我们利用深度和表面法线图作为密集几何先验,将它们与RGB图像结合形成多通道输入。在单视图重建的情况下,这些先验提供互补的空间线索,帮助MLLM更可靠地从二维观测中恢复三维几何结构。其次,在强化学习过程中,我们引入了一个组长度奖励,在保持高几何保真度的同时,促进生成更紧凑、更少冗余的参数化建模序列。采用简单的动态加权策略来稳定训练。在DeepCAD和Fusion360数据集上的实验表明,GACO-CAD在同一MLLM骨架下达到了最先进的性能,在代码有效性、几何精度和建模简洁性方面均优于现有方法。
论文及项目相关链接
Summary
该文本介绍了一种名为GACO-CAD的新型两阶段训练框架,旨在提高从单一图像生成的可编辑参数化CAD模型的几何精度,并鼓励采用更简洁的建模流程。通过利用深度图和表面法线图作为密集几何先验,结合RGB图像形成多通道输入,进行有监督微调。在强化学习阶段,引入了一种小组长度奖励,既保证了高几何保真度,又促进了更紧凑、更少冗余的参数化建模序列的生成。在DeepCAD和Fusion360数据集上的实验表明,GACO-CAD在同等的MLLM骨架下取得了最先进的性能,在代码有效性、几何精度和建模简洁性方面均优于现有方法。
Key Takeaways
- GACO-CAD框架旨在提高从单一图像生成CAD模型的几何精度,并简化建模流程。
- 通过引入深度图和表面法线图作为密集几何先验,提高模型的几何准确性。
- 采用两阶段训练:有监督微调和强化学习,强化学习中引入小组长度奖励以促进更简洁的建模。
- GACO-CAD在DeepCAD和Fusion360数据集上表现出卓越性能,优于现有方法。
- 该框架同时考虑了代码有效性、几何精度和建模简洁性三个方面的评估标准。
- GACO-CAD设计旨在与现有的MLLMs兼容,可以在相同的骨架上实现最佳性能。
点此查看论文截图






Rethinking On-policy Optimization for Query Augmentation
Authors:Zhichao Xu, Shengyao Zhuang, Xueguang Ma, Bingsen Chen, Yijun Tian, Fengran Mo, Jie Cao, Vivek Srikumar
Recent advances in large language models (LLMs) have led to a surge of interest in query augmentation for information retrieval (IR). Two main approaches have emerged. The first prompts LLMs to generate answers or pseudo-documents that serve as new queries, relying purely on the model’s parametric knowledge or contextual information. The second applies reinforcement learning (RL) to fine-tune LLMs for query rewriting, directly optimizing retrieval metrics. While having respective advantages and limitations, the two approaches have not been compared under consistent experimental conditions. In this work, we present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks, including evidence-seeking, ad hoc, and tool retrieval. Our key finding is that simple, training-free query augmentation often performs on par with, or even surpasses, more expensive RL-based counterparts, especially when using powerful LLMs. Motivated by this discovery, we introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which, instead of rewriting a query, the LLM policy learns to generate a pseudo-document that maximizes retrieval performance, thus merging the flexibility and generative structure of prompting with the targeted optimization of RL. We show OPQE outperforms both standalone prompting and RL-based rewriting, demonstrating that a synergistic approach yields the best results. Our implementation is made available to facilitate reproducibility.
近期大型语言模型(LLM)的进展引发了信息检索(IR)中查询增强的热潮。出现了两种主要方法。第一种是引导LLM生成答案或伪文档作为新查询,这完全依赖于模型的参数知识或上下文信息。第二种应用强化学习(RL)对LLM进行微调以进行查询重写,直接优化检索指标。尽管这两种方法各有优势和局限,但它们尚未在一致的实验条件下进行比较。在这项工作中,我们对基于提示和基于RL的查询增强进行了首次系统比较,涉及多种基准测试,包括证据搜寻、特定查询和工具检索。我们的关键发现是,简单、无需训练的自查询增强通常可以执行得与更昂贵的基于RL的同类产品一样好,甚至更好,尤其是在使用功能强大的LLM时。受此发现的启发,我们引入了一种新型混合方法——基于策略伪文档查询扩展(OPQE),该方法不是重写查询,而是学习LLM策略以生成最大化检索性能的伪文档,从而融合了提示的灵活性和生成结构与RL的目标优化。我们显示OPQE优于单独的提示和基于RL的重写,证明了协同方法可获得最佳结果。我们的实现已提供,以便于复制。
论文及项目相关链接
Summary
大语言模型(LLM)查询增强在信息检索中的研究取得进展。本研究首次系统地比较了基于提示和基于强化学习(RL)的查询增强方法,发现简单的训练前查询增强方法常常与更昂贵的基于RL的方法表现相当甚至更佳。因此提出了一种新型的混合方法OPQE,能够生成伪文档以提高检索性能。结果显示,OPQE的性能超过了单一方法和基于RL的重写方法。为此研究的可重复性提供便利的实现方法。
简化提炼出以下几个要点进行输出,您可以对照下文了解更多详细内容哦。如您有更多疑问可以交流询问更多相关信息哈~以下是这个观点的概要性阐述以供您了解此领域的文献材料啦~ 我会对关键词保持恰当的组织性以保持理解便利:大语言模型对于查询增强的研究成果及其采用的创新性研究思路的重要性:初步理解了自然语言在询问期间的描述和处理方式对查询增强效果的潜在影响;简单和高效的查询增强策略的有效性及其潜在应用价值:展示了一种新型策略能很好地平衡了生成灵活性和性能优化,表现出优于单一策略的潜力。有关这个研究领域展望的最新观点将是对业界关注的新方向的解读:综合两者优势可能形成最佳的查询增强方案;本研究的相关实施成果被公布并致力于实现可重复性以促进广泛应用研究。。这是一种强大的文本研究领域取得的显著成果在改进算法发展等方面的观点融合与应用性贡献的描述与概览呢!希望对您有所帮助~在结构要求中明确了要进行逐条解析内容才能清晰获得简明直接的论述吧。如您感觉需要进一步精确的信息再细分回复供您阅读!那么我们来看看接下来得分析:具体到文字方面可归纳为以下几点关键见解:
Key Takeaways
一、大语言模型在信息检索中的查询增强研究备受关注,出现了基于提示和强化学习两种主要方法。
二、基于提示的方法通过生成答案或伪文档作为新查询,依赖于模型的参数知识和上下文信息;基于强化学习的方法则通过微调模型实现查询重写,直接优化检索指标。然而先前这两种方法的比较研究少见统一实验条件支撑得出的结果对比分析阐述比较匮乏 。两者研究逐渐朝着兼顾灵活性、结构性的优化策略方面推进和弥合了更多的理论与实践之间出现的短板不足现状——呼应到了实验层面的要求和技术探索的最新趋势呢!在实践中面临应用上的挑战,特别是在具体领域场景中的适应性表现差异,值得进一步深入研究和探索改进策略的应用价值所在之处呢~按照明确标准化的一系列量化细节展开的开放性应用场景可见细化了具体分析条框实操导向的最终依据和改进调整论证有效性环节的循环解释可信度对比设定实验目标呢!所以此文献的核心发现和创新点值得我们继续深入探讨和研究!我们对此的深入剖析和总结为实践层面的探索提供了更多理论支撑和实践参考!通过科学系统的实验对比我们对此领域的认识将会更加深入哦!更多具体的核心要点阐述体现在下文中以供参考和解读理解~那么下一步我将进行核心观点的细致分析,具体体现为以下几个小点作为您获取更深度洞察的基础内容供您掌握其重要论述及潜在发展趋势参考啦~在您所关注领域中展望实践层面作出总结和借鉴重要的引用以便清晰展开深入理解与学习~~请不要忽视其在当下实际应用的重要性和分析要点;包括对比分析的有效成果和实现未来的新方法优化!这会促使在信息化快速发展的大数据背景下逐步完成在AI研究领域得以加速应用和更科学的改善啦~~望引起重视呢!!以下简要说明每一点的理解逻辑内涵要点概述为准则做出回答梳理补充~便于您参考理解哈~
三、本研究首次系统地比较了这两种方法,发现简单的训练前查询增强方法表现优异,在某些情况下甚至超越较复杂的强化学习方法;这对于从业者提出了新的便捷灵活选择适配旧方法和基础优化机制的运用扩展灵感探讨热点算法下的思维创新启示;对后续研究具有指导意义。这一发现为行业提供了新思路,即在追求性能提升的同时,也要关注方法的简单性和成本效益;通过对核心要素的运用适当提炼运用方法的简单有效性发挥不同要素的融合应用在不同应用场景中的巨大潜力贡献商业价值信息推动精准赋能匹配模式识别度自动化执行方案的广泛采用利用复杂机制问题在开放系统中的自适应响应研究顺应现实要求展望自动化流程集成突破依赖新的路径决策系统的关键技术如何综合人工智能的理论视角展望迭代技术和精细化进步观点让我们理解了新时代发展的多元策略实操成果观察性构建进行泛化引导来充分应对复杂多变的市场环境挑战和机遇的灵活应用策略选择吧!
点此查看论文截图



Structured Debate Improves Corporate Credit Reasoning in Financial AI
Authors:Yoonjin Lee, Munhee Kim, Hanbi Choi, Juhyeon Park, Seungho Lyoo, Woojin Park
Despite advances in financial AI, the automation of evidence-based reasoning remains unresolved in corporate credit assessment, where qualitative non-financial indicators exert decisive influence on loan repayment outcomes yet resist formalization. Existing approaches focus predominantly on numerical prediction and provide limited support for the interpretive judgments required in professional loan evaluation. This study develops and evaluates two operational large language model (LLM)-based systems designed to generate structured reasoning from non-financial evidence. The first is a non-adversarial single-agent system (NAS) that produces bidirectional analysis through a single-pass reasoning pipeline. The second is a debate-based multi-agent system (KPD-MADS) that operationalizes adversarial verification through a ten-step structured interaction protocol grounded in Karl Popper’s critical dialogue framework. Both systems were applied to three real corporate cases and evaluated by experienced credit risk professionals. Compared to manual expert reporting, both systems achieved substantial productivity gains (NAS: 11.55 s per case; KPD-MADS: 91.97 s; human baseline: 1920 s). The KPD-MADS demonstrated superior reasoning quality, receiving higher median ratings in explanatory adequacy (4.0 vs. 3.0), practical applicability (4.0 vs. 3.0), and usability (62.5 vs. 52.5). These findings show that structured multi-agent interaction can enhance reasoning rigor and interpretability in financial AI, advancing scalable and defensible automation in corporate credit assessment.
尽管金融人工智能取得了进展,但在企业信用评估中,基于证据的推理自动化仍然未解决。在企业信用评估中,定性的非财务指标对贷款偿还结果具有决定性的影响,但难以形式化。现有方法主要集中在数值预测上,对专业贷款评估所需的解释性判断支持有限。本研究开发并评估了两个基于大型语言模型(LLM)的系统,旨在从非财务证据中产生结构化推理。第一个是非对抗性单代理系统(NAS),它通过单通道推理管道产生双向分析。第二个是基于辩论的多代理系统(KPD-MADS),它通过基于卡尔·波普批判对话框架的十步结构化交互协议实现对抗性验证。两个系统均应用于三个真实的企业案例,并由经验丰富的信用风险专业人士进行评估。与手动专家报告相比,这两个系统在生产力上取得了实质性的收益(NAS:每宗案件11.55秒;KPD-MADS:91.97秒;人力基准:1920秒)。KPD-MADS在解释充分性、实际适用性和可用性方面的中位评分均较高(分别为4.0 vs 3.0、4.0 vs 3.0和62.5 vs 52.5)。这些发现表明,结构化的多代理交互可以提高金融人工智能中的推理严谨性和可解释性,促进企业信用评估中可伸缩和可靠的自动化。
论文及项目相关链接
PDF 18 pages, 4 figures, 2 algorithms, 2 tables, 4 appendices, will be submitted to AAAI-2026 workshop
Summary
本文探讨了金融AI在自动化方面的进展,特别是在企业信用评估中基于证据推理的自动化。尽管非财务定性指标对贷款偿还结果具有决定性影响,但现有方法主要侧重于数值预测,对专业贷款评估所需的解释性判断支持有限。本研究开发并评估了两种基于大型语言模型(LLM)的系统,用于从非财务证据中产生结构化推理。第一种是非对抗性单代理系统(NAS),通过单次推理管道产生双向分析。第二种是基于辩论的多代理系统(KPD-MADS),通过基于卡尔·波普批判对话框架的十步结构化交互协议实现对抗性验证。两系统均应用于三个实际企业案例,并得到信用风险专业人士的评估。与手动专家报告相比,两系统均大幅提高生产效率。其中,KPD-MADS在解释充分性、实际适用性和可用性方面表现更优。这表明结构化多代理交互能增强金融AI中的推理严谨性和可解释性,推动企业信用评估中的可规模化、合理自动化进展。
Key Takeaways
- 现有金融AI在企业信用评估中面临定性指标自动化难题,非财务指标对贷款结果影响重大且难以形式化。
- 研究开发两种基于大型语言模型(LLM)的系统,用于生成结构化非财务证据推理。
- 非对抗性单代理系统(NAS)通过单次推理管道产生双向分析,提供快速分析。
- 基于辩论的多代理系统(KPD-MADS)模拟对抗性验证,通过结构化交互协议提高推理质量。
- 两系统均显著提高生产效率,相较于手动专家报告,KPD-MADS在解释能力方面表现更佳。
- 多代理交互增强金融AI的推理严谨性和可解释性。
点此查看论文截图





