⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-08 更新
SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning
Authors:Zhenghai Xue, Longtao Zheng, Qian Liu, Yingru Li, Xiaosen Zheng, Zejun Ma, Bo An
Large Language Models (LLMs) can significantly improve their reasoning capabilities by interacting with external tools, a paradigm known as Tool-Integrated Reasoning (TIR). However, extending TIR to multi-turn scenarios using Reinforcement Learning (RL) is often hindered by training instability and performance collapse. We identify that such instability is primarily caused by a distributional drift from external tool feedback, leading to the generation of low-probability tokens. This issue compounds over successive turns, causing catastrophic gradient norm explosions that derail the training process. To address this challenge, we introduce SimpleTIR , a plug-and-play algorithm that stabilizes multi-turn TIR training. Its core strategy is to identify and filter out trajectories containing void turns, i.e., turns that yield neither a code block nor a final answer. By removing these problematic trajectories from the policy update, SimpleTIR effectively blocks the harmful, high-magnitude gradients, thus stabilizing the learning dynamics. Extensive experiments show that SimpleTIR achieves state-of-the-art performance on challenging math reasoning benchmarks, notably elevating the AIME24 score from a text-only baseline of 22.1 to 50.5 when starting from the Qwen2.5-7B base model. Furthermore, by avoiding the constraints of supervised fine-tuning, SimpleTIR encourages the model to discover diverse and sophisticated reasoning patterns, such as self-correction and cross-validation.
大型语言模型(LLM)通过与外部工具进行交互,即所谓的工具集成推理(TIR)范式,可以显著提高其推理能力。然而,将TIR扩展到多轮场景并使用强化学习(RL)进行训练常常会受到训练不稳定和性能崩溃的阻碍。我们确定这种不稳定主要由外部工具反馈的分布漂移引起,导致生成低概率令牌。这个问题在多轮中愈演愈烈,导致梯度范数灾难性爆炸,从而破坏训练过程。为了应对这一挑战,我们引入了SimpleTIR,这是一种即插即用的算法,可以稳定多轮TIR训练。它的核心策略是识别和过滤出包含空转的轨迹,即那些既没有产生代码块也没有产生最终答案的轮次。通过从策略更新中删除这些有问题的轨迹,SimpleTIR有效地阻止了有害的高幅度梯度,从而稳定了学习动态。大量实验表明,SimpleTIR在具有挑战性的数学推理基准测试中实现了最新技术性能。特别是,从Qwen2.5-7B基础模型开始,SimpleTIR将AIME24得分从仅文本的基线22.1提高到50.5。此外,通过避免监督微调(supervised fine-tuning)的约束,SimpleTIR鼓励模型发现多样化和复杂的推理模式,如自我校正和交叉验证。
论文及项目相关链接
Summary
大型语言模型(LLM)通过工具集成推理(TIR)范式与外部工具交互,可以显著提高推理能力。然而,将TIR扩展到多轮场景时,使用强化学习(RL)常面临训练不稳定和性能下降的问题。本文指出这种不稳定主要由外部工具反馈的分布漂移导致生成低概率标记引起。该问题在多轮交互中加剧,导致梯度爆炸,破坏训练过程。为解决此挑战,本文提出SimpleTIR算法,一种即插即用的方法,可稳定多轮TIR训练。其核心策略是识别和过滤掉包含空洞回合的轨迹,即既不产生代码块也不产生最终答案的回合。通过从策略更新中移除这些有问题的轨迹,SimpleTIR有效地阻止了有害的高幅度梯度,从而稳定学习动态。实验表明,SimpleTIR在具有挑战性的数学推理基准测试上实现了最佳性能,特别是从Qwen2.5-7B基础模型开始,将AIME24得分从仅文本的22.1提高到50.5。此外,SimpleTIR避免了监督精细调整的限制,鼓励模型发现多样化和复杂的推理模式,如自我校正和交叉验证。
Key Takeaways
- 大型语言模型(LLM)可以通过工具集成推理(TIR)与外部工具交互,提高推理能力。
- 在多轮场景中使用强化学习(RL)进行TIR面临训练不稳定和性能下降的挑战。
- 训练不稳定的主要原因是外部工具反馈的分布漂移导致生成低概率标记。
- SimpleTIR算法通过识别和过滤掉包含空洞回合的轨迹来稳定多轮TIR训练。
- SimpleTIR实现了在具有挑战性的数学推理基准测试上的最佳性能。
- SimpleTIR将AIME24得分从仅文本的基准提高至更高水平。
点此查看论文截图




Unifi3D: A Study on 3D Representations for Generation and Reconstruction in a Common Framework
Authors:Nina Wiedemann, Sainan Liu, Quentin Leboutet, Katelyn Gao, Benjamin Ummenhofer, Michael Paulitsch, Kai Yuan
Following rapid advancements in text and image generation, research has increasingly shifted towards 3D generation. Unlike the well-established pixel-based representation in images, 3D representations remain diverse and fragmented, encompassing a wide variety of approaches such as voxel grids, neural radiance fields, signed distance functions, point clouds, or octrees, each offering distinct advantages and limitations. In this work, we present a unified evaluation framework designed to assess the performance of 3D representations in reconstruction and generation. We compare these representations based on multiple criteria: quality, computational efficiency, and generalization performance. Beyond standard model benchmarking, our experiments aim to derive best practices over all steps involved in the 3D generation pipeline, including preprocessing, mesh reconstruction, compression with autoencoders, and generation. Our findings highlight that reconstruction errors significantly impact overall performance, underscoring the need to evaluate generation and reconstruction jointly. We provide insights that can inform the selection of suitable 3D models for various applications, facilitating the development of more robust and application-specific solutions in 3D generation. The code for our framework is available at https://github.com/isl-org/unifi3d.
随着文本和图像生成的快速发展,研究越来越多地转向3D生成。与图像中已建立基于像素的表示不同,3D表示仍然多样且分散,涵盖了多种方法,如体素网格、神经辐射场、有符号距离函数、点云或八叉树等,每种方法都有其独特的优势和局限性。在这项工作中,我们提出了一个统一的评估框架,旨在评估3D表示在重建和生成方面的性能。我们基于多个标准对这些表示进行比较:质量、计算效率和泛化性能。除了标准模型基准测试之外,我们的实验旨在得出涉及整个3D生成管道的所有步骤的最佳实践,包括预处理、网格重建、使用自动编码器的压缩和生成。我们的研究结果表明,重建误差对整体性能产生重大影响,强调需要联合评估生成和重建。我们提供了见解,可以为各种应用选择合适的三维模型提供信息,促进开发更稳健且针对特定应用的3D生成解决方案。我们的框架代码可在https://github.com/isl-org/unifi3d获取。
论文及项目相关链接
Summary
在文本和图像生成技术的快速发展之后,研究逐渐转向3D生成技术。不同于图像中已成熟的像素基础表示,3D表示仍然多样且分散,包括体素网格、神经辐射场、有符号距离函数、点云和八叉树等多种方法,各有其独特的优势和局限。本研究提出了一种统一的评估框架,旨在评估3D表示在重建和生成方面的性能。我们基于质量、计算效率和泛化性能等多个标准对这些表示进行了比较。除了标准模型基准测试外,我们的实验还旨在推导涉及整个3D生成管道的所有步骤的最佳实践,包括预处理、网格重建、自动编码器的压缩和生成。我们发现重建错误对整体性能产生重大影响,强调需要联合评估和生成和重建。我们的见解可以为各种应用选择合适的三维模型提供依据,促进三维生成技术在稳健性和应用特定解决方案方面的发展。
Key Takeaways
- 研究已逐渐从文本和图像生成转向3D生成技术。
- 3D表示方法多样,包括体素网格、神经辐射场等,各有优势和局限。
- 提出了一种统一的评估框架来评估3D表示在重建和生成中的性能。
- 评估基于质量、计算效率和泛化性能等多个标准。
- 实验目的包括推导3D生成管道的最佳实践。
- 重建错误对3D表示的整体性能有重大影响。
点此查看论文截图



DCPO: Dynamic Clipping Policy Optimization
Authors:Shihui Yang, Chengfeng Dou, Peidong Guo, Kai Lu, Qiang Ju, Fei Deng, Rihui Xin
Reinforcement Learning from Verifiable Rewards (RLVR) has emerged as a promising framework for enhancing the reasoning capabilities of large language models. However, existing approaches such as GRPO often suffer from zero gradients. This problem arises primarily due to fixed clipping bounds for token-level probability ratios and the standardization of identical rewards, which can lead to ineffective gradient updates and underutilization of generated responses. In this work, we propose Dynamic Clipping Policy Optimization (DCPO), which introduces a dynamic clipping strategy that adaptively adjusts the clipping bounds based on token-specific prior probabilities to enhance token-level exploration, and a smooth advantage standardization technique that standardizes rewards across cumulative training steps to improve the response-level effective utilization of generated responses. DCPO achieved state-of-the-art performance on four benchmarks based on four different models. In particular, DCPO achieved an Avg@1 of 46.7 under greedy decoding and an Avg@32 of 38.8 under 32 times sampling on the AIME24 benchmark, surpassing both DAPO (36.7/31.6) and GRPO (36.7/32.1) on the Qwen2.5-Math-7B model. On the AIME25 benchmark based on Qwen2.5-14B, DCPO achieves a performance of (23.3/19.0), surpassing GRPO (13.3/10.5) and DAPO (20.0/15.3). Furthermore, DCPO achieved an average 28% improvement in the nonzero advantage over GRPO in four models, doubled the training efficiency over DAPO, and significantly reduced the token clipping ratio by an order of magnitude compared to both GRPO and DAPO, while achieving superior performance. These results highlight DCPO’s effectiveness in leveraging generated data more efficiently for reinforcement learning in large language models.
强化学习从可验证奖励(RLVR)已成为提高大型语言模型推理能力的有前途的框架。然而,现有方法(如GRPO)经常面临零梯度的问题。这个问题主要是由于固定剪辑边界的令牌级别概率比率以及标准化相同奖励所导致的,这可能导致梯度更新无效以及生成的响应利用不足。在这项工作中,我们提出了动态剪辑策略优化(DCPO),它引入了一种动态剪辑策略,该策略可以基于令牌特定先验概率自适应地调整剪辑边界,以增强令牌级别的探索,以及一种平滑优势标准化技术,该技术可以在累积的训练步骤中标准化奖励,以提高生成响应的响应级别有效利用。DCPO在基于四种不同模型的四个基准测试上实现了最新性能。特别是在AIME24基准测试上,DCPO在贪婪解码下实现了Avg@1的46.7,在32次采样下实现了Avg@32的38.8,超越了DAPO(36.7/31.6)和GRPO(36.7/32.1)在Qwen2.5-Math-7B模型上的表现。在基于Qwen2.5-14B的AIME25基准测试中,DCPO的性能达到了(23.3/19.0),超越了GRPO(13.3/10.5)和DAPO(20.0/15.3)。此外,DCPO在四个模型中的非零优势平均提高了28%,训练效率是DAPO的两倍,并且与GRPO和DAPO相比,令牌剪辑比例降低了一个数量级,同时实现了卓越的性能。这些结果凸显了DCPO在利用生成数据方面更有效地进行大型语言模型的强化学习方面的有效性。
论文及项目相关链接
摘要
强化学习从可验证奖励(RLVR)框架在提高大型语言模型的推理能力方面展现出巨大潜力。然而,现有方法如GRPO常常遭遇零梯度问题。该问题主要源于固定剪辑边界的令牌级概率比率以及相同奖励的标准化,这可能导致梯度更新无效和生成响应的利用率低下。本研究提出动态剪辑策略优化(DCPO),引入动态剪辑策略,根据令牌特定先验概率自适应调整剪辑边界,以提高令牌级探索,以及平滑优势标准化技术,在累积训练步骤中标准化奖励,以提高响应级的有效利用率。DCPO在四个基准测试上的表现达到最新水平,基于四种不同模型。特别是在AIME24基准测试上,DCPO在贪婪解码下达到46.7的Avg@1,在32次采样下达到38.8的Avg@32,超越了DAPO(36.7/31.6)和GRPO(36.7/32.1)在Qwen2.5-Math-7B模型上的表现。在基于Qwen2.5-14B的AIME25基准测试上,DCPO达到(23.3/19.0)的性能,超越了GRPO(13.3/10.5)和DAPO(20.0/15.3)。此外,DCPO在四个模型中的非零优势平均提高了28%,训练效率是DAPO的两倍,并且显著降低了令牌剪辑比率,与GRPO和DAPO相比降低了一个数量级,同时实现了卓越的性能。这些结果突显了DCPO在利用生成数据方面更有效地进行大型语言模型强化学习的能力。
关键见解
- RLVR框架增强了大型语言模型的推理能力。
- 现有方法如GRPO面临零梯度问题,主要源于固定的剪辑边界和奖励标准化。
- DCPO通过动态调整剪辑策略和引入平滑优势标准化技术来解决这些问题。
- DCPO在多个基准测试上达到最新性能水平,特别是在AIME24和AIME25上的表现突出。
- DCPO相比其他方法在非零优势和训练效率方面有显著提高。
- DCPO显著降低了令牌剪辑比率。
点此查看论文截图



Baichuan-M2: Scaling Medical Capability with Large Verifier System
Authors:Baichuan-M2 Team, :, Chengfeng Dou, Chong Liu, Fan Yang, Fei Li, Jiyuan Jia, Mingyang Chen, Qiang Ju, Shuai Wang, Shunya Dang, Tianpeng Li, Xiangrong Zeng, Yijie Zhou, Chenzheng Zhu, Da Pan, Fei Deng, Guangwei Ai, Guosheng Dong, Hongda Zhang, Jinyang Tai, Jixiang Hong, Kai Lu, Linzhuang Sun, Peidong Guo, Qian Ma, Rihui Xin, Shihui Yang, Shusen Zhang, Yichuan Mo, Zheng Liang, Zhishou Zhang, Hengfu Cui, Zuyi Zhu, Xiaochuan Wang
As large language models (LLMs) advance in conversational and reasoning capabilities, their practical application in healthcare has become a critical research focus. However, there is a notable gap between the performance of medical LLMs on static benchmarks such as USMLE and their utility in real-world clinical decision-making. This discrepancy arises because traditional exams fail to capture the dynamic, interactive nature of medical consultations. To address this challenge, we introduce a novel dynamic verification framework that moves beyond static answer verifier, establishing a large-scale, high-fidelity interactive reinforcement learning system. Our framework comprises two key components: a Patient Simulator that creates realistic clinical environments using de-identified medical records, and a Clinical Rubrics Generator that dynamically produces multi-dimensional evaluation metrics. Building on this foundation, we develop Baichuan-M2, a 32B-parameter medical augmented reasoning model trained through a multi-stage reinforcement learning strategy with an improved Group Relative Policy Optimization (GRPO) algorithm. Evaluated on HealthBench, Baichuan-M2 outperforms all other open-source models and most advanced closed-source counterparts, achieving a score above 32 on the challenging HealthBench Hard benchmark-previously exceeded only by GPT-5. Our work demonstrates that robust dynamic verifier system is essential for aligning LLM capabilities with practical clinical applications, establishing a new Pareto front in the performance-parameter trade-off for medical AI deployment.
随着大型语言模型(LLM)在对话和推理能力方面的不断进步,它们在医疗保健领域的实际应用已成为重要的研究焦点。然而,在诸如USMLE等静态基准测试上表现良好的医疗LLM,在现实世界中的临床决策制定中的实用性却存在显著差距。这种差异的产生是因为传统考试无法捕捉到医疗咨询的动态交互性质。为了解决这一挑战,我们引入了一种新型动态验证框架,该框架超越了静态答案验证器,建立了一个大规模、高保真度的交互式强化学习系统。我们的框架包括两个关键组成部分:使用去标识医疗记录创建现实临床环境的病人模拟器,以及动态生成多维度评价指标的临床规则生成器。在此基础上,我们开发了白川M2,这是一个32B参数的医疗增强推理模型,通过多阶段强化学习策略进行训练,并改进了群体相对策略优化(GRPO)算法。在HealthBench上评估,白川M2超越了所有其他开源模型和最先进的专有模型,在具有挑战性的HealthBench Hard基准测试上得分超过32分(之前只有GPT-5才能达到)。我们的工作表明,健壮的动态验证系统对于将LLM能力与实际临床应用相结合至关重要,为医疗人工智能部署的性能参数权衡建立了新的帕累托前沿。
论文及项目相关链接
PDF Baichuan-M2 Technical Report
Summary
随着大型语言模型(LLM)在会话和推理能力方面的不断进步,它们在医疗保健领域的实际应用已成为关键研究焦点。然而,存在显著的差距在于LLM在静态基准测试(如USMLE)上的表现与在实际临床决策中的实用性。为解决这一挑战,研究团队提出了一种新型动态验证框架,该框架包括患者模拟器和临床规则生成器两个关键组件,并建立了一个大规模、高保真度的交互式强化学习系统。在此基础上,团队开发了参数为32B的医疗推理增强模型Baichuan-M2,该模型在HealthBench上的表现超过了所有开源模型以及大多数先进的专有模型,并达到了之前只有GPT-5才能超越的健康长椅硬标准分数以上。研究结果表明,建立可靠的动态验证系统对于将LLM能力与实际临床应用相结合至关重要,为医疗人工智能部署的性能参数权衡建立了新的帕累托前沿。
Key Takeaways
- 大型语言模型(LLM)在医疗保健领域的实际应用与静态基准测试之间存在显著差距。
- 传统考试无法捕捉医疗咨询的动态、交互性质。
- 新型动态验证框架包括患者模拟器和临床规则生成器两个关键组件。
- 动态验证框架建立了一个大规模、高保真度的交互式强化学习系统。
- Baichuan-M2模型在HealthBench上的表现超越了其他模型和之前的标准。
- 研究结果表明,建立可靠的动态验证系统对于LLM在临床应用中的重要性。
点此查看论文截图




Understanding Space Is Rocket Science – Only Top Reasoning Models Can Solve Spatial Understanding Tasks
Authors:Nils Hoehing, Mayug Maniparambil, Ellen Rushe, Noel E. O’Connor, Anthony Ventresque
We propose RocketScience, an open-source contrastive VLM benchmark that tests for spatial relation understanding. It is comprised of entirely new real-world image-text pairs covering mostly relative spatial understanding and the order of objects. The benchmark is designed to be very easy for humans and hard for the current generation of VLMs, and this is empirically verified. Our results show a striking lack of spatial relation understanding in open source and frontier commercial VLMs and a surprisingly high performance of reasoning models. Additionally, we perform a disentanglement analysis to separate the contributions of object localization and spatial reasoning in chain-of-thought-based models and find that the performance on the benchmark is bottlenecked by spatial reasoning and not object localization capabilities. We release the dataset with a CC-BY-4.0 license and make the evaluation code available at: https://github.com/nilshoehing/rocketscience
我们提出了RocketScience,这是一个开源的对比式VLM基准测试,旨在测试对空间关系的理解。它由全新的现实世界图像文本对组成,主要涵盖相对空间理解和物体顺序。这个基准测试对人类来说很容易,对当前的VLM来说很难,这已经得到了实证验证。我们的结果表明,在开源和前沿的商业VLM中,对空间关系的理解存在明显的缺乏,而推理模型的表现却出人意料地好。此外,我们进行了拆解分析,以区分基于思维链的模型中对象定位和空间推理的贡献,发现该基准测试的表现受到空间推理的制约,而不是对象定位能力。我们以CC-BY-4. 0许可证发布数据集,并在https://github.com/nilshoehing/rocketscience上提供评估代码。
论文及项目相关链接
Summary
提出一个名为RocketScience的开源对比视觉语言模型基准测试,重点测试空间关系理解能力。该基准测试包含全新的现实图像文本对,主要涵盖相对空间理解和物体顺序。人类容易完成而当前一代视觉语言模型难以完成,经验证确是如此。结果显示开源和前沿商业视觉语言模型在空间关系理解上的显著缺失,以及推理模型出人意料的良好表现。此外,通过分离对象定位和空间推理的贡献,发现该基准测试的瓶颈在于空间推理能力而不是对象定位能力。已发布数据集并附带CC-BY-4.0许可证,评估代码可在以下链接找到:链接。
Key Takeaways
- RocketScience是一个测试视觉语言模型空间关系理解能力的开源基准测试。
- 该基准测试包含现实图像文本对,着重于相对空间理解和物体顺序。
- 人类容易完成此基准测试,而当前视觉语言模型难以完成。
- 评估结果显示开源和商业视觉语言模型在空间关系理解上存在显著不足。
- 推理模型表现意外良好。
- 基准测试的瓶颈在于空间推理能力,而非对象定位能力。
点此查看论文截图




Omnidirectional Spatial Modeling from Correlated Panoramas
Authors:Xinshen Zhang, Tongxi Fu, Xu Zheng
Omnidirectional scene understanding is vital for various downstream applications, such as embodied AI, autonomous driving, and immersive environments, yet remains challenging due to geometric distortion and complex spatial relations in 360{\deg} imagery. Existing omnidirectional methods achieve scene understanding within a single frame while neglecting cross-frame correlated panoramas. To bridge this gap, we introduce \textbf{CFpano}, the \textbf{first} benchmark dataset dedicated to cross-frame correlated panoramas visual question answering in the holistic 360{\deg} scenes. CFpano consists of over 2700 images together with over 8000 question-answer pairs, and the question types include both multiple choice and open-ended VQA. Building upon our CFpano, we further present \methodname, a multi-modal large language model (MLLM) fine-tuned with Group Relative Policy Optimization (GRPO) and a set of tailored reward functions for robust and consistent reasoning with cross-frame correlated panoramas. Benchmark experiments with existing MLLMs are conducted with our CFpano. The experimental results demonstrate that \methodname achieves state-of-the-art performance across both multiple-choice and open-ended VQA tasks, outperforming strong baselines on all major reasoning categories (\textbf{+5.37%} in overall performance). Our analyses validate the effectiveness of GRPO and establish a new benchmark for panoramic scene understanding.
全景场景理解对于各种下游应用(如嵌入式人工智能、自动驾驶和沉浸式环境)至关重要,然而由于360度图像中的几何失真和复杂的空间关系,它仍然是一个挑战。现有的全景方法在同一帧内实现场景理解,但忽略了跨帧相关全景图。为了填补这一空白,我们引入了CFpano,这是首个专门针对全景场景中跨帧相关全景图的视觉问答的基准数据集。CFpano包含超过2700张图像和超过8000个问答对,问题类型包括多项选择和开放式问答。基于我们的CFpano,我们进一步推出了\methodname,这是一种多模态大型语言模型(MLLM),采用群体相对策略优化(GRPO)和一系列定制的奖励函数进行微调,以实现跨帧相关全景图的稳健和一致推理。我们在CFpano基准数据集上与现有的MLLM进行了实验。实验结果表明,\methodname在多项选择和开放式问答任务中都达到了最新技术水平,在所有主要推理类别上都超过了强大的基准线(总体性能提高\textbf{5.37%})。我们的分析验证了GRPO的有效性,并为全景场景理解建立了新的基准。
论文及项目相关链接
Summary
本文介绍了全景场景理解的重要性及其面临的挑战,如几何失真和复杂的空间关系。为了解决这个问题,提出了一种新的基准数据集CFpano,专门用于全景场景中的跨帧相关全景视觉问答任务。该数据集包含超过2700张图像和超过8000个问答对,包括多种类型的问题。基于CFpano数据集,进一步提出了一种多模态大型语言模型(MLLM)的方法,采用群体相对策略优化(GRPO)和定制奖励函数进行微调,实现跨帧相关全景的稳健和一致推理。实验结果表明,该方法在多项选择题和开放性问题解答任务上均达到最佳性能,在主要推理类别上优于强基线。
Key Takeaways
- 全景场景理解对于多种下游应用如人工智能实体、自动驾驶和沉浸式环境至关重要。
- 现有方法主要关注单帧内的全景场景理解,忽略了跨帧相关全景的差距。
- 介绍了新的基准数据集CFpano,专门用于跨帧相关全景视觉问答任务。
- CFpano包含大量图像和问答对,问题类型包括多项选择和开放性问题。
- 提出了一种多模态大型语言模型方法,采用GRPO和定制奖励函数进行跨帧相关全景的稳健推理。
- 实验结果表明该方法在全景场景理解上达到最佳性能,优于现有强基线。
点此查看论文截图






DeepSeek performs better than other Large Language Models in Dental Cases
Authors:Hexian Zhang, Xinyu Yan, Yanqi Yang, Lijian Jin, Ping Yang, Junwen Wang
Large language models (LLMs) hold transformative potential in healthcare, yet their capacity to interpret longitudinal patient narratives remains inadequately explored. Dentistry, with its rich repository of structured clinical data, presents a unique opportunity to rigorously assess LLMs’ reasoning abilities. While several commercial LLMs already exist, DeepSeek, a model that gained significant attention earlier this year, has also joined the competition. This study evaluated four state-of-the-art LLMs (GPT-4o, Gemini 2.0 Flash, Copilot, and DeepSeek V3) on their ability to analyze longitudinal dental case vignettes through open-ended clinical tasks. Using 34 standardized longitudinal periodontal cases (comprising 258 question-answer pairs), we assessed model performance via automated metrics and blinded evaluations by licensed dentists. DeepSeek emerged as the top performer, demonstrating superior faithfulness (median score = 0.528 vs. 0.367-0.457) and higher expert ratings (median = 4.5/5 vs. 4.0/5), without significantly compromising readability. Our study positions DeepSeek as the leading LLM for case analysis, endorses its integration as an adjunct tool in both medical education and research, and highlights its potential as a domain-specific agent.
大型语言模型(LLM)在医疗领域具有变革性潜力,然而它们解释纵向患者叙述的能力尚未得到充分探索。牙科拥有丰富的结构化临床数据仓库,为严格评估LLM的推理能力提供了独特机会。虽然已有一些商业LLM存在,但今年早些时候引起广泛关注的DeepSeek也加入了竞争。本研究评估了四种最先进的LLM(GPT-4o、Gemini 2.0 Flash、Copilot和DeepSeek V3)在分析纵向牙科病例摘要的能力,通过开放式临床任务进行评估。我们使用34个标准化的纵向牙周病例(包含258个问答对),通过自动化指标和盲评牙医评估模型性能。DeepSeek表现出最佳性能,在忠实度方面尤为出色(中位数得分=0.528 vs. 0.367-0.457),并获得更高的专家评分(中位数= 4.5/5 vs. 4.0/5),同时不损害可读性。本研究将DeepSeek定位为案例分析的领先LLM,支持将其集成到医学教育和研究中作为辅助工具,并突出了其作为领域特定代理的潜力。
论文及项目相关链接
PDF Abstract word count: 171; Total word count: 3130; Total number of tables: 2; Total number of figures: 3; Number of references: 32
Summary
大型语言模型(LLMs)在医疗保健领域具有变革性潜力,但在解读纵向患者叙述方面尚待充分探索。牙科丰富的结构化临床数据为严格评估LLMs的推理能力提供了独特机会。在评估了四种最先进的LLMs后,DeepSeek表现出卓越的分析纵向牙科案例的能力,尤其是在模型性能的自动化指标和牙医的盲评中均表现最佳。它被视为领先的LLM案例分析工具,可整合到医学教育和研究中作为辅助工具,并展现出作为特定领域代理的潜力。
Key Takeaways
- 大型语言模型(LLMs)在医疗保健领域的潜力尚未完全发掘,特别是在解析纵向患者叙述方面。
- 牙科数据的结构化特点使其成为评估LLMs推理能力的理想场所。
- 评估了四种先进的LLMs(GPT-4o、Gemini 2.0 Flash、Copilot和DeepSeek V3)在牙科案例上的表现。
- DeepSeek在分析纵向牙科案例方面表现出最佳性能。
- DeepSeek在自动化指标和牙医盲评中的表现均优于其他模型。
- DeepSeek被视为领先的LLM案例分析工具,适合作为医学教育和研究的辅助工具。
点此查看论文截图

AutoDrive-R$^2$: Incentivizing Reasoning and Self-Reflection Capacity for VLA Model in Autonomous Driving
Authors:Zhenlong Yuan, Jing Tang, Jinguo Luo, Rui Chen, Chengxuan Qian, Lei Sun, Xiangxiang Chu, Yujun Cai, Dapeng Zhang, Shuo Li
Vision-Language-Action (VLA) models in autonomous driving systems have recently demonstrated transformative potential by integrating multimodal perception with decision-making capabilities. However, the interpretability and coherence of the decision process and the plausibility of action sequences remain largely underexplored. To address these issues, we propose AutoDrive-R$^2$, a novel VLA framework that enhances both reasoning and self-reflection capabilities of autonomous driving systems through chain-of-thought (CoT) processing and reinforcement learning (RL). Specifically, we first propose an innovative CoT dataset named nuScenesR$^2$-6K for supervised fine-tuning, which effectively builds cognitive bridges between input information and output trajectories through a four-step logical chain with self-reflection for validation. Moreover, to maximize both reasoning and self-reflection during the RL stage, we further employ the Group Relative Policy Optimization (GRPO) algorithm within a physics-grounded reward framework that incorporates spatial alignment, vehicle dynamic, and temporal smoothness criteria to ensure reliable and realistic trajectory planning. Extensive evaluation results across both nuScenes and Waymo datasets demonstrates the state-of-the-art performance and robust generalization capacity of our proposed method.
自动驾驶系统中的视觉-语言-动作(VLA)模型通过整合多模式感知与决策能力,最近展示了变革潜力。然而,决策过程的可解释性和连贯性以及动作序列的合理性在很大程度上尚未被探索。为了解决这些问题,我们提出了AutoDrive-R$^2$,这是一种新型的VLA框架,它通过思维链(CoT)处理和强化学习(RL)增强了自动驾驶系统的推理和自反思能力。具体来说,我们首先提出了一种创新性的CoT数据集,名为nuScenesR$^2$-6K,用于监督微调,它通过四步逻辑链有效地在输入信息和输出轨迹之间建立认知桥梁,并进行自我反思以进行验证。此外,为了在RL阶段最大化推理和自反思,我们在基于物理的奖励框架内进一步采用了群体相对策略优化(GRPO)算法,该框架结合了空间对齐、车辆动态和时间平滑标准,以确保可靠的现实轨迹规划。在nuScenes和Waymo数据集的广泛评估结果表明,我们提出的方法具有最先进的性能和稳健的泛化能力。
论文及项目相关链接
Summary
在自动驾驶系统中,Vision-Language-Action(VLA)模型通过整合多模式感知与决策能力,展现出巨大的潜力。然而,决策过程的可解释性和连贯性以及行动序列的合理性仍存在较大的探索空间。为应对这些问题,我们提出了AutoDrive-R$^2$这一新型VLA框架,它通过思维链(CoT)处理和强化学习(RL)增强了自动驾驶系统的推理和反思能力。我们创新性地提出了nuScenesR$^2$-6K数据集,用于监督微调,有效地在输入信息和输出轨迹之间建立认知桥梁。同时,在RL阶段,我们采用Group Relative Policy Optimization(GRPO)算法,在物理奖励框架内融入空间对齐、车辆动态和时序平滑标准,确保可靠的轨迹规划。在nuScenes和Waymo数据集上的评估结果证明了该方法的卓越性能和稳健的泛化能力。
Key Takeaways
- VLA模型在自动驾驶系统中具有巨大的潜力,通过整合多模式感知与决策能力推动自主驾驶的发展。
- 当前研究中,决策过程的可解释性和连贯性以及行动序列的合理性是亟待解决的问题。
- 提出了AutoDrive-R$^2$框架,结合思维链(CoT)处理和强化学习(RL),增强了自动驾驶系统的推理和反思能力。
- 创新性地推出nuScenesR$^2$-6K数据集用于监督微调,提升认知桥梁的建立。
- 在RL阶段采用GRPO算法,结合物理奖励框架,确保可靠的轨迹规划。
- 该方法在nuScenes和Waymo数据集上的评估结果表现优异,具有卓越的性能和稳健的泛化能力。
- 该研究为自动驾驶系统的决策过程提供了新的思路和方向。
点此查看论文截图


Group Relative Policy Optimization for Speech Recognition
Authors:Prashanth Gurunath Shivakumar, Yile Gu, Ankur Gandhe, Ivan Bulyko
Speech Recognition has seen a dramatic shift towards adopting Large Language Models (LLMs). This shift is partly driven by good scalability properties demonstrated by LLMs, ability to leverage large amounts of labelled, unlabelled speech and text data, streaming capabilities with auto-regressive framework and multi-tasking with instruction following characteristics of LLMs. However, simple next-token prediction objective, typically employed with LLMs, have certain limitations in performance and challenges with hallucinations. In this paper, we propose application of Group Relative Policy Optimization (GRPO) to enable reinforcement learning from human feedback for automatic speech recognition (ASR). We design simple rule based reward functions to guide the policy updates. We demonstrate significant improvements in word error rate (upto 18.4% relative), reduction in hallucinations, increased robustness on out-of-domain datasets and effectiveness in domain adaptation.
语音识别在采用大型语言模型(LLM)方面发生了巨大转变。这一转变部分是由于LLM所展现的良好可扩展性、能够利用大量有标签和无标签的语音和文本数据、具有自回归框架的流式处理能力以及具有指令遵循特性的多任务处理能力所驱动的。然而,通常与LLM一起使用的简单下一个令牌预测目标在性能和幻觉方面具有一定的局限性。在本文中,我们提出了将组相对策略优化(GRPO)应用于语音识别(ASR)的强化学习,以从人类反馈中进行训练。我们设计了基于简单规则的奖励函数来指导策略更新。我们在词错误率方面取得了显著的改进(最多相对降低18.4%)、幻觉减少、对域外数据集的鲁棒性增强以及域适应的有效性。
论文及项目相关链接
PDF Accepted for ASRU 2025
Summary
大型语言模型(LLMs)在语音识别领域的应用已引起显著转变。通过利用大量的有标签和无标签的语音和文本数据、流式处理的自动回归框架以及多任务指令遵循特性,LLMs展现出良好的可扩展性。然而,简单采用下一个令牌预测目标也存在性能上的局限和挑战,如幻视问题。本文提出应用群体相对策略优化(GRPO)方法,通过人类反馈进行强化学习以实现自动语音识别(ASR)。设计基于简单规则的奖励功能来引导策略更新。在词错误率方面实现了显著改进(相对降低18.4%)、幻视减少、对域外数据集的鲁棒性增强以及域适应的有效性。
Key Takeaways
- 大型语言模型(LLMs)在语音识别中展现出良好的可扩展性,得益于其利用大量语音和文本数据的能力、流式处理的自动回归框架以及多任务指令遵循特性。
- 简单采用下一个令牌预测目标在语音识别中存在性能局限和挑战,如性能不足和幻视问题。
- Group Relative Policy Optimization (GRPO)被应用于自动语音识别(ASR),以通过人类反馈进行强化学习。
- 通过设计基于简单规则的奖励功能,GRPO能够显著改进词错误率,相对降低达18.4%。
- GRPO在减少幻视、增强对域外数据集的鲁棒性以及提高域适应有效性方面表现出色。
- 该方法的应用为语音识别领域带来了新的可能性,有望解决当前的一些挑战。
点此查看论文截图





Kwai Keye-VL 1.5 Technical Report
Authors:Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Haonan Fang, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang
In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model’s context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.
近年来,大型语言模型(LLM)的发展取得了显著进步,通过多模态大型语言模型(MLLM)扩展了其多模态任务的能力。然而,由于视频的动态和信息密集特性,视频理解仍然是一个具有挑战性的领域。现有模型在处理视频内容时,在空间分辨率和时间覆盖之间难以取舍。我们推出了Keye-VL-1.5,通过三个关键创新解决了视频理解中的基本挑战。首先,我们引入了一种新颖的快慢视频编码策略,该策略根据帧间相似性动态分配计算资源,以高分辨率处理具有重大视觉变化的关键帧(慢路径),并以较低分辨率处理相对静态帧,增加时间覆盖(快路径)。其次,我们实施了分阶段的四阶段预训练方法,该方法系统地扩展了模型的上下文长度,从8K扩展到128K令牌,能够处理更长的视频和更复杂的视觉内容。第三,我们开发了一个全面的后训练管道,专注于推理增强和人类偏好对齐,包括一个5步的思维链数据构建过程、基于迭代GSPO的强化学习以及针对困难情况的渐进提示提示和校准训练。在公共基准测试上进行广泛评估和内部人类严格评估中,Keye-VL-1.5相较于现有模型表现出了显著改进,特别是在视频理解任务上表现出色,同时在一般多模态基准测试中保持竞争力。
论文及项目相关链接
PDF Github page: https://github.com/Kwai-Keye/Keye
Summary
视频理解仍是因视频的动态和信息密集特性而具有挑战。现有模型在处理视频内容时,在分辨率和时间覆盖之间权衡困难。我们推出Keye-VL-1.5,通过三种创新解决视频理解的根本挑战:引入动态分配计算资源的快慢视频编码策略;实施分阶段预训练方法,使模型处理更长的视频和更复杂的视觉内容;开发专注于推理增强和人类偏好对齐的后期训练管道。在公共基准测试和内部人类评估中,Keye-VL-1.5较现有模型有显著改善,尤其在视频理解任务上表现突出。
Key Takeaways
- Large Language Models (LLMs) 已扩展到多模态任务,但视频理解仍具挑战。
- 现有模型在处理视频时在空间分辨率和时间覆盖之间存在权衡困难。
- Keye-VL-1.5 通过快慢视频编码策略动态分配计算资源。
- Keye-VL-1.5 实施分阶段预训练方法,使模型能处理更长的视频和复杂的视觉内容。
- Keye-VL-1.5 开发后期训练管道,专注于推理增强和人类偏好对齐。
- Keye-VL-1.5 在视频理解任务上表现显著,同时在通用多模态基准测试中保持竞争力。
点此查看论文截图



Towards High Data Efficiency in Reinforcement Learning with Verifiable Reward
Authors:Xinyu Tang, Zhenduo Zhang, Yurou Liu, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, Jun Zhou
Recent advances in large reasoning models have leveraged reinforcement learning with verifiable rewards (RLVR) to improve reasoning capabilities. However, scaling these methods typically requires extensive rollout computation and large datasets, leading to high training costs and low data efficiency. To mitigate this issue, we propose DEPO, a Data-Efficient Policy Optimization pipeline that combines optimized strategies for both offline and online data selection. In the offline phase, we curate a high-quality subset of training samples based on diversity, influence, and appropriate difficulty. During online RLVR training, we introduce a sample-level explorability metric to dynamically filter samples with low exploration potential, thereby reducing substantial rollout computational costs. Furthermore, we incorporate a replay mechanism for under-explored samples to ensure adequate training, which enhances the model’s final convergence performance. Experiments across five reasoning benchmarks show that DEPO consistently outperforms existing methods in both offline and online data selection scenarios. Notably, using only 20% of the training data, our approach achieves a 1.85 times speed-up on AIME24 and a 1.66 times speed-up on AIME25 compared to GRPO trained on the full dataset.
最近的大型推理模型进展利用了可验证奖励的强化学习(RLVR)来提高推理能力。然而,扩展这些方法通常需要大量的滚动计算和大型数据集,导致训练成本高和数据效率低。为了解决这一问题,我们提出了DEPO,一个数据高效策略优化管道,结合了离线数据和在线数据选择的优化策略。在离线阶段,我们根据多样性、影响力和适当的难度选择培训样本的高质量子集。在在线RLVR训练期间,我们引入了一个样本级别的探索性指标,以动态过滤具有低探索潜力的样本,从而大大减少滚动计算成本。此外,我们为探索不足的样本加入了重播机制,以确保足够的训练,这提高了模型的最终收敛性能。在五个推理基准测试上的实验表明,DEPO在离线数据选择和在线数据选择场景中均优于现有方法。值得注意的是,仅使用20%的训练数据,我们的方法在AIME24上实现了相对于在整个数据集上训练的GRPO加速1.85倍的速度提升,并在AIME25上实现了加速1.66倍的速度提升。
论文及项目相关链接
Summary
大模型推理领域的最新进展通过采用可验证奖励的强化学习(RLVR)提升了模型的推理能力。然而,这些方法在扩展时通常需要大量的滚动计算和大规模数据集,导致训练成本高昂且数据效率低下。为解决这一问题,我们提出了DEPO这一数据高效策略优化管道,结合了离线与在线数据选择的优化策略。离线阶段,我们根据多样性、影响力和适当难度选择高质量的训练样本子集。在线RLVR训练期间,我们引入样本级探索性指标,动态过滤低探索潜力的样本,大大降低了滚动计算成本。此外,我们纳入了一种回放机制,用于确保对未充分探索的样本进行充足训练,从而提高模型的最终收敛性能。在五个推理基准测试上的实验表明,DEPO在离线与在线数据选择场景中均较现有方法表现更优秀。仅使用20%的训练数据,我们的方法在AIME24上的速度提高了1.85倍,在AIME25上的速度提高了1.66倍。
Key Takeaways
- 最新进展利用强化学习提升大模型的推理能力,但存在计算成本高和数据效率低的问题。
- DEPO是一个数据高效策略优化管道,结合了离线与在线数据选择的优化策略。
- 离线阶段选择高质量的训练样本子集是基于多样性、影响力和适当难度。
- 在线训练期间引入样本级探索性指标以动态过滤低潜力样本并降低计算成本。
- DEPO引入回放机制确保对未充分探索的样本进行充足训练,提高模型收敛性能。
- 在五个推理基准测试上,DEPO较现有方法表现更优秀。
点此查看论文截图


Enhancing Large Language Model for Knowledge Graph Completion via Structure-Aware Alignment-Tuning
Authors:Yu Liu, Yanan Cao, Xixun Lin, Yanmin Shang, Shi Wang, Shirui Pan
Knowledge graph completion (KGC) aims to infer new knowledge and make predictions from knowledge graphs. Recently, large language models (LLMs) have exhibited remarkable reasoning capabilities. LLM-enhanced KGC methods primarily focus on designing task-specific instructions, achieving promising advancements. However, there are still two critical challenges. First, existing methods often ignore the inconsistent representation spaces between natural language and graph structures. Second, most approaches design separate instructions for different KGC tasks, leading to duplicate works and time-consuming processes. To address these challenges, we propose SAT, a novel framework that enhances LLMs for KGC via structure-aware alignment-tuning. Specifically, we first introduce hierarchical knowledge alignment to align graph embeddings with the natural language space through multi-task contrastive learning. Then, we propose structural instruction tuning to guide LLMs in performing structure-aware reasoning over KGs, using a unified graph instruction combined with a lightweight knowledge adapter. Experimental results on two KGC tasks across four benchmark datasets demonstrate that SAT significantly outperforms state-of-the-art methods, especially in the link prediction task with improvements ranging from 8.7% to 29.8%.
知识图谱补全(KGC)旨在从知识图谱中推断新知识并进行预测。最近,大型语言模型(LLM)表现出了卓越的推理能力。LLM增强的KGC方法主要侧重于设计特定任务指令,并取得了有希望的进展。然而,仍然存在两个关键挑战。首先,现有方法往往忽略了自然语言与图形结构之间不一致的表示空间。其次,大多数方法为不同的KGC任务设计单独的指令,导致工作重复和耗时过程。为了应对这些挑战,我们提出了SAT,这是一种通过结构感知对齐调整增强LLM的新型KGC框架。具体来说,我们首先引入分层知识对齐,通过多任务对比学习将图嵌入与自然语言空间对齐。然后,我们提出结构指令调整,以指导LLM在知识图谱上进行结构感知推理,使用统一的图形指令和轻量级知识适配器。在两个KGC任务、四个基准数据集上的实验结果表明,SAT显著优于最新方法,尤其在链接预测任务中的改进范围从8.7%到29.8%。
论文及项目相关链接
PDF EMNLP 2025, Main, Long Paper
Summary
知识图谱补全(KGC)旨在从知识图谱中推断新知识并进行预测。大型语言模型(LLMs)展现出强大的推理能力,LLM增强的KGC方法主要关注设计特定任务指令,并取得显著进展。然而,仍存在两个关键挑战:一是忽视自然语言与图形结构之间不一致的表示空间;二是为不同的KGC任务设计单独的指令,导致重复工作和耗时过程。为应对这些挑战,我们提出SAT框架,通过结构感知对齐调整增强LLMs在KGC中的应用。实验结果表明,SAT在四个基准数据集上的两个KGC任务上显著优于最新方法,尤其在链接预测任务中的改进范围从8.7%到29.8%。
Key Takeaways
- 知识图谱补全(KGC)的目标是从知识图谱中推断新知识和进行预测。
- 大型语言模型(LLMs)在KGC中展现出强大的推理能力。
- LLM增强的KGC方法通过设计特定任务指令取得显著进展。
- 现有方法忽略自然语言与图形结构之间不一致的表示空间。
- 大多数方法为不同的KGC任务设计单独指令,导致重复工作和耗时。
- SAT框架通过结构感知对齐调整增强LLMs在KGC中的应用。
点此查看论文截图






Self-Exploring Language Models for Explainable Link Forecasting on Temporal Graphs via Reinforcement Learning
Authors:Zifeng Ding, Shenyang Huang, Zeyu Cao, Emma Kondrup, Zachary Yang, Xingyue Huang, Yuan Sui, Zhangdie Yuan, Yuqicheng Zhu, Xianglong Hu, Yuan He, Farimah Poursafaei, Michael Bronstein, Andreas Vlachos
Forecasting future links is a central task in temporal graph (TG) reasoning, requiring models to leverage historical interactions to predict upcoming ones. Traditional neural approaches, such as temporal graph neural networks, achieve strong performance but lack explainability and cannot be applied to unseen graphs without retraining. Recent studies have begun to explore using large language models (LLMs) for graph reasoning, but most of them are constrained to static graphs or small synthetic TGs and lack the evaluation of the quality of reasoning traces generated by LLMs. In this work, we present Reasoning-Enhanced Learning for Temporal Graphs (ReaL-TG), a reinforcement learning framework that fine-tunes LLMs to perform explainable link forecasting on real-world TGs. ReaL-TG uses outcome-based reward to encourage models to self-explore reasoning strategies from graph structure and to produce explanations that directly justify their predictions. To enable evaluation on LLM-generated reasoning traces, we propose a new evaluation protocol combining ranking metrics with an LLM-as-a-Judge system that assesses both the quality of reasoning and the impact of hallucinations. Experiments with ReaL-TG-4B, obtained by fine-tuning Qwen3-4B under our framework, show that it outperforms much larger frontier LLMs, including GPT-5 mini, on ranking metrics, while producing high-quality explanations confirmed by both the LLM judge and human evaluation.
预测未来链接是时序图(TG)推理中的核心任务,需要模型利用历史交互来预测即将到来的交互。传统的神经网络方法,如时序图神经网络,虽然性能强大,但缺乏可解释性,无法在未重新训练的情况下应用于未见过的图。最近的研究开始探索使用大型语言模型(LLM)进行图推理,但大多数都局限于静态图或小合成TG,并且缺乏对LLM生成的推理轨迹质量的评估。在这项工作中,我们提出了用于时序图的推理增强学习(ReaL-TG),这是一个强化学习框架,对LLM进行微调,以在现实世界的TG上执行可解释的链接预测。ReaL-TG使用结果导向的奖励来鼓励模型从图形结构中自我探索推理策略,并产生直接证明其预测的解释。为了能够对LLM生成的推理轨迹进行评估,我们提出了一种新的评估协议,结合排名指标和作为法官的LLM系统,评估推理质量以及虚构内容的影响。通过在我们的框架下微调Qwen3-4B获得的ReaL-TG-4B实验表明,它在排名指标上超越了包括GPT-5 mini在内的前沿更大型LLM,同时产生了由LLM法官和人类评估确认的高质量解释。
论文及项目相关链接
Summary
本文介绍了针对时序图(Temporal Graph,TG)中的链接预测任务,提出了一种基于强化学习的可解释性学习方法——Reasoning-Enhanced Learning for Temporal Graphs(ReaL-TG)。该方法通过微调大型语言模型(LLM)来执行可解释的链接预测,利用结果导向的奖励机制鼓励模型自我探索推理策略,同时生成解释预测的直接证据。为评估LLM生成的推理轨迹质量,提出了一种新的评估协议,结合排名指标和LLM作为评委的系统来评估推理质量和幻觉的影响。实验表明,通过ReaL-TG框架微调得到的ReaL-TG-4B模型在排名指标上优于前沿的大型语言模型,如GPT-5 mini等,同时产生的解释得到了LLM评委和人类评估者的确认。
Key Takeaways
- 时序图链接预测是时序图推理的核心任务,需要利用历史交互预测未来交互。
- 传统神经网络方法如时序图神经网络虽然性能强大,但缺乏可解释性,无法应用于未见过的图而无需重新训练。
- 大型语言模型(LLMs)在图推理中的应用开始受到关注,但大多数方法仅限于静态图或小合成时序图,缺乏生成的推理轨迹质量评估。
- ReaL-TG是一个强化学习框架,用于微调LLMs,以执行具有可解释性的时序图链接预测。
- ReaL-TG使用结果导向的奖励机制鼓励模型自我探索推理策略,并生成解释预测的直接证据。
- 提出了一种新的评估协议来评估LLM生成的推理轨迹质量,结合排名指标和LLM作为评委的系统。
点此查看论文截图



RPRO:Ranked Preference Reinforcement Optimization for Enhancing Medical QA and Diagnostic Reasoning
Authors:Chia-Hsuan Hsu, Jun-En Ding, Hsin-Ling Hsu, Feng Liu, Fang-Ming Hung
Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that uniquely combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO differentiates itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley-Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA and MedQA-USMLE show consistent improvements over strong baselines. Remarkably, our 1.1B parameter model outperforms much larger 7B-13B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement offers a scalable and effective approach to building more reliable, clinically grounded medical LLMs.
医疗问答需要融合领域知识和逻辑推理的先进推理能力。然而,现有的大型语言模型(LLM)通常生成的推理链缺乏事实准确性和临床可靠性。我们提出了排名偏好强化优化(RPRO)这一新颖框架,其独特结合了强化学习与偏好驱动推理优化,以提升临床思维链(CoT)性能。RPRO通过采用任务适应性推理模板和概率评估机制,使输出与既定临床工作流程保持一致,同时自动识别和纠正低质量推理链,从而与先前的方法相区别。不同于传统的配对偏好方法,RPRO引入了基于Bradley-Terry模型的组排名优化,并引入了KL散度正则化以实现稳定训练。在PubMedQA和MedQA-USMLE上的实验表明,与强大的基线相比,RPRO具有持续一致的改进效果。值得注意的是,我们1.1B参数模型的表现优于7B-13B的大型模型,包括医疗专业变体。这些发现表明,偏好优化与质量驱动细化相结合,为构建更可靠、以临床为基础的医疗LLM提供了一种可扩展和有效的方法。
论文及项目相关链接
总结
医学问答需要融合领域知识和逻辑推理的先进推理能力。然而,现有的大型语言模型(LLMs)产生的推理链往往缺乏事实准确性和临床可靠性。本文提出一种名为Ranked Preference Reinforcement Optimization(RPRO)的新型框架,结合强化学习与偏好驱动推理优化,提升临床思维链(CoT)的表现。RPRO采用任务适应性推理模板和概率评估机制,使输出符合临床工作流程,并自动识别和纠正低质量推理链。与传统的一对一偏好方法不同,RPRO引入基于Bradley-Terry模型的群组排名优化,并结合KL散度正则化进行稳定训练。在PubMedQA和MedQA-USMLE上的实验表明,相较于强基线,RPRO有持续性的改进。值得注意的是,参数规模为1.1B的模型表现优于7B-13B的大型模型,包括医学专用变体。这表明结合偏好优化与质量驱动细化提供了一种可扩展且有效的途径,用于构建更可靠、临床基础扎实的医学LLMs。
关键见解
- 医学问答需要集成领域知识和逻辑推理的先进推理能力。
- 现有大型语言模型产生的推理链在事实准确性和临床可靠性方面存在不足。
- RPRO框架结合强化学习与偏好驱动推理优化,提升临床思维链的表现。
- RPRO采用任务适应性推理模板和概率评估机制,使输出更符合临床工作流程。
- RPRO能够自动识别和纠正低质量的推理链。
- 与传统方法不同,RPRO引入群组排名优化和KL散度正则化进行稳定训练。
点此查看论文截图


CoreThink: A Symbolic Reasoning Layer to reason over Long Horizon Tasks with LLMs
Authors:Jay Vaghasiya, Omkar Ghugarkar, Vishvesh Bhat, Vipul Dholaria, Julian McAuley
We introduce CoreThink, a state-of-the-art Reasoning Layer built upon a novel reasoning method called General Symbolics. This approach diverges from reasoning paradigms such as test-time scaling, Supervised Fine-Tuning (SFT), and Reinforcement Learning with Verifiable Rewards (RLVR). CoreThink General Symbolic Reasoner (GSR) is specifically structured around three key use cases: tool-calling, code generation, and planning, demonstrating exemplary performance across a total of seven benchmarks in their respective areas. Notably, we are achieving SOTA scores of 66.66% on Livecodebench v6, 89% on Instruction-Following Evals, and 24.4% on ARC-AGI-2. We also present an agentic coding IDE, developed using the principles of General Symbolics, which achieves a state-of-the-art accuracy of 62.3% on SWE-Bench Lite. We are able to achieve these improvements without any fine-tuning or training costs. Our Reasoning Layer is designed to provide a pure performance uplift, ensuring that a model’s accuracy on reasoning tasks is never negatively impacted. We argue that incumbent methods will eventually lead to diminishing returns in LLM performance, necessitating the development of new reasoning techniques. This technical report details our approach at a high level and the availability of the CoreThink models for reasoning-intensive use cases.
我们介绍了CoreThink,这是一个基于最新推理方法——通用符号(General Symbolics)的先进推理层。此方法不同于测试时缩放、监督微调(SFT)和可验证奖励强化学习(RLVR)等推理范式。CoreThink通用符号推理器(GSR)特别围绕三种关键用例构建:工具调用、代码生成和规划,在七个基准测试中各自领域表现出卓越的性能。值得注意的是,我们在Livecodebench v6上达到了66.66%的SOTA分数,在指令跟随评估上达到了89%,在ARC-AGI-2上达到了24.4%。我们还推出了一个使用通用符号原则开发的代理编码IDE,在SWE-Bench Lite上达到了62.3%的最新准确性。我们能够在不产生任何微调或培训成本的情况下实现这些改进。我们的推理层旨在提供纯粹的性能提升,确保模型在推理任务上的准确性不会受到负面影响。我们认为,现有方法最终将导致大型语言模型性能的收益递减,因此需要开发新的推理技术。本技术报告从高层次上详细介绍了我们的方法以及CoreThink模型在推理密集型用例中的可用性。
论文及项目相关链接
Summary
CoreThink是一款基于全新通用符号推理方法建立的先进推理层。它与测试时缩放、监督微调及强化学习可验证奖励等推理模式不同。CoreThink通用符号推理器(GSR)针对工具调用、代码生成和规划等三个关键应用场景进行专门设计,在各自的领域里实现了七个基准测试中的卓越表现,例如在Livecodebench v6上达到66.66%的顶尖分数。我们推出了一款采用通用符号学原理的编程IDE,在SWE-Bench Lite上达到了62.3%的顶尖准确率。最重要的是,我们的推理层旨在提供纯粹的性能提升,确保模型的推理任务精度不受负面影响。我们预测现有的方法最终会导致大型语言模型性能的收益递减,因此需要开发新的推理技术。
Key Takeaways
- CoreThink是一个基于全新通用符号推理方法的先进推理层。
- CoreThink实现了在多个基准测试中的卓越表现,包括工具调用、代码生成和规划等领域。
- CoreThink的推理层旨在在不损害模型推理任务精度的前提下,提供纯粹的性能提升。
- CoreThink推出了基于通用符号学原理的编程IDE,实现了高准确率。
- 现有推理方法可能会导致大型语言模型性能的收益递减,因此需要新的推理技术。
- CoreThink模型适用于需要大量推理的使用场景。
点此查看论文截图






Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Authors:Lang Xiong, Nishant Bhargava, Wesley Chang, Jianhang Hong, Haihao Liu, Kevin Zhu
Large Language Models (LLMs) often exhibit significant behavioral shifts when they perceive a change from a real-world deployment context to a controlled evaluation setting, a phenomenon known as “evaluation awareness.” This discrepancy poses a critical challenge for AI alignment, as benchmark performance may not accurately reflect a model’s true safety and honesty. In this work, we systematically quantify these behavioral changes by manipulating the perceived context of prompts. We introduce a methodology that uses a linear probe to score prompts on a continuous scale from “test-like” to “deploy-like” and leverage an LLM rewriting strategy to shift these prompts towards a more natural, deployment-style context while preserving the original task. Using this method, we achieved a 30% increase in the average probe score across a strategic role-playing dataset after rewriting. Evaluating a suite of state-of-the-art models on these original and rewritten prompts, we find that rewritten “deploy-like” prompts induce a significant and consistent shift in behavior. Across all models, we observed an average increase in honest responses of 5.26% and a corresponding average decrease in deceptive responses of 12.40%. Furthermore, refusal rates increased by an average of 6.38%, indicating heightened safety compliance. Our findings demonstrate that evaluation awareness is a quantifiable and manipulable factor that directly influences LLM behavior, revealing that models are more prone to unsafe or deceptive outputs in perceived test environments. This underscores the urgent need for more realistic evaluation frameworks to accurately gauge true model alignment before deployment.
大型语言模型(LLM)在感知到从现实世界部署环境到受控评估环境的变更时,通常会表现出显著的行为变化,这一现象被称为“评估意识”。这种差异对人工智能对齐提出了关键挑战,因为基准性能可能无法准确反映模型的真实安全性和诚实度。在这项工作中,我们通过操作提示的感知上下文来系统地量化这些行为变化。我们引入了一种方法,使用线性探针对提示进行从“测试型”到“部署型”的持续量表分,并利用LLM重写策略将这些提示转向更自然、更符合部署风格的上下文,同时保留原始任务。使用这种方法,在角色扮演数据集上进行重写后,探针平均得分提高了30%。在原始和重写后的提示上评估一系列最新模型,我们发现重写的“部署型”提示引发了行为和态度的显著且一致的变化。在所有模型中,我们观察到诚实回应的平均增加了5.26%,而欺骗性回应的平均下降了12.40%。此外,拒绝率平均增加了6.38%,表明安全合规性有所提高。我们的研究结果表明,评估意识是一个可量化的、可操控的因素,直接影响LLM的行为,表明模型在感知的测试环境中更容易产生不安全或欺骗性的输出。这强调了在实际部署前,需要更现实的评估框架来准确衡量模型对齐的真实需求。
论文及项目相关链接
Summary
大型语言模型(LLM)在感知到从现实世界部署环境到受控评估环境的转变时,会表现出显著的行为变化,这种现象被称为“评估意识”。这种行为差异给AI对齐带来了严峻挑战,因为基准测试性能可能无法准确反映模型的真实安全性和诚实度。本研究通过操纵提示的感知上下文来系统地量化这些行为变化。我们引入了一种方法,使用线性探针对提示进行从“测试型”到“部署型”的连续评分,并利用LLM重写策略来转变这些提示,使其更加自然、贴近部署环境,同时保留原始任务。通过此方法,我们在战略角色扮演数据集上重写提示后,探针平均得分提高了30%。评估一系列最先进的模型在这些原始和重写提示上的表现,我们发现重写的“部署型”提示引发了显著且一致的行为变化。所有模型中,我们观察到诚实回应的平均增加了5.26%,相应的欺骗回应平均减少了12.4%。此外,拒绝率平均增加了6.38%,表明安全合规性有所提高。我们的研究结果表明,评估意识是一个可量化且可操控的因素,直接影响LLM的行为,表明模型在感知的测试环境中更容易产生不安全或欺骗性的输出。
Key Takeaways
- 大型语言模型(LLM)在评估环境和实际部署环境中的行为存在差异,称为“评估意识”。
- 这种行为差异对AI对齐构成挑战,因为基准测试性能可能无法反映模型的真实安全性和诚实度。
- 通过操纵提示的感知上下文,可以系统地量化LLM的行为变化。
- 引入线性探针方法评分提示,并发现重写提示以更贴近部署环境可以提高模型的表现。
- 重写后的提示导致模型行为显著且一致地改变,表现为诚实回应增加,欺骗回应减少,安全合规性提高。
- 评估意识是一个可量化且可操控的因素,直接影响LLM的行为。
点此查看论文截图






ERank: Fusing Supervised Fine-Tuning and Reinforcement Learning for Effective and Efficient Text Reranking
Authors:Yuzheng Cai, Yanzhao Zhang, Dingkun Long, Mingxin Li, Pengjun Xie, Weiguo Zheng
Text reranking models are a crucial component in modern systems like Retrieval-Augmented Generation, tasked with selecting the most relevant documents prior to generation. However, current Large Language Models (LLMs) powered rerankers often face a fundamental trade-off. On one hand, Supervised Fine-Tuning based pointwise methods that frame relevance as a binary classification task lack the necessary scoring discrimination, particularly for those built on reasoning LLMs. On the other hand, approaches designed for complex reasoning often employ powerful yet inefficient listwise formulations, rendering them impractical for low latency applications. To resolve this dilemma, we introduce ERank, a highly effective and efficient pointwise reranker built from a reasoning LLM that excels across diverse relevance scenarios. We propose a novel two-stage training pipeline that begins with Supervised Fine-Tuning (SFT). In this stage, we move beyond binary labels and train the model generatively to output fine grained integer scores, which significantly enhances relevance discrimination. The model is then further refined using Reinforcement Learning (RL) with a novel, listwise derived reward. This technique instills global ranking awareness into the efficient pointwise architecture. We evaluate the ERank reranker on the BRIGHT, FollowIR, TREC DL, and BEIR benchmarks, demonstrating superior effectiveness and robustness compared to existing approaches. On the reasoning-intensive BRIGHT benchmark, our ERank-4B achieves an nDCG@10 of 38.7, while a larger 32B variant reaches a state of the art nDCG@10 of 40.2.
文本重排模型是现代检索增强生成系统中的重要组成部分,负责在生成前选择最相关的文档。然而,当前的大型语言模型驱动的重新排名器常常面临一个基本的权衡。一方面,基于监督微调(Supervised Fine-Tuning)的点态方法将相关性作为二分类任务来处理,缺乏必要的评分判别力,特别是对于基于推理的大型语言模型。另一方面,针对复杂推理的方法通常采用强大但低效的列表式公式,使得它们不适用于低延迟应用。为了解决这一困境,我们引入了ERank,这是一个高效且高效的点态重新排名器,由擅长处理各种相关性场景的大型语言模型构建而成。我们提出了一种新颖的两阶段训练管道,首先从监督微调(SFT)开始。在这一阶段,我们超越了二元标签的限制,以生成方式训练模型以输出精细粒度的整数分数,这显著提高了相关性的判别力。然后进一步使用强化学习(RL)进行细化调整模型参数与行为特征选取设置决策模式或准则修正评分函数,并使用一种新型的列表式派生奖励。这种技术将全局排名意识灌输到高效的点态架构中。我们在BRIGHT、FollowIR、TREC DL和BEIR基准测试集上对ERank重新排名器进行了评估证明了它在先进的方法和任务目标上所表现得更出色强大且具有鲁棒性对于需要强烈逻辑推理能力的bright基准测试数据集我们的ERank-4B模型实现了nDCG@10为38.7的得分而更大的32B变体则达到了业界领先的nDCG@10为40.2的得分表现卓越在重推理任务的基准测试中优势明显领先于业界其他模型。
论文及项目相关链接
Summary
本文介绍了现代系统中重要组件——文本重排序模型(ERank)的设计与实现。针对大型语言模型(LLM)在重排序任务中的局限性,ERank模型采用了一种高效且高效的点态重排序方法,并应用于多种相关性场景。通过两阶段训练流程,结合监督微调与强化学习,使得模型不仅提升了判别相关性能力,也实现了全局排名意识。在多个基准测试中,ERank表现出了卓越的有效性和稳健性。
Key Takeaways
- ERank是一个针对文本重排序任务的高效且有效的模型。
- 当前大型语言模型(LLM)在重排序任务中面临权衡:点态方法缺乏评分判别力,而复杂的推理方法效率低下。
- ERank通过两阶段训练流程来解决这一问题:首先通过监督微调生成精细粒度整数得分,然后采用强化学习进一步提高全局排名意识。
点此查看论文截图






Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards
Authors:Xiaolong Wei, Bo Lu, Xingyu Zhang, Zhejun Zhao, Dongdong Shen, Long Xia, Dawei Yin
Large Language Models (LLMs) have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small Language Models (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.
大规模语言模型(LLM)已经展现出令人瞩目的创造性写作能力,但其巨大的计算需求阻碍了其广泛应用。增强小型语言模型(SLM)提供了一个有前途的替代方案,但当前的方法,如监督微调(SFT),在创新性方面遇到困难,而强化学习从人类反馈(RLHF)成本高昂。本文在一个强化学习从AI反馈(RLAIF)框架内探索了两种独特的AI驱动奖励策略,以激发一个7B参数SLM的创造性写作,特别是用于生成中文问候语。第一种策略采用一个RM,该RM经过一个新型的多智能体拒绝采样框架训练的高质量偏好数据,该框架专为创造性任务设计。第二种更新颖的策略利用了一个原则指导的LLM-as-a-Judge,其奖励函数通过一种对抗性训练方案和一个反射机制进行优化,以直接提供奖励信号。综合实验表明,虽然两种方法都显著提高了创意输出的质量,但原则指导的LLM-as-a-Judge明显产生了更高质量的生成结果。此外,它在训练效率和减少对人类注释数据的依赖方面具有显著优势,为创造性SLM提供了可伸缩和有效的途径。我们的自动评估方法也表现出与人类判断的高度一致性。我们的代码和数据公开在https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models。
论文及项目相关链接
PDF EMNLP 2025 Main
Summary
大型语言模型展现出惊人的创意写作能力,但其巨大的计算需求限制了广泛应用。增强小型语言模型(SLMs)是一个有前景的替代方案,但现有方法如监督微调(SFT)缺乏新颖性,强化学习从人类反馈(RLHF)则成本高昂。本文探索了RLAIF框架下两种独特的AI驱动奖励策略,以激发一个7B参数的小型语言模型的创意写作,特别是生成中文问候语。第一个策略使用RM训练,基于新型多智能体拒绝采样框架精选的高质量偏好数据,针对创意任务设计。第二个更创新的策略是采用原则指导的LLM-as-a-Judge,其奖励功能通过具有反思机制的对抗训练方案优化,直接提供奖励信号。实验显示,这两种方法都能显著提高创意输出水平,而原则指导的LLM-as-a-Judge在生成质量上更胜一筹,同时在训练效率和减少依赖人类标注数据方面具有显著优势,为小型语言模型的创意发展提供了更可扩展和有效的途径。我们的自动评估方法与人类判断高度一致。我们的代码和数据在https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models公开可用。
Key Takeaways
- 大型语言模型展现创意写作能力,但计算需求大,应用受限。
- 增强小型语言模型(SLMs)为替代方案,但现有方法存在不足。
- 本文探索了两种AI驱动奖励策略,用于激发小型语言模型的创意写作,特别是中文问候语生成。
- 第一种策略使用RM训练,基于新型多智能体拒绝采样框架精选高质量偏好数据。
- 第二种策略采用原则指导的LLM-as-a-Judge,通过对抗训练与反思机制优化奖励功能。
- 实验显示两种方法提高创意输出,原则指导的LLM-as-a-Judge在生成质量、训练效率和数据依赖方面表现更优。
点此查看论文截图




Challenges and Applications of Large Language Models: A Comparison of GPT and DeepSeek family of models
Authors:Shubham Sharma, Sneha Tuli, Narendra Badam
Large Language Models (LLMs) are transforming AI across industries, but their development and deployment remain complex. This survey reviews 16 key challenges in building and using LLMs and examines how these challenges are addressed by two state-of-the-art models with unique approaches: OpenAI’s closed source GPT-4o (May 2024 update) and DeepSeek-V3-0324 (March 2025), a large open source Mixture-of-Experts model. Through this comparison, we showcase the trade-offs between closed source models (robust safety, fine-tuned reliability) and open source models (efficiency, adaptability). We also explore LLM applications across different domains (from chatbots and coding tools to healthcare and education), highlighting which model attributes are best suited for each use case. This article aims to guide AI researchers, developers, and decision-makers in understanding current LLM capabilities, limitations, and best practices.
大型语言模型(LLM)正在各行业推动人工智能的变革,但其开发和部署仍然复杂。这篇综述文章回顾了在构建和使用LLM过程中所面临的16项关键挑战,并探讨了这两个最前沿模型是如何应对这些挑战的:OpenAI的闭源GPT-4o(2024年5月更新)和DeepSeek-V3-0324(大型开源专家混合模型,2025年3月)。通过对比分析,本文展示了闭源模型(稳健的安全性和精细调整后的可靠性)和开源模型(效率和适应性)之间的权衡。此外,本文还探讨了LLM在不同领域的应用(从聊天机器人和编码工具到医疗保健和教育),强调每个用例最适合的模型属性。本文旨在为AI研究人员、开发人员和决策者提供指导,了解当前LLM的能力、局限性和最佳实践。
论文及项目相关链接
PDF 18 pages, 7 figures
Summary
大型语言模型(LLMs)正在为各行业带来人工智能的变革,但其开发和部署仍然面临复杂性。本文回顾了构建和使用LLMs所面临的16项关键挑战,并探讨了如何通过两种前沿模型应对这些挑战:OpenAI的闭源GPT-4o(2024年五月更新版)和大型开源专家混合模型DeepSeek-V3-0324(公开源代码)(将于未来采用更有效率以及更具适应性的路线发展)。此外,文章探讨了不同领域的应用案例,例如聊天机器人、编码工具、医疗保健和教育等,并强调每个用例最适合的模型属性。本文旨在为AI研究人员、开发人员和决策者提供关于当前LLM的能力、局限性和最佳实践的理解。希望为读者提供更多信息和参考建议。随着技术进一步发展,语言模型未来也将越来越智能化。研究语言和机器语言的深度融合能够增强人与计算机的智能互动水平,这对于行业的发展和创新非常有益。也必将成为引领智能产业发展的未来驱动力。综上文章详细总结了LLM发展现状和挑战进行了一些详尽探讨和思考提供了深入了解当下最新的观点思路依据及发展意义介绍和思考重要性问题并在文章内容中获得良好的理解和认知对人工智能领域发展起到了很好的启示作用。这篇文章在AI领域掀起了一次较大的知识共享风波因其作为相对综合的大尺度呈现从不同视角细致审视与呈现的形式对行业发展有很好的指引意义展示了关键方面的总结和全局的认知体验更好地实现现有背景下业内交流和互补从而提升人工智能技术实际使用的精度和使用便捷度上提供了一个研究热点或者观察切入点极具时代性及其指导性作用且具有很强的创新性令人深思能够激起人们对行业深度探讨和共同学习的热潮能够引起行业内人士的共鸣和认同。
Key Takeaways
- 大型语言模型(LLMs)在多个行业中推动了AI的发展,但存在诸多挑战。
- 文章对比分析了前沿的两种语言模型GPT-4o和DeepSeek-V3-0324,突显闭源和开源模型之间的优劣权衡。
- LLM在不同领域(如聊天机器人、编程工具等)的应用得到探讨,强调模型属性与用例的匹配性。
- 文章旨在帮助AI研究人员、开发人员和决策者了解LLM的能力、局限性和最佳实践。
点此查看论文截图





Improving Aviation Safety Analysis: Automated HFACS Classification Using Reinforcement Learning with Group Relative Policy Optimization
Authors:Arash Ahmadi, Sarah Sharif, Yaser Banad
Analyzing the human factors behind aviation accidents is crucial for preventing future incidents, yet traditional methods using the Human Factors Analysis and Classification System (HFACS) are limited by scalability and consistency. To address this, we introduce an automated HFACS classification framework for aviation safety analysis that utilizes Reinforcement Learning with Group Relative Policy Optimization (GRPO) to fine-tune a Llama-3.1 8B language model. Our approach incorporates a multi-component reward system tailored for aviation safety analysis and integrates synthetic data generation to overcome class imbalance in accident datasets. The resulting GRPO-optimized model achieved noticeable performance gains, including a 350% increase in exact match accuracy (from 0.0400 to 0.1800) and an improved partial match accuracy of 0.8800. Significantly, our specialized model outperforms state-of-the-art LLMs (Large Language Models), including GPT-5-mini and Gemini-2.5-fiash, on key metrics. This research also proposes exact match accuracy in multi-label HFACS classification problem as a new benchmarking methodology to evaluate the advanced reasoning capabilities of language models. Ultimately, our work validates that smaller, domain-optimized models can provide a computationally efficient and better solution for critical safety analysis. This approach makes powerful, low-latency deployment on resource-constrained edge devices feasible.
分析航空事故中的人为因素对于预防未来事故至关重要。然而,使用人为因素分析和分类系统(HFACS)的传统方法受到可扩展性和一致性的限制。为了解决这一问题,我们引入了一个用于航空安全分析的自动化HFACS分类框架,该框架利用带有群体相对策略优化(GRPO)的强化学习来微调Llama-3.1 8B语言模型。我们的方法采用针对航空安全分析量身定制的多组件奖励系统,并整合合成数据生成,以克服事故数据集中的类别不平衡问题。经过GRPO优化的模型取得了显著的性能提升,其中精确匹配准确率提高了350%(从0.0400提高到0.1800),部分匹配准确率提高到0.8800。重要的是,我们的专业模型在关键指标上优于最新的大型语言模型,包括GPT-5-mini和Gemini-2.5-fiash。本研究还提出将多标签HFACS分类问题中的精确匹配准确率作为新的基准评估方法来评估语言模型的先进推理能力。最终,我们的工作验证了小型、域优化的模型可以为用户提供计算高效且更好的解决方案,用于关键安全分析。这种方法使得在资源受限的边缘设备上部署强大、低延迟的模型变得可行。
论文及项目相关链接
Summary
该研究采用强化学习结合相对策略优化(GRPO)方法,对航空安全分析中的人为因素进行分类。该研究使用Llama-3.1 8B语言模型,并结合多成分奖励系统和合成数据生成,解决数据集类别不平衡问题。新模型在精确匹配准确率上提高了350%,部分匹配准确率也有所提高,且在关键指标上优于其他大型语言模型。此外,该研究还提出了将多标签HFACS分类问题中的精确匹配准确率作为新的评估方法,以评估语言模型的推理能力。最终验证了小型、优化的模型在关键安全分析方面可提供高效解决方案,并适用于资源受限的边缘设备。
Key Takeaways
- 强化学习结合相对策略优化(GRPO)用于航空安全分析中的HFACS分类。
- 采用Llama-3.1 8B语言模型,结合多成分奖励系统提高模型性能。
- 合成数据生成解决数据集类别不平衡问题。
- 模型在精确匹配准确率上实现显著增长,达到0.1800,优于其他大型语言模型。
- 提出将多标签HFACS分类问题的精确匹配准确率作为评估语言模型推理能力的新方法。
- 小型、优化的模型在关键安全分析方面表现出高效性能,适合资源受限的边缘设备部署。
点此查看论文截图





