发布日期: 2025-08-23

更新日期: 2025-09-08

文章字数: 22.5k

阅读时长: 93 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-08-23 更新

Intern-S1: A Scientific Multimodal Foundation Model

Authors:Lei Bai, Zhongrui Cai, Maosong Cao, Weihan Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, Yongkang Chen, Yu Cheng, Yu Cheng, Pei Chu, Tao Chu, Erfei Cui, Ganqu Cui, Long Cui, Ziyun Cui, Nianchen Deng, Ning Ding, Nanqin Dong, Peijie Dong, Shihan Dou, Sinan Du, Haodong Duan, Caihua Fan, Ben Gao, Changjiang Gao, Jianfei Gao, Songyang Gao, Yang Gao, Zhangwei Gao, Jiaye Ge, Qiming Ge, Lixin Gu, Yuzhe Gu, Aijia Guo, Qipeng Guo, Xu Guo, Conghui He, Junjun He, Yili Hong, Siyuan Hou, Caiyu Hu, Hanglei Hu, Jucheng Hu, Ming Hu, Zhouqi Hua, Haian Huang, Junhao Huang, Xu Huang, Zixian Huang, Zhe Jiang, Lingkai Kong, Linyang Li, Peiji Li, Pengze Li, Shuaibin Li, Tianbin Li, Wei Li, Yuqiang Li, Dahua Lin, Junyao Lin, Tianyi Lin, Zhishan Lin, Hongwei Liu, Jiangning Liu, Jiyao Liu, Junnan Liu, Kai Liu, Kaiwen Liu, Kuikun Liu, Shichun Liu, Shudong Liu, Wei Liu, Xinyao Liu, Yuhong Liu, Zhan Liu, Yinquan Lu, Haijun Lv, Hongxia Lv, Huijie Lv, Qidang Lv, Ying Lv, Chengqi Lyu, Chenglong Ma, Jianpeng Ma, Ren Ma, Runmin Ma, Runyuan Ma, Xinzhu Ma, Yichuan Ma, Zihan Ma, Sixuan Mi, Junzhi Ning, Wenchang Ning, Xinle Pang, Jiahui Peng, Runyu Peng, Yu Qiao, Jiantao Qiu, Xiaoye Qu, Yuan Qu, Yuchen Ren, Fukai Shang, Wenqi Shao, Junhao Shen, Shuaike Shen, Chunfeng Song, Demin Song, Diping Song, Chenlin Su, Weijie Su, Weigao Sun, Yu Sun, Qian Tan, Cheng Tang, Huanze Tang, Kexian Tang, Shixiang Tang, Jian Tong, Aoran Wang, Bin Wang, Dong Wang, Lintao Wang, Rui Wang, Weiyun Wang, Wenhai Wang, Yi Wang, Ziyi Wang, Ling-I Wu, Wen Wu, Yue Wu, Zijian Wu, Linchen Xiao, Shuhao Xing, Chao Xu, Huihui Xu, Jun Xu, Ruiliang Xu, Wanghan Xu, GanLin Yang, Yuming Yang, Haochen Ye, Jin Ye, Shenglong Ye, Jia Yu, Jiashuo Yu, Jing Yu, Fei Yuan, Bo Zhang, Chao Zhang, Chen Zhang, Hongjie Zhang, Jin Zhang, Qiaosheng Zhang, Qiuyinzhe Zhang, Songyang Zhang, Taolin Zhang, Wenlong Zhang, Wenwei Zhang, Yechen Zhang, Ziyang Zhang, Haiteng Zhao, Qian Zhao, Xiangyu Zhao, Xiangyu Zhao, Bowen Zhou, Dongzhan Zhou, Peiheng Zhou, Yuhao Zhou, Yunhua Zhou, Dongsheng Zhu, Lin Zhu, Yicheng Zou

In recent years, a plethora of open-source foundation models have emerged, achieving remarkable progress in some widely attended fields, with performance being quite close to that of closed-source models. However, in high-value but more challenging scientific professional fields, either the fields still rely on expert models, or the progress of general foundation models lags significantly compared to those in popular areas, far from sufficient for transforming scientific research and leaving substantial gap between open-source models and closed-source models in these scientific domains. To mitigate this gap and explore a step further toward Artificial General Intelligence (AGI), we introduce Intern-S1, a specialized generalist equipped with general understanding and reasoning capabilities with expertise to analyze multiple science modal data. Intern-S1 is a multimodal Mixture-of-Experts (MoE) model with 28 billion activated parameters and 241 billion total parameters, continually pre-trained on 5T tokens, including over 2.5T tokens from scientific domains. In the post-training stage, Intern-S1 undergoes offline and then online reinforcement learning (RL) in InternBootCamp, where we propose Mixture-of-Rewards (MoR) to synergize the RL training on more than 1000 tasks simultaneously. Through integrated innovations in algorithms, data, and training systems, Intern-S1 achieved top-tier performance in online RL training.On comprehensive evaluation benchmarks, Intern-S1 demonstrates competitive performance on general reasoning tasks among open-source models and significantly outperforms open-source models in scientific domains, surpassing closed-source state-of-the-art models in professional tasks, such as molecular synthesis planning, reaction condition prediction, predicting thermodynamic stabilities for crystals. Our models are available at https://huggingface.co/internlm/Intern-S1.

近年来，大量的开源基础模型不断涌现，在一些广泛关注的领域取得了显著的进展，其性能与闭源模型相当接近。然而，在高价值但更具挑战性的科学专业领域，这些领域仍然依赖于专家模型，或者通用基础模型的进展与流行领域相比滞后显著，远远不足以改变科学研究，从而在这些科学领域的开源模型和闭源模型之间产生了巨大的差距。为了缩小这一差距，并进一步朝着通用人工智能（AGI）的方向发展，我们引入了Intern-S1，这是一个具备通用理解和推理能力，同时拥有分析多种科学模态数据的专业通才。Intern-S1是一个多模态的专家混合（MoE）模型，拥有28亿个激活参数和总共241亿个参数，在5T标记上持续进行预训练，其中包括来自科学领域的超过2.5T标记。在训练后阶段，Intern-S1在InternBootCamp中经历了离线然后是在线的强化学习（RL）训练。我们提出了奖励混合（MoR）方法来协同在超过1000个任务上同时进行RL训练。通过算法、数据和训练系统的综合创新，Intern-S1在在线RL训练中取得了顶尖性能。在全面的评估基准测试中，Intern-S1在通用推理任务中展示了与开源模型的竞争力，并在科学领域显著优于开源模型，在专业任务中超越了闭源的最先进模型，如分子合成规划、反应条件预测、晶体热力学稳定性预测等。我们的模型可在https://huggingface.co/internlm/Intern-S1获取。

论文及项目相关链接

PDF

Summary

该文介绍了新兴的开源基础模型 Intern-S1，它在一些广受关注的领域取得了引人注目的进展，性能接近闭源模型。然而，在高价值的科学专业领域中，开源基础模型的进展仍然显著滞后。为了缩小差距并朝着通用人工智能（AGI）迈出一步，推出了 Intern-S1 模型，该模型具备通用理解和推理能力，并具备分析多种科学模态数据的专业知识。通过算法、数据和训练系统的综合创新，Intern-S1 在在线强化学习训练中取得了顶尖性能，并在综合评估基准测试中展现了竞争力。

Key Takeaways

开源基础模型在许多领域性能接近闭源模型，但在高价值的科学专业领域存在显著差距。
Intern-S1 是一款多模态的混合专家（MoE）模型，具备通用理解和推理能力，并具备分析多种科学模态数据的专业知识。
Intern-S1 通过持续预训练在 5T 标记符号上，包括超过 2.5T 来自科学领域的标记符号。
Intern-S1 在在线强化学习训练阶段采用了混合物奖励（MoR）方法，同步进行超过 1000 项任务的强化学习训练。
通过算法、数据和训练系统的综合创新，Intern-S1 在在线强化学习训练中实现了顶尖性能。
Intern-S1 在通用推理任务上展现了与开源模型的竞争力，并在科学领域显著优于其他开源模型。
Intern-S1 在专业任务上超越了闭源的最先进模型，如分子合成规划、反应条件预测和晶体热力学稳定性预测等。

Cool Papers

点此查看论文截图

Response and Prompt Evaluation to Prevent Parasocial Relationships with Chatbots

Authors:Emma Rath, Stuart Armstrong, Rebecca Gorman

The development of parasocial relationships with AI agents has severe, and in some cases, tragic effects for human well-being. Yet preventing such dynamics is challenging: parasocial cues often emerge gradually in private conversations, and not all forms of emotional engagement are inherently harmful. We address this challenge by introducing a simple response evaluation framework, created by repurposing a state-of-the-art language model, that evaluates ongoing conversations for parasocial cues in real time. To test the feasibility of this approach, we constructed a small synthetic dataset of thirty dialogues spanning parasocial, sycophantic, and neutral conversations. Iterative evaluation with five stage testing successfully identified all parasocial conversations while avoiding false positives under a tolerant unanimity rule, with detection typically occurring within the first few exchanges. These findings provide preliminary evidence that evaluation agents can provide a viable solution for the prevention of parasocial relations.

与人工智能代理建立准社会关系的发展给人类福祉带来了严重，甚至在某些情况下带来了悲剧性的影响。然而，阻止这种动态发展具有挑战性：准社会线索通常会在私人对话中逐渐显现，并非所有形式的情感参与都是有害的。我们通过引入一个简单的反应评估框架来解决这一挑战，该框架由最先语言模型改造而成，用于实时评估对话中是否存在准社会线索。为了验证该方法的可行性，我们构建了包含三十个对话的小型合成数据集，涵盖了准社会、谄媚和中性对话。通过五个阶段的测试进行迭代评估，成功地识别出了所有准社会对话，在宽容一致规则下避免了误报，检测通常发生在最初的几次交流中。这些发现提供了初步证据，表明评估代理可以提供一种防止准社会关系的可行解决方案。

论文及项目相关链接

PDF

Summary

本文探讨了与人工智能代理发展寄生社会关系所带来的严重甚至悲剧性影响。针对这一挑战，提出一个简单的响应评估框架，通过改造当前最先进的语言模型，以实时评估对话中的寄生社会线索。通过构建包含寄生社会、谄媚和中性对话的三十个对话的小型合成数据集进行测试，成功识别出所有寄生社会对话，并在容忍一致性的规则下避免误报。初步证据表明评估代理可为预防寄生社会关系提供可行解决方案。

Key Takeaways

与人工智能代理发展寄生社会关系可能对人类福祉产生严重影响。
寄生社会线索在私人对话中逐渐显现，并非所有情感参与都是有害的。
引入了一个简单的响应评估框架，利用先进语言模型实时评估对话中的寄生社会线索。
通过构建包含不同类型对话的合成数据集进行测试，验证了该框架的有效性。
在容忍一致性的规则下，该框架能成功识别寄生社会对话，且通常能在前几轮交流中即完成检测。
评估代理在预防寄生社会关系方面具有潜在应用价值。

Cool Papers

点此查看论文截图

End-to-End Agentic RAG System Training for Traceable Diagnostic Reasoning

Authors:Qiaoyu Zheng, Yuze Sun, Chaoyi Wu, Weike Zhao, Pengcheng Qiu, Yongguo Yu, Kun Sun, Yanfeng Wang, Ya Zhang, Weidi Xie

Accurate diagnosis with medical large language models is hindered by knowledge gaps and hallucinations. Retrieval and tool-augmented methods help, but their impact is limited by weak use of external knowledge and poor feedback-reasoning traceability. To address these challenges, We introduce Deep-DxSearch, an agentic RAG system trained end-to-end with reinforcement learning (RL) that enables steer tracebale retrieval-augmented reasoning for medical diagnosis. In Deep-DxSearch, we first construct a large-scale medical retrieval corpus comprising patient records and reliable medical knowledge sources to support retrieval-aware reasoning across diagnostic scenarios. More crutially, we frame the LLM as the core agent and the retrieval corpus as its environment, using tailored rewards on format, retrieval, reasoning structure, and diagnostic accuracy, thereby evolving the agentic RAG policy from large-scale data through RL. Experiments demonstrate that our end-to-end agentic RL training framework consistently outperforms prompt-engineering and training-free RAG approaches across multiple data centers. After training, Deep-DxSearch achieves substantial gains in diagnostic accuracy, surpassing strong diagnostic baselines such as GPT-4o, DeepSeek-R1, and other medical-specific frameworks for both common and rare disease diagnosis under in-distribution and out-of-distribution settings. Moreover, ablation studies on reward design and retrieval corpus components confirm their critical roles, underscoring the uniqueness and effectiveness of our approach compared with traditional implementations. Finally, case studies and interpretability analyses highlight improvements in Deep-DxSearch’s diagnostic policy, providing deeper insight into its performance gains and supporting clinicians in delivering more reliable and precise preliminary diagnoses. See https://github.com/MAGIC-AI4Med/Deep-DxSearch.

使用医疗大型语言模型进行准确诊断受到知识差距和幻觉的阻碍。检索和工具增强方法有所帮助，但它们的影响受限于外部知识利用不足和反馈推理追踪能力弱。为了解决这些挑战，我们引入了Deep-DxSearch，这是一个通过强化学习（RL）进行端到端训练的主动推理图系统，可实现用于医疗诊断的追踪检索增强推理。在Deep-DxSearch中，我们首先构建了一个大规模医疗检索语料库，包含患者记录和可靠的医疗知识来源，以支持跨诊断场景的检索感知推理。更重要的是，我们将大型语言模型作为核心主体，将检索语料库作为其环境，针对格式、检索、推理结构和诊断准确性制定专门奖励，从而通过强化学习推动主动推理图策略从大规模数据中进化。实验表明，我们的端到端主动强化学习训练框架在多个数据中心始终优于基于提示和训练免费的主动推理图方法。经过训练，Deep-DxSearch在诊断准确性方面取得了显著进步，超越了强诊断基线，如GPT-4o、DeepSeek-R1和其他针对常见和罕见疾病诊断的医疗专用框架，无论在内部分布还是外部分布环境中均表现优异。此外，关于奖励设计和检索语料库组件的消融研究证实了它们的关键作用，强调了我们的方法与传统实施方式的独特性和有效性。最后，案例研究和解释性分析突显了Deep-DxSearch诊断策略的改进，为其性能提升提供了更深入的了解，并支持临床医生提供更可靠和精确的早期诊断。更多信息请参见https://github.com/MAGIC-AI4Med/Deep-DxSearch。

论文及项目相关链接

PDF 35 pages, 5 figures, 3 tables

Summary

基于大规模医疗语言模型的诊断存在知识缺口和幻觉的问题。检索和工具增强方法有所帮助，但其影响受限于外部知识利用不足和反馈推理追踪能力弱。为解决这些挑战，我们推出Deep-DxSearch，这是一个采用强化学习（RL）进行端到端训练的能动型RAG系统，可实现追踪可检索的推理辅助医疗诊断。Deep-DxSearch首先构建大规模医疗检索语料库，包含患者记录和可靠医疗知识源，以支持跨诊断场景的检索感知推理。更为关键的是，我们将大型语言模型作为核心主体，将检索语料库作为环境，通过格式、检索、推理结构和诊断准确性的定制奖励，使能动型RAG策略从大规模数据中通过RL进行进化。实验表明，我们的端到端能动型RL训练框架在多个数据中心均优于提示工程和免训练RAG方法。经过训练，Deep-DxSearch在诊断准确性方面取得了实质性进展，超越了包括GPT-4o、DeepSeek-R1在内的强劲诊断基线以及其他医疗特定框架，在常见和罕见疾病的诊断中均表现出色，无论处于分布内还是分布外环境。

Key Takeaways

大型语言模型在医疗诊断中的应用受到知识缺口和幻觉的挑战。
检索和工具增强方法虽有帮助，但受限于外部知识利用不足和推理反馈追踪能力弱。
Deep-DxSearch是一个基于强化学习训练的能动型RAG系统，支持检索增强的推理过程。
Deep-DxSearch构建大规模医疗检索语料库以支持跨诊断场景的检索感知推理。
Deep-DxSearch将大型语言模型作为核心主体，结合定制奖励进行训练，以提高诊断准确性。
实验表明Deep-DxSearch在多个数据中心均优于其他方法，并能在常见和罕见疾病的诊断中表现出色。

Cool Papers

点此查看论文截图

Benchmarking Computer Science Survey Generation

Authors:Weihang Su, Anzhe Xie, Qingyao Ai, Jianming Long, Jiaxin Mao, Ziyi Ye, Yiqun Liu

Scientific survey articles play a vital role in summarizing research progress, yet their manual creation is becoming increasingly infeasible due to the rapid growth of academic literature. While large language models (LLMs) offer promising capabilities for automating this process, progress in this area is hindered by the absence of standardized benchmarks and evaluation protocols. To address this gap, we introduce SurGE (Survey Generation Evaluation), a new benchmark for evaluating scientific survey generation in the computer science domain. SurGE consists of (1) a collection of test instances, each including a topic description, an expert-written survey, and its full set of cited references, and (2) a large-scale academic corpus of over one million papers that serves as the retrieval pool. In addition, we propose an automated evaluation framework that measures generated surveys across four dimensions: information coverage, referencing accuracy, structural organization, and content quality. Our evaluation of diverse LLM-based approaches shows that survey generation remains highly challenging, even for advanced self-reflection frameworks. These findings highlight the complexity of the task and the necessity for continued research. We have open-sourced all the code, data, and models at: https://github.com/oneal2000/SurGE

科学综述文章在总结研究进展方面起着至关重要的作用，然而由于其数量的快速增长，手动编写变得越来越不可行。虽然大型语言模型（LLM）为自动化这个过程提供了有前景的能力，但这一领域的进展受到缺乏标准化基准测试和评估协议的阻碍。为了填补这一空白，我们引入了SurGE（综述生成评估）基准测试，这是一个用于评估计算机科学领域科学综述生成情况的新基准测试。SurGE包括：（1）测试实例集，每个实例包括主题描述、专家撰写的综述及其全套引文；（2）作为检索池的大规模学术语料库，包含超过一百万的论文。此外，我们提出了一个自动化评估框架，从四个维度衡量生成的综述：信息覆盖、引用准确性、结构组织和内容质量。我们对多种基于LLM的方法的评估表明，即使是对于先进的自我反思框架来说，生成综述仍然极具挑战性。这些发现突显了任务的复杂性以及继续研究的必要性。我们已经公开所有代码、数据和模型在：https://github.com/oneal2000/SurGE

论文及项目相关链接

PDF

Summary
学术综述文章在总结研究进展方面起着重要作用，但手动创建综述越来越不可行。大型语言模型（LLM）为自动化此过程提供了潜力，但缺乏标准化基准测试和评估协议阻碍了进展。为解决这一空白，我们推出了SurGE（综述生成评估）基准测试，用于评估计算机科学领域的学术综述生成能力。SurGE包括一组测试实例和一个大规模学术文献库，同时提供自动评估框架来衡量生成的综述在信息覆盖、引用准确性、结构组织和内容质量等方面的表现。我们对不同的LLM方法进行了评估，发现即使是高级自我反思框架，综述生成仍然极具挑战性。这些发现凸显了任务的复杂性以及持续研究的必要性。相关代码、数据和模型已开源。

Key Takeaways

学术综述文章在总结研究进展中起重要作用，但手动创建变得困难。
大型语言模型（LLM）为自动化综述生成提供了潜力。
当前缺乏标准化基准测试和评估协议来评估LLM在学术综述生成方面的性能。
SurGE基准测试包括测试实例和大规模学术文献库，用于评估科学综述生成能力。
SurGE提供自动评估框架来衡量生成的综述在信息覆盖、引用准确性等方面的表现。
LLM方法在综述生成方面仍然面临挑战，凸显了任务的复杂性。

Cool Papers

点此查看论文截图

On the Effectiveness of Graph Reordering for Accelerating Approximate Nearest Neighbor Search on GPU

Authors:Yutaro Oguri, Mai Nishimura, Yusuke Matsui

We present the first systematic investigation of graph reordering effects for graph-based Approximate Nearest Neighbor Search (ANNS) on a GPU. While graph-based ANNS has become the dominant paradigm for modern AI applications, recent approaches focus on algorithmic innovations while neglecting memory layout considerations that significantly affect execution time. Our unified evaluation framework enables comprehensive evaluation of diverse reordering strategies across different graph indices through a graph adapter that converts arbitrary graph topologies into a common representation and a GPU-optimized graph traversal engine. We conduct a comprehensive analysis across diverse datasets and state-of-the-art graph indices, introducing analysis metrics that quantify the relationship between structural properties and memory layout effectiveness. Our GPU-targeted reordering achieves up to 15$%$ QPS improvements while preserving search accuracy, demonstrating that memory layout optimization operates orthogonally to existing algorithmic innovations. We will release all code upon publication to facilitate reproducibility and foster further research.

我们对基于图的近似最近邻搜索（ANNS）在GPU上的图重排效应进行了首次系统研究。虽然基于图的ANNS已成为现代AI应用的主导范式，但最近的方法主要集中在算法创新上，而忽视了内存布局考虑，这显著影响了执行时间。我们的统一评估框架通过图形适配器（将任意图形拓扑转换为通用表示形式）和GPU优化的图形遍历引擎，全面评估了不同图形索引的不同重排序策略。我们在不同数据集和最新图形索引上进行了全面的分析，引入了分析度量标准，以量化结构属性与内存布局有效性之间的关系。我们针对GPU进行的重排序在保持搜索准确性的同时，实现了高达15%的查询每秒（QPS）改进，这表明内存布局优化与现有的算法创新是垂直运行的。我们会在发表时公开所有代码，以促进可重复性和进一步的研究。

论文及项目相关链接

PDF

Summary

本文首次对基于GPU的图形重排序在图形近似最近邻搜索（ANNS）中的影响进行了系统研究。研究发现在现代AI应用中，虽然基于图形的ANNS已成为主流范式，但最近的算法更多地关注算法创新，而忽视了内存布局对执行时间的显著影响。本文提供了一个统一的评估框架，能够全面评估不同重排序策略在不同图形索引上的表现。通过图形适配器和GPU优化的图形遍历引擎，该框架能够将任意图形拓扑转换为通用表示形式。本文全面分析了不同数据集和最新图形索引之间的关系，并引入了量化结构特性和内存布局有效性的分析指标。通过针对GPU的重排序优化，实现了查询速率（QPS）提高达15%，同时保证了搜索精度。这表明内存布局优化与现有的算法创新是相辅相成的。

Key Takeaways

本文首次系统地研究了图形重排序在基于GPU的图形近似最近邻搜索（ANNS）中的影响。
研究指出，尽管基于图形的ANNS在现代AI应用中占据主导地位，但最近的算法更多地关注算法创新，忽视了内存布局的重要性。
提供了一个统一的评估框架，用于全面评估不同重排序策略在不同图形索引上的表现。
通过图形适配器和GPU优化的图形遍历引擎，任意图形拓扑可转换为通用表示形式。
研究全面分析了不同数据集和最新图形索引之间的关系，引入了分析结构特性和内存布局有效性的量化指标。
通过针对GPU的重排序优化，实现了查询速率（QPS）的提高，最高可达15%。

Cool Papers

点此查看论文截图

DIO: Refining Mutual Information and Causal Chain to Enhance Machine Abstract Reasoning Ability

Authors:Ruizhuo Song, Beiming Yuan

Despite the outstanding performance of current deep learning models across various domains, their fundamental bottleneck in abstract reasoning remains unresolved. To address this challenge, the academic community has introduced Raven’s Progressive Matrices (RPM) problems as an authoritative benchmark for evaluating the abstract reasoning capabilities of deep learning algorithms, with a focus on core intelligence dimensions such as abstract reasoning, pattern recognition, and complex problem-solving. Therefore, this paper centers on solving RPM problems, aiming to contribute to enhancing the abstract reasoning abilities of machine intelligence. Firstly, this paper adopts a ``causal chain modeling’’ perspective to systematically analyze the complete causal chain in RPM tasks: image $\rightarrow$ abstract attributes $\rightarrow$ progressive attribute patterns $\rightarrow$ pattern consistency $\rightarrow$ correct answer. Based on this analysis, the network architecture of the baseline model DIO is designed. However, experiments reveal that the optimization objective formulated for DIO, namely maximizing the variational lower bound of mutual information between the context and the correct option, fails to enable the model to genuinely acquire the predefined human reasoning logic. This is attributed to two main reasons: the tightness of the lower bound significantly impacts the effectiveness of mutual information maximization, and mutual information, as a statistical measure, does not capture the causal relationship between subjects and objects. To overcome these limitations, this paper progressively proposes three improvement methods:

尽管当前深度学习模型在各种领域表现出卓越的性能，它们在抽象推理方面的根本瓶颈仍未解决。为了应对这一挑战，学术界引入了Raven的渐进矩阵（RPM）问题作为评估深度学习算法抽象推理能力的权威基准测试，重点关注抽象推理、模式识别和复杂问题解决等核心智力维度。因此，本文专注于解决RPM问题，旨在提高机器智能的抽象推理能力。首先，本文采用“因果链建模”的角度，系统地分析了RPM任务中的完整因果链：图像$\rightarrow$抽象属性$\rightarrow$渐进属性模式$\rightarrow$模式一致性$\rightarrow$正确答案。基于此分析，设计了基线模型DIO的网络架构。然而，实验表明，为DIO制定的优化目标，即最大化上下文与正确选项之间的互信息的变分下界，未能使模型真正获得预先设定的人类推理逻辑。这主要归因于两个原因：下界的紧密性显著影响了互信息最大化的有效性；作为统计量度的互信息并不能捕捉主体与对象之间的因果关系。为了克服这些局限性，本文逐步提出了三种改进方法：

论文及项目相关链接

PDF 15 pages, 9 figures, 8 tables

Summary
深度学习模型虽然在多个领域表现出色，但在抽象推理方面仍存在瓶颈。为应对这一挑战，学术界引入了Raven’s Progressive Matrices（RPM）问题作为评估深度学习算法抽象推理能力的权威基准测试。本文专注于解决RPM问题，旨在提高机器智能的抽象推理能力。该文采用因果链建模方法，分析RPM任务的完整因果链，并设计基线模型DIO的网络架构。然而，实验表明，DIO的优化目标未能使模型真正获得预设的人类推理逻辑。原因在于变分下界的紧密性影响了信息最大化的有效性，且信息最大化作为统计度量手段未能捕捉主体与对象之间的因果关系。为克服这些局限，本文逐步提出三种改进方法。

Key Takeaways

深度学习模型在抽象推理方面存在瓶颈，需要引入新的评估基准如RPM问题来评估其能力。
本文通过因果链建模分析RPM问题的完整因果链，并设计基线模型DIO。
DIO的优化目标未能使模型真正获得预设的人类推理逻辑，主要原因是变分下界的紧密性和信息最大化的有效性问题。
信息最大化作为统计度量手段无法捕捉主体与对象之间的因果关系。
为改进模型性能，本文提出了三种策略：优化变分下界、结合使用多种推理方法和强化学习、以及构建更加复杂的网络结构来增强模型的抽象推理能力。
解决RPM问题对于提高机器智能的抽象推理能力具有重要意义。

Cool Papers

点此查看论文截图

Unveiling Trust in Multimodal Large Language Models: Evaluation, Analysis, and Mitigation

Authors:Yichi Zhang, Yao Huang, Yifan Wang, Yitong Sun, Chang Liu, Zhe Zhao, Zhengwei Fang, Huanran Chen, Xiao Yang, Xingxing Wei, Hang Su, Yinpeng Dong, Jun Zhu

The trustworthiness of Multimodal Large Language Models (MLLMs) remains an intense concern despite the significant progress in their capabilities. Existing evaluation and mitigation approaches often focus on narrow aspects and overlook risks introduced by the multimodality. To tackle these challenges, we propose MultiTrust-X, a comprehensive benchmark for evaluating, analyzing, and mitigating the trustworthiness issues of MLLMs. We define a three-dimensional framework, encompassing five trustworthiness aspects which include truthfulness, robustness, safety, fairness, and privacy; two novel risk types covering multimodal risks and cross-modal impacts; and various mitigation strategies from the perspectives of data, model architecture, training, and inference algorithms. Based on the taxonomy, MultiTrust-X includes 32 tasks and 28 curated datasets, enabling holistic evaluations over 30 open-source and proprietary MLLMs and in-depth analysis with 8 representative mitigation methods. Our extensive experiments reveal significant vulnerabilities in current models, including a gap between trustworthiness and general capabilities, as well as the amplification of potential risks in base LLMs by both multimodal training and inference. Moreover, our controlled analysis uncovers key limitations in existing mitigation strategies that, while some methods yield improvements in specific aspects, few effectively address overall trustworthiness, and many introduce unexpected trade-offs that compromise model utility. These findings also provide practical insights for future improvements, such as the benefits of reasoning to better balance safety and performance. Based on these insights, we introduce a Reasoning-Enhanced Safety Alignment (RESA) approach that equips the model with chain-of-thought reasoning ability to discover the underlying risks, achieving state-of-the-art results.

尽管多模态大型语言模型（MLLMs）的能力已经取得了重大进展，但其可信度仍然是一个备受关注的问题。现有的评估和缓解方法往往侧重于狭窄的方面，忽视了多模态引入的风险。为了应对这些挑战，我们提出了MultiTrust-X，这是一个全面评估、分析和缓解MLLMs可信度问题的基准测试。我们定义了一个三维框架，包含五个可信度方面，即真实性、稳健性、安全性、公平性和隐私性；两种新型风险类型，包括多模态风险和跨模态影响；以及从数据、模型结构、训练和推理算法等角度的多种缓解策略。基于分类学，MultiTrust-X包括32个任务和28个精选数据集，能够对30个开源和专有MLLM进行整体评估，并使用8种具有代表性的缓解方法进行深入分析。我们的大量实验揭示了当前模型的重要漏洞，包括可信度与通用能力之间的鸿沟，以及基于LLM的基础训练中推理和推断的多模态带来的潜在风险的放大。此外，我们的受控分析揭示了现有缓解策略的关键局限性，即虽然一些方法在特定方面有所改进，但几乎没有什么能有效地解决整体可信度问题，而且许多方法引入了意外的权衡，从而损害了模型的实用性。这些发现还为未来的改进提供了实际见解，例如通过推理来更好地平衡安全性和性能的好处。基于这些见解，我们引入了一种增强型安全对齐推理（RESA）方法，该方法赋予模型链式思维推理能力，以发现潜在风险，实现最新结果。

论文及项目相关链接

PDF For Appendix, please refer to arXiv:2406.07057

Summary

文本提出了针对多模态大型语言模型（MLLMs）的信任问题的一种全面的基准测试方案MultiTrust-X。文章详细描述了所提出的方案的构建方法和重要特点，包括对MLLMs的可信度的五个方面的评价：真实性、稳健性、安全性、公平性和隐私性；涵盖多模态风险和跨模态影响的新风险类型；以及从数据、模型架构、训练和推理算法等角度的缓解策略。基于这些分类，MultiTrust-X包括32项任务和28个精选数据集，能够对30个开源和专有MLLM进行全面评估，并使用8种具有代表性的缓解方法进行深入分析。文章还介绍了基于实验和分析的发现，包括当前模型中的显著漏洞和未来改进的实际见解。最后提出了一种新的方法RESA来增强模型的安全性和性能平衡。

Key Takeaways

多模态大型语言模型（MLLMs）的可信性仍是人们关注的重点，尽管其能力已取得了显著进展。
现有评估和缓解方法往往关注狭窄的方面，忽视了多模态引入的风险。
MultiTrust-X是一个全面的基准测试，用于评估、分析和缓解MLLMs的可信性问题。它定义了一个包含五个可信方面的三维框架，包括真实性、稳健性、安全性、公平性和隐私性。
MultiTrust-X还包括涵盖多模态风险和跨模态影响的新风险类型以及多种缓解策略。
实验表明，当前模型存在显著漏洞，包括可信性与一般能力之间的鸿沟，以及基础LLMs中潜在风险的放大。这些风险在多模态训练和推理中尤为明显。
现有的缓解策略存在局限性，一些方法仅在特定方面有所改善，而少数方法能有效解决整体的可信性问题，并可能引入影响模型实用性的意外权衡。

Cool Papers

点此查看论文截图

KG-EDAS: A Meta-Metric Framework for Evaluating Knowledge Graph Completion Models

Authors:Haji Gul, Abul Ghani Naim, Ajaz Ahmad Bhat

Knowledge Graphs (KGs) enable applications in various domains such as semantic search, recommendation systems, and natural language processing. KGs are often incomplete, missing entities and relations, an issue addressed by Knowledge Graph Completion (KGC) methods that predict missing elements. Different evaluation metrics, such as Mean Reciprocal Rank (MRR), Mean Rank (MR), and Hit@k, are commonly used to assess the performance of such KGC models. A major challenge in evaluating KGC models, however, lies in comparing their performance across multiple datasets and metrics. A model may outperform others on one dataset but underperform on another, making it difficult to determine overall superiority. Moreover, even within a single dataset, different metrics such as MRR and Hit@1 can yield conflicting rankings, where one model excels in MRR while another performs better in Hit@1, further complicating model selection for downstream tasks. These inconsistencies hinder holistic comparisons and highlight the need for a unified meta-metric that integrates performance across all metrics and datasets to enable a more reliable and interpretable evaluation framework. To address this need, we propose KG Evaluation based on Distance from Average Solution (EDAS), a robust and interpretable meta-metric that synthesizes model performance across multiple datasets and diverse evaluation criteria into a single normalized score ($M_i \in [0,1]$). Unlike traditional metrics that focus on isolated aspects of performance, EDAS offers a global perspective that supports more informed model selection and promotes fairness in cross-dataset evaluation. Experimental results on benchmark datasets such as FB15k-237 and WN18RR demonstrate that EDAS effectively integrates multi-metric, multi-dataset performance into a unified ranking, offering a consistent, robust, and generalizable framework for evaluating KGC models.

知识图谱（KGs）可应用于各种领域，如语义搜索、推荐系统和自然语言处理。知识图谱往往是不完整的，缺少实体和关系，这一问题通过知识图谱补全（KGC）方法得到解决，这些方法能够预测缺失元素。通常使用诸如平均倒数排名（MRR）、平均排名（MR）和命中@k等评估指标来评估此类KGC模型的性能。然而，评估KGC模型的主要挑战在于比较它们在多个数据集和指标上的性能。一个模型可能在某个数据集上优于其他模型，但在另一个数据集上表现较差，这使得难以确定其总体优势。此外，即使在单个数据集中，不同的指标（如MRR和命中@1）也可能产生冲突的排名，一个模型可能在MRR中表现优异，而在命中@1中表现较好，这进一步使下游任务中的模型选择复杂化。这些不一致性阻碍了全面的比较，并强调了需要一个统一的度量标准，该标准能够整合所有指标和数据集的性能，从而建立一个更可靠和可解释的评价框架。为了应对这一需求，我们提出了基于距离平均解（EDAS）的KG评估方法，这是一种稳健且可解释的度量标准，可将模型在多个数据集和不同评价指标上的性能综合成一个单一的归一化分数（Mi∈[0,1]）。与传统的孤立关注性能方面的指标不同，EDAS提供了全局视角，支持更明智的模型选择，并促进了跨数据集的公平评估。在FB15k-237和WN18RR等基准数据集上的实验结果表明，EDAS有效地将多指标、多数据集的性能集成到一个统一的排名中，为评估KGC模型提供了一个一致、稳健和通用的框架。

论文及项目相关链接

PDF

Summary

本文介绍了知识图谱（KGs）在各领域的应用，如语义搜索、推荐系统和自然语言处理。知识图谱常常不完整，缺少实体和关系，知识图谱补全（KGC）方法可预测缺失元素。使用不同的评估指标，如平均倒数排名（MRR）、平均排名（MR）和命中@k，来评估KGC模型的性能。然而，在多个数据集和指标上比较KGC模型的性能是一个主要挑战。此外，即使在单个数据集中，不同的指标也可能产生冲突的排名结果。针对这一需求，本文提出了基于距离平均解（EDAS）的KG评估方法，该方法将模型在多个数据集和多种评价指标上的性能综合为一个单一的归一化分数，为更可靠和可解释的评价框架提供了支持。实验结果表明，EDAS能有效整合多指标、多数据集的性能，为评价KGC模型提供了一个一致、稳健和通用的框架。

Key Takeaways

知识图谱（KGs）在多个领域有广泛应用，但存在不完整性问题，需要知识图谱补全（KGC）方法预测缺失元素。
评估KGC模型的性能通常使用多个数据集和评价指标，但比较这些性能是一个挑战。
存在多种评价KGC模型的指标，但不同指标可能导致冲突的结果。
EDAS是一种新的评估方法，能将模型在多个数据集和多种评价指标上的性能综合为一个单一的归一化分数。
EDAS提供了一个全局视角，支持更明智的模型选择和跨数据集的公平评价。
实验结果表明，EDAS能有效整合多指标和多数据集的性能评价。

Cool Papers

点此查看论文截图

REG4Rec: Reasoning-Enhanced Generative Model for Large-Scale Recommendation Systems

Authors:Haibo Xing, Hao Deng, Yucheng Mao, Jinxin Hu, Yi Xu, Hao Zhang, Jiahao Wang, Shizhun Wang, Yu Zhang, Xiaoyi Zeng, Jing Zhang

Sequential recommendation aims to predict a user’s next action in large-scale recommender systems. While traditional methods often suffer from insufficient information interaction, recent generative recommendation models partially address this issue by directly generating item predictions. To better capture user intents, recent studies have introduced a reasoning process into generative recommendation, significantly improving recommendation performance. However, these approaches are constrained by the singularity of item semantic representations, facing challenges such as limited diversity in reasoning pathways and insufficient reliability in the reasoning process. To tackle these issues, we introduce REG4Rec, a reasoning-enhanced generative model that constructs multiple dynamic semantic reasoning paths alongside a self-reflection process, ensuring high-confidence recommendations. Specifically, REG4Rec utilizes an MoE-based parallel quantization codebook (MPQ) to generate multiple unordered semantic tokens for each item, thereby constructing a larger-scale diverse reasoning space. Furthermore, to enhance the reliability of reasoning, we propose a training reasoning enhancement stage, which includes Preference Alignment for Reasoning (PARS) and a Multi-Step Reward Augmentation (MSRA) strategy. PARS uses reward functions tailored for recommendation to enhance reasoning and reflection, while MSRA introduces future multi-step actions to improve overall generalization. During inference, Consistency-Oriented Self-Reflection for Pruning (CORP) is proposed to discard inconsistent reasoning paths, preventing the propagation of erroneous reasoning. Lastly, we develop an efficient offline training strategy for large-scale recommendation. Experiments on real-world datasets and online evaluations show that REG4Rec delivers outstanding performance and substantial practical value.

序列推荐旨在在大规模推荐系统中预测用户的下一个行为。虽然传统方法常常因信息交互不足而受限，但最近的生成推荐模型通过直接生成项目预测部分地解决了这个问题。为了更好地捕捉用户意图，最近的研究将推理过程引入生成推荐，显著提高了推荐性能。然而，这些方法受到项目语义表示单一性的限制，面临推理路径多样性有限和推理过程可靠性不足等挑战。为了解决这些问题，我们引入了REG4Rec，这是一种增强推理的生成模型，它构建了多个动态语义推理路径，并伴随着一个自我反思过程，确保高置信度的推荐。具体来说，REG4Rec使用一个基于MoE的并行量化代码本（MPQ）来为每个项目生成多个无序语义标记，从而构建一个更大规模的多样化推理空间。此外，为了提高推理的可靠性，我们提出了一种训练推理增强阶段，包括用于推理的偏好对齐（PARS）和多步奖励增强（MSRA）策略。PARS使用针对推荐的奖励函数来增强推理和反思，而MSRA通过引入未来的多步行动来提高整体泛化能力。在推理过程中，我们提出了面向一致性的自我反思修剪（CORP）来丢弃不一致的推理路径，防止错误推理的传播。最后，我们为大规模推荐开发了一种高效的离线训练策略。在现实数据集上的实验和在线评估表明，REG4Rec表现出卓越的性能和巨大的实用价值。

论文及项目相关链接

PDF

Summary：
新一代推荐系统REG4Rec结合生成式推荐与动态语义推理路径，以提高用户意图捕捉和推荐性能。通过构建多个动态语义推理路径和自反思过程，解决单一语义表示带来的挑战。REG4Rec使用基于MoE的并行量化代码本生成多个无序语义令牌，构建大规模多样化推理空间。同时，通过训练增强推理阶段和一致性导向的自反思策略，提高推理的可靠性和准确性。实验证明，REG4Rec在真实数据集和在线评估中表现出卓越性能。

Key Takeaways：

REG4Rec结合生成式推荐与动态语义推理，旨在提高捕捉用户意图和推荐性能。
通过构建多个动态语义推理路径，解决单一语义表示的限制和挑战。
使用基于MoE的并行量化代码本生成多个无序语义令牌，创建大规模多样化推理空间。
通过训练增强推理阶段，包括偏好对齐推理和奖励增强策略，提高推理可靠性。
在推理过程中引入一致性导向的自反思策略，避免错误推理的传播。
REG4Rec采用高效离线训练策略，适用于大规模推荐场景。

Cool Papers

点此查看论文截图

RATopo: Improving Lane Topology Reasoning via Redundancy Assignment

Authors:Han Li, Shaofei Huang, Longfei Xu, Yulu Gao, Beipeng Mu, Si Liu

Lane topology reasoning plays a critical role in autonomous driving by modeling the connections among lanes and the topological relationships between lanes and traffic elements. Most existing methods adopt a first-detect-then-reason paradigm, where topological relationships are supervised based on the one-to-one assignment results obtained during the detection stage. This supervision strategy results in suboptimal topology reasoning performance due to the limited range of valid supervision. In this paper, we propose RATopo, a Redundancy Assignment strategy for lane Topology reasoning that enables quantity-rich and geometry-diverse topology supervision. Specifically, we restructure the Transformer decoder by swapping the cross-attention and self-attention layers. This allows redundant lane predictions to be retained before suppression, enabling effective one-to-many assignment. We also instantiate multiple parallel cross-attention blocks with independent parameters, which further enhances the diversity of detected lanes. Extensive experiments on OpenLane-V2 demonstrate that our RATopo strategy is model-agnostic and can be seamlessly integrated into existing topology reasoning frameworks, consistently improving both lane-lane and lane-traffic topology performance.

车道拓扑推理在自动驾驶中扮演着重要角色，它通过建模车道之间的连接以及车道与交通元素之间的拓扑关系来实现。现有大多数方法采用先检测再推理的模式，其中拓扑关系是基于检测阶段获得的一一对应关系结果进行监督的。这种监督策略由于有效的监督范围有限，导致拓扑推理性能不佳。在本文中，我们提出了RATopo，这是一种用于车道拓扑推理的冗余分配策略，能够实现丰富数量和多样几何形状的拓扑监督。具体来说，我们通过交换Transformer解码器的交叉注意力和自注意力层来重建其结构。这允许在抑制之前保留冗余的车道预测，从而实现有效的一对多分配。我们还实例化了具有独立参数的多个并行交叉注意力块，这进一步增强了检测到的车道的多样性。在OpenLane-V2上的大量实验表明，我们的RATopo策略与模型无关，可以无缝集成到现有的拓扑推理框架中，持续提高车道与车道以及车道与交通的拓扑性能。

论文及项目相关链接

PDF Accepted by ACM MM 2025

摘要

车道拓扑推理在自动驾驶中起到关键作用，通过对车道间连接以及车道与交通元素间的拓扑关系进行建模来实现。现有方法通常采用先检测后推理的模式，基于检测阶段获得的一一对应关系结果来监督拓扑关系。这种监督策略因有效监督范围有限而导致拓扑推理性能不佳。本文提出RATopo，一种车道拓扑推理中的冗余分配策略，实现丰富数量和几何多样的拓扑监督。具体来说，我们重构Transformer解码器，通过交换交叉注意力和自注意力层来保留冗余车道预测，实现有效的一对多分配。我们还实例化具有独立参数的多个并行交叉注意力块，进一步提高检测到的车道的多样性。在OpenLane-V2上的大量实验表明，我们的RATopo策略具有模型无关性，可以无缝集成到现有的拓扑推理框架中，持续提高车道与车道、车道与交通的拓扑性能。

关键见解

车道拓扑推理在自动驾驶中扮演关键角色，通过对车道间以及车道与交通元素的拓扑关系进行建模来实现。
现有方法的监督策略因有效监督范围有限导致拓扑推理性能不佳。
RATopo策略采用冗余分配，实现丰富数量和几何多样的拓扑监督。
重构Transformer解码器，通过交换交叉注意力和自注意力层来保留冗余车道预测。
实例化多个并行交叉注意力块，提高检测到的车道的多样性。
RATopo策略可以无缝集成到现有拓扑推理框架中。
RATopo策略能提高车道与车道、车道与交通的拓扑推理性能。

Cool Papers

点此查看论文截图

Select to Know: An Internal-External Knowledge Self-Selection Framework for Domain-Specific Question Answering

Authors:Bolei He, Xinran He, Run Shao, Shanfu Shu, Xianwei Xue, Mingquan Cheng, Haifeng Li, Zhenhua Ling

Large Language Models (LLMs) perform well in general QA but often struggle in domain-specific scenarios. Retrieval-Augmented Generation (RAG) introduces external knowledge but suffers from hallucinations and latency due to noisy retrievals. Continued pretraining internalizes domain knowledge but is costly and lacks cross-domain flexibility. We attribute this challenge to the long-tail distribution of domain knowledge, which leaves partial yet useful internal knowledge underutilized. We further argue that knowledge acquisition should be progressive, mirroring human learning: first understanding concepts, then applying them to complex reasoning. To address this, we propose Selct2Know (S2K), a cost-effective framework that internalizes domain knowledge through an internal-external knowledge self-selection strategy and selective supervised fine-tuning. We also introduce a structured reasoning data generation pipeline and integrate GRPO to enhance reasoning ability. Experiments on medical, legal, and financial QA benchmarks show that S2K consistently outperforms existing methods and matches domain-pretrained LLMs with significantly lower cost.

大型语言模型（LLM）在一般问答任务中表现良好，但在特定领域的场景中经常面临挑战。检索增强生成（RAG）引入了外部知识，但由于检索结果中的噪声而容易出现虚构和延迟。持续预训练可以内化领域知识，但成本高昂且缺乏跨领域灵活性。我们将这一挑战归因于领域知识的长尾分布，这使得部分有用的内部知识未得到充分利用。我们进一步认为，知识获取应该是一个渐进的过程，反映人类学习的方式：首先理解概念，然后将其应用于复杂推理。为了解决这一问题，我们提出了Selct2Know（S2K）这一成本效益高的框架，它通过内部和外部知识的自我选择策略以及选择性监督微调来内化领域知识。我们还引入了一个结构化推理数据生成管道，并整合了GRPO技术以提高推理能力。在医疗、法律和财务问答基准测试上的实验表明，S2K始终优于现有方法，并以更低的成本与领域预训练LLM相匹配。

论文及项目相关链接

PDF EMNLP2025 Findings

Summary

大型语言模型（LLM）在一般问答任务中表现良好，但在特定领域场景中常遇到困难。检索增强生成（RAG）引入外部知识，但存在幻象和延迟问题。继续预训练可以内化领域知识，但成本高昂且缺乏跨领域灵活性。本文认为这是由于领域知识的长尾分布导致部分内部知识未被充分利用。知识获取应循序渐进，先理解概念，再应用于复杂推理。为解决这一问题，本文提出Selct2Know（S2K）框架，通过内外部知识自选策略和选择性监督微调来内化领域知识。同时引入结构化推理数据生成管道并结合GRPO增强推理能力。在医疗、法律和财务问答基准测试上的实验表明，S2K持续优于现有方法，并以较低成本匹配领域预训练LLM。

Key Takeaways

LLMs在特定领域场景中面临挑战，需要更有效的方式内化领域知识。
RAG虽引入外部知识，但存在幻象和延迟问题。
继续预训练虽然能内化领域知识，但成本高昂且缺乏跨领域灵活性。
领域知识的长尾分布导致部分内部知识未被充分利用。
知识获取应循序渐进，先理解概念再应用复杂推理。
S2K框架通过内外部知识自选策略和选择性监督微调来内化领域知识，提高性能并降低成本。

Cool Papers

点此查看论文截图

Fin-PRM: A Domain-Specialized Process Reward Model for Financial Reasoning in Large Language Models

Authors:Yuanchen Zhou, Shuo Jiang, Jie Zhu, Junhui Li, Lifan Guo, Feng Chen, Chi Zhang

Process Reward Models (PRMs) have emerged as a promising framework for supervising intermediate reasoning in large language models (LLMs), yet existing PRMs are primarily trained on general or Science, Technology, Engineering, and Mathematics (STEM) domains and fall short in domain-specific contexts such as finance, where reasoning is more structured, symbolic, and sensitive to factual and regulatory correctness. We introduce \textbf{Fin-PRM}, a domain-specialized, trajectory-aware PRM tailored to evaluate intermediate reasoning steps in financial tasks. Fin-PRM integrates step-level and trajectory-level reward supervision, enabling fine-grained evaluation of reasoning traces aligned with financial logic. We apply Fin-PRM in both offline and online reward learning settings, supporting three key applications: (i) selecting high-quality reasoning trajectories for distillation-based supervised fine-tuning, (ii) providing dense process-level rewards for reinforcement learning, and (iii) guiding reward-informed Best-of-N inference at test time. Experimental results on financial reasoning benchmarks, including CFLUE and FinQA, demonstrate that Fin-PRM consistently outperforms general-purpose PRMs and strong domain baselines in trajectory selection quality. Downstream models trained with Fin-PRM yield substantial improvements with baselines, with gains of 12.9% in supervised learning, 5.2% in reinforcement learning, and 5.1% in test-time performance. These findings highlight the value of domain-specialized reward modeling for aligning LLMs with expert-level financial reasoning. Our project resources will be available at https://github.com/aliyun/qwen-dianjin.

过程奖励模型（PRMs）已成为监督大型语言模型（LLMs）中的中间推理的有前途的框架，但现有的PRM主要训练于通用或科学、技术、工程和数学（STEM）领域，而在金融等特定领域的上下文中表现不足，这里的推理更加结构化、符号化，并且对事实和法规的正确性更加敏感。我们引入了Fin-PRM，这是一个针对金融任务中的中间推理步骤进行评估的域专业化、轨迹感知的PRM。Fin-PRM集成了步骤级和轨迹级的奖励监督，能够实现与金融逻辑对齐的推理轨迹的精细评估。我们在离线奖励学习和在线奖励学习环境中都应用了Fin-PRM，支持三个关键应用：（i）选择基于蒸馏的有监督微调的高质量推理轨迹，（ii）为强化学习提供密集的过程级奖励，（iii）在测试时指导基于奖励的Best-of-N推理。在金融推理基准测试上的实验结果，包括CFLUE和FinQA，证明了Fin-PRM在轨迹选择质量上始终优于通用PRM和强基线。使用Fin-PRM训练的下游模型与基线相比实现了显著改进，监督学习提高了12.9%，强化学习提高了5.2%，测试时性能提高了5.1%。这些发现凸显了领域专业化奖励模型对于将大型语言模型与专家级金融推理对齐的价值。我们的项目资源将在https://github.com/aliyun/qwen-dianjin上提供。

论文及项目相关链接

PDF

Summary

本文介绍了针对金融领域特殊需求的进程奖励模型（Fin-PRM）。该模型能够在金融任务的推理过程中进行精细化评估，结合了步骤级和轨迹级的奖励监督，符合金融逻辑。实验结果表明，Fin-PRM在财务推理基准测试上的表现优于通用进程奖励模型和强基准线，能提高下游模型的轨迹选择质量。通过Fin-PRM，下游模型的性能得到了显著提高，在监督学习、强化学习和测试时间性能方面的提升分别达到了12.9%、5.2%和5.1%。这凸显了针对金融领域进行专项奖励建模的价值。

Key Takeaways

Fin-PRM是一个针对金融领域的专业化进程奖励模型，用于评估金融任务中的推理过程。
Fin-PRM结合了步骤级和轨迹级的奖励监督，以精细的方式评估推理轨迹，与金融逻辑相符。
Fin-PRM在财务推理基准测试上的表现优于通用进程奖励模型和强基准线。
Fin-PRM能提高下游模型的轨迹选择质量，包括用于监督学习、强化学习和测试时间的推理轨迹选择。
通过Fin-PRM，下游模型的性能得到了显著提高，包括在监督学习、强化学习和测试时间方面的提升。
实验结果证明了针对特定领域（如金融）进行奖励建模的重要性。

Cool Papers

点此查看论文截图

PuzzleClone: An SMT-Powered Framework for Synthesizing Verifiable Data

Authors:Kai Xiong, Yanwei Huang, Rongjunchen Zhang, Kun Chen, Haipang Wu

High-quality mathematical and logical datasets with verifiable answers are essential for strengthening the reasoning capabilities of large language models (LLMs). While recent data augmentation techniques have facilitated the creation of large-scale benchmarks, existing LLM-generated datasets often suffer from limited reliability, diversity, and scalability. To address these challenges, we introduce PuzzleClone, a formal framework for synthesizing verifiable data at scale using Satisfiability Modulo Theories (SMT). Our approach features three key innovations: (1) encoding seed puzzles into structured logical specifications, (2) generating scalable variants through systematic variable and constraint randomization, and (3) ensuring validity via a reproduction mechanism. Applying PuzzleClone, we construct a curated benchmark comprising over 83K diverse and programmatically validated puzzles. The generated puzzles span a wide spectrum of difficulty and formats, posing significant challenges to current state-of-the-art models. We conduct post training (SFT and RL) on PuzzleClone datasets. Experimental results show that training on PuzzleClone yields substantial improvements not only on PuzzleClone testset but also on logic and mathematical benchmarks. Post training raises PuzzleClone average from 14.4 to 56.2 and delivers consistent improvements across 7 logic and mathematical benchmarks up to 12.5 absolute percentage points (AMC2023 from 52.5 to 65.0). Our code and data are available at https://github.com/puzzleclone.

高质量的数学和逻辑数据集对于增强大型语言模型（LLM）的推理能力至关重要。虽然最近的数据增强技术有助于创建大规模基准测试，但现有的LLM生成的数据集往往存在可靠性、多样性和可扩展性方面的局限性。为了应对这些挑战，我们引入了PuzzleClone，这是一个使用可满足性模理论（SMT）大规模合成可验证数据的正式框架。我们的方法有三个关键创新点：（1）将种子谜题编码为结构化的逻辑规范，（2）通过系统的变量和约束随机化生成可扩展的变体，（3）通过复制机制确保有效性。应用PuzzleClone，我们构建了一个包含超过83000个多样且程序验证有效的谜题的精选基准测试。生成的谜题难度各异，形式多样，对当前最先进的模型构成了重大挑战。我们在PuzzleClone数据集上进行训练后（SFT和RL）。实验结果表明，在PuzzleClone数据集上进行训练不仅提高了其在PuzzleClone测试集上的表现，而且在逻辑和数学基准测试上也有显著改善。训练后，PuzzleClone的平均分数从14.4提高到56.2，在7个逻辑和数学基准测试中均取得了一致的改进，绝对百分点最高提高了12.5（AMC2023从52.5提高到65.0）。我们的代码和数据可在https://github.com/puzzleclone找到。

论文及项目相关链接

PDF

Summary

基于可验证答案的高质量数学和逻辑数据集对于加强大型语言模型（LLM）的推理能力至关重要。针对现有LLM生成数据集在可靠性、多样性和可扩展性方面的局限性，本文提出了PuzzleClone框架，通过可满足性模理论（SMT）大规模合成可验证数据。PuzzleClone通过三个关键创新点实现了突破：将种子谜题编码为结构化逻辑规范、通过系统变量和约束随机化生成可扩展变体，以及通过复制机制确保有效性。应用PuzzleClone框架构建了一个包含超过8.3万个多样化和程序验证的谜题基准测试集。实验结果表明，在PuzzleClone数据集上进行训练不仅提高了其在PuzzleClone测试集上的性能，而且在逻辑和数学基准测试上也取得了显著改进。

Key Takeaways

高质量数学和逻辑数据集对加强LLM推理能力至关重要。
现有LLM生成数据集存在可靠性、多样性和可扩展性问题。
PuzzleClone框架通过SMT合成可验证数据，具有三个关键创新点。
PuzzleClone构建了包含超过8.3万个谜题的大规模基准测试集。
在PuzzleClone数据集上训练LLM，不仅提高了其在测试集上的性能，也改进了逻辑和数学基准测试成绩。
PuzzleClone平均成绩从14.4提高到56.2，在7个逻辑和数学基准测试中实现了一致性改进。

Cool Papers

点此查看论文截图

ContextualLVLM-Agent: A Holistic Framework for Multi-Turn Visually-Grounded Dialogue and Complex Instruction Following

Authors:Seungmin Han, Haeun Kwon, Ji-jun Park, Taeyang Yoon

Despite significant advancements in Large Language Models (LLMs) and Large Vision-Language Models (LVLMs), current models still face substantial challenges in handling complex, multi-turn, and visually-grounded tasks that demand deep reasoning, sustained contextual understanding, entity tracking, and multi-step instruction following. Existing benchmarks often fall short in capturing the dynamism and intricacies of real-world multi-modal interactions, leading to issues such as context loss and visual hallucinations. To address these limitations, we introduce MMDR-Bench (Multi-Modal Dialogue Reasoning Benchmark), a novel dataset comprising 300 meticulously designed complex multi-turn dialogue scenarios, each averaging 5-7 turns and evaluated across six core dimensions including visual entity tracking and reasoning depth. Furthermore, we propose CoLVLM Agent (Contextual LVLM Agent), a holistic framework that enhances existing LVLMs with advanced reasoning and instruction following capabilities through an iterative “memory-perception-planning-execution” cycle, requiring no extensive re-training of the underlying models. Our extensive experiments on MMDR-Bench demonstrate that CoLVLM Agent consistently achieves superior performance, attaining an average human evaluation score of 4.03, notably surpassing state-of-the-art commercial models like GPT-4o (3.92) and Gemini 1.5 Pro (3.85). The framework exhibits significant advantages in reasoning depth, instruction adherence, and error suppression, and maintains robust performance over extended dialogue turns, validating the effectiveness of its modular design and iterative approach for complex multi-modal interactions.

尽管大型语言模型（LLM）和大型视觉语言模型（LVLM）取得了重大进展，但当前模型在处理复杂、多轮、视觉基础的任务时仍面临巨大挑战，这些任务需要深度推理、持续上下文理解、实体跟踪以及多步骤指令执行。现有的基准测试通常无法捕捉真实世界多模态交互的动态性和复杂性，导致诸如上下文丢失和视觉幻觉等问题。为了解决这些局限性，我们推出了MMDR-Bench（多模态对话推理基准测试），这是一个新的数据集，包含了300个精心设计的复杂多轮对话场景，每个场景平均有5-7轮对话，并在六个核心维度进行评估，包括视觉实体跟踪和推理深度等。此外，我们提出了CoLVLM Agent（上下文LVLM代理）这一全面框架，它通过迭代“记忆-感知-规划-执行”循环，增强了现有LVLMs的推理和指令执行能力，无需对底层模型进行大量再训练。我们在MMDR-Bench上进行的大量实验表明，CoLVLM Agent始终实现卓越性能，平均人类评价得分为4.03，显著超过了最先进的商业模型，如GPT-4o（3.92）和Gemini 1.5 Pro（3.85）。该框架在推理深度、指令遵循和错误抑制方面表现出显著优势，并在较长的对话轮次中保持稳健性能，验证了其在复杂多模态交互中的模块化设计和迭代方法的有效性。

论文及项目相关链接

PDF

Summary

本文介绍了大型语言模型（LLMs）和大型视觉语言模型（LVLMs）的最新进展，指出了它们在处理复杂、多轮、视觉定位的任务方面的挑战。为此，引入MMDR-Bench数据集和CoLVLM Agent框架。MMDR-Bench包含300个精心设计的复杂多轮对话场景，评估核心维度包括视觉实体跟踪和推理深度。CoLVLM Agent通过迭代“记忆-感知-规划-执行”循环增强现有LVLMs的推理和指令遵循能力，无需对底层模型进行大量再训练。在MMDR-Bench上的实验表明，CoLVLM Agent性能卓越，平均人类评价得分为4.03，超越了GPT-4o（3.92）和Gemini 1.5 Pro（3.85）。该框架在推理深度、指令遵循和错误抑制方面表现出显著优势，并在多轮对话中保持稳健性能。

Key Takeaways

大型语言模型和视觉语言模型在处理复杂、多轮、视觉定位的任务时仍面临挑战。
MMDR-Bench数据集旨在解决这一挑战，包含复杂多轮对话场景，强调视觉实体跟踪和推理深度。
CoLVLM Agent框架通过迭代循环增强现有LVLMs的推理和指令遵循能力，无需大量再训练。
CoLVLM Agent在MMDR-Bench上的实验表现优于GPT-4o和Gemini 1.5 Pro等现有模型。
该框架在推理深度、指令遵循和错误抑制方面表现出显著优势。
CoLVLM Agent在多轮对话中保持稳健性能。

Cool Papers

点此查看论文截图

Mobile-Agent-v3: Foundamental Agents for GUI Automation

Authors:Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, Ming Yan

This paper introduces GUI-Owl, a foundational GUI agent model that achieves state-of-the-art performance among open-source end-to-end models on ten GUI benchmarks across desktop and mobile environments, covering grounding, question answering, planning, decision-making, and procedural knowledge. GUI-Owl-7B achieves 66.4 on AndroidWorld and 29.4 on OSWorld. Building on this, we propose Mobile-Agent-v3, a general-purpose GUI agent framework that further improves performance to 73.3 on AndroidWorld and 37.7 on OSWorld, setting a new state-of-the-art for open-source GUI agent frameworks. GUI-Owl incorporates three key innovations: (1) Large-scale Environment Infrastructure: a cloud-based virtual environment spanning Android, Ubuntu, macOS, and Windows, enabling our Self-Evolving GUI Trajectory Production framework. This generates high-quality interaction data via automated query generation and correctness validation, leveraging GUI-Owl to refine trajectories iteratively, forming a self-improving loop. It supports diverse data pipelines and reduces manual annotation. (2) Diverse Foundational Agent Capabilities: by integrating UI grounding, planning, action semantics, and reasoning patterns, GUI-Owl supports end-to-end decision-making and can act as a modular component in multi-agent systems. (3) Scalable Environment RL: we develop a scalable reinforcement learning framework with fully asynchronous training for real-world alignment. We also introduce Trajectory-aware Relative Policy Optimization (TRPO) for online RL, achieving 34.9 on OSWorld. GUI-Owl and Mobile-Agent-v3 are open-sourced at https://github.com/X-PLUG/MobileAgent.

本文介绍了GUI-Owl，这是一个基础GUI代理模型，它在桌面和移动环境的十个GUI基准测试上实现了开源端到端模型的最新性能，涵盖了接地、问答、规划、决策和程序知识。GUI-Owl-7B在AndroidWorld上达到66.4，在OSWorld上达到29.4。在此基础上，我们提出了通用GUI代理框架Mobile-Agent-v3，它进一步提高了性能，在AndroidWorld上达到73.3，在OSWorld上达到37.7，为开源GUI代理框架创造了新的最新纪录。GUI-Owl融合了三大创新点：（1）大规模环境基础设施：一个跨越Android、Ubuntu、macOS和Windows的基于云端的虚拟环境，使我们的自我进化GUI轨迹生产框架成为可能。该框架通过自动化查询生成和正确性验证生成高质量交互数据，利用GUI-Owl迭代地优化轨迹，形成一个自我改进循环。它支持各种数据管道，减少手动注释。（2）多样化的基础代理功能：通过集成UI接地、规划、动作语义和推理模式，GUI-Owl支持端到端的决策制定，可以作为多代理系统中的模块化组件。（3）可扩展的环境强化学习：我们开发了一个可扩展的强化学习框架，具有完全异步训练以符合现实世界的对齐。我们还引入了轨迹感知相对策略优化（TRPO）进行在线强化学习，在OSWorld上实现34.9的成绩。GUI-Owl和Mobile-Agent-v3已在https://github.com/X-PLUG/Mobileagent开源。

论文及项目相关链接

PDF

摘要

本文介绍了GUI-Owl，这是一个基础GUI代理模型，在桌面和移动环境的十个GUI基准测试上实现了开创性的性能。它在AndroidWorld上的得分为66.4，在OSWorld上的得分为29.4。在此基础上，提出了通用GUI代理框架Mobile-Agent-v3，进一步提高了性能，在AndroidWorld上得分73.3，在OSWorld上得分37.7，为开源GUI代理框架创造了新的世界纪录。GUI-Owl包含三个关键创新点：大规模环境基础设施、多样化的基础代理功能和可扩展的环境强化学习。

关键见解

GUI-Owl模型：介绍了GUI-Owl，一个性能出色的开源端到端GUI代理模型，它在多个GUI基准测试中表现出色。
Mobile-Agent-v3框架：提出的Mobile-Agent-v3框架进一步提高了GUI代理的性能，并在AndroidWorld和OSWorld上达到了新的世界纪录。
大规模环境基础设施：通过云端的虚拟环境，实现了自我进化的GUI轨迹生产框架，减少了手动标注，并支持多种数据管道。
多样化的基础代理能力：GUI-Owl集成了UI定位、规划、动作语义和推理模式，支持端到端的决策制定，并可作为多代理系统中的模块化组件。
可扩展的环境强化学习：开发了具有完全异步训练的可扩展强化学习框架，并引入了轨迹感知相对策略优化（TRPO）进行在线强化学习。
GUI-Owl和Mobile-Agent-v3的开源性：这两个模型和框架都是开源的，方便公众访问和贡献。
模型与框架的应用前景：GUI-Owl和Mobile-Agent-v3在移动和桌面环境中的GUI代理任务上表现出色，为未来的人机交互技术提供了新的方向。

Cool Papers

点此查看论文截图

On the need to perform comprehensive evaluations of automated program repair benchmarks: Sorald case study

Authors:Sumudu Liyanage, Sherlock A. Licorish, Markus Wagner, Stephen G. MacDonell

In supporting the development of high-quality software, especially necessary in the era of LLMs, automated program repair (APR) tools aim to improve code quality by automatically addressing violations detected by static analysis profilers. Previous research tends to evaluate APR tools only for their ability to clear violations, neglecting their potential introduction of new (sometimes severe) violations, changes to code functionality and degrading of code structure. There is thus a need for research to develop and assess comprehensive evaluation frameworks for APR tools. This study addresses this research gap, and evaluates Sorald (a state-of-the-art APR tool) as a proof of concept. Sorald’s effectiveness was evaluated in repairing 3,529 SonarQube violations across 30 rules within 2,393 Java code snippets extracted from Stack Overflow. Outcomes show that while Sorald fixes specific rule violations, it introduced 2,120 new faults (32 bugs, 2088 code smells), reduced code functional correctness–as evidenced by a 24% unit test failure rate–and degraded code structure, demonstrating the utility of our framework. Findings emphasize the need for evaluation methodologies that capture the full spectrum of APR tool effects, including side effects, to ensure their safe and effective adoption.

在支持高质量软件开发的时代，特别是在大型语言模型（LLMs）的时代，自动化程序修复（APR）工具旨在通过自动解决静态分析分析器检测到的违规情况来提高代码质量。以往的研究往往只评估APR工具清除违规的能力，忽视了它们可能引入新的（有时是严重的）违规情况、对代码功能的改变以及代码结构的退化。因此，有必要进行研究，开发和评估APR工具的全面评估框架。本研究解决了这一研究空白，并作为概念验证评估了最新颖的APR工具Sorald。在修复从Stack Overflow提取的3,529个SonarQube违规情况（涉及30条规则，涵盖2,393个Java代码片段）方面，对Sorald的有效性进行了评估。结果表明，虽然Sorald能够修复特定的规则违规情况，但它引入了2,120个新故障（包括32个错误，2,088个代码异味），降低了代码功能的正确性（有24%的单元测试失败率作为证据），并破坏了代码结构，证明了我们的框架的实用性。研究结果强调需要一种评估方法，该方法能够捕捉到APR工具的所有影响，包括副作用，以确保其安全和有效的采用。

论文及项目相关链接

PDF

Summary

该文探讨了在LLM时代高质量软件开发中自动化程序修复工具的重要性及其评价框架的缺失。文章以Sorald这一前沿APR工具为例，通过对其修复SonarQube检测到的3,529个违规实例的效果进行评估，发现虽然Sorald能够修复特定规则违规，但同时也引入了新的缺陷（包括32个错误和大量代码异味），并对代码功能正确性产生负面影响（单元测试失败率达到了24%），同时导致代码结构退化。因此，文章强调了开发全面评估APR工具的重要性，特别是需要评估方法能够捕捉工具的所有影响，包括副作用，以确保其安全和有效的采用。

Key Takeaways

自动化程序修复工具旨在提高代码质量，特别是在LLM时代显得尤为重要。
现有研究往往只关注APR工具修复违规的能力，而忽视了其可能引入新的缺陷、改变代码功能以及破坏代码结构的风险。
Sorald作为前沿APR工具的代表，在修复特定违规实例的同时，也引入了大量新的缺陷和代码异味。
Sorald对代码功能正确性产生了负面影响，导致单元测试失败率上升。
代码结构在修复过程中可能出现退化现象。
当前研究需要开发全面的评估框架来全面评估APR工具的影响，包括其对代码质量、功能性和结构性的影响。

Cool Papers

点此查看论文截图

Don’t Think Twice! Over-Reasoning Impairs Confidence Calibration

Authors:Romain Lacombe, Kerrie Wu, Eddie Dilworth

Large Language Models deployed as question answering tools require robust calibration to avoid overconfidence. We systematically evaluate how reasoning capabilities and budget affect confidence assessment accuracy, using the ClimateX dataset (Lacombe et al., 2023) and expanding it to human and planetary health. Our key finding challenges the “test-time scaling” paradigm: while recent reasoning LLMs achieve 48.7% accuracy in assessing expert confidence, increasing reasoning budgets consistently impairs rather than improves calibration. Extended reasoning leads to systematic overconfidence that worsens with longer thinking budgets, producing diminishing and negative returns beyond modest computational investments. Conversely, search-augmented generation dramatically outperforms pure reasoning, achieving 89.3% accuracy by retrieving relevant evidence. Our results suggest that information access, rather than reasoning depth or inference budget, may be the critical bottleneck for improved confidence calibration of knowledge-intensive tasks.

大型语言模型作为问答工具部署时，需要进行稳健的校准以避免过度自信。我们系统地评估了推理能力和预算对信心评估精度的影响，使用了ClimateX数据集（Lacombe等人，2023年），并将其扩展到人类和星球健康领域。我们的关键发现对“测试时缩放”范式提出了挑战：虽然最近的推理大型语言模型在评估专家信心方面达到了48.7%的准确度，但增加推理预算却会损害而不是改善校准。过度推理导致系统过度自信，随着思考预算的增加而恶化，超出适度计算投资后产生收益递减和负面回报。相反，搜索增强生成法显著优于纯推理法，通过检索相关证据达到了89.3%的准确度。我们的结果表明，信息获取可能而不是推理深度或推理预算成为改进知识密集型任务信心校准的关键瓶颈。

论文及项目相关链接

PDF Published at ICML 2025 Workshop on Reliable and Responsible Foundation Models

Summary

大型语言模型作为问答工具需要可靠的校准以避免过度自信。本文通过ClimateX数据集（Lacombe等人，2023年）系统评估推理能力和预算对信心评估准确性的影响，并扩展到人类和地球健康领域。研究发现，增加推理预算并不会改善校准，反而可能导致系统性过度自信，并且在较长的思考预算下产生负面回报。相反，搜索增强生成法显著优于纯推理法，通过检索相关证据达到89.3%的准确率。这表明信息获取可能是改进知识密集型任务信心校准的关键瓶颈，而非推理深度或推理预算。

Key Takeaways

大型语言模型作为问答工具需要校准以避免过度自信。
增加推理预算不一定会提高信心评估的准确性，反而可能导致系统性过度自信。
搜索增强生成法通过检索相关证据显著提高信心评估的准确率。
信息获取是改进知识密集型任务信心校准的关键。
推理深度并非改善信心校准的主要因素。
适度的计算投资可以获得较好的信心评估效果，过多的计算资源可能导致负面回报。

Cool Papers

点此查看论文截图

In-Context Iterative Policy Improvement for Dynamic Manipulation

Authors:Mark Van der Merwe, Devesh Jha

Attention-based architectures trained on internet-scale language data have demonstrated state of the art reasoning ability for various language-based tasks, such as logic problems and textual reasoning. Additionally, these Large Language Models (LLMs) have exhibited the ability to perform few-shot prediction via in-context learning, in which input-output examples provided in the prompt are generalized to new inputs. This ability furthermore extends beyond standard language tasks, enabling few-shot learning for general patterns. In this work, we consider the application of in-context learning with pre-trained language models for dynamic manipulation. Dynamic manipulation introduces several crucial challenges, including increased dimensionality, complex dynamics, and partial observability. To address this, we take an iterative approach, and formulate our in-context learning problem to predict adjustments to a parametric policy based on previous interactions. We show across several tasks in simulation and on a physical robot that utilizing in-context learning outperforms alternative methods in the low data regime. Video summary of this work and experiments can be found https://youtu.be/2inxpdrq74U?si=dAdDYsUEr25nZvRn.

基于互联网规模语言数据训练的注意力架构在各种语言任务上展现出了最前沿的推理能力，如逻辑问题和文本推理。此外，这些大型语言模型（LLM）还表现出通过上下文学习进行少样本预测的能力，其中根据提示提供的输入输出例子可以推广到新的输入。这种能力还超越了标准语言任务，使少样本学习能够应用于一般模式。在这项工作中，我们考虑了使用预训练语言模型进行上下文学习的应用，用于动态操作。动态操作带来了几个关键挑战，包括维度增加、动态复杂性和部分可观察性。为了解决这个问题，我们采取了一种迭代方法，并将我们的上下文学习问题制定为基于先前交互预测参数策略的调整。我们在模拟和实体机器人上的多个任务中都表明，利用上下文学习优于低数据环境下的其他方法。该工作和实验的视频摘要可以在https://youtu.be/2inxpdrq74U?si=dAdDYsUEr25nZvRn找到。

简化说明（非正式版）：

论文及项目相关链接

PDF 14 pages. Accepted at CoRL 2025

Summary

预训练的语言模型借助互联网规模的语料库进行训练，已显示出卓越的推理能力，可在各种语言任务中进行逻辑和文本推理。它们能通过上下文学习进行少量样本预测，该能力适用于更一般的模式。本文探讨在动态操控场景下应用上下文学习的预训练语言模型。动态操控面临高维性、复杂动态和局部观测等挑战。通过迭代方法并预测基于之前交互的参数策略的微调，我们发现在模拟任务和实际机器人任务中，使用上下文学习的效果优于低数据环境下的其他方法。相关视频摘要和实验可通过链接查看：视频链接。

Key Takeaways

预训练语言模型在大量语言数据上训练后，展现出卓越的语言推理能力。
这些模型能通过上下文学习进行少量样本预测，适用于多种任务。
在动态操控场景中，上下文学习具有应用价值，但面临高维性、复杂动态和局部观测等挑战。
采用迭代方法预测基于之前交互的参数策略的微调来解决这些挑战。
在模拟和真实机器人实验中，上下文学习的效果优于其他方法，特别是在低数据环境下。
视频摘要提供了对工作方法和实验结果的直观展示。

Cool Papers

点此查看论文截图

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model

Authors: NVIDIA, :, Aarti Basant, Abhijit Khairnar, Abhijit Paithankar, Abhinav Khattar, Adithya Renduchintala, Aditya Malte, Akhiad Bercovich, Akshay Hazare, Alejandra Rico, Aleksander Ficek, Alex Kondratenko, Alex Shaposhnikov, Alexander Bukharin, Ali Taghibakhshi, Amelia Barton, Ameya Sunil Mahabaleshwarkar, Amy Shen, Andrew Tao, Ann Guan, Anna Shors, Anubhav Mandarwal, Arham Mehta, Arun Venkatesan, Ashton Sharabiani, Ashwath Aithal, Ashwin Poojary, Ayush Dattagupta, Balaram Buddharaju, Banghua Zhu, Barnaby Simkin, Bilal Kartal, Bita Darvish Rouhani, Bobby Chen, Boris Ginsburg, Brandon Norick, Brian Yu, Bryan Catanzaro, Charles Wang, Charlie Truong, Chetan Mungekar, Chintan Patel, Chris Alexiuk, Christian Munley, Christopher Parisien, Dan Su, Daniel Afrimi, Daniel Korzekwa, Daniel Rohrer, Daria Gitman, David Mosallanezhad, Deepak Narayanan, Dima Rekesh, Dina Yared, Dmytro Pykhtar, Dong Ahn, Duncan Riach, Eileen Long, Elliott Ning, Eric Chung, Erick Galinkin, Evelina Bakhturina, Gargi Prasad, Gerald Shen, Haifeng Qian, Haim Elisha, Harsh Sharma, Hayley Ross, Helen Ngo, Herman Sahota, Hexin Wang, Hoo Chang Shin, Hua Huang, Iain Cunningham, Igor Gitman, Ivan Moshkov, Jaehun Jung, Jan Kautz, Jane Polak Scowcroft, Jared Casper, Jian Zhang, Jiaqi Zeng, Jimmy Zhang, Jinze Xue, Jocelyn Huang, Joey Conway, John Kamalu, Jonathan Cohen, Joseph Jennings, Julien Veron Vialard, Junkeun Yi, Jupinder Parmar, Kari Briski, Katherine Cheung, Katherine Luna, Keith Wyss, Keshav Santhanam, Kezhi Kong, Krzysztof Pawelec, Kumar Anik, Kunlun Li, Kushan Ahmadian, Lawrence McAfee, Laya Sleiman, Leon Derczynski, Luis Vega, Maer Rodrigues de Melo, Makesh Narsimhan Sreedhar, Marcin Chochowski, Mark Cai, Markus Kliegl, Marta Stepniewska-Dziubinska, Matvei Novikov, Mehrzad Samadi, Meredith Price, Meriem Boubdir, Michael Boone, Michael Evans, Michal Bien, Michal Zawalski, Miguel Martinez, Mike Chrzanowski, Mohammad Shoeybi, Mostofa Patwary, Namit Dhameja, Nave Assaf, Negar Habibi, Nidhi Bhatia, Nikki Pope, Nima Tajbakhsh, Nirmal Kumar Juluru, Oleg Rybakov, Oleksii Hrinchuk, Oleksii Kuchaiev, Oluwatobi Olabiyi, Pablo Ribalta, Padmavathy Subramanian, Parth Chadha, Pavlo Molchanov, Peter Dykas, Peter Jin, Piotr Bialecki, Piotr Januszewski, Pradeep Thalasta, Prashant Gaikwad, Prasoon Varshney, Pritam Gundecha, Przemek Tredak, Rabeeh Karimi Mahabadi, Rajen Patel, Ran El-Yaniv, Ranjit Rajan, Ria Cheruvu, Rima Shahbazyan, Ritika Borkar, Ritu Gala, Roger Waleffe, Ruoxi Zhang, Russell J. Hewett, Ryan Prenger, Sahil Jain, Samuel Kriman, Sanjeev Satheesh, Saori Kaji, Sarah Yurick, Saurav Muralidharan, Sean Narenthiran, Seonmyeong Bak, Sepehr Sameni, Seungju Han, Shanmugam Ramasamy, Shaona Ghosh, Sharath Turuvekere Sreenivas, Shelby Thomas, Shizhe Diao, Shreya Gopal, Shrimai Prabhumoye, Shubham Toshniwal, Shuoyang Ding, Siddharth Singh, Siddhartha Jain, Somshubra Majumdar, Soumye Singhal, Stefania Alborghetti, Syeda Nahida Akter, Terry Kong, Tim Moon, Tomasz Hliwiak, Tomer Asida, Tony Wang, Tugrul Konuk, Twinkle Vashishth, Tyler Poon, Udi Karpas, Vahid Noroozi, Venkat Srinivasan, Vijay Korthikanti, Vikram Fugro, Vineeth Kalluru, Vitaly Kurin, Vitaly Lavrukhin, Wasi Uddin Ahmad, Wei Du, Wonmin Byeon, Ximing Lu, Xin Dong, Yashaswi Karnati, Yejin Choi, Yian Zhang, Ying Lin, Yonggan Fu, Yoshi Suhara, Zhen Dong, Zhiyu Li, Zhongbo Zhu, Zijia Chen

We introduce Nemotron-Nano-9B-v2, a hybrid Mamba-Transformer language model designed to increase throughput for reasoning workloads while achieving state-of-the-art accuracy compared to similarly-sized models. Nemotron-Nano-9B-v2 builds on the Nemotron-H architecture, in which the majority of the self-attention layers in the common Transformer architecture are replaced with Mamba-2 layers, to achieve improved inference speed when generating the long thinking traces needed for reasoning. We create Nemotron-Nano-9B-v2 by first pre-training a 12-billion-parameter model (Nemotron-Nano-12B-v2-Base) on 20 trillion tokens using an FP8 training recipe. After aligning Nemotron-Nano-12B-v2-Base, we employ the Minitron strategy to compress and distill the model with the goal of enabling inference on up to 128k tokens on a single NVIDIA A10G GPU (22GiB of memory, bfloat16 precision). Compared to existing similarly-sized models (e.g., Qwen3-8B), we show that Nemotron-Nano-9B-v2 achieves on-par or better accuracy on reasoning benchmarks while achieving up to 6x higher inference throughput in reasoning settings like 8k input and 16k output tokens. We are releasing Nemotron-Nano-9B-v2, Nemotron-Nano12B-v2-Base, and Nemotron-Nano-9B-v2-Base checkpoints along with the majority of our pre- and post-training datasets on Hugging Face.

我们介绍了Nemotron-Nano-9B-v2，这是一款混合了Mamba-Transformer的语言模型，旨在提高推理工作负载的吞吐量，同时与类似规模的模型相比实现最先进的准确性。Nemotron-Nano-9B-v2建立在Nemotron-H架构的基础上，将Transformer架构中大部分的自注意力层替换为Mamba-2层，从而在生成用于推理的长思考轨迹时实现更快的推理速度。我们通过首先在20万亿个令牌上使用FP8训练配方预训练一个12亿参数模型（Nemotron-Nano-12B-v2-Base）来创建Nemotron-Nano-9B-v2。在对Nemotron-Nano-12B-v2-Base进行对齐后，我们采用Minitron策略来压缩和蒸馏模型，旨在能够在单个NVIDIA A10G GPU（具有22GB内存，bfloat16精度）上进行高达128k令牌的推理。与现有的类似规模模型（例如Qwen3-8B）相比，我们在推理基准测试上展示了Nemotron-Nano-9B-v2实现了相当的或更好的准确性，同时在如8k输入和16k输出令牌的推理设置中实现了高达6倍的推理吞吐量。我们将Nemotron-Nano-9B-v2、Nemotron-Nano12B-v2-Base和Nemotron-Nano-9B-v2-Base检查点以及我们的大部分预训练和后续训练数据集在Hugging Face上发布。

论文及项目相关链接

PDF

Summary

基于Mamba-Transformer架构的Nemotron-Nano-9B-v2混合语言模型旨在提高推理工作负载的吞吐量，同时与同类模型相比实现最先进的准确性。该模型通过采用Nemotron-H架构，将Transformer架构中的大部分自注意力层替换为Mamba-2层，以实现在生成推理所需的长思考轨迹时的更快推理速度。通过预训练一个12亿参数的模型（Nemotron-Nano-12B-v2-Base），并在使用FP8训练配方处理过的20万亿个令牌上对其进行训练，创建了Nemotron-Nano-9B-v2。采用Minitron策略对模型进行压缩和蒸馏，可在单个NVIDIA A10G GPU上实现对高达128k令牌的推理。与现有类似规模的模型相比，Nemotron-Nano-9B-v2在推理基准测试上实现了相当或更好的准确性，同时在8k输入和16k输出令牌等推理环境中实现了高达6倍的推理吞吐量。

Key Takeaways

Nemotron-Nano-9B-v2是一个混合Mamba-Transformer语言模型，旨在提高推理工作负载的吞吐量和准确性。
该模型基于Nemotron-H架构，将自注意力层替换为Mamba-2层以实现更快的推理速度。
模型预训练在包含20万亿令牌的巨大数据集上进行，采用了FP8训练配方。
通过使用Minitron策略进行压缩和蒸馏，模型可以在单个NVIDIA A10G GPU上处理高达128k令牌的推理任务。
与类似规模的模型相比，Nemotron-Nano-9B-v2在推理基准测试中表现出色。
该模型实现了高达6倍的推理吞吐量，特别是在处理大规模输入和输出令牌时。

Cool Papers

点此查看论文截图

ThinkTuning: Instilling Cognitive Reflections without Distillation

Authors:Aswin RRV, Jacob Dineen, Divij Handa, Md Nayem Uddin, Mihir Parmar, Chitta Baral, Ben Zhou

Recent advances in test-time scaling have led to the emergence of thinking LLMs that exhibit self-reflective behaviors and multi-step reasoning. While RL drives this self-improvement paradigm, a recent study (Gandhi et al., 2025) shows that RL alone does not truly instill these new reasoning abilities - it merely draws out behaviors already present in the base models. This raises a question: How can we train the models that don’t exhibit such thinking behavior to develop it in the first place? To this end, we propose ThinkTuning, a GRPO-based interactive training approach where we augment the rollouts of a student model with the guidance from a teacher model. A simple idea from classroom practice inspires our method: a teacher poses a problem, lets the student try an answer, then gives corrective feedback – enough to point the mind in the right direction and then show the solution. Each piece of feedback reshapes the student’s thoughts, leading them to arrive at the correct solution. Similarly, we find that this type of implicit supervision through feedback from a teacher model of the same size improves the reasoning capabilities of the student model. In particular, on average, our method shows a 3.85% improvement over zero-shot baselines across benchmarks, and on MATH-500, AIME and GPQA-Diamond it shows 2.08%, 2.23% and 3.99% improvements over the vanilla-GRPO baseline. Source code is available at https://github.com/3rdAT/ThinkTuning.

近期测试时间缩放技术的进展，催生出了一种具备自我反思行为和跨步骤推理能力的思考式大型语言模型（LLMs）。虽然强化学习（RL）推动了这种自我改进模式的发展，但最近的一项研究（Gandhi等人，2025）表明，单纯依靠RL并不能真正赋予这些新的推理能力——它只是激发了在基础模型中已经存在的行为。这就产生了一个问题：如何训练那些没有这种思维行为的模型，让它们首先具备这种能力呢？为此，我们提出了ThinkTuning，这是一种基于GRPO的交互式训练方法，我们通过在学员模型的rollouts中加入教师模型的指导来增强其功能。我们的方法灵感来自于课堂实践的简单想法：教师提出一个问题，让学生尝试回答，然后给出纠正反馈，足以指引正确的方向并展示解决方案。每一份反馈都会重塑学生的思路，引导他们找到正确的答案。同样，我们发现，通过教师模型的反馈进行这种隐式监督，能够改善学员模型的推理能力。尤其是我们的方法在平均基准测试上比零基准线高出3.85%的改进。在MATH-500、AIME和GPQA-Diamond上，相对于普通的GRPO基线，我们的方法分别实现了2.08%、2.23%和3.99%的改进。源代码可在https://github.com/3rdAT/ThinkTuning获取。

论文及项目相关链接

PDF EMNLP 2025 (Main Conference)

Summary

本文介绍了近期测试时间扩展技术的进步推动了具备自我反思行为和跨步推理的思考式LLM的出现。然而，研究发现仅通过强化学习（RL）并不能真正赋予模型这些新推理能力，只能调动基础模型中已有的能力。因此，提出ThinkTuning训练方法，这是一种基于GRPO的交互式训练方法，通过教师模型的指导来增强学生模型的rollouts。该方法受到课堂实践的启发，通过教师提出问题和提供反馈，引导学生思考并找到正确答案。同样地，本文通过教师模型给予同样规模的隐式监督反馈，提升了学生模型的推理能力。ThinkTuning方法相较于零基准线平均提升了3.85%，在MATH-500、AIME和GPQA-Diamond任务上分别提升了2.08%、2.23%和3.99%。

Key Takeaways