发布日期: 2025-10-03

更新日期: 2025-11-27

文章字数: 22k

阅读时长: 90 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-03 更新

Probing the Critical Point (CritPt) of AI Reasoning: a Frontier Physics Research Benchmark

Authors:Minhui Zhu, Minyang Tian, Xiaocheng Yang, Tianci Zhou, Penghao Zhu, Eli Chertkov, Shengyan Liu, Yufeng Du, Lifan Yuan, Ziming Ji, Indranil Das, Junyi Cao, Yufeng Du, Jinchen He, Yifan Su, Jiabin Yu, Yikun Jiang, Yujie Zhang, Chang Liu, Ze-Min Huang, Weizhen Jia, Xinan Chen, Peixue Wu, Yunkai Wang, Juntai Zhou, Yong Zhao, Farshid Jafarpour, Jessie Shelton, Aaron Young, John Bartolotta, Wenchao Xu, Yue Sun, Anjun Chu, Victor Colussi, Chris Akers, Nathan Brooks, Wenbo Fu, Christopher Wilson, Jinchao Zhao, Marvin Qi, Anqi Mu, Yubo Yang, Allen Zang, Yang Lyu, Peizhi Mai, Xuefei Guo, Luyu Gao, Ze Yang, Chi Xue, Dmytro Bandak, Yaïr Hein, Yonatan Kahn, Kevin Zhou, John Drew Wilson, Jarrod T. Reilly, Di Luo, Daniel Inafuku, Hao Tong, Liang Yang, Ruixing Zhang, Xueying Wang, Ofir Press, Nicolas Chia, Eliu Huerta, Hao Peng

While large language models (LLMs) with reasoning capabilities are progressing rapidly on high-school math competitions and coding, can they reason effectively through complex, open-ended challenges found in frontier physics research? And crucially, what kinds of reasoning tasks do physicists want LLMs to assist with? To address these questions, we present the CritPt (Complex Research using Integrated Thinking - Physics Test, pronounced “critical point”), the first benchmark designed to test LLMs on unpublished, research-level reasoning tasks that broadly covers modern physics research areas, including condensed matter, quantum physics, atomic, molecular & optical physics, astrophysics, high energy physics, mathematical physics, statistical physics, nuclear physics, nonlinear dynamics, fluid dynamics and biophysics. CritPt consists of 71 composite research challenges designed to simulate full-scale research projects at the entry level, which are also decomposed to 190 simpler checkpoint tasks for more fine-grained insights. All problems are newly created by 50+ active physics researchers based on their own research. Every problem is hand-curated to admit a guess-resistant and machine-verifiable answer and is evaluated by an automated grading pipeline heavily customized for advanced physics-specific output formats. We find that while current state-of-the-art LLMs show early promise on isolated checkpoints, they remain far from being able to reliably solve full research-scale challenges: the best average accuracy among base models is only 4.0% , achieved by GPT-5 (high), moderately rising to around 10% when equipped with coding tools. Through the realistic yet standardized evaluation offered by CritPt, we highlight a large disconnect between current model capabilities and realistic physics research demands, offering a foundation to guide the development of scientifically grounded AI tools.

随着具有推理能力的大型语言模型（LLMs）在高中数学竞赛和编程方面迅速进步，它们是否能有效地应对前沿物理研究中遇到的复杂、开放性的挑战？关键的是，物理学家希望LLMs辅助完成哪些推理任务？为了回答这些问题，我们推出了CritPt（利用综合思维进行复杂研究——物理测试，俗称“临界点”），这是第一个针对未发布的研究级推理任务设计的基准测试。它广泛覆盖了现代物理研究领域，包括凝聚态物理、量子力学、原子、分子和光学物理、天体物理、高能物理、数学物理、统计物理、核物理、非线性动力学、流体力学和生物物理学。CritPt由71个复合研究挑战组成，旨在模拟入门级全尺度研究项目，同时分解为190个更简单的检查点任务，以获取更精细的见解。所有问题均由50多名活跃的物理研究人员基于他们自己的研究全新创建。每个问题都是手工筛选的，以得出难以猜测且机器可验证的答案，并由针对高级物理特定输出格式进行大量定制的自动评分管道进行评估。我们发现，虽然当前最先进的大型语言模型在单独的检查点上显示出早期的潜力，但它们仍然远远不能可靠地解决全研究规模的挑战：在基础模型中，最佳平均准确率仅为4.0%，由GPT-5（高级版）实现，当配备编码工具时，准确率适度提高到约10%。通过CritPt提供的现实且标准化的评估，我们突出了当前模型能力与现实物理研究需求之间的巨大差距，为开发科学基础的AI工具提供了指导基础。

论文及项目相关链接

PDF 39 pages, 6 figures, 6 tables

Summary

大型语言模型（LLMs）在物理研究领域的推理能力正受到挑战。为测试LLMs在未公开的研究级推理任务上的表现，推出了CritPt评估工具，涵盖现代物理研究领域。LLMs在单独的检查点上有早期承诺，但在解决全面的研究级挑战上仍有很大差距。

Key Takeaways

大型语言模型（LLMs）正在参与物理研究领域的推理任务。
CritPt是首个针对LLMs的评估工具，模拟真实的研究级物理推理任务。
CritPt包含71个复合研究挑战和190个简单的检查点任务。
问题由50多名活跃的物理研究员基于自身研究创建，手工编制以确保答案的客观性和机器可验证性。
当前最先进的LLMs在解决全面的研究级挑战上仍有很大差距，基础模型的平均准确率仅为4.0%，配备编码工具后提高到约10%。
CritPt评估揭示了当前模型能力与真实物理研究需求之间的巨大差距。

Cool Papers

点此查看论文截图

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

Authors:Jinyeop Song, Song Wang, Julian Shun, Yada Zhu

Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucinations and expose reasoning traces. However, many KG-RAG systems compose multiple LLM modules (e.g planning, reasoning, and responding), inflating inference cost and binding behavior to a specific target KG. To address this, we introduce KG-R1, an agentic KG retrieval-augmented generation (KG-RAG) framework through reinforcement learning (RL). KG-R1 utilizes a single agent that interacts with KGs as its environment, learning to retrieve at each step and incorporating the retrieved information into its reasoning and generation. The process is optimized through end-to-end RL. In controlled experiments across Knowledge-Graph Question Answering (KGQA) benchmarks, our method demonstrates both efficiency and transferability: Using Qwen-2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use larger foundation or fine-tuned models. Furthermore, KG-R1 enables plug and play: after training, it maintains strong accuracy on new KGs without modification. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at https://github.com/Jinyeop3110/KG-R1.

知识图谱检索增强生成（KG-RAG）技术将大型语言模型（LLM）与结构化的、可验证的知识图谱（KGs）相结合，以减少幻觉并暴露推理痕迹。然而，许多KG-RAG系统由多个LLM模块（如规划、推理和响应）组成，增加了推理成本，并且与特定的目标知识图谱紧密绑定。为了解决这个问题，我们引入了通过强化学习（RL）的KG-R1智能知识图谱检索增强生成（KG-RAG）框架。KG-R1使用一个智能体，它与知识图谱环境进行交互，学习在每个步骤中进行检索，并将检索到的信息融入其推理和生成过程中。该过程通过端到端的RL进行优化。在知识图谱问答（KGQA）基准测试上的受控实验表明，我们的方法既有效率又具备可迁移性：使用Qwen-2.5-3B，KG-R1在提高答案准确性的同时，生成的标记数量少于使用更大基础模型或微调模型的多模块工作流程方法。此外，KG-R1实现了即插即用：训练后，它在新的知识图谱上无需修改就能保持高度的准确性。这些特性使KG-R1成为有前景的知识图谱检索增强生成框架，有望用于实际部署。我们的代码已公开在https://github.com/Jinyeop3110/KG-R1。

论文及项目相关链接

PDF 10 pages, 5 figures. Submitted to ICLR 2026

Summary

KG-R1是一种基于强化学习（RL）的知识图谱检索增强生成（KG-RAG）框架。它使用一个智能体与环境（即知识图谱）进行交互，学习在每一步进行检索，并将检索到的信息融入其推理和生成过程中。KG-R1解决了传统KG-RAG系统存在的问题，如推理模块过多导致推理成本高昂以及行为绑定于特定目标知识图谱的问题。在知识图谱问答（KGQA）基准测试中的实验表明，KG-R1具有高效性和可转移性，能提高答案准确性并减少生成令牌数量。此外，经过训练后，KG-R1可以在新知识图谱上保持高精度而无需修改，使其成为有前景的KG-RAG框架，适用于实际部署。

Key Takeaways

KG-R1是一个基于强化学习的知识图谱检索增强生成框架。
KG-R1使用一个智能体来与知识图谱环境进行交互。
KG-R1解决了传统KG-RAG系统推理模块过多、推理成本高昂的问题。
KG-R1框架具有优化端到端强化学习的能力。
在知识图谱问答基准测试中，KG-R1展现了高效性和可转移性。
KG-R1能提高答案准确性并减少生成令牌数量。

Cool Papers

点此查看论文截图

Interactive Learning for LLM Reasoning

Authors:Hehai Lin, Shilei Cao, Minzhi Li, Sudong Wang, Haotian Wu, Linyi Yang, Juepeng Zheng, Chengwei Qin

Existing multi-agent learning approaches have developed interactive training environments to explicitly promote collaboration among multiple Large Language Models (LLMs), thereby constructing stronger multi-agent systems (MAS). However, during inference, they require re-executing the MAS to obtain final solutions, which diverges from human cognition that individuals can enhance their reasoning capabilities through interactions with others and resolve questions independently in the future. To investigate whether multi-agent interaction can enhance LLMs’ independent problem-solving ability, we introduce ILR, a novel co-learning framework for MAS that integrates two key components: Dynamic Interaction and Perception Calibration. Specifically, Dynamic Interaction first adaptively selects either cooperative or competitive strategies depending on question difficulty and model ability. LLMs then exchange information through Idea3 (Idea Sharing, Idea Analysis, and Idea Fusion), an innovative interaction paradigm designed to mimic human discussion, before deriving their respective final answers. In Perception Calibration, ILR employs Group Relative Policy Optimization (GRPO) to train LLMs while integrating one LLM’s reward distribution characteristics into another’s reward function, thereby enhancing the cohesion of multi-agent interactions. We validate ILR on three LLMs across two model families of varying scales, evaluating performance on five mathematical benchmarks and one coding benchmark. Experimental results show that ILR consistently outperforms single-agent learning, yielding an improvement of up to 5% over the strongest baseline. We further discover that Idea3 can enhance the robustness of stronger LLMs during multi-agent inference, and dynamic interaction types can boost multi-agent learning compared to pure cooperative or competitive strategies.

现有的多智能体学习方法已经构建了交互式训练环境，以明确促进多个大型语言模型（LLM）之间的协作，从而构建更强大的多智能体系统（MAS）。然而，在推理过程中，它们需要重新执行MAS来获得最终解决方案，这与人类认知不符。人类个体可以通过与他人互动增强自己的推理能力，并在未来独立解决问题。为了调查多智能体互动是否能增强LLM的独立解决问题的能力，我们引入了ILR，这是一种新型的多智能体协同学习框架，它集成了两个关键组成部分：动态交互和感知校准。具体而言，动态交互首先根据问题的难度和模型能力自适应地选择合作或竞争策略。然后，LLM通过模仿人类讨论的创新交互模式Idea3（思想共享、思想分析和思想融合）交换信息，然后得出各自的最终答案。在感知校准中，ILR采用群体相对策略优化（GRPO）来训练LLM，同时将一个LLM的奖励分布特性集成到另一个LLM的奖励函数中，从而提高多智能体交互的凝聚力。我们在两个不同规模的模型家族中的三个LLM上验证了ILR的有效性，通过五个数学基准测试和一个编码基准测试评估性能。实验结果表明，ILR始终优于单智能体学习，与最强基线相比提高了高达5%。我们还发现Idea3可以增强更强LLM在多智能体推理过程中的稳健性，与纯粹的合作或竞争策略相比，动态交互类型可以促进多智能体学习。

论文及项目相关链接

PDF The code will be released later

摘要

现有的多智能体学习方案通过构建交互训练环境来明确促进多个大型语言模型间的协作，进而构建更强大的多智能体系统。然而，在推理过程中，这些系统需要再次执行整个多智能体系统来获得最终解决方案，这与人类认知相悖。为了研究多智能体交互是否能增强大型语言模型的独立解决问题的能力，我们引入了ILR，这是一种新型多智能体系统的共学习框架，包括动态交互和感知校准两个关键组件。ILR能够自适应地选择合作或竞争策略，通过模仿人类讨论的创新交互模式Idea3（思想共享、思想分析和思想融合）来交换信息，并推导出各自的最终答案。在感知校准中，ILR采用群体相对策略优化方法，将一个大型语言模型的奖励分布特性集成到另一个的奖励函数中，增强了多智能体交互的凝聚力。我们在三个不同规模的大型语言模型上验证了ILR的有效性，实验结果表明，ILR在五个数学基准测试和一个编码基准测试中均优于单智能体学习方案，并最多提高了5%的性能。此外，还发现Idea3能增强强大型语言模型在多智能体推理中的稳健性，而动态交互类型相比于单纯的合作或竞争策略更能促进多智能体学习。

关键见解

多智能体学习通过构建交互训练环境促进大型语言模型间的协作。
现有方案在推理时需要重新执行整个多智能体系统，与人类独立解决问题的能力不同步。
ILR框架引入动态交互和感知校准两大组件，模仿人类讨论模式以提高大型语言模型的独立问题解决能力。
ILR自适应选择合作或竞争策略，通过Idea3交换信息并推导出答案。
感知校准中的群体相对策略优化增强了多智能体交互的凝聚力。
ILR在多个大型语言模型上的实验验证显示其性能优于单智能体学习方案。

Cool Papers

点此查看论文截图

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

Authors:Tuan Nguyen, Naseem Khan, Khang Tran, NhatHai Phan, Issa Khalil

The rapid rise of synthetic media has made deepfake detection a critical challenge for online safety and trust. Progress remains constrained by the scarcity of large, high-quality datasets. Although multimodal large language models (LLMs) exhibit strong reasoning capabilities, their performance on deepfake detection is poor, often producing explanations that are misaligned with visual evidence or hallucinatory. To address this limitation, we introduce a reasoning-annotated dataset for deepfake detection and propose Paragraph-level Relative Policy Optimization (PRPO), a reinforcement learning algorithm that aligns LLM reasoning with image content at the paragraph level. Experiments show that PRPO improves detection accuracy by a wide margin and achieves the highest reasoning score of 4.55/5.0. Ablation studies further demonstrate that PRPO significantly outperforms GRPO under test-time conditions. These results underscore the importance of grounding multimodal reasoning in visual evidence to enable more reliable and interpretable deepfake detection.

合成媒体的迅速崛起使得深度伪造检测成为网络安全和信任的关键挑战。由于大型高质量数据集的稀缺，进展仍然受到限制。尽管多模态大型语言模型（LLM）展现出强大的推理能力，但在深度伪造检测方面的表现却很差，经常产生与视觉证据不符或虚幻的解释。为了解决这一局限性，我们引入了深度伪造检测推理注释数据集，并提出了段落级相对策略优化（PRPO），这是一种强化学习算法，能够在段落级别将LLM推理与图像内容对齐。实验表明，PRPO在检测准确率上有显著提高，并达到了4.55/5.0的最高推理分数。消融研究进一步证明，在测试条件下PRPO显著优于GRPO。这些结果强调，以视觉证据为基础的多模态推理对于实现更可靠和可解释的深度伪造检测至关重要。

论文及项目相关链接

PDF

Summary
多媒体内容的泛滥使得深度伪造技术检测成为网络安全的重要挑战。目前大型多模态语言模型在深度伪造检测方面存在性能瓶颈，其解释与视觉证据常存在偏差或幻想。为解决此问题，我们推出深度伪造检测推理标注数据集，并提出段落级相对策略优化方法（PRPO），一种强化学习算法，使语言模型推理与图像内容相符。实验表明，PRPO显著提高检测准确率，达到最高推理分数4.55/5。

Key Takeaways

深度伪造技术检测的重要性随着多媒体内容的增长而提升。
大型多模态语言模型在深度伪造检测方面存在性能问题，其解释常与实际视觉证据不符或产生幻想。
为改善语言模型在深度伪造检测中的表现，推出了深度伪造检测推理标注数据集。
提出段落级相对策略优化（PRPO）方法，一种强化学习算法，旨在使语言模型的推理与图像内容相符。
实验结果显示PRPO能显著提高深度伪造检测的准确性。
PRPO的推理分数高于其他方法，达到最高分数4.55/5。

Cool Papers

点此查看论文截图

Reinforced Strategy Optimization for Conversational Recommender Systems via Network-of-Experts

Authors:Xiaoyan Zhao, Ming Yan, Yang Zhang, Yang Deng, Jian Wang, Fengbin Zhu, Yilun Qiu, Hong Cheng, Tat-Seng Chua

Conversational Recommender Systems (CRSs) aim to provide personalized recommendations through multi-turn natural language interactions with users. Given the strong interaction and reasoning skills of Large Language Models (LLMs), leveraging LLMs for CRSs has recently emerged as a promising direction. However, existing LLM-based methods often lack explicit optimization of interaction strategies, instead relying on unified prompts and the LLM’s internal knowledge to decide how to interact, which can lead to suboptimal outcomes. In this paper, we propose a novel Reinforced Strategy Optimization (RSO) method for CRS, which decomposes the process of generating strategy-driven response decisions into the macro-level strategy planning and micro-level strategy adaptation through a network-of-experts architecture. At the macro level, a Planner expert selects macro-level interaction strategies (e.g., recommend, explain, encourage). At the micro level, an Actor expert generates detailed responses conditioned on the selected macro-level strategy, guided by auxiliary experts that provide complementary information such as user preferences and factual grounding. This hierarchical decomposition disentangles the optimization of different sub-tasks involved in CRS response generation, enabling more tractable learning at each level. To address the scarcity of high-quality multi-turn training data, we formulate strategy learning as a reinforcement learning problem, guided by an LLM-based reward model to achieve automatic strategy exploration. Extensive experiments show that RSO significantly improves interaction performance compared to state-of-the-art baselines, demonstrating the effectiveness of explicit hierarchical strategy optimization for CRS.

对话推荐系统（CRS）旨在通过多轮自然语言与用户交互提供个性化推荐。鉴于大型语言模型（LLM）的强大交互和推理能力，利用LLM进行CRS最近已成为一个有前途的方向。然而，现有的基于LLM的方法通常缺乏明确的优化交互策略，而是依赖于统一的提示和LLM的内部知识来决定如何交互，这可能导致次优结果。在本文中，我们提出了一种用于CRS的新型强化策略优化（RSO）方法，它将生成策略驱动响应决策的过程分解为宏观层面的策略规划和微观层面的策略适应，通过专家网络架构实现。在宏观层面，规划专家选择宏观级别的交互策略（例如推荐、解释、鼓励）。在微观层面，根据选定的宏观策略，行为专家生成详细的响应，并由提供补充信息的辅助专家引导，例如用户偏好和事实依据。这种层次分解解决了涉及CRS响应生成的不同子任务的优化问题，使每个级别的学习更加易于处理。为了解决高质量多轮训练数据的稀缺问题，我们将策略学习制定为强化学习问题，由基于LLM的奖励模型指导，实现自动策略探索。大量实验表明，与最新基线相比，RSO显著提高了交互性能，证明了显式分层策略优化对CRS的有效性。

论文及项目相关链接

PDF

Summary：

本文提出一种针对对话推荐系统（CRS）的强化策略优化（RSO）方法。该方法通过专家网络架构，将生成策略驱动响应决策的过程分解为宏观策略规划和微观策略适应。宏观层面由规划专家选择交互策略，微观层面由行动专家根据选择的宏观策略生成详细响应。为解决高质量多轮训练数据稀缺问题，将策略学习制定为强化学习问题，以LLM为基础的奖励模型实现自动策略探索。实验表明，RSO方法显著提高了交互性能。

Key Takeaways：

对话推荐系统（CRS）旨在通过与用户的多轮自然语言交互提供个性化推荐。
大型语言模型（LLM）在CRS中的利用已成为一个研究热点，但现有方法缺乏明确的优化交互策略。
本文提出一种新型的强化策略优化（RSO）方法，将生成响应决策的过程分解为宏观策略规划和微观策略适应。
宏观层面的规划专家负责选择交互策略，如推荐、解释、鼓励等。
微观层面的行动专家根据所选宏观策略生成详细响应，受辅助专家提供用户偏好和事实依据等信息指导。
该方法将策略学习制定为强化学习问题，解决高质量多轮训练数据稀缺问题。

Cool Papers

点此查看论文截图

Unspoken Hints: Accuracy Without Acknowledgement in LLM Reasoning

Authors:Arash Marioriyad, Shaygan Adim, Nima Alighardashi, Mahdieh Soleymani Banghshah, Mohammad Hossein Rohban

Large language models (LLMs) increasingly rely on chain-of-thought (CoT) prompting to solve mathematical and logical reasoning tasks. Yet, a central question remains: to what extent are these generated rationales \emph{faithful} to the underlying computations, rather than post-hoc narratives shaped by hints that function as answer shortcuts embedded in the prompt? Following prior work on hinted vs.\ unhinted prompting, we present a systematic study of CoT faithfulness under controlled hint manipulations. Our experimental design spans four datasets (AIME, GSM-Hard, MATH-500, UniADILR), two state-of-the-art models (GPT-4o and Gemini-2-Flash), and a structured set of hint conditions varying in correctness (correct and incorrect), presentation style (sycophancy and data leak), and complexity (raw answers, two-operator expressions, four-operator expressions). We evaluate both task accuracy and whether hints are explicitly acknowledged in the reasoning. Our results reveal three key findings. First, correct hints substantially improve accuracy, especially on harder benchmarks and logical reasoning, while incorrect hints sharply reduce accuracy in tasks with lower baseline competence. Second, acknowledgement of hints is highly uneven: equation-based hints are frequently referenced, whereas raw hints are often adopted silently, indicating that more complex hints push models toward verbalizing their reliance in the reasoning process. Third, presentation style matters: sycophancy prompts encourage overt acknowledgement, while leak-style prompts increase accuracy but promote hidden reliance. This may reflect RLHF-related effects, as sycophancy exploits the human-pleasing side and data leak triggers the self-censoring side. Together, these results demonstrate that LLM reasoning is systematically shaped by shortcuts in ways that obscure faithfulness.

大型语言模型（LLMs）越来越依赖链式思维（CoT）提示来解决数学和逻辑推理任务。然而，一个核心问题仍然存在：这些生成的依据在多大程度上是忠于底层计算的，而不是由提示所塑造的事后叙事，这些提示作为嵌入在提示中的答案捷径？在关于提示与非提示方法的前期研究之后，我们对受控提示操作下的CoT忠实度进行了系统研究。我们的实验设计涵盖了四个数据集（AIME、GSM-Hard、MATH-500、UniADILR）、两个最先进的模型（GPT-4o和Gemini-2-Flash），以及一套结构化的提示条件，这些条件在正确性（正确和错误）、呈现风格（奉承和数据泄露）和复杂性（原始答案、两个运算符表达式、四个运算符表达式）上有所不同。我们评估了任务准确性和模型在推理中是否明确承认提示。我们的研究结果揭示了三个关键发现。首先，正确的提示会显著提高准确性，特别是在更难的本基准测试和逻辑推理中，而错误的提示会大幅度降低基线能力较低的任务的准确性。其次，对提示的认可存在严重的不平衡：基于方程的提示经常被引用，而原始提示往往被静默采纳，这表明更复杂的提示促使模型在推理过程中公开依赖。第三，呈现风格很重要：奉承提示鼓励明显的认可，而泄露风格的提示虽然提高了准确性，但促进了隐藏的依赖。这可能反映了RLHF相关的效应，因为奉承利用人类愉悦的一面，而数据泄露触发了自我审查的一面。总的来说，这些结果表明，LLM推理受到捷径的系统性影响，这些捷径掩盖了忠实性。

论文及项目相关链接

PDF 5 Pages, 4 Figures, 4 Tables

Summary

这篇论文研究了大型语言模型（LLMs）在解决数学和逻辑推理任务时，对于思维链（CoT）提示的依赖程度。实验设计包括四个数据集、两个最先进的模型，以及一系列的正确性、呈现风格和复杂程度各异的提示条件。研究发现，正确的提示可以显著提高准确率，尤其是针对难度较大的基准测试和逻辑推理；而错误的提示则会在任务基础能力较低的情况下大幅度降低准确率。此外，模型对提示的承认程度不均衡，方程式的提示经常被提及，而原始答案的提示往往静默采用。呈现风格也至关重要，讨好的提示鼓励明显的承认，而泄露风格的提示虽然能提高准确率，但更容易让人依赖而不自知。这些结果揭示了LLM推理在一定程度上受到捷径的影响，导致忠实度降低。

Key Takeaways

大型语言模型（LLMs）在解决数学和逻辑任务时依赖思维链（CoT）提示。
正确提示显著提高准确率，尤其在难度大的基准测试和逻辑推理中。
错误提示显著降低任务基础能力较低时的准确率。
模型对不同类型的提示有不同的承认程度：方程提示常提及，原始答案的提示静默采用。
提示的呈现风格影响模型的反应：讨好风格的提示鼓励明显的承认，泄露风格的提示促进隐性依赖。
LLM推理受捷径影响，导致忠实度降低。

Cool Papers

点此查看论文截图

R-Log: Incentivizing Log Analysis Capability in LLMs via Reasoning-based Reinforcement Learning

Authors:Yilun Liu, Ziang Chen, Song Xu, Minggui He, Shimin Tao, Weibin Meng, Yuming Xie, Tao Han, Chunguang Zhao, Jingzhou Du, Daimeng Wei, Shenglin Zhang, Yongqian Sun

The growing complexity of log data in modern software systems has prompted the use of Large Language Models (LLMs) for automated log analysis. Current approaches typically rely on direct supervised fine-tuning (SFT) on log-label pairs. However, this exacerbates the domain discrepancy between general-purpose LLMs and specialized log data, causing overfitting. Furthermore, SFT’s imbalanced loss computation often allows lengthy contexts to overwhelm critical, concise details in model answers, leading to hallucinations. To address these limitations, we propose R-Log, a novel reasoning-based paradigm that mirrors the structured, step-by-step analytical process of human engineers. This approach enhances generalizability by learning the underlying rules behind conclusions. We further employ Reinforcement Learning (RL) to optimize the model within a simulated O&M environment, thereby reducing hallucinations by directly rewarding correct outcomes. R-Log is first cold-started on a curated dataset of 2k+ reasoning trajectories, guided by 13 strategies from manual O&M practices, to establish an initial reasoning capability. This ability is then refined via RL using a joint reward function. Empirical evaluations on real-world logs show that R-Log outperforms existing methods across five log analysis tasks, particularly in unseen scenarios (by 228.05%). We also designed R-Log-fast with 5x speedup while keeping 93% of the efficacy.

随着现代软件系统中日志数据的日益复杂性，大型语言模型（LLM）被广泛应用于自动化日志分析。当前的方法通常依赖于日志标签对的直接监督微调（SFT）。然而，这加剧了通用LLM和专用日志数据之间的领域差异，导致过拟合。此外，SFT的不平衡损失计算经常使冗长的上下文掩盖了模型答案中的关键简洁细节，从而导致幻觉。为了解决这些局限性，我们提出了R-Log这一基于推理的新范式，它反映了人类工程师的结构化、逐步分析过程。这种方法通过了解结论背后的基本规则来提高通用性。我们还采用强化学习（RL）在模拟的运维管理环境中优化模型，通过直接奖励正确结果来减少幻觉。R-Log首先在一个精选的包含超过2000个推理轨迹的数据集上进行冷启动，该数据集由手动运维实践的13种策略指导，以建立初步的推理能力。然后，使用联合奖励函数通过RL进行精细化。在真实世界日志上的实证研究结果表明，R-Log在五个日志分析任务上的表现均优于现有方法，特别是在未见场景中的表现提高了228.05%。我们还设计了具有5倍加速的R-Log-fast，同时保持了93%的有效性。

论文及项目相关链接

PDF

Summary

基于现代软件系统中日志数据的复杂性增长，研究者提出使用大型语言模型（LLMs）进行自动日志分析。当前方法通常依赖于直接在日志标签对上进行监督微调（SFT），但这种方法加剧了通用LLMs和特定日志数据之间的领域差异，导致过拟合。为解决这些问题，本文提出了R-Log这一新型基于推理的方法，模拟人工工程师的逐步分析过程，提高模型通用性并减少幻觉。通过强化学习（RL）在模拟的运维环境中优化模型，并首次使用冷启动在超过2k个推理轨迹的数据集上建立初步推理能力。在真实世界日志上的实证评估显示，R-Log在五个日志分析任务上优于现有方法，特别是在未见场景中的性能提升达228.05%。同时，还设计了快速版本的R-Log，保持93%效能的同时实现5倍加速。

Key Takeaways

大型语言模型（LLMs）被应用于现代软件系统的自动日志分析，应对其复杂性。
当前方法通过直接在日志标签对上进行监督微调（SFT），存在领域差异和过拟合问题。
R-Log是一种新型基于推理的方法，模拟人工工程师的分析步骤，提高模型通用性。
R-Log使用强化学习（RL）在模拟的运维环境中优化，减少幻觉。
R-Log首次通过冷启动在包含超过2k个推理轨迹的数据集上建立初步推理能力。
在真实世界日志上的评估显示，R-Log在多个日志分析任务上表现优异，特别是未见场景。

Cool Papers

点此查看论文截图

Mem-α: Learning Memory Construction via Reinforcement Learning

Authors:Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, Xiaojian Wu

Large language model (LLM) agents are constrained by limited context windows, necessitating external memory systems for long-term information understanding. Current memory-augmented agents typically depend on pre-defined instructions and tools for memory updates. However, language models may lack the ability to determine which information to store, how to structure it, and when to update it, especially as memory systems become more complex. This results in suboptimal memory construction and information loss. To this end, we propose Mem-alpha, a reinforcement learning framework that trains agents to effectively manage complex memory systems through interaction and feedback. We also construct a specialized training dataset spanning diverse multi-turn interaction patterns paired with comprehensive evaluation questions designed to teach effective memory management. During training, agents process sequential information chunks, learn to extract and store relevant content, then update the memory system. The reward signal derives from downstream question-answering accuracy over the full interaction history, directly optimizing for memory construction. To illustrate the effectiveness of our training framework, we design a memory architecture comprising core, episodic, and semantic components, equipped with multiple tools for memory operations. Empirical evaluation demonstrates that Mem-alpha achieves significant improvements over existing memory-augmented agent baselines. Despite being trained exclusively on instances with a maximum length of 30k tokens, our agents exhibit remarkable generalization to sequences exceeding 400k tokens, over 13x the training length, highlighting the robustness of Mem-alpha.

大型语言模型（LLM）代理受限于有限的上下文窗口，需要外部记忆系统进行长期信息理解。当前的增强记忆代理通常依赖于预先定义的指令和工具进行记忆更新。然而，语言模型可能缺乏判断存储哪些信息、如何构建结构、何时进行更新的能力，特别是随着记忆系统变得越来越复杂。这会导致记忆构建不佳和信息丢失。为此，我们提出了Mem-alpha，这是一个强化学习框架，通过交互和反馈来训练代理有效地管理复杂的记忆系统。我们还构建了一个特殊训练数据集，涵盖了多种多轮交互模式，并配有全面的评估问题，以教授有效的内存管理。在训练过程中，代理处理序列信息块，学习提取和存储相关内容，然后更新记忆系统。奖励信号来源于下游问答的准确性，这是基于整个交互历史的，直接优化内存构建。为了说明我们训练框架的有效性，我们设计了一种记忆架构，包括核心、情境和语义组件，并配备了多种记忆操作工具。经验评估表明，Mem-alpha在现有的增强记忆代理基准测试上取得了显著改进。尽管仅在最长不超过30k令牌的情况下进行训练，我们的代理在超过40万令牌的序列中表现出了显著的泛化能力，这是训练长度的13倍以上，突显了Mem-alpha的稳健性。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）受限于上下文窗口，需要外部记忆系统进行长期信息理解。当前记忆增强型代理人通常依赖于预先定义的指令和工具进行记忆更新，但语言模型可能在确定存储哪些信息、如何构建和更新方面存在缺陷，尤其是在记忆系统变得更复杂时。这导致了记忆构建不佳和信息丢失。为解决这些问题，我们提出了Mem-alpha强化学习框架，通过交互和反馈训练代理人有效地管理复杂记忆系统。我们还构建了一个特殊训练数据集，涵盖了多样化的多轮交互模式，并设计了全面的评估问题，以教授有效的内存管理。在训练过程中，代理人处理序列信息块，学习提取和存储相关内容，然后更新记忆系统。奖励信号来源于下游问答准确率的全程互动历史记录，直接优化内存构建。我们设计了一种包括核心、情景和语义组件的内存架构，配备了多种内存操作工具。经验评估表明，Mem-alpha在现有记忆增强型代理人基线方面取得了显著改进。即使在仅对最长不超过3万令牌的实例进行训练的情况下，我们的代理人在超过4万令牌的序列上也展现出了显著的泛化能力。

Key Takeaways

大型语言模型受限于上下文窗口，需要外部记忆系统进行长期信息理解。
当前记忆增强型代理人依赖预定义指令和工具进行记忆更新。
语言模型在决定存储哪些信息以及如何构建和更新方面存在挑战。
Mem-alpha强化学习框架训练代理人有效管理复杂记忆系统。
通过交互和反馈，Mem-alpha优化了内存构建。
Mem-alpha通过特殊训练数据集和全面的评估问题来提高内存管理效率。

Cool Papers

点此查看论文截图

MuSLR: Multimodal Symbolic Logical Reasoning

Authors:Jundong Xu, Hao Fei, Yuhui Zhang, Liangming Pan, Qijun Huang, Qian Liu, Preslav Nakov, Min-Yen Kan, William Yang Wang, Mong-Li Lee, Wynne Hsu

Multimodal symbolic logical reasoning, which aims to deduce new facts from multimodal input via formal logic, is critical in high-stakes applications such as autonomous driving and medical diagnosis, as its rigorous, deterministic reasoning helps prevent serious consequences. To evaluate such capabilities of current state-of-the-art vision language models (VLMs), we introduce the first benchmark MuSLR for multimodal symbolic logical reasoning grounded in formal logical rules. MuSLR comprises 1,093 instances across 7 domains, including 35 atomic symbolic logic and 976 logical combinations, with reasoning depths ranging from 2 to 9. We evaluate 7 state-of-the-art VLMs on MuSLR and find that they all struggle with multimodal symbolic reasoning, with the best model, GPT-4.1, achieving only 46.8%. Thus, we propose LogiCAM, a modular framework that applies formal logical rules to multimodal inputs, boosting GPT-4.1’s Chain-of-Thought performance by 14.13%, and delivering even larger gains on complex logics such as first-order logic. We also conduct a comprehensive error analysis, showing that around 70% of failures stem from logical misalignment between modalities, offering key insights to guide future improvements. All data and code are publicly available at https://llm-symbol.github.io/MuSLR.

面向自动驾驶和医疗诊断等高风险应用的多模态符号逻辑推理（Multimodal Symbolic Logical Reasoning）至关重要，旨在通过形式逻辑从多模态输入中推导出新事实。其严谨且确定的推理有助于预防严重后果。为了评估当前最先进的视觉语言模型（VLMs）在这些能力方面的表现，我们推出了基于形式逻辑规则的第一个多模态符号逻辑推理基准测试MuSLR。MuSLR包含7个领域的1093个实例，包括35个原子符号逻辑和976个逻辑组合，推理深度从2到9不等。我们在MuSLR上评估了7种最先进的VLMs，发现它们在多模态符号推理方面都遇到了困难，表现最好的GPT-4.1也仅达到46.8%。因此，我们提出了LogiCAM框架，该框架应用形式逻辑规则来处理多模态输入，提升了GPT-4.1的Chain-of-Thought性能达14.13%，并在一阶逻辑等复杂逻辑上取得了更大的进步。我们还进行了全面的错误分析，发现约70%的错误源于模态之间的逻辑不一致性，这为未来的改进提供了关键指导。所有数据和代码均可在https://llm-symbol.github.io/MuSLR公开访问。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

本文介绍了多模态符号逻辑推理的重要性，特别是在自动驾驶和医疗诊断等高风险应用中的作用。为评估当前最先进的视觉语言模型（VLMs）的多模态符号逻辑推理能力，推出了首个基准测试MuSLR。评价结果显示，现有模型在多模态符号逻辑推理方面存在困难，于是提出了LogiCAM框架，该框架应用形式逻辑规则对多模态输入进行处理，提高了GPT-4.1的性能。错误分析表明，约70%的错误源于不同模态之间的逻辑不匹配，这为未来的改进提供了关键指导。

Key Takeaways

多模态符号逻辑推理在高风险应用如自动驾驶和医疗诊断中至关重要。
推出了首个评估视觉语言模型的多模态符号逻辑推理能力的基准测试MuSLR。
当前最先进的视觉语言模型在多模态符号逻辑推理方面存在困难。
LogiCAM框架通过应用形式逻辑规则提高多模态输入处理性能。
LogiCAM框架提高了GPT-4.1的推理性能，特别是在复杂逻辑如一阶逻辑方面。
错误分析显示，约70%的错误源于不同模态之间的逻辑不匹配。

Cool Papers

点此查看论文截图

RL-Guided Data Selection for Language Model Finetuning

Authors:Animesh Jha, Harshit Gupta, Ananjan Nandi

Data selection for finetuning Large Language Models (LLMs) can be framed as a budget-constrained optimization problem: maximizing a model’s downstream performance under a strict training data budget. Solving this problem is generally intractable, and existing approximate approaches are pretraining-oriented and transfer poorly to the fine-tuning setting. We reformulate this problem as a tractable Markov Decision Process (MDP) and train agents using various Reinforcement Learning (RL) methods to learn optimal data selection policies, guided by an efficient, proxy-model-based reward signal. Across four datasets, training on a $5%$ subset selected by our approach matches or outperforms fine-tuning on the full dataset by up to $10.8$ accuracy points, while cutting wall-clock training time by up to $2 \times$, highlighting the promise of RL-guided data selection.

针对微调大型语言模型（LLM）的数据选择可构建为一个预算约束优化问题：在严格的训练数据预算下，最大化模型的下游性能。解决这个问题通常是棘手的，现有的近似方法以预训练为导向，在微调设置中的迁移效果较差。我们将这个问题重新表述为一个可行的马尔可夫决策过程（MDP），并使用各种强化学习（RL）方法来训练代理，以学习最佳数据选择策略，由一个高效的基于代理模型的奖励信号来指导。在四个数据集上，使用我们的方法所选的5%子集进行训练，与在完整数据集上进行微调的效果相匹敌或表现得更好，准确度提高高达10.8个百分点，同时将训练时间减少了一半，这凸显了由强化学习引导的数据选择的潜力。

论文及项目相关链接

PDF To appear in NeurIPS 2025 Constrained Optimization for ML Workshop

Summary

在严格的训练数据预算下，如何最大化大型语言模型（LLM）的下游性能成为了一个重要的预算约束优化问题。为了解决这个问题，文章提出了一种基于强化学习（RL）的方法，将这个问题重新表述为一个可行的马尔可夫决策过程（MDP）。使用各种RL方法训练代理，通过高效的基于代理模型的奖励信号来指导，学习最佳数据选择策略。在四个数据集上，使用此方法选择的5%子集进行训练，与全数据集进行微调的效果相匹配或更好，准确度提高了高达10.8个点，同时节省了高达两倍的实际训练时间。

Key Takeaways

数据选择对于微调大型语言模型（LLM）至关重要，可视为预算约束优化问题。
现有方法偏向于预训练，在微调设置中表现不佳。
文章将问题重新表述为马尔可夫决策过程（MDP），使其变得可行。
使用强化学习（RL）方法训练代理，以学习最佳数据选择策略。
基于代理模型的奖励信号来指导数据选择过程。
在四个数据集上的实验结果显示，使用此方法选择的子集进行训练，效果与全数据集微调相匹配或更好。

Cool Papers

点此查看论文截图

More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

Authors:Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, Jing Zhang

Reasoning has emerged as a pivotal capability in Large Language Models (LLMs). Through Reinforcement Learning (RL), typically Group Relative Policy Optimization (GRPO), these models are able to solve complex tasks such as mathematics and code generation. Building on these advances, recent research has sought to extend reasoning to Vision-Language Models (VLMs), yielding promising results across diverse visual tasks. Despite this progress, our study uncovers the dual nature of multimodal reasoning: while it substantially enhances logical inference and facilitates performance on challenging problems, it may gradually impair perceptual grounding, leading to recognition failures on otherwise basic visual questions. Through further analysis, we attribute this phenomenon to visual forgetting, wherein prolonged reasoning causes the model to increasingly disregard visual input. To address this, we propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories. Our result model, VAPO-Thinker-7B, significantly strengthens the model’s reliance on visual information and achieves new state-of-the-art results on a wide range of established benchmarks. Project page: https://xytian1008.github.io/VAPO/

推理已经成为大型语言模型（LLM）中的关键能力。通过强化学习（RL），通常是群体相对策略优化（GRPO），这些模型能够解决数学和代码生成等复杂任务。在这些进展的基础上，最近的研究试图将推理扩展到视觉语言模型（VLM），在不同的视觉任务中取得了令人鼓舞的结果。尽管取得了进展，但我们的研究发现多模态推理具有双重性质：虽然它大大提高了逻辑推理能力，有助于解决具有挑战性的问题，但它可能会逐渐损害感知定位能力，导致对基本视觉问题的识别失败。通过进一步分析，我们将这种现象归因于视觉遗忘，长时间的推理会导致模型越来越忽视视觉输入。为了解决这一问题，我们提出了视觉锚定策略优化（VAPO），这是一种简单有效的方法，能够明确引导推理过程走向视觉轨迹。我们的结果模型VAPO-Thinker-7B大大加强了模型对视觉信息的依赖，并在广泛的基准测试中实现了新的最先进结果。项目页面：https://xytian1008.github.io/VAPO/。

论文及项目相关链接

PDF

Summary
大型语言模型（LLM）通过强化学习（RL）和集团相对策略优化（GRPO）等方法，展现出强大的推理能力，可以完成数学和代码生成等复杂任务。近期研究尝试将推理能力扩展到视觉语言模型（VLM），并在多种视觉任务中取得优异表现。然而，本研究发现多模态推理存在双重性质：虽然能提高逻辑推断能力并解决难题，但可能逐渐损害感知定位能力，导致对基本视觉问题的识别失败。视觉遗忘是导致这一现象的关键因素，长期推理会使模型越来越忽视视觉输入。为解决这一问题，本研究提出视觉锚定策略优化（VAPO），通过明确引导推理过程走向视觉轨迹，有效强化模型对视觉信息的依赖，并在广泛建立的基准测试上取得最新最先进的成果。

Key Takeaways

大型语言模型通过强化学习和集团相对策略优化展现出强大的推理能力。
视觉语言模型在多模态推理方面取得优异表现。
多模态推理存在双重性质：提高逻辑推断能力的同时可能损害感知定位能力。
视觉遗忘是长期推理中模型忽视视觉输入的关键因素。
视觉锚定策略优化（VAPO）能有效强化模型对视觉信息的依赖。

Cool Papers

点此查看论文截图

Chain-in-Tree: Back to Sequential Reasoning in LLM Tree Search

Authors:Xinzhe Li

Test-time scaling enables large language models (LLMs) to improve performance on long-horizon reasoning tasks by allocating additional compute at inference. Tree-search-based approaches achieve state-of-the-art results in this setting, but they are notoriously inefficient, often an order of magnitude slower than simpler iterative methods. We introduce Chain-in-Tree (CiT), a plug-in framework that adaptively decides when to branch during search rather than branching at every step. CiT relies on lightweight Branching Necessity (BN) evaluation methods: BN-DP (Direct Prompting), where an auxiliary LLM directly judges whether a step requires branching, and BN-SC (Self-Consistency), which clusters multiple candidate actions to estimate agreement. We integrate CiT into three representative LLM-in-the-loop tree search frameworks: Tree of Thoughts (ToT-BS), ReST-MCTS, and RAP, and evaluate across GSM8K and Math500. Our results show that: (1) BN-DP consistently reduces token generation, model invocations, and runtime by 75-85 percent across all settings, with negligible accuracy loss and sometimes accuracy gains; (2) BN-SC typically yields substantial savings (up to 80 percent) but shows instability in 1-4 out of 14 settings, caused by a small subset of examples that produce very long reasoning steps; (3) the quality of auxiliary LLMs is critical, not only the BN evaluator in BN-DP, but also the models used in BN-SC for clustering and equivalence checking. When these roles are filled by smaller LLMs, performance degrades. Importantly, BN-SC does not require LLMs in domains with deterministic action spaces, where clustering can be done programmatically. We also provide a theoretical guarantee that BN-DP never increases LLM invocations relative to the baseline and release a unified implementation of CiT across ToT-BS, ReST-MCTS, and RAP to facilitate reproducibility and extension.

测试时缩放功能使得大型语言模型（LLM）能够在推理任务上提高性能，通过在推理时分配额外的计算资源。基于树搜索的方法在这个设置下达到了最好的结果，但它们以效率闻名不佳，通常比更简单的迭代方法慢一个数量级。我们引入了Chain-in-Tree（CiT），这是一种插件框架，它自适应地决定何时进行分支搜索，而不是每一步都进行分支。CiT依赖于轻量级的Branching Necessity（BN）评估方法：BN-DP（直接提示），其中辅助LLM直接判断一个步骤是否需要分支；以及BN-SC（自我一致性），它通过对多个候选动作进行聚类来估计一致性。我们将CiT集成到三个具有代表性的LLM循环树搜索框架中：思维树（ToT-BS）、ReST-MCTS和RAP，并在GSM8K和Math500上进行了评估。我们的结果表明：（1）BN-DP在所有设置中一致地减少了令牌生成、模型调用和运行时，减少了75-85%，并且几乎没有损失准确性，有时甚至能提高准确性；（2）BN-SC通常可以产生巨大的节省（高达80%），但在14个设置中的1-4个表现出不稳定，这是由于一小部分例子产生了非常长的推理步骤；（3）辅助LLM的质量至关重要，不仅是BN-DP中的BN评估器，还包括用于聚类和等价检查的BN-SC中的模型。当这些角色由较小的LLM扮演时，性能会下降。重要的是，BN-SC在具有确定性动作空间的领域中不需要LLM，这些领域中可以通过编程进行聚类。我们还提供了理论保证，BN-DP相对于基准线不会增加LLM的调用次数，并发布了在ToT-BS、ReST-MCTS和RAP上的CiT的统一实现，以促进可重复性和扩展性。

论文及项目相关链接

PDF Under Review

Summary

本文介绍了针对大型语言模型（LLM）在推理任务中效率低下的问题，提出了一种名为Chain-in-Tree（CiT）的插件框架。该框架能够自适应地决定何时进行分支搜索，而不是在每一步都进行分支。文章介绍了两种分支必要性（BN）评估方法：BN-DP（直接提示）和BN-SC（自我一致性）。通过集成CiT到三种代表性的LLM循环树搜索框架中，实验结果表明BN-DP能大幅降低标记生成、模型调用和运行时成本，同时保持或提高准确性；BN-SC在大部分情况下能节省大量计算资源，但在部分情况下表现出不稳定性。辅助LLM的质量对性能至关重要。

Key Takeaways

Test-time scaling通过分配额外的计算资源在推理时提高了大型语言模型（LLM）在长周期推理任务上的性能。
Tree-search-based方法虽然在此设置中获得最佳结果，但效率极低，通常比简单的迭代方法慢一个数量级。
Chain-in-Tree (CiT)是一种插件框架，能够自适应决定何时进行分支搜索，而不是在每一步都进行分支，提高了效率。
CiT使用两种Branching Necessity (BN)评估方法：BN-DP（直接提示）和BN-SC（自我一致性）。
BN-DP能大幅降低标记生成、模型调用和运行时成本，同时保持或提高准确性。
BN-SC在大部分情况下能节省大量计算资源，但在特定情况下可能表现出不稳定。
辅助LLM的质量对CiT框架的性能至关重要。

Cool Papers

点此查看论文截图

Dolphin v1.0 Technical Report

Authors:Taohan Weng, Chi zhang, Chaoran Yan, Siya Liu, Xiaoyang Liu, Yalun Wu, Boyang Wang, Boyan Wang, Jiren Ren, Kaiwen Yan, Jinze Yu, Kaibing Hu, Henan Liu, Haoyun Zheng, Zhenyu Liu, Duo Zhang, Xiaoqing Guo, Anjie Le, Hongcheng Guo

Ultrasound is crucial in modern medicine but faces challenges like operator dependence, image noise, and real-time scanning, hindering AI integration. While large multimodal models excel in other medical imaging areas, they struggle with ultrasound’s complexities. To address this, we introduce Dolphin v1.0 (V1) and its reasoning-augmented version, Dolphin R1-the first large-scale multimodal ultrasound foundation models unifying diverse clinical tasks in a single vision-language framework.To tackle ultrasound variability and noise, we curated a 2-million-scale multimodal dataset, combining textbook knowledge, public data, synthetic samples, and general corpora. This ensures robust perception, generalization, and clinical adaptability.The Dolphin series employs a three-stage training strategy: domain-specialized pretraining, instruction-driven alignment, and reinforcement-based refinement. Dolphin v1.0 delivers reliable performance in classification, detection, regression, and report generation. Dolphin R1 enhances diagnostic inference, reasoning transparency, and interpretability through reinforcement learning with ultrasound-specific rewards.Evaluated on U2-Bench across eight ultrasound tasks, Dolphin R1 achieves a U2-score of 0.5835-over twice the second-best model (0.2968) setting a new state of the art. Dolphin v1.0 also performs competitively, validating the unified framework. Comparisons show reasoning-enhanced training significantly improves diagnostic accuracy, consistency, and interpretability, highlighting its importance for high-stakes medical AI.

超声在现代医学中至关重要，但面临着操作者依赖、图像噪声和实时扫描等挑战，阻碍了人工智能的融合。虽然在其他医学影像领域，大型多模式模型表现出色，但在应对超声的复杂性方面，它们却表现挣扎。为了解决这一问题，我们推出了Dolphin v1.0（V1）及其增强推理版本Dolphin R1——这是第一个统一了多种临床任务的大型超声多模式基础模型，在一个视觉语言框架内。为了应对超声的变性和噪声问题，我们筛选了一个规模达两百万的多模式数据集，结合了教科书知识、公开数据、合成样本和一般语料库。这确保了稳健的感知、推广和临床适应性。Dolphin系列采用了一种三阶段培训策略：专业领域的预训练、指令驱动的调整和基于强化学习的细化。Dolphin v1.0在分类、检测、回归和报告生成方面表现出可靠的性能。Dolphin R1通过强化学习使用超声特定奖励增强诊断推理、透明度和解释性。在U2-Bench上对八个超声任务进行评估，Dolphin R1的U2分数为0.5835，是第二名模型（得分0.2968）的两倍多，创造了新的技术记录。Dolphin v1.0也表现出竞争力，验证了统一框架的有效性。对比显示，增强推理训练显著提高了诊断的准确性、一致性和解释性，突显其在高风险医疗人工智能中的重要性。

论文及项目相关链接

PDF

Summary：

超声在现代医学中至关重要，但面临操作依赖、图像噪声和实时扫描等挑战，阻碍人工智能的集成。大型多模态模型在其他医学影像领域表现出色，但在应对超声复杂性方面却表现不佳。为解决这个问题，我们推出了Dolphin v1.0及其增强推理版Dolphin R1。它们是首个大规模超声多模态基础模型，在一个统一的视觉语言框架内融合了多样的临床任务。通过整合包含教科书知识、公开数据、合成样本和一般语料库的200万规模多模态数据集，解决了超声变性和噪声问题，确保了稳健的感知、推广和临床适应性。Dolphin系列采用三阶段培训策略：领域专业化预训练、指令驱动对齐和强化学习精炼。Dolphin v1.0在分类、检测、回归和报告生成方面表现出可靠性能。Dolphin R1通过强化学习使用超声特定奖励提高诊断推理的透明度、可解释性和准确性。在U2-Bench上的评估显示，Dolphin R1的U2分数为0.5835，是第二名模型（0.2968）的两倍多，创下了新的世界纪录。

Key Takeaways：

超声在现代医学中扮演重要角色，但面临操作依赖、图像噪声和实时扫描等挑战。
大型多模态模型在超声领域存在复杂性挑战，需要特定的解决方案。
Dolphin v1.0及其增强版Dolphin R1是首个统一临床任务的大规模超声多模态基础模型。
通过整合多模态数据集解决了超声变性和噪声问题。
Dolphin系列采用三阶段培训策略，确保模型的稳健性和性能。
Dolphin R1通过强化学习提高了诊断推理的透明度、可解释性和准确性。

Cool Papers

点此查看论文截图

DeepSearch: Overcome the Bottleneck of Reinforcement Learning with Verifiable Rewards via Monte Carlo Tree Search

Authors:Fang Wu, Weihao Xuan, Heli Qi, Ximing Lu, Aaron Tu, Li Erran Li, Yejin Choi

Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.

尽管强化学习验证器（RLVR）已成为在大型语言模型（LLM）中开发高级推理能力的重要组成部分，但当代研究已经记录了随着优化步骤的增加而出现的训练高原现象，这显示出尽管计算投入增加，但性能提升却显著下降。这一局限性源于当前RLVR实践中固有的稀疏探索模式，模型依赖于有限的展开，常常错过关键的推理路径，并且未能系统地覆盖解决方案空间。我们提出了DeepSearch框架，它将蒙特卡洛树搜索直接集成到RLVR训练中。与现有仅在推理时依赖树搜索的方法不同，DeepSearch将结构化搜索嵌入到训练循环中，从而能够在推理步骤之间进行系统的探索和精细的信用分配。通过训练时的探索，DeepSearch解决了根本性的探索不足问题，从而克服了长时间的训练步骤导致的性能改进减少。我们的贡献包括：（1）全局前沿选择策略，优先搜索树中有望的节点；（2）基于熵的引导进行选择，以确定监督的信心路径；（3）使用解决方案缓存来提高效率的自适应回放缓冲区训练。在数学推理基准测试上的实验表明，DeepSearch达到了62.95％的平均准确率，为1.5B推理模型建立了新的技术领先地位，与扩展训练方法相比，GPU小时数减少了5.7倍。这些结果强调了战略探索比粗暴扩展更重要，并展示了算法创新在推进RLVR方法论方面的前景。DeepSearch通过系统搜索而不是长期计算来建立扩展推理能力的新方向。

论文及项目相关链接

PDF

Summary
在LLM领域，尽管RLVR已成为培养高级推理能力的关键组件，但当代研究记录显示，在数千步优化后会出现性能提升停滞的情况。这主要是因为当前RLVR实践中的探索模式稀疏，模型依赖有限的试运行，常常错过关键的推理路径且未能系统地覆盖解空间。本研究提出DeepSearch框架，它将蒙特卡洛树搜索直接集成到RLVR训练中。不同于仅在推理阶段依赖树搜索的现有方法，DeepSearch将结构化搜索嵌入到训练循环中，实现了推理步骤的系统性探索和精细化的信用分配。通过训练时的探索，DeepSearch解决了根本性的探索不足问题，从而解决了长时间训练性能提升不明显的问题。我们的贡献包括：（1）全局前沿选择策略，优先搜索树中有前途的节点；（2）基于熵的引导策略，确定有信心的路径进行监督；（3）具有解决方案缓存功能的自适应回放缓冲区训练以提高效率。在数学推理基准测试上，DeepSearch达到了平均准确率62.95%，并为规模为1.5B的推理模型创造了新的里程碑，其使用的GPU小时数比延长训练的方法减少了5.7倍。这些结果强调了策略性探索优于简单规模扩张的重要性，并证明了算法创新在推进RLVR方法上的潜力。DeepSearch为通过系统化搜索提高推理能力开辟了新的方向。

Key Takeaways

当前RLVR实践面临性能提升停滞的问题，主要由于探索模式稀疏，导致错过关键推理路径和有限的解空间覆盖。
DeepSearch框架集成了蒙特卡洛树搜索到RLVR训练中，实现系统性探索和推理步骤的精细化信用分配。
DeepSearch通过训练时的探索解决根本性的探索不足问题，提高长时间训练的性能。
全局前沿选择策略、基于熵的引导策略和自适应回放缓冲区训练是DeepSearch的主要贡献。
在数学推理基准测试上，DeepSearch表现出色，创造了新的准确率记录，并提高了效率。
实验结果强调策略性探索优于简单规模扩张，突出了算法创新在RLVR方法中的重要性。

Cool Papers

点此查看论文截图

RADAR: Reasoning-Ability and Difficulty-Aware Routing for Reasoning LLMs

Authors:Nigel Fernandez, Branislav Kveton, Ryan A. Rossi, Andrew S. Lan, Zichao Wang

Reasoning language models have demonstrated remarkable performance on many challenging tasks in math, science, and coding. Choosing the right reasoning model for practical deployment involves a performance and cost tradeoff at two key levels: model size and reasoning budget, where larger models and higher reasoning budget lead to better performance but with increased cost and latency. In this work, we tackle this tradeoff from the angle of model configuration routing for different queries, and present RADAR (Reasoning-Ability and Difficulty-Aware Routing), a lightweight, interpretable, and scalable routing framework. Inspired by psychometrics, RADAR learns an item response model from model responses with different budgets to different queries, with interpretable parameters including query difficulties and model-budget abilities. RADAR then routes queries with higher difficulty to model-budget pairs with higher ability, and vice versa. We conduct extensive experiments on 8 widely used challenging reasoning benchmarks, demonstrating the superior performance of RADAR compared to state-of-the-art model routing methods. RADAR also exhibits query generalization capabilities, showing strong performance on out-of-distribution queries in all benchmarks. RADAR is also scalable and can efficiently integrate additional models by dynamically selecting a small set of evaluation queries to estimate their abilities.

推理语言模型在数学、科学和编程等领域的许多具有挑战性的任务中表现出卓越的性能。为实际部署选择合适的推理模型需要在两个关键层面上进行性能和成本的权衡：模型大小和推理预算。较大的模型和更高的推理预算会带来更好的性能，但也会增加成本和延迟。在这项工作中，我们从不同查询的模型配置路由的角度来解决这一权衡问题，并提出了RADAR（基于推理能力和难度感知的路由）。它是一个轻便、可解释和可扩展的路由框架。RADAR受到心理测量的启发，从具有不同预算的模型对不同查询的响应中学习项目响应模型，其中可解释的参数包括查询难度和模型预算能力。然后，RADAR将难度较高的查询路由到具有较高能力的模型预算对，反之亦然。我们在8个广泛使用的挑战性推理基准测试上进行了大量实验，证明了RADAR相比于最新前沿的模型路由方法具有卓越的性能。RADAR还具有查询泛化能力，在所有基准测试中对于离群查询也表现出强大的性能。此外，RADAR具有可扩展性，可以通过动态选择一小部分评估查询来估计其能力，从而有效地集成额外的模型。

论文及项目相关链接

PDF

Summary

该文介绍了推理语言模型在实际部署时面临性能与成本的权衡问题，特别是在模型大小和推理预算方面。文章提出了一种基于模型配置路由的解决策略，即RADAR（含推理能力与难度感知的路由）。RADAR借鉴心理测量学，从模型对不同查询的响应中学习项目反应模型，涉及可解释的查询难度和模型预算能力参数。通过广泛实验，RADAR在8个挑战性的推理基准测试中表现出卓越性能，相较于最新模型路由方法具有优势。此外，RADAR还具有查询泛化能力，在所有基准测试中对离群查询表现出强劲性能，并可动态选择少量评估查询来整合额外模型，展现其可扩展性。

Key Takeaways

推理语言模型在实际部署中面临性能与成本的权衡问题，特别是在模型大小和推理预算上需要选择合理的平衡点。
文章提出了RADAR框架，这是一种基于模型配置路由的解决策略，旨在提高推理性能并降低部署成本。
RADAR框架借鉴心理测量学原理，通过模型对不同查询的响应来学习项目反应模型。
RADAR具有可解释性，包括查询难度和模型预算能力等参数。
在多个挑战性的推理基准测试中，RADAR表现出卓越性能，相较于其他最新模型路由方法具有优势。
RADAR具有查询泛化能力，适应于各种环境下的查询需求。

Cool Papers

点此查看论文截图

Mechanisms of Matter: Language Inferential Benchmark on Physicochemical Hypothesis in Materials Synthesis

Authors:Yingming Pu, Tao Lin, Hongyu Chen

The capacity of Large Language Models (LLMs) to generate valid scientific hypotheses for materials synthesis remains largely unquantified, hindered by the absence of benchmarks probing physicochemical logics reasoning. To address this, we introduce MatterMech, a benchmark for evaluating LLM-generated hypotheses across eight nanomaterial synthesis domains. Our analysis reveals a critical disconnect: LLMs are proficient in abstract logic yet fail to ground their reasoning in fundamental physicochemical principles. We demonstrate that our proposed principle-aware prompting methodology substantially outperforms standard Chain-of-Thought, enhancing both hypothesis accuracy and computational efficiency. This work provides a methodological framework to advance LLMs toward reliable scientific hypothesis generation in materials science. The MatterMech benchmark and associated code is publicly available at \href{https://github.com/amair-lab/MatterMech}{GitHub}.

大型语言模型（LLM）在生成有效的科学假设以进行材料合成方面的能力尚未得到量化，由于缺乏探究物理化学逻辑推理的基准测试是阻碍其发展的原因之一。为了解决这一问题，我们引入了MatterMech，这是一个用于评估LLM在八个纳米材料合成领域生成假设的基准测试。我们的分析揭示了一个关键差距：LLM虽然擅长抽象逻辑，但未能将其推理建立在基本的物理化学原理之上。我们证明，所提出的基于原理的提示方法显著优于标准思维链，提高了假设准确性和计算效率。这项工作为推进LLM在材料科学中生成可靠科学假设的方法论框架提供了基础。MatterMech基准测试和相关代码已在GitHub上公开发布（\href{https://github.com/amair-lab/MatterMech}{链接}）。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）在生成材料合成科学假设方面的能力尚未得到充分量化，这受到缺乏探索物理化学逻辑推理基准的限制。为解决这一问题，我们推出了MatterMech基准测试，用于评估LLM在八个纳米材料合成领域的假设生成能力。分析表明，LLM虽擅长抽象逻辑，但在将推理与基本物理化学原理相结合方面存在明显不足。我们证明，提出的原理感知提示方法显著优于标准的思维链方法，提高了假设的准确性和计算效率。这项工作为推进LLM在材料科学中生成可靠科学假设的方法论框架提供了基础。MatterMech基准测试和相关代码已公开，可在GitHub上获取。

Key Takeaways

大型语言模型（LLMs）在生成材料合成科学假设方面的能力尚未充分量化。
缺乏基准测试来探索LLMs在物理化学逻辑推理方面的能力。
推出了MatterMech基准测试，用于评估LLM在纳米材料合成领域的假设生成能力。
LLM虽然擅长抽象逻辑，但在将推理与基本物理化学原理结合方面存在不足。
原理感知提示方法显著提高了假设生成的准确性和计算效率。
该工作为LLM在材料科学中生成可靠科学假设提供了方法论框架。

Cool Papers

点此查看论文截图

MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning

Authors:Huihao Jing, Wenbin Hu, Hongyu Luo, Jianhui Yang, Wei Fan, Haoran Li, Yangqiu Song

Multi-agent systems (MAS), leveraging the remarkable capabilities of Large Language Models (LLMs), show great potential in addressing complex tasks. In this context, integrating MAS with legal tasks is a crucial step. While previous studies have developed legal benchmarks for LLM agents, none are specifically designed to consider the unique advantages of MAS, such as task decomposition, agent specialization, and flexible training. In fact, the lack of evaluation methods limits the potential of MAS in the legal domain. To address this gap, we propose MASLegalBench, a legal benchmark tailored for MAS and designed with a deductive reasoning approach. Our benchmark uses GDPR as the application scenario, encompassing extensive background knowledge and covering complex reasoning processes that effectively reflect the intricacies of real-world legal situations. Furthermore, we manually design various role-based MAS and conduct extensive experiments using different state-of-the-art LLMs. Our results highlight the strengths, limitations, and potential areas for improvement of existing models and MAS architectures.

多智能体系统（MAS）借助大型语言模型（LLM）的卓越能力，在应对复杂任务方面显示出巨大潜力。在此背景下，将MAS与法律任务集成是一个关键步骤。虽然之前的研究已经为LLM代理开发了法律基准测试，但没有一个是特别考虑MAS的独特优势，如任务分解、代理专业化和灵活的训练。事实上，缺乏评估方法限制了MAS在法律领域的潜力。为了弥补这一空白，我们提出了MASLegalBench，这是一个专为MAS定制的法律基准测试，并采用了演绎推理的方法进行设计。我们的基准测试以GDPR为应用场景，包含丰富的背景知识，涵盖有效的推理过程，充分反映了现实法律情境的复杂性。此外，我们还手动设计了各种基于角色的MAS，并使用不同的最新LLM进行了广泛的实验。我们的结果突显了现有模型和MAS架构的优点、局限性以及潜在的改进方向。

论文及项目相关链接

PDF

Summary
多智能体系统结合大型语言模型在解决复杂任务上具有巨大潜力。针对当前法律领域中缺乏专门针对多智能体系统的评估方法的问题，提出了MASLegalBench法律基准测试，该测试以GDPR为应用场景，涵盖丰富的背景知识和复杂的推理过程，有效反映真实法律情境的复杂性。并通过实验评估了不同先进语言模型在多智能体系统中的表现。

Key Takeaways

多智能体系统结合大型语言模型在解决复杂任务上的潜力巨大。
当前法律领域缺乏专门针对多智能体系统的评估方法。
MASLegalBench是一个针对多智能体系统的法律基准测试，以GDPR为应用场景。
MASLegalBench测试涵盖丰富的背景知识和复杂的推理过程，反映真实法律情境的复杂性。
通过实验评估了不同先进语言模型在多智能体系统中的表现。
结果揭示了现有模型和多智能体系统架构的优势、局限性和潜在的改进方向。

Cool Papers

点此查看论文截图

From Ambiguity to Verdict: A Semiotic-Grounded Multi-Perspective Agent for LLM Logical Reasoning

Authors:Yunyao Zhang, Xinglang Zhang, Junxi Sheng, Wenbing Li, Junqing Yu, Wei Yang, Zikai Song

Logical reasoning is a fundamental capability of large language models (LLMs). However, existing studies largely overlook the interplay between logical complexity and semantic complexity, resulting in methods that struggle to address challenging scenarios involving abstract propositions, ambiguous contexts, and conflicting stances, which are central to human reasoning. For this gap, we propose LogicAgent, a semiotic-square-guided framework designed to jointly address logical complexity and semantic complexity. LogicAgent explicitly performs multi-perspective deduction in first-order logic (FOL), while mitigating vacuous reasoning through existential import checks that incorporate a three-valued decision scheme (True, False, Uncertain) to handle boundary cases more faithfully. Furthermore, to overcome the semantic simplicity and low logical complexity of existing datasets, we introduce RepublicQA, a benchmark that reaches college-level difficulty (FKGL = 11.94) and exhibits substantially greater lexical and structural diversity than prior benchmarks. RepublicQA is grounded in philosophical concepts, featuring abstract propositions and systematically organized contrary and contradictory relations, making it the most semantically rich resource for evaluating logical reasoning. Experiments demonstrate that LogicAgent achieves state-of-the-art performance on RepublicQA, with a 6.25% average gain over strong baselines, and generalizes effectively to mainstream logical reasoning benchmarks including ProntoQA, ProofWriter, FOLIO, and ProverQA, achieving an additional 7.05% average gain. These results highlight the strong effectiveness of our semiotic-grounded multi-perspective reasoning in boosting LLMs’ logical performance.

逻辑推理是大型语言模型（LLM）的基本能力。然而，现有研究大多忽视了逻辑复杂性与语义复杂性之间的相互作用，导致现有方法难以应对涉及抽象命题、模糊上下文和冲突立场的挑战场景，而这些是人类推理的核心。为了弥补这一差距，我们提出了LogicAgent，这是一个以符号平方为指导的框架，旨在共同解决逻辑复杂性和语义复杂性。LogicAgent明确地进行一阶逻辑（FOL）的多视角推理，同时通过存在性导入检查来缓解空洞推理，采用三值决策方案（真、假、不确定）以更真实地处理边界情况。此外，为了克服现有数据集语义简单、逻辑复杂性低的问题，我们引入了RepublicQA，这是一个难度达到大学水平（FKGL=11.94）的基准测试，与先前的基准测试相比，它在词汇和结构多样性方面要大得多。RepublicQA以哲学概念为基础，包含抽象命题，以及有组织性的相反和矛盾关系，成为评估逻辑推理方面最语义丰富的资源。实验表明，LogicAgent在RepublicQA上实现了最先进的性能，比强基线平均提高了6.25%，并在主流的逻辑推理基准测试（包括ProntoQA、ProofWriter、FOLIO和ProverQA）上实现了有效的泛化，平均提高了额外的7.05%。这些结果突显了我们的符号化多视角推理在提升LLM的逻辑性能方面的强大有效性。

论文及项目相关链接

PDF

Summary：

大型语言模型（LLM）的逻辑推理能力是一项基本功能。然而，现有研究忽视了逻辑复杂性和语义复杂性之间的相互作用，导致它们在处理涉及抽象命题、模糊上下文和冲突立场等人类推理核心的挑战场景时表现不佳。为此，我们提出了LogicAgent框架，它采用半符号平方引导法，旨在联合解决逻辑复杂性和语义复杂性。LogicAgent明确执行一阶逻辑（FOL）的多角度推理，并通过存在性导入检查来缓解空洞推理，该检查采用三值决策方案（真、假、不确定）以更真实地处理边界情况。此外，为了克服现有数据集语义简单、逻辑复杂度低的问题，我们引入了RepublicQA基准测试，其难度达到大学水平（FKGL=11.94），并且在词汇和结构多样性方面大大超过了以前的基准测试。RepublicQA以哲学概念为基础，包含抽象命题和系统组织的相反和矛盾关系，成为评估逻辑推理的最语义丰富的资源。实验表明，LogicAgent在RepublicQA上实现了最先进的性能，平均优于强基线6.25%，并在主流逻辑推理基准测试（包括ProntoQA、ProofWriter、FOLIO和ProverQA）上实现了额外的7.05%的平均增幅。这些结果强烈证明了我们的半符号引导的多角度推理的强大有效性。

Key Takeaways：

大型语言模型（LLM）的逻辑推理能力重要，但现有研究忽视了逻辑复杂性和语义复杂性的相互作用。
LogicAgent框架通过半符号平方引导法联合解决逻辑复杂性和语义复杂性。
LogicAgent执行多角度的一阶逻辑（FOL）推理，并采用三值决策方案处理边界情况。
RepublicQA基准测试难度高，包含抽象命题和矛盾关系，是评估逻辑推理的最语义丰富的资源。
LogicAgent在RepublicQA上表现最佳，平均优于其他模型6.25%。
LogicAgent在多个主流逻辑推理基准测试上都表现出优异的性能，平均增益7.05%。

Cool Papers

点此查看论文截图

LLaDA-MoE: A Sparse MoE Diffusion Language Model

Authors:Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, Ji-Rong Wen

We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE’s strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

我们介绍了LLaDA-MoE，这是一个采用专家混合（MoE）架构的大型语言扩散模型，在大约20T令牌上进行从头训练。LLaDA-MoE以7B参数的容量维持，同时在推理时仅激活1.4B参数，实现了具有竞争力的性能并显著减少了计算开销。我们的经验评估表明，LLaDA-MoE在多个基准测试中实现了最先进的扩散语言模型性能，超越了之前的扩散语言模型LLaDA、LLaDA 1.5和Dream。指令调整模型LLaDA-MoE-7B-A1B-Instruct在知识理解、代码生成、数学推理、代理和对齐任务中展现出与Qwen2.5-3B-Instruct相当的能力，尽管它使用的活动参数更少。我们的结果表明，将稀疏MoE架构整合到掩蔽扩散语言模型的训练目标中，仍能在有效推理和较少活动参数的情况下发挥出MoE的优势，并为进一步探索扩散语言模型提供了广阔的空间。LLaDA-MoE模型已在Huggingface上提供。

论文及项目相关链接

PDF

Summary

LLaDA-MoE是一个大型语言扩散模型，采用混合专家（MoE）架构，在约20T标记上从头开始训练。它在保持7B参数容量的同时，推理时仅激活1.4B参数，实现了显著的计算资源减少。实证评估表明，LLaDA-MoE在多个基准测试中超越了之前的扩散语言模型LLaDA、LLaDA 1.5和Dream，实现了扩散语言模型中的最新性能。LLaDA-MoE模型在知识理解、代码生成、数学推理、代理和对齐任务中，表现出与Qwen2.5-3B-Instruct相当的能力。整合稀疏MoE架构到掩码扩散语言模型的训练目标中，仍能在高效的推理和少量激活参数下发挥出MoE的优势，并为扩散语言模型的进一步探索提供了广阔的空间。

Key Takeaways

LLaDA-MoE是一个大型语言扩散模型，结合了混合专家（MoE）架构。
LLaDA-MoE在约20T标记数据上训练，具有7B参数容量，但推理时仅激活1.4B参数，降低了计算开销。
实证评估显示LLaDA-MoE在多个基准测试中超越了其他扩散语言模型。
LLaDA-MoE在知识理解、代码生成、数学推理等任务中表现出强大的能力。
整合MoE架构到掩码扩散语言模型的训练目标中，有助于实现高效推理和优秀性能。
LLaDA-MoE模型的性能表现打开了进一步探索扩散语言模型的广阔空间。

Cool Papers

点此查看论文截图

Latent Collective Preference Optimization: A General Framework for Robust LLM Alignment

Authors:Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu

Standard human preference-based alignment methods, such as Reinforcement Learning from Human Feedback (RLHF), are a cornerstone technology for aligning Large Language Models (LLMs) with human values. However, these methods are all underpinned by a critical, yet flawed assumption: human preferences are homogeneous (representing a single, unified preference) and the collected data is noiseless (free from error). In reality, neither is true since human preference is pluralistic and annotators can make mistakes. This creates a discrepancy between the recorded data and the ground-truth preferences, which can misguide the model and degrade its performance. To address this challenge, we introduce Latent Collective Preference Optimization (LCPO). LCPO leverages an Expectation-Maximization (EM) algorithm to learn the latent collective consensus from noisy data. It operates by inferring the correctness of each preference label and using this probability as an adaptive weight to re-calibrate each data point’s contribution to the training loss, thereby mitigating noise. We generalize this approach by establishing a theoretical link between arbitrary preference losses and their corresponding probabilistic models, elevating LCPO from a specific algorithm to a general framework for robust preference alignment. Theoretically, we prove that under the condition of a perfectly calibrated model, LCPO is guaranteed to converge to the true noise level of the dataset. Our experiments demonstrate LCPO’s effectiveness as a general framework, consistently enhancing four state-of-the-art alignment algorithms (DPO, IPO, SimPO, and CPO). When applied to Mistral and Llama 3 models, the LCPO-enhanced methods achieve substantial win rate gains on AlpacaEval 2 and Arena-Hard, with improvements of up to 7.0% on both benchmarks.

标准的人类偏好基准对齐方法，如基于人类反馈的强化学习（RLHF），是使大型语言模型（LLM）与人类价值观对齐的核心技术。然而，这些方法都是基于一个至关重要但存在缺陷的假设：人类偏好是均质的（代表一个统一、单一的偏好），收集的数据是无噪声的（没有错误）。实际上，两者都不真实，因为人类偏好是多元化的，注释者也会犯错误。这就造成了记录的数据与真实偏好之间的不一致，可能会误导模型并降低其性能。为了应对这一挑战，我们引入了潜在集体偏好优化（LCPO）。LCPO利用期望最大化（EM）算法从噪声数据中学习潜在的集体共识。它通过推断每个偏好标签的正确性，并使用这个概率作为自适应权重来重新校准每个数据点对训练损失的贡献，从而减轻噪声影响。我们通过建立任意偏好损失与其相应概率模型之间的理论联系，将这种方法一般化，使LCPO从一个特定算法提升为一个稳健的偏好对齐的一般框架。理论上，我们证明在模型完全校准的条件下，LCPO一定能收敛到数据集的真实噪声水平。我们的实验证明了LCPO作为一般框架的有效性，它始终如一地增强了四种最先进的对齐算法（DPO、IPO、SimPO和CPO）。当应用于Mistral和Llama 3模型时，采用LCPO增强方法在AlpacaEval 2和Arena-Hard上实现了显著的胜率提升，在两个基准测试上的改进均达到7.0%。

论文及项目相关链接

PDF

Summary
基于人类偏好反馈的强化学习（RLHF）等标准人类偏好基准对齐方法是对齐大型语言模型（LLM）与人类价值观的核心技术。然而，这些方法建立在有缺陷的假设上，即人类偏好是均质的，且收集的数据是无噪声的。实际上，人类偏好是多元化的，注释者也可能犯错。这导致记录的数据与真实偏好之间存在差异，可能会误导模型并降低其性能。为解决此挑战，我们引入了潜在集体偏好优化（LCPO）。LCPO利用期望最大化（EM）算法从噪声数据中学习潜在的集体共识。它通过推断每个偏好标签的正确性，并使用此概率作为自适应权重来重新校准每个数据点对训练损失的贡献，从而减轻噪声影响。我们通过建立任意偏好损失与其相应概率模型之间的理论联系，将LCPO从特定算法提升为用于稳健偏好对齐的一般框架。实验证明，LCPO作为一般框架的有效性，它持续提升四种最先进的对齐算法（DPO、IPO、SimPO和CPO）的性能。应用于Mistral和Llama 3模型时，在AlpacaEval 2和Arena-Hard上实现了显著的胜率增长，两个基准测试的提升幅度最高达7.0%。

Key Takeaways