⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-07-06 更新
Large Reasoning Models are not thinking straight: on the unreliability of thinking trajectories
Authors:Jhouben Cuesta-Ramirez, Samuel Beaussant, Mehdi Mounsif
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have recently achieved impressive results on reasoning benchmarks. Yet, growing evidence shows that these models often generate longer but ineffective chains of thought (CoTs), calling into question whether benchmark gains reflect real reasoning improvements. We present new evidence of overthinking, where models disregard correct solutions even when explicitly provided, instead continuing to generate unnecessary reasoning steps that often lead to incorrect conclusions. Experiments on three state-of-the-art models using the AIME2024 math benchmark reveal critical limitations in these models ability to integrate corrective information, posing new challenges for achieving robust and interpretable reasoning.
通过强化学习(RL)训练的大型语言模型(LLM)最近在推理基准测试上取得了令人印象深刻的结果。然而,越来越多的证据表明,这些模型往往会产生更长但无效的推理链(CoTs),人们质疑基准测试成绩的提高是否反映了真实推理能力的改进。我们提供了新的过度思考的证据,即模型会忽视正确的解决方案,即使明确提供,也会继续生成不必要的推理步骤,通常导致错误的结论。在AIME2024数学基准上对三种先进模型进行的实验揭示了这些模型在整合纠正信息方面的关键局限,为实现稳健和可解释的推理提出了新的挑战。
论文及项目相关链接
PDF Accepted to KONVENS 2025
Summary:
强化学习训练的大型语言模型在推理基准测试中取得了令人印象深刻的结果。然而,越来越多的证据表明,这些模型往往产生冗长而无用的推理链,使得基准测试成绩的提高并不能真正反映推理能力的提高。新的证据表明,模型会过度思考,忽视正确的解决方案,继续生成不必要的推理步骤,导致错误的结论。对使用AIME2024数学基准的三个先进模型进行的实验揭示了这些模型在整合纠正信息方面的关键局限性,为实现稳健和可解释的推理提出了新的挑战。
Key Takeaways:
- 大型语言模型在推理基准测试中表现优异。
- 这类模型有时生成冗长且无效的推理链。
- 模型存在过度思考现象,会忽视正确解决方案。
- 先进的大型语言模型在整合纠正信息方面存在局限性。
- 模型生成的推理步骤常导致错误结论。
- 这些关键问题为实现稳健和可解释的推理带来了挑战。
点此查看论文截图




SAFER: Probing Safety in Reward Models with Sparse Autoencoder
Authors:Sihang Li, Wei Shi, Ziyuan Xie, Tao Liang, Guojun Ma, Xiang Wang
Reinforcement learning from human feedback (RLHF) is a key paradigm for aligning large language models (LLMs) with human values, yet the reward models at its core remain largely opaque. In this work, we present sparse Autoencoder For Enhanced Reward model (\textbf{SAFER}), a novel framework for interpreting and improving reward models through mechanistic analysis. Leveraging Sparse Autoencoders (SAEs), we uncover human-interpretable features in reward model activations, enabling insight into safety-relevant decision-making. We apply SAFER to safety-oriented preference datasets and quantify the salience of individual features by activation differences between chosen and rejected responses. Using these feature-level signals, we design targeted data poisoning and denoising strategies. Experiments show that SAFER can precisely degrade or enhance safety alignment with minimal data modification, without sacrificing general chat performance. Our approach contributes to interpreting, auditing and refining reward models in high-stakes LLM alignment tasks. Our codes are available at https://github.com/xzy-101/SAFER-code. \textit{This paper discusses topics related to large language model safety and may include discussions or examples that highlight potential risks or unsafe outcomes.}
强化学习从人类反馈(RLHF)是对齐大型语言模型(LLM)与人类价值观的关键范式,但其核心的奖励模型仍然大多不透明。在这项工作中,我们提出了稀疏自动编码器增强奖励模型(SAFER),这是一个通过机械分析来解释和改进奖励模型的新型框架。我们利用稀疏自动编码器(SAE),揭示了奖励模型激活中的人可解释特征,从而深入了解与安全相关的决策。我们将SAFER应用于面向安全的偏好数据集,并通过选择响应和拒绝响应之间的激活差异来量化单个特征的重要性。使用这些特征级信号,我们设计了有针对性的数据污染和去噪策略。实验表明,SAFER可以在不牺牲一般聊天性能的情况下,通过最小的数据修改精确地降低或提高安全对齐性。我们的方法为解释、审计和完善高风险语言模型对齐任务中的奖励模型做出了贡献。我们的代码可在https://github.com/xzy-101/SAFER-code上找到。本文讨论了与大型语言模型安全相关的话题,并可能包含突出潜在风险或不安全结果的讨论或示例。
论文及项目相关链接
Summary
强化学习与人类反馈的结合是实现大型语言模型与人类价值观对齐的关键方法,但其核心的奖励模型仍相对模糊。本研究提出名为“SAFER”的新型框架,用于解读和改进奖励模型。利用稀疏自编码器(Sparse Autoencoders),我们从奖励模型激活中发现人类可解释的特征,从而洞察安全相关的决策过程。对面向安全性的偏好数据集的应用和实验中,我们通过激活所选和拒绝响应之间的差异量化个体特征的重要性。根据这些特征级信号,我们设计了有针对性的数据中毒和去噪策略。研究表明,SAFER能够在不牺牲通用聊天性能的前提下,精确调整安全对齐度或增强安全对齐度。我们的方法为解读、审计和改进高风险大型语言模型对齐任务中的奖励模型做出了贡献。相关代码可通过链接获取。该论文讨论了大型语言模型的安全相关问题,并可能包含突出潜在风险或不安全结果的讨论或示例。
Key Takeaways
- 强化学习与人类反馈(RLHF)是大型语言模型(LLM)与人类价值观对齐的关键方法。
- 奖励模型是RLHF的核心,但相对模糊且缺乏透明度。
- “SAFER”框架用于解读和改进奖励模型,利用稀疏自编码器发现人类可解释的特征。
- 通过洞察奖励模型的激活状态,可以了解安全相关的决策过程。
- 应用面向安全性的偏好数据集,量化个体特征在奖励模型中的重要性。
- 基于特征级信号设计数据中毒和去噪策略,精确调整安全对齐度。
点此查看论文截图







MiCo: Multi-image Contrast for Reinforcement Visual Reasoning
Authors:Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, Hengshuang Zhao
This work explores enabling Chain-of-Thought (CoT) reasoning to link visual cues across multiple images. A straightforward solution is to adapt rule-based reinforcement learning for Vision-Language Models (VLMs). However, such methods typically rely on manually curated question-answer pairs, which can be particularly challenging when dealing with fine grained visual details and complex logic across images. Inspired by self-supervised visual representation learning, we observe that images contain inherent constraints that can serve as supervision. Based on this insight, we construct image triplets comprising two augmented views of the same image and a third, similar but distinct image. During training, the model is prompted to generate a reasoning process to compare these images (i.e., determine same or different). Then we optimize the model with rule-based reinforcement learning. Due to the high visual similarity and the presence of augmentations, the model must attend to subtle visual changes and perform logical reasoning to succeed. Experiments show that, although trained solely on visual comparison tasks, the learned reasoning ability generalizes effectively to a wide range of questions. Without relying on any human-annotated question-answer pairs, our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.
本文探讨了实现Chain-of-Thought(CoT)推理,以在多个图像之间建立视觉线索联系。一种简单的解决方案是为视觉语言模型(VLMs)适应基于规则的强化学习。然而,此类方法通常依赖于手动整理的问题答案对,在处理精细的视觉细节和图像之间的复杂逻辑时,可能会面临特别挑战。受自我监督的视觉表示学习的启发,我们观察到图像包含可以作为监督的固有约束。基于这一见解,我们构建了图像三元组,包括同一图像的两个增强视图和第三张相似但不同的图像。在训练过程中,模型被提示生成一个推理过程来比较这些图像(即确定相同或不同)。然后我们用基于规则的强化学习优化模型。由于高度视觉相似性和增强技术的存在,模型必须关注细微的视觉变化,并进行逻辑推理才能成功。实验表明,尽管仅通过视觉比较任务进行训练,但所学的推理能力对一系列问题有效地泛化。我们的方法不依赖任何人工标注的问题答案对,在跨图像推理基准测试上取得了显著改进,并在一般视觉任务上表现出强大性能。
论文及项目相关链接
Summary
本文探索了通过Chain-of-Thought(CoT)推理链接跨多个图像视觉线索的方法。该研究提出了一种基于自我监督视觉表示学习的策略,利用图像中的固有约束作为监督信息。通过构建包含两个相同图像的增强视图和一个相似但不同的图像的三元组,训练模型对图像进行比较并生成推理过程。最终优化模型采用基于规则的强化学习。模型在视觉相似性高和存在增强的条件下,必须关注细微的视觉变化并进行逻辑推理才能成功。实验表明,虽然模型仅经过视觉比较任务的训练,但其学习能力能够广泛应用于各种问题。并且,该研究的方法不依赖任何人工标注的问题答案对,在跨图像推理基准测试上取得了显著改进,并在一般视觉任务上表现出强劲性能。
Key Takeaways
- 工作探索了Chain-of-Thought(CoT)推理在跨多个图像视觉线索中的应用。
- 提出了一种基于自我监督视觉表示学习的策略,利用图像中的固有约束作为监督信息。
- 通过构建图像三元组来训练模型进行图像比较和推理。
- 采用基于规则的强化学习来优化模型。
- 模型必须关注细微的视觉变化并进行逻辑推理,以应对高度视觉相似性和存在的增强条件。
- 实验表明,模型在跨图像推理任务上取得了显著改进。
点此查看论文截图




EFRame: Deeper Reasoning via Exploration-Filter-Replay Reinforcement Learning Framework
Authors:Chen Wang, Lai Wei, Yanzhi Zhang, Chenyang Shao, Zedong Dan, Weiran Huang, Yue Wang, Yuzhi Zhang
Recent advances in reinforcement learning (RL) have significantly enhanced the reasoning capabilities of large language models (LLMs). Group Relative Policy Optimization (GRPO), an efficient variant of PPO that lowers RL’s computational cost, still faces limited exploration, low sample efficiency and instability, constraining its performance on complex reasoning tasks. To address these limitations, we introduce EFRame, an Exploration-Filtering-Replay framework that systematically augments GRPO along three critical dimensions. EFRame performs additional rollouts to explore high-quality trajectories, applies online filtering to eliminate low-quality samples that introduce noise and variance, and leverages experience replay to repeatedly exploit rare but informative samples. EFRame establishes a complete and stable learning cycle, guiding the model through a structured transition from exploration to convergence. Our experiments across a variety of reasoning benchmarks demonstrate that EFRame not only improves the robustness and efficiency of training, but also enables access to deeper reasoning capabilities that remain unattainable under vanilla GRPO. Furthermore, EFRame enables a more fine-grained categorization of training samples, allowing for a deeper analysis of how different types of samples contribute to the learning process in RL. Our code is available at https://github.com/597358816/EFRame.
随着强化学习(RL)的最新进展,大型语言模型(LLM)的推理能力得到了显著提升。集团相对策略优化(GRPO)是PPO的一种高效变体,降低了RL的计算成本,但仍面临探索有限、样本效率较低和不稳定的问题,使其在复杂推理任务上的性能受到限制。为了解决这些局限性,我们引入了EFRame,这是一个探索-过滤-回放框架,系统地增强了GRPO的三个关键维度。EFRame执行额外的迭代来探索高质量的轨迹,应用在线过滤来消除引入噪声和方差的低质量样本,并利用经验回放来反复利用稀有但具有信息量的样本。EFRame建立了一个完整稳定的学习周期,引导模型从探索到收敛进行结构化转换。我们在多种推理基准测试上的实验表明,EFRame不仅提高了训练的稳健性和效率,还使模型能够访问更深入的推理能力,这些是普通GRPO无法达到的。此外,EFRame能够对训练样本进行更精细的分类,从而更深入地分析不同类型样本对RL学习过程的不同贡献。我们的代码可在https://github.com/597358816/EFRame找到。
简化版翻译
论文及项目相关链接
Summary
强化学习(RL)的近期进展极大地提升了大型语言模型(LLM)的推理能力。针对GRPO(一种降低RL计算成本的PPO变体)在复杂推理任务上的局限性,本文提出了EFRame框架,该框架通过探索、过滤和回放三个关键维度来增强GRPO。实验表明,EFRame不仅提高了训练的稳健性和效率,还实现了更深入的推理能力。
Key Takeaways
- 强化学习提升大型语言模型的推理能力。
- GRPO面临探索局限性、低样本效率和不稳定性的问题。
- EFRame框架通过探索、过滤和回放增强GRPO。
- EFRame提高训练稳健性和效率。
- EFRame使模型实现更深入的推理能力。
- EFRame使训练样本分类更精细,便于分析不同类型样本对学习过程的影响。
点此查看论文截图




LeanConjecturer: Automatic Generation of Mathematical Conjectures for Theorem Proving
Authors:Naoto Onda, Kazumi Kasaura, Yuta Oriike, Masaya Taniguchi, Akiyoshi Sannai, Sho Sonoda
We introduce LeanConjecturer, a pipeline for automatically generating university-level mathematical conjectures in Lean 4 using Large Language Models (LLMs). Our hybrid approach combines rule-based context extraction with LLM-based theorem statement generation, addressing the data scarcity challenge in formal theorem proving. Through iterative generation and evaluation, LeanConjecturer produced 12,289 conjectures from 40 Mathlib seed files, with 3,776 identified as syntactically valid and non-trivial, that is, cannot be proven by \texttt{aesop} tactic. We demonstrate the utility of these generated conjectures for reinforcement learning through Group Relative Policy Optimization (GRPO), showing that targeted training on domain-specific conjectures can enhance theorem proving capabilities. Our approach generates 103.25 novel conjectures per seed file on average, providing a scalable solution for creating training data for theorem proving systems. Our system successfully verified several non-trivial theorems in topology, including properties of semi-open, alpha-open, and pre-open sets, demonstrating its potential for mathematical discovery beyond simple variations of existing results.
我们介绍了LeanConjecturer,这是一个使用Large Language Models (LLM)在Lean 4中自动生成大学级别数学猜想的流水线。我们的混合方法结合了基于规则上下文提取与基于LLM的定理陈述生成,解决了形式化定理证明中的数据稀缺挑战。通过迭代生成和评估,LeanConjecturer从40个Mathlib种子文件中产生了12,289个猜想,其中3,776个被识别为语法有效且非平凡(即无法通过aesop策略证明)。我们通过Group Relative Policy Optimization (GRPO)展示了这些生成猜想在强化学习中的实用性,表明针对特定领域的猜想进行有针对性的训练可以增强定理证明能力。我们的方法平均每个种子文件生成103.25个新猜想,为定理证明系统创建训练数据提供了可扩展的解决方案。我们的系统成功验证了拓扑中的几个非平凡定理,包括半开集、alpha开集和预开集的属性,展示了其在数学发现上的潜力,而不仅仅是现有结果的简单变化。
论文及项目相关链接
PDF 15 pages, 4 figures, 5 tables
Summary
自动生成数学猜想系统LeanConjecturer的简介。利用大型语言模型(LLMs)在Lean 4环境中构建数学猜想生成管道。通过结合规则基础上下文提取和基于LLM的定理陈述生成,应对形式化定理证明中的数据稀缺挑战。从40个Mathlib种子文件中生成了12,289个猜想,其中3,776个被识别为语法上有效且非平凡(即不能用aesop策略证明)。通过群体相对政策优化(GRPO)验证了猜想在强化学习中的实用性。系统平均每个种子文件生成103.25个新猜想,为定理证明系统创建训练数据提供了可扩展的解决方案。该系统成功验证了拓扑中的非平凡定理,展示了其在数学发现方面的潜力。
Key Takeaways
- LeanConjecturer是一个自动生成大学级别数学猜想的系统,使用Large Language Models (LLMs)在Lean 4环境中构建。
- 该系统通过规则基础上下文提取与LLM结合的定理陈述生成来应对数据稀缺的挑战。
- 从40个Mathlib种子文件中生成了12,289个猜想,其中部分猜想的验证对于强化学习具有重要意义。
- 系统成功验证了多个非平凡定理,包括拓扑学中半开集、alpha开集和预开集的性质等。
- 通过Group Relative Policy Optimization (GRPO)展示了目标训练对领域特定猜想的强化学习能力提升。
- 系统平均每个种子文件生成大量新猜想,为定理证明系统的训练数据提供了可扩展的解决方案。
点此查看论文截图







SEEA-R1: Tree-Structured Reinforcement Fine-Tuning for Self-Evolving Embodied Agents
Authors:Wanxin Tian, Shijie Zhang, Kevin Zhang, Xiaowei Chi, Yulin Luo, Junyu Lu, Chunkai Fan, Qiang Zhou, Yiming Zhao, Ning Liu Siyu Lin, Zhiyuan Qin, Xiaozhu Ju, Shanghang Zhang, Jian Tang
Self-evolution, the ability of agents to autonomously improve their reasoning and behavior, is essential for the embodied domain with long-horizon, real-world tasks. Despite current advancements in reinforcement fine-tuning (RFT) showing strong performance in enhancing reasoning in LLMs, its potential to enable self-evolving embodied intelligence with multi-modal interactions remains largely unexplored. Specifically, reinforcement fine-tuning faces two fundamental obstacles in embodied settings: (i) the lack of accessible intermediate rewards in multi-step reasoning tasks limits effective learning signals, and (ii) reliance on hand-crafted reward functions restricts generalization to novel tasks and environments. To address these challenges, we present Self-Evolving Embodied Agents-R1, SEEA-R1, the first RFT framework designed for enabling the self-evolving capabilities of embodied agents. Specifically, to convert sparse delayed rewards into denser intermediate signals that improve multi-step reasoning, we propose Tree-based group relative policy optimization (Tree-GRPO), which integrates Monte Carlo Tree Search into GRPO. To generalize reward estimation across tasks and scenes, supporting autonomous adaptation and reward-driven self-evolution, we further introduce Multi-modal Generative Reward Model (MGRM). To holistically evaluate the effectiveness of SEEA-R1, we evaluate on the ALFWorld benchmark, surpassing state-of-the-art methods with scores of 85.07% (textual) and 36.19% (multi-modal), outperforming prior models including GPT-4o. SEEA-R1 also achieves scores of 80.3% without environmental reward, surpassing all open-source baselines and highlighting its scalability as a self-evolving embodied agent. Additional experiments and qualitative analysis further support the potential of SEEA-R1 for future research in scalable embodied intelligence.
自我进化能力,即代理能够自主提高其推理和行为的能力,对于具有长远视野和现实世界任务的身体化领域至关重要。尽管强化微调(RFT)在提升大型语言模型(LLMs)的推理能力方面已展现出强大的性能,但在支持多模式交互的自我进化智能方面仍存在大量未开发潜力。具体而言,强化微调在面对现实环境时面临两个根本障碍:首先,缺乏可获取的中间奖励多步骤推理任务限制了有效的学习信号;其次,对人工奖励函数的依赖限制了其在新任务和环境中泛化的能力。为了解决这些挑战,我们提出了Self-Evolving Embodied Agents-R1(简称SEEA-R1),这是首个为增强代理的自我进化能力而设计的强化微调框架。具体来说,为了将稀疏延迟奖励转化为更密集的中间信号以提高多步骤推理能力,我们提出了基于树的群组相对策略优化(Tree-GRPO),它将蒙特卡洛树搜索集成到GRPO中。为了跨任务和场景进行奖励估计泛化,以支持自主适应和奖励驱动的自我进化,我们还引入了多模式生成奖励模型(MGRM)。为了全面评估SEEA-R1的有效性,我们在ALFWorld基准测试上进行了评估,超越了最先进的方法,得分分别为文本模式下的85.07%和多模式下的36.19%,超过了包括GPT-4o在内的先前模型。此外,在不使用环境奖励的情况下,SEEA-R1达到了80.3%的得分,超过了所有开源基准测试,突显了其作为自我进化实体代理的可扩展性。其他实验和定性分析进一步支持了SEEA-R1在未来可扩展智能研究中的潜力。
论文及项目相关链接
Summary
文章探讨自我进化在实体领域中的重要性,特别是在具有长远视角和真实任务的环境中。虽然强化微调(RFT)在提升大型语言模型(LLMs)的推理能力方面表现出强大的性能,但在实体自我进化智能方面,特别是在多模态交互方面的潜力尚未得到充分探索。文章提出Self-Evolving Embodied Agents-R1(SEEA-R1)框架来解决这一挑战,该框架旨在实现实体智能的自我进化能力。通过树状相对策略优化(Tree-GRPO)将稀疏延迟奖励转化为更密集的中间信号以提高多步推理能力,并引入多模态生成奖励模型(MGRM)来推广跨任务和场景的奖励估计。在ALFWorld基准测试中进行了全面评估,结果显示SEEA-R1超越现有技术方法,并在文本和多模态模式下的得分分别为85.07%和36.19%,优于包括GPT-4o在内的先前模型。此外,在没有环境奖励的情况下,SEEA-R1的得分达到80.3%,超越了所有开源基线,凸显其作为自我进化实体代理的可扩展性。
Key Takeaways
- 自我进化对于实体领域中的长远视角和真实任务至关重要。
- 强化微调(RFT)虽在提升大型语言模型推理能力上表现优秀,但在实现实体自我进化智能方面潜力尚未被充分探索。
- SEEA-R1框架旨在解决实体智能的自我进化问题,并为此提出Tree-GRPO和多模态生成奖励模型(MGRM)。
- Tree-GRPO能将稀疏延迟奖励转化为中间信号,提高多步推理能力。
- MGRM模型能推广跨任务和场景的奖励估计,支持自主适应和奖励驱动的自我进化。
- 在ALFWorld基准测试中,SEEA-R1表现出卓越性能,超越现有技术方法。
点此查看论文截图



Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
Authors:Yifan Shen, Yuanzhe Liu, Jingyuan Zhu, Xu Cao, Xiaofeng Zhang, Yixiao He, Wenming Ye, James Matthew Rehg, Ismini Lourentzou
Current Vision-Language Models (VLMs) struggle with fine-grained spatial reasoning, particularly when multi-step logic and precise spatial alignment are required. In this work, we introduce SpatialReasoner-R1, a vision-language reasoning model designed to address these limitations. To construct high-quality supervision for spatial reasoning, we design a Multi-Model Monte Carlo Tree Search (M3CTS) method that generates diverse, logically consistent Long Chain-of-Thought (LongCoT) reasoning trajectories. In addition, we propose fine-grained Direct Preference Optimization (fDPO), which introduces segment-specific preference granularity for descriptive grounding and logical reasoning, guided by a spatial reward mechanism that evaluates candidate responses based on visual consistency, spatial grounding, and logical coherence. Experimental results demonstrate that fDPO achieves an average improvement of 4.1% over standard DPO across spatial quality tasks, and a 9.0% gain in spatial quantity tasks. SpatialReasoner-R1, trained with fDPO, sets a new SoTA on SPATIALRGPT-Bench, outperforming the strongest baseline by 9.8% in average accuracy, while maintaining competitive performance on general vision-language tasks.
当前的语言视觉模型(VLMs)在精细的空间推理方面存在困难,特别是在需要多步骤逻辑和精确空间对齐的情况下。在这项工作中,我们引入了SpatialReasoner-R1,这是一种设计用于解决这些局限性的视觉语言推理模型。为了构建高质量的空间推理监督,我们设计了一种多模型蒙特卡洛树搜索(M3CTS)方法,该方法可以生成多样化且逻辑一致的Long CoT(长思考链)推理轨迹。此外,我们提出了精细的Direct Preference Optimization(fDPO),它引入了分段特定的偏好粒度,用于描述性定位和逻辑推理,由空间奖励机制引导,该机制根据视觉一致性、空间定位和逻辑连贯性来评估候选答案。实验结果表明,在空间质量任务上,fDPO较标准DPO平均提高了4.1%,在空间数量任务上提高了9.0%。使用fDPO训练的SpatialReasoner-R1在SPATIALRGPT-Bench上创下了新的最佳成绩,平均准确率较最强基线提高了9.8%,同时在一般的视觉语言任务上保持了竞争力。
论文及项目相关链接
PDF 29 pages
Summary
该文本介绍了当前视觉语言模型(VLMs)在精细空间推理方面的局限性,特别是在需要多步骤逻辑和精确空间对齐的情况下。为此,研究者引入了SpatialReasoner-R1视觉语言推理模型,并通过设计多模式蒙特卡洛树搜索(M3CTS)方法来构建高质量的空间推理监督。此外,还提出了精细直接偏好优化(fDPO),该方法引入分段特定偏好粒度,用于描述性定位和逻辑推理,由空间奖励机制评估候选响应的视觉一致性、空间定位和逻辑连贯性。实验结果表明,fDPO在标准DPO的基础上,在空间质量任务上平均提高了4.1%,在空间数量任务上提高了9.0%。使用fDPO训练的SpatialReasoner-R1在SPATIALRGPT-Bench上达到了新的技术水平,平均准确率较最强基线提高了9.8%,同时在一般视觉语言任务上保持竞争力。
Key Takeaways
- 当前视觉语言模型在精细空间推理方面存在局限性,特别是在需要多步骤逻辑和精确空间对齐的场景中。
- SpatialReasoner-R1模型被设计来解决这些限制,通过引入多模式蒙特卡洛树搜索(M3CTS)方法构建高质量的空间推理监督。
- 提出了精细直接偏好优化(fDPO),通过引入分段特定偏好粒度,用于描述性定位和逻辑推理。
- fDPO通过空间奖励机制评估候选响应的视觉一致性、空间定位和逻辑连贯性。
- 实验结果显示,在空间质量任务上,fDPO较标准DPO平均提高了4.1%的性能;在空间数量任务上,这一数字达到了9.0%。
- SpatialReasoner-R1在SPATIALRGPT-Bench上表现出卓越性能,较最强基线平均提高了9.8%的准确率。
点此查看论文截图



APO: Enhancing Reasoning Ability of MLLMs via Asymmetric Policy Optimization
Authors:Minjie Hong, Zirun Guo, Yan Xia, Zehan Wang, Ziang Zhang, Tao Jin, Zhou Zhao
Multimodal Large Language Models (MLLMs) are powerful at integrating diverse data, but they often struggle with complex reasoning. While Reinforcement learning (RL) can boost reasoning in LLMs, applying it to MLLMs is tricky. Common issues include a drop in performance on general tasks and the generation of overly detailed or “overthinking” reasoning. Our work investigates how the KL penalty and overthinking affect RL training in MLLMs. We propose Asymmetric Policy Optimization (APO) to address these issues, which divides the sampled responses into positive and negative groups. For positive samples, Difficulty-Adaptive Divergence Shaping (DADS) is introduced to dynamically adjust the KL divergence weight based on their difficulty. This method prevents policy entropy from dropping sharply, improves training stability, utilizes samples better, and preserves the model’s existing knowledge. For negative samples, Suboptimal Trajectory Complexity Regularization (STCR) is proposed to penalize overly long responses. This helps mitigate overthinking and encourages more concise reasoning while preserving the model’s explorative capacity. We apply our method to Qwen2.5-VL-3B, creating View-R1-3B. View-R1-3B significantly enhances reasoning capabilities, showing an average 7% gain over the base model and outperforming larger MLLMs (7-11B) on various reasoning benchmarks. Importantly, unlike other reasoning-tuned MLLMs that often degrade on general tasks, View-R1-3B maintains consistent improvement, demonstrating superior generalization. These results highlight the effectiveness and broad applicability of our DADS and STCR techniques for advancing complex multimodal reasoning in MLLMs. The code will be made available at https://github.com/Indolent-Kawhi/View-R1.
多模态大型语言模型(MLLMs)在整合多样化数据方面表现出强大的能力,但在复杂推理方面经常遇到困难。虽然强化学习(RL)可以提高LLMs的推理能力,但将其应用于MLLMs却很难。常见的问题包括一般任务性能下降以及生成过于详细或“过度思考”的推理。我们的工作研究了KL惩罚和过度思考如何影响MLLMs中的RL训练。为了解决这些问题,我们提出了不对称策略优化(APO),将采样响应分为正负两组。对于正样本,我们引入了难度自适应发散形状(DADS),根据难度动态调整KL发散权重。这种方法可以防止策略熵急剧下降,提高训练稳定性,更好地利用样本,并保留模型现有的知识。对于负样本,我们提出了次优轨迹复杂性正则化(STCR)来惩罚过于冗长的回应。这有助于减轻过度思考,鼓励更简洁的推理,同时保留模型的探索能力。我们将该方法应用于Qwen2.5-VL-3B,创建了View-R1-3B。View-R1-3B显著增强了推理能力,在基准模型的基础上平均提高了7%,并在各种推理基准测试中优于更大的MLLMs(7-11B)。重要的是,与其他经常降低通用任务性能的针对推理优化的MLLMs不同,View-R1-3B保持了持续的改进,显示出优越的泛化能力。这些结果突出了我们的DADS和STCR技术在推进MLLMs中复杂多模态推理的有效性和广泛应用性。代码将在https://github.com/Indolent-Kawhi/View-R1上提供。
论文及项目相关链接
Summary
本文探讨了多模态大型语言模型(MLLMs)在复杂推理方面的挑战,以及如何通过不对称策略优化(APO)来解决这些问题。文章介绍了难度自适应发散形状(DADS)和次优轨迹复杂性正则化(STCR)两种方法,分别用于优化正样本和负样本的训练过程。最终通过应用这些方法于特定模型上,实现了对模型推理能力的提升,且在各种推理基准测试上表现优越,并维持了对一般任务的改善效果。其代码将公开发布在指定网址。
Key Takeaways
- MLLMs在复杂推理上表现不足,强化学习(RL)的应用也存在挑战。
- 提出不对称策略优化(APO)以解决MLLMs在RL训练中的问题。
- 引入难度自适应发散形状(DADS)优化正样本训练,动态调整KL发散权重。
- 提出次优轨迹复杂性正则化(STCR)以惩罚过长的负样本响应,减轻“过度思考”。
- 应用这些方法于特定模型,实现推理能力平均提升7%,并在各种基准测试中表现优越。
- 模型在保持推理能力提升的同时,也维持了对一般任务的改善效果。
点此查看论文截图





Refine-POI: Reinforcement Fine-Tuned Large Language Models for Next Point-of-Interest Recommendation
Authors:Peibo Li, Shuang Ao, Hao Xue, Yang Song, Maarten de Rijke, Johan Barthélemy, Tomasz Bednarz, Flora D. Salim
Large language models (LLMs) have been adopted for next point-of-interest (POI) recommendation tasks. Typical LLM-based recommenders fall into two categories: prompt-based and supervised fine-tuning (SFT)-based models. Prompt-based models generally offer greater output flexibility but deliver lower accuracy, whereas SFT-based models achieve higher performance yet face a fundamental mismatch: next POI recommendation data does not naturally suit supervised fine-tuning. In SFT, the model is trained to reproduce the exact ground truth, but each training example provides only a single target POI, so there is no ground truth for producing a top-k list. To address this, we propose Refine-POI, a reinforcement fine-tuning framework for next POI recommendation. We introduce recommendation-driven rewards that enable LLMs to learn to generate top-k recommendation lists using only one ground-truth POI per example. Experiments on real-world datasets demonstrate that Refine-POI achieves state-of-the-art top-k recommendation performance.
基于大型语言模型(LLM)的推荐系统已被应用于下一个兴趣点(POI)推荐任务。典型的基于LLM的推荐系统分为两类:基于提示的和基于监督微调(SFT)的模型。基于提示的模型通常具有更大的输出灵活性,但准确性较低,而基于SFT的模型虽然性能较高,但面临一个基本的不匹配问题:下一个POI推荐数据并不适合进行监督微调。在SFT中,模型被训练以复制精确的真相标签,但由于每个训练样本只提供一个目标POI,因此不存在产生Top-k列表的真实标签。为了解决这个问题,我们提出了Refine-POI,这是一个用于下一个POI推荐的强化微调框架。我们引入了推荐驱动奖励,使LLM能够学习使用每个示例仅一个真实POI来生成Top-k推荐列表。在真实数据集上的实验表明,Refine-POI在Top-k推荐性能方面达到了最新水平。
论文及项目相关链接
Summary
大型语言模型(LLM)已应用于下一个兴趣点(POI)推荐任务,主要分为基于提示和基于监督微调(SFT)的模型两类。基于提示的模型输出灵活但准确度较低,而SFT模型虽性能较高,但存在数据不匹配问题。为此,我们提出Refine-POI,一个针对下一个POI推荐的强化微调框架,引入推荐驱动奖励,使LLM能够学习生成前k个推荐列表。在真实数据集上的实验表明,Refine-POI实现了最先进的top-k推荐性能。
Key Takeaways
- 大型语言模型(LLMs)已应用于POI推荐任务。
- 基于提示的模型和基于监督微调(SFT)的模型各有优缺点。
- 基于提示的模型输出灵活但准确度较低。
- SFT模型虽然性能高,但存在数据不匹配问题。
- 提出了Refine-POI框架,一个针对下一个POI推荐的强化微调框架。
- Refine-POI通过引入推荐驱动奖励,使LLM能够学习生成前k个推荐列表。
点此查看论文截图



Is DeepSeek a New Voice Among LLMs in Public Opinion Simulation?
Authors:Weihong Qi, Fan Huang, Jisun An, Haewoon Kwak
This study evaluates the ability of DeepSeek, an open-source large language model (LLM), to simulate public opinions in comparison to LLMs developed by major tech companies. By comparing DeepSeek-R1 and DeepSeek-V3 with Qwen2.5, GPT-4o, and Llama-3.3 and utilizing survey data from the American National Election Studies (ANES) and the Zuobiao dataset of China, we assess these models’ capacity to predict public opinions on social issues in both China and the United States, highlighting their comparative capabilities between countries. Our findings indicate that DeepSeek-V3 performs best in simulating U.S. opinions on the abortion issue compared to other topics such as climate change, gun control, immigration, and services for same-sex couples, primarily because it more accurately simulates responses when provided with Democratic or liberal personas. For Chinese samples, DeepSeek-V3 performs best in simulating opinions on foreign aid and individualism but shows limitations in modeling views on capitalism, particularly failing to capture the stances of low-income and non-college-educated individuals. It does not exhibit significant differences from other models in simulating opinions on traditionalism and the free market. Further analysis reveals that all LLMs exhibit the tendency to overgeneralize a single perspective within demographic groups, often defaulting to consistent responses within groups. These findings highlight the need to mitigate cultural and demographic biases in LLM-driven public opinion modeling, calling for approaches such as more inclusive training methodologies.
本研究评估了DeepSeek这一开源大型语言模型(LLM)模拟公众意见的能力,并与由大型科技公司开发的大型语言模型进行了比较。通过比较DeepSeek-R1和DeepSeek-V3与Qwen2.5、GPT-4o和Llama-3.3,并利用美国全国选举研究(ANES)和中国的《左标》数据集的调查数据,我们评估了这些模型在中国和美国就社会问题上预测公众意见的能力,并突出了它们在不同国家之间的比较能力。我们的研究结果表明,在模拟美国关于堕胎问题的意见方面,DeepSeek-V3表现得最好,与其他主题如气候变化、枪支管制、移民和为同性夫妇提供的服务相比,主要是因为它在提供民主党人或自由派人物时更能准确地模拟回应。对于中国的样本,DeepSeek-V3在模拟关于对外援助和个人主义方面的意见方面表现最好,但在模拟对资本主义的看法方面存在局限性,尤其未能捕捉到低收入和非大学学历个体的立场。在模拟关于传统主义和自由市场的意见方面,与其他模型相比没有显著差异。进一步的分析表明,所有大型语言模型都倾向于在人口群体中概括单一视角,经常在群体内部产生一致的反应。这些发现凸显了在大型语言模型驱动的公众意见模型中减轻文化和人口偏见的必要性,呼吁采用更包容的训练方法等。
论文及项目相关链接
Summary:
本研究评估了DeepSeek这一开源大型语言模型(LLM)模拟公众意见的能力,并与由大型科技公司开发的大型语言模型进行了比较。通过对比DeepSeek-R1和DeepSeek-V3与Qwen2.5、GPT-4o和Llama-3.3,并利用美国全国选举研究(ANES)和中国的佐标数据集进行调查数据,评估这些模型在中国和美国就社会热点问题模拟公众意见的能力,以及国家间的比较能力。研究发现在美国,DeepSeek-V3在模拟堕胎问题的意见方面表现最佳,而在气候变化、枪支控制、移民和为同性夫妇提供服务等其他主题方面不如其他模型。对于中国的样本,DeepSeek-V3在模拟关于对外援助和个人主义的观点方面表现最佳,但在模拟关于资本主义的观点方面存在局限性,特别是在捕捉低收入和非大学教育者的立场方面。在模拟关于传统主义和自由市场的观点方面,与其他模型相比没有显著差异。进一步的分析显示,所有大型语言模型都有在人口群体内部过度概括单一视角的倾向,经常导致群体内部的一致回应。这些发现强调了减轻大型语言模型驱动公众意见建模中的文化和人口偏差的需要,呼吁采用更具包容性的训练方法。
Key Takeaways:
- DeepSeek与其他大型语言模型在模拟公众意见方面进行了比较。
- DeepSeek-V3在模拟美国关于堕胎问题的意见上表现最佳。
- DeepSeek-V3在中国关于对外援助和个人主义的观点模拟中表现最佳。
- 在模拟关于资本主义的观点方面,DeepSeek-V3存在局限性。
- 大型语言模型有过度概括人口群体内部单一视角的倾向。
- 所有模型在模拟关于传统主义和自由市场的观点方面没有显著差异。
点此查看论文截图






ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models
Authors:Hongbo Liu, Jingwen He, Yi Jin, Dian Zheng, Yuhao Dong, Fan Zhang, Ziqi Huang, Yinan He, Yangguang Li, Weichao Chen, Yu Qiao, Wanli Ouyang, Shengjie Zhao, Ziwei Liu
Cinematography, the fundamental visual language of film, is essential for conveying narrative, emotion, and aesthetic quality. While recent Vision-Language Models (VLMs) demonstrate strong general visual understanding, their proficiency in comprehending the nuanced cinematic grammar embedded within individual shots remains largely unexplored and lacks robust evaluation. This critical gap limits both fine-grained visual comprehension and the precision of AI-assisted video generation. To address this, we introduce ShotBench, a comprehensive benchmark specifically designed for cinematic language understanding. It features over 3.5k expert-annotated QA pairs from images and video clips, meticulously curated from over 200 acclaimed (predominantly Oscar-nominated) films and spanning eight key cinematography dimensions. Our evaluation of 24 leading VLMs on ShotBench reveals their substantial limitations: even the top-performing model achieves less than 60% average accuracy, particularly struggling with fine-grained visual cues and complex spatial reasoning. To catalyze advancement in this domain, we construct ShotQA, a large-scale multimodal dataset comprising approximately 70k cinematic QA pairs. Leveraging ShotQA, we develop ShotVL through supervised fine-tuning and Group Relative Policy Optimization. ShotVL significantly outperforms all existing open-source and proprietary models on ShotBench, establishing new state-of-the-art performance. We open-source our models, data, and code to foster rapid progress in this crucial area of AI-driven cinematic understanding and generation.
电影摄影作为电影的基本视觉语言,对于传达叙事、情感和审美质量至关重要。尽管最近的视觉语言模型(VLM)展示了强大的通用视觉理解能力,但它们在理解单个镜头中嵌入的微妙电影语法方面的能力尚未得到广泛探索,也缺乏稳健的评估。这一关键差距限制了精细粒度的视觉理解和AI辅助视频生成的精度。为了解决这个问题,我们推出了ShotBench,这是一个专门为电影语言理解设计的综合基准测试。它包含来自超过200部备受赞誉(主要是奥斯卡提名)电影的超过3.5k专家注释的QA对,涵盖了八个关键的电影摄影维度。我们对24个领先的VLM在ShotBench上的评估显示出了它们的重大局限性:即使表现最佳的模型平均准确率也低于60%,尤其难以掌握细微的视觉线索和复杂的空间推理。为了催化该领域的进步,我们构建了ShotQA,这是一个大型多模式数据集,包含大约7万对电影QA。通过利用ShotQA进行有监督的微调以及集团相对策略优化,我们开发了ShotVL。ShotVL在ShotBench上显著超越了所有现有的开源和专有模型,创造了新的最佳性能。为了促进AI驱动的电影理解生成的这一关键领域的快速发展,我们公开了我们的模型、数据和代码。
论文及项目相关链接
Summary
本文介绍了电影视觉语言的重要性,并指出了现有视觉语言模型在理解电影镜头中的细微语法方面的不足。为了解决这个问题,文章提出了ShotBench基准测试平台,用于评估模型对电影语言的理解能力。此外,文章还介绍了ShotQA数据集和ShotVL模型的开发情况,该模型在ShotBench上的表现达到了新的水平。最后,文章开源了模型、数据和代码,以促进AI驱动的电影理解的快速发展。
Key Takeaways
- 电影视觉语言的重要性及其对于叙事、情感和美学质量的传达作用。
- 现有视觉语言模型在理解电影细微语法方面的局限性,特别是在理解单个镜头中的细微差别和复杂空间推理方面。
- ShotBench基准测试平台的介绍,该平台旨在评估模型对电影语言的理解能力,包含从图像和视频剪辑中精心挑选的超过3.5k的专家注释问答对。
- 对24款领先视觉语言模型的评估结果揭示了它们的显著局限性。
- ShotQA数据集的构建,这是一个大规模的多模式电影问答数据集,包含大约70k的电影问答对。
- 通过监督微调分组相对策略优化,开发了ShotVL模型,该模型在ShotBench上的表现超越了所有现有的开源和专有模型。
点此查看论文截图







Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards
Authors:Jihao Gu, Qihang Ai, Yingyao Wang, Pi Bu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Ziming Wang, Yingxiu Zhao, Ming-Liang Zhang, Jun Song, Yuning Jiang, Bo Zheng
Vision-language model-based mobile agents have gained the ability to not only understand complex instructions and mobile screenshots, but also optimize their action outputs via thinking and reasoning, benefiting from reinforcement learning, such as Group Relative Policy Optimization (GRPO). However, existing research centers on offline reinforcement learning training or online optimization using action-level rewards, which limits the agent’s dynamic interaction with the environment. This often results in agents settling into local optima, thereby weakening their ability for exploration and error action correction. To address these challenges, we introduce an approach called Mobile-R1, which employs interactive multi-turn reinforcement learning with task-level rewards for mobile agents. Our training framework consists of three stages: initial format finetuning, single-step online training via action-level reward, followed by online training via task-level reward based on multi-turn trajectories. This strategy is designed to enhance the exploration and error correction capabilities of Mobile-R1, leading to significant performance improvements. Moreover, we have collected a dataset covering 28 Chinese applications with 24,521 high-quality manual annotations and established a new benchmark with 500 trajectories. We will open source all resources, including the dataset, benchmark, model weight, and codes: https://mobile-r1.github.io/Mobile-R1/.
基于视觉语言模型的移动代理不仅获得了理解复杂指令和移动截图的能力,而且得益于强化学习(如群体相对策略优化(GRPO)),它们还能够通过思考和推理优化其行动输出。然而,现有的研究中心主要关注离线强化学习训练或利用动作级奖励进行在线优化,这限制了代理与环境的动态交互。这通常导致代理陷入局部最优,从而削弱其探索能力和纠正错误行动的能力。为了解决这些挑战,我们引入了一种称为Mobile-R1的方法,它为移动代理采用交互式多回合强化学习和任务级奖励。我们的训练框架包括三个阶段:初始格式微调、通过动作级奖励进行单步在线训练,然后是基于多回合轨迹的任务级奖励在线训练。这一策略旨在增强Mobile-R1的探索和错误纠正能力,从而带来显著的性能改进。此外,我们收集了涵盖28个中文应用程序的数据集,包含24521个高质量的手动注释,并建立了包含500个轨迹的新基准。我们将公开所有资源,包括数据集、基准测试、模型权重和代码:https://mobile-r1.github.io/Mobile-R1/。
论文及项目相关链接
PDF 14 pages, 12 figures
Summary
移动智能体基于视觉语言模型,具备理解复杂指令和移动截图的能力,并通过强化学习进行优化动作输出。然而,现有研究集中在离线强化学习训练或动作级别的在线优化上,限制了智能体与环境的动态交互。为解决这一问题,我们提出Mobile-R1方法,采用交互式多回合强化学习与任务级别奖励相结合的训练框架,包括初始格式微调、单步在线动作级别奖励训练和基于多回合轨迹的在线任务级别奖励训练三个阶段。这有助于提高Mobile-R1的探索和错误纠正能力,实现显著的性能提升。我们收集了一个包含28个应用程序的数据集,建立了新的基准测试。
Key Takeaways
- 移动智能体已具备通过视觉语言模型理解复杂指令和移动截图的能力。
- 强化学习被用于优化移动智能体的动作输出。
- 现有研究主要集中在离线强化学习训练或动作级别的在线优化上,限制了智能体与环境的交互。
- Mobile-R1方法采用交互式多回合强化学习与任务级别奖励来增强智能体的探索能力和错误纠正能力。
- Mobile-R1的训练框架包括初始格式微调、单步在线动作级别奖励训练和在线任务级别奖励训练三个阶段。
- 收集了一个包含28个应用程序的数据集,并建立了新的基准测试。
点此查看论文截图






From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
Authors:Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.
信息检索是现代知识获取的核心基石,每天可以在不同领域处理数十亿的查询。然而,传统的基于关键词的搜索引擎越来越难以满足复杂的、多步骤的信息需求。我们的观点是,拥有推理和智能能力的大型语言模型(LLM)正在引领一种新的名为智能深度研究(Agentic Deep Research)的范式。这些系统通过紧密集成自主推理、迭代检索和信息合成到一个动态反馈循环中,从而超越了传统的信息搜索技术。我们追溯了从静态网页搜索到交互式、基于智能系统的演变过程,这些系统可以规划、探索和学习能力。我们还引入了一个测试时间缩放定律来正式计算深度对推理和搜索的影响。在基准测试结果和开源实现的推动下,我们证明了智能深度研究不仅显著优于现有方法,而且已成为未来信息搜索的主导范式。所有相关资源,包括工业产品、研究论文、基准数据集和开源实现,都在 https://github.com/DavidZWZ/Awesome-Deep-Research 中为社区收集。
论文及项目相关链接
Summary
大型语言模型(LLM)具备推理和代理能力,正在推动一种名为代理深度研究的新范式。这种新范式通过紧密结合自主推理、迭代检索和信息合成,突破了传统信息搜索技术的限制。其在测试时的规模定律正式确定了计算深度对推理和搜索的影响。代理深度研究不仅显著优于现有方法,而且有望成为未来信息搜索的主导范式。相关资源均已收集,供社区参考。
Key Takeaways
- 大型语言模型(LLMs)具备推理和代理能力,推动信息搜索新范式——代理深度研究的出现。
- 代理深度研究通过结合自主推理、迭代检索和信息合成,超越了传统信息搜索技术。
- 测试时的规模定律表明计算深度对推理和搜索有重要影响。
- 代理深度研究显著优于现有方法,有望成为未来信息搜索的主导范式。
- 该领域的相关资源,包括工业产品、研究论文、基准数据集和开源实现,均已收集供社区参考。
- 代理深度研究系统的计划、探索和学习能力是传统搜索引擎无法比拟的。
- 开源实现的兴起对代理深度研究的推广和应用起到了积极作用。
点此查看论文截图



Reasoning about Uncertainty: Do Reasoning Models Know When They Don’t Know?
Authors:Zhiting Mei, Christina Zhang, Tenny Yin, Justin Lidard, Ola Shorinwa, Anirudha Majumdar
Reasoning language models have set state-of-the-art (SOTA) records on many challenging benchmarks, enabled by multi-step reasoning induced using reinforcement learning. However, like previous language models, reasoning models are prone to generating confident, plausible responses that are incorrect (hallucinations). Knowing when and how much to trust these models is critical to the safe deployment of reasoning models in real-world applications. To this end, we explore uncertainty quantification of reasoning models in this work. Specifically, we ask three fundamental questions: First, are reasoning models well-calibrated? Second, does deeper reasoning improve model calibration? Finally, inspired by humans’ innate ability to double-check their thought processes to verify the validity of their answers and their confidence, we ask: can reasoning models improve their calibration by explicitly reasoning about their chain-of-thought traces? We introduce introspective uncertainty quantification (UQ) to explore this direction. In extensive evaluations on SOTA reasoning models across a broad range of benchmarks, we find that reasoning models: (i) are typically overconfident, with self-verbalized confidence estimates often greater than 85% particularly for incorrect responses, (ii) become even more overconfident with deeper reasoning, and (iii) can become better calibrated through introspection (e.g., o3-Mini and DeepSeek R1) but not uniformly (e.g., Claude 3.7 Sonnet becomes more poorly calibrated). Lastly, we conclude with important research directions to design necessary UQ benchmarks and improve the calibration of reasoning models.
推理语言模型已在许多具有挑战性的基准测试上达到了最新技术水平(SOTA),这是通过使用强化学习进行多步推理实现的。然而,与之前的语言模型一样,推理模型容易产生自信且看似合理的错误答案(幻觉)。知道何时以及多少信任这些模型对于在真实世界应用中安全部署推理模型至关重要。为此,我们在工作中探讨了推理模型的不确定性量化问题。具体来说,我们提出了三个基本问题:首先,推理模型是否经过良好校准?其次,更深入的推理能否提高模型的校准度?最后,受人类天生能够回顾自己的思考过程以验证答案的有效性和自信程度的启发,我们问:推理模型是否能够通过明确思考其思维轨迹来提高其校准度?我们引入内省不确定性量化(UQ)来探索这个方向。在对一系列最新技术水平的推理模型进行广泛评估时,我们发现这些模型通常过于自信,自我口头表述的信心估计值经常超过85%,特别是在错误答案的情况下;它们在进行更深入推理时会变得更加过于自信;而且能够通过内省变得更好校准(例如,o3-Mini和DeepSeek R1),但并非所有模型都如此(例如,Claude 3.7 Sonnet的校准度变得更差)。最后,我们总结了设计必要的不确定性量化基准并改善推理模型校准的重要研究方向。
论文及项目相关链接
Summary
本文探讨了使用强化学习诱导的多步推理语言模型的不确定性量化问题。文章对三个关键问题进行了探索:语言模型的校准情况;更深入的推理是否改善模型校准;以及语言模型是否能通过对其思维过程进行内省来提高校准。研究发现,语言模型通常有过度自信的问题,且错误的回答中自信度较高;深度推理会使模型更加过度自信;部分模型通过内省可以更好地校准,但也有例外。最后,文章提出了设计不确定性量化基准测试和改进语言模型校准的重要研究方向。
Key Takeaways
- 语言模型在多步推理中展现出卓越性能,但存在生成自信但错误的回答(hallucinations)的问题。
- 模型校准对于将语言模型部署到实际场景中至关重要。
- 研究发现语言模型普遍存在过度自信的问题,特别是错误的回答。
- 深度推理使语言模型更加过度自信。
- 通过内省,部分语言模型的校准性能有所提升,但也有例外。
点此查看论文截图



Privacy-Preserving LLM Interaction with Socratic Chain-of-Thought Reasoning and Homomorphically Encrypted Vector Databases
Authors:Yubeen Bae, Minchan Kim, Jaejin Lee, Sangbum Kim, Jaehyung Kim, Yejin Choi, Niloofar Mireshghallah
Large language models (LLMs) are increasingly used as personal agents, accessing sensitive user data such as calendars, emails, and medical records. Users currently face a trade-off: They can send private records, many of which are stored in remote databases, to powerful but untrusted LLM providers, increasing their exposure risk. Alternatively, they can run less powerful models locally on trusted devices. We bridge this gap. Our Socratic Chain-of-Thought Reasoning first sends a generic, non-private user query to a powerful, untrusted LLM, which generates a Chain-of-Thought (CoT) prompt and detailed sub-queries without accessing user data. Next, we embed these sub-queries and perform encrypted sub-second semantic search using our Homomorphically Encrypted Vector Database across one million entries of a single user’s private data. This represents a realistic scale of personal documents, emails, and records accumulated over years of digital activity. Finally, we feed the CoT prompt and the decrypted records to a local language model and generate the final response. On the LoCoMo long-context QA benchmark, our hybrid framework, combining GPT-4o with a local Llama-3.2-1B model, outperforms using GPT-4o alone by up to 7.1 percentage points. This demonstrates a first step toward systems where tasks are decomposed and split between untrusted strong LLMs and weak local ones, preserving user privacy.
大规模语言模型(LLM)越来越多地被用作个人代理,访问敏感用户数据,如日历、电子邮件和医疗记录。目前,用户面临一种权衡:他们可以将存储在远程数据库中的私人记录发送给功能强大但不受信任的语言模型提供商,从而增加其暴露风险。或者,他们可以在可靠的设备上运行功能较弱的模型。我们弥补了这一差距。我们的苏格拉底思维链推理首先向功能强大、不受信任的语言模型发送一个通用的非私有用户查询,该模型生成一个思维链(CoT)提示和详细的子查询,无需访问用户数据。接下来,我们将这些子查询嵌入并在我们的同态加密向量数据库中使用加密子秒语义搜索功能,在单个用户的私人数据的一百万条条目中进行搜索。这代表了多年来积累的个人文档、电子邮件和记录的现实规模。最后,我们将思维链提示和解密的记录输入本地语言模型,并生成最终响应。在LoCoMo长文本问答基准测试中,我们结合GPT-4o和本地Llama-3.2-1B模型的混合框架,比仅使用GPT-4o高出7.1个百分点。这朝着任务在不受信任的强大语言模型和较弱的本地语言模型之间分解和拆分、同时保留用户隐私的系统迈出了第一步。
论文及项目相关链接
PDF 29 pages
Summary
大型语言模型(LLM)作为个人代理,访问用户敏感数据,如日历、电子邮件和医疗记录。用户面临两难选择:将私有记录发送到强大的但不受信任的语言模型提供商,或将记录保存在本地模型中。本文介绍了一种苏格拉底式思维链技术,该技术将用户查询发送到强大的不受信任的语言模型生成思维链提示和详细的子查询,然后利用同态加密向量数据库在用户的私有数据中执行加密语义搜索。最后,将思维链提示和已解密的记录输入本地语言模型生成最终响应。在LoCoMo长文本问答基准测试中,结合了GPT-4o和本地Llama-3.2-1B模型的混合框架比仅使用GPT-4o高出7.1个百分点。这证明了在不牺牲用户隐私的情况下,任务可以在强大的不受信任的语言模型和较弱的地方模型之间分解和划分。
Key Takeaways
- 大型语言模型作为个人代理广泛应用于敏感数据访问,涉及数据隐私与安全的权衡问题。
- 苏格拉底式思维链技术结合了强大的语言模型和本地信任模型的优势,为这个问题提供了一个解决方案。
- 思维链技术先通过生成通用非私人查询访问强大语言模型,然后生成详细的子查询以执行加密语义搜索用户的私有数据。
- 苏格拉底式思维链技术在LoCoMo长文本问答基准测试中表现出更高的性能。
- 混合框架结合了远程和本地模型的优势,提高了性能并保护了用户隐私。
- 该技术为未来的任务分解和模型分工提供了范例,在不牺牲用户隐私的前提下实现了高效的计算资源利用。
点此查看论文截图

BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning
Authors:Ha-Thanh Nguyen, Chaoran Liu, Qianying Liu, Hideyuki Tachibana, Su Myat Noe, Yusuke Miyao, Koichi Takeda, Sadao Kurohashi
We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.
我们推出了BIS Reasoning 1.0,这是首个大规模日语数据集,专门设计用于评估大型语言模型(LLM)中的信念不一致推理。与之前的NeuBAROCO和JFLD数据集不同,它们侧重于一般或信念一致的推理,BIS Reasoning 1.0引入了逻辑上有效但信念不一致的推理题,以揭示在基于人类语料库训练的LLM中存在的推理偏见。我们对最先进的模型进行了基准测试,包括GPT模型、Claude模型和领先的日本LLM,结果显示性能存在很大差异,GPT-4o的准确率达到了79.54%。我们的分析确定了当前LLM在处理逻辑上有效但信念相冲突输入时的关键弱点。这些发现对于在诸如法律、医疗保健和科学文献等高风险领域部署LLM具有重要的影响意义,在这些领域中,真相必须超越直觉信念,以确保完整性和安全性。
论文及项目相关链接
PDF This version includes typo corrections, added logit lens analysis for open models, and an updated author list
Summary
BIS Reasoning 1.0的推出,标志着首个专门用于评估大型语言模型中信念不一致推理的大型日语数据集诞生。该数据集引入逻辑上合理但信念不一致的推理问题,以揭示训练于人类语料库的大型语言模型中的推理偏见。最新模型的表现参差不齐,GPT-4o准确率最高,达到79.54%。分析指出,当前大型语言模型在处理逻辑上合理但信念冲突的输入时存在重大弱点,这在法律、医疗和科学文献等高风险领域的部署中具有重要含义。
Key Takeaways
- BIS Reasoning 1.0是首个专门评估大型语言模型中信念不一致推理的日语数据集。
- 数据集引入了逻辑上合理但信念不一致的推理问题。
- 现有大型语言模型在处理此类问题时存在推理偏见。
- GPT-4o在BIS Reasoning 1.0数据集上的准确率为79.54%。
- 大型语言模型在处理逻辑与信念冲突时存在弱点。
- 这一发现对于在高风险领域部署大型语言模型具有重要意义,如法律、医疗和科学文献领域。
点此查看论文截图





AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
Authors:Kai Li, Can Shen, Yile Liu, Jirui Han, Kelong Zheng, Xuechao Zou, Zhe Wang, Xingjian Du, Shun Zhang, Hanjun Luo, Yingbin Jin, Xinxin Xing, Ziyang Ma, Yue Liu, Xiaojun Jia, Yifan Zhang, Junfeng Fang, Kun Wang, Yibo Yan, Haoyang Li, Yiming Li, Xiaobin Zhuang, Yang Liu, Haibo Hu, Zhizheng Wu, Xiaolin Hu, Eng-Siong Chng, XiaoFeng Wang, Wenyuan Xu, Wei Dong, Xinfeng Li
The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.
音频大语言模型(ALLM)的快速进步和广泛应用要求我们对其可靠性有严格的理解。然而,关于评估这些模型的研究,特别是针对音频模态特有风险的研究,仍然在很大程度上未被探索。现有的评估框架主要关注文本模态或只解决有限的安全维度,未能充分考虑音频模态的固有特性和应用场景。我们引入AudioTrust——首个专门针对音频大语言模型的多元化可靠性评估框架和基准测试。AudioTrust促进了在六个关键维度上的评估:公平性、幻觉、安全性、隐私、稳健性和认证。为了全面评估这些维度,AudioTrust围绕18个独特的实验设置构建。其核心是精心构建的数据集,包含超过4420个音频/文本样本,这些样本来自现实世界场景(例如日常对话、紧急呼叫、语音助手交互),专门用于探索音频大语言模型的多元化可靠性。为了进行评估,基准测试精心设计了9个音频特定评估指标,我们采用大规模自动化管道对模型输出进行客观和可量化的评分。实验结果揭示了当前先进开源和专有音频大语言模型在面对各种高风险音频场景时的可靠性边界和局限性,为未来音频模型的安全可信部署提供了宝贵见解。我们的平台和基准测试可在https://github.com/JusperLee/AudioTrust获得。
论文及项目相关链接
PDF Technical Report
Summary
本文介绍了一个针对音频大语言模型(ALLM)的多维度信任评估框架——AudioTrust。该框架旨在全面评估音频模型的信任度,涵盖了公平性、幻象度、安全性、隐私性、稳健性和认证等六个关键维度。通过精心设计的实验和包含真实场景数据的评估指标,AudioTrust能够客观量化评估不同风险场景中模型的表现,并揭示当前主流模型的信任度边界与局限性。旨在为音频模型的可靠部署提供有价值的参考。
Key Takeaways
- AudioTrust是首个针对音频大语言模型(ALLM)的多维度信任评估框架。
- 该框架涵盖了公平性、幻象度、安全性、隐私性、稳健性和认证等六个关键评估维度。
- AudioTrust使用了18个不同的实验设置来全面评估这些维度。
- 框架使用了精心构建的数据集,包含来自真实场景(如日常对话、紧急呼叫、语音助手交互)的4420个音频/文本样本。
- 评估中采用了9个针对音频的特定评估指标,并使用了大规模自动化管道进行模型输出的客观和可伸缩评分。
- 实验结果揭示了当前主流音频语言模型在面对高风险场景时的信任度边界和局限性。
点此查看论文截图






Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning
Authors:Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Visual-language Chain-of-Thought (CoT) data resources are relatively scarce compared to text-only counterparts, limiting the improvement of reasoning capabilities in Vision Language Models (VLMs). However, high-quality vision-language reasoning data is expensive and labor-intensive to annotate. To address this issue, we leverage a promising resource: game code, which naturally contains logical structures and state transition processes. Therefore, we propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis. Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution. Using the Code2Logic approach, we developed the GameQA dataset to train and evaluate VLMs. GameQA is cost-effective and scalable, offers controllable difficulty gradation and is diverse with 30 games and 158 tasks. Surprisingly, despite training solely on game data, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code, dataset and models are available at https://github.com/tongjingqi/Code2Logic.
视觉语言思维链(CoT)数据资源相较于仅有文本的数据资源而言较为稀缺,这限制了视觉语言模型(VLM)推理能力的提升。然而,高质量视觉语言推理数据的标注成本高且劳动密集。为解决此问题,我们利用了一个很有前景的资源:游戏代码,其天然包含逻辑结构和状态转换过程。因此,我们提出了Code2Logic,这是一种新型的游戏代码驱动的多模态推理数据合成方法。我们的方法利用大型语言模型(LLM)来适应游戏代码,通过代码执行自动获取推理过程和结果。使用Code2Logic方法,我们开发了GameQA数据集来训练和评估VLM。GameQA成本低且可扩展,提供可控的难度梯度,并且具有30个游戏和158个任务的多样性。令人惊讶的是,尽管仅在游戏数据上进行训练,VLM表现出了跨域的泛化能力,特别是Qwen2.5-VL-7B在7个不同的视觉语言基准测试中性能提升了2.33%。我们的代码、数据集和模型可在https://github.com/tongjingqi/Code2Logic找到。
论文及项目相关链接
PDF 63 pages, 23 figures, submitted to NeurIPS 2025
Summary
本文介绍了视觉语言链思维(CoT)数据资源相较于文本只有资源较为匮乏,制约了视觉语言模型(VLMs)推理能力的提升。为应对这一问题,研究团队提出了Code2Logic——一种利用游戏代码驱动的多模态推理数据合成新方法。该方法通过大型语言模型(LLMs)适应游戏代码,实现通过代码执行自动获取推理过程和结果。基于此方法,团队开发了GameQA数据集用于训练和评估VLMs。GameQA具有成本效益高、可规模化、难度可控以及任务多样性等特点。令人惊讶的是,即使在仅使用游戏数据进行训练的情况下,VLMs仍展现出跨域泛化能力,其中Qwen2.5-VL-7B在7种不同的视觉语言基准测试中的性能提升了2.33%。
Key Takeaways
- 视觉语言链思维(CoT)数据资源的稀缺性制约了视觉语言模型(VLMs)的推理能力提升。
- Code2Logic方法利用游戏代码中的逻辑结构和状态转换过程,为多模态推理数据合成提供新思路。
- 利用大型语言模型(LLMs)适应游戏代码,可自动获取推理过程和结果。
- GameQA数据集具有成本效益高、可规模化、难度可控以及任务多样性等特点。
- GameQA数据集用于训练和评估VLMs。
- 仅使用游戏数据进行训练的VLMs展现出跨域泛化能力。
点此查看论文截图






CSC-SQL: Corrective Self-Consistency in Text-to-SQL via Reinforcement Learning
Authors:Lei Sheng, Shuai-Shuai Xu
Large language models (LLMs) have demonstrated strong capabilities in translating natural language questions about relational databases into SQL queries. In particular, test-time scaling techniques such as Self-Consistency and Self-Correction can enhance SQL generation accuracy by increasing computational effort during inference. However, these methods have notable limitations: Self-Consistency may select suboptimal outputs despite majority votes, while Self-Correction typically addresses only syntactic errors. To leverage the strengths of both approaches, we propose CSC-SQL, a novel method that integrates Self-Consistency and Self-Correction. CSC-SQL selects the two most frequently occurring outputs from parallel sampling and feeds them into a merge revision model for correction. Additionally, we employ the Group Relative Policy Optimization (GRPO) algorithm to fine-tune both the SQL generation and revision models via reinforcement learning, significantly enhancing output quality. Experimental results confirm the effectiveness and generalizability of CSC-SQL. On the BIRD private test set, our 7B model achieves 71.72% execution accuracy, while the 32B model achieves 73.67%. The code has been open sourced at https://github.com/CycloneBoy/csc_sql.
大型语言模型(LLM)在将关于关系数据库的自然语言问题翻译成SQL查询方面表现出了强大的能力。特别是,测试时的缩放技术,如自我一致性(Self-Consistency)和自我修正(Self-Correction),可以通过增加推理过程中的计算工作量来提高SQL生成准确性。然而,这些方法有明显的局限性:自我一致性可能会在多数投票的情况下选择出次优输出,而自我修正通常只针对语法错误。为了充分利用这两种方法的优点,我们提出了CSC-SQL,一种将自我一致性和自我修正相结合的新方法。CSC-SQL选择并行采样中出现频率最高的两个输出,并将其输入到一个修正合并模型中进行修正。此外,我们采用集团相对策略优化(GRPO)算法,通过强化学习对SQL生成和修订模型进行微调,显著提高了输出质量。实验结果证实了CSC-SQL的有效性和通用性。在BIRD私有测试集上,我们的70亿模型达到了71.72%的执行准确率,而320亿模型达到了73.67%。代码已开源在https://github.com/CycloneBoy/csc_sql。
论文及项目相关链接
PDF 25 pages, 5 figures
Summary
大型语言模型在将自然语言问题翻译成SQL查询方面表现出强大的能力。测试时扩展技术,如自我一致性及自我校正,可通过增加推理过程中的计算工作量来提高SQL生成准确性。然而,这些方法有局限性:自我一致性可能会选择次优输出,而自我校正主要解决的是语法错误。为结合这两种方法的优点,我们提出了CSC-SQL新方法,它结合了自我一致性与自我校正。CSC-SQL从并行采样中选择出现最频繁的两个输出,并将其输入修正合并模型。此外,我们采用集团相对政策优化算法对SQL生成和修订模型进行微调,通过强化学习显著提高输出质量。实验结果证实了CSC-SQL的有效性和通用性。
Key Takeaways
- 大型语言模型在将自然语言转化为SQL查询方面表现出强大的能力。
- 测试时扩展技术如自我一致性及自我校正能提升SQL生成的准确性。
- 自我一致性和自我校正方法存在局限性,分别可能选择次优输出和主要解决语法错误。
- CSC-SQL结合了自我一致性和自我校正,通过选择最频繁的输出并进行修正来提高性能。
- 采用集团相对政策优化算法对SQL生成和修订模型进行微调,显著提高输出质量。
- CSC-SQL在BIRD私人测试集上取得了71.72%和73.67%的执行准确率。
点此查看论文截图




Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought
Authors:Hanlin Zhu, Shibo Hao, Zhiting Hu, Jiantao Jiao, Stuart Russell, Yuandong Tian
Large Language Models (LLMs) have demonstrated remarkable performance in many applications, including challenging reasoning problems via chain-of-thoughts (CoTs) techniques that generate ``thinking tokens’’ before answering the questions. While existing theoretical works demonstrate that CoTs with discrete tokens boost the capability of LLMs, recent work on continuous CoTs lacks a theoretical understanding of why it outperforms discrete counterparts in various reasoning tasks such as directed graph reachability, a fundamental graph reasoning problem that includes many practical domain applications as special cases. In this paper, we prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem, where $D$ is the diameter of the graph, while the best known result of constant-depth transformers with discrete CoTs requires $O(n^2)$ decoding steps where $n$ is the number of vertices ($D<n$). In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously (i.e., parallel breadth-first search (BFS)), while discrete CoTs must choose a single path sampled from the superposition state, which leads to sequential search that requires many more steps and may be trapped into local solutions. We also performed extensive experiments to verify that our theoretical construction aligns well with the empirical solution obtained via training dynamics. Notably, encoding of multiple search frontiers as a superposition state automatically emerges in training continuous CoTs, without explicit supervision to guide the model to explore multiple paths simultaneously.
大型语言模型(LLM)在许多应用中表现出了出色的性能,包括通过思维链(CoTs)技术解决具有挑战性的推理问题,在回答问题之前会生成“思维标记”。虽然现有的理论工作表明,使用离散标记的CoTs可以增强LLM的功能,但关于连续CoTs的最新工作缺乏对其在诸如定向图可达性之类的各种推理任务中表现优于离散对应物的理论理解。定向图可达性是一个基本的图形推理问题,包括许多实际应用领域的特殊情况。在本文中,我们证明了一个具有连续CoTs的两层transformer可以解决定向图可达性问题,其中D是图的直径,而具有离散CoTs的恒定深度变压器已知的最佳结果需要O(n^2)解码步骤,其中n是顶点数(D<n)。在我们的构建中,每个连续的思维向量都是一个叠加态,可以同时编码多个搜索前沿(即并行广度优先搜索(BFS)),而离散CoTs必须从叠加态中选择一条路径,这导致了顺序搜索需要更多的步骤,并且可能陷入局部解决方案。我们还进行了大量实验,验证我们的理论构建与通过训练动态获得的经验解决方案吻合。值得注意的是,在训练连续CoTs过程中,将多个搜索前沿编码为叠加态会自动出现,而无需明确的监督来指导模型同时探索多条路径。
论文及项目相关链接
PDF 26 pages, 7 figures
Summary
大型语言模型(LLMs)通过思维链(CoTs)技术展现出卓越的性能,尤其在解决推理问题方面。离散思维链已有理论研究基础,而连续思维链则缺乏对其理论优势的理解。本研究证明了连续思维链在两层的transformer结构中的优越性,能更高效地解决定向图可达性问题。本研究还发现连续思维向量是叠加状态,能同时编码多个搜索边界,而离散思维链则需从叠加状态中选取单一路径,导致搜索步骤增多且可能陷入局部解。实验证明,该理论构造与通过训练动态获得的实证解决方案高度一致。
Key Takeaways
- 大型语言模型(LLMs)通过思维链(CoTs)技术解决推理问题表现出卓越性能。
- 离散思维链已有理论研究基础,但连续思维链的理论优势尚不清楚。
- 两层transformer结构中的连续思维链能更高效地解决定向图可达性问题。
- 连续思维向量是叠加状态,能同时编码多个搜索边界。
- 离散思维链需要从叠加状态中选取单一路径,导致搜索步骤增多且可能陷入局部解。
- 实验证明,理论构造与通过训练动态获得的实证解决方案一致。
点此查看论文截图



