发布日期: 2025-09-30

更新日期: 2025-11-27

文章字数: 21.2k

阅读时长: 86 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-30 更新

WebGen-Agent: Enhancing Interactive Website Generation with Multi-Level Feedback and Step-Level Reinforcement Learning

Authors:Zimu Lu, Houxing Ren, Yunqiao Yang, Ke Wang, Zhuofan Zong, Junting Pan, Mingjie Zhan, Hongsheng Li

Agent systems powered by large language models (LLMs) have demonstrated impressive performance on repository-level code-generation tasks. However, for tasks such as website codebase generation, which depend heavily on visual effects and user-interaction feedback, current code agents rely only on simple code execution for feedback and verification. This approach fails to capture the actual quality of the generated code. In this paper, we propose WebGen-Agent, a novel website-generation agent that leverages comprehensive and multi-level visual feedback to iteratively generate and refine the website codebase. Detailed and expressive text descriptions and suggestions regarding the screenshots and GUI-agent testing of the websites are generated by a visual language model (VLM), together with scores that quantify their quality. The screenshot and GUI-agent scores are further integrated with a backtracking and select-best mechanism, enhancing the performance of the agent. Utilizing the accurate visual scores inherent in the WebGen-Agent workflow, we further introduce \textit{Step-GRPO with Screenshot and GUI-agent Feedback} to improve the ability of LLMs to act as the reasoning engine of WebGen-Agent. By using the screenshot and GUI-agent scores at each step as the reward in Step-GRPO, we provide a dense and reliable process supervision signal, which effectively improves the model’s website-generation ability. On the WebGen-Bench dataset, WebGen-Agent increases the accuracy of Claude-3.5-Sonnet from 26.4% to 51.9% and its appearance score from 3.0 to 3.9, outperforming the previous state-of-the-art agent system. Additionally, our Step-GRPO training approach increases the accuracy of Qwen2.5-Coder-7B-Instruct from 38.9% to 45.4% and raises the appearance score from 3.4 to 3.7.

由大型语言模型（LLM）驱动的智能系统已在仓库级别的代码生成任务中展现出令人印象深刻的性能。然而，对于高度依赖视觉效果和用户交互反馈的网站代码生成任务，当前代码智能体仅依赖于简单的代码执行进行反馈和验证。这种方法无法捕捉生成代码的实际质量。在本文中，我们提出了WebGen-Agent，这是一种新型网站生成智能体，它利用全面、多层次的视觉反馈来迭代生成和细化网站代码库。通过视觉语言模型（VLM）生成关于网站截图和GUI智能体测试的详细而富有表现力的文本描述和建议，以及量化其质量的分数。截图和GUI智能体的分数进一步与回溯和选择最佳机制相结合，增强了智能体的性能。利用WebGen-Agent工作流程中固有的准确视觉分数，我们还引入了带有截图和GUI智能体反馈的Step-GRPO，以提高LLM作为WebGen-Agent推理引擎的能力。通过将每一步的截图和GUI智能体分数作为Step-GRPO中的奖励，我们提供了密集可靠的流程监督信号，有效提高模型的网站生成能力。在WebGen-Bench数据集上，WebGen-Agent将Claude-3.5-Sonnet的准确性从26.4%提高到51.9%，外观得分从3.0提高到3.9，超越了之前的先进智能体系。此外，我们的Step-GRPO训练方法将Qwen2.5-Coder-7B-Instruct的准确性从38.9%提高到45.4%，外观得分从3.4提高到3.7。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的代理系统在仓库级别的代码生成任务上表现出强大的性能。然而，对于依赖视觉效果和用户交互反馈的网站代码生成任务，当前代码代理仅依赖于简单的代码执行进行反馈和验证，无法捕捉生成代码的实际质量。为此，本文提出了WebGen-Agent，一个利用全面、多层次的视觉反馈来迭代生成和细化网站代码库的新型网站生成代理。WebGen-Agent通过视觉语言模型（VLM）生成关于网站截图和GUI代理测试的详细文本描述和建议，同时量化其质量。通过集成截图和GUI代理分数以及回溯和选择最佳机制，增强了代理的性能。此外，还介绍了利用WebGen-Agent工作流程中的精确视觉分数改进的Step-GRPO方法，通过每一步的截图和GUI代理分数作为奖励，为Step-GRPO提供了密集可靠的过程监督信号，有效提高模型在网站生成方面的能力。在WebGen-Bench数据集上，WebGen-Agent提高了Claude-3.5-Sonnet的准确性和外观评分，超越了现有最先进的代理系统。同时，我们的Step-GRPO训练方法也提高了Qwen2.5-Coder-7B-Instruct的准确性和外观评分。

Key Takeaways

代理系统在仓库级别的代码生成任务上表现优异，但在依赖视觉效果和用户交互反馈的网站代码生成任务上存在局限性。
WebGen-Agent利用全面、多层次的视觉反馈迭代生成和细化网站代码库，包括通过视觉语言模型生成的详细文本描述和建议。
WebGen-Agent集成了截图和GUI代理分数，增强了性能，并引入了回溯和选择最佳机制。
Step-GRPO方法利用WebGen-Agent工作流程中的精确视觉分数，通过每一步的截图和GUI代理分数作为奖励，提供密集可靠的过程监督信号。
WebGen-Agent在WebGen-Bench数据集上显著提高了代理系统的准确性，特别是针对Claude-3.5-Sonnet的性能提升显著。
Step-GRPO训练方法也提高了模型的准确性和外观评分。

Cool Papers

点此查看论文截图

VLA-Reasoner: Empowering Vision-Language-Action Models with Reasoning via Online Monte Carlo Tree Search

Authors:Wenkai Guo, Guanxing Lu, Haoyuan Deng, Zhenyu Wu, Yansong Tang, Ziwei Wang

Vision-Language-Action models (VLAs) achieve strong performance in general robotic manipulation tasks by scaling imitation learning. However, existing VLAs are limited to predicting short-sighted next-action, which struggle with long-horizon trajectory tasks due to incremental deviations. To address this problem, we propose a plug-in framework named VLA-Reasoner that effectively empowers off-the-shelf VLAs with the capability of foreseeing future states via test-time scaling. Specifically, VLA-Reasoner samples and rolls out possible action trajectories where involved actions are rationales to generate future states via a world model, which enables VLA-Reasoner to foresee and reason potential outcomes and search for the optimal actions. We further leverage Monte Carlo Tree Search (MCTS) to improve search efficiency in large action spaces, where stepwise VLA predictions seed the root. Meanwhile, we introduce a confidence sampling mechanism based on Kernel Density Estimation (KDE), to enable efficient exploration in MCTS without redundant VLA queries. We evaluate intermediate states in MCTS via an offline reward shaping strategy, to score predicted futures and correct deviations with long-term feedback. We conducted extensive experiments in both simulators and the real world, demonstrating that our proposed VLA-Reasoner achieves significant improvements over the state-of-the-art VLAs. Our method highlights a potential pathway toward scalable test-time computation of robotic manipulation.

视觉-语言-动作模型（VLAs）通过扩大模仿学习，在一般的机器人操作任务中取得了强大的性能。然而，现有的VLAs仅限于预测短视的下一个动作，对于长周期轨迹任务则由于增量偏差而面临困难。为了解决这个问题，我们提出了一种名为VLA-Reasoner的插件框架，该框架可以有效地赋予现成的VLAs通过测试时缩放预测未来状态的能力。具体来说，VLA-Reasoner会抽样并展开可能的行动轨迹，其中的行动是作为理由来生成未来状态的世界模型，这使得VLA-Reasoner能够预测并推理潜在的结果并寻找最佳行动。我们进一步利用蒙特卡洛树搜索（MCTS）来提高在大动作空间中的搜索效率，其中逐步的VLA预测作为根种子。同时，我们引入了一种基于核密度估计（KDE）的信心抽样机制，能够在MCTS中进行有效的探索，无需冗余的VLA查询。我们通过离线奖励塑形策略来评估MCTS中的中间状态，为预测的未来打分并纠正长期反馈中的偏差。我们在模拟器和真实世界中都进行了大量实验，证明了我们提出的VLA-Reasoner相较于最先进的VLAs取得了显著的改进。我们的方法揭示了实现可扩展的测试时间计算的机器人操作的潜在途径。

论文及项目相关链接

PDF 9 pages

Summary
在视觉-语言-动作模型（VLAs）中，通过扩展模仿学习，可以在一般的机器人操作任务中取得强大的性能。但现有的VLAs仅局限于预测短视的下一个动作，难以应对长视规划轨迹的任务。为此，我们提出了一个名为VLA-Reasoner的插件框架，它通过测试时扩展的能力赋予现成的VLAs预见未来状态的能力。该框架通过世界模型生成未来状态，涉及的动作作为解释进行采样和展开可能的行动轨迹。此外，我们利用蒙特卡洛树搜索（MCTS）提高在大动作空间中的搜索效率，以逐步的VLA预测作为根节点。同时，我们引入基于核密度估计（KDE）的信心采样机制，在MCTS中无需冗余的VLA查询即可实现高效探索。通过离线奖励塑形策略评估MCTS的中间状态，以评分预测的未来并纠正长期反馈中的偏差。在模拟器和真实世界的广泛实验表明，我们提出的VLA-Reasoner相较于最先进的VLAs取得了显著改进。我们的方法为机器人操作的测试时计算提供了一个潜在的可行路径。

Key Takeaways

VLA模型在机器人操作任务中表现出强大的性能，但仅限于短视预测。
VLA-Reasoner插件框架赋能VLAs以预见未来状态的能力，通过生成未来状态并展开可能的行动轨迹来解决短视问题。
VLA-Reasoner利用蒙特卡洛树搜索（MCTS）提高在大动作空间中的搜索效率。
基于核密度估计（KDE）的信心采样机制提高了探索效率，无需冗余的VLA查询。
通过离线奖励塑形策略评估中间状态，实现对预测未来的评分和对长期反馈中的偏差的纠正。
在模拟器和真实世界的实验中，VLA-Reasoner相较于最先进的VLAs取得了显著改进。

Cool Papers

点此查看论文截图

Variational Reasoning for Language Models

Authors:Xiangxin Zhou, Zichen Liu, Haonan Wang, Chao Du, Min Lin, Chongxuan Li, Liang Wang, Tianyu Pang

We introduce a variational reasoning framework for language models that treats thinking traces as latent variables and optimizes them through variational inference. Starting from the evidence lower bound (ELBO), we extend it to a multi-trace objective for tighter bounds and propose a forward-KL formulation that stabilizes the training of the variational posterior. We further show that rejection sampling finetuning and binary-reward RL, including GRPO, can be interpreted as local forward-KL objectives, where an implicit weighting by model accuracy naturally arises from the derivation and reveals a previously unnoticed bias toward easier questions. We empirically validate our method on the Qwen 2.5 and Qwen 3 model families across a wide range of reasoning tasks. Overall, our work provides a principled probabilistic perspective that unifies variational inference with RL-style methods and yields stable objectives for improving the reasoning ability of language models. Our code is available at https://github.com/sail-sg/variational-reasoning.

我们为语言模型引入了一种变分推理框架，该框架将思考痕迹视为潜在变量，并通过变分推理对其进行优化。我们从证据下界（ELBO）出发，将其扩展为多轨迹目标以获得更紧密界限，并提出了一种正向KL公式，该公式可稳定变分后验的训练。我们进一步表明，拒绝采样微调和二值奖励强化学习，包括GRPO，可以解释为局部正向KL目标，其中模型准确性的隐式加权自然产生于推导中，并揭示了之前未注意到的对简单问题的偏向。我们在广泛的推理任务上对Qwen 2.5和Qwen 3模型家族上进行了实证验证。总的来说，我们的工作提供了一个有原则的概率视角，将变分推理与RL风格的方法统一起来，并为提高语言模型的推理能力提供了稳定的目标。我们的代码位于https://github.com/sail-sg/variational-reasoning。

论文及项目相关链接

PDF

Summary
文本介绍了一种用于语言模型的变分推理框架，该框架将思考痕迹视为潜在变量，并通过变分推理进行优化。从证据下界（ELBO）出发，扩展到多痕迹目标以获取更紧密界限，并提出前向KL公式以稳定变分后验的训练。文本还展示了拒绝采样微调、二元奖励强化学习等方法可以解释为局部前向KL目标，其中模型精度隐式加权自然产生于推导中，揭示了对更简单问题的偏向。文本对Qwen 2.5和Qwen 3模型家族进行了实证验证，涉及广泛的推理任务。总体而言，这项工作提供了一个有原则的概率视角，将变分推理与强化学习风格的方法统一起来，并为提高语言模型的推理能力提供了稳定的目标。

Key Takeaways

引入了一种变分推理框架，将思考痕迹视为潜在变量进行优化。
通过扩展证据下界（ELBO）到多痕迹目标，获得更紧密界限。
提出前向KL公式，稳定变分后验的训练。
拒绝采样微调、二元奖励强化学习等方法可解释为局部前向KL目标。
模型精度隐式加权自然产生于推导中，揭示了对更简单问题的偏向。
在Qwen 2.5和Qwen 3模型家族上进行了实证验证，涉及广泛的推理任务。

Cool Papers

点此查看论文截图

UML-CoT: Structured Reasoning and Planning with Unified Modeling Language for Robotic Room Cleaning

Authors:Hongyu Chen, Guangrun Wang

Chain-of-Thought (CoT) prompting improves reasoning in large language models (LLMs), but its reliance on unstructured text limits interpretability and executability in embodied tasks. Prior work has explored structured CoTs using scene or logic graphs, yet these remain fundamentally limited: they model only low-order relations, lack constructs like inheritance or behavioral abstraction, and provide no standardized semantics for sequential or conditional planning. We propose UML-CoT, a structured reasoning and planning framework that leverages Unified Modeling Language (UML) to generate symbolic CoTs and executable action plans. UML class diagrams capture compositional object semantics, while activity diagrams model procedural control flow. Our three-stage training pipeline combines supervised fine-tuning with Group Relative Policy Optimization (GRPO), including reward learning from answer-only data. We evaluate UML-CoT on MRoom-30k, a new benchmark of cluttered room-cleaning scenarios. UML-CoT outperforms unstructured CoTs in interpretability, planning coherence, and execution success, highlighting UML as a more expressive and actionable structured reasoning formalism.

链式思维（CoT）提示可以改善大型语言模型（LLM）的推理能力，但其对无结构文本的依赖限制了其在实体任务中的可解释性和可执行性。早期的工作已经探索了使用场景或逻辑图的结构化CoT，但这些方法仍然存在根本性的局限：它们只建模低阶关系，缺乏继承或行为抽象等构建，并且不提供用于顺序或条件规划的标准化语义。我们提出了UML-CoT，这是一个结构化推理和规划框架，它利用统一建模语言（UML）生成符号化的CoT和可执行的行动计划。UML类图捕获组合对象语义，而活动图对程序控制流进行建模。我们的三阶段训练管道结合了监督微调与群组相对策略优化（GRPO），包括仅从答案数据中学习奖励。我们在MRoom-30k新基准测试上评估了UML-CoT，该测试模拟了杂乱房间清洁场景。UML-CoT在无结构化CoT的可解释性、规划连贯性和执行成功方面表现出优势，突显了UML作为一种更具表现力和可操作性的结构化推理形式系统。

论文及项目相关链接

PDF

Summary：利用统一建模语言（UML）建立结构化推理和规划框架UML-CoT，通过生成符号化链式思维（CoT）和可执行行动计划，改进大型语言模型（LLM）的推理能力。该框架通过UML类图捕捉组合对象语义，通过活动图模拟过程控制流。采用包括奖励学习在内的三阶段训练管道，从仅答案数据中优化相对策略。在全新的房间清洁场景MRoom-30k测试中，UML-CoT在可解释性、规划连贯性和执行成功率方面优于非结构化CoT，凸显了UML作为更具表现力和可操作性的结构化推理形式的重要性。

Key Takeaways：

UML-CoT结合了统一建模语言（UML）和链式思维（CoT），旨在改进大型语言模型（LLM）的推理能力。
通过UML类图和活动图，UML-CoT能够捕捉组合对象语义并模拟过程控制流。
该框架提出了一个三阶段训练管道，包括监督微调以及与Group Relative Policy Optimization (GRPO)结合的奖励学习。
在新的房间清洁场景MRoom-30k测试中，UML-CoT表现出更高的可解释性、规划连贯性和执行成功率。
与传统的非结构化CoT相比，UML-CoT提供了更标准化和可执行的语义理解和行动计划。
UML作为结构化推理形式，具有更高的表达力和可操作性。

Cool Papers

点此查看论文截图

Dynamic Experts Search: Enhancing Reasoning in Mixture-of-Experts LLMs at Test Time

Authors:Yixuan Han, Fan Ma, Ruijie Quan, Yi Yang

Test-Time Scaling (TTS) enhances the reasoning ability of large language models (LLMs) by allocating additional computation during inference. However, existing approaches primarily rely on output-level sampling while overlooking the role of model architecture. In mainstream Mixture-of-Experts (MoE) LLMs, we observe that varying the number of activated experts yields complementary solution sets with stable accuracy, revealing a new and underexplored source of diversity. Motivated by this observation, we propose Dynamic Experts Search (DES), a TTS strategy that elevates expert activation into a controllable dimension of the search space. DES integrates two key components: (1) Dynamic MoE, which enables direct control of expert counts during inference to generate diverse reasoning trajectories without additional cost; and (2) Expert Configuration Inheritance, which preserves consistent expert counts within a reasoning path while varying them across runs, thereby balancing stability and diversity throughout the search. Extensive experiments across MoE architectures, verifiers and reasoning benchmarks (i.e., math, code and knowledge) demonstrate that DES reliably outperforms TTS baselines, enhancing accuracy and stability without additional cost. These results highlight DES as a practical and scalable form of architecture-aware TTS, illustrating how structural flexibility in modern LLMs can advance reasoning.

测试时缩放（TTS）通过推理过程中的额外计算，增强了大型语言模型（LLM）的推理能力。然而，现有的方法主要依赖于输出级别的采样，而忽视了模型架构的作用。在主流混合专家（MoE）LLM中，我们观察到改变激活的专家数量会产生具有稳定准确率的互补解决方案集，揭示了一种新的且尚未被充分探索的多样性来源。受这一观察结果的启发，我们提出了动态专家搜索（DES），这是一种将专家激活提升为搜索空间的可控维度的TTS策略。DES集成了两个关键组件：（1）动态MoE，它能够在推理过程中直接控制专家数量，生成多样化的推理轨迹，无需额外成本；（2）专家配置继承，它能够在推理路径中保持一致的专家数量，同时在各次运行中变化专家数量，从而在搜索过程中平衡稳定性和多样性。在MoE架构、验证器和推理基准测试（例如数学、代码和知识）上的大量实验表明，DES可靠地优于TTS基线，提高了准确性和稳定性，且无需额外成本。这些结果突出了DES作为一种实用且可扩展的架构感知TTS形式，说明了现代LLM中的结构灵活性如何推动推理的发展。

论文及项目相关链接

PDF

Summary：

测试时缩放（TTS）通过推理时分配额外的计算资源来提高大型语言模型（LLM）的推理能力。然而，现有方法主要依赖于输出级别的采样，而忽略了模型架构的作用。本文提出动态专家搜索（DES）策略，这是一种将专家激活提升为搜索空间可控维度的TTS策略。DES包括两个关键组件：动态MoE和专家配置继承，能够在推理过程中直接控制专家数量，生成多样化的推理轨迹，同时保持专家数量的一致性，从而平衡稳定性和多样性。实验表明，DES在不需要额外成本的情况下可靠地优于TTS基线，提高了准确性和稳定性。这些结果突显了DES作为实用且可扩展的架构感知TTS形式的优势，展示了现代LLM中结构灵活性对推理的推动作用。

Key Takeaways：

测试时缩放（TTS）增强了大型语言模型（LLM）的推理能力，通过分配额外的计算资源进行优化。
现有方法主要关注输出级别的采样，而忽视模型架构的影响。
本文引入动态专家搜索（DES）策略，将专家激活作为搜索空间的可控维度。
DES包含动态MoE和专家配置继承两个关键组件。
动态MoE允许在推理过程中直接控制专家数量，生成多样化推理轨迹，无需额外成本。
专家配置继承在推理路径中保持专家数量的一致性，平衡稳定性和多样性。

Cool Papers

点此查看论文截图

StepORLM: A Self-Evolving Framework With Generative Process Supervision For Operations Research Language Models

Authors:Chenyu Zhou, Tianyi Xu, Jianghao Lin, Dongdong Ge

Large Language Models (LLMs) have shown promising capabilities for solving Operations Research (OR) problems. While reinforcement learning serves as a powerful paradigm for LLM training on OR problems, existing works generally face two key limitations. First, outcome reward suffers from the credit assignment problem, where correct final answers can reinforce flawed reasoning. Second, conventional discriminative process supervision is myopic, failing to evaluate the interdependent steps of OR modeling holistically. To this end, we introduce StepORLM, a novel self-evolving framework with generative process supervision. At its core, StepORLM features a co-evolutionary loop where a policy model and a generative process reward model (GenPRM) iteratively improve on each other. This loop is driven by a dual-feedback mechanism: definitive, outcome-based verification from an external solver, and nuanced, holistic process evaluation from the GenPRM. The combined signal is used to align the policy via Weighted Direct Preference Optimization (W-DPO) and simultaneously refine the GenPRM. Our resulting 8B-parameter StepORLM establishes a new state-of-the-art across six benchmarks, significantly outperforming vastly larger generalist models, agentic methods, and specialized baselines. Moreover, the co-evolved GenPRM is able to act as a powerful and universally applicable process verifier, substantially boosting the inference scaling performance of both our own model and other existing LLMs.

大规模语言模型（LLM）在解决运筹学（OR）问题方面显示出巨大的潜力。虽然强化学习是训练LLM解决OR问题的强大范式，但现有研究通常面临两个主要局限。首先，结果奖励存在信用分配问题，正确的最终答案可能会强化错误的推理。其次，传统的判别过程监督是短视的，无法全面评估OR建模的相互依赖步骤。为此，我们引入了StepORLM，这是一种新型的自进化框架，具有生成过程监督功能。StepORLM的核心是一个协同进化循环，其中策略模型和生成过程奖励模型（GenPRM）相互迭代改进。这个循环由一个双重反馈机制驱动：来自外部求解器的基于结果的确定性验证，以及来自GenPRM的细致全面过程评估。结合这两种信号，通过加权直接偏好优化（W-DPO）来调整策略，同时改进GenPRM。我们得到的8B参数StepORLM在六个基准测试上达到了最新水平，显著优于更大的通用模型、主动方法和专用基准测试。此外，协同进化的GenPRM能够充当强大而通用的过程验证器，大大提高我们自己的模型和其他现有LLM的推理扩展性能。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在解决运筹学（OR）问题方面展现出巨大潜力。强化学习作为LLM训练的一种强大范式在OR问题上存在两个主要局限。为解决这些问题，本文提出StepORLM，一种新型自我进化的框架，具有生成过程监督功能。StepORLM核心是一个协同进化循环，策略模型和生成过程奖励模型相互迭代改进。该循环由双重反馈机制驱动：来自外部求解器的确定性、结果为基础的验证，以及来自GenPRM的微妙、整体过程评估。结合信号用于通过加权直接偏好优化（W-DPO）对齐策略，同时改进GenPRM。我们的8B参数StepORLM在六个基准测试上达到最新水平，显著优于更大的通用模型、自主方法和专用基准。此外，协同进化的GenPRM能够充当强大而通用的过程验证器，显著提高了我们模型和其他现有LLM的推理扩展性能。

Key Takeaways

大型语言模型（LLM）在解决运筹学（OR）问题方面具有巨大潜力。
强化学习是训练LLM解决OR问题的一种有效方法，但存在两个主要局限：结果奖励中的信用分配问题和传统判别过程监督的近视性。
StepORLM框架通过引入生成过程监督来解决这些问题，具有自我进化的能力。
StepORLM的核心是一个协同进化循环，包括策略模型和生成过程奖励模型的迭代改进。
双重反馈机制包括外部求解器的结果验证和GenPRM的过程评估。
提出的8B参数StepORLM在多个基准测试上表现出最佳性能，优于大型通用模型、自主方法和专用基准。

Cool Papers

点此查看论文截图

REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model

Authors:Bo Li, Guanzhi Deng, Ronghao Chen, Junrong Yue, Shuo Zhang, Qinghua Zhao, Linqi Song, Lijie Wen

Understanding how Large Language Models (LLMs) perform complex reasoning and their failure mechanisms is a challenge in interpretability research. To provide a measurable geometric analysis perspective, we define the concept of the Reasoning Manifold, a latent low-dimensional geometric structure formed by the internal representations corresponding to all correctly reasoned generations. This structure can be conceptualized as the embodiment of the effective thinking paths that the model has learned to successfully solve a given task. Based on this concept, we build REMA, a framework that explains the origins of failures by quantitatively comparing the spatial relationships of internal model representations corresponding to both erroneous and correct reasoning samples. Specifically, REMA first quantifies the geometric deviation of each erroneous representation by calculating its k-nearest neighbors distance to the approximated manifold formed by correct representations, thereby providing a unified failure signal. It then localizes the divergence points where these deviations first become significant by tracking this deviation metric across the model’s layers and comparing it against a baseline of internal fluctuations from correct representations, thus identifying where the reasoning chain begins to go off-track. Our extensive experiments on diverse language and multimodal models and tasks demonstrate the low-dimensional nature of the reasoning manifold and the high separability between erroneous and correct reasoning representations. The results also validate the effectiveness of the REMA framework in analyzing the origins of reasoning failures. This research connects abstract reasoning failures to measurable geometric deviations in representations, providing new avenues for in-depth understanding and diagnosis of the internal computational processes of black-box models.

理解大型语言模型（LLM）如何进行复杂推理及其失败机制是解释性研究中的一项挑战。为了提供一个可衡量的几何分析视角，我们定义了“推理流形”这一概念，这是一个由与所有正确推理生成相对应的潜在低维几何结构形成的内部表示。这一结构可以概念化为模型学习成功解决给定任务的有效思考路径的体现。基于此概念，我们构建了REMA框架，通过定量比较错误和正确推理样本对应的内部模型表示的空间关系来解释失败的原因。具体来说，REMA首先通过计算每个错误表示与其邻近的正确表示形成的近似流形的k最近邻距离来量化其几何偏差，从而提供一个统一的失败信号。然后，它通过跟踪模型各层中的偏差度量并与来自正确表示的内部波动的基线进行比较，定位这些偏差首次变得显著的发散点，从而确定推理链开始出错的地方。我们在多种语言和跨模态模型和任务上的大量实验证明了推理流形的低维性质以及错误和正确推理表示之间的高可分性。结果也验证了REMA框架在分析推理失败原因方面的有效性。这项研究将抽象的推理失败与可衡量的表示几何偏差联系起来，为深入理解和诊断黑箱模型的内部计算过程提供了新的途径。

论文及项目相关链接

PDF

Summary

本文提出了理解大型语言模型（LLM）进行复杂推理及其失败机制的可衡量几何分析视角。通过引入“推理流形”的概念，即所有正确推理生成对应的潜在低维几何结构，为有效思考路径的体现，进而构建REMA框架。该框架通过定量比较错误和正确推理样本的模型内部表示的空间关系，揭示失败的起源。通过计算每个错误表示与由正确表示形成的近似流形的k近邻距离，REMA提供了一个统一的失败信号。然后，通过跟踪这种偏差度量跨越模型层并与正确表示的基线内部波动进行比较，定位偏差首次变得显著的发散点。在多种语言和跨模态模型及任务上的实验证明了推理流形的低维性质以及错误和正确推理表示之间的高可分离性。结果也验证了REMA框架在分析推理失败原因方面的有效性。该研究将抽象的推理失败与可衡量的几何偏差联系起来，为深入了解和理解黑箱模型的内部计算过程提供了新的途径。

Key Takeaways

引入“推理流形”概念，定义为LLM内部表示形成的低维几何结构，反映模型成功解决给定任务的有效思考路径。
构建REMA框架，通过比较错误和正确推理样本的模型内部表示的空间关系，定量解析失败的起源。
通过计算错误表示与正确表示流形之间的距离，提供统一的失败信号。
定位偏差首次变得显著的发散点，通过跟踪偏差度量在模型各层的变动并与基线进行比较。
实验证明推理流形的低维性质以及错误和正确推理表示之间的高可分离性。
REMA框架在分析推理失败原因方面有效性得到验证。

Cool Papers

点此查看论文截图

Estimating the Empowerment of Language Model Agents

Authors:Jinyeop Song, Jeff Gore, Max Kleiman-Weiner

As language model (LM) agents become more capable and gain broader access to real-world tools, there is a growing need for scalable evaluation frameworks of agentic capability. However, conventional benchmark-centric evaluations are costly to design and require human designers to come up with valid tasks that translate into insights about general model capabilities. In this work, we propose information-theoretic evaluation based on empowerment, the mutual information between an agent’s actions and future states, as an open-ended method for evaluating LM agents. We introduce EELMA (Estimating Empowerment of Language Model Agents), an algorithm for approximating effective empowerment from multi-turn text interactions. We validate EELMA on both language games and scaled-up realistic web-browsing scenarios. We find that empowerment strongly correlates with average task performance, characterize the impact of environmental complexity and agentic factors such as chain-of-thought, model scale, and memory length on estimated empowerment, and that high empowerment states and actions are often pivotal moments for general capabilities. Together, these results demonstrate empowerment as an appealing general-purpose metric for evaluating and monitoring LM agents in complex, open-ended settings.

随着语言模型（LM）代理变得更加强大，并获得了更广泛访问真实世界工具的能力，对代理能力的可评估评价框架的需求不断增长。然而，以基准测试为中心的传统评估在设计上成本高昂，需要人类设计师提出有效的任务，以洞察通用模型的能力。在这项工作中，我们提出基于赋能的信息理论评估，即代理行动与未来状态之间的互信息，作为一种评估语言模型代理的开放方法。我们介绍了EELMA（语言模型代理的赋能估计），这是一种通过多轮文本交互来近似有效赋能的算法。我们在语言游戏和规模更大的真实网页浏览场景中验证了EELMA。我们发现赋能与平均任务性能有很强的相关性，并描述了环境复杂性和代理因素（如思维链、模型规模和记忆长度）对估计赋能的影响，以及高赋能状态和行动通常是通用能力的重要时刻。总的来说，这些结果证明了赋能作为评估和监控复杂开放式设置中语言模型代理的通用指标的吸引力。

论文及项目相关链接

PDF 10 pages, 8 figures. Submitted to ICLR 2026

Summary

本文提出一种基于赋能的信息理论评估方法，用于评估语言模型（LM）代理的能力。文章介绍了EELMA算法，可通过多轮文本交互来估算代理的有效赋能。通过语言游戏和规模化的真实网页浏览场景的验证，发现赋能与平均任务性能密切相关。文章还探讨了环境复杂性和代理因素（如思维链、模型规模和记忆长度）对估算赋能的影响，并指出高赋能状态和行动通常是代理通用能力的重要时刻。

Key Takeaways

提出一种基于赋能的信息理论评估框架，用于评估语言模型代理的能力。
介绍了EELMA算法，通过多轮文本交互估算代理的有效赋能。
验证EELMA在语言游戏和真实网页浏览场景下的有效性。
发现赋能与平均任务性能之间的强相关性。
探讨了环境复杂性和代理因素对估算赋能的影响。
高赋能状态和行动对代理的通用能力至关重要。

Cool Papers

点此查看论文截图

Evaluating the Limits of Large Language Models in Multilingual Legal Reasoning

Authors:Antreas Ioannou, Andreas Shiamishis, Nora Hollenstein, Nezihe Merve Gürel

In an era dominated by Large Language Models (LLMs), understanding their capabilities and limitations, especially in high-stakes fields like law, is crucial. While LLMs such as Meta’s LLaMA, OpenAI’s ChatGPT, Google’s Gemini, DeepSeek, and other emerging models are increasingly integrated into legal workflows, their performance in multilingual, jurisdictionally diverse, and adversarial contexts remains insufficiently explored. This work evaluates LLaMA and Gemini on multilingual legal and non-legal benchmarks, and assesses their adversarial robustness in legal tasks through character and word-level perturbations. We use an LLM-as-a-Judge approach for human-aligned evaluation. We moreover present an open-source, modular evaluation pipeline designed to support multilingual, task-diverse benchmarking of any combination of LLMs and datasets, with a particular focus on legal tasks, including classification, summarization, open questions, and general reasoning. Our findings confirm that legal tasks pose significant challenges for LLMs with accuracies often below 50% on legal reasoning benchmarks such as LEXam, compared to over 70% on general-purpose tasks like XNLI. In addition, while English generally yields more stable results, it does not always lead to higher accuracy. Prompt sensitivity and adversarial vulnerability is also shown to persist across languages. Finally, a correlation is found between the performance of a language and its syntactic similarity to English. We also observe that LLaMA is weaker than Gemini, with the latter showing an average advantage of about 24 percentage points across the same task. Despite improvements in newer LLMs, challenges remain in deploying them reliably for critical, multilingual legal applications.

在这个以大型语言模型（LLM）为主导的时代，了解其能力和局限性，特别是在法律等高风险领域，至关重要。虽然诸如Meta的LLaMA、OpenAI的ChatGPT、Google的Gemini、DeepSeek等新兴的大型语言模型被越来越多地融入法律工作流程中，但它们在多语种、管辖范围广泛以及对立情境下的表现仍然未得到足够的探索。本工作针对LLaMA和Gemini进行了多语言法律和非法律基准的测试评估，并通过字符和单词级别的扰动评估了它们在法律任务中的对抗稳健性。我们采用LLM作为法官的方法来进行符合人类评估。此外，我们推出一个开源的模块化评估管道，旨在支持任何大型语言模型和数据集的多语言、任务多样化的基准测试，特别侧重于法律任务，包括分类、摘要、开放问题和一般推理。我们的研究发现，法律任务对大型语言模型构成了重大挑战，在LEXam等法律推理基准测试上的准确率往往低于50%，而在XNLI等通用任务上的准确率超过70%。此外，虽然英语通常会产生更稳定的结果，但并不总是导致更高的准确性。提示敏感性和对抗性脆弱性也显示会在各种语言中持续存在。最后，我们发现一种语言与英语的句法相似性与其表现之间存在关联。我们还观察到LLaMA比Gemini表现较弱，后者在同一任务上平均优势约为24个百分点。尽管新一代的大型语言模型有所改进，但在关键的多语言法律应用中可靠部署它们仍存在挑战。

论文及项目相关链接

PDF 39 pages, 36 figures. Code and evaluation pipeline available at https://github.com/RobustML-Lab/Legal-Multilingual-Evaluation-of-LLMs

Summary

大型语言模型（LLMs）在法律领域的应用日益广泛，但其在多语言、司法管辖多样以及对抗性环境下的表现尚待充分探索。本研究对LLaMA和Gemini进行了多语言法律和非法律基准测试，并通过字符和单词级别的扰动评估了它们在法律任务中的对抗性稳健性。研究发现，法律任务对LLMs构成重大挑战，准确率常低于50%，且在跨语言环境下存在提示敏感性和对抗性脆弱性问题。

Key Takeaways

LLMs在法律领域的应用愈发重要，但其在法律任务中的表现尚待深入研究。
LLMs在多语言环境下的表现，特别是对抗性情境，仍面临诸多挑战。
LLaMA和Gemini在多元法律基准测试中的表现存在差异，Gemini相对更优越。
法律任务的复杂性使得LLMs的准确率常低于50%，尤其是法律推理任务。
在某些情况下，英语并不总是导致更高的准确率。
LLMs对提示的敏感性和对抗性脆弱性在跨语言环境下持续存在。

Cool Papers

点此查看论文截图

MDAR: A Multi-scene Dynamic Audio Reasoning Benchmark

Authors:Hui Li, Changhao Jiang, Hongyu Wang, Ming Zhang, Jiajun Sun, Zhixiong Yang, Yifei Cao, Shihan Dou, Xiaoran Fan, Baoyu Fan, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

The ability to reason from audio, including speech, paralinguistic cues, environmental sounds, and music, is essential for AI agents to interact effectively in real-world scenarios. Existing benchmarks mainly focus on static or single-scene settings and do not fully capture scenarios where multiple speakers, unfolding events, and heterogeneous audio sources interact. To address these challenges, we introduce MDAR, a benchmark for evaluating models on complex, multi-scene, and dynamically evolving audio reasoning tasks. MDAR comprises 3,000 carefully curated question-answer pairs linked to diverse audio clips, covering five categories of complex reasoning and spanning three question types. We benchmark 26 state-of-the-art audio language models on MDAR and observe that they exhibit limitations in complex reasoning tasks. On single-choice questions, Qwen2.5-Omni (open-source) achieves 76.67% accuracy, whereas GPT-4o Audio (closed-source) reaches 68.47%; however, GPT-4o Audio substantially outperforms Qwen2.5-Omni on the more challenging multiple-choice and open-ended tasks. Across all three question types, no model achieves 80% performance. These findings underscore the unique challenges posed by MDAR and its value as a benchmark for advancing audio reasoning research.Code and benchmark can be found at https://github.com/luckyerr/MDAR.

从音频中推理的能力，包括语音、副语言线索、环境声音和音乐，对于人工智能代理在真实世界场景中有效地交互至关重要。现有的基准测试主要关注静态或单一场景设置，并没有完全捕捉多说话者、展开的事件和异质音频源交互的场景。为了解决这些挑战，我们引入了MDAR，这是一个用于评估复杂、多场景和动态演化音频推理任务的模型性能的基准测试。MDAR包含3000个精心策划的问题答案对，与各种音频剪辑相关联，涵盖五类复杂推理，涉及三种问题类型。我们在MDAR上评估了26个最先进的音频语言模型，并观察到它们在复杂推理任务上的局限性。在单选题方面，Qwen2.5-Omni（开源）的准确率为76.67%，而GPT-4o Audio（闭源）的准确率为68.47%；然而，在更具挑战性的多选和开放式任务上，GPT-4o Audio显著优于Qwen2.5-Omni。在所有三种问题类型中，没有任何模型的性能达到80%。这些发现强调了MDAR所面临的独特挑战以及其作为推进音频推理研究基准测试的价值。代码和基准测试可在https://github.com/luckyerr/MDAR找到。

论文及项目相关链接

PDF 25 pages, 7 figures

Summary
文章强调了在真实世界场景中，AI需要从音频中进行推理的能力至关重要，包括语音、副语言线索、环境声音和音乐。现有基准测试主要集中在静态或单一场景设置上，无法充分捕捉复杂、多场景和动态演变的音频推理任务。为此，文章引入MDAR基准测试，涵盖五大类复杂推理和三种问题类型。对现有先进的音频语言模型进行基准测试发现，它们在复杂推理任务上存在局限性。

Key Takeaways

AI在真实世界场景中的音频推理能力至关重要，包括从语音、副语言线索、环境声音和音乐中推理。
现有基准测试无法充分捕捉复杂、多场景和动态演变的音频推理任务。
MDAR基准测试用于评估模型在复杂、多场景和动态音频推理任务上的性能。
MDAR包含与各种音频剪辑相关的3000个精心策划的问题答案对。
在MDAR基准测试中，现有音频语言模型在复杂推理任务中表现出局限性。
在单项选择题方面，Qwen2.5-Omni表现较好，而GPT-4o Audio在某些更具挑战性的多项选择和开放式任务中表现更好。

Cool Papers

点此查看论文截图

TreeMind: Automatically Reproducing Android Bug Reports via LLM-empowered Monte Carlo Tree Search

Authors:Zhengyu Chen, Zhaoyi Meng, Wenxiang Zhao, Wansen Wang, Haoyang Zhao, Jiahao Zhan, Jie Cui, Hong Zhong

Automatically reproducing Android app crashes from textual bug reports is challenging, particularly when the reports are incomplete and the modern UI exhibits high combinatorial complexity. Existing approaches based on reinforcement learning or large language models (LLMs) exhibit limitations in such scenarios. They struggle to infer unobserved steps and reconstruct the underlying user action sequences to navigate the vast UI interaction space, primarily due to limited goal-directed reasoning and planning. We present TreeMind, a novel technique that integrates LLMs with a customized Monte Carlo Tree Search (MCTS) algorithm to achieve strategic UI exploration in bug reproduction. To the best of our knowledge, this is the first work to combine external decision-making with LLM semantic reasoning for reliable bug reproduction. We formulate the reproduction task as a target-driven search problem, leveraging MCTS as the core planning mechanism to iteratively refine action sequences. To enhance MCTS with semantic reasoning, we introduce two LLM-guided agents with distinct roles: Expander generates top-k promising actions based on the current UI state and exploration history, while Simulator estimates the likelihood that each action leads toward successful reproduction. By incorporating multi-modal UI inputs and advanced prompting techniques, TreeMind conducts feedback-aware navigation that identifies missing but essential user actions and incrementally reconstructs the reproduction paths. We evaluate TreeMind on a dataset of 93 real-world Android bug reports from three widely-used benchmarks. Experimental results show that it significantly outperforms four state-of-the-art baselines in reproduction success rate. A real-world case study indicates that integrating LLM reasoning with MCTS-based planning is a compelling direction for automated bug reproduction.

自动根据文本故障报告重现安卓应用崩溃是有挑战性的，尤其是在报告不完整且现代用户界面具有高组合复杂性的情况下。基于强化学习或大型语言模型（LLM）的现有方法在此类场景中表现出局限性。他们很难推断未观察到的步骤并重建潜在的用户操作序列，以导航广泛的UI交互空间，这主要是因为缺乏目标导向的推理和规划。我们提出了TreeMind，这是一种将LLM与定制蒙特卡洛树搜索（MCTS）算法相结合的新技术，以实现战略性的UI探索，用于故障重现。据我们所知，这是第一个将外部决策制定与LLM语义推理相结合的工作，以实现可靠的故障重现。我们将重现任务制定为目标驱动搜索问题，利用MCTS作为核心规划机制来迭代优化动作序列。为了通过语义推理增强MCTS，我们引入了两个由LLM引导的主体，分别扮演不同的角色：扩展器根据当前UI状态和探索历史生成前k个有前景的动作，而模拟器估计每个动作导致成功重现的可能性。通过结合多模式UI输入和先进的提示技术，TreeMind进行了反馈感知导航，可识别缺失但至关重要的用户操作，并逐步重建重现路径。我们在包含来自三个广泛使用的基准测试的93个真实世界安卓故障报告的数据集上评估了TreeMind。实验结果表明，在重现成功率方面，它显著优于四种最新技术基线。一个真实案例研究表明，将LLM推理与基于MCTS的规划相结合是自动化故障重现的一个令人信服的方向。

论文及项目相关链接

PDF

Summary
自动化地从文本bug报告中重现Android应用崩溃具有挑战性，特别是在报告不完整且现代UI具有高度的组合复杂性时。现有的基于强化学习或大型语言模型（LLMs）的方法存在局限性。它们难以推断未观察到的步骤并重建潜在的用户操作序列来导航广阔的UI交互空间，这主要是由于缺乏目标导向的推理和规划。本研究提出了TreeMind技术，该技术将LLMs与定制的蒙特卡洛树搜索（MCTS）算法相结合，以实现战略性的UI探索用于bug重现。这是首次将外部决策与LLM语义推理相结合的工作，以实现可靠的bug重现。我们将重现任务制定为目标驱动搜索问题，利用MCTS作为核心规划机制来迭代地完善动作序列。通过引入两个由LLM引导的agent，以增强MCTS的语义推理能力：Expander根据当前UI状态和探索历史生成有前景的k个动作，而Simulator估计每个动作导致成功重现的可能性。通过结合多模式UI输入和先进的提示技术，TreeMind进行反馈感知导航，确定缺失但关键的用户操作并逐步重建重现路径。在包含来自三个广泛使用的基准测试的93个真实世界Android bug报告的数据集上评估TreeMind。实验结果表明，它在重现成功率上显著优于四种最先进的基线方法。一项真实案例研究表明，将LLM推理与MCTS-based规划相结合是一个很有前景的方向，用于自动化bug重现。

Key Takeaways

自动化重现Android应用崩溃从文本bug报告中具有挑战性，特别是在处理不完整报告和现代UI的高组合复杂性时。
现有方法（如强化学习和大型语言模型）在推断未观察到的步骤和重建用户操作序列方面存在局限性。
TreeMind技术结合了LLMs和MCTS算法，以战略性地探索UI用于bug重现。
TreeMind通过引入两个LLM引导的agent（Expander和Simulator）来增强MCTS的语义推理能力。
TreeMind通过结合多模式UI输入和先进的提示技术，进行反馈感知导航并重建bug重现路径。
在包含真实世界Android bug报告的数据集上进行的实验表明，TreeMind在重现成功率方面优于其他方法。

Cool Papers

点此查看论文截图

MoveFM-R: Advancing Mobility Foundation Models via Language-driven Semantic Reasoning

Authors:Fanjin Meng, Yuan Yuan, Jingtao Ding, Jie Feng, Chonghua Han, Yong Li

Mobility Foundation Models (MFMs) have advanced the modeling of human movement patterns, yet they face a ceiling due to limitations in data scale and semantic understanding. While Large Language Models (LLMs) offer powerful semantic reasoning, they lack the innate understanding of spatio-temporal statistics required for generating physically plausible mobility trajectories. To address these gaps, we propose MoveFM-R, a novel framework that unlocks the full potential of mobility foundation models by leveraging language-driven semantic reasoning capabilities. It tackles two key challenges: the vocabulary mismatch between continuous geographic coordinates and discrete language tokens, and the representation gap between the latent vectors of MFMs and the semantic world of LLMs. MoveFM-R is built on three core innovations: a semantically enhanced location encoding to bridge the geography-language gap, a progressive curriculum to align the LLM’s reasoning with mobility patterns, and an interactive self-reflection mechanism for conditional trajectory generation. Extensive experiments demonstrate that MoveFM-R significantly outperforms existing MFM-based and LLM-based baselines. It also shows robust generalization in zero-shot settings and excels at generating realistic trajectories from natural language instructions. By synthesizing the statistical power of MFMs with the deep semantic understanding of LLMs, MoveFM-R pioneers a new paradigm that enables a more comprehensive, interpretable, and powerful modeling of human mobility. The implementation of MoveFM-R is available online at https://anonymous.4open.science/r/MoveFM-R-CDE7/.

移动基础模型（MFMs）已经推进了人类运动模式的建模，但由于数据规模和语义理解的限制，它们面临上限。虽然大型语言模型（LLMs）提供了强大的语义推理，但它们缺乏生成物理上可行的移动轨迹所需的时空统计的固有理解。为了解决这些差距，我们提出了MoveFM-R，这是一个利用语言驱动的语义推理能力解锁移动基础模型潜力的新型框架。它解决了两个关键挑战：连续地理坐标和离散语言令牌之间的词汇不匹配，以及MFMs的潜在向量和LLMs的语义世界之间的表示差距。MoveFM-R建立在三项核心创新之上：一种语义增强的位置编码，以弥合地理与语言的差距；一种渐进式课程，以使LLM的推理与移动模式对齐；以及一种用于条件轨迹生成交互式自我反思机制。大量实验表明，MoveFM-R显著优于现有的基于MFM和LLM的基线。它还在零样本设置中表现出稳健的泛化能力，并擅长根据自然语言指令生成逼真的轨迹。通过合成MFM的统计能力与LLM的深度语义理解，MoveFM-R开创了一种新范式，使人类移动性的建模更加全面、可解释和强大。MoveFM-R的实现可在https://anonymous.4open.science/r/MoveFM-R-CDE7/在线访问。

论文及项目相关链接

PDF

Summary

本文介绍了Mobility Foundation Models（MFMs）在建模人类移动模式方面的进展及其面临的数据规模和语义理解上的局限性。针对这些问题，提出了一种新型框架MoveFM-R，它通过利用语言驱动的语义推理能力，解锁了移动基础模型的全潜力。该框架解决了连续地理坐标和离散语言令牌之间词汇不匹配以及MFMs的潜在向量和LLMs的语义世界之间表示差距等两个关键挑战。MoveFM-R建立在三项核心创新之上：语义增强的位置编码以弥补地理与语言之间的差距，渐进式课程以使LLM的推理与移动模式对齐，以及用于条件轨迹生成交互式自我反思机制。实验表明，MoveFM-R显著优于现有的MFM和LLM基线，在零样本设置中表现出强大的泛化能力，并擅长根据自然语言指令生成逼真的轨迹。它通过合成MFMs的统计能力与LLMs的深度语义理解，开创了一种新的范式，使人类移动性的建模更加全面、可解释和强大。

Key Takeaways

MFMs在建模人类移动模式方面取得进展，但面临数据规模和语义理解的局限性。
MoveFM-R框架旨在解锁MFMs的潜力，结合语言驱动的语义推理能力。
MoveFM-R解决两个关键挑战：地理坐标与语言令牌之间的词汇不匹配以及MFMs与LLMs之间的表示差距。
MoveFM-R建立在三项核心创新上：语义增强的位置编码、渐进式课程对齐和交互式自我反思机制。
MoveFM-R显著优于现有基线，具有强大的泛化能力，并能根据自然语言指令生成真实轨迹。
MoveFM-R合成MFMs的统计能力与LLMs的语义理解，为建模人类移动性开创了新范式。
MoveFM-R框架的实现可在网上获取。

Cool Papers

点此查看论文截图

CircuitSense: A Hierarchical Circuit System Benchmark Bridging Visual Comprehension and Symbolic Reasoning in Engineering Design Process

Authors:Arman Akbari, Jian Gao, Yifei Zou, Mei Yang, Jinru Duan, Dmitrii Torbunov, Yanzhi Wang, Yihui Ren, Xuan Zhang

Engineering design operates through hierarchical abstraction from system specifications to component implementations, requiring visual understanding coupled with mathematical reasoning at each level. While Multi-modal Large Language Models (MLLMs) excel at natural image tasks, their ability to extract mathematical models from technical diagrams remains unexplored. We present \textbf{CircuitSense}, a comprehensive benchmark evaluating circuit understanding across this hierarchy through 8,006+ problems spanning component-level schematics to system-level block diagrams. Our benchmark uniquely examines the complete engineering workflow: Perception, Analysis, and Design, with a particular emphasis on the critical but underexplored capability of deriving symbolic equations from visual inputs. We introduce a hierarchical synthetic generation pipeline consisting of a grid-based schematic generator and a block diagram generator with auto-derived symbolic equation labels. Comprehensive evaluation of six state-of-the-art MLLMs, including both closed-source and open-source models, reveals fundamental limitations in visual-to-mathematical reasoning. Closed-source models achieve over 85% accuracy on perception tasks involving component recognition and topology identification, yet their performance on symbolic derivation and analytical reasoning falls below 19%, exposing a critical gap between visual parsing and symbolic reasoning. Models with stronger symbolic reasoning capabilities consistently achieve higher design task accuracy, confirming the fundamental role of mathematical understanding in circuit synthesis and establishing symbolic reasoning as the key metric for engineering competence.

工程设计通过层次抽象从系统规格到组件实现进行操作，需要在每个层次上结合视觉理解和数学推理。虽然多模态大型语言模型（MLLMs）在自然图像任务上表现出色，但它们在从技术图表中提取数学模型的能力仍待探索。我们提出了CircuitSense，这是一个全面的基准测试，通过8006多个问题评估从组件级原理图到系统级框图的电路理解层次。我们的基准测试独特地考察了完整的工程工作流程：感知、分析和设计，特别是从视觉输入中推导符号方程这一关键但尚未充分探索的能力。我们引入了一个分层的合成生成管道，包括一个基于网格的原理图生成器和一个带有自动派生符号方程标签的框图生成器。对六个最新MLLMs的综合评估，包括封闭源代码和开放源代码模型，揭示了视觉到数学推理的根本局限性。封闭源代码的模型在涉及组件识别和拓扑识别的感知任务上的准确率超过85%，但在符号推导和分析推理方面的性能低于19%，暴露了视觉解析和符号推理之间的关键差距。具有更强符号推理能力的模型在设计任务上的准确性持续更高，证实了数学理解在电路合成中的基础作用，并确立了符号推理作为工程能力的主要指标。

论文及项目相关链接

PDF

Summary：
工程设计的层次抽象需要从系统规格到组件实现，每个层次都需要视觉理解与数学推理的结合。当前，多模态大型语言模型（MLLMs）在图像处理任务中表现出色，但其从技术图表中提取数学模型的能力尚未被探索。本研究提出了CircuitSense综合基准测试，该测试通过超过8006个涵盖从组件级原理图到系统级框图的问题，评估电路理解的层次结构。该基准测试独特地考察了完整的工程工作流程：感知、分析和设计，特别是从视觉输入中推导符号方程这一重要但被忽视的能力。本研究引入了一个层次性的合成生成管道，包括基于网格的原理图生成器和带有自动生成的符号方程标签的框图生成器。对六个最先进的MLLMs的综合评估表明，视觉到数学的推理存在根本性局限。封闭源模型在涉及组件识别和拓扑识别的感知任务上的准确率超过85%，但在符号推导和分析推理方面的性能低于19%，暴露了视觉解析和符号推理之间的关键差距。具有更强符号推理能力的模型在设计任务上始终具有更高的准确性，证实了数学理解在工程合成中的基础作用，并确立了符号推理作为工程能力的关键指标。

Key Takeaways:

工程设计涉及从系统规格到组件实现的层次抽象，需要视觉理解与数学推理的结合。
当前多模态大型语言模型（MLLMs）在电路理解的视觉与数学结合方面能力有限。
CircuitSense基准测试用于评估电路理解的层次结构，涵盖从原理图到系统框图的问题。
基准测试强调完整的工程工作流程，特别是从视觉输入中推导符号方程的能力。
封闭源模型在感知任务上表现良好，但在符号推导和分析推理上存在显著局限。
模型在符号推理方面的能力对于工程设计任务的成功至关重要。

Cool Papers

点此查看论文截图

RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer

Authors:Wangbo Zhao, Yizeng Han, Zhiwei Tang, Jiasheng Tang, Pengfei Zhou, Kai Wang, Bohan Zhuang, Zhangyang Wang, Fan Wang, Yang You

Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling. Existing training-free accelerators - step reduction, feature caching, and sparse attention - enhance inference speed but typically rely on a uniform heuristic or a manually designed adaptive strategy for all images, leaving quality on the table. Alternatively, dynamic neural networks offer per-image adaptive acceleration, but their high fine-tuning costs limit broader applicability. To address these limitations, we introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers, a framework that delivers image-wise acceleration with zero updates to the base generator. Specifically, three lightweight policy heads - Step-Skip, Cache-Reuse, and Sparse-Attention - observe the current denoising state and independently decide their corresponding speed-up at each timestep. All policy parameters are trained online via Group Relative Policy Optimization (GRPO) while the generator remains frozen. Meanwhile, an adversarially learned discriminator augments the reward signal, discouraging reward hacking by boosting returns only when generated samples stay close to the original model’s distribution. Across state-of-the-art DiT backbones, including Stable Diffusion 3 and FLUX, RAPID3 achieves nearly 3x faster sampling with competitive generation quality.

扩散Transformer（DiTs）在视觉生成方面表现出色，但仍受到采样速度较慢的阻碍。现有的无训练加速器——步骤减少、特征缓存和稀疏注意力——提高了推理速度，但它们通常依赖于统一的启发式方法或手动设计的适用于所有图像的自适应策略，从而留下了质量提升空间。相比之下，动态神经网络提供了每图像自适应加速，但其高昂的微调成本限制了其更广泛的应用。为了解决这些限制，我们引入了RAPID3：扩散Transformer的三级强化加速策略，该框架实现了对基础生成器的零更新进行图像级加速。具体来说，三个轻量级的策略头——步骤跳过、缓存重用和稀疏注意力——观察当前的去噪状态，并独立决定每个时间步的相应加速。所有策略参数都通过组相对策略优化（GRPO）在线训练，而生成器保持不变。同时，对抗性学习鉴别器增强了奖励信号，通过仅在生成样本接近原始模型分布时提高回报来抑制奖励操纵。在包括Stable Diffusion 3和FLUX在内的最新DiT主干网络上，RAPID3实现了近3倍的快速采样，同时保持有竞争力的生成质量。

论文及项目相关链接

PDF

Summary

本文介绍了针对Diffusion Transformers（DiTs）的加速框架RAPID3，该框架能够在不更新基础生成器的情况下实现图像级别的加速。RAPID3包括三个独立的策略头：Step-Skip、Cache-Reuse和Sparse-Attention，它们根据当前的去噪状态在每个时间步独立决定加速。通过Group Relative Policy Optimization（GRPO）在线训练策略参数，同时采用对抗性学习的方法增强奖励信号，以保持生成样本接近于原始模型分布。RAPID3能够在多个先进的DiT主干网络上实现近3倍的采样速度提升，同时保持竞争力强的生成质量。

Key Takeaways

RAPID3是一个针对Diffusion Transformers的加速框架，旨在提高图像生成的速度而不损失质量。
RAPID3包括三个策略头：Step-Skip、Cache-Reuse和Sparse-Attention，它们根据去噪状态进行决策以加速采样过程。
策略参数通过Group Relative Policy Optimization（GRPO）在线训练，而基础生成器保持不变。
RAPID3采用对抗性学习的方法增强奖励信号，以确保生成的样本接近原始模型分布。
RAPID3能够在多个先进的DiT主干网络上实现近3倍的采样速度提升。
RAPID3框架能够应对训练免费加速器的局限性，通过更精细的加速策略提高图像生成效率。

Cool Papers

点此查看论文截图

Rule-Based Reinforcement Learning for Document Image Classification with Vision Language Models

Authors:Michael Jungo, Andreas Fischer

Rule-based reinforcement learning has been gaining popularity ever since DeepSeek-R1 has demonstrated its success through simple verifiable rewards. In the domain of document analysis, reinforcement learning is not as prevalent, even though many downstream tasks may benefit from the emerging properties of reinforcement learning, particularly the enhanced reason capabilities. We study the effects of rule-based reinforcement learning with the task of Document Image Classification which is one of the most commonly studied downstream tasks in document analysis. We find that reinforcement learning tends to have better generalisation capabilities to out-of-distritbution data, which we examine in three different scenarios, namely out-of-distribution images, unseen classes and different modalities. Our code is available at https://github.com/jungomi/vision-finetune.

基于规则的强化学习自从DeepSeek-R1通过简单的可验证奖励展示其成功以来，一直受到广泛关注。虽然在文档分析领域，强化学习的普及程度并不高，尽管许多下游任务可能会从强化学习的新兴属性中受益，尤其是增强的推理能力。我们研究了基于规则的强化学习在文档图像分类任务中的应用，这是文档分析中最常见的下游任务之一。我们发现强化学习对于离群数据具有更好的泛化能力，我们在三种不同场景中对此进行了检查，即离群图像、未见类别和不同模式。我们的代码可在https://github.com/jungomi/vision-finetune上找到。

Summary：规则化的强化学习因DeepSeek-R1的成功演示简单可验证奖励而受到广泛关注。在文档分析领域，强化学习尚未普及，尽管许多下游任务可从强化学习的新兴属性中受益，特别是增强推理能力。研究团队以文档图像分类任务为对象研究基于规则的强化学习效果，发现强化学习在三种不同场景下对离群分布数据有更好的泛化能力，包括离群图像、未见类别和不同模态。相关代码已公开在https://github.com/jungomi/vision-finetune。

Key Takeaways：

强化学习通过简单可验证奖励实现成功应用。
文档分析领域强化学习普及程度不高，但具有潜力。
基于规则的强化学习在文档图像分类任务上进行了实证研究。
强化学习在三种不同场景下展现出对离群分布数据的良好泛化能力。
强化学习能够应对如离群图像、未见类别和不同模态等挑战。
相关研究代码已公开在GitHub上供公众访问和使用。

Cool Papers

点此查看论文截图

Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks

Authors:Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Lijun Wang, Yuanyuan Peng, Huan Gao, Mingkun Xu, Shangyang Li

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

近期视觉语言模型（VLMs）的进步在标准医学基准测试中取得了显著的性能，但它们真正的临床推理能力仍不明确。现有数据集主要强调分类准确性，创造了一种评估错觉，即模型虽然表现熟练，但在高风险诊断推理中仍然失败。我们推出了Neural-MedBench，这是一个紧凑而推理密集型的基准测试，专门用于探索神经学多模式临床推理的极限。Neural-MedBench集成了多序列MRI扫描、结构化电子健康记录和临床笔记，包含三个核心任务家族：鉴别诊断、病灶识别和理由生成。为了确保可靠的评估，我们开发了一个混合评分管道，结合了基于LLM的评分者、临床医生验证和语义相似性度量。通过对最先进的VLMs的系统评估，包括GPT-4o、Claude-4和MedGemma，我们观察到与常规数据集相比，性能急剧下降。错误分析表明，推理失败，而不是感知错误，主导了模型的不足。我们的研究结果表明了两轴评估框架的必要性：面向广度的大型数据集用于统计泛化，以及面向深度的紧凑基准测试（如Neural-MedBench）用于推理保真度。我们公开了Neural-MedBench：https://neuromedbench.github.io/，作为一个开放和可扩展的诊断测试平台，旨在指导未来基准测试的扩展，并能够进行严格而经济实惠的临床可信人工智能评估。

论文及项目相关链接

PDF 23 pages, 12 figures

Summary
在标准医学基准测试中，视觉语言模型取得了显著的性能提升，但其真正的临床推理能力尚不清楚。为此，我们推出了Neural-MedBench，这是一个紧凑且推理密集型的基准测试，旨在探索神经学中的多模态临床推理的极限。它集成了多序列MRI扫描、结构化电子健康记录和临床笔记，包括三个核心任务家族：鉴别诊断、病灶识别和理由生成。为了确保可靠的评估，我们开发了一个混合评分管道，结合了基于LLM的评分者、临床医生验证和语义相似性度量。通过对最先进的VLMs的系统评估，包括GPT-4o、Claude-4和MedGemma，我们发现与常规数据集相比，性能大幅下降。错误分析表明，推理失败而非感知错误是模型的主要缺陷。我们的研究强调了两轴评估框架的必要性：面向广度的大型数据集用于统计泛化，以及面向深度的紧凑基准测试如Neural-MedBench用于推理保真度。我们公开发布了Neural-MedBench（https://neuromedbench.github.io/），作为一个开放和可扩展的诊断测试平台，为未来基准测试的发展提供指导，并能进行严格且经济实惠的临床信任人工智能评估。

Key Takeaways

VLMs在医学基准测试上表现卓越，但临床推理能力尚不清楚。
推出Neural-MedBench基准测试，旨在评估多模态临床推理能力。
Neural-MedBench集成了MRI扫描、电子健康记录和临床笔记等多个数据源。
基准测试包括三个核心任务：鉴别诊断、病灶识别和理由生成。
开发了混合评分管道以确保评估的可靠性。
与常规数据集相比，VLMs在Neural-MedBench上的性能显著下降。
错误分析显示推理失败是模型的主要挑战，而不是感知错误。

Cool Papers

点此查看论文截图

Evaluating LLMs for Combinatorial Optimization: One-Phase and Two-Phase Heuristics for 2D Bin-Packing

Authors:Syed Mahbubul Huq, Daniel Brito, Daniel Sikar, Rajesh Mojumder

This paper presents an evaluation framework for assessing Large Language Models’ (LLMs) capabilities in combinatorial optimization, specifically addressing the 2D bin-packing problem. We introduce a systematic methodology that combines LLMs with evolutionary algorithms to generate and refine heuristic solutions iteratively. Through comprehensive experiments comparing LLM generated heuristics against traditional approaches (Finite First-Fit and Hybrid First-Fit), we demonstrate that LLMs can produce more efficient solutions while requiring fewer computational resources. Our evaluation reveals that GPT-4o achieves optimal solutions within two iterations, reducing average bin usage from 16 to 15 bins while improving space utilization from 0.76-0.78 to 0.83. This work contributes to understanding LLM evaluation in specialized domains and establishes benchmarks for assessing LLM performance in combinatorial optimization tasks.

本文提出了一个评估框架，用于评估大型语言模型（LLM）在组合优化方面的能力，特别是解决二维装箱问题。我们介绍了一种系统方法，该方法将LLM与进化算法相结合，通过迭代生成和优化启发式解决方案。通过全面的实验对比LLM生成的启发式方法与传统方法（有限首次适应法和混合首次适应法），我们证明了LLM能够生成更高效的解决方案，同时减少计算资源的需求。我们的评估表明，GPT-4o在两次迭代内实现了最优解，平均使用的箱子数量从原来的16个减少到15个，空间利用率也从原来的0.76-0.78提升到现在的0.83。这项研究对于理解LLM在特定领域的评估有着重要的贡献，并建立了评估LLM在组合优化任务中性能的基准。

论文及项目相关链接

PDF 1 table, 6 figures. 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Accepted for the Workshop: Evaluating the Evolving LLM Lifecycle Benchmarks, Emergent Abilities, and Scaling

Summary

该文评估了大型语言模型（LLM）在组合优化领域的能力，特别是解决二维装箱问题的方法。文章提出了一种结合LLM和进化算法的系统方法，通过迭代生成和优化启发式解决方案。实验证明，LLM生成的启发式方法比传统方法（如有限首次适配和混合首次适配）更高效，且计算资源消耗更少。评价结果显示，GPT-4o在两次迭代内达到最优解，平均使用箱数从16个减少到15个，空间利用率从0.76-0.78提高到0.83。该研究加深了对LLM在特定领域评价的理解，并为评估其在组合优化任务中的性能建立了基准。

Key Takeaways

文章评估了大型语言模型（LLM）在解决组合优化问题中的表现，特别是针对二维装箱问题。
提出了一种结合LLM和进化算法的系统方法，用于生成和优化启发式解决方案。
通过实验证明，LLM生成的启发式方法相较于传统方法更为高效，且计算资源消耗更少。
GPT-4o在两次迭代内找到最优解，展现了其高效性能。
研究结果提高了空间利用率，平均使用箱数有所减少。
该研究有助于加深对LLM在特定领域评价的理解。

Cool Papers

点此查看论文截图

ASSESS: A Semantic and Structural Evaluation Framework for Statement Similarity

Authors:Xiaoyang Liu, Tao Zhu, Zineng Dong, Yuntian Liu, Qingfeng Guo, Zhaoxuan Liu, Yu Chen, Tao Luo

Statement autoformalization, the automated translation of statements from natural language into formal languages, has seen significant advancements, yet the development of automated evaluation metrics remains limited. Existing metrics for formal statement similarity often fail to balance semantic and structural information. String-based approaches capture syntactic structure but ignore semantic meaning, whereas proof-based methods validate semantic equivalence but disregard structural nuances and, critically, provide no graded similarity score in the event of proof failure. To address these issues, we introduce ASSESS (A Semantic and Structural Evaluation Framework for Statement Similarity), which comprehensively integrates semantic and structural information to provide a continuous similarity score. Our framework first transforms formal statements into Operator Trees to capture their syntactic structure and then computes a similarity score using our novel TransTED (Transformation Tree Edit Distance) Similarity metric, which enhances traditional Tree Edit Distance by incorporating semantic awareness through transformations. For rigorous validation, we present EPLA (Evaluating Provability and Likeness for Autoformalization), a new benchmark of 524 expert-annotated formal statement pairs derived from miniF2F and ProofNet, with labels for both semantic provability and structural likeness. Experiments on EPLA demonstrate that TransTED Similarity outperforms existing methods, achieving state-of-the-art accuracy and the highest Kappa coefficient. The benchmark, and implementation code will be made public soon.

声明自动形式化（将自然语言中的声明自动转换为形式化语言）已经取得了重大进展，但自动评估指标的发展仍然有限。现有的形式化声明相似性度量通常无法平衡语义和结构信息。基于字符串的方法可以捕捉句法结构，但忽略了语义意义，而基于证明的方法验证了语义等价性，但忽略了结构细微差别，并且在证明失败的情况下不提供分级相似性评分。为了解决这些问题，我们引入了ASSESS（声明相似性的语义和结构评估框架），它全面整合语义和结构信息以提供连续的相似性评分。我们的框架首先将形式化声明转换为操作树以捕获其句法结构，然后使用我们新颖的TransTED（转换树编辑距离）相似性度量来计算相似性得分，该度量通过转换融入语义意识，增强了传统的树编辑距离。为了严格验证，我们推出了EPLA（自动形式化的可证明性和相似性评价），这是从miniF2F和ProofNet派生的、带有语义可证明性和结构相似性的专家注释形式化声明对的新基准数据集，包含524对数据点。在EPLA上的实验表明，TransTED相似性度量优于现有方法，达到了最先进的准确性和最高的Kappa系数。基准数据集和实现代码将很快公开。

论文及项目相关链接

PDF

Summary

本文介绍了自动形式化声明的评估方法的发展状况，针对现有方法的不足，提出了一个新的综合语义和结构信息的评价框架——ASSESS，并引入了TransTED相似度度量方法。该方法通过转换操作树捕捉语句的语法结构，并通过语义感知的转换增强传统的树编辑距离。实验结果表明，TransTED相似度度量方法在EPLA基准测试集上表现优异，准确率和kappa系数均达到最佳水平。

Key Takeaways

自动形式化声明的评估仍然是研究的热点和难点。现有的评估方法无法平衡语义和结构信息。
ASSESS框架通过结合语义和结构信息来解决现有问题，提供了一种连续的相似度评分方法。它通过操作树来捕捉语法结构。
TransTED相似度度量方法是本文提出的创新方法，通过融入语义感知的转换来增强传统的树编辑距离。
EPLA基准测试集由专家标注的524对形式化声明组成，用于评估自动形式化的可信度与相似性。

Cool Papers

点此查看论文截图

Thinking in Many Modes: How Composite Reasoning Elevates Large Language Model Performance with Limited Data

Authors:Zishan Ahmad, Saisubramaniam Gopalakrishnan

Large Language Models (LLMs), despite their remarkable capabilities, rely on singular, pre-dominant reasoning paradigms, hindering their performance on intricate problems that demand diverse cognitive strategies. To address this, we introduce Composite Reasoning (CR), a novel reasoning approach empowering LLMs to dynamically explore and combine multiple reasoning styles like deductive, inductive, and abductive for more nuanced problem-solving. Evaluated on scientific and medical question-answering benchmarks, our approach outperforms existing baselines like Chain-of-Thought (CoT) and also surpasses the accuracy of DeepSeek-R1 style reasoning (SR) capabilities, while demonstrating superior sample efficiency and adequate token usage. Notably, CR adaptively emphasizes domain-appropriate reasoning styles. It prioritizes abductive and deductive reasoning for medical question answering, but shifts to causal, deductive, and inductive methods for scientific reasoning. Our findings highlight that by cultivating internal reasoning style diversity, LLMs acquire more robust, adaptive, and efficient problem-solving abilities.

尽管大型语言模型（LLMs）具有显著的能力，但它们依赖于单一、主导的推理范式，这阻碍了它们在需要多种认知策略的复杂问题上的表现。为了解决这个问题，我们引入了复合推理（CR），这是一种新型推理方法，能够赋予LLMs动态探索和结合多种推理风格的能力，如演绎推理、归纳推理和类比推理，以更微妙的方式解决问题。在科学和医学问答基准测试上，我们的方法优于现有的基线方法，如思维链（CoT），并且超越了DeepSeek-R1风格推理（SR）的准确率，同时显示出较高的样本效率和足够的令牌使用效率。值得注意的是，CR能够自适应地强调与领域相适应的推理风格。它在医学问题回答中优先使用类比和演绎推理，但在科学推理中则转向因果、演绎和归纳方法。我们的研究结果表明，通过培养内部推理风格的多样性，LLMs可以获得更稳健、自适应和高效的问题解决能力。

论文及项目相关链接

PDF 7 pages, 3 figures

Summary
大型语言模型（LLM）在解决复杂问题时受限于单一的预定义推理模式。为解决这一问题，我们提出了复合推理（CR）这一新型推理方法，它能动态探索并结合多种推理风格（如演绎推理、归纳推理和假设推理），从而实现更精细的问题解决。在科研和医疗问答基准测试中，复合推理方法优于现有的思维链（CoT）等基准测试方法，并超越了DeepSeek-R1式推理（SR）的准确率，表现出优越的样本效率和足够的标记词使用效率。特别地，复合推理可以自适应地强调与特定领域适合的推理风格，比如在医疗问答上注重假设推理和演绎推理，而在科学推理上则转向因果、演绎和归纳方法。我们的研究结果表明，通过培养内部推理风格的多样性，LLM可以获得更强大、适应性和高效的问题解决能力。

Key Takeaways

LLM受限于单一预定义推理模式，难以解决需要多种认知策略的复杂问题。
引入复合推理（CR）方法，能结合多种推理风格（如演绎、归纳和假设推理）。
在科研和医疗问答基准测试中，复合推理表现优于现有方法，如思维链（CoT）。
复合推理方法提高了样本效率和标记词使用效率。
复合推理能自适应强调特定领域的推理风格，如医疗领域的假设和演绎推理。
通过培养内部推理风格的多样性，LLM获得更强大、适应性和高效的问题解决能力。

Cool Papers

点此查看论文截图

Towards Faithful Reasoning in Remote Sensing: A Perceptually-Grounded GeoSpatial Chain-of-Thought for Vision-Language Models

Authors:Jiaqi Liu, Lang Sun, Ronghao Fu, Bo Yang

Vision-Language Models (VLMs) in remote sensing often fail at complex analytical tasks, a limitation stemming from their end-to-end training paradigm that bypasses crucial reasoning steps and leads to unverifiable outputs. To address this limitation, we introduce the Perceptually-Grounded Geospatial Chain-of-Thought (Geo-CoT), a framework that models remote sensing analysis as a verifiable, multi-step process. We instill this analytical process through a two-stage alignment strategy, leveraging Geo-CoT380k, the first large-scale dataset of structured Geo-CoT rationales. This strategy first employs supervised fine-tuning (SFT) to instill the foundational cognitive architecture, then leverages Group Reward Policy Optimization (GRPO) to refine the model’s reasoning policy towards factual correctness. The resulting model, RSThinker, outputs both a final answer and its justifying, verifiable analytical trace. This capability yields dominant performance, significantly outperforming state-of-the-art models across a comprehensive range of tasks. The public release of our Geo-CoT380k dataset and RSThinker model upon publication serves as a concrete pathway from opaque perception towards structured, verifiable reasoning for Earth Observation.

遥感中的视觉语言模型（VLMs）往往在复杂的分析任务上表现不佳，这一局限性源于其端到端的训练范式，这种范式跳过了关键的推理步骤，导致输出无法验证。为了解决这一局限性，我们引入了感知基础地理思维链（Geo-CoT），这是一种将遥感分析建模为可验证的多步骤过程的框架。我们通过两阶段对齐策略来灌输这种分析过程，利用Geo-CoT380k，这是第一个大规模的结构化Geo-CoT理由数据集。该策略首先采用监督微调（SFT）来灌输基础认知架构，然后利用组奖励政策优化（GRPO）来完善模型的推理策略，使其朝事实正确性方向发展。由此产生的模型RSThinker不仅输出最终答案，还输出可验证的分析轨迹。这种能力带来了卓越的性能，在一系列任务中显著优于最先进的模型。我们数据集Geo-CoT380k和RSThinker模型的公开发布，为从模糊感知走向结构化、可验证的地球观测推理提供了具体的途径。

论文及项目相关链接

PDF

Summary

本文介绍了在遥感领域中，视觉语言模型（VLMs）在处理复杂分析任务时的局限性，提出了一种名为Perceptually-Grounded Geospatial Chain-of-Thought（Geo-CoT）的框架来解决这一问题。该框架通过两阶段对齐策略，利用Geo-CoT380k数据集，将遥感分析建模为可验证的多步骤过程。首先通过监督微调（SFT）来建立基础认知架构，然后利用Group Reward Policy Optimization（GRPO）来优化模型的推理策略，以实现事实正确性。最终产生的模型RSThinker不仅能给出最终答案，还能提供可验证的分析轨迹。这一能力使其在多种任务上显著优于现有最先进的模型。

Key Takeaways

视觉语言模型（VLMs）在遥感复杂分析任务中存在局限性，需要一种新的框架来解决。
提出了Perceptually-Grounded Geospatial Chain-of-Thought（Geo-CoT）框架，将遥感分析建模为可验证的多步骤过程。
利用两阶段对齐策略和Geo-CoT380k数据集来建立和推广这一框架。
第一阶段通过监督微调（SFT）建立基础认知架构。
第二阶段利用Group Reward Policy Optimization（GRPO）优化模型的推理策略，实现事实正确性。
最终产生的模型RSThinker能同时输出最终答案和可验证的分析轨迹。

Cool Papers

点此查看论文截图