发布日期: 2025-05-15

更新日期: 2025-05-26

文章字数: 22.2k

阅读时长: 90 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-05-15 更新

AC-Reason: Towards Theory-Guided Actual Causality Reasoning with Large Language Models

Authors:Yanxi Zhang, Xin Cong, Zhong Zhang, Xiao Liu, Dongyan Zhao, Yesai Wu

Actual causality (AC), a fundamental aspect of causal reasoning (CR), is responsible for attribution and responsibility assignment in real-world scenarios. However, existing LLM-based methods lack grounding in formal AC theory, resulting in limited interpretability. Therefore, we propose AC-Reason, a semi-formal reasoning framework that identifies causally relevant events within an AC scenario, infers the values of their formal causal factors (e.g., sufficiency, necessity, and normality), and answers AC queries via a theory-guided algorithm with explanations. While AC-Reason does not explicitly construct a causal graph, it operates over variables in the underlying causal structure to support principled reasoning. To enable comprehensive evaluation, we introduce AC-Bench, a new benchmark built upon and substantially extending Big-Bench Hard Causal Judgment (BBH-CJ). AC-Bench comprises ~1K carefully annotated samples, each with detailed reasoning steps and focuses solely on actual causation. The case study shows that synthesized samples in AC-Bench present greater challenges for LLMs. Extensive experiments on BBH-CJ and AC-Bench show that AC-Reason consistently improves LLM performance over baselines. On BBH-CJ, all tested LLMs surpass the average human rater accuracy of 69.60%, with GPT-4 + AC-Reason achieving 75.04%. On AC-Bench, GPT-4 + AC-Reason again achieves the highest accuracy of 71.82%. AC-Bench further enables fine-grained analysis of reasoning faithfulness, revealing that only Qwen-2.5-72B-Instruct, Claude-3.5-Sonnet, and GPT-4o exhibit faithful reasoning, whereas GPT-4 tends to exploit shortcuts. Finally, our ablation study proves that integrating AC theory into LLMs is highly effective, with the proposed algorithm contributing the most significant performance gains.

实际因果关系（AC）是因果推理（CR）的一个基本方面，负责现实世界场景中的归属和责任分配。然而，基于大型语言模型（LLM）的现有方法缺乏正式的AC理论基础，导致解释性有限。因此，我们提出了AC-Reason，这是一个半正式的推理框架，它能够在AC场景中识别因果相关的事件，推断它们的形式因果因素（例如，充分性、必要性和正常性）的值，并通过带有解释的理论指导算法回答AC查询。虽然AC-Reason没有明确构建因果图，但它操作底层因果结构中的变量以支持有原则的推理。为了进行全面的评估，我们引入了AC-Bench，这是一个基于Big-Bench Hard Causal Judgment（BBH-CJ）并对其进行实质性扩展的新基准测试。AC-Bench包含约1000个精心注释的样本，每个样本都有详细的推理步骤，并且只专注于实际因果关系。案例研究表明，AC-Bench中的合成样本对大型语言模型构成了更大的挑战。在BBH-CJ和AC-Bench上的广泛实验表明，AC-Reason持续提高了大型语言模型的性能相对于基线。在BBH-CJ上，所有测试的大型语言模型都超过了人类评分者的平均准确率69.60%，其中GPT-4 + AC-Reason达到了75.04%。在AC-Bench上，GPT-4 + AC-Reason再次获得了最高的准确率71.82%。AC-Bench还进一步实现了对推理忠实度的细致分析，结果表明只有Qwen-2.5-72B-Instruct、Claude-3.5-Sonnet和GPT-4展现出忠实的推理能力，而GPT-4倾向于使用捷径。最后，我们的消融研究证明将AC理论整合到大型语言模型中是非常有效的，其中我们提出的算法贡献了对性能的最大提升。

论文及项目相关链接

PDF

Summary

该文介绍了实际因果关系（AC）在因果推理（CR）中的重要性及其在现实世界场景中的归因和责任分配作用。现有基于大语言模型（LLM）的方法缺乏形式化的AC理论指导，导致解释性有限。因此，提出了AC-Reason半形式化推理框架，通过识别AC场景中的因果相关事件、推断其形式化因果因素值（如充分性、必要性和正常性），并通过理论引导算法以解释方式回答AC查询。引入AC-Bench基准测试，在Big-Bench Hard Causal Judgment（BBH-CJ）基础上构建并大幅扩展，包含约1000个精心标注的样本，每个样本都有详细的推理步骤，并专注于实际因果关系。研究表明，AC-Reason框架能持续提高LLM的性能，在BBH-CJ和AC-Bench上的实验表明，GPT-4 + AC-Reason准确率最高达到75.04%。最后，消融研究证明将AC理论整合到LLM中是有效的，其中提出的算法贡献最大性能提升。

Key Takeaways

实际因果关系（AC）是因果推理（CR）的核心，对于现实世界的归因和责任分配至关重要。
当前基于大语言模型（LLM）的方法在因果推理方面缺乏形式化理论支撑，导致解释性不足。
AC-Reason框架通过识别AC场景中的因果相关事件和推断其因果因素值，支持半形式化的推理过程。
AC-Bench基准测试扩展自BBH-CJ，包含精心标注的样本，专注于评估实际因果关系推理能力。
AC-Reason能提高LLM在BBH-CJ和AC-Bench上的性能，GPT-4 + AC-Reason表现最佳。
精细粒度的分析显示某些LLM（如GPT-4）存在推理忠实性问题，倾向于采取捷径。

Cool Papers

点此查看论文截图

TRAIL: Trace Reasoning and Agentic Issue Localization

Authors:Darshan Deshpande, Varun Gangal, Hersh Mehta, Jitin Krishnan, Anand Kannappan, Rebecca Qian

The increasing adoption of agentic workflows across diverse domains brings a critical need to scalably and systematically evaluate the complex traces these systems generate. Current evaluation methods depend on manual, domain-specific human analysis of lengthy workflow traces - an approach that does not scale with the growing complexity and volume of agentic outputs. Error analysis in these settings is further complicated by the interplay of external tool outputs and language model reasoning, making it more challenging than traditional software debugging. In this work, we (1) articulate the need for robust and dynamic evaluation methods for agentic workflow traces, (2) introduce a formal taxonomy of error types encountered in agentic systems, and (3) present a set of 148 large human-annotated traces (TRAIL) constructed using this taxonomy and grounded in established agentic benchmarks. To ensure ecological validity, we curate traces from both single and multi-agent systems, focusing on real-world applications such as software engineering and open-world information retrieval. Our evaluations reveal that modern long context LLMs perform poorly at trace debugging, with the best Gemini-2.5-pro model scoring a mere 11% on TRAIL. Our dataset and code are made publicly available to support and accelerate future research in scalable evaluation for agentic workflows.

随着代理工作流程在各个领域的应用日益广泛，对可伸缩、系统地评估这些系统产生的复杂跟踪信息的需求也日益迫切。当前的评估方法依赖于对冗长的工作流跟踪进行手动、特定领域的人类分析——这种方法无法随着代理输出的复杂性和数量的增长而扩展。这些环境中的错误分析进一步受到外部工具输出和语言模型推理的相互作用的影响，使其比传统软件调试更具挑战性。在这项工作中，我们（1）提出了对代理工作流程跟踪进行稳健和动态评估的需求，（2）介绍了在代理系统中遇到错误类型的正式分类，以及（3）根据这一分类学并基于已建立的代理基准测试，展示了一套由人类标注的148个大型跟踪集（TRAIL）。为保证生态效度，我们从单代理系统和多代理系统中挑选跟踪集，重点关注现实世界的应用，如软件工程和开放世界信息检索。我们的评估表明，现代大型语言模型在跟踪调试方面的表现不佳，最佳Gemini-2.5-pro模型的TRAIL得分仅为11%。我们公开提供数据集和代码，以支持和加速未来对代理工作流程的可扩展评估的研究。

Summary

本文主要介绍了随着agentic工作流程在多个领域的广泛应用，对可扩展和系统性评价其生成的复杂跟踪信息的需求日益迫切。当前的评价方法依赖于对冗长的工作流跟踪进行手动、特定领域的分析，这种方法无法适应agentic输出的日益复杂和大量增长。本文中，作者强调了需要针对agentic工作流跟踪进行稳健和动态的评价方法，并引入了在agentic系统中遇到的错误类型的正式分类。此外，还利用此分类和基于公认的agentic基准测试构建了一套包含大型人为注释的跟踪数据（TRAIL）。这些跟踪数据来源于单智能体和多智能体系统，专注于软件工程和开放世界信息检索等实际应用场景。研究结果显示，现代长文本LLM模型在跟踪调试方面表现不佳，性能最佳的Gemini-2.5专业模型在TRAIL上仅得分百分之十一。为了方便和加速未来的研究工作，数据集和代码都已公开发布。
精简后的中文版概述不超字数限制（约一百字）：本文主要讨论agentic工作流程评价的问题，提出需要新的评价方法和数据集来应对复杂性提升和工作量增长的问题。构建了一种新数据集——TRAIL用于测试和比较不同的agentic工作流程性能评价的方法的有效性，旨在帮助更好地应对误差检测和追踪错误分析等现实需求场景应用；但同时注意到性能评价面临困难的问题和挑战。通过数据集的公开分享和共享研究成果推动相关领域的进一步研究和发展。总体而言是提供了一些具有启发性的研究视角和思考，但也带来了性能和可靠性的挑战需要进一步探讨和研究的问题意识和方法指导方面的反思研究问题和推进挑战未来方向性价值的文章内容总结和表述表达重要信息和关键点；基于大量数据的真实实验测试分析结果总结及具体问题的解决方向以及后续的对策选择指向潜在进一步可能应用的案例建议和下一步工作计划前瞻性启发理解发展的亮点
以上summary同时充分注意到了本文作者研究中遇到的难点以及采用的创新解决思路如基于现代科技的技术性特色：提出并开发一种新的公开可用的数据集并尝试解决了现实应用中性能评价的难点和复杂度的问题在整体领域中通过尝试解决方案引出了一些有启发性未来进一步深入研究可能性的问题，因此是对全文很好的一个简明扼要的概述说明情况对于其他同行有引导学习和深化探索实践的方向启发价值和启发性思维的开拓效果较突出该论文未来领域的前景推广应用拓展将有良好的助力作用和可行性作用与帮助较大通过提高技术创新创新推动解决关键难点的方式呈现出对于科技社会以及科技进步的应用场景中所能产生的实践意义和效益的应用研究有着深刻的分析同时有一定对策提议、对未来的可行性考量等多维度展望的现实可操作性提供了较多的应用价值兼具合理性与全面性助力于更加具有可操作性扩展认知有效意义支持类似实践推广较为有效有力的科技推广应用评价类摘要的核心信息汇总传递成功解决了现实中科技领域的难题并且展望未来该领域的应用前景

Cool Papers

点此查看论文截图

D-Hammer: Efficient Equational Reasoning for Labelled Dirac Notation

Authors:Yingte Xu, Li Zhou, Gilles Barthe

Labelled Dirac notation is a formalism commonly used by physicists to represent many-body quantum systems and by computer scientists to assert properties of quantum programs. It is supported by a rich equational theory for proving equality between expressions in the language. These proofs are typically carried on pen-and-paper, and can be exceedingly long and error-prone. We introduce D-Hammer, the first tool to support automated equational proof for labelled Dirac notation. The salient features of D-Hammer include: an expressive, higher-order, dependently-typed language for labelled Dirac notation; an efficient normalization algorithm; and an optimized C++ implementation. We evaluate the implementation on representative examples from both plain and labelled Dirac notation. In the case of plain Dirac notation, we show that our implementation significantly outperforms DiracDec.

标记Dirac符号是一种形式主义，物理学家常用它来表示多体量子系统，计算机科学家用它来断言量子程序的属性。它支持丰富的等式理论，用于证明语言中表达式之间的等式关系。这些证明通常用笔和纸进行，可能非常冗长且容易出错。我们介绍了D-Hammer，这是第一个支持标记Dirac符号的自动化等式证明的工具。D-Hammer的主要特点包括：用于标记Dirac符号的表达力强、高阶、依赖类型语言；高效的正规化算法；以及优化的C++实现。我们对来自普通Dirac符号和标记Dirac符号的代表性示例进行了实施评估。在普通Dirac符号的情况下，我们表明我们的实现显著优于DiracDec。

论文及项目相关链接

PDF This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. 47 pages (with appendix)

Summary

本文介绍了用于标记Dirac符号的形式主义，它在物理学家中常用于表示多体量子系统，在计算机科学家中用于断言量子程序的属性。然而，证明表达式之间等价的证明通常需要手动进行，可能会非常冗长和容易出错。因此，本文引入了D-Hammer工具，它是第一个支持标记Dirac符号的自动化等式证明的工具。D-Hammer具有表达性高、顺序依赖类型化的语言特点，拥有高效的正规化算法和优化后的C++实现。在普通和标记Dirac符号的代表例子上进行的评估显示，对于普通Dirac符号，我们的实现显著优于DiracDec。

Key Takeaways

标示Dirac符号是一种常用于物理和计算机领域的表示形式，尤其在量子计算和量子程序属性方面尤为重要。
目前对标记Dirac符号的等式证明通常依赖于手动操作，过程冗长且易出错。
D-Hammer工具首次实现了对标记Dirac符号的自动化等式证明支持。
D-Hammer工具具备高级、依赖类型化的语言特点。
D-Hammer工具具备高效的正规化算法和C++优化实现。
对于普通Dirac符号的代表例子，D-Hammer的实现性能显著优于DiracDec。

Cool Papers

点此查看论文截图

OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning

Authors:Zhaochen Su, Linjie Li, Mingyang Song, Yunzhuo Hao, Zhengyuan Yang, Jun Zhang, Guanjie Chen, Jiawei Gu, Juntao Li, Xiaoye Qu, Yu Cheng

While humans can flexibly leverage interactive visual cognition for complex problem-solving, enabling Large Vision-Language Models (LVLMs) to learn similarly adaptive behaviors with visual tools remains challenging. A significant hurdle is the current lack of standardized infrastructure, which hinders integrating diverse tools, generating rich interaction data, and training robust agents effectively. To address these gaps, we introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs. It features standardized vision tool interfaces, scalable trajectory generation for policy initialization, and a flexible training environment. Furthermore, considering supervised fine-tuning (SFT) on static demonstrations offers limited policy generalization for dynamic tool invocation, we propose a novel reinforcement learning (RL) framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools. V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies by directly optimizing for task success using feedback from tool interactions. We empirically validate V-ToolRL on challenging chart reasoning tasks. Our RL-trained agent, built upon a Qwen2-VL-2B, significantly outperforms its SFT-initialized counterpart (+28.83 points) and surpasses established supervised tool-learning baselines like Taco and CogCom by an average of +12.7 points. Notably, it also surpasses prominent closed-source models like GPT-4.1 by +8.68 accuracy points. We hope OpenThinkIMG can serve as a foundational framework for advancing dynamic, tool-augmented visual reasoning, helping the community develop AI agents that can genuinely “think with images”.

人类在解决复杂问题时，可以灵活利用交互式视觉认知。然而，让大型视觉语言模型（LVLMs）学习类似自适应行为并使用视觉工具仍然存在挑战。一个主要的障碍是缺乏标准化基础设施，这阻碍了整合各种工具、生成丰富的交互数据以及有效地训练稳健的代理。为了解决这些空白，我们推出了OpenThinkIMG，这是第一个用于工具增强LVLMs的开源、全面端到端的框架。它具备标准化视觉工具接口、可扩展的轨迹生成以进行策略初始化，以及灵活的训练环境。此外，考虑到对静态演示的监督微调（SFT）对于动态工具调用的策略泛化能力有限，我们提出了一种新的强化学习（RL）框架V-ToolRL，用于训练LVLMs学习自适应策略以调用外部视觉工具。V-ToolRL使LVLMs能够通过直接优化任务成功反馈来自主发现最佳工具使用策略。我们通过具有挑战性的图表推理任务对V-ToolRL进行了实证验证。我们的基于Qwen2-VL-2B的RL训练代理显著优于其SFT初始化的对应物（+28.83分），并超越了Taco和CogCom等既定的监督工具学习基准线，平均高出+12.7分。值得注意的是，它还超越了闭源模型如GPT-4.1，准确率提高了+8.68分。我们希望OpenThinkIMG可以作为推进动态、工具增强的视觉推理的基础框架，帮助社区开发能够真正“以图像思考”的AI代理。

论文及项目相关链接

PDF Work in progress

Summary
这是一篇关于利用开源框架OpenThinkIMG实现工具增强的视觉认知模型的研究。该研究提出了一种结合强化学习的新型训练方法，以支持模型自主发现最佳工具使用策略，从而提升在图表理解等任务上的性能。经过实验验证，该方法在具有挑战性的图表推理任务上显著优于监督微调（SFT）方法和其他先进的工具学习模型。这标志着AI图像理解的进一步发展。

Key Takeaways

当前大型视觉语言模型（LVLMs）在利用视觉工具进行自适应行为学习方面存在挑战。
缺乏标准化基础设施是阻碍因素之一，影响工具整合、丰富交互数据的有效训练。
OpenThinkIMG是一个开源的、全面的端到端框架，用于工具增强的LVLMs，具有标准化视觉工具接口、可扩展的轨迹生成和灵活的训练环境。
监督微调（SFT）在动态工具调用上的策略泛化能力有限，因此提出了一种新型的强化学习（RL）框架V-ToolRL来训练LVLMs。
V-ToolRL使LVLMs能够通过直接优化任务成功反馈来自主发现最佳工具使用策略。
在具有挑战性的图表推理任务上，RL训练的模型显著优于SFT初始化的模型和其他先进的监督工具学习模型，如Taco和CogCom。

Cool Papers

点此查看论文截图

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Authors:Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, Jianye Hao

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD’s capabilities in both “seeing” and “doing,” achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 54.1% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

实现机器人在操作中的通用性仍然是一个关键挑战，特别是在未见过的场景和新型任务中。尽管当前的视觉-语言-行动（VLA）模型建立在通用的视觉-语言模型（VLM）之上，但由于体现数据集的稀缺性和异质性，它们仍然难以实现稳健的零射击性能。为了解决这些局限性，我们提出了FSD（从看到到行动），这是一种新的视觉-语言模型，它通过空间关系推理生成中间表示，为机器人操作提供精细的指导。我们的方法结合了用于训练的分层次数据管道和自一致性机制，该机制将空间坐标与视觉信号对齐。通过广泛的实验，我们全面验证了FSD在“看到”和“行动”方面的能力，在通用的空间推理和体现参考能力的8个基准测试以及我们提出的更具挑战性的VABench基准测试上都取得了出色的表现。我们还验证了机器人在操作中的零射击能力，在SimplerEnv和真实机器人环境中，相对于基准方法的性能得到了显著提高。实验结果表明，FSD在SimplerEnv中的成功率为54.1%，在8个真实世界任务中的成功率为72%，优于最强基准方法3艺敢灵推胶统少犯安0%。

论文及项目相关链接

PDF Early version

Summary

新一代视觉-语言模型FSD通过空间关系推理生成中间表示，为机器人操作提供精细指导，解决未见场景和新任务的通用化挑战。在八个通用空间推理和具身参考能力基准测试中，以及更具挑战性的VABench基准测试中，FSD表现出卓越性能。在SimplerEnv环境中，FSD成功率为54.1%，并在真实机器人任务的八个场景中实现72%的成功率，显著优于基准方法。

Key Takeaways

实现机器人在未见场景和新任务中的通用化操作是一个关键挑战。
当前Vision-Language-Action（VLA）模型在解决此类挑战时仍存在局限性，尤其是在零样本性能上。
FSD模型通过空间关系推理生成中间表示，为机器人操作提供精细指导。
FSD采用分层数据管道和自一致性机制来训练模型，实现了视觉信号的空间坐标对齐。
FSD在多个基准测试中表现出卓越性能，包括通用空间推理和具身参考能力测试。
在SimplerEnv环境中，FSD成功率为54.1%。

Cool Papers

点此查看论文截图

Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?

Authors:Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Ahmed Masry, Mizanur Rahman, Amran Bhuiyan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang

Charts are ubiquitous as they help people understand and reason with data. Recently, various downstream tasks, such as chart question answering, chart2text, and fact-checking, have emerged. Large Vision-Language Models (LVLMs) show promise in tackling these tasks, but their evaluation is costly and time-consuming, limiting real-world deployment. While using LVLMs as judges to assess the chart comprehension capabilities of other LVLMs could streamline evaluation processes, challenges like proprietary datasets, restricted access to powerful models, and evaluation costs hinder their adoption in industrial settings. To this end, we present a comprehensive evaluation of 13 open-source LVLMs as judges for diverse chart comprehension and reasoning tasks. We design both pairwise and pointwise evaluation tasks covering criteria like factual correctness, informativeness, and relevancy. Additionally, we analyze LVLM judges based on format adherence, positional consistency, length bias, and instruction-following. We focus on cost-effective LVLMs (<10B parameters) suitable for both research and commercial use, following a standardized evaluation protocol and rubric to measure the LVLM judge’s accuracy. Experimental results reveal notable variability: while some open LVLM judges achieve GPT-4-level evaluation performance (about 80% agreement with GPT-4 judgments), others struggle (below ~10% agreement). Our findings highlight that state-of-the-art open-source LVLMs can serve as cost-effective automatic evaluators for chart-related tasks, though biases such as positional preference and length bias persist.

图表无处不在，因为它们有助于人们理解和分析数据。最近，出现了各种下游任务，如图表问答、图表转文本和查证等。大型视觉语言模型（LVLMs）在应对这些任务方面显示出潜力，但其评估成本高昂且耗时，限制了其在现实世界的部署。虽然使用LVLMs作为判断者对其他LVLMs的图表理解能力进行评估可以简化评估流程，但专有数据集、对强大模型的访问受限以及评估成本等挑战阻碍了其在工业环境中的采用。为此，我们对13个开源LVLMs作为多种图表理解和推理任务的判断者进行了全面评估。我们设计了成对的和个别的评估任务，涵盖事实正确性、信息丰富性和相关性等标准。此外，我们根据格式遵守、位置一致性、长度偏差和指令遵循等标准对LVLM判断者进行了分析。我们专注于成本效益高的LVLMs（参数小于10B），适合研究和商业使用，并按照标准化的评估协议和评分表来衡量LVLM判断者的准确性。实验结果表明存在显著差异：一些开源LVLM判断者达到了GPT-4级的评估性能（与GPT-4判断的共识约为80%），而其他判断者则表现挣扎（共识低于约10%）。我们的研究结果表明，最先进的开源LVLMs可以作为成本效益高的图表相关任务的自动评估者，尽管仍然存在位置偏好和长度偏差等偏见。

论文及项目相关链接

PDF Accepted at ACL 2025 Industry Track

Summary：
近期出现许多图表相关的下游任务，如图表问答、图表转文本和查核事实等。大型视觉语言模型（LVLMs）在这些任务中展现出潜力，但评估成本高昂且耗时，限制了其在现实世界的部署。研究提出了一种使用开源LVLMs作为评判者对图表理解任务进行评估的全面评估方法，包括成对和点状的评估任务，并基于事实正确性、信息丰富性和相关性等标准进行评价。研究关注成本效益高的LVLMs，适合科研和商业使用。实验结果显示，一些开源LVLM评判者的评估性能达到了GPT-4水平，但也存在显著差异。研究指出，虽然存在位置偏好和长度偏见等偏见，但先进开源LVLMs可以作为图表相关任务的低成本自动评估工具。

Key Takeaways：

图表理解任务日益受到关注，包括图表问答、图表转文本和查核事实等。
大型视觉语言模型（LVLMs）在图表理解任务中展现出潜力，但评估困难且成本高昂。
提出使用开源LVLMs作为评判者对图表理解任务进行综合评价的方法。
评估任务包括成对和点状的评估任务，并基于事实正确性、信息丰富性和相关性等标准进行评价。
关注成本效益高的LVLMs，适合科研和商业使用。
实验结果显示，某些开源LVLM评判者性能接近GPT-4，但也存在显著性能差异。

Cool Papers

点此查看论文截图

VCRBench: Exploring Long-form Causal Reasoning Capabilities of Large Video Language Models

Authors:Pritam Sarkar, Ali Etemad

Despite recent advances in video understanding, the capabilities of Large Video Language Models (LVLMs) to perform video-based causal reasoning remains underexplored, largely due to the absence of relevant and dedicated benchmarks for evaluating causal reasoning in visually grounded and goal-driven settings. To fill this gap, we introduce a novel benchmark named Video-based long-form Causal Reasoning (VCRBench). We create VCRBench using procedural videos of simple everyday activities, where the steps are deliberately shuffled with each clip capturing a key causal event, to test whether LVLMs can identify, reason about, and correctly sequence the events needed to accomplish a specific goal. Moreover, the benchmark is carefully designed to prevent LVLMs from exploiting linguistic shortcuts, as seen in multiple-choice or binary QA formats, while also avoiding the challenges associated with evaluating open-ended QA. Our evaluation of state-of-the-art LVLMs on VCRBench suggests that these models struggle with video-based long-form causal reasoning, primarily due to their difficulty in modeling long-range causal dependencies directly from visual observations. As a simple step toward enabling such capabilities, we propose Recognition-Reasoning Decomposition (RRD), a modular approach that breaks video-based causal reasoning into two sub-tasks of video recognition and causal reasoning. Our experiments on VCRBench show that RRD significantly boosts accuracy on VCRBench, with gains of up to 25.2%. Finally, our thorough analysis reveals interesting insights, for instance, that LVLMs primarily rely on language knowledge for complex video-based long-form causal reasoning tasks.

尽管视频理解领域最近有所进展，但大型视频语言模型（LVLMs）在基于视频因果关系推理方面的能力仍鲜为人知。这主要是因为缺乏用于评估视觉化及目标驱动环境中因果推理能力的相关基准测试。为了填补这一空白，我们推出了一种名为Video-based长形式因果推理（VCRBench）的新型基准测试。我们使用日常简单活动的程序化视频来创建VCRBench，其中步骤被故意打乱，每个片段都捕捉到一个关键的因果事件，以测试LVLMs是否能够识别、推理和正确排序完成特定目标所需的事件。此外，该基准测试精心设计，旨在防止LVLMs利用多选或二元问答格式中的语言捷径，同时避免与开放式问答评估相关的挑战。我们对先进LVLMs在VCRBench上的评估表明，这些模型在基于视频的长期因果推理方面存在困难，这主要是因为它们难以直接从视觉观察中建立长期因果关系依赖。作为朝着具备这种能力的简单步骤，我们提出了识别推理分解（RRD）方法，这是一种模块化方法，将基于视频的因果推理分为视频识别和因果推理两个子任务。我们在VCRBench上的实验表明，RRD在VCRBench上的准确率有了显著提高，提高了高达25.2%。最后，我们的深入分析揭示了一些有趣见解，例如LVLMs主要依赖语言知识进行复杂的基于视频的长期因果推理任务。

论文及项目相关链接

PDF

摘要

本文提出了一项新的视频基准测试——基于视频的长形式因果推理基准测试（VCRBench），以评估大型视频语言模型（LVLMs）在视觉化场景中的因果推理能力。该基准测试通过打乱日常活动的步骤顺序制作程序化视频，每个视频片段捕捉关键因果事件，以测试LVLMs是否具备识别、推理和正确排序完成特定目标所需的事件的能力。此外，该基准测试旨在避免LVLMs利用语言捷径和在开放式问答评估中出现的挑战。对先进LVLMs的评估表明，它们在基于视频的长形式因果推理方面存在困难，主要难以直接从视觉观察中建立长期因果关系。为此，本文提出了一种简单的解决方案——识别推理分解（RRD），将视频基于的因果推理分解为视频识别和因果推理两个子任务。在VCRBench上的实验表明，RRD在VCRBench上的准确率显著提高，提高了高达25.2%。最后，我们的深入分析揭示了有趣的见解，如LVLMs主要依赖语言知识进行复杂视频长形式的因果推理任务。

关键见解

大型视频语言模型（LVLMs）在视频理解方面取得进展，但在基于视频的因果推理方面的能力仍待探索。
缺乏用于评估视觉化环境中目标驱动设置的因果推理的相关基准测试。
新基准测试VCRBench通过使用程序化视频来测试LVLMs的因果推理能力，这些视频包含日常活动的关键因果事件。
VCRBench设计旨在避免语言捷径和开放式问答评估的挑战。
LVLMs在基于视频的长形式因果推理方面存在困难，难以直接从视觉观察中建立长期因果关系。
识别推理分解（RRD）是一种模块化方法，将视频基于的因果推理分解为两个子任务，即视频识别和因果推理，显著提高准确率。

Cool Papers

点此查看论文截图

Scalable UAV Multi-Hop Networking via Multi-Agent Reinforcement Learning with Large Language Models

Authors:Yanggang Xu, Weijie Hong, Jirong Zha, Geng Chen, Jianfeng Zheng, Chen-Chun Hsia, Xinlei Chen

In disaster scenarios, establishing robust emergency communication networks is critical, and unmanned aerial vehicles (UAVs) offer a promising solution to rapidly restore connectivity. However, organizing UAVs to form multi-hop networks in large-scale dynamic environments presents significant challenges, including limitations in algorithmic scalability and the vast exploration space required for coordinated decision-making. To address these issues, we propose MRLMN, a novel framework that integrates multi-agent reinforcement learning (MARL) and large language models (LLMs) to jointly optimize UAV agents toward achieving optimal networking performance. The framework incorporates a grouping strategy with reward decomposition to enhance algorithmic scalability and balance decision-making across UAVs. In addition, behavioral constraints are applied to selected key UAVs to improve the robustness of the network. Furthermore, the framework integrates LLM agents, leveraging knowledge distillation to transfer their high-level decision-making capabilities to MARL agents. This enhances both the efficiency of exploration and the overall training process. In the distillation module, a Hungarian algorithm-based matching scheme is applied to align the decision outputs of the LLM and MARL agents and define the distillation loss. Extensive simulation results validate the effectiveness of our approach, demonstrating significant improvements in network performance, including enhanced coverage and communication quality.

在灾难场景中，建立稳健的紧急通信网络至关重要，无人机（UAVs）为快速恢复连通性提供了有前景的解决方案。然而，在大规模动态环境中组织无人机形成多跳网络面临着重大挑战，包括算法可扩展性的限制和协调决策所需的大量探索空间。为了解决这些问题，我们提出了MRLMN，这是一个结合多智能体强化学习（MARL）和大语言模型（LLM）的新型框架，共同优化无人机代理，以实现最佳的网络性能。该框架采用分组策略和奖励分解，以提高算法的可扩展性，并在无人机之间平衡决策。此外，对选定的重要无人机应用了行为约束，以提高网络的稳健性。此外，该框架整合了LLM代理，利用知识蒸馏转移其高级决策能力给MARL代理。这提高了探索效率和整体训练过程。在蒸馏模块中，应用基于匈牙利算法的匹配方案，以对齐LLM和MARL代理的决策输出并定义蒸馏损失。广泛的仿真结果验证了我们的方法的有效性，显示出网络性能的显著提高，包括覆盖范围和通信质量的增强。

论文及项目相关链接

PDF

Summary

无人机在灾难场景中建立应急通信网络具有巨大潜力。然而，在大规模动态环境中组织无人机形成多跳网络面临诸多挑战。为应对这些挑战，我们提出MRLMN框架，结合多智能体强化学习（MARL）和大型语言模型（LLM），优化无人机以实现最佳网络性能。该框架通过分组策略和奖励分解提高算法可扩展性，并平衡无人机间的决策。此外，对关键无人机的行为约束提高了网络的稳健性。集成LLM智能体，通过知识蒸馏提升探索效率和整体训练过程。仿真实验证明，该方法有效提高网络性能，包括覆盖范围和通信质量。

Key Takeaways

在灾难场景中，建立无人机应急通信网络至关重要。
组织无人机形成多跳网络面临算法可扩展性和决策协调的挑战。
MRLMN框架结合MARL和LLM，优化无人机以实现最佳网络性能。
分组策略和奖励分解提高算法可扩展性，平衡无人机决策。
对关键无人机的行为约束提高网络稳健性。
LLM智能体的集成通过知识蒸馏提升探索效率和训练过程。

Cool Papers

点此查看论文截图

Accelerating Chain-of-Thought Reasoning: When Goal-Gradient Importance Meets Dynamic Skipping

Authors:Ren Zhuang, Ben Wang, Shuifa Sun

Large Language Models leverage Chain-of-Thought (CoT) prompting for complex tasks, but their reasoning traces are often excessively verbose and inefficient, leading to significant computational costs and latency. Current CoT compression techniques typically rely on generic importance metrics and static compression rates, which may inadvertently remove functionally critical tokens or fail to adapt to varying reasoning complexity. To overcome these limitations, we propose Adaptive GoGI-Skip, a novel framework learning dynamic CoT compression via supervised fine-tuning. This approach introduces two synergistic innovations: (1) Goal-Gradient Importance (GoGI), a novel metric accurately identifying functionally relevant tokens by measuring the gradient influence of their intermediate representations on the final answer loss, and (2) Adaptive Dynamic Skipping (ADS), a mechanism dynamically regulating the compression rate based on runtime model uncertainty while ensuring local coherence through an adaptive N-token constraint. To our knowledge, this is the first work unifying a goal-oriented, gradient-based importance metric with dynamic, uncertainty-aware skipping for CoT compression. Trained on compressed MATH data, Adaptive GoGI-Skip demonstrates strong cross-domain generalization across diverse reasoning benchmarks including AIME, GPQA, and GSM8K. It achieves substantial efficiency gains - reducing CoT token counts by over 45% on average and delivering 1.6-2.0 times inference speedups - while maintaining high reasoning accuracy. Notably, it significantly outperforms existing baselines by preserving accuracy even at high effective compression rates, advancing the state of the art in the CoT reasoning efficiency-accuracy trade-off.

大型语言模型利用思维链（CoT）提示来完成复杂任务，但它们的推理过程通常过于冗长且效率低下，导致计算成本增加和延迟。当前的CoT压缩技术通常依赖于通用重要性指标和静态压缩率，这可能会无意中删除功能关键令牌或无法适应不同的推理复杂性。为了克服这些局限性，我们提出了自适应GoGI-Skip，这是一种通过监督微调学习动态CoT压缩的新型框架。这种方法引入了两种协同创新：（1）目标梯度重要性（GoGI），这是一种新型指标，通过测量中间表示对最终答案损失的梯度影响来准确识别功能相关令牌；（2）自适应动态跳过（ADS），这是一种机制，根据运行时模型的不确定性动态调整压缩率，同时通过自适应N令牌约束确保局部连贯性。据我们所知，这是第一项将目标导向、基于梯度的重要性指标与用于CoT压缩的动态、不确定性感知跳过相结合的工作。在压缩MATH数据上训练的Adaptive GoGI-Skip展示了强大的跨域泛化能力，在各种推理基准测试中表现良好，包括AIME、GPQA和GSM8K。它实现了实质性的效率提升——平均减少CoT令牌计数超过45%，并提供1.6-2.0倍的推理速度提升，同时保持高推理准确性。值得注意的是，它在保持高压缩率的同时，仍能保持准确性，在推理效率与准确性之间取得了突破性进展。

论文及项目相关链接

PDF

Summary：
大型语言模型利用链式思维（CoT）提示进行复杂任务，但其推理过程往往过于冗长和效率低下，带来重大计算成本和延迟。针对这一问题，提出了自适应GoGI-Skip框架，通过监督微调实现动态CoT压缩。该框架引入了目标导向的梯度重要性（GoGI）和自适应动态跳过（ADS）两个创新机制，准确识别功能相关符号并动态调整压缩率以适应运行时模型的不确定性。该框架在压缩MATH数据上进行训练，并在AIME、GPQA和GSM8K等多样化推理基准测试中表现出强大的跨域泛化能力。它实现了显著的效率提升，平均减少CoT符号计数超过45%，提供1.6-2.0倍的推理速度，同时保持高推理准确性。

Key Takeaways：

大型语言模型的链式思维（CoT）提示对于复杂任务很重要，但存在冗长和效率低下的问题。
当前CoT压缩技术可能移除关键符号或无法适应不同的推理复杂性。
提出了自适应GoGI-Skip框架，通过监督微调实现动态CoT压缩。
框架包含两个创新机制：目标导向的梯度重要性（GoGI）和自适应动态跳过（ADS）。
GoGI能准确识别功能相关符号，而ADS能动态调整压缩率以适应模型的不确定性。
该框架在多种推理基准测试中表现出强大的泛化能力和显著效率提升。

Cool Papers

点此查看论文截图

Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning

Authors:Xinyue Wang, Biwei Huang

Generalization in reinforcement learning (RL) remains a significant challenge, especially when agents encounter novel environments with unseen dynamics. Drawing inspiration from human compositional reasoning – where known components are reconfigured to handle new situations – we introduce World Modeling with Compositional Causal Components (WM3C). This novel framework enhances RL generalization by learning and leveraging compositional causal components. Unlike previous approaches focusing on invariant representation learning or meta-learning, WM3C identifies and utilizes causal dynamics among composable elements, facilitating robust adaptation to new tasks. Our approach integrates language as a compositional modality to decompose the latent space into meaningful components and provides theoretical guarantees for their unique identification under mild assumptions. Our practical implementation uses a masked autoencoder with mutual information constraints and adaptive sparsity regularization to capture high-level semantic information and effectively disentangle transition dynamics. Experiments on numerical simulations and real-world robotic manipulation tasks demonstrate that WM3C significantly outperforms existing methods in identifying latent processes, improving policy learning, and generalizing to unseen tasks.

强化学习（RL）中的泛化仍然是一个重大挑战，特别是当智能体遇到具有未见动态性的新环境时。我们从人类的组合推理中汲取灵感——已知组件重新配置以应对新情况——我们引入了基于组合因果组件的世界建模（WM3C）。这一新颖框架通过学习和利用组合因果组件来增强RL的泛化能力。与以往专注于不变表示学习或元学习的方法不同，WM3C能够识别和利用可组合元素之间的因果动态关系，促进对新任务的稳健适应。我们的方法将语言作为一种组合模式进行集成，将潜在空间分解成有意义的组件，并在温和的假设下为它们的唯一识别提供理论保证。我们的实际应用使用带有互信息约束和自适应稀疏正则化的掩码自动编码器来捕获高级语义信息并有效地解开转换动态。在数值模拟和真实世界机器人操作任务上的实验表明，WM3C在识别潜在过程、改进策略学习和泛化到未见任务方面显著优于现有方法。

论文及项目相关链接

PDF Published as a conference paper at ICLR 2025

Summary
强化学习中的泛化是一大挑战，尤其是在代理遇到具有未见动态性的新环境时。我们受到人类组合推理的启发，引入了组合因果组件的世界建模（WM3C）这一新颖框架，以提高强化学习的泛化能力。WM3C通过学习和利用组合因果组件来增强泛化能力，不同于专注于不变表示学习或元学习的方法，WM3C能够识别和利用可组合元素之间的因果动态关系，促进对新任务的稳健适应。我们的方法通过将语言作为组合模式来分解潜在空间，并在温和假设下为它们的唯一识别提供理论保证。通过带有互信息约束的自适应稀疏正则化的掩码自动编码器，我们的实际实现可以捕获高级语义信息并有效地解开转移动态。数值模拟和真实世界机器人操作任务的实验表明，WM3C在识别潜在过程、改进策略学习和泛化到未见任务方面显著优于现有方法。

Key Takeaways

强化学习中的泛化面临挑战，特别是在面对新环境时。
WM3C框架通过学习和利用组合因果组件提高强化学习的泛化能力。
WM3C不同于传统方法，能识别和适应可组合元素间的因果动态关系。
WM3C利用语言作为分解潜在空间的组合模式，提供理论保证。
掩码自动编码器结合互信息约束和自适应稀疏正则化用于捕捉高级语义信息。
WM3C在识别潜在过程、策略学习和泛化能力方面显著优于现有方法。

Cool Papers

点此查看论文截图

An Identifiable Cost-Aware Causal Decision-Making Framework Using Counterfactual Reasoning

Authors:Ruichu Cai, Xi Chen, Jie Qiao, Zijian Li, Yuequn Liu, Wei Chen, Keli Zhang, Jiale Zheng

Decision making under abnormal conditions is a critical process that involves evaluating the current state and determining the optimal action to restore the system to a normal state at an acceptable cost. However, in such scenarios, existing decision-making frameworks highly rely on reinforcement learning or root cause analysis, resulting in them frequently neglecting the cost of the actions or failing to incorporate causal mechanisms adequately. By relaxing the existing causal decision framework to solve the necessary cause, we propose a minimum-cost causal decision (MiCCD) framework via counterfactual reasoning to address the above challenges. Emphasis is placed on making counterfactual reasoning processes identifiable in the presence of a large amount of mixed anomaly data, as well as finding the optimal intervention state in a continuous decision space. Specifically, it formulates a surrogate model based on causal graphs, using abnormal pattern clustering labels as supervisory signals. This enables the approximation of the structural causal model among the variables and lays a foundation for identifiable counterfactual reasoning. With the causal structure approximated, we then established an optimization model based on counterfactual estimation. The Sequential Least Squares Programming (SLSQP) algorithm is further employed to optimize intervention strategies while taking costs into account. Experimental evaluations on both synthetic and real-world datasets reveal that MiCCD outperforms conventional methods across multiple metrics, including F1-score, cost efficiency, and ranking quality(nDCG@k values), thus validating its efficacy and broad applicability.

在异常条件下的决策制定是一个关键过程，需要评估当前状态，并以可接受的成本确定将系统恢复到正常状态的最佳行动。然而，在这种情况下，现有的决策框架高度依赖于强化学习或根本原因分析，导致它们经常忽视行动的成本或未能充分融入因果机制。我们通过解决必要的因果关系来放松现有的因果决策框架，提出了一种基于反事实推理的最低成本因果决策（MiCCD）框架，以解决上述挑战。重点放在了在大量的混合异常数据存在的情况下，使反事实推理过程可识别，以及在连续决策空间中找到最佳干预状态。具体来说，它基于因果图建立了一个替代模型，使用异常模式聚类标签作为监督信号。这能够实现变量间结构因果模型的近似，为可识别的反事实推理奠定了基础。在近似了因果结构后，我们建立了基于反事实估计的优化模型。进一步采用序贯最小二乘规划（SLSQP）算法来优化干预策略，同时考虑到成本因素。在合成数据集和真实世界数据集上的实验评估表明，MiCCD在多个指标上优于传统方法，包括F1得分、成本效率和排名质量（nDCG@k值），从而验证了其有效性和广泛的适用性。

论文及项目相关链接

PDF

Summary：
在异常条件下进行决策是一个涉及评估当前状态并确定以可接受的代价将系统恢复到正常状态的最优行动的过程。针对现有决策框架在异常情况下常忽略行动成本或未能充分融入因果机制的问题，我们提出了一个基于反事实推理的最低成本因果决策（MiCCD）框架。该框架通过构建基于因果图的替代模型，使用异常模式聚类标签作为监督信号，以近似变量间的结构因果模型，并为反事实推理的识别奠定基础。在优化模型方面，我们采用了考虑成本的反事实估计方法，并利用序列最小二乘规划（SLSQP）算法优化干预策略。在合成和真实数据集上的实验评估表明，MiCCD在多个指标上优于传统方法，包括F1分数、成本效率和排名质量（nDCG@k值），验证了其有效性和广泛的应用性。

Key Takeaways：

决策制定在异常条件下至关重要，涉及评估当前状态并确定恢复系统的最优行动。
现有决策框架在异常情况下常忽略行动成本或未充分融入因果机制。
提出最低成本因果决策（MiCCD）框架，结合反事实推理处理异常数据。
MiCCD构建基于因果图的替代模型，使用异常模式聚类标签作为监督信号。
MiCCD能够近似变量间的结构因果模型，为反事实推理的识别提供基础。
采用考虑成本的反事实估计方法和序列最小二乘规划（SLSQP）算法优化干预策略。

Cool Papers

点此查看论文截图

AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale

Authors:Yunjie Ji, Xiaoyu Tian, Sitong Zhao, Haotian Wang, Shuaiting Chen, Yiping Peng, Han Zhao, Xiangang Li

We present AM-Thinking-v1, a 32B dense language model that advances the frontier of reasoning, embodying the collaborative spirit of open-source innovation. Outperforming DeepSeek-R1 and rivaling leading Mixture-of-Experts (MoE) models like Qwen3-235B-A22B and Seed1.5-Thinking, AM-Thinking-v1 achieves impressive scores of 85.3 on AIME 2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, showcasing state-of-the-art mathematical and coding capabilities among open-source models of similar scale. Built entirely from the open-source Qwen2.5-32B base model and publicly available queries, AM-Thinking-v1 leverages a meticulously crafted post-training pipeline - combining supervised fine-tuning and reinforcement learning - to deliver exceptional reasoning capabilities. This work demonstrates that the open-source community can achieve high performance at the 32B scale, a practical sweet spot for deployment and fine-tuning. By striking a balance between top-tier performance and real-world usability, we hope AM-Thinking-v1 inspires further collaborative efforts to harness mid-scale models, pushing reasoning boundaries while keeping accessibility at the core of innovation. We have open-sourced our model on \href{https://huggingface.co/a-m-team/AM-Thinking-v1}{Hugging Face}.

我们推出了AM-Thinking-v1，这是一个32B的密集语言模型，在推理方面处于前沿地位，体现了开源创新的协作精神。AM-Thinking-v1超越了DeepSeek-R1，并与领先的混合专家（MoE）模型（如Qwen3-235B-A22B和Seed1.5-Thinking）相竞争。在AIME 2024上取得了85.3的惊人成绩，在AIME 2025上取得了74.4的成绩，在LiveCodeBench上取得了70.3的成绩，展示了同类规模开源模型中的最先进的数学和编码能力。AM-Thinking-v1完全基于开源的Qwen2.5-32B基础模型和公开查询构建，利用精心设计的后训练管道（结合监督微调强化学习）来实现出色的推理能力。这项工作证明了开源社区可以在32B规模上实现高性能，这是部署和微调的实际理想点。我们在两者之间达到了平衡：一流的性能和现实世界的使用性，我们希望AM-Thinking-v1能激发进一步协作努力，利用中型模型，推动推理边界，同时保持可访问性是创新的核心。我们的模型已经开源在Hugging Face（https://huggingface.co/a-m-team/AM-Thinking-v1）。

论文及项目相关链接

PDF

Summary

AM-Thinking-v1是一款具备先进推理能力的32B密集语言模型，展现了开源创新的协作精神。该模型在AIME 2024上取得85.3的高分，在AIME 2025上取得74.4分，在LiveCodeBench上取得70.3分，展示了同类规模开源模型中的卓越数学和编码能力。该模型由开源的Qwen2.5-32B基础模型和公开查询构建，采用精心设计的后训练管道，结合监督微调强化学习，实现出色的推理能力。此工作证明了开源社区可以在32B规模上实现高性能，为部署和微调提供了实用甜区。我们希望通过AM-Thinking-v1激发进一步协作努力，利用中型模型提高推理能力，同时保持创新的核心可及性。模型已开源在Hugging Face上。

Key Takeaways

AM-Thinking-v1是一款具备先进推理能力的语言模型。
该模型在多个基准测试中表现出卓越性能。
AM-Thinking-v1基于开源的Qwen2.5-32B基础模型和公开查询构建。
模型采用了结合监督微调强化学习的后训练管道。
该模型证明了开源社区能在32B规模上实现高性能。
AM-Thinking-v1旨在激发进一步协作努力，利用中型模型提高推理能力。

Cool Papers

点此查看论文截图

Training Strategies for Efficient Embodied Reasoning

Authors:William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, Sergey Levine

Robot chain-of-thought reasoning (CoT) – wherein a model predicts helpful intermediate representations before choosing actions – provides an effective method for improving the generalization and performance of robot policies, especially vision-language-action models (VLAs). While such approaches have been shown to improve performance and generalization, they suffer from core limitations, like needing specialized robot reasoning data and slow inference speeds. To design new robot reasoning approaches that address these issues, a more complete characterization of why reasoning helps policy performance is critical. We hypothesize several mechanisms by which robot reasoning improves policies – (1) better representation learning, (2) improved learning curricularization, and (3) increased expressivity – then devise simple variants of robot CoT reasoning to isolate and test each one. We find that learning to generate reasonings does lead to better VLA representations, while attending to the reasonings aids in actually leveraging these features for improved action prediction. Our results provide us with a better understanding of why CoT reasoning helps VLAs, which we use to introduce two simple and lightweight alternative recipes for robot reasoning. Our proposed approaches achieve significant performance gains over non-reasoning policies, state-of-the-art results on the LIBERO-90 benchmark, and a 3x inference speedup compared to standard robot reasoning.

机器人思维链推理（CoT）——其中模型在采取行动前预测有用的中间表示——为提升机器人策略（特别是视觉-语言-动作模型（VLAs））的泛化能力和性能提供了一种有效方法。虽然已有研究显示这类方法能够提升性能和泛化能力，但它们仍存在核心局限，如需要专门的机器人推理数据以及推理速度慢。为了设计解决这些问题的新的机器人推理方法，对推理为何能帮助策略性能进行更完整的表征至关重要。我们假设机器人推理改善策略的几个机制——（1）更好的表示学习，（2）改进的学习课程化，（3）增加表现力——然后设计了简单的机器人CoT推理变体来隔离并测试每一个假设。我们发现学习生成推理确实会导致更好的VLA表示，而关注这些推理有助于利用这些特征来改善动作预测。我们的研究结果为我们提供了CoT推理为何能帮助VLAs的更好理解，我们据此引入了两种简单轻量级的机器人推理替代方案。我们提出的方法相较于非推理策略实现了显著的性能提升，在LIBERO-90基准测试中达到了最新水平，并且相较于标准机器人推理实现了3倍的推理速度提升。

论文及项目相关链接

PDF

摘要

机器人思维链推理（CoT）通过预测行动前的中间表示来改善机器人策略的泛化性和性能，特别是在视觉-语言-行动模型（VLAs）中效果显著。虽然此方法能提高性能和泛化能力，但它存在核心局限，如需要特定的机器人推理数据和推理速度慢。为了设计解决这些问题的新的机器人推理方法，更全面地描述推理是如何帮助策略性能至关重要。本文假设机器人推理改善策略的几个机制——包括更好的表示学习、改进的学习课程化和提高的表达力，然后设计了简单的机器人CoT推理变体来孤立和测试每个假设。我们发现学习生成推理确实会导致更好的VLA表示，而关注推理有助于利用这些特征来改善行动预测。本文的结果为我们提供了更好地理解CoT推理为何能帮助VLAs的依据，并据此提出了简单轻量级的机器人推理方法。与无推理策略相比，所提出的方法在LIBERO-90基准测试上取得了显著的性能提升，并且推理速度提高了三倍。

关键见解

机器人思维链推理（CoT）能有效提高机器人策略的泛化性和性能，特别是在视觉-语言-行动模型（VLAs）中。
CoT推理的核心限制包括需要特定的机器人推理数据和推理速度慢。
通过假设几个机制来解释CoT推理如何提高策略性能，包括更好的表示学习、改进的学习课程化和提高的表达力。
学习生成推理有助于获得更好的VLA表示，而关注推理有助于利用这些特征改善行动预测。
提出简单轻量级的机器人推理方法，实现了显著的性能提升，并在LIBERO-90基准测试上取得了优异结果。
与传统机器人推理相比，所提出的方法具有更快的推理速度。
本文为我们提供了更好地理解CoT推理如何帮助VLAs的依据。

Cool Papers

点此查看论文截图

DSADF: Thinking Fast and Slow for Decision Making

Authors:Alex Zhihao Dou, Dongfei Cui, Jun Yan, Weida Wang, Benteng Chen, Haoming Wang, Zeke Xie, Shufei Zhang

Although Reinforcement Learning (RL) agents are effective in well-defined environments, they often struggle to generalize their learned policies to dynamic settings due to their reliance on trial-and-error interactions. Recent work has explored applying Large Language Models (LLMs) or Vision Language Models (VLMs) to boost the generalization of RL agents through policy optimization guidance or prior knowledge. However, these approaches often lack seamless coordination between the RL agent and the foundation model, leading to unreasonable decision-making in unfamiliar environments and efficiency bottlenecks. Making full use of the inferential capabilities of foundation models and the rapid response capabilities of RL agents and enhancing the interaction between the two to form a dual system is still a lingering scientific question. To address this problem, we draw inspiration from Kahneman’s theory of fast thinking (System 1) and slow thinking (System 2), demonstrating that balancing intuition and deep reasoning can achieve nimble decision-making in a complex world. In this study, we propose a Dual-System Adaptive Decision Framework (DSADF), integrating two complementary modules: System 1, comprising an RL agent and a memory space for fast and intuitive decision making, and System 2, driven by a VLM for deep and analytical reasoning. DSADF facilitates efficient and adaptive decision-making by combining the strengths of both systems. The empirical study in the video game environment: Crafter and Housekeep demonstrates the effectiveness of our proposed method, showing significant improvements in decision abilities for both unseen and known tasks.

尽管强化学习（RL）代理在明确环境中是有效的，但由于它们依赖于试错交互，因此在将学习到的策略推广到动态设置时往往会遇到困难。最近的研究探索了将大型语言模型（LLM）或视觉语言模型（VLM）应用于通过策略优化指导或先验知识来提升RL代理的泛化能力。然而，这些方法往往缺乏RL代理和基础模型之间的无缝协作，导致在不熟悉的环境中做出不合理的决策和效率瓶颈。充分利用基础模型的推理能力和RL代理的快速反应能力，并增强两者之间的交互以形成双系统，仍然是一个悬而未决的科学问题。为解决这一问题，我们从卡尼曼的快思考（系统1）和慢思考（系统2）理论中获得灵感，表明平衡直觉和深度推理可以在复杂世界中实现敏捷决策。本研究提出了一个双系统自适应决策框架（DSADF），它集成了两个互补的模块：系统1由RL代理和用于快速直觉决策的记忆空间组成；系统2由VLM驱动进行深度分析推理。DSADF通过结合两个系统的优势，实现了高效自适应决策。在电子游戏环境“工匠与家政”中的实证研究证明了我们的方法的有效性，无论是在未见过的任务还是已知的任务上都显著提高了决策能力。

论文及项目相关链接

PDF

Summary：强化学习（RL）在定义明确的环境中表现出强大的性能，但在动态环境中常常无法将其策略进行泛化。为此，一些最新的研究尝试利用大型语言模型（LLM）或视觉语言模型（VLM）来提升RL策略的泛化能力。然而，这些方法在RL agent和基准模型之间的协调方面存在不足，导致不熟悉环境下的决策不合理以及效率瓶颈。借鉴Kahneman的“系统一”和“系统二”理论，本研究提出了一个双系统自适应决策框架（DSADF），结合了直觉和深度推理的优势，以实现快速决策。该框架包括两个互补模块：系统一由RL agent和记忆空间组成，用于快速直觉决策；系统二由VLM驱动进行深度分析推理。在模拟游戏环境Crafter and Housekeep中的实证研究证明了该方法的有效性，显著提高了未知任务和已知任务的决策能力。

Key Takeaways:

强化学习在处理动态环境时存在策略泛化的问题。
大型语言模型和视觉语言模型可用来提高强化学习策略的泛化能力。
当前方法在处理强化学习agent和基准模型之间的协调方面存在挑战。
DSADF框架结合直觉和深度推理，旨在实现快速且自适应的决策。
DSADF框架包含两个互补模块：系统一用于快速直觉决策，系统二用于深度分析推理。
借鉴了Kahneman的系统一和系统二理论，旨在平衡直觉和深度推理。

Cool Papers

点此查看论文截图

Visually Interpretable Subtask Reasoning for Visual Question Answering

Authors:Yu Cheng, Arushi Goel, Hakan Bilen

Answering complex visual questions like `Which red furniture can be used for sitting?’ requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.

回答像“哪些红色家具可以用于坐着？”这样的复杂视觉问题，需要进行多步骤推理，包括对象识别、属性过滤和关系理解。最近的工作通过将任务分解成子任务程序来提高多模态大型语言模型（MLLM）的可解释性，但这些方法计算量大，由于不适应目标数据而精度较低。为解决这一问题，我们引入了VISTAR（视觉可解释的子任务感知推理模型），这是一个子任务驱动的训练框架，通过在MLLM内部生成文本和视觉解释来增强可解释性和推理能力。VISTAR不需要依赖外部模型，而是对MLLM进行微调以产生结构化思考的子任务（逐步推理序列）。在两个基准测试上的实验表明，VISTAR在保持可解释性的同时，不断提高推理的准确性。我们的代码和数据集将在https://github.com/ChengJade/VISTAR上提供。

论文及项目相关链接

PDF

Summary
文本介绍了复杂视觉问答需要多步骤推理，包括目标识别、属性过滤和关系理解等。为解决现有方法的计算成本高和准确性差的问题，本文引入了名为VISTAR的子任务驱动训练框架，通过在大规模语言模型中生成文本和视觉解释来增强可解释性和推理能力。实验结果表明，VISTAR在保持可解释性的同时，提高了推理的准确性。

Key Takeaways

复杂视觉问答需要进行多步骤推理，包括目标识别、属性过滤和关系理解等。
当前的多模态大型语言模型（MLLMs）的可解释性提升方法是通过任务分解成子任务程序实现的。
尽管这些方法可以提升可解释性，但由于计算成本高和对目标数据适应性差，其准确性和效率有待提高。
VISTAR是一个子任务驱动的训练框架，旨在通过在大规模语言模型中生成文本和视觉解释来增强可解释性和推理能力。
VISTAR通过微调MLLMs来产生结构化的子任务思维依据（逐步推理序列），而不是依赖外部模型。
实验结果表明，VISTAR在保持可解释性的同时，提高了推理的准确性。

Cool Papers

点此查看论文截图

FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning

Authors:Zhehao Zhang, Weijie Xu, Fanyou Wu, Chandan K. Reddy

Safety alignment approaches in large language models (LLMs) often lead to the over-refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories. We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts, while structuring responses with explicit reasoning to aid models in accurately distinguishing safe from unsafe contexts. FalseReject includes training datasets tailored for both standard instruction-tuned models and reasoning-oriented models, as well as a human-annotated benchmark test set. Our extensive benchmarking on 29 state-of-the-art (SOTA) LLMs reveals persistent over-refusal challenges. Empirical results demonstrate that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.

大型语言模型（LLM）的安全对齐方法往往导致良性查询被拒绝的情况过多，这在敏感场景中显著降低了其效用。为了解决这一挑战，我们引入了FalseReject，这是一个包含16000个看似有毒查询的综合资源，并涵盖了与这相关的安全性相关的44个类别的结构化响应。我们提出了一个基于图对抗的多智能体交互框架，生成多样且复杂的提示，同时利用结构化响应和明确推理来帮助模型准确区分安全和不安全上下文。FalseReject既包含适合标准指令训练模型的训练数据集，也包含面向推理模型的训练数据集，还包括一个人工标注的基准测试集。我们对最新前沿的29个大型语言模型进行了广泛的基准测试，发现持续存在过度拒绝的问题。实验结果表明，利用FalseReject进行监督微调能大幅减少不必要的拒绝情况，同时不会损害整体安全性或通用语言功能。

论文及项目相关链接

PDF

Summary：大型语言模型（LLM）的安全对齐方法往往会导致良性查询被拒绝，严重影响了其在敏感场景中的实用性。为解决这一问题，我们推出FalseReject资源，包含看似有毒的查询和结构化响应。我们提出一个基于图信息对抗多智能体交互框架生成提示并结构回应以提高模型的准确判断能力。此外还包括定制训练集，覆盖标准指令调整模型和推理导向模型，以及人类注释基准测试集。我们对29个最先进的LLM进行了广泛的基准测试，发现仍存在过度拒绝的挑战。实证结果表明，使用FalseReject进行监督微调可以大大减少不必要的拒绝，同时不损害整体安全性或通用语言功能。

Key Takeaways：

大型语言模型的安全对齐方法存在过度拒绝良性查询的问题。
FalseReject资源包含看似有毒的查询和结构化响应，用于应对此问题。
提出基于图信息对抗多智能体交互框架，提高模型区分安全与非安全语境的准确性。
FalseReject包含定制的训练集，覆盖标准指令调整模型和推理导向模型的需求。
FalseReject包含人类注释基准测试集，便于评估模型性能。
基准测试显示，即使是最先进的LLM也存在过度拒绝的挑战。

Cool Papers

点此查看论文截图

MilChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Remote Sensing

Authors:Aybora Koksal, A. Aydin Alatan

Remarkable capabilities in understanding and generating text-image content have been demonstrated by recent advancements in multimodal large language models (MLLMs). However, their effectiveness in specialized domains-particularly those requiring resource-efficient and domain-specific adaptations-has remained limited. In this work, a lightweight multimodal language model termed MilChat is introduced, specifically adapted to analyze remote sensing imagery in secluded areas, including challenging missile launch sites. A new dataset, MilData, was compiled by verifying hundreds of aerial images through expert review, and subtle military installations were highlighted via detailed captions. Supervised fine-tuning on a 2B-parameter open-source MLLM with chain-of-thought (CoT) reasoning annotations was performed, enabling more accurate and interpretable explanations. Additionally, Group Relative Policy Optimization (GRPO) was leveraged to enhance the model’s ability to detect critical domain-specific cues-such as defensive layouts and key military structures-while minimizing false positives on civilian scenes. Through empirical evaluations, it has been shown that MilChat significantly outperforms both larger, general-purpose multimodal models and existing remote sensing-adapted approaches on open-ended captioning and classification metrics. Over 80% recall and 98% precision were achieved on the newly proposed MilData benchmark, underscoring the potency of targeted fine-tuning and reinforcement learning in specialized real-world applications.

最近的多模态大型语言模型（MLLMs）的进展展示出了在理解和生成文本图像内容方面的卓越能力。然而，它们在特定领域的有效性，尤其是那些需要资源高效和领域特定的适应性的领域，仍然有限。在这项工作中，引入了一个轻量级的多模态语言模型，称为MilChat，特别适应于分析偏远地区的遥感图像，包括具有挑战性的导弹发射场。一个新的数据集MilData是通过专家审查验证的数百张航空图像编译而成的，通过详细的标题突出了微妙的军事设施。在具有思维链（CoT）推理注释的2B参数开源MLLM上进行监督微调，从而提供更准确和可解释的解释。此外，利用集团相对策略优化（GRPO）增强模型检测关键领域线索的能力，如防御布局和关键军事结构，同时尽量减少在民用场景中的误报。通过实证评估，MilChat在开放式标题和分类指标上显著优于更大、通用的多模态模型以及现有的遥感适应方法。在新提出的MilData基准测试上，召回率超过80%，准确率达到98%，这突显了针对特定现实世界应用的精细调整和强化学习的潜力。

论文及项目相关链接

PDF Submitted to JSTARS on April 2, 2025. Code and dataset will be available upon acceptance

Summary：近期多模态大型语言模型（MLLMs）在理解和生成文本图像内容方面展现出卓越的能力。然而，它们在特定领域的应用效果有限，特别是在需要资源高效和领域特定的适应场景。为此，引入了一种轻量级多模态语言模型MilChat，专门适应于分析遥感图像，包括具有挑战性的导弹发射地点等领域。通过专家审查验证数百张空中图像，创建了新的数据集MilData，并通过详细字幕突出微妙的军事设施。通过在带有思维链（CoT）推理注释的开源MLLM上进行监督微调，实现了更准确和可解释的解释。此外，利用集团相对政策优化（GRPO）提高了模型检测关键领域线索的能力，如防御布局和关键军事结构，同时减少了对民用场景的误报。经验评估表明，MilChat在开放字幕和分类指标上显著优于大型通用多模态模型和现有的遥感适应方法。在新提出的MilData基准测试中，召回率超过80%，准确率高达98%，突显了针对特定现实世界应用的精细调整和强化学习的潜力。

Key Takeaways：

多模态大型语言模型（MLLMs）在文本和图像理解方面表现出强大的能力。
在特定领域，如遥感图像分析，需要资源高效和领域特定的模型适应。
引入轻量级多模态语言模型MilChat，特别适用于分析遥感图像，包括导弹发射地点等挑战场景。
通过专家审查创建新的数据集MilData，通过详细字幕突出显示微妙的军事设施。
监督微调结合思维链（CoT）推理注释提高了模型的准确性和可解释性。
利用集团相对政策优化（GRPO）提高模型在检测关键领域线索方面的能力，如防御布局和军事结构。

Cool Papers

点此查看论文截图

Practical Reasoning Interruption Attacks on Reasoning Large Language Models

Authors:Yu Cui, Cong Zuo

Reasoning large language models (RLLMs) have demonstrated outstanding performance across a variety of tasks, yet they also expose numerous security vulnerabilities. Most of these vulnerabilities have centered on the generation of unsafe content. However, recent work has identified a distinct “thinking-stopped” vulnerability in DeepSeek-R1: under adversarial prompts, the model’s reasoning process ceases at the system level and produces an empty final answer. Building upon this vulnerability, researchers developed a novel prompt injection attack, termed reasoning interruption attack, and also offered an initial analysis of its root cause. Through extensive experiments, we verify the previous analyses, correct key errors based on three experimental findings, and present a more rigorous explanation of the fundamental causes driving the vulnerability. Moreover, existing attacks typically require over 2,000 tokens, impose significant overhead, reduce practicality, and are easily detected. To overcome these limitations, we propose the first practical reasoning interruption attack. It succeeds with just 109 tokens by exploiting our newly uncovered “reasoning token overflow” (RTO) effect to overwrite the model’s final answer, forcing it to return an invalid response. Experimental results demonstrate that our proposed attack is highly effective. Furthermore, we discover that the method for triggering RTO differs between the official DeepSeek-R1 release and common unofficial deployments. As a broadened application of RTO, we also construct a novel jailbreak attack that enables the transfer of unsafe content within the reasoning tokens into final answer, thereby exposing it to the user. Our work carries significant implications for enhancing the security of RLLMs.

推理大型语言模型（RLLMs）在各种任务中表现出了卓越的性能，但同时也暴露出了许多安全漏洞。其中大多数漏洞主要集中在生成不安全内容上。然而，最近的工作在DeepSeek-R1中识别出了一种独特的“思维停滞”漏洞：在对抗提示下，模型的推理过程会在系统层面停止，并产生空的最终答案。基于此漏洞，研究人员开发了一种名为推理中断攻击的新型提示注入攻击，并对其原因进行了初步分析。通过广泛的实验，我们验证了之前的分析，根据三个实验发现纠正了关键错误，并对驱动该漏洞的根本原因提供了更严格的解释。而且，现有的攻击通常需要超过2000个令牌，会带来显著的开销，降低实用性，并且很容易被检测出来。为了克服这些限制，我们提出了第一种实用的推理中断攻击。它仅使用109个令牌就能成功，通过利用我们新发现的“推理令牌溢出”（RTO）效应来覆盖模型的最终答案，迫使其返回无效响应。实验结果表明，我们提出的攻击非常有效。此外，我们发现触发RTO的方法在官方DeepSeek-R1发布和常见的非官方部署之间是不同的。作为RTO的扩展应用，我们还构建了一种新型越狱攻击，使不安全内容能够在推理令牌内转移到最终答案中，从而暴露给用户。我们的工作对增强RLLMs的安全性具有重要意义。

论文及项目相关链接

PDF

摘要

推理大型语言模型（RLLMs）在各种任务中表现出卓越的性能，但同时也暴露出许多安全漏洞。最近的工作发现了DeepSeek-R1中独特的“思维停滞”漏洞：在敌对提示下，该模型的推理过程会在系统层面停止，并产生空白的最终答案。基于此漏洞，研究者开发了一种名为“推理中断攻击”的新型提示注入攻击，并对根本原因进行了初步分析。我们通过广泛的实验验证了之前的分析，基于三个实验发现纠正了关键错误，并对驱动该漏洞的根本原因提供了更严格的解释。此外，现有的攻击通常需要超过2000个令牌，会带来显著的开销，降低实用性，并且容易被检测。为了克服这些局限性，我们提出了第一种实用的推理中断攻击，仅使用109个令牌就成功利用了我们新发现的“推理令牌溢出”（RTO）效应来覆盖模型的最终答案，迫使其返回无效响应。实验结果证明我们的攻击非常有效。此外，我们发现触发RTO的方法在官方DeepSeek-R1发行版和常见的非官方部署之间有所不同。作为RTO的扩展应用，我们还构建了一种新型的越狱攻击，能够将推理令牌中的不安全内容转移到最终答案中，从而将其暴露给用户。我们的工作对增强RLLMs的安全性具有重要意义。

关键见解

DeepSeek-R1存在一种名为“思维停滞”的独特漏洞，即在敌对提示下模型推理过程会停止。
研究者利用此漏洞开发了一种名为“推理中断攻击”的新型提示注入攻击方法。
通过广泛实验验证，我们发现了关于该漏洞更严格的解释和先前分析的错误纠正。
我们提出了更高效的推理中断攻击方法，仅使用少量令牌就能成功攻击模型。
触发推理令牌溢出（RTO）的方法在不同版本的DeepSeek-R1之间存在差异。
基于RTO效应，我们构建了一种新型的越狱攻击，能够暴露不安全内容给用户。

Cool Papers

点此查看论文截图

A New DAPO Algorithm for Stock Trading

Authors:Ruijian Zha, Bojun Liu

Recent advances in reinforcement learning, such as Dynamic Sampling Policy Optimization (DAPO), show strong performance when paired with large language models (LLMs). Motivated by this success, we ask whether similar gains can be realized in financial trading. We design a trading agent that combines an improved Group Relative Policy Optimization (GRPO) algorithm, augmented with ideas from DAPO, with LLM-based risk and sentiment signals extracted from financial news. On the NASDAQ-100 index (FNSPID dataset), our agent attains a cumulative return of 230.49 percent and an information ratio of 0.37, outperforming the CPPO-DeepSeek baseline. It also cuts training time from about 8 hours to 2.5 hours over 100 epochs while markedly reducing RAM usage. The proposed RL-LLM framework offers a scalable path toward data-efficient trading agents. Code: https://github.com/Ruijian-Zha/FinRL-DAPO-SR/

近期强化学习领域的进展，如动态采样策略优化（DAPO），在与大型语言模型（LLM）结合时表现出强大的性能。受此成功的启发，我们想知道是否能在金融交易中实现类似的收益。我们设计了一个交易代理，它结合了改进的组相对策略优化（GRPO）算法，融合了DAPO的理念，以及基于LLM的风险和情绪信号，这些信号是从金融新闻中提取的。在纳斯达克100指数（FNSPID数据集）上，我们的代理实现了230.49%的累计回报率和0.37的信息比率，超过了CPPO-DeepSeek基线。此外，它将训练时间从约8小时缩短至2.5小时（100个周期），并显著降低了RAM使用量。所提出的RL-LLM框架为数据高效交易代理提供了可扩展的路径。代码：https://github.com/Ruijian-Zha/FinRL-DAPO-SR/

简化版翻译：

论文及项目相关链接

PDF Accepted to IEEE IDS 2025 Special Track: Financial Reinforcement Learning and Foundation Models (FinRLFM). 3 pages, 2 figures, 3 tables

Summary

近期强化学习与大型语言模型（LLM）结合的研究进展，如动态采样策略优化（DAPO），展现出强大的性能。本研究受此启发，探索将改进后的组相对策略优化（GRPO）算法与LLM提取的金融新闻风险与情绪信号结合，应用于金融交易领域。在纳斯达克100指数（FNSPID数据集）上，该交易代理实现了累计回报率230.49%，信息比率为0.37，优于CPPO-DeepSeek基线模型。此外，该框架显著缩短了训练时间并降低了内存使用，为构建数据高效交易代理提供了可扩展路径。

Key Takeaways

强化学习与大型语言模型的结合在近期展现出强大的性能。
研究探索了将改进后的组相对策略优化算法应用于金融交易领域。
结合DAPO思想的GRPO算法与LLM提取的金融新闻风险及情绪信号，在纳斯达克100指数上实现了高回报率。
该交易代理优于CPPO-DeepSeek基线模型，实现了累计回报率230.49%，信息比率为0.37。
该框架缩短了训练时间，从约8小时减少到2.5小时。
该框架降低了内存使用，为构建数据高效交易代理提供了可扩展路径。

Cool Papers

点此查看论文截图

APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning

Authors:Azim Ospanov, Farzan Farnia, Roozbeh Yousefzadeh

Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with large language models (LLMs) remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (Automated PrOof repair via LLM and Lean cOllaboration), a modular, model-agnostic pipeline that combines the strengths of the Lean compiler with an LLM’s reasoning abilities to achieve better proof-generation results at a low sampling budget. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub-lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top-K budget. The repaired sub-proofs are recombined and reverified, iterating up to a user-controlled maximum number of attempts. On the miniF2F benchmark, we establish a new state-of-the-art accuracy of 75.0% among 7B-parameter models while keeping the sampling budget below one thousand. Moreover, Apollo raises the state-of-the-art accuracy for Goedel-Prover-SFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. General-purpose models (o3-mini, o4-mini) jump from 3-7% to over 40% accuracy. Our results demonstrate that targeted, compiler-guided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving.

形式推理和自动化定理证明是机器学习的一个具有挑战性的子领域，该领域中，机器使用如Lean等正式语言来完成数学定理的证明。形式化验证系统几乎可以立即检查形式化证明是否正确，但使用大型语言模型（LLM）生成完全正确的形式化证明仍然是一项艰巨的任务。文献中的通常做法是多次（高达数千次）提示LLM，直到生成的证明之一通过验证系统。在这项工作中，我们提出了APOLLO（通过LLM和Lean协作的自动化证明修复），这是一个模块化的、模型无关的流程，它将Lean编译器的优势与LLM的推理能力相结合，以较低的采样预算实现更好的证明生成结果。Apollo引导一个完全自动化的过程，其中LLM为定理生成证明，一组智能体分析证明，修复语法错误，使用Lean识别证明中的错误，隔离失败的子引理，利用自动化求解器，并在剩余的每个目标上使用预算较低的top-K调用LLM。修复的亚证明被重新组合并重新验证，迭代直至用户控制的最大尝试次数。在miniF2F基准测试中，我们在7B参数模型中建立了最新最先进的75.0%的准确率，同时保持采样预算低于一千。此外，Apollo将Goedel-Prover-SFT的现有准确率提高到65.6%，同时将样本复杂度从25,600降低到几百。通用模型（o3-mini、o4-mini）的准确率从3-7%跃升至超过40%。我们的结果表明，针对LLM输出的编译器引导修复在提高效率和正确性方面取得了显著成效，这为进一步可扩展的自动化定理证明提供了通用的范例。

论文及项目相关链接

PDF

Summary

该文本介绍了机器学习中挑战的一个子领域——形式推理和自动定理证明。文中提到机器需要利用形式语言如Lean来证明数学定理。虽然形式验证系统可以几乎瞬间验证形式证明的正确性，但使用大型语言模型（LLM）生成完全正确的形式证明仍然是一项艰巨的任务。文章提出了一种结合Lean编译器与LLM推理能力的模块化、模型无关的管道Apollo，以在低采样预算内实现更好的证明生成结果。Apollo通过自动化过程生成定理证明，使用一组代理分析、修复语法错误，并使用Lean识别证明中的错误。修复后的子证明被重新组合并重新验证，迭代直至用户控制的最大尝试次数。在miniF2F基准测试中，Apollo在7B参数模型中达到了75.0%的最新准确率，同时保持采样预算低于一千。此外，Apollo提高了Goedel-Prover-SFT的准确率至65.6%，并将样本复杂度从25600降低到几百。通用模型（o3-mini、o4-mini）的准确率从3-7%跃升至超过40%。结果证明了针对LLM输出的编译器引导修复在提高效率和正确性方面的显著收益，为可扩展的自动定理证明提出了通用范式。

Key Takeaways

形式推理和自动定理证明是机器学习中一个有挑战的子领域，涉及使用如Lean的形式语言来生成数学定理的证明。
当前文献中的通常做法是通过多次提示LLM（最多达数千次）来生成通过验证系统的证明。
Apollo是一个结合Lean编译器与LLM推理能力的管道，旨在提高证明生成的效率和准确性。
Apollo通过自动化过程生成定理证明，并包括证明分析、错误修复、子目标求解等步骤。
Apollo在多个基准测试中达到了最新的高准确率，证明了其在提高效率和正确性方面的显著收益。
Apollo的方法为可扩展的自动定理证明提出了通用范式。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-05-15/R1_Reasoning/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

R1_Reasoning

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-05-15 CodePDE An Inference Framework for LLM-driven PDE Solver Generation

2025-05-15 LLM

LLM

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-05-14 U2AD Uncertainty-based Unsupervised Anomaly Detection Framework for Detecting T2 Hyperintensity in MRI Spinal Cord

2025-05-14 医学图像

医学图像