发布日期: 2025-10-25

更新日期: 2025-11-27

文章字数: 17.9k

阅读时长: 72 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-25 更新

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

Authors:Xi He, Sirui Lu, Bei Zeng

We present a multi-agent, human-in-the-loop workflow that co-designs quantum codes with prescribed transversal diagonal gates. It builds on the Subset-Sum Linear Programming (SSLP) framework (arXiv:2504.20847), which partitions basis strings by modular residues and enforces $Z$-marginal Knill-Laflamme (KL) equalities via small LPs. The workflow is powered by GPT-5 and implemented within TeXRA (https://texra.ai)-a multi-agent research assistant platform that supports an iterative tool-use loop agent and a derivation-then-edit workflow reasoning agent. We work in a LaTeX-Python environment where agents reason, edit documents, execute code, and synchronize their work to Git/Overleaf. Within this workspace, three roles collaborate: a Synthesis Agent formulates the problem; a Search Agent sweeps/screens candidates and exactifies numerics into rationals; and an Audit Agent independently checks all KL equalities and the induced logical action. As a first step we focus on distance $d=2$ with nondegenerate residues. For code dimension $K\in{2,3,4}$ and $n\le6$ qubits, systematic sweeps yield certificate-backed tables cataloging attainable cyclic logical groups-all realized by new codes-e.g., for $K=3$ we obtain order $16$ at $n=6$. From verified instances, Synthesis Agent abstracts recurring structures into closed-form families and proves they satisfy the KL equalities for all parameters. It further demonstrates that SSLP accommodates residue degeneracy by exhibiting a new $((6,4,2))$ code implementing the transversal controlled-phase $diag(1,1,1,i)$. Overall, the workflow recasts diagonal-transversal feasibility as an analytical pipeline executed at scale, combining systematic enumeration with exact analytical reconstruction. It yields reproducible code constructions, supports targeted extensions to larger $K$ and higher distances, and leads toward data-driven classification.

我们提出了一种多智能体、人机循环的工作流程，该流程与规定的横向对角门共同设计量子代码。它基于子集和线性规划（SSLP）框架（arXiv:2504.20847），该框架通过模余数划分基字符串，并通过小型线性规划强制执行Z边界的Knill-Laflamme（KL）等式。该工作流程由GPT-5驱动，并在TeXRA（https://texra.ai）中实现，这是一个多智能体研究助理平台，支持迭代工具使用循环智能体和推导然后编辑的工作流推理智能体。我们在LaTeX-Python环境中工作，其中的智能体进行推理、编辑文档、执行代码，并将他们的工作同步到Git/Overleaf。在这个工作空间中，三个角色进行合作：合成智能体提出问题；搜索智能体扫描/筛选候选人，并将数值精确化为有理数；审计智能体独立检查所有KL等式和诱导的逻辑行动。作为第一步，我们专注于距离d=2的非退化残留物。对于代码维度K∈{2,3,4}和n≤6个量子位，系统扫描生成了有证书支持的表格，列出了可实现的循环逻辑组——这些都是通过新代码实现的，例如对于K=3，我们在n=6时获得了16阶。从验证的实例中，合成智能体将重复的结构抽象为封闭形式的家族，并证明它们满足所有参数的KL等式。它进一步证明SSLP可以通过展示实现横向控制相位的新（（6,4,2））代码$diag(1,1,1,i)$来容纳残留退化。总的来说，该工作流程将对角横向可行性重新构建为执行大规模的分析管道，结合系统枚举和精确的分析重建。它产生了可重复的代码结构，支持针对更大的K和更高的距离进行有针对性的扩展，并朝着数据驱动的分类方向发展。

论文及项目相关链接

PDF 29 pages, 2 figures

Summary

该文本介绍了一种多智能体参与、有人类用户参与的工作流程，该流程利用子集和线性规划框架（SSLP）协同设计量子代码，实现规定的横向对角门操作。通过使用GPT-5和TeXRA平台，该流程在LaTeX-Python环境中实现了智能体的协作工作，包括问题合成、候选搜索筛选、数值精确化以及逻辑动作验证等步骤。针对距离d=2和非退化残留物的情况，系统扫描得到了证书支持的表格，列出了可实现的循环逻辑组。此外，该流程还展示了SSLP如何处理残留物的退化问题，并展示了一种新的（6,4,2）代码实现横向控制相位门操作的实例。总体而言，该工作流程将横向对角可行性转化为大规模分析管道，结合系统枚举和精确分析重建，生成可重复使用的代码结构，并支持有针对性的扩展到更大的K和更高的距离。

Key Takeaways

提出了一种多智能体参与、有人类用户参与的工作流程设计量子代码的新方法。
使用GPT-5和TeXRA平台的多智能体研究助理支持多种角色智能体进行协同工作。
该工作流程依赖于子集和线性规划框架进行量子代码的协同设计，并利用SSLP实现规定的横向对角门操作。

Cool Papers

点此查看论文截图

Beyond Retrieval-Ranking: A Multi-Agent Cognitive Decision Framework for E-Commerce Search

Authors:Zhouwei Zhai, Mengxiang Chen, Haoyun Xia, Jin Li, Renquan Zhou, Min Yang

The retrieval-ranking paradigm has long dominated e-commerce search, but its reliance on query-item matching fundamentally misaligns with multi-stage cognitive decision processes of platform users. This misalignment introduces critical limitations: semantic gaps in complex queries, high decision costs due to cross-platform information foraging, and the absence of professional shopping guidance. To address these issues, we propose a Multi-Agent Cognitive Decision Framework (MACDF), which shifts the paradigm from passive retrieval to proactive decision support. Extensive offline evaluations demonstrate MACDF’s significant improvements in recommendation accuracy and user satisfaction, particularly for complex queries involving negation, multi-constraint, or reasoning demands. Online A/B testing on JD search platform confirms its practical efficacy. This work highlights the transformative potential of multi-agent cognitive systems in redefining e-commerce search.

检索排名范式长期以来一直主导着电子商务搜索，但其对查询项目的依赖从根本上与平台用户的多阶段认知决策过程不相符。这种不匹配带来了关键的局限性：复杂查询中的语义鸿沟、跨平台信息觅食导致的高决策成本以及缺乏专业购物指导。为了解决这些问题，我们提出了多智能体认知决策框架（MACDF），将范式从被动检索转变为积极决策支持。大量的离线评估表明，MACDF在推荐准确性和用户满意度方面有明显的改进，特别是在涉及否定、多约束或推理需求的复杂查询中。在JD搜索平台上的在线A/B测试证实了其实际有效性。这项工作突出了多智能体认知系统在重新定义电子商务搜索方面的潜力。

论文及项目相关链接

PDF

Summary

随着电子商务搜索的发展，检索排名模式长期占据主导地位，但该模式依赖查询项目的匹配方式，与平台用户的多阶段认知决策过程存在根本性不匹配。为解决由此引发的问题，我们提出了多智能体认知决策框架（MACDF），由被动检索转变为积极决策支持。经离线评估和在线A/B测试，证明MACDF在推荐准确性和用户满意度方面有明显提升，特别是在涉及否定、多约束或推理需求的复杂查询中表现优异。此工作突显了多智能体认知系统在重新定义电子商务搜索中的变革潜力。

Key Takeaways

检索排名模式在电子商务搜索中长期占据主导地位，但存在与用户的认知决策过程不匹配的根本问题。
这种不匹配导致语义差距、高决策成本和缺乏专业购物指导等关键限制。
为解决这些问题，提出了多智能体认知决策框架（MACDF），从被动检索转变为积极决策支持。
MACDF通过离线评估和在线A/B测试，在推荐准确性和用户满意度方面表现出显著改进。
特别在涉及复杂查询（如否定、多约束或推理需求）的情况下，MACDF的表现尤为优异。
MACDF的提出突显了多智能体认知系统在重新定义电子商务搜索中的重要作用。
该工作展示了多智能体认知系统的变革潜力。

Cool Papers

点此查看论文截图

GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?

Authors:Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang

Vision-Language Models (VLMs) are increasingly deployed as autonomous agents to navigate mobile graphical user interfaces (GUIs). Operating in dynamic on-device ecosystems, which include notifications, pop-ups, and inter-app interactions, exposes them to a unique and underexplored threat vector: environmental injection. Unlike prompt-based attacks that manipulate textual instructions, environmental injection corrupts an agent’s visual perception by inserting adversarial UI elements (for example, deceptive overlays or spoofed notifications) directly into the GUI. This bypasses textual safeguards and can derail execution, causing privacy leakage, financial loss, or irreversible device compromise. To systematically evaluate this threat, we introduce GhostEI-Bench, the first benchmark for assessing mobile agents under environmental injection attacks within dynamic, executable environments. Moving beyond static image-based assessments, GhostEI-Bench injects adversarial events into realistic application workflows inside fully operational Android emulators and evaluates performance across critical risk scenarios. We further propose a judge-LLM protocol that conducts fine-grained failure analysis by reviewing the agent’s action trajectory alongside the corresponding screenshot sequence, pinpointing failure in perception, recognition, or reasoning. Comprehensive experiments on state-of-the-art agents reveal pronounced vulnerability to deceptive environmental cues: current models systematically fail to perceive and reason about manipulated UIs. GhostEI-Bench provides a framework for quantifying and mitigating this emerging threat, paving the way toward more robust and secure embodied agents.

视觉语言模型（VLMs）越来越多地被部署为自主代理，以在移动图形用户界面（GUI）中进行导航。它们在动态设备生态系统（包括通知、弹出窗口和跨应用程序交互）中运行，面临一个独特且尚未被充分研究的威胁向量：环境注入。与基于提示的攻击不同，环境注入不会操纵文本指令，而是通过在GUI中直接插入对抗性UI元素（例如欺骗性覆盖层或伪造通知）来破坏代理的视觉感知。这绕过了文本安全保护措施，可能导致执行出错、隐私泄露、财务损失或设备遭到不可逆损害。为了系统地评估这一威胁，我们推出了GhostEI-Bench，这是第一个在动态可执行环境中评估移动代理面临环境注入攻击的基准测试。GhostEI-Bench超越了基于静态图像的评估方法，将对抗性事件注入到操作正常的Android模拟器中的实际应用程序工作流程中，并跨关键风险场景进行评估。我们还提出了一个judge-LLM协议，它通过审查代理的行动轨迹以及相应的屏幕截图序列来精细地分析故障原因，确定感知、识别或推理方面的失败之处。对最先进的代理进行的综合实验表明，它们对欺骗性环境线索的明显脆弱性：当前模型在感知和推理被操纵的UI时系统性地失败。GhostEI-Bench提供了一个量化并减轻这一新兴威胁的框架，为构建更强大、更安全的嵌入式代理铺平了道路。

论文及项目相关链接

PDF

Summary

本文介绍了移动图形用户界面（GUI）中的视觉语言模型（VLM）面临的一个新兴威胁——环境注入攻击。环境注入攻击通过在GUI中直接插入对抗性UI元素（如欺骗性叠加层或伪装通知）来绕过文本保障，可能导致隐私泄露、财务损失或设备不可逆损害。为了系统地评估这一威胁，本文引入了GhostEI-Bench，这是一个在动态、可执行环境中评估移动代理面临环境注入攻击的基准测试平台。同时提出了一个法官大型语言模型协议，通过审查代理的行动轨迹和相应的截图序列进行精细的故障分析。实验表明，当前模型对欺骗性环境线索的显著脆弱性，GhostEI-Bench提供了量化这一新兴威胁并减轻其影响的框架。

Key Takeaways

VLMs在操作移动GUI时面临环境注入攻击的独特威胁。
环境注入攻击通过插入对抗性UI元素直接腐蚀VLM的视觉感知。
GhostEI-Bench是首个在动态、可执行环境中评估VLM面临环境注入攻击的基准测试平台。
GhostEI-Bench注入对抗事件到现实应用工作流程中，并在完全操作的Android模拟器中评估性能。
提出法官大型语言模型协议进行精细的故障分析。
实验显示当前模型对欺骗性环境线索有显著脆弱性。

Cool Papers

点此查看论文截图

From Generation to Attribution: Music AI Agent Architectures for the Post-Streaming Era

Authors:Wonil Kim, Hyeongseok Wi, Seungsoon Park, Taejun Kim, Sangeun Keum, Keunhyoung Kim, Taewan Kim, Jongmin Jung, Taehyoung Kim, Gaetan Guerrero, Mael Le Goff, Julie Po, Dongjoo Moon, Juhan Nam, Jongpil Lee

Generative AI is reshaping music creation, but its rapid growth exposes structural gaps in attribution, rights management, and economic models. Unlike past media shifts, from live performance to recordings, downloads, and streaming, AI transforms the entire lifecycle of music, collapsing boundaries between creation, distribution, and monetization. However, existing streaming systems, with opaque and concentrated royalty flows, are ill-equipped to handle the scale and complexity of AI-driven production. We propose a content-based Music AI Agent architecture that embeds attribution directly into the creative workflow through block-level retrieval and agentic orchestration. Designed for iterative, session-based interaction, the system organizes music into granular components (Blocks) stored in BlockDB; each use triggers an Attribution Layer event for transparent provenance and real-time settlement. This framework reframes AI from a generative tool into infrastructure for a Fair AI Media Platform. By enabling fine-grained attribution, equitable compensation, and participatory engagement, it points toward a post-streaming paradigm where music functions not as a static catalog but as a collaborative and adaptive ecosystem.

生成式人工智能正在重塑音乐创作，但其快速发展暴露了归属、权利管理和经济模型方面的结构性缺陷。与从现场表演到录音、下载和流媒体等过去的媒体转变不同，人工智能转变了音乐的整个生命周期，模糊了创作、发行和货币化之间的界限。然而，现有的流媒体系统存在版权流不透明和集中等问题，难以应对人工智能驱动生产所带来的规模和复杂性挑战。我们提出了一种基于内容的音乐人工智能代理架构，该架构通过块级检索和代理编排直接将归属嵌入创意工作流程。该系统设计用于迭代、基于会话的交互，将音乐组织成存储在BlockDB中的颗粒状组件（块）；每次使用都会触发一个归属层事件，以实现透明的来源和实时结算。该框架将人工智能从生成工具重新定位为公平人工智能媒体平台的基础设施。通过实现细粒度归属、公平补偿和参与式参与，它指向了一个后流媒体范式，在这个范式中，音乐不仅是一个静态的目录，而是一个协作和自适应的生态系统。

论文及项目相关链接

PDF Accepted to the NeurIPS 2025 AI4Music Workshop

Summary
生成式人工智能正在重塑音乐创作，但其快速发展暴露了归属、权利管理和经济模型的结构性缺陷。音乐AI提出了一个基于内容的音乐人工智能代理架构，旨在通过块级检索和代理编排直接嵌入归属信息到创作流程中。通过精细化的归属跟踪、实时结算和组织音乐素材块，该架构为公平的人工智能媒体平台提供了基础设施。这标志着从流媒体模式向协作和自适应音乐生态系统的转变。

Key Takeaways

生成式人工智能正在重塑音乐创作领域。
人工智能音乐的快速发展带来了归属、权利管理和经济模型方面的挑战。
传统流媒体系统在处理人工智能驱动的生产方面存在局限。
提出的音乐AI代理架构旨在直接嵌入归属信息到创作流程中。
该架构通过块级检索和代理编排实现精细化的归属跟踪。
该架构为建立公平的人工智能媒体平台提供了基础设施。

Cool Papers

点此查看论文截图

Automated Cloud Infrastructure-as-Code Reconciliation with AI Agents

Authors:Zhenning Yang, Hui Guan, Victor Nicolet, Brandon Paulsen, Joey Dodds, Daniel Kroening, Ang Chen

Cloud infrastructure is managed through a mix of interfaces – traditionally, cloud consoles, command-line interfaces (CLI), and SDKs are the tools of choice. Recently, Infrastructure-as-Code/IaC frameworks (e.g., Terraform) have quickly gained popularity. Unlike conventional tools, IaC~frameworks encode the infrastructure in a “source-of-truth” configuration. They are capable of automatically carrying out modifications to the cloud – deploying, updating, or destroying resources – to bring the actual infrastructure into alignment with the IaC configuration. However, when IaC is used alongside consoles, CLIs, or SDKs, it loses visibility into external changes, causing infrastructure drift, where the configuration becomes outdated, and later IaC operations may undo valid updates or trigger errors. We present NSync, an automated system for IaC reconciliation that propagates out-of-band changes back into the IaC program. Our key insight is that infrastructure changes eventually all occur via cloud API invocations – the lowest layer for cloud management operations. NSync gleans insights from API traces to detect drift (i.e., non-IaC changes) and reconcile it (i.e., update the IaC configuration to capture the changes). It employs an agentic architecture that leverages LLMs to infer high-level intents from noisy API sequences, synthesize targeted IaC updates using specialized tools, and continually improve through a self-evolving knowledge base of past reconciliations. We further introduce a novel evaluation pipeline for injecting realistic drifts into cloud infrastructure and assessing reconciliation performance. Experiments across five real-world Terraform projects and 372 drift scenarios show that NSync outperforms the baseline both in terms of accuracy (from 0.71 to 0.97 pass@3) and token efficiency (1.47$\times$ improvement).

云基础设施通过一系列接口进行管理——传统上，云控制台、命令行界面（CLI）和SDK是首选工具。最近，基础设施即代码（IaC）框架（例如Terraform）迅速流行起来。与传统的工具不同，IaC框架将基础设施编码为“真实来源”配置。它们能够自动对云进行修改、部署、更新或销毁资源，以使实际基础设施与IaC配置保持一致。然而，当IaC与控制台、CLI或SDK一起使用时，它会失去对外部更改的可见性，导致基础设施漂移，此时配置变得过时，随后的IaC操作可能会撤销有效更新或触发错误。我们提出了NSync，这是一个自动化的IaC协调系统，能够将非实时更改传播回IaC程序。我们的关键见解是，所有基础设施更改最终都是通过云API调用发生的——这是云管理操作的最底层。NSync从API跟踪中获取见解以检测漂移（即非IaC更改）并进行协调（即更新IaC配置以捕获更改）。它采用了一种基于代理的架构，该架构利用大型语言模型从嘈杂的API序列中推断出高级意图，并使用专业工具合成有针对性的IaC更新，并通过不断自我演化的以往协调知识库进行自我改进。我们还引入了一种新的评估管道，用于向云基础设施注入现实漂移并评估协调性能。在五个真实的Terraform项目和372个漂移场景的实验中，NSync在准确性和标记效率方面都优于基线（准确率从0.71提高到0.97 pass@3，标记效率提高了1.47倍）。

论文及项目相关链接

PDF

Summary
IaC框架能够自动化管理云基础设施的变更，但当与云控制台、命令行接口或SDK结合使用时，可能会因为忽略外部更改而导致基础设施配置漂移问题。为解决这一问题，提出了NSync系统，通过自动捕获和分析API调用的数据，检测出非IaC变更并进行协调处理，利用LLMs技术从混乱的API序列中推断出高级意图，合成针对性的IaC更新。实验证明，NSync在处理真实世界的Terraform项目时，准确率大幅提升并有效提高标记效率。

Key Takeaways

IaC框架采用编码配置作为基础设施的“单一真实来源”，能自动执行云资源的部署、更新和销毁操作。
当结合使用传统工具（如云控制台、CLI和SDK）时，IaC会遇到基础设施配置漂移问题。
NSync系统通过捕获和分析API数据来解决配置漂移问题，准确检测出非IaC变更。
NSync利用LLMs技术处理混乱的API序列，从API调用中推断出高级意图，然后合成针对性的IaC更新。
NSync采用自我进化的知识库机制，能够不断从过去的协调过程中学习并改进。
实验结果显示，NSync在处理真实世界的Terraform项目时表现出优异的性能，准确率和效率均有所提升。

Cool Papers

点此查看论文截图

DeepWideSearch: Benchmarking Depth and Width in Agentic Information Seeking

Authors:Tian Lan, Bin Zhu, Qianghuai Jia, Junyang Ren, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang

Current search agents fundamentally lack the ability to simultaneously perform \textit{deep} reasoning over multi-hop retrieval and \textit{wide}-scale information collection-a critical deficiency for real-world applications like comprehensive market analysis and business development. To bridge this gap, we introduce DeepWideSearch, the first benchmark explicitly designed to evaluate agents to integrate depth and width in information seeking. In DeepWideSearch, agents must process a large volume of data, each requiring deep reasoning over multi-hop retrieval paths. Specifically, we propose two methods to converse established datasets, resulting in a curated collection of 220 questions spanning 15 diverse domains. Extensive experiments demonstrate that even state-of-the-art agents achieve only 2.39% average success rate on DeepWideSearch, highlighting the substantial challenge of integrating depth and width search in information-seeking tasks. Furthermore, our error analysis reveals four failure modes: lack of reflection, overreliance on internal knowledge, insufficient retrieval, and context overflow-exposing key limitations in current agent architectures. We publicly release DeepWideSearch to catalyze future research on more capable and robust information-seeking agents.

当前的信息搜索代理在本质上缺乏在多跳检索中进行深度推理以及在广泛的信息收集中进行宽度搜索的能力，这对于全面的市场分析、商业发展等现实世界应用来说是一个关键的缺陷。为了弥补这一差距，我们推出了DeepWideSearch，这是第一个明确设计用于评估代理在信息搜索中集成深度和广度能力的基准测试。在DeepWideSearch中，代理必须处理大量数据，每条数据都需要在多跳检索路径上进行深度推理。具体来说，我们提出了两种方法来转换现有的数据集，导致一个精选的问题集合，包含跨越15个不同领域的共220个问题。大量实验表明，即使在DeepWideSearch上，最先进的信息代理也仅能达到平均成功率为仅的仅只有平均成功率为仅有百分之一左右的成绩这说明在信息查询任务中将深度和广度搜索结合起来具有巨大的挑战此外我们的错误分析揭示了四种失败模式：缺乏反思、过度依赖内部知识、检索不足以及上下文溢出暴露了当前代理架构的关键局限性我们公开发布DeepWideSearch以促进对功能更强、更稳健的信息搜索代理的进一步研究

论文及项目相关链接

PDF

Summary

DeepWideSearch填补了现有搜索代理在信息检索中的深度推理与广度收集上的空白，尤其适用于全面的市场分析与商业发展等实际应用。通过深度处理大量数据和多跳检索路径的深度推理，提出两种方法来构建数据集，涵盖15个不同领域。实验显示，即使在DeepWideSearch上，最先进的代理也只有2.39%的平均成功率，表明在整合深度与广度搜索中仍存在巨大挑战。我们公开发布DeepWideSearch，以推动更强大、更稳健的信息搜索代理的研究。

Key Takeaways

当前搜索代理在深度推理和大规模信息收集方面存在根本性缺陷。
DeepWideSearch是为了评估代理在信息搜索中同时实现深度和广度搜索的能力而首次设计的基准测试。
DeepWideSearch要求处理大量数据，并需要沿着多跳检索路径进行深度推理。
提出两种方法来构建数据集，涵盖多个领域的问题。
实验显示，最先进的代理在DeepWideSearch上的平均成功率很低，表明整合深度与广度搜索的挑战性。
错误分析揭示了四种失败模式，包括缺乏反思、过度依赖内部知识、检索不足和上下文溢出。

Cool Papers

点此查看论文截图

Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

Authors:Simin Li, Zihao Mao, Hanxiao Li, Zonglei Jing, Zhuohang bian, Jun Guo, Li Wang, Zhuoran Han, Ruixiao Xu, Xin Yu, Chengdong Ma, Yuqing Ma, Bo An, Yaodong Yang, Weifeng Lv, Xianglong Liu

In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of robustness, which ensures stability under uncertainties, and resilience, the ability to recover from disruptions–a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82,620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones. Code and results available at https://github.com/BUAA-TrustworthyMARL/adv_marl_benchmark .

在多智能体强化学习（MARL）的合作场景中，通常会在理想的模拟环境中调整超参数以最大化合作性能。然而，针对合作调整的策略往往在实际世界的不确定性下无法保持其稳健性和适应性。建立可信赖的MARL系统需要深入了解稳健性和适应性这两个概念，其中稳健性确保在不确定性下的稳定性，而适应性则是从干扰中恢复的能力——这一概念在控制系统中有广泛的研究，但在MARL中却被大大忽视了。在本文中，我们进行了一项大规模实证研究，包括超过82,620个实验，以评估MARL中的合作、稳健性和适应性在四个真实环境中的表现，涉及13种不确定性和15个超参数。我们的主要发现如下：（1）在轻微的不确定性下，优化合作可以提高稳健性和适应性，但随着干扰的加剧，这种联系会减弱。稳健性和适应性也会因算法和不确定性的类型而有所不同。（2）稳健性和适应性不会在不同的不确定性模式或智能体范围内通用：对所有智能体的动作噪声具有稳健性的策略可能会在单个智能体的观测噪声下失效。（3）超参数调整对于可信赖的MARL至关重要：令人惊讶的是，标准实践如参数共享、GAE和PopArt可能会损害稳健性，而早停、高评论家学习率和Leaky ReLU则始终有所帮助。仅通过优化超参数，我们观察到所有MARL骨干的合作、稳健性和适应性都有显著改善，这一现象也适用于这些骨干的稳健MARL方法。代码和结果可在https://github.com/BUAA-TrustworthyMARL/adv_marl_benchmark找到。

论文及项目相关链接

PDF 44 pages, 16 figures, NeurIPS 2025

Summary
在多智能体强化学习（MARL）中，通常在理想模拟环境中调整超参数以最大化合作性能。然而，针对合作的策略在现实世界的不确定性下往往无法保持稳健性和恢复力。本文进行了大规模实证研究，涵盖4个真实环境、13种不确定性和15个超参数，对MARL中的合作、稳健性和恢复力进行评估。研究发现，在轻微不确定性下，优化合作能提高稳健性和恢复性，但随着扰动增强，这种联系减弱。稳健性和恢复性因算法和不确定性类型而异；政策和鲁棒性不会泛化到所有不确定性模态或智能体范围；关键超参数调整对于可靠MARL至关重要。仅通过优化超参数，可以显著改善所有MARL骨架的合作、稳健性和恢复性，这一现象也适用于这些骨架的稳健MARL方法。

Key Takeaways

在多智能体强化学习（MARL）中，合作策略在模拟环境中调整超参数时很重要，但在现实世界的复杂性和不确定性条件下，需要更多考虑稳健性和恢复力。
稳健性和恢复性在面临不同类型的不确定性时会有所不同，并且依赖于所使用的算法。
在轻微的不确定性下，优化合作可以提高稳健性和恢复性，但当不确定性增强时，这种关联减弱。
鲁棒性策略并不总是适用于所有类型的不确定性和所有智能体范围。例如，对所有智能体的动作噪声有效的鲁棒策略可能会在单个智能体的观测噪声下失效。
超参数调整对构建可靠的MARL系统至关重要。一些常见的实践（如参数共享、GAE和PopArt）可能会损害稳健性，而早期停止训练、高评论家学习率和Leaky ReLU则有助于增强稳健性。
仅通过优化超参数，可以显著提高所有MARL骨架的合作、稳健性和恢复性。这种现象也适用于不同的稳健MARL方法。

Cool Papers

点此查看论文截图

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System

Authors:Jiale Guo, Suizhi Huang, Mei Li, Dong Huang, Xingsheng Chen, Regina Zhang, Zhijiang Guo, Han Yu, Siu-Ming Yiu, Pietro Lio, Kwok-Yan Lam

The integration of Large Language Models (LLMs) into software engineering has driven a transition from traditional rule-based systems to autonomous agentic systems capable of solving complex problems. However, systematic progress is hindered by a lack of comprehensive understanding of how benchmarks and solutions interconnect. This survey addresses this gap by providing the first holistic analysis of LLM-powered software engineering, offering insights into evaluation methodologies and solution paradigms. We review over 150 recent papers and propose a taxonomy along two key dimensions: (1) Solutions, categorized into prompt-based, fine-tuning-based, and agent-based paradigms, and (2) Benchmarks, including tasks such as code generation, translation, and repair. Our analysis highlights the evolution from simple prompt engineering to sophisticated agentic systems incorporating capabilities like planning, reasoning, memory mechanisms, and tool augmentation. To contextualize this progress, we present a unified pipeline illustrating the workflow from task specification to deliverables, detailing how different solution paradigms address various complexity levels. Unlike prior surveys that focus narrowly on specific aspects, this work connects 50+ benchmarks to their corresponding solution strategies, enabling researchers to identify optimal approaches for diverse evaluation criteria. We also identify critical research gaps and propose future directions, including multi-agent collaboration, self-evolving systems, and formal verification integration. This survey serves as a foundational guide for advancing LLM-driven software engineering. We maintain a GitHub repository that continuously updates the reviewed and related papers at https://github.com/lisaGuojl/LLM-Agent-SE-Survey.

将大型语言模型（LLMs）集成到软件工程中已经推动了从传统基于规则的系统向能够解决复杂问题的自主代理系统的转变。然而，由于缺乏对基准测试和解决方案之间相互联系的综合理解，系统的发展受到了阻碍。本调查通过提供对LLM驱动的软件工程的首次整体分析来解决这一差距，为评估方法和解决方案范式提供见解。我们回顾了150多篇近期论文，并提出了两个关键维度的分类：（1）解决方案，分为基于提示的、基于精细调整的和基于代理的范式；（2）基准测试，包括代码生成、翻译和修复等任务。我们的分析强调了从简单的提示工程到复杂的代理系统（如规划、推理、记忆机制和工具增强功能）的演变。为了情境化这种进展，我们展示了一个统一的工作流程管道，详细说明了从任务说明到交付品的流程，以及不同的解决方案范式如何处理不同级别的复杂性。与以前只关注特定方面的调查不同，这项工作将50多个基准测试与其相应的解决方案策略联系起来，使研究人员能够根据各种评估标准确定最佳方法。我们还确定了关键的研究空白并提出了未来的方向，包括多代理协作、自我进化系统和形式化验证集成。本调查是推进LLM驱动软件工程的基础指南。我们在GitHub上维护了一个持续更新所审阅和相关论文的仓库：https://github.com/lisaGuojl/LLM-Agent-SE-Survey 。

论文及项目相关链接

PDF 22 pages

Summary：大型语言模型（LLM）融入软件工程推动了从传统规则基础系统向自主代理系统的转变，能解决复杂问题。但缺乏基准测试和解决方案相互联系的全面理解阻碍了系统性进展。本文首次全面分析了LLM驱动的软件工程，深入探讨了评估方法和解决方案范式，填补了这一空白。我们审查了超过150篇近期论文，并依据解决方案和基准测试提出了分类。我们的分析凸显了从简单提示工程到融入规划、推理、记忆机制和工具扩充等能力的先进代理系统的演变。我们提供了一个统一的工作流程管道，以阐释各种解决方案如何应对不同复杂度的任务。本文不同于以往侧重于特定方面的调查，将50多个基准测试与其相应的解决方案策略相联系，并指出了关键研究空白和未来发展方向，包括多代理协作、自我进化系统和形式验证集成。本文为推进LLM驱动的软件工程提供了基础指南。

Key Takeaways:

LLM融入软件工程实现了从传统规则基础系统到自主代理系统的转变。
缺乏基准测试和解决方案相互联系的全面理解阻碍了系统性进展。
本文首次全面分析了LLM驱动的软件工程，探讨了评估方法和解决方案范式。
通过对超过150篇近期论文的审查，提出了基于解决方案和基准测试的分类方法。
分析强调了从简单提示工程到先进代理系统的演变，融入了多种能力如规划、推理、记忆机制等。
提供了统一的工作流程管道，展示不同解决方案如何应对不同复杂度的任务。

Cool Papers

点此查看论文截图

RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration

Authors:Xiuyuan Chen, Jian Zhao, Yuchen Yuan, Tianle Zhang, Huilin Zhou, Zheng Zhu, Ping Hu, Linghe Kong, Chi Zhang, Weiran Huang, Xuelong Li

Existing safety evaluation methods for large language models (LLMs) suffer from inherent limitations, including evaluator bias and detection failures arising from model homogeneity, which collectively undermine the robustness of risk evaluation processes. This paper seeks to re-examine the risk evaluation paradigm by introducing a theoretical framework that reconstructs the underlying risk concept space. Specifically, we decompose the latent risk concept space into three mutually exclusive subspaces: the explicit risk subspace (encompassing direct violations of safety guidelines), the implicit risk subspace (capturing potential malicious content that requires contextual reasoning for identification), and the non-risk subspace. Furthermore, we propose RADAR, a multi-agent collaborative evaluation framework that leverages multi-round debate mechanisms through four specialized complementary roles and employs dynamic update mechanisms to achieve self-evolution of risk concept distributions. This approach enables comprehensive coverage of both explicit and implicit risks while mitigating evaluator bias. To validate the effectiveness of our framework, we construct an evaluation dataset comprising 800 challenging cases. Extensive experiments on our challenging testset and public benchmarks demonstrate that RADAR significantly outperforms baseline evaluation methods across multiple dimensions, including accuracy, stability, and self-evaluation risk sensitivity. Notably, RADAR achieves a 28.87% improvement in risk identification accuracy compared to the strongest baseline evaluation method.

现有大型语言模型（LLM）的安全评估方法存在固有的局限性，包括评估者偏见和由模型同质性引起的检测失败，这些共同削弱了风险评估过程的稳健性。本文旨在通过引入一个重建基础风险概念空间的理论框架，重新审视风险评估范式。具体来说，我们将潜在风险概念空间分解为三个相互排斥的子空间：显式风险子空间（包含直接违反安全指南的风险）、隐式风险子空间（捕获需要上下文推理来识别的潜在恶意内容），以及非风险子空间。此外，我们提出了RADAR，这是一个多智能体协作评估框架，它通过四个专业互补角色利用多轮辩论机制，并采用动态更新机制实现风险概念分布的自我进化。这种方法能够全面覆盖显性和隐性风险，同时减轻评估者的偏见。为了验证我们框架的有效性，我们构建了一个包含800个挑战案例的评估数据集。在我们具有挑战性的测试集和公共基准测试上的大量实验表明，RADAR在多个维度上显著优于基线评估方法，包括准确性、稳定性和自我评估风险敏感性。值得注意的是，与最强的基线评估方法相比，RADAR在风险识别准确性方面提高了28.87%。

论文及项目相关链接

PDF

Summary
大型语言模型的安全评估方法存在局限，如评估者偏见和检测失败等。本文提出一个理论框架，将潜在风险概念空间分解为三个相互独立的子空间，并引入RADAR多智能体协作评估框架，通过四轮互补角色和动态更新机制实现风险概念分布的自我进化。该框架能全面覆盖显性和隐性风险，并减轻评估者偏见。实验证明，RADAR在风险识别准确性方面较最强基线评估方法提高了28.87%。

Key Takeaways

大型语言模型的安全评估方法存在评估者偏见和检测失败等局限。
论文提出一个理论框架，将风险概念空间分解为三个子空间：显式风险子空间、隐式风险子空间和非风险子空间。
引入RADAR多智能体协作评估框架，通过四轮互补角色实现全面风险评估，并减轻评估者偏见。
RADAR采用动态更新机制，实现风险概念分布的自我进化。
实验证明，RADAR在风险识别准确性方面显著提高。
RADAR框架能全面覆盖显性和隐性风险。

Cool Papers

点此查看论文截图

Adaptive Learning in Spatial Agent-Based Models for Climate Risk Assessment: A Geospatial Framework with Evolutionary Economic Agents

Authors:Yara Mohajerani

Climate risk assessment requires modelling complex interactions between spatially heterogeneous hazards and adaptive economic systems. We present a novel geospatial agent-based model that integrates climate hazard data with evolutionary learning for economic agents. Our framework combines Mesa-based spatial modelling with CLIMADA climate impact assessment, introducing adaptive learning behaviours that allow firms to evolve strategies for budget allocation, pricing, wages, and risk adaptation through fitness-based selection and mutation. We demonstrate the framework using riverine flood projections under RCP8.5 until 2100, showing that evolutionary adaptation enables firms to converge with baseline (no hazard) production levels after decades of disruption due to climate stress. Our results reveal systemic risks where even agents that are not directly exposed to floods face impacts through supply chain disruptions, with the end-of-century average price of goods 5.6% higher under RCP8.5 compared to the baseline in our illustrative economic network. This open-source framework provides financial institutions and companies with tools to quantify both direct and cascading climate risks while evaluating cost-effective adaptation strategies.

气候风险评估需要模拟空间异质危险与自适应经济系统之间的复杂交互。我们提出了一种新型地理空间基于主体的模型，该模型结合了气候危险数据与经济主体的进化学习。我们的框架结合了基于梅萨的空间建模与CLIMADA气候影响评估，引入了自适应学习行为，使公司能够通过基于适应度的选择和突变来进化策略，为预算分配、定价、工资和风险适应制定策略。我们以RCP8.5情景下至2100年的河流洪水预测为例展示了该框架的应用，表明由于气候压力造成的数十年的破坏后，进化适应使公司能够恢复基线（无危险）生产水平。我们的研究结果揭示了系统性风险，即即使未直接面临洪水威胁的代理人也面临着供应链中断的影响，在我们例证的经济网络中，世纪末商品平均价格相较于基线在RCP8.5情景下高出5.6%。这一开源框架为金融机构和公司提供了工具，可量化直接和连锁气候风险，同时评估具有成本效益的适应策略。

论文及项目相关链接

PDF Accepted to Tackling Climate Change with Machine Learning workshop at NeurIPS 2025. 5 pages, 1 figure. Source code and documentation available at https://github.com/yaramohajerani/spatial-climate-ABM

Summary
气候风险评估需要模拟空间异质性危害与自适应经济系统之间的复杂交互作用。我们提出了一种新型的地理空间基于主体的模型，该模型结合了气候危害数据与进化学习机制为经济主体所用。该框架结合了基于梅萨的空间模型和CLIMADA气候影响评估技术，引入了自适应学习行为，使得企业能够演化策略进行预算分配、定价、工资和风险适应，通过基于适应度的选择和突变来实现。我们以RCP8.5情景下的河流洪水预测为例，展示了进化适应使企业在几十年的气候压力干扰后能够恢复基线（无危害）生产水平的能力。我们的研究揭示了系统性风险，即使在洪水未直接冲击的参与者也会因供应链中断而受到影响，到本世纪末，在我们的模拟经济网络中，商品平均价格较基线高出5.6%。这一开源框架为金融机构和企业提供了量化直接和连锁气候风险的工具，同时评估成本效益高的适应策略。

Key Takeaways

气候风险评估需模拟空间异质性危害与自适应经济系统的复杂交互。
新型地理空间基于主体的模型结合了气候危害数据与进化学习机制。
模型结合了空间建模和气候影响评估技术，引入自适应学习行为助企业适应风险。
在洪水预测案例中，展示进化适应使企业在长期干扰后恢复生产水平的能力。
研究揭示了系统性风险的连锁反应，非直接受害的参与者也可能受影响。
模拟网络到本世纪末商品平均价格较基线上升了5.6%。

Cool Papers

点此查看论文截图

Multi-Agent Reinforcement Learning for Task Offloading in Wireless Edge Networks

Authors:Andrea Fox, Francesco De Pellegrini, Eitan Altman

In edge computing systems, autonomous agents must make fast local decisions while competing for shared resources. Existing MARL methods often resume to centralized critics or frequent communication, which fail under limited observability and communication constraints. We propose a decentralized framework in which each agent solves a constrained Markov decision process (CMDP), coordinating implicitly through a shared constraint vector. For the specific case of offloading, e.g., constraints prevent overloading shared server resources. Coordination constraints are updated infrequently and act as a lightweight coordination mechanism. They enable agents to align with global resource usage objectives but require little direct communication. Using safe reinforcement learning, agents learn policies that meet both local and global goals. We establish theoretical guarantees under mild assumptions and validate our approach experimentally, showing improved performance over centralized and independent baselines, especially in large-scale settings.

在边缘计算系统中，自主代理必须快速做出本地决策，同时竞争共享资源。现有的多代理强化学习方法通常依赖于集中评论家或频繁通信，这在观察性和通信约束有限的情况下会失效。我们提出了一种分散式框架，每个代理在该框架中解决一个约束马尔可夫决策过程（CMDP），通过共享约束向量进行隐性协调。以任务卸载为例，约束可以防止过载共享服务器资源。协调约束更新不频繁，作为一种轻量级协调机制发挥作用。它们使代理能够与全局资源使用目标保持一致，但几乎不需要直接通信。通过安全强化学习，代理学习满足本地和全局目标的策略。我们在温和的假设下建立了理论保证，并通过实验验证了我们方法的有效性，特别是在大规模设置中相比集中式和独立基线有显著改进表现。

论文及项目相关链接

PDF Oral presentation at AI4NextG @ NeurIPS’25 Workshop

Summary

在边缘计算系统中，自主代理需要快速做出本地决策并竞争共享资源。现有方法往往依赖于集中批评或频繁通信，这在有限的观察力和通信约束下会失效。我们提出了一个去中心化的框架，其中每个代理解决约束马尔可夫决策过程（CMDP），并通过共享约束向量进行隐性协调。对于卸载等特定情况，约束可以防止过载共享服务器资源。协调约束更新不频繁，作为轻量级协调机制发挥作用。它们使代理能够与全局资源使用目标保持一致，但需要很少的直接通信。利用安全强化学习，代理学习既满足本地又满足全球目标的策略。在温和的假设下，我们建立了理论保证并通过实验验证了我们方法的有效性，相较于集中和独立基线在大规模环境下表现出更好的性能。

Key Takeaways

边缘计算系统中自主代理需要快速做出本地决策并竞争共享资源。
现有方法依赖于集中批评或频繁通信，这在有限观察力和通信约束下不适用。
提出了一种去中心化的框架，其中代理通过解决约束马尔可夫决策过程（CMDP）进行隐性协调。
通过共享约束向量，代理可以在不需要频繁更新和直接通信的情况下协调行动。
约束可以防止过载共享服务器资源，例如在卸载场景中。
利用安全强化学习，代理学习满足本地和全局目标的策略。

Cool Papers

点此查看论文截图

Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models?

Authors:Hyeong Kyu Choi, Xiaojin Zhu, Sharon Li

Multi-Agent Debate~(MAD) has emerged as a promising paradigm for improving the performance of large language models through collaborative reasoning. Despite recent advances, the key factors driving MAD’s effectiveness remain unclear. In this work, we disentangle MAD into two key components–Majority Voting and inter-agent Debate–and assess their respective contributions. Through extensive experiments across seven NLP benchmarks, we find that Majority Voting alone accounts for most of the performance gains typically attributed to MAD. To explain this, we propose a theoretical framework that models debate as a stochastic process. We prove that it induces a martingale over agents’ belief trajectories, implying that debate alone does not improve expected correctness. Guided by these insights, we demonstrate that targeted interventions, by biasing the belief update toward correction, can meaningfully enhance debate effectiveness. Overall, our findings suggest that while MAD has potential, simple ensembling methods remain strong and more reliable alternatives in many practical settings. Code is released in https://github.com/deeplearning-wisc/debate-or-vote.

多智能体辩论（MAD）作为一种通过协作推理提高大型语言模型性能的范式，已经展现出巨大的潜力。尽管最近有所进展，但驱动MAD有效性的关键因素仍然不清楚。在这项工作中，我们将MAD分解为两个关键组成部分——多数投票和智能体间的辩论，并评估各自的贡献。通过七个NLP基准的广泛实验，我们发现仅多数投票就占到了通常归因于MAD的大部分性能提升。为了解释这一点，我们提出了一个将辩论建模为随机过程的理论框架。我们证明了它在智能体的信念轨迹上引发了一个随机漫步效应，这意味着辩论本身不会提高预期的准确性。在这些见解的指导下，我们证明通过朝着修正方向引导信念更新可以有针对性地进行干预，这可以显著增强辩论的有效性。总体而言，我们的研究结果表明，尽管MAD具有潜力，但在许多实际应用场景中，简单的集成方法仍然是更强和更可靠的替代方案。代码已在 https://github.com/deeplearning-wisc/debate-or-vote 中发布。

论文及项目相关链接

PDF NeurIPS 2025 Spotlight

Summary
多智能体辩论（MAD）作为提高大型语言模型性能的一种有前景的范式，通过协作推理展现出了巨大的潜力。然而，尽管近期有所进展，但MAD的有效关键因素仍不明确。本研究将MAD分解为两个关键组成部分——多数投票和智能体间的辩论，并评估了各自的贡献。通过七个NLP基准的广泛实验发现，多数投票本身占到了通常归于MAD的大部分性能提升。为解释此现象，提出了一个将辩论视为随机过程的理论框架。证明了辩论会引发智能体轨迹上的随机漫步效应，这意味着辩论本身并不提高预期的准确性。在这些见解的指导下，通过引导信念更新偏向校正，可以显著提高辩论的有效性。总体而言，虽然MAD具有潜力，但在许多实际应用场景中，简单的集成方法仍然是强大且更可靠的替代方案。

Key Takeaways

多智能体辩论（MAD）是一种提高大型语言模型性能的方法。
MAD的关键组件包括多数投票和智能体间的辩论。
多数投票是性能提升的主要来源。
辩论被视为随机过程，自身并不能提高预期的准确性。
通过理论框架揭示了辩论在智能体轨迹上的随机漫步效应。
通过引导信念更新偏向校正，可以显著提高辩论的有效性。

Cool Papers

点此查看论文截图

Prover Agent: An Agent-Based Framework for Formal Mathematical Proofs

Authors:Kaito Baba, Chaoran Liu, Shuhei Kurita, Akiyoshi Sannai

We present Prover Agent, a novel AI agent for automated theorem proving that integrates large language models (LLMs) with a formal proof assistant, Lean. Prover Agent coordinates an informal reasoning LLM, a formal prover model, and feedback from Lean while also generating auxiliary lemmas. These auxiliary lemmas are not limited to subgoals in the formal proof but can also include special cases or potentially useful facts derived from the assumptions, which help in discovering a viable proof strategy. It achieves an 88.1% success rate on the MiniF2F benchmark, establishing a new state-of-the-art among methods using small language models (SLMs) with a much lower sample budget than previous approaches. We also present theoretical analyses and case studies that illustrate how these generated lemmas contribute to solving challenging problems. Our code is publicly available at: https://github.com/kAIto47802/Prover-Agent.

我们介绍了Prover Agent，这是一个用于自动化定理证明的新型人工智能代理，它将大型语言模型（LLM）与形式化证明助手Lean进行了集成。Prover Agent协调非正式推理LLM、形式化证明模型以及来自Lean的反馈，同时生成辅助引理。这些辅助引理不仅限于形式证明中的子目标，还可以包括从假设中得出的特殊情况或可能有用的事实，有助于发现可行的证明策略。它在MiniF2F基准测试上达到了88.1%的成功率，在采用小型语言模型（SLM）的方法中建立了最新 state-of-the-art，样本预算远低于以前的方法。我们还通过理论分析和案例研究来说明这些生成的引理如何解决具有挑战性的问题。我们的代码公开可用：https://github.com/kAIto47802/Prover-Agent。

论文及项目相关链接

PDF 36 pages, 3 figures

摘要

我们推出了一款名为Prover Agent的新型定理证明AI代理，它将大型语言模型（LLM）与形式化证明助手Lean进行了集成。Prover Agent协调非正式推理LLM、形式化证明模型以及来自Lean的反馈，同时生成辅助引理。这些辅助引理不仅限于形式证明中的子目标，还可以包括从假设中得出的特殊情况或可能有用的事实，有助于发现可行的证明策略。在MiniF2F基准测试中，它达到了88.1%的成功率，在采用小型语言模型（SLM）的方法中建立了最新 state-of-the-art，并且样本预算远低于之前的方法。我们还提供了理论分析和案例研究，说明这些生成的引理如何解决复杂问题。我们的代码可在：https://github.com/kAIto47802/Prover-Agent找到。

要点掌握

Prover Agent是一个结合了大型语言模型（LLM）和形式化证明助手Lean的新型定理证明AI代理。
Prover Agent能够协调非正式推理LLM、形式化证明模型以及来自Lean的反馈。
Prover Agent能生成辅助引理，这些引理不仅用于解决形式证明中的子目标，还能包括从假设中得出的特殊情况或潜在有用的信息，帮助寻找可行的证明策略。
Prover Agent在MiniF2F基准测试中的成功率为88.1%，在采用小型语言模型的方法中具有最新 state-of-the-art 性能。
Prover Agent的样本预算远低于之前的方法。
提供了理论分析和案例研究，以展示生成的引理如何帮助解决复杂问题。

Cool Papers

点此查看论文截图

Lessons Learned: A Multi-Agent Framework for Code LLMs to Learn and Improve

Authors:Yuanzhe Liu, Ryan Deng, Tim Kaler, Xuhao Chen, Charles E. Leiserson, Yao Ma, Jie Chen

Recent studies show that LLMs possess different skills and specialize in different tasks. In fact, we observe that their varied performance occur in several levels of granularity. For example, in the code optimization task, code LLMs excel at different optimization categories and no one dominates others. This observation prompts the question of how one leverages multiple LLM agents to solve a coding problem without knowing their complementary strengths a priori. We argue that a team of agents can learn from each other’s successes and failures so as to improve their own performance. Thus, a lesson is the knowledge produced by an agent and passed on to other agents in the collective solution process. We propose a lesson-based collaboration framework, design the lesson solicitation–banking–selection mechanism, and demonstrate that a team of small LLMs with lessons learned can outperform a much larger LLM and other multi-LLM collaboration methods.

最近的研究表明，大型语言模型拥有不同的技能，擅长不同的任务。实际上，我们观察到它们在多个粒度层面有不同的表现。例如，在代码优化任务中，代码大型语言模型在不同优化类别中表现优异，并没有一个模型能够完全超越其他模型。这一观察引发了一个问题：如何利用多个大型语言模型代理来解决编码问题，而不事先了解它们的互补优势。我们认为，代理团队可以从彼此的成功和失败中学习，以提高自己的表现。因此，一课是代理在集体解决方案过程中产生的知识，并传递给其他代理。我们提出了一个基于课程的合作框架，设计了课程请求-存储-选择机制，并证明了一个由小型大型语言模型组成的团队，通过课程学习，可以超越一个更大的大型语言模型和其他多大型语言模型协作方法的表现。

论文及项目相关链接

PDF NeurIPS 2025. Code is available at https://github.com/MITIBM-FastCoder/LessonL

Summary

大型语言模型（LLMs）在多种任务中展现出不同的技能和专长，且其表现差异在不同层级上有所体现。在代码优化任务中，各LLMs擅长不同的优化类别，且没有一种模型能全面超越其他模型。为利用多个LLM解决编码问题，我们提出了一种基于教训的合作框架。在该框架中，各模型可从彼此的成功和失败中学习，从而提升自身性能。因此，教训是指代理在集体解决过程中产生的知识，并传递给其他代理。我们的研究证明了基于教训的小型LLM团队能够超越大型LLM以及其他多LLM协作方法的表现。

Key Takeaways

LLMs在不同任务中展现出不同的技能和专长。
在代码优化任务中，各LLMs擅长不同的优化类别，没有全面领先的模型。
利用多个LLM解决编码问题时，各模型间的互补优势是关键。
基于教训的合作框架有助于各LLM从彼此的成功和失败中学习。
教训是代理在集体解决过程中产生的知识，并传递给其他代理。
基于教训的小型LLM团队表现可超越大型LLM。

Cool Papers

点此查看论文截图

Embodied Agents Meet Personalization: Investigating Challenges and Solutions Through the Lens of Memory Utilization

Authors:Taeyoon Kwon, Dongwook Choi, Hyojun Kim, Sunghwan Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo

LLM-powered embodied agents have shown success on conventional object-rearrangement tasks, but providing personalized assistance that leverages user-specific knowledge from past interactions presents new challenges. We investigate these challenges through the lens of agents’ memory utilization along two critical dimensions: object semantics (identifying objects based on personal meaning) and user patterns (recalling sequences from behavioral routines). To assess these capabilities, we construct MEMENTO, an end-to-end two-stage evaluation framework comprising single-memory and joint-memory tasks. Our experiments reveal that current agents can recall simple object semantics but struggle to apply sequential user patterns to planning. Through in-depth analysis, we identify two critical bottlenecks: information overload and coordination failures when handling multiple memories. Based on these findings, we explore memory architectural approaches to address these challenges. Given our observation that episodic memory provides both personalized knowledge and in-context learning benefits, we design a hierarchical knowledge graph-based user-profile memory module that separately manages personalized knowledge, achieving substantial improvements on both single and joint-memory tasks. Project website: https://connoriginal.github.io/MEMENTO

基于LLM的实体代理在常规的对象重组任务中取得了成功，但如何利用过去交互中的用户特定知识提供个性化辅助带来了新的挑战。我们通过代理的内存利用来研究这些挑战，主要关注两个关键维度：对象语义（基于个人意义识别对象）和用户模式（从行为常规中回忆序列）。为了评估这些能力，我们构建了MEMENTO，这是一个端到端的两阶段评估框架，包括单记忆和联合记忆任务。我们的实验表明，当前代理能够回忆简单的对象语义，但在将顺序用户模式应用于规划时遇到困难。通过深入分析，我们确定了两个关键瓶颈：处理多个记忆时的信息过载和协调失败。基于这些发现，我们探索了解决这些挑战的内存架构方法。鉴于我们对情景记忆提供个性化知识和上下文学习好处的观察，我们设计了一个基于分层知识图的用户配置文件内存模块，该模块单独管理个性化知识，在单记忆和联合记忆任务中都取得了重大改进。项目网站：https://connoriginal.github.io/MEMENTO

论文及项目相关链接

PDF Work in progress

Summary

LLM驱动的智能体在传统物体整理任务上表现出成功，但在利用用户过往互动的个人特定知识提供个性化协助方面存在挑战。研究通过智能体记忆利用的两个关键维度——物体语义和用户模式来审视这些挑战。为评估这些能力，构建了MEMENTO，一个端到端的两阶段评估框架，包括单记忆和联合记忆任务。实验发现当前智能体能够回忆简单的物体语义，但在将顺序用户模式应用于规划时遇到困难。深入分析确定了两个关键瓶颈：处理多重记忆时的信息过载和协调失败。基于这些发现，探讨了解决这些挑战的记忆架构方法。观察到情景记忆既可提供个性化知识又有上下文学习优势，设计了一个基于分级知识图谱的用户档案记忆模块，分别管理个性化知识，在单记忆和联合记忆任务上都取得了显著改进。

Key Takeaways

LLM智能体在物体整理任务上表现良好，但在提供个性化协助时面临挑战。
智能体记忆利用的两个关键维度是物体语义和用户模式。
MEMENTO框架用于评估智能体的记忆能力，包括单记忆和联合记忆任务。
当前智能体能够回忆简单物体语义，但在将用户模式应用于规划时存在困难。
信息过载和协调失败是处理多重记忆时的两个关键瓶颈。
情景记忆对提供个性化知识和上下文学习有益。

Cool Papers

点此查看论文截图

Reinforcing Multi-Turn Reasoning in LLM Agents via Turn-Level Reward Design

Authors:Quan Wei, Siliang Zeng, Chenliang Li, William Brown, Oana Frunza, Wei Deng, Anderson Schneider, Yuriy Nevmyvaka, Yang Katie Zhao, Alfredo Garcia, Mingyi Hong

This paper investigates Reinforcement Learning (RL) approaches to enhance the reasoning capabilities of Large Language Model (LLM) agents in long-horizon, multi-turn scenarios. Although RL algorithms such as Group Relative Policy Optimization (GRPO) and Proximal Policy Optimization (PPO) have been widely applied to train multi-turn LLM agents, they typically rely only on sparse outcome rewards and lack dense intermediate signals across multiple decision steps, limiting their performance on complex reasoning tasks. To bridge this gap, we present the first systematic study of \textit{turn-level reward design} for multi-turn RL algorithms and agent applications. By integrating turn-level rewards, we extend GRPO and PPO to their respective multi-turn variants, enabling fine-grained credit assignment. We conduct case studies on multi-turn reasoning-augmented search agents, where we carefully design two types of turn-level rewards: verifiable and LLM-as-judge. Our experiments on multi-turn search tasks demonstrate that incorporating well-designed turn-level rewards enables RL algorithms to significantly outperform baseline methods with trajectory-level rewards. Both training and validation reward curves illustrate that our method achieves \textit{greater stability}, \textit{faster convergence}, and \textit{higher accuracy}. Numerical results across diverse question-answering datasets further show that our approach consistently delivers highest answer correctness and 100% format correctness.

本文探讨了强化学习（RL）方法，以提高大型语言模型（LLM）代理在长周期、多轮次场景中的推理能力。尽管诸如群组相对策略优化（GRPO）和近端策略优化（PPO）之类的RL算法已被广泛应用于训练多轮LLM代理，但它们通常仅依赖于稀疏的结果奖励，并且在多个决策步骤中缺乏密集的中间信号，这在复杂推理任务上限制了它们的性能。为了弥补这一差距，我们对多轮RL算法的“轮级奖励设计”进行了第一次系统研究，并介绍了其在代理应用中的使用。通过整合轮级奖励，我们将GRPO和PPO扩展到各自的多轮变体，实现了精细的信用分配。我们在多轮推理增强搜索代理的案例研究中，精心设计了两种轮级奖励：可验证的和LLM判断。我们在多轮搜索任务上的实验表明，融入精心设计的轮级奖励能够使RL算法显著优于具有轨迹级奖励的基线方法。训练和验证的奖励曲线都证明我们的方法实现了更高的稳定性、更快的收敛速度和更高的准确性。在不同问答数据集上的数值结果进一步表明，我们的方法始终在答案正确率和格式正确性方面表现最佳。

论文及项目相关链接

PDF work in progress

Summary：
强化学习（RL）用于提升大型语言模型（LLM）代理在多回合场景中的推理能力。本研究首次系统地研究了多回合RL算法和代理应用的回合级别奖励设计。通过集成回合级别奖励，扩展了Group Relative Policy Optimization（GRPO）和Proximal Policy Optimization（PPO）至其多回合变体，实现了精细的信用分配。在回合推理辅助搜索代理的案例研究中，设计了两种回合级别奖励：可验证和LLM作为评判员。实验表明，结合良好的回合级别奖励，RL算法能显著优于轨迹级别奖励的基线方法，实现了更高的稳定性、更快的收敛速度和更高的准确性。

Key Takeaways：

强化学习用于提升大型语言模型在多回合场景中的推理能力。
研究首次系统地探索了回合级别奖励设计在多回合RL算法和代理应用中的重要性。
通过集成回合级别奖励，扩展了GRPO和PPO至其多回合变体。
在回合推理辅助搜索代理的案例研究中，设计了两种回合级别奖励：可验证和LLM作为评判员。
结合良好的回合级别奖励，RL算法能显著优于轨迹级别奖励的基线方法。
实验结果展示了结合回合级别奖励的方法在稳定性、收敛速度和准确性方面的优势。

Cool Papers

点此查看论文截图

MindForge: Empowering Embodied Agents with Theory of Mind for Lifelong Cultural Learning

Authors:Mircea Lică, Ojas Shirekar, Baptiste Colle, Chirag Raman

Embodied agents powered by large language models (LLMs), such as Voyager, promise open-ended competence in worlds such as Minecraft. However, when powered by open-weight LLMs they still falter on elementary tasks after domain-specific fine-tuning. We propose MindForge, a generative-agent framework for cultural lifelong learning through explicit perspective taking. We introduce three key innovations: (1) a structured theory of mind representation linking percepts, beliefs, desires, and actions; (2) natural inter-agent communication; and (3) a multi-component memory system. Following the cultural learning framework, we test MindForge in both instructive and collaborative settings within Minecraft. In an instructive setting with GPT-4, MindForge agents powered by open-weight LLMs significantly outperform their Voyager counterparts in basic tasks yielding $3\times$ more tech-tree milestones and collecting $2.3\times$ more unique items than the Voyager baseline. Furthermore, in fully \textit{collaborative} settings, we find that the performance of two underachieving agents improves with more communication rounds, echoing the Condorcet Jury Theorem. MindForge agents demonstrate sophisticated behaviors, including expert-novice knowledge transfer, collaborative problem solving, and adaptation to out-of-distribution tasks through accumulated cultural experiences.

由大型语言模型（LLM）驱动的实体代理，如Voyager，在Minecraft等世界范围内展现出无限潜能。然而，当由开放式权重LLM驱动时，它们在特定领域微调后仍然会在基础任务上遇到困难。我们提出MindForge，这是一个通过明确视角采取文化终身学习的生成代理框架。我们引入了三项关键创新：（1）一种结构化的心理表征理论，将感知、信念、欲望和行动联系起来；（2）自然代理间通信；（3）多组件内存系统。遵循文化学习框架，我们在Minecraft的指令和协作环境中测试了MindForge。在具有GPT-4的指令环境中，由开放式权重LLM驱动的MindForge代理在基本任务上显著优于Voyager的对应物，达到$3\times$以上的科技树里程碑和收集到比Voyager基线高出$2.3\times$的独特物品。此外，在完全协作的环境中，我们发现两个表现不佳的代理随着沟通轮次的增加而提升性能，这反映了孔多塞陪审团定理。MindForge代理表现出高级行为，包括专家新手知识转移、协作解决问题以及通过累积的文化经验适应超出分布范围的任务。

论文及项目相关链接

PDF Accepted to NeurIPS 2025 main track as poster

Summary

在Minecraft等世界，大型语言模型驱动的内化代理（如Voyager）展现出无限制的潜力，但在特定领域的精细调整后仍无法完成基础任务。我们提出MindForge，一个文化终身学习的生成代理框架，通过明确视角采取。我们引入三个关键创新点：一是结构化心智理论表现连接感知、信念、欲望和行动；二是自然化的代理间通信；三是多组件内存系统。在Minecraft的指示性和协作环境中遵循文化学习框架测试MindForge。在GPT-4指示的环境中，由开放式重量大型语言模型驱动的MindForge代理在基础任务上显著优于Voyager，技术里程碑数是Voyager的三倍，独特物品收集量是Voyager的2.3倍。此外，在完全协作的环境中，我们发现两个表现不佳的代理随着通信轮次的增加而提升性能，这反映了Condorcet陪审团定理。MindForge代理表现出高级行为，包括专家新手知识转移、协作问题解决以及适应离分布任务的文化经验积累。

Key Takeaways

MindForge是一个支持文化终身学习的生成代理框架，引入了结构化的心智理论表现来强化代理的能力和表现。
MindForge的关键创新包括：自然化的代理间通信和多组件内存系统等技术。这些技术能够加强代理在学习和理解任务中的协作和交互能力。
在指令性环境下使用GPT-4的测试中，MindForge在基本任务方面展现出比Voyager更优越的表现，具有显著更高的技术里程碑物品收集率。这显示了其在完成任务和导航方面的强大性能。
在协作环境中，两个表现不佳的代理可以通过增加通信轮次提高性能，这表明了沟通和协作在完成任务中的重要性。这也验证了Condorcet陪审团定理的应用。

Cool Papers

点此查看论文截图