发布日期: 2025-09-30

更新日期: 2025-11-27

文章字数: 16.9k

阅读时长: 68 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-30 更新

Effective Policy Learning for Multi-Agent Online Coordination Beyond Submodular Objectives

Authors:Qixin Zhang, Yan Sun, Can Jin, Xikun Zhang, Yao Shu, Puning Zhao, Li Shen, Dacheng Tao

In this paper, we present two effective policy learning algorithms for multi-agent online coordination(MA-OC) problem. The first one, \texttt{MA-SPL}, not only can achieve the optimal $(1-\frac{c}{e})$-approximation guarantee for the MA-OC problem with submodular objectives but also can handle the unexplored $\alpha$-weakly DR-submodular and $(\gamma,\beta)$-weakly submodular scenarios, where $c$ is the curvature of the investigated submodular functions, $\alpha$ denotes the diminishing-return(DR) ratio and the tuple $(\gamma,\beta)$ represents the submodularity ratios. Subsequently, in order to reduce the reliance on the unknown parameters $\alpha,\gamma,\beta$ inherent in the \texttt{MA-SPL} algorithm, we further introduce the second online algorithm named \texttt{MA-MPL}. This \texttt{MA-MPL} algorithm is entirely \emph{parameter-free} and simultaneously can maintain the same approximation ratio as the first \texttt{MA-SPL} algorithm. The core of our \texttt{MA-SPL} and \texttt{MA-MPL} algorithms is a novel continuous-relaxation technique termed as \emph{policy-based continuous extension}. Compared with the well-established \emph{multi-linear extension}, a notable advantage of this new \emph{policy-based continuous extension} is its ability to provide a lossless rounding scheme for any set function, thereby enabling us to tackle the challenging weakly submodular objectives. Finally, extensive simulations are conducted to validate the effectiveness of our proposed algorithms.

在这篇论文中，我们为针对多智能体在线协同（MA-OC）问题的两个有效的策略学习算法进行了介绍。第一种算法名为MA-SPL'，不仅能为具有子模目标的MA-OC问题实现最优的$(1-\frac{c}{e})$近似保证，还能处理未探索的$\alpha$-弱DR-子模块和$(\gamma,\beta)$-弱子模块场景，其中$c$是所研究子模块函数的曲率，$\alpha$表示收益递减（DR）比率，而元组$(\gamma,\beta)$代表子模块比率。为了减少对MA-SPL’算法中固有的未知参数$\alpha,\gamma,\beta$的依赖，我们进一步引入了第二种在线算法，名为MA-MPL'。该MA-MPL’算法完全无需参数，并能保持与第一种MA-SPL'算法相同的近似比率。我们的MA-SPL’和`MA-MPL’算法的核心是一种新型连续松弛技术，称为基于策略的连续扩展。与已建立的多线性扩展相比，基于策略的连续扩展的一个显著优势是能为任何集合函数提供无损舍入方案，从而让我们能够应对具有挑战性的弱子模块目标。最后，通过大量模拟实验验证了所提出算法的有效性。

论文及项目相关链接

PDF Accepted to NeurIPS 2025

Summary

本文介绍了两种针对多智能体在线协调（MA-OC）问题的有效策略学习算法。第一种算法MA-SPL不仅能为MA-OC问题提供最优的$(1-\frac{c}{e})$近似保证，还能处理未探索的$\alpha$-弱DR-子模块和$(\gamma,\beta)$-弱子模块场景。为了减少对MA-SPL算法中固有未知参数$\alpha,\gamma,\beta$的依赖，进一步引入了第二种在线算法MA-MPL，该算法完全无参数，并能保持与MA-SPL相同的近似比率。两种算法的核心是一种名为基于策略的连续扩展的新型连续松弛技术，与已建立的多线性扩展相比，其显著优势是为任何集合函数提供无损舍入方案，从而解决具有挑战性的弱子模块目标。

Key Takeaways

介绍了两种针对多智能体在线协调问题的有效策略学习算法MA-SPL和MA-MPL。
MA-SPL算法不仅能解决具有子模块目标的MA-OC问题，还提供最优近似保证，并能处理弱子模块场景。
MA-MPL算法是参数无关的，同时保持了与MA-SPL相同的近似比率。
两种算法的核心是新型的基于策略的连续扩展技术，为任何集合函数提供无损舍入方案。
该技术能处理具有挑战性的弱子模块目标。
通过广泛的模拟验证了所提出算法的有效性。

Cool Papers

点此查看论文截图

JointDiff: Bridging Continuous and Discrete in Multi-Agent Trajectory Generation

Authors:Guillem Capellera, Luis Ferraz, Antonio Rubio, Alexandre Alahi, Antonio Agudo

Generative models often treat continuous data and discrete events as separate processes, creating a gap in modeling complex systems where they interact synchronously. To bridge this gap, we introduce JointDiff, a novel diffusion framework designed to unify these two processes by simultaneously generating continuous spatio-temporal data and synchronous discrete events. We demonstrate its efficacy in the sports domain by simultaneously modeling multi-agent trajectories and key possession events. This joint modeling is validated with non-controllable generation and two novel controllable generation scenarios: weak-possessor-guidance, which offers flexible semantic control over game dynamics through a simple list of intended ball possessors, and text-guidance, which enables fine-grained, language-driven generation. To enable the conditioning with these guidance signals, we introduce CrossGuid, an effective conditioning operation for multi-agent domains. We also share a new unified sports benchmark enhanced with textual descriptions for soccer and football datasets. JointDiff achieves state-of-the-art performance, demonstrating that joint modeling is crucial for building realistic and controllable generative models for interactive systems.

生成模型通常将连续数据和离散事件视为单独的过程，从而在模拟它们同步交互的复杂系统时产生了鸿沟。为了弥补这一鸿沟，我们引入了JointDiff，这是一种新型扩散框架，旨在通过同时生成连续时空数据和同步离散事件来统一这两个过程。我们通过同时模拟多智能体轨迹和关键持球事件，在体育领域展示了其有效性。这种联合建模通过不可控生成和两个新型可控生成场景（弱持有者指导可通过一串预定的球持有者名单灵活地控制比赛动态，以及文本指导可通过精细的语言驱动生成）进行了验证。为了通过引导信号实现条件化，我们为多智能体领域引入了CrossGuid这一有效的条件操作。我们还分享了一个新的统一体育基准测试集，并通过文本描述增强足球和足球数据集的内容。JointDiff取得了卓越的性能表现，证明联合建模对于构建真实可控的交互式系统生成模型至关重要。

论文及项目相关链接

PDF

Summary

联合建模连续数据与离散事件的关键框架介绍。为解决生成模型在处理连续数据和离散事件时的间隙问题，提出JointDiff框架，实现两者同步生成。在Sports领域进行多代理轨迹和关键控球事件的联合建模验证，引入CrossGuid操作实现指导信号的调节，分享增强文本描述的统一体育基准数据集。JointDiff达到先进性能，证明联合建模对构建真实可控的交互式系统生成模型至关重要。

Key Takeaways

JointDiff框架解决了生成模型在处理连续数据和离散事件时的间隙问题。
JointDiff实现了连续时空数据和同步离散事件的联合生成。
在Sports领域进行多代理轨迹和关键控球事件的联合建模验证。
引入CrossGuid操作，实现指导信号的调节，用于精细控制生成过程。
分享了一个增强文本描述的新统一体育基准数据集。
JointDiff框架达到了先进性能。

Cool Papers

点此查看论文截图

Estimating the Empowerment of Language Model Agents

Authors:Jinyeop Song, Jeff Gore, Max Kleiman-Weiner

As language model (LM) agents become more capable and gain broader access to real-world tools, there is a growing need for scalable evaluation frameworks of agentic capability. However, conventional benchmark-centric evaluations are costly to design and require human designers to come up with valid tasks that translate into insights about general model capabilities. In this work, we propose information-theoretic evaluation based on empowerment, the mutual information between an agent’s actions and future states, as an open-ended method for evaluating LM agents. We introduce EELMA (Estimating Empowerment of Language Model Agents), an algorithm for approximating effective empowerment from multi-turn text interactions. We validate EELMA on both language games and scaled-up realistic web-browsing scenarios. We find that empowerment strongly correlates with average task performance, characterize the impact of environmental complexity and agentic factors such as chain-of-thought, model scale, and memory length on estimated empowerment, and that high empowerment states and actions are often pivotal moments for general capabilities. Together, these results demonstrate empowerment as an appealing general-purpose metric for evaluating and monitoring LM agents in complex, open-ended settings.

随着语言模型（LM）代理的能力越来越强，对现实世界工具的访问范围也越来越广，对代理能力进行可扩展评估的需求日益增长。然而，以基准测试为中心的传统评估在设计上成本高昂，需要人类设计师提出有效的任务，以洞察通用模型的能力。在这项工作中，我们提出基于赋能的信息理论评估方法，即代理行动与未来状态之间的互信息，作为一种开放式的语言模型代理评估方法。我们介绍了EELMA（语言模型代理赋能估算），一种用于从多轮文本交互中近似有效赋能的算法。我们在语言游戏和规模更大的现实网页浏览场景中验证了EELMA。我们发现赋能与平均任务性能具有很强的相关性，表征了环境复杂性和代理因素（如思维链、模型规模和记忆长度）对估计赋能的影响，以及高赋能状态和行动通常是通用能力的重要时刻。总之，这些结果证明了赋能作为在复杂、开放的设置中评估和监视LM代理的有吸引力的通用指标。

论文及项目相关链接

PDF 10 pages, 8 figures. Submitted to ICLR 2026

Summary
语言模型（LM）代理能力评估的需求日益增长，需要可扩展的评估框架。本文提出了一种基于赋能的信息理论评估方法，通过代理行为和未来状态之间的互信息来评估LM代理。引入EELMA算法，用于从多轮文本交互中近似有效赋能。在验证中发现，赋能与平均任务性能高度相关，并描述了环境复杂性和代理因素如思维链、模型规模和记忆长度对估计赋能的影响。高赋能状态和行动通常是代理能力关键的体现。

Key Takeaways

语言模型代理能力评估需求增长，需要可扩展的评估框架。
提出了基于赋能的信息理论评估方法，通过代理行为和未来状态互信息评估LM代理。
引入EELMA算法，可从多轮文本交互中近似有效赋能。
赋能与平均任务性能高度相关。
环境复杂性和代理因素对估计赋能产生影响。
高赋能状态和行动体现代理能力关键体现。

Cool Papers

点此查看论文截图

InfiAgent: Self-Evolving Pyramid Agent Framework for Infinite Scenarios

Authors:Chenglin Yu, Yang Yu, Songmiao Wang, Yucheng Wang, Yifan Yang, Jinjia Li, Ming Li, Hongxia Yang

Large Language Model (LLM) agents have demonstrated remarkable capabilities in organizing and executing complex tasks, and many such agents are now widely used in various application scenarios. However, developing these agents requires carefully designed workflows, carefully crafted prompts, and iterative tuning, which requires LLM techniques and domain-specific expertise. These hand-crafted limitations hinder the scalability and cost-effectiveness of LLM agents across a wide range of industries. To address these challenges, we propose \textbf{InfiAgent}, a Pyramid-like DAG-based Multi-Agent Framework that can be applied to \textbf{infi}nite scenarios, which introduces several key innovations: a generalized “agent-as-a-tool” mechanism that automatically decomposes complex agents into hierarchical multi-agent systems; a dual-audit mechanism that ensures the quality and stability of task completion; an agent routing function that enables efficient task-agent matching; and an agent self-evolution mechanism that autonomously restructures the agent DAG based on new tasks, poor performance, or optimization opportunities. Furthermore, InfiAgent’s atomic task design supports agent parallelism, significantly improving execution efficiency. This framework evolves into a versatile pyramid-like multi-agent system capable of solving a wide range of problems. Evaluations on multiple benchmarks demonstrate that InfiAgent achieves 9.9% higher performance compared to ADAS (similar auto-generated agent framework), while a case study of the AI research assistant InfiHelper shows that it generates scientific papers that have received recognition from human reviewers at top-tier IEEE conferences.

大规模语言模型（LLM）代理在组织和执行复杂任务方面表现出了显著的能力，并且许多这样的代理现在已在各种应用场景中广泛使用。然而，开发这些代理需要精心设计的工作流程、精心制作的提示和迭代调整，这需要LLM技术和特定领域的专业知识。这些手工制作的局限性阻碍了LLM代理在广泛行业中的可扩展性和成本效益。为了解决这些挑战，我们提出了\textbf{InfiAgent}，这是一个基于金字塔的DAG多代理框架，可应用于\textbf{无限}场景，它引入了几项关键创新：一种通用的“代理即工具”机制，可自动将复杂代理分解为分层的多代理系统；一种双重审计机制，确保任务完成的品质和稳定性；一个代理路由功能，能够实现高效的任务-代理匹配；以及一个代理自我进化机制，能够基于新任务、性能不佳或优化机会自主重新构建代理DAG。此外，InfiAgent的原子任务设计支持代理并行性，显著提高执行效率。这个框架发展成为一个通用的金字塔式多代理系统，能够解决一系列广泛的问题。在多个基准测试上的评估表明，InfiAgent与ADAS（类似的自动生成代理框架）相比，性能提高了9.9%。关于AI助理InfiHelper的案例研究表明，它生成的科技论文已经获得了顶级IEEE会议的人为评审者的认可。

论文及项目相关链接

PDF 9 pages of main content and 32 pages of others, 2 figures, under review as a conference paper at ICLR 2026

Summary

大型语言模型（LLM）代理在组织和执行复杂任务方面表现出卓越的能力，广泛应用于多个应用场景。然而，开发这些代理需要精心设计的工作流程、提示和迭代调整，这需要LLM技术和特定领域的专业知识。为解决手动设计的局限性，提出InfiAgent，一个基于金字塔的DAG多代理框架，引入几项关键创新：通用“代理作为工具”机制，自动将复杂代理分解为分层多代理系统；双重审计机制确保任务完成的品质和稳定性；代理路由功能实现高效的任务-代理匹配；代理自我进化机制可基于新任务、性能不佳或优化机会自主重组代理DAG。此外，InfiAgent的原子任务设计支持代理并行性，显著提高执行效率。评估显示，InfiAgent相较于ADAS框架性能提升9.9%，AI研究助理InfiHelper的案例研究表明其生成的论文获得顶级IEEE会议的认可。

Key Takeaways

LLM代理在复杂任务组织执行方面表现出卓越能力，广泛应用于各行业。
开发LLM代理需精心设计的工作流程、提示和迭代调整，及LLM技术和领域专业知识。
InfiAgent框架解决手动设计局限性，引入多项创新机制，如“代理作为工具”机制、双重审计机制等。
InfiAgent框架通过自动分解复杂代理、确保任务完成质量稳定性、高效任务-代理匹配等提高LLM代理的效能。
InfiAgent支持代理并行性，显著提高执行效率。
与ADAS相比，InfiAgent性能提升9.9%。

Cool Papers

点此查看论文截图

RoboView-Bias: Benchmarking Visual Bias in Embodied Agents for Robotic Manipulation

Authors:Enguang Liu, Siyuan Liang, Liming Lu, Xiyu Zeng, Xiaochun Cao, Aishan Liu, Shuchao Pang

The safety and reliability of embodied agents rely on accurate and unbiased visual perception. However, existing benchmarks mainly emphasize generalization and robustness under perturbations, while systematic quantification of visual bias remains scarce. This gap limits a deeper understanding of how perception influences decision-making stability. To address this issue, we propose RoboView-Bias, the first benchmark specifically designed to systematically quantify visual bias in robotic manipulation, following a principle of factor isolation. Leveraging a structured variant-generation framework and a perceptual-fairness validation protocol, we create 2,127 task instances that enable robust measurement of biases induced by individual visual factors and their interactions. Using this benchmark, we systematically evaluate three representative embodied agents across two prevailing paradigms and report three key findings: (i) all agents exhibit significant visual biases, with camera viewpoint being the most critical factor; (ii) agents achieve their highest success rates on highly saturated colors, indicating inherited visual preferences from underlying VLMs; and (iii) visual biases show strong, asymmetric coupling, with viewpoint strongly amplifying color-related bias. Finally, we demonstrate that a mitigation strategy based on a semantic grounding layer substantially reduces visual bias by approximately 54.5% on MOKA. Our results highlight that systematic analysis of visual bias is a prerequisite for developing safe and reliable general-purpose embodied agents.

自主实体的安全和可靠性依赖于准确且无偏见的视觉感知。然而，现有的基准测试主要强调在扰动下的泛化和鲁棒性，而视觉偏差的系统量化仍然稀少。这一差距限制了我们对感知如何影响决策制定稳定性的更深层次理解。为了解决这个问题，我们提出了RoboView-Bias，这是第一个专门设计用于系统地量化机器人操作中的视觉偏差的基准测试，遵循因素隔离的原则。我们利用结构化变体生成框架和感知公平验证协议，创建了2127个任务实例，能够稳健地测量由单个视觉因素及其相互作用引起的偏差。使用这个基准测试，我们系统地评估了两种流行范式下的三个代表性自主实体，并报告了三个关键发现：一、所有实体都表现出显著的视觉偏差，其中相机视角是最关键的因素；二、实体在高度饱和的颜色上达到其最高成功率，表明它们继承了底层视觉语言模型（VLM）的视觉偏好；三、视觉偏差表现出强烈的、不对称的耦合性，视角会强烈放大与颜色相关的偏差。最后，我们证明了一种基于语义接地层的缓解策略可以显著减少大约54.5%的视觉偏差（在MOKA上）。我们的结果强调，对视觉偏差的系统分析是开发安全可靠的通用自主实体的先决条件。

论文及项目相关链接

PDF

Summary

该文关注自主智能体在视觉感知方面的偏见问题。现有的基准测试主要集中在泛化和鲁棒性上，缺乏系统量化视觉偏见的方法，这限制了我们对感知如何影响决策稳定性的理解。为解决这一问题，文章提出了RoboView-Bias基准测试，这是首个专为系统地量化机器人操纵中的视觉偏见而设计的基准测试。通过结构化的变体生成框架和感知公平验证协议，创建了2127个任务实例来稳健地测量单个视觉因素及其相互作用所导致的偏见。文章系统地评估了三个代表性自主智能体，并发现所有智能体都存在显著的视觉偏见，其中摄像头视角是最重要的因素；智能体对高度饱和的颜色成功率最高，表明其继承了基础视觉语言模型的固有视觉偏好；视觉偏见存在强烈、不对称的耦合现象，视角会加剧颜色相关的偏见。最后，文章展示了基于语义基础层的缓解策略能有效减少大约54.5%的视觉偏见。研究结果强调了系统分析视觉偏见对于开发安全和可靠的通用自主智能体的必要性。

Key Takeaways

自主智能体的安全和可靠性取决于准确的视觉感知，但现有基准测试缺乏系统量化视觉偏见的方法。
RoboView-Bias基准测试是首个专为系统地量化机器人操纵中的视觉偏见而设计的测试。
自主智能体普遍存在视觉偏见，其中摄像头视角是关键因素。
智能体倾向于高度饱和的颜色识别度更高，表明其对颜色的视觉偏好与底层模型相关。
视觉偏见存在不对称的耦合现象，视角会加剧颜色相关的偏见。
基于语义基础层的策略能有效减少视觉偏见。

Cool Papers

点此查看论文截图

Secure and Efficient Access Control for Computer-Use Agents via Context Space

Authors:Haochen Gong, Chenxiao Li, Rui Chang, Wenbo Shen

Large language model (LLM)-based computer-use agents represent a convergence of AI and OS capabilities, enabling natural language to control system- and application-level functions. However, due to LLMs’ inherent uncertainty issues, granting agents control over computers poses significant security risks. When agent actions deviate from user intentions, they can cause irreversible consequences. Existing mitigation approaches, such as user confirmation and LLM-based dynamic action validation, still suffer from limitations in usability, security, and performance. To address these challenges, we propose CSAgent, a system-level, static policy-based access control framework for computer-use agents. To bridge the gap between static policy and dynamic context and user intent, CSAgent introduces intent- and context-aware policies, and provides an automated toolchain to assist developers in constructing and refining them. CSAgent enforces these policies through an optimized OS service, ensuring that agent actions can only be executed under specific user intents and contexts. CSAgent supports protecting agents that control computers through diverse interfaces, including API, CLI, and GUI. We implement and evaluate CSAgent, which successfully defends against more than 99.36% of attacks while introducing only 6.83% performance overhead.

基于大型语言模型（LLM）的计算机使用代理体现了人工智能和操作系统能力的融合，使得通过自然语言控制系统和应用程序级别的功能成为可能。然而，由于LLM固有的不确定性问题，赋予代理计算机控制权带来了重大的安全风险。当代理行为偏离用户意图时，可能会产生不可逆转的后果。现有的缓解方法，如用户确认和基于LLM的动态行为验证，仍然存在可用性、安全性和性能方面的局限性。为了解决这些挑战，我们提出了CSAgent，这是一个针对计算机使用代理的系统级静态策略访问控制框架。为了弥合静态策略与动态上下文和用户意图之间的差距，CSAgent引入了意图和上下文感知策略，并提供自动化工具链以帮助开发人员构建和完善它们。CSAgent通过优化的操作系统服务强制执行这些策略，确保代理的行为只能在特定的用户意图和上下文中执行。CSAgent支持通过不同接口控制计算机的保护代理，包括API、CLI和GUI。我们实现了CSAgent并进行了评估，它成功防御了超过99.36%的攻击，同时只引入了6.83%的性能开销。

论文及项目相关链接

PDF

Summary：基于大型语言模型（LLM）的计算机使用代理融合了人工智能和操作系统功能，能够实现自然语言控制系统和应用程序级别的功能。然而，由于LLM固有的不确定性问题，赋予代理计算机控制权存在重大安全风险。当代理行为偏离用户意图时，可能会造成不可逆的后果。针对现有缓解方法的局限性，如可用性、安全性和性能方面的问题，我们提出了CSAgent，这是一个针对计算机使用代理的系统级静态策略访问控制框架。CSAgent引入了意图和上下文感知策略，并通过自动化工具链帮助开发者构建和完善这些策略。CSAgent通过优化操作系统服务来执行这些策略，确保代理只能在特定的用户意图和上下文中执行操作。CSAgent支持保护通过API、CLI和GUI等不同接口控制计算机代理的安全。我们对CSAgent进行了实现和评估，它在成功防御超过99.36%的攻击的同时，只引入了6.83%的性能开销。

Key Takeaways：

LLM-based computer-use agents结合AI和OS功能，实现自然语言控制。
LLMs的固有不确定性给计算机控制权带来安全风险。
代理行为偏离用户意图可能导致不可逆后果。
现有缓解方法存在可用性、安全性和性能方面的局限性。
CSAgent是一个系统级静态策略访问控制框架，为计算机使用代理提供保护。
CSAgent引入意图和上下文感知策略，并通过自动化工具链辅助开发。

Cool Papers

点此查看论文截图

FeatBench: Evaluating Coding Agents on Feature Implementation for Vibe Coding

Authors:Haorui Chen, Chengze Li, Jia Li

The rapid advancement of Large Language Models (LLMs) has given rise to a novel software development paradigm known as “vibe coding,” where users interact with coding agents through high-level natural language. However, existing evaluation benchmarks for code generation inadequately assess an agent’s vibe coding capabilities. Existing benchmarks are misaligned, as they either require code-level specifications or focus narrowly on issue-solving, neglecting the critical scenario of feature implementation within the vibe coding paradiam. To address this gap, we propose FeatBench, a novel benchmark for vibe coding that focuses on feature implementation. Our benchmark is distinguished by several key features: 1. Pure Natural Language Prompts. Task inputs consist solely of abstract natural language descriptions, devoid of any code or structural hints. 2. A Rigorous & Evolving Data Collection Process. FeatBench is built on a multi-level filtering pipeline to ensure quality and a fully automated pipeline to evolve the benchmark, mitigating data contamination. 3. Comprehensive Test Cases. Each task includes Fail-to-Pass (F2P) and Pass-to-Pass (P2P) tests to verify correctness and prevent regressions. 4. Diverse Application Domains. The benchmark includes repositories from diverse domains to ensure it reflects real-world scenarios. We evaluate two state-of-the-art agent frameworks with four leading LLMs on FeatBench. Our evaluation reveals that feature implementation within the vibe coding paradigm is a significant challenge, with the highest success rate of only 29.94%. Our analysis also reveals a tendency for “aggressive implementation,” a strategy that paradoxically leads to both critical failures and superior software design. We release FeatBench, our automated collection pipeline, and all experimental results to facilitate further community research.

大型语言模型（LLM）的快速发展催生了一种新的软件开发范式，称为“振动编码”，用户通过高级自然语言与编码代理进行交互。然而，现有的代码生成评估基准对代理的振动编码能力评估不足。现有基准存在错位问题，因为它们要么需要代码级规范，要么仅专注于问题解决，而忽略了振动编码范式中功能实现的关键场景。为了弥补这一空白，我们提出了针对振动编码的新型基准测试FeatBench，专注于功能实现。我们的基准测试具有几个关键特点：1.纯自然语言提示。任务输入仅由抽象的自然语言描述组成，不含任何代码或结构提示。2.严格且不断发展的数据收集过程。FeatBench建立在多层次过滤管道上，以确保质量，并建立一个完全自动化的管道来不断发展基准测试，减轻数据污染。3.全面的测试用例。每项任务都包括从失败到通过（F2P）和从通过到通过（P2P）的测试，以验证正确性并防止回归。4.多样的应用领域。基准测试包括来自不同领域的存储库，以确保其反映真实世界场景。我们在FeatBench上评估了两个最先进的代理框架和四个领先的大型语言模型。我们的评估表明，在振动编码范式中进行功能实现是一项重大挑战，最高成功率仅为29.94%。我们的分析还显示了一种“过度实施”的倾向，这是一种会导致关键失败和卓越软件设计的策略悖论。我们发布FeatBench、我们的自动化收集管道和所有实验结果，以促进社区进一步研究。

论文及项目相关链接

PDF

摘要

大型语言模型（LLM）的快速发展催生了一种新的软件开发范式——“vibe编码”，用户通过高级自然语言与编码代理进行交互。然而，现有的代码生成评估基准对代理的vibe编码能力评估不足。它们要么需要代码级别的规范，要么过于关注问题解决，忽视了vibe编码范式中的功能实现关键场景。为解决这一差距，我们提出了专注于功能实现的vibe编码新基准——FeatBench。其特点包括：1.纯自然语言提示。任务输入仅由抽象的自然语言描述组成，不含任何代码或结构提示；2.严格且不断发展的数据收集过程。FeatBench建立在多层次过滤管道上，确保质量，并通过完全自动化的管道不断发展，减轻数据污染；3.全面的测试用例。每个任务都包括从失败到通过（F2P）和从通过到通过（P2P）的测试，以验证正确性和防止回归；4.涵盖多样化的应用领域。该基准包括来自不同领域的仓库，以确保其反映真实世界场景。我们在FeatBench上评估了两个先进的代理框架和四种领先的大型语言模型。我们的评估显示，vibe编码范式中的功能实现是一项重大挑战，最高成功率仅为29.94%。我们的分析还揭示了一种“激进实施”的策略，这种策略会出人意料地导致严重的失败和卓越的软件设计。我们发布FeatBench、我们的自动化收集管道和所有实验结果，以促进社区进一步研究。

Key Takeaways

大型语言模型的快速发展推动了新的软件开发范式“vibe编码”的出现。
现有评估基准在评估vibe编码能力方面存在不足，需要新的评估方法。
FeatBench是一个专注于功能实现的vibe编码新基准，具有纯自然语言提示、严格的数据收集过程、全面的测试用例和多样化的应用领域等特点。
在FeatBench上的评估显示，功能实现在vibe编码范式中是一项重大挑战，最高成功率较低。
存在一种“激进实施”的策略，这种策略可能导致严重的失败，但也可能产生卓越的软件设计。
FeatBench、自动化收集管道和所有实验结果已发布，以促进社区进一步研究。

Cool Papers

点此查看论文截图

Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization

Authors:Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch

Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.

长文档摘要仍然是当前大型语言模型（LLM）面临的一个重大挑战。现有方法在处理过长文档时，通常会面临信息丢失、事实不一致和连贯性问题。我们提出了SummQ，一种新型对抗性多智能体框架，通过两个互补领域中的专业智能体之间的协作智能来解决这些限制：摘要和测验。我们的方法采用摘要生成器和评审员进行协作，以创建和评估全面的摘要，而测验生成器和评审员则创建理解问题，作为摘要过程的持续质量检查。这种对抗性动态通过考生智能体得到增强，验证生成的摘要是否包含回答测验问题所需的信息，从而通过多方面的反馈机制实现迭代改进。我们在三个广泛使用的长文档摘要基准测试上对SummQ进行了评估。实验结果表明，我们的框架在ROUGE和BERTScore指标以及LLM-as-a-Judge和人类评估中，都显著优于现有的最新方法。我们的综合分析揭示了多智能体协作动态的有效性、不同智能体配置的影响以及测验机制的作用。这项工作为长文档摘要建立了一种新方法，使用对抗性智能体协作来提高摘要质量。

论文及项目相关链接

PDF

Summary

本文提出一种名为SummQ的新型对抗性多智能体框架，用于解决长文档摘要生成过程中的信息丢失、事实不一致和连贯性问题。该框架通过专门设计的智能体在摘要和测验两个领域的协作智能实现这一目标。实验结果表明，SummQ在广泛使用的长文档摘要基准测试上显著优于现有最先进的方法。

Key Takeaways

当前大型语言模型在处理长文档摘要时面临信息丢失、事实不一致和连贯性问题的挑战。
SummQ是一种新型的对抗性多智能体框架，旨在解决这些问题。
SummQ包含摘要生成器和评估器，它们协同工作以创建和评估全面的摘要。
SummQ还包括测验生成器和评估器，它们创建理解问题，作为摘要过程的持续质量检查。
通过对抗动态和受验者智能体的验证，确保生成的摘要包含回答测验问题的必要信息，从而实现通过多方面反馈机制的迭代改进。
实验结果表明，SummQ在多种长文档摘要基准测试上的表现显著优于现有最先进的方法。

Cool Papers

点此查看论文截图

Decentralized Aerial Manipulation of a Cable-Suspended Load using Multi-Agent Reinforcement Learning

Authors:Jack Zeng, Andreu Matoses Gimenez, Eugene Vinitsky, Javier Alonso-Mora, Sihao Sun

This paper presents the first decentralized method to enable real-world 6-DoF manipulation of a cable-suspended load using a team of Micro-Aerial Vehicles (MAVs). Our method leverages multi-agent reinforcement learning (MARL) to train an outer-loop control policy for each MAV. Unlike state-of-the-art controllers that utilize a centralized scheme, our policy does not require global states, inter-MAV communications, nor neighboring MAV information. Instead, agents communicate implicitly through load pose observations alone, which enables high scalability and flexibility. It also significantly reduces computing costs during inference time, enabling onboard deployment of the policy. In addition, we introduce a new action space design for the MAVs using linear acceleration and body rates. This choice, combined with a robust low-level controller, enables reliable sim-to-real transfer despite significant uncertainties caused by cable tension during dynamic 3D motion. We validate our method in various real-world experiments, including full-pose control under load model uncertainties, showing setpoint tracking performance comparable to the state-of-the-art centralized method. We also demonstrate cooperation amongst agents with heterogeneous control policies, and robustness to the complete in-flight loss of one MAV. Videos of experiments: https://autonomousrobots.nl/paper_websites/aerial-manipulation-marl

本文提出了一种首个去中心化的方法，使用微型飞行器（MAVs）团队实现对电缆悬挂负载的6自由度（6-DoF）实际操控。我们的方法利用多智能体强化学习（MARL）来训练每个MAV的外环控制策略。不同于最新使用集中方案的控制器，我们的策略不需要全局状态、MAV间的通信或邻近MAV的信息。相反，智能体仅通过负载姿态观测进行隐性通信，从而实现高可扩展性和灵活性。它还显著降低了推理时间的计算成本，使策略得以在设备上部署。此外，我们为MAV引入了一种新的动作空间设计，使用线性加速度和机体速率。这一选择结合稳健的底层控制器，即使在动态三维运动过程中由电缆张力引起的不确定性，也能实现可靠的模拟到现实的转移。我们通过各种现实世界的实验验证了我们的方法，包括负载模型不确定性下的全姿态控制，展示与最新集中方法相当的设定点跟踪性能。我们还展示了具有不同控制策略的代理之间的合作，以及对飞行中失去一个MAV的鲁棒性。实验视频地址：https://autonomousrobots.nl/paper_websites/aerial-manipulation-marl

论文及项目相关链接

PDF

Summary

本论文提出一种基于多智能体强化学习（MARL）的去中心化方法，使得微型飞行器（MAVs）团队能够执行真实世界的6自由度（6-DoF）电缆悬挂负载操控。该方法训练每个MAV的外环控制策略，无需全局状态、智能体间通信及邻近智能体信息，仅通过负载姿态观测实现隐性通信，具有高度的可扩展性和灵活性，并显著降低推理时间计算成本。此外，引入新的MAV动作空间设计，结合鲁棒性底层控制器，实现可靠的模拟到实际转移，即使面对由电缆张力引起的动态三维运动中的重大不确定性。实验验证表明，该方法在各种实际场景中表现优异，包括负载模型不确定性下的全姿态控制，其设定点追踪性能与最先进的集中化方法相当。

Key Takeaways

提出首个去中心化的方法，使MAVs团队能够实现真实世界的6-DoF电缆悬挂负载操控。
利用多智能体强化学习（MARL）训练每个MAV的外环控制策略。
无需全局状态、智能体间通信及邻近智能体信息，具有高度的可扩展性和灵活性。
通过负载姿态观测实现隐性通信，降低推理时间的计算成本。
引入新的MAV动作空间设计，提高模拟到实际操作的转移可靠性。
鲁棒性底层控制器可有效应对由电缆张力引起的动态三维运动中的不确定性。

Cool Papers

点此查看论文截图

PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning

Authors:Keer Lu, Chong Chen, Bin Cui, Huang Leng, Wentao Zhang

Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks. Despite their potential, existing work faces challenges when deploying LLMs in agent-based environments. The widely adopted agent paradigm ReAct centers on integrating single-step reasoning with immediate action execution, which limits its effectiveness in complex tasks requiring long-term strategic planning. Furthermore, the coordination between the planner and executor during problem-solving is also a critical factor to consider in agent design. Additionally, current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories, thereby restricting their generalization ability when confronted with novel problem contexts. To address these challenges, we introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution to support effective long-horizon decision-making. Based on the proposed paradigm, we further put forward PilotRL, a global planning-guided training framework for LLM agents driven by progressive reinforcement learning. We first develop the model’s ability to follow explicit guidance from global plans when addressing agent tasks. Subsequently, based on this foundation, we focus on optimizing the quality of generated plans. Finally, we conduct joint optimization of the model’s planning and execution coordination. Experiments indicate that PilotRL could achieve state-of-the-art performances, with LLaMA3.1-8B-Instruct + PilotRL surpassing closed-sourced GPT-4o by 3.60%, while showing a more substantial gain of 55.78% comparing to GPT-4o-mini at a comparable parameter scale.

面向代理任务的大型语言模型（LLMs）已经取得了显著的进步。尽管它们具有潜力，但在基于代理的环境中部署LLMs时，现有工作仍面临挑战。广泛采用的ReAct代理范式侧重于将单步推理与即时行动执行相结合，这限制了其在需要长期战略规划的复杂任务中的有效性。此外，在问题解决过程中，规划者和执行者之间的协调也是代理设计需要考虑的关键因素。另外，当前的方法主要依赖于监督微调，这往往导致模型记忆已建立的任务完成轨迹，从而在面对新的问题上下文时限制其泛化能力。为了解决这些挑战，我们引入了基于自适应全局规划的代理范式AdaPlan，旨在将高级显式指导与执行相结合，以支持有效的长期决策。基于这一范式，我们进一步提出了PilotRL，这是一个由逐步强化学习驱动的LLM代理全局规划引导训练框架。我们首先发展模型在解决代理任务时遵循全局计划显式指导的能力。然后在此基础上，我们专注于优化生成的计划的质量。最后，我们对模型的规划和执行协调进行联合优化。实验表明，PilotRL可以达到最先进的性能，LLaMA3.1-8B-Instruct + PilotRL超越了封闭源代码的GPT-4o 3.60%，而在参数规模相当的情况下，与GPT-4o-mini相比则取得了更大的提升，达到了55.78%。

论文及项目相关链接

PDF

Summary

大型语言模型在应对面向代理的任务方面取得了显著进展，但仍面临在代理环境中部署时的挑战。现有工作主要集中在集成即时行动执行的单步推理，这限制了其在需要长期战略规划的复杂任务中的有效性。为解决这些问题，提出了基于全局计划的自适应代理范式AdaPlan，旨在将高级明确指导与执行相结合，支持有效的长期决策制定。基于这一范式，提出了PilotRL，一个由渐进强化学习驱动的全球规划引导的大型语言模型代理训练框架。实验表明，PilotRL可以达到最新性能水平，LLaMA3.1-8B-Instruct与PilotRL的结合在性能上超过了闭源的GPT-4o 3.60%，并在相似的参数规模上与GPT-4o-mini相比取得了55.78%的更大收益。

Key Takeaways

大型语言模型在面向代理的任务中取得了显著进展，但仍面临部署挑战。
当前代理范式ReAct主要关注即时行动执行的单步推理，不适用于需要长期战略规划的复杂任务。
AdaPlan范式旨在结合高级明确指导与执行，以支持有效的长期决策制定。
基于AdaPlan范式，提出了PilotRL训练框架，通过渐进强化学习驱动全局规划来指导大型语言模型代理。
PilotRL通过模型对明确指导的遵循能力、计划生成质量的优化以及规划与执行的联合优化来提升性能。
实验结果表明，PilotRL达到了最新性能水平，超越了GPT-4o和GPT-4o-mini的性能表现。

Cool Papers

点此查看论文截图

AgentOrchestra: Orchestrating Hierarchical Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol

Authors:Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, Bo An

Recent advances in LLMs-based agent systems have demonstrated remarkable capabilities in solving complex tasks. Nevertheless, current protocols (e.g., A2A and MCP) suffer from insufficient capabilities in context management, limited adaptability to diverse environments, and the absence of dynamic agent architectures. To address these limitations, we propose the Tool-Environment-Agent (TEA) Protocol, which establishes a principled basis for integrating environments, agents, and tools into an unified system. The TEA protocol treats environments and agents as first-class resources, enabling comprehensive context management and adaptive environment integration. Based on this protocol, we introduce AgentOrchestra, a hierarchical multi-agent framework with a central planning agent that decomposes complex objectives and coordinates specialized agents. Each sub-agent is dedicated to specific functions, providing capabilities for data analysis, file operations, web navigation, and interactive reasoning. Notably, AgentOrchestra introduces a tool manager agent that supports intelligent evolution through dynamic tool creation, retrieval, and reuse mechanisms. Experiments on three widely used benchmarks show that AgentOrchestra consistently outperforms existing baselines, achieving state-of-the-art performance of 83.39% on GAIA and ranking among the top general-purpose LLM-based agents. These results highlight the effectiveness of the TEA Protocol and hierarchical organization in building general-purpose multi-agent systems.

基于大型语言模型（LLM）的代理系统最近的进展在解决复杂任务方面表现出了显著的能力。然而，当前协议（例如A2A和MCP）在上下文管理方面的能力不足，对多种环境的适应性有限，并且缺乏动态代理架构。为了解决这些局限性，我们提出了Tool-Environment-Agent（TEA）协议，该协议为环境、代理和工具集成到一个统一系统中提供了原则性的基础。TEA协议将环境和代理视为一类资源，以实现全面的上下文管理和自适应环境集成。基于该协议，我们介绍了AgentOrchestra，这是一个具有中央规划代理的分层多代理框架，能够分解复杂目标并协调专业代理。每个子代理都致力于特定的功能，提供数据分析、文件操作、网络导航和交互推理的能力。值得注意的是，AgentOrchestra引入了一个工具管理代理，它通过动态的工具创建、检索和重用机制支持智能进化。在三个广泛使用的基准测试上的实验表明，AgentOrchestra始终优于现有基线，在GAIA上实现了最先进的性能83.39%，并成为顶级通用LLM基于代理之一。这些结果突出了TEA协议和分层组织在构建通用多代理系统方面的有效性。

论文及项目相关链接

PDF

Summary

LLMs-based agent系统的最新进展在解决复杂任务方面展现了显著的能力。针对现有协议（如A2A和MCP）在上下文管理、环境适应性和动态代理架构方面的不足，提出了TEA协议。该协议将环境和代理作为一等资源，实现了全面的上下文管理和自适应环境集成。基于TEA协议，引入了AgentOrchestra多代理框架，具有中央规划代理，可分解复杂目标并协调专业代理。AgentOrchestra引入工具管理器代理，支持通过动态工具创建、检索和重用机制进行智能进化。在三个常用基准测试上的实验表明，AgentOrchestra的性能优于现有基线，在GAIA上达到了83.39%的先进水平，成为顶尖的大型语言模型基础代理之一。

Key Takeaways

LLMs-based agent系统展现出解决复杂任务的能力。
现有协议如A2A和MCP在上下文管理、环境适应性方面存在不足。
TEA协议旨在集成环境、代理和工具，将环境和代理视为一等资源。
AgentOrchestra是一个基于TEA协议的多代理框架，具有中央规划代理和工具管理器代理。
AgentOrchestra通过分解复杂目标和协调专业代理来实现高效性能。
AgentOrchestra在多个基准测试上表现出优异的性能，达到或超过现有基线。

Cool Papers

点此查看论文截图

Darwin Godel Machine: Open-Ended Evolution of Self-Improving Agents

Authors:Jenny Zhang, Shengran Hu, Cong Lu, Robert Lange, Jeff Clune

Today’s AI systems have human-designed, fixed architectures and cannot autonomously and continuously improve themselves. The advance of AI could itself be automated. If done safely, that would accelerate AI development and allow us to reap its benefits much sooner. Meta-learning can automate the discovery of novel algorithms, but is limited by first-order improvements and the human design of a suitable search space. The G"odel machine proposed a theoretical alternative: a self-improving AI that repeatedly modifies itself in a provably beneficial manner. Unfortunately, proving that most changes are net beneficial is impossible in practice. We introduce the Darwin G"odel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%. Furthermore, the DGM significantly outperforms baselines without self-improvement or open-ended exploration. All experiments were done with safety precautions (e.g., sandboxing, human oversight). The DGM is a significant step toward self-improving AI, capable of gathering its own stepping stones along paths that unfold into endless innovation.

当前的AI系统具有人类设计好的固定架构，无法自主且持续地进行自我改进。AI的发展本身也可以实现自动化。如果安全地实现，这将加速AI的发展，使我们能够更快地获得其带来的益处。元学习可以自动发现新颖算法，但受到一阶改进和人类设计合适搜索空间的限制。G"odel机器提出了一种理论上的替代方案：一种能够反复以有益的方式改进自己的AI。然而，实际上证明大多数变化都是有益的是不可能的。我们引入了达尔文G"odel机器（DGM），这是一种自我完善的系统，可以迭代地修改自己的代码（从而提高其修改自身代码库的能力），并使用编码基准来实证验证每一次变化。DGM受到达尔文进化论和开放性研究启发，维护了一个生成的编码代理档案。它通过从中采样一个代理并使用基础模型来创建新的、有趣的代理版本，来扩充档案。这种开放式的探索形成了一棵不断增长的多样化、高质量代理树，并允许搜索空间中许多不同路径的并行探索。从实证上看，DGM自动改进了其编码能力（例如，更好的代码编辑工具、长上下文窗口管理、同行评审机制），在SWE-bench上的性能从20.0%提高到50.0%，在Polyglot上的性能从14.2%提高到30.7%。此外，DGM显著优于没有自我完善和开放式探索的基线。所有实验都在采取了安全措施（例如沙箱化、人工监督）的情况下进行。DGM是实现自我完善AI的重要一步，能够在不断展开的无限创新路径上自主寻找自身的发展阶梯。

Summary
今日AI系统依赖人为设计且无法自主持续进化。若安全实现AI自动化升级，将加速其发展，并使我们更早享受其带来的益处。达尔文物种进化启发下的达尔文·哥德尔机器（DGM）可自主迭代更新代码，并通过编码基准测试验证每一次改进。DGM采用开放式探索方式形成多种优质智能体生长树，并能平行探索多个不同的搜索路径。其编码能力从测试中获得显著提升效果。这项工作是人工智能自我改进的一大进步。

Key Takeaways

当前AI系统依赖于固定的人类设计架构，无法自主持续改进。
实现AI自动化升级具有潜力，可以加速AI的发展并使人们更早受益。
达尔文进化理论启发下的达尔文·哥德尔机器（DGM）可自主迭代改进自身代码。
DGM使用编码基准测试验证每一次变化的效果。
DGM采用开放式探索方式形成智能体生长树，允许平行探索多个搜索路径。
DGM编码能力显著增强，实现了对基准测试性能的实质性提升。

Cool Papers

点此查看论文截图

Grounding Multimodal LLMs to Embodied Agents that Ask for Help with Reinforcement Learning

Authors:Ram Ramrakhya, Matthew Chang, Xavier Puig, Ruta Desai, Zsolt Kira, Roozbeh Mottaghi

Embodied agents operating in household environments must interpret ambiguous and under-specified human instructions. A capable household robot should recognize ambiguity and ask relevant clarification questions to infer the user intent accurately, leading to more effective task execution. To study this problem, we introduce the Ask-to-Act task, where an embodied agent is tasked with a single or multi-object rearrangement task using an under-specified instruction in a home environment. The agent must strategically ask minimal, yet relevant, clarification questions to resolve ambiguity while navigating under partial observability. To address this challenge, we propose a novel approach that fine-tunes multi-modal large language models (MLLMs) as vision-language-action (VLA) policies using online reinforcement learning (RL) with LLM-generated rewards. Our method eliminates the need for large-scale human demonstrations or manually engineered rewards for training such agents. We benchmark against strong zero-shot baselines including GPT-4o as well as supervised fine-tuned MLLMs on our task. Our results show that our RL-finetuned MLLM outperforms all baselines by a significant margin (10.4-16.5%), generalizing well to novel scenes and tasks. To the best of our knowledge, this is the first demonstration of adapting MLLMs as VLA agents that can act and ask for help using LLM-generated rewards with online RL.

在家庭环境中运行的实体代理必须解释模糊和未指定的人类指令。一个能干的家用机器人应该能够识别模糊性，并提出相关澄清问题，以准确推断用户意图，从而实现更有效的任务执行。为了研究这个问题，我们引入了“Ask-to-Act”任务，在该任务中，实体代理被要求在家庭环境中使用未指定的指令完成单对象或多对象重新布置任务。代理必须在部分可观察的情况下，通过提出最少但相关的澄清问题来解决模糊性。为了应对这一挑战，我们提出了一种新方法，通过在线强化学习（RL）微调多模态大型语言模型（MLLMs），将其作为视觉语言行动（VLA）策略。我们的方法消除了对大规模人类演示或手动工程奖励训练此类代理的需求。我们在强大的零样本基线（包括GPT-4o）以及我们任务上经过监督微调的大型语言模型上进行基准测试。结果表明，我们的RL微调大型语言模型在所有基准测试上的表现都大幅度领先（提升幅度为10.4%-16.5%），对新场景和任务具有很好的泛化能力。据我们所知，这是首次展示将大型语言模型适应为视觉语言行动代理，这些代理可以使用LLM生成的奖励与在线强化学习进行行动和求助。

论文及项目相关链接

PDF

Summary

本文介绍了家庭环境中的实体代理在解读人类模糊和未明确指定的指令时面临的挑战。一个具备能力的家庭机器人应该能够识别歧义并提出相关澄清问题，以准确推断用户意图，从而更有效地执行任务。为解决这个问题，本文引入了“问后行动”任务，并提出了一种新的方法，即通过在线强化学习微调多模态大型语言模型作为视觉语言行动策略，并使用语言模型生成的奖励来解决问题。这种方法消除了对大规模人类演示或手动设计的奖励的需求。研究表明，采用该方法的语言模型相较于其他基准测试方法的显著优势。这将是首次使用在线强化学习并结合语言模型生成的奖励来适应大型语言模型作为视觉语言行动代理的实践。

Key Takeaways

家庭环境中的实体代理需要解读人类模糊和未明确指定的指令。
机器人通过提出澄清问题，可以更准确推断用户意图并有效执行任务。
引入“问后行动”任务以研究这一问题。
提出一种新方法，通过在线强化学习微调多模态大型语言模型来处理任务。
该方法无需大规模人类演示或手动设计的奖励。
采用该方法的语言模型相较于其他基准测试方法具有显著优势。

Cool Papers

点此查看论文截图

How Strategic Agents Respond: Comparing Analytical Models with LLM-Generated Responses in Strategic Classification

Authors:Tian Xie, Pavan Rauch, Xueru Zhang

When ML algorithms are deployed to automate human-related decisions, human agents may learn the underlying decision policies and adapt their behavior. Strategic Classification (SC) has emerged as a framework for studying this interaction between agents and decision-makers to design more trustworthy ML systems. Prior theoretical models in SC assume that agents are perfectly or approximately rational and respond to decision policies by optimizing their utility. However, the growing prevalence of LLMs raises the possibility that real-world agents may instead rely on these tools for strategic advice. This shift prompts two questions: (i) Can LLMs generate effective and socially responsible strategies in SC settings? (ii) Can existing SC theoretical models accurately capture agent behavior when agents follow LLM-generated advice? To investigate these questions, we examine five critical SC scenarios: hiring, loan applications, school admissions, personal income, and public assistance programs. We simulate agents with diverse profiles who interact with three commercial LLMs (GPT-4o, GPT-4.1, and GPT-5), following their suggestions on effort allocations on features. We compare the resulting agent behaviors with the best responses in existing SC models. Our findings show that: (i) Even without access to the decision policy, LLMs can generate effective strategies that improve both agents’ scores and qualification; (ii) At the population level, LLM-guided effort allocation strategies yield similar or even higher score improvements, qualification rates, and fairness metrics as those predicted by the SC theoretical model, suggesting that the theoretical model may still serve as a reasonable proxy for LLM-influenced behavior; and (iii) At the individual level, LLMs tend to produce more diverse and balanced effort allocations than theoretical models.

当机器学习算法被部署以自动化与人类决策相关的任务时，人类代理人可能会学习其背后的决策策略并调整自己的行为。战略分类（SC）作为一个研究代理人和决策者之间互动的理论框架应运而生，旨在设计更可信赖的机器学习系统。先前的SC理论模型假设代理人是完全或近似理性的，并通过优化其效用对决策策略做出反应。然而，大型语言模型的日益普及意味着现实世界中的代理人可能会依赖这些工具进行战略建议。这种转变引发了两个问题：（i）大型语言模型是否能在SC环境中生成有效且社会责任感强的策略？（ii）当代理人遵循大型语言模型生成的建议时，现有的SC理论模型是否能准确捕捉代理人的行为？为了探究这些问题，我们研究了五个关键的SC场景：雇佣、贷款申请、学校录取、个人收入和公共援助计划。我们模拟了具有不同特征的代理人与GPT-4o、GPT-4.1和GPT-5这三个商业大型语言模型的互动，并遵循他们关于特征分配的努力建议。我们将由此产生的代理人行为与现有SC模型中的最佳反应进行了比较。我们的研究发现：（i）即使没有接触到决策策略，大型语言模型也能生成有效的策略，提高代理人的分数和资格；（ii）在人群层面，大型语言模型引导的努力分配策略产生的分数改善、资格率和公平指标与SC理论模型预测的结果相似甚至更高，这表明理论模型可能仍然是大型语言模型影响行为的合理代理；（iii）在个人层面，大型语言模型产生的努力分配往往比理论模型更加多样和平衡。

论文及项目相关链接

PDF Add GPT 5 experiments

Summary
在机器学习的决策自动化场景中，人类代理可能会学习底层的决策策略并调整自身行为。战略分类（SC）框架旨在研究代理和决策者之间的交互，以设计更可信赖的机器学习系统。现有的SC理论模型假设代理是完全或近似理性的，并会优化其效用以应对决策策略。然而，大型语言模型（LLMs）的普及使得现实世界中的代理可能依赖这些工具进行战略决策。因此提出两个问题：（i）LLMs是否能在SC环境中生成有效且负责任的策略？（ii）当代理遵循LLM生成的建议时，现有的SC理论模型是否能准确捕捉代理行为？研究通过模拟不同特征的代理与三种商业LLMs（GPT-4o、GPT-4.1和GPT-5）的交互来探究这些问题。研究发现，LLMs能在不接触决策策略的情况下生成有效策略，提高代理的得分和资格；在人口层面，LLM指导的努力分配策略产生的得分改善、资格率和公平指标与SC理论模型预测的结果相似甚至更高；而在个体层面，LLMs产生的努力分配更为多样和平衡。

Key Takeaways

LLMs可以在没有接触决策策略的情况下，为代理生成有效策略，提高得分和资格。
在人口层面，LLM指导的策略与SC理论模型预测的结果在得分改善、资格率和公平指标上表现出相似性，显示理论模型可能是LLM影响行为的合理代理。
LLMs产生的努力分配策略相对理论模型更为多样和平衡。
SC框架对于研究代理和决策者之间的交互以及设计可信赖的ML系统具有重要性。
现实世界的代理可能会依赖LLMs进行战略决策，这影响了对代理行为的预测和理解。
需要进一步研究LLMs在SC环境中的表现，特别是在不同场景和不同类型的代理中的表现。

Cool Papers

点此查看论文截图

Capacity-Aware Planning and Scheduling in Budget-Constrained Multi-Agent MDPs: A Meta-RL Approach

Authors:Manav Vora, Ilan Shomorony, Melkior Ornik

We study capacity- and budget-constrained multi-agent MDPs (CB-MA-MDPs), a class that captures many maintenance and scheduling tasks in which each agent can irreversibly fail and a planner must decide (i) when to apply a restorative action and (ii) which subset of agents to treat in parallel. The global budget limits the total number of restorations, while the capacity constraint bounds the number of simultaneous actions, turning na"ive dynamic programming into a combinatorial search that scales exponentially with the number of agents. We propose a two-stage solution that remains tractable for large systems. First, a Linear Sum Assignment Problem (LSAP)-based grouping partitions the agents into r disjoint sets (r = capacity) that maximise diversity in expected time-to-failure, allocating budget to each set proportionally. Second, a meta-trained PPO policy solves each sub-MDP, leveraging transfer across groups to converge rapidly. To validate our approach, we apply it to the problem of scheduling repairs for a large team of industrial robots, constrained by a limited number of repair technicians and a total repair budget. Our results demonstrate that the proposed method outperforms baseline approaches in terms of maximizing the average uptime of the robot team, particularly for large team sizes. Lastly, we confirm the scalability of our approach through a computational complexity analysis across varying numbers of robots and repair technicians.

我们研究容量和预算受限的多智能体马尔可夫决策过程（CB-MA-MDPs），这一类问题涵盖了许多维护和调度任务，其中每个智能体都可能发生不可逆故障，规划者必须决定（i）何时采取恢复行动，（ii）哪些智能体的子集可以并行处理。全局预算限制了恢复的总次数，而容量约束则限制了可以同时采取的行动数量，从而将简单的动态规划转变为组合搜索，随着智能体数量的增加呈指数级扩展。我们提出了一种适用于大型系统的两阶段解决方案。首先，基于线性总和分配问题（LSAP）的分组将智能体划分为r个不相交的集合（r=容量），以最大化预期故障时间的多样性，并按比例向每个集合分配预算。其次，经过元训练的PPO策略解决了每个子MDP，利用跨组的迁移快速收敛。为了验证我们的方法，我们将其应用于大型工业机器人团队维修调度问题，受到有限维修技师数量和总维修预算的限制。结果表明，该方法在最大化机器人团队的平均运行时间方面优于基线方法，特别是在大型团队中。最后，我们通过不同数量和维修技师的机器人进行算法复杂度分析，验证了我们的方法的可扩展性。

论文及项目相关链接

PDF

摘要
研究容量和预算受限的多智能体MDP（CB-MA-MDPs），这一类问题涵盖了诸多维护和调度任务，每个智能体可能遭遇不可逆的故障，规划者必须决定（i）何时采取修复行动以及（ii）并行处理哪些智能体子集。全球预算限制了总修复次数，而容量约束则限制了同时行动的数量，将简单的动态规划转变为随智能体数量呈指数级增长的组合搜索问题。提出一种两阶段解决方案，适用于大规模系统。首先，基于线性总和分配问题（LSAP）的分组将智能体划分为r个不相交的集合（r=容量），最大化预期故障时间的多样性，并按比例向每个集合分配预算。其次，使用元训练的PPO策略解决每个子MDP，利用组之间的转移快速收敛。为验证方法的有效性，将其应用于受限于维修技术人员数量和总维修预算的大型工业机器人维修调度问题。结果表明，该方法在最大化机器人团队平均运行时间方面优于基线方法，特别是在大型团队中。最后，通过针对不同机器人和维修技术人员数量的计算复杂性分析，验证了方法的可扩展性。

关键见解