发布日期: 2025-10-23

更新日期: 2025-11-27

文章字数: 13.1k

阅读时长: 53 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-23 更新

Search Self-play: Pushing the Frontier of Agent Capability without Supervision

Authors:Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Haotian Xu, Jiaqi Guo, Chutian Wang, Haonan Chen, Xiaoxi Jiang, Guanjun Jiang

Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer’s trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents’ performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups. The code is at https://github.com/Alibaba-Quark/SSP.

基于可验证奖励的强化学习（RLVR）已成为训练大型语言模型（LLM）的主流技术。然而，RLVR严重依赖于精心设计的任务查询和相应的真实答案来提供准确的奖励，这需要大量的人工努力，并阻碍了强化学习的扩展过程，特别是在代理场景中。尽管最近有一些工作探索了任务合成方法，但生成的任务难度很难控制，以提供有效的强化学习训练优势。为了实现具有更高可扩展性的代理RLVR，我们探索了深度搜索代理的自我对战训练，其中学习LLM利用多轮搜索引擎调用，同时扮演任务提出者和问题解决者的角色。任务提出者的目标是生成具有明确真实答案的深度搜索查询，并增加任务难度。问题解决者则尝试处理生成的搜索查询，并输出正确的答案预测。为了确保每个生成的搜索查询都有准确的真实答案，我们从提出者的轨迹收集所有搜索结果作为外部知识，然后进行检索增强生成（RAG）测试，以验证在提供所有必要搜索文档的情况下，所提出的查询是否可以正确回答。在这个搜索自我对战（SSP）游戏中，提出者和解决者通过竞争与合作共同进化代理能力。大量的实验结果表明，SSP可以在各种基准测试上显著提高搜索代理的性能，无论是在从头开始的强化学习训练设置还是持续强化学习训练设置下，无需任何监督。代码地址是：https://github.com/Alibaba-Quark/SSP。

论文及项目相关链接

PDF

摘要

强化学习通过可验证奖励（RLVR）已经成为训练大型语言模型（LLM）的主流技术。然而，RLVR严重依赖于精心设计的任务查询和相应的真实答案来提供准确的奖励，这需要大量的人力，并阻碍了RL的扩展过程，特别是在代理场景中。虽然一些最新的工作探索了任务合成方法，但生成的任务难度难以控制，无法提供有效的RL训练优势。为了实现具有更高可扩展性的代理RLVR，我们探索了深度搜索代理的自我对弈训练，其中学习LLM利用多轮搜索引擎调用，同时充当任务提出者和问题解决者。任务提出者旨在生成具有明确真实答案的深入搜索查询，并增加任务难度。问题解决者则尝试处理生成的搜索查询并输出正确的答案预测。为确保每个生成的搜索查询都具有准确的真实答案，我们从提出者的轨迹收集所有搜索结果作为外部知识，然后进行检索增强生成（RAG）测试，以验证在提供所有必要搜索文档的情况下，所提出的查询是否可以正确回答。在这个搜索自我对弈（SSP）游戏中，提出者和解决者通过竞争与合作共同演化代理能力。大量的实验结果表明，SSP可以在各种基准测试上显著提高搜索代理的性能，无论是在从头开始和连续强化学习训练设置下，均无需任何监督。代码位于https://github.com/Alibaba-Quark/SSP。

关键见解

强化学习通过可验证奖励（RLVR）是训练大型语言模型（LLM）的主流技术，但需要大量人力来提供任务查询和真实答案。
搜索自我对弈（SSP）方法被探索以实现更高可扩展性的代理RLVR。
在SSP中，LLM同时充当任务提出者和问题解决者，利用多轮搜索引擎调用。
任务提出者生成具有明确真实答案的深入搜索查询，并控制任务难度。
问题解决者处理生成的搜索查询并输出答案预测。
通过检索增强生成（RAG）测试确保查询的准确性。

Cool Papers

点此查看论文截图

WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection

Authors:Guanzhong He, Zhen Yang, Jinxin Liu, Bin Xu, Lei Hou, Juanzi Li

Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach substantially extends tool-use chains and improves answer accuracy. Using a single 14B model, we achieve state-of-the-art results on HotpotQA and SimpleQA, with accuracies of 72.3% and 90.0%, respectively, and demonstrate strong generalization to out-of-distribution datasets. The code is available at https://github.com/99hgz/WebSeer

搜索代理在交互式环境中实现了智能信息检索和决策的重大进步。尽管强化学习已被用于训练能够在更动态交互检索中使用的代理模型，但现有方法受限于工具使用深度较浅以及在多次迭代交互中误差的累积。在本文中，我们介绍了WebSeer，这是一个通过强化学习训练的更智能的搜索代理，并增强了一种自我反思机制。具体来说，我们构建了一个带有反思模式注释的大型数据集，设计了一个两阶段训练框架，该框架将冷启动和强化学习统一在自我反思范式中，用于现实世界基于网络的环境，使模型能够生成更长和更具反思性的工具使用轨迹。我们的方法极大地扩展了工具使用链并提高了答案准确性。使用单个14B模型，我们在HotpotQA和SimpleQA上取得了最新结果，准确率分别为72.3%和90.0%，并表现出对超出分布数据集的强大泛化能力。代码可在 https://github.com/99hgz/WebSeer 找到。

论文及项目相关链接

PDF

Summary
在信息检索和决策制定领域，智能搜索代理取得了显著进展。尽管强化学习已被用于训练能够执行更动态交互式检索的代理模型，但现有方法受限于工具使用深度较浅以及在多次迭代交互中误差累积的问题。本文提出WebSeer，一种通过强化学习训练的更智能的搜索代理，并配备了自我反思机制。我们构建了一个带有反思模式注解的大型数据集，设计了一个两阶段训练框架，将冷启动和强化学习统一在自我反思范式中，用于现实网络环境中的搜索任务。该方法大大延长了工具使用链并提高了答案的准确性。在HotpotQA和SimpleQA上，使用单个规模为14B的模型取得了最先进的成果，准确率分别为72.3%和90.0%，并在超出分布范围的数据集上表现出强大的泛化能力。代码可在GitHub上获取。

Key Takeaways

智能搜索代理在信息检索和决策制定方面取得显著进展。
强化学习已用于训练动态交互式检索的代理模型，但存在工具使用深度不足和误差累积的问题。
WebSeer是一个更智能的搜索代理，通过强化学习训练，并配备了自我反思机制来解决现有问题。
构建了一个大型数据集，带有反思模式注解，用于训练和自我反思范式的应用。
采用两阶段训练框架，统一冷启动和强化学习，适用于现实网络环境中的搜索任务。
该方法显著延长了工具使用链，提高了答案准确性。

Cool Papers

点此查看论文截图

CUARewardBench: A Benchmark for Evaluating Reward Models on Computer-using Agent

Authors:Haojia Lin, Xiaoyu Tan, Yulei Qin, Zihan Xu, Yuchen Shi, Zongyi Li, Gang Li, Shaofei Cai, Siqi Cai, Chaoyou Fu, Ke Li, Xing Sun

Computer-using agents (CUAs) enable task completion through natural interaction with operating systems and software interfaces. While script-based verifiers are widely adopted for evaluation, they suffer from limited scalability and inability to provide step-wise assessment. Reward models offer promising alternatives, but their effectiveness on CUA evaluation remains largely underexplored. To address this gap, we present CUARewardBench, comprising four key contributions: (1) First-ever Comprehensive CUA Reward Benchmark: We introduce the first benchmark for evaluating both outcome reward models (ORM) and process reward models (PRM) on CUA tasks, enabling systematic assessment across trajectory-level and step-level evaluation. (2) Diverse, Practical and Reliable Dataset: CUARewardBench encompasses trajectories from 10 software categories and 7 agent architectures with varying performance levels (25.9%-50.8% success rates). All trajectories are expertly annotated through carefully designed protocols, with rigorous quality control to ensure reliability and practical applicability. (3) Comprehensive Analysis and Insights: Through extensive experiments across 7 vision-language models and 3 prompt templates, we reveal critical limitations of current CUA RMs, including insufficient visual reasoning capabilities, knowledge deficiencies, and the superiority of general VLMs over specialized CUA models for reward evaluation. (4) Unanimous Prompt Ensemble (UPE): Based on the insights from our comprehensive analysis, we propose UPE, a novel ensemble method that significantly enhances reward model reliability through strict unanimous voting and strategic prompt-template configurations. UPE achieves 89.8% precision and 93.3% NPV for ORM, and 81.7% precision and 85.1% NPV for PRM, substantially outperforming single VLMs and traditional ensemble approaches.

计算机使用代理（CUAs）通过与操作系统和软件接口的自然交互来完成任务。虽然脚本验证器得到了广泛应用，但它们存在可扩展性有限和无法提供逐步评估的问题。奖励模型提供了有前景的替代方案，但它们在CUA评估中的有效性在很大程度上尚未得到探索。为了弥补这一空白，我们提出了CUARewardBench，它包括四个主要贡献：

论文及项目相关链接

PDF 24 pages, 6 figures

Summary

该文本介绍了Computer-using agents (CUAs)在完成任务时的互动操作评价与奖励模型。由于传统的脚本验证器存在可扩展性和评估能力的局限性，因此需要寻求新的解决方案。为解决此问题，文章提出了CUARewardBench方案，该方案包括：首个全面的CUA奖励基准测试，涵盖结果奖励模型和过程奖励模型的评价；包含多种软件类别和代理架构的可靠数据集；对多个视觉语言模型和提示模板的综合分析以及对当前CUA奖励模型的局限性的见解；基于分析结果的统一提示集成方法UPE，显著提高了奖励模型的可靠性。该方案有望推动CUA任务评价体系的进步。

Key Takeaways

Computer-using agents (CUAs)通过自然交互完成操作系统和软件界面的任务，但现有的评价工具存在局限性。
CUARewardBench包含首个全面的CUA奖励基准测试，旨在评价结果奖励模型和过程奖励模型。
数据集包含多种软件类别和代理架构的专家注释轨迹，具有可靠性和实用性。
综合分析揭示了当前CUA奖励模型的局限性，包括视觉推理能力、知识不足等。
统一提示集成方法（UPE）显著提高了奖励模型的可靠性，通过严格的投票和策略性的提示模板配置实现。
UPE在结果奖励模型和过程奖励模型的评估中均表现出优越性能。

Cool Papers

点此查看论文截图

CLASP: Cost-Optimized LLM-based Agentic System for Phishing Detection

Authors:Fouad Trad, Ali Chehab

Phishing websites remain a significant cybersecurity threat, necessitating accurate and cost-effective detection mechanisms. In this paper, we present CLASP, a novel system that effectively identifies phishing websites by leveraging multiple intelligent agents, built using large language models (LLMs), to analyze different aspects of a web resource. The system processes URLs or QR codes, employing specialized LLM-based agents that evaluate the URL structure, webpage screenshot, and HTML content to predict potential phishing threats. To optimize performance while minimizing operational costs, we experimented with multiple combination strategies for agent-based analysis, ultimately designing a strategic combination that ensures the per-website evaluation expense remains minimal without compromising detection accuracy. We tested various LLMs, including Gemini 1.5 Flash and GPT-4o mini, to build these agents and found that Gemini 1.5 Flash achieved the best performance with an F1 score of 83.01% on a newly curated dataset. Also, the system maintained an average processing time of 2.78 seconds per website and an API cost of around $3.18 per 1,000 websites. Moreover, CLASP surpasses leading previous solutions, achieving over 40% higher recall and a 20% improvement in F1 score for phishing detection on the collected dataset. To support further research, we have made our dataset publicly available, supporting the development of more advanced phishing detection systems.

网络钓鱼网站仍然是网络安全领域的一大威胁，因此需要准确且经济实惠的检测机制。在本文中，我们提出了CLASP系统，这是一种新型的反钓鱼网站检测系统。该系统通过利用多个智能代理，借助大型语言模型（LLM）分析网络资源的不同方面，实现对钓鱼网站的准确识别。该系统处理URL或二维码，采用基于LLM的专业代理，评估URL结构、网页截图和HTML内容，以预测潜在的钓鱼网站威胁。为了优化性能同时降低运营成本，我们尝试了各种基于代理的分析组合策略，最终设计了一种战略组合，确保在不损害检测精度的前提下，每个网站的评估成本保持最低。我们在构建这些代理时测试了多种LLM，包括Gemini 1.5 Flash和GPT-4o mini等，发现Gemini 1.5 Flash在新整理的数据集上表现最佳，F1分数达到83.01%。此外，系统平均每个网站的处理时间为2.78秒，API成本约为每处理一千个网站需要支付约3.18美元。而且，CLASP超过了领先的先前解决方案，在收集的数据集上实现了召回率提高了40%以上以及F1分数提高了20%，达到了反钓鱼网站检测领域的高性能表现。为了支持进一步研究，我们已经公开提供数据集，以支持更先进的反钓鱼网站检测系统的开发。

论文及项目相关链接

PDF Accepted in the 5th International Conference on Electrical, Computer, and Energy Technologies (ICECET2025)

Summary
本文介绍了一种新型网络安全系统CLASP，它通过利用多个智能代理来识别钓鱼网站。这些智能代理基于大型语言模型构建，用于分析网络资源的不同方面，如URL结构、网页截图和HTML内容等。CLASP优化了性能并降低了运营成本，同时提高了钓鱼网站检测的准确性。

Key Takeaways

CLASP系统通过利用大型语言模型构建的智能代理来检测钓鱼网站。
系统能够分析URL结构、网页截图和HTML内容等多个方面来预测潜在的钓鱼威胁。
CLASP优化了基于代理的分析组合策略，确保在不牺牲检测精度的前提下降低每网站评估成本。
在新数据集上，使用Gemini 1.5 Flash构建的代理获得了最佳性能，F1分数为83.01%。
系统平均处理时间为每网站2.78秒，API成本约为每1,000网站$3.18。
CLASP超越了现有的解决方案，在收集的数据集上实现了更高的召回率和F1分数。

Cool Papers

点此查看论文截图

QuantEvolve: Automating Quantitative Strategy Discovery through Multi-Agent Evolutionary Framework

Authors:Junhyeog Yun, Hyoun Jun Lee, Insu Jeon

Automating quantitative trading strategy development in dynamic markets is challenging, especially with increasing demand for personalized investment solutions. Existing methods often fail to explore the vast strategy space while preserving the diversity essential for robust performance across changing market conditions. We present QuantEvolve, an evolutionary framework that combines quality-diversity optimization with hypothesis-driven strategy generation. QuantEvolve employs a feature map aligned with investor preferences, such as strategy type, risk profile, turnover, and return characteristics, to maintain a diverse set of effective strategies. It also integrates a hypothesis-driven multi-agent system to systematically explore the strategy space through iterative generation and evaluation. This approach produces diverse, sophisticated strategies that adapt to both market regime shifts and individual investment needs. Empirical results show that QuantEvolve outperforms conventional baselines, validating its effectiveness. We release a dataset of evolved strategies to support future research.

在动态市场中自动化定量交易策略的发展具有挑战性，特别是随着对个性化投资解决方案的需求不断增加。现有方法往往无法在探索庞大的策略空间的同时保持在不同市场条件下的稳健性能所需的多样性。我们推出QuantEvolve，这是一个结合了质量多样性优化和假设驱动策略生成的进化框架。QuantEvolve采用与投资者偏好相符的特征图，如策略类型、风险状况、周转率和回报特征，以维持一组多样化的有效策略。它还整合了一个假设驱动的多元代理系统，通过迭代生成和评估来系统地探索策略空间。这种方法产生了多样化且复杂的策略，能够适应市场机制的转变和个性化的投资需求。实证结果表明，QuantEvolve优于传统基线，验证了其有效性。我们发布了一组进化策略的数据集，以支持未来的研究。

论文及项目相关链接

PDF 25 pages, 13 figures. Accepted for oral presentation at the 2nd Workshop on LLMs and Generative AI for Finance (AI4F), part of ACM ICAIF 2025, Singapore. Non-archival workshop

Summary：
提出的QuantEvolve是一个结合质量多样性优化和假设驱动策略生成的进化框架，用于在动态市场中自动化定量交易策略开发。它通过特征映射和投资者偏好对齐，如策略类型、风险概况、换手率和回报特征等，维持一组有效的多样化策略。同时，它整合了一个假设驱动的多智能体系统，通过迭代生成和评估来系统地探索策略空间。此方法产生的策略多样化且复杂，能够适应市场模式转变和个人投资需求。实证结果显示QuantEvolve优于传统基线，验证了其有效性。

Key Takeaways:

QuantEvolve是一个用于动态市场定量交易策略开发的进化框架。
它结合了质量多样性优化和假设驱动策略生成。
QuantEvolve通过特征映射与投资者偏好对齐，维持有效且多样化的策略。
该框架整合了多智能体系统，以系统地探索策略空间。
QuantEvolve产生的策略能够适应市场变化和个人投资需求。
实证结果显示QuantEvolve优于传统方法。

Cool Papers

点此查看论文截图

SOCIA-Nabla: Textual Gradient Meets Multi-Agent Orchestration for Automated Simulator Generation

Authors:Yuncheng Hua, Sion Weatherhead, Mehdi Jafari, Hao Xue, Flora D. Salim

In this paper, we present SOCIA-Nabla, an end-to-end, agentic framework that treats simulator construction asinstance optimization over code within a textual computation graph. Specialized LLM-driven agents are embedded as graph nodes, and a workflow manager executes a loss-driven loop: code synthesis -> execution -> evaluation -> code repair. The optimizer performs Textual-Gradient Descent (TGD), while human-in-the-loop interaction is reserved for task-spec confirmation, minimizing expert effort and keeping the code itself as the trainable object. Across three CPS tasks, i.e., User Modeling, Mask Adoption, and Personal Mobility, SOCIA-Nabla attains state-of-the-art overall accuracy. By unifying multi-agent orchestration with a loss-aligned optimization view, SOCIA-Nabla converts brittle prompt pipelines into reproducible, constraint-aware simulator code generation that scales across domains and simulation granularities. This work is under review, and we will release the code soon.

本文介绍了SOCIA-Nabla，这是一个端到端的智能框架，它将模拟器构建视为文本计算图内代码实例的优化问题。嵌入专门的LLM驱动的智能体作为图节点，工作流程管理器执行以损失为驱动循环：代码合成->执行->评估->代码修复。优化器执行文本梯度下降（TGD），而人机交互仅限于特定任务确认，尽量减少专家工作量，并将代码本身作为可训练对象。在三项CPS任务（即用户建模、口罩佩戴和个人移动）中，SOCIA-Nabla均取得了最先进的总体准确率。通过统一多智能体的编排与损失对齐的优化视图，SOCIA-Nabla将脆弱的提示管道转化为可复制的、具有约束意识的模拟器代码生成，可跨领域和仿真粒度进行扩展。该工作正在接受审查，我们将很快发布代码。

论文及项目相关链接

PDF 11 pages, 1 figure, 2 tables. The paper is under review

Summary
SOCIA-Nabla是一个端到端的智能框架，它将模拟器构建视为文本计算图中的代码实例优化问题。该框架嵌入特殊LLM驱动的代理作为图形节点，并通过工作流管理器执行损失驱动的循环：代码合成、执行、评估、代码修复。优化器执行文本梯度下降法，而人为介入仅限于任务特定确认，从而最小化专家工作量并专注于训练代码本身。在三项CPS任务中，SOCIA-Nabla取得了最先进的总体准确性。通过统一多智能体编排与损失对齐优化视图，SOCIA-Nabla将脆弱的提示管道转化为可重复的、约束感知的模拟器代码生成，可跨领域和模拟粒度进行扩展。

Key Takeaways

SOCIA-Nabla是一个端到端的智能框架，将模拟器构建视为文本计算图中的代码优化问题。
该框架嵌入特殊LLM驱动的代理作为图形节点，实现多智能体协同工作。
工作流管理器执行损失驱动的循环，包括代码合成、执行、评估和修复。
优化器执行文本梯度下降法，减少人为介入，最小化专家工作量。
SOCIA-Nabla在三项CPS任务中取得了最先进的总体准确性。
该框架将脆弱的提示管道转化为可重复的、约束感知的模拟器代码生成。

Cool Papers

点此查看论文截图

Crucible: Quantifying the Potential of Control Algorithms through LLM Agents

Authors:Lianchen Jia, Chaoyang Li, Qian Houde, Tianchi Huang, Jiangchuan Liu, Lifeng Sun

Control algorithms in production environments typically require domain experts to tune their parameters and logic for specific scenarios. However, existing research predominantly focuses on algorithmic performance under ideal or default configurations, overlooking the critical aspect of Tuning Potential. To bridge this gap, we introduce Crucible, an agent that employs an LLM-driven, multi-level expert simulation to turn algorithms and defines a formalized metric to quantitatively evaluate their Tuning Potential. We demonstrate Crucible’s effectiveness across a wide spectrum of case studies, from classic control tasks to complex computer systems, and validate its findings in a real-world deployment. Our experimental results reveal that Crucible systematically quantifies the tunable space across different algorithms. Furthermore, Crucible provides a new dimension for algorithm analysis and design, which ultimately leads to performance improvements. Our code is available at https://github.com/thu-media/Crucible.

在生产环境中，控制算法通常需要领域专家针对特定场景调整其参数和逻辑。然而，现有研究主要集中在理想或默认配置下的算法性能上，忽视了调整潜力这一关键方面。为了弥补这一差距，我们引入了Crucible，它是一个采用LLM驱动的多级专家模拟的代理，能够调整算法并定义量化评估其调整潜力的正式指标。我们在从经典控制任务到复杂计算机系统的广泛案例研究中展示了Crucible的有效性，并在实际部署中验证了其发现。我们的实验结果表明，Crucible能够系统地量化不同算法的可调空间。此外，Crucible为算法分析和设计提供了新的维度，最终提高了性能。我们的代码可在https://github.com/thu-media/Crucible找到。

论文及项目相关链接

PDF NeurIPS 2025

Summary

针对生产环境中控制算法通常需要领域专家针对特定场景调整参数和逻辑的问题，研究提出Crucible，采用LLM驱动的多级专家模拟来优化算法，并定义了一种量化评估其调整潜力的正式指标。实验结果显示，Crucible能够系统地量化不同算法的可调空间，并提供新的算法分析和设计维度，最终实现性能提升。

Key Takeaways

控制算法在生产环境中需要领域专家进行参数和逻辑调整以适应特定场景。
当前研究主要关注算法在理想或默认配置下的性能，忽视了调整潜力的重要性。
Crucible通过采用LLM驱动的多级专家模拟来优化算法。
Crucible定义了一种量化评估算法调整潜力的正式指标。
Crucible在经典控制任务到复杂计算机系统的广泛案例研究中展示了有效性。
Crucible的发现经过实际部署验证。

Cool Papers

点此查看论文截图

AndroidControl-Curated: Revealing the True Potential of GUI Agents through Benchmark Purification

Authors:Ho Fai Leung, Xiaoyan Xi, Fei Zuo

On-device virtual assistants like Siri and Google Assistant are increasingly pivotal, yet their capabilities are hamstrung by a reliance on rigid, developer-dependent APIs. GUI agents offer a powerful, API-independent alternative, but their adoption is hindered by the perception of poor performance, as even the best models (e.g. Qwen3-VL-235B) scores are capped at around 60% on benchmarks like AndroidControl, far from viability for real-world use. Our research reveals that issue lies not only with the models but with the benchmarks themselves. We identified notable shortcomings in AndroidControl, including ambiguities and factual errors, which systematically underrates agent capabilities. To address this critical oversight, we enhanced AndroidControl into AndroidControl-Curated, a refined version of the benchmark improved through a rigorous purification pipeline. On this enhanced benchmark, state-of-the-art models achieve success rates nearing 75% on complex tasks (15% improvement), reflecting that on-device GUI agents are actually closer to practical deployment than previously thought. We introduce our new SOTA model, Magma-R1- 3B, post-trained on just 2.4k curated samples using 60 hours of an H20 GPU (approximately $60). Despite being 200 times smaller in parameters, this model delivers performance comparable to Qwen3- VL-235B. We release both AndroidControl-Curated benchmark and Magma-R1 model to the research community, encouraging adoption of this enhanced benchmark to better reflect model capabilities and accelerate the development of robust, on-device virtual assistants.

在设备上的虚拟助手，如Siri和Google Assistant，越来越重要，然而他们的能力受到了依赖于僵化、开发者依赖的API的束缚。GUI代理提供了强大、独立于API的替代方案，但它们的采用受到了性能不佳的感知的阻碍，因为即使是最好的模型（例如Qwen3-VL-235B）在AndroidControl等基准测试上的得分也仅限于大约60%，距离实际使用可行性相去甚远。我们的研究发现，问题不仅出在模型上，还出在基准测试本身。我们发现了AndroidControl的显著缺点，包括模糊性和事实错误，它系统地低估了代理的能力。为了解决这一关键疏忽，我们将AndroidControl增强为AndroidControl-Curated，这是通过严格的纯化管道改进的基准测试的精炼版本。在这个增强的基准测试上，最先进的模型在复杂任务上的成功率接近75%（提高了15%），这表明设备上的GUI代理实际上比以前认为的更接近实际部署。我们推出了新的SOTA模型Magma-R1-3B，该模型仅在2.4k精选样本上进行后训练，使用60小时的H20 GPU（约60美元）。尽管参数小了200倍，但该模型的性能与Qwen3-VL-235B相当。我们将AndroidControl-Curated基准测试和Magma-R1模型发布给研究社区，鼓励采用这种增强的基准测试以更好地反映模型能力，并加速开发稳健的、在设备上的虚拟助手。

论文及项目相关链接

PDF

摘要

本文主要研究智能设备的虚拟助手性能评估问题。传统的智能虚拟助手如Siri和Google Assistant依赖开发者设定的API，限制了其性能发挥。而GUI代理提供了一种强大的API独立替代方案，但其应用受到性能不佳的阻碍。本文揭示了问题在于不仅模型本身，还在于评估基准本身存在缺陷。本文指出AndroidControl基准存在的明显不足，包括模糊性和事实错误，该基准系统性低估了代理的能力。为解决这一关键缺陷，本文改进了AndroidControl基准，推出AndroidControl-Curated版本。在此增强基准上，最先进的模型在复杂任务上的成功率提高了近75%（提高15%），表明设备上的GUI代理实际上比以往认为的更接近实际应用部署。本文还介绍了新推出的SOTA模型Magma-R1-3B，该模型仅在2.4k精选样本上进行后训练，使用60小时的H20 GPU（约60美元），参数规模小200倍，但性能与Qwen3-VL-235B相当。本文向研究社区发布了AndroidControl-Curated基准和Magma-R1模型，鼓励采用此增强基准以更好地反映模型能力并加速稳健型设备虚拟助手的开发。

关键见解

智能设备的虚拟助手如Siri和Google Assistant受限于依赖开发者设定的API。
GUI代理提供了一种强大的API独立替代方案，但因其性能不佳而未能广泛应用。
现有评估基准如AndroidControl存在缺陷，包括模糊性和事实错误，导致对虚拟助手的性能评估不准确。
本文推出了改进后的AndroidControl-Curated基准，更准确地评估虚拟助手性能。
在新基准上，先进模型的表现有所提升，成功率接近75%，表明GUI代理更接近实际应用部署。
介绍了新SOTA模型Magma-R1-3B，其参数规模小、成本低，但性能与现有先进模型相当。
本文鼓励研究社区采用新的AndroidControl-Curated基准和Magma-R1模型，以推动虚拟助手的开发和应用。

Cool Papers

点此查看论文截图

LAFA: Agentic LLM-Driven Federated Analytics over Decentralized Data Sources

Authors:Haichao Ji, Zibo Wang, Yifei Zhu, Meng han, Dan Wang, Zhu Han

Large Language Models (LLMs) have shown great promise in automating data analytics tasks by interpreting natural language queries and generating multi-operation execution plans. However, existing LLM-agent-based analytics frameworks operate under the assumption of centralized data access, offering little to no privacy protection. In contrast, federated analytics (FA) enables privacy-preserving computation across distributed data sources, but lacks support for natural language input and requires structured, machine-readable queries. In this work, we present LAFA, the first system that integrates LLM-agent-based data analytics with FA. LAFA introduces a hierarchical multi-agent architecture that accepts natural language queries and transforms them into optimized, executable FA workflows. A coarse-grained planner first decomposes complex queries into sub-queries, while a fine-grained planner maps each subquery into a Directed Acyclic Graph of FA operations using prior structural knowledge. To improve execution efficiency, an optimizer agent rewrites and merges multiple DAGs, eliminating redundant operations and minimizing computational and communicational overhead. Our experiments demonstrate that LAFA consistently outperforms baseline prompting strategies by achieving higher execution plan success rates and reducing resource-intensive FA operations by a substantial margin. This work establishes a practical foundation for privacy-preserving, LLM-driven analytics that supports natural language input in the FA setting.

大型语言模型（LLM）在通过解释自然语言查询和生成多操作执行计划来自动化数据分析任务方面显示出巨大潜力。然而，现有的基于LLM的代理分析框架都是在集中访问数据的假设下运行，提供很少或根本没有隐私保护。相比之下，联邦分析（FA）能够在分布式数据源上进行隐私保护计算，但缺乏自然语言输入的支持，并且需要结构化、机器可读的查询。在这项工作中，我们介绍了LAFA系统，它是第一个将基于LLM的数据分析与FA相结合的系统。LAFA引入了一种分层的多代理架构，该架构可以接受自然语言查询并将其转换为优化的可执行的FA工作流程。粗粒度规划器首先分解复杂的查询到子查询，而细粒度规划器使用先前结构知识将每个子查询映射到FA操作的有向无环图。为了提高执行效率，优化器代理会重写和合并多个DAG，消除冗余操作并尽量减少计算和通信开销。我们的实验表明，LAFA通过实现更高的执行计划成功率并大幅减少资源密集型的FA操作，始终优于基本的提示策略。这项工作为支持自然语言输入的隐私保护LLM驱动分析奠定了实用基础。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在通过自然语言查询生成多操作执行计划以自动化数据分析任务方面展现出巨大潜力。然而，现有的LLM代理驱动的分析框架假定集中式数据访问，几乎不提供隐私保护。相比之下，联邦分析（FA）能够在分布式数据源上实现隐私保护计算，但缺乏自然语言输入支持，需要结构化、可读的查询。本研究提出了LAFA系统，它首次将LLM代理驱动的数据分析与FA集成在一起。LAFA采用分层多代理架构，接受自然语言查询并将其转换为优化的可执行FA工作流程。粗粒度规划器首先将复杂查询分解为子查询，而细粒度规划器则利用先前的结构知识将每个子查询映射到FA操作的定向无环图（DAG）。为了提高执行效率，优化器代理会重写和合并多个DAG，消除冗余操作，最大限度地减少计算和通信开销。实验表明，LAFA通过实现更高的执行计划成功率并大幅减少资源密集型FA操作，始终优于基线提示策略。本研究为支持自然语言输入的隐私保护LLM驱动分析奠定了实践基础。

Key Takeaways

LLM在自动化数据解析任务中表现出巨大潜力，可通过解释自然语言查询生成多操作执行计划。
现有LLM代理驱动的分析框架主要基于集中式数据访问，缺乏隐私保护机制。
联邦分析（FA）能在分布式数据源上实现隐私保护计算，但缺乏自然语言输入支持。
LAFA系统首次集成了LLM代理驱动的数据分析与FA，采用分层多代理架构。
LAFA能将自然语言查询转换为优化的可执行FA工作流程，包括粗粒度和细粒度规划器。
LAFA通过优化执行计划，提高执行效率，减少冗余操作和计算通信开销。

Cool Papers

点此查看论文截图

Wonder Wins Ways: Curiosity-Driven Exploration through Multi-Agent Contextual Calibration

Authors:Yiyuan Pan, Zhe Liu, Hesheng Wang

Autonomous exploration in complex multi-agent reinforcement learning (MARL) with sparse rewards critically depends on providing agents with effective intrinsic motivation. While artificial curiosity offers a powerful self-supervised signal, it often confuses environmental stochasticity with meaningful novelty. Moreover, existing curiosity mechanisms exhibit a uniform novelty bias, treating all unexpected observations equally. However, peer behavior novelty, which encode latent task dynamics, are often overlooked, resulting in suboptimal exploration in decentralized, communication-free MARL settings. To this end, inspired by how human children adaptively calibrate their own exploratory behaviors via observing peers, we propose a novel approach to enhance multi-agent exploration. We introduce CERMIC, a principled framework that empowers agents to robustly filter noisy surprise signals and guide exploration by dynamically calibrating their intrinsic curiosity with inferred multi-agent context. Additionally, CERMIC generates theoretically-grounded intrinsic rewards, encouraging agents to explore state transitions with high information gain. We evaluate CERMIC on benchmark suites including VMAS, Meltingpot, and SMACv2. Empirical results demonstrate that exploration with CERMIC significantly outperforms SoTA algorithms in sparse-reward environments.

在复杂的多智能体强化学习（MARL）中，自主探索在稀疏奖励的情况下严重依赖于为智能体提供有效的内在动机。虽然人工好奇心提供了一个强大的自我监督信号，但它经常会混淆环境随机性与有意义的创新性。此外，现有的好奇心机制表现出统一的新奇偏见，即平等对待所有意外观察结果。然而，同伴行为的新奇性，即潜在任务动态编码常常被人们忽视，导致在分散的、无通信的MARL环境中探索效果不佳。鉴于此，我们受到人类儿童如何通过观察同伴来适应性地校准自己的探索行为启发，提出了一种增强多智能体探索的新方法。我们引入了CERMIC，这是一个有原则性的框架，赋能智能体稳健地过滤噪声意外信号，并通过动态校准其内在好奇心与推断的多智能体上下文来指导探索。此外，CERMIC生成有理论依据的内在奖励，鼓励智能体探索信息增益高的状态转换。我们在包括VMAS、Meltingpot和SMACvT基准测试套件上评估了CERMIC的表现。实验结果表明，在稀疏奖励环境中，使用CERMIC进行探索显著优于现有技术算法。

论文及项目相关链接

PDF

Summary

本文探讨了在复杂多智能体强化学习（MARL）中自主探索的问题，提出一种新型框架CERMIC，通过动态校准内在好奇心和多智能体上下文，增强多智能体探索能力。该框架可生成理论上的内在奖励，鼓励智能体探索信息增益高的状态转换。在基准测试套件VMAS、Meltingpot和SMACv2上的实证结果表明，CERMIC在稀疏奖励环境中的探索性能显著优于现有算法。

Key Takeaways

自主探索在多智能体强化学习（MARL）中至关重要，尤其是在稀疏奖励环境下。
现有好奇心机制易混淆环境随机性与有意义的新颖性，且对所有意外观察一视同仁，导致探索效率不高。
框架CERMIC通过动态校准内在好奇心和多智能体上下文，有效过滤噪声惊喜信号，提高多智能体探索能力。
CERMIC鼓励智能体探索信息增益高的状态转换，生成理论上的内在奖励。
在VMAS、Meltingpot和SMACv2等基准测试套件上，CERMIC的探索性能显著优于现有算法。
CERMIC框架灵感来源于儿童通过观察同伴来适应性地校准自身探索行为的方式。

Cool Papers

点此查看论文截图

Tree of Agents: Improving Long-Context Capabilities of Large Language Models through Multi-Perspective Reasoning

Authors:Song Yu, Xiaofei Xu, Ke Deng, Li Li, Lin Tian

Large language models (LLMs) face persistent challenges when handling long-context tasks, most notably the lost in the middle issue, where information located in the middle of a long input tends to be underutilized. Some existing methods that reduce input have the risk of discarding key information, while others that extend context windows often lead to attention dispersion. To address these limitations, we propose Tree of Agents (TOA), a multi-agent reasoning framework that segments the input into chunks processed by independent agents. Each agent generates its local cognition, then agents dynamically exchange information for collaborative reasoning along tree-structured paths. TOA enables agents to probe different reasoning orders for multi-perspective understanding, effectively mitigating position bias and reducing hallucinations. To improve processing efficiency, we incorporate prefix-hash caching and adaptive pruning strategies, achieving significant performance improvements with comparable API overhead. Experiments show that TOA, powered by compact LLaMA3.1-8B, significantly outperforms multiple baselines and demonstrates comparable performance to the latest and much larger commercial models, such as Gemini1.5-pro, on various long-context tasks. Code is available at https://github.com/Aireduce952/Tree-of-Agents.

大规模语言模型（LLM）在处理长上下文任务时面临持久挑战，最显著的是中间丢失问题，长输入中位于中间的信息往往被利用不足。一些减少输入现有方法存在丢弃关键信息的风险，而其他扩展上下文窗口的方法则常常导致注意力分散。为了解决这些局限性，我们提出了多智能体推理框架Agent树（TOA），它将输入分割成由独立智能体处理的块。每个智能体生成其局部认知，然后智能体沿树状结构路径动态交换信息进行协同推理。TOA使智能体能够探究不同的推理顺序以实现多角度理解，有效地减轻位置偏见并减少幻觉。为了提高处理效率，我们结合了前缀哈希缓存和自适应修剪策略，在API开销相当的情况下实现了显著的性能提升。实验表明，使用紧凑的LLaMA 3.1-8B赋能的TOA显著优于多个基线，并在各种长上下文任务上展现出与最新的更大商业模型（如Gemini 1.5 pro）相当的性能。代码可通过https://github.com/Aireduce952/Tree-of-Agents获取。

论文及项目相关链接

PDF 19 pages, 5 figures

Summary
大型语言模型在处理长语境任务时面临挑战，如中间信息丢失问题。为解决此问题，我们提出多代理推理框架——代理树（TOA），它将输入分割成多个块，由独立代理进行处理。每个代理生成局部认知，然后动态交换信息以进行协作推理。TOA通过优化处理效率和性能提升策略，实现了显著的性能提升。实验证明，TOA在多种长语境任务上优于多个基线模型，并与最新的大型商业模型表现相当。

Key Takeaways

大型语言模型在处理长语境任务时面临挑战，尤其是中间信息丢失问题。
代理树（TOA）是一种多代理推理框架，旨在解决大型语言模型在处理长语境任务时的挑战。
TOA将输入分割成块，由独立代理处理并生成局部认知，之后动态交换信息进行协作推理。
TOA通过优化处理效率和性能提升策略，实现了显著的性能提升。
TOA在多种长语境任务上的表现优于多个基线模型。
TOA的性能与最新的大型商业模型相当。

Cool Papers

点此查看论文截图

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Authors:Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

在动态环境中协调多个实体代理仍然是人工智能的核心挑战，这需要感知驱动的推理和可扩展的合作策略。虽然最近的研究已经利用大型语言模型（LLM）进行多代理规划，但很少有人开始探索视觉语言模型（VLM）用于视觉推理。然而，这些基于VLM的方法在支持多种实体类型方面仍然存在局限性。在这项工作中，我们引入了VIKI-Bench，这是针对实体多代理合作定制的首个分层基准，包含三个结构化级别：代理激活、任务规划和轨迹感知。VIKI-Bench包括各种机器人实体、多视角视觉观察和结构化的监督信号，以评估基于视觉输入的推理。为了展示VIKI-Bench的实用性，我们提出了VIKI-R，这是一个两阶段框架，通过使用Chain-of-Thought注释演示对预训练的视觉语言模型（VLM）进行微调，然后在多级奖励信号下进行强化学习。我们的广泛实验表明，VIKI-R在所有任务级别上均显著优于基准方法。此外，我们还证明了强化学习能够使不同代理之间出现组合合作模式。总之，VIKI-Bench和VIKI-R为推进实体AI系统中的多代理视觉驱动合作提供了统一的测试平台和方法。

Summary

本文介绍了在人工智能领域中，协调多个实体代理在动态环境中的工作是一项核心挑战，需要感知驱动的推理和可扩展的合作策略。尽管最近的工作已经利用大型语言模型进行多代理规划，但基于视觉语言模型的方法仍然有限支持多种实体类型。为此，本文引入了VIKI-Bench，这是一个为实体多代理合作量身定制的分层基准，包括代理激活、任务规划和轨迹感知三个结构化级别。此外，还提出了VIKI-R框架，通过使用思维链注释演示对预训练视觉语言模型进行微调，然后在多级奖励信号下进行强化学习。实验表明，VIKI-R在各级任务上的表现均优于基线方法，强化学习能够促进异构代理之间组合合作模式的出现。总体而言，VIKI-Bench和VIKI-R为推进实体AI系统的多代理视觉驱动合作提供了一个统一的测试平台和方法。

Key Takeaways

协调多个实体代理在动态环境中是人工智能的核心挑战，需要感知驱动的推理和合作策略。
最近的工作开始探索大型语言模型在人工智能多代理规划中的应用。
基于视觉语言模型的方法在支持多种实体类型方面存在局限性。
VIKI-Bench是首个为实体多代理合作量身定制的分层基准，包括三个结构化级别：代理激活、任务规划和轨迹感知。
VIKI-Bench包含多种机器人实体、多视角视觉观察和结构化监督信号，以评估基于视觉输入的推理。
VIKI-R框架通过微调预训练的视觉语言模型，使用思维链注释演示和强化学习来提高多代理合作的效果。

Cool Papers

点此查看论文截图

Can Agents Fix Agent Issues?

Authors:Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, Yiling Lou

LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AGENTISSUE-BENCH, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AGENTISSUE-BENCH and reveal their limited effectiveness (i.e., with only 3.33% - 12.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/ .

基于LLM（大型预训练语言模型）的代理系统正在崭露头角，作为一种新的软件范式，在医学、机器人和编程等多元化领域得到了广泛应用。然而，维护这些系统需要大量的努力，因为它们不可避免地会出现漏洞并持续进化以适应不断变化的外部需求。因此，自动解决代理问题（例如故障报告或功能请求）是一项至关重要的任务且充满挑战。虽然最近的软件工程（SE）代理（如SWE-agent）在解决传统软件系统的问题方面显示出前景，但在解决现实世界中的代理系统问题方面，它们的有效性尚不清楚。这些代理系统与传统的软件有很大的不同。为了填补这一空白，我们首先手动分析了201个真实的代理问题，并确定了常见的代理问题类别。然后，我们花费了500个人工时构建了AGENTISSUE-BENCH，这是一个可复制的基准测试，包含50个代理问题解决方案任务（每个任务都有一个可执行的环境和触发失败的测试）。我们进一步评估了最先进的SE代理在AGENTISSUE-BENCH上的表现，并揭示了它们有限的解决能力（即解决率在3.33%-12.67%之间）。这些结果强调了与传统软件相比，维护代理系统所面临的独特挑战，并突出了需要进一步的研究和开发更先进的SE代理来解决代理问题的必要性。数据和代码可通过链接获取。

论文及项目相关链接

PDF Accepted by the 39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

Summary：基于LLM的代理系统作为一种新的软件范式正在兴起，广泛应用于医学、机器人和编程等领域。但维护这些系统需要耗费大量精力，因此自动解决代理问题至关重要。本研究手动分析真实世界的代理问题并构建BENCH，以评估现有SE代理解决代理问题的能力。研究结果显示现有SE代理的解决率较低，突显出开发先进SE代理以应对代理系统独特挑战的需求。

Key Takeaways：