发布日期: 2025-07-05

更新日期: 2025-07-17

文章字数: 11.3k

阅读时长: 46 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-07-05 更新

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Authors:Yuxuan Zhu, Tengjun Jin, Yada Pruksachatkun, Andy Zhang, Shu Liu, Sasha Cui, Sayash Kapoor, Shayne Longpre, Kevin Meng, Rebecca Weiss, Fazl Barez, Rahul Gupta, Jwala Dhamala, Jacob Merizian, Mario Giulianelli, Harry Coppock, Cozmin Ududec, Jasjeet Sekhon, Jacob Steinhardt, Antony Kellerman, Sarah Schwettmann, Matei Zaharia, Ion Stoica, Percy Liang, Daniel Kang

Benchmarks are essential for quantitatively tracking progress in AI. As AI agents become increasingly capable, researchers and practitioners have introduced agentic benchmarks to evaluate agents on complex, real-world tasks. These benchmarks typically measure agent capabilities by evaluating task outcomes via specific reward designs. However, we show that many agentic benchmarks have issues task setup or reward design. For example, SWE-bench Verified uses insufficient test cases, while TAU-bench counts empty responses as successful. Such issues can lead to under- or overestimation agents’ performance by up to 100% in relative terms. To make agentic evaluation rigorous, we introduce the Agentic Benchmark Checklist (ABC), a set of guidelines that we synthesized from our benchmark-building experience, a survey of best practices, and previously reported issues. When applied to CVE-Bench, a benchmark with a particularly complex evaluation design, ABC reduces the performance overestimation by 33%.

基准测试对于定量跟踪人工智能的进步至关重要。随着人工智能代理的能力越来越强，研究人员和实践者已经引入了代理基准测试来评估代理在复杂现实世界任务上的表现。这些基准测试通常通过特定的奖励设计来评估任务结果，从而衡量代理的能力。然而，我们表明许多代理基准测试在任务设置或奖励设计方面存在问题。例如，SWE-bench Verified使用的测试用例不足，而TAU-bench将空白的回答视为成功。这些问题会导致相对情况下高达100%的代理性能被低估或高估。为了使代理评估更加严格，我们引入了代理基准测试清单（ABC），这是一套我们从基准测试构建经验、最佳实践调查以及先前报告的问题中综合得出的准则。当应用于具有特别复杂评估设计的CVE-Bench时，ABC将性能高估的情况减少了33%。

论文及项目相关链接

PDF 39 pages, 15 tables, 6 figures

Summary：
随着人工智能代理越来越强大，代理基准测试已成为评估其在复杂现实世界任务上表现的关键工具。然而，当前许多代理基准测试在任务设置或奖励设计方面存在问题。为此，作者提出了Agentic Benchmark Checklist（ABC），以减少代理性能评估中的误差。

Key Takeaways：

基准测试是跟踪人工智能进步的关键。
代理基准测试用于评估人工智能代理在复杂现实世界任务上的表现。
当前许多代理基准测试在任务设置或奖励设计方面存在问题。
这些问题可能导致对代理性能的过度或低估。
Agentic Benchmark Checklist（ABC）是一组准则，旨在解决代理性能评估中的问题。
ABC通过减少性能评估中的误差，提高了代理性能评估的严谨性。

Cool Papers

点此查看论文截图

Multi-agent Auditory Scene Analysis

Authors:Caleb Rascon, Luis Gato-Diaz, Eduardo García-Alarcón

Auditory scene analysis (ASA) aims to retrieve information from the acoustic environment, by carrying out three main tasks: sound source location, separation, and classification. These tasks are traditionally executed with a linear data flow, where the sound sources are first located; then, using their location, each source is separated into its own audio stream; from each of which, information is extracted that is relevant to the application scenario (audio event detection, speaker identification, emotion classification, etc.). However, running these tasks linearly increases the overall response time, while making the last tasks (separation and classification) highly sensitive to errors of the first task (location). A considerable amount of effort and computational complexity has been employed in the state-of-the-art to develop techniques that are the least error-prone possible. However, doing so gives rise to an ASA system that is non-viable in many applications that require a small computational footprint and a low response time, such as bioacoustics, hearing-aid design, search and rescue, human-robot interaction, etc. To this effect, in this work, a multi-agent approach is proposed to carry out ASA where the tasks are run in parallel, with feedback loops between them to compensate for local errors, such as: using the quality of the separation output to correct the location error; and using the classification result to reduce the localization’s sensitivity towards interferences. The result is a multi-agent auditory scene analysis (MASA) system that is robust against local errors, without a considerable increase in complexity, and with a low response time. The complete proposed MASA system is provided as a framework that uses open-source tools for sound acquisition and reproduction (JACK) and inter-agent communication (ROS2), allowing users to add their own agents.

听觉场景分析（ASA）旨在通过执行三个主要任务：声源定位、分离和分类，从声学环境中检索信息。这些任务传统上是按线性数据流执行的，首先定位声源；然后，利用它们的位置，将每个声源分离成各自的音频流；从每个音频流中提取与应用场景相关的信息（音频事件检测、说话人识别、情感分类等）。然而，按线性方式运行这些任务会增加总体响应时间，同时使最后的任务（分离和分类）对第一个任务（定位）的错误高度敏感。最新的技术已经投入了大量的努力和计算复杂度，以开发尽可能不易出错的技巧。然而，这样做会导致一个在许多应用中不可行的ASA系统，这些应用需要较小的计算占用空间和较短的响应时间，例如生物声学、助听器设计、搜救、人机交互等。因此，在这项工作中，提出了一种多智能体方法进行ASA，其中任务并行运行，它们之间有反馈循环来弥补局部错误，例如：利用分离输出的质量来纠正定位错误；并利用分类结果来减少定位对干扰的敏感性。结果是一个多智能体听觉场景分析（MASA）系统，它对局部错误具有鲁棒性，不会大幅增加复杂度，并且具有较短的响应时间。所提供的完整MASA系统是一个框架，使用开源工具进行声音采集和回放（JACK）以及智能体之间的通信（ROS2），允许用户添加自己的智能体。

论文及项目相关链接

PDF Submitted to Applied Intelligence

Summary
多智能体听觉场景分析（MASA）采用并行执行任务的方式，通过反馈环路补偿局部错误，如利用分离质量校正定位误差和利用分类结果减少干扰对定位的影响。MASA系统对局部错误具有鲁棒性，且无需大幅增加复杂性并具备低响应时间。

Key Takeaways

听觉场景分析（ASA）主要从声学环境中提取信息，执行三个主要任务：声源定位、分离和分类。
传统线性执行这些任务会增加总体响应时间，并使后续任务对前面任务的错误更加敏感。
在此研究中，提出了一种多智能体听觉场景分析（MASA）方法，其中任务并行运行，并通过反馈环路补偿局部错误。
MASA系统利用分离质量校正定位误差，并使用分类结果减少干扰对定位的影响，对局部错误具有鲁棒性。
MASA系统无需大幅增加复杂性，并具备低响应时间。
提供的MASA系统框架使用开源工具进行声音采集和复制（JACK）以及智能体间通信（ROS2），允许用户添加自己的智能体。
MASA框架有助于在生物声学、助听器设计、搜救、人机交互等应用中实现快速且稳健的听觉场景分析。

Cool Papers

点此查看论文截图

Control at Stake: Evaluating the Security Landscape of LLM-Driven Email Agents

Authors:Jiangrong Wu, Yuhong Nan, Jianliang Wu, Zitong Yao, Zibin Zheng

The increasing capabilities of LLMs have led to the rapid proliferation of LLM agent apps, where developers enhance LLMs with access to external resources to support complex task execution. Among these, LLM email agent apps represent one of the widely used categories, as email remains a critical communication medium for users. LLM email agents are capable of managing and responding to email using LLM-driven reasoning and autonomously executing user instructions via external email APIs (e.g., send email). However, despite their growing deployment and utility, the security mechanism of LLM email agent apps remains underexplored. Currently, there is no comprehensive study into the potential security risk within these agent apps and their broader implications. In this paper, we conduct the first in-depth and systematic security study of LLM email agents. We propose the Email Agent Hijacking (EAH) attack, which overrides the original prompts of the email agent via external email resources, allowing attackers to gain control of the email agent remotely and further perform specific attack scenarios without user awareness. To facilitate the large-scale evaluation, we propose EAHawk, a pipeline to evaluate the EAH attack of LLM email agent apps. By EAHawk, we performed an empirical study spanning 14 representative LLM agent frameworks, 63 agent apps, 12 LLMs, and 20 email services, which led to the generation of 1,404 real-world email agent instances for evaluation. Experimental results indicate that all 1,404 instances were successfully hijacked; on average, only 2.03 attack attempts are required to control an email agent instance. Even worse, for some LLMs, the average number of attempts needed to achieve full agent control drops to as few as 1.23.

随着大型语言模型（LLM）能力的不断提升，LLM代理应用迅速普及。开发者通过访问外部资源来增强LLM的功能，以支持复杂的任务执行。其中，LLM电子邮件代理应用是广泛使用的类别之一，因为电子邮件仍然是用户重要的通信媒介。LLM电子邮件代理能够利用LLM驱动的推理来管理和回复电子邮件，并通过外部电子邮件API自主执行用户指令（例如发送电子邮件）。然而，尽管它们的部署和实用性不断增长，LLM电子邮件代理应用的安全机制仍然鲜有研究。目前，关于这些代理应用中的潜在安全风险及其更广泛影响还没有全面的研究。在本文中，我们对LLM电子邮件代理进行了首次深入和系统的安全研究。我们提出了电子邮件代理劫持（EAH）攻击，该攻击通过外部电子邮件资源覆盖电子邮件代理的原始提示，使攻击者能够远程控制电子邮件代理，并在用户不知情的情况下执行特定的攻击场景。为了促进大规模评估，我们提出了EAHawk，一个评估LLM电子邮件代理应用的EAH攻击的管道。通过EAHawk，我们对14个代表性的LLM代理框架、63个代理应用、12个LLM和20个电子邮件服务进行了实证研究，该研究生成了1404个真实的电子邮件代理实例进行评估。实验结果表明，所有1404个实例均成功被劫持；平均只需要2.03次攻击尝试即可控制一个电子邮件代理实例。更糟糕的是，对于某些LLM，实现完全控制代理所需的平均尝试次数甚至减少到仅1.23次。

论文及项目相关链接

PDF

Summary
大型语言模型（LLM）能力的增强推动了LLM代理应用的快速发展，尤其是其中的电子邮件代理应用，能通过访问外部资源支持复杂任务执行。然而，其安全性机制尚未得到足够的研究，存在潜在的安全风险。本文首次全面系统地研究了LLM电子邮件代理的安全性，并提出了电子邮件代理劫持（EAH）攻击方式，攻击者可通过外部电子邮件资源覆盖原始提示来远程控制电子邮件代理，并秘密执行特定攻击场景。为评估这种攻击方式，本文还提出了EAHawk管道，并实证研究了涵盖多个LLM代理框架、邮件服务和大型语言模型的电子邮件代理实例，结果显示所有实例均被成功劫持。

Key Takeaways

LLM代理应用发展迅速，尤其是电子邮件代理应用。
LLM电子邮件代理面临潜在的安全风险，尚未得到足够研究。
提出了Email Agent Hijacking（EAH）攻击方式，攻击者可远程控制电子邮件代理。
EAH攻击通过覆盖原始提示实现，使攻击者能执行特定场景。
为评估EAH攻击，提出了EAHawk管道。
实证研究表明所有研究的电子邮件代理实例均存在被劫持的风险。

Cool Papers

点此查看论文截图

Multi-Agent Reinforcement Learning for Dynamic Pricing in Supply Chains: Benchmarking Strategic Agent Behaviours under Realistically Simulated Market Conditions

Authors:Thomas Hazenberg, Yao Ma, Seyed Sahand Mohammadi Ziabari, Marijn van Rijswijk

This study investigates how Multi-Agent Reinforcement Learning (MARL) can improve dynamic pricing strategies in supply chains, particularly in contexts where traditional ERP systems rely on static, rule-based approaches that overlook strategic interactions among market actors. While recent research has applied reinforcement learning to pricing, most implementations remain single-agent and fail to model the interdependent nature of real-world supply chains. This study addresses that gap by evaluating the performance of three MARL algorithms: MADDPG, MADQN, and QMIX against static rule-based baselines, within a simulated environment informed by real e-commerce transaction data and a LightGBM demand prediction model. Results show that rule-based agents achieve near-perfect fairness (Jain’s Index: 0.9896) and the highest price stability (volatility: 0.024), but they fully lack competitive dynamics. Among MARL agents, MADQN exhibits the most aggressive pricing behaviour, with the highest volatility and the lowest fairness (0.5844). MADDPG provides a more balanced approach, supporting market competition (share volatility: 9.5 pp) while maintaining relatively high fairness (0.8819) and stable pricing. These findings suggest that MARL introduces emergent strategic behaviour not captured by static pricing rules and may inform future developments in dynamic pricing.

本研究探讨了多智能体强化学习（MARL）如何改进供应链中的动态定价策略，特别是在传统ERP系统依赖于静态、基于规则的方法，从而忽略了市场参与者之间的战略互动的背景下。尽管最近的研究已将强化学习应用于定价，但大多数实现仍然是单智能体的，并且未能模拟现实世界供应链之间的相互依存性质。本研究通过评估三种MARL算法（MADDPG、MADQN和QMIX）在模拟环境中的表现来解决这一差距，该环境受到真实电子商务交易数据和LightGBM需求预测模型的启发，并与静态规则基准进行比较。结果表明，基于规则的智能体实现了近乎完美的公平性（Jain指数：0.9896）和最高的价格稳定性（波动率：0.024），但它们完全缺乏竞争动力。在MARL智能体中，MADQN表现出最激进的定价行为，波动率最高，公平性最低（0.5844）。MADDPG采用了一种更平衡的方法，支持市场竞争（份额波动率：9.5个百分点），同时保持相对较高的公平性（0.8819）和稳定的价格。这些发现表明，MARL引入了静态定价规则无法捕获的突发战略行为，并可能为未来的动态定价发展提供依据。

论文及项目相关链接

PDF

Summary

本研究探讨了多智能体强化学习（MARL）如何改进供应链中的动态定价策略，特别是在传统ERP系统依赖静态、基于规则的方法而忽视市场参与者之间的战略互动的情况下。该研究通过评估三种MARL算法（MADDPG、MADQN和QMIX）在模拟环境中的性能，与基于静态规则的基准进行比较，发现MARL能够引入新兴的战略行为，为动态定价提供了更全面的视角。

Key Takeaways

本研究调查了Multi-Agent Reinforcement Learning（MARL）在供应链动态定价策略中的应用。
传统ERP系统主要依赖静态、基于规则的方法，无法应对市场参与者的战略互动问题。
与静态规则基准相比，MARL算法在模拟环境中表现出不同的定价行为。
MADQN展现出最具有攻击性的定价行为，而MADDPG则提供了一个更平衡的方法。
基于规则的系统在公平性和价格稳定性方面表现较好，但缺乏竞争动态性。
MARL引入的新兴战略行为未被静态定价规则捕捉，这有望为动态定价提供新的视角。

Cool Papers

点此查看论文截图

VRAgent-R1: Boosting Video Recommendation with MLLM-based Agents via Reinforcement Learning

Authors:Siran Chen, Boyu Chen, Chenyun Yu, Yuxiao Luo, Ouyang Yi, Lei Cheng, Chengxiang Zhuo, Zang Li, Yali Wang

Owing to powerful natural language processing and generative capabilities, large language model (LLM) agents have emerged as a promising solution for enhancing recommendation systems via user simulation. However, in the realm of video recommendation, existing studies predominantly resort to prompt-based simulation using frozen LLMs and encounter the intricate challenge of multimodal content understanding. This frequently results in suboptimal item modeling and user preference learning, thereby ultimately constraining recommendation performance. To address these challenges, we introduce VRAgent-R1, a novel agent-based paradigm that incorporates human-like intelligence in user simulation. Specifically, VRAgent-R1 comprises two distinct agents: the Item Perception (IP) Agent and the User Simulation (US) Agent, designed for interactive user-item modeling. Firstly, the IP Agent emulates human-like progressive thinking based on MLLMs, effectively capturing hidden recommendation semantics in videos. With a more comprehensive multimodal content understanding provided by the IP Agent, the video recommendation system is equipped to provide higher-quality candidate items. Subsequently, the US Agent refines the recommended video sets based on in-depth chain-of-thought (CoT) reasoning and achieves better alignment with real user preferences through reinforcement learning. Experimental results on a large-scale video recommendation benchmark have demonstrated the effectiveness of our proposed VRAgent-R1 method, e.g., the IP Agent achieves a 6.0% improvement in NDCG@10 on the MicroLens-100k dataset, while the US Agent shows approximately 45.0% higher accuracy in user decision simulation compared to state-of-the-art baselines.

由于强大的自然语言处理和生成能力，大型语言模型（LLM）代理的出现，已成为通过用户模拟增强推荐系统的一种有前途的解决方案。然而，在视频推荐领域，现有研究主要采取基于提示的模拟使用冻结的LLM，并面临多模态内容理解的复杂挑战。这经常导致物品建模和用户偏好学习不佳，从而最终限制推荐性能。为了解决这些挑战，我们引入了VRAgent-R1，这是一种结合人类智能进行用户模拟的新型基于代理的范式。具体来说，VRAgent-R1包括两个独特的代理：物品感知（IP）代理和用户模拟（US）代理，用于交互式用户物品建模。首先，IP代理基于MLLMs模拟人类渐进式思维，有效捕捉视频中的隐藏推荐语义。通过IP代理提供的更全面多模态内容理解，视频推荐系统能够提供更高质量的候选项目。其次，US代理基于深入的链式思维（CoT）推理细化推荐的视频集，并通过强化学习实现与真实用户偏好的更好对齐。在大规模视频推荐基准测试上的实验结果表明了我们提出的VRAgent-R1方法的有效性，例如IP代理在MicroLens-100k数据集上NDCG@10指标提升了6.0%，而US代理在用户决策模拟方面的准确率比最先进的基线提高了约45.0%。

论文及项目相关链接

PDF

Summary

基于强大的自然语言处理和生成能力，大型语言模型（LLM）代理通过用户模拟在推荐系统增强方面展现出巨大潜力。然而，在视频推荐领域，现有研究主要依赖于基于提示的模拟使用冻结的LLM，并面临多模态内容理解的复杂挑战。这常常导致次优的项目建模和用户偏好学习，从而最终限制推荐性能。为解决这些挑战，我们推出VRAgent-R1，一种融合人类智能的新型代理模拟方法。具体而言，VRAgent-R包括两个独立代理：用于交互式用户项目建模的项目感知（IP）代理和用户模拟（US）代理。首先，IP代理模拟基于MLLM的人类渐进式思考，有效捕捉视频中的隐藏推荐语义。通过IP代理提供的更全面的多模态内容理解，视频推荐系统能够提供更优质的项目。然后，US代理通过深度链思维（CoT）推理对推荐的视频集进行精炼，并通过强化学习实现与现实用户偏好的更好匹配。大型视频推荐基准测试结果显示VRAgent-R1方法的有效性，例如IP代理在MicroLens-100k数据集上NDCG@10指标提升了6.0%，而US代理在用户决策模拟方面的准确度比最新基线高出约45.0%。

Key Takeaways

大型语言模型（LLM）代理通过用户模拟增强了推荐系统性能。
视频推荐领域面临多模态内容理解的复杂挑战。
VRAgent-R1包括项目感知（IP）代理和用户模拟（US）代理两个独立代理，用于交互式用户项目建模。
IP代理能捕捉视频中的隐藏推荐语义，提高视频推荐质量。
US代理通过深度链思维（CoT）推理精炼推荐内容，并匹配现实用户偏好。
VRAgent-R1方法在大型视频推荐基准测试中表现出有效性。
IP代理和US代理分别实现了显著的性能提升。

Cool Papers

点此查看论文截图

WebSailor: Navigating Super-human Reasoning for Web Agent

Authors:Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou

Transcending human cognitive limitations represents a critical frontier in LLM training. Proprietary agentic systems like DeepResearch have demonstrated superhuman capabilities on extremely complex information-seeking benchmarks such as BrowseComp, a feat previously unattainable. We posit that their success hinges on a sophisticated reasoning pattern absent in open-source models: the ability to systematically reduce extreme uncertainty when navigating vast information landscapes. Based on this insight, we introduce WebSailor, a complete post-training methodology designed to instill this crucial capability. Our approach involves generating novel, high-uncertainty tasks through structured sampling and information obfuscation, RFT cold start, and an efficient agentic RL training algorithm, Duplicating Sampling Policy Optimization (DUPO). With this integrated pipeline, WebSailor significantly outperforms all opensource agents in complex information-seeking tasks, matching proprietary agents’ performance and closing the capability gap.

在LLM训练中，超越人类认知局限是一个关键的前沿领域。DeepResearch等专有代理系统已经在极其复杂的信息检索基准测试（如BrowseComp）上展示了超人能力，这是一个以前无法达到的成就。我们认为，它们的成功依赖于开源模型中不存在的复杂推理模式：在浏览广阔的信息景观时系统地减少极端不确定性的能力。基于这一见解，我们引入了WebSailor，这是一种完整的后训练方法论，旨在培养这种关键能力。我们的方法包括通过结构化采样和信息模糊、RFT冷启动以及高效的代理强化学习训练算法Duplicating Sampling Policy Optimization（DUPO）来生成具有新型和高不确定性的任务。通过这个综合管道，WebSailor在复杂的信息检索任务中大大超越了所有开源代理，达到了专有代理的性能水平，并缩小了能力差距。

论文及项目相关链接

PDF

Summary：超越人类认知限制是大型语言模型训练的关键前沿领域。DeepResearch等专有agentic系统已在极为复杂的信息搜索基准测试（如BrowseComp）中展现出超人类能力，这是一个以前无法达成的里程碑。我们认为，其成功关键在于拥有一种开源模型中缺失的复杂推理模式，即在导航大规模信息景观时系统地减少极端不确定性的能力。基于此见解，我们推出了WebSailor，这是一种完整的后训练方法论，旨在培养这种关键能力。通过结构化采样和信息模糊化生成新型高不确定性任务、借助RFT冷启动和一个高效的agentic RL训练算法DUPO，WebSailor显著超越了所有开源代理在复杂信息搜索任务中的表现，达到了专有代理的性能水平并缩小了能力差距。

Key Takeaways：

专有agentic系统如DeepResearch在复杂信息搜索任务中展现出超人类能力。
成功关键在于处理极端不确定性，这是开源模型中缺失的复杂推理模式。
WebSailor是一种后训练方法论，旨在培养在导航大规模信息景观时减少不确定性的能力。
WebSailor通过生成高不确定性任务、利用RFT冷启动和DUPO训练算法来显著提高性能。
WebSailor显著超越了开源代理在复杂信息搜索任务中的表现。
WebSailor达到了专有代理的性能水平，缩小了能力差距。

Cool Papers

点此查看论文截图

AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench

Authors:Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Kelvin Niu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H. Miller, Abhishek Charnalia, Derek Dunfield, Carole-Jean Wu, Pontus Stenetorp, Nicola Cancedda, Jakob Nicolaus Foerster, Yoram Bachrach

AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents’ performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.

人工智能研究代理通过自动化机器学习模型的设计、实施和培训，显示出加速科学进步的巨大潜力。我们关注在提高代理在MLE-bench上的性能的方法，MLE-bench是一个具有挑战性的基准测试，代理在这里参加Kaggle竞赛，解决现实世界中的机器学习问题。我们将人工智能研究代理形式化为搜索策略，这些策略在候选解决方案空间中导航，并使用操作符进行迭代修改。通过设计和系统地改变不同的操作符集和搜索策略（贪心、MCTS、进化），我们表明它们的相互作用对于实现高性能至关重要。我们搜索策略和操作符集的最佳配对在MLE-bench lite上达到了最新结果，将获得Kaggle奖牌的成功率从39.6%提高到47.7%。我们的调查强调了联合考虑搜索策略、操作员设计和评估方法在推进自动化机器学习中的重要性。

Summary

AI研究代理在自动化机器学习模型的设计、实现和训练方面展现出巨大潜力，加速科学进步。本文关注在MLE-bench基准测试上提高代理性能的方法，通过搜索策略和运算符集的设计和优化，实现状态优异的结果，提高在Kaggle竞赛中获得奖牌的成功率。本文强调了联合考虑搜索策略、运算符设计和评估方法对推进自动化机器学习的重要性。

Key Takeaways

AI研究代理具备自动化设计、实现和训练机器学习模型的潜力，显著加速科学进步。
MLE-bench作为挑战基准测试，代理在其中参与Kaggle竞赛以解决现实世界机器学习问题。
通过搜索策略和运算符集的设计和系统性变化，发现其互动对实现高性能至关重要。
最佳搜索策略和运算符组合在MLE-bench lite上实现前沿结果，提高获得Kaggle奖牌的成功率。
强调在推进自动化机器学习过程中，需联合考虑搜索策略、运算符设计和评估方法。
搜索策略的选择对代理性能具有重要影响。

Cool Papers

点此查看论文截图

OMS: On-the-fly, Multi-Objective, Self-Reflective Ad Keyword Generation via LLM Agent

Authors:Bowen Chen, Zhao Wang, Shingo Takamatsu

Keyword decision in Sponsored Search Advertising is critical to the success of ad campaigns. While LLM-based methods offer automated keyword generation, they face three major limitations: reliance on large-scale query-keyword pair data, lack of online multi-objective performance monitoring and optimization, and weak quality control in keyword selection. These issues hinder the agentic use of LLMs in fully automating keyword decisions by monitoring and reasoning over key performance indicators such as impressions, clicks, conversions, and CTA effectiveness. To overcome these challenges, we propose OMS, a keyword generation framework that is On-the-fly (requires no training data, monitors online performance, and adapts accordingly), Multi-objective (employs agentic reasoning to optimize keywords based on multiple performance metrics), and Self-reflective (agentically evaluates keyword quality). Experiments on benchmarks and real-world ad campaigns show that OMS outperforms existing methods; ablation and human evaluations confirm the effectiveness of each component and the quality of generated keywords.

关键词决策在赞助搜索广告中对于广告活动的成功至关重要。虽然基于大型语言模型（LLM）的方法可以实现自动化关键词生成，但它们面临三大局限性：依赖大规模查询关键词对数据、缺乏在线多目标性能监控和优化，以及关键词选择的质量控制较弱。这些问题阻碍了通过监测和推理关键绩效指标（如展示次数、点击次数、转化次数和呼叫行动的有效性）来全自动地进行关键词决策的大型语言模型的使用。为了克服这些挑战，我们提出了OMS，一个关键词生成框架，它具有即时性（无需训练数据，监视在线性能并相应地进行调整），多目标（采用智能推理来优化基于多个性能指标的关键词），以及自我反思性（智能评估关键词质量）。在基准测试和真实世界广告活动上的实验表明，OMS优于现有方法；废除实验和人类评估证实了每个组件的有效性和生成的关键词的质量。

论文及项目相关链接

PDF

Summary

关键词决策在赞助搜索广告中对于广告活动成功至关重要。虽然基于大型语言模型（LLM）的方法提供了自动化关键词生成，但它们存在三大局限性：依赖大规模查询关键词对数据、缺乏在线多目标性能监控和优化，以及关键词选择的质量控制较弱。这些问题阻碍了语言模型在通过监测和推理关键性能指标（如展示次数、点击数、转化数和呼叫行动有效性）来完全自动化关键词决策方面的应用。为了克服这些挑战，我们提出了OMS关键词生成框架，它具备即时性（无需训练数据、监测在线性能并相应地进行调整）、多目标（通过智能体推理优化关键词，基于多个性能指标）和自省性（智能体评估关键词质量）。在基准测试和真实世界广告活动上的实验表明，OMS优于现有方法；废除研究及人类评估均证实了各组成部分的有效性和生成的关键词质量。

Key Takeaways

关键词决策在赞助搜索广告中的重要性。
基于大型语言模型的自动化关键词生成方法存在三大局限性：依赖大规模查询关键词对数据、缺乏在线多目标性能监控和优化，以及关键词选择的质量控制较弱。
OMS框架具备即时性、多目标和自省性特点，能有效克服上述挑战。
OMS框架通过监测和推理关键性能指标（如展示次数、点击数、转化数和呼叫行动有效性）来优化关键词决策。
实验结果表明，OMS框架在基准测试和真实世界广告活动上的表现优于现有方法。
废除研究证实了OMS框架各组成部分的有效性。

Cool Papers

点此查看论文截图

Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge

Authors:Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su

Agentic search such as Deep Research systems-where agents autonomously browse the web, synthesize information, and return comprehensive citation-backed answers-represents a major shift in how users interact with web-scale information. While promising greater efficiency and cognitive offloading, the growing complexity and open-endedness of agentic search have outpaced existing evaluation benchmarks and methodologies, which largely assume short search horizons and static answers. In this paper, we introduce Mind2Web 2, a benchmark of 130 realistic, high-quality, and long-horizon tasks that require real-time web browsing and extensive information synthesis, constructed with over 1000 hours of human labor. To address the challenge of evaluating time-varying and complex answers, we propose a novel Agent-as-a-Judge framework. Our method constructs task-specific judge agents based on a tree-structured rubric design to automatically assess both answer correctness and source attribution. We conduct a comprehensive evaluation of ten frontier agentic search systems and human performance, along with a detailed error analysis to draw insights for future development. The best-performing system, OpenAI Deep Research, can already achieve 50-70% of human performance while spending half the time, highlighting its great potential. Altogether, Mind2Web 2 provides a rigorous foundation for developing and benchmarking the next generation of agentic search systems.

智能搜索，如Deep Research系统——其中的智能体自主浏览网络、合成信息并返回有引用的全面答案——代表了用户与网页规模信息交互方式的一次重大转变。尽管智能搜索带来了更高的效率和认知减负，但其日益增长的复杂性和开放性已经超越了现有的评估基准和方法论，这些方法大多假设搜索视野较短且答案静态。在本文中，我们介绍了Mind2Web 2，这是一个包含130个现实、高质量、长期视野任务的基准测试，这些任务需要进行实时网络浏览和广泛的信息综合，通过超过1000小时的人力构建而成。为了解决评估随时间变化和复杂答案的挑战，我们提出了新颖的“智能体作为法官”框架。我们的方法基于树状评分设计构建特定任务的法官智能体，以自动评估答案的正确性和来源归属。我们对十个前沿的智能体搜索系统和人类性能进行了全面评估，并进行了详细的错误分析，以获取未来发展的见解。表现最佳的OpenAI Deep Research系统已经达到了人类性能的50%~70%，并且用时只有一半，这突显了其巨大潜力。总的来说，Mind2Web 2为下一代智能搜索系统的开发和评估提供了严格的基础。

Summary

本文介绍了Deep Research等自主搜索系统如何通过网络自主浏览、信息合成和提供全面的引证答案，代表了用户与网络规模信息交互的重大转变。然而，随着自主搜索的复杂性和开放性不断增长，现有的评估基准和方法已无法满足需求。为此，本文引入了Mind2Web 2基准测试，包含130个需要实时网络浏览和大量信息合成的长期任务。同时提出了Agent-as-a-Judge评估框架来自动评估答案的正确性和来源归属。评估结果显示，前沿的自主搜索系统已展现出巨大潜力。

Key Takeaways

自主搜索系统如Deep Research正改变用户与网络信息的交互方式，通过自主浏览、信息合成和引证答案提供全面回应。
现有评估基准和方法无法满足自主搜索日益增长的复杂性和开放性需求。
Mind2Web 2基准测试包含130个真实、高质量、长期的任务，需实时网络浏览和大量信息合成。
提出了Agent-as-a-Judge评估框架，可自动评估答案的正确性和来源归属。
评估结果显示前沿自主搜索系统已接近人类表现的一半，展现出巨大潜力。
Mind2Web 2基准为下一代自主搜索系统的发展和评估提供了严格基础。

Cool Papers

点此查看论文截图

From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents

Authors:Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu

Information retrieval is a cornerstone of modern knowledge acquisition, enabling billions of queries each day across diverse domains. However, traditional keyword-based search engines are increasingly inadequate for handling complex, multi-step information needs. Our position is that Large Language Models (LLMs), endowed with reasoning and agentic capabilities, are ushering in a new paradigm termed Agentic Deep Research. These systems transcend conventional information search techniques by tightly integrating autonomous reasoning, iterative retrieval, and information synthesis into a dynamic feedback loop. We trace the evolution from static web search to interactive, agent-based systems that plan, explore, and learn. We also introduce a test-time scaling law to formalize the impact of computational depth on reasoning and search. Supported by benchmark results and the rise of open-source implementations, we demonstrate that Agentic Deep Research not only significantly outperforms existing approaches, but is also poised to become the dominant paradigm for future information seeking. All the related resources, including industry products, research papers, benchmark datasets, and open-source implementations, are collected for the community in https://github.com/DavidZWZ/Awesome-Deep-Research.

信息检索是现代知识获取的核心基石，每天能在各种领域处理数十亿的查询。然而，传统的基于关键词的搜索引擎在处理复杂、多步骤的信息需求时越来越力不从心。我们的观点是，大型语言模型（LLM）具备推理和代理能力，正在推动一种新的名为代理深度研究（Agentic Deep Research）的范式。这些系统通过紧密集成自主推理、迭代检索和信息合成到一个动态反馈环中，从而超越了传统信息搜索技术。我们追踪了从静态网页搜索到基于交互、代理系统的演变，这些系统可以计划、探索和学习的历程。我们还介绍了一个测试时标度定律，以正式计算深度对推理和搜索的影响。在基准测试结果的支持下，以及开源实现的兴起，我们证明了代理深度研究不仅显著优于现有方法，而且还将成为未来信息搜索的主导范式。所有相关资源，包括工业产品、研究论文、基准数据集和开源实现，都收集在https://github.com/DavidZWZ/Awesome-Deep-Research，以供社区使用。

论文及项目相关链接

PDF

Summary
在信息检索领域，传统基于关键词的搜索引擎在处理复杂、多步骤的信息需求时越来越显得不足。大型语言模型（LLMs）的出现，带来了名为Agentic Deep Research的新范式，通过紧密集成自主推理、迭代检索和信息合成，超越了传统信息搜索技术，进入了一个动态反馈循环。Agentic Deep Research不仅显著优于现有方法，而且将成为未来信息搜索的主导范式。

Key Takeaways

传统搜索引擎在处理复杂、多步骤信息需求时存在不足。
大型语言模型（LLMs）具备推理和代理能力，为信息检索带来了新的突破。
Agentic Deep Research超越了传统信息搜索技术，集成了自主推理、迭代检索和信息合成。
Agentic Deep Research具有动态反馈循环的特点，能进行计划、探索和学习的互动。
Agentic Deep Research显著优于现有方法，将成为未来信息搜索的主导范式。
测试时间缩放律被用来正式化计算深度对推理和搜索的影响。

Cool Papers

点此查看论文截图

MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration

Authors:Dingkang Yang, Jinjie Wei, Mingcheng Li, Jiyao Liu, Lihao Liu, Ming Hu, Junjun He, Yakun Ju, Wei Zhou, Yang Liu, Lihua Zhang

In healthcare intelligence, the ability to fuse heterogeneous, multi-intent information from diverse clinical sources is fundamental to building reliable decision-making systems. Large Language Model (LLM)-driven information interaction systems currently showing potential promise in the healthcare domain. Nevertheless, they often suffer from information redundancy and coupling when dealing with complex medical intents, leading to severe hallucinations and performance bottlenecks. To this end, we propose MedAide, an LLM-based medical multi-agent collaboration framework designed to enable intent-aware information fusion and coordinated reasoning across specialized healthcare domains. Specifically, we introduce a regularization-guided module that combines syntactic constraints with retrieval augmented generation to decompose complex queries into structured representations, facilitating fine-grained clinical information fusion and intent resolution. Additionally, a dynamic intent prototype matching module is proposed to utilize dynamic prototype representation with a semantic similarity matching mechanism to achieve adaptive recognition and updating of the agent’s intent in multi-round healthcare dialogues. Ultimately, we design a rotation agent collaboration mechanism that introduces dynamic role rotation and decision-level information fusion across specialized medical agents. Extensive experiments are conducted on four medical benchmarks with composite intents. Experimental results from automated metrics and expert doctor evaluations show that MedAide outperforms current LLMs and improves their medical proficiency and strategic reasoning.

在医疗情报领域，从各种临床来源融合异质、多意图信息的能力对于构建可靠的决策系统至关重要。大型语言模型（LLM）驱动的信息交互系统在医疗领域目前显示出巨大的潜力。然而，它们在处理复杂的医疗意图时，经常面临信息冗余和耦合的问题，导致严重的幻觉和性能瓶颈。为此，我们提出了MedAide，这是一个基于LLM的医疗多智能体协作框架，旨在实现跨专业医疗领域的意图感知信息融合和协同推理。具体来说，我们引入了一个正则化引导模块，该模块结合了句法约束和检索增强生成，将复杂查询分解为结构化表示，促进精细粒度的临床信息融合和意图解析。此外，还提出了一个动态意图原型匹配模块，利用动态原型表示和语义相似性匹配机制，实现多轮医疗对话中智能体意图的自适应识别和更新。最后，我们设计了一种轮换智能体协作机制，引入专业医疗智能体之间的动态角色轮换和决策级信息融合。在四个具有复合意图的医疗基准测试上进行了大量实验。自动度量指标和专家医生评估的实验结果表明，MedAide优于当前的大型语言模型，提高了其医疗能力和战略推理能力。

论文及项目相关链接

PDF LLM-based Multi-Agent Collaboration for Medical Applications

Summary

基于医疗健康领域决策系统的需求，融合多种意图和不同来源的临床数据至关重要。大型语言模型在信息交互系统中展现潜力，但仍面临信息冗余、耦合及复杂医疗意图处理难题。本文提出MedAide框架，利用多智能体协作实现意图感知的信息融合和跨专业领域的协同推理。通过结合句法约束和检索增强生成的正则化引导模块，分解复杂查询为结构化表示，促进精细临床信息融合和意图解析。同时，利用动态原型表示和语义相似性匹配机制的动态意图原型匹配模块，实现多轮医疗对话中智能体的自适应识别和更新。通过设计旋转智能体协作机制，引入专业医疗智能体的动态角色轮换和决策级信息融合。在四个医疗基准测试上的实验表明，MedAide优于当前的大型语言模型，提高了医疗能力和策略推理。

Key Takeaways