发布日期: 2025-04-17

更新日期: 2025-05-14

文章字数: 11k

阅读时长: 44 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-04-17 更新

The Obvious Invisible Threat: LLM-Powered GUI Agents’ Vulnerability to Fine-Print Injections

Authors:Chaoran Chen, Zhiping Zhang, Bingcan Guo, Shang Ma, Ibrahim Khalilov, Simret A Gebreegziabher, Yanfang Ye, Ziang Xiao, Yaxing Yao, Tianshi Li, Toby Jia-Jun Li

A Large Language Model (LLM) powered GUI agent is a specialized autonomous system that performs tasks on the user’s behalf according to high-level instructions. It does so by perceiving and interpreting the graphical user interfaces (GUIs) of relevant apps, often visually, inferring necessary sequences of actions, and then interacting with GUIs by executing the actions such as clicking, typing, and tapping. To complete real-world tasks, such as filling forms or booking services, GUI agents often need to process and act on sensitive user data. However, this autonomy introduces new privacy and security risks. Adversaries can inject malicious content into the GUIs that alters agent behaviors or induces unintended disclosures of private information. These attacks often exploit the discrepancy between visual saliency for agents and human users, or the agent’s limited ability to detect violations of contextual integrity in task automation. In this paper, we characterized six types of such attacks, and conducted an experimental study to test these attacks with six state-of-the-art GUI agents, 234 adversarial webpages, and 39 human participants. Our findings suggest that GUI agents are highly vulnerable, particularly to contextually embedded threats. Moreover, human users are also susceptible to many of these attacks, indicating that simple human oversight may not reliably prevent failures. This misalignment highlights the need for privacy-aware agent design. We propose practical defense strategies to inform the development of safer and more reliable GUI agents.

大型语言模型驱动的GUI代理是一种特殊化的自主系统，能够根据高级指令代表用户执行任务。它通过感知和解释相关应用程序的图形用户界面（GUI），通常是通过视觉方式，推断出必要的行动序列，然后执行点击、键入和触摸等GUI交互动作。为了完成现实世界中的任务，如填写表格或预订服务，GUI代理通常需要处理和操作敏感的用户数据。然而，这种自主性带来了新的隐私和安全风险。攻击者可以在GUI中注入恶意内容，改变代理行为或诱导意外泄露私人信息。这些攻击往往利用代理和人类用户之间视觉显著性的差异，或代理在任务自动化中检测上下文完整性违规的有限能力。在本文中，我们描述了六种此类攻击，并通过实验研究了这些攻击与六种最先进的GUI代理、234个对抗性网页和39名人类参与者的测试。我们的研究结果表明，GUI代理非常容易受到攻击，特别是针对上下文嵌入的威胁。此外，人类用户也易受这些攻击的侵害，表明简单的人为监督可能无法可靠地防止失败。这种错位突显了需要设计具有隐私意识的代理。我们提出实用的防御策略，以推动开发更安全、更可靠的GUI代理。

论文及项目相关链接

PDF

Summary

本文介绍了一种基于大型语言模型（LLM）的GUI代理，它能根据高级指令自主完成任务。它通过感知和解读相关应用的图形用户界面（GUI），推断出必要的操作序列，然后执行点击、输入等操作与GUI进行交互。然而，这种自主性带来了新的隐私和安全风险。对手可以通过注入恶意内容改变代理行为或导致意外泄露私人信息。本文描述了六种攻击类型，并通过实验测试了六种先进的GUI代理、234个对抗性网页和39名人类参与者。研究结果表明，GUI代理很容易受到上下文嵌入威胁的攻击，人类用户也易受这些攻击的影响，简单的人类监督可能无法可靠地防止失败。这强调了需要设计具有隐私意识的代理。

Key Takeaways

GUI代理是一种可以自主完成任务的系统，通过感知和解读相关应用的图形用户界面（GUI）来执行操作。
GUI代理在处理敏感用户数据和完成现实世界任务时存在隐私和安全风险。
对手可以通过注入恶意内容改变代理行为或导致意外泄露私人信息。
本文描述了六种针对GUI代理的攻击类型，并进行了实验验证。
GUI代理容易受到上下文嵌入威胁的攻击，人类用户也易受这些攻击的影响。
简单的人类监督可能无法可靠地防止GUI代理的失败。

Cool Papers

点此查看论文截图

Dopamine Audiobook: A Training-free MLLM Agent for Emotional and Human-like Audiobook Generation

Authors:Yan Rong, Shan Yang, Guangzhi Lei, Li Liu

Audiobook generation, which creates vivid and emotion-rich audio works, faces challenges in conveying complex emotions, achieving human-like qualities, and aligning evaluations with human preferences. Existing text-to-speech (TTS) methods are often limited to specific scenarios, struggle with emotional transitions, and lack automatic human-aligned evaluation benchmarks, instead relying on either misaligned automated metrics or costly human assessments. To address these issues, we propose Dopamine Audiobook, a new unified training-free system leveraging a multimodal large language model (MLLM) as an AI agent for emotional and human-like audiobook generation and evaluation. Specifically, we first design a flow-based emotion-enhanced framework that decomposes complex emotional speech synthesis into controllable sub-tasks. Then, we propose an adaptive model selection module that dynamically selects the most suitable TTS methods from a set of existing state-of-the-art (SOTA) TTS methods for diverse scenarios. We further enhance emotional expressiveness through paralinguistic augmentation and prosody retrieval at word and utterance levels. For evaluation, we propose a novel GPT-based evaluation framework incorporating self-critique, perspective-taking, and psychological MagicEmo prompts to ensure human-aligned and self-aligned assessments. Experiments show that our method generates long speech with superior emotional expression to SOTA TTS models in various metrics. Importantly, our evaluation framework demonstrates better alignment with human preferences and transferability across audio tasks. Project website with audio samples can be found at https://dopamine-audiobook.github.io.

有声书生成技术能创造出生动且情感丰富的音频作品，但在传达复杂情感、实现人类特质以及与人类偏好对齐的评估方面面临挑战。现有的文本到语音（TTS）方法通常局限于特定场景，在情感转换方面表现挣扎，并且缺乏自动与人类对齐的评估基准，而是依赖于错位自动化指标或成本高昂的人工评估。为了解决这个问题，我们提出了多巴胺有声书（Dopamine Audiobook），这是一个新的统一且无需训练的系统，利用多模态大型语言模型（MLLM）作为AI代理人，进行情感化和人性化的有声书生成和评估。具体来说，我们首先设计了一个基于流的情感增强框架，将复杂的情感语音合成分解成可控的子任务。然后，我们提出了一个自适应模型选择模块，该模块能够从一组最先进的TTS方法中动态选择最适合特定场景的TTS方法。我们进一步通过词语和句子的语言辅助增强和语调检索来增强情感表现力。在评估方面，我们提出了一个基于GPT的评估框架，该框架结合了自我批评、换位思考和心理MagicEmo提示，以确保与人类和自我评价的对齐。实验表明，我们的方法在多种指标上生成了具有卓越情感表达能力的长语音，超过了最先进的TTS模型。重要的是，我们的评估框架在展示与人类偏好的对齐性和跨音频任务的迁移性方面表现出色。项目网站及音频样本可访问：https://dopamine-audiobook.github.io。

论文及项目相关链接

PDF

Summary

本文介绍了针对有声书生成面临的挑战，提出了一种新的无需训练的统一系统——多巴胺有声书。该系统利用多模态大型语言模型作为AI代理，实现情感化和人性化的有声书生成与评价。通过设计基于流的情感增强框架和自适应模型选择模块，实现了复杂情感语音合成的可控子任务分解和灵活选择最适合的文本转语音方法。同时，通过旁语增强和语调检索增强情感表达。评价方面，提出了基于GPT的评价框架，确保与人类偏好和自我对齐的评估。实验表明，该方法在多种指标上优于现有文本转语音模型，评价框架更符合人类偏好并具有跨音频任务的迁移性。

Key Takeaways

多巴胺有声书系统是一个无需训练的统一系统，用于有声书生成和评价。
利用多模态大型语言模型实现情感化和人性化的有声书生成。
通过设计基于流的情感增强框架，实现复杂情感语音合成的可控子任务分解。
自适应模型选择模块可灵活选择最适合的文本转语音方法。
通过旁语增强和语调检索增强情感表达。
提出的基于GPT的评价框架能确保与人类偏好和自我对齐的评估。
实验表明，该方法在多种指标上优于现有模型，评价框架具有跨音频任务的迁移性。

Cool Papers

点此查看论文截图

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Authors:Xiaobo Xia, Run Luo

Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

现有构建图形用户界面（GUI）代理的工作大多依赖于在大型视觉语言模型（LVLMs）上进行的监督微调训练范式。然而，这种方法不仅需求大量的训练数据，而且在理解GUI截图和泛化到未见过的界面上也面临困难。这一问题极大地限制了其在现实场景中的应用，尤其是高级任务。

论文及项目相关链接

PDF

摘要
基于统一动作空间规则建模的强化学习框架用于提升图形用户界面任务的性能表现。此框架以精细调教的少量高质量数据为基础，通过策略优化算法如集团相对策略优化（GRPO）更新模型，大幅提升了大型视觉语言模型在真实场景下的表现。相较于传统方法，该框架仅使用极少的数据量（仅千分之三）即可实现卓越性能，并在三个不同平台上的八项基准测试中超越现有技术。这充分证明了强化学习在提升大型视觉语言模型执行真实场景下的图形用户界面任务能力方面的巨大潜力。

关键见解

强化学习框架被应用于提升图形用户界面任务的表现能力，特别是对大型视觉语言模型的增强能力效果显著。该框架依赖于策略优化算法来提高性能表现。
强化学习框架主要使用了少数经过精细选择的高质量数据集来优化模型表现，覆盖了多平台包括Windows、Linux、MacOS、Android以及Web等不同系统平台。这些有针对性的高质量数据帮助提升了模型对不同平台和不同环境的适应能力。

Cool Papers

点此查看论文截图

Breaking the Data Barrier – Building GUI Agents Through Task Generalization

Authors:Junlei Zhang, Zichen Ding, Chang Ma, Zijie Chen, Qiushi Sun, Zhenzhong Lan, Junxian He

Graphical User Interface (GUI) agents offer cross-platform solutions for automating complex digital tasks, with significant potential to transform productivity workflows. However, their performance is often constrained by the scarcity of high-quality trajectory data. To address this limitation, we propose training Vision Language Models (VLMs) on data-rich, reasoning-intensive tasks during a dedicated mid-training stage, and then examine how incorporating these tasks facilitates generalization to GUI planning scenarios. Specifically, we explore a range of tasks with readily available instruction-tuning data, including GUI perception, multimodal reasoning, and textual reasoning. Through extensive experiments across 11 mid-training tasks, we demonstrate that: (1) Task generalization proves highly effective, yielding substantial improvements across most settings. For instance, multimodal mathematical reasoning enhances performance on AndroidWorld by an absolute 6.3%. Remarkably, text-only mathematical data significantly boosts GUI web agent performance, achieving a 5.6% improvement on WebArena and 5.4% improvement on AndroidWorld, underscoring notable cross-modal generalization from text-based to visual domains; (2) Contrary to prior assumptions, GUI perception data - previously considered closely aligned with GUI agent tasks and widely utilized for training - has a comparatively limited impact on final performance; (3) Building on these insights, we identify the most effective mid-training tasks and curate optimized mixture datasets, resulting in absolute performance gains of 8.0% on WebArena and 12.2% on AndroidWorld. Our work provides valuable insights into cross-domain knowledge transfer for GUI agents and offers a practical approach to addressing data scarcity challenges in this emerging field. The code, data and models will be available at https://github.com/hkust-nlp/GUIMid.

图形用户界面（GUI）代理提供了跨平台的自动化复杂数字任务解决方案，具有改变生产力工作流程的巨大潜力。然而，它们的表现往往受到高质量轨迹数据稀缺的限制。为了解决这个问题，我们建议在专门的中间训练阶段，对数据丰富、推理密集的任务进行视觉语言模型（VLM）的训练，然后研究如何将这些任务纳入以促进对GUI规划场景的泛化。具体来说，我们探索了一系列具有可获取指令调整数据的任务，包括GUI感知、多模态推理和文本推理。通过对11个中间训练任务的广泛实验，我们证明：（1）任务泛化证明非常有效，在大多数设置中都取得了实质性的改进。例如，多模态数学推理在AndroidWorld上的性能提高了绝对6.3%。值得注意的是，仅文本的数学数据显著提高了GUI网页代理的性能，在WebArena上提高了5.6%，在AndroidWorld上提高了5.4%，突显了从文本到视觉领域的跨模态泛化的重要性；（2）与先前的假设相反，之前被认为与GUI代理任务紧密相关并广泛用于训练的GUI感知数据对最终性能的影响相对有限；（3）基于这些见解，我们确定了最有效的中间训练任务并优化了混合数据集，从而在WebArena上实现了8.0%的绝对性能提升，在AndroidWorld上实现了12.2%的提升。我们的工作为GUI代理的跨域知识迁移提供了有价值的见解，并为解决这一新兴领域的数据稀缺挑战提供了实用方法。代码、数据和模型将在https://github.com/hkust-nlp/GUIMid上提供。

论文及项目相关链接

PDF 24 pages, 11 figures

Summary

GUI代理的跨平台解决方案为自动化复杂数字任务提供了巨大的潜力，但其性能通常受到高质量轨迹数据的限制。为解决此问题，我们提出在专门的中间训练阶段对视觉语言模型（VLM）进行丰富数据、注重推理的任务训练，并研究这些任务如何促进对GUI规划场景的泛化。实验表明，任务泛化高度有效，多数设置下均有显著提高。文本仅数学数据显著提升GUI网页代理性能，实现WebArena 5.6%和AndroidWorld 5.4%的进步。研究为GUI代理的跨域知识迁移提供了宝贵见解，并提供解决这一新兴领域数据稀缺问题的实用方法。详情访问GUIMid。

Key Takeaways

GUI代理能自动化复杂的数字任务并实现跨平台应用，对生产力工作流程产生巨大影响。
GUI代理性能受到高质量轨迹数据的限制。
在中间训练阶段，通过丰富数据和注重推理的任务训练可以提升视觉语言模型（VLM）。
任务泛化方法在许多情境下都是有效的。特别是文本驱动的多模态数学推理能提高GUI代理在AndroidWorld上的性能高达绝对6.3%。同时发现文本数据的交叉模态泛化效果显著。
GUI感知数据对最终性能的影响相对较小，这颠覆了先前的假设。
最有效的中间训练任务和优化的混合数据集能显著提高性能，如在WebArena上提高绝对性能高达8%，在AndroidWorld上提高绝对性能高达12.2%。这提供了一个解决数据稀缺问题的实用方法。

Cool Papers

点此查看论文截图

SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents

Authors:Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buccholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, Anoop Deoras, Giovanni Zappella, Laurent Callot

Coding agents powered by large language models have shown impressive capabilities in software engineering tasks, but evaluating their performance across diverse programming languages and real-world scenarios remains challenging. We introduce SWE-PolyBench, a new multi-language benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code refactoring. We provide a task and repository-stratified subsample (SWE-PolyBench500) and release an evaluation harness allowing for fully automated evaluation. To enable a more comprehensive comparison of coding agents, this work also presents a novel set of metrics rooted in syntax tree analysis. We evaluate leading open source coding agents on SWE-PolyBench, revealing their strengths and limitations across languages, task types, and complexity classes. Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks. SWE-PolyBench aims to drive progress in developing more versatile and robust AI coding assistants for real-world software engineering. Our datasets and code are available at: https://github.com/amazon-science/SWE-PolyBench

由大型语言模型驱动的编码代理在软件工程任务中表现出了令人印象深刻的能力，但在多种编程语言和现实场景中对它们的性能进行评估仍然具有挑战性。我们介绍了SWE-PolyBench，这是一个新的多语言基准测试，用于对编码代理进行仓库级、基于执行的评价。SWE-PolyBench包含来自21个仓库的2110个实例，涵盖了Java（165个）、JavaScript（1017个）、TypeScript（729个）和Python（199个）的任务，包括错误修复、功能添加和代码重构。我们提供了任务分层和仓库分层的子集（SWE-PolyBench500），并发布了一个评估工具，可以实现全自动评估。为了更全面地比较编码代理，这项工作还提出了基于语法树分析的新指标集。我们在SWE-PolyBench上评估了领先的开源编码代理，揭示了它们在跨语言、任务类型和复杂度类别方面的优势和局限性。我们的实验表明，当前代理在不同语言之间的性能不均，对复杂问题感到困难，而在简单任务上表现较好。SWE-PolyBench旨在推动开发更通用、更稳健的AI编码助手，用于现实世界的软件工程。我们的数据集和代码可在：https://github.com/amazon-science/SWE-PolyBench找到。

论文及项目相关链接

PDF 20 pages, 6 figures, edited website links

Summary

大型语言模型驱动的编码代理在软件工程任务中展现了令人印象深刻的能力，但评估其在多种编程语言和真实场景中的性能仍然具有挑战性。为此，我们推出了SWE-PolyBench，一个用于评估编码代理的多语言基准测试。该基准测试包含来自多个仓库的实例，涵盖了不同编程语言的任务，包括Java、JavaScript、TypeScript和Python中的Bug修复、功能添加和代码重构。我们提供了任务分层和仓库分层的子集（SWE-PolyBench500），并发布了允许全自动评估的评估工具。为了更全面地比较编码代理，本文还提出了基于语法树分析的全新指标集。我们对SWE-PolyBench上的领先开源编码代理进行了评估，揭示了其在不同语言、任务类型和复杂性类别方面的优势和局限性。实验表明，当前代理在不同语言中的表现不均，对复杂问题存在挑战，但在简单任务上的表现较好。SWE-PolyBench旨在推动开发更通用和稳健的AI编码助手，用于真实世界的软件工程。

Key Takeaways

大型语言模型驱动的编码代理在软件工程任务中展现能力，但跨语言与真实场景的评估存在挑战。
推出SWE-PolyBench基准测试，包含多语言的任务实例，涵盖Bug修复、功能添加和代码重构。
提供SWE-PolyBench的子集及评估工具，允许全自动评估。
引入基于语法树分析的全新评估指标集，以更全面地比较编码代理。
当前编码代理在不同语言和任务类型上的表现不均，对复杂问题有挑战，但简单任务上表现较好。

Cool Papers

点此查看论文截图

ELT-Bench: An End-to-End Benchmark for Evaluating AI Agents on ELT Pipelines

Authors:Tengjun Jin, Yuxuan Zhu, Daniel Kang

Practitioners are increasingly turning to Extract-Load-Transform (ELT) pipelines with the widespread adoption of cloud data warehouses. However, designing these pipelines often involves significant manual work to ensure correctness. Recent advances in AI-based methods, which have shown strong capabilities in data tasks, such as text-to-SQL, present an opportunity to alleviate manual efforts in developing ELT pipelines. Unfortunately, current benchmarks in data engineering only evaluate isolated tasks, such as using data tools and writing data transformation queries, leaving a significant gap in evaluating AI agents for generating end-to-end ELT pipelines. To fill this gap, we introduce ELT-Bench, an end-to-end benchmark designed to assess the capabilities of AI agents to build ELT pipelines. ELT-Bench consists of 100 pipelines, including 835 source tables and 203 data models across various domains. By simulating realistic scenarios involving the integration of diverse data sources and the use of popular data tools, ELT-Bench evaluates AI agents’ abilities in handling complex data engineering workflows. AI agents must interact with databases and data tools, write code and SQL queries, and orchestrate every pipeline stage. We evaluate two representative code agent frameworks, Spider-Agent and SWE-Agent, using six popular Large Language Models (LLMs) on ELT-Bench. The highest-performing agent, Spider-Agent Claude-3.7-Sonnet with extended thinking, correctly generates only 3.9% of data models, with an average cost of $4.30 and 89.3 steps per pipeline. Our experimental results demonstrate the challenges of ELT-Bench and highlight the need for a more advanced AI agent to reduce manual effort in ELT workflows. Our code and data are available at https://github.com/uiuc-kang-lab/ELT-Bench.

随着云数据仓库的广泛应用，从业者越来越多地转向使用提取-加载-转换（ELT）管道。然而，设计这些管道通常涉及大量的手动工作以确保其正确性。最近，人工智能方法在数据任务（如文本到SQL）方面取得了显著进展，为减少ELT管道开发中的手动工作提供了机会。然而，目前数据工程领域的基准测试仅评估孤立的任务，例如使用数据工具和编写数据转换查询，在评估用于生成端到端ELT管道的AI代理方面存在巨大差距。为了填补这一空白，我们引入了ELT-Bench，这是一个旨在评估AI代理构建ELT管道能力的端到端基准测试。ELT-Bench包含100个管道，涉及835个源表和203个数据模型，涵盖各个领域。通过模拟涉及整合各种数据源和使用流行数据工具的现实场景，ELT-Bench评估AI代理在处理复杂数据工程工作流程方面的能力。AI代理必须与数据库和数据工具进行交互，编写代码和SQL查询，并协调每个管道阶段。我们使用ELT-Bench对两个具有代表性的代码代理框架Spider-Agent和SWE-Agent以及六种流行的大型语言模型（LLM）进行了评估。表现最佳的代理Spider-Agent Claude-3.7-Sonnet经过扩展思考后，仅正确生成了3.9%的数据模型，平均每个管道的成本为4.3美元和需要89.3步操作。我们的实验结果展示了ELT-Bench的挑战性，并强调了需要更先进的AI代理来减少ELT工作流程中的手动工作。我们的代码和数据可在https://github.com/uiuc-kang-lab/ELT-Bench获得。

论文及项目相关链接

PDF 14 pages, 18 figures

摘要

随着云数据仓库的广泛应用，实践者越来越多地采用Extract-Load-Transform（ELT）管道。然而，设计这些管道通常涉及大量的手动工作来保证正确性。最近，AI方法的发展在数据任务（如文本到SQL）方面展示了强大的能力，有机会减少在开发ELT管道中的手动工作。然而，当前的数据工程基准仅评估孤立的任务，如使用数据工具和编写数据转换查询，在评估用于生成端到端ELT管道的AI代理方面存在巨大差距。为了填补这一空白，我们引入了ELT-Bench，这是一个端到端的基准测试，旨在评估AI代理构建ELT管道的能力。ELT-Bench包含100个管道、835个源表和203个数据模型，涵盖各种领域。通过模拟涉及不同数据源集成和流行数据工具使用的现实场景，ELT-Bench评估AI代理在处理复杂数据工程工作流程方面的能力。我们的实验结果表明ELT-Bench的挑战性，并强调需要更先进的AI代理来减少ELT工作流中的手动工作。我们的代码和数据可在https://github.com/uiuc-kang-lab/ELT-Bench找到。

关键见解

实践者越来越多地采用Extract-Load-Transform（ELT）管道进行数据处理。
设计ELT管道涉及大量手动工作，需要AI方法辅助以确保正确性。
当前数据工程基准测试主要关注孤立任务评估，缺乏评估AI代理在生成端到端ELT管道方面的能力。
引入ELT-Bench基准测试，包含100个管道、多种源表和数据模型，旨在全面评估AI代理处理复杂数据工程工作流程的能力。
AI代理需在仿真环境中与数据库和数据工具交互、编写代码和SQL查询、并协调每个管道阶段。
评估结果表明当前最高性能的AI代理在ELT-Bench上表现有限，仅正确生成少量数据模型，强调需要更先进的AI代理来减少手动工作。

Cool Papers

点此查看论文截图

Steering No-Regret Agents in MFGs under Model Uncertainty

Authors:Leo Widmer, Jiawei Huang, Niao He

Incentive design is a popular framework for guiding agents’ learning dynamics towards desired outcomes by providing additional payments beyond intrinsic rewards. However, most existing works focus on a finite, small set of agents or assume complete knowledge of the game, limiting their applicability to real-world scenarios involving large populations and model uncertainty. To address this gap, we study the design of steering rewards in Mean-Field Games (MFGs) with density-independent transitions, where both the transition dynamics and intrinsic reward functions are unknown. This setting presents non-trivial challenges, as the mediator must incentivize the agents to explore for its model learning under uncertainty, while simultaneously steer them to converge to desired behaviors without incurring excessive incentive payments. Assuming agents exhibit no(-adaptive) regret behaviors, we contribute novel optimistic exploration algorithms. Theoretically, we establish sub-linear regret guarantees for the cumulative gaps between the agents’ behaviors and the desired ones. In terms of the steering cost, we demonstrate that our total incentive payments incur only sub-linear excess, competing with a baseline steering strategy that stabilizes the target policy as an equilibrium. Our work presents an effective framework for steering agents behaviors in large-population systems under uncertainty.

激励设计是一个流行的框架，它通过提供内在奖励之外的额外支付，来引导代理人的学习动态朝着期望的结果发展。然而，大多数现有工作都集中在有限的小代理群体上，或者假设对游戏有完全的了解，这限制了它们在涉及大量人口和模型不确定性的现实场景中的应用。为了解决这一差距，我们在具有密度独立转换的Mean-Field Games（MFGs）中研究转向奖励的设计，其中转换动态和内在奖励函数都是未知的。这一设置带来了非平凡的挑战，因为中介人必须在不确定的情况下激励代理人进行模型学习探索，同时引导他们收敛到期望的行为，而不产生过多的激励支付。假设代理人表现出无（-自适应）遗憾的行为，我们提出了新颖乐观的探索算法。理论上，我们为代理人行为和期望行为之间的累积差距建立了次线性遗憾保证。在转向成本方面，我们证明我们的总激励支付只产生次线性超额，与基线转向策略竞争，该策略将目标政策作为平衡状态加以稳定。我们的工作为在不确定性下引导大规模系统中代理人行为提供了一个有效的框架。

论文及项目相关链接

PDF AISTATS 2025; 34 Pages

Summary

本文研究了在Mean-Field Games框架下设计激励奖励的问题，特别关注密度独立转换的场景。文章提出了一种针对未知转换动态和内在奖励函数的激励设计方法，旨在解决现有工作在处理大规模人群和模型不确定性方面的局限性。通过设计乐观的探索算法，在不需要过多的激励支付的同时，激励代理在不确定下探索并引导它们达到期望的行为。

Key Takeaways

激励设计是一个流行的框架，用于通过提供额外的支付来引导代理的学习动态达到期望的结果。
现有工作大多关注有限的代理群体或假设对游戏有完全了解，这限制了它们在涉及大规模人群和模型不确定性的现实场景中的应用。
文章研究了在Mean-Field Games框架下设计激励奖励的问题，特别关注密度独立转换的场景。
在转换动态和内在奖励函数未知的情况下，必须激励代理在不确定的情况下探索，同时引导他们达到期望的行为，这构成了非平凡的挑战。
文章为无后悔行为的代理设计了乐观的探索算法。
理论上，文章建立了累积差距的次线性后悔保证，确保了代理行为与期望行为的接近程度。

Cool Papers

点此查看论文截图

AppAgentX: Evolving GUI Agents as Proficient Smartphone Users

Authors:Wenjia Jiang, Yangyang Zhuang, Chenxi Song, Xu Yang, Joey Tianyi Zhou, Chi Zhang

Recent advancements in Large Language Models (LLMs) have led to the development of intelligent LLM-based agents capable of interacting with graphical user interfaces (GUIs). These agents demonstrate strong reasoning and adaptability, enabling them to perform complex tasks that traditionally required predefined rules. However, the reliance on step-by-step reasoning in LLM-based agents often results in inefficiencies, particularly for routine tasks. In contrast, traditional rule-based systems excel in efficiency but lack the intelligence and flexibility to adapt to novel scenarios. To address this challenge, we propose a novel evolutionary framework for GUI agents that enhances operational efficiency while retaining intelligence and flexibility. Our approach incorporates a memory mechanism that records the agent’s task execution history. By analyzing this history, the agent identifies repetitive action sequences and evolves high-level actions that act as shortcuts, replacing these low-level operations and improving efficiency. This allows the agent to focus on tasks requiring more complex reasoning, while simplifying routine actions. Experimental results on multiple benchmark tasks demonstrate that our approach significantly outperforms existing methods in both efficiency and accuracy. The code will be open-sourced to support further research.

近期大型语言模型（LLM）的进步促进了基于智能LLM的代理的发展，这些代理能够与图形用户界面（GUI）进行交互。这些代理表现出强大的推理和适应性，使它们能够执行传统上需要预定规则才能完成的复杂任务。然而，基于LLM的代理在逐步推理上的依赖往往会导致效率低下，尤其是对于常规任务。相比之下，传统的基于规则的系统在效率上表现很好，但在适应新场景时缺乏智能和灵活性。为了解决这一挑战，我们提出了一种新型的GUI代理进化框架，该框架提高了操作效率，同时保留了智能和灵活性。我们的方法引入了一种记忆机制，用于记录代理的任务执行历史。通过分析这些历史记录，代理能够识别重复的动作序列并进化出高级动作，这些动作可以作为快捷方式，替代这些低级操作，从而提高效率。这使得代理能够专注于需要更复杂推理的任务，同时简化常规动作。在多个基准任务上的实验结果表明，我们的方法在效率和准确性方面都显著优于现有方法。代码将开源以支持进一步研究。

论文及项目相关链接

PDF

Summary

强大的大型语言模型（LLM）驱动的智能代理展现出了与图形用户界面（GUI）交互的能力，它们具备出色的推理和适应性，可以执行复杂的任务。然而，对于常规任务，基于LLM的代理逐步推理的方式常常导致效率不高。为应对这一挑战，我们提出了一种新型的GUI代理进化框架，该框架在保持智能和灵活性的同时，提高了操作效率。通过引入记忆机制记录代理的任务执行历史，分析这些历史数据来识别重复的操作序列并进化出高级行动作为快捷方式，代替低级操作，从而提高效率。

Key Takeaways

大型语言模型（LLM）驱动的代理具备与图形用户界面（GUI）交互的能力，展现出强大的推理和适应性。
基于LLM的代理在执行常规任务时，逐步推理的方式可能导致效率不高。
新型的GUI代理进化框架旨在提高操作效率，同时保持智能和灵活性。
记忆机制是该框架的核心，它记录代理的任务执行历史，分析这些数据以识别重复操作并进化出高级行动作为快捷方式。
进化出的高级行动能代替低级操作，从而提高效率，使代理能专注于需要复杂推理的任务。
在多个基准任务上的实验结果表明，该框架在效率和准确性方面都显著优于现有方法。

Cool Papers

点此查看论文截图

AFlow: Automating Agentic Workflow Generation

Authors:Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu

Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFlow’s efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFlow enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code is available at https://github.com/FoundationAgents/AFlow.

大型语言模型（LLM）在解决跨不同领域的复杂任务方面展现出了显著潜力，通常是通过采用遵循详细指令和操作序列的代理工作流程来实现。然而，构建这些工作流程需要大量人工努力，从而限制了其可扩展性和通用性。最近的研究试图实现这些工作流程的自动生成和优化，但现有方法仍然依赖于初始的手动设置，并且无法实现完全自动化和有效的工作流程生成。为了应对这一挑战，我们将工作流程优化重新构建为一个在代码表示的工作流程上的搜索问题，其中LLM调用节点通过边缘连接。我们介绍了AFlow，这是一个自动化框架，它利用蒙特卡洛树搜索有效地探索了这个空间，通过代码修改、树形经验和执行反馈来迭代优化工作流程。在六个基准数据集上的实证评估证明了AFlow的有效性，与最新基线相比，实现了平均改进5.7%。此外，AFlow使小型模型能够以美元计算低于GPT-4o推理成本的4.55%的成本在特定任务上表现优于GPT-4o。代码可在https://github.com/FoundationAgents/AFlow找到。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）在解决跨域复杂任务方面展现出显著潜力，通常通过遵循详细指令和操作序列的代理工作流程来完成。然而，构建这些工作流程需要大量人力，限制了其可扩展性和通用性。研究界正努力自动化生成和优化这些工作流程，但现有方法仍依赖初始手动设置，无法实现完全自动化和有效的工作流生成。为解决这一挑战，本文将工作流优化重新构建为在代码表示的工作流上的搜索问题，引入AFlow框架，利用蒙特卡洛树搜索有效地探索这个空间，通过代码修改、树结构经验和执行反馈来迭代优化工作流。在六个基准数据集上的实证评估表明，AFlow的平均改进率达5.7%，并且在特定任务上，即使使用较小的模型也能以4.55%的GPT-4推理成本实现超越。

Key Takeaways

大型语言模型（LLMs）在解决复杂任务方面表现出卓越潜力，但构建工作流程需要大量人力。
现有工作流程自动化方法仍依赖初始手动设置，无法实现全面自动化。
本文将工作流优化重新构建为代码表示的搜索问题。
引入AFlow框架，利用蒙特卡洛树搜索有效探索工作流空间。
AFlow通过代码修改、树结构经验和执行反馈迭代优化工作流。
在六个基准数据集上的实证评估显示，AFlow平均改进率达5.7%。
AFlow使较小模型在特定任务上能超越高性能模型，同时降低推理成本至GPT-4的4.55%。

Cool Papers

点此查看论文截图

Towards Hierarchical Multi-Agent Workflows for Zero-Shot Prompt Optimization

Authors:Yuchi Liu, Jaskirat Singh, Gaowen Liu, Ali Payani, Liang Zheng

Large language models (LLMs) have shown great progress in responding to user questions, allowing for a multitude of diverse applications. Yet, the quality of LLM outputs heavily depends on the prompt design, where a good prompt might enable the LLM to answer a very challenging question correctly. Therefore, recent works have developed many strategies for improving the prompt, including both manual crafting and in-domain optimization. However, their efficacy in unrestricted scenarios remains questionable, as the former depends on human design for specific questions and the latter usually generalizes poorly to unseen scenarios. To address these problems, we give LLMs the freedom to design the best prompts according to themselves. Specifically, we include a hierarchy of LLMs, first constructing a prompt with precise instructions and accurate wording in a hierarchical manner, and then using this prompt to generate the final answer to the user query. We term this pipeline Hierarchical Multi-Agent Workflow, or HMAW. In contrast with prior works, HMAW imposes no human restriction and requires no training, and is completely task-agnostic while capable of adjusting to the nuances of the underlying task. Through both quantitative and qualitative experiments across multiple benchmarks, we verify that despite its simplicity, the proposed approach can create detailed and suitable prompts, further boosting the performance of current LLMs.

大型语言模型（LLM）在回答用户问题方面取得了巨大进步，为多种应用提供了可能。然而，LLM的输出质量在很大程度上取决于提示设计，一个好的提示可能会使LLM能够正确地回答一个具有挑战性的问题。因此，近期的研究工作已经开发了许多改进提示的策略，包括手动制作和领域内的优化。然而，它们在非限制场景中的有效性仍然值得怀疑，因为前者依赖于针对特定问题的人类设计，而后者通常对未见场景的泛化能力较差。为了解决这些问题，我们赋予LLM自由设计最佳提示的权利。具体来说，我们采用分级的大型语言模型，首先以分层的方式精确指导并准确措辞构建提示，然后使用该提示生成对用户查询的最终答案。我们称这种流程为分层多智能体工作流程（HMAW）。与以前的工作相比，HMAW没有人为限制，不需要训练，完全是任务无关的，能够适应底层任务的细微差别。通过多个基准的定量和定性实验，我们验证了尽管方法简单，但所提出的方法可以创建详细且合适的提示，进一步提高了当前大型语言模型的性能。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）在用户问题回应方面取得显著进展，广泛应用于多种应用。但LLM输出质量高度依赖于提示设计，好的提示可能使LLM正确回答具有挑战性的问题。尽管有手动制作和域内优化等改进提示的策略，但在无限制场景中其有效性仍有待验证。为解决这些问题，我们让LLMs自行设计最佳提示。我们采用分级LLM架构，以精确指令和准确措辞构建提示，并使用该提示生成对用户查询的最终答案。我们称之为分层多智能体工作流程（HMAW），与先前工作相比，HMAW无需人为限制，无需训练，且完全不受任务限制，能够适应底层任务的细微差别。通过跨多个基准的定量和定性实验验证，尽管方法简单，但所提出的策略可以创建详细且合适的提示，进一步提升当前LLM的性能。

Key Takeaways