发布日期: 2025-06-14

更新日期: 2025-07-06

文章字数: 12.5k

阅读时长: 51 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-06-14 更新

AutoMind: Adaptive Knowledgeable Agent for Automated Data Science

Authors:Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang

Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.

大型语言模型（LLM）代理在解决现实世界的数据科学问题方面显示出巨大潜力。LLM驱动的数据科学代理有望自动化整个机器学习流程，但它们在现实世界中的有效性仍然有限。现有框架依赖于僵化、预先定义的工作流程和不可灵活调整的编码策略；因此，它们仅在相对简单、经典的问题上表现出色，而无法获取人类实践者在复杂、创新任务中所带来的经验知识。在这项工作中，我们介绍了AutoMind，这是一个自适应、知识型LLM代理框架，通过三个关键进展克服了这些不足：（1）一个精选的专家知识库，将代理与领域专家知识相结合；（2）一种智能知识树搜索算法，战略性探索可能的解决方案；（3）一种自适应编码策略，根据任务复杂性动态调整代码生成。在两个自动化数据科学基准测试上的评估表明，AutoMind相较于最新技术基线具有卓越性能。额外的分析证实了其有效性、效率和解决方案的优质定性，凸显了AutoMind作为实现完全自动化数据科学的稳健和高效的一步。

论文及项目相关链接

PDF Ongoing work. Code is at https://github.com/innovatingAI/AutoMind

Summary：大型语言模型驱动的自动化数据科学代理人（如AutoMind）展示了巨大的潜力以解决现实中的数据科学问题，并通过其独特的专长库和战略性的探索策略进行自我适应编程。这种自动化机器学习方法能在处理复杂的创新型任务上表现出优越性。此工作的研究在人工智能领域中是一大进步。

Key Takeaways：

大型语言模型（LLM）代理人在解决数据科学问题上展现出巨大潜力。
LLM驱动的代理人承诺自动化整个机器学习流程，但其在现实世界的有效性受到限制。
当前框架依赖于预设的工作流程和固定的编码策略，仅在处理简单问题上有优势，难以处理复杂任务。
AutoMind框架通过引入专业知识库、战略性搜索算法和自我适应编程策略克服了这些缺点。

Cool Papers

点此查看论文截图

CIIR@LiveRAG 2025: Optimizing Multi-Agent Retrieval Augmented Generation through Self-Training

Authors:Alireza Salemi, Mukta Maddipatla, Hamed Zamani

This paper presents mRAG, a multi-agent retrieval-augmented generation (RAG) framework composed of specialized agents for subtasks such as planning, searching, reasoning, and coordination. Our system uses a self-training paradigm with reward-guided trajectory sampling to optimize inter-agent collaboration and enhance response generation. Evaluated on DataMorgana-derived datasets during the SIGIR 2025 LiveRAG competition, mRAG outperforms conventional RAG baselines. We further analyze competition outcomes and showcase the framework’s strengths with case studies, demonstrating its efficacy for complex, real-world RAG tasks.

本文介绍了mRAG，这是一种由专门用于规划、搜索、推理和协调等子任务的多智能体组成的新型多智能体增强生成（RAG）框架。我们的系统采用自训练范式，通过奖励引导轨迹采样来优化智能体之间的协作并增强响应生成。在SIGIR 2025 LiveRAG竞赛中使用DataMorgana衍生的数据集进行评估时，mRAG表现出优于传统RAG基线的能力。我们进一步分析了竞赛结果，并通过案例研究展示了该框架的优势，证明了它在复杂的现实世界中执行RAG任务时的有效性。

论文及项目相关链接

PDF

Summary
本文介绍了mRAG，这是一种由多个代理组成的多代理检索增强生成（RAG）框架，用于执行规划、搜索、推理和协调等子任务。该系统采用自我训练模式，通过奖励引导轨迹采样优化代理间的协作并增强响应生成。在SIGIR 2025 LiveRAG竞赛中，使用DataMorgana派生的数据集评估的mRAG性能优于传统RAG基线。我们还分析了竞赛结果并展示了框架在案例研究中的优势，证明了其在复杂、真实的RAG任务中的有效性。

Key Takeaways

mRAG是一个多代理检索增强生成（RAG）框架，包含规划、搜索、推理和协调等子任务的专门代理。
该系统采用自我训练模式，优化代理间的协作。
通过奖励引导轨迹采样增强响应生成。
在SIGIR 2025 LiveRAG竞赛中，mRAG性能优于传统RAG基线。
mRAG框架在复杂、真实的RAG任务中表现出有效性。
通过对竞赛结果的分析，展示了mRAG框架的优势。

Cool Papers

点此查看论文截图

VideoDeepResearch: Long Video Understanding With Agentic Tool Using

Authors:Huaying Yuan, Zheng Liu, Junjie Zhou, Ji-Rong Wen, Zhicheng Dou

Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs) due to the task’s inherent complexity and context window constraint. It is widely assumed that addressing LVU tasks requires foundation MLLMs with extended context windows, strong visual perception capabilities, and proficient domain expertise. In this work, we challenge this common belief by introducing VideoDeepResearch, a novel agentic framework for long video understanding. Our approach relies solely on a text-only large reasoning model (LRM) combined with a modular multi-modal toolkit, including multimodal retrievers and visual perceivers, all of which are readily available in practice. For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using. We conduct extensive experiments on popular LVU benchmarks, including MLVU, Video-MME, and LVBench. Our results demonstrate that VideoDeepResearch achieves substantial improvements over existing MLLM baselines, surpassing the previous state-of-the-art by 9.6%, 6.6%, and 3.9% on MLVU (test), LVBench, and LongVideoBench, respectively. These findings highlight the promise of agentic systems in overcoming key challenges in LVU problems.

长视频理解（LVU）对当前的多模态大型语言模型（MLLMs）提出了重大挑战，这是由于该任务的固有复杂性和上下文窗口约束。普遍认为，解决LVU任务需要具有扩展上下文窗口、强大视觉感知能力和专业领域知识的基础MLLMs。在这项工作中，我们通过引入VideoDeepResearch，一个用于长视频理解的新型代理框架，来挑战这一普遍信念。我们的方法仅依赖于文本大型推理模型（LRM）和模块化多模态工具包，包括多模态检索器和视觉感知器，所有这些在实践中都可轻松获得。对于每个LVU任务，系统通过推理制定问题解决策略，同时有选择地访问和利用工具中的关键视频内容。我们在流行的LVU基准测试上进行了大量实验，包括MLVU、Video-MME和LVBench。结果表明，VideoDeepResearch在现有的MLLM基准测试上取得了显著的改进，在MLVU（测试）、LVBench和LongVideoBench上的性能分别提高了9.6%、6.6%和3.9%。这些发现突出了代理系统在克服LVU问题的关键挑战方面的潜力。

论文及项目相关链接

PDF

Summary
长视频理解（LVU）是当前多模态大型语言模型（MLLMs）面临的一个重大挑战。本文介绍了一种新颖的agentic框架VideoDeepResearch，它通过文本大型推理模型和多模态工具组合来解决LVU任务。实验证明，VideoDeepResearch在流行的LVU基准测试上实现了显著改进，这显示出agentic系统在克服LVU问题中的关键挑战方面的潜力。

Key Takeaways

长视频理解（LVU）是当大型语言模型面临的一个挑战，需要处理复杂性和上下文窗口限制。
VideoDeepResearch是一个新颖的agentic框架，用于解决LVU任务。
VideoDeepResearch采用文本大型推理模型（LRM）和多模态工具组合的方法。
该系统通过推理制定问题解决策略，并选择性访问和利用视频内容。
VideoDeepResearch在多个流行的LVU基准测试上超越了现有MLLM基线。
VideoDeepResearch在MLVU、Video-MME和LVBench上的性能分别提高了9.6%、6.6%和3.9%。

Cool Papers

点此查看论文截图

OPT-BENCH: Evaluating LLM Agent on Large-Scale Search Spaces Optimization Problems

Authors:Xiaozhe Li, Jixuan Chen, Xinyu Fang, Shengyuan Ding, Haodong Duan, Qingwen Liu, Kai Chen

Large Language Models (LLMs) have shown remarkable capabilities in solving diverse tasks. However, their proficiency in iteratively optimizing complex solutions through learning from previous feedback remains insufficiently explored. To bridge this gap, we present OPT-BENCH, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH includes 20 real-world machine learning tasks sourced from Kaggle and 10 classical NP problems, offering a diverse and challenging environment for assessing LLM agents on iterative reasoning and solution refinement. To enable rigorous evaluation, we introduce OPT-Agent, an end-to-end optimization framework that emulates human reasoning when tackling complex problems by generating, validating, and iteratively improving solutions through leveraging historical feedback. Through extensive experiments on 9 state-of-the-art LLMs from 6 model families, we analyze the effects of optimization iterations, temperature settings, and model architectures on solution quality and convergence. Our results demonstrate that incorporating historical context significantly enhances optimization performance across both ML and NP tasks. All datasets, code, and evaluation tools are open-sourced to promote further research in advancing LLM-driven optimization and iterative reasoning. Project page: \href{https://github.com/OliverLeeXZ/OPT-BENCH}{https://github.com/OliverLeeXZ/OPT-BENCH}.

大型语言模型（LLM）在解决各种任务时表现出了显著的能力。然而，它们通过从前馈中学习来迭代优化复杂解决方案的能力尚未得到充分探索。为了弥补这一差距，我们推出了OPT-BENCH，这是一个旨在评估LLM代理在大规模搜索空间优化问题上的综合基准测试。OPT-BENCH包括来自Kaggle的20个现实世界机器学习任务和10个经典NP问题，为评估LLM代理在迭代推理和解决方案优化方面的能力提供了一个多样且富有挑战性的环境。为了进行严格的评估，我们引入了OPT-Agent，这是一个端到端的优化框架，通过利用历史反馈来生成、验证并迭代改进解决方案，从而模拟人类在解决复杂问题时的推理过程。通过对9种最新的大型语言模型进行广泛实验，这些模型来自6个不同的模型家族，我们分析了优化迭代、温度设置和模型架构对解决方案质量和收敛性的影响。我们的结果表明，结合历史背景可以显著提高机器学习任务和NP任务的优化性能。为了促进对LLM驱动的优化和迭代推理的进一步研究，所有数据集、代码和评估工具均已开源。项目页面：https://github.com/OliverLeeXZ/OPT-BENCH访问链接。

论文及项目相关链接

PDF

Summary

本文介绍了大型语言模型（LLMs）在解决大型搜索空间优化问题上的能力。为了评估LLM代理的优化性能，提出了一种名为OPT-BENCH的综合基准测试，其中包括来自Kaggle的20个现实世界机器学习任务和10个经典NP问题。同时，为了模拟人类解决复杂问题的推理过程，引入了OPT-Agent这一端到端的优化框架。实验结果表明，结合历史背景信息能显著提高优化性能，无论是在机器学习还是NP任务上都是如此。所有数据集、代码和评估工具均已开源，以促进LLM驱动的优化和迭代推理的研究。

Key Takeaways

大型语言模型（LLMs）在解决大型搜索空间优化问题上表现出显著的能力。
OPT-BENCH是一种用于评估LLM代理优化性能的综合基准测试，包括来自Kaggle的机器学习任务和经典NP问题。
OPT-Agent是一个端到端的优化框架，通过模拟人类推理来解决复杂问题，能够生成、验证并改进解决方案。
结合历史背景信息能显著提高优化性能。
开源数据集、代码和评估工具促进了LLM驱动的优化和迭代推理的研究。
实验结果表明，不同的优化迭代次数、温度设置和模型架构对解决方案质量和收敛性有影响。

Cool Papers

点此查看论文截图

Reasoning RAG via System 1 or System 2: A Survey on Reasoning Agentic Retrieval-Augmented Generation for Industry Challenges

Authors:Jintao Liang, Gang Su, Huifeng Lin, You Wu, Rui Zhao, Ziyue Li

Retrieval-Augmented Generation (RAG) has emerged as a powerful framework to overcome the knowledge limitations of Large Language Models (LLMs) by integrating external retrieval with language generation. While early RAG systems based on static pipelines have shown effectiveness in well-structured tasks, they struggle in real-world scenarios requiring complex reasoning, dynamic retrieval, and multi-modal integration. To address these challenges, the field has shifted toward Reasoning Agentic RAG, a paradigm that embeds decision-making and adaptive tool use directly into the retrieval process. In this paper, we present a comprehensive review of Reasoning Agentic RAG methods, categorizing them into two primary systems: predefined reasoning, which follows fixed modular pipelines to boost reasoning, and agentic reasoning, where the model autonomously orchestrates tool interaction during inference. We analyze representative techniques under both paradigms, covering architectural design, reasoning strategies, and tool coordination. Finally, we discuss key research challenges and propose future directions to advance the flexibility, robustness, and applicability of reasoning agentic RAG systems. Our collection of the relevant research has been organized into a https://github.com/ByebyeMonica/Reasoning-Agentic-RAG.

检索增强生成（RAG）作为一个强大的框架已经出现，通过整合外部检索和语言生成来克服大型语言模型（LLM）的知识局限。虽然基于静态流水线的早期RAG系统在结构良好的任务中显示出有效性，但在需要复杂推理、动态检索和多模态集成的现实场景中还面临挑战。为了解决这些挑战，该领域已经转向推理代理RAG范式，该范式将决策制定和自适应工具使用直接嵌入到检索过程中。在本文中，我们对推理代理RAG方法进行了全面的回顾，将它们分为两个主要系统：预设推理，它遵循固定的模块化流水线来提升推理能力；自主推理，在该范式下模型在推理过程中自主地协调工具交互。我们分析了这两种范式下的代表性技术，涵盖了架构设计、推理策略和工具协调。最后，我们讨论了关键的研究挑战，并提出了提高推理代理RAG系统的灵活性、鲁棒性和适用性的未来研究方向。我们已把相关的研究整理在https://github.com/ByebyeMonica/Reasoning-Agentic-RAG。

论文及项目相关链接

PDF

Summary

文本介绍了融合外部检索和语言表达功能的强大框架——检索增强生成（RAG）。传统的静态管道式RAG系统在结构化任务中表现良好，但在需要复杂推理、动态检索和多模态集成的现实场景中面临挑战。为解决这些问题，人们开始研究Reasoning Agentic RAG（RARAG）模型，该模型直接将决策和自适应工具嵌入检索过程中。文本回顾了RARAG的方法，并将其分为两类系统：预设推理和自主推理。前者遵循固定的模块化管道来提升推理能力，后者让模型在推理过程中自主协调工具交互。本文还对两者进行了综合分析，并讨论了未来提升RARAG系统灵活性、稳健性和适用性的关键研究方向。研究成果被组织在了特定链接中。简而言之，文本着重讨论新型技术模型解决知识局限性问题的方法和前景。同时展望了未来发展趋势和挑战。

Key Takeaways

RAG框架通过整合外部检索与语言生成，克服大型语言模型的知识局限。
传统静态管道式RAG系统在现实场景中存在挑战，如复杂推理、动态检索和多模态集成等。
RARAG模型的出现解决了这些问题，通过将决策和自适应工具直接嵌入检索过程中。

Cool Papers

点此查看论文截图

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

Authors:Yuquan Xie, Zaijing Li, Rui Shao, Gongwei Chen, Kaiwen Zhou, Yinchuan Li, Dongmei Jiang, Liqiang Nie

Recent efforts to leverage the Multi-modal Large Language Model (MLLM) as GUI agents have yielded promising outcomes. However, these agents still struggle with long-horizon tasks in online environments, primarily due to insufficient knowledge and the inherent gap between offline and online domains. In this paper, inspired by how humans generalize knowledge in open-ended environments, we propose a Hierarchical Multimodal Skills (HMS) module to tackle the issue of insufficient knowledge. It progressively abstracts trajectories into execution skills, core skills, and ultimately meta-skills, providing a hierarchical knowledge structure for long-horizon task planning. To bridge the domain gap, we propose the Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm, which efficiently leverages skills acquired in offline environments to reduce the action search space during online tree exploration. Building on HMS, we propose Mirage-1, a multimodal, cross-platform, plug-and-play GUI agent. To validate the performance of Mirage-1 in real-world long-horizon scenarios, we constructed a new benchmark, AndroidLH. Experimental results show that Mirage-1 outperforms previous agents by 32%, 19%, 15%, and 79% on AndroidWorld, MobileMiniWob++, Mind2Web-Live, and AndroidLH, respectively. Project page: https://cybertronagent.github.io/Mirage-1.github.io/

最近尝试将多模态大型语言模型（MLLM）作为GUI代理产生了可喜的成果。然而，这些代理在在线环境下仍然面临长期任务处理的挑战，这主要是由于知识不足以及离线与在线领域之间的固有差距所导致的。本文受人类如何在开放式环境中推广知识的启发，我们提出了分层多模态技能（HMS）模块来解决知识不足的问题。它逐步将轨迹抽象为执行技能、核心技能和最终元技能，为长期任务规划提供分层知识结构。为了弥领域差距，我们提出了技能增强蒙特卡洛树搜索（SA-MCTS）算法，该算法可以有效地利用在离线环境中获得的知识，从而减少在线树探索期间的行动搜索空间。基于HMS，我们提出了Mirage-1，一个跨平台的多模态即插即用GUI代理。为了验证Mirage-1在现实世界的长期场景中的性能，我们构建了一个新的基准测试平台AndroidLH。实验结果表明，Mirage-1在AndroidWorld、MobileMiniWob++、Mind2Web-Live和AndroidLH上的表现分别优于之前的代理32%、19%、15%和79%。项目页面：https://cybertronagent.github.io/Mirage-1

论文及项目相关链接

PDF 20 pages, 5 figures, 5 tables

Summary

近期利用多模态大型语言模型（MLLM）作为GUI代理取得了有前景的成果。然而，这些代理在在线环境中处理长期任务时仍面临困难，主要由于知识不足和离线与在线领域间的固有差距。本文受人类如何在开放环境中实现知识泛化的启发，提出了分层多模态技能（HMS）模块来解决知识不足的问题。HMS将轨迹渐进地抽象为执行技能、核心技能和元技能，为长期任务规划提供了分层知识结构。为了缩小领域差距，本文提出了技能增强蒙特卡洛树搜索（SA-MCTS）算法，该算法有效地利用在离线环境中获得的技能来减少在线树探索过程中的动作搜索空间。在此基础上，本文提出了Mirage-1，一个跨平台的多模态、随插随用的GUI代理。为了验证Mirage-1在现实世界的长期场景中的性能，我们建立了一个新的基准测试AndroidLH。实验结果表明，Mirage-1在AndroidWorld、MobileMiniWob++、Mind2Web-Live和AndroidLH上的性能分别比以前的代理提高了32%、19%、15%和79%。

Key Takeaways

多模态大型语言模型（MLLM）作为GUI代理在在线环境中处理长期任务方面取得了进展，但仍面临知识不足和领域差距的挑战。
HMS模块通过将轨迹抽象为不同层次的知识结构来解决知识不足的问题。
SA-MCTS算法利用离线环境中获得的技能来缩小在线树探索中的动作搜索空间。
Mirage-1是一个跨平台的多模态GUI代理，通过实验验证了在长期场景中的性能优势。
AndroidLH是一个新的基准测试，用于评估代理在现实世界长期任务中的性能。
Mirage-1在各种基准测试上的性能优于以前的代理。

Cool Papers

点此查看论文截图

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Authors:Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Liqiang Nie

Recently, agents based on multimodal large language models (MLLMs) have achieved remarkable progress across various domains. However, building a generalist agent with capabilities such as perception, planning, action, grounding, and reflection in open-world environments like Minecraft remains challenges: insufficient domain-specific data, interference among heterogeneous tasks, and visual diversity in open-world settings. In this paper, we address these challenges through three key contributions. 1) We propose a knowledge-enhanced data generation pipeline to provide scalable and high-quality training data for agent development. 2) To mitigate interference among heterogeneous tasks, we introduce a Mixture-of-Experts (MoE) architecture with task-level routing. 3) We develop a Multimodal Reasoning-Augmented Reinforcement Learning approach to enhance the agent’s reasoning ability for visual diversity in Minecraft. Built upon these innovations, we present Optimus-3, a general-purpose agent for Minecraft. Extensive experimental results demonstrate that Optimus-3 surpasses both generalist multimodal large language models and existing state-of-the-art agents across a wide range of tasks in the Minecraft environment. Project page: https://cybertronagent.github.io/Optimus-3.github.io/

最近，基于多模态大型语言模型（MLLMs）的代理在各种领域取得了显著的进步。然而，在像Minecraft这样的开放世界环境中，构建一个具备感知、规划、行动、接地和反思能力的通用代理仍然面临挑战：领域特定数据不足、异构任务之间的干扰以及开放世界设置中的视觉多样性。在本文中，我们通过三个主要贡献来解决这些挑战。1）我们提出了一个知识增强数据生成管道，为代理开发提供可扩展和高质量的训练数据。2）为了减轻异构任务之间的干扰，我们引入了一个带有任务级路由的混合专家（MoE）架构。3）我们开发了一种多模态推理增强型强化学习方法，以提高代理在Minecraft中处理视觉多样性的推理能力。基于这些创新，我们推出了Optimus-3，一个适用于Minecraft的通用代理。广泛的实验结果表明，Optimus-3在Minecraft环境中的一系列任务中超越了通用多模态大型语言模型和现有的最先进的代理。项目页面：[https://cybertronagent.github.io/Optimus-3.github.io/]

论文及项目相关链接

PDF 24 pages, 10 figures

Summary
多模态大型语言模型为基础构建的代理人在不同领域取得了显著进步，但在开放世界环境如Minecraft中构建具备感知、规划、行动、定位和反思能力的全能代理人面临挑战。本研究通过三个关键贡献解决这些问题：一是提出知识增强数据生成管道以提供可扩展的高质量训练数据；二是引入混合专家架构及任务级别路由机制，减少异构任务间的干扰；三是开发多模态推理增强型强化学习方法，增强代理人在Minecraft中的视觉多样性推理能力。基于这些创新，研究团队提出了适用于Minecraft的通用代理Optimus-3。实验结果证实Optimus-3在多个Minecraft任务上的表现优于全能多模态大型语言模型和现有前沿代理人。详情链接：https://cybertronagent.github.io/Optimus-3。

Key Takeaways

多模态大型语言模型为基础构建的代理人在不同领域取得显著进步。
构建全能代理人在开放世界环境如Minecraft中面临挑战，包括领域特定数据不足、异构任务干扰和视觉多样性。
提出知识增强数据生成管道以提供高质量训练数据。
引入混合专家架构及任务级别路由机制减少异构任务间的干扰。
开发多模态推理增强型强化学习以应对视觉多样性问题。
提出的通用代理Optimus-3在Minecraft环境中的多个任务上表现优越。

Cool Papers

点此查看论文截图

MasHost Builds It All: Autonomous Multi-Agent System Directed by Reinforcement Learning

Authors:Kuo Yang, Xingjie Yang, Linhui Yu, Qing Xu, Yan Fang, Xu Wang, Zhengyang Zhou, Yang Wang

Large Language Model (LLM)-driven Multi-agent systems (Mas) have recently emerged as a powerful paradigm for tackling complex real-world tasks. However, existing Mas construction methods typically rely on manually crafted interaction mechanisms or heuristic rules, introducing human biases and constraining the autonomous ability. Even with recent advances in adaptive Mas construction, existing systems largely remain within the paradigm of semi-autonomous patterns. In this work, we propose MasHost, a Reinforcement Learning (RL)-based framework for autonomous and query-adaptive Mas design. By formulating Mas construction as a graph search problem, our proposed MasHost jointly samples agent roles and their interactions through a unified probabilistic sampling mechanism. Beyond the accuracy and efficiency objectives pursued in prior works, we introduce component rationality as an additional and novel design principle in Mas. To achieve this multi-objective optimization, we propose Hierarchical Relative Policy Optimization (HRPO), a novel RL strategy that collaboratively integrates group-relative advantages and action-wise rewards. To our knowledge, our proposed MasHost is the first RL-driven framework for autonomous Mas graph construction. Extensive experiments on six benchmarks demonstrate that MasHost consistently outperforms most competitive baselines, validating its effectiveness, efficiency, and structure rationality.

近年来，以大型语言模型（LLM）驱动的多智能体系统（Mas）作为一种强大的范式，已崭露头角，用于解决复杂的真实世界任务。然而，现有的Mas构建方法通常依赖于手动设计的交互机制或启发式规则，这引入了人类偏见并限制了自主性能力。尽管自适应Mas构建方面最近有所进展，但现有系统大多仍停留在半自主模式的范式内。在这项工作中，我们提出了MasHost，这是一个基于强化学习（RL）的自主和查询自适应Mas设计框架。我们将Mas构建制定为图搜索问题，通过统一的概率采样机制联合采样智能体角色及其交互。除了先前研究中追求的准确性和效率目标外，我们还将组件合理性作为Mas中的附加和新颖设计原则。为了实现这一多目标优化，我们提出了分层相对策略优化（HRPO），这是一种新型的RL策略，可以协同整合群体相对优势和行动层面的奖励。据我们所知，我们提出的MasHost是第一个用于自主Mas图构建的RL驱动框架。在六个基准测试上的广泛实验表明，MasHost始终超越大多数竞争基线，验证了其有效性、效率和结构合理性。

论文及项目相关链接

PDF

Summary

大型语言模型驱动的多智能体系统（LLM-driven Mas）已成为解决复杂现实世界任务的有力工具。然而，现有系统大多依赖人工设计的交互机制和启发式规则，存在人为偏见和自主能力受限的问题。本研究提出MasHost，一个基于强化学习（RL）的自主、查询适应性多智能体系统设计框架。通过把多智能体系统构建形式化为图搜索问题，MasHost通过统一的概率采样机制联合采样智能体角色及其交互。本研究引入了组件合理性作为多智能体系统设计的新原则，并提出分层相对策略优化（HRPO）这一新型RL策略，实现多目标优化。实验证明，MasHost在多个基准测试上表现优越，验证了其有效性、效率和结构合理性。

Key Takeaways

大型语言模型驱动的多智能体系统已成为解决复杂任务的强大工具。
现有系统大多依赖人工设计的交互机制和启发式规则，限制了其自主性和适应性。
MasHost框架使用强化学习进行自主多智能体系统设计，通过图搜索问题形式化智能体间的交互。
MasHost引入组件合理性作为设计原则，注重智能体各组成部分的合理性。
MasHost使用分层相对策略优化（HRPO）实现多目标优化，包括准确性、效率和组件合理性。
MasHost在多个基准测试中表现优越，验证了其有效性、效率和结构合理性。

Cool Papers

点此查看论文截图

CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

Authors:Ziqi. Liu, Ziyang. Zhou, Mingxuan. Hu

Large language model (LLM) have become mainstream methods in the field of sarcasm detection. However, existing LLM methods face challenges in irony detection, including: 1. single-perspective limitations, 2. insufficient comprehensive understanding, and 3. lack of interpretability. This paper introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven multi-agent system designed to overcome these issues. CAF-I employs specialized agents for Context, Semantics, and Rhetoric, which perform multidimensional analysis and engage in interactive collaborative optimization. A Decision Agent then consolidates these perspectives, with a Refinement Evaluator Agent providing conditional feedback for optimization. Experiments on benchmark datasets establish CAF-I’s state-of-the-art zero-shot performance. Achieving SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of 76.31, a 4.98 absolute improvement over the strongest prior baseline. This success is attained by its effective simulation of human-like multi-perspective analysis, enhancing detection accuracy and interpretability.

大规模语言模型（LLM）已经成为讽刺检测领域的主流方法。然而，现有的LLM方法在面对讽刺检测时面临着挑战，包括：1.单一视角的局限性，2.理解不够全面，以及3.缺乏可解释性。本文介绍了用于讽刺的协作代理框架（CAF-I），这是一个旨在克服这些问题的LLM驱动的多代理系统。CAF-I采用了针对上下文、语义和修辞的专用代理，进行多维分析并参与交互式协同优化。然后，决策代理巩固这些观点，细化评估代理提供条件反馈以实现优化。在基准数据集上的实验证明了CAF-I最先进的零样本性能。在大多数指标上达到最新技术水平，CAF-I的平均宏观F1值为76.31，比最强基线有4.98的绝对改进。这一成功是通过其模拟人类的多视角分析实现的，提高了检测精度和可解释性。

论文及项目相关链接

PDF

Summary：
大型语言模型（LLM）在讽刺检测领域已成为主流方法，但在讽刺检测方面仍面临挑战。本文介绍了基于多智能体的协作框架（CAF-I），该框架通过采用专门用于上下文、语义和修辞的智能体进行多维分析并交互协作优化，克服了现有LLM方法的局限性。实验结果表明，CAF-I在基准数据集上的零样本性能达到最新水平，平均Macro-F1值为76.31，较之前的最佳基线模型提高了4.98。这成功得益于其对人类多角度分析的模拟，提高了检测准确性和可解释性。

Key Takeaways：

大型语言模型（LLM）已在讽刺检测中广泛应用，但存在挑战。
CAF-I框架采用多智能体系统，包括上下文、语义和修辞智能体，进行多维分析。
CAF-I通过模拟人类的多角度思考，提高了讽刺检测的准确性和可解释性。
实验结果表明，CAF-I在基准数据集上的性能达到最新水平。
CAF-I的平均Macro-F1值达到76.31，较之前的最佳模型有显著提高。
CAF-I通过交互协作优化，实现了智能体之间的有效协作。

Cool Papers

点此查看论文截图

The Optimization Paradox in Clinical AI Multi-Agent Systems

Authors:Suhana Bedi, Iddah Mlauzi, Daniel Shin, Sanmi Koyejo, Nigam H. Shah

Multi-agent artificial intelligence systems are increasingly deployed in clinical settings, yet the relationship between component-level optimization and system-wide performance remains poorly understood. We evaluated this relationship using 2,400 real patient cases from the MIMIC-CDM dataset across four abdominal pathologies (appendicitis, pancreatitis, cholecystitis, diverticulitis), decomposing clinical diagnosis into information gathering, interpretation, and differential diagnosis. We evaluated single agent systems (one model performing all tasks) against multi-agent systems (specialized models for each task) using comprehensive metrics spanning diagnostic outcomes, process adherence, and cost efficiency. Our results reveal a paradox: while multi-agent systems generally outperformed single agents, the component-optimized or Best of Breed system with superior components and excellent process metrics (85.5% information accuracy) significantly underperformed in diagnostic accuracy (67.7% vs. 77.4% for a top multi-agent system). This finding underscores that successful integration of AI in healthcare requires not just component level optimization but also attention to information flow and compatibility between agents. Our findings highlight the need for end to end system validation rather than relying on component metrics alone.

多智能体人工智能系统在临床环境中的部署越来越多，但组件级优化与系统级性能之间的关系仍不清楚。我们利用MIMIC-CDM数据集中的2400例真实患者病例，针对四种腹部疾病（阑尾炎、胰腺炎、胆囊炎、憩室炎）进行研究，将临床诊断分解为信息收集、解读和鉴别诊断。我们采用包括诊断结果、过程遵循和成本效益的综合指标，评估了单智能体系统（一个模型执行所有任务）和多智能体系统（每个任务的专业模型）之间的表现。我们的研究结果揭示了一个悖论：虽然多智能体系统通常表现优于单一智能体系统，但经过组件优化的系统或最佳品种系统具有出色的组件和过程指标（信息准确性为85.5%），在诊断准确性方面却表现不佳（67.7%对比顶级多智能体系统的77.4%）。这一发现强调，在医疗保健中成功集成人工智能不仅需要优化组件级别，还需要关注信息流程和智能体之间的兼容性。我们的研究结果表明，需要进行端到端的系统验证，而不是仅仅依赖组件指标。

论文及项目相关链接

PDF

Summary

本文探讨了多智能体系统在医疗领域的应用，通过对比单智能体系统和多智能体系统在临床诊断中的表现，发现多智能体系统通常优于单智能体系统，但过分依赖单一组件的优化可能导致系统整体性能下降。因此，成功整合人工智能在医疗保健中不仅需要优化组件级别，还需要关注信息流动和智能体之间的兼容性。这要求采用端到端的系统验证方法，而不是仅仅依赖单一组件指标。本文的见解强调了系统整合的重要性，而非仅仅关注单一组件的性能优化。

Key Takeaways

多智能体系统在医疗领域的应用越来越广泛，但在临床环境中组件级优化与系统整体性能之间的关系仍不明确。
研究通过对比单智能体系统和多智能体系统在诊断任务中的表现，揭示了多智能体系统通常具有更好的性能。
研究发现过分依赖单一组件的优化可能导致系统整体性能下降，特别是在诊断准确性方面。
成功整合人工智能在医疗保健中需要关注信息流动和智能体之间的兼容性，而不仅仅是组件级别的优化。
研究强调了采用端到端的系统验证方法的重要性，而不是仅仅依赖单一组件指标来评估系统的性能。

Cool Papers

点此查看论文截图

Nonconvex Game and Multi Agent Reinforcement Learning for Zonal Ancillary Markets

Authors:Francesco Morri, Hélène Le Cadre, Pierre Gruet, Luce Brotcorne

We characterize zonal ancillary market coupling relying on noncooperative game theory. To that purpose, we formulate the ancillary market as a multi-leader single follower bilevel problem, that we subsequently cast as a generalized Nash game with side constraints and nonconvex feasibility sets. We determine conditions for equilibrium existence and show that the game has a generalized potential game structure. To compute market equilibrium, we rely on two exact approaches: an integrated optimization approach and Gauss-Seidel best-response, that we compare against multi-agent deep reinforcement learning. On real data from Germany and Austria, simulations indicate that multi-agent deep reinforcement learning achieves the smallest convergence rate but requires pretraining, while best-response is the slowest. On the economics side, multi-agent deep reinforcement learning results in smaller market costs compared to the exact methods, but at the cost of higher variability in the profit allocation among stakeholders. Further, stronger coupling between zones tends to reduce costs for larger zones.

我们采用非合作博弈理论来刻画区域辅助市场耦合。为此，我们将辅助市场构建为一个多领导单一跟随者的双层问题，随后将其转化为带有侧约束和非凸可行集的广义纳什博弈。我们确定了均衡存在的条件，并证明该博弈具有广义势能博弈结构。为了计算市场均衡，我们依赖两种精确方法：一种整体优化方法和Gauss-Seidel最佳反应法，我们将它们与多智能体深度强化学习进行对比。在来自德国和奥地利的真实数据进行的模拟表明，多智能体深度强化学习具有最小的收敛率，但需要预先训练，而最佳反应法最慢。在经济方面，与精确方法相比，多智能体深度强化学习导致较小的市场成本，但成本分配中的利润波动性较高。此外，区域之间的更强耦合往往降低了较大区域的成本。

论文及项目相关链接

PDF

Summary

本文利用非合作博弈理论探究了区域辅助市场耦合的特性。文章将辅助市场构建为多领导者单跟随者的双层问题，进一步转化为带有侧约束和非凸可行集的广义纳什博弈。文章确定了均衡存在的条件，并展示了该博弈具有广义势能博弈结构。在计算市场均衡时，文章采用了两种精确方法：一体化优化方法和高斯-赛德尔最佳反应法，并与多智能体深度强化学习进行了比较。在来自德国和奥地利的真实数据模拟中，多智能体深度强化学习实现了最小的收敛率，但需要预训练，而最佳反应法是最慢的。在经济层面，多智能体深度强化学习相较于精确方法能降低市场成本，但利润在各利益相关者间的分配波动性较高。此外，区域间更强的耦合趋势有助于降低大型区域的市场成本。

Key Takeaways

本文利用非合作博弈理论探究了区域辅助市场耦合特性。
文章构建了辅助市场的多领导者单跟随者双层问题模型，并进一步转化为广义纳什博弈。
文章确定了均衡存在的条件，并展示了该博弈具有广义势能博弈结构。
采用了两种精确计算方法：一体化优化方法和高斯-赛德尔最佳反应法。
多智能体深度强化学习在模拟中实现了最小的收敛率但需要预训练。
多智能体深度强化学习在市场成本方面表现优于精确方法，但利润分配波动性较高。

Cool Papers

点此查看论文截图

Towards Action Hijacking of Large Language Model-based Agent

Authors:Yuyang Zhang, Kangjie Chen, Jiaxin Gao, Ronghao Cui, Run Wang, Lina Wang, Tianwei Zhang

Recently, applications powered by Large Language Models (LLMs) have made significant strides in tackling complex tasks. By harnessing the advanced reasoning capabilities and extensive knowledge embedded in LLMs, these applications can generate detailed action plans that are subsequently executed by external tools. Furthermore, the integration of retrieval-augmented generation (RAG) enhances performance by incorporating up-to-date, domain-specific knowledge into the planning and execution processes. This approach has seen widespread adoption across various sectors, including healthcare, finance, and software development. Meanwhile, there are also growing concerns regarding the security of LLM-based applications. Researchers have disclosed various attacks, represented by jailbreak and prompt injection, to hijack the output actions of these applications. Existing attacks mainly focus on crafting semantically harmful prompts, and their validity could diminish when security filters are employed. In this paper, we introduce AI$\mathbf{^2}$, a novel attack to manipulate the action plans of LLM-based applications. Different from existing solutions, the innovation of AI$\mathbf{^2}$ lies in leveraging the knowledge from the application’s database to facilitate the construction of malicious but semantically-harmless prompts. To this end, it first collects action-aware knowledge from the victim application. Based on such knowledge, the attacker can generate misleading input, which can mislead the LLM to generate harmful action plans, while bypassing possible detection mechanisms easily. Our evaluations on three real-world applications demonstrate the effectiveness of AI$\mathbf{^2}$: it achieves an average attack success rate of 84.30% with the best of 99.70%. Besides, it gets an average bypass rate of 92.7% against common safety filters and 59.45% against dedicated defense.

最近，以大型语言模型（LLM）为动力来源的应用程序在处理复杂任务方面取得了重大进展。通过利用LLM中嵌入的先进推理能力和广泛知识，这些应用程序可以生成详细的行动计划，随后由外部工具执行。此外，检索增强生成（RAG）的集成通过将最新、特定领域的知识融入规划和执行过程中，提高了性能。这种方法已被广泛应用于各行各业，包括医疗、金融和软件开发。与此同时，关于基于LLM的应用程序的安全性也越发令人担忧。研究人员已经披露了各种攻击方法，以越狱和提示注入为代表，以劫持这些应用程序的输出动作。现有攻击主要集中在制作语义有害的提示上，当使用安全过滤器时，其有效性可能会降低。在本文中，我们介绍了AI$^2$，一种操纵基于LLM的应用程序行动计划的新型攻击方法。与现有解决方案不同，AI$^2$的创新之处在于利用应用程序数据库中的知识来促进恶意但语义无害的提示的构建。为此，它首先从受害者应用程序中收集动作感知知识。基于这些知识，攻击者可以生成误导性的输入，误导LLM生成有害的行动计划，同时轻易绕过可能的检测机制。我们在三个真实世界应用程序上的评估证明了AI$^2$的有效性：其平均攻击成功率达到84.30%，最高达99.70%。此外，它对常见安全过滤器的平均绕过率为92.7%，对专用防御的绕过率为59.45%。

论文及项目相关链接

PDF

摘要
基于大型语言模型（LLM）的应用程序利用先进的推理能力和嵌入的知识来解决复杂任务，通过生成详细行动计划并由外部工具执行来完成工作。结合检索增强生成（RAG）可提高性能和引入领域特定知识。然而，安全性成为LLM应用的关注点，研究人员已披露攻击方法，如越狱和提示注入，来劫持应用输出行动。现有攻击主要集中于制造语义有害提示，但其有效性在采用安全过滤器时可能会降低。本文介绍了一种新型攻击AI$\mathbf{^2}$，可操纵基于LLM的应用的行动计划。AI$\mathbf{^2}$的创新之处在于，它利用应用数据库中的知识来构建恶意但语义无害的提示。首先，它从受害应用中收集行为感知知识，基于此知识，攻击者可生成误导性输入，误导LLM生成有害行动计划，同时轻易绕过可能的检测机制。对三个真实应用的评估显示，AI$\mathbf{^2}$的平均攻击成功率达84.30%，最高达99.70%。同时，它对通用安全过滤器的平均绕过率为92.7%，对专用防御的绕过率为59.45%。

关键见解

大型语言模型（LLM）的应用通过生成详细行动计划并执行来解决复杂任务。
检索增强生成（RAG）的集成提高了性能和领域知识的引入。
LLM应用的安全性成为关注焦点，存在被攻击的风险。
现有攻击主要集中于制造语义有害提示，但面对安全过滤器时效果可能减弱。
AI$\mathbf{^2}$是一种新型攻击，能利用应用数据库中的知识构建恶意但语义无害的提示。
AI$\mathbf{^2}$通过收集行为感知知识来生成误导性输入，从而误导LLM并绕过检测机制。

Cool Papers

点此查看论文截图

The Fellowship of the LLMs: Multi-Agent Workflows for Synthetic Preference Optimization Dataset Generation

Authors:Samee Arif, Sualeha Farid, Abdul Hameed Azeemi, Awais Athar, Agha Ali Raza

This paper presents a novel methodology for generating synthetic Preference Optimization (PO) datasets using multi-model workflows. We evaluate the effectiveness and potential of these workflows in automating and enhancing the dataset generation process. PO dataset generation requires two modules: (1) $\textit{response evaluation}$, and (2) $\textit{response generation}$. In the $\textit{response evaluation}$ module, the responses from Large Language Models (LLMs) are evaluated and ranked - a task typically carried out by human annotators that we automate using LLMs. We assess the response evaluation module in a 2 step process. In step 1, we assess LLMs as evaluators using three distinct prompting strategies. In step 2, we apply the winning prompting strategy to compare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. Our evaluation shows that GPT-4o-as-a-Judge is more consistent across all datasets. For the $\textit{response generation}$ module, we use the identified LLM evaluator configuration and compare different configurations of the LLM Feedback Loop. We use the win rate to determine the best multi-model configuration for generation. Experimenting with various configurations, we find that the LLM Feedback Loop, with Llama as the generator and Gemma as the reviewer, achieves a notable 71.8% and 73.8% win rate over single-model Llama and Gemma, respectively. After identifying the best configurations for both modules, we generate our PO datasets using the above pipeline.

本文提出了一种利用多模型工作流程生成合成偏好优化（PO）数据集的新方法。我们评估了这些工作流程在自动化和改进数据集生成过程中的有效性和潜力。PO数据集生成需要两个模块：（1）响应评估模块和（2）响应生成模块。在响应评估模块中，来自大型语言模型（LLM）的响应被评估和排名——这项任务通常是由人类注释者完成的，而我们使用LLM自动化完成。我们对响应评估模块进行了两步评估。在第一步中，我们使用三种不同的提示策略来评估LLM作为评估者的表现。在第二步中，我们应用获胜的提示策略来比较LLM作为法官、LLM作为陪审团和LLM辩论的性能。我们的评估表明，GPT-4o作为法官在所有数据集上表现更一致。在响应生成模块中，我们使用已确定的LLM评估器配置，并比较LLM反馈循环的不同配置。我们使用胜率来确定生成的最佳多模型配置。通过试验各种配置，我们发现以Llama作为生成器、Gemma作为评审员的LLM反馈循环，相对于单一的Llama和Gemma模型，分别取得了71.8%和73.8%的高胜率。在确定两个模块的最佳配置后，我们使用上述管道生成我们的PO数据集。

论文及项目相关链接

PDF

Summary

本文介绍了一种利用多模型工作流生成合成型偏好优化（PO）数据集的新方法。文章评估了这些工作流在自动化和改进数据集生成过程中的效果和潜力。文中介绍了PO数据集生成所需的两个模块：响应评估和响应生成。响应评估模块中，利用大型语言模型（LLM）自动评估排名响应，省去了人工标注环节。经过两步评估，发现GPT-4o-as-a-Judge在所有数据集上表现更稳定。在响应生成模块中，文章利用确定的LLM评估器配置，对比了不同的LLM反馈循环配置，以胜率确定最佳多模型配置。实验表明，以Llama为生成器、Gemma为评审者的LLM反馈循环获得了显著的71.8%和73.8%的胜率，超越了单一模型的Llama和Gemma。最终，使用上述流程生成PO数据集。

Key Takeaways