嘘~ 正在从服务器偷取页面 . . .

Agent


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-01-25 更新

Scalable Safe Multi-Agent Reinforcement Learning for Multi-Agent System

Authors:Haikuo Du, Fandi Gou, Yunze Cai

Safety and scalability are two critical challenges faced by practical Multi-Agent Systems (MAS). However, existing Multi-Agent Reinforcement Learning (MARL) algorithms that rely solely on reward shaping are ineffective in ensuring safety, and their scalability is rather limited due to the fixed-size network output. To address these issues, we propose a novel framework, Scalable Safe MARL (SS-MARL), to enhance the safety and scalability of MARL methods. Leveraging the inherent graph structure of MAS, we design a multi-layer message passing network to aggregate local observations and communications of varying sizes. Furthermore, we develop a constrained joint policy optimization method in the setting of local observation to improve safety. Simulation experiments demonstrate that SS-MARL achieves a better trade-off between optimality and safety compared to baselines, and its scalability significantly outperforms the latest methods in scenarios with a large number of agents. The feasibility of our method is also verified by hardware implementation with Mecanum-wheeled vehicles.

对于实际的多智能体系统(MAS)而言,安全性和可扩展性是两大关键挑战。然而,现有的依赖于奖励塑造的多智能体强化学习(MARL)算法在保证安全性方面效果不佳,而且由于固定大小的网络输出,其可扩展性相对有限。为了解决这些问题,我们提出了一种新的框架,即可扩展的安全MARL(SS-MARL),以增强MARL方法的安全性和可扩展性。我们利用MAS的固有图形结构,设计了一个多层消息传递网络,以聚合不同大小的局部观察和通信。此外,我们在局部观察的背景下开发了一种约束联合策略优化方法,以提高安全性。仿真实验表明,与基线相比,SS-MARL在最优性和安全性之间取得了更好的平衡,其可扩展性在大量智能体的场景中显著优于最新方法。通过Mecanum轮式车辆进行硬件实现也验证了我们的方法的可行性。

论文及项目相关链接

PDF

Summary

现有基于奖励形状的多智能体强化学习算法在安全性和可扩展性方面存在挑战。为解决这些问题,我们提出了一个名为SS-MARL的新型框架,通过利用多智能体系统的固有图形结构,设计多层消息传递网络来聚合不同大小的地方性观察和通信,并在局部观察中开发约束联合策略优化方法以提高安全性。模拟实验显示,SS-MARL在最优性和安全性之间取得更好的平衡,且在多智能体场景下其扩展性显著优于最新方法。此外,通过Mecanum轮式车辆的硬件实现验证了该方法的可行性。

Key Takeaways

  1. 多智能体系统面临安全性和可扩展性的挑战。
  2. 现有基于奖励形状的多智能体强化学习算法无法确保安全性和有限的可扩展性。
  3. SS-MARL框架旨在提高多智能体强化学习方法的安全性和可扩展性。
  4. SS-MARL利用多智能体系统的固有图形结构设计多层消息传递网络。
  5. 通过局部观察开发约束联合策略优化方法以提高安全性。
  6. 模拟实验表明,SS-MARL在最优性和安全性之间取得平衡,并在多智能体场景下表现出卓越的扩展性。

Cool Papers

点此查看论文截图

WFCRL: A Multi-Agent Reinforcement Learning Benchmark for Wind Farm Control

Authors:Claire Bizon Monroc, Ana Bušić, Donatien Dubuc, Jiamin Zhu

The wind farm control problem is challenging, since conventional model-based control strategies require tractable models of complex aerodynamical interactions between the turbines and suffer from the curse of dimension when the number of turbines increases. Recently, model-free and multi-agent reinforcement learning approaches have been used to address this challenge. In this article, we introduce WFCRL (Wind Farm Control with Reinforcement Learning), the first open suite of multi-agent reinforcement learning environments for the wind farm control problem. WFCRL frames a cooperative Multi-Agent Reinforcement Learning (MARL) problem: each turbine is an agent and can learn to adjust its yaw, pitch or torque to maximize the common objective (e.g. the total power production of the farm). WFCRL also offers turbine load observations that will allow to optimize the farm performance while limiting turbine structural damages. Interfaces with two state-of-the-art farm simulators are implemented in WFCRL: a static simulator (FLORIS) and a dynamic simulator (FAST.Farm). For each simulator, $10$ wind layouts are provided, including $5$ real wind farms. Two state-of-the-art online MARL algorithms are implemented to illustrate the scaling challenges. As learning online on FAST.Farm is highly time-consuming, WFCRL offers the possibility of designing transfer learning strategies from FLORIS to FAST.Farm.

风力发电场控制问题具有挑战性,因为传统的基于模型的控制策略需要处理风力发电机之间复杂空气动力学互动的模型,并且随着风力发电机数量的增加,受到维度诅咒的困扰。最近,无模型多智能体强化学习方法已被用来应对这一挑战。在本文中,我们介绍了WFCRL(基于强化学习的风力发电场控制),这是针对风力发电场控制问题的第一个开放的多智能体强化学习环境套件。WFCRL构建了一个合作式多智能体强化学习(MARL)问题:每个风力发电机都是一个智能体,可以学习调整其偏航角、俯仰角或扭矩,以最大化共同目标(例如,农场总发电量)。WFCRL还提供风力发电机负载观测,允许优化农场性能的同时,尽量减少风力发电机结构损坏。WFCRL实现了与两种最新农场模拟器的接口:静态模拟器(FLORIS)和动态模拟器(FAST.Farm)。对于每个模拟器,提供了10种风力布局,包括5个真实的风力农场。我们实现了两种最新的在线MARL算法,以说明规模挑战。由于在FAST.Farm上进行在线学习非常耗时,WFCRL提供了从FLORIS到FAST.Farm的设计迁移学习策略的可行性。

论文及项目相关链接

PDF

Summary:基于强化学习模型的风电场控制面临挑战,特别是模型复杂的动力学交互和维度诅咒问题。最近,无模型和多智能体强化学习的方法被用来解决这一问题。本文介绍了WFCRL(基于强化学习的风电场控制),这是首个用于风电场控制问题的多智能体强化学习环境套件。WFCRL提供了一个合作式多智能体强化学习问题框架,每个风力发电机都是一个智能体,可以学习调整其偏航角、俯仰角或扭矩以最大化共同目标(如风电场的总功率输出)。WFCRL还提供风力发电机负载观测,允许优化风电场性能并减少风力发电机结构损坏。它实现了与两个最先进的农场模拟器(FLORIS静态模拟器和FAST.Farm动态模拟器)的接口,并提供两种在线多智能体强化学习算法来解决规模挑战问题。由于FAST.Farm上的在线学习非常耗时,WFCRL提供了从FLORIS到FAST.Farm的迁移学习策略的可行性。

Key Takeaways:

  1. 风电场控制问题具有挑战性,传统模型控制策略需要处理复杂的动力学交互并面临维度诅咒问题。
  2. 无模型和多智能体强化学习被用于解决风电场控制问题。
  3. WFCRL是首个用于风电场控制的多智能体强化学习环境套件。
  4. 每个风力发电机在WFCRL中被视为一个智能体,通过调整偏航角、俯仰角或扭矩来最大化总功率输出。
  5. WFCRL提供风力发电机负载观测以优化性能并减少结构损坏。
  6. WFCRL与FLORIS静态模拟器和FAST.Farm动态模拟器接口实现,为不同的模拟环境提供策略转移能力。

Cool Papers

点此查看论文截图

Knowledge-Informed Multi-Agent Trajectory Prediction at Signalized Intersections for Infrastructure-to-Everything

Authors:Huilin Yin, Yangwenhui Xu, Jiaxiang Li, Hao Zhang, Gerhard Rigoll

Multi-agent trajectory prediction at signalized intersections is crucial for developing efficient intelligent transportation systems and safe autonomous driving systems. Due to the complexity of intersection scenarios and the limitations of single-vehicle perception, the performance of vehicle-centric prediction methods has reached a plateau. Furthermore, most works underutilize critical intersection information, including traffic signals, and behavior patterns induced by road structures. Therefore, we propose a multi-agent trajectory prediction framework at signalized intersections dedicated to Infrastructure-to-Everything (I2XTraj). Our framework leverages dynamic graph attention to integrate knowledge from traffic signals and driving behaviors. A continuous signal-informed mechanism is proposed to adaptively process real-time traffic signals from infrastructure devices. Additionally, leveraging the prior knowledge of the intersection topology, we propose a driving strategy awareness mechanism to model the joint distribution of goal intentions and maneuvers. To the best of our knowledge, I2XTraj represents the first multi-agent trajectory prediction framework explicitly designed for infrastructure deployment, supplying subscribable prediction services to all vehicles at intersections. I2XTraj demonstrates state-of-the-art performance on both the Vehicle-to-Infrastructure dataset V2X-Seq and the aerial-view dataset SinD for signalized intersections. Quantitative evaluations show that our approach outperforms existing methods by more than 30% in both multi-agent and single-agent scenarios.

在信号交叉口进行多智能体轨迹预测对于开发高效智能交通系统和安全自动驾驶系统至关重要。由于交叉口场景的复杂性和单一车辆感知的局限性,以车辆为中心预测方法的性能已经达到了瓶颈。此外,大多数工作未能充分利用关键交叉路口信息,包括交通信号和由道路结构引起的行为模式。因此,我们提出了一个专门用于基础设施到一切(I2XTraj)信号交叉口的多智能体轨迹预测框架。我们的框架利用动态图注意力来整合来自交通信号和驾驶行为的知识。提出了一种连续信号感知机制,自适应处理来自基础设施设备的实时交通信号。此外,利用交叉路口拓扑的先验知识,我们提出了一个驾驶策略意识机制,以模拟目标意图和动作的联合分布。据我们所知,I2XTraj是第一个专门为基础设施部署设计的多智能体轨迹预测框架,为交叉口的所有车辆提供可订阅的预测服务。I2XTraj在车辆到基础设施数据集V2X-Seq和信号交叉口空中视图数据集SinD上均表现出卓越的性能。定量评估表明,我们的方法在多智能体和单智能体场景下均优于现有方法30%以上。

论文及项目相关链接

PDF

Summary

本文提出一种针对智能交通系统和自动驾驶系统的多智能体轨迹预测框架I2XTraj。针对有信号交叉口,利用动态图注意力技术整合交通信号与驾驶行为知识。此外,I2XTraj设计适应实时交通信号的连续信号感应机制与模拟交叉路口目标的意图与行为的联合分布的驾驶策略感知机制。I2XTraj为基础设施部署设计,提供订阅预测服务给交叉路口的所有车辆,且在V2X-Seq和SinD数据集上表现卓越,相较于现有方法在多智能体和单智能体场景中的性能提升超过30%。

Key Takeaways

  1. 多智能体轨迹预测在智能交通系统和自动驾驶系统中至关重要。
  2. 当前车辆为中心的预测方法在有信号交叉路口表现受限,因场景复杂性和单车辆感知局限性。
  3. 大多数研究未充分利用关键交叉路口信息,如交通信号和道路结构引发的行为模式。
  4. 提出的多智能体轨迹预测框架I2XTraj整合交通信号与驾驶行为知识,采用动态图注意力技术。
  5. I2XTraj包含适应实时交通信号的连续信号感应机制。
  6. 利用交叉路口拓扑的先验知识,提出驾驶策略感知机制以模拟目标意图与行为的联合分布。

Cool Papers

点此查看论文截图

VulnBot: Autonomous Penetration Testing for A Multi-Agent Collaborative Framework

Authors:He Kong, Die Hu, Jingguo Ge, Liangxiong Li, Tong Li, Bingzhen Wu

Penetration testing is a vital practice for identifying and mitigating vulnerabilities in cybersecurity systems, but its manual execution is labor-intensive and time-consuming. Existing large language model (LLM)-assisted or automated penetration testing approaches often suffer from inefficiencies, such as a lack of contextual understanding and excessive, unstructured data generation. This paper presents VulnBot, an automated penetration testing framework that leverages LLMs to simulate the collaborative workflow of human penetration testing teams through a multi-agent system. To address the inefficiencies and reliance on manual intervention in traditional penetration testing methods, VulnBot decomposes complex tasks into three specialized phases: reconnaissance, scanning, and exploitation. These phases are guided by a penetration task graph (PTG) to ensure logical task execution. Key design features include role specialization, penetration path planning, inter-agent communication, and generative penetration behavior. Experimental results demonstrate that VulnBot outperforms baseline models such as GPT-4 and Llama3 in automated penetration testing tasks, particularly showcasing its potential in fully autonomous testing on real-world machines.

渗透测试是识别和解决网络安全系统漏洞的关键实践,但其手动执行方式劳动强度大且耗时。现有的大型语言模型(LLM)辅助或自动化渗透测试方法常常存在效率低下的问题,如缺乏上下文理解和过多的非结构化数据生成。本文介绍了VulnBot,这是一个利用大型语言模型自动化渗透测试框架,它通过多智能体系统模拟人工渗透测试团队的协作工作流程。为了解决传统渗透测试方法中的效率低下和依赖人工干预的问题,VulnBot将复杂任务分解为三个专业阶段:侦察、扫描和利用。这些阶段由渗透任务图(PTG)引导,以确保逻辑任务执行。关键设计特点包括角色专业化、渗透路径规划、智能体间通信和生成渗透行为。实验结果表明,VulnBot在自动化渗透测试任务上的表现优于GPT-4和Llama3等基线模型,特别是在真实机器上的全自动测试方面显示出其潜力。

论文及项目相关链接

PDF

Summary
本文介绍了VulnBot,一个利用大型语言模型(LLM)自动化执行网络安全渗透测试的框架。通过多智能体系统模拟人工渗透测试团队的协作流程,解决了传统渗透测试方法效率低下和依赖人工干预的问题。VulnBot将复杂的渗透测试任务分解为侦察、扫描和利用三个阶段,并通过渗透任务图(PTG)指导任务的逻辑执行。关键设计特点包括角色专业化、渗透路径规划、智能体间通信和生成式渗透行为。实验结果表明,VulnBot在自动化渗透测试任务中优于GPT-4和Llama等基线模型,尤其是在真实机器上的全自主测试中表现出潜力。

Key Takeaways

  1. VulnBot是一个利用大型语言模型(LLM)的自动化渗透测试框架,能够模拟人工渗透测试团队的协作流程。
  2. VulnBot通过多智能体系统解决了传统渗透测试方法效率低下和依赖人工干预的问题。
  3. VulnBot将复杂的渗透测试任务分解为侦察、扫描和利用三个阶段,确保逻辑任务执行。
  4. 渗透任务图(PTG)是VulnBot的关键设计之一,指导任务执行。
  5. VulnBot具有角色专业化、渗透路径规划、智能体间通信和生成式渗透行为等关键设计特点。
  6. 实验结果表明,VulnBot在自动化渗透测试任务中优于基线模型,如GPT-4和Llama等。

Cool Papers

点此查看论文截图

AgentRec: Agent Recommendation Using Sentence Embeddings Aligned to Human Feedback

Authors:Joshua Park, Yongfeng Zhang

Multi-agent systems must decide which agent is the most appropriate for a given task. We propose a novel architecture for recommending which LLM agent out of many should perform a task given a natural language prompt by extending the Sentence-BERT (SBERT) encoder model. On test data, we are able to achieve a top-1 accuracy of 92.2% with each classification taking less than 300 milliseconds. In contrast to traditional classification methods, our architecture is computationally cheap, adaptive to new classes, interpretable, and controllable with arbitrary metrics through reinforcement learning. By encoding natural language prompts into sentence embeddings, our model captures the semantic content relevant to recommending an agent. The distance between sentence embeddings that belong to the same agent is then minimized through fine-tuning and aligned to human values through reinforcement learning from human feedback. This allows the classification of natural language prompts based on their nearest neighbors by measuring the cosine similarity between embeddings. This work is made possible through the generation of a synthetic dataset for agent recommendation, which we have open-sourced to the public along with the code for AgentRec recommendation system at https://github.com/joshprk/agentrec.

多智能体系统必须决定哪个智能体最适合执行给定任务。我们提出了一种新型架构,通过扩展Sentence-BERT(SBERT)编码器模型,推荐在众多LLM智能体中哪个应该根据自然语言提示执行任务。在测试数据上,我们达到了92.2%的top-1准确率,每个分类的时间少于300毫秒。与传统的分类方法相比,我们的架构计算成本低,适应新类别,具有可解释性,并且可以通过强化学习使用任意指标进行控制。我们的模型通过将自然语言提示编码为句子嵌入来捕获与推荐智能体相关的语义内容。属于同一智能体的句子嵌入之间的距离通过微调来最小化,并通过强化学习与人类反馈对齐以符合人类价值观。这允许根据他们的最近邻居来基于嵌入之间的余弦相似性对自然语言提示进行分类。这项工作是通过生成用于智能体推荐的人工数据集而实现的,我们已经将其与AgentRec推荐系统的代码一起公开发布在https://github.com/joshprk/agentrec。

论文及项目相关链接

PDF 10 pages, 8 figures, preprint

Summary

本文提出一种新型架构,利用Sentence-BERT(SBERT)编码器模型,根据自然语言提示推荐最适合执行任务的LLM代理。该架构在计算成本、对新类别的适应性、解释性和可控性方面均优于传统分类方法。通过编码自然语言提示为句子嵌入,该模型能够捕捉与推荐代理相关的语义内容。通过微调最小化同一代理的句子嵌入之间的距离,并通过强化学习与人类反馈对齐,使分类基于最接近的邻居进行,通过测量嵌入之间的余弦相似性来实现。

Key Takeaways

  1. 提出一种基于Sentence-BERT编码器的新型多代理推荐系统架构。
  2. 该系统能够根据自然语言提示推荐最适合的LLM代理执行任务。
  3. 系统实现高准确率(92.2%),且分类时间短(<300毫秒)。
  4. 架构相较于传统方法更节省计算成本、适应新类别、具备解释性和强化学习的可控性。
  5. 通过编码自然语言提示为句子嵌入,模型能捕捉与推荐代理相关的语义内容。
  6. 通过微调最小化同一代理的句子嵌入距离,并强化学习与人类反馈对齐。
  7. 分类基于测量嵌入之间的余弦相似性,实现基于最近邻的推荐。

Cool Papers

点此查看论文截图

SRMT: Shared Memory for Multi-agent Lifelong Pathfinding

Authors:Alsu Sagirova, Yuri Kuratov, Mikhail Burtsev

Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the principal challenges in MARL is the need for explicit prediction of the agents’ behavior to achieve cooperation. To resolve this issue, we propose the Shared Recurrent Memory Transformer (SRMT) which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to exchange information implicitly and coordinate their actions. We evaluate SRMT on the Partially Observable Multi-Agent Pathfinding problem in a toy Bottleneck navigation task that requires agents to pass through a narrow corridor and on a POGEMA benchmark set of tasks. In the Bottleneck task, SRMT consistently outperforms a variety of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and MovingAI, SRMT is competitive with recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared recurrent memory into the transformer-based architectures can enhance coordination in decentralized multi-agent systems. The source code for training and evaluation is available on GitHub: https://github.com/Aloriosa/srmt.

多智能体强化学习(MARL)在解决各种环境中的合作和竞争多智能体问题上取得了显著进展。多智能体强化学习的主要挑战之一是需要预测智能体的行为以实现合作。为解决此问题,我们提出了共享循环记忆转换器(SRMT),它通过池化和全局广播个体工作记忆,将记忆转换器扩展到多智能体环境,使智能体能够隐式地交换信息并协调其行动。我们在玩具瓶颈导航任务中的部分可观察多智能体路径查找问题上评估了SRMT,该任务要求智能体通过狭窄走廊,并在POGEMA基准任务集上进行了评估。在瓶颈任务中,SRMT始终优于各种强化学习基线,特别是在稀疏奖励下,并且可以有效地推广到比训练过程中所见更长的走廊。在包括迷宫、随机和MovingAI的POGEMA地图上,SRMT与最新的MARL、混合和基于规划算法具有竞争力。这些结果表明,在基于转换器的架构中融入共享循环记忆可以增强分布式多智能体系统中的协调性。有关训练和评估的源代码可在GitHub上找到:https://github.com/Aloriosa/srmt。

论文及项目相关链接

PDF 16 pages, 11 figures

Summary

基于多代理强化学习(MARL)在多环境中解决合作与竞争的多代理问题的显著进展,文章提出了共享递归记忆转换器(SRMT),通过将个体的工作记忆进行池化和全局广播,将记忆转换器扩展到多代理设置,使代理能够隐式地交换信息和协调他们的行动。在狭窄走廊的玩具任务以及POGEMA基准任务集上评估SRMT的表现,结果表明SRMT在稀疏奖励下表现优于多种强化学习基线,并能有效地推广到训练时未见的长走廊场景。在迷宫、随机和移动AI等POGEMA地图上,SRMT与最新的MARL、混合和基于规划算法具有竞争力。这表明在基于转换器的架构中融入共享递归记忆可以增强分散式多代理系统的协调性。

Key Takeaways

  1. 多代理强化学习(MARL)在解决多种环境下的合作和竞争多代理问题上取得了显著进展。
  2. 共享递归记忆转换器(SRMT)通过池化和全局广播个体工作记忆,解决了多代理系统中代理行为的预测问题。
  3. SRMT在玩具任务中的狭窄走廊以及POGEMA基准任务集上进行了评估。
  4. 在稀疏奖励下,SRMT表现优于多种强化学习基线。
  5. SRMT能够推广到训练时未见的长走廊场景,显示出良好的泛化能力。
  6. 在多种POGEMA地图上,SRMT与当前的MARL、混合和规划算法相比具有竞争力。

Cool Papers

点此查看论文截图

Map Prediction and Generative Entropy for Multi-Agent Exploration

Authors:Alexander Spinos, Bradley Woosley, Justin Rokisky, Christopher Korpela, John G. Rogers III, Brian A. Bittner

Traditionally, autonomous reconnaissance applications have acted on explicit sets of historical observations. Aided by recent breakthroughs in generative technologies, this work enables robot teams to act beyond what is currently known about the environment by inferring a distribution of reasonable interpretations of the scene. We developed a map predictor that inpaints the unknown space in a multi-agent 2D occupancy map during an exploration mission. From a comparison of several inpainting methods, we found that a fine-tuned latent diffusion inpainting model could provide rich and coherent interpretations of simulated urban environments with relatively little computation time. By iteratively inferring interpretations of the scene throughout an exploration run, we are able to identify areas that exhibit high uncertainty in the prediction, which we formalize with the concept of generative entropy. We prioritize tasks in regions of high generative entropy, hypothesizing that this will expedite convergence on an accurate predicted map of the scene. In our study we juxtapose this new paradigm of task ranking with the state of the art, which ranks regions to explore by those which maximize expected information recovery. We compare both of these methods in a simulated urban environment with three vehicles. Our results demonstrate that by using our new task ranking method, we can predict a correct scene significantly faster than with a traditional information-guided method.

传统上,自主侦察应用程序依赖于历史观测的明确集合。借助生成技术的最新突破,这项工作使机器人团队能够通过推断场景合理解释的分布来超越目前对环境的了解。我们开发了一种地图预测器,在探索任务期间在多智能体二维占用地图上填充未知空间。从几种填充方法的比较中,我们发现经过微调后的潜在扩散填充模型可以在相对较少的计算时间内提供丰富且连贯的模拟城市环境解释。通过在整个探索过程中迭代推断场景的解释,我们能够确定预测中表现出高不确定性的区域,我们用生成熵的概念来形式化表示。我们优先处理高生成熵区域的任务,假设这将加快对准确预测地图的收敛。在研究中,我们将这种新的任务排序范式与传统的范式进行对比,后者根据可以最大化预期信息恢复的地区来排列需要探索的区域。我们在一个模拟城市环境中用三辆车对这两种方法进行了比较。结果表明,通过使用我们的新任务排序方法,我们可以比传统信息引导方法更快地预测正确的场景。

论文及项目相关链接

PDF

Summary

该文章探讨了自主侦察应用如何借助生成技术的突破,通过对场景进行合理解释的分布推断,来超越对环境的现有认知。文章开发了一种地图预测器,可在多智能体二维占用地图的探索任务中填充未知空间。通过比较几种填充方法,发现经过微调潜扩散填充模型能够以较少的计算时间提供丰富且连贯的模拟城市环境解释。通过在整个探索过程中迭代推断场景的解读,我们能够确定预测中表现出高度不确定性的区域,并用生成熵的概念来形式化表达。我们优先处理高生成熵的区域,假设这将加速对场景准确预测的收敛。研究对比了传统信息指导下的区域探索排名和新任务排名范式。在模拟城市环境中用三辆车进行的比较结果显示,使用新任务排名方法能更快预测正确的场景。

Key Takeaways

  1. 生成技术的突破使机器人团队能够超越现有认知,通过对场景进行合理解释的分布推断来行动。
  2. 开发了一种地图预测器,用于在探索任务中填充多智能体二维占用地图的未知空间。
  3. 通过比较多种填充方法,发现潜扩散填充模型在模拟城市环境中表现良好。
  4. 通过迭代推断场景的解读,能够识别预测中的不确定性区域并用生成熵来形式化表达。
  5. 优先处理高生成熵的区域能加速对场景准确预测的收敛。
  6. 对比了传统信息指导下的区域探索排名与新任务排名范式。

Cool Papers

点此查看论文截图

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

Authors:Haojun Shi, Suyu Ye, Xinyu Fang, Chuanyang Jin, Leyla Isik, Yen-Ling Kuo, Tianmin Shu

Understanding people’s social interactions in complex real-world scenarios often relies on intricate mental reasoning. To truly understand how and why people interact with one another, we must infer the underlying mental states that give rise to the social interactions, i.e., Theory of Mind reasoning in multi-agent interactions. Additionally, social interactions are often multi-modal – we can watch people’s actions, hear their conversations, and/or read about their past behaviors. For AI systems to successfully and safely interact with people in real-world environments, they also need to understand people’s mental states as well as their inferences about each other’s mental states based on multi-modal information about their interactions. For this, we introduce MuMA-ToM, a Multi-modal Multi-Agent Theory of Mind benchmark. MuMA-ToM is the first multi-modal Theory of Mind benchmark that evaluates mental reasoning in embodied multi-agent interactions. In MuMA-ToM, we provide video and text descriptions of people’s multi-modal behavior in realistic household environments. Based on the context, we then ask questions about people’s goals, beliefs, and beliefs about others’ goals. We validated MuMA-ToM in a human experiment and provided a human baseline. We also proposed a novel multi-modal, multi-agent ToM model, LIMP (Language model-based Inverse Multi-agent Planning). Our experimental results show that LIMP significantly outperforms state-of-the-art methods, including large multi-modal models (e.g., GPT-4o, Gemini-1.5 Pro) and a recent multi-modal ToM model, BIP-ALM.

理解人们在复杂的真实世界场景中的社会交往通常依赖于精细的心理推理。要真正了解人们如何以及为何相互交往,我们必须推断出导致这些社会交往的基本心理状态,即多智能体互动中的心智理论推理。此外,社会交往通常是多模式的——我们可以观察人们的行动,听取他们的对话,或者了解他们过去的行为。要让AI系统在真实世界环境中成功且安全地与人类互动,它们还需要理解人们的心理状态以及基于他们互动的多模式信息对彼此心理状态的推断。为此,我们引入了MuMA-ToM,这是一个多模式多智能体心智理论基准测试。MuMA-ToM是第一个评估实体多智能体互动中心理推理的多模式心智理论基准测试。在MuMA-ToM中,我们提供了现实家庭环境中人们的多模式行为的视频和文本描述。基于上下文,我们会询问有关人们的目标、信念以及对他人的目标的信念的问题。我们在人类实验中对MuMA-ToM进行了验证,并提供了人类基准线。我们还提出了一个新的多模式多智能体心智理论模型LIMP(基于语言模型的逆向多智能体规划)。我们的实验结果表明,LIMP显著优于最新方法,包括大型多模式模型(如GPT-4o、Gemini-1.5 Pro)和最新的多模式心智理论模型BIP-ALM。

论文及项目相关链接

PDF AAAI-25 (Oral). Project website: https://scai.cs.jhu.edu/projects/MuMA-ToM/ Code: https://github.com/SCAI-JHU/MuMA-ToM

Summary

理解人们在复杂现实场景中的社交互动通常依赖于精细的心理推理。要深入了解人们如何以及为何相互互动,我们必须推断出导致社交互动的内在心理状态,即多智能体互动中的心智理论推理。此外,社交互动通常是多模态的——我们可以观察人们的行动、聆听他们的对话,以及阅读他们过去的行为。对于AI系统在现实环境中成功且安全地与人类互动,它们也需要理解人们的心理状态以及基于多模态信息对彼此心理状态的推断。为此,我们引入了MuMA-ToM,一个多模态多智能体心智理论基准测试。MuMA-ToM是第一个多模态心智理论基准测试,它评估了具身多智能体互动中的心理推理。在MuMA-ToM中,我们提供了人们多模态行为在真实家庭环境中的视频和文字描述。基于上下文,然后询问关于人们的目标、信念以及关于他人目标的信念问题。我们在人类实验中对MuMA-ToM进行了验证,并提供了人类基准线。我们还提出了一种新型的多模态多智能体心智理论模型LIMP(基于语言模型的逆向多智能体规划)。我们的实验结果表明,LIMP显著优于最新方法,包括大型多模态模型(如GPT-4o、Gemini-1.5 Pro)和最近的多模态心智理论模型BIP-ALM。

Key Takeaways

  1. 理解现实世界中人们的社交互动需要精细的心理推理能力。
  2. 多模态信息(如视频、文本和对话)对于理解人们的社交互动至关重要。
  3. MuMA-ToM是多模态多智能体交互中的首个心智理论基准测试。
  4. MuMA-ToM提供了真实家庭环境中人们的多模态行为描述。
  5. 基于上下文的问题关于人们的目标、信念和关于他人目标的信念是评估的关键。
  6. 在人类实验中验证了MuMA-ToM的有效性,并提供了人类表现基准。

Cool Papers

点此查看论文截图

PhishAgent: A Robust Multimodal Agent for Phishing Webpage Detection

Authors:Tri Cao, Chengyu Huang, Yuexin Li, Huilin Wang, Amy He, Nay Oo, Bryan Hooi

Phishing attacks are a major threat to online security, exploiting user vulnerabilities to steal sensitive information. Various methods have been developed to counteract phishing, each with varying levels of accuracy, but they also face notable limitations. In this study, we introduce PhishAgent, a multimodal agent that combines a wide range of tools, integrating both online and offline knowledge bases with Multimodal Large Language Models (MLLMs). This combination leads to broader brand coverage, which enhances brand recognition and recall. Furthermore, we propose a multimodal information retrieval framework designed to extract the relevant top k items from offline knowledge bases, using available information from a webpage, including logos and HTML. Our empirical results, based on three real-world datasets, demonstrate that the proposed framework significantly enhances detection accuracy and reduces both false positives and false negatives, while maintaining model efficiency. Additionally, PhishAgent shows strong resilience against various types of adversarial attacks.

网络钓鱼攻击是对网络安全的一大威胁,利用用户漏洞窃取敏感信息。已经开发了许多方法来对抗网络钓鱼攻击,虽然各有不同程度的准确性,但它们也面临着明显的局限性。在这项研究中,我们引入了PhishAgent,这是一个多模式代理,结合了多种工具,融合了在线和离线知识库与多模式大型语言模型(MLLMs)。这种结合带来了更广泛的品牌覆盖,增强了品牌识别和回忆。此外,我们提出了一个多模式信息检索框架,旨在从离线知识库中提取与网页上的可用信息相关的前k个条目,包括标志和HTML。基于三个真实世界数据集的经验结果表明,所提出的框架显著提高了检测精度,并减少了误报和漏报,同时保持了模型效率。此外,PhishAgent对多种对抗性攻击表现出了强大的防御能力。

论文及项目相关链接

PDF Accepted at AAAI 2025 (Oral)

Summary

本研究介绍了PhishAgent,一种结合多种工具的多模态代理,通过结合在线和离线知识库与多模态大型语言模型(MLLMs),增强品牌识别和回忆。此外,提出了多模态信息检索框架,从离线知识库中提取与网页信息相关的前k项内容,包括标志和HTML。实验结果表明,该框架显著提高检测精度,减少误报和漏报,同时保持模型效率,并对各种对抗性攻击表现出强大韧性。

Key Takeaways

  1. Phishing攻击是对网络安全的主要威胁,利用各种方法应对但存在局限性。
  2. PhishAgent是一种多模态代理,结合多种工具、在线和离线知识库以及多模态大型语言模型以增强品牌识别和记忆。
  3. 引入多模态信息检索框架,从离线知识库中提取与网页信息相关的关键内容。
  4. 通过三个真实数据集进行的实验表明,该框架显著提高检测精度并减少误报和漏报。
  5. PhishAgent具有强大的对抗各种类型对抗性攻击的韧性。
  6. PhishAgent通过结合多种技术和方法,提供更全面的品牌覆盖。

Cool Papers

点此查看论文截图

Visual Agents as Fast and Slow Thinkers

Authors:Guangyan Sun, Mingyu Jin, Zhenting Wang, Cheng-Long Wang, Siqi Ma, Qifan Wang, Tong Geng, Ying Nian Wu, Yongfeng Zhang, Dongfang Liu

Achieving human-level intelligence requires refining cognitive distinctions between System 1 and System 2 thinking. While contemporary AI, driven by large language models, demonstrates human-like traits, it falls short of genuine cognition. Transitioning from structured benchmarks to real-world scenarios presents challenges for visual agents, often leading to inaccurate and overly confident responses. To address the challenge, we introduce FaST, which incorporates the Fast and Slow Thinking mechanism into visual agents. FaST employs a switch adapter to dynamically select between System 1/2 modes, tailoring the problem-solving approach to different task complexity. It tackles uncertain and unseen objects by adjusting model confidence and integrating new contextual data. With this novel design, we advocate a flexible system, hierarchical reasoning capabilities, and a transparent decision-making pipeline, all of which contribute to its ability to emulate human-like cognitive processes in visual intelligence. Empirical results demonstrate that FaST outperforms various well-known baselines, achieving 80.8% accuracy over VQA^{v2} for visual question answering and 48.7% GIoU score over ReasonSeg for reasoning segmentation, demonstrate FaST’s superior performance. Extensive testing validates the efficacy and robustness of FaST’s core components, showcasing its potential to advance the development of cognitive visual agents in AI systems. The code is available at ttps://github.com/GuangyanS/Sys2-LLaVA.

实现人类水平的智能需要提炼系统一和系统二思维之间的认知差异。虽然当代人工智能表现出人类特质的能力,但这仍不足以实现真正的认知。从结构化基准测试过渡到现实世界场景对视觉智能体而言是一大挑战,这常常导致响应不准确且过于自信。为了应对这一挑战,我们引入了FaST技术,它将快速和慢速思考机制融入视觉智能体。FaST采用开关适配器来动态选择系统一/二模式,根据任务复杂度定制问题解决策略。它通过调整模型信心并整合新的上下文数据来应对不确定和未见过的物体。凭借这一新颖设计,我们倡导一个灵活的系统、分层推理能力和透明的决策制定流程,所有这些都为其在视觉智能中模拟人类认知过程的能力做出贡献。实证结果表明,FaST在各种知名基准测试中表现优异,在视觉问答的VQA^v2测试中准确率达到了80.8%,在推理分割的ReasonSeg测试中GIoU得分达到了48.7%,证明了FaST的卓越性能。广泛的测试验证了FaST核心组件的有效性和稳健性,展示了其在推动人工智能系统中认知视觉智能体发展的潜力。代码可在[https://github.com/GuangyanS/Sys2-LLaVA获取。](https://github.com/GuangyanS/Sys2-LLaVA%E8%8E%B7%E5 取向。)

论文及项目相关链接

PDF International Conference on Learning Representations (ICLR 2025)

Summary

实现人类水平的智能需要精进系统1和系统2思维的认知区分。当代的大型语言模型驱动的AI表现出人类特征,但缺乏真正的认知能力。从结构化基准测试过渡到现实世界场景对视觉智能体构成挑战,导致不准确和过度自信的回应。为解决这一问题,我们引入FaST,它结合了快速和慢速思考机制,并采用开关适配器动态选择系统1/2模式,针对任务复杂度调整问题解决方式。它处理不确定和未见物体通过调整模型信心并整合新的情境数据。FaST倡导灵活系统、层次化推理能力和透明决策流程,能够模拟人类认知过程。经验结果表明,FaST在各种标准上超越现有技术,展示其在视觉问答和推理分割任务上的优越性。

Key Takeaways

  1. 实现人类水平智能需区分系统1和系统2思维的认知差异。
  2. 当代AI虽表现出人类特质,但缺乏真正的认知能力。
  3. 视觉智能体在过渡到现实世界场景时面临挑战,易产生不准确和过度自信的回应。
  4. FaST通过结合快速和慢速思考机制解决此问题,采用开关适配器动态选择系统1/2模式。
  5. FaST能处理不确定和未见物体,通过调整模型信心并整合新情境数据。
  6. FaST设计倡导灵活性、层次化推理和透明决策流程,模拟人类认知过程。

Cool Papers

点此查看论文截图

ShortcutsBench: A Large-Scale Real-world Benchmark for API-based Agents

Authors:Haiyang Shen, Yue Li, Desong Meng, Dongqi Cai, Sheng Qi, Li Zhang, Mengwei Xu, Yun Ma

Recent advancements in integrating large language models (LLMs) with application programming interfaces (APIs) have gained significant interest in both academia and industry. Recent work demonstrates that these API-based agents exhibit relatively strong autonomy and planning capabilities. However, their ability to handle multi-dimensional difficulty levels, diverse task types, and real-world demands remains unknown. In this paper, we introduce \textsc{ShortcutsBench}, a large-scale benchmark for the comprehensive evaluation of API-based agents in solving real-world complex tasks. \textsc{ShortcutsBench} includes a wealth of real APIs from Apple Inc., refined user queries, human-annotated high-quality action sequences, detailed parameter filling values, and parameters requesting necessary input from the system or user. We revealed how existing benchmarks/datasets struggle to accommodate the advanced reasoning capabilities of existing more intelligent LLMs. Moreover, our extensive evaluation of agents built with $5$ leading open-source (size $\geq$ 57B) and $5$ closed-source LLMs (e.g. Gemini-1.5-Pro and GPT-4o-mini) with varying intelligence level reveals significant limitations of existing API-based agents in the whole process of handling complex queries related to API selection, parameter filling, and requesting necessary input from the system and the user. These findings highlight the great challenges that API-based agents face in effectively fulfilling real and complex user queries. All datasets, code, experimental logs, and results are available at \url{https://github.com/EachSheep/ShortcutsBench}.

近期,将大型语言模型(LLMs)与应用编程接口(APIs)相结合的研究进展在学术界和工业界都引起了极大的兴趣。最新研究表明,这些基于API的代理展现出相对较强的自主性和规划能力。然而,它们在处理多维度难度、多样化任务类型和现实需求方面的能力仍然未知。在本文中,我们介绍了\textbf{ShortcutsBench},这是一个大规模基准测试,用于全面评估基于API的代理解决现实世界复杂任务的能力。\textbf{ShortcutsBench}包含了来自苹果公司的丰富真实API、精炼的用户查询、人类注释的高质量动作序列、详细的参数填充值和请求系统或用户必要输入的参数。我们揭示了现有基准测试集/数据集如何难以适应现有更智能LLMs的先进推理能力。此外,我们使用五个领先的开源(规模≥57B)和五个闭源LLMs(例如Gemini-1.5-Pro和GPT-4o-mini)进行了广泛的代理评估,这些代理的智能水平各不相同。评估结果显示,现有基于API的代理在处理与API选择、参数填充以及请求系统和用户必要输入相关的复杂查询的整个过程中存在显著局限性。这些发现凸显了基于API的代理在有效完成真实且复杂的用户查询方面所面临的巨大挑战。所有数据集、代码、实验记录和结果均可在\url{https://github.com/EachSheep/ShortcutsBench}找到。

论文及项目相关链接

PDF ICLR’25: https://openreview.net/forum?id=kKILfPkhSz

Summary

本文介绍了整合大型语言模型(LLMs)与应用编程接口(APIs)的最新进展,并指出API基类代理在解决现实世界复杂任务方面的挑战。为解决这一问题,文章提出了一个大规模基准测试平台——ShortcutsBench,该平台包含丰富的真实API、精炼的用户查询、人类注释的高质量动作序列等。文章还揭示了现有基准测试在容纳更智能LLMs的先进推理能力方面的不足,并通过评估不同智能水平的LLMs在处理复杂查询方面的表现,强调了API基类代理面临的挑战。

Key Takeaways

  1. API基类代理在自主性和规划能力方面表现强劲,但在处理多维难度、多样任务类型和真实世界需求方面存在未知挑战。
  2. ShortcutsBench是一个用于全面评估API基类代理解决现实世界复杂任务的大型基准测试平台。
  3. ShortcutsBench包含真实API、精炼用户查询、人类注释的高质量动作序列等丰富内容。
  4. 现有基准测试/数据集难以适应更智能LLMs的先进推理能力。
  5. 不同智能水平的LLMs在处理API选择、参数填充和请求系统或用户必要输入等复杂查询方面存在显著局限性。
  6. API基类代理在有效完成真实和复杂用户查询方面面临巨大挑战。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
Few-Shot Few-Shot
Few-Shot 方向最新论文已更新,请持续关注 Update in 2025-01-25 Text-to-SQL based on Large Language Models and Database Keyword Search
2025-01-25
下一篇 
LLM LLM
LLM 方向最新论文已更新,请持续关注 Update in 2025-01-25 CRPO Confidence-Reward Driven Preference Optimization for Machine Translation
2025-01-25
  目录