⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-03-15 更新
SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence
Authors:Chang Han Low, Ziyue Wang, Tianyi Zhang, Zhitao Zeng, Zhu Zhuo, Evangelos B. Mazomenos, Yueming Jin
Integration of Vision-Language Models (VLMs) in surgical intelligence is hindered by hallucinations, domain knowledge gaps, and limited understanding of task interdependencies within surgical scenes, undermining clinical reliability. While recent VLMs demonstrate strong general reasoning and thinking capabilities, they still lack the domain expertise and task-awareness required for precise surgical scene interpretation. Although Chain-of-Thought (CoT) can structure reasoning more effectively, current approaches rely on self-generated CoT steps, which often exacerbate inherent domain gaps and hallucinations. To overcome this, we present SurgRAW, a CoT-driven multi-agent framework that delivers transparent, interpretable insights for most tasks in robotic-assisted surgery. By employing specialized CoT prompts across five tasks: instrument recognition, action recognition, action prediction, patient data extraction, and outcome assessment, SurgRAW mitigates hallucinations through structured, domain-aware reasoning. Retrieval-Augmented Generation (RAG) is also integrated to external medical knowledge to bridge domain gaps and improve response reliability. Most importantly, a hierarchical agentic system ensures that CoT-embedded VLM agents collaborate effectively while understanding task interdependencies, with a panel discussion mechanism promotes logical consistency. To evaluate our method, we introduce SurgCoTBench, the first reasoning-based dataset with structured frame-level annotations. With comprehensive experiments, we demonstrate the effectiveness of proposed SurgRAW with 29.32% accuracy improvement over baseline VLMs on 12 robotic procedures, achieving the state-of-the-art performance and advancing explainable, trustworthy, and autonomous surgical assistance.
将视觉语言模型(VLMs)整合到手术智能中面临着幻视、领域知识差距、对手术场景中任务相互依赖性的有限理解等障碍,这削弱了临床可靠性。虽然最近的VLMs表现出了强大的通用推理和思维能力,但它们仍然缺乏精确的手术场景解释所需的领域专业知识和任务意识。虽然思维链(CoT)可以更有效地结构化推理,但当前的方法依赖于自我生成的CoT步骤,这往往会加剧固有的领域差距和幻视。
为了克服这一问题,我们提出了SurgRAW,这是一个由思维链驱动的多代理框架,为机器人辅助手术中的大多数任务提供透明、可解释的见解。通过五个任务中的专业思维链提示:仪器识别、动作识别、动作预测、患者数据提取和结果评估,SurgRAW通过结构化、领域感知推理来缓解幻视。还集成了检索增强生成(RAG)以获取外部医学知识,以弥补领域差距并提高响应可靠性。最重要的是,分层代理系统确保思维链嵌入的VLM代理有效地协作,同时理解任务相互依赖性,小组讨论机制促进了逻辑一致性。为了评估我们的方法,我们引入了SurgCoTBench,这是一个基于推理的数据集,具有结构化帧级注释。通过全面的实验,我们证明了所提出的SurgRAW的有效性,在12个机器人手术程序上相对于基线VLMs实现了29.32%的准确率提升,达到了最先进的表现,并推动了可解释、可信和自主的手术辅助技术。
论文及项目相关链接
Summary
视觉语言模型(VLM)在手术智能集成中受到幻觉、领域知识差距和任务相互依赖性的限制,影响临床可靠性。虽然最近的VLM展现出强大的通用推理和思维能力,但它们仍然缺乏精确的手术场景解释所需的领域专业知识和任务意识。本文提出SurgRAW,一个基于思维链(CoT)的多智能体框架,为机器人辅助手术中的大多数任务提供透明、可解释性的见解。通过五个任务的专门思维链提示,SurgRAW通过结构化、领域意识推理缓解幻觉。同时集成检索增强生成(RAG)以弥补领域知识差距并提高响应可靠性。最重要的是,一个分层智能体系确保嵌入思维链的VLM智能体有效协作,理解任务相互依赖性,并通过小组讨论机制促进逻辑一致性。评估方法引入SurgCoTBench,首个基于推理的数据集,带有结构化帧级注释。实验表明,与基线VLM相比,SurgRAW实现了29.32%的准确率提升,在12项机器人手术程序中达到最佳性能,推动可解释、可信和自主手术辅助的进步。
Key Takeaways
- VLM在手术智能中的集成受到幻觉、领域知识差距和任务相互依赖性的挑战。
- 最近的VLM虽然具有强大的通用推理能力,但缺乏手术领域的专业知识和任务意识。
- SurgRAW框架利用CoT驱动的多智能体结构来解决这些问题,提供可解释性的手术任务见解。
- SurgRAW通过结构化、领域意识的推理来减轻幻觉。
- 集成RAG来弥补医疗领域的知识差距,提高响应的可靠性。
- 分层智能体系确保智能体理解任务相互依赖性并有效协作。
点此查看论文截图




LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
Authors:Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, Yali Wang
Existing Multimodal Large Language Models (MLLMs) encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools (e.g., search engine, memory banks, OCR, retrieval models) to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our methodology consists of four key steps: 1. Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2. Perception: We design an effective retrieval scheme for long videos, improving the coverage of critical temporal segments while maintaining computational efficiency. 3. Action: Agents answer long video-related questions and exchange reasons. 4. Reflection: We evaluate the performance of each agent in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (including GPT-4o) and open-source models (including InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks. Notably, on the LongVideoBench dataset, LVAgent improves accuracy by up to 14.3% compared with SOTA.
现有的多模态大型语言模型(MLLMs)在建模长视频中的时间上下文时面临重大挑战。目前,主流的基本代理方法使用外部工具(如搜索引擎、记忆库、OCR、检索模型)来协助单个MLLM回答长视频问题。尽管有基于工具的支持,单个MLLM仍然只能对长视频有部分理解,导致性能有限。为了更好地应对长视频任务,我们引入了LVAgent,这是第一个使多轮动态协作的MLLM代理实现长视频理解的框架。我们的方法包括四个关键步骤:1.选择:我们从模型库中预先选择适当的代理,根据不同的任务形成最优代理团队。2.感知:我们为长视频设计了有效的检索方案,提高了关键时间段的覆盖率,同时保持了计算效率。3.行动:代理回答与长视频相关的问题并交流理由。4.反思:我们评估每一轮讨论中每个代理的性能,优化代理团队进行动态协作。通过MLLM代理的多轮动态协作,代理可以迭代优化答案。LVAgent是第一个在长视频理解任务中表现优于所有封闭源模型(包括GPT-4o)和开源模型(包括InternVL-2.5和Qwen2-VL)的代理系统方法。我们的LVAgent在四个主流的长视频理解任务中达到了80%的准确率。值得注意的是,在LongVideoBench数据集上,LVAgent与最新技术相比,准确率提高了高达14.3%。
论文及项目相关链接
Summary
针对长视频理解任务,现有单一的多模态大型语言模型(MLLM)存在建模时间上下文能力的挑战。为此,我们提出了LVAgent框架,该框架支持多轮动态协作的MLLM代理,以提高对长视频的理解能力。LVAgent通过选择、感知、行动和反思四个关键步骤实现多轮动态协作,并在长视频理解任务中取得了显著成果,超越了许多封闭和开源模型。
Key Takeaways
- 长视频理解任务中,单一多模态大型语言模型(MLLM)存在建模时间上下文能力的挑战。
- LVAgent框架是首个支持多轮动态协作的MLLM代理的框架,旨在更好地解决长视频任务。
- LVAgent通过选择、感知、行动和反思四个步骤实现多轮动态协作。
- LVAgent在主流长视频理解任务上取得了高达80%的准确率。
- 在LongVideoBench数据集上,LVAgent相较于现有最佳模型,提高了高达14.3%的准确率。
- LVAgent通过设计有效的长视频检索方案,提高了对关键时间段的覆盖并保持了计算效率。
点此查看论文截图






StepMathAgent: A Step-Wise Agent for Evaluating Mathematical Processes through Tree-of-Error
Authors:Shu-Xun Yang, Cunxiang Wang, Yidong Wang, Xiaotao Gu, Minlie Huang, Jie Tang
Evaluating mathematical capabilities is critical for assessing the overall performance of large language models (LLMs). However, existing evaluation methods often focus solely on final answers, resulting in highly inaccurate and uninterpretable evaluation outcomes, as well as their failure to assess proof or open-ended problems. To address these issues, we propose a novel mathematical process evaluation agent based on Tree-of-Error, called StepMathAgent. This agent incorporates four internal core operations: logical step segmentation, step scoring, score aggregation and error tree generation, along with four external extension modules: difficulty calibration, simplicity evaluation, completeness validation and format assessment. Furthermore, we introduce StepMathBench, a benchmark comprising 1,000 step-divided process evaluation instances, derived from 200 high-quality math problems grouped by problem type, subject category and difficulty level. Experiments on StepMathBench show that our proposed StepMathAgent outperforms all state-of-the-art methods, demonstrating human-aligned evaluation preferences and broad applicability to various scenarios. Our data and code are available at https://github.com/SHU-XUN/StepMathAgent.
评估数学能力是评估大型语言模型(LLM)整体性能的关键。然而,现有的评估方法往往只关注最终答案,导致评估结果不准确、难以解释,并且无法评估证明或开放式问题。为了解决这些问题,我们提出了一种基于错误树的新型数学过程评估代理,称为StepMathAgent。该代理结合了四项内部核心操作:逻辑步骤分割、步骤评分、分数聚合和错误树生成,以及四项外部扩展模块:难度校准、简单性评估、完整性验证和格式评估。此外,我们还推出了StepMathBench,这是一个包含1000个按步骤划分的过程评估实例的基准测试,这些实例来源于按问题类型、主题类别和难度水平分组的200个高质量数学问题。在StepMathBench上的实验表明,我们提出的StepMathAgent优于所有最先进的方法,表现出与人类评估偏好相符的结果,并广泛适用于各种场景。我们的数据和代码可在https://github.com/SHU-XUN/StepMathAgent获取。
论文及项目相关链接
Summary
本文强调评估大型语言模型(LLMs)的数学能力的重要性,并提出一种基于Tree-of-Error的新型数学过程评价代理——StepMathAgent。该代理具备逻辑步骤分割、步骤评分、评分聚合和错误树生成等四项内部核心操作,以及难度校准、简单性评价、完整性验证和格式评估等四项外部扩展模块。此外,文章还介绍了包含1000个分步过程评价实例的StepMathBench基准测试,该测试来源于按问题类型、学科类别和难度水平分组的高质量数学问题。实验表明,提出的StepMathAgent优于所有最先进的方法,展现出与人类评估偏好相符的评价和广泛适用的场景。
Key Takeaways
- 评估大型语言模型的数学能力是衡量其整体性能的关键。
- 现有评估方法主要关注最终答案,导致评价不准确、不可解释,并且无法评估证明或开放性问题。
- StepMathAgent是一种基于Tree-of-Error的新型数学过程评价代理,包含内部核心操作和外部扩展模块。
- StepMathAgent具备逻辑步骤分割、步骤评分、评分聚合和错误树生成等功能。
- StepMathBench基准测试包含1000个分步过程评价实例,源于高质量数学问题,并按问题类型、学科和难度分组。
- 实验表明StepMathAgent在评估数学过程方面优于其他先进方法,且与人类评估偏好相符。
点此查看论文截图






Revisiting Multi-Agent Asynchronous Online Optimization with Delays: the Strongly Convex Case
Authors:Lingchan Bao, Tong Wei, Yuanyu Wan
We revisit multi-agent asynchronous online optimization with delays, where only one of the agents becomes active for making the decision at each round, and the corresponding feedback is received by all the agents after unknown delays. Although previous studies have established an $O(\sqrt{dT})$ regret bound for this problem, they assume that the maximum delay $d$ is knowable or the arrival order of feedback satisfies a special property, which may not hold in practice. In this paper, we surprisingly find that when the loss functions are strongly convex, these assumptions can be eliminated, and the existing regret bound can be significantly improved to $O(d\log T)$ meanwhile. Specifically, to exploit the strong convexity of functions, we first propose a delayed variant of the classical follow-the-leader algorithm, namely FTDL, which is very simple but requires the full information of functions as feedback. Moreover, to handle the more general case with only the gradient feedback, we develop an approximate variant of FTDL by combining it with surrogate loss functions. Experimental results show that the approximate FTDL outperforms the existing algorithm in the strongly convex case.
我们重新研究了多智能体异步在线优化问题,该问题中存在延迟。在每一轮中,只有一个智能体被激活以做出决策,所有智能体在未知延迟后接收到相应的反馈。尽管之前的研究已经为这个问题建立了O(√dT)的遗憾界限,但他们假设最大延迟d是可知的,或者反馈到达顺序满足特殊属性,这在实践中可能不成立。在本文中,我们惊讶地发现当损失函数是强凸的时候,这些假设可以被消除,并且现有的遗憾界限可以显著改善到O(dlogT)。具体来说,为了利用函数的强凸性,我们首先提出了经典跟随领导者算法的延迟变体,即FTDL,它非常简单,但需要函数的全部信息作为反馈。此外,为了处理只有梯度反馈的更一般情况,我们通过结合替代损失函数,开发了FTDL的近似变体。实验结果表明,近似FTDL在强凸情况下优于现有算法。
论文及项目相关链接
Summary
本文研究了多智能体异步在线优化问题,其中只有一名智能体在每个回合中活跃以做出决策,所有智能体在未知延迟后都会收到反馈。尽管先前的研究为该问题建立了O(√dT)的后悔界,但它们假设最大延迟d是可知道的或反馈到达顺序满足特殊属性,这可能不符合实际情况。本文惊奇地发现,当损失函数是强凸时,可以消除这些假设,并将现有后悔界显著改善至O(dlogT)。特别是,为了利用函数的强凸性,我们提出了经典跟随领先者算法的延迟变体,即FTDL,它非常简单但需要函数的完整信息作为反馈。此外,为了处理只有梯度反馈的更一般情况,我们结合替代损失函数开发了FTDL的近似变体。实验结果表明,近似FTDL在强凸情况下优于现有算法。
Key Takeaways
- 本文研究了多智能体异步在线优化问题,其中涉及延迟反馈和只有一个智能体在每个回合中活跃做决策的场景。
- 之前的研究为这一问题设定了O(√dT)的后悔界,但假设最大延迟d可知或反馈到达顺序有特殊属性,这可能不符合实际。
- 当损失函数是强凸时,可以消除这些假设,并将后悔界改善至O(dlogT)。
- 提出了经典跟随领先者算法的延迟变体FTDL,该算法简单但要求反馈包含函数的完整信息。
- 为了处理仅有梯度反馈的情况,结合了替代损失函数来开发FTDL的近似变体。
- 实验结果显示,近似FTDL在强凸情况下表现优于现有算法。
点此查看论文截图



PCLA: A Framework for Testing Autonomous Agents in the CARLA Simulator
Authors:Masoud Jamshidiyan Tehrani, Jinhan Kim, Paolo Tonella
Recent research on testing autonomous driving agents has grown significantly, especially in simulation environments. The CARLA simulator is often the preferred choice, and the autonomous agents from the CARLA Leaderboard challenge are regarded as the best-performing agents within this environment. However, researchers who test these agents, rather than training their own ones from scratch, often face challenges in utilizing them within customized test environments and scenarios. To address these challenges, we introduce PCLA (Pretrained CARLA Leaderboard Agents), an open-source Python testing framework that includes nine high-performing pre-trained autonomous agents from the Leaderboard challenges. PCLA is the first infrastructure specifically designed for testing various autonomous agents in arbitrary CARLA environments/scenarios. PCLA provides a simple way to deploy Leaderboard agents onto a vehicle without relying on the Leaderboard codebase, it allows researchers to easily switch between agents without requiring modifications to CARLA versions or programming environments, and it is fully compatible with the latest version of CARLA while remaining independent of the Leaderboard’s specific CARLA version. PCLA is publicly accessible at https://github.com/MasoudJTehrani/PCLA.
近期关于测试自动驾驶代理的研究已显著增加,特别是在仿真环境中。CARLA模拟器通常是首选,CARLA排行榜挑战中的自主代理被认为是在此环境中表现最好的代理。然而,研究人员在测试这些代理而不是从头开始训练它们时,常常面临在自定义测试环境和场景中使用它们的挑战。为了解决这些挑战,我们推出了PCLA(预训练CARLA排行榜代理),这是一个开源Python测试框架,包含来自排行榜挑战的九个高性能预训练自主代理。PCLA是专门为在任意CARLA环境/场景中测试各种自主代理而设计的基础设施。PCLA提供了一种简单的方法将排行榜代理部署到车辆上,而无需依赖排行榜的代码库,它允许研究人员轻松地在不修改CARLA版本或编程环境的情况下切换代理,并且它与最新版本的CARLA完全兼容,同时独立于排行榜的特定CARLA版本。PCLA可在https://github.com/MasoudJTehrani/PCLA公开访问。
论文及项目相关链接
PDF This work will be published at the FSE 2025 demonstration track
Summary
CARLA模拟器中的自主驾驶代理测试研究日益增多,但研究者在使用定制测试环境和场景时面临挑战。为解决这些问题,推出了PCLA(预训练CARLA领导者代理)测试框架,包含九个高性能预训练自主代理,可在任意CARLA环境中进行测试。PCLA提供了简单的方法部署领导者代理到车辆上,无需依赖领导者代码库,并允许研究者轻松切换代理,同时与最新版本的CARLA兼容,独立于领导者特定的CARLA版本。
Key Takeaways
- 最近关于在CARLA模拟器中测试自主驾驶代理的研究显著增长。
- 研究者在定制测试环境和场景中使用现有代理时面临挑战。
- PCLA是一个为测试各种自主代理而设计的测试框架。
- PCLA包含九个来自领导者排行榜的高性能预训练自主代理。
- PCLA提供了简单的方法部署代理到车辆上,无需依赖领导者代码库。
- PCLA允许研究者轻松切换代理,无需更改CARLA版本或编程环境。
- PCLA与最新版本的CARLA兼容,并且独立于领导者特定的CARLA版本。
点此查看论文截图


Learning to Detect Objects from Multi-Agent LiDAR Scans without Manual Labels
Authors:Qiming Xia, Wenkai Lin, Haoen Xiang, Xun Huang, Siheng Chen, Zhen Dong, Cheng Wang, Chenglu Wen
Unsupervised 3D object detection serves as an important solution for offline 3D object annotation. However, due to the data sparsity and limited views, the clustering-based label fitting in unsupervised object detection often generates low-quality pseudo-labels. Multi-agent collaborative dataset, which involves the sharing of complementary observations among agents, holds the potential to break through this bottleneck. In this paper, we introduce a novel unsupervised method that learns to Detect Objects from Multi-Agent LiDAR scans, termed DOtA, without using labels from external. DOtA first uses the internally shared ego-pose and ego-shape of collaborative agents to initialize the detector, leveraging the generalization performance of neural networks to infer preliminary labels. Subsequently,DOtA uses the complementary observations between agents to perform multi-scale encoding on preliminary labels, then decodes high-quality and low-quality labels. These labels are further used as prompts to guide a correct feature learning process, thereby enhancing the performance of the unsupervised object detection task. Extensive experiments on the V2V4Real and OPV2V datasets show that our DOtA outperforms state-of-the-art unsupervised 3D object detection methods. Additionally, we also validate the effectiveness of the DOtA labels under various collaborative perception frameworks.The code is available at https://github.com/xmuqimingxia/DOtA.
无监督的3D目标检测是离线3D目标标注的重要解决方案。然而,由于数据稀疏和视图有限,无监督目标检测中的基于聚类的标签拟合通常会产生低质量的伪标签。多智能体协同数据集涉及智能体之间的互补观察共享,具有突破这一瓶颈的潜力。在本文中,我们介绍了一种新的无监督方法,该方法从多智能体激光雷达扫描中学习检测目标,称为DOtA,无需使用来自外部标签。DOtA首先使用智能体之间共享的内置自我姿势和自我形状来初始化检测器,利用神经网络的泛化性能来推断初步标签。随后,DOtA利用智能体之间的互补观察对初步标签执行多尺度编码,然后解码高质量和低质量标签。这些标签进一步作为提示来引导正确的特征学习过程,从而提高无监督目标检测任务的性能。在V2V4Real和OPV2V数据集上的大量实验表明,我们的DOtA优于最新的无监督3D目标检测方法。此外,我们还验证了在不同协同感知框架下DOtA标签的有效性。代码可在链接中找到。
论文及项目相关链接
PDF 11 pages, 5 figures
Summary
基于多智能体协同数据集的无人监督3D目标检测通过利用智能体间的互补观察突破了检测瓶颈。DOtA方法通过共享智能体的内部姿态和形状进行初步标签推断,并利用智能体间的互补观察进行多尺度编码,生成高质量和低质量标签。这些标签进一步引导特征学习过程,提高了无人监督的3D目标检测性能。在V2V4Real和OPV2V数据集上的实验证明,DOtA优于现有技术。代码已公开。
Key Takeaways
- 多智能体协同数据集通过智能体间的互补观察增强无人监督的3D目标检测性能。
- DOtA方法使用智能体的内部共享姿态和形状初始化检测器。
- 利用智能体间的互补观察进行多尺度编码生成标签。
- 生成的高质量标签引导特征学习过程以提高检测性能。
- 在多个数据集上的实验证明DOtA方法优于现有技术。
- DOtA的代码已经公开,可供他人使用参考。
点此查看论文截图





KG4Diagnosis: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Medical Diagnosis
Authors:Kaiwen Zuo, Yirui Jiang, Fan Mo, Pietro Lio
Integrating Large Language Models (LLMs) in healthcare diagnosis demands systematic frameworks that can handle complex medical scenarios while maintaining specialized expertise. We present KG4Diagnosis, a novel hierarchical multi-agent framework that combines LLMs with automated knowledge graph construction, encompassing 362 common diseases across medical specialties. Our framework mirrors real-world medical systems through a two-tier architecture: a general practitioner (GP) agent for initial assessment and triage, coordinating with specialized agents for in-depth diagnosis in specific domains. The core innovation lies in our end-to-end knowledge graph generation methodology, incorporating: (1) semantic-driven entity and relation extraction optimized for medical terminology, (2) multi-dimensional decision relationship reconstruction from unstructured medical texts, and (3) human-guided reasoning for knowledge expansion. KG4Diagnosis serves as an extensible foundation for specialized medical diagnosis systems, with capabilities to incorporate new diseases and medical knowledge. The framework’s modular design enables seamless integration of domain-specific enhancements, making it valuable for developing targeted medical diagnosis systems. We provide architectural guidelines and protocols to facilitate adoption across medical contexts.
将大型语言模型(LLMs)整合到医疗诊断中,需要能够处理复杂医疗场景并保持专业知识的系统性框架。我们提出了KG4Diagnosis,这是一种新型分层多代理框架,它将LLMs与自动化知识图谱构建相结合,涵盖362种常见疾病,涉及医学各专业领域。我们的框架通过两层架构反映现实医疗系统:一层是通用实践者(GP)代理进行初步评估和分类,另一层是与专业领域深入诊断的专门代理协调。核心创新在于我们端到端的知识图谱生成方法,包括:(1)针对医学术语优化的语义驱动实体和关系提取,(2)从非结构化的医疗文本中重建多维决策关系,(3)用于知识扩展的人机协同推理。KG4Diagnosis作为一个可扩展的基础,为专业医疗诊断系统提供了强大的支持,有能力融入新的疾病和医学知识。该框架的模块化设计使得能够无缝集成特定领域的增强功能,因此对于开发有针对性的医疗诊断系统具有很高的价值。我们提供了架构指南和协议,以推动其在各种医疗环境中的采用。
论文及项目相关链接
PDF 10 pages,5 figures,published to AAAI-25 Bridge Program
Summary:KG4Diagnosis是一个结合大型语言模型(LLM)和自动化知识图谱构建的新型分级多智能体框架,用于医疗服务中的诊断工作。它能够处理复杂的医疗场景并维持专业性的知识,涵盖了医学领域的362种常见疾病。该框架采用两层结构来模拟真实的医疗系统:初级医生进行初步评估和分级诊疗,并与专科智能体协调进行特定领域的深度诊断。该框架的核心创新之处在于其端到端的知识图谱生成方法,包括面向医学术语的语义驱动实体和关系提取、从非结构化医疗文本中重建多维决策关系以及基于人工推理的知识扩展。它为专业医疗诊断系统提供了可扩展的基础,能够融入新的疾病和医学知识。框架的模块化设计使其能够无缝集成特定领域的改进,对于开发有针对性的医疗诊断系统具有重要价值。我们提供了跨医疗环境的采用指南和协议。
Key Takeaways:
- KG4Diagnosis是一个集成LLM和自动化知识图谱的新型诊断框架,涵盖多种常见疾病。
- 该框架采用两层结构,初级医生进行初步评估并与专科智能体协调深度诊断。
- KG4Diagnosis的核心创新在于其知识图谱生成方法,包括语义驱动的实体和关系提取、多维决策关系重建和基于人工推理的知识扩展。
- 框架具有可扩展性,可融入新的疾病和医学知识,且模块化设计可无缝集成特定领域的改进。
- KG4Diagnosis为专业医疗诊断系统提供了基础,并提供了跨医疗环境的采用指南和协议。
- 该框架适用于复杂的医疗场景,并能够维持专业性的知识。
点此查看论文截图





DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
Authors:Zaid Khan, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
The process of creating training data to teach models is currently driven by humans, who manually analyze model weaknesses and plan how to create data that improves a student model. Approaches using LLMs as annotators reduce human effort, but still require humans to interpret feedback from evaluations and control the LLM to produce data the student needs. Automating this labor-intensive process by creating autonomous data generation agents - or teachers - is desirable, but requires environments that can simulate the feedback-driven, iterative, closed loop of data creation. To enable rapid, scalable testing for such agents and their modules, we introduce DataEnvGym, a testbed of teacher environments for data generation agents. DataEnvGym frames data generation as a sequential decision-making task, involving an agent consisting of a data generation policy (which generates a plan for creating training data) and a data generation engine (which transforms the plan into data), inside an environment that provides student feedback. The agent’s goal is to improve student performance. Students are iteratively trained and evaluated on generated data, and their feedback (in the form of errors or weak skills) is reported to the agent after each iteration. DataEnvGym includes multiple teacher environment instantiations across 3 levels of structure in the state representation and action space. More structured environments are based on inferred skills and offer more interpretability and curriculum control. We support 4 domains (math, code, VQA, and tool-use) and test multiple students and teachers. Example agents in our teaching environments can iteratively improve students across tasks and settings. Moreover, we show that environments teach different skill levels and test variants of key modules, pointing to future work in improving data generation agents, engines, and feedback mechanisms.
目前,创建用于训练模型的教学训练数据的过程是由人类驱动的,人类手动分析模型的弱点,并计划如何创建能够改进学生模型的数据。使用大型语言模型(LLMs)作为注释者的方法减少了人力投入,但仍需要人类来解释评估反馈并控制大型语言模型以产生学生所需的数据。通过创建自主数据生成代理(或教师)来自动完成这一劳动密集型过程是可取的,但这需要能够模拟数据创建中的反馈驱动、迭代和闭环的环境。为了实现对这类代理及其模块进行快速、可扩展的测试,我们推出了DataEnvGym,一个为教师环境设计的测试平台。DataEnvGym将数据生成框架设定为一项序列决策任务,涉及数据生成代理内的数据生成策略和生成引擎。其中数据生成策略用于制定创建训练数据的计划,而生成引擎则将该计划转化为实际数据。代理位于一个提供学生反馈的环境中,其目标是提高学生的表现。学生在生成的迭代数据和评估中不断学习和进步,他们的反馈(以错误或技能不足的形式)会在每次迭代后报告给代理。DataEnvGym包含多个教师环境实例,涵盖了状态表示和行为空间中的三个结构层次。更结构化的环境基于推断技能,提供更多的可解释性和课程控制。我们支持四个领域(数学、代码、视觉问答和工具使用),并测试多个学生和教师。在我们的教学环境中,示例代理可以迭代地提高学生的任务表现和设置能力。此外,我们还证明了环境能够教授不同技能水平并测试关键模块的不同变体,这为未来改进数据生成代理、引擎和反馈机制指明了方向。
论文及项目相关链接
PDF ICLR 2025 Spotlight; Project Page: https://DataEnvGym.github.io
摘要
训练数据生成采用自动方式减少人工干预是当前研究的热点。该文提出了一种使用DataEnvGym测试平台的方法,将训练数据生成视为一个序列决策任务。DataEnvGym提供了一个教师环境用于数据生成代理的测试,其中包括数据生成策略和生成引擎。数据生成策略负责创建训练数据的计划,而生成引擎负责将数据转化为实际应用。环境通过学生反馈来评估数据质量,代理的目标是提高学生的表现。通过迭代训练和评估学生生成的错误或弱技能,将反馈反馈给代理,以便优化后续数据生成策略。DataEnvGym在不同结构层次的环境中提供了多个教师环境的实例,更结构化的环境基于推断技能并提供更多可解释性和课程控制。支持数学、代码、视觉问答和工具使用等四个领域,并测试了多个学生和教师。在测试环境中,教师代理能够提高学生跨任务和设置的表现,展示未来改进数据生成策略的方向。
关键见解
- 训练数据生成的自动化成为迫切需求,以替代繁琐的人力标注工作。
- DataEnvGym被提出作为数据生成代理的测试平台,模仿真实的反馈闭环系统。
- 该平台通过将数据生成看作一个序列决策问题来处理,涉及数据生成策略和生成引擎两部分。
- 环境设计注重结构性和解释性,可根据推断技能进行课程控制。
- 支持多种领域的数据生成任务,包括数学、代码、视觉问答和工具使用等。
- 教师代理能够在不同任务和设置下提高学生表现,展示了其有效性。
点此查看论文截图





LaMMA-P: Generalizable Multi-Agent Long-Horizon Task Allocation and Planning with LM-Driven PDDL Planner
Authors:Xiaopan Zhang, Hao Qin, Fuquan Wang, Yue Dong, Jiachen Li
Language models (LMs) possess a strong capability to comprehend natural language, making them effective in translating human instructions into detailed plans for simple robot tasks. Nevertheless, it remains a significant challenge to handle long-horizon tasks, especially in subtask identification and allocation for cooperative heterogeneous robot teams. To address this issue, we propose a Language Model-Driven Multi-Agent PDDL Planner (LaMMA-P), a novel multi-agent task planning framework that achieves state-of-the-art performance on long-horizon tasks. LaMMA-P integrates the strengths of the LMs’ reasoning capability and the traditional heuristic search planner to achieve a high success rate and efficiency while demonstrating strong generalization across tasks. Additionally, we create MAT-THOR, a comprehensive benchmark that features household tasks with two different levels of complexity based on the AI2-THOR environment. The experimental results demonstrate that LaMMA-P achieves a 105% higher success rate and 36% higher efficiency than existing LM-based multiagent planners. The experimental videos, code, datasets, and detailed prompts used in each module can be found on the project website: https://lamma-p.github.io.
语言模型(LMs)具有很强的理解自然语言的能力,使得它们能够有效地将人类指令翻译成简单的机器人任务的详细计划。然而,在处理长期任务时,特别是在识别分配合作型异构机器人团队的子任务方面,仍然存在重大挑战。为了解决这个问题,我们提出了一种语言模型驱动的多智能体PDDL规划器(LaMMA-P),这是一种新型的多智能体任务规划框架,在长期任务上实现了最先进的性能。LaMMA-P结合了语言模型的推理能力和传统的启发式搜索规划器的优点,以实现高成功率和效率,并在各种任务中表现出强大的泛化能力。此外,我们创建了MAT-THOR基准测试,它基于AI2-THOR环境,以家庭任务为特色,分为两个不同的复杂度级别。实验结果表明,LaMMA-P与现有的基于LM的多智能体规划器相比,成功率提高了105%,效率提高了36%。实验视频、代码、数据集和每个模块中使用的详细提示可以在项目网站上找到:https://lamma-p.github.io。
论文及项目相关链接
PDF IEEE Conference on Robotics and Automation (ICRA 2025); Project website: https://lamma-p.github.io/
Summary
本文介绍了语言模型在机器人任务中的应用,提出了一种名为LaMMA-P的新型多智能体任务规划框架,该框架结合了语言模型的推理能力和传统启发式搜索规划器,实现了长周期任务的高成功率和效率。同时,为了评估该框架的性能,创建了一个名为MAT-THOR的基准测试环境。实验结果表明,LaMMA-P相较于现有的基于语言模型的多智能体规划器,成功率提高了105%,效率提高了36%。
Key Takeaways
- 语言模型具备强大的自然语言理解能力,可将其应用于机器人任务的翻译和规划。
- 处理长周期任务面临的主要挑战是子任务的识别和分配。
- LaMMA-P是一种新型的多智能体任务规划框架,结合了语言模型和启发式搜索规划器的优势。
- LaMMA-P在基准测试环境中实现了高成功率和效率。
- MAT-THOR是一个基于AI2-THOR环境的家庭任务基准测试环境,分为两个不同难度级别。
- 实验结果表明,LaMMA-P相较于其他方法有明显的性能提升。
点此查看论文截图





EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage
Authors:Zeyi Liao, Lingbo Mo, Chejian Xu, Mintong Kang, Jiawei Zhang, Chaowei Xiao, Yuan Tian, Bo Li, Huan Sun
Generalist web agents have demonstrated remarkable potential in autonomously completing a wide range of tasks on real websites, significantly boosting human productivity. However, web tasks, such as booking flights, usually involve users’ PII, which may be exposed to potential privacy risks if web agents accidentally interact with compromised websites, a scenario that remains largely unexplored in the literature. In this work, we narrow this gap by conducting the first study on the privacy risks of generalist web agents in adversarial environments. First, we present a realistic threat model for attacks on the website, where we consider two adversarial targets: stealing users’ specific PII or the entire user request. Then, we propose a novel attack method, termed Environmental Injection Attack (EIA). EIA injects malicious content designed to adapt well to environments where the agents operate and our work instantiates EIA specifically for privacy scenarios in web environments. We collect 177 action steps that involve diverse PII categories on realistic websites from the Mind2Web, and conduct experiments using one of the most capable generalist web agent frameworks to date. The results demonstrate that EIA achieves up to 70% ASR in stealing specific PII and 16% ASR for full user request. Additionally, by accessing the stealthiness and experimenting with a defensive system prompt, we indicate that EIA is hard to detect and mitigate. Notably, attacks that are not well adapted for a webpage can be detected via human inspection, leading to our discussion about the trade-off between security and autonomy. However, extra attackers’ efforts can make EIA seamlessly adapted, rendering such supervision ineffective. Thus, we further discuss the defenses at the pre- and post-deployment stages of the websites without relying on human supervision and call for more advanced defense strategies.
通用网页代理在自主完成真实网站上的各种任务方面已显示出显著潜力,极大地提高了人类生产力。然而,网页任务(如订票)通常涉及用户的个人信息(PII),如果网页代理意外地与遭攻击的网站交互,可能会使个人信息面临潜在的隐私风险,这一情景在文献中仍很少被探索。在这项工作中,我们通过开展关于通用网页代理在敌对环境中的隐私风险的首项研究来缩小这一差距。首先,我们为针对网站的攻击提出了一个现实的威胁模型,其中我们考虑了两个敌对目标:窃取用户的特定个人信息或整个用户请求。然后,我们提出了一种新型攻击方法,称为环境注入攻击(EIA)。EIA注入恶意内容,设计得很好,能适应代理运行的环境,我们的工作具体实例化了针对web环境中隐私场景的EIA。我们从Mind2Web收集了涉及真实网站上各类个人信息的177个操作步骤,并使用迄今为止最强大的通用网页代理框架进行实验。结果表明,EIA在窃取特定个人信息方面达到了高达70%的攻击成功率(ASR),对用户请求的完整攻击达到16%的ASR。此外,通过访问隐蔽性并对防御系统进行实验提示,我们表明EIA很难检测和缓解。值得注意的是,不适应网页的攻击可以通过人工检查进行检测,这引发了我们对安全性和自主性之间权衡的讨论。然而,额外的攻击者努力可以使EIA无缝适应,使这种监督无效。因此,我们进一步讨论了不依赖人工监督的网站预部署和后部署阶段的防御措施,并呼吁采用更先进的防御策略。
论文及项目相关链接
PDF Accepted by ICLR 2025
摘要
自动化网站上的各种任务,展现出广泛的应用潜力,大大提高了工作效率。然而,网页任务常涉及用户个人信息泄露的风险。例如订购航班时,若网站代理意外与恶意网站交互,用户的个人信息可能被窃取。本研究首次探讨了自动化网站代理在敌对环境中的隐私风险问题。我们首先设定了一种实际威胁模型来模拟网站攻击的场景,并提出了一种新的攻击方式——“环境注入攻击”(EIA)。我们的攻击方法旨在适应网站代理的运行环境,专门用于应对网页环境中的隐私泄露场景。通过对真实网站的行动步骤进行研究实验,我们发现EIA窃取特定信息的成功率高达70%,窃取全部用户请求的成功率为16%。此外,我们的实验显示,EIA难以被检测和防御系统阻止。但某些不针对特定网页的攻击可通过人工检测发现。这引发了关于安全和自主性之间的权衡讨论。为此,我们讨论了网站部署前后的防御策略,呼吁开发更先进的防御技术。
关键见解
- 网站代理能够自主完成多种任务,显著提高工作效率,但存在隐私泄露风险。
- 本研究首次探讨了自动化网站代理在敌对环境中的隐私风险问题。
- 提出了一种新的攻击方式——“环境注入攻击”(EIA),能够适应网站代理的运行环境并窃取用户信息。
- EIA攻击成功率高,难以被防御系统检测并阻止。
- 存在安全和自主性之间的权衡,需要探讨如何在保障安全的前提下提高自主性。
- 网站部署前后的防御策略是必要的,需要开发更先进的防御技术来应对潜在的攻击。
- EIA的攻击效果与对网页的适应性有关,不完全适应的攻击容易被人类检测。
点此查看论文截图



COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
Authors:Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Behzad Dariush, Kwonjoon Lee, Yilun Du, Chuang Gan
In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics conditioned on an arbitrary number of agents’ actions given only partial egocentric visual observations of the world. To address this issue of partial observability, we first train generative models to estimate the overall world state given partial egocentric observations. To enable accurate simulation of multiple sets of actions on this world state, we then propose to learn a compositional world model for multi-agent cooperation by factorizing the naturally composable joint actions of multiple agents and compositionally generating the video conditioned on the world state. By leveraging this compositional world model, in combination with Vision Language Models to infer the actions of other agents, we can use a tree search procedure to integrate these modules and facilitate online cooperative planning. We evaluate our methods on three challenging benchmarks with 2-4 agents. The results show our compositional world model is effective and the framework enables the embodied agents to cooperate efficiently with different agents across various tasks and an arbitrary number of agents, showing the promising future of our proposed methods. More videos can be found at https://embodied-agi.cs.umass.edu/combo/.
本文研究了具有身体实体的多智能体合作问题,在只考虑自我中心视角的情况下,分布式智能体必须进行合作。为了在这种环境下进行有效的规划,与单智能体场景中学习世界动力学不同,我们必须模拟在仅获得部分自我中心视觉观察的情况下,任意数量智能体的行为所影响的世界动态。为了解决部分可观察性的问题,我们首先训练生成模型,根据部分自我中心观察来估计整体世界状态。为了能够在这个世界状态下准确地模拟多组动作,然后我们提出了通过分解多个智能体的自然可组合联合动作,并基于世界状态组合生成视频,来学习用于多智能体合作的结构化世界模型。通过利用这种结构化世界模型,结合视觉语言模型来推断其他智能体的行为,我们可以使用树搜索过程来整合这些模块,促进在线合作规划。我们在具有2-4个智能体的三个具有挑战性的基准测试上评估了我们的方法。结果表明,我们的结构化世界模型是有效的,该框架能够使得具有身体的智能体在不同的任务和任意数量的智能体之间进行高效合作,展示了我们所提出方法的广阔前景。更多视频可在https://embodied-agi.cs.umass.edu/combo/找到。
论文及项目相关链接
PDF Published at ICLR 2025. 24 pages. The first three authors contributed equally
Summary:
本文研究了多智能体合作问题,针对具有局部自我视角的分散式智能体,通过训练生成模型估计整体世界状态来解决部分观测问题。在此基础上,提出一种用于多智能体合作的组合世界模型,通过分解多个智能体的自然可组合动作,并根据世界状态组合生成视频。结合视觉语言模型进行其他智能体的动作推断,使用树搜索过程整合这些模块,实现在线协同规划。在三个包含2-4个智能体的挑战基准测试中验证了方法的有效性,展现了智能体间的有效合作和不同任务中智能体的灵活性。更多视频可在相关网站找到。
Key Takeaways:
- 研究了多智能体合作问题,解决了在具有局部自我视角下的分散式智能体的协同规划问题。
- 通过训练生成模型估计整体世界状态来解决部分观测问题。
- 提出一种组合世界模型,用于学习多智能体合作的场景。
- 通过分解多个智能体的自然可组合动作,实现视频的条件生成。
- 结合视觉语言模型进行其他智能体的动作推断。
- 使用树搜索过程整合各模块,实现在线协同规划。
点此查看论文截图




Maximum Entropy Heterogeneous-Agent Reinforcement Learning
Authors:Jiarong Liu, Yifan Zhong, Siyi Hu, Haobo Fu, Qiang Fu, Xiaojun Chang, Yaodong Yang
Multi-agent reinforcement learning (MARL) has been shown effective for cooperative games in recent years. However, existing state-of-the-art methods face challenges related to sample complexity, training instability, and the risk of converging to a suboptimal Nash Equilibrium. In this paper, we propose a unified framework for learning stochastic policies to resolve these issues. We embed cooperative MARL problems into probabilistic graphical models, from which we derive the maximum entropy (MaxEnt) objective for MARL. Based on the MaxEnt framework, we propose Heterogeneous-Agent Soft Actor-Critic (HASAC) algorithm. Theoretically, we prove the monotonic improvement and convergence to quantal response equilibrium (QRE) properties of HASAC. Furthermore, we generalize a unified template for MaxEnt algorithmic design named Maximum Entropy Heterogeneous-Agent Mirror Learning (MEHAML), which provides any induced method with the same guarantees as HASAC. We evaluate HASAC on six benchmarks: Bi-DexHands, Multi-Agent MuJoCo, StarCraft Multi-Agent Challenge, Google Research Football, Multi-Agent Particle Environment, and Light Aircraft Game. Results show that HASAC consistently outperforms strong baselines, exhibiting better sample efficiency, robustness, and sufficient exploration. See our page at https://sites.google.com/view/meharl.
近年来,多智能体强化学习(MARL)在合作游戏领域表现出了显著的效果。然而,现有的前沿技术方法面临着样本复杂性、训练不稳定性和收敛到次优纳什均衡的风险等挑战。在本文中,我们提出了一个学习随机策略的统一框架来解决这些问题。我们将合作型MARL问题嵌入到概率图形模型中,并从中推导出MARL的最大熵(MaxEnt)目标。基于MaxEnt框架,我们提出了Heterogeneous-Agent Soft Actor-Critic(HASAC)算法。在理论上,我们证明了HASAC的单调改进和收敛到量化响应均衡(QRE)的特性。此外,我们为MaxEnt算法设计了一个统一的模板,名为Maximum Entropy Heterogeneous-Agent Mirror Learning(MEHAML),该模板为任何诱导方法提供与HASAC相同的保证。我们在六个基准测试上对HASAC进行了评估:Bi-DexHands、Multi-Agent MuJoCo、StarCraft Multi-Agent Challenge、Google Research Football、Multi-Agent Particle Environment和Light Aircraft Game。结果表明,HASAC始终优于强大的基线,表现出更好的样本效率、鲁棒性和足够的探索能力。请访问我们的页面https://sites.google.com/view/meharl了解详情。
论文及项目相关链接
PDF ICLR 2024 Spotlight
Summary
多智能体强化学习(MARL)在近年合作游戏中展现出有效性,但现有方法面临样本复杂性、训练不稳定性和收敛到次优纳什均衡的风险。本文提出一个学习随机策略的框架来解决这些问题,将合作MARL问题嵌入概率图模型中,推导出MARL的最大熵(MaxEnt)目标。基于MaxEnt框架,我们提出Heterogeneous-Agent Soft Actor-Critic(HASAC)算法。理论上,我们证明了HASAC的单调改进和向量化响应均衡(QRE)属性的收敛性。此外,我们为MaxEnt算法设计了一个名为Maximum Entropy Heterogeneous-Agent Mirror Learning(MEHAML)的统一模板,为任何诱导方法提供与HASAC相同的保证。我们在六个基准测试上对HASAC进行了评估,结果显示HASAC持续优于强基线,展现出更好的样本效率、鲁棒性和足够的探索能力。更多信息请访问我们的网站:[网站地址]。
Key Takeaways
- 多智能体强化学习(MARL)在合作游戏中效果显著,但仍存在样本复杂性、训练不稳定性和收敛到次优解的问题。
- 提出了一个学习随机策略的框架,将合作MARL问题嵌入概率图模型,并推导出MaxEnt目标。
- 提出了Heterogeneous-Agent Soft Actor-Critic(HASAC)算法,该算法在理论上被证明具有单调改进和向量化响应均衡的收敛性。
- 广义的MaxEnt算法设计模板MEHAML为任何诱导方法提供与HASAC相同的保证。
- HASAC在六个基准测试上的表现持续优于其他强基线方法。
- HASAC展现出更好的样本效率、鲁棒性和足够的探索能力。
点此查看论文截图


