⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-21 更新
Navigating Quantum Missteps in Agent-Based Modeling: A Schelling Model Case Study
Authors:C. Nico Barati, Arie Croitoru, Ross Gore, Michael Jarret, William Kennedy, Andrew Maciejunes, Maxim A. Malikov, Samuel S. Mendelson
Quantum computing promises transformative advances, but remains constrained by recurring misconceptions and methodological pitfalls. This paper demonstrates a fundamental incompatibility between traditional agent-based modeling (ABM) implementations and quantum optimization frameworks like Quadratic Unconstrained Binary Optimization (QUBO). Using Schelling’s segregation model as a case study, we show that the standard practice of directly translating ABM state observations into QUBO formulations not only fails to deliver quantum advantage, but actively undermines computational efficiency. The fundamental issue is architectural. Traditional ABM implementations entail observing the state of the system at each iteration, systematically destroying the quantum superposition required for computational advantage. Through analysis of Schelling’s segregation dynamics on lollipop networks, we demonstrate how abandoning the QUBO reduction paradigm and instead reconceptualizing the research question, from “simulate agent dynamics iteratively until convergence” to “compute minimum of agent moves required for global satisfaction”, enables a faster classical solution. This structural reconceptualization yields an algorithm that exploits network symmetries obscured in traditional ABM simulations and QUBO formulations. It establishes a new lower bound which quantum approaches must outperform to achieve advantage. Our work emphasizes that progress in quantum agent-based modeling does not require forcing classical ABM implementations into quantum frameworks. Instead, it should focus on clarifying when quantum advantage is structurally possible, developing best-in-class classical baselines through problem analysis, and fundamentally reformulating research questions rather than preserving classical iterative state change observation paradigms.
量子计算虽然带来了突破性的潜力,但其发展仍然受到不断出现的误解和方法上的陷阱的制约。本文通过舍林的隔离模型这一案例研究,展示了传统基于主体的建模(ABM)实现与诸如二次无约束二进制优化(QUBO)的量子优化框架之间存在根本的不兼容性。我们表明,将ABM状态观察直接转换为QUBO公式标准的做法不仅无法实现量子优势,而且会破坏计算效率。根本问题在于架构。传统的ABM实现需要在每次迭代时观察系统的状态,这破坏了计算优势所需的量子叠加性。通过对舍林隔离动力学在棒棒糖网络上的分析,我们展示了放弃QUBO缩减范式,转而重新思考研究问题的方式,即从“迭代模拟代理动态直至收敛”转变为“计算实现全局满意度所需的最小代理移动次数”,能够实现更快的经典解决方案。这种结构性的重新思考揭示了一种算法,该算法利用传统ABM模拟和QUBO公式中隐藏的网络安全对称性。这为量子方法建立了新的下限,必须超越此下限才能实现优势。我们的工作强调,在量子基于主体的建模方面的进展并不需要强迫传统的ABM实现进入量子框架。相反,它应专注于澄清何时结构上可能实现量子优势,通过对问题的分析来开发最佳的经典基准线,并从根本上重新构建研究问题,而不是维持传统的迭代状态变更观察范式。
论文及项目相关链接
Summary
本文探讨了传统基于代理的建模(ABM)与量子优化框架(如二次无约束二进制优化(QUBO))之间的根本性不兼容。通过谢林隔离模型这一案例研究,发现直接将ABM状态观察转化为QUBO公式不仅无法实现量子优势,反而会降低计算效率。本文指出问题关键在于架构,传统的ABM实施方法通过系统观察状态破坏了量子叠加所需的条件,无法实现计算优势。通过重新概念化研究问题,从模拟代理动态直至收敛转向计算达到全局满意度所需的最小代理移动次数,能够更快地实现经典解决方案。这启示我们进展在于对量子代理建模的理解和应用,而不是将经典ABM强制融入量子框架。我们应聚焦于明确量子优势的结构可能性、通过问题分析制定最佳经典基准线,并从根本上重新制定问题而非保留经典迭代状态变化观察模式。
Key Takeaways
- 传统基于代理的建模(ABM)与量子优化框架(如QUBO)存在根本性不兼容问题。
- 直接将ABM状态观察转化为QUBO公式无法实现量子优势,并降低计算效率。
- 问题关键在于传统ABM实施方法通过系统观察状态破坏了量子叠加所需的条件。
- 通过重新概念化研究问题,能更快地实现经典解决方案。
- 重新概念化研究问题包括从模拟代理动态直至收敛转向计算达到全局满意度所需的最小代理移动次数。
- 对量子代理建模的理解和应用是关键进展点,而非将经典ABM强制融入量子框架。
点此查看论文截图
AVATAAR: Agentic Video Answering via Temporal Adaptive Alignment and Reasoning
Authors:Urjitkumar Patel, Fang-Chun Yeh, Chinmay Gondhalekar
With the increasing prevalence of video content, effectively understanding and answering questions about long form videos has become essential for numerous applications. Although large vision language models (LVLMs) have enhanced performance, they often face challenges with nuanced queries that demand both a comprehensive understanding and detailed analysis. To overcome these obstacles, we introduce AVATAAR, a modular and interpretable framework that combines global and local video context, along with a Pre Retrieval Thinking Agent and a Rethink Module. AVATAAR creates a persistent global summary and establishes a feedback loop between the Rethink Module and the Pre Retrieval Thinking Agent, allowing the system to refine its retrieval strategies based on partial answers and replicate human-like iterative reasoning. On the CinePile benchmark, AVATAAR demonstrates significant improvements over a baseline, achieving relative gains of +5.6% in temporal reasoning, +5% in technical queries, +8% in theme-based questions, and +8.2% in narrative comprehension. Our experiments confirm that each module contributes positively to the overall performance, with the feedback loop being crucial for adaptability. These findings highlight AVATAAR’s effectiveness in enhancing video understanding capabilities. Ultimately, AVATAAR presents a scalable solution for long-form Video Question Answering (QA), merging accuracy, interpretability, and extensibility.
随着视频内容的日益普及,对于长视频的有效理解和问答已经成为许多应用的重要组成部分。尽管大型视觉语言模型(LVLMs)的性能已经提升,但它们仍然经常面临微妙查询的挑战,这些查询需要全面理解和详细分析。为了克服这些障碍,我们引入了AVATAAR,这是一个模块化且可解释的框架,它结合了全局和局部视频上下文,以及预检索思考代理和反思模块。AVATAAR创建了一个持久的全局摘要,并在反思模块和预检索思考代理之间建立了反馈循环,允许系统根据部分答案优化其检索策略,并模拟人类类似的迭代推理。在CinePile基准测试中,AVATAAR相对于基线表现出了显著的改进,在时序推理、技术查询、主题相关问题和叙事理解方面分别实现了+5.6%、+5%、+8%和+8.2%的相对增长。我们的实验证实,每个模块都对整体性能做出了积极贡献,反馈循环对于适应性至关重要。这些发现凸显了AVATAAR在提高视频理解能力方面的有效性。最终,AVATAAR为长视频问答(QA)提供了一种可扩展的解决方案,融合了准确性、可解释性和可扩展性。
论文及项目相关链接
PDF Accepted in the 5th IEEE Big Data Workshop on Multimodal AI (MMAI 2025), Dec 8-11, Macau, China, 2025 (Preprint Copy)
Summary
AVATAAR框架结合全局和局部视频上下文,通过预检索思考代理和反思模块,提高长视频理解问答性能。AVATAAR在CinePile基准测试中相比基线有显著改进,各模块对整体性能有积极影响,反馈循环对于适应性至关重要。
Key Takeaways
- AVATAAR框架有效提高视频理解问答性能。
- 框架结合全局和局部视频上下文。
- 引入预检索思考代理和反思模块。
- 在CinePile基准测试中,AVATAAR相对基线有+5.6%的时间推理、+5%的技术查询、+8%的主题问题和+8.2%的叙事理解改进。
- 反馈循环对适应性至关重要。
- 各模块对整体性能有积极影响。
点此查看论文截图
Computer-Use Agents as Judges for Generative User Interface
Authors:Kevin Qinghong Lin, Siyuan Hu, Linjie Li, Zhengyuan Yang, Lijuan Wang, Philip Torr, Mike Zheng Shou
Computer-Use Agents (CUA) are becoming increasingly capable of autonomously operating digital environments through Graphical User Interfaces (GUI). Yet, most GUI remain designed primarily for humans–prioritizing aesthetics and usability–forcing agents to adopt human-oriented behaviors that are unnecessary for efficient task execution. At the same time, rapid advances in coding-oriented language models (Coder) have transformed automatic GUI design. This raises a fundamental question: Can CUA as judges to assist Coder for automatic GUI design? To investigate, we introduce AUI-Gym, a benchmark for Automatic GUI development spanning 52 applications across diverse domains. Using language models, we synthesize 1560 tasks that simulate real-world scenarios. To ensure task reliability, we further develop a verifier that programmatically checks whether each task is executable within its environment. Building on this, we propose a Coder-CUA in Collaboration framework: the Coder acts as Designer, generating and revising websites, while the CUA serves as Judge, evaluating functionality and refining designs. Success is measured not by visual appearance, but by task solvability and CUA navigation success rate. To turn CUA feedback into usable guidance, we design a CUA Dashboard that compresses multi-step navigation histories into concise visual summaries, offering interpretable guidance for iterative redesign. By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments. Our code and dataset are available at https://github.com/showlab/AUI.
用户界面代理(CUA)正越来越能够通过图形用户界面(GUI)自主操作数字环境。然而,大多数GUI仍然主要面向人类设计,优先考虑美观性和易用性,这使得代理必须采用面向人类的、对于高效任务执行不必要的行为。与此同时,面向编码的语言模型(Coder)的快速发展已经改变了自动GUI设计。这引发了一个根本性的问题:作为法官的CUA能否协助Coder进行自动GUI设计?为了调查这个问题,我们引入了AUI-Gym,这是一个跨越52个应用程序的自动GUI开发基准测试平台。我们利用语言模型合成了1560个模拟真实场景的任务。为了确保任务可靠性,我们还开发了一个验证器,该程序会检查每个任务是否能在其环境中执行。在此基础上,我们提出了Coder和CUA的合作框架:Coder作为设计师,生成和修改网站,而CUA作为法官,评估功能并改进设计。成功的衡量标准不是视觉效果,而是任务的解决能力和CUA的导航成功率。为了将CUA的反馈转化为可用的指导,我们设计了CUA仪表板,它将多步导航历史压缩成简洁的视觉摘要,为迭代设计提供可解释的指导。通过将代理定位为设计师和法官,我们的框架将界面设计转向代理本身的效率和可靠性。我们的工作迈出了从被动使用转向数字环境中的积极参与的重要一步。我们的代码和数据集可在https://github.com/showlab/AUI上找到。
论文及项目相关链接
PDF Project: https://showlab.github.io/AUI Github: https://github.com/showlab/AUI
Summary
计算机使用代理(CUA)能自主操作数字环境,但目前图形用户界面(GUI)设计大多以人类为中心,这导致代理在执行任务时面临不必要的障碍。同时,编码导向的语言模型(Coder)的快速发展推动了自动GUI设计的发展。研究引入AUI-Gym基准测试平台,对自动GUI开发进行跨域评估,并利用语言模型合成模拟真实场景的任务。同时,构建Coder与CUA协同工作的框架,其中Coder负责设计和修订网站,CUA则负责评估和精炼设计。此框架重视任务解决和CUA导航成功率而非视觉外观。此外,设计CUA仪表板以提供简洁的视觉摘要,指导迭代设计。此框架推动界面设计向代理原生效率和可靠性转变,使代理从被动使用者转变为数字环境的积极参与者。
Key Takeaways
- 计算机使用代理(CUA)能够自主操作数字环境,但当前GUI设计主要面向人类,限制了效率。
- 自动GUI设计的发展推动了CUA在其中的作用变化。
- AUI-Gym基准测试平台用于评估自动GUI开发的跨域表现,合成模拟真实场景任务。
- 构建Coder与CUA协同工作的框架,其中Coder负责设计,CUA负责评估。
- 评估重点从视觉外观转向任务解决和CUA导航成功率。
- CUA仪表板提供简洁的视觉摘要,指导迭代设计。
点此查看论文截图
Know Your Intent: An Autonomous Multi-Perspective LLM Agent Framework for DeFi User Transaction Intent Mining
Authors:Qian’ang Mao, Yuxuan Zhang, Jiaman Chen, Wenjun Zhou, Jiaqi Yan
As Decentralized Finance (DeFi) develops, understanding user intent behind DeFi transactions is crucial yet challenging due to complex smart contract interactions, multifaceted on-/off-chain factors, and opaque hex logs. Existing methods lack deep semantic insight. To address this, we propose the Transaction Intent Mining (TIM) framework. TIM leverages a DeFi intent taxonomy built on grounded theory and a multi-agent Large Language Model (LLM) system to robustly infer user intents. A Meta-Level Planner dynamically coordinates domain experts to decompose multiple perspective-specific intent analyses into solvable subtasks. Question Solvers handle the tasks with multi-modal on/off-chain data. While a Cognitive Evaluator mitigates LLM hallucinations and ensures verifiability. Experiments show that TIM significantly outperforms machine learning models, single LLMs, and single Agent baselines. We also analyze core challenges in intent inference. This work helps provide a more reliable understanding of user motivations in DeFi, offering context-aware explanations for complex blockchain activity.
随着去中心化金融(DeFi)的发展,理解DeFi交易背后的用户意图至关重要,但由于复杂的智能合约交互、多方面的链上链下因素以及不透明的hex日志,这极具挑战性。现有方法缺乏深度语义洞察。为解决这一问题,我们提出了交易意图挖掘(TIM)框架。TIM利用基于扎根理论和多智能体的大型语言模型(LLM)系统构建了DeFi意图分类法,以稳健地推断用户意图。元级规划器动态协调领域专家,将多个特定视角的意图分析分解为可解决的子任务。问题求解器使用多模态的链上链下数据来处理任务。认知评估器则缓解LLM的幻觉效应并确保可验证性。实验表明,TIM在机器学习模型、单一LLM和单一智能体基线方面表现出显著优势。我们还分析了意图推断中的核心挑战。这项工作有助于提供更可靠的关于DeFi用户动机的理解,为复杂的区块链活动提供情境感知的解释。
论文及项目相关链接
PDF Written in 2025 Q1
Summary
DeFi交易用户意图理解至关重要,但存在挑战。为此,提出Transaction Intent Mining(TIM)框架,利用基于扎根理论和多智能体的大型语言模型(LLM)系统推断用户意图,并通过元级规划器协调专家处理多任务,实现可靠的用户意图推断。
Key Takeaways
- DeFi交易用户意图理解的重要性及其所面临的挑战。
- TIM框架利用DeFi意图分类和多智能体LLM系统进行用户意图推断。
- 元级规划器动态协调专家处理多任务。
- 问题求解器利用多模态链上链下数据完成任务。
- 认知评估器减少LLM幻想并保障可验证性。
- TIM框架在实验中显著优于机器学习模型、单一LLM和单一Agent基线。
点此查看论文截图
DEPO: Dual-Efficiency Preference Optimization for LLM Agents
Authors:Sirui Chen, Mengshi Zhao, Lei Xu, Yuying Zhao, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu
Recent advances in large language models (LLMs) have greatly improved their reasoning and decision-making abilities when deployed as agents. Richer reasoning, however, often comes at the cost of longer chain of thought (CoT), hampering interaction efficiency in real-world scenarios. Nevertheless, there still lacks systematic definition of LLM agent efficiency, hindering targeted improvements. To this end, we introduce dual-efficiency, comprising (i) step-level efficiency, which minimizes tokens per step, and (ii) trajectory-level efficiency, which minimizes the number of steps to complete a task. Building on this definition, we propose DEPO, a dual-efficiency preference optimization method that jointly rewards succinct responses and fewer action steps. Experiments on WebShop and BabyAI show that DEPO cuts token usage by up to 60.9% and steps by up to 26.9%, while achieving up to a 29.3% improvement in performance. DEPO also generalizes to three out-of-domain math benchmarks and retains its efficiency gains when trained on only 25% of the data. Our project page is at https://opencausalab.github.io/DEPO.
最近的大型语言模型(LLM)的进步极大地提高了它们作为代理部署时的推理和决策能力。然而,更丰富的推理往往以更长的思维链(CoT)为代价,这在现实世界的场景中阻碍了交互效率。尽管如此,关于LLM代理效率的系统性定义仍然缺失,阻碍了有针对性的改进。为此,我们引入了双重效率的概念,包括(i)步骤级效率,即最小化每步的令牌数,(ii)轨迹级效率,即最小化完成任务的步骤数。基于这个定义,我们提出了DEPO,一种双重效率偏好优化方法,可以同时奖励简洁的回应和较少的行动步骤。在WebShop和BabyAI上的实验表明,DEPO可以将令牌使用量减少高达60.9%,步骤减少高达26.9%,同时性能提高高达29.3%。DEPO还可以推广到三种离域数学基准测试,并且在仅使用25%的数据进行训练时仍能保持效率提升。我们的项目页面是https://opencausalab.github.io/DEPO。
论文及项目相关链接
PDF Accepted to AAAI 2026
摘要
大语言模型(LLMs)作为代理部署时的推理和决策能力得到极大提升,但更丰富的推理往往以更长的思考链(CoT)为代价,影响现实场景中的交互效率。针对此,本文提出双效率概念,包括步骤级效率和轨迹级效率,并基于此定义提出了DEPO方法,一种双效率偏好优化方法,旨在简洁回应并减少行动步骤。实验显示,DEPO在WebShop和BabyAI任务上分别减少了高达60.9%的令牌使用量和高达26.9%的步骤,同时性能提高了高达29.3%。DEPO还适用于三种超出范围的数学基准测试,并在仅使用25%的数据进行训练时仍能保持效率提升。
关键见解
- LLM代理的推理和决策能力得到增强,但推理丰富性常伴随更长的思考链,影响现实交互效率。
- 提出双效率概念,包括步骤级和轨迹级效率,以衡量LLM代理的效率。
- 引入DEPO方法,通过优化双效率提高大语言模型的性能。
- 在WebShop和BabyAI任务上的实验表明,DEPO显著减少了令牌使用量和步骤数量。
- DEPO在三种数学基准测试中表现出良好的通用性,并在有限数据训练下仍能保持效率提升。
点此查看论文截图
Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration
Authors:Yifu Guo, Zishan Xu, Zhiyuan Yao, Yuquan Lu, Jiaye Lin, Sen Hu, Zhenheng Tang, Yingchao Li, Huacan Wang, Ronghao Chen
Existing multimodal reasoning models and frameworks suffer from fundamental architectural limitations: most lack the human-like ability to autonomously explore diverse reasoning pathways-whether in direct inference, tool-driven visual exploration, programmatic visual manipulation, or intrinsic visual imagination. Consequently, they struggle to adapt to dynamically changing capability requirements in real-world tasks. Meanwhile, humans exhibit a complementary set of thinking abilities when addressing such tasks, whereas existing methods typically cover only a subset of these dimensions. Inspired by this, we propose Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration, a new paradigm for multimodal agentic reasoning. We define six core capabilities essential for multimodal reasoning and organize a comprehensive evaluation benchmark, Octopus-Bench, accordingly. Octopus is capable of autonomously exploring during reasoning and dynamically selecting the most appropriate capability based on the current state. Experimental results show that Octopus achieves the best performance on the vast majority of tasks in Octopus-Bench, highlighting the crucial role of capability coordination in agentic multimodal reasoning.
现有的多模态推理模型与框架存在基本架构上的局限性:大多数模型缺乏人类自主探索多样化推理路径的能力,无论是在直接推理、工具驱动的视觉探索、程序化的视觉操作,还是内在的视觉想象力方面。因此,它们很难适应真实任务中动态变化的能力要求。与此同时,人类在解决此类任务时表现出了一系列互补的思考能力,而现有方法通常只涵盖这些维度的一部分。受此启发,我们提出了Octopus:具有六种能力协同的多模态自主推理新型范式。我们定义了多模态推理所必需的六种核心能力,并相应地构建了综合评估基准Octopus-Bench。Octopus能够在推理过程中自主探索,并根据当前状态动态选择最合适的能力。实验结果表明,Octopus在Octopus-Bench中的绝大多数任务上取得了最佳性能,这突出了能力协调在多模态自主推理中的关键作用。
论文及项目相关链接
Summary
现有多模态推理模型和框架存在基本架构上的局限,缺乏人类自主探索多样化推理路径的能力,难以适应动态变化的能力要求。为此,我们提出Octopus:具有六种能力协同的多模态智能推理,定义六种核心能力并构建相应的评估基准Octopus-Bench。Octopus能够自主探索和动态选择最合适的能力。实验结果表明,Octopus在Octopus-Bench中的大多数任务上取得了最佳性能。
Key Takeaways
- 当前多模态推理模型存在架构上的局限性,无法像人类一样自主探索多样化的推理路径。
- 人类在解决现实任务时展现出多种互补的思考能力,而现有方法通常只覆盖其中一部分。
- Octopus提出一种新的多模态智能推理范式,具备六种核心能力协同工作。
- Octopus能够自主探索推理过程中需要的能力,并根据当前状态动态选择最合适的能力。
- Octopus在Octopus-Bench评估基准上的实验结果表明,它在大多数任务上取得了最佳性能。
- 实验中强调了能力协同在多模态智能推理中的关键作用。
点此查看论文截图
Adversarial Attack on Black-Box Multi-Agent by Adaptive Perturbation
Authors:Jianming Chen, Yawen Wang, Junjie Wang, Xiaofei Xie, Yuanzhe Hu, Qing Wang, Fanjiang Xu
Evaluating security and reliability for multi-agent systems (MAS) is urgent as they become increasingly prevalent in various applications. As an evaluation technique, existing adversarial attack frameworks face certain limitations, e.g., impracticality due to the requirement of white-box information or high control authority, and a lack of stealthiness or effectiveness as they often target all agents or specific fixed agents. To address these issues, we propose AdapAM, a novel framework for adversarial attacks on black-box MAS. AdapAM incorporates two key components: (1) Adaptive Selection Policy simultaneously selects the victim and determines the anticipated malicious action (the action would lead to the worst impact on MAS), balancing effectiveness and stealthiness. (2) Proxy-based Perturbation to Induce Malicious Action utilizes generative adversarial imitation learning to approximate the target MAS, allowing AdapAM to generate perturbed observations using white-box information and thus induce victims to execute malicious action in black-box settings. We evaluate AdapAM across eight multi-agent environments and compare it with four state-of-the-art and commonly-used baselines. Results demonstrate that AdapAM achieves the best attack performance in different perturbation rates. Besides, AdapAM-generated perturbations are the least noisy and hardest to detect, emphasizing the stealthiness.
评估多智能体系统(MAS)的安全性和可靠性随着其在各种应用中的日益普及而变得越来越紧迫。作为一种评估技术,现有的对抗性攻击框架面临一定的局限性,例如由于需要白盒信息或高控制权限而导致的不实用性,以及针对所有智能体或特定固定智能体的攻击缺乏隐蔽性或有效性。为了解决这些问题,我们提出了AdapAM,这是一个针对黑箱MAS的新型对抗性攻击框架。AdapAM包含两个关键组件:(1)自适应选择策略同时选择受害智能体并确定预期的恶意行动(会对MAS产生最糟糕影响的行动),以平衡有效性和隐蔽性。(2)基于代理的扰动以诱导恶意行动利用生成对抗性模仿学习来逼近目标MAS,从而使AdapAM能够使用白盒信息生成扰动观察,并在黑箱设置中诱导受害智能体执行恶意行动。我们在八个多智能体环境中评估了AdapAM,并将其与四种最新且常用的基线进行了比较。结果表明,在不同的扰动率下,AdapAM达到了最佳的攻击性能。此外,AdapAM生成的扰动噪声最少,最难检测,突出了其隐蔽性。
论文及项目相关链接
Summary
该文针对多智能体系统(MAS)的安全与可靠性评估提出了紧迫需求。现有对抗性攻击框架存在局限性,如需要白盒信息和高控制权限的不实用性,以及缺乏针对所有智能体或特定固定智能体的隐形和有效性。为此,提出了一种新型对抗性攻击框架AdapAM,适用于黑盒MAS。其核心组件包括自适应选择策略和基于代理的扰动诱导恶意行动。自适应选择策略可平衡有效性和隐蔽性,同时选择受害者和预期的恶意行动。基于代理的扰动诱导恶意行动利用生成对抗性模仿学习来逼近目标MAS,使AdapAM能够在黑箱环境中利用白箱信息生成扰动观察并诱导受害者执行恶意行动。在多个环境中评估AdapAM,结果表明其在不同扰动率下达到最佳攻击性能,且生成的扰动最不易被检测。
Key Takeaways
- 多智能体系统(MAS)的安全与可靠性评估变得日益重要。
- 现有对抗性攻击框架存在局限性,如需要白盒信息和高控制权限。
- AdapAM是一种新型对抗性攻击框架,适用于黑盒MAS。
- AdapAM包含两个核心组件:自适应选择政策和基于代理的扰动诱导恶意行动。
- 自适应选择策略能够平衡攻击的有效性和隐蔽性。
- 基于代理的扰动方法利用生成对抗性模仿学习来逼近目标MAS。
点此查看论文截图
Symmetry-Breaking in Multi-Agent Navigation: Winding Number-Aware MPC with a Learned Topological Strategy
Authors:Tomoki Nakao, Kazumi Kasaura, Tadashi Kozuno
We address the fundamental challenge of resolving symmetry-induced deadlocks in distributed multi-agent navigation by proposing a new hierarchical navigation method. When multiple agents interact, it is inherently difficult for them to autonomously break the symmetry of deciding how to pass each other. To tackle this problem, we introduce an approach that quantifies cooperative symmetry-breaking strategies using a topological invariant called the winding number, and learns the strategies themselves through reinforcement learning. Our method features a hierarchical policy consisting of a learning-based Planner, which plans topological cooperative strategies, and a model-based Controller, which executes them. Through reinforcement learning, the Planner learns to produce two types of parameters for the Controller: one is the topological cooperative strategy represented by winding numbers, and the other is a set of dynamic weights that determine which agent interaction to prioritize in dense scenarios where multiple agents cross simultaneously. The Controller then generates collision-free and efficient motions based on the strategy and weights provided by the Planner. This hierarchical structure combines the flexible decision-making ability of learning-based methods with the reliability of model-based approaches. Simulation and real-world robot experiments demonstrate that our method outperforms existing baselines, particularly in dense environments, by efficiently avoiding collisions and deadlocks while achieving superior navigation performance. The code for the experiments is available at https://github.com/omron-sinicx/WNumMPC.
我们提出了一种新的分层导航方法,以解决分布式多智能体导航中由对称性引起的死锁这一基本挑战。当多个智能体相互作用时,它们自主打破对称性的决策方式本身就是困难的。为了解决这个问题,我们引入了一种方法,使用称为缠绕数的拓扑不变量来量化合作对称破缺策略,并通过强化学习来学习这些策略本身。我们的方法具有分层策略,包括基于学习的规划器,用于规划拓扑合作策略,以及基于模型的控制器,用于执行这些策略。通过强化学习,规划器学习为控制器生成两种类型的参数:一种是由缠绕数表示的拓扑合作策略,另一组是动态权重,用于确定在多个智能体同时交叉的密集场景中应优先与哪个智能体进行交互。然后,控制器根据规划器提供的策略和权重生成无碰撞和高效的动作。这种分层结构结合了基于学习方法的灵活决策能力与基于模型的可靠性。仿真和真实机器人实验表明,我们的方法在密集环境中优于现有基线,能够高效避免碰撞和死锁,同时实现出色的导航性能。实验的代码可在https://github.com/omron-sinicx/WNumMPC找到。
论文及项目相关链接
PDF 11 pages, 5 figures
总结
本文提出一种解决分布式多智能体导航中对称死锁问题的新方法。通过引入量化合作对称破缺策略并使用拓扑不变量——绕数,结合强化学习来学习策略。该方法采用分层策略,基于学习的规划器制定拓扑合作策略,基于模型的控制器执行策略。规划器通过强化学习产生两类参数供控制器使用:一类是由绕数表示的拓扑合作策略,另一类是动态权重集,用于确定在密集场景中同时交叉的多个智能体之间的交互优先级。控制器根据规划器提供的策略和权重生成无碰撞、高效的运动。该方法结合了学习方法的灵活决策能力和模型方法的可靠性,在模拟和真实机器人实验中均表现出优异性能,特别是在密集环境中。
关键见解
- 引入绕数来解决多智能体导航中的对称性问题。
- 使用强化学习来学习合作对称破缺策略。
- 提出分层导航方法,包括基于学习的规划器和基于模型的控制器。
- 规划器为控制器提供拓扑合作策略和动态权重集。
- 控制器根据策略和权重生成无碰撞、高效的运动。
- 方法结合了学习方法的灵活性和模型方法的可靠性。
点此查看论文截图
OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition
Authors:Xinli Tao, Xin Dong, Xuezhong Zhou
Clinical named entity recognition (NER) is crucial for extracting information from electronic health records (EHRs), but supervised models like CRF and BioClinicalBERT require costly annotated data. While zero-shot NER with large language models (LLMs) reduces this dependency, it struggles with example selection granularity and integrating prompts with self-improvement. To address this, we propose OEMA, a zero-shot clinical NER framework using multi-agent collaboration. OEMA’s three components are: a self-annotator generating examples, a discriminator filtering them via SNOMED CT, and a predictor using entity descriptions for accurate inference. On MTSamples and VAERS datasets, OEMA achieves state-of-the-art exact-match performance. Under related-match, it matches supervised BioClinicalBERT and surpasses CRF. OEMA addresses key zero-shot NER challenges through ontology-guided reasoning and multi-agent collaboration, achieving near-supervised performance and showing promise for clinical NLP applications.
临床命名实体识别(NER)是从电子健康记录(EHRs)中提取信息的关键技术。但CRF和BioClinicalBERT等监督模型需要耗费大量成本的标注数据。虽然基于大型语言模型的零样本NER降低了对标注数据的依赖,但在示例选择粒度以及与自我改进集成提示方面遇到了挑战。针对这一问题,我们提出了OEMA,这是一个利用多智能体协作的零样本临床NER框架。OEMA的三个组件包括:生成示例的自我注释器、通过SNOMED CT进行过滤的鉴别器,以及使用实体描述进行准确推断的预测器。在MTSample和VAERS数据集上,OEMA达到了最先进的精确匹配性能。在相关匹配方面,它与监督的BioClinicalBERT相匹配并超过了CRF。OEMA通过本体引导推理和多智能体协作解决了关键的零样本NER挑战,实现了接近监督的性能,并显示出在临床自然语言处理应用中的潜力。
论文及项目相关链接
PDF 12 pages, 4 figures, 4 tables
Summary
基于大型语言模型的零样本临床命名实体识别(NER)框架OEMA,通过多智能体协作实现自我标注、筛选预测,在MTSample和VAERS数据集上达到了最先进的匹配性能。OEMA解决了零样本NER的关键挑战,展现出近监督性能,为临床NLP应用提供了希望。
Key Takeaways
- 临床命名实体识别(NER)是电子健康记录(EHRs)信息提取的关键技术。
- 监督模型如CRF和BioClinicalBERT需要昂贵的标注数据。
- 零样本NER使用大型语言模型(LLMs)减少了对标注数据的依赖。
- OEMA是一个零样本临床NER框架,通过多智能体协作进行自我标注、筛选和预测。
- OEMA在MTSample和VAERS数据集上达到了先进的匹配性能。
- OEMA解决了零样本NER的关键挑战,如示例选择粒度和提示自我改进。
点此查看论文截图
MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents
Authors:Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Wenrao Pang, Ruiyao Chen, Xinwei Peng, Renjie Lu, Sijie Ren, Guanxu Zhu, Xiaoqin Wu, Zhiqiang Liu, Rongzhao Zhang, Luyi Jiang, Bing Han, Yunqiu Wang, Jie Xu
Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.
近期医学大型语言模型(LLMs)、多模态模型和智能代理的发展,需要评估框架能反映真实的临床工作流程和安全约束。我们推出MedBench v4,这是一个全国性的云基准测试基础设施,包含超过70万项专家策划的任务,涵盖24个主要专业和91个次要专业,并为LLMs、多模态模型和智能代理设有专门赛道。各项任务经过来自500多家机构的临床医生的多阶段完善和多轮审查,开放性回答由校准为人类评分标准的LLM进行评分。我们评估了15款前沿模型。基础LLM的平均整体得分为54.1/100(最佳:Claude Sonnet 4.5,得分62.5/100),但安全和道德方面得分较低(18.4/100)。多模态模型的总体表现较差(平均得分47.5/100;最佳:GPT-5,得分54.9/100),具有良好的感知能力,但在跨模态推理方面较弱。基于相同架构的智能代理大幅提高了端到端的性能(平均得分79.8/100),其中Claude Sonnet 4.5智能代理的总体得分高达85.3/100,安全任务的得分为88.9/100。因此,MedBench v4揭示了基础模型在多模态推理和安全方面存在的持续差距,同时表明治理意识智能协调可以在不牺牲能力的情况下显著提高临床就绪的基准测试水平。通过与中国临床指南和监管优先事项相协调的任务,该平台为医院、开发人员和政策制定者评估医疗人工智能提供了一个实用的参考依据。
论文及项目相关链接
Summary
基于最新的医疗大型语言模型(LLM)、多模态模型和智能体的进展,需要反映真实临床流程和安全性约束的评估框架。我们推出MedBench v4,这是一个全国性的云基准测试平台,包含超过70万专家策划的任务,涵盖24个主要和91个次要专业领域,并为LLM、多模态模型和智能体设有专门赛道。通过对来自500多家机构的医生的多次阶段精细化和多轮审查,以及由校准为人类评分的LLM作为法官对开放式回应进行评分,我们对15个前沿模型进行了评估。基础LLM的平均总体得分为54.1分(最佳:Claude Sonnet 4.5,62.5分),但安全性和道德性较低(18.4分)。多模态模型总体表现稍差(平均得分47.5分;最佳:GPT-5,54.9分),具有强大的感知能力,但跨模态推理较弱。基于相同背景的智能体实质上提高了端到端的性能(平均得分79.8分),其中Claude Sonnet 4.5智能体总体得分高达85.3分,安全任务得分88.9分。因此,MedBench v4揭示了基础模型在多模态推理和安全性方面存在的持续差距,同时表明治理意识智能体的协同可以显著提高临床准备性而不会牺牲能力。该平台通过与中国的临床指南和监管优先事项相一致的任务,为医院、开发者和决策者提供了一个实用的医疗人工智能审计参考。
Key Takeaways
- MedBench v4是一个全国性的云基准测试平台,涵盖多个医疗专业领域,包括LLM、多模态模型和智能体的专门赛道。
- 基础LLM在评估中表现良好,但安全性和道德性方面存在不足。
- 多模态模型感知能力强,但在跨模态推理方面有待提升。
- 智能体在提高端到端性能方面发挥重要作用,尤其是基于Claude Sonnet 4.5的智能体表现突出。
- MedBench v4揭示了基础模型在多模态推理和安全性方面的差距。
- 治理意识智能体的协同可以显著提高临床准备性,而不会牺牲能力。
点此查看论文截图
Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports
Authors:Chenchen Kuai, Zihao Li, Braden Rosen, Stephanie Paal, Navid Jafari, Jean-Louis Briaud, Yunlong Zhang, Youssef M. A. Hashash, Yang Zhou
Post-disaster reconnaissance reports contain critical evidence for understanding multi-hazard interactions, yet their unstructured narratives make systematic knowledge transfer difficult. Large language models (LLMs) offer new potential for analyzing these reports, but often generate unreliable or hallucinated outputs when domain grounding is absent. This study introduces the Mixture-of-Retrieval Agentic RAG (MoRA-RAG), a knowledge-grounded LLM framework that transforms reconnaissance reports into a structured foundation for multi-hazard reasoning. The framework integrates a Mixture-of-Retrieval mechanism that dynamically routes queries across hazard-specific databases while using agentic chunking to preserve contextual coherence during retrieval. It also includes a verification loop that assesses evidence sufficiency, refines queries, and initiates targeted searches when information remains incomplete. We construct HazardRecQA by deriving question-answer pairs from GEER reconnaissance reports, which document 90 global events across seven major hazard types. MoRA-RAG achieves up to 94.5 percent accuracy, outperforming zero-shot LLMs by 30 percent and state-of-the-art RAG systems by 10 percent, while reducing hallucinations across diverse LLM architectures. MoRA-RAG also enables open-weight LLMs to achieve performance comparable to proprietary models. It establishes a new paradigm for transforming post-disaster documentation into actionable, trustworthy intelligence for hazard resilience.
灾后侦察报告对于理解多灾种交互作用至关重要,但其非结构化的叙述使得知识的系统传递变得困难。大型语言模型(LLM)为分析这些报告提供了新的可能性,但在缺乏领域依据的情况下常常产生不可靠或虚构的输出。本研究引入了混合检索代理RAG(MoRA-RAG),这是一个以知识为基础的语言模型框架,将侦察报告转化为多灾种推理的结构化基础。该框架融合了混合检索机制,能够动态地跨特定灾害数据库进行路由查询,同时使用代理分块来在检索过程中保持上下文连贯性。此外,它还包括一个验证循环,用于评估证据充分性、改进查询以及在信息不完整时启动针对性搜索。我们通过从GEER灾后侦察报告中提取问答对构建了HazardRecQA数据集,这些报告记录了全球90个事件中的七种主要灾害类型。MoRA-RAG准确率高达94.5%,相较于零样本LLM和现有最先进的RAG系统分别高出30%和10%,并在多种LLM架构中减少了虚构现象。MoRA-RAG还使得轻量级的LLM性能可以与专有模型相媲美。它为将灾后文档转化为可靠且可操作的情报以应对灾害韧性建立了新的范式。
论文及项目相关链接
PDF 17 pages, 5 figures
Summary:
本文介绍了一种名为MoRA-RAG的知识接地大型语言模型框架,该框架可将侦察报告转化为多灾种推理的结构化基础。MoRA-RAG结合了混合检索机制和代理块,能够在动态路由查询的同时保持上下文连贯性,并具有验证循环来评估证据充分性、细化查询和启动目标搜索。该框架在灾难后的侦察报告中取得了显著成果,实现了高准确性并降低了虚构现象的出现。它为将灾后文档转化为可信的智能情报提供了全新的范例。
Key Takeaways:
- 大型语言模型在分析灾难侦察报告方面具有潜力。
- MoRA-RAG是一个知识接地的语言模型框架,用于处理多灾种的侦察报告。
- MoRA-RAG结合了混合检索机制和代理块技术,提高了查询效率和上下文连贯性。
- MoRA-RAG包含一个验证循环,能够评估证据充分性并进行有针对性的搜索。
- 通过使用MoRA-RAG框架,灾难侦察报告的准确性可达到94.5%。
- MoRA-RAG相较于零基础和现有最先进的RAG系统表现更优,且降低了虚构信息的出现。
点此查看论文截图
S-DAG: A Subject-Based Directed Acyclic Graph for Multi-Agent Heterogeneous Reasoning
Authors:Jiangwen Dong, Zehui Lin, Wanyu Lin, Mingjin Zhang
Large Language Models (LLMs) have achieved impressive performance in complex reasoning problems. Their effectiveness highly depends on the specific nature of the task, especially the required domain knowledge. Existing approaches, such as mixture-of-experts, typically operate at the task level; they are too coarse to effectively solve the heterogeneous problems involving multiple subjects. This work proposes a novel framework that performs fine-grained analysis at subject level equipped with a designated multi-agent collaboration strategy for addressing heterogeneous problem reasoning. Specifically, given an input query, we first employ a Graph Neural Network to identify the relevant subjects and infer their interdependencies to generate an \textit{Subject-based Directed Acyclic Graph} (S-DAG), where nodes represent subjects and edges encode information flow. Then we profile the LLM models by assigning each model a subject-specific expertise score, and select the top-performing one for matching corresponding subject of the S-DAG. Such subject-model matching enables graph-structured multi-agent collaboration where information flows from the starting model to the ending model over S-DAG. We curate and release multi-subject subsets of standard benchmarks (MMLU-Pro, GPQA, MedMCQA) to better reflect complex, real-world reasoning tasks. Extensive experiments show that our approach significantly outperforms existing task-level model selection and multi-agent collaboration baselines in accuracy and efficiency. These results highlight the effectiveness of subject-aware reasoning and structured collaboration in addressing complex and multi-subject problems.
大型语言模型(LLMs)在复杂的推理问题中取得了令人印象深刻的性能表现。它们的有效性高度依赖于任务的具体性质,尤其是所需的领域知识。现有的方法,如专家混合法,通常是在任务层面进行操作;它们过于粗略,无法有效解决涉及多个主题的异质性问题。本研究提出了一种新型框架,以主题层面进行精细分析,并配备专门的多智能体协作策略,以解决异质性问题推理。具体来说,给定输入查询,我们首先采用图神经网络来识别相关主题并推断它们之间的依赖关系,以生成一个基于主题的有向无环图(S-DAG),其中节点代表主题,边编码信息流。然后,我们通过为每个模型分配主题特定专业分数来评估LLM模型,并选择表现最佳的模型来匹配S-DAG中的相应主题。这种主题与模型的匹配使得信息可以从起始模型流向S-DAG上的结束模型,从而实现图结构化的多智能体协作。我们整理和发布了多主题子集的标准基准测试(MMLU-Pro、GPQA、MedMCQA),以更好地反映复杂、现实世界的推理任务。大量实验表明,我们的方法在准确性和效率方面显著优于现有的任务级模型选择和基于多智能体的协作基线。这些结果凸显了主题感知推理和结构化协作在解决复杂和多主题问题中的有效性。
论文及项目相关链接
PDF Accepted by AAAI 2026
Summary
大型语言模型(LLMs)在处理复杂推理问题时表现出强大的性能,其效果高度依赖于任务的具体性质和所需的领域知识。现有方法如混合专家通常在任务层面操作,难以解决涉及多个主题的不同质问题。本研究提出一种新型框架,在主题层面进行精细分析,并配备专门的多智能体协作策略来解决异质性问题推理。通过输入查询,使用图神经网络识别相关主题并推断其相互依赖性,生成主题导向的无环图(S-DAG),其中节点代表主题,边缘编码信息流。通过为每种模型分配主题特定专业分数来评估LLM模型,并选择表现最佳的模型与S-DAG中的相应主题相匹配。这种主题模型匹配可实现图形结构的多智能体协作,信息从起始模型流经S-DAG流向最终模型。研究整理并发布了多主题子集的标准基准测试(MMLU-Pro、GPQA、MedMCQA),以更好地反映复杂、现实世界的推理任务。实验表明,该方法在准确性和效率方面显著优于现有的任务级别模型选择和多元智能协作基准测试。结果证明了主题感知推理和结构协作在解决复杂多主题问题中的有效性。
Key Takeaways
- 大型语言模型(LLMs)在复杂推理问题上表现出强大的性能,其效果受任务特性和领域知识的影响。
- 现有方法如混合专家在解决涉及多个主题的问题时表现不足。
- 提出一种新型框架,在主题层面进行精细分析,并配备多智能体协作策略。
- 使用图神经网络识别相关主题并生成主题导向的无环图(S-DAG)。
- 通过评估LLM模型的主题特定专业分数,实现更精准的模型选择。
- 提出一种主题模型匹配方法,实现图形结构的多智能体协作。
- 整理和发布多主题子集的标准基准测试,实验结果显示所提方法显著优于现有基准测试。
点此查看论文截图
Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools
Authors:Ha Min Son, Huan Ren, Xin Liu, Zhe Zhao
Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer’s success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model’s high-level reasoning and effective low-level execution.
Android是目前最大的移动平台,但自动构建应用程序仍然是一个实际挑战。虽然大型语言模型(LLM)在代码修复方面显示出潜力,但它们在解决Android构建错误方面的应用仍被较少探索。为了弥补这一空白,我们首先推出了AndroidBuildBench,这是一组包含1019个构建失败的基准测试,这些失败案例是从43个开源Android项目的提交历史中精心挑选的。每个问题都与后续提交中的经过验证的解决方案配对,确保修复方案是可行的。其次,我们提出了GradleFixer,这是一个LLM代理,拥有用于检查和操作Gradle构建环境的领域特定工具。GradleFixer的解决率高达81.4%(pass@1),显著优于依赖通用外壳的最新编码代理。GradleFixer的成功表明,虽然LLM具备解决这些故障的高级知识,但它们在将知识转化为有效的低级行动时使用通用外壳会遇到困难。我们展示了一种我们称之为“工具桥接”的策略的有效性,该策略用领域感知的抽象替换通用外壳命令。我们假设这种方法通过两种机制起作用:1)它提供API等格式的工具,LLM使用得更可靠;2)它将行动空间限制在相关操作上。这种方法弥合了模型的高级推理和有效的低级执行之间的鸿沟。
论文及项目相关链接
Summary
本文介绍了针对Android平台构建应用时遇到的挑战,提出AndroidBuildBench作为包含1019个构建失败的基准测试集,并从43个开源Android项目的提交历史中筛选出问题及其对应的解决方案。同时,提出了一个名为GradleFixer的LLM代理,通过工具桥接策略实现了对Gradle构建环境的检查和操作,并达到了较高的解决率。文章强调了LLM在解决构建失败问题上的潜力,并指出了工具桥接策略在提高LLM代理执行效率方面的作用。
Key Takeaways
- Android是移动平台中最大的,但自动构建应用仍然是一个挑战。
- Large Language Models(LLMs)在代码修复方面显示出潜力,但在解决Android构建错误方面的应用仍然有待探索。
- 引入了AndroidBuildBench,这是一个包含1019个构建失败的基准测试集。
- 提出了GradleFixer,一个针对Gradle构建环境的LLM代理,通过工具桥接策略实现了高效的解决率。
- GradleFixer的成功表明LLM具备解决这些失败的高级知识,但在将知识转化为有效的低级操作上遇到困难。
点此查看论文截图
MedLA: A Logic-Driven Multi-Agent Framework for Complex Medical Reasoning with Large Language Models
Authors:Siqi Ma, Jiajie Huang, Fan Zhang, Jinlin Wu, Yue Shen, Guohui Fan, Zhu Zhang, Zelin Zang
Answering complex medical questions requires not only domain expertise and patient-specific information, but also structured and multi-perspective reasoning. Existing multi-agent approaches often rely on fixed roles or shallow interaction prompts, limiting their ability to detect and resolve fine-grained logical inconsistencies. To address this, we propose \textsc{MedLA}, a logic-driven multi-agent framework built on large language models. Each agent organizes its reasoning process into an explicit logical tree based on syllogistic triads (major premise, minor premise, and conclusion), enabling transparent inference and premise-level alignment. Agents engage in a multi-round, graph-guided discussion to compare and iteratively refine their logic trees, achieving consensus through error correction and contradiction resolution. We demonstrate that \textsc{MedLA} consistently outperforms both static role-based systems and single-agent baselines on challenging benchmarks such as MedDDx and standard medical QA tasks. Furthermore, \textsc{MedLA} scales effectively across both open-source and commercial LLM backbones, achieving state-of-the-art performance and offering a generalizable paradigm for trustworthy medical reasoning.
回答复杂的医学问题不仅需要专业领域知识和患者特定信息,还需要结构化和多角度的推理。现有的多智能体方法通常依赖于固定的角色或浅层次的交互提示,这限制了它们在检测和解决细微逻辑不一致方面的能力。为了解决这一问题,我们提出了基于大型语言模型的逻辑驱动多智能体框架——MedLA。每个智能体将其推理过程组织成基于三段论(大前提、小前提和结论)的明确逻辑树,从而实现透明的推理和前提层面的对齐。智能体之间进行多轮、图表引导的讨论,以比较和迭代地优化他们的逻辑树,通过错误修正和解决矛盾达成共识。我们证明,在具有挑战性的基准测试(如MedDDx和标准医学问答任务)中,MedLA始终优于基于静态角色的系统和单智能体基线。此外,MedLA在开源和商业大型语言模型主干上均有效扩展,实现了最先进的性能表现,并为可信医学推理提供了可推广的范式。
论文及项目相关链接
PDF accepted by AAAI-26 (ORAL)
Summary
文本提出了一个逻辑驱动的多智能体框架MedLA,用于解决复杂的医疗问题。该框架基于大型语言模型构建,每个智能体通过逻辑树进行推理,逻辑树基于三段论(大前提、小前提和结论)进行组织,实现透明推理和前提级对齐。智能体通过多轮图形引导的讨论来比较和迭代优化其逻辑树,通过纠错和矛盾解决达成共識。在基准测试中,MedLA表现优异,可跨开源和商业大型语言模型进行有效扩展。
Key Takeaways
- MedLA是一个逻辑驱动的多智能体框架,用于处理复杂的医疗问题。
- 它基于大型语言模型构建,能够整合不同领域知识和患者信息。
- 智能体通过逻辑树进行结构化推理,提高推理的透明度和准确性。
- 多智能体之间的多轮图形引导讨论可以检测和解决逻辑不一致的问题。
- MedLA在基准测试中表现优异,优于静态角色系统和单智能体基线。
- 它可广泛应用于医疗问答系统和其他相关任务。
点此查看论文截图
Core Safety Values for Provably Corrigible Agents
Authors:Aran Nayebi
We introduce the first complete formal solution to corrigibility in the off-switch game, with provable guarantees in multi-step, partially observed environments. Our framework consists of five structurally separate utility heads – deference, switch-access preservation, truthfulness, low-impact behavior via a belief-based extension of Attainable Utility Preservation, and bounded task reward – combined lexicographically by strict weight gaps. Theorem 1 proves exact single-round corrigibility in the partially observable off-switch game; Theorem 3 extends the guarantee to multi-step, self-spawning agents, showing that even if each head is learned to mean-squared error $\varepsilon$ and the planner is $\varepsilon$-sub-optimal, the probability of violating any safety property is bounded while still ensuring net human benefit. In contrast to Constitutional AI or RLHF/RLAIF, which merge all norms into one learned scalar, our separation makes obedience and impact-limits provably dominate even when incentives conflict. For settings where adversaries can modify the agent, we prove that deciding whether an arbitrary post-hack agent will ever violate corrigibility is undecidable by reduction to the halting problem, then carve out a finite-horizon “decidable island” where safety can be certified in randomized polynomial time and verified with privacy-preserving, constant-round zero-knowledge proofs.
我们介绍了在关闭开关游戏中对可纠正性的首个完整形式解决方案,在多步、部分观察环境中提供了可证明保证。我们的框架由五个结构上分离的效用头组成——谦让、开关访问保留、真实性、基于信仰的扩展以达成效用保留的低影响行为,以及有界任务奖励——按严格权重差距词典排列组合。定理1证明了部分可观察关闭开关游戏中的精确单轮可纠正性;定理3将保证范围扩展到多步自繁殖智能体,表明即使每个智能体头都是以均方误差ε进行学习的,规划器的最优解误差也是ε,但仍能保证不违反任何安全属性的情况下,同时确保人类净收益。与合并所有规范为一个学习标量的宪法人工智能或RLHF/RLAIF相比,我们的分离使得即使在激励发生冲突的情况下,也能证明服从和影响力限制占主导地位。对于可以修改智能体的对手环境设置,我们通过将其归结为停机问题来证明判断任意被黑客攻击后的智能体是否违反了可纠正性是不可判定的,然后划定一个有限的“可判定岛屿”,在这里安全可以在随机多项式时间内得到认证,并且可以通过保护隐私、常规定时归零证明来验证。
论文及项目相关链接
PDF 14 pages. To appear in AAAI 2026 Machine Ethics Workshop (W37) Proceedings
Summary
该文本介绍了首个完整解决在离线开关游戏中可纠正性的正式方案,该方案在多步、部分观察环境中具有可证明的保证。框架由五个结构独立的效用头组成,包括谦让、开关访问保留、真实性、基于信念扩展的可实现效用保留的低影响行为和有限任务奖励,按严格权重差距字典序组合。定理1证明了部分可观察离线开关游戏中的精确单轮可纠正性;定理3将保证扩展到多步自衍生代理,显示即使每个头以均方误差ε学习,且规划器是ε次优的,违反任何安全属性的概率仍然有界,同时确保人类净收益。与合并所有规范为一个学习标量的宪法人工智能或RLHF/RLAIF相比,我们的分离使得服从和影响限制在激励冲突时也能占据主导地位。在对手可以修改代理的场合下,我们证明了判定任意后黑客攻击代理是否永远违反可纠正性是停问题的不可判定结果,然后划定了一个有限的“可判定岛”,在此区域内安全可以得到随机多项式时间内的认证和具有隐私保护性的恒定轮次零知识证明进行验证。
Key Takeaways
- 介绍了首个完整解决离线开关游戏中可纠正性的正式方案,并在多步、部分观察环境中具有可证明的保证。
- 通过五个结构独立的效用头组成框架,包括谦让、开关访问保留等。
- 通过定理1和定理3确保了精确单轮和多步的可纠正性,即使代理被学习或规划器存在误差,违反安全属性的概率仍然有界。
- 与其他方法相比,该方案在激励冲突时仍能保证服从和影响限制。
- 在对手可以修改代理的场合下,判定代理的可纠正性是一个复杂的问题,类似于停机问题。
- 划定了一个有限的“可判定岛”,在此区域内安全的认证和验证是可行的。
点此查看论文截图
Assemble Your Crew: Automatic Multi-agent Communication Topology Design via Autoregressive Graph Generation
Authors:Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, Shirui Pan
Multi-agent systems (MAS) based on large language models (LLMs) have emerged as a powerful solution for dealing with complex problems across diverse domains. The effectiveness of MAS is critically dependent on its collaboration topology, which has become a focal point for automated design research. However, existing approaches are fundamentally constrained by their reliance on a template graph modification paradigm with a predefined set of agents and hard-coded interaction structures, significantly limiting their adaptability to task-specific requirements. To address these limitations, we reframe MAS design as a conditional autoregressive graph generation task, where both the system composition and structure are designed jointly. We propose ARG-Designer, a novel autoregressive model that operationalizes this paradigm by constructing the collaboration graph from scratch. Conditioned on a natural language task query, ARG-Designer sequentially and dynamically determines the required number of agents, selects their appropriate roles from an extensible pool, and establishes the optimal communication links between them. This generative approach creates a customized topology in a flexible and extensible manner, precisely tailored to the unique demands of different tasks. Extensive experiments across six diverse benchmarks demonstrate that ARG-Designer not only achieves state-of-the-art performance but also enjoys significantly greater token efficiency and enhanced extensibility. The source code of ARG-Designer is available at https://github.com/Shiy-Li/ARG-Designer.
基于大型语言模型(LLM)的多智能体系统(MAS)已作为处理跨不同领域的复杂问题的强大解决方案而出现。MAS的有效性严重依赖于其协作拓扑结构,这已成为自动设计研究的一个重点。然而,现有方法从根本上受到其依赖于模板图修改范式的约束,具有预定义的智能体集合和硬编码的交互结构,这对于特定任务的特定要求适应能力较弱。为了解决这个问题,我们重新设计了MAS设计为一种条件自回归图生成任务,其中系统组成和结构是共同设计的。我们提出了ARG-Designer,这是一种新型的自回归模型,它通过从零开始构建协作图来实现这一范式。基于自然语言任务查询,ARG-Designer可以顺序地动态地确定所需智能体的数量,从可扩展的智能体池中为它们选择适当的角色并建立最佳的通信链接。这种生成式方法为不同的任务创建定制的拓扑结构,以灵活和可扩展的方式精确匹配不同的需求。在六个不同基准测试上的大量实验表明,ARG-Designer不仅达到了最先进的性能,而且在令牌效率和扩展性方面也有显著提高。ARG-Designer的源代码可在https://github.com/Shiy-Li/ARG-Designer获取。
论文及项目相关链接
PDF Accepted as an oral presentation by AAAI 2026
Summary
基于大型语言模型的多智能体系统已成为处理跨域复杂问题的强大解决方案。系统协作拓扑是智能体系统的关键,现有的自动化设计方法存在局限性。本文提出将智能体系统设计重新定位为条件自回归图生成任务,并提出了ARG-Designer模型。该模型能根据自然语言任务查询动态构建协作图,确定智能体数量、选择角色并建立最优通信链接。这种方法可灵活、可扩展地创建定制化拓扑,适应不同任务的独特需求。实验表明,ARG-Designer在六个不同基准测试中表现最佳,并具有高效的标记效率和出色的可扩展性。代码地址:https://github.com/Shiy-Li/ARG-Designer。
Key Takeaways
- 多智能体系统基于大型语言模型在处理复杂问题方面表现出强大的能力。
- 智能体系统的协作拓扑是有效性的关键,现有自动化设计方法存在局限性。
- ARG-Designer模型将智能体系统设计重新定位为条件自回归图生成任务。
- ARG-Designer能根据自然语言任务查询动态构建协作图,包括确定智能体数量、角色和通信链接。
- 该方法具有灵活性和可扩展性,可适应不同任务的独特需求。
- ARG-Designer在多个基准测试中表现最佳,且标记效率高、可扩展性强。
点此查看论文截图
Agent-SAMA: State-Aware Mobile Assistant
Authors:Linqiang Guo, Wei Liu, Yi Wen Heng, Tse-Hsun, Chen, Yang Wang
Mobile Graphical User Interface (GUI) agents aim to autonomously complete tasks within or across apps based on user instructions. While recent Multimodal Large Language Models (MLLMs) enable these agents to interpret UI screens and perform actions, existing agents remain fundamentally reactive. They reason over the current UI screen but lack a structured representation of the app navigation flow, limiting GUI agents’ ability to understand execution context, detect unexpected execution results, and recover from errors. We introduce Agent-SAMA, a state-aware multi-agent framework that models app execution as a Finite State Machine (FSM), treating UI screens as states and user actions as transitions. Agent-SAMA implements four specialized agents that collaboratively construct and use FSMs in real time to guide task planning, execution verification, and recovery. We evaluate Agent-SAMA on two types of benchmarks: cross-app (Mobile-Eval-E, SPA-Bench) and mostly single-app (AndroidWorld). On Mobile-Eval-E, Agent-SAMA achieves an 84.0% success rate and a 71.9% recovery rate. On SPA-Bench, it reaches an 80.0% success rate with a 66.7% recovery rate. Compared to prior methods, Agent-SAMA improves task success by up to 12% and recovery success by 13.8%. On AndroidWorld, Agent-SAMA achieves a 63.7% success rate, outperforming the baselines. Our results demonstrate that structured state modeling enhances robustness and can serve as a lightweight, model-agnostic memory layer for future GUI agents.
移动图形用户界面(GUI)代理旨在根据用户指令在应用程序内部或跨应用程序自主完成任务。虽然最近的多媒体语言模型(MLLMs)允许这些代理解释UI屏幕并执行操作,但现有代理仍然基本上是反应式的。它们在当前的UI屏幕上进行推理,但缺乏应用程序导航流的结构化表示,这限制了GUI代理理解执行上下文、检测意外的执行结果并从错误中恢复的能力。我们引入了Agent-SAMA,一个状态感知的多代理框架,它将应用程序的执行建模为有限状态机(FSM),将UI屏幕视为状态,用户操作视为转换。Agent-SAMA实现了四个专业代理,它们协同实时构建和使用FSM,以指导任务规划、执行验证和恢复。我们在两种类型的基准测试上评估了Agent-SAMA:跨应用程序(Mobile-Eval-E,SPA-Bench)和主要是单应用程序(AndroidWorld)。在Mobile-Eval-E上,Agent-SAMA达到了84.0%的成功率和71.9%的恢复率。在SPA-Bench上,其成功率达到80.0%,恢复率为66.7%。与先前的方法相比,Agent-SAMA的任务成功率提高了高达12%,恢复成功率提高了13.8%。在AndroidWorld上,Agent-SAMA达到了63.7%的成功率,超越了基线。我们的结果表明,结构化状态建模提高了稳健性,并可以作为未来GUI代理的轻量级、模型无关的记忆层。
论文及项目相关链接
PDF Accepted to AAAI-26 (Main Technical Track)
Summary
移动图形用户界面(GUI)代理旨在基于用户指令在应用程序内部或跨应用程序自主完成任务。虽然最近的多模态大型语言模型(MLLMs)使这些代理能够解释UI屏幕并执行操作,但现有代理本质上是反应式的。它们对当前的UI屏幕进行推理,但缺乏应用程序导航流的结构化表示,这限制了GUI代理理解执行上下文、检测意外执行结果并从错误中恢复的能力。为此,我们引入了Agent-SAMA,一个状态感知的多代理框架,它将应用程序执行建模为有限状态机(FSM),将UI屏幕视为状态,用户操作视为转换。Agent-SAMA实现了四个专业代理,它们协同实时构建和使用FSM,以指导任务规划、执行验证和恢复。评估结果显示,Agent-SAMA在跨应用任务上取得了较高的成功率和恢复率,且在单应用任务上也有良好表现,证明结构化状态建模能提高稳健性,并可作为未来GUI代理的轻量级、模型无关的记忆层。
Key Takeaways
- 移动GUI代理虽能基于用户指令完成任务,但在复杂场景下缺乏应用导航流的结构化表示。
- Agent-SAMA首次将应用执行建模为FSM,创新性地管理UI屏幕和用户操作。
- Agent-SAMA包含四个专业代理,实时构建和使用FSM,提高任务规划和执行验证的效果。
- Agent-SAMA在跨应用任务和单应用任务上均表现出较高的成功率和恢复率。
- 与现有方法相比,Agent-SAMA在任务成功率和恢复率上有所提升。
- 结构化状态建模能提高GUI代理的稳健性。
点此查看论文截图
TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials
Authors:Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, Qing Li
Building Graphical User Interface (GUI) agents is a promising research direction, which simulates human interaction with computers or mobile phones to perform diverse GUI tasks. However, a major challenge in developing generalized GUI agents is the lack of sufficient trajectory data across various operating systems and applications, mainly due to the high cost of manual annotations. In this paper, we propose the TongUI framework that builds generalized GUI agents by learning from rich multimodal web tutorials. Concretely, we crawl and process online GUI tutorials (such as videos and articles) into GUI agent trajectory data, through which we produce the GUI-Net dataset containing 143K trajectory data across five operating systems and more than 200 applications. We develop the TongUI agent by fine-tuning Qwen2.5-VL-3B/7B models on GUI-Net, which show remarkable performance improvements on commonly used grounding and navigation benchmarks, outperforming baseline agents about 10% on multiple benchmarks, showing the effectiveness of the GUI-Net dataset and underscoring the significance of our TongUI framework. We will fully open-source the code, the GUI-Net dataset, and the trained models soon.
构建图形用户界面(GUI)代理是一个充满前景的研究方向,它模拟人与计算机或手机之间的交互,以执行各种GUI任务。然而,在开发通用GUI代理时面临的一个主要挑战是缺乏跨各种操作系统和应用程序的足够轨迹数据,这主要是由于手动注释的成本高昂。在本文中,我们提出了TongUI框架,该框架通过从丰富的多媒体网络教程中学习来构建通用的GUI代理。具体地说,我们在线爬取并处理GUI教程(如视频和文章),将其转化为GUI代理轨迹数据,从而生成包含跨五个操作系统和超过200个应用程序的14.3万条轨迹数据的GUI-Net数据集。我们通过使用GUI-Net对Qwen2.5-VL-3B/7B模型进行微调来开发TongUI代理,在常用的定位导航基准测试中取得了显著的性能提升,在多个基准测试中优于基线代理约10%,这显示了GUI-Net数据集的有效性并突出了我们TongUI框架的重要性。我们将很快公开源代码、GUI-Net数据集和训练好的模型。
论文及项目相关链接
PDF AAAI 2026
Summary
本文提出一种基于网络教程的通用图形用户界面(GUI)代理构建框架TongUI,解决了构建GUI代理面临的数据轨迹缺乏的问题。该框架通过学习在线丰富的多媒体教程(如视频和文章)生成GUI代理轨迹数据,创建了跨五个操作系统和超过200个应用程序的GUI-Net数据集。通过使用GUI-Net微调Qwen2.5-VL-3B/7B模型,TongUI代理在常用的定位与导航基准测试中表现出显著的性能提升,相较于基线代理在多基准测试中的表现提升了约10%,证明了GUI-Net数据集的有效性以及TongUI框架的重要性。代码和数据集将很快开源。
Key Takeaways
- TongUI框架通过学习在线多媒体教程来构建通用的GUI代理,解决了轨迹数据不足的问题。
- TongUI通过爬取和加工在线GUI教程,创建了GUI-Net数据集,包含143K轨迹数据,覆盖五个操作系统和超过200个应用程序。
- TongUI代理通过微调Qwen2.5-VL-3B/7B模型在GUI-Net上展现出卓越性能。
- TongUI代理在定位与导航基准测试中相较于基线代理有显著提升。
- GUI-Net数据集对TongUI代理的性能提升起到了关键作用。
- TongUI框架的重要性在于其利用丰富的网络资源来解决GUI代理开发中的数据轨迹问题。
点此查看论文截图
Adversarial Agents: Black-Box Evasion Attacks with Reinforcement Learning
Authors:Kyle Domico, Jean-Charles Noirot Ferrand, Ryan Sheatsley, Eric Pauley, Josiah Hanna, Patrick McDaniel
Attacks on machine learning models have been extensively studied through stateless optimization. In this paper, we demonstrate how a reinforcement learning (RL) agent can learn a new class of attack algorithms that generate adversarial samples. Unlike traditional adversarial machine learning (AML) methods that craft adversarial samples independently, our RL-based approach retains and exploits past attack experience to improve the effectiveness and efficiency of future attacks. We formulate adversarial sample generation as a Markov Decision Process and evaluate RL’s ability to (a) learn effective and efficient attack strategies and (b) compete with state-of-the-art AML. On two image classification benchmarks, our agent increases attack success rate by up to 13.2% and decreases the average number of victim model queries per attack by up to 16.9% from the start to the end of training. In a head-to-head comparison with state-of-the-art image attacks, our approach enables an adversary to generate adversarial samples with 17% more success on unseen inputs post-training. From a security perspective, this work demonstrates a powerful new attack vector that uses RL to train agents that attack ML models efficiently and at scale.
对机器学习模型的攻击已经通过无状态优化进行了广泛的研究。在本文中,我们展示了强化学习(RL)代理如何学习生成对抗样本的新类攻击算法。与传统的对抗性机器学习(AML)方法独立制作对抗样本不同,我们的基于RL的方法保留并利用过去的攻击经验,以提高未来攻击的有效性和效率。我们将对抗样本生成制定为马尔可夫决策过程,并评估RL(a)学习有效且高效的攻击策略的能力,(b)与最先进的AML竞争的能力。在两个图像分类基准测试中,我们的代理在训练过程中提高了攻击成功率,高达13.2%,并减少了每次攻击的受害者模型查询次数,高达16.9%。在与最先进的图像攻击的直接比较中,我们的方法使对手能够在训练后生成对抗样本时,在未见过输入的情况下提高成功率为训练前的约两倍。从安全角度来看,这项工作展示了一个强大的新攻击向量,该向量使用RL训练代理以大规模有效地攻击机器学习模型。
论文及项目相关链接
Summary
机器学习模型受到广泛研究,但针对强化学习代理生成对抗样本的攻击算法尚属新颖。与传统独立生成对抗样本的方法不同,我们的基于强化学习的方法能够保留并利用过去的攻击经验,提高未来攻击的效率和效果。我们的代理将对抗样本生成过程形式化为马尔可夫决策过程,并在图像分类基准测试中表现出卓越性能,攻击成功率提高了高达13.2%,且训练期间每次攻击所需查询模型的平均次数减少了高达16.9%。在安全领域,这项工作展示了使用强化学习训练代理进行大规模攻击机器学习模型的强大新攻击途径。通过代理与图像攻击的交锋竞争证明,强化学习对不可见输入的对抗样本生成成功率提高了高达17%。该工作强调了强化学习在攻击机器学习模型方面的潜在影响,提出了防御新的安全隐患需要加强对未来对抗样本的研究及策略的优化等研究点。这也暗示了对使用机器学习的系统安全性的挑战和未来的研究方向。
Key Takeaways
以下是基于文本的关键见解:
- 强化学习可用于生成对抗样本的新攻击算法。
- 与传统独立生成对抗样本的方法不同,该攻击保留了过去的攻击经验以提高未来的攻击效果。
- 强化学习代理在图像分类基准测试中表现出卓越性能,攻击成功率提高显著。
- 强化学习生成的对抗样本对不可见输入的攻击成功率显著提高。这暗示了对使用机器学习的系统安全性的挑战和未来的研究方向。
点此查看论文截图
Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs
Authors:Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
Knowledge graphs are widely used in industrial applications, making error detection crucial for ensuring the reliability of downstream applications. Existing error detection methods often fail to effectively utilize fine-grained subgraph information and rely solely on fixed graph structures, while also lacking transparency in their decision-making processes, which results in suboptimal detection performance. In this paper, we propose a novel Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that utilizes multiple large language models (LLMs) in a collaborative setting. By concatenating fine-grained, bidirectional subgraph embeddings with LLM-based query embeddings during training, our framework integrates these representations to produce four specialized agents. These agents utilize subgraph information from different dimensions to engage in multi-round discussions, thereby improving error detection accuracy and ensuring a transparent decision-making process. Extensive experiments on FB15K and WN18RR demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the accuracy and robustness of KG evaluation. For specific industrial scenarios, our framework can facilitate the training of specialized agents using domain-specific knowledge graphs for error detection, which highlights the potential industrial application value of our framework. Our code and datasets are available at https://github.com/kse-ElEvEn/MAKGED.
知识图谱在工业应用中得到了广泛应用,因此错误检测对于确保下游应用的可靠性至关重要。现有的错误检测方法往往不能有效地利用细粒度子图信息,仅依赖于固定的图结构,同时其决策过程缺乏透明度,导致检测性能不佳。在本文中,我们提出了一种用于知识图谱错误检测的多智能体框架(MAKGED),该框架在协作设置中使用多个大型语言模型(LLM)。通过训练过程中将细粒度双向子图嵌入与基于LLM的查询嵌入进行拼接,我们的框架将这些表示集成在一起,产生四个专业智能体。这些智能体利用来自不同维度的子图信息进行多轮讨论,从而提高错误检测精度,并确保决策过程的透明度。在FB15K和WN18RR上的大量实验表明,MAKGED优于最新方法,提高了知识图谱评估的准确性和稳健性。对于特定的工业场景,我们的框架可以利用领域特定的知识图谱训练专业智能体进行错误检测,这突显了我们框架潜在的工业应用价值。我们的代码和数据集可在https://github.com/kse-ElEvEn/MAKGED找到。
论文及项目相关链接
PDF This paper has been ACCEPTED as a FULL PAPER at DASFAA 2025 (Oral)
Summary
知识图谱在工业应用中的错误检测至关重要。现有方法缺乏利用精细子图信息,依赖固定图结构,且决策过程不透明。本文提出一种基于多智能体的知识图谱错误检测框架(MAKGED),结合大型语言模型,融入多轮讨论机制以提高检测准确性并确保决策透明。实验证明,MAKGED在FB15K和WN18RR数据集上优于现有方法,具有潜在工业应用价值。
Key Takeaways
- 知识图谱在工业应用中的错误检测非常重要。
- 现有方法缺乏利用精细子图信息,导致检测性能不佳。
- MAKGED框架结合大型语言模型和多轮讨论机制提高错误检测准确性。
- MAKGED框架能确保决策过程的透明度。
- MAKGED在FB15K和WN18RR数据集上的表现优于现有方法。
- MAKGED框架具有潜在工业应用价值,可针对特定场景训练专业智能体进行错误检测。