⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-22 更新
D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies
Authors:Sen Chen, Tong Zhao, Yi Bin, Fei Ma, Wenqi Shao, Zheng Wang
Developing intelligent agents capable of operating a wide range of Graphical User Interfaces (GUIs) with human-level proficiency is a key milestone on the path toward Artificial General Intelligence. While most existing datasets and benchmarks for training and evaluating GUI agents are static and idealized, failing to reflect the complexity and unpredictability of real-world environments, particularly the presence of anomalies. To bridge this research gap, we propose D-GARA, a dynamic benchmarking framework, to evaluate Android GUI agent robustness in real-world anomalies. D-GARA introduces a diverse set of real-world anomalies that GUI agents commonly face in practice, including interruptions such as permission dialogs, battery warnings, and update prompts. Based on D-GARA framework, we construct and annotate a benchmark featuring commonly used Android applications with embedded anomalies to support broader community research. Comprehensive experiments and results demonstrate substantial performance degradation in state-of-the-art GUI agents when exposed to anomaly-rich environments, highlighting the need for robustness-aware learning. D-GARA is modular and extensible, supporting the seamless integration of new tasks, anomaly types, and interaction scenarios to meet specific evaluation goals.
开发能够操作多种图形用户界面(GUI)并具备人类水平熟练度的智能代理,是实现通用人工智能路上的一个重要里程碑。尽管用于训练和评估GUI代理的大多数现有数据集和基准测试都是静态且理想化的,无法反映真实环境的复杂性和不可预测性,尤其是异常的存在。为了弥补这一研究空白,我们提出了D-GARA,一个动态基准测试框架,以评估Android GUI代理在真实世界异常中的稳健性。D-GARA引入了一系列GUI代理在实践中通常面临的真实世界异常,包括中断,如对话框、电池警告和更新提示等。基于D-GARA框架,我们构建并标注了一个基准测试,其中包含常用的Android应用程序,并嵌入了异常以支持更广泛的社区研究。综合实验和结果表明,先进的GUI代理在异常丰富的环境中性能大幅下降,这凸显了需要健壮性感知学习。D-GARA是模块化和可扩展的,支持无缝集成新任务、异常类型和交互场景,以实现特定的评估目标。
论文及项目相关链接
PDF Accepted to AAAI 2026
Summary
本文旨在开发能够操作多种图形用户界面(GUI)的智能代理,达到人类级别的熟练程度,这是实现通用人工智能的关键里程碑之一。针对现有GUI代理训练和评估数据集和基准测试存在的静态、理想化问题,无法反映真实世界的复杂性和不可预测性,特别是异常的存在。为了弥补这一研究空白,本文提出了D-GARA动态基准测试框架,以评估Android GUI代理在现实世界中异常情况的稳健性。D-GARA引入了一系列GUI代理在实践中常见的现实异常,包括中断如权限对话框、电池警告和更新提示等。基于D-GARA框架,我们构建并标注了一个包含常见Android应用程序的基准测试集,其中包含异常以支持更广泛的社区研究。实验结果表明,在异常丰富的环境中,现有的GUI代理性能大幅下降,凸显了稳健性学习的必要性。D-GARA模块化、可扩展性强,支持无缝集成新任务、异常类型和交互场景,以满足特定的评估目标。
Key Takeaways
- 开发能够操作多种GUI的智能代理是实现通用人工智能的关键。
- 现有GUI代理的评估和训练数据集过于静态和理想化,无法反映真实世界的复杂性。
- D-GARA框架旨在评估Android GUI代理在现实世界异常情况下的稳健性。
- D-GARA引入了多种常见的现实异常,如权限对话框、电池警告和更新提示等。
- 基于D-GARA框架构建了包含常见Android应用程序的基准测试集。
- 实验表明,在异常丰富的环境中,现有GUI代理性能显著下降。
- D-GARA具有模块化和可扩展性特点,可支持不同的评估目标和场景。
点此查看论文截图
Trustworthy AI in the Agentic Lakehouse: from Concurrency to Governance
Authors:Jacopo Tagliabue, Federico Bianchi, Ciro Greco
Even as AI capabilities improve, most enterprises do not consider agents trustworthy enough to work on production data. In this paper, we argue that the path to trustworthy agentic workflows begins with solving the infrastructure problem first: traditional lakehouses are not suited for agent access patterns, but if we design one around transactions, governance follows. In particular, we draw an operational analogy to MVCC in databases and show why a direct transplant fails in a decoupled, multi-language setting. We then propose an agent-first design, Bauplan, that reimplements data and compute isolation in the lakehouse. We conclude by sharing a reference implementation of a self-healing pipeline in Bauplan, which seamlessly couples agent reasoning with all the desired guarantees for correctness and trust.
尽管人工智能能力得到了提升,但大多数企业仍认为智能体不足以信赖来处理生产数据。在本文中,我们认为实现可信赖的智能体工作流程的路径首先要解决基础设施问题:传统的数据湖并不适合智能体的访问模式,但如果我们以事务为中心进行设计,那么治理就能随之而来。特别是,我们与数据库中的MVCC操作进行了类比,展示了为什么在解耦的多语言环境中直接移植会失败。然后,我们提出了一种以智能体为中心的设计——Bauplan,它在数据湖中重新实现了数据和计算隔离。最后,我们分享了在Bauplan中自我修复管道(self-healing pipeline)的参考实现,该管道无缝地将智能体推理与所有期望的正确性和信任保证相结合。
论文及项目相关链接
PDF AAAI26, pre-print of paper accepted at the Trustworthy Agentic AI Workshop
摘要
随着AI能力的不断提升,大多数企业仍认为智能体尚不足以处理生产数据。本文主张,实现可信赖的智能体工作流程,首先需解决基础设施问题:传统数据湖不适合智能体的访问模式,但如果以事务为中心进行设计,治理便会应运而生。我们特别通过操作类比数据库中的MVCC,并展示了在解耦的多语言环境中直接移植为什么会失败。然后,我们提出了一种以智能体为中心的设计——Bauplan,重新实现了数据湖中的数据和计算隔离。最后,我们分享了在Bauplan中的自修复管道的参考实现,该管道无缝地将智能体推理与所有所需的正确性和信任保证结合在一起。
要点
- 企业对AI处理生产数据的信任度不足。
- 实现智能体工作流程需先解决基础设施问题。
- 传统数据湖不适合智能体的访问模式。
- 以事务为中心设计数据湖可实现治理。
- 通过操作类比MVCC来解释为什么直接移植在解耦的多语言环境中会失败。
- 提出了一种以智能体为中心的设计——Bauplan,重新实现了数据湖中的数据和计算隔离。
点此查看论文截图
ChemLabs on ChemO: A Multi-Agent System for Multimodal Reasoning on IChO 2025
Authors:Xu Qiang, Shengyuan Bai, Leqing Chen, Zijing Liu, Yu Li
Olympiad-level benchmarks in mathematics and physics are crucial testbeds for advanced AI reasoning, but chemistry, with its unique multimodal symbolic language, has remained an open challenge. We introduce ChemO, a new benchmark built from the International Chemistry Olympiad (IChO) 2025. ChemO features two key innovations for automated assessment: Assessment-Equivalent Reformulation (AER), which converts problems requiring visual outputs (e.g., drawing molecules) into computationally tractable formats, and Structured Visual Enhancement (SVE), a diagnostic mechanism to disentangle a model’s visual perception capabilities from its core chemical reasoning. To tackle this benchmark, we propose ChemLabs, a hierarchical multi-agent framework that mimics human expert collaboration through specialized agents for problem decomposition, perception, reasoning, and auditing. Experiments on state-of-the-art multimodal models demonstrate that combining SVE with our multi-agent system yields dramatic performance gains. Our top configuration achieves a score of 93.6 out of 100, surpassing an estimated human gold medal threshold and establishing a new state-of-the-art in automated chemical problem-solving. ChemO Dataset: https://huggingface.co/datasets/IDEA-AI4SCI/ChemO
在数学和物理的奥赛级基准测试中,先进的AI推理能力得到了重要考验,但化学凭借其独特的多模式符号语言,仍是一项开放挑战。我们引入了ChemO,这是一个新的基准测试,源于2025年国际化学奥林匹克竞赛(IChO)。ChemO在自动化评估方面拥有两大创新:评估等效重构(AER),将需要视觉输出(例如绘制分子)的问题转化为计算上可行的格式;结构化视觉增强(SVE),一种诊断机制,用于区分模型的视觉感知能力与其核心化学推理能力。为了应对这一基准测试,我们提出了ChemLabs,这是一个分层的多智能体框架,通过专门智能体进行问题分解、感知、推理和审计,模拟人类专家的协作。在最新多模式模型上的实验表明,将SVE与我们的多智能体系统相结合,产生了显著的性能提升。我们的最佳配置得分93.6分(满分100分),超过了估计的人类金牌门槛,在自动化化学问题解决中建立了新的最先进的水平。ChemO数据集:https://huggingface.co/datasets/IDEA-AI4SCI/ChemO
论文及项目相关链接
PDF 13 pages, 1 figures
Summary
化学奥林匹克竞赛中的数学和物理基准测试对于高级人工智能推理至关重要,但化学因其独特的多模态符号语言而成为一项挑战。我们引入了ChemO,这是一个由国际化学奥林匹克竞赛(IChO)构建的新基准测试。ChemO具有两个用于自动化评估的关键创新点:等效评估改革(AER)和结构视觉增强(SVE)。AER将需要视觉输出的难题转换为计算上可行的格式;而SVE是一种诊断机制,可以区分模型的视觉感知能力和其核心化学推理能力。为解决这一基准测试,我们提出了ChemLabs,这是一个层次化的多智能体框架,通过专门化的智能体进行问题分解、感知、推理和审计来模仿人类专家的协作。在最新的多模态模型上的实验表明,结合SVE和我们的多智能体系统会产生显著的性能提升。我们的最佳配置得分93.6分(满分100分),超过了估计的人类金牌门槛,并在自动化化学问题求解方面建立了新的最新技术。
Key Takeaways
- ChemO是首个基于国际化学奥林匹克竞赛(IChO)的人工智能基准测试,专为评估自动化化学问题求解能力设计。
- ChemO引入了两个创新点:Assessment-Equivalent Reformulation (AER) 和 Structured Visual Enhancement (SVE),分别用于问题转换和模型诊断。
- AER能够将需要视觉输出的化学问题转化为计算机可处理格式。
- SVE能够帮助区分模型的视觉感知和核心化学推理能力。
- ChemLabs是一个多智能体框架,模拟人类专家协作解决化学问题,包括问题分解、感知、推理和审计。
- 结合SVE和ChemLabs,在最新多模态模型上取得了显著性能提升。
点此查看论文截图
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning
Authors:Peng Xia, Kaide Zeng, Jiaqi Liu, Can Qin, Fang Wu, Yiyang Zhou, Caiming Xiong, Huaxiu Yao
Large Language Model (LLM) Agents, often trained with Reinforcement Learning (RL), are constrained by a dependency on human-curated data, limiting scalability and tethering AI to human knowledge. Existing self-evolution frameworks offer an alternative but are typically restricted by the model’s inherent capabilities and single-round interactions, hindering the development of complex curricula involving tool use or dynamic reasoning. We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data through multi-step co-evolution and seamless tool integration. Agent0 establishes a symbiotic competition between two agents initialized from the same base LLM: a curriculum agent that proposes increasingly challenging frontier tasks, and an executor agent that learns to solve them. We integrate external tools to enhance the executor’s problem-solving capacity; this improvement, in turn, pressures the curriculum agent to construct more complex, tool-aware tasks. Through this iterative process, Agent0 establishes a self-reinforcing cycle that continuously produces high-quality curricula. Empirically, Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks. Code is available at https://github.com/aiming-lab/Agent0.
大型语言模型(LLM)代理通常通过强化学习(RL)进行训练,它们依赖于人工整理的数据,这限制了其可扩展性,并将人工智能束缚在人类知识范围内。现有的自我进化框架提供了一种替代方案,但它们通常受到模型固有能力和单一回合交互的限制,阻碍了涉及工具使用或动态推理的复杂课程的发展。我们引入了Agent0,这是一个完全自主化的框架,它通过多步协同进化和无缝的工具集成,演化出高性能的代理,而无需外部数据。Agent0建立了两个由同一基础LLM初始化的代理之间的共生竞争:一个课程代理,提出越来越具有挑战性的前沿任务,一个执行代理,学习解决这些问题。我们整合外部工具,以提高执行者的解决问题的能力;这种提高反过来又促使课程代理构建更复杂、工具意识的任务。通过这一迭代过程,Agent0建立了一个自我加强的循环,持续产生高质量的课程。经验表明,Agent0在推理能力上有显著提高,在数学推理和一般推理基准测试中,将Qwen3-8B-Base模型提升了18%和24%。代码可在https://github.com/aiming-lab/Agent0找到。
简化解释
论文及项目相关链接
Summary
基于大型语言模型(LLM)的代理通常通过强化学习(RL)进行训练,但受限于人为编制的数据,限制了其可扩展性并使得人工智能与人类知识紧密相连。现有的自我进化框架提供了替代方案,但受限于模型的固有能力和单一回合的互动,阻碍了涉及工具使用和动态推理的复杂课程的发展。我们推出了Agent0,这是一个完全自主的框架,它通过多步协同进化和无缝工具集成进化高性能代理,无需外部数据。Agent0在来自同一基础LLM的两个代理之间建立了共生竞争:一个课程代理会提出越来越具挑战性的前沿任务,而执行代理则学习如何解决这些问题。我们整合了外部工具以提高执行者的解决问题的能力;这种改进反过来又促使课程代理构建更复杂、工具意识的任务。通过这一迭代过程,Agent0建立了一个自我加强的循环,持续产生高质量的课程。经验表明,Agent0极大地提高了推理能力,在数学推理和通用推理基准测试中,对Qwen3-8B-Base模型的改进分别提高了18%和24%。
Key Takeaways
- Agent0是一个自主框架,能在无需外部数据的情况下通过多步协同进化无缝工具集成来进化高性能代理。
- Agent0通过建立两个代理之间的共生竞争来促进代理的进化:一个课程代理提出挑战任务,一个执行代理学习解决这些任务。
- Agent0整合了外部工具以提高执行代理的问题解决能力。
- 通过迭代过程,Agent0建立了一个自我加强的循环来持续产生高质量的课程。
- Agent0显著提高了基于大型语言模型的推理能力。
- 在数学推理和通用推理基准测试中,Agent0对Qwen3-8B-Base模型的改进分别为18%和24%。
点此查看论文截图
Hiding in the AI Traffic: Abusing MCP for LLM-Powered Agentic Red Teaming
Authors:Strahinja Janjusevic, Anna Baron Garcia, Sohrob Kazerounian
Generative AI is reshaping offensive cybersecurity by enabling autonomous red team agents that can plan, execute, and adapt during penetration tests. However, existing approaches face trade-offs between generality and specialization, and practical deployments reveal challenges such as hallucinations, context limitations, and ethical concerns. In this work, we introduce a novel command & control (C2) architecture leveraging the Model Context Protocol (MCP) to coordinate distributed, adaptive reconnaissance agents covertly across networks. Notably, we find that our architecture not only improves goal-directed behavior of the system as whole, but also eliminates key host and network artifacts that can be used to detect and prevent command & control behavior altogether. We begin with a comprehensive review of state-of-the-art generative red teaming methods, from fine-tuned specialist models to modular or agentic frameworks, analyzing their automation capabilities against task-specific accuracy. We then detail how our MCP-based C2 can overcome current limitations by enabling asynchronous, parallel operations and real-time intelligence sharing without periodic beaconing. We furthermore explore advanced adversarial capabilities of this architecture, its detection-evasion techniques, and address dual-use ethical implications, proposing defensive measures and controlled evaluation in lab settings. Experimental comparisons with traditional C2 show drastic reductions in manual effort and detection footprint. We conclude with future directions for integrating autonomous exploitation, defensive LLM agents, predictive evasive maneuvers, and multi-agent swarms. The proposed MCP-enabled C2 framework demonstrates a significant step toward realistic, AI-driven red team operations that can simulate advanced persistent threats while informing the development of next-generation defensive systems.
生成式人工智能正在通过支持自主的红队代理进行渗透测试中的规划、执行和适应,从而改变进攻性的网络安全。然而,现有方法面临通用性和专业化之间的权衡,实际部署中暴露出幻觉、上下文限制和伦理问题等挑战。在这项工作中,我们引入了一种新型命令与控制(C2)架构,利用模型上下文协议(MCP)来协调跨网络的分布式自适应侦察代理。值得注意的是,我们发现我们的架构不仅提高了整个系统的目标导向行为,而且消除了关键主机和网络特征,这些特征原本可用于检测和阻止命令与控制行为。我们首先对最新的生成式红队方法进行全面评述,从微调的专业模型到模块化或代理框架,分析其自动化能力与特定任务的准确性。然后,我们详细介绍了基于MCP的C2如何克服现有限制,通过支持异步并行操作和实时情报共享而无需定期发送信号来实现。我们还探讨了该架构的高级对抗能力、其隐身技术,并探讨了其双重用途的伦理影响,提出了防御措施并在实验室环境中进行了受控评估。与传统C2的实验比较显示,手动操作的减少和检测足迹的大幅减少。我们最后探讨了未来方向,包括整合自主利用、防御性大型语言模型代理、预测躲避机动以及多代理群体等。所提出的基于MCP的C2框架朝着现实、AI驱动的红队行动迈出了重要一步,能够模拟高级持续威胁,同时为下一代防御系统的发展提供信息。
论文及项目相关链接
PDF 23 pages, 9 figures, 3 tables. Submitted as a full paper for review
Summary
生成式人工智能正在重塑进攻性网络安全,通过启用自主的红队代理进行渗透测试中的规划、执行和适应。然而,现有方法面临通用性与专业性之间的权衡,实际部署中出现了诸如幻觉、上下文限制和伦理担忧等挑战。本研究引入了一种新型命令与控制系统架构,利用模型上下文协议(MCP)协调分布式自适应侦察代理在网络中隐秘行动。该架构不仅提高了系统的目标导向行为,而且消除了关键主机和网络痕迹,可完全检测和防止命令与控制行为。研究分析了现有生成式红队作战方法,包括精细调整的专业模型和模块化代理框架等,评估了它们的自动化能力与任务特定准确性。此外,详细说明了基于MCP的C2系统如何克服当前限制,实现异步并行操作、实时情报共享而无需定期发送信号。同时探讨了该架构的对抗能力、检测躲避技术及其双重用途的伦理影响,并提出了防御措施和实验室环境中的受控评估。与传统C2的初步比较显示,该架构大幅减少了人工投入和检测痕迹。研究展望了未来集成自主攻击、防御大型语言模型代理、预测躲避机动和多智能体集群的方向。本研究标志着实现真实、AI驱动的红军模拟威胁的重要进步,可为下一代防御系统开发提供信息。
Key Takeaways
- 生成式AI正在改变进攻性网络安全领域,通过自主红队代理进行渗透测试中的规划、执行和适应操作。这些代理在仿真模拟方面存在潜在应用。
- 现有方法的通用性与专业性之间的权衡是当前挑战之一。这些方法往往在实践中暴露出缺陷,如幻觉和上下文限制问题。对此架构采用了新型命令与控制架构和模型上下文协议来改进性能。这些协议可解决现有的通用性和专业性问题,并在仿真环境中获得显著效果。此外,它们还有助于消除关键主机和网络痕迹,提高隐蔽性。这些协议对于实现自主决策和高效执行至关重要。同时需要关注伦理问题以确保其正当使用。
- 该研究详细分析了现有的生成式红队作战方法,并评估了它们的自动化能力和任务特定准确性。这为选择最适合特定场景的方法提供了依据。这些分析强调了不同方法的优点和局限性,为选择适合特定应用的方法提供了指导。对于决策者来说,了解各种方法的优势和劣势是做出明智决策的关键。此外还需要进一步探索如何将这些技术应用于实际场景中以提高效率和准确性。
点此查看论文截图
Sensorium Arc: AI Agent System for Oceanic Data Exploration and Interactive Eco-Art
Authors:Noah Bissell, Ethan Paley, Joshua Harrison, Juliano Calil, Myungin Lee
Sensorium Arc (AI reflects on climate) is a real-time multimodal interactive AI agent system that personifies the ocean as a poetic speaker and guides users through immersive explorations of complex marine data. Built on a modular multi-agent system and retrieval-augmented large language model (LLM) framework, Sensorium enables natural spoken conversations with AI agents that embodies the ocean’s perspective, generating responses that blend scientific insight with ecological poetics. Through keyword detection and semantic parsing, the system dynamically triggers data visualizations and audiovisual playback based on time, location, and thematic cues drawn from the dialogue. Developed in collaboration with the Center for the Study of the Force Majeure and inspired by the eco-aesthetic philosophy of Newton Harrison, Sensorium Arc reimagines ocean data not as an abstract dataset but as a living narrative. The project demonstrates the potential of conversational AI agents to mediate affective, intuitive access to high-dimensional environmental data and proposes a new paradigm for human-machine-ecosystem.
Sensorium Arc(人工智能反映气候)是一个实时多模式交互人工智能代理系统,它将海洋人格化为诗意般的发言人,引导用户沉浸在复杂海洋数据的探索中。该系统建立在模块化多代理系统和检索增强的大型语言模型(LLM)框架上,使用户能与体现海洋视角的人工智能代理进行自然对话,产生融合科学见解与生态诗意的回应。通过关键词检测和语义解析,系统会根据对话中的时间、地点和主题线索动态触发数据可视化和视听播放。该项目由中心对突发事件的研究主导开发,受到牛顿·哈里森的生态美学哲学的启发。Sensorium Arc重新想象海洋数据,不是作为一个抽象的数据集,而是一个生动的故事。该项目展示了对话式人工智能代理在情感化、直观访问高维环境数据方面的潜力,并为人类-机器-生态系统提出了一个新的范式。
论文及项目相关链接
PDF (to appear) NeurIPS 2025 Creative AI Track
Summary:
实时多模式交互AI代理系统Sensorium Arc将海洋人格化为诗意发言人,引导用户沉浸探索复杂海洋数据。该系统结合模块化多代理系统和检索增强大型语言模型(LLM)框架,使用户能与AI进行自然对话,系统以海洋的视角产生融合科学与生态诗学的回应。通过关键词检测和语义解析,系统能基于对话的时间、地点和主题线索动态触发数据可视化和视听播放。该项目由中心重大力量研究合作开发,受牛顿·哈里森的生态美学哲学启发,重新想象海洋数据为生动故事而非抽象数据集,展示了对话式AI代理在情感、直觉访问高维环境数据方面的潜力,为人工生态系统领域提出了新的范例。
Key Takeaways:
- Sensorium Arc是一个实时多模式交互AI代理系统,将海洋人格化为诗意发言人。
- 系统结合了模块化多代理系统和检索增强大型语言模型(LLM),实现自然对话体验。
- AI代理能以海洋的视角产生融合科学与生态诗学的回应。
- 系统通过关键词检测和语义解析动态触发数据可视化和视听播放。
- 项目由中心重大力量研究合作开发,受到生态美学哲学的启发。
- Sensorium Arc重新想象了海洋数据为生动故事而非抽象数据集。
点此查看论文截图
Machine Learning Epidemic Predictions Using Agent-based Wireless Sensor Network Models
Authors:Chukwunonso Henry Nwokoye, Blessing Oluchi, Sharna Waldron, Peace Ezzeh
The lack of epidemiological data in wireless sensor networks (WSNs) is a fundamental difficulty in constructing robust models to forecast and mitigate threats such as viruses and worms. Many studies have examined different epidemic models for WSNs, focusing on how malware infections spread given the network’s specific properties, including energy limits and node mobility. In this study, an agent-based implementation of the susceptible-exposed-infected-recovered-vaccinated (SEIRV) mathematical model was employed for machine learning (ML) predictions. Using tools such as NetLogo’s BehaviorSpace and Python, two epidemic synthetic datasets were generated and prepared for the application of several ML algorithms. Posed as a regression problem, the infected and recovered nodes were predicted, and the performance of these algorithms is compared using the error metrics of the train and test sets. The predictions performed well, with low error metrics and high R^2 values (0.997, 1.000, 0.999, 1.000), indicating an effective fit to the training set. The validation values were lower (0.992, 0.998, 0.971, and 0.999), as is typical when evaluating model performance on unseen data. Based on the recorded performances, support vector, linear, Lasso, Ridge, and ElasticNet regression were among the worst-performing algorithms, while Random Forest, XGBoost, Decision Trees, and k-nearest neighbors achieved the best results.
无线传感器网络(WSNs)在流行病学数据方面的缺失,是构建能够预测和缓解病毒和蠕虫等威胁的稳健模型的基本难题。许多研究已经研究了WSN的不同流行病模型,重点研究网络特定属性(包括能源限制和节点移动性)下的恶意软件感染如何传播。本研究采用基于agent的易感暴露感染恢复接种(SEIRV)数学模型的机器学习(ML)预测实现。使用NetLogo的行为空间和Python等工具,生成了两个流行病合成数据集,为应用多种机器学习算法做准备。将其设定为回归问题,预测感染节点和恢复节点,并使用训练和测试集的误差指标比较这些算法的性能。预测表现良好,误差指标低,R²值高(分别为0.997、1.000、0.999、1.000),表明其适合训练集。验证值较低(分别为0.992、0.998、0.971和0.999),这是在对未见数据进行模型性能评估时的正常情况。根据记录的性能,支持向量、线性、Lasso、岭回归和ElasticNet回归等算法表现较差,而随机森林、XGBoost、决策树和k最近邻等取得了最佳结果。
论文及项目相关链接
PDF 8 pages
总结
在无线传感器网络(WSNs)中,由于缺乏流行病学数据,构建稳健模型以预测和缓解病毒和蠕虫等威胁是一个基本难题。本研究采用基于agent的SEIRV数学模型进行机器学习预测,利用NetLogo的BehaviorSpace和Python等工具生成两个合成数据集,并应用多种机器学习算法。预测表现良好,误差指标较低,R²值较高,表明模型训练良好。但在未见数据上的验证值较低,这是模型性能评估中的正常现象。根据记录的表现,某些回归算法表现不佳,而某些决策树和邻近算法则取得了最佳结果。
关键见解
- 缺乏流行病学数据是无线传感器网络构建稳健模型以预测和缓解威胁的主要挑战。
- 本研究采用基于agent的SEIRV数学模型进行机器学习预测。
- 使用NetLogo的BehaviorSpace和Python生成了两个合成数据集。
- 通过回归问题预测感染节点和恢复节点。
- 机器学习算法在训练集上的表现良好,误差指标较低,R²值较高。
- 在未见数据上的验证值略低,但仍可接受。
点此查看论文截图
AquaSentinel: Next-Generation AI System Integrating Sensor Networks for Urban Underground Water Pipeline Anomaly Detection via Collaborative MoE-LLM Agent Architecture
Authors:Qiming Guo, Bishal Khatri, Wenbo Sun, Jinwen Tang, Hua Zhang, Wenlu Wang
Underground pipeline leaks and infiltrations pose significant threats to water security and environmental safety. Traditional manual inspection methods provide limited coverage and delayed response, often missing critical anomalies. This paper proposes AquaSentinel, a novel physics-informed AI system for real-time anomaly detection in urban underground water pipeline networks. We introduce four key innovations: (1) strategic sparse sensor deployment at high-centrality nodes combined with physics-based state augmentation to achieve network-wide observability from minimal infrastructure; (2) the RTCA (Real-Time Cumulative Anomaly) detection algorithm, which employs dual-threshold monitoring with adaptive statistics to distinguish transient fluctuations from genuine anomalies; (3) a Mixture of Experts (MoE) ensemble of spatiotemporal graph neural networks that provides robust predictions by dynamically weighting model contributions; (4) causal flow-based leak localization that traces anomalies upstream to identify source nodes and affected pipe segments. Our system strategically deploys sensors at critical network junctions and leverages physics-based modeling to propagate measurements to unmonitored nodes, creating virtual sensors that enhance data availability across the entire network. Experimental evaluation using 110 leak scenarios demonstrates that AquaSentinel achieves 100% detection accuracy. This work advances pipeline monitoring by demonstrating that physics-informed sparse sensing can match the performance of dense deployments at a fraction of the cost, providing a practical solution for aging urban infrastructure.
地下管道泄漏和渗水对供水安全和环保构成严重威胁。传统的手动检测方法覆盖有限,响应延迟,经常会错过关键异常。本文提出AquaSentinel,一种用于城市地下水管网实时异常检测的新型物理信息人工智能系统。我们介绍了四项关键创新:(1)在高中心节点进行战略性稀疏传感器部署,结合基于物理的状态增强,以在最少基础设施的情况下实现全网观测;(2)RTCA(实时累积异常)检测算法,采用双阈值监控和自适应统计,以区分瞬时波动和真正的异常;(3)时空图神经网络的专家混合(MoE)集合,通过动态加权模型贡献提供稳健预测;(4)基于因果流的泄漏定位,跟踪异常上游以识别源节点和受影响的管道段。我们的系统会在关键网络节点进行战略性传感器部署,并利用基于物理的建模将测量值传播到未监测的节点,创建虚拟传感器,提高整个网络的数据可用性。使用110个泄漏场景进行的实验评估表明,AquaSentinel检测准确率为100%。这项工作通过展示物理信息稀疏传感可以匹配密集部署的性能且成本更低,为解决老化的城市基础设施提供了实用解决方案。
论文及项目相关链接
PDF 7 pages, 1 figure, 2 tables, Accepted to the 40th AAAI Conference on Artificial Intelligence (AAAI 2026), IAAI Deployed Applications Track
Summary:
地下管道泄漏和渗透对供水安全和环境保护构成严重威胁。传统的手动检测方法覆盖有限且响应延迟,经常漏检重要异常。本文提出AquaSentinel,一种基于物理知识的AI系统,用于城市地下水管网实时异常检测。通过四项创新技术实现高效检测:在关键节点进行战略性稀疏传感器部署,结合基于物理的状态增强实现全网观测的最小化基础设施;RTCA(实时累积异常)检测算法,采用双阈值监测和自适应统计来区分瞬态波动和真实异常;基于时空图神经网络的混合专家集合,通过动态加权模型贡献提供稳健预测;基于因果流的泄漏定位,追溯异常上游以识别源节点和受影响管道段。实验评估显示,AquaSentinel在110个泄漏场景中实现100%的检测准确率。
Key Takeaways:
- 地下管道泄漏和渗透对供水安全和环境保护有重大威胁。
- 传统手动检测方法存在覆盖有限和响应延迟的问题。
- AquaSentinel是一个基于物理知识的AI系统,用于实时检测城市地下水管网的异常。
- 系统通过战略性稀疏传感器部署和基于物理的状态增强实现全网观测。
- RTCA检测算法能区分瞬态波动和真实异常。
- 采用了基于时空图神经网络的混合专家集合进行预测。
- 实验评估显示AquaSentinel在泄漏检测中具有高准确率和实用性。
点此查看论文截图
OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition
Authors:Xinli Tao, Xin Dong, Xuezhong Zhou
With the rapid expansion of unstructured clinical texts in electronic health records (EHRs), clinical named entity recognition (NER) has become a crucial technique for extracting medical information. However, traditional supervised models such as CRF and BioClinicalBERT suffer from high annotation costs. Although zero-shot NER based on large language models (LLMs) reduces the dependency on labeled data, challenges remain in aligning example selection with task granularity and effectively integrating prompt design with self-improvement frameworks. To address these limitations, we propose OEMA, a novel zero-shot clinical NER framework based on multi-agent collaboration. OEMA consists of three core components: (1) a self-annotator that autonomously generates candidate examples; (2) a discriminator that leverages SNOMED CT to filter token-level examples by clinical relevance; and (3) a predictor that incorporates entity-type descriptions to enhance inference accuracy. Experimental results on two benchmark datasets, MTSamples and VAERS, demonstrate that OEMA achieves state-of-the-art performance under exact-match evaluation. Moreover, under related-match criteria, OEMA performs comparably to the supervised BioClinicalBERT model while significantly outperforming the traditional CRF method. OEMA improves zero-shot clinical NER, achieving near-supervised performance under related-match criteria. Future work will focus on continual learning and open-domain adaptation to expand its applicability in clinical NLP.
随着电子健康记录(EHRs)中无结构临床文本的快速扩展,临床命名实体识别(NER)已成为提取医疗信息的关键技术。然而,CRF和BioClinicalBERT等传统监督模型面临着高标注成本的问题。尽管基于大型语言模型(LLM)的零样本NER降低了对标记数据的依赖,但在示例选择与任务粒度对齐以及提示设计与自我改进框架的有效集成方面仍存在挑战。为了解决这些局限性,我们提出了基于多智能体协作的OEMA新型零样本临床NER框架。OEMA由三个核心组件组成:(1)自主生成候选示例的自我注释器;(2)利用SNOMED CT过滤掉与临床不相关标记的鉴别器;(3)结合实体类型描述以提高推理准确度的预测器。在MTSamples和VAERS两个基准数据集上的实验结果表明,OEMA在精确匹配评估下达到了最先进的性能。此外,在相关匹配标准下,OEMA的表现与监督学习模型BioClinicalBERT相当,而明显优于传统的CRF方法。OEMA提高了零样本临床NER的性能,在相关匹配标准下实现了接近监督学习的表现。未来的工作将专注于持续学习和开放域适应,以扩大其在临床自然语言处理中的应用范围。
论文及项目相关链接
PDF 12 pages, 4 figures, 4 tables
Summary
随着电子健康档案中临床文本的无结构化快速扩张,临床命名实体识别技术对于提取医疗信息变得至关重要。针对传统监督模型高标注成本的问题,提出了一种基于多智能体协作的零样本临床NER框架OEMA。OEMA包括自主生成候选例子的自我注释器、利用SNOMED CT过滤token级别例子的鉴别器以及结合实体类型描述提高推理准确度的预测器。实验结果表明,OEMA在精确匹配评估下取得了最新技术成果,并在相关匹配标准下实现了近乎监督的性能。
Key Takeaways
- 临床命名实体识别在电子健康档案中的关键作用。
- 传统监督模型如CRF和BioClinicalBERT存在高标注成本问题。
- 零样本NER技术虽然减少了对标注数据的依赖,但仍面临示例选择与任务粒度对齐以及提示设计与自我改进框架整合的挑战。
- OEMA是一个基于多智能体协作的零样本临床NER框架,包括自我注释器、鉴别器和预测器三个核心组件。
- OEMA在精确匹配评估下取得了最新技术成果,并在相关匹配标准下表现出强大的性能,接近监督学习方法。
- OEMA框架的潜在应用领域包括临床NLP的持续学习和开放域适应。
点此查看论文截图
From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks
Authors:Yuyang Zhou, Guang Cheng, Kang Du, Zihan Chen, Tian Qin, Yuyu Zhao
The proliferation of UAVs has enabled a wide range of mission-critical applications and is becoming a cornerstone of low-altitude networks, supporting smart cities, emergency response, and more. However, the open wireless environment, dynamic topology, and resource constraints of UAVs expose low-altitude networks to severe DoS threats. Traditional defense approaches, which rely on fixed configurations or centralized decision-making, cannot effectively respond to the rapidly changing conditions in UAV swarm environments. To address these challenges, we propose a novel federated multi-agent deep reinforcement learning (FMADRL)-driven moving target defense (MTD) framework for proactive DoS mitigation in low-altitude networks. Specifically, we design lightweight and coordinated MTD mechanisms, including leader switching, route mutation, and frequency hopping, to disrupt attacker efforts and enhance network resilience. The defense problem is formulated as a multi-agent partially observable Markov decision process, capturing the uncertain nature of UAV swarms under attack. Each UAV is equipped with a policy agent that autonomously selects MTD actions based on partial observations and local experiences. By employing a policy gradient-based algorithm, UAVs collaboratively optimize their policies via reward-weighted aggregation. Extensive simulations demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving up to a 34.6% improvement in attack mitigation rate, a reduction in average recovery time of up to 94.6%, and decreases in energy consumption and defense cost by as much as 29.3% and 98.3%, respectively, under various DoS attack strategies. These results highlight the potential of intelligent, distributed defense mechanisms to protect low-altitude networks, paving the way for reliable and scalable low-altitude economy.
无人机的普及为一系列关键任务应用提供了支持,并成为低空网络的核心支柱,支持智慧城市、应急响应等各种场景。然而,无人机的开放无线环境、动态拓扑和资源约束使得低空网络面临严重的拒绝服务(DoS)威胁。传统依赖固定配置或集中决策的安全防护方法无法有效应对无人机集群环境的快速变化条件。为了应对这些挑战,我们提出了一种新型联邦多智能体深度强化学习(FMADRL)驱动的移动目标防御(MTD)框架,用于低空网络中主动缓解拒绝服务攻击。具体来说,我们设计了轻量级、协调一致的MTD机制,包括领导者切换、路由突变和频率跳变,以扰乱攻击者的行动并增强网络韧性。防护问题被建模为一个多智能体部分可观察马尔可夫决策过程,以捕捉受攻击时无人机集群的不确定性。每台无人机配备一个策略智能体,根据部分观测和局部经验自主选择合适的MTD行动。通过采用基于策略梯度的算法,无人机通过奖励加权聚合协同优化其策略。大量模拟仿真表明,我们的方法显著优于最新的基线技术,攻击缓解率提高了高达34.6%,平均恢复时间减少了高达94.6%,在各种拒绝服务攻击策略下,能源消耗和防护成本分别减少了高达29.3%和98.3%。这些结果凸显了智能分布式防护机制保护低空网络的潜力,为可靠、可扩展的低空经济铺平了道路。
论文及项目相关链接
PDF 15pages; Accepted by IEEE TCCN
Summary
该文本介绍了无人机(UAVs)的普及及其在低空网络中的重要作用,但同时也面临着严重的拒绝服务(DoS)威胁。针对这些挑战,提出了一种基于联邦多智能体深度强化学习(FMADRL)的移动目标防御(MTD)框架,用于主动缓解低空网络中的DoS攻击。通过设计轻量级、协调性的MTD机制,包括领导者切换、路线突变和频率跳变等,以增强网络韧性和干扰攻击者。模拟实验表明,该方法显著优于现有基线,在各种DoS攻击策略下,攻击缓解率提高34.6%,平均恢复时间减少94.6%,能源消费和防御成本分别降低29.3%和98.3%。
Key Takeaways
- 无人机(UAVs)的普及使得其在低空网络中支持智慧城市、应急响应等关键应用。
- 低空网络面临严重的拒绝服务(DoS)威胁,需要有效的防御策略。
- 传统防御方法无法有效应对动态变化的无人机集群环境。
- 提出一种基于联邦多智能体深度强化学习(FMADRL)的移动目标防御(MTD)框架。
- MTD机制包括领导者切换、路线突变和频率跳变,旨在干扰攻击者并增强网络韧性。
- 防御问题被建模为多智能体部分可观察马尔可夫决策过程,以应对无人机集群的不确定性。
点此查看论文截图
AgentSwift: Efficient LLM Agent Design via Value-guided Hierarchical Search
Authors:Yu Li, Lehui Li, Zhihao Wu, Qingmin Liao, Jianye Hao, Kun Shao, Fengli Xu, Yong Li
Large language model (LLM) agents have demonstrated strong capabilities across diverse domains, yet automated agent design remains a significant challenge. Current automated agent design approaches are often constrained by limited search spaces that primarily optimize workflows but fail to integrate crucial human-designed components like memory, planning, and tool use. Furthermore, these methods are hampered by high evaluation costs, as evaluating even a single new agent on a benchmark can require tens of dollars. The difficulty of this exploration is further exacerbated by inefficient search strategies that struggle to navigate the large design space effectively, making the discovery of novel agents a slow and resource-intensive process. To address these challenges, we propose AgentSwift, a novel framework for automated agent design. We formalize a hierarchical search space that jointly models agentic workflow and composable functional components. This structure moves beyond optimizing workflows alone by co-optimizing functional components, which enables the discovery of more complex and effective agent architectures. To make exploration within this expansive space feasible, we mitigate high evaluation costs by training a value model on a high-quality dataset, generated via a novel strategy combining combinatorial coverage and balanced Bayesian sampling for low-cost evaluation. Guiding the entire process is a hierarchical MCTS strategy, which is informed by uncertainty to efficiently navigate the search space. Evaluated across a comprehensive set of seven benchmarks spanning embodied, math, web, tool, and game domains, AgentSwift discovers agents that achieve an average performance gain of 8.34% over both existing automated agent search methods and manually designed agents. Our framework serves as a launchpad for researchers to rapidly discover powerful agent architectures.
大型语言模型(LLM)代理在多个领域表现出了强大的能力,然而,自动代理设计仍然是一个重大挑战。当前的自动代理设计方法通常受限于有限的搜索空间,这些空间主要优化工作流程,但未能整合诸如记忆、规划和工具使用等重要的人类设计组件。此外,这些方法还受到高昂评估成本的制约,即使在基准测试上评估一个全新的代理也需要数十美元。由于存在低效的搜索策略,使得在大规模设计空间中的导航变得困难,从而加剧了探索的难度,使得新型代理的发现成为一个缓慢且资源密集的过程。为了解决这些挑战,我们提出了AgentSwift,这是一个用于自动代理设计的新型框架。我们正式建立了一个分层搜索空间,该空间联合建模代理工作流程和可组合的功能组件。通过共同优化功能组件,而不仅仅是优化工作流程,我们能够发现更复杂、更有效的代理架构。为了在这个庞大的空间内进行可行的探索,我们通过一种新的策略生成高质量数据集来训练价值模型,该策略结合了组合覆盖和平衡贝叶斯采样以进行低成本评估,从而减轻了高评估成本的问题。整个过程由分层MCTS策略引导,该策略根据不确定性有效地导航搜索空间。在涵盖实体、数学、网络、工具和游戏领域的七个基准测试集上评估AgentSwift,其发现的代理在性能上平均提高了8.34%,超过了现有的自动代理搜索方法和手动设计的代理。我们的框架为研究人员快速发现强大的代理架构提供了起点。
论文及项目相关链接
PDF AAAI-2026
Summary
大型语言模型(LLM)代理在多个领域展现出强大的能力,但自动化代理设计仍然是一个重大挑战。当前方法受限于搜索空间,主要优化工作流程,但未能整合如记忆、规划和工具使用等重要人类设计元素。此外,高昂的评估成本使得这一探索过程缓慢且资源密集。为解决这些挑战,我们提出AgentSwift框架,这是一个自动化代理设计的新框架。它通过联合建模代理工作流程和可组合功能组件的层次化搜索空间,实现了更复杂的代理架构发现。通过训练价值模型降低评估成本,并采用层次化MCTS策略高效导航搜索空间。在多个基准测试中,AgentSwift发现的代理平均性能提升8.34%。该框架为研究人员快速发现强大的代理架构提供了平台。
Key Takeaways
- 当前自动化代理设计面临搜索空间限制、难以整合人类设计元素如记忆、规划和工具使用等挑战。
- 现有方法受限于高评估成本,使得发现新型代理的过程缓慢且资源密集。
- AgentSwift框架通过联合建模代理工作流程和可组合功能组件的层次化搜索空间,优化了代理设计的自动化过程。
- AgentSwift通过训练价值模型降低评估成本,采用层次化MCTS策略高效导航搜索空间。
- AgentSwift在多个基准测试中表现出强大的性能,平均性能提升达到8.34%。
- AgentSwift框架为研究人员提供了一个发现强大代理架构的平台。
点此查看论文截图
Policy Search, Retrieval, and Composition via Task Similarity in Collaborative Agentic Systems
Authors:Saptarshi Nath, Christos Peridis, Eseoghene Benjamin, Xinran Liu, Soheil Kolouri, Peter Kinnell, Zexin Li, Cong Liu, Shirin Dora, Andrea Soltoggio
Agentic AI aims to create systems that set their own goals, adapt proactively to change, and refine behavior through continuous experience. Recent advances suggest that, when facing multiple and unforeseen tasks, agents could benefit from sharing machine-learned knowledge and reusing policies that have already been fully or partially learned by other agents. However, how to query, select, and retrieve policies from a pool of agents, and how to integrate such policies remains a largely unexplored area. This study explores how an agent decides what knowledge to select, from whom, and when and how to integrate it in its own policy in order to accelerate its own learning. The proposed algorithm, \emph{Modular Sharing and Composition in Collective Learning} (MOSAIC), improves learning in agentic collectives by combining (1) knowledge selection using performance signals and cosine similarity on Wasserstein task embeddings, (2) modular and transferable neural representations via masks, and (3) policy integration, composition and fine-tuning. MOSAIC outperforms isolated learners and global sharing approaches in both learning speed and overall performance, and in some cases solves tasks that isolated agents cannot. The results also demonstrate that selective, goal-driven reuse leads to less susceptibility to task interference. We also observe the emergence of self-organization, where agents solving simpler tasks accelerate the learning of harder ones through shared knowledge.
Agentic AI旨在创建能够自主设定目标、主动适应变化并通过持续经验完善行为的系统。最近的进展表明,当面对多个和未曾预见的任务时,智能体可以通过共享机器学习的知识和重用其他智能体已完全或部分学习的策略来获益。然而,如何查询、选择和检索智能体池中的策略,以及如何整合这些策略仍是一个尚未被充分研究的领域。本研究探讨了智能体如何决定选择哪些知识、从谁那里获取以及如何整合这些知识到自己的策略中,以加速自己的学习。所提出的算法——集体学习中的模块化共享与组合(MOSAIC),通过结合(1)使用性能信号和Wasserstein任务嵌入上的余弦相似性进行知识选择,(2)通过掩码实现的模块化可转移神经表征,以及(3)策略集成、组合和微调,改进了智能集体的学习。MOSAIC在学习速度和整体性能上均优于孤立学习者和全局共享方法,在某些情况下还能解决孤立智能体无法解决的问题。结果还表明,选择性、目标驱动的重用降低了任务干扰的易感性。我们还观察到自我组织的出现,其中解决更简单任务的智能体通过共享知识加速了更困难任务的学习。
论文及项目相关链接
PDF 24 pages, 20 figures, 8 tables
Summary
智能体AI旨在创建能够自主设定目标、适应变化并通过持续经验改进行为的系统。本研究探讨了智能体在面临多重和未预见任务时,如何选择、整合其他智能体的策略以加速自身学习的问题。提出的MOSAIC算法通过任务嵌入的性能信号和余弦相似性进行知识选择,利用掩膜实现模块化和可转移的神经表征,以及策略集成、组合和微调。MOSAIC在学习速度和整体性能上均优于孤立学习者和全局共享方法,并在某些情况下能解决孤立智能体无法解决的问题。此外,该算法降低了任务干扰的敏感性,并观察到解决简单任务的智能体通过共享知识加速解决复杂任务的自组织现象。
Key Takeaways
- Agentic AI的目标是创建能够自主设定目标、适应变化并改进行为的系统。
- 面对多重和未预见任务时,智能体可以从其他智能体那里选择并整合策略以加速学习。
- MOSAIC算法通过性能信号和余弦相似性进行知识选择,利用模块化和可转移的神经表征来实现策略集成。
- MOSAIC在学习速度和整体性能上超越了孤立学习者和全局共享方法。
- 选择性、目标驱动的重用策略可降低智能体对任务干扰的敏感性。
- 通过共享知识,解决简单任务的智能体可以加速解决复杂任务。
点此查看论文截图
Securing Smart Contract Languages with a Unified Agentic Framework for Vulnerability Repair in Solidity and Move
Authors:Rabimba Karanjai, Lei Xu, Weidong Shi
The rapid growth of the blockchain ecosystem and the increasing value locked in smart contracts necessitate robust security measures. While languages like Solidity and Move aim to improve smart contract security, vulnerabilities persist. This paper presents Smartify, a novel multi-agent framework leveraging Large Language Models (LLMs) to automatically detect and repair vulnerabilities in Solidity and Move smart contracts. Unlike traditional methods that rely solely on vast pre-training datasets, Smartify employs a team of specialized agents working on different specially fine-tuned LLMs to analyze code based on underlying programming concepts and language-specific security principles. We evaluated Smartify on a dataset for Solidity and a curated dataset for Move, demonstrating its effectiveness in fixing a wide range of vulnerabilities. Our results show that Smartify (Gemma2+codegemma) achieves state-of-the-art performance, surpassing existing LLMs and enhancing general-purpose models’ capabilities, such as Llama 3.1. Notably, Smartify can incorporate language-specific knowledge, such as the nuances of Move, without requiring massive language-specific pre-training datasets. This work offers a detailed analysis of various LLMs’ performance on smart contract repair, highlighting the strengths of our multi-agent approach and providing a blueprint for developing more secure and reliable decentralized applications in the growing blockchain landscape. We also provide a detailed recipe for extending this to other similar use cases.
随着区块链生态系统的快速增长以及智能合约中锁定价值的不断增加,需要采取强大的安全措施。虽然Solidity和Move等语言旨在提高智能合约的安全性,但仍然存在漏洞。本文提出了Smartify,一个利用大型语言模型(LLM)自动检测和修复Solidity和Move智能合约中的漏洞的新型多智能合约框架。与传统的仅依赖于大量预训练数据集的方法不同,Smartify采用一组专门从事不同精细调整过的LLM工作的智能合约团队,根据基本的编程概念和特定于语言的安全原则来分析代码。我们在Solidity数据集和Move精选数据集上评估了Smartify,证明了它在修复各种漏洞方面的有效性。我们的结果表明,Smartify(Gemma2+codegemma)达到了最先进的性能水平,超越了现有的LLM并增强了通用模型的能力,如Llama 3.1。值得注意的是,Smartify可以融入特定于语言的知识,例如Move的细微差别,而无需大规模的语言特定预训练数据集。这项工作详细分析了各种LLM在智能合约修复方面的性能,突出了我们多智能合约方法的优势,并为在日益增长的区块链领域中开发更安全、更可靠的去中心化应用程序提供了蓝图。我们还为扩展此技术到其他类似用例提供了详细的配方。
论文及项目相关链接
Summary
此论文介绍了一种新型的多智能合约框架Smartify,其借助大型语言模型(LLM)自动检测和修复Solidity和Move智能合约中的漏洞。Smartify采用多智能体协作的方式,精细调整LLM,基于编程概念和语言特定的安全原则分析代码。评估显示,Smartify在修复多种漏洞方面表现出卓越性能,超越现有LLM并增强通用模型能力。其可融入特定语言知识,如Move的细节,无需大量特定语言预训练数据集。
Key Takeaways
- Smartify是一个利用大型语言模型(LLM)的多智能合约框架,旨在自动检测和修复Solidity和Move智能合约中的漏洞。
- Smartify采用多智能体协作方式,针对特定的安全原则分析代码。
- 与传统方法相比,Smartify能更有效地修复多种漏洞。
- Smartify的性能超越了现有的LLM并增强了通用模型的能力。
- Smartify能够融入特定语言知识,如Move的细节,而无需大量特定语言的预训练数据集。
- 此论文提供了详细的LLM性能分析,强调了多智能合约方法的力量。