发布日期: 2025-09-12

更新日期: 2025-10-07

文章字数: 9.4k

阅读时长: 38 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-12 更新

Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference

Authors:Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Jiawei Shen, Jingjiang Liu, Yidan Liang

Multi-agent systems (MAS) are critical for automating complex tasks, yet their practical deployment is severely hampered by the challenge of failure attribution. Current diagnostic tools, which rely on statistical correlations, are fundamentally inadequate; on challenging benchmarks like Who&When, state-of-the-art methods achieve less than 15% accuracy in locating the root-cause step of a failure. To address this critical gap, we introduce the first failure attribution framework for MAS grounded in multi-granularity causal inference. Our approach makes two key technical contributions: (1) a performance causal inversion principle, which correctly models performance dependencies by reversing the data flow in execution logs, combined with Shapley values to accurately assign agent-level blame; (2) a novel causal discovery algorithm, CDC-MAS, that robustly identifies critical failure steps by tackling the non-stationary nature of MAS interaction data. The framework’s attribution results directly fuel an automated optimization loop, generating targeted suggestions whose efficacy is validated via counterfactual simulations. Evaluations on the Who&When and TRAIL benchmarks demonstrate a significant leap in performance. Our method achieves up to 36.2% step-level accuracy. Crucially, the generated optimizations boost overall task success rates by an average of 22.4%. This work provides a principled and effective solution for debugging complex agent interactions, paving the way for more reliable and interpretable multi-agent systems.

多智能体系统（MAS）在自动化复杂任务中起着关键作用，但其实际应用却受到故障归属挑战的严重阻碍。当前依赖于统计相关的诊断工具从根本上来说是不足的；在像Who＆When这样的具有挑战性的基准测试中，最先进的方法在定位故障的根本原因步骤时的准确率低于15％。为了解决这一关键差距，我们引入了基于多粒度因果推理的MAS首个故障归属框架。我们的方法做出了两项重要的技术贡献：（1）性能因果反转原则，它通过反转执行日志中的数据流来正确建模性能依赖关系，并结合Shapley值准确地进行智能体级别的责任归属；（2）一种新型因果发现算法CDC-MAS，它通过解决MAS交互数据的非平稳性质来稳健地识别关键故障步骤。该框架的归属结果直接推动自动化优化循环，生成有针对性的建议，其有效性通过反事实模拟进行验证。在Who＆When和TRAIL基准测试上的评估显示性能实现了飞跃。我们的方法达到了高达36.2％的步骤级准确率。关键的是，生成的优化使总体任务成功率平均提高了22.4％。这项工作为调试复杂的智能体交互提供了有原则的和有效的解决方案，为多智能体系统提供更可靠和可解释的道路。

论文及项目相关链接

PDF

Summary
多智能体系统（MAS）在自动化复杂任务中至关重要，但其实际应用受到故障归因问题的严重阻碍。当前诊断工具主要依赖统计相关性，存在根本性不足。针对此问题，本文提出首个基于多粒度因果推断的MAS故障归因框架。该框架包含两项关键技术贡献：一是性能因果反演原理，它通过逆向执行日志中的数据流来准确模拟性能依赖关系，并结合Shapley值进行精确的智能体级别责任分配；二是新型因果发现算法CDC-MAS，它通过解决智能体交互数据的非平稳性来稳健地识别关键失败步骤。该框架的归因结果直接推动自动化优化循环，产生经反事实模拟验证的有效建议。在Who&When和TRAIL基准测试上的评估显示，该方法实现了性能上的重大飞跃，步骤级准确率最高达到36.2%，产生的优化措施平均提高了总体任务成功率22.4%。

Key Takeaways

多智能体系统（MAS）在自动化复杂任务中的应用受故障归因问题的限制。
当前诊断工具主要依赖统计相关性，难以满足挑战需求。
引入首个基于多粒度因果推断的MAS故障归因框架。
性能因果反演原理和新型因果发现算法CDC-MAS是该框架的关键技术贡献。
框架能够准确模拟性能依赖关系并稳健识别关键失败步骤。
该框架推动自动化优化循环，产生有效建议并经反事实模拟验证。

Cool Papers

点此查看论文截图

Agents of Discovery

Authors:Sascha Diefenbacher, Anna Hallin, Gregor Kasieczka, Michael Krämer, Anne Lauscher, Tim Lukas

The substantial data volumes encountered in modern particle physics and other domains of fundamental physics research allow (and require) the use of increasingly complex data analysis tools and workflows. While the use of machine learning (ML) tools for data analysis has recently proliferated, these tools are typically special-purpose algorithms that rely, for example, on encoded physics knowledge to reach optimal performance. In this work, we investigate a new and orthogonal direction: Using recent progress in large language models (LLMs) to create a team of agents – instances of LLMs with specific subtasks – that jointly solve data analysis-based research problems in a way similar to how a human researcher might: by creating code to operate standard tools and libraries (including ML systems) and by building on results of previous iterations. If successful, such agent-based systems could be deployed to automate routine analysis components to counteract the increasing complexity of modern tool chains. To investigate the capabilities of current-generation commercial LLMs, we consider the task of anomaly detection via the publicly available and highly-studied LHC Olympics dataset. Several current models by OpenAI (GPT-4o, o4-mini, GPT-4.1, and GPT-5) are investigated and their stability tested. Overall, we observe the capacity of the agent-based system to solve this data analysis problem. The best agent-created solutions mirror the performance of human state-of-the-art results.

在现代粒子物理学和物理学其他领域的研究中，遇到的大量数据需要大量的数据分析工具和流程来处理。虽然机器学习（ML）工具在数据分析中的使用已经日益普及，但这些工具通常是特殊用途的算法，它们依赖于编码的物理知识来达到最佳性能。在这项工作中，我们探索了一个全新且正交的方向：利用大型语言模型（LLM）的最新进展来创建一组代理，即具有特定子任务的LLM实例，它们以类似于人类研究者的方式联合解决基于数据分析的研究问题：通过创建代码来操作标准工具和库（包括ML系统），并建立在之前迭代的结果之上。如果成功，这种基于代理的系统可以部署以自动化常规分析组件，以应对现代工具链日益增长的复杂性。为了调查当前一代商业LLM的能力，我们考虑了通过公开可用且经过深入研究的LHC Olympics数据集进行异常检测的任务。我们研究了OpenAI的几个当前模型（GPT-4o、o4-mini、GPT-4.1和GPT-5），并对其稳定性进行了测试。总体而言，我们观察到了基于代理的系统解决此数据分析问题的能力。最好的代理解决方案的结果与人类最先进的解决方案相当。

论文及项目相关链接

PDF

Summary
本文探讨了现代粒子物理和其他基础物理研究领域遇到的大规模数据问题，以及如何利用日益复杂的数据分析工具和工作流来解决这些问题。文章重点研究使用大型语言模型（LLMs）创建代理团队的方法，这些代理团队可以联合解决数据分析问题，类似于人类研究者通过创建代码操作标准工具和库来解决这些问题的方式。通过对公开且备受关注的LHC Olympics数据集进行异常检测任务，验证了当前商用LLMs的能力。经过测试，表现最佳的代理解决方案的性能可与人类最佳结果相媲美。

Key Takeaways

现代物理研究领域面临大规模数据处理挑战，需要复杂的数据分析工具和工作流。
机器学习（ML）工具在数据分析中广泛应用，但存在特殊用途的局限性。
大型语言模型（LLMs）为数据分析提供了新方向，可创建代理团队联合解决问题。
代理团队工作方式类似于人类研究者，通过创建代码操作标准工具和库来解决数据分析问题。
使用LHC Olympics数据集进行异常检测任务，评估了当前商用LLMs的能力。
通过测试发现，最好的代理解决方案性能接近人类最佳水平。

Cool Papers

点此查看论文截图

Game-Theoretic Resilience Framework for Cyber-Physical Microgrids using Multi-Agent Reinforcement Learning

Authors:S Krishna Niketh, Sagar Babu Mitikiri, V Vignesh, Vedantham Lakshmi Srinivas, Mayukha Pal

The increasing reliance on cyber physical infrastructure in modern power systems has amplified the risk of targeted cyber attacks, necessitating robust and adaptive resilience strategies. This paper presents a mathematically rigorous game theoretic framework to evaluate and enhance microgrid resilience using a combination of quantitative resilience metrics Load Served Ratio LSR, Critical Load Resilience CLR, Topological Survivability Score TSS, and DER Resilience Score DRS. These are integrated into a unified payoff matrix using the Analytic Hierarchy Process AHP to assess attack defense interactions. The framework is formalized as a finite horizon Markov Decision Process MDP with formal convergence guarantees and computational complexity bounds. Three case studies are developed 1. static attacks analyzed via Nash equilibrium, 2. severe attacks incorporating high impact strategies, and 3. adaptive attacks using Stackelberg games, regret matching, softmax heuristics, and Multi Agent Q Learning. Rigorous theoretical analysis provides convergence proofs with explicit rates , PAC learning sample complexity bounds, and computational complexity analysis. The framework is tested on an enhanced IEEE 33bus distribution system with DERs and control switches, demonstrating the effectiveness of adaptive and strategic defenses in improving cyber physical resilience with statistically significant improvements of 18.7% 2.1% over static approaches.

在现代电力系统中，对智能物理基础设施的日益依赖增加了遭受定向网络攻击的风险，因此需要强大且自适应的复原策略。本文采用数学严谨的游戏理论框架来评估和增强微电网的复原力。我们结合定量复原力指标，包括负载服务比率（LSR）、关键负载复原力（CLR）、拓扑生存分数（TSS）和DER复原力分数（DRS），使用层次分析法（AHP）将它们整合到一个统一的收益矩阵中，以评估攻防互动。该框架被形式化为一个具有正式收敛保证和计算复杂性界限的有限视界马尔可夫决策过程（MDP）。我们发展了三个案例研究：1.通过纳什均衡分析静态攻击；2.纳入高影响策略的重度攻击；以及使用Stackelberg博弈、遗憾匹配、softmax启发式算法和多智能体Q学习的自适应攻击分析。严格的理论分析提供了收敛证明、明确速率以及通过PAC学习样本复杂度界限和计算复杂性分析。该框架在具有DER和控制开关的增强IEEE 33总线分配系统上进行了测试，证明了自适应和战略防御在提高网络物理复原力方面的有效性，相对于静态方法实现了高达18.7%和平均提升2.1%的统计显著改进。

论文及项目相关链接

PDF

Summary
在现代化电力系统中，网络物理基础设施日益增长的依赖度放大了遭遇定向网络攻击的风险，这需要采用强健且具有适应性的恢复策略。本文利用博弈理论框架结合定量恢复能力评价指标负载服务比率（LSR）、关键负载恢复能力（CLR）、拓扑生存能力得分（TSS）和DER恢复能力得分（DRS），对微电网的恢复能力进行评估与提升。通过层次分析法（AHP）将这些指标整合至统一支付矩阵中，以评估攻防互动。该框架被形式化为有限视界马尔可夫决策过程（MDP），具有形式收敛保证和计算复杂性界限。三个案例研究包括：一、纳什均衡分析的静态攻击；二、结合高影响策略的重度攻击；三、使用Stackelberg博弈、遗憾匹配、softmax启发式和多智能体Q学习的自适应攻击。严格的理论分析提供了收敛性证明、显式速率、PAC学习样本复杂性界限和计算复杂性分析。该框架在增强IEEE 33总线配电系统上进行了测试，该测试系统配备了DER和控制开关，展示了自适应和策略防御在提高网络物理恢复能力方面的有效性，相对于静态方法提升了高达18.7%和2.1%。

Key Takeaways

现代电力系统中对网络物理基础设施的依赖增加了遭遇定向网络攻击的风险，需要强健和适应性强的恢复策略。
博弈理论框架用于评估和提升微电网的恢复能力，结合多种定量恢复能力评价指标。
通过层次分析法整合评价指标以评估攻防互动。
该框架形式化为有限视界马尔可夫决策过程，具有收敛性和计算复杂性界限。
静态攻击、重度攻击和自适应攻击三个案例研究展示了理论框架的实际应用。
严格的理论分析提供了收敛性证明和计算复杂性分析。

Cool Papers

点此查看论文截图

Authors:Xiaobei Zhao, Xingqi Lyu, Xiang Li

Agricultural robotic agents have been becoming powerful helpers in a wide range of agricultural tasks, nevertheless, still heavily rely on manual operation or untransportable railway for movement. The AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling agents navigate to the target position following the natural language instructions. AgriVLN effectively understands the simple instructions, however, often misunderstands the complicated instructions. To bridge this gap, we propose the method of Translator for Agricultural Robotic Agents on Vision-and-Language Navigation (T-araVLN), in which the Instruction Translator module translates the original instruction to be both refined and precise. Being evaluated on the A2A benchmark, our T-araVLN effectively improves Success Rate from 0.47 to 0.63 and reduces Navigation Error from 2.91m to 2.28m, demonstrating the state-of-the-art performance in the agricultural domain. Code: https://github.com/AlexTraveling/T-araVLN.

农业机器人代理在广泛的农业任务中已成为强大的助手，然而，仍然严重依赖于手动操作或不可移动的铁路进行移动。AgriVLN方法和A2A基准率先将视觉和语言导航（VLN）扩展到农业领域，使代理能够按照自然语言指令导航到目标位置。AgriVLN能够很好地理解简单指令，但对复杂指令常常误解。为了弥补这一差距，我们提出了农业机器人代理视觉和语言导航翻译方法（T-araVLN），其中的指令翻译模块将原始指令翻译为既精确又细致的指令。在A2A基准上进行评估，我们的T-araVLN成功率为有效提高从0.47至0.63，导航误差从2.91米减少至2.28米，展现了农业领域的最前沿性能。代码：https://github.com/AlexTraveling/T-araVLN。

论文及项目相关链接

PDF

Summary

农业机器人代理在农业领域的任务中成为强大的助手，但它们仍然依赖于手动操作或不可移动的铁路进行移动。AgriVLN方法和A2A基准将视觉和语言导航（VLN）技术首次扩展到农业领域，使代理能够根据自然语言指令导航到目标位置。尽管AgriVLN能够很好地理解简单指令，但对于复杂指令常常会出现误解。为了弥补这一差距，我们提出了针对农业机器人代理的视觉和语言导航的翻译方法（T-araVLN），其中的指令翻译模块能够将原始指令进行精炼和精确的翻译。在A2A基准测试中，我们的T-araVLN方法成功地将成功率从0.47提高到0.63，导航误差从2.91米减少到2.28米，显示出在农业领域的最新技术水平。

Key Takeaways

农业机器人代理在多种农业任务中表现出强大的辅助能力，但在移动方面仍有依赖手动操作或不可移动铁路的限制。
AgriVLN方法和A2A基准首次将视觉和语言导航技术引入农业领域，使机器人能够根据自然语言指令进行导航。
AgriVLN在处理复杂指令时常常出现误解，需要更精确的指令翻译。
T-araVLN方法通过引入指令翻译模块，提高了农业机器人代理对指令的理解能力。
在A2A基准测试中，T-araVLN方法显著提高了导航成功率并降低了导航误差。
T-araVLN方法在农业机器人领域的性能达到了最新技术水平。

Cool Papers

点此查看论文截图

Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems

Authors:Manish Shukla

Agentic artificial intelligence (AI) – multi-agent systems that combine large language models with external tools and autonomous planning – are rapidly transitioning from research laboratories into high-stakes domains. Our earlier “Basic” paper introduced a five-axis framework and proposed preliminary metrics such as goal drift and harm reduction but did not provide an algorithmic instantiation or empirical evidence. This “Advanced” sequel fills that gap. First, we revisit recent benchmarks and industrial deployments to show that technical metrics still dominate evaluations: a systematic review of 84 papers from 2023–2025 found that 83% report capability metrics while only 30% consider human-centred or economic axes [2]. Second, we formalise an Adaptive Multi-Dimensional Monitoring (AMDM) algorithm that normalises heterogeneous metrics, applies per-axis exponentially weighted moving-average thresholds and performs joint anomaly detection via the Mahalanobis distance. Third, we conduct simulations and real-world experiments. AMDM cuts anomaly-detection latency from 12.3 s to 5.6 s on simulated goal drift and reduces false-positive rates from 4.5% to 0.9% compared with static thresholds. We present a comparison table and ROC/PR curves, and we reanalyse case studies to surface missing metrics. Code, data and a reproducibility checklist accompany this paper to facilitate replication. The code supporting this work is available at https://github.com/Manishms18/Adaptive-Multi-Dimensional-Monitoring.

人工智能体（AI）——多智能体系统，它将大型语言模型与外部工具和自主规划相结合——正在迅速从实验室转向高风险领域。我们早期的“基础”论文提出了一个五轴框架，并提出了初步的指标，如目标漂移和减害，但并没有提供算法实例化或实证证据。这篇“高级”续集填补了这一空白。首先，我们回顾了最近的基准测试和工业部署情况，以展示技术指标仍然主导着评估：对2023年至2025年的84篇论文的系统性审查发现，83%的论文报告了能力指标，而只有30%的论文考虑了以人为中心或经济的轴[2]。其次，我们对自适应多维监控（AMDM）算法进行了正规化，该算法对异质指标进行归一化，应用每个轴的指数加权移动平均阈值，并通过马氏距离进行联合异常检测。第三，我们进行了模拟和现实世界实验。AMDM将模拟目标漂移的异常检测延迟时间从12.3秒缩短到5.6秒，与静态阈值相比，将误报率从4.5%降低到0.9%。我们提供了对比表以及ROC/PR曲线，并重新分析了案例研究，以揭示缺失的指标。本论文附有代码、数据和可重复性清单，以方便复制。支持这项工作的代码可在https://github.com/Manishms18/Adaptive-Multi-Dimensional-Monitoring上找到。

论文及项目相关链接

PDF

Summary
本论文探讨了多智能体系统（Agentic AI）在高风险领域的应用。相较于早期只关注基础框架和初步指标的论文，本文更深入地研究了实际运用中的挑战。通过重新审视现有评价指标并推出自适应多维度监控算法，提高了多智能体系统的性能和评估效果。本研究包括实际测试和模拟实验，验证了新算法在降低异常检测延迟和误报率方面的优势。同时，本文还提供了代码和数据集供读者参考。

Key Takeaways

多智能体系统（Agentic AI）正从研究实验室过渡到高风险领域。
技术指标在评估多智能体系统时仍占主导地位，但人类中心和经济轴的评价逐渐受到重视。
自适应多维度监控（AMDM）算法能够规范化异构指标并进行联合异常检测。
AMDM算法相较于静态阈值，能降低异常检测的延迟和误报率。
本文通过模拟实验和真实案例研究验证了AMDM算法的有效性。
本文提供了代码、数据集和可重复性检查表以推动研究的复制和验证。

Cool Papers

点此查看论文截图

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

Authors:Jiaxuan Gao, Wei Fu, Minyang Xie, Shusheng Xu, Chuyi He, Zhiyu Mei, Banghua Zhu, Yi Wu

Recent advancements in LLM-based agents have demonstrated remarkable capabilities in handling complex, knowledge-intensive tasks by integrating external tools. Among diverse choices of tools, search tools play a pivotal role in accessing vast external knowledge. However, open-source agents still fall short of achieving expert-level Search Intelligence, the ability to resolve ambiguous queries, generate precise searches, analyze results, and conduct thorough exploration. Existing approaches fall short in scalability, efficiency, and data quality. For example, small turn limits in existing online RL methods, e.g. <=10, restrict complex strategy learning. This paper introduces ASearcher, an open-source project for large-scale RL training of search agents. Our key contributions include: (1) Scalable fully asynchronous RL training that enables long-horizon search while maintaining high training efficiency. (2) A prompt-based LLM agent that autonomously synthesizes high-quality and challenging QAs, creating a large-scale QA dataset. Through RL training, our prompt-based QwQ-32B agent achieves substantial improvements, with 46.7% and 20.8% Avg@4 gains on xBench and GAIA, respectively. Notably, our agent exhibits extreme long-horizon search, with tool calls exceeding 40 turns and output tokens exceeding 150k during training time. With a simple agent design and no external LLMs, ASearcher-Web-QwQ achieves Avg@4 scores of 42.1 on xBench and 52.8 on GAIA, surpassing existing open-source 32B agents. We open-source our models, training data, and codes in https://github.com/inclusionAI/ASearcher.

最近基于大型语言模型（LLM）的代理在整合外部工具处理复杂、知识密集型任务方面表现出了显著的能力。在众多工具选择中，搜索工具在访问大量外部知识方面发挥着至关重要的作用。然而，开源代理仍然缺乏实现专家级的搜索智能（Search Intelligence），即解决模糊查询、进行精确搜索、分析结果以及进行全面探索的能力。现有方法在可扩展性、效率和数据质量方面存在不足。例如，现有在线强化学习（RL）方法的小回合限制（例如<=10），限制了复杂策略的学习。本文介绍了ASearcher，这是一个用于大规模RL训练搜索代理的开源项目。我们的主要贡献包括：（1）可扩展的完全异步RL训练，能够在维持高效率的同时实现长期搜索。（2）基于提示的大型语言模型代理，能够自主合成高质量、具有挑战性的问答，创建大规模问答数据集。通过强化学习训练，我们的基于提示的QwQ-32B代理在xBench和GAIA上分别实现了46.7%和20.8%的Avg@4增益。值得注意的是，我们的代理展现出极长的搜索期，在训练过程中工具调用超过40回合，输出令牌超过15万。通过简单的代理设计和不使用外部大型语言模型，ASearcher-Web-QwQ在xBench上实现了Avg@4分数为42.1，在GAIA上实现了52.8，超过了现有的开源32B代理。我们在https://github.com/inclusionAI/ASearcher公开了我们的模型、训练数据和代码。

论文及项目相关链接

PDF

Summary

大規模強化學習訓練在搜索代理中的應用已取得顯著成果。近期的研究展示了一種新的開源搜索代理ASearcher，能夠實現長期視野搜索並保持高訓練效率。此外，該代理能自主合成高質量、具挑戰性的問答對，創建大型問答數據集。通過強化學習訓練，ASearcher代理在性能上取得了顯著提升。

Key Takeaways

LLM-based agents通過整合外部工具處理複雜、知識密集型任務的能力已大幅提升。
搜索工具在訪問大量外部知識中起到重要作用。
目前開源代理在解決模糊查詢、生成精確搜索等方面尚未達到專家級別的搜索智慧。
ASearcher是一個新的開源項目，實現了大規模強化學習訓練的搜索代理。
ASearcher實現了可擴展的全異步強化學習訓練，支持長期視野搜索並保持高訓練效率。
ASearcher的提示基LLM代理能自主合成高質量、具挑戰性的問答對。

Cool Papers

点此查看论文截图

From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks

Authors:Yuyang Zhou, Guang Cheng, Kang Du, Zihan Chen, Tian Qin, Yuyu Zhao

The proliferation of UAVs has enabled a wide range of mission-critical applications and is becoming a cornerstone of low-altitude networks, supporting smart cities, emergency response, and more. However, the open wireless environment, dynamic topology, and resource constraints of UAVs expose low-altitude networks to severe DoS threats. Traditional defense approaches, which rely on fixed configurations or centralized decision-making, cannot effectively respond to the rapidly changing conditions in UAV swarm environments. To address these challenges, we propose a novel federated multi-agent deep reinforcement learning (FMADRL)-driven moving target defense (MTD) framework for proactive DoS mitigation in low-altitude networks. Specifically, we design lightweight and coordinated MTD mechanisms, including leader switching, route mutation, and frequency hopping, to disrupt attacker efforts and enhance network resilience. The defense problem is formulated as a multi-agent partially observable Markov decision process, capturing the uncertain nature of UAV swarms under attack. Each UAV is equipped with a policy agent that autonomously selects MTD actions based on partial observations and local experiences. By employing a policy gradient-based algorithm, UAVs collaboratively optimize their policies via reward-weighted aggregation. Extensive simulations demonstrate that our approach significantly outperforms state-of-the-art baselines, achieving up to a 34.6% improvement in attack mitigation rate, a reduction in average recovery time of up to 94.6%, and decreases in energy consumption and defense cost by as much as 29.3% and 98.3%, respectively, under various DoS attack strategies. These results highlight the potential of intelligent, distributed defense mechanisms to protect low-altitude networks, paving the way for reliable and scalable low-altitude economy.

无人机的普及已经为一系列关键任务应用提供了支持，并正成为低空网络的核心支柱，支持智慧城市、应急响应等。然而，无人机的开放无线环境、动态拓扑和资源约束使得低空网络面临严重的拒绝服务（DoS）威胁。传统依赖于固定配置或集中决策的安全防护方法无法有效应对无人机集群环境的快速变化条件。为了应对这些挑战，我们提出了一种新型联邦多智能体深度强化学习（FMADRL）驱动的移动目标防御（MTD）框架，用于低空网络中主动缓解拒绝服务攻击。具体来说，我们设计了轻便且协调的MTD机制，包括领导切换、路由突变和跳频，以扰乱攻击者的努力并增强网络韧性。防御问题被建模为一个多智能体部分可观察马尔可夫决策过程，以捕捉受攻击时无人机集群的不确定性。每台无人机都配备了一个策略智能体，该智能体基于部分观察和局部经验自主地选择MTD行动。通过采用基于策略梯度的算法，无人机通过奖励加权聚合来协同优化其策略。大量模拟结果表明，我们的方法显著优于最先进的基线方法，在攻击缓解率上提高了高达34.6%，平均恢复时间减少了高达94.6%，在各种拒绝服务攻击策略下，能源消耗和防御成本分别减少了高达29.3%和98.3%。这些结果突显了智能分布式防御机制在保护低空网络方面的潜力，为可靠和可扩展的低空经济铺平了道路。

论文及项目相关链接

PDF 16pages; Major Revision for IEEE TCCN

Summary

本文探讨了无人机（UAVs）在低空网络中的重要作用及其面临的安全挑战。针对拒绝服务（DoS）攻击，提出了一种基于联邦多智能体深度强化学习（FMADRL）的移动目标防御（MTD）框架。该框架设计了一系列轻量级、协调性的MTD机制，通过模拟实验验证，该框架在攻击缓解率、恢复时间、能耗和防御成本等方面均表现出显著优势，为低空网络的可靠保护提供了智能分布式防御机制的潜力。

Key Takeaways

无人机在低空网络中的普及支持了多种关键应用，如智慧城市、应急响应等，但面临DoS攻击的安全威胁。
传统防御方法无法有效应对无人机集群环境的快速变化。
提出了一种新型联邦多智能体深度强化学习驱动的移动目标防御框架，用于主动缓解低空网络中的DoS攻击。
框架包括领导者切换、路线突变和频率跳变等MTD机制，以破坏攻击并增强网络复原力。
防御问题被建模为部分可观察的多智能体马尔可夫决策过程，以捕捉受攻击无人机集群的不确定性。
通过奖励加权聚合的基于策略梯度的算法，无人机可协作优化其策略。

Cool Papers

点此查看论文截图

MPO: Boosting LLM Agents with Meta Plan Optimization

Authors:Weimin Xiong, Yifan Song, Qingxiu Dong, Bingchan Zhao, Feifan Song, Xun Wang, Sujian Li

Recent advancements in large language models (LLMs) have enabled LLM-based agents to successfully tackle interactive planning tasks. However, despite their successes, existing approaches often suffer from planning hallucinations and require retraining for each new agent. To address these challenges, we propose the Meta Plan Optimization (MPO) framework, , which enhances agent planning capabilities by directly incorporating explicit guidance. Unlike previous methods that rely on complex knowledge, which either require significant human effort or lack quality assurance, MPO leverages high-level general guidance through meta plans to assist agent planning and enables continuous optimization of the meta plans based on feedback from the agent’s task execution. Our experiments conducted on two representative tasks demonstrate that MPO significantly outperforms existing baselines. Moreover, our analysis indicates that MPO provides a plug-and-play solution that enhances both task completion efficiency and generalization capabilities in previous unseen scenarios.

最近大型语言模型（LLM）的进步使得基于LLM的代理能够成功处理交互规划任务。然而，尽管取得了成功，现有方法经常遭受规划幻觉的影响，并且每个新代理都需要重新训练。为了解决这些挑战，我们提出了元计划优化（MPO）框架，通过直接融入明确指导增强代理规划能力。与依赖复杂知识的之前方法不同，这些方法需要大量人力投入或缺乏质量保证，MPO通过元计划提供高级通用指导来协助代理规划，并根据代理任务执行的反馈持续优化元计划。我们在两个代表性任务上进行的实验表明，MPO显著优于现有基线。此外，我们的分析表明，MPO提供了一种即插即用解决方案，提高了任务完成效率和在未见过场景中的泛化能力。

论文及项目相关链接

PDF EMNLP 2025 Findings

Summary

大型语言模型（LLM）在交互规划任务中的应用取得了成功，但存在规划假象和对新代理需要重训的问题。为此，我们提出了基于元计划优化（MPO）的框架，通过直接融入明确指导来提升代理规划能力。MPO利用高级通用指导通过元计划协助代理规划，并基于代理任务执行的反馈持续优化元计划。实验证明，MPO显著优于现有基线，提供即插即用解决方案，提高任务完成效率和在未见过场景中的泛化能力。

Key Takeaways

大型语言模型（LLM）在交互规划任务中取得进展。
现有方法面临规划假象和新代理需要重训的挑战。
MPO框架通过直接融入明确指导提升代理规划能力。
MPO利用高级通用指导通过元计划协助代理规划。
MPO能够基于代理任务执行的反馈持续优化元计划。
MPO显著优于现有方法，提高任务完成效率。

Cool Papers

点此查看论文截图

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

Authors:Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig

We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at accelerating or even autonomously performing work-related tasks? The answer to this question has important implications both for industry looking to adopt AI into their workflows and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents’ performance on performing real-world professional tasks, in this paper we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that the most competitive agent can complete 30% of tasks autonomously. This paints a nuanced picture on task automation with LM agents–in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems. We release code, data, environment, and experiments on https://the-agent-company.com.

我们每天都在与计算机进行交互，无论是在日常生活还是工作中，许多工作都可以通过访问计算机和互联网来完成。同时，由于大型语言模型（LLM）的改进，能够与周围环境进行交互并带来改变的AI代理也迅速发展的。但是AI代理在加速甚至自主执行工作相关任务方面的表现如何？这个问题的答案对于希望在工作流程中采用AI的行业以及希望了解AI采用可能对劳动力市场产生影响的经济政策都具有重要意义。为了衡量这些LLM代理在执行现实世界专业任务方面的进展，本文介绍了TheAgentCompany，这是一个可扩展的基准测试，用于评估以与世界交互方式与数字工作者相似的AI代理：通过浏览网页、编写代码、运行程序以及与同事交流。我们建立了一个内含网站和数据的独立环境，模拟了一个小型软件公司的环境，并创建了各种可以由该公司员工执行的任务。我们测试了由封闭API型和开放式权重语言模型驱动的基准代理，并发现最具竞争力的代理可以自主完成30%的任务。这为我们带来了关于LM代理任务自动化的微妙画面——在一个模拟真实工作环境的设置中，很大一部分简单任务可以自主解决，但更复杂的长期任务仍然超出了当前系统的能力范围。我们在https://the-agent-company.com上发布了代码、数据、环境和实验。

论文及项目相关链接

PDF Preprint

Summary

本文介绍了AI代理在模拟真实工作环境中的表现评估。通过构建一个小型软件公司环境，测试了基于封闭API和公开权重语言模型的代理，发现最优秀代理可自主完成任务的比例达到30%。这展示了语言模型代理在任务自动化方面的微妙之处，即可以自主解决部分简单任务，但对于复杂长期任务仍难以胜任。同时推出了相关代码和环境平台。

Key Takeaways

AI代理在日常工作和生活中扮演着重要角色，特别是在完成与计算机和网络相关的工作方面。
大型语言模型（LLMs）的进步推动了AI代理的发展及其在周围环境中的交互作用。
通过模拟软件公司环境评估AI代理的表现，发现自主完成任务比例最高可达30%。
AI代理在任务自动化方面表现出微妙的特性，能够自主解决部分简单任务，但对复杂长期任务仍有局限性。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-12/Agent/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Agent

Few-Shot

Few-Shot 方向最新论文已更新，请持续关注 Update in 2025-09-12 Implicit Shape-Prior for Few-Shot Assisted 3D Segmentation

2025-09-12 Few-Shot

Few-Shot

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-09-12 A Survey of Reinforcement Learning for Large Reasoning Models

2025-09-12 LLM

LLM

Agent

2025-09-12 更新

Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference

Agents of Discovery

Game-Theoretic Resilience Framework for Cyber-Physical Microgrids using Multi-Agent Reinforcement Learning

T-araVLN: Translator for Agricultural Robotic Agents on Vision-and-Language Navigation

Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems

Beyond Ten Turns: Unlocking Long-Horizon Agentic Search with Large-Scale Asynchronous RL

From Static to Adaptive Defense: Federated Multi-Agent Deep Reinforcement Learning-Driven Moving Target Defense Against DoS Attacks in UAV Swarm Networks

MPO: Boosting LLM Agents with Meta Plan Optimization

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks