发布日期: 2025-11-07

更新日期: 2025-11-27

文章字数: 15.8k

阅读时长: 64 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-07 更新

Manifold-constrained Hamilton-Jacobi Reachability Learning for Decentralized Multi-Agent Motion Planning

Authors:Qingyi Chen, Ruiqi Ni, Jun Kim, Ahmed H. Qureshi

Safe multi-agent motion planning (MAMP) under task-induced constraints is a critical challenge in robotics. Many real-world scenarios require robots to navigate dynamic environments while adhering to manifold constraints imposed by tasks. For example, service robots must carry cups upright while avoiding collisions with humans or other robots. Despite recent advances in decentralized MAMP for high-dimensional systems, incorporating manifold constraints remains difficult. To address this, we propose a manifold-constrained Hamilton-Jacobi reachability (HJR) learning framework for decentralized MAMP. Our method solves HJR problems under manifold constraints to capture task-aware safety conditions, which are then integrated into a decentralized trajectory optimization planner. This enables robots to generate motion plans that are both safe and task-feasible without requiring assumptions about other agents’ policies. Our approach generalizes across diverse manifold-constrained tasks and scales effectively to high-dimensional multi-agent manipulation problems. Experiments show that our method outperforms existing constrained motion planners and operates at speeds suitable for real-world applications. Video demonstrations are available at https://youtu.be/RYcEHMnPTH8 .

任务诱导约束下的多智能体运动规划（MAMP）是机器人领域的一个关键挑战。许多现实世界场景要求机器人在动态环境中导航，同时遵守任务产生的多种约束。例如，服务机器人必须保持杯子直立，同时避免与人类或其他机器人碰撞。尽管在分散式MAMP方面近期在高维系统方面取得了进展，但融入流形约束仍然很困难。为解决此问题，我们提出了一个针对分散式MAMP的流形约束Hamilton-Jacobi可达性（HJR）学习框架。我们的方法解决了流形约束下的HJR问题，以捕获任务感知安全条件，然后将其集成到分散式轨迹优化规划中。这可使机器人生成既安全又符合任务要求的运动计划，而无需假设其他智能体的策略。我们的方法可广泛应用于各种流形约束任务，并能有效地扩展到高维多智能体操作问题。实验表明，我们的方法优于现有的约束运动规划器，并且其运行速度适用于实际应用。视频演示请见：https://youtu.be/RYcEHMnPTH8。

论文及项目相关链接

PDF

摘要

安全多任务代理规划（MAMP）是机器人技术中的一项关键挑战，特别是在任务诱导的约束条件下。服务机器人等现实场景中的机器人需要在动态环境中导航，同时遵守由任务产生的多元约束。针对此问题，我们提出了基于流形约束的哈密顿-雅可比可达性（HJR）学习框架，用于分散式MAMP。我们的方法解决了流形约束下的HJR问题，以捕获任务感知安全条件，然后将其集成到分散式轨迹优化规划中。这使机器人能够生成既安全又适合任务的行动方案，而无需假设其他代理的政策。我们的方法可广泛应用于各种流形约束任务，并能有效地扩展到高维度多代理操作问题。实验表明，我们的方法优于现有的约束运动规划器，并且其速度适用于实际应用。相关视频演示请见：链接。

要点摘要

安全多任务代理规划（MAMP）是机器人技术中的核心挑战，特别是在处理任务诱导的约束时。
机器人需要在动态环境中导航，同时遵守由任务产生的多元约束。
提出的哈密顿-雅可比可达性（HJR）学习框架可解决流形约束下的MAMP问题。
该方法通过捕获任务感知安全条件，并将其集成到分散式轨迹优化规划中，实现机器人的安全且任务可行的行动方案。
方法可灵活应用于各种流形约束任务，并有效处理高维度多代理操作问题。
实验显示，该方法在性能上超越了现有的约束运动规划器，且其运行速率适用于实际场景应用。
视频演示展示了相关方法和成果。

Cool Papers

点此查看论文截图

ROSBag MCP Server: Analyzing Robot Data with LLMs for Agentic Embodied AI Applications

Authors:Lei Fu, Sahar Salimpour, Leonardo Militano, Harry Edelman, Jorge Peña Queralta, Giovanni Toffetti

Agentic AI systems and Physical or Embodied AI systems have been two key research verticals at the forefront of Artificial Intelligence and Robotics, with Model Context Protocol (MCP) increasingly becoming a key component and enabler of agentic applications. However, the literature at the intersection of these verticals, i.e., Agentic Embodied AI, remains scarce. This paper introduces an MCP server for analyzing ROS and ROS 2 bags, allowing for analyzing, visualizing and processing robot data with natural language through LLMs and VLMs. We describe specific tooling built with robotics domain knowledge, with our initial release focused on mobile robotics and supporting natively the analysis of trajectories, laser scan data, transforms, or time series data. This is in addition to providing an interface to standard ROS 2 CLI tools (“ros2 bag list” or “ros2 bag info”), as well as the ability to filter bags with a subset of topics or trimmed in time. Coupled with the MCP server, we provide a lightweight UI that allows the benchmarking of the tooling with different LLMs, both proprietary (Anthropic, OpenAI) and open-source (through Groq). Our experimental results include the analysis of tool calling capabilities of eight different state-of-the-art LLM/VLM models, both proprietary and open-source, large and small. Our experiments indicate that there is a large divide in tool calling capabilities, with Kimi K2 and Claude Sonnet 4 demonstrating clearly superior performance. We also conclude that there are multiple factors affecting the success rates, from the tool description schema to the number of arguments, as well as the number of tools available to the models. The code is available with a permissive license at https://github.com/binabik-ai/mcp-rosbags.

代理人工智能系统（Agentic AI systems）和实体人工智能系统（Physical or Embodied AI systems）是人工智能和机器人技术的前沿两大研究领域，其中模型上下文协议（Model Context Protocol，简称MCP）正逐渐成为代理应用程序的关键组成部分和使能器。然而，关于这两者的交叉领域，即实体代理人工智能（Agentic Embodied AI）的文献仍然很少。本文介绍了一个用于分析ROS和ROS 2数据包的MCP服务器，它可以通过大型语言模型（LLMs）和视觉语言模型（VLMs）使用自然语言来分析、可视化和处理机器人数据。我们描述了利用机器人领域知识构建的特定工具，我们的初步发布重点关注移动机器人，原生支持轨迹分析、激光扫描数据、转换或时间序列数据的分析。此外，我们还提供了与标准ROS 2命令行工具（“ros2 bag list”或“ros2 bag info”）的接口，以及按主题子集或时间修剪过滤数据包的能力。与MCP服务器相结合，我们提供了一个轻量级的用户界面，允许使用不同的LLMs对工具进行基准测试，包括专有（Anthropic、OpenAI）和开源（通过Groq）。我们的实验结果包括对八种不同先进的LLM/VLM模型的工具调用能力分析，既包括专有也包括开源，大型和小型模型都有。我们的实验表明，在工具调用能力上存在很大差距，其中Kimi K2和Claude Sonnet 4表现出明显优越的性能。我们还得出结论，影响成功率的因素有很多，从工具描述模式到参数数量以及模型可用的工具数量。相关代码以许可许可证的形式可在https://github.com/binabik-ai/mcp-rosbags中找到。

论文及项目相关链接

PDF

Summary
本文介绍了Model Context Protocol（MCP）服务器，用于分析ROS和ROS 2数据包，并通过LLMs和VLMs以自然语言方式处理和可视化机器人数据。文章还描述了一种特定工具，该工具具备机器人领域知识，并专注于移动机器人数据分析。此外，文章提供了与MCP服务器耦合的轻量级用户界面，允许使用不同的LLMs对工具进行基准测试。实验结果表明，不同LLM/VLM模型在工具调用能力上存在较大差异。

Key Takeaways

Model Context Protocol (MCP)在Agentic AI系统和Physical或Embodied AI系统的研究中成为关键组件和应用程序赋能者。
该论文介绍了一个MCP服务器，用于分析ROS和ROS 2数据，实现机器人数据的分析、可视化和自然语言处理。
特定工具已经构建，具备机器人领域知识，专注于移动机器人数据分析。
提供了与MCP服务器耦合的轻量级用户界面，可对不同LLMs进行基准测试。
实验结果显示，不同LLM/VLM模型在工具调用能力上有显著差异。
文章强调工具调用能力受多种因素影响，包括工具描述模式、参数数量和模型可用的工具数量。

Cool Papers

点此查看论文截图

Inter-Agent Trust Models: A Comparative Study of Brief, Claim, Proof, Stake, Reputation and Constraint in Agentic Web Protocol Design-A2A, AP2, ERC-8004, and Beyond

Authors:Botao ‘Amber’ Hu, Helena Rong

As the “agentic web” takes shape-billions of AI agents (often LLM-powered) autonomously transacting and collaborating-trust shifts from human oversight to protocol design. In 2025, several inter-agent protocols crystallized this shift, including Google’s Agent-to-Agent (A2A), Agent Payments Protocol (AP2), and Ethereum’s ERC-8004 “Trustless Agents,” yet their underlying trust assumptions remain under-examined. This paper presents a comparative study of trust models in inter-agent protocol design: Brief (self- or third-party verifiable claims), Claim (self-proclaimed capabilities and identity, e.g. AgentCard), Proof (cryptographic verification, including zero-knowledge proofs and trusted execution environment attestations), Stake (bonded collateral with slashing and insurance), Reputation (crowd feedback and graph-based trust signals), and Constraint (sandboxing and capability bounding). For each, we analyze assumptions, attack surfaces, and design trade-offs, with particular emphasis on LLM-specific fragilities-prompt injection, sycophancy/nudge-susceptibility, hallucination, deception, and misalignment-that render purely reputational or claim-only approaches brittle. Our findings indicate no single mechanism suffices. We argue for trustless-by-default architectures anchored in Proof and Stake to gate high-impact actions, augmented by Brief for identity and discovery and Reputation overlays for flexibility and social signals. We comparatively evaluate A2A, AP2, ERC-8004 and related historical variations in academic research under metrics spanning security, privacy, latency/cost, and social robustness (Sybil/collusion/whitewashing resistance). We conclude with hybrid trust model recommendations that mitigate reputation gaming and misinformed LLM behavior, and we distill actionable design guidelines for safer, interoperable, and scalable agent economies.

随着“代理网络”的逐渐形成——数十亿的人工智能代理（通常是由大型语言模型驱动的）进行自主交易和协作——信任从人类监督转向协议设计。2025年，几个代理间协议巩固了这一转变，包括谷歌的代理间协议（A2A）、代理支付协议（AP2）和以太坊的ERC-8004“无信任代理”，但它们的基本信任假设仍然没有得到充分研究。本文比较研究了代理协议设计中的信任模型，包括简要（自我或第三方可验证的主张）、声明（自称的能力和身份，例如AgentCard）、证明（加密验证，包括零知识证明和受信任的执行环境认证）、质押（有担保的抵押品与斩首和保险）、声誉（基于群体反馈和图谱的信任信号）和约束（沙箱和能教限制）。我们对每种信任模型分析假设、攻击面和设计权衡，特别是大型语言模型特定的脆弱性，如提示注入、顺从/易受引导影响、幻想、欺骗和不对齐等，这些导致纯粹的声誉或仅声明的做法很脆弱。我们的研究结果表明，没有任何单一的机制足以应对。我们主张以证明和质押作为默认无信任架构的锚点来限制高风险行为，辅以简要身份和发现以及声誉叠加以实现灵活性和社会信号。我们根据安全、隐私、延迟/成本和社会稳健性（对抗Sybil攻击、勾结、洗白）等指标对A2A、AP2、ERC-8004及相关学术研究进行横向比较评估。我们得出结论，提出混合信任模型建议，缓解声誉游戏和误导的大型语言模型行为，并提炼出可操作的设计指南，以实现更安全、更互通、更可扩展的代理经济。

论文及项目相关链接

PDF Submitted to AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)

Summary
在“代理网络”形成的过程中，数十亿的人工智能代理（通常是由大型语言模型驱动的）进行自主交易和协作，信任从人类监督转向协议设计。本文对比研究了不同代理协议中的信任模型，包括简要模型、索赔模型、证明模型、股份模型、声誉模型、约束模型，并分析了它们在大型语言模型（LLM）特定脆弱性方面的假设、攻击面和设计权衡。研究发现没有单一的信任机制能够应对所有情况，并推荐使用以证明和股份为主的默认不信任架构来执行关键动作，辅以简要模型和声誉叠加模型以提供身份、发现和灵活性及社会信号。本文还对Google的Agent-to-Agent（A2A）、Agent Payments Protocol（AP2）以及Ethereum的ERC-8004等协议进行了评估和对比。

Key Takeaways

代理网络的形成过程中，信任从人类监督转向协议设计。
信任模型分为简要、索赔、证明、股份、声誉和约束等类型，并在大型语言模型（LLM）的特定脆弱性方面进行了分析。
没有单一的信任机制能够应对所有情况，需要多种机制结合使用。
默认不信任架构应以证明和股份为主来执行关键动作。
简要模型和声誉叠加模型可以提供身份、发现和灵活性及社会信号。
对Google的A2A、AP2以及Ethereum的ERC-8004等协议进行了评估和对比。

Cool Papers

点此查看论文截图

Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling

Authors:Qianhui Zhao, Li Zhang, Fang Liu, Junhang Cheng, Chengru Wu, Junchen Ai, Qiaoyuanhe Meng, Lichen Zhang, Xiaoli Lian, Shubin Song, Yuanping Guo

In recent years, Large Language Models (LLMs) have achieved remarkable progress in automated code generation. In real-world software engineering, the growing demand for rapid iteration and continuous delivery underscores the importance of project-level code generation, where LLMs are expected to generate complete software projects directly from complex user requirements. Although existing studies have made initial explorations, they still face key limitations, including unrealistic datasets and unreliable evaluation metrics that fail to reflect real-world complexity, the semantic gap between human-written requirements and machine-interpretable structures, and difficulties in managing hierarchical dependencies and maintaining quality throughout the generation process. To address these limitations, we first introduce CodeProjectEval, a project-level code generation dataset built from 18 real-world repositories with 12.7 files and 2,388.6 lines of code per task on average, supplemented with documentation and executable test cases for automatic evaluation. We further propose ProjectGen, a multi-agent framework that decomposes projects into architecture design, skeleton generation, and code filling stages with iterative refinement and memory-based context management. Within this framework, we introduce the Semantic Software Architecture Tree (SSAT), a structured and semantically rich representation that effectively bridges user requirements and source code implementation. Experiments show that ProjectGen achieves state-of-the-art performance, passing 52/124 test cases on the small-scale project-level code generation dataset DevBench, a 57% improvement over the baseline approaches, and 310 test cases on CodeProjectEval, representing an improvement of roughly tenfold compared to the baselines.

近年来，大型语言模型（LLM）在自动代码生成方面取得了显著的进步。在现实世界中的软件工程领域，对快速迭代和持续交付的日益增长的需求强调了项目级代码生成的重要性，LLM期望直接从复杂的用户需求生成完整的软件项目。尽管现有研究已经进行了初步探索，但它们仍然面临关键限制，包括不切实际的数据集和无法反映现实复杂性的不可靠评估指标、人类编写的要求与机器可解释结构之间的语义鸿沟以及在生成过程中管理层次依赖关系和保持质量方面的困难。为了解决这些局限性，我们首先引入了CodeProjectEval，这是一个项目级代码生成数据集，由18个真实世界的存储库构建而成，每个任务平均包含12.7个文件和2388.6行代码，还辅以文档和可执行测试用例进行自动评估。我们进一步提出了ProjectGen，这是一个多代理框架，将项目分解为架构设计、骨架生成和代码填充阶段，具有迭代细化和基于内存的上下文管理。在此框架内，我们引入了语义软件架构树（SSAT），这是一种结构化和语义丰富的表示形式，可以有效地架起用户需求与源代码实现之间的桥梁。实验表明，ProjectGen达到了最先进的状态，在小型项目级代码生成数据集DevBench上通过了52/124测试用例，较基线方法有57%的改进，并在CodeProjectEval上通过了310个测试用例，较基线方法有大约十倍的改进。

论文及项目相关链接

PDF

Summary

本文介绍了大型语言模型在自动化代码生成方面的进展，特别是在项目级代码生成方面的应用。针对现有研究的局限性，如数据集不现实、评估指标不可靠、人类编写的要求与机器可解释结构之间的语义鸿沟以及管理层次依赖和保持生成过程质量的困难等，提出了CodeProjectEval数据集和ProjectGen多代理框架。ProjectGen通过分解项目、引入语义软件架构树（SSAT）等策略，实现了先进性能。

Key Takeaways

大型语言模型在自动化代码生成方面取得显著进展，特别是在项目级代码生成中。
现有研究面临不现实的数据集、不可靠的评估指标、语义鸿沟以及管理层次依赖等局限性。
引入了CodeProjectEval数据集，由18个真实世界仓库构建，包含文档和可执行测试用例，用于自动评估。
提出了ProjectGen多代理框架，通过分解项目、引入语义软件架构树（SSAT）等方法解决挑战。
ProjectGen实现了先进性能，在小型项目级代码生成数据集DevBench上通过52个测试用例，改进率为57%。
在CodeProjectEval数据集上，ProjectGen成功通过了310个测试用例，相较于基线方法有十倍的提升。

Cool Papers

点此查看论文截图

A Modular, Data-Free Pipeline for Multi-Label Intention Recognition in Transportation Agentic AI Applications

Authors:Xiaocai Zhang, Hur Lim, Ke Wang, Zhe Xiao, Jing Wang, Kelvin Lee, Xiuju Fu, Zheng Qin

In this study, a modular, data-free pipeline for multi-label intention recognition is proposed for agentic AI applications in transportation. Unlike traditional intent recognition systems that depend on large, annotated corpora and often struggle with fine-grained, multi-label discrimination, our approach eliminates the need for costly data collection while enhancing the accuracy of multi-label intention understanding. Specifically, the overall pipeline, named DMTC, consists of three steps: 1) using prompt engineering to guide large language models (LLMs) to generate diverse synthetic queries in different transport scenarios; 2) encoding each textual query with a Sentence-T5 model to obtain compact semantic embeddings; 3) training a lightweight classifier using a novel online focal-contrastive (OFC) loss that emphasizes hard samples and maximizes inter-class separability. The applicability of the proposed pipeline is demonstrated in an agentic AI application in the maritime transportation context. Extensive experiments show that DMTC achieves a Hamming loss of 5.35% and an AUC of 95.92%, outperforming state-of-the-art multi-label classifiers and recent end-to-end SOTA LLM-based baselines. Further analysis reveals that Sentence-T5 embeddings improve subset accuracy by at least 3.29% over alternative encoders, and integrating the OFC loss yields an additional 0.98% gain compared to standard contrastive objectives. In conclusion, our system seamlessly routes user queries to task-specific modules (e.g., ETA information, traffic risk evaluation, and other typical scenarios in the transportation domain), laying the groundwork for fully autonomous, intention-aware agents without costly manual labelling.

在这项研究中，为交通领域的智能体AI应用提出了一种模块化、无需数据的多标签意图识别流水线。与传统的依赖于大规模注释语料库的意图识别系统不同，我们的方法消除了对昂贵的数据采集的需求，同时提高了多标签意图理解的准确性。具体来说，整个流水线名为DMTC，分为三个步骤：1）使用提示工程引导大型语言模型（LLM）在不同运输场景中生成多样化的合成查询；2）使用Sentence-T5模型对每条文本查询进行编码，以获得紧凑的语义嵌入；3）使用一种新型在线焦点对比（OFC）损失训练轻量级分类器，该损失侧重于困难样本并最大化类间可分性。所提出流水线的适用性在航海交通领域的智能体AI应用中得到了证明。大量实验表明，DMTC的Hamming损失为5.35%，AUC为95.92%，优于最先进的多标签分类器和最新的端到端SOTA LLM基准测试。进一步的分析表明，Sentence-T5嵌入的准确率至少比替代编码器高3.29%，与标准对比目标相比，集成OFC损失产生了额外的0.98%的增益。总之，我们的系统无缝地将用户查询路由到任务特定模块（例如ETA信息、交通风险评估和交通领域中的其他典型场景），为无需昂贵手动标记的完全自主、意图感知的智能体奠定了基础。

论文及项目相关链接

PDF Present in the Transportation Research Board (TRB) Annual Meeting 2026

摘要
多标签意图识别在运输领域的智能化应用已逐渐引起关注。本研究提出了一种模块化的数据无关流程用于处理此问题。不同于依赖大量标注语料库的传统意图识别系统，该方法无需高昂的数据收集成本，并提高了多标签意图理解的准确性。具体流程包括三步：利用提示工程引导大型语言模型生成不同运输场景下的多样化合成查询；使用Sentence-T5模型对文本查询进行编码，获取紧凑的语义嵌入；采用新型在线焦点对比损失训练轻量级分类器，强调困难样本并最大化类间可分性。在航海运输领域的智能化应用中验证了该流程的适用性。实验显示，该方法实现了Hamming损失率为5.35%、AUC为95.92%的效果，优于其他多标签分类器和最新的端到端大型语言模型基线。进一步分析表明，Sentence-T5嵌入相比其他编码器至少提高了3.29%的子集准确率，而集成在线焦点对比损失则较标准对比目标产生了额外的0.98%的增益。总之，本系统可无缝将用户查询路由至特定任务模块（如预计到达时间信息、交通风险评估等运输领域典型场景），为意图感知的全自主智能代理奠定了无需昂贵手动标注的基础。

关键见解

本研究提出了一种模块化、数据无关的管道流程，用于多标签意图识别，适用于运输领域的智能化应用。
与传统依赖大量标注语料库的方法不同，该方法无需高昂的数据收集成本，并能提高多标签意图理解的准确性。
通过三个步骤实现该流程：生成合成查询、编码文本查询以获得语义嵌入以及使用新型在线焦点对比损失训练分类器。
实验结果表明，该方法在Hamming损失率和AUC方面都表现出优越性能。
Sentence-T5嵌入相比其他编码器提高了子集准确率。
集成在线焦点对比损失较标准对比目标产生了额外的增益。

Cool Papers

点此查看论文截图

HAFixAgent: History-Aware Automated Program Repair Agent

Authors:Yu Shi, Hao Li, Bram Adams, Ahmed E. Hassan

Automated program repair (APR) has recently shifted toward large language models and agent-based systems, yet most systems rely on local snapshot context, overlooking repository history. Prior work shows that repository history helps repair single-line bugs, since the last commit touching the buggy line is often the bug-introducing one. In this paper, we investigate whether repository history can also improve agentic APR systems at scale, especially for complex multi-hunk bugs. We present HAFixAgent, a History-Aware Bug-Fixing Agent that injects blame-derived repository heuristics into its repair loop. A preliminary study of all 854 real-world bugs from Defects4J motivates our design, showing that bug-relevant history is both widely available and highly concentrated. Empirical comparison of HAFixAgent with two state-of-the-art baselines shows: (1) Effectiveness: HAFixAgent significantly improves over the agent-based baseline (by 212.3%) and the multi-hunk baseline (by 29.9%). (2) Efficiency: history does not significantly increase agent steps and keeps token costs comparable, with notably lower median costs for complex multi-file-multi-hunk bugs. (3) Practicality: combining different historical heuristics repairs more bugs, offering a clear cost-benefit trade-off. HAFixAgent offers a practical recipe for history-aware agentic APR: ground the agent in version control history, prioritize diff-based historical context, and integrate complementary heuristics when needed.

自动化程序修复（APR）最近已转向大型语言模型和基于代理的系统，然而大多数系统都依赖于本地快照上下文，忽略了仓库历史。早期的研究表明，仓库历史有助于修复单行错误，因为最后修改错误行的提交往往就是引入错误的提交。在本文中，我们研究了仓库历史是否也能大规模改进基于代理的APR系统，特别是对复杂的多块错误。我们提出了HAFixAgent，一个知情的错误修复代理，它将责任驱动的仓库启发式策略注入到修复循环中。对Defects4J中所有854个真实世界错误的初步研究激发了我们的设计，表明错误相关的历史不仅普遍存在而且高度集中。HAFixAgent与两种最新技术基准的实证比较表明：（1）有效性：HAFixAgent在基于代理的基准上显著提高（提高了212.3%），在多块基准上也有所提高（提高了29.9%）。（2）效率：历史并不会显著增加代理步骤，并保持令牌成本相当，对于复杂的多文件多块错误，其中位数成本明显降低。（3）实用性：结合不同的历史启发式方法可以修复更多的错误，提供了明显的成本效益权衡。HAFixAgent为知情基于代理的APR提供了实用方案：以版本控制历史为基础，优先基于差异的上下文，并在需要时集成互补的启发式方法。

论文及项目相关链接

PDF 31 pages, 6 figures

Summary：本文探讨了历史感知的代理自动程序修复系统HAFixAgent。该系统结合了版本控制历史信息，将基于差异的上下文优先级注入修复循环中。通过对Defects4J中的真实世界缺陷的研究，发现历史信息是广泛存在的并且高度集中的。相比两个前沿系统，HAFixAgent对于复杂的多模块缺陷更具修复效率和修复准确性，并且能够利用多种历史信息启发式算法进行更全面的修复。

Key Takeaways：

自动程序修复（APR）正在向大型语言模型和基于代理的系统转变。
大多数系统只关注本地快照上下文，忽略了仓库历史的重要性。
仓库历史信息可以帮助修复单行bug，特别是最后一个提交涉及bug的代码行往往是引入bug的。
HAFixAgent是一个历史感知的Bug修复代理，结合了仓库历史启发式信息进入其修复循环中。
研究发现缺陷相关的历史信息是广泛存在的并且高度集中的。
与两个前沿系统的对比表明，HAFixAgent对于复杂的多模块缺陷修复效果更显著且效率更高。
历史信息不会显著增加代理步骤和令牌成本，并且在复杂的多文件多模块缺陷修复中，成本更低。

Cool Papers

点此查看论文截图

Agentic Meta-Orchestrator for Multi-task Copilots

Authors:Xiaofeng Zhu, Yunshen Zhou

Microsoft Copilot suites serve as the universal entry point for various agents skilled in handling important tasks, ranging from assisting a customer with product purchases to detecting vulnerabilities in corporate programming code. Each agent can be powered by language models, software engineering operations, such as database retrieval, and internal & external knowledge. The repertoire of a copilot can expand dynamically with new agents. This requires a robust orchestrator that can distribute tasks from user prompts to the right agents. In this work, we propose an Agentic Meta-orchestrator (AMO) for handling multiple tasks and scalable agents in copilot services, which can provide both natural language and action responses. We will also demonstrate the planning that leverages meta-learning, i.e., a trained decision tree model for deciding the best inference strategy among various agents/models. We showcase the effectiveness of our AMO through two production use cases: Microsoft 365 (M365) E-Commerce Copilot and code compliance copilot. M365 E-Commerce Copilot advertises Microsoft products to external customers to promote sales success. The M365 E-Commerce Copilot provides up-to-date product information and connects to multiple agents, such as relational databases and human customer support. The code compliance copilot scans the internal DevOps code to detect known and new compliance issues in pull requests (PR).

微软Copilot套件作为各种处理重要任务的智能代理人的通用入口点，从帮助客户购买产品到检测企业编程代码中的漏洞。每个代理人都可借助语言模型、数据库检索等软件工程操作和内部和外部知识来赋能。Copilot的招式可以随着新代理人的加入而动态扩展。这需要一个强大的编排器，能够将用户提示的任务分配给合适的代理人。在此工作中，我们提出了一种Agentic元编排器（AMO）来应对Copilot服务中的多任务处理和可扩展的代理人，它既能提供自然语言又能提供行动回应。我们还将展示利用元学习的规划方法，即训练决策树模型来决定在各种代理人/模型之间选择最佳的推理策略。我们通过两个生产用例来展示AMO的有效性：Microsoft 365（M365）电子商务Copilot和代码合规Copilot。M365电子商务Copilot向外部客户宣传微软产品，以促进销售成功。M365电子商务Copilot提供最新的产品信息，并与多个代理人相连，如关系数据库和人类客户支持。代码合规Copilot扫描内部DevOps代码，以检测拉取请求中的已知和新合规问题。

论文及项目相关链接

PDF

Summary：微软Copilot套件作为多种代理的通用入口点，能够处理从帮助客户购买产品到检测公司编程代码中的漏洞等重要任务。每个代理都可以由语言模型、软件工程操作（如数据库检索）和内外知识驱动。Copilot的曲目可以随着新代理的动态增加而扩展。为此，我们提出了一种用于处理多个任务和可伸缩代理的Agentic Meta-orchestrator（AMO），它可提供自然语言和行为响应。我们还展示了利用元学习的规划，即通过训练决策树模型来选择最佳的推理策略。我们通过两个生产用例展示了AMO的有效性：Microsoft 365（M365）电子商务Copilot和代码合规Copilot。M365电子商务Copilot向外部客户宣传微软产品，以促进销售成功。它提供最新的产品信息，并连接到多个代理，如关系数据库和人工客户支持。代码合规Copilot扫描内部DevOps代码，以检测拉取请求中的已知和新合规问题。

Key Takeaways：

微软Copilot套件是各种代理的通用入口点，可以处理从客户支持到代码漏洞检测的各种任务。
每个代理都可以由语言模型、软件工程操作和内外知识驱动。
Agentic Meta-orchestrator（AMO）是一个用于处理Copilot中的多个任务和可伸缩代理的系统，它提供了自然语言和行为响应。
AMO通过元学习进行规划，通过训练决策树模型来选择最佳的推理策略。
M365电子商务Copilot可以宣传微软产品，并提供最新的产品信息，同时连接到多个代理以提供全面的客户支持。
代码合规Copilot能够扫描内部DevOps代码并检测合规问题。

Cool Papers

点此查看论文截图

Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers

Authors:Ziye Xia, Sergei S. Ospichev

In recent years, the rapid increase in academic publications across various fields has posed severe challenges for academic paper analysis: scientists struggle to timely and comprehensively track the latest research findings and methodologies. Key concept extraction has proven to be an effective analytical paradigm, and its automation has been achieved with the widespread application of language models in industrial and scientific domains. However, existing paper databases are mostly limited to similarity matching and basic classification of key concepts, failing to deeply explore the relational networks between concepts. This paper is based on the OpenAlex opensource knowledge graph. By analyzing nearly 8,000 open-source paper data from Novosibirsk State University, we discovered a strong correlation between the distribution patterns of paper key concept paths and both innovation points and rare paths. We propose a prompt engineering-based key concept path analysis method. This method leverages small language models to achieve precise key concept extraction and innovation point identification, and constructs an agent based on a knowledge graph constraint mechanism to enhance analysis accuracy. Through fine-tuning of the Qwen and DeepSeek models, we achieved significant improvements in accuracy, with the models publicly available on the Hugging Face platform.

近年来，各领域学术出版物数量的迅速增加给学术论文分析带来了严峻挑战：科学家们难以全面及时地追踪最新的研究成果和方法。关键词提取已被证明是一种有效的分析模式，其自动化已通过工业和科学领域语言模型的广泛应用实现。然而，现有的论文数据库大多仅限于关键词的相似度匹配和基本分类，未能深入探索概念之间的关联网络。本文基于OpenAlex开源知识图谱。通过分析新西伯利亚国立大学的近8000篇开源论文数据，我们发现论文关键词路径分布与创新点和稀有路径之间存在强烈的相关性。我们提出了一种基于提示工程的关键词路径分析方法。该方法利用小型语言模型实现精确的关键词提取和创新点识别，并构建一个基于知识图谱约束机制的代理来提高分析准确性。通过对Qwen和DeepSeek模型的微调，我们在准确性方面取得了显著的提高，这些模型已在Hugging Face平台上公开可用。

论文及项目相关链接

PDF 11 pages, 10 figures

Summary
学术出版物数量的急剧增长为学术分析带来了严峻挑战。本文利用OpenAlex开源知识图谱，通过分析Novosibirsk州立大学近8000篇开源论文数据，发现论文关键概念路径分布与创新点和罕见路径之间存在强相关性。提出了一种基于提示工程的关键概念路径分析方法，利用小型语言模型进行精确的关键概念提取和创新点识别，构建基于知识图谱约束机制的代理，以提高分析准确性。通过对Qwen和DeepSeek模型的微调，实现了准确性的显著提高。

Key Takeaways

学术出版物数量的增长导致科研人员难以全面追踪最新研究成果和方法。
关键概念提取是一种有效的分析方法，已广泛应用于工业和科研领域。
现有论文数据库大多局限于关键概念的相似性匹配和基本分类，未能深入探索概念间的关联网络。
本文基于OpenAlex开源知识图谱，发现论文关键概念路径分布与创新点、罕见路径之间存在强相关性。
提出了一种基于提示工程的关键概念路径分析方法，利用小型语言模型进行精确的关键概念提取和创新点识别。
构建了一个基于知识图谱约束机制的代理，旨在提高分析准确性。

Cool Papers

点此查看论文截图

Decentralized Aerial Manipulation of a Cable-Suspended Load using Multi-Agent Reinforcement Learning

Authors:Jack Zeng, Andreu Matoses Gimenez, Eugene Vinitsky, Javier Alonso-Mora, Sihao Sun

This paper presents the first decentralized method to enable real-world 6-DoF manipulation of a cable-suspended load using a team of Micro-Aerial Vehicles (MAVs). Our method leverages multi-agent reinforcement learning (MARL) to train an outer-loop control policy for each MAV. Unlike state-of-the-art controllers that utilize a centralized scheme, our policy does not require global states, inter-MAV communications, nor neighboring MAV information. Instead, agents communicate implicitly through load pose observations alone, which enables high scalability and flexibility. It also significantly reduces computing costs during inference time, enabling onboard deployment of the policy. In addition, we introduce a new action space design for the MAVs using linear acceleration and body rates. This choice, combined with a robust low-level controller, enables reliable sim-to-real transfer despite significant uncertainties caused by cable tension during dynamic 3D motion. We validate our method in various real-world experiments, including full-pose control under load model uncertainties, showing setpoint tracking performance comparable to the state-of-the-art centralized method. We also demonstrate cooperation amongst agents with heterogeneous control policies, and robustness to the complete in-flight loss of one MAV. Videos of experiments: https://autonomousrobots.nl/paper_websites/aerial-manipulation-marl

本文提出了一种去中心化的方法，利用微空中车辆（MAVs）团队实现现实世界中电缆悬挂负载的6自由度（6-DoF）操作。我们的方法利用多智能体强化学习（MARL）来训练每个MAV的外环控制策略。与最先进的使用集中方案的控制器不同，我们的策略不需要全局状态、MAV之间的通信或邻近MAV的信息。相反，智能体仅通过负载姿态观察进行隐性通信，从而实现了高度可扩展性和灵活性。它还显著降低了推理时间内的计算成本，使策略能够在机上部署。此外，我们为MAV设计了一个新的动作空间，使用线性加速度和机体速率。这种选择，结合一个稳健的低级控制器，即使在动态三维运动时由电缆张力引起的大量不确定性也能实现可靠的模拟到现实的转移。我们通过各种真实世界的实验验证了我们的方法，包括在负载模型不确定性下的全姿态控制，显示设定点跟踪性能与最先进的集中方法相当。我们还展示了具有不同控制策略的代理之间的合作，以及对飞行中完全失去一架MAV的鲁棒性。实验视频网址：https://autonomousrobots.nl/paper_websites/aerial-manipulation-marl。

论文及项目相关链接

PDF

Summary

本论文首次提出了一种去中心化的方法，通过团队微型飞行器（MAVs）实现真实世界中的六自由度（6-DoF）电缆悬挂负载操控。该方法利用多智能体强化学习（MARL）训练每个MAV的外环控制策略，无需全局状态、智能体间通信或邻近智能体信息，仅通过负载姿态观测进行隐性通信，实现了高可扩展性和灵活性，显著降低了推理时间的计算成本。同时，引入了新的MAV动作空间设计，结合鲁棒的底层控制器，实现了可靠的模拟到现实的转移，尽管动态三维运动中电缆张力存在重大不确定性。实验验证表明，该方法在负载模型不确定性下的全姿态控制表现优异，设定点跟踪性能与现有集中式方法相当。此外，还展示了智能体间异质控制政策的合作能力以及对单个飞行器完全飞行丢失的鲁棒性。相关实验视频可在自主机器人网站上找到。

Key Takeaways

提出了一种去中心化的方法实现真实世界中的电缆悬挂负载操控。
利用多智能体强化学习训练MAV的外环控制策略。
无需全局状态、智能体间通信或邻近信息，通过负载姿态观测进行隐性通信。
实现了高可扩展性和灵活性，降低了计算成本。
引入了新的MAV动作空间设计，结合了鲁棒的底层控制器。
在负载模型不确定性下的全姿态控制表现优异。

Cool Papers

点此查看论文截图

Multi-Agent Reinforcement Learning for Autonomous Multi-Satellite Earth Observation: A Realistic Case Study

Authors:Mohamad A. Hady, Siyi Hu, Mahardhika Pratama, Jimmy Cao, Ryszard Kowalczyk

The exponential growth of Low Earth Orbit (LEO) satellites has revolutionised Earth Observation (EO) missions, addressing challenges in climate monitoring, disaster management, and more. However, autonomous coordination in multi-satellite systems remains a fundamental challenge. Traditional optimisation approaches struggle to handle the real-time decision-making demands of dynamic EO missions, necessitating the use of Reinforcement Learning (RL) and Multi-Agent Reinforcement Learning (MARL). In this paper, we investigate RL-based autonomous EO mission planning by modelling single-satellite operations and extending to multi-satellite constellations using MARL frameworks. We address key challenges, including energy and data storage limitations, uncertainties in satellite observations, and the complexities of decentralised coordination under partial observability. By leveraging a near-realistic satellite simulation environment, we evaluate the training stability and performance of state-of-the-art MARL algorithms, including PPO, IPPO, MAPPO, and HAPPO. Our results demonstrate that MARL can effectively balance imaging and resource management while addressing non-stationarity and reward interdependency in multi-satellite coordination. The insights gained from this study provide a foundation for autonomous satellite operations, offering practical guidelines for improving policy learning in decentralised EO missions.

低地球轨道（LEO）卫星的指数级增长已经彻底改变了地球观测（EO）任务，解决了气候监测、灾害管理等领域的挑战。然而，多卫星系统中的自主协调仍然是一个基本挑战。传统优化方法难以满足动态地球观测任务实时决策的需求，需要使用强化学习（RL）和多智能体强化学习（MARL）。本文研究了基于强化学习的自主地球观测任务规划，通过建模单卫星操作，并利用多智能体强化学习框架扩展到多卫星星座。我们解决了关键挑战，包括能源和数据存储限制、卫星观测的不确定性以及在部分观测下的分散协调的复杂性。通过利用近乎真实的卫星仿真环境，我们评估了最新MARL算法的训练稳定性和性能，包括PPO、IPPO、MAPPO和HAPPO。我们的结果表明，MARL可以有效地平衡成像和资源管理，同时解决多卫星协调中的非平稳性和奖励互依赖性。本研究所获得的见解为自主卫星操作提供了基础，为改进分散式地球观测任务中的政策学习提供了实用指南。

论文及项目相关链接

PDF

Summary

低地球轨道（LEO）卫星的指数增长已经革新了地球观测（EO）任务，解决了气候监测、灾害管理等领域的挑战。然而，多卫星系统的自主协调仍是一个基本挑战。传统优化方法难以满足动态EO任务的实时决策需求，因此需要强化学习（RL）和多智能体强化学习（MARL）的参与。本文研究基于RL的自主EO任务规划，通过模拟单卫星操作，并扩展到多卫星星座的MARL框架。解决关键挑战，包括能源和数据存储限制、卫星观测的不确定性以及部分观测下的分散协调的复杂性。通过利用近现实的卫星仿真环境，我们评估了最新MARL算法的训练稳定性和性能，包括PPO、IPPO、MAPPO和HAPPO。结果表明，MARL在平衡成像和资源管理的同时，解决了多卫星协调中的非稳定性和奖励互赖性问题。

Key Takeaways

LEO卫星的快速增长已经改变了地球观测任务的样貌，带来诸多好处。
自主协调在多卫星系统中是一个主要挑战。
传统优化方法不能满足动态EO任务的实时决策需求。
RL和MARL技术在EO任务规划中显示出潜力。
解决多卫星操作的关键挑战包括能源、数据存储限制及观测不确定性。
MARL算法在平衡成像和资源管理方面有优异表现。

Cool Papers

点此查看论文截图

Large Language Models Miss the Multi-Agent Mark

Authors:Emanuele La Malfa, Gabriele La Malfa, Samuele Marro, Jie M. Zhang, Elizabeth Black, Michael Luck, Philip Torr, Michael Wooldridge

Recent interest in Multi-Agent Systems of Large Language Models (MAS LLMs) has led to an increase in frameworks leveraging multiple LLMs to tackle complex tasks. However, much of this literature appropriates the terminology of MAS without engaging with its foundational principles. In this position paper, we highlight critical discrepancies between MAS theory and current MAS LLMs implementations, focusing on four key areas: the social aspect of agency, environment design, coordination and communication protocols, and measuring emergent behaviours. Our position is that many MAS LLMs lack multi-agent characteristics such as autonomy, social interaction, and structured environments, and often rely on oversimplified, LLM-centric architectures. The field may slow down and lose traction by revisiting problems the MAS literature has already addressed. Therefore, we systematically analyse this issue and outline associated research opportunities; we advocate for better integrating established MAS concepts and more precise terminology to avoid mischaracterisation and missed opportunities.

近年来，多代理大型语言模型系统（MAS LLMs）的兴趣增加，导致出现了更多利用多个LLMs解决复杂任务的框架。然而，这部分文献中的许多内容都采用了MAS的术语，却没有涉及到其基本原理。在这篇立场论文中，我们强调了MAS理论与当前MAS LLMs实现之间的关键差异，主要集中在四个关键领域：代理的社会方面、环境设计、协调和通信协议以及新兴行为的测量。我们的立场是，许多MAS LLM缺乏多代理特征，如自主性、社会交互和结构环境，并经常依赖于简化、以LLM为中心的结构。重新解决MAS文献已经解决的问题可能会使该领域的发展减缓并失去吸引力。因此，我们系统地分析了这一问题并概述了相关的研究机会；我们提倡更好地整合现有的MAS概念和更精确的术语，以避免误解和错失机会。

论文及项目相关链接

PDF NeurIPS 2025 (position track)

Summary

大型语言模型多智能体系统（MAS LLMs）的研究近年来受到广泛关注，许多框架利用多个LLM解决复杂任务。然而，大量文献在没有深入研究其基础原理的情况下就生搬硬套采用了多智能体系统（MAS）的术语。这篇立场论文强调MAS理论与当前MAS LLMs实现之间的关键差异，主要集中在四个领域：智能的社会性方面、环境设计、协调与通信协议，以及测量新兴行为。我们认为，许多MAS LLM缺乏多智能体的特点，如自主性、社会交互和结构化的环境，并经常依赖于简化过度的、以LLM为中心的架构。因此，我们系统地分析了这一问题，并概述了相关的研究机会；我们倡导更好地整合现有的MAS概念，使用更精确的术语以避免误判和错失机会。

Key Takeaways

多智能体系统（MAS）理论与大型语言模型（LLM）结合的应用在实践中存在对术语的滥用。
当前MAS LLMs缺乏智能的社会性特征，如自主性、社会交互。
MAS LLMs在环境设计方面有待改进，需要与真实世界情境更为紧密地结合。
在协调与通信协议方面存在缺陷，影响多智能体之间的有效协作。
测量新兴行为的方法需要进一步探索和发展。
目前MAS LLM架构倾向于过于简化，过于依赖单一的大型语言模型。

Cool Papers

点此查看论文截图

Distilling LLM Agent into Small Models with Retrieval and Code Tools

Authors:Minki Kang, Jongwon Jeong, Seanie Lee, Jaewoong Cho, Sung Ju Hwang

Large language models (LLMs) excel at complex reasoning tasks but remain computationally expensive, limiting their practical deployment. To address this, recent works have focused on distilling reasoning capabilities into smaller language models (sLMs) using chain-of-thought (CoT) traces from teacher LLMs. However, this approach struggles in scenarios requiring rare factual knowledge or precise computation, where sLMs often hallucinate due to limited capability. In this work, we propose Agent Distillation, a framework for transferring not only reasoning capability but full task-solving behavior from LLM-based agents into sLMs with retrieval and code tools. We improve agent distillation along two complementary axes: (1) we introduce a prompting method called first-thought prefix to enhance the quality of teacher-generated trajectories; and (2) we propose a self-consistent action generation for improving test-time robustness of small agents. We evaluate our method on eight reasoning tasks across factual and mathematical domains, covering both in-domain and out-of-domain generalization. Our results show that sLMs as small as 0.5B, 1.5B, 3B parameters can achieve performance competitive with next-tier larger 1.5B, 3B, 7B models fine-tuned using CoT distillation, demonstrating the potential of agent distillation for building practical, tool-using small agents. Our code is available at https://github.com/Nardien/agent-distillation.

大规模语言模型（LLM）在复杂的推理任务上表现出色，但计算成本高昂，限制了其实践部署。为了解决这一问题，近期的研究工作集中在将推理能力蒸馏到小型语言模型（sLM）中，使用教师LLM的思维链（CoT）轨迹。然而，在需要罕见事实知识或精确计算的场景中，这种方法会遇到困难，因为sLMs的能力有限，经常会虚构信息。在本研究中，我们提出了“Agent Distillation”（代理蒸馏）框架，不仅转移推理能力，而且从基于LLM的代理转移完整的任务解决行为到sLMs，同时结合检索和代码工具。我们沿着两个互补的轴改进了代理蒸馏：（1）我们引入了一种称为“首想法前缀”的提示方法，以提高教师生成的轨迹的质量；（2）我们提出了自洽行动生成，以提高小型代理的测试时间鲁棒性。我们在事实和数学领域的八个推理任务上评估了我们的方法，这些任务涵盖了域内和域外泛化。我们的结果表明，小到0.5B、1.5B、3B参数的小型sLMs可以达到与下一阶使用CoT蒸馏进行微调的大型1.5B、3B、7B模型相当的性能，这显示了代理蒸馏在构建实用的小型工具代理中的潜力。我们的代码可在https://github.com/Nardien/agent-distillation找到。

论文及项目相关链接

PDF NeurIPS 2025 Spotlight

摘要

大语言模型在复杂推理任务上表现出色，但计算成本高昂，限制了其实践部署。为解决这个问题，近期研究致力于将推理能力蒸馏到小型语言模型中，利用教师大语言模型的思维链（CoT）轨迹。然而，在需要罕见事实知识或精确计算等场景中，此方法常常出现幻想，因小型模型的局限性所致。本研究提出Agent Distillation框架，不仅转移推理能力，而且转移基于语言模型的任务解决行为，结合检索和代码工具。我们沿两个互补方向改进agent distillation：（1）引入名为first-thought prefix的提示方法，提高教师生成的轨迹质量；（2）提出自我一致的行动生成，提高小型代理的测试时间稳健性。我们在事实和数学领域的八个推理任务上评估了我们的方法，涵盖域内和域外泛化。结果表明，仅0.5B、1.5B、3B参数的小型模型即可实现与采用CoT蒸馏微调的大型模型相近的性能，显示出agent distillation在构建实用、使用工具的小型模型方面的潜力。

关键见解

大型语言模型在复杂推理任务上具有优势，但计算成本高。
蒸馏推理能力到小型语言模型是解决计算成本问题的方法之一。
使用教师大型语言模型的思维链轨迹进行蒸馏存在局限性，尤其在需要罕见事实知识和精确计算的场景中。
Agent Distillation框架不仅转移推理能力，而且转移基于语言模型的任务解决行为。
Agent Distillation通过引入first-thought prefix的提示方法和自我一致的行动生成来提高代理质量。
在事实和数学领域的多个任务上评估结果表明，小型模型通过agent distillation具有竞争力。
Agent Distillation显示出在构建实用、使用工具的小型模型方面的潜力。

Cool Papers

点此查看论文截图

SecRepoBench: Benchmarking Code Agents for Secure Code Completion in Real-World Repositories

Authors:Chihao Shen, Connor Dilgren, Purva Chiniya, Luke Griffith, Yu Ding, Yizheng Chen

This paper introduces SecRepoBench, a benchmark to evaluate code agents on secure code completion in real-world repositories. SecRepoBench has 318 code completion tasks in 27 C/C++ repositories, covering 15 CWEs. We evaluate 28 standalone LLMs and 13 code agents across 3 state-of-the-art agent frameworks using our benchmark. We find that state-of-the-art LLMs struggle with generating correct and secure code completions. However, code agents significantly outperform standalone LLMs. We show that SecRepoBench is more difficult than the prior state-of-the-art benchmark. Finally, our comprehensive analysis provides insights into potential directions for enhancing the ability of code agents to write correct and secure code in real-world repositories.

本文介绍了SecRepoBench，这是一个评估现实世界存储库中安全代码补全的代理基准测试。SecRepoBench包含27个C/C++存储库的318个代码补全任务，涵盖15种CWE。我们使用此基准测试评估了28个独立的LLMs和3个最先进的代理框架中的13个代码代理。我们发现，最先进的LLMs在生成正确和安全代码补全方面存在困难。然而，代码代理显著优于独立LLMs。我们证明了SecRepoBench比以前的最新基准测试更加困难。最后，我们的综合分析提供了增强代码代理在现实世界存储库中编写正确和安全代码能力的潜在方向的见解。

论文及项目相关链接

PDF

Summary：

本文介绍了SecRepoBench，一个用于评估代码代理在真实代码仓库中进行安全代码补全的基准测试。SecRepoBench包含27个C/C++仓库中的318个代码补全任务，覆盖15种常见的代码弱点（CWE）。研究评估了28个独立的LLMs和13个代码代理，在三个先进的代理框架下表现如何。研究结果表明，最先进的技术LLMs在生成正确且安全的代码补全方面存在困难，而代码代理显著优于独立LLMs。SecRepoBench被证明比先前的最先进的基准测试更为困难。最后，本文的综合分析提供了关于如何增强代码代理在真实仓库中编写正确和安全代码的潜力的见解。

Key Takeaways:

SecRepoBench是一个用于评估代码代理在真实仓库中进行安全代码补全的基准测试。
包含318个代码补全任务，涉及多个C/C++仓库和覆盖广泛的常见代码弱点（CWE）。
独立LLMs在生成正确且安全的代码补全方面存在困难。
代码代理在性能上显著优于独立LLMs。
SecRepoBench比现有的最先进的基准测试更具挑战性。
提供的综合分析有助于深入了解如何改进和提高代码代理的性能。

Cool Papers

点此查看论文截图

TAMO: Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems

Authors:Xiao Zhang, Qi Wang, Mingyi Li, Yuan Yuan, Mengbai Xiao, Fuzhen Zhuang, Dongxiao Yu

Implementing large language models (LLMs)-driven root cause analysis (RCA) in cloud-native systems has become a key topic of modern software operations and maintenance. However, existing LLM-based approaches face three key challenges: multi-modality input constraint, context window limitation, and dynamic dependence graph. To address these issues, we propose a tool-assisted LLM agent with multi-modality observation data for fine-grained RCA, namely TAMO, including multimodality alignment tool, root cause localization tool, and fault types classification tool. In detail, TAMO unifies multi-modal observation data into time-aligned representations for cross-modal feature consistency. Based on the unified representations, TAMO then invokes its specialized root cause localization tool and fault types classification tool for further identifying root cause and fault type underlying system context. This approach overcomes the limitations of LLMs in processing real-time raw observational data and dynamic service dependencies, guiding the model to generate repair strategies that align with system context through structured prompt design. Experiments on two benchmark datasets demonstrate that TAMO outperforms state-of-the-art (SOTA) approaches with comparable performance.

在云原生系统中实现大型语言模型（LLM）驱动的根因分析（RCA）已成为现代软件操作和维护的关键课题。然而，现有的基于LLM的方法面临三大挑战：多模式输入约束、上下文窗口限制和动态依赖图。为了解决这些问题，我们提出了一种名为TAMO的工具辅助LLM代理，用于精细RCA的多模式观察数据。TAMO包括多模式对齐工具、根本原因定位工具和故障类型分类工具。具体来说，TAMO将多模式观察数据统一为时间对齐的表示，以实现跨模式特征的一致性。基于统一的表示，TAMO然后调用其专业的根本原因定位工具和故障类型分类工具，进一步识别系统上下文中的根本原因和故障类型。这种方法克服了LLM在处理实时原始观测数据和动态服务依赖方面的局限性，通过结构化提示设计指导模型生成与系统上下文一致的修复策略。在两个基准数据集上的实验表明，TAMO的性能优于最新方法。

论文及项目相关链接

PDF

Summary
实现大型语言模型（LLM）驱动的根因分析（RCA）在云原生系统中是当代软件操作和维护的关键话题。然而，现有LLM方法面临多模态输入约束、上下文窗口限制和动态依赖图三大挑战。为解决这些问题，我们提出了一个名为TAMO的工具辅助LLM代理，具有多模态观察数据，用于精细RCA。TAMO包括多模态对齐工具、根本原因定位工具和故障类型分类工具。它通过统一多模态观察数据，生成与上下文一致的跨模态特征表示，然后进一步识别根本原因和故障类型。这种方法克服了LLM处理实时原始观测数据和动态服务依赖性的局限性，并通过结构化提示设计引导模型生成与系统上下文一致的修复策略。实验表明，TAMO在基准数据集上的表现优于现有方法。

Key Takeaways

大型语言模型（LLM）在云原生系统中用于根因分析（RCA）是当代软件操作和维护的重要方向。
现有LLM方法面临多模态输入约束、上下文窗口限制和动态依赖图三大挑战。
TAMO工具辅助LLM代理通过多模态观察数据进行精细RCA。
TAMO包括多模态对齐工具、根本原因定位工具和故障类型分类工具。
TAMO通过统一多模态观察数据，生成跨模态特征表示，以识别根本原因和故障类型。
TAMO克服了LLM处理实时原始观测数据和动态服务依赖性的局限性。
通过结构化提示设计，TAMO引导模型生成与系统上下文一致的修复策略。
实验表明，TAMO在基准数据集上的表现优于现有方法。

Cool Papers

点此查看论文截图

Depth Matters: Multimodal RGB-D Perception for Robust Autonomous Agents

Authors:Mihaela-Larisa Clement, Mónika Farsang, Felix Resch, Mihai-Teodor Stanusoiu, Radu Grosu

Autonomous agents that rely purely on perception to make real-time control decisions require efficient and robust architectures. In this work, we demonstrate that augmenting RGB input with depth information significantly enhances our agents’ ability to predict steering commands compared to using RGB alone. We benchmark lightweight recurrent controllers that leverage the fused RGB-D features for sequential decision-making. To train our models, we collect high-quality data using a small-scale autonomous car controlled by an expert driver via a physical steering wheel, capturing varying levels of steering difficulty. Our models were successfully deployed on real hardware and inherently avoided dynamic and static obstacles, under out-of-distribution conditions. Specifically, our findings reveal that the early fusion of depth data results in a highly robust controller, which remains effective even with frame drops and increased noise levels, without compromising the network’s focus on the task.

完全依赖于感知进行实时控制决策的自主智能体需要高效且稳定的架构。在这项工作中，我们证明了与仅使用RGB相比，通过深度信息增强RGB输入可以极大地提高智能体预测转向命令的能力。我们利用融合后的RGB-D特征进行序列决策，评估了轻量级的递归控制器。为了训练我们的模型，我们使用由专家司机通过物理方向盘控制的小型自动驾驶汽车收集高质量数据，这些数据捕捉到了不同级别的转向难度。我们的模型已成功部署在真实硬件上，并固有地避开了动态和静态障碍，在超出分布条件下也能正常运行。具体来说，我们的研究结果表明，深度数据的早期融合会导致一个高度稳定的控制器，即使在帧丢失和噪声水平增加的情况下，也能保持有效，且不会损害网络对任务的关注。

论文及项目相关链接

PDF Submitted to ICRA 2025

Summary

基于感知的自主代理需要高效稳定的架构来做出实时控制决策。本研究通过融合RGB和深度信息，显著提高了代理预测转向命令的能力。我们利用融合后的RGB-D特征进行序列决策，并采用了轻量级的递归控制器。通过专家驾驶的小型自主汽车收集高质量数据来训练模型，模型能够应对不同难度的转向挑战。模型成功部署在实际硬件上，能在不同条件下的场景中避免动态和静态障碍。特别是，研究发现早期融合深度数据会生成稳健的控制器，即使在帧率降低和噪声增加的情况下仍然有效。这不会影响网络专注于任务的本质特性。

Key Takeaways