发布日期: 2025-11-20

更新日期: 2025-11-27

文章字数: 16.9k

阅读时长: 68 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-20 更新

AutoTool: Efficient Tool Selection for Large Language Model Agents

Authors:Jingyi Jia, Qinbin Li

Large Language Model (LLM) agents have emerged as powerful tools for automating complex tasks by leveraging the reasoning and decision-making abilities of LLMs. However, a major bottleneck in current agent frameworks lies in the high inference cost of tool selection, especially in approaches like ReAct that repeatedly invoke the LLM to determine which tool to use at each step. In this work, we propose AutoTool, a novel graph-based framework that bypasses repeated LLM inference by exploiting a key empirical observation: tool usage inertia - the tendency of tool invocations to follow predictable sequential patterns. AutoTool constructs a directed graph from historical agent trajectories, where nodes represent tools and edges capture transition probabilities, effectively modeling the inertia in tool selection. It further integrates parameter-level information to refine tool input generation. By traversing this structured representation, AutoTool efficiently selects tools and their parameters with minimal reliance on LLM inference. Extensive experiments across diverse agent tasks demonstrate that AutoTool reduces inference costs by up to 30% while maintaining competitive task completion rates, offering a practical and scalable enhancement for inference-heavy frameworks. Our work highlights the promise of integrating statistical structure into LLM agent design for greater efficiency without sacrificing performance.

大型语言模型（LLM）代理的出现已经成为利用LLM的推理和决策能力自动化复杂任务的强大工具。然而，当前代理框架的主要瓶颈在于工具选择的高推理成本，特别是在像ReAct这样的方法中，反复调用LLM来确定每一步应使用哪个工具。在这项工作中，我们提出了AutoTool，这是一个基于图的新型框架，它通过利用一个关键的经验观察来绕过反复的LLM推理：工具使用惯性——工具调用的趋势遵循可预测的顺序模式。AutoTool根据历史代理轨迹构建有向图，其中节点代表工具，边捕捉转换概率，有效地对工具选择中的惯性进行建模。它进一步整合参数层面的信息来完善工具输入生成。通过遍历这种结构化表示，AutoTool可以高效选择工具和参数，对LLM推理的依赖度降到最低。在多种代理任务上的广泛实验表明，AutoTool将推理成本降低了高达30%，同时保持了竞争性的任务完成率，为推理密集型框架提供了实用且可扩展的增强。我们的工作强调了将统计结构纳入LLM代理设计的承诺，以提高效率而不牺牲性能。

论文及项目相关链接

PDF Accepted by AAAI 2026, 18 pages, 11 figures, Code: https://github.com/jiajingyyyyyy/AutoTool

Summary

大型语言模型（LLM）代理通过利用LLM的推理和决策能力自动化复杂任务，已成为强大的工具。然而，当前代理框架的主要瓶颈在于工具选择的高推理成本，尤其是在ReAct等方法中，会反复调用LLM来确定每一步应使用哪个工具。针对此问题，本文提出AutoTool，一种基于图的新型框架，它通过利用工具使用惯性这一关键观察结果来绕过反复的LLM推理。AutoTool根据历史代理轨迹构建有向图，其中节点代表工具，边代表过渡概率，有效地对工具选择的惯性进行建模。它还整合参数级别信息来优化工具输入生成。通过遍历此结构化表示，AutoTool能够高效选择工具和参数，对LLM推理的依赖度降到最低。在多种代理任务上的广泛实验表明，AutoTool将推理成本降低了高达30%，同时保持竞争力的任务完成率，为推理密集型框架提供了实用且可扩展的增强。

Key Takeaways

大型语言模型（LLM）代理在自动化复杂任务方面具有强大能力。
当前LLM代理面临的主要挑战是工具选择的高推理成本。
AutoTool是一种新型的图框架，通过利用工具使用惯性来解决高推理成本问题。
AutoTool通过构建有向图来模拟工具选择的惯性，该图由历史代理轨迹组成。
AutoTool通过整合参数级别信息来优化工具输入生成。
AutoTool能高效选择工具和参数，降低对LLM推理的依赖。

Cool Papers

点此查看论文截图

Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities

Authors:Kahaan Gandhi, Boris Bolliet, Inigo Zubeldia

We show that multi-agent systems guided by vision-language models (VLMs) improve end-to-end autonomous scientific discovery. By treating plots as verifiable checkpoints, a VLM-as-a-judge evaluates figures against dynamically generated domain-specific rubrics, enabling agents to correct their own errors and steer exploratory data analysis in real-time. Case studies in cosmology and astrochemistry demonstrate recovery from faulty reasoning paths and adaptation to new datasets without human intervention. On a 10-task benchmark for data-driven discovery, VLM-augmented systems achieve pass at 1 scores of 0.7-0.8, compared to 0.2-0.3 for code-only and 0.4-0.5 for code-and-text baselines, while also providing auditable reasoning traces that improve interpretability. Code available here: https://github.com/CMBAgents/cmbagent

我们展示了由视觉语言模型（VLM）引导的多智能体系统如何提升端到端的自主科学发现能力。通过将图表视为可验证的检查点，VLM作为评判员根据动态生成的领域特定规则对图表进行评估，使智能体能够纠正自己的错误并实时引导探索性数据分析。宇宙学和天体化学的案例研究证明了从错误的推理路径中恢复以及适应新数据集的能力，无需人工干预。在面向数据驱动发现的10项基准测试中，与只使用代码的0.2-0.3分和代码与文本的基准线得分（0.4-0.5）相比，使用VLM增强的系统得分达到0.7-0.8，同时提供可审核的推理轨迹，提高了可解释性。代码可在https://github.com/CMBAgents/cmbagent获取。

论文及项目相关链接

PDF

Summary：以视觉语言模型（VLM）为引导的多智能体系统可提升端到端的自主科学发现能力。通过视图像作为可验证的查检点，采用“模型评判员”（VLM-as-a-judge）方式评估图表与动态生成的特定领域准则的一致性，智能体可纠正自身错误并实时引导探索性数据分析。在宇宙学和天体化学的案例研究中，证明了其可从错误的推理路径中恢复并适应新数据集而无需人工干预。在数据驱动发现的10项基准测试中，增强型VLM系统的通过率达到了0.7至0.8，相较于纯代码基线（0.2至0.3）和代码文本基线（0.4至0.5）表现优越，同时提供可审核的推理轨迹以提高解释性。相关研究详情访问链接：项目链接地址。

Key Takeaways：

多智能体系统借助视觉语言模型（VLM）提高了自主科学发现的效率。
VLM能将图像作为可验证的查检点，评估图表与特定领域准则的一致性。
智能体具备自我纠正错误的能力，并可在数据分析中实时导航。
在宇宙学和天体化学领域的研究展示了该系统的容错性和对新数据集的适应性。
在数据驱动发现的基准测试中，增强型VLM系统的通过率高于其他基线系统。
该系统提供了可审核的推理轨迹，增强了其解释性。

Cool Papers

点此查看论文截图

Agentic AI Systems in Electrical Power Systems Engineering: Current State-of-the-Art and Challenges

Authors:Soham Ghosh, Gaurav Mittal

Agentic AI systems have recently emerged as a critical and transformative approach in artificial intelligence, offering capabilities that extend far beyond traditional AI agents and contemporary generative AI models. This rapid evolution necessitates a clear conceptual and taxonomical understanding to differentiate this new paradigm. Our paper addresses this gap by providing a comprehensive review that establishes a precise definition and taxonomy for “agentic AI,” with the aim of distinguishing it from previous AI paradigms. The concepts are gradually introduced, starting with a highlight of its diverse applications across the broader field of engineering. The paper then presents four detailed, state-of-the-art use case applications specifically within electrical engineering. These case studies demonstrate practical impact, ranging from an advanced agentic framework for streamlining complex power system studies and benchmarking to a novel system developed for survival analysis of dynamic pricing strategies in battery swapping stations. Finally, to ensure robust deployment, the paper provides detailed failure mode investigations. From these findings, we derive actionable recommendations for the design and implementation of safe, reliable, and accountable agentic AI systems, offering a critical resource for researchers and practitioners.

人工智能代理系统最近作为人工智能中的关键和变革性方法出现，提供了远远超出传统人工智能代理和当代生成式人工智能模型的能力。这种快速进化需要明确的概念和分类学理解，以区分这种新范式。我们的论文通过提供全面的综述，旨在建立对“人工智能代理”的精确定义和分类，以将其与以前的人工智能范式区分开。论文逐渐引入概念，首先强调其在工程领域应用的多样性。然后介绍了四个最新的、前沿的应用案例，专门用于电气工程领域。这些案例研究展示了实际应用影响，从简化复杂电力系统研究和基准测试的高级人工智能代理框架，到为电池交换站的动态定价策略的生存分析开发的新型系统。最后，为了确保稳健部署，论文提供了详细的故障模式调查。根据这些发现，我们为设计和实施安全、可靠和可问责的人工智能代理系统提供了可操作的建议，为研究人员和实践者提供了关键资源。

论文及项目相关链接

PDF

总结
新型智能体AI系统为人工智能领域带来了重要变革，超越了传统AI模型和当前主流生成式AI模型的功能边界。本文旨在通过定义与分类解决该领域认知的缺失，清晰地呈现这一新范式的特色。文章从工程领域广泛的应用出发，逐步深入介绍相关概念，并提供了四个电气工程领域的最新应用案例。这些案例展示了智能体AI在简化复杂电力系统研究、性能基准测试以及电池交换站的动态定价策略生存分析等方面的实际应用效果。此外，本文还提供了详细的故障模式调查，为设计和实施安全可靠的智能体AI系统提供了重要建议。

要点掌握

新型智能体AI系统代表了人工智能领域的重要变革，超越了传统和当前主流AI模型的功能边界。
本文旨在通过定义和分类清晰地呈现这一新兴领域的特点。
文章介绍了智能体AI在工程领域的广泛应用，包括电气工程。
文章提供了四个电气工程领域的最新应用案例，展示了智能体AI的实际效果。
智能体AI能够简化复杂电力系统研究并提升性能基准测试水平。
智能体AI在电池交换站的动态定价策略生存分析方面具有应用潜力。

Cool Papers

点此查看论文截图

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

Authors:Jinru Ding, Lu Lu, Chao Ding, Mouxiao Bian, Jiayuan Chen, Renjie Lu, Wenrao Pang, Xiaoqin Wu, Zhiqiang Liu, Luyi Jiang, Bing Han, Yunqiu Wang, Jie Xu

Recent advances in medical large language models (LLMs), multimodal models, and agents demand evaluation frameworks that reflect real clinical workflows and safety constraints. We present MedBench v4, a nationwide, cloud-based benchmarking infrastructure comprising over 700,000 expert-curated tasks spanning 24 primary and 91 secondary specialties, with dedicated tracks for LLMs, multimodal models, and agents. Items undergo multi-stage refinement and multi-round review by clinicians from more than 500 institutions, and open-ended responses are scored by an LLM-as-a-judge calibrated to human ratings. We evaluate 15 frontier models. Base LLMs reach a mean overall score of 54.1/100 (best: Claude Sonnet 4.5, 62.5/100), but safety and ethics remain low (18.4/100). Multimodal models perform worse overall (mean 47.5/100; best: GPT-5, 54.9/100), with solid perception yet weaker cross-modal reasoning. Agents built on the same backbones substantially improve end-to-end performance (mean 79.8/100), with Claude Sonnet 4.5-based agents achieving up to 85.3/100 overall and 88.9/100 on safety tasks. MedBench v4 thus reveals persisting gaps in multimodal reasoning and safety for base models, while showing that governance-aware agentic orchestration can markedly enhance benchmarked clinical readiness without sacrificing capability. By aligning tasks with Chinese clinical guidelines and regulatory priorities, the platform offers a practical reference for hospitals, developers, and policymakers auditing medical AI.

医疗领域的大型语言模型（LLMs）、多模态模型及智能主体的最新进展，需要反映真实临床工作流程和安全约束的评估框架。我们推出MedBench v4，这是一个全国性的云基准测试基础设施，包含超过70万项专家策划的任务，涵盖24个主要专业及91个次要专业，并设有针对LLMs、多模态模型及智能主体的专门赛道。这些项目经过来自500多家机构的临床医生的多阶段精细调整和多轮审查，对于开放式的回答则通过校准至人类评分的大型语言模型作为评判进行打分。我们评估了15款前沿模型。基础LLMs的平均整体得分为54.1/100（最佳：Claude Sonnet 4.5，得分62.5/100），但安全和伦理得分仍然较低（18.4/100）。多模态模型的总体表现稍逊（平均得分47.5/100；最佳：GPT-5，得分54.9/100），具有强大的感知能力，但在跨模态推理方面较弱。基于相同框架的智能主体在端到端的性能上有显著提高（平均得分79.8/100），其中基于Claude Sonnet 4.5的智能主体整体得分高达85.3/100，在安全性任务上得分高达88.9/100。因此，MedBench v4揭示了基础模型在多模态推理及安全性方面存在的持续差距，同时表明治理意识智能协同能在不牺牲能力的情况下显著提高基准临床准备度。通过任务与中国临床指南和监管优先事项相结合，该平台为医院、开发者和政策制定者评估医疗人工智能提供了一个实用的参考依据。

论文及项目相关链接

PDF

Summary

基于医疗大型语言模型（LLMs）、多模态模型和智能体的最新进展，需要反映真实临床流程和安全性约束的评估框架。我们推出MedBench v4，这是一个全国性的云基准测试平台，包含超过70万专家策划的任务，涵盖24个主要和91个次要专业领域，并为LLMs、多模态模型和智能体设有专门赛道。通过对超过500家机构的临床医生进行多阶段精细调整和多轮审查，开放答案由校准为人类评分水平的LLM进行评分。我们评估了15个前沿模型。基础LLM的平均整体得分为54.1分（最高：Claude Sonnet 4.5，得分62.5），但安全性和伦理得分较低（仅得18.4分）。多模态模型整体表现较差（平均得分47.5分；最佳：GPT-5，得分54.9分），具备良好的感知能力但跨模态推理能力较弱。智能体在同一背景下显著提高端到端性能（平均得分79.8分），其中基于Claude Sonnet 4.5的智能体总体得分高达85.3分，安全任务得分高达88.9分。因此，MedBench v4揭示了基础模型在多模态推理和安全性方面存在的持续差距，同时表明治理意识智能编排可显著提高临床准备水平而不牺牲能力。通过与中文临床指南和监管优先事项的任务对齐，该平台为医院、开发人员和政策制定者审核医疗人工智能提供了实用的参考依据。

Key Takeaways

MedBench v4是一个全国性的云基准测试平台，涵盖多个医疗专业领域，包括LLMs、多模态模型和智能体的专门赛道。
基础LLM在评估中表现良好，但在安全性和伦理方面存在差距。
多模态模型在感知方面表现良好，但在跨模态推理方面较弱。
智能体在同一背景下显著提高端到端性能。
MedBench v4揭示了基础模型在多模态推理和安全性方面的持续差距。
治理意识智能编排可显著提高临床准备水平。
平台与中文临床指南和监管优先事项对齐，为医疗AI的审核提供实用参考。

Cool Papers

点此查看论文截图

DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning

Authors:Xiaochuan Liu, Yuanfeng Song, Xiaoming Yin, Xing Chen

In today’s data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.

在如今的数据驱动时代，完全自动化的端到端数据分析，尤其是发现洞察力，对于发现可操作的见解以协助组织做出有效决策至关重要。随着大型语言模型（LLM）的快速发展，LLM驱动的代理已经出现，成为自动化数据分析和洞察力发现的一种有前途的范式。然而，现有的数据洞察代理在几个关键方面仍存在局限性，往往由于（1）对领域知识利用不足，（2）分析深度不够深入，以及（3）在产生洞察力时容易出现错误代码生成，而无法提供令人满意的结果。为了解决这些问题，我们提出了DataSage，这是一种新型的多代理框架，它结合了三个创新功能，包括外部知识检索以丰富分析上下文、多角色辩论机制模拟多种分析视角并深化分析深度以及多路径推理以提高生成代码和见解的准确性。在InsightBench上的广泛实验表明，DataSage在所有难度级别上始终优于现有数据洞察代理，为自动化数据洞察发现提供了有效的解决方案。

论文及项目相关链接

PDF

Summary：

在数据驱动的时代，全自动化的数据分析与洞察发现至关重要。随着大型语言模型（LLMs）的快速发展，LLM驱动的智能代理已成为自动化数据分析与洞察发现的潜力模式。然而，现有数据洞察代理在几个关键方面仍存在局限性，如未能充分利用领域知识、分析深度不足以及在生成见解时易出现错误代码。为解决这些问题，我们提出了DataSage，这是一个新型的多智能体框架，融合了三大创新功能，包括外部知识检索以丰富分析上下文、多角色辩论机制模拟多种分析视角和深化分析深度，以及多路径推理提高生成的代码和见解的准确性。在InsightBench上的广泛实验表明，DataSage在所有难度级别上均优于现有数据洞察代理，为自动化数据洞察发现提供了有效解决方案。

Key Takeaways：

全自动化的数据分析与洞察发现在当今数据驱动的时代具有重要意义。
大型语言模型（LLMs）在自动化数据分析与洞察发现中展现出巨大潜力。
现有数据洞察代理在领域知识利用、分析深度及错误代码生成等方面存在局限性。
DataSage是一个新型的多智能体框架，通过融合三大创新功能来解决现有问题。
DataSage支持外部知识检索，丰富分析上下文。
DataSage采用多角色辩论机制，模拟多种分析视角并深化分析深度。
DataSage的多路径推理能提高生成的代码和见解的准确性。

Cool Papers

点此查看论文截图

Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

Authors:N Dinesh Reddy, Sudeep Pillai

We introduce Orion, a visual agent framework that can take in any modality and generate any modality. Using an agentic framework with multiple tool-calling capabilities, Orion is designed for visual AI tasks and achieves state-of-the-art results. Unlike traditional vision-language models that produce descriptive outputs, Orion orchestrates a suite of specialized computer vision tools, including object detection, keypoint localization, panoptic segmentation, Optical Character Recognition, and geometric analysis, to execute complex multi-step visual workflows. The system achieves competitive performance on MMMU, MMBench, DocVQA, and MMLongBench while extending monolithic vision-language models to production-grade visual intelligence. By combining neural perception with symbolic execution, Orion enables autonomous visual reasoning, marking a transition from passive visual understanding to active, tool-driven visual intelligence.

我们介绍了Orion，这是一个视觉代理框架，可以接收任何模式并生成任何模式。Orion采用具有多种工具调用功能的代理框架，专为视觉AI任务设计，并实现了最先进的成果。不同于传统产生描述性输出的视觉语言模型，Orion可以协调一系列专业的计算机视觉工具，包括目标检测、关键点定位、全景分割、光学字符识别和几何分析，以执行复杂的多步视觉工作流程。该系统在MMMU、MMBench、DocVQA和MMLongBench上取得了具有竞争力的表现，同时扩展了单一的视觉语言模型到生产级的视觉智能。通过结合神经感知和符号执行，Orion实现了自主视觉推理，标志着从被动视觉理解到主动、工具驱动的视觉智能的转变。

论文及项目相关链接

PDF

Summary

Orion是一个视觉代理框架，能够处理任何模态输入并生成任何模态输出。它采用具备多种工具调用功能的代理框架，专为视觉AI任务设计，并实现了业界领先的结果。不同于传统产生描述性输出的视听模型，Orion能协调一系列专业计算机视觉工具，包括目标检测、关键点定位、全景分割、光学字符识别和几何分析，以执行复杂的多步骤视觉工作流。系统实现了在MMMU、MMBench、DocVQA和MMLongBench上的竞争力表现，将单一体制的视听语言模型扩展为生产级的视觉智能。结合神经感知和符号执行，Orion实现了自主视觉推理，标志着从被动视觉理解到主动、工具驱动视觉智能的转变。

Key Takeaways

Orion是一个多模态的视觉代理框架，能够处理各种输入并生成各种输出。
它专为视觉AI任务设计，并集成了多种计算机视觉工具。
Orion实现了在多个基准测试上的业界领先性能。
与传统模型不同，Orion注重执行复杂的多步骤视觉工作流。
Orion结合了神经感知和符号执行，实现了自主视觉推理。
它将生产级的视觉智能从单一的视听语言模型中扩展出来。

Cool Papers

点此查看论文截图

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

Authors:Sushant Mehta

Current agentic AI benchmarks predominantly evaluate task completion accuracy, while overlooking critical enterprise requirements such as cost-efficiency, reliability, and operational stability. Through systematic analysis of 12 main benchmarks and empirical evaluation of state-of-the-art agents, we identify three fundamental limitations: (1) absence of cost-controlled evaluation leading to 50x cost variations for similar precision, (2) inadequate reliability assessment where agent performance drops from 60% (single run) to 25% (8-run consistency), and (3) missing multidimensional metrics for security, latency, and policy compliance. We propose \textbf{CLEAR} (Cost, Latency, Efficacy, Assurance, Reliability), a holistic evaluation framework specifically designed for enterprise deployment. Evaluation of six leading agents on 300 enterprise tasks demonstrates that optimizing for accuracy alone yields agents 4.4-10.8x more expensive than cost-aware alternatives with comparable performance. Expert evaluation (N=15) confirms that CLEAR better predicts production success (correlation $ρ=0.83$) compared to accuracy-only evaluation ($ρ=0.41$).

当前的人工智能代理基准测试主要评估任务完成的准确性，却忽视了企业关键需求，如成本效益、可靠性和运行稳定性。通过对12个主要基准测试的系统分析以及对最新代理的实证评估，我们发现了三个根本的局限性：（1）缺乏成本控制评估，导致相似精度的成本差异高达50倍；（2）可靠性评估不足，代理性能从单次运行的60%下降到多次运行的25%；（3）缺少对安全、延迟和政策合规性的多维指标。我们提出了针对企业部署专门设计的全面评估框架CLEAR（成本、延迟、效率、保证、可靠性）。在300个企业任务上对六个领先代理的评价表明，仅通过优化准确性而获得的代理在成本上比具有相似性能的知情替代方案高出4.4-10.8倍。专家评估（N=15）证实，与仅基于准确性的评估相比，CLEAR能更好地预测生产成功（相关性ρ=0.83）。

论文及项目相关链接

PDF

Summary
在现有的代理智能AI评估体系中，主要侧重于任务完成准确率的评价，忽略了企业关键需求如成本效益、可靠性和运营稳定性等。通过对主流代理智能的基准测试进行系统性分析和实证评估，发现存在三大局限：缺乏成本控制评估导致成本差异巨大、可靠性评估不足导致性能显著下降以及缺少多维度评价指标如安全性、延迟和政策合规性。为此，提出专为企业部署设计的CLEAR评估框架。评估显示，仅优化准确率会导致代理智能成本较高，与兼顾成本的替代方案相比，成本高出4.4-10.8倍。专家评估证实，相较于仅依赖准确率的评估，CLEAR更能有效预测生产成功。

Key Takeaways

当前AI代理评估主要关注任务完成准确率，忽视了企业的关键需求如成本效益、可靠性和运营稳定性。
系统分析和实证评估发现三大局限：缺乏成本控制评估、可靠性评估不足以及缺少多维度评价指标。
提出了一个全新的评估框架——CLEAR，包含成本、延迟、效率、保障和可靠性五个维度。
评估显示仅优化准确率的AI代理成本较高。
专家评估证实CLEAR评估框架相较于仅依赖准确率的评估更能有效预测生产成功。
CLEAR框架特别为企业部署设计，有助于更全面地评估AI代理的性能和适用性。

Cool Papers

点此查看论文截图

APD-Agents: A Large Language Model-Driven Multi-Agents Collaborative Framework for Automated Page Design

Authors:Xinpeng Chen, Xiaofeng Han, Kaihao Zhang, Guochao Ren, Yujie Wang, Wenhao Cao, Yang Zhou, Jianfeng Lu, Zhenbo Song

Layout design is a crucial step in developing mobile app pages. However, crafting satisfactory designs is time-intensive for designers: they need to consider which controls and content to present on the page, and then repeatedly adjust their size, position, and style for better aesthetics and structure. Although many design software can now help to perform these repetitive tasks, extensive training is needed to use them effectively. Moreover, collaborative design across app pages demands extra time to align standards and ensure consistent styling. In this work, we propose APD-agents, a large language model (LLM) driven multi-agent framework for automated page design in mobile applications. Our framework contains OrchestratorAgent, SemanticParserAgent, PrimaryLayoutAgent, TemplateRetrievalAgent, and RecursiveComponentAgent. Upon receiving the user’s description of the page, the OrchestratorAgent can dynamically can direct other agents to accomplish users’ design task. To be specific, the SemanticParserAgent is responsible for converting users’ descriptions of page content into structured data. The PrimaryLayoutAgent can generate an initial coarse-grained layout of this page. The TemplateRetrievalAgent can fetch semantically relevant few-shot examples and enhance the quality of layout generation. Besides, a RecursiveComponentAgent can be used to decide how to recursively generate all the fine-grained sub-elements it contains for each element in the layout. Our work fully leverages the automatic collaboration capabilities of large-model-driven multi-agent systems. Experimental results on the RICO dataset show that our APD-agents achieve state-of-the-art performance.

页面布局设计是开发移动应用页面中的关键步骤。然而，设计师需要花费大量时间来创建令人满意的设计方案，需要考虑要在页面上展示哪些控件和内容，然后反复调整它们的大小、位置和样式，以获得更好的美观和结构。尽管许多设计软件现在可以帮助完成这些重复的任务，但要有效地使用它们仍然需要接受大量的培训。此外，跨应用页面的协作设计还需要额外的时间来对齐标准和确保样式的一致性。在这项工作中，我们提出了APD-agents（移动应用自动化页面设计多智能体框架），这是一个基于大型语言模型（LLM）的多智能体框架。我们的框架包含OrchestratorAgent、SemanticParserAgent、PrimaryLayoutAgent、TemplateRetrievalAgent和RecursiveComponentAgent。在接收到用户对页面的描述后，OrchestratorAgent可以动态地指导其他智能体完成用户的设计任务。具体来说，SemanticParserAgent负责将用户对页面内容的描述转换为结构化数据。PrimaryLayoutAgent可以生成该页面的初始粗略布局。TemplateRetrievalAgent可以检索语义上相关的少数示例，提高布局生成的质量。此外，RecursiveComponentAgent可用于决定如何递归生成布局中的所有精细子元素。我们的工作充分利用了大型模型驱动的多智能体系统的自动协作能力。在RICO数据集上的实验结果表明，我们的APD-agents达到了最先进的技术性能表现。

论文及项目相关链接

PDF

Summary

该文本介绍了在移动应用页面开发中，布局设计的重要性及其挑战，包括设计师需要考虑的控制和内容要素，以及软件工具的辅助和协同设计的需求。为此，提出了APD-agents，一个基于大型语言模型的多代理框架，用于移动应用的自动化页面设计。该框架包括多个代理，如OrchestratorAgent、SemanticParserAgent等，能够自动完成页面设计的任务。实验结果表明，APD-agents在RICO数据集上取得了最先进的性能。

Key Takeaways

布局设计是移动应用页面开发中的重要步骤，但设计过程耗时，需要考虑控制和内容要素的调整。
尽管设计软件可以辅助完成这些任务，但有效使用它们需要广泛培训。
协同设计需求额外的时间来对齐标准和确保一致的样式。
APD-agents是一个基于大型语言模型的多代理框架，用于移动应用的自动化页面设计。
APD-agents包括多个代理，如OrchestratorAgent、SemanticParserAgent等，能够自动完成页面设计的任务。
APD-agents的实验结果在RICO数据集上表现优秀。

Cool Papers

点此查看论文截图

O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

Authors:Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, Wangchunshu Zhou

Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.67% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.

最近，以大型语言模型（LLM）为动力的人工智能代理人在生成人类式回应方面展现出了巨大的潜力。然而，它们在维持复杂环境中的长期互动方面仍面临挑战，这主要归因于上下文一致性和动态个性化方面的局限性。现有的记忆系统通常依赖于检索前的语义分组，这可能会忽略用户语义上无关紧要但至关重要的信息，并引入检索噪声。在报告中，我们提出了O-Mem的初步设计，这是一种基于主动用户分析的新型记忆框架，它能够动态提取和更新用户特征和事件记录，这些记录和特征来自用户与代理之间的主动互动。O-Mem支持层次化检索人格属性和主题相关上下文，从而实现更自适应和连贯的个性化回应。O-Mem在公共LoCoMo基准测试上达到了51.67%的性能，比之前的最佳水平LangMem提高了近3%，在PERSONAMEM上达到了62.99%，比之前的最佳水平A-Mem提高了3.5%。与之前的记忆框架相比，O-Mem还提高了令牌和互动响应时间效率。我们的工作为未来开发高效、人性化的个性化AI助理指明了有前景的方向。

论文及项目相关链接

PDF

Summary

LLM驱动的代理在生成类似人类的响应方面展现出巨大潜力，但仍面临在复杂环境中维持长期互动的挑战，主要因为上下文一致性和动态个性化方面的局限。现有记忆系统通常依赖于语义分组进行检索，这可能会忽略用户语义上不重要但关键的信息并引入检索噪音。为此，本文提出了O-Mem的初步设计，这是一个基于主动用户分析的新型记忆框架，能够动态提取和更新用户特性和事件记录，与代理进行主动互动。O-Mem支持层次化的个性属性检索和主题相关上下文，实现更自适应和连贯的个人化响应。它在公共LoCoMo基准测试上的表现达到51.67%，较之前的最佳水平LangMem提高了近3%，并在PERSONAMEM上达到62.99%，较之前的最佳水平A-Mem提高了3.5%。此外，与之前的记忆框架相比，O-Mem还提高了令牌和互动响应的时间效率。

Key Takeaways

LLM驱动的代理在生成类似人类的响应方面有很大潜力，但在复杂环境中维持长期互动仍存在挑战。
现有记忆系统依赖于语义分组进行检索，这可能忽略关键用户信息并引入噪音。
O-Mem是一个基于主动用户分析的新型记忆框架，能动态提取和更新用户特性和事件记录。
O-Mem支持层次化的个性属性检索和主题相关上下文，实现更自适应和连贯的个性化响应。
O-Mem在公共LoCoMo基准测试和PERSONAMEM上的表现优于之前的最佳水平。
O-Mem提高了令牌和互动响应的时间效率。

Cool Papers

点此查看论文截图

Authors:Jiarui Ji, Runlin Lei, Xuchen Pan, Zhewei Wei, Hao Sun, Yankai Lin, Xu Chen, Yongzheng Yang, Yaliang Li, Bolin Ding, Ji-Rong Wen

The emergence of Large Language Models (LLMs) demonstrates their potential to encapsulate the logic and patterns inherent in human behavior simulation by leveraging extensive web data pre-training. However, the boundaries of LLM capabilities in social simulation remain unclear. To further explore the social attributes of LLMs, we introduce the CiteAgent framework, designed to generate citation networks based on human-behavior simulation with LLM-based agents. CiteAgent successfully captures predominant phenomena in real-world citation networks, including power-law distribution, citational distortion, and shrinking diameter. Building on this realistic simulation, we establish two LLM-based research paradigms in social science: LLM-SE (LLM-based Survey Experiment) and LLM-LE (LLM-based Laboratory Experiment). These paradigms facilitate rigorous analyses of citation network phenomena, allowing us to validate and challenge existing theories. Additionally, we extend the research scope of traditional science of science studies through idealized social experiments, with the simulation experiment results providing valuable insights for real-world academic environments. Our work demonstrates the potential of LLMs for advancing science of science research in social science.

大型语言模型（LLM）的出现表明，它们可以利用大量的网络数据进行预训练，封装人类行为模拟中固有的逻辑和模式。然而，LLM在社会模拟方面的能力边界仍然不明确。为了进一步探索LLM的社会属性，我们引入了CiteAgent框架，该框架旨在基于人类行为模拟生成基于LLM的引用网络。CiteAgent成功捕获了真实世界引用网络中的主要现象，包括幂律分布、引用失真和直径缩小。在这一现实模拟的基础上，我们建立了两个基于LLM的社会科学研究范式：LLM-SE（基于LLM的调查实验）和LLM-LE（基于LLM的实验室实验）。这些范式促进了对引用网络现象的严谨分析，使我们能够验证和挑战现有理论。此外，我们通过理想的社交实验扩展了传统科学研究的范围，模拟实验的结果为真实的学术环境提供了宝贵的见解。我们的工作展示了LLM在社会科学领域推动科学研究发展的潜力。

论文及项目相关链接

PDF accepted by HSSCOMMS’25

Summary

大规模语言模型（LLMs）通过利用广泛的网络数据进行预训练，展现出模拟人类行为逻辑和模式的能力。然而，LLMs在社会模拟方面的能力边界尚不清楚。为了探索LLMs的社会属性，我们引入了CiteAgent框架，该框架旨在利用基于LLM的代理生成基于人类行为模拟的引文网络。CiteAgent成功捕捉了现实引文网络中的主要现象，包括幂律分布、引文失真和收缩直径。在此基础上，我们建立了社会科学领域的两个基于LLM的研究范式：LLM-SE（基于LLM的调查实验）和LLM-LE（基于LLM的实验室实验）。这些范式促进了引文网络现象的严谨分析，验证了现有理论并挑战了新的理论。此外，我们通过理想化的社会实验扩展了传统科学研究的范围，模拟实验结果对现实世界的学术环境提供了宝贵的见解。我们的工作展示了LLMs在推进社会科学领域科学研究方面的潜力。

Key Takeaways

LLMs具备模拟人类行为逻辑和模式的能力，通过利用广泛的网络数据进行预训练。
CiteAgent框架用于生成基于人类行为模拟的引文网络，成功捕捉了现实引文网络的主要现象。
建立了基于LLM的两个研究范式：LLM-SE和LLM-LE，用于严谨分析引文网络现象。
基于LLM的模拟实验验证了现有理论并挑战了新的理论。
通过理想化的社会实验扩展了传统科学研究的范围。
模拟实验结果对现实世界的学术环境提供了宝贵的见解。

Cool Papers

点此查看论文截图

KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

Authors:Zhuo Chen, Fei Wang, Zixuan Li, Zhao Zhang, Weiwei Ding, Chuanguang Yang, Yongjun Xu, Xiaolong Jin, Jiafeng Guo

Knowledge Base Question Answering (KBQA) aims to answer natural-language questions over a structured Knowledge Base (KB). Recent work improves KBQA by adopting an agentic reasoning paradigm, in which Large Language Models (LLMs) iteratively decompose a question, generate its corresponding logical queries, and interact with the KB to derive the answer. However, these methods typically fine-tune LLMs on reasoning trajectories synthesized via process supervision, which offers weak incentives for exploration and thus fails to strengthen the agentic reasoning ability. In this paper, we propose KnowCoder-A1, an LLM that can autonomously perform agentic reasoning on KBs to obtain answers. To incentivize autonomous exploration, KnowCoder-A1 trains the LLM under outcome-only supervision via a multi-stage curriculum reinforcement learning with an easy-to-hard curriculum. To establish foundational agentic capabilities, KnowCoder-A1 first fine-tunes the LLM on a small set of high-quality trajectories obtained through outcome-based rejection sampling. Then, to alleviate the reward sparsity inherent in outcome-only supervision, it applies multi-stage curriculum RL with reward schedules that progress from easy to hard. Trained with outcome-only supervision, KnowCoder-A1 exhibits powerful reasoning behaviors and consistently outperforms prior approaches across three mainstream datasets. Notably, on the zero-shot subset of GrailQA, KnowCoder-A1 achieves up to an 11.1% relative improvement while using only one-twelfth of the training data, demonstrating strong agentic reasoning capabilities.

知识库问答（KBQA）旨在针对结构化知识库（KB）回答自然语言问题。最近的工作通过采用主体推理范式改进了KBQA，其中大型语言模型（LLM）会迭代地分解问题，生成相应的逻辑查询，并与知识库互动以得出答案。然而，这些方法通常通过对通过过程监督合成的推理轨迹微调LLM，这提供了微弱的探索激励，因此无法加强主体推理能力。在本文中，我们提出了KnowCoder-A1，这是一种能够在知识库上自主进行主体推理以获得答案的大型语言模型。为了激励自主探索，KnowCoder-A1仅在结果监督下训练LLM，采用从简单到复杂的多阶段课程强化学习。为了建立基础主体能力，KnowCoder-A1首先在小规模的高质量轨迹上对LLM进行微调，这些轨迹是通过基于结果的拒绝采样获得的。然后，为了缓解仅在结果监督下存在的奖励稀疏问题，它应用了从简单到复杂的多阶段课程强化学习奖励计划。仅在结果监督下进行训练，KnowCoder-A1展现出强大的推理行为，并在三个主流数据集上始终优于之前的方法。值得注意的是，在GrailQA的零样本子集上，KnowCoder-A1在仅使用十二分之一的训练数据的情况下实现了高达11.1%的相对改进，展现了强大的主体推理能力。

论文及项目相关链接

PDF

Summary

该文介绍了Knowledge Base Question Answering (KBQA)通过采用一种新型代理推理模式来提高问答性能的方法。该模式采用大型语言模型（LLM）进行迭代分解问题、生成逻辑查询并与知识库交互以得出答案。文章提出了一种名为KnowCoder-A1的LLM模型，能够在知识库上自主进行代理推理以获取答案。KnowCoder-A1通过采用仅结果监督的多阶段课程强化学习来激励自主探索，并在初始阶段通过基于结果的拒绝采样获得高质量轨迹进行微调。经过仅结果监督的训练，KnowCoder-A1展现出强大的推理能力，并在三个主流数据集上均优于先前的方法。在GrailQA的零样本子集上，KnowCoder-A1在仅使用十二分之一的训练数据的情况下实现了高达11.1%的相对改进，显示出强大的代理推理能力。

Key Takeaways

KBQA采用代理推理模式提高问答性能，通过大型语言模型分解问题、生成逻辑查询并与知识库交互得出答案。
KnowCoder-A1模型能在知识库上自主进行代理推理，获取答案。
KnowCoder-A1通过仅结果监督的多阶段课程强化学习进行训练，以激励自主探索。
KnowCoder-A1采用基于结果的拒绝采样获得高质量轨迹进行微调，建立基础代理能力。
KnowCoder-A1通过多阶段课程RL和奖励进度表缓解仅结果监督的奖励稀疏性问题。
KnowCoder-A1展现出强大的推理能力，在多个主流数据集上表现优于先前方法。

Cool Papers

点此查看论文截图

GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

Authors:Ngoc Bui Lam Quang, Nam Le Nguyen Binh, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Quan Nguyen, Ulas Bagci

Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.

多实例学习（MIL）是全幻灯片图像（WSI）分类的主要方法，能够实现gigapixel病理切片的高效分析。最近的工作将视觉语言模型（VLM）引入MIL管道，通过基于文本的类别描述而不是简单的类别名称来融入医学知识。然而，当这些方法依赖于大型语言模型（LLM）来生成临床描述或使用固定长度的提示来表示复杂的病理学概念时，VLM的有限令牌容量通常限制了编码类别信息的表达性和丰富性。此外，仅由LLM生成的描述可能缺乏领域根基和精细的医学特异性，导致与视觉特征的次优对齐。为了解决这些挑战，我们提出了一个具有两个主要贡献的视觉语言MIL框架：（1）一个基于病理教科书和代理专业化（例如形态学、空间上下文）的接地多代理描述生成系统，以产生准确和多样化的临床描述；（2）使用一系列描述而不是单个提示的文本编码策略，捕捉精细且互补的临床信号，以更好地与视觉特征对齐。集成到VLM-MIL管道中，我们的方法显示了比单提示类别基线改进的性能，并在肾脏和肺癌数据集上实现了与最新模型相当的结果。

论文及项目相关链接

PDF Acccepted in MICCAI Workshop 2025

Summary

本文介绍了将多实例学习（MIL）与视觉语言模型（VLM）结合用于全幻灯片图像（WSI）分类的方法。针对现有方法依赖大型语言模型（LLM）生成临床描述或采用固定长度提示来表示复杂病理概念的问题，本文提出了一个基于多代理描述的视觉语言MIL框架。该框架包括两个关键贡献：一是基于病理学教材和代理专业化生成准确多样的临床描述；二是采用描述列表进行文本编码，以捕获精细的、互补的临床信号，实现与视觉特征的更好对齐。该框架在肾癌和肺癌数据集上的表现优于单提示类基线模型，并与最新模型表现相当。

Key Takeaways

多实例学习（MIL）是适用于全幻灯片图像（WSI）分类的主要方法。
视觉语言模型（VLM）被引入MIL管道以融入医学知识。
大型语言模型（LLM）用于生成临床描述，但存在信息表达受限的问题。
提出的框架包含一个基于多代理描述的系统，利用病理学教材和代理专业化生成准确多样的临床描述。
采用描述列表进行文本编码，捕获精细的、互补的临床信号。
该方法实现了与视觉特征的更好对齐，性能优于单提示类基线模型。

Cool Papers

点此查看论文截图

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

Authors:Baixiang Huang, Zhen Tan, Haoran Wang, Zijie Liu, Dawei Li, Ali Payani, Huan Liu, Tianlong Chen, Kai Shu

Agents based on Large Language Models (LLMs) have demonstrated strong capabilities across a wide range of tasks. However, deploying LLM-based agents in high-stakes domains comes with significant safety and ethical risks. Unethical behavior by these agents can directly result in serious real-world consequences, including physical harm and financial loss. To efficiently steer the ethical behavior of agents, we frame agent behavior steering as a model editing task, which we term Behavior Editing. Model editing is an emerging area of research that enables precise and efficient modifications to LLMs while preserving their overall capabilities. To systematically study and evaluate this approach, we introduce BehaviorBench, a multi-tier benchmark grounded in psychological moral theories. This benchmark supports both the evaluation and editing of agent behaviors across a variety of scenarios, with each tier introducing more complex and ambiguous scenarios. We first demonstrate that Behavior Editing can dynamically steer agents toward the target behavior within specific scenarios. Moreover, Behavior Editing enables not only scenario-specific local adjustments but also more extensive shifts in an agent’s global moral alignment. We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior. Through extensive evaluations of agents built on frontier LLMs, BehaviorBench validates the effectiveness of behavior editing across a wide range of models and scenarios. Our findings offer key insights into a new paradigm for steering agent behavior, highlighting both the promise and perils of Behavior Editing.

基于大语言模型的代理人在各种任务中表现出了强大的能力。然而，在高风险领域部署这些代理人会伴随着重大的安全和道德风险。这些代理人的不道德行为可能会导致严重的现实后果，包括人身伤害和财产损失。为了有效地引导代理人的道德行为，我们将代理人行为引导定义为模型编辑任务，我们称之为行为编辑。模型编辑是一个新兴的研究领域，可以对大型语言模型进行精确有效的修改，同时保持其整体能力。为了系统地研究和评估这种方法，我们引入了BehaviorBench，这是一个基于心理道德理论的多层次基准。该基准支持跨各种场景对代理行为的评估和编辑，每一层都引入更复杂和模糊的场景。我们首先证明行为编辑可以动态引导代理人在特定场景中的目标行为。而且，行为编辑不仅可以进行场景特定的局部调整，还可以对代理人的全局道德定位进行更广泛的改变。我们证明，行为编辑可用于促进道德和仁慈的行为，反之亦然，可以诱导有害或恶意行为。通过对基于前沿大型语言模型的代理人的广泛评估，BehaviorBench验证了行为编辑在广泛模型和场景中的有效性。我们的研究为引导代理人行为的新范式提供了关键见解，突出了行为编辑的潜力和风险。

论文及项目相关链接

PDF AAAI 2026 Oral. 14 pages (including appendix), 11 figures. Code, data, results, and additional resources are available at: https://model-editing.github.io

Summary

基于大型语言模型（LLM）的代理在多种任务上表现出强大的能力。然而，在高风险领域部署这些代理存在重大安全和道德风险。不道德行为可能导致严重后果，包括人身伤害和财务损失。为了有效地引导代理的道德行为，我们将代理行为引导视为模型编辑任务，称为行为编辑。为了系统地研究和评估这一方法，我们引入了基于心理道德理论的多层次基准测试平台BehaviorBench。该基准测试平台支持对各种场景下的代理行为进行评估和编辑，每一层都引入更复杂和模糊的场景。我们证明了行为编辑可以动态引导代理在特定场景中的目标行为。而且，行为编辑不仅可以进行场景特定的局部调整，还可以对代理的全局道德对齐进行更广泛的调整。通过前沿LLM构建的代理的广泛评估，BehaviorBench验证了行为编辑在广泛模型和场景中的有效性。我们的研究为引导代理行为的新范式提供了关键见解，突出了行为编辑的承诺和危险。

Key Takeaways

LLM-based agents demonstrate strong capabilities across various tasks but come with safety and ethical risks in high-stakes domains.
不道德行为可能导致严重的现实后果。
行为编辑可作为引导代理道德行为的有效方法。
BehaviorBench作为多层次的基准测试平台，支持对代理行为进行系统性的评估和编辑。
行为编辑能够在特定场景中动态引导代理的目标行为。
行为编辑不仅可以进行局部调整，还可以对代理的全局道德对齐进行广泛调整。

Cool Papers

点此查看论文截图

GraphCodeAgent: Dual Graph-Guided LLM Agent for Retrieval-Augmented Repo-Level Code Generation

Authors:Jia Li, Xianjie Shi, Kechi Zhang, Ge Li, Zhi Jin, Lei Li, Huangzhao Zhang, Jia Li, Fang Liu, Yuwei Zhang, Zhengwei Tao, Yihong Dong, Yuqi Zhu, Chongyang Tao

Writing code requires significant time and effort in software development. To automate this process, researchers have made substantial progress for code generation. Recently, large language models (LLMs) have demonstrated remarkable proficiency in function-level code generation, yet their performance significantly degrades in the real-world software development process, where coding tasks are deeply embedded within specific repository contexts. Existing studies attempt to use retrieval-augmented code generation (RACG) approaches to mitigate this demand. However, there is a gap between natural language (NL) requirements and programming implementations. This results in the failure to retrieve the relevant code of these fine-grained subtasks. To address this challenge, we propose GraphCodeAgent, a dual graph-guided LLM agent for retrieval-augmented repo-level code generation, bridging the gap between NL requirements and programming implementations. Our approach constructs two interconnected graphs: a Requirement Graph (RG) to model requirement relations of code snippets within the repository, as well as the relations between the target requirement and the requirements of these code snippets, and a Structural-Semantic Code Graph (SSCG) to capture the repository’s intricate code dependencies. Guided by this, an LLM-powered agent performs multi-hop reasoning to systematically retrieve all context code snippets, including implicit and explicit code snippets, even if they are not explicitly expressed in requirements. We evaluated GraphCodeAgent on three advanced LLMs with the two widely-used repo-level code generation benchmarks DevEval and CoderEval. Extensive experiment results show that GraphCodeAgent significantly outperforms state-of-the-art baselines.

编写代码在软件开发过程中需要花费大量的时间和精力。为了自动化这一过程，研究人员在代码生成方面取得了重大进展。最近，大型语言模型（LLM）在函数级别的代码生成方面表现出了惊人的熟练程度，但在现实世界的软件开发过程中，它们的性能会显著下降，因为编码任务深深嵌入在特定的存储库上下文中。现有研究试图采用检索增强代码生成（RACG）方法来缓解这一需求。然而，自然语言（NL）要求和编程实现之间存在差距。这导致无法检索这些精细粒度子任务的相关代码。为了解决这一挑战，我们提出了GraphCodeAgent，这是一个用于检索增强型存储库级别代码生成的双图引导LLM代理，它弥补了自然语言要求和编程实现之间的空白。我们的方法构建了两个相互关联的图：一个是需求图（RG），用于建模存储库内代码片段的要求关系以及目标要求与这些代码片段的要求之间的关系；另一个是结构语义代码图（SSCG），用于捕获存储库中复杂的代码依赖关系。通过此指导，LLM驱动的代理执行多跳推理，以系统地检索所有上下文代码片段，包括隐式和显式代码片段，即使它们没有在要求中明确表达也是如此。我们在三个先进的大型语言模型和两个广泛使用的存储库级别代码生成基准测试DevEval和CoderEval上评估了GraphCodeAgent。大量的实验结果表明，GraphCodeAgent显著优于最新基线。

论文及项目相关链接

PDF

Summary

该文本介绍了软件开发生成代码过程中的自动化问题，指出大型语言模型（LLMs）在功能级别的代码生成方面表现出卓越的能力，但在现实世界软件开发过程中性能会显著下降。为解决这个问题，研究者提出了GraphCodeAgent，它是一个由双图引导的语言模型代理，用于增强仓库级别的代码生成。GraphCodeAgent构建了两个相互连接的图：需求图（RG）和结构语义代码图（SSCG），以捕捉仓库内代码片段的要求关系以及目标要求与这些代码片段之间的关系，以及仓库的复杂代码依赖关系。在DevEval和CoderEval两个广泛使用的仓库级别代码生成基准测试上，GraphCodeAgent在高级语言模型上的实验结果显著优于现有基线。

Key Takeaways

大型语言模型（LLMs）在功能级别的代码生成方面表现出卓越的能力，但在现实软件开发过程中存在性能下降的问题。
GraphCodeAgent是一个双图引导的语言模型代理，旨在解决现有研究中自然语言（NL）要求和编程实现之间的差距问题。
GraphCodeAgent构建了两个图：需求图（RG）用于建模仓库内代码片段的要求关系，结构语义代码图（SSCG）用于捕捉仓库的复杂代码依赖关系。
GraphCodeAgent能够系统地检索所有上下文代码片段，包括隐式和显式代码片段，即使它们没有在要求中明确表达。
GraphCodeAgent在DevEval和CoderEval两个基准测试上的性能显著优于现有方法。
该研究强调了使用LLM进行代码生成时的挑战，并提出了一个有效的解决方案。

Cool Papers

点此查看论文截图

Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs

Authors:Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu

Knowledge graphs are widely used in industrial applications, making error detection crucial for ensuring the reliability of downstream applications. Existing error detection methods often fail to effectively utilize fine-grained subgraph information and rely solely on fixed graph structures, while also lacking transparency in their decision-making processes, which results in suboptimal detection performance. In this paper, we propose a novel Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that utilizes multiple large language models (LLMs) in a collaborative setting. By concatenating fine-grained, bidirectional subgraph embeddings with LLM-based query embeddings during training, our framework integrates these representations to produce four specialized agents. These agents utilize subgraph information from different dimensions to engage in multi-round discussions, thereby improving error detection accuracy and ensuring a transparent decision-making process. Extensive experiments on FB15K and WN18RR demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the accuracy and robustness of KG evaluation. For specific industrial scenarios, our framework can facilitate the training of specialized agents using domain-specific knowledge graphs for error detection, which highlights the potential industrial application value of our framework. Our code and datasets are available at https://github.com/kse-ElEvEn/MAKGED.

知识图谱在工业应用中得到了广泛应用，因此错误检测对于确保下游应用的可靠性至关重要。现有的错误检测方法往往不能有效地利用细粒度子图信息，仅依赖于固定的图结构，且其决策过程缺乏透明度，导致检测性能不佳。在本文中，我们提出了一种用于知识图谱错误检测的多智能体框架（MAKGED），该框架在协作环境中利用多个大型语言模型（LLM）。在训练过程中，我们通过连接细粒度、双向子图嵌入和基于LLM的查询嵌入，将这些表示集成到我们的框架中，生成四个专业智能体。这些智能体利用来自不同维度的子图信息进行多轮讨论，从而提高错误检测精度，并确保决策过程的透明性。在FB15K和WN18RR上的大量实验表明，MAKGED优于最新方法，提高了知识图谱评估的准确性和稳健性。对于特定的工业场景，我们的框架可以利用领域特定的知识图谱训练专业智能体进行错误检测，这凸显了我们框架在工业应用中的潜在价值。我们的代码和数据集可在https://github.com/kse-ElEvEn/MAKGED找到。

论文及项目相关链接

PDF This paper has been ACCEPTED as a FULL PAPER at DASFAA 2025 (Oral)

Summary
知识图谱在工业应用中的错误检测至关重要。现有方法存在未能有效利用细粒度子图信息、依赖固定图结构以及决策过程不透明等问题，导致检测性能不佳。本文提出一种基于多智能体的知识图谱错误检测框架（MAKGED），利用协同设置中的多个大型语言模型（LLMs）。通过训练时结合细粒度双向子图嵌入和LLM基于查询的嵌入，该框架整合这些表示产生四个专业智能体。智能体利用不同维度的子图信息进行多轮讨论，从而提高错误检测精度并确保透明的决策过程。在FB15K和WN18RR上的实验表明，MAKGED优于最新方法，提高知识图谱评估的准确性和稳健性。该框架可为工业场景中特定领域知识图谱的错误检测训练专业智能体，凸显其工业应用价值。

Key Takeaways

知识图谱在工业应用中的错误检测很重要，影响下游应用的可靠性。
现有错误检测方法存在未能有效利用细粒度子图信息的问题。
MAKGED框架利用多智能体和大型语言模型进行协同知识图谱错误检测。
该框架结合子图嵌入和查询嵌入，产生四个专业智能体，提高错误检测精度。
智能体利用不同维度子图信息进行多轮讨论，确保透明的决策过程。
MAKGED在FB15K和WN18RR数据集上的表现优于其他最新方法。

Cool Papers

点此查看论文截图

UniDebugger: Hierarchical Multi-Agent Framework for Unified Software Debugging

Authors:Cheryl Lee, Chunqiu Steven Xia, Longji Yang, Jen-tse Huang, Zhouruixin Zhu, Lingming Zhang, Michael R. Lyu

Software debugging is a time-consuming endeavor involving a series of steps, such as fault localization and patch generation, each requiring thorough analysis and a deep understanding of the underlying logic. While large language models (LLMs) demonstrate promising potential in coding tasks, their performance in debugging remains limited. Current LLM-based methods often focus on isolated steps and struggle with complex bugs. In this paper, we propose the first end-to-end framework, FixAgent, for unified debugging through multi-agent synergy. It mimics the entire cognitive processes of developers, with each agent specialized as a particular component of this process rather than mirroring the actions of an independent expert as in previous multi-agent systems. Agents are coordinated through a three-level design, following a cognitive model of debugging, allowing adaptive handling of bugs with varying complexities. Experiments on extensive benchmarks demonstrate that FixAgent significantly outperforms state-of-the-art repair methods, fixing 1.25$\times$ to 2.56$\times$ bugs on the repo-level benchmark, Defects4J. This performance is achieved without requiring ground-truth root-cause code statements, unlike the baselines. Our source code is available on https://github.com/AcceptePapier/UniDebugger.

软件调试是一项耗时的任务，涉及诸如故障定位和补丁生成等一系列步骤，每一步都需要彻底分析和对底层逻辑有深刻的理解。虽然大型语言模型（LLM）在编码任务中显示出有前景的潜力，但它们在进行调试时的表现仍然有限。当前基于LLM的方法通常关注孤立的步骤，难以处理复杂的错误。在本文中，我们提出了第一个端到端的框架FixAgent，通过多智能体协同进行统一调试。它模仿开发者的整个认知过程，每个智能体被专门化为这一过程中的特定组件，而不是像以前的多智能体系统中那样模仿独立专家的行为。智能体通过三级设计进行协调，遵循调试的认知模型，能够自适应处理各种复杂度的错误。在广泛的基准测试上的实验表明，FixAgent显著优于最新的修复方法，在缺陷修复级别的基准测试Defects4J上修复了从增加到放大率的错误乘以数不等的代码问题（增加到了先前的修正率的百分之两百）。我们的源代码可以在https://github.com/AcceptePapier/UniDebugger上找到。不同于基线方法的是，我们的框架不需要真实根源代码语句的支持就能实现这样的性能。

论文及项目相关链接

PDF Accepted by EMNLP’25, Main Poster

Summary

软件调试是一项耗时的工作，包含故障定位和补丁生成等多个步骤，需要深入分析并深刻理解底层逻辑。尽管大型语言模型在编码任务中展现出潜力，但在调试方面的表现仍然有限。当前基于大型语言模型的调试方法往往侧重于单个步骤，难以处理复杂错误。本文提出首个端到端的框架FixAgent，通过多智能体协同实现统一调试。它模仿开发者的整个认知过程，每个智能体作为这一过程的特定组件而不是独立专家的行动来发挥作用。智能体通过遵循调试的认知模型的三级设计进行协调，能够自适应处理不同复杂度的错误。在广泛的基准测试上进行的实验表明，FixAgent显著优于最新的修复方法，在repo级别的基准测试Defects4J上修复了1.25至2.56倍的错误。最重要的是，它不需要像基线方法那样依赖真实原因的代码语句。

Key Takeaways

软件调试是一个涉及多个步骤的复杂过程，包括故障定位和补丁生成等。
大型语言模型在软件调试方面的表现仍然有限，尤其是在处理复杂错误时。
FixAgent是首个端到端的调试框架，通过多智能体协同工作，模仿人类的整个调试认知过程。
FixAgent的每个智能体都专注于调试过程中的特定步骤。
FixAgent采用三级设计来协调智能体的工作，以处理不同复杂度的错误。
FixAgent在基准测试上的表现优于其他最新修复方法，能够修复更多的错误。
FixAgent不需要真实原因的代码语句，这是一个重要的优势。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-20/Agent/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Agent

MMT

MMT 方向最新论文已更新，请持续关注 Update in 2025-11-20 MAVias Mitigate any Visual Bias

2025-11-20 MMT

MMT

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-11-20 UniGen-1.5 Enhancing Image Generation and Editing through Reward Unification in Reinforcement Learning

2025-11-20 LLM

LLM

Agent

2025-11-20 更新

AutoTool: Efficient Tool Selection for Large Language Model Agents

Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities

Agentic AI Systems in Electrical Power Systems Engineering: Current State-of-the-Art and Challenges

MedBench v4: A Robust and Scalable Benchmark for Evaluating Chinese Medical Language Models, Multimodal Models, and Intelligent Agents

DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning

Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

Beyond Accuracy: A Multi-Dimensional Framework for Evaluating Enterprise Agentic AI Systems

APD-Agents: A Large Language Model-Driven Multi-Agents Collaborative Framework for Automated Page Design

O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

Leveraging LLM-based agents for social science research: insights from citation network simulations

KnowCoder-A1: Incentivizing Agentic Reasoning Capability with Outcome Supervision for KBQA

GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm

GraphCodeAgent: Dual Graph-Guided LLM Agent for Retrieval-Augmented Repo-Level Code Generation

Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs

UniDebugger: Hierarchical Multi-Agent Framework for Unified Software Debugging