⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-27 更新
Latent Collaboration in Multi-Agent Systems
Authors:Jiaru Zou, Xiyuan Yang, Ruizhong Qiu, Gaotang Li, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang
Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings. A shared latent working memory then preserves and transfers each agent’s internal representations, ensuring lossless information exchange. We provide theoretical analyses establishing that LatentMAS attains higher expressiveness and lossless information preservation with substantially lower complexity than vanilla text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS consistently outperforms strong single-model and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4x-4.3x faster end-to-end inference. These results demonstrate that our new latent collaboration framework enhances system-level reasoning quality while offering substantial efficiency gains without any additional training. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.
多智能体系统(MAS)将大型语言模型(LLM)从独立的单模型推理扩展到协同系统级智能。虽然现有的LLM代理依赖于文本介导的推理和通信,但我们通过使模型能够在连续的潜在空间内直接协作而迈出了一步。我们介绍了LatentMAS,这是一种端到端无需训练框架,可实现LLM代理之间的纯粹潜在协作。在LatentMAS中,每个代理首先通过最后一层隐藏嵌入执行自动回归潜在思想生成。然后,共享潜在工作内存保留并转移每个代理的内部表示,确保无损的信息交换。我们提供理论分析证明,LatentMAS在实质降低复杂性的同时,实现了更高的表达性和无损信息保留,优于基于文本的传统MAS。此外,涵盖数学和科学推理、常识理解和代码生成等9项综合基准测试的实证评估表明,LatentMAS始终优于强大的单模型和基于文本的MAS基线,准确率提高高达14.6%,输出令牌使用量减少70.8%-83.7%,并提供4倍至4.3倍的端到端推理速度。结果表明,我们新的潜在协作框架在提高系统级推理质量的同时,无需任何额外训练即可实现实质性的效率提升。代码和数据已全部开源在https://github.com/Gen-Verse/LatentMAS。
论文及项目相关链接
PDF Project: https://github.com/Gen-Verse/LatentMAS
Summary
多代理系统(MAS)扩展了大语言模型(LLM)从独立单模型推理到协同系统级智能。不同于现有依赖于文本调停进行推理和通讯的LLM代理,我们通过使模型直接在连续潜在空间中进行协作来实现进步。引入LatentMAS,一个端到端训练免费的框架,能够实现LLM代理之间的纯粹潜在协作。在LatentMAS中,每个代理首先通过最后一层隐藏嵌入进行自动回归潜在思想生成。共享潜在工作内存保留了各代理的内部表征并确保信息无损交换。理论分析和实证评估表明,LatentMAS在表达性、无损信息保留和复杂度方面优于基于文本的基础MAS,且在数学和科学推理、常识理解和代码生成等9项综合基准测试中,持续优于强大的单模型和基于文本的MAS基准测试,最高提升14.6%的准确率,输出令牌使用量减少70.8%-83.7%,端到端推理速度提高4-4.3倍。
Key Takeaways
- 多代理系统(MAS)扩展了大语言模型(LLM)的应用范围,从单一模型推理到系统级协同智能。
- 现有LLM代理主要依赖文本调停进行推理和通讯,而LatentMAS框架则实现了模型间的直接潜在协作。
- LatentMAS框架包含自动回归潜在思想生成和共享潜在工作内存两大核心组件。
- 自动回归潜在思想生成使每个代理能够生成其内部表征,而共享潜在工作内存确保信息无损交换。
- LatentMAS框架具有更高的表达性、无损信息保留能力,同时在理论分析和实证评估中表现出优势。
- 在多个基准测试中,LatentMAS框架在性能上超越了单一模型和基于文本的MAS基准测试。
点此查看论文截图
DRAFT-RL: Multi-Agent Chain-of-Draft Reasoning for Reinforcement Learning-Enhanced LLMs
Authors:Yuanhao Li, Mingshan Liu, Hongbo Wang, Yiding Zhang, Yifei Ma, Wei Tan
Large Language Models (LLMs) have shown impressive capabilities in multi-step reasoning and problem-solving.Recent works introduce multi-agent reflection frameworks where multiple LLM agents critique and refine each other’s outputs using reinforcement learning (RL). However, these approaches often rely on single-shot responses and lack structural diversity in reasoning exploration. In this paper, we propose DRAFT-RL, a novel framework that integrates Chain-of-Draft (CoD) reasoning into multi-agent RL training. Instead of generating single responses, each agent produces multiple drafts per query, which are then evaluated by peer agents and a learned reward model to identify the most promising trajectory. These selected drafts are used to refine future reasoning strategies through actor-critic learning.DRAFT-RL enables explicit multi-path exploration, peer-guided reflection, and reward-aligned selection, resulting in more robust and interpretable LLM agent behavior. We evaluate our method on complex reasoning tasks including code synthesis, symbolic math, and knowledge-intensive QA,demonstrating that DRAFT-RL outperforms existing reflective and RL-based agents by significant margins in both accuracy and convergence speed
大型语言模型(LLM)在多步推理和问题解决方面展现出了令人印象深刻的能力。最近的研究引入了多智能体反思框架,其中多个LLM智能体使用强化学习(RL)相互批评和精炼彼此的输出。然而,这些方法通常依赖于单轮响应,并在推理探索中缺乏结构多样性。在本文中,我们提出了DRAFT-RL,这是一种将Chain-of-Draft(CoD)推理集成到多智能体RL训练中的新型框架。每个智能体不是生成单个响应,而是针对每个查询生成多个草稿,然后通过同行智能体和学到的奖励模型对这些草稿进行评估,以找出最有前途的轨迹。这些选定的草稿通过演员-评论家学习用于优化未来的推理策略。DRAFT-RL实现了明确的多路径探索、同行指导的反思和奖励对齐的选择,从而产生了更稳健和可解释的LLM智能体行为。我们在包括代码合成、符号数学和知识密集型问答等复杂推理任务上评估了我们的方法,证明DRAFT-RL在准确性和收敛速度方面显著优于现有的反思和基于RL的智能体。
论文及项目相关链接
Summary
大型语言模型(LLMs)在多步推理和问题解决方面展现出令人印象深刻的能力。最近的研究引入了多智能体反思框架,其中多个LLM智能体通过强化学习(RL)相互评价和修正输出。然而,这些方法通常依赖于单步响应,缺乏结构性的推理探索多样性。本文提出一种新型框架——DRAFT-RL,它将草稿链(CoD)推理集成到多智能体RL训练中。每个智能体不仅生成单个响应,而且针对每个查询生成多个草稿,通过同行智能体和学到的奖励模型评估这些草稿,以找出最有前途的轨迹。这些选定的草稿用于通过actor-critic学习优化未来的推理策略。DRAFT-RL实现了明确的多元路径探索、同行引导反思和奖励对齐选择,导致LLM智能体行为更加稳健和可解释。在代码合成、符号数学和知识密集型问答等复杂推理任务上,我们的方法表现出显著的优势,在准确性和收敛速度上均超过了现有的反思和RL智能体。
Key Takeaways
- LLMs已具备多步推理和问题解决能力。
- 现有多智能体反思框架存在单步响应和缺乏结构性推理探索的问题。
- DRAFT-RL框架集成了草稿链(CoD)推理和强化学习,促进多路径探索和同行间的反思。
- 每个智能体生成多个草稿响应,通过同行评估和奖励模型选择最佳轨迹。
- 选定的草稿用于优化未来的推理策略,通过actor-critic学习。
- DRAFT-RL提高了LLM智能体的稳健性和可解释性。
点此查看论文截图
VICoT-Agent: A Vision-Interleaved Chain-of-Thought Framework for Interpretable Multimodal Reasoning and Scalable Remote Sensing Analysis
Authors:Chujie Wang, Zhiyuan Luo, Ruiqi Liu, Can Ran, Shenghua Fan, Xi Chen, Chu He
The current remote sensing image analysis task is increasingly evolving from traditional object recognition to complex intelligence reasoning, which places higher requirements on the model’s reasoning ability and the flexibility of tool invocation. To this end, we propose a new multimodal agent framework, Vision-Interleaved Chain-of-Thought Framework (VICoT), which implements explicit multi-round reasoning by dynamically incorporating visual tools into the chain of thought. Through a stack-based reasoning structure and a modular MCP-compatible tool suite, VICoT enables LLMs to efficiently perform multi-round, interleaved vision-language reasoning tasks with strong generalization and flexibility.We also propose the Reasoning Stack distillation method to migrate complex Agent behaviors to small, lightweight models, which ensures the reasoning capability while significantly reducing complexity. Experiments on multiple remote sensing benchmarks demonstrate that VICoT significantly outperforms existing SOTA frameworks in reasoning transparency, execution efficiency, and generation quality.
当前遥感图像分析任务正由传统的目标识别向复杂的智能推理转变,这对模型的推理能力和工具调用的灵活性提出了更高的要求。为此,我们提出了一种新的多模态代理框架——Vision-Interleaved Chain-of-Thought Framework(VICoT)。它通过动态将视觉工具融入思维链,实现了显式的多轮推理。通过基于堆栈的推理结构和模块化的MCP兼容工具套件,VICoT使大型语言模型能够高效执行多轮交错式视觉语言推理任务,具有很强的通用性和灵活性。我们还提出了Reasoning Stack蒸馏方法,将复杂的代理行为迁移到小型、轻量级的模型上,这保证了推理能力的同时显著降低了复杂性。在多个遥感基准测试上的实验表明,VICoT在推理透明度、执行效率和生成质量方面显著优于现有的最新框架。
论文及项目相关链接
Summary:
新型多模态代理框架Vision-Interleaved Chain-of-Thought Framework(VICoT)通过动态融入视觉工具实现显式多轮推理,支持远程遥感图像分析任务从传统的对象识别向复杂的智能推理转变。该框架采用基于堆栈的推理结构和模块化MCP兼容的工具套件,使LLMs能够高效执行多轮交错式视语言推理任务,具有强大的泛化能力和灵活性。同时,提出Reasoning Stack蒸馏方法,将复杂Agent行为迁移到小型轻量化模型上,确保推理能力的同时显著降低复杂性。在多个遥感基准测试上的实验表明,VICoT在推理透明度、执行效率和生成质量方面显著优于现有SOTA框架。
Key Takeaways:
- VICoT框架支持遥感图像分析任务从对象识别向智能推理的转变。
- VICoT通过动态融入视觉工具实现多轮显式推理。
- 该框架采用基于堆栈的推理结构和模块化工具套件,提高LLMs的泛化能力和灵活性。
- Reasoning Stack蒸馏方法可将复杂Agent行为迁移到小型轻量化模型上。
- VICoT框架在推理透明度、执行效率和生成质量方面优于现有SOTA框架。
- 该框架有助于满足当前遥感图像分析任务对模型推理能力和工具调用灵活性的更高要求。
点此查看论文截图
Adaptive LLM Agents: Toward Personalized Empathetic Care
Authors:Priyanka Singh, Sebastian Von Mammen
Current mental-health conversational systems are usually based on fixed, generic dialogue patterns. This paper proposes an adaptive framework based on large language models that aims to personalize therapeutic interaction according to a user’s psychological state, quantified with the Acceptance of Illness Scale (AIS). The framework defines three specialized agents, L, M, and H, each linked to a different level of illness acceptance, and adjusts conversational behavior over time using continuous feedback signals. The AIS-stratified architecture is treated as a diegetic prototype placed in a plausible near-future setting and examined through the method of design fiction. By embedding the architecture in narrative scenarios, the study explores how such agents might influence access to care and therapeutic relationship. The goal is to show how clinically informed personalization, technical feasibility, and speculative scenario analysis can together inform the responsible design of LLM-based companions for mental-health support.
当前的心理健康对话系统通常基于固定、通用的对话模式。本文提出了一个基于大型语言模型的自适应框架,旨在根据用户的心理状态(通过疾病接受度量表(AIS)进行量化)对治疗互动进行个性化设置。该框架定义了三个专业代理L、M和H,它们与不同的疾病接受程度相关联,并利用连续反馈信号随着时间的推移调整对话行为。AIS分层架构被视为一个嵌入在可信的近未来环境中的叙述原型,并通过设计小说的方法进行检查。通过将架构嵌入叙事场景中,该研究探索了这些代理如何影响护理的获取和治疗关系。目标是展示临床个性化、技术可行性和假设情景分析如何共同为基于大型语言模型的心理健康支持伴侣的负责任设计提供信息。
论文及项目相关链接
PDF Accepted at workshop Future Wellbeing: Using Design Fiction to Explore Human-Agent Interaction and Mental Health at The 13th International Conference on Human-Agent Interaction (HAI 2025)
总结
一个基于大型语言模型的自适应框架被提出,用于根据用户的心理状态(通过疾病接受度量表量化)对治疗性互动进行个性化调整。该框架定义了三个专门代理L、M和H,分别与不同程度的疾病接受相关,并随着时间的推移利用持续反馈信号调整对话行为。该研究采用设计虚构的方法将这一架构嵌入叙事场景中,探索此类代理如何影响护理的获取和治疗关系。目标是展示临床个性化的信息、技术可行性和假设情景分析如何共同为基于大型语言模型的心理健康支持伴侣的设计提供指导。
关键见解
- 当前心理健康对话系统通常基于固定的通用对话模式。
- 提出一种自适应框架,基于大型语言模型,旨在根据用户的心理状态个性化治疗互动。
- 引入疾病接受度量表(AIS)来量化用户的心理状态。
- 框架包含三个专门代理,每个代理与不同的疾病接受程度相关联。
- 通过持续反馈调整对话行为。
- 采用设计虚构的方法,通过叙事场景探索该架构如何影响护理的获取和治疗关系。
点此查看论文截图
“Are We Done Yet?”: A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents
Authors:Marta Sumyk, Oleksandr Kosovan
Computer Use Agents (CUAs) are designed to autonomously operate digital interfaces, yet they often fail to reliably determine whether a given task has been completed. We present an autonomous evaluation and feedback framework that uses vision-language models to assess task completion directly from screenshots and task descriptions. Our dataset covers 42 built-in macOS applications and 1,260 human-labeled tasks across a wide range of scenarios. Our framework achieves up to 73 percent accuracy in task success detection and yields an average relative improvement of 27 percent in overall task success when evaluator feedback is applied. These results show that vision-based evaluation can serve as an effective feedback mechanism that improves the reliability and self-correction of autonomous computer-use agents.
计算机使用代理(CUAs)被设计用来自主操作数字界面,然而他们往往无法可靠地确定特定的任务是否已经完成。我们提出一个自主评估和反馈框架,该框架使用视觉语言模型直接从屏幕截图和任务描述中评估任务完成情况。我们的数据集涵盖了42个内置的macOS应用程序和1260个跨场景的人标注任务。我们的框架在任务成功检测方面达到了73%的准确率,在应用评估者反馈后,总体任务成功率平均提高了27%。这些结果表明,基于视觉的评估可以作为有效的反馈机制,提高自主计算机使用代理的可靠性和自我校正能力。
论文及项目相关链接
PDF This work has been accepted to appear at the AAAI 2026 Workshop on Trust and Control in Agentic AI (TrustAgent)
Summary:
计算机使用代理(CUAs)旨在自主操作数字界面,但往往无法可靠地判断给定任务是否完成。本研究提出了一种自主评估和反馈框架,利用视觉语言模型直接通过屏幕截图和任务描述来评估任务完成情况。该数据集涵盖了42个内置的macOS应用程序和1260个人类标签任务,涉及多种场景。该框架在任务成功检测方面达到了73%的准确率,并在应用评估者反馈后,整体任务成功率平均提高了27%。这些结果表明,基于视觉的评估可以作为有效的反馈机制,提高计算机使用代理的可靠性和自我修正能力。
Key Takeaways:
- 计算机使用代理(CUAs)在判断任务完成方面存在挑战。
- 提出了一个自主评估和反馈框架,利用视觉语言模型直接通过屏幕截图和任务描述评估任务完成情况。
- 该框架涵盖了多种macOS应用程序和多种场景下的任务。
- 框架在任务成功检测方面达到了73%的准确率。
- 应用评估者反馈后,整体任务成功率平均提高了27%。
- 基于视觉的评估可作为有效的反馈机制。
点此查看论文截图
M$^3$Prune: Hierarchical Communication Graph Pruning for Efficient Multi-Modal Multi-Agent Retrieval-Augmented Generation
Authors:Weizi Shao, Taolin Zhang, Zijie Zhou, Chen Chen, Chengyu Wang, Xiaofeng He
Recent advancements in multi-modal retrieval-augmented generation (mRAG), which enhance multi-modal large language models (MLLMs) with external knowledge, have demonstrated that the collective intelligence of multiple agents can significantly outperform a single model through effective communication. Despite impressive performance, existing multi-agent systems inherently incur substantial token overhead and increased computational costs, posing challenges for large-scale deployment. To address these issues, we propose a novel Multi-Modal Multi-agent hierarchical communication graph PRUNING framework, termed M$^3$Prune. Our framework eliminates redundant edges across different modalities, achieving an optimal balance between task performance and token overhead. Specifically, M$^3$Prune first applies intra-modal graph sparsification to textual and visual modalities, identifying the edges most critical for solving the task. Subsequently, we construct a dynamic communication topology using these key edges for inter-modal graph sparsification. Finally, we progressively prune redundant edges to obtain a more efficient and hierarchical topology. Extensive experiments on both general and domain-specific mRAG benchmarks demonstrate that our method consistently outperforms both single-agent and robust multi-agent mRAG systems while significantly reducing token consumption.
近期多模态检索增强生成(mRAG)的进展,通过外部知识增强多模态大型语言模型(MLLMs),表明通过有效沟通,多智能体的集体智慧可以显著超越单一模型。尽管性能令人印象深刻,但现有的多智能体系统不可避免地会产生大量的令牌开销和增加的计算成本,为大规模部署带来挑战。为了解决这些问题,我们提出了一种新型的多模态多智能体分层通信图剪枝框架,称为M$^3$Prune。我们的框架消除了不同模态之间的冗余边,实现了任务性能和令牌开销之间的最优平衡。具体来说,M$^3$Prune首先应用模态内图简化的方法对文本和视觉模态进行简化,识别出解决任务至关重要的边缘。随后,我们使用这些关键边缘构建动态通信拓扑进行模态间图简化。最后,我们逐步删除冗余边缘,以获得更高效和分层化的拓扑结构。在通用和特定领域的mRAG基准测试上的大量实验表明,我们的方法在性能上始终优于单智能体和健壮的多智能体mRAG系统,同时大大降低了令牌消耗。
论文及项目相关链接
Summary
近期多模态检索增强生成(mRAG)技术的进展,通过外部知识增强多模态大型语言模型(MLLMs),并展示了多智能体集体智能通过有效沟通可以显著超越单一模型。为应对现有多智能体系统固有的大量令牌开销和计算成本挑战,我们提出了名为M^3Prune的多模态多智能体分层通信图修剪框架。该框架消除了不同模态之间的冗余边,实现了任务性能和令牌开销之间的最佳平衡。首先,M^3Prune对文本和视觉模态进行内部模态图稀疏化,识别解决任务的最关键边。然后,使用这些关键边构建动态通信拓扑,进行跨模态图稀疏化。最后,逐步修剪冗余边以获得更高效、分层的拓扑结构。在通用和特定领域的mRAG基准测试上,该方法均表现出对单智能体和健壮的多智能体mRAG系统的优越性,同时大大降低了令牌消耗。
Key Takeaways
- 多模态检索增强生成(mRAG)技术通过结合外部知识与多模态大型语言模型(MLLMs)提升了性能。
- 多智能体系统通过有效沟通可以超越单一模型的性能,但存在大量的令牌开销和计算成本问题。
- M^3Prune框架旨在解决多模态多智能体系统中的冗余问题,实现任务性能和计算成本之间的平衡。
- M^3Prune通过内部模态图稀疏化和跨模态图稀疏化来优化多智能体通信。
- M^3Prune能逐步修剪冗余边,形成更高效、分层的通信拓扑。
- 在多种基准测试中,M^3Prune表现出优异的性能,优于单智能体和健壮的多智能体mRAG系统。
点此查看论文截图
Exponential Consensus through Z-Control in High-Order Multi-Agent Systems
Authors:Angela Monti, Fasma Diele
In this work, we introduce a Z-control strategy for multi-agent systems of arbitrary order, aimed at driving the agents toward consensus in the highest-order observable state. The proposed framework supports both direct and indirect control schemes, making it applicable in scenarios where high-order derivatives such as acceleration cannot be directly manipulated. Theoretical analysis ensures exponential convergence while preserving the average dynamics, and a hierarchy of control laws is derived accordingly. Numerical experiments up to third-order models, including opinion dynamics and Cucker-Smale flocking systems, demonstrate the robustness and flexibility of Z-control under varying interaction regimes and control intensities.
在这项工作中,我们为任意阶数的多智能体系统引入了一种Z控制策略,旨在驱动智能体在最高阶可观测状态下达成共识。所提出的框架既支持直接控制方案,也支持间接控制方案,使其适用于不能直接操作高阶导数(如加速度)的场景。理论分析结果保证了指数收敛性,同时保持了平均动力学特性,并据此推导出了控制律的层次结构。针对最高可达三阶模型的数值实验,包括意见动态和Cucker-Smale集群系统,证明了Z控制在不同交互机制和不同控制强度下的鲁棒性和灵活性。
论文及项目相关链接
Summary
该工作引入了一种用于任意阶多智能体系统的Z控制策略,旨在驱动智能体在最高阶可观测状态下达成一致性。所提框架支持直接和间接控制方案,在高阶导数如加速度无法直接操纵的场景中具有适用性。理论分析确保了指数收敛性并保持了平均动力学,并据此推导了控制律的层次结构。数值实验表明,Z控制在高达三阶模型下具有稳健性和灵活性,适用于意见动态和Cucker-Smale群集系统。
Key Takeaways
- 引入Z控制策略用于多智能体系统,促进任意阶系统中智能体在最高阶可观测状态下的一致性。
- 所提出的框架既支持直接控制也支持间接控制方案,拓宽了其应用范围。
- 理论分析确保了系统的指数收敛性,并维持了平均动力学特性。
- 推导了控制律的层次结构,为复杂系统的控制提供了理论支撑。
- 通过数值实验,验证了Z控制在不同交互机制和不同控制强度下的稳健性和灵活性。
- Z控制策略在意见动态和Cucker-Smale群集系统中的应用得到了展示。
- 该工作为多智能体系统的控制提供了新的视角和方法。
点此查看论文截图
VIL2C: Value-of-Information Aware Low-Latency Communication for Multi-Agent Reinforcement Learning
Authors:Qian Zhang, Zhuo Sun, Yao Zhang, Zhiwen Yu, Bin Guo, Jun Zhang
Inter-agent communication serves as an effective mechanism for enhancing performance in collaborative multi-agent reinforcement learning(MARL) systems. However, the inherent communication latency in practical systems induces both action decision delays and outdated information sharing, impeding MARL performance gains, particularly in time-critical applications like autonomous driving. In this work, we propose a Value-of-Information aware Low-latency Communication(VIL2C) scheme that proactively adjusts the latency distribution to mitigate its effects in MARL systems. Specifically, we define a Value of Information (VOI) metric to quantify the importance of delayed message transmission based on each delayed message’s importance. Moreover, we propose a progressive message reception mechanism to adaptively adjust the reception duration based on received messages. We derive the optimized VoI aware resource allocation and theoretically prove the performance advantage of the proposed VIL2C scheme. Extensive experiments demonstrate that VIL2C outperforms existing approaches under various communication conditions. These gains are attributed to the low-latency transmission of high-VoI messages via resource allocation and the elimination of unnecessary waiting periods via adaptive reception duration.
在协同多智能体强化学习(MARL)系统中,智能体间的通信作为一种有效机制,提高了系统性能。然而,实际系统中固有的通信延迟会导致动作决策延迟和信息共享过时,阻碍MARL的性能提升,特别是在自动驾驶等时间关键性应用中。在本研究中,我们提出了一种基于信息价值感知的低延迟通信(VIL2C)方案,该方案可主动调整延迟分布,以缓解MARL系统中的影响。具体来说,我们定义了一个信息价值(VOI)指标,根据每条延迟消息的重要性来衡量其传输价值。此外,我们提出了一种渐进式消息接收机制,根据接收到的消息自适应调整接收持续时间。我们推导出了优化的信息价值感知资源分配策略,并从理论上证明了所提出VIL2C方案的性能优势。大量实验表明,在各种通信条件下,VIL2C的性能优于现有方法。这些收益归因于通过资源分配实现的高信息价值消息的低延迟传输,以及通过自适应接收持续时间消除了不必要的等待时间。
论文及项目相关链接
Summary
在协同多智能体强化学习(MARL)系统中,智能体间的通信是提高性能的有效机制。然而,实际系统中的固有通信延迟会导致动作决策延迟和信息共享过时,特别是在自动驾驶等时间关键性应用中,这会阻碍MARL的性能提升。本文提出了一种基于价值信息的低延迟通信(VIL2C)方案,该方案能主动调整延迟分布以减轻其对MARL系统的影响。具体而言,我们定义了一个价值信息(VoI)指标来量化每条延迟消息的重要性。此外,我们还提出了一种渐进式消息接收机制,根据接收到的消息自适应调整接收时长。实验结果广泛证明,在各种通信条件下,VIL2C方案优于现有方法。这些收益归因于通过资源分配实现的高价值信息低延迟传输以及通过自适应接收时长消除了不必要的等待期。
Key Takeaways
- 在协同多智能体强化学习(MARL)系统中,智能体间的通信是提高性能的关键。
- 通信延迟可能导致动作决策延迟和信息共享过时,影响性能。
- VIL2C方案通过主动调整延迟分布来减轻延迟对MARL系统的影响。
- VIL2C采用价值信息(VoI)指标量化延迟消息的重要性。
- VIL2C采用渐进式消息接收机制,自适应调整接收时长。
- VIL2C方案在各种通信条件下表现出优于现有方法的性能。
点此查看论文截图
More with Less: An Empirical Study of Turn-Control Strategies for Efficient Coding Agents
Authors:Pengfei Gao, Chao Peng
LLM-powered coding agents, which operate in iterative loops (turns) to solve software engineering tasks, are becoming increasingly powerful. However, their practical deployment is hindered by significant and unpredictable costs. This challenge arises from a combination of factors: quadratically growing token counts with each turn, the high price of models, the large number of turns required for real-world tasks, and the tendency of agents to take inefficient or unnecessary actions. While existing research focuses on optimizing individual turns, the strategic control of the total number of turns remains an underexplored area for managing agent performance and cost. To address this gap, we conduct a comprehensive empirical study on SWE-bench using three state-of-the-art models and evaluate the impact of three distinct turn-control strategies: an unrestricted baseline, a fixed-turn limit with reminders, and a novel dynamic-turn strategy that grants extensions on-demand. Our findings first reveal a fundamental trade-off in the unrestricted setting, where no single model excels across performance, cost, and turn efficiency. We then show that a fixed-turn limit, specifically at the 75th percentile of the baseline, serves as a “sweet spot”, substantially reducing costs (by 24%-68%) with minimal impact on solve rates. Most significantly, the dynamic-turn strategy consistently outperforms fixed-limit approaches, achieving comparable or better solve rates while further reducing costs by an additional 12%-24% by intelligently allocating resources only to tasks that need them. This work provides the first systematic analysis of turn-control strategies, offering simple yet effective guidelines for developers to balance cost and efficacy. We demonstrate that dynamic resource allocation is a superior, easy-to-implement approach for deploying powerful yet economically viable coding agents.
大型语言模型驱动的编码代理,通过在迭代循环(回合)中操作来解决软件工程任务,正变得越来越强大。然而,它们的实际部署受到了显著且不可预测成本的阻碍。这一挑战源于多种因素:随着每个回合的令牌数量呈二次方增长、模型价格高昂、完成现实世界任务所需回合数众多,以及代理倾向于采取低效或不必要的行动。尽管现有研究侧重于优化单个回合,但对回合总数的战略控制仍是管理代理性能和成本的未被充分探索的领域。为了弥补这一空白,我们在SWE-bench上进行了全面的实证研究,使用了三种最先进的模型,并评估了三种不同的回合控制策略的影响:无限制的基准测试、带有提醒的固定回合限制,以及一种新型按需扩展的动态回合策略。我们的研究结果首先揭示了无限制设置中的基本权衡,在这个设置中,没有单一模型能在性能、成本和回合效率方面表现出色。我们还发现,在基准测试的75th百分位处设定固定回合限制可以作为一个“最佳点”,能够在不影响解决率的情况下大幅降低成本(24%-68%)。最重要的是,动态回合策略始终优于固定限制方法,在实现相当或更好的解决率的同时,通过智能分配资源进一步降低额外成本12%-24%,仅将资源分配给需要的任务。这项工作首次对回合控制策略进行了系统分析,为开发者提供了平衡成本与效果的简单而有效的指导。我们证明动态资源分配是一种优越且易于实施的方法,用于部署强大且经济可行的编码代理。
论文及项目相关链接
Summary
大型语言模型驱动的编程代理在解决软件工程任务时表现出越来越强大的能力,但其实际部署面临显著且不可预测的成本挑战。研究表明,战略控制代理的总执行次数(即“turn”)是解决这一问题的关键。本文全面研究了使用三种最新技术的模型,并评估了三种不同的转弯控制策略的影响。研究发现,动态资源分配策略可以智能地分配资源,仅在需要时才提供资源,是一种更优质且经济的部署方法。这为开发者提供了平衡成本与效率的简单有效指南。
Key Takeaways
- LLM驱动的编程代理在解决软件工程任务时展现出强大的能力,但部署成本较高且不可预测。
- 控制代理的总执行次数(即“turn”)是解决成本问题的关键策略之一。
- 无限制的模型设置并未显示出单一模型的卓越性能,涉及性能、成本和转弯效率之间的权衡。
- 固定转弯限制策略在75th百分位时表现出较好的成本控制效果,同时保持较高的解决率。
- 动态转弯策略表现最佳,通过智能分配资源实现更高的成本效益。
- 研究为开发者提供了平衡成本与效率的简单有效指南。
点此查看论文截图
LiRA: A Multi-Agent Framework for Reliable and Readable Literature Review Generation
Authors:Gregory Hok Tjoan Go, Khang Ly, Anders Søgaard, Amin Tabatabaei, Maarten de Rijke, Xinyi Chen
The rapid growth of scientific publications has made it increasingly difficult to keep literature reviews comprehensive and up-to-date. Though prior work has focused on automating retrieval and screening, the writing phase of systematic reviews remains largely under-explored, especially with regard to readability and factual accuracy. To address this, we present LiRA (Literature Review Agents), a multi-agent collaborative workflow which emulates the human literature review process. LiRA utilizes specialized agents for content outlining, subsection writing, editing, and reviewing, producing cohesive and comprehensive review articles. Evaluated on SciReviewGen and a proprietary ScienceDirect dataset, LiRA outperforms current baselines such as AutoSurvey and MASS-Survey in writing and citation quality, while maintaining competitive similarity to human-written reviews. We further evaluate LiRA in real-world scenarios using document retrieval and assess its robustness to reviewer model variation. Our findings highlight the potential of agentic LLM workflows, even without domain-specific tuning, to improve the reliability and usability of automated scientific writing.
科学论文的快速增长使得保持文献综述的全面性和时效性变得越来越困难。尽管之前的工作主要集中在自动化检索和筛选上,但系统综述的写作阶段仍被大大忽视,尤其是在可读性和事实准确性方面。为了解决这一问题,我们提出了LiRA(文献综述代理),这是一个多代理协作工作流程,模拟人工文献综述过程。LiRA利用专门代理进行内容大纲、小节写作、编辑和评审,产生连贯且全面的综述文章。在SciReviewGen和专有ScienceDirect数据集上进行评估,LiRA在写作和引文质量方面优于当前基线,如AutoSurvey和MASS-Survey,同时保持与人类撰写审查的竞争力相似性。我们进一步在真实场景中使用文档检索对LiRA进行评估,并评估其对评审员模型变化的稳健性。我们的研究结果表明,即使是通用的大型语言模型工作流程,也有可能在不进行特定领域调整的情况下,提高自动化科学写作的可靠性和可用性。
论文及项目相关链接
Summary
文献综述的难度随着科学出版物数量的快速增长而增加,尤其是在写作阶段。本文介绍了一种名为LiRA的多智能体协作工作流程,模拟人类文献综述过程,包括内容概述、小节写作、编辑和审稿等环节。通过评估,LiRA在撰写和引用质量方面优于当前基线,如AutoSurvey和MASS-Survey,并且与人工撰写的综述具有竞争力的相似性。同时探讨了LiRA在实际文献检索场景中的适用性及其对审稿人模型变化的稳健性。本文展示了智能体的工作流程在自动化科学写作中的潜力。
Key Takeaways
- 科学出版物的迅速增长使文献综述的保持全面性和实时性变得更加困难。
- 当前研究主要关注自动化检索和筛选阶段,而文献综述的写作阶段仍被忽视。
- LiRA是一种多智能体协作工作流程,模拟人类文献综述过程,包括内容概述、小节写作、编辑和审稿等环节。
- LiRA在写作和引用质量方面优于现有的自动综述方法。
- LiRA在保持与人工撰写综述相似的同时展现了较高的性能。
- LiRA在实际文献检索场景中表现良好,并具备应对审稿人模型变化的稳健性。
点此查看论文截图
OceanGym: A Benchmark Environment for Underwater Embodied Agents
Authors:Yida Xue, Mingjun Mao, Xiangyuan Ru, Yuqi Zhu, Baochang Ren, Shuofei Qiao, Mengru Wang, Shumin Deng, Xinyu An, Ningyu Zhang, Ying Chen, Huajun Chen
We introduce OceanGym, the first comprehensive benchmark for ocean underwater embodied agents, designed to advance AI in one of the most demanding real-world environments. Unlike terrestrial or aerial domains, underwater settings present extreme perceptual and decision-making challenges, including low visibility, dynamic ocean currents, making effective agent deployment exceptionally difficult. OceanGym encompasses eight realistic task domains and a unified agent framework driven by Multi-modal Large Language Models (MLLMs), which integrates perception, memory, and sequential decision-making. Agents are required to comprehend optical and sonar data, autonomously explore complex environments, and accomplish long-horizon objectives under these harsh conditions. Extensive experiments reveal substantial gaps between state-of-the-art MLLM-driven agents and human experts, highlighting the persistent difficulty of perception, planning, and adaptability in ocean underwater environments. By providing a high-fidelity, rigorously designed platform, OceanGym establishes a testbed for developing robust embodied AI and transferring these capabilities to real-world autonomous ocean underwater vehicles, marking a decisive step toward intelligent agents capable of operating in one of Earth’s last unexplored frontiers. The code and data are available at https://github.com/OceanGPT/OceanGym.
我们推出OceanGym,这是针对海洋水下智能体设计的首个综合性基准测试平台,旨在推动人工智能在最苛刻的现实环境之一中的应用。与陆地或空中领域不同,水下环境带来了极大的感知和决策挑战,包括低能见度、动态洋流等,使得有效部署智能体变得异常困难。OceanGym涵盖了八个现实任务领域,以及由多模态大型语言模型(MLLM)驱动的统一智能体框架,该框架融合了感知、记忆和序列决策。智能体需要理解光学和声纳数据,自主探索复杂环境,并在这些恶劣条件下完成长期目标。大量实验揭示了最先进的MLLM驱动智能体与人类专家之间的差距,突显了海洋水下环境中感知、规划和适应性的持久性难题。通过提供高保真、严谨设计的平台,OceanGym为开发稳健的嵌入式人工智能并将这些能力转移到现实世界的自主海洋水下车辆提供了测试床,这是朝着智能体在地球最后一个未开发的前沿领域之一中运营迈出的决定性一步。代码和数据可在https://github.com/OceanGPT/OceanGym获得。
论文及项目相关链接
PDF Work in progress
Summary
OceanGym是首个针对海洋水下智能体的全面基准测试平台,旨在推进人工智能在最苛刻的现实环境中的应用。该平台涵盖了八个现实任务域,采用多模态大型语言模型驱动的统一智能体框架,解决了水下环境的感知和决策挑战。通过该平台,可以测试智能体在恶劣环境下的表现,并为开发稳健的嵌入式人工智能以及将这些能力转移到实际的水下车辆提供了测试床。
Key Takeaways
- OceanGym是首个为海洋水下智能体设计的全面基准测试平台。
- 平台旨在推进人工智能在最苛刻的现实环境中的应用。
- OceanGym涵盖了八个现实任务域,包括处理低可见度、动态洋流等挑战。
- 采用多模态大型语言模型驱动的统一智能体框架进行感知、记忆和序列决策。
- 智能体需要理解光学和声纳数据,自主探索复杂环境,并在恶劣条件下完成长期目标。
- 实验表明,现有技术与人类专家之间存在显著差距,特别是在感知、规划和适应方面。
点此查看论文截图
Improved LLM Agents for Financial Document Question Answering
Authors:Nelvin Tan, Zian Seng, Liang Zhang, Yu-Ching Shih, Dong Yang, Amol Salunkhe
Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent’s performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.
大型语言模型(LLM)在多个自然语言处理任务中表现出了令人印象深刻的能力。然而,LLM在包含表格和文本数据的金融文档的数值问答方面仍然面临挑战。最近有研究表明,给定参考标签的情况下,批判代理(即自我校正)对于此任务非常有效。本文在此框架基础上,探讨了在没有参考标签的情况下传统批判代理的有效性,并通过实验表明,在这种情况下,该批判代理的性能会下降。鉴于此,我们提出了一种改进的批判代理和计算器代理相结合的方法,该方法在性能和安全性方面超过了先前的最新方法(思考程序)。此外,我们还研究了我们的代理之间的相互作用以及这种互动对他们表现的影响。
论文及项目相关链接
PDF 13 pages, 5 figures. Unlike the previous version, LLM names are now unmasked
Summary
本文探讨了大型语言模型在财务文档中数值问答任务的挑战。研究指出,当没有提供标准标签时,传统的批判代理性能会下降。因此,提出了一种改进的批判代理和计算器代理,该方法优于之前的程序化思考方法,并具有更高的安全性。此外,本文还研究了这些代理之间的交互及其对性能的影响。
Key Takeaways
- 大型语言模型在财务文档中数值问答任务上面临挑战。
- 传统批判代理在缺乏标准标签时性能下降。
- 改进的批判代理和计算器代理被提出,并表现出优于之前的方法。
- 新方法在安全性和代理间交互方面有所改进。
- 代理间的交互对性能产生影响。
- 实验证明新方法的有效性。
点此查看论文截图
Multi-Modal Data Exploration via Language Agents
Authors:Farhad Nooralahzadeh, Yi Zhang, Jonathan Furst, Kurt Stockinger
International enterprises, organizations, and hospitals collect large amounts of multi-modal data stored in databases, text documents, images, and videos. While there has been recent progress in the separate fields of multi-modal data exploration as well as in database systems that automatically translate natural language questions to database query languages, the research challenge of querying both structured databases and unstructured modalities (e.g., texts, images) in natural language remains largely unexplored. In this paper, we propose M$^2$EX -a system that enables multi-modal data exploration via language agents. Our approach is based on the following research contributions: (1) Our system is inspired by a real-world use case that enables users to explore multi-modal information systems. (2) M$^2$EX leverages an LLM-based agentic AI framework to decompose a natural language question into subtasks such as text-to-SQL generation and image analysis and to orchestrate modality-specific experts in an efficient query plan. (3) Experimental results on multi-modal datasets, encompassing relational data, text, and images, demonstrate that our system outperforms state-of-the-art multi-modal exploration systems, excelling in both accuracy and various performance metrics, including query latency, API costs, and planning efficiency, thanks to the more effective utilization of the reasoning capabilities of LLMs.
国际企业、组织和医院收集大量多模式数据,存储在数据库、文本文档、图像和视频中。虽然最近在多模式数据探索领域以及自动将自然语言问题翻译成数据库查询语言的数据库系统领域都取得了进展,但使用自然语言查询结构化数据库和非结构化模式(如文本、图像)的研究挑战仍然很大程度上未被探索。在本文中,我们提出了M$^2$EX系统,该系统可通过语言智能体进行多模式数据探索。我们的方法基于以下研究贡献:(1)我们的系统受到实际使用案例的启发,使用户能够探索多模式信息系统。(2)M$^2$EX利用基于大型语言模型(LLM)的智能体AI框架,将自然语言问题分解为文本到SQL生成和图像分析等子任务,并有效地协调模态特定专家制定查询计划。(3)在包含关系数据、文本和图像的多模式数据集上的实验结果表明,我们的系统在先进的多模式探索系统中表现出色,在准确性以及各种性能指标(包括查询延迟、API成本和规划效率)上都更胜一筹,这得益于对LLM推理能力的更有效利用。
论文及项目相关链接
PDF Accepted to the IJCNLP AACL 2025 Findings
Summary
多模态数据探索挑战重重。国际企业、组织和医院等机构存在大量存储在数据库、文本文件、图像和视频中的多模态数据。尽管多模态数据探索和数据自动翻译查询语言领域已有进展,但使用自然语言查询结构化数据库和非结构化模态(如文本和图像)的研究挑战仍然很大。本文提出M^2EX系统,该系统通过语言智能体实现多模态数据探索。系统灵感来源于现实应用案例,利用大型语言模型驱动的代理人工智能框架将自然语言问题分解为子任务,如文本到SQL的生成和图像分析,并有效地协调模态特定专家制定查询计划。实验结果表明,该系统在包含关系数据、文本和图像的多模态数据集上的表现优于现有先进的多模态探索系统,得益于大型语言模型的推理能力的有效利用,提高了准确性和查询延迟、API成本和规划效率等性能指标。
Key Takeaways
- 国际企业和机构面临大量多模态数据存储的挑战。
- 多模态数据探索中自然语言查询结构化数据库和非结构化模态仍存在研究挑战。
- M^2EX系统通过语言智能体实现多模态数据探索。
- M^2EX系统灵感来源于现实应用案例,利用大型语言模型处理自然语言问题。
- M^2EX系统将自然语言问题分解为子任务,如文本到SQL生成和图像分析。
- M^2EX系统通过协调模态特定专家制定高效的查询计划。
点此查看论文截图
Scalable and Accurate Graph Reasoning with LLM-based Multi-Agents
Authors:Yuwei Hu, Runlin Lei, Xinyi Huang, Zhewei Wei, Yongchao Liu
Recent research has explored the use of Large Language Models (LLMs) for tackling complex graph reasoning tasks. However, due to the intricacies of graph structures and the inherent limitations of LLMs in handling long text, current approaches often fail to deliver satisfactory accuracy, even on small-scale graphs and simple tasks. To address these challenges, we introduce GraphAgent-Reasoner, a fine-tuning-free framework that utilizes a multi-agent collaboration strategy for explicit and precise graph reasoning. Inspired by distributed graph computation theory, our framework decomposes graph problems into smaller, node-centric tasks that are distributed among multiple agents. The agents collaborate to solve the overall problem, significantly reducing the amount of information and complexity handled by a single LLM, thus enhancing the accuracy of graph reasoning. By simply increasing the number of agents, GraphAgent-Reasoner can efficiently scale to accommodate larger graphs with over 1,000 nodes. Evaluated on the GraphInstruct dataset, our framework demonstrates near-perfect accuracy on polynomial-time graph reasoning tasks, significantly outperforming the best available models, both closed-source and fine-tuned open-source variants. Our framework also demonstrates the capability to handle real-world graph reasoning applications such as webpage importance analysis.
最近的研究探讨了使用大型语言模型(LLM)来解决复杂的图推理任务。然而,由于图结构的复杂性以及LLM在处理长文本时的固有局限性,当前的方法往往难以达到令人满意的准确性,即使在小型图和简单任务上也是如此。为了应对这些挑战,我们引入了GraphAgent-Reasoner,这是一个无需精细调整的框架,它利用多智能体协作策略进行明确和精确的图推理。我们的框架受到分布式图计算理论的启发,将图问题分解为较小的、以节点为中心的任务,这些任务被分配给多个智能体。智能体协作解决总体问题,从而大大减少单个LLM处理的信息量和复杂性,提高了图推理的准确性。通过简单地增加智能体的数量,GraphAgent-Reasoner可以有效地扩展到容纳超过1000个节点的大型图。在GraphInstruct数据集上的评估表明,我们的框架在多项式时间图推理任务上达到了近乎完美的准确性,显著优于现有的最佳模型,包括封闭源代码和经过精细调整的开源变体。我们的框架还展示了处理真实世界图推理应用程序的能力,如网页重要性分析。
论文及项目相关链接
PDF Accepted by AAAI 2026 Workshop WMAC
Summary
大型语言模型在处理复杂图推理任务时存在挑战。为此,提出GraphAgent-Reasoner框架,采用多智能体协同策略进行显式精确图推理。该框架将图问题分解为多个节点为中心的任务,分配给多个智能体解决,从而提高图推理的准确性并有效扩展至大规模图。在GraphInstruct数据集上的评估表明,该框架在多项式时间图推理任务上表现出近乎完美的准确性,显著优于现有模型。
Key Takeaways
- 大型语言模型在处理复杂图推理任务时存在准确性问题。
- GraphAgent-Reasoner框架采用多智能体协同策略进行图推理。
- 该框架将图问题分解为多个节点为中心的任务。
- 通过多智能体协作,GraphAgent-Reasoner提高了图推理的准确性。
- GraphAgent-Reasoner框架可扩展到处理大规模图。
- 在GraphInstruct数据集上的评估中,GraphAgent-Reasoner表现出近乎完美的准确性。