发布日期: 2025-10-22

更新日期: 2025-11-27

文章字数: 11.3k

阅读时长: 45 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-22 更新

Enterprise Deep Research: Steerable Multi-Agent Deep Research for Enterprise Analytics

Authors:Akshara Prabhakar, Roshan Ram, Zixiang Chen, Silvio Savarese, Frank Wang, Caiming Xiong, Huan Wang, Weiran Yao

As information grows exponentially, enterprises face increasing pressure to transform unstructured data into coherent, actionable insights. While autonomous agents show promise, they often struggle with domain-specific nuances, intent alignment, and enterprise integration. We present Enterprise Deep Research (EDR), a multi-agent system that integrates (1) a Master Planning Agent for adaptive query decomposition, (2) four specialized search agents (General, Academic, GitHub, LinkedIn), (3) an extensible MCP-based tool ecosystem supporting NL2SQL, file analysis, and enterprise workflows, (4) a Visualization Agent for data-driven insights, and (5) a reflection mechanism that detects knowledge gaps and updates research direction with optional human-in-the-loop steering guidance. These components enable automated report generation, real-time streaming, and seamless enterprise deployment, as validated on internal datasets. On open-ended benchmarks including DeepResearch Bench and DeepConsult, EDR outperforms state-of-the-art agentic systems without any human steering. We release the EDR framework and benchmark trajectories to advance research on multi-agent reasoning applications. Code at https://github.com/SalesforceAIResearch/enterprise-deep-research and Dataset at https://huggingface.co/datasets/Salesforce/EDR-200

随着信息呈指数级增长，企业面临着将非结构化数据转化为连贯、可操作的见解的巨大压力。虽然自主代理系统显示出潜力，但它们往往在处理特定领域的细微差别、意图对齐和企业集成方面遇到困难。我们推出了企业深度研究（EDR）系统，这是一个多代理系统，它集成了（1）用于自适应查询分解的主规划代理，（2）四个专业搜索代理（通用、学术、GitHub、LinkedIn），（3）一个可扩展的基于MCP的工具生态系统，支持NL2SQL、文件分析和企业工作流程，（4）一个用于数据驱动见解的可视化代理，（5）一种检测知识差距并更新研究方向的反思机制，还可选择人类参与循环指导。这些组件可实现自动化报告生成、实时流和无缝企业部署，已在内部数据集上得到验证。在包括DeepResearch Bench和DeepConsult在内的开放基准测试中，EDR的表现优于最新的代理系统，无需人工指导。我们发布EDR框架和基准轨迹，以推动多代理推理应用的研究。代码位于https://github.com/SalesforceAIResearch/enterprise-deep-research，数据集位于https://huggingface.co/datasets/Salesforce/EDR-200。

论文及项目相关链接

PDF Technical report; 13 pages plus references and appendices

Summary
企业面临信息爆炸式增长的挑战，需要转化大量无结构数据为有用见解。现有的自主代理系统在处理特定领域细微差别、意图对齐和企业集成方面存在困难。为此，我们提出了企业深度研究（EDR）多代理系统，包含五大组件：自适应查询分解的主规划代理、四种专业搜索代理、基于MCP的工具生态系统、可视化代理以及检测知识差距并更新研究方向的反射机制。该系统可实现自动化报告生成、实时流和无缝企业部署，并在内部数据集上进行了验证。在开放基准测试中，包括DeepResearch Bench和DeepConsult在内，EDR在无人工干预的情况下表现出超越现有代理系统的性能。我们发布EDR框架和基准轨迹，以推动多代理推理应用的研究进步。

Key Takeaways

企业面临信息增长带来的挑战，需要将无结构数据转化为可操作的见解。
自主代理系统在处理特定领域的细微差别和意图对齐方面存在困难。
EDR是一个多代理系统，包含五大组件以处理不同的数据任务。
EDR包含自适应查询分解的主规划代理和四种专业搜索代理，能够生成准确全面的数据洞察。
基于MCP的工具生态系统支持多种功能，如自然语言到SQL的转换、文件分析和企业工作流程。
EDR的可视化代理有助于以数据驱动的方式呈现信息，便于理解。

Cool Papers

点此查看论文截图

UltraCUA: A Foundation Model for Computer Use Agents with Hybrid Action

Authors:Yuhao Yang, Zhen Yang, Zi-Yi Dou, Anh Nguyen, Keen You, Omar Attia, Andrew Szot, Michael Feng, Ram Ramrakhya, Alexander Toshev, Chao Huang, Yinfei Yang, Zhe Gan

Multimodal agents for computer use rely exclusively on primitive actions (click, type, scroll) that require accurate visual grounding and lengthy execution chains, leading to cascading failures and performance bottlenecks. While other agents leverage rich programmatic interfaces (APIs, MCP servers, tools), computer-use agents (CUAs) remain isolated from these capabilities. We present UltraCUA, a foundation model that bridges this gap through hybrid action – seamlessly integrating GUI primitives with high-level programmatic tool calls. To achieve this, our approach comprises four key components: (1) an automated pipeline that scales programmatic tools from software documentation, open-source repositories, and code generation; (2) a synthetic data engine producing over 17,000 verifiable tasks spanning real-world computer-use scenarios; (3) a large-scale high-quality hybrid action trajectory collection with both low-level GUI actions and high-level programmatic tool calls; and (4) a two-stage training pipeline combining supervised fine-tuning with online reinforcement learning, enabling strategic alternation between low-level and high-level actions. Experiments with our 7B and 32B models demonstrate substantial improvements over state-of-the-art agents. On OSWorld, UltraCUA models achieve an average 22% relative improvement over base models, while being 11% faster in terms of steps. Out-of-domain evaluation on WindowsAgentArena shows our model reaches 21.7% success rate, outperforming baselines trained on Windows data. The hybrid action mechanism proves critical, reducing error propagation while maintaining execution efficiency.

多模态计算机使用代理仅依赖于原始动作（点击、键入、滚动），这些动作需要准确的视觉定位以及冗长的执行链，导致连续失败和执行瓶颈。而其他代理则利用丰富的程序接口（API、MCP服务器、工具），计算机使用代理（CUAs）仍然与这些功能相隔离。我们提出了UltraCUA，这是一个基础模型，它通过混合动作来弥合这一鸿沟——无缝地集成了GUI原始元素和高级程序工具调用。为实现这一目标，我们的方法包括四个关键组成部分：（1）自动化管道，可从软件文档、开源仓库和代码生成中扩展程序工具；（2）合成数据引擎，生成超过17,000项可验证的任务，涵盖现实计算机使用场景；（3）大规模高质量混合动作轨迹收集，包括低级的GUI动作和高级的程序工具调用；（4）两阶段训练管道，结合监督微调与在线强化学习，能够在低级和高级动作之间进行战略切换。使用我们规模为7B和32B的模型进行的实验表明，与最先进的代理相比有重大改进。在OSWorld上，UltraCUA模型相对于基础模型平均提高了22%的相对性能，同时在步骤上提高了11%的执行速度。在WindowsAgentArena上的域外评估显示，我们的模型达到了21.7%的成功率，超越了那些在Windows数据上训练的基线模型。混合动作机制证明是关键的，在减少错误传播的同时保持了执行效率。

论文及项目相关链接

PDF

Summary

本文介绍了UltraCUA，一个基础模型，它通过混合行动桥接了计算机使用代理（CUAs）与程序化工具的鸿沟。该模型集成了GUI原始元素与高级程序工具调用，包括自动化管道、合成数据引擎、大规模高质量混合行动轨迹收集以及两阶段训练管道。实验证明，该模型相较于当前主流模型有明显提升，如在OSWorld上的平均相对改进率高达22%，步骤效率提升11%。混合行动机制对于减少错误传播并保持执行效率至关重要。

Key Takeaways

UltraCUA是一个基础模型，采用混合行动整合GUI原始元素和高级程序工具调用。
该模型通过自动化管道、合成数据引擎、大规模轨迹收集实现高效工作。
实验证明，UltraCUA在OSWorld上的性能相较于基础模型有显著提升。
混合行动机制有助于减少错误传播，同时保持执行效率。
UltraCUA模型适用于真实世界的计算机使用场景。
该模型在外域评估WindowsAgentArena上取得了21.7%的成功率，超越了基于Windows数据的基线模型。

Cool Papers

点此查看论文截图

A Principle of Targeted Intervention for Multi-Agent Reinforcement Learning

Authors:Anjie Liu, Jianhong Wang, Samuel Kaski, Jun Wang, Mengyue Yang

Steering cooperative multi-agent reinforcement learning (MARL) towards desired outcomes is challenging, particularly when the global guidance from a human on the whole multi-agent system is impractical in a large-scale MARL. On the other hand, designing mechanisms to coordinate agents most relies on empirical studies, lacking a easy-to-use research tool. In this work, we employ multi-agent influence diagrams (MAIDs) as a graphical framework to address the above issues. First, we introduce interaction paradigms that leverage MAIDs to analyze and visualize existing approaches in MARL. Then, we design a new interaction paradigm based on MAIDs, referred to as targeted intervention that is applied to only a single targeted agent, so the problem of global guidance can be mitigated. In our implementation, we introduce a causal inference technique-referred to as Pre-Strategy Intervention (PSI)-to realize the targeted intervention paradigm. Since MAIDs can be regarded as a special class of causal diagrams, a composite desired outcome that integrates the primary task goal and an additional desired outcome can be achieved by maximizing the corresponding causal effect through the PSI. Moreover, the bundled relevance graph analysis of MAIDs provides a tool to identify whether an MARL learning paradigm is workable under the design of an interaction paradigm. In experiments, we demonstrate the effectiveness of our proposed targeted intervention, and verify the result of relevance graph analysis.

在多智能体强化学习（MARL）中，向期望的结果引导合作是一项挑战，特别是在大规模MARL中，对整个多智能体系统进行人的全局指导并不实际。另一方面，设计协调智能体的机制主要依赖于实证研究，缺乏易于使用的研究工具。在这项工作中，我们采用多智能体影响图（MAIDs）作为图形框架来解决上述问题。首先，我们介绍了利用MAIDs分析可视化MARL中现有方法的信息交互范例。然后，我们基于MAIDs设计了一种新的交互范例，称为有针对性的干预，该范例仅适用于单个目标智能体，从而可以缓解全局指导的问题。在我们的实现中，引入了一种因果推理技术——预策略干预（PSI）——来实现有针对性的干预范例。由于MAIDs可以被视为因果图的一个特殊类别，因此可以通过PSI最大化相应的因果效应来实现集成主要任务目标和附加期望结果的综合期望结果。此外，MAIDs的捆绑关联图分析提供了一种工具，可以确定在特定的交互设计下，一种MARL学习范式是否可行。在实验方面，我们验证了所提出的有针对性的干预措施的有效性，并验证了关联图分析的结果。

论文及项目相关链接

PDF Accepted to NeurIPS 2025

Summary

本文利用多智能体影响图（MAIDs）作为图形框架来解决大规模多智能体强化学习（MARL）中的协调问题。通过引入MAIDs的交互范式，分析和可视化现有MARL方法。设计了一种基于MAIDs的新交互范式——目标干预，仅应用于单个目标智能体，减轻全局指导的问题。利用因果推断技术实现目标干预，即预策略干预（PSI）。通过MAIDs的捆绑关联图分析确定智能体学习模式的有效性。实验结果证明了目标干预和关联图分析的有效性。

Key Takeaways

多智能体强化学习（MARL）在指导智能体合作方面面临挑战，特别是在大规模系统中进行全局指导不实际的情况。
多智能体影响图（MAIDs）被用作图形框架来解决这些问题，提供分析和可视化现有MARL方法的工具。
引入了一种新的基于MAIDs的交互范式——目标干预，仅针对单个目标智能体，缓解全局指导问题。
利用预策略干预（PSI）技术实现目标干预，这是一种因果推断技术。
MAIDs可视为一种特殊的因果图，通过PSI最大化相应的因果效应，实现复合期望结果。
MAIDs的捆绑关联图分析可用于确定智能体学习模式是否适用于设计的交互范式。

Cool Papers

点此查看论文截图

Semantic Joint Source Channel Coding for Distributed Subsurface Imaging in Multi-Agent Systems

Authors:Maximilian H. V. Tillmann, Ban-Sok Shin, Dmitriy Shutin, Armin Dekorsy

Multi-agent systems (MAS) are a promising solution for autonomous exploration tasks in hazardous or remote environments, such as planetary surveys. In such settings, communication among agents is essential to ensure collaborative task execution, yet conventional approaches treat exploration and communication as decoupled subsystems. This work presents a novel framework that tightly integrates semantic communication into the MAS exploration process, adapting communication strategies to the exploration methodology to improve overall task performance. Specifically, we investigate the application of semantic joint source-channel coding (JSCC) with over-the-air computation (AirComp) for distributed function computation for the application of cooperative subsurface imaging using the adapt-then-combine full waveform inversion (ATC-FWI) algorithm. Our results demonstrate that semantic JSCC significantly outperforms classical point-to-point and standard JSCC methods, especially in high-connectivity networks. Furthermore, incorporating side information at the receiving agent enhances communication efficiency and imaging accuracy, a feature previously unexplored in MAS-based exploration. We validate our approach through a use case inspired by subsurface anomaly detection, showing measurable improvements in imaging performance per agent. This work underscores the potential of semantic communication in distributed multi-agent exploration, offering a communication-aware exploration paradigm that achieves task-relevant performance gains.

多智能体系统（MAS）是在危险或远程环境（如行星勘测）中进行自主探索任务的一种有前途的解决方案。在这种环境中，智能体之间的通信对于确保协同任务执行至关重要，但传统方法将探索和通信视为解耦的子系统。本文提出了一种新型框架，紧密地将语义通信集成到MAS探索过程中，根据探索方法调整通信策略，以提高总体任务性能。具体来说，我们研究了语义联合源信道编码（JSCC）与空中计算（AirComp）在分布式函数计算中的应用，以用于基于协同反演的协同地下成像。我们的结果表明，语义JSCC在连接度高的网络中明显优于经典点对点传输和常规JSCC方法。此外，在接收端利用附加信息增强了通信效率和成像精度，这是在基于MAS的勘探中以前未被探索的特性。我们通过以地下异常检测为灵感的应用案例验证了我们的方法，显示出每个智能体的成像性能有所提高。这项工作强调了语义通信在分布式多智能体勘探中的潜力，提供了一种通信感知的探索范式，实现了与任务相关的性能提升。

论文及项目相关链接

PDF

Summary

多智能体系统（MAS）在危险或远程环境的自主探索任务中具有广阔的应用前景，如行星勘测。在该环境中，确保协同任务执行的跨智能体通信至关重要。本研究提出了一种新型框架，将语义通信紧密集成到MAS探索过程中，根据探索方法调整通信策略，以提高整体任务性能。此外，本研究调查了语义联合源信道编码与空中计算的结合应用，对基于协同成像的地震波形进行分布函数计算，以模拟感知信号的结果。结果显示，语义联合源信道编码显著优于传统的点对点和标准联合源信道编码方法，特别是在高连通性网络中。此外，接收智能体侧信息的加入提高了通信效率和成像精度，这在基于MAS的探索中是前所未有的。本研究通过模拟地下异常检测的应用案例验证了方法的有效性，实现了每个智能体的成像性能提升。本研究强调了语义通信在分布式多智能体探索中的潜力，为沟通感知探索范式提供了实现任务相关性能增益的途径。

Key Takeaways

多智能体系统（MAS）适合用于危险或远程环境的自主探索任务。
跨智能体的通信对于确保协同任务执行至关重要。
本研究提出了一种新型框架，将语义通信集成到MAS探索过程中。
语义联合源信道编码在智能体通信中表现出显著优势，特别是在高连通性网络中。
接收智能体利用侧信息能提高通信效率和成像精度。
通过地下异常检测的应用案例验证了该方法的有效性。

Cool Papers

点此查看论文截图

ShapeCraft: LLM Agents for Structured, Textured and Interactive 3D Modeling

Authors:Shuyuan Zhang, Chenhan Jiang, Zuoou Li, Jiankang Deng

3D generation from natural language offers significant potential to reduce expert manual modeling efforts and enhance accessibility to 3D assets. However, existing methods often yield unstructured meshes and exhibit poor interactivity, making them impractical for artistic workflows. To address these limitations, we represent 3D assets as shape programs and introduce ShapeCraft, a novel multi-agent framework for text-to-3D generation. At its core, we propose a Graph-based Procedural Shape (GPS) representation that decomposes complex natural language into a structured graph of sub-tasks, thereby facilitating accurate LLM comprehension and interpretation of spatial relationships and semantic shape details. Specifically, LLM agents hierarchically parse user input to initialize GPS, then iteratively refine procedural modeling and painting to produce structured, textured, and interactive 3D assets. Qualitative and quantitative experiments demonstrate ShapeCraft’s superior performance in generating geometrically accurate and semantically rich 3D assets compared to existing LLM-based agents. We further show the versatility of ShapeCraft through examples of animated and user-customized editing, highlighting its potential for broader interactive applications.

从自然语言生成三维模型具有巨大的潜力，能够减少专家手动建模的工作量，提高三维资源的可访问性。然而，现有的方法通常生成的是非结构化的网格，交互性较差，使得它们在艺术工作流程中不太实用。为了解决这些局限性，我们将三维资源表示为形状程序，并引入了ShapeCraft，这是一个用于文本到三维生成的新型多智能体框架。我们的核心思想是提出一种基于图的过程形状（GPS）表示法，它将复杂的自然语言分解为子任务的结构图，从而促进对空间关系和语义形状细节的准确大型语言模型（LLM）理解和解释。具体来说，LLM智能体会按层次结构解析用户输入以初始化GPS，然后迭代地完善过程建模和绘画，以生成结构化、有纹理和可交互的三维资源。定性和定量实验表明，与现有的LLM智能体相比，ShapeCraft在生成几何精确且语义丰富的三维资源方面具有卓越的性能。我们进一步通过动画和用户自定义编辑的示例展示了ShapeCraft的通用性，突出了其在更广泛的交互式应用中的潜力。

论文及项目相关链接

PDF NeurIPS 2025 Poster

Summary
基于自然语言的三维生成技术潜力巨大，有助于减少专家手动建模工作量，提高三维资源的可及性。然而，现有方法常常生成结构混乱的网格，交互性差，难以满足艺术创作需求。为解决这些问题，我们提出ShapeCraft这一新型多智能体框架，用于文本到三维的生成。其核心是图形程序化形状（GPS）表示法，它将复杂的自然语言分解为子任务的结构化图形，促进大型语言模型对空间关系和语义形状细节的理解和解释。智能体通过层次解析用户输入来初始化GPS，然后迭代优化程序建模和绘画，生成结构化、有纹理和交互性的三维资源。实验证明，ShapeCraft在生成几何精确、语义丰富的三维资产方面表现优于现有大型语言模型智能体。同时，通过动画和用户自定义编辑的示例展示了ShapeCraft的通用性，凸显其在更广泛的交互应用中的潜力。

Key Takeaways

自然语言三维生成技术具有减少专家建模工作量、提高三维资源可及性的潜力。
现有方法存在生成网格结构混乱、交互性差的问题。
ShapeCraft框架采用图形程序化形状（GPS）表示法，将自然语言分解为结构化子任务，提高大型语言模型对空间关系和语义形状的理解。
ShapeCraft通过智能体层次解析用户输入，迭代优化程序建模和绘画，生成结构化、有纹理和交互性的三维资源。
实验证明ShapeCraft在生成几何精确、语义丰富的三维资产方面优于现有大型语言模型智能体。
ShapeCraft具有通用性，支持动画和用户自定义编辑。

Cool Papers

点此查看论文截图

Graph Attention-Guided Search for Dense Multi-Agent Pathfinding

Authors:Rishabh Jain, Keisuke Okumura, Michael Amir, Amanda Prorok

Finding near-optimal solutions for dense multi-agent pathfinding (MAPF) problems in real-time remains challenging even for state-of-the-art planners. To this end, we develop a hybrid framework that integrates a learned heuristic derived from MAGAT, a neural MAPF policy with a graph attention scheme, into a leading search-based algorithm, LaCAM. While prior work has explored learning-guided search in MAPF, such methods have historically underperformed. In contrast, our approach, termed LaGAT, outperforms both purely search-based and purely learning-based methods in dense scenarios. This is achieved through an enhanced MAGAT architecture, a pre-train-then-fine-tune strategy on maps of interest, and a deadlock detection scheme to account for imperfect neural guidance. Our results demonstrate that, when carefully designed, hybrid search offers a powerful solution for tightly coupled, challenging multi-agent coordination problems.

在实时环境中为密集多智能体路径查找（MAPF）问题找到接近最优的解决方案仍然是先进规划器面临的一项挑战。为此，我们开发了一个混合框架，它将MAGAT派生出的学习启发式算法、带有图注意力方案的神经MAPF策略与基于搜索的领先算法LaCAM相结合。尽管先前的工作已经探索了在MAPF中使用学习引导搜索，但此类方法历来表现不佳。相比之下，我们的方法LaGAT在密集场景中表现出优于纯搜索方法和纯学习方法的效果。这是通过增强MAGAT架构、对感兴趣地图采用预训练然后精细调整的策略以及死锁检测方案来实现对神经指导的不完美性的考虑。我们的结果表明，当设计得当，混合搜索为解决紧密耦合、具有挑战性的多智能体协调问题提供了强大的解决方案。

论文及项目相关链接

PDF

Summary

本文提出一种混合框架，将基于MAGAT神经网络策略的启发式搜索集成到领先的基于搜索的算法LaCAM中，以解决密集多智能体路径规划（MAPF）问题。该方法称为LaGAT，在密集场景中表现出色，通过增强MAGAT架构、对感兴趣地图进行预训练和调整策略以及考虑不完美神经引导的僵局检测方案实现。

Key Takeaways

解决密集多智能体路径规划（MAPF）问题的挑战在于找到最优解。
提出了一种混合框架LaGAT，结合了MAGAT神经网络策略和基于搜索的算法LaCAM。
LaGAT通过增强MAGAT架构、预训练和调整策略以及对不完美神经引导的僵局检测来提高性能。
LaGAT在密集场景中表现出色，优于纯搜索和纯学习方法。
该方法展示了精心设计的混合搜索对于解决紧密耦合的多智能体协调问题的强大潜力。
LaGAT框架具有灵活性和可扩展性，可应用于其他多智能体系统问题。

Cool Papers

点此查看论文截图

ATL*AS: An Automata-Theoretic Approach and Tool for the Verification of Strategic Abilities in Multi-Agent Systems

Authors:Sofia Garcia de Blas Garcia-Alcalde, Francesco Belardinelli

We present two novel symbolic algorithms for model checking the Alternating-time Temporal Logic ATL*, over both the infinite-trace and the finite-trace semantics. In particular, for infinite traces we design a novel symbolic reduction to parity games. We implement both methods in the ATLAS model checker and evaluate it using synthetic benchmarks as well as a cybersecurity scenario. Our results demonstrate that the symbolic approach significantly outperforms the explicit-state representation and we find that our parity-game-based algorithm offers a more scalable and efficient solution for infinite-trace verification, outperforming previously available tools. Our results also confirm that finite-trace model checking yields substantial performance benefits over infinite-trace verification. As such, we provide a comprehensive toolset for verifying multiagent systems against specifications in ATL.

我们为交替时间时序逻辑ATL提出了两种新型符号算法，用于检查无限跟踪和有限跟踪语义。特别是，对于无限跟踪，我们设计了一种新型的符号简化为奇偶游戏。我们在ATLAS模型检查器中实现了这两种方法，并使用合成基准测试以及网络安全场景对其进行评估。结果表明，符号方法显著优于显式状态表示法，我们发现基于奇偶游戏的算法为无限轨迹验证提供了更可扩展和高效的解决方案，优于以前可用的工具。我们的结果还证实，相对于无限轨迹验证，有限轨迹模型检查可以产生实质性的性能优势。因此，我们为验证多智能体系统相对于ATL*规范提供了一套全面的工具集。

论文及项目相关链接

PDF

Summary

两种新型符号算法可用于检查交替时间时序逻辑ATL的模型，包括无限跟踪和有限跟踪语义。特别是针对无限跟踪，我们设计了一种新颖的符号缩减到博弈论中的奇偶游戏。我们在ATLAS模型检查器中实现了这两种方法，并使用合成基准测试和网络安全场景对其进行了评估。结果证明，符号方法显著优于显式状态表示法，基于奇偶游戏的算法为无限轨迹验证提供了更可扩展和高效的解决方案，超越了以前可用的工具。同时，我们证明了相对于无限轨迹验证，有限轨迹模型检查带来了巨大的性能优势。因此，我们为验证多智能体系统提供了全面的工具集。满足ATL*规范的需求。

Key Takeaways：

Cool Papers

点此查看论文截图

SketchMind: A Multi-Agent Cognitive Framework for Assessing Student-Drawn Scientific Sketches

Authors:Ehsan Latif, Zirak Khan, Xiaoming Zhai

Scientific sketches (e.g., models) offer a powerful lens into students’ conceptual understanding, yet AI-powered automated assessment of such free-form, visually diverse artifacts remains a critical challenge. Existing solutions often treat sketch evaluation as either an image classification task or monolithic vision-language models, which lack interpretability, pedagogical alignment, and adaptability across cognitive levels. To address these limitations, we present SketchMind, a cognitively grounded, multi-agent framework for evaluating and improving student-drawn scientific sketches. SketchMind comprises modular agents responsible for rubric parsing, sketch perception, cognitive alignment, and iterative feedback with sketch modification, enabling personalized and transparent evaluation. We evaluate SketchMind on a curated dataset of 3,575 student-generated sketches across six science assessment items with different highest order of Bloom’s level that require students to draw models to explain phenomena. Compared to baseline GPT-4o performance without SRG (average accuracy: 55.6%), and with SRG integration achieves 77.1% average accuracy (+21.4% average absolute gain). We also demonstrate that multi-agent orchestration with SRG enhances SketchMind performance, for example, GPT-4.1 gains an average 8.9% increase in sketch prediction accuracy, outperforming single-agent pipelines across all items. Human evaluators rated the feedback and co-created sketches generated by \textsc{SketchMind} with GPT-4.1, which achieved an average of 4.1 out of 5, significantly higher than those of baseline models (e.g., 2.3 for GPT-4o). Experts noted the system’s potential to meaningfully support conceptual growth through guided revision. Our code and (pending approval) dataset will be released to support reproducibility and future research in AI-driven education.

科学草图（例如模型）为学生们的概念理解提供了一个强大的视角，然而，对于此类自由形式、视觉多样化的伪作的人工智能驱动的自动评估仍然是一个关键挑战。现有解决方案通常将草图评估视为图像分类任务或单一视觉语言模型，这缺乏解释性、教学对齐和不同认知层次的适应性。为了解决这些局限性，我们推出了SketchMind，这是一个基于认知的多智能体框架，用于评估和改进学生绘制的科学草图。SketchMind包含模块化智能体，负责规则解析、草图感知、认知对齐以及草图修改的迭代反馈，能够实现个性化和透明的评估。我们在一个包含3575个学生生成草图的定制数据集上评估了SketchMind，这些草图涵盖了六个需要学生通过绘制模型来解释现象的科学评估项目，并且这些项目的Bloom等级要求各不相同。与没有SRG的GPT-4o基线性能相比（平均准确率：55.6%），以及通过SRG集成实现的平均准确率77.1%（平均绝对增益+21.4%）。我们还证明了带有SRG的多智能体协同工作可以提高SketchMind的性能，例如GPT-4.1在草图预测准确率上平均提高了8.9%，在所有项目上超越了单智能管道。人类评估者评价了SketchMind与GPT-4.1共同生成的反馈和草图，其平均得分为4.1（满分5分），远高于基线模型的得分（例如GPT-4o的2.3分）。专家指出，该系统有潜力通过指导修订来支持有意义的概念增长。我们的代码（待批准）和数据集将发布，以支持可重复性和人工智能驱动教育的未来研究。

论文及项目相关链接

PDF Submitted to NeurIPS2025

摘要

科学草图（如模型）是学生概念理解的重要反映，然而，对这类自由形式、视觉多样的制品的AI自动化评估仍是一个关键挑战。现有解决方案通常将草图评估视为图像分类任务或单一视觉语言模型，缺乏解释性、教育对齐性和认知层面的适应性。为解决这些局限性，我们提出了SketchMind，这是一个认知基础的多代理框架，用于评估和改进学生绘制的科学草图。SketchMind包含负责评分规则解析、草图感知、认知对齐以及与草图修改迭代反馈的模块化代理，能够实现个性化且透明的评估。我们在3575个学生生成的草图数据集上评估了SketchMind，这些草图涉及六个不同Bloom认知层次要求的科学评估项目。与没有SRG的GPT-4基线性能相比，集成了SRG后平均准确率达到了77.1%（平均绝对增益为+21.4%）。我们还证明了带有SRG的多代理协同能提高SketchMind的性能，例如GPT-4.1在草图预测准确性上平均提高了8.9%，在所有项目上均优于单代理管道。人类评估者对SketchMind与GPT-4.1共同生成的反馈和草图评价较高（平均4.1分），显著高于基线模型（如GPT-4o的2.3分）。专家指出该系统有潜力通过指导修订来支持概念增长。我们的代码（待批准）和数据集将公开发布，以支持可重复性和未来AI驱动教育的研究。

关键见解

科学草图对于学生的概念理解至关重要，但自动化评估是个挑战。
现有解决方案存在缺乏解释性、教育对齐性和认知层面适应性的问题。
SketchMind是一个多代理框架，包括评分规则解析、草图感知等模块，旨在评估和改进学生草图。
SketchMind实现了个性化且透明的评估，通过模块化代理提供有针对性的反馈。
在学生生成的草图数据集上的评估显示，SketchMind性能优于基线模型，尤其是集成SRG后。
多代理协同能提高SketchMind性能，GPT-4.1在草图预测方面表现出显著改进。

Cool Papers

点此查看论文截图

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Authors:Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

自动化将用户界面（UI）设计转化为前端代码，对于加速软件开发和普及设计工作流程具有巨大潜力。虽然多模态大型语言模型（MLLM）可以将图像转换为代码，但在复杂的UI上经常失败，难以在一个单一的大型模型中统一视觉感知、布局规划和代码合成，导致频繁的感知和规划错误。为了解决这个问题，我们提出了ScreenCoder，这是一个模块化多智能体框架，将任务分解为三个可解释的阶段：接地、规划和生成。通过将这些独特的责任分配给专门的智能体，我们的框架实现了比端到端方法更高的稳健性和保真度。此外，ScreenCoder是一个可扩展的数据引擎，使我们能够生成高质量的图片-代码对。我们使用这些数据通过监督精细调整和强化学习的双阶段管道微调开源MLLM，证明了其在UI生成能力方面的巨大提升。大量实验表明，我们的方法在布局准确性、结构连贯性和代码正确性方面达到了最先进的性能。我们的代码已公开发布在https://github.com/leigest519/ScreenCoder。

论文及项目相关链接

PDF ScreenCoder-v2

Summary
自动化将用户界面（UI）设计转化为前端代码具有加速软件开发和普及设计工作流程的巨大潜力。然而，多模态大型语言模型（MLLMs）在复杂UI的翻译上经常失败，难以在一个单一模型中统一视觉感知、布局规划和代码合成，导致频繁的感知和规划错误。为解决这一问题，我们提出了ScreenCoder，这是一个模块化多智能体框架，将任务分解为可解释的三个阶段：接地、规划和生成。通过将这些特定职责分配给专用智能体，我们的框架比端到端方法实现了更高的鲁棒性和保真度。此外，ScreenCoder还是一个可扩展的数据引擎，能够生成高质量的图片-代码对。我们使用这些数据对开源MLLM进行微调，通过监督精细调整和双阶段强化学习管道，证明其在UI生成能力上的显著增长。

Key Takeaways

自动化UI设计转化为前端代码有助于加速软件开发和普及设计工作流程。
多模态大型语言模型（MLLMs）在复杂UI翻译上存在困难，难以实现视觉感知、布局规划和代码合成的统一。
ScreenCoder框架通过模块化分解任务为接地、规划和生成三个阶段，提高了鲁棒性和保真度。
ScreenCoder框架采用多智能体协作，每个智能体负责特定的任务阶段。
ScreenCoder能够生成高质量的图片-代码对数据，用于训练和改进模型。
使用ScreenCoder生成的数据对开源MLLM进行微调，通过监督学习和强化学习提高了UI生成能力。

Cool Papers

点此查看论文截图

General agents contain world models

Authors:Jonathan Richens, David Abel, Alexis Bellot, Tom Everitt

Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent’s policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.

世界模型是否是实现灵活、有目标导向行为的必要组成部分，还是无模型学习已经足够？我们针对这一问题给出了正式答案，证明任何能够推广到多步骤目标导向任务的智能体都必须学习其环境的预测模型。我们证明了该模型可以从智能体的策略中提取出来，要提高智能体的性能或其能够实现的目标的复杂性，需要学习越来越精确的世界模型。这产生了一系列后果：从开发安全和通用的智能体，到限定复杂环境中的智能体能力，以及为从智能体中引出世界模型提供新的算法。

论文及项目相关链接

PDF Accepted ICML 2025. Typos corrected

Summary：
世界模型对于实现灵活、目标导向的行为是否必要，还是模型外的学习已足够？我们正式回答这个问题，表明任何能够推广到多步目标导向任务的主体都必须已经学习到了环境的预测模型。我们展示了这个模型可以从主体的策略中提取出来，并且提高主体的性能或增加它能实现的目标的复杂性需要学习越来越精确的世界模型。

Key Takeaways:

世界模型对于多步目标导向行为的必要性。
任何具备通用化能力的主体在学习实现目标时必须拥有环境预测模型。
世界模型可以从主体的策略中提取出来。
提高主体性能或目标的复杂性需要学习更精确的世界模型。
学习世界模型对于开发安全且通用的主体有重要作用。
世界模型的精度会影响主体在复杂环境中的能力界限。

Cool Papers

点此查看论文截图

CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation

Authors:Xinchen Wang, Pengfei Gao, Chao Peng, Ruida Hu, Cuiyun Gao

Large language models (LLMs) have demonstrated strong capabilities in code generation, underscoring the critical need for rigorous and comprehensive evaluation. Existing evaluation approaches fall into three categories, including human-centered, metric-based, and LLM-based. Considering that human-centered approaches are labour-intensive and metric-based ones overly rely on reference answers, LLM-based approaches are gaining increasing attention due to their stronger contextual understanding capabilities. However, they generally evaluate the generated code based on static prompts, and tend to fail for complex code scenarios which typically involve multiple requirements and require more contextual information. In addition, these approaches lack fine-grained evaluation for complex code, resulting in limited explainability. To mitigate the limitations, we propose CodeVisionary, the first agent-based evaluation framework for complex code generation. CodeVisionary consists of two stages: (1) Requirement-guided multi-dimensional context distillation stage and (2) Fine-grained scoring and summarization stage. A comprehensive evaluation report is also generated for enhanced explainability. For validation, we construct a new benchmark consisting of 363 samples spanning 37 coding scenarios and 23 programming languages. Extensive experiments demonstrate that CodeVisionary achieves the best performance among three baselines for evaluating complex code generation, outperforming the best baseline with average improvements of 0.217, 0.163, and 0.141 in Pearson, Spearman, and Kendall-Tau coefficients, respectively. The resources of CodeVisionary are available at https://github.com/Eshe0922/CodeVisionary.

大型语言模型（LLMs）在代码生成方面展现出强大的能力，这强调了进行严格和全面评估的迫切需求。现有的评估方法可分为三类，包括以人类为中心的方法、基于指标的方法和基于LLM的方法。考虑到以人类为中心的方法劳动强度大，而基于指标的方法过度依赖参考答案，基于LLM的方法由于其更强的上下文理解能力而越来越受到关注。然而，它们通常基于静态提示对生成的代码进行评估，对于涉及多个要求和需要更多上下文信息的复杂代码场景，往往会出现失败。此外，这些方法对复杂代码的精细评估不足，导致解释性有限。为了缓解这些限制，我们提出了CodeVisionary，这是第一个用于复杂代码生成的基于代理的评估框架。CodeVisionary由两个阶段组成：（1）要求引导的多维度上下文提炼阶段和（2）精细评分和总结阶段。我们还生成了综合评估报告，以增强解释性。为了验证，我们构建了一个新的基准测试，包含363个样本，涵盖37个编码场景和23种编程语言。大量实验表明，CodeVisionary在评估复杂代码生成方面优于其他三种基线方法，在Pearson、Spearman和Kendall-Tau系数上分别平均提高了0.217、0.163和0.141。CodeVisionary的资源可访问https://github.com/Eshe0922/CodeVisionary了解。

论文及项目相关链接

PDF

Summary
大型语言模型在代码生成方面展现出强大的能力，但现有的评估方法存在局限性。因此，我们提出了CodeVisionary，一个基于代理的复杂代码生成评估框架，包括多维上下文提炼和精细评分总结两个阶段，并生成综合评估报告以增强解释性。实验证明，CodeVisionary在复杂代码生成评估中优于其他三种基线方法。

Key Takeaways

大型语言模型在代码生成方面具有强大能力，凸显出对严格、全面的评估方法的迫切需求。
现有评估方法主要包括人类中心、基于指标和基于大型语言模型的方法。其中基于大型语言模型的方法因具有更强的上下文理解能力而受到越来越多的关注。
基于大型语言模型的评估方法一般通过静态提示来评估生成的代码，对于涉及多个要求和需要更多上下文信息的复杂代码场景往往表现不佳。
现有方法缺乏对复杂代码的精细评估，导致解释性有限。
CodeVisionary是一个创新的、基于代理的评估框架，用于复杂代码生成，包含多维上下文提炼和精细评分总结两个阶段。
CodeVisionary生成综合评估报告以增强解释性，并通过实验证明在复杂代码生成评估中优于其他方法。

Cool Papers

点此查看论文截图

First Field-Trial Demonstration of L4 Autonomous Optical Network for Distributed AI Training Communication: An LLM-Powered Multi-AI-Agent Solution

Authors:Yihao Zhang, Qizhi Qiu, Xiaomin Liu, Dianxuan Fu, Xingyu Liu, Leyan Fei, Yuming Cheng, Lilin Yi, Weisheng Hu, Qunbi Zhuge

We demonstrate the first cross-domain cross-layer level-4 autonomous optical network via a multi-AI-agent system. Field trials show ~98% task completion rate across the distributed AI training lifecycle-3.2x higher than single agents using state-of-the-art LLMs.

我们通过多智能体系统展示了首个跨域跨层第4级自主光学网络。现场试验表明，在分布式人工智能训练生命周期中，任务完成率约为98%，比使用最新大型语言模型的单个智能体的任务完成率高出3.2倍。

论文及项目相关链接

PDF Accepted by 51st European Conference on Optical Communication (ECOC 2025), paper W.02.01.177

Summary:
实现首个跨域跨层级的四级自主光学网络，通过多人工智能代理系统完成现场试验，任务完成率高达98%，相比使用最先进的大型语言模型的单代理，生命周期中的训练效率提升3.2倍。

Key Takeaways:

首创跨域跨层级的四级自主光学网络。
多人工智能代理系统完成现场试验。
任务完成率高达98%。
与使用最先进大型语言模型的单代理相比，效率提升3.2倍。
突破了分布式人工智能训练的新高度。
展示了多AI-agent系统在通信网络中的优势。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-22/Agent/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Agent

Few-Shot

Few-Shot 方向最新论文已更新，请持续关注 Update in 2025-10-22 AcademicEval Live Long-Context LLM Benchmark

2025-10-22 Few-Shot

Few-Shot

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-10-22 PANER A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition

2025-10-22 LLM

LLM