⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-19 更新
TGPO: Tree-Guided Preference Optimization for Robust Web Agent Reinforcement Learning
Authors:Ziyuan Chen, Zhenghui Zhao, Zhangye Han, Miancan Liu, Xianhang Ye, Yiqing Li, Hongbo Min, Jinkui Ren, Xiantao Zhang, Guitao Cao
With the rapid advancement of large language models and vision-language models, employing large models as Web Agents has become essential for automated web interaction. However, training Web Agents with reinforcement learning faces critical challenges including credit assignment misallocation, prohibitively high annotation costs, and reward sparsity. To address these issues, we propose Tree-Guided Preference Optimization (TGPO), an offline reinforcement learning framework that proposes a tree-structured trajectory representation merging semantically identical states across trajectories to eliminate label conflicts. Our framework incorporates a Process Reward Model that automatically generates fine-grained rewards through subgoal progress, redundancy detection, and action verification. Additionally, a dynamic weighting mechanism prioritizes high-impact decision points during training. Experiments on Online-Mind2Web and our self-constructed C-WebShop datasets demonstrate that TGPO significantly outperforms existing methods, achieving higher success rates with fewer redundant steps.
随着大型语言模型和视觉语言模型的快速发展,将大型模型作为Web代理进行自动化网络交互已经变得至关重要。然而,使用强化学习训练Web代理面临着关键挑战,包括信用分配不当、标注成本过高和奖励稀疏。为了解决这些问题,我们提出了树导向偏好优化(TGPO),这是一种离线强化学习框架。它提出了树状轨迹表示,通过合并轨迹中的语义相同状态来消除标签冲突。我们的框架采用过程奖励模型,通过子目标进度、冗余检测和行为验证自动生成精细奖励。此外,动态加权机制在训练过程中优先处理高影响力的决策点。在线Mind2Web和我们自己构建的C-WebShop数据集上的实验表明,TGPO显著优于现有方法,在更少冗余步骤的情况下实现更高的成功率。
论文及项目相关链接
Summary
大型语言和视觉语言模型的快速发展使得将大型模型作为Web代理进行自动化网络交互至关重要。然而,使用强化学习训练Web代理面临关键挑战,如信用分配不当、标注成本过高和奖励稀缺等。为解决这些问题,我们提出了树形引导偏好优化(TGPO)的离线强化学习框架,它通过树形结构轨迹表示合并语义上相同的轨迹状态,消除标签冲突。该框架结合过程奖励模型,通过子目标进展、冗余检测和行为验证自动生成精细奖励。此外,动态加权机制在训练过程中优先处理高影响的决策点。实验证明,TGPO在Online-Mind2Web和我们自行构建的C-WebShop数据集上显著优于现有方法,以更高的成功率实现了更少的冗余步骤。
Key Takeaways
- 大型语言和视觉语言模型的快速发展推动了Web代理的自动化网络交互重要性。
- 强化学习在训练Web代理时面临信用分配不当、高标注成本和奖励稀缺等挑战。
- TGPO框架采用树形结构轨迹表示来合并语义相同的状态,消除标签冲突。
- TGPO结合过程奖励模型,自动生成精细奖励,通过子目标进展、冗余检测和行为验证实现。
- TGPO使用动态加权机制在训练过程中优先处理高影响的决策点。
- 实验证明TGPO在Online-Mind2Web和C-WebShop数据集上表现优异,实现了高成功率和较少的冗余步骤。
点此查看论文截图





MIMIC-D: Multi-modal Imitation for MultI-agent Coordination with Decentralized Diffusion Policies
Authors:Dayi Dong, Maulik Bhatt, Seoyeon Choi, Negar Mehr
As robots become more integrated in society, their ability to coordinate with other robots and humans on multi-modal tasks (those with multiple valid solutions) is crucial. We propose to learn such behaviors from expert demonstrations via imitation learning (IL). However, when expert demonstrations are multi-modal, standard IL approaches can struggle to capture the diverse strategies, hindering effective coordination. Diffusion models are known to be effective at handling complex multi-modal trajectory distributions in single-agent systems. Diffusion models have also excelled in multi-agent scenarios where multi-modality is more common and crucial to learning coordinated behaviors. Typically, diffusion-based approaches require a centralized planner or explicit communication among agents, but this assumption can fail in real-world scenarios where robots must operate independently or with agents like humans that they cannot directly communicate with. Therefore, we propose MIMIC-D, a Centralized Training, Decentralized Execution (CTDE) paradigm for multi-modal multi-agent imitation learning using diffusion policies. Agents are trained jointly with full information, but execute policies using only local information to achieve implicit coordination. We demonstrate in both simulation and hardware experiments that our method recovers multi-modal coordination behavior among agents in a variety of tasks and environments, while improving upon state-of-the-art baselines.
随着机器人越来越融入社会,他们在多模态任务(存在多种有效解决方案的任务)中与其他机器人和人类进行协调的能力变得至关重要。我们提议通过模仿学习(IL)从专家演示中学习这种行为。然而,当专家演示呈现多模态时,标准IL方法可能在捕获多种策略方面遇到困难,从而阻碍有效的协调。扩散模型在单代理系统中处理复杂的多模态轨迹分布时表现出色。扩散模型在多代理场景中同样表现出色,多模态在那里更常见,对协调行为的学习至关重要。通常,基于扩散的方法需要集中式规划器或代理之间的显式通信,但在机器人必须独立操作或无法直接与像人类这样的代理通信的现实中,这一假设可能会失效。因此,我们提出了MIMIC-D,这是一种使用扩散策略的多模态多代理模仿学习的集中训练、分散执行(CTDE)范式。代理使用完整信息进行联合训练,但执行策略仅使用本地信息以实现隐性协调。我们在模拟和硬件实验中都证明,我们的方法在多种任务和环境中的代理间多模态协调行为恢复方面优于最新基线。
论文及项目相关链接
PDF 9 pages, 4 figures, 5 tables
Summary
随着机器人更多地融入社会,他们在多模式任务(存在多种有效解决方案的任务)中与其他机器人和人类协调的能力变得至关重要。本文提出通过模仿学习(IL)从专家演示中学习此类行为,但当专家演示呈现多模式时,标准IL方法可能难以掌握多种策略,阻碍有效协调。本文提出MIMIC-D方法,这是一种集中训练、分散执行(CTDE)的多模式多智能体模仿学习范式,利用扩散策略。智能体在拥有全部信息的情况下进行联合训练,但执行策略时仅使用本地信息以实现隐性协调。实验证明,该方法在各种任务和环境中实现了智能体之间的多模式协调行为,并优于现有基线。
Key Takeaways
- 随着机器人更多地融入社会,其与其他实体在多模式任务中的协调变得重要。
- 通过模仿学习从专家演示中学习行为是一个有效的学习方法。
- 标准模仿学习方法在处理多模式专家演示时可能面临挑战。
- 扩散模型在处理复杂的多模式轨迹分布方面表现出色。
- 扩散模型在多智能体场景中的表现尤为出色,其中多模式对于学习协调行为至关重要。
- MIMIC-D方法是一种集中训练、分散执行的多模式多智能体模仿学习新范式。
点此查看论文截图






CrowdAgent: Multi-Agent Managed Multi-Source Annotation System
Authors:Maosheng Qin, Renyu Zhu, Mingxuan Xia, Chenkai Chen, Zhen Zhu, Minmin Lin, Junbo Zhao, Lu Xu, Changjie Fan, Runze Wu, Haobo Wang
High-quality annotated data is a cornerstone of modern Natural Language Processing (NLP). While recent methods begin to leverage diverse annotation sources-including Large Language Models (LLMs), Small Language Models (SLMs), and human experts-they often focus narrowly on the labeling step itself. A critical gap remains in the holistic process control required to manage these sources dynamically, addressing complex scheduling and quality-cost trade-offs in a unified manner. Inspired by real-world crowdsourcing companies, we introduce CrowdAgent, a multi-agent system that provides end-to-end process control by integrating task assignment, data annotation, and quality/cost management. It implements a novel methodology that rationally assigns tasks, enabling LLMs, SLMs, and human experts to advance synergistically in a collaborative annotation workflow. We demonstrate the effectiveness of CrowdAgent through extensive experiments on six diverse multimodal classification tasks. The source code and video demo are available at https://github.com/QMMMS/CrowdAgent.
高质量标注数据是现代自然语言处理(NLP)的基石。虽然最近的方法开始利用各种标注源,包括大型语言模型(LLM)、小型语言模型(SLM)和人类专家,但它们往往只关注标注步骤本身。在动态管理这些源所需的全面过程控制方面仍然存在一个关键空白,这需要以统一的方式解决复杂的调度和质量成本权衡问题。我们受到现实世界中众包公司的启发,推出了CrowdAgent,这是一个多智能体系统,通过整合任务分配、数据标注和质量/成本管理,提供端到端的流程控制。它实现了一种合理分配任务的新方法,使大型语言模型、小型语言模型和人类专家能够在协同标注工作流程中协同推进。我们在六个不同的多模式分类任务上进行了大量实验,验证了CrowdAgent的有效性。源代码和视频演示可在https://github.com/QMMMS/CrowdAgent获取。
论文及项目相关链接
Summary
本文介绍了现代自然语言处理中高质量标注数据的重要性。虽然现有方法开始利用多种标注源,如大型语言模型(LLMs)、小型语言模型(SLMs)和人类专家,但它们往往只关注标注步骤本身。为解决动态管理这些源所需的端到端流程控制问题,本文借鉴现实世界的众包公司,引入了CrowdAgent多智能体系统。该系统通过任务分配、数据标注和质量成本管理来提供端到端的流程控制,并采用了新的任务分配方法,使LLMs、SLMs和人类专家能够在协同标注工作流程中协同工作。通过六个不同的多模态分类任务的实验验证,CrowdAgent的有效性得到了证明。其源代码和视频演示可在https://github.com/QMMMS/CrowdAgent获取。
Key Takeaways
- 现代NLP中高质量标注数据是关键,需要利用多种标注源如LLMs、SLMs和人类专家。
- 现有方法主要关注标注步骤,缺乏对动态管理标注源所需的端到端流程控制。
- CrowdAgent是一个多智能体系统,提供端到端的流程控制,包括任务分配、数据标注和质量成本管理。
- CrowdAgent采用新的任务分配方法,促进LLMs、SLMs和人类专家的协同工作。
- 通过六个多模态分类任务的实验验证,证明了CrowdAgent的有效性。
- CrowdAgent源代码和视频演示可在指定网址获取。
点此查看论文截图





InfraMind: A Novel Exploration-based GUI Agentic Framework for Mission-critical Industrial Management
Authors:Liangtao Lin, Zhaomeng Zhu, Tianwei Zhang, Yonggang Wen
Mission-critical industrial infrastructure, such as data centers, increasingly depends on complex management software. Its operations, however, pose significant challenges due to the escalating system complexity, multi-vendor integration, and a shortage of expert operators. While Robotic Process Automation (RPA) offers partial automation through handcrafted scripts, it suffers from limited flexibility and high maintenance costs. Recent advances in Large Language Model (LLM)-based graphical user interface (GUI) agents have enabled more flexible automation, yet these general-purpose agents face five critical challenges when applied to industrial management, including unfamiliar element understanding, precision and efficiency, state localization, deployment constraints, and safety requirements. To address these issues, we propose InfraMind, a novel exploration-based GUI agentic framework specifically tailored for industrial management systems. InfraMind integrates five innovative modules to systematically resolve different challenges in industrial management: (1) systematic search-based exploration with virtual machine snapshots for autonomous understanding of complex GUIs; (2) memory-driven planning to ensure high-precision and efficient task execution; (3) advanced state identification for robust localization in hierarchical interfaces; (4) structured knowledge distillation for efficient deployment with lightweight models; and (5) comprehensive, multi-layered safety mechanisms to safeguard sensitive operations. Extensive experiments on both open-source and commercial DCIM platforms demonstrate that our approach consistently outperforms existing frameworks in terms of task success rate and operational efficiency, providing a rigorous and scalable solution for industrial management automation.
重要工业基础设施(如数据中心)越来越依赖于复杂的管理软件。然而,由于系统复杂性日益增加、多供应商集成以及专业操作人员短缺,其运营面临严峻挑战。虽然机器人流程自动化(RPA)可以通过手工脚本实现部分自动化,但它存在灵活性有限和维护成本高的缺点。最近基于大型语言模型(LLM)的图形用户界面(GUI)代理的进步为实现更灵活的自动化提供了可能,但这些通用代理在工业管理应用中面临五个关键挑战,包括陌生元素理解、精度和效率、状态定位、部署约束和安全要求。为了解决这些问题,我们提出了InfraMind,这是一个基于探索的GUI代理框架,专门用于工业管理系统。InfraMind集成了五个创新模块,系统地解决了工业管理中的不同挑战:(1)基于系统搜索的探索和虚拟机快照,以自主理解复杂的GUI;(2)记忆驱动规划,确保高精度和高效率的任务执行;(3)高级状态识别,以在分层界面中实现稳健定位;(4)结构化知识蒸馏,以使用轻量级模型实现高效部署;(5)全面、多层次的安全机制,以保障敏感操作的安全。在开源和商业DCIM平台上的大量实验表明,我们的方法在任务成功率和操作效率方面始终优于现有框架,为工业管理自动化提供了严格和可扩展的解决方案。
论文及项目相关链接
Summary
工业自动化管理面临的挑战越来越大,涉及到系统复杂性增加、多供应商集成和缺乏专业运营商等问题。传统RPA存在局限性,新型的大型语言模型GUI代理虽然更具灵活性,但仍面临多个挑战。因此,提出了InfraMind框架,集成了五个创新模块来解决工业管理中的不同挑战,包括自主理解复杂GUI、高精密高效任务执行、分层接口稳健定位等。实验证明,InfraMind在任务成功率和操作效率方面表现优于现有框架,为工业自动化管理提供了严格和可扩展的解决方案。
Key Takeaways
- 工业自动化管理面临系统复杂性增加、多供应商集成和缺乏专业运营商等挑战。
- 传统RPA存在局限性,无法满足日益增长的需求。
- 大型语言模型GUI代理提供了更灵活的自动化解决方案,但仍面临多个挑战。
- InfraMind框架通过集成五个创新模块解决工业管理中的挑战。
- InfraMind自主理解复杂GUI、实现高精密高效任务执行、分层接口稳健定位等关键功能。
- InfraMind通过广泛实验验证,表现出优秀的性能,优于现有框架。
点此查看论文截图



AgentCTG: Harnessing Multi-Agent Collaboration for Fine-Grained Precise Control in Text Generation
Authors:Xinxu Zhou, Jiaqi Bai, Zhenqi Sun, Fanxiang Zeng, Yue Liu
Although significant progress has been made in many tasks within the field of Natural Language Processing (NLP), Controlled Text Generation (CTG) continues to face numerous challenges, particularly in achieving fine-grained conditional control over generation. Additionally, in real scenario and online applications, cost considerations, scalability, domain knowledge learning and more precise control are required, presenting more challenge for CTG. This paper introduces a novel and scalable framework, AgentCTG, which aims to enhance precise and complex control over the text generation by simulating the control and regulation mechanisms in multi-agent workflows. We explore various collaboration methods among different agents and introduce an auto-prompt module to further enhance the generation effectiveness. AgentCTG achieves state-of-the-art results on multiple public datasets. To validate its effectiveness in practical applications, we propose a new challenging Character-Driven Rewriting task, which aims to convert the original text into new text that conform to specific character profiles and simultaneously preserve the domain knowledge. When applied to online navigation with role-playing, our approach significantly enhances the driving experience through improved content delivery. By optimizing the generation of contextually relevant text, we enable a more immersive interaction within online communities, fostering greater personalization and user engagement.
虽然在自然语言处理(NLP)领域的许多任务中都取得了重大进展,但受控文本生成(CTG)仍然面临许多挑战,特别是在实现对生成的精细条件控制方面。此外,在真实场景和在线应用中,需要考虑成本、可扩展性、领域知识学习以及更精确的控制,这为CTG提出了更高的要求。本文介绍了一种新型且可扩展的框架AgentCTG,旨在通过模拟多代理工作流程中的控制和调节机制,提高文本生成的精确性和复杂性控制。我们探讨了不同代理之间的各种协作方法,并引入了一个自动提示模块来进一步提高生成效果。AgentCTG在多个公共数据集上实现了最新结果。为了验证其在实际应用中的有效性,我们提出了一个新的具有挑战性的角色驱动重写任务,旨在将原始文本转换为符合特定角色特征的新文本,同时保留领域知识。当应用于在线角色扮演导航时,我们的方法通过改进内容交付显著增强了驾驶体验。通过优化上下文相关文本的生成,我们能够在在线社区内实现更沉浸式的交互,促进更大的个性化和用户参与度。
论文及项目相关链接
Summary
文本介绍了在自然语言处理领域中的控制文本生成(CTG)面临的挑战,包括实现精细的条件控制、实际应用中的成本考量、可扩展性、领域知识学习和更精确的控制等。文章提出了一种新型的、可扩展的框架AgentCTG,旨在通过模拟多代理工作流程中的控制和调节机制,提高文本生成的精确和复杂控制。该框架引入了自动提示模块,以提高生成效果,并在多个公共数据集上取得了最新结果。此外,为了验证其在实践应用中的有效性,文章提出了一种具有挑战性的新任务——角色驱动重写任务,旨在将原始文本转换为符合特定角色特征的新文本,同时保留领域知识。最后,将其应用于在线角色扮演导航,通过优化上下文相关文本的生成,提高了内容交付的效果,增强了在线社区的沉浸式交互、个性化和用户参与度。
Key Takeaways
- 控制文本生成(CTG)在自然语言处理领域面临诸多挑战,尤其是在实现精细的条件控制方面。
- 文本生成需要考虑到实际应用中的成本考量、可扩展性和领域知识学习。
- AgentCTG是一个新型的、可扩展的框架,旨在通过模拟多代理工作流程中的控制和调节机制来提高文本生成的精确控制。
- AgentCTG框架引入了自动提示模块,以提高生成效果,并在多个数据集上取得了最新成果。
- 为验证AgentCTG在实际应用中的有效性,提出了一种新的角色驱动重写任务,能够将原始文本转换为符合特定角色特征的新文本。
- AgentCTG应用于在线角色扮演导航,优化了上下文相关文本的生成,提高了内容交付效果。
点此查看论文截图





See, Think, Act: Teaching Multimodal Agents to Effectively Interact with GUI by Identifying Toggles
Authors:Zongru Wu, Rui Mao, Zhiyuan Tian, Pengzhou Cheng, Tianjie Ju, Zheng Wu, Lingzhong Dong, Haiyue Sheng, Zhuosheng Zhang, Gongshen Liu
The advent of multimodal agents facilitates effective interaction within graphical user interface (GUI), especially in ubiquitous GUI control. However, their inability to reliably execute toggle control instructions remains a key bottleneck. To investigate this, we construct a state control benchmark with binary toggle instructions from public datasets. Evaluations of existing agents demonstrate their unreliability, particularly when the current toggle state already matches the desired state. To address the challenge, we propose State-aware Reasoning (StaR), a training method that teaches agents to perceive the current toggle state, analyze the desired state from the instruction, and act accordingly. Experiments on three multimodal agents demonstrate that StaR can improve toggle instruction execution accuracy by over 30%. Further evaluations on three public benchmarks show that StaR also enhances general task performance. Finally, evaluations on a dynamic environment highlight the potential of StaR for real-world applications. Code, benchmark, and StaR-enhanced agents are available at https://github.com/ZrW00/StaR.
多模态代理的出现促进了图形用户界面(GUI)内的有效交互,特别是在通用GUI控制中。然而,它们无法可靠地执行切换控制指令仍是关键瓶颈。为了调查这一点,我们使用公共数据集构建了一个包含二进制切换指令的状态控制基准。对现有代理的评估表明,它们在稳定性方面存在不足,尤其是在当前切换状态已经与所需状态相匹配时。为了解决这一挑战,我们提出了状态感知推理(StaR)训练方法,该方法可以教授代理感知当前切换状态,分析指令中的期望状态并据此采取行动。在三组多模态代理上的实验表明,StaR可以将切换指令执行准确性提高超过30%。在三个公共基准测试上的进一步评估表明,StaR也提高了整体任务性能。最后,在动态环境中的评估突显了StaR在现实世界应用中的潜力。相关代码、基准测试和采用StaR技术的代理可通过https://github.com/ZrW00/StaR获取。
论文及项目相关链接
Summary
多模态智能体的出现促进了图形用户界面(GUI)内的有效交互,特别是在通用GUI控制中。然而,执行切换控制指令的可靠性问题是关键瓶颈。为调查此问题,我们利用公共数据集构建了包含二元切换指令的状态控制基准。对现有的智能体评估显示,当当前切换状态与期望状态相匹配时,它们的可靠性尤其差。为解决此挑战,我们提出了名为State-aware Reasoning(StaR)的训练方法,能教导智能体感知当前切换状态,分析指令中的期望状态并相应行动。对三个多模态智能体的实验表明,StaR能提高切换指令执行准确性超过30%。在三个公共基准测试上的进一步评估显示,StaR也能提高一般任务性能。最终,动态环境下的评估突显了StaR在现实世界应用中的潜力。详情访问https://github.com/ZrW00/StaR。
Key Takeaways
- 多模态智能体在GUI交互中表现出色,但在执行切换控制指令时存在可靠性问题。
- 当当前切换状态与期望状态匹配时,现有智能体的可靠性较差。
- 提出了State-aware Reasoning(StaR)训练方法,以提高智能体在执行切换控制指令时的可靠性。
- StaR能提高切换指令执行准确性超过30%。
- StaR不仅能提高智能体在切换控制方面的性能,还能提高一般任务性能。
- 在动态环境下的评估证明了StaR在现实世界应用中的潜力。
点此查看论文截图



DECAMP: Towards Scene-Consistent Multi-Agent Motion Prediction with Disentangled Context-Aware Pre-Training
Authors:Jianxin Shi, Zengqi Peng, Xiaolong Chen, Tianyu Wo, Jun Ma
Trajectory prediction is a critical component of autonomous driving, essential for ensuring both safety and efficiency on the road. However, traditional approaches often struggle with the scarcity of labeled data and exhibit suboptimal performance in multi-agent prediction scenarios. To address these challenges, we introduce a disentangled context-aware pre-training framework for multi-agent motion prediction, named DECAMP. Unlike existing methods that entangle representation learning with pretext tasks, our framework decouples behavior pattern learning from latent feature reconstruction, prioritizing interpretable dynamics and thereby enhancing scene representation for downstream prediction. Additionally, our framework incorporates context-aware representation learning alongside collaborative spatial-motion pretext tasks, which enables joint optimization of structural and intentional reasoning while capturing the underlying dynamic intentions. Our experiments on the Argoverse 2 benchmark showcase the superior performance of our method, and the results attained underscore its effectiveness in multi-agent motion forecasting. To the best of our knowledge, this is the first context autoencoder framework for multi-agent motion forecasting in autonomous driving. The code and models will be made publicly available.
轨迹预测是自动驾驶的关键组成部分,对于确保道路安全和提高效率至关重要。然而,传统方法在标记数据稀缺的情况下常常表现不佳,在多智能体预测场景中的性能也不尽如人意。为了解决这些挑战,我们提出了一种用于多智能体运动预测的解耦上下文感知预训练框架,名为DECAMP。与现有方法将表示学习与预设任务纠缠在一起不同,我们的框架将行为模式学习与潜在特征重建解耦,优先考虑可解释的动态,从而增强下游预测的场景表示。此外,我们的框架结合了上下文感知表示学习与协作空间运动预设任务,这能够在捕捉底层动态意图的同时,实现结构和意图推理的联合优化。我们在Argoverse 2基准测试上的实验展示了我们的方法的卓越性能,所获得的结果强调了其在多智能体运动预测中的有效性。据我们所知,这是自动驾驶中多智能体运动预测的首个上下文自动编码器框架。代码和模型将公开发布。
论文及项目相关链接
Summary
文章介绍了自主驾驶中轨迹预测的重要性,并指出传统方法在多智能体预测场景中面临数据标签稀缺和性能不佳的问题。为解决这些问题,提出了一种名为DECAMP的去耦合上下文感知预训练框架。该框架将行为模式学习与潜在特征重建相分离,优先学习可解释的动力学,从而改进场景表示,用于下游预测任务。此外,结合上下文感知表示学习和协作空间运动前任务,可实现结构和意图推理的联合优化,同时捕捉潜在动态意图。在Argoverse 2基准测试上的实验表明,该方法性能卓越,是首个用于自主驾驶多智能体运动预测的上下文自动编码器框架。
Key Takeaways
- 轨迹预测是自主驾驶中的关键组成部分,对道路安全和效率至关重要。
- 传统方法在多智能体预测场景中面临数据标签稀缺和性能挑战。
- DECAMP框架解决了这些问题,通过去耦合上下文感知预训练来提高多智能体运动预测的性能。
- DECAMP框架将行为模式学习与潜在特征重建分离,优先学习可解释的动力学。
- 上下文感知表示学习和协作空间运动前任务的结合可实现结构和意图推理的联合优化。
- 实验结果显示,DECAMP在Argoverse 2基准测试上表现优越。
点此查看论文截图







VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
Authors:Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, Zhuosheng Zhang
With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a two-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 20.64% in untrustworthy scenarios over the state-of-the-art, without compromising normal performance. Analysis highlights VeriOS-Agent’s rationality, generalizability, and scalability. The codes, datasets and models are available at https://github.com/Wuzheng02/VeriOS.
随着多模态大型语言模型的快速发展,操作系统(OS)代理通过设备上的图形用户界面(GUI)越来越能够自动化任务。然而,大多数现有的操作系统代理是针对理想化环境设计的,而现实世界的环境通常存在不可信的条件下。为了减轻此类场景中过度执行的风险,我们提出了一种查询驱动的人机代理交互框架,使操作系统代理能够决定何时向人类查询以更可靠地完成任务。基于该框架,我们引入了VeriOS-Agent,这是一个可信的操作系统代理,采用两阶段学习范式训练,促进了元知识的解耦和利用。具体来说,VeriOS-Agent在正常情况下自主执行操作,而在不可信的场景中主动查询人类。实验表明,在不可信的场景中,VeriOS-Agent的平均步骤成功率比最新技术提高了20.64%,同时不影响正常性能。分析强调了VeriOS-Agent的合理性、通用性和可扩展性。代码、数据集和模型可在https://github.com/Wuzheng02/VeriOS获取。
论文及项目相关链接
Summary
多模态大型语言模型的快速发展使得操作系统代理能够通过设备上的图形用户界面自动化任务。然而,针对现实环境中存在的不可信情况,提出了一个查询驱动的人机交互框架。基于该框架的VeriOS-Agent具备强大的可信度,通过双阶段学习范式自主执行操作,同时在不可靠情况下主动查询人类干预。实验结果证明,VeriOS-Agent在不妥协正常性能的情况下,提高了不信任场景的平均步骤成功率达百分之二十六点六四。详细代码和数据集已发布在相应仓库。
Key Takeaways
- 多模态大型语言模型的进步使得操作系统代理能够通过图形用户界面自动化任务。
- 现有操作系统代理主要设计用于理想环境,而现实环境往往存在不可信的情况。
- 提出了一种查询驱动的人机交互框架,允许操作系统代理在必要时查询人类以完成更可靠的任务。
- VeriOS-Agent是一个基于该框架的可信操作系统代理,通过双阶段学习范式自主执行操作并在不可靠情况下主动请求人工干预。
- 实验结果显示,VeriOS-Agent在不妥协正常性能的情况下显著提高了不信任场景下的任务完成率。
- VeriOS-Agent的设计具备理性、通用性和可扩展性。
点此查看论文截图




MAFA: A multi-agent framework for annotation
Authors:Mahmood Hegazy, Aaron Rodrigues, Azzam Naeem
Modern consumer banking applications require accurate and efficient retrieval of information in response to user queries. Mapping user utterances to the most relevant Frequently Asked Questions (FAQs) is a crucial component of these systems. Traditional approaches often rely on a single model or technique, which may not capture the nuances of diverse user inquiries. In this paper, we introduce a multi-agent framework for FAQ annotation that combines multiple specialized agents with different approaches and a judge agent that reranks candidates to produce optimal results. Our agents utilize a structured reasoning approach inspired by Attentive Reasoning Queries (ARQs), which guides them through systematic reasoning steps using targeted, task-specific JSON queries. Our framework features a few-shot example strategy, where each agent receives different few-shots, enhancing ensemble diversity and coverage of the query space. We evaluate our framework on a real-world major bank dataset as well as public benchmark datasets (LCQMC and FiQA), demonstrating significant improvements over single-agent approaches across multiple metrics, including a 14% increase in Top-1 accuracy, an 18% increase in Top-5 accuracy, and a 12% improvement in Mean Reciprocal Rank on our dataset, and similar gains on public benchmarks when compared with traditional and single-agent annotation techniques. Our framework is particularly effective at handling ambiguous queries, making it well-suited for deployment in production banking applications while showing strong generalization capabilities across different domains and languages.
现代消费银行应用需要准确高效地检索用户查询的信息。将用户的话语映射到最相关的常见问题(FAQ)是一个关键组成部分。传统方法通常依赖于单一模型或技术,这可能无法捕捉到各种用户查询的细微差别。在本文中,我们介绍了一个多代理的FAQ注释框架,该框架结合了多个具有不同方法的专业代理和一个法官代理,以对候选人进行重新排名以产生最佳结果。我们的代理采用受关注推理查询(ARQs)启发的结构化推理方法,通过使用有针对性的、特定任务的JSON查询来指导他们进行系统化的推理步骤。我们的框架采用少数镜头示例策略,每个代理接收不同的少数镜头,增强了组合多样性和查询空间的覆盖。我们在现实世界的主要银行数据集以及公共基准数据集(LCQMC和FiQA)上评估了我们的框架,结果表明,与单代理方法相比,我们的框架在多个指标上都有显著改进,包括提高14%的Top-1准确率、提高18%的Top-5准确率以及在我们的数据集上提高12%的平均倒数排名,与传统和单代理注释技术相比,公共基准测试上也有类似的收益。我们的框架在处理模糊查询方面特别有效,因此非常适合在生产银行应用程序中部署,同时在不同领域和语言中显示出强大的泛化能力。
论文及项目相关链接
Summary
本文介绍了一种多代理框架,用于将用户的话语映射到最相关的常见问题解答(FAQs)。该框架结合了多个专业代理和判断代理,通过不同的方法和结构化推理,重新排序候选人以产生最佳结果。评估结果表明,与传统和单一代理方法相比,该框架在多个指标上取得了显著改进,尤其擅长处理模糊查询,适用于生产银行应用程序并具有良好的跨领域和语言泛化能力。
Key Takeaways
- 多代理框架结合了多个专业代理和判断代理,以优化用户查询与FAQs的匹配。
- 采用基于ARQ的结构化推理方法,指导代理通过有针对性的任务特定JSON查询进行系统性推理。
- 框架采用少量示例策略,增强代理团队的多样性和查询空间的覆盖范围。
- 在真实银行数据集和公共基准数据集上的评估表明,该框架在多个指标上优于单代理方法。
- 框架在处理模糊查询方面表现出色,适用于生产银行应用程序。
- 该框架具有良好的跨领域和语言泛化能力。
点此查看论文截图





Humanoid Agent via Embodied Chain-of-Action Reasoning with Multimodal Foundation Models for Zero-Shot Loco-Manipulation
Authors:Congcong Wen, Geeta Chandra Raju Bethala, Yu Hao, Niraj Pudasaini, Hao Huang, Shuaihang Yuan, Baoru Huang, Anh Nguyen, Anthony Tzes, Yi Fang
Humanoid loco-manipulation, which integrates whole-body locomotion with dexterous manipulation, remains a fundamental challenge in robotics. Beyond whole-body coordination and balance, a central difficulty lies in understanding human instructions and translating them into coherent sequences of embodied actions. Recent advances in foundation models provide transferable multimodal representations and reasoning capabilities, yet existing efforts remain largely restricted to either locomotion or manipulation in isolation, with limited applicability to humanoid settings. In this paper, we propose Humanoid-COA, the first humanoid agent framework that integrates foundation model reasoning with an Embodied Chain-of-Action (CoA) mechanism for zero-shot loco-manipulation. Within the perception–reasoning–action paradigm, our key contribution lies in the reasoning stage, where the proposed CoA mechanism decomposes high-level human instructions into structured sequences of locomotion and manipulation primitives through affordance analysis, spatial inference, and whole-body action reasoning. Extensive experiments on two humanoid robots, Unitree H1-2 and G1, in both an open test area and an apartment environment, demonstrate that our framework substantially outperforms prior baselines across manipulation, locomotion, and loco-manipulation tasks, achieving robust generalization to long-horizon and unstructured scenarios. Project page: https://humanoid-coa.github.io/
类人型机器人的行走操作(loco-manipulation)技术融合了全身行走与灵巧操作,仍是机器人领域的一项基本挑战。除了全身协调与平衡之外,主要的困难在于理解人类指令并将其转化为连贯的躯体动作序列。最近的进展为基础模型提供了可迁移的多模式表示和推理能力,但现有的努力大多局限于单独的行走或操作,在类人型设置中的应用有限。在本文中,我们提出了类人型机器人框架——Humanoid-COA,这是首个将基础模型推理与行动链(CoA)机制相结合的类人型机器人框架,用于零起步的行走操作。在感知-推理-行动的模式下,我们的主要贡献在于推理阶段,提出的CoA机制通过能力分析、空间推理和全身行动推理,将高级人类指令分解为行走和操作原语的结构化序列。在Unitree H1-2和G1两个类人型机器人进行的开放测试区域和公寓环境的广泛实验表明,我们的框架在操控、行走和行走操作任务上大大超过了先前的基线,实现了对长周期和非结构化场景的稳健泛化。项目页面:https://humanoid-coa.github.io/。
论文及项目相关链接
PDF website link: https://humanoid-coa.github.io/
Summary
机器人领域中的人形机器人的行走与操控集成技术是一项基础挑战。现有研究大多关注于单独行走或操控任务,缺乏对人形机器人环境的适用性。本文提出首个集成基础模型推理的人形机器人框架——Humanoid-COA,结合动作链机制实现零射击行走操控。框架通过感知-推理-动作模式中的推理阶段,将高层次的指令通过仿射分析、空间推理和全身动作推理分解为结构化的行走和操控序列。在Unitree H1-2和G1人形机器人上的实验表明,该框架在操控、行走和行走操控任务上大幅超越先前基线,实现长周期和非结构化场景的稳健泛化。
Key Takeaways
- 人形机器人的行走与操控集成技术是机器人领域中的一项基础挑战。
- 现有研究主要关注单独行走或操控任务,缺乏对人形机器人环境的适用性。
- Humanoid-COA是首个集成基础模型推理的人形机器人框架。
- 该框架结合了动作链机制实现零射击行走操控。
- Humanoid-COA通过感知-推理-动作模式中的推理阶段,将高层次的指令分解为结构化的行走和操控序列。
- 在Unitree H1-2和G1人形机器人上的实验验证了Humanoid-COA的有效性。
点此查看论文截图





ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation
Authors:Oucheng Huang, Yuhang Ma, Zeng Zhao, Mingrui Wu, Jiayi Ji, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun, Rongrong Ji
ComfyUI is a popular workflow-based interface that allows users to customize image generation tasks through an intuitive node-based system. However, the complexity of managing node connections and diverse modules can be challenging for users. In this paper, we introduce ComfyGPT, a self-optimizing multi-agent system designed to generate ComfyUI workflows based on task descriptions automatically. The key innovations of ComfyGPT include: (1) consisting of four specialized agents to build a multi-agent workflow generation system: ReformatAgent, FlowAgent, RefineAgent, and ExecuteAgent; (2) focusing on generating precise node connections instead of entire workflows, improving generation accuracy; and (3) enhancing workflow generation through reinforcement learning. Moreover, we introduce FlowDataset, a large-scale dataset containing 13,571 workflow-description pairs, and FlowBench, a comprehensive benchmark for evaluating workflow generation systems. Additionally, we propose four novel evaluation metrics: Format Validation (FV), Pass Accuracy (PA), Pass Instruct Alignment (PIA), and Pass Node Diversity (PND). Experimental results demonstrate that ComfyGPT significantly outperforms existing LLM-based methods in workflow generation, making it a significant step forward in this field. Code is avaliable at https://github.com/comfygpt/comfygpt.
ComfyUI是一个基于工作流程的流行界面,它允许用户通过一个直观的节点系统自定义图像生成任务。然而,管理节点连接和多样化模块复杂性可能会对用户构成挑战。在本文中,我们介绍了ComfyGPT,这是一个自我优化的多智能体系统,能够基于任务描述自动生成ComfyUI工作流程。ComfyGPT的关键创新点包括:(1)由四个专业智能体构成,建立多智能体工作流程生成系统:ReformatAgent、FlowAgent、RefineAgent和ExecuteAgent;(2)专注于生成精确节点连接,而非整个工作流程,以提高生成精度;(3)通过强化学习增强工作流程生成。此外,我们介绍了FlowDataset,这是一个包含13571个工作流程描述对的大规模数据集,以及FlowBench,一个评估工作流程生成系统的综合基准测试。我们还提出了四项新的评估指标:格式验证(FV)、通过率(PA)、通过指令对齐(PIA)和通过节点多样性(PND)。实验结果表明,在生成工作流程方面,ComfyGPT显著优于现有的大型语言模型(LLM)方法,是该领域的重要进步。相关代码可通过https://github.com/comfygpt/comfygpt获取。
论文及项目相关链接
Summary
ComfyGPT是一款基于多智能体的自优化系统,用于根据任务描述自动生成ComfyUI工作流程。它包括四个专门代理,通过强化学习提高流程生成精度。同时,引入了FlowDataset大规模数据集和FlowBench评估基准,以及新型评估指标。实验结果显示,ComfyGPT在流程生成方面显著优于现有LLM方法。
Key Takeaways
- ComfyGPT是一个多智能体系统,用于自动生成ComfyUI工作流程。
- 它由四个专门代理组成,包括ReformatAgent、FlowAgent、RefineAgent和ExecuteAgent。
- ComfyGPT通过强化学习提高工作流程生成的精度。
- 引入了FlowDataset大规模数据集和FlowBench评估基准,以支持研究和实验。
- 提出了四个新的评估指标:格式验证(FV)、通过精度(PA)、通过指令对齐(PIA)和通过节点多样性(PND)。
- 实验结果显示,ComfyGPT在流程生成方面优于现有的LLM方法。
点此查看论文截图





DAVIS: Planning Agent with Knowledge Graph-Powered Inner Monologue
Authors:Minh Pham Dinh, Munira Syed, Michael G Yankoski, Trenton W. Ford
Designing a generalist scientific agent capable of performing tasks in laboratory settings to assist researchers has become a key goal in recent Artificial Intelligence (AI) research. Unlike everyday tasks, scientific tasks are inherently more delicate and complex, requiring agents to possess a higher level of reasoning ability, structured and temporal understanding of their environment, and a strong emphasis on safety. Existing approaches often fail to address these multifaceted requirements. To tackle these challenges, we present DAVIS. Unlike traditional retrieval-augmented generation (RAG) approaches, DAVIS incorporates structured and temporal memory, which enables model-based planning. Additionally, DAVIS implements an agentic, multi-turn retrieval system, similar to a human’s inner monologue, allowing for a greater degree of reasoning over past experiences. DAVIS demonstrates substantially improved performance on the ScienceWorld benchmark comparing to previous approaches on 8 out of 9 elementary science subjects. In addition, DAVIS’s World Model demonstrates competitive performance on the famous HotpotQA and MusiqueQA dataset for multi-hop question answering. To the best of our knowledge, DAVIS is the first RAG agent to employ an interactive retrieval method in a RAG pipeline.
设计一种能够在实验室环境中执行任务的通用科学代理,以协助研究人员,已成为近期人工智能(AI)研究的关键目标。与日常任务不同,科学任务本质上更加精细和复杂,要求代理具备更高水平的推理能力、对其环境的结构和时间理解,并强调安全性。现有方法往往不能满足这些多方面的要求。为了应对这些挑战,我们推出了DAVIS。不同于传统的检索增强生成(RAG)方法,DAVIS结合了结构和时间记忆,实现了基于模型的规划。此外,DAVIS实现了一种类似于人类内心独白的代理多轮检索系统,允许对过去的经验进行更大程度的推理。DAVIS在ScienceWorld基准测试上的表现与之前的方法相比,在9个基础科学科目中的8个科目上都有显著改进。此外,DAVIS的世界模型在著名的HotpotQA和MusiqueQA数据集上表现出具有竞争力的性能,用于多跳问题回答。据我们所知,DAVIS是第一个采用交互式检索方法的RAG代理。
论文及项目相关链接
PDF Accepted to EMNLP 2025 Findings
Summary
人工智能领域中设计一种能够执行实验室任务的通用科学智能代理成为当前研究的关键目标。科学任务具有内在复杂性,要求代理具备高级推理能力、结构化环境理解和时序感知能力,以及高度安全性。文章介绍了一种新型的智能代理DAVIS,它采用结构化和时序记忆技术,实现基于模型的规划。与传统基于检索增强的生成(RAG)方法不同,DAVIS还采用多轮检索系统,可实现经验基础上的高度推理。DAVIS在ScienceWorld基准测试中表现卓越,与先前方法相比在九门基础科学中八项成绩突出。此外,DAVIS的世界模型在HotpotQA和MusiqueQA数据集上的多跳问答表现也相当出色。最重要的是,DAVIS是首个采用交互式检索方法的RAG代理。
Key Takeaways
- 设计通用科学智能代理是人工智能领域的重要目标。
- 科学任务具有内在复杂性,要求高级推理能力和结构化环境理解。
- DAVIS是一种新型智能代理,采用结构化和时序记忆技术实现基于模型的规划。
- DAVIS采用多轮检索系统,可实现经验基础上的高度推理。
- DAVIS在ScienceWorld基准测试中表现卓越,多项成绩领先。
- DAVIS的世界模型在HotpotQA和MusiqueQA数据集上的多跳问答表现良好。
点此查看论文截图




Semantic Alignment-Enhanced Code Translation via an LLM-Based Multi-Agent System
Authors:Zhiqiang Yuan, Weitong Chen, Hanlin Wang, Kai Yu, Xin Peng, Yiling Lou
Code translation converts code from one programming language to another while maintaining its original functionality, which is crucial for software migration, system refactoring, and cross-platform development. Traditional rule-based methods rely on manually-written rules, which can be time-consuming and often result in less readable code. To overcome this, learning-based methods have been developed, leveraging parallel data to train models for automated code translation. More recently, the advance of Large Language Models (LLMs) further boosts learning-based code translation. Although promising, LLM-translated program still suffers from diverse quality issues (e.g., syntax errors and semantic errors). In particular, it can be challenging for LLMs to self-debug these errors when simply provided with the corresponding error messages. In this work, we propose a novel LLM-based multi-agent system TRANSAGENT, which enhances LLM-based code translation by fixing the syntax errors and semantic errors with the synergy between four LLM-based agents, including Initial Code Translator, Syntax Error Fixer, Code Aligner, and Semantic Error Fixer. The main insight of TRANSAGENT is to first localize the error code block in the target program based on the execution alignment between the target and source program, which can narrow down the fixing space and thus lower down the fixing difficulties. To evaluate TRANSAGENT, we first construct a new benchmark from recent programming tasks to mitigate the potential data leakage issue. On our benchmark, TRANSAGENT outperforms the latest LLM-based code translation technique UniTrans in both translation effectiveness and efficiency; additionally, our evaluation on different LLMs show the generalization of TRANSAGENT and our ablation study shows the contribution of each agent.
代码翻译是将代码从一种编程语言转换为另一种语言,同时保持其原始功能,这在软件迁移、系统重构和跨平台开发中发挥关键作用。传统的基于规则的方法依赖于手动编写的规则,这既耗时又往往导致生成的代码可读性较差。为了克服这一问题,已经开发了基于学习的方法,利用并行数据来训练模型以实现自动化代码翻译。最近,大型语言模型(LLM)的进步进一步推动了基于学习的代码翻译。尽管有前景,但LLM翻译的程序的代码仍存在各种质量问题(例如语法错误和语义错误)。特别是,当仅提供相应的错误消息时,对于LLM来说自我调试这些错误可能会具有挑战性。在这项工作中,我们提出了一种基于LLM的多智能体系统TRANSAGENT,它通过协同四个基于LLM的智能体(包括初始代码翻译器、语法错误修复器、代码对齐器和语义错误修复器)来提高基于LLM的代码翻译质量,修复语法错误和语义错误。TRANSAGENT的主要见解是基于目标程序与源程序之间的执行对齐来首先定位错误代码块,这样可以缩小修复空间,从而降低修复难度。为了评估TRANSAGENT的性能,我们首先构建了新的基准测试,以缓解潜在的数据泄露问题。在我们的基准测试中,TRANSAGENT在翻译效果和效率方面都优于最新的LLM代码翻译技术UniTrans;此外,我们对不同LLM的评估显示了TRANSAGENT的通用性,我们的消融研究显示了每个智能体的贡献。
论文及项目相关链接
Summary
本文介绍了一种基于多智能体的代码翻译系统TRANSAGENT,它通过协同四个基于LLM的智能体(初始代码翻译器、语法错误修复器、代码对齐器和语义错误修复器)来提高LLM在代码翻译中的性能,尤其擅长修复语法和语义错误。TRANSAGENT的主要见解是通过执行目标程序和源代码之间的对齐来定位错误代码块,从而缩小修复范围并降低修复难度。在构建的新基准测试上,TRANSAGENT在翻译效果和效率上都优于最新的LLM代码翻译技术UniTrans。
Key Takeaways
- 代码翻译是将代码从一个编程语言转换为另一个语言的过程,对软件迁移、系统重构和跨平台开发至关重要。
- 传统基于规则的方法依赖于手动编写的规则,耗时且结果代码可读性较差。
- 学习型方法利用并行数据训练模型进行自动化代码翻译,而大型语言模型(LLMs)的进展进一步推动了这一领域的发展。
- LLM在自我调试错误方面面临挑战,尤其是当仅提供错误消息时。
- TRANSAGENT是一个基于LLM的多智能体系统,通过协同四个智能体(初始代码翻译器、语法错误修复器等)解决LLM在代码翻译中的质量问题。
- TRANSAGENT的主要策略是通过执行目标程序和源代码之间的对齐来定位错误代码块,降低修复难度。
点此查看论文截图






