发布日期: 2025-08-08

更新日期: 2025-08-20

文章字数: 18.5k

阅读时长: 75 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-08-08 更新

From MAS to MARS: Coordination Failures and Reasoning Trade-offs in Hierarchical Multi-Agent Robotic Systems within a Healthcare Scenario

Authors:Yuanchen Bai, Zijian Ding, Shaoyue Wen, Xiang Chang, Angelique Taylor

Multi-agent robotic systems (MARS) build upon multi-agent systems by integrating physical and task-related constraints, increasing the complexity of action execution and agent coordination. However, despite the availability of advanced multi-agent frameworks, their real-world deployment on robots remains limited, hindering the advancement of MARS research in practice. To bridge this gap, we conducted two studies to investigate performance trade-offs of hierarchical multi-agent frameworks in a simulated real-world multi-robot healthcare scenario. In Study 1, using CrewAI, we iteratively refine the system’s knowledge base, to systematically identify and categorize coordination failures (e.g., tool access violations, lack of timely handling of failure reports) not resolvable by providing contextual knowledge alone. In Study 2, using AutoGen, we evaluate a redesigned bidirectional communication structure and further measure the trade-offs between reasoning and non-reasoning models operating within the same robotic team setting. Drawing from our empirical findings, we emphasize the tension between autonomy and stability and the importance of edge-case testing to improve system reliability and safety for future real-world deployment. Supplementary materials, including codes, task agent setup, trace outputs, and annotated examples of coordination failures and reasoning behaviors, are available at: https://byc-sophie.github.io/mas-to-mars/.

多智能体机器人系统（MARS）通过整合物理和任务相关约束，建立在多智能体系统的基础上，增加了行动执行和智能体协调的复杂性。然而，尽管有先进的智能体框架可用，它们在机器人上的实际应用部署仍然有限，阻碍了MARS研究的实际应用发展。为了弥补这一差距，我们进行了两项研究，以探讨模拟现实世界的多机器人医疗保健场景中分层多智能体框架的性能权衡。在第一项研究中，我们使用CrewAI逐步完善系统的知识库，系统地识别和分类仅通过提供上下文知识无法解决的协调故障（例如工具访问违规、未能及时处理故障报告）。在第二项研究中，我们使用AutoGen评估了重新设计的双向通信结构，并进一步测量了同一机器人团队环境中推理和非推理模型之间的权衡。根据我们的实证发现，我们强调了自主性与稳定性之间的紧张关系以及边缘案例测试对改善系统可靠性和未来实际应用部署安全的重要性。补充材料包括代码、任务智能体设置、跟踪输出以及协调失败和推理行为的注释示例，可在以下网址找到：https://byc-sophie.github.io/mas-to-mars/。

论文及项目相关链接

PDF

Summary
多智能体机器人系统（MARS）在构建过程中结合了物理和任务相关的约束条件，增加了动作执行和智能体协调的复杂性。为弥现实世界应用上的差距，研究人员进行两项研究调查分层多智能体框架在模拟多机器人护理场景的权衡效益。第一通过CrewAI研究工具研究不可由单一知识解决的不解之处（协调故障）。第二采用AutoGen衡量了同一个机器人团队中的推理与非推理模型的权衡关系。研究者从研究中强调了自主性与稳定性之间的张力，并强调边缘案例测试对改善系统可靠性和安全性的重要性，为将来的现实应用提供重要的参考。更多资料可通过链接获取。

Key Takeaways

MARS结合了物理和任务相关约束，增加了动作执行和智能体协调的复杂性。
研究旨在通过实证研究分析分层多智能体框架在多机器人护理场景中的权衡效益。
通过CrewAI研究工具研究不可通过单一知识解决的协调问题。
通过AutoGen评估了机器人团队中的推理与非推理模型的权衡关系。
研究结果强调了自主性与稳定性之间的张力，以及边缘案例测试对系统可靠性和安全性的重要性。

Cool Papers

点此查看论文截图

TurboTrain: Towards Efficient and Balanced Multi-Task Learning for Multi-Agent Perception and Prediction

Authors:Zewei Zhou, Seth Z. Zhao, Tianhui Cai, Zhiyu Huang, Bolei Zhou, Jiaqi Ma

End-to-end training of multi-agent systems offers significant advantages in improving multi-task performance. However, training such models remains challenging and requires extensive manual design and monitoring. In this work, we introduce TurboTrain, a novel and efficient training framework for multi-agent perception and prediction. TurboTrain comprises two key components: a multi-agent spatiotemporal pretraining scheme based on masked reconstruction learning and a balanced multi-task learning strategy based on gradient conflict suppression. By streamlining the training process, our framework eliminates the need for manually designing and tuning complex multi-stage training pipelines, substantially reducing training time and improving performance. We evaluate TurboTrain on a real-world cooperative driving dataset, V2XPnP-Seq, and demonstrate that it further improves the performance of state-of-the-art multi-agent perception and prediction models. Our results highlight that pretraining effectively captures spatiotemporal multi-agent features and significantly benefits downstream tasks. Moreover, the proposed balanced multi-task learning strategy enhances detection and prediction.

端到端训练多智能体系统在提高多任务性能方面具有显著优势。然而，训练此类模型仍然具有挑战性，需要大量的手动设计和监控。在这项工作中，我们介绍了TurboTrain，这是一种用于多智能体感知和预测的新型高效训练框架。TurboTrain包含两个关键组件：基于掩膜重建学习的多智能体时空预训练方案和基于梯度冲突抑制的平衡多任务学习策略。通过优化训练过程，我们的框架无需手动设计和调整复杂的分阶段训练管道，从而大大减少了训练时间并提高了性能。我们在现实世界的合作驾驶数据集V2XPnP-Seq上对TurboTrain进行了评估，并证明它进一步提高了最先进的多智能体感知和预测模型的性能。我们的结果强调，预训练有效地捕获了时空多智能体特征，并为下游任务带来了显著的好处。此外，提出的平衡多任务学习策略提高了检测和预测能力。

论文及项目相关链接

PDF ICCV 2025

Summary

本工作引入了一种名为TurboTrain的新型高效多智能体感知和预测训练框架，包含两个关键组件：基于掩码重建学习的多智能体时空预训练方案和基于梯度冲突抑制的平衡多任务学习策略。通过优化训练过程，该框架消除了手动设计和调整复杂多阶段训练管道的需求，大幅缩短了训练时间并提高了性能。在真实世界合作驾驶数据集V2XPnP-Seq上的评估结果表明，TurboTrain能够进一步提升最先进的智能体感知和预测模型性能。其中预训练能够有效捕获时空多智能体特征，显著促进下游任务性能的提升；而提出的平衡多任务学习策略则增强了检测和预测能力。

Key Takeaways

TurboTrain框架包含两个关键组件：多智能体时空预训练方案和平衡多任务学习策略。
多智能体时空预训练基于掩码重建学习，有助于捕获智能体间的复杂交互和时空关系。
平衡多任务学习策略通过梯度冲突抑制机制提升模型在多任务场景下的性能。
TurboTrain简化了多智能体系统的训练过程，降低了复杂性和训练时间。
在真实世界合作驾驶数据集上的评估证明了TurboTrain对提升智能体感知和预测模型性能的有效性。
预训练能够有效捕获时空多智能体特征，对下游任务性能有显著提升。

Cool Papers

点此查看论文截图

CONVERGE: A Multi-Agent Vision-Radio Architecture for xApps

Authors:Filipe B. Teixeira, Carolina Simões, Paulo Fidalgo, Wagner Pedrosa, André Coelho, Manuel Ricardo, Luis M. Pessoa

Telecommunications and computer vision have evolved independently. With the emergence of high-frequency wireless links operating mostly in line-of-sight, visual data can help predict the channel dynamics by detecting obstacles and help overcoming them through beamforming or handover techniques. This paper proposes a novel architecture for delivering real-time radio and video sensing information to O-RAN xApps through a multi-agent approach, and introduces a new video function capable of generating blockage information for xApps, enabling Integrated Sensing and Communications. Experimental results show that the delay of sensing information remains under 1,ms and that an xApp can successfully use radio and video sensing information to control the 5G/6G RAN in real-time.

电信和计算机视觉已独立发展。随着高频无线链路的出现，主要用于视线传输，视觉数据可以通过检测障碍物预测信道动态变化并有助于克服通过波束成形或切换技术。本文提出了一种通过多智能体方法将实时无线电和视频传感信息传输到O-RAN xApps的新型架构，并引入了一种可为xApps生成阻挡信息的新视频功能，以实现综合感知和通信。实验结果表明，传感信息延迟低于1毫秒，并且xApp可以成功利用无线电和视频传感信息实时控制5G/6G RAN。

论文及项目相关链接

PDF 7 pages, 5 figures

Summary
在电信与计算机视觉独立发展的背景下，本研究提出了一种新型架构，通过多代理方法将实时无线电和视频感知信息传递给O-RAN xApps。该研究利用视觉数据预测信道动态，通过生成阻挡信息为xApps提供支持，实现集成感知和通信功能。实验结果显示，感知信息的延迟低于1毫秒，xApp能够成功利用无线电和视频感知信息实时控制5G/6G RAN。

Key Takeaways

电信与计算机视觉的各自发展：介绍了电信与计算机视觉领域的独立发展，为整合两者打下基础。
高频无线链路与视觉数据的结合：视觉数据可帮助预测信道动态，通过检测障碍物来帮助克服通信障碍。
新型架构的提出：研究提出了一种新的架构，通过多代理方法传递实时无线电和视频感知信息给O-RAN xApps。
视频功能的引入：引入了一种能够生成阻挡信息的视频功能，为xApps提供支持。
集成感知和通信的实现：通过结合无线电和视频感知信息，实现了集成感知和通信功能。
实验结果：实验结果显示感知信息的延迟极低，并且xApp能够成功利用这些信息实时控制RAN。

Cool Papers

点此查看论文截图

Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation

Authors:Jinxing Zhou, Yanghao Zhou, Mingfei Han, Tong Wang, Xiaojun Chang, Hisham Cholakkal, Rao Muhammad Anwer

Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R\textsuperscript{2}-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R\textsuperscript{2}-AVSBench. Code will be available at https://github.com/jasongief/TGS-Agent.

参照视听分割（Ref-AVS）旨在根据给定的参考表达式对可听视频中的目标对象进行分割。早期的工作通常依赖于通过多模态融合学习潜在嵌入，以提示可调SAM/SAM2解码器进行分割，这需要强烈的像素级监督并且缺乏可解释性。从明确的参考理解这一新颖角度出发，我们提出了TGS-Agent，它将任务分解为Think-Ground-Segment过程，通过首先通过多模态分析识别所指对象，然后进行粗粒度定位，最后精确分割，模仿人类的推理过程。为此，我们首先提出了Ref-Thinker，这是一个能够推理文本、视觉和听觉线索的多模态语言模型。我们构建了一个指令微调数据集，其中包含用于Ref-Thinker精细调整的具有明确对象感知的think-answer链。通过Ref-Thinker推断的对象描述被用作Grounding-DINO和SAM婴儿床的明确提示，它们执行定位与分割无需依赖像素级监督。此外，我们介绍了R² - AVSBench基准测试数据集用于更好的评估模型泛化能力具有语言多样性和推理密集型参考集的新基准测试数据集R² AVSBench上进行评估我们的方法在标准Ref AVSBench以及提出的R² AVSBench上均达到了最先进的成果相关代码将在https://github.com/jasongief/TGS-Agent上提供。

Summary

本文介绍了针对音频视频分割任务的新型方法TGS-Agent。该方法采用明确参考理解的角度，将任务分解为Think-Ground-Segment过程，通过多模态分析识别目标对象，然后进行粗略定位与精确分割。为此，提出了Ref-Thinker多模态语言模型，能够推理文本、视觉和听觉线索。构建了用于Ref-Thinker精细调整的指令调整数据集，其中包含明确的目标感知思考答案链。通过Ref-Thinker推断的对象描述作为对Grounding-DINO和SAM2的明确提示，进行定位和分割，无需像素级监督。此外，引入了新的基准测试R\textsuperscript{2}-AVSBench，用于评估模型对语言多样性和推理密集参考的泛化能力。TGS-Agent在标准Ref-AVSBench和提出的R\textsuperscript{2}-AVSBench上均取得最佳结果。

Key Takeaways

TGS-Agent采用明确参考理解方法，将音频视频分割任务分解为Think-Ground-Segment过程。
提出Ref-Thinker多模态语言模型，能推理文本、视觉和听觉线索，用于识别目标对象。
构建了指令调整数据集用于Ref-Thinker的精细调整，包含目标感知思考答案链。
Ref-Thinker推断的对象描述作为对Grounding-DINO和SAM2的明确提示，实现定位和分割。
无需像素级监督，提高了模型的泛化能力。
引入新的基准测试R\textsuperscript{2}-AVSBench，以评估模型在语言多样性和推理密集参考方面的性能。
TGS-Agent在标准测试上取得最佳结果。

Cool Papers

点此查看论文截图

Beyond Pixels: Exploring DOM Downsampling for LLM-Based Web Agents

Authors:Thassilo M. Schiepanski, Nicholas Piël

Frontier LLMs only recently enabled serviceable, autonomous web agents. At that, a model poses as an instantaneous domain model backend. Ought to suggest interaction, it is consulted with a web-based task and respective application state. The key problem lies in application state serialisation $\unicode{x2013}$ referred to as snapshot. State-of-the-art web agents are premised on grounded GUI snapshots, i.e., screenshots enhanced with visual cues. Not least to resemble human perception, but for images representing relatively cheap means of model input. LLM vision still lag behind code interpretation capabilities. DOM snapshots, which structurally resemble HTML, impose a desired alternative. Vast model input token size, however, disables reliable implementation with web agents to date. We propose D2Snap, a first-of-its-kind DOM downsampling algorithm. Based on a GPT-4o backend, we evaluate D2Snap on tasks sampled from the Online-Mind2Web dataset. The success rate of D2Snap-downsampled DOM snapshots (67%) matches a grounded GUI snapshot baseline (65%) $\unicode{x2013}$ within the same input token order of magnitude (1e3). Our best evaluated configurations $\unicode{x2013}$ one token order above, but within the model’s context window $\unicode{x2013}$ outperform this baseline by 8%. Our evaluation, moreover, yields that DOM-inherent hierarchy embodies a strong UI feature for LLMs.

前沿的大型语言模型（LLMs）最近才开始实现可服务的、自主的Web代理。在此情况下，一个模型充当即时领域模型后端。当需要建议交互时，它会与基于Web的任务和相应的应用程序状态进行协商。关键问题在于应用程序状态的序列化——被称为快照。最先进的Web代理基于实际的GUI快照，即增强视觉线索的截图。这不仅是为了模仿人类感知，而且图像代表了一种相对廉价的模型输入手段。然而，LLM的视野仍然落后于代码解释能力。DOM快照从结构上类似于HTML，成为了一种理想的选择。然而，巨大的模型输入令牌规模仍然使可靠的Web代理实现难以实现。我们提出了D2Snap，这是一种首创的DOM降采样算法。基于GPT-4o后端，我们对Online-Mind2Web数据集的任务样本对D2Snap进行了评估。D2Snap降采样DOM快照的成功率（67%）与基于实际GUI的快照基线（65%）相匹配——处于相同的输入令牌数量级（1e3）内。我们评估的最佳配置——比模型上下文窗口高一个令牌数量级——但超出了这个基线8%。此外，我们的评估还表明，DOM固有的层次结构对于大型语言模型来说是一个强大的用户界面特征。

论文及项目相关链接

PDF

Summary

基于前沿的大型语言模型（LLM），出现了能够提供服务、自主运行的网页代理。模型作为即时领域模型后端，通过咨询web任务及相关应用程序状态进行交互。主要问题在于应用程序状态序列化，即所谓的快照技术。当前先进的网页代理依赖于基于图像的快照技术，即增强视觉线索的截图。为了模仿人类感知，图像成为模型输入的相对廉价手段。然而，LLM在视觉方面仍然落后于代码解释能力。DOM快照的结构类似于HTML，成为了一种理想的替代方案。然而，巨大的模型输入令牌规模阻碍了网页代理的可靠实现。为此，我们提出了D2Snap，这是一种首创的DOM降采样算法。基于GPT-4o后端，我们对Online-Mind2Web数据集的任务样本进行了评估。D2Snap降采样DOM快照的成功率（67%）与基于图像的快照基线（65%）相当，且输入令牌数量级相同（1e3）。在模型上下文窗口内，最佳配置超越了这一基线8%。此外，我们的评估显示DOM固有的层次结构对LLMs来说是一个强大的用户界面特征。

Key Takeaways

前沿LLM使得自主web代理成为可能。
当前web代理面临的主要挑战在于应用程序状态序列化，即快照技术。
基于图像的快照技术模仿了人类感知，但LLM在视觉方面的能力仍然有限。
DOM快照成为了一种理想的替代方案，但实现上存在可靠性问题。
提出的D2Snap算法是首创的DOM降采样方法，性能优越。
D2Snap的成功率与基于图像的快照基线相当，并且在某些配置下超越了这一基线。
DOM固有的层次结构对LLMs来说是一个强大的用户界面特征。

Cool Papers

点此查看论文截图

Multi-Agent Taskforce Collaboration: Self-Correction of Compounding Errors in Long-Form Literature Review Generation

Authors:Zhi Zhang, Yan Liu, Zhejing Hu, Gong Chen, Sheng-hua Zhong, Jiannong Cao

Literature reviews play an important role in scientific research. Recent advances in large language models (LLMs) have boosted the development of automated systems for the entire literature review workflow, from retrieval to manuscript drafting. However, a key challenge is that mistakes made in early stages can propagate and amplify in subsequent steps, leading to compounding errors that undermine the faithfulness of the final review. To tackle this issue, we propose the Multi-Agent Taskforce Collaboration (MATC) framework, which consists of a manager agent and four executor agents for literature searching, outline generation, fact localization, and manuscript drafting. We propose three novel collaboration paradigms, forming exploration, exploitation, and experience taskforces, to effectively organize agents and mitigate compounding errors both between and within executor agents. Experimental results show that MATC achieves state-of-the-art performance on existing benchmarks. We further propose a new benchmark dataset featuring more diverse topics for faithful literature review generation.

文献综述在科学研究领域扮演着重要角色。近期，大型语言模型（LLM）的进步推动了整个文献综述工作流的自动化系统的发展，从检索到手稿起草。然而，一个关键挑战在于，早期阶段的错误可能会在后续步骤中传播和放大，从而导致累积错误，破坏最终评审的忠实性。为了解决这个问题，我们提出了多代理任务力协作（MATC）框架，该框架由一个管理代理和四个执行代理组成，分别负责文献搜索、大纲生成、事实定位和手稿起草。我们提出了三种新的协作范式，即探索、利用和经验任务力，以有效地组织代理并减轻执行代理之间以及内部的累积错误。实验结果表明，MATC在现有基准测试上达到了最新技术水平。我们进一步提出了一个包含更多主题的新基准数据集，用于生成忠实的文献综述。

论文及项目相关链接

PDF

Summary

本文介绍了文献综述在科学研究中扮演的重要角色，以及近期大型语言模型（LLMs）的进步推动了自动化文献综述工作流的开发。然而，早期阶段的错误会累积并放大后续步骤中的问题，导致最终审查的忠实性受到损害。为解决这一问题，本文提出了多代理任务协作（MATC）框架，包括一个管理代理和四个执行代理，用于文献检索、大纲生成、事实定位和手稿起草。通过形成探索、开发和经验任务小组，有效组织代理并减轻执行代理内部和外部的累积错误。实验结果表明，MATC在现有基准测试上取得了最新技术水平的性能表现。此外，本文还提出了一个新的包含更多主题多样性的基准数据集，用于忠实文献综述生成。

Key Takeaways

文献综述在科学研究中具有重要地位。
大型语言模型的进步推动了自动化文献综述工作流的开发。
早期阶段的错误会累积并影响后续步骤，导致忠实性问题。
提出多代理任务协作（MATC）框架，包括管理代理和四个执行代理。
形成探索、开发和经验任务小组以有效组织代理并减少错误。
MATC在现有基准测试中表现优异。

Cool Papers

点此查看论文截图

ShoppingBench: A Real-World Intent-Grounded Shopping Benchmark for LLM-based Agents

Authors:Jiangyuan Wang, Kejun Xiao, Qi Sun, Huaipeng Zhao, Tao Luo, Jiandong Zhang, Xiaoyi Zeng

Existing benchmarks in e-commerce primarily focus on basic user intents, such as finding or purchasing products. However, real-world users often pursue more complex goals, such as applying vouchers, managing budgets, and finding multi-products seller. To bridge this gap, we propose ShoppingBench, a novel end-to-end shopping benchmark designed to encompass increasingly challenging levels of grounded intent. Specifically, we propose a scalable framework to simulate user instructions based on various intents derived from sampled real-world products. To facilitate consistent and reliable evaluations, we provide a large-scale shopping sandbox that serves as an interactive simulated environment, incorporating over 2.5 million real-world products. Experimental results demonstrate that even state-of-the-art language agents (such as GPT-4.1) achieve absolute success rates under 50% on our benchmark tasks, highlighting the significant challenges posed by our ShoppingBench. In addition, we propose a trajectory distillation strategy and leverage supervised fine-tuning, along with reinforcement learning on synthetic trajectories, to distill the capabilities of a large language agent into a smaller one. As a result, our trained agent achieves competitive performance compared to GPT-4.1.

现有的电子商务基准测试主要关注用户的基本意图，如查找或购买产品。然而，现实世界的用户通常追求更复杂的目标，如使用优惠券、管理预算和寻找多产品卖家。为了弥补这一差距，我们提出了ShoppingBench这一全新的端到端购物基准测试，旨在涵盖日益复杂的基于意图的级别。具体来说，我们提出了一个可扩展的框架，基于从现实世界采样产品中得出的各种意图来模拟用户指令。为了促进一致和可靠的评估，我们提供了一个大规模的购物沙盘，作为一个交互式模拟环境，其中包含了超过250万种真实世界的产品。实验结果表明，即使在我们的基准测试任务上，最先进的语言智能体（如GPT-4.1）的绝对成功率也低于50%，这凸显了我们的ShoppingBench所带来的重大挑战。此外，我们提出了一种轨迹蒸馏策略，并利用监督微调以及对合成轨迹的强化学习，将大型语言智能体的能力提炼到小型智能体上。结果，我们训练的智能体表现出与GPT-4.1相竞争的性能。

论文及项目相关链接

PDF submit to AAAI2026

Summary

在电子商务领域，现有基准测试主要关注用户的基本意图，如查找或购买产品。然而，真实用户通常追求更复杂的目标，如使用优惠券、管理预算和寻找多产品卖家。为了弥补这一差距，我们提出了ShoppingBench，这是一个新的端到端购物基准测试，旨在涵盖日益复杂的基于意图的基准测试。我们提出了一个可扩展的框架，基于采样自真实世界的产品来模拟用户指令。我们还提供了一个大型购物沙盘，作为模拟环境，包含超过250万真实产品，以确保一致和可靠的评估。实验结果表明，即使在我们的基准测试任务上，最先进的语言代理（如GPT-4.1）的绝对成功率也低于50%，突显出ShoppingBench的重大挑战。此外，我们提出了一种轨迹蒸馏策略，利用监督微调与合成轨迹的强化学习来提炼大型语言代理的能力。经过训练的代理表现与GPT-4.1相比具有竞争力。

Key Takeaways

现有电子商务基准测试主要关注基本用户意图，但真实用户追求更复杂的目标。
ShoppingBench是一个新的端到端购物基准测试，旨在涵盖更复杂的基于意图的基准测试。
ShoppingBench使用模拟用户指令和真实世界产品的采样来模拟现实购物场景。
实验结果表明，即使是最先进的语言代理在面对ShoppingBench的基准测试任务时也会面临重大挑战。
提出了一种轨迹蒸馏策略，通过监督微调与强化学习提高代理性能。
经过训练的代理表现与GPT-4.1相比具有竞争力。

Cool Papers

点此查看论文截图

Empowering Time Series Forecasting with LLM-Agents

Authors:Chin-Chia Michael Yeh, Vivian Lai, Uday Singh Saini, Xiran Fan, Yujie Fan, Junpeng Wang, Xin Dai, Yan Zheng

Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state-of-the-art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data-Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data-centric approaches in AutoML for time series forecasting.

基于大型语言模型（LLM）的代理已作为自动化机器学习（AutoML）系统的有效规划器而出现。虽然大多数现有的AutoML方法侧重于自动化特征工程和模型架构搜索，但最近的时间序列预测研究表明，轻量级模型通常可以达到最先进的性能。这一观察结果促使我们探索改进数据质量，而不是模型架构，作为AutoML在时间序列数据上可能硕果累累的方向。我们提出了DCATS，一个面向时间序列的数据中心代理。DCATS利用伴随时间序列的元数据来清理数据，同时优化预测性能。我们在大规模交通流量预测数据集上使用四个时间序列预测模型评估了DCATS。结果表明，DCATS在所有测试模型和所有时间范围内实现了平均6%的误差降低，突出了数据中心方法在AutoML时间序列预测中的潜力。

论文及项目相关链接

PDF

Summary
大型语言模型驱动的代理在自动化机器学习（AutoML）系统中发挥了有效的规划作用。针对时间序列预测，研究发现在时间序列数据上采用以数据为中心的方法可能更有成效，于是我们提出了DCATS数据为中心的时间序列代理。DCATS利用时间序列的元数据对数据进行清理并优化预测性能。在大型交通流量预测数据集上，对四种时间序列预测模型进行的评估表明，DCATS在所有测试模型和时间范围内平均减少了6%的误差，突显了数据为中心的方法在AutoML时间序列预测中的潜力。

Key Takeaways

大型语言模型（LLM）在自动化机器学习（AutoML）系统中扮演有效规划角色。
现有AutoML方法主要关注特征工程和模型架构搜索的自动化。
针对时间序列预测，研究发现轻量化模型常能达到最佳性能。
数据质量改进是一个有前景的方向，提出DCATS数据为中心的时间序列代理。
DCATS利用元数据清理数据并优化预测性能。
在大型交通流量预测数据集上的评估显示，DCATS降低了平均6%的误差。

Cool Papers

点此查看论文截图

AgREE: Agentic Reasoning for Knowledge Graph Completion on Emerging Entities

Authors:Ruochen Zhao, Simone Conia, Eric Peng, Min Li, Saloni Potdar

Open-domain Knowledge Graph Completion (KGC) faces significant challenges in an ever-changing world, especially when considering the continual emergence of new entities in daily news. Existing approaches for KGC mainly rely on pretrained language models’ parametric knowledge, pre-constructed queries, or single-step retrieval, typically requiring substantial supervision and training data. Even so, they often fail to capture comprehensive and up-to-date information about unpopular and/or emerging entities. To this end, we introduce Agentic Reasoning for Emerging Entities (AgREE), a novel agent-based framework that combines iterative retrieval actions and multi-step reasoning to dynamically construct rich knowledge graph triplets. Experiments show that, despite requiring zero training efforts, AgREE significantly outperforms existing methods in constructing knowledge graph triplets, especially for emerging entities that were not seen during language models’ training processes, outperforming previous methods by up to 13.7%. Moreover, we propose a new evaluation methodology that addresses a fundamental weakness of existing setups and a new benchmark for KGC on emerging entities. Our work demonstrates the effectiveness of combining agent-based reasoning with strategic information retrieval for maintaining up-to-date knowledge graphs in dynamic information environments.

开放领域知识图谱补全（KGC）在日新月异的世界中面临着重大挑战，特别是考虑到日常新闻中不断出现的新实体。现有的KGC方法主要依赖于预训练语言模型的参数知识、预先构建的查询或单步检索，通常需要大量的监督和训练数据。尽管如此，它们往往无法捕获关于不受欢迎和/或新兴实体的全面和最新的信息。为此，我们引入了用于新兴实体的Agentic推理（AgREE），这是一种新型基于代理的框架，结合了迭代检索动作和多步推理，以动态构建丰富的知识图谱三元组。实验表明，尽管不需要任何训练努力，AgREE在构建知识图谱三元组方面显著优于现有方法，尤其是对于在语言模型训练过程中未见的新兴实体，其性能优于以前的方法高达13.7%。此外，我们提出了一种新的评估方法，解决了现有设置的基本弱点，以及针对新兴实体的KGC的新基准测试。我们的工作证明了在动态信息环境中结合基于代理的推理和战略信息检索以维持最新知识图谱的有效性。

论文及项目相关链接

PDF

Summary
知识图谱补全（KGC）在日益变化的世界中面临巨大挑战，尤其对于持续涌现的新兴实体。现有方法主要依赖预训练语言模型的参数知识、预构建查询或单步检索，需要庞大的监督数据，但仍难以获取全面、最新的关于非热门或新兴实体的信息。为解决这一问题，我们提出了基于Agent的新兴实体推理（AgREE）框架，结合迭代检索动作和多步推理来动态构建丰富的知识图谱三元组。实验表明，AgREE在构建知识图谱三元组方面显著优于现有方法，尤其是对未在语言模型训练过程中出现的新兴实体，提升幅度高达13.7%。我们还提出了一种新的评估方法和为KGC新兴实体设立的新的基准测试。我们的工作证明了在动态信息环境中结合基于Agent的推理与策略性信息检索来维护最新知识图谱的有效性。

Key Takeaways

知识图谱补全（KGC）在处理新兴实体时面临挑战。
现有KGC方法主要依赖预训练语言模型，但难以获取关于新兴实体的全面、最新信息。
提出的AgREE框架结合迭代检索动作和多步推理，能动态构建知识图谱三元组。
AgREE在构建知识图谱三元组方面显著优于现有方法，尤其针对新兴实体。
AgREE提升幅度高达13.7%，显示出其有效性。
研究者还提出了一种新的KGC评估方法和基准测试。

Cool Papers

点此查看论文截图

XARP Tools: An Extended Reality Platform for Humans and AI Agents

Authors:Arthur Caetano, Misha Sra

This technical report presents XARP Tools, an extended reality (XR) framework designed for human and AI developers alike. XARP comprises a server-side Python library and platform-specific XR clients. The library offers high-level APIs and communicates with clients via a JSON-based protocol over WebSockets. XR clients encapsulate device and runtime specifics, providing responsive, low-latency user interaction. XARP can be utilized in three ways: (i) as a library that abstracts XR development for humans; (ii) as a set of callable tools that allow AI agents to drive on-the-fly interactions with users; and (iii) as a Model Context Protocol server that plugs XR devices into AI ecosystems. XARP code and working examples are released openly at https://github.com/HAL-UCSB/xarp.

本技术报告介绍了XARP工具，这是一个为人工智能开发人员和人类开发人员设计的扩展现实（XR）框架。XARP包括服务器端Python库和针对平台的XR客户端。该库提供高级API，并通过WebSocket上的基于JSON的协议与客户端进行通信。XR客户端封装了设备和运行时环境的详细信息，提供响应迅速、低延迟的用户交互。XARP可以以三种方式使用：（i）作为一个人类XR开发的抽象库；（ii）作为一组可调用工具，允许AI代理实时与用户进行交互；（iii）作为将XR设备接入AI生态系统的模型上下文协议服务器。XARP代码和工作示例已在https://github.com/HAL-UCSB/xarp上公开发布。

论文及项目相关链接

PDF

Summary
本技术报告介绍了XARP工具，这是一个为人工智能开发人员和人类开发人员设计的扩展现实（XR）框架。XARP包含一个服务器端Python库和平台特定的XR客户端。该库提供高级API，并通过WebSocket使用基于JSON的协议与客户端通信。XR客户端封装了设备和运行时详细信息，提供响应迅速、低延迟的用户交互。XARP可以以三种方式利用：（i）作为为人类抽象XR开发的库；（ii）作为允许AI代理驱动即时用户交互的可调用工具集；（iii）作为将XR设备插入AI生态系统的模型上下文协议服务器。XARP的代码和工作示例已在https://github.com/HAL-UCSB/xarp上公开发布。

Key Takeaways

XARP工具是一个扩展现实（XR）框架，适用于人工智能开发人员和人类开发人员。
XARP包含一个服务器端Python库和平台特定的XR客户端，提供高级API和基于JSON的协议通信。
XR客户端封装设备和运行时详细信息，实现响应迅速、低延迟的用户交互。
XARP可以作为人类抽象XR开发的库使用。
XARP允许AI代理驱动即时用户交互，作为可调用工具集使用。
XARP可以作为模型上下文协议服务器，将XR设备插入AI生态系统。

Cool Papers

点此查看论文截图

HyCodePolicy: Hybrid Language Controllers for Multimodal Monitoring and Decision in Embodied Agents

Authors:Yibin Liu, Zhixuan Liang, Zanxin Chen, Tianxing Chen, Mengkang Hu, Wanxi Dong, Congsheng Xu, Zhaoming Han, Yusen Qin, Yao Mu

Recent advances in multimodal large language models (MLLMs) have enabled richer perceptual grounding for code policy generation in embodied agents. However, most existing systems lack effective mechanisms to adaptively monitor policy execution and repair codes during task completion. In this work, we introduce HyCodePolicy, a hybrid language-based control framework that systematically integrates code synthesis, geometric grounding, perceptual monitoring, and iterative repair into a closed-loop programming cycle for embodied agents. Technically, given a natural language instruction, our system first decomposes it into subgoals and generates an initial executable program grounded in object-centric geometric primitives. The program is then executed in simulation, while a vision-language model (VLM) observes selected checkpoints to detect and localize execution failures and infer failure reasons. By fusing structured execution traces capturing program-level events with VLM-based perceptual feedback, HyCodePolicy infers failure causes and repairs programs. This hybrid dual feedback mechanism enables self-correcting program synthesis with minimal human supervision. Our results demonstrate that HyCodePolicy significantly improves the robustness and sample efficiency of robot manipulation policies, offering a scalable strategy for integrating multimodal reasoning into autonomous decision-making pipelines.

最近的多模态大型语言模型（MLLM）的进步为实体代理中的代码策略生成提供了更丰富的感知基础。然而，大多数现有系统缺乏有效的机制来在任务完成过程中自适应地监视策略执行和修复代码。在这项工作中，我们引入了HyCodePolicy，这是一个基于混合语言的控制框架，它系统地集成了代码合成、几何基础、感知监控和迭代修复，形成一个闭环编程周期，用于实体代理。从技术上讲，给定自然语言指令后，我们的系统首先将其分解为子目标并生成基于对象中心几何原始数据的初始可执行程序。然后该程序在仿真中执行，同时视觉语言模型（VLM）会观察选定检查点以检测和定位执行失败并推断失败原因。通过将捕获程序级事件的结构化执行轨迹与基于VLM的感知反馈相结合，HyCodePolicy能够推断出失败原因并修复程序。这种混合的双重反馈机制能够实现自我修正的程序合成，几乎无需人工监督。我们的结果表明，HyCodePolicy显著提高了机器人操作策略的稳健性和样本效率，为将多模态推理集成到自主决策管道中提供了可扩展的策略。

论文及项目相关链接

PDF Accepted to ICCV 2025 Workshop on Multi-Modal Reasoning for Agentic Intelligence

Summary

HyCodePolicy，一种基于语言的混合控制框架，实现了对体素智能体编码策略生成过程中的模拟执行进行自适应监测与程序修复的功能集成。框架包括代码合成、几何定位、感知监测及迭代修复环节，有效实现闭环编程周期。

Key Takeaways

多模态大型语言模型（MLLMs）的进步为体素智能体编码策略生成提供了更丰富感知基础。
当前系统缺乏在任务完成过程中自适应监测政策执行和修复代码的有效机制。
HyCodePolicy系统通过分解自然语言指令生成初始可执行程序，并模拟执行。
借助视觉语言模型（VLM）观测特定检查点以检测并定位执行失败，推断失败原因。
结构化执行轨迹与基于VLM的感知反馈的融合，使HyCodePolicy能够推断失败原因并修复程序。
混合双重反馈机制实现了自我修正的程序合成，并减少了对人工监督的需求。

Cool Papers

点此查看论文截图

The Dark Side of LLMs: Agent-based Attacks for Complete Computer Takeover

Authors:Matteo Lupinacci, Francesco Aurelio Pironti, Francesco Blefari, Francesco Romeo, Luigi Arena, Angelo Furfaro

The rapid adoption of Large Language Model (LLM) agents and multi-agent systems enables remarkable capabilities in natural language processing and generation. However, these systems introduce unprecedented security vulnerabilities that extend beyond traditional content generation attacks to system-level compromise. This paper presents a comprehensive evaluation of the security of LLMs used as reasoning engines within autonomous agents, highlighting how they can be exploited as attack vectors capable of achieving complete computer takeover. We focus on how different attack surfaces and trust boundaries - Direct Prompt Injection, RAG Backdoor, and Inter Agent Trust - can be leveraged to orchestrate such takeovers. We demonstrate that adversaries can effectively coerce popular LLMs (including GPT-4, Claude-4 and Gemini-2.5) into autonomously installing and executing malware on victim machines. Our evaluation of 18 state-of-the-art LLMs reveals an alarming scenario: 94.4% of models succumb to Direct Prompt Injection and 83.3% are vulnerable to the more stealth and evasive RAG Backdoor Attack. Notably, we tested trust boundaries within multi-agent systems, where LLM agents interact and influence each other, and we revealed a critical security flaw: LLMs which successfully resist direct injection or RAG backdoor will execute identical payloads when requested by peer agents. Our findings show that 100.0% of tested LLMs can be compromised through Inter-Agent Trust Exploitation attacks and that every model exhibits context-dependent security behaviors that create exploitable blind spots. Our results also highlight the need to increase awareness and research on the security risks of LLMs, showing a paradigm shift in cybersecurity threats, where AI tools themselves become sophisticated attack vectors.

大型语言模型（LLM）代理的快速采用和多代理系统的出现，为自然语言处理和生成带来了显著的能力。然而，这些系统也引入了前所未有的安全漏洞，这些漏洞超出了传统的内容生成攻击，涉及到系统级别的妥协。本文对作为自主代理中的推理引擎使用的大型语言模型的安全性能进行了全面评估，并强调了它们如何利用这些漏洞实现完全的计算机接管，作为攻击向量。我们重点关注如何利用不同的攻击面和信任边界，如直接提示注入、RAG后门和代理间信任，来协调此类接管行动。我们证明，对手可以有效地迫使流行的大型语言模型（包括GPT-4、Claude-4和Gemini-2. 通过对当前最先进的18个大型语言模型的评估，我们发现了一个令人担忧的场景：有94.4%的模型直接受到提示注入的影响，有83.3%的模型容易受到更隐蔽和更具逃避性的RAG后门攻击的影响。值得注意的是，我们在多代理系统中测试了信任边界，其中大型语言模型代理相互交互和影响，我们发现了一个关键的安全漏洞：即使某些大型语言模型成功抵抗了直接注入或RAG后门攻击，但当收到对等代理的请求时，它们仍会执行相同的负载。我们的研究结果表明，通过代理间信任剥削攻击，所有测试的大型语言模型都有可能受到威胁，每个模型都表现出与上下文相关的安全行为，这些行为产生了可利用的盲点。我们的研究结果还强调了对大型语言模型安全风险的认识和研究需求增加的重要性，并显示了网络安全威胁的一个范式转变，人工智能工具本身已成为复杂的攻击向量。

论文及项目相关链接

PDF

Summary

本文全面评估了大型语言模型（LLM）作为自主推理引擎的安全性，揭示了其存在的重大安全漏洞。文章指出，LLM代理和多代理系统虽然能带来卓越的自然语言处理与生成能力，但也带来了前所未有的安全风险。这些风险不仅限于传统的内容生成攻击，更可能引发系统级别的破坏。文章展示了如何通过不同的攻击面和信任边界来利用LLM作为攻击媒介，实现对计算机的全面接管。测试结果显示，大多数LLM模型存在安全漏洞，容易受到攻击。因此，必须提高对LLM安全风险的意识，并加强相关研究。

Key Takeaways

大型语言模型（LLM）和多代理系统引入前所未有的安全风险。
LLM可作为攻击媒介，实现对计算机的全面接管。
文章全面评估了LLM的安全性，揭示了其存在的重大安全漏洞。
通过不同的攻击面和信任边界来利用LLM进行攻击。
测试结果显示，大多数LLM模型存在安全漏洞，容易受到Direct Prompt Injection、RAG Backdoor和Inter-Agent Trust Exploitation等攻击。
LLM模型间的交互和相互影响会引发新的安全风险。

Cool Papers

点此查看论文截图

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks

Authors:Xiaoya Lu, Zeren Chen, Xuhao Hu, Yijin Zhou, Weichen Zhang, Dongrui Liu, Lu Sheng, Jing Shao

Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent’s actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent’s interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems. Code and data are released under this https URL.

由VLM驱动的实体代理的规划缺陷会构成重大的安全隐患，阻碍了它们在现实家庭任务中的部署。然而，现有的静态、非交互式的评估模式无法充分评估这些交互式环境中的风险，因为它们无法模拟代理行动中出现的动态风险，并依赖于忽略不安全中间步骤的不可靠的事后评估。为了弥补这一关键差距，我们提出了对代理的交互安全性进行评估：其在出现风险时的感知能力和按正确程序顺序执行缓解步骤的能力。因此，我们推出了IS-Bench，这是专门为交互安全性设计的多模式基准测试，其中包含161个具有挑战性的场景，在高度逼真的模拟器中实例化388个独特的安全风险。关键的是，它促进了面向过程的新型评估，验证风险缓解行动是否在特定的风险倾向步骤之前或之后执行。针对领先的VLM的广泛实验，包括GPT-4o和Gemini-2.5系列，揭示出当前代理缺乏交互安全意识，虽然安全意识的“思维链”可以提高性能，但往往会损害任务完成。通过突出这些关键限制，IS-Bench为开发更安全、更可靠的实体AI系统提供了基础。代码和数据在这个URL发布。

论文及项目相关链接

PDF

Summary

文本指出由VLM驱动的实体代理存在规划缺陷，这在实际家庭任务部署中带来重大安全隐患。现有的静态、非互动评估模式无法在这种互动环境中充分评估风险，因为它们无法模拟由代理行为产生的动态风险，并依赖于忽视不安全中间步骤的不可靠的后期评估。为解决这一关键差距，本文提出了对代理的互动安全性进行评估的方法，即评估其在出现风险时感知和按正确程序顺序执行缓解步骤的能力。为此，引入了IS-Bench，它是首个为多模态互动安全性设计的基准测试平台，包含161个充满挑战的场景和388个独特的安全风险，并借助高保真模拟器进行实例化。IS-Bench提供了一个过程导向的评估方法，可以验证是否在特定风险较高的步骤之前或之后执行了风险缓解措施。通过对领先的VLM进行实验，包括GPT-4o和Gemini-2.5系列，发现当前代理缺乏互动安全意识，尽管安全意识的思维链可以提升性能，但往往会牺牲任务完成度。IS-Bench为开发更安全、更可靠的实体AI系统提供了基础。

Key Takeaways

VLM驱动的实体代理存在规划缺陷，导致实际部署中的安全隐患。
现有评估模式无法充分评估互动环境中的风险。
互动安全性的评估是关键，涉及感知新兴风险和按正确顺序执行缓解步骤的能力。
IS-Bench是首个为多模态互动安全性设计的基准测试平台。
IS-Bench提供了过程导向的评估方法，验证了风险缓解措施是否针对特定风险步骤执行。
当前VLM存在缺乏互动安全意识的局限性。

Cool Papers

点此查看论文截图

UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Authors:Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma

Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.

图形用户界面（GUI）的自主代理正在彻底改变人机交互的方式，然而它们对文本指令的依赖限制了可访问性和便利性，特别是在免提场景中。为了解决这个问题，我们提出用语音代替文本作为GUI代理的指令输入模式，并引入了UITron-Speech。它是一个端到端的首个GUI代理，能够直接处理语音指令和设备截图来预测用户操作。为了解决数据稀缺的问题，我们使用随机说话人文本到语音模型合成高质量语音指令数据集。此外，我们设计了一种混合模式训练策略，以缓解预训练基础模型中的固有模式不平衡问题。此外，我们对GUI接地预测误差的分布进行了统计分析，并提出了无需训练的两步接地细化方法，以减轻轻微的定位偏差。在多个基准测试上的广泛实验表明，UITron-Speech实现了稳健的性能和卓越的适应性，凸显了语音驱动的GUI代理在更可访问和智能的人机交互中的可行性和潜力。我们的代码和数据集可在https://github.com/UITron-hub/UITron-Speech上找到。

论文及项目相关链接

PDF

Summary
语音驱动的GUI代理正在改变人机交互的方式。该研究提出使用语音指令替代文本指令作为GUI代理的输入方式，并引入UITron-Speech，它能直接处理语音指令和设备截图以预测用户操作。该研究解决了数据稀缺的问题，并设计了一种混合模态训练策略来缓解预训练基础模型中的模态不平衡问题。此外，该研究还提出了一种训练免费的两步定位修正方法，以减轻轻微定位偏差的问题。

Key Takeaways

自主代理在GUI中正在革命化人机交互。
语音指令可以作为GUI代理的输入方式，提高无障碍性和便利性。
UITron-Speech是首个能够直接处理语音指令和设备截图的端到端GUI代理。
研究解决了数据稀缺问题，通过合成高质量语音指令数据集来解决。
设计了混合模态训练策略，以缓解预训练模型中的模态不平衡问题。
提出了训练免费的定位修正方法，以减轻轻微定位偏差的问题。
UITron-Speech在多个基准测试中表现出稳健的性能和优越的适应性。

Cool Papers

点此查看论文截图

Accelerating Focal Search in Multi-Agent Path Finding with Tighter Lower Bounds

Authors:Yimin Tang, Zhenghong Yu, Jiaoyang Li, Sven Koenig

Multi-Agent Path Finding (MAPF) involves finding collision-free paths for multiple agents while minimizing a cost function–an NP-hard problem. Bounded suboptimal methods like Enhanced Conflict-Based Search (ECBS) and Explicit Estimation CBS (EECBS) balance solution quality with computational efficiency using focal search mechanisms. While effective, traditional focal search faces a limitation: the lower bound (LB) value determining which nodes enter the FOCAL list often increases slowly in early search stages, resulting in a constrained search space that delays finding valid solutions. In this paper, we propose a novel bounded suboptimal algorithm, double-ECBS (DECBS), to address this issue by first determining the maximum LB value and then employing a best-first search guided by this LB to find a collision-free path. Experimental results demonstrate that DECBS outperforms ECBS in most test cases and is compatible with existing optimization techniques. DECBS can reduce nearly 30% high-level CT nodes and 50% low-level focal search nodes. When agent density is moderate to high, DECBS achieves a 23.5% average runtime improvement over ECBS with identical suboptimality bounds and optimizations.

多智能体路径寻找（MAPF）涉及为多个智能体寻找无碰撞路径，同时最小化成本函数——这是一个NP难题。像基于冲突增强搜索（ECBS）和显式估计CBS（EECBS）这样的有界次优方法通过使用焦点搜索机制来平衡解决方案质量与计算效率。尽管有效，但传统焦点搜索面临一个局限性：在早期搜索阶段，确定哪些节点进入焦点列表的下界（LB）值通常增长缓慢，导致搜索空间受限，从而延迟找到有效解决方案。针对这一问题，本文提出了一种新的有界次优算法——双ECBS（DECBS），首先确定最大LB值，然后采用最佳优先搜索，以该LB为指导找到无碰撞路径。实验结果表明，DECBS在大多数情况下优于ECBS，并且与现有优化技术兼容。DECBS可以减少近30%的高级CT节点和50%的低位焦点搜索节点。当智能体密度适中到较高时，与ECBS相比，DECBS在具有相同次优性边界和优化的情况下，平均运行时间缩短了23.5%。

论文及项目相关链接

PDF 7 pages

Summary

该论文针对多智能体路径规划（MAPF）问题，提出了一种名为double-ECBS（DECBS）的新型有界次优算法。该算法旨在解决传统焦点搜索面临的限制，通过先确定最大LB值，然后使用最佳优先搜索来找到无碰撞路径。实验结果表明，DECBS在大多数情况下优于ECBS，并能与现有优化技术兼容。DECBS能显著减少高级CT节点和低级别焦点搜索节点的数量，提高算法效率。对于中等至高密度智能体，DECBS在具有相同次优性边界和优化的情况下，平均运行时间比ECBS提高了约23.5%。

Key Takeaways

多智能体路径规划（MAPF）是一个NP难题，旨在找到多个智能体的无碰撞路径并最小化成本函数。
Bounded Suboptimal方法如ECBS和EECBS通过焦点搜索机制在解决方案质量和计算效率之间取得平衡。
传统焦点搜索面临的问题是早期搜索阶段LB值增长缓慢，导致搜索空间受限，延迟找到有效解决方案。
DECBS算法旨在解决此问题，通过先确定最大LB值，然后使用最佳优先搜索来找到无碰撞路径。
实验表明，DECBS在大多数测试案例中优于ECBS，并能显著减少高级CT节点和低级别焦点搜索节点的数量。
在中等至高密度智能体的情况下，DECBS在具有相同次优性边界和优化的场景中提供了显著的性能改进。

Cool Papers

点此查看论文截图

RAILGUN: A Unified Convolutional Policy for Multi-Agent Path Finding Across Different Environments and Tasks

Authors:Yimin Tang, Xiao Xiong, Jingyi Xi, Jiaoyang Li, Erdem Bıyık, Sven Koenig

Multi-Agent Path Finding (MAPF), which focuses on finding collision-free paths for multiple robots, is crucial for applications ranging from aerial swarms to warehouse automation. Solving MAPF is NP-hard so learning-based approaches for MAPF have gained attention, particularly those leveraging deep neural networks. Nonetheless, despite the community’s continued efforts, all learning-based MAPF planners still rely on decentralized planning due to variability in the number of agents and map sizes. We have developed the first centralized learning-based policy for MAPF problem called RAILGUN. RAILGUN is not an agent-based policy but a map-based policy. By leveraging a CNN-based architecture, RAILGUN can generalize across different maps and handle any number of agents. We collect trajectories from rule-based methods to train our model in a supervised way. In experiments, RAILGUN outperforms most baseline methods and demonstrates great zero-shot generalization capabilities on various tasks, maps and agent numbers that were not seen in the training dataset.

多智能体路径寻找（MAPF）专注于为多个机器人寻找无碰撞路径，对于从无人机群到仓库自动化的应用至关重要。解决MAPF问题是NP难的，因此基于学习的MAPF方法引起了关注，特别是那些利用深度神经网络的方法。尽管如此，尽管社区一直在努力，所有基于学习的MAPF规划者仍然依赖于分散式规划，这是由于智能体和地图规模的变化。我们为MAPF问题开发了第一个集中式学习策略，名为RAILGUN。RAILGUN不是基于智能体的策略，而是基于地图的策略。通过利用基于CNN的架构，RAILGUN可以在不同的地图上推广，并处理任何数量的智能体。我们从基于规则的方法收集轨迹，以监督的方式训练我们的模型。在实验中，RAILGUN优于大多数基线方法，并在各种任务、地图和智能体数量上展示了出色的零样本推广能力，这些在训练数据集中并未见过。

论文及项目相关链接

PDF 7 pages

Summary

多智能体路径查找（MAPF）对于从无人机群到仓库自动化等应用至关重要，它为多个机器人找到无碰撞路径。MAPF问题的解决是NP难的，因此基于学习的MAPF方法受到关注，尤其是利用深度神经网络的方法。然而，尽管社区持续努力，所有基于学习的MAPF规划器仍依赖于分散式规划，这是由于智能体和地图大小的差异造成的。我们开发了首个针对MAPF问题的集中学习政策，称为RAILGUN。RAILGUN不是基于智能体的策略，而是基于地图的策略。它通过利用CNN架构，可以跨不同地图进行推广，并处理任何数量的智能体。我们通过从基于规则的方法收集轨迹来以监督方式训练我们的模型。在实验中，RAILGUN在大多数基准测试方法中都表现出色，并在各种任务、地图和智能体数量上展示了出色的零样本泛化能力，这些在训练数据集中并未出现。

Key Takeaways

多智能体路径查找（MAPF）在多个机器人导航中起到关键作用，广泛应用于不同领域。
MAPF问题解决的复杂性使其成为NP难题，因此基于学习的方法受到关注。
基于学习的MAPF规划器通常采用分散式规划，因为智能体和地图大小的变化导致集中规划困难。
提出了一种新的集中学习策略RAILGUN，它是基于地图而非智能体的策略。
RAILGUN利用CNN架构实现不同地图之间的泛化，并能处理任意数量的智能体。
RAILGUN通过收集规则方法的轨迹进行训练，采用监督学习方式进行模型训练。

Cool Papers

点此查看论文截图

Assessing Agentic Large Language Models in Multilingual National Bias

Authors:Qianying Liu, Katrina Qiyao Wang, Fei Cheng, Sadao Kurohashi

Large Language Models have garnered significant attention for their capabilities in multilingual natural language processing, while studies on risks associated with cross biases are limited to immediate context preferences. Cross-language disparities in reasoning-based recommendations remain largely unexplored, with a lack of even descriptive analysis. This study is the first to address this gap. We test LLM’s applicability and capability in providing personalized advice across three key scenarios: university applications, travel, and relocation. We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages. We quantify bias in model-generated scores and assess the impact of demographic factors and reasoning strategies (e.g., Chain-of-Thought prompting) on bias patterns. Our findings reveal that local language bias is prevalent across different tasks, with GPT-4 and Sonnet reducing bias for English-speaking countries compared to GPT-3.5 but failing to achieve robust multilingual alignment, highlighting broader implications for multilingual AI agents and applications such as education. \footnote{Code available at: https://github.com/yiyunya/assess_agentic_national_bias

大型语言模型在多语种自然语言处理方面引起了广泛关注，而与跨语言偏见相关的风险研究仅限于即时语境偏好。基于推理的推荐中的跨语言差异在很大程度上尚未被探索，甚至缺乏描述性分析。本研究首次填补了这一空白。我们测试了大型语言模型在三个关键场景（大学申请、旅行和搬迁）中提供个性化建议的适用性和能力。通过分析大型语言模型在多语言决策任务中的回答，我们研究了其中的多语言偏见。我们量化了模型生成分数中的偏见，并评估了人口统计因素和推理策略（如“思维链”提示）对偏见模式的影响。我们的研究发现，本地语言偏见在不同任务中普遍存在，GPT-4和Sonnet对英文国家的偏见减少，相较于GPT-3.5但未能实现稳健的多语言对齐，这对多语言人工智能代理和应用（如教育）具有更广泛的影响。\footnote{代码可在：\url{https://github.com/yiyunya/assess_agentic_national_bias}

论文及项目相关链接

PDF Accepted to ACL 2025 Findings. 14 pages

Summary

大型语言模型在多语种自然语言处理中受到广泛关注，但关于跨语言偏见风险的研究仅限于直接语境偏好。本研究首次关注跨语言差异在基于推理的建议中的影响，探索了语言模型在提供个性化建议方面的适用性。通过对最新大型语言模型在决策任务中的反应进行分析，本研究量化了模型生成的分数中的偏见，并评估了人口统计因素和推理策略对偏见模式的影响。研究发现，本地语言偏见在不同任务中普遍存在，GPT-4和Sonnet相较于GPT-3.5在英语国家的偏见有所减少，但仍未实现稳健的多语言对齐。

Key Takeaways

大型语言模型在多语种自然语言处理中受到广泛关注。
跨语言偏见在基于推理的建议中影响显著，这一领域的研究仍待深入。
语言模型在提供个性化建议方面的适用性得到了探索。
本研究分析了最新大型语言模型在决策任务中的反应。
研究发现模型生成的分数中存在量化偏见。
人口统计因素和推理策略对语言模型的偏见模式有影响。

Cool Papers

点此查看论文截图

Evaluating the Robustness of Multimodal Agents Against Active Environmental Injection Attacks

Authors:Yurun Chen, Xavier Hu, Keting Yin, Juncheng Li, Shengyu Zhang

As researchers continue to optimize AI agents for more effective task execution within operating systems, they often overlook a critical security concern: the ability of these agents to detect “impostors” within their environment. Through an analysis of the agents’ operational context, we identify a significant threat-attackers can disguise malicious attacks as environmental elements, injecting active disturbances into the agents’ execution processes to manipulate their decision-making. We define this novel threat as the Active Environment Injection Attack (AEIA). Focusing on the interaction mechanisms of the Android OS, we conduct a risk assessment of AEIA and identify two critical security vulnerabilities: (1) Adversarial content injection in multimodal interaction interfaces, where attackers embed adversarial instructions within environmental elements to mislead agent decision-making; and (2) Reasoning gap vulnerabilities in the agent’s task execution process, which increase susceptibility to AEIA attacks during reasoning. To evaluate the impact of these vulnerabilities, we propose AEIA-MN, an attack scheme that exploits interaction vulnerabilities in mobile operating systems to assess the robustness of MLLM-based agents. Experimental results show that even advanced MLLMs are highly vulnerable to this attack, achieving a maximum attack success rate of 93% on the AndroidWorld benchmark by combining two vulnerabilities.

随着研究人员不断优化操作系统内的人工智能代理（AI agents），以更有效地执行任务，他们往往忽略了一个关键的安全问题：这些代理检测其环境中“伪装者”（impostors）的能力。通过分析代理的操作上下文，我们发现了一种重大威胁：攻击者可以将恶意攻击伪装成环境因素，向代理的执行过程中注入主动干扰，以操纵其决策。我们将这种新型威胁定义为“主动环境注入攻击”（AEIA）。我们关注Android操作系统的交互机制，对AEIA进行了风险评估，并发现了两个关键的安全漏洞：（1）多模式交互界面中的对抗内容注入，攻击者将对抗性指令嵌入环境元素中，以误导代理的决策制定；（2）代理任务执行过程中的推理间隔漏洞，增加了在推理过程中遭受AEIA攻击的风险。为了评估这些漏洞的影响，我们提出了AEIA-MN攻击方案，该方案利用移动操作系统中的交互漏洞来评估基于MLLM的代理的稳健性。实验结果表明，即使是先进的MLLM也极易受到这种攻击的影响，通过在AndroidWorld基准测试中结合两个漏洞，最高攻击成功率达到了93%。

论文及项目相关链接

PDF Accepted at ACM MM 2025 Main Conference

Summary：
研究者在优化人工智能代理以在操作系统内更有效地执行任务时，往往忽略了关键的安全问题，即这些代理检测环境中“伪装者”的能力。通过分析代理的操作上下文，我们发现了一种重大威胁——攻击者可以将恶意攻击伪装成环境元素，将主动干扰注入代理的执行过程中，从而操纵其决策。我们称这种新型威胁为Active Environment Injection Attack（AEIA）。通过对Android操作系统的交互机制进行风险评估，我们发现了两个关键的安全漏洞：一是多模式交互界面中的对抗内容注入，攻击者可以在环境元素中嵌入对抗指令来误导代理的决策；二是代理任务执行过程中的推理间隔漏洞，会增加在推理过程中受到AEIA攻击的风险。为了评估这些漏洞的影响，我们提出了AEIA-MN攻击方案，该方案利用移动操作系统中的交互漏洞来评估基于MLLM的代理的稳健性。实验结果表明，即使是高级MLLM也很容易受到这种攻击，在AndroidWorld基准测试中结合两个漏洞的最大攻击成功率达到93%。

Key Takeaways：

研究者优化AI代理时易忽略检测环境中“伪装者”的关键安全问题。
出现了一种新型威胁Active Environment Injection Attack（AEIA），攻击者能将恶意攻击伪装成环境元素。
在多模式交互界面中存在着对抗内容注入的安全漏洞。
代理任务执行过程中的推理间隔漏洞会增加受到AEIA攻击的风险。
AEIA攻击能通过利用移动操作系统中的交互漏洞进行。
基于MLLM的代理存在较大的AEIA攻击风险，攻击成功率最高可达93%。
现有AI代理需要进一步加强在操作系统内的安全防护能力。

Cool Papers

点此查看论文截图

A Value Based Parallel Update MCTS Method for Multi-Agent Cooperative Decision Making of Connected and Automated Vehicles

Authors:Ye Han, Lijun Zhang, Dejian Meng, Zhuang Zhang, Xingyu Hu, Songyu Weng

To solve the problem of lateral and logitudinal joint decision-making of multi-vehicle cooperative driving for connected and automated vehicles (CAVs), this paper proposes a Monte Carlo tree search (MCTS) method with parallel update for multi-agent Markov game with limited horizon and time discounted setting. By analyzing the parallel actions in the multi-vehicle joint action space in the partial-steady-state traffic flow, the parallel update method can quickly exclude potential dangerous actions, thereby increasing the search depth without sacrificing the search breadth. The proposed method is tested in a large number of randomly generated traffic flow. The experiment results show that the algorithm has good robustness and better performance than the SOTA reinforcement learning algorithms and heuristic methods. The vehicle driving strategy using the proposed algorithm shows rationality beyond human drivers, and has advantages in traffic efficiency and safety in the coordinating zone.

针对连接自动化车辆（CAVs）的多车协同驾驶的横向和纵向联合决策问题，本文提出了一种基于马尔可夫博弈的多智能体蒙特卡洛树搜索（MCTS）方法，该方法采用并行更新策略，适用于有限视野和时间折扣设置。通过分析部分稳态交通流中多车联合动作空间的并行动作，并行更新方法可以迅速排除潜在的危险动作，从而在不影响搜索广度的情况下增加搜索深度。该方法在大量随机生成的交通流中进行了测试。实验结果表明，该算法具有良好的鲁棒性，并优于当前最佳的强化学习算法和启发式方法。使用所提出算法的车辆驾驶策略表现出了超越人类驾驶员的合理性，并在协调区域内具有提高交通效率和安全性的优势。

论文及项目相关链接

PDF arXiv admin note: text overlap with arXiv:2408.04295 by other authors

Summary

该论文针对连接和自动化车辆（CAVs）的多车协同驾驶的横向和纵向联合决策问题，提出了一种基于蒙特卡洛树搜索（MCTS）的并行更新方法，用于处理多智能体马尔可夫博弈中有限视界和时间折扣设置。通过分析部分稳态交通流中的多车联合行动空间中的并行行动，该方法能够迅速排除潜在的危险行动，从而在增加搜索深度的同时不牺牲搜索广度。实验结果表明，该算法具有良好的鲁棒性，优于当前最佳强化学习算法和启发式方法。使用此算法的车辆驾驶策略在协调区域内的交通效率和安全性方面表现出超越人类驾驶员的合理性。

Key Takeaways