⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-07-09 更新
Modeling Latent Partner Strategies for Adaptive Zero-Shot Human-Agent Collaboration
Authors:Benjamin Li, Shuyang Shi, Lucia Romero, Huao Li, Yaqi Xie, Woojun Kim, Stefanos Nikolaidis, Michael Lewis, Katia Sycara, Simon Stepputtis
In collaborative tasks, being able to adapt to your teammates is a necessary requirement for success. When teammates are heterogeneous, such as in human-agent teams, agents need to be able to observe, recognize, and adapt to their human partners in real time. This becomes particularly challenging in tasks with time pressure and complex strategic spaces where the dynamics can change rapidly. In this work, we introduce TALENTS, a strategy-conditioned cooperator framework that learns to represent, categorize, and adapt to a range of partner strategies, enabling ad-hoc teamwork. Our approach utilizes a variational autoencoder to learn a latent strategy space from trajectory data. This latent space represents the underlying strategies that agents employ. Subsequently, the system identifies different types of strategy by clustering the data. Finally, a cooperator agent is trained to generate partners for each type of strategy, conditioned on these clusters. In order to adapt to previously unseen partners, we leverage a fixed-share regret minimization algorithm that infers and adjusts the estimated partner strategy dynamically. We assess our approach in a customized version of the Overcooked environment, posing a challenging cooperative cooking task that demands strong coordination across a wide range of possible strategies. Using an online user study, we show that our agent outperforms current baselines when working with unfamiliar human partners.
在协作任务中,能够适应队友是成功的必要条件。当队友具有差异性,如在人类与智能体组成的团队中,智能体需要能够实时观察、识别并适应其人类伙伴。这在有时间压力和复杂战略空间的任务中尤其具有挑战性,因为动态可能会迅速变化。在这项工作中,我们引入了TALENTS(策略条件协作框架),它能够学习表示、分类并适应一系列伙伴策略,从而实现即时团队协作。我们的方法利用变分自动编码器从轨迹数据中学习潜在策略空间。该潜在空间代表了智能体所采用的底层策略。然后,系统通过聚类数据来识别不同类型的策略。最后,训练一个协作智能体来为每种类型的策略生成伙伴,并根据这些集群进行条件化。为了适应之前未见过的伙伴,我们采用了一种固定份额后悔最小化算法,该算法能够动态推断并调整估计的伙伴策略。我们在定制的《煮糊了(Overcooked)》环境中评估了我们的方法,该环境提出了一个具有挑战性的合作烹饪任务,需要广泛的不同策略之间强大的协作能力。通过在线用户研究,我们证明了我们的智能体在与不熟悉的人类伙伴合作时超过了当前基线水平。
论文及项目相关链接
PDF Best Paper Award at the RSS 2025 Generative Models x HRI (GenAI-HRI) Workshop
Summary
本文强调在协作任务中,适应队友的能力是成功的必要条件。在人机协作团队中,面对多样化的队友,智能体需要实时观察、识别和适应人类伙伴。在有时间压力和复杂战略空间的任务中,这一要求更具挑战性。本研究提出了TALENTS策略性协作框架,该框架能够学习表示和适应各种伙伴战略,实现即兴团队合作。通过使用变分自编码器学习轨迹数据的潜在策略空间,该空间代表智能体采用的战略。然后系统通过聚类数据识别不同战略类型。最后,训练协作智能体为每种策略生成合作伙伴。为应对未见过的伙伴,研究使用固定份额后悔最小化算法动态推断和调整估计的伙伴战略。在定制的《Overcooked》环境中评估该方法,该环境要求在各种可能战略中展示强有力的协调和合作能力。通过在线用户研究,本文证明研究提出的智能体在与不熟悉的人类伙伴协作时表现优于当前基线。
Key Takeaways
- 在协作任务中,适应队友的能力是成功的重要因素。
- 面对多样化的队友,智能体需要实时观察、识别和适应人类伙伴的能力。
- TALENTS策略性协作框架能够学习表示和适应各种伙伴战略,实现即兴团队合作。
- 使用变分自编码器学习轨迹数据的潜在策略空间以代表智能体的战略。
- 系统通过聚类数据识别不同的战略类型。
- 利用固定份额后悔最小化算法来动态推断和调整伙伴战略以适应未知伙伴。
点此查看论文截图






SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam?
Authors:Jingyi Chai, Shuo Tang, Rui Ye, Yuwen Du, Xinyu Zhu, Mengcheng Zhou, Yanfeng Wang, Weinan E, Siheng Chen
The rapid advancements of AI agents have ignited the long-held ambition of leveraging them to accelerate scientific discovery. Achieving this goal requires a deep understanding of the frontiers of human knowledge. As such, Humanity’s Last Exam (HLE) provides an exceptionally challenging touchstone for evaluating scientific AI agents. In this work, we aim to construct the foundational architecture for general-purpose agents and validate the capabilities through leading performance on HLE. To achieve this, we introduce X-Master, a tool-augmented reasoning agent designed to emulate human researchers by interacting flexibly with external tools during its reasoning process. This agent, guided by the conceptualization of code as an interaction language, can flexibly leverage built-in Python libraries and our customized tools to augment the reasoning. We further scale its capabilities through X-Masters, a scattered-and-stacked agentic workflow that systematically enhances breadth and depth of reasoning. Our open-source solution, X-Masters, sets a new state-of-the-art record on HLE with a score of 32.1%, surpassing OpenAI’s and Google’s Deep Research (26.6% and 26.9%) and becoming the first to exceed the 30% threshold. This work allows us to gain a deeper understanding of complex task-solving and accumulates valuable experience that can inform future advancements, guiding subsequent model training.
人工智能代理的快速进步激发了人们长期以来利用它们来加速科学发现的愿望。实现这一目标需要对人类知识的前沿有深刻的理解。因此,Humanity’s Last Exam(HLE)为评估科学人工智能代理提供了极具挑战性的试金石。在这项工作中,我们的目标是构建通用代理的基础架构,并通过在HLE上的领先性能来验证其能力。为了实现这一点,我们引入了X-Master,这是一个工具增强型推理代理,旨在通过推理过程中与外部工具的灵活交互来模拟人类研究者。该代理以代码作为交互语言的理念为指导,可以灵活地利用内置的Python库和我们的定制工具来增强推理。我们进一步通过X-Masters(一种分散堆叠的代理工作流程)来扩展其能力,系统地提高推理的广度和深度。我们的开源解决方案X-Masters在HLE上创下了新的最高纪录,得分为32.1%,超越了OpenAI和Google Deep Research(分别为26.6%和26.9%),并成为首个超过30%阈值的研究。这项工作使我们能够更深入地了解复杂任务解决,并积累了宝贵的经验,可以为未来的进步提供指导,为后续的模型训练提供指导。
论文及项目相关链接
PDF 12 pages, 7 figures
Summary
人工智能的快速发展激发了人们利用人工智能加速科学发现的长期愿望。为实现这一目标,需要深入理解人类知识的前沿。Humanity’s Last Exam(HLE)为评估科学人工智能提供了极具挑战性的基准测试。本研究旨在构建通用智能体的基础架构,并通过在HLE上的卓越表现验证其能力。我们引入了X-Master这一工具辅助推理智能体,以模拟人类研究者通过外部工具进行灵活交互的推理过程。在代码被视为一种交互语言的理念指导下,该智能体能够灵活利用内置Python库和自定义工具来增强推理能力。此外,我们还通过X-Masters(一种分散叠加的智能体工作流程)来进一步扩展其能力,系统提升推理的广度和深度。我们的开源解决方案X-Masters在HLE上创造了新的最高纪录,得分为32.1%,超越了OpenAI和谷歌的深度研究(分别为26.6%和26.9%),成为首个突破30%阈值的智能体。本研究使我们更深入地了解复杂任务解决,并积累了宝贵的经验,可以为未来的进步提供指导,帮助后续模型训练。
Key Takeaways
- AI的发展激发了其在科学发现领域应用的潜力。
- Humanity’s Last Exam (HLE)是评估科学人工智能能力的挑战性基准测试。
- X-Master工具辅助推理智能体的设计旨在模拟人类研究者与工具的交互过程。
- X-Master将Python库和自定义工具相结合以增强其推理能力。
- 通过X-Masters系统,智能体的推理能力和效率得到了进一步提升。
- X-Masters在HLE上的得分创下了新的纪录,达到了32.1%。
点此查看论文截图



CREW-WILDFIRE: Benchmarking Agentic Multi-Agent Collaborations at Scale
Authors:Jonathan Hyun, Nicholas R Waytowich, Boyuan Chen
Despite rapid progress in large language model (LLM)-based multi-agent systems, current benchmarks fall short in evaluating their scalability, robustness, and coordination capabilities in complex, dynamic, real-world tasks. Existing environments typically focus on small-scale, fully observable, or low-complexity domains, limiting their utility for developing and assessing next-generation multi-agent Agentic AI frameworks. We introduce CREW-Wildfire, an open-source benchmark designed to close this gap. Built atop the human-AI teaming CREW simulation platform, CREW-Wildfire offers procedurally generated wildfire response scenarios featuring large maps, heterogeneous agents, partial observability, stochastic dynamics, and long-horizon planning objectives. The environment supports both low-level control and high-level natural language interactions through modular Perception and Execution modules. We implement and evaluate several state-of-the-art LLM-based multi-agent Agentic AI frameworks, uncovering significant performance gaps that highlight the unsolved challenges in large-scale coordination, communication, spatial reasoning, and long-horizon planning under uncertainty. By providing more realistic complexity, scalable architecture, and behavioral evaluation metrics, CREW-Wildfire establishes a critical foundation for advancing research in scalable multi-agent Agentic intelligence. All code, environments, data, and baselines will be released to support future research in this emerging domain.
尽管基于大型语言模型(LLM)的多智能体系统取得了快速进展,但当前的标准评估指标在评估其处理复杂、动态、真实任务的可扩展性、鲁棒性和协调能力方面还存在不足。现有的环境通常专注于小规模、完全可观察或低复杂度的领域,这限制了它们在开发和评估下一代多智能体Agentic人工智能框架方面的实用性。我们推出了CREW-Wildfire,这是一个旨在弥补这一差距的开源基准测试。CREW-Wildfire建立在人类-人工智能团队协作的CREW模拟平台之上,提供程序化生成的森林火灾应对场景,包括大型地图、不同智能体、部分可观察性、随机动态和长期规划目标。环境通过模块化的感知和执行模块支持低级控制和高级自然语言交互。我们实现并评估了多种先进的基于LLM的多智能体Agentic人工智能框架,揭示了显著的性能差距,这些差距突显出大规模协调、通信、空间推理和不确定性下的长期规划所面临的挑战。通过提供更现实的复杂性、可扩展的架构和行为评估指标,CREW-Wildfire为推进可扩展多智能体Agentic智能的研究奠定了重要基础。所有代码、环境、数据和基准测试都将发布,以支持这一新兴领域未来的研究。
论文及项目相关链接
PDF Our project website is at: http://generalroboticslab.com/CREW-Wildfire
Summary:
介绍了一个名为CREW-Wildfire的开放源代码基准测试环境,用于评估基于大型语言模型的多智能体系统在复杂、动态、真实任务中的可扩展性、稳健性和协调能力。该环境模拟了大型地图、异构智能体、部分可观察性、随机动态和长期规划目标等场景,支持低级别控制和高级自然语言交互。文章还实现了多个最新的多智能体框架,并在CREW-Wildfire环境中评估了它们的表现,强调了在大规模协调、沟通、空间推理和长期不确定性规划等方面的挑战。此环境为未来研究提供了一个关键基础。
Key Takeaways:
- 当前基准测试在评估基于大型语言模型的多智能体系统时存在局限性,特别是在复杂、动态、真实任务的扩展性、稳健性和协调能力方面。
- CREW-Wildfire是一个新的开放源代码基准测试环境,旨在解决现有环境的局限性,并模拟真实世界的复杂场景。
- CREW-Wildfire环境支持大型地图、异构智能体、部分可观察性、随机动态和长期规划目标等特性。
- 该环境通过模块化的感知和执行模块支持低级别控制和高级自然语言交互。
- 在CREW-Wildfire环境中评估了多个最新的多智能体框架,发现它们在大型协调、沟通、空间推理和长期不确定性规划等方面存在显著性能差距。
- CREW-Wildfire环境的构建为未来研究提供了一个关键基础,包括更现实的复杂性、可扩展的架构和行为评估指标。
点此查看论文截图




Beyond Features: How Dataset Design Influences Multi-Agent Trajectory Prediction Performance
Authors:Tobias Demmler, Jakob Häringer, Andreas Tamke, Thao Dang, Alexander Hegai, Lars Mikelsons
Accurate trajectory prediction is critical for safe autonomous navigation, yet the impact of dataset design on model performance remains understudied. This work systematically examines how feature selection, cross-dataset transfer, and geographic diversity influence trajectory prediction accuracy in multi-agent settings. We evaluate a state-of-the-art model using our novel L4 Motion Forecasting dataset based on our own data recordings in Germany and the US. This includes enhanced map and agent features. We compare our dataset to the US-centric Argoverse 2 benchmark. First, we find that incorporating supplementary map and agent features unique to our dataset, yields no measurable improvement over baseline features, demonstrating that modern architectures do not need extensive feature sets for optimal performance. The limited features of public datasets are sufficient to capture convoluted interactions without added complexity. Second, we perform cross-dataset experiments to evaluate how effective domain knowledge can be transferred between datasets. Third, we group our dataset by country and check the knowledge transfer between different driving cultures.
精确的轨迹预测对于安全的自主导航至关重要,然而数据集设计对模型性能的影响尚未得到充分研究。这项工作系统地研究了特征选择、跨数据集转换和地理多样性在多智能体环境中对轨迹预测精度的影响。我们使用基于在德国和美国自己录制数据的新型L4运动预测数据集,对最先进的模型进行了评估。这包括增强的地图和智能体特征。我们将自己的数据集与美国为主的Argoverse 2基准数据集进行了比较。首先,我们发现,纳入我们数据集中特有的附加地图和智能体特征,并没有在基线特征上取得明显的改进,这表明现代架构不需要大量的特征集即可实现最佳性能。公共数据集的有限特征足以捕捉复杂的交互,而不会增加复杂性。其次,我们进行了跨数据集实验,以评估不同数据集之间迁移领域知识的有效性。第三,我们按国家分组数据集,并检查不同驾驶文化之间的知识迁移。
论文及项目相关链接
Summary
本文研究了数据集设计对多智能体轨迹预测模型性能的影响,包括特征选择、跨数据集迁移和地理多样性等因素。通过对比新型L4运动预测数据集与美国Argoverse 2基准数据集,发现现代架构在不需要大量特征的情况下也能实现最佳性能,公开数据集有限的特征足以捕捉复杂的交互作用。此外,还进行了跨数据集实验来评估不同驾驶文化间的知识迁移效果。
Key Takeaways
- 数据集设计对轨迹预测模型性能有显著影响。
- 特征选择是其中的关键因素,但现代模型不需要大量特征也能实现最佳性能。
- 跨数据集迁移知识在轨迹预测中是一个有效的策略。
- 公开数据集的有限特征足以捕捉复杂的交互作用。
- 地理多样性对轨迹预测模型性能有影响。
- 不同驾驶文化间的知识迁移是一个值得研究的课题。
点此查看论文截图






MARBLE: A Multi-Agent Rule-Based LLM Reasoning Engine for Accident Severity Prediction
Authors:Kaleem Ullah Qasim, Jiashu Zhang
Accident severity prediction plays a critical role in transportation safety systems but is a persistently difficult task due to incomplete data, strong feature dependencies, and severe class imbalance in which rare but high-severity cases are underrepresented and hard to detect. Existing methods often rely on monolithic models or black box prompting, which struggle to scale in noisy, real-world settings and offer limited interpretability. To address these challenges, we propose MARBLE a multiagent rule based LLM engine that decomposes the severity prediction task across a team of specialized reasoning agents, including an interchangeable ML-backed agent. Each agent focuses on a semantic subset of features (e.g., spatial, environmental, temporal), enabling scoped reasoning and modular prompting without the risk of prompt saturation. Predictions are coordinated through either rule-based or LLM-guided consensus mechanisms that account for class rarity and confidence dynamics. The system retains structured traces of agent-level reasoning and coordination outcomes, supporting in-depth interpretability and post-hoc performance diagnostics. Across both UK and US datasets, MARBLE consistently outperforms traditional machine learning classifiers and state-of-the-art (SOTA) prompt-based reasoning methods including Chain-of-Thought (CoT), Least-to-Most (L2M), and Tree-of-Thought (ToT) achieving nearly 90% accuracy where others plateau below 48%. This performance redefines the practical ceiling for accident severity classification under real world noise and extreme class imbalance. Our results position MARBLE as a generalizable and interpretable framework for reasoning under uncertainty in safety-critical applications.
事故严重程度预测在交通运输安全系统中扮演着至关重要的角色,但一直是一项困难的任务,因为存在数据不完整、特征依赖性强以及严重的类别不平衡等问题,其中罕见但高严重性的案例被低估且难以检测。现有方法往往依赖于单一模型或黑箱提示,难以在嘈杂的现实环境中扩展,且提供有限的解释性。为了应对这些挑战,我们提出了MARBLE,这是一个基于多代理规则的LLM引擎,它将严重程度预测任务分解到一个专业推理代理团队中,包括一个可互换的ML支持代理。每个代理专注于一组语义特征(例如,空间、环境、时间),从而能够进行局部推理和模块化提示,而不会导致提示饱和的风险。预测是通过基于规则或LLM引导的共识机制进行的,这些机制会考虑类别的稀有性和信心动态变化。该系统保留了代理级推理和协调结果的结构化跟踪,支持深入的解释性和事后性能诊断。无论是在英国和美国的数据集上,MARBLE都始终优于传统的机器学习分类器以及最新的基于提示的推理方法,包括思维链(CoT)、从最少到最多(L2M)和思维树(ToT),达到近90%的准确率,而其他方法则低于48%。这一性能重新定义了现实世界噪声和极端类别不平衡条件下事故严重程度分类的实际上限。我们的结果将MARBLE定位为安全关键应用下不确定性推理的可推广和可解释框架。
论文及项目相关链接
PDF 13 pages, 5 figures
Summary
本文提出了一个多主体规则式LLM引擎,称为MARBLE,用于解决交通意外严重性预测中的挑战。它通过分解任务到多个专业推理主体来应对数据不完整、特征依赖性强以及类别不平衡等问题。MARBLE在各数据集上的表现均优于传统机器学习分类器和最新的提示式推理方法,实现了近90%的准确率。
Key Takeaways
- MARBLE是一个多主体规则式LLM引擎,专门用于解决交通意外严重性预测的挑战。
- 该方法通过分解任务到多个专业推理主体,处理数据不完整、特征依赖性强和类别不平衡等问题。
- MARBLE采用模块化提示和规则共识机制,避免提示饱和,并考虑类别稀有性和置信度动态。
- MARBLE保留了主体级推理和协调结果的结构化痕迹,支持深入的可解释性和事后性能诊断。
- 在英国和美国的数据集上,MARBLE的准确率接近90%,优于传统机器学习和最新提示式推理方法。
点此查看论文截图



FurniMAS: Language-Guided Furniture Decoration using Multi-Agent System
Authors:Toan Nguyen, Tri Le, Quang Nguyen, Anh Nguyen
Furniture decoration is an important task in various industrial applications. However, achieving a high-quality decorative result is often time-consuming and requires specialized artistic expertise. To tackle these challenges, we explore how multi-agent systems can assist in automating the decoration process. We propose FurniMAS, a multi-agent system for automatic furniture decoration. Specifically, given a human prompt and a household furniture item such as a working desk or a TV stand, our system suggests relevant assets with appropriate styles and materials, and arranges them on the item, ensuring the decorative result meets functionality, aesthetic, and ambiance preferences. FurniMAS assembles a hybrid team of LLM-based and non-LLM agents, each fulfilling distinct roles in a typical decoration project. These agents collaborate through communication, logical reasoning, and validation to transform the requirements into the final outcome. Extensive experiments demonstrate that our FurniMAS significantly outperforms other baselines in generating high-quality 3D decor.
家具装饰在各种工业应用中是一项重要的任务。然而,实现高质量的装饰效果往往很耗时,并需要专业的艺术知识。为了应对这些挑战,我们探讨了多智能体系统如何协助自动化装饰过程。我们提出了FurniMAS,一个用于自动家具装饰的多智能体系统。具体来说,给定一个人工提示和一件家具物品(如办公桌或电视架),我们的系统会建议具有合适风格和材料的相关资产,并将它们排列在物品上,确保装饰结果满足功能、美学和氛围偏好。FurniMAS汇集了基于LLM和非LLM的智能体混合团队,在典型的装饰项目中,每个智能体都扮演着独特的角色。这些智能体通过沟通、逻辑推理和验证来协作,将要求转化为最终成果。大量实验表明,我们的FurniMAS在生成高质量3D装饰方面显著优于其他基线。
论文及项目相关链接
Summary
本文探讨了家具装饰在工业应用中的重要性及其所面临的挑战,如高质量装饰结果需要耗费大量时间和专业技能。为解决这些问题,研究提出了多智能体系统——FurniMAS,该系统能自动进行家具装饰。借助人为提示和家庭家具项目(如办公桌或电视柜等),该系统能够根据风格和材料筛选相关资产,并将其排列在物品上,确保装饰结果满足功能、美学和环境偏好。此外,FurniMAS融合了基于LLM和非LLM的智能体,它们在装饰项目中扮演不同角色,通过沟通、逻辑推理和验证协作,将需求转化为最终成果。实验证明,相较于其他基准测试,FurniMAS在高质的三维装饰效果上表现更优。
Key Takeaways
- 家具装饰在工业应用中非常重要,但实现高质量装饰结果通常需要大量时间和专业技能。
- FurniMAS是一种用于自动家具装饰的多智能体系统。
- FurniMAS通过人为提示和家庭家具项目,如办公桌或电视柜等来实现自动装饰功能。它能够根据风格和材料建议相关的资产并排列它们以达到良好的装饰效果。它不仅考虑了美学和功能要求,还包括环境偏好。
- FurniMAS集成了基于LLM和非LLM的智能体,它们在装饰项目中扮演不同角色并通过沟通、逻辑推理和验证协作完成任务。
点此查看论文截图






Robustifying 3D Perception through Least-Squares Multi-Agent Graphs Object Tracking
Authors:Maria Damanaki, Ioulia Kapsali, Nikos Piperigkos, Alexandros Gkillas, Aris S. Lalos
The critical perception capabilities of EdgeAI systems, such as autonomous vehicles, are required to be resilient against adversarial threats, by enabling accurate identification and localization of multiple objects in the scene over time, mitigating their impact. Single-agent tracking offers resilience to adversarial attacks but lacks situational awareness, underscoring the need for multi-agent cooperation to enhance context understanding and robustness. This paper proposes a novel mitigation framework on 3D LiDAR scene against adversarial noise by tracking objects based on least-squares graph on multi-agent adversarial bounding boxes. Specifically, we employ the least-squares graph tool to reduce the induced positional error of each detection’s centroid utilizing overlapped bounding boxes on a fully connected graph via differential coordinates and anchor points. Hence, the multi-vehicle detections are fused and refined mitigating the adversarial impact, and associated with existing tracks in two stages performing tracking to further suppress the adversarial threat. An extensive evaluation study on the real-world V2V4Real dataset demonstrates that the proposed method significantly outperforms both state-of-the-art single and multi-agent tracking frameworks by up to 23.3% under challenging adversarial conditions, operating as a resilient approach without relying on additional defense mechanisms.
边缘人工智能系统(如自动驾驶汽车)的关键感知能力需要对抗恶意威胁具有韧性。这要求系统能够随着时间的推移,准确识别和定位场景中的多个物体,并减轻其影响。单智能体跟踪可以抵御恶意攻击,但缺乏态势感知,这强调了多智能体合作的需要,以提高上下文理解和稳健性。本文提出了一种基于最小二乘图的多智能体对抗边界框跟踪对象的新型缓解框架,以对抗激光雷达场景中的对抗性噪声。具体来说,我们采用最小二乘图工具,通过差异坐标和锚点,利用重叠的边界框减少每个检测到的对象的中心位置误差。因此,多车辆检测被融合和细化,减轻了对抗影响,并与现有轨迹相关联,在两个阶段执行跟踪,以进一步抑制对抗威胁。在现实世界中的V2V4Real数据集上的广泛评估研究表明,所提出的方法在具有挑战性的对抗条件下,比最先进的单智能体和多智能体跟踪框架高出23.3%,作为一种不需要依赖额外防御机制的稳健方法。
论文及项目相关链接
PDF 6 pages, 3 figures, 4 tables
Summary
本论文提出了一种基于多智能体对抗边界框的3D激光雷达场景对抗噪声的缓解框架。该框架采用最小二乘图工具,利用全连接图上的重叠边界框,通过差分坐标和锚点减少每个检测对象中心的定位误差。通过多车检测融合和细化,缓解对抗影响,并在两个阶段执行跟踪关联现有轨迹,进一步抑制对抗威胁。在现实世界V2V4Real数据集上的评估研究表明,该方法在具有挑战性的对抗条件下显著优于最先进的单智能体和多智能体跟踪框架,作为一种不依赖额外防御机制的稳健方法。
Key Takeaways
- EdgeAI系统需要对抗恶意攻击时的关键感知能力,能够准确识别和定位场景中的多个物体。
- 单智能体跟踪虽对对抗性攻击有韧性,但缺乏情境意识,因此需多智能体合作增强上下文理解和稳健性。
- 论文提出基于最小二乘图和重叠边界框的缓解框架,减少检测对象中心的定位误差。
- 该框架融合和细化多车检测,以缓解对抗影响,并与现有轨迹进行关联,进一步抑制对抗威胁。
- 所提出的方法在现实世界数据集上的表现显著优于其他跟踪框架。
- 该方法具有在挑战性对抗条件下运作的稳健性,不依赖额外的防御机制。
点此查看论文截图







Who’s the Mole? Modeling and Detecting Intention-Hiding Malicious Agents in LLM-Based Multi-Agent Systems
Authors:Yizhe Xie, Congcong Zhu, Xinyue Zhang, Minghao Wang, Chi Liu, Minglu Zhu, Tianqing Zhu
Multi-agent systems powered by Large Language Models (LLM-MAS) demonstrate remarkable capabilities in collaborative problem-solving. While LLM-MAS exhibit strong collaborative abilities, the security risks in their communication and coordination remain underexplored. We bridge this gap by systematically investigating intention-hiding threats in LLM-MAS, and design four representative attack paradigms that subtly disrupt task completion while maintaining high concealment. These attacks are evaluated in centralized, decentralized, and layered communication structures. Experiments conducted on six benchmark datasets, including MMLU, MMLU-Pro, HumanEval, GSM8K, arithmetic, and biographies, demonstrate that they exhibit strong disruptive capabilities. To identify these threats, we propose a psychology-based detection framework AgentXposed, which combines the HEXACO personality model with the Reid Technique, using progressive questionnaire inquiries and behavior-based monitoring. Experiments conducted on six types of attacks show that our detection framework effectively identifies all types of malicious behaviors. The detection rate for our intention-hiding attacks is slightly lower than that of the two baselines, Incorrect Fact Injection and Dark Traits Injection, demonstrating the effectiveness of intention concealment. Our findings reveal the structural and behavioral risks posed by intention-hiding attacks and offer valuable insights into securing LLM-based multi-agent systems through psychological perspectives, which contributes to a deeper understanding of multi-agent safety. The code and data are available at https://anonymous.4open.science/r/AgentXposed-F814.
基于大型语言模型的多智能体系统(LLM-MAS)在协同解决问题方面展现出卓越的能力。尽管LLM-MAS表现出强大的协作能力,但其在通信和协调中的安全风险仍然被忽视。我们通过系统地研究LLM-MAS中的意图隐藏威胁来填补这一空白,并设计了四种具有代表性的攻击模式,这些攻击模式能够在完成任务的过程中微妙地制造干扰,同时保持高度隐蔽性。这些攻击在集中式、分散式和分层通信结构中都经过了评估。在MMLU、MMLU-Pro、HumanEval、GSM8K、算术和传记等六个基准数据集上进行的实验表明,它们具有很强的破坏性能力。为了识别这些威胁,我们提出了基于心理学的检测框架AgentXposed,该框架结合了HEXACO人格模型和雷德技术,采用渐进式问卷调查和行为监测。对六种类型的攻击进行的实验表明,我们的检测框架有效地识别出所有类型的恶意行为。对于意图隐藏攻击的检测率略低于另外两个基准(即错误事实注入和黑暗特质注入),这证明了意图隐藏的隐蔽性。我们的研究揭示了意图隐藏攻击的结构和行为风险,并通过心理学视角为基于LLM的多智能体系统的安全提供了宝贵见解,有助于更深入地了解多智能体的安全性。相关代码和数据可在https://anonymous.4open.science/r/AgentXposed-F814处获取。
论文及项目相关链接
Summary
多代理系统通过大型语言模型(LLM-MAS)展现出卓越的协同解决问题的能力。然而,LLM-MAS在通信和协调中的安全风险尚未得到充分探索。本研究系统地探讨了LLM-MAS中的意图隐藏威胁,设计了四种典型的攻击模式,这些攻击模式可以在不引人注目的前提下破坏任务完成。实验在六种数据集上进行,证明这些攻击具有很强的破坏性。为了识别这些威胁,本研究提出了基于心理学的检测框架AgentXposed,结合HEXACO人格模型和Reid技术,通过渐进式问卷调查和行为监测来识别恶意行为。实验表明,该检测框架能有效识别各种恶意攻击,但对意图隐藏的攻击的检测率略低于两个基线方法,显示出意图隐藏的效力。本研究揭示了意图隐藏攻击的结构和行为风险,并为通过心理学视角保障LLM-MAS提供了宝贵见解。
Key Takeaways
- LLM-MAS在协同解决问题中表现出卓越的能力,但其通信和协调中的安全风险尚未被充分探索。
- 存在意图隐藏的威胁,可微妙地破坏任务完成,且不易被察觉。
- 在六种数据集上的实验证明了这些意图隐藏的攻击具有很强的破坏性。
- 提出基于心理学的检测框架AgentXposed,通过问卷调查和行为监测识别恶意行为。
- AgentXposed能有效识别各种攻击,但对意图隐藏的攻击的检测率略低于基线方法。
- 研究揭示了意图隐藏攻击的结构和行为风险。
点此查看论文截图






LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction
Authors:Yixin Yan, Yang Li, Yuanfan Wang, Xiaozhou Zhou, Beihao Xia, Manjiang Hu, Hongmao Qin
It has been challenging to model the complex temporal-spatial dependencies between agents for trajectory prediction. As each state of an agent is closely related to the states of adjacent time steps, capturing the local temporal dependency is beneficial for prediction, while most studies often overlook it. Besides, learning the high-order motion state attributes is expected to enhance spatial interaction modeling, but it is rarely seen in previous works. To address this, we propose a lightweight framework, LTMSformer, to extract temporal-spatial interaction features for multi-modal trajectory prediction. Specifically, we introduce a Local Trend-Aware Attention mechanism to capture the local temporal dependency by leveraging a convolutional attention mechanism with hierarchical local time boxes. Next, to model the spatial interaction dependency, we build a Motion State Encoder to incorporate high-order motion state attributes, such as acceleration, jerk, heading, etc. To further refine the trajectory prediction, we propose a Lightweight Proposal Refinement Module that leverages Multi-Layer Perceptrons for trajectory embedding and generates the refined trajectories with fewer model parameters. Experiment results on the Argoverse 1 dataset demonstrate that our method outperforms the baseline HiVT-64, reducing the minADE by approximately 4.35%, the minFDE by 8.74%, and the MR by 20%. We also achieve higher accuracy than HiVT-128 with a 68% reduction in model size.
对代理之间的复杂时空依赖性进行轨迹预测建模一直是一个挑战。由于代理的每个状态都与相邻时间步的状态紧密相关,捕捉局部时间依赖性对预测是有益的,但大多数研究往往忽略了这一点。此外,学习高阶运动状态属性有望增强空间交互建模,但之前在的工作中很少见到。为了解决这一问题,我们提出了一种轻量级的框架LTMSformer,用于提取时空交互特征进行多模态轨迹预测。具体而言,我们引入了一种局部趋势感知注意力机制,通过利用具有分层局部时间框的卷积注意力机制来捕捉局部时间依赖性。接下来,为了对空间交互依赖性进行建模,我们构建了一个运动状态编码器,以融入高阶运动状态属性,如加速度、急动度、方向等。为了进一步优化轨迹预测,我们提出了一个轻量级提案优化模块,该模块利用多层感知器进行轨迹嵌入,并生成参数较少的优化轨迹。在Argoverse 1数据集上的实验结果表明,我们的方法优于基线HiVT-64,minADE降低了约4.35%,minFDE降低了8.74%,MR降低了20%。我们还实现了比HiVT-128更高的精度,并实现了模型体积的68%缩减。
论文及项目相关链接
Summary
提出LTMSformer框架用于多模态轨迹预测的时空交互特征提取。利用局部趋势感知注意力机制捕捉局部时间依赖性,并构建运动状态编码器来模拟空间交互依赖性。此外,提出轻量级建议细化模块以进一步优化轨迹预测。在Argoverse 1数据集上的实验结果表明,该方法优于基线HiVT-64,并实现了较高的准确性,模型大小减少了68%。
Key Takeaways
- 建模轨迹预测中的复杂时空依赖性是一项挑战。
- 捕捉局部时间依赖性对预测有益,但大多数研究忽略了这一点。
- 高阶运动状态属性学习可以增强空间交互建模,但在以前的研究中很少看到。
- 提出LTMSformer框架用于多模态轨迹预测。
- 引入局部趋势感知注意力机制来捕捉局部时间依赖性。
- 构建运动状态编码器来模拟空间交互依赖性,并纳入高阶运动状态属性。
点此查看论文截图





Agentic Distributed Computing
Authors:Ajay D. Kshemkalyani, Manish Kumar, Anisur Rahaman Molla, Gokarna Sharma
The most celebrated and extensively studied model of distributed computing is the {\em message-passing model,} in which each vertex/node of the (distributed network) graph corresponds to a static computational device that communicates with other devices through passing messages. In this paper, we consider the {\em agentic model} of distributed computing which extends the message-passing model in a new direction. In the agentic model, computational devices are modeled as relocatable or mobile computational devices (called agents in this paper), i.e., each vertex/node of the graph serves as a container for the devices, and hence communicating with another device requires relocating to the same node. We study two fundamental graph level tasks, leader election, and minimum spanning tree, in the agentic model, which will enhance our understanding of distributed computation across paradigms. The objective is to minimize both time and memory complexities. Following the literature, we consider the synchronous setting in which each agent performs its operations synchronously with others, and hence the time complexity can be measured in rounds. In this paper, we present two deterministic algorithms for leader election: one for the case of $k<n$ and another for the case of $k=n$, minimizing both time and memory complexities, where $k$ and $n$, respectively, are the number of agents and number of nodes of the graph. Using these leader election results, we develop deterministic algorithms for agents to construct a minimum spanning tree of the graph, minimizing both time and memory complexities. To the best of our knowledge, this is the first study of distributed graph level tasks in the agentic model with $k\leq n$. Previous studies only considered the case of $k=n$.
分布式计算中最为著名且被广泛研究的模型是消息传递模型,其中(分布式网络)图的每个顶点/节点对应于一个静态计算设备,这些设备通过传递消息进行通信。在本文中,我们考虑分布式计算的智能体模型,该模型在消息传递模型的基础上向新的方向扩展。在智能体模型中,计算设备被建模为可移动的计算设备(本文称为代理),即图的每个顶点/节点作为设备的容器,因此与其他设备通信需要迁移至同一节点。我们研究智能体模型中的两个基本的图级别任务,即领导选举和最小生成树,这将增强我们对不同范式下分布式计算的理解。目标是最大限度地减少时间和内存复杂性。根据文献,我们考虑同步设置,其中每个智能体与其他智能体同步执行操作,因此时间复杂度可以按轮数来衡量。在本文中,我们为领导选举提出了两个确定性算法:一个用于k < n的情况,另一个用于k = n的情况,这两种情况都能最大限度地减少时间和内存复杂性,其中k和n分别是智能体和图的节点数。利用这些领导选举的结果,我们为智能体开发出了构建图的最小生成树的确定性算法,同时最小化时间和内存复杂度。据我们所知,这是首次在智能体模型中研究k≤n的分布式图级别任务。之前的研究只考虑了k = n的情况。
论文及项目相关链接
PDF 42 pages, 3 figures,3 tables, 8 pseudocodes; some overlaps with arXiv:2403.13716v2
Summary
该文本介绍了一种新型的分布式计算模型——agentic模型。该模型扩展了消息传递模型,将计算设备建模为可移动的装置(称为agent)。文章研究了agentic模型中的两个基本图形级任务,即领导选举和最小生成树,旨在最小化时间和内存复杂性。文章还介绍了两种用于领导选举的确定性算法,并基于领导选举结果开发了构建图形最小生成树的确定性算法。这是关于agentic模型中k≤n情况下分布式图形级别任务的首项研究,之前的研究仅考虑k=n的情况。
Key Takeaways
- Agentic模型是分布式计算的一种新型模型,将计算设备建模为可移动的装置(agents)。
- 该模型扩展了消息传递模型,注重研究图形级别的任务。
- 文章研究了两个基本图形级别任务:领导选举和最小生成树。
- 研究的目的是最小化时间和内存复杂性。
- 介绍了两种用于领导选举的确定性算法,分别针对k<n和k=n的情况。
- 基于领导选举结果,开发了构建图形最小生成树的确定性算法。
点此查看论文截图


Multimedia Verification Through Multi-Agent Deep Research Multimodal Large Language Models
Authors:Huy Hoan Le, Van Sy Thinh Nguyen, Thi Le Chi Dang, Vo Thanh Khang Nguyen, Truong Thanh Hung Nguyen, Hung Cao
This paper presents our submission to the ACMMM25 - Grand Challenge on Multimedia Verification. We developed a multi-agent verification system that combines Multimodal Large Language Models (MLLMs) with specialized verification tools to detect multimedia misinformation. Our system operates through six stages: raw data processing, planning, information extraction, deep research, evidence collection, and report generation. The core Deep Researcher Agent employs four tools: reverse image search, metadata analysis, fact-checking databases, and verified news processing that extracts spatial, temporal, attribution, and motivational context. We demonstrate our approach on a challenge dataset sample involving complex multimedia content. Our system successfully verified content authenticity, extracted precise geolocation and timing information, and traced source attribution across multiple platforms, effectively addressing real-world multimedia verification scenarios.
本文是我们为ACMMM25多媒体验证挑战赛提交的论文。我们开发了一种多智能体验证系统,该系统结合了多模态大型语言模型(MLLMs)和专业的验证工具来检测多媒体虚假信息。我们的系统通过六个阶段进行操作:原始数据处理、规划、信息提取、深入研究、证据收集和报告生成。核心深度研究员智能体采用四种工具:反向图像搜索、元数据分析、事实核查数据库和经过验证的新闻处理,用于提取空间、时间、归属和动机上下文信息。我们在涉及复杂多媒体内容的挑战数据集样本上展示了我们的方法。我们的系统成功验证了内容的真实性,提取了精确的位置和时间信息,并在多个平台上追踪了来源归属,有效地解决了现实世界的多媒体验证场景问题。
论文及项目相关链接
PDF 33rd ACM International Conference on Multimedia (MM’25) Grand Challenge on Multimedia Verification
Summary
多媒体验证是一项重要的挑战,本文介绍了我们在ACMMM25多媒体验证大赛中的提交作品。我们开发了一种多代理验证系统,该系统结合了多模态大型语言模型和专用验证工具来检测多媒体虚假信息。通过六个阶段操作:原始数据处理、规划、信息提取、深度研究、证据收集和报告生成。核心深度研究员代理采用四种工具:反向图像搜索、元数据分析、事实核查数据库和经过验证的新闻处理,提取空间、时间、归属和动机上下文。我们在包含复杂多媒体内容的挑战数据集样本上展示了我们的方法,成功验证了内容真实性,提取了精确的位置和时间信息,并跨多个平台追踪了来源归属,有效地解决了现实世界的多媒体验证场景。
Key Takeaways
- 提交作品专注于ACMMM25多媒体验证大赛,旨在解决多媒体信息真实性问题。
- 开发了一种多代理验证系统,结合了多模态大型语言模型和专用验证工具。
- 系统包括六个操作阶段:从原始数据处理到报告生成。
- 核心深度研究员代理使用四种工具进行深度研究,包括反向图像搜索、元数据分析、事实核查数据库和验证新闻处理。
- 该系统能够验证内容真实性,提取精确的位置和时间信息。
- 系统能够跨多个平台追踪来源归属。
点此查看论文截图




SRefiner: Soft-Braid Attention for Multi-Agent Trajectory Refinement
Authors:Liwen Xiao, Zhiyu Pan, Zhicheng Wang, Zhiguo Cao, Wei Li
Accurate prediction of multi-agent future trajectories is crucial for autonomous driving systems to make safe and efficient decisions. Trajectory refinement has emerged as a key strategy to enhance prediction accuracy. However, existing refinement methods often overlook the topological relationships between trajectories, which are vital for improving prediction precision. Inspired by braid theory, we propose a novel trajectory refinement approach, Soft-Braid Refiner (SRefiner), guided by the soft-braid topological structure of trajectories using Soft-Braid Attention. Soft-Braid Attention captures spatio-temporal topological relationships between trajectories by considering both spatial proximity and vehicle motion states at ``soft intersection points”. Additionally, we extend this approach to model interactions between trajectories and lanes, further improving the prediction accuracy. SRefiner is a multi-iteration, multi-agent framework that iteratively refines trajectories, incorporating topological information to enhance interactions within traffic scenarios. SRefiner achieves significant performance improvements over four baseline methods across two datasets, establishing a new state-of-the-art in trajectory refinement. Code is here https://github.com/Liwen-Xiao/SRefiner.
多智能体未来轨迹的精确预测对于自动驾驶系统做出安全和高效的决策至关重要。轨迹优化已成为提高预测准确性的关键策略。然而,现有的优化方法往往忽视了轨迹之间的拓扑关系,这对于提高预测精度至关重要。受辫状理论启发,我们提出了一种新型的轨迹优化方法——Soft-Braid Refiner(SRefiner)。该方法以轨迹的软辫拓扑结构为指导,使用Soft-Braid Attention。Soft-Braid Attention通过考虑“软交点”处的空间接近度和车辆运动状态,捕捉轨迹之间的时空拓扑关系。此外,我们将此方法扩展为对轨迹和车道之间的交互进行建模,进一步提高预测精度。SRefiner是一个多迭代、多智能体的框架,通过迭代优化轨迹,融入拓扑信息以增强交通场景中的交互。SRefiner在两个数据集上的四个基准方法上实现了显著的性能改进,在轨迹优化方面建立了最新技术。代码详见:链接地址。
论文及项目相关链接
Summary
基于辫状理论启发,提出一种新型轨迹优化方法——Soft-Braid Refiner(SRefiner),该方法通过Soft-Braid Attention捕捉轨迹的时空拓扑结构关系,考虑轨迹间的空间接近程度和车辆运动状态,特别是在“软交点”处的信息。此外,SRefiner还扩展了轨迹与车道间交互的建模,进一步提高了预测精度。SRefiner在多数据集上的性能优于四种基线方法,为轨迹优化树立了新的标准。
Key Takeaways
- 多智能体未来轨迹的准确预测对自动驾驶系统做出安全和高效决策至关重要。
- 轨迹优化是提高预测准确性的关键策略之一。
- 现有轨迹优化方法常忽略轨迹间的拓扑关系,这对提高预测精度至关重要。
- Soft-Braid Refiner(SRefiner)基于辫状理论启发,通过Soft-Braid Attention捕捉轨迹的时空拓扑结构关系。
- SRefiner考虑了轨迹间的空间接近程度和车辆运动状态,特别是在“软交点”处的信息。
- SRefiner扩展了轨迹与车道间交互的建模,进一步提高了预测精度。
点此查看论文截图




Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties
Authors:Guohong Liu, Jialei Ye, Jiacheng Liu, Yuanchun Li, Wei Liu, Pengzhi Gao, Jian Luan, Yunxin Liu
Mobile GUI agents are designed to autonomously execute diverse device-control tasks by interpreting and interacting with mobile screens. Despite notable advancements, their resilience in real-world scenarios where screen content may be partially manipulated by untrustworthy third parties remains largely unexplored. Owing to their black-box and autonomous nature, these agents are vulnerable to manipulations that could compromise user devices. In this work, we present the first systematic investigation into the vulnerabilities of mobile GUI agents. We introduce a scalable attack simulation framework AgentHazard, which enables flexible and targeted modifications of screen content within existing applications. Leveraging this framework, we develop a comprehensive benchmark suite comprising both a dynamic task execution environment and a static dataset of vision-language-action tuples, totaling over 3,000 attack scenarios. The dynamic environment encompasses 58 reproducible tasks in an emulator with various types of hazardous UI content, while the static dataset is constructed from 210 screenshots collected from 14 popular commercial apps. Importantly, our content modifications are designed to be feasible for unprivileged third parties. We evaluate 7 widely-used mobile GUI agents and 5 common backbone models using our benchmark. Our findings reveal that all examined agents are significantly influenced by misleading third-party content (with an average misleading rate of 28.8% in human-crafted attack scenarios) and that their vulnerabilities are closely linked to the employed perception modalities and backbone LLMs. Furthermore, we assess training-based mitigation strategies, highlighting both the challenges and opportunities for enhancing the robustness of mobile GUI agents. Our code and data will be released at https://agenthazard.github.io.
移动GUI代理旨在通过解释和与移动屏幕进行交互来自主执行多样化的设备控制任务。尽管有了显著的进步,但在现实世界场景中,当屏幕内容可能被不可信的第三方部分操纵时,它们的恢复能力在很大程度上仍未被探索。由于这些代理具有黑箱和自主性质,它们容易受到可能导致用户设备出现问题的操纵。在这项工作中,我们对移动GUI代理的脆弱性进行了首次系统研究。我们介绍了一个可扩展的攻击模拟框架AgentHazard,它能够在现有应用程序中灵活地实现有针对性的屏幕内容修改。利用这一框架,我们开发了一套全面的基准测试套件,其中包括一个动态任务执行环境和一个包含超过3000个攻击场景的静态视觉语言动作元组数据集。动态环境包括模拟器中的58个可复制任务,其中包含各种类型的危险UI内容,而静态数据集则是从14个流行的商业应用中收集的210张截图。重要的是,我们的内容修改是专为没有特权的第三方设计的。我们使用基准测试套件评估了7种常用的移动GUI代理和5种常见的骨干模型。我们的研究结果表明,所有经过测试的代理都受到第三方误导内容的影响(人为设计的攻击场景中平均误导率为28.8%),并且它们的脆弱性与所采用的感知方式和骨干模型紧密相关。此外,我们还评估了基于训练的缓解策略,并强调了增强移动GUI代理稳健性的挑战和机遇。我们的代码和数据将在https://agenthazard.github.io上发布。
论文及项目相关链接
Summary
本文研究了移动GUI代理的脆弱性,并介绍了攻击模拟框架AgentHazard。该框架能够在现有应用程序中灵活地修改屏幕内容。通过该框架,作者构建了一个包含动态任务执行环境和静态视觉语言动作元组数据集的综合基准测试套件,包含超过3000个攻击场景。评估了7种常用的移动GUI代理和5种常见的后台模型,发现所有被评估的代理都受到第三方误导内容的影响,其脆弱性与所采用的感知模态和后台LLM密切相关。
Key Takeaways
- 移动GUI代理在设计自主执行多样设备控制任务时,对屏幕内容的解读和交互存在脆弱性。
- 现有移动GUI代理在面对第三方对屏幕内容的部分操控时,其实际世界场景中的韧性有待提升。
- AgentHazard攻击模拟框架被用于系统地探索移动GUI代理的脆弱性,并允许对屏幕内容进行灵活和有针对性的修改。
- 综合基准测试套件包含动态任务执行环境和静态视觉语言动作数据集,涵盖超过3000个攻击场景。
- 所有评估的移动GUI代理都受到第三方误导内容的影响,平均误导率高达28.8%。
- 移动GUI代理的脆弱性与所采用的感知模态和后台语言模型紧密相关。
点此查看论文截图





How to Train Your LLM Web Agent: A Statistical Diagnosis
Authors:Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
基于LLM的Web代理近期取得了显著进展,但大部分进展出现在封闭源代码系统中,与开源替代方案的差距进一步扩大。进展受到两个主要挑战的限制:第一,对单步任务的狭窄关注,忽视了多步Web交互的复杂性;第二,LLM基于Web代理后训练所需的高计算成本。为解决这一问题,我们对LLM Web代理后训练的计算分配进行了首次统计研究。我们的方法采用两阶段管道,训练一只Llama 3.1 8B学生代理通过监督微调(SFT)模仿一只Llama 3.3 70B教师代理,随后进行基于策略的策略强化学习。我们发现这个过程对超参数选择非常敏感,使全面扫描变得不切实际。为了节省他人昂贵的试错成本,我们对1370个配置进行采样,并使用bootstrap方法来估计有效超参数。我们的结果表明,结合SFT和基于策略RL的方法在工作场和MiniWob++上的表现始终优于单一方法。此外,此策略仅需55%的计算量即可达到纯SFT在MiniWob++上的峰值性能,有效地推动了计算性能帕累托前沿的进步,是唯一能够缩小与封闭源代码模型差距的策略。
论文及项目相关链接
Summary
LLM网络代理在闭源系统中取得显著进展,但与开源替代方案之间存在差距。主要挑战在于过于关注单步任务,忽视了多步网络交互的复杂性,以及训练LLM网络代理后计算成本高昂。本研究首次对LLM网络代理进行统计研究,关注计算分配问题。研究采用两阶段管道,通过监督微调(SFT)训练Llama 3.1 8B学生模型以模仿Llama 3.3 70B教师模型,随后进行策略强化学习。研究发现该过程对超参数选择高度敏感,因此进行大规模搜索不切实际。为了节省他人昂贵的试错成本,本研究对1370种配置进行采样,并使用引导抽样估计有效超参数。结果显示,结合SFT和策略强化学习的方法始终优于单一方法,在工作竞技场和MiniWob++上的表现尤为突出。此外,该方法仅需55%的计算量即可达到纯SFT在MiniWob++上的峰值性能,有效推动计算性能帕累托前沿的进步,并且是唯一能缩小与闭源模型之间差距的策略。
Key Takeaways
- LLM网络代理在闭源系统中进展显著,但开源替代方案与之存在差距。
- 主要挑战包括:关注单步任务而忽视多步交互的复杂性,以及LLM网络代理训练的高计算成本。
- 研究采用两阶段管道:监督微调(SFT)和策略强化学习。
- 该过程对超参数选择高度敏感,大规模搜索不切实际。
- 结合SFT和策略强化学习的方法在任务表现上优于单一方法。
- 该方法降低了计算成本,有效推动计算性能帕累托前沿的进步。
点此查看论文截图



AutoMind: Adaptive Knowledgeable Agent for Automated Data Science
Authors:Yixin Ou, Yujie Luo, Jingsheng Zheng, Lanning Wei, Shuofei Qiao, Jintian Zhang, Da Zheng, Huajun Chen, Ningyu Zhang
Large Language Model (LLM) agents have shown great potential in addressing real-world data science problems. LLM-driven data science agents promise to automate the entire machine learning pipeline, yet their real-world effectiveness remains limited. Existing frameworks depend on rigid, pre-defined workflows and inflexible coding strategies; consequently, they excel only on relatively simple, classical problems and fail to capture the empirical expertise that human practitioners bring to complex, innovative tasks. In this work, we introduce AutoMind, an adaptive, knowledgeable LLM-agent framework that overcomes these deficiencies through three key advances: (1) a curated expert knowledge base that grounds the agent in domain expert knowledge, (2) an agentic knowledgeable tree search algorithm that strategically explores possible solutions, and (3) a self-adaptive coding strategy that dynamically tailors code generation to task complexity. Evaluations on two automated data science benchmarks demonstrate that AutoMind delivers superior performance versus state-of-the-art baselines. Additional analyses confirm favorable effectiveness, efficiency, and qualitative solution quality, highlighting AutoMind as an efficient and robust step toward fully automated data science.
大型语言模型(LLM)代理在解决现实世界的数据科学问题方面显示出巨大潜力。LLM驱动的数据科学代理有望自动化整个机器学习流程,但它们在现实世界中的有效性仍然有限。现有框架依赖于僵化、预先定义的工作流程和不可灵活调整的编码策略;因此,它们仅在相对简单、经典的问题上表现出色,而无法获取人类实践者在复杂、创新任务中所带来的经验知识。在这项工作中,我们介绍了AutoMind,这是一个自适应、知识型LLM代理框架,通过三个关键进展克服了这些不足:(1)一个精选的专家知识库,将代理与领域专家知识相结合;(2)一种智能知识树搜索算法,战略性地探索可能的解决方案;(3)一种自适应编码策略,根据任务复杂性动态调整代码生成。在两个自动化数据科学基准测试上的评估表明,AutoMind相较于最新基线技术表现出卓越性能。附加分析证实了其有效性、高效性和解决方案的优质性,凸显了AutoMind作为实现完全自动化数据科学的稳健而高效的步骤。
论文及项目相关链接
PDF Ongoing work. Code is at https://github.com/innovatingAI/AutoMind
Summary:大型语言模型(LLM)在解决现实数据科学问题方面显示出巨大潜力,LLM驱动的数据科学代理承诺自动化整个机器学习管道,但其在现实世界中的有效性仍然受到限制。现有框架依赖于预设的工作流程和僵化的编码策略,仅擅长处理相对简单的问题,难以捕捉人类实践者在复杂创新任务中的经验知识。本研究引入AutoMind,一种自适应的知识型LLM代理框架,通过三个关键进展克服这些缺陷:1)精选的专家知识库使代理融入领域专家知识,2)智能知识树搜索算法战略性探索可能的解决方案,以及3)自适应编码策略根据任务复杂性动态调整代码生成。评估表明,AutoMind在自动化数据科学基准测试中表现出优于最新技术的性能。附加分析证实了其有效性、效率和解决方案质量,突显AutoMind是朝着全自动数据科学的稳健而高效的一步。
Key Takeaways:
- LLM在解决数据科学问题方面具有巨大潜力,但现实应用中的有效性受限。
- 现有数据科学框架依赖于预设的工作流程和僵化的编码策略,难以处理复杂创新任务。
- AutoMind是一个自适应的知识型LLM代理框架,通过三个关键创新改善现有问题。
- AutoMind融入领域专家知识,通过智能知识库增强代理的能力。
- 智能知识树搜索算法使AutoMind能战略性探索可能的解决方案。
- AutoMind具有自适应编码策略,能根据任务复杂性动态调整代码生成。
点此查看论文截图




On the Role of Feedback in Test-Time Scaling of Agentic AI Workflows
Authors:Souradip Chakraborty, Mohammadreza Pourreza, Ruoxi Sun, Yiwen Song, Nino Scherrer, Furong Huang, Amrit Singh Bedi, Ahmad Beirami, Jindong Gu, Hamid Palangi, Tomas Pfister
Agentic AI workflows (systems that autonomously plan and act) are becoming widespread, yet their task success rate on complex tasks remains low. A promising solution is inference-time alignment, which uses extra compute at test time to improve performance. Inference-time alignment relies on three components: sampling, evaluation, and feedback. While most prior work studies sampling and automatic evaluation, feedback remains underexplored. To study the role of feedback, we introduce Iterative Agent Decoding (IAD), a procedure that repeatedly inserts feedback extracted from different forms of critiques (reward models or AI-generated textual feedback) between decoding steps. Through IAD, we analyze feedback along four dimensions: (1) its role in the accuracy-compute trade-offs with limited inference budget, (2) quantifying the gains over diversity-only baselines such as best-of-N sampling, (3) effectiveness of composing feedback from reward models versus textual critique, and (4) robustness to noisy or low-quality feedback. Across Sketch2Code, Text2SQL, Intercode, and WebShop, we show that IAD with proper integration of high fidelity feedback leads to consistent gains up to 10 percent absolute performance improvement over various baselines such as best-of-N. Our findings underscore feedback as a crucial knob for inference-time alignment of agentic AI workflows with limited inference budget.
基于AI的工作流程(自主计划和行动的系统)正变得越来越普遍,然而它们在复杂任务上的任务成功率仍然较低。一种有前景的解决方案是推理时间对齐,它利用额外的计算测试时间来提高性能。推理时间对齐依赖于三个组件:采样、评估和反馈。虽然大多数早期的研究集中在采样和自动评估上,但反馈仍然被忽视。为了研究反馈的作用,我们引入了迭代代理解码(IAD),这是一种在解码步骤之间反复插入从不同形式的评论(奖励模型或AI生成的文本反馈)中提取的反馈的程序。通过IAD,我们沿着四个维度分析反馈:(1)其在有限的推理预算下对准确性计算折衷的作用,(2)相对于只重视多样性的基线(如最佳N采样)的量化收益,(3)从奖励模型与文本评论组合反馈的有效性,(4)对嘈杂或低质量反馈的稳健性。在Sketch2Code、Text2SQL、Intercode和WebShop上,我们展示了适当整合高保真反馈的IAD在各种基线(如最佳N采样)上实现了高达10%的绝对性能改进。我们的研究结果表明,对于具有有限推理预算的基于AI的工作流程而言,反馈是推理时间对齐的关键参数。
论文及项目相关链接
Summary
本文探讨了自主计划行动的Agentic AI工作流在复杂任务上的成功率较低的问题。为解决这一问题,研究者提出了推理时间对齐方法,该方法在测试时利用额外的计算来提高性能。本文重点研究反馈的作用,并介绍了迭代式解码(IAD),一种在解码步骤之间重复插入从不同形式的批评中提取的反馈的程序。通过对IAD的分析,研究发现其在有限推理预算下的准确性计算权衡、相对于多样性基准的增益量化、从奖励模型中组合反馈的有效性以及对于噪声或低质量反馈的稳健性等方面都有显著表现。在多个任务上的实验结果表明,通过适当整合高保真反馈,IAD在各种基准测试上实现了高达10%的绝对性能提升。
Key Takeaways
- Agentic AI工作流在复杂任务上的成功率仍然有待提高。
- 推理时间对齐方法通过使用额外的计算在测试时提高AI的性能。
- 反馈在推理时间对齐中扮演重要角色,而这一点在之前的研究中却被忽视。
- 迭代式解码(IAD)程序能够在解码步骤之间插入不同形式的反馈。
- IAD在有限推理预算下的准确性计算权衡方面表现出色。
- IAD相对于多样性基准的增益量化显著。
点此查看论文截图





Autonomous Microscopy Experiments through Large Language Model Agents
Authors:Indrajeet Mandal, Jitendra Soni, Mohd Zaki, Morten M. Smedskjaer, Katrin Wondraczek, Lothar Wondraczek, Nitya Nand Gosvami, N. M. Anoop Krishnan
Large language models (LLMs) are revolutionizing self driving laboratories (SDLs) for materials research, promising unprecedented acceleration of scientific discovery. However, current SDL implementations rely on rigid protocols that fail to capture the adaptability and intuition of expert scientists in dynamic experimental settings. We introduce Artificially Intelligent Lab Assistant (AILA), a framework automating atomic force microscopy through LLM driven agents. Further, we develop AFMBench a comprehensive evaluation suite challenging AI agents across the complete scientific workflow from experimental design to results analysis. We find that state of the art models struggle with basic tasks and coordination scenarios. Notably, Claude 3.5 sonnet performs unexpectedly poorly despite excelling in materials domain question answering (QA) benchmarks, revealing that domain specific QA proficiency does not necessarily translate to effective agentic capabilities. Additionally, we observe that LLMs can deviate from instructions, raising safety alignment concerns for SDL applications. Our ablations reveal that multi agent frameworks outperform single-agent architectures. We also observe significant prompt fragility, where slight modifications in prompt structure cause substantial performance variations in capable models like GPT 4o. Finally, we evaluate AILA’s effectiveness in increasingly advanced experiments AFM calibration, feature detection, mechanical property measurement, graphene layer counting, and indenter detection. Our findings underscore the necessity for rigorous benchmarking protocols and prompt engineering strategies before deploying AI laboratory assistants in scientific research environments.
大型语言模型(LLM)正在为材料研究的自动驾驶实验室(SDL)带来革命性的变革,并有望为科学发现带来前所未有的加速。然而,当前的SDL实现依赖于严格的协议,这些协议未能捕捉到动态实验环境中专家科学家的适应性和直觉。我们引入了人工智能实验室助手(AILA)这一框架,通过LLM驱动的代理实现原子力显微镜的自动化。此外,我们开发了AFMBench评估套件,全面挑战从实验设计到结果分析整个科学工作流程的AI代理。我们发现,最先进的技术模型在基本任务和协调场景方面遇到了困难。值得注意的是,尽管Claude 3.5 sonnet在材料领域问答(QA)基准测试中表现出色,但在实际执行中表现不佳,这表明特定领域的问答能力并不一定能转化为有效的代理能力。此外,我们观察到LLM会偏离指令,这给SDL应用程序的安全对齐带来了担忧。我们的分析表明,多代理框架优于单代理架构。我们还观察到显著的提示脆弱性,即提示结构的微小变化会导致如GPT 4o等能力模型的性能出现巨大变化。最后,我们在越来越先进的实验中评估了AILA的有效性,包括原子力显微镜校准、特征检测、机械性能测量、石墨烯层计数和压痕检测。我们的研究结果强调了在部署人工智能实验室助手进行科学研究的环境中,进行严格的基准测试协议和提示工程策略的必要性。
论文及项目相关链接
Summary
大型语言模型(LLM)正在为材料研究的自动驾驶实验室(SDL)带来革命性变革,加速了科学发现的速度。然而,现有的SDL实现依赖于僵化的协议,无法捕获动态实验环境中专家科学家的适应性和直觉。为此,我们引入了人工智能实验室助手(AILA)框架,通过LLM驱动的智能体自动化原子力显微镜。同时,我们开发了AFMBench评估套件,全面挑战从实验设计到结果分析的整个科学工作流程中的AI智能体。研究发现在实际场景中,前沿模型在基本任务和协调场景方面面临挑战。此外,我们还观察到LLM可能偏离指令的问题,引发了SDL应用的安全对齐担忧。我们的研究还表明多智能体框架优于单智能体架构。最后,我们评估了AILA在AFM校准、特征检测、机械性能测试、石墨烯层计数和印痕检测等高级实验中的有效性。我们的研究结果表明,在将AI实验室助手部署于科学研究环境之前,需要严格的基准测试协议和提示工程策略。
Key Takeaways
- 大型语言模型(LLM)正在加速材料研究的自动驾驶实验室(SDL)的科学发现过程。
- 现有SDL实现缺乏适应性和直觉,无法模拟专家科学家在动态实验环境中的行为。
- 引入人工智能实验室助手(AILA)框架和AFMBench评估套件,以自动化原子力显微镜并全面评估AI智能体在科学工作流程中的表现。
- 即使是前沿模型,在基本任务和协调场景方面也存在挑战。
- LLM可能偏离指令,引发SDL应用的安全问题。
- 多智能体框架在性能上优于单智能体架构。
- 在将AI实验室助手部署于科学研究环境前,需严格的基准测试协议和提示工程策略。
点此查看论文截图


