发布日期: 2025-05-02

更新日期: 2025-05-25

文章字数: 12.8k

阅读时长: 52 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-05-02 更新

SWE-smith: Scaling Data for Software Engineering Agents

Authors:John Yang, Kilian Leret, Carlos E. Jimenez, Alexander Wettig, Kabir Khandpur, Yanzhe Zhang, Binyuan Hui, Ofir Press, Ludwig Schmidt, Diyi Yang

Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.

尽管最近在软件工程语言模型（LMs）方面取得了进展，但收集训练数据仍然是一个痛点。现有数据集很小，最多只有来自11个或更少GitHub仓库的数千个训练实例。整理此类数据集的流程通常很复杂，需要数百小时的人工劳动；伴随的执行环境也需要占用数兆字节的存储空间，严重限制了其可扩展性和可用性。为了解决这一痛点，我们引入了SWE-smith，这是一个大规模生成软件工程训练数据的新型管道。给定任何Python代码库，SWE-smith构建相应的执行环境，然后自动合成数百到数千个任务实例，这些实例会破坏代码库中的现有测试。使用SWE-smith，我们创建了一个包含5万个实例的数据集，这些实例来源于128个GitHub仓库，规模比以前的所有工作都大一个数量级。我们训练了SWE-agent-LM-32B模型，在SWE-bench经过验证的基准测试中达到了40.2%的Pass@1解决率，在开源模型中处于最先进水平。为了降低语言模型系统在自动软件工程方面的入门门槛，我们将SWE-smith（收集流程、任务实例、轨迹、模型）开源。所有资源均可在https://swesmith.com找到。

论文及项目相关链接

PDF

Summary

本文介绍了针对软件工程技术训练数据收集难题的新型解决方案SWE-smith。SWE-smith能够自动生成大量的任务实例，打破了以往测试集的限制，从GitHub仓库中构建了相应的执行环境。通过SWE-smith生成的包含5万个实例的数据集显著提高了软件工程技术训练数据规模。并且开源模型SWE-agent-LM-32B在SWE-bench Verified基准测试中实现了较高的性能表现。文章的所有资源均已在https://swesmith.com上公开。

Key Takeaways

SWE-smith是一种新型软件工程技术训练数据生成管道，能够自动生成大量任务实例。
该方法通过构建执行环境，打破了以往测试集的限制。
使用SWE-smith生成的数据集包含5万个实例，来自128个GitHub仓库，规模显著。
开源模型SWE-agent-LM-32B在基准测试中性能领先。
SWE-smith的开源化降低了进入门槛，便于研究LM系统在自动化软件工程中的应用。
所有相关资源均可在https://swesmith.com上找到。

Cool Papers

点此查看论文截图

LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Authors:Marc Glocker, Peter Hönig, Matthias Hirschmanner, Markus Vincze

We present an embodied robotic system with an LLM-driven agent-orchestration architecture for autonomous household object management. The system integrates memory-augmented task planning, enabling robots to execute high-level user commands while tracking past actions. It employs three specialized agents: a routing agent, a task planning agent, and a knowledge base agent, each powered by task-specific LLMs. By leveraging in-context learning, our system avoids the need for explicit model training. RAG enables the system to retrieve context from past interactions, enhancing long-term object tracking. A combination of Grounded SAM and LLaMa3.2-Vision provides robust object detection, facilitating semantic scene understanding for task planning. Evaluation across three household scenarios demonstrates high task planning accuracy and an improvement in memory recall due to RAG. Specifically, Qwen2.5 yields best performance for specialized agents, while LLaMA3.1 excels in routing tasks. The source code is available at: https://github.com/marc1198/chat-hsr.

我们呈现了一个具有LLM驱动的主体编排架构的实体机器人系统，用于自主管理家居物品。该系统结合了增强记忆的任务规划，使机器人能够执行高级用户命令，同时跟踪过去的动作。它采用了三个专业代理：路由代理、任务规划代理和知识库代理，每个代理都由特定任务的LLM提供支持。通过利用上下文学习，我们的系统避免了对明确模型训练的需求。RAG使系统能够从过去的交互中检索上下文，从而提高长期物体跟踪能力。Grounded SAM和LLaMa3.2-Vision的结合提供了强大的物体检测功能，促进任务规划的语义场景理解。在三个家庭场景中的评估表明，任务规划精度高，且由于RAG的参与，记忆回忆能力有所提高。具体来说，Qwen2.5在专业代理方面表现最佳，而LLaMA3.1在路由任务方面表现出色。源代码可在https://github.com/marc1198/chat-hsr上找到。

论文及项目相关链接

PDF Accepted at Austrian Robotics Workshop 2025

Summary

一个具有LLM驱动代理编排架构的实体机器人系统，用于自主管理家居物品。该系统结合了记忆增强的任务规划，使机器人能够执行高级用户命令并跟踪过去的行动。通过利用上下文学习，该系统避免了显式模型训练的需要。RAG使系统能够从过去的互动中检索上下文，提高长期物品跟踪能力。结合Grounded SAM和LLaMa3.2-Vision，提供稳健的对象检测，促进场景语义理解以进行任务规划。在三个家庭场景中的评估显示，任务规划精度高，且由于RAG的存在，记忆回忆有所改善。

Key Takeaways

该系统是一个实体机器人平台，旨在通过LLM驱动代理编排架构进行自主家居物品管理。
集成记忆增强的任务规划功能，使机器人能执行高级用户命令并跟踪过去的行为。
系统采用上下文学习，无需明确的模型训练。
RAG能够从过去的互动中检索上下文，提高长期物品跟踪能力。
Grounded SAM和LLaMa3.2-Vision的结合为系统提供了稳健的对象检测功能。
系统在三个家庭场景中的评估结果表明，任务规划精度高，记忆召回也有所改善。

Cool Papers

点此查看论文截图

Iterative Trajectory Exploration for Multimodal Agents

Authors:Pengxiang Li, Zhi Gao, Bofei Zhang, Yapeng Mi, Xiaojian Ma, Chenrui Shi, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

Multimodal agents, which integrate a controller (e.g., a large language model) with external tools, have demonstrated remarkable capabilities in tackling complex tasks. However, existing agents need to collect a large number of expert data for fine-tuning to adapt to new environments. In this paper, we propose an online self-exploration method for multimodal agents, namely SPORT, via step-wise preference optimization to refine the trajectories of agents, which automatically generates tasks and learns from solving the generated tasks, without any expert annotation. SPORT operates through four iterative components: task synthesis, step sampling, step verification, and preference tuning. First, we synthesize multi-modal tasks using language models. Then, we introduce a novel search scheme, where step sampling and step verification are executed alternately to solve each generated task. We employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller’s policy through preference tuning, producing a SPORT Agent. By interacting with real environments, the SPORT Agent evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks show that the SPORT Agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.

多模态智能体通过整合控制器（如大型语言模型）与外部工具，在应对复杂任务方面展现出卓越的能力。然而，现有智能体需要收集大量专家数据来进行微调以适应新环境。在本文中，我们提出了一种在线自我探索的多模态智能体方法，名为SPORT，通过逐步偏好优化来优化智能体的轨迹，该方法可以自动生成任务并从解决生成的任务中学习，无需任何专家标注。SPORT通过四个迭代组件进行操作：任务合成、步骤采样、步骤验证和偏好调整。首先，我们使用语言模型合成多模态任务。然后，我们引入了一种新颖的搜索方案，该方案交替执行步骤采样和步骤验证来解决每个生成的任务。我们采用验证器为人工智能提供反馈来构建逐步偏好数据。随后，这些数据被用来通过偏好调整更新控制器的策略，从而产生一个SPORT智能体。通过与真实环境进行交互，SPORT智能体进化成一个更加精细和强大的系统。在GTA和GAIA基准测试中的评估显示，SPORT智能体的改进率达到6.41%和3.64%，凸显了我们方法的通用性和有效性。项目页面为https://SPORT-Agents.github.io。

论文及项目相关链接

PDF 16 pages, 8 figures

Summary
多模态智能体通过与外部工具集成大型语言模型控制器，展现出处理复杂任务的能力。然而，现有智能体需要收集大量专家数据进行微调以适应新环境。本研究提出一种在线自我探索方法SPORT，通过步骤偏好优化为智能体调整轨迹，使其自动生成任务并从解决的任务中学习，无需专家标注。SPORT通过任务合成、步骤采样、步骤验证和偏好调整四个迭代环节操作。利用语言模型合成多模态任务，引入新型搜索方案，交替执行步骤采样和验证来解决每个生成的任务。采用验证器提供AI反馈来构建步骤偏好数据，随后用于通过偏好调整更新控制器的策略，产生SPORT智能体。在与实际环境互动中，SPORT智能体逐渐进化为更精细、更强大的系统。在GTA和GAIA基准测试中，SPORT智能体的表现证明了该方法带来的推广和有效性。

Key Takeaways

多模态智能体通过集成大型语言模型控制器和外部工具，具备处理复杂任务的能力。
现有智能体需要大量专家数据进行微调以适应新环境。
SPORT是一种在线自我探索方法，为智能体调整轨迹，无需专家标注即可自动生成任务并学习。
SPORT通过任务合成、步骤采样、步骤验证和偏好调整四个迭代环节操作。
采用语言模型合成多模态任务，并引入新型搜索方案解决生成的任务。
通过AI反馈构建步骤偏好数据，用于更新控制器的策略。

Cool Papers

点此查看论文截图

Unsupervised Feature Transformation via In-context Generation, Generator-critic LLM Agents, and Duet-play Teaming

Authors:Nanxu Gong, Xinyuan Wang, Wangyang Ying, Haoyue Bai, Sixun Dong, Haifeng Chen, Yanjie Fu

Feature transformation involves generating a new set of features from the original dataset to enhance the data’s utility. In certain domains like material performance screening, dimensionality is large and collecting labels is expensive and lengthy. It highly necessitates transforming feature spaces efficiently and without supervision to enhance data readiness and AI utility. However, existing methods fall short in efficient navigation of a vast space of feature combinations, and are mostly designed for supervised settings. To fill this gap, our unique perspective is to leverage a generator-critic duet-play teaming framework using LLM agents and in-context learning to derive pseudo-supervision from unsupervised data. The framework consists of three interconnected steps: (1) Critic agent diagnoses data to generate actionable advice, (2) Generator agent produces tokenized feature transformations guided by the critic’s advice, and (3) Iterative refinement ensures continuous improvement through feedback between agents. The generator-critic framework can be generalized to human-agent collaborative generation, by replacing the critic agent with human experts. Extensive experiments demonstrate that the proposed framework outperforms even supervised baselines in feature transformation efficiency, robustness, and practical applicability across diverse datasets.

特征转换涉及从原始数据集中生成新的特征集，以提高数据的效用。在某些领域，如材料性能筛选中，维度很大，收集标签既昂贵又耗时。这强烈需要有效地进行特征空间转换，并且无需监督来提高数据准备和人工智能的效用。然而，现有方法在有效地遍历大量的特征组合空间方面存在不足，并且大多是为有监督环境设计的。为了填补这一空白，我们的独特视角是利用生成器-评论家二重奏协作框架，使用大型语言模型（LLM）代理和上下文学习，从无监督数据中获取伪监督。该框架包含三个相互关联的步骤：（1）评论家代理诊断数据以生成可操作的建议，（2）生成器代理根据评论家的建议生成标记化的特征转换，（3）迭代优化确保通过代理之间的反馈进行持续改进。生成器-评论家框架可以通过用人类专家替换评论家代理来推广到人类-代理协作生成。大量实验表明，所提出的框架在特征转换效率、稳健性和实际应用的适用性方面甚至超过了有监督的基线模型，适用于多种数据集。

论文及项目相关链接

PDF Accepted to IJCAI 2025

Summary

特征转换通过从原始数据集中生成新的特征集来提高数据效用。在材料性能筛选等领域，由于维度大且收集标签的成本高且耗时，高效且无监督的特征空间转换变得至关重要。现有方法难以在大量的特征组合空间中进行有效导航，并且大多设计用于监督环境。为解决此空白，我们利用生成器-评论家二重奏的框架，结合大型语言模型（LLM）代理和上下文学习，从无监督数据中衍生伪监督。该框架包含三个相互关联的步骤：评论家代理诊断数据以生成可操作的建议，生成器代理根据评论家的建议产生标记化的特征转换，以及迭代优化确保通过代理之间的反馈持续改进。生成器-评论家框架可推广至人机协作生成，以人类专家替代评论家代理。大量实验表明，该框架在特征转换效率、稳健性和实际应用的多样性方面，甚至超越了监督基线。

Key Takeaways

特征转换能提高数据的效用，特别是在维度大、标签收集成本高的领域。
现有特征转换方法存在对大量特征组合空间导航困难的问题，且主要适用于监督环境。
利用生成器-评论家二重奏框架，结合大型语言模型代理和上下文学习，可从无监督数据中衍生伪监督。
框架包含三个主要步骤：数据诊断、特征生成和迭代优化。
生成器-评论家框架可推广至人机协作生成，以人类专家替代评论家代理。
该框架在特征转换效率、稳健性和实际应用的多样性方面表现优异，超越监督基线。

Cool Papers

点此查看论文截图

Robust Multi-agent Communication Based on Decentralization-Oriented Adversarial Training

Authors:Xuyan Ma, Yawen Wang, Junjie Wang, Xiaofei Xie, Boyu Wu, Shoubin Li, Fanjiang Xu, Qing Wang

In typical multi-agent reinforcement learning (MARL) problems, communication is important for agents to share information and make the right decisions. However, due to the complexity of training multi-agent communication, existing methods often fall into the dilemma of local optimization, which leads to the concentration of communication in a limited number of channels and presents an unbalanced structure. Such unbalanced communication policy are vulnerable to abnormal conditions, where the damage of critical communication channels can trigger the crash of the entire system. Inspired by decentralization theory in sociology, we propose DMAC, which enhances the robustness of multi-agent communication policies by retraining them into decentralized patterns. Specifically, we train an adversary DMAC_Adv which can dynamically identify and mask the critical communication channels, and then apply the adversarial samples generated by DMAC_Adv to the adversarial learning of the communication policy to force the policy in exploring other potential communication schemes and transition to a decentralized structure. As a training method to improve robustness, DMAC can be fused with any learnable communication policy algorithm. The experimental results in two communication policies and four multi-agent tasks demonstrate that DMAC achieves higher improvement on robustness and performance of communication policy compared with two state-of-the-art and commonly-used baselines. Also, the results demonstrate that DMAC can achieve decentralized communication structure with acceptable communication cost.

在多智能体强化学习（MARL）的典型问题中，通信对于智能体共享信息和做出正确决策至关重要。然而，由于训练多智能体通信的复杂性，现有方法往往陷入局部优化的困境，导致通信集中在有限的通道上，呈现出不平衡的结构。这种不平衡的通信策略在异常条件下很脆弱，关键通信通道的损坏可能会引发整个系统的崩溃。受社会学中分散化理论的启发，我们提出了DMAC（分散多智能体通信）算法，它通过重新训练智能体通信策略来提高其鲁棒性，使它们呈现分散的模式。具体来说，我们训练了一个对手DMAC_Adv，它能够动态地识别和屏蔽关键的通信通道，然后将由DMAC_Adv生成的对抗样本应用于通信策略的对抗性学习，迫使策略探索其他潜在的通信方案，并转向分散的结构。作为一种提高鲁棒性的训练方法，DMAC可以与任何可学习的通信策略算法相结合。在两个通信策略和四个多智能体任务的实验结果证明了DMAC在提高通信策略的鲁棒性和性能方面的优势，相较于两个先进且常用的基线方法，DMAC取得了更高的改进。此外，结果还表明DMAC能够实现具有可接受的通信成本的分散式通信结构。

论文及项目相关链接

PDF

Summary

多智能体强化学习中的通信至关重要，但现有方法易陷入局部优化困境，导致通信不平衡。本文受社会学中分权理论启发，提出DMAC方法，通过重新训练智能体通信策略以增强其鲁棒性，并采用对抗性训练方式促使智能体探索其他潜在通信方案并实现去中心化结构。实验表明，DMAC能提高通信策略的鲁棒性和性能。

Key Takeaways

多智能体强化学习中，通信对智能体决策至关重要。
现有方法面临局部优化问题，导致通信不平衡和结构脆弱。
DMAC方法受社会学中分权理论启发，旨在增强智能体通信策略的鲁棒性。
DMAC通过重新训练通信策略，促使智能体实现去中心化结构。
DMAC采用对抗性训练方式，使智能体能动态识别并屏蔽关键通信通道。
实验表明，DMAC相较于两种先进基线方法，在提高通信策略的鲁棒性和性能方面有更高成效。

Cool Papers

点此查看论文截图

Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA

Authors:Xuanzhao Dong, Wenhui Zhu, Hao Wang, Xiwen Chen, Peijie Qiu, Rui Yin, Yi Su, Yalin Wang

Medical question answering (QA) is a reasoning-intensive task that remains challenging for large language models (LLMs) due to hallucinations and outdated domain knowledge. Retrieval-Augmented Generation (RAG) provides a promising post-training solution by leveraging external knowledge. However, existing medical RAG systems suffer from two key limitations: (1) a lack of modeling for human-like reasoning behaviors during information retrieval, and (2) reliance on suboptimal medical corpora, which often results in the retrieval of irrelevant or noisy snippets. To overcome these challenges, we propose Discuss-RAG, a plug-and-play module designed to enhance the medical QA RAG system through collaborative agent-based reasoning. Our method introduces a summarizer agent that orchestrates a team of medical experts to emulate multi-turn brainstorming, thereby improving the relevance of retrieved content. Additionally, a decision-making agent evaluates the retrieved snippets before their final integration. Experimental results on four benchmark medical QA datasets show that Discuss-RAG consistently outperforms MedRAG, especially significantly improving answer accuracy by up to 16.67% on BioASQ and 12.20% on PubMedQA. The code is available at: https://github.com/LLM-VLM-GSL/Discuss-RAG.

医疗问答（QA）是一项需要大量推理的任务，对于大型语言模型（LLM）来说仍然具有挑战性，因为存在虚构和过时的领域知识。检索增强生成（RAG）通过利用外部知识提供了一种有前途的后期训练解决方案。然而，现有的医疗RAG系统存在两个主要局限性：（1）在信息检索过程中缺乏对人类式推理行为的建模；（2）依赖于次优的医疗语料库，这往往导致检索到不相关或嘈杂的片段。为了克服这些挑战，我们提出了Discuss-RAG，这是一个即插即用的模块，旨在通过基于协作代理的推理来增强医疗问答RAG系统。我们的方法引入了一个总结代理，该代理协调医疗专家团队进行多次头脑风暴，从而提高检索内容的关联性。此外，一个决策代理在最终整合之前评估检索到的片段。在四个基准医疗问答数据集上的实验结果表明，Discuss-RAG持续优于MedRAG，特别是在BioASQ上答案准确率提高了高达16.67%，在PubMedQA上提高了12.20%。代码可在：https://github.com/LLM-VLM-GSL/Discuss-RAG找到。

论文及项目相关链接

PDF

摘要

医学问答是一个需要大量推理的任务，对于大型语言模型来说仍然具有挑战性，因为存在虚构和过时的领域知识问题。检索增强生成（RAG）通过利用外部知识提供了一种有前景的后期训练解决方案。然而，现有的医学RAG系统存在两个主要局限性：（1）在信息检索过程中缺乏人类式推理行为的建模；（2）依赖于次优医学语料库，这通常会导致检索到不相关或嘈杂的片段。为了克服这些挑战，我们提出了Discuss-RAG，这是一个即插即用的模块，旨在通过基于协作代理的推理增强医学问答RAG系统。我们的方法引入了一个总结代理，该代理协调医学专家团队进行多轮讨论，从而提高检索内容的关联性。此外，一个决策代理对检索到的片段进行评估，然后再进行最终集成。在四个基准医学问答数据集上的实验结果表明，Discuss-RAG的性能持续优于MedRAG，尤其是在BioASQ上答案准确率提高高达16.67%，在PubMedQA上提高12.20%。相关代码可在：https://github.com/LLM-VLM-GSL/Discuss-RAG获取。

要点

医学问答是一个对大型语言模型具有挑战性的任务，因为存在虚构和领域知识过时的问题。
检索增强生成（RAG）是一种利用外部知识的后期训练解决方案。
现有医学RAG系统存在两个主要局限性：缺乏人类式推理行为的建模和依赖于次优医学语料库。
Discuss-RAG通过引入总结代理和决策代理，旨在解决这些挑战。
总结代理协调医学专家团队进行多轮讨论，提高检索内容的关联性。
决策代理对检索到的片段进行评估，再进行最终集成。

Cool Papers

点此查看论文截图

VideoMultiAgents: A Multi-Agent Framework for Video Question Answering

Authors:Noriyuki Kugo, Xiang Li, Zixin Li, Ashish Gupta, Arpandeep Khatua, Nidhish Jain, Chaitanya Patel, Yuta Kyuragi, Yasunori Ishii, Masamoto Tanabiki, Kazuki Kozuka, Ehsan Adeli

Video Question Answering (VQA) inherently relies on multimodal reasoning, integrating visual, temporal, and linguistic cues to achieve a deeper understanding of video content. However, many existing methods rely on feeding frame-level captions into a single model, making it difficult to adequately capture temporal and interactive contexts. To address this limitation, we introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing. It enhances video understanding leveraging complementary multimodal reasoning from independently operating agents. Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions directly relevant to a given query, thus improving the answer accuracy. Experimental results demonstrate that our method achieves state-of-the-art performance on Intent-QA (79.0%, +6.2% over previous SOTA), EgoSchema subset (75.4%, +3.4%), and NExT-QA (79.6%, +0.4%). The source code is available at https://github.com/PanasonicConnect/VideoMultiAgents.

视频问答（VQA）本质上依赖于多模态推理，它结合了视觉、时间和语言线索，以实现更深层次的视频内容理解。然而，许多现有方法依赖于将帧级字幕输入单一模型，这使得难以充分捕获时间和交互上下文。为了解决这一局限性，我们引入了VideoMultiAgents框架，该框架集成了用于视觉、场景图分析和文本处理的专门代理。它通过独立运行的代理的互补多模态推理来增强视频理解。我们的方法还辅以问题引导的字幕生成，生成的字幕突出了与给定查询直接相关的对象、动作和时间过渡，从而提高了答案的准确性。实验结果表明，我们的方法在Intent-QA（意图问答，79.0%，较之前的最优方法高出6.2%）、EgoSchema子集（75.4%，高出3.4%）和NExT-QA（下一个问答，79.6%，高出0.4%）上达到了最新技术水平。源代码可在https://github.com/PanasonicConnect/VideoMultiAgents获取。

论文及项目相关链接

PDF

Summary
视频问答（VQA）需要依靠对视频内容的多模态推理进行回答，包括对视觉、时间和语言线索的整合。然而，现有方法大多依赖于将帧级字幕输入单一模型，难以充分捕捉时间和交互上下文。为解决此问题，我们推出VideoMultiAgents框架，集成专门用于视觉、场景图分析和文本处理的代理。它通过独立操作的代理的互补多模态推理增强视频理解。此外，辅以问题导向的标题生成，生成与查询直接相关的物体、动作和时间过渡的标题，从而提高答案的准确性。实验结果证明，我们的方法在Intent-QA（提升6.2%，达到79.0%）、EgoSchema子集（提升3.4%，达到75.4%）和NExT-QA（提升0.4%，达到79.6%）上达到了最新的水平。代码可在PanasonicConnect公司的VideoMultiAgents项目中找到。

Key Takeaways

VQA依赖于多模态推理，需要整合视觉、时间和语言线索。
现有方法主要依赖单一模型处理帧级字幕，难以捕捉时间和交互上下文。
VideoMultiAgents框架集成了专门用于视觉、场景图分析和文本处理的代理，通过独立操作的代理的互补进行视频理解。
问题导向的标题生成技术可以提高答案的准确性。
VideoMultiAgents在多个数据集上取得了最新成果，包括Intent-QA、EgoSchema子集和NExT-QA。

Cool Papers

点此查看论文截图

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Authors:Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, Ben Kao, Guohao Li, Junxian He, Yu Qiao, Zhiyong Wu

Graphical User Interface (GUI) agents powered by Vision-Language Models (VLMs) have demonstrated human-like computer control capability. Despite their utility in advancing digital automation, a critical bottleneck persists: collecting high-quality trajectory data for training. Common practices for collecting such data rely on human supervision or synthetic data generation through executing pre-defined tasks, which are either resource-intensive or unable to guarantee data quality. Moreover, these methods suffer from limited data diversity and significant gaps between synthetic data and real-world environments. To address these challenges, we propose OS-Genesis, a novel GUI data synthesis pipeline that reverses the conventional trajectory collection process. Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions, then retrospectively derive high-quality tasks to enable trajectory-level exploration. A trajectory reward model is then employed to ensure the quality of the generated trajectories. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks. In-depth analysis further validates OS-Genesis’s efficiency and its superior data quality and diversity compared to existing synthesis methods. Our codes, data, and checkpoints are available at https://qiushisun.github.io/OS-Genesis-Home/.

由视觉语言模型（VLMs）驱动的图形用户界面（GUI）代理已经展现出人类般的计算机控制能力。尽管它们在推进数字自动化方面很有用，但仍然存在一个关键瓶颈：收集高质量轨迹数据进行训练。目前常见的收集此类数据的方法依赖于人工监督或通过执行预定义任务来生成合成数据，这些方法要么资源密集，要么无法保证数据质量。此外，这些方法还遭受数据多样性有限以及合成数据与真实世界环境之间显著差异的困扰。为了应对这些挑战，我们提出了OS-Genesis，一种新型GUI数据合成管道，它颠覆了传统的轨迹收集过程。不同于依赖预定义任务的方法，OS-Genesis使代理首先感知环境并执行分步交互，然后回顾性地推导高质量任务以实现轨迹层面的探索。接着，采用轨迹奖励模型来保证生成轨迹的质量。我们证明，使用OS-Genesis训练的GUI代理在高度挑战性的在线基准测试中显著提高了性能。深入分析进一步验证了OS-Genesis的效率及其相较于现有合成方法在数据质量和多样性方面的优越性。我们的代码、数据和检查点位于：https://qiushisun.github.io/OS-Genesis-Home/。

论文及项目相关链接

PDF Work in progress

Summary

基于Vision-Language Models（VLMs）的图形用户界面（GUI）代理展现了人类般的计算机控制功能。然而，收集高质量轨迹数据以进行训练仍是瓶颈。常规的数据收集方法依赖于人工监督或执行预定义任务生成合成数据，这些方法要么资源消耗大，要么无法保证数据质量。为解决这些问题，本文提出了OS-Genesis，一种新型的GUI数据合成流程。OS-Genesis改变了传统轨迹收集过程，让代理首先感知环境并执行逐步交互，然后回顾性地生成高质量任务以实现轨迹层面的探索。使用轨迹奖励模型确保生成轨迹的质量。实验证明，使用OS-Genesis训练的GUI代理在具有挑战性的在线基准测试中表现优异。

Key Takeaways

GUI代理利用Vision-Language Models（VLMs）实现了人类般的计算机控制功能。
收集高质量轨迹数据是训练GUI代理的瓶颈。
传统数据收集方法依赖人工监督或预定义任务生成合成数据，存在资源消耗大、数据质量难以保证的问题。
OS-Genesis是一种新型的GUI数据合成流程，改变了轨迹收集方式，让代理通过感知环境和逐步交互来生成高质量任务。
OS-Genesis使用轨迹奖励模型确保生成轨迹的质量。
使用OS-Genesis训练的GUI代理在在线基准测试中表现优异。

Cool Papers

点此查看论文截图

You Name It, I Run It: An LLM Agent to Execute Tests of Arbitrary Projects

Authors:Islem Bouzenia, Michael Pradel

The ability to execute the test suite of a project is essential in many scenarios, e.g., to assess code quality and code coverage, to validate code changes made by developers or automated tools, and to ensure compatibility with dependencies. Despite its importance, executing the test suite of a project can be challenging in practice because different projects use different programming languages, software ecosystems, build systems, testing frameworks, and other tools. These challenges make it difficult to create a reliable, universal test execution method that works across different projects. This paper presents ExecutionAgent, an automated technique that prepares scripts for building an arbitrary project from source code and running its test cases. Inspired by the way a human developer would address this task, our approach is a large language model (LLM)-based agent that autonomously executes commands and interacts with the host system. The agent uses meta-prompting to gather guidelines on the latest technologies related to the given project, and it iteratively refines its process based on feedback from the previous steps. Our evaluation applies ExecutionAgent to 50 open-source projects that use 14 different programming languages and many different build and testing tools. The approach successfully executes the test suites of 33/50 projects, while matching the test results of ground truth test suite executions with a deviation of only 7.5%. These results improve over the best previously available technique by 6.6x. The costs imposed by the approach are reasonable, with an execution time of 74 minutes and LLM costs of USD 0.16, on average per project. We envision ExecutionAgent to serve as a valuable tool for developers, automated programming tools, and researchers that need to execute tests across a wide variety of projects.

执行项目测试集的能力在许多场景中都是至关重要的，例如评估代码质量和代码覆盖率、验证开发人员或自动化工具所做的代码更改，以及确保与依赖项的兼容性。尽管其重要性不言而喻，但在实践中执行项目的测试集可能会面临挑战，因为不同的项目会使用不同的编程语言、软件生态系统、构建系统、测试框架和其他工具。这些挑战使得创建一个可靠、通用的跨项目测试执行方法变得困难。

本文介绍了ExecutionAgent，这是一种自动化技术，用于为从源代码构建任意项目并运行其测试用例生成脚本。我们的方法灵感来自于人类开发者如何完成此任务，它是一个基于大型语言模型（LLM）的代理，可以自主执行命令并与主机系统交互。该代理使用元提示来收集有关给定项目的最新技术指南，并基于上一步的反馈来迭代优化其流程。

我们的评估将ExecutionAgent应用于50个使用14种不同编程语言以及许多不同构建和测试工具的开源项目。该方法成功执行了33/50个项目的测试集，与基准测试集执行结果的匹配度达到了仅7.5%的偏差。这些结果比之前可用的最佳技术有了6.6倍的提升。该方法的成本是合理的，每个项目的平均执行时间为74分钟，LLM的成本为0.16美元。

我们期望ExecutionAgent能成为开发者、自动化编程工具和研究人员执行广泛项目测试的有价值工具。

论文及项目相关链接

PDF PUBLISHED AT ISSTA 2025

Summary
项目测试套件执行能力在许多场景中至关重要，如评估代码质量和覆盖率、验证开发者和自动化工具所做的代码更改以及确保与依赖项的兼容性。然而，执行项目测试套件在实践中可能具有挑战性，因为不同项目使用的编程语言、软件生态系统、构建系统、测试框架和其他工具都有所不同。本文提出ExecutionAgent，一种自动化技术，用于为任意项目从源代码构建脚本并运行其测试用例。该方法是基于大型语言模型（LLM）的代理，可自主执行命令并与主机系统交互。通过对50个使用不同编程语言和构建及测试工具的开源项目进行评估，该方法成功执行了其中33个项目的测试套件，与真实测试套件执行结果的偏差仅为7.5%。相较于之前最佳的技术方法，这一结果提高了6.6倍。该方法的执行时间和LLM成本合理。预期ExecutionAgent能为需要跨多种项目执行测试的开发者、自动化编程工具和研究者提供有价值的工具。

Key Takeaways

执行测试套件对于评估代码质量、验证代码更改和确保依赖兼容性至关重要。
不同项目在执行测试套件时面临多种挑战，如使用的编程语言、软件生态系统等差异。
ExecutionAgent是一种自动化技术，可以从源代码为任意项目构建脚本并运行其测试用例。
ExecutionAgent基于大型语言模型（LLM），自主执行命令并与主机系统交互。
该方法成功执行了33个项目的测试套件，与真实结果的偏差仅为7.5%。
ExecutionAgent相较于之前的技术方法有显著提高，执行时间合理，且LLM成本不高。

Cool Papers

点此查看论文截图

Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent

Authors:Janghoon Ock, Tirtha Vinchurkar, Yayati Jadhav, Amir Barati Farimani

Adsorption energy is a key reactivity descriptor in catalysis, enabling efficient screening for optimal catalysts. However, determining adsorption energy typically requires evaluating numerous adsorbate-catalyst configurations. Current algorithmic approaches rely on exhaustive enumeration of adsorption sites and configurations, which makes the process computationally intensive and does not inherently guarantee the identification of the global minimum energy. In this work, we introduce Adsorb-Agent, a Large Language Model (LLM) agent designed to efficiently identify system-specific stable adsorption configurations corresponding to the global minimum adsorption energy. Adsorb-Agent leverages its built-in knowledge and emergent reasoning capabilities to strategically explore adsorption configurations likely to hold adsorption energy. By reducing the reliance on exhaustive sampling, it significantly decreases the number of initial configurations required while improving the accuracy of adsorption energy predictions. We evaluate Adsorb-Agent’s performance across twenty representative systems encompassing a range of complexities. The Adsorb-Agent successfully identifies comparable adsorption energies for 83.7% of the systems and achieves lower energies, closer to the actual global minimum, for 35% of the systems, while requiring significantly fewer initial configurations than conventional methods. Its capability is particularly evident in complex systems, where it identifies lower adsorption energies for 46.7% of systems involving intermetallic surfaces and 66.7% of systems with large adsorbate molecules. These results demonstrate the potential of Adsorb-Agent to accelerate catalyst discovery by reducing computational costs and improving the reliability of adsorption energy predictions.

吸附能是催化中的关键反应性描述符，能够实现高效筛选最佳催化剂。然而，确定吸附能通常需要评估大量的吸附物-催化剂构型。当前的算法方法依赖于吸附位点和构型的详尽列举，这使得过程计算密集，并且不保证能固有地识别全局最低能量。在这项工作中，我们引入了Adsorb-Agent，这是一个大型语言模型（LLM）代理，旨在有效地识别与全局最低吸附能相对应的特定系统稳定吸附构型。Adsorb-Agent利用其内置知识和新兴推理能力来有策略地探索可能含有吸附能的吸附构型。通过减少对详尽采样的依赖，它在减少所需初始构型数量的同时提高了吸附能预测的准确性。我们评估了Adsorb-Agent在涵盖各种复杂性的二十个代表性系统上的性能。Adsorb-Agent成功地为83.7%的系统确定了相当的吸附能，并为35%的系统实现了更接近实际全局最低值的能量，同时所需的初始构型远少于传统方法。其能力在复杂系统中尤其明显，为涉及金属间表面的系统中有46.7%和具有大吸附分子的系统中66.7%确定了较低的吸附能。这些结果证明了Adsorb-Agent在加速催化剂发现方面的潜力，可以降低计算成本并提高吸附能预测的可信度。

论文及项目相关链接

PDF

Summary

吸附能是催化中的重要反应描述符，用于有效筛选最佳催化剂。然而，确定吸附能通常需要评估许多吸附质-催化剂配置。当前算法方法依赖于吸附位点和配置的详尽列举，这使得过程计算密集，并不保证找到全局最低能量。在本研究中，我们引入了吸附剂代理（Adsorb-Agent），这是一个大型语言模型（LLM）代理，旨在有效识别与全局最低吸附能对应的系统特定稳定吸附配置。它通过利用内置知识和新兴推理能力来策略性地探索可能包含吸附能的吸附配置，减少了对详尽采样的依赖，在减少所需初始配置数量的同时提高了吸附能预测的准确性。我们对Adsorb-Agent的性能在二十个代表性系统上进行评估，涵盖了一系列复杂性。Adsorb-Agent成功为83.7%的系统确定了可比的吸附能，为35%的系统达到更接近实际全局最小值的能量，同时比传统方法需要的初始配置要少得多。其在复杂系统中的能力特别明显，在涉及金属间表面的系统中为46.7%的系统以及具有大吸附质分子的系统中为66.7%的系统确定了较低的吸附能。这些结果证明了Adsorb-Agent在加速催化剂发现方面的潜力，可降低计算成本并提高吸附能预测的可靠性。

Key Takeaways

吸附能是催化中重要的反应描述符，对于筛选最佳催化剂至关重要。
当前算法在确定吸附能时依赖详尽的吸附位点和配置列举，计算密集且不一定找到全局最低能量。
Adsorb-Agent是一个大型语言模型代理，可高效识别系统特定的稳定吸附配置和全局最低吸附能。
Adsorb-Agent利用内置知识和新兴推理能力来策略性探索吸附配置。
该代理减少了对详尽采样的依赖，提高了预测吸附能的准确性和效率。
在多个代表性系统上评估，Adsorb-Agent在复杂系统中表现特别出色。

Cool Papers

点此查看论文截图

Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-based Decision-Making Systems

Authors:Ruochen Jiao, Shaoyuan Xie, Justin Yue, Takami Sato, Lixu Wang, Yixuan Wang, Qi Alfred Chen, Qi Zhu

Large Language Models (LLMs) have shown significant promise in real-world decision-making tasks for embodied artificial intelligence, especially when fine-tuned to leverage their inherent common sense and reasoning abilities while being tailored to specific applications. However, this fine-tuning process introduces considerable safety and security vulnerabilities, especially in safety-critical cyber-physical systems. In this work, we propose the first comprehensive framework for Backdoor Attacks against LLM-based Decision-making systems (BALD) in embodied AI, systematically exploring the attack surfaces and trigger mechanisms. Specifically, we propose three distinct attack mechanisms: word injection, scenario manipulation, and knowledge injection, targeting various components in the LLM-based decision-making pipeline. We perform extensive experiments on representative LLMs (GPT-3.5, LLaMA2, PaLM2) in autonomous driving and home robot tasks, demonstrating the effectiveness and stealthiness of our backdoor triggers across various attack channels, with cases like vehicles accelerating toward obstacles and robots placing knives on beds. Our word and knowledge injection attacks achieve nearly 100% success rate across multiple models and datasets while requiring only limited access to the system. Our scenario manipulation attack yields success rates exceeding 65%, reaching up to 90%, and does not require any runtime system intrusion. We also assess the robustness of these attacks against defenses, revealing their resilience. Our findings highlight critical security vulnerabilities in embodied LLM systems and emphasize the urgent need for safeguarding these systems to mitigate potential risks.

大型语言模型（LLM）在实体人工智能的现实决策任务中显示出巨大的潜力，尤其是在微调以利用其固有的常识和推理能力的同时，针对特定应用进行定制时。然而，这种微调过程引入了相当大的安全和漏洞隐患，特别是在安全关键的网物系统。在这项工作中，我们提出了针对基于LLM的决策系统（BALD）的第一个全面的框架，系统地探索攻击面和触发机制。具体来说，我们提出了三种独特的攻击机制：单词注入、场景操控和知识注入，针对基于LLM的决策管道的不同组件。我们对具有代表性的LLM（GPT-3.5、LLaMA2、PaLM2）在自动驾驶和家庭机器人任务上进行了广泛的实验，证明了我们的后门触发在通过各种攻击渠道时的有效性和隐蔽性，包括车辆加速冲向障碍和机器人在床上放置刀具等案例。我们的单词和知识注入攻击在多个模型和数据集上成功率接近100%，而且只需要对系统有限的访问权限。我们的场景操控攻击成功率超过65%，最高可达90%，而且不需要任何运行时系统入侵。我们还对攻击进行了防御评估，揭示了它们的韧性。我们的研究发现了实体LLM系统中关键的安全漏洞，并强调了保护这些系统以减轻潜在风险的紧迫需求。

论文及项目相关链接

PDF Accepted paper at ICLR 2025, 31 pages, including main paper, references, and appendix

Summary

大型语言模型在实体人工智能的决策任务中展现出巨大潜力，但精细调整过程会引发安全和保障漏洞，特别是在安全关键的网路物理系统中。本研究首次提出针对基于LLM的决策系统的后门攻击（BALD）的全面框架，针对LLM决策管道的不同组件，提出三种独特的攻击机制：单词注入、场景操控和知识注入。我们在自动驾驶和家庭机器人任务中对代表性LLM（GPT-3.5、LLaMA2、PaLM2）进行了广泛实验，证明了我们的后门触发机制在各种攻击通道中的有效性和隐蔽性。

Key Takeaways

大型语言模型（LLM）在实体人工智能的决策任务中具有显著潜力，但精细调整过程引发安全和保障漏洞。
首次提出针对LLM决策系统的后门攻击（BALM）的全面框架。
提出三种攻击机制：单词注入、场景操控和知识注入，可针对LLM决策管道的不同部分。
实验证明，在自动驾驶和家庭机器人任务中，所提攻击方法具有高效性和隐蔽性。
单词注入和知识注入攻击的成功率高达近100%，场景操控攻击的成功率超过65%，甚至达到90%。
这些攻击对系统的防御具有抗性，显示出其稳健性。

Cool Papers

点此查看论文截图

Variational Offline Multi-agent Skill Discovery

Authors:Jiayu Chen, Tian Lan, Vaneet Aggarwal

Skills are effective temporal abstractions established for sequential decision making, which enable efficient hierarchical learning for long-horizon tasks and facilitate multi-task learning through their transferability. Despite extensive research, research gaps remain in multi-agent scenarios, particularly for automatically extracting subgroup coordination patterns in a multi-agent task. In this case, we propose two novel auto-encoder schemes: VO-MASD-3D and VO-MASD-Hier, to simultaneously capture subgroup- and temporal-level abstractions and form multi-agent skills, which firstly solves the aforementioned challenge. An essential algorithm component of these schemes is a dynamic grouping function that can automatically detect latent subgroups based on agent interactions in a task. Further, our method can be applied to offline multi-task data, and the discovered subgroup skills can be transferred across relevant tasks without retraining. Empirical evaluations on StarCraft tasks indicate that our approach significantly outperforms existing hierarchical multi-agent reinforcement learning (MARL) methods. Moreover, skills discovered using our method can effectively reduce the learning difficulty in MARL scenarios with delayed and sparse reward signals. The codebase is available at https://github.com/LucasCJYSDL/VOMASD.

技能是为顺序决策而建立的有效时间抽象，它们为长期任务提供了高效的层次化学习，并通过其可转移性促进了多任务学习。尽管已有大量研究，但在多智能体场景中仍存在研究空白，特别是在多智能体任务中自动提取子组协调模式方面。针对这种情况，我们提出了两种新型自动编码器方案：VO-MASD-3D和VO-MASD-Hier，它们可以同时捕获子组和时间层面的抽象，形成多智能体技能，首先解决了上述挑战。这些方案的一个基本算法组件是一个动态分组函数，该函数可以基于任务中的智能体交互自动检测潜在子组。此外，我们的方法可以应用于离线多任务数据，并且发现的子组技能可以在相关任务之间转移而无需重新训练。在星际争霸任务上的经验评估表明，我们的方法显著优于现有的分层多智能体强化学习（MARL）方法。此外，使用我们的方法发现的技能可以有效地降低具有延迟和稀疏奖励信号的多智能体场景中的学习难度。代码库位于https://github.com/LucasCJYSDL/VOMASD。

论文及项目相关链接

PDF This work will appear in the proceedings of IJCAI 2025

Summary
技能是有效的时序抽象，用于序列决策制定，能够实现长周期任务的高效层次学习，并通过其可转移性促进多任务学习。然而，在多智能体场景中，自动提取智能体协调模式的研究还存在缺口。针对此问题，我们提出了两种新型自动编码器方案：VO-MASD-3D和VO-MASD-Hier，它们可以同时捕获子群体和时序层面的抽象信息，形成多智能体技能。关键算法组件是动态分组功能，可以基于任务中智能体的交互自动检测潜在子群体。此外，我们的方法可应用于离线多任务数据，发现的子群体技能可以在相关任务之间转移而无需重新训练。在星际争霸任务上的实证评估表明，我们的方法显著优于现有的分层多智能体强化学习（MARL）方法。此外，使用我们的方法发现的技能可以有效地降低MARL场景中延迟和稀疏奖励信号的学习难度。

Key Takeaways