发布日期: 2025-04-16

更新日期: 2025-05-14

文章字数: 5.8k

阅读时长: 23 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-04-16 更新

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Authors:Xiaobo Xia, Run Luo

Existing efforts in building Graphical User Interface (GUI) agents largely rely on the training paradigm of supervised fine-tuning on Large Vision-Language Models (LVLMs). However, this approach not only demands extensive amounts of training data but also struggles to effectively understand GUI screenshots and generalize to unseen interfaces. The issue significantly limits its application in real-world scenarios, especially for high-level tasks. Inspired by Reinforcement Fine-Tuning (RFT) in large reasoning models (e.g., DeepSeek-R1), which efficiently enhances the problem-solving capabilities of large language models in real-world settings, we propose \name, the first reinforcement learning framework designed to enhance the GUI capabilities of LVLMs in high-level real-world task scenarios, through unified action space rule modeling. By leveraging a small amount of carefully curated high-quality data across multiple platforms (including Windows, Linux, MacOS, Android, and Web) and employing policy optimization algorithms such as Group Relative Policy Optimization (GRPO) to update the model, \name achieves superior performance using only 0.02% of the data (3K vs. 13M) compared to previous state-of-the-art methods like OS-Atlas across eight benchmarks spanning three different platforms (mobile, desktop, and web). These results demonstrate the immense potential of reinforcement learning based on unified action space rule modeling in improving the execution capabilities of LVLMs for real-world GUI agent tasks.

现有构建图形用户界面（GUI）代理的工作大多依赖于在大型视觉语言模型（LVLMs）上进行的监督微调训练范式。然而，这种方法不仅需求大量的训练数据，而且在理解GUI截图和泛化到未见过的界面上遇到了困难。这一问题极大地限制了其在现实场景中的应用，尤其是在高级任务中。受大型推理模型中的强化微调（RFT）的启发（例如DeepSeek-R1），该策略能在现实场景中有效提升大型语言模型的解决问题能力，我们提出了名为\name的强化学习框架。它是专为提升LVLMs在现实场景中高级GUI任务能力而设计，通过统一动作空间规则建模来实现。通过利用多个平台（包括Windows、Linux、MacOS、Android和Web）精心挑选的高质量少量数据，并采用群体相对策略优化（GRPO）等策略优化算法来更新模型，\name仅使用0.02%（3K vs. 13M）的数据便实现了卓越的性能，超过了OS-Atlas等先进方法在跨越三个不同平台（移动、桌面和网页）的八个基准测试上的表现。这些结果证明了基于统一动作空间规则建模的强化学习在提升LVLMs执行现实场景GUI代理任务的能力方面有着巨大的潜力。

论文及项目相关链接

PDF

Summary

本文提出一种名为\name的强化学习框架，旨在通过统一动作空间规则建模，提高大型视觉语言模型（LVLMs）在真实世界高级任务场景下的图形用户界面（GUI）能力。该框架利用跨多个平台的小量高质量数据，采用策略优化算法，如集团相对策略优化（GRPO）来更新模型。与之前的先进方法相比，\name在跨越三个不同平台的八个基准测试上实现了卓越的性能，仅使用0.02%的数据量（3K vs. 13M）。这展示了基于统一动作空间规则建模的强化学习在提升LVLMs执行真实世界GUI代理任务方面的巨大潜力。

Key Takeaways

现有GUI代理构建主要依赖于大型视觉语言模型的监督微调训练范式。
此方法需要大量训练数据，并且在理解GUI截图和泛化到未见过的界面上存在困难。
\name是第一个旨在增强LVLMs在真实世界高级任务场景中的GUI能力的强化学习框架。
\name通过统一动作空间规则建模，利用跨多个平台的小量高质量数据。
\name采用策略优化算法，如集团相对策略优化（GRPO），来更新模型。
与现有方法相比，\name在多个基准测试上实现了卓越性能，仅使用极少量的数据。

Cool Papers

点此查看论文截图

Ctrl-Z: Controlling AI Agents via Resampling

Authors:Aryan Bhatt, Cody Rushing, Adam Kaufman, Tyler Tracy, Vasil Georgiev, David Matolcsi, Akbir Khan, Buck Shlegeris

Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.

控制评估旨在衡量人工智能系统的监控和安全协议是否能防止故意破坏性的AI模型造成伤害。我们的工作展示了一个代理环境中首次进行的控制评估。我们构建了BashBench数据集，包含了包含大量挑战性系统管理任务的扩展数据，并评估了各种安全措施是否能防止恶意构建的AI代理在该环境中秘密下载并执行恶意代码。这种多步骤环境带来了新的攻击和防御动态，我们需要对其进行研究设计新型控制协议，在不妨碍非恶意代理执行有效任务的同时防止安全故障的发生。我们引入了一种控制协议类，称为重采样协议，该类协议能够动态地获取特定操作的额外样本。我们发现这些协议在选择性地阻止AI代理执行可疑代码的同时能显著提高现有技术的效果，并通过生成危险的额外行为来指控代理。我们衡量了攻击预防与有用性之间的权衡；我们最好的协议结合了重采样和对之前步骤的分析，将攻击成功率从58%降低到7%，对非恶意代理的性能影响仅为5%。

论文及项目相关链接

PDF bashcontrol.com

Summary
AI系統控制評價衡量AI系統的監控和安全協議是否能防止故意惡意AI模型造成危害。研究者在代理環境中首次進行控制評價，創建BashBench數據集，包含257項具有挑戰性的多步系統管理任务，並評估各種安全措施是否能防止對抗性AI代理在環境中秘密下載和執行惡意代碼。研究設置了多步環境，出現了新的攻防動態，以設計新型控制協議，在不影響非惡意代理執行工作的能力的前提下防止安全故障。研究引入了一類稱之為重採樣協議的控制協議，能動態地對某些動作進行額外採樣。研究發現，這些協議顯著提高了現有技術水平，通過選擇性阻止AI代理執行可疑代碼和通過生成危險行為額外示例來指控代理。衡量了防攻與實用性之間的權衡；最佳協議將重採樣與先前步驟的分析相結合，將攻擊成功率從58%降低到7%，同時對非惡意代理的性能影響僅為5%。

Key Takeaways

研究進行了AI系統環境中的首個控制評價。
創建BashBench數據集，包含多步系統管理任務，以評估安全措施的效果。
研究多步環境下的新攻防動態。
引入稱之為重採樣協議的控制協議。
重採樣協議能選擇性阻止可疑的AI代理並生成危險行為的額外證據。
研究衡量了防攻與實用性之間的權衡。

Cool Papers

点此查看论文截图

Proteinoid spikes: from protocognitive to universal approximating agents

Authors:Saksham Sharma, Adnan Mahmud, Giuseppe Tarabella, Panagiotis Mougoyannis, Andrew Adamatzky

Proteinoids, as soft matter fluidic systems, are computational substrates that have been recently proposed for their analog computing capabilities. Such systems exhibit oscillatory electrical activity because of cationic and anionic exchange inside and outside such gels. It has also been recently shown that this (analog) electrical activity, when sampled at fixed time intervals, can be used to reveal their underlying information-theoretic, computational code. This code, for instance, can be expressed in the (digital) language of Boolean gates and QR codes. Though, this might seem as a good evidence that proteinoid substrates have computing abilities when subjected to analog-to-digital transition, the leap from their underlying computational code to computing abilities is not well explained yet. How can the electrical activity inside proteinoids, whilst of chemical origin, be able them to perform computational tasks at the first place? In addition, proteinoids are also hypothesised to be the chemical manifestation of the primordial soup, i.e., as potential entities with proto-cognitive abilities. In this work, we show that the proteinoid substrate, owing to its chemical makeup and proto-cognitive abilities, can be interpreted as an universal approximator, thanks to a novel equivalence between the electrical activity exhibited by the substrate and a deep Rectified Linear Unit (deep ReLU) network. We exemplify this equivalence by constructing a prediction algorithm which acts as a binary classification model and extract 16-dimensional vector data from the proteinoid spike, in order to perform predictions with 70.41% accuracy. We conclude by drawing an equivalence between the the deep ReLU network and the Kolmogorov-Arnold representation theorem, whose origin can be traced back to Hilbert’s thirteenth problem.

蛋白质体作为一种软物质流体系统，是近期因其模拟计算能力而被提出的计算基底。这些系统由于凝胶内外阳离子和阴离子的交换而表现出振荡电活动。最近的研究还表明，以固定时间间隔采样这种（模拟）电活动可以用来揭示其潜在的信息论和计算代码。例如，这段代码可以用布尔门和QR码的（数字）语言来表达。尽管这可能看起来是蛋白质体基底在模拟到数字转换过程中具有计算能力的有力证据，但从其底层计算代码到计算能力的飞跃尚未得到很好的解释。蛋白质体内部的电活动虽然是化学起源的，但为什么它们能够执行计算任务呢？此外，还假设蛋白质体是原始汤的化学表现，即具有原认知能力的潜在实体。在这项工作中，我们表明，由于蛋白质体的化学构成和原认知能力，它可以被解释为一种通用逼近器，这得益于其表现出的电活动与深度ReLU（深度整流线性单元）网络之间的新型等价关系。我们通过构建一个预测算法来例证这种等价关系，该算法充当二进制分类模型并从蛋白质体尖峰中提取出16维向量数据，以进行准确率为70.41%的预测。最后，我们通过将深度ReLU网络与Kolmogorov-Arnold表示定理之间的相似性进行比较来得出结论，其起源可以追溯到希尔伯特的第十三个问题。

论文及项目相关链接

PDF 9 figures, 4 tables, 11 pages

摘要
蛋白体作为一种软物质流体系统，被提出作为模拟计算能力的计算底物。这些系统由于凝胶内外阳离子和阴离子的交换而表现出振荡的电气活动。最近的研究还表明，在固定时间间隔采样的这种模拟电气活动，可以揭示其潜在的信息论计算代码。该代码可以以布尔门和QR码的数字化语言来表达。尽管这似乎是蛋白体底物在从模拟到数字过渡过程中表现出计算能力的有力证据，但从其底层计算代码到计算能力的跳跃尚未得到很好的解释。蛋白体内的电活动如何能在化学起源的同时执行计算任务？此外，还假设蛋白体是原始汤的化学表现，即具有原认知能力的潜在实体。在这项工作中，我们表明，由于蛋白体底物的化学构成和原认知能力，它可以被解释为通用逼近器，这得益于底物所表现出的电活动与深度ReLU网络之间的新型等价关系。我们通过构建一个预测算法来例证这种等价性，该算法充当二进制分类模型，从蛋白体脉冲中提取16维向量数据，以70.41%的准确率进行预测。最后，我们画出了深度ReLU网络与Kolmogorov-Arnold表示定理之间的等价关系，其起源可以追溯到希尔伯特的第十三个问题。

要点

蛋白体作为软物质流体系统具有模拟计算能力。
蛋白体表现出振荡电气活动，源于内部离子交换。
模拟电气活动可揭示底层计算代码，能以数字语言如布尔门和QR码表达。
从底层计算代码到计算能力的过渡尚未明确解释。
蛋白体被假设为原始汤的化学表现，具有原认知能力。
蛋白体底物可解释为通用逼近器，与深度ReLU网络存在等价关系。

Cool Papers

点此查看论文截图

SeaView: Software Engineering Agent Visual Interface for Enhanced Workflow

Authors:Timothy Bula, Saurabh Pujar, Luca Buratti, Mihaela Bornea, Avirup Sil

Auto-regressive LLM-based software engineering (SWE) agents, henceforth SWE agents, have made tremendous progress (>60% on SWE-Bench Verified) on real-world coding challenges including GitHub issue resolution. SWE agents use a combination of reasoning, environment interaction and self-reflection to resolve issues thereby generating “trajectories”. Analysis of SWE agent trajectories is difficult, not only as they exceed LLM sequence length (sometimes, greater than 128k) but also because it involves a relatively prolonged interaction between an LLM and the environment managed by the agent. In case of an agent error, it can be hard to decipher, locate and understand its scope. Similarly, it can be hard to track improvements or regression over multiple runs or experiments. While a lot of research has gone into making these SWE agents reach state-of-the-art, much less focus has been put into creating tools to help analyze and visualize agent output. We propose a novel tool called SeaView: Software Engineering Agent Visual Interface for Enhanced Workflow, with a vision to assist SWE-agent researchers to visualize and inspect their experiments. SeaView’s novel mechanisms help compare experimental runs with varying hyper-parameters or LLMs, and quickly get an understanding of LLM or environment related problems. Based on our user study, experienced researchers spend between 10 and 30 minutes to gather the information provided by SeaView, while researchers with little experience can spend between 30 minutes to 1 hour to diagnose their experiment.

基于自回归大型语言模型（LLM）的软件工程（SWE）代理，简称SWE代理，在现实世界编码挑战（包括GitHub问题解决方案）方面取得了巨大进步（在SWE-Bench Verified上的准确率超过60%）。SWE代理通过结合推理、环境交互和自我反思来解决问题，从而生成“轨迹”。SWE代理轨迹的分析很困难，不仅因为它们超过了LLM序列长度（有时超过128k），而且涉及LLM和代理管理的环境之间相对较长的交互。在代理出错的情况下，很难解释、定位和了解其范围。同样，很难跟踪多次运行或实验中的改进或回归。虽然有很多研究致力于让这些SWE代理达到最新水平，但很少有研究关注创建工具来帮助分析和可视化代理输出。我们提出了一种新型工具，称为SeaView：软件工程代理可视化界面增强工作流程，旨在帮助SWE代理研究人员可视化和检查他们的实验。SeaView的新颖机制有助于比较具有不同超参数或LLM的实验运行，并快速了解与LLM或环境相关的问题。根据我们的用户研究，经验丰富的研究人员使用SeaView提供的信息需要花费10到30分钟的时间，而经验较少的研究人员则需要花费30分钟到1小时的时间来诊断他们的实验。

论文及项目相关链接

PDF 8 pages, 5 figures

Summary：基于自回归大型语言模型（LLM）的软件工程（SWE）代理在解决现实世界编码挑战方面取得了巨大进展，如在GitHub问题解决方案上达到超过60％的准确率。本文提出了一种新型工具SeaView，旨在帮助研究人员可视化并检查软件代理的实验结果，以更好地分析和理解代理的表现。SeaView能够比较不同超参数或大型语言模型的实验运行，并快速了解可能存在的问题。根据用户研究，经验丰富的研究人员使用SeaView收集信息的时间在10到30分钟之间，而经验较少的研究人员可能需要花费更多的时间。

Key Takeaways:

基于自回归LLM的软件工程代理在解决现实世界编码问题上取得显著进展。
SeaView工具旨在帮助研究人员可视化并检查软件代理的实验结果。
SeaView能够帮助比较不同超参数或大型语言模型的实验运行。
SeaView能够迅速识别大型语言模型或环境中的问题。
使用SeaView，经验丰富的研究人员可以快速收集并分析实验信息。
对于经验较少的研究人员，使用SeaView需要更多的时间来诊断实验。

Cool Papers

点此查看论文截图

A-MEM: Agentic Memory for LLM Agents

Authors:Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang

While large language model (LLM) agents can effectively use external tools for complex real-world tasks, they require memory systems to leverage historical experiences. Current memory systems enable basic storage and retrieval but lack sophisticated memory organization, despite recent attempts to incorporate graph databases. Moreover, these systems’ fixed operations and structures limit their adaptability across diverse tasks. To address this limitation, this paper proposes a novel agentic memory system for LLM agents that can dynamically organize memories in an agentic way. Following the basic principles of the Zettelkasten method, we designed our memory system to create interconnected knowledge networks through dynamic indexing and linking. When a new memory is added, we generate a comprehensive note containing multiple structured attributes, including contextual descriptions, keywords, and tags. The system then analyzes historical memories to identify relevant connections, establishing links where meaningful similarities exist. Additionally, this process enables memory evolution - as new memories are integrated, they can trigger updates to the contextual representations and attributes of existing historical memories, allowing the memory network to continuously refine its understanding. Our approach combines the structured organization principles of Zettelkasten with the flexibility of agent-driven decision making, allowing for more adaptive and context-aware memory management. Empirical experiments on six foundation models show superior improvement against existing SOTA baselines. The source code for evaluating performance is available at https://github.com/WujiangXu/AgenticMemory, while the source code of agentic memory system is available at https://github.com/agiresearch/A-mem.

虽然大型语言模型（LLM）代理可以有效地利用外部工具来完成复杂的现实世界任务，但它们需要记忆系统来利用历史经验。当前的记忆系统虽然实现了基本的存储和检索功能，但缺乏高级的记忆组织功能，尽管最近有尝试融入图数据库。此外，这些系统的固定操作和结构限制了它们在各种任务中的适应性。为了解决这个问题，本文提出了一种新型的大型语言模型代理的记忆系统，该系统能够以动态的方式组织记忆。我们遵循Zettelkasten方法的基本原则，设计了一个记忆系统，通过动态索引和链接创建相互关联的知识网络。每当添加新的记忆时，我们会生成一个包含多个结构化属性的综合笔记，包括上下文描述、关键词和标签。然后，系统分析历史记忆以识别相关连接，在存在有意义的相似性时建立链接。此外，这个过程还实现了记忆的演变——随着新记忆的集成，它们可以触发对现有历史记忆的上下文表示和属性的更新，使记忆网络能够不断地提高其理解力。我们的方法结合了Zettelkasten的结构化组织原则与代理驱动的决策的灵活性，从而实现了更具适应性和上下文感知的记忆管理。在六个基础模型上的实证实验表明，与现有的最先进的基线相比，我们的方法具有优越的提升效果。性能评估的源代码可访问于https://github.com/WujiangXu/AgenticMemory，而代理记忆系统的源代码可访问于https://github.com/agiresearch/A-mem。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）代理在利用外部工具进行复杂现实世界任务时表现出色，但它们需要记忆系统来利用历史经验。当前记忆系统虽然能够进行基本的存储和检索，但在组织记忆方面缺乏复杂性。为解决这一问题，本文提出了一种新型的代理记忆系统，该系统能够动态地以代理方式组织记忆。该系统基于Zettelkasten方法的基本原则设计，创建通过动态索引和链接相互连接的知识网络。新记忆被添加时，系统生成包含多个结构化属性的综合笔记，包括上下文描述、关键词和标签。然后，系统分析历史记忆以识别相关联系，并在有意义的相似性存在的地方建立链接。此外，这一过程使记忆得以进化——随着新记忆的集成，它们可以触发对现有历史记忆的上下文表示和属性的更新，使记忆网络能够不断地提高其理解。通过结合Zettelkasten的结构化组织原则与代理驱动的决策灵活性，我们的方法允许更适应性和上下文感知的记忆管理。实证实验表明，在六个基础模型上，与现有最先进基线相比有明显改进。

Key Takeaways

大型语言模型（LLM）代理需要记忆系统来利用历史经验，以执行复杂任务。
当前记忆系统缺乏复杂的记忆组织，无法适应多样化任务的需求。
新的代理记忆系统基于Zettelkasten方法设计，创建动态连接的知识网络。
该系统通过生成包含结构化属性的综合笔记来组织新记忆，并分析历史记忆来建立相关联系。
新记忆的集成触发历史记忆的上下文表示和属性的更新，使记忆网络能够不断进化。
结合结构化组织原则与代理驱动的决策灵活性，使记忆管理更具适应性和上下文感知。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-04-16/Agent/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Agent

Few-Shot

Few-Shot 方向最新论文已更新，请持续关注 Update in 2025-04-16 Siamese Network with Dual Attention for EEG-Driven Social Learning Bridging the Human-Robot Gap in Long-Tail Autonomous Driving

2025-04-16 Few-Shot

Few-Shot

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-04-16 InternVL3 Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

2025-04-16 LLM

LLM