发布日期: 2025-11-13

更新日期: 2025-11-27

文章字数: 3.3k

阅读时长: 13 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-13 更新

MobileRL: Online Agentic Reinforcement Learning for Mobile GUI Agents

Authors:Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, Yuxiao Dong

Building general-purpose graphical user interface (GUI) agents has become increasingly promising with the progress in vision language models. However, developing effective mobile GUI agents with reinforcement learning (RL) remains challenging due to the heavy-tailed distribution of task difficulty and the inefficiency of large-scale environment sampling. We present an online agentic reinforcement learning framework MobileRL to enhance GUI agents in mobile environments. Its core component is the Difficulty-ADAptive GRPO (ADAGRPO) algorithm. In ADAGRPO, we design difficulty-adaptive positive replay and failure curriculum filtering to adapt the model to different task difficulties. We introduce the shortest-path reward adjustment strategy to reshape rewards concerning the task length in multi-turn agentic tasks. Those strategies jointly stabilize RL training, improve sample efficiency, and generate strong performance across diverse mobile apps and tasks. We apply MOBILERL to two open models (Qwen2.5-VL-7B-Instruct and GLM-4.1V-9B-Base). The resultant MOBILERL-9B model achieves state-of-the-art results in terms of success rates on both AndroidWorld (80.2%) and AndroidLab (53.6%). The MOBILERL framework is open-sourced at: https://github.com/THUDM/MobileRL.

随着视觉语言模型的进步，构建通用图形用户界面（GUI）代理的前景越来越有前景。然而，由于任务难度的重尾分布和大规模环境采样的低效性，使用强化学习（RL）开发有效的移动GUI代理仍然具有挑战性。我们提出了一种在线代理强化学习框架MobileRL，以增强移动环境中的GUI代理。其核心组件是难度自适应GRPO（ADAGRPO）算法。在ADAGRPO中，我们设计了难度自适应的正向回放和失败课程过滤，以使模型适应不同的任务难度。我们引入了最短路径奖励调整策略，以重塑多轮代理任务中关注任务长度的奖励。这些策略共同稳定了强化学习的训练过程，提高了样本效率，并在多种移动应用和任务中取得了良好的性能。我们将MobileRL应用于两个开放模型（Qwen2.5-VL-7B-Instruct和GLM-4.1V-9B-Base）。结果，MobileRL-9B模型在AndroidWorld（80.2%）和AndroidLab（53.6%）上的成功率达到了最新水平。MobileRL框架的开源地址为：https://github.com/THUDM/MobileRL。

论文及项目相关链接

PDF

Summary

随着视觉语言模型的进步，构建通用图形用户界面（GUI）代理的前景越来越光明。然而，使用强化学习（RL）开发有效的移动GUI代理仍然具有挑战性，主要体现在任务难度分布的重尾特性和大规模环境采样的低效性。本文提出了一个名为MobileRL的在线代理强化学习框架，以提高移动环境中GUI代理的性能。其核心组件是难度自适应GRPO（ADAGRPO）算法，其中包括难度自适应正向回放和失败课程过滤，以适不同的任务难度。此外，还引入了最短路径奖励调整策略，以重塑多轮代理任务中的奖励。这些策略共同稳定了RL训练，提高了样本效率，并在多种移动应用和任务中取得了良好的性能表现。

Key Takeaways

视觉语言模型的进步为构建通用GUI代理提供了越来越大的潜力。
使用强化学习开发移动GUI代理面临任务难度分布重尾和环境采样低效的挑战。
MobileRL框架通过在线代理强化学习提高了移动GUI代理的性能。
MobileRL的核心组件ADAGRPO算法包括难度自适应正向回放、失败课程过滤和最短路径奖励调整策略。
这些策略共同稳定了RL训练，提高了样本效率，并在多种移动应用和任务中取得了显著的性能提升。
MOBILERL框架应用于两个开放模型，并在AndroidWorld和AndroidLab上取得了 state-of-the-art 的成功率结果。

Cool Papers

点此查看论文截图

$Agent^2$: An Agent-Generates-Agent Framework for Reinforcement Learning Automation

Authors:Yuan Wei, Xiaohan Shan, Ran Miao, Jianmin Li

Reinforcement learning (RL) agent development traditionally requires substantial expertise and iterative effort, often leading to high failure rates and limited accessibility. This paper introduces Agent$^2$, an LLM-driven agent-generates-agent framework for fully automated RL agent design. Agent$^2$ autonomously translates natural language task descriptions and environment code into executable RL solutions without human intervention. The framework adopts a dual-agent architecture: a Generator Agent that analyzes tasks and designs agents, and a Target Agent that is automatically generated and executed. To better support automation, RL development is decomposed into two stages, MDP modeling and algorithmic optimization, facilitating targeted and effective agent generation. Built on the Model Context Protocol, Agent$^2$ provides a unified framework for standardized agent creation across diverse environments and algorithms, incorporating adaptive training management and intelligent feedback analysis for continuous refinement. Extensive experiments on benchmarks including MuJoCo, MetaDrive, MPE, and SMAC show that Agent$^2$ outperforms manually designed baselines across all tasks, achieving up to 55% performance improvement with consistent average gains. By enabling a closed-loop, end-to-end automation pipeline, this work advances a new paradigm in which agents can design and optimize other agents, underscoring the potential of agent-generates-agent systems for automated AI development.

强化学习（RL）代理的开发通常需要大量的专业知识和迭代努力，这往往导致高失败率和有限的可达性。本文介绍了Agent$^2$，一个由大型语言模型驱动的全自动RL代理设计框架。Agent$^2$能够自主地将自然语言任务描述和环境代码翻译成可执行RL解决方案，无需人工干预。该框架采用双代理架构：一个分析任务和设计代理的生成器代理，以及一个自动生成并执行的目标代理。为了更好地支持自动化，将RL开发分解为两个阶段，即马尔可夫决策过程建模和算法优化，以促进有针对性的有效代理生成。基于模型上下文协议，Agent$^2$提供了一个统一的框架，用于在多种环境和算法中进行标准化代理创建，并融入自适应训练管理和智能反馈分析以实现持续改进。在MuJoCo、MetaDrive、MPE和SMAC等基准测试上的大量实验表明，Agent$^2$在所有任务上的表现都超过了手动设计的基线，实现了高达55%的性能提升，并持续获得平均收益。通过实现闭环端到端自动化管道，这项工作开创了一种新范式，即代理能够设计和优化其他代理，突显了代理生成代理系统在自动化人工智能开发中的潜力。

论文及项目相关链接

PDF 19 pages, 5 figures,4 Tables

Summary

本文提出了一种名为Agent$^2$的LLM驱动的自动人R去监督代理框架，用于全自动强化学习代理设计。Agent$^2$能够自主将自然语言任务描述和环境代码翻译成可执行强化学习解决方案，无需人工干预。该框架采用双代理架构，包括生成代理和目标代理，生成代理负责分析任务和设计代理，目标代理则自动生成并执行。强化学习的发展被分解为两个阶段：MDP建模和算法优化，以促进有针对性的有效代理生成。在MuJoCo、MetaDrive、MPE和SMAC等基准测试上的大量实验表明，Agent$^2$在各项任务中表现优于手动设计的基线，性能提升幅度高达55%，平均有稳定的收益。该工作开启了一个新纪元，实现了闭环端到端的自动化管道，推动了代理设计优化其他代理的新范式发展，凸显了自动AI开发的潜力。

Key Takeaways

Agent$^2$是一个自动化的强化学习代理设计框架，无需人工干预即可从自然语言任务描述和环境代码中生成可执行RL解决方案。
Agent$^2$采用双代理架构，包括负责分析任务和设计的生成代理，以及自动生成并执行的目标代理。
强化学习发展被分解为两个阶段：MDP建模和算法优化，使代理生成更加有针对性且有效。
Agent$^2$基于Model Context Protocol提供统一的框架，支持跨不同环境和算法的标准代理创建。
该框架具有自适应训练管理和智能反馈分析功能，可实现持续优化。
在多个基准测试上的实验表明，Agent$^2$在性能上显著优于手动设计的基线，平均有稳定的收益。

Cool Papers

点此查看论文截图

Generalized Principal-Agent Problem with a Learning Agent

Authors:Tao Lin, Yiling Chen

In classic principal-agent problems such as Stackelberg games, contract design, and Bayesian persuasion, the agent best responds to the principal’s committed strategy. We study repeated generalized principal-agent problems under the assumption that the principal does not have commitment power and the agent uses algorithms to learn to respond to the principal. We reduce this problem to a one-shot problem where the agent approximately best responds, and prove that: (1) If the agent uses contextual no-regret learning algorithms with regret $\mathrm{Reg}(T)$, then the principal can guarantee utility at least $U^* - Θ\big(\sqrt{\tfrac{\mathrm{Reg}(T)}{T}}\big)$, where $U^*$ is the principal’s optimal utility in the classic model with a best-responding agent. (2) If the agent uses contextual no-swap-regret learning algorithms with swap-regret $\mathrm{SReg}(T)$, then the principal cannot obtain utility more than $U^* + O(\frac{\mathrm{SReg(T)}}{T})$. (3) In addition, if the agent uses mean-based learning algorithms (which can be no-regret but not no-swap-regret), then the principal can sometimes do significantly better than $U^*$. These results not only refine previous works on Stackelberg games and contract design, but also lead to new results for Bayesian persuasion with a learning agent and all generalized principal-agent problems where the agent does not have private information.

在经典的委托代理问题，如斯塔克尔伯格博弈、合同设计和贝叶斯劝说中，代理人会对委托人的固定策略做出最佳反应。我们研究在委托人没有承诺能力且代理人使用算法学习如何应对委托人的假设下的重复广义委托代理问题。我们将这个问题简化为一次性问题，其中代理人做出近似最佳反应，并证明：（1）如果代理人使用具有遗憾值Reg(T)的上下文无遗憾学习算法，那么委托人可以保证效用至少为U^* - Θ(√(Reg(T)/T))，其中U^是在具有最佳响应代理的经典模型中委托人的最优效用。（2）如果代理人使用具有交换遗憾值SReg(T)的上下文无交换遗憾学习算法，那么委托人无法获得超过U^ + O(SReg(T)/T)的效用。（3）此外，如果代理人使用基于均值的学习算法（可能是无遗憾的但不是无交换遗憾的），那么委托人有时可以做得比U*更好。这些结果不仅细化了斯塔克尔伯格博弈和合同设计的前期工作，而且还为学习代理的贝叶斯劝说和所有代理人没有私人信息的广义委托代理问题带来了新的结果。

论文及项目相关链接

PDF A short version of this work appeared on ICLR 2025 (spotlight). This full version has been accepted by Quantitative Economics

Summary

本文研究了在广义主代理问题中，当主体没有承诺能力且代理使用算法进行学习回应时的情况。文章将这一问题简化为一次性问题，并证明了代理使用不同算法时主体的效用变化。当代理使用基于上下文的无后悔学习算法时，主体的效用接近最优；当代理使用基于上下文的不换悔学习算法时，主体的效用有限；而当代理使用基于平均值的学习算法时，主体有时能获得更好的效益。这些结果不仅完善了Stackelberg博弈和合同设计的研究，还将结果推广到了代理没有私人信息的所有广义主代理问题以及贝叶斯说服场景。

Key Takeaways