发布日期: 2025-06-21

更新日期: 2025-07-06

文章字数: 1.4k

阅读时长: 5 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-06-21 更新

Learning to Incentivize in Repeated Principal-Agent Problems with Adversarial Agent Arrivals

Authors:Junyan Liu, Arnab Maiti, Artin Tajdini, Kevin Jamieson, Lillian J. Ratliff

We initiate the study of a repeated principal-agent problem over a finite horizon $T$, where a principal sequentially interacts with $K\geq 2$ types of agents arriving in an adversarial order. At each round, the principal strategically chooses one of the $N$ arms to incentivize for an arriving agent of unknown type. The agent then chooses an arm based on its own utility and the provided incentive, and the principal receives a corresponding reward. The objective is to minimize regret against the best incentive in hindsight. Without prior knowledge of agent behavior, we show that the problem becomes intractable, leading to linear regret. We analyze two key settings where sublinear regret is achievable. In the first setting, the principal knows the arm each agent type would select greedily for any given incentive. Under this setting, we propose an algorithm that achieves a regret bound of $O(\min{\sqrt{KT\log N},K\sqrt{T}})$ and provide a matching lower bound up to a $\log K$ factor. In the second setting, an agent’s response varies smoothly with the incentive and is governed by a Lipschitz constant $L\geq 1$. Under this setting, we show that there is an algorithm with a regret bound of $\tilde{O}((LN)^{1/3}T^{2/3})$ and establish a matching lower bound up to logarithmic factors. Finally, we extend our algorithmic results for both settings by allowing the principal to incentivize multiple arms simultaneously in each round.

我们研究了一个在有限时间范围$T$内的重复主代理问题，其中主方与按对抗性顺序到达的$K\geq 2$种类型的代理方进行连续互动。每一轮中，主方策略性地选择其中一个$N$个手臂来激励到达的未知类型的代理方。代理方然后基于其自身的效用和提供的激励来选择手臂，并且主方获得相应的奖励。目标是使后悔程度最小化，相较于事后的最佳激励措施。在没有任何关于代理方行为的知识的情况下，我们证明了这个问题变得难以解决，导致线性后悔。我们分析了两种关键情况下可以实现次线性后悔。在第一种情况下，主方知道每种代理类型对于任何给定的激励会贪婪地选择哪个手臂。在这种情况下，我们提出了一种算法，该算法实现的后悔界限为$O(\min{\sqrt{KT\log N},K\sqrt{T}})$，并提供了一个匹配的低于$\log K$因子的下界。在第二种情况下，代理方的反应随着激励的变化而平稳变化，并由Lipschitz常数$L\geq 1$控制。在这种情况下，我们表明有一种算法可以将后悔界限降低到$\tilde{O}((LN)^{1/3}T^{2/3})$，并且建立了匹配的下界，直到对数因子为止。最后，我们通过允许主方在每一轮中同时激励多个手臂来扩展这两种设置下的算法结果。

论文及项目相关链接

PDF To appear at ICML 2025

Summary：
本研究探讨了一个有限时间范围内重复的主-代理人问题，其中主要参与者需要与多种类型的代理人进行交互。在每一轮中，主要参与者必须战略性选择一种手段来激励到达的未知类型的代理人。基于自身的效用和提供的激励，代理人会选择一条路线，主要参与者会获得相应的奖励。目标是尽量减少与最佳后续激励的遗憾。在无代理人行为先验知识的情况下，问题变得难以解决，导致线性遗憾。我们分析了两种情况，其中可以实现次线性遗憾。首先，当主要参与者知道每种类型的代理人在给定激励下会选择哪种手段时，我们提出了一种算法，其遗憾界限为O(min{√KTlogN,K√T})，并提供了一个匹配的上下限，差距在logK范围内。其次，当代理人的响应随着激励的变化而平滑变化，并由Lipschitz常数L≥1控制时，我们展示了一种算法，其遗憾界限为O~(LN)^1/3T^2/3)，并建立了匹配的下限，差距在对数因素范围内。最后，我们扩展了这两种设置的算法结果，允许主要参与者在每一轮中同时激励多条路线。

Key Takeaways：