⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-28 更新
Interactive Recommendation Agent with Active User Commands
Authors:Jiakai Tang, Yujie Luo, Xunke Xi, Fei Sun, Xueyang Feng, Sunhao Dai, Chao Yi, Dian Chen, Zhujin Gao, Yang Li, Xu Chen, Wen Chen, Jian Wu, Yuning Jiang, Bo Zheng
Traditional recommender systems rely on passive feedback mechanisms that limit users to simple choices such as like and dislike. However, these coarse-grained signals fail to capture users’ nuanced behavior motivations and intentions. In turn, current systems cannot also distinguish which specific item attributes drive user satisfaction or dissatisfaction, resulting in inaccurate preference modeling. These fundamental limitations create a persistent gap between user intentions and system interpretations, ultimately undermining user satisfaction and harming system effectiveness. To address these limitations, we introduce the Interactive Recommendation Feed (IRF), a pioneering paradigm that enables natural language commands within mainstream recommendation feeds. Unlike traditional systems that confine users to passive implicit behavioral influence, IRF empowers active explicit control over recommendation policies through real-time linguistic commands. To support this paradigm, we develop RecBot, a dual-agent architecture where a Parser Agent transforms linguistic expressions into structured preferences and a Planner Agent dynamically orchestrates adaptive tool chains for on-the-fly policy adjustment. To enable practical deployment, we employ simulation-augmented knowledge distillation to achieve efficient performance while maintaining strong reasoning capabilities. Through extensive offline and long-term online experiments, RecBot shows significant improvements in both user satisfaction and business outcomes.
传统推荐系统依赖于被动反馈机制,这些机制使用户只能选择简单的喜欢或不喜欢等选项。然而,这些粗粒度的信号无法捕捉用户细微的行为动机和意图。因此,当前系统也无法区分哪些特定项目属性驱动用户满意或不满意,导致偏好建模不准确。这些基本局限造成了用户意图与系统解释之间的持久差距,最终损害了用户满意度和系统效率。为了解决这些局限,我们引入了交互式推荐馈送(IRF),这是一种开创性的范式,能够在主流推荐馈送中实现自然语言命令。不同于限制用户受到被动隐性行为影响的传统系统,IRF通过实时语言命令赋予用户对推荐策略进行主动明确控制的能力。为了支持这一范式,我们开发了RecBot,这是一种双代理架构,其中解析器代理将语言表达转化为结构化偏好,而规划器代理则动态协调适应性的工具链,用于即时策略调整。为了实现实际部署,我们采用仿真增强知识蒸馏法,在保持强大推理能力的同时实现高效性能。通过大量离线测试和长期在线实验,RecBot在用户满意度和业务成果方面都显示出显著改善。
论文及项目相关链接
PDF Under Review
Summary
传统推荐系统依赖被动反馈机制,使用户只能选择简单的喜欢或不喜欢,这种粗粒度的信号无法捕捉用户复杂的行为动机和意图。为解决这一问题,我们提出了交互推荐馈(IRF)这一开创性理念,并在主流推荐系统中引入自然语言命令。IRF通过实时语言命令使用户能够主动控制推荐策略,而传统系统只能被动接受用户隐性行为影响。为实现这一理念,我们开发了RecBot系统,该系统包含解析器代理(将语言表达式转化为结构化偏好)和规划器代理(动态协调即时策略调整工具链)。通过模拟增强知识蒸馏技术,RecBot实现了高效性能并保持强大的推理能力。实验证明,RecBot在用户满意度和业务成果方面取得了显著改进。
Key Takeaways
- 传统推荐系统依赖被动反馈机制,无法捕捉用户复杂的行为动机和意图。
- 交互推荐馈(IRF)理念引入自然语言命令,使用户能够主动控制推荐策略。
- RecBot系统包括解析器代理和规划器代理,分别负责将语言转化为结构化偏好和动态调整推荐策略。
- 模拟增强知识蒸馏技术使RecBot实现高效性能并保持强大推理能力。
- 用户满意度和业务成果得到了显著改进。
- RecBot系统能够通过实时语言命令进行即时策略调整,提高系统的适应性和灵活性。
点此查看论文截图




Nova: Real-Time Agentic Vision-Language Model Serving with Adaptive Cross-Stage Parallelization
Authors:Yuhang Xu, Shengzhong Liu, Dong Zhang, Bingheng Yan, Fan Wu, Guihai Chen
This paper presents Nova, a real-time scheduling framework for serving agentic vision-language models (VLMs) on a single GPU with balanced per-request latency and overall request process throughput. Our design begins by enabling effective pipelining across vision encode, LLM prefill, and LLM decode stages of VLMs, by exploiting their heterogeneous resource demands during execution and incorporating elastic GPU spatial partitioning among stages to maximally utilize the compute and memory resources. Building on this, we introduce a real-time scheduling algorithm that adaptively calibrates resource allocation among stages based on a Pareto-optimal analysis of the latency-throughput trade-off, allowing the system to sustain responsiveness and resource efficiency under dynamic request loads. To further alleviate GPU memory pressure, we design a lightweight weight offloading strategy for vision encoders that preserves inference efficiency with minimized memory overhead. Extensive evaluations on both synthetic and real-world agent workloads demonstrate that Nova consistently outperforms the state-of-the-art baselines, improving the maximum latency by up to 23.3%, while keeping competitive throughput.
本文介绍了Nova,这是一个为单一GPU上运行的agentic视觉语言模型(VLMs)提供的实时调度框架,它能够在每个请求延迟和整体请求处理吞吐量之间实现平衡。我们的设计首先通过在VLM的愿景编码、LLM预填充和LLM解码阶段实现有效的管道化,利用这些阶段在执行过程中的异构资源需求,并融入弹性GPU空间分区以最大化利用计算和内存资源。在此基础上,我们引入了一种实时调度算法,该算法基于延迟-吞吐量权衡的帕累托最优分析,自适应地校准各阶段的资源分配,使系统能够在动态请求负载下保持响应性和资源效率。为了进一步缓解GPU内存压力,我们为视觉编码器设计了一种轻量级的卸载策略,以最小的内存开销保持推理效率。对合成和真实世界代理工作负载的广泛评估表明,Nova始终优于最新基线,最大延迟提高了高达23.3%,同时保持竞争力吞吐量。
论文及项目相关链接
Summary
该论文介绍了Nova,一个为单一GPU上服务代理视觉语言模型(VLMs)提供实时调度框架。Nova通过管道化设计,有效应对VLMs的异构资源需求,并引入实时调度算法,根据延迟和吞吐量的帕累托最优分析自适应调整资源分配。此外,Nova还设计了一种轻量级卸载策略以减轻GPU内存压力。实验表明,Nova在性能上超越现有技术,最大延迟降低了高达23.3%,同时保持竞争力吞吐量。
Key Takeaways
- Nova是一个用于单一GPU上代理视觉语言模型(VLMs)的实时调度框架。
- Nova通过管道设计有效应对VLMs的异构资源需求。
- Nova引入了一种基于延迟和吞吐量帕累托最优分析的实时调度算法。
- Nova通过自适应资源分配在动态请求负载下保持响应性和资源效率。
- Nova设计了一种轻量级卸载策略以减轻GPU内存压力。
- 实验表明,Nova在性能上超越现有技术。
点此查看论文截图






VC-Agent: An Interactive Agent for Customized Video Dataset Collection
Authors:Yidan Zhang, Mutian Xu, Yiming Hao, Kun Zhou, Jiahao Chang, Xiaoqiang Liu, Pengfei Wan, Hongbo Fu, Xiaoguang Han
Facing scaling laws, video data from the internet becomes increasingly important. However, collecting extensive videos that meet specific needs is extremely labor-intensive and time-consuming. In this work, we study the way to expedite this collection process and propose VC-Agent, the first interactive agent that is able to understand users’ queries and feedback, and accordingly retrieve/scale up relevant video clips with minimal user input. Specifically, considering the user interface, our agent defines various user-friendly ways for the user to specify requirements based on textual descriptions and confirmations. As for agent functions, we leverage existing multi-modal large language models to connect the user’s requirements with the video content. More importantly, we propose two novel filtering policies that can be updated when user interaction is continually performed. Finally, we provide a new benchmark for personalized video dataset collection, and carefully conduct the user study to verify our agent’s usage in various real scenarios. Extensive experiments demonstrate the effectiveness and efficiency of our agent for customized video dataset collection. Project page: https://allenyidan.github.io/vcagent_page/.
面对不断扩大规模的法律法规,互联网视频数据变得越来越重要。然而,收集满足特定需求的大量视频极为耗费劳动力和时间。在这项工作中,我们研究了如何加快这一收集过程,并提出了VC-Agent,这是一个首款能够理解用户查询和反馈的交互式代理。它能够根据用户需求检索或扩展相关视频片段,只需极少的用户输入。具体来说,就用户界面而言,我们的代理为用户提供了各种友好的方式,基于文本描述和确认来指定要求。至于代理功能,我们利用现有的多模态大型语言模型将用户的要求与视频内容连接起来。更重要的是,我们提出了两种新型过滤策略,它们可以在用户交互持续进行时得到更新。最后,我们为个性化视频数据集收集提供了一个新基准,并通过用户研究仔细验证了我们的代理在各种实际场景中的使用。大量实验证明了我们的代理在定制视频数据集收集方面的有效性和效率。项目页面:https://allenyidan.github.io/vcagent_page/。
论文及项目相关链接
PDF Project page: https://allenyidan.github.io/vcagent_page/
Summary
面对互联网上视频数据的规模增长,收集满足特定需求的视频变得越来越重要,但这一过程极其耗费时间和劳动力。本研究旨在加速这一过程,提出VC-Agent,这是一个能够理解用户查询和反馈的首个交互式代理。通过最小化用户输入,该代理能够检索和扩展相关视频片段。我们的代理定义各种用户友好的方式让用户基于文本描述和确认来指定要求。此外,我们利用现有的多模态大型语言模型将用户要求与视频内容连接起来。更重要的是,我们提出两种新型过滤策略,可随着用户互动的连续进行而更新。最后,我们为个性化视频数据集收集提供了一个新基准测试,并通过用户研究仔细验证了该代理在各种实际场景中的应用。
Key Takeaways
- 面对互联网上视频数据的规模增长,视频收集变得至关重要,但这一过程具有挑战性和耗时性。
- VC-Agent是一个交互式代理,旨在加速满足特定需求的视频收集过程。
- VC-Agent能够理解用户查询和反馈,并最小化用户输入以检索和扩展相关视频片段。
- 该代理采用多种用户友好的方式允许用户指定要求,如文本描述和确认。
- 利用多模态大型语言模型连接用户要求与视频内容。
- VC-Agent提出两种可随用户互动更新而更新的新型过滤策略。
点此查看论文截图


Tree Search for LLM Agent Reinforcement Learning
Authors:Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu
Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.
最近强化学习(RL)的进展大大提高了大型语言模型(LLM)的代理能力。在长期和多轮代理任务中,仅由结果奖励驱动的传统方法常常面临监督稀疏的问题。为了解决这一挑战,我们提出了基于树搜索的分组相对策略优化(Tree-GRPO),这是一种分组代理RL方法,其中每个树节点代表完整的代理交互步骤。通过共享公共前缀,树搜索采样增加了在固定预算的令牌或工具调用期间可实现的滚动次数。此外,我们发现树状轨迹自然地允许即使只使用结果奖励也能构建逐步过程监督信号。基于此,Tree-GRPO估计了树内和树间的分组相对优势。通过理论分析,我们证明了树内级别分组相对策略优化的目标与步骤级别直接偏好学习的目标是一致的。在11个数据集和3种问答任务上的实验证明了基于树的RL优于基于链的RL方法。
论文及项目相关链接
Summary
强化学习在提升大型语言模型的代理能力方面取得了显著进展。在长期和多轮代理任务中,仅由结果奖励驱动的现有方法常常面临监督稀疏的问题。为解决此挑战,我们提出了基于树搜索的分组代理强化学习方法——Tree-GRPO。树中的每个节点代表完整的代理交互步骤,通过共享共同的前缀,树搜索采样提高了在固定预算的标记或工具调用次数中可实现的rollouts的数量。此外,我们发现树结构轨迹自然允许构建分步过程监督信号,即使只使用结果奖励。基于此,Tree-GRPO估计了树内和树间的分组相对优势。通过理论分析,我们证明了树内级别的分组相对策略优化目标与步骤级别的直接偏好学习目标是等价的。跨11个数据集和3种问答任务的实验证明了基于树的强化学习相较于基于链的强化学习方法的优越性。
Key Takeaways
- 强化学习增强了大型语言模型的代理能力。
- 在长期和多轮代理任务中,现有方法面临监督稀疏的问题。
- Tree-GRPO是一种基于树搜索的分组代理强化学习方法。
- 树搜索通过共享前缀提高了rollouts的数量。
- 树结构轨迹允许构建分步过程监督信号,即使只使用结果奖励。
- Tree-GRPO估计了树内和树间的分组相对优势。
点此查看论文截图



SGMem: Sentence Graph Memory for Long-Term Conversational Agents
Authors:Yaxiong Wu, Yongyue Zhang, Sheng Liang, Yong Liu
Long-term conversational agents require effective memory management to handle dialogue histories that exceed the context window of large language models (LLMs). Existing methods based on fact extraction or summarization reduce redundancy but struggle to organize and retrieve relevant information across different granularities of dialogue and generated memory. We introduce SGMem (Sentence Graph Memory), which represents dialogue as sentence-level graphs within chunked units, capturing associations across turn-, round-, and session-level contexts. By combining retrieved raw dialogue with generated memory such as summaries, facts and insights, SGMem supplies LLMs with coherent and relevant context for response generation. Experiments on LongMemEval and LoCoMo show that SGMem consistently improves accuracy and outperforms strong baselines in long-term conversational question answering.
长期对话代理需要有效的内存管理来处理超过大型语言模型(LLM)上下文窗口的对话历史。现有基于事实提取或摘要的方法虽然可以减少冗余,但在不同粒度的对话和组织检索相关信息方面存在困难,以及在生成内存时的挑战。我们引入SGMem(句子图内存),它将对话表示为分块单元内的句子级图形,捕捉轮次、回合和会话级上下文之间的关联。通过将检索到的原始对话与生成的内存(如摘要、事实和见解)相结合,SGMem为LLM提供连贯且相关的上下文,以生成响应。在LongMemEval和LoCoMo上的实验表明,SGMem在问答中长期对话的准确性和表现均表现优于强大的基线模型。
论文及项目相关链接
PDF 19 pages, 6 figures, 1 table
Summary:
长期对话机器人需要有效的内存管理来应对超出大型语言模型(LLM)上下文窗口的对话历史。现有方法基于事实提取或摘要减少冗余,但难以组织和检索不同粒度对话和生成记忆中的相关信息。我们提出SGMem(句子图内存),它将对话表示为句子级图块内的块状单元,捕捉跨轮次、回合和会话级别的上下文关联。通过将检索到的原始对话与生成的内存(如摘要、事实和见解)相结合,SGMem为LLM提供连贯且相关的上下文,用于生成响应。在LongMemEval和LoCoMo上的实验表明,SGMem在长期的对话问答中提高了准确性并超越了强大的基线。
Key Takeaways:
- 长期对话机器人需要有效管理内存来应对超出大型语言模型上下文窗口的对话历史。
- 现有方法主要基于事实提取或摘要技术来处理对话历史,但存在组织相关信息的挑战。
- SGMem通过句子级图块表示对话,捕捉不同粒度(轮次、回合和会话级别)的上下文关联。
- SGMem结合了检索到的原始对话和生成的内存(如摘要、事实和见解)。
- SGMem为LLM提供了连贯且相关的上下文,增强了响应生成的能力。
- 在LongMemEval和LoCoMo实验上,SGMem在长期的对话问答中表现出更高的准确性。
点此查看论文截图



Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning
Authors:Xiangru Tang, Wanghan Xu, Yujie Wang, Zijie Guo, Daniel Shao, Jiapeng Chen, Cixuan Zhang, Ziyi Wang, Lixin Zhang, Guancheng Wan, Wenlong Zhang, Lei Bai, Zhenfei Yin, Philip Torr, Hanrui Wang, Di Jin
Large language models (LLMs) have recently shown strong progress on scientific reasoning, yet two major bottlenecks remain. First, explicit retrieval fragments reasoning, imposing a hidden “tool tax” of extra tokens and steps. Second, multi-agent pipelines often dilute strong solutions by averaging across all candidates. We address these challenges with a unified framework that combines implicit retrieval and structured collaboration. At its foundation, a Monitor-based retrieval module operates at the token level, integrating external knowledge with minimal disruption to reasoning. On top of this substrate, Hierarchical Solution Refinement (HSR) iteratively designates each candidate as an anchor to be repaired by its peers, while Quality-Aware Iterative Reasoning (QAIR) adapts refinement to solution quality. On Humanity’s Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy – the highest reported to date, surpassing the strongest agent baseline by 13.4 points and leading frontier LLMs by up to 18.1 points, while simultaneously reducing token usage by 53.5% and agent steps by 43.7%. Results on SuperGPQA and TRQA confirm robustness across domains. Error analysis shows that reasoning failures and knowledge gaps co-occur in over 85% of cases, while diversity analysis reveals a clear dichotomy: retrieval tasks benefit from solution variety, whereas reasoning tasks favor consensus. Together, these findings demonstrate how implicit augmentation and structured refinement overcome the inefficiencies of explicit tool use and uniform aggregation. Code is available at: https://github.com/tangxiangru/Eigen-1.
大型语言模型(LLM)在科技推理方面最近取得了显著进展,但仍然存在两个主要瓶颈。首先,明确的检索片段推理,会额外增加标记和步骤,从而产生一种隐藏的“工具税”。其次,多智能体管道往往会通过平均所有候选者来削弱强有力的解决方案。我们用一个统一的框架来解决这些挑战,该框架结合了隐式检索和结构化协作。在该框架的基础上,一个基于监视器的检索模块在令牌层面运行,将外部知识与推理的干扰降至最低。在这个基础之上,分层解决方案精炼(HSR)会迭代地指定每个候选者作为锚点,由同龄人进行修复,而质量感知迭代推理(QAIR)则根据解决方案质量进行精炼。在人类最后考试(HLE)生物/化学金牌科目上,我们的框架达到了48.3%的准确率——这是迄今为止报告的最高准确率,比最强的智能体基准高出13.4个百分点,领先前沿的LLM高达18.1个百分点,同时减少了53.5%的令牌使用量和43.7%的智能体步骤。在SuperGPQA和TRQA上的结果证实了其在不同领域的稳健性。错误分析表明,推理失败和知识空白在超过85%的情况下同时存在,而多样性分析显示了一个明确的二元性:检索任务受益于解决方案的多样性,而推理任务则倾向于达成共识。这些发现共同表明,隐式增强和结构化精炼如何克服显式工具使用和统一聚合的效率低下问题。代码可在:https://github.com/tangxiangru/Eigen-1上找到。
论文及项目相关链接
摘要
大型语言模型在科学推理方面取得了显著进展,但仍存在两个主要瓶颈:一是显式检索片段化推理,增加了额外的标记和步骤的“工具税”;二是多代理管道往往会通过平均所有候选者来削弱优质解决方案的优势。本研究提出一个统一框架,结合隐式检索和结构化协作来解决这些挑战。该框架以监控器为基础的检索模块在令牌级别运行,将外部知识与推理的干扰降至最低。在此基础上,分层解决方案细化(HSR)会迭代地指定每个候选者作为锚点,由同龄人进行修复,而质量感知迭代推理(QAIR)则根据解决方案质量进行适应。在Humanity’s Last Exam(HLE)生物/化学金牌任务上,我们的框架达到了48.3%的准确率,这是迄今为止报告的最高值,比最强代理基线高出13.4个百分点,领先前沿的大型语言模型最多达18.1个百分点。同时,该框架减少了53.5%的令牌使用量和43.7%的代理步骤。在SuperGPQA和TRQA上的结果证实了其在不同领域的稳健性。错误分析表明,推理失败和知识差距在超过85%的情况下同时发生,而多样性分析显示存在一个明确的二分法:检索任务受益于解决方案的多样性,而推理任务则倾向于共识。这些发现表明,隐式增强和结构化细化如何克服显式工具使用和统一聚合的无效性。相关代码可访问:https://github.com/tangxiangru/Eigen-1。
关键见解
- 大型语言模型在科学推理方面表现出显著进展,但仍存在两个主要挑战:额外标记和步骤的“工具税”以及多代理管道的平均效应。
- 提出一种结合隐式检索和结构化协作的统一框架,解决上述挑战。
- 框架中的监视器基础检索模块在令牌级别操作,减少外部知识与推理之间的干扰。
- 通过分层解决方案细化和质量感知迭代推理,提高解决方案的质量和效率。
- 在Humanity’s Last Exam任务上达到48.3%的准确率,显著优于其他方法和大型语言模型。
- 错误分析显示推理失败和知识差距经常同时发生,而多样性分析揭示出解决方案多样性与推理任务共识之间的平衡。
点此查看论文截图





ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective
Authors:Yiwen Zhang, Ziang Chen, Fanqi Kong, Yizhe Huang, Xue Feng
Large Language Models (LLMs) have been used to make decisions in complex scenarios, where they need models to think deeply, reason logically, and decide wisely. Many existing studies focus solely on multi-round conversations in social tasks or simulated environments, neglecting the various types of decisions and their interdependence. Current reinforcement learning methods struggle to consider the strategies of others during training. To address these issues, we first define a strategic decision-making problem that includes two types of decisions and their temporal dependencies. Furthermore, we propose Theory of Mind Policy Optimization (ToMPO) algorithm to optimize the perception of other individual strategies and the game situation trends. Compared to the Group Relative Policy Optimization (GRPO) algorithm, ToMPO enhances the LLM’s strategic decision-making mainly by: 1) generating rollouts based on reasoning the strategies of other individuals, 2) estimating advantages at both the graph-level and sample-level, and 3) balancing global and partial rewards. The ToMPO algorithm outperforms the GRPO method by 35% in terms of model output compliance and cooperative outcomes. Additionally, when compared to models with parameter sizes 100 times larger, it shows an 18% improvement. This demonstrates the effectiveness of the ToMPO algorithm in enhancing the model’s strategic decision-making capabilities.
大型语言模型(LLM)已被应用于复杂场景中的决策,这些场景需要模型进行深度思考、逻辑推断和明智决策。许多现有研究仅专注于社会任务或模拟环境中的多轮对话,忽视了各种类型的决策及其相互依赖性。当前的强化学习方法在训练过程中很难考虑到他人的策略。为了解决这些问题,我们首先定义了一个战略决策问题,包括两种类型的决策及其时间依赖性。此外,我们提出了ToMPO(思维策略优化理论)算法,以优化对其他个体策略和游戏局势趋势的感知。与群体相对策略优化(GRPO)算法相比,ToMPO主要通过以下方式增强LLM的战略决策能力:1)基于推理其他个体的策略生成滚动结果,2)在图形级别和样本级别估计优势,3)平衡全局和局部奖励。在模型输出符合度和合作结果方面,ToMPO算法比GRPO方法高出35%。此外,与参数规模大100倍的模型相比,其表现出了18%的改进。这证明了ToMPO算法在提高模型战略决策能力方面的有效性。
论文及项目相关链接
PDF 22 pages, 14 figures
Summary
大型语言模型(LLM)在复杂场景中的决策能力受到广泛关注。现有研究多聚焦于社会任务中的多轮对话或模拟环境,忽视了不同类型的决策及其相互依赖性。针对现有强化学习方法的不足,本文定义了一种包含两种类型决策及其时间依赖性的战略决策问题。同时,提出了心智策略优化(ToMPO)算法,以优化对其他个体策略和游戏局势趋势的感知。相较于群体相对策略优化(GRPO)算法,ToMPO主要通过生成基于其他个体策略推理的滚动输出、估计图形和样本级别的优势以及平衡全局和局部奖励来增强LLM的战略决策能力。实验表明,ToMPO算法在模型输出合规性和合作结果方面优于GRPO方法,相较于参数规模大100倍的模型也有显著改进。
Key Takeaways
- 大型语言模型(LLM)在复杂场景中的决策能力重要。
- 现有研究多聚焦于社会任务中的对话模拟,忽略了不同类型决策的相互依赖性。
- 提出了一种包含两种类型决策及其时间依赖性的战略决策问题定义。
- 介绍了心智策略优化(ToMPO)算法,用于优化对其他个体策略和游戏局势的感知。
- ToMPO算法通过生成基于其他个体策略的滚动输出、估计图形和样本级别的优势来增强LLM的战略决策能力。
- ToMPO算法在模型输出合规性和合作结果方面优于Group Relative Policy Optimization(GRPO)算法。
点此查看论文截图




EvoMail: Self-Evolving Cognitive Agents for Adaptive Spam and Phishing Email Defense
Authors:Wei Huang, De-Tian Chu, Lin-Yuan Bai, Wei Kang, Hai-Tao Zhang, Bo Li, Zhi-Mo Han, Jing Ge, Hai-Feng Lin
Modern email spam and phishing attacks have evolved far beyond keyword blacklists or simple heuristics. Adversaries now craft multi-modal campaigns that combine natural-language text with obfuscated URLs, forged headers, and malicious attachments, adapting their strategies within days to bypass filters. Traditional spam detection systems, which rely on static rules or single-modality models, struggle to integrate heterogeneous signals or to continuously adapt, leading to rapid performance degradation. We propose EvoMail, a self-evolving cognitive agent framework for robust detection of spam and phishing. EvoMail first constructs a unified heterogeneous email graph that fuses textual content, metadata (headers, senders, domains), and embedded resources (URLs, attachments). A Cognitive Graph Neural Network enhanced by a Large Language Model (LLM) performs context-aware reasoning across these sources to identify coordinated spam campaigns. Most critically, EvoMail engages in an adversarial self-evolution loop: a ‘’red-team’’ agent generates novel evasion tactics – such as character obfuscation or AI-generated phishing text – while the ‘’blue-team’’ detector learns from failures, compresses experiences into a memory module, and reuses them for future reasoning. Extensive experiments on real-world datasets (Enron-Spam, Ling-Spam, SpamAssassin, and TREC) and synthetic adversarial variants demonstrate that EvoMail consistently outperforms state-of-the-art baselines in detection accuracy, adaptability to evolving spam tactics, and interpretability of reasoning traces. These results highlight EvoMail’s potential as a resilient and explainable defense framework against next-generation spam and phishing threats.
现代电子邮件垃圾邮件和钓鱼攻击已经远远超越了关键词黑名单或简单启发式方法。现在,攻击者会制造多模式活动,将自然语言文本与模糊URL、伪造标题和恶意附件相结合,并在几天内调整他们的策略以绕过过滤器。传统垃圾邮件检测系统依赖于静态规则或单模态模型,难以整合异构信号或持续适应,导致性能迅速下降。我们提出了EvoMail,这是一个用于稳健检测垃圾邮件和钓鱼的自进化认知代理框架。EvoMail首先构建一个统一的异构电子邮件图,融合了文本内容、元数据(标题、发送者、域名)和嵌入式资源(URL、附件)。一个由大型语言模型(LLM)增强认知图神经网络在这类来源中进行上下文感知推理,以识别协调的垃圾邮件活动。最重要的是,EvoMail参与了一个对抗性自我进化循环:“红队”代理生成新的躲避技巧,如字符模糊或AI生成的钓鱼文本,而“蓝队”检测器从失败中学习,将经验压缩到内存模块中,并重新用于未来的推理。在真实数据集(如Enron-Spam、Ling-Spam、SpamAssassin和TREC)以及合成对抗性变种上的广泛实验表明,EvoMail在检测精度、适应不断变化的垃圾邮件策略以及推理痕迹的可解释性方面均优于最新基线。这些结果突显了EvoMail作为抵御新一代垃圾邮件和钓鱼威胁的坚韧且可解释性强的防御框架的潜力。
论文及项目相关链接
Summary
现代电子邮件垃圾邮件和钓鱼攻击已经超越了关键词黑名单或简单启发式方法。对手现在采用多模式活动,结合自然语言文本、模糊网址、伪造标题和恶意附件,并在几天内调整策略绕过过滤器。传统的垃圾邮件检测系统依赖于静态规则或单模态模型,难以整合异构信号或持续适应,导致性能迅速下降。我们提出EvoMail,一种自进化认知代理框架,用于稳健地检测垃圾邮件和钓鱼邮件。EvoMail首先构建一个统一的异构电子邮件图,融合了文本内容、元数据(标题、发送者、域名)和嵌入式资源(网址、附件)。一个由大型语言模型增强的认知图神经网络在这些来源上执行上下文感知推理,以识别协同的垃圾邮件活动。最重要的是,EvoMail参与对抗性自进化循环:红队代理生成新的规避策略,如字符模糊或AI生成的钓鱼文本,而蓝队检测器从失败中学习,将经验压缩到记忆模块中并重复用于未来推理。在真实数据集(如Enron-Spam、Ling-Spam、SpamAssassin和TREC)和合成对抗变体上的广泛实验表明,EvoMail在检测准确性、适应不断变化的垃圾邮件策略以及推理痕迹的可解释性方面均优于最新基线。这些结果突显了EvoMail作为对抗下一代垃圾邮件和钓鱼威胁的坚韧且可解释性强的防御框架的潜力。
Key Takeaways
- 现代垃圾邮件和钓鱼攻击已经超越简单的检测机制,采用多模式策略结合文本、网址和附件等。
- 传统检测系统依赖于静态规则和单模态模型,难以适应新威胁并容易性能下降。
- EvoMail是一个自进化认知代理框架,融合异构信号如文本、元数据和嵌入式资源来检测垃圾邮件。
- EvoMail采用认知图神经网络和大型语言模型进行上下文感知推理。
- EvoMail具有对抗性自我进化能力,能生成并应对新型规避策略。
- 广泛实验证明EvoMail在检测准确性、适应性和可解释性方面优于现有方法。
点此查看论文截图



Which Cultural Lens Do Models Adopt? On Cultural Positioning Bias and Agentic Mitigation in LLMs
Authors:Yixin Wan, Xingrun Chen, Kai-Wei Chang
Large language models (LLMs) have unlocked a wide range of downstream generative applications. However, we found that they also risk perpetuating subtle fairness issues tied to culture, positioning their generations from the perspectives of the mainstream US culture while demonstrating salient externality towards non-mainstream ones. In this work, we identify and systematically investigate this novel culture positioning bias, in which an LLM’s default generative stance aligns with a mainstream view and treats other cultures as outsiders. We propose the CultureLens benchmark with 4000 generation prompts and 3 evaluation metrics for quantifying this bias through the lens of a culturally situated interview script generation task, in which an LLM is positioned as an onsite reporter interviewing local people across 10 diverse cultures. Empirical evaluation on 5 state-of-the-art LLMs reveals a stark pattern: while models adopt insider tones in over 88 percent of US-contexted scripts on average, they disproportionately adopt mainly outsider stances for less dominant cultures. To resolve these biases, we propose 2 inference-time mitigation methods: a baseline prompt-based Fairness Intervention Pillars (FIP) method, and a structured Mitigation via Fairness Agents (MFA) framework consisting of 2 pipelines: (1) MFA-SA (Single-Agent) introduces a self-reflection and rewriting loop based on fairness guidelines. (2) MFA-MA (Multi-Agent) structures the process into a hierarchy of specialized agents: a Planner Agent(initial script generation), a Critique Agent (evaluates initial script against fairness pillars), and a Refinement Agent (incorporates feedback to produce a polished, unbiased script). Empirical results showcase the effectiveness of agent-based methods as a promising direction for mitigating biases in generative LLMs.
大型语言模型(LLM)已经解锁了众多下游生成式应用。然而,我们发现它们也有维持细微的公平问题的风险,这些问题与文化有关,它们从主流美国文化的角度出发来定位其生成的内容,同时表现出对非主流文化的明显外部性。在这项工作中,我们识别并系统地研究了这种新型文化定位偏见,LLM的默认生成立场符合主流观点,并将其他文化视为外来者。我们提出了CultureLens基准测试,包含4000个生成提示和3个评估指标,通过文化情境下的采访剧本生成任务来量化这种偏见,在这个任务中,LLM被定位为现场记者,采访来自10种不同文化的当地人。对5个最先进的大型语言模型的实证评估揭示了一个鲜明的模式:尽管这些模型在平均超过88%的美国语境剧本中采用了内部人的语气,但它们对非主导文化的外部人立场却占很大比例。为了解决这些偏见,我们提出了两种推理时间缓解方法:基于提示的公平干预原则(FIP)方法和通过公平代理(MFA)的结构化缓解方法,包括两个管道:(1)MFA-SA(单代理)引入基于公平准则的自我反思和重写循环。(2)MFA-MA(多代理)将过程结构化为专门代理的层次结构:一个计划代理(初始剧本生成),一个批判代理(评估初始剧本是否符合公平原则),和一个精炼代理(结合反馈以产生经过抛光的无偏见剧本)。实证结果展示了基于代理的方法作为减轻大型生成式语言模型中偏见的有前途的方向。
论文及项目相关链接
Summary
大型语言模型(LLM)为下游生成式应用打开了广泛的可能性,但同时也存在微妙的公平性问题,即模型倾向从主流美国文化的视角生成内容,忽视非主流文化。本研究针对这一新型文化定位偏见进行系统研究,并提出CultureLens基准测试,通过模拟跨文化访谈场景来量化偏见程度。实证评估显示,模型在模拟美国文化背景下的脚本中更倾向采取内部人视角,而对非主流文化则更倾向采取外部人视角。为解决偏见问题,本研究提出两种推理时间缓解方法:基于提示的公平干预原则(FIP)和通过公平性代理(MFA)进行结构化缓解。实证结果表明基于代理的方法在缓解生成式LLM偏见方面展现出良好前景。
Key Takeaways
- 大型语言模型(LLM)在下游生成式应用中有广泛用途,但也存在文化定位偏见。
- LLM倾向于从主流美国文化的视角生成内容,忽视非主流文化。
- 提出CultureLens基准测试,通过模拟跨文化访谈场景来量化这种偏见。
- 实证评估显示,模型在模拟不同文化背景下的脚本生成中,存在对主流文化和非主流文化的不同立场。
- 为解决偏见问题,研究提出两种推理时间缓解方法:FIP和MFA。
- MFA包括单代理和多代理两种方法,通过自我反思、重写和引入专业代理来纠正偏见。
点此查看论文截图




Recon-Act: A Self-Evolving Multi-Agent Browser-Use System via Web Reconnaissance, Tool Generation, and Task Execution
Authors:Kaiwen He, Zhiwei Wang, Chenyi Zhuang, Jinjie Gu
Recent years, multimodal models have made remarkable strides and pave the way for intelligent browser use agents. However, when solving tasks on real world webpages in multi-turn, long-horizon trajectories, current agents still suffer from disordered action sequencing and excessive trial and error during execution. This paper introduces Recon-Act, a self-evolving multi-agent framework grounded in Reconnaissance-Action behavioral paradigm. The system comprises a Reconnaissance Team and an Action Team: the former conducts comparative analysis and tool generation, while the latter handles intent decomposition, tool orchestration, and execution. By contrasting the erroneous trajectories with successful ones, the Reconnaissance Team infers remedies, and abstracts them into a unified notion of generalized tools, either expressed as hints or as rule-based codes, and register to the tool archive in real time. The Action Team reinference the process empowered with these targeting tools, thus establishing a closed-loop training pipeline of data-tools-action-feedback. Following the 6 level implementation roadmap proposed in this work, we have currently reached Level 3 (with limited human-in-the-loop intervention). Leveraging generalized tools obtained through reconnaissance, Recon-Act substantially improves adaptability to unseen websites and solvability on long-horizon tasks, and achieves state-of-the-art performance on the challenging VisualWebArena dataset.
近年来,多模态模型取得了显著的进步,为智能浏览器使用代理的应用开辟了道路。然而,在处理现实世界网页的多轮、长周期轨迹的任务时,当前代理仍然受到动作序列混乱和执行过程中的过度试错的影响。本文介绍了Recon-Act,一个基于侦察-行动行为范式的自我进化的多代理框架。该系统包括一个侦察队和一个行动队:前者进行比较分析的工具生成,后者处理意图分解、工具编排和执行。通过对比错误的轨迹和成功的轨迹,侦察队推断出补救措施,并将其抽象为通用工具的统一概念,这些工具可以以提示或基于规则的代码形式表达,并实时注册到工具库中。行动队利用这些有针对性的工具来加强推理过程,从而建立数据-工具-行动-反馈的闭环训练管道。遵循本工作中提出的6级实施路线图,我们目前已达到Level 3(有限的半自动化干预)。通过侦察获得的通用工具,Recon-Act大大提高了对未见网站的适应性和解决长期任务的能力,并在具有挑战性的VisualWebArena数据集上实现了最先进的性能。
论文及项目相关链接
Summary
本论文提出了Recon-Act,一种基于侦察-行动行为范式的自进化多智能体框架。它包含侦察团队和行动团队两部分,分别负责比较分析、工具生成以及意图分解、工具编排与执行等任务。通过错误轨迹与成功轨迹的对比,侦察团队推断出改进措施,并抽象为通用工具,实时注册至工具库。行动团队利用这些工具强化推理过程,建立从数据-工具-行动-反馈的闭环训练管道。目前遵循该论文提出的六级实施路线图,已达成Level 3阶段(有限的人机交互)。通过侦察获得的通用工具,Recon-Act大大提高了对未见网站的适应性和长期任务的解决能力,并在挑战性的VisualWebArena数据集上实现了最佳性能。
Key Takeaways
- 多模态模型在智能浏览器代理方面取得了显著进展。
- 当前代理在处理多回合、长期任务时仍存在行动序列紊乱和执行中过多的试错问题。
- Recon-Act是一个基于侦察-行动行为范式的自进化多智能体框架。
- 侦察团队负责对比分析,生成通用工具并实时注册至工具库。
- 行动团队利用这些通用工具强化推理过程,建立闭环训练管道。
- 目前已经实现了Level 3阶段的人机交互,利用侦察获得的通用工具提高了对未见网站的适应性和长期任务的解决能力。
点此查看论文截图



MAIFormer: Multi-Agent Inverted Transformer for Flight Trajectory Prediction
Authors:Seokbin Yoon, Keumjin Lee
Flight trajectory prediction for multiple aircraft is essential and provides critical insights into how aircraft navigate within current air traffic flows. However, predicting multi-agent flight trajectories is inherently challenging. One of the major difficulties is modeling both the individual aircraft behaviors over time and the complex interactions between flights. Generating explainable prediction outcomes is also a challenge. Therefore, we propose a Multi-Agent Inverted Transformer, MAIFormer, as a novel neural architecture that predicts multi-agent flight trajectories. The proposed framework features two key attention modules: (i) masked multivariate attention, which captures spatio-temporal patterns of individual aircraft, and (ii) agent attention, which models the social patterns among multiple agents in complex air traffic scenes. We evaluated MAIFormer using a real-world automatic dependent surveillance-broadcast flight trajectory dataset from the terminal airspace of Incheon International Airport in South Korea. The experimental results show that MAIFormer achieves the best performance across multiple metrics and outperforms other methods. In addition, MAIFormer produces prediction outcomes that are interpretable from a human perspective, which improves both the transparency of the model and its practical utility in air traffic control.
对于多架飞机的飞行轨迹预测非常重要,它提供了对当前空中交通流量中飞机如何导航的关键洞察。然而,预测多智能体飞行轨迹具有内在的挑战性。主要的困难之一是建模飞机随时间变化的个体行为以及航班之间的复杂交互。生成可解释的预测结果也是一个挑战。因此,我们提出了一种名为MAIFormer的多智能体反向Transformer作为预测多智能体飞行轨迹的新型神经网络架构。该框架具有两个关键注意力模块:(i)掩码多元注意力,用于捕捉单个飞机的时空模式;(ii)智能体注意力,用于模拟复杂空中交通场景中多个智能体之间的社交模式。我们使用韩国仁川国际机场终端空域的自动相关监视广播飞行轨迹数据集对MAIFormer进行了评估。实验结果表明,MAIFormer在多指标方面实现了最佳性能并超越了其他方法。此外,MAIFormer产生的预测结果可以从人类角度进行解释,这提高了模型的透明度及其在航空交通管制中的实际应用价值。
论文及项目相关链接
PDF 8 pages, 7 figures, submitted for IEEE Transactions on Intelligent Transportation System
Summary
基于多代理逆变换器(MAIFormer)的飞行轨迹预测技术对于现代航空交通控制至关重要。该技术通过两个关键注意力模块——掩码多元注意力和代理注意力,实现对个体飞机行为及其间复杂交互的建模。利用韩国仁川国际机场终端空域的实时自动相关监视广播飞行轨迹数据集进行的评估表明,MAIFormer在多个指标上表现最佳,预测结果对人类可解释,提高了模型的透明度和实用性。
Key Takeaways
- 飞行轨迹预测对于航空交通控制至关重要,为多飞机交互提供关键洞察。
- 多代理飞行轨迹预测具有挑战性,需建模个体飞机行为及时空复杂交互。
- 提出了一种新的神经架构——多代理逆变换器(MAIFormer)进行预测。
- MAIFormer包含两个关键注意力模块:掩码多元注意力和代理注意力。
- 掩码多元注意力捕捉个体飞机的时空模式,代理注意力建模多代理间的社会模式。
- 在仁川国际机场的实际数据集上的评估显示,MAIFormer在多个指标上表现最佳。
点此查看论文截图







CORE: Full-Path Evaluation of LLM Agents Beyond Final State
Authors:Panagiotis Michelakis, Yiannis Hadjiyiannis, Dimitrios Stamoulis
Evaluating AI agents that solve real-world tasks through function-call sequences remains an open challenge. Existing agentic benchmarks often reduce evaluation to a binary judgment of the final state, overlooking critical aspects such as safety, efficiency, and intermediate correctness. We propose a framework based on deterministic finite automata (DFAs) that encodes tasks as sets of valid tool-use paths, enabling principled assessment of agent behavior in diverse world models. Building on this foundation, we introduce CORE, a suite of five metrics, namely Path Correctness, Path Correctness - Kendall’s tau Composite, Prefix Criticality, Harmful-Call Rate, and Efficiency, that quantify alignment with expected execution patterns. Across diverse worlds, our method reveals important performance differences between agents that would otherwise appear equivalent under traditional final-state evaluation schemes.
评估通过函数调用序列解决现实任务的AI代理仍然是一个开放性的挑战。现有的代理基准测试通常将评估简化为最终状态的二元判断,忽略了安全、效率和中间过程正确性等重要方面。我们提出了一种基于确定性有限自动机(DFA)的框架,该框架将任务编码为有效的工具使用路径集,从而能够对代理在不同世界模型中的行为进行有原则的评价。在此基础上,我们引入了CORE,这是一套五个指标,即路径正确性、路径正确性——肯德尔tau复合指标、前缀关键性、有害调用率和效率,这些指标可以量化与预期执行模式的对齐程度。在我们的方法中,不同的世界模型揭示了代理之间重要的性能差异,而这些代理在传统的最终状态评估方案下看起来是等效的。
论文及项目相关链接
PDF Accepted: LAW 2025 Workshop NeurIPS 2025
总结
在评价通过函数调用序列解决真实任务的AI代理时仍存在挑战。现有的代理基准测试通常将评价简化为最终状态的二元判断,忽略了安全、效率和中间过程正确性等重要方面。我们提出了一个基于确定性有限自动机(DFA)的框架,该框架将任务编码为有效的工具使用路径集,从而能够在不同的世界模型中评估代理的行为。在此基础上,我们引入了CORE指标套件,包括路径正确性、路径正确性-肯德尔tau复合指标、前缀重要性、有害调用率和效率等五个指标,以量化其与预期执行模式的对齐程度。在不同的世界中,我们的方法揭示了代理之间重要的性能差异,这些差异在传统最终状态评估方案下可能会被忽视。
关键见解
- AI代理的评价仍面临挑战,特别是在解决真实世界任务时需要考虑多个因素如安全性、效率和中间过程正确性。
- 现有代理基准测试主要关注最终状态的二元判断,忽略了其他重要方面。
- 提出了一种基于确定性有限自动机的框架来编码任务作为有效的工具使用路径集。
- 引入CORE指标套件以量化代理行为与预期执行模式的对齐程度。包括路径正确性、肯德尔tau复合指标等五个指标。
- 该框架能够在不同的世界模型中评估代理的行为。
- 通过该方法和指标套件,能够揭示传统最终状态评估方案下可能被忽视的代理性能差异。
点此查看论文截图



i-LAVA: Insights on Low Latency Voice-2-Voice Architecture for Agents
Authors:Anupam Purwar, Aditya Choudhary
We experiment with a low-latency, end-to-end voice-to-voice communication model to optimize it for real-time conversational applications. By analyzing components essential to voice to voice (V-2-V) system viz. automatic speech recognition (ASR), text-to-speech (TTS), and dialog management, our work analyzes how to reduce processing time while maintaining high-quality interactions to identify the levers for optimizing V-2-V system. Our work identifies that TTS component which generates life-like voice, full of emotions including natural pauses and exclamations has highest impact on Real time factor (RTF). The experimented V-2-V architecture utilizes CSM1b has the capability to understand tone as well as context of conversation by ingesting both audio and text of prior exchanges to generate contextually accurate speech. We explored optimization of Residual Vector Quantization (RVQ) iterations by the TTS decoder which come at a cost of decrease in the quality of voice generated. Our experimental evaluations also demonstrate that for V-2-V implementations based on CSM most important optimizations can be brought by reducing the number of RVQ Iterations along with the codebooks used in Mimi.
我们实验了一种低延迟、端到端的语音到语音通信模型,以优化其适用于实时对话应用。通过分析语音到语音(V-2-V)系统所必需的关键组件,包括自动语音识别(ASR)、文本到语音(TTS)和对话管理,我们的工作分析了如何在保持高质量交互的同时减少处理时间,以识别优化V-2-V系统的关键因素。我们的工作发现,文本到语音(TTS)组件对实时因子(RTF)的影响最大,该组件可生成逼真的语音,充满情感,包括自然停顿和感叹。所试验的V-2-V架构利用CSM1b,通过摄取先前的音频和文本交换,理解对话的语调以及上下文语境,从而生成上下文准确的语音。我们探索了通过TTS解码器优化剩余向量量化(RVQ)迭代的方法,但这会导致生成的语音质量下降。我们的实验评估还表明,对于基于CSM的V-2-V实现,最重要的优化可以通过减少RVQ迭代次数以及Mimi中使用的代码本数量来实现。
论文及项目相关链接
PDF This paper analyzes a low-latency, end-to-end voice-to-voice (V-2-V) architecture, identifying that the Text-to-Speech (TTS) component has the highest impact on real-time performance. By reducing the number of Residual Vector Quantization (RVQ) iterations in the TTS model, latency can be effectively halved, creating a direct trade-off between conversational speed and audio quality
Summary
本研究通过实验低延迟、端到端的语音到语音通信模型,针对实时会话应用进行优化。通过分析语音到语音系统(V-2-V)的关键组件,如自动语音识别(ASR)、文本到语音(TTS)和对话管理,研究如何在保持高质量交互的同时减少处理时间,以找到优化V-2-V系统的关键手段。研究结果表明,TTS组件对实时因素(RTF)的影响最大,能够生成充满情感的逼真语音,包括自然停顿和感叹。实验性的V-2-V架构利用CSM1b,能够通过摄取先前的音频和文本交换来理解语调以及对话的上下文,生成语境准确的语音。本研究还探索了通过减少TTS解码器的剩余矢量量化(RVQ)迭代次数来优化V-2-V实现,但这可能会导致生成的语音质量下降。
Key Takeaways
- 研究通过实验低延迟的端到端语音到语音通信模型来优化实时会话应用。
研究了V-2-V系统的关键组件(ASR、TTS和对话管理)以实现性能的优化。
TTS组件对实时因素(RTF)的影响最大,因为它能够生成充满情感的逼真语音。
通过实验发现充满情感的语音合成是系统实时性能的关键所在。实验性的V-2-V架构利用CSM技术能理解对话的语境和语调,提高了语音交互的自然性和准确性。
这意味着系统可以根据先前的对话内容来生成更合适的响应。研究通过减少TTS解码器的RVQ迭代次数来优化系统性能,但这可能会对语音质量产生影响。
表明在优化处理速度的同时需要权衡语音质量的损失。自动语音识别(ASR)在处理过程中扮演重要角色,它的性能直接影响到系统对语音信号的理解能力。
这表明需要高效准确的ASR算法来提高系统的整体性能。对话管理在实时交互中至关重要,直接影响系统响应用户的方式和用户满意度。
高效的对话管理能够确保系统的响应更加流畅和自然。
点此查看论文截图









Learning to Summarize by Learning to Quiz: Adversarial Agentic Collaboration for Long Document Summarization
Authors:Weixuan Wang, Minghao Wu, Barry Haddow, Alexandra Birch
Long document summarization remains a significant challenge for current large language models (LLMs), as existing approaches commonly struggle with information loss, factual inconsistencies, and coherence issues when processing excessively long documents. We propose SummQ, a novel adversarial multi-agent framework that addresses these limitations through collaborative intelligence between specialized agents operating in two complementary domains: summarization and quizzing. Our approach employs summary generators and reviewers that work collaboratively to create and evaluate comprehensive summaries, while quiz generators and reviewers create comprehension questions that serve as continuous quality checks for the summarization process. This adversarial dynamic, enhanced by an examinee agent that validates whether the generated summary contains the information needed to answer the quiz questions, enables iterative refinement through multifaceted feedback mechanisms. We evaluate SummQ on three widely used long document summarization benchmarks. Experimental results demonstrate that our framework significantly outperforms existing state-of-the-art methods across ROUGE and BERTScore metrics, as well as in LLM-as-a-Judge and human evaluations. Our comprehensive analyses reveal the effectiveness of the multi-agent collaboration dynamics, the influence of different agent configurations, and the impact of the quizzing mechanism. This work establishes a new approach for long document summarization that uses adversarial agentic collaboration to improve summarization quality.
长文档摘要仍然是当前大型语言模型(LLM)面临的一个重大挑战。现有方法在处理过长文档时,通常会面临信息丢失、事实不一致和连贯性问题。我们提出了SummQ,这是一种新型对抗性多智能体框架,它通过两个互补领域中的智能体之间的协作智能来解决这些局限性:摘要和测验。我们的方法采用摘要生成器和评审员进行协作,以创建和评估全面的摘要,同时测验生成器和评审员创建理解问题,作为摘要过程的持续质量检查。这种对抗性动态通过被测试智能体得到增强,验证生成的摘要是否包含回答测验问题所需的信息,通过多方面的反馈机制实现迭代改进。我们在三个广泛使用的长文档摘要基准测试上对SummQ进行了评估。实验结果表明,我们的框架在ROUGE和BERTScore指标以及LLM-as-a-Judge和人类评估中均显著优于现有最先进的方法。我们的综合分析揭示了多智能体协作动态的有效性、不同智能体配置的影响以及测验机制的作用。这项工作为长文档摘要提出了一种新的方法,采用对抗性智能协作来提高摘要质量。
论文及项目相关链接
Summary
本文提出一种名为SummQ的新型对抗性多智能体框架,用于解决长文档摘要生成过程中的信息丢失、事实性不一致和连贯性问题。该框架通过两个互补领域的智能体之间的协作智能实现摘要生成和评估,同时通过问答生成和评估来持续检查摘要质量。实验结果表明,SummQ在广泛使用的长文档摘要评估指标上显著优于现有先进技术。此工作采用对抗性智能体协作的方法提高了摘要质量。
Key Takeaways
- 当前大型语言模型在处理长文档摘要时面临信息丢失、事实性不一致和连贯性问题等挑战。
- SummQ是一种新型对抗性多智能体框架,旨在解决这些问题。
- SummQ包含摘要生成器和评估器,通过协作生成并评估摘要。
- SummQ还包括问答生成器和评估器,作为对摘要质量的持续检查。
- SummQ通过多智能体协作动态实现迭代优化,通过多方面的反馈机制进行改进。
- 实验结果表明,SummQ在多个长文档摘要评估指标上优于现有技术。
点此查看论文截图





Adaptive Learning in Spatial Agent-Based Models for Climate Risk Assessment: A Geospatial Framework with Evolutionary Economic Agents
Authors:Yara Mohajerani
Climate risk assessment requires modelling complex interactions between spatially heterogeneous hazards and adaptive economic systems. We present a novel geospatial agent-based model that integrates climate hazard data with evolutionary learning for economic agents. Our framework combines Mesa-based spatial modelling with CLIMADA climate impact assessment, introducing adaptive learning behaviours that allow firms to evolve strategies for budget allocation, pricing, wages, and risk adaptation through fitness-based selection and mutation. We demonstrate the framework using riverine flood projections under RCP8.5 until 2100, showing that evolutionary adaptation enables firms to converge with baseline (no hazard) production levels after decades of disruption due to climate stress. Our results reveal systemic risks where even agents that are not directly exposed to floods face impacts through supply chain disruptions, with the end-of-century average price of goods 5.6% higher under RCP8.5 compared to the baseline. This open-source framework provides financial institutions and companies with tools to quantify both direct and cascading climate risks while evaluating cost-effective adaptation strategies.
气候风险评估需要模拟空间异质性危害与自适应经济系统之间的复杂交互。我们提出了一种新型基于地理空间的主体模型,该模型将气候危害数据与主体的进化学习相结合,用于评估经济系统。我们的框架结合了基于梅萨的空间建模与CLIMADA气候影响评估,引入了自适应学习行为,使企业能够通过基于适应度的选择和变异来进化策略,为预算分配、定价、工资和风险适应制定策略。我们以RCP8.5情景下直至2100年的河流洪水预测为例来展示该框架的应用,表明通过几十年的气候压力带来的破坏后,进化适应使企业能够收敛到基线(无危害)的生产水平。我们的研究结果显示,即使在供应链中断的情况下,没有直接暴露于洪水中的主体也会受到影响,到世纪末期的商品平均价格较基线水平在RCP8.5情景下高出5.6%。这一开源框架为金融机构和企业提供了量化直接和连锁气候风险的工具,同时评估成本效益高的适应策略。
论文及项目相关链接
PDF Submitted and accepted to Tackling Climate Change with Machine Learning workshop at NeurIPS 2025. 5 pages, 1 figure. Source code and documentation available at https://github.com/yaramohajerani/spatial-climate-ABM
Summary:气候风险评估需要模拟空间异质性危害和适应性经济系统之间的复杂交互作用。本研究提出了一种新型地理空间主体模型,该模型结合了气候危害数据与进化学习机制,为经济主体提供评估框架。该框架结合了基于梅萨的空间建模与气候影响评估CLIMADA,引入适应性学习行为,使企业能够根据健康度选择变异进化策略来制定预算分配、定价、工资及风险适应策略。以RCP8.5情景下的河流洪水预测为例,研究显示,经过几十年的气候变化压力后,通过演化适应策略企业的生产水平会趋向于基准线的水平(没有危害时)。本框架揭示出系统性风险中的压力不限于直接受影响的实体企业也会受到影响。该研究提出的开放源代码框架有助于金融机构和企业定量评估气候风险与供应链断裂引发的连锁反应,同时评估成本效益高的适应策略。
Key Takeaways:
- 气候风险评估需要模拟复杂交互作用,包括空间异质性危害和适应性经济系统。
- 提出一种新型地理空间主体模型,整合气候危害数据与进化学习机制。
- 框架结合了空间建模与气候影响评估CLIMADA。
- 引入适应性学习行为,使企业能根据健康度调整策略以适应各种挑战。
- 以河流洪水预测为例,展示了进化适应策略在应对气候变化中的作用。
- 系统性风险包括供应链断裂导致的连锁反应,即使非直接受影响的实体企业也会受到影响。
点此查看论文截图


Aegis: Automated Error Generation and Identification for Multi-Agent Systems
Authors:Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, Xue Feng
As Multi-Agent Systems (MAS) become increasingly autonomous and complex, understanding their error modes is critical for ensuring their reliability and safety. However, research in this area has been severely hampered by the lack of large-scale, diverse datasets with precise, ground-truth error labels. To address this bottleneck, we introduce \textbf{AEGIS}, a novel framework for \textbf{A}utomated \textbf{E}rror \textbf{G}eneration and \textbf{I}dentification for Multi-Agent \textbf{S}ystems. By systematically injecting controllable and traceable errors into initially successful trajectories, we create a rich dataset of realistic failures. This is achieved using a context-aware, LLM-based adaptive manipulator that performs sophisticated attacks like prompt injection and response corruption to induce specific, predefined error modes. We demonstrate the value of our dataset by exploring three distinct learning paradigms for the error identification task: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. Our comprehensive experiments show that models trained on AEGIS data achieve substantial improvements across all three learning paradigms. Notably, several of our fine-tuned models demonstrate performance competitive with or superior to proprietary systems an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems. Our project website is available at https://kfq20.github.io/AEGIS-Website.
随着多智能体系统(MAS)的自主性和复杂性不断提高,理解它们的错误模式对于确保它们的可靠性和安全性至关重要。然而,由于缺乏大规模、多样化的精确地面真实错误标签数据集,这一领域的研究受到了严重阻碍。为了解决这一瓶颈问题,我们引入了\textbf{AEGIS},这是一个用于多智能体系统的自动化错误生成和识别的新型框架。我们通过系统性地在初始成功轨迹中注入可控和可追踪的错误,创建了一个丰富的现实失败数据集。这是通过基于上下文感知、大型语言模型的自适应操纵器实现的,该操纵器执行诸如提示注入和响应腐败之类的复杂攻击,以产生特定的预定义错误模式。我们通过探索错误识别任务的三种独特学习范式来证明我们数据集的价值:有监督微调、强化学习和对比学习。我们的全面实验表明,在AEGIS数据上训练的模型在这三种学习范式中都取得了重大改进。值得注意的是,我们微调的几个模型的性能与或优于一个数量级更大的专有系统,这验证了我们的自动化数据生成框架对于开发更稳健和可解释的多智能体系统是一个至关重要的资源。我们的项目网站可在https://kfq20.github.io/AEGIS-Website访问。
论文及项目相关链接
Summary
多智能体系统自主性和复杂性提升的情况下,其错误模式的理解对保障其可靠性和安全性至关重要。但此领域研究受限于缺乏大规模多样数据集,具有精确地面真实错误标签。为解决此瓶颈,提出AEGIS框架,通过系统注入可控可追踪错误至初始成功轨迹,创建丰富多智能体系统错误数据集。采用语境感知、基于大型语言模型的自适应操纵器实现复杂攻击,如提示注入和响应腐败以产生特定预设错误模式。通过三种学习范式验证数据集价值。实验显示,在AEGIS数据训练的模型在所有学习范式中取得实质性改进,部分微调模型性能甚至优于大订单私有系统,验证了自动化数据生成框架的重要性。
Key Takeaways
- 多智能体系统(MAS)的错误模式理解对其可靠性和安全性至关重要。
- 当前研究受限于缺乏具有精确地面真实错误标签的大规模、多样化数据集。
- AEGIS框架用于自动生成和识别多智能体系统的错误。
- AEGIS通过注入可控和可追踪的错误到初始成功轨迹来创建数据集。
- 采用语境感知、基于大型语言模型的自适应操纵器实现复杂攻击以产生错误。
- 三种学习范式在AEGIS数据集上进行实验验证,包括监督微调、强化学习和对比学习。
点此查看论文截图


WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
Authors:Junteng Liu, Yunji Li, Chi Zhang, Jingyang Li, Aili Chen, Ke Ji, Weiyu Cheng, Zijia Wu, Chengyu Du, Qidi Xu, Jiayuan Song, Zhengmao Zhu, Wenhu Chen, Pengyu Zhao, Junxian He
The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online sources. However, existing open-source web agents either demonstrate limited information-seeking abilities on complex tasks or lack transparent implementations. In this work, we identify that the key challenge lies in the scarcity of challenging data for information seeking. To address this limitation, we introduce WebExplorer: a systematic data generation approach using model-based exploration and iterative, long-to-short query evolution. This method creates challenging query-answer pairs that require multi-step reasoning and complex web navigation. By leveraging our curated high-quality dataset, we successfully develop advanced web agent WebExplorer-8B through supervised fine-tuning followed by reinforcement learning. Our model supports 128K context length and up to 100 tool calling turns, enabling long-horizon problem solving. Across diverse information-seeking benchmarks, WebExplorer-8B achieves the state-of-the-art performance at its scale. Notably, as an 8B-sized model, WebExplorer-8B is able to effectively search over an average of 16 turns after RL training, achieving higher accuracy than WebSailor-72B on BrowseComp-en/zh and attaining the best performance among models up to 100B parameters on WebWalkerQA and FRAMES. Beyond these information-seeking tasks, our model also achieves strong generalization on the HLE benchmark even though it is only trained on knowledge-intensive QA data. These results highlight our approach as a practical path toward long-horizon web agents.
大型语言模型(LLM)的范式越来越转向代理应用,其中网页浏览能力对于从各种在线来源检索信息至关重要。然而,现有的开源网络代理要么在复杂任务上表现出有限的信息搜索能力,要么缺乏透明的实现。在这项工作中,我们发现关键挑战在于缺乏信息搜索的挑战性数据。为了解决这个问题,我们引入了WebExplorer:一种基于模型探索和数据迭代、从长到短的查询进化的系统数据生成方法。该方法创建需要多步骤推理和复杂网络导航的挑战性查询答案对。通过利用我们精心策划的高质量数据集,我们成功地开发了先进的网络代理WebExplorer-8B,通过监督微调后采用强化学习。我们的模型支持12万8千次的上下文长度和最多达10万次的工具调用回合,可实现长期规划的问题解决。在多种信息搜索基准测试中,WebExplorer-8B在其规模上实现了最先进的性能。值得注意的是,作为一个规模为8B的模型,WebExplorer-8B在经过强化学习训练后能够在平均超过16轮内进行有效搜索,在BrowseComp-en/zh上的准确度高于WebSailor-72B,并在WebWalkerQA和FRAMES上的性能表现达到了参数在不超过百亿的模型中的最佳水平。除了这些信息搜索任务之外,即使在只接受了知识密集型问答数据训练的情况下,我们的模型在HLE基准测试上也表现出强大的泛化能力。这些结果突显了我们的方法作为实现长期网络代理的实际途径的可行性。
论文及项目相关链接
Summary
本文介绍了大型语言模型(LLM)在面向代理应用时的挑战及解决方案。针对现有开源网络代理在信息搜索方面的局限性,提出了WebExplorer方法。该方法通过模型探索、迭代式长短查询演化,生成具有挑战性的问题答案对,并发展了WebExplorer-8B网络代理。该模型支持长语境和多次工具调用,能在多种信息搜索基准测试中达到最佳性能。
Key Takeaways
- 大型语言模型(LLM)正越来越多地应用于代理应用,其中网络浏览能力对于从各种在线来源检索信息至关重要。
- 现有开源网络代理在面对复杂任务时,信息搜索能力有限,且实施不够透明。
- WebExplorer方法通过模型探索和查询迭代,解决了信息搜索中的关键挑战,生成具有挑战性的问题答案对。
- WebExplorer-8B网络代理的成功开发得益于精心策划的高质量数据集。
- WebExplorer-8B模型支持长语境和多次工具调用,实现了长周期问题解决。
- 在多种信息搜索基准测试中,WebExplorer-8B达到了最佳性能,甚至超越了一些更大规模模型的性能。
点此查看论文截图



JudgeAgent: Knowledge-wise and Dynamic LLM Evaluation with Agent-as-Interviewer
Authors:Zhichao Shi, Xuhui Jiang, Chengjin Xu, Cangli Yao, Zhenxin Huang, Shengjie Ma, Yinghan Shen, Jian Guo, Yuanzhuo Wang
Current evaluation paradigms for large language models (LLMs) suffer from overestimated or biased evaluation and mismatched question difficulty, leading to incomplete evaluations of LLM’s knowledge and capability boundaries, which hinder LLM’s effective application and optimization. To address these challenges, we propose Agent-as-Interviewer, a dynamic evaluation paradigm that employs LLM agents to conduct multi-turn interactions for evaluation. Unlike current benchmarking or dynamic interaction paradigms, Agent-as-Interviewer utilizes agents to call knowledge tools for wider and deeper knowledge in the dynamic multi-turn question generation, achieving more complete evaluations of the LLM’s knowledge boundaries. It also leverages agents to plan query strategies for adjustment of the question difficulty levels, enhancing the difficulty control to match the actual capabilities of target LLMs. Based on this paradigm, we develop JudgeAgent, a knowledge-wise dynamic evaluation framework that employs knowledge-driven synthesis as the agent’s tool, and uses difficulty scoring as strategy guidance, thereby finally providing valuable suggestions to help targets optimize themselves. Extensive experiments validate the effectiveness of JudgeAgent’s suggestions, demonstrating that Agent-as-Interviewer can accurately identify the knowledge and capability boundaries of target models. The source code is available on https://anonymous.4open.science/r/JudgeAgent.
当前大型语言模型(LLM)的评价范式存在过度估计或偏向评价以及问题难度不匹配的问题,导致对LLM的知识和能力边界评价不完整,阻碍了LLM的有效应用和优化。为了解决这些挑战,我们提出了“Agent-as-Interviewer”动态评价范式,该范式采用LLM代理进行多轮互动进行评价。与现有的基准测试或动态交互范式不同,Agent-as-Interviewer利用代理调用知识工具进行更广泛和深入的知识动态多轮问题生成,实现对LLM知识边界的更完整评价。它还利用代理规划查询策略来调整问题难度级别,增强难度控制以匹配目标LLM的实际能力。基于这一范式,我们开发了JudgeAgent知识型动态评价框架,采用知识驱动合成作为代理工具,使用难度评分作为策略指导,最终为目标提供有价值的建议以帮助其优化自身。大量实验验证了JudgeAgent建议的有效性,证明Agent-as-Interviewer能够准确识别目标模型的知识和能力边界。源代码可在https://anonymous.4open.science/r/JudgeAgent上找到。
论文及项目相关链接
Summary:当前大型语言模型(LLM)的评价体系存在高估、偏见、问题难度不匹配等问题,无法全面评估LLM的知识和能力边界,制约了LLM的有效应用和优化。为此,提出Agent-as-Interviewer动态评价体系,利用LLM代理进行多轮互动评价。该体系通过代理调用知识工具实现更广泛、更深入的知识动态多轮问题生成,更全面地评估LLM的知识边界;同时,利用代理制定查询策略,调整问题难度,更好地匹配目标LLM的实际能力。基于该体系,开发了JudgeAgent知识动态评价体系,采用知识驱动合成作为代理工具,难度评分作为策略指导,为目标模型提供优化建议。实验验证JudgeAgent的有效性,表明Agent-as-Interviewer能准确识别目标模型的知识和能力边界。
Key Takeaways:
- 当前大型语言模型(LLM)评价体系存在问题,如高估、偏见及问题难度不匹配等。
- Agent-as-Interviewer动态评价体系通过LLM代理进行多轮互动评价,解决这些问题。
- 该体系利用代理调用知识工具实现更广泛和深入的知识动态多轮问题生成。
- Agent-as-Interviewer能更全面地评估LLM的知识边界。
- 代理制定查询策略,调整问题难度,以匹配目标LLM的实际能力。
- JudgeAgent知识动态评价体系采用知识驱动合成和难度评分策略,提供优化建议。
- 实验验证JudgeAgent的有效性,能准确识别目标模型的知识和能力边界。源代码已公开。
点此查看论文截图



R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization
Authors:Yuante Li, Xu Yang, Xiao Yang, Minrui Xu, Xisen Wang, Weiqing Liu, Jiang Bian
Financial markets pose fundamental challenges for asset return prediction due to their high dimensionality, non-stationarity, and persistent volatility. Despite advances in large language models and multi-agent systems, current quantitative research pipelines suffer from limited automation, weak interpretability, and fragmented coordination across key components such as factor mining and model innovation. In this paper, we propose R&D-Agent for Quantitative Finance, in short RD-Agent(Q), the first data-centric multi-agent framework designed to automate the full-stack research and development of quantitative strategies via coordinated factor-model co-optimization. RD-Agent(Q) decomposes the quant process into two iterative stages: a Research stage that dynamically sets goal-aligned prompts, formulates hypotheses based on domain priors, and maps them to concrete tasks, and a Development stage that employs a code-generation agent, Co-STEER, to implement task-specific code, which is then executed in real-market backtests. The two stages are connected through a feedback stage that thoroughly evaluates experimental outcomes and informs subsequent iterations, with a multi-armed bandit scheduler for adaptive direction selection. Empirically, RD-Agent(Q) achieves up to 2X higher annualized returns than classical factor libraries using 70% fewer factors, and outperforms state-of-the-art deep time-series models on real markets. Its joint factor-model optimization delivers a strong balance between predictive accuracy and strategy robustness. Our code is available at: https://github.com/microsoft/RD-Agent.
金融市场由于其高维性、非稳定性和持续波动性,对资产回报预测构成了根本性挑战。尽管大型语言模型和多元代理系统有所进展,但目前的定量研究管道在自动化、可解释性方面存在局限,且在因子挖掘和模型创新等关键组件之间的协调碎片化。在本文中,我们提出了针对金融量化的R&D-Agent,简称RD-Agent(Q),这是首个以数据为中心的多代理框架,旨在通过协调的因子模型协同优化来自动化定量策略的全栈研发。RD-Agent(Q)将量化过程分解为两个迭代阶段:研究阶段动态设置目标对齐的提示,基于领域先验制定假设,并将其映射为具体任务;开发阶段则利用代码生成代理Co-STEER来执行特定任务的代码,随后在真实市场中进行回测。两个阶段通过反馈阶段连接,该阶段对实验结果进行全面评估并为后续迭代提供信息,同时使用多臂匪徒调度器进行自适应方向选择。实证表明,RD-Agent(Q)使用70%更少的因素实现了高达两倍于传统因素库的年化收益率,并且在真实市场上优于最新深度时间序列模型。其联合因子模型优化在预测精度和策略稳健性之间达到了良好的平衡。我们的代码可在:https://github.com/microsoft/RD-Agent上找到。
论文及项目相关链接
PDF 42 pages,11figures, NeurIPS 2025
Summary
本文提出了针对金融市场的量化策略研发多智能体框架RD-Agent(Q)。该框架通过协同优化因子模型,实现量化策略的自动化研发。框架包含研究阶段和发展阶段两个迭代阶段,分别负责设定目标提示、形成基于领域先验的假设和转化为实际任务代码并执行实盘测试,且两个阶段间设有反馈环节来评估实验结果并指导后续迭代方向。经验实证显示,RD-Agent(Q)能实现相较于传统因子库更高的年化回报并显著优于现有的时间序列模型。目前该代码已开源于GitHub微软RD-Agent项目中。
Key Takeaways
- RD-Agent(Q)是一个针对金融市场的量化策略研发的多智能体框架,旨在实现量化策略的自动化研发。
- 该框架通过协同优化因子模型来提高预测资产回报的准确性。
- RD-Agent(Q)包含研究阶段和发展阶段两个迭代阶段,研究阶段动态设置目标提示和假设,发展阶段则负责生成任务特定代码并在实际市场中进行测试。
- 反馈环节彻底评估实验结果并指导后续迭代方向,同时通过多智能体协同提升实验效率和结果质量。
- 实验证明,RD-Agent(Q)实现了高达两倍的年化回报率,相较于传统因子库使用更少的因素数量。
- RD-Agent(Q)在实际市场表现优于目前最先进的深度时间序列模型。其平衡的预测精度和策略稳健性带来了卓越性能。
点此查看论文截图





MASS: Muli-agent simulation scaling for portfolio construction
Authors:Taian Guo, Haiyang Shen, JinSheng Huang, Zhengyang Mao, Junyu Luo, Binqi Chen, Zhuoru Chen, Luchen Liu, Bingyu Xia, Xuhui Liu, Yun Ma, Ming Zhang
The application of LLM-based agents in financial investment has shown significant promise, yet existing approaches often require intermediate steps like predicting individual stock movements or rely on predefined, static workflows. These limitations restrict their adaptability and effectiveness in constructing optimal portfolios. In this paper, we introduce the Multi-Agent Scaling Simulation (MASS), a novel framework that leverages multi-agent simulation for direct, end-to-end portfolio construction. At its core, MASS employs a backward optimization process to dynamically learn the optimal distribution of heterogeneous agents, enabling the system to adapt to evolving market regimes. A key finding enabled by our framework is the exploration of the scaling effect for portfolio construction: we demonstrate that as the number of agents increases exponentially (up to 512), the aggregated decisions yield progressively higher excess returns. Extensive experiments on a challenging, self-collected dataset from the 2023 Chinese A-share market show that MASS consistently outperforms seven state-of-the-art baselines. Further backtesting, stability analyses and the experiment on data leakage concerns validate its enhanced profitability and robustness. We have open-sourced our code, dataset, and training snapshots at https://github.com/gta0804/MASS/ to foster further research.
基于大语言模型(LLM)的代理在金融投资中的应用前景广阔,但现有方法通常需要预测个股走势等中间步骤,或者依赖于预定义、静态的工作流程。这些限制影响了它们在构建最优投资组合时的适应性和有效性。在本文中,我们介绍了多智能体规模仿真(MASS)这一新型框架,它利用多智能体仿真进行直接端到端的投资组合构建。其核心是MASS采用逆向优化过程来动态学习异质智能体的最佳分布,使系统能够适应不断变化的市场状况。我们的框架所带来的一个关键发现是探索投资组合构建中的规模效应:我们证明随着智能体数量的指数增长(最多至512个),智能体集体做出的决策产生的超额回报逐渐增加。在来自2023年中国A股市场自我收集的具有挑战性的数据集上进行的广泛实验表明,MASS持续优于七种最新技术水平的基线。进一步的反向测试、稳定性分析和数据泄露问题实验验证了其盈利能力和稳健性的提升。我们已在https://github.com/gta0804/MASS/开源我们的代码、数据集和培训快照,以促进进一步的研究。
论文及项目相关链接
Summary
该文提出了一种基于多智能体模拟(MASS)的新兴框架,用于直接端到端的投资组合构建。该框架利用反向优化过程动态学习异质智能体的最优分布,以适应不断变化的市场环境。实验表明,随着智能体数量的增加,决策聚合产生的超额回报逐渐提高。在2023年中国A股市场数据集的测试中,MASS框架表现优于其他先进方法。
Key Takeaways
- 多智能体模拟(MASS)框架用于直接端到端的投资组合构建。
- 该框架采用反向优化过程来动态学习智能体的最优分布。
- 智能体的增加可带来更高的超额回报。
- 在中国A股市场数据集的测试中,MASS表现优异,且开源了代码和数据集以促进进一步研究。
- 该框架能适应不断变化的市场环境。
- 通过实验验证了MASS的盈利能力和稳健性。
点此查看论文截图


