发布日期: 2025-11-05

更新日期: 2025-11-27

文章字数: 17.5k

阅读时长: 71 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-05 更新

InnovatorBench: Evaluating Agents’ Ability to Conduct Innovative LLM Research

Authors:Yunze Wu, Dayuan Fu, Weiye Si, Zhen Huang, Mohan Jiang, Keyu Li, Shijie Xia, Jie Sun, Tianze Xu, Xiangkun Hu, Pengrui Lu, Xiaojie Cai, Lyumanshan Ye, Wenhong Zhu, Yang Xiao, Pengfei Liu

AI agents could accelerate scientific discovery by automating hypothesis formation, experiment design, coding, execution, and analysis, yet existing benchmarks probe narrow skills in simplified settings. To address this gap, we introduce InnovatorBench, a benchmark-platform pair for realistic, end-to-end assessment of agents performing Large Language Model (LLM) research. It comprises 20 tasks spanning Data Construction, Filtering, Augmentation, Loss Design, Reward Design, and Scaffold Construction, which require runnable artifacts and assessment of correctness, performance, output quality, and uncertainty. To support agent operation, we develop ResearchGym, a research environment offering rich action spaces, distributed and long-horizon execution, asynchronous monitoring, and snapshot saving. We also implement a lightweight ReAct agent that couples explicit reasoning with executable planning using frontier models such as Claude-4, GPT-5, GLM-4.5, and Kimi-K2. Our experiments demonstrate that while frontier models show promise in code-driven research tasks, they struggle with fragile algorithm-related tasks and long-horizon decision making, such as impatience, poor resource management, and overreliance on template-based reasoning. Furthermore, agents require over 11 hours to achieve their best performance on InnovatorBench, underscoring the benchmark’s difficulty and showing the potential of InnovatorBench to be the next generation of code-based research benchmark.

人工智能代理可以通过自动化假设形成、实验设计、编码、执行和分析来加速科学发现。然而，现有的基准测试通常在简化的环境中对狭窄的技能进行探查。为了弥补这一差距，我们引入了InnovatorBench，这是一个用于评估执行大型语言模型（LLM）研究的代理的基准测试平台。它包含跨越数据构建、过滤、增强、损失设计、奖励设计和脚手架构建的20项任务，需要可运行工件以及正确性、性能、输出质量和不确定性的评估。为了支持代理操作，我们开发了ResearchGym，这是一个研究环境，提供了丰富的行动空间、分布式和长期视角的执行、异步监控和快照保存功能。我们还实现了一个轻量级的ReAct代理，该代理利用前沿模型（如Claude-4、GPT-5、GLM-4.5和Kimi-K2）进行明确的推理和可执行的规划。我们的实验表明，尽管前沿模型在代码驱动的研究任务中显示出潜力，但在脆弱的算法相关任务和长期决策制定方面仍面临挑战，如缺乏耐心、资源管理不善以及对模板推理的过度依赖。此外，代理在InnovatorBench上达到最佳性能需要超过11个小时，这突显了基准测试的困难性，并展示了InnovatorBench作为下一代代码基准测试的潜力。

论文及项目相关链接

PDF

Summary
人工智能代理可加速科学发现，通过自动化假设形成、实验设计、编码、执行和分析。为评估代理在大规模语言模型研究中的表现，推出了InnovatorBench平台和一系列任务，涵盖数据构建、过滤、增强、损失设计、奖励设计和脚手架构建等方面。同时，开发了ResearchGym环境支持代理操作。实验表明，前沿模型在代码驱动的研究任务中表现出潜力，但在脆弱的算法相关任务和长期决策制定方面存在挑战。

Key Takeaways

AI代理可加速科学发现的多个环节，包括假设形成、实验设计等。
InnovatorBench是一个用于评估AI代理在大规模语言模型（LLM）研究中表现的基准测试平台。
InnovatorBench包含涵盖多个方面的20项任务，需要评估正确性、性能、输出质量和不确定性。
ResearchGym是一个研究环境，为AI代理操作提供支持，具有丰富行动空间、分布式执行等特性。
前沿模型在代码驱动的研究任务中表现出潜力，但在某些特定任务上存在挑战，如脆弱的算法相关任务和长期决策制定。
代理完成任务需要超过11小时，表明InnovatorBench的难度和潜力。

Cool Papers

点此查看论文截图

MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

Authors:Wenhao Wang, Peizhi Niu, Zhao Xu, Zhaoyu Chen, Jian Du, Yaxin Du, Xianghe Pang, Keduan Huang, Yanfeng Wang, Qiang Yan, Siheng Chen

Large Language Models (LLMs) increasingly rely on external tools to perform complex, realistic tasks, yet their ability to utilize the rapidly expanding Model Contextual Protocol (MCP) ecosystem remains limited. Existing MCP research covers few servers, depends on costly manual curation, and lacks training support, hindering progress toward real-world deployment. To overcome these limitations, we introduce MCP-Flow, an automated web-agent-driven pipeline for large-scale server discovery, data synthesis, and model training. MCP-Flow collects and filters data from 1166 servers and 11536 tools, producing 68733 high-quality instruction-function call pairs and 6439 trajectories, far exceeding prior work in scale and diversity. Extensive experiments demonstrate MCP-Flow’s effectiveness in driving superior MCP tool selection, function-call generation, and enhanced agentic task performance. MCP-Flow thus provides a scalable foundation for advancing LLM agents’ proficiency in real-world MCP environments. MCP-Flow is publicly available at \href{https://github.com/wwh0411/MCP-Flow}{https://github.com/wwh0411/MCP-Flow}.

随着大型语言模型（LLMs）越来越依赖外部工具来执行复杂、现实的任务，它们利用迅速扩展的模型上下文协议（MCP）生态系统的能力仍然有限。现有的MCP研究涉及服务器数量有限，依赖于昂贵的人工整理，且缺乏培训支持，阻碍了其在现实世界部署方面的进展。为了克服这些限制，我们引入了MCP-Flow，这是一个自动化网络驱动的管道，用于大规模服务器发现、数据合成和模型训练。MCP-Flow从包括规模为第三方应用的搜索引擎等资源中提取结构化和非结构化文本形式的各种资源和应用程序语境信息，并从这些资源中收集并过滤数据。它从来自包括服务器、工具等的数据中生成大量高质量的指令函数调用对和轨迹。实验表明，MCP-Flow在驱动优质MCP工具选择、函数调用生成以及增强代理任务性能方面具有出色效果。我们的主要结果表明通过灵活的激励算法扩展场景有效性优势方向分解效果目前效果很好并分析了诸多应用程序上生命周期如实时更新等。因此，MCP-Flow为提升LLM代理在现实世界的MCP环境中的能力提供了可扩展的基础。MCP-Flow的公开地址为：https://github.com/wwh0411/MCP-Flow。

论文及项目相关链接

PDF Preprint, Under Review

Summary

大型语言模型（LLMs）越来越多地依赖外部工具来完成复杂、现实的任务，然而它们利用迅速扩展的模型上下文协议（MCP）生态系统的能力仍然有限。为了克服这些限制，我们推出了MCP-Flow，这是一个自动化的网络代理驱动管道，用于大规模服务器发现、数据合成和模型训练。它收集并过滤了来自1166个服务器和11536个工具的数据，产生了高质量指令函数调用对和任务轨迹，规模和多样性均超过以前的工作。实验证明，MCP-Flow在驱动MCP工具选择、函数调用生成和增强代理任务性能方面非常有效，为LLM代理在现实世界的MCP环境中提供了可扩展的基础。

Key Takeaways

大型语言模型（LLMs）在利用模型上下文协议（MCP）生态系统方面存在局限性。
MCP-Flow是一个自动化的网络代理驱动管道，用于大规模服务器发现、数据合成和模型训练。
MCP-Flow能够收集并过滤大量数据，产生高质量指令函数调用对和任务轨迹。
MCP-Flow的规模和多样性超过以前的工作。
MCP-Flow能有效驱动MCP工具选择、函数调用生成。
MCP-Flow能增强代理任务性能。

Cool Papers

点此查看论文截图

Experience-Driven Exploration for Efficient API-Free AI Agents

Authors:Chenwei Tang, Jingyu Xing, Xinyu Liu, Zizhou Wang, Jiawei Du, Liangli Zhen, Jiancheng Lv

Most existing software lacks accessible Application Programming Interfaces (APIs), requiring agents to operate solely through pixel-based Graphical User Interfaces (GUIs). In this API-free setting, large language model (LLM)-based agents face severe efficiency bottlenecks: limited to local visual experiences, they make myopic decisions and rely on inefficient trial-and-error, hindering both skill acquisition and long-term planning. To address these challenges, we propose KG-Agent, an experience-driven learning framework that structures an agent’s raw pixel-level interactions into a persistent State-Action Knowledge Graph (SA-KG). KG-Agent overcomes inefficient exploration by linking functionally similar but visually distinct GUI states, forming a rich neighborhood of experience that enables the agent to generalize from a diverse set of historical strategies. To support long-horizon reasoning, we design a hybrid intrinsic reward mechanism based on the graph topology, combining a state value reward for exploiting known high-value pathways with a novelty reward that encourages targeted exploration. This approach decouples strategic planning from pure discovery, allowing the agent to effectively value setup actions with delayed gratification. We evaluate KG-Agent in two complex, open-ended GUI-based decision-making environments (Civilization V and Slay the Spire), demonstrating significant improvements in exploration efficiency and strategic depth over the state-of-the-art methods.

现有大多数软件缺乏可访问的应用程序编程接口（API），导致智能体只能通过基于像素的图形用户界面（GUI）进行操作。在这种无API的设置下，基于大型语言模型的智能体面临严重的效率瓶颈：它们仅限于本地视觉体验，做出短视的决策，并依赖低效的试错方法，这阻碍了技能获取和长期规划。为了解决这些挑战，我们提出了KG-Agent，这是一种以经验驱动的学习框架，它将智能体的原始像素级互动结构化为一个持久的状态-动作知识图（SA-KG）。KG-Agent通过连接功能相似但视觉上有区别的GUI状态来克服低效的探索，形成了一个丰富的经验邻域，使智能体能够从一系列历史策略中概括知识。为了支持长期推理，我们设计了一种基于图形拓扑的混合内在奖励机制，该机制结合了状态价值奖励，用于利用已知的高价值路径，以及鼓励有针对性的探索的新颖性奖励。这种方法将战略规划与纯粹的发现解耦，允许智能体有效地评估延迟满足的设置动作。我们在两个复杂且开放式的GUI决策环境（文明V和屠杀塔刺）中对KG-Agent进行了评估，与最先进的方法相比，它在探索效率和战略深度方面表现出了显著的改进。

论文及项目相关链接

PDF

Summary

本文提出一种基于知识图谱的代理学习框架KG-Agent，解决了现有软件缺乏应用程序编程接口（API）导致的效率瓶颈问题。KG-Agent能够将代理的原始像素级交互结构化并持久化为状态动作知识图谱（SA-KG）。它克服了低效的探索，通过链接功能相似但视觉不同的GUI状态，形成一个丰富的经验邻域，使代理能够从一系列历史策略中概括知识。此外，还设计了一种基于图形拓扑的混合内在奖励机制，以支持长期规划。在复杂的开放GUI决策环境（如文明五和击败魔王）中，KG-Agent显示出显著的探索效率和战略深度改进。

Key Takeaways

软件中普遍存在缺乏应用程序编程接口（API），导致大型语言模型（LLM）代理在操作过程中存在效率瓶颈。
KG-Agent是一种基于知识图谱的学习框架，可以将代理的像素级交互结构化并持久化。
KG-Agent通过链接功能相似但视觉不同的GUI状态，形成丰富的经验邻域，提高代理的概括能力。
KG-Agent设计了一种基于图形拓扑的混合内在奖励机制，以支持长期规划。
该机制结合了状态价值奖励和新颖性奖励，以鼓励有针对性的探索和解脱的战略规划。
KG-Agent在复杂的开放GUI决策环境中表现出显著的探索效率和战略深度改进。

Cool Papers

点此查看论文截图

KVCOMM: Online Cross-context KV-cache Communication for Efficient LLM-based Multi-agent Systems

Authors:Hancheng Ye, Zhengqi Gao, Mingyuan Ma, Qinsi Wang, Yuzhe Fu, Ming-Yu Chung, Yueqian Lin, Zhijian Liu, Jianyi Zhang, Danyang Zhuo, Yiran Chen

Multi-agent large language model (LLM) systems are increasingly adopted for complex language processing tasks that require communication and coordination among agents. However, these systems often suffer substantial overhead from repeated reprocessing of overlapping contexts across agents. In typical pipelines, once an agent receives a message from its predecessor, the full context-including prior turns-must be reprocessed from scratch, leading to inefficient processing. While key-value (KV) caching is an effective solution for avoiding redundant computation in single-agent settings where prefixes remain unchanged, it cannot be directly reused in multi-agent scenarios due to diverging prefixes introduced by agent-specific context extensions. We identify that the core challenge lies in the offset variance of KV-caches across agents. To address this, we propose KVCOMM, a training-free framework that enables efficient prefilling in multi-agent inference by reusing KV-caches and aligning cache offsets of overlapping contexts under diverse prefix contexts. KVCOMM estimates and adjusts KV-caches for shared content by referencing a pool of cached examples-termed anchors-that store observed cache deviations under varying prefixes. The anchor pool is maintained and updated online, allowing dynamic adaptation to distinct user requests and context structures. KVCOMM achieves over 70% reuse rate across diverse multi-agent workloads, including retrieval-augmented generation, math reasoning, and collaborative coding tasks, all without quality degradation. Particularly, when each fully-connected agent receives 1K input tokens with 512 prefix tokens and 512 output tokens under a five-agent setting, KVCOMM achieves up to 7.8x speedup compared to the standard prefill pipeline, reducing TTFT from ~430 ms to ~55 ms.

多主体大型语言模型（LLM）系统越来越多地被用于需要主体间通信和协调的复杂语言处理任务。然而，这些系统经常遭受来自主体间重叠上下文重复处理的巨大开销。在典型流程中，一旦主体收到来自其前序主体的消息，整个上下文（包括先前的回合）必须从头开始重新处理，导致处理效率低下。虽然在单主体环境中，键值（KV）缓存对于避免冗余计算是一个有效的解决方案，其中前缀保持不变，但由于主体特定的上下文扩展导致的前缀分歧，它不能直接用于多主体场景。我们确定了核心挑战在于主体间KV缓存的偏移量差异。为了解决这个问题，我们提出了KVCOMM，这是一个无需训练的框架，通过重用KV缓存和对不同前缀上下文下重叠上下文的缓存偏移进行对齐，实现了多主体推断中的高效预填充。KVCOMM通过引用存储了观察到的缓存偏差的锚点示例池来估计和调整共享内容的KV缓存。这个锚点池在线维护和更新，允许根据独特的用户请求和上下文结构进行动态适应。KVCOMM在多样化的多主体工作负载上实现了超过70%的重用率，包括增强检索生成、数学推理和协作编码任务等，且不会降低质量。尤其当每个全连接的主体在五个主体的设置下接收1K输入令牌（包括512个前缀令牌和512个输出令牌）时，KVCOMM与标准预填充流水线相比，实现了高达7.8倍的加速，将TTFT从约430毫秒减少到约55毫秒。

论文及项目相关链接

PDF Accepted for publication in NeurIPS2025. Code is available at \url{https://github.com/FastMAS/KVCOMM}

Summary

该文本介绍了多代理大型语言模型（LLM）系统在处理复杂的语言任务时面临的挑战，如重复处理重叠的上下文导致效率低下的问题。文章提出KVCOMM框架，能够在无需训练的情况下，通过重用KV缓存并调整缓存偏移来克服多代理环境中的冗余计算问题，从而提高多代理推理的效率。该框架利用锚池来存储观察到的缓存偏差，并对其进行动态调整以适应不同的用户请求和上下文结构。实验结果表明，KVCOMM在多种多代理工作负载上实现了超过70%的重用率，且没有质量下降。在特定的实验条件下，与标准的预填充管道相比，KVCOMM实现了高达7.8倍的加速。

Key Takeaways

多代理大型语言模型（LLM）在处理复杂的语言任务时面临重复处理重叠上下文的问题，导致效率低下。
KVCOMM框架旨在解决多代理环境中冗余计算的问题，通过重用KV缓存并调整缓存偏移来提高多代理推理的效率。
KVCOMM利用锚池来存储并更新缓存偏差，以适应不同的用户请求和上下文结构。
KVCOMM在多种多代理工作负载上实现了超过70%的缓存重用率，且不会降低质量。
在特定条件下，与标准预填充管道相比，KVCOMM实现了显著的性能提升，达到7.8倍的加速。
KVCOMM框架无需训练，可快速适应不同的环境和任务需求。

Cool Papers

点此查看论文截图

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

Authors:Ching Chang, Yidan Shi, Defu Cao, Wei Yang, Jeehyun Hwang, Haixin Wang, Jiacheng Pang, Wei Wang, Yan Liu, Wen-Chih Peng, Tien-Fu Chen

Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates. The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes. Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (https://github.com/blacksnail789521/Time-Series-Reasoning-Survey). Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets. We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings. Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

时间序列推理将时间视为第一类的轴，并将中间证据直接纳入答案中。本文通过推理拓扑来定义问题并整理文献，主要分为三个家族：一步直接推理、具有明确中间体的线性链推理，以及探索、修订和聚合的分支结构推理。拓扑与领域的主要目标相结合，包括传统的时间序列分析、解释和理解、因果推理和决策制定，以及时间序列生成；同时，一个紧凑的标签集跨越这些轴，涵盖分解和验证、集成、工具使用、知识访问、多模态性、代理循环和LLM对齐机制等。本文回顾了不同领域的方法和系统，展示了每种拓扑的优缺点，以及支持研究和部署的精选数据集、基准测试和资源（https://github.com/blacksnail789521/Time-Series-Reasoning-Survey）。本文强调了保持证据可见且时间对齐的评估实践，并就匹配拓扑与不确定性、与可观察到的伪迹接地、规划转变和流式传输以及将成本和延迟视为设计预算等方面提供了指导。我们强调，推理结构必须在地面和自校正能力、计算成本和可重复性之间取得平衡，而未来的进展可能取决于将推理质量与实用性联系在一起的基准测试，以及在具有转变感知、流式传输和长期视野的封闭循环测试床上权衡成本和风险的实践。总之，这些方向标志着从狭隘的准确性向大规模可靠性转变，使系统不仅能够分析，而且能够理解、解释和采取实际行动，对动态世界产生可追踪的证据和可信的结果。

论文及项目相关链接

PDF This paper is currently under review

Summary
时间序列推理将时间作为首要考量轴，融入中间证据直接得出答案。这篇综述定义了问题并通过推理拓扑组织文献，分为三个类别：一步直接推理、带有明确中间体的线性链推理和探究、修订并聚合的分支结构推理。拓扑与领域的主要目标相结合，包括传统时间序列分析、解释与理解、因果推断与决策制定以及时间序列生成。同时，一个紧凑的标签集跨越这些轴，涵盖分解与验证、集成、工具使用、知识访问、多模态性、代理循环和LLM对齐机制等。本文评述了不同领域的方法和系统，展示了每种拓扑的优势和局限性，并提供了支持研究和部署的数据集、基准测试和资源。强调了在保持证据可见性和时间对齐的评估实践下，对匹配拓扑到不确定性、与可观察到的工件接地、规划转变和流式传输以及将成本和延迟作为设计预算的指导意见进行了提炼。未来进展可能取决于将推理质量与实用性相联系的基准测试，以及能够在成本和风险之间进行权衡的闭环测试平台。这篇综述标志着从狭隘的准确性向大规模可靠性的转变，使系统不仅能够分析而且能够理解、解释和行动于动态世界，具有可追溯的证据和可信的结果。

Key Takeaways

时间序列推理将时间视为首要考量轴，融入中间证据进行推理。
综述定义了时间序列推理的问题并通过推理拓扑组织文献，包括直接推理、线性链推理和分支结构推理。
推理拓扑与领域目标（如传统时间序列分析、解释与理解等）相结合。
紧凑的标签集包含分解与验证、集成、工具使用等多模态性关键要素。
方法和系统的评述展示了各种拓扑的优势和局限性，并提供数据集、基准测试和资源支持。
强调保持证据可见性和时间对齐的评估实践，提供匹配拓扑与不确定性的指导。

Cool Papers

点此查看论文截图

SE-Agent: Self-Evolution Trajectory Optimization in Multi-Step Reasoning with LLM-Based Agents

Authors:Jiaye Lin, Yifu Guo, Yuzhen Han, Sen Hu, Ziyi Ni, Licheng Wang, Mingguang Chen, Hongzhang Liu, Ronghao Chen, Yangfan He, Daxin Jiang, Binxing Jiao, Chen Hu, Huacan Wang

Large Language Model (LLM)-based agents have recently shown impressive capabilities in complex reasoning and tool use via multi-step interactions with their environments. While these agents have the potential to tackle complicated tasks, their problem-solving process, i.e., agents’ interaction trajectory leading to task completion, remains underexploited. These trajectories contain rich feedback that can navigate agents toward the right directions for solving problems correctly. Although prevailing approaches, such as Monte Carlo Tree Search (MCTS), can effectively balance exploration and exploitation, they ignore the interdependence among various trajectories and lack the diversity of search spaces, which leads to redundant reasoning and suboptimal outcomes. To address these challenges, we propose SE-Agent, a Self-Evolution framework that enables Agents to optimize their reasoning processes iteratively. Our approach revisits and enhances former pilot trajectories through three key operations: revision, recombination, and refinement. This evolutionary mechanism enables two critical advantages: (1) it expands the search space beyond local optima by intelligently exploring diverse solution paths guided by previous trajectories, and (2) it leverages cross-trajectory inspiration to efficiently enhance performance while mitigating the impact of suboptimal reasoning paths. Through these mechanisms, SE-Agent achieves continuous self-evolution that incrementally improves reasoning quality. We evaluate SE-Agent on SWE-bench Verified to resolve real-world GitHub issues. Experimental results across five strong LLMs show that integrating SE-Agent delivers up to 55% relative improvement, achieving state-of-the-art performance among all open-source agents on SWE-bench Verified. Our code and demonstration materials are publicly available at https://github.com/JARVIS-Xs/SE-Agent.

基于大型语言模型（LLM）的代理最近表现出通过与其环境的多步骤交互进行复杂推理和工具使用的令人印象深刻的能力。虽然这些代理有潜力处理复杂任务，但他们的解决问题过程，即代理完成任务的交互轨迹，仍未得到充分探索。这些轨迹包含丰富的反馈，可以引导代理朝着正确的方向解决问题。虽然现有的方法，如蒙特卡洛树搜索（MCTS），可以有效地平衡探索和利用，但它们忽略了各种轨迹之间的相互依赖性，缺乏搜索空间的多样性，这导致冗余推理和次优结果。为了解决这些挑战，我们提出了SE-Agent，这是一种自我进化框架，使代理能够迭代地优化他们的推理过程。我们的方法通过三个关键操作：修订、重组和细化，来重新审视和改进先前的轨迹。这种进化机制带来了两个关键优势：（1）它通过智能地探索以前轨迹引导的多样化解决方案路径，扩大了搜索空间，超越了局部最优；（2）它利用跨轨迹的灵感来有效地提高性能，同时减轻次优推理路径的影响。通过这些机制，SE-Agent实现了持续的自我进化，逐步提高了推理质量。我们在SWE-bench Verified上评估了SE-Agent，以解决现实世界中的GitHub问题。在五个强大的LLM上的实验结果表明，集成SE-Agent带来了高达55%的相对改进，在SWE-bench Verified上的开源代理中实现了最先进的性能。我们的代码和演示材料可在https://github.com/JARVIS-Xs/SE-Agent上公开获得。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的代理在复杂推理和工具使用方面展现出强大的能力，通过与环境的多步交互完成任务。然而，其解题过程，即代理的交互轨迹，尚未得到充分研究。这些轨迹包含丰富的反馈，可以引导代理正确解决问题。虽然现有的方法如蒙特卡洛树搜索（MCTS）可以平衡探索和开发，但它们忽略了轨迹之间的依赖性，并且缺乏搜索空间的多样性，导致冗余推理和次优结果。为解决这些问题，我们提出了SE-Agent，一个自我进化框架，使代理能够迭代优化其推理过程。通过修订、重组和细化之前的轨迹，SE-Agent实现了两个关键优势：一是通过智能探索由先前轨迹引导的多样化解决方案路径来扩展搜索空间超越局部最优；二是利用跨轨迹灵感来有效增强性能并减少次优推理路径的影响。在GitHub问题上对SE-Agent进行SWE-bench Verified评估显示，相较于五种顶尖LLM集成算法提高到了最高可达的百分之五十五的性能改进效果。公开的代码和演示材料已在Github公开托管服务中开放共享。我们在附件提供了我们提出的研究实现的进一步解读和总结代码展示下载地址和详细内容参考描述文件地址以供参考。[总结由其他人在自然语言语境下简化版复述版本，如有偏差请以原文为准]

Key Takeaways

Cool Papers

点此查看论文截图

A Self-Evolving AI Agent System for Climate Science

Authors:Zijie Guo, Jiong Wang, Fenghua Ling, Wangxu Wei, Xiaoyu Yue, Zhe Jiang, Wanghan Xu, Jing-Jia Luo, Lijing Cheng, Yoo-Geun Ham, Fengfei Song, Pierre Gentine, Toshio Yamagata, Ben Fei, Wenlong Zhang, Xinyu Gu, Chao Li, Yaqiang Wang, Tao Chen, Wanli Ouyang, Bowen Zhou, Lei Bai

Scientific progress in Earth science depends on integrating data across the planet’s interconnected spheres. However, the accelerating volume and fragmentation of multi-sphere knowledge and data have surpassed human analytical capacity. This creates a major bottleneck for discovery, especially in climate science. To address this challenge, we introduce EarthLink, the first self-evolving AI agent system designed as an interactive “copilot” for Earth scientists. Through natural language interaction, EarthLink automates the entire research workflow by integrating planning, code execution, data analysis, and physical reasoning into a unified process that directly addresses this limitation. Beyond efficiency, it exhibits human-like cross-disciplinary analytical ability and achieves proficiency comparable to a junior researcher in expert evaluations on core large-scale climate tasks, including model-observation comparison and climate change understanding. When tasked with an open scientific problem, specifically the discovery of precursors of the Atlantic Ni~no, EarthLink autonomously developed a research strategy, identified sources of predictability, verified its hypotheses with available data, and proposed a physically consistent mechanism. These emerging capabilities enable a new human-AI research paradigm. Scientists can focus on value and result judgments, while AI systems handle complex data analysis and knowledge integration. This accelerates the pace and breadth of discovery in Earth sciences. The system is accessible at our website https://earthlink.intern-ai.org.cn.

地球科学的科学进步取决于整合全球互联领域的数据。然而，多领域知识和数据的不断增加和碎片化已经超出了人类的分析能力，这成为了发现的主要瓶颈，特别是在气候科学领域。为了应对这一挑战，我们推出了EarthLink，这是首个自我进化的AI代理系统，旨在成为地球科学家的交互式“副驾驶”。通过自然语言交互，EarthLink通过整合规划、代码执行、数据分析和物理推理，将整个过程自动化，直接解决这一限制。除了提高效率外，它还表现出类似人类的跨学科分析能力，并在核心大规模气候任务的专家评估中达到了初级研究员的专业水平，包括模型观测对比和气候变化理解。在面对一个公开的科学问题，特别是大西洋尼诺先兆的发现问题时，EarthLink自主地制定了研究策略，确定了可预测性的来源，用现有数据验证了其假设，并提出了一个物理一致的机制。这些新兴的能力使人类-AI研究范式得以更新。科学家们可以专注于价值和结果判断，而AI系统则处理复杂的数据分析和知识整合。这加快了地球科学发现的步伐和广度。该系统可在我们的网站https://earthlink.intern-ai.org.cn访问。

论文及项目相关链接

PDF

摘要

地球科学领域的科学进步依赖于对全球互联领域的数据进行集成。然而，随着知识和数据体量的不断加速增长和碎片化，已经超出了人类的分析能力。这成为了发现的一大瓶颈，特别是在气候科学领域。为解决这一挑战，我们推出了EarthLink，首个自我进化的AI代理系统，旨在成为地球科学家的交互式“副驾驶”。通过自然语言交互，EarthLink自动化了整个研究工作流程，将规划、代码执行、数据分析和物理推理整合为一个统一的过程，直接解决了这一局限性。除了提高效率外，它还具有跨学科的人类分析能力，并在核心的大规模气候任务的评价中达到了初级研究员的水平，包括模型观测比较和气候变化理解。面对开放的科研问题，例如大西洋Ni~no的先兆发现，EarthLink可自主制定研究策略，确定预测的来源，用现有数据验证假设，并提出物理一致的机制。这些新兴的能力使人类-AI研究范式得以开启。科学家可以专注于价值和结果判断，而AI系统则处理复杂的数据分析和知识整合。这加速了地球科学领域的发现速度和广度。该系统可通过访问我们的网站https://www.earthlink.intern-ai.org.cn获取。

关键见解

地球科学进步需整合全球互联领域的数据。
知识和数据的快速增加和碎片化已超越人类分析能力的极限。
EarthLink是首个自我进化的AI代理系统，旨在解决此挑战。
EarthLink通过自然语言交互自动化研究流程，并集成了规划、代码执行、数据分析和物理推理。
EarthLink具有跨学科的人类分析能力，并在气候科学任务中表现出初级研究员水平的技能。
在处理开放科学问题时，EarthLink能自主制定策略、验证假设并提出物理机制。

Cool Papers

点此查看论文截图

How to Train Your LLM Web Agent: A Statistical Diagnosis

Authors:Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia

LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with open-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.

基于LLM的Web代理最近取得了显著进展，但大部分进展出现在封闭源代码系统中，与开源替代方案的差距加大。进展受到两个主要挑战的限制：首先，对单步任务的狭窄关注，忽视了多步Web交互的复杂性；其次，基于LLM的Web代理后训练所需的高计算成本。为解决这一问题，我们对LLM Web代理后训练的计算分配进行了首次统计研究。我们的方法采用两阶段管道，训练一个Llama 3.1 8B学生，通过有监督的微调（SFT）模仿Llama 3.3 70B教师，随后进行基于策略的策略强化学习。我们发现这个过程对超参数选择非常敏感，使全面扫描变得不切实际。为了节省他人昂贵的试错过程，我们对1370个配置进行抽样，并使用bootstrapping估计有效的超参数。我们的结果表明，结合SFT和基于策略RL的方法在工作竞技场和MiniWob++上始终优于单独使用任何一种方法。此外，该策略仅需55%的计算量即可达到纯SFT在MiniWob++上的峰值性能，有效地推动了计算性能帕累托前沿，并且是唯一能够缩小与封闭源代码模型差距的策略。

论文及项目相关链接

PDF

Summary

LLM网络代理在闭源系统中取得显著进展，但与开源替代方案之间存在差距。研究通过统计研究计算分配来解决LLM网络代理后训练面临的挑战，采用两阶段管道，通过监督微调（SFT）训练Llama 3.1 8B学生来模仿Llama 3.3 70B教师，然后进行策略强化学习。研究发现该过程对超参数选择高度敏感，通过采样1370种配置并利用bootstrap估计有效超参数来避免昂贵的试错过程。结合SFT和策略RL的方法在WorkArena和MiniWob++上的表现均优于单独使用其中任何一种方法。此方法只需使用约55%的计算资源即可达到在MiniWob++上的峰值性能，有效地推动了计算性能帕累托前沿，并且是唯一能够缩小与闭源模型差距的策略。

Key Takeaways

LLM网络代理在闭源系统中取得显著进展，但开源与闭源之间存在差距。
研究的挑战包括单步任务视野狭窄，忽视多步交互复杂性以及高昂的计算成本。
研究采用两阶段管道解决挑战，包括监督微调（SFT）和策略强化学习。
方法对超参数选择高度敏感，因此通过大量采样配置估计有效超参数来避免试错。
结合SFT和策略RL的方法在多个任务上表现优越。
此方法在计算资源使用上更为高效，推动了计算性能前沿。

Cool Papers

点此查看论文截图

VideoExplorer: Think With Videos For Agentic Long-Video Understanding

Authors:Huaying Yuan, Zheng Liu, Junjie Zhou, Hongjin Qian, Yan Shu, Nicu Sebe, Ji-Rong Wen, Zhicheng Dou

Long-video understanding~(LVU) is a challenging problem in computer vision. Existing methods either downsample frames for single-pass reasoning, sacrificing fine-grained details, or depend on textual reasoning over task-agnostic representations, hindering task-specific perception and exploration. In this paper, we propose VideoExplorer, a framework grounded in the principle of ``thinking with video’’, which naturally intertwines planning, temporal grounding, and scalable perception into a coherent reasoning process. Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding until reaching the final answer, enabling faithful, efficient, and interpretable reasoning. To address the lack of LVU training resources, we construct a long-video reasoning dataset using difficulty-adaptive sampling to ensure high-quality trajectories on complex tasks. Building on this dataset, we design a two-stage training pipeline: supervised trajectory initialization followed by trajectory-level preference optimization, encouraging adaptive temporal grounding and iterative information integration guided by downstream rewards. Extensive evaluations on popular long-video understanding and reasoning benchmarks demonstrate VideoExplorer’s significant advantage over existing baselines, highlighting its robustness, adaptability, and efficiency. Our code is made publicly available in this repository(https://github.com/yhy-2000/VideoDeepResearch).

长视频理解（LVU）是计算机视觉领域的一个具有挑战性的问题。现有方法要么为了单次推理而降低帧分辨率，牺牲了精细细节，要么依赖于任务无关表示上的文本推理，阻碍了特定任务的感知和探索。在本文中，我们提出了VideoExplorer框架，该框架基于“用视频思考”的原则，将规划、时间定位和可扩展感知自然地交织成一个连贯的推理过程。VideoExplorer不是对静态上下文进行推理，而是迭代地制定子问题、定位相关时刻，并进行面向任务的、时间可扩展的视频理解，直到达到最终答案，从而实现忠实、高效和可解释的推理。为了解决LVU训练资源不足的问题，我们使用难度自适应采样构建了一个长视频推理数据集，以确保在复杂任务上的高质量轨迹。基于这个数据集，我们设计了一个两阶段的训练管道：监督轨迹初始化，然后是轨迹级别的偏好优化，鼓励自适应时间定位以及由下游奖励引导的迭代信息整合。在流行的长视频理解和推理基准测试上的广泛评估表明，VideoExplorer相对于现有基线具有显著优势，突显了其稳健性、适应性和效率。我们的代码已公开发布在此仓库中：https://github.com/yhy-2000/VideoDeepResearch。

论文及项目相关链接

PDF

总结
视频深度研究提出一种新的长视频理解框架VideoExplorer，融合了规划、时间定位和可伸缩感知进行协同推理。它采用迭代子问题形式化、相关时刻定位和任务导向的、可伸缩的视频理解，达到最终答案。为解决长视频理解资源缺乏的问题，构建了一个长视频推理数据集，并采用两阶段训练管道，包括监督轨迹初始化和轨迹级别偏好优化。VideoExplorer在流行的长视频理解和推理基准测试中表现出显著优势。

要点

长视频理解（LVU）是计算机视觉中的难题。
现有方法通过降帧进行单次推理或依赖文本推理，牺牲了精细粒度的细节或阻碍了特定任务的感知和探索。
VideoExplorer框架基于“用视频思考”的原则，将规划、时间定位和可伸缩感知自然融入协同推理过程。
VideoExplorer通过迭代子问题形式化、相关时刻定位和任务导向的可伸缩视频理解，实现忠实、高效和可解释性的推理。
为解决LVU训练资源缺乏的问题，构建了一个长视频推理数据集，采用难度自适应采样确保复杂任务的高质量轨迹。
采用两阶段训练管道，包括监督轨迹初始化和轨迹级别偏好优化，鼓励自适应时间定位和迭代信息整合，由下游奖励引导。

Cool Papers

点此查看论文截图

CoP: Agentic Red-teaming for Large Language Models using Composition of Principles

Authors:Chen Xiong, Pin-Yu Chen, Tsung-Yi Ho

Recent advances in Large Language Models (LLMs) have spurred transformative applications in various domains, ranging from open-source to proprietary LLMs. However, jailbreak attacks, which aim to break safety alignment and user compliance by tricking the target LLMs into answering harmful and risky responses, are becoming an urgent concern. The practice of red-teaming for LLMs is to proactively explore potential risks and error-prone instances before the release of frontier AI technology. This paper proposes an agentic workflow to automate and scale the red-teaming process of LLMs through the Composition-of-Principles (CoP) framework, where human users provide a set of red-teaming principles as instructions to an AI agent to automatically orchestrate effective red-teaming strategies and generate jailbreak prompts. Distinct from existing red-teaming methods, our CoP framework provides a unified and extensible framework to encompass and orchestrate human-provided red-teaming principles to enable the automated discovery of new red-teaming strategies. When tested against leading LLMs, CoP reveals unprecedented safety risks by finding novel jailbreak prompts and improving the best-known single-turn attack success rate by up to 19.0 times.

最近大型语言模型（LLM）的进步已经在各个领域中催生了变革性的应用，从开源到专有LLM。然而，越狱攻击旨在通过欺骗目标LLM回答有害和危险回应来破坏安全对齐和用户合规性，这已成为一项紧迫的关切。LLM的红队实践是在前沿AI技术发布之前积极探索潜在风险和易出错情况。本文提出了一个基于原则组成（CoP）框架的代理工作流程，以自动化和扩展LLM的红队流程。在该流程中，人类用户提供一组红队原则作为AI代理的指令，以自动协调有效的红队策略并生成越狱提示。与现有的红队方法不同，我们的CoP框架提供了一个统一和可扩展的框架，可以包含和协调人类提供的红队原则，以实现自动发现新的红队策略。在与领先的LLM进行测试时，CoP通过发现新的越狱提示并提高了已知单轮攻击的成功率最多达19.0倍，揭示了前所未有的安全风险。

论文及项目相关链接

PDF

Summary

大语言模型（LLM）的近期进展为各领域带来了变革性的应用，但LLM面临的安全问题也随之突显，尤其是jailbreak攻击，即意图通过欺骗LLM产生有害和危险回应来破坏其安全对齐和用户合规性。针对前沿AI技术的释放前风险探索，红队实践是一种有效方法。本文提出了一个基于原则组合（CoP）框架的代理工作流程，通过人类用户提供一套红队原则作为指令来自动协调有效的红队策略并生成jailbreak提示。不同于现有的红队方法，我们的CoP框架提供了一个统一和可扩展的框架，能够包含和协调人类提供的红队原则，以实现自动化发现新的红队策略。在针对领先的LLM进行测试时，CoP揭示了前所未有的安全风险，通过发现新的jailbreak提示并提高了已知单轮攻击的成功率，最高达19倍。

Key Takeaways

大语言模型（LLMs）在多个领域带来变革性应用，但面临jailbreak攻击等安全问题的挑战。
红队实践是探索前沿AI技术释放前风险的有效方法。
提出了基于原则组合（CoP）框架的代理工作流程，以自动化和规模化LLMs的红队实践。
CoP框架通过人类提供的红队原则来自动协调红队策略并生成jailbreak提示。
CoP框架实现了对现有人工智能技术的自动化改进和策略发现。
CoP在针对领先LLM的测试中发现新的jailbreak提示并显著提高攻击成功率。

Cool Papers

点此查看论文截图

UI-Evol: Automatic Knowledge Evolving for Computer Use Agents

Authors:Ziyun Zhang, Xinyi Liu, Xiaoyi Zhang, Jun Wang, Gang Chen, Yan Lu

External knowledge has played a crucial role in the recent development of computer use agents. We identify a critical knowledge-execution gap: retrieved knowledge often fails to translate into effective real-world task execution. Our analysis shows even 90% correct knowledge yields only 41% execution success rate. To bridge this gap, we propose UI-Evol, a plug-and-play module for autonomous GUI knowledge evolution. UI-Evol consists of two stages: a Retrace Stage that extracts faithful objective action sequences from actual agent-environment interactions, and a Critique Stage that refines existing knowledge by comparing these sequences against external references. We conduct comprehensive experiments on the OSWorld benchmark with the state-of-the-art Agent S2. Our results demonstrate that UI-Evol not only significantly boosts task performance but also addresses a previously overlooked issue of high behavioral standard deviation in computer use agents, leading to superior performance on computer use tasks and substantially improved agent reliability.

外部知识在计算机使用代理的近期发展中起到了至关重要的作用。我们识别出了一个关键的知识执行差距：检索的知识往往无法转化为有效的现实世界任务执行。我们的分析表明，即使90%的知识是正确的，也只能达到41%的执行成功率。为了弥补这一差距，我们提出了UI-Evol，这是一个用于自主GUI知识进化的即插即用模块。UI-Evol由两个阶段组成：一个追溯阶段，从实际的代理环境交互中提取忠实的客观行动序列；一个批判阶段，通过将这些序列与外部参考进行比较来完善现有知识。我们在OSWorld基准测试上进行了全面的实验，使用最先进的代理S2。结果表明，UI-Evol不仅显著提高了任务性能，而且解决了之前被忽视的计算机使用代理中行为标准差高的问题，从而提高了计算机使用任务的性能和代理的可靠性。

论文及项目相关链接

PDF Accepted to ICML 2025 Workshop on Computer Use Agents

Summary

本文强调了外部知识在计算机使用代理发展中的重要性，并指出了知识执行之间的关键差距。即使准确的知识达到90%，其执行成功率也只有41%。为弥补这一差距，本文提出了UI-Evol，这是一种用于自主GUI知识进化的即插即用模块。该模块包括两个阶段：Retrace阶段从实际的代理环境交互中提取忠实的客观行动序列，而Critique阶段则通过比较这些序列与外部参考来完善现有知识。实验结果表明，UI-Evol不仅显著提高任务性能，而且解决了计算机使用代理中先前被忽视的高的行为标准偏差问题，从而在计算机使用任务上实现卓越性能并大幅提高代理可靠性。

Key Takeaways

外部知识在计算机使用代理的发展中扮演了重要角色。
存在一个知识执行差距，即获取的知识往往无法有效地转化为实际任务执行。
即使知识的准确性高达90%，其执行成功率也只有41%。
为解决这一差距，提出了UI-Evol模块，包括Retrace和Critique两个阶段。
Retrace阶段从代理环境交互中提取忠实客观的行动序列。
Critique阶段通过比较行动序列与外部参考来完善现有知识。

Cool Papers

点此查看论文截图

Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding

Authors:Xiaoyi Zhang, Zhaoyang Jia, Zongyu Guo, Jiahao Li, Bin Li, Houqiang Li, Yan Lu

Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity and the difficulty of question answering under such extended contexts. While Large Language Models (LLMs) have demonstrated considerable advancements in video analysis capabilities and long context handling, they continue to exhibit limitations when processing information-dense hour-long videos. To overcome such limitations, we propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips. Unlike previous video agents that rely on predefined workflows applied uniformly across different queries, our approach emphasizes the autonomous and adaptive nature of agents. By providing a set of search-centric tools on multi-granular video database, our DVD agent leverages the advanced reasoning capability of LLM to plan on its current observation state, strategically selects tools to orchestrate adaptive workflow for different queries in light of the gathered information. We perform comprehensive evaluation on multiple long video understanding benchmarks that demonstrates our advantage. Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%, which substantially surpasses all prior works, and further improves to 76.0% with transcripts. The code has been released at https://github.com/microsoft/DeepVideoDiscovery.

长视频理解面临着巨大的挑战，由于巨大的时空复杂性和在这种扩展环境下进行问答的难度。虽然大型语言模型（LLM）在视频分析能力和长文本处理能力方面取得了显著的进步，但在处理信息密集的小时长的视频时，它们仍然表现出局限性。为了克服这些局限性，我们提出了深度视频发现（DVD）代理，采用代理搜索策略对分割的视频片段进行处理。不同于以前依赖于为不同查询统一应用预定义工作流程的视频代理，我们的方法强调代理的自主性和适应性。通过在多粒度视频数据库上提供一系列以搜索为中心的工具，我们的DVD代理利用LLM的高级推理能力来规划其当前观察状态，根据收集的信息战略性地选择工具来协调适应不同查询的工作流程。我们在多个长视频理解基准测试上进行了全面评估，证明了我们的优势。我们的DVD代理在具有挑战性的LVBench数据集上达到了74.2%的准确率，这大大超过了以前的所有作品，并且在使用字幕的情况下进一步提高到了76.0%。代码已发布在https://github.com/microsoft/DeepVideoDiscovery。

论文及项目相关链接

PDF Accepted to NeurIPS 2025

Summary

该文本主要介绍了处理长视频理解时的挑战以及提出的解决方案。通过使用深度视频发现（DVD）代理，实现了基于多粒度视频数据库的搜索中心工具集，能够根据当前观察状态进行规划，灵活选择工具以适应不同的查询需求。在多个长视频理解基准测试上的评估显示，DVD代理具有优势，特别是在具有挑战性的LVBench数据集上实现了74.2%的准确率，超越了所有先前的工作，并且在使用转录的情况下进一步提高到了76.0%。

Key Takeaways

长视频理解面临巨大的时空复杂性和问答挑战。
大型语言模型在处理信息密集型长视频时存在局限性。
提出的Deep Video Discovery（DVD）代理采用基于多粒度视频数据库的搜索中心策略。
DVD代理强调代理的自主性和适应性。
DVD代理能够根据当前观察状态进行规划，灵活选择工具以适应不同的查询需求。
在多个长视频理解基准测试上，DVD代理表现出优势，特别是在LVBench数据集上实现了较高的准确率。

Cool Papers

点此查看论文截图

HEXGEN-FLOW: Optimizing LLM Inference Request Scheduling for Agentic Text-to-SQL

Authors:You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, Binhang Yuan

Recent advancements in leveraging the agentic paradigm of large language models (LLMs) have substantially improved Text-to-SQL capabilities, empowering users without specialized database knowledge to intuitively query databases. However, deploying agentic LLM-based Text-to-SQL systems in production presents significant challenges, stemming from their inherently multi-stage computational dependencies, strict latency requirements, and the complexity of deployment across heterogeneous GPUs widely existing in enterprise clusters. Meanwhile, existing LLM serving frameworks are primarily designed for independent inference tasks, resulting in suboptimal performance and frequent service-level objective (SLO) violations in Text-to-SQL workloads. In this paper, we introduce HEXGEN-FLOW, a novel framework designed explicitly to schedule and execute agentic multi-stage LLM-based Text-to-SQL workflows on heterogeneous GPU clusters serving multi-tenant Text-to-SQL requests. HEXGEN-FLOW introduces a hierarchical scheduling approach that combines global workload-balanced task dispatching with an adaptive local priority queue, guided by a systematic analysis of agentic Text-to-SQL workflows. Additionally, we propose a lightweight simulation-based method for tuning critical scheduling hyperparameters, further enhancing robustness and adaptability. Our evaluation on realistic Text-to-SQL benchmarks demonstrates that HEXGEN-FLOW significantly outperforms state-of-the-art LLM serving frameworks. Across all traces, HEXGEN-FLOW reduces P95 tail latency by $1.42{\sim}1.56\times$ and increases throughput by $1.49{\sim}1.81\times$, demonstrating robust improvements under diverse workloads. Our code is available at https://github.com/Relaxed-System-Lab/Hexgen-Flow.

最近，利用大型语言模型（LLM）的代理范式在Text-to-SQL能力方面取得了实质性进步，使得没有专业数据库知识的用户能够直观地查询数据库。然而，在生产环境中部署基于代理LLM的Text-to-SQL系统面临着重大挑战，这些挑战源于其固有的多阶段计算依赖性、严格的延迟要求和在企业集群中广泛存在的异构GPU部署的复杂性。同时，现有的LLM服务框架主要设计用于独立推理任务，导致在Text-to-SQL工作负载中的性能不佳和频繁的服务级别目标（SLO）违规。在本文中，我们介绍了HEXGEN-FLOW，这是一个专门设计用于在异构GPU集群上调度和执行基于代理的多阶段LLM Text-to-SQL工作流程的新框架，服务于多租户Text-to-SQL请求。HEXGEN-FLOW引入了一种分层调度方法，结合全局负载均衡的任务分配和自适应本地优先级队列，以系统分析代理Text-to-SQL工作流程为指导。此外，我们还提出了一种基于轻量级模拟的方法来调整关键调度超参数，进一步增强了稳健性和适应性。我们在实际的Text-to-SQL基准测试上的评估表明，HEXGEN-FLOW显著优于最新的LLM服务框架。在所有痕迹中，HEXGEN-FLOW将P95尾部延迟降低了$ 1.42{\sim} 1.56 \times $，吞吐量提高了$ 1.49{\sim} 1.81 \times $，在多种工作负载下显示出稳健的改进。我们的代码可在https://github.com/Relaxed-System-Lab/Hexgen-Flow上找到。

论文及项目相关链接

PDF

Summary

该文介绍了利用大型语言模型（LLM）的agentic范式提升Text-to-SQL能力的最新进展，使用户能够直观查询数据库而无需特定知识。然而，在生产环境中部署基于LLM的Text-to-SQL系统面临诸多挑战，如多阶段计算依赖、严格延迟要求和在企业集群中部署的异构GPU的复杂性等。针对这些问题，本文提出了一种新型框架HEXGEN-FLOW，用于在异构GPU集群上调度和执行多阶段LLM基于Text-to-SQL的工作流。HEXGEN-FLOW结合了全局负载平衡的任务分配和自适应本地优先队列的分层调度方法，并通过系统分析指导agentic Text-to-SQL工作流。此外，还提出了一种基于轻量级模拟的方法，用于调整关键调度超参数，提高了稳健性和适应性。在现实的Text-to-SQL基准测试中，HEXGEN-FLOW显著优于现有的LLM服务框架，降低了P95尾部延迟，并提高了吞吐量。

Key Takeaways

LLM的agentic范式在Text-to-SQL能力上取得了进展，支持用户直观查询数据库。
生产环境中部署基于LLM的Text-to-SQL系统面临挑战，包括多阶段计算、延迟和异构GPU部署的复杂性。
HEXGEN-FLOW框架被设计用于在异构GPU集群上调度和执行多阶段LLM的Text-to-SQL工作流。
HEXGEN-FLOW采用分层调度方法，结合全局负载平衡和自适应本地优先队列。
HEXGEN-FLOW通过系统分析指导agentic Text-to-SQL工作流，并提出一种轻量级模拟方法调整调度超参数。
在Text-to-SQL基准测试中，HEXGEN-FLOW显著优于现有LLM服务框架。
HEXGEN-FLOW降低了P95尾部延迟并提高了吞吐量。

Cool Papers

点此查看论文截图

MARFT: Multi-Agent Reinforcement Fine-Tuning

Authors:Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang

LLM-based Multi-Agent Systems have demonstrated remarkable capabilities in addressing complex, agentic tasks, from generating high-quality presentation slides to even conducting sophisticated scientific research. Meanwhile, RL has been widely recognized for its effectiveness in enhancing agent intelligence, but limited research has investigated the fine-tuning of LaMAS using foundational RL techniques. Moreover, the direct application of MARL methods to LaMAS introduces significant challenges, stemming from the unique characteristics and mechanisms inherent to LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes a novel paradigm termed Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce a brand-new MG called Flex-MG, which aligns with the LaMAS optimization in real-world applications and a universal algorithmic framework tailored specifically for LaMAS, outlining the conceptual foundations, key distinctions, and practical implementation strategies. We review the evolution from RL to RFT, setting the stage for a parallel analysis in the multi-agent domain. In the context of LaMAS, we elucidate critical differences between MARL and MARFT. These differences motivate a transition toward a LaMAS-oriented formulation of RFT. Central to this work is a robust and scalable MARFT framework. We detail the core algorithm and provide a complete, open-source implementation to facilitate adoption and further research. The latter sections of the paper explore real-world application perspectives and opening challenges in MARFT. By bridging theoretical underpinnings with practical methodologies, this work serves as a roadmap for researchers seeking to advance MARFT toward resilient and adaptive solutions in agentic systems. Our implementation of the proposed framework is publicly available at: https://github.com/jwliao-ai/MARFT.

基于LLM的多智能体系统已在解决复杂的、多智能体的任务方面表现出了卓越的能力，无论是生成高质量演示幻灯片还是进行高级科学研究。与此同时，强化学习在提高智能体智能方面得到了广泛认可，但关于使用基础强化学习技术对大型多智能体系统（LaMAS）进行微调的研究仍然有限。此外，将多智能体强化学习（MARL）方法直接应用于LaMAS引入了重大挑战，这些挑战源于LaMAS固有的独特特征和机制。为了解决这些挑战，本文进行了基于LLM的MARL的综合研究，并提出了一种新型范式——多智能体强化学习微调（MARFT）。我们引入了一种全新的图灵游戏（Flex-MG），它符合LaMAS在现实世界应用中的优化，以及专为LaMAS定制的通用算法框架，概述了概念基础、关键区别和实践实施策略。我们回顾了从强化学习到强化微调（RFT）的演变，为在多智能体领域进行平行分析奠定了基础。在LaMAS的背景下，我们阐明了MARL和MARFT之间的关键区别。这些区别促使我们朝着以LaMAS为导向的RFT公式转变。本工作的核心是稳健且可扩展的MARFT框架。我们详细介绍了核心算法，并提供了一个完整、开源的实现，以促进采用和进一步研究。论文的后几部分探讨了现实世界的应用视角和MARFT中的开放挑战。通过桥接理论基石与实践方法，本研究为研究人员推进MARFT朝着智能系统中的弹性和适应性解决方案提供了路线图。我们提出的框架实现可在https://github.com/jwliao-ai/MARFT公开访问。

论文及项目相关链接

PDF 42 pages

Summary

大型语言模型（LLM）为基础的多智能体系统（LaMAS）在解决复杂的智能任务方面展现出卓越的能力，如生成高质量演示幻灯片及开展高级科学研究等。强化学习（RL）被公认为增强智能体智能的有效方法，但关于使用基础强化学习技术对LaMAS进行精细调整的研究仍然有限。针对直接应用多智能体强化学习（MARL）方法于LaMAS所面临的挑战，本文进行了全面的研究并提出了名为“多智能体强化精细调整”（MARFT）的新型范式。介绍了与LaMAS优化在现实应用中对齐的Flex-MG新型模型，以及专为LaMas定制的通用算法框架。文章阐述了从RL到RFT的演变过程，并对多智能体领域进行了平行分析。在LaMAS的语境下，本文阐述了MARL和MARFT之间的关键区别，推动了面向LaMas的RFT导向性表述的发展。文章的核心在于稳健且可扩展的MARFT框架，详细说明了核心算法，并提供开源实现以促进采纳和进一步研究。本文探讨了现实应用视角和MARFT的开放挑战，为研究人员提供了通往具有弹性和适应性的解决方案的道路图。

Key Takeaways

LLM-based Multi-Agent Systems (LaMAS)具备处理复杂任务的能力，如生成演示幻灯片及科学研究。
强化学习（RL）可有效提升智能体智能，但其在LaMAS中的应用研究和精细调整受限。
面临直接应用多智能体强化学习（MARL）方法于LaMAS的挑战，需考虑LaMAS的独特特性和机制。
提出了一种新型范式——Multi-Agent Reinforcement Fine-Tuning (MARFT)以应对这些挑战。
介绍了与LaMAS优化对齐的Flex-MG模型和定制通用算法框架。
文章阐述了从RL到RFT的演变，强调了多智能体领域的平行分析。

Cool Papers

点此查看论文截图

3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark

Authors:Ivan Sviridov, Amina Miftakhova, Artemiy Tereshchenko, Galina Zubkova, Pavel Blinov, Andrey Savchenko

Though Large Vision-Language Models (LVLMs) are being actively explored in medicine, their ability to conduct complex real-world telemedicine consultations combining accurate diagnosis with professional dialogue remains underexplored. This paper presents 3MDBench (Medical Multimodal Multi-agent Dialogue Benchmark), an open-source framework for simulating and evaluating LVLM-driven telemedical consultations. 3MDBench simulates patient variability through temperament-based Patient Agent and evaluates diagnostic accuracy and dialogue quality via Assessor Agent. It includes 2996 cases across 34 diagnoses from real-world telemedicine interactions, combining textual and image-based data. The experimental study compares diagnostic strategies for widely used open and closed-source LVLMs. We demonstrate that multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings, highlighting the importance of context-aware, information-seeking questioning. Moreover, injecting predictions from a diagnostic convolutional neural network into the LVLM’s context boosts F1 by up to 20%. Source code is available at https://github.com/univanxx/3mdbench.

虽然大型视觉语言模型（LVLMs）在医学领域得到了积极的研究和探索，但它们在进行结合准确诊断和专业对话的复杂现实世界远程医疗咨询方面的能力仍然被低估。本文介绍了3MDBench（医疗多模态多智能体对话基准测试），这是一个开源框架，用于模拟和评估LVLM驱动的远程医疗咨询。3MDBench通过基于性格的患者智能体模拟患者变异性，并通过评估智能体评估诊断准确性和对话质量。它包括来自现实世界远程医疗互动的34个诊断中的2996个案例，结合了文本和图像数据。实验性研究比较了广泛使用的开源和闭源LVLMs的诊断策略。我们证明，与无对话环境相比，具有内部推理的多模式对话可以提高F1分数6.5%，这突显了上下文感知、寻求信息的问题的重要性。此外，将诊断卷积神经网络（CNN）的预测结果注入LVLM的语境中，F1可以提高高达20%。源代码可在https://github.com/univanxx/3mdbench获取。

论文及项目相关链接

PDF EMNLP 25 (main)

Summary

本文介绍了一个名为3MDBench的开放源代码框架，用于模拟和评估大型视觉语言模型（LVLMs）在远程医疗咨询中的表现。该框架结合了文本和图像数据，模拟患者的个体差异，并通过评估器评估诊断准确性和对话质量。实验研究表明，与非对话设置相比，模态对话与内部推理可以提高F1分数达6.5%，而将诊断卷积神经网络的预测结果注入LVLMs的背景信息中，可以提高F1分数达20%。该框架为医疗领域的复杂现实咨询提供了新的模拟和评估工具。

Key Takeaways

3MDBench是一个模拟评估大型视觉语言模型在远程医疗咨询中的性能框架。
它结合了文本和图像数据，模拟患者的个体差异。
3MDBench通过评估器评估诊断准确性和对话质量。
多模态对话与内部推理相比非对话设置能提高F1分数达6.5%。
将诊断卷积神经网络的预测结果注入LVLM的背景信息可以提高F1分数达20%。
该框架为远程医疗咨询提供了新的模拟和评估工具。

Cool Papers

点此查看论文截图

LLM Strategic Reasoning: Agentic Study through Behavioral Game Theory

Authors:Jingru Jia, Zehua Yuan, Junhao Pan, Paul E. McNamara, Deming Chen

Strategic decision-making involves interactive reasoning where agents adapt their choices in response to others, yet existing evaluations of large language models (LLMs) often emphasize Nash Equilibrium (NE) approximation, overlooking the mechanisms driving their strategic choices. To bridge this gap, we introduce an evaluation framework grounded in behavioral game theory, disentangling reasoning capability from contextual effects. Testing 22 state-of-the-art LLMs, we find that GPT-o3-mini, GPT-o1, and DeepSeek-R1 dominate most games yet also demonstrate that the model scale alone does not determine performance. In terms of prompting enhancement, Chain-of-Thought (CoT) prompting is not universally effective, as it increases strategic reasoning only for models at certain levels while providing limited gains elsewhere. Additionally, we investigate the impact of encoded demographic features on the models, observing that certain assignments impact the decision-making pattern. For instance, GPT-4o shows stronger strategic reasoning with female traits than males, while Gemma assigns higher reasoning levels to heterosexual identities compared to other sexual orientations, indicating inherent biases. These findings underscore the need for ethical standards and contextual alignment to balance improved reasoning with fairness.

战略决策制定涉及交互式推理，其中代理会根据他人的选择调整自己的决策。然而，对于大型语言模型（LLM）的现有评估通常侧重于纳什均衡（NE）近似，忽略了驱动其战略选择背后的机制。为了填补这一空白，我们引入了一个基于行为博弈论的评估框架，从情境效应中分离推理能力。通过对22个最新大型语言模型的测试，我们发现GPT-o3-mini、GPT-o1和DeepSeek-R1在大多数游戏中占据主导地位，但同时也表明模型规模本身并不能决定性能。在提示增强方面，链式思维（CoT）提示并非普遍有效，因为它只针对某些级别的模型提高战略推理能力，而在其他地方提供有限的收益。此外，我们还探讨了编码的人口特征对模型的影响，发现某些任务会影响决策模式。例如，GPT-4o在女性特质方面表现出更强的战略推理能力相对于男性，而Gemma为异性恋身份分配了较高的推理水平与其他性取向相比，这表明了固有的偏见。这些发现强调了需要在改进推理与公平之间取得平衡的道德标准和情境对齐。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

这篇文本探讨了战略决策制定过程中的交互推理和大型语言模型（LLMs）的评价问题。文章强调现有评价框架过于关注纳什均衡（NE）近似，忽略了驱动战略选择的机制。为此，文章提出了基于行为博弈论的评价框架，将推理能力与情境效应区分开来。通过对多款前沿LLM的测试发现，GPT-o3-mini、GPT-o1和DeepSeek-R1在游戏中的表现最为出色，但模型规模本身并不决定性能。此外，文章还探讨了提示增强方式的影响，发现链式思维（CoT）提示并非普遍有效，而且对模型战略推理的提升有限。同时，文章还发现编码后的特征对模型决策制定模式有影响，体现了不同模型对人口特征的固有偏见。总结来说，要提高战略决策制定的公正性和效率，需要注重模型的情境对齐和伦理标准的平衡。

Key Takeaways