⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-06 更新
1 PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts
Authors:Vivi Andersson, Sofia Bobadilla, Harald Hobbelhagen, Martin Monperrus
Smart contracts operate in a highly adversarial environment, where vulnerabilities can lead to substantial financial losses. Thus, smart contracts are subject to security audits. In auditing, proof-of-concept (PoC) exploits play a critical role by demonstrating to the stakeholders that the reported vulnerabilities are genuine, reproducible, and actionable. However, manually creating PoCs is time-consuming, error-prone, and often constrained by tight audit schedules. We introduce POCO, an agentic framework that automatically generates executable PoC exploits from natural-language vulnerability descriptions written by auditors. POCO autonomously generates PoC exploits in an agentic manner by interacting with a set of code-execution tools in a Reason-Act-Observe loop. It produces fully executable exploits compatible with the Foundry testing framework, ready for integration into audit reports and other security tools. We evaluate POCO on a dataset of 23 real-world vulnerability reports. POCO consistently outperforms the prompting and workflow baselines, generating well-formed and logically correct PoCs. Our results demonstrate that agentic frameworks can significantly reduce the effort required for high-quality PoCs in smart contract audits. Our contribution provides readily actionable knowledge for the smart contract security community.
智能合约运行在一个高度对抗的环境中,其中存在的漏洞可能导致重大财务损失。因此,智能合约需要进行安全审计。在审计过程中,概念验证(PoC)攻击发挥着关键作用,向利益相关者证明报告的漏洞是真实存在的、可复制的并且可以采取行动。然而,手动创建PoC既耗时又容易出错,并且经常受到严格的审计时间表的限制。我们引入了POCO,这是一个自主框架,能够自动从审计员编写的自然语言漏洞描述中生成可执行PoC攻击。POCO通过在一个理性-行动-观察循环中与一组代码执行工具进行交互,以自主的方式生成PoC攻击。它产生与Foundry测试框架兼容的完全可执行的攻击,并准备好集成到审计报告和其他安全工具中。我们在包含现实世界漏洞报告的数据库上对POCO进行了评估。POCO始终优于提示和工作流基线,生成的PoC结构合理且逻辑正确。我们的结果证明自主框架可以大大减少高质量PoC在智能合约审计所需的工作量。我们的成果为智能合约安全社区提供了可立即采取行动的知识。
论文及项目相关链接
PDF Under review
Summary
智能合约运行环境充满对抗性,存在漏洞可能导致重大经济损失,因此需接受安全审计。在审计过程中,概念验证(PoC)利用发挥着重要作用。但手动创建PoC耗时、易出错且常常受制于紧张的审计时间表。为此,我们推出POCO——一种自动化生成可执行PoC漏洞利用的框架。POCO通过与一系列代码执行工具进行交互来智能生成PoC漏洞利用,并在理性-行动-观察循环中产生与Foundry测试框架兼容的完全可执行漏洞利用,准备集成到审计报告和其他安全工具中。我们对包含23份真实世界漏洞报告的POCO数据集进行了评估。POCO持续超越提示和工作流程基线,生成结构良好、逻辑正确的PoC。我们的结果证明了智能框架能显著降低智能合约审计中对高质量PoC的需求。我们的贡献为智能合约安全社区提供了可立即采取行动的知识。
Key Takeaways
- 智能合约运行环境具有对抗性,安全审计至关重要。
- 概念验证(PoC)利用在智能合约审计中扮演关键角色。
- 手动创建PoC耗时且易出错,因此需要自动化解决方案。
- POCO框架能够智能生成可执行PoC漏洞利用。
- POCO通过代码执行工具交互,在理性-行动-观察循环中产生完全可执行的漏洞利用。
- POCO生成的PoC与Foundry测试框架兼容,便于集成到审计报告和安全工具中。
点此查看论文截图
Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning
Authors:Bowen Jin, TJ Collins, Donghan Yu, Mert Cemri, Shenao Zhang, Mengyu Li, Jay Tang, Tian Qin, Zhiyang Xu, Jiarui Lu, Guoli Yin, Jiawei Han, Zirui Wang
Large language models (LLMs) exhibit complementary strengths across domains and come with varying inference costs, motivating the design of multi-agent LLM systems where specialized models collaborate efficiently. Existing approaches predominantly rely on decentralized frameworks, which invoke multiple LLMs for every input and thus lead to substantial and uncontrolled inference costs. In this work, we introduce a centralized multi-LLM framework, where a controller LLM selectively coordinates a pool of expert models in a cost-efficient and cost-controllable manner. We formulate this coordination problem as reinforcement learning with dual objectives: maximizing task performance while minimizing the overall inference cost. In addition, we expect the multi-agent system to have adapted behavior with different budget conditions during inference. To this end, we propose CoRL, a reinforcement learning framework that optimizes the performance cost trade-off in a controllable multi-budget setting. Experiments on four diverse benchmarks demonstrate that CoRL enables a single system to surpass the best expert LLM under high-budget settings, while maintaining strong performance in more economical low-budget modes, highlighting the effectiveness of centralized coordination for scalable and cost-efficient multi-agent LLM systems.
大型语言模型(LLM)在各个领域表现出互补的优势,并且具有不同的推理成本,这促使我们设计多智能体LLM系统,其中专业模型可以高效协作。现有方法主要依赖于分散式框架,它为每个输入调用多个LLM,从而导致推理成本巨大且不可控。在这项工作中,我们引入了一个集中式多LLM框架,其中控制器LLM以成本效益和成本控制的方式选择性协调专家模型池。我们将这种协调问题制定为具有双重目标的强化学习:最大化任务性能的同时最小化总体推理成本。此外,我们希望多智能体系统能够在推理过程中适应不同的预算条件。为此,我们提出了CoRL,这是一个优化性能成本权衡的强化学习框架,在一个可控的多预算环境中实现优化。在四个不同基准测试上的实验表明,CoRL能够在高预算设置下使单一系统超越最佳专家LLM的表现,同时在更经济的低预算模式下保持强劲性能,这突出了集中式协调在可扩展和成本效益高的多智能体LLM系统中的有效性。
论文及项目相关链接
PDF 14 pages
Summary
本文介绍了一种集中式的多LLM框架,通过控制器LLM选择性地协调专家模型池,以高效、可控的方式实现成本优化。该问题被公式化为具有双重目标的强化学习问题:在最小化总体推理成本的同时最大化任务性能。实验表明,CoRL框架能够在多预算设置下优化性能成本权衡,使单一系统在预算充足时超越最佳专家LLM,同时在预算较低时也能保持强劲性能。
Key Takeaways
- LLMs在多领域具有互补优势,但推理成本各异,需要设计多代理LLM系统以实现高效协作。
- 现有方法主要依赖去中心化框架,对每次输入调用多个LLM,导致推理成本高昂且不可控。
- 集中式多LLM框架被提出,通过控制器LLM选择性协调专家模型池以实现成本优化。
- 协调问题被公式化为强化学习问题,具有双重目标:最大化任务性能同时最小化推理成本。
- 提出CoRL框架,能在多预算设置下优化性能成本权衡。
- 实验证明,CoRL框架在预算充足时性能超越最佳专家LLM,在预算较低时也能保持强劲性能。
点此查看论文截图
Curriculum Design for Trajectory-Constrained Agent: Compressing Chain-of-Thought Tokens in LLMs
Authors:Georgios Tzannetos, Parameswaran Kamalaruban, Adish Singla
Training agents to operate under strict constraints during deployment, such as limited resource budgets or stringent safety requirements, presents significant challenges, especially when these constraints render the task complex. In this work, we propose a curriculum learning strategy that gradually tightens constraints during training, enabling the agent to incrementally master the deployment requirements. Inspired by self-paced learning techniques in unconstrained reinforcement learning (RL), our approach facilitates a smoother transition to challenging environments by initially training on simplified versions of the constraints and progressively introducing the full deployment conditions. We provide a theoretical analysis using an RL agent in a binary-tree Markov Decision Process (MDP) to demonstrate that our curriculum strategy can accelerate training relative to a baseline approach that imposes the trajectory constraints from the outset. Moreover, we empirically validate the effectiveness and generality of our method across both RL and large language model (LLM) agents in diverse settings, including a binary-tree MDP, a multi-task navigation domain, and a math reasoning task with two benchmarks. These results highlight the potential of curriculum design in enhancing the efficiency and performance of agents operating under complex trajectory constraints during deployment. Moreover, when applied to LLMs, our strategy enables compression of output chain-of-thought tokens, achieving a substantial inference speedup on consumer hardware, demonstrating its effectiveness for resource-constrained deployment.
在部署期间,训练智能体在严格的约束条件下进行操作,如有限的资源预算或严格的安全要求,带来了巨大的挑战,尤其是当这些约束使任务变得复杂时。在此工作中,我们提出了一种课程学习策略,该策略在训练过程中逐渐加强约束,使智能体能够逐步掌握部署要求。我们的方法受到无约束强化学习(RL)中的自我进度学习技术的启发,通过最初在约束的简化版本上进行训练,并逐步引入完整的部署条件,实现向具有挑战性的环境的平稳过渡。我们使用一个在二叉树马尔可夫决策过程(MDP)中的RL智能体进行理论分析,以证明我们的课程策略可以加速训练,相对于一种从一开始就施加轨迹约束的基线方法。此外,我们通过强化学习和大型语言模型(LLM)智能体在多种环境中的实证研究验证了我们的方法的有效性和通用性,包括二叉树MDP、多任务导航域和数学推理任务的两项基准测试。这些结果凸显了课程设计在提升部署期间处于复杂轨迹约束下的智能体的效率和性能方面的潜力。此外,当应用于LLM时,我们的策略能够实现输出思维链代币的压缩,在消费者硬件上实现了推理的实质性加速,证明了其在资源受限部署中的有效性。
论文及项目相关链接
PDF NeurIPS’25 paper
Summary
本文提出一种课程学习策略,在训练过程中逐步加强约束条件,使代理能够逐步掌握部署要求。该方法借鉴了无约束强化学习中的自我进度学习技术,通过最初在简化版约束下进行训练,逐步引入完整的部署条件,实现更平稳地过渡到复杂环境。理论分析和实证研究均表明,该方法可加速强化学习和大型语言模型代理在多种设置下的训练,提高代理在部署时处理复杂轨迹约束的效率。
Key Takeaways
- 提出一种课程学习策略,逐步加强训练过程中的约束条件,以应对部署时的严格约束挑战。
- 借鉴无约束强化学习中的自我进度学习技术,实现代理在简化版约束下的初步训练,并逐步引入完整的部署条件。
- 通过理论分析和实证研究,证明该方法可加速强化学习代理在二进制树状马尔可夫决策过程中的训练。
- 展示该方法在多种任务导航领域和数学推理任务中的有效性。
- 当应用于大型语言模型时,该方法可实现输出思维链代币的压缩,在消费者硬件上实现推理速度的显著提高。
- 该策略提高了代理在资源受限的部署环境中的效率和性能。
点此查看论文截图
Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems
Authors:Elias Lumer, Faheem Nizar, Anmol Gulati, Pradeep Honaganahalli Basavaraju, Vamse Kumar Subbiah
Recent advances in LLM Multi-Agent Systems enable scalable orchestration of sub-agents, each coordinating hundreds or thousands of tools or Model Context Protocol (MCP) servers. However, existing retrieval methods typically match queries against coarse agent-level descriptions before routing, which obscures fine-grained tool functionality and often results in suboptimal agent selection. We introduce Tool-to-Agent Retrieval, a unified framework that embeds both tools and their parent agents in a shared vector space and connects them through metadata relationships. By explicitly representing tool capabilities and traversing metadata to the agent level, Tool-to-Agent Retrieval enables granular tool-level or agent-level retrieval, ensuring that agents and their underlying tools or MCP servers are equally represented without the context dilution that arises from chunking many tools together. Evaluating Tool-to-Agent Retrieval across eight embedding models, our approach achieves consistent improvements of 19.4% in Recall@5 and 17.7% in nDCG@5 over previous state-of-the-art agent retrievers on the LiveMCPBench benchmark.
近期LLM多智能体系统的发展使得子智能体的可伸缩性协同工作成为可能,每个子智能体可以协调数百或数千个工具或模型上下文协议(MCP)服务器。然而,现有的检索方法通常在路由之前针对粗略的智能体级别描述进行匹配,这掩盖了精细的工具功能,并且通常导致次优的智能体选择。我们引入了工具到智能体检索,这是一个统一框架,将工具和它们的父智能体嵌入共享向量空间中,并通过元数据关系将它们连接起来。通过显式表示工具能力并遍历到智能体级别的元数据,工具到智能体检索能够实现精细的工具级别或智能体级别检索,确保智能体和其底层工具或MCP服务器在上下文稀释的情况下得到同等表示,这是通过将许多工具组合在一起而产生的。在LiveMCPBench基准测试上,我们的工具到智能体检索方法在八个嵌入模型上进行了评估,与之前的最新智能体检索方法相比,Recall@5提高了19.4%,nDCG@5提高了17.7%。
论文及项目相关链接
总结
多Agent系统最近进展允许灵活调度子代理,每个代理可以协调成百上千的工具或Model Context Protocol(MCP)服务器。但现有的检索方法大多只匹配查询与粗略的代理级别描述,忽视了精细的工具功能,导致代理选择往往不佳。为此,我们提出了工具到代理检索框架,该框架将工具和其父代理嵌入同一向量空间,并通过元数据关系连接它们。通过明确展示工具能力和遍历元数据至代理级别,工具到代理检索可实现精细的工具级别或代理级别检索,确保代理和其底层工具或MCP服务器在语境中同样被代表,避免了将许多工具组合在一起所带来的语境稀释问题。在LiveMCPBench基准测试上,我们的方法相较于最先进的代理检索方法,Recall@5提高了19.4%,nDCG@5提高了17.7%。
关键见解
- 多Agent系统允许灵活调度大量子代理,每个子代理可以协调众多工具和MCP服务器。
- 现有检索方法主要匹配查询与粗略的代理级别描述,忽视了工具级别的精细功能。
- 工具到代理检索框架将工具和代理嵌入同一向量空间,并通过元数据连接。
- 通过明确展示工具能力并遍历元数据至代理级别,实现工具级别和代理级别的精细检索。
- 工具到代理检索确保代理和其工具在语境中的同等代表性,避免了语境稀释问题。
- 在LiveMCPBench基准测试中,新方法在Recall@5和nDCG@5两个关键指标上取得了显著改进。
点此查看论文截图
From Pixels to Cooperation Multi Agent Reinforcement Learning based on Multimodal World Models
Authors:Sureyya Akin, Kavita Srivastava, Prateek B. Kapoor, Pradeep G. Sethi, Sunita Q. Patel, Rahu Srivastava
Learning cooperative multi-agent policies directly from high-dimensional, multimodal sensory inputs like pixels and audio (from pixels) is notoriously sample-inefficient. Model-free Multi-Agent Reinforcement Learning (MARL) algorithms struggle with the joint challenge of representation learning, partial observability, and credit assignment. To address this, we propose a novel framework based on a shared, generative Multimodal World Model (MWM). Our MWM is trained to learn a compressed latent representation of the environment’s dynamics by fusing distributed, multimodal observations from all agents using a scalable attention-based mechanism. Subsequently, we leverage this learned MWM as a fast, “imagined” simulator to train cooperative MARL policies (e.g., MAPPO) entirely within its latent space, decoupling representation learning from policy learning. We introduce a new set of challenging multimodal, multi-agent benchmarks built on a 3D physics simulator. Our experiments demonstrate that our MWM-MARL framework achieves orders-of-magnitude greater sample efficiency compared to state-of-the-art model-free MARL baselines. We further show that our proposed multimodal fusion is essential for task success in environments with sensory asymmetry and that our architecture provides superior robustness to sensor-dropout, a critical feature for real-world deployment.
直接从像素和音频等高维、多模态感官输入中学习多智能体合作策略(from pixels)是出了名的样本效率低下。无模型多智能体强化学习(MARL)算法在表示学习、部分可观察性和信用分配方面面临着联合挑战。为解决这一问题,我们提出了一种基于共享生成式多模态世界模型(MWM)的新型框架。我们的MWM经过训练,学习通过融合所有智能体的分布式多模态观察来压缩环境的动态表示,使用可扩展的注意力机制。随后,我们利用学到的MWM作为一个快速“想象”模拟器,完全在其潜在空间内训练合作MARL策略(例如MAPPO),从而将表示学习与策略学习解耦。我们引入了一系列基于3D物理模拟器的具有挑战性的多模态多智能体基准测试。我们的实验表明,我们的MWM-MARL框架与最新的无模型MARL基线相比,实现了数量级的样本效率提升。我们还进一步表明,我们提出的多模态融合对于在感官不对称的环境中完成任务至关重要,并且我们的架构对传感器丢失具有出色的稳健性,这是现实世界部署的关键功能。
论文及项目相关链接
Summary
本文提出一种基于共享生成式多模态世界模型(MWM)的多智能体强化学习框架,用于直接从高维多模态感官输入(如像素和音频)中学习合作策略。该框架通过可扩展的注意力机制融合所有智能体的分布式多模态观察结果,学习环境的压缩潜在表示。然后利用学到的MWM作为快速“想象”模拟器,在潜在空间内完全训练合作的多智能体强化学习策略(例如MAPPO),使表示学习与策略学习解耦。实验表明,MWM-MARL框架与最新的无模型MARL基线相比,实现了样本效率的显著提高,并且在感官不对称的环境中,所提出的多模态融合对于任务成功至关重要,同时该架构对传感器掉线具有出色的稳健性,是现实世界部署的关键特性。
Key Takeaways
- 提出一种基于共享生成式多模态世界模型(MWM)的多智能体强化学习框架。
- MWM通过注意力机制融合多模态观察结果,学习环境的压缩潜在表示。
- 利用MWM作为“想象”模拟器,在潜在空间内训练合作的多智能体强化学习策略。
- MWM-MARL框架实现了样本效率的显著提高,相较于现有的无模型MARL基线。
- 多模态融合在感官不对称的环境中对于任务成功至关重要。
- 架构对传感器掉线具有出色的稳健性,适合现实世界的部署。
点此查看论文截图
Deterministic Legal Agents: A Canonical Primitive API for Auditable Reasoning over Temporal Knowledge Graphs
Authors:Hudson de Martim
For autonomous legal agents to operate safely in high-stakes domains, they require a foundation of absolute determinism and auditability-guarantees that standard Retrieval-Augmented Generation (RAG) frameworks cannot provide. When interacting with temporal knowledge graphs that model the complex evolution of legal norms, agents must navigate versioning, causality, and hierarchical structures with precision, a task for which black-box vector search is ill-suited. This paper introduces a new architectural pattern to solve this: a formal Primitive API designed as a secure execution layer for reasoning over such graphs. Instead of a monolithic query engine, our framework provides a library of canonical primitives-atomic, composable, and auditable primitives. This design empowers planner-guided agents to decompose complex legal questions into transparent execution plans, enabling critical tasks with full verifiability, including: (i) precise point-in-time version retrieval, (ii) robust causal lineage tracing, and (iii) context-aware hybrid search. Ultimately, this architecture transforms opaque retrieval into auditable reasoning, turning the agent’s internal process from a black box into a verifiable log of deterministic primitives and providing a blueprint for building the next generation of trustworthy legal AI.
对于在高风险领域中安全运行的自主法律实体,他们需要绝对决定论和可审计性的基础,这是标准检索增强生成(RAG)框架无法提供的保证。在与模拟法律规范的复杂演变的时序知识图谱进行交互时,实体必须精确地导航版本、因果关系和层次结构,这是黑箱向量搜索无法胜任的任务。本文引入了一种新的架构模式来解决这个问题:一种正式的Primitive API,它被设计为在这种知识图谱上进行推理的安全执行层。我们的框架不是提供单一查询引擎,而是提供一系列标准的原始库——原子化、可组合和可审核的原始库。这种设计赋予了规划器引导实体分解复杂法律问题的能力,使其转变为透明执行计划,完成具有完全可验证性的关键任务,包括:(i)精确的时间点版本检索,(ii)稳健的因果谱系追踪,(iii)上下文感知混合搜索。最终,这种架构将模糊检索转化为可审核的推理,将实体的内部过程从黑箱转变为可验证的确定性原始日志,并为构建下一代可信赖的法律AI提供了蓝图。
论文及项目相关链接
PDF Major revision reframing the paper from an API spec to a novel architectural pattern for deterministic agents. The core contribution is now positioned as a blueprint for auditable reasoning, essential for building trustworthy legal AI systems
Summary
在这个文本中,主要讨论了在法律领域的自主智能代理需要具备的条件以及如何建立一个适合自主智能代理在高风险领域中运作的新架构模式。自主智能代理需要在复杂法律准则的时间演变模型上进行推理操作,并且需要有确定性、审计性作为基础保证,标准检索增强生成框架无法满足这一需求。文中引入了一种新的架构模式,即通过构建形式化原始API来实现推理和安全执行层,以处理时间知识图谱。该架构采用原子、可组合和可审计的原始库来支持计划导向的代理进行复杂法律问题的分解,实现精确的点时版本检索、稳健的因果谱系追踪和上下文感知混合搜索等任务。最终,该架构将模糊检索转化为可审计推理,将代理的内部过程从黑箱转变为可验证的确定性原始日志,为构建下一代可信法律AI提供了蓝图。
Key Takeaways
以下是七个关于文本的主要见解:
- 自主法律代理在高风险领域运作需要绝对确定性和审计性作为基础保证。
- 标准检索增强生成框架无法满足自主代理在复杂法律准则时间演变模型上的需求。
- 新架构模式通过形式化原始API实现推理和安全执行层来处理时间知识图谱。
- 新架构采用原子、可组合和可审计的原始库来支持计划导向的代理进行复杂法律问题的分解。
- 该架构支持精确的点时版本检索、稳健的因果谱系追踪和上下文感知混合搜索等任务。
- 新架构将模糊检索转化为可审计推理,增强了代理操作的透明度和可验证性。
点此查看论文截图
Towards Stable and Personalised Profiles for Lexical Alignment in Spoken Human-Agent Dialogue
Authors:Keara Schaaij, Roel Boumans, Tibor Bosse, Iris Hendrickx
Lexical alignment, where speakers start to use similar words across conversation, is known to contribute to successful communication. However, its implementation in conversational agents remains underexplored, particularly considering the recent advancements in large language models (LLMs). As a first step towards enabling lexical alignment in human-agent dialogue, this study draws on strategies for personalising conversational agents and investigates the construction of stable, personalised lexical profiles as a basis for lexical alignment. Specifically, we varied the amounts of transcribed spoken data used for construction as well as the number of items included in the profiles per part-of-speech (POS) category and evaluated profile performance across time using recall, coverage, and cosine similarity metrics. It was shown that smaller and more compact profiles, created after 10 min of transcribed speech containing 5 items for adjectives, 5 items for conjunctions, and 10 items for adverbs, nouns, pronouns, and verbs each, offered the best balance in both performance and data efficiency. In conclusion, this study offers practical insights into constructing stable, personalised lexical profiles, taking into account minimal data requirements, serving as a foundational step toward lexical alignment strategies in conversational agents.
词汇对齐对话中的交流双方都使用相似的词汇是众所周知的有助于成功交流的方法。然而,特别是在考虑到大型语言模型(LLM)的最新发展的情况下,其在对话代理中的实现仍然有待探索。作为实现人类与智能代理词汇对齐的第一步,本研究借鉴个性化对话代理的策略,研究构建稳定、个性化的词汇分布作为基础进行词汇对齐。具体来说,我们变化用于构建不同分布的转录语音数据量以及每个词类(POS)类别中涵盖的项目数量,并使用召回率、覆盖率和余弦相似性度量指标评估其随时间推移的表现。结果证明了小型的更紧凑的分布,在包含形容词5项、连词5项以及每个副词、名词、代词和动词各包含10项的录音语音数据构建后表现最佳,在性能和效率方面都达到了最佳平衡。总之,本研究为构建稳定、个性化的词汇分布提供了实用见解,考虑到最小数据需求的要求,作为对话代理中实现词汇对齐策略的基础步骤。
论文及项目相关链接
PDF This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in TSD 2025. Lecture Notes in Computer Science, vol 16029
Summary
该研究探索了词汇对齐在对话交流中的重要性,并尝试将其应用于对话代理中。通过个性化策略构建稳定的个性化词汇表,该研究优化了对话代理的词汇对齐。实验表明,使用少量语音数据构建的较小且紧凑的词汇表在性能和效率方面表现最佳。该研究为对话代理中实现词汇对齐策略提供了实践性的见解。
Key Takeaways
- 词汇对齐在成功沟通中起到重要作用,尤其是在对话代理中的实现仍然有待探索。
- 该研究采用个性化策略来构建稳定的个性化词汇表,为对话代理中的词汇对齐打下基础。
- 研究通过实验优化了词汇表的构建方式,发现使用少量语音数据构建的较小且紧凑的词汇表表现最佳。
- 该研究强调了词汇表构建中考虑最小数据需求的重要性。
- 通过评价词汇表的性能,该研究展示了如何通过调整转录语音数据的数量和词汇表中的项目数量来优化性能。
- 研究采用回忆、覆盖范围和余弦相似度指标来评估词汇表性能的时间变化。
点此查看论文截图
AI Research Agents for Machine Learning: Search, Exploration, and Generalization in MLE-bench
Authors:Edan Toledo, Karen Hambardzumyan, Martin Josifoski, Rishi Hazra, Nicolas Baldwin, Alexis Audran-Reiss, Michael Kuchnik, Despoina Magka, Minqi Jiang, Alisia Maria Lupidi, Andrei Lupu, Roberta Raileanu, Kelvin Niu, Tatiana Shavrina, Jean-Christophe Gagnon-Audet, Michael Shvartsman, Shagun Sodhani, Alexander H. Miller, Abhishek Charnalia, Derek Dunfield, Carole-Jean Wu, Pontus Stenetorp, Nicola Cancedda, Jakob Nicolaus Foerster, Yoram Bachrach
AI research agents are demonstrating great potential to accelerate scientific progress by automating the design, implementation, and training of machine learning models. We focus on methods for improving agents’ performance on MLE-bench, a challenging benchmark where agents compete in Kaggle competitions to solve real-world machine learning problems. We formalize AI research agents as search policies that navigate a space of candidate solutions, iteratively modifying them using operators. By designing and systematically varying different operator sets and search policies (Greedy, MCTS, Evolutionary), we show that their interplay is critical for achieving high performance. Our best pairing of search strategy and operator set achieves a state-of-the-art result on MLE-bench lite, increasing the success rate of achieving a Kaggle medal from 39.6% to 47.7%. Our investigation underscores the importance of jointly considering the search strategy, operator design, and evaluation methodology in advancing automated machine learning.
人工智能研究代理通过自动化机器学习模型的设计、实施和培训,显示出加速科学进步的巨大潜力。我们关注在提高代理在MLE-bench上的性能的方法,这是一个具有挑战性的基准测试,代理在此与Kaggle竞赛中的其他参赛者竞争以解决现实世界中的机器学习问题。我们将人工智能研究代理形式化为搜索策略,这些策略在候选解决方案空间中导航,并通过运算符进行迭代修改。通过设计和系统地改变不同的运算符集和搜索策略(贪心、MCTS、进化),我们证明它们的相互作用对于实现高性能至关重要。我们在搜索策略和运算符集的最佳配对上,在MLE-bench lite上取得了最新结果,将赢得Kaggle奖牌的成功率从39.6%提高到47.7%。我们的调查强调了联合考虑搜索策略、操作员设计和评估方法对于推动自动化机器学习的重要性。
论文及项目相关链接
PDF Code: https://github.com/facebookresearch/aira-dojo
Summary
AI研究代理在自动化机器学习模型的设计、实现和训练方面展现出巨大潜力,加速科学进步。本文关注在MLE-bench基准测试上提高代理性能的方法,代理在此基准测试中参与Kaggle竞赛以解决现实世界机器学习问题。通过设计和系统变化不同的操作集和搜索策略(贪心、MCTS、进化),发现其相互作用对实现高性能至关重要。最佳搜索策略与操作集搭配在MLE-bench lite上实现业界最佳结果,将赢得Kaggle奖牌的成功率从39.6%提高到47.7%。本文强调在推进自动化机器学习过程中,联合考虑搜索策略、操作设计和评估方法的重要性。
Key Takeaways
- AI研究代理在自动化机器学习方面具有巨大潜力,可加速科学进步。
- MLE-bench基准测试是评估代理性能的重要平台,涉及现实世界的机器学习问题。
- 不同的搜索策略和操作集对代理性能有重要影响。
- 搜索策略、操作设计和评估方法需联合考虑,以推进自动化机器学习的发展。
- 最佳搜索策略与操作集搭配可在MLE-bench lite上实现业界最佳结果。
- 代理性能的提高对于在Kaggle竞赛中取得优异成绩具有关键作用。
- 本文为AI研究代理的设计和实施提供了宝贵的见解和启示。
点此查看论文截图
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Authors:Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.
基于LLM的代理软件已在软件工程(SWE)任务的日益增长的范围内显示出有前景的能力。然而,推动这一领域发展面临两大挑战。首先,高质量的训练数据稀缺,尤其是反映真实世界SWE场景的数据,在这些场景中,代理必须与开发环境进行交互、执行代码并根据其行为结果调整行为。现有数据集要么仅限于一次性代码生成,要么包含手动整理的小型交互任务集合,缺乏规模和多样性。其次,由于缺乏新的交互式SWE任务,影响了快速改进模型的评估,因为静态基准测试由于污染问题而很快过时。为了解决这些局限性,我们引入了一种新型、自动化且可扩展的管道,以从多样化的GitHub存储库中持续提取真实世界的交互式SWE任务。使用该管道,我们构建了SWE-rebench公共数据集,包含超过21000个基于Python的交互式SWE任务,适用于大规模强化学习SWE代理。此外,我们使用通过SWE-rebench方法收集的连续新鲜任务构建了一个无污染的代理软件工程基准测试。我们在该基准测试上对多种LLM的结果进行了比较,并显示由于污染问题,某些语言模型的性能可能被夸大。
论文及项目相关链接
PDF Dataset: https://huggingface.co/datasets/nebius/SWE-rebench, SWE-rebench leaderboard https://swe-rebench.com NeurIPS 2025
Summary
LLM-based软件代理在软件工程任务中展现出巨大潜力,但面临训练数据缺乏和静态基准测试污染两大挑战。为此,研究团队创新性地推出自动化、可扩展的管道,从GitHub仓库持续提取真实互动软件工程任务,构建SWE-rebench数据集。该数据集包含超过2.1万个Python互动任务,适用于大规模强化学习软件代理的训练。同时,研究团队利用SWE-rebench方法持续提供新任务,建立无污染的基准测试,用于评估软件代理性能。对比发现,某些语言模型在存在污染的基准测试中的表现可能过于夸大。
Key Takeaways
- LLM-based软件代理在软件工程任务中有巨大潜力。
- 缺乏反映真实世界场景的高质量训练数据是软件代理面临的关键挑战之一。
- 研究团队推出自动化、可扩展的管道从GitHub提取真实的互动软件工程任务来构建SWE-rebench数据集。
- SWE-rebench数据集包含超过2.1万个Python互动任务,适用于强化学习软件代理的大规模训练。
- 研究团队利用持续供应的新任务建立无污染的基准测试,以评估软件代理性能。
- 现有基准测试可能因污染问题而夸大某些语言模型的表现。
点此查看论文截图
Program Synthesis Dialog Agents for Interactive Decision-Making
Authors:Matthew Toles, Nikhil Balwani, Rattandeep Singh, Valentina Giulia Sartori Rodriguez, Zhou Yu
Many real-world eligibility problems, ranging from medical diagnosis to tax planning, can be mapped to decision problems expressed in natural language, wherein a model must make a binary choice based on user features. Large-scale domains such as legal codes or frequently updated funding opportunities render human annotation (e.g., web forms or decision trees) impractical, highlighting the need for agents that can automatically assist in decision-making. Since relevant information is often only known to the user, it is crucial that these agents ask the right questions. As agents determine when to terminate a conversation, they face a trade-off between accuracy and the number of questions asked, a key metric for both user experience and cost. To evaluate this task, we propose BeNYfits, a new benchmark for determining user eligibility for multiple overlapping social benefits opportunities through interactive decision-making. Our experiments show that current language models struggle with frequent hallucinations, with GPT-4o scoring only 35.7 F1 using a ReAct-style chain-of-thought. To address this, we introduce ProADA, a novel approach that leverages program synthesis to assist in decision-making by mapping dialog planning to a code generation problem and using gaps in structured data to determine the best next action. Our agent, ProADA, improves the F1 score to 55.6 while maintaining nearly the same number of dialog turns.
现实世界中的许多资格问题,从医疗诊断到税务规划,都可以映射到以自然语言表达出的决策问题,其中模型必须基于用户特征进行二元选择。大规模领域(如法律编码或经常更新的资金机会)使得人工标注(例如网页表单或决策树)变得不切实际,这突显了需要能够自动协助决策的智能代理。由于相关信息通常只为用户所知,因此这些代理问对问题至关重要。由于代理需要确定何时终止对话,因此在准确性与所提问题的数量之间存在权衡,这是用户体验和成本的关键指标。为了评估此任务,我们提出了BeNYfits,这是一个新的基准测试,用于通过交互式决策来确定用户是否符合多个重叠的社会福利机会。我们的实验表明,当前的语言模型经常出现幻觉,GPT-4o使用ReAct风格的思考链仅获得35.7的F1分数。为了解决这一问题,我们引入了ProADA这一新方法,它通过程序合成来协助决策,将对话规划映射到代码生成问题,并利用结构化数据中的空白来确定最佳下一步行动。我们的代理ProADA将F1分数提高到55.6,同时保持近乎相同的对话轮次。
论文及项目相关链接
Summary
本文探讨了现实世界中从医疗诊断到税务规划等多元化领域中的用户资格决策问题。文章指出在这些大规模且频繁更新的领域中,人工标注的不实用性,强调需要能够自动协助决策的智能代理。这些代理需要提出关键问题并确定何时结束对话,同时面临准确性与提问数量之间的权衡。为评估此任务,文章提出了BeNYfits基准测试,用于衡量用户在多重重叠的社会福利机会下的资格。实验显示现有语言模型存在频繁误判问题,GPT-4o的F1分数仅为35.7。为解决此问题,文章提出了ProADA新方法,通过对话计划映射到代码生成问题并利用结构化数据中的空白来确定最佳下一步行动。使用ProADA的智能代理将F1分数提升至55.6,同时保持相近的对话轮次。
Key Takeaways
- 大规模领域的用户资格决策问题需智能代理协助。
- 智能代理需能自动提出关键问题并决定对话结束时机。
- 准确性与提问数量之间存在权衡。
- BeNYfits基准测试用于评估用户在多重社会福利机会下的资格。
- 现有语言模型存在频繁误判问题。
- ProADA方法通过对话计划映射到代码生成问题来提升决策准确性。
点此查看论文截图