发布日期: 2025-11-26

更新日期: 2025-11-27

文章字数: 16.3k

阅读时长: 67 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-26 更新

VDC-Agent: When Video Detailed Captioners Evolve Themselves via Agentic Self-Reflection

Authors:Qiang Wang, Xinyuan Gao, SongLin Dong, Jizhou Han, Jiangyang Li, Yuhang He, Yihong Gong

We present VDC-Agent, a self-evolving framework for Video Detailed Captioning that requires neither human annotations nor larger teacher models. The agent forms a closed loop of caption generation, principle-guided scoring (score and textual suggestions), and prompt refinement. When caption quality regresses, a self-reflection path leverages the previous chain-of-thought to amend the update. Running this process on unlabeled videos produces trajectories of (caption, score) pairs. We convert the trajectories into preference tuples and filter out samples with JSON parsing errors, resulting in VDC-Agent-19K, which contains 18,886 automatically constructed pairs. We then fine-tune the base MLLM on this dataset using an easy-to-hard curriculum direct preference optimization. Built on Qwen2.5-VL-7B-Instruct, our VDC-Agent-7B attains state-of-the-art performance on the VDC benchmark with 49.08% average accuracy and 2.50 score, surpassing specialized video captioners and improving over the base model by +5.13% accuracy and +0.27 score at similar inference cost.

我们推出了VDC-Agent，这是一款用于视频详细字幕的自我进化框架，无需人工标注和大型教师模型。该代理形成了字幕生成、原则导向评分（评分和文本建议）和提示精化的闭环。当字幕质量回归时，自我反思路径会利用之前的思考链来修正更新。在无标签视频上运行此过程会产生（字幕，分数）对轨迹。我们将轨迹转换为偏好元组，并过滤掉带有JSON解析错误的样本，从而得到VDC-Agent-19K，其中包含18886个自动构建的对。然后，我们在该数据集上使用简单到困难的课程直接偏好优化对基础MLLM进行微调。基于Qwen2.5-VL-7B-Instruct构建，我们的VDC-Agent-7B在VDC基准测试上达到了业界领先水平，平均准确率为49.08%，得分为2.50，超越了专业视频字幕组，并在类似的推理成本下，相比基础模型提高了+5.13%的准确率和+0.27的得分。

论文及项目相关链接

PDF

Summary
VDC-Agent是一个无需人工标注和大型教师模型的自进化视频详细字幕框架。它通过循环生成字幕、原则指导评分和提示优化，形成闭环。在字幕质量下降时，自我反思路径会利用之前的思考链来修正更新。在无标签视频上运行此过程会产生（字幕，评分）对轨迹。我们将其转化为偏好元组，过滤掉带有JSON解析错误的样本，形成VDC-Agent-19K数据集。然后，我们在该数据集上微调基础MLLM，使用从易到难的课程优先优化策略。基于Qwen2.5-VL-7B-Instruct构建的VDC-Agent-7B在VDC基准测试上达到业界领先水平，平均准确率为49.08%，得分为2.50，超越了专业视频字幕器，相比基础模型提高了5.13%的准确率和0.27的得分，同时推理成本相似。

Key Takeaways

VDC-Agent是一个自进化的视频详细字幕框架，无需人工标注和大型教师模型。
VDC-Agent形成闭环，包括字幕生成、原则指导评分和提示优化。
当字幕质量下降时，VDC-Agent会通过自我反思路径修正更新。
通过运行在无标签视频上，产生（字幕，评分）对轨迹。
将轨迹转化为偏好元组，并过滤掉带有JSON解析错误的样本，形成VDC-Agent-19K数据集。
VDC-Agent-7B在VDC基准测试上表现优秀，平均准确率和得分均达到业界领先水平。

Cool Papers

点此查看论文截图

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

Authors:Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, Xiang Bai

Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT

任务调度对于实体人工智能至关重要，它使智能体能够遵循自然语言指令并在3D物理世界中高效地执行动作。然而，现有数据集往往通过忽略运筹（OR）知识和3D空间定位来简化任务规划。在这项工作中，我们提出了基于运筹知识的3D定位任务调度（ORS3D），这是一个新任务，要求协同语言理解、3D定位和效率优化。不同于以前的设置，ORS3D要求智能体通过利用可并行执行的子任务来最小化总完成时间，例如，在微波炉运行时清洁水槽。为了促进ORS3D的研究，我们构建了ORS3D-60K数据集，该数据集包含4万个现实世界场景中的6万个组合任务。此外，我们提出了GRANT实体多模态大型语言模型，它配备了一种简单有效的调度令牌机制来生成高效的任务安排和定位行动。在ORS3D-60K上的广泛实验验证了GRANT在语言理解、3D定位和调度效率方面的有效性。代码可在https://github.com/H-EmbodVis/GRANT找到。

论文及项目相关链接

PDF Accepted to AAAI 2026 (Oral). The code is available at \url{https://github.com/H-EmbodVis/GRANT}

Summary

本文关注任务调度在实体AI中的重要性，提出结合操作研究（OR）知识的三维场景任务调度（ORS3D）。不同于以往的任务调度，ORS3D要求实体智能体最小化总完成时间，并利用并行子任务，如清洁水槽与微波炉运行同时进行。为此构建了大型数据集ORS3D-60K和具备调度令牌机制的GRANT多模态大型语言模型。实验证明GRANT在理解语言、三维定位以及调度效率上的有效性。

Key Takeaways

任务调度在实体AI中至关重要，它允许智能体遵循自然语言指令并在三维物理世界中高效执行任务。
现有数据集忽略了操作研究（OR）知识及三维空间定位的重要性。
提出新的任务——基于操作研究的三维场景任务调度（ORS3D），结合语言理解、三维定位及效率优化。
ORS3D要求最小化总完成时间，并强调利用并行子任务以提升效率。
为推动ORS3D研究，构建了包含4K真实场景及6万项复合任务的ORS3D-60K大规模数据集。
开发了名为GRANT的实体多模态大型语言模型，具备简单有效的调度令牌机制。

Cool Papers

点此查看论文截图

LLM-Driven Stationarity-Aware Expert Demonstrations for Multi-Agent Reinforcement Learning in Mobile Systems

Authors:Tianyang Duan, Zongyuan Zhang, Zheng Lin, Songxiao Guo, Xiuxian Guan, Guangyu Wu, Zihan Fang, Haotian Meng, Xia Du, Ji-Zhe Zhou, Heming Cui, Jun Luo, Yue Gao

Multi-agent reinforcement learning (MARL) has been increasingly adopted in many real-world applications. While MARL enables decentralized deployment on resource-constrained edge devices, it suffers from severe non-stationarity due to the synchronous updates of agent policies. This non stationarity results in unstable training and poor policy con vergence, especially as the number of agents increases. In this paper, we propose RELED, a scalable MARL framework that integrates large language model (LLM)-driven expert demonstrations with autonomous agent exploration. RELED incorporates a Stationarity-Aware Expert Demonstration module, which leverages theoretical non-stationarity bounds to enhance the quality of LLM-generated expert trajectories, thus providing high reward and training-stable samples for each agent. Moreover, a Hybrid Expert-Agent Policy Optimization module adaptively balances each agent’s learning from both expert-generated and agent-generated trajectories, accelerating policy convergence and improving generalization. Extensive experiments with real city networks based on OpenStreetMap demonstrate that RELED achieves superior performance compared to state-of-the-art MARL methods.

多智能体强化学习（MARL）已被广泛应用于许多真实世界的应用场景。虽然MARL能够在资源受限的边缘设备上实现分布式部署，但由于智能体策略的同步更新，它面临着严重的非平稳性问题。这种非平稳性导致了训练的不稳定和策略收敛性差，尤其是随着智能体数量的增加。在本文中，我们提出了RELED，一个可扩展的MARL框架，它结合了大型语言模型（LLM）驱动的专家演示和自主智能体探索。RELED采用平稳性感知专家演示模块，利用理论非平稳性界限来提高LLM生成的专家轨迹的质量，从而为每个智能体提供高奖励和稳定训练的样本。此外，混合专家-智能体策略优化模块自适应地平衡每个智能体从专家生成和智能体生成的轨迹中学习，加速策略收敛并改进泛化能力。在基于OpenStreetMap的真实城市网络上进行的广泛实验表明，RELED相比最先进的MARL方法实现了卓越的性能。

论文及项目相关链接

PDF 15 pages, 9 figures

Summary

多智能体强化学习（MARL）在实际应用中日益普及。虽然MARL能够在资源受限的边缘设备上实现分布式部署，但由于智能体策略的同步更新，它面临着严重的非稳定性问题。非平稳性导致训练不稳定，策略收敛性差，尤其随着智能体数量的增加问题愈发严重。本文提出RELED，一种可扩展的MARL框架，它将大型语言模型（LLM）驱动的专家演示与自主智能体探索相结合。RELED采用平稳性感知专家演示模块，利用理论非平稳性边界来提高LLM生成的专家轨迹的质量，从而为每个智能体提供高奖励和稳定的训练样本。此外，混合专家-智能体策略优化模块自适应地平衡每个智能体从专家生成和智能体生成的轨迹中学习，加快策略收敛并提高泛化能力。在基于OpenStreetMap的真实城市网络进行的广泛实验表明，RELED相较于最新的MARL方法具有更优越的性能。

Key Takeaways

MARL面临非平稳性问题，影响训练和策略收敛。
RELED框架结合了LLM驱动的专家演示和自主智能体探索。
RELED通过平稳性感知专家演示模块提高专家轨迹质量。
RELED利用混合专家-智能体策略优化模块平衡学习和泛化能力。
RELED在真实城市网络上的实验表现优于其他先进的MARL方法。
RELED框架具有可扩展性，能够适应不同数量和类型的智能体。

Cool Papers

点此查看论文截图

Black-Box Lifting and Robustness Theorems for Multi-Agent Contracts

Authors:Paul Dütting, Tomer Ezra, Michal Feldman, Thomas Kesselheim

Multi-agent contract design has largely evaluated contracts through the lens of pure Nash equilibria (PNE). This focus, however, is not without loss: In general, the principal can strictly gain by recommending a complex, possibly correlated, distribution over actions, while preserving incentive compatibility. In this work, we extend the analysis of multi-agent contracts beyond pure Nash equilibria to encompass more general equilibrium notions, including mixed Nash equilibria as well as (coarse-)correlated equilibria (CCE). The latter, in particular, captures the limiting outcome of agents engaged in learning dynamics. Our main result shows that for submodular and, more generally, XOS rewards, such complex recommendations yield at most a constant-factor gain: there exists a contract and a PNE whose utility is within a constant factor of the best CCE achievable by any contract. This provides a black-box lifting: results established against the best PNE automatically apply with respect to the best CCE, with only a constant factor loss. For submodular rewards, we further show how to transform a contract and a PNE of that contract into a new contract such that any of its CCEs gives a constant approximation to the PNE. This yields black-box robustness: up to constant factors, guarantees established for a specific contract and PNE automatically extend to the modified contract and any of its CCEs. We thus expand prior guarantees for multi-agent contracts and lower the barrier to new ones. As an important corollary, we obtain poly-time algorithms for submodular rewards that achieve constant approximations in any CCE, against the best CCE under the best contract. Such worst-case guarantees are provably unattainable for XOS rewards. Finally, we bound the gap between different equilibrium notions for subadditive, supermodular, and general rewards.

多智能体合约设计主要通过纯纳什均衡（PNE）的角度来评估合约。然而，这种关注并非没有损失：一般来说，主体可以通过推荐复杂的、可能是相关的行动分布来严格获益，同时保留激励相容性。在这项工作中，我们将多智能体合约的分析扩展到纯纳什均衡之外，涵盖了更一般的均衡概念，包括混合纳什均衡以及（粗略）相关均衡（CCE）。特别是后者，它捕捉了智能体在学习动态中参与的极限结果。我们的主要结果表明，对于次模和更一般的XOS奖励，这种复杂的推荐最多只会产生常数倍的收益：存在一种合约和PNE，其效用与任何合约所能达到的最佳CCE的效用在常数倍以内。这为黑盒提升提供了依据：针对最佳PNE建立的结果自动适用于最佳CCE，仅损失常数倍的效益。对于次模奖励，我们还展示了如何将合约及其PNE转化为新合约，使得其任何CCE都能给PNE提供常数近似。这带来了黑盒稳健性：对于特定的合约和PNE，在恒定因素内建立的保证会自动扩展到修改后的合约及其任何CCE。因此，我们扩大了多智能体合约的先前保证，并降低了新合约的障碍。作为重要推论，我们获得了针对次模奖励的多项式时间算法，在任何CCE中都能实现对最佳合约下最佳CCE的常数近似。对于XOS奖励，这种最坏情况的保证被证明是不可达到的。最后，我们限制了不同均衡概念之间的鸿沟，包括次加、超模和一般奖励。

论文及项目相关链接

PDF

Summary
本文扩展了多智能体合同分析，除了纯纳什均衡外，还包括更一般的均衡概念，如混合纳什均衡和（粗略）相关均衡。对于子模块和更一般的XOS奖励，复杂的推荐最多产生恒定因子的收益。这为子模块奖励提供了黑箱提升，并且针对最佳PNE的结果自动适用于最佳CCE，只有恒定因子损失。此外，对于子模块奖励，我们将合同及其PNE转换为新合同，使其任何CCE都能为PNE提供恒定近似值。这为多智能体合同提供了新的保证并降低了障碍。同时，对于子模块奖励，我们获得了实现恒定近似的多项式时间算法。针对XOS奖励的最坏情况的保证是无法实现的。最后，我们针对子加法、超模块和一般奖励的不同均衡概念进行了差距界定。

Key Takeaways

多智能体合同设计主要通过纯纳什均衡进行分析，但存在局限性。
合同推荐复杂的行动分布可能带来收益。
对于子模块和XOS奖励，复杂的合同推荐能在最佳纳什均衡和最佳相关均衡间取得常数倍的收益。
存在黑箱提升机制，使得针对纳什均衡的结果可以应用于相关均衡。
对于子模块奖励，有将合同和其纳什均衡转化为新合同的策略，从而在所有相关均衡中都能保持近似收益。这为多智能体合同设计提供了更大的灵活性。
子模块奖励下的合同计算具有多项式时间算法来实现恒定近似值。

Cool Papers

点此查看论文截图

AutoEnv: Automated Environments for Measuring Cross-Environment Agent Learning

Authors:Jiayi Zhang, Yiran Peng, Fanqi Kong, Yang Cheng, Yifan Wu, Zhaoyang Yu, Jinyu Xiang, Jianhao Ruan, Jinlin Wang, Maojia Song, HongZhang Liu, Xiangru Tang, Bang Liu, Chenglin Wu, Yuyu Luo

Humans naturally adapt to diverse environments by learning underlying rules across worlds with different dynamics, observations, and reward structures. In contrast, existing agents typically demonstrate improvements via self-evolving within a single domain, implicitly assuming a fixed environment distribution. Cross-environment learning has remained largely unmeasured: there is no standard collection of controllable, heterogeneous environments, nor a unified way to represent how agents learn. We address these gaps in two steps. First, we propose AutoEnv, an automated framework that treats environments as factorizable distributions over transitions, observations, and rewards, enabling low-cost (4.12 USD on average) generation of heterogeneous worlds. Using AutoEnv, we construct AutoEnv-36, a dataset of 36 environments with 358 validated levels, on which seven language models achieve 12-49% normalized reward, demonstrating the challenge of AutoEnv-36. Second, we formalize agent learning as a component-centric process driven by three stages of Selection, Optimization, and Evaluation applied to an improvable agent component. Using this formulation, we design eight learning methods and evaluate them on AutoEnv-36. Empirically, the gain of any single learning method quickly decrease as the number of environments increases, revealing that fixed learning methods do not scale across heterogeneous environments. Environment-adaptive selection of learning methods substantially improves performance but exhibits diminishing returns as the method space expands. These results highlight both the necessity and the current limitations of agent learning for scalable cross-environment generalization, and position AutoEnv and AutoEnv-36 as a testbed for studying cross-environment agent learning. The code is avaiable at https://github.com/FoundationAgents/AutoEnv.

人类通过学习和掌握不同世界的底层规则，自然地适应多样化的环境，这些世界具有不同的动态、观察和奖励结构。相比之下，现有的智能体通常通过在单一域内自我进化来展示改进，这隐含地假设了一个固定的环境分布。跨环境学习一直缺乏衡量：没有可控的、多样化的环境标准集合，也没有统一的方式来表示智能体如何学习。我们分两步解决这些差距。首先，我们提出了AutoEnv，这是一个自动化框架，将环境视为可转换的分布，包括转换、观察和奖励，能够低成本（平均4.12美元）地生成多样化的世界。使用AutoEnv，我们构建了AutoEnv-36数据集，包含36个环境、358个验证级别，七个语言模型在此数据集上实现12%-49%的标准化奖励，证明了AutoEnv-36的挑战性。其次，我们将智能体学习形式化为一个以选择、优化和评估三个阶段驱动的组件中心过程，这些阶段应用于可改进的智能体组件。基于此形式化表述，我们设计了八种学习方法，并在AutoEnv-36上进行了评估。从经验上看，随着环境数量的增加，任何单一学习方法的收益都会迅速下降，这表明固定的学习方法并不适用于多样化的环境。环境自适应的学习方法选择能显著提高性能，但随着方法空间的扩大，其回报也会逐渐减少。这些结果强调了智能体学习在跨环境泛化中的必要性和当前局限性，并将AutoEnv和AutoEnv-36定位为研究跨环境智能体学习的测试平台。代码可用https://github.com/FoundationAgents/AutoEnv访问。

论文及项目相关链接

PDF

Summary

本文探讨了在多变环境下，人类通过学习不同环境的底层规则来自然适应不同环境，而现有智能体通常只在单一域内自我进化，假设环境分布是固定的。文章提出了AutoEnv框架和AutoEnv-36数据集，旨在解决跨环境学习的问题。该框架能够将环境视为可转换的分布，从而生成多样化的世界。实验表明，在AutoEnv-36数据集上，单一学习方法的收益随着环境数量的增加而迅速减少，表明固定学习方法在异质环境中无法很好地扩展。文章强调了环境自适应学习方法选择的必要性以及当前方法存在的局限性，并认为AutoEnv和AutoEnv-36是跨环境智能体学习的理想测试平台。

Key Takeaways

人类能够自然适应不同的环境，通过了解环境的底层规则来适应具有不同动力、观察和奖励结构的各个世界。然而现有的智能体通常在单一环境中自我进化。

Cool Papers

点此查看论文截图

Exponential Consensus through Z-Control in High-Order Multi-Agent Systems

Authors:Angela Monti, Fasma Diele

In this work, we introduce a Z-control strategy for multi-agent systems of arbitrary order, aimed at driving the agents toward consensus in the highest-order observable state. The proposed framework supports both direct and indirect control schemes, making it applicable in scenarios where high-order derivatives such as acceleration cannot be directly manipulated. Theoretical analysis ensures exponential convergence while preserving the average dynamics, and a hierarchy of control laws is derived accordingly. Numerical experiments up to third-order models, including opinion dynamics and Cucker-Smale flocking systems, demonstrate the robustness and flexibility of Z-control under varying interaction regimes and control intensities.

在这项工作中，我们为任意阶多智能体系统引入了一种Z控制策略，旨在驱动智能体在最高阶可观测状态下达成共识。所提出的框架支持直接和间接控制方案，适用于无法直接操作高阶导数（如加速度）等场景。理论分析结果保证了系统的指数收敛性并保持了平均动态性，并由此推导出了控制律的层次结构。包括意见动态和Cucker-Smale群集系统在内的三阶模型数值实验，证明了Z控制在不同交互机制和不同控制强度下的稳健性和灵活性。

论文及项目相关链接

PDF

Summary

本文介绍了一种针对任意阶数的多智能体系统的Z控制策略，旨在驱动智能体在最高阶可观测状态下达成一致性。所提出的框架支持直接和间接控制方案，适用于不能直接操作高阶导数（如加速度）的场景。理论分析确保了指数收敛性并保持了平均动力学，并据此推导了控制律的层次结构。数值实验证明，Z控制适用于各种交互和控制强度下的三阶模型，如观点动态和Cucker-Smale群集系统，具有稳健性和灵活性。

Key Takeaways

提出了一种用于多智能体系统的Z控制策略，用于在最高阶可观测状态下实现智能体的一致性。
支持直接和间接控制方案，适用于高阶导数不能直接操作的情况。
理论分析确保了指数收敛性并保持了平均动力学。
推导了控制律的层次结构。
数值实验证明了Z控制在不同交互和控制强度下的稳健性和灵活性。

Cool Papers

点此查看论文截图

LLM-Based Agentic Negotiation for 6G: Addressing Uncertainty Neglect and Tail-Event Risk

Authors:Hatim Chergui, Farhad Rezazadeh, Mehdi Bennis, Merouane Debbah

A critical barrier to the trustworthiness of sixth-generation (6G) agentic autonomous networks is the uncertainty neglect bias; a cognitive tendency for large language model (LLM)-powered agents to make high-stakes decisions based on simple averages while ignoring the tail risk of extreme events. This paper proposes an unbiased, risk-aware framework for agentic negotiation, designed to ensure robust resource allocation in 6G network slicing. Specifically, agents leverage Digital Twins (DTs) to predict full latency distributions, which are then evaluated using a formal framework from extreme value theory, namely, Conditional Value-at-Risk (CVaR). This approach fundamentally shifts the agent’s objective from reasoning over the mean to reasoning over the tail, thereby building a statistically-grounded buffer against worst-case outcomes. Furthermore, our framework ensures full uncertainty awareness by requiring agents to quantify epistemic uncertainty – confidence in their own DTs predictions – and propagate this meta-verification to make robust decisions, preventing them from acting on unreliable data. We validate this framework in a 6G inter-slice negotiation use-case between an eMBB and a URLLC agent. The results demonstrate the profound failure of the biased, mean-based baseline, which consistently fails its SLAs with a 25% rate. Our unbiased, CVaR-aware agent successfully mitigates this bias, eliminating SLA violations and reducing the URLLC and eMBB p99.999 latencies by around 11%. We show this reliability comes at the rational and quantifiable cost of slightly reduced energy savings to 17%, exposing the false economy of the biased approach. This work provides a concrete methodology for building the trustworthy autonomous systems required for 6G.

六代（6G）智能自主网络的可信性面临的一个关键障碍是忽视不确定性的偏见；这是一种由大型语言模型（LLM）驱动的代理人在基于简单平均值做出高风险决策时忽视极端事件的尾部风险的认知倾向。本文针对这一问题，提出了一种无偏见、风险意识强的代理谈判框架，旨在确保在6G网络切片中进行稳健的资源分配。具体来说，代理人利用数字双胞胎（DTs）来预测完整的延迟分布，然后使用极值理论的正式框架即条件风险价值（CVaR）对其进行评估。这种方法从根本上改变了代理人的目标，从推理平均值转向推理尾部，从而建立了一个对抗最坏情况的统计基础缓冲区。此外，我们的框架通过要求代理人量化认识不确定性——对自己DTs预测的信任度，并将这种元验证传播出去，以做出稳健的决策，防止他们根据不可靠的数据采取行动。我们在一个涉及eMBB和URLLC代理之间的6G跨切片谈判应用场景中验证了这一框架的有效性。结果表明，有偏见的、基于平均值的基线方法存在严重缺陷，其服务等级协议（SLA）失败率高达25%。我们的无偏见、了解CVaR的代理人成功地减轻了这种偏见，消除了SLA违规情况，并将URLLC和eMBB的p99.999延迟减少了约11%。我们展示了这种可靠性所带来的合理且可量化的成本是略微减少节能至17%，暴露了偏见方法的虚假经济。这项工作为构建6G所需的可信自主系统提供了具体的方法论。

论文及项目相关链接

PDF Link to open-source non-commercial code available

Summary

Cool Papers

点此查看论文截图

VIL2C: Value-of-Information Aware Low-Latency Communication for Multi-Agent Reinforcement Learning

Authors:Qian Zhang, Zhuo Sun, Yao Zhang, Zhiwen Yu, Bin Guo, Jun Zhang

Inter-agent communication serves as an effective mechanism for enhancing performance in collaborative multi-agent reinforcement learning(MARL) systems. However, the inherent communication latency in practical systems induces both action decision delays and outdated information sharing, impeding MARL performance gains, particularly in time-critical applications like autonomous driving. In this work, we propose a Value-of-Information aware Low-latency Communication(VIL2C) scheme that proactively adjusts the latency distribution to mitigate its effects in MARL systems. Specifically, we define a Value of Information (VOI) metric to quantify the importance of delayed message transmission based on each delayed message’s importance. Moreover, we propose a progressive message reception mechanism to adaptively adjust the reception duration based on received messages. We derive the optimized VoI aware resource allocation and theoretically prove the performance advantage of the proposed VIL2C scheme. Extensive experiments demonstrate that VIL2C outperforms existing approaches under various communication conditions. These gains are attributed to the low-latency transmission of high-VoI messages via resource allocation and the elimination of unnecessary waiting periods via adaptive reception duration.

在协同多智能体强化学习（MARL）系统中，智能体间的通信作为一种有效机制，有助于提高系统性能。然而，实际系统中固有的通信延迟会导致动作决策延迟和信息共享过时，阻碍MARL的性能提升，特别是在自动驾驶等时间关键性应用中。在本研究中，我们提出了一种价值信息感知低延迟通信（VIL2C）方案，该方案可主动调整延迟分布，以减轻在MARL系统中的影响。具体来说，我们定义了一个价值信息（VOI）指标，基于每条延迟消息的重要性来衡量延迟消息传输的重要性。此外，我们提出了一种渐进式消息接收机制，根据接收到的消息自适应地调整接收持续时间。我们得出了优化的VoI感知资源分配，并从理论上证明了所提出的VIL2C方案的性能优势。大量实验表明，在各种通信条件下，VIL2C的性能优于现有方法。这些收益归因于通过资源分配实现的高VoI消息的低延迟传输，以及通过自适应接收持续时间消除了不必要的等待时间。

论文及项目相关链接

PDF

Summary

在协同多智能体强化学习（MARL）系统中，智能体间的通信是提高性能的有效机制。然而，实际系统中的固有通信延迟会导致动作决策延迟和信息共享过时，特别是在自动驾驶等时间关键性应用中，这会阻碍MARL的性能提升。本文提出了一种基于价值信息的低延迟通信（VIL2C）方案，该方案可主动调整延迟分布以减轻其对MARL系统的影响。我们定义了一个价值信息（VOI）指标来量化基于每条延迟消息重要性的延迟消息传输的重要性。此外，我们提出了一种渐进式消息接收机制，根据接收到的消息自适应地调整接收持续时间。实验证明，在各种通信条件下，VIL2C方案优于现有方法。这些收益来自于通过资源分配实现的高价值信息的低延迟传输以及通过自适应接收持续时间消除不必要的等待时间。

Key Takeaways

协同多智能体强化学习（MARL）系统中，智能体间的通信是提高性能的关键。
通信延迟会导致动作决策延迟和信息共享过时，影响MARL性能。
提出了基于价值信息的低延迟通信（VIL2C）方案，主动调整延迟分布以优化系统性能。
通过定义价值信息（VOI）指标来量化延迟消息的重要性。
提出了渐进式消息接收机制，自适应调整接收持续时间。
VIL2C方案通过资源分配实现高价值信息的低延迟传输。

Cool Papers

点此查看论文截图

A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval, Disambiguation and Reflective Analysis

Authors:Wenxuan Mu, Jinzhong Ning, Di Zhao, Yijia Zhang

In-context learning (ICL) with large language models (LLMs) has emerged as a promising paradigm for named entity recognition (NER) in low-resource scenarios. However, existing ICL-based NER methods suffer from three key limitations: (1) reliance on dynamic retrieval of annotated examples, which is problematic when annotated data is scarce; (2) limited generalization to unseen domains due to the LLM’s insufficient internal domain knowledge; and (3) failure to incorporate external knowledge or resolve entity ambiguities. To address these challenges, we propose KDR-Agent, a novel multi-agent framework for multi-domain low-resource in-context NER that integrates Knowledge retrieval, Disambiguation, and Reflective analysis. KDR-Agent leverages natural-language type definitions and a static set of entity-level contrastive demonstrations to reduce dependency on large annotated corpora. A central planner coordinates specialized agents to (i) retrieve factual knowledge from Wikipedia for domain-specific mentions, (ii) resolve ambiguous entities via contextualized reasoning, and (iii) reflect on and correct model predictions through structured self-assessment. Experiments across ten datasets from five domains demonstrate that KDR-Agent significantly outperforms existing zero-shot and few-shot ICL baselines across multiple LLM backbones. The code and data can be found at https://github.com/MWXGOD/KDR-Agent.

基于大型语言模型的上下文学习（ICL）已成为低资源场景中命名实体识别（NER）的有前途的范式。然而，现有的基于ICL的NER方法存在三个主要局限性：（1）依赖于动态检索标注示例，当标注数据稀缺时，这是有问题的；（2）由于语言模型内部领域知识不足，难以推广到未见过的领域；（3）无法融入外部知识或解决实体歧义。为了解决这些挑战，我们提出了KDR-Agent，这是一个用于多领域低资源上下文NER的新型多智能体框架，它集成了知识检索、消歧和反思分析。KDR-Agent利用自然语言的类型定义和实体级别的对比演示的静态集合，减少对大规模标注语料库的依赖。中央规划器协调专业智能体（i）从Wikipedia检索特定领域的事实知识，（ii）通过上下文推理解决模糊实体，（iii）通过结构化自我评价反思并纠正模型预测。在来自五个领域的十个数据集上的实验表明，KDR-Agent在多个大型语言模型主干上显著优于现有的零样本和少样本ICL基线。代码和数据可以在https://github.com/MWXGOD/KDR-Agent找到。

论文及项目相关链接

PDF This paper has been accepted by AAAI 2026 (Main Technical Track)

Summary

基于大型语言模型（LLM）的上下文学习（ICL）在命名实体识别（NER）领域展现出巨大潜力，特别是在资源匮乏的场景下。然而，现有的ICL-based NER方法存在三大局限性。为解决这些问题，我们提出了KDR-Agent，一种新型的多代理框架，集成了知识检索、解歧和反思分析。实验表明，KDR-Agent在多领域低资源上下文中显著优于现有的零样本和少样本ICL基线。

Key Takeaways

ICL与LLM在资源匮乏的场景下的命名实体识别展现巨大潜力。
现有ICL-based NER方法存在三大局限性：依赖动态检索标注样本、对未见领域的泛化能力不足、无法融入外部知识或解决实体歧义。
KDR-Agent是一个多代理框架，集成了知识检索、解歧和反思分析。
KDR-Agent利用自然语言类型定义和静态的实体级别对比演示，减少对大规模标注语料库的依赖。
KDR-Agent通过中央规划器协调专业代理，实现事实知识检索、歧义解决和模型预测反思校正。

Cool Papers

点此查看论文截图

Multi-Agent Monocular Dense SLAM With 3D Reconstruction Priors

Authors:Haihang Wu, Yuchen Zhou

Monocular Simultaneous Localization and Mapping (SLAM) aims to estimate a robot’s pose while simultaneously reconstructing an unknown 3D scene using a single camera. While existing monocular SLAM systems generate detailed 3D geometry through dense scene representations, they are computationally expensive due to the need for iterative optimization. To address this challenge, MASt3R-SLAM utilizes learned 3D reconstruction priors, enabling more efficient and accurate estimation of both 3D structures and camera poses. However, MASt3R-SLAM is limited to single-agent operation. In this paper, we extend MASt3R-SLAM to introduce the first multi-agent monocular dense SLAM system. Each agent performs local SLAM using a 3D reconstruction prior, and their individual maps are fused into a globally consistent map through a loop-closure-based map fusion mechanism. Our approach improves computational efficiency compared to state-of-the-art methods, while maintaining similar mapping accuracy when evaluated on real-world datasets.

单目相机即时定位与地图构建（SLAM）旨在利用单个相机估计机器人的姿态，同时重建未知的3D场景。虽然现有的单目SLAM系统通过密集的场景表示生成详细的3D几何结构，但由于需要进行迭代优化，它们计算量大。为了解决这一挑战，MASt3R-SLAM利用学习的3D重建先验知识，能够更高效、更准确地估计3D结构和相机姿态。然而，MASt3R-SLAM仅限于单代理操作。在本文中，我们将MASt3R-SLAM扩展到引入第一个多代理单目密集SLAM系统。每个代理使用3D重建先验执行局部SLAM，并且他们的个人地图通过基于闭环的地图融合机制融合成全局一致的地图。我们的方法在提高计算效率的同时，与最新方法在现实世界数据集上的评估结果相比，保持了相似的映射精度。

论文及项目相关链接

PDF

Summary

基于单目相机的同时定位与地图构建（SLAM）旨在估计机器人的姿态，同时重建未知的3D场景。现有单目SLAM系统通过密集场景表示生成详细的3D几何结构，但由于需要迭代优化，计算成本较高。为解决这一问题，MASt3R-SLAM利用学习到的3D重建先验信息，提高了对3D结构和相机姿态的估计效率和准确性。然而，MASt3R-SLAM仅限于单智能体操作。本文扩展了MASt3R-SLAM，引入首个多智能体单目密集SLAM系统。每个智能体利用3D重建先验执行局部SLAM，并通过基于闭环地图融合机制将各自的地图融合为全局一致地图。与现有方法相比，该方法提高了计算效率，同时在真实世界数据集上的映射精度得到保持。

Key Takeaways

单目SLAM系统使用单目相机同时进行机器人姿态估计和未知3D场景重建。
现有单目SLAM系统计算成本高，因为需要迭代优化。
MASt3R-SLAM利用学习到的3D重建先验信息提高估计效率和准确性。
MASt3R-SLAM仅限于单智能体操作。
本文扩展了MASt3R-SLAM，引入多智能体单目密集SLAM系统。
每个智能体利用3D重建先验执行局部SLAM。

Cool Papers

点此查看论文截图

MAGMA-Edu: Multi-Agent Generative Multimodal Framework for Text-Diagram Educational Question Generation

Authors:Zhenyu Wu, Jian Li, Hua Huang

Educational illustrations play a central role in communicating abstract concepts, yet current multimodal large language models (MLLMs) remain limited in producing pedagogically coherent and semantically consistent educational visuals. We introduce MAGMA-Edu, a self-reflective multi-agent framework that unifies textual reasoning and diagrammatic synthesis for structured educational problem generation. Unlike existing methods that treat text and image generation independently, MAGMA-Edu employs a two-stage co-evolutionary pipeline: (1) a generation-verification-reflection loop that iteratively refines question statements and solutions for mathematical accuracy, and (2) a code-based intermediate representation that enforces geometric fidelity and semantic alignment during image rendering. Both stages are guided by internal self-reflection modules that evaluate and revise outputs until domain-specific pedagogical constraints are met. Extensive experiments on multimodal educational benchmarks demonstrate the superiority of MAGMA-Edu over state-of-the-art MLLMs. Compared to GPT-4o, MAGMA-Edu improves the average textual metric from 57.01 to 92.31 (+35.3 pp) and boosts image-text consistency (ITC) from 13.20 to 85.24 (+72 pp). Across all model backbones, MAGMA-Edu achieves the highest scores (Avg-Text 96.20, ITC 99.12), establishing a new state of the art for multimodal educational content generation and demonstrating the effectiveness of self-reflective multi-agent collaboration in pedagogically aligned vision-language reasoning.

教育插图在传达抽象概念方面扮演着核心角色，然而，当前的多模态大型语言模型（MLLMs）在生成教学上连贯且语义一致的教育视觉内容方面仍存在局限性。我们引入了MAGMA-Edu，这是一个自我反思的多代理框架，它将文本推理和图表合成统一起来，用于结构化教育问题生成。与现有独立处理文本和图像生成的方法不同，MAGMA-Edu采用两阶段协同进化管道：（1）生成-验证-反思循环，该循环可迭代地完善问题陈述和解决方案的数学准确性；（2）基于代码的中间表示形式，在图像渲染过程中强制实施几何保真和语义对齐。两个阶段都受到内部自我反思模块的引导，评估和修订输出，直到满足特定领域的教学约束。在多元教育模式基准测试上的大量实验表明，MAGMA-Edu优于最新的MLLMs。与GPT-4o相比，MAGMA-Edu将平均文本指标从57.01提高到92.31（+35.3个百分点），并将图像文本一致性（ITC）从13.20提高到85.24（+72个百分点）。在所有模型主干中，MAGMA-Edu取得了最高分（平均文本96.20，ITC 99.12），为教育内容的多媒体生成建立了新的技术标杆，并证明了自我反思多代理协作在教学对齐的视觉语言推理中的有效性。

论文及项目相关链接

PDF

Summary

文本介绍了MAGMA-Edu在教育内容生成方面的作用。该框架能够统一文本推理和图表合成，生成结构化的教育问题。MAGMA-Edu采用自我反思的多阶段协同进化管道，提高文本和图像生成的准确性、一致性，并满足特定的教育要求。与现有的MLLM相比，MAGMA-Edu表现出优越性。在平均文本指标和图像文本一致性方面取得了显著的提升。总的来说，MAGMA-Edu在多模态教育内容生成方面达到了新的水平，证明了自我反思多智能协作在教育对齐的视语言推理中的有效性。

Key Takeaways

Cool Papers

点此查看论文截图

DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning

Authors:Xiaochuan Liu, Yuanfeng Song, Xiaoming Yin, Xing Chen

In today’s data-driven era, fully automated end-to-end data analytics, particularly insight discovery, is critical for discovering actionable insights that assist organizations in making effective decisions. With the rapid advancement of large language models (LLMs), LLM-driven agents have emerged as a promising paradigm for automating data analysis and insight discovery. However, existing data insight agents remain limited in several key aspects, often failing to deliver satisfactory results due to: (1) insufficient utilization of domain knowledge, (2) shallow analytical depth, and (3) error-prone code generation during insight generation. To address these issues, we propose DataSage, a novel multi-agent framework that incorporates three innovative features including external knowledge retrieval to enrich the analytical context, a multi-role debating mechanism to simulate diverse analytical perspectives and deepen analytical depth, and multi-path reasoning to improve the accuracy of the generated code and insights. Extensive experiments on InsightBench demonstrate that DataSage consistently outperforms existing data insight agents across all difficulty levels, offering an effective solution for automated data insight discovery.

在如今这个数据驱动的时代，全自动化的端对端数据分析，尤其是发现洞察，对于发现有助于组织做出有效决策的行动性见解至关重要。随着大型语言模型（LLM）的快速发展，LLM驱动的代理已经涌现为自动化数据分析和洞察发现的一种有前途的模式。然而，现有的数据洞察代理在几个关键方面仍然存在局限性，通常由于（1）领域知识利用不足，（2）分析深度不够，（3）在产生见解时产生的代码错误率高，而无法提供令人满意的结果。为了解决这些问题，我们提出了DataSage，这是一种新型的多代理框架，融合了三项创新功能，包括外部知识检索以丰富分析上下文，多角色辩论机制模拟不同的分析视角并深化分析深度，以及多路径推理以提高生成代码和见解的准确性。在InsightBench上的广泛实验表明，DataSage在所有难度级别上始终优于现有的数据洞察代理，为自动化的数据洞察发现提供了有效的解决方案。

论文及项目相关链接

PDF

Summary：

在数据驱动的时代，全自动化的数据分析洞察对于组织做出有效决策至关重要。随着大型语言模型（LLMs）的快速发展，LLM驱动的智能代理已成为自动化数据分析和洞察的热门模式。然而，现有数据洞察代理在多个关键方面存在局限性，如缺乏领域知识利用、分析深度不足以及洞察生成中错误代码生成等问题。为解决这些问题，我们提出DataSage多代理框架，它通过引入外部知识检索丰富分析语境、模拟多样化分析视角和深化分析深度的多角色辩论机制以及提高生成代码和洞察准确性的多路径推理等三大创新特性。在InsightBench上的广泛实验表明，DataSage在所有难度级别上均优于现有数据洞察代理，为自动化数据洞察发现提供了有效解决方案。

Key Takeaways：

全自动化数据分析洞察对组织决策至关重要。
LLM驱动的智能代理已成为自动化数据分析的热门模式。
现有数据洞察代理存在领域知识利用不足、分析深度不够及错误代码生成等问题。
DataSage框架通过引入外部知识检索、多角色辩论机制和多路径推理解决这些问题。
DataSage在InsightBench实验上表现优异，优于其他数据洞察代理。
DataSage框架能有效提高洞察生成的准确性和深度。

Cool Papers

点此查看论文截图

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Authors:Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang

Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that LIVE-SWE-AGENT can achieve an impressive solve rate of 77.4% without test-time scaling, outperforming all existing software agents, including the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.

大型语言模型（LLM）正在重塑几乎所有行业，包括软件工程。近年来，已经提出了许多LLM代理来解决现实世界中的软件问题。此类软件代理通常配备了一套编程工具，并能够自主决定下一步行动，以形成完整的轨迹来解决端到端的软件任务。虽然前景看好，但它们通常需要专门设计，并且可能仍然不够理想，因为穷尽整个代理架构的设计空间可能会非常具有挑战性和成本高昂。研究人员认识到软件代理本身就是可以进一步改进/修改的，因此最近已经提出了许多自我改进的软件代理，包括达尔文-歌德机（DGM）。同时，这样的自我改进代理需要在特定基准测试上进行昂贵的离线训练，并且可能无法在不同的大型语言模型或基准测试之间进行很好的泛化。在本文中，我们提出了Live-SWE-agent，这是第一个可以在解决现实世界软件问题时在运行时自主持续自我进化的软件代理。更具体地说，Live-SWE-agent从最基础的代理架构开始，仅使用bash工具（例如mini-SWE-agent），并在解决现实世界软件问题时自主进化自己的架构实现。我们在广泛研究的SWE-bench验证基准测试上的评估显示，LIVE-SWE-AGENT在无需测试时间缩放的情况下，实现了令人印象深刻的解决率高达77.4%，超越了所有现有软件代理，包括最佳专有解决方案。此外，Live-SWE-agent在最新的SWE-Bench Pro基准测试上的表现也超过了最新的手工定制软件代理，实现了已知的最佳解决率为45.8%。

论文及项目相关链接

PDF

摘要

大型语言模型（LLMs）正在重塑几乎所有行业，包括软件工程。近年来，已提出多种LLM代理来解决现实世界中的软件问题。这些软件代理通常配备了一套编程工具，并能够自主决定下一步行动，以形成完整的轨迹来解决端到端的软件任务。尽管这些软件代理很有前景，但它们通常需要专门设计，并且可能仍然不是最优的，因为穷尽整个代理架构的设计空间极具挑战性和成本高昂。研究人员已经认识到软件代理本质上是软件本身，可以进一步进行改进和修改，因此最近已经提出了一些自我改进的软件代理，包括达尔文-哥德尔机器（DGM）。然而，这样的自我改进代理需要在特定基准测试上进行昂贵的离线训练，并且可能无法在不同的LLM或基准测试之间很好地推广。本文提出了Live-SWE-agent，这是第一个能够在解决现实世界软件问题时实时自主持续进化的软件代理。更具体地说，Live-SWE-agent从最基础的代理架构开始，仅使用bash工具（例如mini-SWE-agent），并在解决现实世界软件问题时自主进化其自己的架构实现。在广泛研究的SWE-Bench Verified基准测试上的评估表明，LIVE-SWE-AGENT的解决率高达77.4%，无需测试时扩展，超越了所有现有软件代理，包括最佳专有解决方案。此外，Live-SWE-agent在最近的SWE-Bench Pro基准测试上的表现也优于最新的手动定制软件代理，达到了45.8%的最佳已知解决率。

关键见解

大型语言模型（LLMs）正在重塑软件工程领域。
软件代理具备自主决策和解决端到端软件任务的能力。
现有软件代理通常需要专门设计，且可能存在架构设计的挑战和成本问题。
最近有趋势向自我改进的软件代理发展，如达尔文-哥德尔机器（DGM）。
自我改进的软件代理需要在特定的基准测试上进行昂贵的离线训练，且其泛化能力有限。
Live-SWE-agent是第一个能够在解决现实世界软件问题时实时自主进化的软件代理。

Cool Papers

点此查看论文截图

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization

Authors:Kelun Lei, Hailong Yang, Huaitao Zhang, Xin You, Kaige Zhang, Zhongzhi Luan, Yi Liu, Depei Qian

Designing high-performance kernels requires expert-level tuning and a deep understanding of hardware characteristics. Recent advances in large language models (LLMs) have enabled automated kernel generation, yet most existing systems rely solely on correctness or execution time feedback, lacking the ability to reason about low-level performance bottlenecks. In this paper, we introduce PRAGMA, a profile-guided AI kernel generation framework that integrates execution feedback and fine-grained hardware profiling into the reasoning loop. PRAGMA enables LLMs to identify performance bottlenecks, preserve historical best versions, and iteratively refine code quality. We evaluate PRAGMA on KernelBench, covering GPU and CPU backends. Results show that PRAGMA consistently outperforms baseline AIKG without profiling enabled and achieves 2.81$\times$ and 2.30$\times$ averaged speedups against Torch on CPU and GPU platforms, respectively.

设计高性能内核需要专家级的调整和深入了解硬件特性。大型语言模型（LLM）的最新进展已经实现了自动化内核生成，但大多数现有系统仅依赖于正确性或执行时间的反馈，缺乏解决底层性能瓶颈的能力。在本文中，我们介绍了PRAGMA，这是一个基于配置的AI内核生成框架，它将执行反馈和精细的硬件分析集成到推理循环中。PRAGMA使LLM能够识别性能瓶颈、保留历史最佳版本并迭代改进代码质量。我们在KernelBench上对PRAGMA进行了评估，涵盖了GPU和CPU后端。结果表明，PRAGMA始终优于没有配置启用的基线AIKG，与Torch相比，在CPU和GPU平台上分别实现了2.81×和2.30×的平均加速。

论文及项目相关链接

PDF

Summary

该论文提出了一种名为PRAGMA的新型AI内核生成框架，它能够整合执行反馈和细粒度硬件性能分析数据来识别性能瓶颈并迭代改进内核设计。其在KernelBench上的评估显示，相比不使用性能分析功能的基线AI内核生成方法，PRAGMA具有显著优势，能在CPU和GPU平台上分别实现平均加速比2.81倍和2.3倍。

Key Takeaways

PRAGMA结合了执行反馈和硬件性能分析数据，优化了内核设计过程。
该框架能够识别性能瓶颈并保留历史最佳版本的内核。
PRAGMA通过迭代改进代码质量，提高了内核性能。
在KernelBench的评估中，PRAGMA相较于基线AI内核生成方法具有显著优势。
PRAGMA框架支持GPU和CPU后端。
在CPU和GPU平台上，PRAGMA实现了平均加速比分别为2.81倍和2.3倍。

Cool Papers

点此查看论文截图

LLM Agents for Automated Dependency Upgrades

Authors:Vali Tawosi, Salwa Alamir, Xiaomo Liu, Manuela Veloso

As a codebase expands over time, its library dependencies can become outdated and require updates to maintain innovation and security. However, updating a library can introduce breaking changes in the code, necessitating significant developer time for maintenance. To address this, we introduce a framework of LLM agents to be used in combination with migration documentation to automatically recommend and apply code updates and ensure compatibility with new versions. Our solution can automatically localize updated library usages in live Java codebases and implement recommended fixes in a user-friendly manner. The system architecture consists of multiple key components: a Summary Agent, Control Agent, and Code Agent. To validate our approach, we apply the framework on an industrial use case by which we create three synthetic code repositories with major Upgrade changes and benchmark our approach against state-of-the-art methods. Results show that our approach not only performs upgrades using fewer tokens across all cases but also achieves a precision of 71.4%, highlighting its efficiency and effectiveness compared to state-of-the-art methods.

随着代码库随时间扩展，其库依赖可能会变得过时，需要更新以维持创新和安全性。然而，更新库可能会在代码中引入破坏性变更，需要开发者花费大量时间进行维护。为解决此问题，我们引入了一个大型语言模型（LLM）代理框架，该框架将结合迁移文档自动推荐和应用代码更新，确保与新版本兼容。我们的解决方案可以自动定位活跃Java代码库中的更新库用法，并以用户友好的方式实现推荐的修复。系统架构包含多个关键组件：摘要代理、控制代理和代码代理。为了验证我们的方法，我们在工业应用场景中应用了该框架，创建了三个包含主要升级更改的合成代码库，并与最先进的方法对基准测试进行了比较。结果表明，我们的方法不仅在所有情况下使用的令牌更少，而且达到了71.4%的精度，与最先进的方法相比，凸显了其效率和效果。

论文及项目相关链接

PDF Accepted to AISM Workshop at ASE 2005

Summary
随着代码库的不断扩展，其库依赖关系可能会过时并需要更新以保持创新和安全性。为解决更新库时可能引入的代码破坏性问题，我们引入了LLM代理框架，结合迁移文档自动推荐和应用程序代码更新，确保与新版兼容。我们的解决方案可自动定位实时Java代码库中的库更新使用情况，并以用户友好的方式实现推荐的修复。系统架构包括关键组件：摘要代理、控制代理和代码代理。我们在工业应用场景中应用了该框架，创建了三个带有主要升级更改的合成代码库，并与最新方法进行了基准测试。结果表明，我们的方法不仅在所有情况下使用的令牌更少，而且精确度达到71.4%，与最新方法相比具有高效性和有效性。

Key Takeaways

随着代码库扩展，库依赖关系可能过时，需更新以维持创新和安全性。
更新库可能会引入代码破坏性问题，需要开发者投入大量时间维护。
引入LLM代理框架，结合迁移文档，可自动推荐和应用程序代码更新，确保与新版兼容。
系统架构包括摘要代理、控制代理和代码代理等关键组件。
解决方案能自动定位实时Java代码库中的库更新使用情况。
方法以用户友好的方式实现推荐的修复，提高了效率。

Cool Papers

点此查看论文截图

ALMAS: an Autonomous LLM-based Multi-Agent Software Engineering Framework

Authors:Vali Tawosi, Keshav Ramani, Salwa Alamir, Xiaomo Liu

Multi-agent Large Language Model (LLM) systems have been leading the way in applied LLM research across a number of fields. One notable area is software development, where researchers have advanced the automation of code implementation, code testing, code maintenance, inter alia, using LLM agents. However, software development is a multifaceted environment that extends beyond just code. As such, a successful LLM system must factor in multiple stages of the software development life-cycle (SDLC). In this paper, we propose a vision for ALMAS, an Autonomous LLM-based Multi-Agent Software Engineering framework, which follows the above SDLC philosophy such that it may work within an agile software development team to perform several tasks end-to-end. ALMAS aligns its agents with agile roles, and can be used in a modular fashion to seamlessly integrate with human developers and their development environment. We showcase the progress towards ALMAS through our published works and a use case demonstrating the framework, where ALMAS is able to seamlessly generate an application and add a new feature.

多智能体大型语言模型（LLM）系统在多个领域的应用LLM研究中处于领先地位。一个值得注意的领域是软件开发，研究人员利用LLM智能体推进了代码实现、代码测试、代码维护等的自动化。然而，软件开发是一个多层次的环境，不仅仅局限于代码。因此，一个成功的LLM系统必须考虑到软件开发生命周期（SDLC）的多个阶段。在本文中，我们提出了ALMAS的愿景，这是一个基于自主LLM的多智能体软件工程框架，它遵循上述SDLC理念，能够在敏捷软件开发团队中工作，端到端地执行多项任务。ALMAS的智能体符合敏捷角色，可以以模块化方式使用，无缝集成人类开发者和他们的开发环境。我们通过已发布的工作和一个展示该框架使用情况的案例来展示ALMAS的进展，该案例表明ALMAS能够无缝地生成应用程序并添加新功能。

论文及项目相关链接

PDF Accepted to MAS-GAIN Workshop at ASE 2025

Summary

多主体大型语言模型（LLM）系统在多个领域的应用LLM研究中处于领先地位。在软件开发领域，研究人员使用LLM代理推动了代码实现、代码测试、代码维护等的自动化。然而，软件开发是一个多维度的环境，不仅包括代码。因此，一个成功的LLM系统必须考虑软件开发生命周期（SDLC）的多个阶段。本文提出了ALMAS的愿景，一个基于自主LLM的多主体软件工程框架，遵循SDLC理念，可以在敏捷软件开发团队中执行多项端到端任务。ALMAS将代理与敏捷角色对齐，可以以模块化方式使用，与人类开发者和他们的开发环境无缝集成。我们通过已发布的工作和一个展示框架使用情况的案例来展示ALMAS的进展，其中ALMAS能够无缝地生成应用程序并添加新功能。

Key Takeaways

多主体大型语言模型（LLM）系统在多个领域表现突出，特别是在软件开发领域。
自动化是软件开发领域的重要发展方向，LLM系统在这方面有着广泛的应用前景。
软件开发生命周期（SDLC）包含多个阶段，成功的LLM系统需要考虑这些阶段。
ALMAS是一个基于自主LLM的多主体软件框架，遵循SDLC理念，可以支持敏捷软件开发团队执行多项任务。
ALMAS将代理与敏捷角色对齐，模块化集成方式使其可以与人类开发者和开发环境无缝集成。
ALMAS框架已经通过已发布的工作和案例展示取得了一定的进展。

Cool Papers

点此查看论文截图

Communicating Plans, Not Percepts: Scalable Multi-Agent Coordination with Embodied World Models

Authors:Brennen A. Hill, Mant Koh En Wei, Thangavel Jishnuanandh

Robust coordination is critical for effective decision-making in multi-agent systems, especially under partial observability. A central question in Multi-Agent Reinforcement Learning (MARL) is whether to engineer communication protocols or learn them end-to-end. We investigate this dichotomy using embodied world models. We propose and compare two communication strategies for a cooperative task-allocation problem. The first, Learned Direct Communication (LDC), learns a protocol end-to-end. The second, Intention Communication, uses an engineered inductive bias: a compact, learned world model, the Imagined Trajectory Generation Module (ITGM), which uses the agent’s own policy to simulate future states. A Message Generation Network (MGN) then compresses this plan into a message. We evaluate these approaches on goal-directed interaction in a grid world, a canonical abstraction for embodied AI problems, while scaling environmental complexity. Our experiments reveal that while emergent communication is viable in simple settings, the engineered, world model-based approach shows superior performance, sample efficiency, and scalability as complexity increases. These findings advocate for integrating structured, predictive models into MARL agents to enable active, goal-driven coordination.

在多智能体系统中进行有效的决策制定，特别是在部分可观察性条件下，稳健的协调至关重要。多智能体强化学习（MARL）中的一个核心问题是选择设计通信协议还是端到端地直接学习通信协议。我们利用具体的世界模型来调查这种差异。针对合作任务分配问题，我们提出了两种通信策略并进行比较。第一种是端到端学习的“学习直接通信”(LDC)。第二种是意图通信，它采用工程归纳偏差的方法：一个紧凑的、学习得到的世界模型——“想象轨迹生成模块”(ITGM)，该模块使用智能体自己的策略来模拟未来的状态。“消息生成网络”(MGN)随后将这个计划压缩成一条消息。我们在网格世界中有目的地互动的问题上评估这些方法，网格世界是解决具象人工智能问题的经典抽象。同时环境复杂性在不断扩大。我们的实验表明，虽然在新环境下即时通信是可实现的，但当复杂性增加时，基于工程设计、以世界模型为基础的方法在性能、样本效率和可扩展性方面都显示出更好的效果。这些发现主张将结构化预测模型整合到MARL智能体中，以实现主动、目标驱动的协调。

论文及项目相关链接

PDF Published in the Proceedings of the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Scaling Environments for Agents (SEA). Additionally accepted for presentation in the NeurIPS 2025 Workshop: Embodied World Models for Decision Making (EWM) and the NeurIPS 2025 Workshop: Optimization for Machine Learning (OPT)

Summary

在基于多智能体的强化学习（MARL）中，协调对于决策至关重要，特别是在部分可观察的环境中。本文探讨了通信协议是应该设计还是通过端到端学习得到的这一核心问题。研究采用具身世界模型进行比较两种通信策略：一种为学习直接通信（LDC），这是一种端到端的协议学习方式；另一种为意图通信，它使用了一种工程归纳偏见，即一个简洁的具身世界模型——想象轨迹生成模块（ITGM），该模块利用智能体的自身策略模拟未来状态。消息生成网络（MGN）将计划压缩成一条消息。在网格世界中的目标导向交互任务上评估了这两种方法，这是一个对具身人工智能问题的经典抽象。实验表明，在环境复杂性增加的情况下，基于世界模型的工程设计方法表现出更高的性能、样本效率和可扩展性，而新兴通信在简单设置中才可行。这些发现主张将结构化预测模型集成到MARL智能体中，以实现主动的目标驱动协调。

Key Takeaways