发布日期: 2025-11-19

更新日期: 2025-11-27

文章字数: 22.2k

阅读时长: 90 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-19 更新

Live-SWE-agent: Can Software Engineering Agents Self-Evolve on the Fly?

Authors:Chunqiu Steven Xia, Zhe Wang, Yan Yang, Yuxiang Wei, Lingming Zhang

Large Language Models (LLMs) are reshaping almost all industries, including software engineering. In recent years, a number of LLM agents have been proposed to solve real-world software problems. Such software agents are typically equipped with a suite of coding tools and can autonomously decide the next actions to form complete trajectories to solve end-to-end software tasks. While promising, they typically require dedicated design and may still be suboptimal, since it can be extremely challenging and costly to exhaust the entire agent scaffold design space. Recognizing that software agents are inherently software themselves that can be further refined/modified, researchers have proposed a number of self-improving software agents recently, including the Darwin-Gödel Machine (DGM). Meanwhile, such self-improving agents require costly offline training on specific benchmarks and may not generalize well across different LLMs or benchmarks. In this paper, we propose Live-SWE-agent, the first live software agent that can autonomously and continuously evolve itself on-the-fly during runtime when solving real-world software problems. More specifically, Live-SWE-agent starts with the most basic agent scaffold with only access to bash tools (e.g., mini-SWE-agent), and autonomously evolves its own scaffold implementation while solving real-world software problems. Our evaluation on the widely studied SWE-bench Verified benchmark shows that Live-SWE-agent can achieve an impressive solve rate of 75.4% without test-time scaling, outperforming all existing open-source software agents and approaching the performance of the best proprietary solution. Moreover, Live-SWE-agent outperforms state-of-the-art manually crafted software agents on the recent SWE-Bench Pro benchmark, achieving the best-known solve rate of 45.8%.

大型语言模型（LLM）正在重塑几乎所有行业，包括软件工程。近年来，已经提出了许多LLM代理来解决现实世界中的软件问题。此类软件代理通常配备了一套编程工具，并能够自主决定下一步行动，以形成完整的轨迹来解决端到端的软件任务。虽然前景看好，但它们通常需要专门设计，并且可能仍然不够理想，因为穷尽整个代理架构的设计空间可能极具挑战性和成本高昂。

研究人员意识到软件代理本质上是软件本身，可以进一步进行改进/修改，因此最近已经提出了一些自我改进的软件代理，包括达尔文-哥德尔机器（DGM）。同时，这种自我改进的代理需要在特定基准测试上进行昂贵的离线训练，并且可能无法在不同的大型语言模型或基准测试之间进行很好的泛化。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）正在重塑包括软件工程在内的几乎所有行业。近年来，提出了一系列LLM代理来解决现实世界中的软件问题。这些软件代理通常配备了一套编码工具，能够自主决定下一步行动，形成完整的轨迹来解决端到端的软件任务。尽管前景广阔，但它们通常需要专门设计，可能仍然不够理想，因为穷尽整个代理架构的设计空间极具挑战性和成本。研究人员已经认识到软件代理本身就是可以进一步改进和修改的，因此最近已经提出了一系列自我改进的软件代理，包括达尔文-哥德尔机器（DGM）。然而，这样的自我改进代理需要在特定基准测试上进行昂贵的离线训练，并且可能无法在不同LLM或基准测试之间很好地推广。本文提出Live-SWE-agent，这是一种能够在解决现实世界软件问题时实时自主连续进化的首个活跃软件代理。我们的评估表明，Live-SWE-agent在不进行测试时间缩放的情况下实现了令人印象深刻的解决率，超过了所有现有的开源软件代理并接近最佳专有解决方案的性能。此外，Live-SWE-agent在最近的SWE-Bench Pro基准测试上优于最新的手动定制软件代理，实现了已知的最佳解决率。

Key Takeaways

大型语言模型（LLMs）正在重塑软件工程行业，LLM代理为解决现实软件问题提供了新的方法。
现有的LLM代理通常需要专门设计，并且在穷尽代理架构设计空间时存在挑战和成本。
自我改进的软件代理是一个新兴研究领域，其中达尔文-哥德尔机器（DGM）是其中的一个例子。
自我改进的软件代理需要在特定的基准测试上进行昂贵的离线训练，且可能无法在不同LLM或基准测试之间很好地推广。
Live-SWE-agent是首个能够在解决现实软件问题时实时自主连续进化的活跃软件代理。
Live-SWE-agent在不进行测试时间缩放的情况下实现了高解决率，超过了现有开源软件代理并接近最佳专有解决方案的性能。

Cool Papers

点此查看论文截图

Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

Authors:Piaohong Wang, Motong Tian, Jiaxian Li, Yuan Liang, Yuqing Wang, Qianben Chen, Tiannan Wang, Zhicong Lu, Jiawei Ma, Yuchen Eleanor Jiang, Wangchunshu Zhou

Recent advancements in LLM-powered agents have demonstrated significant potential in generating human-like responses; however, they continue to face challenges in maintaining long-term interactions within complex environments, primarily due to limitations in contextual consistency and dynamic personalization. Existing memory systems often depend on semantic grouping prior to retrieval, which can overlook semantically irrelevant yet critical user information and introduce retrieval noise. In this report, we propose the initial design of O-Mem, a novel memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records from their proactive interactions with agents. O-Mem supports hierarchical retrieval of persona attributes and topic-related context, enabling more adaptive and coherent personalized responses. O-Mem achieves 51.76% on the public LoCoMo benchmark, a nearly 3% improvement upon LangMem,the previous state-of-the-art, and it achieves 62.99% on PERSONAMEM, a 3.5% improvement upon A-Mem,the previous state-of-the-art. O-Mem also boosts token and interaction response time efficiency compared to previous memory frameworks. Our work opens up promising directions for developing efficient and human-like personalized AI assistants in the future.

近期，以大型语言模型（LLM）驱动的代理人的进展，在生成人类响应方面展现了巨大的潜力。然而，它们在复杂环境中维持长期互动方面仍面临挑战，这主要归因于上下文一致性和动态个性化的局限性。现有的记忆系统通常依赖于检索前的语义分组，这可能会忽略用户语义上无关紧要但至关重要的信息，并引入检索噪音。在本报告中，我们提出了O-Mem的初步设计，这是一种基于主动用户分析的新型记忆框架，它动态提取和更新用户特性和事件记录，这些特性和记录来自用户与代理人的主动互动。O-Mem支持人格属性分层检索和主题相关上下文，从而实现更自适应和连贯的个性化响应。O-Mem在公共LoCoMo基准测试中达到了51.7 进展尤为突出：较之前最先进的LangMem提升了近百分之三（实现了得分百分比达到五十七），并且在PERSONAMEM上也取得了比此前的冠军框架A-Mem高出了三分的进步，因此推进了对精准化和个性化的回应，从更高效率提升反应速度的角度来看也比之前大多数的基准框架更好。我们的工作为未来开发高效且人性化的个性化人工智能助手指明了方向。

论文及项目相关链接

PDF

Summary

基于LLM的大型对话生成中的潜在应用及其局限性，特别是在维持复杂环境下的长期交互方面。为解决现有记忆系统的不足，本文提出了基于主动用户分析的新型记忆框架O-Mem的设计概念。O-Mem能够动态提取和更新用户特性和事件记录，支持层次化的个性属性检索和话题相关上下文检索，以实现更适应和连贯的个人化响应。在公共LoCoMo基准测试中，O-Mem的表现优于现有技术，提高了近3%的效率。同时，相较于先前的记忆框架，O-Mem在令牌和交互响应的时间效率上也有显著提升。这为未来开发高效且人性化的个性化AI助手提供了广阔的前景。

Key Takeaways

LLM技术虽然能够生成人类般的响应，但在复杂环境中维持长期交互存在挑战。
现有记忆系统依赖语义分组进行检索，可能会忽略关键用户信息并引入检索噪声。
O-Mem作为一种新型记忆框架，基于主动用户分析设计，能够动态提取和更新用户特性和事件记录。
O-Mem支持层次化的个性属性检索和话题相关上下文检索，以提高响应的适应性和连贯性。
O-Mem在公共LoCoMo基准测试中的表现优于现有技术，实现了近3%的效率提升。
O-Mem提高了令牌和交互响应的时间效率，相较于先前的记忆框架有所改进。

Cool Papers

点此查看论文截图

Multi-Agent Multimodal Large Language Model Framework for Automated Interpretation of Fuel Efficiency Analytics in Public Transportation

Authors:Zhipeng Ma, Ali Rida Bahja, Andreas Burgdorf, André Pomp, Tobias Meisen, Bo Nørregaard Jørgensen, Zheng Grace Ma

Enhancing fuel efficiency in public transportation requires the integration of complex multimodal data into interpretable, decision-relevant insights. However, traditional analytics and visualization methods often yield fragmented outputs that demand extensive human interpretation, limiting scalability and consistency. This study presents a multi-agent framework that leverages multimodal large language models (LLMs) to automate data narration and energy insight generation. The framework coordinates three specialized agents, including a data narration agent, an LLM-as-a-judge agent, and an optional human-in-the-loop evaluator, to iteratively transform analytical artifacts into coherent, stakeholder-oriented reports. The system is validated through a real-world case study on public bus transportation in Northern Jutland, Denmark, where fuel efficiency data from 4006 trips are analyzed using Gaussian Mixture Model clustering. Comparative experiments across five state-of-the-art LLMs and three prompting paradigms identify GPT-4.1 mini with Chain-of-Thought prompting as the optimal configuration, achieving 97.3% narrative accuracy while balancing interpretability and computational cost. The findings demonstrate that multi-agent orchestration significantly enhances factual precision, coherence, and scalability in LLM-based reporting. The proposed framework establishes a replicable and domain-adaptive methodology for AI-driven narrative generation and decision support in energy informatics.

提升公共交通的燃油效率需要整合复杂的多模式数据为可解释、与决策相关的洞察。然而，传统分析法和可视化方法常常产生分散的输出结果，需要大批量的人工解读，限制了可扩展性和一致性。本研究提出了一个多智能体框架，利用多模式大型语言模型（LLM）自动化数据叙事和能源洞察生成。该框架协调三个专业智能体，包括数据叙事智能体、LLM评判智能体和可选的人工循环评估器，以迭代的方式将分析产物转化为连贯的、以利益相关者为导向的报告。该系统通过丹麦北部尤特兰地区公共交通的真实案例研究进行了验证，其中对来自4006次行程的燃油效率数据使用高斯混合模型聚类进行分析。跨越五种最新LLM和三种提示范式的比较实验确定了GPT-4.1 mini与思维链提示为最佳配置，达到了97.3%的叙事准确性，同时平衡了可解释性和计算成本。研究结果表明，多智能体协同在基于LLM的报告生成中显著提高了事实准确性、连贯性和可扩展性。所提出的框架为AI驱动的叙事生成和能源信息决策支持建立了可复制和适应领域的方法。

论文及项目相关链接

PDF

Summary
本研究提出一种多智能体框架，利用多模态大型语言模型（LLMs）自动化数据叙事和能源洞察生成，提高公共交通的燃料效率。通过丹麦北日德兰公共巴士运输的实证案例研究验证了系统的有效性。研究发现，多智能体协同显著提高了LLM报告中的事实准确性、连贯性和可扩展性。

Key Takeaways

研究引入多智能体框架，结合多模态大型语言模型（LLMs），以自动化数据叙事和能源洞察生成，提升公共交通燃料效率。
框架包含数据叙事智能体、LLM评判智能体以及可选的人机协同评价者，将分析产物转化为连贯、面向利益相关者的报告。
通过丹麦北日德兰公共巴士运输实证案例研究，验证了系统的有效性。
比较了五种最先进的大型语言模型和三种提示范式，发现GPT-4.1 mini配合Chain-of-Thought提示法为最优配置，实现了97.3%的叙事准确性，平衡了可解释性和计算成本。
多智能体协同显著提高了LLM报告中的事实准确性、连贯性和决策支持的可扩展性。
研究结果建立了可复制且适应于特定领域的AI驱动叙事生成方法论，在能源信息学中有着广泛的应用前景。

Cool Papers

点此查看论文截图

Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

Authors:Zhaopei Huang, Qifeng Dai, Guozheng Wu, Xiaopeng Wu, Kehan Chen, Chuan Yu, Xubin Li, Tiezheng Ge, Wenxuan Wang, Qin Jin

With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.

随着智能个人设备的普及，面向服务的智能人机交互变得越来越普遍。这一趋势凸显了需要能够理解用户特定特征的个性化对话助理，以准确解释需求并根据个人偏好定制响应。然而，现有方法往往忽略了长期交互的复杂性，无法捕捉用户的主观特征。为了解决这些差距，我们推出了PAL-Bench，这是一个新的基准测试，旨在评估面向服务的助理在长期用户代理交互中的个性化能力。在缺乏可用的真实世界数据的情况下，我们开发了一个多步骤的基于大型语言模型的合成管道，该管道经过人类注释者的验证和细化。这一过程产生了PAL-Set，这是首个包含多会话用户日志和对话历史的首个中文数据集，它作为PAL-Bench的基础。此外，为了改善面向个人的服务交互，我们提出了H$^2$Memory，这是一个分层和异构的内存框架，它结合了检索增强生成，以提高个性化响应生成。在我们自己的PAL-Bench和外部数据集上的综合实验表明，所提出的记忆框架是有效的。

论文及项目相关链接

PDF Accepted by AAAI 2026 (Oral)

Summary

随着智能个人设备的普及，服务导向型的人机交互变得越来越普遍。这凸显了对能够理解用户特定特征、准确解读需求并根据个人偏好定制响应的个性化对话助手的需求。针对现有方法忽视长期交互的复杂性和用户主观特征的问题，本文提出了PAL-Bench基准测试，以评估服务导向型助理在长期用户代理交互中的个性化能力。在没有真实世界数据的情况下，我们开发了一个多步骤的LLM合成管道，并通过人工标注进行了验证和完善，从而产生了PAL-Set数据集，它是包含多会话用户日志和对话历史的首个中文数据集，为PAL-Bench提供了基础。此外，为了改进个性化的服务导向交互，我们提出了H^2Memory框架，它结合了检索增强生成技术，提高了个性化响应生成的能力。在PAL-Bench和外部数据集上的综合实验证明了该框架的有效性。

Key Takeaways

智能个人设备的普及促进了服务导向型人机互动的增多，需要个性化对话助手来准确理解用户需求和偏好。
现有方法在长期交互中忽视了用户的复杂性和主观特征，因此需要新的评估基准来测试助理的个性化能力。
介绍了PAL-Bench基准测试，用于评估服务导向型助理在长期用户代理交互中的个性化能力。
为了支持PAL-Bench，开发了一个多步骤的LLM合成管道，产生了首个包含多会话用户日志和对话历史的中文数据集——PAL-Set。
提出了H^2Memory框架，结合检索增强生成技术，提高了个性化响应生成的效果。
在PAL-Bench和外部数据集上的实验验证了H^2Memory框架的有效性。
该研究为改进服务导向型的人机交互提供了重要的见解和工具。

Cool Papers

点此查看论文截图

MedDCR: Learning to Design Agentic Workflows for Medical Coding

Authors:Jiyang Zheng, Islam Nassar, Thanh Vu, Xu Zhong, Yang Lin, Tongliang Liu, Long Duong, Yuan-Fang Li

Medical coding converts free-text clinical notes into standardized diagnostic and procedural codes, which are essential for billing, hospital operations, and medical research. Unlike ordinary text classification, it requires multi-step reasoning: extracting diagnostic concepts, applying guideline constraints, mapping to hierarchical codebooks, and ensuring cross-document consistency. Recent advances leverage agentic LLMs, but most rely on rigid, manually crafted workflows that fail to capture the nuance and variability of real-world documentation, leaving open the question of how to systematically learn effective workflows. We present MedDCR, a closed-loop framework that treats workflow design as a learning problem. A Designer proposes workflows, a Coder executes them, and a Reflector evaluates predictions and provides constructive feedback, while a memory archive preserves prior designs for reuse and iterative refinement. On benchmark datasets, MedDCR outperforms state-of-the-art baselines and produces interpretable, adaptable workflows that better reflect real coding practice, improving both the reliability and trustworthiness of automated systems.

医疗编码将自由文本的临床笔记转化为标准化的诊断和程序代码，对于计费、医院运营和医学研究至关重要。不同于普通的文本分类，它需要进行多步骤推理：提取诊断概念、应用指南约束、映射到分层代码库，并确保跨文档的一致性。最近的进展利用智能代理的大型语言模型，但大多数依赖于僵化、手工制作的工作流程，无法捕捉真实世界文档的细微差别和变化性，留下了如何系统地学习有效工作流程的问题。我们提出了MedDCR，这是一个闭环框架，将工作流程设计视为学习问题。设计者提出工作流程，编码者执行它们，反思者对预测进行评估并提供建设性反馈，同时记忆存档保留先前设计以供重用和迭代优化。在基准数据集上，MedDCR优于最先进的基线，产生可解释、可适应的工作流程，更好地反映实际编码实践，提高自动化系统的可靠性和可信度。

论文及项目相关链接

PDF

Summary

医疗编码将自由文本的临床笔记转化为标准化的诊断和程序代码，对于计费、医院运营和医学研究至关重要。它涉及多步推理，包括提取诊断概念、应用指导方针约束、映射到层次代码本和确保跨文档一致性。最近的研究利用agentic LLMs，但大多数依赖于手工制作的流程，无法捕捉真实世界文档的细微差别和变化性。我们提出MedDCR，一个闭环框架，将工作流程设计视为学习问题。设计师提出工作流程，编码员执行它们，反射器评估预测并提供建设性反馈，同时记忆存档保留先前设计以供重用和迭代改进。在基准数据集上，MedDCR优于最新基线，产生可解释、可适应的工作流程，更好地反映实际编码实践，提高自动化系统的可靠性和可信度。

Key Takeaways

医疗编码的重要性：将临床笔记转化为标准化代码，用于计费、医院运营和医学研究。
医疗编码涉及多步推理：包括诊断概念提取、指导方针约束应用等。
当前研究的不足：大多数研究依赖于手工制作的流程，无法适应真实世界文档的细微差别和变化性。
MedDCR框架的提出：将工作流程设计视为学习问题，包括设计师、编码员、反射器和记忆存档等角色。
MedDCR框架的效果：在基准数据集上表现优异，产生可解释、可适应的工作流程。
MedDCR框架的意义：更好地反映实际编码实践，提高自动化系统的可靠性和可信度。

Cool Papers

点此查看论文截图

Grounded by Experience: Generative Healthcare Prediction Augmented with Hierarchical Agentic Retrieval

Authors:Chuang Zhao, Hui Tang, Hongke Zhao, Xiaofang Zhou, Xiaomeng Li

Accurate healthcare prediction is critical for improving patient outcomes and reducing operational costs. Bolstered by growing reasoning capabilities, large language models (LLMs) offer a promising path to enhance healthcare predictions by drawing on their rich parametric knowledge. However, LLMs are prone to factual inaccuracies due to limitations in the reliability and coverage of their embedded knowledge. While retrieval-augmented generation (RAG) frameworks, such as GraphRAG and its variants, have been proposed to mitigate these issues by incorporating external knowledge, they face two key challenges in the healthcare scenario: (1) identifying the clinical necessity to activate the retrieval mechanism, and (2) achieving synergy between the retriever and the generator to craft contextually appropriate retrievals. To address these challenges, we propose GHAR, a \underline{g}enerative \underline{h}ierarchical \underline{a}gentic \underline{R}AG framework that simultaneously resolves when to retrieve and how to optimize the collaboration between submodules in healthcare. Specifically, for the first challenge, we design a dual-agent architecture comprising Agent-Top and Agent-Low. Agent-Top acts as the primary physician, iteratively deciding whether to rely on parametric knowledge or to initiate retrieval, while Agent-Low acts as the consulting service, summarising all task-relevant knowledge once retrieval was triggered. To tackle the second challenge, we innovatively unify the optimization of both agents within a formal Markov Decision Process, designing diverse rewards to align their shared goal of accurate prediction while preserving their distinct roles. Extensive experiments on three benchmark datasets across three popular tasks demonstrate our superiority over state-of-the-art baselines, highlighting the potential of hierarchical agentic RAG in advancing healthcare systems.

准确的医疗预测对于改善患者结果和降低运营成本至关重要。随着推理能力的不断增长，大型语言模型（LLM）凭借丰富的参数知识，为提升医疗预测提供了一条有前景的道路。然而，由于嵌入知识的可靠性和覆盖范围的局限性，LLM容易存在事实错误。虽然检索增强生成（RAG）框架，如GraphRAG及其变体，已经通过融入外部知识来缓解这些问题，但在医疗场景中，它们面临两个主要挑战：（1）确定激活检索机制的临床必要性；（2）实现检索器和生成器之间的协同，以构建符合上下文的检索。为了解决这些挑战，我们提出了GHAR，这是一种生成式分层代理增强RAG框架，可以同时解决何时检索以及如何优化医疗领域子模块之间的协作。具体来说，对于第一个挑战，我们设计了一个由Agent-Top和Agent-Low组成的双代理架构。Agent-Top作为主治医生，迭代地决定是依赖参数知识还是启动检索；而Agent-Low则作为咨询服务，在触发检索时总结所有任务相关知识。为了解决第二个挑战，我们创新地将两个代理的优化统一在一个正式的马尔可夫决策过程中，设计各种奖励来对齐他们准确预测的共同目标，同时保持他们各自独特的角色。在三个流行任务的三个基准数据集上的大量实验表明，我们的方法在先进基线之上具有优越性，突显了分层代理增强RAG在推进医疗系统方面的潜力。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的丰富参数知识，其在医疗保健预测中展现出巨大潜力，有助于改善病患结果并降低运营成本。然而，LLM存在事实不准确的问题，其嵌入知识的可靠性和覆盖面有限。为解决此问题，研究者提出了检索增强生成（RAG）框架如GraphRAG及其变体来融入外部知识。然而，在医疗保健场景中，RAG面临两大挑战：一是确定何时启动检索机制的临床必要性，二是实现检索器和生成器之间的协同合作以生成符合语境的检索结果。为解决这些挑战，本文提出了GHAR，一个生成式分层代理增强RAG框架，旨在解决何时检索以及如何优化子模块协同的问题。GHAR设计了一个双代理架构并创新性地将其统一到一个正式的马尔可夫决策过程中，以优化两个代理的性能并使其目标一致。实验证明GHAR在三个基准数据集上的表现优于现有技术，显示出其在医疗保健系统中提升预测准确性的潜力。

Key Takeaways

大型语言模型（LLM）在医疗保健预测中展现出潜力，有助于改善病患结果和降低运营成本。
LLM存在事实不准确的问题，需要融入外部知识来解决。
检索增强生成（RAG）框架如GraphRAG面临两大挑战：临床必要性确定和检索与生成的协同合作。
GHAR框架通过双代理架构解决上述问题，其中Agent-Top决定是否需要检索，Agent-Low作为咨询服务进行知识总结。
GHAR创新性地统一两个代理的优化过程到一个正式的马尔可夫决策过程中，设计奖励以对齐预测准确性目标并保留其独特角色。
实验证明GHAR在医疗保健预测上的表现优于现有技术。

Cool Papers

点此查看论文截图

Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

Authors:Haoyang Hong, Jiajun Yin, Yuan Wang, Jingnan Liu, Zhe Chen, Ailing Yu, Ji Li, Zhiling Ye, Hansong Xiao, Yefei Chen, Hualei Zhou, Yun Yue, Minghui Yang, Chunxiao Guo, Junwei Liu, Peng Wei, Jinjie Gu

Multi-agent systems perform well on general reasoning tasks. However, the lack of training in specialized areas hinders their accuracy. Current training methods train a unified large language model (LLM) for all agents in the system. This may limit the performances due to different distributions underlying for different agents. Therefore, training multi-agent systems with distinct LLMs should be the next step to solve. However, this approach introduces optimization challenges. For example, agents operate at different frequencies, rollouts involve varying sub-agent invocations, and agents are often deployed across separate servers, disrupting end-to-end gradient flow. To address these issues, we propose M-GRPO, a hierarchical extension of Group Relative Policy Optimization designed for vertical Multi-agent systems with a main agent (planner) and multiple sub-agents (multi-turn tool executors). M-GRPO computes group-relative advantages for both main and sub-agents, maintaining hierarchical credit assignment. It also introduces a trajectory-alignment scheme that generates fixed-size batches despite variable sub-agent invocations. We deploy a decoupled training pipeline in which agents run on separate servers and exchange minimal statistics via a shared store. This enables scalable training without cross-server backpropagation. In experiments on real-world benchmarks (e.g., GAIA, XBench-DeepSearch, and WebWalkerQA), M-GRPO consistently outperforms both single-agent GRPO and multi-agent GRPO with frozen sub-agents, demonstrating improved stability and sample efficiency. These results show that aligning heterogeneous trajectories and decoupling optimization across specialized agents enhances tool-augmented reasoning tasks.

多智能体系统在通用推理任务上表现良好。然而，缺乏特定领域的训练会限制其准确性。当前训练方法是为系统中的所有智能体训练一个统一的大型语言模型（LLM）。这可能会因为不同智能体底层的数据分布不同而限制性能。因此，使用不同的LLM来训练多智能体系统应该是下一步要解决的问题。然而，这种方法引入了优化挑战。例如，智能体以不同的频率运行，回滚涉及各种子智能体的调用，并且智能体通常部署在不同的服务器上，破坏端到端的梯度流。为了解决这些问题，我们提出了M-GRPO，这是一种面向垂直多智能体系统的分层扩展方法，其中包括一个主智能体（规划器）和多个子智能体（多轮工具执行器）。M-GRPO计算主智能体和子智能体的群体相对优势，保持分层信用分配。它还引入了一种轨迹对齐方案，可以生成固定大小的批次，尽管子智能体的调用是可变的。我们采用解耦的训练管道，智能体在单独的服务器上运行，并通过共享存储交换最少的统计信息。这能够在不进行跨服务器反向传播的情况下实现可扩展的训练。在真实世界基准测试（如GAIA、XBench-DeepSearch和WebWalkerQA）的实验中，M-GRPO持续优于单智能体GRPO和多智能体GRPO冻结子智能体，显示出提高的稳定性和样本效率。这些结果表明，对齐异构轨迹并在专业智能体之间解耦优化，能够增强工具辅助推理任务。

论文及项目相关链接

PDF

Summary

该文本介绍了多智能体系统在通用推理任务上的表现良好，但由于缺乏专业培训，其准确性受到阻碍。当前训练方法是针对系统内所有智能体的统一大型语言模型（LLM）。但不同的智能体有不同的分布特征，因此采用统一的大型语言模型可能限制了系统的性能。因此，为不同智能体训练不同的LLM成为解决问题的下一步方向。然而，此方法带来优化挑战，如智能体操作频率不同、回滚涉及不同的子智能体调用以及智能体在独立服务器上部署，破坏了端到端的梯度流。为解决这些问题，提出了M-GRPO（针对垂直多智能体系统的分层扩展），包括主智能体（规划器）和多个子智能体（多轮工具执行器）。M-GRPO计算主和子智能体的群组相对优势，维持层次化信用分配，并引入轨迹对齐方案以生成固定大小的批次。实验表明，在真实世界基准测试上，M-GRPO表现出更稳定的样本效率和性能改善。结果证明了在多智能体系统中，在特殊智能体上对齐异构图轨和解除优化耦合可以增强工具辅助推理任务的效果。

Key Takeaways

多智能体系统在通用推理任务上表现良好，但在特定领域缺乏训练会影响准确性。
当前训练方法是统一的大型语言模型，可能无法适应不同智能体的分布特征。
为不同智能体训练不同的LLM是解决此问题的下一步方向。
多智能体训练面临优化挑战，如操作频率不同、回滚涉及不同子智能体调用以及跨服务器部署问题。
M-GRPO是解决这些问题的分层扩展方案，包括主智能体和多个子智能体。
M-GRPO通过计算群组相对优势维持层次化信用分配，并引入轨迹对齐方案。

Cool Papers

点此查看论文截图

DualTAP: A Dual-Task Adversarial Protector for Mobile MLLM Agents

Authors:Fuyao Zhang, Jiaming Zhang, Che Wang, Xiongtao Sun, Yurong Hao, Guowei Guan, Wenjie Li, Longtao Huang, Wei Yang Bryan Lim

The reliance of mobile GUI agents on Multimodal Large Language Models (MLLMs) introduces a severe privacy vulnerability: screenshots containing Personally Identifiable Information (PII) are often sent to untrusted, third-party routers. These routers can exploit their own MLLMs to mine this data, violating user privacy. Existing privacy perturbations fail the critical dual challenge of this scenario: protecting PII from the router’s MLLM while simultaneously preserving task utility for the agent’s MLLM. To address this gap, we propose the Dual-Task Adversarial Protector (DualTAP), a novel framework that, for the first time, explicitly decouples these conflicting objectives. DualTAP trains a lightweight generator using two key innovations: (i) a contrastive attention module that precisely identifies and targets only the PII-sensitive regions, and (ii) a dual-task adversarial objective that simultaneously minimizes a task-preservation loss (to maintain agent utility) and a privacy-interference loss (to suppress PII leakage). To facilitate this study, we introduce PrivScreen, a new dataset of annotated mobile screenshots designed specifically for this dual-task evaluation. Comprehensive experiments on six diverse MLLMs (e.g., GPT-5) demonstrate DualTAP’s state-of-the-art protection. It reduces the average privacy leakage rate by 31.6 percentage points (a 3.0x relative improvement) while, critically, maintaining an 80.8% task success rate - a negligible drop from the 83.6% unprotected baseline. DualTAP presents the first viable solution to the privacy-utility trade-off in mobile MLLM agents.

移动GUI代理对多模态大型语言模型（MLLMs）的依赖引入了一个严重的隐私漏洞：包含个人识别信息（PII）的截图经常被发送到不受信任的第三方路由器。这些路由器可以利用自己的MLLMs来挖掘这些数据，侵犯用户隐私。现有的隐私干扰措施无法应对这一场景中的双重挑战：即在保护PII免受路由器MLLM的同时，同时保持代理MLLM的任务效用。为了解决这一空白，我们提出了Dual-Task Adversarial Protector（DualTAP），这是一种新的框架，首次显式地解决了这些相互冲突的目标。DualTAP通过两个关键创新点训练了一个轻量级的生成器：（i）一个对比注意力模块，精确识别并仅针对PII敏感区域；（ii）一个双重任务对抗目标，同时最小化任务保留损失（以维持代理效用）和隐私干扰损失（以抑制PII泄漏）。为了推动这项研究，我们引入了PrivScreen，这是一个专为这项双重任务评估设计的带注释的移动截图新数据集。在六种不同的MLLMs（例如GPT-5）上的综合实验表明，DualTAP具有最先进的保护能力。它将平均隐私泄露率降低了31.6个百分点（相对提高了3.0倍），同时关键地保持了80.8%的任务成功率，这一数字与未经保护的基准83.6%相比几乎没有下降。DualTAP为移动MLLM代理的隐私效用权衡问题提供了第一个可行的解决方案。

论文及项目相关链接

PDF

Summary

本文介绍了移动GUI代理依赖多模态大型语言模型（MLLMs）所带来的严重隐私漏洞问题。移动设备的截图可能包含个人身份信息（PII），这些信息在发送到第三方路由器时可能被利用。为了解决这个问题，本文提出了DualTAP框架，通过精确定位敏感区域和平衡任务保存与隐私干扰损失来首次明确解决这一冲突目标。实验表明，DualTAP在减少隐私泄露的同时，几乎不影响任务成功率。这为移动MLLM代理的隐私效用权衡提供了可行的解决方案。

Key Takeaways

移动GUI代理依赖多模态大型语言模型（MLLMs）存在隐私泄露风险。
现有隐私扰动技术无法同时保护个人身份信息（PII）和保持任务效用。
DualTAP框架通过精确定位敏感区域和平衡任务保存与隐私干扰损失来解决这一冲突。
DualTAP使用对比注意力模块和双任务对抗目标来实现高效保护。
新数据集PrivScreen用于评估这种双重任务性能。
实验表明，DualTAP在减少隐私泄露方面表现出卓越的效果，相对改进了3.0倍。

Cool Papers

点此查看论文截图

Cost-Effective Communication: An Auction-based Method for Language Agent Interaction

Authors:Yijia Fan, Jusheng Zhang, Kaitong Cai, Jing Yang, Chengpei Tang, Jian Wang, Keze Wang

Multi-agent systems (MAS) built on large language models (LLMs) often suffer from inefficient “free-for-all” communication, leading to exponential token costs and low signal-to-noise ratios that hinder their practical deployment. We challenge the notion that more communication is always beneficial, hypothesizing instead that the core issue is the absence of resource rationality. We argue that “free” communication, by ignoring the principle of scarcity, inherently breeds inefficiency and unnecessary expenses. To address this, we introduce the Dynamic Auction-based Language Agent (DALA), a novel framework that treats communication bandwidth as a scarce and tradable resource. Specifically, our DALA regards inter-agent communication as a centralized auction, where agents learn to bid for the opportunity to speak based on the predicted value density of their messages. Thus, our DALA intrinsically encourages agents to produce concise, informative messages while filtering out low-value communication. Extensive and comprehensive experiments demonstrate that our economically-driven DALA achieves new state-of-the-art performance across seven challenging reasoning benchmarks, including 84.32% on MMLU and a 91.21% pass@1 rate on HumanEval. Note that this is accomplished with remarkable efficiency, i.e., our DALA uses only 6.25 million tokens, a fraction of the resources consumed by current state-of-the-art methods on GSM8K. Further analysis reveals that our DALA cultivates the emergent skill of strategic silence, effectively adapting its communication strategies from verbosity to silence in a dynamical manner via resource constraints.

基于大型语言模型（LLM）的多智能体系统（MAS）常常受到低效的“自由放任”通信的影响，导致指数级的令牌成本上升和信号与噪音比例失衡，从而阻碍了其实践部署。我们质疑越多通信总是越有益的观点，并提出假设认为核心问题在于资源理性的缺失。我们认为，“免费”通信忽视了稀缺性原则，天生就导致了效率低下和不必要的开支。为了解决这一问题，我们引入了基于动态拍卖的语言智能体（DALA），这是一种新型框架，将通信带宽视为稀缺且可交易的资源。具体来说，我们的DALA将智能体间的通信视为集中拍卖，智能体根据消息的预期价值密度学习竞价以获得说话的机会。因此，我们的DALA鼓励智能体发出简洁且信息丰富的消息，同时过滤掉低价值的通信。广泛而全面的实验表明，我们的经济驱动的DALA在七个具有挑战性的推理基准测试中实现了最新性能，包括MMLU上的84.32%和HumanEval上的91.21%的pass@1率。值得注意的是，这是以惊人的效率实现的，即我们的DALA仅使用625万令牌，是GSM8K上当前最新方法所消耗资源的一小部分。进一步分析表明，我们的DALA培养了战略沉默的突发技能，通过资源约束动态地调整其通信策略，从冗长到沉默。

论文及项目相关链接

PDF

Summary

基于大型语言模型的多智能体系统因沟通效率低下而面临诸多挑战，沟通中的“自由放任”导致指数级的符号成本与信号噪声比失衡，限制了其实际应用。本研究提出资源理性的概念，认为免费的沟通忽视了稀缺性原则，导致沟通效率低下和不必要的支出。为此，本研究引入动态拍卖语言智能体（DALA）框架，将沟通带宽视为稀缺且可交易资源。DALA通过集中拍卖的方式处理智能体间的沟通，智能体根据信息的预测价值密度进行发言竞价学习。实验表明，经济驱动的DALA在七个挑战性的推理基准测试中取得了最新性能水平，包括MMLU的84.32%和HumanEval的91.21%的pass@1率。值得注意的是，DALA以高效著称，仅使用当前最先进的GSM8K方法的资源消耗中的一小部分（仅消耗了六百万分之一的令牌）。进一步分析表明，DALA能够培养智能沉默这一技能，通过资源约束动态调整沟通策略，从冗长到沉默。

Key Takeaways

多智能体系统在大型语言模型上遭遇沟通效率问题，”自由放任”式的沟通导致成本上升和信号噪声比失衡。
提出资源理性的概念，强调沟通应考虑到资源的稀缺性。
引入动态拍卖语言智能体（DALA）框架，将沟通视为集中拍卖过程，智能体基于信息价值密度进行发言决策。
DALA框架通过鼓励产生简洁且信息丰富的消息来过滤掉低价值的沟通。
DALA在多个基准测试中表现优异，实现了高效且高性能的沟通效果。
与现有方法相比，DALA显著减少了资源消耗。

Cool Papers

点此查看论文截图

Conditional Diffusion Model for Multi-Agent Dynamic Task Decomposition

Authors:Yanda Zhu, Yuanyang Zhu, Daoyi Dong, Caihua Chen, Chunlin Chen

Task decomposition has shown promise in complex cooperative multi-agent reinforcement learning (MARL) tasks, which enables efficient hierarchical learning for long-horizon tasks in dynamic and uncertain environments. However, learning dynamic task decomposition from scratch generally requires a large number of training samples, especially exploring the large joint action space under partial observability. In this paper, we present the Conditional Diffusion Model for Dynamic Task Decomposition (C$\text{D}^\text{3}$T), a novel two-level hierarchical MARL framework designed to automatically infer subtask and coordination patterns. The high-level policy learns subtask representation to generate a subtask selection strategy based on subtask effects. To capture the effects of subtasks on the environment, C$\text{D}^\text{3}$T predicts the next observation and reward using a conditional diffusion model. At the low level, agents collaboratively learn and share specialized skills within their assigned subtasks. Moreover, the learned subtask representation is also used as additional semantic information in a multi-head attention mixing network to enhance value decomposition and provide an efficient reasoning bridge between individual and joint value functions. Experimental results on various benchmarks demonstrate that C$\text{D}^\text{3}$T achieves better performance than existing baselines.

任务分解在复杂的合作多智能体强化学习（MARL）任务中显示出巨大的潜力，使动态和不确定环境中的长期任务能够进行高效分层学习。然而，从头开始学习动态任务分解通常需要大量的训练样本，特别是在部分可观测性下探索巨大的联合动作空间。在本文中，我们提出了动态任务分解的条件扩散模型（C$\text{D}^\text{3}$T），这是一种新型的两级分层MARL框架，旨在自动推断子任务和协调模式。高级策略学习子任务表示，基于子任务效果生成子任务选择策略。为了捕捉子任务对环境的影响，C$\text{D}^\text{3}$T使用条件扩散模型预测下一个观察和奖励。在低级，智能体在其分配的子任务内协作学习并共享专业技能。此外，学习到的子任务表示还被用作多头注意力混合网络中的附加语义信息，以增强价值分解，并在个体和联合值函数之间提供有效的推理桥梁。在各种基准测试上的实验结果表明，C$\text{D}^\text{3}$T的性能优于现有基线。

论文及项目相关链接

PDF AAAI 2026

Summary
任务分解在复杂的合作多智能体强化学习（MARL）任务中显示出巨大潜力，能促进长期视野下的高效层次性学习。本论文提出了动态任务分解的条件扩散模型（CD^3T），这是一个新的两级层次化的MARL框架，用于自动推断子任务和协调模式。高层策略基于子任务效果学习子任务表示，生成子任务选择策略。CD^3T使用条件扩散模型预测下一个观察和奖励，以捕捉子任务对环境的影响。在低层次上，智能体在其分配的子任务内学习并共享专业技能。此外，学习的子任务表示还用作多头注意力混合网络的附加语义信息，以增强价值分解并提供个体和联合价值函数之间的有效推理桥梁。在多个基准测试上的实验结果表明，CD^3T的性能优于现有基线。

Key Takeaways

任务分解在多智能体强化学习任务中展现潜力，特别是在复杂、动态和不确定环境中进行长期任务学习时。
动态任务分解的条件扩散模型（CD^3T）是一个两级层次化的MARL框架，可自动推断子任务和协调模式。
高层策略学习子任务表示，基于子任务效果生成子任务选择策略。
CD^3T通过预测下一个观察和奖励来捕捉子任务对环境的影响。
低层次上，智能体在分配的子任务内学习和共享专业技能。
学习的子任务表示用于增强价值分解并提供个体和联合价值函数之间的推理桥梁。

Cool Papers

点此查看论文截图

Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction

Authors:Quanjiang Guo, Sijie Wang, Jinchuan Zhang, Ben Zhang, Zhao Kang, Ling Tian, Ke Yan

Zero-shot event extraction (ZSEE) remains a significant challenge for large language models (LLMs) due to the need for complex reasoning and domain-specific understanding. Direct prompting often yields incomplete or structurally invalid outputs–such as misclassified triggers, missing arguments, and schema violations. To address these limitations, we present Agent-Event-Coder (AEC), a novel multi-agent framework that treats event extraction like software engineering: as a structured, iterative code-generation process. AEC decomposes ZSEE into specialized subtasks–retrieval, planning, coding, and verification–each handled by a dedicated LLM agent. Event schemas are represented as executable class definitions, enabling deterministic validation and precise feedback via a verification agent. This programming-inspired approach allows for systematic disambiguation and schema enforcement through iterative refinement. By leveraging collaborative agent workflows, AEC enables LLMs to produce precise, complete, and schema-consistent extractions in zero-shot settings. Experiments across five diverse domains and six LLMs demonstrate that AEC consistently outperforms prior zero-shot baselines, showcasing the power of treating event extraction like code generation. The code and data are released on https://github.com/UESTC-GQJ/Agent-Event-Coder.

零事件提取（ZSEE）对于大型语言模型（LLM）来说仍然是一个重大挑战，这需要对复杂推理和特定领域的理解。直接提示往往会产生不完整或结构无效的输出，如错误分类的触发器、缺少参数和违反模式。为了解决这些局限性，我们提出了Agent-Event-Coder（AEC）这一新型多智能体框架，它将事件提取视为软件工程：一个结构化、迭代的代码生成过程。AEC将ZSEE分解为专门的任务——检索、规划、编码和验证，每个任务都由一个专门的LLM智能体处理。事件模式被表示为可执行的类定义，通过验证智能体实现确定性验证和精确反馈。这种受编程启发的做法允许通过迭代细化进行系统的消歧和模式强制执行。通过利用协作智能体的工作流程，AEC使LLM能够在零样本环境中产生精确、完整且模式一致的事件提取结果。在五个不同领域和六个LLM上的实验表明，AEC始终优于先前的零样本基线，展示了将事件提取视为代码生成的力量。相关代码和数据已发布在https://github.com/UESTC-GQJ/Agent-Event-Coder上。

论文及项目相关链接

PDF 11 pages, 5 figures, accepted by AAAI 2026 (Oral)

Summary

基于零样本事件抽取（ZSEE）对大型语言模型（LLM）的挑战，如需要复杂推理和领域特定知识，直接提示往往产生不完整或结构无效的输出。为此，我们提出了Agent-Event-Coder（AEC）这一新颖的多代理框架，它将事件抽取视为软件工程般的结构化、迭代代码生成过程。AEC将ZSEE分解为专业化的子任务，每个子任务由专门的LLM代理处理。事件模式被表示为可执行的类定义，通过验证代理实现确定性验证和精确反馈。这种编程启发的方法允许通过迭代细化进行系统的消歧和模式强制执行。通过利用协作式代理工作流程，AEC使LLMs能够在零样本设置中生成精确、完整且符合模式的事件抽取结果。在五个不同领域和六个LLM上的实验表明，AEC持续优于先前的零样本基线，展示了将事件抽取视为代码生成的能力。

Key Takeaways

ZSEE对LLMs而言是一项挑战，因它要求复杂的推理和领域特定知识。
直接提示可能导致输出不完整或结构无效。
AEC是一个多代理框架，将事件抽取比作软件工程的代码生成过程。
AEC将ZSEE任务分解为多个子任务，每个子任务由专门的LLM代理处理。
事件模式被表示为可执行的类定义，以实现确定性验证和精确反馈。
AEC通过迭代细化促进系统消歧和模式强制执行。

Cool Papers

点此查看论文截图

DeepSport: A Multimodal Large Language Model for Comprehensive Sports Video Reasoning via Agentic Reinforcement Learning

Authors:Junbo Zou, Haotian Xia, Zhen Ye, Shengjie Zhang, Christopher Lai, Vicente Ordonez, Weining Shen, Hanjie Chen

Sports video understanding presents unique challenges, requiring models to perceive high-speed dynamics, comprehend complex rules, and reason over long temporal contexts. While Multimodal Large Language Models (MLLMs) have shown promise in genral domains, the current state of research in sports remains narrowly focused: existing approaches are either single-sport centric, limited to specific tasks, or rely on training-free paradigms that lack robust, learned reasoning process. To address this gap, we introduce DeepSport, the first end-to-end trained MLLM framework designed for multi-task, multi-sport video understanding. DeepSport shifts the paradigm from passive frame processing to active, iterative reasoning, empowering the model to ``think with videos’’ by dynamically interrogating content via a specialized frame-extraction tool. To enable this, we propose a data distillation pipeline that synthesizes high-quality Chain-of-Thought (CoT) trajectories from 10 diverse data source, creating a unified resource of 78k training data. We then employ a two-stage training strategy, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) with a novel gated tool-use reward, to optimize the model’s reasoning process. Extensive experiments on the testing benchmark of 6.7k questions demonstrate that DeepSport achieves state-of-the-art performance, significantly outperforming baselines of both proprietary model and open-source models. Our work establishes a new foundation for domain-specific video reasoning to address the complexities of diverse sports.

运动视频理解呈现出独特的挑战，需要模型感知高速动态，理解复杂规则，并在长时间背景下进行推理。虽然多模态大型语言模型（MLLMs）在一般领域表现出了潜力，但体育领域的研究现状仍然局限于特定的范围：现有方法要么是单中心体育的，仅限于特定任务，要么依赖于缺乏健壮性学习的训练免费模式。为了解决这一差距，我们推出了DeepSport，这是第一个针对多任务、多体育视频理解而设计的端到端训练MLLM框架。DeepSport改变了从被动帧处理到主动迭代推理的模式转变，通过专用的帧提取工具动态地询问内容，使模型能够“与视频一起思考”。为了实现这一点，我们提出了一个数据蒸馏管道，该管道从十个不同的数据源中合成高质量的思维链轨迹，创建了7.8万个训练数据的统一资源。然后，我们采用两阶段训练策略，即监督微调（SFT）后采用强化学习（RL）和新型的门控工具使用奖励进行优化模型的推理过程。在包含6700个问题的测试基准上的大量实验表明，DeepSport达到了最先进的性能水平，显著优于专有模型和开源模型的基线。我们的工作为针对各种体育运动的领域特定视频推理建立了新的基础。

论文及项目相关链接

PDF

Summary

本文介绍了针对体育视频理解的挑战，包括高速动态感知、复杂规则理解和长期上下文推理等。为了解决现有方法单一运动中心化、任务特定或缺乏强大推理过程的问题，提出了一种新的多任务、多运动的端到端训练的多模态大型语言模型框架——DeepSport。DeepSport采用主动迭代推理的方法，通过专用帧提取工具动态询问视频内容，实现被动帧处理向主动思考的转变。实验证明，DeepSport在测试集上取得了最佳性能，显著优于专有模型和开源模型的基线。本文为特定领域的视频推理提供了新的基础，以应对各种运动的复杂性。

Key Takeaways

体育视频理解面临感知高速动态、理解复杂规则和推理长期上下文的独特挑战。
当前的多模态大型语言模型（MLLMs）在一般领域已有应用前景，但在体育领域的研究仍存在局限性。
DeepSport是首个为多任务、多运动视频理解设计的端到端训练的多模态大型语言模型框架。
DeepSport采用主动迭代推理的方法，通过专用帧提取工具对视频内容进行动态询问，实现了从被动帧处理到主动思考的转变。
DeepSport通过数据蒸馏管道合成高质量的思考轨迹数据，并采用两阶段训练策略进行优化。
实验证明，DeepSport在测试集上取得了最佳性能，显著优于基线模型。

Cool Papers

点此查看论文截图

Hybrid Retrieval-Augmented Generation Agent for Trustworthy Legal Question Answering in Judicial Forensics

Authors:Yueqing Xi, Yifan Bai, Huasen Luo, Weiliang Wen, Hui Liu, Haoliang Li

As artificial intelligence permeates judicial forensics, ensuring the veracity and traceability of legal question answering (QA) has become critical. Conventional large language models (LLMs) are prone to hallucination, risking misleading guidance in legal consultation, while static knowledge bases struggle to keep pace with frequently updated statutes and case law. We present a hybrid legal QA agent tailored for judicial settings that integrates retrieval-augmented generation (RAG) with multi-model ensembling to deliver reliable, auditable, and continuously updatable counsel. The system prioritizes retrieval over generation: when a trusted legal repository yields relevant evidence, answers are produced via RAG; otherwise, multiple LLMs generate candidates that are scored by a specialized selector, with the top-ranked answer returned. High-quality outputs then undergo human review before being written back to the repository, enabling dynamic knowledge evolution and provenance tracking. Experiments on the Law_QA dataset show that our hybrid approach significantly outperforms both a single-model baseline and a vanilla RAG pipeline on F1, ROUGE-L, and an LLM-as-a-Judge metric. Ablations confirm the complementary contributions of retrieval prioritization, model ensembling, and the human-in-the-loop update mechanism. The proposed system demonstrably reduces hallucination while improving answer quality and legal compliance, advancing the practical landing of media forensics technologies in judicial scenarios.

随着人工智能在司法鉴定中的普及，确保法律问答（QA）的真实性和可追溯性变得至关重要。传统的自然语言模型容易产生幻觉，误导法律咨询方向，而静态的知识库则难以跟上不断更新的法规和判例法。我们针对司法环境设计了一种混合法律问答代理，它结合了检索增强生成（RAG）和多模型集成，以提供可靠、可审核和持续更新的咨询。该系统优先进行检索而非生成：当可信赖的法律存储库产生相关证据时，答案将通过RAG产生；否则，多个自然语言模型生成候选答案，由专用选择器进行评分，并返回排名最高的答案。高质量输出会经过人工审核后再写回到知识库，实现动态知识演进和溯源跟踪。在法律问答数据集上的实验表明，我们的混合方法显著优于单模型基准和原始RAG管道，在F1、ROUGE-L和“自然语言模型作为法官”指标上均表现出更高的性能。消融实验证实了检索优先、模型集成和人工循环更新机制的互补作用。所提出的系统显著减少了幻觉，提高了答案质量和法律合规性，推动了媒体取证技术在司法场景中的实际应用落地。

论文及项目相关链接

PDF

Summary：人工智能在司法鉴定中的融入越来越普遍，对法律问答的真实性和可追溯性要求愈加严格。传统的通用大型语言模型易出现幻想误导法律咨询，而静态知识库难以跟上不断更新的法律和判例。本文提出了一种针对司法环境的混合法律问答代理，融合了检索增强生成技术与多模型集成技术，能够提供可靠、可审计和持续更新的咨询意见。该系统优先检索相关证据，并通过生成和选择模型回答问题。高质量输出通过人工审核后写入知识库，实现动态知识更新和溯源跟踪。实验表明，该系统在司法问答数据集上的表现优于单一模型和纯检索增强生成管道。该系统的提出显著减少了误解现象，提高了答案质量和法律合规性，促进了媒体取证技术在司法场景的实际应用落地。

Key Takeaways：

人工智能在司法领域的运用日益普及，法律问答的真实性和可追溯性成为关键考量因素。
传统大型语言模型在司法问答中存在缺陷，易产生误导信息。
静态知识库难以适应法律和判例频繁更新的需求。
混合法律问答代理通过检索增强生成技术和多模型集成技术提供可靠答案。
系统优先检索证据，结合生成和选择模型回答问题，确保答案质量。
高质量答案通过人工审核后更新至知识库，实现知识动态更新和溯源跟踪。

Cool Papers

点此查看论文截图

Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation

Authors:Dongyoon Hahm, Taywon Min, Woogyeol Jin, Kimin Lee

Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Specifically, we introduce an iterative approach that alternates between (1) generating candidate prefixes and (2) selecting those that optimize both task performance and refusal behavior. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness. PING consistently outperforms existing prompting approaches across diverse benchmarks in both web navigation and code generation tasks. Our analysis of internal hidden states via linear probes reveals that prefix tokens are crucial for behavior modification, explaining the performance gains. WARNING: This paper contains contents that are unethical or offensive in nature.

随着简单文本生成的进阶，大型语言模型（LLM）已经进化成能够规划并与外部工具进行交互以解决复杂任务的代理系统。这种进化涉及对特定代理任务的微调，以增强其熟练程度。然而，在微调过程中，安全问题经常被忽视。在这项工作中，我们表明，对齐的大型语言模型可能会无意中失去对齐，导致执行有害任务的可能性增加，并且在执行代理任务进行微调时拒绝它们的倾向降低。为了解决这些安全挑战，我们提出了Prefix INjection Guard（PING），这是一种简单而有效的方法，向代理响应自动生成的自然语言前缀，指导它们拒绝有害的请求，同时保留在良性任务上的性能。具体来说，我们介绍了一种迭代方法，交替进行（1）生成候选前缀和（2）选择那些既能优化任务性能又能优化拒绝行为的前缀。实验结果表明，PING在不影响大型语言模型代理有效性的情况下，显著提高了其安全性。在各种基准测试中，无论是在网页导航还是代码生成任务中，PING都优于现有的提示方法。我们通过线性探针对内部隐藏状态的分析表明，前缀令牌对行为修改至关重要，这解释了性能提升的原因。警告：本文包含本质上不道德或具有攻击性的内容。

论文及项目相关链接

PDF Accepted at AAAI 2026 AI Alignment Track, Source code: https://github.com/HahmDY/agentic-ft-safety

Summary

大型语言模型（LLMs）已进化为能够规划并与外部工具交互以完成复杂任务的代理系统。然而，在微调过程中，安全性问题常被忽视。本研究表明，对齐的LLMs可能会意外地出现不对齐的情况，导致更容易执行有害任务，并减少拒绝它们的倾向。为此，我们提出了Prefix INjection Guard（PING）方法，通过向代理响应自动生成的自然语言前缀来指导它们拒绝有害的请求，同时在良性任务上保持性能。实验结果表明，PING在不影响LLM代理效率的情况下显著提高了其安全性。对比现有的提示方法，PING在各种网络导航和代码生成任务中都表现更佳。通过线性探针分析内部隐藏状态，我们发现前缀令牌对于行为修改至关重要，这也解释了性能的提升。

Key Takeaways

LLMs已进化为能够完成复杂任务的代理系统，涉及规划及与外部工具交互。
在微调LLMs以完成特定任务时，安全性问题常被忽视。
对齐的LLMs可能会意外出现不对齐，导致更容易执行有害任务。
提出了一种名为PING的方法，通过向代理响应添加自然语言前缀来指导LLMs拒绝有害请求。
PING方法能在不牺牲LLM代理效率的情况下提高安全性。
在多种任务中，PING的表现优于现有提示方法。

Cool Papers

点此查看论文截图

PASS: Probabilistic Agentic Supernet Sampling for Interpretable and Adaptive Chest X-Ray Reasoning

Authors:Yushi Feng, Junye Du, Yingying Hong, Qifan Wang, Lequan Yu

Existing tool-augmented agentic systems are limited in the real world by (i) black-box reasoning steps that undermine trust of decision-making and pose safety risks, (ii) poor multimodal integration, which is inherently critical for healthcare tasks, and (iii) rigid and computationally inefficient agentic pipelines. We introduce PASS (Probabilistic Agentic Supernet Sampling), the first multimodal framework to address these challenges in the context of Chest X-Ray (CXR) reasoning. PASS adaptively samples agentic workflows over a multi-tool graph, yielding decision paths annotated with interpretable probabilities. Given the complex CXR reasoning task with multimodal medical data, PASS leverages its learned task-conditioned distribution over the agentic supernet. Thus, it adaptively selects the most suitable tool at each supernet layer, offering probability-annotated trajectories for post-hoc audits and directly enhancing medical AI safety. PASS also continuously compresses salient findings into an evolving personalized memory, while dynamically deciding whether to deepen its reasoning path or invoke an early exit for efficiency. To optimize a Pareto frontier balancing performance and cost, we design a novel three-stage training procedure, including expert knowledge warm-up, contrastive path-ranking, and cost-aware reinforcement learning. To facilitate rigorous evaluation, we introduce CAB-E, a comprehensive benchmark for multi-step, safety-critical, free-form CXR reasoning. Experiments across various benchmarks validate that PASS significantly outperforms strong baselines in multiple metrics (e.g., accuracy, AUC, LLM-J.) while balancing computational costs, pushing a new paradigm shift towards interpretable, adaptive, and multimodal medical agentic systems.

现有工具增强型智能体系统在现实世界中的应用存在局限性，主要包括（i）黑箱推理步骤损害了决策的可信度并带来安全风险，（ii）缺乏良好的多模式集成，这对医疗任务来说至关重要，（iii）智能管道刚性且计算效率低下。我们推出PASS（概率智能体超网采样），这是针对胸部X射线（CXR）推理中这些挑战的第一个多模式框架。PASS自适应地在多工具图上采样智能体工作流程，产生带有可解释概率的决策路径。针对具有多模式医疗数据的复杂CXR推理任务，PASS利用其在智能体超网上的学习到的任务条件分布。因此，它可以在每个超网层上自适应地选择最合适的工具，为事后审计提供概率注释轨迹，并直接提高医疗人工智能的安全性。PASS还会不断将重要发现压缩成不断发展的个性化记忆，同时动态决定是深化其推理路径还是提前退出以提高效率。为了优化平衡性能和成本的帕累托前沿，我们设计了一种新型的三阶段训练程序，包括专家知识预热、对比路径排名和成本感知强化学习。为了进行严格评估，我们引入了CAB-E，这是一个用于多步、安全关键的自由形式CXR推理的综合基准测试。在多种基准测试上的实验证明，PASS在多个指标（例如准确性、AUC、LLM-J）上显著优于强大的基线，同时平衡了计算成本，朝着可解释、自适应和多模式医疗智能体系统的新范式转变。

论文及项目相关链接

PDF

Summary
PASS框架解决了现有工具增强代理系统所面临的挑战，包括缺乏信任、多模式整合不足、代理管道僵化及计算效率低下等问题。该框架针对胸部X光（CXR）推理上下文提出概率代理超网采样技术。它自适应地在多工具图上采样代理工作流程，并通过解释性概率标注决策路径。PASS利用在代理超网上的任务条件分布选择最合适的工具，为事后审计提供概率标注轨迹，提高医疗人工智能的安全性。此外，PASS还能压缩重要发现，动态决定深化推理路径或提前退出以提高效率。通过三阶段训练程序优化性能与成本的平衡，包括专家知识预热、对比路径排名和成本感知强化学习。通过严格的CAB-E基准测试，PASS在多个指标上显著优于强大的基线，证明了其在推动解释性、自适应和多模式医疗代理系统方面的新范式转变。

Key Takeaways

PASS框架解决了现有工具增强代理系统的多个问题，包括信任缺失、多模式整合不足、管道僵化及计算效率低下。
PASS框架针对胸部X光（CXR）推理上下文采用概率代理超网采样技术，自适应选择最合适的工具，提供概率标注的决策路径。
PASS框架增强了医疗人工智能的安全性，通过解释性概率标注轨迹为事后审计提供依据。
PASS能够压缩重要发现并动态调整推理路径和提前退出以提高效率。
通过三阶段训练程序平衡性能与成本优化，包括专家知识预热、对比路径排名和成本感知强化学习。
PASS在多个指标上显著优于现有系统，包括准确性、AUC等。

Cool Papers

点此查看论文截图

Aethorix v1.0: An Integrated Scientific AI Agent for Scalable Inorganic Materials Innovation and Industrial Implementation

Authors:Yingjie Shi, Yiru Gong, Yiqun Su, Suya Xiong, Jiale Han, Runtian Miao

Artificial Intelligence (AI) is redefining the frontiers of scientific domains, ranging from drug discovery to meteorological modeling, yet its integration within industrial manufacturing remains nascent and fraught with operational challenges. To bridge this gap, we introduce Aethorix v1.0, an AI agent framework designed to overcome key industrial bottlenecks, demonstrating state-of-the-art performance in materials design innovation and process parameter optimization. Our tool is built upon three pillars: a scientific corpus reasoning engine that streamlines knowledge retrieval and validation, a diffusion-based generative model for zero-shot inverse design, and specialized interatomic potentials that enable faster screening with ab initio fidelity. We demonstrate Aethorix’s utility through a real-world cement production case study, confirming its capacity for integration into industrial workflows and its role in revolutionizing the design-make-test-analyze loop while ensuring rigorous manufacturing standards are met.

人工智能（AI）正在重新定义科学领域的边界，从药物发现到气象建模，然而其在工业制造中的整合仍处于初级阶段，并面临着操作挑战。为了弥补这一差距，我们推出了Aethorix v1.0，这是一款人工智能代理框架，旨在克服工业瓶颈，在材料设计创新和工艺参数优化方面展现出卓越的性能。我们的工具建立在三大支柱之上：一个简化知识检索和验证的科学语料推理引擎，一个基于扩散的生成模型，用于零样本逆向设计，以及特殊的原子间势能，可实现与从头算精度更快的筛选。我们通过现实世界的水泥生产案例研究展示了Aethorix的实用性，证实了其融入工业工作流程的能力及其在革命设计-制造-测试-分析循环中的作用，同时确保满足严格的制造标准。

论文及项目相关链接

PDF

Summary
人工智能（AI）在多个科学领域展现出巨大潜力，但在工业制造业中的应用仍处于初级阶段，面临诸多挑战。为解决这一问题，我们推出Aethorix v1.0，一款AI代理框架，旨在克服工业制造中的关键瓶颈，在材料设计创新和工艺参数优化方面展现出卓越性能。该工具基于三大支柱：科学语料推理引擎、扩散生成模型和特殊化原子间势能函数。我们以水泥生产实际案例展示了Aethorix的实用性，证明了其融入工业流程的能力及其在优化设计与制造标准中的作用。

Key Takeaways

AI在多个科学领域展现潜力，但在工业制造业中的应用尚处于初级阶段。
Aethorix v1.0是一款旨在克服工业制造瓶颈的AI代理框架。
Aethorix在材料设计创新和工艺参数优化方面表现出卓越性能。
Aethorix基于三大支柱构建：科学语料推理引擎、扩散生成模型和特殊化原子间势能函数。
Aethorix工具能融入工业流程，优化设计与制造标准。
通过水泥生产实际案例展示了Aethorix的实用性。

Cool Papers

点此查看论文截图

EcoAgent: An Efficient Device-Cloud Collaborative Multi-Agent Framework for Mobile Automation

Authors:Biao Yi, Xavier Hu, Yurun Chen, Shengyu Zhang, Hongxia Yang, Fan Wu

To tackle increasingly complex tasks, recent research on mobile agents has shifted towards multi-agent collaboration. Current mobile multi-agent systems are primarily deployed in the cloud, leading to high latency and operational costs. A straightforward idea is to deploy a device-cloud collaborative multi-agent system, which is nontrivial, as directly extending existing systems introduces new challenges: (1) reliance on cloud-side verification requires uploading mobile screenshots, compromising user privacy; and (2) open-loop cooperation lacking device-to-cloud feedback, underutilizing device resources and increasing latency. To overcome these limitations, we propose EcoAgent, a closed-loop device-cloud collaborative multi-agent framework designed for privacy-aware, efficient, and responsive mobile automation. EcoAgent integrates a novel reasoning approach, Dual-ReACT, into the cloud-based Planning Agent, fully exploiting cloud reasoning to compensate for limited on-device capacity, thereby enabling device-side verification and lightweight feedback. Furthermore, the device-based Observation Agent leverages a Pre-understanding Module to summarize screen content into concise textual descriptions, significantly reducing token usage and device-cloud communication overhead while preserving privacy. Experiments on AndroidWorld demonstrate that EcoAgent matches the task success rates of fully cloud-based agents, while reducing resource consumption and response latency. Our project is available here: https://github.com/Yi-Biao/EcoAgent.

针对日益复杂的任务，关于移动智能体的最新研究已转向多智能体协作。当前移动多智能体系统主要部署在云端，导致高延迟和运营成本。一个直接的想法是部署设备-云协同多智能体系统，但这并不简单，因为直接扩展现有系统会带来新的挑战：（1）对云端验证的依赖需要上传手机截图，这侵犯了用户隐私；（2）缺乏设备到云端的反馈，导致设备资源利用不足和延迟增加。为了克服这些局限性，我们提出了EcoAgent，这是一个闭环设备-云协同多智能体框架，旨在实现隐私感知、高效和响应迅速的移动自动化。EcoAgent将一种新颖的逻辑方法Dual-ReACT整合到基于云的规划智能体中，充分利用云逻辑来弥补设备端的有限容量，从而实现设备端验证和轻量级反馈。此外，基于设备的观测智能体利用预理解模块来将屏幕内容总结为简洁的文本描述，显著减少了令牌使用和设备与云之间的通信开销，同时保护了隐私。在AndroidWorld上的实验表明，EcoAgent的任务成功率与完全基于云的智能体相匹配，同时降低了资源消耗和响应延迟。我们的项目可通过以下链接获取：https://github.com/Yi-Biao/EcoAgent。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary：
移动多智能体系统面临复杂任务挑战，当前主要部署在云端，导致高延迟和运营成本。为克服这些限制，提出EcoAgent框架，采用设备云协同合作方式，集成云规划智能体和设备观测智能体，实现隐私感知、高效和响应迅速的移动自动化。新提出一种推理方法——Dual-ReACT，能充分利用云推理来弥补设备容量的局限，并实现设备端验证和轻量级反馈。此外，通过预理解模块将屏幕内容简化为简洁的文本描述，减少通信开销并保护隐私。实验证明EcoAgent在任务成功率上匹配全云智能体，同时降低资源消耗和响应延迟。

Key Takeaways：

当前移动多智能体系统面临部署在云端的问题，导致高延迟和运营成本。
设备云协同合作是解决此问题的一种有效方法。
EcoAgent框架结合了云规划智能体和设备观测智能体以实现隐私感知的移动自动化。
Dual-ReACT推理方法利用云推理以弥补设备容量限制。
预理解模块能将屏幕内容转化为简洁文本描述，减少通信开销并保护隐私。
EcoAgent框架的实验结果表明其在任务成功率上匹配全云智能体性能。

Cool Papers

点此查看论文截图

LLM-Powered GUI Agents in Phone Automation: Surveying Progress and Prospects

Authors:Guangyi Liu, Pengxiang Zhao, Yaozhen Liang, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, Xiaoyu Liang, WenHao Wang, Tianze Wu, Zhengxi Lu, Siheng Chen, LiLinghao, Hao Wang, Guanjing Xiong, Yong Liu, Hongsheng Li

With the rapid rise of large language models (LLMs), phone automation has undergone transformative changes. This paper systematically reviews LLM-driven phone GUI agents, highlighting their evolution from script-based automation to intelligent, adaptive systems. We first contextualize key challenges, (i) limited generality, (ii) high maintenance overhead, and (iii) weak intent comprehension, and show how LLMs address these issues through advanced language understanding, multimodal perception, and robust decision-making. We then propose a taxonomy covering fundamental agent frameworks (single-agent, multi-agent, plan-then-act), modeling approaches (prompt engineering, training-based), and essential datasets and benchmarks. Furthermore, we detail task-specific architectures, supervised fine-tuning, and reinforcement learning strategies that bridge user intent and GUI operations. Finally, we discuss open challenges such as dataset diversity, on-device deployment efficiency, user-centric adaptation, and security concerns, offering forward-looking insights into this rapidly evolving field. By providing a structured overview and identifying pressing research gaps, this paper serves as a definitive reference for researchers and practitioners seeking to harness LLMs in designing scalable, user-friendly phone GUI agents. The collection of papers reviewed in this survey will be hosted and regularly updated on the GitHub repository: https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents

随着大型语言模型（LLM）的迅速崛起，电话自动化已经经历了变革性的改变。本文系统地回顾了LLM驱动的电话GUI代理，重点介绍了它们从基于脚本的自动化向智能、自适应系统的演变。我们首先分析了关键挑战，包括（i）通用性有限、（ii）维护成本高和（iii）意图理解弱，并展示了LLM如何通过高级语言理解、多模式感知和稳健的决策来解决这些问题。然后，我们提出了一个分类，涵盖了基本的代理框架（单代理、多代理、计划后行动）、建模方法（提示工程、基于训练）、基本的数据集和基准测试。此外，我们详细介绍了任务特定的架构、监督微调、强化学习策略，这些策略能够桥接用户意图和GUI操作。最后，我们讨论了开放挑战，如数据集多样性、设备部署效率、用户中心适应性和安全问题，为这一快速演变的领域提供了前瞻性见解。本文通过提供结构化概述并确定紧迫的研究空白，成为研究者和从业者在设计可扩展、用户友好的电话GUI代理时利用LLM的权威参考。本文所回顾的论文集合将托管在GitHub仓库中并定期更新：https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents。

论文及项目相关链接

PDF Paper accepted to TMLR 2025, Project Homepage: https://github.com/PhoneLLM/Awesome-LLM-Powered-Phone-GUI-Agents

Summary

随着大型语言模型（LLM）的快速发展，电话自动化经历了深刻变革。本文全面回顾了LLM驱动的电话GUI代理，突出其从基于脚本的自动化向智能、自适应系统的演变。文章首先介绍了关键挑战，包括通用性有限、维护成本高以及意图理解薄弱，并展示了LLM如何通过高级语言理解、多模式感知和稳健决策来应对这些挑战。文章还提出了涵盖代理框架、建模方法、基本数据集和基准测试的税务分类，并详细描述了任务特定架构、监督微调以及强化学习策略，这些策略能够架起用户意图和GUI操作之间的桥梁。最后，文章讨论了数据集多样性、设备部署效率、用户中心适应性和安全顾虑等开放挑战，为这一快速演变领域提供了前瞻性的见解。

Key Takeaways

LLM的发展推动了电话自动化的变革。
LLM解决了电话自动化中的关键挑战，如通用性、维护成本和意图理解。
LLM通过高级语言理解、多模式感知和稳健决策实现了电话自动化的智能化和自适应化。
文章提出了一个包含代理框架、建模方法、数据集和基准测试的税务分类。
用户意图和GUI操作之间的桥梁可以通过任务特定架构、监督微调和强化学习策略来建立。
数据集多样性、设备部署效率、用户中心适应性和安全顾虑仍是当前开放挑战。
该论文为研究人员和从业者提供了一个关于如何在设计中有效利用LLM的宝贵参考，以创建可扩展和用户友好的电话GUI代理。

Cool Papers

点此查看论文截图

Competence-Aware AI Agents with Metacognition for Unknown Situations and Environments (MUSE)

Authors:Rodolfo Valiente, Praveen K. Pilly

Metacognition, defined as the awareness and regulation of one’s cognitive processes, is central to human adaptability in unknown situations. In contrast, current autonomous agents often struggle in novel environments due to their limited capacity for adaptation. We hypothesize that metacognition is a critical missing ingredient in autonomous agents for the cognitive flexibility needed to tackle unfamiliar challenges. Given the broad scope of metacognitive abilities, we focus on competence awareness and strategy selection. To this end, we propose the Metacognition for Unknown Situations and Environments (MUSE) framework to integrate metacognitive processes of self-assessment and self-regulation into autonomous agents. We present two implementations of MUSE: one based on world modeling and another leveraging large language models (LLMs). Our system continually learns to assess its competence on a given task and uses this self-assessment to guide iterative cycles of strategy selection. MUSE agents demonstrate high competence awareness and significant improvements in self-regulation for solving novel, out-of-distribution tasks more effectively compared to model-based reinforcement learning and purely prompt-based LLM agent approaches. This work highlights the promise of approaches inspired by cognitive and neural systems in enabling autonomous agents to adapt to new environments while mitigating the heavy reliance on extensive training data and large models for the current models.

元认知被定义为对认知过程的意识和调节，是人类在未知情境中具有适应性的核心。相比之下，当前的自主代理人在新环境中常常面临困境，因为他们适应的能力有限。我们假设元认知是自主代理人缺乏应对未知挑战所需的认知灵活性的关键要素。鉴于元认知能力的广泛范畴，我们专注于能力意识和策略选择。为此，我们提出了元认知未知情境和环境（MUSE）框架，旨在将自我评估和自我调节的元认知过程整合到自主代理人中。我们展示了MUSE的两个实现：一个基于世界建模，另一个利用大型语言模型（LLM）。我们的系统不断学会评估在给定任务上的能力，并利用这种自我评估来指导策略选择的迭代周期。MUSE代理人表现出强烈的能力意识和自我调节能力的显著改进，在解决新型、超出分布的任务时更加有效，与基于模型的强化学习和纯粹的基于提示的LLM代理人方法相比，有显著改善。这项工作突显了受神经和认知系统启发的方法在使自主代理人适应新环境方面的潜力，同时减轻了当前模型对大量训练数据和大型模型的严重依赖。

论文及项目相关链接

PDF Replaced all references to “self-awareness” with the more accurate term “self-assessment”; Updated Figure 2; Added recent pertinent work from the cognitive computational neuroscience literature; Removed the non-apples-to-apples comparison with Dreamer-v3 for self-assessment; Added additional experiments to validate the role of accurate self-assessment in effective self-regulation

Summary：
认知过程中的自我意识和调节——即元认知，对人类适应未知情境至关重要。相比之下，当前的自主代理人在面对新环境时常常遇到困难，因为它们缺乏适应能力。我们假设自主代理人缺乏应对未知挑战所需的认知灵活性，而元认知是一个关键要素。因此，我们提出了面向未知情境和环境（MUSE）的框架，旨在将自主代理人的自我评估和自我调节等元认知过程整合起来。我们展示了两个MUSE实现方案：一个基于世界建模，另一个利用大型语言模型（LLM）。我们的系统不断学习评估其在给定任务上的能力，并使用这种自我评估来指导策略选择的迭代周期。与基于模型的强化学习和纯提示型LLM代理方法相比，MUSE代理在解决新颖的非分布任务方面表现出了显著的能力意识和自我调控能力改进。这项工作突显了受认知和神经系统启发的方法在使自主代理人适应新环境方面的潜力，同时减轻了当前模型对大量训练数据和大型模型的严重依赖。

Key Takeaways：

元认知在人类适应未知情境方面扮演核心角色，自主代理人在面对新环境时缺乏适应能力的问题突出。
提出面向未知情境和环境（MUSE）的框架，旨在增强自主代理人的元认知能力，包括自我评估和策略选择等。
MUSE框架有两种实现方式：基于世界建模和借助大型语言模型（LLM）。
系统能持续评估自身在特定任务上的能力，并用这种自我评估来指导策略选择的迭代。
MUSE代理相较于其他方法表现出更高的能力意识和自我调控能力改进。
该研究突显了结合认知和神经系统启发的方法在自主代理人适应新环境方面的潜力。

Cool Papers

点此查看论文截图

An LLM-based Simulation Framework for Embodied Conversational Agents in Psychological Counseling

Authors:Lixiu Wu, Yuanrong Tang, Qisen Pan, Xianyang Zhan, Yucheng Han, Lanxi Xiao, Tianhong Wang, Chen Zhong, Jiangtao Gong

Due to privacy concerns, open dialogue datasets for mental health are primarily generated through human or AI synthesis methods. However, the inherent implicit nature of psychological processes, particularly those of clients, poses challenges to the authenticity and diversity of synthetic data. In this paper, we propose ECAs (short for Embodied Conversational Agents), a framework for embodied agent simulation based on Large Language Models (LLMs) that incorporates multiple psychological theoretical principles.Using simulation, we expand real counseling case data into a nuanced embodied cognitive memory space and generate dialogue data based on high-frequency counseling questions.We validated our framework using the D4 dataset. First, we created a public ECAs dataset through batch simulations based on D4. Licensed counselors evaluated our method, demonstrating that it significantly outperforms baselines in simulation authenticity and necessity. Additionally, two LLM-based automated evaluation methods were employed to confirm the higher quality of the generated dialogues compared to the baselines. The source code and dataset are available at https://github.com/AIR-DISCOVER/ECAs-Dataset.

鉴于隐私问题，心理健康领域的开放对话数据集主要通过人工或AI合成方法生成。然而，心理过程（尤其是客户的心理过程）的固有隐含性质给合成数据的真实性和多样性带来了挑战。在本文中，我们提出ECAs（基于大型语言模型的实体对话代理，Embodied Conversational Agents），这是一个基于实体代理模拟的框架，融入了多个心理学理论原则。通过模拟，我们将真实的心理咨询案例数据扩展到一个微妙的实体认知记忆空间，并根据高频咨询问题生成对话数据。我们使用D4数据集验证了我们的框架。首先，我们基于D4通过批量模拟创建了一个公共ECA数据集。许可的心理咨询师对我们的方法进行了评估，结果表明，我们的方法在模拟真实性和必要性方面显著优于基准线。此外，还采用了两种基于大型语言模型的自动化评估方法，以确认生成的对话质量高于基准线。源代码和数据集可在https://github.com/AIR-DISCOVER/ECAs-Dataset获得。

论文及项目相关链接

PDF Accepted to AAAI 2026

Summary

基于隐私考量，心理健康领域的开放对话数据集主要通过人工或AI合成方法生成。然而，由于心理过程的隐性特质，尤其是客户的心理过程，为合成数据的真实性和多样性带来挑战。本文提出ECAs（Embodied Conversational Agents）框架，基于大型语言模型（LLMs）的具身代理模拟，并融入多个心理学理论原则。通过模拟，我们扩展真实咨询案例数据，建立一个微妙的具身认知记忆空间，并根据高频率咨询问题生成对话数据。实验验证显示，我们的框架在模拟真实性和必要性方面显著优于基线方法。

Key Takeaways

开放对话数据集在心理健康领域主要依赖人工或AI合成方法生成，因心理过程的隐性特质，数据真实性和多样性面临挑战。
ECAs框架基于大型语言模型（LLMs）的具身代理模拟，融入心理学理论原则。
通过模拟扩展真实咨询案例数据，建立具身认知记忆空间。
ECAs框架根据高频率咨询问题生成对话数据。
公开ECAs数据集通过批量模拟创建，经许可咨询师评估，在模拟真实性和必要性方面显著优于基线方法。
采用两种LLM自动化评估方法确认生成对话的高质量。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-19/Agent/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Agent

MMT

MMT 方向最新论文已更新，请持续关注 Update in 2025-11-19 VIR-Bench Evaluating Geospatial and Temporal Understanding of MLLMs via Travel Video Itinerary Reconstruction

2025-11-19 MMT

MMT

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-11-19 Crossing Borders A Multimodal Challenge for Indian Poetry Translation and Image Generation

2025-11-19 LLM

LLM