⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-24 更新
The STAR-XAI Protocol: An Interactive Framework for Inducing Second-Order Agency in AI Agents
Authors:Antoni Guasch, Maria Isabel Valdez
Current Large Reasoning Models (LRMs) exhibit significant limitations in reliability and transparency, often showing a collapse in reasoning capabilities when faced with high-complexity, long-horizon tasks. This “illusion of thinking” is frequently an artifact of non-agentic, black-box evaluation paradigms that fail to cultivate robust problem-solving processes. In response, we introduce The STAR-XAI Protocol (Socratic, Transparent, Agentic, Reasoning - for eXplainable Artificial Intelligence), a novel methodology for training and operating verifiably reliable AI agents. Our method reframes the human-AI interaction as a structured, Socratic dialogue, governed by an explicit and evolving rulebook, the Consciousness Transfer Package (CTP). Through an interactive Gameplay Cycle that enforces ante-hoc strategic justification and a state-locking Checksum that prevents error accumulation, the protocol transforms a powerful but opaque LRM into a disciplined “Clear Box” agent. We demonstrate the efficacy of this method through an exhaustive 25-move case study in the complex strategic game “Caps i Caps”. The agent not only solved the high-complexity puzzle but also demonstrated Second-Order Agency, identifying flaws in its own supervisor-approved plans and adapting its core integrity protocols mid-task. The STAR-XAI Protocol offers a practical pathway to creating AI agents that are not just high-performing, but also transparent, auditable, and trustworthy by design.
当前的大型推理模型(LRM)在可靠性和透明度方面存在显著局限性,面对高复杂度、长期任务时,往往会出现推理能力崩溃的情况。这种“思维错觉”往往是非代理、黑箱评估范式的产物,这些范式未能培养稳健的问题解决过程。为了应对这一问题,我们推出了STAR-XAI协议(针对可解释人工智能的索然无味、透明、智能推理),这是一种可验证的可靠AI代理训练和运行的新型方法论。我们的方法将人类与AI的交互重新定义为结构化的、索然无味的对话,由明确的、不断发展的规则手册——意识传输包(CTP)来管理。通过强制实施预先的战略合理化交互游戏周期,以及防止错误积累的州锁定校验和,该协议将强大但模糊的大型模型转变为守纪律的“清晰盒子”代理。我们在复杂的战略游戏《棋王之战》中进行了为期25步的案例研究,展示了该方法的有效性。该代理不仅解决了高复杂度的谜题,还展示了二阶代理能力,识别出其监督者批准的计划中的缺陷,并在任务中期调整了其核心完整性协议。STAR-XAI协议为创建不仅高性能而且透明、可审计和可信赖的AI代理提供了切实可行的途径。
论文及项目相关链接
PDF Paper 1 of 4 in The STAR-XAI Protocol series. Paper 2 [arXiv:ID_to_be_added], Paper 3 [arXiv:ID_to_be_added], Paper 4 [arXiv:ID_to_be_added]
Summary
本文指出当前大型推理模型(LRMs)在应对高复杂度、长期任务时存在可靠性和透明度方面的显著局限性,表现为推理能力的崩溃。为此,作者提出了一种新的训练和操作可靠人工智能代理的方法——STAR-XAI协议。该协议通过结构化、苏格拉底式的对话,以及明确的不断进化的规则手册——意识转移包(CTP),将人类与AI的互动进行重构。此外,通过执行前的战略合理化、状态锁定校验和等机制,该协议将强大的但模糊的大型模型转变为有纪律的“清晰盒子”代理。作者通过一个复杂的战略游戏“Capscaps”的25步案例研究,证明了该方法的有效性。该代理不仅解决了高复杂度谜题,还表现出二阶代理能力,识别自身监督者批准计划的缺陷,并在任务中途调整其核心完整性协议。STAR-XAI协议为创建透明、可审计、可信赖的AI代理提供了实用途径。
Key Takeaways
- 当前大型推理模型(LRMs)在高复杂度、长期任务中存在可靠性和透明度问题,表现为推理能力的崩溃。
- STAR-XAI协议是一种新的训练和操作AI代理的方法,旨在解决LRMs的局限性。
- 该协议通过苏格拉底式的对话和明确的规则手册(意识转移包)来重构人类与AI的互动。
- 协议中的执行前战略合理化、状态锁定校验和等机制使强大的但模糊的大型模型转变为有纪律的“清晰盒子”代理。
- 通过复杂的战略游戏“Capscaps”的25步案例研究,证明了STAR-XAI协议的有效性。
- AI代理不仅解决了高复杂度谜题,还展现出二阶代理能力,能识别并调整自身计划的缺陷。
点此查看论文截图



Orcust: Stepwise-Feedback Reinforcement Learning for GUI Agent
Authors:Junyu Lu, Songxin Zhang, Zejian Xie, Zhuoyang Song, Jiaxing Zhang
Recent advances in GUI agents have achieved remarkable grounding and action-prediction performance, yet existing models struggle with unreliable reward signals and limited online trajectory generation. In this paper, we introduce Orcust, a framework that integrates Principle-Constrained Reward Modeling (PCRM) and Online VM-Grounded Trajectory Construction (OVTC) to enhance reasoning reliability and data efficiency in interactive GUI tasks. We leverages environment-verifiable and LLM-derived principle to enforce interpretable reward signals that constrain long chain-of-thought reasoning and rule-based feedback. OVTC spins up instrumented virtual machines to autonomously collect structured GUI interaction trajectories with explicit procedural and structural objectives, enabling the training of a stepwise reward model that robustly captures human preferences and adheres to task-specific constraints. Extensive experiments on standard GUI benchmarks covering perceptual grounding, foundational operations, and end-to-end task execution reveal that Orcust achieves state-of-the-art performance, improving by 22.2% on ScreenSpot and 23.9% on ScreenSpot-Pro over the base model (i.e. Qwen2.5-VL-7B). The results demonstrate Orcust’s effectiveness in enhancing the reasoning, adaptability and scalability of GUI agents across various environments and task complexities.
近期图形用户界面(GUI)代理的进展已经实现了显著的地标定位和动作预测性能。然而,现有模型在不可靠的奖励信号和有限的在线轨迹生成方面仍存在挑战。在本文中,我们介绍了Orcust框架,它结合了原则约束奖励建模(PCRM)和在线VM基地标轨迹构建(OVTC),以提高交互式GUI任务中的推理可靠性和数据效率。我们利用环境可验证和LLM派生的原则来实施可解释的奖励信号,约束长链思维推理和基于规则的反馈。OVTC启动仪器化虚拟机,以自主收集具有明确程序和结构目标的结构化GUI交互轨迹,从而训练分步奖励模型,稳健地捕捉人类偏好并遵守特定任务的约束。在涵盖感知定位、基础操作和端到端任务执行的标准GUI基准测试上的大量实验表明,Orcust达到了最先进的性能,在ScreenSpot和ScreenSpot-Pro上的性能较基础模型分别提高了22.2%和23.9%(即Qwen2.5-VL-7B)。结果证明了Orcust在增强GUI代理的推理、适应性和可伸缩性方面的有效性,适用于各种环境和任务复杂性。
论文及项目相关链接
Summary
本文介绍了Orcust框架,它通过结合原则约束奖励建模(PCRM)和在线VM接地轨迹构建(OVTC)来提高交互GUI任务中的推理可靠性和数据效率。该框架利用环境可验证和LLM衍生的原则来强制执行可解释的奖励信号,约束长链思维推理和基于规则的反馈。通过启动仪器化虚拟机自主收集具有明确程序和结构目标的GUI交互轨迹,训练分步奖励模型,稳健地捕捉人类偏好并遵守特定任务约束。在标准GUI基准测试上的广泛实验表明,Orcust在屏幕点(ScreenSpot)和屏幕专业版(ScreenSpot-Pro)上的性能较基础模型分别提高了22.2%和23.9%。结果表明,Orcust在提高GUI代理的推理、适应性和可扩展性方面非常有效。
Key Takeaways
- Orcust框架结合了原则约束奖励建模(PCRM)和在线VM接地轨迹构建(OVTC)以增强交互GUI任务的推理可靠性和数据效率。
- 环境可验证和LLM衍生的原则用于执行可解释的奖励信号,以约束长链思维推理和基于规则的反溃。
- OVTC通过启动仪器化虚拟机来收集GUI交互轨迹,这些轨迹具有明确的程序和结构目标。
- 通过训练分步奖励模型,Orcust能够稳健地捕捉人类偏好并遵守特定任务约束。
- Orcust在标准GUI基准测试上表现出卓越性能,较基础模型有显著提高。
- Orcust提高了GUI代理的推理、适应性和可扩展性。
点此查看论文截图



Multi-Agent Amodal Completion: Direct Synthesis with Fine-Grained Semantic Guidance
Authors:Hongxing Fan, Lipeng Wang, Haohua Chen, Zehuan Huang, Jiangtao Wu, Lu Sheng
Amodal completion, generating invisible parts of occluded objects, is vital for applications like image editing and AR. Prior methods face challenges with data needs, generalization, or error accumulation in progressive pipelines. We propose a Collaborative Multi-Agent Reasoning Framework based on upfront collaborative reasoning to overcome these issues. Our framework uses multiple agents to collaboratively analyze occlusion relationships and determine necessary boundary expansion, yielding a precise mask for inpainting. Concurrently, an agent generates fine-grained textual descriptions, enabling Fine-Grained Semantic Guidance. This ensures accurate object synthesis and prevents the regeneration of occluders or other unwanted elements, especially within large inpainting areas. Furthermore, our method directly produces layered RGBA outputs guided by visible masks and attention maps from a Diffusion Transformer, eliminating extra segmentation. Extensive evaluations demonstrate our framework achieves state-of-the-art visual quality.
非模态补全对于图像编辑和增强现实等应用至关重要,它能够生成被遮挡物体的不可见部分。之前的方法在数据需求、通用性或渐进管道中的误差累积方面面临挑战。我们提出了一种基于前端协同推理的协同多智能体推理框架,以克服这些问题。我们的框架使用多个智能体协同分析遮挡关系并确定必要的边界扩展,从而产生用于图像修复的精确掩模。同时,一个智能体会生成精细的文本描述,实现精细语义指导。这确保了准确的物体合成,并防止了遮挡物或其他不需要的元素(尤其是在较大的修复区域内)的再生。此外,我们的方法直接生成由可见掩模和注意力图引导的分层RGBA输出,这些输出由一个扩散转换器指导,无需额外的分割。广泛评估表明,我们的框架达到了最先进的视觉质量。
论文及项目相关链接
Summary
基于协同多智能体推理框架,通过多个智能体的协作分析遮挡关系并确定必要的边界扩展,实现精确遮罩补全,同时生成精细的文本描述,为图像编辑和AR等应用提供重要支持。该框架解决了现有方法面临的挑战,如数据需求、泛化能力和渐进式管道中的误差累积问题。评估结果表明,该框架达到业界领先水平。
Key Takeaways
- 协同多智能体推理框架可解决图像编辑和AR中的遮挡问题。
- 通过多个智能体协作分析遮挡关系并确定边界扩展,实现精确遮罩补全。
- 生成精细的文本描述,提供精细语义指导,确保准确的对象合成并防止遮挡物和其他不想要的元素再生。
- 直接生成分层RGBA输出,由可见遮罩和扩散变换器的注意力图引导,无需额外分割。
- 该框架解决了现有方法的数据需求、泛化能力和渐进式管道中的误差累积问题。
点此查看论文截图




GLo-MAPPO: A Multi-Agent Proximal Policy Optimization for Energy Efficiency in UAV-Assisted LoRa Networks
Authors:Abdullahi Isa Ahmed, Jamal Bentahar, El Mehdi Amhoud
Long Range (LoRa) based low-power wide area networks (LPWANs) are crucial for enabling next-generation IoT (NG-IoT) applications in 5G/6G ecosystems due to their long-range, low-power, and low-cost characteristics. However, achieving high energy efficiency in such networks remains a critical challenge, particularly in large-scale or dynamically changing environments. Traditional terrestrial LoRa deployments often suffer from coverage gaps and non-line-of-sight (NLoS) propagation losses, while satellite-based IoT solutions consume excessive energy and introduce high latency, limiting their suitability for energy-constrained and delay-sensitive applications. To address these limitations, we propose a novel architecture using multiple unmanned aerial vehicles (UAVs) as flying LoRa gateways to dynamically collect data from ground-based LoRa end devices. Our approach aims to maximize the system’s weighted global energy efficiency by jointly optimizing spreading factors, transmission powers, UAV trajectories, and end-device associations. Additionally, we formulate this complex optimization problem as a partially observable Markov decision process (POMDP) and propose green LoRa multi-agent proximal policy optimization (GLo-MAPPO), a multi-agent reinforcement learning (MARL) framework based on centralized training with decentralized execution (CTDE). Simulation results show that GLo-MAPPO significantly outperforms benchmark algorithms, achieving energy efficiency improvements of 71.25%, 18.56%, 67.00%, 59.73%, and 49.95% for networks with 10, 20, 30, 40, and 50 LoRa end devices, respectively.
基于长程(LoRa)的低功耗广域网(LPWANs)因其长距离、低功耗、低成本的特点,在5G/6G生态系统中为下一代物联网(NG-IoT)应用至关重要。然而,在这种网络中实现高能效仍然是一个关键问题,特别是在大规模或动态变化的环境中。传统的地面LoRa部署往往存在覆盖缺口和非直视(NLoS)传播损失,而基于卫星的物联网解决方案消耗大量能源并引入高延迟,这限制了它们在能源受限和延迟敏感应用中的适用性。为了解决这些局限性,我们提出了一种使用多架无人机作为飞行LoRa网关的新型架构,以动态地从地面LoRa终端设备收集数据。我们的方法旨在通过联合优化扩频因子、传输功率、无人机轨迹和终端设备关联来最大化系统的加权全局能效。此外,我们将这个复杂的优化问题制定为部分可观察马尔可夫决策过程(POMDP),并提出绿色LoRa多智能体近端策略优化(GLo-MAPPO),这是一个基于集中训练与分散执行(CTDE)的多智能体强化学习(MARL)框架。仿真结果表明,GLo-MAPPO显著优于基准算法,在拥有10、20、30、40和50个LoRa终端设备的网络中,能效分别提高了71.25%、18.56%、67.00%、59.73%和49.95%。
论文及项目相关链接
PDF 15 pages, 19 figures, journal
Summary
基于LoRa的长距离低功率广域网络在5G/6G生态系统中对实现下一代物联网应用至关重要。然而,在大规模或动态变化的环境中实现高能效仍是关键挑战。为解决传统地面LoRa部署中存在的覆盖空白和非直视传播损耗问题,以及卫星物联网解决方案能耗高、延迟大的局限性,本文提出一种新型架构,利用多架无人机作为飞行LoRa网关来动态收集地面LoRa终端的数据。该架构旨在通过联合优化扩频因子、传输功率、无人机轨迹和终端关联来最大化系统的全局能效。仿真结果表明,本文提出的基于多智能体近端策略优化的绿色LoRa(GLo-MAPPO)算法显著优于基准算法,在拥有不同数量LoRa终端的网络中分别实现了能效提升。
Key Takeaways
- LoRa技术因其长距离、低功耗和低成本的特性在5G/6G物联网应用中至关重要。
- 在大规模或动态环境中实现LoRa网络的高能效是一个关键挑战。
- 传统地面LoRa部署存在覆盖空白和非直视传播损耗问题。
- 卫星物联网解决方案面临能耗高和延迟大的局限性。
- 利用多架无人机作为飞行LoRa网关,可动态收集地面LoRa终端的数据。
- 通过联合优化各项参数,最大化系统全局能效。
点此查看论文截图





MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents
Authors:Yuzhen Lei, Hongbin Xie, Jiaxing Zhao, Shuangxue Liu, Xuan Song
Large Language Models (LLMs) have excelled in question-answering (QA) tasks within single domains. However, their reasoning and coordination capabilities in complex, multi-stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models’ abilities for multi-stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose \textbf{MSCoRe}, a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors. The dataset is created using a structured three-phase pipeline: dynamic sampling, iterative question-answer generation, and a multi-level quality assessment to ensure data quality. Tasks are further categorized into three difficulty levels according to stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state-of-the-art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks. We also tested the models’ robustness and found that their performance is negatively affected by noisy data. MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents. The code and data are available at https://github.com/D3E0-source/MSCoRE.
大型语言模型(LLM)在单一领域内的问答任务中表现出色。然而,它们在复杂多阶段场景中的推理和协作能力仍然未被充分探索。现有基准测试通常侧重于孤立的任务或狭窄的领域,忽略了模型在无需明确外部指导的情况下进行多阶段协作和优化的能力。为了弥补这一差距,我们提出了MSCoRe,这是一个新的基准测试,包含126696个特定领域的问答实例,涵盖汽车、制药、电子和能源领域中的场景。该数据集是通过结构化三阶段管道创建的:动态采样、迭代问答生成,以及多层次质量评估,以确保数据质量。任务进一步根据阶段覆盖范围和复杂性分为三个难度级别。通过MSCoRe,我们对各种最新的前沿LLM代理进行了全面评估。商业模型在所有任务和场景中表现最佳,但在简单任务和复杂任务之间,ROUGE分数仍存在明显差距。我们还测试了模型的稳健性,发现噪声数据会对它们的性能产生负面影响。MSCoRe为社区提供了一个宝贵的新资源,用于评估和提高LLM代理的多阶段推理能力。代码和数据可在https://github.com/D3E0-source/MSCoRE找到。
论文及项目相关链接
PDF 10 pages, 5 figures
Summary
LLMs在单一领域的问题解答任务中表现出色,但在复杂多阶段场景中的推理和协作能力尚未得到充分探索。为弥补这一空白,提出了MSCoRe这一新基准测试,包含涵盖汽车、制药、电子和能源领域的126696个特定领域问答实例。该数据集通过动态采样、迭代问答生成和多层次质量评估的结构化三阶段管道创建,确保了数据质量。评估显示,商业模型在所有任务和场景中表现最佳,但在简单和复杂任务之间仍存在显著差距。此外,模型的稳健性测试表明,噪声数据对其性能有负面影响。MSCoRe为社区提供了一个评估和改进LLM代理多阶段推理能力的新资源。
Key Takeaways
- LLMs在单一领域的问题解答任务中表现出色,但在多阶段场景中的推理和协作能力有待探索。
- 提出了一种新的基准测试MSCoRe,涵盖多个领域,包含126696个特定领域问答实例。
- 数据集通过结构化管道创建,确保数据质量。
- 商业模型在所有任务和场景中表现最佳,但简单和复杂任务间存在性能差距。
- LLMs的稳健性受到噪声数据的负面影响。
点此查看论文截图





Privacy in Action: Towards Realistic Privacy Mitigation and Evaluation for LLM-Powered Agents
Authors:Shouju Wang, Fenglin Yu, Xirui Liu, Xiaoting Qin, Jue Zhang, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan
The increasing autonomy of LLM agents in handling sensitive communications, accelerated by Model Context Protocol (MCP) and Agent-to-Agent (A2A) frameworks, creates urgent privacy challenges. While recent work reveals significant gaps between LLMs’ privacy Q&A performance and their agent behavior, existing benchmarks remain limited to static, simplified scenarios. We present PrivacyChecker, a model-agnostic, contextual integrity based mitigation approach that effectively reduces privacy leakage from 36.08% to 7.30% on DeepSeek-R1 and from 33.06% to 8.32% on GPT-4o, all while preserving task helpfulness. We also introduce PrivacyLens-Live, transforming static benchmarks into dynamic MCP and A2A environments that reveal substantially higher privacy risks in practical. Our modular mitigation approach integrates seamlessly into agent protocols through three deployment strategies, providing practical privacy protection for the emerging agentic ecosystem. Our data and code will be made available at https://aka.ms/privacy_in_action.
随着大型语言模型(LLM)代理在处理敏感通信中的自主性不断增强,以及由Model Context Protocol(MCP)和Agent-to-Agent(A2A)框架加速推动,带来了紧迫的隐私挑战。虽然最近有研究表明LLMs的隐私问答性能与代理行为之间存在显著差距,但现有的基准测试仅限于静态、简化的场景。我们提出了PrivacyChecker,这是一种基于模型不可知论的情境完整性缓解方法,可有效减少DeepSeek-R1上的隐私泄露,从36.08%减少到7.3 当你处在新系统中时的行动规划和应急应对策略将成为每个隐私管理团队所重视的重要组成部分。)且在GPT-上隐私泄露率也从原来的百分之三十三点零六降至百分之八点三二,同时保持任务的有用性。我们还推出了PrivacyLens-Live,它将静态基准测试转化为动态MCP和A2A环境,在实际使用中显示出更高的隐私风险。我们的模块化缓解方法可通过三种部署策略无缝集成到代理协议中,为新兴的代理生态系统提供实际的隐私保护。我们的数据和代码将在https://aka.ms/privacy_in_action上提供。
论文及项目相关链接
PDF To appear at EMNLP 2025 (Findings)
Summary
大模型代理(LLM)在处理敏感通信时的自主性日益增强,这带来了紧迫的隐私挑战。现有评估存在局限,本文提出PrivacyChecker方法有效减少隐私泄露,并在实际部署中融入代理协议,为保护新兴代理生态系统中的隐私提供实用方法。具体数据和代码将在相关网站公布。
Key Takeaways
- LLM代理在处理敏感通信时自主性增强,引发隐私挑战。
- Model Context Protocol (MCP) 和 Agent-to-Agent (A2A) 框架加剧了这一问题。
- 现有评估存在局限性,主要集中于静态、简化场景。
- PrivacyChecker方法基于上下文完整性,有效减少隐私泄露。
- PrivacyChecker在DeepSeek-R1和GPT-4o上的隐私泄露分别降低至7.30%和8.32%。
- PrivacyLens-Live将静态基准测试转化为动态MCP和A2A环境,揭示更高的隐私风险。
点此查看论文截图





PRINCIPLES: Synthetic Strategy Memory for Proactive Dialogue Agents
Authors:Namyoung Kim, Kai Tzu-iunn Ong, Yeonjun Hwang, Minseok Kang, Iiseo Jihn, Gayoung Kim, Minju Kim, Jinyoung Yeo
Dialogue agents based on large language models (LLMs) have shown promising performance in proactive dialogue, which requires effective strategy planning. However, existing approaches to strategy planning for proactive dialogue face several limitations: limited strategy coverage, preference bias in planning, and reliance on costly additional training. To address these, we propose PRINCIPLES: a synthetic strategy memory for proactive dialogue agents. PRINCIPLES is derived through offline self-play simulations and serves as reusable knowledge that guides strategy planning during inference, eliminating the need for additional training and data annotation. We evaluate PRINCIPLES in both emotional support and persuasion domains, demonstrating consistent improvements over strong baselines. Furthermore, PRINCIPLES maintains its robustness across extended and more diverse evaluation settings. See our project page at https://huggingface.co/spaces/kimnamssya/Principles.
基于大型语言模型(LLM)的对话代理人在主动对话中表现出了有前景的性能,这要求有效的策略规划。然而,现有的主动对话策略规划方法面临一些局限性:策略覆盖面有限、规划中的偏好偏见和依赖昂贵的额外训练。为了解决这些问题,我们提出了PRINCIPLES:主动对话代理人的合成策略记忆。PRINCIPLES是通过离线自我游戏模拟得出的,作为可重复使用的知识,在推理过程中指导策略规划,无需额外的训练和数据标注。我们在情感支持和劝诱领域评估了PRINCIPLES,与强大的基准线相比,表现出了持续的一致改进。此外,PRINCIPLES在扩展和更多样化的评估环境中保持了其稳健性。请访问我们的项目页面https://huggingface.co/spaces/kimnamssya/Principles了解详情。
论文及项目相关链接
PDF Accepted to EMNLP 2025 Findings
Summary
基于大型语言模型(LLM)的对话代理在主动对话中表现出良好性能,这需要进行有效的战略规划。然而,现有主动对话策略规划方法存在策略覆盖面有限、规划偏好偏差以及依赖昂贵额外训练等问题。为解决这些问题,我们提出了PRINCIPLES:一种用于主动对话代理的合成策略记忆。PRINCIPLES通过离线自我游戏模拟得出,作为可重复使用的知识,在推理过程中指导策略规划,无需额外的训练和数据标注。我们在情感支持和劝说领域评估了PRINCIPLES,与强大的基线相比,显示出持续改进。此外,PRINCIPLES在扩展和更多样化的评估环境中保持了其稳健性。更多信息请访问我们的项目页面:项目链接。
Key Takeaways
- 大型语言模型(LLM)在主动对话中表现出良好性能,需策略规划。
- 现有主动对话策略规划方法存在局限性,如策略覆盖面有限、规划偏好偏差和额外训练成本高昂。
- PRINCIPLES是一种合成策略记忆,通过离线自我游戏模拟生成,用于指导主动对话代理的策略规划。
- PRINCIPLES在情感支持和劝说领域评估表现优异,相较于基线方法有所改进。
- PRINCIPLES在扩展和多样化评估环境中保持稳健性。
- PRINCIPLES可重复使用,降低了对额外训练和数据标注的需求。
点此查看论文截图




Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation
Authors:Ahmed T. Elboardy, Ghada Khoriba, Essam A. Rashed
Automating radiology report generation poses a dual challenge: building clinically reliable systems and designing rigorous evaluation protocols. We introduce a multi-agent reinforcement learning framework that serves as both a benchmark and evaluation environment for multimodal clinical reasoning in the radiology ecosystem. The proposed framework integrates large language models (LLMs) and large vision models (LVMs) within a modular architecture composed of ten specialized agents responsible for image analysis, feature extraction, report generation, review, and evaluation. This design enables fine-grained assessment at both the agent level (e.g., detection and segmentation accuracy) and the consensus level (e.g., report quality and clinical relevance). We demonstrate an implementation using chatGPT-4o on public radiology datasets, where LLMs act as evaluators alongside medical radiologist feedback. By aligning evaluation protocols with the LLM development lifecycle, including pretraining, finetuning, alignment, and deployment, the proposed benchmark establishes a path toward trustworthy deviance-based radiology report generation.
自动化生成放射学报告面临双重挑战:建立可靠的临床系统和设计严格的评价协议。我们引入了一种多智能体强化学习框架,该框架既可作为放射学生态系统中多模式临床推理的基准测试环境,也可用于评估。所提出的框架在模块化架构中集成了大型语言模型(LLM)和大型视觉模型(LVM),该架构由十个专门智能体组成,负责图像分析、特征提取、报告生成、审查和评估。这种设计能够在智能体层面(例如检测和分割准确性)和共识层面(例如报告质量和临床相关性)进行精细评估。我们使用ChatGPT-4o在公共放射学数据集上进行了实现展示,其中LLM作为评估器与医学放射科医师反馈相结合。通过使评价协议与LLM开发周期(包括预训练、微调、对齐和部署)保持一致,所提出的标准建立了一条基于可信偏差的放射学报告生成路径。
论文及项目相关链接
PDF NeurIPS2025 Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
Summary
本文介绍了一个多智能体强化学习框架,用于解决放射学报告生成中的双重挑战:构建临床可靠的系统和设计严格评估协议。该框架集成了大型语言模型和大型视觉模型,在一个模块化架构中包括十个专门负责图像分析、特征提取、报告生成、审查和评估的智能体。这一设计既可以在智能体层面进行精细评估(如检测和分割精度),也可以在共识层面进行报告质量和临床相关性的评估。通过利用ChatGPT-4o在公共放射学数据集上的实现,展示了大型语言模型作为评估者的作用,并与医学放射科医师的反馈相结合。所提出的基准测试通过与大型语言模型开发周期(包括预训练、微调、对齐和部署)对齐评估协议,为基于可信偏差的放射学报告生成提供了发展路径。
Key Takeaways
- 多智能体强化学习框架用于放射学报告生成的自动化挑战。
- 整合大型语言模型和大型视觉模型以处理多模态临床数据。
- 模块化架构中包含十个专门智能体,负责图像分析、特征提取、报告生成、审查和评估。
- 该框架实现了精细粒度的评估,既包括智能体层面的评估(如检测和分割的准确性),也包括共识层面的评估(如报告质量和临床相关性)。
- 利用ChatGPT-4o在公共放射学数据集上的实例演示了大型语言模型作为评估者的作用。
- 大型语言模型的发展周期与评估协议紧密对齐,包括预训练、微调、对齐和部署阶段。
点此查看论文截图



UIPro: Unleashing Superior Interaction Capability For GUI Agents
Authors:Hongxin Li, Jingran Su, Jingfan Chen, Zheng Ju, Yuntao Chen, Qing Li, Zhaoxiang Zhang
Building autonomous agents that perceive and operate graphical user interfaces (GUIs) like humans has long been a vision in the field of artificial intelligence. Central to these agents is the capability for GUI interaction, which involves GUI understanding and planning capabilities. Existing methods have tried developing GUI agents based on the multi-modal comprehension ability of vision-language models (VLMs). However, the limited scenario, insufficient size, and heterogeneous action spaces hinder the progress of building generalist GUI agents. To resolve these issues, this paper proposes \textbf{UIPro}, a novel generalist GUI agent trained with extensive multi-platform and multi-task GUI interaction data, coupled with a unified action space. We first curate a comprehensive dataset encompassing 20.6 million GUI understanding tasks to pre-train UIPro, granting it a strong GUI grounding capability, which is key to downstream GUI agent tasks. Subsequently, we establish a unified action space to harmonize heterogeneous GUI agent task datasets and produce a merged dataset to foster the action prediction ability of UIPro via continued fine-tuning. Experimental results demonstrate UIPro’s superior performance across multiple GUI task benchmarks on various platforms, highlighting the effectiveness of our approach.
在人工智能领域,构建能够像人类一样感知和操作图形用户界面(GUI)的自主代理一直是人们的一个愿景。这些代理的核心能力是GUI交互,这涉及到GUI理解和规划能力。现有方法已经尝试基于视觉语言模型(VLMs)的多模态理解能力来开发GUI代理。然而,有限的场景、不足的数据规模和异构的动作空间阻碍了通用GUI代理的构建进展。为了解决这些问题,本文提出了UIPro,这是一种新型通用GUI代理,通过大量的多平台、多任务GUI交互数据进行训练,并配备了统一的动作空间。我们首先创建了一个综合数据集,包含2060万项GUI理解任务来预训练UIPro,赋予它强大的GUI定位能力,这是下游GUI代理任务的关键。随后,我们建立了一个统一的动作空间,以协调异构的GUI代理任务数据集,并生成一个合并数据集,通过持续微调来提升UIPro的动作预测能力。实验结果表明,UIPro在多个平台上的GUI任务基准测试中表现出卓越性能,凸显了我们方法的有效性。
论文及项目相关链接
PDF Accepted to ICCV 2025
Summary
本文提出了一种名为UIPro的新型通用GUI代理,该代理能够感知和操作图形用户界面(GUI)。为了解决现有方法的局限和不足,使用具有跨平台和多任务交互数据的UIPro数据集进行训练,同时建立了统一的动作空间,以提升GUI代理的行动预测能力。实验结果显示,UIPro在不同平台和多个GUI任务基准测试上的表现均优于其他方法。
Key Takeaways
- 通用GUI代理的构想与重要性:介绍通用GUI代理的构建理念及其对于人工智能领域的重要性。强调了GUI交互能力在自主代理中的核心地位。
- 多模态理解与视觉语言模型(VLMs):探讨了现有基于VLMs的GUI代理方法,指出其局限性和挑战。包括应用场景的限制、数据量不足以及动作空间的异质性。
- UIPro的提出与数据集构建:介绍了UIPro代理的开发背景及关键特点。强调使用包含大量GUI理解任务的UIPro数据集进行预训练的重要性。
- 统一动作空间的建立:解释了如何建立统一动作空间来整合不同GUI代理任务数据集,并如何通过持续微调来提升UIPro的行动预测能力。
- 实验结果分析:展示了UIPro在不同平台和多个GUI任务基准测试上的表现,验证了其优越性和有效性。强调其性能超越现有方法的特点。
- 面向未来的挑战与展望:提出了未来在构建更强大的GUI代理方面可能面临的挑战和研究方向,包括数据集的扩展性、模型的鲁棒性和实际应用场景的挑战等。
点此查看论文截图





AEGIS: Automated Error Generation and Identification for Multi-Agent Systems
Authors:Fanqi Kong, Ruijie Zhang, Huaxiao Yin, Guibin Zhang, Xiaofei Zhang, Ziang Chen, Zhaowei Zhang, Xiaoyuan Zhang, Song-Chun Zhu, Xue Feng
As Multi-Agent Systems (MAS) become increasingly autonomous and complex, understanding their error modes is critical for ensuring their reliability and safety. However, research in this area has been severely hampered by the lack of large-scale, diverse datasets with precise, ground-truth error labels. To address this bottleneck, we introduce \textbf{AEGIS}, a novel framework for \textbf{A}utomated \textbf{E}rror \textbf{G}eneration and \textbf{I}dentification for Multi-Agent \textbf{S}ystems. By systematically injecting controllable and traceable errors into initially successful trajectories, we create a rich dataset of realistic failures. This is achieved using a context-aware, LLM-based adaptive manipulator that performs sophisticated attacks like prompt injection and response corruption to induce specific, predefined error modes. We demonstrate the value of our dataset by exploring three distinct learning paradigms for the error identification task: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. Our comprehensive experiments show that models trained on AEGIS data achieve substantial improvements across all three learning paradigms. Notably, several of our fine-tuned models demonstrate performance competitive with or superior to proprietary systems an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems. Our project website is available at https://kfq20.github.io/AEGIS-Website.
随着多智能体系统(MAS)的自主性和复杂性不断提高,理解它们的错误模式对于确保它们的可靠性和安全性至关重要。然而,由于缺乏大规模、多样化的精确地面真实错误标签数据集,这一领域的研究受到了严重阻碍。为了解决这一瓶颈问题,我们引入了多智能体系统的自动错误生成和识别新型框架AEGIS。我们通过系统性地在最初成功的轨迹中注入可控且可追踪的错误,创建了一个丰富的现实失败数据集。这是通过基于上下文感知的大型语言模型自适应操纵器实现的,它执行诸如提示注入和响应腐败等高级攻击,以产生特定的预定义错误模式。我们通过探索错误识别任务的三种不同学习范式来展示我们数据集的价值:监督微调、强化学习和对比学习。我们的综合实验表明,在AEGIS数据上训练的模型在这三种学习范式中都取得了实质性的改进。值得注意的是,我们的一些微调模型的性能表现与规模大一个数量级的专有系统相当或更优,这验证了我们的自动化数据生成框架对于开发更稳健和可解释的多智能体系统的重要性。我们的项目网站位于https://kfq20.github.io/AEGIS-Website。
论文及项目相关链接
Summary
针对多智能体系统(MAS)的错误模式研究受限于缺乏大规模多样化和精确标注的数据集。本研究引入AEGIS框架,通过注入可控可追踪的错误生成丰富的失败数据集,实现多智能体系统的自动错误识别和生成。实验证明,该数据集对三种学习范式均有显著改进效果,且部分微调模型性能表现优异。
Key Takeaways
- 多智能体系统(MAS)的自主性和复杂性增长导致理解其错误模式对确保可靠性和安全性至关重要。
- 当前研究受限于缺乏大型多样化且标注精确的错误数据集。
- AEGIS框架通过注入可控可追踪的错误生成丰富的失败数据集。
- AEGIS框架使用基于LLM的上下文感知自适应操纵器进行复杂攻击,如提示注入和响应腐败,以产生特定的预定义错误模式。
- AEGIS数据集在三种不同的学习范式中进行了实验验证,均表现出显著改进效果。
- 部分在AEGIS数据上训练的模型性能表现优异,与专有系统的性能相当甚至更优。
点此查看论文截图


Agentic AI for Software: thoughts from Software Engineering community
Authors:Abhik Roychoudhury
AI agents have recently shown significant promise in software engineering. Much public attention has been transfixed on the topic of code generation from Large Language Models (LLMs) via a prompt. However, software engineering is much more than programming, and AI agents go far beyond instructions given by a prompt. At the code level, common software tasks include code generation, testing, and program repair. Design level software tasks may include architecture exploration, requirements understanding, and requirements enforcement at the code level. Each of these software tasks involves micro-decisions which can be taken autonomously by an AI agent, aided by program analysis tools. This creates the vision of an AI software engineer, where the AI agent can be seen as a member of a development team. Conceptually, the key to successfully developing trustworthy agentic AI-based software workflows will be to resolve the core difficulty in software engineering - the deciphering and clarification of developer intent. Specification inference, or deciphering the intent, thus lies at the heart of many software tasks, including software maintenance and program repair. A successful deployment of agentic technology into software engineering would involve making conceptual progress in such intent inference via agents. Trusting the AI agent becomes a key aspect, as software engineering becomes more automated. Higher automation also leads to higher volume of code being automatically generated, and then integrated into code-bases. Thus to deal with this explosion, an emerging direction is AI-based verification and validation (V & V) of AI generated code. We posit that agentic software workflows in future will include such AIbased V&V.
人工智能代理在软件工程领域最近显示出巨大的潜力。公众的关注主要集中在通过提示从大型语言模型(LLM)生成代码的话题上。然而,软件工程不仅仅是编程,人工智能代理也远远超过给定的提示。在代码层面,常见的软件任务包括代码生成、测试和程序修复。设计层面的软件任务可能包括架构探索、需求理解和代码级别的需求强制。这些软件任务中的每一个都涉及到可以由人工智能代理自主做出的微观决策,辅以程序分析工具。这创造了人工智能软件工程师的愿景,其中人工智能代理可以被视为开发团队的一员。概念上,成功开发可信赖的代理型人工智能软件工作流程的关键将在于解决软件工程的核心难题——解读和明确开发者的意图。因此,规格推断或意图解读是许多软件任务的核心,包括软件维护和程序修复。成功地将代理技术部署到软件工程将需要在意图推断方面取得概念性进展。随着软件工程自动化程度的提高,信任人工智能代理成为一个至关重要的方面。更高的自动化也导致自动生成更多代码,并将其集成到代码库中。因此,为了应对这一爆炸式增长,一个新兴的方向是人工智能生成的代码的基于人工智能的验证和验证(V&V)。我们认为未来的代理型软件工作流程将包括这种基于人工智能的V&V。
论文及项目相关链接
PDF 4 pages
Summary
AI代理在软件工程领域展现出巨大潜力,不仅限于指令驱动的代码生成,还涵盖代码层面的任务如测试与程序修复,以及设计层面的任务如架构探索与需求理解。AI代理可自主做出微观决策,辅以程序分析工具,实现了AI软件工程师的愿景。发展可信赖的代理型AI软件工作流程的关键在于解析和澄清开发者意图,这也是软件工程的核心难题。未来,代理型软件工作流程将包括AI生成的代码的验证与验证。
Key Takeaways
- AI代理在软件工程中的潜力巨大,涵盖代码生成、测试、程序修复和设计等多个方面。
- AI代理可以自主完成微观决策,成为开发团队的一员。
- 解析和澄清开发者意图是软件工程的核心难题,也是发展可信赖的代理型AI软件工作流程的关键。
- AI代理在软件维护、程序修复等方面的应用是未来的重要发展方向。
- AI生成的代码验证与验证(V&V)是应对自动化生成代码量激增的重要方法。
- 未来代理型软件工作流程将包括AI生成的代码的V&V。
点此查看论文截图





ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning
Authors:Yu Sun, Xingyu Qian, Weiwen Xu, Hao Zhang, Chenghao Xiao, Long Li, Deli Zhao, Wenbing Huang, Tingyang Xu, Qifeng Bai, Yu Rong
Reasoning-based large language models have excelled in mathematics and programming, yet their potential in knowledge-intensive medical question answering remains underexplored and insufficiently validated in clinical contexts. To bridge this gap, we introduce ReasonMed, the largest medical reasoning dataset to date, comprising 370k high-quality examples distilled from 1.75 million initial reasoning paths generated by complementary LLMs and curated through a cost-efficient easy-medium-difficult (EMD) pipeline. ReasonMed is built through a multi-agent generation, verification, and refinement process, in which an Error Refiner improves reasoning paths by correcting error-prone steps identified by a verifier. Using ReasonMed, we investigate effective strategies for training medical reasoning models and find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results. Models trained on ReasonMed set a new benchmark: ReasonMed-7B surpasses the prior best sub-10B models by 4.17% and even exceeds LLaMA3.1-70B on PubMedQA by 4.60%. When scaled to ReasonMed-14B, it remains highly competitive, underscoring consistent scaling potential. The codes and datasets are available at https://github.com/YuSun-Work/ReasonMed.
基于推理的大型语言模型在数学和编程方面表现出色,但在知识密集型的医疗问题回答方面的潜力在临床试验环境中尚未得到充分探索验证。为了弥补这一差距,我们推出了迄今为止最大的医疗推理数据集ReasonMed。该数据集包含从由互补的大型语言模型生成的首批175万条推理路径中提炼出的37万条高质量示例,并通过成本效益高、难度分级(简易至中级至困难)的管道进行整理。ReasonMed的建立过程包括多智能体生成、验证和细化,其中Error Refiner通过识别验证器指出的易错步骤来改善推理路径。利用ReasonMed,我们研究了训练医疗推理模型的有效策略,发现将详细的推理过程与简洁的答案摘要相结合,可以得到最稳健的微调结果。在ReasonMed数据集上训练的模型创造了新的基准:ReasonMed-7B在之前最佳的低于十亿参数模型中提高了4.10%,甚至在PubMedQA上超越了LLaMA3.1-70B达4.6%。当模型扩大到ReasonMed-14B时,其表现依然具有很强的竞争力,突显出持续扩展的潜力。相关代码和数据集可通过https://github.com/YuSun-Work/ReasonMed获取。
论文及项目相关链接
PDF 28 pages, 6 figures, 7 tables
Summary
大型语言模型在数学和编程领域表现出色,但在知识密集型医疗问题回答方面的潜力尚未得到充分探索和验证。为解决这一差距,我们推出了ReasonMed数据集,它是迄今为止最大的医疗推理数据集,包含从互补的大型语言模型生成的初始推理路径中提炼出的高质量示例。使用ReasonMed数据集,我们研究了训练医疗推理模型的有效策略,发现将详细的推理过程与简洁的答案摘要相结合,可以获得最稳健的微调结果。基于ReasonMed数据集训练的模型达到了新的基准水平,超过了先前的最佳模型,并在PubMedQA上表现出良好的性能。数据集和代码已公开提供。
Key Takeaways
- 大型语言模型在医疗领域的应用潜力尚未得到充分探索。
- ReasonMed数据集是迄今为止最大的医疗推理数据集。
- ReasonMed数据集通过多代理生成、验证和细化过程构建而成。
- 使用ReasonMed数据集训练模型时,结合详细的推理过程和简洁的答案摘要效果最佳。
- 基于ReasonMed数据集的模型性能超过了先前的最佳模型,并在PubMedQA上表现出良好的性能。
- 数据集和代码已公开提供,便于其他研究者使用。
点此查看论文截图






PAKTON: A Multi-Agent Framework for Question Answering in Long Legal Agreements
Authors:Petros Raptopoulos, Giorgos Filandrianos, Maria Lymperaiou, Giorgos Stamou
Contract review is a complex and time-intensive task that typically demands specialized legal expertise, rendering it largely inaccessible to non-experts. Moreover, legal interpretation is rarely straightforward-ambiguity is pervasive, and judgments often hinge on subjective assessments. Compounding these challenges, contracts are usually confidential, restricting their use with proprietary models and necessitating reliance on open-source alternatives. To address these challenges, we introduce PAKTON: a fully open-source, end-to-end, multi-agent framework with plug-and-play capabilities. PAKTON is designed to handle the complexities of contract analysis through collaborative agent workflows and a novel retrieval-augmented generation (RAG) component, enabling automated legal document review that is more accessible, adaptable, and privacy-preserving. Experiments demonstrate that PAKTON outperforms both general-purpose and pretrained models in predictive accuracy, retrieval performance, explainability, completeness, and grounded justifications as evaluated through a human study and validated with automated metrics.
合同审查是一项复杂且耗时的任务,通常需要专门的法律专业知识,使得非专家很难参与。此外,法律解读并不直观——模糊性普遍存在,判断往往取决于主观评估。使这些挑战更加复杂的是,合同通常是机密的,限制了其在专有模型中的使用,并需要使用开源替代方案。为了应对这些挑战,我们推出了PAKTON:一个完全开源、端到端、具有即插即用功能的多代理框架。PAKTON通过协作代理工作流程和新型检索增强生成(RAG)组件,旨在应对合同分析的复杂性,实现更便捷、更灵活且保护隐私的自动化法律文件审查。实验表明,在通过人类研究和自动化指标验证的预测准确性、检索性能、可解释性、完整性和合理依据方面,PAKTON的表现都优于通用模型和预训练模型。
论文及项目相关链接
PDF Accepted at EMNLP 2025
Summary
合同审查是一项复杂且耗时的任务,通常需要专业的法律知识,使得非专家难以接触。此外,法律解释并不简单——模糊性普遍存在,且判断通常取决于主观评估。合同通常具有保密性,限制了其在使用专有模型方面的应用,并需要依赖开源替代方案。为解决这些挑战,我们推出了PAKTON:一个完全开源、端到端、多代理的框架,具有即插即用的功能。PAKTON通过协作代理工作流程和新型检索增强生成(RAG)组件,设计用于处理合同分析的复杂性,实现更便捷、灵活且保护隐私的自动化法律文件审查。实验表明,PAKTON在预测准确性、检索性能、可解释性、完整性和有据可依方面优于通用和预训练模型,并通过人类研究和自动化指标进行了评估和验证。
Key Takeaways
- 合同审查需要专业的法律知识,对于非专家来说具有挑战性。
- 法律解释中存在模糊性和主观判断。
- 合同保密性限制了专有模型的应用,需要依赖开源解决方案。
- PAKTON是一个完全开源、端到端、多代理的框架,具备处理合同审查复杂性的能力。
- PAKTON通过协作代理和检索增强生成(RAG)组件实现自动化法律文件审查。
- 实验证明PAKTON在多个方面优于其他模型,包括预测准确性、检索性能、可解释性等。
点此查看论文截图


