发布日期: 2025-06-12

更新日期: 2025-07-06

文章字数: 18.7k

阅读时长: 75 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-06-12 更新

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Authors:Li Kang, Xiufeng Song, Heng Zhou, Yiran Qin, Jie Yang, Xiaohong Liu, Philip Torr, Lei Bai, Zhenfei Yin

Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.

在动态环境中协调多个实体代理仍然是人工智能的核心挑战，这需要感知驱动的推理和可扩展的合作策略。虽然最近的研究已经利用大型语言模型（LLM）进行多代理规划，但很少有人开始探索视觉语言模型（VLM）用于视觉推理。然而，这些基于VLM的方法在支持多种实体类型方面仍存在局限性。在这项工作中，我们引入了专门为实体多代理合作定制的分层基准——VIKI-Bench，它包含三个结构化级别：代理激活、任务规划和轨迹感知。VIKI-Bench包括多种机器人实体、多视角视觉观察和结构化的监督信号，以评估基于视觉输入的推理。为了证明VIKI-Bench的实用性，我们提出了VIKI-R，这是一个两阶段框架，它通过思维链注释演示对预训练的视觉语言模型（VLM）进行微调，然后在多层次奖励信号下进行强化学习。我们的广泛实验表明，VIKI-R在所有任务层面都显著优于基准方法。此外，我们还表明，强化学习能够促进异构代理之间组合合作模式的出现。总之，VIKI-Bench和VIKI-R为推进实体AI系统的多代理、视觉驱动合作提供了统一的测试平台和方法。

Summary

本文介绍了在人工智能领域中，协调多个实体代理在动态环境中的挑战，需要感知驱动的推理和可扩展的合作策略。虽然已有工作利用大型语言模型进行多代理规划，但基于视觉语言模型的视觉推理方法仍有限支持多种实体类型。本文引入VIKI-Bench，首个针对实体多代理合作的分层基准测试，包括代理激活、任务规划和轨迹感知三个结构化级别。VIKI-Bench包含多种机器人实体、多视角视觉观察和结构化的监督信号，以评估基于视觉输入的推理。为展示VIKI-Bench的实用性，本文提出VIKI-R，一个两阶段框架，通过精细调整预训练的视觉语言模型，利用思维链注解演示，再通过多级别奖励信号进行强化学习。实验表明，VIKI-R在各级任务上显著优于基准方法，强化学习促使异质代理间出现组合合作模式。总之，VIKI-Bench和VIKI-R为推进多代理、视觉驱动的合作实体人工智能系统提供了统一的测试床和方法。

Key Takeaways

协调多个实体代理在动态环境中是人工智能的核心挑战，需结合感知驱动的推理和合作策略。
虽然大型语言模型已用于多代理规划，但基于视觉语言模型的视觉推理方法仍需探索更多领域。
VIKI-Bench是首个针对实体多代理合作的分层基准测试，包含代理激活、任务规划和轨迹感知等结构化级别。
VIKI-Bench注重多样机器人实体、多视角视觉观察和结构化的监督信号的评估。
VIKI-R框架结合了预训练视觉语言模型的调整、思维链注解演示和强化学习，提高了多代理任务的表现。
强化学习有助于异质代理间出现组合合作模式。

Cool Papers

点此查看论文截图

Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System

Authors:Yuan Guo, Tingjia Miao, Zheng Wu, Pengzhou Cheng, Ming Zhou, Zhuosheng Zhang

Autonomous agents powered by multimodal large language models have been developed to facilitate task execution on mobile devices. However, prior work has predominantly focused on atomic tasks – such as shot-chain execution tasks and single-screen grounding tasks – while overlooking the generalization to compositional tasks, which are indispensable for real-world applications. This work introduces UI-NEXUS, a comprehensive benchmark designed to evaluate mobile agents on three categories of compositional operations: Simple Concatenation, Context Transition, and Deep Dive. UI-NEXUS supports interactive evaluation in 20 fully controllable local utility app environments, as well as 30 online Chinese and English service apps. It comprises 100 interactive task templates with an average optimal step count of 14.05. Experimental results across a range of mobile agents with agentic workflow or agent-as-a-model show that UI-NEXUS presents significant challenges. Specifically, existing agents generally struggle to balance performance and efficiency, exhibiting representative failure modes such as under-execution, over-execution, and attention drift, causing visible atomic-to-compositional generalization gap. Inspired by these findings, we propose AGENT-NEXUS, a lightweight and efficient scheduling system to tackle compositional mobile tasks. AGENT-NEXUS extrapolates the abilities of existing mobile agents by dynamically decomposing long-horizon tasks to a series of self-contained atomic subtasks. AGENT-NEXUS achieves 24% to 40% task success rate improvement for existing mobile agents on compositional operation tasks within the UI-NEXUS benchmark without significantly sacrificing inference overhead. The demo video, dataset, and code are available on the project page at https://ui-nexus.github.io.

由多模态大型语言模型驱动的自主体已被开发出来，以在移动设备上促进任务执行。然而，先前的工作主要集中在原子任务上，如射击链执行任务和单屏幕接地任务，而忽略了对组合任务的推广，这对于实际应用来说是不可或缺的。本文介绍了UI-NEXUS，这是一个全面的基准测试，旨在评估移动体在三类组合操作上的表现：简单连接、上下文转换和深度潜水。UI-NEXUS支持在20个完全可控的本地实用应用程序环境中的交互式评估，以及30个在线中文和英文服务应用程序。它包括100个交互式任务模板，平均最佳步骤计数为14.05。跨越一系列具有自主工作流程或代理模型的移动代理的实验结果表明，UI-NEXUS存在重大挑战。具体来说，现有代理通常在性能和效率之间挣扎，表现出典型的失败模式，如执行不足、过度执行和注意力漂移，导致明显的从原子到组合的一般化差距。根据这些发现，我们提出了AGENT-NEXUS，这是一个轻便高效的调度系统，用于解决组合移动任务。AGENT-NEXUS通过动态地将长周期任务分解为一系列独立的原子子任务来推断现有移动代理的能力。在UI-NEXUS基准测试中，AGENT-NEXUS在不显著增加推理开销的情况下，使现有移动代理在组合操作任务上的任务成功率提高了24%至40%。演示视频、数据集和代码可在https://ui-nexus.github.io项目页面上找到。

论文及项目相关链接

PDF

Summary

Cool Papers

点此查看论文截图

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

Authors:Wendong Bu, Yang Wu, Qifan Yu, Minghe Gao, Bingchen Miao, Zhenkui Zhang, Kaihang Pan, Yunfei Li, Mengze Li, Wei Ji, Juncheng Li, Siliang Tang, Yueting Zhuang

As multimodal large language models (MLLMs) advance, MLLM-based virtual agents have demonstrated remarkable performance. However, existing benchmarks face significant limitations, including uncontrollable task complexity, extensive manual annotation with limited scenarios, and a lack of multidimensional evaluation. In response to these challenges, we introduce OmniBench, a self-generating, cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity through subtask composition. To evaluate the diverse capabilities of virtual agents on the graph, we further present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities. Our synthesized dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate. Training on our graph-structured data shows that it can more efficiently guide agents compared to manually annotated data. We conduct multidimensional evaluations for various open-source and closed-source models, revealing their performance across various capabilities and paving the way for future advancements. Our project is available at https://omni-bench.github.io/.

随着多模态大型语言模型（MLLMs）的不断发展，基于MLLM的虚拟代理已经表现出了显著的性能。然而，现有的基准测试面临着重大挑战，包括任务复杂性不可控、场景有限且需要大量手动标注，以及缺乏多维评估。为了应对这些挑战，我们引入了OmniBench，这是一个自我生成、跨平台、基于图形的基准测试，通过子任务组合，具有合成可控复杂度任务的自动化管道。为了评估虚拟代理在图上的各种能力，我们还推出了OmniEval，这是一个多维评估框架，包括子任务级评估、基于图形的指标和跨越10种能力的综合测试。我们合成的数据集包含20个场景下的3.6万图形结构任务，达到了91%的人类接受率。在我们的图形结构数据上进行训练表明，与手动注释的数据相比，它可以更有效地指导代理。我们对各种开源和闭源模型进行了多维评估，揭示了它们在各种能力方面的表现，为未来的进步铺平了道路。我们的项目可在https://omni-bench.github.io/访问。

论文及项目相关链接

PDF Accepted by ICML 2025 (Oral)

Summary

随着多模态大型语言模型（MLLMs）的发展，基于MLLM的虚拟代理展现出卓越的性能。然而，现有基准测试面临诸多局限性，包括任务复杂性不可控、场景有限且需大量手动标注，以及缺乏多维度评估。为应对这些挑战，我们推出OmniBench，一个自我生成、跨平台、基于图谱的基准测试，通过子任务组合来合成可控复杂度的任务。为评估虚拟代理在图谱上的多元能力，我们还推出OmniEval，一个包含子任务级别评估、基于图谱的指标和全面测试10种能力的多维度评估框架。我们的合成数据集包含20个场景下的3.6万张图谱任务，达到91%的人类接受率。使用我们的图谱数据进行训练表明，与手动注释数据相比，它可以更有效地指导代理。我们对各种开源和闭源模型进行了多维度评估，揭示了它们在各种能力上的表现，为未来发展铺平了道路。

Key Takeaways

多模态大型语言模型（MLLMs）的虚拟代理表现出卓越性能。
现有基准测试存在任务复杂性不可控、场景有限和手动标注量大等局限性。
OmniBench是一个自我生成、跨平台、基于图谱的基准测试，可合成可控复杂度的任务。
OmniEval是一个多维度评估框架，用于评估虚拟代理在图谱上的多元能力。
合成数据集包含3.6万张图谱任务，人类接受率高达91%。
与手动注释数据相比，使用图谱数据进行训练可以更有效地指导虚拟代理。
多维度评估了各种模型在多种能力上的表现。

Cool Papers

点此查看论文截图

Improved LLM Agents for Financial Document Question Answering

Authors:Nelvin Tan, Zian Seng, Liang Zhang, Yu-Ching Shih, Dong Yang, Amol Salunkhe

Large language models (LLMs) have shown impressive capabilities on numerous natural language processing tasks. However, LLMs still struggle with numerical question answering for financial documents that include tabular and textual data. Recent works have showed the effectiveness of critic agents (i.e., self-correction) for this task given oracle labels. Building upon this framework, this paper examines the effectiveness of the traditional critic agent when oracle labels are not available, and show, through experiments, that this critic agent’s performance deteriorates in this scenario. With this in mind, we present an improved critic agent, along with the calculator agent which outperforms the previous state-of-the-art approach (program-of-thought) and is safer. Furthermore, we investigate how our agents interact with each other, and how this interaction affects their performance.

大型语言模型（LLM）在自然语言处理任务中表现出了令人印象深刻的能力。然而，LLM在处理包含表格和文本数据的金融文档的数值问答方面仍然面临挑战。近期的研究已经显示了批评代理（即自我校正）在该任务上的有效性，前提是必须有正确的标签。本文在此基础上，研究了当没有正确标签时传统批评代理的有效性，并通过实验表明在这种情况下批评代理的性能会下降。鉴于此，我们提出了一种改进的批评代理和计算器代理，计算器代理的性能超过了之前的最新方法（思维编程），并且更加安全。此外，我们还研究了这两个代理之间的交互方式以及这种交互如何影响它们的性能。

论文及项目相关链接

PDF 12 pages, 5 figures

Summary

大型语言模型在自然语言处理任务中表现出强大的能力，但在金融文档的数值问答任务中面临挑战，特别是缺乏标准标签的情况。本研究探讨了当没有标准标签时传统批评代理的表现下降的问题，并提出了一种改进的批评代理和计算器代理。该新方法超越了现有的程序思维方法的性能并具有较高的安全性。此外，本研究还探讨了这些代理之间的相互作用及其对性能的影响。

Key Takeaways

大型语言模型在自然语言处理任务中表现出强大的能力，但在金融文档的数值问答方面存在挑战。
传统批评代理在缺乏标准标签时的性能会下降。
研究提出了一种改进的批评代理和计算器代理，该方法在安全性和性能上超越了现有的程序思维方法。
研究探讨了代理之间的相互作用及其对性能的影响。
改进后的批评代理和计算器代理能有效处理金融文档中的表格和文本数据。
该研究为金融文档的数值问答任务提供了一种新的解决方案。

Cool Papers

点此查看论文截图

MasHost Builds It All: Autonomous Multi-Agent System Directed by Reinforcement Learning

Authors:Kuo Yang, Xingjie Yang, Linhui Yu, Qing Xu, Yan Fang, Xu Wang, Zhengyang Zhou, Yang Wang

Large Language Model (LLM)-driven Multi-agent systems (Mas) have recently emerged as a powerful paradigm for tackling complex real-world tasks. However, existing Mas construction methods typically rely on manually crafted interaction mechanisms or heuristic rules, introducing human biases and constraining the autonomous ability. Even with recent advances in adaptive Mas construction, existing systems largely remain within the paradigm of semi-autonomous patterns. In this work, we propose MasHost, a Reinforcement Learning (RL)-based framework for autonomous and query-adaptive Mas design. By formulating Mas construction as a graph search problem, our proposed MasHost jointly samples agent roles and their interactions through a unified probabilistic sampling mechanism. Beyond the accuracy and efficiency objectives pursued in prior works, we introduce component rationality as an additional and novel design principle in Mas. To achieve this multi-objective optimization, we propose Hierarchical Relative Policy Optimization (HRPO), a novel RL strategy that collaboratively integrates group-relative advantages and action-wise rewards. To our knowledge, our proposed MasHost is the first RL-driven framework for autonomous Mas graph construction. Extensive experiments on six benchmarks demonstrate that MasHost consistently outperforms most competitive baselines, validating its effectiveness, efficiency, and structure rationality.

基于大型语言模型（LLM）的多智能体系统（Mas）最近出现为一种强大的范式，用于处理复杂的现实世界任务。然而，现有的Mas构建方法通常依赖于手动制作的交互机制或启发式规则，这引入了人类偏见并限制了自主性能力。尽管最近在自适应Mas构建方面取得了进展，但现有系统大多仍保持在半自主模式的范式内。在这项工作中，我们提出了MasHost，这是一个基于强化学习（RL）的自主和查询自适应Mas设计框架。通过将Mas构建公式化为图搜索问题，我们提出的MasHost通过统一的概率采样机制联合采样智能体的角色和它们的交互。除了先前工作中追求准确性和效率目标之外，我们引入了组件合理性作为Mas中的附加和新颖的设计原则。为了实现这种多目标优化，我们提出了分层相对策略优化（HRPO），这是一种新型的RL策略，能够协同整合群体相对优势和行动方面的奖励。据我们所知，我们提出的MasHost是第一个用于自主Mas图构建RL驱动的框架。在六个基准测试上的广泛实验表明，MasHost始终优于大多数竞争基线，验证了其有效性、效率和结构合理性。

论文及项目相关链接

PDF

Summary
大语言模型驱动的跨主体系统面临人工交互机制和启发式规则的局限性，这带来了人为偏见并限制了自主性。研究团队提出了MasHost框架，使用强化学习技术自主设计和构建多主体系统，并将这个过程公式化为图形搜索问题。新的优化方法将准确性、效率和组件合理性相结合，采用分层相对策略优化方法。实验证明，MasHost在多个基准测试中表现优异，验证了其有效性、效率和结构合理性。

Key Takeaways

大语言模型驱动的多主体系统（Mas）是处理复杂现实世界任务的有力工具。
传统构建方法依赖人为设计的交互机制和启发式规则，存在人为偏见和自主性的限制。
MasHost框架首次使用强化学习技术自主设计和构建多主体系统，解决了上述问题。
MasHost通过图形搜索公式化过程实现自适应的多主体系统设计，使用统一的概率采样机制对主体角色和互动进行采样。
提出组件合理性作为新的设计原则，以实现多目标优化。
采用分层相对策略优化方法（HRPO），整合群体相对优势和行动奖励。
实验证明MasHost在多个基准测试中表现优异，验证其有效性、效率和结构合理性。

Cool Papers

点此查看论文截图

CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

Authors:Ziqi. Liu, Ziyang. Zhou, Mingxuan. Hu

Large language model (LLM) have become mainstream methods in the field of sarcasm detection. However, existing LLM methods face challenges in irony detection, including: 1. single-perspective limitations, 2. insufficient comprehensive understanding, and 3. lack of interpretability. This paper introduces the Collaborative Agent Framework for Irony (CAF-I), an LLM-driven multi-agent system designed to overcome these issues. CAF-I employs specialized agents for Context, Semantics, and Rhetoric, which perform multidimensional analysis and engage in interactive collaborative optimization. A Decision Agent then consolidates these perspectives, with a Refinement Evaluator Agent providing conditional feedback for optimization. Experiments on benchmark datasets establish CAF-I’s state-of-the-art zero-shot performance. Achieving SOTA on the vast majority of metrics, CAF-I reaches an average Macro-F1 of 76.31, a 4.98 absolute improvement over the strongest prior baseline. This success is attained by its effective simulation of human-like multi-perspective analysis, enhancing detection accuracy and interpretability.

大型语言模型（LLM）已成为自然语言处理领域中嘲讽检测的主流方法。然而，现有的LLM方法在检测讽刺时面临挑战，包括：1.单一视角的局限性，2.综合理解不足，以及3.缺乏可解释性。本文介绍了用于讽刺的协作代理框架（CAF-I），这是一个以LLM驱动的多代理系统，旨在克服这些问题。CAF-I采用专门针对上下文、语义和修辞的代理，进行多维分析并参与交互式协同优化。然后，决策代理整合这些观点，细化评估代理提供条件反馈以实现优化。在基准数据集上的实验证明了CAF-I最先进的零样本性能。在大多数指标上达到最新水平，CAF-I的平均宏观F1分数为76.31，较之前最强的基线模型有4.98的绝对改进。这一成功是通过其模拟人类多视角分析而实现的，提高了检测精度和可解释性。

论文及项目相关链接

PDF ICML 2025 Workshop on Collaborative and Federated Agentic Workflows

Summary：
大型语言模型（LLM）在讽刺检测领域已成为主流方法，但在讽刺检测方面仍面临挑战。本文介绍了一种基于LLM的多代理系统——协作代理框架（CAF-I），通过专业化的上下文、语义和修辞代理进行多维分析，并通过决策代理进行视角整合，优化评估代理提供条件反馈以实现优化。在基准数据集上的实验证明了CAF-I的零样本性能达到最新水平，在大多数指标上均表现优异，平均Macro-F1达到76.31，相较于最强基线有4.98的绝对提升。这得益于其模拟人类的多角度分析，提高了检测准确性和可解释性。

Key Takeaways：

大型语言模型（LLM）在讽刺检测中是主流方法，但仍面临挑战。
现有LLM方法在讽刺检测中的挑战包括单视角限制、缺乏全面理解和缺乏可解释性。
CAF-I是一种基于LLM的多代理系统，通过专业化的上下文、语义和修辞代理进行多维分析。
CAF-I采用决策代理来整合视角，并使用优化评估代理进行条件反馈以实现优化。
实验证明CAF-I的零样本性能达到最新水平，平均Macro-F1达到76.31。
CAF-I相较于最强基线有4.98的绝对提升。

Cool Papers

点此查看论文截图

TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

Authors:Weiya Li, Junjie Chen, Bei Li, Boyang Liu, Zichen Wen, Nuanqiao Shan, Xiaoqian Liu, Anping Liu, Huajie Liu, Youyan Wang, Wujiuge Yin, Hu Song, Bing Huang, Zhiyuan Xia, Jialiang Chen, Linfeng Zhang

Machine translation has long been a central task in natural language processing. With the rapid advancement of large language models (LLMs), there has been remarkable progress in translation quality. However, fully realizing the translation potential of LLMs remains an open challenge. Recent studies have explored multi-agent systems to decompose complex translation tasks into collaborative subtasks, showing initial promise in enhancing translation quality through agent cooperation and specialization. Nevertheless, existing multi-agent translation frameworks largely neglect foundational insights from cognitive translation studies. These insights emphasize how human translators employ different cognitive strategies, such as balancing literal and free translation, refining expressions based on context, and iteratively evaluating outputs. To address this limitation, we propose a cognitively informed multi-agent framework called TACTIC, which stands for T ranslation A gents with Cognitive- T heoretic Interactive Collaboration. The framework comprises six functionally distinct agents that mirror key cognitive processes observed in human translation behavior. These include agents for drafting, refinement, evaluation, scoring, context reasoning, and external knowledge gathering. By simulating an interactive and theory-grounded translation workflow, TACTIC effectively leverages the full capacity of LLMs for high-quality translation. Experimental results on diverse language pairs from the FLORES-200 and WMT24 benchmarks show that our method consistently achieves state-of-the-art performance. Using DeepSeek-V3 as the base model, TACTIC surpasses GPT-4.1 by an average of +0.6 XCOMET and +1.18 COMETKIWI-23. Compared to DeepSeek-R1, it further improves by +0.84 XCOMET and +2.99 COMETKIWI-23. Code is available at https://github.com/weiyali126/TACTIC.

机器翻译长期以来都是自然语言处理中的一项核心任务。随着大型语言模型（LLMs）的快速发展，翻译质量取得了显著的进步。然而，完全实现LLMs的翻译潜力仍然是一个开放性的挑战。近期的研究探索了多智能体系统将复杂的翻译任务分解成协同的子任务，初步显示出通过智能体协作和专业化增强翻译质量的希望。然而，现有的多智能体翻译框架很大程度上忽视了来自认知翻译研究的见解。这些见解强调人类译者如何采用不同的认知策略，如平衡直译和意译、根据上下文优化表达、以及迭代评估输出。为了解决这个问题，我们提出了一个受认知启发的多智能体框架，名为TACTIC，代表具有认知理论交互协作的翻译智能体。该框架包括六个功能各异的智能体，反映人类翻译行为中观察到的关键认知过程。这些智能体包括起草、改进、评估、打分、上下文推理和外部知识收集的智能体。通过模拟交互和基于理论的翻译工作流程，TACTIC有效地利用LLMs的全部能力实现高质量翻译。在FLORES-200和WMT24基准测试上的多样化语言对的实验结果表明，我们的方法一直达到最新技术水平。以DeepSeek-V3为基础模型，TACTIC平均超越GPT-4.1，XCOMET提升+0.6，COMETKIWI-23提升+1.18。相较于DeepSeek-R1，进一步在XCOMET上提升+0.84，COMETKIWI-23提升+2.99。代码可用在https://github.com/weiyali126/TACTIC。

论文及项目相关链接

PDF 20 pages, 4 figures, Under review. Code: https://github.com/weiyali126/TACTIC

Summary
机器翻译是自然语言处理中的核心任务之一。随着大型语言模型（LLMs）的快速发展，翻译质量取得了显著进步，但完全实现LLMs的翻译潜力仍是一个开放性的挑战。本文提出一个认知启发下的多智能体翻译框架TACTIC，该框架包括六个功能各异的智能体，模拟人类翻译过程中的关键认知行为。实验结果表明，TACTIC框架在多种语言对上的翻译性能达到最新水平，优于GPT-4.1和DeepSeek-R1。

Key Takeaways

机器翻译是自然语言处理的重要任务，大型语言模型（LLMs）的进步推动了翻译质量的提升。
完全实现LLMs的翻译潜力仍存在挑战，需要探索新的方法和技术。
多智能体系统在翻译任务中的应用展现出提升翻译质量的初步希望。
现有多智能体翻译框架忽视了来自认知翻译研究的见解。
TACTIC框架是一个认知启发下的多智能体翻译框架，包括六个功能各异的智能体，模拟人类翻译过程中的关键认知行为。
TACTIC框架在多种语言对上的翻译性能达到最新水平，优于GPT-4.1和DeepSeek-R1。

Cool Papers

点此查看论文截图

Reinforce LLM Reasoning through Multi-Agent Reflection

Authors:Yurun Yuan, Tengyang Xie

Leveraging more test-time computation has proven to be an effective way to boost the reasoning capabilities of large language models (LLMs). Among various methods, the verify-and-improve paradigm stands out for enabling dynamic solution exploration and feedback incorporation. However, existing approaches often suffer from restricted feedback spaces and lack of coordinated training of different parties, leading to suboptimal performance. To address this, we model this multi-turn refinement process as a Markov Decision Process and introduce DPSDP (Direct Policy Search by Dynamic Programming), a reinforcement learning algorithm that trains an actor-critic LLM system to iteratively refine answers via direct preference learning on self-generated data. Theoretically, DPSDP can match the performance of any policy within the training distribution. Empirically, we instantiate DPSDP with various base models and show improvements on both in- and out-of-distribution benchmarks. For example, on benchmark MATH 500, majority voting over five refinement steps increases first-turn accuracy from 58.2% to 63.2% with Ministral-based models. An ablation study further confirms the benefits of multi-agent collaboration and out-of-distribution generalization.

利用更多的测试时间计算已被证明是提高大型语言模型（LLM）推理能力的一种有效方法。在各种方法中，验证和改进范式脱颖而出，因为它能够实现动态解决方案探索和反馈融合。然而，现有方法常常受到反馈空间限制和各方缺乏协调训练的困扰，导致性能不佳。为解决这一问题，我们将这种多轮优化过程建模为马尔可夫决策过程，并引入DPSDP（通过动态规划进行直接策略搜索）一种强化学习算法，该算法训练一个actor-critic LLM系统，通过直接在自我生成的数据上进行偏好学习来迭代优化答案。理论上，DPSDP可以在训练分布内匹配任何策略的性能。实际上，我们用各种基础模型实例化了DPSDP，并展示了在内部和外部基准测试中的改进。例如，在MATH 500基准测试中，经过五步优化的多数投票结果将初答准确率从58.2%提高到了63.2%，使用Ministral作为基础模型。一项消融研究进一步证实了多智能体协作和跨分布泛化的好处。

论文及项目相关链接

PDF International Conference on Machine Learning (ICML), 2025

Summary

本摘要以简洁的方式描述了如何利用更多的测试时间计算来提升大型语言模型的推理能力。其中，验证和改进范式特别突出，它可实现动态解决方案探索和反馈融合。针对现有方法存在的反馈空间受限和不同参与方缺乏协同训练导致性能不佳的问题，本研究采用马尔可夫决策过程模拟多轮细化过程，并引入DPSDP（一种基于动态规划的直接政策搜索强化学习算法），训练演员评论家LLM系统通过直接偏好学习在自我生成的数据上迭代优化答案。理论上，DPSDP可在训练分布内匹配任何策略的性能。实证研究证明，在各种基准测试中，DPSDP均有显著改善，如MATH 500基准测试中，经过五步优化后的多数投票结果首次准确率从58.2%提升至63.2%。同时，通过消融研究进一步验证了多智能体协作和跨分布泛化的优势。这是一项跨语言和实际应用前景广泛的研究成果。随着相关研究和应用的进一步发展，这一领域将会有更多的突破和进展。简言之，本文提出一种利用强化学习算法提升大型语言模型推理能力的新方法，具有较强的理论与实践价值。总体来说对实际部署具有很高的参考意义和研究价值。总之在实际应用中还需验证与完善效果具体的情况可能需要后续进一步研究讨论证实确认推广应用的可行性和具体实施方案并且不断探索解决存在问题的能力为未来的发展提供更多的创新想法以及实际应用场景等。目前该研究仍具有广阔的应用前景和潜在价值值得进一步深入研究和探索。文中提出了一种新的强化学习算法来提升大型语言模型的推理能力在未来该技术能够落地应用中将推动自然语言处理领域的进一步发展对于推进相关领域技术的发展与应用具有重要意义促进多智能体协作的进一步发展具有重要的理论和实践价值并可能产生深远的社会影响和经济价值推动产业的升级和转型提升国家竞争力推动经济发展和社会进步具有广泛的应用前景和发展潜力。文中提出的算法具有广阔的应用前景和潜在价值未来有望广泛应用于自然语言处理领域的相关场景如智能客服问答系统对话生成机器人等领域在实际应用中具有非常广阔的推广前景。未来可以进一步探索将该方法应用于更多领域场景以及优化算法性能等方面。该算法为自然语言处理领域带来创新突破有望成为未来研究的重要方向之一对于该算法在实际应用中的性能和表现有广泛期待并引起行业的关注与研究价值表明它在人工智能自然语言处理等领域的重要贡献同时意味着人类在应用人工智能技术的道路上又迈出了重要的一步。该算法将极大地推动人工智能技术的发展和应用为人们的生活带来便利并产生深远的影响和经济效益为社会进步贡献力量为人类社会的科技进步注入新的活力。Key Takeaways：

利用更多的测试时间计算可有效提升大型语言模型的推理能力。
验证和改进范式可实现动态解决方案探索和反馈融合。
DPSDP算法通过直接偏好学习在自我生成的数据上训练LLM系统以迭代优化答案。
DPSDP算法能在理论上限匹配任何策略的性能，并在多个基准测试中实现显著改进。
消融研究证明了多智能体协作和跨分布泛化的优势。
DPSDP算法具有广泛的应用前景，可应用于自然语言处理领域的多个场景，如智能客服、问答系统和对话生成机器人等。

Cool Papers

点此查看论文截图

HiBerNAC: Hierarchical Brain-emulated Robotic Neural Agent Collective for Disentangling Complex Manipulation

Authors:Hongjun Wu, Heng Zhang, Pengsong Zhang, Jin Wang, Cong Wang

Recent advances in multimodal vision-language-action (VLA) models have revolutionized traditional robot learning, enabling systems to interpret vision, language, and action in unified frameworks for complex task planning. However, mastering complex manipulation tasks remains an open challenge, constrained by limitations in persistent contextual memory, multi-agent coordination under uncertainty, and dynamic long-horizon planning across variable sequences. To address this challenge, we propose \textbf{HiBerNAC}, a \textbf{Hi}erarchical \textbf{B}rain-\textbf{e}mulated \textbf{r}obotic \textbf{N}eural \textbf{A}gent \textbf{C}ollective, inspired by breakthroughs in neuroscience, particularly in neural circuit mechanisms and hierarchical decision-making. Our framework combines: (1) multimodal VLA planning and reasoning with (2) neuro-inspired reflection and multi-agent mechanisms, specifically designed for complex robotic manipulation tasks. By leveraging neuro-inspired functional modules with decentralized multi-agent collaboration, our approach enables robust and enhanced real-time execution of complex manipulation tasks. In addition, the agentic system exhibits scalable collective intelligence via dynamic agent specialization, adapting its coordination strategy to variable task horizons and complexity. Through extensive experiments on complex manipulation tasks compared with state-of-the-art VLA models, we demonstrate that \textbf{HiBerNAC} reduces average long-horizon task completion time by 23%, and achieves non-zero success rates (12\textendash 31%) on multi-path tasks where prior state-of-the-art VLA models consistently fail. These results provide indicative evidence for bridging biological cognition and robotic learning mechanisms.

近期多模态视觉语言动作（VLA）模型的进步已经彻底改变了传统机器人学习的方式，使系统在统一的框架下解释视觉、语言和行为，从而进行复杂的任务规划。然而，掌握复杂的操作任务仍然是一个开放性的挑战，受到持久上下文记忆限制、不确定性下的多智能体协调和可变序列中的动态长期规划限制的影响。为了应对这一挑战，我们提出了\textbf{HiBerNAC}，这是一个受神经科学突破启发的分层脑模拟机器人神经网络集体（\textbf{Hi}erarchical \textbf{B}rain-\textbf{e}mulated \textbf{r}obotic \textbf{N}eural \textbf{A}gent \textbf{C}ollective）。我们的框架结合了（1）多模态VLA规划和推理与（2）神经启发反思和多智能体机制，专为复杂的机器人操作任务设计。通过利用神经启发的功能模块和分散的多智能体协作，我们的方法能够稳健地增强复杂操作任务的实时执行。此外，智能系统通过动态智能体专业化展现出可扩展的集体智能，使其协调策略适应可变的任务视野和复杂性。通过与最新的VLA模型在复杂的操作任务上进行广泛的实验比较，我们证明\textbf{HiBerNAC}平均长期任务完成时间减少了23%，并且在多路径任务上实现了非零成功率（12%\textendash 31%），而先前的最新VLA模型则一直失败。这些结果为连接生物认知和机器人学习机制提供了指示性证据。

论文及项目相关链接

PDF 31 pages,5 figures

摘要

最新多模态视语言动作（VLA）模型的进展已对传统机器人学习带来了革命性变革，使系统能够在统一框架中解释视觉、语言和动作，用于复杂任务规划。然而，掌握复杂操作任务仍是一项开放性的挑战，受限于持久性上下文记忆、不确定性下的多智能体协调和可变序列的动态长期规划等方面的局限。为解决此挑战，我们提出了HiBerNAC，一个受神经科学突破启发的分层脑模拟机器人神经网络集体（Hierarchical Brain-emulated robotic Neural Agent Collective）。该框架结合了多模态VLA规划和推理与神经启发反思和多智能体机制，专为复杂机器人操作任务设计。通过利用神经启发功能模块与分布式多智能体协作，我们的方法可实现稳健且增强的实时执行复杂操作任务。此外，该智能体系统通过动态智能体专业化展现可扩展的集体智能，使协调策略适应可变的任务范围和复杂性。与最新的VLA模型相比，我们在复杂的操作任务上进行了广泛的实验，证明HiBerNAC能够减少平均长期任务完成时间23%，并在多路径任务上实现非零成功率（12%~31%），而先前的最新VLA模型则一直未能成功。这些结果提供了将生物认知与机器人学习机制相结合的指示性证据。

关键见解

多模态视语言动作（VLA）模型的最新进展已经推动了机器人学习领域的变革，使机器人能够在统一框架内解释视觉、语言和动作。
复杂操作任务对机器人学习构成挑战，主要包括持久上下文记忆、多智能体协调和动态长期规划的问题。
HiBerNAC框架受神经科学启发，结合多模态VLA规划和推理与神经启发功能模块。
通过神经启发功能模块和分布式多智能体协作，HiBerNAC能够稳健地执行复杂操作任务，并增强实时性能。
HiBerNAC展现出通过动态智能体专业化实现的集体智能的可扩展性，适应可变的任务范围和复杂性。
实验证明，与最新VLA模型相比，HiBerNAC在复杂操作任务上的表现有所超越，减少了任务完成时间，并在多路径任务上实现了非零成功率。

Cool Papers

点此查看论文截图

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

Authors:Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, Aviral Kumar

The current paradigm of test-time scaling relies on generating long reasoning traces (“thinking” more) before producing a response. In agent problems that require interaction, this can be done by generating thinking traces before acting in the world. However, this process does not allow agents to acquire new information from the environment or adapt their behavior over time. In this work, we propose to scale test-time interaction, an untapped dimension of test-time scaling that increases the agent’s interaction horizon to enable running rich behaviors such as exploration, backtracking, and dynamic re-planning within a single rollout. To demonstrate the promise of this scaling dimension, we study the domain of web agents. We first show that even prompting-based interaction scaling without any training can improve task success on web benchmarks non-trivially. Building on this, we introduce TTI (Test-Time Interaction), a curriculum-based online reinforcement learning (RL) approach that trains agents by adaptively adjusting their rollout lengths. Using a Gemma 3 12B model, TTI produces state-of-the-art open-source, open-data web agents on WebVoyager and WebArena benchmarks. We further show that TTI enables agents to balance exploration and exploitation adaptively. Our results establish interaction scaling as a powerful, complementary axis to scaling per-step compute, offering new avenues for training adaptive agents.

当前测试时间缩放的范式依赖于在生成响应之前产生较长的推理轨迹（即“思考”更多）。在需要交互的代理问题中，这可以通过在世界中采取行动之前生成思维轨迹来完成。然而，这个过程不允许代理从环境中获取新信息，也不能随着时间的推移改变它们的行为。在这项工作中，我们提出了测试时间交互的扩展，这是测试时间缩放的一个未被开发的维度，它增加了代理的交互范围，使代理能够在单个运行中执行丰富的行为，如探索、回溯和动态重新规划。为了证明这一扩展维度的潜力，我们研究了网页代理领域。我们首先表明，即使在没有任何训练的情况下，基于提示的交互扩展也可以在一定程度上提高网页基准测试的任务成功率。在此基础上，我们引入了测试时间交互（TTI），这是一种基于课程的在线强化学习（RL）方法，通过自适应调整滚动长度来训练代理。使用Gemma 3 12B模型，TTI在WebVoyager和WebArena基准测试中产生了最先进的开源开放数据网页代理。我们还进一步展示了TTI使代理能够自适应地平衡探索和利用。我们的结果确立了交互缩放作为一个强大的、与每步计算缩放相辅相成的维度，为训练自适应代理提供了新的途径。

论文及项目相关链接

PDF Fixed typo in Figure 6 and Conclusion

Summary

本文提出测试时交互扩展的概念，旨在提高智能体在环境中的交互能力，使其能够在单次运行中执行丰富的行为，如探索、回溯和动态规划。该研究通过在无需额外训练的情况下采用提示为基础的交互扩展方式，实现了在非基准测试上的任务成功率的提升。此外，该研究还引入了一种基于在线强化学习的测试时交互（TTI）方法，通过自适应调整智能体的运行时长进行训练。TTI使用Gemma 3 12B模型在WebVoyager和WebArena基准测试中产生了先进的开源智能体。结果表明，测试时交互扩展是一种强大的、与每步计算扩展相辅相成的补充轴，为训练自适应智能体提供了新的途径。

Key Takeaways

当前测试时间扩展的范式依赖于生成长的推理轨迹再做出回应，但在需要交互的智能体问题中，这种方法无法让智能体从环境中获取新信息或随时间改变行为。
测试时交互扩展是提高智能体在环境中的交互能力的一种新方法，能使其执行探索、回溯和动态规划等丰富行为。
采用提示为基础的交互扩展方式，能在不额外训练的情况下提升任务成功率。
引入了一种基于在线强化学习的测试时交互（TTI）方法，通过自适应调整智能体的运行时长进行训练。
TTI使用Gemma 3 12B模型在WebVoyager和WebArena基准测试中表现出色。
测试时交互扩展是一种强大的补充轴，与每步计算扩展相辅相成。

Cool Papers

点此查看论文截图

SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems

Authors:Peiran Li, Xinkai Zou, Zhuohang Wu, Ruifeng Li, Shuo Xing, Hanwen Zheng, Zhikai Hu, Yuping Wang, Haoxi Li, Qin Yuan, Yingmo Zhang, Zhengzhong Tu

Recent advances in large language models (LLMs) and vision-language models (VLMs) have enabled powerful autonomous agents capable of complex reasoning and multi-modal tool use. Despite their growing capabilities, today’s agent frameworks remain fragile, lacking principled mechanisms for secure information flow, reliability, and multi-agent coordination. In this work, we introduce SAFEFLOW, a new protocol-level framework for building trustworthy LLM/VLM-based agents. SAFEFLOW enforces fine-grained information flow control (IFC), precisely tracking provenance, integrity, and confidentiality of all the data exchanged between agents, tools, users, and environments. By constraining LLM reasoning to respect these security labels, SAFEFLOW prevents untrusted or adversarial inputs from contaminating high-integrity decisions. To ensure robustness in concurrent multi-agent settings, SAFEFLOW introduces transactional execution, conflict resolution, and secure scheduling over shared state, preserving global consistency across agents. We further introduce mechanisms, including write-ahead logging, rollback, and secure caches, that further enhance resilience against runtime errors and policy violations. To validate the performances, we built SAFEFLOWBENCH, a comprehensive benchmark suite designed to evaluate agent reliability under adversarial, noisy, and concurrent operational conditions. Extensive experiments demonstrate that agents built with SAFEFLOW maintain impressive task performance and security guarantees even in hostile environments, substantially outperforming state-of-the-art. Together, SAFEFLOW and SAFEFLOWBENCH lay the groundwork for principled, robust, and secure agent ecosystems, advancing the frontier of reliable autonomy.

最近的大型语言模型（LLM）和视觉语言模型（VLM）的进步使得能够执行复杂推理和多模式工具使用的强大自主代理成为可能。尽管它们的能力日益增强，但当前的代理框架仍然脆弱，缺乏安全信息流、可靠性和多代理协调的原则性机制。在这项工作中，我们引入了SAFEFLOW，这是一个用于构建可信的LLM/VLM基于代理的新协议级框架。SAFEFLOW强制实施精细的信息流控制（IFC），精确跟踪代理、工具、用户和环境之间交换的所有数据的来源、完整性和机密性。通过限制LLM推理以尊重这些安全标签，SAFEFLOW防止不受信任或对敌输入污染高完整性决策。为了确保在并发多代理环境中的稳健性，SAFEFLOW引入了事务执行、冲突解决和共享状态的安全调度，以保留全局一致性。我们还引入了包括预写日志、回滚和安全缓存等机制，进一步增强对运行时错误和政策违规的抵御能力。为了验证性能，我们构建了SAFEFLOWBENCH，这是一套综合基准测试，旨在评估代理在敌对、嘈杂和并发操作条件下的可靠性。大量实验表明，使用SAFEFLOW构建的代理即使在恶劣环境中也能保持令人印象深刻的任务性能和安全保证，显著优于最新技术。SAFEFLOW和SAFEFLOWBENCH一起奠定了原则性、稳健性和安全代理生态系统的基石，推动了可靠自主性的前沿发展。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）和视觉语言模型（VLM）的最新进展已经催生了能够进行复杂推理和多模态工具使用的强大自主代理。然而，当前的代理框架仍然脆弱，缺乏安全信息流、可靠性和多代理协调的机制。为此，我们引入了SAFEFLOW，一个为构建可信LLM/VLM代理的新协议级框架。SAFEFLOW强制执行精细的信息流控制（IFC），精确跟踪代理、工具、用户和环境之间交换的所有数据的来源、完整性和机密性。通过约束LLM推理以尊重这些安全标签，SAFEFLOW防止不受信任或对抗性输入污染高完整性的决策。为确保并发多代理设置中的稳健性，SAFEFLOW引入了事务执行、冲突解决和共享状态的安全调度，以保留全局一致性。我们进一步引入了包括写前日志、回滚和安全缓存等机制，以增强对运行时错误和政策违规的抵御能力。为了验证性能，我们构建了SAFEFLOWBENCH，一个综合基准测试套件，旨在评估代理在敌对、嘈杂和并发操作条件下的可靠性。实验表明，使用SAFEFLOW构建的代理即使在恶劣环境中也能保持令人印象深刻的任务性能和安全性保证，显著优于现有技术。SAFEFLOW和SAFEFLOWBENCH共同为原理化、稳健和安全的代理生态系统奠定了基础，推动了可靠自主性前沿的发展。

Key Takeaways

LLMs和VLMs的最新进展使得复杂推理和多模态工具使用的强大自主代理成为可能。
当前代理框架缺乏安全信息流、可靠性和多代理协调的机制。
SAFEFLOW是一个新的协议级框架，为构建可信LLM/VLM代理提供方案。
SAFEFLOW通过精细的信息流控制（IFC）确保数据安全性和完整性。
SAFEFLOW防止不受信任或对抗性输入污染决策，并通过事务执行、冲突解决和安全调度确保并发环境中的稳健性。
SAFEFLOW引入了多种机制以增强对运行时错误和政策违规的抵御能力。

Cool Papers

点此查看论文截图

Authors:Qingbin Zeng, Ruotong Zhao, Jinzhu Mao, Haoyang Li, Fengli Xu, Yong Li

Modeling urban crime is an important yet challenging task that requires understanding the subtle visual, social, and cultural cues embedded in urban environments. Previous work has mainly focused on rule-based agent-based modeling (ABM) and deep learning methods. ABMs offer interpretability of internal mechanisms but exhibit limited predictive accuracy. In contrast, deep learning methods are often effective in prediction but are less interpretable and require extensive training data. Moreover, both lines of work lack the cognitive flexibility to adapt to changing environments. Leveraging the capabilities of large language models (LLMs), we propose CrimeMind, a novel LLM-driven ABM framework for simulating urban crime within a multi-modal urban context. A key innovation of our design is the integration of the Routine Activity Theory (RAT) into the agentic workflow of CrimeMind, enabling it to process rich multi-modal urban features and reason about criminal behavior. However, RAT requires LLM agents to infer subtle cues in evaluating environmental safety as part of assessing guardianship, which can be challenging for LLMs. To address this, we collect a small-scale human-annotated dataset and align CrimeMind’s perception with human judgment via a training-free textual gradient method. Experiments across four major U.S. cities demonstrate that CrimeMind outperforms both traditional ABMs and deep learning baselines in crime hotspot prediction and spatial distribution accuracy, achieving up to a 24% improvement over the strongest baseline. Furthermore, we conduct counterfactual simulations of external incidents and policy interventions and it successfully captures the expected changes in crime patterns, demonstrating its ability to reflect counterfactual scenarios. Overall, CrimeMind enables fine-grained modeling of individual behaviors and facilitates evaluation of real-world interventions.

建模城市犯罪是一项重要且具有挑战性的任务，需要理解城市环境中微妙的视觉、社会和文化线索。以往的研究主要集中在基于规则的主体建模（ABM）和深度学习方法上。ABM提供了内部机制的解释性，但预测精度有限。相比之下，深度学习方法在预测方面通常很有效，但解释性较差，且需要大量训练数据。此外，这两种方法都缺乏适应环境变化的认知灵活性。我们利用大型语言模型（LLM）的能力，提出了CrimeMind，这是一种新型的LLM驱动ABM框架，可在多模式城市背景下模拟城市犯罪。设计中的一个关键创新是将日常活动理论（RAT）集成到CrimeMind的主体工作流程中，使其能够处理丰富的多模式城市特征并对犯罪行为进行推理。然而，RAT需要LLM主体在评估环境安全性时推断微妙的线索，作为评估监护权的一部分，这对于LLM来说可能具有挑战性。为了解决这一问题，我们收集了一个小规模的人工注释数据集，并通过一种无需训练的文本梯度方法与CrimeMind的感知与人类判断对齐。在四个美国主要城市的实验表明，在犯罪热点预测和空间分布准确性方面，CrimeMind优于传统的ABM和深度学习基线，比最强基线提高了高达24%。此外，我们进行了外部事件和政策干预的反事实模拟，成功捕捉了犯罪模式的预期变化，证明了其反映反事实场景的能力。总体而言，CrimeMind能够实现个体行为的精细建模，并便于评估现实世界的干预措施。

论文及项目相关链接

PDF Typos corrected

摘要

利用大型语言模型（LLM）驱动的活动理论（RAT）和基于代理的建模（ABM）框架，提出了CrimeMind模型，该模型能够在多模态城市环境中模拟城市犯罪。通过整合RAT，CrimeMind能够处理丰富的多模态城市特征，并对犯罪行为进行推理。为解决LLM在评估环境安全性方面的挑战，采用无训练文本梯度方法与人类判断对齐。实验表明，CrimeMind在犯罪热点预测和空间分布准确性方面优于传统ABM和深度学习基线，改进率最高达24%。此外，还能进行外部事件和政策干预的模拟，成功捕捉犯罪模式的预期变化。总体而言，CrimeMind可实现精细的个体行为建模，并评估现实干预措施。

关键见解

城市犯罪建模是一项重要而具有挑战性的任务，需要理解城市环境中的微妙视觉、社会和文化线索。
现有方法主要集中于基于规则的代理建模（ABM）和深度学习，但各有局限：ABM预测精度有限，深度学习模型解释性较差且需要大量训练数据。
提出了CrimeMind模型，结合大型语言模型（LLM）和多模态城市环境模拟城市犯罪。
CrimeMind整合了活动理论（RAT），使模型能够处理丰富的多模态城市特征并推理犯罪行为。
为解决LLM在评估环境安全性方面的挑战，采用无训练文本梯度方法对齐人类判断。
实验表明，CrimeMind在犯罪热点预测和空间分布准确性方面优于传统方法。

Cool Papers

点此查看论文截图

TextAtari: 100K Frames Game Playing with Language Agents

Authors:Wenhao Li, Wenwu Li, Chuyun Shen, Junjie Sheng, Zixiao Huang, Di Wu, Yun Hua, Wei Yin, Xiangfeng Wang, Hongyuan Zha, Bo Jin

We present TextAtari, a benchmark for evaluating language agents on very long-horizon decision-making tasks spanning up to 100,000 steps. By translating the visual state representations of classic Atari games into rich textual descriptions, TextAtari creates a challenging test bed that bridges sequential decision-making with natural language processing. The benchmark includes nearly 100 distinct tasks with varying complexity, action spaces, and planning horizons, all rendered as text through an unsupervised representation learning framework (AtariARI). We evaluate three open-source large language models (Qwen2.5-7B, Gemma-7B, and Llama3.1-8B) across three agent frameworks (zero-shot, few-shot chain-of-thought, and reflection reasoning) to assess how different forms of prior knowledge affect performance on these long-horizon challenges. Four scenarios-Basic, Obscured, Manual Augmentation, and Reference-based-investigate the impact of semantic understanding, instruction comprehension, and expert demonstrations on agent decision-making. Our results reveal significant performance gaps between language agents and human players in extensive planning tasks, highlighting challenges in sequential reasoning, state tracking, and strategic planning across tens of thousands of steps. TextAtari provides standardized evaluation protocols, baseline implementations, and a framework for advancing research at the intersection of language models and planning. Our code is available at https://github.com/Lww007/Text-Atari-Agents.

我们推出了TextAtari，这是一个用于评估语言智能体在长达10万步的远期决策任务上的表现的基准测试。通过将经典的Atari游戏的视觉状态表示转化为丰富的文本描述，TextAtari创建了一个具有挑战性的测试平台，该平台将序列决策与自然语言处理相结合。该基准测试包含近100个不同任务，具有不同的复杂度、行动空间和计划期限，所有内容均通过无监督表示学习框架（AtariARI）以文本形式呈现。我们评估了三个开源大型语言模型（Qwen2.5-7B、Gemma-7B和Llama3.1-8B）在三种智能体框架（零样本、少量链式思维和反思推理）下的表现，以了解不同形式的先验知识如何影响这些长期决策挑战的性能。四种场景——基础、遮蔽、手动增强和参考基础——探讨了语义理解、指令理解和专家示范对智能体决策的影响。我们的研究结果显示，在语言智能体和人类玩家在复杂规划任务之间存在显著的性能差距，这凸显了数万步的序列推理、状态跟踪和战略规划中的挑战。TextAtari提供了标准化的评估协议、基准实施方案和一个框架，以促进语言模型和规划交叉领域的研究进展。我们的代码可在https://github.com/Lww007/Text-Atari-Agents上找到。

论文及项目相关链接

PDF 51 pages, 39 figures

Summary

文本介绍了一个名为TextAtari的基准测试平台，该平台旨在评估语言模型在长达数十万步的长周期决策任务中的表现。通过将经典的Atari游戏的视觉状态表示转化为丰富的文本描述，TextAtari为连接序列决策与自然语言处理搭建了一个挑战性的测试平台。该平台包含近100个不同任务，涵盖不同的复杂度、动作空间和规划周期，所有任务均以文本形式呈现。文章评估了三种开源大型语言模型在不同代理框架下的表现，并探讨了不同先验知识对这些长周期挑战的影响。TextAtari提供了标准化的评估协议、基准实现和一个推动语言模型和规划交叉研究的框架。

Key Takeaways

TextAtari是一个评估语言模型的基准测试平台，专注于长周期决策任务，任务跨度长达10万步。
平台通过翻译Atari游戏的视觉状态表示成文本描述，搭建了一个连接序列决策与NLP的桥梁。
TextAtari包含近100个不同任务，涵盖不同复杂度、动作空间和规划周期。
评估了三种大型语言模型在多种代理框架下的表现。
探讨了不同形式的先验知识对长周期挑战任务性能的影响。
通过四种场景研究，探讨了语义理解、指令理解和专家示范对代理决策制定的影响。

Cool Papers

点此查看论文截图

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

Authors:Chiyu Zhang, Marc-Alexandre Cote, Michael Albada, Anush Sankaran, Jack W. Stokes, Tong Wang, Amir Abdi, William Blum, Muhammad Abdul-Mageed

Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench’s modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at https://github.com/microsoft/DefenderBench.

大型语言模型（LLM）代理在人类语言理解和推理方面表现出了令人印象深刻的能力，然而它们在网络安全方面的潜力仍未被充分探索。我们推出了DefenderBench，这是一个实用、开源的工具包，用于评估语言代理在进攻、防御和网络安全知识任务方面的性能。DefenderBench包括网络入侵、恶意内容检测、代码漏洞分析和网络安全知识评估等环境。它特意为研究者设计，经济实惠、易于访问，同时提供公平严格的评估。我们使用标准化的代理框架，对若干最新技术和流行的大型语言模型进行了基准测试，包括开放和封闭权重模型。我们的结果表明，Claude-3.7-sonnet表现最佳，DefenderBench得分为81.65，其次是Claude-3.7-sonnet-think，得分为78.40，而最好的开放权重模型Llama 3.3 70B紧随其后，DefenderBench得分为71.81。DefenderBench的模块化设计允许无缝集成自定义的大型语言模型和任务，促进了可重复性和公平比较。DefenderBench的匿名版本可在https://github.com/microsoft/DefenderBench上获得。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在理解和推理人类语言方面表现出强大的能力，但在网络安全领域的应用潜力尚未得到充分探索。本文介绍了DefenderBench，这是一个用于评估语言模型在攻击、防御和网络安全知识任务上的实用开源工具包。它对研究者具有经济实惠、易于访问的特点，并能提供公平和严格的评估。我们对几款最先进和流行的LLM进行了基准测试，结果表明Claude-3.7-sonnet表现最佳，DefenderBench得分为81.65。

Key Takeaways

大型语言模型（LLM）在网络安全领域的应用潜力尚未充分探索。
DefenderBench是一个用于评估语言模型在网络安全方面的实用开源工具包。
DefenderBench包括网络入侵、恶意内容检测、代码漏洞分析和网络安全知识评估等环境。
DefenderBench具有经济实惠、易于访问的特点，为研究者提供公平和严格的评估。
在基准测试中，Claude-3.7-sonnet表现最佳，DefenderBench得分为81.65。
DefenderBench的模块化设计允许无缝集成自定义的LLM和任务。

Cool Papers

点此查看论文截图

AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation

Authors:Ming Wang, Peidong Wang, Lin Wu, Xiaocui Yang, Daling Wang, Shi Feng, Yuxin Chen, Bixuan Wang, Yifei Zhang

Constrained by the cost and ethical concerns of involving real seekers in AI-driven mental health, researchers develop LLM-based conversational agents (CAs) with tailored configurations, such as profiles, symptoms, and scenarios, to simulate seekers. While these efforts advance AI in mental health, achieving more realistic seeker simulation remains hindered by two key challenges: dynamic evolution and multi-session memory. Seekers’ mental states often fluctuate during counseling, which typically spans multiple sessions. To address this, we propose AnnaAgent, an emotional and cognitive dynamic agent system equipped with tertiary memory. AnnaAgent incorporates an emotion modulator and a complaint elicitor trained on real counseling dialogues, enabling dynamic control of the simulator’s configurations. Additionally, its tertiary memory mechanism effectively integrates short-term and long-term memory across sessions. Evaluation results, both automated and manual, demonstrate that AnnaAgent achieves more realistic seeker simulation in psychological counseling compared to existing baselines. The ethically reviewed and screened code can be found on https://github.com/sci-m-wang/AnnaAgent.

受限于人工智能心理健康领域涉及真实求助者的成本和伦理问题，研究人员开发了基于大型语言模型（LLM）的对话代理（CA），并定制了配置，如个人简介、症状和场景，以模拟求助者。尽管这些努力推动了人工智能在心理健康领域的发展，但由于两个关键挑战——动态演变和多会话记忆，实现更真实的求助者模拟仍然受到限制。在咨询过程中，求助者的心理状态往往会波动，这一过程通常跨越多个会话。针对这一问题，我们提出了AnnaAgent，一个配备三级记忆的情感和认知动态代理系统。AnnaAgent融入了一个情感调节器和投诉诱发器，经过真实咨询对话的训练，能够实现模拟器配置的动态控制。此外，其三级记忆机制有效地整合了跨会话的短期和长期记忆。自动化和手动评估结果均表明，相较于现有基线，AnnaAgent在心理咨询中实现了更真实的求助者模拟。经过伦理审查和筛选的代码可在https://github.com/sci-m-wang/AnnaAgent找到。

论文及项目相关链接

PDF

Summary

基于AI驱动的心理健康领域中真实寻求者参与的成本和伦理问题，研究者开发了基于大型语言模型（LLM）的对话代理（CA），并定制配置以模拟寻求者。尽管有所进展，但实现更真实的寻求者模拟仍面临两大挑战：动态演变和多会话记忆。为此，研究者提出了AnnaAgent系统，一个配备三级记忆的情感和认知动态代理系统。AnnaAgent结合了基于真实咨询对话训练的情感调节器和投诉激发器，实现了模拟器配置的动态控制。其三级记忆机制有效地整合了跨会话的短期和长期记忆。评估结果表明，相较于现有基线，AnnaAgent在心理咨询中的寻求者模拟更为真实。相关伦理审查和筛选的代码可在此处找到：https://github.com/sci-m-wang/AnnaAgent。

Key Takeaways

研究者使用大型语言模型（LLM）开发对话代理（CA）以模拟心理健康领域的寻求者。
实现更真实的寻求者模拟面临两大挑战：动态演变和多会话记忆。
AnnaAgent系统通过配备三级记忆机制解决了这两大挑战。
AnnaAgent结合了情感调节器和投诉激发器，实现模拟器配置的动态控制。
AnnaAgent的模拟效果经过自动化和手动评估，相较于现有基线更为真实。
AnnaAgent系统经过伦理审查和筛选，相关代码可公开访问。

Cool Papers

点此查看论文截图

Understanding Bias Reinforcement in LLM Agents Debate

Authors:Jihwan Oh, Minchan Jeong, Jongwoo Ko, Se-Young Yun

Large Language Models $($LLMs$)$ solve complex problems using training-free methods like prompt engineering and in-context learning, yet ensuring reasoning correctness remains challenging. While self-correction methods such as self-consistency and self-refinement aim to improve reliability, they often reinforce biases due to the lack of effective feedback mechanisms. Multi-Agent Debate $($MAD$)$ has emerged as an alternative, but we identify two key limitations: bias reinforcement, where debate amplifies model biases instead of correcting them, and lack of perspective diversity, as all agents share the same model and reasoning patterns, limiting true debate effectiveness. To systematically evaluate these issues, we introduce $\textit{MetaNIM Arena}$, a benchmark designed to assess LLMs in adversarial strategic decision-making, where dynamic interactions influence optimal decisions. To overcome MAD’s limitations, we propose $\textbf{DReaMAD}$ $($$\textbf{D}$iverse $\textbf{Rea}$soning via $\textbf{M}$ulti-$\textbf{A}$gent $\textbf{D}$ebate with Refined Prompt$)$, a novel framework that $(1)$ refines LLM’s strategic prior knowledge to improve reasoning quality and $(2)$ promotes diverse viewpoints within a single model by systematically modifying prompts, reducing bias. Empirical results show that $\textbf{DReaMAD}$ significantly improves decision accuracy, reasoning diversity, and bias mitigation across multiple strategic tasks, establishing it as a more effective approach for LLM-based decision-making.

大型语言模型（LLMs）通过无需训练的方法，如提示工程和上下文学习来解决复杂问题，但确保推理的正确性仍然具有挑战性。虽然自我校正方法（如自我一致性和自我改进）旨在提高可靠性，但由于缺乏有效的反馈机制，它们往往会强化偏见。多智能体辩论（MAD）作为一种替代方法已经出现，但我们发现了两个关键局限：偏见强化，即辩论放大模型偏见而不是纠正它们；以及缺乏观点多样性，因为所有智能体都使用相同的模型和推理模式，限制了真正辩论的有效性。为了系统地评估这些问题，我们引入了MetaNIM Arena，这是一个基准测试，旨在评估LLMs在对抗性战略决策中的能力，动态交互会影响最优决策。为了克服MAD的局限性，我们提出了DReaMAD（通过修改提示促进多智能体辩论中的多样化推理），一个新型框架，它（1）精炼LLM的战略先验知识以提高推理质量，（2）通过系统地修改提示促进单一模型内的不同观点，从而减少偏见。实证结果表明，DReaMAD在多个战略任务中显著提高了决策准确性、推理多样性和偏见缓解，确立其为LLM决策制定中更有效的方法。

论文及项目相关链接

PDF ICML 2025

Summary

大型语言模型（LLMs）通过无需训练的方法如提示工程和自然语境学习来解决复杂问题，但确保推理的正确性仍然具有挑战性。尽管自我校正方法如自我一致性和自我改进旨在提高可靠性，但它们往往会因缺乏有效的反馈机制而强化偏见。多智能体辩论（MAD）作为一种替代方法应运而生，但存在两个关键局限：偏见强化和视角缺乏多样性。为系统地评估这些问题，我们推出了MetaNIM Arena基准测试，用于评估LLMs在对抗性战略决策制定中的能力，其中动态交互影响最优决策。为了克服MAD的局限性，我们提出了DReaMAD框架（通过多元智能体辩论和精炼提示实现多样化的推理），该框架（1）完善LLM的战略先验知识以提高推理质量，（2）通过系统地修改提示，在单个模型内促进不同观点的形成，从而减少偏见。实证结果表明，DReaMAD在多个战略任务中显著提高了决策准确性、推理多样性和偏见缓解，确立其作为LLM决策制定的更有效方法。

Key Takeaways

大型语言模型（LLMs）在解决复杂问题时面临推理正确性的挑战。
自我校正方法虽然提高了可靠性，但可能强化模型偏见。
多智能体辩论（MAD）作为一种替代方法存在偏见强化和视角缺乏多样性两个问题。
MetaNIM Arena基准测试用于评估LLMs在战略决策中的能力。
DReaMAD框架通过完善战略先验知识和促进不同观点的形成来提高推理质量和减少偏见。
DReaMAD框架在多个战略任务中表现出显著的决策准确性、推理多样性和偏见缓解效果。

Cool Papers

点此查看论文截图

FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL

Authors:Woosung Koh, Wonbeen Oh, Siyeol Kim, Suhin Shin, Hyeongjin Kim, Jaein Jang, Junghyun Lee, Se-Young Yun

Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically removed or added during the inference trajectory – a common occurrence in real-world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra-trajectory dynamic entity composition under zero-shot out-of-domain (OOD) generalization, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally applicable augmentation technique for MARL backbone methods. FlickerFusion stochastically drops out parts of the observation space, emulating being in-domain when inferenced OOD. The results show that FlickerFusion not only achieves superior inference rewards but also uniquely reduces uncertainty vis-`a-vis the backbone, compared to existing methods. Benchmarks, implementations, and model weights are organized and open-sourced at flickerfusion305.github.io, accompanied by ample demo video renderings.

多智能体强化学习在各种现实世界应用中解决复杂合作任务方面已显示出巨大潜力。然而，现有的多智能体强化学习（Multi-Agent Reinforcement Learning，简称MARL）方法通常依赖于一个限制性假设，即在训练和推理过程中实体的数量（例如智能体、障碍物）保持不变。这忽略了在推理轨迹过程中实体被动态移除或添加的场景——这在搜索和救援任务以及动态战斗情况等现实环境中是常见的情况。本文解决无预设环境下轨迹内动态实体组成挑战，即在无法预测的情况下会发生这样的动态变化。我们的实证研究发现在这些场景中现有MARL方法性能显著下降且不确定性增加。作为回应，我们提出了FlickerFusion，这是一种新型的离域泛化方法，作为多智能体强化学习主要方法的一种普遍适用的增强技术。FlickerFusion会随机丢失观测空间的部分信息，模拟在域内的情况进行推理。结果显示，与现有方法相比，FlickerFusion不仅实现了更高的推理奖励，而且独特地降低了相对于主要方法的不确定性。基准测试、实施方法和模型权重都在公开网站上整理并开源，同时配有丰富的演示视频渲染。

论文及项目相关链接

PDF ICLR 2025

Summary

多智能体强化学习在解决各种现实世界应用中的复杂合作任务方面显示出巨大潜力。然而，现有方法常假设实体数量在训练和推理过程中保持不变，忽略了现实世界中实体动态变化的情况。为解决此问题，本文提出了一种名为FlickerFusion的新方法，通过随机丢弃观测空间的部分信息，模拟在不同分布下推理所面对的困难，以模拟in-domain的环境，提升了强化学习模型泛化能力和不确定性的鲁棒性。其结果表明，相比现有方法，FlickerFusion能够显著地提升推理奖励并减少不确定性。其模型资源已公开于指定网站供大众访问和参考。

Key Takeaways

多智能体强化学习在处理复杂合作任务时具有显著潜力。
现有多智能体强化学习方法在实体数量变化时存在局限性。
FlickerFusion作为一种新的泛化方法，解决了实体动态变化的问题。
FlickerFusion通过随机丢弃观测空间信息模拟动态变化的实体场景，实现了模型性能的提升。
FlickerFusion相较于现有方法能够在提高推理奖励的同时降低不确定性。
FlickerFusion方法具有普遍适用性，可应用于多种多智能体强化学习背景方法。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-06-12/Agent/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Agent

Few-Shot

Few-Shot 方向最新论文已更新，请持续关注 Update in 2025-06-12 AraReasoner Evaluating Reasoning-Based LLMs for Arabic NLP

2025-06-12 Few-Shot

Few-Shot

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-06-12 VIKI-R Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

2025-06-12 LLM

LLM

Agent

2025-06-12 更新

VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning

Atomic-to-Compositional Generalization for Mobile Agents with A New Benchmark and Scheduling System

What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities

Improved LLM Agents for Financial Document Question Answering

MasHost Builds It All: Autonomous Multi-Agent System Directed by Reinforcement Learning

CAF-I: A Collaborative Multi-Agent Framework for Enhanced Irony Detection with Large Language Models

TACTIC: Translation Agents with Cognitive-Theoretic Interactive Collaboration

Reinforce LLM Reasoning through Multi-Agent Reflection

HiBerNAC: Hierarchical Brain-emulated Robotic Neural Agent Collective for Disentangling Complex Manipulation

Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction

SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems

CrimeMind: Simulating Urban Crime with Multi-Modal LLM Agents

TextAtari: 100K Frames Game Playing with Language Agents

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

AnnaAgent: Dynamic Evolution Agent System with Multi-Session Memory for Realistic Seeker Simulation

Understanding Bias Reinforcement in LLM Agents Debate

FlickerFusion: Intra-trajectory Domain Generalizing Multi-Agent RL