发布日期: 2025-09-17

更新日期: 2025-10-07

文章字数: 11.9k

阅读时长: 48 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-17 更新

Agentic Temporal Graph of Reasoning with Multimodal Language Models: A Potential AI Aid to Healthcare

Authors:Susanta Mitra

Healthcare and medicine are multimodal disciplines that deal with multimodal data for reasoning and diagnosing multiple diseases. Although some multimodal reasoning models have emerged for reasoning complex tasks in scientific domains, their applications in the healthcare domain remain limited and fall short in correct reasoning for diagnosis. To address the challenges of multimodal medical reasoning for correct diagnosis and assist the healthcare professionals, a novel temporal graph-based reasoning process modelled through a directed graph has been proposed in the current work. It helps in accommodating dynamic changes in reasons through backtracking, refining the reasoning content, and creating new or deleting existing reasons to reach the best recommendation or answer. Again, consideration of multimodal data at different time points can enable tracking and analysis of patient health and disease progression. Moreover, the proposed multi-agent temporal reasoning framework provides task distributions and a cross-validation mechanism to further enhance the accuracy of reasoning outputs. A few basic experiments and analysis results justify the novelty and practical utility of the proposed preliminary approach.

医疗护理和医学是多模式学科，涉及多模式数据来推理和诊断多种疾病。虽然一些多模式推理模型已经出现在科学领域的复杂任务推理中，它们在医疗领域的应用仍然有限，并且在诊断的正确推理方面存在不足。为了解决多模式医疗推理在正确诊断方面的挑战，并帮助医疗卫生专业人员，当前工作提出了一种基于动态图的新型推理过程，该过程通过有向图进行建模。它有助于通过回溯来容纳原因的动态变化，精炼推理内容，并创建新的或删除现有的原因，以得到最佳建议或答案。同样，在不同时间点考虑多模式数据能够实现对患者健康和疾病进展的跟踪和分析。此外，所提出的基于多智能体的时序推理框架提供了任务分配和交叉验证机制，以进一步提高推理输出的准确性。一些基本实验和分析结果证明了所提出初步方法的新颖性和实用性。

论文及项目相关链接

PDF

Summary

本文介绍了医疗健康领域中的多模态推理模型。虽然科学领域中已经出现了多模态推理模型，但其在医疗诊断领域的应用仍存在局限性，难以满足准确的诊断推理需求。针对这一挑战，本文提出了一种基于时序图的推理过程，通过建立有向图来模拟动态变化的推理过程，可回溯、优化推理内容，并根据需要创建或删除理由以得出最佳答案或建议。同时，考虑不同时间点的多模态数据有助于追踪和分析患者健康和疾病进展情况。此外，文中提出的基于多智能体的时序推理框架具备任务分配和交叉验证机制，可进一步提高推理结果的准确性。初步实验和分析结果证明了该方法的创新性和实用性。

Key Takeaways

多模态数据在医疗诊断中的重要作用。
当前多模态推理模型在医疗诊断领域的应用局限和挑战。
基于时序图的推理过程能有效模拟动态变化的医疗诊断推理。
时序图推理过程支持回溯、优化推理内容，并灵活调整理由。
多模态数据在不同时间点的结合有助于追踪和分析患者健康及疾病进展。
多智能体时序推理框架通过任务分配和交叉验证提高推理准确性。

Cool Papers

点此查看论文截图

Authors:Antonin Sulc, Thorsten Hellert

The development of intelligent agents, particularly those powered by language models (LMs), has shown the critical role in various environments that require intelligent and autonomous decision. Environments are not passive testing grounds and they represent the data required for agents to learn and exhibit very challenging conditions that require adaptive, complex and autonomous capacity to make decisions. While the paradigm of scaling models and datasets has led to remarkable emergent capabilities, we argue that scaling the structure, fidelity, and logical consistency of agent reasoning within these environments is a crucial, yet underexplored, dimension of AI research. This paper introduces a neuro-symbolic multi-agent architecture where the belief states of individual agents are formally represented as Kripke models. This foundational choice enables them to reason about known concepts of \emph{possibility} and \emph{necessity} using the formal language of modal logic. In this work, we use of immutable, domain-specific knowledge to make infere information, which is encoded as logical constraints essential for proper diagnosis. In the proposed model, we show constraints that actively guide the hypothesis generation of LMs, effectively preventing them from reaching physically or logically untenable conclusions. In a high-fidelity simulated particle accelerator environment, our system successfully diagnoses complex, cascading failures by combining the powerful semantic intuition of LMs with the rigorous, verifiable validation of modal logic and a factual world model and showcasing a viable path toward more robust, reliable, and verifiable autonomous agents.

智能体（尤其是那些由语言模型驱动的）的发展，在各种需要智能和自主决策的环境中显示出关键作用。环境并非被动的测试场所，而是代表智能体所需的数据进行学习并展现出极具挑战性的条件，需要适应性强、复杂和自主决策能力。虽然模型和数据集的扩展范式已经带来了显著的新兴能力，但我们主张在这些环境中扩展智能体的结构、逼真度和逻辑一致性推理是人工智能研究中至关重要但尚未被充分探索的维度。本文介绍了一种神经符号多智能体架构，其中单个智能体的信念状态被正式表示为Kripke模型。这一基本选择使他们能够使用模态逻辑的自然语言来推断已知概念的可能性和必要性。在这项工作中，我们使用不可变的领域特定知识来进行信息推断，这些知识被编码为逻辑约束，对于适当的诊断至关重要。在提出的模型中，我们展示了约束条件，这些约束条件积极指导语言模型的假设生成，有效地防止它们得出物理上或逻辑上不可行的结论。在一个高保真模拟的粒子加速器环境中，我们的系统将语言模型的强大语义直觉与模态逻辑的严格可验证验证和事实世界模型相结合，成功诊断出复杂级联故障，并展示了朝着更稳健、可靠和可验证的自主智能体的可行路径。

论文及项目相关链接

PDF 10 pages, 1 figure, Scaling Environments for Agents (SEA) Workshop at NeuralIPS

摘要

智能代理的发展，尤其是那些由语言模型驱动的代理，在各种需要智能和自主决策的环境中发挥了关键作用。环境并非被动的测试场所，而是代表数据，要求代理具备适应复杂环境并自主决策的能力。虽然模型和数据集的扩展已导致出现引人注目的新兴能力，但我们主张在这些环境中扩展代理推理的结构、保真度和逻辑一致性是人工智能研究的关键且尚未充分探索的维度。本文引入了一种神经符号多代理架构，其中单个代理的信念状态被形式化表示为Kripke模型。这一基本选择使他们能够利用模态逻辑的形式语言来推理已知的概念“可能性”和“必要性”。在这项工作中，我们使用不变、特定领域的知识来进行信息推断，这些知识被编码为逻辑约束，对于正确诊断至关重要。在提出的模型中，我们展示了约束条件，这些约束条件积极指导语言模型的假设生成，有效地防止其得出物理上或逻辑上站不住脚的结论。在高保真模拟粒子加速器环境中，我们的系统成功地将语言模型的强大语义直觉与模态逻辑的严格可验证性和事实世界模型相结合，诊断出复杂级联故障，展示了一条实现更稳健、可靠和可验证的自主代理的可行途径。

关键见解

智能代理在需要智能和自主决策的环境中发挥关键作用。
扩展代理推理的结构、保真度和逻辑一致性是AI研究的关键维度。
提出的神经符号多代理架构采用Kripke模型表示代理的信念状态。
该架构利用模态逻辑来推理“可能性”和“必要性”的概念。
使用特定领域的、不变的知识进行信息推断，对正确诊断至关重要。
提出的模型通过约束条件积极指导语言模型的假设生成。

Cool Papers

点此查看论文截图

FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning

Authors:Haodong Chen, Haojian Huang, XinXiang Yin, Dian Shao

Video Question Answering (VideoQA) based on Large Language Models (LLMs) has shown potential in general video understanding but faces significant challenges when applied to the inherently complex domain of sports videos. In this work, we propose FineQuest, the first training-free framework that leverages dual-mode reasoning inspired by cognitive science: i) Reactive Reasoning for straightforward sports queries and ii) Deliberative Reasoning for more complex ones. To bridge the knowledge gap between general-purpose models and domain-specific sports understanding, FineQuest incorporates SSGraph, a multimodal sports knowledge scene graph spanning nine sports, which encodes both visual instances and domain-specific terminology to enhance reasoning accuracy. Furthermore, we introduce two new sports VideoQA benchmarks, Gym-QA and Diving-QA, derived from the FineGym and FineDiving datasets, enabling diverse and comprehensive evaluation. FineQuest achieves state-of-the-art performance on these benchmarks as well as the existing SPORTU dataset, while maintains strong general VideoQA capabilities.

基于大型语言模型（LLM）的视频问答（VideoQA）在通用视频理解方面显示出潜力，但当应用于固有的复杂体育视频领域时，面临着巨大的挑战。在这项工作中，我们提出了FineQuest，这是第一个无需训练的框架，它受到认知科学的启发，采用双模式推理：i）针对直接的体育查询采用反应推理，ii）针对更复杂的内容采用审慎推理。为了弥补通用模型与特定领域体育理解之间的知识差距，FineQuest引入了SSGraph，这是一个跨越九种体育运动的多媒体体育知识场景图，它编码视觉实例和特定领域的术语，以提高推理的准确性。此外，我们引入了两个新的体育VideoQA基准测试，即Gym-QA和Diving-QA，它们分别基于FineGym和FineDiving数据集，能够实现多样化和全面的评估。FineQuest在这些基准测试以及现有的SPORTU数据集上达到了最先进的性能，同时保持了强大的通用VideoQA能力。

论文及项目相关链接

PDF ACM MM 2025

Summary
文本主要介绍了针对体育视频理解的基于大型语言模型的视频问答技术面临的挑战，并提出了FineQuest训练免费的框架，它采用双模式推理和跨九种运动的多模态体育知识场景图SSGraph，以提高推理准确性。同时，介绍了两个新的体育视频问答基准测试，Gym-QA和Diving-QA，以及FineQuest在这些基准测试上的表现。

Key Takeaways

视频问答技术应用于体育视频理解时面临的挑战。
FineQuest是首个无需训练的框架，采用双模式推理，包括反应性推理和慎思性推理。
SSGraph是一个多模态体育知识场景图，跨越九种运动，旨在缩小通用模型与特定领域体育理解之间的知识差距。
SSGraph结合了视觉实例和特定领域的术语，以提高推理准确性。
引入了两个新的体育视频问答基准测试：Gym-QA和Diving-QA。
FineQuest在基准测试上表现出卓越的性能，同时保持了强大的通用视频问答能力。

Cool Papers

点此查看论文截图

AMLNet: A Knowledge-Based Multi-Agent Framework to Generate and Detect Realistic Money Laundering Transactions

Authors:Sabin Huda, Ernest Foo, Zahra Jadidi, MA Hakim Newton, Abdul Sattar

Anti-money laundering (AML) research is constrained by the lack of publicly shareable, regulation-aligned transaction datasets. We present AMLNet, a knowledge-based multi-agent framework with two coordinated units: a regulation-aware transaction generator and an ensemble detection pipeline. The generator produces 1,090,173 synthetic transactions (approximately 0.16% laundering-positive) spanning core laundering phases (placement, layering, integration) and advanced typologies (e.g., structuring, adaptive threshold behavior). Regulatory alignment reaches 75% based on AUSTRAC rule coverage (Section 4.2), while a composite technical fidelity score of 0.75 summarizes temporal, structural, and behavioral realism components (Section 4.4). The detection ensemble achieves F1 0.90 (precision 0.84, recall 0.97) on the internal test partitions of AMLNet and adapts to the external SynthAML dataset, indicating architectural generalizability across different synthetic generation paradigms. We provide multi-dimensional evaluation (regulatory, temporal, network, behavioral) and release the dataset (Version 1.0, https://doi.org/10.5281/zenodo.16736515), to advance reproducible and regulation-conscious AML experimentation.

反洗钱（AML）研究受限于缺乏可公开共享的、符合监管要求的交易数据集。我们提出了AMLNet，这是一个基于知识的多智能体框架，包含两个协调单元：一个意识到法规的交易生成器和一个集成检测管道。生成器产生了跨越核心洗钱阶段（放置、分层、集成）和高级类型（例如，结构化、自适应阈值行为）的1,090,173条合成交易（大约0.16%为洗钱阳性）。基于AUSTRAC规则覆盖，监管一致性达到75%（第4.2节），而0.75的综合技术保真度分数概括了时间、结构和行为现实主义的组成部分（第4.4节）。检测组合在AMLNet的内部测试分区上实现了F1分数0.90（精确度0.84，召回率0.97），并适应外部的SynthAML数据集，表明架构在不同合成生成范式之间的通用性。我们提供了多维评估（监管、时间、网络、行为），并公开了数据集（Version 1.0，[https://doi.org/10.5281/zenodo.16736515]，以促进可重复和遵守法规的反洗钱实验。

论文及项目相关链接

PDF

Summary：

AMLNet是一个基于知识的多代理框架，包括两个协调单元：一个法规意识交易生成器和一个集成检测管道。生成器产生约含洗钱行为的合成交易数据，涵盖核心洗钱阶段和高级类型。该框架符合AUSTRAC规则的合规性达到75%，检测性能良好，可适应不同合成生成范式。发布数据集以推动可重复和遵守法规的AML实验。

Key Takeaways：

AMLNet是一个多代理框架，旨在解决反洗钱（AML）研究中缺乏可公开分享、符合法规的交易数据集的问题。
该框架包括两个协调单元：一个法规意识的交易生成器和一个集成检测管道。
交易生成器能够产生涵盖核心洗钱阶段和高级类型的合成交易数据。
该框架符合AUSTRAC规则的合规性达到75%，并且通过了多方面的评估，包括时间、结构、行为真实性。
检测管道在内部测试分区上实现了F1分数为0.90的良好检测结果，并适应外部SynthAML数据集，表明其架构在不同合成生成范式中的泛化能力。
该研究发布了数据集（Version 1.0），以推动可重复和遵守法规的AML实验。

Cool Papers

点此查看论文截图

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

Authors:Ching Chang, Yidan Shi, Defu Cao, Wei Yang, Jeehyun Hwang, Haixin Wang, Jiacheng Pang, Wei Wang, Yan Liu, Wen-Chih Peng, Tien-Fu Chen

Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates. The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes. Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (https://github.com/blacksnail789521/Time-Series-Reasoning-Survey). Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets. We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings. Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

时间序列推理将时间视为第一级轴，并将中间证据直接纳入答案中。本文通过推理拓扑对问题进行了定义和文献整理，主要包括三个家族：一步直接推理、具有明确中间体的线性链推理和探究、修订和聚合的分支结构推理。拓扑与领域的主要目标相结合，包括传统的时间序列分析、解释和理解、因果推断和决策制定以及时间序列生成，而紧凑的标签集则跨越这些轴，涵盖分解和验证、集成、工具使用、知识访问、多模态性、代理循环和LLM对齐机制等。本文回顾了不同领域的方法和系统，展示了每种拓扑的优势和局限性，以及可靠性或稳健性方面的不足，同时提供了支持研究和部署的数据集、基准测试和资源（https://github.com/blacksnail789521/Time-Series-Reasoning-Survey）。本文强调了保持证据可见且时间对齐的评估实践，并就匹配拓扑与不确定性、与可观察到的伪迹对接、规划迁移和流式处理以及将成本和延迟视为设计预算的指导方针进行了提炼。我们强调，推理结构必须在容纳能力和自我校正方面达到平衡计算成本和可重复性，而未来的进展可能取决于将推理质量与实用性联系在一起的基准测试，以及权衡成本和风险的情况下，在具备迁移意识、流式处理和长期视野的环境中建立闭环测试平台。总的来说，这些方向标志着从狭窄的准确性向大规模可靠性转变，使系统不仅能够分析，而且能够理解、解释和采取实际行动，在动态世界中实现可追溯的证据和可信的结果。

论文及项目相关链接

PDF This paper is currently under review

摘要

时间序列推理将时间视为首要轴，并将中间证据直接纳入答案中。本文定义了问题，并通过推理拓扑来组织文献，包括一步直接推理、具有明确中间体的线性链推理和探究、修订和聚合的分支结构推理等三个家族。拓扑与领域的主要目标相结合，如传统的时间序列分析、解释和理解、因果推理和决策制定以及时间序列生成等。同时，一个紧凑的标签集跨越了这些轴，并涵盖了分解和验证、集成、工具使用、知识访问、多模态性、代理循环和LLM对齐机制等。本文回顾了不同领域的方法和系统，展示了每种拓扑的优缺点，以及支持的数据集、基准测试和资源（https://github.com/blacksnail789521/Time-Series-Reasoning-Survey）。强调了保持证据可见性和时间对齐性的评估实践，并提供了匹配拓扑与不确定性、用可观察到的伪迹进行接地、规划迁移和流式处理以及将成本和延迟视为设计预算的指导。未来进展可能取决于将推理质量与实用性联系起来的基准测试，以及能够在成本和风险之间进行权衡的闭环测试平台。这些方向标志着从狭隘的准确性向大规模可靠性转变，使系统不仅能够分析，而且能够理解、解释和应对动态世界，具有可追溯的证据和可靠的结果。

关键见解

时间序列推理将时间视为核心要素，纳入中间证据。
推理拓扑包含三种家族：一步直接推理、线性链推理和分支结构推理。
主要领域目标包括时间序列分析、解释和理解等。
紧凑的标签集覆盖了时间序列推理的各个方面。
方法和系统的回顾展示了不同拓扑的优缺点。
评估实践强调保持证据可见性和时间对齐性。

Cool Papers

点此查看论文截图

VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection

Authors:Ziliang Wang, Ge Li, Jia Li, Hao Zhu, Zhi Jin

The application of language models to project-level vulnerability detection remains challenging, owing to the dual requirement of accurately localizing security-sensitive code and correctly correlating and reasoning over complex program context. We present VulAgent, a multi-agent vulnerability detection framework based on hypothesis validation. Our design is inspired by how human auditors review code: when noticing a sensitive operation, they form a hypothesis about a possible vulnerability, consider potential trigger paths, and then verify the hypothesis against the surrounding context. VulAgent implements a semantics-sensitive, multi-view detection pipeline: specialized agents, each aligned to a specific analysis perspective (e.g., memory, authorization), collaboratively surface and precisely localize sensitive code sites with higher coverage. Building on this, VulAgent adopts a hypothesis-validation paradigm: for each vulnerability report, it builds hypothesis conditions and a trigger path, steering the LLM to target the relevant program context and defensive checks during verification, which reduces false positives. On average across the two datasets, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable–fixed code pairs by up to 450% (246% on average), and reduces the false positive rate by about 36% compared with state-of-the-art LLM-based baselines.

将语言模型应用于项目级别的漏洞检测仍然具有挑战性，这归因于准确定位安全敏感代码以及正确关联和推理复杂程序上下文的双重要求。我们提出了VulAgent，一个基于假设验证的多代理漏洞检测框架。我们的设计灵感来自于人类审计员如何审查代码：当注意到敏感操作时，他们会对可能的漏洞形成假设，考虑潜在的触发路径，然后根据周围上下文验证假设。VulAgent实现了一个语义敏感的、多视角的检测流程：专业化的代理，每个代理都对应一个特定的分析角度（例如内存、授权），协同工作，精确定位敏感代码站点，覆盖更广。在此基础上，VulAgent采用假设验证范式：对于每个漏洞报告，它构建假设条件和触发路径，引导大型语言模型在验证过程中定位相关的程序上下文和防御检查，这减少了误报。在两个数据集上平均比较，VulAgent提高了整体准确率6.6%，正确识别脆弱-固定代码对的比率提高了高达450%（平均提高246%），与最新的大型语言模型基准相比，误报率降低了约36%。

论文及项目相关链接

PDF

Summary

基于语言模型的项目级别漏洞检测仍然具有挑战性，需要准确定位安全敏感代码，并正确关联和推理复杂的程序上下文。我们提出了基于假设验证的多智能体漏洞检测框架VulAgent。它的设计灵感来源于人工审核代码的方式，当发现敏感操作时，会形成关于可能漏洞的假设，考虑潜在的触发路径，然后根据周围的上下文验证假设。VulAgent实现了一个语义敏感的多视角检测管道，专业化的智能体，每个智能体对应一个特定的分析角度（例如内存、授权等），协同工作并精确定位敏感代码站点，提高覆盖率。此外，VulAgent采用假设验证模式：针对每个漏洞报告，构建假设条件和触发路径，引导大型语言模型在验证时定位相关程序上下文和防御检查，降低了误报率。与最新的大型语言模型基线相比，VulAgent平均提高了6.6%的整体准确率，对脆弱固定代码对的正确识别率平均提高了246%，并降低了约36%的误报率。

Key Takeaways

语言模型在项目级别漏洞检测中的应用具有挑战性，需要同时满足定位安全敏感代码和关联复杂程序上下文的要求。
VulAgent是一个基于假设验证的多智能体漏洞检测框架，模仿人工审核代码的方式。
VulAgent实现语义敏感的多视角检测，通过协同工作的智能体精确识别敏感代码，提高覆盖率。
VulAgent采用假设验证模式，构建假设条件和触发路径，引导大型语言模型进行有针对性的验证，降低误报率。
与现有大型语言模型基线相比，VulAgent在整体准确率、对脆弱固定代码对的识别率和误报率方面均有显著改善。
VulAgent的设计有助于提高漏洞检测的准确性和效率。

Cool Papers

点此查看论文截图

MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization

Authors:Yichen Han, Bojun Liu, Zhengpeng zhou, Guanyu Liu, Zeng Zhang, Yang Yang, Wenli Wang, Isaac N Shi, Yunyan, Lewei He, Tianyu Shi

Prompt engineering is crucial for leveraging large language models (LLMs), but existing methods often rely on a single optimization trajectory, limiting adaptability and efficiency while suffering from narrow perspectives, gradient conflicts, and high computational cost. We propose MAPGD (Multi-Agent Prompt Gradient Descent), a framework integrating multi-agent collaboration with gradient-based optimization. MAPGD features specialized agents for task clarity, example selection, format design, and stylistic refinement; semantic gradient coordination to resolve conflicts; bandit-based candidate selection for efficient exploration-exploitation; and theoretical convergence guarantees. Experiments on classification, generation, and reasoning tasks show MAPGD outperforms single-agent and random baselines in accuracy and efficiency. Ablations confirm the benefits of gradient fusion, agent specialization, and conflict resolution, providing a unified, gradient-inspired multi-agent approach to robust and interpretable prompt optimization.

提示工程在利用大型语言模型（LLM）方面起着至关重要的作用，但现有方法往往依赖于单一优化轨迹，这限制了适应性并降低了效率，同时受到视角狭窄、梯度冲突和高计算成本的困扰。我们提出了MAPGD（多智能体提示梯度下降）框架，它结合了多智能体协作与基于梯度的优化。MAPGD具有任务明确、示例选择、格式设计和风格精炼的专用智能体；语义梯度协调以解决冲突；基于强盗的候选选择以实现有效的探索与开发；以及理论收敛保证。在分类、生成和推理任务上的实验表明，MAPGD在准确性和效率方面优于单智能体和随机基线。剖析实验证实了梯度融合、智能体专业化和冲突解决的好处，提供了一个统一、受梯度启发的多智能体方法，用于稳健和可解释的提示优化。

论文及项目相关链接

PDF

Summary
多模态提示工程对于利用大型语言模型至关重要，但现有方法受限于单一优化轨迹，存在适应性差、效率低、视角狭窄、梯度冲突和计算成本高的问题。我们提出MAPGD（多智能体提示梯度下降）框架，集成了多智能体协作与梯度优化。MAPGD具有任务清晰、示例选择、格式设计和风格修饰的专职智能体；语义梯度协调以解决冲突；基于强盗的候选选择以实现高效探索与利用；理论收敛性保证。在分类、生成和推理任务上的实验表明，MAPGD在准确性和效率上优于单智能体和随机基线。

Key Takeaways

多模态提示工程对大型语言模型的重要性及其现有挑战概述。
MAPGD框架集成了多智能体协作与梯度优化。
MAPGD框架包括任务清晰、示例选择等专职智能体。
MAPGD通过语义梯度协调解决梯度冲突。
基于强盗的候选选择方法实现了高效的探索与利用。
MAPGD具有理论收敛性保证。
实验结果显示MAPGD在分类、生成和推理任务上的优越性能。

Cool Papers

点此查看论文截图

Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

Authors:Wuyuao Mai, Geng Hong, Qi Liu, Jinsong Chen, Jiarun Dai, Xudong Pan, Yuan Zhang, Min Yang

Penetration testing is critical for identifying and mitigating security vulnerabilities, yet traditional approaches remain expensive, time-consuming, and dependent on expert human labor. Recent work has explored AI-driven pentesting agents, but their evaluation relies on oversimplified capture-the-flag (CTF) settings that embed prior knowledge and reduce complexity, leading to performance estimates far from real-world practice. We close this gap by introducing the first real-world, agent-oriented pentesting benchmark, TermiBench, which shifts the goal from ‘flag finding’ to achieving full system control. The benchmark spans 510 hosts across 25 services and 30 CVEs, with realistic environments that require autonomous reconnaissance, discrimination between benign and exploitable services, and robust exploit execution. Using this benchmark, we find that existing systems can hardly obtain system shells under realistic conditions. To address these challenges, we propose TermiAgent, a multi-agent penetration testing framework. TermiAgent mitigates long-context forgetting with a Located Memory Activation mechanism and builds a reliable exploit arsenal via structured code understanding rather than naive retrieval. In evaluations, our work outperforms state-of-the-art agents, exhibiting stronger penetration testing capability, reducing execution time and financial cost, and demonstrating practicality even on laptop-scale deployments. Our work delivers both the first open-source benchmark for real-world autonomous pentesting and a novel agent framework that establishes a milestone for AI-driven penetration testing.

渗透测试对于识别和缓解安全漏洞至关重要，但传统的方法仍然昂贵、耗时，并且依赖于专业的人工劳动力。最近的工作已经探索了AI驱动的渗透测试代理，但它们的评估依赖于简化的捕获标志（CTF）设置，这些设置会嵌入先验知识并降低复杂性，从而导致性能估计与实际应用相去甚远。我们通过引入面向实际应用的渗透测试基准测试TermiBench来弥补这一差距，该基准测试将目标从“寻找标志”转变为实现全面系统控制。该基准测试涵盖了25项服务和30个CVE的510个主机，具有现实环境，需要自主侦察、区分良性服务和可利用服务，以及强大的漏洞利用执行。使用此基准测试，我们发现现有系统几乎无法在真实条件下获得系统外壳。为了应对这些挑战，我们提出了TermiAgent，这是一个多代理渗透测试框架。TermiAgent通过定位内存激活机制缓解长期上下文遗忘问题，并通过结构化代码理解而不是简单检索来建立可靠的漏洞利用库。在评估中，我们的工作优于最新代理，表现出更强的渗透测试能力，缩短了执行时间并降低了财务成本，甚至在笔记本电脑规模的部署中也证明了实用性。我们的工作不仅提供了首个面向实际应用的开源渗透测试基准测试，还提供了一个新颖的代理框架，为AI驱动的渗透测试树立了里程碑。

论文及项目相关链接

PDF

Summary
渗透测试对于识别与缓解安全漏洞至关重要，但传统方法仍显昂贵、耗时，并依赖专家人力。为缩小现实与简化设置的差距，我们推出首个面向真实世界的渗透测试基准测试平台TermiBench，旨在实现系统全面控制而非简单的“寻旗”目标。该平台跨越510台主机、涵盖25项服务和30项CVE，提供真实环境以要求自主侦查、辨别良性与可攻击服务及稳定的攻击执行。使用此基准测试平台，我们发现现有系统在实际条件下难以获得系统shell。为解决这些挑战，我们提出多智能体渗透测试框架TermiAgent，借助定位内存激活机制缓解长期上下文遗忘问题，并通过结构化代码理解构建可靠攻击库而非盲目检索。评估显示，我们的工作表现优于最新智能体，展现出更强渗透测试能力、缩短执行时间及降低财务成本，并在笔记本电脑规模部署中展现出实用性。我们的工作不仅推出首个面向真实世界自主渗透测试的开源基准测试平台，还开创性提出里程碑式的AI驱动渗透测试框架。

Key Takeaways

渗透测试对于安全至关重要，但传统方法存在成本高、耗时长等缺点。
推出首个面向真实世界的渗透测试基准测试平台TermiBench，目标为全面系统控制。
TermiBench包含多环境，要求自主侦查、辨别服务和执行攻击。
现有系统在现实条件下难以获得系统shell。
提出多智能体渗透测试框架TermiAgent，借助定位内存激活机制缓解长期上下文遗忘。
TermiAgent通过结构化代码理解构建可靠攻击库。

Cool Papers

点此查看论文截图

GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

Authors:Ziyi Ni, Huacan Wang, Shuo Zhang, Shuo Lu, Ziyang He, Wang You, Zhenheng Tang, Yuntao Du, Bill Sun, Hongzhang Liu, Sen Hu, Ronghao Chen, Bo Li, Xin Li, Chen Hu, Binxing Jiao, Daxin Jiang, Pin Lyu

Beyond scratch coding, exploiting large-scale code repositories (e.g., GitHub) for practical tasks is vital in real-world software development, yet current benchmarks rarely evaluate code agents in such authentic, workflow-driven scenarios. To bridge this gap, we introduce GitTaskBench, a benchmark designed to systematically assess this capability via 54 realistic tasks across 7 modalities and 7 domains. Each task pairs a relevant repository with an automated, human-curated evaluation harness specifying practical success criteria. Beyond measuring execution and task success, we also propose the alpha-value metric to quantify the economic benefit of agent performance, which integrates task success rates, token cost, and average developer salaries. Experiments across three state-of-the-art agent frameworks with multiple advanced LLMs show that leveraging code repositories for complex task solving remains challenging: even the best-performing system, OpenHands+Claude 3.7, solves only 48.15% of tasks (recent progress has pushed the frontier further, with RepoMaster+Claude 3.5 achieving a new record of 62.96%). Error analysis attributes over half of failures to seemingly mundane yet critical steps like environment setup and dependency resolution, highlighting the need for more robust workflow management and increased timeout preparedness. By releasing GitTaskBench, we aim to drive progress and attention toward repository-aware code reasoning, execution, and deployment – moving agents closer to solving complex, end-to-end real-world tasks. The benchmark and code are open-sourced at https://github.com/QuantaAlpha/GitTaskBench.

除了Scratch编码之外，利用大规模的代码仓库（例如GitHub）来完成实际任务在现实世界软件开发中至关重要。然而，当前的评价基准很少在这种真实、以工作流程驱动的情境中对代码代理进行评估。为了弥补这一差距，我们引入了GitTaskBench，这是一个旨在通过7种模式和7个领域的54个实际任务来系统地评估这种能力的基准。每个任务都会配对一个相关的仓库，以及一个自动化、人工策划的评价装置，该装置规定了实用的成功标准。除了衡量执行和任务成功之外，我们还提出了alpha值指标来量化代理性能的经济效益，该指标综合了任务成功率、令牌成本和平均开发人员工资。在三个最先进的代理框架和多个先进的LLM上的实验表明，利用代码仓库来解决复杂任务仍然具有挑战性：即使表现最佳的系统OpenHands+Claude 3.7也只能解决48.15%的任务（最近的进展已经将前沿进一步推进，RepoMaster+Claude 3.5创造了62.96%的新纪录）。错误分析将一半以上的失败归因于看似普通但关键步骤，如环境设置和依赖项解析，这突显出对更稳健的工作流管理和增加超时准备的必要性。我们发布GitTaskBench，旨在推动对感知仓库的代码推理、执行和部署的进展和关注，使代理更接近解决复杂、端到端的现实任务。基准测试和代码已开源，地址为：https://github.com/QuantaAlpha/GitTaskBench。

论文及项目相关链接

PDF Highly practical, Well-motivated, Actionable

Summary

本文介绍了一个名为GitTaskBench的基准测试，该测试旨在通过54个现实任务来系统地评估代码代理在真实工作流程驱动场景中的能力。测试涵盖了7个模态和领域，通过相关仓库与自动化的人类评估标准配对，评估代理的执行和实用性成功标准。除了衡量执行和任务成功外，还提出了alpha值指标来量化代理性能的经济效益，该指标综合了任务成功率、令牌成本和平均开发人员薪资。实验表明，利用代码仓库解决复杂任务仍然具有挑战性，最佳系统仅解决了48.15%的任务。通过开源GitTaskBench，旨在推动对仓库感知代码推理、执行和部署的关注和发展。

Key Takeaways

GitTaskBench是一个旨在评估代码代理在真实软件开发场景中能力的基准测试。
测试包含54个现实任务，涵盖7个模态和领域。
测试通过相关仓库与自动化的人类评估标准配对来评估任务。
提出了alpha值指标来量化代理性能的经济效益。
实验表明，利用代码仓库解决复杂任务具有挑战性，最佳系统仅解决约一半任务。
错误分析显示，环境设置和依赖项解析等看似平凡但关键的步骤是导致失败的主要原因之一。

Cool Papers

点此查看论文截图

HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

Authors:Shijie Zhang, Renhao Li, Songsheng Wang, Philipp Koehn, Min Yang, Derek F. Wong

The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model’s self-reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at https://github.com/nlp2ct-shijie/HiMATE.

随着大型语言模型（LLM）的进步，自动评估变得更加灵活和可解释。在机器翻译评估领域，利用基于多维度质量指标（MQM）的翻译错误注释的大型语言模型，可以产生更符合人类判断的结果。然而，当前的基于大型语言模型的评估方法仍然面临准确识别错误范围和评估其严重性的挑战。

在本文中，我们提出了用于机器翻译评价的分层多智能体框架HiMATE。我们认为现有方法未能充分利用MQM层次结构中的精细粒度结构和语义信息。为了解决这个问题，我们开发了一个基于MQM错误分类的分层多智能体系统，实现对子类型错误的精细评估。为了进一步减轻系统内的幻觉现象，我们采用了两种关键策略：利用模型的自我反思能力和促进涉及不对称信息的智能体讨论。

论文及项目相关链接

PDF

Summary：大型语言模型（LLM）的进步推动了机器翻译评估的灵活性和可解释性自动评估的发展。基于多维质量指标（MQM）的LLM翻译错误注释，使得评估结果更加符合人类判断。然而，当前基于LLM的评估方法仍面临准确识别错误跨度及其严重程度的问题。本研究提出了基于MQM错误分类的分层多智能体机器翻译评估框架HiMATE，通过融入模型自我反思能力和智能体讨论机制解决系统性幻觉问题。实验表明，HiMATE在不同数据集上优于其他基线方法，特别是在错误跨度检测和严重程度评估方面表现显著，平均F1分数提高了89%。

Key Takeaways：

大型语言模型（LLM）在机器翻译评估中表现出灵活性及可解释性自动评估的潜力。
基于多维质量指标（MQM）的LLM翻译错误注释，使得评估结果更加符合人类判断。
当前LLM在机器翻译评估中仍面临准确识别错误跨度及其严重程度的挑战。
提出的HiMATE框架是一种基于MQM错误分类的分层多智能体机器翻译评估方法。
HiMATE通过融入模型自我反思能力和智能体讨论机制解决系统性幻觉问题。
实验结果显示，HiMATE在不同数据集上的表现优于其他基线方法。

Cool Papers

点此查看论文截图

Kolb-Based Experiential Learning for Generalist Agents with Human-Level Kaggle Data Science Performance

Authors:Antoine Grosnit, Alexandre Maraval, Refinath S N, Zichao Zhao, James Doran, Giuseppe Paolo, Albert Thomas, Jonas Gonzalez, Abhineet Kumar, Khyati Khandelwal, Abdelhakim Benechehab, Hamza Cherkaoui, Youssef Attia El-Hili, Kun Shao, Jianye Hao, Jun Yao, Balázs Kégl, Haitham Bou-Ammar, Jun Wang

Human expertise emerges through iterative cycles of interaction, reflection, and internal model updating, which are central to cognitive theories such as Kolb’s experiential learning and Vygotsky’s zone of proximal development. In contrast, current AI systems, particularly LLM agents, rely on static pre-training or rigid workflows, lacking mechanisms for continual adaptation. Recent studies identified early cognitive traits in LLM agents (reflection, revision, and self-correction) suggesting foundational elements of human-like experiential learning. Thus the key question: Can we design LLM agents capable of structured, cognitively grounded learning similar to human processes? In response, we propose a computational framework of Kolb’s learning cycle with Vygotsky’s ZPD for autonomous agents. Our architecture separates extrinsic (environment interaction) and intrinsic (internal reflection/abstraction) functions, enabling cognitively grounded scaffolded learning, where the agent initially learns within structured environments, followed by open-ended generalisation. This approach empowers agents to master complex tasks ; domains that traditional fine-tuning or simple reflective methods could not tackle effectively. Its potential is powerfully demonstrated via direct comparison with humans in real-world Kaggle data science competitions. Learning fully automated data science code generation across 81 tasks, our system, Agent K, demonstrated the ability to perform the entire workflow autonomously, achieving an Elo-MMR score of 1694, beyond median score of the Kaggle Masters (the top 2% among 200,000 users) of our study. With 9 gold, 8 silver, and 12 bronze medals level performance - including 4 gold and 4 silver on prize-awarding competitions - Agent K is the 1st AI system to successfully integrate Kolb- and Vygotsky-inspired human cognitive learning, marking a major step toward generalist AI.

人类专业知识是通过一系列迭代互动、反思和内部模型更新的循环过程逐渐形成的，这在柯尔布的体验式学习和维果斯基的最近发展区等认知理论中占据核心地位。相比之下，当前的人工智能系统，尤其是大型语言模型（LLM）代理，依赖于静态的预训练或刻板的工作流程，缺乏持续适应的机制。最近的研究在大型语言模型代理中发现了早期的认知特征（如反思、修订和自我校正），这表明了与人类体验式学习相似的基础要素。因此，关键问题是：我们能够设计出具有与人类过程相似的结构化、认知基础的大型语言模型（LLM）代理吗？作为回应，我们提出了一个基于柯尔布学习周期和维果斯基最近发展区的自主计算框架。我们的架构将外在（环境互动）和内在（内在反思/抽象）功能分开，从而实现认知基础的支持学习，代理最初在结构化环境中学习，随后进行开放式泛化。这种方法使代理能够掌握复杂任务，这些任务在传统的微调或简单的反思方法无法有效处理。其在现实世界中的Kaggle数据科学竞赛中与人类的直接比较中展示了其潜力。学习全自动化数据科学代码生成跨81项任务时，我们的系统Agent K展现出完全自主执行整个工作流程的能力，达到了Elo-MMR分数为1694的水平，超过了我们的研究中Kaggle大师的中位数分数（即在20万名用户中排名前2%）。其表现获得了9枚金牌、8枚银牌和12枚铜牌的成绩——包括获奖竞赛中的4枚金牌和4枚银牌——Agent K是首个成功整合柯尔布和维果斯基启发的人类认知学习的人工智能系统，标志着通用人工智能迈出了重要的一步。

论文及项目相关链接

PDF

Summary

基于Kolb的体验学习和Vygotsky的最近发展区理论，提出了面向自主代理的计算框架，实现了结构化、以认知为基础的学习过程。通过分离外在（环境交互）和内在（内在反思/抽象）功能，使代理最初在结构化的环境中学习，随后进行开放式泛化。这一方法赋予了代理掌握复杂任务的能力，在现实世界的数据科学竞赛中与人类直接比较展现了其潜力。首个整合了Kolb和Vygotsky认知学习理论的人工智能系统Agent K，取得了重大进展。

Key Takeaways

人类专家通过互动、反思和内部模型更新的迭代循环出现，这是Kolb的体验学习和Vygotsky的最近发展区认知理论的核心。
当前AI系统，尤其是LLM代理，依赖于静态的预训练或僵硬的工作流程，缺乏持续适应的机制。
最近的研究发现LLM代理的早期认知特征（如反思、修订和自我校正），暗示了人类体验学习的基本要素。
提出了一种面向自主代理的计算框架，该框架基于Kolb的学习周期和Vygotsky的最近发展区理论，实现了结构化、以认知为基础的学习。
通过分离外在和内在功能，代理能够在结构化的环境中学习，然后进行开放式泛化，掌握了复杂任务的能力。
Agent K系统成功整合了Kolb和Vygotsky的认知学习理论，展现了强大的性能。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-17/Agent/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Agent

Few-Shot

Few-Shot 方向最新论文已更新，请持续关注 Update in 2025-09-17 UniPar A Unified LLM-Based Framework for Parallel and Accelerated Code Translation in HPC

2025-09-17 Few-Shot

Few-Shot

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-09-17 Survival at Any Cost? LLMs and the Choice Between Self-Preservation and Human Harm

2025-09-17 LLM

LLM

Agent

2025-09-17 更新

Agentic Temporal Graph of Reasoning with Multimodal Language Models: A Potential AI Aid to Healthcare

Neuro-Symbolic Agents with Modal Logic for Autonomous Diagnostics

FineQuest: Adaptive Knowledge-Assisted Sports Video Understanding via Agent-of-Thoughts Reasoning

AMLNet: A Knowledge-Based Multi-Agent Framework to Generate and Detect Realistic Money Laundering Transactions

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection

MAPGD: Multi-Agent Prompt Gradient Descent for Collaborative Prompt Optimization

Shell or Nothing: Real-World Benchmarks and Memory-Activated Agents for Automated Penetration Testing

GitTaskBench: A Benchmark for Code Agents Solving Real-World Tasks Through Code Repository Leveraging

HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

Kolb-Based Experiential Learning for Generalist Agents with Human-Level Kaggle Data Science Performance