发布日期: 2025-02-27

更新日期: 2025-05-14

文章字数: 14.5k

阅读时长: 58 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-02-27 更新

AgentRM: Enhancing Agent Generalization with Reward Modeling

Authors:Yu Xia, Jingru Fan, Weize Chen, Siyu Yan, Xin Cong, Zhong Zhang, Yaxi Lu, Yankai Lin, Zhiyuan Liu, Maosong Sun

Existing LLM-based agents have achieved strong performance on held-in tasks, but their generalizability to unseen tasks remains poor. Hence, some recent work focus on fine-tuning the policy model with more diverse tasks to improve the generalizability. In this work, we find that finetuning a reward model to guide the policy model is more robust than directly finetuning the policy model. Based on this finding, we propose AgentRM, a generalizable reward model, to guide the policy model for effective test-time search. We comprehensively investigate three approaches to construct the reward model, including explicit reward modeling, implicit reward modeling and LLM-as-a-judge. We then use AgentRM to guide the answer generation with Best-of-N sampling and step-level beam search. On four types of nine agent tasks, AgentRM enhances the base policy model by $8.8$ points on average, surpassing the top general agent by $4.0$. Moreover, it demonstrates weak-to-strong generalization, yielding greater improvement of $12.6$ on LLaMA-3-70B policy model. As for the specializability, AgentRM can also boost a finetuned policy model and outperform the top specialized agent by $11.4$ on three held-in tasks. Further analysis verifies its effectiveness in test-time scaling. Codes will be released to facilitate the research in this area.

现有基于大型语言模型的代理在固定任务上表现良好，但在未见任务上的泛化能力仍然较差。因此，一些最近的研究专注于通过更多多样化的任务微调策略模型，以提高其泛化能力。在这项工作中，我们发现相较于直接微调策略模型，通过奖励模型进行微调以指导策略模型更加稳健。基于这一发现，我们提出了AgentRM（一种通用奖励模型），用于引导策略模型进行有效的测试时间搜索。我们全面研究了三种构建奖励模型的方法，包括显式奖励建模、隐式奖励建模和LLM作为评判者。然后，我们使用AgentRM指导答案生成，采用最佳N采样和步骤级光束搜索。在四种类型的九个代理任务中，AgentRM平均提高了基础策略模型的性能8.8个点，超过了顶级通用代理4.0个点。此外，它展示了从弱到强的泛化能力，对LLaMA-3-70B策略模型的改进提高了12.6个点。至于专项能力方面，AgentRM也能提升微调后的策略模型性能，并在三项固定任务上超过顶级专业代理11.4个点。进一步的分析验证了其在测试时间扩展中的有效性。相关代码将发布，以促进该领域的研究。

论文及项目相关链接

PDF

Summary

本文探讨了在LLM基础上构建的代理在未见任务上的泛化能力问题。研究发现，通过微调奖励模型来指导政策模型比直接微调政策模型更为稳健。基于此，提出了AgentRM，一种可泛化的奖励模型，用于指导政策模型进行有效的测试时间搜索。通过三种构建奖励模型的方法（显式奖励建模、隐式奖励建模和LLM作为判官），AgentRM平均提升了基础政策模型8.8个百分点，并在LLaMA-3-70B政策模型上实现了高达12.6个百分点的改进。同时，AgentRM也能提升微调后的政策模型在特定任务上的表现。该方法的代码将公开发布，以促进相关领域的研究。

Key Takeaways

LLM基础代理在未见任务上的泛化能力有待提高。
通过微调奖励模型指导政策模型能提升代理的泛化能力。
AgentRM是一种可泛化的奖励模型，通过三种方法构建：显式奖励建模、隐式奖励建模和LLM作为判官。
AgentRM在多种任务上提升了基础政策模型的表现，平均提升8.8个百分点，并在LLaMA-3-70B政策模型上实现更大的改进。
AgentRM能提升特定任务上的表现，并超越顶尖专业代理的表现。
AgentRM的有效性在测试时间缩放方面得到了进一步验证。

Cool Papers

点此查看论文截图

WebGames: Challenging General-Purpose Web-Browsing AI Agents

Authors:George Thomas, Alex J. Chan, Jikun Kang, Wenqi Wu, Filippos Christianos, Fraser Greenlee, Andy Toulis, Marvin Purtorab

We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for humans while systematically testing the limitations of current AI systems across fundamental browser interactions, advanced input processing, cognitive tasks, workflow automation, and interactive entertainment. Our framework eliminates external dependencies through a hermetic testing environment, ensuring reproducible evaluation with verifiable ground-truth solutions. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems’ ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at webgames.convergence.ai, offering a lightweight, client-side implementation that facilitates rapid evaluation cycles. Through its modular architecture and standardized challenge specifications, WebGames provides a robust foundation for measuring progress in development of more capable web-browsing agents.

我们介绍了WebGames，这是一套全面的基准测试套件，旨在通过50多个交互式挑战来评估通用网页浏览人工智能代理。这些挑战是专为人类设计得简单易懂的，同时系统地测试了当前人工智能系统在基本浏览器交互、高级输入处理、认知任务、工作流程自动化和互动娱乐等方面的局限性。我们的框架通过密闭的测试环境消除了外部依赖，确保可重复评估和可验证的基准解决方案。我们评估了领先的视觉语言模型，包括GPT-4o、Claude计算机使用、Gemini-1.5 Pro和Qwen2-VL与人类性能的对比。结果表明存在显著的能力差距，最好的人工智能系统只有43.1%的成功率，而人类性能为95.7%，这突显了当前人工智能系统在处理人类认为直观的常见网页交互模式方面的根本性局限。该基准测试在webgames.convergence.ai上公开可用，提供了一个轻量级的客户端实现，促进了快速的评估周期。通过其模块化架构和标准化的挑战规格，WebGames为衡量更能力强的网页浏览代理的开发进度提供了坚实的基础。

论文及项目相关链接

PDF

Summary

WebGames是一个全面评估通用网页浏览AI系统的基准测试套件，包含超过五十项互动挑战，旨在通过设计特定挑战评估人工智能系统的基本浏览器交互、高级输入处理、认知任务、工作流程自动化及互动娱乐的能力。框架消除依赖封闭测试环境以确保可重复验证评估。评测显示顶尖的视觉语言模型仍有明显短板，最好的AI系统成功率为人类成绩的不到一半，展示其在处理常见网页交互模式上的局限。WebGames基准测试套件已公开发布并提供轻量级客户端实现，便于快速评估循环。其模块化架构和标准挑战规格为衡量网页浏览代理的发展进步提供了稳健基础。

Key Takeaways

WebGames是一个全面评估网页浏览AI的基准测试套件，包含超过五十项互动挑战。
该框架旨在测试AI系统的多项能力，包括基本浏览器交互、高级输入处理等。
测试环境封闭消除外部依赖，确保评估结果的可重复性。
评测结果揭示了顶尖视觉语言模型存在的显著能力差距。
最佳AI系统的成功率仅为人类成绩的约一半，表明在处理常见网页交互模式上存在局限。
WebGames基准测试套件已公开发布并提供客户端实现，便于快速评估AI性能。

Cool Papers

点此查看论文截图

RefuteBench 2.0 – Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction

Authors:Jianhao Yan, Yun Luo, Yue Zhang

In the multi-turn interaction schema, large language models (LLMs) can leverage user feedback to enhance the quality and relevance of their responses. However, evaluating an LLM’s ability to incorporate user refutation feedback is crucial yet challenging. In this study, we introduce RefuteBench 2.0, which significantly extends the original RefuteBench by incorporating LLM agents as refuters and evaluators, which allows for flexible and comprehensive assessment. We design both transient and persistent refutation instructions with different validity periods. Meta-evaluation shows that the LLM-based refuter could generate more human-like refutations and the evaluators could assign scores with high correlation with humans. Experimental results of various LLMs show that current models could effectively satisfy the refutation but fail to memorize the refutation information. Interestingly, we also observe that the performance of the initial task decreases as the refutations increase. Analysis of the attention scores further shows a potential weakness of current LLMs: they struggle to retain and correctly use previous information during long context dialogues. https://github.com/ElliottYan/RefuteBench-2.0

在多轮交互模式中，大型语言模型（LLM）可以利用用户反馈来提高其响应的质量和相关性。然而，评估LLM融入用户反驳性反馈的能力是至关重要且充满挑战的。在这项研究中，我们推出了RefuteBench 2.0，它通过引入LLM代理作为反驳者和评估者，对原始RefuteBench进行了重大扩展，从而实现了灵活全面的评估。我们设计了具有不同有效期限的瞬时和持久反驳指令。元评估表明，基于LLM的反驳者能够产生更多人性化的反驳，评估者的评分与人类评分高度相关。各种LLM的实验结果表明，当前模型虽然能满足反驳需求，但无法记住反驳信息。有趣的是，我们还观察到随着反驳的增加，初始任务性能有所下降。对注意力分数的进一步分析揭示了当前LLM的潜在弱点：在长期的对话情境中，它们很难保留并正确使用以前的信息。详情请访问 https://github.com/ElliottYan/RefuteBench-2.0 了解更多。

论文及项目相关链接

PDF Work on progess

Summary

基于多轮交互模式，大型语言模型（LLM）能够利用用户反馈来提升响应的质量和相关性。本研究介绍了RefuteBench 2.0，它在原有基础上纳入了LLM作为反驳者和评估者，实现了灵活和全面的评估。通过设计瞬时和持久性反驳指令，并进行了元评估，显示LLM生成的反驳更具人性化，评估者的评分与人类高度相关。实验结果显示，当前模型能有效满足反驳要求，但无法记住反驳信息。随着反驳的增加，初始任务性能有所下降。注意力得分分析揭示了当前LLM的潜在弱点：在长篇对话中，它们难以保留并正确使用以前的信息。

Key Takeaways

LLM在多轮交互模式中可以利用用户反馈提升响应质量。
RefuteBench 2.0通过纳入LLM作为反驳者和评估者进行更全面的评估。
LLM生成的反驳更具人性化，评估者的评分与人类高度相关。
当前模型虽能满足反驳要求，但无法有效记忆反驳信息。
随着反驳的增加，初始任务性能会下降。
LLM在长篇对话中难以保留并正确使用以前的信息。
注意力得分分析揭示了LLM的潜在弱点。

Cool Papers

点此查看论文截图

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

Authors:Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao

Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model’s reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.

从视觉丰富的文档中理解信息对于传统的检索增强生成（RAG）方法来说仍然是一个巨大的挑战。现有的基准测试主要关注基于图像的问答（QA），忽视了在密集的视觉文档中进行高效检索、理解和推理的根本性挑战。为了弥补这一差距，我们引入了ViDoSeek，这是一个新型数据集，旨在评估RAG在需要复杂推理的视觉丰富文档上的表现。基于此，我们确定了当前RAG方法的关键局限性：（i）纯粹的视觉检索方法难以有效地整合文本和视觉特征；（ii）以前的方法通常分配了不足的推理令牌，限制了其有效性。为了应对这些挑战，我们提出了ViDoRAG，这是一个为跨视觉文档的复杂推理量身定制的新型多代理RAG框架。ViDoRAG采用基于高斯混合模型（GMM）的混合策略，以有效处理多模式检索。为了进一步激发模型的推理能力，我们引入了一种迭代代理工作流程，融合了探索、总结和反思，为RAG领域中的测试时间缩放提供了研究框架。在ViDoSeek上的广泛实验验证了我们的方法的有效性和通用性。值得注意的是，ViDoRAG在竞争性的ViDoSeek基准测试上的表现优于现有方法，准确率提高了10%以上。

论文及项目相关链接

PDF

Summary

本文介绍了针对视觉丰富文档进行复杂推理的RAG方法的重要挑战。为解决现有RAG方法在视觉丰富文档上的局限性，提出了ViDoRAG，一个为复杂视觉文档推理定制的多代理RAG框架。该框架采用基于高斯混合模型（GMM）的混合策略，有效处理多模式检索，并引入迭代代理工作流程，以激发模型的推理能力。在ViDoSeek上的实验验证了该方法的有效性和泛化能力，且ViDoRAG在竞争性的ViDoSeek基准测试上的表现优于现有方法超过10%。

Key Takeaways

视觉丰富文档的RAG方法面临重大挑战，需要高效检索、理解和推理。
现有基准测试主要关注图像问答，忽视了视觉丰富文档中的复杂推理挑战。
引入ViDoSeek数据集，用于评估视觉丰富文档上的RAG性能。
现有RAG方法的局限性：纯视觉检索方法难以整合文本和视觉特征，以往方法分配推理令牌不足。
提出ViDoRAG框架，采用高斯混合模型（GMM）的混合策略处理多模式检索。
ViDoRAG引入迭代代理工作流程，包括探索、总结和反思，以激发模型的推理能力。

Cool Papers

点此查看论文截图

LLM Knows Geometry Better than Algebra: Numerical Understanding of LLM-Based Agents in A Trading Arena

Authors:Tianmi Ma, Jiawei Du, Wenxin Huang, Wenjie Wang, Liang Xie, Xian Zhong, Joey Tianyi Zhou

Recent advancements in large language models (LLMs) have significantly improved performance in natural language processing tasks. However, their ability to generalize to dynamic, unseen tasks, particularly in numerical reasoning, remains a challenge. Existing benchmarks mainly evaluate LLMs on problems with predefined optimal solutions, which may not align with real-world scenarios where clear answers are absent. To bridge this gap, we design the Agent Trading Arena, a virtual numerical game simulating complex economic systems through zero-sum games, where agents invest in stock portfolios. Our experiments reveal that LLMs, including GPT-4o, struggle with algebraic reasoning when dealing with plain-text stock data, often focusing on local details rather than global trends. In contrast, LLMs perform significantly better with geometric reasoning when presented with visual data, such as scatter plots or K-line charts, suggesting that visual representations enhance numerical reasoning. This capability is further improved by incorporating the reflection module, which aids in the analysis and interpretation of complex data. We validate our findings on NASDAQ Stock dataset, where LLMs demonstrate stronger reasoning with visual data compared to text. Our code and data are publicly available at https://github.com/wekjsdvnm/Agent-Trading-Arena.git.

最近的大型语言模型（LLM）的进步在自然语言处理任务中显著提高了性能。然而，它们在应对动态、未见任务的推广能力，特别是在数值推理方面，仍然是一个挑战。现有的基准测试主要对LLMs评估预定义最优解决方案的问题，这可能不符合现实世界场景，其中没有明确的答案。为了弥补这一差距，我们设计了Agent Trading Arena，这是一个通过零和游戏模拟复杂经济系统的虚拟数值游戏，其中代理投资于股票组合。我们的实验表明，包括GPT-4o在内的LLMs在处理纯文本股票数据时，在代数推理方面遇到困难，往往关注局部细节而非全局趋势。相比之下，当以视觉数据（如散点图或K线图）呈现时，LLMs在几何推理方面的表现要好得多，这表明视觉表示增强了数值推理。通过引入反射模块，这种能力得到了进一步提高，有助于分析和解释复杂数据。我们在NASDAQ股票数据集上验证了我们的发现，其中LLMs在视觉数据下的推理能力相较于文本更强。我们的代码和数据公开可访问于https://github.com/wekjsdvnm/Agent-Trading-Arena.git。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在自然语言处理任务上的表现有了显著提升，但在动态未见任务的泛化能力上仍存在挑战。现有基准测试主要评估LLM在具有预设最优解决方案的问题上的表现，这可能不符合现实世界场景。为解决此差距，我们设计了一个名为Agent Trading Arena的虚拟数值游戏，模拟复杂的经济系统通过零和游戏进行股票组合投资。实验表明，LLM在处理纯文本股票数据时面临代数推理困难，常常关注局部细节而忽视全局趋势。相比之下，当呈现可视化数据时，如散点图或K线图，LLM的几何推理能力表现更好。引入反射模块后，这一能力得到了进一步提升，有助于分析复杂数据。我们在NASDAQ股票数据集上验证了这些发现。

Key Takeaways

大型语言模型（LLM）在自然语言处理任务上表现优异，但在泛化到动态未见任务方面存在挑战。
现有基准测试与真实世界场景存在差距，需要新的评估方法。
Agent Trading Arena是一个模拟复杂经济系统的虚拟数值游戏，用于评估LLM在股票组合投资方面的能力。
LLM在处理纯文本股票数据时面临代数推理困难。
LLM在可视化数据（如散点图、K线图）上的几何推理能力更强。
引入反射模块可进一步提升LLM对复杂数据的分析和理解能力。
在NASDAQ股票数据集上，LLM在可视化数据上的推理能力优于文本数据。

Cool Papers

点此查看论文截图

Assessing Large Language Models in Agentic Multilingual National Bias

Authors:Qianying Liu, Katrina Qiyao Wang, Fei Cheng, Sadao Kurohashi

Large Language Models have garnered significant attention for their capabilities in multilingual natural language processing, while studies on risks associated with cross biases are limited to immediate context preferences. Cross-language disparities in reasoning-based recommendations remain largely unexplored, with a lack of even descriptive analysis. This study is the first to address this gap. We test LLM’s applicability and capability in providing personalized advice across three key scenarios: university applications, travel, and relocation. We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages. We quantify bias in model-generated scores and assess the impact of demographic factors and reasoning strategies (e.g., Chain-of-Thought prompting) on bias patterns. Our findings reveal that local language bias is prevalent across different tasks, with GPT-4 and Sonnet reducing bias for English-speaking countries compared to GPT-3.5 but failing to achieve robust multilingual alignment, highlighting broader implications for multilingual AI agents and applications such as education.

大型语言模型在多语种自然语言处理方面引起了广泛关注，但关于与跨语种偏见相关的风险研究仅限于即时语境偏好。基于推理的推荐中的跨语言差异仍被大量探索，甚至缺乏描述性分析。本研究首次填补了这一空白。我们测试了大型语言模型在三种关键场景（大学申请、旅行和搬迁）中提供个性化建议的适用性和能力。通过分析大型语言模型对多语种决策任务的回应，我们研究了其存在的多语种偏见。我们量化模型生成分数中的偏见，并评估人口统计因素和推理策略（如“思维链”提示）对偏见模式的影响。我们的研究发现在不同的任务中都普遍存在本地语言偏见，GPT-4和Sonnet与GPT-3.5相比减少了英语国家的偏见，但未能实现稳健的多语种对齐，这凸显了对多语种人工智能代理和教育应用等更广泛的启示。

论文及项目相关链接

PDF 13 pages

Summary
大型语言模型在多语种自然语言处理中受到广泛关注，但关于跨语言偏见的研究仍限于即时语境偏好。本研究首次探讨了语言模型在提供个性化建议方面的适用性，并研究了其在大学申请、旅行和搬迁等三个关键场景中的表现。通过分析决策任务中的多语种偏见，我们量化了模型生成的分数中的偏见，并评估了人口统计因素和推理策略（如Chain-of-Thought提示）对偏见模式的影响。研究发现，本地语言偏见在不同任务中普遍存在，GPT-4和Sonnet相较于GPT-3.5在英语国家的偏见有所减少，但仍未实现稳健的多语种对齐。这强调了多语种人工智能在教育等领域的更广泛影响的重要性。此外也进一步说明即使是现今前沿的语言模型仍然存在多方面的局限和偏误需要加以深入探索研究才能优化AI语言模型的全面应用与适用性以适应多种语言的挑战同时也有望扩大在多语种场景下智能体的应用场景和服务范围推动相关领域的发展进步。研究也揭示了对不同语言的认知和表达存在差异因此在未来的研究中应进一步关注如何克服这些差异以提高模型的泛化能力和跨文化适应性从而更好地服务于全球用户。随着技术的不断进步和语言模型研究的深入我们期待未来能够开发出更加智能、高效且适应多语种环境的语言模型从而更好地满足社会的需求服务于全球的交流和合作推动世界进步。总体来说这是一个对前沿领域有益的补充和发展推动了人工智能在多语种环境下的应用和发展。总的来说本研究的发现对于未来多语种人工智能的发展和应用具有重要的启示意义。同时该研究也为我们提供了宝贵的见解和思路为未来的研究提供了重要的参考方向。我们相信随着技术的不断进步这一领域的研究将不断发展和壮大并为多语种交流带来更多的便利和创新贡献力量为社会进步发展作出贡献奠定了重要的理论基础和启示。未来对于大型语言模型的研究也将朝着更加全面和深入的方向发展推动人工智能技术的不断进步和创新发展以更好地服务于社会需求和人类进步发展。

Key Takeaways

大型语言模型在多语种自然语言处理中受到关注，但跨语言偏见的研究仍存在局限，且尚未在理据型推荐中得到足够探索。此研究意在填补此领域的空白并深入了解现状并进行原因分析、研究和策略构建以帮助当前问题的解决以实现全面的应用研究并最终有效支持泛化和决策策略的构建和优化以实现多语种场景下的智能决策支持和服务提升用户体验和满意度。

Cool Papers

点此查看论文截图

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models

Authors:Hongzhan Lin, Yang Deng, Yuxuan Gu, Wenxuan Zhang, Jing Ma, See-Kiong Ng, Tat-Seng Chua

Large Language Models (LLMs) have significantly advanced the fact-checking studies. However, existing automated fact-checking evaluation methods rely on static datasets and classification metrics, which fail to automatically evaluate the justification production and uncover the nuanced limitations of LLMs in fact-checking. In this work, we introduce FACT-AUDIT, an agent-driven framework that adaptively and dynamically assesses LLMs’ fact-checking capabilities. Leveraging importance sampling principles and multi-agent collaboration, FACT-AUDIT generates adaptive and scalable datasets, performs iterative model-centric evaluations, and updates assessments based on model-specific responses. By incorporating justification production alongside verdict prediction, this framework provides a comprehensive and evolving audit of LLMs’ factual reasoning capabilities, to investigate their trustworthiness. Extensive experiments demonstrate that FACT-AUDIT effectively differentiates among state-of-the-art LLMs, providing valuable insights into model strengths and limitations in model-centric fact-checking analysis.

大型语言模型（LLM）在事实核查研究方面取得了显著进展。然而，现有的自动化事实核查评估方法依赖于静态数据集和分类指标，这些方法无法自动评估论证的产生，并揭示LLM在事实核查中的微妙局限性。在这项工作中，我们引入了FACT-AUDIT，这是一个以代理驱动的框架，能够自适应和动态地评估LLM的事实核查能力。借助重要性采样原理和多代理协作，FACT-AUDIT生成自适应和可扩展的数据集，执行迭代模型为中心的评估，并根据模型特定的响应更新评估结果。通过结合论证的产生和裁决预测，该框架提供了对LLM事实推理能力的全面和不断发展的审计，以调查其可信度。大量实验表明，FACT-AUDIT能够有效地区分最先进的LLM，在模型为中心的事实核查分析中提供了关于模型优势和局限性的宝贵见解。

论文及项目相关链接

PDF

Summary
大型语言模型（LLM）在事实核查研究中取得了显著进展。然而，现有的自动化事实核查评估方法依赖于静态数据集和分类指标，无法自动评估论证生成并揭示LLM在事实核查中的微妙局限性。为此，我们引入了FACT-AUDIT这一以代理驱动框架，自适应地动态评估LLM的事实核查能力。该框架利用重要性采样原理和跨代理协作生成自适应和可扩展数据集，执行迭代模型中心评估，并根据模型特定响应更新评估结果。通过结合论证生成与裁决预测，该框架对LLM的事实推理能力进行了全面和不断发展的审计，以研究其可靠性。实验表明，FACT-AUDIT有效地区分了最先进的大型语言模型，为模型中心的事实核查分析提供了有价值的见解。

Key Takeaways

大型语言模型（LLMs）在事实核查领域有重要进展。
现有自动化评估方法存在局限性，无法全面评估LLMs在事实核查中的性能。
FACT-AUDIT框架基于代理驱动，能自适应、动态地评估LLMs的事实核查能力。
该框架利用重要性采样和跨代理协作生成数据集，执行模型中心的评价。
FACT-AUDIT结合论证生成与裁决预测，全面审计LLM的事实推理能力。
实验证明，FACT-AUDIT能有效区分不同的大型语言模型。

Cool Papers

点此查看论文截图

Authors:Meng Feng, Viraj Parimi, Brian Williams

Safe navigation is essential for autonomous systems operating in hazardous environments. Traditional planning methods excel at long-horizon tasks but rely on a predefined graph with fixed distance metrics. In contrast, safe Reinforcement Learning (RL) can learn complex behaviors without relying on manual heuristics but fails to solve long-horizon tasks, particularly in goal-conditioned and multi-agent scenarios. In this paper, we introduce a novel method that integrates the strengths of both planning and safe RL. Our method leverages goal-conditioned RL and safe RL to learn a goal-conditioned policy for navigation while concurrently estimating cumulative distance and safety levels using learned value functions via an automated self-training algorithm. By constructing a graph with states from the replay buffer, our method prunes unsafe edges and generates a waypoint-based plan that the agent follows until reaching its goal, effectively balancing faster and safer routes over extended distances. Utilizing this unified high-level graph and a shared low-level goal-conditioned safe RL policy, we extend this approach to address the multi-agent safe navigation problem. In particular, we leverage Conflict-Based Search (CBS) to create waypoint-based plans for multiple agents allowing for their safe navigation over extended horizons. This integration enhances the scalability of goal-conditioned safe RL in multi-agent scenarios, enabling efficient coordination among agents. Extensive benchmarking against state-of-the-art baselines demonstrates the effectiveness of our method in achieving distance goals safely for multiple agents in complex and hazardous environments. Our code will be released to support future research.

安全导航对于在危险环境中运行自主系统至关重要。传统规划方法在长远任务上表现出色，但依赖于具有固定距离度量的预定义图。相比之下，安全强化学习（RL）能够学习复杂行为，无需依赖手动启发式方法，但在解决长远任务时却会失败，特别是在目标条件和多智能体场景中。在本文中，我们引入了一种结合规划和安全强化学习优点的新方法。我们的方法利用目标条件强化学习和安全强化学习来学习导航的目标条件策略，同时使用通过自动化自训练算法学到的值函数来估计累积距离和安全水平。通过利用回放缓冲区中的状态构建图，我们的方法可以剔除不安全边缘并生成基于路点的计划，智能体遵循该计划直至达到目标，有效地在较长距离上实现更快的安全路线平衡。利用这种统一的高级图和共享的底层目标条件安全强化学习策略，我们将这一方法扩展到解决多智能体安全导航问题。特别是，我们利用基于冲突的搜索（CBS）为多个智能体创建基于路点的计划，允许它们在扩展范围内安全导航。这种融合提高了目标条件安全强化学习在多智能体场景中的可扩展性，实现了智能体之间的有效协调。与最新技术的基准测试相比，我们的方法在复杂和危险环境中为多个智能体实现安全距离目标的有效性得到了验证。我们的代码将发布以支持未来的研究。

论文及项目相关链接

PDF Due to the limitation “The abstract field cannot be longer than 1,920 characters”, the abstract here is shorter than that in the PDF file

Summary

本文提出了一种结合规划和安全强化学习（RL）的方法，用于自主系统在危险环境中的安全导航。该方法通过构建包含回放缓冲区状态的高层次图形，能够学习目标条件下的策略，同时估计累积距离和安全级别。通过修剪不安全边缘并生成基于路径点的计划，该方法使代理能够在达到目标的过程中实现更快、更安全的路线。此外，该方法还扩展到解决多代理安全导航问题，利用基于冲突的搜索（CBS）为多个代理创建基于路径点的计划，以实现长期安全导航。该方法在复杂和危险环境中对多个代理实现安全距离目标方面表现出有效性。

Key Takeaways

传统规划方法擅长处理长期任务，但依赖于预定义的图形和固定距离度量。
安全强化学习（RL）可以学习复杂行为，但无法解决长期任务，特别是在目标调节和多代理场景中。
本文提出的方法结合了规划和安全RL的优点，通过构建包含回放缓冲区状态的高层次图形来实现目标条件下的安全导航。
该方法通过修剪不安全边缘并生成基于路径点的计划，实现更快、更安全的路线。
该方法扩展到多代理安全导航问题，利用基于冲突的搜索（CBS）为多个代理创建基于路径点的计划。
通过集成CBS，该方法提高了目标调节安全RL在多代理场景中的可扩展性，实现了高效的代理间协调。

Cool Papers

点此查看论文截图

Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

Authors:Junyang Wang, Haiyang Xu, Xi Zhang, Ming Yan, Ji Zhang, Fei Huang, Jitao Sang

The rapid increase in mobile device usage necessitates improved automation for seamless task management. However, many AI-driven frameworks struggle due to insufficient operational knowledge. Manually written knowledge helps but is labor-intensive and inefficient. To address these challenges, we introduce Mobile-Agent-V, a framework that leverages video guidance to provide rich and cost-effective operational knowledge for mobile automation. Mobile-Agent-V enhances task execution capabilities by leveraging video inputs without requiring specialized sampling or preprocessing. Mobile-Agent-V integrates a sliding window strategy and incorporates a video agent and deep-reflection agent to ensure that actions align with user instructions. Through this innovative approach, users can record task processes with guidance, enabling the system to autonomously learn and execute tasks efficiently. Experimental results show that Mobile-Agent-V achieves a 30% performance improvement compared to existing frameworks. The code will be open-sourced at https://github.com/X-PLUG/MobileAgent.

随着移动设备使用的快速增加，需要改进自动化以实现无缝任务管理。然而，许多人工智能驱动的框架由于操作知识不足而面临困境。手动编写知识会有所帮助，但这种方法劳动强度大且效率低下。为了解决这些挑战，我们推出了Mobile-Agent-V框架，它利用视频指导，为移动自动化提供丰富且成本效益高的操作知识。Mobile-Agent-V通过利用视频输入，无需特殊采样或预处理，增强了任务执行能力。Mobile-Agent-V采用滑动窗口策略，并融入视频代理和深度反思代理，确保操作与用户指令一致。通过这一创新方法，用户可以按指导记录任务过程，使系统能够自主高效学习和执行任务。实验结果表明，与现有框架相比，Mobile-Agent-V实现了30%的性能提升。代码将在https://github.com/X-PLUG/MobileAgent上开源。

论文及项目相关链接

PDF 16 pages, 7 figures, 7tables

Summary
移动设备使用量的迅速增长需要改进自动化以实现无缝任务管理。然而，许多人工智能驱动的框架因缺乏操作知识而面临挑战。手动编写知识虽有帮助，但劳动密集且效率低下。为解决这些挑战，我们推出了Mobile-Agent-V框架，该框架利用视频指导，提供丰富且经济实惠的操作知识，用于移动自动化。Mobile-Agent-V通过利用视频输入增强任务执行能力，无需特殊采样或预处理。它采用滑动窗口策略，并结合视频代理和深度反思代理，确保操作与用户指令一致。用户可通过指导录制任务流程，使系统能够自主高效学习和执行任务。实验结果表明，与现有框架相比，Mobile-Agent-V实现了30%的性能提升。

Key Takeaways

移动设备使用量的增长推动了自动化任务管理的需求。
许多AI框架因操作知识不足而面临挑战。
Mobile-Agent-V框架利用视频指导提供操作知识，促进移动自动化。
Mobile-Agent-V通过视频输入增强任务执行能力，效率更高。
Mobile-Agent-V采用滑动窗口策略结合视频代理和深度反思代理，确保操作与用户指令一致。
用户可通过指导录制任务流程，使系统能够自主高效学习并执行任务。
实验结果显示，Mobile-Agent-V相较于现有框架有30%的性能提升。

Cool Papers

点此查看论文截图

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

Authors:Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, Siheng Chen

With the integration of large language models (LLMs), embodied agents have strong capabilities to process the scene information and plan complicated instructions in natural language, paving the way for the potential deployment of embodied robots. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in the real world. To study this issue, we present SafeAgentBench-a new benchmark for safety-aware task planning of embodied LLM agents. SafeAgentBench includes: (1) a new dataset with 750 tasks, covering 10 potential hazards and 3 task types; (2) SafeAgentEnv, a universal embodied environment with a low-level controller, supporting multi-agent execution with 17 high-level actions for 8 state-of-the-art baselines; and (3) reliable evaluation methods from both execution and semantic perspectives. Experimental results show that, although agents based on different design frameworks exhibit substantial differences in task success rates, their overall safety awareness remains weak. The most safety-conscious baseline achieves only a 10% rejection rate for detailed hazardous tasks. Moreover, simply replacing the LLM driving the agent does not lead to notable improvements in safety awareness. More details and code are available at https://github.com/shengyin1224/SafeAgentBench.

随着大型语言模型（LLM）的集成，实体代理具备了强大的场景信息处理能力和自然语言复杂指令规划能力，为实体机器人的潜在部署铺平了道路。然而，一个可预见的问题是他们能够完美执行一些危险的任务，有可能对现实世界造成破坏。为了研究这个问题，我们提出了SafeAgentBench——一个新的针对实体LLM代理的安全意识任务规划的基准测试。SafeAgentBench包括：（1）一个新的数据集，包含750个任务，涵盖10种潜在危险和3种任务类型；（2）SafeAgentEnv是一个通用的实体环境，配备低级控制器，支持多个代理使用带有包括细致的仿真和场景设计等特点的代理执行17种高级动作；（3）从执行和语义两个角度提供可靠的评估方法。实验结果表明，虽然基于不同设计框架的代理在任务成功率上展现出较大差异，但整体的安全意识仍然薄弱。最具有安全意识的基线仅对详细的危险任务达到拒绝率为百分之十。此外，简单地更换驱动代理的LLM并不会导致安全意识的显著提高。更多细节和代码可在 https://github.com/shengyin1224/SafeAgentBench 上查看。

论文及项目相关链接

PDF 23 pages, 17 tables, 8 figures

Summary

大型语言模型驱动的实体代理具备处理场景信息和规划复杂指令的能力，为实体机器人的部署提供了可能。但存在潜在风险，实体代理人可能执行危险任务并造成实际损害。为解决这一问题，推出SafeAgentBench——实体LLM代理安全感知任务规划的新基准。SafeAgentBench包含：新的包含750个任务的数据集，涵盖10种潜在危险和三种任务类型；通用实体环境SafeAgentEnv，支持多代理执行并配备用于八种最新基线模型的十七种高级动作；以及从执行和语义两个角度的可靠评估方法。实验结果显示，不同框架的代理人在任务成功率上有显著差异，但整体安全意识仍然薄弱。最具安全意识的基线模型详细危险任务的拒绝率仅为百分之十。替换LLM模型驱动对安全意识并无显著改善。更多详情和代码可访问SafeAgentBench官网：链接地址。

Key Takeaways

大型语言模型驱动的实体代理人具备强大的场景信息处理能力和复杂指令规划能力。
实体机器人在执行任务时存在潜在风险，可能造成实际损害。
SafeAgentBench是一个新的基准，用于评估实体LLM代理的安全意识任务规划能力。
SafeAgentBench包含新的数据集、通用实体环境和评估方法。
不同框架的代理人在任务成功率上存在差异，但整体安全意识仍然薄弱。
最具安全意识的基线模型对详细危险任务的拒绝率较低。

Cool Papers

点此查看论文截图

AFlow: Automating Agentic Workflow Generation

Authors:Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, Bingnan Zheng, Bang Liu, Yuyu Luo, Chenglin Wu

Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce AFlow, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFlow’s efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFlow enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code will be available at https://github.com/geekan/MetaGPT.

大规模语言模型（LLM）在解决跨不同领域的复杂任务中表现出了显著潜力，通常是通过采用遵循详细指令和操作序列的代理工作流程来实现。然而，构建这些工作流程需要大量的人工努力，限制了其可扩展性和通用性。最近的研究试图实现这些工作流程的自动生成和优化，但现有方法仍然依赖于初始的手动设置，并无法实现完全自动化和有效的工作流程生成。为了应对这一挑战，我们将工作流程优化重新构建为一个在代码表示的工作流程上的搜索问题，其中LLM调用节点通过边缘连接。我们引入了AFlow，这是一个自动化框架，它利用蒙特卡洛树搜索有效地探索了这个空间，通过代码修改、树形经验和执行反馈来迭代优化工作流程。在六个基准数据集上的实证评估证明了AFlow的有效性，与最新基线相比，平均提高了5.7%。此外，AFlow使较小的模型能够以4.55%的美元推理成本在特定任务上超越GPT-4o。代码将在https://github.com/geekan/MetaGPT上提供。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在解决跨域复杂任务方面具有显著潜力，通常通过遵循详细指令和操作序列的agentic工作流程来实现。然而，构建这些工作流程需要大量人力，限制了其可扩展性和通用性。为解决这个问题，研究者们将工作流程优化重新构建为代码表示的工作流程搜索问题，并提出AFlow自动化框架，利用蒙特卡洛树搜索高效探索这个空间，通过代码修改、树结构经验和执行反馈来迭代优化工作流程。实证评估显示，AFlow在六个基准数据集上的效果优于最新基线方法，平均提高了5.7%。此外，AFlow使小型模型能够在特定任务上优于GPT-4，且推理成本仅为GPT-4的4.55%。

Key Takeaways

大型语言模型（LLM）在解决跨域复杂任务方面具有潜力。
现有工作流程构建需要大量人力，限制了可扩展性和通用性。
研究者将工作流程优化重新构建为代码表示的工作流程搜索问题。
AFlow框架通过蒙特卡洛树搜索高效探索工作流程空间。
AFlow通过代码修改、树结构经验和执行反馈迭代优化工作流程。
实证评估显示AFlow优于最新基线方法，平均提高了5.7%。
AFlow使小型模型在特定任务上表现优异，且推理成本较低。

Cool Papers

点此查看论文截图

Learning Multi-agent Multi-machine Tending by Mobile Robots

Authors:Abdalwhab Abdalwhab, Giovanni Beltrame, Samira Ebrahimi Kahou, David St-Onge

Robotics can help address the growing worker shortage challenge of the manufacturing industry. As such, machine tending is a task collaborative robots can tackle that can also highly boost productivity. Nevertheless, existing robotics systems deployed in that sector rely on a fixed single-arm setup, whereas mobile robots can provide more flexibility and scalability. In this work, we introduce a multi-agent multi-machine tending learning framework by mobile robots based on Multi-agent Reinforcement Learning (MARL) techniques with the design of a suitable observation and reward. Moreover, an attention-based encoding mechanism is developed and integrated into Multi-agent Proximal Policy Optimization (MAPPO) algorithm to boost its performance for machine tending scenarios. Our model (AB-MAPPO) outperformed MAPPO in this new challenging scenario in terms of task success, safety, and resources utilization. Furthermore, we provided an extensive ablation study to support our various design decisions.

机器人技术有助于解决制造业日益严重的劳动力短缺挑战。因此，机器照料是一个协作机器人可以完成的任务，可以大幅提高生产效率。然而，目前在该领域部署的机器人系统依赖于固定的单臂设置，而移动机器人可以提供更大的灵活性和可扩展性。在这项工作中，我们引入了一种基于移动机器人的多智能体多机器照料学习框架，采用多智能体强化学习（MARL）技术，并设计了一个合适的观察和奖励机制。此外，开发了一种基于注意力的编码机制，并将其整合到多智能体近端策略优化（MAPPO）算法中，以提高其在机器照料场景中的性能。我们的模型（AB-MAPPO）在任务成功、安全性和资源利用方面，在这个具有挑战性的新场景中优于MAPPO。此外，我们还通过广泛的消融研究来支持我们的各种设计决策。

论文及项目相关链接

PDF 7 pages, 4 figures, Accepted at an AAAI workshop (The Multi-Agent AI in the Real World Workshop)

Summary：
工业机器人可以帮助解决制造业日益严重的劳动力短缺挑战。机器照料是一项协作机器人能够完成的任务，可以大大提高生产力。然而，当前制造业使用的机器人系统大多依赖于固定的单臂设置，而移动机器人可以提供更大的灵活性和可扩展性。本文介绍了一种基于多智能体强化学习（MARL）技术的移动机器人多机器照料学习框架，并设计了合适的观察和奖励机制。此外，开发了一种基于注意力的编码机制，并将其集成到多智能体近端策略优化（MAPPO）算法中，以提高机器照料场景的性能。AB-MAPPO模型在此新挑战场景中在任务成功、安全性和资源利用率方面优于MAPPO。我们还进行了一项广泛的研究以支持我们的各种设计决策。

Key Takeaways：

工业机器人可助力解决制造业劳动力短缺问题。
机器照料是协作机器人能提升生产效的任务之一。
当前机器人系统存在固定单臂设置的局限性，移动机器人更具灵活性和可扩展性。
引入基于多智能体强化学习（MARL）的移动机器人多机器照料学习框架。
提出了一种注意力编码机制并集成到MAPPO算法中，提高了机器照料性能。
AB-MAPPO模型在任务成功、安全性和资源利用率方面优于MAPPO。

Cool Papers

点此查看论文截图

Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework

Authors:Bohao Yang, Dong Liu, Chenghao Xiao, Kun Zhao, Chen Tang, Chao Li, Lin Yuan, Guang Yang, Lanxiao Huang, Chenghua Lin

Large Language Models (LLMs) demonstrate remarkable ability to comprehend instructions and generate human-like text, enabling sophisticated agent simulation beyond basic behavior replication. However, the potential for creating freely customisable characters remains underexplored. We introduce the Customisable Conversation Agent Framework, which employs LLMs to simulate real-world characters through personalised characteristic feature injection, enabling diverse character creation according to user preferences. We propose the SimsConv dataset, comprising 68 customised characters and 13,971 multi-turn role-playing dialogues across 1,360 real-world scenes. Characters are initially customised using pre-defined elements (career, aspiration, traits, skills), then expanded through personal and social profiles. Building on this, we present SimsChat, a freely customisable role-playing agent incorporating various realistic settings and topic-specified character interactions. Experimental results on both SimsConv and WikiRoleEval datasets demonstrate SimsChat’s superior performance in maintaining character consistency, knowledge accuracy, and appropriate question rejection compared to existing models. Our framework provides valuable insights for developing more accurate and customisable human simulacra. Our data and code are publicly available at https://github.com/Bernard-Yang/SimsChat.

大型语言模型（LLMs）显示出理解和执行指令以及生成人类文本方面的显著能力，能够实现超越基本行为复制的复杂代理模拟。然而，创建可自由定制角色的潜力仍未得到充分探索。我们引入了可定制对话代理框架，该框架采用LLMs通过个性化特征注入模拟现实世界角色，根据用户偏好实现多样化角色创建。我们提出了SimsConv数据集，包含68个自定义角色和13971个跨1360个现实场景的多轮角色扮演对话。角色最初使用预定义元素（职业、抱负、特征、技能）进行定制，然后通过个人和社会资料进一步扩展。在此基础上，我们推出了SimsChat，这是一个可自由定制的角色扮演代理，包含各种现实场景和特定话题的角色互动。在SimsConv和WikiRoleEval数据集上的实验结果证明了SimsChat在保持角色一致性、知识准确性和适当问题拒绝方面的优越性能，相比于现有模型。我们的框架为开发更准确、可定制的人类模拟物提供了宝贵的见解。我们的数据和代码可在https://github.com/Bernard-Yang/SimsChat公开访问。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）在理解和执行指令以及生成人类文本方面展现出显著的能力，不仅能模仿基本行为，还能模拟复杂的人物行为。然而，创建可自由定制的角色潜力尚未被充分研究。我们引入了可定制对话代理框架，该框架利用LLMs通过个性化特征注入来模拟现实角色，可根据用户偏好创建多样化的角色。我们提出了SimsConv数据集，包含68个自定义角色和13971个跨1360个现实场景的多轮角色扮演对话。角色通过预设元素（职业、抱负、特质、技能）进行初步定制，然后通过个人和社会资料进一步扩展。在此基础上，我们推出了SimsChat，这是一个可自由定制的角色扮演代理，包含各种现实场景和特定话题的角色互动。在SimsConv和WikiRoleEval数据集上的实验结果证明了SimsChat在保持角色一致性、知识准确性和适当的问题拒绝方面的优越性。我们的框架为开发更准确、可定制的人类模拟体提供了有价值的见解。

Key Takeaways

大型语言模型（LLMs）能模拟复杂的人物行为，而创建可自由定制的角色潜力尚未被充分研究。
引入了可定制对话代理框架，利用LLMs模拟现实角色，可创建多样化的角色。
提出了SimsConv数据集，包含自定义角色和跨现实场景的多轮角色扮演对话。
角色通过预设元素进行初步定制，包括职业、抱负、特质、技能等。
SimsChat是一个可自由定制的角色扮演代理，包含各种现实场景和特定话题的角色互动。
SimsChat在保持角色一致性、知识准确性和问题拒绝方面表现出优越性。

Cool Papers

点此查看论文截图

GOAL: A Generalist Combinatorial Optimization Agent Learner

Authors:Darko Drakulic, Sofia Michel, Jean-Marc Andreoli

Machine Learning-based heuristics have recently shown impressive performance in solving a variety of hard combinatorial optimization problems (COPs). However, they generally rely on a separate neural model, specialized and trained for each single problem. Any variation of a problem requires adjustment of its model and re-training from scratch. In this paper, we propose GOAL (for Generalist combinatorial Optimization Agent Learner), a generalist model capable of efficiently solving multiple COPs and which can be fine-tuned to solve new COPs. GOAL consists of a single backbone plus light-weight problem-specific adapters for input and output processing. The backbone is based on a new form of mixed-attention blocks which allows to handle problems defined on graphs with arbitrary combinations of node, edge and instance-level features. Additionally, problems which involve heterogeneous types of nodes or edges are handled through a novel multi-type transformer architecture, where the attention blocks are duplicated to attend the meaningful combinations of types while relying on the same shared parameters. We train GOAL on a set of routing, scheduling and classic graph problems and show that it is only slightly inferior to the specialized baselines while being the first multi-task model that solves a wide range of COPs. Finally we showcase the strong transfer learning capacity of GOAL by fine-tuning it on several new problems. Our code is available at https://github.com/naver/goal-co/.

基于机器学习的启发式算法最近在解决各种复杂的组合优化问题（COPs）方面表现出了令人印象深刻的性能。然而，它们通常依赖于针对每个单独问题专门设计和训练的神经网络模型。任何问题的变化都需要调整其模型并从头开始重新训练。在本文中，我们提出了GOAL（通用组合优化代理学习者），这是一个通用模型，能够高效地解决多个COPs，并且可以微调以解决新的COPs。GOAL由单个主干加上用于输入和输出处理的轻量级问题特定适配器组成。主干基于一种新的混合注意力块形式，能够处理定义在图上并具有节点、边和实例级特征任意组合的问题。此外，涉及不同类型节点或边的问题是通过一种新型的多类型变压器架构处理的，其中注意力块被复制以关注类型的有意义的组合，同时依赖于相同的共享参数。我们在一组路由、调度和经典图问题上训练GOAL，并展示它略逊于专用基准测试，但它是第一个解决广泛COPs的多任务模型。最后，我们通过在一系列新问题上进行微调，展示了GOAL强大的迁移学习能力。我们的代码可在[https://github.com/naver/goal-co/]上找到。

论文及项目相关链接

PDF Accepted to ICLR 2025

Summary

机器学习启发式算法在解决组合优化问题上表现出强大的性能，但它们通常针对每个特定问题都需要特定的神经网络模型并进行训练。本文提出一种通用模型GOAL（通用组合优化学习者），能够高效解决多种组合优化问题，并可通过微调适应新问题的求解。GOAL采用单一主干网络加上针对输入输出处理的轻量化问题特定适配器，主干网络基于新型混合注意力块，能够处理具有任意组合节点、边缘和实例级特征的图形定义问题。对于涉及不同类型的节点或边缘的问题，采用新型多类型变换器架构处理。实验证明GOAL在多任务环境下的表现与专门基准模型相当，且具有良好的迁移学习能力。代码已开源在[https://github.com/naver/goal-co/]上。

Key Takeaways

机器学习启发式算法在处理组合优化问题上展现了出色的性能。
现有模型针对每个问题需要单独的神经网络模型进行训练。
GOAL是一种通用模型，可高效解决多种组合优化问题。
GOAL具有微调能力，能适应解决新问题。
GOAL的主干网络采用新型混合注意力块技术，能处理复杂图形问题。
对于涉及不同类型的问题，GOAL采用多类型变换器架构进行处理。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-02-27/Agent/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Agent

Few-Shot

Few-Shot 方向最新论文已更新，请持续关注 Update in 2025-02-27 Multi-Perspective Data Augmentation for Few-shot Object Detection

2025-02-27 Few-Shot

Few-Shot

LLM

LLM 方向最新论文已更新，请持续关注 Update in 2025-02-27 LLM-Based Design Pattern Detection

2025-02-27 LLM

LLM

Agent

2025-02-27 更新

AgentRM: Enhancing Agent Generalization with Reward Modeling

WebGames: Challenging General-Purpose Web-Browsing AI Agents

RefuteBench 2.0 – Agentic Benchmark for Dynamic Evaluation of LLM Responses to Refutation Instruction

ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents

LLM Knows Geometry Better than Algebra: Numerical Understanding of LLM-Based Agents in A Trading Arena

Assessing Large Language Models in Agentic Multilingual National Bias

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models

Safe Multi-Agent Navigation guided by Goal-Conditioned Safe Reinforcement Learning

Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

AFlow: Automating Agentic Workflow Generation

Learning Multi-agent Multi-machine Tending by Mobile Robots

Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework

GOAL: A Generalist Combinatorial Optimization Agent Learner