发布日期: 2025-05-03

更新日期: 2025-05-25

文章字数: 6.1k

阅读时长: 24 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-05-03 更新

Visual Test-time Scaling for GUI Agent Grounding

Authors:Tiange Luo, Lajanugen Logeswaran, Justin Johnson, Honglak Lee

We introduce RegionFocus, a visual test-time scaling approach for Vision Language Model Agents. Understanding webpages is challenging due to the visual complexity of GUI images and the large number of interface elements, making accurate action selection difficult. Our approach dynamically zooms in on relevant regions, reducing background clutter and improving grounding accuracy. To support this process, we propose an image-as-map mechanism that visualizes key landmarks at each step, providing a transparent action record and enables the agent to effectively choose among action candidates. Even with a simple region selection strategy, we observe significant performance gains of 28+% on Screenspot-pro and 24+% on WebVoyager benchmarks on top of two state-of-the-art open vision language model agents, UI-TARS and Qwen2.5-VL, highlighting the effectiveness of visual test-time scaling in interactive settings. We achieve a new state-of-the-art grounding performance of 61.6% on the ScreenSpot-Pro benchmark by applying RegionFocus to a Qwen2.5-VL-72B model. Our code will be released publicly at https://github.com/tiangeluo/RegionFocus.

我们介绍了RegionFocus，这是一种用于视觉语言模型代理的视觉测试时间缩放方法。理解网页由于GUI图像的视觉复杂性和大量的界面元素而具有挑战性，这使得准确的动作选择变得困难。我们的方法动态放大相关区域，减少背景杂乱，提高定位精度。为了支持这一过程，我们提出了一种图像地图机制，每一步都可视化关键地标，提供透明的动作记录，使代理能够有效地在动作候选者中进行选择。即使使用简单的区域选择策略，我们在Screenspot-pro和WebVoyager基准测试上也观察到显著的性能提升，分别超过28%和24%，这凸显了视觉测试时间缩放方法在交互式环境中的有效性。通过将RegionFocus应用于Qwen2.5-VL-72B模型，我们在ScreenSpot-Pro基准测试上取得了61.6%的最新最佳定位性能。我们的代码将在https://github.com/tiangeluo/RegionFocus上公开发布。

论文及项目相关链接

PDF

Summary

RegionFocus是一种视觉测试时间缩放方法，用于视觉语言模型代理。该研究通过动态放大相关区域，减少背景干扰，提高定位精度，以解决网页理解的视觉复杂性问题和大量界面元素带来的准确动作选择困难。该研究还提出了一种图像地图机制，可视化关键地标，为代理提供透明的动作记录，并支持其在动作候选者中进行有效选择。在Screenspot-pro和WebVoyager基准测试上，RegionFocus在两种最先进的开放视觉语言模型代理UI-TARS和Qwen2.5-VL上实现了显著的性能提升。通过应用RegionFocus到Qwen2.5-VL-72B模型，该研究在ScreenSpot-Pro基准测试上实现了61.6%的最新定位性能。

Key Takeaways

RegionFocus是一种视觉测试时间缩放方法，用于提高视觉语言模型在处理复杂网页时的性能。
通过动态放大相关区域，RegionFocus可以减少背景干扰，提高定位精度。
研究提出了一种图像地图机制，可视化关键地标，为代理提供透明的动作记录。
该方法支持代理在动作候选者中进行有效选择，从而提高动作选择的准确性。
在Screenspot-pro和WebVoyager基准测试上，RegionFocus显著提升了性能，分别达到了28%+和24%+的提升。
RegionFocus在Qwen2.5-VL-72B模型上的应用在ScreenSpot-Pro基准测试上达到了61.6%的新定位性能。

Cool Papers

点此查看论文截图

Large Language Models as AI Agents for Digital Atoms and Molecules: Catalyzing a New Era in Computational Biophysics

Authors:Yijie Xia, Xiaohan Lin, Zicheng Ma, Jinyuan Hu, Yanheng Li, Zhaoxin Xie, Hao Li, Li Yang, Zhiqiang Zhao, Lijiang Yang, Zhenyu Chen, Yi Qin Gao

In computational biophysics, where molecular data is expanding rapidly and system complexity is increasing exponentially, large language models (LLMs) and agent-based systems are fundamentally reshaping the field. This perspective article examines the recent advances at the intersection of LLMs, intelligent agents, and scientific computation, with a focus on biophysical computation. Building on these advancements, we introduce ADAM (Agent for Digital Atoms and Molecules), an innovative multi-agent LLM-based framework. ADAM employs cutting-edge AI architectures to reshape scientific workflows through a modular design. It adopts a hybrid neural-symbolic architecture that combines LLM-driven semantic tools with deterministic symbolic computations. Moreover, its ADAM Tool Protocol (ATP) enables asynchronous, database-centric tool orchestration, fostering community-driven extensibility. Despite the significant progress made, ongoing challenges call for further efforts in establishing benchmarking standards, optimizing foundational models and agents, and building an open collaborative ecosystem. ADAM is accessible at https://sidereus-ai.com.

在计算生物物理学领域，分子数据迅速扩展，系统复杂性呈指数级增长，大型语言模型（LLMs）和基于智能主体的系统正在从根本上重塑该领域。本文着眼于LLMs、智能主体和科学计算之间的交叉点，探讨最新进展。在此基础上，我们介绍了ADAM（数字原子和分子代理），这是一个创新的多智能主体的大型语言模型框架。ADAM采用前沿的人工智能架构，通过模块化设计重塑科学工作流程。它采用混合神经符号架构，将大型语言模型驱动的语义工具与确定性符号计算相结合。此外，其ADAM工具协议（ATP）支持异步、以数据库为中心的工具编排，促进了以社区为基础的扩展性。尽管取得了重大进展，但仍面临建立基准测试标准、优化基础模型和智能主体以及建立开放协作生态系统等挑战。ADAM可通过https://sidereus-ai.com访问。

论文及项目相关链接

PDF 24 pages, 3 figures, 2 tables

Summary

大语言模型（LLMs）和基于智能体的系统在计算生物物理学领域正在带来根本性的变革。本文介绍了LLMs、智能体和科学计算交叉领域的最新进展，重点关注生物物理计算。在此基础上，提出了ADAM（用于数字原子和分子的智能体）这一创新的多智能体LLM框架，采用先进的AI架构重塑科学工作流程。ADAM工具协议（ATP）支持异步、以数据库为中心的工具编排，促进社区驱动的扩展性。但仍需进一步努力建立基准测试标准、优化基础模型和智能体并建立开放协作生态系统。可通过https://sidereus-ai.com访问ADAM。

Key Takeaways

大语言模型（LLMs）和基于智能体的系统正在计算生物物理学领域带来变革。
文章介绍了LLMs与智能体在科学计算领域的最新进展，特别是生物物理计算。
ADAM是一个创新的多智能体LLM框架，用于重塑科学工作流程。
ADAM采用先进的AI架构和混合神经符号架构，结合LLM驱动的语义工具和确定性符号计算。
ADAM工具协议（ATP）支持异步、以数据库为中心的工具编排，促进社区扩展性。
计算生物物理学领域仍存在挑战，需要建立基准测试标准、优化基础模型和智能体。

Cool Papers

点此查看论文截图

Empowering Agentic Video Analytics Systems with Video Language Models

Authors:Yuxuan Yan, Shiqi Jiang, Ting Cao, Yifan Yang, Qianqian Yang, Yuanchao Shu, Yuqing Yang, Lili Qiu

AI-driven video analytics has become increasingly pivotal across diverse domains. However, existing systems are often constrained to specific, predefined tasks, limiting their adaptability in open-ended analytical scenarios. The recent emergence of Video-Language Models (VLMs) as transformative technologies offers significant potential for enabling open-ended video understanding, reasoning, and analytics. Nevertheless, their limited context windows present challenges when processing ultra-long video content, which is prevalent in real-world applications. To address this, we introduce AVA, a VLM-powered system designed for open-ended, advanced video analytics. AVA incorporates two key innovations: (1) the near real-time construction of Event Knowledge Graphs (EKGs) for efficient indexing of long or continuous video streams, and (2) an agentic retrieval-generation mechanism that leverages EKGs to handle complex and diverse queries. Comprehensive evaluations on public benchmarks, LVBench and VideoMME-Long, demonstrate that AVA achieves state-of-the-art performance, attaining 62.3% and 64.1% accuracy, respectively, significantly surpassing existing VLM and video Retrieval-Augmented Generation (RAG) systems. Furthermore, to evaluate video analytics in ultra-long and open-world video scenarios, we introduce a new benchmark, AVA-100. This benchmark comprises 8 videos, each exceeding 10 hours in duration, along with 120 manually annotated, diverse, and complex question-answer pairs. On AVA-100, AVA achieves top-tier performance with an accuracy of 75.8%.

AI驱动的视频分析在不同领域变得越来越重要。然而，现有系统通常局限于特定的预定义任务，限制了它们在开放式分析场景中的适应性。最近出现的视频语言模型（VLM）作为变革性技术，为开放式视频理解、推理和分析提供了巨大潜力。然而，它们在处理现实世界应用中普遍存在的超长视频内容时，有限的上下文窗口带来了挑战。为了解决这一问题，我们引入了AVA，这是一个由VLM驱动的系统，旨在进行开放式高级视频分析。AVA有两个关键创新点：（1）事件知识图（EKG）的近实时构建，用于对长或连续视频流进行高效索引；（2）利用EKG处理复杂和多样化查询的代理检索生成机制。在公共基准测试LVBench和VideoMME-Long上的综合评估表明，AVA达到了最先进的性能，准确率分别为62.3%和64.1%，显著超越了现有的VLM和视频检索增强生成（RAG）系统。此外，为了评估超长和开放世界视频场景中的视频分析，我们引入了新的基准测试AVA-100。该基准测试包含8个视频，每个持续时间超过10小时，以及120个手动注释的多样化、复杂的问答对。在AVA-100上，AVA达到了顶级性能，准确率为75.8%。

论文及项目相关链接

PDF 15 pages

Summary
AI驱动的视讯分析在各个领域愈发重要。然而，现有系统往往局限于特定预设任务，难以适应开放式分析场景。新兴的视频语言模型（VLM）技术为开放式视讯理解、推理和分析提供了巨大潜力。为处理现实应用中常见的超长视讯内容，我们推出AVA系统，以VLM为动力，适用于开放式高级视讯分析。AVA包含两大创新：一是构建事件知识图谱（EKG）实现近实时索引长或连续视讯流，二是利用EKG处理复杂多样查询的检索生成机制。在公开基准测试LVBench和视频MME-Long上的全面评估显示，AVA达到业界顶尖性能，准确率分别为62.3%和64.1%，显著超越现有VLM和视频检索增强生成系统。为评估超长和开放式视讯场景中的视讯分析，我们推出新基准测试AVA-100。该测试包含8个时长超过10小时的视讯和120组手动标注的复杂问题答案。AVA在AVA-100上表现优异，准确率为75.8%。

Key Takeaways

AI驱动的视讯分析在多个领域中的重要性日益凸显。
现有系统主要局限于特定任务，缺乏在开放式场景中的适应性。
视频语言模型（VLM）技术为开放式视讯理解、推理和分析提供了潜力。
处理超长视讯内容的挑战需要创新的解决方案。
AVA系统通过引入事件知识图谱（EKG）和检索生成机制应对这些挑战。
AVA在公开基准测试上表现优异，达到业界顶尖水平。

Cool Papers

点此查看论文截图

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Authors:Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, Qingyun Wu

Failure attribution in LLM multi-agent systems-identifying the agent and step responsible for task failures-provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems. To support this initiative, we introduce the Who&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps. Using the Who&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5% accuracy in identifying failure-responsible agents but only 14.2% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task’s complexity and the need for further research in this area. Code and dataset are available at https://github.com/mingyin1/Agents_Failure_Attribution

在大规模语言模型（LLM）多智能体系统中的故障归属问题，即确定导致任务失败的智能体和步骤，为系统调试提供了关键线索。然而，这一领域仍然缺乏足够的探索，并且需要大量的人工操作。在本文中，我们提出并定义了一个新的研究领域：自动化的大规模语言模型多智能体系统失败归属。为了支持这一倡议，我们引入了Who&When数据集，该数据集包含来自127个LLM多智能体系统的详细失败日志，其中精细的注释将失败与特定的智能体和关键的错误步骤联系起来。利用Who&When数据集，我们开发并评估了三种自动化失败归属方法，总结了它们各自的优缺点。最好的方法在识别负责失败的智能体方面达到了53.5%的准确率，但在确定失败步骤方面只有14.2%，部分方法的性能甚至低于随机水平。即使是最新技术推理模型，如OpenAI o1和DeepSeek R1，也无法实现实际可用性。这些结果突显了这项任务的复杂性，以及对该领域进一步研究的必要性。相关代码和数据集可通过https://github.com/mingyin1/Agents_Failure_Attribution访问。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）多智能体系统中的故障归属问题对于系统调试至关重要，但相关研究仍显不足且需要大量人工操作。本文提出并探讨了一个新的研究领域：自动化故障归属分析在LLM多智能体系统中的应用。为了支持该研究，我们引入了Who&When数据集，该数据集包含来自127个LLM多智能体系统的详细失败日志，并对失败与特定智能体和关键错误步骤之间的关系进行了精细标注。基于该数据集，我们开发并评估了三种自动化故障归属方法，并总结了它们的优缺点。最佳方法的准确率为53.5%，能够识别出导致失败的智能体，但在确定失败步骤方面仅为14.2%，某些方法的性能甚至低于随机水平。即使是最新的一流推理模型，如OpenAI o1和DeepSeek R1，也无法实现实际可用性。这凸显了任务的复杂性以及对进一步研究的必要性。数据集和代码已公开发布。

Key Takeaways

LLM多智能体系统的故障归属对于系统调试至关重要。
自动化故障归属分析是一个新兴且重要的研究领域。
Who&When数据集包含详细的失败日志，对失败与智能体和错误步骤的关系进行了精细标注，为自动化故障归属分析提供了支持。
自动化故障归属方法存在挑战，即使是最佳方法也无法准确识别所有失败步骤。
当前一流推理模型在自动化故障归属分析方面的性能有待提高。
故障归属分析的复杂性凸显了进一步研究的必要性。

Cool Papers

点此查看论文截图

WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks

Authors:Ivan Evtimov, Arman Zharmagambetov, Aaron Grattafiori, Chuan Guo, Kamalika Chaudhuri

Web navigation AI agents use language-and-vision foundation models to enhance productivity but these models are known to be susceptible to indirect prompt injections that get them to follow instructions different from the legitimate user’s. Existing explorations of this threat applied to web agents often focus on a single isolated adversarial goal, test with injected instructions that are either too easy or not truly malicious, and often give the adversary unreasonable access. In order to better focus adversarial research, we construct a new benchmark called WASP (Web Agent Security against Prompt injection attacks) that introduces realistic web agent hijacking objectives and an isolated environment to test them in that does not affect real users or the live web. As part of WASP, we also develop baseline attacks against popular web agentic systems (VisualWebArena, Claude Computer Use, etc.) instantiated with various state-of-the-art models. Our evaluation shows that even AI agents backed by models with advanced reasoning capabilities and by models with instruction hierarchy mitigations are susceptible to low-effort human-written prompt injections. However, the realistic objectives in WASP also allow us to observe that agents are currently not capable enough to complete the goals of attackers end-to-end. Agents begin executing the adversarial instruction between 16 and 86% of the time but only achieve the goal between 0 and 17% of the time. Based on these findings, we argue that adversarial researchers should demonstrate stronger attacks that more consistently maintain control over the agent given realistic constraints on the adversary’s power.

网络导航AI代理利用语言和视觉基础模型来提高生产力，但这些模型已知容易受间接提示注入的影响，导致它们遵循与合法用户不同的指令。现有针对网络代理的威胁探索通常侧重于单一的孤立对抗目标，使用注入的指令过于简单或并不真正恶意，并且经常给对手不合理的访问权限。为了更好地集中对抗性研究，我们构建了一个新的基准测试WASP（针对提示注入攻击的Web代理安全性），它引入了现实的网络代理劫持目标和一个孤立的测试环境，不影响真实用户或实时网络。作为WASP的一部分，我们还针对流行的网络代理系统（如VisualWebArena、Claude Computer Use等）使用各种最新模型开发了基线攻击。我们的评估表明，即使是背后支持具有先进推理能力和带有指令层次缓解措施的模型的AI代理，也容易受到低强度的人写提示注入的影响。然而，WASP中的现实目标也让我们观察到，目前代理还不足以完全满足攻击者的目标。代理开始执行对抗指令的时间在16%至86%之间，但仅在0%至17%的情况下实现了攻击者的目标。基于这些发现，我们认为对抗性研究人员应该展示更强的攻击，在给定对手实际能力约束的情况下，更持续地控制代理。

Summary

本文介绍了Web导航AI代理使用语言与视觉基础模型以提高生产力，但它们易受间接提示注入攻击的影响。为了更专注于对抗性研究，作者构建了一个新的基准测试WASP，引入了现实主义的Web代理劫持目标和一个隔离的环境进行测试，不影响真实用户或实时网络。作者评估了一些流行的Web代理系统的基础攻击，发现即使背后有先进的推理能力和指令层次缓解的模型也易受低努力的人类编写的提示注入攻击的影响。然而，WASP中的现实目标也表明，代理目前尚不足以完全完成攻击者的目标。代理执行对抗指令的时间在16%~86%之间，但仅能在0%~17%的情况下达到攻击目标。因此，作者主张对抗性研究人员应展示更强的攻击能力，在给定实际约束的情况下更持续地控制代理。

Key Takeaways

Web导航AI代理使用语言与视觉基础模型提升生产力，但易受间接提示注入攻击的影响。
为了更准确地研究这种威胁，作者建立了一个新的基准测试WASP。
WASP引入了现实主义的Web代理劫持目标和一个隔离的环境进行测试。
对流行的Web代理系统的基础攻击评估显示，即使是先进的模型也易受低努力的人类编写的提示注入攻击的影响。
代理在执行对抗指令时表现出一定的脆弱性，执行时间在16%~86%之间。
然而，代理完成攻击者目标的能力有限，成功率在0%~17%之间。

Cool Papers

点此查看论文截图

Towards Optimal Circuit Generation: Multi-Agent Collaboration Meets Collective Intelligence

Authors:Haiyan Qin, Jiahao Feng, Xiaotong Feng, Wei W. Xing, Wang Kang

Large language models (LLMs) have transformed code generation, yet their application in hardware design produces gate counts 38%–1075% higher than human designs. We present CircuitMind, a multi-agent framework that achieves human-competitive efficiency through three key innovations: syntax locking (constraining generation to basic logic gates), retrieval-augmented generation (enabling knowledge-driven design), and dual-reward optimization (balancing correctness with efficiency). To evaluate our approach, we introduce TC-Bench, the first gate-level benchmark harnessing collective intelligence from the TuringComplete ecosystem – a competitive circuit design platform with hundreds of thousands of players. Experiments show CircuitMind enables 55.6% of model implementations to match or exceed top-tier human experts in composite efficiency metrics. Most remarkably, our framework elevates the 14B Phi-4 model to outperform both GPT-4o mini and Gemini 2.0 Flash, achieving efficiency comparable to the top 25% of human experts without requiring specialized training. These innovations establish a new paradigm for hardware optimization where collaborative AI systems leverage collective human expertise to achieve optimal circuit designs. Our model, data, and code are open-source at https://github.com/BUAA-CLab/CircuitMind.

大规模语言模型（LLM）已经改变了代码生成，然而它们在硬件设计中的应用产生的门数比人类设计高出38%至1075%。我们提出了CircuitMind，这是一个多代理框架，通过三个关键创新实现了与人类竞争的效率：语法锁定（将生成限制在基本逻辑门内）、检索增强生成（实现知识驱动设计）和双重奖励优化（平衡正确性与效率）。为了评估我们的方法，我们引入了TC-Bench，这是第一个利用TuringComplete生态系统集体智慧的门级基准测试——一个拥有数十万玩家的竞争性电路设计平台。实验表明，CircuitMind使55.6%的模型实现在复合效率指标上匹配或超过顶级人类专家。最值得注意的是，我们的框架提升了14B Phi-4模型的表现，使其超越了GPT-4o mini和Gemini 2.0 Flash，在不需要专门培训的情况下，实现了与人类专家前25%相当的效率。这些创新为硬件优化建立了一个新范式，其中协作AI系统利用人类的集体专业知识来实现最优电路设计。我们的模型、数据和代码均在https://github.com/BUAA-CLab/CircuitMind开源。

论文及项目相关链接

PDF 9 pages, 6 figures

Summary
电路生成领域的大型语言模型（LLMs）在硬件设计中的应用会产生较高的门计数，本研究提出CircuitMind框架，通过语法锁定、检索增强生成和双奖励优化等技术实现人类竞争的效率，引入TC-Bench基准测试平台进行评估，实验表明CircuitMind框架能够实现与顶尖人类专家相当的复合效率指标。

Key Takeaways