LLM

发布日期: 2025-09-13

更新日期: 2025-10-07

文章字数: 16.7k

阅读时长: 67 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-13 更新

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Authors:Akshit Sinha, Arvindh Arun, Shashwat Goel, Steffen Staab, Jonas Geiping

Does continued scaling of large language models (LLMs) yield diminishing returns? Real-world value often stems from the length of task an agent can complete. We start this work by observing the simple but counterintuitive fact that marginal gains in single-step accuracy can compound into exponential improvements in the length of a task a model can successfully complete. Then, we argue that failures of LLMs when simple tasks are made longer arise from mistakes in execution, rather than an inability to reason. We propose isolating execution capability, by explicitly providing the knowledge and plan needed to solve a long-horizon task. We find that larger models can correctly execute significantly more turns even when small models have 100% single-turn accuracy. We observe that the per-step accuracy of models degrades as the number of steps increases. This is not just due to long-context limitations – curiously, we observe a self-conditioning effect – models become more likely to make mistakes when the context contains their errors from prior turns. Self-conditioning does not reduce by just scaling the model size. In contrast, recent thinking models do not self-condition, and can also execute much longer tasks in a single turn. We conclude by benchmarking frontier thinking models on the length of task they can execute in a single turn. Overall, by focusing on the ability to execute, we hope to reconcile debates on how LLMs can solve complex reasoning problems yet fail at simple tasks when made longer, and highlight the massive benefits of scaling model size and sequential test-time compute for long-horizon tasks.

继续扩大大型语言模型（LLM）的规模是否会产生收益递减效应？现实价值往往源于代理可以完成的任务长度。我们从观察一个简单但具有矛盾的事实开始，即单步精度的边际增益可以转化为任务成功完成长度的指数改进。然后，我们论证当简单任务被延长时，LLM的失败是由于执行过程中的错误，而不是推理能力的不足。我们提议通过明确提供解决长期任务所需的知识和计划来隔离执行能力。我们发现，即使在小型模型的单轮准确率达到100%的情况下，大型模型仍然可以正确执行更多轮次。我们观察到，随着步骤数量的增加，模型的每步准确率会下降。这不仅仅是由于长期上下文限制——奇怪的是，我们观察到了一个自条件效应——当上下文包含先前轮次中的错误时，模型更可能犯错误。仅通过扩大模型规模并不能减少这种自条件效应。相比之下，最近的思考模型并不自我条件化，并且可以在单轮中执行更长的任务。我们通过前沿思考模型在单轮中可以执行的任务长度进行基准测试来得出结论。总体而言，通过关注执行能力，我们希望调和关于LLM如何解决复杂的推理问题但在更长的简单任务中失败的争论，并强调在规模模型和顺序测试时间计算中对长期任务的巨大好处。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）的持续扩展是否产生边际效益递减，实际价值往往取决于模型能完成任务的长度。研究发现，单步准确性的微小增益可以累积成为任务长度的指数级改进，并指出长任务失败往往源于执行过程中的错误，而非推理能力的不足。通过提供知识和计划来隔离执行能力，研究人员发现大型模型在执行更多步骤时表现更好，即使在小型模型单步准确率为百分之百的情况下也是如此。随着步骤数量的增加，模型的每步准确性会下降，这不仅仅是由于长期上下文限制。一种奇怪的现象是自我条件效应——当上下文包含先前的错误时，模型更容易出错。增加模型规模并不能减少自我条件效应。相比之下，最新的思考模型不进行自我条件，并能一次完成更长的任务。通过对前沿思考模型进行基准测试，发现它们在单次执行任务的长度方面表现出色。因此，本研究通过关注模型的执行能力，探讨了LLM如何解决复杂的推理问题，以及在任务长度增加时遇到的简单任务失败问题，并强调了模型规模扩展和顺序测试时间计算在长期任务中的巨大益处。

Key Takeaways

大型语言模型（LLM）的实际价值取决于其能完成任务的长度。
单步准确性的微小改进可以累积成任务完成的指数级提升。
LLM在长任务中的失败主要源于执行过程中的错误，而非推理能力不足。
通过提供知识和计划来隔离模型的执行能力，大型模型在执行更多步骤时表现更佳。
随着步骤数量的增加，模型的每步准确性会下降，这其中包括自我条件效应的影响。
简单的模型规模扩展并不能减少自我条件效应。

Cool Papers

点此查看论文截图

Measuring Epistemic Humility in Multimodal Large Language Models

Authors:Bingkui Tong, Jiaer Xia, Sifeng Shang, Kaiyang Zhou

Hallucinations in multimodal large language models (MLLMs) – where the model generates content inconsistent with the input image – pose significant risks in real-world applications, from misinformation in visual question answering to unsafe errors in decision-making. Existing benchmarks primarily test recognition accuracy, i.e., evaluating whether models can select the correct answer among distractors. This overlooks an equally critical capability for trustworthy AI: recognizing when none of the provided options are correct, a behavior reflecting epistemic humility. We present HumbleBench, a new hallucination benchmark designed to evaluate MLLMs’ ability to reject plausible but incorrect answers across three hallucination types: object, relation, and attribute. Built from a panoptic scene graph dataset, we leverage fine-grained scene graph annotations to extract ground-truth entities and relations, and prompt GPT-4-Turbo to generate multiple-choice questions, followed by a rigorous manual filtering process. Each question includes a “None of the above” option, requiring models not only to recognize correct visual information but also to identify when no provided answer is valid. We evaluate a variety of state-of-the-art MLLMs – including both general-purpose and specialized reasoning models – on HumbleBench and share valuable findings and insights with the community. By incorporating explicit false-option rejection, HumbleBench fills a key gap in current evaluation suites, providing a more realistic measure of MLLM reliability in safety-critical settings. Our code and dataset are released publicly and can be accessed at https://github.com/maifoundations/HumbleBench.

多模态大型语言模型（MLLMs）中的幻觉，即模型生成与输入图像不一致的内容，在现实世界应用中带来了重大风险，从视觉问答中的误导信息到决策中的不安全错误。现有的基准测试主要测试识别准确率，即评估模型是否能在干扰项中选出正确答案。这忽略了可信人工智能同样关键的能力：即当所有提供的选项都不正确时，能够识别出来，这种行为反映了知识谦逊。我们提出了HumbleBench，这是一个新的幻觉基准测试，旨在评估MLLMs拒绝可能但错误答案的能力，涵盖三种幻觉类型：对象、关系和属性。它建立在全景场景图数据集上，我们利用精细场景图注释来提取真实实体和关系，并提示GPT-4-Turbo生成多项选择题，随后经过严格的手动过滤过程。每个问题都包括一个“以上都不是”的选项，要求模型不仅要识别正确的视觉信息，还要在没有任何提供的答案有效时能够识别出来。我们在HumbleBench上评估了多种先进的多模态大型语言模型，包括通用和专门用于推理的模型，并与社区分享了有价值的发现和见解。通过融入明确的虚假选项拒绝能力，HumbleBench填补了当前评估套件的关键空白，为安全关键设置中MLLM的可靠性提供了更现实的衡量标准。我们的代码和数据集已公开发布，可在https://github.com/maifoundations/HumbleBench访问。

论文及项目相关链接

PDF

摘要

多模态大型语言模型（MLLMs）中的幻觉，即模型生成与输入图像不一致的内容，在真实世界应用中存在重大风险，如视觉问答中的错误信息以及决策制定中的安全错误。现有基准测试主要关注识别准确性，即评估模型是否能在干扰项中选出正确答案。然而，这忽略了可信人工智能的另一个关键能力：拒绝正确但不可靠的答案的行为反映了知识谦逊。我们提出了HumbleBench，一个新的幻觉基准测试，旨在评估MLLMs拒绝三种幻觉类型（对象、关系和属性）的可靠但错误答案的能力。HumbleBench建立在一个全景场景图数据集上，利用精细场景图注释提取真实实体和关系，并提示GPT-4 Turbo生成多项选择题，随后进行严格的人工过滤过程。每个问题都包括一个“以上都不是”的选项，要求模型不仅要识别正确的视觉信息，还要在没有任何提供的答案是有效时识别出来。我们在HumbleBench上评估了一系列最先进的MLLMs，包括通用和专门用于推理的模型，并与社区分享有价值的发现和见解。通过明确的虚假选项拒绝，HumbleBench填补了当前评估套件的关键空白，为安全关键设置中MLLM的可靠性提供了更现实的衡量标准。我们的代码和数据集已公开发布，可访问https://github.com/maifoundations/HumbleBench。

要点如下

多模态大型语言模型（MLLMs）中的幻觉在真实世界应用中存在风险。
当前基准测试主要关注识别准确性，忽略了模型在拒绝错误答案时的能力。
提出新的幻觉基准测试——HumbleBench，旨在评估MLLMs拒绝三种幻觉类型的能力。
利用全景场景图数据集和精细场景图注释建立HumbleBench。
GPT-4 Turbo被提示生成多项选择题，并要求模型识别正确的视觉信息和无有效答案的情况。
在HumbleBench上评估了一系列最先进的MLLMs，包括通用和专门用于推理的模型。

Cool Papers

点此查看论文截图

Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems

Authors:Minghang Zhu, Zhengliang Shi, Zhiwei Xu, Shiguang Wu, Lingjie Wang, Pengjie Ren, Zhaochun Ren, Zhumin Chen

The advancement of large language models (LLMs) has enabled the construction of multi-agent systems to solve complex tasks by dividing responsibilities among specialized agents, such as a planning agent for subgoal generation and a grounding agent for executing tool-use actions. Most existing methods typically fine-tune these agents independently, leading to capability gaps among them with poor coordination. To address this, we propose MOAT, a Multi-Agent Joint Alignment Tuning framework that improves agents collaboration through iterative alignment. MOAT alternates between two key stages: (1) Planning Agent Alignment, which optimizes the planning agent to generate subgoal sequences that better guide the grounding agent; and (2) Grounding Agent Improving, which fine-tunes the grounding agent using diverse subgoal-action pairs generated by the agent itself to enhance its generalization capablity. Theoretical analysis proves that MOAT ensures a non-decreasing and progressively convergent training process. Experiments across six benchmarks demonstrate that MOAT outperforms state-of-the-art baselines, achieving average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks.

大型语言模型（LLM）的进步使得能够构建多智能体系统，通过专门化的智能体分配责任来解决复杂任务，例如用于生成子目标的规划智能体和用于执行工具使用动作的接地智能体。大多数现有方法通常独立微调这些智能体，导致它们之间存在能力差距，协调不佳。为了解决这一问题，我们提出了MOAT，这是一个多智能体联合对齐调整框架，它通过迭代对齐改进智能体的协作。MOAT在两个关键阶段之间交替进行：（1）规划智能体对齐，优化规划智能体以生成更好地指导接地智能体的子目标序列；（2）接地智能体改进，使用智能体本身生成的多样化的子目标-动作对微调接地智能体，提高其泛化能力。理论分析证明，MOAT确保了非递减和逐步收敛的训练过程。在六个基准测试上的实验表明，MOAT优于最新的基线技术，在已完成任务上平均提高了3.1%，在未完成任务上平均提高了4.4%。

论文及项目相关链接

PDF EMNLP 2025 Findings

Summary

大型语言模型（LLM）的进步使得多智能体系统的构建成为可能，通过专门化的智能体分工协作来解决复杂任务。然而，现有方法通常独立微调这些智能体，导致能力差距和协调不良。为此，我们提出MOAT——一种多智能体联合对齐调整框架，通过迭代对齐提高智能体的协作能力。MOAT交替进行两个关键阶段：规划智能体对齐，优化规划智能体以生成更好的子目标序列来指导接地智能体；接地智能体改进，使用由智能体本身生成的多样化的子目标-动作对进行微调，提高其泛化能力。理论和实验证明，MOAT确保训练过程不断增进并逐步收敛，在六个基准测试上的表现优于最新基线技术，完成任务的平均改进率为3.1%和4.4%。

Key Takeaways

LLM的进步促进了多智能体系统的构建，允许通过专门化的智能体来解决复杂任务。
现有方法独立微调智能体，导致能力差距和协调问题。
MOAT框架是一种多智能体联合对齐调整方法，通过迭代对齐提高智能体的协作。
MOAT包括两个关键阶段：规划智能体对齐和接地智能体改进。
规划智能体对齐阶段优化生成子目标序列，以指导接地智能体的行为。
接地智能体改进阶段使用多样化的子目标-动作对进行微调，提高其泛化能力。

Cool Papers

点此查看论文截图

LAVA: Language Model Assisted Verbal Autopsy for Cause-of-Death Determination

Authors:Yiqun T. Chen, Tyler H. McCormick, Li Liu, Abhirup Datta

Verbal autopsy (VA) is a critical tool for estimating causes of death in resource-limited settings where medical certification is unavailable. This study presents LA-VA, a proof-of-concept pipeline that combines Large Language Models (LLMs) with traditional algorithmic approaches and embedding-based classification for improved cause-of-death prediction. Using the Population Health Metrics Research Consortium (PHMRC) dataset across three age categories (Adult: 7,580; Child: 1,960; Neonate: 2,438), we evaluate multiple approaches: GPT-5 predictions, LCVA baseline, text embeddings, and meta-learner ensembles. Our results demonstrate that GPT-5 achieves the highest individual performance with average test site accuracies of 48.6% (Adult), 50.5% (Child), and 53.5% (Neonate), outperforming traditional statistical machine learning baselines by 5-10%. Our findings suggest that simple off-the-shelf LLM-assisted approaches could substantially improve verbal autopsy accuracy, with important implications for global health surveillance in low-resource settings.

言语病理（VA）是在医疗资源有限、无法进行医学认证的环境下估计死亡原因的重要工具。本研究提出了LA-VA概念验证流程，它结合了大型语言模型（LLM）与传统算法方法和基于嵌入的分类，以改进死亡原因预测。利用人口健康指标研究联盟（PHMRC）跨越三个年龄类别（成人：7580例；儿童：1960例；新生儿：2438例）的数据集，我们评估了多种方法：GPT-5预测、LCVA基线、文本嵌入和元学习者集合。我们的结果表明，GPT-5表现最佳的个人性能平均测试准确率分别为成人48.6%，儿童50.5%，新生儿53.5%，高出传统的统计学机器学习基线模型约5%-10%。我们的研究发现，简单的现成的大型语言模型辅助方法可能极大地提高言语病理的准确性，这对在低资源环境下进行全球健康监测具有重大意义。

论文及项目相关链接

PDF

Summary

基于自然语言处理的大型语言模型（LLM）在资源受限环境中对死亡原因估计具有关键作用。本研究提出LA-VA概念验证流程，结合传统算法和基于嵌入的分类方法，以提高死亡原因预测的准确性。利用人口健康指标研究协会（PHMRC）数据集，对GPT-5预测、LCVA基线、文本嵌入和元学习者集成等多种方法进行评价。结果显示GPT-5表现最佳，平均测试准确度分别为成人48.6%、儿童50.5%、新生儿53.5%，较传统统计机器学习基线高出5-10%。这提示我们，简单的即时LLM辅助方法可显著提高口头验尸的准确度，对低资源环境中的全球健康监测具有重要意义。

Key Takeaways

Verbal autopsy (VA)是资源受限环境中估算死亡原因的重要工具。
本研究提出了LA-VA概念验证流程，结合了大型语言模型（LLMs）、传统算法和基于嵌入的分类方法以提高死亡原因预测的准确性。
使用PHMRC数据集进行实证研究，包括成人、儿童和新生儿三个年龄组的数据。
GPT-5在各种评估方法中表现最佳，相比传统统计机器学习基线提高了5-10%的准确度。
简单即时的大型语言模型（LLM）辅助方法可以显著提高口头验尸的准确性。
LLM在死亡原因预测中的使用具有重要的全球健康监测意义，尤其是在低资源环境中。

Cool Papers

点此查看论文截图

VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models: Methods and Results

Authors:Hanwei Zhu, Haoning Wu, Zicheng Zhang, Lingyu Zhu, Yixuan Li, Peilin Chen, Shiqi Wang, Chris Wei Zhou, Linhan Cao, Wei Sun, Xiangyang Zhu, Weixia Zhang, Yucheng Zhu, Jing Liu, Dandan Zhu, Guangtao Zhai, Xiongkuo Min, Zhichao Zhang, Xinyue Li, Shubo Xu, Anh Dao, Yifan Li, Hongyuan Yu, Jiaojiao Yi, Yiding Tian, Yupeng Wu, Feiran Sun, Lijuan Liao, Song Jiang

This paper presents a summary of the VQualA 2025 Challenge on Visual Quality Comparison for Large Multimodal Models (LMMs), hosted as part of the ICCV 2025 Workshop on Visual Quality Assessment. The challenge aims to evaluate and enhance the ability of state-of-the-art LMMs to perform open-ended and detailed reasoning about visual quality differences across multiple images. To this end, the competition introduces a novel benchmark comprising thousands of coarse-to-fine grained visual quality comparison tasks, spanning single images, pairs, and multi-image groups. Each task requires models to provide accurate quality judgments. The competition emphasizes holistic evaluation protocols, including 2AFC-based binary preference and multi-choice questions (MCQs). Around 100 participants submitted entries, with five models demonstrating the emerging capabilities of instruction-tuned LMMs on quality assessment. This challenge marks a significant step toward open-domain visual quality reasoning and comparison and serves as a catalyst for future research on interpretable and human-aligned quality evaluation systems.

本文介绍了VQualA 2025挑战赛的内容，该挑战赛作为ICCV 2025视觉质量评估研讨会的一部分，旨在评估和改进最新大型多模态模型（LMMs）在跨多张图像进行开放式和详细推理视觉质量差异方面的能力。为此，比赛引入了一个新的基准测试，包含从粗到细粒度的视觉质量比较任务数千个，涵盖单张图像、图像对和多图像组。每个任务都需要模型提供准确的质量判断。比赛强调整体评估协议，包括基于2AFC的二元偏好和多项选择题（MCQs）。约有100名参赛者提交了参赛作品，其中五种模型展示了指令调整型LMMs在质量评估方面的新兴能力。本次挑战赛是朝着开放式域视觉质量推理和比较迈出的重要一步，并为未来可解释性和人类对齐质量评估系统的研究起到了推动作用。

论文及项目相关链接

PDF ICCV VQualA Workshop 2025

Summary

视觉质量对比挑战——视觉质量对比对于大模态模型挑战（VQualA 2025）总结。该挑战旨在评估与提高先进的大模态模型在多个图像之间视觉质量差异的推理能力。通过引入包含数千个粗到细粒度视觉质量对比任务的新基准测试，包括单图像、图像对和多图像组，该挑战强调整体评估协议，包括基于2AFC的二元偏好和多选择题。大约一百名参与者提交参赛作品，其中五种模型展现了其在质量评估方面的潜力。此挑战是迈向开放式视觉质量推理和对比的重要一步，并为未来可解释和人类对齐的质量评估系统的研究起到了推动作用。

Key Takeaways

VQualA 2025 Challenge旨在评估大模态模型对多个图像之间视觉质量差异的推理能力。
挑战引入了包含粗到细粒度视觉质量对比任务的新型基准测试。
该挑战强调全面的评估协议，包括基于2AFC的二元偏好和多选题形式。
有大约一百名参与者参与了此次挑战。
有五种模型在质量评估方面表现出潜力。
此挑战是视觉质量对比和推理领域的重要进展。

Cool Papers

点此查看论文截图

MR-UIE: Multi-Perspective Reasoning with Reinforcement Learning for Universal Information Extraction

Authors:Zhongqiu Li, Shiquan Wang, Ruiyu Fang, Mengjiao Bao, Zhenhe Wu, Shuangyong Song, Yongxiang Li, Zhongjiang He

Large language models (LLMs) demonstrate robust capabilities across diverse research domains. However, their performance in universal information extraction (UIE) remains insufficient, especially when tackling structured output scenarios that involve complex schema descriptions and require multi-step reasoning. While existing approaches enhance the performance of LLMs through in-context learning and instruction tuning, significant limitations nonetheless persist. To enhance the model’s generalization ability, we propose integrating reinforcement learning (RL) with multi-perspective reasoning for information extraction (IE) tasks. Our work transitions LLMs from passive extractors to active reasoners, enabling them to understand not only what to extract but also how to reason. Experiments conducted on multiple IE benchmarks demonstrate that MR-UIE consistently elevates extraction accuracy across domains and surpasses state-of-the-art methods on several datasets. Furthermore, incorporating multi-perspective reasoning into RL notably enhances generalization in complex IE tasks, underscoring the critical role of reasoning in challenging scenarios.

大型语言模型（LLM）在多个研究领域中表现出了强大的能力。然而，它们在通用信息提取（UIE）方面的表现仍然不足，尤其是在处理涉及复杂模式描述和需要多步骤推理的结构化输出场景时。虽然现有方法通过上下文学习和指令微调提高了LLM的性能，但仍存在显著局限性。为了提高模型的泛化能力，我们提出了将强化学习（RL）与多视角推理相结合，用于信息提取（IE）任务。我们的工作使LLM从被动提取器转变为积极推理器，使它们不仅能够理解要提取的内容，而且能够理解如何进行推理。在多个信息提取基准测试上进行的实验表明，MR-UIE在跨域提取准确性方面表现一致，并在多个数据集上超过了最先进的方法。此外，将多视角推理融入强化学习，显著提高了复杂信息提取任务的泛化能力，突显了推理在具有挑战性场景中的关键作用。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在多个研究领域中表现出强大的能力，但在通用信息提取（UIE）方面的表现仍然不足，特别是在涉及复杂模式描述和需要多步骤推理的结构化输出场景中。为提高模型的泛化能力，提出了结合强化学习（RL）和多角度推理的信息提取（IE）任务方法。该方法使LLM从被动提取器转变为积极推理器，不仅理解需要提取的内容，还理解如何推理。在多个信息提取基准测试上的实验表明，MR-UIE在多个领域中的提取准确性不断提高，并在某些数据集上超越了最先进的方法。此外，将多角度推理融入RL显著提高了在复杂信息提取任务中的泛化能力，突显了推理在挑战场景中的关键作用。

Key Takeaways

LLM在UIE方面的表现仍有待提高，特别是在处理复杂模式描述和多步骤推理的结构化输出场景方面。
通过结合强化学习（RL）和多角度推理，可以提高LLM在IE任务中的泛化能力。
MR-UIE方法使LLM从被动提取信息转变为积极推理，使其不仅理解提取内容，还理解如何推理。
MR-UIE在多个信息提取基准测试上的表现优于其他方法，显示出其有效性。
结合多角度推理的RL在复杂IE任务中显著提高泛化能力。
多角度推理在挑战场景中起到关键作用。

Cool Papers

点此查看论文截图

Recurrence Meets Transformers for Universal Multimodal Retrieval

Authors:Davide Caffagni, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

With the rapid advancement of multimodal retrieval and its application in LLMs and multimodal LLMs, increasingly complex retrieval tasks have emerged. Existing methods predominantly rely on task-specific fine-tuning of vision-language models and are limited to single-modality queries or documents. In this paper, we propose ReT-2, a unified retrieval model that supports multimodal queries, composed of both images and text, and searches across multimodal document collections where text and images coexist. ReT-2 leverages multi-layer representations and a recurrent Transformer architecture with LSTM-inspired gating mechanisms to dynamically integrate information across layers and modalities, capturing fine-grained visual and textual details. We evaluate ReT-2 on the challenging M2KR and M-BEIR benchmarks across different retrieval configurations. Results demonstrate that ReT-2 consistently achieves state-of-the-art performance across diverse settings, while offering faster inference and reduced memory usage compared to prior approaches. When integrated into retrieval-augmented generation pipelines, ReT-2 also improves downstream performance on Encyclopedic-VQA and InfoSeek datasets. Our source code and trained models are publicly available at: https://github.com/aimagelab/ReT-2

随着多模态检索技术的快速发展及其在大型语言模型和多模态大型语言模型中的应用，出现了越来越多复杂的检索任务。现有方法主要依赖于针对特定任务的视觉语言模型的微调，并局限于单模态查询或文档。在本文中，我们提出了ReT-2，这是一个支持多模态查询的统一检索模型，由图像和文本组成，可以在包含文本和图像的跨模态文档集合中进行搜索。ReT-2利用多层表示和循环神经网络Transformer架构，结合LSTM启发的门控机制，动态地整合各层和各种信息源的信息，捕捉精细的视觉和文本细节。我们在具有挑战性的M2KR和M-BEIR基准测试集上对不同配置的检索系统进行了评估。结果表明，ReT-2在各种不同设置下均达到了最先进的性能水平，同时与先前的方法相比，推理速度更快，内存使用更少。当集成到检索增强生成管道时，ReT-2也在百科全书式视觉问答和信息检索数据集上的下游性能有所提升。我们的源代码和训练好的模型可在以下网址公开访问：https://github.com/aimagelab/ReT-2。

论文及项目相关链接

PDF

Summary

随着多模态检索在LLM和多模态LLM中的迅速发展和应用，出现了越来越复杂的检索任务。现有方法主要依赖于针对特定任务的视觉语言模型的微调，并局限于单模态查询或文档。本文提出了ReT-2，一个支持图文结合的多模态查询的检索模型。ReT-2采用多层表示和循环Transformer架构，利用LSTM启发式的门控机制动态地整合跨层和跨模态的信息，捕捉精细的视觉和文本细节。在具有挑战性的M2KR和M-BEIR基准测试中，ReT-2在多种检索配置中均取得了最先进的性能表现，同时与先前的方法相比具有更快的推理速度和更低的内存使用。当集成到检索增强生成管道时，ReT-2在百科全书式VQA和信息检索数据集上的下游性能也得到了提升。

Key Takeaways

多模态检索领域正在快速发展，面临越来越复杂的检索任务。
现有方法主要依赖于特定任务的视觉语言模型微调，并局限于单模态查询。
ReT-2是一个支持多模态查询的检索模型，能够处理图文结合的信息。
ReT-2利用多层表示和循环Transformer架构，结合LSTM启发式的门控机制。
ReT-2在多个基准测试中取得了最先进的性能，包括M2KR和M-BEIR。
ReT-2在推理速度和内存使用方面相比以前的方法有优势。
ReT-2集成到检索增强生成管道后，在下游任务性能上有所提升，例如在Encyclopedic-VQA和InfoSeek数据集上。

Cool Papers

点此查看论文截图

AI Self-preferencing in Algorithmic Hiring: Empirical Evidence and Insights

Authors:Jiannan Xu, Gujie Li, Jane Yi Jiang

As generative artificial intelligence (AI) tools become widely adopted, large language models (LLMs) are increasingly involved on both sides of decision-making processes, ranging from hiring to content moderation. This dual adoption raises a critical question: do LLMs systematically favor content that resembles their own outputs? Prior research in computer science has identified self-preference bias – the tendency of LLMs to favor their own generated content – but its real-world implications have not been empirically evaluated. We focus on the hiring context, where job applicants often rely on LLMs to refine resumes, while employers deploy them to screen those same resumes. Using a large-scale controlled resume correspondence experiment, we find that LLMs consistently prefer resumes generated by themselves over those written by humans or produced by alternative models, even when content quality is controlled. The bias against human-written resumes is particularly substantial, with self-preference bias ranging from 68% to 88% across major commercial and open-source models. To assess labor market impact, we simulate realistic hiring pipelines across 24 occupations. These simulations show that candidates using the same LLM as the evaluator are 23% to 60% more likely to be shortlisted than equally qualified applicants submitting human-written resumes, with the largest disadvantages observed in business-related fields such as sales and accounting. We further demonstrate that this bias can be reduced by more than 50% through simple interventions targeting LLMs’ self-recognition capabilities. These findings highlight an emerging but previously overlooked risk in AI-assisted decision making and call for expanded frameworks of AI fairness that address not only demographic-based disparities, but also biases in AI-AI interactions.

随着生成式人工智能（AI）工具被广泛采纳，大型语言模型（LLM）在决策过程中的作用日益凸显，从招聘到内容审核无一例外。这种双重采纳引发了一个关键问题：LLM是否会系统性地偏爱与其自身输出相似的内容？计算机科学领域的前期研究已经发现了自我偏好偏见——LLM倾向于偏爱其自己生成的内容——但其在实际世界的影响尚未得到实证评估。我们关注招聘背景，求职者常常依靠LLM来完善简历，而雇主则使用它们来筛选这些简历。通过大规模控制的简历对应实验，我们发现LLM始终偏爱由自己生成的简历，而非人类所写或由其他模型产生的简历，甚至在内容质量得到控制的情况下亦是如此。对人类撰写的简历的偏见尤为严重，主要商业和开源模型的自我偏好偏见范围从68%到88%。为了评估对劳动市场的影响，我们在24个职业中模拟了现实的招聘流程。这些模拟显示，使用与评估者相同LLM的候选人比提交人类撰写的简历的同等条件的申请人更有可能被列入选定名单，且在商业相关领域如销售和会计观察到最大的劣势。我们进一步证明，通过针对LLM的自我识别能力的简单干预，可以减少超过一半的偏见。这些发现突出了人工智能辅助决策制定中出现但以前未被重视的风险，并呼吁扩大人工智能公平性的框架，不仅要解决基于人口统计的差距，还要解决人工智能之间的偏见问题。

论文及项目相关链接

PDF This paper has been accepted as a non-archival submission at EAAMO 2025 and AIES 2025

摘要

随着生成式人工智能工具（AI）的广泛应用，大型语言模型（LLM）越来越多地参与到决策过程的各个环节，从招聘到内容审核不一而足。这种双重采用引发了一个关键问题：LLM是否会系统性地偏爱与其自身输出相似的内容？计算机科学领域的前期研究已经发现了LLM的自我偏好偏差——即LLM倾向于青睐自己生成的内容，但其现实影响尚未经过实证评估。本研究聚焦于招聘环节，应聘者往往依赖LLM来完善简历，而雇主则使用它们来筛选这些简历。通过大规模控制性简历对应实验，我们发现LLM始终偏好由自身生成的简历，而非人类撰写或由其他模型产生的简历，甚至在内容质量得到控制的情况下亦是如此。相较于人类撰写的简历，对后者的偏见尤为显著，在不同的大型商业和开源模型中，自我偏好偏差范围在68%至88%之间。为了评估对劳动力市场的影响，我们对24类职业的现实招聘流程进行了模拟。这些模拟显示，使用与评估者相同LLM的候选人被短名单列出的可能性比提交人类撰写简历的同等资格申请人高出23%至60%，在销售和会计等商业相关领域观察到的不利情况最为严重。我们进一步证明，通过针对LLM的自我识别能力的简单干预措施，这种偏见可以减少超过55%。这些发现突显了人工智能辅助决策制定中出现的新兴但以前被忽视的风险，并呼吁扩大人工智能公平性的框架，不仅要解决基于人口统计数据的差异，还要解决人工智能之间的互动偏见。

关键见解

大型语言模型（LLM）在招聘等决策过程中广泛应用。
LLM表现出自我偏好偏差，更倾向于选择自身生成的文本内容。
在控制内容质量的情况下，LLM对由其他模型或人类撰写的简历存在显著偏见。
使用与评估者相同LLM的应聘者在招聘流程中被短名单列出的可能性更高。
这种偏见在商业和开源模型中尤为显著，且可能对某些职业造成不利影响。
通过简单干预措施，可以有效减少LLM的自我偏好偏差。

Cool Papers

点此查看论文截图

Improving Alignment in LVLMs with Debiased Self-Judgment

Authors:Sihan Yang, Chenhang Cui, Zihao Zhao, Yiyang Zhou, Weilong Yan, Ying Wei, Huaxiu Yao

The rapid advancements in Large Language Models (LLMs) and Large Visual-Language Models (LVLMs) have opened up new opportunities for integrating visual and linguistic modalities. However, effectively aligning these modalities remains challenging, often leading to hallucinations–where generated outputs are not grounded in the visual input–and raising safety concerns across various domains. Existing alignment methods, such as instruction tuning and preference tuning, often rely on external datasets, human annotations, or complex post-processing, which limit scalability and increase costs. To address these challenges, we propose a novel approach that generates the debiased self-judgment score, a self-evaluation metric created internally by the model without relying on external resources. This enables the model to autonomously improve alignment. Our method enhances both decoding strategies and preference tuning processes, resulting in reduced hallucinations, enhanced safety, and improved overall capability. Empirical results show that our approach significantly outperforms traditional methods, offering a more effective solution for aligning LVLMs.

大型语言模型（LLMs）和大型视觉语言模型（LVLMs）的快速发展为整合视觉和语言模式提供了新的机会。然而，有效地对齐这些模式仍然具有挑战性，这常常导致生成的输出没有基于视觉输入，并且在各个领域引发安全担忧。现有的对齐方法，如指令调整和偏好调整，通常依赖于外部数据集、人工注释或复杂的后处理，这限制了可扩展性并增加了成本。为了解决这些挑战，我们提出了一种新方法，生成去偏自我判断分数，这是一个由模型内部创建的自评指标，无需依赖外部资源。这使得模型能够自主地改进对齐。我们的方法提高了解码策略和偏好调整过程，导致减少了幻觉、增强了安全性并提高了整体能力。经验结果表明，我们的方法显著优于传统方法，为LVLMs的对齐提供了更有效的解决方案。

论文及项目相关链接

PDF EMNLP 2025 Findings

Summary

随着大型语言模型（LLMs）和大型视觉语言模型（LVLMs）的快速发展，整合视觉和语言学模态的新机会已经打开。然而，有效对齐这些模态仍然具有挑战性，可能导致生成的输出不基于视觉输入，并在不同领域引发安全问题。现有对齐方法常依赖外部数据集、人工注释或复杂后处理，限制了可扩展性并增加了成本。为解决这些挑战，我们提出了一种新方法，生成偏误自我判断分数，这是一个由模型内部创建的自我评价指标，无需依赖外部资源。这使模型能够自主改善对齐。我们的方法改进了解码策略和偏好调整过程，减少了幻视现象，增强了安全性并提高了整体能力。经验结果表明，我们的方法显著优于传统方法，为LVLMs的对齐提供了更有效的解决方案。

Key Takeaways

大型语言模型（LLMs）和大型视觉语言模型（LVLMs）的整合带来了新的机会，但对齐视觉和语言学模态仍然具有挑战性。
现有对齐方法存在依赖外部资源、成本高和扩展性有限的问题。
提出了一种新的自我评价指标——偏误自我判断分数，无需依赖外部资源，使模型能够自主改善对齐。
新方法改进了解码策略和偏好调整过程，减少了幻视现象。
新方法增强了安全性并提高了模型的整体能力。
实证研究结果显示，新方法在LVLMs的对齐上显著优于传统方法。

Cool Papers

点此查看论文截图

Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?

Authors:Bhakti Khera, Rezvan Alamian, Pascal A. Scherz, Stephan M. Goetz

The legal field already uses various large language models (LLMs) in actual applications, but their quantitative performance and reasons for it are underexplored. We evaluated several open-source and proprietary LLMs – including GPT-series, Anthropic, Deepseek and Llama-3, variants – on parts of the European Qualifying Examination (EQE) for future European Patent Attorneys. OpenAI o1 led with 0.82 accuracy and 0.81 F1 score, whereas (Amazon Web Services) AWS Llama 3.1 8B lagged at 0.50 accuracy, and a Python-deployed Llama 3.1 8B scored 0.55. The latter two are within the range of mere guessing for the two-answer forced-choice design. None of the evaluated models could have passed the examination fully, as accuracy never exceeded the average threshold of 0.90 required for professional-level standards – also not models that are regularly promoted for their assumed beyond-PhD- and bar-admitted-lawyer-level performance. GPT-4o excelled at integrating text and graphics, while Claude 3 Opus often lost formatting coherence. Human patent experts evaluated the textual justifications and uncovered various critical shortcomings of each model. They valued clarity and legal rationale over the raw correctness of the answers, which revealed misalignment between automatic metrics and expert judgment. Model outputs were sensitive to modest temperature changes and prompt wording, which underscores the remaining necessity of expert oversight. Future work should target logical consistency, robust multimodality, and adaptive prompting to approach human-level patent proficiency. In summary, despite the outstanding performance of recent large models, the general public might overestimate their performance. The field has a long way to go to develop a virtual patent attorney. This paper wants to point out several specific limitations that need solutions.

在法律领域，虽然已经在实际应用中使用各种大型语言模型（LLM），但它们的定量性能和原因尚未得到充分探索。我们对几个开源和专有的大型语言模型进行了评估，包括GPT系列、Anthropic、Deepseek和Llama-3的变种，以及欧洲专利代理人未来资格欧洲考试（EQE）的部分内容。OpenAI o1以0.82的准确率和0.81的F1分数位居榜首，而亚马逊网络服务（AWS）Llama 3.1 8B的准确率为0.5准确率和0.5系统的准确率为0.5，后者两个模型的性能仅在两选一的猜测范围内。经过评估的模型中，没有任何一个能够完全通过考试，因为准确性从未超过专业级别标准所要求的平均阈值0.9，而且也不会对宣传中所称的超越博士和执业律师水平的性能产生影响。GPT-4在整合文本和图形方面表现出色，而Claude 3 Opus经常失去格式连贯性。人类专利专家对文本依据进行了评估，发现了每个模型的多种关键缺陷。他们重视答案的清晰度和法律合理性，而非答案本身的正确性，这揭示了自动指标与专家判断之间的不一致。模型输出对微小的温度变化和提示语比较敏感，这强调了专家监督的剩余必要性。未来的工作应该致力于逻辑连贯性、稳健的多媒体形式和适应性提示来接近人类水平的专利能力。总之，尽管最近的大型模型表现出色，公众可能会高估它们的性能。法律领域在开发虚拟专利代理人方面还有很长的路要走。本文想指出需要解决的几个具体局限性。

论文及项目相关链接

PDF 41 pages, 21 figures

Summary

本论文对多种大型语言模型（LLM）在法律领域的应用进行了评估。在针对欧洲专利律师资格考试的测试中，各模型的性能表现差异显著。尽管高级模型如OpenAI o1表现出色，但其准确度仍未能达到专业水平的要求。模型的回答在法律理论和专家评估中存在诸多缺陷，凸显了自动评估与专家判断之间的差异。未来研究需关注逻辑一致性、多模态能力和适应性提示设计。总体而言，公众对模型性能可能存在过度估计，法律虚拟代理人研发尚需更多努力。

Key Takeaways

法律领域已广泛应用大型语言模型（LLM），但它们的定量性能和原因尚未得到充分探索。
在欧洲专利律师资格考试的测试中，不同LLM模型表现差异显著，没有模型能够完全通过考试。
高级模型的准确度仍未达到专业水平的要求，需要更高的准确度以达到专业标准。
模型在法律理论和实践方面存在缺陷，需要更细致的校准以适应专业领域的需求。
专家判断与法律理论的融合是目前大型语言模型所面临的挑战之一。
模型输出对温度变化和提示措辞敏感，强调专家监督的必要性。

Cool Papers

点此查看论文截图

Scaling LLM Planning: NL2FLOW for Parametric Problem Generation and Rigorous Evaluation

Authors:Jungkoo Kang

Robust workflow composition is critical for effective agent performance, yet progress in Large Language Model (LLM) planning and reasoning is hindered by a scarcity of scalable evaluation data. This work introduces NL2Flow, a fully automated pipeline for generating and evaluating workflow planning problems. NL2Flow generates problems parametrically in a structured intermediate representation, translating them into both natural language and formal PDDL. I evaluate several open-source, instruct-tuned LLMs on a dataset of 2296 low-difficulty problems generated by NL2Flow. Results demonstrate that the best-performing model achieved 86% success in generating valid plans and 69% in generating optimal plans (for solvable problems). Regression analysis shows that the influence of problem characteristics on plan generation is contingent on both model and prompt design. Importantly, translating natural language problems into a structured JSON representation prior to symbolic planning significantly improved success rates, suggesting a benefit from neuro-symbolic integration. These findings underscore the importance of understanding error sources within LLM reasoning as systems scale to more complex tasks. As LLM reasoning scales to increasingly complex problems, understanding the shifting bottlenecks and sources of error within these systems will be crucial.

强大的工作流程组合对于提高代理性能至关重要，然而，由于缺乏可扩展的评估数据，大型语言模型（LLM）的规划和推理工作进展受到了阻碍。本文介绍了NL2Flow，这是一个全自动化的管道，用于生成和评估工作流程规划问题。NL2Flow以结构化的中间表示形式参数化生成问题，并将其转换为自然语言描述和正式PDDL描述。我在由NL2Flow生成的包含有难度的低难度问题数据集上评估了几个开源的指令微调LLM。结果表明，表现最好的模型在生成有效计划方面取得了86%的成功率，在生成最优计划方面取得了69%（针对可解决的问题）。回归分析表明，问题特征对计划生成的影响取决于模型和提示设计。重要的是，在符号规划之前将自然语言问题转换为结构化JSON表示形式显著提高了成功率，这表明神经符号融合具有优势。这些发现强调了随着LLM推理系统处理越来越复杂的任务时，了解系统内部错误来源的重要性。随着LLM推理处理越来越复杂的问题时，了解这些系统内不断变化的瓶颈和错误来源将至关重要。

论文及项目相关链接

PDF 31 pages, 7 figures

Summary

本文介绍了NL2Flow这一全自动管道，用于生成和评估工作流程规划问题。NL2Flow能够参数化生成问题，并将其转化为自然语言与正式PDDL，评估了多个开源指令微调LLM模型在由NL2Flow生成的2296个低难度问题数据集上的表现。最佳模型生成有效计划的成功率为86%，可解决问题生成最优计划的成功率为69%。回归分析表明，问题特性对计划生成的影响取决于模型和提示设计。将自然语言问题转化为结构化JSON表示再进行符号规划，能显著提高成功率，显示出神经符号融合的益处。这些发现强调了在大规模LLM推理中理解错误来源的重要性。随着LLM推理逐步解决日益复杂的问题，理解这些系统内部不断变化的瓶颈和错误来源将是关键。

Key Takeaways

NL2Flow是首个全自动管道，能够生成并评估工作流程规划问题。
最佳LLM模型在生成有效计划和最优计划方面取得了显著成果。
回归分析了问题特性、模型及提示设计对计划生成的影响。
将自然语言问题转化为结构化JSON表示再进行符号规划，能显著提高计划生成的成功率。
研究强调了在大规模LLM推理中理解错误来源的重要性。
随着LLM推理处理的问题日益复杂，理解系统内部的瓶颈和错误来源将成为关键。

Cool Papers

点此查看论文截图

Development and Comparative Evaluation of Three Artificial Intelligence Models (NLP, LLM, JEPA) for Predicting Triage in Emergency Departments: A 7-Month Retrospective Proof-of-Concept

Authors:Edouard Lansiaux, Ramy Azzouz, Emmanuel Chazard, Amélie Vromant, Eric Wiel

Emergency departments struggle with persistent triage errors, especially undertriage and overtriage, which are aggravated by growing patient volumes and staff shortages. This study evaluated three AI models [TRIAGEMASTER (NLP), URGENTIAPARSE (LLM), and EMERGINET (JEPA)] against the FRENCH triage scale and nurse practice, using seven months of adult triage data from Roger Salengro Hospital in Lille, France. Among the models, the LLM-based URGENTIAPARSE consistently outperformed both AI alternatives and nurse triage, achieving the highest accuracy (F1-score 0.900, AUC-ROC 0.879) and superior performance in predicting hospitalization needs (GEMSA). Its robustness across structured data and raw transcripts highlighted the advantage of LLM architectures in abstracting patient information. Overall, the findings suggest that integrating LLM-based AI into emergency department workflows could significantly enhance patient safety and operational efficiency, though successful adoption will depend on addressing limitations and ensuring ethical transparency.

急诊科持续面临分诊错误问题，特别是低估病情和过度评估病情的情况。这些问题因患者数量增加和人员短缺而加剧。本研究使用法国里尔市罗杰·萨伦格洛医院为期七个月的成人分诊数据，以法国FRENCH分级标准以及护士实践为标准，评估了三种人工智能模型（TRIAGEMASTER（NLP）、URGENTIAPARSE（LLM）和EMERGINET（JEPA））的表现。在这些模型中，基于LLM的URGENTIAPARSE始终表现优于其他两种AI模型和护士分诊，准确率最高（F1分数为0.900，AUC-ROC为0.879），并且在预测住院需求方面表现优异（GEMSA）。其在结构化数据和原始文本转录中的稳健性凸显了大型语言模型在提取患者信息方面的优势。总体而言，研究结果表明将基于大型语言模型的人工智能整合到急诊科工作流程中可能会显著提高患者安全和运营效率，但要成功实施并依赖解决限制因素并确保伦理透明度。

论文及项目相关链接

PDF 13 pages, 7 figures, 3 tables

Summary

本文研究了三种AI模型（TRIAGEMASTER（NLP）、URGENTIAPARSE（LLM）和EMERGINET（JEPA））在急诊部门应用的表现，特别是在应对日益增长的患者数量和医护人员短缺导致的持续性分流错误问题。研究结果表明，基于LLM的URGENTIAPARSE模型表现最为出色，其准确性最高（F1分数为0.900，AUC-ROC为0.879），并在预测住院需求方面展现出卓越性能。该模型的稳健性使其成为抽象患者信息领域的理想选择。总体而言，将基于LLM的AI集成到急诊部门工作流程中有望显著提高患者安全和运营效率。

Key Takeaways

急诊部门面临持续的分流错误问题，特别是低分流和高分流，受到患者数量增长和人员短缺的加剧。
三种AI模型（TRIAGEMASTER、URGENTIAPARSE和EMERGINET）被评估用于解决这一问题。
URGENTIAPARSE（基于LLM）在准确性上表现最佳，与护士分流相比具有更高的F1分数和AUC-ROC值。
URGENTIAPARSE在预测住院需求方面展现出卓越性能。
LLM架构在抽象患者信息方面具有优势，稳健性高，适用于处理结构数据和原始转录数据。
集成LLM基于AI的急诊部门工作流程有望增强患者安全和运营效率。

Cool Papers

点此查看论文截图

LLMs for sensory-motor control: Combining in-context and iterative learning

Authors:Jônata Tyska Carvalho, Stefano Nolfi

We propose a method that enables large language models (LLMs) to control embodied agents by directly mapping continuous observation vectors to continuous action vectors. At the outset, the LLMs generate a control strategy based on a textual description of the agent, its environment, and the intended goal. This strategy is then iteratively refined through a learning process in which the LLMs are repeatedly prompted to improve the current strategy, using performance feedback and sensory-motor data collected during its evaluation. The method is validated on classic control tasks from the Gymnasium library and the inverted pendulum task from the MuJoCo library. The approach proves effective with relatively compact models such as Gpt-oss:120b and Qwen2.5:72b. In most cases, it successfully identifies optimal or near-optimal solutions by integrating symbolic knowledge derived through reasoning with sub-symbolic sensory-motor data gathered as the agent interacts with its environment.

我们提出了一种方法，使大型语言模型（LLM）能够通过直接将连续的观察向量映射到连续的动作向量来控制实体代理。一开始，LLM基于代理的文本描述、其环境以及预期目标生成控制策略。然后，通过学习过程迭代地完善这一策略，在此过程中，LLM被反复提示以性能反馈和评估期间收集的感官运动数据来改善当前策略。该方法在Gymnasium库的经典控制任务和MuJoCo库的倒立摆任务上进行了验证。该方法在相对紧凑的模型（如Gpt-oss:120b和Qwen2.5:72b）上证明有效。在大多数情况下，它通过整合通过推理获得的符号知识与代理在与环境交互过程中收集的亚符号感官运动数据，成功找到最优解或近似最优解。

论文及项目相关链接

PDF Article updated with results from gpt-oss:120b. 24 pages (13 pages are from appendix), 6 figures, code for experiments replication and supplementary material provided at https://github.com/jtyska/llm-robotics-article/

Summary

大型语言模型（LLM）可以通过将连续观测向量直接映射到连续动作向量来控制实体代理。LLM基于代理的文本描述、环境及目标生成控制策略，并通过收集性能反馈和感觉运动数据来迭代优化这一策略。该方法在Gymnasium和MuJoCo库的经典控制任务上得到验证，使用Gpt-oss:120b和Qwen2.5:72b等较紧凑的模型时效果显著。该方法结合了符号知识和代理与环境交互中的感觉运动数据，通常能找出最优或接近最优的解决方案。

Key Takeaways

LLM能够控制实体代理，通过将连续观测向量映射到连续动作向量。
LLM基于文本描述生成控制策略，并考虑代理、环境和目标。
通过性能反馈和感觉运动数据的收集，LLM能够迭代优化控制策略。
这种方法在经典控制任务上进行了验证，如Gymnasium和MuJoCo库的任务。
相对紧凑的模型，如Gpt-oss:120b和Qwen2.5:72b，在应用中表现出有效性。
该方法结合了符号知识和代理与环境交互中的感觉运动数据。
通常能找出最优或接近最优的解决方案。

Cool Papers

点此查看论文截图

Toward Generation of Test Cases from Task Descriptions via History-aware Planning

Authors:Duy Cao, Phu Nguyen, Vy Le, Tien N. Nguyen, Vu Nguyen

In automated web testing, generating test scripts from natural language task descriptions is crucial for enhancing the test generation process. This activity involves creating the correct sequences of actions to form test scripts for future testing activities. Current state-of-the-art approaches are limited in generating these action sequences, as they either demand substantial manual effort for human demonstrations or fail to consider the history of previous web content and actions to decide the next action. In this paper, we introduce HxAgent, an iterative large language model agent planning approach that determines the next action based on: 1) observations of the current contents and feasible actions, 2) short-term memory of previous web states and actions, and 3) long-term experience with (in)correct action sequences. The agent generates a sequence of actions to perform a given task, which is effectively an automated test case to verify the task. We conducted an extensive empirical evaluation of HxAgent using two datasets. On the MiniWoB++ dataset, our approach achieves 97% exact-match accuracy that is comparable to the best baselines while eliminating the need for human demonstrations required by those methods. For complex tasks requiring navigation through multiple actions and screens, HxAgent achieves an average 82% exact-match. On the second dataset, comprising 350 task instances across seven popular websites, including YouTube, LinkedIn, Facebook, and Google, HxAgent achieves high performance, with 87% of the action sequences exactly matching the ground truth and a prefix-match of 93%, outperforming the baseline by 59%.

在自动化网页测试中，从自然语言任务描述生成测试脚本对于增强测试生成过程至关重要。这项活动涉及创建正确的操作序列以形成用于未来测试活动的测试脚本。当前最先进的技术方法在生成这些操作序列方面存在局限性，因为它们要么需要大量的人工演示，要么无法考虑以前的网页内容和操作历史来决定下一个操作。在本文中，我们介绍了HxAgent，这是一个迭代的大型语言模型代理规划方法，它根据以下三个因素确定下一个操作：1）对当前内容和可行操作的观察；2）对先前网页状态和操作的短期记忆；以及3）对（正确或错误的）操作序列的长期经验。代理生成执行给定任务的操作序列，这实际上是验证任务的自动化测试用例。我们使用两个数据集对HxAgent进行了广泛的实证评估。在MiniWoB++数据集上，我们的方法达到了97%的精确匹配率，与最佳基线相当，同时消除了这些方法所需要的人工演示。对于需要多次操作和屏幕导航的复杂任务，HxAgent的平均精确匹配率达到82%。在第二个数据集上，该数据集包含350个任务实例，涵盖YouTube、LinkedIn、Facebook和Google等七个流行网站，HxAgent表现出高性能，其操作序列的精确匹配率达到87%，前缀匹配率为93%，超过了基线方法的59%。

论文及项目相关链接

PDF Change the method and experimentation

Summary

基于自然语言任务描述自动生成web测试脚本，是提升测试生成过程效率的关键步骤。当前主流方法在这方面有所局限，它们要么需要大量人工示范，要么不考虑以往的网页内容和行为来决定下一步行动。本文提出一种名为HxAgent的大型语言模型代理规划方法，它根据当前内容及其可行操作、短期内存中的以往web状态和操作以及长期的对（正确或错误）操作序列的经验来决定下一步操作。此方法在一项测试中有效生成完成特定任务的操作序列，从而进行自动化测试验证任务。广泛的实证评估表明，在复杂任务场景下，如在MiniWoB++数据集上，HxAgent取得了与最佳基线相当的精确匹配准确率（达到97%），且无需人工示范；在另一包含多个网站任务的第二数据集上，其精确匹配率高达87%，前缀匹配率高达93%，显著优于基线标准。HXAgent的有效性和高效性显示出了它在自动化Web测试领域的潜力与前景。HXAgent表现出优越的性能，有望成为自动化Web测试领域的创新解决方案。

Key Takeaways

一、自动生成web测试脚本是提升自动化web测试效率的关键环节。当前主流方法存在局限性，需要人工示范或忽略历史信息。

Cool Papers

点此查看论文截图

Transforming Wearable Data into Personal Health Insights using Large Language Model Agents

Authors:Mike A. Merrill, Akshay Paruchuri, Naghmeh Rezaei, Geza Kovacs, Javier Perez, Yun Liu, Erik Schenck, Nova Hammerquist, Jake Sunshine, Shyam Tailor, Kumar Ayush, Hao-Wei Su, Qian He, Cory Y. McLean, Mark Malhotra, Shwetak Patel, Jiening Zhan, Tim Althoff, Daniel McDuff, Xin Liu

Deriving personalized insights from popular wearable trackers requires complex numerical reasoning that challenges standard LLMs, necessitating tool-based approaches like code generation. Large language model (LLM) agents present a promising yet largely untapped solution for this analysis at scale. We introduce the Personal Health Insights Agent (PHIA), a system leveraging multistep reasoning with code generation and information retrieval to analyze and interpret behavioral health data. To test its capabilities, we create and share two benchmark datasets with over 4000 health insights questions. A 650-hour human expert evaluation shows that PHIA significantly outperforms a strong code generation baseline, achieving 84% accuracy on objective, numerical questions and, for open-ended ones, earning 83% favorable ratings while being twice as likely to achieve the highest quality rating. This work can advance behavioral health by empowering individuals to understand their data, enabling a new era of accessible, personalized, and data-driven wellness for the wider population.

从流行的可穿戴跟踪器获取个性化洞察需要复杂的数值推理，这挑战了标准的大型语言模型，需要基于工具的方法，如代码生成。大型语言模型（LLM）代理人为这种大规模分析提供了有前景但尚未充分利用的解决方案。我们介绍了个人健康洞察代理（PHIA），这是一个系统，利用多步骤推理和代码生成以及信息检索来分析并解释行为健康数据。为了测试其能力，我们创建并分享了包含超过4000个健康洞察问题两个基准数据集。一项为期650小时的人类专家评估表明，PHIA显著优于强大的代码生成基线，在客观数值问题上达到84%的准确率，在开放性问题上获得83%的好评，同时有两次机会获得最高质量评分。这项工作可以通过帮助个人理解他们的数据，为更广泛的人群开启一个可访问、个性化、数据驱动的健康新时代，从而促进行为健康的发展。

论文及项目相关链接

PDF 53 pages, 7 main figures, 2 main tables, accepted to Nature Communications

Summary

文章介绍了穿戴式追踪器的个人见解的派生对标准LLM提出挑战，需要通过基于工具的方法如代码生成来解决这一问题。提出了个人健康见解代理（PHIA）系统，该系统利用多步骤推理和代码生成以及信息检索来分析并解释行为健康数据。测试表明，PHIA在客观数值问题和开放性问题上分别达到了84%和83%的好评率，且在高质量评级方面高出基线模型两倍。这为行为健康领域带来了个性化、数据驱动的新时代。

Key Takeaways

LLM在解读可穿戴设备健康数据上面临挑战，需要更复杂的数值推理和工具化方法如代码生成来应对。
提出个人健康见解代理（PHIA）系统，利用多步骤推理和代码生成技术来分析解读健康数据。
PHIA系统在处理客观数值问题和开放性问题时表现出色，准确率较高。
PHIA系统相较于基线模型在高质量评级方面表现更优秀。
PHIA系统对于理解个人健康数据具有巨大潜力，能够推动行为健康领域的发展。
文章创建并分享了包含超过4000个健康见解问题的两个基准数据集以测试系统的能力。

Cool Papers

点此查看论文截图

Osprey: Pixel Understanding with Visual Instruction Tuning

Authors:Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu

Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However, current MLLMs primarily focus on image-level or box-level understanding, falling short in achieving fine-grained vision-language alignment at pixel level. Besides, the lack of mask-based instruction data limits their advancements. In this paper, we propose Osprey, a mask-text instruction tuning approach, to extend MLLMs by incorporating fine-grained mask regions into language instruction, aiming at achieving pixel-wise visual understanding. To achieve this goal, we first meticulously curate a mask-based region-text dataset with 724K samples, and then design a vision-language model by injecting pixel-level representation into LLM. Specifically, Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey’s superiority in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning. In particular, Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code, dataset and demo can be found at https://github.com/CircleRadon/Osprey.

多模态大型语言模型（MLLMs）最近通过视觉指令微调，实现了令人印象深刻的通用视觉语言功能。然而，当前的MLLMs主要关注图像级别或框级别的理解，在像素级别的精细视觉语言对齐方面存在不足。此外，缺乏基于mask的指令数据限制了其发展。在本文中，我们提出了Osprey，这是一种基于mask文本的指令微调方法，通过将精细的mask区域融入语言指令来扩展MLLMs，旨在实现像素级的视觉理解。为了实现这一目标，我们首先精心制作了一个基于mask的区域文本数据集，包含724K样本，然后设计了一个通过注入像素级表示到LLM中的视觉语言模型。具体来说，Osprey采用卷积CLIP主干作为视觉编码器，并采用一个掩码感知的视觉提取器从高分辨率输入中提取精确的视觉mask特征。实验结果表明，Osprey在各种区域理解任务中的优越性，展示其像素级指令微调的新能力。特别是，Osprey可以与Segment Anything Model（SAM）无缝集成，以获得多粒度语义。相关源代码、数据集和演示可在https://github.com/CircleRadon/Osprey找到。

论文及项目相关链接

PDF CVPR2024, Code and Demo link:https://github.com/CircleRadon/Osprey

Summary

多模态大型语言模型（MLLM）通过视觉指令微调获得了通用的视觉-语言功能，但在精细粒度的视觉-语言对齐方面存在不足。本文提出Osprey，一种基于掩码文本的指令微调方法，旨在实现像素级的视觉理解。通过引入精细掩码区域到语言指令中，设计了一个视觉语言模型。实验结果表明，Osprey在各种区域理解任务中具有优越性，并可无缝集成到Segment Anything Model（SAM）中，获得多粒度语义。

Key Takeaways

MLLMs虽然已经通过视觉指令微调获得了视觉-语言功能，但在像素级的精细粒度视觉理解方面仍有不足。
当前MLLMs受限于缺乏基于掩码的指令数据，影响了其性能的提升。
Osprey方法通过引入精细掩码区域到语言指令中，旨在实现像素级的视觉理解。
Osprey设计了一个视觉语言模型，采用了卷积CLIP骨干网作为视觉编码器，并使用掩码感知的视觉提取器从高分辨率输入中提取精确的视觉掩码特征。
实验结果表明，Osprey在各种区域理解任务中具有优越性。
Osprey可以无缝集成到Segment Anything Model（SAM），获得多粒度语义。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-13/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-09-13 Maximizing social welfare among EF1 allocations at the presence of two types of agents

2025-09-13 Agent

Agent

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-09-13 FLUX-Reason-6M & PRISM-Bench A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark

2025-09-13 R1_Reasoning

R1_Reasoning