LLM

发布日期: 2025-05-02

更新日期: 2025-05-25

文章字数: 20.3k

阅读时长: 82 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-05-02 更新

COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning

Authors:Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Olga Russakovsky

Multimodal Large Language Models (MLLMs) excel at simple vision-language tasks but struggle when faced with complex tasks that require multiple capabilities, such as simultaneously recognizing objects, counting them, and understanding their spatial relationships. This might be partially the result of the fact that Visual Instruction Tuning (VIT), a critical training step for MLLMs, has traditionally focused on scaling data volume, but not the compositional complexity of training examples. We propose COMPACT (COMPositional Atomic-to-complex visual Capability Tuning), which generates a training dataset explicitly controlling for the compositional complexity of the training examples. The data from COMPACT allows MLLMs to train on combinations of atomic capabilities to learn complex capabilities more efficiently. Across all benchmarks, COMPACT achieves comparable performance to the LLaVA-665k VIT while using less than 10% of its data budget, and even outperforms it on several, especially those involving complex multi-capability tasks. For example, COMPACT achieves substantial 83.3% improvement on MMStar and 94.0% improvement on MM-Vet compared to the full-scale VIT on particularly complex questions that require four or more atomic capabilities. COMPACT offers a scalable, data-efficient, visual compositional tuning recipe to improve on complex visual-language tasks.

多模态大型语言模型（MLLMs）在简单的视觉语言任务上表现出色，但在面对需要多种能力的复杂任务时，如同时识别物体、计数和理解其空间关系，则表现困难。这可能是由于视觉指令调整（VIT）这一MLLM的关键训练步骤传统上侧重于扩大数据量，而忽视了训练实例的组成复杂性。我们提出了COMPACT（Compositional Atomic-to-complex视觉能力调整），它通过明确控制训练实例的组成复杂性来生成训练数据集。COMPACT的数据允许MLLM在原子能力的组合上进行训练，从而更有效地学习复杂能力。在所有基准测试中，COMPACT在使用不到其数据预算10%的情况下实现了与LLaVA-665k VIT相当的性能，甚至在多个基准测试上表现更好，尤其是涉及复杂多能力任务时。例如，COMPACT在MMStar和MM-Vet上的表现相对于全规模VIT有显著改善，分别提高了83.3%和94.0%，尤其是在处理需要四种或更多原子能力的特别复杂问题上。COMPACT提供了一种可扩展、数据高效、视觉组成的调整配方，以改进复杂的视觉语言任务。

论文及项目相关链接

PDF 17 pages, 13 figures

Summary

MLLM在处理简单视觉语言任务时表现出色，但在面对需要多种能力的复杂任务时表现欠佳，如同时识别物体、计数和理解空间关系等。这可能是由于传统视觉指令调整（VIT）主要关注数据规模，而忽视训练实例的组成复杂性所致。为此，我们提出了COMPACT方法，它通过控制训练实例的组成复杂性来生成训练数据集。COMPACT允许MLLM在原子能力组合上进行训练，从而更有效地学习复杂能力。在所有基准测试中，COMPACT在使用不到LLaVA-665kVIT十分之一的数据预算的情况下实现了与之相当的性能，并且在涉及复杂多能力任务的基准测试中表现更佳。例如，COMPACT在MMStar和MM-Vet上的表现相对于全面规模的VIT有显著改善。COMPACT为复杂的视觉语言任务提供了一个可扩展、数据高效、视觉组成的调整方案。

Key Takeaways

MLLMs在处理复杂视觉语言任务时遇到困难，这些任务需要多种能力如物体识别、计数和空间理解。
传统VIT主要关注数据规模，而忽视训练实例的组成复杂性。
COMPACT方法通过控制训练实例的组成复杂性来提高MLLM的性能。
COMPACT允许MLLM在原子能力组合上训练，从而更高效地学习复杂能力。
COMPACT在使用较少数据的情况下实现了与LLaVA-665kVIT相当的性能，并在某些基准测试中表现更佳。
COMPACT在涉及复杂多能力任务的MMStar和MM-Vet上的表现显著改善。

Cool Papers

点此查看论文截图

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

Authors:Z. Z. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z. F. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, Chong Ruan

We introduce DeepSeek-Prover-V2, an open-source large language model designed for formal theorem proving in Lean 4, with initialization data collected through a recursive theorem proving pipeline powered by DeepSeek-V3. The cold-start training procedure begins by prompting DeepSeek-V3 to decompose complex problems into a series of subgoals. The proofs of resolved subgoals are synthesized into a chain-of-thought process, combined with DeepSeek-V3’s step-by-step reasoning, to create an initial cold start for reinforcement learning. This process enables us to integrate both informal and formal mathematical reasoning into a unified model. The resulting model, DeepSeek-Prover-V2-671B, achieves state-of-the-art performance in neural theorem proving, reaching 88.9% pass ratio on the MiniF2F-test and solving 49 out of 658 problems from PutnamBench. In addition to standard benchmarks, we introduce ProverBench, a collection of 325 formalized problems, to enrich our evaluation, including 15 selected problems from the recent AIME competitions (years 24-25). Further evaluation on these 15 AIME problems shows that the model successfully solves 6 of them. In comparison, DeepSeek-V3 solves 8 of these problems using majority voting, highlighting that the gap between formal and informal mathematical reasoning in large language models is substantially narrowing.

我们介绍了DeepSeek-Prover-V2，这是一款为Lean 4中的形式化定理证明而设计的大型开源语言模型。其初始化数据是通过由DeepSeek-V3驱动的递归定理证明管道收集的。冷启动训练过程始于提示DeepSeek-V3将复杂问题分解为一系列子目标。已解决的子目标的证明被综合为思维链过程，与DeepSeek-V3的逐步推理相结合，为强化学习创建初始冷启动。这一过程使我们能够将非正式和正式的数学推理整合到一个统一模型中。结果模型DeepSeek-Prover-V2-671B在神经定理证明方面达到了最新性能，在MiniF2F测试中的通过率为88.9%，并解决PutnamBench中的49个中的658个问题。除了标准基准测试外，我们还引入了ProverBench，这是一组包含325个形式化问题的集合，以丰富我们的评估，其中包括最近AIME竞赛（第24-25年）精选的15个问题。对这15个AIME问题的进一步评估表明，该模型成功解决了其中6个。相比之下，DeepSeek-V3使用多数投票解决了其中8个问题，这表明大型语言模型中形式与非形式数学推理之间的差距正在显著缩小。

论文及项目相关链接

PDF

Summary

DeepSeek-Prover-V2是一款用于Lean 4形式化定理证明的开源大型语言模型。它通过DeepSeek-V3驱动的递归定理证明管道进行初始化数据采集，结合深度思考与逐步推理，实现冷启动训练程序。该模型在神经定理证明方面达到最新性能，在MiniF2F测试中的通过率达到了88.9%，并解决PutnamBench中的49个问题。此外，还引入了ProverBench，包含325个形式化问题，用于丰富评估，包括最近AIME竞赛（第24-25年）中选择的15个问题。进一步评估显示，该模型成功解决了其中6个问题，而DeepSeek-V3通过多数投票解决了其中8个问题，表明大型语言模型中形式与非形式数学推理之间的差距正在显著缩小。

Key Takeaways

DeepSeek-Prover-V2是一个用于定理证明的大型语言模型。
模型采用由DeepSeek-V3驱动的递归定理证明管道进行初始化数据采集。
通过结合深度思考和逐步推理，实现了冷启动训练程序。
在神经定理证明方面达到最新性能，在MiniF2F测试中的通过率达到了88.9%。
模型成功解决PutnamBench中的49个问题。
引入ProverBench作为新的评估工具，其中包括来自AIME竞赛的形式化问题。

Cool Papers

点此查看论文截图

MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness

Authors:Junsheng Huang, Zhitao He, Sandeep Polisetty, Qingyun Wang, May Fung

With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.

随着大型语言模型（LLM）的广泛应用，生成不存在的事实的问题，即被称为“幻觉”的问题，已经引起了越来越多的关注。之前关于提高LLM信心估计的研究主要集中在单一问题设置上。然而，在更具挑战性的多问题设置下，LLM对其内部参数化知识边界的认识，即需要同时准确回答多个问题的能力，仍然被研究得很少。为了填补这一空白，我们引入了一种新的方法——多答案和信心逐步调整（MAC-Tuning），在指令数据的微调过程中，它将答案预测和信心估计的学习分开。大量实验表明，我们的方法在平均精度上比基线高出高达25%。

论文及项目相关链接

PDF

Summary：随着大型语言模型（LLM）的广泛应用，生成不存在的事实即幻觉问题备受关注。先前关于提高LLM信心评估的研究主要集中在单一问题设置上，但在更具挑战性的多问题设置下，LLM对其内部参数化知识边界的认识仍有待探索。为了弥补这一空白，提出了一种新的方法——多答案与信心逐步调整（MAC-Tuning），在指令数据微调时，将答案预测和信心评估的学习分开。实验表明，该方法在平均精度上比基线高出25%。

Key Takeaways：

大型语言模型（LLM）的幻觉问题受到广泛关注。
之前的LLM信心评估研究主要集中在单一问题设置上。
在多问题设置下，LLM对其内部参数化知识边界的认识有待探索。
提出了一种新的方法MAC-Tuning，将答案预测和信心评估的学习在指令数据微调时分开。
MAC-Tuning方法通过逐步调整信心和提供多个答案来改进LLM的性能。
实验表明，MAC-Tuning在平均精度上比基线有显著提高。

Cool Papers

点此查看论文截图

LLM-based Interactive Imitation Learning for Robotic Manipulation

Authors:Jonas Werner, Kun Chu, Cornelius Weber, Stefan Wermter

Recent advancements in machine learning provide methods to train autonomous agents capable of handling the increasing complexity of sequential decision-making in robotics. Imitation Learning (IL) is a prominent approach, where agents learn to control robots based on human demonstrations. However, IL commonly suffers from violating the independent and identically distributed (i.i.d) assumption in robotic tasks. Interactive Imitation Learning (IIL) achieves improved performance by allowing agents to learn from interactive feedback from human teachers. Despite these improvements, both approaches come with significant costs due to the necessity of human involvement. Leveraging the emergent capabilities of Large Language Models (LLMs) in reasoning and generating human-like responses, we introduce LLM-iTeach – a novel IIL framework that utilizes an LLM as an interactive teacher to enhance agent performance while alleviating the dependence on human resources. Firstly, LLM-iTeach uses a hierarchical prompting strategy that guides the LLM in generating a policy in Python code. Then, with a designed similarity-based feedback mechanism, LLM-iTeach provides corrective and evaluative feedback interactively during the agent’s training. We evaluate LLM-iTeach against baseline methods such as Behavior Cloning (BC), an IL method, and CEILing, a state-of-the-art IIL method using a human teacher, on various robotic manipulation tasks. Our results demonstrate that LLM-iTeach surpasses BC in the success rate and achieves or even outscores that of CEILing, highlighting the potential of LLMs as cost-effective, human-like teachers in interactive learning environments. We further demonstrate the method’s potential for generalization by evaluating it on additional tasks. The code and prompts are provided at: https://github.com/Tubicor/LLM-iTeach.

近期机器学习的发展为训练能够处理机器人序列决策中日益增长的复杂性的自主代理提供了方法。模仿学习（IL）是一种突出的方法，其中代理根据人类演示来学习控制机器人。然而，IL通常会违反机器人任务中的独立同分布（i.i.d）假设。通过允许代理从人类教师的交互式反馈中学习，交互式模仿学习（IIL）实现了性能的提升。尽管有这些改进，但由于人类参与的必要，这两种方法都带来了相当大的成本。利用大型语言模型（LLM）在推理和生成人类反应方面的能力，我们引入了LLM-iTeach——一种新型IIL框架，它利用LLM作为交互式教师来提高代理性能，同时减轻对人力资源的依赖。首先，LLM-iTeach使用分层提示策略来指导LLM生成Python代码中的策略。然后，通过设计的基于相似度的反馈机制，LLM-iTeach在代理训练过程中提供纠正和评价反馈。我们在各种机器人操作任务上评估了LLM-iTeach与基线方法（如行为克隆（BC）一种IL方法以及CEILing一种使用人类教师的最新IIL方法）的表现。我们的结果表明，LLM-iTeach在成功率上超越了BC，并达到了或甚至超过了CEILing的水平，这突显了LLM作为交互式学习环境中成本效益高、人性化的教师的潜力。通过对其在额外任务上的评估，我们进一步展示了该方法的潜在泛化能力。相关代码和提示请参见：https://github.com/Tubicor/LLM-iTeach。

论文及项目相关链接

PDF To be published in IJCNN 2025 proceedings

Summary

近期机器学习进展为处理机器人复杂序列决策提供了自主代理训练方法。模仿学习（IL）是一种基于人类演示让代理控制机器人的方法，但违反了机器人任务的独立同分布假设。交互式模仿学习（IIL）通过允许代理从人类教师的互动反馈中学习，提高了性能。然而，两种方法都需要人类参与，成本较高。利用大型语言模型（LLM）的推理和生成人类响应的能力，我们引入了LLM-iTeach——一个利用LLM作为互动教师的新型IIL框架，以提高代理性能并减少对人力资源的依赖。LLM-iTeach首先使用分层提示策略来指导LLM生成Python代码策略。然后，通过一个设计的基于相似度的反馈机制，LLM-iTeach在代理训练过程中提供纠正和评价反馈。评估显示，LLM-iTeach在成功率上超越了行为克隆（BC）这一IL方法，并达到了或超越了当前主流的IIL方法CEILing的水平，突显了LLM作为成本效益高、人类化的互动教师的潜力。

Key Takeaways

近期机器学习进展为机器人复杂决策提供了解决方案。
模仿学习（IL）和交互式模仿学习（IIL）是机器人学习中的两种主要方法，但都需要人类参与。
LLM-iTeach利用大型语言模型（LLM）作为互动教师，提高代理性能并减少人类资源依赖。
LLM-iTeach使用分层提示策略和基于相似度的反馈机制。
LLM-iTeach在多个机器人操作任务上的表现超越了行为克隆（BC）方法，并达到了或超越了当前主流的IIL方法。
LLM-iTeach具有潜力在更多任务上得到应用和评估。
相关代码和提示已公开在GitHub上。

Cool Papers

点此查看论文截图

LLM-Empowered Embodied Agent for Memory-Augmented Task Planning in Household Robotics

Authors:Marc Glocker, Peter Hönig, Matthias Hirschmanner, Markus Vincze

We present an embodied robotic system with an LLM-driven agent-orchestration architecture for autonomous household object management. The system integrates memory-augmented task planning, enabling robots to execute high-level user commands while tracking past actions. It employs three specialized agents: a routing agent, a task planning agent, and a knowledge base agent, each powered by task-specific LLMs. By leveraging in-context learning, our system avoids the need for explicit model training. RAG enables the system to retrieve context from past interactions, enhancing long-term object tracking. A combination of Grounded SAM and LLaMa3.2-Vision provides robust object detection, facilitating semantic scene understanding for task planning. Evaluation across three household scenarios demonstrates high task planning accuracy and an improvement in memory recall due to RAG. Specifically, Qwen2.5 yields best performance for specialized agents, while LLaMA3.1 excels in routing tasks. The source code is available at: https://github.com/marc1198/chat-hsr.

我们呈现了一个具有LLM驱动的代理编排架构的实体机器人系统，用于自主管理家居物品。该系统结合了增强记忆的任务规划，使机器人能够执行高级用户命令，同时跟踪过去的行动。它采用了三个专业代理：路由代理、任务规划代理和知识库代理，每个代理都由特定任务的LLM驱动。通过利用上下文学习，我们的系统避免了对显式模型训练的需求。RAG使系统能够从过去的交互中检索上下文，从而提高长期对象跟踪能力。结合了Grounded SAM和LLaMa3.2-Vision提供强大的对象检测功能，促进任务规划的场景语义理解。在三种家庭场景下的评估表明，任务规划精度高，由于RAG的存在，记忆回忆能力得到提高。具体来说，Qwen2.5在专用代理方面表现最佳，而LLaMA3.1在路由任务方面表现出色。源代码可在：https://github.com/marc1198/chat-hsr访问。

论文及项目相关链接

PDF Accepted at Austrian Robotics Workshop 2025

Summary

一个采用LLM驱动的代理编排架构的机器人系统，用于自主管理家务物品。该系统结合了记忆增强任务规划，通过专业代理（路由代理、任务规划代理和知识库代理）实现用户高级指令的执行并追踪过去动作。利用上下文学习，系统无需明确的模型训练。RAG帮助系统从过去互动中检索上下文，增强长期物品追踪能力。结合Grounded SAM和LLaMa3.2-Vision，提供稳健的物品检测，促进任务规划的场景语义理解。在三个家庭场景中的评估显示，任务规划准确性高，因RAG记忆回忆有改善。Qwen2.5在特殊代理中表现最佳，LLaMA3.1在路由任务中表现出色。

Key Takeaways

该机器人系统采用LLM驱动的代理编排架构，用于家务物品管理。
系统结合了记忆增强任务规划，可以执行高级用户指令并追踪过去动作。
利用上下文学习，减少了对明确模型训练的需求。
RAG从过去的互动中检索上下文，增强了长期物品追踪能力。
结合Grounded SAM和LLaMa3.2-Vision技术，提高了物品检测的稳健性和场景语义理解。
在多种家庭场景的评估中，系统显示出高的任务规划准确性和记忆改善。

Cool Papers

点此查看论文截图

Visual Text Processing: A Comprehensive Review and Unified Evaluation

Authors:Yan Shu, Weichao Zeng, Fangmin Zhao, Zeyu Chen, Zhenhang Li, Xiaomeng Yang, Yu Zhou, Paolo Rota, Xiang Bai, Lianwen Jin, Xu-Cheng Yin, Nicu Sebe

Visual text is a crucial component in both document and scene images, conveying rich semantic information and attracting significant attention in the computer vision community. Beyond traditional tasks such as text detection and recognition, visual text processing has witnessed rapid advancements driven by the emergence of foundation models, including text image reconstruction and text image manipulation. Despite significant progress, challenges remain due to the unique properties that differentiate text from general objects. Effectively capturing and leveraging these distinct textual characteristics is essential for developing robust visual text processing models. In this survey, we present a comprehensive, multi-perspective analysis of recent advancements in visual text processing, focusing on two key questions: (1) What textual features are most suitable for different visual text processing tasks? (2) How can these distinctive text features be effectively incorporated into processing frameworks? Furthermore, we introduce VTPBench, a new benchmark that encompasses a broad range of visual text processing datasets. Leveraging the advanced visual quality assessment capabilities of multimodal large language models (MLLMs), we propose VTPScore, a novel evaluation metric designed to ensure fair and reliable evaluation. Our empirical study with more than 20 specific models reveals substantial room for improvement in the current techniques. Our aim is to establish this work as a fundamental resource that fosters future exploration and innovation in the dynamic field of visual text processing. The relevant repository is available at https://github.com/shuyansy/Visual-Text-Processing-survey.

视觉文本是文档和场景图像中的关键组成部分，传递丰富的语义信息，并吸引计算机视觉社区的广泛关注。除了文本检测和识别等传统任务外，视觉文本处理在基础模型的推动下取得了快速发展，包括文本图像重建和文本图像操作。尽管取得了重大进展，但由于文本与一般物体的独特属性差异，仍然存在挑战。有效捕获和利用这些独特的文本特征是开发稳健的视觉文本处理模型的关键。在本文中，我们对视觉文本处理的最新进展进行了全面、多角度的分析，重点回答两个关键问题：（1）什么文本特征最适合不同的视觉文本处理任务？（2）如何有效地将这些独特的文本特征纳入处理框架？此外，我们介绍了VTPBench，这是一个涵盖广泛视觉文本处理数据集的新基准测试。我们利用多模态大型语言模型（MLLMs）的高级视觉质量评估能力，提出了VTPScore，一种新型评估指标，旨在确保公平可靠的评价。我们对超过20种特定模型的实证研究揭示了当前技术仍有大量改进空间。我们的目标是将这项工作确立为视觉文本处理这一动态领域的基本资源，促进未来的探索和创新。相关仓库可在https://github.com/shuyansy/Visual-Text-Processing-survey找到。

论文及项目相关链接

PDF

Summary
文本中强调了视觉文本在文档和场景图像中的关键作用，它传递了丰富的语义信息并引起了计算机视觉界的关注。随着基础模型的出现，视觉文本处理任务迅速推进，包括文本图像重建和文本图像操作等。尽管有所进展，但由于文本与一般物体的独特属性不同，仍存在挑战。有效捕捉和利用这些独特的文本特征对于开发稳健的视觉文本处理模型至关重要。文章全面分析了视觉文本处理的最新进展，并介绍了VTPBench新基准和VTPScore评估指标。该研究揭示了当前技术的巨大改进空间，旨在成为促进视觉文本处理领域未来探索和创新的基本资源。

Key Takeaways

视觉文本在文档和场景图像中扮演关键角色，传递丰富语义信息，备受计算机视觉领域关注。
视觉文本处理任务发展迅速，包括文本图像重建和文本图像操作等。
文本与一般物体具有独特属性，有效捕捉和利用这些独特文本特征对开发稳健模型至关重要。
文章全面分析了视觉文本处理的最新进展，回答了两个关键问题：哪些文本特征最适合不同的视觉文本处理任务？如何有效地将这些独特的文本特征融入处理框架？
介绍了新的VTPBench基准，涵盖了广泛的视觉文本处理数据集。
利用多模态大型语言模型的先进视觉质量评估能力，提出了VTPScore评估指标，确保公平可靠的评估。

Cool Papers

点此查看论文截图

Traceback of Poisoning Attacks to Retrieval-Augmented Generation

Authors:Baolei Zhang, Haoran Xin, Minghong Fang, Zhuqing Liu, Biao Yi, Tong Li, Zheli Liu

Large language models (LLMs) integrated with retrieval-augmented generation (RAG) systems improve accuracy by leveraging external knowledge sources. However, recent research has revealed RAG’s susceptibility to poisoning attacks, where the attacker injects poisoned texts into the knowledge database, leading to attacker-desired responses. Existing defenses, which predominantly focus on inference-time mitigation, have proven insufficient against sophisticated attacks. In this paper, we introduce RAGForensics, the first traceback system for RAG, designed to identify poisoned texts within the knowledge database that are responsible for the attacks. RAGForensics operates iteratively, first retrieving a subset of texts from the database and then utilizing a specially crafted prompt to guide an LLM in detecting potential poisoning texts. Empirical evaluations across multiple datasets demonstrate the effectiveness of RAGForensics against state-of-the-art poisoning attacks. This work pioneers the traceback of poisoned texts in RAG systems, providing a practical and promising defense mechanism to enhance their security.

将大型语言模型（LLM）与检索增强生成（RAG）系统相结合，可以利用外部知识源提高准确性。然而，最近的研究表明RAG容易受到中毒攻击的影响，攻击者将有毒文本注入知识数据库，导致产生攻击者所期望的回应。现有的防御手段主要集中在推理时间缓解上，已被证明不足以应对高级攻击。在本文中，我们介绍了RAGForensics，这是针对RAG的第一个回溯系统，旨在识别知识数据库中的有毒文本，这些文本是攻击的原因。RAGForensics通过迭代方式运行，首先从数据库中检索一部分文本，然后使用专门设计的提示来引导LLM检测潜在的有毒文本。在多个数据集上的经验评估表明，RAGForensics对最先进的中毒攻击具有有效性。这项工作在RAG系统中开创了有毒文本的追溯先河，提供了一种增强系统安全的实用且前景广阔的防御机制。

论文及项目相关链接

PDF Accepted by The Web Conference 2025

Summary

LLMs与RAG系统结合可提高准确性，但易受到毒药攻击影响，攻击者将有毒文本注入知识库以产生所需响应。现有主要侧重于推理时间缓解的防御措施对高级攻击证明无效。本文介绍RAGForensics，首个为RAG设计的回溯系统，用于识别知识库中的有毒文本。RAGForensics通过迭代操作，首先从数据库中检索部分文本，然后使用专门设计的提示引导LLM检测可能的毒药文本。在多个数据集上的实证评估证明，RAGForensics对最先进的毒药攻击有效。这项工作在RAG系统中开创了有毒文本回溯的先河，提供了一种增强安全性的实用且前景广阔的防御机制。

Key Takeaways

LLMs结合RAG系统利用外部知识源提高准确性，但存在被毒药攻击的风险。
毒药攻击通过注入有毒文本到知识库，导致产生攻击者希望的响应。
现有防御措施主要侧重于推理时间的缓解，对高级攻击不够有效。
RAGForensics是首个为RAG设计的回溯系统，可识别知识库中的有毒文本。
RAGForensics通过迭代方式检索文本并使用专门设计的提示引导LLM检测毒药文本。
实证评估证明RAGForensics对先进的毒药攻击有效。

Cool Papers

点此查看论文截图

AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization

Authors:Haotian Luo, Haiying He, Yibo Wang, Jinluan Yang, Rui Liu, Naiqiang Tan, Xiaochun Cao, Dacheng Tao, Li Shen

Recently, long-thought reasoning models achieve strong performance on complex reasoning tasks, but often incur substantial inference overhead, making efficiency a critical concern. Our empirical analysis reveals that the benefit of using Long-CoT varies across problems: while some problems require elaborate reasoning, others show no improvement, or even degraded accuracy. This motivates adaptive reasoning strategies that tailor reasoning depth to the input. However, prior work primarily reduces redundancy within long reasoning paths, limiting exploration of more efficient strategies beyond the Long-CoT paradigm. To address this, we propose a novel two-stage framework for adaptive and efficient reasoning. First, we construct a hybrid reasoning model by merging long and short CoT models to enable diverse reasoning styles. Second, we apply bi-level preference training to guide the model to select suitable reasoning styles (group-level), and prefer concise and correct reasoning within each style group (instance-level). Experiments demonstrate that our method significantly reduces inference costs compared to other baseline approaches, while maintaining performance. Notably, on five mathematical datasets, the average length of reasoning is reduced by more than 50%, highlighting the potential of adaptive strategies to optimize reasoning efficiency in large language models. Our code is coming soon at https://github.com/StarDewXXX/AdaR1

最近，长期被认为是推理模型的复杂推理任务上表现优异，但往往产生大量的推理开销，使得效率成为一个关键问题。我们的实证分析显示，使用Long-CoT的好处因问题而异：虽然一些问题需要精细推理，但其他问题则没有显示出改进，甚至精度下降。这激发了适应性的推理策略，根据输入定制推理深度。然而，先前的工作主要侧重于减少长推理路径中的冗余，限制了Long-CoT模式之外更有效率策略的探索。为了解决这一问题，我们提出了一种用于自适应和高效推理的新型两阶段框架。首先，我们通过合并长短CoT模型构建混合推理模型，以实现多样化的推理风格。其次，我们应用双层偏好训练来引导模型选择适当的推理风格（群体层面），并在每种风格内偏好简洁正确的推理（实例层面）。实验表明，与其他基准方法相比，我们的方法在保持性能的同时，显著降低了推理成本。值得注意的是，在五个数学数据集上，推理的平均长度减少了超过50%，这突显了自适应策略在优化大型语言模型中的推理效率的潜力。我们的代码很快将在https://github.com/StarDewXXX/AdaR1上发布。

论文及项目相关链接

PDF

Summary
复杂推理任务中，长推理模型表现出强大的性能，但推理效率成为关键问题。研究发现使用Long-CoT的好处因问题而异，因此提出自适应推理策略，根据输入调整推理深度。为进一步提高效率，提出一种新型的两阶段框架，融合长短CoT模型，实现多样化推理风格。通过双层次偏好训练，引导模型选择合适推理风格，并在各风格内偏好简洁正确推理。实验表明，该方法在降低推理成本的同时维持性能，在五个数学数据集上平均推理长度减少超过50%。

Key Takeaways

长推理模型在复杂任务上表现优异，但效率成为关键问题。
使用Long-CoT的好处因问题而异，需要自适应推理策略。
融合长短CoT模型以提供多样化推理风格。
双层次偏好训练引导模型选择适合的推理风格。
相比其他方法，所提方法在降低推理成本的同时维持性能。
在五个数学数据集上平均推理长度减少超过50%。

Cool Papers

点此查看论文截图

Sadeed: Advancing Arabic Diacritization Through Small Language Model

Authors:Zeina Aldallal, Sara Chrouf, Khalil Hennara, Mohamed Motaism Hamed, Muhammad Hreden, Safwan AlModhayan

Arabic text diacritization remains a persistent challenge in natural language processing due to the language’s morphological richness. In this paper, we introduce Sadeed, a novel approach based on a fine-tuned decoder-only language model adapted from Kuwain 1.5B Hennara et al. [2025], a compact model originally trained on diverse Arabic corpora. Sadeed is fine-tuned on carefully curated, high-quality diacritized datasets, constructed through a rigorous data-cleaning and normalization pipeline. Despite utilizing modest computational resources, Sadeed achieves competitive results compared to proprietary large language models and outperforms traditional models trained on similar domains. Additionally, we highlight key limitations in current benchmarking practices for Arabic diacritization. To address these issues, we introduce SadeedDiac-25, a new benchmark designed to enable fairer and more comprehensive evaluation across diverse text genres and complexity levels. Together, Sadeed and SadeedDiac-25 provide a robust foundation for advancing Arabic NLP applications, including machine translation, text-to-speech, and language learning tools.

阿拉伯文本变音符号化仍然是自然语言处理中的一个持续挑战，因为该语言形态丰富。在这篇论文中，我们介绍了Sadeed，这是一种新型方法，基于Kuwain 1.5B Hennara等人于2025年开发的紧凑模型改编的微调解码器语言模型。该模型最初是在各种阿拉伯语料库上进行训练的。Sadeed在精心挑选的高质量带有变音符号的数据集上进行微调，这些数据集是通过严格的数据清理和标准化管道构建的。尽管使用了适中的计算资源，但Sadeed在与专有大型语言模型的竞争中取得了具有竞争力的结果，并且在类似领域上的表现优于传统模型。此外，我们强调了当前阿拉伯变音符号化基准测试实践中的关键局限性。为了解决这些问题，我们推出了SadeedDiac-25，这是一个新基准测试，旨在实现在不同文本类型和复杂程度上的更全面和公平的评估。因此，Sadeed和SadeedDiac-25为推进阿拉伯自然语言处理应用提供了坚实的基础，包括机器翻译、文本到语音和语言学习工具。

论文及项目相关链接

PDF

Summary

本文介绍了Sadeed，一种基于Kuwain 1.5B Hennara等人在阿拉伯语语料库上训练的紧凑模型的新型阿拉伯语文本发音符号化方法。Sadeed经过精心挑选的高质量发音符号数据集进行微调，通过严格的数据清理和标准化流程构建。相较于专有大型语言模型，Sadeed在资源使用适度的情况下取得了具有竞争力的结果，并且在类似领域上超越了传统模型。此外，本文还指出了当前阿拉伯语发音符号化评估方法的关键局限性，并引入了SadeedDiac-25新基准测试，以实现在不同文本类型和复杂程度上的更公平和全面的评估。Sadeed和SadeedDiac-25为推进阿拉伯语NLP应用，如机器翻译、文本转语音和语言学习工具等提供了坚实的基础。

Key Takeaways

本文提出了一种新型的阿拉伯语文本发音符号化方法Sadeed，基于紧凑模型进行微调实现。
Sadeed在高质量发音符号数据集上进行训练，通过严格的数据清理和标准化流程构建。
Sadeed在资源使用适度的情况下取得了与大型语言模型相当的竞争力结果。
Sadeed在类似领域上超越了传统模型。
当前阿拉伯语发音符号化的评估方法存在关键局限性。
引入了新的基准测试SadeedDiac-25，以推动更公平和全面的评估，适用于不同文本类型和复杂程度。

Cool Papers

点此查看论文截图

Advancing Arabic Reverse Dictionary Systems: A Transformer-Based Approach with Dataset Construction Guidelines

Authors:Serry Sibaee, Samar Ahmed, Abdullah Al Harbi, Omer Nacar, Adel Ammar, Yasser Habashi, Wadii Boulila

This study addresses the critical gap in Arabic natural language processing by developing an effective Arabic Reverse Dictionary (RD) system that enables users to find words based on their descriptions or meanings. We present a novel transformer-based approach with a semi-encoder neural network architecture featuring geometrically decreasing layers that achieves state-of-the-art results for Arabic RD tasks. Our methodology incorporates a comprehensive dataset construction process and establishes formal quality standards for Arabic lexicographic definitions. Experiments with various pre-trained models demonstrate that Arabic-specific models significantly outperform general multilingual embeddings, with ARBERTv2 achieving the best ranking score (0.0644). Additionally, we provide a formal abstraction of the reverse dictionary task that enhances theoretical understanding and develop a modular, extensible Python library (RDTL) with configurable training pipelines. Our analysis of dataset quality reveals important insights for improving Arabic definition construction, leading to eight specific standards for building high-quality reverse dictionary resources. This work contributes significantly to Arabic computational linguistics and provides valuable tools for language learning, academic writing, and professional communication in Arabic.

本研究通过开发有效的阿拉伯语反向词典（RD）系统，解决了阿拉伯语自然语言处理中的关键空白。该系统使用户可以根据描述或意义查找单词。我们提出了一种基于转换器的方法，采用具有几何递减层的半编码器神经网络架构，在阿拉伯语RD任务中取得了最新结果。我们的方法结合了全面的数据集构建过程，并为阿拉伯语词典定义建立了正式的质量标准。使用各种预训练模型的实验表明，阿拉伯语特定模型的表现远超通用多语言嵌入，其中ARBERTv2取得最佳排名分数（0.0644）。此外，我们对反向词典任务进行了正式抽象，增强了理论理解，并开发了一个模块化的Python库（RDTL），具有可配置的训练管道。我们对数据集质量的分析揭示了改进阿拉伯语定义构建的见解，并提出了建立高质量反向词典资源的八条特定标准。本工作对阿拉伯语计算语言学做出了重大贡献，并为阿拉伯语的语言学习、学术写作和专业交流提供了有价值的工具。

论文及项目相关链接

PDF

Summary
阿拉伯语反向词典系统研究

该研究解决了阿拉伯语自然语言处理中的关键空白，开发了一种有效的阿拉伯语反向词典（RD）系统，使用户可以通过描述或意义查找单词。研究提出了一种基于转换器的新型方法，采用具有几何递减层的半编码器神经网络架构，在阿拉伯语RD任务上取得了最新成果。该方法纳入了综合数据集构建过程，为阿拉伯语词典定义建立了正式的质量标准。实验表明，阿拉伯语特定模型明显优于通用多语言嵌入，ARBERTv2获得最佳排名得分（0.0644）。此外，研究提供了对反向词典任务的正式抽象，增强了理论理解，并开发了一个模块化的Python库（RDTL），具有可配置的训练管道。对数据集质量的分析揭示了改进阿拉伯语定义构建的重要见解，并为构建高质量反向词典资源提供了八项特定标准。这项研究对阿拉伯语计算语言学做出了重大贡献，并为阿拉伯语的语言学习、学术写作和专业交流提供了有价值的工具。

Key Takeaways

研究开发了有效的阿拉伯语反向词典系统，允许用户通过描述或意义查找单词。
采用新型基于转换器的半编码器神经网络架构，实现阿拉伯语RD任务的最新成果。
阿拉伯语特定模型在实验中表现优于通用多语言嵌入，ARBERTv2获得最佳排名得分。
研究提供了反向词典任务的正式抽象，增强理论理解。
开发了一个模块化的Python库（RDTL），具有可配置的训练管道。
分析揭示了改进阿拉伯语定义构建的重要见解。
研究为构建高质量反向词典资源提出了八项特定标准，对阿拉伯语计算语言学做出重大贡献。

Cool Papers

点此查看论文截图

Uncertainty Quantification for Language Models: A Suite of Black-Box, White-Box, LLM Judge, and Ensemble Scorers

Authors:Dylan Bouchard, Mohit Singh Chauhan

Hallucinations are a persistent problem with Large Language Models (LLMs). As these models become increasingly used in high-stakes domains, such as healthcare and finance, the need for effective hallucination detection is crucial. To this end, we propose a versatile framework for zero-resource hallucination detection that practitioners can apply to real-world use cases. To achieve this, we adapt a variety of existing uncertainty quantification (UQ) techniques, including black-box UQ, white-box UQ, and LLM-as-a-Judge, transforming them as necessary into standardized response-level confidence scores ranging from 0 to 1. To enhance flexibility, we introduce a tunable ensemble approach that incorporates any combination of the individual confidence scores. This approach enables practitioners to optimize the ensemble for a specific use case for improved performance. To streamline implementation, the full suite of scorers is offered in this paper’s companion Python toolkit, UQLM. To evaluate the performance of the various scorers, we conduct an extensive set of experiments using several LLM question-answering benchmarks. We find that our tunable ensemble typically surpasses its individual components and outperforms existing hallucination detection methods. Our results demonstrate the benefits of customized hallucination detection strategies for improving the accuracy and reliability of LLMs.

幻觉（hallucinations）是大型语言模型（LLM）持续存在的问题。随着这些模型在医疗保健和金融等高风险领域的使用越来越普遍，对有效的幻觉检测的需求变得至关重要。为此，我们提出了一种用于零资源幻觉检测的通用框架，实践者可以将其应用于现实世界的用例。为实现这一目标，我们采用了多种现有的不确定性量化（UQ）技术，包括黑盒UQ、白盒UQ和LLM-as-a-Judge，按需将它们转化为标准化的响应级置信度分数，范围从0到1。为了提高灵活性，我们引入了一种可调集成方法，该方法可结合任何单个置信度分数的组合。这种方法使实践者能够为特定的用例优化集成，以提高性能。为了简化实施过程，整套评分器都包含在本文的配套Python工具包UQLM中。为了评估各种评分器的性能，我们使用多个LLM问答基准进行了一系列广泛的实验。我们发现，我们的可调集成通常超过其各个组件的性能，并且优于现有的幻觉检测方法。我们的结果证明了为改善LLM的准确性和可靠性而量身定制的幻觉检测策略的好处。

论文及项目相关链接

PDF UQLM repository: https://github.com/cvs-health/uqlm

Summary

该文提出了一种零资源环境下的幻觉检测框架，用于提高大型语言模型（LLM）在高风险领域（如医疗和金融）中的准确性和可靠性。通过适应多种不确定性量化（UQ）技术，并引入可调集成方法，实现对LLM幻觉的有效检测。实验结果表明，该集成方法通常优于其单个组件，并优于现有的幻觉检测方法。

Key Takeaways

大型语言模型（LLM）在高风险领域中的幻觉问题亟待解决。
提出了一种零资源幻觉检测框架，以适应多种不确定性量化（UQ）技术。
框架包括黑盒UQ、白盒UQ和LLM-as-a-Judge等技术，转化为标准化的响应级别信心分数。
引入了一种可调集成方法，能够结合任何单个置信度分数组合，针对特定用例进行优化。
完整的评分者套件包含在论文的Python工具包UQLM中，便于实施。
通过多组实验验证了各种评分者的性能，表明可调集成方法通常优于其单个组件。

Cool Papers

点此查看论文截图

VisualPuzzles: Decoupling Multimodal Reasoning Evaluation from Domain Knowledge

Authors:Yueqi Song, Tianyue Ou, Yibo Kong, Zecheng Li, Graham Neubig, Xiang Yue

Current multimodal benchmarks often conflate reasoning with domain-specific knowledge, making it difficult to isolate and evaluate general reasoning abilities in non-expert settings. To address this, we introduce VisualPuzzles, a benchmark that targets visual reasoning while deliberately minimizing reliance on specialized knowledge. VisualPuzzles consists of diverse questions spanning five categories: algorithmic, analogical, deductive, inductive, and spatial reasoning. One major source of our questions is manually translated logical reasoning questions from the Chinese Civil Service Examination. Experiments show that VisualPuzzles requires significantly less intensive domain-specific knowledge and more complex reasoning compared to benchmarks like MMMU, enabling us to better evaluate genuine multimodal reasoning. Evaluations show that state-of-the-art multimodal large language models consistently lag behind human performance on VisualPuzzles, and that strong performance on knowledge-intensive benchmarks does not necessarily translate to success on reasoning-focused, knowledge-light tasks. Additionally, reasoning enhancements such as scaling up inference compute (with “thinking” modes) yield inconsistent gains across models and task types, and we observe no clear correlation between model size and performance. We also found that models exhibit different reasoning and answering patterns on VisualPuzzles compared to benchmarks with heavier emphasis on knowledge. VisualPuzzles offers a clearer lens through which to evaluate reasoning capabilities beyond factual recall and domain knowledge.

当前的多模态基准测试通常将推理与特定领域的知识混淆，在非专业环境中很难隔离和评估一般的推理能力。为了解决这个问题，我们推出了VisualPuzzles，这是一个以视觉推理为目标，同时故意减少对专业知识依赖的基准测试。VisualPuzzles包含五个类别的问题：算法、类比、演绎、归纳和空间推理。我们的问题主要来源于中国公务员考试中手动翻译的逻辑推理问题。实验表明，VisualPuzzles相比MMMU等基准测试，需要更少的特定领域知识和更复杂的推理能力，使我们能够更好地评估真正的多模态推理。评估显示，最先进的多媒体语言模型在VisualPuzzles上的表现始终落后于人类的表现，而且在知识密集型基准测试上的出色表现并不一定能在注重推理、轻知识的任务上取得成功。此外，增加推理能力的增强方法（如扩大推理计算规模（采用“思考”模式））在不同模型和任务类型之间产生了不一致的收益，并且我们观察到模型大小与性能之间没有明显的相关性。我们还发现，与重点强调知识的基准测试相比，模型在VisualPuzzles上表现出不同的推理和回答模式。VisualPuzzles提供了一个更清晰的视角来评估超越事实记忆和领域知识的推理能力。

论文及项目相关链接

PDF 56 pages, 43 figures

摘要

本论文介绍了VisualPuzzles这一视觉推理基准测试的设计思路和主要成果。该研究指出当前多模态基准测试往往混淆了推理和领域专业知识，使得在非专业环境下难以评估通用的推理能力。为了解决这个问题，研究团队引入了VisualPuzzles基准测试，旨在专注于视觉推理，并尽量减少对专业知识的依赖。该基准测试包含涵盖算法推理、类比推理、演绎推理、归纳推理和空间推理五个类别的问题，并且许多问题的灵感来源于中国政府服务考试的逻辑推理题。研究结果显示，与现有基准测试相比，VisualPuzzles测试依赖于更少的专业知识而更注重推理能力。在最新推出的多模态大型语言模型测试中，基于VisualPuzzles的评估结果显示与人类的性能表现仍有一定差距，同时不同模型在处理推理密集型任务时的表现也存在较大差异。另外，增强模型推理能力的措施在不同模型和任务上的表现不一致，而模型规模与性能之间并无明确关联。总的来说，VisualPuzzles提供了一个更清晰的视角来评估超越事实记忆和领域知识的推理能力。

关键见解

当前多模态基准测试混淆了推理和领域专业知识，使得评估通用推理能力变得困难。
VisualPuzzles基准测试旨在专注于视觉推理，减少对专业知识的依赖。
VisualPuzzles包含多种逻辑推理类别的问题，如算法、类比、演绎、归纳和空间推理。
VisualPuzzles测试的问题部分来源于中国政府服务考试的逻辑推理题。
与其他基准测试相比，VisualPuzzles更侧重于评估推理能力而非专业知识。
最新多模态大型语言模型在VisualPuzzles上的表现落后于人类，且不同模型间的表现存在差异。
增强模型推理能力的措施效果不一，且模型规模与性能之间并无明确关联。

Cool Papers

点此查看论文截图

LabTOP: A Unified Model for Lab Test Outcome Prediction on Electronic Health Records

Authors:Sujeong Im, Jungwoo Oh, Edward Choi

Lab tests are fundamental for diagnosing diseases and monitoring patient conditions. However, frequent testing can be burdensome for patients, and test results may not always be immediately available. To address these challenges, we propose LabTOP, a unified model that predicts lab test outcomes by leveraging a language modeling approach on EHR data. Unlike conventional methods that estimate only a subset of lab tests or classify discrete value ranges, LabTOP performs continuous numerical predictions for a diverse range of lab items. We evaluate LabTOP on three publicly available EHR datasets and demonstrate that it outperforms existing methods, including traditional machine learning models and state-of-the-art large language models. We also conduct extensive ablation studies to confirm the effectiveness of our design choices. We believe that LabTOP will serve as an accurate and generalizable framework for lab test outcome prediction, with potential applications in clinical decision support and early detection of critical conditions.

实验室测试对于疾病的诊断和病人状况的监测至关重要。然而，频繁的测试对病人来说可能是一种负担，而且测试结果并非总能立即获得。为了应对这些挑战，我们提出了LabTOP，这是一个通过电子健康记录（EHR）数据采用语言建模方法进行实验室测试结果预测的统合模型。不同于仅估计部分实验室测试或对离散值范围进行分类的传统方法，LabTOP可以对多种实验室项目进行连续的数值预测。我们在三个公开的EHR数据集上评估了LabTOP，并证明其表现优于现有方法，包括传统的机器学习模型以及最先进的大型语言模型。我们还进行了广泛的消融研究以验证我们的设计选择。我们相信LabTOP将作为实验室测试结果预测的准确且可推广的框架，在临床决策支持和关键疾病的早期检测中具有潜在应用价值。

论文及项目相关链接

PDF 11 pages for main text, 13 pages for appendix

Summary：
实验室测试是诊断疾病和监测患者状况的基础，但频繁测试对患者造成负担，且结果并非总能即时得知。为解决这些问题，我们提出LabTOP模型，通过电子健康记录数据的语言建模方法预测实验室测试结果。不同于仅估计部分测试或分类离散值范围的传统方法，LabTOP可进行各种实验室项目的连续数值预测。我们在三个公开可用的电子健康记录数据集上评估LabTOP，证明其优于传统机器学习和目前顶尖的大型语言模型。我们相信LabTOP将成为准确且可推广的实验室测试结果预测框架，可应用于临床决策支持和关键疾病的早期检测。

Key Takeaways：

LabTOP模型利用语言建模在电子健康记录数据上进行预测实验室测试结果。
LabTOP能够针对各种实验室项目做出连续数值预测。
与传统方法和现有大型语言模型相比，LabTOP表现出优越性能。
LabTOP的设计选择经过广泛的消融研究验证其有效性。
LabTOP模型具有准确度和泛化能力，可应用于临床决策支持。
LabTOP有助于减轻患者的频繁测试负担和及时得知结果的问题。

Cool Papers

点此查看论文截图

T2VEval: Benchmark Dataset and Objective Evaluation Method for T2V-generated Videos

Authors:Zelu Qi, Ping Shi, Shuqi Wang, Chaoyang Zhang, Fei Zhao, Zefeng Ying, Da Pan, Xi Yang, Zheqi He, Teng Dai

Recent advances in text-to-video (T2V) technology, as demonstrated by models such as Runway Gen-3, Pika, Sora, and Kling, have significantly broadened the applicability and popularity of the technology. This progress has created a growing demand for accurate quality assessment metrics to evaluate the perceptual quality of T2V-generated videos and optimize video generation models. However, assessing the quality of text-to-video outputs remain challenging due to the presence of highly complex distortions, such as unnatural actions and phenomena that defy human cognition. To address these challenges, we constructed T2VEval-Bench, a multi-dimensional benchmark dataset for text-to-video quality evaluation, which contains 148 textual prompts and 1,783 videos generated by 13 T2V models. To ensure a comprehensive evaluation, we scored each video on four dimensions in the subjective experiment, which are overall impression, text-video consistency, realness, and technical quality. Based on T2VEval-Bench, we developed T2VEval, a multi-branch fusion scheme for T2V quality evaluation. T2VEval assesses videos across three branches: text-video consistency, realness, and technical quality. Using an attention-based fusion module, T2VEval effectively integrates features from each branch and predicts scores with the aid of a large language model. Additionally, we implemented a divide-and-conquer training strategy, enabling each branch to learn targeted knowledge while maintaining synergy with the others. Experimental results demonstrate that T2VEval achieves state-of-the-art performance across multiple metrics.

近期文本转视频（T2V）技术的进展，由诸如跑道Gen-3、皮卡、索拉和克林模型所展示，已经显著地拓宽了这项技术的应用范围和受欢迎程度。这一进展催生了对准确的质量评估指标的日益增长的需求，以评估T2V生成视频的感知质量并优化视频生成模型。然而，评估文本转视频输出的质量仍然具有挑战性，因为存在高度复杂的失真，例如不自然的行为和违背人类认知的现象。为了应对这些挑战，我们构建了T2VEval-Bench，这是一个用于文本转视频质量评估的多维基准数据集，包含148个文本提示和由13个T2V模型生成的1,783个视频。为了确保全面的评估，我们在主观实验中对每个视频在四个维度上进行了评分，分别是整体印象、文本与视频的一致性、真实性和技术质量。基于T2VEval-Bench，我们开发了T2VEval，这是一个用于T2V质量评估的多分支融合方案。T2VEval通过三个分支来评估视频：文本与视频的一致性、真实性和技术质量。通过使用基于注意力的融合模块，T2VEval有效地集成了每个分支的特征，并借助大型语言模型进行预测评分。此外，我们实施了分而治之的训练策略，使每个分支能够学习有针对性的知识，同时与其他分支保持协同。实验结果表明，T2VEval在多个指标上达到了最先进的性能。

论文及项目相关链接

PDF

Summary

文本转视频（T2V）技术的最新进展，如Runway Gen-3、Pika、Sora和Kling等模型，大大拓宽了T2V技术的应用范围和受欢迎程度。为评估T2V生成视频的感知质量并优化视频生成模型，提出了T2VEval-Bench，一个多维度的T2V质量评估基准数据集，包含148个文本提示和1783个由13个T2V模型生成的视频。为全面评估视频质量，我们在主观实验中对视频进行了四个维度的评分：整体印象、文本与视频一致性、真实性和技术质量。基于T2VEval-Bench，开发了T2VEval，一个用于T2V质量评估的多分支融合方案。T2VEval通过注意力基础的融合模块，有效整合各分支的特征，借助大型语言模型进行评分预测。此外，实施了分而治之的训练策略，使各分支能够学习针对性知识，同时保持彼此间的协同。实验结果表明，T2VEval在多个指标上达到最新技术水平。

Key Takeaways

T2V技术的最新进展展示了广泛的应用和受欢迎程度，产生了对评估其生成视频质量的需求。
T2VEval-Bench是一个多维度数据集，包含文本提示和由不同T2V模型生成的视频，用于全面评估视频质量。
视频质量评估包括四个维度：整体印象、文本与视频一致性、真实性和技术质量。
T2VEval是一个多分支融合方案，用于T2V质量评估，通过注意力机制整合不同分支的特征。
T2VEval借助大型语言模型进行评分预测，提高评估准确性。
实施分而治之的训练策略，使各分支能够针对性地学习知识，并保持彼此间的协同。

Cool Papers

点此查看论文截图

Adsorb-Agent: Autonomous Identification of Stable Adsorption Configurations via Large Language Model Agent

Authors:Janghoon Ock, Tirtha Vinchurkar, Yayati Jadhav, Amir Barati Farimani

Adsorption energy is a key reactivity descriptor in catalysis, enabling efficient screening for optimal catalysts. However, determining adsorption energy typically requires evaluating numerous adsorbate-catalyst configurations. Current algorithmic approaches rely on exhaustive enumeration of adsorption sites and configurations, which makes the process computationally intensive and does not inherently guarantee the identification of the global minimum energy. In this work, we introduce Adsorb-Agent, a Large Language Model (LLM) agent designed to efficiently identify system-specific stable adsorption configurations corresponding to the global minimum adsorption energy. Adsorb-Agent leverages its built-in knowledge and emergent reasoning capabilities to strategically explore adsorption configurations likely to hold adsorption energy. By reducing the reliance on exhaustive sampling, it significantly decreases the number of initial configurations required while improving the accuracy of adsorption energy predictions. We evaluate Adsorb-Agent’s performance across twenty representative systems encompassing a range of complexities. The Adsorb-Agent successfully identifies comparable adsorption energies for 83.7% of the systems and achieves lower energies, closer to the actual global minimum, for 35% of the systems, while requiring significantly fewer initial configurations than conventional methods. Its capability is particularly evident in complex systems, where it identifies lower adsorption energies for 46.7% of systems involving intermetallic surfaces and 66.7% of systems with large adsorbate molecules. These results demonstrate the potential of Adsorb-Agent to accelerate catalyst discovery by reducing computational costs and improving the reliability of adsorption energy predictions.

吸附能是催化中的关键反应性描述符，能够实现高效筛选最佳催化剂。然而，确定吸附能通常需要评估众多的吸附物-催化剂组合。当前的算法方法依赖于对吸附位点和配置的详尽枚举，这使得过程计算量大，并且不保证能找到全局最低能量。在这项工作中，我们引入了Adsorb-Agent，这是一种大型语言模型（LLM）代理，旨在有效地识别与全局最低吸附能量相对应的特定系统稳定的吸附配置。Adsorb-Agent利用其内置知识和新兴推理能力来有策略地探索可能包含吸附能的吸附配置。通过减少对详尽采样的依赖，它在减少所需初始配置数量的同时提高了吸附能预测的准确性。我们评估了Adsorb-Agent在涵盖各种复杂性的二十个代表性系统上的性能。Adsorb-Agent成功地为83.7%的系统确定了可比的吸附能，并为35%的系统实现了更接近实际全局最小值更低的能量，同时所需的初始配置数量远远少于传统方法。其能力在复杂系统中尤为突出，在涉及金属间表面的系统中为46.7%的系统找到了较低的吸附能，在具有大吸附分子的系统中为66.7%的系统找到了较低的吸附能。这些结果证明了Adsorb-Agent在减少计算成本和提高吸附能预测可靠性方面的潜力，从而有望加速催化剂的发现。

论文及项目相关链接

PDF

Summary

吸附能是催化中的关键反应性描述符，可用于有效筛选最佳催化剂。然而，确定吸附能通常需要评估众多的吸附物-催化剂构型。当前算法方法依赖于吸附位点和构型的详尽列举，这使得过程计算密集，并且不保证找到全局最低能量。在此工作中，我们引入了吸附剂Agent，这是一个大型语言模型（LLM）代理，被设计用于有效识别与全局最低吸附能对应的系统特定稳定吸附构型。吸附剂Agent利用其内置知识和新兴推理能力来策略性地探索可能含有吸附能的吸附构型。通过减少对详尽采样的依赖，它在减少所需初始构型数量的同时提高了吸附能预测的准确性。我们对吸附剂Agent在二十个代表性系统上的性能进行了评估，这些系统的复杂性各不相同。吸附剂Agent成功地为83.7%的系统确定了可比的吸附能，并为35%的系统达到了更接近实际全局最小值更低的能量，同时所需的初始构型数量远少于传统方法。其在复杂系统中的能力尤为突出，为涉及金属间表面的系统降低了46.7%的吸附能，为大吸附物分子的系统降低了66.7%的吸附能。这些结果展示了吸附剂Agent在通过减少计算成本和提高吸附能预测可靠性来加速催化剂发现方面的潜力。

Key Takeaways

吸附能是催化中的关键反应性描述符，用于筛选最佳催化剂。
当前算法通过详尽列举吸附位点和构型来确定吸附能，但此方法计算密集并不保证找到全局最低能量。
引入了一种新型大型语言模型（LLM）代理——吸附剂Agent，可高效识别系统特定的稳定吸附构型。
吸附剂Agent利用内置知识和新兴推理能力，策略性地探索可能含有吸附能的吸附构型。
相比传统方法，吸附剂Agent减少了初始构型的需求，提高了吸附能预测的准确性。
在复杂系统中，吸附剂Agent表现突出，为涉及金属间表面和大吸附物分子的系统降低了吸附能。

Cool Papers

点此查看论文截图

Extracting and Transferring Abilities For Building Multi-lingual Ability-enhanced Large Language Models

Authors:Zhipeng Chen, Kun Zhou, Liang Song, Wayne Xin Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

Multi-lingual ability transfer has become increasingly important for the broad application of large language models (LLMs). Existing work highly relies on training with the multi-lingual ability-related data, which may be not available for low-resource languages. To solve it, we propose a Multi-lingual Ability Extraction and Transfer approach, named as MAET. Our key idea is to decompose and extract language-agnostic ability-related weights from LLMs, and transfer them across different languages by simple addition and subtraction operations without training. Specially, our MAET consists of the extraction and transfer stages. In the extraction stage, we firstly locate key neurons that are highly related to specific abilities, and then employ them to extract the transferable ability-specific weights. In the transfer stage, we further select the ability-related parameter tensors, and design the merging strategy based on the linguistic and ability specific weights, to build the multi-lingual ability-enhanced LLM. To demonstrate the effectiveness of our proposed approach, we conduct extensive experiments on mathematical and scientific tasks in both high-resource lingual and low-resource lingual scenarios. Experiment results have shown that MAET can effectively and efficiently extract and transfer the advanced abilities, and outperform training-based baseline methods. Our code and data are available at https://github.com/RUCAIBox/MAET.

多语言能力的迁移对于大型语言模型（LLM）的广泛应用越来越重要。现有工作高度依赖于多语言能力相关数据来进行训练，这对于低资源语言可能无法获得。为了解决这个问题，我们提出了一种多语言能力提取与迁移方法，命名为MAET。我们的核心思想是从LLM中分解并提取与语言无关的能力相关权重，通过简单的加减操作在不同语言之间进行迁移，无需训练。特别是，我们的MAET由提取和迁移两个阶段组成。在提取阶段，我们首先要找到与特定能力高度相关的关键神经元，然后利用它们来提取可迁移的能力特定权重。在迁移阶段，我们进一步选择与能力相关的参数张量，并设计基于语言和能力特定权重的合并策略，以构建多语言能力增强的LLM。为了验证我们提出的方法的有效性，我们在高资源语言和低资源语言场景下对数学和科学任务进行了大量实验。实验结果表明，MAET可以有效且高效地提取和迁移高级能力，并优于基于训练的方法。我们的代码和数据可在https://github.com/RUCAIBox/MAET找到。

论文及项目相关链接

PDF 17 Pages. Working in progress

Summary

本文提出了一种名为MAET的多语言能力提取与迁移方法，用于提高大型语言模型（LLM）在不同语言环境下的应用能力。该方法通过分解和提取语言无关的能力相关权重，实现跨不同语言的迁移，无需额外训练。通过提取阶段找到与特定能力高度相关的关键神经元，并据此提取可迁移的能力特定权重。在迁移阶段，选择能力相关的参数张量，并设计基于语言和能力的权重合并策略，以构建多语言能力增强型LLM。实验证明，MAET能有效提取和迁移高级能力，并在高资源语言和低资源语言场景中超越基于训练的基础方法。

Key Takeaways

MAET方法旨在解决多语言环境下LLM应用能力的问题，特别是在低资源语言环境中。
MAET通过分解和提取LLM中的语言无关能力相关权重，实现跨不同语言的迁移。
方法包括提取和迁移两个阶段，提取阶段找到与特定能力相关的关键神经元，迁移阶段则选择能力相关参数张量并设计合并策略。
MAET方法在多种语言和科学任务上进行了广泛实验验证，证明了其有效性和优越性。
MAET方法实现了高效的能力提取和迁移，超越了基于训练的基础方法。

Cool Papers

点此查看论文截图

A Formal Framework for Understanding Length Generalization in Transformers

Authors:Xinting Huang, Andy Yang, Satwik Bhattamishra, Yash Sarrof, Andreas Krebs, Hattie Zhou, Preetum Nakkiran, Michael Hahn

A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the task, theoretical understanding of this phenomenon remains limited. In this work, we introduce a rigorous theoretical framework to analyze length generalization in causal transformers with learnable absolute positional encodings. In particular, we characterize those functions that are identifiable in the limit from sufficiently long inputs with absolute positional encodings under an idealized inference scheme using a norm-based regularizer. This enables us to prove the possibility of length generalization for a rich family of problems. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks. Our theory not only explains a broad set of empirical observations but also opens the way to provably predicting length generalization capabilities in transformers.

对于转换器模型来说，一个主要的挑战是将其推广到训练时未见过的更长的序列。尽管之前的工作从实证角度证明了转换器在某些任务上可以在长度推广方面取得成功或失败，但对于这一现象的理论理解仍然有限。在这项工作中，我们引入了一个严格的理论框架来分析带有可学习绝对位置编码的因果转换器的长度推广问题。特别是，我们刻画了在理想化推理方案下，使用基于范数的正则化器，从足够长的输入中识别出的那些具有绝对位置编码的函数特性。这使得我们能够证明对于一系列丰富的问题进行长度推广的可能性。我们通过实验验证了该理论在预测长度推广成功和失败方面的能力，涵盖了一系列算法和形式语言任务。我们的理论不仅解释了大量实证观察结果，而且为预测转换器中的长度推广能力开辟了道路。

论文及项目相关链接

PDF 85 pages, 9 figures, 11 tables. Accepted for publication at ICLR 2025

Summary
本文引入了一个严谨的理论框架，用于分析具有可学习绝对位置编码的因果Transformer中的长度泛化问题。通过对足够长输入下通过理想化推理方案利用绝对位置编码识别出的函数特性进行描述，本文证明了在丰富的任务集中实现长度泛化的可能性。同时，实验验证了该理论在算法和自然语言任务中预测长度泛化成功与失败的准确性。本文不仅解释了广泛的实验观察结果，而且为预测Transformer的长度泛化能力开辟了道路。

Key Takeaways

本文引入了理论框架来分析因果Transformer中的长度泛化问题。
通过描述足够长输入下利用绝对位置编码识别函数的特性，证明了在丰富任务集中实现长度泛化的可能性。
提供了对长度泛化现象的严谨理论理解。
实验验证了理论在预测长度泛化成功与失败方面的准确性。
该理论不仅解释了广泛的实验观察结果，而且为预测Transformer的性能提供了依据。
该研究强调了绝对位置编码在长度泛化中的重要性。

Cool Papers

点此查看论文截图

Revise, Reason, and Recognize: LLM-Based Emotion Recognition via Emotion-Specific Prompts and ASR Error Correction

Authors:Yuanchao Li, Yuan Gong, Chao-Han Huck Yang, Peter Bell, Catherine Lai

Annotating and recognizing speech emotion using prompt engineering has recently emerged with the advancement of Large Language Models (LLMs), yet its efficacy and reliability remain questionable. In this paper, we conduct a systematic study on this topic, beginning with the proposal of novel prompts that incorporate emotion-specific knowledge from acoustics, linguistics, and psychology. Subsequently, we examine the effectiveness of LLM-based prompting on Automatic Speech Recognition (ASR) transcription, contrasting it with ground-truth transcription. Furthermore, we propose a Revise-Reason-Recognize prompting pipeline for robust LLM-based emotion recognition from spoken language with ASR errors. Additionally, experiments on context-aware learning, in-context learning, and instruction tuning are performed to examine the usefulness of LLM training schemes in this direction. Finally, we investigate the sensitivity of LLMs to minor prompt variations. Experimental results demonstrate the efficacy of the emotion-specific prompts, ASR error correction, and LLM training schemes for LLM-based emotion recognition. Our study aims to refine the use of LLMs in emotion recognition and related domains.

随着大型语言模型（LLMs）的进步，使用提示工程进行语音情绪标注和识别最近开始浮现，但其效果和可靠性仍然值得怀疑。在本文中，我们对这一主题进行了系统研究，首先提出了融入声学、语言学和心理学的特定情绪知识的新提示。随后，我们研究了基于LLM的提示在自动语音识别（ASR）转录中的有效性，并将其与真实转录进行了对比。此外，我们还为存在ASR错误的鲁棒性LLM情绪识别提出了修订-推理-识别的提示管道。另外，我们还在上下文感知学习、上下文内学习和指令调整方面进行了实验，以检验LLM训练方案在此方向上的实用性。最后，我们研究了LLMs对微小提示变化的敏感性。实验结果证明了特定情绪提示、ASR错误修正和LLM训练方案在基于LLM的情绪识别中的有效性。我们的研究旨在完善LLM在情绪识别和相关领域的应用。

论文及项目相关链接

PDF Accepted to ICASSP 2025

Summary

随着大型语言模型（LLM）的发展，通过提示工程进行语音情绪注解和识别已崭露头角，但其有效性和可靠性仍存疑。本文对此进行系统研究，首先提出融合声学、语言学及心理学中情绪特定知识的新提示。接着，我们研究了基于LLM的提示对自动语音识别（ASR）转录的有效性，并与真实转录进行了对比。此外，我们还提出了一个Revise-Reason-Recognize提示管道，用于从带有ASR错误的口语中稳健地识别情绪。实验研究了上下文感知学习、即时学习和指令微调等LLM训练方案的有用性。最后，我们研究了LLM对微小提示变化的敏感性。实验结果表明，情绪特定提示、ASR错误校正和LLM训练方案在基于LLM的情绪识别中有效。本研究旨在完善LLM在情绪识别及相关领域的应用。

Key Takeaways

LLM可用于语音情绪识别，但其有效性和可靠性需要深入研究。
融合声学、语言学和心理学中的情绪特定知识能提高LLM在语音情绪识别方面的性能。
基于LLM的提示对ASR转录的有效性进行了研究，并与真实转录进行了对比。
提出了一个Revise-Reason-Recognize提示管道，用于从带有ASR错误的口语中稳健地识别情绪。
上下文感知学习、即时学习和指令微调等LLM训练方案在语音情绪识别中表现出潜力。
LLM对微小提示变化具有敏感性，这为提高其性能提供了方向。
实验结果表明，情绪特定提示、ASR错误校正和LLM训练方案在基于LLM的情绪识别中是有效的。

Cool Papers

点此查看论文截图

Can We Trust Embodied Agents? Exploring Backdoor Attacks against Embodied LLM-based Decision-Making Systems

Authors:Ruochen Jiao, Shaoyuan Xie, Justin Yue, Takami Sato, Lixu Wang, Yixuan Wang, Qi Alfred Chen, Qi Zhu

Large Language Models (LLMs) have shown significant promise in real-world decision-making tasks for embodied artificial intelligence, especially when fine-tuned to leverage their inherent common sense and reasoning abilities while being tailored to specific applications. However, this fine-tuning process introduces considerable safety and security vulnerabilities, especially in safety-critical cyber-physical systems. In this work, we propose the first comprehensive framework for Backdoor Attacks against LLM-based Decision-making systems (BALD) in embodied AI, systematically exploring the attack surfaces and trigger mechanisms. Specifically, we propose three distinct attack mechanisms: word injection, scenario manipulation, and knowledge injection, targeting various components in the LLM-based decision-making pipeline. We perform extensive experiments on representative LLMs (GPT-3.5, LLaMA2, PaLM2) in autonomous driving and home robot tasks, demonstrating the effectiveness and stealthiness of our backdoor triggers across various attack channels, with cases like vehicles accelerating toward obstacles and robots placing knives on beds. Our word and knowledge injection attacks achieve nearly 100% success rate across multiple models and datasets while requiring only limited access to the system. Our scenario manipulation attack yields success rates exceeding 65%, reaching up to 90%, and does not require any runtime system intrusion. We also assess the robustness of these attacks against defenses, revealing their resilience. Our findings highlight critical security vulnerabilities in embodied LLM systems and emphasize the urgent need for safeguarding these systems to mitigate potential risks.

大型语言模型（LLM）在实体人工智能的决策任务中展现出巨大潜力，特别是在微调以利用其固有的常识和推理能力的同时，针对特定应用进行定制时。然而，这种微调过程引入了相当大的安全和保密漏洞，特别是在安全关键的网物系统。在这项工作中，我们提出了针对实体人工智能中基于LLM的决策系统的首个全面的后门攻击框架（BALD），系统地探索攻击面和触发机制。具体来说，我们提出了三种独特的攻击机制：单词注入、场景操控和知识注入，它们分别针对基于LLM的决策管道的不同组件。我们在自动驾驶和家庭机器人任务中对代表性LLM（GPT-3.5、LLaMA2、PaLM2）进行了广泛实验，证明了我们的后门触发在多种攻击通道中的有效性和隐蔽性，案例包括车辆加速撞向障碍物和机器人把刀放在床上。我们的词汇和知识注入攻击在多模型和数据集上达到了近100%的成功率，同时只需要对系统有限的访问权限。我们的场景操控攻击的成功率超过65%，甚至达到90%，而且不需要任何运行时系统入侵。我们还评估了这些攻击对防御的稳健性，证明了它们的韧性。我们的研究发现了实体LLM系统中关键的安全漏洞，并强调了保护这些系统以减轻潜在风险的迫切需求。

论文及项目相关链接

PDF Accepted paper at ICLR 2025, 31 pages, including main paper, references, and appendix

Summary

大型语言模型（LLM）在实体人工智能的决策任务中展现出巨大潜力，尤其是在精细化调整以利用其内在常识和推理能力的同时，针对特定应用进行定制时。然而，这种微调过程引入了巨大的安全和保障漏洞，特别是在安全关键性的网络物理系统中。本研究首次提出了针对实体人工智能中基于LLM的决策系统的后门攻击全面框架（BALD），系统地探索攻击面和触发机制。我们提出三种独特的攻击机制：单词注入、场景操纵和知识注入，针对LLM决策制定流程中的不同组件。在自动驾驶和家庭机器人任务中对代表性LLM（GPT-3.5、LLaMA2、PaLM2）进行的广泛实验表明，我们的后门触发在多种攻击通道中有效且隐蔽，出现了车辆加速撞向障碍物和机器人在床上放置刀具等案例。我们的研究揭示了实体LLM系统中的关键安全漏洞，并强调了保障这些系统以减轻潜在风险的紧迫需求。

Key Takeaways

LLM在实体人工智能的决策任务中具有显著潜力，但在微调过程中存在安全和保障漏洞。
首次提出针对LLM决策系统的后门攻击全面框架BALD。
提出三种攻击机制：单词注入、场景操纵和知识注入，针对LLM决策流程的不同部分。
实验证明攻击机制在多种LLM和实际应用场景中有效且隐蔽。
攻击可导致严重后果，如车辆冲向障碍物和机器人危险行为。
现有防御措施对攻击机制的抵御能力有限。
研究强调了保障LLM系统的紧迫需求以减轻潜在风险。

Cool Papers

点此查看论文截图

Emergence of a High-Dimensional Abstraction Phase in Language Transformers

Authors:Emily Cheng, Diego Doimo, Corentin Kervadec, Iuri Macocco, Jade Yu, Alessandro Laio, Marco Baroni

A language model (LM) is a mapping from a linguistic context to an output token. However, much remains to be known about this mapping, including how its geometric properties relate to its function. We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionality. During this phase, representations (1) correspond to the first full linguistic abstraction of the input; (2) are the first to viably transfer to downstream tasks; (3) predict each other across different LMs. Moreover, we find that an earlier onset of the phase strongly predicts better language modelling performance. In short, our results suggest that a central high-dimensionality phase underlies core linguistic processing in many common LM architectures.

语言模型（LM）是将语言上下文映射到输出标记的过程。然而，关于这种映射还有很多未知，包括其几何属性如何与其功能相关联。我们采用高级几何方法对其进行分析，观察到五个基于转换器的预训练语言模型和三个输入数据集之间存在一个以高内在维度为特征的明显阶段。在这个阶段中，表示（1）对应于输入的第一个完整语言抽象；（2）是第一个切实转移到下游任务的表示；（3）在不同语言模型中相互预测。此外，我们发现阶段提前出现强烈预示着更好的语言建模性能。总之，我们的结果表明，核心语言处理在许多常见的LM架构中都存在一个中央高维度阶段。

论文及项目相关链接

PDF Published as conference paper at ICLR 2025

Summary：语言模型（LM）将语言上下文映射到输出标记上，但其映射的几何特性与功能之间的关系仍知之甚少。本文采用高级几何方法分析预训练的基于变压器的语言模型，发现存在一个以高内在维度为特征的明显阶段。在这一阶段，语言模型的表示对应于输入的首个完整语言抽象，可实际应用于下游任务，并在不同语言模型之间彼此预测。早期进入此阶段的语言模型预测性能更佳。简而言之，本文结果表明，核心语言处理在许多常见语言模型架构中都存在以高维度为中心的阶段。

Key Takeaways：

语言模型将语言上下文映射到输出标记上，但其映射的几何特性与功能之间的关系仍不清楚。
预训练的基于变压器的语言模型存在以高内在维度为特征的明显阶段。
在这一阶段，语言模型的表示对应于输入的首个完整语言抽象。
这一阶段的表示可实际应用于下游任务，并在不同语言模型之间彼此预测。
早期进入高维度阶段的语言模型具有更好的语言建模性能。
核心语言处理在许多常见语言模型架构中都存在以高维度为中心的阶段。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-05-02/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-05-02 SWE-smith Scaling Data for Software Engineering Agents

2025-05-02 Agent

Agent

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-05-02 DeepSeek-Prover-V2 Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition

2025-05-02 R1_Reasoning

R1_Reasoning