LLM

发布日期: 2025-02-28

更新日期: 2025-05-14

文章字数: 20.7k

阅读时长: 84 分

阅读次数: 8

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-02-28 更新

Norm Growth and Stability Challenges in Localized Sequential Knowledge Editing

Authors:Akshat Gupta, Christine Fang, Atahan Ozdemir, Maochuan Lu, Ahmed Alaa, Thomas Hartvigsen, Gopala Anumanchipalli

This study investigates the impact of localized updates to large language models (LLMs), specifically in the context of knowledge editing - a task aimed at incorporating or modifying specific facts without altering broader model capabilities. We first show that across different post-training interventions like continuous pre-training, full fine-tuning and LORA-based fine-tuning, the Frobenius norm of the updated matrices always increases. This increasing norm is especially detrimental for localized knowledge editing, where only a subset of matrices are updated in a model . We reveal a consistent phenomenon across various editing techniques, including fine-tuning, hypernetwork-based approaches, and locate-and-edit methods: the norm of the updated matrix invariably increases with successive updates. Such growth disrupts model balance, particularly when isolated matrices are updated while the rest of the model remains static, leading to potential instability and degradation of downstream performance. Upon deeper investigations of the intermediate activation vectors, we find that the norm of internal activations decreases and is accompanied by shifts in the subspaces occupied by these activations, which shows that these activation vectors now occupy completely different regions in the representation space compared to the unedited model. With our paper, we highlight the technical challenges with continuous and localized sequential knowledge editing and their implications for maintaining model stability and utility.

本研究探讨了局部更新对大型语言模型（LLM）的影响，特别是在知识编辑的情境中——一项旨在融入或修改特定事实的任务，而不会改变更广泛的模型能力。我们首先展示了在不同的后训练干预措施中，如持续预训练、全微调以及基于LORA的微调，更新矩阵的Frobenius范数总会增加。这种增加的范数对于局部知识编辑尤其具有破坏性，因为在知识编辑过程中只有模型中的一部分矩阵被更新。我们揭示了各种编辑技术中的一致现象，包括微调、基于超网络的方法和定位编辑方法：随着连续更新的进行，更新矩阵的范数不断增加。这种增长破坏了模型的平衡，特别是当孤立的矩阵被更新而其余模型保持静态时，可能导致潜在的不稳定性和下游性能的下降。在对中间激活向量进行更深入的研究后，我们发现内部激活的范数减少，并伴随着这些激活所占据的子空间的转移，这表明这些激活向量现在与未编辑的模型相比，在表示空间中占据了完全不同的区域。通过我们的论文，我们强调了连续和局部顺序知识编辑的技术挑战及其对保持模型稳定性和效用的影响。

论文及项目相关链接

PDF Accepted for Oral Presentation at KnowFM @ AAAI 2025. arXiv admin note: text overlap with arXiv:2502.01636

Summary

这项研究探讨了局部更新大型语言模型（LLM）的影响，特别是在知识编辑任务中。研究结果显示，不同的训练后干预措施，如持续预训练、全面微调以及基于LORA的微调，都会导致更新矩阵的Frobenius范数增加。这种现象在局部知识编辑中尤为明显，其中只有模型的一部分矩阵得到更新。研究还发现，在各种编辑技术中，如微调、超网络方法和定位编辑方法，随着连续更新，更新矩阵的范数不可避免地增加。这破坏了模型的平衡性，尤其是在仅更新某些矩阵而其余模型保持静态时，可能导致下游性能的不稳定甚至退化。对中间激活向量的深入研究还发现，内部激活的范数减少，并且这些激活所占据的子空间发生变化。这表明这些激活向量在表示空间中占据了与未编辑模型完全不同的区域。本文强调了连续和局部顺序知识编辑的技术挑战及其对保持模型稳定性和实用性产生的影响。

Key Takeaways

局部更新大型语言模型（LLM）的研究重点在知识编辑任务中。
不同训练后干预措施会导致更新矩阵的Frobenius范数增加。
在局部知识编辑中，只有模型的一部分矩阵得到更新，这可能导致模型平衡性的破坏。
各种编辑技术中，随着连续更新，更新矩阵的范数不可避免地增加，可能影响模型的稳定性和下游性能。
更新知识编辑后的模型会导致中间激活向量的范数减少和子空间变化。
激活向量在表示空间中的位置发生显著变化，这可能对模型的性能产生影响。

Cool Papers

点此查看论文截图

DataMan: Data Manager for Pre-training Large Language Models

Authors:Ru Peng, Kexin Yang, Yawen Zeng, Junyang Lin, Dayiheng Liu, Junbo Zhao

The performance emergence of large language models (LLMs) driven by data scaling laws makes the selection of pre-training data increasingly important. However, existing methods rely on limited heuristics and human intuition, lacking comprehensive and clear guidelines. To address this, we are inspired by ``reverse thinking’’ – prompting LLMs to self-identify which criteria benefit its performance. As its pre-training capabilities are related to perplexity (PPL), we derive 14 quality criteria from the causes of text perplexity anomalies and introduce 15 common application domains to support domain mixing. In this paper, we train a Data Manager (DataMan) to learn quality ratings and domain recognition from pointwise rating, and use it to annotate a 447B token pre-training corpus with 14 quality ratings and domain type. Our experiments validate our approach, using DataMan to select 30B tokens to train a 1.3B-parameter language model, demonstrating significant improvements in in-context learning (ICL), perplexity, and instruction-following ability over the state-of-the-art baseline. The best-performing model, based on the Overall Score l=5 surpasses a model trained with 50% more data using uniform sampling. We continue pre-training with high-rated, domain-specific data annotated by DataMan to enhance domain-specific ICL performance and thus verify DataMan’s domain mixing ability. Our findings emphasize the importance of quality ranking, the complementary nature of quality criteria, and their low correlation with perplexity, analyzing misalignment between PPL and ICL performance. We also thoroughly analyzed our pre-training dataset, examining its composition, the distribution of quality ratings, and the original document sources.

大型语言模型（LLM）在数据规模定律的推动下，其性能不断提升，这使得预训练数据的选择变得越来越重要。然而，现有方法依赖于有限的启发式知识和人为直觉，缺乏全面、明确的指导。为解决这一问题，我们从“逆向思维”中汲取灵感，激励LLM自我识别哪些标准对其性能有益。由于其预训练能力与困惑度（PPL）相关，我们从文本困惑度异常的原因中推导出14个质量标准，并引入15个常见应用领域以支持领域混合。在本文中，我们训练了一个数据管理器（DataMan）来学习质量评分和领域识别，并使用它对一个由点状评分进行注释的包含高达约四兆字节数据的预训练语料库进行注释。此外，语料库包括标注为高质量的近十亿词标的部分用于训练语言模型参数达到百亿级别的大规模模型，并在此基础上与业界最新算法比较，在上下文学习（ICL）、困惑度和指令执行能力方面取得了显著的提升效果。基于总体得分l=5的最佳性能模型超越了使用均匀采样方法训练的数据量多出百分之五十的模型。我们继续利用DataMan标注的高质量、特定领域的预训练数据进行训练以提高特定领域的上下文学习能力并验证DataMan的跨领域融合能力。我们的研究重点强调了质量排名的重要性、质量标准之间的互补性以及它们与困惑度之间较低的相关性，并分析了困惑度与上下文学习能力之间可能存在的不对齐现象。此外我们还全面分析了我们的预训练数据集及其构成质量评估结果以及原始的文档来源等信息。

论文及项目相关链接

PDF ICLR2025 paper

摘要

大型语言模型（LLM）的性能提升得益于数据规模定律的驱动，使得预训练数据的选择变得至关重要。然而，现有方法依赖于有限的启发式和人类直觉，缺乏全面清晰的指导。本文受“逆向思维”的启发，通过提示LLM自我识别对其性能有益的标准来解决这一问题。本文从文本困惑度异常的原因中推导出14个质量标准，并引入15个常见应用领域支持领域混合。本文训练了一个数据管理器（DataMan）来学习点评级的质量评估和领域识别，并用于注释一个包含447亿标记的预训练语料库。通过实验验证了该方法的有效性，使用DataMan选择了训练规模为仅训练参数为的数据来训练模型为语言模型，在上下文学习（ICL）、困惑度和指令执行能力方面均优于最新基线水平取得了显著的改进。最好的模型整体评分L超过均匀抽样得到的包含的训练模型的得分表现出了优秀的性能。通过DataMan标注的高质量、特定领域的预训练数据继续预训练，提高了特定领域的上下文学习效果，验证了DataMan的领域混合能力。本文强调了质量评估的重要性以及质量标准的互补性和低相关性分析了困惑度和上下文学习表现之间的不匹配现象同时也深入分析了预训练数据集的内容构成、质量评级分布和原始文档来源。总的来说，本文提出了一种新的数据管理方法以提高LLM的性能并提供了丰富的见解和发现。通过优化数据质量和选择特定领域的数据来训练模型这可能对未来的LLM研究和开发产生重要影响并带来显著改进。这将极大地促进大型语言模型的发展并推动相关领域的技术进步。

关键见解

以下是从该文中得出的最重要的七个发现或要点，用中文进行精简阐述：

Cool Papers

点此查看论文截图

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

Authors:Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Bin Xu, Lei Hou, Juanzi Li

Reward models (RMs) are crucial for the training and inference-time scaling up of large language models (LLMs). However, existing reward models primarily focus on human preferences, neglecting verifiable correctness signals which have shown strong potential in training LLMs. In this paper, we propose agentic reward modeling, a reward system that combines reward models with verifiable correctness signals from different aspects to provide reliable rewards. We empirically implement a reward agent, named RewardAgent, that combines human preference rewards with two verifiable signals: factuality and instruction following, to provide more reliable rewards. We conduct comprehensive experiments on existing reward model benchmarks and inference time best-of-n searches on real-world downstream tasks. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness. We further construct training preference pairs using RewardAgent and train an LLM with the DPO objective, achieving superior performance on various NLP benchmarks compared to conventional reward models. Our codes are publicly released to facilitate further research (https://github.com/THU-KEG/Agentic-Reward-Modeling).

奖励模型（RM）对于大型语言模型（LLM）的训练和推理时间扩展至关重要。然而，现有的奖励模型主要关注人类偏好，忽略了可验证的正确性信号，这些信号在训练LLM方面显示出强大的潜力。在本文中，我们提出了代理奖励建模，这是一种结合奖励模型与来自不同方面的可验证正确性信号的奖励系统，以提供可靠的奖励。我们通过实证实现了一个名为RewardAgent的奖励代理，它将人类偏好奖励与两个可验证的信号（真实性和指令遵循性）相结合，以提供更可靠的奖励。我们在现有的奖励模型基准测试和现实世界下游任务的推理时间最佳n搜索上进行了全面的实验。RewardAgent显著优于传统奖励模型，证明了其有效性。我们进一步使用RewardAgent构建训练偏好对，并使用DPO目标训练LLM，在各种NLP基准测试中相比传统奖励模型表现出卓越的性能。我们的代码已公开发布，以方便进一步的研究（https://github.com/THU-KEG/Agentic-Reward-Modeling）。

论文及项目相关链接

PDF 16 pages, 5 figures

摘要
奖励模型（RMs）对于大型语言模型（LLM）的训练和推理时间扩展至关重要。然而，现有奖励模型主要关注人类偏好，忽略了可验证的正确性信号在训练LLM方面的巨大潜力。本文提出代理奖励建模，一种结合奖励模型与从不同方面获得的可验证正确性信号的奖励系统，以提供可靠的奖励。我们实证地实现了一个名为RewardAgent的奖励代理，它将人类偏好奖励与两个可验证的信号（事实性和指令遵循性）相结合，以提供更可靠的奖励。我们在现有的奖励模型基准测试和现实世界下游任务的推理时间最佳n搜索上进行了全面的实验。RewardAgent显著优于传统奖励模型，证明了其有效性。我们进一步使用RewardAgent构建训练偏好对，并使用DPO目标训练LLM，与传统奖励模型相比，在各种NLP基准测试上表现出卓越的性能。我们的代码已公开发布，以便进行进一步研究（https://github.com/THU-KEG/Agentic-Reward-Modeling）。

关键见解

奖励模型对于大型语言模型的训练和推理时间扩展至关重要。
现有奖励模型主要关注人类偏好，忽略了可验证的正确性信号。
本文提出了代理奖励建模，结合了奖励模型和可验证的正确性信号，以提供可靠的奖励。
RewardAgent实证地结合了人类偏好奖励与事实性和指令遵循性两个可验证信号。
RewardAgent在奖励模型基准测试和推理时间搜索上表现出卓越性能。
使用RewardAgent构建的训练偏好对和DPO目标训练的LLM在NLP基准测试上表现优越。
相关研究代码已公开发布，便于进一步研究。

Cool Papers

点此查看论文截图

Shh, don’t say that! Domain Certification in LLMs

Authors:Cornelius Emde, Alasdair Paren, Preetham Arvind, Maxime Kayser, Tom Rainforth, Thomas Lukasiewicz, Bernard Ghanem, Philip H. S. Torr, Adel Bibi

Large language models (LLMs) are often deployed to perform constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess, and mitigate this risk, we introduce domain certification; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach, which we call VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates, which bound the probability of out-of-domain samples tightly with minimum penalty to refusal behavior.

大型语言模型（LLM）通常被部署用于执行受限制的任务，涉及狭窄的域。例如，可以在LLM的基础上构建客户支持机器人，依靠它们广泛的语言理解和能力来提高性能。然而，这些LLM容易受对抗影响，可能会生成不在预定范围内的输出。为了正式化、评估和减轻这种风险，我们引入了域认证的概念，以确保准确描述语言模型的跨域行为。然后，我们提出了一种简单而有效的方法，我们称之为VALID，该方法提供对抗边界作为证书。最后，我们在不同的数据集上评估了我们的方法，证明它产生了有意义的证书，紧密地限制了跨域样本的概率，并且对拒绝行为的影响最小。

论文及项目相关链接

PDF 10 pages, includes appendix Published in International Conference on Learning Representations (ICLR) 2025

Summary

大型语言模型（LLMs）在特定领域任务中的部署往往面临模型性能约束。尽管可以构建基于LLM的客户支持机器人，并利用其广泛的语言理解和增强性能能力，但这些LLMs容易受攻击并可能产生超出预期领域的输出。为此，我们引入领域认证来正式评估并缓解这种风险，确保语言模型的跨领域行为得到准确表征。我们提出了一种简单有效的验证方法VALID，为模型提供对抗性边界证书。评估结果显示，该方法能够生成有意义的证书，紧密约束跨领域样本的概率，对拒绝行为的影响最小。

Key Takeaways

LLMs常用于执行特定领域的任务，但存在性能约束和易受攻击的风险。
客户支持机器人依赖于LLMs进行广泛的对话和交流功能。
LLMs可能产生超出预期领域的输出，导致潜在风险。
为解决这一问题，引入了领域认证的概念来正式评估并保障语言模型的跨领域行为。
提出了一种名为VALID的简单验证方法，为语言模型提供对抗性边界证书。
通过多种数据集评估显示，VALID方法能够生成有意义的证书，有效约束跨领域样本的概率。

Cool Papers

点此查看论文截图

FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

Authors:Anikait Singh, Sheryl Hsu, Kyle Hsu, Eric Mitchell, Stefano Ermon, Tatsunori Hashimoto, Archit Sharma, Chelsea Finn

Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.

大型语言模型（LLM）的有效个性化对于众多用户接口应用程序（如虚拟助理和内容策划）至关重要。我们受到LLM强大上下文学习能力的启发，提出了小样本偏好优化（FSPO）方法，它将奖励建模重新定义为元学习问题。在该框架下，LLM通过快速适应少量用户偏好标签来学习为用户服务，并为他们构建个性化的奖励功能。此外，由于现实世界中的偏好数据稀缺且难以大规模收集，我们提出了精心设计选择来构建个性化合成数据集，使用公开可用的LLM生成超过100万个合成个性化偏好。特别是，为了成功地从合成数据转移到真实用户，我们发现数据的高多样性和连贯、自我一致的结构至关重要。我们在三个领域对FSPO进行了评估：电影评论、基于教育背景的教学适应和一般问答，以及一项受控的人类研究。总体而言，FSPO在针对合成用户的个性化生成方面平均达到了87%的Alpaca Eval胜率，在开放式问答方面与真实人类用户的对战中胜率达到了72%。

Summary

LLM个性化对于广泛的用户界面应用至关重要，如虚拟助理和内容策划。本文提出一种名为Few-Shot Preference Optimization（FSPO）的方法，通过少量用户偏好数据快速适应个性化奖励函数建模。为解决真实偏好数据稀缺和难以大规模收集的问题，本文设计了合成偏好数据集用于个性化生成超过1百万的合成偏好数据。实验证明，FSPO在个性化开放生成任务中表现优异，对合成用户平均胜率达87%，真实用户开放问答任务中胜率为72%。

Key Takeaways

LLM个性化对于用户界面应用如虚拟助理和内容策划至关重要。
Few-Shot Preference Optimization（FSPO）通过少量用户偏好数据快速适应个性化奖励函数建模。
提出合成偏好数据集设计来解决真实偏好数据稀缺问题。
成功从合成数据转移到真实用户需要数据展现高多样性和连贯性结构。
FSPO在个性化开放生成任务中表现优异，包括电影评论、教育背景导向的教学适应和一般问答领域。

Cool Papers

点此查看论文截图

Efficient Federated Search for Retrieval-Augmented Generation

Authors:Rachid Guerraoui, Anne-Marie Kermarrec, Diana Petrescu, Rafael Pires, Mathis Randl, Martijn de Vos

Large language models (LLMs) have demonstrated remarkable capabilities across various domains but remain susceptible to hallucinations and inconsistencies, limiting their reliability. Retrieval-augmented generation (RAG) mitigates these issues by grounding model responses in external knowledge sources. Existing RAG workflows often leverage a single vector database, which is impractical in the common setting where information is distributed across multiple repositories. We introduce RAGRoute, a novel mechanism for federated RAG search. RAGRoute dynamically selects relevant data sources at query time using a lightweight neural network classifier. By not querying every data source, this approach significantly reduces query overhead, improves retrieval efficiency, and minimizes the retrieval of irrelevant information. We evaluate RAGRoute using the MIRAGE and MMLU benchmarks and demonstrate its effectiveness in retrieving relevant documents while reducing the number of queries. RAGRoute reduces the total number of queries up to 77.5% and communication volume up to 76.2%.

大型语言模型（LLM）在各个领域表现出了显著的能力，但仍易受到错觉和不一致性的困扰，这限制了其可靠性。检索增强生成（RAG）通过以外部知识资源为基础来减轻这些问题。现有的RAG工作流程通常利用单个向量数据库，这在信息分布在多个存储库中的常见环境中并不实用。我们引入了RAGRoute，这是一种用于联邦RAG搜索的新机制。RAGRoute使用轻量级神经网络分类器在查询时动态选择相关数据源。这种方法不查询每个数据源，因此可以显著降低查询开销，提高检索效率，并尽量减少检索无关信息。我们使用MIRAGE和MMLU基准测试评估了RAGRoute，并证明了其在检索相关文档时的有效性，同时减少了查询次数。RAGRoute最多可减少77.5%的总查询次数和高达76.2%的通信量。

论文及项目相关链接

PDF To appear in the proceedings of EuroMLSys’25

Summary

LLM存在可靠性问题，如易产生幻觉和结果不一致等。为提高模型响应的可靠性，提出了基于外部知识源的检索增强生成（RAG）方法。然而，现有RAG工作流程依赖于单一向量数据库，不适用于信息分散于多个存储库的场景。为此，本文提出一种新型的联邦RAG搜索机制——RAGRoute。它通过轻量级神经网络分类器动态选择相关数据源进行查询，减少了查询开销，提高了检索效率并降低了获取无关信息的风险。评估显示，RAGRoute在减少查询数量和通信体积方面表现出显著优势。

Key Takeaways

LLM虽在多个领域展现出强大能力，但仍存在可靠性和一致性方面的问题，如易产生幻觉。
RAG方法通过结合外部知识源提高了LLM的响应可靠性。
现有RAG工作流程主要依赖单一向量数据库，限制了其在多数据源场景的应用。
RAGRoute是一种新型的联邦RAG搜索机制，通过动态选择相关数据源进行查询，提高了效率并降低了获取无关信息的风险。
RAGRoute利用轻量级神经网络分类器来选择数据源，显著减少了查询开销。
评估显示，RAGRoute在减少查询数量和通信体积方面具有显著优势。
RAGRoute的方法对于提高LLM在实际应用中的性能和可靠性具有重要意义。

Cool Papers

点此查看论文截图

LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts

Authors:Thanh-Phong Le, Trung Le Chi Phan, Nghia Hieu Nguyen, Kiet Van Nguyen

\textbf{Purpose:} Document Visual Question Answering (document VQA) challenges multimodal systems to holistically handle textual, layout, and visual modalities to provide appropriate answers. Document VQA has gained popularity in recent years due to the increasing amount of documents and the high demand for digitization. Nonetheless, most of document VQA datasets are developed in high-resource languages such as English. \textbf{Methods:} In this paper, we present ReceiptVQA (\textbf{Receipt} \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a document kind with high commercial potentials. The dataset encompasses \textbf{9,000+} receipt images and \textbf{60,000+} manually annotated question-answer pairs. In addition to our study, we introduce LiGT (\textbf{L}ayout-\textbf{i}nfused \textbf{G}enerative \textbf{T}ransformer), a layout-aware encoder-decoder architecture designed to leverage embedding layers of language models to operate layout embeddings, minimizing the use of additional neural modules. \textbf{Results:} Experiments on ReceiptVQA show that our architecture yielded promising performance, achieving competitive results compared with outstanding baselines. Furthermore, throughout analyzing experimental results, we found evident patterns that employing encoder-only model architectures has considerable disadvantages in comparison to architectures that can generate answers. We also observed that it is necessary to combine multiple modalities to tackle our dataset, despite the critical role of semantic understanding from language models. \textbf{Conclusion:} We hope that our work will encourage and facilitate future development in Vietnamese document VQA, contributing to a diverse multimodal research community in the Vietnamese language.

目的：文档视觉问答（Document Visual Question Answering，简称Document VQA）挑战了多模态系统，使其能够全面处理文本、布局和视觉模式，以提供适当的答案。由于文档数量的不断增加和数字化需求的不断增长，文档VQA近年来变得非常受欢迎。然而，大多数文档VQA数据集都是用资源丰富的语言（如英语）开发的。

方法：在本文中，我们介绍了ReceiptVQA（收据视觉问答），这是越南语中针对收据的初始大型文档VQA数据集。收据是一种商业潜力很高的文件类型。该数据集包含超过9,000张收据图像和超过6万对手动标注的问题答案对。除了我们的研究，我们还介绍了LiGT（布局注入生成器Transformer），这是一种布局感知的编码器-解码器架构，旨在利用语言模型的嵌入层进行操作布局嵌入，尽量减少使用额外的神经网络模块。

结果：在ReceiptVQA上的实验表明，我们的架构性能令人鼓舞，与基准测试相比取得了有竞争力的结果。此外，通过分析实验结果，我们发现与能够生成答案的架构相比，使用仅编码器模型架构具有明显的劣势。我们还观察到，尽管语言模型的语义理解起着关键作用，但结合多种模式来解决我们的数据集是必要的。

结论：我们希望这项工作能鼓励和促进越南语文档VQA的发展，为越南语的多模态研究社区做出贡献。

论文及项目相关链接

PDF Accepted at IJDAR

摘要

本文介绍了针对越南语收据文档的大规模视觉问答数据集ReceiptVQA的开发。该数据集包含超过9000张收据图像和超过6万个手动标注的问题答案对。同时，本文还介绍了布局感知生成式转换器架构LiGT，该架构利用语言模型的嵌入层进行布局嵌入，最小化额外神经网络模块的使用。实验结果表明，该架构在ReceiptVQA上取得了有竞争力的结果。此外，分析实验结果还表明，与使用编码器的模型架构相比，生成式答案的架构具有明显的优势。文章认为有必要结合多种模态来解决文档视觉问答问题，同时也强调了语言模型语义理解的重要性。希望这项工作能促进越南语文档视觉问答的发展，并为多样化的多媒体研究社区做出贡献。

关键见解

文档视觉问答（Document VQA）是处理文本、布局和视觉模态的多模态系统的挑战性问题，目前由于数字化需求而备受关注。
目前大多数文档VQA数据集都是针对英语等资源丰富的语言开发的。本文引入了针对越南语收据的ReceiptVQA数据集，填补了越南语文档VQA数据集的空白。
介绍了LiGT架构，这是一种布局感知生成式转换器，能够利用语言模型的嵌入层进行布局嵌入，减少了额外神经网络模块的使用。
实验结果表明，LiGT架构在ReceiptVQA数据集上取得了有竞争力的结果。
分析实验结果还发现，生成答案的模型架构比仅使用编码器的架构具有明显优势。
在解决文档视觉问答问题时，需要结合多种模态，同时重视语言模型的语义理解。

Cool Papers

点此查看论文截图

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

Authors:Qingpei Guo, Kaiyou Song, Zipeng Feng, Ziping Ma, Qinglong Zhang, Sirui Gao, Xuzheng Yu, Yunxiao Sun, Tai-WeiChang, Jingdong Chen, Ming Yang, Jun Zhou

We present M2-omni, a cutting-edge, open-source omni-MLLM that achieves competitive performance to GPT-4o. M2-omni employs a unified multimodal sequence modeling framework, which empowers Large Language Models(LLMs) to acquire comprehensive cross-modal understanding and generation capabilities. Specifically, M2-omni can process arbitrary combinations of audio, video, image, and text modalities as input, generating multimodal sequences interleaving with audio, image, or text outputs, thereby enabling an advanced and interactive real-time experience. The training of such an omni-MLLM is challenged by significant disparities in data quantity and convergence rates across modalities. To address these challenges, we propose a step balance strategy during pre-training to handle the quantity disparities in modality-specific data. Additionally, a dynamically adaptive balance strategy is introduced during the instruction tuning stage to synchronize the modality-wise training progress, ensuring optimal convergence. Notably, we prioritize preserving strong performance on pure text tasks to maintain the robustness of M2-omni’s language understanding capability throughout the training process. To our best knowledge, M2-omni is currently a very competitive open-source model to GPT-4o, characterized by its comprehensive modality and task support, as well as its exceptional performance. We expect M2-omni will advance the development of omni-MLLMs, thus facilitating future research in this domain.

我们推出了M2-omni，这是一款先进的开源通用多模态大型语言模型（MLLM），其性能与GPT-4o相当。M2-omni采用统一的多模态序列建模框架，使大型语言模型（LLM）具备全面的跨模态理解和生成能力。具体来说，M2-omni可以处理音频、视频、图像和文本等多种模式的任意组合作为输入，生成交替的音频、图像或文本输出，从而实现先进且交互式的实时体验。训练这样的通用多模态LLM面临的一个挑战是，不同模态的数据量和收敛率存在显著差异。为了解决这一挑战，我们在预训练过程中提出了一种步骤平衡策略，以处理特定模态数据在数量上的差异。此外，在指令调整阶段引入了一种动态自适应平衡策略，以同步不同模态的训练进度，确保最佳收敛。值得注意的是，我们优先在纯文本任务上保持卓越性能，以在整个训练过程中保持M2-omni的语言理解能力的稳健性。据我们所知，M2-omni目前是一个非常具有竞争力的开源模型，类似于GPT-4o，其特点是支持全面的模态和任务，以及出色的性能。我们相信M2-omni将推动通用多模态LLM的发展，从而推动该领域的未来研究。

论文及项目相关链接

PDF

Summary：

M2-omni是一款先进的开源通用多模态大型语言模型，具备跨模态理解和生成能力，可处理多种模态输入并生成多模态序列输出。它采用统一的序列建模框架，通过预训练和指令调优阶段的平衡策略，应对不同模态数据量和收敛率的差异。M2-omni对纯文本任务保持出色性能，以确保语言理解能力的稳健性。与GPT-4o相比，M2-omni具有全面的模态和任务支持以及卓越的性能表现。

Key Takeaways：

M2-omni是一款先进的开源多模态大型语言模型。
它具备跨模态理解和生成能力，可处理多种模态输入并生成多模态序列输出。
M2-omni采用统一的序列建模框架。
预训练阶段采用步骤平衡策略，以处理不同模态数据量的差异。
在指令调优阶段引入动态自适应平衡策略，以确保各模态的训练进度同步并优化收敛。
M2-omni在纯文本任务上保持出色性能，以确保语言理解能力的稳健性。
M2-omni在综合模态和任务支持以及性能表现方面与GPT-4o相比具有竞争力。

Cool Papers

点此查看论文截图

Exploring Graph Tasks with Pure LLMs: A Comprehensive Benchmark and Investigation

Authors:Yuxiang Wang, Xinnan Dai, Wenqi Fan, Yao Ma

Graph-structured data has become increasingly prevalent across various domains, raising the demand for effective models to handle graph tasks like node classification and link prediction. Traditional graph learning models like Graph Neural Networks (GNNs) have made significant strides, but their capabilities in handling graph data remain limited in certain contexts. In recent years, large language models (LLMs) have emerged as promising candidates for graph tasks, yet most studies focus primarily on performance benchmarks and fail to address their broader potential, including their ability to handle limited data, their transferability across tasks, and their robustness. In this work, we provide a comprehensive exploration of LLMs applied to graph tasks. We evaluate the performance of pure LLMs, including those without parameter optimization and those fine-tuned with instructions, across various scenarios. Our analysis goes beyond accuracy, assessing LLM ability to perform in few-shot/zero-shot settings, transfer across domains, understand graph structures, and demonstrate robustness in challenging scenarios. We conduct extensive experiments with 16 graph learning models alongside 6 LLMs (e.g., Llama3B, GPT-4o, Qwen-plus), comparing their performance on datasets like Cora, PubMed, ArXiv, and Products. Our findings show that LLMs, particularly those with instruction tuning, outperform traditional models in few-shot settings, exhibit strong domain transferability, and demonstrate excellent generalization and robustness. This work offers valuable insights into the capabilities of LLMs for graph learning, highlighting their advantages and potential for real-world applications, and paving the way for future research in this area. Codes and datasets are released in https://github.com/myflashbarry/LLM-benchmarking.

图形结构化数据在各个领域的普及率越来越高，对处理图形任务（如节点分类和链接预测）的有效模型的需求也在增加。传统的图形学习模型，如图神经网络（GNNs），已经取得了重大进展，但在某些上下文中处理图形数据的能力仍然有限。近年来，大型语言模型（LLM）的出现成为图形任务的有前途的候选者，但大多数研究主要集中在性能基准测试上，未能解决其更广泛的潜力，包括处理有限数据的能力、跨任务的迁移能力以及稳健性。在这项工作中，我们对LLM在图形任务中的应用进行了全面的探索。我们评估了纯LLM的性能，包括那些未经参数优化和经过指令微调的情况。我们的分析超越了准确性，评估了LLM在少量样本/零样本设置中的表现、跨域的迁移能力、对图形结构的理解，以及在具有挑战性的场景中的稳健性。我们使用大量的实验与16个图形学习模型和6个LLM（如Llama3B、GPT-4o、Qwen-plus）进行比较，比较它们在Cora、PubMed、ArXiv和Products等数据集上的性能表现。我们的研究结果表明，LLM（特别是经过指令调整的LLM）在少量样本设置中的表现优于传统模型，展现出强大的域迁移能力，并展现出良好的泛化和稳健性。这项工作为LLM在图形学习方面的能力提供了宝贵的见解，突出了其在实际应用中的优势和潜力，为这一领域的未来研究铺平了道路。代码和数据集已在https://github.com/myflashbarry/LLM-benchmarking发布。

论文及项目相关链接

PDF

摘要

图结构数据在各个领域的普及使得对处理图任务的有效模型的需求日益增加，如节点分类和链接预测。传统的图学习模型，如图神经网络（GNNs），已经取得了重大进展，但在某些情况下处理图数据的能力仍然有限。近年来，大型语言模型（LLMs）已成为图任务的有前途的候选者，但大多数研究主要集中在性能基准测试上，未能解决其更广泛的潜力，包括处理有限数据的能力、跨任务的迁移能力以及稳健性。在这项工作中，我们对LLMs在图任务上的应用进行了全面的探索。我们评估了纯LLMs的性能，包括那些没有经过参数优化和经过指令微调的情况。我们的分析超越了准确性，评估了LLM在少量样本/零样本设置中的表现、跨域的迁移能力、对图结构的理解以及挑战性场景中的稳健性。我们在与16个图学习模型的大量实验中，与LLMs（如Llama3B、GPT-4o、Qwen-plus）进行了比较，他们在Cora、PubMed、ArXiv和Products等数据集上的表现如何。我们的研究结果表明，LLMs，尤其是经过指令调整的LLMs，在少量样本设置中的表现优于传统模型，具有较强的域迁移能力，并展现出出色的泛化和稳健性。这项工作为LLMs在图学习方面的能力提供了有价值的见解，突出了其优点和潜在的实际应用，并为这一领域的未来研奠定了基础。相关代码和数据集已发布在：https://github.com/myflashbarry/LLM-benchmarking上供查阅研究者和公众使用。

关键见解

LLMs在图任务上的表现日益突出，特别是在节点分类和链接预测方面。
与传统图学习模型相比，LLMs在有限数据情况下展现出更强的处理能力。
LLMs具有强大的跨域迁移能力，能够在不同数据集之间有效转换知识。
LLMs能够理解图结构，这有助于它们在图任务中的性能。
在挑战性场景中，LLMs展现出优秀的泛化和稳健性。
经过指令调教的LLMs在少量样本设置中的表现尤为出色。

Cool Papers

点此查看论文截图

Comprehensive Analysis of Transparency and Accessibility of ChatGPT, DeepSeek, And other SoTA Large Language Models

Authors:Ranjan Sapkota, Shaina Raza, Manoj Karkee

Despite increasing discussions on open-source Artificial Intelligence (AI), existing research lacks a discussion on the transparency and accessibility of state-of-the-art (SoTA) Large Language Models (LLMs). The Open Source Initiative (OSI) has recently released its first formal definition of open-source software. This definition, when combined with standard dictionary definitions and the sparse published literature, provide an initial framework to support broader accessibility to AI models such as LLMs, but more work is essential to capture the unique dynamics of openness in AI. In addition, concerns about open-washing, where models claim openness but lack full transparency, has been raised, which limits the reproducibility, bias mitigation, and domain adaptation of these models. In this context, our study critically analyzes SoTA LLMs from the last five years, including ChatGPT, DeepSeek, LLaMA, and others, to assess their adherence to transparency standards and the implications of partial openness. Specifically, we examine transparency and accessibility from two perspectives: open-source vs. open-weight models. Our findings reveal that while some models are labeled as open-source, this does not necessarily mean they are fully open-sourced. Even in the best cases, open-source models often do not report model training data, and code as well as key metrics, such as weight accessibility, and carbon emissions. To the best of our knowledge, this is the first study that systematically examines the transparency and accessibility of over 100 different SoTA LLMs through the dual lens of open-source and open-weight models. The findings open avenues for further research and call for responsible and sustainable AI practices to ensure greater transparency, accountability, and ethical deployment of these models.(DeepSeek transparency, ChatGPT accessibility, open source, DeepSeek open source)

尽管关于开源人工智能（AI）的讨论日益增多，但现有研究缺乏对最先进的大型语言模型（LLM）透明度和可访问性的讨论。开放源代码倡议组织（OSI）最近发布了其关于开源软件的首个正式定义。该定义与标准词典定义和已发布的稀少文献相结合，为支持如LLM等AI模型的更广泛可访问性提供了初步框架，但要捕捉AI中开放性的独特动态还需要更多的工作。此外，还提出了对“开源洗白”的担忧，即模型声称是开放的，但并不完全透明，这限制了这些模型的可复现性、偏见缓解和领域适应性。在此背景下，我们的研究从过去五年的最先进LLM出发，包括ChatGPT、DeepSeek、LLaMA等，对其是否符合透明度标准以及部分开放性的影响进行了批判性分析。具体来说，我们从两个角度考察透明度和可访问性：开源与开放式权重模型。我们的研究发现，虽然一些模型被标记为开源，但这并不意味着它们完全是开放的。即使在最好的情况下，开源模型也经常不报告模型训练数据、代码以及关键指标，如权重可访问性和碳排放。据我们所知，这是第一项通过开源和开放式权重模型的双重视角系统地考察超过100种最先进LLM的透明度和可访问性的研究。该研究为进一步的探索打开了途径，并呼吁采取负责任和可持续的AI实践，以确保这些模型的更大透明度、问责制和道德部署。（DeepSeek的透明度、ChatGPT的可访问性、开源、DeepSeek的开源）

论文及项目相关链接

PDF

Summary

本文讨论了关于开源人工智能（AI）的讨论增加，但对最先进的自然语言处理大模型（LLM）的透明度和可访问性的讨论仍然缺乏。Open Source Initiative（OSI）最近发布了其首个关于开源软件的正式定义，为AI模型的开放性和可访问性提供了初步框架。然而，有关“开放洗白”（声称开放性但实际上缺乏透明度）的担忧限制了模型的可重复性、偏见缓解和领域适应性。本研究从透明度和可访问性的角度评估了近年来的最先进的LLM，发现即使一些模型被标记为开源，也并不一定完全开放源代码。本研究系统地考察了超过一百种最先进的LLM的透明度和可访问性，开启了进一步研究的途径，并呼吁采取负责任和可持续的AI实践，以确保这些模型的透明度、问责制和道德部署。

Key Takeaways

最先进的自然语言处理大模型（LLM）的透明度和可访问性讨论仍然不足。
Open Source Initiative（OSI）发布了关于开源软件的正式定义，为AI模型的开放性和可访问性提供了初步框架。
存在“开放洗白”现象，即模型声称开放性但实际上缺乏透明度。
本研究评估了近年来的LLM透明度和可访问性，发现一些被标记为开源的模型并不完全开放源代码。
研究发现，即使是开源的模型，其训练数据、代码和一些关键指标（如权重可访问性和碳排放）往往也未被公开。
本研究系统地考察了超过一百种最先进的LLM的透明度和可访问性。

Cool Papers

点此查看论文截图

InductionBench: LLMs Fail in the Simplest Complexity Class

Authors:Wenyue Hua, Tyler Wong, Sun Fei, Liangming Pan, Adam Jardine, William Yang Wang

Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs’ inductive reasoning capabilities. Coda and data are available https://github.com/Wenyueh/inductive_reasoning_benchmark.

大型语言模型（LLM）在推理方面取得了显著的进步，许多现有基准测试（如o1和o3）已得到模型的全盘或部分解决。然而，大多数这些基准测试侧重于演绎推理，包括数学和编码任务，这些任务中的规则（如数学公理或编程语法）是明确定义的，LLM可以根据这些规则进行计划并应用它们以得出解决方案。相比之下，从观察到的数据中推断出潜在规则的归纳推理仍较少探索。这样的归纳过程处于科学发现的核心，因为它们能够使研究人员从实证观察中提取一般原则。为了评估LLM是否具备这种能力，我们引入了InductionBench这一新基准测试，用于评估LLM的归纳推理能力。我们的实验结果表明，即使在功能子常规层次结构中最简单的类别中，最先进的模型也很难掌握其复杂性，这突显了当前LLM归纳推理能力的显著不足。代码和数据可用https://github.com/Wenyueh/inductive_reasoning_benchmark。

论文及项目相关链接

PDF 24 pages, 7 figures

Summary

大型语言模型（LLM）在推理方面取得了显著进步，现有许多基准测试已被模型如o1和o3部分或完全解决。然而，大多数基准测试强调的是规则明确的演绎推理任务，如数学公理或编程语法，LLM可以根据这些规则进行计划和应用来得到解决方案。相比之下，从观察数据中推断出潜在规则的归纳推理则研究较少。归纳过程处于科学发现的核心，因为它们使研究人员能够从实证观察中提取一般原则。为了评估LLM是否具备这种能力，我们引入了InductionBench，这是一个新的基准测试，旨在评估LLM的归纳推理能力。实验结果表明，最先进的模型在函数子正则层次结构中最简单的复杂性类别中仍然面临困难，这突显了当前LLM在归纳推理能力方面的显著缺陷。

Key Takeaways

LLM在推理方面取得显著进步，已能在许多基准测试中表现优异。
现有基准测试大多强调演绎推理任务，这类任务中规则明确，LLM可应用这些规则进行问题解决。
归纳推理，即从观察数据中推断出潜在规则的过程，对于科学发现至关重要，但在LLM中研究较少。
引入了一个新的基准测试InductionBench，用于评估LLM的归纳推理能力。
最先进的LLM在简单的归纳推理任务中仍面临挑战。
LLM在归纳推理能力方面存在显著缺陷。

Cool Papers

点此查看论文截图

Learning to Generate Unit Tests for Automated Debugging

Authors:Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal

Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to large language models (LLMs), motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), we propose UTDebug that (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and backtracks edits based on multiple generated UTs to avoid overfitting, and helps LLMs debug effectively. We show that UTGen outperforms other LLM-based baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen’s unit tests improves pass@1 accuracy of Qwen2.5 32B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3.17% and 12.35% (respectively) over other LLM-based UT generation baselines. Lastly, we demonstrate that UTGen is a better judge for code correctness, outperforming a state-of-the-art trained 8B reward model by 4.43% on HumanEval+ with best-of-10 sampling using Qwen2.5 7B.

单元测试（UTs）在评估代码正确性并为大型语言模型（LLM）提供反馈方面发挥着重要作用，从而激励自动生成测试。然而，我们发现了一种权衡：生成揭示错误错误的单元测试输入和在没有标准答案的情况下正确预测单元测试输出之间的权衡。为了解决这一权衡问题，我们提出了UTGen，它教会LLM根据任务描述生成揭示错误的单元测试输入以及正确的预期输出。由于模型生成的测试可能会产生噪声信号（例如，来自错误预测的输出），因此我们提出了UTDebug，它（i）通过测试时间的计算扩展了UTGen，以提高UT输出预测能力，（ii）基于多个生成的UT进行验证和回溯编辑，以避免过度拟合，并帮助LLM有效地进行调试。我们表明，UTGen在衡量揭示错误的UT输入和正确的UT输出的指标上，优于其他LLM基准测试7.59%。当与UTDebug一起使用时，我们发现UTGen的单元测试反馈提高了Qwen2.5 32B在人类评估修复和我们的MBPP+更难调试分割上的pass@1准确率，分别超过了其他的LLM基准测试3.17%和12.35%。最后，我们证明UTGen在代码正确性判断上更胜一筹，在人类评估+（HumanEval+）上超越了一个最先进的训练有素的8B奖励模型4.43%，并使用Qwen2.5 7B的最佳10次采样。

论文及项目相关链接

PDF First two authors contributed equally. Dataset and Code: https://github.com/archiki/UTGenDebug

Summary

本文探讨了单元测试（UTs）在大规模语言模型（LLMs）中的重要作用，并指出了生成能够揭示错误的单元测试用例与预测正确输出之间的权衡问题。为此，本文提出了UTGen方法，能够根据任务描述生成揭示错误的单元测试用例及其正确的预期输出。同时，为了解决模型生成测试时可能出现的噪声信号问题，进一步提出了UTDebug方法，通过测试时间的计算来提高UT输出的预测准确性，并验证和回溯多个生成的单元测试，避免过拟合，帮助LLM更有效地进行调试。实验结果表明，UTGen在存在错误揭示的单元测试用例和正确的单元测试用例输出的度量指标上，优于其他LLM基线方法7.59%。结合UTDebug使用时，UTGen的单元测试反馈提高了Qwen2.5 32B在人类评估修复和更难调试的MBPP+分割上的准确率。最后，实验显示UTGen在判断代码正确性方面表现更优，使用Qwen2.5 7B的最佳采样方案时，比当前最先进的8B奖励模型在HumanEval+上的性能高出4.43%。

Key Takeaways

单元测试在大规模语言模型中扮演重要角色，用于评估代码正确性和提供反馈。
提出UTGen方法，根据任务描述生成揭示错误的单元测试用例及其正确预期输出。
UTDebug方法用于提高UT输出的预测准确性，通过测试时间的计算来扩展UTGen。
UTDebug通过验证和回溯多个生成的单元测试，避免模型过拟合，帮助有效调试。
UTGen在错误揭示的单元测试用例和正确输出的度量上优于其他LLM基线方法。
结合UTDebug使用，UTGen的反馈提高了LLM在特定任务上的准确率。

Cool Papers

点此查看论文截图

TAPO: Task-Referenced Adaptation for Prompt Optimization

Authors:Wenxin Luo, Weirui Wang, Xiaopeng Li, Weibo Zhou, Pengyue Jia, Xiangyu Zhao

Prompt engineering can significantly improve the performance of large language models (LLMs), with automated prompt optimization (APO) gaining significant attention due to the time-consuming and laborious nature of manual prompt design. However, much of the existing work in APO overlooks task-specific characteristics, resulting in prompts that lack domain specificity and are not well-suited for task-specific optimization. In this paper, we introduce TAPO, a multitask-aware prompt optimization framework composed of three key modules. First, a task-aware metric selection module is proposed to enhance task-specific prompt generation capabilities. Second, we present a multi-metrics evaluation module to jointly evaluate prompts from multiple perspectives. Third, an evolution-based optimization framework is introduced for automatic prompt refinement, which improves adaptability across various tasks. Extensive experiments on six datasets demonstrate the effectiveness of our approach, and our code is publicly available.

提示工程可以显著提高大语言模型（LLM）的性能，由于手动提示设计的时间消耗和繁琐性，自动提示优化（APO）引起了广泛关注。然而，现有的APO工作大多忽视了任务特性，导致提示缺乏领域特异性，不适用于特定任务的优化。在本文中，我们介绍了TAPO，这是一个由三个关键模块组成的多任务感知提示优化框架。首先，我们提出了一种任务感知指标选择模块，以提高任务特定提示生成能力。其次，我们提出了一个多指标评估模块，从多个角度对提示进行联合评估。第三，介绍了一种基于进化优化的自动提示改进框架，提高在不同任务中的适应性。在六个数据集上的大量实验证明了我们的方法的有效性，我们的代码已经公开可用。

论文及项目相关链接

PDF Accepted to ICASSP 2025

Summary
自动提示优化（APO）能显著提高大语言模型（LLM）的性能，但现有工作往往忽视任务特性，导致提示缺乏领域特异性且不适用于特定任务的优化。本文提出TAPO，一个多任务感知的提示优化框架，包括任务感知度量选择模块、多度量评估模块和基于进化的优化框架，以提高跨各种任务的适应性。实验证明，该方法有效。

Key Takeaways

自动提示优化（APO）能提高大语言模型（LLM）性能。
现有APO方法忽视任务特性，导致提示缺乏领域特异性。
TAPO框架包含任务感知度量选择模块，提高任务特定提示生成能力。
多度量评估模块能从多个角度联合评估提示。
基于进化的优化框架能自动改进提示，提高跨不同任务的适应性。
在六个数据集上的广泛实验证明了TAPO方法的有效性。

Cool Papers

点此查看论文截图

Privacy-Preserving Retrieval-Augmented Generation with Differential Privacy

Authors:Tatsuki Koga, Ruihan Wu, Kamalika Chaudhuri

With the recent remarkable advancement of large language models (LLMs), there has been a growing interest in utilizing them in the domains with highly sensitive data that lies outside their training data. For this purpose, retrieval-augmented generation (RAG) is particularly effective – it assists LLMs by directly providing relevant information from the external knowledge sources. However, without extra privacy safeguards, RAG outputs risk leaking sensitive information from the external data source. In this work, we explore RAG under differential privacy (DP), a formal guarantee of data privacy. The main challenge with differentially private RAG is how to generate long accurate answers within a moderate privacy budget. We address this by proposing an algorithm that smartly spends privacy budget only for the tokens that require the sensitive information and uses the non-private LLM for other tokens. Our extensive empirical evaluations reveal that our algorithm outperforms the non-RAG baseline under a reasonable privacy budget of $\epsilon\approx 10$ across different models and datasets.

随着大型语言模型（LLM）的最近显著进步，人们对其在高度敏感数据领域的应用越来越感兴趣，这些数据位于其训练数据之外。为此，检索增强生成（RAG）特别有效——它通过直接从外部知识源提供相关信息来辅助LLM。然而，没有额外的隐私保障，RAG输出存在泄露外部数据源中敏感信息的风险。在这项工作中，我们在差分隐私（DP）的框架下探索RAG，这是数据隐私的一种正式保障。差分私有RAG的主要挑战是如何在适度的隐私预算内生成准确且篇幅长的答案。我们通过提出一种算法来解决这个问题，该算法智能地仅对需要敏感信息的令牌使用隐私预算，并使用非私有LLM处理其他令牌。我们的广泛经验评估表明，在合理的隐私预算下，我们的算法在不同模型和数据集上的表现均优于非RAG基线。在$\epsilon \approx 10$时尤为明显。

论文及项目相关链接

PDF

Summary
基于大型语言模型（LLM）的快速发展，对于具有高度敏感数据领域的应用表现出了浓厚兴趣。本文探讨了如何利用检索增强生成（RAG）技术在保护隐私的前提下，通过差分隐私（DP）框架增强数据隐私保护，使大型语言模型能够在敏感数据上发挥作用。本文提出了一种算法，该算法仅在需要敏感信息的标记上消耗隐私预算，并在其他标记上使用非私有大型语言模型。在合理的隐私预算下，该算法在不同模型和数据集上的表现优于非检索增强生成方法。

Key Takeaways

大型语言模型（LLM）在高度敏感数据领域的应用正在增长。
检索增强生成（RAG）技术能有效辅助大型语言模型处理外部知识源的信息。
差分隐私（DP）框架用于增强数据隐私保护。
提出了一种智能消耗隐私预算的算法，仅在需要敏感信息的标记上消耗预算。
该算法在合理的隐私预算下表现出优异的性能。
算法在不同模型和数据集上的表现均优于非检索增强生成方法。

Cool Papers

点此查看论文截图

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

Authors:Fredrik Carlsson, Fangyu Liu, Daniel Ward, Murathan Kurfali, Joakim Nivre

This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples – a process we refer to as hyperfitting – the long-sequence generative capabilities are greatly enhanced. Greedy decoding with these Hyperfitted models even outperform Top-P sampling over long-sequences, both in terms of diversity and human preferences. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token.

本文介绍了预训练的大型语言模型（LLM）在非常小的数据集上过拟合的反直觉泛化结果。在开放文本生成的环境中，人们普遍认为LLM倾向于生成重复和乏味的序列，这一现象在使用贪心解码进行生成时尤为明显。即使使用最先进的LLM，包含数十亿参数，通过大数据集进行下一个标记预测训练，这个问题依然存在。我们发现，通过进一步微调这些模型，使其在小样本集上达到接近零的训练损失——我们称之为超拟合（Hyperfitting）——可以极大地提高长序列的生成能力。使用这些超拟合模型的贪心解码甚至超过了长序列的Top-P采样，无论是在多样性还是人类偏好方面。这一现象扩展到了各种规模、不同领域的LLM，甚至扩展到了自回归图像生成。我们还发现这种现象与Grokking和双重下降有着明显的区别。令人惊讶的是，我们的实验表明，超拟合模型很少陷入训练中的重复序列，即使明确阻止这些序列也会导致高质量输出。所有超拟合模型产生的预测具有极低的熵，通常将所有概率分配给单个标记。

论文及项目相关链接

PDF Under review at ICLR

Summary

这篇论文探讨了预训练大型语言模型（LLM）在极小数据集上的反直觉泛化结果。在开放文本生成环境中，LLM容易产生重复和枯燥的序列，尤其在使用贪心解码时更为明显。这一现象即使在采用最先进的LLM（使用大型数据集进行下一令牌预测训练）时也会持续存在。研究发现，通过进一步微调这些模型，使其在小样本集上的训练损失接近零——我们称之为超拟合（Hyperfitting）——可以大大提高长序列的生成能力。使用超拟合模型进行贪心解码甚至在某些方面优于Top-P采样，无论是多样性还是人类偏好。这一现象扩展到各种规模、不同领域的LLM，甚至扩展到自回归图像生成。研究还发现，超拟合现象与Grokking和双重下降有着显著的不同。令人惊讶的是，实验表明，超拟合模型很少陷入重复的训练序列，即使显式阻止这些序列也能产生高质量的输出。所有超拟合模型的预测都具有极低的熵，通常将所有概率分配给单个标记。

Key Takeaways

LLM在开放文本生成环境中容易生成重复和枯燥的序列，尤其在贪心解码时更为明显。
通过超拟合（Hyperfitting）技术，即在极小数据集上微调模型使训练损失接近零，可以提高LLM的长序列生成能力。
超拟合模型的贪心解码在某些方面优于Top-P采样，包括多样性和人类偏好。
超拟合现象适用于不同规模、不同领域的LLM以及自回归图像生成。
超拟合与Grokking和双重下降有显著不同。
超拟合模型不会陷入重复的训练序列，即使阻止这些序列也能产生高质量输出。

Cool Papers

点此查看论文截图

Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code

Authors:Timur Galimzyanov, Sergey Titov, Yaroslav Golubev, Egor Bogomolov

This paper introduces the human-curated PandasPlotBench dataset, designed to evaluate language models’ effectiveness as assistants in visual data exploration. Our benchmark focuses on generating code for visualizing tabular data - such as a Pandas DataFrame - based on natural language instructions, complementing current evaluation tools and expanding their scope. The dataset includes 175 unique tasks. Our experiments assess several leading Large Language Models (LLMs) across three visualization libraries: Matplotlib, Seaborn, and Plotly. We show that the shortening of tasks has a minimal effect on plotting capabilities, allowing for the user interface that accommodates concise user input without sacrificing functionality or accuracy. Another of our findings reveals that while LLMs perform well with popular libraries like Matplotlib and Seaborn, challenges persist with Plotly, highlighting areas for improvement. We hope that the modular design of our benchmark will broaden the current studies on generating visualizations. Our dataset and benchmark code are available online: https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench; https://github.com/JetBrains-Research/PandasPlotBench.

本文介绍了人为策划的PandasPlotBench数据集，该数据集旨在评估语言模型作为视觉数据探索助手的效率。我们的基准测试专注于根据自然语言指令为表格数据（如Pandas DataFrame）生成可视化代码，补充了当前的评价工具，并扩大了其范围。数据集包含175个独特任务。我们的实验评估了三大可视化库中的几个领先的大型语言模型（LLM），包括Matplotlib、Seaborn和Plotly。我们表明，任务的缩短对绘图能力的影响微乎其微，这为用户界面提供了便利，该界面可容纳简洁的用户输入，而不会牺牲功能或准确性。我们的另一项研究发现，虽然LLM在流行的库（如Matplotlib和Seaborn）上表现良好，但在Plotly上仍然存在挑战，这突出了需要改进的领域。我们希望我们基准测试的模块化设计能够扩大关于生成可视化研究的当前范围。我们的数据集和基准测试代码可以在网上找到：https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench；https://github.com/JetBrains-Research/PandasPlotBench。

论文及项目相关链接

PDF 5 pages

Summary
本研究介绍了一个人工制作的PandasPlotBench数据集，用于评估语言模型作为可视化数据探索助手的效能。该基准测试专注于根据自然语言指令生成可视化表格数据的代码，是对当前评估工具的一种补充和扩展。实验评估了多个主流的大型语言模型（LLM）在三个可视化库上的表现：Matplotlib、Seaborn和Plotly。研究表明，任务简化对绘图能力的影响较小，且用户接口能够简洁快速地适应指令输入而无需牺牲功能或准确性。虽然LLM在Matplotlib和Seaborn等流行库上表现良好，但在Plotly上仍存在挑战，这为改进提供了方向。PandasPlotBench的模块化设计有望拓宽可视化生成的研究领域。数据集和基准测试代码可在网上找到：https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench；https://github.com/JetBrains-Research/PandasPlotBench。

Key Takeaways

PandasPlotBench数据集旨在评估语言模型在可视化数据探索方面的效能，特别是生成基于自然语言指令的表格数据可视化代码。
该研究通过三个可视化库（Matplotlib、Seaborn和Plotly）评估了LLM的表现。
任务简化对绘图能力的影响较小，允许简洁的用户输入与功能及准确性之间的平衡。
LLM在流行库（如Matplotlib和Seaborn）上的表现良好，但在Plotly上仍面临挑战。
PandasPlotBench的模块化设计有助于拓宽可视化生成的研究方向。
该数据集提供了资源链接供研究人员访问和使用：https://huggingface.co/datasets/JetBrains-Research/PandasPlotBench 和 https://github.com/JetBrains-Research/PandasPlotBench。

Cool Papers

点此查看论文截图

Interpreting Language Reward Models via Contrastive Explanations

Authors:Junqi Jiang, Tom Bewley, Saumitra Mishra, Freddy Lecue, Manuela Veloso

Reward models (RMs) are a crucial component in the alignment of large language models’ (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM’s local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. In quantitative experiments, we validate the effectiveness of our method for finding high-quality contrastive explanations. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment.

奖励模型（RMs）是大型语言模型（LLM）输出与人类价值观对齐的关键组成部分。RMs通过预测和比较奖励分数来近似人类对相同提示下可能的LLM响应的偏好。然而，由于它们通常是具有标量输出头部的LLM的修改版本，RMs是大型黑箱，其预测不可解释。更透明的RM将增强对LLM对齐的信任。在这项工作中，我们提出使用对比解释来解释RM进行的任何二元响应比较。具体来说，我们生成一组与原始比较相似的新比较来表征RM的局部行为。生成新的比较时的扰动响应会明确修改手动指定的高级评估属性，RM行为的分析将基于此。在定量实验中，我们验证了我们的方法在寻找高质量对比解释方面的有效性。然后，我们展示了我们的方法在调查RM对每个评估属性的全局敏感性方面的定性实用性，并展示了如何自动提取代表性示例来解释和比较不同RM的行为。我们认为我们的方法是一个灵活的RM解释框架，为更可解释和可信赖的LLM对齐提供了基础。

论文及项目相关链接

PDF Accepted at ICLR 2025 conference

Summary

奖励模型（RMs）对于对齐大型语言模型（LLMs）的输出与人类价值观起到了关键作用。通过预测和比较奖励分数，RMs近似人类对同一提示下可能的LLM响应的偏好。然而，RMs通常是具有标量输出头部的LLM修改版本，其预测不可解释，成为大型黑箱。为改善对LLM对齐的信任度，我们提议使用对比解释法来解释RMs的二元响应比较。具体而言，我们生成一系列与原比较相似的新比较，以表征RM的局部行为。生成扰动响应以明确修改手动指定的高级评估属性，这些属性是分析RM行为的基础。在定量实验中，我们验证了我们的方法在寻找高质量对比解释方面的有效性。然后我们通过展示我们的方法在调查RM对每个评估属性的全局敏感性方面的定性用途，并展示如何自动提取代表性示例来解释和比较不同RM的行为。我们认为我们的方法是一个灵活的RM解释框架，为更可解释和更值得信赖的LLM对齐提供了基础。

Key Takeaways

奖励模型（RMs）在大型语言模型（LLMs）与人类价值观对齐方面起关键作用。
RMs通过预测和比较奖励分数来近似人类对LLM响应的偏好。
RMs作为大型黑箱，其预测不可解释。
提出使用对比解释法来解释RMs的二元响应比较。
生成一系列与原比较相似的新比较以表征RM的局部行为。
在定量实验中验证了对比解释法的有效性。

Cool Papers

点此查看论文截图

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Authors:Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Radha Poovendran

Instruction tuning has been widely adopted to ensure large language models (LLMs) follow user instructions effectively. The resulting instruction-following capabilities of LLMs heavily rely on the instruction datasets used for tuning. Recently, synthetic instruction datasets have emerged as an economically viable solution to provide LLMs diverse and high-quality instructions. However, existing approaches typically assume that larger or stronger models are stronger teachers for instruction tuning, and hence simply adopt these models as response generators to the synthetic instructions. In this paper, we challenge this commonly-adopted assumption. Our extensive experiments across five base models and twenty response generators reveal that larger and stronger models are not necessarily stronger teachers of smaller models. We refer to this phenomenon as the Larger Models’ Paradox. We observe that existing metrics cannot precisely predict the effectiveness of response generators since they ignore the compatibility between teachers and base models being fine-tuned. We thus develop a novel metric, named as Compatibility-Adjusted Reward (CAR) to measure the effectiveness of response generators. Our experiments across five base models demonstrate that CAR outperforms almost all baselines.

指令微调已被广泛采用，以确保大型语言模型（LLM）有效地遵循用户指令。LLM的指令遵循能力严重依赖于用于调校的指令数据集。最近，合成指令数据集作为经济可行的解决方案出现，为LLM提供多样化和高质量的指令。然而，现有方法通常假设更大或更强的模型是更强的教师，用于指令微调，因此简单地采用这些模型作为对合成指令的回应生成器。在本文中，我们质疑这一普遍接受的假设。我们在五个基础模型和二十个响应生成器上进行的广泛实验表明，更大、更强的模型不一定是更小模型的更强教师。我们将这种现象称为“大模型悖论”。我们观察到，现有指标不能精确预测响应生成器的有效性，因为它们忽略了被微调教师与基础模型之间的兼容性。因此，我们开发了一种名为兼容性调整奖励（CAR）的新指标，以衡量响应生成器的有效性。我们在五个基础模型上的实验表明，CAR几乎优于所有基线。

论文及项目相关链接

PDF This is paper is accepted at NAACL 2025

Summary

大型语言模型（LLM）通过指令微调广泛采纳以确保其有效遵循用户指令。LLM的指令遵循能力严重依赖于用于微调的指令数据集。近期，合成指令数据集作为经济可行的解决方案出现，为LLM提供多样且高质量指令。然而，现有方法通常假设更大或更强的模型是更强的教师用于指令微调，并简单地采用这些模型作为对合成指令的响应生成器。本文挑战这一普遍假设。我们在五个基础模型及二十个响应生成器上的广泛实验显示，更大和更强的模型不一定是更小模型更强教师。我们称此现象为“大模型悖论”。我们观察到现有指标无法精确预测响应生成器的有效性，因为它们忽略了与教师及正在微调的基础模型之间的兼容性。因此，我们开发了一种名为兼容性调整奖励（CAR）的新指标来衡量响应生成器的有效性。在五个基础模型上的实验表明，CAR几乎超越所有基线。

Key Takeaways

大型语言模型（LLM）通过指令微调以增强遵循用户指令的能力。
指令数据集在LLM的指令遵循能力中起关键作用。
合成指令数据集是经济可行的解决方案，为LLM提供多样且高质量的指令。
现有方法假设更大模型是更强的教师，但实验表明这并非必然。
更大和更强的模型不一定是更小模型更强教师的现象称为“大模型悖论”。
现有评估指标无法精确预测响应生成器的有效性，因为它们忽略了与教师及基础模型的兼容性。

Cool Papers

点此查看论文截图

Lightning IR: Straightforward Fine-tuning and Inference of Transformer-based Language Models for Information Retrieval

Authors:Ferdinand Schlatt, Maik Fröbe, Matthias Hagen

A wide range of transformer-based language models have been proposed for information retrieval tasks. However, including transformer-based models in retrieval pipelines is often complex and requires substantial engineering effort. In this paper, we introduce Lightning IR, an easy-to-use PyTorch Lightning-based framework for applying transformer-based language models in retrieval scenarios. Lightning IR provides a modular and extensible architecture that supports all stages of a retrieval pipeline: from fine-tuning and indexing to searching and re-ranking. Designed to be scalable and reproducible, Lightning IR is available as open-source: https://github.com/webis-de/lightning-ir.

针对信息检索任务，已经提出了多种基于Transformer的语言模型。然而，在检索管道中包含基于Transformer的模型通常很复杂，需要大量的工程努力。在本文中，我们介绍了Lightning IR，这是一个易于使用的基于PyTorch Lightning的框架，可用于在检索场景中应用基于Transformer的语言模型。Lightning IR提供了一种模块化且可扩展的架构，支持检索管道的所有阶段：从微调、索引到搜索和重新排序。设计用于可扩展性和可重复性，Lightning IR可作为开源使用：https://github.com/webis-de/lightning-ir。

论文及项目相关链接

PDF Accepted as a demo at WSDM’25

Summary：本文介绍了Lightning IR框架，这是一个基于PyTorch Lightning的易于使用的框架，用于在检索场景中应用基于转换器的语言模型。它提供了一个模块化且可扩展的架构，支持检索管道的所有阶段，包括微调、索引、搜索和重新排名。

Key Takeaways：

本文提出了一个基于PyTorch Lightning的框架——Lightning IR，用于在检索任务中应用基于转换器的语言模型。
该框架设计模块化，可支持从微调、索引到搜索和重新排名的所有检索阶段。
Lightning IR具有可扩展性和可重复性。
该框架旨在简化在检索任务中应用基于转换器的语言模型的复杂性。
该框架是开源的，可供公众访问和使用。可通过链接https://github.com/webis-de/lightning-ir获取。
Lightning IR适用于广泛的应用场景，包括但不限于文档检索、自然语言处理等。

Cool Papers

点此查看论文截图

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Authors:Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou

Large Language Models (LLMs) have shown immense potential in enhancing various aspects of our daily lives, from conversational AI to search and AI assistants. However, their growing capabilities come at the cost of extremely large model sizes, making deployment on edge devices challenging due to memory and computational constraints. This paper introduces a novel approach to LLM weight pruning that directly optimizes for approximating the attention matrix, a core component of transformer architectures. Unlike existing methods that focus on linear approximations, our approach accounts for the non-linear nature of the Softmax attention mechanism. We provide theoretical guarantees for the convergence of our Gradient Descent-based optimization method to a near-optimal pruning mask solution. Our empirical results demonstrate the effectiveness of our non-linear pruning approach in maintaining model performance while significantly reducing computational costs, which is beyond the current state-of-the-art methods, i.e., SparseGPT and Wanda, by a large margin. This work establishes a new theoretical foundation for pruning algorithm design in LLMs, potentially paving the way for more efficient LLM inference on resource-constrained devices.

大型语言模型（LLM）在我们的日常生活中展现出了巨大的潜力，无论是从对话式人工智能到搜索和人工智能助手等各个方面。然而，其能力的增长需要极大的模型规模作为代价，由于内存和计算资源的限制，使其在边缘设备的部署变得具有挑战性。本文介绍了一种针对LLM权重修剪的新型方法，该方法直接优化近似注意力矩阵，这是转换器架构的核心组件。与其他关注线性近似的方法不同，我们的方法考虑了Softmax注意力机制的非线性特性。我们为基于梯度下降的优化方法的收敛性提供了理论保证，以接近最优的修剪掩膜解决方案。我们的实证结果表明，我们的非线性修剪方法在保证模型性能的同时，显著降低了计算成本，并大大超越了当前最先进的SparseGPT和Wanda方法。这项工作为LLM中的修剪算法设计建立了新的理论基础，可能为资源受限设备上的更高效LLM推理铺平道路。

论文及项目相关链接

PDF ICLR 2025

Summary

大型语言模型（LLM）在日常生活的多个领域，如对话式AI、搜索和AI助手等方面展现出巨大潜力。然而，其不断增长的能力是以极大的模型规模为代价的，这使得在边缘设备上部署面临内存和计算约束的挑战。本文介绍了一种针对LLM的新权重修剪方法，该方法直接优化近似注意力矩阵，这是变压器架构的核心组件。与其他关注线性近似的方法不同，我们的方法考虑到Softmax注意力机制的非线性特性。本文为基于梯度的优化方法提供了收敛到接近最优修剪掩膜解决方案的理论保证。实证结果表明，我们的非线性修剪方法在保持模型性能的同时，大大降低了计算成本，并大幅超越了当前先进的方法，如SparseGPT和Wanda。这项研究为LLM修剪算法设计建立了新的理论基础，可能为资源受限设备上的LLM推理提供更有效的途径。

Key Takeaways

LLM在多个领域展现出巨大潜力，但模型规模较大，部署在边缘设备上具有挑战。
现有LLM修剪方法主要关注线性近似，而本文方法考虑到Softmax注意力机制的非线性特性。
本文提出了一种新的LLM权重修剪方法，直接优化近似注意力矩阵。
该方法提供了理论保证，证明基于梯度的优化方法可以收敛到接近最优的修剪掩膜解决方案。
实证结果表明，该方法在保持模型性能的同时，显著降低了计算成本。
与现有先进方法相比，如SparseGPT和Wanda，本文方法具有更好的性能。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-02-28/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-02-28 Agentic Reward Modeling Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

2025-02-28 Agent

Agent

Interactive

Interactive 方向最新论文已更新，请持续关注 Update in 2025-02-27 Nonlinear Gravitational Radiation Reaction Failed Tail, Memories & Squares

2025-02-27 Interactive

Interactive

LLM

2025-02-28 更新

Norm Growth and Stability Challenges in Localized Sequential Knowledge Editing

DataMan: Data Manager for Pre-training Large Language Models

Agentic Reward Modeling: Integrating Human Preferences with Verifiable Correctness Signals for Reliable Reward Systems

Shh, don’t say that! Domain Certification in LLMs

FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

Efficient Federated Search for Retrieval-Augmented Generation

LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

Exploring Graph Tasks with Pure LLMs: A Comprehensive Benchmark and Investigation

Comprehensive Analysis of Transparency and Accessibility of ChatGPT, DeepSeek, And other SoTA Large Language Models

InductionBench: LLMs Fail in the Simplest Complexity Class

Learning to Generate Unit Tests for Automated Debugging

TAPO: Task-Referenced Adaptation for Prompt Optimization

Privacy-Preserving Retrieval-Augmented Generation with Differential Privacy

The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code

Interpreting Language Reward Models via Contrastive Explanations

Stronger Models are NOT Stronger Teachers for Instruction Tuning

Lightning IR: Straightforward Fine-tuning and Inference of Transformer-based Language Models for Information Retrieval

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

微信扫一扫：分享