⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-03-11 更新
A Survey of Large Language Model Empowered Agents for Recommendation and Search: Towards Next-Generation Information Retrieval
Authors:Yu Zhang, Shutong Qiao, Jiaqi Zhang, Tzu-Heng Lin, Chen Gao, Yong Li
Information technology has profoundly altered the way humans interact with information. The vast amount of content created, shared, and disseminated online has made it increasingly difficult to access relevant information. Over the past two decades, search and recommendation systems (collectively referred to as information retrieval systems) have evolved significantly to address these challenges. Recent advances in large language models (LLMs) have demonstrated capabilities that surpass human performance in various language-related tasks and exhibit general understanding, reasoning, and decision-making abilities. This paper explores the transformative potential of large language model agents in enhancing search and recommendation systems. We discuss the motivations and roles of LLM agents, and establish a classification framework to elaborate on the existing research. We highlight the immense potential of LLM agents in addressing current challenges in search and recommendation, providing insights into future research directions. This paper is the first to systematically review and classify the research on LLM agents in these domains, offering a novel perspective on leveraging this advanced AI technology for information retrieval. To help understand the existing works, we list the existing papers on agent-based simulation with large language models at this link: https://github.com/tsinghua-fib-lab/LLM-Agent-for-Recommendation-and-Search.
信息技术已经深刻改变了人类与信息的交互方式。网上创建、共享和传播的大量内容使得获取相关信息的难度越来越大。在过去的二十年中,搜索和推荐系统(统称为信息检索系统)已经发生了显著的变化,以应对这些挑战。大型语言模型(LLM)的最新进展在各种语言相关任务中表现出了超越人类的性能,并展现出了一般理解、推理和决策能力。本文探讨了大型语言模型代理在增强搜索和推荐系统方面的变革潜力。我们讨论了LLM代理的动机和角色,并建立了一个分类框架来详细阐述现有研究。我们强调了LLM代理在解决搜索和推荐方面的当前挑战方面的巨大潜力,为未来的研究方向提供了见解。本文首次系统回顾和分类了这些领域中的LLM代理研究,为如何利用这种先进的AI技术进行信息检索提供了一个新颖的视角。为了帮助理解现有工作,我们在以下链接列出了关于基于代理模拟的大型语言模型的现有论文:https://github.com/tsinghua-fib-lab/LLM-Agent-for-Recommendation-and-Search。
论文及项目相关链接
Summary
信息技术深刻改变了人类与信息的交互方式。随着在线创建、共享和传播的内容量激增,访问相关信息变得越来越困难。搜索和推荐系统(统称为信息检索系统)在过去的二十年中已经显著发展,以应对这些挑战。大型语言模型(LLM)的最新进展在各种语言相关任务中展现了超越人类性能的能力,并展现出一般性的理解、推理和决策能力。本文探讨了大型语言模型代理在信息检索中增强搜索和推荐系统的变革潜力。我们讨论了LLM代理的动机和角色,并建立了一个分类框架来阐述现有研究。我们强调了LLM代理在解决搜索和推荐中的当前挑战方面的巨大潜力,并为未来的研究方向提供了见解。本文系统地回顾和分类了在这些领域中关于LLM代理的研究,为如何利用这一先进的AI技术进行信息检索提供了新的视角。
Key Takeaways
- 信息技术改变了人类与信息的交互方式,信息过载问题日益突出。
- 搜索和推荐系统已显著发展,以应对信息过载挑战。
- 大型语言模型(LLM)在多种语言任务中表现超越人类,具备理解、推理和决策能力。
- LLM在信息检索中具有巨大潜力,可增强搜索和推荐系统的性能。
- LLM代理的动机和角色在研究中受到关注,并建立了分类框架来阐述现有研究。
- LLM代理能够解决搜索和推荐中的当前挑战,提供未来研究方向的见解。
点此查看论文截图


R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning
Authors:Huatong Song, Jinhao Jiang, Yingqian Min, Jie Chen, Zhipeng Chen, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Existing Large Reasoning Models (LRMs) have shown the potential of reinforcement learning (RL) to enhance the complex reasoning capabilities of Large Language Models~(LLMs). While they achieve remarkable performance on challenging tasks such as mathematics and coding, they often rely on their internal knowledge to solve problems, which can be inadequate for time-sensitive or knowledge-intensive questions, leading to inaccuracies and hallucinations. To address this, we propose \textbf{R1-Searcher}, a novel two-stage outcome-based RL approach designed to enhance the search capabilities of LLMs. This method allows LLMs to autonomously invoke external search systems to access additional knowledge during the reasoning process. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. % effectively generalizing to out-of-domain datasets and supporting both Base and Instruct models. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
现有的大型推理模型(LRM)已经显示出强化学习(RL)在增强大型语言模型(LLM)的复杂推理能力方面的潜力。虽然它们在数学和编码等具有挑战性的任务上取得了显著的成绩,但它们通常依赖于内部知识来解决问题,这对于时间敏感或知识密集的问题可能不够用,导致不准确和幻觉。为了解决这一问题,我们提出了\textbf{R1-Searcher},这是一种新型的两阶段成果导向的RL方法,旨在增强LLM的搜索能力。这种方法允许LLM在推理过程中自主调用外部搜索系统来访问额外的知识。我们的框架完全依赖于RL,无需处理奖励或蒸馏冷启动等复杂流程。%能够有效地泛化到域外数据集并支持基础模型和指令模型。我们的实验表明,我们的方法显著优于以前的强大RAG方法,即使与闭源的GPT-4o-mini相比也是如此。
论文及项目相关链接
Summary
强化学习(RL)在提升大型语言模型(LLM)的复杂推理能力方面具有潜力。现有大型推理模型(LRMs)在数学和编程等挑战性任务上表现出卓越性能,但往往依赖内部知识解决问题,对于时间敏感或知识密集型问题可能显得不足,导致不准确和幻觉。为解决这一问题,提出新型两阶段结果导向的RL方法——R1-Searcher,旨在增强LLM的搜索能力,使其能够自主调用外部搜索系统获取额外知识以辅助推理过程。该方法仅依赖RL,无需过程奖励或冷启动时的蒸馏。
Key Takeaways
- 强化学习用于增强大型语言模型的推理能力。
- LRMs在数学和编程任务上表现优异,但在时间敏感和知识密集型问题上存在不足。
- R1-Searcher是一种新型的两阶段结果导向的RL方法,旨在增强LLM的搜索能力。
- LLMs可自主调用外部搜索系统获取额外知识以辅助推理。
- R1-Searcher方法显著提高之前的RAG方法性能,甚至超越闭源的GPT-4o-mini。
- 该方法仅依赖强化学习,无需过程奖励或冷启动时的知识蒸馏。
- 方法具有推广至跨领域数据集的能力,并适用于基础模型和指令模型。
点此查看论文截图



LLM-based Iterative Approach to Metamodeling in Automotive
Authors:Nenad Petrovic, Fengjunjie Pan, Vahid Zolfaghari, Alois Knoll
In this paper, we introduce an automated approach to domain-specific metamodel construction relying on Large Language Model (LLM). The main focus is adoption in automotive domain. As outcome, a prototype was implemented as web service using Python programming language, while OpenAI’s GPT-4o was used as the underlying LLM. Based on the initial experiments, this approach successfully constructs Ecore metamodel based on set of automotive requirements and visualizes it making use of PlantUML notation, so human experts can provide feedback in order to refine the result. Finally, locally deployable solution is also considered, including the limitations and additional steps required.
在这篇论文中,我们介绍了一种基于大型语言模型(LLM)的领域特定元模型构建自动化方法。主要关注在汽车行业的应用。我们采用Python编程语言实现了一个原型作为web服务,同时使用了OpenAI的GPT-4o作为基础的大型语言模型。基于初步实验,该方法成功根据汽车需求构建了Ecore元模型,并利用PlantUML符号进行可视化,以便人类专家提供反馈以完善结果。最后,还考虑了可本地部署的解决方案,包括其局限性和所需的额外步骤。
论文及项目相关链接
Summary
基于大型语言模型(LLM)的自动化领域特定元模型构建方法被介绍。重点研究其在汽车领域的应用。通过使用Python编程语言实现原型作为Web服务,并利用OpenAI的GPT-4o作为底层LLM。初步实验表明,该方法能够根据汽车需求构建Ecore元模型,并利用PlantUML符号进行可视化,使人类专家可以提供反馈以完善结果。最后还考虑了本地可部署的解决方案,包括其局限性和所需额外步骤。
Key Takeaways
- 介绍了基于LLM的自动化领域特定元模型构建方法。
- 研究的重点是在汽车领域的应用。
- 使用Python实现原型作为Web服务,并依赖OpenAI的GPT-4o作为LLM。
- 初步实验证明,该方法能够根据汽车需求构建Ecore元模型。
- 利用PlantUML符号进行可视化,以便人类专家提供反馈以改进结果。
- 考虑了本地可部署的解决方案。
点此查看论文截图






Are Your LLM-based Text-to-SQL Models Secure? Exploring SQL Injection via Backdoor Attacks
Authors:Meiyu Lin, Haichuan Zhang, Jiale Lao, Renyuan Li, Yuanchun Zhou, Carl Yang, Yang Cao, Mingjie Tang
Large language models (LLMs) have shown state-of-the-art results in translating natural language questions into SQL queries (Text-to-SQL), a long-standing challenge within the database community. However, security concerns remain largely unexplored, particularly the threat of backdoor attacks, which can introduce malicious behaviors into models through fine-tuning with poisoned datasets. In this work, we systematically investigate the vulnerabilities of LLM-based Text-to-SQL models and present ToxicSQL, a novel backdoor attack framework. Our approach leverages stealthy {semantic and character-level triggers} to make backdoors difficult to detect and remove, ensuring that malicious behaviors remain covert while maintaining high model accuracy on benign inputs. Furthermore, we propose leveraging SQL injection payloads as backdoor targets, enabling the generation of malicious yet executable SQL queries, which pose severe security and privacy risks in language model-based SQL development. We demonstrate that injecting only 0.44% of poisoned data can result in an attack success rate of 79.41%, posing a significant risk to database security. Additionally, we propose detection and mitigation strategies to enhance model reliability. Our findings highlight the urgent need for security-aware Text-to-SQL development, emphasizing the importance of robust defenses against backdoor threats.
基于大型语言模型(LLM)的自然语言翻译成为SQL查询(文本到SQL)取得了最先进的成果,这是数据库领域长期以来的一个挑战。然而,安全问题仍未得到充分研究,尤其是后门攻击带来的威胁。后门攻击可以通过中毒数据集对模型进行微调,从而引入恶意行为。在这项工作中,我们系统地研究了基于LLM的文本到SQL模型的漏洞,并推出了ToxicSQL,一种新型的后门攻击框架。我们的方法利用隐蔽的语义和字符级触发器使后门难以检测和移除,确保在良性输入上保持模型准确性的同时,恶意行为保持隐蔽。此外,我们提出利用SQL注入负载作为后门目标,能够生成恶意且可执行的SQL查询,这在基于语言模型的SQL开发中带来了严重的安全和隐私风险。我们证明,仅注入0.44%的中毒数据就可以达到79.41%的攻击成功率,给数据库安全带来显著风险。此外,我们还提出了检测和缓解策略来提高模型可靠性。我们的研究结果强调了安全意识的文本到SQL开发的紧迫需求,并强调了对抗后门威胁的稳健防御的重要性。
论文及项目相关链接
Summary
LLM基于的Text-to-SQL模型存在安全风险,本文提出了针对该模型的ToxicSQL后门攻击框架,利用语义和字符级触发器使后门难以检测和移除,通过注入SQL注入负载生成恶意且可执行的SQL查询,对数据库安全构成严重威胁。同时,本文也提出了检测和缓解策略来提升模型可靠性。
Key Takeaways
- LLM的Text-to-SQL模型面临后门攻击的安全风险。
- 提出了一种新的后门攻击框架——ToxicSQL。
- 利用语义和字符级触发器使后门难以检测和移除。
- 通过注入SQL注入负载生成恶意且可执行的SQL查询。
- 少量(0.44%)的带毒数据即可导致高达79.41%的攻击成功率。
- 提出了检测和缓解策略以提升模型可靠性。
点此查看论文截图






MatrixFlow: System-Accelerator co-design for high-performance transformer applications
Authors:Qunyou Liu, Marina Zapater, David Atienza
Transformers are central to advances in artificial intelligence (AI), excelling in fields ranging from computer vision to natural language processing. Despite their success, their large parameter count and computational demands challenge efficient acceleration. To address these limitations, this paper proposes MatrixFlow, a novel co-designed system-accelerator architecture based on a loosely coupled systolic array including a new software mapping approach for efficient transformer code execution. MatrixFlow is co-optimized via a novel dataflow-based matrix multiplication technique that reduces memory overhead. These innovations significantly improve data throughput, which is critical for handling the extensive computations required by transformers. We validate our approach through full system simulation using gem5 across various BERT and ViT Transformer models featuring different data types, demonstrating significant application-wide speed-ups. Our method achieves up to a 22x improvement compared to a many-core CPU system, and outperforms the closest state-of-the-art loosely-coupled and tightly-coupled accelerators by over 5x and 8x, respectively.
Transformer是人工智能(AI)进步的核心,其在从计算机视觉到自然语言处理等多个领域表现出卓越的性能。尽管它们取得了成功,但由于参数众多和计算需求巨大,它们面临着有效加速的挑战。为了解决这些局限性,本文提出了MatrixFlow,这是一种新型协同设计的系统加速器架构,基于松散耦合的脉动阵列,包括一种用于高效执行Transformer代码的新软件映射方法。MatrixFlow通过基于数据流的新型矩阵乘法技术进行优化,减少了内存开销。这些创新大大提高了数据吞吐量,对于处理Transformer所需的大量计算而言至关重要。我们通过使用gem5进行全面系统仿真,验证了各种BERT和ViT Transformer模型的方法,这些模型具有不同的数据类型,显示出显著的全应用加速效果。我们的方法与多核CPU系统相比,实现了最高达22倍的改进,并且比最新的松散耦合和紧密耦合加速器分别高出超过5倍和8倍。
论文及项目相关链接
Summary
本文介绍了针对人工智能领域中的Transformer模型计算效率低下的问题,提出了一种名为MatrixFlow的新型协同设计系统加速器架构。该架构基于松散耦合的收缩阵列,包括新的软件映射方法和基于数据流矩阵乘法的优化技术。MatrixFlow显著提高了数据吞吐量,这对于处理Transformer所需的大量计算至关重要。通过gem5进行的全系统仿真验证,在多种BERT和ViT Transformer模型上,该方法实现了显著的应用加速,相比于多核CPU系统提升了最高达22倍的速度,并且相较于目前先进的松散耦合和紧密耦合加速器分别有超过5倍和8倍的性能优势。
Key Takeaways
- Transformers是人工智能领域的重要突破,但计算效率和内存需求是一大挑战。
- MatrixFlow是一种新型系统加速器架构,专为优化Transformer的计算效率而设计。
- MatrixFlow基于松散耦合的收缩阵列结构,并采用了创新的软件映射方法和数据流矩阵乘法技术。
- MatrixFlow通过提高数据吞吐量来优化Transformer的计算性能。
- 通过gem5的全系统仿真验证,MatrixFlow在多种Transformer模型上实现了显著的速度提升。
- 与多核CPU系统相比,MatrixFlow最高实现了22倍的速度提升。
点此查看论文截图








“Only ChatGPT gets me”: An Empirical Analysis of GPT versus other Large Language Models for Emotion Detection in Text
Authors:Florian Lecourt, Madalina Croitoru, Konstantin Todorov
This work investigates the capabilities of large language models (LLMs) in detecting and understanding human emotions through text. Drawing upon emotion models from psychology, we adopt an interdisciplinary perspective that integrates computational and affective sciences insights. The main goal is to assess how accurately they can identify emotions expressed in textual interactions and compare different models on this specific task. This research contributes to broader efforts to enhance human-computer interaction, making artificial intelligence technologies more responsive and sensitive to users’ emotional nuances. By employing a methodology that involves comparisons with a state-of-the-art model on the GoEmotions dataset, we aim to gauge LLMs’ effectiveness as a system for emotional analysis, paving the way for potential applications in various fields that require a nuanced understanding of human language.
本文旨在研究大型语言模型(LLM)通过文本检测和了解人类情绪的能力。我们借鉴心理学的情感模型,采用跨学科视角,融合计算科学和情感科学的见解。主要目标是评估这些模型在识别文本互动中所表达的情绪方面的准确性,并在这一特定任务上比较不同的模型。这项研究为提升人机交互的广泛努力做出了贡献,使人工智能技术对用户的情绪细微差别更具响应性和敏感性。通过采用与GoEmotions数据集上的最新模型进行对比的方法,我们旨在评估LLM作为情感分析系统的有效性,为需要在人类语言方面具备细微理解能力的各个领域的应用铺平道路。
论文及项目相关链接
Summary
本工作研究大型语言模型(LLM)在检测和理解人类情绪方面的能力。该研究结合心理学中的情感模型,采用跨学科方法,整合计算和情感科学的见解。主要目标是评估LLM准确识别文本互动中表达情感的能力,并在这一特定任务上比较不同的模型。该研究为增强人机交互的广泛努力做出贡献,使人工智能技术对用户的情绪细微差别更加敏感和响应。通过采用与GoEmotions数据集上的最新模型进行对比的方法,本研究旨在评估LLM在情感分析方面的有效性,为可能需要深刻理解人类语言的各个领域的应用铺平道路。
Key Takeaways
- 本研究探讨了大型语言模型(LLM)在检测和理解人类情绪方面的能力。
- 研究结合了心理学中的情感模型,并采用了跨学科的方法。
- 主要目标是评估LLM在识别文本中的情感表达方面的准确性,并比较不同模型在此任务上的表现。
- 研究对增强人机交互的贡献,使AI技术对用户情绪更敏感和响应。
- 通过与GoEmotions数据集中的最新模型对比,评估了LLM在情感分析方面的有效性。
- 该研究为可能需要深入理解人类语言的各个领域(如自然语言处理、社交媒体分析、聊天机器人等)提供了潜在的应用价值。
- 整体而言,该研究为大型语言模型在情感分析领域的进一步发展提供了有价值的见解和参考。
点此查看论文截图








Comparative Analysis Based on DeepSeek, ChatGPT, and Google Gemini: Features, Techniques, Performance, Future Prospects
Authors:Anichur Rahman, Shahariar Hossain Mahir, Md Tanjum An Tashrif, Airin Afroj Aishi, Md Ahsan Karim, Dipanjali Kundu, Tanoy Debnath, Md. Abul Ala Moududi, MD. Zunead Abedin Eidmum
Nowadays, DeepSeek, ChatGPT, and Google Gemini are the most trending and exciting Large Language Model (LLM) technologies for reasoning, multimodal capabilities, and general linguistic performance worldwide. DeepSeek employs a Mixture-of-Experts (MoE) approach, activating only the parameters most relevant to the task at hand, which makes it especially effective for domain-specific work. On the other hand, ChatGPT relies on a dense transformer model enhanced through reinforcement learning from human feedback (RLHF), and then Google Gemini actually uses a multimodal transformer architecture that integrates text, code, and images into a single framework. However, by using those technologies, people can be able to mine their desired text, code, images, etc, in a cost-effective and domain-specific inference. People may choose those techniques based on the best performance. In this regard, we offer a comparative study based on the DeepSeek, ChatGPT, and Gemini techniques in this research. Initially, we focus on their methods and materials, appropriately including the data selection criteria. Then, we present state-of-the-art features of DeepSeek, ChatGPT, and Gemini based on their applications. Most importantly, we show the technological comparison among them and also cover the dataset analysis for various applications. Finally, we address extensive research areas and future potential guidance regarding LLM-based AI research for the community.
如今,DeepSeek、ChatGPT和Google Gemini是世界上最流行、最令人兴奋的用于推理、多模态能力和一般语言性能的大型语言模型(LLM)技术。DeepSeek采用混合专家(MoE)方法,只激活与手头任务最相关的参数,使其在特定领域的工作尤其有效。另一方面,ChatGPT依赖于通过强化学习从人类反馈(RLHF)增强的密集变压器模型,然后Google Gemini实际上使用了一种多模态变压器架构,该架构将文本、代码和图像集成到一个单一框架中。然而,通过利用这些技术,人们能够以成本效益和特定领域的推理来挖掘他们所需的文本、代码、图像等。人们可能会根据最佳性能选择这些技术。在这方面,我们在这项研究中基于DeepSeek、ChatGPT和Gemini技术提供了一项比较研究。最初,我们关注它们的方法和材料,适当地包括数据选择标准。然后,我们根据它们的应用展示DeepSeek、ChatGPT和Gemini的最新功能。最重要的是,我们展示了它们之间的技术比较,并涵盖了各种应用的数据集分析。最后,我们针对社区提出了关于基于LLM的人工智能研究的广泛研究领域和未来潜在指导。
论文及项目相关链接
Summary
深度探索(DeepSeek)、ChatGPT和谷歌双子座(Google Gemini)是当前最流行的大型语言模型(LLM)技术,它们分别采用混合专家(MoE)方法、强化学习从人类反馈(RLHF)和多模态转换器架构,用于推理、多模态能力和一般语言性能。这些技术使人们能够以成本效益和特定领域的方式进行文本、代码、图像等的挖掘。本文对这些技术进行了比较研究,重点介绍了它们的方法、材料和数据选择标准,以及它们在应用程序方面的最新特性。同时,对三者进行了技术比较,并对各种应用的数据集进行了分析。最后,探讨了广泛的研究领域和未来关于LLM基于AI研究的潜在指导方向。
Key Takeaways
- DeepSeek、ChatGPT和Google Gemini是当前最流行的LLM技术。
- DeepSeek采用混合专家(MoE)方法,适用于特定领域的任务。
- ChatGPT使用强化学习从人类反馈(RLHF)进行增强。
- Google Gemini使用多模态转换器架构,整合文本、代码和图像。
- 这些技术使文本、代码、图像等的挖掘变得成本效益高且特定领域化。
- 文章对这些技术进行了比较,包括方法、材料、数据选择标准和应用程序。
点此查看论文截图




Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size
Authors:Alireza Behtash, Marijan Fofonjka, Ethan Baird, Tyler Mauer, Hossein Moghimifam, David Stout, Joel Dennison
We present a novel approach to selective model quantization that transcends the limitations of architecture-specific and size-dependent compression methods for Large Language Models (LLMs) using Entropy-Weighted Quantization (EWQ). By analyzing the entropy distribution across transformer blocks, EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size. Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models while reducing memory usage by up to 18%. We demonstrate the effectiveness of EWQ across multiple architectures – from 1.6B to 70B parameters – and showcase consistent improvements in the quality-compression trade-off regardless of model scale or architectural design. A surprising finding of EWQ is its ability to reduce perplexity compared to unquantized models, suggesting the presence of beneficial regularization through selective precision reduction. This improvement holds across different model families, indicating a fundamental relationship between layer-level entropy and optimal precision requirements. Additionally, we introduce FastEWQ, a rapid method for entropy distribution analysis that eliminates the need for loading model weights. This technique leverages universal characteristics of entropy distribution that persist across various architectures and scales, enabling near-instantaneous quantization decisions while maintaining 80% classification accuracy with full entropy analysis. Our results demonstrate that effective quantization strategies can be developed independently of specific architectural choices or model sizes, opening new possibilities for efficient LLM deployment.
我们提出了一种新的选择性模型量化方法,通过使用熵加权量化(EWQ)超越了针对大型语言模型(LLM)的特定架构和大小依赖的压缩方法的局限性。通过分析变压器模块之间的熵分布,EWQ能够确定哪些模块可以安全量化,而不会导致性能显著下降,这独立于模型架构或大小。我们的方法在保持大规模多任务语言理解(MMLU)准确度得分与未量化模型相差不到0.5%的同时,减少了高达18%的内存使用,优于均匀量化方法。我们证明了EWQ在多个架构中的有效性,参数从1.6B到70B不等,并且在模型规模或架构设计上实现了持续的改进,在质量压缩权衡方面表现优异。EWQ的一个意外发现是它能够减少困惑度与未量化模型相比,这表明通过选择性精度降低存在有益的正则化。这种改进在不同模型家族中普遍存在,表明层级熵和最佳精度要求之间存在根本关系。此外,我们引入了FastEWQ,这是一种快速分析熵分布的方法,无需加载模型权重。该技术利用熵分布的通用特性,这些特性在各种架构和规模上持续存在,能够在保持80%分类准确度的同时,进行近乎即时的量化决策。我们的结果表明,有效的量化策略可以独立于特定的架构选择或模型大小来开发,为高效部署LLM开辟了新的可能性。
论文及项目相关链接
PDF 29 pages, 7 figures, 14 tables; Fixed some types, added some clarifications and improvements
Summary
提出一种利用熵加权量化(EWQ)的LLM选择性模型量化新方法,该方法突破了架构特定和大小依赖的压缩方法的局限。通过分析变压器块的熵分布,EWQ可确定哪些块可在不影响性能的情况下安全量化,独立于模型架构和大小。该方法优于均匀量化方法,保持大规模多任务语言理解准确性,在不超过未量化模型0.5%误差的同时减少内存使用高达18%。此方法适用于多种架构,在不同规模或设计的模型中实现一致的改进。此外,EWQ通过选择性精度降低展现出有益的正规化效果,并降低困惑度。我们还介绍了FastEWQ技术,可以快速分析熵分布,无需加载模型权重。这表明,独立于特定架构选择和模型大小的有效量化策略是可行的,为高效部署LLM打开了新的可能性。
Key Takeaways
- 提出了一种名为EWQ的新型LLM模型量化方法,能针对不同的模型和架构进行选择性量化。
- EWQ基于熵分布的分析确定可量化的变压器块,避免因量化导致的性能下降。
- EWQ相较于均匀量化方法,能维持较高的准确性,同时显著降低内存使用。
- EWQ在多个不同规模和设计的模型中表现一致,包括大型语言模型。
- EWQ在选择性降低精度的情况下展现出正则化效果,相较于未量化的模型降低了困惑度。
- 介绍了FastEWQ技术,能快速分析熵分布而无需加载模型权重,提高了量化决策的效率。
点此查看论文截图



LiGT: Layout-infused Generative Transformer for Visual Question Answering on Vietnamese Receipts
Authors:Thanh-Phong Le, Trung Le Chi Phan, Nghia Hieu Nguyen, Kiet Van Nguyen
Document Visual Question Answering (Document VQA) challenges multimodal systems to holistically handle textual, layout, and visual modalities to provide appropriate answers. Document VQA has gained popularity in recent years due to the increasing amount of documents and the high demand for digitization. Nonetheless, most of document VQA datasets are developed in high-resource languages such as English. In this paper, we present ReceiptVQA (\textbf{Receipt} \textbf{V}isual \textbf{Q}uestion \textbf{A}nswering), the initial large-scale document VQA dataset in Vietnamese dedicated to receipts, a document kind with high commercial potentials. The dataset encompasses \textbf{9,000+} receipt images and \textbf{60,000+} manually annotated question-answer pairs. In addition to our study, we introduce LiGT (\textbf{L}ayout-\textbf{i}nfused \textbf{G}enerative \textbf{T}ransformer), a layout-aware encoder-decoder architecture designed to leverage embedding layers of language models to operate layout embeddings, minimizing the use of additional neural modules. Experiments on ReceiptVQA show that our architecture yielded promising performance, achieving competitive results compared with outstanding baselines. Furthermore, throughout analyzing experimental results, we found evident patterns that employing encoder-only model architectures has considerable disadvantages in comparison to architectures that can generate answers. We also observed that it is necessary to combine multiple modalities to tackle our dataset, despite the critical role of semantic understanding from language models. We hope that our work will encourage and facilitate future development in Vietnamese document VQA, contributing to a diverse multimodal research community in the Vietnamese language.
文档视觉问答(Document VQA)挑战多模态系统,以全面处理文本、布局和视觉模式,以提供适当的答案。近年来,由于文档数量的不断增加和数字化需求的不断增长,文档VQA的受欢迎程度日益提高。然而,大部分的文档VQA数据集都是用资源丰富的语言(如英语)开发的。在本文中,我们介绍了ReceiptVQA(收据视觉问答),这是越南语中针对收据的首个大规模文档VQA数据集,收据是一种具有极高商业潜力的文档类型。该数据集包含超过9000张收据图像和超过6万个手动标注的问题答案对。除了我们的研究,我们还引入了LiGT(布局融入生成Transformer),这是一个布局感知的编码器-解码器架构,旨在利用语言模型的嵌入层来操作布局嵌入,尽量减少使用额外的神经网络模块。在ReceiptVQA上的实验表明,我们的架构取得了有希望的性能,与优秀的基线相比取得了具有竞争力的结果。此外,通过分析实验结果,我们发现与能够生成答案的架构相比,使用仅编码器模型架构具有明显的劣势。我们还观察到,尽管语言模型的语义理解起着关键作用,但结合多种模式来解决我们的数据集是必要的。我们希望我们的工作能鼓励和促进越南语文档VQA的未来发展,为越南语的多模态研究社区做出贡献。
论文及项目相关链接
PDF Accepted at IJDAR
摘要
本文介绍了Document Visual Question Answering(Document VQA)的挑战,并针对文本、布局和视觉模态的综合处理要求提供答案。文中提出了一种新的越南语收据文档VQA数据集ReceiptVQA,包含超过9,000张收据图像和超过6万组手动标注的问题答案对。此外,还引入了LiGT模型,这是一种布局感知的编码器-解码器架构,旨在利用语言模型的嵌入层进行布局嵌入,减少使用额外的神经网络模块。实验表明,该架构在ReceiptVQA上表现出良好的性能,并与基线模型相比具有竞争力。分析实验结果显示,与只能编码的模型架构相比,能够生成答案的架构具有明显的优势。同时观察到,尽管语言模型的语义理解起着关键作用,但结合多种模态对于解决数据集的问题仍是必要的。希望这项工作能推动越南语文档VQA的发展,并为越南语的多模态研究社区做出贡献。
关键见解
- Document VQA面临处理文本、布局和视觉模态的挑战。
- 提出了一种新的越南语收据文档VQA数据集ReceiptVQA。
- 引入了LiGT模型,一种布局感知的编码器-解码器架构。
- LiGT模型在ReceiptVQA数据集上表现出良好的性能。
- 实验分析显示,生成答案的模型架构比仅编码的模型架构具有优势。
- 结合多种模态对于解决数据集的问题是必要的。
点此查看论文截图


Exoplanet Transit Candidate Identification in TESS Full-Frame Images via a Transformer-Based Algorithm
Authors:Helem Salinas, Rafael Brahm, Greg Olmschenk, Richard K. Barry, Karim Pichara, Stela Ishitani Silva, Vladimir Araujo
The Transiting Exoplanet Survey Satellite (TESS) is surveying a large fraction of the sky, generating a vast database of photometric time series data that requires thorough analysis to identify exoplanetary transit signals. Automated learning approaches have been successfully applied to identify transit signals. However, most existing methods focus on the classification and validation of candidates, while few efforts have explored new techniques for the search of candidates. To search for new exoplanet transit candidates, we propose an approach to identify exoplanet transit signals without the need for phase folding or assuming periodicity in the transit signals, such as those observed in multi-transit light curves. To achieve this, we implement a new neural network inspired by Transformers to directly process Full Frame Image (FFI) light curves to detect exoplanet transits. Transformers, originally developed for natural language processing, have recently demonstrated significant success in capturing long-range dependencies compared to previous approaches focused on sequential data. This ability allows us to employ multi-head self-attention to identify exoplanet transit signals directly from the complete light curves, combined with background and centroid time series, without requiring prior transit parameters. The network is trained to learn characteristics of the transit signal, like the dip shape, which helps distinguish planetary transits from other variability sources. Our model successfully identified 214 new planetary system candidates, including 122 multi-transit light curves, 88 single-transit and 4 multi-planet systems from TESS sectors 1-26 with a radius > 0.27 $R_{\mathrm{Jupiter}}$, demonstrating its ability to detect transits regardless of their periodicity.
凌日系外行星勘测卫星(TESS)正在对天空进行大规模勘测,生成大量光度时间序列数据,需要进行彻底分析以识别凌日系外行星的信号。自动化学习方法已成功应用于识别凌日信号。然而,大多数现有方法都集中在候选对象的分类和验证上,而对寻找新候选对象的新技术探索却很少。为了寻找新的凌日系外行星候选对象,我们提出了一种无需相位折叠或假设凌日信号周期性的方法,例如多凌日光度曲线所观察到的那样。为了实现这一点,我们受到Transformer启发的神经网络,直接处理全帧图像(FFI)光度曲线来检测凌日系外行星。Transformer最初是为自然语言处理而开发的,最近的研究表明,与以前专注于序列数据的方法相比,它在捕获长期依赖关系方面取得了重大成功。这种能力使我们能够利用多头自注意力从完整的光度曲线中直接识别凌日系外行星的信号,结合背景和质心时间序列,无需事先的凌日参数。网络经过训练,学习凌日信号的特征,如暗形,这有助于区分行星凌日和其他变源。我们的模型成功识别了214个新的行星系统候选对象,包括122个多凌日光度曲线、88个单凌日和来自TESS 1-26扇区的4个多行星系统,其半径大于0.27个木星半径,证明了其检测凌日的能力,无论其周期性如何。
论文及项目相关链接
Summary
该文本介绍了利用Transformer神经网络直接处理全帧图像(FFI)光变曲线来检测行星过境的新方法。该方法无需相位折叠或假设过境信号的周期性,可直接从完整的光变曲线中识别出行星过境信号。此方法已成功识别出TESS探测器观测到的214个新的行星系统候选体。
Key Takeaways
- TESS正在对天空进行大规模观测,生成了大量的光变曲线数据。
- 现有方法主要集中在候选者的分类和验证上,对新候选者的搜索方法探索较少。
- 提出了一种新的基于Transformer神经网络的方法,用于直接处理全帧图像(FFI)光变曲线来检测行星过境信号。
- 该方法无需假设过境信号的周期性,可直接从完整的光变曲线中识别信号。
- Transformer网络具有捕捉长期依赖关系的能力,有助于识别行星过境信号。
- 该方法通过学习和识别过境信号的特征,如暗斑形状,来区分行星过境和其他变化源。
点此查看论文截图



Bias Unveiled: Investigating Social Bias in LLM-Generated Code
Authors:Lin Ling, Fazle Rabbi, Song Wang, Jinqiu Yang
Large language models (LLMs) have significantly advanced the field of automated code generation. However, a notable research gap exists in evaluating social biases that may be present in the code produced by LLMs. To solve this issue, we propose a novel fairness framework, i.e., Solar, to assess and mitigate the social biases of LLM-generated code. Specifically, Solar can automatically generate test cases for quantitatively uncovering social biases of the auto-generated code by LLMs. To quantify the severity of social biases in generated code, we develop a dataset that covers a diverse set of social problems. We applied Solar and the crafted dataset to four state-of-the-art LLMs for code generation. Our evaluation reveals severe bias in the LLM-generated code from all the subject LLMs. Furthermore, we explore several prompting strategies for mitigating bias, including Chain-of-Thought (CoT) prompting, combining positive role-playing with CoT prompting and dialogue with Solar. Our experiments show that dialogue with Solar can effectively reduce social bias in LLM-generated code by up to 90%. Last, we make the code and data publicly available is highly extensible to evaluate new social problems.
大型语言模型(LLM)在自动化代码生成领域取得了显著进展。然而,评估LLM生成的代码中可能存在的社会偏见仍存在显著的研究空白。为了解决这一问题,我们提出了一个公平性的新框架,即Solar,来评估和缓解LLM生成代码的社会偏见。具体来说,Solar能够自动生成测试用例,定量发现LLM生成的代码中的社会偏见。为了量化生成代码中社会偏见的严重程度,我们开发了一个涵盖多种社会问题的数据集。我们将Solar和定制数据集应用于四个最先进的代码生成LLM。我们的评估发现所有测试LLM生成的代码都存在严重偏见。此外,我们探索了几种减轻偏见的方法,包括链式思维(CoT)提示、将积极角色扮演与CoT提示和与Solar对话相结合。我们的实验表明,与Solar的对话可以有效地减少LLM生成代码中的社会偏见高达90%。最后,我们公开了代码和数据集,这对于评估新的社会问题具有很强的可扩展性。
论文及项目相关链接
PDF accepted for publication in the Association for the Advancement of Artificial Intelligence (AAAI), 2025
Summary
大型语言模型(LLM)在自动化代码生成领域取得了显著进展,但评估LLM生成代码中可能存在的社会偏见的研究空白仍然显著。为解决这一问题,本文提出了一个名为Solar的新型公平性框架,用于评估和缓解LLM生成代码中的社会偏见。Solar能够自动生成测试用例,定量发现LLM生成代码中的社会偏见。为量化生成代码中社会偏见的严重程度,本文开发了一个涵盖多种社会问题的数据集。将Solar和数据集应用于四种先进的LLM代码生成模型,发现所有模型生成的代码都存在严重偏见。此外,本文探讨了多种缓解偏见的方法,包括Chain-of-Thought(CoT)提示法、结合正面角色扮演与CoT提示法以及与Solar对话等。实验表明,与Solar的对话可以有效减少LLM生成代码中的社会偏见,高达90%。最后,公开的代码和数据集可高度扩展,用于评估新的社会问题。
Key Takeaways
- 大型语言模型(LLM)在自动化代码生成中表现出显著进展。
- LLM生成的代码中可能存在社会偏见的问题尚未得到充分研究。
- Solar框架被提出用于评估和缓解LLM生成代码中的社会偏见。
- Solar能自动生成测试用例以发现LLM生成代码中的社会偏见。
- 开发了一个涵盖多种社会问题的数据集,用于量化生成代码中社会偏见的严重程度。
- 对四种先进的LLM代码生成模型进行评估,发现存在严重偏见。
点此查看论文截图






Improving Data Efficiency via Curating LLM-Driven Rating Systems
Authors:Jinlong Pang, Jiaheng Wei, Ankit Parag Shah, Zhaowei Zhu, Yaxuan Wang, Chen Qian, Yang Liu, Yujia Bao, Wei Wei
Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2, a Diversity-aware Score curation method for Data Selection. By systematically modeling error patterns through a score transition matrix, DS2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that “more can be less.”
指令调整对于适应大型语言模型(LLM)下游任务至关重要,最近的研究表明,少量人工整理的数据可以超越大规模数据集,挑战传统数据规模定律。虽然基于LLM的数据质量评分系统为人工标注提供了一种成本效益高的替代方案,但它们即使在GPT-4等强大模型中也会存在不准确和偏见的问题。在这项工作中,我们引入了DS2,一种用于数据选择的多样性感知分数整理方法。通过系统地通过评分转换矩阵对错误模式进行建模,DS2校正了基于LLM的评分并促进了所选数据样本的多样性。我们的方法表明,经过整理的子集(仅占原始数据集的3.3%)在各种机器对齐基准测试中优于全面数据集(30万样本),并且与相同样本量(1千个样本)的人类对齐数据集如LIMA相匹配或更胜一筹。这些发现挑战了传统数据规模假设,强调冗余、低质量的样本可能会降低性能并再次证明“更多可能是更少”。
论文及项目相关链接
Summary
本文强调了指令微调在适应大型语言模型(LLM)到下游任务中的重要性,并指出小量的人工编制数据可能优于大规模数据集,挑战了传统数据规模定律。针对LLM基础数据质量评估系统存在的不准确和偏见问题,本文提出了DS2,一种用于数据选择的多样性感知分数编纂方法。通过系统建模误差模式,DS2校正了LLM基础上的分数并促进了所选数据样本的多样性。实验结果显示,精选的子集(仅占原始数据的3.3%)在各种机器对齐基准测试中表现优于全面数据集(30万样本),并在相同样本量下与人工对齐的数据集(如LIMA)相匹配或表现更好。这些发现挑战了传统数据规模假设,强调冗余、低质量的样本会损害性能,再次证明了“更多不一定更好”。
Key Takeaways
- 指令微调在LLM适应下游任务中起关键作用。
- 小规模人工编制数据在性能上可能优于大规模数据集,对传统数据规模定律提出挑战。
- LLM基础数据质量评估系统存在不准确性和偏见问题。
- DS2方法通过系统建模误差模式来校正LLM分数并促进数据多样性。
- DS2方法显示精选子集在各种机器对齐基准测试中表现优异。
- 与相同规模的人工对齐数据集相比,精选子集的表现相匹配或更佳。
点此查看论文截图





AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models
Authors:Junfeng Fang, Houcheng Jiang, Kun Wang, Yunshan Ma, Shi Jie, Xiang Wang, Xiangnan He, Tat-seng Chua
Large language models (LLMs) often exhibit hallucinations due to incorrect or outdated knowledge. Hence, model editing methods have emerged to enable targeted knowledge updates. To achieve this, a prevailing paradigm is the locating-then-editing approach, which first locates influential parameters and then edits them by introducing a perturbation. While effective, current studies have demonstrated that this perturbation inevitably disrupt the originally preserved knowledge within LLMs, especially in sequential editing scenarios. To address this, we introduce AlphaEdit, a novel solution that projects perturbation onto the null space of the preserved knowledge before applying it to the parameters. We theoretically prove that this projection ensures the output of post-edited LLMs remains unchanged when queried about the preserved knowledge, thereby mitigating the issue of disruption. Extensive experiments on various LLMs, including LLaMA3, GPT2-XL, and GPT-J, show that AlphaEdit boosts the performance of most locating-then-editing methods by an average of 36.4% with a single line of additional code for projection solely. Our code is available at: https://github.com/jianghoucheng/AlphaEdit.
大型语言模型(LLM)由于知识不正确或过时,经常会出现幻觉。因此,出现了模型编辑方法,以实现针对性的知识更新。为此,一种普遍的方法是“定位然后编辑”的方法,它首先定位有影响力的参数,然后通过引入扰动来进行编辑。尽管这种方法有效,但当前的研究已经证明,这种扰动不可避免地会破坏LLM中原本保存的知识,特别是在连续编辑场景中。为解决这一问题,我们引入了AlphaEdit,这是一种全新的解决方案,它将扰动投影到保留知识的零空间上,然后再将其应用到参数上。我们从理论上证明了这种投影确保了当被问及保留的知识时,编辑后的LLM的输出保持不变,从而缓解了知识被干扰的问题。在包括LLaMA3、GPT2-XL和GPT-J等各种LLM上的大量实验表明,AlphaEdit通过将投影的单行代码添加到现有的定位然后编辑方法中,平均提高了其性能约36.4%。我们的代码可在以下网址找到:https://github.com/jianghoucheng/AlphaEdit。
论文及项目相关链接
Summary
大语言模型(LLM)易出现幻觉,原因在于知识不正确或过时。因此,出现了模型编辑方法对知识库进行有针对性的更新。当前主流方法是定位-编辑模式,即先定位关键参数后进行编辑。然而,这种方式会破坏原有知识,特别是在连续编辑场景中。为解决这一问题,我们推出AlphaEdit,一种将扰动投影到保留知识的零空间再应用于参数的新方法。理论证明,该方法能确保对保留知识的查询得到相同输出,解决了扰动导致的破坏问题。在多种LLM上的实验表明,AlphaEdit能平均提升定位-编辑方法的性能达36.4%,仅需添加一行投影代码。代码地址:https://github.com/jianghoucheng/AlphaEdit。
Key Takeaways
- LLMs易因知识不正确或过时产生幻觉。
- 现有的模型编辑方法采用定位-编辑模式,但会破坏原有知识。
- AlphaEdit通过投影技术将扰动应用于保留知识的零空间,确保对保留知识的查询输出不变。
- AlphaEdit能显著提升定位-编辑方法的性能,平均提升达36.4%。
- AlphaEdit方法只需在现有代码基础上添加一行投影代码。
点此查看论文截图




SynSUM – Synthetic Benchmark with Structured and Unstructured Medical Records
Authors:Paloma Rabaey, Henri Arno, Stefan Heytens, Thomas Demeester
We present the SynSUM benchmark, a synthetic dataset linking unstructured clinical notes to structured background variables. The dataset consists of 10,000 artificial patient records containing tabular variables (like symptoms, diagnoses and underlying conditions) and related notes describing the fictional patient encounter in the domain of respiratory diseases. The tabular portion of the data is generated through a Bayesian network, where both the causal structure between the variables and the conditional probabilities are proposed by an expert based on domain knowledge. We then prompt a large language model (GPT-4o) to generate a clinical note related to this patient encounter, describing the patient symptoms and additional context. We conduct both an expert evaluation study to assess the quality of the generated notes, as well as running some simple predictor models on both the tabular and text portions of the dataset, forming a baseline for further research. The SynSUM dataset is primarily designed to facilitate research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text - the symptoms, in the case of SynSUM. Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation.
我们推出了SynSUM基准测试集,这是一个合成数据集,它将非结构化的临床笔记与结构化的背景变量联系起来。该数据集包含10,000条人工患者记录,包括表格变量(如症状、诊断和潜在疾病)以及描述呼吸系统疾病领域中虚构患者遭遇的相关笔记。数据的表格部分是通过贝叶斯网络生成的,变量之间的因果结构和条件概率均由专家基于领域知识提出。然后,我们提示大型语言模型(GPT-4o)生成与该患者遭遇相关的临床笔记,描述患者的症状和额外的上下文。我们既进行了专家评估研究以评估生成的笔记的质量,也在数据集的文字和表格部分上运行了一些简单的预测模型,为进一步的研究提供参考基准。SynSUM数据集主要用于促进在有表格背景变量的情况下进行临床信息提取的研究,这些背景变量可以通过领域知识链接到文本中感兴趣的概念——以SynSUM中的症状为例。次要用途包括在表格数据和文本上的自动化临床推理研究、存在表格和/或文本混淆因素时的因果效应估计以及多模式合成数据生成。
论文及项目相关链接
PDF The dataset can be downloaded from https://github.com/prabaey/synsum. Presented at the GenAI4Health workshop at AAAI 2025
Summary
本数据集SynSUM为合成数据集,链接了结构化背景变量与非结构化临床笔记。数据集包含一万条虚构的患者记录,涉及呼吸疾病领域的表格变量(如症状、诊断及基础疾病)及相关描述患者遭遇的临床笔记。表格数据通过贝叶斯网络生成,专家基于领域知识提出变量间的因果结构及条件概率。之后利用大型语言模型(GPT-4o)生成与病人遭遇相关的临床笔记,描述病人症状及额外情境。本研究通过专家评估生成的笔记质量,同时在数据集文本和表格部分运行简单的预测模型,为后续研究提供基线。SynSUM数据集主要用于促进临床信息提取研究,尤其在存在表格背景变量的情况下,可通过领域知识链接文本中的概念(如SynSUM中的症状)。次要用途包括自动化临床推理研究,处理表格和文本数据时的因果效应估算和多模态合成数据生成。
Key Takeaways
- SynSUM是一个合成数据集,将非结构化临床笔记与结构化背景变量链接起来。
- 数据集包含一万条虚构的患者记录,涉及呼吸疾病领域的表格数据和临床笔记。
- 表格数据通过贝叶斯网络生成,反映变量间的因果结构和条件概率。
- 利用大型语言模型生成临床笔记,描述患者症状及情境。
- 数据集用于研究临床信息提取、自动化临床推理、因果效应估算和多模态合成数据生成等。
- 数据集经过专家评估生成的笔记质量,同时提供了预测模型的基线数据。
点此查看论文截图




Transformer Block Coupling and its Correlation with Generalization in LLMs
Authors:Murdock Aubry, Haoming Meng, Anton Sugolov, Vardan Papyan
Large Language Models (LLMs) have made significant strides in natural language processing, and a precise understanding of the internal mechanisms driving their success is essential. In this work, we analyze the trajectories of token embeddings as they pass through transformer blocks, linearizing the system along these trajectories through their Jacobian matrices. By examining the relationships between these block Jacobians, we uncover the phenomenon of \textbf{transformer block coupling} in a multitude of LLMs, characterized by the coupling of their top singular vectors across tokens and depth. Our findings reveal that coupling \textit{positively correlates} with model performance, and that this relationship is stronger than with other hyperparameters such as parameter count, model depth, and embedding dimension. We further investigate how these properties emerge during training, observing a progressive development of coupling, increased linearity, and layer-wise exponential growth in token trajectories. Additionally, experiments with Vision Transformers (ViTs) corroborate the emergence of coupling and its relationship with generalization, reinforcing our findings in LLMs. Collectively, these insights offer a novel perspective on token interactions in transformers, opening new directions for studying their mechanisms as well as improving training and generalization.
大型语言模型(LLM)在自然语言处理方面取得了显著进展,对驱动其成功的内部机制进行精确理解至关重要。在这项工作中,我们分析了令牌嵌入在通过transformer块时的轨迹,通过其雅可比矩阵沿这些轨迹线性化系统。通过检查这些块雅可比之间的关系,我们在多种LLM中发现了“transformer块耦合”现象,其特征在于令牌和深度之间的顶部奇异向量的耦合。我们的研究发现,耦合与模型性能呈正相关,而且这种关系比参数数量、模型深度和嵌入维度等其他超参数更强。我们进一步调查了这些属性在训练过程中的出现情况,观察到耦合、线性增强和逐层指数增长的轨迹的发展过程。此外,对视觉转换器(ViTs)的实验证实了耦合的出现及其与泛化的关系,证实了我们在LLM中的发现。总的来说,这些见解为变压器中的令牌交互提供了新的视角,为学习其机制以及改进训练和泛化能力开辟了新方向。
论文及项目相关链接
PDF Published as a conference paper at the International Conference on Learning Representations (ICLR 2025)
Summary
大型语言模型(LLM)通过变换器块进行标记嵌入轨迹分析,揭示了变换器块耦合现象。该现象表现为标记和深度之间的顶部奇异向量耦合。研究发现,耦合与模型性能正相关,与其他超参数相比,如参数计数、模型深度和嵌入维度等更为紧密。训练过程中,观察到耦合、线性增长和逐层指数增长的标记轨迹的逐步发展。此外,视觉变换器(ViT)实验也证实了耦合的出现与其泛化性的关联,为理解标记在转换器中的交互作用提供了新视角,也为改进训练和泛化指明了方向。
Key Takeaways
- LLMs通过变换器块进行标记嵌入轨迹分析,揭示了变换器块耦合现象。
- 变换器块耦合表现为标记和深度之间的顶部奇异向量耦合。
- 耦合与模型性能呈正相关。
- 相比其他超参数,模型耦合与模型性能的关系更为紧密。
- 训练过程中观察到耦合、线性增长和逐层指数增长的标记轨迹的逐步发展。
- ViT实验证实了耦合的出现与泛化性之间的关联。
点此查看论文截图




Efficient Automated Circuit Discovery in Transformers using Contextual Decomposition
Authors:Aliyah R. Hsu, Georgia Zhou, Yeshwanth Cherapanamjeri, Yaxuan Huang, Anobel Y. Odisho, Peter R. Carroll, Bin Yu
Automated mechanistic interpretation research has attracted great interest due to its potential to scale explanations of neural network internals to large models. Existing automated circuit discovery work relies on activation patching or its approximations to identify subgraphs in models for specific tasks (circuits). They often suffer from slow runtime, approximation errors, and specific requirements of metrics, such as non-zero gradients. In this work, we introduce contextual decomposition for transformers (CD-T) to build interpretable circuits in large language models. CD-T can produce circuits of arbitrary level of abstraction, and is the first able to produce circuits as fine-grained as attention heads at specific sequence positions efficiently. CD-T consists of a set of mathematical equations to isolate contribution of model features. Through recursively computing contribution of all nodes in a computational graph of a model using CD-T followed by pruning, we are able to reduce circuit discovery runtime from hours to seconds compared to state-of-the-art baselines. On three standard circuit evaluation datasets (indirect object identification, greater-than comparisons, and docstring completion), we demonstrate that CD-T outperforms ACDC and EAP by better recovering the manual circuits with an average of 97% ROC AUC under low runtimes. In addition, we provide evidence that faithfulness of CD-T circuits is not due to random chance by showing our circuits are 80% more faithful than random circuits of up to 60% of the original model size. Finally, we show CD-T circuits are able to perfectly replicate original models’ behavior (faithfulness $ = 1$) using fewer nodes than the baselines for all tasks. Our results underscore the great promise of CD-T for efficient automated mechanistic interpretability, paving the way for new insights into the workings of large language models.
自动化机制解释研究因其将神经网络内部解释扩展到大型模型的潜力而备受关注。现有的自动化电路发现工作依赖于激活补丁或其近似值来识别模型中的特定任务子图(电路)。它们常常受到运行速度慢、近似误差和特定指标要求(如非零梯度)的困扰。在这项工作中,我们引入了用于构建大型语言模型中可解释电路的变压器上下文分解(CD-T)。CD-T可以产生任意抽象层次的电路,并且是第一个能够高效地在特定序列位置产生精细到注意力头的电路的方法。CD-T由一组数学方程构成,用于隔离模型特征的贡献。通过递归计算模型计算图中所有节点的贡献,并使用CD-T进行修剪,与最新基线相比,我们将电路发现运行时间从数小时减少到了数秒。在三个标准电路评估数据集(间接对象识别、大于比较和docstring完成)上,我们证明了CD-T优于ACDC和EAP,能够更好地恢复手动电路,在低运行时间下的平均ROC AUC达到97%。此外,我们通过显示我们的电路比随机电路更忠实(忠实度高出80%),而随机电路最多达到原始模型大小的60%,证明了CD-T电路的忠实性并非偶然。最后,我们展示了CD-T电路能够完美复制原始模型的行为(忠实度= 1),在所有任务中使用比基线更少的节点。我们的研究结果强调了CD-T在高效自动化机制解释方面的巨大潜力,为深入了解大型语言模型的工作原理开辟了道路。
论文及项目相关链接
摘要
本文介绍了一种针对大型语言模型的解释性电路构建方法——上下文分解法(CD-T)。现有自动化电路发现工作存在运行速度慢、近似误差和特定指标要求等问题,而CD-T能够产生任意层次的抽象电路,并能高效地产生精细到特定序列位置注意力头的电路。通过递归计算模型中所有节点的贡献并进行修剪,CD-T将电路发现的时间从小时减少到秒。在三个标准电路评估数据集上,CD-T表现出优异的性能,优于ACDC和EAP,平均ROC AUC达到97%。此外,CD-T电路的可信性证据显示,其比随机电路更忠实于原始模型,且使用较少的节点就能完全复制原始模型的行为。结果突出了CD-T在高效自动化机械解释方面的巨大潜力,为深入了解大型语言模型的工作机制铺平了道路。
关键见解
- 现有自动化电路发现方法存在运行速度慢、近似误差和特定指标要求等问题。
- 引入上下文分解法(CD-T)来构建大型语言模型的解释性电路。
- CD-T能高效地产生精细到注意力头的电路。
- 通过递归计算模型中所有节点的贡献并进行修剪,显著减少电路发现时间。
- 在三个标准电路评估数据集上,CD-T表现出卓越的性能,优于其他方法。
- CD-T电路比随机电路更忠实于原始模型。
点此查看论文截图

SoK: Membership Inference Attacks on LLMs are Rushing Nowhere (and How to Fix It)
Authors:Matthieu Meeus, Igor Shilov, Shubham Jain, Manuel Faysse, Marek Rei, Yves-Alexandre de Montjoye
Whether LLMs memorize their training data and what this means, from measuring privacy leakage to detecting copyright violations, has become a rapidly growing area of research. In the last few months, more than 10 new methods have been proposed to perform Membership Inference Attacks (MIAs) against LLMs. Contrary to traditional MIAs which rely on fixed-but randomized-records or models, these methods are mostly trained and tested on datasets collected post-hoc. Sets of members and non-members, used to evaluate the MIA, are constructed using informed guesses after the release of a model. This lack of randomization raises concerns of a distribution shift between members and non-members. In this work, we first extensively review the literature on MIAs against LLMs and show that, while most work focuses on sequence-level MIAs evaluated in post-hoc setups, a range of target models, motivations and units of interest are considered. We then quantify distribution shifts present in 6 datasets used in the literature using a model-less bag of word classifier and show that all datasets constructed post-hoc suffer from strong distribution shifts. These shifts invalidate the claims of LLMs memorizing strongly in real-world scenarios and, potentially, also the methodological contributions of the recent papers based on these datasets. Yet, all hope might not be lost. We introduce important considerations to properly evaluate MIAs against LLMs and discuss, in turn, potential ways forwards: randomized test splits, injections of randomized (unique) sequences, randomized fine-tuning, and several post-hoc control methods. While each option comes with its advantages and limitations, we believe they collectively provide solid grounds to guide MIA development and study LLM memorization. We conclude with an overview of recommended approaches to benchmark sequence-level and document-level MIAs against LLMs.
关于LLM是否记忆其训练数据以及这意味着什么,从衡量隐私泄露到检测版权侵犯,已成为一个迅速发展的研究领域。在过去的几个月里,已经提出了超过10种针对LLM执行成员推理攻击(MIA)的新方法。与传统依靠固定但随机记录或模型的MIA不同,这些方法大多在收集的后验数据集上进行训练和测试。用于评估MIA的成员和非成员集是在模型发布后通过有根据的猜测构建的。这种缺乏随机性引发了成员与非成员之间分布转移的担忧。
论文及项目相关链接
PDF IEEE Conference on Secure and Trustworthy Machine Learning (SaTML 2025)
Summary
LLMs面临成员推理攻击(MIAs)的研究迅速增长,新的攻击方法不断被提出。这些新方法主要在事后收集的数据集上进行训练和测试,存在成员与非成员之间的分布偏移问题。本文对LLMs的MIAs文献进行全面回顾,发现存在一系列目标模型、动机和感兴趣单位。通过模型无关的词汇分类器量化六个数据集的分布偏移,显示所有事后构建的数据集都存在强烈的分布偏移。这些偏移对LLMs在现实场景中的记忆能力提出了质疑,也可能使基于这些数据集的方法论贡献无效。本文讨论了正确评估LLMs的MIAs的重要考虑因素,并提出了前进的潜在方式,包括随机测试分割、注入随机(唯一)序列、随机微调等。
Key Takeaways
- LLMs面临成员推理攻击(MIAs)的研究迅速增长,已提出超过10种新的攻击方法。
- 这些新方法主要在事后数据集上进行训练和测试,存在分布偏移问题。
- 大多数研究关注序列级别的MIAs和事后评估设置,但目标模型、动机和感兴趣单位存在多种选择。
- 通过模型无关的词汇分类器量化六个数据集的分布偏移,显示所有数据集都存在强烈的分布偏移。
- 分布偏移对LLMs在现实场景中的记忆能力提出质疑,也可能使基于这些数据集的方法论无效。
- 本文讨论了正确评估LLMs的MIAs的重要因素,并提出了改进方向,包括随机测试分割、注入随机序列等。
点此查看论文截图



Efficient Evolutionary Search Over Chemical Space with Large Language Models
Authors:Haorui Wang, Marta Skreta, Cher-Tian Ser, Wenhao Gao, Lingkai Kong, Felix Strieth-Kalthoff, Chenru Duan, Yuchen Zhuang, Yue Yu, Yanqiao Zhu, Yuanqi Du, Alán Aspuru-Guzik, Kirill Neklyudov, Chao Zhang
Molecular discovery, when formulated as an optimization problem, presents significant computational challenges because optimization objectives can be non-differentiable. Evolutionary Algorithms (EAs), often used to optimize black-box objectives in molecular discovery, traverse chemical space by performing random mutations and crossovers, leading to a large number of expensive objective evaluations. In this work, we ameliorate this shortcoming by incorporating chemistry-aware Large Language Models (LLMs) into EAs. Namely, we redesign crossover and mutation operations in EAs using LLMs trained on large corpora of chemical information. We perform extensive empirical studies on both commercial and open-source models on multiple tasks involving property optimization, molecular rediscovery, and structure-based drug design, demonstrating that the joint usage of LLMs with EAs yields superior performance over all baseline models across single- and multi-objective settings. We demonstrate that our algorithm improves both the quality of the final solution and convergence speed, thereby reducing the number of required objective evaluations. Our code is available at http://github.com/zoom-wang112358/MOLLEO
分子发现的优化问题带来了巨大的计算挑战,因为优化目标可能是不可微分的。进化算法(EA)常用于优化分子发现中的黑盒目标,它通过执行随机突变和交叉操作来遍历化学空间,从而导致大量的目标评估成本高昂。在这项工作中,我们通过将具有化学感知能力的大型语言模型(LLM)纳入进化算法来缓解这一缺点。具体来说,我们使用在大量化学信息语料库上训练的大型语言模型重新设计进化算法中的交叉和突变操作。我们在涉及属性优化、分子再发现和基于结构的药物设计等多个任务上,对商业和开源模型进行了广泛的实证研究,证明了大型语言模型与进化算法的联合使用在单目标和多目标设置中都优于所有基线模型。我们证明我们的算法提高了最终解决方案的质量和收敛速度,从而减少了所需的目标评估次数。我们的代码可在http://github.com/zoom-wang112358/MOLLEO找到。
论文及项目相关链接
PDF Published in ICLR 2025
Summary
本文主要介绍了将大型语言模型(LLMs)融入进化算法(EAs)来解决分子发现中的优化问题。通过利用LLMs处理化学信息的能力,重新设计了EA中的交叉和突变操作。在多个任务上进行了实证研究,证明联合使用LLMs和EAs的方法在单目标和多目标设置下均优于所有基线模型,提高了最终解决方案的质量和收敛速度,减少了所需的目标评估次数。
Key Takeaways
- 大型语言模型(LLMs)被融入进化算法(EAs),以解决分子发现中的优化问题。
- LLMs用于重新设计EA中的交叉和突变操作。
- 实证研究证明了LLMs和EAs联合使用在多个任务上的优越性。
- 该方法提高了最终解决方案的质量和收敛速度。
- 该方法减少了所需的目标评估次数。
- 该算法在单目标和多目标设置下均表现优异。
点此查看论文截图


VoCo-LLaMA: Towards Vision Compression with Large Language Models
Authors:Xubing Ye, Yukang Gan, Xiaoke Huang, Yixiao Ge, Yansong Tang
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs’ understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576$\times$, resulting in up to 94.8$%$ fewer FLOPs and 69.6$%$ acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs’ contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via https://yxxxb.github.io/VoCo-LLaMA-page/.
视觉语言模型(VLMs)在各种多模态任务中取得了显著的成功,但它们常常受到处理高分辨率图像输入和视频时有限上下文窗口和高计算成本的限制。视觉压缩可以通过减少视觉令牌数量来缓解这个问题。以前的方法使用外部模块压缩视觉令牌,并强制LLMs理解压缩后的令牌,导致视觉信息丢失。然而,LLMs对视觉令牌的理解模式在压缩学习过程中没有得到充分利用。我们提出了VoCo-LLaMA,这是使用LLMs压缩视觉令牌的第一种方法。通过在视觉指令调整阶段引入视觉压缩令牌,并利用注意力蒸馏,我们的方法将LLMs如何理解视觉令牌转化为它们对VoCo令牌的处理。VoCo-LLaMA促进了有效的视觉压缩,并提高了推理阶段的计算效率。具体来说,我们的方法以576倍的压缩比实现了最小的性能损失,导致FLOPs减少高达94.8%,推理时间加速69.6%。此外,通过连续使用时间序列压缩令牌序列的视频帧,VoCo-LLaMA展示了理解时间关联的能力,在流行的视频问答基准测试中超越了之前的方法。我们的方法为解锁VLMs上下文窗口的潜力提供了一种有前途的方式,使更多可扩展的多模态应用程序成为可能。项目页面以及相关代码可以通过https://yxxxb.github.io/VoCo-LLaMA-page/访问。
论文及项目相关链接
PDF 11 pages, 4 figures
Summary
本文介绍了在视觉语言模型(VLMs)中利用大型语言模型(LLMs)进行视觉压缩的新方法——VoCo-LLaMA。该方法引入视觉压缩令牌,在视觉指令微调阶段利用注意力蒸馏,将LLMs对视觉令牌的理解转化为对VoCo令牌的处理。VoCo-LLaMA实现了有效的视觉压缩,提高了推理阶段的计算效率,实现了高压缩比和低性能损失。此外,通过连续训练使用视频帧的时间序列压缩令牌序列,VoCo-LLaMA展现出理解时间关联的能力,并在流行的视频问答基准测试中表现出超越先前方法的效果。
Key Takeaways
- VLMs在处理高分辨率图像和视频时面临上下文窗口有限和计算成本高的问题。
- 视觉压缩可以缓解这些问题,但先前的压缩方法导致视觉信息损失。
- VoCo-LLaMA是首个利用LLMs进行视觉压缩的方法。
- VoCo-LLaMA通过引入视觉压缩令牌和注意力蒸馏,将LLMs对视觉令牌的理解转化为处理VoCo令牌。
- VoCo-LLaMA实现了高压缩比和低性能损失,推理阶段计算效率显著提高。
- 通过连续训练使用视频帧的时间序列压缩令牌序列,VoCo-LLaMA展现出理解时间关联的能力。
- VoCo-LLaMA在视频问答基准测试中表现优于先前方法。
点此查看论文截图






Mixture-of-Subspaces in Low-Rank Adaptation
Authors:Taiqiang Wu, Jiahao Wang, Zhe Zhao, Ngai Wong
In this paper, we introduce a subspace-inspired Low-Rank Adaptation (LoRA) method, which is computationally efficient, easy to implement, and readily applicable to large language, multimodal, and diffusion models. Initially, we equivalently decompose the weights of LoRA into two subspaces, and find that simply mixing them can enhance performance. To study such a phenomenon, we revisit it through a fine-grained subspace lens, showing that such modification is equivalent to employing a fixed mixer to fuse the subspaces. To be more flexible, we jointly learn the mixer with the original LoRA weights, and term the method Mixture-of-Subspaces LoRA (MoSLoRA). MoSLoRA consistently outperforms LoRA on tasks in different modalities, including commonsense reasoning, visual instruction tuning, and subject-driven text-to-image generation, demonstrating its effectiveness and robustness. Codes are available at https://github.com/wutaiqiang/MoSLoRA.
本文介绍了一种受子空间启发的低秩适配(LoRA)方法,该方法计算效率高,易于实现,可轻松应用于大型语言、多模态和扩散模型。首先,我们将LoRA的权重等价地分解为两个子空间,并发现简单地混合它们可以增强性能。为了研究这一现象,我们通过精细的子空间透镜重新考察,表明这种修改相当于使用固定的混合器来融合子空间。为了更灵活,我们与原始LoRA权重一起学习混合器,并将该方法称为“子空间混合LoRA(MoSLoRA)”。MoSLoRA在不同模态的任务上均优于LoRA,包括常识推理、视觉指令调整和主题驱动的文字到图像生成,证明了其有效性和稳健性。代码可从https://github.com/wutaiqiang/MoSLoRA获取。
论文及项目相关链接
PDF EMNLP 2024 Main, Oral
Summary
本文介绍了一种受子空间启发的低秩适应(LoRA)方法,该方法计算效率高、易于实现,可广泛应用于大型语言、多模态和扩散模型。通过对LoRA权重进行等效分解,并引入混合技术,提出了一种名为MoSLoRA的混合子空间LoRA方法。MoSLoRA在不同模态的任务上表现出优越性能,包括常识推理、视觉指令调整和主题驱动的文字图像生成,凸显其有效性和稳健性。
Key Takeaways
- 引入了一种新的低秩适应(LoRA)方法,适用于大型语言、多模态和扩散模型。
- 通过等效分解LoRA权重,发现了混合技术能增强性能。
- MoSLoRA方法通过引入子空间混合概念,对LoRA进行了改进。
- MoSLoRA在常识推理、视觉指令调整和文本图像生成等任务上表现优越。
- MoSLoRA方法具有计算效率高和易于实现的特点。
- 研究表明,MoSLoRA方法能提高模型的灵活性和性能。
点此查看论文截图







