LLM

发布日期: 2025-01-04

更新日期: 2025-05-14

文章字数: 19.5k

阅读时长: 80 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-01-04 更新

A Survey on Large Language Model Acceleration based on KV Cache Management

Authors:Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments. Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications. The curated paper list for KV cache management is in: \href{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}.

大型语言模型（LLM）因其对上下文的理解和逻辑推理能力，已在自然语言处理、计算机视觉和多模态任务等多个领域带来了革命性的变革。然而，LLM的计算和内存需求，特别是在推理过程中，使其在扩展到现实世界、长上下文和实时应用时面临重大挑战。键值（KV）缓存管理已成为加速LLM推理的关键优化技术，通过减少冗余计算和提高内存利用率来加速LLM推理。本文全面概述了用于LLM加速的KV缓存管理策略，将其分类为令牌级、模型级和系统级优化。令牌级策略包括KV缓存选择、预算分配、合并、量化和低秩分解，而模型级优化则侧重于架构创新和注意力机制以提高KV的重复使用。系统级方法则针对内存管理、调度和硬件感知设计，以提高跨不同计算环境的效率。此外，本文还概述了用于评估这些策略的文本和多模态数据集和基准测试。通过提供详细的分类和比较分析，本文旨在为研究人员和实践者提供有用的见解，以支持开发高效和可扩展的KV缓存管理技巧，为LLM在现实世界应用中的实际部署做出贡献。关于KV缓存管理的精选论文列表请参见：https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在多领域如自然语言处理、计算机视觉和多模态任务中带来革命性变化，但其计算和内存需求，特别是在推理阶段，对现实世界、长上下文和实时应用构成挑战。键值（KV）缓存管理作为优化技术，通过减少冗余计算和提高内存利用率来加速LLM推理。本文综述了KV缓存管理策略以加速LLM的各个方面，包括令牌级别、模型级别和系统级别的优化。此调查旨在提供有关KV缓存管理的深刻见解，支持研究人员和实践者开发高效且可扩展的KV缓存管理技术，推动LLM在现实世界应用中的实际部署。

Key Takeaways

LLM已在多个领域带来变革，但计算和内存需求挑战其在实际应用中的扩展。
KV缓存管理是用于加速LLM的关键优化技术，可减少冗余计算并改善内存使用。
KV缓存管理策略分为令牌级别、模型级别和系统级别的优化。
令牌级别的策略包括KV缓存选择、预算分配、合并、量化和低秩分解。
模型级别的优化侧重于架构创新和注意力机制以增强KV的重复使用。
系统级别的策略处理内存管理、调度和硬件感知设计，以提高各种计算环境中的效率。

Cool Papers

点此查看论文截图

RecLM: Recommendation Instruction Tuning

Authors:Yangqin Jiang, Yuhao Yang, Lianghao Xia, Da Luo, Kangyi Lin, Chao Huang

Modern recommender systems aim to deeply understand users’ complex preferences through their past interactions. While deep collaborative filtering approaches using Graph Neural Networks (GNNs) excel at capturing user-item relationships, their effectiveness is limited when handling sparse data or zero-shot scenarios, primarily due to constraints in ID-based embedding functions. To address these challenges, we propose a model-agnostic recommendation instruction-tuning paradigm that seamlessly integrates large language models with collaborative filtering. Our proposed $\underline{Rec}$ommendation $\underline{L}$anguage $\underline{M}$odel (RecLM) enhances the capture of user preference diversity through a carefully designed reinforcement learning reward function that facilitates self-augmentation of language models. Comprehensive evaluations demonstrate significant advantages of our approach across various settings, and its plug-and-play compatibility with state-of-the-art recommender systems results in notable performance enhancements. The implementation of our RecLM framework is publicly available at: https://github.com/HKUDS/RecLM.

现代推荐系统旨在通过用户过去的交互来深入了解用户的复杂偏好。虽然使用图神经网络（GNNs）的深度协同过滤方法在捕捉用户-项目关系方面表现出色，但在处理稀疏数据或零射击场景时，其有效性主要受到基于ID的嵌入函数约束的限制。为了解决这些挑战，我们提出了一种模型无关的推荐指令调整范式，该范式无缝集成了大型语言模型与协同过滤。我们提出的推荐语言模型（RecLM）通过精心设计的强化学习奖励函数增强了对用户偏好多样性的捕获，该奖励函数促进了语言模型的自我增强。综合评估表明，我们的方法在各种设置中具有显著优势，并且其与最新推荐系统的即插即用兼容性导致性能得到显着提升。我们的RecLM框架的实现可在以下网址找到：https://github.com/HKUDS/RecLM。

论文及项目相关链接

PDF

Summary

融合深度协同过滤与大型语言模型的推荐指令调优范式。针对现代推荐系统面临的挑战，如稀疏数据或零样本场景下的用户偏好理解难题，提出一种模型无关的推荐指令调优方法。通过精心设计的强化学习奖励函数，促进语言模型的自我增强，提升用户偏好多样性的捕捉能力。该方法具有显著的优越性，并可与最先进的推荐系统无缝集成，实现显著的性能提升。更多细节可通过访问https://github.com/HKUDS/RecLM获取。

Key Takeaways

现代推荐系统致力于通过用户的过去交互来深入理解其复杂偏好。
深度协同过滤方法使用图神经网络（GNNs）在捕捉用户-物品关系方面表现出色。
当处理稀疏数据或零样本场景时，现有方法的有效性受到限制，主要原因是基于ID的嵌入函数的约束。
提出一种模型无关的推荐指令调优方法，将大型语言模型与协同过滤无缝集成。
通过精心设计的强化学习奖励函数，提升用户偏好多样性的捕捉能力。
该方法在各种设置下具有显著优势，且与最先进的推荐系统兼容，实现性能提升。

Cool Papers

点此查看论文截图

MEDEC: A Benchmark for Medical Error Detection and Correction in Clinical Notes

Authors:Asma Ben Abacha, Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Meliha Yetisgen, Fei Xia, Thomas Lin

Several studies showed that Large Language Models (LLMs) can answer medical questions correctly, even outperforming the average human score in some medical exams. However, to our knowledge, no study has been conducted to assess the ability of language models to validate existing or generated medical text for correctness and consistency. In this paper, we introduce MEDEC (https://github.com/abachaa/MEDEC), the first publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism). MEDEC consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems that were not previously seen by any LLM. The dataset has been used for the MEDIQA-CORR shared task to evaluate seventeen participating systems [Ben Abacha et al., 2024]. In this paper, we describe the data creation methods and we evaluate recent LLMs (e.g., o1-preview, GPT-4, Claude 3.5 Sonnet, and Gemini 2.0 Flash) for the tasks of detecting and correcting medical errors requiring both medical knowledge and reasoning capabilities. We also conducted a comparative study where two medical doctors performed the same task on the MEDEC test set. The results showed that MEDEC is a sufficiently challenging benchmark to assess the ability of models to validate existing or generated notes and to correct medical errors. We also found that although recent LLMs have a good performance in error detection and correction, they are still outperformed by medical doctors in these tasks. We discuss the potential factors behind this gap, the insights from our experiments, the limitations of current evaluation metrics, and share potential pointers for future research.

多项研究表明，大型语言模型（LLM）能够正确回答医学问题，甚至在某些医学考试中的表现超过人类平均水平。然而，据我们所知，尚未有研究评估语言模型验证现有或生成医学文本的正确性和一致性的能力。在本文中，我们介绍了MEDEC（https://github.com/abachaa/MEDEC），这是临床上首个公开可用的医疗错误检测和校正的基准测试，涵盖了五种类型的错误（诊断、管理、治疗、药物治疗和病原体）。MEDEC包含3848篇临床文本，其中包括来自美国三个医院系统的488篇以前未被任何LLM查看过的临床笔记。该数据集已被用于MEDIQA-CORR共享任务来评估十七个参与系统[Ben Abacha等人，XXXX]。在本文中，我们描述了数据创建方法，并评估了最新的LLM（例如，o1-preview、GPT-4、Claude 3.5 Sonnet和Gemini 2.0 Flash）在检测和校正需要医学知识和推理能力的医疗错误任务上的表现。我们还进行了一项对比研究，其中两名医生执行了相同的任务在MEDEC测试集上。结果表明，MEDEC是一个足够具有挑战性的基准测试，可以评估模型验证现有或生成的笔记以及校正医疗错误的能力。我们还发现，尽管最新的LLM在错误检测和校正方面表现良好，但它们仍然在这些任务上被医生超越。我们讨论了造成这一差距的潜在因素、实验中的见解、当前评估指标的局限性，并分享了未来研究的潜在方向。

论文及项目相关链接

PDF This version has been updated with further clarification regarding the model size estimates that were mined from public articles only and provided to aid in contextualizing model performance. The authors cannot vouch for the accuracy of those estimates

Summary
大型语言模型（LLM）能正确回答医学问题并在某些医学考试中表现超越人类平均水平。然而，尚无研究评估语言模型验证现有或生成医学文本的正确性和一致性的能力。本文介绍了MEDEC（医疗错误检测和校正基准测试），包含诊断、管理、治疗、药物治疗和病原体等五种错误的临床文本数据集。评估了多个LLM在检测和校正医学错误方面的表现，并与两位医生进行对比。结果显示，MEDEC是评估模型验证和校正能力的挑战基准测试，但LLM在这些任务上仍被医生超越。本文探讨了潜在因素、实验见解、当前评估指标的局限性以及未来研究的潜在方向。

Key Takeaways

LLM能够正确回答医学问题并在某些医学考试中表现出色。
目前尚无研究评估LLM在验证医学文本正确性和一致性方面的能力。
MEDEC是首个公开可用的医疗错误检测和校正基准测试，包含多种类型的医疗错误。
LLM在检测和校正医学错误方面的性能良好，但仍被医生超越。
医生和LLM之间的性能差距可能源于语言模型在医疗知识和推理能力方面的不足。
当前评估指标存在局限性，需要进一步完善和改进。

Cool Papers

点此查看论文截图

Speech Retrieval-Augmented Generation without Automatic Speech Recognition

Authors:Do June Min, Karel Mundnich, Andy Lapastora, Erfan Soltanmohammadi, Srikanth Ronanki, Kyu Han

One common approach for question answering over speech data is to first transcribe speech using automatic speech recognition (ASR) and then employ text-based retrieval-augmented generation (RAG) on the transcriptions. While this cascaded pipeline has proven effective in many practical settings, ASR errors can propagate to the retrieval and generation steps. To overcome this limitation, we introduce SpeechRAG, a novel framework designed for open-question answering over spoken data. Our proposed approach fine-tunes a pre-trained speech encoder into a speech adapter fed into a frozen large language model (LLM)–based retrieval model. By aligning the embedding spaces of text and speech, our speech retriever directly retrieves audio passages from text-based queries, leveraging the retrieval capacity of the frozen text retriever. Our retrieval experiments on spoken question answering datasets show that direct speech retrieval does not degrade over the text-based baseline, and outperforms the cascaded systems using ASR. For generation, we use a speech language model (SLM) as a generator, conditioned on audio passages rather than transcripts. Without fine-tuning of the SLM, this approach outperforms cascaded text-based models when there is high WER in the transcripts.

针对语音数据的问答的一种常见方法是首先使用自动语音识别（ASR）进行语音转录，然后在转录上应用基于文本的检索增强生成（RAG）。虽然这种级联管道已在许多实际场景中证明了其有效性，但ASR错误可能会传播到检索和生成步骤。为了克服这一局限性，我们引入了SpeechRAG，这是一个专为开放式问题回答语音数据设计的新型框架。我们提出的方法将预训练的语音编码器微调为语音适配器，并输入到基于冻结的大型语言模型（LLM）的检索模型中。通过对文本和语音的嵌入空间进行对齐，我们的语音检索器直接从基于文本的查询中检索语音片段，利用冻结的文本检索器的检索能力。我们在语音问答数据集上的检索实验表明，直接语音检索并不亚于基于文本的基线，并且优于使用ASR的级联系统。对于生成，我们使用语音语言模型（SLM）作为生成器，以音频片段为条件，而不是以转录为条件。在不微调SLM的情况下，当转录中的词错误率（WER）较高时，这种方法优于级联的基于文本模型。

论文及项目相关链接

PDF ICASSP 2025

摘要

本摘要简要介绍了针对语音数据的问答系统的常见方法，并指出了其存在的局限性。为了克服这些局限性，提出了一种新的框架SpeechRAG，它通过微调预训练的语音编码器并将其馈入冻结的大型语言模型（LLM）的检索模型来实现直接语音检索。该框架通过对文本和语音嵌入空间的对齐，能够直接从基于文本的查询中检索语音段落，从而利用冻结文本检索器的检索能力。实验表明，直接语音检索不逊色于基于文本的基础线，并且在存在高语音识别错误率的情况下，其表现优于级联系统。在生成方面，使用语音语言模型（SLM）作为生成器，以音频段落为条件，无需对SLM进行微调，即表现出优于级联文本模型的效果。

关键见解

现有问答系统通常通过级联管道处理语音数据，即先使用自动语音识别（ASR）进行语音转录，然后在转录上应用基于文本的检索增强生成（RAG）。
ASR错误可能会传播到检索和生成步骤，影响系统性能。
SpeechRAG框架被引入以解决这一问题，它通过微调语音编码器并将其输入到冻结的大型语言模型检索模型中，实现直接语音检索。
SpeechRAG通过对文本和语音嵌入空间的对齐，使系统能够直接从基于文本的查询中检索语音段落。
实验表明，直接语音检索的性能不逊于文本基础线，且在语音识别错误率较高的情况下，其表现优于级联系统。
在生成阶段，使用语音语言模型（SLM）作为生成器，以音频段落为条件，无需对SLM进行微调，即可获得优于级联文本模型的效果。
SpeechRAG框架为开放问答回答提供了一个有效的新方法，特别是在处理包含错误或噪音的语音数据时。

Cool Papers

点此查看论文截图

World knowledge-enhanced Reasoning Using Instruction-guided Interactor in Autonomous Driving

Authors:Mingliang Zhai, Cheng Li, Zengyuan Guo, Ningrui Yang, Xiameng Qin, Sanyuan Zhao, Junyu Han, Ji Tao, Yuwei Wu, Yunde Jia

The Multi-modal Large Language Models (MLLMs) with extensive world knowledge have revitalized autonomous driving, particularly in reasoning tasks within perceivable regions. However, when faced with perception-limited areas (dynamic or static occlusion regions), MLLMs struggle to effectively integrate perception ability with world knowledge for reasoning. These perception-limited regions can conceal crucial safety information, especially for vulnerable road users. In this paper, we propose a framework, which aims to improve autonomous driving performance under perceptionlimited conditions by enhancing the integration of perception capabilities and world knowledge. Specifically, we propose a plug-and-play instruction-guided interaction module that bridges modality gaps and significantly reduces the input sequence length, allowing it to adapt effectively to multi-view video inputs. Furthermore, to better integrate world knowledge with driving-related tasks, we have collected and refined a large-scale multi-modal dataset that includes 2 million natural language QA pairs, 1.7 million grounding task data. To evaluate the model’s utilization of world knowledge, we introduce an object-level risk assessment dataset comprising 200K QA pairs, where the questions necessitate multi-step reasoning leveraging world knowledge for resolution. Extensive experiments validate the effectiveness of our proposed method.

多模态大型语言模型（MLLMs）拥有丰富的世界知识，为自动驾驶注入了新的活力，特别是在可感知区域的推理任务中。然而，当面对感知受限区域（动态或静态遮挡区域）时，MLLMs在将感知能力与世界知识进行有效整合以进行推理方面遇到了困难。这些感知受限的区域可能隐藏了关键的安全信息，特别是对于脆弱的道路使用者。在本文中，我们提出了一个框架，旨在通过增强感知能力和世界知识的整合来提高感知受限条件下的自动驾驶性能。具体来说，我们提出了一种即插即用的指令引导交互模块，该模块可以弥不同模态之间的差距并显着缩短输入序列长度，从而能够有效地适应多视图视频输入。此外，为了更好地将世界知识与驾驶相关任务进行整合，我们已经收集和精炼了一个大规模的多模态数据集，其中包括200万个自然语言问答对和170万个基础任务数据。为了评估模型对世界知识的利用情况，我们引入了一个对象级别的风险评估数据集，包含20万个问答对，这些问题需要利用世界知识进行多步骤推理来解决。大量实验验证了我们的方法的有效性。

论文及项目相关链接

PDF AAAI 2025. 14 pages. Supplementary Material

Summary

大型多模态语言模型（MLLMs）在自动驾驶领域展现出强大的潜力，特别是在可感知区域的推理任务中。然而，面对感知受限区域（动态或静态遮挡区域），MLLMs在整合感知能力和世界知识方面存在挑战。本文提出一种框架，旨在通过增强感知能力和世界知识的整合，提高自动驾驶在感知受限条件下的性能。该框架包括一个即插即用的指令引导交互模块，能够缩小模态差距并显著减少输入序列长度，适应多视角视频输入。同时，为了整合世界知识与驾驶任务，收集和精炼了一个大规模的多模态数据集，包含200万自然语言问答对和170万定位任务数据。通过引入对象级别的风险评估数据集来评估模型对世界知识的利用效果，实验证明该方法的有效性。

Key Takeaways

MLLMs在自动驾驶的推理任务中具有强大潜力，特别是在可感知区域。
感知受限区域对MLLMs整合感知能力和世界知识构成挑战。
框架包括一个指令引导交互模块，可适应多视角视频输入并缩小模态差距。
提出了一个大规模的多模态数据集，用于整合世界知识和驾驶任务。
引入了对象级别的风险评估数据集以评估模型对世界知识的利用。
该框架通过增强感知能力和世界知识的整合，提高了自动驾驶在感知受限条件下的性能。

Cool Papers

点此查看论文截图

ChatTS: Aligning Time Series with LLMs via Synthetic Data for Enhanced Understanding and Reasoning

Authors:Zhe Xie, Zeyan Li, Xiao He, Longlong Xu, Xidao Wen, Tieying Zhang, Jianjun Chen, Rui Shi, Dan Pei

Understanding time series is crucial for its application in real-world scenarios. Recently, large language models (LLMs) have been increasingly applied to time series tasks, leveraging their strong language capabilities to enhance various applications. However, research on multimodal LLMs (MLLMs) for time series understanding and reasoning remains limited, primarily due to the scarcity of high-quality datasets that align time series with textual information. This paper introduces ChatTS, a novel MLLM designed for time series analysis. ChatTS treats time series as a modality, similar to how vision MLLMs process images, enabling it to perform both understanding and reasoning with time series. To address the scarcity of training data, we propose an attribute-based method for generating synthetic time series with detailed attribute descriptions. We further introduce Time Series Evol-Instruct, a novel approach that generates diverse time series Q&As, enhancing the model’s reasoning capabilities. To the best of our knowledge, ChatTS is the first TS-MLLM that takes multivariate time series as input for understanding and reasoning, which is fine-tuned exclusively on synthetic datasets. We evaluate its performance using benchmark datasets with real-world data, including six alignment tasks and four reasoning tasks. Our results show that ChatTS significantly outperforms existing vision-based MLLMs (e.g., GPT-4o) and text/agent-based LLMs, achieving a 46.0% improvement in alignment tasks and a 25.8% improvement in reasoning tasks.

理解时间序列对于其在现实世界场景中的应用至关重要。最近，大型语言模型（LLM）越来越多地被应用于时间序列任务，利用其强大的语言能力来增强各种应用程序。然而，用于时间序列理解和推理的多模态大型语言模型（MLLM）的研究仍然有限，主要是由于缺乏高质量的数据集，这些数据集能够将时间序列与文本信息对齐。本文介绍了一种新型的时间序列分析MLLM——ChatTS。ChatTS将时间序列视为一种模态，类似于视觉MLLM处理图像的方式，能够执行时间序列的理解和推理。为了解决训练数据的稀缺问题，我们提出了一种基于属性的方法来生成具有详细属性描述的合成时间序列。我们进一步引入了时间序列演化指令（Time Series Evol-Instruct），这是一种生成多样化时间序列问答的新方法，增强了模型的推理能力。据我们所知，ChatTS是首个接受多元时间序列进行理解和推理的TS-MLLM，它仅在合成数据集上进行微调。我们使用包含真实世界数据的基准数据集来评估其性能，包括六个对齐任务和四个推理任务。我们的结果表明，ChatTS显著优于现有的基于视觉的MLLM（例如GPT-4o）和基于文本/代理的LLM，在对齐任务上提高了46.0%，在推理任务上提高了25.8%。

论文及项目相关链接

PDF

摘要

理解时间序列对实际应用至关重要。最近，大型语言模型（LLM）越来越多地被应用于时间序列任务。然而，针对时间序列理解和推理的多模态LLM（MLLM）研究仍然有限，主要原因是缺乏高质量的时间序列与文本信息对齐的数据集。本文介绍了一种新型的时间序列分析MLLM——ChatTS。ChatTS将时间序列视为一种模态，类似于视觉MLLM处理图像的方式，能够执行时间序列的理解和推理。为解决训练数据不足的问题，我们提出了一种基于属性的方法生成带有详细属性描述的时间序列合成数据。此外，我们还引入了时间序列演化指令方法，生成多样化的时间序列问答数据，增强模型的推理能力。据我们所知，ChatTS是首个接受多元时间序列输入进行理解和推理的TS-MLLM，并仅在合成数据集上进行微调。我们使用包含六个对齐任务和四个推理任务的基准数据集评估其性能。结果表明，ChatTS在对齐任务上较现有的基于视觉的MLLMs（如GPT-4o）和基于文本/代理的LLMs有显著改进，其中对齐任务提高了46.0%，推理任务提高了25.8%。

关键见解

LLMs最近越来越多地被应用于时间序列任务，利用它们强大的语言能力增强各种应用。
针对时间序列理解和推理的多模态LLM（MLLM）研究仍然有限，主要原因是高质量数据集的缺乏。
ChatTS是一种新型的时间序列分析MLLM，能够将时间序列视为一种模态进行处理。
ChatTS通过生成合成数据和采用新颖的时间序列演化指令方法解决训练数据不足的问题。
ChatTS是首个接受多元时间序列进行理解和推理的TS-MLLM，且在合成数据集上微调。

Cool Papers

点此查看论文截图

MaLei at the PLABA Track of TAC-2024: RoBERTa for Task 1 – LLaMA3.1 and GPT-4o for Task 2

Authors:Zhidong Ling, Zihao Li, Pablo Romero, Lifeng Han, Goran Nenadic

This report is the system description of the MaLei team (Manchester and Leiden) for shared task Plain Language Adaptation of Biomedical Abstracts (PLABA) 2024 (we had an earlier name BeeManc following last year). This report contains two sections corresponding to the two sub-tasks in PLABA 2024. In task one, we applied fine-tuned ReBERTa-Base models to identify and classify the difficult terms, jargon and acronyms in the biomedical abstracts and reported the F1 score. Due to time constraints, we didn’t finish the replacement task. In task two, we leveraged Llamma3.1-70B-Instruct and GPT-4o with the one-shot prompts to complete the abstract adaptation and reported the scores in BLEU, SARI, BERTScore, LENS, and SALSA. From the official Evaluation from PLABA-2024 on Task 1A and 1B, our \textbf{much smaller fine-tuned RoBERTa-Base} model ranked 3rd and 2nd respectively on the two sub-task, and the \textbf{1st on averaged F1 scores across the two tasks} from 9 evaluated systems. Our LLaMA-3.1-70B-instructed model achieved the \textbf{highest Completeness} score for Task-2. We share our fine-tuned models and related resources at \url{https://github.com/HECTA-UoM/PLABA2024}

本报告是MaLei团队（曼彻斯特与莱顿）针对共享任务——生物医疗摘要的平实语言改编（PLABA 2024）的系统描述（我们去年使用的名字是BeeManc）。本报告包含两部分，分别对应PLABA 2024的两个子任务。在任务一中，我们采用了微调过的ReBERTa-Base模型来识别和分类生物医疗摘要中的难词、行话和缩写词，并报告了F1分数。由于时间限制，我们未完成替换任务。在任务二中，我们利用Llamma3.1-70B-Instruct和GPT-4o结合一次性的提示来完成摘要的改编，并报告了BLEU、SARI、BERTScore、LENS和SALSA等各项分数。根据PLABA-2024的官方评估，在1A和1B子任务中，我们较小的微调RoBERTa-Base模型分别排名第三和第二；在两项任务的平均F1分数上，我们的模型在9个评估系统中排名第一。我们的LLaMA-3.1-70B-Instruct模型在任务2中获得了最高的完整性分数。我们会在https://github.com/HECTA-UoM/PLABA2024分享我们的微调模型和相关资源。

论文及项目相关链接

PDF ongoing work - system report for PLABA2024 with TAC

Summary

本文介绍了MaLei团队在Plain Language Adaptation of Biomedical Abstracts (PLABA) 2024共享任务中的系统描述。在任务一中，团队使用fine-tuned ReBERTa-Base模型识别和分类生物医学摘要中的难词、术语和缩写，并报告了F1分数。在任务二中，他们利用Llamma3.1-70B-Instruct和GPT-4o进行了一次性提示来完成摘要的适应，并报告了BLEU、SARI、BERTScore、LENS和SALSA的分数。根据PLABA-2024的官方评估，MaLei团队在任务一中排名第三和第二，平均F1分数在两项任务中排名第一，任务二中LLaMA-3.1-70B-instructed模型完整性得分最高。相关资源和模型可以在https://github.com/HECTA-UoM/PLABA2024找到。

Key Takeaways

MaLei团队参与了PLABA 2024共享任务，并进行了系统描述。
任务一中使用fine-tuned ReBERTa-Base模型进行难词、术语和缩写的识别与分类，取得较高F1分数。
任务二中利用Llamma3.1-70B-Instruct和GPT-4o进行摘要适应，涉及多种评估指标的分数报告。
MaLei团队在任务一中排名第三和第二，平均F1分数在两项任务中排名第一。
LLaMA-3.1-70B-instructed模型在任务二中的完整性得分最高。
MaLei团队分享了他们的模型和资源，可在指定链接处获取。

Cool Papers

点此查看论文截图

FALCON: Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization system

Authors:Zeyuan Li, Yangfan He, Lewei He, Jianhui Wang, Tianyu Shi, Bin Lei, Yuchen Li, Qiuwu Chen

Recently, large language models (LLMs) have achieved significant progress in automated code generation. Despite their strong instruction-following capabilities, these models frequently struggled to align with user intent in coding scenarios. In particular, they were hampered by datasets that lacked diversity and failed to address specialized tasks or edge cases. Furthermore, challenges in supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF) led to failures in generating precise, human-intent-aligned code. To tackle these challenges and improve the code generation performance for automated programming systems, we propose Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization (i.e., FALCON). FALCON is structured into two hierarchical levels. From the global level, long-term memory improves code quality by retaining and applying learned knowledge. At the local level, short-term memory allows for the incorporation of immediate feedback from compilers and AI systems. Additionally, we introduce meta-reinforcement learning with feedback rewards to solve the global-local bi-level optimization problem and enhance the model’s adaptability across diverse code generation tasks. Extensive experiments demonstrate that our technique achieves state-of-the-art performance, leading other reinforcement learning methods by more than 4.5 percentage points on the MBPP benchmark and 6.1 percentage points on the Humaneval benchmark. The open-sourced code is publicly available at https://github.com/titurte/FALCON.

近期，大型语言模型（LLM）在自动化代码生成方面取得了显著进展。尽管这些模型具有很强的指令执行能力，但在编码场景中，它们经常难以与用户意图对齐。特别是，由于缺乏多样性的数据集以及无法解决专业任务或边缘情况，它们受到了阻碍。此外，监督微调（SFT）和人类反馈强化学习（RLHF）的挑战导致生成精确、符合人类意图的代码失败。为了解决这些挑战并改善自动化编程系统的代码生成性能，我们提出了基于反馈的自适应长短时记忆强化编码优化（即FALCON）。FALCON分为两个层次。从全局层面来看，长期记忆通过保留和应用学到的知识来提高代码质量。在局部层面，短期记忆允许编译器和AI系统的即时反馈得以融入。此外，我们引入了带有反馈奖励的元强化学习来解决全局-局部两级优化问题，并增强模型在不同代码生成任务中的适应性。大量实验表明，我们的技术达到了最新技术水平，在MBPP基准测试上比其他强化学习方法高出4.5个百分点，在Humaneval基准测试上高出6.1个百分点。开源代码可在https://github.com/titurte/FALCON公开获取。

论文及项目相关链接

PDF 20 pages, 7 figures

Summary

大型语言模型在自动化代码生成方面取得了显著进展，但在用户意图对齐方面存在挑战。提出反馈驱动的自适应长短时记忆增强编码优化（FALCON）方法，通过全球层级长期记忆改善代码质量和本地层级短期记忆结合即时反馈提高模型性能。引入元强化学习与反馈奖励解决全局-局部优化问题，提升模型在多样化代码生成任务中的适应性。实验证明，FALCON在MBPP和Humaneval基准测试中表现卓越。

Key Takeaways

大型语言模型在自动化代码生成中取得进展，但在用户意图对齐方面存在挑战。
数据集缺乏多样性和未能解决特定任务或边缘案例是模型面临的主要问题。
提出反馈驱动的自适应长短时记忆增强编码优化（FALCON）方法。
全球层级长期记忆和本地层级短期记忆结合即时反馈提高模型性能。
引入元强化学习与反馈奖励解决全局-局部优化问题。
FALCON在MBPP和Humaneval基准测试中表现卓越，领先其他强化学习方法。
开放源代码已公开发布。

Cool Papers

点此查看论文截图

Satori: Towards Proactive AR Assistant with Belief-Desire-Intention User Modeling

Authors:Chenyi Li, Guande Wu, Gromit Yeuk-Yin Chan, Dishita G Turakhia, Sonia Castelo Quispe, Dong Li, Leslie Welch, Claudio Silva, Jing Qian

Augmented Reality assistance are increasingly popular for supporting users with tasks like assembly and cooking. However, current practice typically provide reactive responses initialized from user requests, lacking consideration of rich contextual and user-specific information. To address this limitation, we propose a novel AR assistance system, Satori, that models both user states and environmental contexts to deliver proactive guidance. Our system combines the Belief-Desire-Intention (BDI) model with a state-of-the-art multi-modal large language model (LLM) to infer contextually appropriate guidance. The design is informed by two formative studies involving twelve experts. A sixteen within-subject study find that Satori achieves performance comparable to an designer-created Wizard-of-Oz (WoZ) system without relying on manual configurations or heuristics, thereby enhancing generalizability, reusability and opening up new possibilities for AR assistance.

增强现实辅助技术越来越受欢迎，支持用户完成组装和烹饪等任务。然而，当前实践通常提供基于用户请求的反应性响应，缺乏丰富的上下文和用户特定信息的考虑。为了解决这一局限性，我们提出了一种新型的AR辅助系统Satori，该系统对用户状态和环境上下文进行建模，以提供主动指导。我们的系统将信念-欲望-意图（BDI）模型与最新多模态大型语言模型（LLM）相结合，以推断符合上下文的指导。设计灵感来源于涉及十二名专家的两项形成性研究。一项针对同一主体的内部研究结果表明，Satori在不依赖手动配置或启发式的情况下实现了与设计精良的Wizard-of-Oz（WoZ）系统相当的性能，从而增强了通用性和可重用性，并为AR辅助开辟了新可能性。

论文及项目相关链接

PDF

Summary
基于增强现实（AR）的辅助系统在任务支持，如组装和烹饪方面的普及性逐渐增加。然而，当前实践主要提供从用户请求初始化的反应式响应，缺乏丰富的上下文和用户特定信息的考虑。为解决此局限性，我们提出了一种新型的AR辅助系统Satori，该系统能够模拟用户状态和环境上下文以提供主动指导。该系统结合了信念-欲望-意图（BDI）模型与最新多模态大型语言模型（LLM）进行上下文适当指导的推断。设计由两项包括十二名专家的形成性研究支撑。一项内部对照研究发现，Satori在无需依赖手动配置或启发式的情况下，实现了与设计良好的Wizard-of-Oz（WoZ）系统相当的性能，从而增强了通用性、可重用性并为AR辅助开启了新的可能性。

Key Takeaways

当前AR辅助系统主要为反应式响应，需改进以考虑丰富的上下文和用户特定信息。
新型的AR辅助系统Satori能模拟用户状态和环境上下文，提供主动指导。
Satori系统结合了BDI模型与多模态LLM进行上下文指导推断。
Satori的设计基于包括十二名专家的两项形成性研究。
内部对照研究表明，Satori的性能与设计良好的WoZ系统相当。
Satori增强了AR辅助的通用性和可重用性。

Cool Papers

点此查看论文截图

TSDS: Data Selection for Task-Specific Model Finetuning

Authors:Zifan Liu, Amin Karbasi, Theodoros Rekatsinas

Finetuning foundation models for specific tasks is an emerging paradigm in modern machine learning. The efficacy of task-specific finetuning largely depends on the selection of appropriate training data. We present TSDS (Task-Specific Data Selection), a framework to select data for task-specific model finetuning, guided by a small but representative set of examples from the target task. To do so, we formulate data selection for task-specific finetuning as an optimization problem with a distribution alignment loss based on optimal transport to capture the discrepancy between the selected data and the target distribution. In addition, we add a regularizer to encourage the diversity of the selected data and incorporate kernel density estimation into the regularizer to reduce the negative effects of near-duplicates among the candidate data. We connect our optimization problem to nearest neighbor search and design efficient algorithms to compute the optimal solution based on approximate nearest neighbor search techniques. We evaluate our method on data selection for both continued pretraining and instruction tuning of language models. We show that instruction tuning using data selected by our method with a 1% selection ratio often outperforms using the full dataset and beats the baseline selection methods by 1.5 points in F1 score on average.

在现代机器学习中，针对特定任务微调基础模型是一种新兴范式。任务特定微调的有效性在很大程度上取决于适当训练数据的选择。我们提出了TSDS（Task-Specific Data Selection，任务特定数据选择），这是一种为任务特定模型微调选择数据的框架，该框架由目标任务的小而具有代表性的样本集引导。为此，我们将任务特定微调的数据选择制定为一个优化问题，通过基于最优传输的分布对齐损失来捕获所选数据与目标分布之间的差异。此外，我们添加一个正则化项来鼓励所选数据的多样性，并将内核密度估计纳入正则化项中，以减少候选数据中近似重复项的负面影响。我们将优化问题与最近邻搜索联系起来，并设计基于近似最近邻搜索技术的有效算法来计算最优解。我们在对语言模型的持续预训练和指令调整的数据选择上评估了我们的方法。我们表明，使用我们的方法选定的数据进行指令调整，即使使用1%的选择比率，也往往优于使用全数据集，并且在F1分数上平均比基线选择方法高出1.5个点。

论文及项目相关链接

PDF 31 pages, 1 figure

Summary

任务特定微调是现代机器学习中的一个新兴范式，其效果在很大程度上取决于适当训练数据的选择。本文提出TSDS（任务特定数据选择）框架，通过目标任务中小而具代表性的样本集引导任务特定模型微调的数据选择。为解决此问题，将数据选择作为基于最优传输的分布对齐损失优化问题，以捕捉所选数据与目标分布之间的差异。此外，还添加了一个正则化项以鼓励所选数据的多样性，并结合核密度估计减少候选数据中的近似重复项的负面影响。本文通过将优化问题与最近邻搜索联系起来，并基于近似最近邻搜索技术设计了有效的算法来计算最优解。通过对语言模型的持续预训练和指令调优的数据选择进行评估，结果表明，使用所选数据的指令调优方法在某些情况下甚至超越了使用全数据集的效果，并且在F1分数上平均比基线选择方法高出1.5分。

Key Takeaways

任务特定微调是现代机器学习的新趋势，训练数据选择至关重要。
TSDS框架通过小样本集引导任务特定模型微调的数据选择。
数据选择被表述为分布对齐损失的优化问题。
正则化项鼓励数据多样性，结合核密度估计减少近似重复的影响。
优化问题与最近邻搜索相结合，设计了高效算法寻找最优解。
在语言模型的持续预训练和指令调优中评估方法，效果明显。

Cool Papers

点此查看论文截图

Reconstructive Visual Instruction Tuning

Authors:Haochen Wang, Anlin Zheng, Yucheng Zhao, Tiancai Wang, Zheng Ge, Xiangyu Zhang, Zhaoxiang Zhang

This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs.

本文介绍了重建视觉指令调整（ROSS），这是一类利用视觉为中心监督信号的大型多模态模型（LMMs）。与传统仅监督文本输出的视觉指令调整方法不同，ROSS通过重建输入图像来促使LMMs监督视觉输出。通过这种方式，它充分利用了输入图像本身所固有的丰富性和细节，这些细节在纯文本监督中往往会被丢失。然而，从自然图像中产生有意义的反馈是具有挑战性的，因为视觉信号存在大量的空间冗余。为了解决这一问题，ROSS采用去噪目标来重建输入图像的潜在表示，避免直接回归精确的原始RGB值。这种内在激活设计鼓励LMMs保持图像细节，从而提高了它们的精细理解能力和减少了幻觉。经验上，ROSS在不同类型的视觉编码器和语言模型上均带来了显著改进。与聚集多个视觉专家的外在辅助最新技术相比，使用单一SigLIP视觉编码器的ROSS表现出了强大的性能，证明了我们的针对视觉输出的视觉为中心监督的有效性。

论文及项目相关链接

PDF

Summary

本文介绍了重建视觉指令调整（ROSS）技术，这是一种利用视觉中心监督信号的大型多模态模型（LMMs）家族。与传统的仅监督文本输出的视觉指令调整方法不同，ROSS通过重建输入图像来促使LMMs监督视觉输出。它充分利用了输入图像本身的内在丰富性和细节，避免了纯文本监督中常常丢失的信息。为应对自然图像产生的有意义反馈所面临的视觉信号空间冗余问题，ROSS采用去噪目标来重建输入图像的潜在表示，避免直接回归精确的原始RGB值。这种内在激活设计鼓励LMMs保持图像细节，从而提高其精细粒度理解能力并减少幻觉。实证结果显示，ROSS在不同视觉编码器和语言模型上均带来显著改进，与聚合多个视觉专家的外在辅助最新技术相比，使用单一SigLIP视觉编码器的ROSS表现出强大的性能，证明了针对视觉输出的视觉中心监督的有效性。

Key Takeaways

ROSS是一种利用视觉中心监督信号的大型多模态模型（LMMs）。
ROSS通过重建输入图像来促使LMMs监督视觉输出，充分利用输入图像的内在丰富性和细节。
ROSS采用去噪目标来应对自然图像产生的有意义反馈所面临的视觉信号空间冗余问题。
通过内在激活设计，ROSS鼓励LMMs保持图像细节，提高精细粒度理解能力并减少幻觉。
ROSS在多种视觉编码器和语言模型上实现了显著的性能改进。
与聚合多个视觉专家的外在辅助技术相比，使用单一SigLIP视觉编码器的ROSS表现出强大的性能。

Cool Papers

点此查看论文截图

Text Clustering as Classification with LLMs

Authors:Chen Huang, Guoxiu He

Text clustering remains valuable in real-world applications where manual labeling is cost-prohibitive. It facilitates efficient organization and analysis of information by grouping similar texts based on their representations. However, implementing this approach necessitates fine-tuned embedders for downstream data and sophisticated similarity metrics. To address this issue, this study presents a novel framework for text clustering that effectively leverages the in-context learning capacity of Large Language Models (LLMs). Instead of fine-tuning embedders, we propose to transform the text clustering into a classification task via LLM. First, we prompt LLM to generate potential labels for a given dataset. Second, after integrating similar labels generated by the LLM, we prompt the LLM to assign the most appropriate label to each sample in the dataset. Our framework has been experimentally proven to achieve comparable or superior performance to state-of-the-art clustering methods that employ embeddings, without requiring complex fine-tuning or clustering algorithms. We make our code available to the public for utilization at https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM.

文本聚类在现实世界的许多应用中仍然具有价值，在这些应用中，手动标注的成本高昂。它通过基于文本表示将相似的文本分组，促进了信息的有效组织和分析。然而，实现这种方法需要针对下游数据的精细嵌入器以及复杂的相似性度量标准。为了解决这一问题，本研究提出了一种新的文本聚类框架，该框架有效利用大型语言模型（LLM）的上下文学习能力。我们不需要精细调整嵌入器，而是提议通过将文本聚类转化为分类任务来利用LLM的能力。首先，我们提示LLM为给定数据集生成潜在标签。其次，在整合LLM生成的相似标签后，我们再次提示LLM为数据集中的每个样本分配最适当的标签。实验证明，我们的框架与采用嵌入的先进聚类方法相比，可以达到相当的或更好的性能，无需复杂的微调或聚类算法。我们已将代码公开供公众使用：https://github.com/ECNU-Text-Computing/Text-Clustering-via-LLM。

论文及项目相关链接

PDF 12 pages, 3 figures

Summary

本文提出了一种基于大型语言模型（LLM）的文本聚类新方法。该方法将文本聚类转化为分类任务，利用LLM的上下文学习能力生成潜在标签并分配标签给数据集中的样本，无需复杂的微调或聚类算法，即可实现与基于嵌入的先进聚类方法相当的或更好的性能。

Key Takeaways

文本聚类在手动标签成本高昂的现实世界应用中具有重要价值，有助于高效组织和分析信息。
实施文本聚类需要针对下游数据的精细嵌入器和复杂相似性度量。
本研究提出了一种基于大型语言模型（LLM）的文本聚类新框架，利用LLM的上下文学习能力。
该框架通过将文本聚类转化为分类任务来工作，生成潜在标签并分配这些标签给数据集中的样本。
该方法无需复杂的微调或聚类算法，即可实现与现有先进技术相当的或更好的性能。
公开了代码以供公众使用。

Cool Papers

点此查看论文截图

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Authors:Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak

In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.

本文介绍了SoloAudio，这是一种基于扩散的新型目标声音提取（TSE）生成模型。我们的方法训练音频上的潜在扩散模型，用操作潜在特征的跳过连接的Transformer替换之前的U-Net主干。SoloAudio通过利用CLAP模型作为目标声音的特征提取器，支持面向音频和面向语言的TSE。此外，SoloAudio还利用最先进的文本到音频模型生成的合成音频进行训练，表现出对域外数据和未见声音事件的强大泛化能力。我们在FSD Kaggle 2018混合数据集和AudioSet的真实数据上评估了这种方法，SoloAudio在域内和域外数据上都达到了最新水平的结果，并展现出了令人印象深刻的零样本和少样本能力。已发布源代码和演示。

论文及项目相关链接

PDF Submitted to ICASSP 2025

摘要

本文介绍了SoloAudio，这是一种基于扩散的新型目标声音提取（TSE）生成模型。该方法对音频进行潜伏扩散模型训练，用跳跃连接的Transformer替换之前的U-Net主干，该Transformer在潜伏特征上运行。SoloAudio通过利用面向音频和面向语言的TSE，以及利用CLAP模型作为目标声音的特征提取器，支持面向音频和面向语言的TSE。此外，SoloAudio还利用最先进的文本到音频模型生成的合成音频进行训练，表现出对域外数据和未见声音事件的强大泛化能力。我们在FSD Kaggle 2018混合数据集和AudioSet的真实数据上评估了这种方法，SoloAudio在域内和域外数据上都取得了最新技术成果，并表现出令人印象深刻的零样本和少样本能力。公开了源代码和演示。

要点

介绍了SoloAudio作为一种新型目标声音提取（TSE）生成模型。
使用潜伏扩散模型进行训练，替代U-Net主干，采用跳跃连接的Transformer。
支持音频和语言两种面向的TSE，利用CLAP模型提取目标声音特征。
利用先进文本到音频模型生成的合成音频进行训练。
对域内和域外数据以及零样本和少样本能力表现出强大的泛化性能。
在FSD Kaggle 2018混合数据集和AudioSet真实数据上取得最新技术成果。

Cool Papers

点此查看论文截图

The Mamba in the Llama: Distilling and Accelerating Hybrid Models

Authors:Junxiong Wang, Daniele Paliotta, Avner May, Alexander M. Rush, Tri Dao

Linear RNN architectures, like Mamba, can be competitive with Transformer models in language modeling while having advantageous deployment characteristics. Given the focus on training large-scale Transformer models, we consider the challenge of converting these pretrained models for deployment. We demonstrate that it is feasible to distill large Transformers into linear RNNs by reusing the linear projection weights from attention layers with academic GPU resources. The resulting hybrid model, which incorporates a quarter of the attention layers, achieves performance comparable to the original Transformer in chat benchmarks and outperforms open-source hybrid Mamba models trained from scratch with trillions of tokens in both chat benchmarks and general benchmarks. Moreover, we introduce a hardware-aware speculative decoding algorithm that accelerates the inference speed of Mamba and hybrid models. Overall we show how, with limited computation resources, we can remove many of the original attention layers and generate from the resulting model more efficiently. Our top-performing model, distilled from Llama3-8B-Instruct, achieves a 29.61 length-controlled win rate on AlpacaEval 2 against GPT-4 and 7.35 on MT-Bench, surpassing the best 8B scale instruction-tuned linear RNN model. We also find that the distilled model has natural length extrapolation, showing almost perfect accuracy in the needle-in-a-haystack test at 20x the distillation length. Code and pre-trained checkpoints are open-sourced at https://github.com/jxiw/MambaInLlama and https://github.com/itsdaniele/speculative_mamba.

线性RNN架构（如Mamba）在语言建模方面可以与Transformer模型竞争，同时具有优势部署特性。鉴于大型Transformer模型的训练重点，我们面临着将这些预训练模型进行部署的挑战。我们证明了利用学术GPU资源，通过重用注意力层的线性投影权重，将大型Transformer蒸馏到线性RNN中是可行的。所得混合模型仅包含四分之一的注意力层，在聊天基准测试中性能与原始Transformer相当，并且在聊天基准测试和一般基准测试中，优于从头开始训练、涵盖数十亿标记的开源混合Mamba模型。此外，我们引入了一种硬件感知的投机解码算法，提高了Mamba和混合模型的推理速度。总体而言，我们展示了如何在有限的计算资源下，去除许多原始注意力层，并从所得模型更高效地生成。我们的高性能模型从Llama3-8B-Instruct中提炼出来，在AlpacaEval 2上相对于GPT-4达到了29.61的受控长度胜率，在MT-Bench上达到了7.35，超过了最佳8B规模指令调整线性RNN模型。我们还发现，提炼后的模型具有自然的长度扩展性，在蒸馏长度提高20倍的情况下，几乎达到了完美的准确度，如在“沙里淘金”测试中所展示的。代码和预先训练的检查点已公开在：https://github.com/jxiw/MambaInLlama 和 https://github.com/itsdaniele/speculative_mamba。

论文及项目相关链接

PDF NeurIPS 2024. v2 updates: 1. Improved distillation approach and new results for Llama 3.1/3.2 distilled models. 2. Fixed math typos. 3. Added needle in the haystack long-context experiments. 4. Mentioned Mamba-Zephyr as subquadratic and added Mamba-Zephyr-8B lm_eval result

摘要

线性RNN架构如Mamba在语言建模方面可与Transformer模型竞争，并具有优势部署特性。我们关注于将大型Transformer模型进行部署的挑战，并展示了使用学术GPU资源将大型Transformer蒸馏到线性RNN中的可行性。通过复用注意力层的线性投影权重，得到的混合模型实现了与原始Transformer相当的聊天基准测试性能，并在一般基准测试中优于从头开始训练的开源混合Mamba模型。此外，我们引入了一种硬件感知的投机解码算法，该算法可以加速Mamba和混合模型的推理速度。总的来说，我们展示了如何在有限的计算资源下移除许多原始注意力层并从所得模型中更有效地生成数据。我们的高性能模型是从Llama3-8B-Instruct蒸馏而来，在AlpacaEval 2对GPT-4比赛中获得29.61的受控长度胜率，并在MT-Bench上获得7.35的成绩，超越了最佳的8B指令微调线性RNN模型。我们还发现蒸馏模型具有自然长度外推能力，在蒸馏长度20倍的情况下，几乎达到了完美的准确度。代码和预先训练的模型检查点已开源在[链接地址]。

关键见解

线性RNN架构如Mamba在语言建模方面可与Transformer模型竞争，并具有优势部署特性。
通过复用注意力层的线性投影权重，可将大型Transformer蒸馏成线性RNN。
混合模型实现了与原始Transformer相当的聊天基准测试性能。
引入了一种硬件感知的投机解码算法以加速模型推理速度。
在有限的计算资源下移除许多原始注意力层，并从简化模型中更有效地生成数据。
高性能模型在多个基准测试中表现出超越最佳线性RNN模型的能力。

Cool Papers

点此查看论文截图

Security Attacks on LLM-based Code Completion Tools

Authors:Wen Cheng, Ke Sun, Xinyu Zhang, Wei Wang

The rapid development of large language models (LLMs) has significantly advanced code completion capabilities, giving rise to a new generation of LLM-based Code Completion Tools (LCCTs). Unlike general-purpose LLMs, these tools possess unique workflows, integrating multiple information sources as input and prioritizing code suggestions over natural language interaction, which introduces distinct security challenges. Additionally, LCCTs often rely on proprietary code datasets for training, raising concerns about the potential exposure of sensitive data. This paper exploits these distinct characteristics of LCCTs to develop targeted attack methodologies on two critical security risks: jailbreaking and training data extraction attacks. Our experimental results expose significant vulnerabilities within LCCTs, including a 99.4% success rate in jailbreaking attacks on GitHub Copilot and a 46.3% success rate on Amazon Q. Furthermore, We successfully extracted sensitive user data from GitHub Copilot, including 54 real email addresses and 314 physical addresses associated with GitHub usernames. Our study also demonstrates that these code-based attack methods are effective against general-purpose LLMs, such as the GPT series, highlighting a broader security misalignment in the handling of code by modern LLMs. These findings underscore critical security challenges associated with LCCTs and suggest essential directions for strengthening their security frameworks. The example code and attack samples from our research are provided at https://github.com/Sensente/Security-Attacks-on-LCCTs.

大型语言模型（LLM）的快速发展极大地提升了代码补全功能，并催生了一代基于LLM的代码补全工具（LCCTs）。不同于通用的大型语言模型，这些工具拥有独特的工作流程，能够整合多个信息源作为输入，并优先处理代码建议而非自然语言交互，这带来了不同的安全挑战。此外，LCCTs通常依赖于专有代码数据集进行训练，引发了关于潜在敏感数据泄露的担忧。本文利用LCCT的这些独特特点，针对两种关键安全风险开发有针对性的攻击方法：越狱攻击和训练数据提取攻击。我们的实验结果暴露了LCCT中的重大漏洞，包括对GitHub Copilot的越狱攻击成功率为99.4%，对亚马逊Q的成功率为46.3%。此外，我们还成功地从GitHub Copilot中提取了敏感用户数据，包括54个真实电子邮件地址和与GitHub用户名相关的314个物理地址。我们的研究还表明，这些基于代码的攻击方法对通用的大型语言模型（如GPT系列）同样有效，这凸显了现代大型语言模型在处理代码方面的整体安全不匹配问题。这些发现强调了LCCT面临的关键安全挑战，并为加强其安全框架提供了重要方向。我们研究中的示例代码和攻击样本可在https://github.com/Sensente/Security-Attacks-on-LCCTs找到。

论文及项目相关链接

PDF Paper accepted at AAAI 2025

Summary

大型语言模型（LLM）的快速发展极大地提升了代码补全功能，并催生了一种新型基于LLM的代码补全工具（LCCTs）。然而，LCCTs在集成多信息源和优先提供代码建议的过程中面临独特的挑战，同时它们依赖专有代码数据集进行训练，这可能引发数据泄露风险。本文深入探讨了LCCTs的这些特性，并针对两大关键风险设计有针对性的攻击方法：越狱攻击和训练数据提取攻击。实验结果显示，LCCTs存在重大漏洞，如在GitHub Copilot上的越狱攻击成功率高达99.4%，并在亚马逊Q上达到46.3%的成功率。此外，我们从GitHub Copilot成功提取了用户的敏感数据，包括54个真实电子邮件地址和314个与GitHub用户名关联的物理地址。本文的研究方法同样适用于GPT系列等通用LLM，表明现代LLM在处理代码时的安全漏洞更为普遍。这些发现突显了LCCTs的安全挑战，并为加强其安全框架提供了关键方向。相关示例代码和攻击样本可在我们的研究网站找到。

Key Takeaways

LLM的快速发展推动了代码补全工具（LCCTs）的崛起，LCCTs集成多信息源并优先提供代码建议，带来独特的挑战。
LCCTs依赖于专有代码数据集进行训练，存在泄露敏感数据的风险。
针对LCCTs的越狱攻击成功率高，存在重大安全漏洞。
从GitHub Copilot成功提取了用户的敏感数据，包括电子邮件地址和物理地址。
攻击方法同样适用于通用LLM，如GPT系列，表明现代LLM在处理代码时的安全漏洞广泛存在。
研究提供了针对LCCTs安全挑战的关键方向。

Cool Papers

点此查看论文截图

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

Authors:Yulei Qin, Yuncheng Yang, Pengcheng Guo, Gang Li, Hang Shao, Yuchen Shi, Zihan Xu, Yun Gu, Ke Li, Xing Sun

Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Despite the vast amount of open instruction datasets, naively training a LLM on all existing instructions may not be optimal and practical. To pinpoint the most beneficial datapoints, data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. However, under the context of instruction tuning, there still exists a gap in knowledge on what kind of data evaluation metrics can be employed and how they can be integrated into the selection mechanism. To bridge this gap, we present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs. We systematically categorize all applicable methods into quality-based, diversity-based, and importance-based ones where a unified, fine-grained taxonomy is structured. For each category, representative methods are elaborated to describe the landscape of relevant research. In addition, comparison between the latest methods is conducted on their officially reported results to provide in-depth discussions on their limitations. Finally, we summarize the open challenges and propose the promosing avenues for future studies. All related contents are available at https://github.com/yuleiqin/fantastic-data-engineering.

指令调整在将大型语言模型（LLM）与人类偏好对齐方面起着关键作用。尽管存在大量的开放指令数据集，但盲目地训练LLM以适应所有现有指令可能并不理想且不切实际。为了找到最有利的数据点，自然语言处理（NLP）和深度学习领域已经提出了数据评估与选择方法。然而，在指令调整的上下文中，关于可以使用什么样的数据评估指标以及如何将它们整合到选择机制中，仍然存在知识空白。为了填补这一空白，我们对现有文献进行了全面回顾，特别是针对LLM指令调整的数据评估与选择。我们将所有适用的方法系统地分为基于质量、基于多样性和基于重要性的三类，并构建了统一、精细的分类体系。针对每个类别，我们详细阐述了代表性方法来描述相关研究领域的概况。此外，还对最新方法进行了官方报告结果的比较，对其局限性进行了深入探讨。最后，我们总结了开放挑战，并为未来研究提出了有前景的方向。所有相关内容均可访问：https://github.com/yuleiqin/fantastic-data-engineering。

论文及项目相关链接

PDF Accepted to TMLR with Survey Certificate, review, survey, 37 pages, 5 figures, 4 tables

Summary
大数据时代的语言模型优化过程中，指令调优扮演关键角色。盲目采用海量指令数据并非最佳选择，需对数据评估和选择方法进行深入研究。当前研究中质量评估、多样性和重要性评估是三大主流评估方法。针对语言模型的指令调优数据评估方法尚待完善，本文梳理了现有的数据评估文献，详细阐述了各类方法的优缺点及适用场景，并比较了最新方法的性能差异。同时指出存在的挑战和未来的研究趋势。

Key Takeaways

指令调优在LLM与人类偏好对齐中起关键作用。
单纯依赖大量指令数据集并不理想，需要数据评估与选择方法。
数据评估主要包括质量评估、多样性和重要性评估三大类别。
本文综述了现有文献中的方法并分类为三大类别。
代表性方法描述了相关领域的研究现状。
比较了最新方法的性能差异并深入讨论了其局限性。

Cool Papers

点此查看论文截图

Synergistic Multi-Agent Framework with Trajectory Learning for Knowledge-Intensive Tasks

Authors:Shengbin Yue, Siyuan Wang, Wei Chen, Xuanjing Huang, Zhongyu Wei

Recent advancements in Large Language Models (LLMs) have led to significant breakthroughs in various natural language processing tasks. However, generating factually consistent responses in knowledge-intensive scenarios remains a challenge due to issues such as hallucination, difficulty in acquiring long-tailed knowledge, and limited memory expansion. This paper introduces SMART, a novel multi-agent framework that leverages external knowledge to enhance the interpretability and factual consistency of LLM-generated responses. SMART comprises four specialized agents, each performing a specific sub-trajectory action to navigate complex knowledge-intensive tasks. We propose a multi-agent co-training paradigm, Long-Short Trajectory Learning, which ensures synergistic collaboration among agents while maintaining fine-grained execution by each agent. Extensive experiments on five knowledge-intensive tasks demonstrate SMART’s superior performance compared to widely adopted knowledge internalization and knowledge enhancement methods. Our framework can extend beyond knowledge-intensive tasks to more complex scenarios. Our code is available at https://github.com/yueshengbin/SMART.

近年来，大型语言模型（LLM）的进展导致各种自然语言处理任务取得了重大突破。然而，在知识密集型场景中生成事实一致的回应仍然是一个挑战，因为存在诸如幻觉、获取长尾知识的困难和有限的内存扩展等问题。本文介绍了SMART，一个利用外部知识提高LLM生成响应的可解释性和事实一致性的新型多智能体框架。SMART包含四个专业智能体，每个智能体执行特定的子轨迹动作，以完成复杂的知识密集型任务。我们提出了一种多智能体协同训练范式——长短轨迹学习，它确保智能体之间的协同合作，同时保持每个智能体的精细执行。在五个知识密集型任务上的大量实验表明，与广泛采用的知识内部化和知识增强方法相比，SMART的性能更优越。我们的框架可以扩展到知识密集型任务以外的更复杂的场景。我们的代码可在https://github.com/yueshengbin/SMART中找到。

论文及项目相关链接

PDF Accepted by AAAI2025

Summary

近期大型语言模型（LLM）的进展为各种自然语言处理任务带来了重大突破，但在知识密集型场景中生成事实一致的回复仍存在挑战。本文提出了一种新型的多智能体框架SMART，利用外部知识提高LLM生成回复的可解释性和事实一致性。SMART包含四个专业智能体，通过长期短期轨迹学习等协同训练模式，确保智能体之间的协同合作，同时保持精细的执行能力。在五个知识密集型任务上的实验表明，SMART优于广泛采用的知识内化和知识增强方法。此框架可应用于更复杂场景。代码公开于https://github.com/yueshengbin/SMART。

Key Takeaways

LLM在自然语言处理任务中取得了显著进展。
知识密集型场景中生成事实一致的回复存在挑战。
SMART是一个新型的多智能体框架，利用外部知识提高LLM的回复质量。
SMART包含四个专业智能体，每个智能体执行特定子轨迹动作以完成复杂任务。
提出了一种多智能体协同训练模式——长期短期轨迹学习，确保智能体间的协同合作和精细执行。
在五个知识密集型任务上的实验表明SMART优于其他知识内化和增强方法。

Cool Papers

点此查看论文截图

Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions

Authors:Yi Yuan, Dongya Jia, Xiaobin Zhuang, Yuanzhe Chen, Zhengxi Liu, Zhuo Chen, Yuping Wang, Yuxuan Wang, Xubo Liu, Xiyuan Kang, Mark D. Plumbley, Wenwu Wang

Generative models have shown significant achievements in audio generation tasks. However, existing models struggle with complex and detailed prompts, leading to potential performance degradation. We hypothesize that this problem stems from the simplicity and scarcity of the training data. This work aims to create a large-scale audio dataset with rich captions for improving audio generation models. We first develop an automated pipeline to generate detailed captions by transforming predicted visual captions, audio captions, and tagging labels into comprehensive descriptions using a Large Language Model (LLM). The resulting dataset, Sound-VECaps, comprises 1.66M high-quality audio-caption pairs with enriched details including audio event orders, occurred places and environment information. We then demonstrate that training the text-to-audio generation models with Sound-VECaps significantly improves the performance on complex prompts. Furthermore, we conduct ablation studies of the models on several downstream audio-language tasks, showing the potential of Sound-VECaps in advancing audio-text representation learning. Our dataset and models are available online from here https://yyua8222.github.io/Sound-VECaps-demo/.

生成模型在音频生成任务中取得了显著成就。然而，现有模型在复杂且详细的提示方面存在困难，可能导致性能下降。我们假设这个问题源于训练数据的简单性和稀缺性。这项工作旨在创建一个大规模音频数据集，其中包含丰富的字幕，以改进音频生成模型。我们首先开发了一个自动化管道，通过利用大型语言模型（LLM）将预测的视觉字幕、音频字幕和标签转换为综合描述，来生成详细字幕。结果数据集Sound-VECaps包含166万个高质量音频字幕对，其中包含丰富的细节，如音频事件顺序、发生地点和环境信息。然后，我们证明使用Sound-VECaps训练文本到音频生成模型可以显著提高复杂提示的性能。此外，我们在几个下游音频语言任务上进行了模型的剔除研究，显示了Sound-VECaps在推进音频文本表示学习方面的潜力。我们的数据集和模型可从https://yyua8222.github.io/Sound-VECaps-demo/在线获取。

论文及项目相关链接

PDF 5 pages with 1 appendix, accepted by ICASSP 2025

Summary

本文介绍了生成模型在音频生成任务中的成就和挑战。针对现有模型在处理复杂和详细提示时性能下降的问题，提出创建一个大规模音频数据集Sound-VECaps，旨在改善音频生成模型的性能。通过大型语言模型（LLM）将预测的视觉字幕、音频字幕和标签转换为详细的描述，生成大规模音频数据集。使用Sound-VECaps训练文本到音频生成模型，显著提高了复杂提示的性能。此外，还进行了几项下游音频语言任务的模型消融研究，显示了Sound-VECaps在推动音频文本表示学习方面的潜力。数据集和模型可从链接获取。

Key Takeaways

生成模型在音频生成任务中取得了显著成就，但处理复杂和详细提示时存在性能下降的问题。
问题根源被认为是训练数据的简单性和稀缺性。
创建一个名为Sound-VECaps的大规模音频数据集，以改善音频生成模型的性能。
利用大型语言模型（LLM）生成详细描述来创建数据集，包含音频事件顺序、发生地点和环境信息等丰富细节。
使用Sound-VECaps训练的文本到音频生成模型在复杂提示上的性能显著提高。
进行了下游音频语言任务的模型消融研究，显示了Sound-VECaps在推动音频文本表示学习方面的潜力。

Cool Papers

点此查看论文截图

Beyond Numeric Awards: In-Context Dueling Bandits with LLM Agents

Authors:Fanzeng Xia, Hao Liu, Yisong Yue, Tongxin Li

In-context reinforcement learning (ICRL) is a frontier paradigm for solving reinforcement learning problems in the foundation model era. While ICRL capabilities have been demonstrated in transformers through task-specific training, the potential of Large Language Models (LLMs) out-of-the-box remains largely unexplored. Recent findings highlight that LLMs often face challenges when dealing with numerical contexts, and limited attention has been paid to evaluating their performance through preference feedback generated by the environment. This paper is the first to investigate LLMs as in-context decision-makers under the problem of Dueling Bandits (DB), a stateless preference-based reinforcement learning setting that extends the classic Multi-Armed Bandit (MAB) model by querying for preference feedback. We compare GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, Llama 3.1, and o1-Preview against nine well-established DB algorithms. Our results reveal that our top-performing LLM, GPT-4 Turbo, has the zero-shot relative decision-making ability to achieve surprisingly low weak regret across all the DB environment instances by quickly including the best arm in duels. However, an optimality gap exists between LLMs and classic DB algorithms in terms of strong regret. LLMs struggle to converge and consistently exploit even when explicitly prompted to do so, and are sensitive to prompt variations. To bridge this gap, we propose an agentic flow framework: LLM with Enhanced Algorithmic Dueling (LEAD), which integrates off-the-shelf DB algorithms with LLM agents through fine-grained adaptive interplay. We show that LEAD has theoretical guarantees inherited from classic DB algorithms on both weak and strong regret. We validate its efficacy and robustness even with noisy and adversarial prompts. The design of our framework sheds light on how to enhance the trustworthiness of LLMs used for in-context decision-making.

上下文增强学习（ICRL）是基础模型时代解决增强学习问题的一种前沿范式。虽然通过特定任务的训练，ICRL在Transformer中的能力已经得到了证明，但大语言模型（LLM）的即插即用潜力尚未得到充分探索。最近的发现表明，LLM在处理数值上下文时经常面临挑战，而且很少有人关注通过环境产生的偏好反馈来评估它们的性能。本文首次调查了LLM作为上下文决策者在多臂赌博机（DB）问题下的表现，这是一个基于无状态偏好增强学习环境的模型，通过查询偏好反馈来扩展经典的多臂赌博机（MAB）模型。我们比较了GPT-3.5 Turbo、GPT-4、GPT-4 Turbo、Llama 3.1和o1预览版与九个成熟的DB算法。我们的结果表明，表现最佳的LLM——GPT-4 Turbo具有零射决策能力，能够在所有DB环境实例中取得令人惊讶的低弱遗憾，通过快速将最佳手臂纳入决斗中。然而，在强遗憾方面，LLM与经典DB算法之间存在差距。即使明确提示进行收敛和持续剥削时，LLM也很难做到这一点，并且对提示的变动非常敏感。为了缩小这一差距，我们提出了一个基于代理流框架的LLM增强算法决斗（LEAD），它通过精细的适应性互动将现成的DB算法与LLM代理集成在一起。我们展示了LEAD在弱遗憾和强遗憾方面都继承了经典DB算法的理论保证。我们在存在噪声和对抗提示的情况下验证了其有效性和稳健性。我们的框架设计揭示了如何增强用于上下文决策LLM的可信度的启示。

论文及项目相关链接

PDF

Summary

本文首次探索了在无状态偏好强化学习环境——Dueling Bandits下的大型语言模型（LLM）作为上下文决策者的潜力。研究对比了GPT-3.5 Turbo、GPT-4、GPT-4 Turbo、Llama 3.1和o1-Preview等多种LLM与九个成熟的DB算法的表现。结果表明，GPT-4 Turbo在相对决策能力上表现出色，能够在各种DB环境实例中实现低弱遗憾，但在强遗憾方面存在最优性差距。为此，研究提出了一个整合经典DB算法与LLM代理的代理流框架：LEAD。LEAD具有理论保证，并在弱遗憾和强遗憾方面均表现稳健，即使在噪声和对抗性提示下也验证了其有效性和鲁棒性。研究为提升LLM在上下文决策中的可信度提供了启示。

Key Takeaways