LLM

发布日期: 2024-12-12

更新日期: 2024-12-13

文章字数: 32.8k

阅读时长: 134 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2024-12-12 更新

Generative Semantic Communication: Architectures, Technologies, and Applications

Authors:Jinke Ren, Yaping Sun, Hongyang Du, Weiwen Yuan, Chongjie Wang, Xianda Wang, Yingbin Zhou, Ziwei Zhu, Fangxin Wang, Shuguang Cui

This paper delves into the applications of generative artificial intelligence (GAI) in semantic communication (SemCom) and presents a thorough study. Three popular SemCom systems enabled by classical GAI models are first introduced, including variational autoencoders, generative adversarial networks, and diffusion models. For each system, the fundamental concept of the GAI model, the corresponding SemCom architecture, and the associated literature review of recent efforts are elucidated. Then, a novel generative SemCom system is proposed by incorporating the cutting-edge GAI technology-large language models (LLMs). This system features two LLM-based AI agents at both the transmitter and receiver, serving as “brains” to enable powerful information understanding and content regeneration capabilities, respectively. This innovative design allows the receiver to directly generate the desired content, instead of recovering the bit stream, based on the coded semantic information conveyed by the transmitter. Therefore, it shifts the communication mindset from “information recovery” to “information regeneration” and thus ushers in a new era of generative SemCom. A case study on point-to-point video retrieval is presented to demonstrate the superiority of the proposed generative SemCom system, showcasing a 99.98% reduction in communication overhead and a 53% improvement in retrieval accuracy compared to the traditional communication system. Furthermore, four typical application scenarios for generative SemCom are delineated, followed by a discussion of three open issues warranting future investigation. In a nutshell, this paper provides a holistic set of guidelines for applying GAI in SemCom, paving the way for the efficient implementation of generative SemCom in future wireless networks.

本文深入探讨了生成式人工智能（GAI）在语义通信（SemCom）中的应用，并进行了全面研究。首先介绍了三种由经典GAI模型启用的流行的SemCom系统，包括变分自编码器、生成对抗网络和扩散模型。对于每个系统，都阐述了GAI模型的基本概念、相应的SemCom架构以及最近的文献综述。然后，通过结合最前沿的GAI技术——大型语言模型（LLM），提出了一种新型的生成式SemCom系统。该系统在发射端和接收端都采用了基于LLM的AI代理，分别作为“大脑”，以实现强大的信息理解和内容再生能力。这种创新设计允许接收端根据发射端传递的编码语义信息直接生成所需内容，而不是恢复比特流。因此，它将通信思维从“信息恢复”转变为“信息再生”，从而开启了生成式SemCom的新时代。通过点对点视频检索的案例分析，展示了所提出的生成式SemCom系统的优越性，与传统通信系统相比，通信开销减少了99.98%，检索精度提高了5a 3%。此外，还介绍了生成式SemCom的四个典型应用场景，并讨论了三个需要未来研究的开放问题。总之，本文为GAI在SemCom中的应用提供了一套全面的指导方针，为未来无线网络中生成式SemCom的有效实施奠定了基础。

论文及项目相关链接

PDF 18 pages, 8 figures

Summary
本文研究了生成式人工智能（GAI）在语义通信（SemCom）中的应用，并介绍了三种由经典GAI模型启用的SemCom系统。文章提出了结合最新GAI技术的大型语言模型（LLM）的生成式SemCom系统，实现了从“信息恢复”到“信息再生”的转变。该系统在点对点视频检索案例研究中表现出卓越性能，相较于传统通信系统，减少了99.98%的通信开销，提高了53%的检索精度。本文为GAI在SemCom中的应用提供了一套全面的指南，为未来的无线网络中实现高效的生成式SemCom铺平了道路。

Key Takeaways

文章介绍了生成式人工智能（GAI）在语义通信（SemCom）中的应用及其研究背景。
阐述了三种基于经典GAI模型的SemCom系统，包括变分自编码器、生成对抗网络和扩散模型。
提出了一种结合大型语言模型（LLM）的创新型生成式SemCom系统，实现了强大的信息理解和内容再生能力。
该系统在点对点视频检索案例研究中表现出卓越性能，显著降低了通信开销并提高了检索精度。
文章讨论了生成式SemCom在四个典型应用场景中的潜力。
强调了未来在开放问题上需要进一步研究和探索的方向。

点此查看论文截图

Fast Prompt Alignment for Text-to-Image Generation

Authors:Khalil Mrini, Hanlin Lu, Linjie Yang, Weilin Huang, Heng Wang

Text-to-image generation has advanced rapidly, yet aligning complex textual prompts with generated visuals remains challenging, especially with intricate object relationships and fine-grained details. This paper introduces Fast Prompt Alignment (FPA), a prompt optimization framework that leverages a one-pass approach, enhancing text-to-image alignment efficiency without the iterative overhead typical of current methods like OPT2I. FPA uses large language models (LLMs) for single-iteration prompt paraphrasing, followed by fine-tuning or in-context learning with optimized prompts to enable real-time inference, reducing computational demands while preserving alignment fidelity. Extensive evaluations on the COCO Captions and PartiPrompts datasets demonstrate that FPA achieves competitive text-image alignment scores at a fraction of the processing time, as validated through both automated metrics (TIFA, VQA) and human evaluation. A human study with expert annotators further reveals a strong correlation between human alignment judgments and automated scores, underscoring the robustness of FPA’s improvements. The proposed method showcases a scalable, efficient alternative to iterative prompt optimization, enabling broader applicability in real-time, high-demand settings. The codebase is provided to facilitate further research: https://github.com/tiktok/fast_prompt_alignment

文本到图像的生成技术已经迅速发展，然而，将复杂的文本提示与生成的图像对齐仍然具有挑战性，特别是在复杂对象关系和精细细节方面。本文介绍了快速提示对齐（FPA）技术，这是一种提示优化框架，它采用一次通过的方法，提高了文本到图像的对齐效率，而无需使用当前方法（如OPT2I）所特有的迭代开销。FPA利用大型语言模型（LLM）进行单次迭代提示同义替换，随后通过优化提示进行微调或上下文内学习，以实现实时推理，降低了计算需求同时保持了对齐保真度。在COCO Captions和PartiPrompts数据集上的广泛评估表明，FPA在处理时间的一小部分内实现了具有竞争力的文本图像对齐得分，这已通过自动化指标（TIFA、VQA）和人类评估得到了验证。由专家注释者进行的人类研究进一步揭示了人类对齐判断与自动化得分之间的强烈相关性，强调了FPA改进的稳健性。所提出的方法展示了迭代提示优化的可扩展、高效替代方案，可在实时、高需求环境中得到更广泛的应用。代码库已提供，以方便进一步的研究：https://github.com/tiktok/fast_prompt_alignment

论文及项目相关链接

PDF TikTok Technical Report

Summary

本文提出了Fast Prompt Alignment（FPA）框架，利用大型语言模型进行单次迭代提示优化，实现文本到图像对齐的实时推理，大大提高了对齐效率并降低了计算需求。在COCO Captions和PartiPrompts数据集上的评估表明，FPA在文本图像对齐得分上表现优异，同时处理时间大幅减少。此外，专家标注人员的研究进一步验证了FPA改进的稳健性。该框架为实时高需求场景下的提示优化提供了可扩展、高效的替代方案。

Key Takeaways

Fast Prompt Alignment (FPA) 框架旨在解决文本到图像生成中对齐复杂文本提示与生成图像的问题。
FPA 利用大型语言模型进行单次迭代提示优化，提升文本到图像对齐效率。
FPA 实现了实时推理，减少计算需求的同时保持了对齐精度。
在 COCO Captions 和 PartiPrompts 数据集上的评估证明了 FPA 的高效性能。
通过专家标注人员的研究验证了 FPA 改进的稳健性。
FPA 提供了一种可扩展、高效的替代方案，适用于实时高需求场景下的提示优化。

点此查看论文截图

Multimodal Latent Language Modeling with Next-Token Diffusion

Authors:Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, Furu Wei

Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop $\sigma$-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.

多模态生成模型需要一种统一的方法来处理离散数据（如文本和代码）和连续数据（如图像、音频、视频）。在这项工作中，我们提出了潜在语言建模（LatentLM），它使用因果Transformer无缝集成了连续和离散数据。具体来说，我们采用变分自编码器（VAE）将连续数据表示为潜在向量，并引入下一个令牌扩散来进行这些向量的自回归生成。此外，为了解决自回归建模中的方差崩溃问题，我们开发了σ-VAE。大量实验表明，LatentLM在各种模态中都非常有效。在图像生成方面，LatentLM在性能和可扩展性方面都超越了Diffusion Transformers。当集成到多模态大型语言模型中时，LatentLM提供了一个通用接口，统一了多模态生成和理解。实验结果表明，在扩大训练令牌规模方面，LatentLM与Transfusion和向量量化模型相比取得了有利的表现。在文本到语音的合成中，LatentLM在说话人相似性和稳健性方面超越了最先进的VALL-E 2模型，同时需要减少10倍的解码步骤。结果证明LatentLM是一种高度有效和可扩展的方法，可以推动大型多模态模型的发展。

论文及项目相关链接

PDF

Summary

基于离散数据（如文本和代码）和连续数据（如图像、音频和视频）的多模态生成模型需要一种统一的方法来处理。本研究提出了潜在语言建模（LatentLM），它利用因果Transformer无缝集成连续和离散数据。通过变分自编码器（VAE）表示连续数据为潜在向量，并引入下一个符号扩散进行这些向量的自回归生成。此外，为了解决自回归建模中的方差崩溃问题，开发了σ-VAE。实验表明，LatentLM在各种模态中的有效性。在图像生成方面，LatentLM在性能和可扩展性方面都超越了Diffusion Transformers。当集成到多模态大型语言模型中时，LatentLM提供了一个统一多模态生成和理解的通用接口。实验结果证明，LatentLM在扩大训练符号方面的表现优于Transfusion和向量量化模型。在文本到语音的合成中，LatentLM在说话人相似性和稳健性方面优于当前主流的VALL-E 2模型，并且只需要10倍的解码步骤。结果证明了LatentLM在推进大型多模态模型方面的高效性和可扩展性。

Key Takeaways

LatentLM提出了一个统一的方法来处理离散和连续数据的多模态生成模型。
通过变分自编码器（VAE）表示连续数据为潜在向量，并引入下一个符号扩散进行自回归生成。
σ-VAE的引入解决了自回归建模中的方差崩溃问题。
LatentLM在图像生成方面表现出优异的性能和可扩展性。
LatentLM集成到多模态大型语言模型时，提供了统一的接口。
在扩大训练符号方面，LatentLM的表现优于其他模型。

点此查看论文截图

Synthetic Vision: Training Vision-Language Models to Understand Physics

Authors:Vahid Balazadeh, Mohammadmehdi Ataei, Hyunmin Cheong, Amir Hosein Khasahmadi, Rahul G. Krishnan

Physical reasoning, which involves the interpretation, understanding, and prediction of object behavior in dynamic environments, remains a significant challenge for current Vision-Language Models (VLMs). In this work, we propose two methods to enhance VLMs’ physical reasoning capabilities using simulated data. First, we fine-tune a pre-trained VLM using question-answer (QA) pairs generated from simulations relevant to physical reasoning tasks. Second, we introduce Physics Context Builders (PCBs), specialized VLMs fine-tuned to create scene descriptions enriched with physical properties and processes. During physical reasoning tasks, these PCBs can be leveraged as context to assist a Large Language Model (LLM) to improve its performance. We evaluate both of our approaches using multiple benchmarks, including a new stability detection QA dataset called Falling Tower, which includes both simulated and real-world scenes, and CLEVRER. We demonstrate that a small QA fine-tuned VLM can significantly outperform larger state-of-the-art foundational models. We also show that integrating PCBs boosts the performance of foundational LLMs on physical reasoning tasks. Using the real-world scenes from the Falling Tower dataset, we also validate the robustness of both approaches in Sim2Real transfer. Our results highlight the utility that simulated data can have in the creation of learning systems capable of advanced physical reasoning.

涉及动态环境中物体行为的解释、理解和预测的物理推理仍然是当前视觉语言模型（VLMs）的一个重大挑战。在这项工作中，我们提出了两种使用模拟数据增强VLM物理推理能力的方法。首先，我们使用与物理推理任务相关的模拟生成的问答（QA）对来微调预训练的VLM。其次，我们引入了物理上下文构建器（PCBs），这是经过精细调整的专门用于创建场景描述的VLM，其中包含丰富的物理属性和过程。在物理推理任务中，可以利用这些PCB作为上下文，帮助大型语言模型（LLM）提高其性能。我们使用多个基准测试来评估我们的两种方法，包括一个新的名为“坠落之塔”的稳定性检测问答数据集，其中包括模拟和真实场景，以及CLEVRER。我们证明，经过小型问答对微调过的VLM可以显著优于较大的最新基础模型。我们还表明，集成PCB可以提高基础LLM在物理推理任务上的性能。我们还使用来自“坠落之塔”数据集的真实场景验证了两种方法的Sim2Real迁移鲁棒性。我们的结果强调了模拟数据在创建能够进行高级物理推理的学习系统方面的实用性。

论文及项目相关链接

PDF

Summary

本文提出了两种使用模拟数据增强视觉语言模型（VLMs）物理推理能力的方法。首先，通过对预训练的VLM进行微调，利用与物理推理任务相关的模拟生成的问答对来增强其物理推理能力。其次，引入了物理语境构建器（PCBs），它们是经过微调专门用于创建场景描述的VLMs，富含物理属性和过程。在物理推理任务中，可以利用这些PCBs作为上下文来辅助大型语言模型（LLM）提高性能。评估方法包括新的稳定性检测问答数据集Falling Tower和CLEVRER等多个基准测试。结果显示，小型问答微调VLM显著优于大型先进基础模型，集成PCBs则提高了基础LLM在物理推理任务上的性能。在Falling Tower数据集的实时场景上验证了两种方法在模拟到真实迁移的稳健性。总结起来，本文强调模拟数据在创建具备高级物理推理能力的学习系统中的应用价值。

Key Takeaways

物理推理是当前视觉语言模型（VLMs）面临的重要挑战，涉及动态环境中物体行为的解释、理解和预测。
提出了两种增强VLMs物理推理能力的方法：微调预训练VLM使用模拟生成的与物理推理任务相关的问答对；引入物理语境构建器（PCBs）作为上下文辅助大型语言模型（LLM）。
通过多个基准测试，包括新的稳定性检测问答数据集Falling Tower和CLEVRER，对两种方法进行了评估。
小型问答微调VLM表现出显著性能，优于大型先进基础模型；集成PCBs提高了基础LLM在物理推理任务上的性能。
两种方法在模拟到真实迁移的稳健性上得到了验证，特别是在实时场景上。
本文强调了模拟数据在创建具备高级物理推理能力的学习系统中的应用价值。

点此查看论文截图

Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models

Authors:Jiahui Li, Yongchang Hao, Haoyu Xu, Xing Wang, Yu Hong

Despite the advancements in training Large Language Models (LLMs) with alignment techniques to enhance the safety of generated content, these models remain susceptible to jailbreak, an adversarial attack method that exposes security vulnerabilities in LLMs. Notably, the Greedy Coordinate Gradient (GCG) method has demonstrated the ability to automatically generate adversarial suffixes that jailbreak state-of-the-art LLMs. However, the optimization process involved in GCG is highly time-consuming, rendering the jailbreaking pipeline inefficient. In this paper, we investigate the process of GCG and identify an issue of Indirect Effect, the key bottleneck of the GCG optimization. To this end, we propose the Model Attack Gradient Index GCG (MAGIC), that addresses the Indirect Effect by exploiting the gradient information of the suffix tokens, thereby accelerating the procedure by having less computation and fewer iterations. Our experiments on AdvBench show that MAGIC achieves up to a 1.5x speedup, while maintaining Attack Success Rates (ASR) on par or even higher than other baselines. Our MAGIC achieved an ASR of 74% on the Llama-2 and an ASR of 54% when conducting transfer attacks on GPT-3.5. Code is available at https://github.com/jiah-li/magic.

尽管利用对齐技术训练大型语言模型（LLM）以增强生成内容的安全性的技术已经取得了进展，但这些模型仍然容易受到越狱攻击的影响，这是一种暴露LLM安全漏洞的对抗性攻击方法。值得注意的是，贪婪坐标梯度（GCG）方法已经证明能够自动生成对抗性后缀，突破最先进的大型语言模型。然而，GCG涉及的优化过程非常耗时，使得越狱管道效率低下。在本文中，我们研究了GCG的过程，并发现了间接效应的问题，这是GCG优化的关键瓶颈。为此，我们提出了模型攻击梯度指数GCG（MAGIC），通过利用后缀令牌的梯度信息来解决间接效应问题，从而通过减少计算和迭代次数来加速程序。我们在AdvBench上的实验表明，MAGIC实现了高达1.5倍的速度提升，同时保持攻击成功率（ASR）与其他基线持平甚至更高。我们的MAGIC在Llama-2上实现了74%的ASR，在对GPT-3.5进行迁移攻击时实现了54%的ASR。代码可在https://github.com/jiah-li/magic上找到。

论文及项目相关链接

PDF 13 pages,2 figures, accepted by The 31st International Conference on Computational Linguistics

摘要

随着大型语言模型（LLM）的对齐技术不断进步，其生成内容的安全性得到了提升。然而，这些模型仍然面临一种名为“越狱”的对抗性攻击方法，暴露了LLM的安全漏洞。本文研究了贪婪坐标梯度（GCG）方法，该方法能够自动生成对抗性后缀以破解最新LLM。然而，GCG涉及的优化过程非常耗时，使得越狱管道效率低下。本文调查了GCG的过程，并确定了间接效应这一关键瓶颈。为此，我们提出了模型攻击梯度指数GCG（MAGIC），通过利用后缀标记的梯度信息来解决间接效应问题，从而通过减少计算和迭代次数来加速程序。在AdvBench上的实验表明，MAGIC实现了最高达1.5倍的速度提升，同时保持攻击成功率（ASR）与其他基线持平甚至更高。我们的MAGIC在对Llama-2的攻击中实现了74%的ASR，在对GPT-3.5进行迁移攻击时实现了54%的ASR。

关键见解

大型语言模型（LLM）仍面临对抗性攻击，即“越狱”，这暴露了其安全漏洞。
贪婪坐标梯度（GCG）方法可以自动生成的对抗性后缀来破解先进的LLM，但优化过程高度耗时。
GCG方法的瓶颈在于间接效应。
提出了模型攻击梯度指数GCG（MAGIC）来解决间接效应问题，通过利用梯度信息加速越狱过程。
MAGIC在AdvBench实验上实现了速度提升，同时保持或提高攻击成功率（ASR）。
MAGIC对Llama-2的攻击成功率为74%，对GPT-3.5的迁移攻击成功率为54%。
MAGIC的代码已公开提供。

点此查看论文截图

Preference Discerning with LLM-Enhanced Generative Retrieval

Authors:Fabian Paischer, Liu Yang, Linfeng Liu, Shuai Shao, Kaveh Hassani, Jiacheng Li, Ricky Chen, Zhang Gabriel Li, Xialo Gao, Wei Shao, Xue Feng, Nima Noorshams, Sem Park, Bo Long, Hamid Eghbalzadeh

Sequential recommendation systems aim to provide personalized recommendations for users based on their interaction history. To achieve this, they often incorporate auxiliary information, such as textual descriptions of items and auxiliary tasks, like predicting user preferences and intent. Despite numerous efforts to enhance these models, they still suffer from limited personalization. To address this issue, we propose a new paradigm, which we term preference discerning. In preference dscerning, we explicitly condition a generative sequential recommendation system on user preferences within its context. To this end, we generate user preferences using Large Language Models (LLMs) based on user reviews and item-specific data. To evaluate preference discerning capabilities of sequential recommendation systems, we introduce a novel benchmark that provides a holistic evaluation across various scenarios, including preference steering and sentiment following. We assess current state-of-the-art methods using our benchmark and show that they struggle to accurately discern user preferences. Therefore, we propose a new method named Mender ($\textbf{M}$ultimodal Prefer$\textbf{en}$ce $\textbf{d}$iscern$\textbf{er}$), which improves upon existing methods and achieves state-of-the-art performance on our benchmark. Our results show that Mender can be effectively guided by human preferences even though they have not been observed during training, paving the way toward more personalized sequential recommendation systems. We will open-source the code and benchmarks upon publication.

顺序推荐系统旨在根据用户的交互历史为其提供个性化推荐。为了实现这一目标，它们通常会融入辅助信息，如物品的文本描述和辅助任务，如预测用户偏好和意图。尽管已经有很多努力来提升这些模型，但它们仍然面临着个性化有限的挑战。为了解决这个问题，我们提出了一种新的方法，我们称之为偏好识别。在偏好识别中，我们显式地将生成式顺序推荐系统建立在用户偏好之上。为此，我们基于用户评论和特定物品数据，使用大型语言模型（LLM）生成用户偏好。为了评估顺序推荐系统的偏好识别能力，我们引入了一个新的基准测试，该测试可以在各种情境下进行全面评估，包括偏好控制和情感跟随。我们使用此基准测试评估了当前最先进的方法，并发现它们在准确识别用户偏好方面存在困难。因此，我们提出了一种新的方法，名为Mender（多模式偏好辨别器），它在现有方法的基础上进行了改进，并在我们的基准测试上达到了最先进的性能。我们的结果表明，即使在没有在训练中观察到人类偏好的情况下，Mender也可以有效地被人类偏好所引导，为更个性化的顺序推荐系统铺平了道路。在发布时，我们将公开源代码和基准测试。

论文及项目相关链接

PDF 11 pages + references and appendix

Summary

基于用户历史交互的序列推荐系统旨在为用户提供个性化推荐。为达到此目的，它们经常结合辅助信息，如物品的文字描述和预测用户偏好和意图的辅助任务。尽管有很多提高这些模型的努力，它们仍然面临个性化有限的挑战。为解决此问题，我们提出了一个新的范式——偏好辨别。在偏好辨别中，我们明确地在生成序列推荐系统中以用户偏好为条件。为此，我们基于用户评论和特定物品数据使用大型语言模型（LLM）生成用户偏好。为了评估序列推荐系统的偏好辨别能力，我们引入了一个新的基准测试，该测试全面评价了不同场景，包括偏好引导和情感跟随。我们评估了当前先进的方法，并发现它们在准确辨别用户偏好方面存在困难。因此，我们提出了一种新的方法——Mender（多模态偏好鉴别器），它在现有方法的基础上进行了改进，并在我们的基准测试中达到了最新性能。结果显示，Mender即使在没有在训练中观察到的人类偏好也能进行有效指导，为更个性化的序列推荐系统铺平了道路。我们将公开代码和基准测试。

Key Takeaways

序列推荐系统旨在基于用户历史交互提供个性化推荐。
现有序列推荐系统存在个性化有限的挑战。
提出新的范式——偏好辨别来解决这个问题。
利用大型语言模型（LLM）生成用户偏好。
引入新的基准测试来评估序列推荐系统的偏好辨别能力。
当前先进方法在准确辨别用户偏好方面存在困难。

点此查看论文截图

Der Effizienz- und Intelligenzbegriff in der Lexikographie und kuenstlichen Intelligenz: kann ChatGPT die lexikographische Textsorte nachbilden?

Authors:Ivan Arias-Arias, Maria Jose Dominguez Vazquez, Carlos Valcarcel Riveiro

By means of pilot experiments for the language pair German and Galician, this paper examines the concept of efficiency and intelligence in lexicography and artificial intelligence, AI. The aim of the experiments is to gain empirically and statistically based insights into the lexicographical text type,dictionary article, in the responses of ChatGPT 3.5, as well as into the lexicographical data on which this chatbot was trained. Both quantitative and qualitative methods are used for this purpose. The analysis is based on the evaluation of the outputs of several sessions with the same prompt in ChatGPT 3.5. On the one hand, the algorithmic performance of intelligent systems is evaluated in comparison with data from lexicographical works. On the other hand, the ChatGPT data supplied is analysed using specific text passages of the aforementioned lexicographical text type. The results of this study not only help to evaluate the efficiency of this chatbot regarding the creation of dictionary articles, but also to delve deeper into the concept of intelligence, the thought processes and the actions to be carried out in both disciplines.

本文通过针对德语和加利西亚语语言对的试点实验，探讨了词典学、人工智能（AI）中的效率和智能概念。这些实验的目的是实证地、统计地了解ChatGPT 3.5对词典文本类型（词典词条）的响应，以及该聊天机器人所训练的词典数据。为此目的，使用了定量和定性方法。分析是基于对ChatGPT 3.5中相同提示下多次会话输出的评估进行的。一方面，智能系统的算法性能与词典作品的数据进行评估比较。另一方面，使用上述词典文本类型的特定文本段落分析ChatGPT提供的数据。本研究的结果不仅有助于评估该聊天机器人在创建词典文章方面的效率，而且有助于更深入地了解智能的概念、思维过程和这两个学科中需要执行的操作。

论文及项目相关链接

PDF 25 pages, in German language

Summary

本论文通过德语和加利西亚语的试点实验，探讨了词典编纂与人工智能中的效率与智能概念。实验旨在实证和统计地分析ChatGPT 3.5对于词典条目的回应，并研究该聊天机器人所训练的词典数据。研究采用了定量和定性方法，通过对同一提示下ChatGPT 3.5的多轮输出进行评估，对比智能系统的算法性能与词典作品的数据。此外，还使用特定的词典文本类型对ChatGPT的数据进行了深入分析。本研究结果不仅有助于评估该聊天机器人在创建词典条目方面的效率，还深入探讨了智能、思维过程和行动在词典编纂与人工智能两个领域中的应用。

Key Takeaways

本研究通过德语和加利西亚语的试点实验，探索了词典编纂和人工智能中的效率与智能。
实验旨在实证分析ChatGPT 3.5对词典条目的回应以及其所训练的词典数据。
研究采用了定量和定性方法，重点评估了ChatGPT 3.5的多轮输出。
研究对比了智能系统的算法性能与词典作品的数据。
特定的词典文本类型被用于深入分析ChatGPT的数据。
研究结果有助于评估聊天机器人在创建词典条目方面的效率。

点此查看论文截图

Leveraging Graph-RAG and Prompt Engineering to Enhance LLM-Based Automated Requirement Traceability and Compliance Checks

Authors:Arsalan Masoudifard, Mohammad Mowlavi Sorond, Moein Madadi, Mohammad Sabokrou, Elahe Habibi

Ensuring that Software Requirements Specifications (SRS) align with higher-level organizational or national requirements is vital, particularly in regulated environments such as finance and aerospace. In these domains, maintaining consistency, adhering to regulatory frameworks, minimizing errors, and meeting critical expectations are essential for the reliable functioning of systems. The widespread adoption of large language models (LLMs) highlights their immense potential, yet there remains considerable scope for improvement in retrieving relevant information and enhancing reasoning capabilities. This study demonstrates that integrating a robust Graph-RAG framework with advanced prompt engineering techniques, such as Chain of Thought and Tree of Thought, can significantly enhance performance. Compared to baseline RAG methods and simple prompting strategies, this approach delivers more accurate and context-aware results. While this method demonstrates significant improvements in performance, it comes with challenges. It is both costly and more complex to implement across diverse contexts, requiring careful adaptation to specific scenarios. Additionally, its effectiveness heavily relies on having complete and accurate input data, which may not always be readily available, posing further limitations to its scalability and practicality.

确保软件需求规格（SRS）与更高层次的组织或国家要求相一致是至关重要的，特别是在金融和航空等受监管的环境中。在这些领域，保持一致性、遵守监管框架、减少错误和满足关键期望是系统可靠运行的关键。大型语言模型（LLM）的广泛应用突显了其巨大的潜力，但在检索相关信息和提高推理能力方面仍有很大的改进空间。本研究表明，将稳健的图-RAG框架与先进的提示工程技术（如思维链和思维树）相结合，可以显著提高性能。与基线RAG方法和简单提示策略相比，这种方法提供了更准确和上下文感知的结果。虽然这种方法在性能上取得了显著的改进，但它也带来了挑战。在多种背景下实施既昂贵又更复杂，需要谨慎适应特定场景。此外，其有效性严重依赖于完整和准确的数据输入，这可能并不总是轻易可用，进一步限制了其可扩展性和实用性。

论文及项目相关链接

PDF

Summary

本文将介绍如何确保软件需求规格（SRS）与组织或国家的高层次需求一致的重要性，特别是在金融和航空等受监管的环境中的重要性。同时，文章强调了大型语言模型（LLM）在提高信息检索和推理能力方面的巨大潜力及仍有改进的空间。该研究展示了一个稳健的Graph-RAG框架与先进的提示工程技术相结合，如链式思维和树状思维，可以显著提高性能。虽然这种方法在性能上取得了显著改进，但其成本高、实现复杂且涉及多样情境适应性调整。此外，该方法的有效性取决于是否拥有完整和准确的数据输入，这进一步限制了其可扩展性和实用性。

Key Takeaways

以下是关键见解的要点摘要：

软件需求规格与高层级组织或国家要求的对齐在金融和航空等领域极其重要。
大型语言模型（LLM）在信息检索和推理能力方面展现出巨大潜力。但仍需改进以满足具体需求。
集成Graph-RAG框架与先进的提示工程技术可以提高性能，如链式思维和树状思维方法。
与基线相比，该方法提供了更准确和上下文感知的结果。然而，其成本高且实现复杂。
该方法需要适应不同的情境，并具有应对多种情况的复杂性挑战。它的成功应用需谨慎实施特定场景的策略和调整。
方法的有效性取决于拥有完整和准确的数据输入，这限制了其在实际应用中的可扩展性。在实际应用中需注意数据的完整性和准确性问题。

点此查看论文截图

Advancing Single- and Multi-task Text Classification through Large Language Model Fine-tuning

Authors:Hang Zhao, Qile P. Chen, Yijing Barry Zhang, Gang Yang

Both encoder-only models (e.g., BERT, RoBERTa) and large language models (LLMs, e.g., Llama3) have been widely used for text classification tasks. However, there is a lack of systematic studies comparing the performance of encoder-based models and LLMs in text classification, particularly when fine-tuning is involved. This study employed a diverse range of models and methods, varying in size and architecture, and including both fine-tuned and pre-trained approaches. We first assessed the performances of these LLMs on the 20 Newsgroups (20NG) and MASSIVE datasets, comparing them to encoder-only RoBERTa models. Additionally, we explored the multi-task capabilities of both model types by combining multiple classification tasks, including intent detection and slot-filling, into a single model using data from both datasets. Our results indicate that fully fine-tuned Llama3-70B models outperform RoBERTa-large and other decoder LLMs across various classification tasks and datasets. Moreover, the consolidated multi-task fine-tuned LLMs matched the performance of dual-model setups in both tasks across both datasets. Overall, our study provides a comprehensive benchmark of encoder-only and LLM models on text classification tasks and demonstrates a method to combine two or more fully fine-tuned decoder LLMs for reduced latency and equivalent performance.

编码器模型（例如BERT、RoBERTa）和大语言模型（LLM，例如Llama3）已被广泛应用于文本分类任务。然而，关于基于编码器的模型和LLM在文本分类中的性能对比的系统性研究仍然缺乏，尤其是在涉及微调时。本研究采用了不同规模和架构的多种模型和方法，包括微调和预先训练的方法。我们首先评估了这些LLM在20新闻组（20NG）和MASSIVE数据集上的性能，并将其与仅使用编码器的RoBERTa模型进行了比较。此外，我们还探索了两种模型类型的多任务功能，通过结合多个分类任务（包括意图检测和槽填充），使用这两个数据集的数据构建单一模型。我们的结果表明，完全微调后的Llama3-70B模型在各种分类任务和数据集上的性能优于RoBERTa大型模型和其他解码器LLM。此外，经过整合的多任务微调LLM在跨数据集的两个任务中的性能与双模型设置相匹配。总体而言，我们的研究提供了关于仅编码器和LLM模型在文本分类任务上的全面基准测试，并展示了一种将两个或多个完全调校的解码器LLM结合起来，以减少延迟并维持等效性能的方法。

论文及项目相关链接

PDF 9 pages, 3 tables

Summary

本文对比研究了编码器模型（如BERT、RoBERTa）与大型语言模型（LLMs，如Llama3）在文本分类任务上的性能。研究采用多种规模和架构的模型和方法，包括微调和预先训练的方法，在20Newsgroups（20NG）和MASSIVE数据集上评估了LLMs与RoBERTa的性能。此外，研究还探索了两种模型类型的多任务能力，通过将多个分类任务（如意图检测和槽填充）整合到单一模型中。结果显示，完全微调后的Llama3-70B模型在各项分类任务和数据集上表现优于RoBERTa-large和其他解码器LLMs。同时，多任务微调后的LLMs在两项任务上的表现与双模型设置相匹配。总体而言，本文提供了编码器模型和LLM在文本分类任务上的全面基准测试，并展示了如何结合两个或多个完全微调后的解码器LLMs以实现降低延迟和等效性能的方法。

Key Takeaways

对比了编码器模型（如BERT、RoBERTa）和大型语言模型（LLMs，如Llama3）在文本分类任务上的性能。
在多个数据集（包括20Newsgroups和MASSIVE）上进行了广泛的实验评估。
LLMs在各项分类任务和数据集上的表现通常优于RoBERTa。
完全微调后的Llama3-70B模型表现出最佳性能。
多任务能力方面，多任务微调后的LLMs表现良好，可与双模型设置相匹配。
研究展示了结合多个完全微调后的解码器LLMs的方法，以实现降低延迟和等效性能。

点此查看论文截图

TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs

Authors:Hao Kang, Srikant Bharadwaj, James Hensman, Tushar Krishna, Victor Ruhle, Saravan Rajmohan

Large language model (LLM) inference demands significant amount of computation and memory, especially in the key attention mechanism. While techniques, such as quantization and acceleration algorithms, like FlashAttention, have improved efficiency of the overall inference, they address different aspects of the problem: quantization focuses on weight-activation operations, while FlashAttention improves execution but requires high-precision formats. Recent Key-value (KV) cache quantization reduces memory bandwidth but still needs floating-point dequantization for attention operation. We present TurboAttention, a comprehensive approach to enable quantized execution of attention that simultaneously addresses both memory and computational efficiency. Our solution introduces two key innovations: FlashQ, a headwise attention quantization technique that enables both compression of KV cache and quantized execution of activation-activation multiplication, and Sparsity-based Softmax Approximation (SAS), which eliminates the need for dequantization to FP32 during exponentiation operation in attention. Experimental results demonstrate that TurboAttention achieves 1.2-1.8x speedup in attention, reduces the KV cache size by over 4.4x, and enables up to 2.37x maximum throughput over the FP16 baseline while outperforming state-of-the-art quantization and compression techniques across various datasets and models.

大型语言模型（LLM）的推理需要大量的计算和内存资源，特别是在关键注意力机制方面。虽然量化技术和加速算法（如FlashAttention）已经提高了整体推理的效率，但它们解决的问题不同：量化关注权重激活操作，而FlashAttention改进了执行但需要高精度格式。最近的键值（KV）缓存量化减少了内存带宽，但仍然需要浮点反量化进行注意力操作。我们提出了TurboAttention，这是一种全面的方法，能够实现对注意力的量化执行，同时解决内存和计算效率问题。我们的解决方案引入了两项关键创新：FlashQ，一种逐头注意力量化技术，能够实现KV缓存的压缩和激活激活乘法的量化执行；以及基于稀疏性的Softmax近似（SAS），它消除了在注意力指数运算期间反量化到FP32的需要。实验结果表明，TurboAttention实现了注意力速度的1.2-1.8倍提升，减少了KV缓存大小超过4.4倍，并且在各种数据集和模型上相对于FP16基线实现了最高达2.37倍的最大吞吐量，超越了最先进的量化和压缩技术。

论文及项目相关链接

PDF

Summary

LLM的推理需要大量的计算和内存资源，特别是在关键注意力机制方面。现有的技术如量化和FlashAttention等虽提高了推理效率，但仍存在不足。本文提出TurboAttention，一种综合的注意力量化执行方法，同时解决内存和计算效率问题。TurboAttention包括两项创新技术：FlashQ用于实现KV缓存压缩和量化执行激活乘法，以及基于稀疏性的Softmax近似（SAS），消除注意力指数运算中的去量化需求。实验结果显示TurboAttention在注意力方面实现了1.2-1.8倍的速度提升，KV缓存大小减少超过4.4倍，同时在各种数据集和模型上超越了现有量化技术。

Key Takeaways

LLM推理依赖于大量的计算和内存资源，特别是在注意力机制上。
量化和加速算法如FlashAttention已提高推理效率，但仍存在不足。
TurboAttention是一种综合方法，旨在实现注意力的量化执行，同时提高内存和计算效率。
TurboAttention包括两项关键技术：FlashQ用于KV缓存压缩和激活乘法的量化执行，以及SAS消除注意力运算中的去量化步骤。
实验结果表明TurboAttention在注意力处理上实现了显著的速度提升，并大幅减小了KV缓存大小。

点此查看论文截图

Can We Generate Visual Programs Without Prompting LLMs?

Authors:Michal Shlapentokh-Rothman, Yu-Xiong Wang, Derek Hoiem

Visual programming prompts LLMs (large language mod-els) to generate executable code for visual tasks like visual question answering (VQA). Prompt-based methods are difficult to improve while also being unreliable and costly in both time and money. Our goal is to develop an efficient visual programming system without 1) using prompt-based LLMs at inference time and 2) a large set of program and answer annotations. We develop a synthetic data augmentation approach and alternative program generation method based on decoupling programs into higher-level skills called templates and the corresponding arguments. Our results show that with data augmentation, prompt-free smaller LLMs ($\approx$ 1B parameters) are competitive with state-of-the art models with the added benefit of much faster inference

视觉编程提示大型语言模型（LLMs）为视觉任务生成可执行代码，如视觉问答（VQA）。基于提示的方法很难改进，同时不可靠，时间和金钱成本都很高。我们的目标是在不使用基于提示的LLM进行推断的情况下，并且不使用大量的程序和答案注释来开发高效的视觉编程系统。我们开发了一种合成数据增强方法，以及基于将程序解耦为高级技能（称为模板）和相应参数的替代程序生成方法。我们的结果表明，通过数据增强，无需提示的小型LLM（约1B参数）与最先进的模型相比具有竞争力，并且具有更快的推理优势。

论文及项目相关链接

PDF

Summary

基于视觉编程提示，大型语言模型（LLM）可生成用于视觉任务（如视觉问答）的可执行代码。但提示型方法既难以改进，又存在可靠性不足的问题，且消耗大量时间和成本。我们的目标是开发一种高效的视觉编程系统，不使用提示型LLM进行推理，且无需大量程序和答案注释。我们提出了一种基于合成数据增强的方法，并通过将程序解耦为高级技能（称为模板）和相应参数来生成替代程序。研究结果表明，通过数据增强，无需提示的小型LLM与当前先进模型相当，并具备更快的推理优势。

Key Takeaways

视觉编程提示使LLM能够生成用于视觉任务的代码。
提示型方法存在改进困难、可靠性不足和成本高昂的问题。
目标开发一种高效的视觉编程系统，不使用提示型LLM进行推理。
提出基于合成数据增强的方法改进模型性能。
通过将程序解耦为模板和参数来生成替代程序。
数据增强使小型LLM表现与先进模型相当。

点此查看论文截图

Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning

Authors:Rongzhe Wei, Mufei Li, Mohsen Ghassemi, Eleonora Kreačić, Yifan Li, Xiang Yue, Bo Li, Vamsi K. Potluru, Pan Li, Eli Chien

Large Language Models are trained on extensive datasets that often contain sensitive, human-generated information, raising significant concerns about privacy breaches. While certified unlearning approaches offer strong privacy guarantees, they rely on restrictive model assumptions that are not applicable to LLMs. As a result, various unlearning heuristics have been proposed, with the associated privacy risks assessed only empirically. The standard evaluation pipelines typically randomly select data for removal from the training set, apply unlearning techniques, and use membership inference attacks to compare the unlearned models against models retrained without the to-be-unlearned data. However, since every data point is subject to the right to be forgotten, unlearning should be considered in the worst-case scenario from the privacy perspective. Prior work shows that data outliers may exhibit higher memorization effects. Intuitively, they are harder to be unlearn and thus the privacy risk of unlearning them is underestimated in the current evaluation. In this paper, we leverage minority data to identify such a critical flaw in previously widely adopted evaluations. We substantiate this claim through carefully designed experiments, including unlearning canaries related to minority groups, inspired by privacy auditing literature. Using personally identifiable information as a representative minority identifier, we demonstrate that minority groups experience at least 20% more privacy leakage in most cases across six unlearning approaches, three MIAs, three benchmark datasets, and two LLMs of different scales. Given that the right to be forgotten should be upheld for every individual, we advocate for a more rigorous evaluation of LLM unlearning methods. Our minority-aware evaluation framework represents an initial step toward ensuring more equitable assessments of LLM unlearning efficacy.

大型语言模型（LLM）的训练数据集广泛且常包含敏感的人类生成信息，引发了关于隐私泄露的重大担忧。虽然经过认证的遗忘处理方法提供了强大的隐私保证，但它们依赖于不适用于LLM的模型假设。因此，已经提出了各种遗忘启发式方法，并对相关的隐私风险进行了实证评估。标准的评估流程通常从训练集中随机选择数据进行删除，应用遗忘技术，并使用成员推理攻击来比较未学习的模型与重新训练的模型（无需遗忘数据）。然而，由于每个数据点都有被遗忘的权利，从隐私的角度来看，遗忘应在最坏的情况下进行考虑。先前的研究表明，数据异常值可能表现出更高的记忆效果。直觉上，它们更难被遗忘，因此当前评估低估了遗忘它们的隐私风险。在本文中，我们利用少数群体数据来识别先前广泛采用的评估中的这一关键缺陷。我们通过精心设计的实验证实了这一主张，包括与少数群体相关的遗忘信标（canary），这是受隐私审计文献启发的。以个人身份信息作为代表性的少数群体标识符，我们证明在大多数情况下，在六种遗忘方法、三种成员推理攻击、三个基准数据集和两个不同规模的大型语言模型中，少数群体的隐私泄露至少增加了20%。鉴于每个个体都有被遗忘的权利，我们主张对大型语言模型的遗忘方法进行更严格的评估。我们的少数群体意识评估框架是确保大型语言模型遗忘效率评估更加公平的第一步。

论文及项目相关链接

PDF

摘要
大型语言模型（LLM）的训练数据包含大量敏感的人类生成信息，引发了关于隐私泄露的担忧。虽然经过认证的遗忘方法提供强有力的隐私保障，但它们依赖于不适用于LLM的模型假设。因此，人们提出了各种遗忘启发式方法，并对相关的隐私风险进行了实证评估。标准评估流程通常随机选择数据进行训练集删除，应用遗忘技术，并使用成员推理攻击来比较未学习的模型与不使用待遗忘数据的重新训练的模型。然而，从隐私角度来看，由于每个数据点都有被遗忘的权利，因此应考虑最坏的情况进行遗忘。先前的研究表明，数据异常值可能表现出更高的记忆效应，即它们更难被遗忘，因此当前评估中的隐私风险被低估了。本文利用少数数据来识别以前广泛采用的评估中的这一关键缺陷。通过精心设计的实验，包括与少数群体相关的遗忘金丝雀（受隐私审计文献启发），我们证明在大多数情况下，少数群体至少会经历20%以上的隐私泄露。因此，我们主张对LLM的遗忘方法进行更严格的评估。我们的面向少数的评估框架是确保更公平地评估LLM遗忘效率的第一步。

关键见解

大型语言模型（LLM）的训练涉及敏感信息的隐私问题。
现有遗忘技术主要基于模型假设，可能不适用于LLM。
标准评估流程主要关注随机数据删除和成员推理攻击的比较，但未充分考虑最坏情况的遗忘。
数据异常值（即少数群体）在遗忘过程中可能具有更高的隐私泄露风险。
本文通过实验证明，少数群体在遗忘过程中至少面临20%以上的隐私泄露风险。
现有评估方法未能充分评估LLM遗忘方法的效果和公平性。

点此查看论文截图

MaestroMotif: Skill Design from Artificial Intelligence Feedback

Authors:Martin Klissarov, Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, Marlos C. Machado, Pierluca D’Oro

Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM’s feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM’s code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

描述自然语言中的技能具有将人类关于决策制定的知识注入人工智能系统的潜力。我们提出了MaestroMotif方法，这是一种人工智能辅助技能设计方法，能够产生高性能和可适应的代理。MaestroMotif利用大型语言模型（LLM）的功能来有效地创建和重用技能。它首先使用LLM的反馈来自动设计对应于每个技能的奖励，从他们的自然语言描述开始。然后，它利用LLM的代码生成能力与强化学习相结合，对技能进行训练，并将它们组合起来实现用语言指定的复杂行为。我们在NetHack学习环境（NLE）的一系列复杂任务中评估了MaestroMotif，结果表明它在性能和可用性方面都超越了现有方法。

论文及项目相关链接

PDF

Summary

本文介绍了MaestroMotif方法，这是一种AI辅助技能设计方式，能生成高性能且能适应不同情境的AI代理。该方法利用自然语言描述技能的能力，通过大型语言模型（LLM）自动设计相应技能的奖励，并结合强化学习进行技能培训和复杂行为的实现。在NetHack学习环境的复杂任务测试中，MaestroMotif表现出卓越的性能和实用性。

Key Takeaways

MaestroMotif是一种AI辅助技能设计方法，利用大型语言模型（LLM）实现技能的有效创建和再利用。
MaestroMotif通过自然语言描述技能，自动设计相应技能的奖励。
该方法结合强化学习进行技能培训，并能实现复杂的行为。
MaestroMotif在NetHack学习环境的复杂任务测试中表现优异。
MaestroMotif注重技能的适应性和性能的优化。
通过利用大型语言模型的能力，MaestroMotif可以注入人类决策知识到AI系统中。

点此查看论文截图

EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance

Authors:Yingxin Li, Ye Li, Yuan Meng, Xinzhu Ma, Zihan Geng, Shutao Xia, Zhi Wang

As large language models (LLMs) continue to advance, the demand for higher quality and faster processing of long contexts across various applications is growing. KV cache is widely adopted as it stores previously generated key and value tokens, effectively reducing redundant computations during inference. However, as memory overhead becomes a significant concern, efficient compression of KV cache has gained increasing attention. Most existing methods perform compression from two perspectives: identifying important tokens and designing compression strategies. However, these approaches often produce biased distributions of important tokens due to the influence of accumulated attention scores or positional encoding. Furthermore, they overlook the sparsity and redundancy across different heads, which leads to difficulties in preserving the most effective information at the head level. To this end, we propose EMS to overcome these limitations, while achieving better KV cache compression under extreme compression ratios. Specifically, we introduce a Global-Local score that combines accumulated attention scores from both global and local KV tokens to better identify the token importance. For the compression strategy, we design an adaptive and unified Evict-then-Merge framework that accounts for the sparsity and redundancy of KV tokens across different heads. Additionally, we implement the head-wise parallel compression through a zero-class mechanism to enhance efficiency. Extensive experiments demonstrate our SOTA performance even under extreme compression ratios. EMS consistently achieves the lowest perplexity, improves scores by over 1.28 points across four LLMs on LongBench under a 256 cache budget, and preserves 95% retrieval accuracy with a cache budget less than 2% of the context length in the Needle-in-a-Haystack task.

随着大型语言模型（LLM）的持续进步，对跨各种应用的高质量和快速处理长文本的需求不断增长。KV缓存因其能够存储先前生成的键和值令牌而得到广泛应用，从而有效地减少了推理过程中的冗余计算。然而，内存开销成为一个重要的问题，KV缓存的有效压缩也引起了越来越多的关注。现有的大多数方法从两个角度进行压缩：识别重要令牌和设计压缩策略。然而，这些方法通常会产生由于累积的注意力分数或位置编码的影响而产生的重要令牌的偏向分布。此外，它们忽视了不同头之间的稀疏性和冗余性，这使得在头部层面保留最有效信息变得困难。为此，我们提出EMS来克服这些局限性，同时实现在极端压缩比率下更好的KV缓存压缩。具体来说，我们引入了一个全局-局部分数，该分数结合了全局和局部KV令牌上的累积注意力分数，以更好地识别令牌的重要性。对于压缩策略，我们设计了一个自适应的、统一的逐出合并框架（Evict-then-Merge），该框架考虑了不同头之间KV令牌的稀疏性和冗余性。此外，我们通过零类机制实现了头部并行压缩，以提高效率。大量实验表明，即使在极端压缩比率下，我们的性能也达到了领先水平。EMS在极端压缩比率下始终实现了最低的困惑度，在LongBench上的四个LLM得分提高了超过1.28点，在Haystack任务中的Needle-in-a-Haystack场景下，在缓存预算少于上下文长度的2%的情况下保留了95%的检索准确性。

论文及项目相关链接

PDF

Summary

本文探讨了大语言模型（LLM）中KV缓存的压缩问题。随着LLM的不断发展，对高质量、快速处理长上下文的需求不断增长，而KV缓存由于能够存储先前生成的键和值令牌，有效减少推理过程中的冗余计算而受到广泛采用。然而，内存开销成为一个重要的问题，因此KV缓存的压缩受到越来越多的关注。针对现有方法存在的局限性，如识别重要令牌时的偏见分布、忽略不同头部之间的稀疏性和冗余性等，本文提出了一种新的KV缓存压缩方法EMS。该方法通过结合全局和局部令牌的重要性得分来识别重要令牌，并采用自适应统一的驱逐合并框架进行压缩。实验表明，即使在极端压缩比率下，EMS也能实现更好的KV缓存压缩性能。

Key Takeaways

LLM的发展推动了KV缓存压缩的需求。
KV缓存存储键和值令牌，减少冗余计算。
现有KV缓存压缩方法存在识别重要令牌时的偏见分布问题。
EMS通过结合全局和局部令牌的重要性得分来识别重要令牌。
EMS采用自适应统一的驱逐合并框架进行压缩，考虑不同头部之间的稀疏性和冗余性。
EMS实现了高效的头并行压缩。

点此查看论文截图

Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation

Authors:Pengyue Jia, Derong Xu, Xiaopeng Li, Zhaocheng Du, Xiangyang Li, Xiangyu Zhao, Yichao Wang, Yuhao Wang, Huifeng Guo, Ruiming Tang

The reranker and generator are two critical components in the Retrieval-Augmented Generation (i.e., RAG) pipeline, responsible for ranking relevant documents and generating responses. However, due to differences in pre-training data and objectives, there is an inevitable gap between the documents ranked as relevant by the reranker and those required by the generator to support answering the query. To address this gap, we propose RADIO, a novel and practical preference alignment framework with RAtionale DIstillatiOn. Specifically, We first propose a rationale extraction method that leverages the reasoning capabilities of Large Language Models (LLMs) to extract the rationales necessary for answering the query. Subsequently, a rationale-based alignment process is designed to rerank the documents based on the extracted rationales, and fine-tune the reranker to align the preferences. We conduct extensive experiments on two tasks across three datasets to demonstrate the effectiveness of our approach compared to baseline methods. Our code is released online to ease reproduction.

重排器和生成器是增强检索生成（即RAG）管道中的两个关键组件，负责排列相关文档和生成答复。然而，由于预训练数据和目标的不同，重排器排列的相关文档和生成器支持回答问题所需的文档之间存在不可避免的差距。为了解决这一差距，我们提出了RADIO，这是一个具有理性蒸馏（RAtionale DIstillatiOn）的新型实用偏好对齐框架。具体来说，我们首先提出了一种利用大型语言模型（LLM）的推理能力来提取用于回答查询所需的依据的理性提取方法。随后，设计了一种基于理性的对齐过程，根据提取的依据重新排列文档，并通过微调重排器来对齐偏好。我们在三个数据集的两个任务上进行了大量实验，以证明我们的方法相较于基准方法的有效性。我们的代码已在线发布，以方便复制。

论文及项目相关链接

PDF under review

Summary

本文介绍了在检索增强生成（RAG）管道中，排名器和生成器两个核心组件的作用及其面临的挑战。针对两者在预训练数据和目标上的差异导致的差距问题，提出了一种名为RADIO的新型偏好对齐框架，利用大型语言模型（LLM）的推理能力提取回答查询所需的理由，基于这些理由重新排名文档，并微调排名器以对齐偏好。通过两个任务在三个数据集上的实验验证了该方法的有效性。

Key Takeaways

RANKER和生成器是RAG管道中的关键组件，分别负责排名相关文档和生成响应。
由于预训练数据和目标的不同，RANKER和生成器之间存在差距。
提出了一个名为RADIO的新型偏好对齐框架来解决这一差距问题。
使用大型语言模型（LLM）的推理能力提取回答查询的理由。
基于提取的理由重新排名文档，并微调RANKER以对齐偏好。
在两个任务、三个数据集上的实验验证了RADIO方法的有效性。

点此查看论文截图

POINTS1.5: Building a Vision-Language Model towards Real World Applications

Authors:Yuan Liu, Le Tian, Xiao Zhou, Xinyu Gao, Kavio Yu, Yang Yu, Jie Zhou

Vision-language models have made significant strides recently, demonstrating superior performance across a range of tasks, e.g. optical character recognition and complex diagram analysis. Building on this trend, we introduce a new vision-language model, POINTS1.5, designed to excel in various real-world applications. POINTS1.5 is an enhancement of POINTS1.0 and incorporates several key innovations: i) We replace the original CLIP vision encoder, which had a fixed image resolution, with a NaViT-style vision encoder that supports native dynamic high resolution. This allows POINTS1.5 to process images of any resolution without needing to split them into tiles. ii) We add bilingual support to POINTS1.5, significantly enhancing its capability in Chinese. Due to the scarcity of open-source Chinese datasets for vision-language models, we collect numerous images from the Internet and annotate them using a combination of manual and automatic methods. iii) We propose a set of rigorous filtering methods for visual instruction tuning datasets. We comprehensively evaluate all these filtering methods, and choose the most effective ones to obtain the final visual instruction tuning set. Thanks to these innovations, POINTS1.5 significantly outperforms POINTS1.0 and demonstrates strong performance across a range of real-world applications. Notably, POINTS1.5-7B is trained on fewer than 4 billion tokens and ranks first on the OpenCompass leaderboard among models with fewer than 10 billion parameters

视觉语言模型近期取得了重大进展，在各种任务中表现出卓越的性能，例如光学字符识别和复杂图表分析。基于这一趋势，我们推出了一款新的视觉语言模型——POINTS1.5，旨在在各种现实应用中获得卓越表现。POINTS1.5是POINTS1.0的增强版，并融入了若干关键创新：一）我们替换了原始的CLIP视觉编码器，该编码器具有固定图像分辨率，采用NaViT风格的视觉编码器，支持原生动态高分辨率。这使得POINTS1.5能够处理任何分辨率的图像，而无需将它们分割成瓦片。二）我们为POINTS1.5增加了双语支持，这极大地提高了其在中文方面的能力。由于视觉语言模型的中文开源数据集稀缺，我们从互联网收集了大量图像，并结合手动和自动方法进行标注。三）我们为视觉指令调整数据集提出了一系列严格的过滤方法。我们全面评估了所有这些过滤方法，并选择了最有效的过滤方法来获得最终的视觉指令调整集。得益于这些创新，POINTS1.5在性能上显著超越了POINTS1.0，并在一系列现实应用中表现出强大的性能。值得注意的是，POINTS1.5-7B在少于4亿个令牌上进行训练，在参数少于10亿的模型中排名第一，荣登OpenCompass排行榜首位。

论文及项目相关链接

PDF

Summary：新型视觉语言模型POINTS1.5基于以往模型的优势进行创新，通过支持动态高分辨率图像处理和增强中文支持等功能，提升了在各种实际应用中的性能。通过严格筛选视觉指令微调数据集，并采用先进的过滤方法，其在多个指标上显著超越了前代模型POINTS1.0，表现强劲。同时，POINTS1.5在训练数据量方面也进行了优化，能够在少于4亿令牌的训练下达到优秀性能，且在OpenCompass排行榜上同类模型参数数量级内表现名列前茅。

Key Takeaways：

POINTS1.5是新一代的视觉语言模型，基于先前模型POINTs系列进行改进。
该模型通过引入NaViT风格的动态高分辨率视觉编码器，提升了图像处理能力。
POINTS1.5支持双语，特别是增强了中文处理能力。为解决中文数据集稀缺问题，该模型收集并标注了大量互联网图像。
模型通过严格的过滤方法筛选视觉指令微调数据集，并进行综合评估选择最有效的过滤方法获得最终数据集。
POINTS1.5在性能上显著超越前代模型POINTS1.0，特别是在训练数据量优化方面表现出色。

点此查看论文截图

The Roles of English in Evaluating Multilingual Language Models

Authors:Wessel Poelman, Miryam de Lhoneux

Multilingual natural language processing is getting increased attention, with numerous models, benchmarks, and methods being released for many languages. English is often used in multilingual evaluation to prompt language models (LMs), mainly to overcome the lack of instruction tuning data in other languages. In this position paper, we lay out two roles of English in multilingual LM evaluations: as an interface and as a natural language. We argue that these roles have different goals: task performance versus language understanding. This discrepancy is highlighted with examples from datasets and evaluation setups. Numerous works explicitly use English as an interface to boost task performance. We recommend to move away from this imprecise method and instead focus on furthering language understanding.

多语言自然语言处理得到了越来越多的关注，针对多种语言发布了众多模型、基准测试方法和手段。在跨语言评估中，英语常被用来提示语言模型（LMs），这主要是为了克服其他语言指令调整数据的缺乏。在这篇立场论文中，我们阐述了英语在跨语言LM评估中的两个作用：作为接口和作为自然语言。我们认为这两个作用的目标不同：任务执行与语言理解。通过数据集和评估设置的示例，这个差异得到了强调。许多作品明确地用英语作为接口来提升任务执行效率。我们建议摒弃这种不精确的方法，转而专注于提高语言理解能力。

论文及项目相关链接

PDF NoDaLiDa 2025

Summary

在多种语言自然语言处理领域中，英语在多语言语言模型评估中发挥着重要作用，主要体现在作为接口和自然语言两个方面。尽管其在任务性能上的表现备受关注，但其语言理解能力的差异也尤为重要。文中通过具体数据集和评估实例展示了这一差异。本文呼吁转向关注语言理解，而不是简单地以英语作为提高任务性能的接口。

Key Takeaways

英语在多语言自然语言处理中扮演重要角色，用于促进语言模型（LMs）的多语言评估。
英语在评估中的两种角色是：作为接口和作为自然语言。
使用英语作为接口主要用于提高任务性能，但其存在局限性，未能充分体现语言理解能力。
重视语言理解能力的提升而非仅仅以英语为接口的方法进行评估的重要性。
当前许多工作利用英语作为界面推动任务执行的方式有待改进。
英语作为评估语言的一部分具有复杂的双重角色和多种目的考量。需要更加注重以不同视角进行评估的策略研究和实践创新。

点此查看论文截图

SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs

Authors:Sultan Alrashed

We present SmolTulu-1.7b-Instruct, referenced in this report as SmolTulu-DPO-1130, an instruction-tuned language model that adapts AllenAI’s Tulu 3 post-training pipeline to enhance Huggingface’s SmolLM2-1.7B base model. Through comprehensive empirical analysis using a 135M parameter model, we demonstrate that the relationship between learning rate and batch size significantly impacts model performance in a task-dependent manner. Our findings reveal a clear split: reasoning tasks like ARC and GSM8K benefit from higher learning rate to batch size ratios, while pattern recognition tasks such as HellaSwag and IFEval show optimal performance with lower ratios. These insights informed the development of SmolTulu, which achieves state-of-the-art performance among sub-2B parameter models on instruction following, scoring 67.7% on IFEval ($\Delta$11%), and mathematical reasoning with 51.6% on GSM8K ($\Delta$3.4%), with an alternate version achieving scoring 57.1% on ARC ($\Delta5.4%$). We release our model, training recipes, and ablation studies to facilitate further research in efficient model alignment, demonstrating that careful adaptation of optimization dynamics can help bridge the capability gap between small and large language models.

我们推出SmolTulu-1.7b-Instruct，本报告将其称为SmolTulu-DPO-1130，这是一款经过指令调整的语言模型，它适应了AllenAI的Tulu 3后训练管道，以强化Huggingface的SmolLM2-1.7B基础模型。我们通过使用1.35亿参数模型的全面实证分析证明，学习率与批次大小之间的关系会以一种任务依赖的方式显著影响模型性能。我们的研究发现了一个明确的划分：推理任务（如ARC和GSM8K）受益于较高的学习率与批次大小的比例，而模式识别任务（如HellaSwag和IFeval）在较低的比例下表现出最佳性能。这些见解为SmolTulu的开发提供了指导，SmolTulu在指令遵循方面达到了小于2B参数模型中的最新技术水平，在IFeval上得分67.7%（提高了11%），在GSM8K数学推理上得分51.6%（提高了3.4%），另一版本在ARC上得分57.1%（提高了5.4%）。我们发布我们的模型、训练配方和消融研究，以促进对高效模型对齐的进一步研究，表明优化动力学的仔细调整有助于缩小小型和大型语言模型之间的能力差距。

论文及项目相关链接

PDF 10 pages, 4 figures, and 13 tables. For the SmolTulu-1.7b-instruct model, see: https://huggingface.co/SultanR/SmolTulu-1.7b-Instruct

Summary

SmolTulu-DPO-1130模型是基于Huggingface的SmolLM2-1.7B基础模型，通过采用AllenAI的Tulu 3后训练管道进行适应，开发了一种指令优化语言模型。研究发现学习率与批次大小之间的关系对模型性能有显著影响，不同任务有不同的最优比率。SmolTulu模型在指令遵循和数学推理任务上取得了卓越性能，并在相关数据集上取得了最新成果。

Key Takeaways

SmolTulu-DPO-1130是基于Huggingface的SmolLM2-1.7B基础模型开发的一种指令优化语言模型。
模型采用了AllenAI的Tulu 3后训练管道进行适应。
学习率与批次大小的关系对模型性能有重要影响，不同任务需要不同的最优比率。
在指令遵循和数学推理任务上，SmolTulu模型取得了卓越性能。

点此查看论文截图

Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training

Authors:Shijian Wang, Linxin Song, Jieyu Zhang, Ryotaro Shimizu, Ao Luo, Li Yao, Cunjian Chen, Julian McAuley, Hanqian Wu

Current multimodal language models (MLMs) evaluation and training approaches overlook the influence of instruction format, presenting an elephant-in-the-room problem. Previous research deals with this problem by manually crafting instructions, failing to yield significant insights due to limitations in diversity and scalability. In this work, we propose a programmatic instruction template generator capable of producing over 39B unique template combinations by filling randomly sampled positional synonyms into weighted sampled meta templates, enabling us to comprehensively examine the MLM’s performance across diverse instruction templates. Our experiments across eight common MLMs on five benchmark datasets reveal that MLMs have high template sensitivities with at most 29% performance gaps between different templates. We further augment the instruction tuning dataset of LLaVA-1.5 with our template generator and perform instruction tuning on LLaVA-1.5-7B and LLaVA-1.5-13B. Models tuned on our augmented dataset achieve the best overall performance when compared with the same scale MLMs tuned on at most 75 times the scale of our augmented dataset, highlighting the importance of instruction templates in MLM training. The code is available at https://github.com/shijian2001/TemplateMatters .

当前的多模态语言模型（MLM）评估和训练方法忽视了指令格式的影响，这就像一个房间里的大象问题。之前的研究通过手动制定指令来解决这个问题，但由于多样性和可扩展性的限制，未能产生重大见解。在这项工作中，我们提出了一种程序化指令模板生成器，通过向加权采样元模板中填充随机采样位置同义词，能够产生超过3.9万亿种独特的模板组合，从而能够全面考察MLM在不同指令模板下的性能。我们在五个基准数据集上对八个常见MLM进行的实验表明，MLM对模板的敏感性很高，不同模板之间性能差异最大达到29%。我们进一步使用我们的模板生成器增强了LLaVA-1.5的指令调整数据集，并在LLaVA-1.5-7B和LLaVA-1.5-13B上进行了指令调整。与我们增强数据集上调校的相同规模MLM相比，模型在整体性能上达到最佳，这表明在MLM训练中指令模板的重要性。代码可通过https://github.com/shijian2001/TemplateMatters获取。

Summary

本文指出当前的多模态语言模型（MLMs）在评估和训练过程中忽视了指令格式的影响，为此提出了一种程序化指令模板生成器。该生成器通过随机填充加权采样元模板中的位置同义词，能够产生超过39亿种独特的模板组合，从而全面考察MLM在不同指令模板下的性能。实验结果显示，MLMs对模板的敏感性很高，不同模板间性能差异最大可达29%。通过增强LLaVA-1.5的指令微调数据集并使用该模板生成器，模型在LLaVA-1.5-7B和LLaVA-1.5-13B上的性能得到了提升。与在扩大规模的数据集上训练的相同规模的MLMs相比，使用增强数据集训练的模型取得了最佳性能，突显了指令模板在MLM训练中的重要性。

Key Takeaways

当前的多模态语言模型（MLMs）评估和训练忽略了指令格式的影响，这成为了一个需要关注的问题。
提出了一种程序化指令模板生成器，能够产生大量的独特模板组合，以全面考察MLM的性能。
MLMs对模板的敏感性高，不同模板间性能差异较大。
通过增强指令微调数据集并使用提出的模板生成器，模型性能得到了提升。
使用增强数据集训练的模型性能优于在扩大规模的数据集上训练的相同规模的MLMs。
指令模板在MLM训练中具有重要意义。

点此查看论文截图

M2SE: A Multistage Multitask Instruction Tuning Strategy for Unified Sentiment and Emotion Analysis

Authors:Ao Li, Longwei Xu, Chen Ling, Jinghui Zhang, Pengwei Wang

Sentiment analysis and emotion recognition are crucial for applications such as human-computer interaction and depression detection. Traditional unimodal methods often fail to capture the complexity of emotional expressions due to conflicting signals from different modalities. Current Multimodal Large Language Models (MLLMs) also face challenges in detecting subtle facial expressions and addressing a wide range of emotion-related tasks. To tackle these issues, we propose M2SE, a Multistage Multitask Sentiment and Emotion Instruction Tuning Strategy for general-purpose MLLMs. It employs a combined approach to train models on tasks such as multimodal sentiment analysis, emotion recognition, facial expression recognition, emotion reason inference, and emotion cause-pair extraction. We also introduce the Emotion Multitask dataset (EMT), a custom dataset that supports these five tasks. Our model, Emotion Universe (EmoVerse), is built on a basic MLLM framework without modifications, yet it achieves substantial improvements across these tasks when trained with the M2SE strategy. Extensive experiments demonstrate that EmoVerse outperforms existing methods, achieving state-of-the-art results in sentiment and emotion tasks. These results highlight the effectiveness of M2SE in enhancing multimodal emotion perception. The dataset and code are available at https://github.com/xiaoyaoxinyi/M2SE.

情感分析和情绪识别在人机交互和抑郁症检测等应用中至关重要。传统的单模态方法由于来自不同模态的信号相互冲突，往往无法捕捉情绪表达的复杂性。当前的多模态大型语言模型（MLLMs）在检测微妙的面部表情和应对各种情绪相关任务时也面临挑战。为了解决这些问题，我们提出了M2SE，这是一种用于通用MLLMs的多阶段多任务情感和情绪指令调整策略。它采用组合方法，在诸如多模态情感分析、情绪识别、面部表情识别、情感推理推断和情绪因果对提取等任务上训练模型。我们还介绍了情感多任务数据集（EMT），这是一个支持这五个任务的专业数据集。我们的模型“情感宇宙”（EmoVerse）建立在基本的MLLM框架上，无需修改，但在使用M2SE策略进行训练时，这些任务上的表现得到了显著提高。大量实验表明，EmoVerse优于现有方法，在情感和情绪任务中达到最新水平。这些结果凸显了M2SE在提高多模态情感感知方面的有效性。数据集和代码可在https://github.com/xiaoyaoxinyi/M2SE中找到。

论文及项目相关链接

PDF

Summary

本文介绍了情感分析和情绪识别在多模态交互和抑郁症检测等领域的重要性。传统单模态方法无法捕捉情绪表达的复杂性，而现有的多模态大型语言模型（MLLMs）在检测细微面部表情和应对各种情绪相关任务时面临挑战。为此，本文提出了M2SE，一种用于通用MLLMs的多阶段多任务情感与情绪指令调整策略。该策略结合了多种任务训练模型，如多模态情感分析、情绪识别、面部表情识别、情感推理推断和情感因果关系提取等。同时，本文还介绍了支持这五个任务的情感多任务数据集EMT。实验结果表明，基于M2SE策略训练的模型Emotion Universe（EmoVerse）在多个任务上实现了显著的提升，并在情感与情绪任务上取得了最新成果。这证明了M2SE策略在提高多模态情感感知方面的有效性。数据集和代码可在GitHub上获取。

Key Takeaways

情感分析和情绪识别对于人机交互和抑郁症检测等应用至关重要。
传统单模态方法难以捕捉情绪表达的复杂性，需要多模态方法来解决这一问题。
M2SE是一种用于通用MLLMs的多阶段多任务情感与情绪指令调整策略，结合了多种任务训练模型。
M2SE策略包括多任务如多模态情感分析、情绪识别等。
引入的情感多任务数据集EMT支持五个任务，有助于模型训练。
基于M2SE策略训练的模型Emotion Universe（EmoVerse）在多个任务上实现了显著的提升。

点此查看论文截图

SAT: Spatial Aptitude Training for Multimodal Language Models

Authors:Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, Kate Saenko

Spatial perception is a fundamental component of intelligence. While many studies highlight that large multimodal language models (MLMs) struggle to reason about space, they only test for static spatial reasoning, such as categorizing the relative positions of objects. Meanwhile, real-world deployment requires dynamic capabilities like perspective-taking and egocentric action recognition. As a roadmap to improving spatial intelligence, we introduce SAT, Spatial Aptitude Training, which goes beyond static relative object position questions to the more dynamic tasks. SAT contains 218K question-answer pairs for 22K synthetic scenes across a training and testing set. Generated using a photo-realistic physics engine, our dataset can be arbitrarily scaled and easily extended to new actions, scenes, and 3D assets. We find that even MLMs that perform relatively well on static questions struggle to accurately answer dynamic spatial questions. Further, we show that SAT instruction-tuning data improves not only dynamic spatial reasoning on SAT, but also zero-shot performance on existing real-image spatial benchmarks: $23%$ on CVBench, $8%$ on the harder BLINK benchmark, and $18%$ on VSR. When instruction-tuned on SAT, our 13B model matches larger proprietary MLMs like GPT4-V and Gemini-3-1.0 in spatial reasoning. Our data/code is available at http://arijitray1993.github.io/SAT/ .

空间感知是智能的基本组成部分。尽管许多研究强调大型多模态语言模型（MLM）在推理空间方面存在困难，但它们只测试静态空间推理，例如对物体的相对位置进行分类。然而，现实世界的应用需要动态能力，如观点采择和第一人称动作识别。作为提高空间智能的路线图，我们引入了SAT（空间适应能力训练），它超越了静态相对物体位置问题，涵盖了更动态的任务。SAT包含21.8万个问答对，涉及训练和测试集的2.2万个合成场景。使用逼真的物理引擎生成，我们的数据集可以任意扩展，并轻松扩展到新的动作、场景和3D资产。我们发现，即使在静态问题上表现相对较好的MLM也很难准确回答动态空间问题。此外，我们还表明，SAT指令调整数据不仅提高了SAT上的动态空间推理能力，还提高了现有真实图像空间基准的零样本性能：CVBench上提高23%，难度更高的BLINK基准上提高8%，以及VSR上提高18%。当在SAT上进行指令调整时，我们的13B模型在空间推理方面与较大的专有MLM（如GPT4-V和Gemini-3-1.0）相匹配。我们的数据和代码可在http://arijitray1993.github.io/SAT/中找到。

论文及项目相关链接

PDF Project webpage: http://arijitray1993.github.io/SAT/

摘要

空间感知是智能的基本组成部分。许多研究强调大型多模态语言模型（MLMs）在静态空间推理方面存在困难，但仅限于分类物体的相对位置等任务。然而，现实世界的应用需要更动态的能力，如视角和基于自我中心的行动识别等。为了改善空间智能，我们引入了SAT（空间能力训练），它超越了静态相对物体位置问题，涵盖了更动态的任务。SAT包含用于训练的合成场景中的问答对以及测试集的问答对。我们的数据集是使用逼真的物理引擎生成的，可以任意扩展并轻松扩展到新的动作、场景和三维资产。我们发现即使在静态问题上表现良好的MLM也很难准确回答动态空间问题。此外，我们还表明，SAT指令调整数据不仅提高了SAT上的动态空间推理能力，还提高了现有真实图像空间基准测试上的零样本性能：CVBench提高23%，难度更大的BLINK基准测试提高8%，以及VSR提高18%。当在SAT上经过指令调整后，我们的模型能在空间推理方面匹配较大的专有MLM（如GPT-4 V和双子座3-1.0）。我们的数据和代码可在http://arijitray1993.github.io/SAT/获取。

关键见解

一、空间感知是智能的重要组成部分，对于现实世界的应用至关重要。
二、当前研究主要关注静态空间推理任务，例如物体相对位置的分类。然而，实际应用需要模型具备动态能力，如视角采取和基于自我中心的行动识别。这启示我们不仅需要研究静态空间感知，还要关注动态空间感知的研究。
三、引入新的数据集SAT（Spatial Aptitude Training），包含合成场景中的问答对，旨在提高模型的空间智能能力。数据集使用逼真的物理引擎生成，可任意扩展并轻松扩展到新的动作、场景和三维资产。这表明数据集的多样性和可扩展性对于训练模型至关重要。

点此查看论文截图

Scaling Sequential Recommendation Models with Transformers

Authors:Pablo Zivic, Hernan Vazquez, Jorge Sanchez

Modeling user preferences has been mainly addressed by looking at users’ interaction history with the different elements available in the system. Tailoring content to individual preferences based on historical data is the main goal of sequential recommendation. The nature of the problem, as well as the good performance observed across various domains, has motivated the use of the transformer architecture, which has proven effective in leveraging increasingly larger amounts of training data when accompanied by an increase in the number of model parameters. This scaling behavior has brought a great deal of attention, as it provides valuable guidance in the design and training of even larger models. Taking inspiration from the scaling laws observed in training large language models, we explore similar principles for sequential recommendation. We use the full Amazon Product Data dataset, which has only been partially explored in other studies, and reveal scaling behaviors similar to those found in language models. Compute-optimal training is possible but requires a careful analysis of the compute-performance trade-offs specific to the application. We also show that performance scaling translates to downstream tasks by fine-tuning larger pre-trained models on smaller task-specific domains. Our approach and findings provide a strategic roadmap for model training and deployment in real high-dimensional preference spaces, facilitating better training and inference efficiency. We hope this paper bridges the gap between the potential of transformers and the intrinsic complexities of high-dimensional sequential recommendation in real-world recommender systems. Code and models can be found at https://github.com/mercadolibre/srt

通过查看用户与系统中可用元素之间的交互历史来主要解决用户偏好建模的问题。根据历史数据为个体量身定制内容是顺序推荐的主要目标。问题的性质以及在各个领域中观察到的良好性能，激发了使用变压器架构的动机。当模型参数数量增加时，该架构在利用越来越多的训练数据方面证明是有效的。这种扩展行为引起了极大的关注，因为它为设计更大的模型以及训练提供了有价值的指导。从训练大型语言模型中观察到的扩展定律中汲取灵感，我们探索了顺序推荐的类似原则。我们使用完整的亚马逊产品数据集，该数据集在其他研究中仅得到部分探索，并揭示了与语言模型中发现的类似的扩展行为。最优计算训练是可能的，但需要仔细分析特定于应用程序的计算性能权衡。我们还表明，通过在小任务特定领域上微调更大的预训练模型，性能扩展可以转化为下游任务。我们的方法和发现提供了在真实的高维偏好空间中训练模型和部署的战略路线图，有助于提高训练和推理效率。我们希望本文能够架起潜力无限的变压器与现实中复杂的高维顺序推荐系统之间的桥梁。代码和模型可在https://github.com/mercadolibre/srt找到。

论文及项目相关链接

PDF

Summary

本文主要探讨如何通过用户的互动历史进行用户偏好建模，以及如何利用Transformer架构在序列推荐中个性化内容。文章通过利用亚马逊产品数据集，探索了类似语言模型的缩放规律，并展示了性能缩放如何应用于下游任务。此外，文章还强调了计算最优训练的重要性，并指出需要针对特定应用的计算性能权衡进行仔细分析。本文的研究结果和发现为在高维偏好空间中训练和优化模型提供了战略路线图。

Key Takeaways

文章关注通过用户与系统元素的互动历史来建模用户偏好，这是序列推荐的主要目标。
Transformer架构已被证明在利用大量训练数据和提高模型参数数量方面非常有效，已成为该领域的主要研究焦点。
文章利用亚马逊产品数据集，发现了类似语言模型的缩放行为。
计算最优训练对序列推荐至关重要，但需要仔细分析特定应用的计算性能权衡。
更大的预训练模型可以通过微调适应小任务特定领域，展示性能缩放。
文章的研究结果提供了一个在高维偏好空间中训练和优化模型的战略路线图。

点此查看论文截图

Causal World Representation in the GPT Model

Authors:Raanan Y. Rohekar, Yaniv Gurwicz, Sungduk Yu, Vasudev Lal

Are generative pre-trained transformer (GPT) models only trained to predict the next token, or do they implicitly learn a world model from which a sequence is generated one token at a time? We examine this question by deriving a causal interpretation of the attention mechanism in GPT, and suggesting a causal world model that arises from this interpretation. Furthermore, we propose that GPT-models, at inference time, can be utilized for zero-shot causal structure learning for in-distribution sequences. Empirical evaluation is conducted in a controlled synthetic environment using the setup and rules of the Othello board game. A GPT, pre-trained on real-world games played with the intention of winning, is tested on synthetic data that only adheres to the game rules. We find that the GPT model tends to generate next moves that adhere to the game rules for sequences for which the attention mechanism encodes a causal structure with high confidence. In general, in cases for which the GPT model generates moves that do not adhere to the game rules, it also fails to capture any causal structure.

生成预训练Transformer（GPT）模型是否仅经过训练以预测下一个令牌，还是它们是否从世界模型中隐式学习生成序列的方式，即一个令牌接一个令牌地生成序列？我们通过推导GPT中注意力机制的因果解释，并建议由此产生的因果世界模型来探讨这个问题。此外，我们提出GPT模型在推理时间可以用于零射击因果结构学习，适用于内部分布序列。在Othello棋盘游戏的受控合成环境中进行了实证评估。使用在现实世界游戏中预训练的GPT模型（旨在获胜），测试仅遵循游戏规则合成数据。我们发现GPT模型倾向于生成遵循游戏规则的下一步动作序列，这些序列的注意力机制编码了高置信度的因果结构。总体而言，在GPT模型生成不遵守游戏规则的移动情况下，它也未能捕捉到任何因果结构。

论文及项目相关链接

PDF NeurIPS 2024 Workshop on Causality and Large Models (CaLM)

Summary

GPT模型是否仅通过预测下一个令牌进行训练，还是它们会隐式地从一个世界模型中学习并生成序列中的每一个令牌？本文通过推导GPT注意力机制的因果解释来探讨这个问题，并提出了由此产生的因果世界模型。此外，我们认为GPT模型在推理时可用于零射击因果结构学习，适用于内部分布序列。在Othello游戏的受控合成环境中进行了实证评估。一个经过真实世界游戏训练的GPT模型被测试在仅遵循游戏规则合成数据上。我们发现，当注意力机制编码高置信度的因果结构时，GPT模型倾向于生成遵循游戏规则的下一步动作序列。一般来说，当GPT模型生成的行动不遵循游戏规则时，它也无法捕捉到任何因果关系结构。

Key Takeaways

GPT模型不仅学习预测下一个令牌，而且从世界模型中隐式学习生成序列。
通过推导GPT注意力机制的因果解释，提出了一个因果世界模型。
GPT模型在推理阶段可用于零射击因果结构学习，适用于内部分布序列。
在Othello游戏的受控合成环境中进行了实证评估。
GPT模型在遵循游戏规则的合成数据上表现出生成遵循规则的下一步动作的趋势。
当GPT模型无法生成遵循游戏规则的行动时，它也无法捕捉到任何因果关系结构。

点此查看论文截图

HARP: Hesitation-Aware Reframing in Transformer Inference Pass

Authors:Romain Storaï, Seung-won Hwang

This paper aims to improve the performance of large language models by addressing the variable computational demands in inference steps, where some tokens require more computational resources than others. We present HARP, a simple modification to “off-the-shelf” Transformer forward pass. Drawing from hesitation and the framing effect in decision-making, HARP selectively applies additional computation when the model encounters uncertainty during token generation. Our method mimics human cognitive processes by pausing at difficult decision points and reframing inputs for a different perspective. Unlike other approaches, HARP is model-agnostic, training-free, and easy to implement. We thoroughly evaluate our method across various downstream tasks and model sizes, demonstrating performance improvements up to +5.16%. Notably, HARP achieves these gains while maintaining inference times twice faster than beam search. Simple and yet with significant gains, HARP offers a practical solution for enhancing the performance of Transformer-based language models with minimal computational impact.

本文旨在通过解决推理步骤中可变的计算需求来提高大型语言模型的性能，其中一些令牌需要比其他令牌更多的计算资源。我们提出了HARP，这是对“现货”Transformer前向传递的简单修改。它借鉴了决策制定中的犹豫和框架效应，当模型在令牌生成过程中遇到不确定性时，HARP会选择性地应用额外的计算。我们的方法通过在困难的决策点暂停并重新构建输入以获取不同视角来模仿人类的认知过程。与其他方法不同，HARP具有模型无关性、无需训练且易于实现。我们在各种下游任务和模型规模上全面评估了我们的方法，展示了高达+5.16%的性能提升。值得注意的是，HARP在保持推理时间是束搜索的两倍速度的同时实现了这些收益。简单而又能带来显著收益，HARP为在最小计算影响的情况下提高基于Transformer的语言模型性能提供了实用解决方案。

论文及项目相关链接

PDF

Summary
在简化背景下，本文提出一种名为HARP的方法，旨在通过针对推理步骤中的可变计算需求改进大型语言模型的性能。在遇到不确定的令牌生成时，HARP有选择性地应用额外的计算，模仿人类在决策过程中的犹豫和重构思维。此方法无需特殊训练，易于实现，并且在各种下游任务和模型大小上的评估中均表现出性能改进，最高可达+5.16%。同时，HARP保持推理时间是光束搜索的两倍快。

Key Takeaways

HARP旨在通过解决推理步骤中的可变计算需求，提高大型语言模型的性能。
HARP利用犹豫和决策制定中的重构效应，有选择地在模型遇到不确定性时应用额外的计算。
HARP模仿人类的认知过程，在决策点暂停并重新构建输入以获取不同视角。
HARP是一种模型无关、无需额外训练且易于实现的方法。
在各种下游任务和模型大小上评估，HARP显示出性能改进，最高可达+5.16%。
HARP在加快推理速度的同时维持了模型性能的提升。

点此查看论文截图

A Review on the Applications of Transformer-based language models for Nucleotide Sequence Analysis

Authors:Nimisha Ghosh, Daniele Santoni, Indrajit Saha, Giovanni Felici

In recent times, Transformer-based language models are making quite an impact in the field of natural language processing. As relevant parallels can be drawn between biological sequences and natural languages, the models used in NLP can be easily extended and adapted for various applications in bioinformatics. In this regard, this paper introduces the major developments of Transformer-based models in the recent past in the context of nucleotide sequences. We have reviewed and analysed a large number of application-based papers on this subject, giving evidence of the main characterizing features and to different approaches that may be adopted to customize such powerful computational machines. We have also provided a structured description of the functioning of Transformers, that may enable even first time users to grab the essence of such complex architectures. We believe this review will help the scientific community in understanding the various applications of Transformer-based language models to nucleotide sequences. This work will motivate the readers to build on these methodologies to tackle also various other problems in the field of bioinformatics.

近年来，基于Transformer的语言模型在自然语言处理领域产生了巨大的影响。由于生物序列和自然语言之间存在相似性，自然语言处理中使用的模型可以很容易地扩展和适应生物信息学的各种应用。在这方面，本文介绍了近期在核苷酸序列背景下基于Transformer模型的主要发展。我们回顾并分析了大量关于这一主题的应用论文，提供了主要特征证据和可能采取的不同方法来定制这些强大的计算机。我们还提供了Transformer工作机制的结构化描述，甚至可以让首次接触的用户也能理解这些复杂的架构。我们相信这次回顾将有助于科学界了解基于Transformer的语言模型在核苷酸序列方面的各种应用。这项工作将激励读者基于这些方法来解决生物信息学领域的各种问题。

论文及项目相关链接

PDF

Summary
该论文探讨了基于Transformer的模型在自然语言处理领域的应用，并指出这些模型可以轻易扩展到生物信息学领域。论文详细介绍了Transformer的主要特点和应用，包括其在核苷酸序列分析方面的应用，有助于科学界了解这些模型在生物信息学中的多种应用，并激励读者在此基础上解决其他生物信息学问题。

Key Takeaways

基于Transformer的模型在自然语言处理领域有重大影响。
Transformer模型可轻易扩展到生物信息学领域。
论文详细阐述了Transformer的主要特点和应用。
Transformer模型在核苷酸序列分析方面有着重要应用。
该论文有助于科学界了解Transformer模型在生物信息学中的多种应用。
论文提供的结构化描述使得初学者也能理解Transformer的复杂架构。

点此查看论文截图

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

Authors:Jieyu Zhang, Le Xue, Linxin Song, Jun Wang, Weikai Huang, Manli Shu, An Yan, Zixian Ma, Juan Carlos Niebles, silvio savarese, Caiming Xiong, Zeyuan Chen, Ranjay Krishna, Ran Xu

With the rise of multimodal applications, instruction data has become critical for training multimodal language models capable of understanding complex image-based queries. Existing practices rely on powerful but costly large language models (LLMs) or multimodal language models (MLMs) to produce instruction data. These are often prone to hallucinations, licensing issues and the generation process is often hard to scale and interpret. In this work, we present a programmatic approach that employs scene graphs as symbolic representations of images and human-written programs to systematically synthesize vision-centric instruction data. Our approach ensures the interpretability and controllability of the data generation process and scales efficiently while maintaining factual accuracy. By implementing a suite of 24 single-image, 14 multi-image instruction generators, and a scene graph generation pipeline, we build a scalable, cost-effective system: ProVision which produces diverse question-answer pairs concerning objects, attributes, relations, depth, etc., for any given image. Applied to Visual Genome and DataComp datasets, we generate over 10 million instruction data points, ProVision-10M, and leverage them in both pretraining and instruction tuning stages of MLMs. When adopted in the instruction tuning stage, our single-image instruction data yields up to a 7% improvement on the 2D split and 8% on the 3D split of CVBench, along with a 3% increase in performance on QBench2, RealWorldQA, and MMMU. Our multi-image instruction data leads to an 8% improvement on Mantis-Eval. Incorporation of our data in both pre-training and fine-tuning stages of xGen-MM-4B leads to an averaged improvement of 1.6% across 11 benchmarks.

随着多模态应用的兴起，指令数据对于训练能够理解复杂图像查询的多模态语言模型变得至关重要。现有实践依赖于强大但成本高昂的大型语言模型（LLM）或多模态语言模型（MLM）来生成指令数据。这些方法往往容易出现幻觉、许可问题，且生成过程往往难以扩展和解释。在这项工作中，我们提出了一种程序化方法，它利用场景图作为图像的象征表示和人类编写的程序来系统地合成以视觉为中心的指令数据。我们的方法确保了数据生成过程的可解释性和可控性，在保持事实准确性的同时，实现了高效扩展。通过实施24个单图像、14个多图像指令生成器以及场景图生成管道，我们建立了一个可扩展、成本效益高的系统：ProVision。该系统可针对给定图像生成涉及对象、属性、关系、深度等方面的多样化问答对。在Visual Genome和DataComp数据集上应用时，我们生成了超过1000万的指令数据点——ProVision-10M，并将其用于MLM的预训练和指令调整阶段。在指令调整阶段采用我们的单图像指令数据，在CVBench的2D分割和3D分割上分别提高了7%和8%的性能，同时在QBench2、RealWorldQA和MMMU上的性能提高了3%。我们的多图像指令数据在Mantis-Eval上提高了8%。在xGen-MM-4B的预训练和微调阶段都融入了我们的数据，在11个基准测试上的平均改进了1.6%。

摘要
随着多模态应用的兴起，指令数据对于训练能够理解复杂图像查询的多模态语言模型至关重要。本研究提出了一种程序化方法，利用场景图作为图像的象征性表示和人为编写的程序来系统地合成以视觉为中心指令数据。该方法确保了数据生成的可解释性和可控性，同时保持了事实准确性并实现了高效规模化。通过实施一系列单图像和多图像指令生成器以及场景图生成管道，我们建立了可扩展、成本效益高的系统——ProVision，该系统可为给定图像生成关于对象、属性、关系、深度等的多样化问答对。在Visual Genome和DataComp数据集上应用时，我们生成了超过1亿个指令数据点——ProVision-10M，并将其用于MLM的预训练和指令调整阶段。在指令调整阶段采用我们的单图像指令数据，在CVBench的2D分割和3D分割上分别提高了7%和8%的性能，同时在QBench2、RealWorldQA和MMMU上的性能也提高了3%。我们的多图像指令数据在Mantis-Eval上提高了8%。在xGen-MM-4B的预训练和微调阶段都融入了我们的数据，在11个基准测试中平均提高了1.6%的性能。

关键见解

随着多模态应用的普及，指令数据对于训练能够理解复杂图像查询的多模态语言模型变得至关重要。
现有方法依赖强大的大型语言模型（LLMs）或多模态语言模型（MLMs）来生成指令数据，但存在虚构、许可问题和难以扩展的生成过程。
研究提出了一种新的程序化方法，使用场景图和人为编写的程序来合成以视觉为中心的指令数据，确保了数据生成的可解释性和可控性。
该方法实现了高效规模化并保持了事实准确性。通过实施多个单图像和多图像指令生成器及场景图生成管道，建立了可扩展的系统——ProVision。
ProVision能够为任何给定图像生成多样化的问答对，涵盖了对象、属性、关系、深度等。
在多个数据集上应用ProVision生成的指令数据，取得了显著的性能提升。在多个基准测试中的平均性能提高了1.6%。
ProVision系统的成本效益高，可为训练多模态语言模型提供有价值的资源。

点此查看论文截图

Understanding Factual Recall in Transformers via Associative Memories

Authors:Eshaan Nichani, Jason D. Lee, Alberto Bietti

Large language models have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count. We next introduce a synthetic factual recall task, and prove that a transformer with a single layer of self-attention followed by an MLP can obtain 100% accuracy on the task whenever either the total number of self-attention parameters or MLP parameters scales (up to log factors) linearly with the number of facts. In particular, the transformer can trade off between using the value matrices or the MLP as an associative memory to store the dataset of facts. We complement these expressivity results with an analysis of the gradient flow trajectory of a simplified linear attention model trained on our factual recall task, where we show that the model exhibits sequential learning behavior.

大型语言模型展现出了令人印象深刻的执行事实性回忆的能力。先前的研究发现，在事实性回忆任务上训练的转换器储存信息的速率与其参数数量成正比。在我们的研究中，我们展示了浅层转换器可以通过结合关联记忆来获得接近最优的存储容量。我们首先证明，无论是线性关联记忆还是MLP关联记忆，其存储容量都随着参数数量的增加而线性扩展。接下来，我们引入了一个合成的事实性回忆任务，并证明了一个带有单层自注意力机制后跟MLP的转换器在自注意力参数或MLP参数总数线性增加时（对数因素除外）可以达到任务上的百分之百准确率。特别是，转换器可以在使用值矩阵或MLP作为关联记忆来存储事实数据集之间进行权衡。我们通过这些表达力分析的结果，补充了在我们的事实性回忆任务上训练的简化线性注意力模型的梯度流轨迹分析，该分析表明模型表现出序列学习行为。

论文及项目相关链接

PDF

Summary

大型语言模型展现出强大的事实记忆能力。本研究证明浅层变压器可通过结合关联记忆达到近乎最优的存储能力。我们证明了线性与MLP关联记忆的存储能力与参数数量呈线性关系。通过合成的事实记忆任务，我们发现单层自注意力后接MLP的变压器，在自注意力参数或MLP参数数量与事实数量呈线性关系（对数因素内）时，可达到100%准确率。特别是，变压器可在使用值矩阵或MLP作为关联记忆来存储事实数据集之间进行权衡。我们通过这些表达力结果，补充了简化线性注意力模型在事实记忆任务上梯度流轨迹的分析，显示模型展现出序贯学习行为。

Key Takeaways

大型语言模型具备强大的事实记忆能力。
浅层变压器结合关联记忆可达到近乎最优的存储能力。
线性与MLP关联记忆的存储能力与参数数量呈线性关系。
变压器在特定条件下可达到100%的事实记忆任务准确率。
变压器可在使用值矩阵与MLP作为关联记忆之间进行权衡。
简化线性注意力模型在事实记忆任务上展现出序贯学习行为。

点此查看论文截图

LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations

Authors:Mingjie Xu, Mengyang Wu, Yuzhi Zhao, Jason Chun Lok Li, Weifeng Ou

Scene Graph Generation (SGG) converts visual scenes into structured graph representations, providing deeper scene understanding for complex vision tasks. However, existing SGG models often overlook essential spatial relationships and struggle with generalization in open-vocabulary contexts. To address these limitations, we propose LLaVA-SpaceSGG, a multimodal large language model (MLLM) designed for open-vocabulary SGG with enhanced spatial relation modeling. To train it, we collect the SGG instruction-tuning dataset, named SpaceSGG. This dataset is constructed by combining publicly available datasets and synthesizing data using open-source models within our data construction pipeline. It combines object locations, object relations, and depth information, resulting in three data formats: spatial SGG description, question-answering, and conversation. To enhance the transfer of MLLMs’ inherent capabilities to the SGG task, we introduce a two-stage training paradigm. Experiments show that LLaVA-SpaceSGG outperforms other open-vocabulary SGG methods, boosting recall by 8.6% and mean recall by 28.4% compared to the baseline. Our codebase, dataset, and trained models are publicly accessible on GitHub at the following URL: https://github.com/Endlinc/LLaVA-SpaceSGG.

场景图生成（SGG）将视觉场景转换为结构化的图表示，为复杂的视觉任务提供更深的场景理解。然而，现有的SGG模型经常忽略重要的空间关系，并且在开放词汇表的背景下泛化能力较弱。为了解决这些局限性，我们提出了LLaVA-SpaceSGG，这是一个为开放词汇表SGG设计的多模式大型语言模型（MLLM），具有增强的空间关系建模。为了训练它，我们收集了SGG指令调整数据集，命名为SpaceSGG。该数据集是通过结合公开数据集并使用我们的数据构建管道中的开源模型合成数据而构建的。它结合了目标位置、目标关系和深度信息，形成三种数据格式：空间SGG描述、问答和对话。为了提高MLLMs的内在能力向SGG任务的转移，我们引入了两阶段训练模式。实验表明，LLaVA-SpaceSGG在开放词汇表SGG方法上的表现优于其他方法，与基线相比，召回率提高了8.6%，平均召回率提高了28.4%。我们的代码库、数据集和训练模型可在以下GitHub链接上公开访问：https://github.com/Endlinc/LLaVA-SpaceSGG。

论文及项目相关链接

PDF Accepted by the WACV 2025, including supplementary material

Summary
空间场景图生成（SGG）是深度场景理解的重要技术，可将复杂视觉任务转换为结构化图形表示。然而，现有SGG模型在开放词汇环境中存在空间关系建模不足和泛化能力受限的问题。为此，我们提出了LLaVA-SpaceSGG模型，一种适用于开放词汇SGG的多模态大型语言模型（MLLM）。我们收集并构建了名为SpaceSGG的SGG指令微调数据集，用于训练模型。实验表明，LLaVA-SpaceSGG相较于其他开放词汇SGG方法表现出更高的性能，召回率提高了8.6%，平均召回率提高了28.4%。我们的代码库、数据集和训练模型已在GitHub上公开访问。

Key Takeaways

SGG技术能够将复杂视觉任务转换为结构化图形表示，提高场景理解的深度。
当前SGG模型存在忽视关键空间关系和泛化能力不足的问题。
LLaVA-SpaceSGG是一种基于多模态大型语言模型（MLLM）的开放词汇SGG模型。
为训练LLaVA-SpaceSGG模型，我们构建了名为SpaceSGG的SGG指令微调数据集。
SpaceSGG数据集结合了目标位置、目标关系和深度信息，包含三种数据格式。
采用两阶段训练范式增强MLLM向SGG任务的知识迁移能力。

点此查看论文截图

Authors:Qifan Yu, Zhebei Shen, Zhongqi Yue, Yang Wu, Wenqiao Zhang, Yunfei Li, Juncheng Li, Siliang Tang, Yueting Zhuang

Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to handle real-world tasks. However, the rapid expansion of visual instruction datasets introduces data redundancy, leading to excessive computational costs. We propose a collaborative framework, DataTailor, which leverages three key principles–informativeness, uniqueness, and representativeness–for effective data selection. We argue that a valuable sample should be informative of the task, non-redundant, and represent the sample distribution (i.e., not an outlier). We further propose practical ways to score against each principle, which automatically adapts to a given dataset without tedious hyperparameter tuning. Comprehensive experiments on various benchmarks demonstrate that DataTailor achieves 100.8% of the performance of full-data fine-tuning with only 15% of the data, significantly reducing computational costs while maintaining superior results. This exemplifies the “Less is More” philosophy in MLLM development.

指令调整微调预先训练好的多模态大型语言模型（MLLMs）来处理真实任务。然而，视觉指令数据集的快速扩张导致了数据冗余，从而产生了过高的计算成本。我们提出了一种协作框架DataTailor，它利用三个关键原则——信息性、唯一性和代表性——进行有效的数据选择。我们认为，有价值的样本应该具有任务的信息性、非冗余性，并代表样本分布（即非异常值）。我们进一步提出了针对每项原则的实际评分方法，该方法可自动适应给定的数据集，无需繁琐的超参数调整。在各种基准测试上的综合实验表明，DataTailor在仅使用15%数据的情况下，实现了全数据微调性能的100.8%，在降低计算成本的同时保持了卓越的结果。这体现了MLLM发展中的“少即是多”理念。

论文及项目相关链接

PDF 14 pages, 7 figures

Summary
指令微调对预训练的多模态大型语言模型（MLLMs）进行微调以处理现实世界任务。随着视觉指令数据集的快速扩展，数据冗余问题愈发严重，导致计算成本过高。为此，我们提出了一个协作框架DataTailor，该框架利用信息性、唯一性和代表性三个关键原则进行高效的数据选择。我们通过实用的评分方式对这三个原则进行衡量，自动适应给定的数据集，无需繁琐的超参数调整。在多个基准测试上的综合实验表明，DataTailor在仅使用15%数据的情况下，实现了全数据微调性能的100.8%，显著降低了计算成本，同时保持了卓越的结果，这体现了MLLM发展中的“少即是多”理念。

Key Takeaways

指令微调是优化预训练多模态大型语言模型（MLLMs）处理现实任务的方法。
数据冗余问题随着视觉指令数据集的扩大而加剧，导致计算成本上升。
DataTailor框架通过信息性、唯一性和代表性三个原则进行高效数据选择。
有效的样本应具有任务信息性、非冗余性，并能代表样本分布。
DataTailor通过实用的评分方式自动适应数据集，避免繁琐的超参数调整。
在多个基准测试上，DataTailor在仅使用少量数据的情况下取得了令人瞩目的性能。

点此查看论文截图

S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

Authors:Xinyu Yang, Jixuan Leng, Geyang Guo, Jiawei Zhao, Ryumei Nakada, Linjun Zhang, Huaxiu Yao, Beidi Chen

Current PEFT methods for LLMs can achieve either high quality, efficient training, or scalable serving, but not all three simultaneously. To address this limitation, we investigate sparse fine-tuning and observe a remarkable improvement in generalization ability. Utilizing this key insight, we propose a family of Structured Sparse Fine-Tuning (S$^{2}$FT) methods for LLMs, which concurrently achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability. S$^{2}$FT accomplishes this by “selecting sparsely and computing densely”. It selects a few heads and channels in the MHA and FFN modules for each Transformer block, respectively. Next, it co-permutes weight matrices on both sides of the coupled structures in LLMs to connect the selected components in each layer into a dense submatrix. Finally, S$^{2}$FT performs in-place gradient updates on all submatrices. Through theoretical analysis and empirical results, our method prevents overfitting and forgetting, delivers SOTA performance on both commonsense and arithmetic reasoning with 4.6% and 1.3% average improvements compared to LoRA, and surpasses full FT by 11.5% when generalizing to various domains after instruction tuning. Using our partial backpropagation algorithm, S$^{2}$FT saves training memory up to 3$\times$ and improves latency by 1.5-2.7$\times$ compared to full FT, while delivering an average 10% improvement over LoRA on both metrics. We further demonstrate that the weight updates in S$^{2}$FT can be decoupled into adapters, enabling effective fusion, fast switch, and efficient parallelism for serving multiple fine-tuned models.

当前针对LLM的PEFT方法只能同时实现高质量、高效训练或可扩展的服务，但不能三者兼顾。为了解决这个问题，我们研究了稀疏微调技术，并观察到其显著提高了泛化能力。利用这一关键见解，我们针对LLM提出了一系列结构化稀疏微调（S$^{2}$FT）方法，这些方法在微调性能、训练效率和推理可扩展性方面均达到了业界最佳水平。S$^{2}$FT通过“稀疏选择、密集计算”来实现这一目标。它分别选择MHA和FFN模块中每个Transformer块的几个头和通道。接下来，它对LLM中耦合结构两侧的权重矩阵进行共排列，将每一层中选择的组件连接成密集的子矩阵。最后，S$^{2}$FT对所有子矩阵进行就地梯度更新。通过理论分析和实验结果，我们的方法可以防止过拟合和遗忘，在常识推理和算术推理方面都达到了最先进的性能，与LoRA相比平均提高了4.6%和1.3%，在指令调优后推广到不同领域时，比全量FT高出11.5%。通过使用我们的部分反向传播算法，S$^{2}$FT在训练内存方面节省了高达3倍，与全量FT相比，延迟提高了1.5-2.7倍，同时在两个指标上都平均比LoRA提高了10%。我们进一步证明，S$^{2}$FT中的权重更新可以被解耦为适配器，为实现多个微调模型的有效融合、快速切换和高效并行提供服务。

论文及项目相关链接

PDF

摘要

当前针对LLM的PEFT方法往往只能同时满足高质量、高效训练和可伸缩服务中的两项，无法实现三者兼顾。为解决这一局限，本研究探讨了稀疏微调技术并观察到其显著提高泛化能力。基于此关键发现，我们提出了针对LLM的结构化稀疏微调（Structured Sparse Fine-Tuning，S$^{2}$FT）方法系列，同时实现了最先进的微调性能、训练效率和推理可伸缩性。S$^{2}$FT通过“稀疏选择、密集计算”实现这一目标。它选择Transformer块中MHA和FFN模块的少数头部和通道，对LLM中耦合结构的两侧权重矩阵进行共置换，将每一层中选定的组件连接成密集子矩阵。最后，S$^{2}$FT对所有子矩阵执行原地梯度更新。通过理论分析和实证结果，本研究方法可以防止过拟合和遗忘，在常识和算术推理方面表现出卓越性能，与LoRA相比平均提高了4.6%和1.3%，在指令调优后推广到不同领域时比全量微调高出11.5%。通过使用部分反向传播算法，S$^{2}$FT在训练内存方面节省了高达3倍，与全量微调相比延迟提高了1.5-2.7倍，同时在两个指标上平均比LoRA提高了10%。此外，本研究进一步证明S$^{2}$FT中的权重更新可以解耦为适配器，为融合多个微调模型提供了有效、快速切换和高效并行处理的能力。

要点归纳

当前PEFT方法在LLM上难以同时实现高质量、高效训练和可伸缩服务三者。
研究通过稀疏微调技术观察到显著泛化能力改进。
提出结构化稀疏微调（S$^{2}$FT）方法，结合稀疏选择与密集计算策略。
S$^{2}$FT选择Transformer块中的关键组件并进行权重矩阵置换，形成密集子矩阵。
S$^{2}$FT通过原地梯度更新优化性能，减少训练内存使用并提高推理速度。
S$^{2}$FT在常识和算术推理方面表现出卓越性能，相比LoRA有平均10%的提升。

点此查看论文截图

The BrowserGym Ecosystem for Web Agent Research

Authors:Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic’s latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

BrowserGym生态系统满足了日益增长的对于高效评估和基准测试web代理的需求，特别是那些利用自动化和大型语言模型（LLM）进行web交互任务的代理。许多现有的基准测试受到碎片化以及评估方法不一致的影响，使得实现可靠的对比和可重复的结果具有挑战性。BrowserGym旨在通过提供一个统一的、类似于gym的环境来解决这个问题，该环境具有明确定义的观察空间和动作空间，促进各种基准测试的标准化评估。结合AgentLab这一互补框架，它有助于代理的创建、测试和分析，BrowserGym提供了集成新基准测试的灵活性，同时确保一致的评估和综合的实验管理。这种标准化方法旨在减少开发web代理的时间和复杂性，支持更可靠的对比，促进对代理行为的深入分析，并可能产生更适应、更强大的代理，最终加速LLM驱动的自动化创新。作为支持证据，我们进行了首个大规模、多基准的web代理实验，比较了当前BrowserGym中所有可用基准测试上6个最新前沿LLM的性能。我们的研究结果发现，除与视觉相关的任务外，GPT-4o表现优越，OpenAI和Anthropic的最新模型之间存在很大的差异，Claude-3.5-Sonnet几乎在所有基准测试中均领先。尽管取得了这些进展，我们的结果强调，由于现实世界的web环境的固有复杂性和当前模型的局限性，构建稳健高效的web代理仍然是一个巨大的挑战。

论文及项目相关链接

PDF

Summary

BrowserGym生态系统解决了对高效评估和基准测试网络代理的日益增长的需求，特别是那些利用自动化和大语言模型（LLM）进行网络交互任务的项目。BrowserGym旨在通过提供一个统一的、类似于gym的环境，以及定义良好的观测和行动空间，解决现有基准测试中的碎片化和不一致评估方法的问题，从而推动标准化评估在多样化基准测试中的应用。结合AgentLab框架，一个辅助创建、测试和分析代理的框架，BrowserGym提供了灵活集成新基准测试的能力，同时确保一致的评估和综合的实验管理。这种标准化方法旨在减少开发网络代理的时间和复杂性，支持更可靠的对比和对代理行为的深入分析，可以产生更灵活、能力更强的代理，最终加速LLM驱动的自动化创新。我们以大型、跨多个基准测试的网络代理实验作为支持证据，比较了目前BrowserGym中所有基准测试上六个最先进LLM的性能。结果表明OpenAI和Antropic的最新模型之间存在显著差异，Claude-3.5-Sonnet在几乎所有基准测试中领先，但在视觉相关任务上GPT-4o表现更优。尽管取得了进展，但我们的结果强调，由于真实世界网络环境的固有复杂性和当前模型的局限性，构建健壮且高效的Web代理仍然是一个重大挑战。

Key Takeaways

BrowserGym生态系统解决了现有网络代理评估和基准测试的碎片化问题。
它提供了一个统一的评估环境，促进了标准化评估方法的实施。
AgentLab框架与BrowserGym结合，有助于创建、测试和分析代理。
通过实验验证了LLM在多个基准测试上的性能差异。
Claude-3.5-Sonnet在大多数基准测试中表现优异，但在视觉任务上GPT-4o表现更好。
构建健壮且高效的Web代理仍然面临重大挑战，主要因为真实网络环境的复杂性和模型的局限性。

点此查看论文截图

VoiceBench: Benchmarking LLM-Based Voice Assistants

Authors:Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, Haizhou Li

Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to traditional text-based interactions. However, the absence of benchmarks designed to evaluate these speech interaction capabilities has hindered progress of LLM-based voice assistants development. Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting the more intricate, real-world scenarios that involve diverse speaker characteristics, environmental and content factors. To address this, we introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.

基于大型语言模型（LLM）的成功，最近的进展，如GPT-4o，已经能够通过基于LLM的语音助手实现实时语音交互，与传统的基于文本的交互相比，为用户提供了显著改善的体验。然而，缺乏用于评估这些语音交互能力的基准测试阻碍了基于LLM的语音助手的发展。当前的评估主要集中在自动语音识别（ASR）或清洁语音的一般知识评估上，忽视了涉及多种说话人特征、环境和内容因素的更复杂、真实世界的场景。为了解决这一问题，我们引入了VoiceBench，这是第一个旨在提供基于LLM的语音助手的多元化评估的基准测试。VoiceBench还包括真实和合成的语音指令，这些指令融入了上述三个关键的真实世界变化因素。广泛的实验揭示了当前基于LLM的语音助手模型的局限性，并为该领域的未来研究和开发提供了宝贵的见解。

论文及项目相关链接

PDF Work in progress. Data is available at https://github.com/MatthewCYM/VoiceBench

Summary

基于大型语言模型（LLM）的成功，GPT-4o等最新进展使得通过LLM语音助手进行实时语音交互成为可能，相比传统的文本交互方式提供了显著的用户体验改进。然而，缺乏评估这些语音交互能力的基准测试阻碍了LLM语音助手的发展。为解决这个问题，我们推出了VoiceBench基准测试，这是首个全面评估LLM语音助手的多方面评价工具。VoiceBench包含真实和合成语音指令，涵盖了现实世界中的关键变化因素。大量实验揭示了当前LLM语音助手的局限性，为今后的研究和发展提供了宝贵见解。

Key Takeaways

LLMs的进步推动了实时语音交互的发展，通过LLM语音助手实现。
GPT-4o等最新技术显著提升了用户体验。
目前缺乏评估LLM语音助手交互能力的基准测试。
VoiceBench是首个全面评估LLM语音助手的基准测试。
VoiceBench包含真实和合成的语音指令，模拟现实世界的复杂场景。
现有LLM语音助手存在局限性。

点此查看论文截图

ChatGPT and biometrics: an assessment of face recognition, gender detection, and age estimation capabilities

Authors:Ahmad Hassanpour, Yasamin Kowsari, Hatef Otroshi Shahreza, Bian Yang, Sebastien Marcel

This paper explores the application of large language models (LLMs), like ChatGPT, for biometric tasks. We specifically examine the capabilities of ChatGPT in performing biometric-related tasks, with an emphasis on face recognition, gender detection, and age estimation. Since biometrics are considered as sensitive information, ChatGPT avoids answering direct prompts, and thus we crafted a prompting strategy to bypass its safeguard and evaluate the capabilities for biometrics tasks. Our study reveals that ChatGPT recognizes facial identities and differentiates between two facial images with considerable accuracy. Additionally, experimental results demonstrate remarkable performance in gender detection and reasonable accuracy for the age estimation tasks. Our findings shed light on the promising potentials in the application of LLMs and foundation models for biometrics.

本文探讨了大型语言模型（LLM）如ChatGPT在生物识别任务中的应用。我们特别研究了ChatGPT在执行与生物识别相关的任务时的能力，重点是人脸识别、性别检测和年龄估计。由于生物识别被视为敏感信息，ChatGPT避免回答直接提示，因此我们制定了一种提示策略来绕过其安全措施并评估其生物识别任务的能力。我们的研究表明，ChatGPT能够识别面部身份并准确区分两张面部图像。此外，实验结果表明，在性别检测方面表现出显著性能，年龄估计任务的准确性也合理。我们的研究为LLM和基础模型在生物识别领域的应用提供了广阔的前景。

论文及项目相关链接

PDF Published as a conference paper at IEEE International Conference on Image Processing (ICIP) 2024

Summary

本文探讨了大型语言模型（LLMs）如ChatGPT在生物识别任务中的应用。文章重点研究了ChatGPT在人脸识别、性别识别和年龄估计等生物识别相关任务的能力。由于生物识别被视为敏感信息，ChatGPT避免回答直接提示，因此研究团队设计了一种提示策略来绕过其安全保护机制以评估其在生物识别任务中的能力。研究表明，ChatGPT能够准确识别面部身份并区分不同的面部图像。此外，实验结果显示其在性别识别方面表现出卓越的性能，而在年龄估计任务中达到合理的准确性。本文的发现揭示了LLMs和基础模型在生物识别领域应用中的巨大潜力。

Key Takeaways

该研究探讨了大型语言模型（LLMs）在生物识别任务中的应用，特别是人脸识别、性别识别和年龄估计等方面。
ChatGPT能够准确识别面部身份并区分不同的面部图像。
实验结果显示ChatGPT在性别识别方面表现出卓越的性能。
在年龄估计任务中，ChatGPT达到合理的准确性。
由于生物识别信息的敏感性，研究团队需要设计巧妙的提示策略来评估ChatGPT在生物识别任务中的能力。
文章指出LLMs和基础模型在生物识别领域具有巨大的应用潜力。

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2024-12-12/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2024-12-12 EOV-Seg Efficient Open-Vocabulary Panoptic Segmentation

2024-12-12 Vision Transformer

Vision Transformer

TTS

TTS 方向最新论文已更新，请持续关注 Update in 2024-12-11 Towards Controllable Speech Synthesis in the Era of Large Language Models A Survey

2024-12-11 TTS

TTS

LLM

2024-12-12 更新

Generative Semantic Communication: Architectures, Technologies, and Applications

Fast Prompt Alignment for Text-to-Image Generation

Multimodal Latent Language Modeling with Next-Token Diffusion

Synthetic Vision: Training Vision-Language Models to Understand Physics

Exploiting the Index Gradients for Optimization-Based Jailbreaking on Large Language Models

Preference Discerning with LLM-Enhanced Generative Retrieval

Der Effizienz- und Intelligenzbegriff in der Lexikographie und kuenstlichen Intelligenz: kann ChatGPT die lexikographische Textsorte nachbilden?

Leveraging Graph-RAG and Prompt Engineering to Enhance LLM-Based Automated Requirement Traceability and Compliance Checks

Advancing Single- and Multi-task Text Classification through Large Language Model Fine-tuning

TURBOATTENTION: Efficient Attention Approximation For High Throughputs LLMs

Can We Generate Visual Programs Without Prompting LLMs?

Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning

MaestroMotif: Skill Design from Artificial Intelligence Feedback

EMS: Adaptive Evict-then-Merge Strategy for Head-wise KV Cache Compression Based on Global-Local Importance

Bridging Relevance and Reasoning: Rationale Distillation in Retrieval-Augmented Generation

POINTS1.5: Building a Vision-Language Model towards Real World Applications

The Roles of English in Evaluating Multilingual Language Models

SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs

Template Matters: Understanding the Role of Instruction Templates in Multimodal Language Model Evaluation and Training

M2SE: A Multistage Multitask Instruction Tuning Strategy for Unified Sentiment and Emotion Analysis

SAT: Spatial Aptitude Training for Multimodal Language Models

Scaling Sequential Recommendation Models with Transformers

Causal World Representation in the GPT Model

HARP: Hesitation-Aware Reframing in Transformer Inference Pass

A Review on the Applications of Transformer-based language models for Nucleotide Sequence Analysis

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

Understanding Factual Recall in Transformers via Associative Memories

LLaVA-SpaceSGG: Visual Instruct Tuning for Open-vocabulary Scene Graph Generation with Enhanced Spatial Relations

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

The BrowserGym Ecosystem for Web Agent Research

VoiceBench: Benchmarking LLM-Based Voice Assistants

ChatGPT and biometrics: an assessment of face recognition, gender detection, and age estimation capabilities