LLM

发布日期: 2025-01-16

更新日期: 2025-01-16

文章字数: 20.5k

阅读时长: 84 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-01-16 更新

PokerBench: Training Large Language Models to become Professional Poker Players

Authors:Richard Zhuang, Akshat Gupta, Richard Yang, Aniket Rahane, Zhengyu Li, Gopala Anumanchipalli

We introduce PokerBench - a benchmark for evaluating the poker-playing abilities of large language models (LLMs). As LLMs excel in traditional NLP tasks, their application to complex, strategic games like poker poses a new challenge. Poker, an incomplete information game, demands a multitude of skills such as mathematics, reasoning, planning, strategy, and a deep understanding of game theory and human psychology. This makes Poker the ideal next frontier for large language models. PokerBench consists of a comprehensive compilation of 11,000 most important scenarios, split between pre-flop and post-flop play, developed in collaboration with trained poker players. We evaluate prominent models including GPT-4, ChatGPT 3.5, and various Llama and Gemma series models, finding that all state-of-the-art LLMs underperform in playing optimal poker. However, after fine-tuning, these models show marked improvements. We validate PokerBench by having models with different scores compete with each other, demonstrating that higher scores on PokerBench lead to higher win rates in actual poker games. Through gameplay between our fine-tuned model and GPT-4, we also identify limitations of simple supervised fine-tuning for learning optimal playing strategy, suggesting the need for more advanced methodologies for effectively training language models to excel in games. PokerBench thus presents a unique benchmark for a quick and reliable evaluation of the poker-playing ability of LLMs as well as a comprehensive benchmark to study the progress of LLMs in complex game-playing scenarios. The dataset and code will be made available at: \url{https://github.com/pokerllm/pokerbench}.

我们介绍了PokerBench——一个用于评估大型语言模型（LLM）扑克牌技能的基准测试。尽管大型语言模型在传统自然语言处理任务中表现出色，但它们在复杂战略游戏如扑克中的应用构成了一项新挑战。扑克是一种信息不完全的游戏，需要多种技能，如数学、推理、规划、战略，以及对博弈论和人类心理学的深刻理解。这使得扑克成为大型语言模型的理想下一个前沿领域。PokerBench由最重要11000个场景的全面汇编组成，这些场景分为开拍和闭拍两种，是与训练过的扑克玩家共同开发的。我们评估了包括GPT-4、ChatGPT 3.5以及各种Llama和Gemma系列模型在内的突出模型，发现所有最新的大型语言模型在扑克游戏中的表现都不佳。然而，经过微调后，这些模型表现出显著改善。我们通过让不同得分的模型相互竞争来验证PokerBench，结果表明在PokerBench上得分较高的模型在实际的扑克游戏中的胜率也较高。通过我们微调后的模型与GPT-4之间的游戏，我们还发现了简单监督微调在学习最优策略方面的局限性，这提示我们需要更先进的方法有效地训练语言模型，使其在游戏中表现出色。因此，PokerBench提供了一个独特、快速可靠的基准测试来评估大型语言模型的扑克牌能力，以及一个全面的基准来研究大型语言模型在复杂游戏场景中的进展。数据集和代码将在以下网址提供：https://github.com/pokerllm/pokerbench。

论文及项目相关链接

PDF AAAI 2025

Summary

本文介绍了PokerBench——一个用于评估大型语言模型（LLM）扑克牌技能的基准测试。文章指出大型语言模型在扑克牌这类复杂战略游戏中的表现面临新的挑战。PokerBench包含了与训练有素的扑克玩家合作开发的最为关键的1.1万个场景。研究发现，即使在fine-tuning之后，先进的大型语言模型在扑克游戏中的表现仍然不理想。通过不同得分模型的竞争验证，以及精细调整模型与GPT-4的对战，揭示了简单监督fine-tuning在学习最佳策略方面的局限性。因此，PokerBench成为评估大型语言模型扑克牌能力的独特基准测试，并为复杂游戏场景中大型语言模型的发展进步提供了综合评估手段。数据集和代码已发布在指定链接上。

Key Takeaways

介绍了PokerBench作为评估大型语言模型扑克牌技能的基准测试。
大型语言模型在扑克牌这类复杂战略游戏中的表现面临新的挑战。
PokerBench包含与训练有素的扑克玩家合作的1.1万个关键场景。
当前先进的大型语言模型在扑克游戏中的表现不理想，但fine-tuning后有所改善。
通过不同得分模型的竞争验证，表明高分的PokerBench模型在真实扑克游戏中的胜率更高。
与GPT-4的对战揭示了简单监督fine-tuning在学习最佳策略方面的局限性。

Cool Papers

点此查看论文截图

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks

Authors:Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma

We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID-300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.

我们介绍了Omni-RGPT，这是一种多模态大型语言模型，旨在促进图像和视频的区域级理解。为了实现时空维度上的一致区域表示，我们引入了Token Mark，这是一组突出显示视觉特征空间内目标区域的令牌。这些令牌直接使用区域提示（例如，盒子或蒙版）嵌入到空间区域中，并同时纳入文本提示以指定目标，从而在视觉令牌和文本令牌之间建立直接联系。为了进一步支持不需要轨迹的视频理解，我们通过一个辅助任务来引导Token Mark，利用令牌的连续性，实现在视频中的稳定区域解释。此外，我们还引入了大规模的区域级视频指令数据集（RegVID-300k）。Omni-RGPT在图像和基于视频的常识推理基准测试上达到了最新水平，同时在描述和指代表达理解任务中表现出强大的性能。

Summary

Omni-RGPT是一款多模态大型语言模型，用于实现图像和视频的区域级理解。通过引入Token Mark，实现在时空维度上的区域一致性表示。Token Mark通过区域提示（如盒子或掩膜）直接在空间区域内嵌入标记，并通过文本提示指定目标，建立视觉和文本标记之间的直接联系。为支持无需轨迹的视频理解，通过利用标记的一致性，引入辅助任务引导Token Mark，实现视频中的区域稳定解读。此外，还引入了大规模的区域级视频指令数据集RegVID-300k。Omni-RGPT在图像和视频常识推理基准测试上取得了最新结果，同时在描述和指代表达理解任务中表现出强大的性能。

Key Takeaways

Omni-RGPT是一个多模态大型语言模型，旨在实现图像和视频的区域级理解。
Token Mark是引入的一组标记，用于在视觉特征空间中突出目标区域。
通过区域提示和文本提示，Token Mark建立了视觉和文本标记之间的直接联系。
引入辅助任务，利用标记的一致性，支持稳定的区域解读，无需轨迹即可理解视频。
引入了大规模区域级视频指令数据集RegVID-300k。
Omni-RGPT在图像和视频常识推理基准测试中取得了先进结果。

Cool Papers

点此查看论文截图

Enhancing Automated Interpretability with Output-Centric Feature Descriptions

Authors:Yoav Gur-Arieh, Roy Mayan, Chen Agassy, Atticus Geiger, Mor Geva

Automated interpretability pipelines generate natural language descriptions for the concepts represented by features in large language models (LLMs), such as plants or the first word in a sentence. These descriptions are derived using inputs that activate the feature, which may be a dimension or a direction in the model’s representation space. However, identifying activating inputs is costly, and the mechanistic role of a feature in model behavior is determined both by how inputs cause a feature to activate and by how feature activation affects outputs. Using steering evaluations, we reveal that current pipelines provide descriptions that fail to capture the causal effect of the feature on outputs. To fix this, we propose efficient, output-centric methods for automatically generating feature descriptions. These methods use the tokens weighted higher after feature stimulation or the highest weight tokens after applying the vocabulary “unembedding” head directly to the feature. Our output-centric descriptions better capture the causal effect of a feature on model outputs than input-centric descriptions, but combining the two leads to the best performance on both input and output evaluations. Lastly, we show that output-centric descriptions can be used to find inputs that activate features previously thought to be “dead”.

自动化解释管道为大语言模型（LLM）中的特征（如植物或句子的第一个词）生成自然语言描述。这些描述是通过激活特征的输入得出的，这可能是模型表示空间中的一个维度或方向。然而，确定激活输入成本很高，而特征在模型行为中的机械作用取决于输入如何导致特征被激活以及特征激活如何影响输出。通过转向评估，我们发现当前的管道提供的描述未能捕捉到特征对输出的因果效应。为了解决这个问题，我们提出了高效、以输出为中心的方法，用于自动生成特征描述。这些方法使用特征刺激后权重更高的令牌，或直接应用词汇“unembedding”头到特征以获得最高权重的令牌。我们的以输出为中心的描述比以输入为中心的描述更能捕捉特征对模型输出的因果效应，但两者相结合在输入和输出评估方面的表现都最佳。最后，我们证明了以输出为中心的描述可以用于找到以前被认为是“死”的特征的激活输入。

论文及项目相关链接

PDF

Summary
自动解释性管道为大语言模型（LLM）中的概念生成自然语言描述，这些描述揭示了模型的运作方式。然而，现有的解释性管道难以捕捉特征对输出的因果效应。为解决这一问题，研究者提出了基于输出的特征描述方法，这种方法能够更好地揭示特征对模型输出的影响。结合输入和输出的描述能取得最佳效果，并可用来寻找激活那些被认为不活跃的模型特征。总体来说，本文强调了自动化解释性管道对理解大语言模型的重要性。

Key Takeaways

自动解释性管道为大语言模型中的概念生成自然语言描述。
描述是基于输入激活的特征生成的，但这种方法成本较高。
描述需要捕捉特征对输出的因果效应，现有方法难以做到。
输出为中心的特征描述方法能更好地揭示特征对模型输出的影响。
结合输入和输出的描述可以提高解释的准确性。

Cool Papers

点此查看论文截图

HALoGEN: Fantastic LLM Hallucinations and Where to Find Them

Authors:Abhilasha Ravichander, Shrusti Ghela, David Wadden, Yejin Choi

Despite their impressive ability to generate high-quality and fluent text, generative large language models (LLMs) also produce hallucinations: statements that are misaligned with established world knowledge or provided input context. However, measuring hallucination can be challenging, as having humans verify model generations on-the-fly is both expensive and time-consuming. In this work, we release HALoGEN, a comprehensive hallucination benchmark consisting of: (1) 10,923 prompts for generative models spanning nine domains including programming, scientific attribution, and summarization, and (2) automatic high-precision verifiers for each use case that decompose LLM generations into atomic units, and verify each unit against a high-quality knowledge source. We use this framework to evaluate ~150,000 generations from 14 language models, finding that even the best-performing models are riddled with hallucinations (sometimes up to 86% of generated atomic facts depending on the domain). We further define a novel error classification for LLM hallucinations based on whether they likely stem from incorrect recollection of training data (Type A errors), or incorrect knowledge in training data (Type B errors), or are fabrication (Type C errors). We hope our framework provides a foundation to enable the principled study of why generative models hallucinate, and advances the development of trustworthy large language models.

尽管生成式大型语言模型（LLM）能够生成高质量和流畅的文本，但它们也会产生幻觉：即与既定世界知识或提供的输入上下文不匹配的陈述。然而，衡量幻觉是一项挑战，因为让人工实时验证模型的生成既昂贵又耗时。在这项工作中，我们发布了HALoGEN，这是一个全面的幻觉评估基准，包括：（1）涵盖编程、科学归属和摘要等九个领域的10923个生成模型提示；（2）针对每个用例的自动高精度验证器，它将LLM生成物分解成原子单位，并针对高质量的知识来源对每个单位进行验证。我们使用此框架评估了大约15万个来自14个语言模型的生成物，发现即使是表现最佳的模型也充斥着幻觉（在某些领域，生成的原子事实中有时高达86%是幻觉）。我们进一步定义了LLM幻觉的新型错误分类，基于它们是否可能源于训练数据的不正确回忆（A类错误），或训练数据中的错误知识（B类错误），或是虚构（C类错误）。我们希望我们的框架能够为研究生成模型为何会产生幻觉提供基础，并推动开发可信赖的大型语言模型。

论文及项目相关链接

PDF Preprint

摘要

该文探讨了生成式大型语言模型（LLM）的幻觉问题。虽然LLM能够生成高质量和流畅的文本，但它们也会产生与既定世界知识或提供输入上下文不匹配的幻觉。文章提出了一种名为HALoGEN的综合幻觉评估工具，包括用于生成模型的10923个提示和针对每个用例自动进行精确验证的验证器。通过对大约来自使用此框架进行评估的语言模型的生存评估结果表明，即使在最好的模型中，也存在大量的幻觉现象（在某些领域高达生成的原子事实的百分之八十六）。此外，文章还定义了一种新型错误分类，根据是否可能源于训练数据的错误回忆（类型A错误）、训练数据中的错误知识（类型B错误）或伪造（类型C错误），来对LLM幻觉进行分类。作者希望通过其框架深入研究大型语言模型出现幻觉的原因，并推动开发可信赖的大型语言模型。总的来说，该论文对于了解大型语言模型的优点和局限性至关重要。尽管评估困难，但仍提出了解决方案并分享了相关数据和分析。尽管该领域仍然需要深入研究和分析才能解决问题。但仍然推动了基于高级别可信性理解的发展策略的方向研究发展具有前瞻性价值的研究，帮助理解这些现象发生的原因及其长期影响。同时，该研究也为未来的研究提供了重要的基础。

关键发现列表：

以下是这篇论文的几个主要发现和重点观点：

Cool Papers

点此查看论文截图

LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding

Authors:Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, Si Liu

Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages.Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST .

最近的多模态大型语言模型（MLLMs）的进步已经取得了令人鼓舞的结果，但现有方法很难同时有效地处理时间和空间的定位。这一挑战源于两个关键问题：首先，融入时空定位会引入大量的坐标组合，导致语言和视觉坐标表示的对齐变得复杂；其次，在视频特征压缩过程中编码精细的时间信息和空间信息具有固有的难度。为了解决这些问题，我们提出了LLaVA-ST，一个用于精细时空多模态理解的MLLM。在LLaVA-ST中，我们提出了语言对齐位置嵌入，它将文本坐标特殊令牌嵌入到视觉空间中，简化了精细时空对应关系的对齐。此外，我们设计了时空打包器，它将时间分辨率和空间分辨率的特征压缩解耦为两个独立的点-区域注意力处理流。我们还推出了ST-Align数据集，包含430万训练样本，用于精细时空多模态理解。通过ST-align，我们提出了一种渐进的训练管道，通过从粗到细的序列阶段对齐视觉和文本特征。此外，我们引入了ST-Align基准测试，以评估时空交错精细理解任务，包括时空视频定位（STVG）、事件定位和描述（ELC）以及空间视频定位（SVG）。LLaVA-ST在需要精细时间、空间或时空交错多模态理解的11个基准测试中取得了卓越的性能。我们的代码、数据和基准测试将在https://github.com/appletea233/LLaVA-ST上发布。

论文及项目相关链接

PDF

Summary

MLLM在处理时空定位方面存在挑战，提出了一种名为LLaVA-ST的细粒度时空多模态理解模型。该模型通过语言对齐位置嵌入和空间时间打包器解决这些问题。还推出了ST-Align数据集和基准测试，以评估细粒度时空多模态理解任务。LLaVA-ST在多个基准测试中表现出卓越性能。

Key Takeaways

MLLM在处理时空定位时面临挑战，主要是由于语言和视觉坐标表示的对齐复杂性以及视频特征压缩中精细时空信息的编码难度。
LLaVA-ST模型通过语言对齐位置嵌入和空间时间打包器解决这些问题。
LLaVA-ST将文本坐标特殊令牌嵌入视觉空间，简化了精细时空对应关系的对齐。
空间时间打包器将时空分辨率的特征压缩解耦为两个独立的点区域注意力处理流。
推出ST-Align数据集，包含430万训练样本，用于细粒度时空多模态理解。
引入ST-Align基准测试，评估时空交织的细粒度理解任务，包括时空视频定位、事件定位和描述以及空间视频定位。

Cool Papers

点此查看论文截图

Comparative Analysis of Efficient Adapter-Based Fine-Tuning of State-of-the-Art Transformer Models

Authors:Saad Mashkoor Siddiqui, Mohammad Ali Sheikh, Muhammad Aleem, Kajol R Singh

In this work, we investigate the efficacy of various adapter architectures on supervised binary classification tasks from the SuperGLUE benchmark as well as a supervised multi-class news category classification task from Kaggle. Specifically, we compare classification performance and time complexity of three transformer models, namely DistilBERT, ELECTRA, and BART, using conventional fine-tuning as well as nine state-of-the-art (SoTA) adapter architectures. Our analysis reveals performance differences across adapter architectures, highlighting their ability to achieve comparable or better performance relative to fine-tuning at a fraction of the training time. Similar results are observed on the new classification task, further supporting our findings and demonstrating adapters as efficient and flexible alternatives to fine-tuning. This study provides valuable insights and guidelines for selecting and implementing adapters in diverse natural language processing (NLP) applications.

在这项工作中，我们研究了不同的适配器架构在SuperGLUE基准测试中的监督二分类任务以及Kaggle中的监督多类新闻类别分类任务上的有效性。具体来说，我们比较了三种转换器模型（即DistilBERT、ELECTRA和BART）在常规微调以及九种最新适配器架构下的分类性能和时间复杂度。我们的分析揭示了不同适配器架构之间的性能差异，突出了它们在极短的训练时间内达到与微调相当或更好的性能的能力。在新的分类任务上也观察到了类似的结果，这进一步支持了我们的发现，并证明了适配器是微调的有效和灵活替代方案。这项研究为在多种自然语言处理（NLP）应用中选择和实施适配器提供了宝贵的见解和指导。

论文及项目相关链接

PDF

Summary

本文研究了不同适配器架构在SuperGLUE基准测试中的监督二元分类任务以及来自Kaggle的监督多类新闻分类任务上的效果。文章对比了DistilBERT、ELECTRA和BART三种转换器模型的分类性能和时间成本，包括传统微调以及九种先进的适配器架构。研究发现，适配器架构间存在性能差异，它们能够在很短的时间内达到与微调相当或更好的性能。在新分类任务上也观察到了类似的结果，进一步验证了适配器在NLP应用中的高效性和灵活性。本文提供了宝贵的见解和指导方针，为在各种自然语言处理任务中选择和实施适配器提供了依据。

Key Takeaways

论文对多种适配器架构在基准测试中的表现进行了深入研究。
对比了DistilBERT、ELECTRA和BART三种转换器模型在监督分类任务上的性能和时间成本。
研究发现适配器架构在分类任务上的性能与微调相当或更好，且训练时间更短。
新分类任务的结果进一步支持了适配器架构的有效性和灵活性。
论文提供了关于选择和实施适配器的宝贵见解和指导方针。
研究成果有助于理解和优化自然语言处理中的模型训练效率。

Cool Papers

点此查看论文截图

Text-Diffusion Red-Teaming of Large Language Models: Unveiling Harmful Behaviors with Proximity Constraints

Authors:Jonathan Nöther, Adish Singla, Goran Radanović

Recent work has proposed automated red-teaming methods for testing the vulnerabilities of a given target large language model (LLM). These methods use red-teaming LLMs to uncover inputs that induce harmful behavior in a target LLM. In this paper, we study red-teaming strategies that enable a targeted security assessment. We propose an optimization framework for red-teaming with proximity constraints, where the discovered prompts must be similar to reference prompts from a given dataset. This dataset serves as a template for the discovered prompts, anchoring the search for test-cases to specific topics, writing styles, or types of harmful behavior. We show that established auto-regressive model architectures do not perform well in this setting. We therefore introduce a black-box red-teaming method inspired by text-diffusion models: Diffusion for Auditing and Red-Teaming (DART). DART modifies the reference prompt by perturbing it in the embedding space, directly controlling the amount of change introduced. We systematically evaluate our method by comparing its effectiveness with established methods based on model fine-tuning and zero- and few-shot prompting. Our results show that DART is significantly more effective at discovering harmful inputs in close proximity to the reference prompt.

最近的工作提出了针对给定的大型语言模型（LLM）进行自动化红队测试的方法。这些方法使用红队LLM来发现诱导目标LLM出现有害行为的输入。在本文中，我们研究了能够进行有针对性的安全评估的红队策略。我们提出了一个带有接近度约束的红队优化框架，其中发现的提示必须与给定数据集中的参考提示相似。该数据集作为发现的提示的模板，将测试用例的搜索锚定到特定主题、写作风格或有害行为类型上。我们发现现有的自动回归模型架构在此设置中表现不佳。因此，我们引入了一种受文本扩散模型启发的黑盒红队方法：用于审计和红队的扩散（DART）。DART通过扰动参考提示的嵌入空间进行修改，直接控制引入的变化量。我们通过将其与基于模型微调以及零次和少次提示的方法的有效性进行比较，系统地评估了我们的方法。结果表明，DART在发现与参考提示接近的有害输入方面显著更有效。

论文及项目相关链接

PDF This is an extended version of a paper published at AAAI 25

Summary

本文研究了针对大型语言模型（LLM）的安全评估方法，特别是红队攻击策略。文章提出了一种带有近似约束的红队优化框架，该框架要求发现的提示必须与给定数据集中的参考提示相似。此外，文章介绍了基于文本扩散模型的黑色盒子红队方法——审计与红队扩散（DART）。DART通过嵌入空间中扰动参考提示来修改它，并直接控制引入的变化量。通过系统地评估，结果显示DART在发现与参考提示相近的有害输入方面，比基于模型微调、零样本和少样本提示的方法更为有效。

Key Takeaways

提出了带有近似约束的红队优化框架，针对LLM进行安全评估。
发现现有自动回归模型架构在此设置中的表现不佳。
介绍了基于文本扩散模型的黑色盒子红队方法——DART。
DART通过嵌入空间中的扰动来修改参考提示。
DART能更有效地发现与参考提示相近的有害输入。
DART在发现测试案例时，能够锚定到特定主题、写作风格或有害行为类型。

Cool Papers

点此查看论文截图

Engineering LLM Powered Multi-agent Framework for Autonomous CloudOps

Authors:Kannan Parthasarathy, Karthik Vaidhyanathan, Rudra Dhar, Venkat Krishnamachari, Basil Muhammed, Adyansh Kakran, Sreemaee Akshathala, Shrikara Arun, Sumant Dubey, Mohan Veerubhotla, Amey Karan

Cloud Operations (CloudOps) is a rapidly growing field focused on the automated management and optimization of cloud infrastructure which is essential for organizations navigating increasingly complex cloud environments. MontyCloud Inc. is one of the major companies in the CloudOps domain that leverages autonomous bots to manage cloud compliance, security, and continuous operations. To make the platform more accessible and effective to the customers, we leveraged the use of GenAI. Developing a GenAI-based solution for autonomous CloudOps for the existing MontyCloud system presented us with various challenges such as i) diverse data sources; ii) orchestration of multiple processes; and iii) handling complex workflows to automate routine tasks. To this end, we developed MOYA, a multi-agent framework that leverages GenAI and balances autonomy with the necessary human control. This framework integrates various internal and external systems and is optimized for factors like task orchestration, security, and error mitigation while producing accurate, reliable, and relevant insights by utilizing Retrieval Augmented Generation (RAG). Evaluations of our multi-agent system with the help of practitioners as well as using automated checks demonstrate enhanced accuracy, responsiveness, and effectiveness over non-agentic approaches across complex workflows.

云操作（CloudOps）是一个快速发展的领域，专注于云基础设施的自动化管理和优化，对于在日益复杂的云环境中导航的组织而言，这是必不可少的。MontyCloud Inc.是云操作领域的主要公司之一，利用自主机器人管理云合规性、安全性和持续运营。为了使平台对客户更加可用和有效，我们利用了GenAI。为现有的MontyCloud系统开发基于GenAI的自主云操作系统解决方案，给我们带来了各种挑战，例如（i）多样化的数据源；（ii）多个流程的编排；（iii）处理复杂的工作流以自动化常规任务。为此，我们开发了MOYA，这是一个多代理框架，利用GenAI，在自主性和必要的人为控制之间取得平衡。该框架整合了内部和外部系统，针对任务编排、安全和错误缓解等因素进行了优化，同时利用检索增强生成（RAG）提供准确、可靠和相关的见解。通过实践者的帮助以及自动检查对我们的多代理系统进行评估，表明在复杂的工作流中，与无代理方法相比，我们的系统具有更高的准确性、响应性和有效性。

论文及项目相关链接

PDF The paper has been accepted as full paper to CAIN 2025 (https://conf.researchr.org/home/cain-2025), co-located with ICSE 2025 (https://conf.researchr.org/home/icse-2025). The paper was submitted to CAIN for review on 9 November 2024

Summary
蒙蒂云公司正在开发一种基于GenAI的自主云操作（CloudOps）解决方案。他们面临的挑战包括多样的数据来源、多个流程的协同以及复杂工作流的自动化。为了应对这些挑战，他们开发了MOYA多智能体框架，该框架优化了任务协同、安全性和错误缓解等因素，并利用检索增强生成（RAG）技术提供准确、可靠和相关的见解。评估表明，与没有智能体的方法相比，多智能体系统在复杂工作流程中的准确性、响应性和有效性都得到了提升。

Key Takeaways

Cloud Operations (CloudOps) 是一个快速发展的领域，专注于云基础设施的自动化管理和优化。
蒙蒂云公司是云操作领域的主要公司之一，利用自主机器人管理云合规性、安全和持续运营。
蒙蒂云公司为了使其平台更易于客户使用并提高其效果，使用了GenAI技术。
开发基于GenAI的自主云操作解决方案面临了多样的数据来源、多个流程的协同以及复杂工作流的自动化等挑战。
MOYA多智能体框架是为了应对这些挑战而开发的，它整合了内外部系统，并优化了任务协同、安全性和错误缓解等因素。
MOYA框架利用检索增强生成（RAG）技术提供准确、可靠和相关的见解。

Cool Papers

点此查看论文截图

Investigating Energy Efficiency and Performance Trade-offs in LLM Inference Across Tasks and DVFS Settings

Authors:Paul Joe Maliakel, Shashikant Ilager, Ivona Brandic

Large language models (LLMs) have shown significant improvements in many natural language processing (NLP) tasks, accelerating their rapid adoption across many industries. These models are resource-intensive, requiring extensive computational resources both during training and inference, leading to increased energy consumption and negative environmental impact. As their adoption accelerates, the sustainability of LLMs has become a critical issue, necessitating strategies to optimize their runtime efficiency without compromising performance. Hence, it is imperative to identify the parameters that significantly influence the performance and energy efficiency of LLMs. To that end, in this work, we investigate the effect of important parameters on the performance and energy efficiency of LLMs during inference and examine their trade-offs. First, we analyze how different types of models with varying numbers of parameters and architectures perform on tasks like text generation, question answering, and summarization by benchmarking LLMs such as Falcon-7B, Mistral-7B-v0.1, T5-3B, GPT-2, GPT-J-6B, and GPT-Neo-2.7B. Second, we study input and output sequence characteristics such as sequence length concerning energy consumption, performance, and throughput. Finally, we explore the impact of hardware-based power-saving techniques, i.e., Dynamic Voltage Frequency Scaling (DVFS), on the models’ latency and energy efficiency. Our extensive benchmarking and statistical analysis reveal many interesting findings, uncovering how specific optimizations can reduce energy consumption while maintaining throughput and accuracy. This study provides actionable insights for researchers and practitioners to design energy-efficient LLM inference systems.

大型语言模型（LLM）在许多自然语言处理（NLP）任务中取得了显著改进，加速了其在多个行业的快速采用。这些模型资源密集，在训练和推理过程中都需要大量的计算资源，导致能源消耗增加并对环境产生负面影响。随着其采用的加速，LLM的可持续性已成为一个关键问题，需要制定策略以优化其运行效率，同时不影响性能。因此，确定对LLM的性能和能效产生重大影响的参数至关重要。为此，我们在这项工作中调查了重要参数对LLM在推理过程中的性能和能效的影响，并研究了它们之间的权衡。首先，我们通过基准测试分析具有不同参数和架构的各种模型在文本生成、问答和摘要等任务上的表现，包括Falcon-7B、Mistral-7B-v0.1、T5-3B、GPT-2、GPT-J-6B和GPT-Neo-2.7B等LLM。其次，我们研究了输入和输出序列特征，如序列长度与能源消费、性能和吞吐量的关系。最后，我们探索了基于硬件的节能技术，即动态电压频率缩放（DVFS）对模型的延迟和能效的影响。我们进行了广泛的基准测试和统计分析，发现了许多有趣的结果，揭示了如何通过特定优化减少能源消耗，同时保持吞吐量和准确性。这项研究为研究人员和从业者提供了设计能源高效LLM推理系统的可行见解。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在自然语言处理（NLP）任务中取得了显著进展，但其资源密集型的特性导致了能源消耗和环境负担加重。为了提升LLM的可持续性，对其运行效率的优化至关重要。本研究通过基准测试，探讨了模型参数对LLM性能和能源效率的影响，并分析了不同模型在不同任务上的表现。同时，还研究了输入输出序列特性与能源消耗、性能和吞吐量的关系，并探索了基于硬件的节能技术如动态电压频率缩放（DVFS）对模型延迟和能源效率的影响。本研究为设计能源高效的LLM推理系统提供了实践指导。

Key Takeaways

LLM在自然语言处理任务中取得显著进展，但资源密集型特性引发能源和环境问题。
对LLM性能和能源效率的优化至关重要，特别是在加速其推理过程时。
不同类型和规模的LLM在不同任务上的表现存在差异，如文本生成、问答和摘要等。
输入输出序列长度与能源消耗、性能和吞吐量密切相关。
硬件节能技术如DVFS对LLM的延迟和能源效率有重要影响。
本研究通过基准测试和统计分析，揭示了优化LLM以减少能源消耗的方法。

Cool Papers

点此查看论文截图

ASTRID – An Automated and Scalable TRIaD for the Evaluation of RAG-based Clinical Question Answering Systems

Authors:Mohita Chowdhury, Yajie Vera He, Aisling Higham, Ernest Lim

Large Language Models (LLMs) have shown impressive potential in clinical question answering (QA), with Retrieval Augmented Generation (RAG) emerging as a leading approach for ensuring the factual accuracy of model responses. However, current automated RAG metrics perform poorly in clinical and conversational use cases. Using clinical human evaluations of responses is expensive, unscalable, and not conducive to the continuous iterative development of RAG systems. To address these challenges, we introduce ASTRID - an Automated and Scalable TRIaD for evaluating clinical QA systems leveraging RAG - consisting of three metrics: Context Relevance (CR), Refusal Accuracy (RA), and Conversational Faithfulness (CF). Our novel evaluation metric, CF, is designed to better capture the faithfulness of a model’s response to the knowledge base without penalising conversational elements. To validate our triad, we curate a dataset of over 200 real-world patient questions posed to an LLM-based QA agent during surgical follow-up for cataract surgery - the highest volume operation in the world - augmented with clinician-selected questions for emergency, clinical, and non-clinical out-of-domain scenarios. We demonstrate that CF can predict human ratings of faithfulness better than existing definitions for conversational use cases. Furthermore, we show that evaluation using our triad consisting of CF, RA, and CR exhibits alignment with clinician assessment for inappropriate, harmful, or unhelpful responses. Finally, using nine different LLMs, we demonstrate that the three metrics can closely agree with human evaluations, highlighting the potential of these metrics for use in LLM-driven automated evaluation pipelines. We also publish the prompts and datasets for these experiments, providing valuable resources for further research and development.

大型语言模型（LLM）在临床问题回答（QA）方面显示出令人印象深刻的潜力，其中检索增强生成（RAG）作为确保模型应答事实准确性的领先方法而出现。然而，当前的自动化RAG指标在临床和对话用例中的表现不佳。使用临床人员对应答的评价成本高昂，不可扩展，且不利于RAG系统的持续迭代开发。为了解决这些挑战，我们引入了ASTRID——一种利用RAG评估临床QA系统的自动化和可扩展的TRIaD，由三个指标组成：上下文相关性（CR）、拒绝准确率（RA）和对话忠实性（CF）。我们新颖的评价指标CF旨在更好地捕捉模型响应知识库的真实性，同时不惩罚对话元素。为了验证我们的三人组，我们整理了一个数据集，其中包括200多个现实世界的患者问题，这些问题是在基于LLM的QA代理进行白内障手术随访时提出的，还有临床医生选择的关于紧急、临床和非临床域外场景的题目。我们证明，CF能够比现有定义更好地预测人类对话用例中的忠实性评价。此外，我们还展示了使用由CF、RA和CR组成的三人组评估的回应与临床医生对不当、有害或无助的回应的评估相一致。最后，我们使用九个不同的LLM进行实验，证明这三个指标可以紧密地与人类评估达成一致，突出了这些指标在LLM驱动的自动化评估管道中的潜力。我们还公布了这些实验的提示和数据集，为进一步的研究和开发提供了有价值的资源。

论文及项目相关链接

PDF 29 pages

Summary

LLMs在临床问答系统中有巨大潜力，检索增强生成（RAG）方法能确保模型回答的事实准确性。但现有自动化RAG指标在临床和对话场景中的表现不佳。为此，提出一种新型评估方法ASTRID，包括三个指标：上下文相关性（CR）、拒绝准确性（RA）和对话忠实度（CF）。CF能更准确地捕捉模型回应的忠实度，同时避免对话元素的惩罚。实验验证表明，CF能更好预测人类对话场景的忠实度评价；同时包含CF、RA和CR的评估方法能紧密贴合临床医生对不当、有害或无帮助回应的评估。对九种不同的大型语言模型（LLM）的实验进一步验证了这三个指标与人类评价的契合度，展示了这些指标在LLM自动化评估流程中的潜力。

Key Takeaways

LLMs在临床问答系统中展现潜力，RAG方法确保回答事实准确性。
当前自动化RAG指标在临床和对话场景中表现不佳。
引入新型评估方法ASTRID，包括CR、RA和CF三个指标。
CF能更准确地预测人类对话场景的忠实度评价。
包含CF、RA和CR的评估方法与临床医生评估紧密贴合，能识别不当、有害或无帮助的回应。
三个指标与人类评价的契合度高，有潜力用于LLM自动化评估流程。

Cool Papers

点此查看论文截图

CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation

Authors:Jinjun Peng, Leyi Cui, Kele Huang, Junfeng Yang, Baishakhi Ray

Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing, enhancing productivity across various tasks. While identifying incorrect code is often straightforward, detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge, which poses considerable security risks of using LLM-generated code and underscores the need for robust evaluation benchmarks that assess both functional correctness and security. Current benchmarks like CyberSecEval and SecurityEval attempt to solve it but are hindered by unclear and impractical specifications, failing to assess both functionality and security accurately. To tackle these deficiencies, we introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs. This framework not only assesses code functionality but also its security simultaneously with high-quality task specifications and outcome-driven test oracles which provides high accuracy. Coupled with CWEval-bench, a multilingual, security-critical coding benchmark, CWEval provides a rigorous empirical security evaluation on LLM-generated code, overcoming previous benchmarks’ shortcomings. Through our evaluations, CWEval reveals a notable portion of functional but insecure code produced by LLMs, and shows a serious inaccuracy of previous evaluations, ultimately contributing significantly to the field of secure code generation. We open-source our artifact at: https://github.com/Co1lin/CWEval .

大型语言模型（LLM）在代码生成或辅助编写方面为开发者提供了巨大帮助，提高了在各种任务中的生产效率。虽然识别错误的代码通常很直接，但在功能正确的代码中检测漏洞更具挑战性，尤其是对于安全知识有限的开发者来说。这带来了使用LLM生成代码的安全风险，并强调需要对功能正确性和安全性进行评估的稳健评估基准的需求。当前的基准测试，如CyberSecEval和安全评估（SecurityEval），试图解决这个问题，但由于规格不明确且不切实际而受到阻碍，无法准确评估功能和安全性。为了解决这些不足，我们引入了CWEval，这是一种新型的结果驱动评估框架，旨在提高LLM生成代码的安全评估能力。该框架不仅评估代码的功能性，而且还同时评估其安全性，具有高标准的任务规格和结果驱动的测试工具，提供高精度。与多语言安全关键编码基准CWEval-bench相结合，CWEval对LLM生成的代码进行了严格的实证安全评估，克服了以前基准测试的缺点。通过我们的评估，CWEval揭示了LLM产生的功能强大但存在安全隐患的代码显著部分，显示出之前评估的严重不准确性，最终为安全代码生成领域做出了重大贡献。我们的开源工件可以在：https://github.com/Co1lin/CWEval找到。

论文及项目相关链接

PDF to be published in LLM4Code 2025

Summary

大型语言模型（LLM）在代码生成方面为开发者提供了极大的帮助，提高了生产力。然而，检测功能正确代码中的漏洞更具挑战性，特别是对于安全知识有限的开发者，这增加了使用LLM生成代码的安全风险。当前的安全评估基准如CyberSecEval和SecurityEval存在模糊和不切实际的规格问题，无法准确评估功能和安全性。为解决这些问题，我们推出了CWEval，这是一个新型的结果驱动评估框架，旨在提高LLM生成代码的安全性评估水平。它不仅评估代码的功能性，还同时评估其安全性，具有高质量的任务规格和结果驱动的测试验证，提供高精确度。搭配CWEval-bench——一个多语言、安全关键的编码基准，CWEval对LLM生成的代码进行了严格的实证安全评估，克服了先前基准的缺点。我们的评估显示，CWEval发现了大量由LLM生成的功能性但存在安全隐患的代码，并指出了先前评估的严重不准确性。

Key Takeaways

LLM在代码生成方面对开发者有重大帮助，但存在检测功能正确代码中的漏洞的挑战。
当前的安全评估基准如CyberSecEval和SecurityEval存在缺陷，无法准确评估LLM生成代码的功能和安全性。
CWEval是一个新型的结果驱动评估框架，旨在提高LLM生成代码的安全性评估水平。
CWEval同时评估代码的功能性和安全性，具有高质量的任务规格和结果驱动的测试验证。
CWEval搭配多语言、安全关键的编码基准CWEval-bench，对LLM生成的代码进行严格的实证安全评估。
CWEval发现了大量由LLM生成的功能性但存在安全隐患的代码。

Cool Papers

点此查看论文截图

PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving

Authors:Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli

Large language models (LLMs) are widely used across various applications, but their substantial computational requirements pose significant challenges, particularly in terms of HBM bandwidth bottlenecks and inter-device communication overhead. In this paper, we present PRESERVE, a novel prefetching framework designed to optimize LLM inference by overlapping memory reads for model weights and KV-cache with collective communication operations. Through extensive experiments conducted on commercial AI accelerators, we demonstrate up to 1.6x end-to-end speedup on state-of-the-art, open-source LLMs. Additionally, we perform a design space exploration that identifies the optimal hardware configuration for the proposed method, showing a further 1.25x improvement in performance per cost by selecting the optimal L2 cache size. Our results show that PRESERVE has the potential to mitigate the memory bottlenecks and communication overheads, offering a solution to improve the performance and scalability of the LLM inference systems.

大型语言模型（LLM）在多种应用中得到广泛运用，但其巨大的计算需求带来了重大挑战，特别是在高带宽内存（HBM）瓶颈和设备间通信开销方面。在本文中，我们提出了PRESERVE，这是一个新的预取框架，旨在通过重叠模型权重和KV缓存的内存读取与集体通信操作来优化LLM推理。我们在商业人工智能加速器上进行了大量实验，证明在最新开源LLM上最多可实现1.6倍端到端加速。此外，我们还进行了设计空间探索，确定了所提方法的最佳硬件配置，通过选择最佳的L2缓存大小，在性能成本方面实现了进一步的1.25倍提升。我们的结果表明，PRESERVE具有缓解内存瓶颈和通信开销的潜力，为解决提高LLM推理系统性能和可扩展性的问题提供了解决方案。

论文及项目相关链接

PDF

Summary
大型语言模型（LLM）在多领域应用广泛，但其计算需求挑战重重，特别是面临高速缓冲存储器带宽瓶颈和设备间通信开销问题。本研究提出一种名为PRESERVE的新型预取框架，旨在通过重叠模型权重和KV缓存的内存读取与集体通信操作来优化LLM推理。在商用人工智能加速器上进行的广泛实验表明，该框架对先进开源LLM的端到端速度提高了1.6倍。另外，通过设计空间探索，确定了该方法的最优硬件配置，在选用最佳L2缓存大小时，性能成本比进一步提高1.25倍。研究结果表明，PRESERVE具有缓解内存瓶颈和通信开销的潜力，为解决提高LLM推理系统性能和可扩展性问题提供了有效方案。

Key Takeaways

PRESERVE是一种针对大型语言模型（LLM）的预取框架，旨在优化LLM推理性能。
该框架通过重叠内存读取和集体通信操作来减少计算延迟。
在商用人工智能加速器上进行的实验表明，PRESERVE可以提高端到端速度达1.6倍。
PRESERVE的设计空间探索确定了最优硬件配置，进一步提高了性能成本比。
通过选用最佳的L2缓存大小，可以获得更好的性能提升。
PRESERVE具有缓解内存瓶颈和通信开销的潜力。

Cool Papers

点此查看论文截图

I Can Find You in Seconds! Leveraging Large Language Models for Code Authorship Attribution

Authors:Soohyeon Choi, Yong Kiam Tan, Mark Huasong Meng, Mohamed Ragab, Soumik Mondal, David Mohaisen, Khin Mi Mi Aung

Source code authorship attribution is important in software forensics, plagiarism detection, and protecting software patch integrity. Existing techniques often rely on supervised machine learning, which struggles with generalization across different programming languages and coding styles due to the need for large labeled datasets. Inspired by recent advances in natural language authorship analysis using large language models (LLMs), which have shown exceptional performance without task-specific tuning, this paper explores the use of LLMs for source code authorship attribution. We present a comprehensive study demonstrating that state-of-the-art LLMs can successfully attribute source code authorship across different languages. LLMs can determine whether two code snippets are written by the same author with zero-shot prompting, achieving a Matthews Correlation Coefficient (MCC) of 0.78, and can attribute code authorship from a small set of reference code snippets via few-shot learning, achieving MCC of 0.77. Additionally, LLMs show some adversarial robustness against misattribution attacks. Despite these capabilities, we found that naive prompting of LLMs does not scale well with a large number of authors due to input token limitations. To address this, we propose a tournament-style approach for large-scale attribution. Evaluating this approach on datasets of C++ (500 authors, 26,355 samples) and Java (686 authors, 55,267 samples) code from GitHub, we achieve classification accuracy of up to 65% for C++ and 68.7% for Java using only one reference per author. These results open new possibilities for applying LLMs to code authorship attribution in cybersecurity and software engineering.

源代码作者归属在软件取证、抄袭检测以及软件补丁完整性保护中具有重要意义。现有技术通常依赖于有监督机器学习，由于需要大量标注数据集，其在不同编程语言和编码风格之间的泛化能力受到限制。受自然语言作者分析方面近期的大型语言模型（LLM）进步的启发，这些进步在没有特定任务的调整下表现出了令人惊叹的性能，本文探索了使用LLM进行源代码作者归属。我们进行了全面的研究，表明最先进的LLM可以成功地在不同语言之间进行源代码作者归属。LLM可以通过零样本提示确定两个代码片段是否由同一作者编写，实现马修斯相关系数（MCC）为0.78，并且可以从小规模的参考代码片段集中通过少量学习进行代码作者归属，实现MCC为0.77。此外，LLM对一些误判攻击还显示出一定的对抗性稳健性。尽管具备了这些功能，但我们发现对LLM的简单提示并不适用于大量作者的情况，因为存在输入标记的限制。为了解决这个问题，我们提出了一种用于大规模归属的锦标赛式方法。在GitHub上的C++（500位作者，26355个样本）和Java（686位作者，55267个样本）数据集上评估此方法，对于每位作者仅使用一个参考样本的情况下，我们实现了高达65%的C++分类准确率和68.7%的Java分类准确率。这些结果开启了将LLM应用于网络安全和软件工程中的代码作者归属的新可能性。

论文及项目相关链接

PDF 12 pages, 5 figures,

摘要
利用大型语言模型（LLMs）的自然语言处理技术在软件代码的作者身份鉴别上进行了一项创新研究。该研究探讨了零射提示和少射学习技术在不同编程语言中的代码作者身份鉴别应用，并发现大型语言模型在代码作者身份鉴别上具有出色的性能。尽管存在输入令牌限制，但通过锦标赛风格的策略可以在大规模属性鉴别中实现分类准确率高达65%（针对C++）和68.7%（针对Java）。这为在网络安全和软件工程中应用大型语言模型进行代码作者身份鉴别提供了新的可能性。

关键见解

源码作者归属在软件取证、抄袭检测和软件补丁保护中占据重要地位。传统技术往往依赖于监督机器学习，这在跨不同编程语言和编码风格时面临挑战。
大型语言模型（LLMs）的自然语言处理技术能成功应用在代码作者身份鉴别上，且表现出卓越的性能，无需特定任务的调整和优化。
通过零射提示技术，大型语言模型能够判断两个代码片段是否由同一作者编写，且具有较高的马修斯相关系数（MCC）达到0.78。此外，使用少量的参考代码片段进行少量学习也能达到较高的马修斯相关系数。

Cool Papers

点此查看论文截图

Change Captioning in Remote Sensing: Evolution to SAT-Cap – A Single-Stage Transformer Approach

Authors:Yuduo Wang, Weikang Yu, Pedram Ghamisi

Change captioning has become essential for accurately describing changes in multi-temporal remote sensing data, providing an intuitive way to monitor Earth’s dynamics through natural language. However, existing change captioning methods face two key challenges: high computational demands due to multistage fusion strategy, and insufficient detail in object descriptions due to limited semantic extraction from individual images. To solve these challenges, we propose SAT-Cap based on the transformers model with a single-stage feature fusion for remote sensing change captioning. In particular, SAT-Cap integrates a Spatial-Channel Attention Encoder, a Difference-Guided Fusion module, and a Caption Decoder. Compared to typical models that require multi-stage fusion in transformer encoder and fusion module, SAT-Cap uses only a simple cosine similarity-based fusion module for information integration, reducing the complexity of the model architecture. By jointly modeling spatial and channel information in Spatial-Channel Attention Encoder, our approach significantly enhances the model’s ability to extract semantic information from objects in multi-temporal remote sensing images. Extensive experiments validate the effectiveness of SAT-Cap, achieving CIDEr scores of 140.23% on the LEVIR-CC dataset and 97.74% on the DUBAI-CC dataset, surpassing current state-of-the-art methods. The code and pre-trained models will be available online.

变化标注已成为准确描述多时相遥感数据变化的关键技术，通过自然语言提供了一种直观监测地球动态的方法。然而，现有的变化标注方法面临两个主要挑战：一是由于多阶段融合策略导致的高计算需求，二是由于从单个图像中提取有限语义而导致的物体描述细节不足。为了解决这些挑战，我们提出了基于转换器模型的SAT-Cap遥感变化标注方法，采用单阶段特征融合。特别是，SAT-Cap集成了空间通道注意编码器、差异导向融合模块和字幕解码器。与典型需要转换器编码器和融合模块多阶段融合的模型相比，SAT-Cap仅使用一个简单的基于余弦相似性的融合模块进行信息整合，降低了模型架构的复杂性。通过空间通道注意编码器联合建模空间和通道信息，我们的方法显著提高了模型从多时相遥感图像中的物体提取语义信息的能力。大量实验验证了SAT-Cap的有效性，在LEVIR-CC数据集上达到了CIDEr分数140.23%，在DUBAI-CC数据集上达到了97.74%，超越了当前最先进的方法。代码和预训练模型将在线提供。

论文及项目相关链接

PDF

Summary

基于Transformer模型的单阶段特征融合遥感变化标注方法（SAT-Cap）能有效解决现有变化标注面临的挑战，包括计算需求高和对象描述细节不足的问题。SAT-Cap通过整合空间通道注意力编码器、差异引导融合模块和字幕解码器，采用基于余弦相似度的简单融合模块进行信息整合，降低模型结构复杂性。在LEVIR-CC和DUBAI-CC数据集上取得优异表现，CIDEr得分分别高达140.23%和97.74%，超越现有最先进方法。

Key Takeaways

遥感变化标注的重要性在于能直观地通过自然语言监测地球动态变化。
目前变化标注面临两大挑战：计算需求高和对象描述细节不足。
SAT-Cap基于Transformer模型，采用单阶段特征融合进行遥感变化标注。
SAT-Cap通过整合空间通道注意力编码器和差异引导融合模块提升模型从多时相遥感图像中提取语义信息的能力。
SAT-Cap使用简单的基于余弦相似度的融合模块进行信息整合，降低模型复杂性。
SAT-Cap在LEVIR-CC和DUBAI-CC数据集上表现优异，CIDEr得分高于现有最先进方法。

Cool Papers

点此查看论文截图

Transforming Indoor Localization: Advanced Transformer Architecture for NLOS Dominated Wireless Environments with Distributed Sensors

Authors:Saad Masrur, Jung-Fu, Cheng, Atieh R. Khamesi, Ismail Guvenc

Indoor localization in challenging non-line-of-sight (NLOS) environments often leads to mediocre accuracy with traditional approaches. Deep learning (DL) has been applied to tackle these challenges; however, many DL approaches overlook computational complexity, especially for floating-point operations (FLOPs), making them unsuitable for resource-limited devices. Transformer-based models have achieved remarkable success in natural language processing (NLP) and computer vision (CV) tasks, motivating their use in wireless applications. However, their use in indoor localization remains nascent, and directly applying Transformers for indoor localization can be both computationally intensive and exhibit limitations in accuracy. To address these challenges, in this work, we introduce a novel tokenization approach, referred to as Sensor Snapshot Tokenization (SST), which preserves variable-specific representations of power delay profile (PDP) and enhances attention mechanisms by effectively capturing multi-variate correlation. Complementing this, we propose a lightweight Swish-Gated Linear Unit-based Transformer (L-SwiGLU Transformer) model, designed to reduce computational complexity without compromising localization accuracy. Together, these contributions mitigate the computational burden and dependency on large datasets, making Transformer models more efficient and suitable for resource-constrained scenarios. The proposed tokenization method enables the Vanilla Transformer to achieve a 90th percentile positioning error of 0.388 m in a highly NLOS indoor factory, surpassing conventional tokenization methods. The L-SwiGLU ViT further reduces the error to 0.355 m, achieving an 8.51% improvement. Additionally, the proposed model outperforms a 14.1 times larger model with a 46.13% improvement, underscoring its computational efficiency.

在室内定位领域中，挑战非直视（NLOS）环境下采用传统方法的准确性往往表现平庸。深度学习（DL）已应用于应对这些挑战；然而，许多深度学习方法忽视了计算复杂性，尤其是浮点运算（FLOPs），使得它们不适合资源受限的设备。基于Transformer的模型在自然语言处理（NLP）和计算机视觉（CV）任务中取得了显著的成功，激发了它们在无线应用中的使用。然而，它们在室内定位中的应用仍处于初级阶段，直接应用Transformer进行室内定位可能会计算量大且精度有限。为了应对这些挑战，在这项工作中，我们引入了一种新颖的令牌化方法，称为Sensor Snapshot Tokenization（SST），它保留了功率延迟配置文件（PDP）的变量特定表示，并通过有效地捕获多变量相关性增强了注意力机制。作为补充，我们提出了基于轻量级Swish门控线性单元（L-SwiGLU）的Transformer模型，旨在降低计算复杂度而不妥协定位精度。这些贡献共同减轻了计算负担和对大数据集的依赖，使Transformer模型在资源受限的场景中更加高效适用。所提出的令牌化方法使得普通Transformer能够在高度非直视的室内工厂中实现90百分位定位误差为0.388米，超越了传统的令牌化方法。L-SwiGLU ViT进一步将误差降低到0.355米，实现了8.51%的改进。此外，该模型还优于一个规模更大的模型（规模更大14.1倍），具有46.13%的改进，突显了其计算效率。

论文及项目相关链接

PDF The paper has been submitted to IEEE Transactions on Machine Learning in Communications and Networking

Summary

本文介绍了在具有挑战性的非视距（NLOS）室内定位环境中，深度学习技术面临的挑战和解决方法。研究者引入了一种新颖的标记化方法——传感器快照标记化（SST），该方法能够保留电力延迟特征的数据结构信息。此外，研究还提出了基于Swish门控线性单元设计的轻量化Transformer模型，以降低计算复杂性并保留定位准确性。通过这些创新技术，可在计算资源有限的条件下实现高效准确的室内定位。相比于传统方法，新的技术模型显著提高了定位精度和计算效率。

Key Takeaways

室内定位在挑战性的非视距环境中准确度降低的问题存在。尽管深度学习在解决这些问题上发挥了作用，但在处理浮点数操作上的计算复杂性对资源有限的设备提出了挑战。为了解决该问题，一种新的Sensor Snapshot Tokenization（SST）技术被引入用于处理数据特点中的某些复杂部分，并保留重要的数据结构信息。

Cool Papers

点此查看论文截图

Transforming Role Classification in Scientific Teams Using LLMs and Advanced Predictive Analytics

Authors:Wonduk Seo, Yi Bu

Scientific team dynamics are critical in determining the nature and impact of research outputs. However, existing methods for classifying author roles based on self-reports and clustering lack comprehensive contextual analysis of contributions. Thus, we present a transformative approach to classifying author roles in scientific teams using advanced large language models (LLMs), which offers a more refined analysis compared to traditional clustering methods. Specifically, we seek to complement and enhance these traditional methods by utilizing open source and proprietary LLMs, such as GPT-4, Llama3 70B, Llama2 70B, and Mistral 7x8B, for role classification. Utilizing few-shot prompting, we categorize author roles and demonstrate that GPT-4 outperforms other models across multiple categories, surpassing traditional approaches such as XGBoost and BERT. Our methodology also includes building a predictive deep learning model using 10 features. By training this model on a dataset derived from the OpenAlex database, which provides detailed metadata on academic publications – such as author-publication history, author affiliation, research topics, and citation counts – we achieve an F1 score of 0.76, demonstrating robust classification of author roles.

科研团队的动态在决定研究成果的性质和影响方面至关重要。然而，现有的基于自我报告和聚类的作者角色分类方法缺乏对贡献的全面情境分析。因此，我们提出了一种利用先进的大型语言模型（LLM）对科研团队中的作者角色进行分类的变革性方法。与传统的聚类方法相比，这种方法提供了更为精细的分析。具体来说，我们希望通过利用开源和专有的大型语言模型（如GPT-4、Llama 3 70B、Llama 2 70B和Mistral 7x8B）来补充和改进这些方法，以进行角色分类。通过少样本提示，我们对作者角色进行分类，并证明GPT-4在多类别中表现优于其他模型，超越了传统的XGBoost和BERT等方法。我们的方法还包括建立一个使用10个特征的预测深度学习模型。通过在OpenAlex数据库衍生的数据集上训练该模型，该数据库提供了关于学术出版物的详细元数据（如作者出版历史、作者关联、研究主题和引用计数），我们获得了0.76的F1分数，证明了作者角色稳健的分类。

论文及项目相关链接

PDF 14 pages, 4 figures, 3 tables

Summary

科研团队中的动态对研究结果有着重要影响。现有的作者角色分类方法主要基于自我报告和聚类，缺乏对贡献的全方位上下文分析。本研究提出了一种利用大型语言模型（LLM）对科研团队作者角色进行分类的革新方法，相比传统聚类方法，该方法提供了更为精细的分析。本研究使用开源和专有LLM，如GPT-4、Llama3 70B、Llama2 70B和Mistral 7x8B等工具进行角色分类。通过少样本提示进行角色分类，并发现GPT-4在多类别中表现最佳，超越了传统的XGBoost和BERT等方法。此外，本研究还使用OpenAlex数据库的详细元信息建立了深度学习预测模型，该模型取得了F1分数为0.76的良好成绩。

Key Takeaways

作者角色分类在科研团队中至关重要，影响研究质量和结果。
传统基于自我报告和聚类的作者角色分类方法缺乏全面的上下文分析。
利用大型语言模型（LLM）进行作者角色分类是一种创新且精细的方法。
GPT-4在多种分类中表现优于其他模型，成为最佳的语言模型工具。
结合少样本提示进行角色分类是有效的。
使用OpenAlex数据库的详细元信息建立的深度学习预测模型取得了较高的F1分数。

Cool Papers

点此查看论文截图

Breaking Memory Limits: Gradient Wavelet Transform Enhances LLMs Training

Authors:Ziqing Wen, Ping Luo, Jiahuan Wang, Xiaoge Deng, Jinping Zou, Kun Yuan, Tao Sun, Dongsheng Li

Large language models (LLMs) have shown impressive performance across a range of natural language processing tasks. However, their vast number of parameters introduces significant memory challenges during training, particularly when using memory-intensive optimizers like Adam. Existing memory-efficient algorithms often rely on techniques such as singular value decomposition projection or weight freezing. While these approaches help alleviate memory constraints, they generally produce suboptimal results compared to full-rank updates. In this paper, we investigate the memory-efficient method beyond low-rank training, proposing a novel solution called Gradient Wavelet Transform (GWT), which applies wavelet transforms to gradients in order to significantly reduce the memory requirements for maintaining optimizer states. We demonstrate that GWT can be seamlessly integrated with memory-intensive optimizers, enabling efficient training without sacrificing performance. Through extensive experiments on both pre-training and fine-tuning tasks, we show that GWT achieves state-of-the-art performance compared with advanced memory-efficient optimizers and full-rank approaches in terms of both memory usage and training performance.

大型语言模型（LLM）在多种自然语言处理任务中表现出了令人印象深刻的性能。然而，其庞大的参数数量在训练过程中带来了重大的内存挑战，特别是当使用如Adam这样的内存密集型优化器时。现有的内存高效算法通常依赖于奇异值分解投影或权重冻结等技术。虽然这些方法有助于缓解内存约束，但它们通常产生的结果不如全秩更新。在本文中，我们研究了低秩训练之外的内存高效方法，提出了一种称为梯度小波变换（GWT）的新解决方案，该方案将小波变换应用于梯度，以显著降低维护优化器状态所需的内存要求。我们证明GWT可以无缝集成到内存密集型优化器中，实现高效训练而不牺牲性能。通过对预训练和微调任务的广泛实验，我们表明，在内存使用量和训练性能方面，GWT与先进的内存高效优化器和全秩方法相比均达到了最先进的性能。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在自然语言处理任务中表现出卓越性能，但其庞大的参数数量在训练时带来内存挑战，特别是使用如Adam等内存密集型优化器时。现有内存高效算法常依赖低阶技术或权重冻结等技术，虽然有助于缓解内存压力，但通常产生次优结果。本文提出一种名为梯度小波变换（GWT）的新型内存高效方法，通过应用小波变换对梯度进行处理，显著降低维护优化器状态所需的内存。实验证明，GWT可无缝集成于内存密集型优化器，实现高效训练且不影响性能。与先进内存优化器和全阶方法相比，GWT在内存使用和训练性能上均达到最新水平。

Key Takeaways

大型语言模型（LLMs）在自然语言处理任务中表现优秀，但训练时面临内存挑战。
现有内存高效算法可能产生次优结果。
梯度小波变换（GWT）是一种新型内存高效方法，通过应用小波变换降低内存需求。
GWT可无缝集成于内存密集型优化器，实现高效训练。
GWT在内存使用和训练性能上达到最新水平，优于其他先进内存优化器和全阶方法。
GWT有助于解决LLMs训练中的内存挑战，尤其在使用内存密集型优化器时。

Cool Papers

点此查看论文截图

How GPT learns layer by layer

Authors:Jason Du, Kelly Hong, Alishba Imran, Erfan Jahanparast, Mehdi Khfifi, Kaichun Qiao

Large Language Models (LLMs) excel at tasks like language processing, strategy games, and reasoning but struggle to build generalizable internal representations essential for adaptive decision-making in agents. For agents to effectively navigate complex environments, they must construct reliable world models. While LLMs perform well on specific benchmarks, they often fail to generalize, leading to brittle representations that limit their real-world effectiveness. Understanding how LLMs build internal world models is key to developing agents capable of consistent, adaptive behavior across tasks. We analyze OthelloGPT, a GPT-based model trained on Othello gameplay, as a controlled testbed for studying representation learning. Despite being trained solely on next-token prediction with random valid moves, OthelloGPT shows meaningful layer-wise progression in understanding board state and gameplay. Early layers capture static attributes like board edges, while deeper layers reflect dynamic tile changes. To interpret these representations, we compare Sparse Autoencoders (SAEs) with linear probes, finding that SAEs offer more robust, disentangled insights into compositional features, whereas linear probes mainly detect features useful for classification. We use SAEs to decode features related to tile color and tile stability, a previously unexamined feature that reflects complex gameplay concepts like board control and long-term planning. We study the progression of linear probe accuracy and tile color using both SAE’s and linear probes to compare their effectiveness at capturing what the model is learning. Although we begin with a smaller language model, OthelloGPT, this study establishes a framework for understanding the internal representations learned by GPT models, transformers, and LLMs more broadly. Our code is publicly available: https://github.com/ALT-JS/OthelloSAE.

大型语言模型（LLM）擅长语言处理、策略游戏和推理等任务，但在构建对于智能体自适应决策至关重要的可概括内部表示方面面临挑战。为了让智能体在复杂环境中有效导航，它们必须构建可靠的世界模型。虽然LLM在特定基准测试上表现良好，但它们通常难以概括，导致脆弱的表示形式，限制了它们在现实世界中的有效性。了解LLM如何构建内部世界模型，对于开发能够在任务之间保持一致性、自适应行为的智能体至关重要。我们分析了OthelloGPT，这是一个基于GPT的模型，在Othello游戏上进行了训练，作为研究表示学习的受控测试平台。尽管仅通过下一个令牌预测和随机有效动作进行训练，OthelloGPT在理解棋盘状态和游戏方面显示出有意义的逐层进展。较早的层次捕捉静态属性，如棋盘边缘，而较深的层次反映动态棋子变化。为了解释这些表示形式，我们比较了稀疏自动编码器（SAE）和线性探头，发现SAE提供了对组合特征更稳健、更分离的见解，而线性探头主要检测用于分类的特征。我们使用SAE解码与棋子颜色和棋子稳定性相关的特征，这是一个以前未检查的特征，反映了复杂的游戏概念，如棋盘控制和长期规划。我们研究了线性探头准确度和瓷砖颜色的进展，使用SAE和线性探头来比较它们在捕捉模型学习方面的有效性。虽然我们从较小的语言模型OthelloGPT开始，但这项研究为理解GPT模型、转换器和LLM更广泛学习的内部表示建立了框架。我们的代码公开可用：https://github.com/ALT-JS/OthelloSAE。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在处理语言、策略游戏和推理等任务上表现出色，但在构建对于智能体自适应决策至关重要的通用内部表示方面存在挑战。为了有效地适应复杂环境，智能体必须构建可靠的世界模型。尽管LLM在某些基准测试上表现良好，但它们往往无法泛化，导致表示形式脆弱，限制了它们在现实世界中的有效性。理解LLM如何构建内部世界模型是开发能够在任务之间表现出一致性和自适应行为的智能体的关键。本研究以OthelloGPT为例进行分析，它是一个基于GPT的模型，经过Othello游戏训练，作为研究表示学习的受控测试平台。尽管它仅经过下一个令牌预测和随机有效动作的训练，但OthelloGPT在理解棋盘状态和游戏规则方面显示出有意义的逐层进展。早期的层次捕捉静态属性，如棋盘边缘，而较深的层次则反映动态瓷砖变化。本研究通过稀疏自动编码器（SAE）与线性探针的比较，发现SAE为组合特征提供了更稳健、更松散的见解，而线性探针则主要检测用于分类的特征。我们使用SAE解码与瓷砖颜色和瓷砖稳定性相关的特征，这是一个之前未被检查过的特征，反映了诸如棋盘控制和长期规划之类的复杂游戏规则概念。本研究以较小的语言模型OthelloGPT为出发点，建立了理解GPT模型、变压器和LLM更广泛领域内学习的内部表示机制。相关代码已公开发布：链接。

Key Takeaways

大型语言模型（LLM）在特定任务上表现出色，但在构建通用内部表示和泛化能力方面存在挑战。
为了有效适应复杂环境，智能体需要构建可靠的世界模型。
OthelloGPT作为一个研究案例，展示了在理解棋盘状态和游戏规则方面的逐层进展。
早期层次捕捉静态属性，如棋盘边缘；较深的层次反映动态变化。
稀疏自动编码器（SAE）提供了对模型学习特征的更稳健和松散的见解，而线性探针主要用于分类特征检测。
通过SAE解码了与瓷砖颜色和稳定性相关的特征，反映了复杂游戏规则概念。

Cool Papers

点此查看论文截图

Punctuation’s Semantic Role between Brain and Transformers Models

Authors:Zenon Lamprou, Frank Polick, Yashar Moshfeghi

Contemporary neural networks intended for natural language processing (NLP) are not designed with specific linguistic rules. It suggests that they may acquire a general understanding of language. This attribute has led to extensive research in deciphering their internal representations. A pioneering method involves an experimental setup using human brain data to explore if a translation between brain and neural network representations can be established. Since this technique emerged, more sophisticated NLP models have been developed. In our study, we apply this method to evaluate four new NLP models aiming to identify the one most compatible with brain activity. Additionally, to explore how the brain comprehends text semantically, we alter the text by removing punctuation in four different ways to understand its impact on semantic processing by the human brain. Our findings indicate that the RoBERTa model aligns best with brain activity, outperforming BERT in accuracy according to our metrics. Furthermore, for BERT, higher accuracy was noted when punctuation was excluded, and increased context length did not significantly diminish accuracy compared to the original results with punctuation.

针对自然语言处理（NLP）的现代神经网络并没有根据特定的语言规则进行设计，这表明它们可能获得了对语言的普遍理解。这一属性导致了对其内部表征解密的大规模研究。一种首创的方法涉及使用人类脑数据来探索是否可以在大脑和神经网络表征之间建立翻译的实验设置。自这项技术出现以来，更先进的NLP模型已经被开发出来。在我们的研究中，我们应用这种方法来评估四种新的NLP模型，旨在找出与脑活动最相容的一种。此外，为了探索大脑如何理解文本语义，我们通过四种不同的方式去除标点符号，以了解其对人类大脑语义处理的影响。我们的研究结果表明，RoBERTa模型与脑活动最为吻合，根据我们的指标，其在准确性上超过了BERT。此外，对于BERT而言，在排除标点符号后，其准确性有所提高，而且与原始带有标点符号的结果相比，增加上下文长度并没有显著降低其准确性。

论文及项目相关链接

PDF

Summary

本文探讨了当代神经网络在处理自然语言时的特性，包括其不依赖特定语言学规则的普遍语言理解能力。为研究神经网络的内部表征，一种使用人类脑数据的研究方法应运而生。本文应用此方法评估了四种新的NLP模型，并发现RoBERTa模型与脑活动最为契合。此外，研究还表明，去除文本中的标点符号会影响语义处理，但在BERT模型中，高准确性可在无标点情况下维持，且上下文长度的增加对准确性影响不显著。

Key Takeaways

当代神经网络在处理自然语言时不依赖特定语言学规则，展现出对语言的普遍理解能力。
使用人类脑数据的研究方法正在揭示神经网络的内部表征。
在评估四种新的NLP模型后，发现RoBERTa模型与脑活动最为契合。
去除文本中的标点符号会影响语义处理。
在BERT模型中，去除标点后仍可维持高准确性。
上下文长度的增加对NLP模型的准确性影响不显著。

Cool Papers

点此查看论文截图

KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model

Authors:Xinshuo Hu, Zifei Shan, Xinping Zhao, Zetian Sun, Zhenyu Liu, Dongfang Li, Shaolin Ye, Xinyuan Wei, Qian Chen, Baotian Hu, Min Zhang

As retrieval-augmented generation prevails in large language models, embedding models are becoming increasingly crucial. Despite the growing number of general embedding models, prior work often overlooks the critical role of training data quality. In this work, we introduce KaLM-Embedding, a general multilingual embedding model that leverages a large quantity of cleaner, more diverse, and domain-specific training data. Our model has been trained with key techniques proven to enhance performance: (1) persona-based synthetic data to create diversified examples distilled from LLMs, (2) ranking consistency filtering to remove less informative samples, and (3) semi-homogeneous task batch sampling to improve training efficacy. Departing from traditional BERT-like architectures, we adopt Qwen2-0.5B as the pre-trained model, facilitating the adaptation of auto-regressive language models for general embedding tasks. Extensive evaluations of the MTEB benchmark across multiple languages show that our model outperforms others of comparable size, setting a new standard for multilingual embedding models with <1B parameters.

随着大型语言模型中检索增强生成技术的普及，嵌入模型变得越来越重要。尽管通用嵌入模型的数量不断增加，但之前的研究往往忽视了训练数据质量的重要作用。在这项工作中，我们介绍了KaLM-Embedding，这是一个通用的多语言嵌入模型，它利用大量更清洁、更多样化且特定于域的训练数据。我们的模型采用了一些经过验证的关键技术来提高性能：（1）基于角色的合成数据，从大型语言模型中提炼出多样化的示例；（2）排名一致性过滤以消除信息量较少的样本；（3）半均匀任务批量采样以提高训练效率。我们摒弃了传统的BERT类似架构，采用Qwen2-0.5B作为预训练模型，便于自适应通用嵌入任务的自回归语言模型。在多种语言的MTEB基准测试上的广泛评估表明，我们的模型在同类模型中表现最佳，为参数小于1B的多语言嵌入模型设定了新的标准。

论文及项目相关链接

PDF Technical Report. 23 pages, 6 figures, 10 tables

Summary

基于大型语言模型下的检索增强生成趋势，嵌入模型的重要性日益凸显。针对现有嵌入模型忽视训练数据质量的问题，本文提出了KaLM-Embedding这一通用多语言嵌入模型。该模型利用数量庞大的更清洁、更多样化和特定领域的训练数据，并采用经过验证能提高性能的关键技术，如基于人格的合成数据创建、排名一致性过滤以及半同构任务批量采样等。与传统的BERT类似架构不同，我们使用Qwen2-0.5B作为预训练模型，便于自适应通用嵌入任务的自回归语言模型。在MTEB基准测试上的多项语言评估显示，该模型在同类模型中表现优异，为参数小于1B的多语言嵌入模型树立了新标准。

Key Takeaways