LLM

发布日期: 2025-11-08

更新日期: 2025-11-27

文章字数: 21.6k

阅读时长: 87 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-08 更新

Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning

Authors:Mohammad Atif Quamar, Mohammad Areeb

Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30–35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.

Chain-of-Thought (CoT)提示是实现大型语言模型中复杂推理的关键技术。然而，生成完整、固定长度的理由在计算上是浪费的，增加了令牌使用量和延迟。我们引入了LEASH：Logit-Entropy自适应停止启发式，这是一种无需训练的解码算法，可以自适应地中止理由生成。LEASH监控两个内在信号：令牌级熵的斜率和顶级logit边际的改善。一旦这两个信号达到平稳状态，就终止生成，这表明模型已经达到了稳定的推理状态。在GSM8K和AQuA-RAT基准测试上，对四个指令调整过的模型使用LEASH，平均令牌生成量减少了30-35%，延迟减少了27%，相对于CoT，精度下降了10个百分点。LEASH与模型无关，无需额外的训练或监督，为CoT解码提供了简单高效的替代方案。

论文及项目相关链接

PDF Presented at the 1st Workshop on Efficient Reasoning (NeurIPS 2025)

Summary：

链式思维（CoT）提示是推动大型语言模型进行复杂推理的关键技术。然而，生成完整、固定长度的理由在计算上是浪费的，增加了令牌使用量和延迟。本文引入了一种无需训练的解码算法LEASH（Logit-Entropy自适应停止启发式算法），该算法可自适应地终止理由生成。LEASH监控两个内在信号：令牌级别的熵斜率和top-logit边际改善。当这两个信号趋于平稳时，它会终止生成，表明模型已到达稳定的推理状态。在GSM8K和AQuA-RAT基准测试的四个指令调优模型中，LEASH将平均令牌生成量减少了30-35%，延迟减少了27%，而相对于CoT的准确率下降了10个百分点。LEASH具有模型无关性，不需要额外的训练或监督，为CoT解码提供了简单高效的替代方案。

Key Takeaways：

链式思维（CoT）提示是推动大型语言模型复杂推理的关键。
生成完整固定长度的理由在计算上成本较高。
LEASH是一种无需训练的解码算法，可自适应终止理由生成。
LEASH通过监控两个内在信号：令牌级别的熵斜率和top-logit边际改善来工作。
LEASH在多个模型中减少了令牌生成和延迟。
相对于CoT，LEASH的准确率有所下降，但提供了一种简单高效的替代方案。

Cool Papers

点此查看论文截图

When retrieval outperforms generation: Dense evidence retrieval for scalable fake news detection

Authors:Alamgir Munir Qazi, John P. McCrae, Jamal Abdul Nasir

The proliferation of misinformation necessitates robust yet computationally efficient fact verification systems. While current state-of-the-art approaches leverage Large Language Models (LLMs) for generating explanatory rationales, these methods face significant computational barriers and hallucination risks in real-world deployments. We present DeReC (Dense Retrieval Classification), a lightweight framework that demonstrates how general-purpose text embeddings can effectively replace autoregressive LLM-based approaches in fact verification tasks. By combining dense retrieval with specialized classification, our system achieves better accuracy while being significantly more efficient. DeReC outperforms explanation-generating LLMs in efficiency, reducing runtime by 95% on RAWFC (23 minutes 36 seconds compared to 454 minutes 12 seconds) and by 92% on LIAR-RAW (134 minutes 14 seconds compared to 1692 minutes 23 seconds), showcasing its effectiveness across varying dataset sizes. On the RAWFC dataset, DeReC achieves an F1 score of 65.58%, surpassing the state-of-the-art method L-Defense (61.20%). Our results demonstrate that carefully engineered retrieval-based systems can match or exceed LLM performance in specialized tasks while being significantly more practical for real-world deployment.

虚假信息的泛滥迫切需要强大且计算效率高的事实核查系统。虽然当前最先进的技术方法利用大型语言模型（LLM）生成解释性理由，但这些方法在现实世界部署中面临着重大的计算障碍和幻视风险。我们提出了DeReC（密集检索分类），这是一个通用文本嵌入的轻量级框架，它展示了如何在事实核查任务中有效地替代基于自回归的LLM方法。通过密集检索与专项分类相结合，我们的系统在实现更高准确性的同时，效率也显著提高。DeReC在效率上超越了生成解释性理由的LLM，在RAWFC上运行时缩短了95%（从4小时36分钟减少到45分钟）和在LIAR-RAW上缩短了92%（从约三小时减少到约两小时），展现了其在不同数据集大小上的有效性。在RAWFC数据集上，DeReC的F1分数达到65.58%，超过了最先进的L-Defense方法（仅得分于根据我所获取的原文记录：不使用语言的处理预测防御检测时得分61.20%）。我们的结果表明，精心设计的基于检索的系统可以在特定任务中与LLM性能相匹配甚至超越，同时在实际部署中更加实用。

论文及项目相关链接

PDF

Summary：

随着假信息的泛滥，需要开发既稳健又计算效率高的事实核查系统。当前高级方法使用大型语言模型（LLM）生成解释理由，但在现实部署中面临重大计算障碍和幻觉风险。我们提出DeReC（密集检索分类），一种轻量级框架，展示通用文本嵌入如何有效地替代自回归的LLM方法，用于事实核查任务。结合密集检索和特殊分类，我们的系统在实现更高准确性的同时，效率也显著提高。DeReC在效率上超越了生成解释的大型语言模型，在RAWFC上运行时缩短了95%（从4小时36秒缩短到仅23分钟），在LIAR-RAW上缩短了92%（从近2小时缩短到仅约半小时）。我们的研究结果表明，在专用任务上精心设计的数据检索系统可能超越LLM性能，同时在现实部署中更加实用。

Key Takeaways:

假信息的普及突显了对高效事实核查系统的需求。
当前LLM在事实核查面临计算效率与风险问题。
DeReC框架结合了密集检索与分类技术以提高准确性并减少计算成本。
DeReC在RAWFC和LIAR-RAW数据集上的性能显著优于现有方法。
DeReC运行效率高，减少了运行时计算时间。
DeReC的优异性能表明精心设计的检索系统可在特定任务上超越LLM的性能表现。

Cool Papers

点此查看论文截图

PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

Authors:Yicheng Xiao, Yu Chen, Haoxuan Ma, Jiale Hong, Caorui Li, Lingxiang Wu, Haiyun Guo, Jinqiao Wang

While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model’s fine-grained vision-language alignment. However, the inherent token length limitation of CLIP’s text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically leverage the advantages of enhancing both visual and textual content processing granularity, we propose PixCLIP, a novel framework designed to concurrently accommodate visual prompt inputs and process lengthy textual descriptions. Specifically, we first establish an automated annotation pipeline capable of generating pixel-level localized, long-form textual descriptions for images. Utilizing this pipeline, we construct LongGRIT, a high-quality dataset comprising nearly 1.5 million samples. Secondly, we replace CLIP’s original text encoder with the LLM and propose a three-branch pixel-text alignment learning framework, facilitating fine-grained alignment between image regions and corresponding textual descriptions at arbitrary granularity. Experiments demonstrate that PixCLIP showcases breakthroughs in pixel-level interaction and handling long-form texts, achieving state-of-the-art performance.

对比语言图像预训练（CLIP）模型在各种下游视觉语言理解任务中取得了显著的成功，但提高其精细粒度图像文本对齐能力仍是活跃的研究焦点。为此，大多数现有工作采用明确提高视觉信息处理粒度的策略，例如，融入视觉提示来引导模型关注图像内的特定局部区域。同时，多模态大型语言模型（MLLM）的研究表明，使用长而详细的文本描述进行训练可以有效地提高模型的精细视觉语言对齐能力。然而，CLIP文本编码器的固有令牌长度限制从根本上限制了CLIP处理嵌入在长文本序列中的更精细的文本信息的能力。为了协同利用提高视觉和文本内容处理粒度的优势，我们提出了PixCLIP，这是一个新颖框架，旨在同时接受视觉提示输入并处理冗长的文本描述。具体来说，我们首先建立了一个自动化注释管道，能够生成像素级的局部化长文本图像描述。利用此管道，我们构建了LongGRIT数据集，这是一个高质量的数据集，包含近150万个样本。其次，我们替换了CLIP的原始文本编码器为大型语言模型，并提出了一种三支像素文本对齐学习框架，促进图像区域和相应文本描述之间在任意粒度上的精细对齐。实验表明，PixCLIP在像素级交互和处理长文本方面取得了突破，达到了最先进的性能。

论文及项目相关链接

PDF

Summary

本文探讨了CLIP模型在视觉语言理解任务中的优秀表现，并指出其对于精细粒度的图像文本对齐的潜在提升空间。现有研究主要通过增加视觉信息处理的粒度来强化模型性能，如利用视觉提示引导模型关注图像内的特定局部区域。同时，多模态大型语言模型的研究显示，使用长而详细的文本描述可以有效地改善模型的精细粒度视觉语言对齐能力。然而，CLIP文本编码器的固有令牌长度限制，使其难以处理长文本序列中嵌入的更精细的文本信息。为此，本文提出PixCLIP框架，该框架能够同时接受视觉提示输入并处理冗长的文本描述。具体来说，建立了自动注释管道，能够生成图像像素级的局部化长文本描述，并据此构建了高质量数据集LongGRIT。此外，本文用LLM替换CLIP的原始文本编码器，并提出一个三分支像素文本对齐学习框架，促进图像区域和相应文本描述之间的精细粒度对齐。实验表明，PixCLIP在像素级交互和长文本处理方面取得了突破性进展，达到了先进性能。

Key Takeaways

CLIP模型在视觉语言理解任务中表现出色，但提高精细粒度的图像文本对齐能力仍是研究重点。
现有研究通过增加视觉信息处理的粒度来强化模型性能。
使用长而详细的文本描述可以有效改善模型的精细粒度视觉语言对齐能力。
CLIP的文本编码器存在令牌长度限制，难以处理长文本序列中的精细文本信息。
PixCLIP框架结合了视觉提示输入和长文本描述处理的优势。
自动注释管道生成了图像像素级的局部化长文本描述，构建了LongGRIT数据集。

Cool Papers

点此查看论文截图

Question the Questions: Auditing Representation in Online Deliberative Processes

Authors:Soham De, Lodewijk Gelauff, Ashish Goel, Smitha Milli, Ariel Procaccia, Alice Siu

A central feature of many deliberative processes, such as citizens’ assemblies and deliberative polls, is the opportunity for participants to engage directly with experts. While participants are typically invited to propose questions for expert panels, only a limited number can be selected due to time constraints. This raises the challenge of how to choose a small set of questions that best represent the interests of all participants. We introduce an auditing framework for measuring the level of representation provided by a slate of questions, based on the social choice concept known as justified representation (JR). We present the first algorithms for auditing JR in the general utility setting, with our most efficient algorithm achieving a runtime of $O(mn\log n)$, where $n$ is the number of participants and $m$ is the number of proposed questions. We apply our auditing methods to historical deliberations, comparing the representativeness of (a) the actual questions posed to the expert panel (chosen by a moderator), (b) participants’ questions chosen via integer linear programming, (c) summary questions generated by large language models (LLMs). Our results highlight both the promise and current limitations of LLMs in supporting deliberative processes. By integrating our methods into an online deliberation platform that has been used for over hundreds of deliberations across more than 50 countries, we make it easy for practitioners to audit and improve representation in future deliberations.

许多审议过程（如公民大会和审议性民意调查）的核心特点是参与者有机会直接与专家进行交流。虽然参与者通常受邀为专家小组提出问题，但由于时间限制，只能选出有限的问题。这就提出了如何选出最能代表所有参与者利益的一小批问题的挑战。我们引入了一个审计框架，基于被称为合理代表性（JR）的社会选择概念，来衡量一组问题的代表性水平。我们提供了在一般效用环境下对JR进行审计的第一批算法，我们最高效的算法达到了O(mnlogn)的运行时间，其中n是参与者的数量，m是提出的问题的数量。我们将审计方法应用于历史审议，比较了（a）专家小组实际面临的问题（由主持人选择）、（b）通过整数线性规划选择的参与者问题、（c）由大型语言模型（LLM）生成的问题摘要的代表性。我们的结果既展示了大型语言模型在支持审议过程中的潜力，也指出了其当前存在的局限性。通过将我们的方法整合到一个已在全球超过五十个国家用于数百次审议的在线审议平台中，我们使从业者能够轻松审计和改进未来审议中的代表性。

论文及项目相关链接

PDF

摘要
在公民会议和民意调查等决策过程中，核心特点之一是参与者可直接与专家交流的机会。但由于时间限制，通常只能选择少数问题给专家解答。如何从小规模问题中选择能代表所有参与者利益的提问是一大挑战。本文引入审计框架，基于社会选择理论中的公正代表性概念，衡量问题列表的代表性水平。我们首次提出适用于一般效用设置的公正代表性审计算法，最高效算法的时间复杂度为O(mnlogn)，其中n为参与者数量，m为提出的问题数量。我们运用审计方法分析历史决策案例，比较由主持人挑选的专家解答的真实问题、通过整数线性规划挑选的参与者问题和由大型语言模型（LLM）生成的问题摘要的代表性。结果突显了大型语言模型在支持决策过程中的潜力与当前局限。通过将方法整合至已在全球超过五十国家开展数百次决策讨论的在线决策平台中，未来从业者可轻松审计并提升决策过程的代表性。

关键见解

决策过程中，如公民大会和民意调查，允许参与者直接与专家交流是核心特点之一。
由于时间限制，选择专家解答的问题是一个挑战。
提出一种基于公正代表性概念的审计框架来衡量问题列表的代表性。
首次在一般效用设置下为公正代表性审计设计算法，最高效算法的时间复杂度为O(mnlogn)。
通过历史案例研究，评估了不同问题选择方式的代表性，包括主持人挑选、整数线性规划选择和大型语言模型生成的问题摘要。
结果揭示了大型语言模型在支持决策过程中的潜力和当前局限。

Cool Papers

点此查看论文截图

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

Authors:Jingqi Tong, Yurong Mou, Hangcheng Li, Mingzhe Li, Yongzhuo Yang, Ming Zhang, Qiguang Chen, Tianyi Liang, Xiaomeng Hu, Yining Zheng, Xinchi Chen, Jun Zhao, Xuanjing Huang, Xipeng Qiu

“Thinking with Text” and “Thinking with Images” paradigm significantly improve the reasoning ability of large language models (LLMs) and Vision Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, hindering unified multimodal understanding and generation. To overcome these limitations, we introduce “Thinking with Video”, a new paradigm that leverages video generation models, such as Sora-2, to bridge visual and textual reasoning in a unified temporal framework. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench). VideoThinkBench encompasses two task categories: (1) vision-centric tasks (e.g., Eyeballing Puzzles), and (2) text-centric tasks (e.g., subsets of GSM8K, MMMU). Our evaluation establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is generally comparable to state-of-the-art (SOTA) VLMs, and even surpasses VLMs on several tasks, such as Eyeballing Games. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 75.53% accuracy on MMMU. Furthermore, we systematically analyse the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2’s performance. In summary, our findings demonstrate that the video generation model is the potential unified multimodal understanding and generation model, positions “thinking with video” as a unified multimodal reasoning paradigm.

“思考文本”和”思考图像”模式显著提高了大型语言模型（LLM）和视觉语言模型（VLM）的推理能力。然而，这些模式存在固有的局限性。（1）图像只能捕捉单个瞬间，无法表示动态过程或连续变化；（2）文本和视觉作为不同的模态被分隔开来，阻碍了统一的多模态理解和生成。为了克服这些局限性，我们引入了“思考视频”这一新范式，它利用视频生成模型（如Sora-2）在统一的时间框架内桥接视觉和文本推理。为了支持这一探索，我们开发了视频思考基准测试（VideoThinkBench）。VideoThinkBench包含两个任务类别：（1）以视觉为中心的任务（例如眼球追踪拼图），以及（2）以文本为中心的任务（例如GSM8K、MMMU的子集）。我们的评估证明Sora-2是一个能干的推理者。在以视觉为中心的任务中，Sora-2通常与最新技术（SOTA）的VLM相当，甚至在几个任务上超过了VLM，如眼球追踪游戏。在以文本为中心的任务中，Sora-2在MATH上达到了92%的准确率，在MMMU上达到了75.53%的准确率。此外，我们系统地分析了这些能力的来源。我们还发现，自我一致性（self-consistency）和上下文学习（in-context learning）可以提高Sora-2的性能。总之，我们的研究结果表明，视频生成模型是潜在的多模态统一理解和生成模型，将“思考视频”定位为统一的多模态推理范式。

论文及项目相关链接

PDF 36 pages, 14 figures

摘要

“思考文本”、“思考图像”范式能显著提升大语言模型（LLM）和视觉语言模型（VLM）的推理能力，但它们存在局限性。图像仅捕捉单一时刻，无法代表动态过程或连续变化；文本和视觉作为独立模态的分离，阻碍了统一的多模态理解和生成。为解决这些问题，我们提出了“思考视频”新范式，利用视频生成模型（如Sora-2）在统一的时间框架内桥接视觉和文本推理。为支持这一探索，我们开发了视频思考基准测试（VideoThinkBench），包含以视觉为中心的任务（如眼球追踪拼图）和以文本为中心的任务（如GSM8K、MMMU的子集）。评估表明，Sora-2具备强大的推理能力。在以视觉为中心的任务上，Sora-2与最先进的VLM相当，甚至在眼球游戏等任务上表现更佳。在以文本为中心的任务上，Sora-2在MATH上达到92%的准确率，在MMMU上达到75.53%的准确率。此外，我们系统地分析了这些能力的来源，发现自我一致性和上下文学习能提高Sora-2的性能。总之，我们的研究结果表明，视频生成模型是潜在的多模态理解和生成模型，确立了“思考视频”为统一的多模态推理范式。

关键见解

“思考视频”范式旨在通过视频生成模型（如Sora-2）在统一的时间框架内桥接视觉和文本推理，克服了传统文本和图像推理范式的局限性。
视频思考基准测试（VideoThinkBench）包含以视觉为中心和以文本为中心的任务类别，为评估视频生成模型的性能提供了支持。
在多个任务上评估的Sora-2表现出强大的推理能力，与最先进的VLM相当或更优。
在文本任务方面，Sora-2在MATH和MMMU任务上实现了较高的准确率。
自我一致性和上下文学习是提高视频生成模型性能的关键因素。
视频生成模型具有潜在的多模态理解和生成能力。

Cool Papers

点此查看论文截图

Large Language Models for Cyber Security

Authors:Raunak Somani, Aswani Kumar Cherukuri

This paper studies the integration off Large Language Models into cybersecurity tools and protocols. The main issue discussed in this paper is how traditional rule-based and signature based security systems are not enough to deal with modern AI powered cyber threats. Cybersecurity industry is changing as threats are becoming more dangerous and adaptive in nature by levering the features provided by AI tools. By integrating LLMs into these tools and protocols, make the systems scalable, context-aware and intelligent. Thus helping it to mitigate these evolving cyber threats. The paper studies the architecture and functioning of LLMs, its integration into Encrypted prompts to prevent prompt injection attacks. It also studies the integration of LLMs into cybersecurity tools using a four layered architecture. At last, the paper has tried to explain various ways of integration LLMs into traditional Intrusion Detection System and enhancing its original abilities in various dimensions. The key findings of this paper has been (i)Encrypted Prompt with LLM is an effective way to mitigate prompt injection attacks, (ii) LLM enhanced cyber security tools are more accurate, scalable and adaptable to new threats as compared to traditional models, (iii) The decoupled model approach for LLM integration into IDS is the best way as it is the most accurate way.

本文研究了将大型语言模型集成到网络安全工具和协议中。本文主要讨论的是传统基于规则和基于签名的安全系统如何不足以应对现代人工智能驱动的网络安全威胁。随着威胁变得越来越危险和适应性更强，网络安全行业正在发生变化，利用人工智能工具提供的特性。通过将大型语言模型集成到这些工具和协议中，使系统具有可扩展性、上下文感知能力和智能化，从而有助于缓解这些不断演变的网络威胁。本文研究了大型语言模型的架构和功能，并将其集成到加密提示中以防止提示注入攻击。它还研究了使用四层架构将大型语言模型集成到网络安全工具中。最后，本文试图解释将大型语言模型集成到传统入侵检测系统并增强其原始能力的各种方法。本文的主要发现包括：（i）带有大型语言模型的加密提示是缓解提示注入攻击的有效方法，（ii）与传统模型相比，增强的大型语言模型网络安全工具更准确、可扩展并且更能适应新威胁，（iii）将大型语言模型集成到IDS中的解耦模型方法是最好的方式，因为它是最准确的方式。

论文及项目相关链接

PDF

Summary

本文探讨了大型语言模型在网络安全工具和协议中的集成应用。文章指出传统基于规则和签名的安全系统无法应对现代人工智能驱动的网络安全威胁。借助人工智能工具的优势，网络安全行业正在发生变革。通过集成大型语言模型，可以使系统具备可扩展性、上下文感知能力和智能化，从而应对不断演变的网络威胁。文章研究了大型语言模型的架构和功能，以及将其集成到加密提示中以防止提示注入攻击的方法。此外，文章还探讨了利用四层架构将大型语言模型集成到网络安全工具中的方法，并尝试解释了如何将其集成到传统的入侵检测系统并增强其原始功能。本研究的关键发现包括：加密的大型语言模型提示是缓解提示注入攻击的有效方法；与传统模型相比，大型语言模型增强的网络安全工具更准确、可扩展，更能适应新威胁；解耦模型方法是将大型语言模型集成到IDS的最佳方式，因为它最准确。

Key Takeaways

大型语言模型集成在网络安全工具和协议中，使系统具备可扩展性、上下文感知和智能化，以应对现代网络威胁。
加密的大型语言模型提示是缓解提示注入攻击的有效方法。
传统基于规则和签名的安全系统无法充分应对现代AI驱动的网络安全威胁。
大型语言模型增强的网络安全工具相较于传统模型更准确、可扩展，并能更好地适应新威胁。
四层架构被用于将大型语言模型集成到网络安全工具中。
解耦模型方法是将大型语言模型集成到入侵检测系统（IDS）的最佳方式，因其准确性最高。

Cool Papers

点此查看论文截图

Decoding Emergent Big Five Traits in Large Language Models: Temperature-Dependent Expression and Architectural Clustering

Authors:Christos-Nikolaos Zacharopoulos, Revekka Kyriakoglou

As Large Language Models (LLMs) become integral to human-centered applications, understanding their personality-like behaviors is increasingly important for responsible development and deployment. This paper systematically evaluates six LLMs, applying the Big Five Inventory-2 (BFI-2) framework, to assess trait expressions under varying sampling temperatures. We find significant differences across four of the five personality dimensions, with Neuroticism and Extraversion susceptible to temperature adjustments. Further, hierarchical clustering reveals distinct model clusters, suggesting that architectural features may predispose certain models toward stable trait profiles. Taken together, these results offer new insights into the emergence of personality-like patterns in LLMs and provide a new perspective on model tuning, selection, and the ethical governance of AI systems. We share the data and code for this analysis here: https://osf.io/bsvzc/?view_only=6672219bede24b4e875097426dc3fac1

随着大型语言模型（LLM）在人类为中心的应用中变得不可或缺，了解它们类似人格的行为对于负责任的开发和部署变得越来越重要。本文系统地评估了六种大型语言模型，应用大五人格量表-第二版（BFI-2）框架，在不同的采样温度下评估特征表达。我们发现五个个性维度中有四个存在显著差异，神经质性和外向性容易受到温度调整的影响。此外，层次聚类揭示了不同的模型聚类，这表明架构特征可能使某些模型倾向于稳定的特征分布。总之，这些结果提供了大型语言模型中人格模式出现的新见解，并为模型调优、选择和人工智能系统的伦理治理提供了新的视角。我们在此共享此分析的数据和代码：https://osf.io/bsvzc/?view_only=6672219bede24b4e875097426dc3fac1。

论文及项目相关链接

PDF Accepted at IJCNLP-AACL 2025

Summary

基于大型语言模型（LLM）在以人为中心的应用中的重要作用，理解其人格化行为对于负责任的开发和部署至关重要。本文系统地评估了六种LLM，应用大五人格量表第二版（BFI-2）框架来评估不同采样温度下的特质表达。研究发现四种人格维度存在显著差异，神经质和外向性易受温度调整影响。此外，层次聚类揭示了不同的模型集群，表明架构特性可能会使某些模型趋于稳定的人格特征。研究结果为理解LLM中出现的人格化模式提供了新视角，并为模型调优、选择和人工智能系统的伦理治理提供了新视角。

Key Takeaways

大型语言模型（LLM）在以人为中心的应用中发挥着重要作用，理解其人格化行为至关重要。
通过应用大五人格量表第二版（BFI-2）框架评估LLM，发现不同模型在人格特质表达上存在差异。
神经质和外向性特质在LLM中易受采样温度调整的影响。
层次聚类分析揭示了不同的LLM模型集群，这可能与模型的架构特性有关。
LLM的人格化行为研究对于模型的调优、选择和人工智能系统的伦理治理具有重要意义。
论文提供的数据和代码有助于进一步分析和理解LLM的人格化行为。

Cool Papers

点此查看论文截图

Promoting Sustainable Web Agents: Benchmarking and Estimating Energy Consumption through Empirical and Theoretical Analysis

Authors:Lars Krupp, Daniel Geißler, Vishal Banwari, Paul Lukowicz, Jakob Karolus

Web agents, like OpenAI’s Operator and Google’s Project Mariner, are powerful agentic systems pushing the boundaries of Large Language Models (LLM). They can autonomously interact with the internet at the user’s behest, such as navigating websites, filling search masks, and comparing price lists. Though web agent research is thriving, induced sustainability issues remain largely unexplored. To highlight the urgency of this issue, we provide an initial exploration of the energy and $CO_2$ cost associated with web agents from both a theoretical -via estimation- and an empirical perspective -by benchmarking. Our results show how different philosophies in web agent creation can severely impact the associated expended energy, and that more energy consumed does not necessarily equate to better results. We highlight a lack of transparency regarding disclosing model parameters and processes used for some web agents as a limiting factor when estimating energy consumption. Our work contributes towards a change in thinking of how we evaluate web agents, advocating for dedicated metrics measuring energy consumption in benchmarks.

网络代理，如OpenAI的操作员和Google的Project Mariner，是强大的代理系统，正在推动大型语言模型（LLM）的边界。他们可以在用户的请求下自主地与互联网进行交互，例如浏览网站、填写搜索掩码和比较价格列表。尽管网络代理研究正在蓬勃发展，但由此产生的可持续性相关问题仍很大程度上未被发现和探索。为了突出这个问题的紧迫性，我们从理论和实证两个角度初步探讨了网络代理的能量和二氧化碳成本。我们的结果表明，网络代理创建中的不同理念会对所消耗的能量产生严重影响，消耗更多的能量并不一定意味着效果更好。我们强调了在一些网络代理中使用模型参数和流程的透明度不足，这是估计能源消耗时的限制因素。我们的研究有助于改变我们对如何评估网络代理的思考方式，提倡在基准测试中采用专门的能源消耗的度量指标。

论文及项目相关链接

PDF Accepted by AAAI 2026 AISI

Summary
大语言模型（LLM）的推动者如OpenAI的Operator和Google的Project Mariner等网络代理系统正在蓬勃发展。然而，其可持续性影响尚未得到充分研究。本文初步探讨了网络代理所涉及的能源和二氧化碳成本问题，并从理论和实证两个角度进行了分析。研究结果表明，网络代理创建的不同理念会对消耗的能源产生严重影响，并且更高的能源消耗并不一定意味着更好的结果。同时，缺乏某些网络代理模型参数和流程的透明度也是限制估算能源消耗的关键因素。本研究呼吁改变对网络代理的评估方式，倡导在基准测试中增加对能源消耗的特殊度量指标。

Key Takeaways

网络代理如OpenAI的Operator和Google的Project Mariner正在推动LLM的边界发展。
网络代理的可持续性影响尚未得到充分研究。
网络代理涉及能源和二氧化碳成本问题。
不同网络代理创建理念对能源消耗有严重影响。
更高的能源消耗不一定意味着网络代理效果更好。
缺乏网络代理模型参数和流程的透明度是估算能源消耗的关键限制因素。

Cool Papers

点此查看论文截图

TwIST: Rigging the Lottery in Transformers with Independent Subnetwork Training

Authors:Michael Menezes, Barbara Su, Xinze Feng, Yehya Farhat, Hamza Shili, Anastasios Kyrillidis

We introduce TwIST, a distributed training framework for efficient large language model (LLM) sparsification. TwIST trains multiple subnetworks in parallel, periodically aggregates their parameters, and resamples new subnetworks during training. This process identifies high-quality subnetworks (“golden tickets”) without requiring post-training procedures such as calibration or Hessian-based recovery. As a result, TwIST enables zero-cost pruning at deployment time while achieving perplexity competitive with state-of-the-art post-training sparsification methods. The benefits are most pronounced under aggressive sparsity (e.g., 50%+), where TwIST significantly outperforms baseline methods; for example, reaching 23.14 PPL compared to 31.64 for the closest prior approach. Unlike unstructured pruning, TwIST produces structured, dense matrices that offer practical inference speedups and memory reductions on commodity hardware (e.g., CPUs) that do not support efficient sparse computation. TwIST provides an efficient training-time path to deployable sparse LLMs without additional fine-tuning or recovery overhead.

我们介绍了TwIST，这是一个用于高效大型语言模型（LLM）稀疏化的分布式训练框架。TwIST并行训练多个子网络，定期聚合它们的参数，并在训练过程中重新采样新的子网络。这个过程能够识别出高质量的子网络（“金牌票”），而无需进行诸如校准或基于Hessian的恢复等训练后程序。因此，TwIST能够在部署时实现零成本修剪，同时达到与最新训练后稀疏化方法竞争的水平。在激进稀疏性（例如50%以上）的情况下，TwIST的优势最为显著，显著优于基准方法；例如，达到23.14 PPL，而最接近的先前方法为31.64。不同于非结构化修剪，TwIST产生的是结构化、密集矩阵，在商品硬件（如不支持高效稀疏计算的CPU）上提供实际的推理速度提升和内存减少。TwIST提供了一种高效的训练时间路径，可以部署稀疏LLM，而无需额外的微调或恢复开销。

论文及项目相关链接

PDF

Summary

本文介绍了TwIST，一个用于高效大型语言模型（LLM）稀疏化的分布式训练框架。TwIST能够并行训练多个子网络，定期聚合参数，并在训练过程中重新采样新的子网络。该过程能够识别高质量的子网络（即“优质票证”），无需额外的训练后程序，如校准或基于Hessian的恢复。因此，TwIST实现了部署时的零成本修剪，同时达到了与最新训练后稀疏化方法竞争的表现。在激烈的稀疏性下（例如50%以上），TwIST的优势最为显著，例如达到23.14的困惑度，而最接近的先前方法则为31.64。不同于非结构化修剪，TwIST能够产生结构化的密集矩阵，在实际硬件（如CPU）上进行推理加速和内存缩减，这些硬件不支持高效的稀疏计算。TwIST为训练期间的可部署稀疏LLM提供了一条高效路径，无需额外的微调或恢复开销。

Key Takeaways

TwIST是一个用于LLM稀疏化的分布式训练框架，能够在训练过程中识别高质量的子网络。
TwIST通过并行训练多个子网络，定期聚合参数，并重新采样新的子网络来工作。
TwIST实现了部署时的零成本修剪。
TwIST在激烈的稀疏性下表现优异，例如达到较低的困惑度。
TwIST不同于非结构化修剪，能够产生结构化的密集矩阵。
TwIST适用于商品硬件，如CPU，能够实现推理加速和内存缩减。

Cool Papers

点此查看论文截图

Secure Code Generation at Scale with Reflexion

Authors:Arup Datta, Ahmed Aljohani, Hyunsook Do

Large language models (LLMs) are now widely used to draft and refactor code, but code that works is not necessarily secure. We evaluate secure code generation using the Instruct Prime, which eliminated compliance-required prompts and cue contamination, and evaluate five instruction-tuned code LLMs using a zero-shot baseline and a three-round reflexion prompting approach. Security is measured using the Insecure Code Detector (ICD), and results are reported by measuring Repair, Regression, and NetGain metrics, considering the programming language and CWE family. Our findings show that insecurity remains common at the first round: roughly 25-33% of programs are insecure at a zero-shot baseline (t0 ). Weak cryptography/config-dependent bugs are the hardest to avoid while templated ones like XSS, code injection, and hard-coded secrets are handled more reliably. Python yields the highest secure rates; C and C# are the lowest, with Java, JS, PHP, and C++ in the middle. Reflexion prompting improves security for all models, improving average accuracy from 70.74% at t0 to 79.43% at t3 , with the largest gains in the first round followed by diminishing returns. The trends with Repair, Regression, and NetGain metrics show that applying one to two rounds produces most of the benefits. A replication package is available at https://doi.org/10.5281/zenodo.17065846.

大型语言模型（LLM）现在广泛用于编写和重构代码，但能够运行的代码并不一定安全。我们使用Instruct Prime对安全代码生成进行评估，该工具消除了合规性所需的提示和线索污染，并采用零基准线和三轮反思提示方法对五个指令调整型代码LLM进行评估。安全性通过不安全代码检测器（ICD）来衡量，结果通过衡量修复、回归和NetGain指标来报告，同时考虑编程语言和CWE家族。我们的研究发现，在第一轮中，不安全的情况仍然很常见：大约25-33%的程序在零基准（t0）时是不安全的。避免弱加密/配置依赖的漏洞是最具挑战性的，而模板化的漏洞如跨站脚本攻击、代码注入和硬编码密钥则处理得更可靠。Python的安全率最高；C和C#的安全率最低，Java、JS、PHP和C++处于中间水平。反思提示可以提高所有模型的安全性，从初始的70.74%提高到第三轮时的79.43%，在第一轮中获得的收益最大，随后收益递减。应用一到两轮的趋势显示出修复、回归和NetGain指标的大部分效益。复制包可在https://doi.org/10.5281/zenodo.17065846找到。

论文及项目相关链接

PDF Accepted for publication at the 2nd IEEE International Conference on AI-powered Software (AIware 2025)

摘要

大型语言模型（LLM）常用于编写和重构代码，但可运行的代码不一定安全。本研究使用Instruct Prime评估安全代码生成，通过消除合规提示和线索污染，并采用零基准线和三轮反思提示法对五个指令调整型代码LLM进行评估。安全性通过不安全代码检测器（ICD）衡量，结果通过修复、回归和净收益指标报告，同时考虑编程语言和CWE家族。研究发现，第一轮时仍存在普遍的不安全性：大约25-33%的程序在零基准线（t0）时存在不安全性。避免弱加密/配置依赖的bug最为困难，而模板化的如跨站脚本攻击、代码注入和硬编码秘密等则更为可靠地处理。Python的安全率最高；C和C#最低，Java、JS、PHP和C++在中间。反思提示提高了所有模型的安全性，从t0时的平均准确率70.74%提高到t3时的79.43%，首轮收益最大，随后收益递减。修复、回归和净收益指标的趋势显示，应用一至两轮会产生大部分效益。相关研究可访问：https://doi.org/10.5281/zenodo.17065846获取复制包。

关键见解

LLM常用于代码生成，但生成的代码可能存在安全隐患。
Instruct Prime能有效评估安全代码生成，消除不必要的提示和线索污染。
通过不安全代码检测器（ICD）衡量安全性，涉及编程语言和CWE家族。
初始阶段存在较高比例的不安全代码（约25-33%）。
避免某些类型的bug（如弱加密/配置依赖）相对困难。
Python的安全代码生成表现最佳，而C和C#的表现较差。其他语言如Java、JS、PHP和C++表现居中。

Cool Papers

点此查看论文截图

How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis

Authors:Ahmed Mostafa, Raisul Arefin Nahid, Samuel Mulder

Tokenization is fundamental in assembly code analysis, impacting intrinsic characteristics like vocabulary size, semantic coverage, and extrinsic performance in downstream tasks. Despite its significance, tokenization in the context of assembly code remains an underexplored area. This study aims to address this gap by evaluating the intrinsic properties of Natural Language Processing (NLP) tokenization models and parameter choices, such as vocabulary size. We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code. Additionally, we assess their impact on downstream tasks like function signature prediction – a critical problem in binary code analysis. To this end, we conduct a thorough study on various tokenization models, systematically analyzing their efficiency in encoding assembly instructions and capturing semantic nuances. Through intrinsic evaluations, we compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code. Using state-of-the-art pre-trained models such as the decoder-only Large Language Model (LLM) Llama 3.2, the encoder-only transformer BERT, and the encoder-decoder model BART, we evaluate the effectiveness of these tokenizers across multiple performance metrics. Preliminary findings indicate that tokenizer choice significantly influences downstream performance, with intrinsic metrics providing partial but incomplete predictability of extrinsic evaluation outcomes. These results reveal complex trade-offs between intrinsic tokenizer properties and their utility in practical assembly code tasks. Ultimately, this study provides valuable insights into optimizing tokenization models for low-level code analysis, contributing to the robustness and scalability of Natural Language Model (NLM)-based binary analysis workflows.

令牌化在汇编代码分析中至关重要，它影响词汇大小、语义覆盖率和下游任务的外在性能等内在特征。尽管其意义重大，但在汇编代码的上下文中，令牌化仍然是一个尚未充分探索的领域。本研究旨在通过评估自然语言处理（NLP）令牌化模型的内在属性以及参数选择（如词汇大小）来解决这一差距。我们探索了针对汇编代码独特特征的预处理定制选项和预令牌化规则。此外，我们还评估了它们对函数签名预测等下游任务的影响——这是二进制代码分析中的一个关键问题。为此，我们对各种令牌化模型进行了深入研究，系统分析它们在编码汇编指令和捕捉语义细微差别方面的效率。通过内在评估，我们基于令牌化效率、词汇压缩和汇编代码的代表性保真度来比较令牌化器。我们使用最先进的预训练模型，如只解码的大型语言模型（LLM）Llama 3.2、只编码的变压器BERT和编码器解码器模型BART，评估这些令牌化器在多个性能指标上的有效性。初步研究结果表明，令牌化器的选择对下游性能有显著影响，内在指标提供了对外部评估结果的部分但不完全的可预测性。这些结果揭示了内在令牌化器特性与其在实际汇编代码任务中的实用性之间的复杂权衡。最终，本研究为优化用于低级代码分析的令牌化模型提供了有价值的见解，为基于自然语言模型（NLM）的二进制分析工作流程的稳健性和可扩展性做出了贡献。

论文及项目相关链接

PDF Publication Notice. This paper was published in the BAR 2025 Workshop (with NDSS 2025) and is for research and educational use. Copyright \c{opyright} 2025 Internet Society. All rights reserved. Personal/classroom reproduction is permitted with this notice and full paper citation. All other uses, including commercial, require prior written permission from the Internet Society

Summary

本文研究了令牌化在汇编代码分析中的重要性，并探讨了自然语言处理（NLP）令牌化模型的内在特性及其参数选择对汇编代码分析的影响。文章通过评估多种令牌化模型在编码汇编指令和捕捉语义细微差别方面的效率，进行了一项深入研究。此外，还探讨了不同令牌化模型在函数签名预测等下游任务中的有效性。初步研究结果表明，令牌化模型的选择对下游性能具有显著影响。最终，该研究为优化用于低级别代码分析的令牌化模型提供了有价值的见解，有助于增强自然语言模型（NLM）在二进制分析工作流程中的稳健性和可扩展性。

Key Takeaways

令牌化在汇编代码分析中具有重要意义，影响词汇大小、语义覆盖等内在特性和下游任务的性能。
NLP令牌化模型的内在特性和参数选择是研究的重点，如词汇大小。
研究评估了多种令牌化模型在编码汇编指令和捕捉语义细微差别方面的效率。
研究利用先进的预训练模型，如大型语言模型（LLM）Llama 3.2等，评价了令牌化模型的有效性。
初步研究结果表明，令牌化模型的选择对下游任务性能有显著影响。
内在评估指标只能部分预测外在评估结果，存在复杂的权衡关系。

Cool Papers

点此查看论文截图

SurgViVQA: Temporally-Grounded Video Question Answering for Surgical Scene Understanding

Authors:Mauro Orazio Drago, Luca Carlini, Pelinsu Celebi Balyemez, Dennis Pierantozzi, Chiara Lena, Cesare Hassan, Danail Stoyanov, Elena De Momi, Sophia Bano, Mobarak I. Hoque

Video Question Answering (VideoQA) in the surgical domain aims to enhance intraoperative understanding by enabling AI models to reason over temporally coherent events rather than isolated frames. Current approaches are limited to static image features, and available datasets often lack temporal annotations, ignoring the dynamics critical for accurate procedural interpretation. We propose SurgViVQA, a surgical VideoQA model that extends visual reasoning from static images to dynamic surgical scenes. It uses a Masked Video–Text Encoder to fuse video and question features, capturing temporal cues such as motion and tool–tissue interactions, which a fine-tuned large language model (LLM) then decodes into coherent answers. To evaluate its performance, we curated REAL-Colon-VQA, a colonoscopic video dataset that includes motion-related questions and diagnostic attributes, as well as out-of-template questions with rephrased or semantically altered formulations to assess model robustness. Experimental validation on REAL-Colon-VQA and the public EndoVis18-VQA dataset shows that SurgViVQA outperforms existing image-based VQA benchmark models, particularly in keyword accuracy, improving over PitVQA by +11% on REAL-Colon-VQA and +9% on EndoVis18-VQA. A perturbation study on the questions further confirms improved generalizability and robustness to variations in question phrasing. SurgViVQA and the REAL-Colon-VQA dataset provide a framework for temporally-aware understanding in surgical VideoQA, enabling AI models to interpret dynamic procedural contexts more effectively. Code and dataset available at https://github.com/madratak/SurgViVQA.

手术领域的视频问答（VideoQA）旨在通过使AI模型能够推理出时间上连贯的事件，而不是孤立的帧，从而提高手术过程中的理解。当前的方法仅限于静态图像特征，而可用的数据集通常缺乏时间注释，忽略了对准确程序解释至关重要的动态。我们提出了SurgViVQA，一个手术视频问答模型，它将视觉推理从静态图像扩展到动态手术场景。它使用遮罩视频文本编码器来融合视频和问题的特征，捕捉诸如运动和工具组织交互之类的临时线索，然后经过微调的大型语言模型（LLM）将这些线索解码成连贯的答案。为了评估其性能，我们整理出了REAL-Colon-VQA数据集，这是一组结肠镜视频数据集，包括与运动相关的问题和诊断属性，以及重新表述或语义更改的模板外问题，以评估模型的稳健性。在REAL-Colon-VQA和公共EndoVis18-VQA数据集上的实验验证表明，SurgViVQA在基于图像的VQA基准模型上表现更好，特别是在关键词准确性方面，在REAL-Colon-VQA上比PitVQA高出+11%，在EndoVis18-VQA上高出+9%。对问题进行的扰动研究进一步证实了其对问题表述的改进和稳健性。SurgViVQA和REAL-Colon-VQA数据集为手术视频问答中的时间感知理解提供了框架，使AI模型能够更有效地解释动态程序上下文。代码和数据集可在https://github.com/madratak/SurgViVQA上找到。

论文及项目相关链接

PDF

摘要
手术领域的视频问答（VideoQA）旨在通过使AI模型能够推理出时间上连贯的事件而非孤立的帧，从而提升术中理解。当前的方法仅限于静态图像特征，而可用的数据集往往缺乏时间注释，忽略了对准确程序解读至关重要的动态要素。我们提出了SurgViVQA，一种手术视频问答模型，它将视觉推理从静态图像扩展到动态手术场景。它使用掩码视频-文本编码器融合视频和问题特征，捕捉如运动和工具-组织交互等时间线索，然后由微调的大型语言模型（LLM）解码成连贯的答案。为了评估其性能，我们精选了REAL-Colon-VQA数据集，该数据集包含运动相关问题、诊断属性和重新表述或语义改变的出模板问题，以评估模型的稳健性。在REAL-Colon-VQA和公共EndoVis18-VQA数据集上的实验验证表明，SurgViVQA优于现有的基于图像的VQA基准模型，特别是在关键词准确率方面，相较于PitVQA在REAL-Colon-VQA上提高了+11%，在EndoVis18-VQA上提高了+9%。对问题的扰动研究进一步证实了其改进的一般性和对问题表述变化的稳健性。SurgViVQA和REAL-Colon-VQA数据集为手术视频问答中的时间感知理解提供了框架，使AI模型更有效地解释动态程序上下文。

关键见解

视频问答（VideoQA）在手术领域的重要性在于提升AI模型对术中情况的了解，使其能够推理在时间上有连贯的事件。
当前手术领域的VideoQA模型主要局限于静态图像特征，缺乏捕捉动态手术场景的要素。
提出了一种新的手术VideoQA模型——SurgViVQA，能够扩展视觉推理至动态手术场景。
SurgViVQA使用Masked Video-Text Encoder融合视频和问题特征，捕捉时间线索如运动和工具与组织的交互。
引入大型语言模型（LLM）解码视频和问题的融合特征，生成连贯的答案。
推出新的数据集REAL-Colon-VQA用于评估SurgViVQA性能，包含运动相关问题、诊断属性和重新表述的问题。
实验验证显示SurgViVQA在关键词准确率上优于现有基准模型，并具有较好的通用性和稳健性。

Cool Papers

点此查看论文截图

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

Authors:Jonathan Li, Nasim Farahini, Evgenii Iuliugin, Magnus Vesterlund, Christian Haggstrom, Guangtao Wang, Shubhangi Upasani, Ayush Sachdeva, Rui Li, Faline Fu, Chen Wu, Ayesha Siddiqua, John Long, Tuowen Zhao, Matheen Musaddiq, Hakan Zeffer, Yun Du, Mingran Wang, Qinghua Li, Bo Li, Urmish Thakker, Raghu Prabhakar

The proliferation of 100B+ parameter Large Language Models (LLMs) with 100k+ context length support have resulted in increasing demands for on-chip memory to support large KV caches. Techniques such as StreamingLLM and SnapKV demonstrate how to control KV cache size while maintaining model accuracy. Yet, these techniques are not commonly used within industrial deployments using frameworks like vLLM or SGLang. The reason is twofold: on one hand, the static graphs and continuous batching methodology employed by these frameworks make it difficult to admit modifications to the standard multi-head attention algorithm, while on the other hand, the accuracy implications of such techniques on modern instruction-following and reasoning models are not well understood, obfuscating the need for implementing these techniques. In this paper, we explore these accuracy implications on Llama-3.1-8B-Instruct and DeepSeek-R1, and develop SnapStream, a KV cache compression method that can be deployed at scale. We demonstrate the efficacy of SnapStream in a 16-way tensor-parallel deployment of DeepSeek-671B on SambaNova SN40L accelerators running at 128k context length and up to 1832 tokens per second in a real production setting. SnapStream enables $4\times$ improved on-chip memory usage and introduces minimal accuracy degradation on LongBench-v2, AIME24 and LiveCodeBench. To the best of our knowledge, this is the first implementation of sparse KV attention techniques deployed in a production inference system with static graphs and continuous batching.

随着支持超过百亿参数的大型语言模型（LLM）和超过十万上下文长度的支持越来越多，对芯片内存的需求也在不断增加，以支持大规模的KV缓存。StreamingLLM和SnapKV等技术展示了如何在保持模型准确性的同时控制KV缓存大小。然而，这些技术在工业部署中并不常用，特别是在使用vLLM或SGLang等框架的情况下。原因有两方面：一方面，这些框架采用的静态图和连续批处理方法使得难以对标准多头注意力算法进行修改；另一方面，这些技术对现代指令遵循和推理模型的准确性影响尚不清楚，这使得实现这些技术的必要性变得模糊。在本文中，我们探讨了这些准确性影响在Llama-3.1-8B-Instruct和DeepSeek-R1上的表现，并开发了SnapStream，这是一种可大规模部署的KV缓存压缩方法。我们在实际生产环境中展示了SnapStream在DeepSeek-671B的16路张量并行部署中的有效性，该环境运行在SambaNova SN40L加速器上，上下文长度为128k，每秒可处理高达1832个令牌。SnapStream在LongBench-v2、AIME24和LiveCodeBench上的准确率略有下降，但在芯片内存使用上提高了四倍。据我们所知，这是首个在具有静态图和连续批处理的生产推理系统中部署稀疏KV注意力技术的实现。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）参数超过100B，对芯片内存需求增大。StreamingLLM和SnapKV等技术可控制KV缓存大小同时保持模型精度。然而，在vLLM或SGLang等框架中，由于静态图和连续批处理方法的采用，这些技术并不常用。本文探索了这些技术对Llama-3.1-8B-Instruct和DeepSeek-R1准确性的影响，并开发了SnapStream方法，可在生产环境中部署。在SambaNova SN40L加速器上运行DeepSeek-671B时，SnapStream实现了对芯片内存使用的四倍改进，并在LongBench-v2、AIME24和LiveCodeBench上引入了最小的精度损失。这是首个在具有静态图和连续批处理的生产推断系统中部署稀疏KV注意力技术的实现。

Key Takeaways

大型语言模型（LLM）参数增多导致对芯片内存需求增加。
StreamingLLM和SnapKV等技术能够在保持模型精度的同时控制KV缓存大小。
现有框架如vLLM和SGLang很少使用这些技术，主要原因是静态图和连续批处理方法的采用。
本文探索了这些技术对特定模型的准确性影响。
开发了SnapStream方法，可在生产环境中部署，实现对芯片内存使用的改进。
SnapStream在特定模型上的实际效果是实现了对芯片内存使用的四倍改进，并引入了最小的精度损失。

Cool Papers

点此查看论文截图

OceanAI: A Conversational Platform for Accurate, Transparent, Near-Real-Time Oceanographic Insights

Authors:Bowen Chen, Jayesh Gajbhar, Gregory Dusek, Rob Redmon, Patrick Hogan, Paul Liu, DelWayne Bohnenstiehl, Dongkuan Xu, Ruoying He

Artificial intelligence is transforming the sciences, yet general conversational AI systems often generate unverified “hallucinations” undermining scientific rigor. We present OceanAI, a conversational platform that integrates the natural-language fluency of open-source large language models (LLMs) with real-time, parameterized access to authoritative oceanographic data streams hosted by the National Oceanic and Atmospheric Administration (NOAA). Each query such as “What was Boston Harbor’s highest water level in 2024?” triggers real-time API calls that identify, parse, and synthesize relevant datasets into reproducible natural-language responses and data visualizations. In a blind comparison with three widely used AI chat-interface products, only OceanAI produced NOAA-sourced values with original data references; others either declined to answer or provided unsupported results. Designed for extensibility, OceanAI connects to multiple NOAA data products and variables, supporting applications in marine hazard forecasting, ecosystem assessment, and water-quality monitoring. By grounding outputs and verifiable observations, OceanAI advances transparency, reproducibility, and trust, offering a scalable framework for AI-enabled decision support within the oceans. A public demonstration is available at https://oceanai.ai4ocean.xyz.

人工智能正在改变科学领域，然而通用的对话式人工智能系统通常会生成未经证实的“幻觉”，破坏科学的严谨性。我们推出了OceanAI，这是一个对话平台，它将开源大型语言模型（LLM）的自然语言流畅性与实时、参数化的访问国家海洋和大气管理局（NOAA）权威海洋学数据流相结合。每个查询，如“波士顿港口在2024年的最高水位是多少？”都会触发实时API调用，这些调用会识别、解析和综合相关的数据集，以可再生的自然语言响应和数据可视化形式呈现。在与三个广泛使用的AI聊天界面产品的盲对比测试中，只有OceanAI能够生成带有原始数据引用的NOAA来源值；其他产品要么拒绝回答，要么提供未经支持的结果。OceanAI设计用于可扩展性，能够连接到多个NOAA数据产品和变量，支持海洋危害预报、生态系统评估和水质监测等应用。通过基于输出和可验证的观察，OceanAI促进了透明度、可重复性和信任度，为海洋领域的人工智能决策支持提供了可扩展的框架。公共演示版本可在https://oceanai.ai4ocean.xyz访问。

论文及项目相关链接

PDF A related presentation will be given at the AGU(American Geophysical Union) and AMS(American Meteorological Society) Annual Meetings

Summary
人工智能正在改变科学领域，但通用的对话式人工智能系统常常产生未经证实的“幻觉”，破坏科学的严谨性。我们推出了OceanAI，这是一个对话平台，它将开源大型语言模型（LLM）的自然语言流畅性与实时、参数化的美国国家海洋和大气管理局（NOAA）海洋学数据流访问相结合。例如，“波士顿港口在2024年的最高水位是多少？”这样的查询会触发实时API调用，识别、解析和综合相关数据集，以可重现的自然语言回应和数据可视化形式提供信息。与其他广泛使用的AI聊天界面产品相比，只有OceanAI能提供NOAA来源的带有原始数据参考的值；其他产品要么不回答，要么提供无支持的结果。OceanAI设计用于可扩展性，可连接到多个NOAA数据产品和变量，支持海洋危害预报、生态系统评估和水质监测等应用。通过基于观察和可观察数据的输出，OceanAI提高了透明度、可重复性和信任度，为海洋中的AI决策支持提供了可扩展的框架。

Key Takeaways

OceanAI是一个结合了自然语言处理和实时海洋学数据流的对话平台。
它利用大型语言模型（LLM）的流畅性，提供对NOAA权威海洋学数据的参数化实时访问。
OceanAI能针对特定查询如“某地的最高水位”等，提供可再生的自然语言回应和数据可视化。
与其他AI聊天界面产品相比，OceanAI提供的答案是基于NOAA原始数据的，具有数据参考。
该平台具有可扩展性，能连接到多种NOAA数据产品和变量。
OceanAI应用于海洋危害预报、生态系统评估和水质监测等领域。

Cool Papers

点此查看论文截图

Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs

Authors:Wanyun Cui, Mingwei Xu

Recent advances in Large Language Models (LLMs) have highlighted the critical importance of extending context length, yet the quadratic complexity of attention mechanisms poses significant challenges for efficient long-context modeling. KV cache compression has emerged as a key approach to address this challenge. Through extensive empirical analysis, we reveal a fundamental yet previously overlooked asymmetry in KV caches: while adjacent keys receive similar attention weights ({\it local homogeneity}), adjacent values demonstrate distinct {\it heterogeneous} distributions. This key-value asymmetry reveals a critical limitation in existing compression methods that treat keys and values uniformly. To address the limitation, we propose a training-free compression framework (AsymKV) that combines homogeneity-based key merging with a mathematically proven lossless value compression. Extensive experiments demonstrate that AsymKV consistently outperforms existing long-context methods across various tasks and base models. For example, on LLaMA3.1-8B, AsymKV achieves an average score of 43.95 on LongBench, surpassing SOTA methods like H$_2$O (38.89) by a large margin.Our code can be found in this link:https://github.com/the-scale-lab/Asymkv.

近期大型语言模型（LLM）的进步凸显了扩展上下文长度的重要性，然而，注意力机制的二次复杂性为高效长上下文建模带来了重大挑战。KV缓存压缩已成为应对这一挑战的关键方法。通过广泛的实证分析，我们发现了KV缓存中一个基本且之前被忽视的不对称性：虽然相邻的键会获得相似的注意力权重（局部同质性），但相邻的值却表现出不同的异质性分布。这种键值不对称性揭示了现有压缩方法在统一处理键和值时存在的关键局限性。为了解决这一局限性，我们提出了一种无需训练的压缩框架（AsymKV），该框架结合了基于同质性键合并和经过数学证明的无损值压缩。大量实验表明，在各种任务和基准模型上，AsymKV始终优于现有的长上下文方法。例如，在LLaMA3.1-8B上，AsymKV在LongBench上的平均得分为43.95，大幅度超越了H_2O（38.89）等最新方法。我们的代码可以在这个链接中找到：https://github.com/the-scale-lab/Asymkv。

论文及项目相关链接

PDF 14 pages,7 figures;Accepted by NeurIPS 2025

Summary

LLM上下文长度扩展中的关键挑战是注意力机制的二次复杂性。近期研究通过KV缓存压缩来应对这一挑战，揭示出KV缓存中的键和值分布不对称：键的分布呈现局部同质性，而值的分布则是异构的。这为统一处理键值的方法带来限制。为应对这一问题，提出一种无需训练的压缩框架（AsymKV），结合基于同质的键合并与无损值压缩技术。实验证明，AsymKV在不同任务和基准模型上均优于现有长上下文方法。例如，在LLaMA3.1-8B上，AsymKV在LongBench上的平均得分为43.95，大幅超越H$_2$O等现有最佳方法（38.89）。代码可在此链接找到：https://github.com/the-scale-lab/Asymkv。

Key Takeaways

LLM在扩展上下文长度时面临注意力机制的二次复杂性挑战。
KV缓存压缩是应对这一挑战的关键方法。
KV缓存中存在键和值分布不对称的问题，即键的分布呈现局部同质性，而值的分布是异构的。
现有压缩方法在处理这种键值不对称时存在限制。
提出一种无需训练的压缩框架（AsymKV），结合键的同质合并和值的无损压缩技术。
AsymKV在不同任务和基准模型上的表现均优于现有长上下文方法。

Cool Papers

点此查看论文截图

How do Transformers Learn Implicit Reasoning?

Authors:Jiaran Ye, Zijun Yao, Zhidian Huang, Liangming Pan, Jinxin Liu, Yushi Bai, Amy Xin, Weichuan Liu, Xiaoyin Che, Lei Hou, Juanzi Li

Recent work suggests that large language models (LLMs) can perform multi-hop reasoning implicitly – producing correct answers without explicitly verbalizing intermediate steps – but the underlying mechanisms remain poorly understood. In this paper, we study how such implicit reasoning emerges by training transformers from scratch in a controlled symbolic environment. Our analysis reveals a three-stage developmental trajectory: early memorization, followed by in-distribution generalization, and eventually cross-distribution generalization. We find that training with atomic triples is not necessary but accelerates learning, and that second-hop generalization relies on query-level exposure to specific compositional structures. To interpret these behaviors, we introduce two diagnostic tools: cross-query semantic patching, which identifies semantically reusable intermediate representations, and a cosine-based representational lens, which reveals that successful reasoning correlates with the cosine-base clustering in hidden space. This clustering phenomenon in turn provides a coherent explanation for the behavioral dynamics observed across training, linking representational structure to reasoning capability. These findings provide new insights into the interpretability of implicit multi-hop reasoning in LLMs, helping to clarify how complex reasoning processes unfold internally and offering pathways to enhance the transparency of such models.

最近的研究表明，大型语言模型（LLM）可以隐式地进行多跳推理——在不需要明确表述中间步骤的情况下给出正确答案，但其背后的机制仍鲜为人知。在本文中，我们通过在一个受控的符号环境中从头训练转换器来研究这种隐式推理是如何出现的。我们的分析揭示了一个三阶段的发展轨迹：早期的记忆，然后是内部分布推广，最后是跨分布推广。我们发现使用原子三元组进行训练并不是必需的，但可以加速学习，而第二跳推广依赖于对特定组合结构的查询级别暴露。为了解释这些行为，我们引入了两种诊断工具：跨查询语义修补，用于识别可重复使用的中间表示；基于余弦值的表示透镜，揭示了成功的推理与隐藏空间中的余弦基础聚类之间的相关性。这种聚类现象进而为整个训练过程中观察到的行为动态提供了连贯的解释，将表示结构与推理能力联系起来。这些发现为我们提供了关于LLM中隐式多跳推理解释性的新见解，有助于阐明复杂的推理过程如何在内部展开，并提供了提高此类模型透明度的途径。

论文及项目相关链接

PDF Accepted as Spotlight at NeurIPS 2025

Summary

大型语言模型（LLM）能够隐式进行多跳推理，即产生正确答案而无需明确表述中间步骤。本研究通过在有控制符号的环境中从头训练变换器，探讨了这种隐式推理是如何出现的。分析揭示了三个发展阶段：早期记忆、内部分布泛化和跨分布泛化。研究发现，使用原子三元组进行训练并非必需，但可以加速学习，而第二跳泛化依赖于对特定组合结构的查询级暴露。为了解释这些行为，研究引入了两种诊断工具：跨查询语义补丁和基于余弦的代表透镜。这些发现揭示了隐式多跳推理在LLM中的可解释性，有助于阐明复杂的推理过程如何在内部展开，并为提高此类模型的透明度提供了途径。

Key Takeaways

LLM具备隐式多跳推理能力，无需明确表述中间步骤就能产生正确答案。
隐式推理的出现与在控制符号环境中训练变换器有关，经历了早期记忆、内部分布泛化和跨分布泛化三个阶段。
使用原子三元组进行训练虽然不是必需的，但可以加速学习过程。
第二跳泛化依赖于对特定组合结构的查询级暴露。
引入的跨查询语义补丁和基于余弦的代表透镜两种诊断工具有助于解释LLM的隐式推理行为。
成功推理与隐藏空间中的余弦基础聚类现象相关联，这为观察训练过程中的行为动态提供了连贯的解释。

Cool Papers

点此查看论文截图

Exact Expressive Power of Transformers with Padding

Authors:William Merrill, Ashish Sabharwal

Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer’s expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding recognize precisely the class $\mathsf{FO}$-uniform $\mathsf{TC}^0$ of extremely parallelizable problems. While the $\mathsf{TC}^0$ upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with $O(\log^d n)$ looping on inputs of length $n$ recognize exactly the class $\mathsf{FO}$-uniform $\mathsf{TC}^d$ of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers’ expressive power: with polylogarithmic looping, polynomially padded transformers recognize precisely the class $\mathsf{FO}$-uniform $\mathsf{NC}$, the best that could be expected without losing parallelism (unless $\mathsf{NC} = \mathsf{P}$). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought for test-time compute.

思想链是基于Transformer的大型语言模型（LLM）提高计算能力的自然推理时间方法，但成本是顺序解码。有没有更有效的方法来扩大Transformer的表达力而不增加参数？我们将带有填充标记的Transformer视为一种可并行化的测试时间计算形式。我们展示了具有多项式填充的平均硬注意力、掩码预范数Transformer能够精确识别出极可并行化问题的类$\mathsf{FO}$-uniform $\mathsf{TC}^0$。虽然已知$\mathsf{TC}^0$的上界，但证明匹配的下界一直难以捉摸。此外，我们的新型分析揭示了填充Transformer与另一种推理时间计算相结合时的精确扩展能力，即通过循环动态增加深度。我们的核心技术贡献在于展示填充如何帮助带来完全问题和归约的概念，这是经典复杂性理论的核心，并将其应用于Transformer的正式研究。借助这个新工具，我们证明了在输入长度为n的情况下，带有$O(\log^d n)$循环的填充Transformer精确地识别了中等并行化问题的类别$\mathsf{FO}$-uniform $\mathsf{TC}^d$。因此，填充和循环一起系统地扩大了Transformer的表达力：使用多项式填充和具有对数循环的Transformer精确地识别出$\mathsf{FO}$-uniform $\mathsf{NC}$类的问题，这是在保持并行性时所能达到的最佳效果（除非$\mathsf{NC}=\mathsf{P}$）。我们的研究结果进一步激发了填充和循环作为并行化思想链的替代方案，用于测试时间计算的研究和探索。

论文及项目相关链接

PDF Neurips 2025

摘要
该文本探讨了在无需增加参数的情况下，通过填充标记（padding tokens）和循环机制提高transformer大语言模型（LLM）计算效率的可能性。文中介绍了通过平均硬注意力法，采用带有多项式填充的掩码预标准化transformer，能够精确识别出极可并行化问题类$\text{FO}$-uniform $\text{TC}^0$。此外，结合另一种推理时间计算方式——动态增加深度循环，揭示了填充对transformer性能的精确扩展作用。文章的核心技术贡献在于展示了填充如何帮助引入经典复杂性理论中的完整问题和简化概念，对transformer进行正式研究。通过使用填充和循环作为并行化的替代方案，作者证明了填充变压器通过在对数循环上工作精确识别可并行化问题类$\text{FO}$-uniform $\text{TC}^d$。因此，填充和循环一起系统地扩展了变压器的表达能力。这些结果进一步鼓励了对填充和循环作为测试时间计算的并行替代方法的探索。

关键见解

文中提出了一种基于填充标记的自然推理方法，旨在提高基于transformer的大语言模型（LLM）的计算能力。
通过采用平均硬注意力法和掩码预标准化transformer技术，能够精确识别出极可并行化问题类$\text{FO}$-uniform $\text{TC}^0$。
填充技术结合了动态深度循环机制，进一步揭示了其对transformer性能扩展的精确作用。
文章展示了填充如何引入经典复杂性理论中的完整问题和简化概念，为transformer研究提供了新的视角。
通过使用填充和循环技术，证明了填充变压器能够精确识别可并行化问题类$\text{FO}$-uniform $\text{TC}^d$。
填充和循环结合显著提高了transformer的表达能力，可作为测试时间计算的一种有效并行替代方案。

Cool Papers

点此查看论文截图

A Little Depth Goes a Long Way: The Expressive Power of Log-Depth Transformers

Authors:William Merrill, Ashish Sabharwal

Recent theoretical results show transformers cannot express sequential reasoning problems over long inputs, intuitively because their computational depth is bounded. However, prior work treats the depth as a constant, leaving it unclear to what degree bounded depth may suffice for solving problems over short inputs, or how increasing the transformer’s depth affects its expressive power. We address these questions by analyzing transformers whose depth can grow minimally with context length $n$. We show even highly uniform transformers with depth $\Theta(\log n)$ can express two important problems: recognizing regular languages, which captures state tracking abilities and was known to be expressible only by an unconventional, non-uniform model of transformers, and graph connectivity, which underlies multi-step reasoning. Notably, both of these problems cannot be expressed by fixed-depth transformers under standard complexity conjectures, demonstrating the expressivity benefit of growing depth. Moreover, our theory quantitatively predicts how depth must grow with input length to express these problems, showing that depth scaling is more efficient than scaling width or chain-of-thought steps. Empirically, our detailed experiments designed to bridge the expressivity vs. learnability gap reveal that our theoretical depth requirements for regular language recognition closely match the practical depth requirements for successfully training transformers. Thus, our results clarify how depth affects a transformer’s reasoning capabilities, and provide practical guidance for effective depth selection for sequential reasoning.

最近的理论结果显示，变压器无法对长输入进行顺序推理问题的表达，直觉上是因为其计算深度是有限的。然而，之前的研究将深度视为常数，尚不明确有限的深度在解决短输入问题上能起多大的作用，或者增加变压器的深度会如何影响其表达能力。我们通过分析深度可以随上下文长度n最小程度增长的变压器来解决这些问题。我们展示了即使是高度均匀的变压器，其深度为Θ（log n）也能表达两个重要问题：识别正则语言，这捕捉了状态跟踪能力，已知只有非传统的非均匀变压器模型才能表达；以及图连通性，这是多步推理的基础。值得注意的是，根据标准复杂性猜想，这两个问题都不能由固定深度的变压器来表达，这证明了增长深度在表达能力上的优势。此外，我们的理论定量预测了深度必须如何随输入长度的增长而增长才能表达这些问题，表明深度缩放比宽度缩放或思维链步骤更为高效。从实证上看，我们设计的详细实验旨在弥合表达与可学习性之间的差距，并揭示出理论上的正则语言识别所需的深度与实际成功训练变压器所需的深度非常接近。因此，我们的结果明确了深度如何影响变压器的推理能力，并为有效的深度选择提供了实际指导。

论文及项目相关链接

PDF NeurIPS 2025

Summary
近期理论结果显示，变压器在处理长输入序列的推理问题时存在局限，其计算深度是受限的。先前的研究将深度视为常数，尚不清楚有限深度对短输入的解决问题程度如何，以及增加变压器的深度会如何影响其表达能力。本文通过分析深度随上下文长度n增长的最少变压器来解答这些问题。我们发现，即使具有高度统一的深度为θ（log n）的变压器，也能表达两个重要问题：识别规则语言和图形连接。这些问题此前被认为只能由非传统的、非常规的变压器模型来表达。值得注意的是，根据标准复杂性猜想，这些问题无法由固定深度的变压器来表达，这显示了增长深度对提高表达能力的好处。此外，我们的理论还定量预测了表达这些问题所需的深度与输入长度的增长关系，表明深度缩放比宽度缩放或思考步骤更有效率。通过实验，我们发现理论上的深度要求对规则语言识别的要求与实际成功训练变压器的深度要求非常吻合。因此，我们的研究结果明确了深度对变压器推理能力的影响，并为有效地选择深度进行顺序推理提供了实际指导。

Key Takeaways

变压器在处理长输入序列的推理问题时存在局限性，其计算深度是受限的。
先前研究将变压器的深度视为常数，缺乏对短输入和深度增加对其表达能力影响的研究。
通过分析深度随上下文长度增长的变压器，发现即使是较浅的变压器也能表达识别规则语言和图形连接等重要问题。
这些问题无法由固定深度的变压器表达，增加深度有助于提高表达能力。
理论预测了表达特定问题所需的深度与输入长度的增长关系，表明深度缩放比宽度或思考步骤更为高效。
实验结果显示理论上的深度要求对实际训练变压器的深度选择具有指导意义。

Cool Papers

点此查看论文截图

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Authors:Akhilesh Aravapalli, Mounika Marreddy, Radhika Mamidi, Manish Gupta, Subba Reddy Oota

Transformer-based models have revolutionized the field of natural language processing. To understand why they perform so well and to assess their reliability, several studies have focused on questions such as: Which linguistic properties are encoded by these models, and to what extent? How robust are these models in encoding linguistic properties when faced with perturbations in the input text? However, these studies have mainly focused on BERT and the English language. In this paper, we investigate similar questions regarding encoding capability and robustness for 8 linguistic properties across 13 different perturbations in 6 Indic languages, using 9 multilingual Transformer models (7 universal and 2 Indic-specific). To conduct this study, we introduce a novel multilingual benchmark dataset, IndicSentEval, containing approximately $\sim$47K sentences. Surprisingly, our probing analysis of surface, syntactic, and semantic properties reveals that while almost all multilingual models demonstrate consistent encoding performance for English, they show mixed results for Indic languages. As expected, Indic-specific multilingual models capture linguistic properties in Indic languages better than universal models. Intriguingly, universal models broadly exhibit better robustness compared to Indic-specific models, particularly under perturbations such as dropping both nouns and verbs, dropping only verbs, or keeping only nouns. Overall, this study provides valuable insights into probing and perturbation-specific strengths and weaknesses of popular multilingual Transformer-based models for different Indic languages. We make our code and dataset publicly available [https://github.com/aforakhilesh/IndicBertology].

Transformer模型已经彻底改变了自然语言处理领域。为了更好地理解它们为何表现如此出色并评估其可靠性，多项研究专注于以下问题：这些模型编码了哪些语言特性，以及到了什么程度？当面对输入文本的扰动时，这些模型在编码语言特性方面的稳健性如何？然而，这些研究主要集中在BERT和英语上。在本文中，我们针对9种多语言Transformer模型（7种通用模型和2种印度语特定模型），研究关于编码能力和稳健性的相关问题，涉及6种印度语言的8种语言特性以及13种不同的扰动。为了进行这项研究，我们引入了一个新的多语言基准数据集IndicSentEval，包含约47K个句子。令人惊讶的是，我们对表层、句法层和语义特性的探查分析表明，几乎所有多语言模型在英语上的编码性能都表现一致，但在印度语上的表现却喜忧参半。正如预期那样，针对印度语的特定印度语多语言模型在捕捉印度语的语言特性方面比通用模型表现得更好。有趣的是，通用模型在大多数情况下表现出比印度语特定模型更好的稳健性，特别是在删除名词和动词、仅删除动词或仅保留名词等扰动情况下。总体而言，本研究为不同印度语流行的多语言Transformer模型在探查和特定扰动方面的优缺点提供了宝贵的见解。我们的代码和数据集已公开可用：[https://github.com/aforakhilesh/IndicBertology]。

论文及项目相关链接

PDF 25 pages, 11 figures, Accepted at IJCNLP-AACL 2025 Findings

Summary

本文探讨了多语言Transformer模型在编码印度语言（Indic languages）方面的性能与稳健性。研究涉及9个多语言Transformer模型，针对6种印度语言的8种语言特性进行编码能力和稳健性的评估。通过引入一个新的多语言基准数据集IndicSentEval，包含约47K个句子，研究发现在英语中所有多语言模型的编码性能一致，但在印度语言中结果各异。尽管印度特定多语言模型在印度语言中的语言学特性捕捉得更好，但通用模型在某些扰动情况下展现出更好的稳健性。总体而言，该研究为不同印度语言的多语言Transformer模型提供了有价值的洞察。

Key Takeaways

多语言Transformer模型在编码印度语言方面存在性能差异。
针对6种印度语言的8种语言特性进行了评估。
引入新的多语言基准数据集IndicSentEval，包含约47K个句子。
在英语中，所有多语言模型的编码性能一致；但在印度语言中，结果各异。
印度特定多语言模型在印度语言中的语言学特性捕捉得更好。
通用模型在某些扰动情况下展现出更好的稳健性。

Cool Papers

点此查看论文截图

LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users

Authors:Elinor Poole-Dayan, Deb Roy, Jad Kabbara

While state-of-the-art large language models (LLMs) have shown impressive performance on many tasks, there has been extensive research on undesirable model behavior such as hallucinations and bias. In this work, we investigate how the quality of LLM responses changes in terms of information accuracy, truthfulness, and refusals depending on three user traits: English proficiency, education level, and country of origin. We present extensive experimentation on three state-of-the-art LLMs and two different datasets targeting truthfulness and factuality. Our findings suggest that undesirable behaviors in state-of-the-art LLMs occur disproportionately more for users with lower English proficiency, of lower education status, and originating from outside the US, rendering these models unreliable sources of information towards their most vulnerable users.

虽然最先进的大型语言模型（LLM）在许多任务中表现出令人印象深刻的性能，但关于其不希望出现的行为（如虚构和偏见）的研究已非常广泛。在这项工作中，我们研究了在英语熟练程度、教育水平和原籍国三个用户特征方面，LLM响应的信息准确性、真实性和拒绝方面的质量如何变化。我们在三个最先进的大型语言模型和两个针对真实性和事实性的数据集上进行了广泛的实验。我们的研究结果表明，在用户英语水平较低、受教育程度较低以及非美国籍的用户群体中，最先进的LLM更有可能出现不良行为，导致这些模型对其最脆弱的用户群体不可靠。

论文及项目相关链接

PDF Paper accepted at AAAI 2026

Summary

本文探讨了大型语言模型（LLM）在用户特质（英语熟练程度、教育水平和原籍国）影响下的回应质量变化，包括信息准确性、真实性和拒绝回答的情况。实验结果显示，对于英语水平较低、教育程度较低和非美国用户，先进的大型语言模型表现出更多的不良行为，如幻想和偏见，使得这些模型对于最脆弱的用户群而言并不可靠的资讯来源。

Key Takeaways