LLM

发布日期: 2025-10-22

更新日期: 2025-11-27

文章字数: 13.1k

阅读时长: 53 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-22 更新

PANER: A Paraphrase-Augmented Framework for Low-Resource Named Entity Recognition

Authors:Nanda Kumar Rengarajan, Jun Yan, Chun Wang

Named Entity Recognition (NER) is a critical task that requires substantial annotated data, making it challenging in low-resource scenarios where label acquisition is expensive. While zero-shot and instruction-tuned approaches have made progress, they often fail to generalize to domain-specific entities and do not effectively utilize limited available data. We present a lightweight few-shot NER framework that addresses these challenges through two key innovations: (1) a new instruction tuning template with a simplified output format that combines principles from prior IT approaches to leverage the large context window of recent state-of-the-art LLMs; (2) introducing a strategic data augmentation technique that preserves entity information while paraphrasing the surrounding context, thereby expanding our training data without compromising semantic relationships. Experiments on benchmark datasets show that our method achieves performance comparable to state-of-the-art models on few-shot and zero-shot tasks, with our few-shot approach attaining an average F1 score of 80.1 on the CrossNER datasets. Models trained with our paraphrasing approach show consistent improvements in F1 scores of up to 17 points over baseline versions, offering a promising solution for groups with limited NER training data and compute power.

命名实体识别（NER）是一项需要大批量标注数据的任务，这在资源匮乏的场景下颇具挑战性，因为获取标签的成本高昂。尽管零样本和指令微调的方法已经取得了一些进展，但它们通常难以推广到特定领域的实体，并且不能有效地利用有限的可用数据。我们提出了一种轻量级的少样本NER框架，通过两个关键创新来解决这些挑战：（1）一个新的指令微调模板，采用简化的输出格式，结合了先前的IT原则，以利用最新的先进LLM的大语境窗口；（2）引入了一种战略性的数据增强技术，在改述周围语境的同时保留实体信息，从而扩大了我们的训练数据，且不损害语义关系。在基准数据集上的实验表明，我们的方法在少样本和零样本任务上的性能与最新模型相当，我们的少样本方法在CrossNER数据集上的平均F1分数为80.1。使用我们的改述方法训练的模型在F1分数上较基线版本有持续且显著的改进，改进幅度高达17个点，为拥有有限的NER训练数据和计算能力的团队提供了有希望的解决方案。

论文及项目相关链接

PDF

Summary

本文提出一种轻量级的少样本命名实体识别框架，通过两个关键创新点解决低资源场景下的挑战：一是采用新的指令调整模板，简化输出格式，结合前期IT方法的原理，利用最新先进的大型语言模型的上下文窗口；二是引入战略数据增强技术，在改写上下文的同时保留实体信息，从而扩展训练数据而不损害语义关系。实验表明，该方法在少样本和零样本任务上的性能与最新模型相当，少样本方法在CrossNER数据集上的平均F1分数达到80.1。使用此改写方法训练的模型在F1分数上较基线版本有显著改善，最高可达17分，为解决命名实体识别训练数据有限和计算能力有限的团队提供了有前景的解决方案。

Key Takeaways

命名实体识别（NER）在低资源场景下具有挑战，因为需要大量标注数据。
提出了一种新的少样本NER框架，包含两个关键创新点：指令调整模板和数据增强技术。
指令调整模板结合了简化输出格式和最新大型语言模型的上下文窗口原理。
数据增强技术在不损害语义关系的前提下扩展训练数据。
实验表明，该方法在少样本和零样本任务上的性能与最新模型相当。
在CrossNER数据集上，少样本方法的平均F1分数达到80.1。

Cool Papers

点此查看论文截图

Qomhra: A Bilingual Irish-English Large Language Model

Authors:Joseph McInerney

This paper introduces Qomhr'a, a bilingual Irish-English large language model (LLM), developed under low-resource constraints presenting a complete pipeline spanning bilingual continued pre-training, instruction tuning, and alignment from human preferences. Newly accessible Irish corpora and English text are mixed and curated to improve Irish performance while preserving English ability. 6 closed-weight LLMs are judged for their Irish text generation by a native speaker, a learner and other LLMs. Google’s Gemini-2.5-Pro is ranked the highest and is subsequently used to synthesise instruction tuning and human preference datasets. Two datasets are contributed leveraging Gemini-2.5-Pro: a 30K Irish-English parallel instruction tuning dataset and a 1K human preference dataset, generating accepted and rejected responses that show near perfect alignment with a native Irish speaker. Qomhr'a is comprehensively evaluated across benchmarks testing translation, gender understanding, topic identification and world knowledge with gains of up to 29% in Irish and 44% in English. Qomhr'a also undergoes instruction tuning and demonstrates clear progress in instruction following, crucial for chatbot functionality.

本文介绍了Qomhr’a，这是一个在资源有限的情况下开发的双语爱尔兰语-英语大型语言模型（LLM）。它呈现了一个完整的流程，包括双语持续预训练、指令调整和人类偏好的对齐。新获得的爱尔兰语料库和英语文本被混合和筛选，以提高爱尔兰语性能的同时保持英语能力。有6个封闭权重的LLM被邀请对它们的爱尔兰文本生成进行评估，评估人员包括母语者、学习者和其他LLM。谷歌的Gemini-2.5-Pro被评为最佳模型，随后被用来合成指令调整和人类偏好数据集。利用Gemini-2.5-Pro贡献了两个数据集：一个包含爱尔兰语和英语平行指令的3万句训练数据集和一个包含1千句的人类偏好数据集。生成接受的回应与拒绝回应显示几乎完美地与母语者一致。Qomhr’a在各种基准测试中进行了全面评估，包括翻译、性别理解、主题识别和世界知识等，在爱尔兰语方面提高了高达29%，在英语方面提高了高达44%。此外，Qomhr’a还经历了指令调整阶段，并在执行指令方面显示出明显的进步，这对于聊天机器人功能至关重要。

论文及项目相关链接

PDF

Summary

本文主要介绍了一个名为Qomhr’a的跨双语大型语言模型（LLM）。该模型在低资源条件下开发，包括双语持续预训练、指令调优和基于人类偏好的对齐等完整流程。通过混合和精选新近可访问的爱尔兰语料库和英文文本，提高了爱尔兰语的性能同时保留了英文能力。经过对六个封闭权重LLM的评估，Google的Gemini-2.5-Pro在爱尔兰文本生成方面表现最佳。利用Gemini-2.5-Pro合成指令调优和人类偏好数据集，贡献了两个数据集：一个包含爱尔兰语和英语的平行指令调优数据集和一个人类偏好数据集。Qomhr’a经过全面评估，在翻译、性别理解、主题识别和世界知识等方面有显著提升，尤其在爱尔兰语的指令遵循方面取得了明显进展。总体而言，这是一个创新的双语语言模型。

Key Takeaways

Qomhr’a是一个双语大型语言模型（LLM），支持爱尔兰语和英语。
模型在低资源条件下开发，展示了一个完整的从预训练到指令调优的管道。
利用新爱尔兰语料库和英语文本以提高模型性能。
Google的Gemini-2.5-Pro被选为最佳的模型之一进行指令调优和数据集合成。
合成两个数据集：一个包含爱尔兰语和英语的平行指令调优数据集和一个基于人类偏好的数据集。
Qomhr’a在翻译、性别理解、主题识别和世界知识等方面表现出显著提升。

Cool Papers

点此查看论文截图

SimBench: Benchmarking the Ability of Large Language Models to Simulate Human Behaviors

Authors:Tiancheng Hu, Joachim Baumann, Lorenzo Lupo, Dirk Hovy, Nigel Collier, Paul Röttger

Large language model (LLM) simulations of human behavior have the potential to revolutionize the social and behavioral sciences, if and only if they faithfully reflect real human behaviors. Current evaluations are fragmented, based on bespoke tasks and metrics, creating a patchwork of incomparable results. To address this, we introduce SimBench, the first large-scale, standardized benchmark for a robust, reproducible science of LLM simulation. By unifying 20 diverse datasets covering tasks from moral decision-making to economic choice across a large global participant pool, SimBench provides the necessary foundation to ask fundamental questions about when, how, and why LLM simulations succeed or fail. We show that, while even the best LLMs today have limited simulation ability (score: 40.80/100), performance scales log-linearly with model size. Simulation performance is not improved by increased inference-time compute. We demonstrate an alignment-simulation trade-off: instruction-tuning improves performance on low-entropy (consensus) questions but degrades it on high-entropy (diverse) ones. Models particularly struggle when simulating specific demographic groups. Finally, we demonstrate that simulation ability correlates most strongly with deep, knowledge-intensive reasoning (MMLU-Pro, r=0.939). By making progress measurable, we aim to accelerate the development of more faithful LLM simulators.

大语言模型（LLM）对人类行为的模拟具有颠覆社会和行为科学的潜力，前提是能真实反映人类行为。目前的评估是分散的，基于特定任务和指标，导致结果无法比较。为解决这一问题，我们推出SimBench，这是第一个大规模标准化基准测试，为LLM模拟建立一个稳健、可复制的科学体系。通过统一涵盖从道德决策到经济选择等任务的20个不同数据集，并在全球大规模参与者群体中开展测试，SimBench为关于LLM模拟何时、如何以及为何成功或失败的根本问题提供了必要的基础。我们发现，尽管今天最先进的LLM模拟能力仍然有限（得分：40.80/100），但随着模型规模的扩大，性能呈对数线性增长。增加推理时间的计算并不会提高模拟性能。我们证明了指令调整与模拟之间存在权衡：指令调整在低熵（共识）问题上的表现有所提高，但在高熵（多样性）问题上的表现则下降。特别是在模拟特定人群时面临困难。最后，我们证明了模拟能力与深度知识密集型推理之间具有很强的相关性（MMLU-Pro相关性系数为0.939）。通过制定可衡量的标准，我们希望加速更真实LLM模拟器的开发。

Summary

大型语言模型（LLM）对人类行为的模拟具有革新社会和行为科学的潜力，前提是其必须真实反映人类行为。当前评估基于特定任务和指标，缺乏统一标准，导致结果难以比较。为解决这一问题，本文引入SimBench，即首个大规模、标准化基准测试，以建立稳健、可复制的大型语言模型模拟科学。SimBench统一了20个涵盖从道德决策到经济选择等任务的多样化数据集，并在全球大规模参与者池中建立基础，为关于大型语言模型模拟何时、如何以及为何成功或失败的根本问题提供答案。研究显示，目前最好的大型语言模型模拟能力依然有限（得分为40.80/100），但性能随模型规模呈对数线性增长。模拟性能并不会因推理时间的计算增加而提高。此外，研究还展示了指令调整对低熵（共识）问题性能的提升以及对高熵（多样性）问题的恶化之间的权衡。特别是在模拟特定人群时，模型面临挑战。最后，研究指出模拟能力与深度知识推理之间存在强烈相关性（MMLU-Pro，r=0.939）。通过使进步可衡量，本文旨在加速更真实的大型语言模型模拟器的发展。

Key Takeaways

大型语言模型（LLM）对人类行为的模拟具有对社会和行为科学的潜在革命性影响。
当前缺乏统一的评估标准导致LLM模拟结果难以比较。
SimBench作为首个大规模、标准化的LLM模拟基准测试提供了必要的统一基础。
LLM模拟性能与模型规模对数线性相关。
模拟性能不因推理时间计算增加而提高。
存在指令调整对模拟性能的影响权衡，特别是在处理多样性问题方面。

Cool Papers

点此查看论文截图

SOLE: Hardware-Software Co-design of Softmax and LayerNorm for Efficient Transformer Inference

Authors:Wenxun Wang, Shuchang Zhou, Wenyu Sun, Peiqin Sun, Yongpan Liu

Transformers have shown remarkable performance in both natural language processing (NLP) and computer vision (CV) tasks. However, their real-time inference speed and efficiency are limited due to the inefficiency in Softmax and Layer Normalization (LayerNorm). Previous works based on function approximation suffer from inefficient implementation as they place emphasis on computation while disregarding memory overhead concerns. Moreover, such methods rely on retraining to compensate for approximation error which can be costly and inconvenient. In this paper, we present SOLE, a hardware-software co-design for Softmax and LayerNorm which is composed of E2Softmax and AILayerNorm. E2Softmax utilizes log2 quantization of exponent function and log-based division to approximate Softmax while AILayerNorm adopts low-precision statistic calculation. Compared with state-of-the-art designs, we achieve both low-precision calculation and low bit-width storage on Softmax and LayerNorm. Experiments show that SOLE maintains inference accuracy without retraining while offering orders of magnitude speedup and energy savings over GPU, achieving 3.04x, 3.86x energy-efficiency improvements and 2.82x, 3.32x area-efficiency improvements over prior state-of-the-art custom hardware for Softmax and LayerNorm, respectively.

转换器在自然语言处理（NLP）和计算机视觉（CV）任务中表现出卓越的性能力。然而，由于其Softmax和Layer Normalization（LayerNorm）的效率低下，实时推理速度和效率受到限制。之前基于函数近似的工作存在实现效率低下的问题，因为它们注重计算而忽略了内存开销的担忧。此外，这些方法需要依赖重新训练来弥补近似误差，这既昂贵又不方便。在本文中，我们介绍了SOLE，一种为Softmax和LayerNorm的软硬件协同设计，它由E2Softmax和AILayerNorm组成。E2Softmax利用指数函数的log2量化和基于log的除法来近似Softmax，而AILayerNorm采用低精度统计计算。与最新设计相比，我们在Softmax和LayerNorm上实现了低精度计算和低位宽存储。实验表明，SOLE在不重新训练的情况下保持了推理精度，同时与GPU相比实现了数量级的加速和节能。与最新的针对Softmax和LayerNorm的定制硬件相比，分别实现了3.04倍、3.86倍的能效改进和2.82倍、3.32倍的面积效率改进。

论文及项目相关链接

PDF

摘要

该文介绍了针对Transformer在实时推理速度和效率方面的问题，提出一种软硬件协同设计方法——SOLE，包含E2Softmax和AILayerNorm两部分。E2Softmax利用对数量化方法和对数除法近似Softmax，而AILayerNorm采用低精度统计计算。相较于现有设计，本文实现了Softmax和LayerNorm的低精度计算和低位宽存储。实验表明，无需重新训练，即可保持推理精度，同时在GPU上实现数量级的速度提升和节能效果，相较于先前的定制硬件设计，能量效率和面积效率均有显著提高。

关键见解

Transformers在自然语言处理和计算机视觉任务中表现出色，但实时推理速度和效率受限。
现有基于功能近似的优化方法在计算效率上表现不佳，并忽略了内存开销问题。
SOLE是一种软硬件协同设计方法，针对Softmax和LayerNorm进行优化。
E2Softmax利用对数量化方法和对数除法来实现Softmax的近似计算。
AILayerNorm采用低精度统计计算来优化LayerNorm的计算效率。
实验结果表明，无需重新训练即可保持推理精度。

Cool Papers

点此查看论文截图

Video Reasoning without Training

Authors:Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, using entropy of the model’s output as a signal, we discover that the high-quality models go through a series of micro-explorations and micro-exploitations which keep the reasoning process grounded (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We further observe that once this “thinking” process is over, more accurate models demonstrate a better convergence by reducing the entropy significantly via a final exploitation phase (i.e., a more certain convergence towards a solution trajectory). We then use these novel, theoretically-grounded insights to tune the model’s behavior directly at inference, without using any RL or supervised fine-tuning. Specifically, during inference, our proposed approach called V-Reason (Video-Reason) adapts the value cache of the LMM via a few optimization steps on a small, trainable controller using an entropy-based objective, i.e., no supervision from any dataset or RL is necessary. This tuning improves the model’s micro-exploration and exploitation behavior during inference. Our experiments show that our proposed method achieves significant improvements over the base instruction-tuned models across several video reasoning datasets, narrowing the gap with RL-trained models to within 0.6% average accuracy without any training, while offering massive efficiency benefits: output tokens are reduced by 58.6% compared to the RL model.

使用大型多模态模型（LMMs）进行视频推理依赖于昂贵的强化学习（RL）和冗长的思维链，导致在训练和推理过程中产生大量的计算开销。而且，这些推理模型控制思维过程的机制非常有限。在本文中，我们以模型输出的熵作为信号，发现高质量模型会经历一系列的微观探索和微观利用，使推理过程保持扎实（即，在模型探索或通过思考寻找答案时避免过度随机性）。我们进一步观察到，一旦“思考”过程结束，更精确的模型会通过最终的利用阶段显著减少熵，从而显示出更好的收敛性（即，更确定地朝向解决方案轨迹）。然后，我们利用这些新的、理论上的见解，直接在推理时调整模型的行为，而无需使用任何RL或监督微调。具体来说，在推理期间，我们提出的名为V-Reason（视频推理）的方法通过几个优化步骤，对LMM的值缓存进行适应，这些优化步骤在一个小型的、可训练的控制器上基于熵的目标进行，即无需任何数据集或RL的监督。这种调整改进了模型在推理过程中的微观探索和利用行为。我们的实验表明，与基于指令调整的基础模型相比，我们的方法在多个视频推理数据集上实现了显著改进，在无需任何训练的情况下，与RL训练模型的平均准确率差距缩小至0.6%，同时带来了巨大的效率效益：与RL模型相比，输出令牌减少了58.6%。

论文及项目相关链接

PDF

摘要

基于大型多模态模型（LMM）的视频推理依赖于昂贵的强化学习（RL）和冗长的思考链，导致训练和推理过程中计算开销较大。本文发现高质量模型通过一系列微观探索和微观利用来保持推理过程的稳健性，避免过度随机性。模型思考结束后，更准确模型通过最终利用阶段显著减少熵，表现出更好的收敛性。本文利用这些新理论见解，无需使用任何RL或监督微调，直接在推理时调整模型行为。实验表明，本文提出的方法在多个视频推理数据集上较基础指令调整模型有显著改善，与RL训练模型的平均准确率差距缩小至0.6%以内，同时大大提高了效率：输出令牌减少58.6%。

关键见解

视频推理中的大型多模态模型（LMM）依赖强化学习（RL），计算和推理过程中存在显著的计算开销。
高质量模型通过微观探索和微观利用过程保持推理稳健性，避免过度随机性。
模型在思考结束后，通过减少熵表现出更好的收敛性。
利用熵作为信号，可以直接在推理时调整模型行为，无需使用强化学习或监督微调。
提出的方法在多个视频推理数据集上较基础模型有显著改善。
与强化学习训练模型相比，平均准确率差距缩小至0.6%以内。

Cool Papers

点此查看论文截图

Parameter-Efficient Fine-Tuning for Low-Resource Languages: A Comparative Study of LLMs for Bengali Hate Speech Detection

Authors:Akif Islam, Mohd Ruhul Ameen

Bengali social media platforms have witnessed a sharp increase in hate speech, disproportionately affecting women and adolescents. While datasets such as BD-SHS provide a basis for structured evaluation, most prior approaches rely on either computationally costly full-model fine-tuning or proprietary APIs. This paper presents the first application of Parameter-Efficient Fine-Tuning (PEFT) for Bengali hate speech detection using LoRA and QLoRA. Three instruction-tuned large language models - Gemma-3-4B, Llama-3.2-3B, and Mistral-7B - were fine-tuned on the BD-SHS dataset of 50,281 annotated comments. Each model was adapted by training fewer than 1% of its parameters, enabling experiments on a single consumer-grade GPU. The results show that Llama-3.2-3B achieved the highest F1-score of 92.23%, followed by Mistral-7B at 88.94% and Gemma-3-4B at 80.25%. These findings establish PEFT as a practical and replicable strategy for Bengali and related low-resource languages.

孟加拉社交媒体平台见证了仇恨言论的急剧增加，对妇女和青少年产生了不成比例的影响。虽然BD-SHS等数据集为结构化评估提供了基础，但大多数先前的方法都依赖于计算成本高昂的全模型微调或专有API。本文首次应用参数高效微调（PEFT）进行孟加拉仇恨言论检测，使用LoRA和QLoRA方法。在含有50,281个注释评论的BD-SHS数据集上，对三个指令调整的大型语言模型——Gemma-3-4B、Llama-3.2-3B和Mistral-7B进行了微调。每个模型的参数适应都是通过训练不到1%的参数来实现的，可在单个消费级GPU上进行实验。结果表明，Llama-3.2-3B的F1得分最高，达到92.23%，其次是Mistral-7B的88.94%和Gemma-3-4B的80.25%。这些发现证明了PEFT在孟加拉语和相关低资源语言中是一种实用且可复制的策略。

论文及项目相关链接

PDF Accepted to IEEE COMPAS 2025. 6 pages, 3 figures, 6 tables

Summary

该论文关注孟加拉语社交媒体平台上的仇恨言论问题，特别是针对女性和青少年的影响。为解决这一问题，论文首次应用参数高效微调（PEFT）技术，利用LoRA和QLoRA方法进行孟加拉语仇恨言论检测。通过对三个指令调优的大型语言模型进行微调，实验结果显示，Llama-3.2-3B模型表现最佳，F1分数达到92.23%，其次是Mistral-7B和Gemma-3-4B模型。研究证明了参数高效微调是一种针对孟加拉语和相关低资源语言的实用且可复制的策略。

Key Takeaways

孟加拉语社交媒体平台上的仇恨言论问题日益严重，对女性和青少年造成不成比例的影响。
数据集如BD-SHS为结构化评估提供了基础，但之前的方法要么计算成本高，要么依赖专有API。
论文首次将参数高效微调（PEFT）技术应用于孟加拉语仇恨言论检测。
使用了LoRA和QLoRA方法来进行微调。
三个大型语言模型（Gemma-3-4B、Llama-3.2-3B和Mistral-7B）在BD-SHS数据集上进行微调。
Llama-3.2-3B模型表现最佳，F1分数达到92.23%。

Cool Papers

点此查看论文截图

UniGTE: Unified Graph-Text Encoding for Zero-Shot Generalization across Graph Tasks and Domains

Authors:Duo Wang, Yuan Zuo, Guangyue Lu, Junjie Wu

Generalizing to unseen graph tasks without task-specific supervision is challenging: conventional graph neural networks are typically tied to a fixed label space, while large language models (LLMs) struggle to capture graph structure. We introduce UniGTE, an instruction-tuned encoder-decoder framework that unifies structural and semantic reasoning. The encoder augments a pretrained autoregressive LLM with learnable alignment tokens and a structure-aware graph-text attention mechanism, enabling it to attend jointly to a tokenized graph and a natural-language task prompt while remaining permutation-invariant to node order. This yields compact, task-aware graph representations. Conditioned solely on these representations, a frozen LLM decoder predicts and reconstructs: it outputs the task answer and simultaneously paraphrases the input graph in natural language. The reconstruction objective regularizes the encoder to preserve structural cues. UniGTE is instruction-tuned on five datasets spanning node-level, edge-level, and graph-level tasks across diverse domains, yet requires no fine-tuning at inference. It achieves new state-of-the-art zero-shot results on node classification, link prediction, graph classification, and graph regression under cross-task and cross-domain settings, demonstrating that tight integration of graph structure with LLM semantics enables robust, transferable graph reasoning.

将未见过图的任务推广到未见特定任务监督是有挑战性的：传统的图神经网络通常绑定到一个固定的标签空间，而大型语言模型（LLM）在捕捉图结构方面表现挣扎。我们引入了UniGTE，这是一个指令调优的编码器-解码器框架，它统一了结构性和语义推理。编码器通过可学习的对齐令牌和结构感知图文本注意力机制来增强预训练的自回归LLM，使其在令牌化图和自然语言任务提示的同时能够共同关注，同时对节点顺序保持置换不变。这产生了紧凑的任务感知图表示。仅基于这些表示，冻结的LLM解码器进行预测和重建：它输出任务答案，同时用自然语言对输入图进行同义替换。重建目标使编码器能够保留结构线索。UniGTE在涵盖节点级、边缘级和图级任务的五个数据集上进行指令调优，涉及多个领域，但在推理过程中不需要微调。它在跨任务和跨域设置下实现了节点分类、链接预测、图分类和图回归的零样本新最佳结果，表明将图结构与LLM语义紧密结合可实现稳健、可转移的图推理。

论文及项目相关链接

PDF

Summary：通用图任务中，无需特定任务监督是一大挑战。常规图神经网络受限于固定标签空间，而大型语言模型难以捕捉图结构。我们推出UniGTE，一种指令调优的编码器-解码器框架，统一结构性和语义推理。编码器通过可学习对齐令牌和结构感知图文本注意力机制增强预训练自回归大型语言模型，使其能够同时关注令牌化图和自然语言任务提示，对节点顺序保持置换不变性。这产生紧凑的任务感知图表示。仅基于这些表示，冻结的大型语言模型解码器可预测并重建：输出任务答案，同时用自然语言复述输入图。重建目标使编码器保持结构线索。UniGTE在涵盖节点级、边缘级和图级任务的五个数据集上进行指令调优，跨越不同领域，但在推理过程中无需微调。它在跨任务和跨域设置下实现了节点分类、链接预测、图形分类和图形回归的最新零样本结果，证明了图结构与大型语言模型语义紧密结合可实现稳健、可转移的图推理能力。

Key Takeaways:

UniGTE框架成功将结构性和语义推理结合起来，通过编码器增强预训练的大型语言模型，使其能够处理图数据。
该框架能够生成紧凑的任务感知图表示，仅基于这些表示，解码器就能预测并重建任务答案和图的自然语言描述。
UniGTE通过引入可学习对齐令牌和图形文本注意力机制，使模型能够同时关注图形数据和自然语言提示，且对节点顺序具有不变性。
重建目标有助于编码器保持结构线索，强化模型性能。
UniGTE在多种任务和领域上进行了指令调优，但推理过程中无需微调。
该框架实现了跨任务和跨域的零样本学习新纪录，在节点分类、链接预测、图形分类和图形回归方面表现优异。

Cool Papers

点此查看论文截图

AtomBench: A Benchmark for Generative Atomic Structure Models using GPT, Diffusion, and Flow Architectures

Authors:Charles Rhys Campbell, Aldo H. Romero, Kamal Choudhary

Generative models have become significant assets in the exploration and identification of new materials, enabling the rapid proposal of candidate crystal structures that satisfy target properties. Despite the increasing adoption of diverse architectures, a rigorous comparative evaluation of their performance on materials datasets is lacking. In this work, we present a systematic benchmark of three representative generative models- AtomGPT (a transformer-based model), Crystal Diffusion Variational Autoencoder (CDVAE), and FlowMM (a Riemannian flow matching model). These models were trained to reconstruct crystal structures from subsets of two publicly available superconductivity datasets- JARVIS Supercon 3D and DS A/B from the Alexandria database. Performance was assessed using the Kullback-Leibler (KL) divergence between predicted and reference distributions of lattice parameters, as well as the mean absolute error (MAE) of individual lattice constants. For the computed KLD and MAE scores, CDVAE performs most favorably, followed by AtomGPT, and then FlowMM. All benchmarking code and model configurations will be made publicly available at https://github.com/atomgptlab/atombench_inverse.

生成模型在探索和识别新材料方面已成为重要资产，能够迅速提出满足目标属性的候选晶体结构。尽管采用了多种不同的架构，但在材料数据集上对其性能进行严格的比较评估仍然缺乏。在这项工作中，我们对三种具有代表性的生成模型进行了系统基准测试，包括AtomGPT（基于转换器的模型）、晶体扩散变分自动编码器（CDVAE）和流匹配模型FlowMM（黎曼流匹配模型）。这些模型经过训练，能够从两个公开可用的超导性数据集的子集（JARVIS Supercon 3D和Alexandria数据库的DS A/B）重建晶体结构。性能评估采用预测晶格参数分布与参考分布之间的Kullback-Leibler（KL）散度，以及单个晶格常数的平均绝对误差（MAE）。根据计算的KLD和MAE分数，CDVAE表现最为优越，其次是AtomGPT，然后是FlowMM。所有基准测试代码和模型配置将在https://github.com/atomgptlab/atombench_inverse上公开提供。

论文及项目相关链接

PDF

Summary

本文介绍了三种代表性生成模型：AtomGPT、Crystal Diffusion Variational Autoencoder（CDVAE）和FlowMM，在材料数据集上的性能系统评估。这些模型被训练用于从公开可用的超导材料数据集中重建晶体结构。通过预测的晶格参数分布与参考分布之间的Kullback-Leibler（KL）散度以及晶格常数的平均绝对误差（MAE）来评估性能。结果显示CDVAE表现最佳，其次是AtomGPT，最后是FlowMM。

Key Takeaways

三种代表性生成模型被用于晶体结构的重建，包括AtomGPT、CDVAE和FlowMM。
这些模型在公开可用的超导材料数据集上进行训练。
性能评估基于预测的晶格参数分布与参考分布之间的KL散度和晶格常数的MAE。
CDVAE在性能评估中表现最佳。
AtomGPT表现优于FlowMM。
所有基准测试代码和模型配置将公开提供。

Cool Papers

点此查看论文截图

A Novel GPT-Based Framework for Anomaly Detection in System Logs

Authors:Zeng Zhang, Wenjie Yin, Xiaoqi Li

Identification of anomalous events within system logs constitutes a pivotal element within the frame- work of cybersecurity defense strategies. However, this process faces numerous challenges, including the management of substantial data volumes, the distribution of anomalies, and the precision of con- ventional methods. To address this issue, the present paper puts forward a proposal for an intelligent detection method for system logs based on Genera- tive Pre-trained Transformers (GPT). The efficacy of this approach is attributable to a combination of structured input design and a Focal Loss op- timization strategy, which collectively result in a substantial enhancement of the performance of log anomaly detection. The initial approach involves the conversion of raw logs into event ID sequences through the use of the Drain parser. Subsequently, the Focal Loss loss function is employed to address the issue of class imbalance. The experimental re- sults demonstrate that the optimized GPT-2 model significantly outperforms the unoptimized model in a range of key metrics, including precision, recall, and F1 score. In specific tasks, comparable or superior performance has been demonstrated to that of the GPT-3.5 API.

在系统日志中识别异常事件是网络安全防御策略框架中的关键要素。然而，这个过程面临许多挑战，包括管理大量数据、异常分布和传统方法的精确度。为了解决这一问题，本文提出了一种基于生成预训练变压器（GPT）的智能检测系统日志的方法。该方法的有效性归因于结构化输入设计和焦点损失优化策略的结合，它们共同导致日志异常检测性能的大幅提高。初始方法涉及使用Drain解析器将原始日志转换为事件ID序列。随后，焦点损失损失函数被用来解决类不平衡问题。实验结果表明，经过优化的GPT-2模型在精度、召回率和F1分数等一系列关键指标上显著优于未优化的模型。在特定任务中，其表现与GPT-3.5 API相当或更优秀。

论文及项目相关链接

PDF

Summary
系统日志中的异常事件识别是网络安全防御策略中的关键部分，但面临数据量大、异常分布和常规方法精度低等挑战。本文提出一种基于生成式预训练转换器（GPT）的智能检测方法的建议，通过结构化的输入设计和Focal Loss优化策略，大幅提高日志异常检测的效能。该方法先将原始日志转换为事件ID序列，然后使用Drain解析器，并采用Focal Loss损失函数解决类别不平衡问题。实验结果表明，优化后的GPT-2模型在关键指标（如精确度、召回率和F1分数）上显著优于未优化的模型，在某些任务中的性能与GPT-3.5 API相当或更好。

Key Takeaways

系统日志中的异常事件识别是网络安全的重要部分。
当前面临数据量大、异常分布和常规方法精度低的挑战。
本文提出了一种基于GPT的智能检测方法来检测系统日志异常。
该方法结合结构化的输入设计和Focal Loss优化策略，提高了异常检测的效能。
通过将原始日志转换为事件ID序列和使用Drain解析器，再进行Focal Loss损失函数处理，来解决类别不平衡问题。
实验结果显示，优化后的GPT-2模型在精度、召回率和F1分数等关键指标上表现优异。

Cool Papers

点此查看论文截图

Wavy Transformer

Authors:Satoshi Noguchi, Yoshinobu Kawahara

Transformers have achieved remarkable success across natural language processing (NLP) and computer vision (CV). However, deep transformer models often suffer from an over-smoothing issue, in which token representations converge to similar values as they pass through successive transformer blocks. In this paper, we establish an equivalence between the hidden-state dynamics induced by stacked attention layers and graph neural diffusion on a complete graph. From this perspective, over-smoothing can be interpreted as a consequence of the dissipative nature of the underlying diffusion dynamics. Motivated by this physical interpretation, we propose Wavy Transformer, which consists of a novel attention layer based on second-order wavy dynamics. We also introduce a feed-forward network and a normalization layer designed to preserve the physical state-velocity relationship under the chain rule, thereby extending the transformer architecture. We further validate our proposed techniques on various transformer models for NLP and CV tasks. The results consistently demonstrate that Wavy Transformer improves performance with minimal additional parameters and no extra hyperparameter tuning.

转换器在自然语言处理（NLP）和计算机视觉（CV）领域取得了显著的成功。然而，深度转换器模型常常面临过度平滑的问题，在这个问题中，令牌表示通过连续的转换器块时收敛到相似的值。本文建立了堆叠的注意力层产生的隐藏状态动力学与完全图上的图神经扩散之间的等价关系。从这个角度来看，过度平滑可以被解释为底层扩散动力学的耗散性质的后果。受这种物理解释的启发，我们提出了波浪转换器，它基于二阶波浪动力学的新型注意力层。我们还引入了一个前馈网络和规范化层，旨在根据链式规则保持物理状态速度关系，从而扩展了转换器的架构。我们在NLP和CV任务的多种转换器模型上进一步验证了我们的技术。结果一致表明，波浪转换器在提高性能的同时，只需极少量的额外参数和无需额外的超参数调整。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

本文探讨了深度Transformer模型中的过平滑问题，并建立了堆叠注意力层产生的隐藏状态动力学与完全图上的图神经网络扩散之间的等价关系。基于这种物理解释，本文提出了Wavy Transformer模型，其中包括基于二阶波动动力学的注意力层，并进一步验证其在NLP和CV任务上的性能。结果表明，Wavy Transformer在不增加额外参数和无需额外超参数调整的情况下，能显著提高性能。

Key Takeaways

Transformer模型在自然语言处理和计算机视觉领域取得了显著成功，但深度Transformer模型存在过平滑问题。
过平滑问题可以解释为底层扩散动力学的耗散性质的结果。
本文建立了堆叠注意力层产生的隐藏状态动力学与图神经网络扩散之间的等价关系。
基于物理解释，提出了Wavy Transformer模型，包括基于二阶波动动力学的注意力层。
Wavy Transformer扩展了Transformer架构，并引入了前馈网络和归一化层，旨在保持物理状态速度关系。
在NLP和CV任务上验证了Wavy Transformer的性能，结果表明其能提高性能且不需要增加额外的参数和超参数调整。

Cool Papers

点此查看论文截图

Watch the Weights: Unsupervised monitoring and control of fine-tuned LLMs

Authors:Ziqian Zhong, Aditi Raghunathan

The releases of powerful open-weight large language models (LLMs) are often not accompanied by access to their full training data. Existing interpretability methods, particularly those based on activations, often require or assume distributionally similar data. This is a significant limitation when detecting and defending against novel potential threats like backdoors, which are by definition out-of-distribution. In this work, we introduce a new method for understanding, monitoring and controlling fine-tuned LLMs that interprets weights, rather than activations, thereby side stepping the need for data that is distributionally similar to the unknown training data. We demonstrate that the top singular vectors of the weight difference between a fine-tuned model and its base model correspond to newly acquired behaviors. By monitoring the cosine similarity of activations along these directions, we can detect salient behaviors introduced during fine-tuning with high precision. For backdoored models that bypasses safety mechanisms when a secret trigger is present, our method stops up to 100% of attacks with a false positive rate below 1.2%. For models that have undergone unlearning, we detect inference on erased topics with accuracy up to 95.42% and can even steer the model to recover “unlearned” information. Besides monitoring, our method also shows potential for pre-deployment model auditing: by analyzing commercial instruction-tuned models (OLMo, Llama, Qwen), we are able to uncover model-specific fine-tuning focus including marketing strategies and Midjourney prompt generation. Our implementation can be found at https://github.com/fjzzq2002/WeightWatch.

强大的开放权重大型语言模型（LLM）的发布通常不会附带其完整的训练数据。现有的可解释性方法，尤其是基于激活的方法，通常需要或假设分布相似的数据。当检测和防御后门等新型潜在威胁时，这是一个重要的限制，后门威胁在定义上属于分布外数据。在这项工作中，我们引入了一种新的理解、监控和调整精细调整过的LLM的方法，它解释权重而不是激活，从而避免了需要类似于未知训练数据的分布数据。我们证明，精细调整模型与其基础模型之间的权重差异的上部奇异向量对应于新获得的行为。通过监测这些方向上激活的余弦相似性，我们可以高精度地检测精细调整过程中引入的关键行为。对于带有秘密触发机制时绕过安全机制的后门模型，我们的方法可以阻止高达100%的攻击，误报率低于1.2%。对于已经经历过遗忘学习的模型，我们可以准确检测到高达95.42%的已删除主题的推断，甚至可以引导模型恢复“未学习”的信息。除了监控，我们的方法还显示出在预部署模型审计方面的潜力：通过分析商业指令调整模型（OLMo、Llama、Qwen），我们能够发现特定模型的精细调整重点，包括营销策略和Midjourney提示生成。我们的实现可以在https://github.com/fjzzq2002/WeightWatch找到。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）的发布通常不附带其完整训练数据的访问权限。现有的可解释性方法，尤其是基于激活的方法，通常需要或假设与未知训练数据分布相似的数据，这在检测和防范后门等新型潜在威胁时存在重大局限性。本文介绍了一种新的理解、监控和控制微调LLM的方法，该方法解释权重而不是激活，从而避免了需要类似于未知训练数据的分布数据。我们证明微调模型与基础模型之间的权重差异的上奇异向量对应新获得的行为。通过监测这些方向上激活的余弦相似性，我们可以高精度地检测微调过程中引入的关键行为。对于带有秘密触发机制的后门模型，我们的方法可以阻止高达100%的攻击，误报率低于1.2%。对于已经进行未学习的模型，我们可以准确检测被删除的主题的推理过程，甚至可以引导模型恢复“未学习”的信息。此外，我们的方法还显示出在部署前进行模型审计的潜力：通过分析商业指令调整模型（OLMo、Llama、Qwen），我们能够发现特定模型的微调重点，包括营销策略和Midjourney提示生成等。

Key Takeaways

现有LLM的可解释性方法受限于需要分布相似数据，难以应对新型威胁如后门攻击。
提出一种基于权重解释的新方法，绕过对分布相似数据的需要。
通过监测余弦相似性，可高精度检测微调过程中引入的关键行为。
对后门攻击的检测率可达100%，误报率低于1.2%。
能够检测未学习模型的已删除主题推理，并引导模型恢复未学习信息。
方法具有潜在用于部署前模型审计的能力，可发现特定模型的微调重点。

Cool Papers

点此查看论文截图

From Sequence to Structure: Uncovering Substructure Reasoning in Transformers

Authors:Xinnan Dai, Kai Yang, Jay Revolinsky, Kai Guo, Aoran Wang, Bohang Zhang, Jiliang Tang

Recent studies suggest that large language models (LLMs) possess the capability to solve graph reasoning tasks. Notably, even when graph structures are embedded within textual descriptions, LLMs can still effectively answer related questions. This raises a fundamental question: How can a decoder-only Transformer architecture understand underlying graph structures? To address this, we start with the substructure extraction task, interpreting the inner mechanisms inside the transformers and analyzing the impact of the input queries. Specifically, through both empirical results and theoretical analysis, we present Induced Substructure Filtration (ISF), a perspective that captures the substructure identification in the multi-layer transformers. We further validate the ISF process in LLMs, revealing consistent internal dynamics across layers. Building on these insights, we explore the broader capabilities of Transformers in handling diverse graph types. Specifically, we introduce the concept of thinking in substructures to efficiently extract complex composite patterns, and demonstrate that decoder-only Transformers can successfully extract substructures from attributed graphs, such as molecular graphs. Together, our findings offer a new insight on how sequence-based Transformers perform the substructure extraction task over graph data.

最近的研究表明，大型语言模型（LLM）具备解决图形推理任务的能力。值得注意的是，即使在文本描述中嵌入图形结构，LLM仍然可以有效地回答相关问题。这引发了一个根本性的问题：仅仅解码的Transformer架构是如何理解潜在的图形结构的？为了解决这一问题，我们从子结构提取任务开始，解读Transformer内部机制，并分析输入查询的影响。具体来说，我们通过实证结果和理论分析，提出了诱导子结构过滤（ISF）的观点，该观点捕捉了多层Transformer中的子结构识别。我们进一步验证了LLM中的ISF过程，揭示了跨层的内部动态一致性。基于这些见解，我们探索了Transformer处理不同类型图表的更广泛能力。具体来说，我们引入了子结构思维的概念，以有效地提取复杂的组合模式，并证明仅解码的Transformer可以成功地从属性图中提取子结构，例如分子图。总的来说，我们的研究为基于序列的Transformer如何在图形数据上执行子结构提取任务提供了新的见解。

论文及项目相关链接

PDF Camera Ready version for Neurips 2025

Summary

大型语言模型（LLM）具备解决图形推理任务的能力，即使图形结构嵌入在文本描述中，也能有效地回答问题。本文探讨了如何通过一个仅包含解码器的Transformer架构理解潜在图形结构的问题，通过子结构提取任务来解读Transformer内部机制，并分析了输入查询的影响。提出了Induced Substructure Filtration（ISF）的视角，来捕捉多层Transformer中的子结构识别。通过验证ISF在LLM中的过程，发现其在各层之间具有一致的内部动态。此外，本文探讨了Transformer处理不同类型图数据的潜力，引入了子结构思维的概念，并展示了仅解码器Transformer成功从属性图（如分子图）中提取子结构的能力。这些发现提供了序列基于Transformer在图形数据上进行子结构提取任务的新见解。

Key Takeaways

大型语言模型（LLM）可以处理图形推理任务，即使图形结构隐藏在文本描述中也能有效回答相关问题。
LLM通过子结构提取任务理解潜在图形结构，这需要解读Transformer的内部机制。
Induced Substructure Filtration（ISF）视角有助于捕捉多层Transformer中的子结构识别。
LLM中的ISF过程在各层之间具有一致的内部动态。
Transformer架构具有处理不同类型图数据的能力。
通过引入子结构思维，可以更有效地从属性图中提取复杂复合模式。

Cool Papers

点此查看论文截图

Forecasting Clinical Risk from Textual Time Series: Structuring Narratives for Temporal AI in Healthcare

Authors:Shahriar Noroozizadeh, Sayantan Kumar, Jeremy C. Weiss

Clinical case reports encode temporal patient trajectories that are often underexploited by traditional machine learning methods relying on structured data. In this work, we introduce the forecasting problem from textual time series, where timestamped clinical findings – extracted via an LLM-assisted annotation pipeline – serve as the primary input for prediction. We systematically evaluate a diverse suite of models, including fine-tuned decoder-based large language models and encoder-based transformers, on tasks of event occurrence prediction, temporal ordering, and survival analysis. Our experiments reveal that encoder-based models consistently achieve higher F1 scores and superior temporal concordance for short- and long-horizon event forecasting, while fine-tuned masking approaches enhance ranking performance. In contrast, instruction-tuned decoder models demonstrate a relative advantage in survival analysis, especially in early prognosis settings. Our sensitivity analyses further demonstrate the importance of time ordering, which requires clinical time series construction, as compared to text ordering, the format of the text inputs that LLMs are classically trained on. This highlights the additional benefit that can be ascertained from time-ordered corpora, with implications for temporal tasks in the era of widespread LLM use.

临床病例报告描述了患者的时序轨迹，这些轨迹常常被依赖结构化数据的传统机器学习方法所忽视。在这项工作中，我们从文本时间序列中引入了预测问题，通过大型语言模型辅助的注释管道提取的时间戳临床发现作为预测的主要输入。我们对一系列模型进行了系统评估，包括微调过的基于解码器的大型语言模型和基于编码器的转换器，用于事件发生的预测、时间顺序和生存分析任务。我们的实验表明，基于编码器的模型在短期和长期事件预测中始终获得更高的F1分数和优越的时间一致性，而微调过的掩码方法提高了排名性能。相比之下，经过指令训练的解码器模型在生存分析中显示出相对优势，特别是在早期预后设置中。我们的敏感性分析进一步表明时间顺序的重要性，这需要临床时间序列的构建，与文本顺序（即LLM经典训练文本输入格式）相对比。这突出了有序语料库所能带来的额外好处，对广泛使用LLM的时代的时间任务具有指导意义。

论文及项目相关链接

PDF AAAI AI for Social Impact 2026. Shahriar Noroozizadeh, Sayantan Kumar (authors contributed equally)

Summary

本文介绍了临床病例报告中的时间患者轨迹，这些轨迹常常被传统依赖结构化数据的机器学习方法所忽视。研究引入了文本时间序列的预测问题，通过大型语言模型辅助的注释管道提取的时间戳临床发现作为主要预测输入。通过系统地评估各种模型，包括微调后的基于解码器的大型语言模型和基于编码器的转换器，本研究在事件发生的预测、时间顺序和生存分析任务上发现基于编码器的模型具有更高的F1分数和良好的时间一致性，对于短期和长期事件预测始终表现优异，而经过微调的掩码方法则提高了排名性能。相比之下，经过指令训练的解码器模型在生存分析中显示出相对优势，特别是在早期预后环境中。敏感性分析进一步强调了时间顺序的重要性，这需要临床时间序列的构建，与传统的LLM训练文本输入格式相比，时间顺序的要求是一个亮点，这对广泛采用LLM的时代的时间任务具有启示意义。

Key Takeaways

临床病例报告中的时间患者轨迹常常被传统机器学习方法忽视。
文本时间序列预测问题被引入，使用LLM辅助的注释管道提取的时间戳临床发现作为预测的主要输入。
基于编码器的模型在事件预测、时间顺序和生存分析任务上表现优异，具有更高的F1分数和良好的时间一致性。
经过微调的掩码方法提高了排名性能。
指令训练的解码器模型在生存分析中相对优势显著，特别是在早期预后环境中。
时间顺序在临床时间序列的构建中至关重要，与LLM传统训练的文本输入格式不同。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-22/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-10-22 Enterprise Deep Research Steerable Multi-Agent Deep Research for Enterprise Analytics

2025-10-22 Agent

Agent

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-10-22 Robobench A Comprehensive Evaluation Benchmark for Multimodal Large Language Models as Embodied Brain

2025-10-22 R1_Reasoning

R1_Reasoning