LLM

发布日期: 2025-07-15

更新日期: 2025-08-19

文章字数: 17.5k

阅读时长: 71 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-07-15 更新

Lumos-1: On Autoregressive Video Generation from a Unified Model Perspective

Authors:Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, Hao Luo, Jiasheng Tang, Fan Wang, Yi Yang

Autoregressive large language models (LLMs) have unified a vast range of language tasks, inspiring preliminary efforts in autoregressive video generation. Existing autoregressive video generators either diverge from standard LLM architectures, depend on bulky external text encoders, or incur prohibitive latency due to next-token decoding. In this paper, we introduce Lumos-1, an autoregressive video generator that retains the LLM architecture with minimal architectural modifications. To inject spatiotemporal correlations in LLMs, we identify the efficacy of incorporating 3D RoPE and diagnose its imbalanced frequency spectrum ranges. Therefore, we propose MM-RoPE, a RoPE scheme that preserves the original textual RoPE while providing comprehensive frequency spectra and scaled 3D positions for modeling multimodal spatiotemporal data. Moreover, Lumos-1 resorts to a token dependency strategy that obeys intra-frame bidirectionality and inter-frame temporal causality. Based on this dependency strategy, we identify the issue of frame-wise loss imbalance caused by spatial information redundancy and solve it by proposing Autoregressive Discrete Diffusion Forcing (AR-DF). AR-DF introduces temporal tube masking during training with a compatible inference-time masking policy to avoid quality degradation. By using memory-efficient training techniques, we pre-train Lumos-1 on only 48 GPUs, achieving performance comparable to EMU3 on GenEval, COSMOS-Video2World on VBench-I2V, and OpenSoraPlan on VBench-T2V. Code and models are available at https://github.com/alibaba-damo-academy/Lumos.

自回归大型语言模型（LLM）已经统一了多种语言任务，为自回归视频生成提供了初步的思路。现有的自回归视频生成器要么与标准LLM架构不同，依赖于庞大的外部文本编码器，要么由于下一个令牌解码而产生过高的延迟。在本文中，我们介绍了Lumos-1，这是一款保留LLM架构并进行了最小修改的自回归视频生成器。为了将时空相关性注入LLM中，我们确定了融入3D RoPE的有效性，并诊断了其不平衡的频率谱范围。因此，我们提出了MM-RoPE，这是一种RoPE方案，既保留了原始的文本RoPE，又提供了全面的频率谱和缩放的三维位置，用于模拟多模态时空数据。此外，Lumos-1采用了一种遵循帧内双向性和帧间时间因果关系的令牌依赖策略。基于这种依赖策略，我们发现了由于空间信息冗余导致的帧级损失不平衡问题，并提出了自回归离散扩散强制（AR-DF）来解决这个问题。AR-DF在训练过程中引入了时间管掩码，并使用兼容的推理时间掩码策略，以避免质量下降。通过使用高效的内存训练技术，我们只用了48个GPU对Lumos-1进行了预训练，在GenEval上的性能与EMU3相当，在VBench-I2V上的COSMOS-Video2World和VBench-T2V上的OpenSoraPlan也表现出色。代码和模型可在https://github.com/alibaba-damo-academy/Lumos上找到。

Summary

该文介绍了Lumos-1这种基于LLM架构的自动视频生成器。为注入时空相关性，论文识别了3D RoPE的有效性并诊断了其不平衡的频率谱范围问题，于是提出了MM-RoPE方案。同时，Lumos-1采取遵循帧内双向性和帧间时序因果关系的令牌依赖策略。针对由空间信息冗余引起的帧损失不平衡问题，论文提出了基于自动回归离散扩散强迫（AR-DF）的解决方案。通过采用高效的训练方法，Lumos-1在有限的GPU资源上进行了预训练，在GenEval的EMU3上实现了比较性能，并在VBench上的COSMOS-Video2World和OpenSoraPlan上进行了验证。代码和模型可在GitHub上找到。

Key Takeaways

Lumos-1是首个基于LLM架构的自动视频生成器，具有统一多种语言任务的能力。
引入MM-RoPE方案以改进原有的RoPE设计，考虑了更全面的频率谱和三维位置建模问题。
采用特定的令牌依赖策略来保持帧内双向性和帧间时序因果性。
提出AR-DF来解决因空间信息冗余导致的帧损失不平衡问题。
Lumos-1通过内存高效训练技术在有限的GPU资源上实现了良好的性能表现。
Lumos-1在多个基准测试中实现了与现有技术相当的性能表现，包括GenEval的EMU3、VBench上的COSMOS-Video2World和OpenSoraPlan等。

Cool Papers

点此查看论文截图

One Token to Fool LLM-as-a-Judge

Authors:Yulai Zhao, Haolin Liu, Dian Yu, S. Y. Kung, Haitao Mi, Dong Yu

Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., “:” or “.”) or reasoning openers like “Thought process:” and “Let’s solve this problem step by step.” can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.

生成奖励模型（也称为LLM作为评判者），使用大型语言模型（LLM）来评估答案质量，在可验证奖励的强化学习（RLVR）中越来越被采用。它们通常被优先选择用于涉及自由形式输出的复杂推理任务，而不是僵化的基于规则的度量标准。在这种模式下，通常会提示LLM将候选答案与真实答案进行比较，并分配一个二进制奖励来表示正确性。尽管这种比较任务看似简单，但我们发现生成奖励模型对表面操纵表现出令人惊讶的脆弱性：非单词符号（例如：“：”或“。”）或推理开场词如“思考过程：”和“让我们一步一步解决这个问题。”可能会导致错误的正面奖励。我们证明这种弱点在LLM、数据集和提示格式中是普遍存在的，对依赖生成奖励模型的核心算法范式构成严重威胁，如拒绝抽样、偏好优化和RLVR。为了缓解这个问题，我们引入了一种简单有效的数据增强策略，并训练了一个新的生成奖励模型，该模型具有更强的稳健性。我们的研究结果表明，更可靠的基于LLM的评估方法迫切需要。我们在https://huggingface.co/sarosavo/Master-RM和https://huggingface.co/datasets/sarosavo/Master-RM发布了我们稳健的通用奖励模型和合成训练数据。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）作为奖励模型在强化学习可验证奖励（RLVR）中越来越受欢迎，用于评估答案质量。然而，这种模型在处理复杂推理任务时存在漏洞，容易被表面操纵所影响，导致错误奖励。研究团队提出了一种新的增强奖励模型，并发布了相关资源。这凸显了更可靠的LLM评估方法的迫切需求。

Key Takeaways

生成式奖励模型（LLMs-as-judges）采用大型语言模型评估答案质量，在强化学习可验证奖励（RLVR）中受到青睐。
与刚性规则基础的度量相比，这种模型更适合处理涉及自由形式输出的复杂推理任务。
生成式奖励模型存在漏洞，容易被非单词符号和推理开场白等表面操纵影响，导致错误奖励。
问题广泛存在于LLM、数据集和提示格式中，对依赖生成式奖励模型的核心算法范式构成严重威胁。
研究团队通过简单的数据增强策略训练了新的生成式奖励模型，提高了稳健性。
现有的LLM评估方法存在不足，需要更可靠的方法。

Cool Papers

点此查看论文截图

BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity

Authors:Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun

To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).

为缓解大型语言模型（LLM）的计算负担，以混合专家（MoE）为代表的激活稀疏性架构吸引了越来越多的关注。然而，标准MoE的非可微分和僵硬路由会损害模型性能。尽管每个令牌只激活少数参数，但这些稀疏激活的架构表现出较低的块级稀疏性，表明多个连续令牌的联合激活了很大比例的参数。这种稀疏模式不利于低资源条件下的加速（例如，边缘设备），并且与主流加速技术（例如，推测解码）不兼容。为解决这些挑战，我们引入了一种新型MoE架构BlockFFN及其高效的训练和部署技术。具体来说，我们使用集成ReLU激活和RMSNorm的路由器实现可微分和灵活路由。接下来，为促进令牌级稀疏性（TLS）和块级稀疏性（CLS），设计了CLS感知训练目标，使BlockFFN更易于加速。最后，我们首次结合激活稀疏性和推测解码实现了高效的加速内核。实验结果表明，BlockFFN优于其他MoE基线，实现超过80%的TLS和70%的8令牌CLS。我们的内核在实际边缘设备上的速度比密集模型最高快3.67倍。所有代码和检查点均已公开（https://github.com/thunlp/BlockFFN）。

论文及项目相关链接

PDF 21 pages, 7 figures, 15 tables

Summary

本文为了解决大型语言模型（LLM）的计算负担问题，引入了混合专家（MoE）架构中的激活稀疏性。针对传统MoE的非可微和不可灵活路由的问题，提出了新型的MoE架构BlockFFN，并结合高效的训练和部署技术。BlockFFN使用集成ReLU激活和RMSNorm的路由器实现可微和灵活路由。通过设计面向块级稀疏性（CLS）的训练目标，促进了令牌级稀疏性（TLS）和CLS的提高，使得BlockFFN更适用于加速。此外，结合激活稀疏性和推测解码，首次实现了高效的加速内核。实验结果表明，BlockFFN优于其他MoE基线，达到超过80%的TLS和70%的8令牌CLS。在真实端侧设备上，相比密集模型，我们的内核实现了高达3.67倍的加速。

Key Takeaways

大型语言模型（LLM）的计算负担成为研究焦点，激活稀疏性的混合专家（MoE）架构被视为解决方案。
传统MoE存在非可微和不可灵活路由的问题，因此提出新型MoE架构BlockFFN。
BlockFFN使用集成ReLU激活和RMSNorm的路由器，实现可微和灵活路由。
设计面向块级稀疏性（CLS）的训练目标，以提升令牌级稀疏性（TLS），使模型更适用于加速。
结合激活稀疏性和推测解码，首次实现高效的加速内核。
BlockFFN性能优越，超过其他MoE基线，实现高TLS和CLS。
在真实端侧设备上，BlockFFN相比密集模型实现了显著的加速效果。

Cool Papers

点此查看论文截图

LLMCup: Ranking-Enhanced Comment Updating with LLMs

Authors:Hua Ge, Juan Zhai, Minxue Pan, Fusen He, Ziyue Tan

While comments are essential for enhancing code readability and maintainability in modern software projects, developers are often motivated to update code but not comments, leading to outdated or inconsistent documentation that hinders future understanding and maintenance. Recent approaches such as CUP and HebCup have attempted automatic comment updating using neural sequence-to-sequence models and heuristic rules, respectively. However, these methods can miss or misinterpret crucial information during comment updating, resulting in inaccurate comments, and they often struggle with complex update scenarios. Given these challenges, a promising direction lies in leveraging large language models (LLMs), which have shown impressive performance in software engineering tasks such as comment generation, code synthesis, and program repair. This suggests their strong potential to capture the logic behind code modifications - an ability that is crucial for the task of comment updating. Nevertheless, selecting an appropriate prompt strategy for an LLM on each update case remains challenging. To address this, we propose a novel comment updating framework, LLMCup, which first uses multiple prompt strategies to provide diverse candidate updated comments via an LLM, and then employs a ranking model, CupRank, to select the best candidate as final updated comment. Experimental results demonstrate the effectiveness of LLMCup, with improvements over state-of-the-art baselines (CUP and HebCup) by 49.0%-116.9% in Accuracy, 10.8%-20% in BLEU-4, 4.6% in METEOR, 0.9%-1.9% in F1, and 2.1%-3.4% in SentenceBert similarity. Furthermore, a user study shows that comments updated by LLMCup sometimes surpass human-written updates, highlighting the importance of incorporating human evaluation in comment quality assessment.

在现代软件项目中，注释对于提高代码的可读性和可维护性至关重要。尽管开发者有动力更新代码，但往往忽视更新注释，这导致文档过时或不一致，阻碍了未来的理解和维护。近期的方法，如CUP和HebCup，分别尝试使用神经序列到序列模型和启发式规则进行自动注释更新。然而，这些方法在更新注释时可能会遗漏或误解关键信息，导致注释不准确，并且在处理复杂的更新场景时经常遇到困难。考虑到这些挑战，一个前景光明的方向是利用大型语言模型（LLM），其在软件工程任务如注释生成、代码合成和程序修复中表现出令人印象深刻的性能。这表明LLM有能力捕捉代码修改背后的逻辑——这对于注释更新任务至关重要。然而，针对每个更新案例选择适当的提示策略对于LLM仍然是一个挑战。为解决这一问题，我们提出了一种新颖的注释更新框架LLMCup。它首先使用多种提示策略，通过LLM提供多样的候选更新注释，然后采用排名模型CupRank选择最佳候选作为最终的更新注释。实验结果表明LLMCup的有效性，其在准确率上较最新基线方法（CUP和HebCup）提高了49.0%-116.9%，BLEU-4提高了10.8%-20%，METEOR提高了4.6%，F1提高了0.9%-1.9%，SentenceBert相似度提高了2.1%-3.4%。此外，一项用户研究还显示，LLMCup更新的注释有时超越了人工更新的注释，这凸显了在评估注释质量时融入人工评估的重要性。

论文及项目相关链接

PDF 13 pages, 10 figures

摘要
代码注释在现代软件项目中对可读性和可维护性至关重要。但开发者更新代码而不更新注释的情况时有发生，导致文档过时或不一致，影响未来的理解和维护。当前方法如CUP和HebCup尝试通过神经序列到序列模型和启发式规则自动更新注释，但仍有不足。大型语言模型（LLM）具有潜力捕获代码修改背后的逻辑，为解决这一问题提供了新的方向。提出一种新的注释更新框架LLMCup，采用多种提示策略通过LLM提供多样的候选更新注释，并使用排名模型CupRank选择最佳候选作为最终更新注释。实验结果证明LLMCup的有效性，准确率较现有方法提高49.0%-116.9%，BLEU-4提高10.8%-20%，METEOR提高4.6%，F1提高0.9%-1.9%，SentenceBert相似性提高2.1%-3.4%。用户研究表明，LLMCup更新的注释有时超越人工更新，表明在注释质量评估中融入人工评估的重要性。

关键见解

开发者经常忽视更新注释，导致文档过时或不一致，影响未来的理解和维护。
当前自动注释更新方法如CUP和HebCup存在不足，可能遗漏或误解重要信息。
大型语言模型（LLM）在软件工程任务中表现出强大的性能，如注释生成、代码合成和程序修复，具有捕获代码修改背后逻辑的强大潜力。
提出一种新的注释更新框架LLMCup，采用多种提示策略并通过排名模型选择最佳候选注释。
实验结果表明LLMCup在多个评估指标上较现有方法有所改进。
用户研究显显示LLMCup更新的注释有时超越人工更新。
在注释质量评估中融入人工评估的重要性。

Cool Papers

点此查看论文截图

KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment

Authors:Jiyao Zhang, Chengli Zhong, Hui Xu, Qige Li, Yi Zhou

Modern large language models (LLMs) show promising progress in formalizing informal mathematics into machine-verifiable theorems. However, these methods still face bottlenecks due to the limited quantity and quality of multilingual parallel corpora. In this paper, we propose a novel neuro-symbolic framework KELPS (Knowledge-Equation based Logical Processing System) to address these problems. KELPS is an iterative framework for translating, synthesizing, and filtering informal data into multiple formal languages (Lean, Coq, and Isabelle). First, we translate natural language into Knowledge Equations (KEs), a novel language that we designed, theoretically grounded in assertional logic. Next, we convert them to target languages through rigorously defined rules that preserve both syntactic structure and semantic meaning. This process yielded a parallel corpus of over 60,000 problems. Our framework achieves 88.9% syntactic accuracy (pass@1) on MiniF2F, outperforming SOTA models such as Deepseek-V3 (81%) and Herald (81.3%) across multiple datasets. All datasets and codes are available in the supplementary materials.

现代大型语言模型（LLM）在将非正式数学形式化为可验证的机器定理方面取得了令人瞩目的进展。然而，由于多语言平行语料库的数量和质量有限，这些方法仍然面临瓶颈。针对这些问题，本文提出了一种新型的神经符号框架KELPS（基于知识方程的逻辑处理系统）。KELPS是一个迭代框架，用于将非正式数据翻译成多种正式语言（Lean、Coq和Isabelle）。首先，我们将自然语言翻译成我们设计的一种新型知识方程（KE），该方程以断言逻辑为理论基础。接下来，我们通过严格定义的规则将它们转换为目标语言，这些规则既保留了句法结构又保留了语义含义。这一过程产生了超过6万个问题的平行语料库。我们的框架在MiniF2F上达到了88.9%的句法准确率（pass@1），优于多个数据集中的最佳模型，如Deepseek-V3（81%）和Herald（81.3%）。所有数据集和代码都可在补充材料中找到。

论文及项目相关链接

PDF Accepted by the ICML 2025 AI4MATH Workshop. 22 pages, 16 figures, 2 tables

Summary
现代大型语言模型在将非正式数学形式化为可验证的机器定理方面取得了进展。然而，由于多语言平行语料库的数量和质量有限，这些方法仍面临瓶颈。本文提出了一种新型的神经符号框架KELPS（基于知识方程的逻辑处理系统），以解决这些问题。KELPS是一个迭代框架，可将非正式数据翻译成多种正式语言（Lean、Coq和Isabelle）。首先，我们将自然语言翻译成我们设计的知识方程（KEs），在断言逻辑中有理论根据。然后，通过严格定义的规则将它们转换为目标语言，同时保持语法结构和语义意义。此过程产生了超过6万个问题的平行语料库。我们的框架在MiniF2F上达到了88.9%的语法准确性（pass@1），优于Deepseek-V3（81%）和Herald（81.3%）等模型。

Key Takeaways

现代大型语言模型在将非正式数学形式化为机器定理方面取得进展。
多语言平行语料库的数量和质量限制是大型语言模型应用的主要瓶颈。
KELPS框架被提出来解决这些问题，它是一个能够将非正式数据翻译成多种正式语言的迭代框架。
KELPS通过翻译自然语言到知识方程（KEs）来工作，这是基于断言逻辑的新语言。
知识方程进一步被转换成目标语言，同时保持语法结构和语义意义。
KELPS框架产生了超过6万个问题的平行语料库。

Cool Papers

点此查看论文截图

Introspection of Thought Helps AI Agents

Authors:Haoran Sun, Shaoning Zeng

AI Agents rely on Large Language Models (LLMs) and Multimodal-LLMs (MLLMs) to perform interpretation and inference in text and image tasks without post-training, where LLMs and MLLMs play the most critical role and determine the initial ability and limitations of AI Agents. Usually, AI Agents utilize sophisticated prompt engineering and external reasoning framework to obtain a promising interaction with LLMs, e.g., Chain-of-Thought, Iteration of Thought and Image-of-Thought. However, they are still constrained by the inherent limitations of LLM in understanding natural language, and the iterative reasoning process will generate a large amount of inference cost. To this end, we propose a novel AI Agent Reasoning Framework with Introspection of Thought (INoT) by designing a new LLM-Read code in prompt. It enables LLM to execute programmatic dialogue reasoning processes following the code in prompt. Therefore, self-denial and reflection occur within LLM instead of outside LLM, which can reduce token cost effectively. Through our experiments on six benchmarks for three different tasks, the effectiveness of INoT is verified, with an average improvement of 7.95% in performance, exceeding the baselines. Furthermore, the token cost of INoT is lower on average than the best performing method at baseline by 58.3%. In addition, we demonstrate the versatility of INoT in image interpretation and inference through verification experiments.

人工智能代理（AI Agents）依赖大型语言模型（LLMs）和多模态语言模型（MLLMs）进行无需后训练的文本和图像任务中的解释和推断。在这里，LLMs和MLLMs起到最关键的的作用并决定了AI代理的初始能力和局限性。通常，AI代理会利用复杂的提示工程和外部推理框架来获得与LLMs的有前途的交互，例如“思维链”、“思维迭代”和“思维图像”。然而，它们仍然受到LLM在理解自然语言方面固有局限性的制约，而且迭代推理过程会产生大量的推理成本。为此，我们提出了一种新型的具有思想内省（INoT）的AI代理推理框架，通过设计新的LLM-Read提示代码来实现。它使LLM能够根据提示中的代码执行程序化对话推理过程。因此，自我否定和反思发生在LLM内部而不是外部，这可以有效地降低令牌成本。我们在三个不同任务的六个基准测试上进行了实验，验证了INoT的有效性，性能平均提高了7.95%，超过了基线水平。此外，INoT的平均令牌成本比基线中表现最好的方法降低了58.3%。此外，我们通过验证实验展示了INoT在图像解释和推断方面的通用性。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）和多模态LLM（MLLM）是AI代理进行文本和图像任务解释和推理的关键。尽管AI代理采用先进的提示工程和外部推理框架与LLM交互，但它们仍受限于LLM理解自然语言的固有局限性，并且迭代推理过程会产生大量推理成本。为此，我们提出了具有思想内省（INoT）的新型AI代理推理框架，通过设计新的LLM阅读提示代码，使LLM能够按照提示中的代码执行程序化对话推理过程。这减少了令牌成本，并在三个不同任务的六个基准测试上验证了其有效性，平均性能提升7.95%，并且比基线方法的令牌成本低58.3%。此外，我们通过验证实验展示了INoT在图像解释和推理中的通用性。

Key Takeaways

AI代理依赖大型语言模型（LLM）和多模态LLM（MLLM）进行文本和图像的解读和推理。
AI代理通常采用先进的提示工程和外部推理框架与LLM交互。
LLM在理解自然语言方面存在固有局限性，迭代推理过程会产生大量推理成本。
提出了具有思想内省（INoT）的AI代理推理框架，通过设计新的LLM阅读提示代码来减少令牌成本。
INoT框架在三个不同任务的六个基准测试上验证了其有效性，平均性能提升7.95%。
INoT框架的令牌成本比基线方法的平均成本低58.3%。

Cool Papers

点此查看论文截图

Leanabell-Prover-V2: Verifier-integrated Reasoning for Formal Theorem Proving via Reinforcement Learning

Authors:Xingguang Ji, Yahui Liu, Qi Wang, Jingyuan Zhang, Yang Yue, Rui Shi, Chenxi Sun, Fuzheng Zhang, Guorui Zhou, Kun Gai

We introduce our Leanabell-Prover-V2, a 7B large language models (LLMs) that can produce formal theorem proofs in Lean 4, with verifier-integrated Long Chain-of-Thoughts (CoT). Following our previous work Leanabell-Prover-V1, we continual to choose to posttrain existing strong prover models for further performance improvement. In our V2 version, we mainly upgrade the Reinforcement Learning (RL) with feedback provided by the Lean 4 verifier. Crucially, verifier feedback, such as indicating success or detailing specific errors, allows the LLM to become ``self-aware’’ of the correctness of its own reasoning process and learn to reflexively correct errors. Leanabell-Prover-V2 directly optimizes LLM reasoning trajectories with multi-turn verifier interactions, together with feedback token masking for stable RL training and a simple reward strategy. Experiments show that Leanabell-Prover-V2 improves performance by 3.2% (pass@128) with Kimina-Prover-Preview-Distill-7B and 2.0% (pass@128) with DeepSeek-Prover-V2-7B on the MiniF2F test set. The source codes, curated data and models are available at: https://github.com/Leanabell-LM/Leanabell-Prover-V2.

我们推出Leanabell-Prover-V2，这是一款7B大型语言模型（LLM），能够在Lean 4中产生正式定理证明，并集成验证器的长链思维（CoT）。继我们之前的Leanabell-Prover-V1工作之后，我们选择在现有强大的证明模型上进行后续训练，以进一步提高性能。在V2版本中，我们主要使用Lean 4验证器提供的反馈来升级强化学习（RL）。关键的是，验证器的反馈，如指示成功或详细列出具体错误，可以让LLM对其自身的推理过程的正确性产生“自我意识”，并学会反射性地纠正错误。Leanabell-Prover-V2通过多轮验证器交互直接优化LLM的推理轨迹，同时采用反馈令牌遮蔽来稳定RL训练并简化奖励策略。实验表明，在MiniF2F测试集上，Leanabell-Prover-V2使用Kiminia-Prover-Preview-Distill-7B提高了3.2%（pass@128）的性能，使用DeepSeek-Prover-V2-7B提高了2.0%（pass@128）的性能。相关源代码、精选数据和模型可在以下网址找到：https://github.com/Leanabell-LM/Leanabell-Prover-V2。

论文及项目相关链接

PDF 23 pages, 13 figures

摘要

Leanabell-Prover-V2是一款7B大型语言模型（LLM），能在Lean 4中生成正式定理证明，并整合验证器长链思维（CoT）。相较于先前的Leanabell-Prover-V1，我们持续对强效验证器模型进行后训练，进一步提升性能。在V2版本中，我们主要升级强化学习（RL）技术，结合Lean 4验证器提供的反馈。验证器反馈，如指示成功或详细列出特定错误，让LLM对其自身推理过程的正确性有“自我意识”，并学会反思纠正错误。Leanabell-Prover-V2直接优化LLM推理轨迹，通过多回合验证器互动、反馈令牌遮蔽稳定RL训练及简单奖励策略实现。实验显示，在MiniF2F测试集上，Leanabell-Prover-V2使用Kimina-Prover-Preview-Distill-7B提高性能3.2%（pass@128），使用DeepSeek-Prover-V2-7B提高性能2.0%（pass@128）。相关源代码、精选数据和模型可在链接获取。

关键见解

Leanabell-Prover-V2是第一款能够产生Lean 4形式定理证明的7B大型语言模型（LLM）。
该模型通过整合验证器的长链思维（CoT）增强了其推理能力。
与先前的版本相比，Leanabell-Prover-V2在后训练强效验证器模型方面取得了进一步的性能提升。
该模型主要升级了强化学习（RL）技术，利用Lean 4验证器提供的反馈进行训练。
验证器反馈使LLM能够对其自身推理过程的正确性进行反思和纠正。
Leanabell-Prover-V2通过多回合验证器互动直接优化LLM推理轨迹。

Cool Papers

点此查看论文截图

NL in the Middle: Code Translation with LLMs and Intermediate Representations

Authors:Chi-en Amy Tai, Pengyu Nie, Lukasz Golab, Alexander Wong

Studies show that large language models (LLMs) produce buggy code translations. One avenue to improve translation accuracy is through intermediate representations, which could provide structured insights to guide the model’s understanding. We explore whether code translation using LLMs can benefit from intermediate representations via natural language (NL) and abstract syntax trees (ASTs). Since prompt engineering greatly affects LLM performance, we consider several ways to integrate these representations, from one-shot to chain-of-thought (CoT) prompting. Using Open Gpt4 8X7B and specialized StarCoder and CodeGen models on popular code translation benchmarks (CodeNet and AVATAR), we find that CoT with an intermediate NL summary performs best, with an increase of 13.8% and 6.7%, respectively, in successful translations for the best-performing model (Open Gpt4 8X7B) compared to the zero-shot prompt.

研究表明，大型语言模型（LLM）产生的代码翻译存在错误。提高翻译准确性的一个途径是通过中间表示，这可能为模型的理解提供结构化洞察。我们探索了使用自然语言（NL）和抽象语法树（ASTs）的中间表示是否能使LLM的代码翻译受益。由于提示工程极大地影响了LLM的性能，我们考虑了几种整合这些表示的方法，从一次完成到思维链（CoT）提示。我们在流行的代码翻译基准测试（CodeNet和AVATAR）上使用了Open Gpt4 8X7B以及专门的StarCoder和CodeGen模型，发现使用带有中间NL摘要的CoT表现最佳，在性能最好的模型（Open Gpt4 8X7B）中，与零次提示相比，成功翻译次数分别增加了13.8%和6.7%。

论文及项目相关链接

PDF

Summary

大型语言模型（LLMs）在代码翻译上表现欠佳，可通过中间表示层改善翻译精度。本文研究如何通过自然语言（NL）和抽象语法树（ASTs）使用中间表示层进行代码翻译。由于提示工程对LLM性能有很大影响，我们考虑了几种整合这些表示层的方法，包括一次性提示和链式思维（CoT）提示。在流行的代码翻译基准测试（CodeNet和AVATAR）上，使用Open Gpt4 8X7B和专门的StarCoder和CodeGen模型进行测试，发现使用带有中间NL摘要的CoT效果最佳，与零次提示相比，最佳性能模型（Open Gpt4 8X7B）的成功翻译率分别提高了13.8%和6.7%。

Key Takeaways

大型语言模型（LLMs）在代码翻译方面存在缺陷。
中间表示层能提高LLM在代码翻译上的准确性。
自然语言（NL）和抽象语法树（ASTs）可以作为中间表示层用于代码翻译。
提示工程对LLM性能具有重要影响。
研究者尝试了多种提示方法，包括一次性提示和链式思维（CoT）提示。
在测试基准上，使用带有中间NL摘要的CoT表现最佳。

Cool Papers

点此查看论文截图

A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1

Authors:Marcin Pietroń, Rafał Olszowski, Jakub Gomułka, Filip Gampel, Andrzej Tomski

Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM’s, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.

论证挖掘（AM）是一个跨学科的研究领域，它融合了逻辑、哲学、语言学、修辞学、法律、心理学和计算机科学等领域的见解。它涉及自动识别和提取论证性成分，如前提和主张，以及检测它们之间的关系，如支持、攻击或中立。最近，随着大型语言模型（LLM）的出现，该领域取得了显著进展，与传统的分析方法和其他深度学习模型相比，LLM提高了分析提取论证语义的效率。虽然有许多基准测试用于测试和验证LLM的质量，但在公开可用的论证分类数据库中关于这些模型操作的研究和结果仍然缺乏。本文使用Args.me和UKP等多样化数据集对部分LLM进行了研究。测试的模型包括GPT、Llama和DeepSeek的版本，以及采用Chain-of-Thoughts算法的推理增强变体。结果表明，在论证分类基准测试中，ChatGPT-4o表现优于其他模型。在结合推理能力的模型中，Deepseek-R1表现出其优越性。然而，尽管GPT-4o和Deepseek-R1具有优势，但它们仍然会出错。本文讨论了所有模型最常见的错误。据我们所知，所呈现的工作是使用LLM和提示算法对所述数据集进行的首次更广泛的分析。该工作还显示了已知提示算法在论证分析中的一些弱点，同时指出了改进方向。该工作的附加值是对可用论证数据集进行深入分析和展示其不足。

论文及项目相关链接

PDF

摘要

论矿学是一个跨学科的研究领域，融合了逻辑、哲学、语言学、修辞学、法律、心理学和计算机科学等多领域的见解。它自动识别和提取论证成分，如前提和主张，并检测它们之间的关系，如支持、攻击或中立。随着大型语言模型（LLM）的出现，该领域取得了重大进展。本文研究了几个LLM在论证分类数据库中的表现，使用Args.me和UKP等不同数据集。研究结果表明，ChatGPT-4o在论证分类基准测试中表现最好。具有推理能力的Deepseek-R1模型也表现出优越性。然而，尽管它们具有优势，GPT-4o和Deepseek-R1仍然存在错误。本文还讨论了现有提示算法在论证分析中的弱点，并指出了改进方向。

关键见解

论矿学是一个涉及多个学科的研究领域，包括逻辑、哲学等，主要研究如何自动识别和提取论证成分。
大型语言模型（LLM）在论矿学领域的应用已经取得了显著进展，尤其是在分析论证语义方面。
在论证分类数据库中使用LLM模型的研究仍然有限，缺乏广泛的研究结果和公开数据。
使用Args.me和UKP等数据集的研究表明，ChatGPT-4o在论证分类基准测试中表现最佳。
结合推理能力的模型，如Deepseek-R1，也表现出优越性。
GPT-4o和Deepseek-R1等模型虽然表现优越，但仍存在错误，这些错误常见于论证分析和推理过程。

Cool Papers

点此查看论文截图

Visual Semantic Description Generation with MLLMs for Image-Text Matching

Authors:Junyu Chen, Yihua Gao, Mingyong Li

Image-text matching (ITM) aims to address the fundamental challenge of aligning visual and textual modalities, which inherently differ in their representations, continuous, high-dimensional image features vs. discrete, structured text. We propose a novel framework that bridges the modality gap by leveraging multimodal large language models (MLLMs) as visual semantic parsers. By generating rich Visual Semantic Descriptions (VSD), MLLMs provide semantic anchor that facilitate cross-modal alignment. Our approach combines: (1) Instance-level alignment by fusing visual features with VSD to enhance the linguistic expressiveness of image representations, and (2) Prototype-level alignment through VSD clustering to ensure category-level consistency. These modules can be seamlessly integrated into existing ITM models. Extensive experiments on Flickr30K and MSCOCO demonstrate substantial performance improvements. The approach also exhibits remarkable zero-shot generalization to cross-domain tasks, including news and remote sensing ITM. The code and model checkpoints are available at https://github.com/Image-Text-Matching/VSD.

图像文本匹配（ITM）旨在解决视觉和文本模态对齐的根本挑战，这两种模态在其表示方式上存在固有的差异，分别是连续、高维的图像特征和离散、结构的文本。我们提出了一种新的框架，利用多模态大型语言模型（MLLM）作为视觉语义解析器来弥合模态差异。通过生成丰富的视觉语义描述（VSD），MLLM提供了语义锚点，促进了跨模态对齐。我们的方法结合了：（1）实例级对齐：通过融合视觉特征与VSD，增强图像表示的语言表现力；（2）原型级对齐：通过VSD聚类，确保类别级一致性。这些模块可以无缝集成到现有的ITM模型中。在Flickr30K和MSCOCO上的大量实验证明了显著的性能改进。该方法在跨域任务（包括新闻和遥感ITM）中表现出令人印象深刻的零样本泛化能力。代码和模型检查点位于https://github.com/Image-Text-Matching/VSD。

论文及项目相关链接

PDF Accepted by ICME2025 oral

Summary
文本介绍了一种图像与文本匹配（ITM）的新框架，通过利用多模态大型语言模型（MLLMs）作为视觉语义解析器来解决视觉和文本模态对齐的根本挑战。该框架通过生成丰富的视觉语义描述（VSD），提供语义锚点，促进跨模态对齐。该框架结合了实例级对齐和原型级对齐两个模块，通过融合视觉特征与VSD增强图像表示的语言表现力，并通过VSD聚类确保类别级别的一致性。实验表明，该框架在Flickr30K和MSCOCO等数据集上实现了显著的性能提升，并在新闻和遥感ITM等跨域任务中表现出强大的零样本泛化能力。

Key Takeaways

图像-文本匹配（ITM）旨在解决视觉和文本模态对齐的挑战。
多模态大型语言模型（MLLMs）作为视觉语义解析器，生成视觉语义描述（VSD）。
VSD提供语义锚点，促进跨模态对齐。
实例级对齐模块融合视觉特征与VSD，增强图像表示的语言表现力。
原型级对齐模块通过VSD聚类确保类别级别的一致性。
该框架在Flickr30K和MSCOCO数据集上实现了性能提升。

Cool Papers

点此查看论文截图

AbbIE: Autoregressive Block-Based Iterative Encoder for Efficient Sequence Modeling

Authors:Preslav Aleksandrov, Meghdad Kurmanji, Fernando Garcia Redondo, David O’Shea, William Shen, Alex Iacob, Lorenzo Sani, Xinchi Qiu, Nicola Cancedda, Nicholas D. Lane

We introduce the Autoregressive Block-Based Iterative Encoder (AbbIE), a novel recursive generalization of the encoder-only Transformer architecture, which achieves better perplexity than a standard Transformer and allows for the dynamic scaling of compute resources at test time. This simple, recursive approach is a complement to scaling large language model (LLM) performance through parameter and token counts. AbbIE performs its iterations in latent space, but unlike latent reasoning models, does not require a specialized dataset or training protocol. We show that AbbIE upward generalizes (ability to generalize to arbitrary iteration lengths) at test time by only using 2 iterations during train time, far outperforming alternative iterative methods. AbbIE’s ability to scale its computational expenditure based on the complexity of the task gives it an up to \textbf{12%} improvement in zero-shot in-context learning tasks versus other iterative and standard methods and up to 5% improvement in language perplexity. The results from this study open a new avenue to Transformer performance scaling. We perform all of our evaluations on model sizes up to 350M parameters.

我们介绍了基于块的自回归迭代编码器（AbbIE），它是仅编码器Transformer架构的一种新型递归泛化，实现了比标准Transformer更低的困惑度，并在测试时允许动态调整计算资源。这种简单、递归的方法是通过参数和令牌计数来扩展大型语言模型（LLM）性能的补充。AbbIE在潜在空间中进行迭代，但与潜在推理模型不同的是，它不需要专门的数据集或训练协议。我们表明，AbbIE在测试时通过向上泛化（泛化到任意迭代长度的能力），在训练过程中只使用两次迭代，远远超过了其他迭代方法。AbbIE能够根据任务的复杂性调整其计算支出，使其在零镜头上下文学习任务中比其他迭代和标准方法最多提高了12%，在语言困惑度方面最多提高了5%。本研究的结果开辟了Transformer性能扩展的新途径。我们的评估都是在参数规模达3.5亿以下的模型上进行的。

论文及项目相关链接

PDF 14 pages and 6 figures. Submitted to NeurIPS 2025

Summary

新一代递归式语言模型——Autoregressive Block-Based Iterative Encoder（AbbIE）介绍。该模型在标准Transformer的基础上实现递归泛化，降低了测试时的计算资源消耗，实现了更低的困惑度。AbbIE通过迭代潜在空间执行任务，无需特殊数据集或训练协议。在训练时仅使用两次迭代，就能在测试时向上泛化到任意迭代长度。此外，AbbIE在零样本上下文学习任务上相比其他迭代和标准方法最多提升了12%，语言困惑度最多提升了5%。这一研究为Transformer性能提升开辟了新途径。模型评价范围涵盖参数规模高达3.5亿的模型。

Key Takeaways

引入了一种名为Autoregressive Block-Based Iterative Encoder（AbbIE）的新递归式语言模型。
AbbIE模型基于标准Transformer架构实现递归泛化，具有更低的困惑度。
AbbIE通过潜在空间的迭代执行任务，无需特殊数据集或训练协议。
AbbIE在训练时仅使用两次迭代，实现了测试时的向上泛化能力。
与其他迭代和标准方法相比，AbbIE在零样本上下文学习任务上性能显著提升，最高提升达到12%。同时降低了语言困惑度，最高降低达到5%。

Cool Papers

点此查看论文截图

White-Basilisk: A Hybrid Model for Code Vulnerability Detection

Authors:Ioannis Lamprou, Alexander Shevtsov, Ioannis Arapakis, Sotiris Ioannidis

The proliferation of software vulnerabilities presents a significant challenge to cybersecurity, necessitating more effective detection methodologies. We introduce White-Basilisk, a novel approach to vulnerability detection that demonstrates superior performance while challenging prevailing assumptions in AI model scaling. Utilizing an innovative architecture that integrates Mamba layers, linear self-attention, and a Mixture of Experts framework, White-Basilisk achieves state-of-the-art results in vulnerability detection tasks with a parameter count of only 200M. The model’s capacity to process sequences of unprecedented length enables comprehensive analysis of extensive codebases in a single pass, surpassing the context limitations of current Large Language Models (LLMs). White-Basilisk exhibits robust performance on imbalanced, real-world datasets, while maintaining computational efficiency that facilitates deployment across diverse organizational scales. This research not only establishes new benchmarks in code security but also provides empirical evidence that compact, efficiently designed models can outperform larger counterparts in specialized tasks, potentially redefining optimization strategies in AI development for domain-specific applications.

软件漏洞的激增对网络安全构成了重大挑战，需要更有效的检测方法。我们引入了White-Basilisk，这是一种新型的漏洞检测方案，它在挑战人工智能模型规模化方面的主流假设的同时，表现出了卓越的性能。White-Basilisk利用了一种创新架构，该架构集成了Mamba层、线性自注意力机制和专家混合框架，仅在参数计数2亿的情况下，便在漏洞检测任务中实现了最新结果。该模型处理前所未有的序列长度的能力，能够在单次传递中对大量代码库进行全面分析，突破了当前大型语言模型（LLM）的上下文限制。White-Basilisk在不平衡的、真实世界的数据集上表现出稳健的性能，同时保持了计算效率，可以在各种组织规模上进行部署。这项研究不仅为代码安全树立了新基准，而且提供了实证证据，证明在特定任务中，设计紧凑、高效的模型可以超越大型模型的表现，这有可能重新定义人工智能开发的优化策略，为特定领域的应用提供指导。

论文及项目相关链接

PDF

Summary
白巴儿之蛇是一种新型的软件漏洞检测模型，具有出色的性能，挑战了人工智能模型扩展的普遍假设。它采用创新的架构，融合了曼巴层、线性自注意力机制和专家混合框架，以仅2亿个参数实现了最先进的漏洞检测结果。该模型能够处理前所未有的序列长度，可以在单个过程中全面分析大量的代码库，超越了当前大型语言模型的上下文限制。白巴儿之蛇在不平衡的、真实世界的数据集上表现出强大的性能，同时保持了计算效率，可以在各种组织规模上进行部署。这项研究不仅树立了代码安全性的新基准，也为专门任务的AI开发提供了优化策略的证据。这为重新审视优化策略提供了一个视角。即使在这个错综复杂的技术世界里，“精简、高效同样可以达到惊人效果”的主题依然是深入人心的座右铭。借助高效计算和资源优化的思路和实践手段让人类世界得到了无限的活力和希望。对后续的改进与技术的深入具有积极和广泛的影响价值及潜在的理论与实践应用价值。“取其精髓而表现最在的力量是无法超越的”，相信在这个意义上将会启发并促进领域技术的发展与应用价值的提高具有十分重要的意义及优势，这种小体积模型的推出极有可能促使当前庞大而繁杂的大模型的未来精简。这为当下代码安全性和漏洞检测等相关领域的进一步拓展与延伸奠定了坚实基础与思路启发作用。“简化后的东西最灵活”只有站在需求视角把握问题解决核心精髓才能使研究成果展现出最为真实的巨大价值和强大影响力才能够吸引行业从业者以及相关研究者继续在该领域中进行深度的探索和无限的发挥在结合更多领域的复杂代码来进一步完善的同时实现对大型语言模型的超越与颠覆为人工智能的发展注入新的活力。通过该模型的研究为人工智能的发展提供了全新的视角和思路，也为未来的研究提供了重要的参考和借鉴价值。同时，该模型的成功也为其他领域的研究提供了重要的启示和借鉴作用。通过白巴儿之蛇模型的研究与应用，我们有望构建一个更加安全、高效、智能的计算机系统。同时，这也将推动人工智能领域的不断发展与创新具有非常广泛的应用前景和推广价值通过更深入的研究和创新的应用将能够更好地保护网络安全并提高软件开发的质量和效率从而促进信息技术的持续发展和进步为人类社会带来更多的便利和福祉。这一突破性的技术将极大地推动人工智能的发展进程并为未来的人工智能技术开辟新的道路推动计算机科学的不断进步和持续繁荣为我们的未来发展创造更加广阔的前景和技术保障为我们的社会发展提供源源不断的动力和技术支撑以更有效地推动技术的进步和应用价值的同时保持系统的安全与稳定性至关重要其在实际应用中的表现值得期待和关注具有广阔的应用前景和重要的社会价值值得我们继续深入研究和探索。该模型具有强大的应用潜力能够为未来的软件开发和网络安全领域带来革命性的变化为推动整个计算机科学的发展做出贡献展现出更加重要的影响力以及对行业的重要推动意义在当前和未来均具备很高的价值和深远的影响力令人期待与探索的同时推进行业的不断进步与发展及价值创造。。使用简洁明了的句子，凸显模型的卓越性能和前景展望。Key Takeaways:

白巴儿之蛇模型利用创新架构实现了高效的软件漏洞检测。
模型具有处理长序列的能力，可全面分析大量代码库。
与现有大型语言模型相比，白巴儿之蛇模型具有更优秀的上下文处理能力。
模型在真实世界的不平衡数据集上表现出强大的稳健性。
白巴儿之蛇模型具有高效的计算性能，适用于各种组织规模。
研究证明了紧凑设计的模型在特定任务上可超越大型模型。

Cool Papers

点此查看论文截图

Circumventing Safety Alignment in Large Language Models Through Embedding Space Toxicity Attenuation

Authors:Zhibo Zhang, Yuxi Li, Kailong Wang, Shuai Yuan, Ling Shi, Haoyu Wang

Large Language Models (LLMs) have achieved remarkable success across domains such as healthcare, education, and cybersecurity. However, this openness also introduces significant security risks, particularly through embedding space poisoning, which is a subtle attack vector where adversaries manipulate the internal semantic representations of input data to bypass safety alignment mechanisms. While previous research has investigated universal perturbation methods, the dynamics of LLM safety alignment at the embedding level remain insufficiently understood. Consequently, more targeted and accurate adversarial perturbation techniques, which pose significant threats, have not been adequately studied. In this work, we propose ETTA (Embedding Transformation Toxicity Attenuation), a novel framework that identifies and attenuates toxicity-sensitive dimensions in embedding space via linear transformations. ETTA bypasses model refusal behaviors while preserving linguistic coherence, without requiring model fine-tuning or access to training data. Evaluated on five representative open-source LLMs using the AdvBench benchmark, ETTA achieves a high average attack success rate of 88.61%, outperforming the best baseline by 11.34%, and generalizes to safety-enhanced models (e.g., 77.39% ASR on instruction-tuned defenses). These results highlight a critical vulnerability in current alignment strategies and underscore the need for embedding-aware defenses.

大型语言模型（LLM）在医疗保健、教育和网络安全等领域取得了显著的成功。然而，这种开放性也引入了重大的安全风险，特别是通过嵌入空间中毒，这是一种微妙的攻击方式，攻击者会操纵输入数据的内部语义表示来绕过安全对齐机制。尽管之前的研究已经研究了通用扰动方法，但对LLM在嵌入层面的安全对齐动态的理解仍然不足。因此，更具针对性和准确的对抗性扰动技术没有得到足够的研究，这些技术对安全构成了重大威胁。在本研究中，我们提出了ETTA（嵌入转换毒性衰减），这是一种新型框架，通过线性转换来识别并减弱嵌入空间中的毒性敏感维度。ETTA能够绕过模型拒绝行为，同时保持语言连贯性，无需对模型进行微调或访问训练数据。在五个具有代表性的开源LLM上使用AdvBench基准测试进行评估，ETTA的平均攻击成功率高达88.61%，比最佳基线高出11.34%，并可以推广到增强安全性的模型（例如，指令调整防御的语音识别率为77.39%）。这些结果突显了当前对齐策略的关键漏洞，并强调了嵌入感知防御的需求。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在多个领域取得了显著成功，但也带来了安全风险，特别是嵌入空间中毒问题。针对这一问题，本文提出ETTA框架，通过线性变换识别并减弱嵌入空间中的毒性敏感维度。ETTA绕过模型拒绝行为，同时保持语言连贯性，无需对模型进行微调或访问训练数据。在AdvBench基准测试上，ETTA对五种开源LLM的平均攻击成功率高达88.61%，比最佳基线高出11.34%，并适用于增强安全性的模型。

Key Takeaways

大型语言模型（LLM）在多个领域表现优异，但存在安全风险，尤其是嵌入空间中毒问题。
ETTA框架被提出用于识别并减弱嵌入空间中的毒性敏感维度。
ETTA能够绕过模型的拒绝行为，同时保持语言的连贯性。
ETTA无需对模型进行微调或访问训练数据。
ETTA在AdvBench基准测试上的攻击成功率高达88.61%，显著优于其他方法。
ETTA适用于增强安全性的模型，并在这些模型上取得了良好的性能。

Cool Papers

点此查看论文截图

Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models

Authors:Varin Sikka, Vishal Sikka

With widespread adoption of transformer-based language models in AI, there is significant interest in the limits of LLMs capabilities, specifically so-called hallucinations, occurrences in which LLMs provide spurious, factually incorrect or nonsensical information when prompted on certain subjects. Furthermore, there is growing interest in agentic uses of LLMs - that is, using LLMs to create agents that act autonomously or semi-autonomously to carry out various tasks, including tasks with applications in the real world. This makes it important to understand the types of tasks LLMs can and cannot perform. We explore this topic from the perspective of the computational complexity of LLM inference. We show that LLMs are incapable of carrying out computational and agentic tasks beyond a certain complexity, and further that LLMs are incapable of verifying the accuracy of tasks beyond a certain complexity. We present examples of both, then discuss some consequences of this work.

随着基于变压器架构的语言模型在人工智能中的广泛应用，对于大型语言模型（LLM）的能力极限产生了浓厚的兴趣，特别是所谓的“幻觉”现象。当针对某些主题提示时，LLM会产生虚假、事实错误或无意义的信息。此外，人们还越来越关注LLM的代理使用，即使用LLM创建能够自主或半自主执行各种任务的代理，包括在现实世界中应用的任务。这使我们有必要了解LLM能够执行和不能执行的任务类型。我们从计算复杂性的角度探讨这一主题，对LLM推理进行探讨。我们证明了LLM无法执行超出一定复杂度的计算和代理任务，并且进一步证明LLM无法验证超出一定复杂度的任务的准确性。我们提供了这两方面的例子，然后讨论这项工作的后果。

论文及项目相关链接

PDF 6 pages; to be submitted to AAAI-26 after reviews

Summary

随着基于Transformer的语言模型在AI中的广泛应用，对于大型语言模型（LLM）的能力极限，特别是所谓的“幻觉”现象，人们产生了极大的兴趣。本文探讨了LLM的计算复杂性，展示了LLM无法执行超出一定复杂度的计算和代理任务，也无法验证超出一定复杂度的任务的准确性。

Key Takeaways

LLMs存在能力极限，无法执行超出一定复杂度的计算和代理任务。
LLMs在特定主题提示下可能会提供错误、无意义的信息，即所谓的“幻觉”现象。
LLMs无法验证超出一定复杂度的任务的准确性。
本文探讨了LLM的计算复杂性，并提供了相关的实例说明。
了解和掌握LLM的能力范围对于有效使用它们至关重要。
LLMs在现实世界应用中的代理使用需要谨慎，因为它们可能无法自主或半自主地完成复杂的任务。

Cool Papers

点此查看论文截图

Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery

Authors:Licong Xu, Milind Sarkar, Anto I. Lonappan, Íñigo Zubeldia, Pablo Villanueva-Domingo, Santiago Casas, Christian Fidler, Chetana Amancharla, Ujjwal Tiwari, Adrian Bayer, Chadi Ait Ekioui, Miles Cranmer, Adrian Dimitrov, James Fergusson, Kahaan Gandhi, Sven Krippendorf, Andrew Laverick, Julien Lesgourgues, Antony Lewis, Thomas Meier, Blake Sherwin, Kristen Surrao, Francisco Villaescusa-Navarro, Chi Wang, Xueqing Xu, Boris Bolliet

We present a multi-agent system for automation of scientific research tasks, cmbagent (https://github.com/CMBAgents/cmbagent). The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.

我们提出一个用于自动化科研任务的多智能体系统，名为cmbagent（https://github.com/CMBAgents/cmbagent）。该系统由大约30个大型语言模型（LLM）智能体组成，采用规划与控制策略来协调智能体的工作流程，且在任何时刻都不需要人工介入。每个智能体都专注于不同的任务（如执行科学论文和代码库的检索、编写代码、解读结果、评价其他智能体的输出），并且系统能够在本地执行代码。我们成功地将cmbagent应用于执行博士级别的宇宙学任务（使用超新星数据测量宇宙学参数），并在两个基准测试集上评估其性能，发现其性能优于现有的最先进的大型语言模型。源代码已在GitHub上提供，演示视频也已发布，该系统已部署在HuggingFace上，并将可在云端使用。

论文及项目相关链接

PDF Accepted contribution to the ICML 2025 Workshop on Machine Learning for Astrophysics. Code: https://github.com/CMBAgents/cmbagent Videos: https://www.youtube.com/@cmbagent HuggingFace: https://huggingface.co/spaces/astropilot-ai/cmbagent Cloud: https://cmbagent.cloud

Summary

一个集成了约30个大型语言模型（LLM）的多智能体系统被开发出来，用于自动化科研任务。该系统采用规划和控制策略来协调智能体的工作流程，整个过程中无需人工参与。每个智能体都能完成不同的任务，如检索科学论文和代码库、编写代码、解读结果、评估其他智能体的输出等。该系统已成功应用于一项博士级别的宇宙学任务（使用超新星数据测量宇宙学参数），并在两个基准测试集上表现出超越当前先进LLM的性能。源代码已在GitHub上提供，还有演示视频，该系统已部署在HuggingFace上，并将可在云端使用。

Key Takeaways

介绍了一个多智能体系统，用于自动化科研任务。
系统集成了约30个大型语言模型（LLM）。
采用规划和控制策略协调智能体的工作流程，无需人工参与。
智能体可完成不同任务，如检索、编写代码、解读结果、评估输出等。
成功应用于博士级别的宇宙学任务，测量宇宙学参数。
在两个基准测试集上表现出超越当前先进LLM的性能。

Cool Papers

点此查看论文截图

Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference

Authors:Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, Josep Ll. Berral

Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference. While batching is commonly used to increase throughput, performance gains plateau beyond a certain batch size, especially with smaller models, a phenomenon that existing literature typically explains as a shift to the compute-bound regime. In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck. To address this, we propose a Batching Configuration Advisor (BCA) that optimizes memory allocation, reducing GPU memory requirements with minimal impact on throughput. The freed memory and underutilized GPU compute capabilities can then be leveraged by concurrent workloads. Specifically, we use model replication to improve serving throughput and GPU utilization. Our findings challenge conventional assumptions about LLM inference, offering new insights and practical strategies for improving resource utilization, particularly for smaller language models. The code is publicly available at https://github.com/FerranAgulloLopez/vLLMBatchingMemoryGap.

大型语言模型已被广泛应用于各种任务，但其自回归生成特性往往导致推理过程中的资源利用率低下。虽然批处理通常用于提高吞吐量，但在达到一定批量大小后，性能提升会遇到瓶颈，尤其是对于较小的模型，现有文献通常将此现象解释为转向计算受限状态。在本文中，我们通过深入的GPU级别分析揭示，大批量推理仍然是内存受限的，由于DRAM带宽饱和是主要瓶颈，导致大多数GPU计算能力未得到充分利用。为了解决这一问题，我们提出了批处理配置顾问（BCA），通过优化内存分配，以最小限度地影响吞吐量的方式降低GPU内存要求。释放的内存和未充分利用的GPU计算能力可以被并发工作负载所利用。具体来说，我们使用模型复制来提高服务吞吐量和GPU利用率。我们的研究挑战了关于大型语言模型推理的常规假设，为改进资源利用提供了新见解和实际策略，特别是对于较小的语言模型。代码公开在https://github.com/FerranAgulloLopez/vLLMBatchingMemoryGap。

论文及项目相关链接

PDF Pol G. Recasens, Ferran Agullo: equal contribution. Paper accepted at IEEE CLOUD 2025

Summary
大语言模型广泛应用于各种任务，但其自回归生成特性导致推理过程中的资源利用效率低下。批处理通常用于提高吞吐量，但超过一定批次大小后性能提升平台化，特别是对小模型而言。本文深入GPU层面分析，揭示大批量推理仍存在内存瓶颈问题，GPU计算能力大部分未得到充分利用，主要是由于DRAM带宽饱和所致。为解决这一问题，本文提出批处理配置顾问（BCA），优化内存分配，以减小对吞吐量的影响。释放的内存和未充分利用的GPU计算能力可用于并发工作负载。具体来说，通过模型复制提高服务吞吐量和GPU利用率。本文挑战了关于LLM推理的常规假设，为改进资源利用提供了新见解和实用策略，特别是对小语言模型而言。

Key Takeaways

大语言模型的自回归生成特性导致推理过程中的资源利用效率低下，特别是对小模型而言，性能提升在超过一定批次大小后趋于平稳。
通过深入GPU层面的分析，发现大批量推理存在内存瓶颈问题，大部分GPU计算能力未被充分利用，主要原因是DRAM带宽饱和。
提出批处理配置顾问（BCA）以优化内存分配，减小对吞吐量的影响，并通过释放的内存和未充分利用的GPU计算能力来提高并发工作负载的效率。
通过模型复制来提高服务吞吐量和GPU利用率是一种有效的策略。
本文挑战了关于LLM推理的常规假设，为改进资源利用提供了新的见解和实用策略。
公开可用的代码为研究人员和实践者提供了实现上述策略的工具和参考。

Cool Papers

点此查看论文截图

Weak-to-Strong Jailbreaking on Large Language Models

Authors:Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang

Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack’s key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model’s decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong

大型语言模型（LLM）容易受到越狱攻击，导致生成有害、不道德或偏见的文本。然而，现有的越狱方法计算成本高昂。在本文中，我们提出了弱到强越狱攻击，这是一种高效的推理时间攻击，用于针对对齐的大型语言模型生成有害文本。我们的主要直觉是基于这样的观察：越狱和对齐的模型只在初始解码分布上有所不同。弱到强攻击的关键技术见解是使用两个较小的模型（一个安全模型和一个不安全模型）来对抗性地修改一个更大的安全模型的解码概率。我们在来自三个组织的五个开源大型语言模型上评估了弱到强攻击。结果表明，我们的方法可以在两个数据集上将错位率提高到超过99%，每个例子只需进行一次正向传递。我们的研究揭示了当对齐大型语言模型时，需要解决一个紧急的安全问题。作为一种初步尝试，我们提出了一种防御策略来防止此类攻击，但创建更先进的防御手段仍然是一个挑战。复制该方法的代码可在https://github.com/XuandongZhao/weak-to-strong 找到。

论文及项目相关链接

PDF ICML 2025

Summary
大型语言模型（LLM）易受“越狱攻击”，产生有害、不道德或偏向的文本生成。现有越狱方法计算成本高。本文提出弱到强越狱攻击，这是一种高效的推理时间攻击，用于针对对齐的大型语言模型产生有害文本。我们的关键直觉是基于观察，即越狱和对齐模型之间的区别仅在于其初始解码分布。弱到强攻击的关键技术见解是使用两个较小的模型（一个安全模型和一个不安全模型）来对抗性地修改一个更大的安全模型的解码概率。我们在五个开源的大型语言模型上评估了弱到强攻击，来自三个组织。结果表明，我们的方法可以在两个数据集上将错位率提高到超过99%，每个示例只需进行一次前向传递。我们的研究揭示了一个紧急的安全问题，需要在对齐大型语言模型时解决。作为初步尝试，我们提出了一种防御策略来防止此类攻击，但创建更先进的防御仍然具有挑战性。可复制该方法的代码位于https://github.com/XuandongZhao/weak-to-strong。

Key Takeaways