⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-06-08 更新
SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs
Authors:Jiahui Wang, Zuyan Liu, Yongming Rao, Jiwen Lu
Multimodal Large Language Models (MLLMs) are commonly derived by extending pre-trained Large Language Models (LLMs) with visual capabilities. In this work, we investigate how MLLMs process visual inputs by analyzing their attention mechanisms. We reveal a surprising sparsity phenomenon: only a small subset (approximately less than 5%) of attention heads in LLMs actively contribute to visual understanding, termed visual heads. To identify these heads efficiently, we design a training-free framework that quantifies head-level visual relevance through targeted response analysis. Building on this discovery, we introduce SparseMM, a KV-Cache optimization strategy that allocates asymmetric computation budgets to heads in LLMs based on their visual scores, leveraging the sparity of visual heads for accelerating the inference of MLLMs. Compared with prior KV-Cache acceleration methods that ignore the particularity of visual, SparseMM prioritizes stress and retaining visual semantics during decoding. Extensive evaluations across mainstream multimodal benchmarks demonstrate that SparseMM achieves superior accuracy-efficiency trade-offs. Notably, SparseMM delivers 1.38x real-time acceleration and 52% memory reduction during generation while maintaining performance parity on efficiency test. Our project is open sourced at https://github.com/CR400AF-A/SparseMM.
多模态大型语言模型(MLLMs)通常是通过在预训练的大型语言模型(LLMs)的基础上增强其视觉能力而得到的。在这项研究中,我们通过分析注意力机制来研究MLLMs如何处理视觉输入。我们揭示了一个令人惊讶的稀疏现象:在LLMs中,只有一小部分(大约不到5%)的注意力头对视觉理解有积极贡献,这些被称为视觉头。为了有效地识别这些头部,我们设计了一个无训练框架,该框架通过有针对性的响应分析量化头部级别的视觉相关性。基于这一发现,我们引入了SparseMM,这是一种KV-Cache优化策略,根据视觉分数为LLMs中的头部分配不对称的计算预算,利用视觉头部的稀疏性来加速MLLMs的推理过程。与之前忽略视觉特性的KV-Cache加速方法相比,SparseMM在解码过程中优先考虑压力和保留视觉语义。在主流的多模态基准测试上的广泛评估表明,SparseMM实现了卓越的准确性-效率权衡。值得注意的是,在生成过程中,SparseMM实现了1.38倍的实时加速和52%的内存减少,同时在效率测试上保持了性能上的平衡。我们的项目已开源在https://github.com/CR400AF-A/SparseMM。
论文及项目相关链接
Summary
本文研究了多模态大型语言模型(MLLMs)如何处理视觉输入,通过对其注意力机制进行分析,揭示了一种惊人的稀疏现象:只有一小部分(大约少于5%)的注意力头对视觉理解有积极贡献,被称为视觉头。基于此发现,本文提出了一种基于KV-Cache优化的SparseMM策略,根据视觉分数对LLM中的头进行不对称计算预算分配,利用视觉头的稀疏性加速MLLM的推理。实验表明,SparseMM在主流的多模态基准测试中实现了卓越的准确性-效率权衡,实现了1.38倍的实时加速和52%的内存减少,同时保持性能不变。
Key Takeaways
- 多模态大型语言模型(MLLMs)是通过扩展预训练的大型语言模型(LLMs)并加入视觉能力形成的。
- MLLMs处理视觉输入时,只有一小部分注意力头(视觉头)对视觉理解有贡献。
- 提出了一种无需训练的培训框架,通过有针对性的响应分析来量化头部级别的视觉相关性。
- 基于这一发现,引入了SparseMM策略,这是一种KV-Cache优化策略,根据视觉分数为LLM中的头分配不对称的计算预算。
- SparseMM利用视觉头的稀疏性来加速MLLM的推理。
- 与忽略视觉特性的KV-Cache加速方法相比,SparseMM在解码过程中优先考虑压力并保持视觉语义的保留。
点此查看论文截图




VideoMolmo: Spatio-Temporal Grounding Meets Pointing
Authors:Ghazi Shazan Ahmad, Ahmed Heakl, Hanan Gani, Abdelrahman Shaker, Zhiqiang Shen, Ranjay Krishna, Fahad Shahbaz Khan, Salman Khan
Spatio-temporal localization is vital for precise interactions across diverse domains, from biological research to autonomous navigation and interactive interfaces. Current video-based approaches, while proficient in tracking, lack the sophisticated reasoning capabilities of large language models, limiting their contextual understanding and generalization. We introduce VideoMolmo, a large multimodal model tailored for fine-grained spatio-temporal pointing conditioned on textual descriptions. Building upon the Molmo architecture, VideoMolmo incorporates a temporal module utilizing an attention mechanism to condition each frame on preceding frames, ensuring temporal consistency. Additionally, our novel temporal mask fusion pipeline employs SAM2 for bidirectional point propagation, significantly enhancing coherence across video sequences. This two-step decomposition, i.e., first using the LLM to generate precise pointing coordinates, then relying on a sequential mask-fusion module to produce coherent segmentation, not only simplifies the task for the language model but also enhances interpretability. Due to the lack of suitable datasets, we curate a comprehensive dataset comprising 72k video-caption pairs annotated with 100k object points. To evaluate the generalization of VideoMolmo, we introduce VPoS-Bench, a challenging out-of-distribution benchmark spanning five real-world scenarios: Cell Tracking, Egocentric Vision, Autonomous Driving, Video-GUI Interaction, and Robotics. We also evaluate our model on Referring Video Object Segmentation (Refer-VOS) and Reasoning VOS tasks. In comparison to existing models, VideoMolmo substantially improves spatio-temporal pointing accuracy and reasoning capability. Our code and models are publicly available at https://github.com/mbzuai-oryx/VideoMolmo.
时空定位在多个领域(从生物研究到自主导航和交互式界面)的精确交互中起着至关重要的作用。尽管当前基于视频的方法在跟踪方面非常熟练,但它们缺乏大型语言模型的复杂推理能力,从而限制了其在上下文理解和泛化方面的能力。我们引入了VideoMolmo,这是一个大型多模态模型,专为精细的时空定位而设计,该定位基于文本描述。VideoMolmo建立在Molmo架构之上,并加入了一个时间模块,该模块利用注意力机制将每一帧置于先前的帧上,确保时间的一致性。此外,我们新颖的时间掩膜融合管道采用SAM2进行双向点传播,显著提高了视频序列之间的连贯性。这种两步分解法,即首先使用大型语言模型生成精确的指向坐标,然后依赖于顺序掩膜融合模块来产生连贯的分割,不仅简化了语言模型的任务,还提高了可解释性。由于缺乏合适的数据集,我们整理了一个综合数据集,包含7.2万对视频字幕和标注的10万个目标点。为了评估VideoMolmo的泛化能力,我们引入了VPoS-Bench这一颇具挑战性的分布外基准测试平台,它涵盖五种现实场景:细胞追踪、以自我为中心的视觉、自动驾驶、视频界面交互和机器人技术。我们还评估了我们的模型在引用视频对象分割(Refer-VOS)和推理VOS任务上的表现。与现有模型相比,VideoMolmo大大提高了时空定位精度和推理能力。我们的代码和模型可在https://github.com/mbzuai-oryx/VideoMolmo公开访问。
论文及项目相关链接
PDF 20 pages, 13 figures
Summary
本文介绍了VideoMolmo,一个大型多模态模型,专为精细时空定位而设计,可基于文本描述进行精确互动。它通过结合语言模型和时序模块,提高了视频中的时空定位精度和推理能力。此外,其独特的时间掩膜融合管道增强了视频序列的连贯性。为了训练和评估该模型,作者创建了一个大型视频标注数据集VPoS-Bench和一个挑战性的跨域基准测试。总体上,VideoMolmo显著提高了时空定位精度和推理能力。
Key Takeaways
- VideoMolmo是一个大型多模态模型,专门用于精细的时空定位任务。
- 该模型结合了文本描述进行时空定位,增强了模型的理解和推理能力。
- VideoMolmo采用了新颖的时空掩膜融合管道以提高视频序列连贯性。
- 为训练和评估该模型,创建了一个包含大量视频标注的大规模数据集VPoS-Bench。
- VideoMolmo模型引入了新的跨域基准测试来评估其泛化能力。该模型可广泛应用于不同领域如生物学研究、自动驾驶和交互式界面等。
- VideoMolmo在时空定位精度和推理能力方面显著优于现有模型。
点此查看论文截图



Search Arena: Analyzing Search-Augmented LLMs
Authors:Mihran Miroyan, Tsung-Han Wu, Logan King, Tianle Li, Jiayi Pan, Xinyan Hu, Wei-Lin Chiang, Anastasios N. Angelopoulos, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez
Search-augmented language models combine web search with Large Language Models (LLMs) to improve response groundedness and freshness. However, analyzing these systems remains challenging: existing datasets are limited in scale and narrow in scope, often constrained to static, single-turn, fact-checking questions. In this work, we introduce Search Arena, a crowd-sourced, large-scale, human-preference dataset of over 24,000 paired multi-turn user interactions with search-augmented LLMs. The dataset spans diverse intents and languages, and contains full system traces with around 12,000 human preference votes. Our analysis reveals that user preferences are influenced by the number of citations, even when the cited content does not directly support the attributed claims, uncovering a gap between perceived and actual credibility. Furthermore, user preferences vary across cited sources, revealing that community-driven platforms are generally preferred and static encyclopedic sources are not always appropriate and reliable. To assess performance across different settings, we conduct cross-arena analyses by testing search-augmented LLMs in a general-purpose chat environment and conventional LLMs in search-intensive settings. We find that web search does not degrade and may even improve performance in non-search settings; however, the quality in search settings is significantly affected if solely relying on the model’s parametric knowledge. We open-sourced the dataset to support future research in this direction. Our dataset and code are available at: https://github.com/lmarena/search-arena.
搜索增强型语言模型结合了网络搜索与大型语言模型(LLM),以提高响应的真实性和实时性。然而,分析这些系统仍然具有挑战性:现有数据集规模有限,范围狭窄,通常仅限于静态、单轮的事实核查问题。在这项工作中,我们介绍了Search Arena,这是一个大规模、众源的人类偏好数据集,包含超过24,000个与搜索增强型LLM的多轮用户交互配对。该数据集涵盖多样的意图和语言,包含约12,000次人类偏好投票的完整系统跟踪记录。我们的分析表明,用户偏好受到引用数量的影响,即使引用的内容并没有直接支持所归属的主张,这揭示了感知可信度和实际可信度之间的差距。此外,不同来源的用户偏好各不相同,表明社区驱动的平台通常更受欢迎,而静态百科全书来源并不总是适当和可靠。为了评估不同环境下的性能,我们通过测试搜索增强型LLM在通用聊天环境和传统LLM在搜索密集型环境中的表现来进行跨领域分析。我们发现,在非搜索环境中,网络搜索并不会降低性能,甚至可能提高性能;然而,如果在仅依赖模型的参数知识的情况下,搜索环境中的质量会显著受到影响。我们公开了数据集,以支持未来在这方面的研究。我们的数据集和代码可在https://github.com/lmarena/search-arena找到。
论文及项目相关链接
PDF Preprint. Code: https://github.com/lmarena/search-arena. Dataset: https://huggingface.co/datasets/lmarena-ai/search-arena-24k
Summary
本文介绍了搜索增强语言模型(Search-augmented LLMs)的研究。为提高响应的扎实性和新颖性,研究者将网页搜索与大型语言模型(LLMs)结合。为分析这些系统,研究团队推出了一款名为Search Arena的大型、人群参与的大规模数据集,涵盖了超过2万四千次配对的多轮用户与搜索增强LLMs的互动。分析显示,用户偏好受引用数量影响,即使引用的内容并未直接支持所声明的观点,感知可信度与实际可信度之间存在差距。此外,用户对引用来源的偏好各异,社区驱动平台普遍受欢迎,而静态百科全书来源并非始终适当和可靠。评估不同设置中的性能时,研究发现网页搜索在非搜索环境中并不会降低性能,甚至可能提高性能;但如果仅依赖模型的参数知识,则在搜索环境中的质量会明显受到影响。数据集已开源,支持未来相关研究。
Key Takeaways
- 搜索增强语言模型结合了网页搜索和大型语言模型(LLMs),以提高响应的扎实性和新颖性。
- 现有数据集在规模和范围上存在局限性,常局限于静态、单轮、查证事实的问题。
- 推出名为Search Arena的大型数据集,包含超过2万四千次配对的多轮用户与搜索增强LLMs的互动记录。
- 用户偏好受引用数量影响,即使引用的内容未直接支持观点,也存在感知可信度与实际可信度的差距。
- 用户对引用来源的偏好不同,社区驱动平台更受欢迎,静态百科全书来源并非始终可靠。
- 网页搜索在非搜索环境中不会降低性能,甚至可能提高性能;但在搜索环境中,仅依赖模型的参数知识会影响质量。
点此查看论文截图





MINT-CoT: Enabling Interleaved Visual Tokens in Mathematical Chain-of-Thought Reasoning
Authors:Xinyan Chen, Renrui Zhang, Dongzhi Jiang, Aojun Zhou, Shilin Yan, Weifeng Lin, Hongsheng Li
Chain-of-Thought (CoT) has widely enhanced mathematical reasoning in Large Language Models (LLMs), but it still remains challenging for extending it to multimodal domains. Existing works either adopt a similar textual reasoning for image input, or seek to interleave visual signals into mathematical CoT. However, they face three key limitations for math problem-solving: reliance on coarse-grained box-shaped image regions, limited perception of vision encoders on math content, and dependence on external capabilities for visual modification. In this paper, we propose MINT-CoT, introducing Mathematical INterleaved Tokens for Chain-of-Thought visual reasoning. MINT-CoT adaptively interleaves relevant visual tokens into textual reasoning steps via an Interleave Token, which dynamically selects visual regions of any shapes within math figures. To empower this capability, we construct the MINT-CoT dataset, containing 54K mathematical problems aligning each reasoning step with visual regions at the token level, accompanied by a rigorous data generation pipeline. We further present a three-stage MINT-CoT training strategy, progressively combining text-only CoT SFT, interleaved CoT SFT, and interleaved CoT RL, which derives our MINT-CoT-7B model. Extensive experiments demonstrate the effectiveness of our method for effective visual interleaved reasoning in mathematical domains, where MINT-CoT-7B outperforms the baseline model by +34.08% on MathVista, +28.78% on GeoQA, and +23.2% on MMStar, respectively. Our code and data are available at https://github.com/xinyan-cxy/MINT-CoT
“链式思维(Chain-of-Thought,简称CoT)已在大型语言模型(LLM)中广泛提高了数学推理能力,但将其扩展到多模态领域仍然具有挑战性。现有工作要么采用类似的文本推理进行图像输入,要么寻求将视觉信号融入数学CoT。然而,它们在解决数学问题方面面临三个主要局限性:依赖粗粒度的框状图像区域、视觉编码器对数学内容的感知有限,以及依赖外部能力进行视觉修改。在本文中,我们提出MINT-CoT,引入用于链式思维视觉推理的数学交织令牌(Mathematical INterleaved Tokens)。MINT-CoT通过交织令牌自适应地将相关视觉令牌交织到文本推理步骤中,该令牌动态选择数学图形内的任何形状的视觉区域。为了支持此功能,我们构建了MINT-CoT数据集,包含54K个数学问题,每个推理步骤都与令牌级别的视觉区域对齐,并配有严格的数据生成流程。我们还提出了一个三阶段的MINT-CoT训练策略,逐步结合纯文本CoTSFT、交织CoTSFT和交织CoTRL,从而衍生出我们的MINT-CoT-7B模型。大量实验表明,我们的方法在数学领域进行有效的视觉交织推理方面的有效性,MINT-CoT-7B在MathVista上超越基准模型+34.08%,在GeoQA上+28.78%,在MMStar上+23.2%。我们的代码和数据集可在https://github.com/xinyan-cxy/MINT-CoT找到。“
论文及项目相关链接
PDF Code is released at https://github.com/xinyan-cxy/MINT-CoT
Summary
本文介绍了在数学领域中,Chain-of-Thought(CoT)在多模态领域的应用挑战。现有方法存在依赖粗粒度图像区域、对数学问题视觉编码的认知受限以及对视觉修改的外部能力依赖等三个关键局限性。本文提出了MINT-CoT,引入了数学交织令牌(Mathematical INterleaved Tokens)进行视觉推理。MINT-CoT自适应地将相关视觉令牌插入到文本推理步骤中,通过动态选择数学图形内的任何形状视觉区域来实现这一点。为了支持此功能,构建了包含54K数学问题的MINT-CoT数据集,每个推理步骤都与视觉区域在令牌级别对齐。同时介绍了三个阶段的MINT-CoT训练策略,包括纯文本CoT的SFT、交织CoT的SFT和交织CoT的RL。实验结果证明了MINT-CoT的有效性,其中MINT-CoT-7B模型在MathVista、GeoQA和MMStar上的性能分别优于基线模型+34.08%、+28.78%和+23.2%。
Key Takeaways
- MINT-CoT解决了现有方法在将Chain-of-Thought(CoT)应用于多模态领域的挑战。
- 现有方法在数学问题求解中存在三个关键局限性:依赖粗粒度图像区域、对数学问题视觉编码的认知受限以及对视觉修改的外部能力依赖。
- MINT-CoT引入了数学交织令牌(Mathematical INterleaved Tokens)进行视觉推理,可以自适应地将相关视觉令牌插入文本推理步骤中。
- MINT-CoT使用动态选择数学图形内任何形状视觉区域的Interleave Token技术。
- MINT-CoT数据集包含54K数学问题,每个推理步骤都与视觉区域在令牌级别对齐。
- MINT-CoT训练策略采用三个阶段的训练方法,包括纯文本CoT的SFT、交织CoT的SFT和交织CoT的RL。
点此查看论文截图




Constrained Entropic Unlearning: A Primal-Dual Framework for Large Language Models
Authors:Taha Entesari, Arman Hatami, Rinat Khaziev, Anil Ramakrishna, Mahyar Fazlyab
Large Language Models (LLMs) deployed in real-world settings increasingly face the need to unlearn sensitive, outdated, or proprietary information. Existing unlearning methods typically formulate forgetting and retention as a regularized trade-off, combining both objectives into a single scalarized loss. This often leads to unstable optimization and degraded performance on retained data, especially under aggressive forgetting. We propose a new formulation of LLM unlearning as a constrained optimization problem: forgetting is enforced via a novel logit-margin flattening loss that explicitly drives the output distribution toward uniformity on a designated forget set, while retention is preserved through a hard constraint on a separate retain set. Compared to entropy-based objectives, our loss is softmax-free, numerically stable, and maintains non-vanishing gradients, enabling more efficient and robust optimization. We solve the constrained problem using a scalable primal-dual algorithm that exposes the trade-off between forgetting and retention through the dynamics of the dual variable. Evaluations on the TOFU and MUSE benchmarks across diverse LLM architectures demonstrate that our approach consistently matches or exceeds state-of-the-art baselines, effectively removing targeted information while preserving downstream utility.
大型语言模型(LLM)在现实世界的部署中越来越需要遗忘敏感、过时或专有信息。现有的遗忘方法通常将遗忘和保留制定为规范化的权衡,将两个目标结合成一个单一的标量化损失。这通常会导致优化不稳定,以及在保留数据上的性能下降,尤其是在强烈的遗忘情况下。我们提出了一种新的LLM遗忘公式,将其视为约束优化问题:通过新的logit-margin平坦损失强制执行遗忘,该损失显式地将输出分布推向指定遗忘集上的均匀分布,同时通过在单独的保留集上设置硬约束来保留保留信息。与基于熵的目标相比,我们的损失无softmax,数值稳定,保持非零梯度,能够实现更高效和稳健的优化。我们使用可扩展的原对偶算法解决约束问题,通过双变量动态暴露遗忘和保留之间的权衡。在TOFU和MUSE基准测试上对多种LLM架构的评估表明,我们的方法始终匹配或超过最新基线,在去除目标信息的同时保持下游实用性。
论文及项目相关链接
Summary
大语言模型部署在现实世界中时,需要去除敏感、过时或专有信息。现有去学习方法通常将遗忘和保留表述为正则化的权衡问题,导致优化不稳定和对保留数据的性能下降。为此,我们提出将大语言模型去学习作为约束优化问题来表述,通过新颖的logit-margin平坦损失强制输出分布均匀化以实现遗忘,同时通过硬约束在保留集上保留保留信息。我们的损失函数无需softmax,数值稳定,保持非零梯度,实现了更高效的优化。采用可扩展的原生对偶算法解决约束问题,通过对偶变量的动态揭示遗忘和保留之间的权衡。在TOFU和MUSE基准上的评估表明,我们的方法能够达到或超过最新基线水平,有效地消除目标信息的同时保留下游实用性。
Key Takeaways
- 大语言模型在现实应用中需要去除敏感、过时或专有信息。
- 现有去学习方法存在优化不稳定和对保留数据性能下降的问题。
- 提出将大语言模型去学习表述为约束优化问题,通过logit-margin平坦损失实现遗忘。
- 损失函数无需softmax,数值稳定,保持非零梯度,优化更高效。
- 采用可扩展的原生对偶算法解决约束问题,展现遗忘和保留之间的权衡。
- 方法在TOFU和MUSE基准上表现优异,能有效去除目标信息同时保留下游实用性。
- 这种方法为大型语言模型的去学习提供了一个新的、有效的解决方案。
点此查看论文截图


Sample Complexity and Representation Ability of Test-time Scaling Paradigms
Authors:Baihe Huang, Shanda Li, Tianhao Wu, Yiming Yang, Ameet Talwalkar, Kannan Ramchandran, Michael I. Jordan, Jiantao Jiao
Test-time scaling paradigms have significantly advanced the capabilities of large language models (LLMs) on complex tasks. Despite their empirical success, theoretical understanding of the sample efficiency of various test-time strategies – such as self-consistency, best-of-$n$, and self-correction – remains limited. In this work, we first establish a separation result between two repeated sampling strategies: self-consistency requires $\Theta(1/\Delta^2)$ samples to produce the correct answer, while best-of-$n$ only needs $\Theta(1/\Delta)$, where $\Delta < 1$ denotes the probability gap between the correct and second most likely answers. Next, we present an expressiveness result for the self-correction approach with verifier feedback: it enables Transformers to simulate online learning over a pool of experts at test time. Therefore, a single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query, extending the representation theory of Transformers from single-task to multi-task settings. Finally, we empirically validate our theoretical results, demonstrating the practical effectiveness of self-correction methods.
测试时缩放范式已显着提升大型语言模型(LLM)在复杂任务上的能力。尽管它们在经验上取得了成功,但关于各种测试时间策略(例如自洽性、best-of-$n$以及自我校正)的样本效率的理论理解仍然有限。在这项工作中,我们首先确定了两种重复采样策略之间的分离结果:自洽性需要$\Theta(1/\Delta^2)$样本才能得出正确答案,而best-of-$n$仅需要$\Theta(1/\Delta)$样本,其中$\Delta < 1$表示正确和第二大可能的答案之间的概率差距。接下来,我们为带有验证器反馈的自我校正方法提供了表现力结果:它使Transformer能够在测试时模拟专家池中的在线学习。因此,单个Transformer架构可以在不了解与用户查询相关的特定任务的情况下解决多个任务,从而将Transformer的表示理论从单任务扩展到多任务设置。最后,我们通过实验验证了我们的理论结果,证明了自我校正方法的实用性。
论文及项目相关链接
Summary
测试时间缩放范式在复杂任务中显著提升了大型语言模型的能力。虽然它们在经验上取得了成功,但对于各种测试时间策略的理论样本效率(如自我一致性、最优选择方法和自我修正等)的理论理解仍然有限。本文中,我们建立了一个样本策略的分离结果:自我一致性需要更多的样本才能产生正确的答案,而最优选择方法则需要较少的样本。此外,我们还给出了自我修正方法与验证器反馈的表达性结果:它使Transformer能够在测试时模拟在线学习并共享专家意见池的能力。因此,单一Transformer架构可证明能够在不事先了解与用户的查询相关联的具体任务的情况下解决多任务问题,从而从单任务扩展了Transformer的表示理论至多任务场景。最后,我们通过实证验证了我们的理论结果,证明了自我修正方法的有效性。
Key Takeaways
- 测试时间缩放范式增强了大型语言模型在复杂任务上的表现。
- 样本效率对于不同的测试时间策略存在理论差异。
- 自我一致性策略和最优选择方法的样本需求不同。
- 自我修正方法通过验证器反馈增强了Transformer的多任务学习能力。
- Transformer能够在不预先了解特定任务的情况下解决多任务问题。
- 自我修正方法实现了Transformer从单任务表示理论向多任务表示理论的扩展。
点此查看论文截图




Micro-Act: Mitigate Knowledge Conflict in Question Answering via Actionable Self-Reasoning
Authors:Nan Huo, Jinyang Li, Bowen Qin, Ge Qu, Xiaolong Li, Xiaodong Li, Chenhao Ma, Reynold Cheng
Retrieval-Augmented Generation (RAG) systems commonly suffer from Knowledge Conflicts, where retrieved external knowledge contradicts the inherent, parametric knowledge of large language models (LLMs). It adversely affects performance on downstream tasks such as question answering (QA). Existing approaches often attempt to mitigate conflicts by directly comparing two knowledge sources in a side-by-side manner, but this can overwhelm LLMs with extraneous or lengthy contexts, ultimately hindering their ability to identify and mitigate inconsistencies. To address this issue, we propose Micro-Act a framework with a hierarchical action space that automatically perceives context complexity and adaptively decomposes each knowledge source into a sequence of fine-grained comparisons. These comparisons are represented as actionable steps, enabling reasoning beyond the superficial context. Through extensive experiments on five benchmark datasets, Micro-Act consistently achieves significant increase in QA accuracy over state-of-the-art baselines across all 5 datasets and 3 conflict types, especially in temporal and semantic types where all baselines fail significantly. More importantly, Micro-Act exhibits robust performance on non-conflict questions simultaneously, highlighting its practical value in real-world RAG applications.
检索增强生成(RAG)系统通常面临知识冲突的问题,即检索到的外部知识与大型语言模型(LLM)的内在参数知识相矛盾。这会对问答等下游任务性能产生不利影响。现有方法往往试图通过并排比较两种知识来源来减轻冲突,但这可能会使语言模型面临过多或冗长的上下文,最终阻碍其识别和缓解不一致的能力。为了解决这一问题,我们提出了Micro-Act框架,它具有分层动作空间,可自动感知上下文复杂性,并自适应地将每个知识源分解为一系列精细的比较。这些比较表现为可操作的步骤,能够进行超越表面上下文的推理。在五组基准数据集上进行的大量实验表明,Micro-Act在所有五个数据集和三种冲突类型上,均较最先进的基线模型在问答准确性方面有显著提高,特别是在时间和语义类型方面,所有基线模型均表现不佳。更重要的是,Micro-Act在非冲突问题上也表现出稳健的性能,这凸显了其在现实世界RAG应用中的实用价值。
论文及项目相关链接
PDF Accepted by ACL 2025 Main
Summary
本文探讨了Retrieval-Augmented Generation(RAG)系统中常见的知识冲突问题,即检索的外部知识与大型语言模型(LLM)的内在参数知识相矛盾。这会对问答等下游任务性能产生负面影响。现有方法常常通过并排比较两种知识源来减轻冲突,但这可能使LLM面临繁琐或冗长的上下文,最终阻碍其识别和缓解不一致的能力。为解决这一问题,本文提出了Micro-Act框架,它具有分层动作空间,可自动感知上下文复杂性,并自适应地将每个知识源分解成一系列精细的比较。这些比较表现为可操作的步骤,使推理超越了表面上下文。在五个基准数据集上的广泛实验表明,Micro-Act在所有数据集和三种冲突类型上,尤其是在时间和语义类型上,相较于最新基线技术,问答准确性有了显著的提升。更重要的是,Micro-Act在非冲突问题上也表现出稳健的性能,凸显其在现实世界的RAG应用中的实用价值。
Key Takeaways
- RAG系统面临知识冲突问题,即外部检索知识与LLM的内在知识相矛盾。
- 现有方法通过并排比较知识源来减轻冲突,但可能使LLM面临繁琐的上下文。
- Micro-Act框架具有分层动作空间,可自动感知上下文复杂性并分解知识源。
- Micro-Act通过精细的比较和可操作的步骤进行推理,超越表面上下文。
- Micro-Act在多个数据集和三种冲突类型上显著提高了问答准确性。
- Micro-Act在非冲突问题上也表现出稳健的性能。
点此查看论文截图





LeanPO: Lean Preference Optimization for Likelihood Alignment in Video-LLMs
Authors:Xiaodong Wang, Jinfa Huang, Li Yuan, Peixi Peng
Most Video Large Language Models (Video-LLMs) adopt preference alignment techniques, e.g., DPO~\citep{rafailov2024dpo}, to optimize the reward margin between a winning response ($y_w$) and a losing response ($y_l$). However, the likelihood displacement observed in DPO indicates that both $\log \pi_\theta (y_w\mid x)$ and $\log \pi_\theta (y_l\mid x) $ often decrease during training, inadvertently boosting the probabilities of non-target responses. In this paper, we systematically revisit this phenomenon from LLMs to Video-LLMs, showing that it intensifies when dealing with the redundant complexity of video content. To alleviate the impact of this phenomenon, we propose \emph{Lean Preference Optimization} (LeanPO), a reference-free approach that reformulates the implicit reward as the average likelihood of the response with respect to the policy model. A key component of LeanPO is the reward-trustworthiness correlated self-generated preference data pipeline, which carefully infuses relevant prior knowledge into the model while continuously refining the preference data via self-reflection. This allows the policy model to obtain high-quality paired data and accurately estimate the newly defined reward, thus mitigating the unintended drop. In addition, we introduce a dynamic label smoothing strategy that mitigates the impact of noise in responses from diverse video content, preventing the model from overfitting to spurious details. Extensive experiments demonstrate that LeanPO significantly enhances the performance of state-of-the-art Video-LLMs, consistently boosting baselines of varying capacities with minimal additional training overhead. Moreover, LeanPO offers a simple yet effective solution for aligning Video-LLM preferences with human trustworthiness, paving the way toward the reliable and efficient Video-LLMs.
大多数视频大语言模型(Video-LLMs)采用偏好对齐技术,例如DPO~\citep{rafailov2024dpo},以优化获胜响应(yw)和失败响应(yl)之间的奖励差距。然而,DPO中观察到的可能性位移表明,logπθ(yw∣x)和logπθ(yl∣x)在训练过程中往往会减少,这无意中提升了非目标响应的概率。本文系统地回顾了从大语言模型到视频大语言模型中的这种现象,表明在处理视频内容的冗余复杂性时,这种现象会加剧。为了缓解这种现象的影响,我们提出了无需参考的偏好优化(LeanPO),将隐性奖励重新定义为关于策略模型的响应的平均可能性。LeanPO的关键组件是奖励可信度相关的自我生成的偏好数据管道,它谨慎地将相关先验知识融入模型中,同时通过自我反思不断精炼偏好数据。这使得策略模型能够获得高质量配对数据并准确估计新定义的奖励,从而缓解无意中的下降。此外,我们引入了一种动态标签平滑策略,减轻了来自不同视频内容中的响应噪声的影响,防止模型过度拟合于虚假细节。大量实验表明,LeanPO显著提高了最先进的Video-LLMs的性能,始终如一地提高了不同容量的基线性能,并且额外的训练开销最小。此外,LeanPO提供了一个简单而有效的解决方案,使Video-LLM的偏好与人类可信度保持一致,为可靠高效的Video-LLMs铺平了道路。
论文及项目相关链接
PDF Code: https://github.com/Wang-Xiaodong1899/LeanPO
摘要
本文主要探讨了视频大语言模型(Video-LLMs)在优化奖励边际时面临的问题。现有方法如DPO在训练过程中往往导致胜方回应(yw)和败方回应(yl)的概率对数下降,从而可能提升非目标回应的概率。本文系统性地从LLMs重新审视这一现象,并指出在处理视频内容的冗余复杂性时,该问题更加严重。为缓解这一现象,本文提出了无参考的LeanPO方法,该方法将隐式奖励重新定义为响应相对于策略模型的平均概率。LeanPO的关键组件是奖励信任度相关的自我生成偏好数据管道,它通过不断自我反思来精细地注入相关先验知识并持续精炼偏好数据。此外,本文还引入了动态标签平滑策略,以缓解来自各种视频内容中响应的噪声影响,防止模型过度拟合细节。实验表明,LeanPO显著提高了先进Video-LLM的性能,并且与不同容量的基线相比具有最小的额外训练开销。此外,LeanPO为可靠高效的Video-LLM提供了一种简单有效的偏好对齐方法。
关键见解
- Video-LLMs在优化奖励边际时采用偏好对齐技术,如DPO,但存在概率下降的问题。
- 在处理视频内容的冗余复杂性时,该问题更为严重。
- 提出LeanPO方法,通过平均概率重新定义了隐式奖励的概念。
- LeanPO利用奖励信任度相关的自我生成偏好数据管道来精细地注入相关先验知识并精炼偏好数据。
- 动态标签平滑策略用于缓解来自各种视频内容响应的噪声影响。
- LeanPO显著提高Video-LLM性能并有效对齐人类信任度。
点此查看论文截图



LLM-First Search: Self-Guided Exploration of the Solution Space
Authors:Nathan Herr, Tim Rocktäschel, Roberta Raileanu
Large Language Models (LLMs) have demonstrated remarkable improvements in reasoning and planning through increased test-time compute, often by framing problem-solving as a search process. While methods like Monte Carlo Tree Search (MCTS) have proven effective in some domains, their reliance on fixed exploration hyperparameters limits their adaptability across tasks of varying difficulty, rendering them impractical or expensive in certain settings. In this paper, we propose \textbf{LLM-First Search (LFS)}, a novel \textit{LLM Self-Guided Search} method that removes the need for pre-defined search strategies by empowering the LLM to autonomously control the search process via self-guided exploration. Rather than relying on external heuristics or hardcoded policies, the LLM evaluates whether to pursue the current search path or explore alternative branches based on its internal scoring mechanisms. This enables more flexible and context-sensitive reasoning without requiring manual tuning or task-specific adaptation. We evaluate LFS on Countdown and Sudoku against three classic widely-used search algorithms, Tree-of-Thoughts’ Breadth First Search (ToT-BFS), Best First Search (BestFS), and MCTS, each of which have been used to achieve SotA results on a range of challenging reasoning tasks. We found that LFS (1) performs better on more challenging tasks without additional tuning, (2) is more computationally efficient compared to the other methods, especially when powered by a stronger model, (3) scales better with stronger models, due to its LLM-First design, and (4) scales better with increased compute budget. Our code is publicly available at \href{https://github.com/NathanHerr/LLM-First-Search}{LLM-First-Search}.
大型语言模型(LLM)通过增加测试时的计算量,在推理和规划方面取得了显著的改进,通常通过将问题解决框架设定为搜索过程。虽然蒙特卡洛树搜索(MCTS)等方法在某些领域已经证明是有效的,但它们对固定探索超参数的依赖限制了它们在不同难度任务中的适应性,因此在某些环境中它们是不切实际的或昂贵的。在本文中,我们提出了\textbf{LLM-First Search(LFS)},这是一种新型的\textit{LLM自我引导搜索}方法,它通过赋予LLM自主控制搜索过程的能力,从而消除了对预先定义的搜索策略的需求,实现自我引导的探索。LLM不需要依赖外部启发式或硬编码的策略,而是根据其内部评分机制来判断是继续当前搜索路径还是探索替代分支。这实现了更灵活、更依赖于上下文的推理,而无需手动调整或针对特定任务的适应。我们在倒计时和数独上评估了LFS,并与三种经典的广泛使用的搜索算法进行了比较:思维树广度优先搜索(ToT-BFS)、最佳优先搜索(BestFS)和MCTS,这些算法已在各种具有挑战性的推理任务上取得了最新成果。我们发现LFS(1)在更具挑战性的任务上表现更好,无需额外调整,(2)与其他方法相比具有更高的计算效率,尤其是当使用更强大的模型时,(3)由于其LLM-First设计,随着更强模型的增强而更好地扩展,(4)随着计算预算的增加而更好地扩展。我们的代码可在\href{https://github.com/NathanHerr/LLM-First-Search}{LLM-First-Search}上公开访问。
论文及项目相关链接
PDF 9 main pages, 2 figures, 2 tables, 36 appendix pages
Summary
大型语言模型(LLM)在测试时间的计算增加中展现出令人印象深刻的推理和规划能力,通常通过将问题解决视为搜索过程来实现。本文提出了一种新型的LLM自引导搜索方法——LLM-First Search(LFS),它消除了对预定义搜索策略的需求,通过赋予LLM自主控制搜索过程的能力,实现自我引导的探索。我们在倒计时和数独任务上评估了LFS,并与三种经典的搜索算法进行了比较:树状思维广度优先搜索(ToT-BFS)、最佳优先搜索(BestFS)和蒙特卡洛树搜索(MCTS)。我们发现LFS在更具挑战性的任务上表现更好,计算效率更高,尤其是在使用更强大的模型时,并且由于其LLM优先设计而具有更好的模型扩展性。
Key Takeaways
- LLM通过增加测试时间的计算,提高了推理和规划能力。
- LLM-First Search (LFS)是一种新型的LLM自引导搜索方法,无需预定义的搜索策略。
- LFS通过LLM的自我引导探索,增强了搜索的灵活性和上下文敏感性。
- LFS在倒计时和数独任务上的表现优于其他三种经典的搜索算法。
- LFS在更具挑战性的任务上表现更好,并且计算效率更高。
- LFS在使用更强大的模型时具有更好的扩展性。
点此查看论文截图




FPTQuant: Function-Preserving Transforms for LLM Quantization
Authors:Boris van Breugel, Yelysei Bondarenko, Paul Whatmough, Markus Nagel
Large language models (LLMs) require substantial compute, and thus energy, at inference time. While quantizing weights and activations is effective at improving efficiency, naive quantization of LLMs can significantly degrade performance due to large magnitude outliers. This paper describes FPTQuant, which introduces four novel, lightweight, and expressive function-preserving transforms (FPTs) to facilitate quantization of transformers: (1) a mergeable pre-RoPE transform for queries and keys, (2) a mergeable transform for values, (3) a mergeable scaling transform within the MLP block, and (4) a cheap, dynamic scaling transform. By leveraging the equivariances and independencies inherent to canonical transformer operation, we designed these FPTs to maintain the model’s function while shaping the intermediate activation distributions to be more quantization friendly. FPTQuant requires no custom kernels and adds virtually no overhead during inference. The FPTs are trained both locally to reduce outliers, and end-to-end such that the outputs of the quantized and full-precision models match. FPTQuant enables static INT4 quantization with minimal overhead and shows SOTA speed-up of up to 3.9 times over FP. Empirically, FPTQuant has an excellent accuracy-speed trade-off – it is performing on par or exceeding most prior work and only shows slightly lower accuracy compared to a method that is up to 29% slower.
大型语言模型(LLM)在推理时需要大量的计算和能源。虽然量化权重和激活值能有效提高效率,但对LLM进行简单的量化会由于较大的幅度异常值而导致性能显著下降。本文介绍了FPTQuant,它引入了四种新颖、轻便、表达性强的保功能变换(FPTs),以促进变压器的量化:(1)用于查询和键的可合并预RoPE变换,(2)用于值的可合并变换,(3)MLP块内的可合并缩放变换,以及(4)一种廉价、动态缩放变换。我们利用规范变换器操作所固有的等变性和独立性来设计这些FPTs,以维持模型的性能,同时使中间激活值分布更利于量化。FPTQuant不需要自定义内核,在推理过程中几乎不会增加开销。FPTs既可以在本地进行训练以减少异常值,也可以端到端进行训练,以使量化模型和全精度模型的输出相匹配。FPTQuant实现了静态INT4量化,具有最小的开销,并显示出相对于浮点数的最高达3.9倍的速度提升。经验上,FPTQuant具有出色的精度-速度权衡——它的性能与大多数先前的工作相当或超过,并且与一种慢至多达29%的方法相比,其精度仅略有下降。
论文及项目相关链接
摘要
大型语言模型(LLM)在推理时需要大量的计算和能源。虽然量化权重和激活可以提高效率,但LLM的量化方法可能导致性能显著下降,特别是面临大型幅度异常值问题。本文介绍的FPTQuant方法引入了四种新颖、轻便且表现性强的函数保持变换(FPT),旨在解决transformer的量化问题:(1)可合并的预RoPE变换用于查询和键;(2)可合并的变换用于值;(3)MLP块内的可合并缩放变换;(4)廉价、动态缩放变换。这些FPT的设计旨在保持模型的性能,同时使中间激活分布更适合量化。FPTQuant无需自定义内核,在推理过程中几乎不会增加开销。FPT既可用于局部训练以减少异常值,也可用于端到端的训练,以使量化模型和全精度模型的输出相匹配。FPTQuant可实现静态INT4量化,具有最小的额外开销,并显示出最高达3.9倍的速度提升。经验表明,FPTQuant在准确性-速度方面的权衡表现出色,性能与大多数先前工作相当或表现更好,仅在少数情况下准确度略有下降相较于速度较慢的方法而言,具有优势的是仅需大约提升模型的运行速度的测试情况下才有相对较低的准确度损失。总体而言,FPTQuant为LLM的量化提供了一种高效且准确的方法。
关键见解
- 大型语言模型推理需要巨大的计算和能源资源。
- 量化LLM可以提高效率,但可能导致性能下降,特别是面临大型幅度异常值问题。
- FPTQuant引入四种函数保持变换(FPT)来改进LLM的量化。这些FPT针对transformer的结构设计而成以更有效地量化处理查询、键、值和MLP块的中间激活分布。
- FPTQuant方法无需自定义内核且几乎不增加推理过程中的开销。它通过利用模型的内在等价性和独立性来实现高效量化。
点此查看论文截图


Fine-Tuning Video Transformers for Word-Level Bangla Sign Language: A Comparative Analysis for Classification Tasks
Authors:Jubayer Ahmed Bhuiyan Shawon, Hasan Mahmud, Kamrul Hasan
Sign Language Recognition (SLR) involves the automatic identification and classification of sign gestures from images or video, converting them into text or speech to improve accessibility for the hearing-impaired community. In Bangladesh, Bangla Sign Language (BdSL) serves as the primary mode of communication for many individuals with hearing impairments. This study fine-tunes state-of-the-art video transformer architectures – VideoMAE, ViViT, and TimeSformer – on BdSLW60 (arXiv:2402.08635), a small-scale BdSL dataset with 60 frequent signs. We standardized the videos to 30 FPS, resulting in 9,307 user trial clips. To evaluate scalability and robustness, the models were also fine-tuned on BdSLW401 (arXiv:2503.02360), a large-scale dataset with 401 sign classes. Additionally, we benchmark performance against public datasets, including LSA64 and WLASL. Data augmentation techniques such as random cropping, horizontal flipping, and short-side scaling were applied to improve model robustness. To ensure balanced evaluation across folds during model selection, we employed 10-fold stratified cross-validation on the training set, while signer-independent evaluation was carried out using held-out test data from unseen users U4 and U8. Results show that video transformer models significantly outperform traditional machine learning and deep learning approaches. Performance is influenced by factors such as dataset size, video quality, frame distribution, frame rate, and model architecture. Among the models, the VideoMAE variant (MCG-NJU/videomae-base-finetuned-kinetics) achieved the highest accuracies of 95.5% on the frame rate corrected BdSLW60 dataset and 81.04% on the front-facing signs of BdSLW401 – demonstrating strong potential for scalable and accurate BdSL recognition.
手势语言识别(SLR)涉及从图像或视频中自动识别和分类手势,将它们转换为文本或语音,以提高听力障碍者的可访问性。在孟加拉国,孟加拉手语(BdSL)是许多听力障碍者主要的交流方式。本研究对最前沿的视频转换器架构——VideoMAE、ViViT和TimeSformer进行了微调,应用于BdSLW60(arXiv:2402.08635)数据集,这是一个包含60种常见手势的小规模BdSL数据集。我们将视频标准化至30FPS,生成了9307个用户试验片段。为了评估模型的扩展性和稳健性,我们还在大规模的BdSLW401数据集(arXiv:2503.02360)上对模型进行了微调,该数据集包含401个手势类别。此外,我们还与LSA64和WLASL等公共数据集进行了基准测试。为了提高模型的稳健性,我们应用了数据增强技术,如随机裁剪、水平翻转和短边缩放。在模型选择过程中,为了确保跨折叠的评估平衡,我们对训练集采用了10折分层交叉验证,而对来自未见用户的U4和U8的保留测试数据进行了签名人独立评估。结果表明,视频转换器模型显著优于传统的机器学习和深度学习方法。性能受到数据集大小、视频质量、帧分布、帧率和模型架构等因素的影响。在帧速率校正的BdSLW60数据集上,VideoMAE变体(MCG-NJU/videomae-base-finetuned-kinetics)取得了最高准确率,达到95.5%,在面向正面的BdSLW401数据集上的准确率为81.04%,显示出强大的可扩展和准确识别BdSL的潜力。
论文及项目相关链接
PDF 16 pages, 8 figures, 6 tables
Summary
本文介绍了基于视频图像的手语识别技术(SLR)在孟加拉国的应用。研究者针对孟加拉手语(BdSL)的数据集进行了先进的视频模型架构的调整和优化。模型性能通过在不同的数据集上进行评估来体现,结果表明视频模型明显优于传统的机器学习和深度学习模型。其中,VideoMAE模型在修正帧率的BdSLW60数据集上准确率高达95.5%,在面向前方的BdSLW401数据集上的准确率为81.04%,展现出在可伸缩性和准确性方面的强大潜力。
Key Takeaways
- 手语识别(SLR)是将图像或视频中的手势自动识别和分类,并转换为文字或语音,以提高听障人士的访问性。
- 在孟加拉国,孟加拉手语(BdSL)是听障人士的主要沟通方式。
- 研究者使用先进的视频模型架构(如VideoMAE, ViViT和TimeSformer)对BdSL数据集进行了微调。
- 数据增强技术如随机裁剪、水平翻转和短边缩放被用来提高模型的稳健性。
- 视频模型在性能上显著优于传统的机器学习和深度学习模型。
- VideoMAE模型在修正帧率的BdSLW60数据集上的准确率最高,达到95.5%。
点此查看论文截图






Model Internal Sleuthing: Finding Lexical Identity and Inflectional Morphology in Modern Language Models
Authors:Michael Li, Nishant Subramani
Large transformer-based language models dominate modern NLP, yet our understanding of how they encode linguistic information is rooted in studies of early models like BERT and GPT-2. To better understand today’s language models, we investigate how both classical architectures (BERT, DeBERTa, GPT-2)and contemporary large language models (Pythia, OLMo-2, Gemma-2, Qwen2.5, Llama-3.1) represent lexical identity and inflectional morphology. We train linear and nonlinear classifiers on layer-wise activations to predict word lemmas and inflectional features. We discover that models concentrate lexical information linearly in early layers and increasingly nonlinearly in later layers, while keeping inflectional information uniformly accessible and linearly separable throughout the layers. Further analysis reveals that these models encode inflectional morphology through generalizable abstractions, but rely predominantly on memorization to encode lexical identity. Remarkably, these patterns emerge across all 16 models we test, despite differences in architecture, size, and training regime (including pretrained and instruction-tuned variants). This consistency suggests that, despite substantial advances in LLM technologies, transformer models organize linguistic information in similar ways, indicating that these properties could be fundamental for next token prediction and are learned early during pretraining. Our code is available at https://github.com/ml5885/model_internal_sleuthing
现代自然语言处理领域主要由大型基于Transformer的语言模型主导,然而我们对它们如何编码语言信息的理解仍基于早期模型,如BERT和GPT-2。为了更好地理解当今的语言模型,我们研究了经典架构(BERT、DeBERTa、GPT-2)和当代大型语言模型(Pythia、OLMo-2、Gemma-2、Qwen2.5、Llama-3.1)如何表示词汇身份和屈折形态。我们对分层激活训练了线性和非线性分类器,以预测词素和屈折特征。我们发现,模型在早期层次中以线性方式集中词汇信息,在后期层次中则越来越以非线性方式集中信息,同时保持屈折信息在整个层次中均匀可访问且可线性分离。进一步的分析表明,这些模型通过可推广的抽象来编码屈折形态,但主要依赖于记忆来编码词汇身份。值得注意的是,在我们测试的16个模型中,尽管它们在架构、规模和训练方案(包括预训练和指令调整变体)上存在差异,但这些模式仍然出现。这种一致性表明,尽管大型语言模型技术取得了重大进展,但Transformer模型以相似的方式组织语言信息,这表明这些属性对于下一个令牌预测至关重要,并且在预训练早期就已经学习。我们的代码可在https://github.com/ml5885/model_internal_sleuthing上找到。
论文及项目相关链接
Summary:近期研究表明,无论模型架构、大小或训练机制如何,大型基于transformer的自然语言处理模型在早期层级中主要集中处理词汇信息,而后层更多地涉及非线性的信息处理,并保持了词汇信息的一致性。虽然它们在结构和功能上有所发展,但它们都倾向于通过可概括的抽象来编码形态变化信息,但主要依赖记忆来编码词汇身份。这些发现揭示了现代NLP模型在处理语言信息时的核心机制。
Key Takeaways:
- 大型基于transformer的语言模型在编码语言信息时表现出一致性,无论模型架构、大小或训练机制如何。
- 这些模型在早期层级集中处理词汇信息,并在后续层级中涉及更复杂的非线性信息处理。
- 词汇信息在模型中保持一致性,而形态变化信息则通过可概括的抽象进行编码。
- 模型主要依赖记忆来编码词汇身份。
- 训练分类器可以在不同层级预测单词词元和形态变化特征。
- 模型组织和处理语言信息的方式可能与下一个单词的预测有关。
点此查看论文截图






Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem
Authors:Hao Song, Yiming Shen, Wenxuan Luo, Leixin Guo, Ting Chen, Jiashui Wang, Beibei Li, Xiaosong Zhang, Jiachi Chen
The Model Context Protocol (MCP) is an emerging standard designed to enable seamless interaction between Large Language Model (LLM) applications and external tools or resources. Within a short period, thousands of MCP services have already been developed and deployed. However, the client-server integration architecture inherent in MCP may expand the attack surface against LLM Agent systems, introducing new vulnerabilities that allow attackers to exploit by designing malicious MCP servers. In this paper, we present the first systematic study of attack vectors targeting the MCP ecosystem. Our analysis identifies four categories of attacks, i.e., Tool Poisoning Attacks, Puppet Attacks, Rug Pull Attacks, and Exploitation via Malicious External Resources. To evaluate the feasibility of these attacks, we conduct experiments following the typical steps of launching an attack through malicious MCP servers: upload-download-attack. Specifically, we first construct malicious MCP servers and successfully upload them to three widely used MCP aggregation platforms. The results indicate that current audit mechanisms are insufficient to identify and prevent the proposed attack methods. Next, through a user study and interview with 20 participants, we demonstrate that users struggle to identify malicious MCP servers and often unknowingly install them from aggregator platforms. Finally, we demonstrate that these attacks can trigger harmful behaviors within the user’s local environment-such as accessing private files or controlling devices to transfer digital assets-by deploying a proof-of-concept (PoC) framework against five leading LLMs. Additionally, based on interview results, we discuss four key challenges faced by the current security ecosystem surrounding MCP servers. These findings underscore the urgent need for robust security mechanisms to defend against malicious MCP servers.
模型上下文协议(MCP)是一种新兴标准,旨在实现大型语言模型(LLM)应用程序与外部工具或资源之间的无缝交互。在很短的时间内,已经开发并部署了成千上万个MCP服务。然而,MCP所固有的客户端-服务器集成架构可能会扩大针对LLM代理系统的攻击面,引入新的漏洞,攻击者可利用这些漏洞通过设计恶意的MCP服务器进行攻击。在本文中,我们对针对MCP生态系统的攻击向量进行了首次系统研究。我们的分析确定了四类攻击,即工具中毒攻击、傀儡攻击、地毯式攻击和通过恶意外部资源的利用攻击。为了评估这些攻击的可行性,我们通过恶意MCP服务器发动攻击的典型步骤进行了实验:上传-下载-攻击。具体来说,我们首先构建了恶意MCP服务器,并成功将其上传到三个广泛使用的MCP聚合平台。结果表明,当前的审计机制不足以识别和预防所提出的攻击方法。接下来,通过对20名参与者的用户研究和采访,我们证明了用户很难识别恶意MCP服务器,并且经常无意中从聚合平台安装它们。最后,我们通过针对五款领先的大型语言模型部署概念验证(PoC)框架,证明这些攻击可以触发用户本地环境中的有害行为,例如访问私人文件或控制设备转移数字资产。此外,基于访谈结果,我们讨论了当前MCP服务器安全生态系统所面临的四大挑战。这些发现强调了亟需构建稳健的安全机制来防范恶意MCP服务器。
论文及项目相关链接
摘要
模型上下文协议(MCP)是一种新兴标准,旨在实现大型语言模型(LLM)应用程序与外部工具或资源之间的无缝交互。然而,MCP所固有的客户端-服务器集成架构可能会扩大针对LLM代理系统的攻击面,引入允许攻击者通过设计恶意MCP服务器进行利用的新漏洞。本文对针对MCP生态系统的攻击向量进行了系统的研究分析,识别出四种攻击类别。为了评估这些攻击的可行性,我们通过恶意MCP服务器发动攻击的典型步骤进行实验:上传-下载-攻击。我们发现当前审计机制不足以识别并预防所提出的攻击方法。此外,通过对20名参与者的用户研究和访谈,我们证明了用户难以识别恶意MCP服务器,并经常无意中从聚合平台安装它们。最后,通过一个针对五大领先LLM的概念验证框架,我们证明了这些攻击可以触发用户本地环境中的有害行为,如访问私人文件或控制设备转移数字资产。
关键见解
- 模型上下文协议(MCP)是一种旨在促进LLM与外部工具无缝交互的新兴标准,但存在安全漏洞。
- 识别了四种针对MCP的攻击类别:工具中毒攻击、木偶攻击、拉绳攻击和通过恶意外部资源的利用。
- 通过实验证明,当前审计机制不足以识别和预防提出的攻击方法。
- 用户难以识别恶意MCP服务器,经常从聚合平台无意中安装它们。
- 攻击可以触发用户本地环境中的有害行为,如访问私人文件或控制设备转移数字资产。
- 目前MCp服务器的安全生态系统面临四大挑战。
- 急需建立稳健的安全机制来防范恶意MCP服务器。
点此查看论文截图





Analysis of LLM Bias (Chinese Propaganda & Anti-US Sentiment) in DeepSeek-R1 vs. ChatGPT o3-mini-high
Authors:PeiHsuan Huang, ZihWei Lin, Simon Imbot, WenCheng Fu, Ethan Tu
Large language models (LLMs) increasingly shape public understanding and civic decisions, yet their ideological neutrality is a growing concern. While existing research has explored various forms of LLM bias, a direct, cross-lingual comparison of models with differing geopolitical alignments-specifically a PRC-system model versus a non-PRC counterpart-has been lacking. This study addresses this gap by systematically evaluating DeepSeek-R1 (PRC-aligned) against ChatGPT o3-mini-high (non-PRC) for Chinese-state propaganda and anti-U.S. sentiment. We developed a novel corpus of 1,200 de-contextualized, reasoning-oriented questions derived from Chinese-language news, presented in Simplified Chinese, Traditional Chinese, and English. Answers from both models (7,200 total) were assessed using a hybrid evaluation pipeline combining rubric-guided GPT-4o scoring with human annotation. Our findings reveal significant model-level and language-dependent biases. DeepSeek-R1 consistently exhibited substantially higher proportions of both propaganda and anti-U.S. bias compared to ChatGPT o3-mini-high, which remained largely free of anti-U.S. sentiment and showed lower propaganda levels. For DeepSeek-R1, Simplified Chinese queries elicited the highest bias rates; these diminished in Traditional Chinese and were nearly absent in English. Notably, DeepSeek-R1 occasionally responded in Simplified Chinese to Traditional Chinese queries and amplified existing PRC-aligned terms in its Chinese answers, demonstrating an “invisible loudspeaker” effect. Furthermore, such biases were not confined to overtly political topics but also permeated cultural and lifestyle content, particularly in DeepSeek-R1.
大型语言模型(LLM)越来越影响公众理解和公民决策,但它们的意识形态中立性却令人日益担忧。尽管现有研究已经探索了LLM偏见的各种形式,但针对具有不同地缘政治对齐方式的模型进行直接、跨语言的比较——特别是PRC系统模型与非PRC模型之间的比较——仍然缺乏。本研究通过系统地评估DeepSeek-R1(PRC对齐)与ChatGPT o3-mini-high(非PRC)的中文国家宣传和反美情绪,来弥补这一空白。我们开发了一个由1200个去语境化、以推理为导向的问题组成的新型语料库,这些问题来自简体中文、繁体中文和英文的中文语言新闻。对两个模型给出的答案(共7200个)进行了评估,采用了一种结合基于规则的GPT-4o评分与人类注释的混合评估管道。我们的研究结果揭示了模型级别和依赖于语言的偏见。与ChatGPT o3-mini-high相比,DeepSeek-R1始终表现出更高比例的宣传和反美偏见,而ChatGPT o3-mini-high则基本不存在反美情绪,宣传水平也较低。对于DeepSeek-R1来说,简体中文查询引发的偏见率最高;在繁体中文中这些偏见有所减少,英文中则几乎不存在。值得注意的是,DeepSeek-R1有时会用简体中文回答繁体中文查询,并在其中文答案中放大已有的PRC对齐术语,表现出一种“隐形扬声器”效应。此外,这种偏见不仅限于政治话题,也渗透到文化和生活方式的内容中,特别是在DeepSeek-R1中。
论文及项目相关链接
Summary
大型语言模型(LLM)在塑造公众理解和公民决策方面扮演着越来越重要的角色,但其意识形态中立性引发关注。现有研究已探索了LLM的偏见形式,但缺乏对不同地缘政治立场模型(特别是PRC系统模型与非PRC模型)的直接跨语言比较。本研究通过系统评估DeepSeek-R1(PRC对齐)与ChatGPT o3-mini-high(非PRC)的中文国家宣传与反美情绪,填补这一空白。研究发现模型级别和语言依赖的偏见。DeepSeek-R1较ChatGPT o3-mini-high表现出更高的宣传与反美偏见,后者基本无反美情绪且宣传程度较低。DeepSeek-R1在回答简体中文查询时表现出最高的偏见率,繁体中文查询中偏见率降低,英文回答中偏见几乎不存在。此外,DeepSeek-R1在回答繁体中文查询时会以简体中文回应,并在其中文答案中放大已有PRC立场术语,表现出“隐形扬声器”效应。这种偏见不仅存在于政治话题中,也渗透到文化与生活方式内容中,尤其在DeepSeek-R1中更为明显。
Key Takeaways
- 大型语言模型(LLM)在公众理解和公民决策中扮演重要角色,引发对其意识形态中立性的关注。
- 研究对比了DeepSeek-R1(PRC对齐)与ChatGPT o3-mini-high(非PRC)的语言模型。
- DeepSeek-R1较ChatGPT o3-mini-high存在更高的宣传与反美偏见。
- 模型偏见存在语言依赖性,DeepSeek-R1在简体中文查询中表现出最高偏见率,而在英文中几乎无偏见。
- DeepSeek-R1在回应时存在语言转换现象,并会放大已有PRC立场术语。
- 模型偏见不仅存在于政治话题,也渗透到文化与生活方式内容中。
点此查看论文截图




Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers
Authors:Woomin Song, Sai Muralidhar Jayanthi, Srikanth Ronanki, Kanthashree Mysore Sathyendra, Jinwoo Shin, Aram Galstyan, Shubham Katiyar, Sravan Babu Bodapati
As large language models increasingly gain popularity in real-world applications, processing extremely long contexts, often exceeding the model’s pre-trained context limits, has emerged as a critical challenge. While existing approaches to efficient long-context processing show promise, recurrent compression-based methods struggle with information preservation, whereas random access approaches require substantial memory resources. We introduce REFORM, a novel inference framework that efficiently handles long contexts through a two-phase approach. First, it incrementally processes input chunks while maintaining a compressed KV cache, constructs cross-layer context embeddings, and utilizes early exit strategy for improved efficiency. Second, it identifies and gathers essential tokens via similarity matching and selectively recomputes the KV cache. Compared to baselines, REFORM achieves over 50% and 27% performance gains on RULER and BABILong respectively at 1M context length. It also outperforms baselines on Infinite-Bench and MM-NIAH, demonstrating flexibility across diverse tasks and domains. Additionally, REFORM reduces inference time by 30% and peak memory usage by 5%, achieving both efficiency and superior performance.
随着大型语言模型在真实世界应用中的普及度不断提高,处理极长的上下文内容——通常超出模型的预训练上下文限制——已经成为一项关键挑战。虽然现有的高效长上下文处理方法显示出潜力,但基于循环压缩的方法在信息保留方面遇到困难,而随机访问方法则需要大量内存资源。我们引入了REFORM,这是一种新的推理框架,通过两阶段方法有效地处理长上下文。首先,它增量处理输入块,同时维护压缩的KV缓存,构建跨层上下文嵌入,并利用提前退出策略提高效率。其次,它通过相似度匹配来识别和收集关键令牌,并选择性重新计算KV缓存。与基线相比,REFORM在RULER和BABILong上的性能分别提高了50%和27%,上下文长度为1M。它在Infinite-Bench和MM-NIAH上也表现优于基线,证明了在不同任务和领域的灵活性。此外,REFORM将推理时间减少了30%,峰值内存使用率降低了5%,实现了效率和性能的双重提升。
论文及项目相关链接
Summary
长语境处理在大型语言模型中的应用中逐渐成为一大挑战。本文提出了REFORM这一新颖推理框架,它通过两个阶段处理长语境:一、通过增量处理输入片段、维持压缩KV缓存、构建跨层语境嵌入并采用早期退出策略提高效率;二、通过相似性匹配识别和收集关键令牌,并选择性重新计算KV缓存。REFORM在不同任务和领域都展现出灵活性,同时提高性能和效率。
Key Takeaways
- 大型语言模型在真实应用中面临处理长语境的挑战。
- 现有方法如压缩和随机访问在处理长语境时存在信息保留和内存需求问题。
- REFORM框架采用两阶段方法处理长语境,结合增量处理、KV缓存、跨层语境嵌入和早期退出策略提高效率。
- REFORM通过相似性匹配识别和收集关键令牌,选择性重新计算KV缓存。
- REFORM在多个基准测试中实现了显著的性能提升,并展示了在不同任务和领域的灵活性。
点此查看论文截图






TRiMM: Transformer-Based Rich Motion Matching for Real-Time multi-modal Interaction in Digital Humans
Authors:Yueqian Guo, Tianzhao Li, Xin Lyu, Jiehaolin Chen, Zhaohan Wang, Sirui Xiao, Yurun Chen, Yezi He, Helin Li, Fan Zhang
Large Language Model (LLM)-driven digital humans have sparked a series of recent studies on co-speech gesture generation systems. However, existing approaches struggle with real-time synthesis and long-text comprehension. This paper introduces Transformer-Based Rich Motion Matching (TRiMM), a novel multi-modal framework for real-time 3D gesture generation. Our method incorporates three modules: 1) a cross-modal attention mechanism to achieve precise temporal alignment between speech and gestures; 2) a long-context autoregressive model with a sliding window mechanism for effective sequence modeling; 3) a large-scale gesture matching system that constructs an atomic action library and enables real-time retrieval. Additionally, we develop a lightweight pipeline implemented in the Unreal Engine for experimentation. Our approach achieves real-time inference at 120 fps and maintains a per-sentence latency of 0.15 seconds on consumer-grade GPUs (Geforce RTX3060). Extensive subjective and objective evaluations on the ZEGGS, and BEAT datasets demonstrate that our model outperforms current state-of-the-art methods. TRiMM enhances the speed of co-speech gesture generation while ensuring gesture quality, enabling LLM-driven digital humans to respond to speech in real time and synthesize corresponding gestures. Our code is available at https://github.com/teroon/TRiMM-Transformer-Based-Rich-Motion-Matching
基于大型语言模型(LLM)的数字人引发了一系列关于协同语音手势生成系统的研究。然而,现有方法在实时合成和长文本理解方面遇到了困难。本文介绍了基于Transformer的丰富动作匹配(TRiMM),这是一种用于实时3D手势生成的新型多模态框架。我们的方法结合了三个模块:1)跨模态注意力机制,实现语音和手势之间的精确时间对齐;2)具有滑动窗口机制的长上下文自回归模型,用于有效的序列建模;3)大规模手势匹配系统,构建原子动作库,实现实时检索。此外,我们在Unreal Engine中开发了一个轻量级的实验管道。我们的方法实现实时推理速度为每秒处理高达每秒生成动作的帧数为高达每秒处理帧数可以达到惊人的每小时百万帧的水平120帧(每秒的帧速率可达达到FPS)。并且维持延迟保持在低级别GPU上每句延迟仅为0.15秒(Geforce RTX 3060)。在ZEGGS和BEAT数据集上的主观和客观评估表明,我们的模型优于当前的最先进技术方法。TRiMM提高了协同语音手势生成的速度,同时确保手势质量,使LLM驱动的虚拟数字人能够实时响应语音并合成相应的手势。我们的代码可通过https://github.com/teroon/TRiMM-Transformer-Based-Rich-Motion-Matching获取。
论文及项目相关链接
PDF 24 pages,12 figures
Summary
基于Transformer的丰富动作匹配(TRiMM)是一种新型的多模态框架,用于实时3D手势生成。该方法结合了跨模态注意力机制、长上下文自回归模型和大规模手势匹配系统,实现了精确的时间对齐和高质量的手势生成。该模型在ZEGGS和BEAT数据集上的评估结果优于当前的最先进方法。TRiMM加快了协同语音手势生成的速度,使得LLM驱动的数字人能够实时响应语音并合成相应的手势。
Key Takeaways
- TRiMM是一种用于实时3D手势生成的多模态框架。
- 该方法结合了跨模态注意力机制、长上下文自回归模型和大规模手势匹配系统。
- TRiMM实现了精确的时间对齐,使手势与语音同步。
- 该模型在ZEGGS和BEAT数据集上的表现优于当前的最先进方法。
- TRiMM加快了协同语音手势生成的实时推理速度,达到每秒处理帧数高达120帧。
- TRiMM确保了手势质量,使得LLM驱动的数字人能实时响应语音并合成相应手势。
点此查看论文截图



How do Transformer Embeddings Represent Compositions? A Functional Analysis
Authors:Aishik Nagar, Ishaan Singh Rawal, Mansi Dhanania, Cheston Tan
Compositionality is a key aspect of human intelligence, essential for reasoning and generalization. While transformer-based models have become the de facto standard for many language modeling tasks, little is known about how they represent compound words, and whether these representations are compositional. In this study, we test compositionality in Mistral, OpenAI Large, and Google embedding models, and compare them with BERT. First, we evaluate compositionality in the representations by examining six diverse models of compositionality (addition, multiplication, dilation, regression, etc.). We find that ridge regression, albeit linear, best accounts for compositionality. Surprisingly, we find that the classic vector addition model performs almost as well as any other model. Next, we verify that most embedding models are highly compositional, while BERT shows much poorer compositionality. We verify and visualize our findings with a synthetic dataset consisting of fully transparent adjective-noun compositions. Overall, we present a thorough investigation of compositionality.
组合性是人工智能的一个重要方面,对于推理和泛化至关重要。虽然基于转换器的模型已成为许多语言建模任务的默认标准,但对于它们如何表示复合词以及这些表示是否组合性,人们的了解仍然有限。在这项研究中,我们在Mistral、OpenAI Large和Google嵌入模型中测试了组合性,并将其与BERT进行了比较。首先,我们通过检查六种不同的组合性模型(加法、乘法、膨胀、回归等)来评估表示中的组合性。我们发现岭回归虽然线性,但最能解释组合性。令人惊讶的是,我们发现经典的向量加法模型的性能几乎与其他任何模型一样好。接下来,我们验证大多数嵌入模型具有很高的组合性,而BERT的组合性则较差。我们通过包含完全透明的形容词-名词组合的合成数据集来验证并可视化我们的发现。总的来说,我们对组合性进行了彻底的研究。
论文及项目相关链接
Summary
本研究探讨了模型在表现组合词方面的能力,即所谓的“组合性”。通过测试Mistral、OpenAI Large、Google嵌入模型以及BERT模型的组合性表现,发现大多数嵌入模型具有良好的组合性,而BERT的表现较差。研究中使用了六种不同的组合性模型来评估模型的表现,发现岭回归虽然在组合性方面表现优秀,但经典的向量加法模型同样表现良好。通过合成数据集验证了形容词和名词组合结构的可视化结果。总之,本研究对组合性进行了全面调查。
Key Takeaways
- 研究探讨了模型在表现组合词方面的能力,即组合性。
- 测试了Mistral、OpenAI Large、Google嵌入模型和BERT模型的组合性表现。
- 通过六种不同的组合性模型评估模型的表现,发现岭回归和向量加法模型表现较好。
- 大多数嵌入模型具有良好的组合性,而BERT表现较差。
- 通过合成数据集验证了形容词和名词组合结构的可视化结果。
- 研究强调了组合性是语言建模任务的关键方面,对于推理和泛化很重要。
点此查看论文截图




Probing the Geometry of Truth: Consistency and Generalization of Truth Directions in LLMs Across Logical Transformations and Question Answering Tasks
Authors:Yuntai Bao, Xuhong Zhang, Tianyu Du, Xinkui Zhao, Zhengwen Feng, Hao Peng, Jianwei Yin
Large language models (LLMs) are trained on extensive datasets that encapsulate substantial world knowledge. However, their outputs often include confidently stated inaccuracies. Earlier works suggest that LLMs encode truthfulness as a distinct linear feature, termed the “truth direction”, which can classify truthfulness reliably. We address several open questions about the truth direction: (i) whether LLMs universally exhibit consistent truth directions; (ii) whether sophisticated probing techniques are necessary to identify truth directions; and (iii) how the truth direction generalizes across diverse contexts. Our findings reveal that not all LLMs exhibit consistent truth directions, with stronger representations observed in more capable models, particularly in the context of logical negation. Additionally, we demonstrate that truthfulness probes trained on declarative atomic statements can generalize effectively to logical transformations, question-answering tasks, in-context learning, and external knowledge sources. Finally, we explore the practical application of truthfulness probes in selective question-answering, illustrating their potential to improve user trust in LLM outputs. These results advance our understanding of truth directions and provide new insights into the internal representations of LLM beliefs. Our code is public at https://github.com/colored-dye/truthfulness_probe_generalization
大型语言模型(LLM)是在包含大量世界知识的广泛数据集上进行训练的。然而,它们的输出通常包括自信的不准确之处。早期的研究表明,LLM将真实性编码为一个独特的线性特征,称为“真实方向”,可以可靠地分类真实性。我们针对关于真实方向的几个开放性问题进行探讨:(i)LLM是否普遍表现出一致的真实方向;(ii)是否需要复杂的探测技术来确定真实方向;(iii)真实方向如何在不同的语境中通用化。我们的研究发现,并非所有LLM都表现出一致的真实方向,在更强大的模型中观察到更强烈的表示,特别是在逻辑否定的背景下。此外,我们证明,在陈述性原子语句上训练的真实性探测器可以有效地推广到逻辑转换、问答任务、上下文学习和外部知识源。最后,我们探索了真实性探测器在选择性问答中的实际应用,说明了它们在提高用户对LLM输出的信任方面的潜力。这些结果推动了我们对于真实方向的理解,并为LLM信念的内部表示提供了新的见解。我们的代码公开在:https://github.com/colored-dye/truthfulness_probe_generalization 。
论文及项目相关链接
PDF 19 pages, 16 figures; accepted to Findings of ACL 2025
Summary
大型语言模型(LLM)虽基于广泛数据集培训,包含丰富世界知识,但其输出常含自信错误。先前研究提出LLM内存在一个称为“真实方向”的线性特征,可可靠地分类真实性。本文探讨关于真实方向的几个开放问题:LLM是否普遍展现一致的真实方向;是否需高级探测技术来识别真实方向;以及真实方向如何在不同语境中普及。研究发现,并非所有LLM都展现一致真实方向,更强大的模型在逻辑否定语境下表现更明显。此外,我们证明在陈述原子语句上训练的真实性探测器可有效推广到逻辑转换、问答任务、上下文学习和外部知识源。最后,我们探索了真实性探测器在选择性问答中的实际应用,展示了提高用户对LLM输出的信任度的潜力。
Key Takeaways
- LLMs常输出自信但可能不准确的答案。
- “真实方向”是LLMs中用于分类真实性的线性特征。
- 不同LLM可能展现不同的真实方向一致性。
- 更先进的LLM在逻辑否定方面表现出更强大的真实方向。
- 真实性探测器可在多种语境和任务中有效推广。
- 真实性探测器有助于提高用户对LLM输出的信任度。
点此查看论文截图





MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration
Authors:Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, Yi R. Fung
In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.
近年来,多模态大型语言模型(MLLMs)取得了显著进展,但仍在多模态推理方面面临固有的挑战,这需要多层次(例如感知、推理)和多粒度(例如多步推理链)的高级推断。以往关于估计模型信心的工作往往集中在训练和校准的整体响应上,但未能评估每个推理步骤的信心,导致出现不希望看到的幻觉累积。在这项工作中,我们提出了MMBoundary,这是一个新型框架,通过推理步骤的信心校准,提高了MLLMs的知识边界意识。为实现这一目标,我们提议结合补充文本和跨模态自我奖励信号来估计MLLM推理过程中每一步的信心。除了使用自我奖励的信心估计信号对MLLM进行有监督的微调,以进行初始信心表达的预热,我们还引入了一个强化学习阶段,使用多个奖励函数来进一步对齐模型知识并校准每一步的信心,增强推理链的自我校正能力。实证结果表明,MMBoundary在跨不同领域的数据集和指标上显著优于现有方法,多模态信心校准误差平均减少了7.5%,任务性能提高了高达8.3%。
论文及项目相关链接
PDF 18 pages, ACL 2025
Summary
MLLM在多模态推理方面存在挑战,需要多级别和多粒度的推理能力。现有模型信心评估方法主要关注整体响应的培训和校准,而忽视了对每个推理步骤的信心评估,导致出现不必要的幻觉累积。本研究提出MMBoundary框架,通过推理步骤的信心校准提高MLLM的知识边界意识。结合补充文本和跨模态自奖励信号来估计MLLM推理过程中每一步的信心,并引入强化学习阶段进一步对齐模型知识和校准信心,提高推理链的自我校正能力。实证研究结果显示,MMBoundary在多个领域数据集和指标上的表现优于现有方法,平均减少7.5%的多模态信心校准错误,任务性能提高8.3%。
Key Takeaways
- MLLMs在多模态推理方面存在挑战,需要进一步提高多级别和多粒度的推理能力。
- 现有模型信心评估方法主要关注整体响应的培训和校准,缺乏对每个推理步骤的信心评估。
- MMBoundary框架通过推理步骤的信心校准提高MLLM的知识边界意识。
- MMBoundary结合补充文本和跨模态自奖励信号来估计MLLM推理过程中每一步的信心。
- MMBoundary引入强化学习阶段,进一步对齐模型知识并校准信心。
- MMBoundary能提高推理链的自我校正能力。
点此查看论文截图




Curse of High Dimensionality Issue in Transformer for Long-context Modeling
Authors:Shuhai Zhang, Zeng You, Yaofo Chen, Zhiquan Wen, Qianyue Wang, Zhijie Qiu, Yuanqing Li, Mingkui Tan
Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to \textit{redundant} attention computations: while attention weights are often \textit{sparse}, all tokens consume \textit{equal} computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a \textit{supervised learning task}, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a \textit{group coding strategy}, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose \textit{Dynamic Group Attention} (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.Code is available at https://github.com/bolixinyu/DynamicGroupAttention.
基于Transformer的大型语言模型(LLM)通过自注意力机制捕捉长程依赖关系,从而在自然语言处理任务中表现出色。然而,由于冗余的注意力计算,长上下文建模面临重大的计算效率低下问题:虽然注意力权重通常是稀疏的,但所有令牌都消耗着平等的计算资源。在本文中,我们将传统的概率序列建模重新定义为“监督学习任务”,这能够区分相关和不相关的令牌,并提供对冗余的更清晰理解。基于这种重新定义,我们从理论上分析了注意力稀疏性,揭示只有少数令牌对预测产生了重大贡献。在此基础上,我们将注意力优化制定为线性编码问题,并提出“分组编码策略”,从理论上展示了其提高对抗随机噪声的稳健性和提高学习效率的能力。受此启发,我们提出了“动态组注意力”(DGA),它利用分组编码来通过聚合不太重要的令牌在注意力计算中明确减少冗余。经验结果表明,我们的DGA在保持竞争力性能的同时,显著降低了计算成本。代码可用在https://github.com/bolixinyu/DynamicGroupAttention。
论文及项目相关链接
PDF Accepted at ICML 2025
Summary
本文介绍了基于Transformer的大型语言模型(LLM)在自然语言处理任务中的优异表现,尤其是通过自注意力机制捕捉长距离依赖关系的能力。然而,长文本建模存在计算效率问题,主要是由于注意力计算的冗余性。对此,文章通过重新构建概率序列模型为监督学习任务,分离重要与非重要信息以识别冗余,对注意力稀疏性进行了理论分析。文章提出一种基于线性编码的注意力优化策略,并据此提出动态分组注意力(DGA)机制,通过聚合次要信息进行更明确的计算以减少冗余性。经验表明,该机制在降低计算成本的同时保持了竞争力。
Key Takeaways
- Transformer-based LLMs 擅长捕捉长距离依赖关系。
- 长文本建模面临计算效率问题,主要由于冗余的注意力计算。
- 通过将概率序列模型重新构建为监督学习任务,可分离重要与非重要信息,更好地理解冗余。
- 对注意力稀疏性的理论分析显示,只有少数标记对预测有重要贡献。
- 线性编码策略的提出是为了优化注意力机制。在此基础上提出的动态分组注意力(DGA)可以减少冗余性。
- 动态分组注意力机制在降低计算成本的同时保持了良好的性能表现。
点此查看论文截图

