⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-24 更新
UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning
Authors:Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen
Recent advances in Large Multi-modal Models (LMMs) have demonstrated their remarkable success as general-purpose multi-modal assistants, with particular focuses on holistic image- and video-language understanding. Conversely, less attention has been given to scaling fine-grained pixel-level understanding capabilities, where the models are expected to realize pixel-level alignment between visual signals and language semantics. Some previous studies have applied LMMs to related tasks such as region-level captioning and referring expression segmentation. However, these models are limited to performing either referring or segmentation tasks independently and fail to integrate these fine-grained perception capabilities into visual reasoning. To bridge this gap, we propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses. Our model distinguishes itself by seamlessly integrating pixel-level perception with general visual understanding capabilities. Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference, thereby enabling fine-grained pixel-level reasoning. The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos. A novel PixelQA task that jointly requires referring, segmentation, and question answering is also designed to verify the flexibility of our method.
近期大型多模态模型(LMMs)的进展证明了它们在作为通用多模态助理方面的显著成功,特别是在整体图像和视频语言理解方面。然而,对于精细化像素级别的理解能力的规模扩展,关注度相对较低。在这些模型中,期望模型能够实现视觉信号和语义语言之间的像素级对齐。虽然一些早期研究已将LMMs应用于相关任务,如区域级标题和引用表达式分割,但这些模型仅限于独立执行引用或分割任务,无法将这些精细化的感知能力整合到视觉推理中。为了弥补这一差距,我们提出了UniPixel,这是一个大型多模态模型,能够灵活理解视觉提示输入并产生基于遮罩的响应。我们的模型通过无缝集成像素级感知和一般视觉理解能力来区分自己。具体来说,UniPixel处理视觉提示并根据需求生成相关遮罩,在推理过程中基于这些中间指针进行后续推理,从而实现精细的像素级推理。我们的方法的有效性已在10个基准测试上得到验证,涵盖了一系列任务,包括像素级的引用/分割和图像/视频的以对象为中心的理解。还设计了一个新颖的PixelQA任务,联合需要引用、分割和问答,以验证我们方法的灵活性。
论文及项目相关链接
PDF NeurIPS 2025 Camera Ready. Project Page: https://polyu-chenlab.github.io/unipixel/
Summary
大型多模态模型(LMMs)在通用多模态助理方面取得了显著的成功,特别是在整体图像和视频语言理解方面。然而,对于精细粒度的像素级理解能力的扩展关注较少,其中模型需要在视觉信号和语言语义之间实现像素级对齐。为弥补这一差距,本文提出了UniPixel模型,该模型能够灵活理解视觉提示输入并生成基于掩码的响应。UniPixel通过无缝集成像素级感知与通用视觉理解能力来区分自己。实验结果表明,该模型在包括像素级引用/分割和图像/视频中的对象中心理解在内的多个任务上的有效性。还设计了一个新型PixelQA任务来验证方法的灵活性。
Key Takeaways
- 大型多模态模型(LMMs)已广泛应用于通用多模态助理领域,特别是在整体图像和视频语言理解方面表现出显著成功。
- 当前缺乏精细粒度的像素级理解能力,要求模型实现视觉信号与语言语义之间的像素级对齐。
- UniPixel模型旨在解决这一差距,通过灵活理解视觉提示输入并生成基于掩码的响应来实现像素级理解。
- UniPixel通过无缝集成像素级感知和通用视觉理解能力来区分其他模型。
- 实验结果证明了UniPixel模型在多个任务上的有效性,包括像素级引用、分割和图像/视频中的对象中心理解。
- 新型PixelQA任务旨在验证模型的灵活性,涵盖引用、分割和问答等多个方面。
点此查看论文截图





TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs
Authors:Yunheng Li, Jing Cheng, Shaoyong Jia, Hangyi Kuang, Shaohui Jiao, Qibin Hou, Ming-Ming Cheng
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks. We reveal that existing reinforcement learning methods, such as Group Relative Policy Optimization (GRPO), rely on on-policy sampling for policy updates. However, in tasks with large temporal search spaces, this strategy becomes both inefficient and limited in performance, as it often fails to identify temporally accurate solutions. To address this limitation, TempSamp-R1 leverages ground-truth annotations as off-policy supervision to provide temporally precise guidance, effectively compensating for the sparsity and misalignment in on-policy solutions. To further stabilize training and reduce variance in reward-based updates, TempSamp-R1 provides a non-linear soft advantage computation method that dynamically reshapes the reward feedback via an asymmetric transformation. By employing a hybrid Chain-of-Thought (CoT) training paradigm, TempSamp-R1 optimizes a single unified model to support both CoT and non-CoT inference modes, enabling efficient handling of queries with varying reasoning complexity. Experimental results demonstrate that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets: Charades-STA (R1@0.7: 52.9%, +2.7%), ActivityNet Captions (R1@0.5: 56.0%, +5.3%), and QVHighlights (mAP: 30.0%, +3.0%). Moreover, TempSamp-R1 shows robust few-shot generalization capabilities under limited data. Code: https://github.com/HVision-NKU/TempSamp-R1
本文介绍了TempSamp-R1,这是一个新的强化精细调整框架,旨在提高多模态大型语言模型(MLLMs)适应视频时间定位任务的效率。我们发现现有的强化学习方法,如群组相对策略优化(GRPO),依赖于策略更新过程中的在线策略采样。然而,在具有大时间搜索空间的任务中,此策略既效率低下,性能也受到局限,因为它往往无法找到时间准确的解决方案。为了解决这一局限性,TempSamp-R1利用真实标注作为离线策略监督,提供时间精确的指导,有效地弥补了在线策略解决方案中的稀疏性和不对齐问题。为了进一步稳定训练和减少基于奖励的更新的方差,TempSamp-R1提供了一种非线性软优势计算方法,通过不对称转换动态重塑奖励反馈。通过采用混合的“思维链”(CoT)训练范式,TempSamp-R1优化了一个统一的单一模型,以支持CoT和非CoT推理模式,从而能够高效处理不同复杂度的查询。实验结果表明,TempSamp-R1优于基于GRPO的基线模型,在基准数据集上创下了最新纪录:Charades-STA(R1@0.7: 52.9%,+2.7%)、ActivityNet Captions(R1@0.5: 56.0%,+5.3%)和QVHighlights(mAP: 30.0%,+3.0%)。此外,TempSamp-R1在有限数据下表现出强大的泛化能力。代码:https://github.com/HVision-NKU/TempSamp-R1。
论文及项目相关链接
PDF Accepted at NeurIPS 2025
Summary
本文介绍了一种名为TempSamp-R1的新型强化精细调整框架,旨在提高多模态大型语言模型(MLLMs)在视频时序定位任务中的效率。针对现有强化学习方法在大型时序搜索空间中的局限性和不足,TempSamp-R1利用真实标注作为离线策略监督,提供精确的时间指导,并引入非线性软优势计算方法,动态调整奖励反馈。结合Chain-of-Thought训练范式,TempSamp-R1支持多种推理模式,提高处理不同复杂度查询的效率。实验结果显示,TempSamp-R1在Charades-STA、ActivityNet Captions和QVHighlights等基准数据集上取得了最新性能。
Key Takeaways
- TempSamp-R1是一个针对视频时序定位任务的多模态大型语言模型强化精细调整框架。
- 现有强化学习方法在大型时序搜索空间中表现不佳,TempSamp-R1通过引入离线策略监督和精确时间指导来解决这一问题。
- TempSamp-R1采用非线性软优势计算方法,动态调整奖励反馈,进一步提高训练稳定性和性能。
- 结合Chain-of-Thought训练范式,TempSamp-R1支持多种推理模式,适应不同复杂度的查询。
- TempSamp-R1在多个基准数据集上取得了最新性能,包括Charades-STA、ActivityNet Captions和QVHighlights。
- TempSamp-R1具有强大的少样本泛化能力,在有限数据下表现稳健。
点此查看论文截图




Everyday Physics in Korean Contexts: A Culturally Grounded Physical Reasoning Benchmark
Authors:Jihae Jeong, DaeYeop Lee, DongGeon Lee, Hwanjo Yu
Existing physical commonsense reasoning benchmarks predominantly focus on Western contexts, overlooking cultural variations in physical problem-solving. To address this gap, we introduce EPiK (Everyday Physics in Korean Contexts), a novel benchmark comprising 181 binary-choice problems that test physical reasoning within Korean cultural contexts, ranging from kimchi (Korean food) to traditional fermentation. EPiK is constructed using a two-stage generation and verification pipeline to create culturally-authentic problems across 9 reasoning subtasks and 84 scenarios. Unlike approaches based on simple translation, our method generates problems organically from Korean contexts while upholding rigorous physical reasoning standards. Our evaluations show that Korean-specialized models consistently outperform general-purpose models of comparable size. This performance gap highlights the limitations of culturally-agnostic models and demonstrates the critical need for culturally-aware benchmarks to truly measure language understanding. Our EPiK is publicly available at https://huggingface.co/datasets/jjae/EPiK.
现有的物理常识推理基准测试主要关注西方背景,忽视了物理问题解决中的文化差异。为了弥补这一空白,我们推出了EPiK(韩国背景下的日常物理学)这一新型基准测试,包含181个二元选择题,旨在测试韩国文化背景下的物理推理能力,涵盖从泡菜(韩国食品)到传统发酵等多个领域。EPiK的构建采用了两阶段生成和验证流程,以创建涵盖9个推理子任务和84个场景的文化认证问题。不同于基于简单翻译的方法,我们的方法从韩国背景中有机生成问题,同时坚持严格的物理推理标准。我们的评估显示,针对韩国特色的模型持续优于通用模型。这种性能差距突显了文化无知模型(culturally-agnostic models)的局限性,并展示了真正衡量语言理解能力迫切需要对文化意识的基准测试。我们的EPiK可在https://huggingface.co/datasets/jjae/EPiK上公开访问。
论文及项目相关链接
PDF Accepted to MRL@EMNLP 2025
Summary:
为了弥补物理常识推理基准测试主要集中在西方背景下的局限性,忽视在不同文化背景下的物理问题解决能力的差异,本文引入了一项新的基准测试——EPiK(日常生活中的物理学在韩国背景下的应用)。EPiK包含181个二元选择题,测试在韩国文化背景下对物理推理的应用,涉及韩国食品如泡菜和传统发酵等方面。EPiK采用两阶段生成和验证流程来构建真实反映韩国文化的问题,涉及9个推理子任务和84个场景。与其他简单翻译的方法不同,EPiK从韩国文化背景出发,同时遵循严格的物理推理标准生成问题。评估显示,针对韩国特色的模型表现优于通用模型。这突显了忽视文化差异的模型局限性,强调开发能够真正衡量语言理解能力的文化意识基准测试的重要性。本文提供的EPiK可在huggingface.co/datasets/jjae/EPiK上公开获取。
Key Takeaways:
- EPiK是一个基于韩国文化背景的日常生活物理常识推理基准测试。
- EPiK包含181个二元选择题,涵盖韩国食品与传统发酵等内容。
- EPiK采用两阶段生成和验证流程确保问题的真实性和物理推理的严谨性。
- 与简单翻译的方法不同,EPiK从韩国文化背景出发生成问题。
- 评估显示,针对韩国特色的模型表现优于通用模型,突显文化意识的重要性。
- 现有物理常识推理基准测试存在忽视文化差异的局限性。
点此查看论文截图







Adaptive Fast-and-Slow Visual Program Reasoning for Long-Form VideoQA
Authors:Chenglin Li, Feng Han, FengTao, Ruilin Li, Qianglong Chen, Jingqi Tong, Yin Zhang, Jiaqi Wang
Large language models (LLMs) have shown promise in generating program workflows for visual tasks. However, previous approaches often rely on closed-source models, lack systematic reasoning, and struggle with long-form video question answering (videoQA). To address these challenges, we introduce the FS-VisPR framework, an adaptive visual program reasoning approach that balances fast reasoning for simple queries with slow reasoning for difficult ones. First, we design efficient visual modules (e.g., key clip retrieval and subtitle retrieval) to support long-form video tasks. Then, we construct a diverse and high-quality fast-slow reasoning dataset with a strong LLM to align open-source language models’ ability to generate visual program workflows as FS-LLM. Next, we design a fast-slow reasoning framework with FS-LLM: Simple queries are directly solved by VideoLLMs, while difficult ones invoke visual program reasoning, motivated by human-like reasoning processes. During this process, low-confidence fast-thinking answers will trigger a second-stage slow-reasoning process, and a fallback mechanism to fast reasoning is activated if the program execution fails. Moreover, we improve visual programs through parameter search during both training and inference. By adjusting the parameters of the visual modules within the program, multiple variants are generated: during training, programs that yield correct answers are selected, while during inference, the program with the highest confidence result is applied. Experiments show that FS-VisPR improves both efficiency and reliability in visual program workflows. It achieves 50.4% accuracy on LVBench, surpassing GPT-4o, matching the performance of Qwen2.5VL-72B on VideoMME.
大型语言模型(LLM)在生成视觉任务的程序工作流程方面显示出巨大的潜力。然而,之前的方法常常依赖于闭源模型,缺乏系统推理能力,并且在长视频问答(videoQA)方面遇到困难。为了应对这些挑战,我们引入了FS-VisPR框架,这是一种自适应的视觉程序推理方法,能够平衡简单查询的快速推理和复杂查询的慢速推理。首先,我们设计了高效的视觉模块(如关键片段检索和字幕检索)来支持长视频任务。接着,我们构建了一个多样且高质量的快慢推理数据集,并借助强大的LLM,将开源语言模型生成视觉程序工作流程的能力与FS-LLM对齐。然后,我们设计了一个快慢推理框架结合FS-LLM:简单查询直接由VideoLLM解决,而复杂的查询则通过视觉程序推理解决,这激发了类似人类的推理过程。在此过程中,低信心的快速思考答案将触发第二阶段的慢速推理过程,如果程序执行失败,将激活返回到快速推理的备份机制。此外,我们通过程序和推理过程中的参数搜索改进了视觉程序。通过调整程序中视觉模块的参数,可以生成多个变体:在训练过程中,选择得出正确答案的程序;在推理过程中,应用具有最高置信度结果的程序。实验表明,FS-VisPR提高了视觉程序工作流程的效率和可靠性。它在LVBench上达到了50.4%的准确率,超越了GPT-4o,与Qwen2.5VL-72B在VideoMME上的性能相匹配。
论文及项目相关链接
Summary
大型语言模型在生成视觉任务程序工作流程方面展现出潜力,但存在依赖封闭源代码模型、缺乏系统推理和应对长视频问答困难等挑战。为解决这些问题,提出FS-VisPR框架,采用快慢推理结合的方式,支持长视频任务,构建高质量数据集,设计快慢推理框架,触发低信心答案的第二阶段慢推理,并在训练和推理过程中调整参数提高程序性能。实验表明,FS-VisPR提高了视觉程序工作流程的效率和可靠性。
Key Takeaways
- 大型语言模型在视觉任务程序生成中具潜力。
- 现有方法依赖封闭源代码模型、缺乏系统推理和长视频问答能力。
- FS-VisPR框架结合快慢推理,支持长视频任务。
- 构建高质量数据集,设计快慢推理框架,触发低信心答案的第二阶段慢推理。
- 训练和推理过程中调整参数提高程序性能。
- FS-VisPR提高视觉程序工作流程效率和可靠性,达到行业领先水平。
点此查看论文截图





ConfClip: Confidence-Weighted and Clipped Reward for Reinforcement Learning in LLMs
Authors:Bonan Zhang, Zhongqi Chen, Bowen Song, Qinya Li, Fan Wu, Guihai Chen
Reinforcement learning (RL) has become a standard paradigm for refining large language models (LLMs) beyond pre-training and instruction tuning. A prominent line of work is RL with verifiable rewards (RLVR), which leverages automatically verifiable outcomes (e.g., correctness or executability) to generate reward signals. While efficient, this framework faces two key limitations: First, its binary feedback is too sparse to capture the quality of the reasoning process. Second, its coarse-grained rewards potentially lead to vanishing gradients. Inspired by observations from human learning, we introduce a RL technique that integrates verifiable outcomes with the model’s own confidence estimates. This joint design enriches the reward signal, providing finer-grained feedback and implicitly supervising the reasoning process. Experimental results demonstrate that our proposed method enhances RL performance across multiple datasets and reduces token consumption during inference, while incurring negligible additional training cost. Moreover, it can be used as a plug-in module to enhance other state-of-the-art RL methods.
强化学习(RL)已经成为超越预训练和指令微调来优化大型语言模型(LLM)的标准范式。一条突出的工作线是带可验证奖励的强化学习(RLVR),它利用可自动验证的结果(如正确性或可执行性)来生成奖励信号。虽然效率很高,但这个框架面临两个主要局限:首先,其二进制反馈过于稀疏,无法捕捉推理过程的质量;其次,其粗粒度的奖励可能导致梯度消失。受人类学习的观察启发,我们引入了一种强化学习技术,该技术将可验证的结果与模型本身的置信度估计相结合。这种联合设计丰富了奖励信号,提供了更精细的反馈,并隐含地监督了推理过程。实验结果表明,我们提出的方法在多个数据集上提高了强化学习的性能,降低了推理过程中的令牌消耗,同时只产生了微不足道的额外训练成本。此外,它可以作为插件模块用于增强其他最先进强化学习方法。
论文及项目相关链接
Summary
强化学习(RL)已成为优化大型语言模型(LLM)的标准范式,尤其在预训练和指令微调之后。RLVR是其中的一条重要研究线路,它利用可自动验证的结果(如正确性)来生成奖励信号。然而,该框架存在两个主要局限:一是其反馈稀疏,无法捕捉推理过程的质量;二是其粗粒度的奖励可能导致梯度消失。本研究受人类学习的启发,提出了一种结合可验证结果与模型自身置信度估计的RL技术。这种联合设计丰富了奖励信号,提供了更精细的反馈,并隐含地监督了推理过程。实验结果表明,该方法提高了多个数据集上的RL性能,降低了推理时的令牌消耗,且几乎不增加额外的训练成本。此外,它还可以作为插件模块增强其他先进的RL方法。
Key Takeaways
- 强化学习已成为优化大型语言模型的标准方法。
- RLVR利用可自动验证的结果生成奖励信号,但存在反馈稀疏和粗粒度奖励的问题。
- 结合可验证结果与模型自身置信度估计的RL技术被提出,以丰富奖励信号并提供更精细的反馈。
- 该方法提高了多个数据集上的RL性能,降低了推理时的令牌消耗。
- 该方法几乎不增加额外的训练成本。
- 该方法可作为插件模块增强其他先进的RL方法。
点此查看论文截图




EngiBench: A Benchmark for Evaluating Large Language Models on Engineering Problem Solving
Authors:Xiyuan Zhou, Xinlei Wang, Yirui He, Yang Wu, Ruixi Zou, Yuheng Cheng, Yulu Xie, Wenxuan Liu, Huan Zhao, Yan Xu, Jinjin Gu, Junhua Zhao
Large language models (LLMs) have shown strong performance on mathematical reasoning under well-posed conditions. However, real-world engineering problems require more than mathematical symbolic computation – they need to deal with uncertainty, context, and open-ended scenarios. Existing benchmarks fail to capture these complexities. We introduce EngiBench, a hierarchical benchmark designed to evaluate LLMs on solving engineering problems. It spans three levels of increasing difficulty (foundational knowledge retrieval, multi-step contextual reasoning, and open-ended modeling) and covers diverse engineering subfields. To facilitate a deeper understanding of model performance, we systematically rewrite each problem into three controlled variants (perturbed, knowledge-enhanced, and math abstraction), enabling us to separately evaluate the model’s robustness, domain-specific knowledge, and mathematical reasoning abilities. Experiment results reveal a clear performance gap across levels: models struggle more as tasks get harder, perform worse when problems are slightly changed, and fall far behind human experts on the high-level engineering tasks. These findings reveal that current LLMs still lack the high-level reasoning needed for real-world engineering, highlighting the need for future models with deeper and more reliable problem-solving capabilities. Our source code and data are available at https://github.com/EngiBench/EngiBench.
大型语言模型(LLM)在条件良好的情况下在数学推理方面表现出强大的性能。然而,现实世界中的工程问题不仅仅需要数学符号计算,它们还需要处理不确定性、上下文和开放场景。现有的基准测试无法捕捉这些复杂性。我们引入了EngiBench,这是一个分层基准测试,旨在评估LLM解决工程问题的能力。它涵盖了三个难度递增的级别(基础知识检索、多步骤上下文推理和开放式建模),并涵盖了多样化的工程子领域。为了加深对模型性能的理解,我们系统地为每个问题重新编写了三种受控变体(扰动、知识增强和数学抽象),这使我们能够分别评估模型的稳健性、特定领域的知识和数学推理能力。实验结果揭示了各级之间的性能差距:随着任务难度的增加,模型的困难程度更大,当问题稍作更改时,表现更差,并且在高级工程任务上远远落后于人类专家。这些发现表明,当前的大型语言模型仍然缺乏真实世界工程所需的高级推理能力,这强调了未来模型需要更深、更可靠的解决问题的能力。我们的源代码和数据可在https://github.com/EngiBench/EngiBench获取。
论文及项目相关链接
Summary
大型语言模型(LLMs)在数学推理方面表现出强大的性能,但在面对现实世界工程问题时,需要处理不确定性、上下文和开放场景等复杂因素,现有基准测试无法捕捉这些复杂性。本文介绍了EngiBench,一个旨在评估LLM解决工程问题能力的分层基准测试。它涵盖了三个难度递增的级别(基础知识检索、多步骤上下文推理和开放建模),并涉及多个工程子领域。通过系统地改写每个问题为三种受控变体(扰动、知识增强和数学抽象),可以单独评估模型的稳健性、领域特定知识和数学推理能力。实验结果表明,模型在面临更复杂的任务时表现挣扎,当问题稍微改变时表现更差,而在高级工程任务上远远落后于人类专家。这揭示了当前LLM仍然缺乏真实世界工程所需的高级推理能力,强调未来模型需要更深刻、更可靠的解决问题的能力。
Key Takeaways
- 大型语言模型(LLMs)在数学推理上表现出色,但在解决工程问题时面临挑战。
- 现有基准测试无法捕捉工程问题中的复杂性,如不确定性、上下文和开放场景。
- EngiBench是一个新的分层基准测试,旨在评估LLM解决工程问题的能力。
- EngiBench包含三个难度递增的级别,覆盖多个工程子领域。
- 通过系统改写问题,可以单独评估模型的稳健性、领域知识和数学推理能力。
- 实验结果表明,LLM在面临更复杂的任务时表现不足,需要增强高级推理能力。
点此查看论文截图



MSCoRe: A Benchmark for Multi-Stage Collaborative Reasoning in LLM Agents
Authors:Yuzhen Lei, Hongbin Xie, Jiaxing Zhao, Shuangxue Liu, Xuan Song
Large Language Models (LLMs) have excelled in question-answering (QA) tasks within single domains. However, their reasoning and coordination capabilities in complex, multi-stage scenarios remain underexplored. Existing benchmarks typically focus on isolated tasks or narrow domains, overlooking models’ abilities for multi-stage collaboration and optimization without explicit external guidance. To bridge this gap, we propose \textbf{MSCoRe}, a novel benchmark comprising 126696 domain-specific QA instances spanning scenarios in automotive, pharmaceutical, electronics, and energy sectors. The dataset is created using a structured three-phase pipeline: dynamic sampling, iterative question-answer generation, and a multi-level quality assessment to ensure data quality. Tasks are further categorized into three difficulty levels according to stage coverage and complexity. With MSCoRe, we have conducted a comprehensive evaluation of various state-of-the-art LLM agents. The commercial models performed best across all tasks and scenarios, but a notable gap in ROUGE scores remains between simple and complex tasks. We also tested the models’ robustness and found that their performance is negatively affected by noisy data. MSCoRe provides a valuable new resource for the community to evaluate and improve multi-stage reasoning in LLM agents. The code and data are available at https://github.com/D3E0-source/MSCoRE.
大型语言模型(LLM)在单一领域内的问答任务中表现出色。然而,它们在复杂多阶段场景中的推理和协作能力仍然被探索得不够。现有的基准测试通常侧重于孤立的任务或狭窄的领域,忽视了模型在多个阶段协作和优化而无需明确外部指导的能力。为了弥补这一差距,我们提出了MSCoRe,这是一个新的基准测试,包含126696个特定领域的问答实例,涵盖汽车、制药、电子和能源领域。该数据集是使用结构化三阶段管道创建的:动态采样、迭代问答生成和多级质量评估,以确保数据质量。任务根据阶段覆盖范围和复杂性进一步分为三级难度。通过MSCoRe,我们对各种最新的大型语言模型代理进行了全面评估。商业模型在所有任务和场景中表现最佳,但在简单任务和复杂任务之间仍然存在明显的ROUGE评分差距。我们还测试了模型的稳健性,发现噪声数据会对性能产生负面影响。MSCoRe为社区提供了一个宝贵的新资源,用于评估和提高大型语言模型代理的多阶段推理能力。代码和数据可在https://github.com/D3E0-source/MSCoRE找到。
Translation
论文及项目相关链接
PDF 10 pages, 5 figures
Summary
大型语言模型(LLM)在单一领域内的问答任务中表现出色,但在复杂多阶段场景中的推理和协作能力尚未得到充分探索。现有基准测试主要关注孤立任务或狭窄领域,忽视了模型在多阶段协作和优化方面的能力,且缺乏明确的外部指导。为弥补这一空白,我们提出了MSCoRe基准测试,包含126696个特定领域的问答实例,涵盖汽车、制药、电子和能源等领域。该数据集通过结构化三阶段管道创建,包括动态采样、迭代问答生成和多级质量评估,以确保数据质量。任务根据阶段覆盖和复杂性分为三级难度。我们对各种最新LLM代理进行了全面评估,发现商业模型在所有任务和场景中表现最佳,但简单任务和复杂任务之间的ROUGE分数仍存在显著差距。我们还测试了模型的稳健性,发现噪声数据对性能有负面影响。MSCoRe为社区提供了一个宝贵的新资源,用于评估和提高LLM代理的多阶段推理能力。
Key Takeaways
- 大型语言模型(LLM)在单一领域问答任务中表现出色,但在多阶段复杂场景中的推理和协作能力有待探索。
- 现有基准测试主要关注孤立任务或狭窄领域,缺乏多阶段协作和优化的测试。
- MSCoRe基准测试是一个全新的数据集,包含多个领域(汽车、制药等)的126696个问答实例,用于评估LLM的多阶段推理能力。
- 数据集采用结构化三阶段管道创建,确保数据质量。
- 任务难度分为三级,涵盖不同的阶段覆盖和复杂性。
- 商业模型在所有任务和场景中表现最佳,但简单和复杂任务之间的性能差距仍然存在。
点此查看论文截图





Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models
Authors:Jun Ling, Yao Qi, Tao Huang, Shibo Zhou, Yanqin Huang, Jiang Yang, Ziqi Song, Ying Zhou, Yang Yang, Heng Tao Shen, Peng Wang
In this work, we address the task of table image to LaTeX code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs. A central challenge of this task lies in accurately handling complex tables – those with large sizes, deeply nested structures, and semantically rich or irregular cell content – where existing methods often fail. We begin with a comprehensive analysis, identifying key challenges and highlighting the limitations of current evaluation protocols. To overcome these issues, we propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-LaTeX dataset. To further improve generation quality, we introduce a dual-reward reinforcement learning strategy based on Group Relative Policy Optimization (GRPO). Unlike standard approaches that optimize purely over text outputs, our method incorporates both a structure-level reward on LaTeX code and a visual fidelity reward computed from rendered outputs, enabling direct optimization of the visual output quality. We adopt a hybrid evaluation protocol combining TEDS-Structure and CW-SSIM, and show that our method achieves state-of-the-art performance, particularly on structurally complex tables, demonstrating the effectiveness and robustness of our approach.
在这项工作中,我们致力于将表格图像转换为LaTeX代码生成的任务,目标是实现从视觉输入自动重建高质量、适合出版的表格。此任务的核心挑战在于准确处理复杂的表格,这些表格具有大尺、深度嵌套的结构和语义丰富或不规则的单元格内容,现有的方法通常在这里失效。我们首先进行全面分析,确定关键挑战并强调当前评估协议的局限性。为了克服这些问题,我们提出了一种增强的多模式大型语言模型(MLLM)框架,其中对预训练的大型语言模型进行微调,以适应大规模表格到LaTeX数据集的训练。为了进一步提高生成质量,我们引入了一种基于组相对策略优化(GRPO)的双重奖励强化学习策略。与其他仅针对文本输出的优化方法不同,我们的方法结合了LaTeX代码的基于结构层次的奖励以及根据渲染输出计算的视觉保真度奖励,实现了对视觉输出质量的直接优化。我们采用结合了TEDS结构和CW-SSIM的混合评估协议,并展示我们的方法实现了前沿性能,特别是在结构复杂的表格上,证明了我们的方法的有效性和稳健性。
论文及项目相关链接
PDF NeurIPS 2025
Summary
文本主要介绍了从表格图像生成LaTeX代码的任务,目标是实现高质量自动化重建出版物可用表格。其解决了处理复杂表格的中心问题,提出了一个增强的多模态大型语言模型框架和基于集团相对政策优化的双奖励强化学习策略。此方法采用结合TEDS结构和CW-SSIM的评价协议,并且在结构上复杂表格的性能方面达到业界领先水平。整个体系体现了本方法的优越性和稳健性。
Key Takeaways
- 研究对象:本文专注于解决从表格图像生成LaTeX代码的任务,旨在实现高质量表格的视觉输入重建。这一领域的一个重要挑战在于处理复杂的表格数据。
- 主要挑战:处理大型、嵌套结构复杂、语义丰富或不规则单元格内容的复杂表格是这一任务的主要挑战。现有方法往往在这一问题上难以取得满意的结果。
- 创新解决方案:文章提出了一个增强型多模态大型语言模型框架来解决上述问题。同时结合预训练的大型模型和在大型表格到LaTeX数据集上进行微调的方法。
- 强化学习应用:引入了一种基于集团相对政策优化的双奖励强化学习策略,进一步提高生成质量。这种方法结合了LaTeX代码的结构级别奖励和渲染输出的视觉保真度奖励,能够直接优化视觉输出质量。
- 评价协议:采用了结合TEDS结构和CW-SSIM的混合评价协议,对模型性能进行了全面评估。这种评价协议能够更准确地反映模型在处理复杂结构表格时的性能。
- 性能表现:该研究实现了业界领先水平,特别是在处理结构复杂的表格上展示了优越的性能表现,这体现了所提出方法的稳健性和实用性。该研究证明了所提出方法在表格图像到LaTeX代码生成任务的性能上具有较高的优势。
点此查看论文截图


Can LLMs Reason Over Non-Text Modalities in a Training-Free Manner? A Case Study with In-Context Representation Learning
Authors:Tianle Zhang, Wanlong Fang, Jonathan Woo, Paridhi Latawa, Deepak A. Subramanian, Alvin Chan
The remarkable performance of Large Language Models (LLMs) can be enhanced with test-time computation, which relies on external tools and even other deep learning models. However, existing approaches for integrating non-text modality representations into LLMs typically require additional costly supervised training, restricting on-the-fly adaptation to new domains and modalities. In this work, we explore the feasibility of integrating representations from non-text foundational models (FMs) into text-based LLMs in a training-free manner. We propose In-Context Representation Learning (ICRL) as a proof-of-concept to allow LLMs to adaptively utilize non-text modality representations with few-shot learning. Unlike traditional in-context learning, which incorporates text-label pairs, ICRL replaces text inputs with FM representations, enabling the LLM to perform multi-modal inference without fine-tuning. We evaluate ICRL on a suite of tasks in the molecular domain, investigating three core research questions: (i) how to map FM representations into LLMs in a training-free manner, (ii) what factors influence ICRL performance, and (iii) what mechanisms underlie the effectiveness of ICRL. To the best of our knowledge, ICRL is the first training-free framework for integrating non-text modality representations into text-based LLMs, presenting a promising direction for adaptable, multi-modal generalization.
大型语言模型(LLM)的出色性能可以通过测试时的计算增强,这依赖于外部工具和其他深度学习模型。然而,将非文本模态表示集成到LLM中的现有方法通常需要额外的昂贵的有监督训练,这限制了在新领域和模态的即时适应。在这项工作中,我们探索了以无训练方式将非文本基础模型(FM)的表示集成到基于文本的LLM中的可行性。我们提出基于上下文表示学习(ICRL)的概念验证,以允许LLM以少量样本学习方式自适应地利用非文本模态表示。与传统的上下文学习不同,后者结合了文本标签对,ICRL用FM表示替换文本输入,使LLM能够在无需微调的情况下执行多模态推断。我们在分子领域的多个任务上评估ICRL,并探究三个核心问题:(i)如何在无训练情况下将FM表示映射到LLM中,(ii)哪些因素影响ICRL的性能,以及(iii)ICRL有效性的基础机制是什么。据我们所知,ICRL是第一个用于将非文本模态表示集成到基于文本的LLM中的无训练框架,为可适应的多模态泛化提供了有前景的方向。
论文及项目相关链接
PDF NIPS 2025
Summary
文本探讨了在无需训练的情况下,将非文本基础模型的表示集成到文本大型语言模型中的可行性。提出了一种名为上下文表示学习(ICRL)的概念,使大型语言模型能够自适应地利用非文本模态表示进行少样本学习,并实现了多模态推理。评价ICRL在分子领域的一系列任务上的表现,并探讨了如何将FM表示映射到LLMs中、影响ICRL性能的因素以及ICRL机制的有效性。此为首个无需训练即可整合非文本模态表示的框架,为可适应的多模态推广提供了有前景的方向。
Key Takeaways
- 大型语言模型的性能可通过测试时的计算增强,依赖外部工具和深度学习模型。
- 现有集成非文本模态表示到大型语言模型的方法需要额外的监督训练,限制了在新领域和模态的即时适应。
- 提出了一种名为上下文表示学习(ICRL)的概念,允许大型语言模型以训练无关的方式利用非文本基础模型的表示。
- ICRL实现了少样本学习,使大型语言模型能够进行多模态推理,无需微调。
- ICRL在分子领域的任务上进行了评估,并解决了如何将FM表示映射到LLMs、影响ICRL性能的因素以及ICRL机制的有效性等核心问题。
- ICRL是首个无需训练的框架,能够整合非文本模态表示到文本为基础的大型语言模型中。
点此查看论文截图



Correlation or Causation: Analyzing the Causal Structures of LLM and LRM Reasoning Process
Authors:Zhizhang FU, Guangsheng Bao, Hongbo Zhang, Chenkai Hu, Yue Zhang
LLMs suffer from critical reasoning issues such as unfaithfulness, bias, and inconsistency, since they lack robust causal underpinnings and may rely on superficial correlations rather than genuine understanding. Successive LRMs have emerged as a promising alternative, leveraging advanced training techniques such as reinforcement learning (RL) and distillation to improve task accuracy. However, the impact of these training methods on causality remains largely unexplored. In this study, we conduct a systematic causal analysis on LLMs and LRMs, examining structural causal models (SCMs) of four key variables: problem instruction (Z), thinking process (T), reasoning steps (X), and answer (Y). Our findings reveal that RLVR-trained LRMs exhibit enhanced causal reasoning capabilities, aligning more closely with ideal causal structures, while LLMs and distilled LRMs fail to address causality-related deficiencies. Our further investigation indicates that RLVR reduces spurious correlations and strengthens genuine causal patterns, thereby mitigating unfaithfulness and bias. In addition, our inspection on the dynamics of the RLVR training process observes a high correlation between reduced spurious features and improved causal structures, where the causal relationships consistently improve in the training process. This study contributes to the understanding of causality in reasoning models, highlights the critical role of RLVR in enhancing causal reasoning, and provides insights for designing future AI systems with stronger causal foundations. We release our code and data at https://github.com/Harryking1999/CoT_Causal_Analysis.
大型语言模型(LLMs)存在推理问题,如缺乏忠实性、偏见和不一致性,因为它们缺乏坚实的因果基础,可能依赖于表面关联而非真正的理解。随着连续学习模型(LRMs)的出现,作为一种有前途的替代方案,它利用先进的训练技术,如强化学习(RL)和蒸馏技术来提高任务准确性。然而,这些训练方法对因果关系的影响在很大程度上尚未被探索。在本研究中,我们对LLMs和LRMs进行了系统的因果分析,研究了四个关键变量的结构因果模型(SCMs):问题指令(Z)、思考过程(T)、推理步骤(X)和答案(Y)。我们的研究发现,经过RLVR训练的LRMs表现出增强的因果推理能力,更接近于理想的因果结构,而LLMs和蒸馏LRMs无法解决与因果关系相关的缺陷。我们的进一步调查表明,RLVR减少了虚假关联,加强了真正的因果模式,从而减轻了缺乏忠实性和偏见的问题。此外,我们对RLVR训练过程的动态性的检查发现,减少的虚假特征与改善的因果结构之间存在高度相关性,在训练过程中因果关系持续得到改善。本研究有助于理解推理模型中的因果关系,突出了RLVR在增强因果推理中的关键作用,为设计具有更强因果基础的未来人工智能系统提供了见解。我们在https://github.com/Harryking1999/CoT_Causal_Analysis上发布了我们的代码和数据。
论文及项目相关链接
Summary
大型语言模型(LLMs)存在推理缺陷,如缺乏真实理解而导致的失实、偏见和不一致性。新型的连续学习模型(LRMs)采用强化学习(RL)和蒸馏等高级训练技术来提升任务准确性,但对因果性的影响尚未充分研究。本研究通过结构因果模型(SCM)对LLMs和LRMs进行因果分析,发现RLVR训练的LRM展现出更出色的因果推理能力,更接近理想因果结构。RLVR能减少虚假关联并强化真实因果模式,改善不忠实和偏见问题。研究还观察到RLVR训练过程中减少虚假特征与改善因果关系的强相关性。本研究有助于理解模型中的因果关系,强调了RLVR在强化因果推理中的角色,为设计未来具有更强因果基础的AI系统提供了启示。
Key Takeaways
- 大型语言模型(LLMs)存在推理问题,如失实、偏见和不一致性。
- 新型的连续学习模型(LRMs)采用强化学习(RL)等训练技术改善任务准确性。
- 结构因果模型(SCM)用于分析LLMs和LRMs的因果关系。
- RLVR训练的LRM展现出更强的因果推理能力,更接近理想因果结构。
- RLVR能有效减少虚假关联,强化真实因果关系。
- RLVR训练过程中,减少虚假特征与改善因果关系之间存在强相关性。
点此查看论文截图




Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation
Authors:Ahmed T. Elboardy, Ghada Khoriba, Essam A. Rashed
Automating radiology report generation poses a dual challenge: building clinically reliable systems and designing rigorous evaluation protocols. We introduce a multi-agent reinforcement learning framework that serves as both a benchmark and evaluation environment for multimodal clinical reasoning in the radiology ecosystem. The proposed framework integrates large language models (LLMs) and large vision models (LVMs) within a modular architecture composed of ten specialized agents responsible for image analysis, feature extraction, report generation, review, and evaluation. This design enables fine-grained assessment at both the agent level (e.g., detection and segmentation accuracy) and the consensus level (e.g., report quality and clinical relevance). We demonstrate an implementation using chatGPT-4o on public radiology datasets, where LLMs act as evaluators alongside medical radiologist feedback. By aligning evaluation protocols with the LLM development lifecycle, including pretraining, finetuning, alignment, and deployment, the proposed benchmark establishes a path toward trustworthy deviance-based radiology report generation.
自动化生成放射学报告面临双重挑战:建立临床可靠的系统和设计严格的评价协议。我们引入了一种多智能体强化学习框架,该框架既可作为放射生态系统中多模式临床推理的基准测试环境,又可作为评价环境。所提出的框架在模块化架构中集成了大型语言模型(LLM)和大型视觉模型(LVM),由十个专门负责图像分析、特征提取、报告生成、审查和评估的智能体组成。这种设计能够在智能体级别(例如,检测和分割精度)和共识级别(例如,报告质量和临床相关性)进行精细评估。我们在公共放射学数据集上展示了使用ChatGPT-4o的实现,其中LLM作为评估者,与医学放射科医师的反馈相结合。通过使评价协议与LLM开发周期(包括预训练、微调、对齐和部署)保持一致,所提出的基准测试为基于可信偏差的放射学报告生成建立了道路。
论文及项目相关链接
PDF NeurIPS2025 Workshop: Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling
Summary
本文介绍了一个多智能体强化学习框架,用于解决放射学报告生成中的双重挑战:构建可靠的医学系统并设计严格的评价协议。该框架结合了大型语言模型和大型视觉模型,在模块化架构中实现十种专业智能体的功能,包括图像分析、特征提取、报告生成、审查和评估等。该设计可实现在智能体层面和共识层面的精细评估,通过与医学放射科医师反馈相结合,使用ChatGPT-4o在公共放射学数据集上进行了演示实现。通过使评价协议与大型语言模型开发周期保持一致,包括预训练、微调、对齐和部署等阶段,所提出的基准测试有助于建立可靠的基于偏差的放射学报告生成路径。
Key Takeaways
- 本文提出了一个多智能体强化学习框架,用于放射学报告生成中的临床可靠性和严格评价协议的问题。
- 该框架结合了大型语言模型和大型视觉模型,并采用模块化架构实现多种功能。
- 框架中包含十种专业智能体,负责图像分析、特征提取、报告生成、审查和评估等任务。
- 实现了智能体层面和共识层面的精细评估,为评价放射学报告生成的质量提供了全面的视角。
- 使用ChatGPT-4o在公共放射学数据集上进行了演示实现,结合了医学放射科医师的反馈。
- 所提出的评价协议与大型语言模型开发周期保持一致,包括预训练、微调、对齐和部署等阶段。
点此查看论文截图



LLaVul: A Multimodal LLM for Interpretable Vulnerability Reasoning about Source Code
Authors:Ala Jararweh, Michael Adams, Avinash Sahu, Abdullah Mueen, Afsah Anwar
Increasing complexity in software systems places a growing demand on reasoning tools that unlock vulnerabilities manifest in source code. Many current approaches focus on vulnerability analysis as a classifying task, oversimplifying the nuanced and context-dependent real-world scenarios. Even though current code large language models (LLMs) excel in code understanding, they often pay little attention to security-specific reasoning. We propose LLaVul, a multimodal LLM tailored to provide fine-grained reasoning about code through question-answering (QA). Our model is trained to integrate paired code and natural queries into a unified space, enhancing reasoning and context-dependent insights about code vulnerability. To evaluate our model performance, we construct a curated dataset of real-world vulnerabilities paired with security-focused questions and answers. Our model outperforms state-of-the-art general-purpose and code LLMs in the QA and detection tasks. We further explain decision-making by conducting qualitative analysis to highlight capabilities and limitations. By integrating code and QA, LLaVul enables more interpretable and security-focused code understanding.
随着软件系统复杂性的增加,对解锁源代码中表现出的漏洞的推理工具的需求也在增长。许多当前的方法将漏洞分析作为分类任务,这简化了现实世界中微妙且依赖于上下文情境的情境。尽管当前的大型代码语言模型(LLM)在理解代码方面表现出色,但它们往往忽视了特定于安全性的推理。我们提出了LLaVul,这是一个量身定制的多模式LLM,通过问答(QA)提供对代码的精细推理。我们的模型经过训练,能够将配对代码和自然查询整合到统一的空间中,增强对代码漏洞的推理和上下文依赖的洞察力。为了评估我们的模型性能,我们构建了一个包含以安全为中心的问题和答案的现实世界漏洞数据集。我们的模型在问答和检测任务中的表现优于最新的通用和代码LLM。我们还通过进行定性分析来解释决策制定,以突出我们的能力和局限性。通过整合代码和问答,LLaVul实现了更可解释性和以安全为中心的代码理解。
论文及项目相关链接
Summary
本文介绍了软件系统的复杂性增长对理解源代码中的漏洞所需的推理工具的需求。针对当前将漏洞分析作为分类任务的简化处理,以及大型代码语言模型对安全特定推理的忽视,提出了LLaVul模型。该模型通过问答方式提供精细粒度的代码推理,通过整合代码和自然查询进行统一空间处理,增强了关于代码漏洞的上下文相关见解。通过构建现实漏洞与安全相关问题答案的数据集评估模型性能,LLaVul在问答和检测任务中表现出超越现有通用和代码大型语言模型的优势。通过定性分析解释决策制定,展示了LLaVul的能力与局限性。整合代码和问答功能使LLaVul在代码理解方面更具可解释性和安全性关注。
Key Takeaways
- 软件系统的复杂性要求对源代码中的漏洞进行更深入的理解和分析。
- 当前的方法大多将漏洞分析视为分类任务,未能充分考虑实际场景中的细微差别和上下文依赖。
- 大型代码语言模型虽然能很好地理解代码,但在安全特定的推理方面关注不足。
- LLaVul是一个多模态的大型语言模型,旨在通过问答方式提供精细粒度的代码推理。
- LLaVul模型能够整合代码和自然查询,在统一空间内增强对代码漏洞的推理和上下文理解。
- LLaVul在构建的现实漏洞数据集上的表现优于现有的通用和代码大型语言模型。
点此查看论文截图







Mano Report
Authors:Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, Yuyang Chen, Ruiyang Yu, Siran Peng, Menglin Li, Nan Huang, Haitian Wei, Jiawei Yu, Yi Xin, Xilin Zhao, Kai Gu, Ping Jiang, Sifan Zhou, Shuo Wang
Graphical user interfaces (GUIs) are the primary medium for human-computer interaction, yet automating GUI interactions remains challenging due to the complexity of visual elements, dynamic environments, and the need for multi-step reasoning. Existing methods based on vision-language models (VLMs) often suffer from limited resolution, domain mismatch, and insufficient sequential decisionmaking capability. To address these issues, we propose Mano, a robust GUI agent built upon a multi-modal foundation model pre-trained on extensive web and computer system data. Our approach integrates a novel simulated environment for high-fidelity data generation, a three-stage training pipeline (supervised fine-tuning, offline reinforcement learning, and online reinforcement learning), and a verification module for error recovery. Mano demonstrates state-of-the-art performance on multiple GUI benchmarks, including Mind2Web and OSWorld, achieving significant improvements in success rate and operational accuracy. Our work provides new insights into the effective integration of reinforcement learning with VLMs for practical GUI agent deployment, highlighting the importance of domain-specific data, iterative training, and holistic reward design.
图形用户界面(GUI)是人机交互的主要媒介。然而,由于视觉元素的复杂性、动态环境以及多步骤推理的需求,自动化GUI交互仍然具有挑战性。基于视觉语言模型(VLM)的现有方法通常受到分辨率有限、领域不匹配和序列决策能力不足的困扰。为了解决这个问题,我们提出了名为Mano的稳健GUI代理,它建立在大量网络和计算机系统数据上预训练的多模态基础模型之上。我们的方法结合了用于高保真数据生成的新型模拟环境、三阶段训练管道(监督微调、离线强化学习和在线强化学习)以及用于错误恢复的验证模块。Mano在多个GUI基准测试中表现出卓越的性能,包括Mind2Web和OSWorld,在成功率和操作准确性方面取得了显著改进。我们的工作提供了将强化学习与VLM有效结合以进行实际GUI代理部署的新见解,强调了领域特定数据、迭代训练和整体奖励设计的重要性。
论文及项目相关链接
Summary
GUI交互自动化是计算机视觉和机器学习领域的热门挑战,文章提出了一款基于多模态基础模型的GUI交互代理——Mano。Mano通过模拟环境进行高保真数据生成,采用分阶段训练策略,包括监督微调、离线强化学习和在线强化学习三个阶段,并配备验证模块进行错误恢复。在多个GUI基准测试中,Mano表现出卓越的性能,成功率和操作准确性均显著提高。该研究为强化学习与VLM在GUI代理部署中的有效结合提供了新的见解。
Key Takeaways
- GUI交互自动化面临诸多挑战,如视觉元素复杂性、动态环境和多步推理需求。
- 现有基于视觉语言模型(VLM)的方法存在分辨率有限、领域不匹配和决策能力不足的问题。
- Mano是一个基于多模态基础模型的GUI代理,通过模拟环境进行高保真数据生成。
- Mano采用分阶段训练策略,包括监督微调、离线强化学习和在线强化学习三个阶段。
- Mano配备了验证模块以进行错误恢复,提高了交互的稳健性。
- 在多个GUI基准测试中,Mano表现出卓越性能,成功率和操作准确性显著提高。
点此查看论文截图




CogAtom: From Cognitive Atoms to Olympiad-level Mathematical Reasoning in Large Language Models
Authors:Zhuofan Chen, Jiyuan He, Yichi Zhang, Xing Hu, Haoxing Wen, Jun Bai, Wenge Rong
Mathematical reasoning poses significant challenges for Large Language Models (LLMs) due to its demand for multi-step reasoning and abstract conceptual integration. While recent test-time scaling techniques rely heavily on high-quality, challenging problems, the scarcity of Olympiad-level math problems remains a bottleneck. We introduce CogAtom, a novel cognitive atom-based framework for synthesizing mathematically rigorous and cognitively diverse problems. Unlike prior approaches, CogAtom models problem construction as a process of selecting and recombining fundamental reasoning units, cognitive atoms, extracted from human-authored solutions. A diversity-promoting random walk algorithm enables exploration of the cognitive atom space, while a constraint-based recombination mechanism ensures logical soundness and structural validity. The combinatorial nature of the graph structure provides a near-infinite space of reasoning paths, and the walk algorithm systematically explores this space to achieve large-scale synthesis of high-quality problems; meanwhile, by controlling the number of cognitive atoms, we can precisely adjust problem difficulty, ensuring diversity, scalability, and controllability of the generated problems. Experimental results demonstrate that CogAtom outperforms existing methods in accuracy, reasoning depth, and diversity, generating problems that closely match the difficulty of AIME while exceeding it in structural variation. Our work offers a cognitively grounded pathway toward scalable, high-quality math problem generation.Our code is publicly available at https://github.com/Icarus-1111/CogAtom.
数学推理对大型语言模型(LLM)提出了重大挑战,这主要是因为其需要多步骤推理和抽象概念整合。尽管最近的测试时间缩放技术严重依赖于高质量、具有挑战性的问题,但缺乏奥林匹克级别的数学问题仍然是瓶颈。我们引入了CogAtom,这是一种基于认知原子的新型问题合成框架,用于合成数学严谨、认知多样的问题。不同于以往的方法,CogAtom将问题构建过程视为一个选择和重组人类解决方案中提取的基本推理单元——认知原子的过程。一种促进多样性的随机游走算法能够探索认知原子空间,而基于约束的重组机制则确保了逻辑严谨性和结构有效性。图结构的组合性质提供了一个近乎无限的推理路径空间,随机游走算法系统地探索这个空间,以实现高质量问题的大规模合成;同时,通过控制认知原子的数量,我们可以精确地调整问题的难度,确保生成问题的多样性、可扩展性和可控性。实验结果表明,CogAtom在准确性、推理深度和多样性方面优于现有方法,生成的问题与AIME的难度相匹配,并在结构变化上超过了AIME。我们的工作为可伸缩、高质量的数学问题生成提供了一条认知基础途径。我们的代码公开在https://github.com/Icarus-1111/CogAtom。
论文及项目相关链接
Summary
在大型语言模型(LLM)面临数学推理的多步骤和抽象概念整合的挑战时,我们推出了CogAtom,一种新型的基于认知原子的框架,用于合成严谨的数学和多样化的认知问题。它通过选择和重组从人类解决方案中提取的基本推理单元——认知原子,来构建问题。实验结果表明,CogAtom在准确性、推理深度和多样性方面优于现有方法,生成的问题难度与AIME相当,在结构变化上表现更优。CogAtom为大规模高质量数学问题生成提供了认知依据的路径。相关代码已公开。
Key Takeaways
- 大型语言模型(LLM)在数学推理方面存在挑战,需要多步骤推理和抽象概念整合。
- CogAtom是一个基于认知原子的新型框架,用于合成数学和认知问题。
- CogAtom通过选择和重组认知原子来构建问题,确保逻辑合理性和结构有效性。
- 认知原子提取自人类解决方案。
- 多样性促进的随机游走算法用于探索认知原子空间。
- 组合性质图结构提供了近乎无限的推理路径空间。
点此查看论文截图




GRPOformer: Advancing Hyperparameter Optimization via Group Relative Policy Optimization
Authors:Haoxin Guo, Jiawen Pan, Weixin Zhai
Hyperparameter optimization (HPO) plays a critical role in improving model performance. Transformer-based HPO methods have shown great potential; however, existing approaches rely heavily on large-scale historical optimization trajectories and lack effective reinforcement learning (RL) techniques, thereby limiting their efficiency and performance improvements. Inspired by the success of Group Relative Policy Optimization (GRPO) in large language models (LLMs), we propose GRPOformer – a novel hyperparameter optimization framework that integrates reinforcement learning (RL) with Transformers. In GRPOformer, Transformers are employed to generate new hyperparameter configurations from historical optimization trajectories, while GRPO enables rapid trajectory construction and optimization strategy learning from scratch. Moreover, we introduce Policy Churn Regularization (PCR) to enhance the stability of GRPO training. Experimental results on OpenML demonstrate that GRPOformer consistently outperforms baseline methods across diverse tasks, offering new insights into the application of RL for HPO.
超参数优化(HPO)在提高模型性能中起着关键作用。基于Transformer的HPO方法已显示出巨大的潜力;然而,现有方法严重依赖于大规模的历史优化轨迹,且缺乏有效的强化学习(RL)技术,从而限制了其效率和性能的提升。受大型语言模型(LLM)中群体相对策略优化(GRPO)成功的启发,我们提出了GRPOformer——一个将强化学习与Transformer相结合的新型超参数优化框架。在GRPOformer中,Transformer被用于从历史优化轨迹生成新的超参数配置,而GRPO则能够实现从零开始的快速轨迹构建和优化策略学习。此外,我们引入了策略变动正则化(PCR)以增强GRPO训练稳定性。在OpenML上的实验结果表明,GRPOformer在多种任务上始终优于基准方法,为RL在HPO中的应用提供了新的见解。
论文及项目相关链接
Summary
本文介绍了Hyperparameter optimization(HPO)在改进模型性能中的关键作用。针对现有Transformer-based HPO方法的局限性,提出了一种新的超参数优化框架GRPOformer,它结合了强化学习与Transformer技术。GRPOformer利用Transformer从历史优化轨迹生成新的超参数配置,而GRPO则实现了快速轨迹构建和优化策略学习。此外,还引入了Policy Churn Regularization(PCR)以增强GRPO训练的稳定性。在OpenML上的实验结果表明,GRPOformer在多种任务上始终优于基准方法,为HPO在RL中的应用提供了新的见解。
Key Takeaways
- HPO在提高模型性能方面起着关键作用。
- 现有Transformer-based HPO方法存在局限性,缺乏强化学习技术的应用。
- GRPOformer是一个新的超参数优化框架,结合了强化学习与Transformer技术。
- Transformers用于从历史优化轨迹生成新的超参数配置。
- GRPO实现了快速轨迹构建和优化策略学习。
- 引入Policy Churn Regularization(PCR)以增强GRPO训练的稳定性。
点此查看论文截图





CoPlanner: An Interactive Motion Planner with Contingency-Aware Diffusion for Autonomous Driving
Authors:Ruiguo Zhong, Ruoyu Yao, Pei Liu, Xiaolong Chen, Rui Yang, Jun Ma
Accurate trajectory prediction and motion planning are crucial for autonomous driving systems to navigate safely in complex, interactive environments characterized by multimodal uncertainties. However, current generation-then-evaluation frameworks typically construct multiple plausible trajectory hypotheses but ultimately adopt a single most likely outcome, leading to overconfident decisions and a lack of fallback strategies that are vital for safety in rare but critical scenarios. Moreover, the usual decoupling of prediction and planning modules could result in socially inconsistent or unrealistic joint trajectories, especially in highly interactive traffic. To address these challenges, we propose a contingency-aware diffusion planner (CoPlanner), a unified framework that jointly models multi-agent interactive trajectory generation and contingency-aware motion planning. Specifically, the pivot-conditioned diffusion mechanism anchors trajectory sampling on a validated, shared short-term segment to preserve temporal consistency, while stochastically generating diverse long-horizon branches that capture multimodal motion evolutions. In parallel, we design a contingency-aware multi-scenario scoring strategy that evaluates candidate ego trajectories across multiple plausible long-horizon evolution scenarios, balancing safety, progress, and comfort. This integrated design preserves feasible fallback options and enhances robustness under uncertainty, leading to more realistic interaction-aware planning. Extensive closed-loop experiments on the nuPlan benchmark demonstrate that CoPlanner consistently surpasses state-of-the-art methods on both Val14 and Test14 datasets, achieving significant improvements in safety and comfort under both reactive and non-reactive settings. Code and model will be made publicly available upon acceptance.
准确的轨迹预测和运动规划对于自动驾驶系统在复杂、交互性强的环境中安全导航至关重要,这些环境具有多模式不确定性。然而,当前的生成-评估框架通常会构建多个合理的轨迹假设,但最终只采用一个最可能的结果,这导致过于自信的决策和缺乏关键的备份策略,这在罕见但关键的场景中至关重要。此外,预测和规划模块的常规解耦可能导致社会不一致或不符合现实的联合轨迹,特别是在高度交互的交通环境中。为了解决这些挑战,我们提出了一种基于意外情况的扩散规划器(CoPlanner),这是一个联合模型多智能体交互轨迹生成和基于意外情况的协同规划的统一框架。具体来说,基于枢点的扩散机制将轨迹采样锚定在一个验证过的共享短期片段上,以保持时间一致性,同时随机生成多样且长远的分支,以捕捉多模式运动演变。同时,我们设计了一种基于意外情况的场景评分策略,该策略评估多个可能的长远未来情景下的候选自我轨迹,在安全性、进展和舒适性之间取得平衡。这种集成设计保留了可行的备份选项,增强了不确定情况下的稳健性,从而实现了更现实、交互意识更强的规划。在nuPlan基准测试上的大量闭环实验表明,CoPlanner在Val14和Test14数据集上始终超过最新方法,并在反应性和非反应性设置下在安全性和舒适性方面取得了显著的改进。论文接受后将公开提供代码和模型。
论文及项目相关链接
Summary
该文讨论了自动驾驶系统中准确轨迹预测和运动规划的重要性,指出在复杂、交互性环境中,多模态不确定性下的安全导航是关键挑战。当前框架通常采用生成-评估模式,构建多个可能的轨迹假设,但最终选择最可能的单一结果,导致缺乏应对罕见但关键场景的备用策略。此外,预测和规划模块的解耦可能导致社会不一致或在不具高度交互性的交通环境中产生不现实的联合轨迹。针对这些挑战,提出一种应急感知扩散规划器(CoPlanner)的框架,联合进行多智能体交互轨迹生成和应急感知运动规划。该框架通过中心条件扩散机制在验证过的短期轨迹上进行轨迹采样,同时随机生成多样的长期分支以捕捉多模态运动演化。同时设计了一种应急感知多场景评分策略,在多种可能的长期演化场景中评估候选轨迹,平衡安全、进展和舒适度。在nuPlan基准测试上的闭环实验表明,CoPlanner在Val14和Test14数据集上均超越现有方法,在安全性和舒适性方面取得显著改进。
Key Takeaways
- 自动驾驶系统需要准确的轨迹预测和运动规划以应对复杂环境中的不确定性。
- 当前框架采用单一最可能轨迹决策可能导致缺乏备用策略,影响安全性。
- CoPlanner框架联合进行多智能体交互轨迹生成和应急感知运动规划,提高安全性与稳健性。
- 框架中的中心条件扩散机制确保轨迹采样的短期一致性,并随机生成长期分支以捕捉多模态运动演化。
- 应急感知多场景评分策略平衡安全、进展和舒适度,实现更现实的交互感知规划。
- 在nuPlan基准测试上,CoPlanner在多个数据集上超越现有方法,特别是在安全性和舒适性方面表现突出。
点此查看论文截图




Interpretable Audio Editing Evaluation via Chain-of-Thought Difference-Commonality Reasoning with Multimodal LLMs
Authors:Yuhang Jia, Xu Zhang, Yang Chen, Hui Wang, Enzhi Wang, Yong Qin
Automatic mean opinion score (MOS) prediction provides a more perceptual alternative to objective metrics, offering deeper insights into the evaluated models. With the rapid progress of multimodal large language models (MLLMs), their enhanced perceptual and reasoning abilities enable more comprehensive and interpretable audio quality assessment. In this work, we tackle the challenging task of audio editing evaluation and propose the first natural language-based automated evaluation framework built on MLLMs. Our approach introduces two fine-tuning tasks to boost multi-audio understanding, combined with Chain-of-Thought prompting, and lightweight instruction tuning, to enhance step-by-step reasoning. Experiment demonstrate that our framework delivers accurate, interpretable, and text-based editing evaluation, closely aligning with human judgments and objective metrics while substantially improving over baselines. The code and demo are available at https://github.com/NKU-HLT/Eval_Reasoning.
自动平均意见得分(MOS)预测提供了一种比客观指标更感知的替代方案,为所评估的模型提供了更深入的了解。随着多模态大型语言模型(MLLMs)的快速发展,其增强的感知和推理能力能够进行更全面和可解释性的音频质量评估。在这项工作中,我们解决了音频编辑评估这一具有挑战性的任务,并基于MLLMs提出了第一个自然语言基础的自动化评估框架。我们的方法引入了两个微调任务来促进多音频理解,结合“思维链”提示和轻量级指令微调,以增强逐步推理。实验表明,我们的框架能够提供准确、可解释性和基于文本编辑评估,与人类判断和客观指标紧密对齐,同时在基准测试上有显著提高。代码和演示可在https://github.com/NKU-HLT/Eval_Reasoning找到。
论文及项目相关链接
Summary
自动平均意见得分(MOS)预测为客观度量提供了一个更加感知的替代方案,它能为所评估的模型提供更深入的见解。随着多模态大型语言模型(MLLMs)的快速发展,其增强的感知和推理能力使音频质量评估更加全面和可解释。在这项工作中,我们解决了音频编辑评估这一具有挑战性的任务,并首次提出了基于MLLMs的自然语言自动化评估框架。我们的方法引入了两个微调任务来提高多音频理解,并结合思维链提示和轻量级指令微调,以增强逐步推理。实验表明,我们的框架能够提供准确、可解释、基于文本编辑评估,与人类判断和客观度量紧密对齐,并在很大程度上优于基线。代码和演示可在https://github.com/NKU-HLT/Eval_Reasoning找到。
Key Takeaways
- 自动平均意见得分(MOS)预测为模型评估提供了更深入的感知视角。
- 多模态大型语言模型(MLLMs)的快速发展增强了感知和推理能力。
- 在音频编辑评估任务中引入了自然语言自动化评估框架。
- 通过两个微调任务和思维链提示提高多音频理解和推理能力。
- 框架准确、可解释,基于文本编辑评估与人类判断和客观度量对齐。
- 框架在音频编辑评价上显著优于基线方法。
点此查看论文截图




LLMs as Layout Designers: A Spatial Reasoning Perspective
Authors:Sha Li
While Large Language Models (LLMs) have demonstrated impressive reasoning and planning abilities in textual domains and can effectively follow instructions for complex tasks, their capacity for spatial understanding and reasoning remains limited. Such capabilities, however, are critical for applications like content-aware graphic layout design, which demands precise placement, alignment, and structural organization of multiple elements within constrained visual spaces. To address this gap, we propose LaySPA, a reinforcement learning-based framework that augments LLM agents with explicit spatial reasoning capabilities. LaySPA leverages hybrid reward signals that capture geometric validity, structural fidelity, and visual quality, enabling agents to model inter-element relationships, navigate the canvas, and optimize spatial arrangements. Through iterative self-exploration and adaptive policy optimization, LaySPA produces both interpretable reasoning traces and structured layouts. Experimental results demonstrate that LaySPA generates structurally sound and visually appealing layouts, outperforming larger general-purpose LLMs and achieving results on par with state-of-the-art specialized layout models.
虽然大型语言模型(LLM)在文本领域表现出了令人印象深刻的推理和规划能力,并能有效执行复杂任务的指令,但它们在空间理解和推理方面的能力仍然有限。然而,对于内容感知的图形布局设计这类应用而言,这种能力至关重要。内容感知的图形布局设计要求在有限的视觉空间内对多个元素进行精确放置、对齐和结构组织。为了解决这一差距,我们提出了LaySPA,一个基于强化学习的框架,为LLM代理增加了明确的空间推理能力。LaySPA利用混合奖励信号,捕捉几何有效性、结构保真度和视觉质量,使代理能够模拟元素间的关系、导航画布并优化空间布局。通过迭代自我探索和自适应策略优化,LaySPA既产生可解释的推理轨迹,又生成结构化布局。实验结果表明,LaySPA生成的布局结构健全、视觉吸引力强,优于更大的通用LLM,并与最新的专业布局模型结果相当。
论文及项目相关链接
Summary
大型语言模型(LLMs)在文本领域展现出强大的推理和规划能力,并能有效执行复杂任务的指令,但在空间理解和推理方面仍存在局限。对于内容感知的图形布局设计等应用,精确放置、对齐和结构化组织视觉空间内的多个元素至关重要。为解决这一差距,提出LaySPA,一个基于强化学习的框架,增强LLM代理的空间推理能力。LaySPA利用混合奖励信号,捕捉几何有效性、结构保真度和视觉质量,使代理能够模拟元素间的关系、导航画布并优化空间布局。通过迭代自我探索和自适应策略优化,LaySPA产生可解释的推理轨迹和结构化的布局。实验结果表明,LaySPA生成的布局结构稳健、视觉吸引力强,优于大型通用LLMs,并与专业布局模型的最新结果相当。
Key Takeaways
- 大型语言模型(LLMs)在文本处理方面表现出强大的推理和规划能力,但在空间理解和推理方面存在局限。
- 空间理解和推理对于内容感知的图形布局设计等应用至关重要。
- LaySPA是一个基于强化学习的框架,旨在增强LLM代理的空间推理能力。
- LaySPA利用混合奖励信号来捕捉几何有效性、结构保真度和视觉质量。
- LaySPA使代理能够模拟元素间的关系、导航画布并优化空间布局。
- 通过迭代自我探索和自适应策略优化,LaySPA能够产生可解释的推理轨迹和结构化的布局。
点此查看论文截图




Evaluating LLM Generated Detection Rules in Cybersecurity
Authors:Anna Bertiger, Bobby Filar, Aryan Luthra, Stefano Meschiari, Aiden Mitchell, Sam Scholten, Vivek Sharath
LLMs are increasingly pervasive in the security environment, with limited measures of their effectiveness, which limits trust and usefulness to security practitioners. Here, we present an open-source evaluation framework and benchmark metrics for evaluating LLM-generated cybersecurity rules. The benchmark employs a holdout set-based methodology to measure the effectiveness of LLM-generated security rules in comparison to a human-generated corpus of rules. It provides three key metrics inspired by the way experts evaluate security rules, offering a realistic, multifaceted evaluation of the effectiveness of an LLM-based security rule generator. This methodology is illustrated using rules from Sublime Security’s detection team and those written by Sublime Security’s Automated Detection Engineer (ADE), with a thorough analysis of ADE’s skills presented in the results section.
大型语言模型(LLMs)在安全环境中的应用越来越普遍,但其有效性的衡量措施有限,这限制了安全从业人员对其的信任和实用性。在此,我们提出一个开源评估框架和基准指标,用于评估LLM生成的网络安全规则的效能。该基准采用基于保留集的方法,来衡量LLM生成的网络安全规则与人为生成的规则集相比的有效性。它提供了三个受专家评估安全规则启发的关键指标,为基于LLM的安全规则生成器提供现实、多方面的有效性评估。该方法使用Sublime Security检测团队制定的规则以及Sublime Security自动检测工程师(ADE)编写的规则进行说明,并在结果部分对ADE的技能进行了详细分析。
论文及项目相关链接
PDF Preprint of a paper accepted at the Conference on Applied Machine Learning in Information Security (CAMLIS 2025). 11 pages, 3 figures, 4 tables
Summary:随着大型语言模型(LLM)在安全领域的广泛应用,对其生成的安全规则的有效性评估显得尤为重要。本研究提出了一种开源评估框架和基准测试指标,以比较LLM生成的安全规则与人类生成的安全规则集的有效性。该基准测试采用保留集方法,通过专家评价安全规则的三个关键指标来全面评估LLM生成的安全规则的有效性。研究使用了Sublime Security的检测团队规则和自动化检测工程师(ADE)编写的规则,并对ADE的技能进行了详细分析。
Key Takeaways:
- LLMs在安全领域的应用越来越广泛,但对其生成的安全规则的有效性评估措施有限,影响了安全实践者对它们的信任和使用。
- 研究提出了一种开源评估框架和基准测试指标,用于评估LLM生成的安全规则的有效性。
- 该评估框架采用保留集方法进行比较,衡量LLM生成的安全规则与人类生成的安全规则集的效果。
- 提供了三个关键指标,以专家评价安全规则的方式全面评估LLM生成的安全规则的有效性。
- 研究使用了Sublime Security的检测团队规则和ADE的规则进行实例分析。
- 该研究强调了自动化检测工程师(ADE)技能的重要性,在结果部分对其进行了详细分析。
点此查看论文截图



EG-MLA: Embedding-Gated Multi-head Latent Attention for Scalable and Efficient LLMs
Authors:Zhengge Cai, Haowen Hou
Reducing the key-value (KV) cache size is a crucial step toward enabling efficient inference in large language models (LLMs), especially under latency and memory constraints. While Multi-Head Attention (MHA) offers strong representational power, it incurs significant memory overhead. Recent work on Multi-head Latent Attention (MLA) mitigates this by compressing KV representations into a shared latent space, achieving a better trade-off between performance and cache efficiency. While MLA already achieves significant KV cache reduction, the scope for further compression remains limited without performance loss. In this paper, we propose \textbf{Embedding-Gated Multi-head Latent Attention (EG-MLA)}, a novel extension of MLA that further reduces KV cache size while enhancing representational expressiveness. EG-MLA introduces a token-specific embedding gating mechanism applied in the latent space, enabling fine-grained modulation of compressed KV vectors with minimal additional computation. Compared to MHA, EG-MLA achieves over 91.6% reduction in KV cache size with negligible performance degradation. Relative to MLA, EG-MLA consistently improves task accuracy across diverse reasoning benchmarks while achieving up to 59.9% additional memory savings. Our theoretical analysis highlights how embedding gating induces implicit high-order interactions, and empirical evaluations demonstrate robust generalization across model scales and compression regimes. Notably, we successfully scale EG-MLA to over 1 billion parameters, demonstrating its practical viability for large-scale LLM deployment. These results establish EG-MLA as a memory- and compute-efficient attention mechanism that enables scalable, high-performance inference in modern LLMs.
减少键值(KV)缓存大小是在大型语言模型(LLM)中实现高效推理的关键步骤,尤其是在延迟和内存受限的情况下。虽然多头注意力(MHA)具有很强的表征能力,但它会引起巨大的内存开销。最近的多头潜在注意力(MLA)的研究通过压缩KV表示到共享潜在空间来缓解这个问题,在性能和缓存效率之间实现了更好的权衡。尽管MLA已经实现了显著的KV缓存减少,但在不损失性能的情况下进一步压缩的范围仍然有限。
在本文中,我们提出了嵌入门控多头潜在注意力(EG-MLA),这是MLA的一种新型扩展,可以进一步减少KV缓存大小,同时提高表征表达能力。EG-MLA在潜在空间中应用了一种特定于符号的嵌入门控机制,能够对压缩的KV向量进行精细调节,同时只需极少的额外计算。与MHA相比,EG-MLA实现了超过91.6%的KV缓存大小减少,同时性能损失微乎其微。相对于MLA,EG-MLA在多种推理基准测试中始终提高了任务准确性,同时实现了高达59.9%的额外内存节省。我们的理论分析强调了嵌入门控如何引发隐式的高阶交互,而经验评估则证明了其在模型规模和压缩制度下的稳健泛化能力。值得注意的是,我们成功地将EG-MLA扩展到超过1亿参数,这证明了其在大规模LLM部署中的实际可行性。这些结果将EG-MLA确立为一种内存和计算高效的注意力机制,能够在现代LLM中实现可扩展、高性能的推理。
论文及项目相关链接
摘要
该文探讨了在大型语言模型(LLM)中如何通过减少键值(KV)缓存大小以实现高效推理。文章提出了Embedding-Gated Multi-head Latent Attention(EG-MLA)这一新型扩展机制,进一步减小KV缓存大小并提高表达力。EG-MLA在潜在空间引入特定符号嵌入门控机制,可对压缩的KV向量进行精细调制,同时计算成本增加极小。与Multi-Head Attention(MHA)相比,EG-MLA在KV缓存大小上减少了超过91.6%,性能下降微乎其微。相对于Multi-head Latent Attention(MLA),EG-MLA在各种推理基准测试中任务准确度持续提高,同时实现了高达59.9%的额外内存节省。本文理论分析和实证评估证明了embedding门控机制能够引发隐式高阶交互,并在模型规模和压缩机制上展现出稳健的泛化能力。特别是,EG-MLA成功扩展至超过1亿参数,证明其在大规模LLM部署中的实际可行性。结果证明EG-MLA是一种内存和计算高效的注意力机制,可实现现代大型语言模型的可伸缩、高性能推理。
关键见解
- 减少键值(KV)缓存大小是实现大型语言模型(LLM)高效推理的关键步骤。
- Multi-head Latent Attention(MLA)通过压缩KV表示到共享潜在空间,实现了性能和缓存效率之间的平衡。
- 提出的Embedding-Gated Multi-head Latent Attention(EG-MLA)机制进一步减小KV缓存大小,同时提高表达力。
- EG-MLA引入特定符号嵌入门控机制,可对压缩的KV向量进行精细调制,且几乎不影响性能。
- 与MHA相比,EG-MLA大幅减少了KV缓存大小,同时保持高性能。
- 相对MLA,EG-MLA提高了任务准确度,并实现了显著的额外内存节省。
- EG-MLA成功扩展至大规模参数,证明其在大型语言模型部署中的实用性。
点此查看论文截图





