⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-05-08 更新
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model
Authors:Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Lijiang Li, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, Ke Li, Rongrong Ji, Xing Sun
With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.
随着人类对自然人机交互的需求不断增长,基于语音的系统得到了越来越多的关注,因为语音是日常沟通中最常见的形式之一。然而,现有的语音模型在流模式下生成第一个音频令牌时仍然存在较高的延迟,这对部署构成了重大瓶颈。为了解决这一问题,我们提出了VITA-Audio,这是一种具有快速音频文本令牌生成功能的端到端大型语音模型。具体来说,我们引入了一个轻量级的跨模态令牌预测(MCTP)模块,该模块可以在单个模型前向传递过程中有效地生成多个音频令牌,这不仅加速了推理,而且显著降低了流场景中生成第一个音频的延迟。此外,我们探索了一种四阶段渐进式训练策略,以在尽可能减少语音质量损失的情况下实现模型加速。据我们所知,VITA-Audio是首个能够在第一次前向传递过程中生成音频输出的多模态大型语言模型,可实现实时对话功能,延迟极低。VITA-Audio可完全复制且只接受开源数据进行训练。实验结果表明,我们的模型在7B参数规模下实现了3~5倍的推理速度提升,并且在多个自动语音识别(ASR)、文本到语音(TTS)和语音问答(SQA)任务的基准测试中显著优于类似规模的开源模型。
论文及项目相关链接
PDF Training and Inference Codes: https://github.com/VITA-MLLM/VITA-Audio
Summary
本摘要介绍了随着人类计算机交互需求的增长,语音系统得到了越来越多的关注。为了解决现有语音模型在流式场景下生成首个音频令牌时的高延迟问题,提出了VITA-Audio端到端大型语音模型,具备快速音频文本令牌生成能力。通过引入轻量级的跨模态令牌预测模块,实现了一次前向传递中生成多个音频令牌,从而加速了推理并降低了延迟。此外,还探索了四阶段渐进训练策略,以在保持最小语音质量损失的情况下加速模型。VITA-Audio是首个能够在首次前向传递中产生音频输出的多模态大型语言模型,可实现实时对话功能并显著降低延迟。实验结果表明,该模型在参数规模为7B时实现了推理速度的3~5倍提升,并且在自动语音识别、文本转语音和语音问答任务上显著优于类似规模的开源模型。
Key Takeaways
- 随着人类对计算机交互的自然需求增长,语音系统备受关注。
- 现有语音模型在流式场景下生成首个音频令牌时存在高延迟问题。
- VITA-Audio是一个端到端的大型语音模型,旨在解决上述问题。
- VITA-Audio通过引入MCTP模块,实现了一次前向传递中生成多个音频令牌,加速了推理并降低了延迟。
- 采用四阶段渐进训练策略以加速模型,同时保持最小的语音质量损失。
- VITA-Audio是首个能够在首次前向传递中产生音频输出的多模态大型语言模型。
点此查看论文截图






WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch
Authors:Zimu Lu, Yunqiao Yang, Houxing Ren, Haotian Hou, Han Xiao, Ke Wang, Weikang Shi, Aojun Zhou, Mingjie Zhan, Hongsheng Li
LLM-based agents have demonstrated great potential in generating and managing code within complex codebases. In this paper, we introduce WebGen-Bench, a novel benchmark designed to measure an LLM-based agent’s ability to create multi-file website codebases from scratch. It contains diverse instructions for website generation, created through the combined efforts of human annotators and GPT-4o. These instructions span three major categories and thirteen minor categories, encompassing nearly all important types of web applications. To assess the quality of the generated websites, we use GPT-4o to generate test cases targeting each functionality described in the instructions, and then manually filter, adjust, and organize them to ensure accuracy, resulting in 647 test cases. Each test case specifies an operation to be performed on the website and the expected result after the operation. To automate testing and improve reproducibility, we employ a powerful web-navigation agent to execute tests on the generated websites and determine whether the observed responses align with the expected results. We evaluate three high-performance code-agent frameworks, Bolt.diy, OpenHands, and Aider, using multiple proprietary and open-source LLMs as engines. The best-performing combination, Bolt.diy powered by DeepSeek-R1, achieves only 27.8% accuracy on the test cases, highlighting the challenging nature of our benchmark. Additionally, we construct WebGen-Instruct, a training set consisting of 6,667 website-generation instructions. Training Qwen2.5-Coder-32B-Instruct on Bolt.diy trajectories generated from a subset of this training set achieves an accuracy of 38.2%, surpassing the performance of the best proprietary model.
基于LLM的代理在复杂的代码库中生成和管理代码方面表现出了巨大的潜力。在本文中,我们介绍了WebGen-Bench,这是一个新的基准测试,旨在衡量基于LLM的代理从零开始创建多文件网站代码库的能力。它包含了多样化的网站生成指令,这些指令是通过人类注释者和GPT-4o的共同努力创建的。这些指令涵盖了三个主要类别和十三个次要类别,涵盖了几乎所有重要类型的web应用程序。
论文及项目相关链接
摘要
LLM模型在生成和管理复杂代码库中的代码方面展现出巨大潜力。本文介绍了一种新型基准测试WebGen-Bench,旨在评估LLM模型创建多文件网站代码库的能力。该基准测试包含丰富多样的网站生成指令,这些指令由人类注释器和GPT-4o共同创建,涵盖三大类别和十三个小类别,囊括几乎所有重要类型的web应用程序。为评估生成的网站质量,我们使用GPT-4o针对指令生成测试用例,然后进行人工过滤、调整和整理,确保准确性,最终生成647个测试用例。我们还采用强大的网络导航代理执行测试,以自动化测试和提高结果的重复性。本文对三个高性能代码代理框架Bolt.diy、OpenHands和Aider进行了评估,使用多个专有和开源LLM作为引擎。表现最佳的组合是DeepSeek-R1驱动的Bolt.diy,在测试用例上仅达到27.8%的准确率,凸显了基准测试的艰巨性。此外,我们还构建了WebGen-Instruct训练集,包含6667条网站生成指令。使用Bolt.diy从部分训练集中生成的轨迹训练Qwen2.5-Coder-32B-Instruct模型,准确率达到了38.2%,超过了最佳专有模型的性能。
关键见解
- LLM模型在生成多文件网站代码库方面表现出巨大潜力。
- 引入WebGen-Bench基准测试,用于评估LLM模型创建网站代码的能力。
- WebGen-Bench包含由人类注释器和GPT-4o共同创建的丰富多样的网站生成指令。
- 使用GPT-4o生成测试用例,并结合人工过滤、调整和整理,确保准确性。
- 自动化测试通过强大的网络导航代理执行,提高结果的重复性。
- 评估了三个高性能代码代理框架,表现最佳的组合是DeepSeek-R1驱动的Bolt.diy,但准确率仅为27.8%。
点此查看论文截图





PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing
Authors:Yiping Xie, Bo Zhao, Mingtong Dai, Jian-Ping Zhou, Yue Sun, Tao Tan, Weicheng Xie, Linlin Shen, Zitong Yu
Remote photoplethysmography (rPPG) enables non-contact physiological measurement but remains highly susceptible to illumination changes, motion artifacts, and limited temporal modeling. Large Language Models (LLMs) excel at capturing long-range dependencies, offering a potential solution but struggle with the continuous, noise-sensitive nature of rPPG signals due to their text-centric design. To bridge this gap, we introduce PhysLLM, a collaborative optimization framework that synergizes LLMs with domain-specific rPPG components. Specifically, the Text Prototype Guidance (TPG) strategy is proposed to establish cross-modal alignment by projecting hemodynamic features into LLM-interpretable semantic space, effectively bridging the representational gap between physiological signals and linguistic tokens. Besides, a novel Dual-Domain Stationary (DDS) Algorithm is proposed for resolving signal instability through adaptive time-frequency domain feature re-weighting. Finally, rPPG task-specific cues systematically inject physiological priors through physiological statistics, environmental contextual answering, and task description, leveraging cross-modal learning to integrate both visual and textual information, enabling dynamic adaptation to challenging scenarios like variable illumination and subject movements. Evaluation on four benchmark datasets, PhysLLM achieves state-of-the-art accuracy and robustness, demonstrating superior generalization across lighting variations and motion scenarios.
远程光体积描记术(rPPG)能够实现非接触式生理测量,但仍受到光照变化、运动伪影和有限时间建模的高度影响。大型语言模型(LLM)擅长捕捉长距离依赖关系,提供潜在解决方案,但由于其以文本为中心的设计,它们在处理rPPG信号的连续性和对噪声的敏感性方面遇到困难。为了弥补这一差距,我们引入了PhysLLM,这是一个协同优化框架,旨在将LLM与特定领域的rPPG组件相结合。具体来说,提出了Text Prototype Guidance(TPG)策略,通过建立血流动力学特征与LLM可解释语义空间的跨模态对齐,有效地弥合了生理信号与语言符号之间的代表性差距。此外,还提出了一种新颖的Dual-Domain Stationary(DDS)算法,通过自适应时频域特征重新加权来解决信号不稳定的问题。最后,通过生理统计、环境上下文回答和任务描述等rPPG任务特定线索,利用跨模态学习整合视觉和文本信息,实现动态适应光照变化和主体移动等挑战场景。在四个基准数据集上的评估表明,PhysLLM达到了最先进的准确性和稳健性,在照明变化和运动场景中显示出优越的泛化能力。
论文及项目相关链接
Summary
远程光容积脉搏波成像(rPPG)能实现非接触生理测量,但易受光照变化、运动干扰影响,且缺乏长时间动态建模能力。大型语言模型(LLM)虽擅长捕捉长时依赖关系,但因其文本设计中心特性,对连续且噪声敏感的rPPG信号处理能力不足。为此,本研究提出一种融合LLM与rPPG技术优势的协同优化框架——PhysLLM。通过文本原型引导策略(TPG)建立跨模态对齐,将血流动力学特征映射至LLM可理解语义空间;采用双域稳定算法(DDS)解决信号不稳定问题,自适应调整时频域特征权重。此外,通过生理统计、环境上下文理解及任务描述等特定提示来增强模型的泛化能力。在四个基准数据集上的评估表明,PhysLLM达到了最新的精确度与稳健性标准,展现了在光照变化和动态场景中的优越表现。
Key Takeaways
- 远程光容积脉搏波成像(rPPG)面临光照变化、运动干扰和有限时间建模的挑战。
- 大型语言模型(LLM)具有捕捉长时依赖关系的优势,但对处理连续且噪声敏感的rPPG信号存在困难。
- 引入PhysLLM框架,结合LLM和rPPG技术,以优化处理生理信号。
- 采用文本原型引导策略(TPG)建立跨模态对齐,将生理信号转化为语言模型可理解的语义信息。
- 提出双域稳定算法(DDS)解决信号不稳定问题,实现自适应特征权重调整。
- 通过生理统计、环境上下文和任务描述等特定提示增强模型的泛化能力。
点此查看论文截图





DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes
Authors:Sergey Linok, Vadim Semenov, Anastasia Trunova, Oleg Bulichev, Dmitry Yudin
The analysis of events in dynamic environments poses a fundamental challenge in the development of intelligent agents and robots capable of interacting with humans. Current approaches predominantly utilize visual models. However, these methods often capture information implicitly from images, lacking interpretable spatial-temporal object representations. To address this issue we introduce DyGEnc - a novel method for Encoding a Dynamic Graph. This method integrates compressed spatial-temporal structural observation representation with the cognitive capabilities of large language models. The purpose of this integration is to enable advanced question answering based on a sequence of textual scene graphs. Extended evaluations on the STAR and AGQA datasets indicate that DyGEnc outperforms existing visual methods by a large margin of 15-25% in addressing queries regarding the history of human-to-object interactions. Furthermore, the proposed method can be seamlessly extended to process raw input images utilizing foundational models for extracting explicit textual scene graphs, as substantiated by the results of a robotic experiment conducted with a wheeled manipulator platform. We hope that these findings will contribute to the implementation of robust and compressed graph-based robotic memory for long-horizon reasoning. Code is available at github.com/linukc/DyGEnc.
对动态环境中事件的分析,在开发能够与人类互动的智能代理和机器人方面,构成了根本性的挑战。目前的方法主要使用视觉模型。然而,这些方法通常从图像中隐式地获取信息,缺乏可解释的时空对象表示。为了解决这一问题,我们引入了DyGEnc——一种新型的动态图编码方法。该方法将压缩的时空结构观测表示与大型语言模型的认知能力相结合。这种结合的目的是根据一系列文本场景图进行高级问答。在STAR和AGQA数据集上的扩展评估表明,DyGEnc在解决有关人与对象交互历史的问题时,在性能上大大超越了现有视觉方法,差距达15-25%。此外,所提出的方法可以无缝扩展到处理原始输入图像,利用基础模型提取明确的文本场景图,正如在轮式操作平台机器人实验的结果所证实的那样。我们希望这些发现将有助于实现用于长期推理的稳健且压缩的图基机器人记忆。代码可在github.com/linukc/DyGEnc找到。
论文及项目相关链接
PDF 8 pages, 5 figures, 6 tables
Summary
在动态环境中分析事件对发展智能交互机器人提出了重大挑战。当前主流方法主要依赖视觉模型,但缺乏可解释的空间时间物体表示。为解决这个问题,本文提出DyGEnc方法,集成压缩的空间时间结构观测表示与大型语言模型的认知能力,以支持基于文本场景图的先进问答。在STAR和AGQA数据集上的评估显示,DyGEnc在解决有关人与物体交互历史的问题时,较现有视觉方法的性能提升幅度达15-25%。此外,该方法可轻松扩展到处理原始图像,并利用基础模型提取明确的文本场景图,如机器人实验所示。本文希望这些发现有助于实现用于长期推理的稳健和压缩图基机器人记忆。
Key Takeaways
- 动态环境中分析事件对智能机器人与人类交互的发展构成挑战。
- 当前方法主要依赖视觉模型,但缺乏可解释的空间时间物体表示。
- DyGEnc方法集成了压缩的空间时间结构观测表示与大型语言模型的认知能力。
- DyGEnc在解决有关人与物体交互历史的问题时表现出优异性能,较现有视觉方法的性能提升幅度达15-25%。
- DyGEnc可轻松扩展到处理原始图像,并能够通过基础模型提取文本场景图。
- 机器人实验证实了DyGEnc的有效性。
点此查看论文截图








LlamaFirewall: An open source guardrail system for building secure AI agents
Authors:Sahana Chennabasappa, Cyrus Nikolaidis, Daniel Song, David Molnar, Stephanie Ding, Shengye Wan, Spencer Whitman, Lauren Deason, Nicholas Doucette, Abraham Montilla, Alekhya Gampa, Beto de Paola, Dominik Gabi, James Crnkovich, Jean-Christophe Testud, Kat He, Rashnil Chaturvedi, Wu Zhou, Joshua Saxe
Large language models (LLMs) have evolved from simple chatbots into autonomous agents capable of performing complex tasks such as editing production code, orchestrating workflows, and taking higher-stakes actions based on untrusted inputs like webpages and emails. These capabilities introduce new security risks that existing security measures, such as model fine-tuning or chatbot-focused guardrails, do not fully address. Given the higher stakes and the absence of deterministic solutions to mitigate these risks, there is a critical need for a real-time guardrail monitor to serve as a final layer of defense, and support system level, use case specific safety policy definition and enforcement. We introduce LlamaFirewall, an open-source security focused guardrail framework designed to serve as a final layer of defense against security risks associated with AI Agents. Our framework mitigates risks such as prompt injection, agent misalignment, and insecure code risks through three powerful guardrails: PromptGuard 2, a universal jailbreak detector that demonstrates clear state of the art performance; Agent Alignment Checks, a chain-of-thought auditor that inspects agent reasoning for prompt injection and goal misalignment, which, while still experimental, shows stronger efficacy at preventing indirect injections in general scenarios than previously proposed approaches; and CodeShield, an online static analysis engine that is both fast and extensible, aimed at preventing the generation of insecure or dangerous code by coding agents. Additionally, we include easy-to-use customizable scanners that make it possible for any developer who can write a regular expression or an LLM prompt to quickly update an agent’s security guardrails.
大型语言模型(LLM)已经从简单的聊天机器人进化成能够执行复杂任务的自主代理,如编辑生产代码、协调工作流程和基于不可信的输入(如网页和电子邮件)采取更高风险的行动。这些功能引入了新的安全风险,现有的安全措施,如模型微调或针对聊天机器人的防护栏,并不能完全解决这些问题。考虑到风险较高且缺乏确定性的解决方案来减轻这些风险,实时防护栏监控器作为最后一道防线至关重要,并且需要支持系统级别、针对用例的安全政策定义和执行。我们推出了LlamaFirewall,这是一个开源的安全防护栏框架,旨在作为对抗与人工智能代理相关的安全风险的最后防线。我们的框架通过三个强大的防护栏来减轻风险:PromptGuard 2,一种通用越狱检测器,表现出卓越的性能;Agent Alignment Checks,一种思维链审计员,检查代理推理是否存在提示注入和目标错位,尽管它仍然处于实验阶段,但在一般场景中防止间接注入方面显示出比以前提出的方法更强的有效性;以及CodeShield,一个快速且可扩展的在线静态分析引擎,旨在防止编码代理生成不安全或危险的代码。此外,我们还包括易于使用且可定制化的扫描器,使任何能够编写正则表达式或LLM提示的开发人员都可以快速更新代理的安全防护栏。
论文及项目相关链接
摘要
大型语言模型(LLM)已从简单的聊天机器人进化为能够执行复杂任务的自主代理,如编辑生产代码、协调工作流程和根据不可信的输入(如网页和电子邮件)采取高风险行动。这些功能引入了新的安全风险,现有的安全措施(如模型微调或针对聊天机器人的防护栏)并未完全解决这些问题。鉴于风险较高且缺乏确定的解决方案来减轻这些风险,急需一种实时防护栏监控器作为最后一层防御,并支持系统级别、用例特定的安全策略定义和执行。我们引入了LlamaFirewall,这是一个专注于安全的防护栏框架,旨在作为对抗与AI代理相关的安全风险的最后防线。我们的框架通过三个强大的防护栏来减轻风险,包括PromptGuard 2通用越狱检测器、Agent Alignment Checks思维链审计器和CodeShield在线静态分析引擎,分别用于防止提示注入、代理对齐问题和代理生成的不安全或危险代码。此外,我们还包含了易于使用的可定制扫描器,使得任何能够编写正则表达式或LLM提示的开发者都能快速更新代理的安全防护栏。
关键见解
- LLMs已从简单聊天机器人进化为能执行复杂任务的自主代理。
- LLMs的新功能引入了未被现有安全措施充分解决的安全风险。
- 急需一种实时防护栏监控器作为对抗与AI代理相关的安全风险的最后防线。
- LlamaFirewall是一个专注于安全的防护栏框架,旨在服务这一需求。
- LlamaFirewall包含三个强大的防护栏:PromptGuard 2、Agent Alignment Checks和CodeShield。
- PromptGuard 2可防止提示注入等安全风险。
点此查看论文截图




A Comprehensive Survey of Large AI Models for Future Communications: Foundations, Applications and Challenges
Authors:Feibo Jiang, Cunhua Pan, Li Dong, Kezhi Wang, Merouane Debbah, Dusit Niyato, Zhu Han
The 6G wireless communications aim to establish an intelligent world of ubiquitous connectivity, providing an unprecedented communication experience. Large artificial intelligence models (LAMs) are characterized by significantly larger scales (e.g., billions or trillions of parameters) compared to typical artificial intelligence (AI) models. LAMs exhibit outstanding cognitive abilities, including strong generalization capabilities for fine-tuning to downstream tasks, and emergent capabilities to handle tasks unseen during training. Therefore, LAMs efficiently provide AI services for diverse communication applications, making them crucial tools for addressing complex challenges in future wireless communication systems. This study provides a comprehensive review of the foundations, applications, and challenges of LAMs in communication. First, we introduce the current state of AI-based communication systems, emphasizing the motivation behind integrating LAMs into communications and summarizing the key contributions. We then present an overview of the essential concepts of LAMs in communication. This includes an introduction to the main architectures of LAMs, such as transformer, diffusion models, and mamba. We also explore the classification of LAMs, including large language models (LLMs), large vision models (LVMs), large multimodal models (LMMs), and world models, and examine their potential applications in communication. Additionally, we cover the training methods and evaluation techniques for LAMs in communication systems. Lastly, we introduce optimization strategies such as chain of thought (CoT), retrieval augmented generation (RAG), and agentic systems. Following this, we discuss the research advancements of LAMs across various communication scenarios. Finally, we analyze the challenges in the current research and provide insights into potential future research directions.
6G无线通信旨在建立一个无处不在的智能互联世界,提供前所未有的通信体验。大型人工智能模型(LAMs)与典型的人工智能(AI)模型相比,具有显著更大的规模(例如,数十亿或数万亿参数)。LAMs表现出卓越的认知能力,包括强大的泛化能力以适应下游任务微调,以及处理训练期间未见任务的突发能力。因此,LAMs为各种通信应用程序有效地提供AI服务,成为解决未来无线通信系统中复杂挑战的关键工具。本研究全面回顾了通信中LAMs的基础、应用和挑战。首先,我们介绍了基于人工智能的通信系统的当前状态,强调了将LAMs集成到通信中的动机,并总结了关键贡献。然后,我们概述了通信中LAMs的基本概念。这包括介绍LAMs的主要架构,如变压器、扩散模型和玛姆巴。我们还探讨了LAMs的分类,包括大型语言模型(LLMs)、大型视觉模型(LVMs)、大型多模态模型(LMMs)和世界模型,并研究它们在通信中的潜在应用。此外,我们还介绍了通信系统中LAMs的训练方法和评估技术。之后,我们介绍了优化策略,如思维链(CoT)、检索增强生成(RAG)和代理系统。然后,我们讨论了LAMs在各种通信场景中的研究进展。最后,我们分析了当前研究的挑战,并提供了对未来研究方向的洞察。
论文及项目相关链接
Summary
6G无线通信追求建立普遍互联的智能世界,提供前所未有的通信体验。大型人工智能模型(LAMs)具有突出的认知能力,包括强大的泛化能力和处理未见任务的涌现能力,为各种通信应用提供高效的人工智能服务,对解决未来无线通信系统的复杂挑战至关重要。本文全面回顾了LAMs在通信领域的基础、应用和挑战。介绍了人工智能通信系统的当前状态,概述了将LAMs集成到通信中的动机和关键贡献。探讨了LAMs的主要架构、分类、训练方法和评估技术,并讨论了其在各种通信场景中的研究进展。
Key Takeaways
- 6G无线通信追求建立普遍互联的智能世界,提供前所未有的通信体验。
- 大型人工智能模型(LAMs)具有显著更大的规模,展现出强大的认知能力和泛化性能。
- LAMs包括大型语言模型(LLMs)、大型视觉模型(LVMs)、大型多模态模型(LMMs)和世界模型等分类。
- LAMs在通信系统中具有广泛的应用潜力,如智能通信、数据处理和自动化控制等。
- 当前的研究进展涵盖了LAMs的架构、分类、训练方法和评估技术等方面。
- 挑战在于解决LAMs的复杂性、计算资源和隐私保护等问题。
点此查看论文截图





am-ELO: A Stable Framework for Arena-based LLM Evaluation
Authors:Zirui Liu, Jiatong Li, Yan Zhuang, Qi Liu, Shuanghong Shen, Jie Ouyang, Mingyue Cheng, Shijin Wang
Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating’s probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs.
基于战场的评估是现代人工智能模型,尤其是大型语言模型(LLM)的基本且重要的评估范式。基于ELO评级系统的现有框架由于排名不一致和忽视评估者能力差异而面临不可避免的稳定性问题。在本文中,我们通过增强ELO评级系统,引入了一种新的稳定战场框架来解决这些问题。具体来说,我们用最大似然估计(MLE)方法替换迭代更新方法,称为m-ELO,并为模型排名的MLE方法的一致性和稳定性提供了理论证明。此外,我们提出了融入评估者能力的am-ELO,能够同时估计模型得分和评估者可靠性。实验证明,该方法确保了稳定性,证明该框架为LLM提供了更稳健、准确和稳定的评估方法。
论文及项目相关链接
PDF ICML2025 Accepted
Summary
基于竞技场的评估是现代人工智能模型,尤其是大型语言模型(LLM)的基本而重要的评估范式。现有基于ELO评分系统的框架存在排名不一致和忽视评估者能力差异等不稳定问题。本文引入了一种新型稳定的竞技场框架来解决这些问题,通过增强ELO评分系统,采用最大似然估计(MLE)方法替代迭代更新方法,提出m-ELO,并从理论上证明了MLE方法在模型排名中的一致性和稳定性。此外,还提出了结合评估者能力的am-ELO,可以同时估计模型得分和评估者可靠性。实验证明该方法具有稳定性,证明该框架为LLM提供了更稳健、准确和稳定的评估方法。
Key Takeaways
- 竞技场评估是现代AI模型的重要评估方式,尤其是LLM。
- 基于ELO评分系统的现有框架存在不稳定问题,如排名不一致和忽视评估者能力差异。
- 新型稳定竞技场框架通过增强ELO评分系统来解决这些问题。
- 采用最大似然估计(MLE)方法替代迭代更新方法,提出m-ELO。
- m-ELO具有一致性和稳定性,适用于模型排名。
- am-ELO结合评估者能力,能同时估计模型得分和评估者可靠性。
点此查看论文截图







Evaluation of LLMs on Long-tail Entity Linking in Historical Documents
Authors:Marta Boscariol, Luana Bulla, Lia Draetta, Beatrice Fiumanò, Emanuele Lenzi, Leonardo Piano
Entity Linking (EL) plays a crucial role in Natural Language Processing (NLP) applications, enabling the disambiguation of entity mentions by linking them to their corresponding entries in a reference knowledge base (KB). Thanks to their deep contextual understanding capabilities, LLMs offer a new perspective to tackle EL, promising better results than traditional methods. Despite the impressive generalization capabilities of LLMs, linking less popular, long-tail entities remains challenging as these entities are often underrepresented in training data and knowledge bases. Furthermore, the long-tail EL task is an understudied problem, and limited studies address it with LLMs. In the present work, we assess the performance of two popular LLMs, GPT and LLama3, in a long-tail entity linking scenario. Using MHERCL v0.1, a manually annotated benchmark of sentences from domain-specific historical texts, we quantitatively compare the performance of LLMs in identifying and linking entities to their corresponding Wikidata entries against that of ReLiK, a state-of-the-art Entity Linking and Relation Extraction framework. Our preliminary experiments reveal that LLMs perform encouragingly well in long-tail EL, indicating that this technology can be a valuable adjunct in filling the gap between head and long-tail EL.
实体链接(EL)在自然语言处理(NLP)应用中扮演着至关重要的角色,它通过将在文本中提及的实体链接到参考知识库(KB)中的相应条目来实现实体消歧。由于大型语言模型(LLM)具备深厚的上下文理解能力,它们为处理EL提供了新的视角,并有望带来比传统方法更好的结果。尽管LLM的泛化能力令人印象深刻,但链接不太流行、长尾实体仍然是一个挑战,因为这些实体在训练数据和知识库中往往表示不足。此外,长尾EL任务是一个尚未被深入研究的问题,只有有限的研究使用LLM来解决。在目前的工作中,我们评估了两种流行的大型语言模型GPT和LLama3在长尾实体链接场景中的性能。我们使用MHERCL v0.1作为基准测试集,该测试集包含来自特定领域历史文本的句子进行手动标注。我们定量比较了LLM在识别实体并将其链接到相应Wikidata条目的性能与最先进的关系抽取框架ReLiK的性能。我们的初步实验表明,大型语言模型在长尾EL方面表现良好,这表明这项技术可以很好地填补头部和长尾EL之间的差距。
论文及项目相关链接
Summary
基于大型语言模型(LLM)的自然语言处理(NLP)应用中,实体链接(EL)起着至关重要的作用。通过将提及的实体链接到参考知识库中的相应条目来实现歧义消解。LLM凭借其对上下文的深入理解展现了其在解决EL问题上优于传统方法的潜力。然而,对长尾实体的链接仍然存在挑战,因为它们在训练数据和知识库中常常被低估。本文评估了GPT和LLama3两种流行LLM在处理长尾实体链接场景中的性能。使用来自特定领域历史文本的MHERCL v0.1手动注释基准数据集进行定量比较,发现LLM在识别并将实体链接到其相应的Wikidata条目方面表现良好,表现出这一技术在缩小头部和长尾EL差距中的潜力。
Key Takeaways
- LLM在自然语言处理中扮演重要角色,特别是在实体链接任务上展现出超越传统方法的潜力。
- LLM能够理解语境深处的含义,从而提高实体链接的准确性。
- 长尾实体链接仍然存在挑战,这些实体在训练数据和知识库中常常被低估或缺失。
- 使用MHERCL v0.1数据集评估了GPT和LLama3在长尾实体链接任务上的性能表现。
- LLM在处理长尾实体链接方面表现出良好性能,有潜力缩小头部实体与长尾实体之间的链接差距。
点此查看论文截图




Image Recognition with Online Lightweight Vision Transformer: A Survey
Authors:Zherui Zhang, Rongtao Xu, Jie Zhou, Changwei Wang, Xingtian Pei, Wenhao Xu, Jiguang Zhang, Li Guo, Longxiang Gao, Wenbo Xu, Shibiao Xu
The Transformer architecture has achieved significant success in natural language processing, motivating its adaptation to computer vision tasks. Unlike convolutional neural networks, vision transformers inherently capture long-range dependencies and enable parallel processing, yet lack inductive biases and efficiency benefits, facing significant computational and memory challenges that limit its real-world applicability. This paper surveys various online strategies for generating lightweight vision transformers for image recognition, focusing on three key areas: Efficient Component Design, Dynamic Network, and Knowledge Distillation. We evaluate the relevant exploration for each topic on the ImageNet-1K benchmark, analyzing trade-offs among precision, parameters, throughput, and more to highlight their respective advantages, disadvantages, and flexibility. Finally, we propose future research directions and potential challenges in the lightweighting of vision transformers with the aim of inspiring further exploration and providing practical guidance for the community. Project Page: https://github.com/ajxklo/Lightweight-VIT
Transformer架构在自然语言处理领域取得了巨大成功,这促使其适应计算机视觉任务。与卷积神经网络不同,视觉Transformer天生就能捕捉长距离依赖关系并实现并行处理,但缺乏归纳偏见和效率优势,面临着计算和内存方面的挑战,这些挑战限制了其在现实世界中的应用。本文调查了在线生成轻量级视觉Transformer进行图像识别的各种策略,重点关注三个关键领域:高效组件设计、动态网络和知识蒸馏。我们在ImageNet-1K基准测试上对每个主题进行了相关探索,分析了精度、参数、吞吐量等之间的权衡,以突出各自的优势、劣势和灵活性。最后,我们提出了轻量级视觉Transformer的未来研究方向和潜在挑战,旨在激发进一步探索,为社区提供实用指导。项目页面:https://github.com/ajxklo/Lightweight-VIT
论文及项目相关链接
Summary
本文探讨了Transformer架构在计算机视觉任务中的应用,并指出了其面临的挑战。文章主要介绍了针对图像识别的轻量化愿景转换器生成的各种在线策略,重点介绍了高效组件设计、动态网络和知识蒸馏三个关键领域。在ImageNet-1K基准测试上评估了相关探索的优缺点和灵活性。文章旨在为社区提供进一步的探索灵感和实践指导。
Key Takeaways
- Transformer架构在计算机视觉任务中取得了显著成功,具有捕获长距离依赖性和并行处理的能力。
- 与卷积神经网络相比,愿景转换器缺乏归纳偏见和效率优势,面临计算和内存挑战,限制了其在现实世界中的应用。
- 文章介绍了针对图像识别的轻量化愿景转换器的在线策略,包括高效组件设计、动态网络和知识蒸馏三个关键领域。
- 文章评估了这些策略在ImageNet-1K基准测试上的性能,包括精度、参数、吞吐量和权衡方面的优缺点和灵活性。
- 通过评估各种策略,文章旨在提供对计算机视觉任务中轻量化愿景转换器实用性的深入理解。
- 文章指出了未来研究方向和潜在挑战,旨在为社区提供进一步的探索灵感和实践指导。
点此查看论文截图






The Art of Repair: Optimizing Iterative Program Repair with Instruction-Tuned Models
Authors:Fernando Vallecillos Ruiz, Max Hort, Leon Moonen
Automatic program repair (APR) aims to reduce the manual efforts required to identify and fix errors in source code. Before the rise of LLM-based agents, a common strategy was to increase the number of generated patches, sometimes to the thousands, to achieve better repair results on benchmarks. More recently, self-iterative capabilities enabled LLMs to refine patches over multiple rounds guided by feedback. However, literature often focuses on many iterations and disregards different numbers of outputs. We investigate an APR pipeline that balances these two approaches, the generation of multiple outputs and multiple rounds of iteration, while imposing a limit of 10 total patches per bug. We apply three SOTA instruction-tuned LLMs - DeepSeekCoder-Instruct, Codellama-Instruct, Llama3.1-Instruct - to the APR task. We further fine-tune each model on an APR dataset with three sizes (1K, 30K, 65K) and two techniques (Full Fine-Tuning and LoRA), allowing us to assess their repair capabilities on two APR benchmarks: HumanEval-Java and Defects4J. Our results show that by using only a fraction (<1%) of the fine-tuning dataset, we can achieve improvements of up to 78% in the number of plausible patches generated, challenging prior studies that reported limited gains using Full Fine-Tuning. However, we find that exceeding certain thresholds leads to diminishing outcomes, likely due to overfitting. Moreover, we show that base models greatly benefit from creating patches in an iterative fashion rather than generating them all at once. In addition, the benefit of iterative strategies becomes more pronounced in complex benchmarks. Even fine-tuned models, while benefiting less from iterations, still gain advantages, particularly on complex benchmarks. The research underscores the need for balanced APR strategies that combine multi-output generation and iterative refinement.
自动程序修复(APR)旨在减少识别和修复源代码中的错误所需的人工操作。在基于LLM的代理出现之前,一种常见策略是通过生成大量的补丁(有时高达数千个)来提高基准测试中的修复效果。最近,自我迭代能力使LLM能够在反馈的指导下通过多轮迭代来完善补丁。然而,文献通常关注多次迭代而忽略了不同数量的输出。我们研究了一种平衡这两种方法的APR管道,即生成多个输出和多次迭代,同时限制每个bug的总补丁数量为10。我们将三个最先进的指令调整LLM——DeepSeekCoder-Instruct、Codellama-Instruct、Llama3.1-Instruct——应用于APR任务。我们还对每个模型进行了APR数据集的三种规模(1K、30K、65K)以及两种技术(完全微调(Full Fine-Tuning)和LoRA)的微调,这使我们能够在两个APR基准测试(HumanEval-Java和Defects4J)上评估它们的修复能力。我们的结果表明,仅使用一小部分(<1%)的微调数据集,我们就可以提高生成合理补丁的数量,达到高达78%,这挑战了先前报告使用Full Fine-Tuning收益有限的研究。然而,我们发现超过某个阈值会导致结果减少,这可能是由于过度拟合导致的。此外,我们的结果表明基础模型在采用迭代方式创建补丁时受益匪浅,而不是一次性生成所有补丁。而且,在复杂的基准测试中,迭代策略的优势变得更加突出。虽然经过调整的模型从迭代中获益较少,但仍具有优势,特别是在复杂的基准测试中。该研究强调了需要平衡的APR策略,将多输出生成和迭代改进结合起来。
论文及项目相关链接
PDF Accepted for publication in the research track of the 29th International Conference on Evaluation and Assessment in Software Engineering (EASE), 17-20 June 2025, Istanbul, T"urkiye
摘要
自动程序修复(APR)旨在减少手动识别和修复源代码中错误的所需努力。随着大语言模型(LLM)的崛起,自我迭代能力使LLM能够在反馈的指导下对补丁进行多次迭代优化。本文研究了平衡生成多个输出和多次迭代的APR管道,同时限制每个错误的补丁总数为10个。本文应用了三种先进的指令微调LLM——DeepSeekCoder-Instruct、Codellama-Instruct和Llama3.1-Instruct来完成APR任务。进一步对每种模型进行APR数据集的不同规模(1K、30K、65K)和两种技术(全微调LoRA)的微调,使我们能够在两个APR基准测试套件HumanEval-Java和Defects4J上评估其修复能力。结果表明,仅使用一小部分(<1%)的微调数据集,我们就可以提高高达78%的可行补丁生成数量,挑战了先前报告使用全微调收益有限的研究。然而我们发现超出一定的阈值会导致收益递减,这可能是由于过度拟合造成的。此外,我们还表明基础模型通过迭代创建补丁的方式比一次性生成所有补丁的方式受益匪浅。即使在复杂的基准测试中,迭代策略的优势也变得更加突出。虽然微调模型从迭代中获益较少,但仍具有优势,特别是在复杂的基准测试中。该研究强调了需要平衡的APR策略,结合多输出生成和迭代优化。
关键见解
- 自动程序修复旨在减少修复源代码错误的手动努力。
- LLM的自我迭代能力使补丁优化成为可能。
- 研究了结合生成多个输出和多次迭代的APR策略,限制每个错误的补丁总数为10个。
- 使用先进的指令微调LLM完成APR任务。
- 使用较小的微调数据集即可显著提高补丁生成的质量。
- 迭代策略对基础模型和微调模型都有益,特别是在复杂基准测试中。
点此查看论文截图




Snakemaker: Seamlessly transforming ad-hoc analyses into sustainable Snakemake workflows with generative AI
Authors:Marco Masera, Alessandro Leone, Johannes Köster, Ivan Molineris
Reproducibility and sustainability present significant challenges in bioinformatics software development, where rapidly evolving tools and complex workflows often result in short-lived or difficult-to-adapt pipelines. This paper introduces Snakemaker, a tool that leverages generative AI to facilitate researchers build sustainable data analysis pipelines by converting unstructured code into well-defined Snakemake workflows. Snakemaker non-invasively tracks the work performed in the terminal by the researcher, analyzes execution patterns, and generates Snakemake workflows that can be integrated into existing pipelines. Snakemaker also supports the transformation of monolithic Ipython Notebooks into modular Snakemake pipelines, resolving the global state of the notebook into discrete, file-based interactions between rules. An integrated chat assistant provides users with fine-grained control through natural language instructions. Snakemaker generates high-quality Snakemake workflows by adhering to the best practices, including Conda environment tracking, generic rule generation and loop unrolling. By lowering the barrier between prototype and production-quality code, Snakemaker addresses a critical gap in computational reproducibility for bioinformatics research.
在生物信息学软件开发中,再现性和可持续性构成了重大挑战。在这里,快速演变的工具和复杂的工作流程经常导致短期存在或难以适应的管道。本文介绍了Snakemaker,它是一款利用生成人工智能的工具,帮助研究人员通过将从结构化的代码中生成明确的Snakemake工作流,建立可持续的数据分析管道。Snakemaker非侵入性地追踪研究人员在终端执行的工作,分析执行模式,并生成可以集成到现有管道中的Snakemake工作流。Snakemaker还支持将庞大的Ipython笔记本转换为模块化的Snakemake管道,解决笔记本全局状态的问题,将其转变为规则之间离散、基于文件的交互。一个集成的聊天助手通过自然语言指令为用户提供精细粒度的控制。Snakemaker通过遵循最佳实践生成高质量的Snakemake工作流,包括Conda环境跟踪、通用规则生成和循环展开。通过降低原型和生产质量代码之间的障碍,Snakemaker解决了生物信息学研究中计算再现性的关键差距。
论文及项目相关链接
Summary
蛇形流水线工具(Snakemaker)引入了一种基于生成式人工智能的方法,通过把非结构化的代码转化为定义的蛇形流水线工作流来提升数据管道研究的可持续性。该工具以非侵入的方式跟踪研究人员在终端中的工作表现,分析执行模式并生成可以集成到现有流水线中的蛇形流水线工作流。此外,它支持将大型Ipython笔记本转化为模块化的蛇形流水线,并通过精细粒度的自然语言指令控制用户交互。Snakemaker遵循最佳实践生成高质量的蛇形流水线,包括Conda环境跟踪、通用规则生成和循环展开等。该工具解决了从原型到生产质量代码的障碍,填补了生物信息学研究中的计算可重复性的关键空白。
Key Takeaways
- Snakemaker利用生成式人工智能将非结构化代码转化为蛇形流水线工作流。
- Snakemaker以非侵入方式跟踪终端工作表现并分析执行模式。
- Snakemaker支持将大型Ipython笔记本转化为模块化的蛇形流水线。
- Snakemaker解决了生物信息学软件开发的可持续性问题。
- 通过自然语言指令提供精细粒度的控制。
- Snakemaker遵循最佳实践生成高质量的蛇形流水线工作流。
点此查看论文截图




ReplaceMe: Network Simplification via Layer Pruning and Linear Transformations
Authors:Dmitriy Shopkhoev, Ammar Ali, Magauiya Zhussip, Valentin Malykh, Stamatios Lefkimmiatis, Nikos Komodakis, Sergey Zagoruyko
We introduce ReplaceMe, a generalized training-free depth pruning method that effectively replaces transformer blocks with a linear operation, while maintaining high performance for low compression ratios. In contrast to conventional pruning approaches that require additional training or fine-tuning, our approach requires only a small calibration dataset that is used to estimate a linear transformation to approximate the pruned blocks. This estimated linear mapping can be seamlessly merged with the remaining transformer blocks, eliminating the need for any additional network parameters. Our experiments show that ReplaceMe consistently outperforms other training-free approaches and remains highly competitive with state-of-the-art pruning methods that involve extensive retraining/fine-tuning and architectural modifications. Applied to several large language models (LLMs), ReplaceMe achieves up to 25% pruning while retaining approximately 90% of the original model’s performance on open benchmarks - without any training or healing steps, resulting in minimal computational overhead (see Fig.1). We provide an open-source library implementing ReplaceMe alongside several state-of-the-art depth pruning techniques, available at this repository.
我们介绍了ReplaceMe,这是一种通用的无训练深度剪枝方法,它可以通过线性运算有效地替换transformer块,同时在低压缩比的情况下保持高性能。与传统的需要额外训练或微调的剪枝方法不同,我们的方法只需要一个小型校准数据集,用于估计线性变换以近似剪枝块。估计的线性映射可以无缝地与其他transformer块合并,无需任何额外的网络参数。我们的实验表明,ReplaceMe始终优于其他无训练方法,并且在涉及大量重新训练/微调和结构修改的先进剪枝方法中保持高度竞争力。应用于多个大型语言模型(LLM)时,ReplaceMe实现了高达25%的剪枝率,同时在开放基准测试中保留了原始模型约90%的性能——无需任何训练或修复步骤,导致计算开销最小(见图1)。我们提供了一个开源库,实现了ReplaceMe以及几种先进的深度剪枝技术,可在该仓库中找到。
论文及项目相关链接
Summary
ReplaceMe是一种无需训练的深度剪枝方法,可在保持高性能的同时实现低压缩比的转换操作。与其他需要额外训练或微调的传统剪枝方法不同,该方法仅需要一个小的校准数据集来估计线性转换以近似剪枝块。估计的线性映射可以无缝地与其他转换器块合并,无需任何额外的网络参数。实验表明,ReplaceMe在大型语言模型上表现优异,无需训练或修复步骤即可实现高达25%的剪枝,同时保留原始模型在开放基准测试上约90%的性能,计算开销最小。
Key Takeaways
- ReplaceMe是一种无需训练的深度剪枝方法,可以在保持高性能的同时实现低压缩比转换操作。
- 该方法使用一个小的校准数据集来估计线性转换以近似剪枝块,无需额外的网络参数。
- ReplaceMe能够无缝地与其他转换器块合并。
- 实验表明,ReplaceMe在大型语言模型上表现优异,可实现高达25%的剪枝。
- ReplaceMe可以在不损失太多性能的情况下实现高压缩比,与其他先进的剪枝方法相比具有竞争力。
- 该方法无需任何训练或修复步骤即可实现高性能剪枝,从而降低了计算开销。
点此查看论文截图




SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning
Authors:Jinpeng Chen, Runmin Cong, Yuzhi Zhao, Hongzheng Yang, Guangneng Hu, Horace Ho Shing Ip, Sam Kwong
Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to incrementally learn new tasks without catastrophic forgetting. In this paper, we explore forgetting in this context, categorizing it into superficial forgetting and essential forgetting. Superficial forgetting refers to cases where the model’s knowledge may not be genuinely lost, but its responses to previous tasks deviate from expected formats due to the influence of subsequent tasks’ answer styles, making the results unusable. By contrast, essential forgetting refers to situations where the model provides correctly formatted but factually inaccurate answers, indicating a true loss of knowledge. Assessing essential forgetting necessitates addressing superficial forgetting first, as severe superficial forgetting can obscure the model’s knowledge state. Hence, we first introduce the Answer Style Diversification (ASD) paradigm, which defines a standardized process for transforming data styles across different tasks, unifying their training sets into similarly diversified styles to prevent superficial forgetting caused by style shifts. Building on this, we propose RegLoRA to mitigate essential forgetting. RegLoRA stabilizes key parameters where prior knowledge is primarily stored by applying regularization, enabling the model to retain existing competencies. Experimental results demonstrate that our overall method, SEFE, achieves state-of-the-art performance.
多模态持续指令调优(MCIT)旨在使多模态大型语言模型(MLLM)能够逐步学习新任务而无需面临灾难性遗忘的问题。在本文中,我们探索了在此背景下的遗忘问题,将其分为浅表性遗忘和本质性遗忘。浅表性遗忘是指模型的知识可能并未真正丢失,但由于后续任务的答案风格影响,其对先前任务的回答偏离了预期格式,导致结果无法使用。相比之下,本质性遗忘是指模型提供格式正确但事实错误的答案,这表明知识确实丢失了。评估本质性遗忘需要先解决浅表性遗忘问题,因为严重的浅表性遗忘可能会掩盖模型的知识状态。因此,我们首先引入了答案风格多样化(ASD)范式,该范式定义了一个标准化流程来转换不同任务的数据风格,将其训练集统一为类似多样化的风格,以防止因风格变化导致的浅表性遗忘。在此基础上,我们提出了RegLoRA来缓解本质性遗忘。RegLoRA通过应用正则化来稳定关键参数,其中主要存储先验知识,使模型能够保留现有技能。实验结果表明,我们的整体方法SEFE达到了最新技术水平。
论文及项目相关链接
Summary
本文介绍了多模态持续指令调优(MCIT)的目标,旨在使多模态大型语言模型(MLLMs)能够逐步学习新任务而不发生灾难性遗忘。文章探讨了该背景下的遗忘问题,将其分为表面性遗忘和本质性遗忘。通过引入答案风格多样化(ASD)范式和RegLoRA方法,有效防止因风格变化导致的表面性遗忘并减轻本质性遗忘。实验结果表明,整体方法SEFE达到最新技术水平。
Key Takeaways
- MCIT的目标是使MLLMs能够逐步学习新任务而不忘记旧知识。
- 遗忘在MLLMs中分为表面性遗忘和本质性遗忘。
- 表面性遗忘指的是模型知识并未真正丢失,但由于后续任务答案风格的影响,对任务的回应偏离了预期格式。
- 本质性遗忘是模型真正丧失了知识,表现为回答格式正确但内容失实。
- 为了评估本质性遗忘,必须先解决表面性遗忘,因为严重的表面性遗忘可能会掩盖模型的知识状态。
- 引入ASD范式来标准化不同任务的数据样式转换,防止因风格变化导致的表面性遗忘。
点此查看论文截图




Colombian Waitresses y Jueces canadienses: Gender and Country Biases in Occupation Recommendations from LLMs
Authors:Elisa Forcada Rodríguez, Olatz Perez-de-Viñaspre, Jon Ander Campos, Dietrich Klakow, Vagrant Gautam
One of the goals of fairness research in NLP is to measure and mitigate stereotypical biases that are propagated by NLP systems. However, such work tends to focus on single axes of bias (most often gender) and the English language. Addressing these limitations, we contribute the first study of multilingual intersecting country and gender biases, with a focus on occupation recommendations generated by large language models. We construct a benchmark of prompts in English, Spanish and German, where we systematically vary country and gender, using 25 countries and four pronoun sets. Then, we evaluate a suite of 5 Llama-based models on this benchmark, finding that LLMs encode significant gender and country biases. Notably, we find that even when models show parity for gender or country individually, intersectional occupational biases based on both country and gender persist. We also show that the prompting language significantly affects bias, and instruction-tuned models consistently demonstrate the lowest and most stable levels of bias. Our findings highlight the need for fairness researchers to use intersectional and multilingual lenses in their work.
自然语言处理(NLP)的公平性研究的目标之一是衡量和减轻由NLP系统传播的刻板偏见。然而,这样的工作往往集中在单一偏见轴(最常见的是性别)和英语语言上。为了解决这个问题,我们进行了第一项多语言交叉国家和性别偏见的研究,重点关注由大型语言模型生成的职业推荐。我们在英语、西班牙语和德语中构建了提示基准测试,系统地改变国家和性别,使用25个国家和四组代词。然后,我们在这一基准测试上评估了5个基于Llama的模型,发现大型语言模型包含了重大的性别和国家偏见。值得注意的是,即使模型在性别或国家方面表现出单一公平性,基于国家和性别的交叉职业偏见仍然存在。我们还表明,提示语言对偏见有很大影响,而指令调整模型始终表现出最低和最稳定的偏见水平。我们的研究结果强调,公平研究人员需要在其工作中使用交叉和多语言的视角。
论文及项目相关链接
Summary
在NLP的公平性研究中,一个目标是衡量和减轻由NLP系统传播的刻板偏见。然而,此类工作往往侧重于单一偏见轴(通常是性别)和英语。本研究弥补了这些不足,首次研究多语言国家与性别偏见交集,重点关注大型语言模型产生的职业推荐。我们在英语、西班牙语和德语中构建了基准测试集,通过系统地改变国家和性别(使用25个国家和四个代词集),评估了基于Llama的五个模型。我们发现LLM蕴含显著的性别和国家偏见。值得注意的是,即使模型单独对性别或国家显示公平性,基于两者结合的交叉职业偏见仍然存在。我们还发现提示语言显著影响偏见,指令微调模型表现出最低且最稳定的偏见水平。我们的研究强调了公平研究人员需要在工作中使用交叉和多语言视角的必要性。
Key Takeaways
- 本研究首次探讨了多语言环境下的国家与性别偏见交集问题,特别是在大型语言模型生成的职业推荐方面。
- 构建了针对英语、西班牙语和德语的多语言基准测试集。
- 系统地研究不同国家和性别对职业推荐的影响,使用25个国家和四种代词集进行评估。
- 发现大型语言模型存在显著的性别和国家偏见。
- 即使模型在单一维度(性别或国家)上表现公平,交叉职业偏见仍然存在。
- 提示语言对偏见有显著影响,指令微调模型表现出较低且稳定的偏见水平。
点此查看论文截图




Intra-Layer Recurrence in Transformers for Language Modeling
Authors:Anthony Nguyen, Wenjun Lin
Transformer models have established new benchmarks in natural language processing; however, their increasing depth results in substantial growth in parameter counts. While existing recurrent transformer methods address this issue by reprocessing layers multiple times, they often apply recurrence indiscriminately across entire blocks of layers. In this work, we investigate Intra-Layer Recurrence (ILR), a more targeted approach that applies recurrence selectively to individual layers within a single forward pass. Our experiments show that allocating more iterations to earlier layers yields optimal results. These findings suggest that ILR offers a promising direction for optimizing recurrent structures in transformer architectures.
Transformer模型已在自然语言处理领域建立了新的基准,但其深度增加导致参数数量大幅增长。现有的循环Transformer方法通过多次重新处理层来解决这个问题,但它们通常在整个层块中随意应用循环。在这项工作中,我们研究了逐层递归(ILR),这是一种更有针对性的方法,可以一次前向传递过程中有选择地对个别层应用递归。我们的实验表明,对早期层分配更多迭代次数可以获得最佳结果。这些发现表明,ILR为优化Transformer架构中的循环结构提供了一个有前途的方向。
论文及项目相关链接
PDF Accepted at Canadian AI 2025. Code available at https://github.com/ant-8/Layer-Recurrent-Transformers
Summary
变换器模型在自然语言处理领域树立了新的基准,但其深度增加导致参数数量大幅增长。现有循环变换器方法通过多次重新处理层来解决这个问题,但它们常常在整个层块上盲目应用循环。本研究探讨了单向前传中的选择性层内循环(ILR),更有针对性地应用于单个层。实验表明,为早期层分配更多迭代次数可获得最佳结果。这表明ILR为优化变换器架构中的循环结构提供了有前途的方向。
Key Takeaways
- 变换器模型在自然语言处理中表现优异,但深度增加导致参数增多。
- 现有循环变换器方法通常在整个层块上盲目应用循环。
- 本研究提出更有针对性的层内循环(ILR)方法。
- 实验显示,对早期层分配更多迭代次数可得到最佳结果。
- ILR方法为优化变换器架构中的循环结构提供了新方向。
- ILR能够选择性应用,避免不必要的计算开销。
点此查看论文截图







Harnessing the Power of LLMs, Informers and Decision Transformers for Intent-driven RAN Management in 6G
Authors:Md Arafat Habib, Pedro Enrique Iturria Rivera, Yigit Ozcan, Medhat Elsayed, Majid Bavand, Raimundas Gaigalas, Melike Erol-Kantarci
Intent-driven network management is critical for managing the complexity of 5G and 6G networks. It enables adaptive, on-demand management of the network based on the objectives of the network operators. In this paper, we propose an innovative three-step framework for intent-driven network management based on Generative AI (GenAI) algorithms. First, we fine-tune a Large Language Model (LLM) on a custom dataset using a Quantized Low-Rank Adapter (QLoRA) to enable memory-efficient intent processing within limited computational resources. A Retrieval Augmented Generation (RAG) module is included to support dynamic decision-making. Second, we utilize a transformer architecture for time series forecasting to predict key parameters, such as power consumption, traffic load, and packet drop rate, to facilitate intent validation proactively. Lastly, we introduce a Hierarchical Decision Transformer with Goal Awareness (HDTGA) to optimize the selection and orchestration of network applications and hence, optimize the network. Our intent guidance and processing approach improves BERTScore by 6% and the semantic similarity score by 9% compared to the base LLM model. Again, the proposed predictive intent validation approach can successfully rule out the performance-degrading intents with an average of 88% accuracy. Finally, compared to the baselines, the proposed HDTGA algorithm increases throughput at least by 19.3%, reduces delay by 48.5%, and boosts energy efficiency by 54.9%.
意图驱动的网络管理对于管理5G和6G网络的复杂性至关重要。它根据网络运营商的目标,实现了网络的自适应按需管理。本文提出了一种基于生成式人工智能(GenAI)算法的创新三步意图驱动网络管理框架。首先,我们使用量化低秩适配器(QLoRA)在自定义数据集上微调大型语言模型(LLM),以在有限的计算资源中实现高效的意图处理。包含检索增强生成(RAG)模块,以支持动态决策。其次,我们采用变压器架构进行时间序列预测,预测功耗、交通负载和包丢失率等关键参数,以主动验证意图。最后,我们引入具有目标意识分层决策变压器(HDTGA),优化网络应用程序的选择和协调,从而优化网络。与基础LLM模型相比,我们的意图指导和处理方法提高了BERTScore 6%和语义相似性得分9%。同样,所提出的预测意图验证方法可以成功排除性能下降的意图,平均准确率为88%。最后,与基线相比,所提出的HDTGA算法至少提高吞吐量19.3%,降低延迟48.5%,提高能源效率54.9%。
论文及项目相关链接
PDF Currently under review
Summary
本文提出了基于生成式人工智能(GenAI)算法的创新三步框架,用于意图驱动的网络管理。该框架通过微调大型语言模型(LLM),利用量化低秩适配器(QLoRA)在有限计算资源内实现高效的意图处理。同时,结合检索增强生成(RAG)模块支持动态决策制定。此外,采用转换器架构进行时间序列预测,以预测关键参数,如功率消耗、交通负载和包丢失率,以主动验证意图。最后,引入具有目标意识分层决策转换器(HDTGA)优化网络应用和网络的选型编排。该框架提高了网络管理的效率和性能。
Key Takeaways
- 意图驱动的网络管理对于管理5G和6G网络的复杂性至关重要。
- 提出了一种基于生成式人工智能(GenAI)算法的创新三步框架,包括微调大型语言模型(LLM)、量化低秩适配器(QLoRA)、检索增强生成(RAG)模块和支持动态决策的时间序列预测转换器架构。
- 通过微调LLM和QLoRA,可在有限的计算资源内实现高效的意图处理。
- 引入具有目标意识的分层决策转换器(HDTGA)以优化网络选择和编排,提高了网络性能。
- 该框架的意图指导和处理提高了BERTScore和语义相似性分数。
- 预测意图验证方法可以成功排除性能下降的意图,具有较高的准确性。
点此查看论文截图



How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias
Authors:Ruiquan Huang, Yingbin Liang, Jing Yang
Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as even pairs' and
parity check’, the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of a transformer well-trained for the even pairs task, or into the training of a one-layer transformer. For both problems, our analysis shows that the joint training of attention and linear layers exhibits two distinct phases. In the first phase, the attention layer grows rapidly, mapping data sequences into separable vectors. In the second phase, the attention layer becomes stable, while the linear layer grows logarithmically and approaches in direction to a max-margin hyperplane that correctly separates the attention layer outputs into positive and negative samples, and the loss decreases at a rate of $O(1/t)$. Our experiments validate those theoretical results.
语言识别任务是自然语言处理(NLP)中的根本任务,已被广泛应用于评估大型语言模型(LLM)的性能。这些任务在解释变压器的工作机制方面也起着至关重要的作用。在这项工作中,我们专注于常规语言识别类别中的两个代表性任务,即“偶数对”和“奇偶校验”,其目的是确定给定序列中某些子序列的出现是否为偶数。我们的目标是探索由注意力层加线性层组成的一层变压器如何学习解决这些任务,通过理论分析其在梯度下降下的训练动态。虽然“偶数对”可以直接由一层变压器解决,“奇偶校验”需要通过将思维链(CoT)集成到为“偶数对”任务训练良好的变压器的推理阶段或一层变压器的训练中来解决。对于这两个问题,我们的分析表明,注意力层和线性层的联合训练表现出两个明显的阶段。在第一阶段,注意力层迅速增长,将数据序列映射为可分离的向量。在第二阶段,注意力层变得稳定,而线性层的增长呈对数趋势,并朝向一个最大间隔超平面,该超平面能将注意力层的输出正确地分为正样本和负样本,损失以O(1/t)的速度减小。我们的实验验证了这些理论结果。
论文及项目相关链接
PDF accepted by ICML 2025
Summary
本文探讨了自然语言处理中的语言识别任务,特别是针对“偶数对”和“奇偶校验”这两个代表性任务的理论分析。文章重点研究了一个由注意力层和线性层组成的一层变压器如何通过梯度下降解决这些任务。研究发现,联合训练注意力层和线性层分为两个阶段:第一阶段注意力层迅速增长,将数据序列映射为可分向量;第二阶段注意力层稳定,线性层以对数方式增长并逐渐接近最大间隔超平面,将注意力层输出分为正负样本。实验验证了这些理论结果。
Key Takeaways
- 语言识别任务是自然语言处理中的重要基准测试,用于评估大型语言模型的性能。
- “偶数对”和“奇偶校验”是语言识别中的代表性任务,旨在确定给定序列中某些子序列的出现是否为偶数。
- 一层变压器由注意力层和线性层组成,通过梯度下降解决这些任务的理论分析是本研究的重点。
点此查看论文截图

HAIR: Hardness-Aware Inverse Reinforcement Learning with Introspective Reasoning for LLM Alignment
Authors:Ruoxi Cheng, Haoxuan Ma, Weixin Wang
The alignment of large language models (LLMs) with human values remains critical yet hindered by four key challenges: (1) scarcity of balanced safety datasets, (2) alignment tax, (3) vulnerability to jailbreak attacks due to shallow alignment, and (4) inability to dynamically adapt rewards according to task difficulty. To address these limitations, we introduce HAIR (Hardness-Aware Inverse Reinforcement Learning with Introspective Reasoning), a novel alignment approach inspired by shadow models in membership inference attacks. Our approach consists of two main components: (1) construction of a balanced safety Chain-of-Draft (CoD) dataset for seven harmful categories using structured prompts that leverage the introspective reasoning capabilities of LLMs; and (2) training of category-specific reward models with Group Relative Policy Optimization (GRPO), dynamically tuning optimization to task difficulty at both the data and model levels. Comprehensive experiments across four harmlessness and four usefulness benchmarks demonstrate that HAIR achieves state-of-the-art performance, outperforming all baseline methods in safety while maintaining high levels of usefulness.
大型语言模型(LLM)与人类价值观的契合仍然至关重要,但受到四个关键挑战的影响:(1)平衡安全数据集的稀缺性,(2)对齐税,(3)由于浅层对齐而容易受到越狱攻击,(4)无法根据任务难度动态调整奖励。为了解决这些局限性,我们引入了HAIR(基于内省推理的硬度感知逆向强化学习),这是一种受成员推理攻击中的影子模型启发的新型对齐方法。我们的方法主要包括两个组成部分:(1)使用结构化提示,构建用于七个有害类别的平衡安全“草案链”(CoD)数据集,利用LLM的内省推理能力;(2)使用组相对策略优化(GRPO)训练特定类别的奖励模型,在数据和模型层面根据任务的难度动态调整优化。在四个无害性和四个有用性基准上的综合实验表明,HAIR实现了最先进的性能,在安全方面优于所有基准方法,同时保持较高的实用性。
论文及项目相关链接
PDF The three authors contributed equally to this work
Summary
大型语言模型(LLM)与人类价值观的契合至关重要,但面临四大挑战。为解决这些问题,提出了一种新的对齐方法HAIR,通过构建平衡的安全Chain-of-Draft数据集和训练特定类别的奖励模型,实现动态适应任务难度的优化。实验证明,HAIR在安全性和实用性方面都达到了最新水平。
Key Takeaways
- 大型语言模型(LLM)与人类价值观的契合至关重要,但存在四大挑战。
- HAIR是一种新的LLM对齐方法,旨在解决现有挑战。
- HAIR通过构建平衡的安全Chain-of-Draft(CoD)数据集来提高LLM的安全性。
- CoD数据集利用结构化提示和LLM的直觉推理能力,针对七种有害类别进行构建。
- HAIR通过训练特定类别的奖励模型,实现动态适应任务难度的优化。
- Group Relative Policy Optimization(GRPO)是HAIR中的关键组件,用于在数据和模型层面进行动态优化。
点此查看论文截图




CASE – Condition-Aware Sentence Embeddings for Conditional Semantic Textual Similarity Measurement
Authors:Gaifan Zhang, Yi Zhou, Danushka Bollegala
The meaning conveyed by a sentence often depends on the context in which it appears. Despite the progress of sentence embedding methods, it remains unclear how to best modify a sentence embedding conditioned on its context. To address this problem, we propose Condition-Aware Sentence Embeddings (CASE), an efficient and accurate method to create an embedding for a sentence under a given condition. First, CASE creates an embedding for the condition using a Large Language Model (LLM), where the sentence influences the attention scores computed for the tokens in the condition during pooling. Next, a supervised nonlinear projection is learned to reduce the dimensionality of the LLM-based text embeddings. We show that CASE significantly outperforms previously proposed Conditional Semantic Textual Similarity (C-STS) methods on an existing standard benchmark dataset. We find that subtracting the condition embedding consistently improves the C-STS performance of LLM-based text embeddings. Moreover, we propose a supervised dimensionality reduction method that not only reduces the dimensionality of LLM-based embeddings but also significantly improves their performance.
句子所传达的意义往往取决于其出现的上下文环境。尽管句子嵌入方法已经取得了一定的进展,但如何根据上下文环境对句子嵌入进行最佳修改仍然不明确。为了解决这个问题,我们提出了条件感知句子嵌入(CASE),这是一种高效且准确的方法,用于在给定条件下为句子创建嵌入。首先,CASE使用大型语言模型(LLM)为条件创建嵌入,其中句子会影响条件聚合过程中计算的标记的注意力分数。接下来,学习监督非线性投影以降低基于LLM的文本嵌入的维度。我们表明,CASE在现有的标准基准数据集上显著优于先前提出的条件语义文本相似性(C-STS)方法。我们发现,减去条件嵌入可以持续提高基于LLM的文本嵌入的C-STS性能。此外,我们提出了一种监督降维方法,不仅降低了基于LLM的嵌入的维度,还显著提高了其性能。
论文及项目相关链接
Summary
条件感知句子嵌入(CASE)是一种高效且准确的方法,用于在给定的条件下为句子创建嵌入。它首先使用大型语言模型(LLM)为条件创建嵌入,然后通过监督非线性投影降低文本嵌入的维度。CASE显著优于现有的条件语义文本相似性(C-STS)方法,并发现减去条件嵌入和采用监督降维方法能改善LLM基于的文本嵌入性能。
Key Takeaways
- 条件感知句子嵌入(CASE)是为了解决句子嵌入如何根据上下文进行修改的问题而提出的方法。
- CASE使用大型语言模型(LLM)为条件创建嵌入,句子会影响条件池化时的注意力得分计算。
- CASE采用监督非线性投影来降低文本嵌入的维度。
- CASE显著优于现有的条件语义文本相似性(C-STS)方法。
- 减去条件嵌入可改善LLM基于的文本嵌入在C-STS上的性能。
- 提出的监督降维方法不仅降低了LLM基于嵌入的维度,还显著提高了其性能。
- 此方法对于理解和生成与特定上下文相关的句子嵌入具有潜在的应用价值。
点此查看论文截图


BIG-Bench Extra Hard
Authors:Mehran Kazemi, Bahare Fatemi, Hritik Bansal, John Palowitch, Chrysovalantis Anastasiou, Sanket Vaibhav Mehta, Lalit K. Jain, Virginia Aglietti, Disha Jindal, Peter Chen, Nishanth Dikkala, Gladys Tyen, Xin Liu, Uri Shalit, Silvia Chiappa, Kate Olszewska, Yi Tay, Vinh Q. Tran, Quoc V. Le, Orhan Firat
Large language models (LLMs) are increasingly deployed in everyday applications, demanding robust general reasoning capabilities and diverse reasoning skillset. However, current LLM reasoning benchmarks predominantly focus on mathematical and coding abilities, leaving a gap in evaluating broader reasoning proficiencies. One particular exception is the BIG-Bench dataset, which has served as a crucial benchmark for evaluating the general reasoning capabilities of LLMs, thanks to its diverse set of challenging tasks that allowed for a comprehensive assessment of general reasoning across various skills within a unified framework. However, recent advances in LLMs have led to saturation on BIG-Bench, and its harder version BIG-Bench Hard (BBH). State-of-the-art models achieve near-perfect scores on many tasks in BBH, thus diminishing its utility. To address this limitation, we introduce BIG-Bench Extra Hard (BBEH), a new benchmark designed to push the boundaries of LLM reasoning evaluation. BBEH replaces each task in BBH with a novel task that probes a similar reasoning capability but exhibits significantly increased difficulty. We evaluate various models on BBEH and observe a (harmonic) average accuracy of 9.8% for the best general-purpose model and 44.8% for the best reasoning-specialized model, indicating substantial room for improvement and highlighting the ongoing challenge of achieving robust general reasoning in LLMs. We release BBEH publicly at: https://github.com/google-deepmind/bbeh.
大型语言模型(LLM)在日常应用中的部署日益增多,要求具备稳健的通用推理能力和多样化的推理技能集。然而,当前的LLM推理基准测试主要集中在数学和编码能力上,在评估更广泛的推理能力方面存在差距。一个特别的例外是BIG-Bench数据集,它已成为评估LLM通用推理能力的重要基准,得益于其多样化的挑战任务集,能够在统一框架内全面评估跨各种技能的通用推理能力。然而,LLM的最新进展导致在BIG-Bench及其更高级版本BIG-Bench Hard (BBH)上的饱和。最先进的模型在BBH的许多任务上取得了近乎完美的成绩,从而降低了其实用性。为了解决这一局限性,我们推出了BIG-Bench Extra Hard(BBEH),这是一个新的基准测试,旨在突破LLM推理评估的界限。BBEH用新型任务替换BBH中的每个任务,这些任务探查相似的推理能力,但难度显著增加。我们在BBEH上评估了各种模型,观察到最佳通用模型的(调和)平均准确率为9.8%,最佳推理专用模型的平均准确率为44.8%,这表明有巨大的改进空间,并突出了在LLM中实现稳健通用推理的当前挑战。我们公开发布了BBEH:https://github.com/google-deepmind/bbeh。
论文及项目相关链接
Summary:大型语言模型(LLM)在日常应用中越来越广泛,需要强大的通用推理能力和多样化的技能集。当前LLM推理基准测试主要集中在数学和编码能力上,缺乏对更广泛推理能力的评估。BIG-Bench数据集是一个重要的例外,它已成为评估LLM通用推理能力的重要基准,包含多种具有挑战性的任务,能够全面评估各种技能的通用推理能力。然而,随着LLM的近期进展,BIG-Bench及其更难版本BBH已趋于饱和。最先进的模型在BBH的许多任务上取得了近乎完美的成绩,降低了其效用。为解决这一局限性,我们推出了BIG-Bench Extra Hard(BBEH),这是一个新的基准测试,旨在推动LLM推理评估的界限。BBEH用类似但难度更大的新任务替换了BBH中的每个任务。我们对不同的模型进行了评估,发现最好的通用模型的平均准确度为9.8%,最好的推理专业模型的平均准确度为44.8%,显示出巨大的改进空间并凸显了实现稳健的通用推理在LLM中的挑战。我们公开发布了BBEH:https://github.com/google-deepmind/bbeh。
Key Takeaways:
- LLM在日常应用中需求强大和多样化的推理能力。
- 当前LLM推理基准测试主要集中在数学和编码能力上,需要更全面的评估。
- BIG-Bench已成为评估LLM通用推理能力的重要基准。
- 随着LLM的进展,现有的基准测试如BIG-Bench和BBH已趋于饱和。
- 最先进的模型在现有基准测试上表现近乎完美,凸显了新挑战的必要性。
- 推出新的基准测试BBEH,以更全面地评估LLM的推理能力。
点此查看论文截图


