⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-02-09 更新
OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance
Authors:Yongqiang Yao, Jingru Tan, Jiahao Hu, Feizhao Zhang, Xin Jin, Bo Li, Ruihao Gong, Pengfei Liu
Recently, vision-language instruct-tuning models have made significant progress due to their more comprehensive understanding of the world. In this work, we discovered that large-scale 3D parallel training on those models leads to an imbalanced computation load across different devices. The vision and language parts are inherently heterogeneous: their data distribution and model architecture differ significantly, which affects distributed training efficiency. We rebalanced the computational loads from data, model, and memory perspectives to address this issue, achieving more balanced computation across devices. These three components are not independent but are closely connected, forming an omniverse balanced training framework. Specifically, for the data, we grouped instances into new balanced mini-batches within and across devices. For the model, we employed a search-based method to achieve a more balanced partitioning. For memory optimization, we adaptively adjusted the re-computation strategy for each partition to utilize the available memory fully. We conducted extensive experiments to validate the effectiveness of our method. Compared with the open-source training code of InternVL-Chat, we significantly reduced GPU days, achieving about 1.8x speed-up. Our method’s efficacy and generalizability were further demonstrated across various models and datasets. Codes will be released at https://github.com/ModelTC/OmniBal.
最近,由于视觉语言指令微调模型对世界的理解更加全面,它们取得了重大进展。在这项工作中,我们发现这些模型的大规模3D并行训练会导致不同设备之间的计算负载不平衡。视觉和语言部分本质上是异构的:它们的数据分布和模型架构存在很大差异,这影响了分布式训练的效率。我们从数据、模型和内存三个方面重新平衡计算负载,以解决这一问题,实现跨设备的更平衡计算。这三个组件不是独立的,而是紧密相联的,形成了一个全方位平衡的训练框架。具体来说,对于数据,我们将实例分组为跨设备内外的新平衡小批量。对于模型,我们采用基于搜索的方法来实现更平衡的划分。对于内存优化,我们自适应地调整每个分区的重新计算策略,以充分利用可用内存。我们进行了大量实验来验证我们方法的有效性。与InternVL-Chat的开源训练代码相比,我们大幅减少了GPU天数,实现了约1.8倍的速度提升。我们的方法在各种模型和数据集上的有效性和通用性得到了进一步证明。代码将在https://github.com/ModelTC/OmniBal发布。
论文及项目相关链接
Summary
大规模三维并行训练在视觉语言指令微调模型上的不均衡计算负载问题被解决。通过从数据、模型和内存三个方面进行平衡调整,实现了更均衡的计算负载分配。该研究提高了分布式训练效率,减少了GPU天数,并在不同的模型和数据集上验证了其有效性和通用性。相关代码将在GitHub上发布。
Key Takeaways
- 视觉语言指令微调模型在大规模三维并行训练中存在计算负载不均衡的问题。
- 视觉和语言部分在数据分布和模型架构上存在固有差异,影响分布式训练效率。
- 为解决计算负载不均衡问题,从数据、模型和内存三个方面进行了平衡调整。
- 通过平衡数据的实例分组和模型搜索方法实现更均衡的计算负载分配。
- 优化内存管理策略以充分利用可用内存,适应不同的分区需求。
- 该方法显著提高了训练效率,减少了GPU天数,实现了约1.8倍的速度提升。
点此查看论文截图






Advantage Alignment Algorithms
Authors:Juan Agustin Duque, Milad Aghajohari, Tim Cooijmans, Razvan Ciuca, Tianyu Zhang, Gauthier Gidel, Aaron Courville
Artificially intelligent agents are increasingly being integrated into human decision-making: from large language model (LLM) assistants to autonomous vehicles. These systems often optimize their individual objective, leading to conflicts, particularly in general-sum games where naive reinforcement learning agents empirically converge to Pareto-suboptimal Nash equilibria. To address this issue, opponent shaping has emerged as a paradigm for finding socially beneficial equilibria in general-sum games. In this work, we introduce Advantage Alignment, a family of algorithms derived from first principles that perform opponent shaping efficiently and intuitively. We achieve this by aligning the advantages of interacting agents, increasing the probability of mutually beneficial actions when their interaction has been positive. We prove that existing opponent shaping methods implicitly perform Advantage Alignment. Compared to these methods, Advantage Alignment simplifies the mathematical formulation of opponent shaping, reduces the computational burden and extends to continuous action domains. We demonstrate the effectiveness of our algorithms across a range of social dilemmas, achieving state-of-the-art cooperation and robustness against exploitation.
人工智能代理正越来越多地融入人类的决策过程:从大型语言模型(LLM)助理到自动驾驶汽车。这些系统通常优化各自的目标,从而导致冲突,特别是在总收益游戏中的冲突尤为明显,其中天真的强化学习代理通常会收敛到帕累托次优的纳什均衡。为解决这一问题,对手塑形已成为在总收益游戏中寻找社会利益均衡的一种范式。在这项工作中,我们介绍了优势对齐(Advantage Alignment)算法家族,这些算法基于基本原理,能够高效直观地进行对手塑形。我们通过调整交互代理的优势来实现这一点,当它们的交互呈现正面时,增加了互惠行为的可能性。我们证明了现有的对手塑形方法隐含地执行优势对齐。与这些方法相比,优势对齐简化了对手塑形的数学公式,减轻了计算负担,并扩展到了连续动作领域。我们在一系列社会困境中展示了算法的有效性,实现了最先进的合作和对抗剥削的稳健性。
论文及项目相关链接
PDF 25 Pages, 8 figures
Summary
:人工智能代理正越来越多地融入人类决策过程,从大型语言模型(LLM)助理到自动驾驶汽车。这些系统通常优化各自的目标,导致冲突,特别是在总和游戏中尤为明显,其中简单的强化学习代理往往会收敛到帕累托次优的纳什均衡。为解决这一问题,对手塑造已成为寻找一般总和游戏中社会利益均衡的一种范式。我们引入了一种优势对齐算法,这是一种从第一性原则派生出来的算法家族,能够高效直观地实现对手塑造。我们通过调整交互代理的优势来实现这一点,即在交互积极时增加互益行动的概率。我们证明了现有的对手塑造方法隐式地执行优势对齐。与这些方法相比,优势对齐简化了对手塑造的数学公式,降低了计算负担,并扩展到了连续动作领域。我们在一系列社会困境中展示了算法的有效性,实现了最先进的合作和对利用的稳健性。
Key Takeaways
- 人工智能代理正融入人类决策,引发冲突问题。
- 对手塑造是解决人工智能代理冲突的一种范式。
- 引入优势对齐算法家族,能够高效直观地实现对手塑造。
- 优势对齐通过调整交互代理的优势来实现互益行动的概率增加。
- 现有对手塑造方法隐式执行优势对齐。
- 优势对齐简化了对手塑造的数学公式,降低计算负担。
点此查看论文截图


Demystifying Language Model Forgetting with Low-rank Example Associations
Authors:Xisen Jin, Xiang Ren
Large Language models (LLMs) suffer from forgetting of upstream data when fine-tuned. Despite efforts on mitigating forgetting, few have investigated whether, and how forgotten upstream examples are dependent on newly learned tasks. Insights on such dependencies enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in $N$ upstream examples of language modeling or instruction-tuning after fine-tuning LLMs on one of $M$ new tasks, visualized in $M\times N$ matrices. We show that the matrices are often well-approximated with low-rank matrices, indicating the dominance of simple associations between the learned tasks and forgotten upstream examples. Leveraging the analysis, we predict forgetting of upstream examples when fine-tuning on unseen tasks with matrix completion over the empirical associations. This enables fast identification of most forgotten examples without expensive inference on the entire upstream data. The approach, despite simplicity, outperforms prior approaches that learn semantic relationships of learned tasks and upstream examples with LMs for predicting forgetting. We demonstrate the practical utility of our analysis by showing statistically significantly reduced forgetting as we upweight predicted examples for replay at fine-tuning. Project page: https://inklab.usc.edu/lm-forgetting-prediction/
大型语言模型(LLM)在微调时会遗忘上游数据。尽管有人努力减轻遗忘,但很少有人研究遗忘的上游示例是否以及如何依赖于新学习的任务。对这种依赖性的洞察能够使人们有效地有针对性地缓解遗忘。在本文中,我们对在一个新的M任务上微调LLM后发生的上游语言建模或指令调整中的N个上游示例的遗忘进行了实证分析,并以$M\times N$矩阵的形式进行可视化。我们发现这些矩阵通常可以用低阶矩阵很好地逼近,这表明学到的任务和遗忘的上游示例之间存在简单的关联占据主导地位。利用分析,我们通过在经验关联上应用矩阵补全来预测在未见任务上进行微调时的上游示例遗忘。这能够在不耗费昂贵的整个上游数据的推理情况下快速识别出最容易遗忘的例子。尽管我们的方法很简单,但它超越了以前的方法,后者使用语言模型来学习学习到的任务和上游示例之间的语义关系来预测遗忘。我们通过展示在微调时通过回放预测的例子来显著减少遗忘,证明了我们的分析的实际效用。项目页面:[https://inklab.usc.edu/lm-forgetting-prediction/]
论文及项目相关链接
PDF 8 pages; preprint
Summary
大型语言模型(LLM)在微调时会遗忘上游数据。本文实证分析了在微调LLM后,对上游数据示例的遗忘情况,并探讨了这种遗忘与新增任务之间的关系。通过矩阵可视化分析,发现遗忘矩阵通常可以用低阶矩阵近似表示,表明新任务与遗忘的上游数据示例之间存在简单的关联。利用这种分析,可以通过矩阵补全对经验关联进行预测,从而快速识别最易遗忘的示例,无需在整个上游数据集上进行昂贵的推理操作。通过加权预测示例进行回放的方法显示,在减少遗忘方面表现出显著的实际效果。
Key Takeaways
- LLM在微调时会遗忘上游数据。
- 遗忘与新增任务之间存在关联。
- 通过矩阵可视化分析遗忘情况。
- 遗忘矩阵可以用低阶矩阵近似表示。
- 利用矩阵补全预测遗忘。
- 可快速识别最易遗忘的示例,无需全数据集推理。
点此查看论文截图








SituationalLLM: Proactive language models with scene awareness for dynamic, contextual task guidance
Authors:Muhammad Saif Ullah Khan, Muhammad Zeshan Afzal, Didier Stricker
Large language models (LLMs) have achieved remarkable success in text-based tasks but often struggle to provide actionable guidance in real-world physical environments. This is because of their inability to recognize their limited understanding of the user’s physical context. We present SituationalLLM, a novel approach that integrates structured scene information into an LLM to deliver proactive, context-aware assistance. By encoding objects, attributes, and relationships in a custom Scene Graph Language, SituationalLLM actively identifies gaps in environmental context and seeks clarifications during user interactions. This behavior emerges from training on the Situational Awareness Database for Instruct-Tuning (SAD-Instruct), which combines diverse, scenario-specific scene graphs with iterative, dialogue-based refinements. Experimental results indicate that SituationalLLM outperforms generic LLM baselines in task specificity, reliability, and adaptability, paving the way for environment-aware AI assistants capable of delivering robust, user-centric guidance under real-world constraints.
大型语言模型(LLM)在文本任务中取得了显著的成功,但在现实世界的物理环境中往往难以提供可操作的指导。这是因为它们无法识别自己对用户物理环境的有限理解。我们提出了SituationalLLM,这是一种新型方法,通过将结构化场景信息集成到LLM中,以提供主动、基于上下文的任务辅助。通过自定义场景图语言对物体、属性和关系进行编码,SituationalLLM能够主动识别环境上下文中的空白,并在用户交互过程中寻求澄清。这种行为源于对情境意识数据库指令微调(SAD-Instruct)的训练,它将多样化的场景特定场景图与迭代式的对话式微调相结合。实验结果表明,在任务特异性、可靠性和适应性方面,SituationalLLM优于通用LLM基线,为环境感知AI助理铺平了道路,能够在现实世界的约束下提供稳健的、以用户为中心的指导。
论文及项目相关链接
PDF Revised Submission to Open Research Europe
Summary
大型语言模型(LLM)在文本任务中取得了显著成功,但在现实世界的物理环境中往往难以提供可操作的指导。本文提出了一种名为SituationalLLM的新方法,该方法将结构化场景信息集成到LLM中,以提供主动、基于上下文的帮助。通过自定义场景图语言编码对象、属性和关系,SituationalLLM能够主动识别环境上下文中的差距,在用户交互过程中寻求澄清。这种行为来自于对情境意识数据库指令训练(SAD-Instruct)的训练,它将多样化的场景特定场景图与迭代式的对话式改进相结合。实验结果表明,SituationalLLM在任务特异性、可靠性和适应性方面优于通用LLM基线,为环境感知AI助理开辟了道路,能够在现实世界的约束条件下提供稳健、以用户为中心的指导。
Key Takeaways
- LLM在现实世界的物理环境中提供指导时存在局限性,无法识别用户物理上下文的理解限制。
- SituationalLLM是一种将结构化场景信息集成到LLM中的新方法,以提供基于上下文的主动帮助。
- SituationalLLM通过自定义场景图语言编码对象、属性和关系。
- SituationalLLM能够主动识别环境上下文中的差距,并在用户交互中寻求澄清。
- SituationalLLM的训练基于情境意识数据库指令训练(SAD-Instruct),结合了场景特定场景图和对话式改进。
- 实验结果表明,SituationalLLM在任务特异性、可靠性和适应性方面优于通用LLM基线。
点此查看论文截图







How Out-of-Distribution Detection Learning Theory Enhances Transformer: Learnability and Reliability
Authors:Yijin Zhou, Yutang Ge, Xiaowen Dong, Yuguang Wang
Transformer networks excel in natural language processing and computer vision tasks. However, they still face challenges in generalizing to Out-of-Distribution (OOD) datasets, i.e. data whose distribution differs from that seen during training. The OOD detection aims to distinguish outliers while preserving in-distribution (ID) data performance. This paper introduces the OOD detection Probably Approximately Correct (PAC) Theory for transformers, which establishes the conditions for data distribution and model configurations for the learnability of transformers in terms of OOD detection. The theory demonstrates that outliers can be accurately represented and distinguished with sufficient data. The theoretical implications highlight the trade-off between theoretical principles and practical training paradigms. By examining this trade-off, we naturally derived the rationale for leveraging auxiliary outliers to enhance OOD detection. Our theory suggests that by penalizing the misclassification of outliers within the loss function and strategically generating soft synthetic outliers, one can robustly bolster the reliability of transformer networks. This approach yields a novel algorithm that ensures learnability and refines the decision boundaries between inliers and outliers. In practice, the algorithm consistently achieves state-of-the-art performance across various data formats.
Transformer网络在自然语言处理和计算机视觉任务上表现出色。然而,它们在泛化到分布外(OOD)数据集时仍面临挑战,即数据分布与训练期间所见的数据分布不同。OOD检测旨在区分异常值,同时保持分布内(ID)数据性能。本文引入了针对变压器的OOD检测可能近似正确(PAC)理论,该理论建立了数据分布和模型配置的条件,以在OOD检测方面学习变压器的知识。该理论表明,使用足够的数据可以准确表示并区分异常值。理论上的含义突出了理论原则和实践训练范式之间的权衡。通过考察这种权衡,我们自然地得出了利用辅助异常值来提高OOD检测的合理性。我们的理论认为,通过惩罚损失函数中异常值的误分类,并战略性地生成柔软的合成异常值,可以稳健地提高变压器网络的可靠性。这种方法产生了一种新型算法,该算法确保可学习性并细化了内聚点和异常值之间的决策边界。在实践中,该算法在各种数据格式上始终实现最先进的性能。
论文及项目相关链接
Summary
本文介绍了针对Transformer网络在Out-of-Distribution(OOD)数据集上的挑战,提出了基于PAC理论的OOD检测方案。该方案通过建立数据分布和模型配置的条件,展示了Transformer在OOD检测方面的可学习性,并强调了利用辅助异常值提高OOD检测性能的重要性。通过惩罚损失函数中异常值的误分类并生成软合成异常值,能够增强Transformer网络的可靠性。该算法在实践中达到了各种数据格式的最新性能水平。
Key Takeaways
- Transformer网络在自然语言处理和计算机视觉任务中表现出色,但在处理与训练数据分布不同的OOD数据集时仍面临挑战。
- OOD检测旨在区分异常值同时保持ID数据性能。
- 本文提出了基于PAC理论的OOD检测方案,该方案为Transformer在OOD检测方面的可学习性建立了条件。
- 理论强调利用辅助异常值提高OOD检测性能的重要性。
- 通过惩罚损失函数中异常值的误分类并生成软合成异常值,可以增强Transformer网络的可靠性。
- 该方案在实践中实现了各种数据格式的最新性能水平。
点此查看论文截图



Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
Authors:Andis Draguns, Andrew Gritsevskiy, Sumeet Ramesh Motwani, Charlie Rogers-Smith, Jeffrey Ladish, Christian Schroeder de Witt
The rapid proliferation of open-source language models significantly increases the risks of downstream backdoor attacks. These backdoors can introduce dangerous behaviours during model deployment and can evade detection by conventional cybersecurity monitoring systems. In this paper, we introduce a novel class of backdoors in transformer models, that, in contrast to prior art, are unelicitable in nature. Unelicitability prevents the defender from triggering the backdoor, making it impossible to properly evaluate ahead of deployment even if given full white-box access and using automated techniques, such as red-teaming or certain formal verification methods. We show that our novel construction is not only unelicitable thanks to using cryptographic techniques, but also has favourable robustness properties. We confirm these properties in empirical investigations, and provide evidence that our backdoors can withstand state-of-the-art mitigation strategies. Additionally, we expand on previous work by showing that our universal backdoors, while not completely undetectable in white-box settings, can be harder to detect than some existing designs. By demonstrating the feasibility of seamlessly integrating backdoors into transformer models, this paper fundamentally questions the efficacy of pre-deployment detection strategies. This offers new insights into the offence-defence balance in AI safety and security.
开源语言模型的快速增殖显著增加了下游后门攻击的风险。这些后门可以在模型部署期间引入危险行为,并且可以逃避传统网络安全监控系统的检测。在本文中,我们介绍了变压器模型中一类新型后门,与现有技术相比,这些后门具有不可激发的性质。不可激发性可以防止防御者触发后门,即使在部署前给予完全的白盒访问并使用自动化技术(如红队或某些形式验证方法),也无法对其进行适当评估。我们表明,我们的新型构造不仅由于使用了加密技术而不可激发,而且还具有有利的稳健性属性。我们通过实证研究证实了这些属性,并提供证据表明我们的后门可以抵御最先进的缓解策略。此外,我们通过展示我们的通用后门虽然在白盒设置下并非完全不可检测,但比某些现有设计更难检测,从而扩展了以前的工作。本文通过展示将后门无缝集成到变压器模型中的可行性,从根本上质疑了预部署检测策略的有效性。这为人工智能安全和防御攻防平衡提供了新的见解。
论文及项目相关链接
PDF 19 pages, 7 figures
Summary
开源语言模型的快速增殖增加了下游后门攻击的风险。本文介绍了一种新型的后门攻击方法,通过在转换器模型中植入后门来改变模型的行为。后门攻击具有隐蔽性,使得防御者在部署前无法触发后门,即使拥有完全的白盒访问权限并使用自动化技术也无法检测。这种后门攻击具有强大的稳健性,可以抵御最先进的缓解策略。本文展示了后门攻击的设计,对预部署检测策略的有效性提出了质疑,为人工智能安全和防御平衡提供了新的见解。
Key Takeaways
- 开源语言模型的快速普及增加了下游后门攻击的风险。
- 介绍了一种新型的后门攻击方法,该方法是隐性的,可改变模型的行为。
- 后门攻击具有隐蔽性,使得防御者在部署前无法检测和触发后门。
- 后门攻击具有强大的稳健性,能够抵御先进的缓解策略。
- 该攻击方法的存在对预部署检测策略的有效性提出了质疑。
- 后门攻击设计展示了对转换器模型的完美集成。
点此查看论文截图



Quality Assessment for AI Generated Images with Instruction Tuning
Authors:Jiarui Wang, Huiyu Duan, Guangtao Zhai, Xiongkuo Min
Artificial Intelligence Generated Content (AIGC) has grown rapidly in recent years, among which AI-based image generation has gained widespread attention due to its efficient and imaginative image creation ability. However, AI-generated Images (AIGIs) may not satisfy human preferences due to their unique distortions, which highlights the necessity to understand and evaluate human preferences for AIGIs. To this end, in this paper, we first establish a novel Image Quality Assessment (IQA) database for AIGIs, termed AIGCIQA2023+, which provides human visual preference scores and detailed preference explanations from three perspectives including quality, authenticity, and correspondence. Then, based on the constructed AIGCIQA2023+ database, this paper presents a MINT-IQA model to evaluate and explain human preferences for AIGIs from Multi-perspectives with INstruction Tuning. Specifically, the MINT-IQA model first learn and evaluate human preferences for AI-generated Images from multi-perspectives, then via the vision-language instruction tuning strategy, MINT-IQA attains powerful understanding and explanation ability for human visual preference on AIGIs, which can be used for feedback to further improve the assessment capabilities. Extensive experimental results demonstrate that the proposed MINT-IQA model achieves state-of-the-art performance in understanding and evaluating human visual preferences for AIGIs, and the proposed model also achieves competing results on traditional IQA tasks compared with state-of-the-art IQA models. The AIGCIQA2023+ database and MINT-IQA model are available at: https://github.com/IntMeGroup/MINT-IQA.
人工智能生成内容(AIGC)近年来发展迅速,其中基于人工智能的图像生成因其高效和富有创造力的图像创作能力而受到广泛关注。然而,由于人工智能生成的图像(AIGI)具有独特的失真,可能无法满足人类的偏好,这强调了了解和评估人类对AIGI的偏好的必要性。为此,本文首先建立了一个新型的人工智能生成图像质量评估(IQA)数据库,名为AIGCIQA2023+,该数据库提供从质量、真实性和对应性三个方面的人类视觉偏好分数和详细的偏好解释。然后,基于构建的AIGCIQA2023+数据库,本文提出了一个MINT-IQA模型,该模型能够从多角度评估并解释人类对AIGI的偏好,并且能够通过指令微调策略增强理解和解释能力。具体来说,MINT-IQA模型首先学习并评估人类对AI生成的图像的多角度偏好,然后通过视觉语言指令微调策略,MINT-IQA获得了对AIGI的人类视觉偏好的强大理解和解释能力,可用于反馈以进一步提高评估能力。大量的实验结果证明,所提出的MINT-IQA模型在理解和评估人类对AIGI的视觉偏好方面达到了最先进的性能,并且在传统的IQA任务上也取得了与最先进的IQA模型相竞争的结果。AIGCIQA2023+数据库和MINT-IQA模型可在https://github.com/IntMeGroup/MINT-IQA获取。
论文及项目相关链接
Summary
本文首先构建了名为AIGCIQA2023+的新型AI生成图像质量评估数据库,提供从质量、真实性和对应性三个方面的人类视觉偏好分数和详细偏好解释。然后,基于该数据库,提出了一个名为MINT-IQA的多角度评价模型,该模型通过指令微调策略,能够强大地理解和解释人类对AI生成图像的多角度视觉偏好。实验结果显示,MINT-IQA模型在理解和评估人类对AI生成图像视觉偏好方面取得了最佳性能,并在传统图像质量评估任务上取得了与最新模型相当的结果。
Key Takeaways
- 建立了名为AIGCIQA2023+的AI生成图像质量评估数据库,包含人类视觉偏好分数和详细偏好解释。
- 从质量、真实性和对应性三个方面评估AI生成图像。
- 提出了基于MINT-IQA的多角度评价模型,能够理解和解释人类对AI生成图像的多角度视觉偏好。
- MINT-IQA模型通过指令微调策略,具有强大的理解和解释能力。
- MINT-IQA模型在理解和评估人类对AI生成图像视觉偏好方面取得最佳性能。
- MINT-IQA模型在传统图像质量评估任务上取得了与最新模型相当的结果。
点此查看论文截图






HMT: Hierarchical Memory Transformer for Efficient Long Context Language Processing
Authors:Zifan He, Yingqi Cao, Zongyue Qin, Neha Prakriya, Yizhou Sun, Jason Cong
Transformer-based large language models (LLM) have been widely used in language processing applications. However, due to the memory constraints of the devices, most of them restrict the context window. Even though recurrent models in previous works can memorize past tokens to enable unlimited context and maintain effectiveness, they have ``flat’’ memory architectures. Such architectures have limitations in selecting and filtering information. Since humans are good at learning and self-adjustment, we believe that imitating brain memory hierarchy is beneficial for model memorization. Thus, we propose the Hierarchical Memory Transformer (HMT), a novel framework that facilitates a model’s long-context processing ability by imitating human memorization behavior. Leveraging memory-augmented segment-level recurrence, we organize the memory hierarchy by preserving tokens from early input segments, passing memory embeddings along the sequence, and recalling relevant information from history. Evaluating general language modeling, question-answering tasks, and the summarization task, we show that HMT consistently improves the long-context processing ability of existing models. Furthermore, HMT achieves a comparable or superior generation quality to long-context LLMs with $2 \sim 57\times$ fewer parameters and $2.5 \sim 116\times$ less inference memory, significantly outperforming previous memory-augmented models. Code on Github: https://github.com/OswaldHe/HMT-pytorch.
基于Transformer的大型语言模型(LLM)已广泛应用于语言处理应用。然而,由于设备的内存限制,大多数模型都限制了上下文窗口。尽管之前的工作中的循环模型可以记住过去的标记来实现无限上下文并保持有效性,但它们具有“平面”内存架构。这种架构在选择和过滤信息方面存在局限性。由于人类擅长学习和自我调整,我们相信模仿大脑记忆层次结构对模型记忆是有益的。因此,我们提出了分层记忆转换器(HMT),这是一个新的框架,通过模仿人类的记忆行为,促进模型的长期上下文处理能力。通过利用增强内存的段级递归,我们保留早期输入段的标记来组织记忆层次结构,将记忆嵌入物沿着序列传递,并从历史中回忆相关信息。通过对通用语言建模、问答任务和摘要任务进行评估,我们证明了HMT在改进现有模型的长上下文处理能力方面的一致性。此外,HMT在长上下文LLM的生成质量方面达到了相当或更高的水平,使用了较少的参数(2至57倍)和较少的推理内存(2.5至116倍),显著优于之前的增强记忆模型。GitHub上的代码为:https://github.com/OswaldHe/HMT-pytorch。
论文及项目相关链接
PDF NAACL 2025 Main Conference
Summary
本文提出一种名为Hierarchical Memory Transformer(HMT)的新型框架,该框架通过模仿人类记忆机制来提高模型的长期上下文处理能力。该框架通过分段级别的记忆增强复现机制构建记忆层次结构,保存早期输入片段的令牌,传递内存嵌入序列,并从历史中回忆相关信息。实验表明,HMT在通用语言建模、问答任务和摘要任务中能够一致提高现有模型的长期上下文处理能力,并且在较少的参数和较低推理内存的情况下达到或超过长上下文大型语言模型(LLM)的生成质量。
Key Takeaways
- Hierarchical Memory Transformer(HMT)框架模仿人类记忆机制来提高模型的长期上下文处理能力。
- HMT通过分段级别的记忆增强复现机制构建记忆层次结构。
- HMT能够保存早期输入片段的令牌,传递内存嵌入序列,并回忆相关信息。
- HMT在通用语言建模、问答任务和摘要任务中表现出提高现有模型的长期上下文处理能力的效果。
- HMT在较少的参数和较低推理内存的情况下达到或超过长上下文大型语言模型(LLM)的生成质量。
点此查看论文截图




CollagePrompt: A Benchmark for Budget-Friendly Visual Recognition with GPT-4V
Authors:Siyu Xu, Yunke Wang, Daochang Liu, Bo Du, Chang Xu
Recent advancements in generative AI have suggested that by taking visual prompts, GPT-4V can demonstrate significant proficiency in visual recognition tasks. Despite its impressive capabilities, the financial cost associated with GPT-4V’s inference presents a substantial barrier to its wide use. To address this challenge, we propose a budget-friendly collage prompting task that collages multiple images into a single visual prompt and makes GPT-4V perform visual recognition on several images simultaneously, thereby reducing the cost. We collect a dataset of various collage prompts to assess its performance in GPT-4V’s visual recognition. Our evaluations reveal several key findings: 1) Recognition accuracy varies with different positions in the collage. 2) Grouping images of the same category together leads to better visual recognition results. 3) Incorrect labels often come from adjacent images. These findings highlight the importance of image arrangement within collage prompt. To this end, we construct a benchmark called CollagePrompt, which offers a platform for designing collage prompt to achieve more cost-effective visual recognition with GPT-4V. A baseline method derived from genetic algorithms to optimize collage layouts is proposed and two metrics are introduced to measure the efficiency of the optimized collage prompt. Our benchmark enables researchers to better optimize collage prompts, thus making GPT-4V more cost-effective in visual recognition. The code and data are available at this project page https://collageprompting.github.io/.
最近生成式人工智能的进步表明,通过视觉提示,GPT-4V在视觉识别任务中表现出了显著的专业能力。尽管其能力令人印象深刻,但GPT-4V推理的财务成本却成为其广泛使用的巨大障碍。为了应对这一挑战,我们提出了一种经济实惠的拼贴提示任务,该任务将多张图像拼贴成单个视觉提示,使GPT-4V能够同时对多张图像进行视觉识别,从而降低推理成本。我们收集了各种拼贴提示数据集来评估GPT-4V的视觉识别性能。我们的评估揭示了几个关键发现:1)识别准确率因拼贴中的不同位置而异。2)将同一类别的图像分组在一起有助于获得更好的视觉识别结果。3)错误的标签往往来自相邻的图像。这些发现突显了拼贴提示中图像排列的重要性。为此,我们构建了一个名为CollagePrompt的基准测试平台,该平台为设计拼贴提示以实现更具成本效益的GPT-4V视觉识别提供了平台。提出了一种基于遗传算法优化拼贴布局的基础方法,并引入了两个指标来衡量优化拼贴提示的效率。我们的基准测试使研究人员能够更好地优化拼贴提示,从而使GPT-4V在视觉识别方面更具成本效益。代码和数据可在项目页面https://collageprompting.github.io/上找到。
论文及项目相关链接
PDF Accepted by NAACL2025 Findings
Summary
GPT-4V在视觉识别任务中展现出显著的能力,但其推理成本高昂。为降低成本,研究提出一种经济高效的拼贴提示任务,将多个图像拼贴成单一视觉提示,使GPT-4V能同时执行多个图像的视觉识别。研究评估发现拼贴位置、图像分组和相邻图像影响识别结果。为此构建了一个名为CollagePrompt的基准测试平台,提供设计拼贴提示以实现更具成本效益的GPT-4V视觉识别。提出一种基于遗传算法的基线方法来优化拼贴布局,并引入两个指标来衡量优化拼贴提示的效率。
Key Takeaways
- GPT-4V具备强大的视觉识别能力。
- GPT-4V推理成本高昂,成为其广泛应用的障碍。
- 提出拼贴提示任务,通过拼贴多个图像降低GPT-4V的视觉识别成本。
- 评估发现拼贴位置、图像分组和相邻图像对识别结果有影响。
- 构建CollagePrompt基准测试平台,为优化拼贴提示提供工具。
- 提出基于遗传算法的基线方法来优化拼贴布局。
点此查看论文截图





Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
Authors:Egor Zverev, Sahar Abdelnabi, Soroush Tabesh, Mario Fritz, Christoph H. Lampert
Instruction-tuned Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features that are common in other areas of computer science, particularly an explicit separation of instructions and data. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. Surprisingly, there is currently no established definition or benchmark to quantify this phenomenon. In this work, we close this gap by introducing a formal measure for instruction-data separation and an empirical variant that is calculable from a model’s outputs. We also present a new dataset, SEP, that allows estimating the measure for real-world models. Our results on various LLMs show that the problem of instruction-data separation is real: all models fail to achieve high separation, and canonical mitigation techniques, such as prompt engineering and fine-tuning, either fail to substantially improve separation or reduce model utility. The source code and SEP dataset are openly accessible at https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.
指令优化的大型语言模型(LLM)在多个实际应用中表现出令人印象深刻的结果,但它们缺少计算机科学其他领域中常见的关键安全功能,特别是指令和数据的明确分离。这使得它们容易受到间接提示注入等操作的操纵,并且通常不适合执行安全关键任务。令人惊讶的是,目前尚没有既定的定义或基准来衡量这种现象。在这项工作中,我们通过引入指令与数据分离的正式度量标准以及可从模型输出中计算的经验变体来填补这一空白。我们还提供了一个新的数据集SEP,可以用来估计真实世界模型中的度量标准。我们在各种LLM上的结果表明,指令与数据分离的问题确实存在:所有模型的分离程度均不高,常见的缓解技术(如提示工程和微调)要么无法实质性提高分离程度,要么会降低模型的实用性。源代码和SEP数据集可在https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed公开访问。
论文及项目相关链接
PDF Published as a conference paper at ICLR 2025, GitHub: https://github.com/egozverev/Shold-It-Be-Executed-Or-Processed. 10 pages main text, 30 pages in total
Summary
大型语言模型(LLM)在诸多实际应用中表现出色,但缺乏计算机科学其他领域常见的安全特性,特别是缺少指令与数据的明确分离。这使其容易受到间接提示注入等操纵,并不适合安全关键任务。为解决此问题,本研究提出了指令与数据分离的正式衡量标准及其可计算的实证变体,并引入SEP数据集来估计真实模型的指标。结果表明所有模型都存在指令与数据分离问题,常规缓解技术如提示工程和微调要么无法实质改善分离,要么会降低模型效用。数据集源代码和SEP数据集已在GitHub上公开。
Key Takeaways
- 大型语言模型(LLM)在多个实际应用中展现优异性能,但缺乏必要的安全特性。
- LLM缺少指令与数据的明确分离,容易受到操纵,不适合安全关键任务。
- 研究提出了指令与数据分离的正式衡量标准和实证变体。
- 引入SEP数据集用于估计真实模型的指令与数据分离程度。
- 研究表明所有被测试的LLM都存在指令与数据分离问题。
- 常见缓解技术如提示工程和微调在改善指令与数据分离方面效果有限。
点此查看论文截图





What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning
Authors:Yifan Du, Hangyu Guo, Kun Zhou, Wayne Xin Zhao, Jinpeng Wang, Chuyuan Wang, Mingchen Cai, Ruihua Song, Ji-Rong Wen
Visual instruction tuning is crucial for enhancing the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs). In this paper, we aim to investigate a fundamental question: ‘’what makes for good visual instructions’’. Through a comprehensive empirical study, we find that instructions focusing on complex visual reasoning tasks are particularly effective in improving the performance of MLLMs, with results correlating to instruction complexity. Based on this insight, we develop a systematic approach to automatically create high-quality complex visual reasoning instructions. Our approach employs a synthesize-complicate-reformulate paradigm, leveraging multiple stages to gradually increase the complexity of the instructions while guaranteeing quality. Based on this approach, we create the ComVint dataset with 32K examples, and fine-tune four MLLMs on it. Experimental results consistently demonstrate the enhanced performance of all compared MLLMs, such as a 27.86% and 27.60% improvement for LLaVA on MME-Perception and MME-Cognition, respectively. Our code and data are publicly available at the link: https://github.com/RUCAIBox/ComVint.
视觉指令调整对于提高多模态大型语言模型(MLLMs)的零样本泛化能力至关重要。本文旨在探究一个基本问题:“什么样的视觉指令是好的”。通过全面的实证研究,我们发现专注于复杂视觉推理任务的指令在提高MLLM性能方面特别有效,结果与指令复杂性相关。基于此见解,我们开发了一种系统的方法,自动创建高质量的复杂视觉推理指令。我们的方法采用合成-复杂化-重构的模式,利用多个阶段来逐步增加指令的复杂性,同时保证质量。基于此方法,我们创建了包含32K实例的ComVint数据集,并在其上微调了四个MLLM。实验结果一致表明所有对比MLLM的性能都有所提高,例如LLaVA在MME-Perception和MME-Cognition上的改进分别为27.86%和27.60%。我们的代码和数据在以下链接公开可用:https://github.com/RUCAIBox/ComVint。
论文及项目相关链接
PDF Accepted by COLING2025
Summary
视觉指令调优对于提升多模态大型语言模型的零样本泛化能力至关重要。本文旨在探究“什么是好的视觉指令”这一基本问题。通过全面的实证研究,我们发现关注复杂视觉推理任务的指令在提高MLLM性能方面特别有效,并且结果与指令的复杂性相关。基于此,我们开发了一种自动创建高质量复杂视觉推理指令的系统方法。使用合成-复杂化-改写的模式,通过多个阶段逐步增加指令的复杂性同时保证质量。基于该方法,我们创建了包含32K实例的ComVint数据集,并在其上微调了四个MLLM。实验结果一致表明所有对比的MLLM性能有所提升,如在MME-Perception和MME-Cognition上LLaVA分别提升了27.86%和27.60%。
Key Takeaways
- 视觉指令调优对增强多模态大型语言模型的零样本泛化能力至关重要。
- 复杂视觉推理任务的指令对提高语言模型性能效果显著。
- 指令的复杂性对模型性能有直接影响。
- 提出了一种自动创建高质量复杂视觉推理指令的系统方法。
- 创建了包含32K实例的ComVint数据集用于模型训练。
- 使用ComVint数据集微调的MLLM性能有所提升。
点此查看论文截图




Chain-of-Factors Paper-Reviewer Matching
Authors:Yu Zhang, Yanzhen Shen, SeongKu Kang, Xiusi Chen, Bowen Jin, Jiawei Han
With the rapid increase in paper submissions to academic conferences, the need for automated and accurate paper-reviewer matching is more critical than ever. Previous efforts in this area have considered various factors to assess the relevance of a reviewer’s expertise to a paper, such as the semantic similarity, shared topics, and citation connections between the paper and the reviewer’s previous works. However, most of these studies focus on only one factor, resulting in an incomplete evaluation of the paper-reviewer relevance. To address this issue, we propose a unified model for paper-reviewer matching that jointly considers semantic, topic, and citation factors. To be specific, during training, we instruction-tune a contextualized language model shared across all factors to capture their commonalities and characteristics; during inference, we chain the three factors to enable step-by-step, coarse-to-fine search for qualified reviewers given a submission. Experiments on four datasets (one of which is newly contributed by us) spanning various fields such as machine learning, computer vision, information retrieval, and data mining consistently demonstrate the effectiveness of our proposed Chain-of-Factors model in comparison with state-of-the-art paper-reviewer matching methods and scientific pre-trained language models.
随着提交到学术会议的论文数量迅速增加,对自动化和准确的论文审稿人匹配的需求比以往任何时候都更为关键。在此领域的先前研究已经考虑了各种因素来评估审稿人的专业知识与论文的相关性,例如语义相似性、共同的主题以及论文与审稿人之前作品之间的引用联系。然而,大多数这些研究只专注于一个因素,导致对论文与审稿人相关性的评估不完整。为了解决这一问题,我们提出了一种统一的论文审稿人匹配模型,该模型同时考虑语义、主题和引用因素。具体来说,在训练过程中,我们对所有因素共享一个上下文化的语言模型进行微调,以捕捉它们的共同点和特征;在推理过程中,我们将这三个因素串联起来,以在给定的提交中进行逐步的、从粗到细的搜索合格审稿人。在涵盖机器学习、计算机视觉、信息检索和数据挖掘等不同领域的四个数据集上进行实验,与我们提出的Chain-of-Factors模型相比,一贯证明了其在论文审稿人匹配方法与科学预训练语言模型中的有效性。
论文及项目相关链接
PDF 10 pages; Accepted to WWW 2025 (Code: https://github.com/yuzhimanhua/CoF)
Summary
学术会议论文提交数量激增,对自动化精准匹配审稿人的需求愈发重要。先前研究主要聚焦于单一因素评估审稿人与论文相关性,如语义相似性、共同主题和引用联系等。本研究提出一种统一的论文与审稿人匹配模型,同时考虑语义、主题和引用因素。实验结果表明,Chain-of-Factors模型在多个数据集上的表现优于当前先进的论文审稿匹配方法和科学预训练语言模型。
Key Takeaways
- 学术会议论文提交数量迅速增长,对自动化精准匹配审稿人的需求愈发迫切。
- 先前研究主要聚焦于单一因素评估审稿人与论文的相关性。
- 本研究提出一种统一的论文与审稿人匹配模型,整合语义、主题和引用因素。
- 模型采用分阶段粗到细的搜索策略寻找合格审稿人。
- 模型在多个涵盖不同领域的数据集上进行了实验验证。
- 实验结果表明,Chain-of-Factors模型在论文与审稿人匹配方面的表现优于当前先进的匹配方法和预训练语言模型。
点此查看论文截图




Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction
Authors:Ji Qi, Chuchun Zhang, Xiaozhi Wang, Kaisheng Zeng, Jifan Yu, Jinxin Liu, Jiuding Sun, Yuxiang Chen, Lei Hou, Juanzi Li, Bin Xu
The robustness to distribution changes ensures that NLP models can be successfully applied in the realistic world, especially for information extraction tasks. However, most prior evaluation benchmarks have been devoted to validating pairwise matching correctness, ignoring the crucial measurement of robustness. In this paper, we present the first benchmark that simulates the evaluation of open information extraction models in the real world, where the syntactic and expressive distributions under the same knowledge meaning may drift variously. We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique that consists of sentences with structured knowledge of the same meaning but with different syntactic and expressive forms. By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques. We perform experiments on typical models published in the last decade as well as a popular large language model, the results show that the existing successful models exhibit a frustrating degradation, with a maximum drop of 23.43 F1 score. Our resources and code are available at https://github.com/qijimrc/ROBUST.
模型的分布变化鲁棒性保证了自然语言处理模型能够在现实世界中成功应用,特别是在信息提取任务中。然而,大多数先前的评估基准测试主要集中在验证配对匹配的正确性,忽略了鲁棒性的关键度量。在本文中,我们提出了第一个模拟现实世界开放信息提取模型评估的基准测试,其中相同知识含义下的语法和表达分布可能会发生变化。我们设计并标注了一个大规模测试平台,其中的每个例子都是一个知识不变的小团体,由具有相同含义的结构化知识的句子组成,但具有不同的语法和表达形式。通过进一步阐述鲁棒性指标,如果一个模型在整个群体中的性能始终准确,则该模型被认为是稳健的。我们在过去十年中发布的典型模型以及流行的大型语言模型上进行了实验,结果表明,现有成功模型的性能令人沮丧地下降,最大F1分数下降了23.43分。我们的资源和代码可在https://github.com/qijimrc/ROBUST找到。
论文及项目相关链接
PDF Accepted by EMNLP 2023 Main Conference
Summary
本文提出了首个模拟现实世界开放信息提取模型评估的基准测试,重点关注模型在语法和表达分布变化下的稳健性。通过构建大规模测试平台,对模型在不同语法和表达形式但意义相同的知识集合上进行评估。实验结果显示,现有成功模型在稳健性方面存在显著下降,最大F1分数下降23.43。
Key Takeaways
- 本文强调了NLP模型在现实世界中应用时,对分布变化稳健性的重要性,特别是在信息提取任务中。
- 提出了首个模拟现实世界开放信息提取模型评估的基准测试,关注模型在知识意义相同但语法和表达分布变化下的性能。
- 构建了一个大规模测试平台,每个例子都是一个知识不变集合,包含结构化知识相同但不同语法和表达形式的句子。
- 定义了模型的稳健性评估标准,即模型在整体集合上的性能表现要稳定准确。
- 实验结果显示,现有成功模型在稳健性方面存在显著不足,最大F1分数下降23.43。
- 提供了研究资源和代码公开链接,便于其他研究者使用和进一步探索。
点此查看论文截图




