LLM

发布日期: 2025-09-20

更新日期: 2025-11-27

文章字数: 17k

阅读时长: 69 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-20 更新

LNE-Blocking: An Efficient Framework for Contamination Mitigation Evaluation on Large Language Models

Authors:Ruijie Hou, Yueyang Jiao, Hanxu Hu, Yingming Li, Wai Lam, Huajian Zhang, Hongyuan Lu

The problem of data contamination is now almost inevitable during the development of large language models (LLMs), with the training data commonly integrating those evaluation benchmarks even unintentionally. This problem subsequently makes it hard to benchmark LLMs fairly. Instead of constructing contamination-free datasets (quite hard), we propose a novel framework, \textbf{LNE-Blocking}, to restore model performance prior to contamination on potentially leaked datasets. Our framework consists of two components: contamination detection and disruption operation. For the prompt, the framework first uses the contamination detection method, \textbf{LNE}, to assess the extent of contamination in the model. Based on this, it adjusts the intensity of the disruption operation, \textbf{Blocking}, to elicit non-memorized responses from the model. Our framework is the first to efficiently restore the model’s greedy decoding performance. This comes with a strong performance on multiple datasets with potential leakage risks, and it consistently achieves stable recovery results across different models and varying levels of data contamination. We release the code at https://github.com/RuijieH/LNE-Blocking to facilitate research.

数据污染问题在大型语言模型（LLM）的发展过程中几乎不可避免，训练数据通常会无意中融入评估基准，从而使得公平评估LLM变得困难。我们并不主张构建无污染的数据集（这相当困难），而是提出一种新型框架，名为“LNE-Blocking”，以在可能泄露的数据集上恢复模型性能。我们的框架包含两个组成部分：污染检测与干扰操作。对于提示，框架首先使用污染检测方法“LNE”来评估模型中的污染程度。基于此，它调整干扰操作“Blocking”的强度，以激发模型给出非记忆化的回应。我们的框架是首个能够高效恢复模型贪婪解码性能的技术。它在多个潜在泄露风险的数据集上表现出强大的性能，并且在不同的模型和不同程度的数据污染情况下，都能实现稳定的恢复结果。为了方便研究，我们在https://github.com/RuijieH/LNE-Blocking上发布了代码。

论文及项目相关链接

PDF

Summary
数据污染问题几乎在大型语言模型（LLM）的发展过程中不可避免，训练数据通常无意中整合了评估基准，这使得公平评估LLM变得困难。我们提出了一种名为LNE-Blocking的新颖框架，旨在通过在可能泄露的数据集上恢复模型性能来应对这一问题。该框架包含两个组件：污染检测与干扰操作。对于提示，该框架首先使用污染检测方法LNE来评估模型的污染程度。基于此，它调整干扰操作Blocking的强度，以激发模型的非记忆响应。该框架是首个能够高效恢复模型贪婪解码性能的方法，在多个潜在泄露风险的数据集上具有出色的性能，并且在不同模型和不同级别的数据污染情况下都能实现稳定的恢复结果。

Key Takeaways

数据污染在LLM的发展中几乎不可避免，训练数据经常无意中包含评估基准。
LNE-Blocking框架旨在解决数据污染问题，通过恢复模型性能来应对潜在的数据泄露。
LNE-Blocking框架包含两个主要组件：污染检测和干扰操作。
污染检测方法LNE用于评估模型的污染程度。
基于污染程度的评估，调整干扰操作的强度，激发模型非记忆响应。
LNE-Blocking框架能够高效恢复模型的贪婪解码性能。

Cool Papers

点此查看论文截图

Generalizable Geometric Image Caption Synthesis

Authors:Yue Xin, Wenyuan Wang, Rui Pan, Ruida Wang, Howard Meng, Renjie Pi, Shizhe Diao, Tong Zhang

Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, even in out-of-distribution scenarios, the generated dataset enhances the general reasoning capabilities of multimodal large language models, yielding accuracy improvements of $2.8%\text{-}4.8%$ in statistics, arithmetic, algebraic, and numerical tasks with non-geometric input images of MathVista and MathVerse, along with $2.4%\text{-}3.9%$ improvements in Art, Design, Tech, and Engineering tasks in MMMU.

多模态大型语言模型具有各种需要强大推理能力的实际应用。尽管最近有所进展，但这些模型在解决复杂的几何问题时仍然感到困难。主要挑战源于缺乏用于理解几何图像的高质量图像文本对数据集。此外，大多数基于模板的数据合成管道通常无法推广到其预定义模板之外的问题。在本文中，我们通过引入强化学习与可验证奖励（RLVR）的互补过程来填补这一空白，将其纳入数据生成管道。通过采用RLVR来完善由50种基本几何关系合成的几何图像的标题，并使用来自数学问题解决任务的奖励信号，我们的管道成功捕捉了解决几何问题的关键特征。这能够实现更好的任务泛化，并产生了不小的改进。此外，即使在超出分布的场景下，生成的数据集也增强了多模态大型语言模型的通用推理能力，在MathVista和MathVerse的非几何输入图像中，统计、算术、代数和数值任务的准确率提高了2.8%~4.8%，在MMMU的艺术、设计、技术和工程任务中提高了2.4%~3.9%。

论文及项目相关链接

PDF

Summary

多模态大型语言模型在需要强大推理能力的实际应用中存在挑战，特别是在解决复杂几何问题时。本文引入了一种基于强化学习与可验证奖励（RLVR）的数据生成管道，通过精炼合成几何图像的标题，并使用来自数学问题解决任务的奖励信号，成功捕捉几何问题解决的关键特征。这提高了任务泛化能力，并为非几何输入图像的数学视觉和数学世界任务带来显著性能提升。

Key Takeaways

多模态大型语言模型在解决复杂几何问题时面临挑战。
缺乏高质量图像文本对数据集是理解几何图像的一个关键难题。
大多数基于模板的数据合成管道无法推广到超出其预定义模板的问题。
引入强化学习与可验证奖励（RLVR）到数据生成管道中，成功解决上述问题。
通过精炼几何图像的标题并使用数学问题解决任务的奖励信号，捕捉到几何问题解决的关键特征。
该方法提高了任务泛化能力，带来显著性能提升。

Cool Papers

点此查看论文截图

Assessing Historical Structural Oppression Worldwide via Rule-Guided Prompting of Large Language Models

Authors:Sreejato Chatterjee, Linh Tran, Quoc Duy Nguyen, Roni Kirson, Drue Hamlin, Harvest Aquino, Hanjia Lyu, Jiebo Luo, Timothy Dye

Traditional efforts to measure historical structural oppression struggle with cross-national validity due to the unique, locally specified histories of exclusion, colonization, and social status in each country, and often have relied on structured indices that privilege material resources while overlooking lived, identity-based exclusion. We introduce a novel framework for oppression measurement that leverages Large Language Models (LLMs) to generate context-sensitive scores of lived historical disadvantage across diverse geopolitical settings. Using unstructured self-identified ethnicity utterances from a multilingual COVID-19 global study, we design rule-guided prompting strategies that encourage models to produce interpretable, theoretically grounded estimations of oppression. We systematically evaluate these strategies across multiple state-of-the-art LLMs. Our results demonstrate that LLMs, when guided by explicit rules, can capture nuanced forms of identity-based historical oppression within nations. This approach provides a complementary measurement tool that highlights dimensions of systemic exclusion, offering a scalable, cross-cultural lens for understanding how oppression manifests in data-driven research and public health contexts. To support reproducible evaluation, we release an open-sourced benchmark dataset for assessing LLMs on oppression measurement (https://github.com/chattergpt/llm-oppression-benchmark).

传统的历史结构压迫测量努力由于每个国家的独特、局部特定的排斥、殖民和社会地位历史，在跨国有效性方面存在困难，并且通常依赖于以物质资源为优势的结构化指数，同时忽视了基于生活的身份排斥。我们引入了一种新型的压迫测量框架，该框架利用大型语言模型（LLM）在多种地理政治背景下生成与上下文相关的生活历史不利境遇得分。我们从全球新冠肺炎疫情的多语种研究中提取自我认定的种族主义言论，设计受规则引导的回话策略，鼓励模型生成可解释且基于理论基础的压迫估计值。我们系统地评估了这些策略在多个最先进的大型语言模型中的应用。我们的结果表明，当受到明确规则指导时，大型语言模型能够捕捉到国家内部的基于身份的微妙历史压迫形式。这种方法提供了一个补充性的测量工具，突出体现了系统排斥的各个方面，提供了一个跨文化视角，用于了解压迫如何在数据驱动的研究和公共卫生环境中出现。为了支持可重复评估，我们发布了一个开源基准数据集，用于评估大型语言模型在压迫测量方面的表现（https://github.com/chattergpt/llm-oppression-benchmark）。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的压迫感测量新框架，利用多元新冠疫情全球研究中的自我认定的民族言论，设计规则引导提示策略，产生可解释、理论支撑的历史压迫感评估。该框架考虑了地域性历史因素，如排斥、殖民和社会地位等，并强调身份认同的重要性。通过跨多个前沿LLM的系统评估，证明该框架可捕捉到复杂的身份历史性压迫现象。这一工具对于公共健康研究、理解压迫的数据驱动背景有重要作用，同时为研究者提供一个开源评估基准数据集（https://github.com/chattergpt/llm-oppression-benchmark）。

Key Takeaways

传统历史结构压迫测量方法存在局限性，难以跨国家有效应用。
新框架利用LLM测量压迫感，考虑地域性历史因素，包括排斥、殖民和社会地位等。
框架结合自我认定的民族言论，设计规则引导提示策略，产生可解释的历史压迫感评估。
评估策略跨多个LLM进行验证，证明其有效性。
该方法有助于捕捉复杂的身份历史性压迫现象，为公共健康研究和数据驱动的背景理解提供重要工具。
开源评估基准数据集支持可复制性评价。

Cool Papers

点此查看论文截图

Evil Vizier: Vulnerabilities of LLM-Integrated XR Systems

Authors:Yicheng Zhang, Zijian Huang, Sophie Chen, Erfan Shayegani, Jiasi Chen, Nael Abu-Ghazaleh

Extended reality (XR) applications increasingly integrate Large Language Models (LLMs) to enhance user experience, scene understanding, and even generate executable XR content, and are often called “AI glasses”. Despite these potential benefits, the integrated XR-LLM pipeline makes XR applications vulnerable to new forms of attacks. In this paper, we analyze LLM-Integated XR systems in the literature and in practice and categorize them along different dimensions from a systems perspective. Building on this categorization, we identify a common threat model and demonstrate a series of proof-of-concept attacks on multiple XR platforms that employ various LLM models (Meta Quest 3, Meta Ray-Ban, Android, and Microsoft HoloLens 2 running Llama and GPT models). Although these platforms each implement LLM integration differently, they share vulnerabilities where an attacker can modify the public context surrounding a legitimate LLM query, resulting in erroneous visual or auditory feedback to users, thus compromising their safety or privacy, sowing confusion, or other harmful effects. To defend against these threats, we discuss mitigation strategies and best practices for developers, including an initial defense prototype, and call on the community to develop new protection mechanisms to mitigate these risks.

扩展现实（XR）应用程序越来越多地集成大型语言模型（LLM），以改善用户体验、场景理解，甚至生成可执行XR内容，通常被称为“人工智能眼镜”。尽管有这些潜在优势，但集成的XR-LLM管道使XR应用程序容易受到新形式攻击的影响。在本文中，我们从文献和实践中分析了集成LLM的XR系统，并从系统角度沿不同维度对它们进行分类。基于这种分类，我们确定了常见的威胁模型，并在多个XR平台上对采用各种LLM模型（在Meta Quest 3、Meta Ray-Ban、Android和Microsoft HoloLens 2上运行Llama和GPT模型）进行了概念验证攻击。尽管这些平台各自实现LLM集成的方式不同，但它们都存在共同的漏洞，即攻击者可以修改合法LLM查询周围的公开上下文，导致向用户提供错误的视觉或听觉反馈，从而危及他们的安全或隐私，造成混淆或其他有害影响。为了应对这些威胁，我们讨论了开发者的缓解策略和最佳实践，包括初步防御原型，并呼吁社区开发新的保护机制来减轻这些风险。

论文及项目相关链接

PDF

Summary

XR应用集成大型语言模型（LLM）以增强用户体验、场景理解，甚至生成可执行XR内容，被称为“AI眼镜”。然而，这种集成使得XR应用面临新型攻击风险。本文对文献和实际中的LLM集成XR系统进行了分析，并从系统角度对其进行了分类。在此分类的基础上，我们确定了通用的威胁模型，并在多个XR平台上对采用不同LLM模型的攻击进行了概念验证。这些平台虽然LLM集成方式各异，但都存在一种攻击方式：攻击者可以修改合法LLM查询周围的公开上下文，导致用户收到错误的视觉或听觉反馈，从而危及他们的安全或隐私，引发混乱或其他有害影响。为此，我们讨论了开发者应对的缓解策略和最佳实践，并呼吁社区开发新的保护机制来减轻这些风险。

Key Takeaways

XR应用通过集成LLM来提升用户体验和内容生成，但这也增加了系统的新安全威胁。
LLM集成的XR系统面临公共上下文被修改的风险，可能导致错误反馈。
不同XR平台（如Meta Quest 3、Meta Ray-Ban、Android和Microsoft HoloLens 2）采用不同LLM模型集成方式，但都存在类似的安全隐患。
攻击者能够通过修改公开上下文影响用户接收的视觉或听觉信息，造成安全、隐私和混淆等风险。
为应对这些威胁，开发者需要采取缓解策略和最佳实践。
需要社区共同努力开发新的保护机制来减轻这些风险。

Cool Papers

点此查看论文截图

Evolving Language Models without Labels: Majority Drives Selection, Novelty Promotes Variation

Authors:Yujun Zhou, Zhenwen Liang, Haolin Liu, Wenhao Yu, Kishan Panaganti, Linfeng Song, Dian Yu, Xiangliang Zhang, Haitao Mi, Dong Yu

Large language models (LLMs) are increasingly trained with reinforcement learning from verifiable rewards (RLVR), yet real-world deployment demands models that can self-improve without labels or external judges. Existing label-free methods, confidence minimization, self-consistency, or majority-vote objectives, stabilize learning but steadily shrink exploration, causing an entropy collapse: generations become shorter, less diverse, and brittle. Unlike prior approaches such as Test-Time Reinforcement Learning (TTRL), which primarily adapt models to the immediate unlabeled dataset at hand, our goal is broader: to enable general improvements without sacrificing the model’s inherent exploration capacity and generalization ability, i.e., evolving. We formalize this issue and propose EVolution-Oriented and Label-free Reinforcement Learning (EVOL-RL), a simple rule that couples stability with variation under a label-free setting. EVOL-RL keeps the majority-voted answer as a stable anchor (selection) while adding a novelty-aware reward that favors responses whose reasoning differs from what has already been produced (variation), measured in semantic space. Implemented with GRPO, EVOL-RL also uses asymmetric clipping to preserve strong signals and an entropy regularizer to sustain search. This majority-for-selection + novelty-for-variation design prevents collapse, maintains longer and more informative chains of thought, and improves both pass@1 and pass@n. EVOL-RL consistently outperforms the majority-only TTRL baseline; e.g., training on label-free AIME24 lifts Qwen3-4B-Base AIME25 pass@1 from TTRL’s 4.6% to 16.4%, and pass@16 from 18.5% to 37.9%. EVOL-RL not only prevents diversity collapse but also unlocks stronger generalization across domains (e.g., GPQA). Furthermore, we demonstrate that EVOL-RL also boosts performance in the RLVR setting, highlighting its broad applicability.

大型语言模型（LLM）越来越多地采用可验证奖励的强化学习（RLVR）进行训练，但在现实世界的部署中，需要模型能够在没有标签或外部评判的情况下自我改进。现有的无标签方法，如最小化置信度、自我一致性或多数投票目标，虽然可以稳定学习，但会不断缩小探索范围，导致熵崩溃：生成的内容变得更短、更少样化和脆弱。

论文及项目相关链接

PDF

Summary

大型语言模型在强化学习验证奖励（RLVR）中进行训练，但在现实世界的部署中需要模型能够自我改进，无需标签或外部评判。现有的无标签方法，如信心最小化、自我一致性或多数投票目标，虽然可以稳定学习，但会缩减探索，导致熵崩溃，生成的文本变短、缺乏多样性且脆弱。本文提出一种名为EVOL-RL的进化导向无标签强化学习规则，旨在解决这一问题。EVOL-RL将多数投票答案作为稳定锚点（选择），同时增加一个重视响应推理的新颖性奖励（变异），在语义空间中衡量。该方法防止了崩溃，保持了更长的、更有信息量的思考链，并提高了pass@1和pass@n。EVOL-RL在无需标签的AIME24上的表现始终优于仅使用多数投票的TTRL基线。

Key Takeaways

大型语言模型采用强化学习验证奖励（RLVR）进行训练。
现有无标签方法虽稳定学习但会缩减探索，导致熵崩溃。
EVOL-RL旨在解决这一问题，结合稳定与变异在无标签设置下。
EVOL-RL采用多数投票答案作为稳定锚点，同时增加重视响应推理的新颖性奖励。
EVOL-RL在无需标签的AIME24上表现优异，提高了pass@1和pass@n。
EVOL-RL不仅防止多样性崩溃，还提高了跨领域的泛化能力。

Cool Papers

点此查看论文截图

Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

Authors:Zaiquan Yang, Yuhao Liu, Gerhard Hancke, Rynson W. H. Lau

Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal tube of a video, as specified by the input text query. In this paper, we utilize multimodal large language models (MLLMs) to explore a zero-shot solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to dynamically assign special tokens, referred to as \textit{grounding tokens}, for grounding the text query; and (2) MLLMs often suffer from suboptimal grounding due to the inability to fully integrate the cues in the text query (\textit{e.g.}, attributes, actions) for inference. Based on these insights, we propose a MLLM-based zero-shot framework for STVG, which includes novel decomposed spatio-temporal highlighting (DSTH) and temporal-augmented assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH strategy first decouples the original query into attribute and action sub-queries for inquiring the existence of the target both spatially and temporally. It then uses a novel logit-guided re-attention (LRA) module to learn latent variables as spatial and temporal prompts, by regularizing token predictions for each sub-query. These prompts highlight attribute and action cues, respectively, directing the model’s attention to reliable spatial and temporal related visual regions. In addition, as the spatial grounding by the attribute sub-query should be temporally consistent, we introduce the TAS strategy to assemble the predictions using the original video frames and the temporal-augmented frames as inputs to help improve temporal consistency. We evaluate our method on various MLLMs, and show that it outperforms SOTA methods on three common STVG benchmarks. The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.

时空视频定位（STVG）旨在根据输入的文本查询定位视频中的时空管道。在本文中，我们利用多模态大型语言模型（MLLMs）来探索STVG中的零样本解决方案。我们揭示了关于MLLMs的两个关键见解：（1）MLLMs倾向于动态分配特殊令牌，称为“定位令牌”，用于定位文本查询；（2）由于无法完全整合文本查询中的线索（例如属性、动作）进行推理，MLLMs经常遭受次优定位。基于这些见解，我们提出了基于MLLM的STVG零样本框架，该框架包括新颖的分解时空突出显示（DSTH）和时间增强组装（TAS）策略，以释放MLLM的推理能力。DSTH策略首先将原始查询分解为属性和动作子查询，以询问目标在空间和时间上的存在。然后，它使用新型的逻辑引导再注意（LRA）模块来学习作为空间和时间提示的潜在变量，通过规范化每个子查询的令牌预测。这些提示分别突出属性和动作线索，引导模型关注可靠的空间和时间相关的视觉区域。此外，由于属性子查询的空间定位在时间上应该是一致的，我们引入了TAS策略，使用原始视频帧和时间增强帧作为输入来组合预测，以帮助提高时间一致性。我们在各种MLLM上评估了我们的方法，并在三个常见的STVG基准测试上证明了其优于最新方法。相关代码将在https://github.com/zaiquanyang/LLaVA_Next_STVG中提供。

论文及项目相关链接

PDF

Summary

本文利用多模态大语言模型（MLLMs）探索时空视频定位（STVG）的零样本解决方案。研究发现MLLMs在定位时倾向于使用特定的标记，但受限于无法完全整合文本查询中的线索。为此，提出了基于MLLM的零样本STVG框架，包括分解时空突出显示（DSTH）和时间增强装配（TAS）策略，以释放MLLM的推理能力。通过DSTH策略将查询分解为属性动作子查询，并利用LR模块学习潜在变量作为时空提示。此外，采用TAS策略结合原始视频帧和时间增强帧进行预测，以提高时间一致性。在多个MLLM上的实验表明，该方法在三个常见的STVG基准测试上优于现有技术。

Key Takeaways

利用多模态大语言模型（MLLMs）探索时空视频定位（STVG）的零样本解决方案。
MLLMs在定位时倾向于使用特定的标记，即“grounding tokens”，但仍存在难以整合文本查询中的所有线索的问题。
提出了一种基于MLLM的零样本STVG框架，包括分解时空突出显示（DSTH）和时间增强装配（TAS）策略。
DSTH策略通过将查询分解为属性动作子查询并利用LR模块学习潜在变量，以突出属性动作线索并引导模型关注可靠的时空相关视觉区域。
TAS策略结合原始视频帧和时间增强帧进行预测，旨在提高时间一致性。
在多个MLLM上的实验验证了该方法的有效性，并在三个常见的STVG基准测试上表现出优越性能。

Cool Papers

点此查看论文截图

Mind the Gap: Data Rewriting for Stable Off-Policy Supervised Fine-Tuning

Authors:Shiwan Zhao, Xuyang Zhao, Jiaming Zhou, Aobo Kong, Qicheng Li, Yong Qin

Supervised fine-tuning (SFT) of large language models can be viewed as an off-policy learning problem, where expert demonstrations come from a fixed behavior policy while training aims to optimize a target policy. Importance sampling is the standard tool for correcting this distribution mismatch, but large policy gaps lead to high variance and training instability. Existing approaches mitigate this issue using KL penalties or clipping, which passively constrain updates rather than actively reducing the gap. We propose a simple yet effective data rewriting framework that proactively shrinks the policy gap by keeping correct solutions as on-policy data and rewriting incorrect ones with guided re-solving, falling back to expert demonstrations only when needed. This aligns the training distribution with the target policy before optimization, reducing importance sampling variance and stabilizing off-policy fine-tuning. Experiments on five mathematical reasoning benchmarks demonstrate consistent and significant gains over both vanilla SFT and the state-of-the-art Dynamic Fine-Tuning (DFT) approach. The data and code will be released at https://github.com/NKU-HLT/Off-Policy-SFT.

大型语言模型的监督微调（SFT）可以被视为一种离线学习策略问题，其中专家演示来自固定的行为策略，而训练的目标是优化目标策略。重要性采样是纠正这种分布不匹配的标准工具，但策略差距较大导致方差较大和训练不稳定。现有方法通过使用KL惩罚或裁剪来缓解这个问题，这些方法被动地约束更新，而不是主动地缩小差距。我们提出了一种简单有效的数据重写框架，它通过保持正确的解决方案作为在线策略数据，并用引导解决方式重写错误的解决方案，仅在必要时回溯到专家演示。这在对优化之前对齐训练分布与目标策略，减少重要性采样方差并稳定离线微调。在五个数学推理基准测试上的实验表明，与普通的SFT和最新的动态微调（DFT）方法相比，该框架具有一致且显著的收益。数据和代码将在https://github.com/NKU-HLT/Off-Policy-SFT发布。

论文及项目相关链接

PDF

Summary
大型语言模型的监督微调（SFT）可以看作是一个离线策略学习问题，其中专家演示来自固定的行为策略，而训练的目标是优化目标策略。重要性采样是解决这种分布不匹配的标准工具，但当策略差距较大时，会导致高方差和训练不稳定。我们提出了一种简单有效的数据重写框架，通过保持正确的解决方案为在线策略数据，并用指导解决方式重写错误的解决方案，仅在必要时回退到专家演示，这可以在优化之前对齐训练分布与目标策略，减少重要性采样的方差并稳定离线策略的微调。在五个数学推理基准测试上的实验表明，与普通的SFT和最新的动态微调（DFT）方法相比，该框架具有一致且显著的增益。

Key Takeaways

大型语言模型的监督微调可以被视为离线策略学习问题。
重要性采样是解决分布不匹配的标准工具，但在策略差距较大时存在高方差和训练不稳定的问题。
现有方法使用KL惩罚或裁剪来被动约束更新，而非主动缩小策略差距。
提出的简单而有效的数据重写框架通过保持正确的解决方案并主动缩小策略差距来改善训练稳定性。
该框架在五个数学推理基准测试上表现优于普通SFT和最新的DFT方法。
该方法和实验细节将在NKU-HLT的Off-Policy SFT项目中发布。

Cool Papers

点此查看论文截图

Self-Improving Embodied Foundation Models

Authors:Seyed Kamyar Seyed Ghasemipour, Ayzaan Wahid, Jonathan Tompson, Pannag Sanketi, Igor Mordatch

Foundation models trained on web-scale data have revolutionized robotics, but their application to low-level control remains largely limited to behavioral cloning. Drawing inspiration from the success of the reinforcement learning stage in fine-tuning large language models, we propose a two-stage post-training approach for robotics. The first stage, Supervised Fine-Tuning (SFT), fine-tunes pretrained foundation models using both: a) behavioral cloning, and b) steps-to-go prediction objectives. In the second stage, Self-Improvement, steps-to-go prediction enables the extraction of a well-shaped reward function and a robust success detector, enabling a fleet of robots to autonomously practice downstream tasks with minimal human supervision. Through extensive experiments on real-world and simulated robot embodiments, our novel post-training recipe unveils significant results on Embodied Foundation Models. First, we demonstrate that the combination of SFT and Self-Improvement is significantly more sample-efficient than scaling imitation data collection for supervised learning, and that it leads to policies with significantly higher success rates. Further ablations highlight that the combination of web-scale pretraining and Self-Improvement is the key to this sample-efficiency. Next, we demonstrate that our proposed combination uniquely unlocks a capability that current methods cannot achieve: autonomously practicing and acquiring novel skills that generalize far beyond the behaviors observed in the imitation learning datasets used during training. These findings highlight the transformative potential of combining pretrained foundation models with online Self-Improvement to enable autonomous skill acquisition in robotics. Our project website can be found at https://self-improving-efms.github.io .

基于网络规模数据训练的基石模型已经为机器人技术带来了革命性的变革，但其在低级控制方面的应用仍然主要局限于行为克隆。受大型语言模型微调中强化学习阶段成功的启发，我们为机器人提出了一种两阶段的后训练法。第一阶段是监督微调（SFT），使用a）行为克隆和b）待进步预测目标对预训练的基石模型进行微调。在第二阶段，即自我提升阶段，待进步预测有助于提取形状良好的奖励函数和稳健的成功检测器，从而允许一系列机器人在几乎无需人工监督的情况下自主练习下游任务。通过在实际机器人和模拟机器人上的大量实验，我们的新型后训练配方在嵌入式基石模型上取得了显著成果。首先，我们证明了SFT与自我提升的结合在样本效率上明显优于扩大模仿数据的收集用于监督学习，并且它导致的政策成功率更高。进一步的消融研究强调，结合网络规模预训练和自我提升是这种样本效率的关键。接下来，我们证明了我们的组合具有目前方法无法实现的独特能力：自主练习并获取超出在训练期间使用的模仿学习数据集中观察到的行为的泛化能力。这些发现凸显了将预训练的基石模型与在线自我提升相结合，以实现机器人在技能获取方面的自主性的变革潜力。我们的项目网站可以在https://self-improving-efms.github.io找到。

论文及项目相关链接

PDF Appearing in the Conference on Neural Information Processing Systems (NeurIPS 2025)

摘要
基于互联网规模数据的预训练模型已在机器人领域引发革命，但其应用于低层次控制主要局限于行为克隆。受大型语言模型微调中强化学习阶段成功的启发，我们提出了一种用于机器人的两阶段后训练法。第一阶段为监督微调（SFT），使用行为克隆和步距预测目标对预训练模型进行微调。第二阶段为自我提升，步距预测可提取出形状良好的奖励函数和稳健的成功检测器，使机器人群体能够在极少人类监督的情况下自主练习下游任务。通过在实际和模拟机器人实体上的广泛实验，我们的新型后训练配方在嵌入式基础模型上取得了显著成果。首先，我们证明了SFT与自我提升的结合比扩大模仿数据收集进行有监督学习更加样本高效，且产生的策略成功率更高。进一步的剖析表明，结合互联网规模预训练和自我提升是样本效率的关键。接下来，我们证明了我们的组合独特地解锁了一种当前方法无法实现的能力：自主练习并获取超越训练过程中使用的模仿学习数据集的新技能。这些发现突显了将预训练基础模型与在线自我提升相结合，在机器人领域实现自主技能获取的变革性潜力。

关键见解

提出了一种两阶段后训练法用于机器人领域，包括监督微调（SFT）和自我提升阶段。
SFT结合了行为克隆和步距预测目标，以优化预训练模型的性能。
自我提升阶段通过步距预测提取奖励函数和成功检测器，使机器人能自主实践下游任务。
与仅通过扩大模仿数据收集进行有监督学习相比，SFT与自我提升的结合更加样本高效，且策略成功率更高。
结合互联网规模预训练和自我提升是取得样本效率的关键。
该方法能够解锁自主获取并练习新技能的能力，这些技能能够泛化到超越训练期间观察到的行为。

Cool Papers

点此查看论文截图

A1: Asynchronous Test-Time Scaling via Conformal Prediction

Authors:Jing Xiong, Qiujiang Chen, Fanghua Ye, Zhongwei Wan, Chuanyang Zheng, Chenyang Zhao, Hui Shen, Alexander Hanbo Li, Chaofan Tao, Haochen Tan, Haoli Bai, Lifeng Shang, Lingpeng Kong, Ngai Wong

Large language models (LLMs) benefit from test-time scaling, but existing methods face significant challenges, including severe synchronization overhead, memory bottlenecks, and latency, especially during speculative decoding with long reasoning chains. We introduce A1 (Asynchronous Test-Time Scaling), a statistically guaranteed adaptive inference framework that addresses these challenges. A1 refines arithmetic intensity to identify synchronization as the dominant bottleneck, proposes an online calibration strategy to enable asynchronous inference, and designs a three-stage rejection sampling pipeline that supports both sequential and parallel scaling. Through experiments on the MATH, AMC23, AIME24, and AIME25 datasets, across various draft-target model families, we demonstrate that A1 achieves a remarkable 56.7x speedup in test-time scaling and a 4.14x improvement in throughput, all while maintaining accurate rejection-rate control, reducing latency and memory overhead, and no accuracy loss compared to using target model scaling alone. These results position A1 as an efficient and principled solution for scalable LLM inference. We have released the code at https://github.com/menik1126/asynchronous-test-time-scaling.

大型语言模型（LLM）受益于测试时间缩放，但现有方法面临重大挑战，包括严重的同步开销、内存瓶颈和延迟，特别是在具有长推理链的投机解码过程中。我们引入了A1（异步测试时间缩放），这是一种统计上保证的自适应推理框架，解决了这些挑战。A1通过优化算术强度来确定同步是主要的瓶颈，提出了在线校准策略以实现异步推理，并设计了一个三阶段拒绝采样管道，支持顺序和并行缩放。我们在MATH、AMC23、AIME24和AIME25数据集上，对各种草稿目标模型家族进行了实验，结果表明，A1在测试时间缩放方面实现了惊人的56.7倍加速，吞吐量提高了4.14倍，同时保持了精确的拒绝率控制，降低了延迟和内存开销，且相较于仅使用目标模型缩放，没有精度损失。这些结果使A1成为高效且有原则的可扩展LLM推理解决方案。我们已将代码发布在https://github.com/menik1126/asynchronous-test-time-scaling上。

论文及项目相关链接

PDF Tech Report

Summary

本论文介绍了大型语言模型（LLMs）测试时间缩放面临的挑战，并提出了一种名为A1的异步测试时间缩放框架。该框架通过优化算术强度识别同步瓶颈，采用在线校准策略实现异步推理，并设计了一个支持串行和并行缩放的三阶段拒绝采样管道。实验结果表明，A1在测试时间缩放方面实现了显著的速度提升，同时保持了高效的拒绝率控制、降低延迟和内存开销，且没有损失精度。

Key Takeaways

LLMs可从测试时间缩放中获益，但现有方法存在同步开销、内存瓶颈和延迟等挑战。
A1框架是一种针对这些问题的异步测试时间缩放解决方案，具有统计保证的适应性推理。
A1通过优化算术强度识别同步瓶颈，并提出在线校准策略实现异步推理。
A1设计了一个三阶段拒绝采样管道，支持串行和并行缩放。
实验结果表明，A1在多个数据集和模型家族上实现了显著的速度提升，最高达到56.7倍。
A1在保持拒绝率控制、降低延迟和内存开销的同时，没有损失精度。

Cool Papers

点此查看论文截图

From Pixels to Urban Policy-Intelligence: Recovering Legacy Effects of Redlining with a Multimodal LLM

Authors:Anthony Howell, Nancy Wu, Sharmistha Bagchi, Yushim Kim, Chayn Sun

This paper shows how a multimodal large language model (MLLM) can expand urban measurement capacity and support tracking of place-based policy interventions. Using a structured, reason-then-estimate pipeline on street-view imagery, GPT-4o infers neighborhood poverty and tree canopy, which we embed in a quasi-experimental design evaluating the legacy of 1930s redlining. GPT-4o recovers the expected adverse socio-environmental legacy effects of redlining, with estimates statistically indistinguishable from authoritative sources, and it outperforms a conventional pixel-based segmentation baseline-consistent with the idea that holistic scene reasoning extracts higher-order information beyond object counts alone. These results position MLLMs as policy-grade instruments for neighborhood measurement and motivate broader validation across policy-evaluation settings.

这篇论文展示了多模态大型语言模型（MLLM）如何扩大城市测量能力，并支持基于地点的政策干预的跟踪。通过使用街道景观图像的结构化、推理估算流程，GPT-4o推断出邻里贫困和树冠覆盖情况，我们将这些情况嵌入到对20世纪30年代红线政策的遗产进行准实验设计评估中。GPT-4o恢复了红线政策预期的负面社会环境影响，其估计值与权威来源无法区分，且表现优于基于像素的常规分割基线，这符合整体场景推理能够提取出单纯计数对象之外的高阶信息的理念。这些结果确立了MLLM作为政策级邻里测量工具的地位，并激励我们在更广泛的政策评估环境中进行验证。

论文及项目相关链接

PDF

Summary

这篇论文展示了多模态大型语言模型（MLLM）如何提升城市测量能力并支持基于地点的政策干预跟踪。研究通过运用街道景观图像的结构化推理估计流程，使用GPT-4o推断出邻里贫困和树木覆盖情况，并嵌入准实验设计评估了上世纪30年代红线政策的遗留影响。GPT-4o恢复了预期的红线政策带来的不良社会环境影响，其估计结果与权威来源无法区分，且优于传统的基于像素的分割基线。这表明整体场景推理能够提取超出单纯物体计数的高阶信息。这些结果确立了MLLMs在邻里测量方面的政策级地位，并激励我们在更广泛的政策评估环境中进行验证。

Key Takeaways

多模态大型语言模型（MLLM）可提升城市测量能力，支持跟踪基于地点的政策干预。
GPT-4o能通过街道景观图像推断邻里贫困和树木覆盖情况。
GPT-4o在评估红线政策的遗留影响方面表现出色，其估计结果与权威来源相近。
GPT-4o的表现在评估方面优于传统的基于像素的分割基线。
整体场景推理能够提取超出单纯物体计数的高阶信息。
MLLMs可成为政策评估的可靠工具，特别是在邻里测量方面。

Cool Papers

点此查看论文截图

Evaluating Large Language Models for Cross-Lingual Retrieval

Authors:Longfei Zuo, Pingjun Hong, Oliver Kraus, Barbara Plank, Robert Litschko

Multi-stage information retrieval (IR) has become a widely-adopted paradigm in search. While Large Language Models (LLMs) have been extensively evaluated as second-stage reranking models for monolingual IR, a systematic large-scale comparison is still lacking for cross-lingual IR (CLIR). Moreover, while prior work shows that LLM-based rerankers improve CLIR performance, their evaluation setup relies on lexical retrieval with machine translation (MT) for the first stage. This is not only prohibitively expensive but also prone to error propagation across stages. Our evaluation on passage-level and document-level CLIR reveals that further gains can be achieved with multilingual bi-encoders as first-stage retrievers and that the benefits of translation diminishes with stronger reranking models. We further show that pairwise rerankers based on instruction-tuned LLMs perform competitively with listwise rerankers. To the best of our knowledge, we are the first to study the interaction between retrievers and rerankers in two-stage CLIR with LLMs. Our findings reveal that, without MT, current state-of-the-art rerankers fall severely short when directly applied in CLIR.

多阶段信息检索（IR）已成为搜索中广泛采用的范式。虽然大型语言模型（LLM）作为单语IR的第二阶段重排序模型已经得到了广泛评估，但在跨语言IR（CLIR）方面，仍缺乏系统的大规模比较。此外，虽然先前的工作表明，基于LLM的重排序器可以改善CLIR的性能，但它们的评估设置依赖于机器翻译（MT）的第一阶段词汇检索。这不仅成本高昂，而且在各阶段之间容易出现错误传播。我们在段落级和文档级的CLIR评估中发现，使用多语言双编码器作为第一阶段的检索器可以取得进一步的收益，并且随着重排序模型的增强，翻译的优势在减小。我们还表明，基于指令训练LLM的配对重排序器与列表重排序器的表现具有竞争力。据我们所知，我们是首批在两阶段CLIR中研究检索器和重排序器交互的LLM的研究人员。我们的研究结果表明，在不使用机器翻译的情况下，当前最先进的重排序器在直接应用于CLIR时存在严重缺陷。

论文及项目相关链接

PDF Accepted at EMNLP 2025 (Findings)

Summary

基于多阶段信息检索在自然搜索中的广泛应用，大语言模型在第二阶段的排序模型方面已在单语信息检索中得到广泛评估。尽管之前的研究显示，基于大型语言模型的排序器能提升跨语言信息检索（CLIR）的性能，但其评估设置依赖于机器翻译进行第一阶段的词汇检索，这既昂贵又容易出现阶段间的错误传播。本文采用跨语言情境评估不同阶段的CLIR方法，揭示采用多语言bi编码器作为第一阶段的检索器可取得进一步的提高，而且翻译的好处会随着更强大的排序模型的增加而减少。另外还发现，基于指令调优的大型语言模型的配对排序器与列表排序器的性能竞争相当。据了解，我们首次研究了两阶段CLIR的大型语言模型检索器和排序器之间的交互作用。研究发现，如果不使用机器翻译，当前先进的排序器在直接应用于CLIR时表现较差。

Key Takeaways

大型语言模型在跨语言信息检索（CLIR）的第二阶段排序模型中的评估仍然缺乏系统性大规模对比。
基于大型语言模型的排序器能提高CLIR性能，但依赖于机器翻译的第一阶段检索成本高昂且易出现错误传播。
采用多语言bi编码器作为第一阶段的检索器可以进一步提高CLIR性能。
随着排序模型的增强，翻译的重要性逐渐减少。
基于指令调优的大型语言模型的配对排序器表现良好。
在两阶段CLIR中，大型语言模型的检索器和排序器之间的交互作用尚未得到充分研究。

Cool Papers

点此查看论文截图

CodeLSI: Leveraging Foundation Models for Automated Code Generation with Low-Rank Optimization and Domain-Specific Instruction Tuning

Authors:Huy Le, Phong Nguyen, Hao Do, Tuan Nguyen, Thien Pham, Anh Nguyen-Duc, Tho Quan

Context: Automated code generation using Foundation Models (FMs) offers promising solutions for enhancing software development efficiency. However, challenges remain in ensuring domain specificity, cost-effectiveness, and security - especially when relying on third-party APIs. This paper introduces CodeLSI, a framework that combines low-rank optimization and domain-specific instruction tuning to address these challenges. Objectives: The aim of this study is to develop and evaluate CodeLSI, a novel approach for generating high-quality code tailored to specific domains, using FMs fine-tuned on company infrastructure without dependence on external APIs. Methods: CodeLSI applies low-rank adaptation techniques to reduce the computational cost of model pre-training and fine-tuning. Domain-specific instruction tuning is employed to align code generation with organizational needs. We implemented and tested the framework on real-world JavaScript coding tasks using datasets drawn from internal software projects. Results: Experimental evaluations show that CodeLSI produces high-quality, context aware code. It outperforms baseline models in terms of relevance, accuracy, and domain fit. The use of low-rank optimization significantly reduced resource requirements, enabling scalable training on company-owned infrastructure. Conclusion: CodeLSI demonstrates that combining low-rank optimization with domain specific tuning can enhance the practicality and performance of FMs for automated code generation. This approach provides a secure, cost-efficient alternative to commercial API based solutions and supports faster, more targeted innovation in software development.

背景：利用基础模型（FMs）进行自动代码生成为提高软件开发效率提供了有前景的解决方案。然而，在保障领域特异性、成本效益和安全方面仍存在挑战，特别是在依赖第三方API时。本文介绍了CodeLSI框架，它结合了低秩优化和领域特定指令调整来解决这些挑战。

目标：本研究的目标是开发和评估CodeLSI，这是一种利用FMs生成针对特定领域的高质量代码的新型方法，通过对公司内部基础设施进行微调，不依赖外部API。

方法：CodeLSI应用低秩适应技术来降低模型预训练和精细调整的计算成本。采用领域特定指令调整使代码生成与组织需求保持一致。我们在现实世界的JavaScript编程任务上实现了该框架，并使用从内部软件项目中抽取的数据集进行了测试。

结果：实验评估表明，CodeLSI产生了高质量、上下文感知的代码。它在相关性、准确性和领域适应性方面优于基准模型。低秩优化的使用显著降低了资源要求，能够在公司拥有的基础设施上进行可扩展的培训。

论文及项目相关链接

PDF

摘要

本文介绍了一种名为CodeLSI的新框架，该框架旨在使用基础模型（FMs）针对特定领域生成高质量代码，从而提高软件开发效率。CodeLSI结合了低阶优化和领域特定指令调整，以解决在计算成本、安全性和领域特异性方面的挑战。通过对真实世界JavaScript编程任务的测试，CodeLSI表现出卓越的性能，生成了高质量、上下文感知的代码，优于基线模型。此外，低阶优化显著降低了资源需求，使在公司自有基础设施上进行可扩展性训练成为可能。CodeLSI展示了将低阶优化与领域特定调整相结合可以增强自动代码生成的实用性，为基于商业API的解决方案提供了安全、成本效益高的替代方案，并支持更快、更有针对性的软件开发创新。

关键见解

CodeLSI框架结合了低阶优化和领域特定指令调整，旨在提高软件开发效率并满足特定需求。
该框架旨在解决在计算成本、安全性和领域特异性方面的挑战。
CodeLSI通过真实世界的JavaScript编程任务测试，表现出卓越的性能。
CodeLSI生成的高质量代码具有上下文感知能力，优于基线模型。
低阶优化显著降低了资源需求，使训练过程更具可扩展性。
CodeLSI提供了一个安全、成本效益高的替代方案，用于基于商业API的解决方案。

Cool Papers

点此查看论文截图

Don’t Forget the Nonlinearity: Unlocking Activation Functions in Efficient Fine-Tuning

Authors:Bo Yin, Xingyi Yang, Xinchao Wang

Existing parameter-efficient fine-tuning (PEFT) methods primarily adapt weight matrices while keeping activation functions fixed. We introduce \textbf{NoRA}, the first PEFT framework that directly adapts nonlinear activation functions in pretrained transformer-based models. NoRA replaces fixed activations with learnable rational functions and applies structured low-rank updates to numerator and denominator coefficients, with a group-wise design that localizes adaptation and improves stability at minimal cost. On vision transformers trained on CIFAR-10 and CIFAR-100, NoRA matches or exceeds full fine-tuning while updating only 0.4% of parameters (0.02M), achieving accuracy gains of +0.17% and +0.27%. When combined with LoRA (\textbf{NoRA++}), it outperforms LoRA and DoRA under matched training budgets by adding fewer trainable parameters. On LLaMA3-8B instruction tuning, NoRA++ consistently improves generation quality, yielding average MMLU gains of +0.3%–0.8%, including +1.6% on STEM (Alpaca) and +1.3% on OpenOrca. We further show that NoRA constrains adaptation to a low-dimensional functional subspace, implicitly regularizing update magnitude and direction. These results establish activation-space tuning as a complementary and highly parameter-efficient alternative to weight-based PEFT, positioning activation functions as first-class objects for model adaptation.

现有参数高效微调（PEFT）方法主要适应权重矩阵，同时保持激活函数固定。我们引入了NoRA，这是第一个直接适应预训练基于transformer模型的非线性激活函数的PEFT框架。NoRA用可学习的有理函数替换固定激活函数，并对分子和分母系数应用结构化低秩更新，采用分组设计实现局部适应，并在几乎不增加成本的情况下提高稳定性。在CIFAR-10和CIFAR-100上训练的视觉变压器中，NoRA在仅更新0.4%（即0.02M）的参数时，就能达到或超过完全微调的效果，准确率分别提高了+0.17%和+0.27%。与LoRA结合使用时（**NoRA++**），在匹配的训练预算下，通过添加更少的可训练参数，它优于LoRA和DoRA。在LLaMA3-8B指令调整中，NoRA++持续提高生成质量，平均MMLU增益为+0.3%~+0.8%，其中STEM（Alpaca）上提高+1.6%，OpenOrca上提高+1.3%。我们进一步表明，NoRA将适应约束在低维函数子空间内，隐含地正则化更新幅度和方向。这些结果确立了激活空间调整作为基于权重的PEFT的互补和高度参数高效的替代方案，将激活函数定位为模型适应的一流对象。

论文及项目相关链接

PDF

Summary

本文介绍了一种新的参数高效微调（PEFT）方法——NoRA，它直接适应预训练transformer模型中的非线性激活函数。NoRA用可学习的有理函数替换固定激活，并对分子和分母系数应用结构化低秩更新，以局部化适应并提高稳定性，同时成本极低。在CIFAR-10和CIFAR-100的视觉上，NoRA匹配或超过全微调，仅更新0.4%的参数（0.02M），准确率提高+0.17%和+0.27%。与LoRA结合使用时（NoRA++），在匹配的训练预算下，通过添加更少的可训练参数，它优于LoRA和DoRA。在LLaMA3-8B指令调整中，NoRA++持续提高生成质量，平均MMLU增益为+0.3%~+0.8%，其中STEM（Alpaca）和OpenOrca分别提高+1.6%和+1.3%。此外，NoRA将适应限制在低维功能子空间中，隐式地规范更新幅度和方向。这些结果确立了激活空间调整作为权重基础PEFT的互补和高度参数高效替代方案，将激活函数定位为模型适应的一流对象。

Key Takeaways

NoRA是一种新的参数高效微调（PEFT）方法，专注于适应预训练transformer模型中的非线性激活函数。
NoRA通过用可学习的有理函数替换固定激活，并利用结构化低秩更新来提高模型的稳定性和性能。
在视觉任务上，NoRA在参数更新极少的情况下，能够达到或超过全调的效果，并有一定的准确率提升。
NoRA++是NoRA与LoRA的结合，它在匹配的训练预算下表现出更好的性能。
在LLaMA3-8B指令调整中，NoRA++显著提高生成质量。
NoRA将模型适应限制在低维功能子空间中，隐式地规范更新幅度和方向。
这些结果确立了激活空间调整作为参数高效微调的一种重要且有效的替代方案。

Cool Papers

点此查看论文截图

Probing the Representational Power of Sparse Autoencoders in Vision Models

Authors:Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng

Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from the high-dimensional internal representations of LLMs. Despite their popularity with language models, SAEs remain understudied in the visual domain. In this work, we provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks. Our experimental results demonstrate that SAE features are semantically meaningful, improve out-of-distribution generalization, and enable controllable generation across three vision model architectures: vision embedding models, multi-modal LMMs and diffusion models. In vision embedding models, we find that learned SAE features can be used for OOD detection and provide evidence that they recover the ontological structure of the underlying model. For diffusion models, we demonstrate that SAEs enable semantic steering through text encoder manipulation and develop an automated pipeline for discovering human-interpretable attributes. Finally, we conduct exploratory experiments on multi-modal LLMs, finding evidence that SAE features reveal shared representations across vision and language modalities. Our study provides a foundation for SAE evaluation in vision models, highlighting their strong potential improving interpretability, generalization, and steerability in the visual domain.

稀疏自编码器（Sparse Autoencoders，简称SAE）已成为解释大型语言模型（Large Language Models，简称LLM）隐藏状态的一种流行工具。通过学会从稀疏瓶颈层重建激活，SAE从LLM的高维内部表示中发现可解释的特征。尽管SAE在语言模型中很受欢迎，但在视觉领域它们仍被研究得不够深入。在这项工作中，我们通过对一系列图像任务的大量评估，全面评估了SAE在视觉模型中的表征能力。实验结果表明，SAE特征是语义上有意义的，能够改善分布外的泛化能力，并在三种视觉模型架构中实现了可控生成：视觉嵌入模型、多模态LLM和扩散模型。在视觉嵌入模型中，我们发现学到的SAE特征可用于OOD检测，并提供证据表明它们恢复了底层模型的本体结构。对于扩散模型，我们展示了SAE通过文本编码器操作实现语义转向，并开发了一个自动化管道来发现人类可解释的属性。最后，我们对多模态LLM进行了探索性实验，发现证据表明SAE特征揭示了跨视觉和语言模态的共享表示。我们的研究为SAE在视觉模型中的评估奠定了基础，突显了它们在提高视觉领域的可解释性、泛化和可控性方面的强大潜力。

论文及项目相关链接

PDF ICCV 2025 Findings

Summary

基于Sparse Autoencoders（SAE）在大型语言模型（LLM）中的解释隐藏状态的应用，本文对其在视觉模型中的代表性能力进行了广泛的评估。实验结果表明，SAE特征具有语义意义，可提高离群分布泛化能力，并在三种视觉模型架构中实现可控生成。本文的研究为SAE在视觉模型中的应用提供了评估基础，突显其在提高视觉领域的解释性、泛化和可控性方面的潜力。

Key Takeaways

SAEs用于解析视觉模型的隐藏状态表现良好。
SAE特征具有语义意义，可提高模型的离群分布泛化能力。
SAEs在三种视觉模型架构中实现了可控生成。
在视觉嵌入模型中，SAE特征可用于离群值检测，并揭示了底层模型的本体结构。
在扩散模型中，SAE通过文本编码器操作实现了语义控制，并开发了发现人类可解释属性的自动化管道。
SAE特征在多模态LLM中显示出跨视觉和语言模态的共享表示。
本文研究为SAE在视觉模型中的应用提供了评估基础。

Cool Papers

点此查看论文截图

SMART: Simulated Students Aligned with Item Response Theory for Question Difficulty Prediction

Authors:Alexander Scarlatos, Nigel Fernandez, Christopher Ormerod, Susan Lottridge, Andrew Lan

Item (question) difficulties play a crucial role in educational assessments, enabling accurate and efficient assessment of student abilities and personalization to maximize learning outcomes. Traditionally, estimating item difficulties can be costly, requiring real students to respond to items, followed by fitting an item response theory (IRT) model to get difficulty estimates. This approach cannot be applied to the cold-start setting for previously unseen items either. In this work, we present SMART (Simulated Students Aligned with IRT), a novel method for aligning simulated students with instructed ability, which can then be used in simulations to predict the difficulty of open-ended items. We achieve this alignment using direct preference optimization (DPO), where we form preference pairs based on how likely responses are under a ground-truth IRT model. We perform a simulation by generating thousands of responses, evaluating them with a large language model (LLM)-based scoring model, and fit the resulting data to an IRT model to obtain item difficulty estimates. Through extensive experiments on two real-world student response datasets, we show that SMART outperforms other item difficulty prediction methods by leveraging its improved ability alignment.

题目难度在教育评估中扮演重要角色，能够准确高效地评估学生能力并实现个性化，以最大化学习效果。传统上，估算题目难度可能会成本高昂，需要真实学生对题目做出回应，然后对题目反应理论（IRT）模型进行拟合以获取难度估算。这种方法也不能应用于之前未见过的题目的冷启动设置。在这项工作中，我们提出了SMART（与IRT对齐的模拟学生），这是一种将模拟学生与指定能力对齐的新方法，然后可在模拟中用于预测开放式题目的难度。我们通过直接偏好优化（DPO）实现这种对齐，根据真实IRT模型下响应的可能性来形成偏好对。我们通过生成数千个响应来进行模拟，使用基于大型语言模型（LLM）的评分模型进行评估，并将得到的数据拟合到IRT模型中以获得题目难度估算。我们在两个真实世界学生响应数据集上进行了大量实验，结果表明，利用改进的能力对齐功能，SMART在题目难度预测方法上具有优越性。

论文及项目相关链接

PDF Published in EMNLP 2025: The 2025 Conference on Empirical Methods in Natural Language Processing

摘要

在教育评估中，题目难度扮演着重要角色，能准确高效地评估学生能力，并根据个人情况进行调整，以最大化学习效果。传统上，估算题目难度成本较高，需真实学生对题目作出反应，再拟合项目反应理论（IRT）模型得到难度估算值。此方法无法应用于新题目的冷启动设置。本研究提出SMART（与IRT对齐的模拟学生）方法，通过指令能力对齐模拟学生，可在模拟中预测开放式题目的难度。我们采用直接偏好优化（DPO）技术实现对齐，根据真实IRT模型下的回答可能性形成偏好对。我们在两个真实学生回答数据集上进行了广泛实验，证明了SMART利用改进的能力对齐在题目难度预测方面的优势。

关键见解

题目难度在教育评估中至关重要，影响学生能力的准确和高效评估。
传统估算题目难度的方法成本较高，且无法应用于新题目的冷启动环境。
本研究提出了SMART方法，通过模拟学生与IRT模型的对齐，预测开放式题目的难度。
SMART采用直接偏好优化技术实现模拟学生与IRT模型的对齐。
在模拟中，通过生成数千个回答并使用大型语言模型（LLM）评分模型进行评估，再拟合IRT模型得到题目难度估算值。
在两个真实学生回答数据集上的实验表明，SMART在题目难度预测方面表现出色。

Cool Papers

点此查看论文截图

QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA

Authors:Jacob Dineen, Aswin RRV, Qin Liu, Zhikun Xu, Xiao Ye, Ming Shen, Zhaonan Li, Shijie Lu, Chitta Baral, Muhao Chen, Ben Zhou

Alignment of large language models (LLMs) with principles like helpfulness, honesty, and harmlessness typically relies on scalar rewards that obscure which objectives drive the training signal. We introduce QA-LIGN, which decomposes monolithic rewards into interpretable principle-specific evaluations through structured natural language programs. Models learn through a draft, critique, and revise pipeline, where symbolic evaluation against the rubrics provides transparent feedback for both initial and revised responses during GRPO training. Applied to uncensored Llama-3.1-8B-Instruct, QA-LIGN reduces attack success rates by up to 68.7% while maintaining a 0.67% false refusal rate, achieving Pareto optimal safety-helpfulness performance and outperforming both DPO and GRPO with state-of-the-art reward models given equivalent training. These results demonstrate that making reward signals interpretable and modular improves alignment effectiveness, suggesting transparency enhances LLM safety.

大型语言模型（LLM）与有助益性、诚实性和无害性等原则的对齐，通常依赖于标量奖励，这些奖励会掩盖驱动训练信号的多个目标。我们引入了QA-LIGN，它通过结构化的自然语言程序将单一的奖励分解为可解释的原则特定评估。模型通过草案、评论和修订管道进行学习，其中符号评估与评分标准对照为在GRPO训练期间的初始和修订响应提供透明反馈。在未经审查的Llama-3.1-8B-Instruct上应用QA-LIGN，可将攻击成功率降低高达68.7%，同时保持0.67%的误拒绝率，实现了帕累托最优的安全与有益性能，并且在等效训练的情况下使用最先进的奖励模型，其性能优于DPO和GRPO。这些结果表明，使奖励信号可解释和模块化可提高对齐效率，这表明透明度提高了LLM的安全性。

论文及项目相关链接

PDF Accepted to Findings of EMNLP 2025

Summary

大型语言模型（LLM）与诸如有用性、诚实性和无害性等原则的对齐通常依赖于标量奖励，这会掩盖驱动训练信号的各个目标。本研究提出了QA-LIGN，它通过结构化的自然语言程序将单一的奖励分解成可解释的原则特定评估，使模型学习一个草案、评估和修订的管道。在此过程中，符号评价提供初始和修订回应时的透明反馈，以促进GRPO训练。在对非管控的Llama-3.1-8B-Instruct的应用中，QA-LIGN将攻击成功率降低了高达68.7%，同时保持0.67%的误拒绝率，实现帕累托最优的安全性能和提高的有用性表现。结果显示，将奖励信号设计为可解释和模块化能够提高对齐效率，表明透明度增强了LLM的安全性。

Key Takeaways

大型语言模型（LLM）与原则的对齐通常依赖标量奖励，这导致训练信号的各个目标变得模糊。
QA-LIGN通过结构化的自然语言程序将奖励分解成可解释的原则特定评估。
模型学习包括草案、评估和修订的管道，以提高对齐效率。
在非管控的Llama模型中，QA-LIGN显著提高了安全性和性能表现。
QA-LIGN降低了攻击成功率高达68.7%，同时维持较低的误拒绝率。
奖励信号的透明度和模块化设计对于提高LLM的对齐效果至关重要。

Cool Papers

点此查看论文截图

PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models

Authors:Chenzhuo Zhao, Ziqian Liu, Xinda Wang, Junting Lu, Chaoyi Ruan

Prompt optimization is a practical and widely applicable alternative to fine tuning for improving large language model performance. Yet many existing methods evaluate candidate prompts by sampling full outputs, often coupled with self critique or human annotated preferences, which limits scalability, especially for smaller models or models that are not instruction tuned. We present PMPO (Probabilistic Metric Prompt Optimization), a unified framework that uses token level cross entropy as a direct, lightweight evaluation signal. PMPO locates low quality prompt segments via a masking based analysis and iteratively rewrites them to propose improved variants. Crucially, during evaluation, PMPO selects among variants by minimizing loss in a single forward pass, eliminating output sampling and human or judge based scoring for selection while still using standard generation only to propose rewrites. This unified, loss based strategy supports both supervised and preference based tasks. Across model sizes and datasets, PMPO outperforms prior prompt optimizers: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA RAT, and raises AlpacaEval 2.0 win rates by over 19 points. These results demonstrate PMPO’s effectiveness, efficiency, and broad applicability.

提示优化是一种实用的、广泛应用于改进大型语言模型性能的微调替代方法。然而，许多现有方法通过采样完整输出来评估候选提示，通常与自我批评或人工注释偏好相结合，这限制了可扩展性，特别是对于未进行指令调整的小型模型或模型。我们提出了PMPO（基于概率度量的提示优化），这是一个统一框架，使用令牌级别的交叉熵作为直接、轻量级的评估信号。PMPO通过基于遮罩的分析找到质量较低的提示片段，并对其进行迭代重写，以提出改进后的变体。关键的是，在评估过程中，PMPO通过最小化单次前向传递中的损失来选择变体，消除了输出采样和基于人类或法官的评分选择，同时仍仅使用标准生成来提出重写。这种统一的、基于损失的策略既支持有监督的任务，也支持基于偏好的任务。在各种模型大小和数据集上，PMPO的表现都优于先前的提示优化器：它在BBH上达到了最高的平均准确率，在GSM8K和AQUA RAT上表现强劲，并将AlpacaEval 2.0的胜率提高了超过19个百分点。这些结果证明了PMPO的有效性、效率和广泛的适用性。

论文及项目相关链接

PDF

Summary

本文主要介绍了一种名为PMPO（概率度量提示优化）的实用框架，它通过直接使用标记级别的交叉熵作为评估信号来优化大型语言模型的性能。与传统的基于采样输出、结合自我评估或人工标注偏好的评估方法不同，PMPO能够在无需采样输出的情况下找到质量不佳的提示段并进行迭代重写。它通过一次前向传递最小化损失来自动选择最佳提示变体，无需使用输出采样或基于人类或评委的评分机制进行筛选。这一基于损失的方法既支持监督任务也支持偏好任务，并在不同模型大小和数据集上均优于先前的提示优化器。

Key Takeaways