LLM

发布日期: 2025-10-10

更新日期: 2025-11-27

文章字数: 20.1k

阅读时长: 81 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-10 更新

Vibe Checker: Aligning Code Evaluation with Human Preference

Authors:Ming Zhong, Xiang Zhou, Ting-Yun Chang, Qingze Wang, Nan Xu, Xiance Si, Dan Garrette, Shyam Upadhyay, Jeremiah Liu, Jiawei Han, Benoit Schillings, Jiao Sun

Large Language Models (LLMs) have catalyzed vibe coding, where users leverage LLMs to generate and iteratively refine code through natural language interactions until it passes their vibe check. Vibe check is tied to real-world human preference and goes beyond functionality: the solution should feel right, read cleanly, preserve intent, and remain correct. However, current code evaluation remains anchored to pass@k and captures only functional correctness, overlooking the non-functional instructions that users routinely apply. In this paper, we hypothesize that instruction following is the missing piece underlying vibe check that represents human preference in coding besides functional correctness. To quantify models’ code instruction following capabilities with measurable signals, we present VeriCode, a taxonomy of 30 verifiable code instructions together with corresponding deterministic verifiers. We use the taxonomy to augment established evaluation suites, resulting in Vibe Checker, a testbed to assess both code instruction following and functional correctness. Upon evaluating 31 leading LLMs, we show that even the strongest models struggle to comply with multiple instructions and exhibit clear functional regression. Most importantly, a composite score of functional correctness and instruction following correlates the best with human preference, with the latter emerging as the primary differentiator on real-world programming tasks. Our work identifies core factors of the vibe check, providing a concrete path for benchmarking and developing models that better align with user preferences in coding.

大规模语言模型（LLM）已经推动了氛围编码的发展，用户利用LLM生成代码并借助自然语言交互进行迭代优化，直到通过他们的氛围检查。氛围检查与真实世界的用户偏好有关，并且超越了功能范围：解决方案不仅要能够正确运行，而且要结构清晰、保持意图并且始终正确。然而，当前的代码评估仍然侧重于准确抓住用户意图的前k个选项，仅关注功能正确性，忽视了用户经常应用的非功能性指令。在本文中，我们假设指令遵循是氛围检查中缺失的部分，除了功能正确性外，它代表了人类在编码过程中的偏好。为了量化模型遵循代码指令的能力并使用可测量的信号进行评估，我们提出了VeriCode，这是一个包含30个可验证代码指令的目录，并配备了相应的确定性验证器。我们使用目录来增强现有的评估套件，从而形成了Vibe Checker，一个可以评估代码指令遵循和功能正确性的测试平台。在对31个领先的大型语言模型进行评估后，我们发现即使是最强大的模型也很难遵循多个指令，并且会出现明显的功能回归。最重要的是，功能正确性和指令遵循的组合分数与人类的偏好关联度最高，后者在实际编程任务中成为主要区别因素。我们的工作确定了氛围检查的核心因素，为基准测试和研发更符合用户编码偏好的模型提供了具体路径。

论文及项目相关链接

PDF Preprint

Summary

大型语言模型（LLM）推动了“氛围编码”，用户利用LLM生成并通过自然语言交互迭代完善代码，直至满足他们的氛围感知。然而，当前代码评估主要关注功能正确性，忽略了用户的非功能指令。本文提出指令遵循是氛围感知中缺失的重要部分，并代表了人类编码中的偏好。为了量化模型遵循代码指令的能力，我们提出了VeriCode，包含30种可验证的代码指令及其相应的确定性验证器。我们使用此分类来增强现有的评估套件，从而建立了一个评估代码指令遵循和功能正确性的测试平台Vibe Checker。评估领先的LLM时发现，即使在强大的模型也难以遵循多个指令，表现出明显的功能回归。最重要的是，与人类偏好最相关的是功能正确性和指令遵循的复合分数，其中后者在现实编程任务中成为主要区别。本文的研究为氛围感知提供了核心因素的具体路径，为更符合用户编码偏好的模型评估和研发提供了方向。

Key Takeaways

大型语言模型（LLM）推动了氛围编码，允许用户通过自然语言交互生成和迭代完善代码。
当前的代码评估主要关注功能正确性，忽略了用户的非功能指令。
指令遵循是氛围感知编码中重要的部分，代表人类编码中的偏好。
提出了VeriCode分类，包含30种可验证的代码指令及其验证器。
通过建立Vibe Checker测试平台，可以评估代码指令遵循和功能正确性。
评估发现，即使是强大的LLM也难以完全遵循多个指令，存在功能回归现象。

Cool Papers

点此查看论文截图

Prompt, Synthesize, Fine-Tune: A Secure Code Generation Recipe

Authors:Junjie Li, Fazle Rabbi, Bo Yang, Song Wang, Jinqiu Yang

Although Large Language Models (LLMs) show promising solutions to automated code generation, they often produce insecure code that threatens software security. Current approaches (e.g., SafeCoder) to improve secure code generation suffer from limited and imbalanced datasets, reducing their effectiveness and generalizability. In this work, we present Secure-Instruct, a novel framework that automatically synthesizes high-quality vulnerable and secure code examples, generates fine-tuning instructions, and instruction-tunes LLMs to align task description and secure code generation abilities. We evaluate Secure-Instruct on four representative LLMs using two benchmarks: our own CWEBench and the existing CWEval. CWEBench comprises 93 scenarios on 44 CWEs, all without overlap with Secure-Instruct’s synthetic instruction-tuning dataset, while CWEval covers 31 CWEs with 119 manually verified security-critical tasks. We find that Secure-Instruct improves not only the security but also the functional correctness of the generated code. On CWEBench, Secure-Instruct substantially improves secure code generation, giving a 14.3% average increase in secure ratio over the pretrained models and outperforms SafeCoder by 7.6%. On CWEval, Secure-Instruct achieves a 14% increase for CodeLlama-7B and 5.8% for Mistral-7B in Func-Sec@1 over pretrained models, and surpasses SafeCoder by 15.8% and 6.8% respectively.

尽管大型语言模型（LLM）在自动代码生成方面显示出有前途的解决方案，但它们通常会产生威胁软件安全的不安全代码。当前改进安全代码生成的方法（例如SafeCoder）受限于数据集的大小和不平衡，降低了其有效性和通用性。在这项工作中，我们提出了Secure-Instruct，一个能够自动合成高质量脆弱和安全代码示例、生成微调指令、并根据指令调整LLM以匹配任务描述和安全代码生成能力的新型框架。我们在四个代表性LLM上评估了Secure-Instruct，使用了两个基准测试：我们自己的CWEBench和现有的CWEval。CWEBench包含44个CWE的93个场景，所有场景与Secure-Instruct的合成指令调整数据集没有重叠，而CWEval涵盖31个CWE的119个经过手动验证的安全关键任务。我们发现Secure-Instruct不仅提高了生成的代码的安全性，还提高了其功能性正确性。在CWEBench上，Secure-Instruct显著提高了安全代码生成能力，使安全比率较预训练模型平均提高了14.3%，并优于SafeCoder 7.6%。在CWEval上，Secure-Instruct在Func-Sec@1指标上较预训练模型提高了CodeLlama-7B的14%和Mistral-7B的5.8%，并分别超过了SafeCoder的15.8%和6.8%。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在自动化代码生成方面展现出巨大潜力，但常常生成威胁软件安全的代码。当前如SafeCoder等改善安全代码生成的方法受限于数据集，影响效果和泛化能力。本研究提出Secure-Instruct框架，可自动合成高质量的安全和漏洞代码示例，生成微调指令并对LLM进行指令微调，以提升任务描述与安全代码生成能力的对齐。在自定义的CWEBench和现有的CWEval两个基准测试上评估显示，Secure-Instruct不仅提高了代码的安全性，还提升了功能正确性。相较于预训练模型，Secure-Instruct在CWEBench上的安全比率平均提高了14.3%，并优于SafeCoder的7.6%。在CWEval上，Secure-Instruct对CodeLlama-7B和Mistral-7B的Func-Sec@1分别提高了14%和5.8%，并且优于SafeCoder。

Key Takeaways

大型语言模型（LLM）在自动化代码生成时易产生威胁软件安全的代码。
当前改善安全代码生成的方法如SafeCoder受限于数据集，影响效果和泛化能力。
Secure-Instruct框架通过自动合成高质量的安全和漏洞代码示例，并生成微调指令来提升LLM的性能。
Secure-Instruct提高了代码的安全性和功能正确性。
在CWEBench基准测试中，Secure-Instruct相较于预训练模型安全比率平均提高了14.3%，并优于SafeCoder。
在CWEval基准测试中，Secure-Instruct对CodeLlama-7B和Mistral-7B的Func-Sec@1有所提高。
Secure-Instruct框架的提出为改善LLM在安全代码生成方面的不足提供了新的思路和方法。

Cool Papers

点此查看论文截图

TRIM: Token-wise Attention-Derived Saliency for Data-Efficient Instruction Tuning

Authors:Manish Nagaraj, Sakshi Choudhary, Utkarsh Saxena, Deepak Ravikumar, Kaushik Roy

Instruction tuning is essential for aligning large language models (LLMs) to downstream tasks and commonly relies on large, diverse corpora. However, small, high-quality subsets, known as coresets, can deliver comparable or superior results, though curating them remains challenging. Existing methods often rely on coarse, sample-level signals like gradients, an approach that is computationally expensive and overlooks fine-grained features. To address this, we introduce TRIM (Token Relevance via Interpretable Multi-layer Attention), a forward-only, token-centric framework. Instead of using gradients, TRIM operates by matching underlying representational patterns identified via attention-based “fingerprints” from a handful of target samples. Such an approach makes TRIM highly efficient and uniquely sensitive to the structural features that define a task. Coresets selected by our method consistently outperform state-of-the-art baselines by up to 9% on downstream tasks and even surpass the performance of full-data fine-tuning in some settings. By avoiding expensive backward passes, TRIM achieves this at a fraction of the computational cost. These findings establish TRIM as a scalable and efficient alternative for building high-quality instruction-tuning datasets.

指令微调对于将大型语言模型（LLM）与下游任务对齐至关重要，通常依赖于大规模、多样化的语料库。然而，小型高质量子集，也称为核心集，可以产生相当或更好的结果，尽管对其进行策划仍然具有挑战性。现有方法通常依赖于粗略的样本级信号，如梯度，这种方法计算量大，并且忽略了细粒度特征。为了解决这个问题，我们引入了TRIM（通过可解释的多层注意力获取令牌相关性），这是一个仅前向、以令牌为中心的框架。TRIM不通过梯度工作，而是通过匹配少量目标样本中通过注意力“指纹”识别出的潜在表示模式来运行。这种方法使TRIM具有高效率，并且对定义任务的结构特征具有独特敏感性。通过我们的方法选择的核心集在下游任务上始终超过最新基线高达9%，并在某些设置下甚至超过全数据精细调整的性能。通过避免昂贵的反向传递，TRIM以较小的计算成本实现了这一点。这些发现证明TRIM是一种可扩展和高效的构建高质量指令调整数据集的替代方案。

论文及项目相关链接

PDF

Summary

大规模语言模型（LLM）的指令调整对于下游任务至关重要，通常依赖于大量、多样的语料库。然而，小的、高质量的子集（称为核心集）可以产生相当或更好的结果，但挑选它们仍然具有挑战性。现有方法通常依赖于粗糙的样本级信号，如梯度，这种方法计算量大，忽略了细微的特征。为解决这一问题，我们推出TRIM（通过可解释的多层注意力获取令牌相关性），这是一个仅前向传播的、以令牌为中心的框架。TRIM不通过梯度工作，而是匹配来自少量目标样本的基于注意力的“指纹”来识别潜在的模式。这种方法使TRIM既高效又独特地关注定义任务的结构特征。我们的方法选择的核心集在下游任务上始终优于最新基线高达9%，并在某些情况下甚至超过了全数据微调的性能。通过避免昂贵的反向传播，TRIM以极低的计算成本实现了这一目标。这些发现证明了TRIM作为一种可扩展和高效的指令调整数据集构建方法的地位。

Key Takeaways

指令调整对于LLM的下游任务至关重要，需要用到大量、多样的语料库进行支持。
核心集的选择是提高指令调整效率的关键，但其挑选过程具有挑战性。
现有方法通常依赖样本级信号如梯度进行核心集选择，计算量大且忽略细微特征。
TRIM框架通过匹配基于注意力“指纹”的潜在模式进行核心集选择，无需依赖梯度信息。
TRIM方法选择的核心集在下游任务上表现优异，相较于最新基线有显著提升。
TRIM方法避免了昂贵的反向传播过程，计算效率极高。

Cool Papers

点此查看论文截图

GPT-5 Model Corrected GPT-4V’s Chart Reading Errors, Not Prompting

Authors:Kaichun Yang, Jian Chen

We present a quantitative evaluation to understand the effect of zero-shot large-language model (LLMs) and prompting uses on chart reading tasks. We asked LLMs to answer 107 visualization questions to compare inference accuracies between the agentic GPT-5 and multimodal GPT-4V, for difficult image instances, where GPT-4V failed to produce correct answers. Our results show that model architecture dominates the inference accuracy: GPT5 largely improved accuracy, while prompt variants yielded only small effects. Pre-registration of this work is available here: https://osf.io/u78td/?view_only=6b075584311f48e991c39335c840ded3; the Google Drive materials are here:https://drive.google.com/file/d/1ll8WWZDf7cCNcfNWrLViWt8GwDNSvVrp/view.

我们对零样本大型语言模型（LLM）在图表阅读任务中的影响进行了定量评估，并探讨了提示的使用效果。我们要求LLM回答107个可视化问题，以比较GPT-5和GPT-4V在多模态困难图像实例上的推理准确性，其中GPT-4V未能给出正确答案。我们的结果表明，模型架构对推理精度起主导作用：GPT-5大大提高了精度，而提示变体只产生了微小影响。本工作的预注册信息可在此处查看：链接；谷歌驱动材料可在此处查看：链接。

论文及项目相关链接

PDF

Summary

本文主要评估了零样本大型语言模型（LLMs）在图表阅读任务中的应用效果，对比了GPT-5和GPT-4V在解决可视化问题上的推理准确性。研究发现，模型架构对推理准确性起主导作用，GPT-5显著提高了准确性，而提示变体只产生了较小的影响。

Key Takeaways

本文对零样本大型语言模型（LLMs）在图表阅读任务中的效果进行了定量评估。
对比了GPT-5和GPT-4V在解决可视化问题上的推理准确性。
模型架构对推理准确性有主导作用，GPT-5表现更优秀。
提示变体对推理准确性的影响较小。
提供了研究的前注册链接和Google Drive材料供读者参考。
研究结果表明，在解决困难图像实例时，选择合适的模型架构是关键。

Cool Papers

点此查看论文截图

TimeFormer: Transformer with Attention Modulation Empowered by Temporal Characteristics for Time Series Forecasting

Authors:Zhipeng Liu, Peibo Duan, Xuan Tang, Baixin Li, Yongsheng Huang, Mingyang Geng, Changsheng Zhang, Bin Zhang, Binwu Wang

Although Transformers excel in natural language processing, their extension to time series forecasting remains challenging due to insufficient consideration of the differences between textual and temporal modalities. In this paper, we develop a novel Transformer architecture designed for time series data, aiming to maximize its representational capacity. We identify two key but often overlooked characteristics of time series: (1) unidirectional influence from the past to the future, and (2) the phenomenon of decaying influence over time. These characteristics are introduced to enhance the attention mechanism of Transformers. We propose TimeFormer, whose core innovation is a self-attention mechanism with two modulation terms (MoSA), designed to capture these temporal priors of time series under the constraints of the Hawkes process and causal masking. Additionally, TimeFormer introduces a framework based on multi-scale and subsequence analysis to capture semantic dependencies at different temporal scales, enriching the temporal dependencies. Extensive experiments conducted on multiple real-world datasets show that TimeFormer significantly outperforms state-of-the-art methods, achieving up to a 7.45% reduction in MSE compared to the best baseline and setting new benchmarks on 94.04% of evaluation metrics. Moreover, we demonstrate that the MoSA mechanism can be broadly applied to enhance the performance of other Transformer-based models.

尽管Transformer在自然语言处理方面表现出色，但由于未能充分考虑到文本和时间模态之间的差异，将其应用于时间序列预测仍然具有挑战性。在本文中，我们开发了一种针对时间序列数据的新型Transformer架构，旨在最大限度地提高其表示能力。我们确定了时间序列的两个关键但常被忽视的特征：（1）过去对未来单向影响；（2）随时间流逝影响衰减的现象。这些特性被引入到增强Transformer的注意力机制中。我们提出了TimeFormer，其核心创新点是一种带有两个调制项（MoSA）的自注意力机制，旨在在满足霍克斯过程和因果掩码约束的条件下，捕捉这些时间序列的时间先验信息。此外，TimeFormer还引入了一个基于多尺度和子序列分析的框架，以捕获不同时间尺度上的语义依赖关系，丰富时间依赖性。在多个真实世界数据集上进行的广泛实验表明，TimeFormer显著优于最新方法，在均方误差（MSE）方面相比最佳基线降低了高达7.45%，在评价指标上达到了94.04%的新基准。此外，我们证明了MoSA机制可以广泛应用于提高其他基于Transformer的模型性能。

论文及项目相关链接

PDF

Summary：
虽然Transformer在自然语言处理方面表现出色，但在时间序列预测方面的应用仍然具有挑战性，因为未能充分考虑文本和时间模态之间的差异。本文提出了一种针对时间序列数据的新型Transformer架构，旨在提高其表示能力。文章强调了时间序列的两个关键但常被忽视的特性：过去对未来单向影响以及影响随时间衰减的现象。为此，文章提出了TimeFormer，其核心创新在于设计了一种带有两个调制项的自注意力机制（MoSA），能够在霍克斯过程和因果掩码的约束下捕捉这些时间先验知识。此外，TimeFormer还引入了一个基于多尺度和子序列分析的框架，以在不同的时间尺度上捕捉语义依赖关系，丰富时间依赖性。在多个真实数据集上进行的广泛实验表明，TimeFormer显著优于最新方法，在均方误差上最多减少7.45%，并在94.04%的评估指标上达到新的基准水平。此外，我们还证明了MoSA机制可以广泛应用于提高其他基于Transformer的模型性能。

Key Takeaways：

虽然Transformer在自然语言处理方面表现出色，但在时间序列预测的应用上存在挑战。
文章提出了针对时间序列数据的新型Transformer架构——TimeFormer。
TimeFormer考虑了时间序列的两个关键特性：过去对未来的单向影响以及影响随时间衰减的现象。
TimeFormer引入了带有两个调制项的自注意力机制（MoSA），以捕捉时间先验知识。
TimeFormer在多个真实数据集上表现出显著优势，相比最佳基线方法降低了MSE。
MoSA机制可广泛应用于增强其他基于Transformer的模型性能。

Cool Papers

点此查看论文截图

Reproducibility Study of “XRec: Large Language Models for Explainable Recommendation”

Authors:Ranjan Mishra, Julian I. Bibo, Quinten van Engelen, Henk Schaapman

In this study, we reproduced the work done in the paper “XRec: Large Language Models for Explainable Recommendation” by Ma et al. (2024). The original authors introduced XRec, a model-agnostic collaborative instruction-tuning framework that enables large language models (LLMs) to provide users with comprehensive explanations of generated recommendations. Our objective was to replicate the results of the original paper, albeit using Llama 3 as the LLM for evaluation instead of GPT-3.5-turbo. We built on the source code provided by Ma et al. (2024) to achieve our goal. Our work extends the original paper by modifying the input embeddings or deleting the output embeddings of XRec’s Mixture of Experts module. Based on our results, XRec effectively generates personalized explanations and its stability is improved by incorporating collaborative information. However, XRec did not consistently outperform all baseline models in every metric. Our extended analysis further highlights the importance of the Mixture of Experts embeddings in shaping the explanation structures, showcasing how collaborative signals interact with language modeling. Through our work, we provide an open-source evaluation implementation that enhances accessibility for researchers and practitioners alike. Our complete code repository can be found at https://github.com/julianbibo/xrec-reproducibility.

在这项研究中，我们重新实现了Ma等人于2024年发表的论文“XRec：用于可解释推荐的大型语言模型”中的工作。原始作者介绍了XRec，这是一个模型无关的协同指令微调框架，它使大型语言模型（LLM）能够为用户生成推荐提供全面的解释。我们的目标是复制原始论文的结果，但使用Llama 3作为评估的LLM，而不是GPT-3.5-turbo。我们基于Ma等人提供的源代码（2024年）来实现我们的目标。我们的工作通过修改XRec的Mixture of Experts模块的输入嵌入或删除其输出嵌入来扩展原始论文。根据我们的结果，XRec有效地生成了个性化解释，通过引入协同信息提高了其稳定性。然而，XRec并未在所有指标上始终优于所有基线模型。我们的进一步分析进一步强调了Mixture of Experts嵌入在塑造解释结构中的重要性，展示了协同信号如何与语言建模进行交互。通过我们的工作，我们提供了一个开源评估实现，增强了研究者和实践者的可访问性。我们的完整代码仓库可在https://github.com/julianbibo/xrec-reproducibility找到。

论文及项目相关链接

PDF

Summary

该研究中，成功复制了Ma等人（于报告截至时间次年）《基于大型语言模型的解释性推荐研究》（简称“XRec”）的成果，旨在验证该推荐模型的泛化性，并将开源工具扩展以适应解释的交互模式。以基于语言模型的协同调整指令对原代码进行改进，并发现XRec在生成个性化解释方面表现良好，但并非在所有指标上均优于基线模型。研究亮点在于揭示了专家混合嵌入在解释结构形成中的关键作用，且有助于展示协同信号与语言模型的相互作用机制。更多详细研究资料可访问相关代码库（链接：https://github.com/julianbibo/xrec-reproducibility）。

Key Takeaways

以下是根据提供的文本提炼出的关键见解要点：

Cool Papers

点此查看论文截图

Knowledge Graph-Guided Multi-Agent Distillation for Reliable Industrial Question Answering with Datasets

Authors:Jiqun Pan, Zhenke Duan, Jiani Tu, Anzhi Cheng, Yanqing Wang

Industrial question-answering (QA) systems require higher safety and reliability than general-purpose dialogue models, as errors in high-risk scenarios such as equipment fault diagnosis can have severe consequences. Although multi-agent large language models enhance reasoning depth, they suffer from uncontrolled iterations and unverifiable outputs, and conventional distillation methods struggle to transfer collaborative reasoning capabilities to lightweight, deployable student models. To address these challenges, we propose Knowledge Graph-guided Multi-Agent System Distillation (KG-MASD). Our approach formulates distillation as a Markov Decision Process and incorporates a knowledge graph as a verifiable structured prior to enrich state representation and ensure convergence. By integrating collaborative reasoning with knowledge grounding, KG-MASD generates high-confidence instruction-tuning data and jointly distills reasoning depth and verifiability into compact student models suitable for edge deployment. Experiments on an industrial QA dataset show that KG-MASD improves accuracy by 2.4 per cent to 20.1 per cent over baselines and significantly enhances reliability, enabling trustworthy AI deployment in safety-critical industrial scenarios. Code and data are available at https://github.com/erwinmsmith/KG-MAD/.

产业问答系统（QA）对安全性和可靠性的要求高于通用对话模型，因为在设备故障诊断等高风险场景中出错会产生严重后果。虽然多智能体大型语言模型增强了推理深度，但它们存在不可控迭代和不可验证输出的问题，传统蒸馏方法难以将协作推理能力转移到轻量级、可部署的学生模型。为了应对这些挑战，我们提出了知识图谱引导的多智能体系统蒸馏（KG-MASD）。我们的方法将蒸馏公式化为马尔可夫决策过程，并融合知识图谱作为可验证的结构化先验来丰富状态表示并确保收敛。通过结合协作推理和知识接地，KG-MASD生成高置信度指令调整数据，并将推理深度和可验证性联合蒸馏成适用于边缘部署的紧凑学生模型。在工业QA数据集上的实验表明，与基线相比，KG-MASD的准确率提高了2.4%至20.1%，并显著提高了可靠性，可在安全关键工业场景中实现可信AI部署。代码和数据集可通过https://github.com/erwinmsmith/KG-MAD/获取。

论文及项目相关链接

PDF 41 pages, 12 figures, 6 tables

Summary

工业问答系统相较于通用对话模型需要更高的安全性和可靠性，因为高风险的场景（如设备故障诊断）中的错误可能导致严重后果。为解决多代理大型语言模型带来的推理深度增强但迭代不可控、输出不可验证的问题，以及传统蒸馏方法难以将协作推理能力转移到轻量级、可部署的学生模型的问题，提出了知识图谱引导的多代理系统蒸馏（KG-MASD）。该方法将蒸馏过程制定为马尔可夫决策过程，并引入知识图谱作为可验证的结构化先验知识来丰富状态表示并保障收敛性。通过将协作推理与知识定位相结合，KG-MASD生成了高置信度的指令调整数据，并将推理深度和可验证性联合蒸馏成适合边缘部署的紧凑学生模型。在工业问答数据集上的实验表明，KG-MASD相较于基线方法提高了2.4%至20.1%的准确率，并显著增强了可靠性，实现了安全关键工业场景中可信人工智能的部署。

Key Takeaways

工业问答系统需要高安全性和可靠性，因为高风险的错误可能导致严重后果。
多代理大型语言模型能提高推理深度但存在迭代不可控和输出不可验证的问题。
传统蒸馏方法难以将协作推理能力转移到轻量级、可部署的学生模型。
KG-MASD方法结合知识图谱和马尔可夫决策过程来解决上述问题。
知识图谱作为结构化先验知识，丰富状态表示并保障收敛性。
KG-MASD通过结合协作推理和知识定位，生成高置信度的指令调整数据。
KG-MASD提高了在工业问答数据集上的准确率，并显著增强了模型的可靠性，适用于安全关键的工业场景部署。

Cool Papers

点此查看论文截图

AWARE, Beyond Sentence Boundaries: A Contextual Transformer Framework for Identifying Cultural Capital in STEM Narratives

Authors:Khalid Mehtab Khan, Anagha Kulkarni

Identifying cultural capital (CC) themes in student reflections can offer valuable insights that help foster equitable learning environments in classrooms. However, themes such as aspirational goals or family support are often woven into narratives, rather than appearing as direct keywords. This makes them difficult to detect for standard NLP models that process sentences in isolation. The core challenge stems from a lack of awareness, as standard models are pre-trained on general corpora, leaving them blind to the domain-specific language and narrative context inherent to the data. To address this, we introduce AWARE, a framework that systematically attempts to improve a transformer model’s awareness for this nuanced task. AWARE has three core components: 1) Domain Awareness, adapting the model’s vocabulary to the linguistic style of student reflections; 2) Context Awareness, generating sentence embeddings that are aware of the full essay context; and 3) Class Overlap Awareness, employing a multi-label strategy to recognize the coexistence of themes in a single sentence. Our results show that by making the model explicitly aware of the properties of the input, AWARE outperforms a strong baseline by 2.1 percentage points in Macro-F1 and shows considerable improvements across all themes. This work provides a robust and generalizable methodology for any text classification task in which meaning depends on the context of the narrative.

在学生的反思中识别文化资本（CC）主题，可以提供有价值的见解，有助于培养课堂中的公平学习环境。然而，诸如志向目标或家庭支持等主题通常被编织成叙事，而非以关键词的形式直接出现，这使得它们难以被标准的NLP模型检测，这些模型孤立地处理句子。核心挑战源于缺乏意识，因为标准模型是在一般语料库上进行预训练的，这使得它们对数据的特定领域语言和叙事上下文一无所知。为了解决这一问题，我们引入了AWARE框架，该框架系统地尝试提高转换器模型对此类细微任务的意识。AWARE有三个核心组件：1）领域意识，使模型的词汇适应学生反思的语言风格；2）语境意识，生成句子嵌入，意识到整篇文章的语境；3）类别重叠意识，采用多标签策略来识别单个句子中共存的主题。我们的结果表明，通过使模型明确意识到输入的特性，AWARE在Macro-F1上超出了强基线2.1个百分点，并在所有主题上都显示出可观的改进。这项工作为任何依赖于叙事上下文的文本分类任务提供了稳健且可推广的方法。

论文及项目相关链接

PDF

Summary

本文探讨了在学生的反思中识别文化资本主题的重要性，这有助于促进课堂中的公平学习环境。文章指出，主题如志向目标或家庭支持通常融入叙事之中，而非以关键词的形式出现，这使得标准NLP模型难以孤立地处理句子来检测这些主题。为解决此问题，本文提出了AWARE框架，通过三个核心组件提高transformer模型对此类任务的敏锐度：领域敏锐度、语境敏锐度和类别重叠敏锐度。结果显示，通过使模型明确意识到输入的特性，AWARE在宏观F1分数上超越了强基线2.1个百分点，并在所有主题上都取得了显著的改进。这为任何依赖于叙事语境的文本分类任务提供了稳健且可推广的方法论。

Key Takeaways

识别学生反思中的文化资本主题有助于促进公平的学习环境。
主题常融入叙事中，而非以关键词形式出现，这使得标准NLP模型难以检测。
AWARE框架包括三个核心组件：领域敏锐度、语境敏锐度和类别重叠敏锐度。
AWARE框架通过提高模型的敏锐度，在宏观F1分数上超越了强基线。
AWARE框架在所有主题上都取得了显著的改进。
此方法适用于任何依赖于叙事语境的文本分类任务。

Cool Papers

点此查看论文截图

Language Model Based Text-to-Audio Generation: Anti-Causally Aligned Collaborative Residual Transformers

Authors:Juncheng Wang, Chao Xu, Cheng Yu, Zhe Hu, Haoyu Xie, Guoqi Yu, Lei Shang, Shujun Wang

While language models (LMs) paired with residual vector quantization (RVQ) tokenizers have shown promise in text-to-audio (T2A) generation, they still lag behind diffusion-based models by a non-trivial margin. We identify a critical dilemma underpinning this gap: incorporating more RVQ layers improves audio reconstruction fidelity but exceeds the generation capacity of conventional LMs. To address this, we first analyze RVQ dynamics and uncover two key limitations: 1) orthogonality of features across RVQ layers hinders effective LMs training, and 2) descending semantic richness in tokens from deeper RVQ layers exacerbates exposure bias during autoregressive decoding. Based on these insights, we propose Siren, a novel LM-based framework that employs multiple isolated transformers with causal conditioning and anti-causal alignment via reinforcement learning. Extensive experiments demonstrate that Siren outperforms both existing LM-based and diffusion-based T2A systems, achieving state-of-the-art results. By bridging the representational strengths of LMs with the fidelity demands of audio synthesis, our approach repositions LMs as competitive contenders against diffusion models in T2A tasks. Moreover, by aligning audio representations with linguistic structures, Siren facilitates a promising pathway toward unified multi-modal generation frameworks.

虽然语言模型（LMs）与残差向量量化（RVQ）分词器的结合在文本到音频（T2A）生成中显示出潜力，但它们仍然落后于基于扩散的模型相当大的距离。我们发现了造成这一差距的关键困境：增加RVQ层数可以提高音频重建的保真度，但超出了传统语言模型的生成能力。为了解决这个问题，我们首先分析了RVQ的动态特性，并发现了两个关键限制：1）RVQ层之间的特征正交性阻碍了有效的语言模型训练；2）来自更深RVQ层的令牌语义丰富性的下降加剧了自回归解码过程中的暴露偏差。基于这些见解，我们提出了Siren，这是一种基于语言模型的新型框架，它采用多个独立的变压器，通过强化学习实现因果条件和非因果对齐。大量实验表明，Siren在基于语言模型和基于扩散的T2A系统中表现出色，取得了最新结果。通过弥合了语言模型的表征强度与音频合成的保真度需求，我们的方法使语言模型在T2A任务中成为与扩散模型竞争的竞争者。此外，通过将音频表示与语言结构对齐，Siren为统一的多模式生成框架提供了有希望的途径。

论文及项目相关链接

PDF Accepted to EMNLP 2025

Summary

基于语言模型（LMs）与残差向量量化（RVQ）标记器在文本转音频（T2A）生成中的应用，本研究提出了一个新的框架——Siren。分析表明，引入更多RVQ层可以提高音频重建的保真度，但超出了传统语言模型的生成能力。针对这一问题，本研究分析了RVQ的动力学，并揭示了两个关键局限性。为此，本研究提出采用多个独立的Transformer架构与因果条件、逆向学习相结合的策略来解决这些问题。实验结果证实，Siren不仅在基于语言模型的系统中表现出色，而且超越了许多基于扩散的系统，成为最新的技术成果。该方法不仅提升了语言模型的表达能力，满足了音频合成的保真度要求，还重新确立了语言模型在文本转音频任务中的竞争力地位。此外，通过音频表示与语言结构的对齐，Siren为统一的多模态生成框架提供了可行的路径。

Key Takeaways

语言模型与残差向量量化在文本转音频生成领域有应用前景，但仍落后于扩散模型。
增加RVQ层可提高音频重建的保真度，但超出语言模型的生成能力。
RVQ动力学存在两个关键局限性：特征正交性影响语言模型训练，深层RVQ层中的语义丰富度下降加剧了自回归解码过程中的暴露偏差。
提出Siren框架，采用多个独立Transformer架构结合因果条件和逆向学习来解决上述问题。
Siren在文本转音频任务中表现优异，超越了现有的语言模型和扩散模型。
Siren重新确立了语言模型在文本转音频任务中的竞争力地位。

Cool Papers

点此查看论文截图

CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling

Authors:Zhengyang Tang, Zihan Ye, Chenyu Huang, Xuhan Huang, Chengpeng Li, Sihang Li, Guanhua Chen, Ming Yan, Zizhuo Wang, Hongyuan Zha, Dayiheng Liu, Benyou Wang

Large Reasoning Models (LRMs) have demonstrated strong capabilities in complex multi-step reasoning, opening new opportunities for automating optimization modeling. However, existing domain adaptation methods, originally designed for earlier instruction-tuned models, often fail to exploit the advanced reasoning patterns of modern LRMs – In particular, we show that direct fine-tuning on traditional \textit{non-reflective} datasets leads to limited gains. To fully leverage LRMs’ inherent reasoning abilities, we propose \textbf{CALM} (\textit{Corrective Adaptation with Lightweight Modification}), a framework that progressively refines LRMs within their native reasoning modes for optimization modeling tasks. In CALM, an expert intervener identifies reasoning flaws and provides concise corrective hints, which the LRM incorporates to produce improved reasoning trajectories. These interventions modify fewer than 2.6% of generated tokens, but generate high-quality data for soft adaptation through supervised fine-tuning. The adapted model is then further improved through reinforcement learning. Building on CALM, we develop \textbf{STORM} (\textit{Smart Thinking Optimization Reasoning Model}), a 4B-parameter LRM that achieves a new state-of-the-art average accuracy of 68.9% across five popular optimization modeling benchmarks, matching the performance of a 671B LRM. These results demonstrate that dynamic, hint-based data synthesis both preserves and amplifies the native reasoning patterns of modern LRMs, offering a more effective and scalable path towards expert-level performance on challenging optimization modeling tasks.

大型推理模型（LRMs）在复杂多步推理中展示了强大的能力，为自动化优化建模带来了新的机遇。然而，现有的域适应方法，最初是为早期的指令调优模型设计的，往往无法利用现代LRMs的先进推理模式。特别是，我们表明直接在传统的\textit{非反思}数据集上进行微调会导致有限的收益。为了充分利用LRMs的固有推理能力，我们提出了\textbf{CALM（修正适应与轻量级修改）}，这是一个框架，它逐步在优化建模任务中细化LRMs的固有推理模式。在CALM中，专家干预者识别推理缺陷并提供简洁的纠正提示，LRM会将其纳入以产生改进的推理轨迹。这些干预只修改了不到2.6%的生成令牌，但通过监督微调实现了高质量数据的软适应。适应后的模型再通过强化学习进一步改进。基于CALM，我们开发了\textbf{STORM（智能思考优化推理模型）}，这是一个拥有4B参数的LRM，在五个流行的优化建模基准测试上达到了平均准确率68.9%的新水平，与671B LRM的性能相匹配。这些结果表明，动态、基于提示的数据合成既保留了现代LRMs的固有推理模式又放大了其优势，为在具有挑战性的优化建模任务上实现专家级性能提供了更有效、更可扩展的途径。

论文及项目相关链接

PDF Work in progress

Summary

大型推理模型（LRMs）在复杂多步推理方面展现出强大的能力，为自动化优化建模带来了新的机遇。然而，传统的领域自适应方法未能充分利用现代LRMs的先进推理模式。为了充分发挥LRMs的固有推理能力，提出了CALM（Corrective Adaptation with Lightweight Modification）框架，通过渐进式地在原生推理模式下优化建模任务来完善LRMs。专家干预者识别推理缺陷并提供简洁的纠正提示，LRMs结合这些提示改进推理轨迹。这些干预仅修改不到2.6%的生成令牌，但通过监督微调实现了高质量数据的软适应。在此基础上，开发了STORM（Smart Thinking Optimization Reasoning Model），在五个流行的优化建模基准测试中实现了平均准确率68.9%，与671B的LRM性能相匹配。这表明基于动态提示的数据合成既保留了又放大了现代LRMs的固有推理模式，为在具有挑战性的优化建模任务上实现专家级性能提供了更有效、更可扩展的途径。

Key Takeaways

LRMs在复杂多步推理方面表现出强大的能力，为自动化优化建模带来了新的机遇。
传统领域自适应方法未能充分利用现代LRMs的先进推理模式。
CALM框架通过渐进式完善LRMs的推理能力，结合专家干预者的纠正提示改进推理轨迹。
CALM框架中的干预仅修改少量生成令牌，实现高质量数据的软适应。
基于CALM框架开发的STORM模型在多个优化建模基准测试中达到最新状态平均准确率68.9%。
动态、基于提示的数据合成方法既保留了又放大了现代LRMs的固有推理模式。

Cool Papers

点此查看论文截图

Rare Text Semantics Were Always There in Your Diffusion Transformer

Authors:Seil Kang, Woojung Han, Dayun Ju, Seong Jae Hwang

Starting from flow- and diffusion-based transformers, Multi-modal Diffusion Transformers (MM-DiTs) have reshaped text-to-vision generation, gaining acclaim for exceptional visual fidelity. As these models advance, users continually push the boundary with imaginative or rare prompts, which advanced models still falter in generating, since their concepts are often too scarce to leave a strong imprint during pre-training. In this paper, we propose a simple yet effective intervention that surfaces rare semantics inside MM-DiTs without additional training steps, data, denoising-time optimization, or reliance on external modules (e.g., large language models). In particular, the joint-attention mechanism intrinsic to MM-DiT sequentially updates text embeddings alongside image embeddings throughout transformer blocks. We find that by mathematically expanding representational basins around text token embeddings via variance scale-up before the joint-attention blocks, rare semantics clearly emerge in MM-DiT’s outputs. Furthermore, our results generalize effectively across text-to-vision tasks, including text-to-image, text-to-video, and text-driven image editing. Our work invites generative models to reveal the semantics that users intend, once hidden yet ready to surface.

从基于流和扩散的转换器开始，多模态扩散转换器（MM-DiTs）已经改变了文本到视觉生成的方式，因其出色的视觉逼真度而受到赞誉。随着这些模型的进步，用户不断用富有想象力或罕见的提示来挑战界限，但在使用这些先进模型生成内容时仍然会出现短板，因为那些罕见的提示所表达的概念在预训练阶段留下的印记通常不够深刻。在本文中，我们提出了一种简单而有效的干预措施，可以在MM-DiTs内部提取罕见的语义内容，无需额外的训练步骤、数据、去噪时间优化或依赖外部模块（如大型语言模型）。特别是MM-DiT内在的联合注意机制在变压器块内顺序地更新文本嵌入和图像嵌入。我们发现通过联合注意块之前在文本令牌嵌入周围数学扩展表示性盆地（通过方差规模扩大），罕见的语义内容会清晰地出现在MM-DiT的输出中。此外，我们的结果在文本到视觉任务中表现良好，包括文本到图像、文本到视频和文本驱动的图像编辑。我们的工作邀请生成模型揭示用户意图的语义内容，这些语义内容虽然曾经隐藏但已准备好浮现。

论文及项目相关链接

PDF Accepted to NeurIPS 2025

Summary
多模态扩散转换器（MM-DiT）基于流和扩散技术，重塑了文本到视觉的生成，以其出色的视觉逼真度而受到赞誉。然而，对于先进模型而言，用户提供的富有想象力的或罕见的提示仍然会在生成时遇到困难，因为这些罕见概念在预训练阶段留下的印记通常较为微弱。本文提出了一种简单而有效的干预措施，无需额外的训练步骤、数据、去噪时间优化或依赖外部模块（如大型语言模型），即可在MM-DiT内部呈现罕见的语义。通过数学上扩大文本令牌嵌入周围的表示盆地，即通过在联合注意力块之前扩大方差比例，我们发现罕见的语义在MM-DiT的输出中清晰出现。此外，我们的结果有效地跨文本到视觉任务进行推广，包括文本到图像、文本到视频和文本驱动的图像编辑。本研究邀请生成模型揭示用户意图中隐藏的语义。

Key Takeaways

多模态扩散转换器（MM-DiT）已重塑文本到视觉生成领域，展现高视觉逼真度。
用户提供的罕见或富有想象力的提示对先进模型仍具挑战，因为这些模型在预训练阶段难以捕捉这些罕见概念。
本文提出了一种简单有效的干预措施，无需额外训练、数据或依赖外部模块，即可在MM-DiT中呈现罕见语义。
通过数学上扩大文本嵌入的表示范围（即方差比例扩大），MM-DiT能够清晰输出罕见语义。
该方法有效跨多种文本到视觉任务推广，包括文本到图像、文本到视频以及文本驱动的图像编辑。
该研究有助于生成模型更好地理解和揭示用户意图中的隐藏语义。

Cool Papers

点此查看论文截图

DHQA-4D: Perceptual Quality Assessment of Dynamic 4D Digital Human

Authors:Yunhao Li, Sijing Wu, Yucheng Zhu, Huiyu Duan, Zicheng Zhang, Guangtao Zhai

With the rapid development of 3D scanning and reconstruction technologies, dynamic digital human avatars based on 4D meshes have become increasingly popular. A high-precision dynamic digital human avatar can be applied to various fields such as game production, animation generation, and remote immersive communication. However, these 4D human avatar meshes are prone to being degraded by various types of noise during the processes of collection, compression, and transmission, thereby affecting the viewing experience of users. In light of this fact, quality assessment of dynamic 4D digital humans becomes increasingly important. In this paper, we first propose a large-scale dynamic digital human quality assessment dataset, DHQA-4D, which contains 32 high-quality real-scanned 4D human mesh sequences, 1920 distorted textured 4D human meshes degraded by 11 textured distortions, as well as their corresponding textured and non-textured mean opinion scores (MOSs). Equipped with DHQA-4D dataset, we analyze the influence of different types of distortion on human perception for textured dynamic 4D meshes and non-textured dynamic 4D meshes. Additionally, we propose DynaMesh-Rater, a novel large multimodal model (LMM) based approach that is able to assess both textured 4D meshes and non-textured 4D meshes. Concretely, DynaMesh-Rater elaborately extracts multi-dimensional features, including visual features from a projected 2D video, motion features from cropped video clips, and geometry features from the 4D human mesh to provide comprehensive quality-related information. Then we utilize a LMM model to integrate the multi-dimensional features and conduct a LoRA-based instruction tuning technique to teach the LMM model to predict the quality scores. Extensive experimental results on the DHQA-4D dataset demonstrate the superiority of our DynaMesh-Rater method over previous quality assessment methods.

随着3D扫描和重建技术的快速发展，基于4D网格的动态数字人类化身变得越来越流行。高精度动态数字人类化身可应用于游戏制作、动画生成和远程沉浸式通信等各种领域。然而，这些4D人类化身网格在收集、压缩和传输过程中容易受到各种噪声的干扰，从而影响用户的使用体验。鉴于此，对动态4D数字人类的质量评估变得日益重要。在本文中，我们首先提出了大规模动态数字人类质量评估数据集DHQA-4D，其中包含32个高质量真实扫描的4D人类网格序列、1920个由11种纹理失真降级的失真纹理4D人类网格，以及它们相应的纹理和非纹理平均意见得分（MOSs）。配备DHQA-4D数据集，我们分析了不同类型失真对纹理动态4D网格和非纹理动态4D网格的人类感知的影响。此外，我们提出了DynaMesh-Rater，这是一种基于大型多模态模型（LMM）的新方法，能够评估纹理4D网格和非纹理4D网格。具体来说，DynaMesh-Rater精心提取了多维特征，包括从投影的2D视频中提取的视觉特征、从裁剪的视频片段中提取的运动特征以及从4D人类网格中提取的几何特征，以提供全面的质量相关信息。然后，我们利用LMM模型来整合这些多维特征，并采用基于LoRA的指令微调技术来教导LMM模型预测质量分数。在DHQA-4D数据集上的广泛实验结果证明了我们的DynaMesh-Rater方法相较于之前的质量评估方法的优越性。

论文及项目相关链接

PDF

Summary

随着3D扫描和重建技术的快速发展，基于4D网格的动态数字人类化身越来越受欢迎。可应用于游戏制作、动画制作和远程沉浸式通信等领域。然而，这些4D人类化身网格在收集、压缩和传输过程中容易受到各种噪声的影响，从而影响用户体验。因此，对动态4D数字人的质量评估变得尤为重要。本文首先提出大规模动态数字人类质量评估数据集DHQA-4D，包含32个高质量真实扫描的4D人类网格序列、1920个由11种纹理失真产生的失真纹理4D人类网格及其相应的纹理和非纹理平均意见得分（MOSs）。配备DHQA-4D数据集，我们分析了不同类型失真对纹理动态4D网格和非纹理动态4D网格的人类感知的影响。此外，我们提出了基于动态网格评分器（DynaMesh-Rater）的新型大型多模态模型（LMM）方法，能够评估纹理和非纹理的4D网格。该方法通过提取多维特征，包括来自投影的二维视频的视觉特征、来自裁剪的视频剪辑的运动特征以及来自4D人类网格的几何特征，提供全面的质量相关信息。然后利用LMM模型整合多维特征，并采用LoRA技术进行指令微调，教导模型预测质量分数。在DHQA-4D数据集上的广泛实验结果表明，我们的DynaMesh-Rater方法优于先前的质量评估方法。

Key Takeaways

动态数字人类化身使用基于4D网格的技术变得越来越流行，可应用于游戏制作、动画制作和远程沉浸式通信。
动态4D数字人的质量评估变得重要，因为这些网格在收集、压缩和传输过程中容易因噪声而质量下降。
DHQA-4D数据集包含各种高质量和失真纹理的4D人类网格序列及其对应的平均意见得分（MOSs），用于分析不同类型失真对感知的影响。
提出了一种新型的多模态模型方法——动态网格评分器（DynaMesh-Rater），能够评估纹理和非纹理的4D网格的质量。
DynaMesh-Rater通过提取多维特征来提供全面的质量相关信息，并利用大型多模态模型（LMM）进行预测。
DHQA-4D数据集上的实验结果表明，与其他质量评估方法相比，DynaMesh-Rater具有优越性。

Cool Papers

点此查看论文截图

Rainbow Padding: Mitigating Early Termination in Instruction-Tuned Diffusion LLMs

Authors:Bumjun Kim, Dongjae Jeon, Dueun Kim, Wonje Jeung, Albert No

Diffusion large language models (dLLMs) have emerged as a promising alternative to autoregressive models, offering flexible generation orders and strong performance on complex reasoning tasks. However, instruction-tuned dLLMs exhibit a critical vulnerability we term \texttt{} overflow: as allocated sequence length increases, responses paradoxically become shorter, collapsing into early termination or degenerating into streams of \texttt{} tokens. Although noticed in practice, this issue has not been systematically analyzed. We trace its root cause to the dual role of \texttt{} as both termination and padding, which concentrates probability mass on \texttt{} at later positions and propagates backward to trigger early termination. To address this, we introduce Rainbow Padding, a simple remedy that replaces repeated \texttt{} placeholders with a repeating cycle of distinct padding tokens, distributing probability mass and breaking \texttt{} dominance. Experiments show that Rainbow Padding substantially improves length robustness and output quality, with as few as seven padding tokens sufficient to prevent early termination. Moreover, the method integrates efficiently into existing instruction-tuned models: LoRA fine-tuning for a single epoch on minimal data yields significant improvements, making this solution highly practical. The code is publicly available at https://github.com/quasar529/rainbow-padding.

扩散大型语言模型（dLLMs）作为自回归模型的有前途的替代方案而出现，它提供了灵活的生成顺序和在复杂推理任务上的强大性能。然而，经过指令调整的dLLMs表现出我们所谓的“溢出”的关键漏洞：随着分配序列长度的增加，响应却矛盾地变得更短，陷入提前终止或退化为一连串的标记。虽然在实践中已经注意到这个问题，但尚未对其进行系统分析。我们追溯其根源到的双重角色，既是终止符又是填充符，这将在后面的位置集中概率质量，并向后传播以触发提前终止。为了解决这一问题，我们引入了Rainbow Padding，这是一种简单的补救措施，用重复的填充令牌循环替换重复的占位符，分散概率质量并打破的主导地位。实验表明，Rainbow Padding极大地提高了长度稳健性和输出质量，只需七个填充令牌就足以防止提前终止。此外，该方法可以有效地集成到现有的指令调整模型中：只需对少量数据进行LoRA微调一个周期即可取得显著改进，这使得此解决方案非常实用。代码可在https://github.com/quasar529/rainbow-padding公开访问。

论文及项目相关链接

PDF 25 pages. Project page available at~\url{https://ai-isl.github.io/rainbow-padding}

Summary

扩散大型语言模型（dLLMs）作为对自回归模型的有前途的替代方案出现，具有灵活的生成顺序和强大的复杂推理任务性能。然而，指令调优的dLLMs存在一个被称为“溢出”的关键漏洞：随着分配序列长度的增加，响应却变得较短，导致过早终止或退化为一串的“”标记。本文追踪了问题的根源，并提出了彩虹填充法（Rainbow Padding）作为解决方案，通过替换重复的“”占位符为不同的填充令牌，从而打破“”的主导地位。实验表明，Rainbow Padding显著提高了长度稳健性和输出质量，仅使用七个填充令牌就足以防止过早终止。此外，该方法可高效集成到现有指令调优模型中，通过LoRA微调少量数据即可完成显著改进，具有很高的实用性。

Key Takeaways

dLLMs作为一种新兴技术，具有灵活的生成顺序和强大的复杂推理任务性能。
指令调优的dLLMs存在“溢出”问题，即在序列长度增加时响应反而变短。
“溢出”问题的根源在于“”标记的双重角色（终止和填充）。
Rainbow Padding方法通过替换重复的“”标记为不同的填充令牌来解决这个问题。
实验证明Rainbow Padding能显著提高长度稳健性和输出质量。
仅使用少量填充令牌（如七个）即可防止过早终止。

Cool Papers

点此查看论文截图

Scaled Signed Averaging Improves In-Context and Early Learning Benchmark Performance in Small Transformers

Authors:Omar Naim, Swarnadeep Bhar, Jérôme Bolte, Nicholas Asher

While Large Language models’ abilities for in-context learning (ICL) have drawn much attention, we examine some of its limitations on semantic tasks involving quantifiers like “all” and “some”, as well as on tasks with linear functions. We identify Softmax, the scoring function in attention mechanism, as a contributing factor to these limitations. We propose scaled signed averaging (SSA), a novel alternative to Softmax to mitigate these problems. We show that SSA significantly improves performance on our ICL tasks. In addition, SSA outperforms transformer models with Softmax on several early learning NLP benchmarks and linguistic probing tasks on zero and few-shot settings.

虽然大型语言模型在上下文学习（ICL）的能力上得到了广泛关注，但我们对一些语义任务的局限性进行了探究，这些任务涉及量词“所有”、“一些”，以及涉及线性函数的任务。我们确定了注意力机制中的评分函数Softmax是导致这些局限性的因素之一。为此，我们提出了Softmax的替代方案——带符号缩放平均法（SSA），以缓解这些问题。我们证明SSA在ICL任务上的表现有了显著提高。此外，在零样本和少样本设置中，SSA在多个早期学习NLP基准测试和语言学探测任务上的表现优于使用Softmax的Transformer模型。

论文及项目相关链接

PDF

Summary

大型语言模型在上下文学习（ICL）方面的能力已备受关注，但其处理涉及量词（如“所有”和“一些”）的语义任务以及线性函数任务的局限性也日益凸显。研究发现，注意力机制中的评分函数Softmax是限制之一。为缓解这些问题，提出一种新型的评分函数——Scaled Signed Averaging（SSA）。在ICL任务上的表现显著优于Softmax，且在零样本和少样本设置的多项早期学习NLP基准测试和语言学探测任务中表现优异。

Key Takeaways

大型语言模型在上下文学习（ICL）方面虽受关注，但在处理涉及量词的语义任务及线性函数任务时存在局限性。
Softmax作为注意力机制中的评分函数，是这些限制因素之一。
为改善模型在这些任务上的表现，提出了Scaled Signed Averaging（SSA）这一新型评分函数。
SSA在ICL任务上的表现显著优于Softmax。
SSA在零样本和少样本设置下，多项早期学习NLP基准测试和语言学探测任务中表现优异。
SSA的引入有助于提升语言模型在处理复杂任务时的性能。

Cool Papers

点此查看论文截图

FLEx: Personalized Federated Learning for Mixture-of-Experts LLMs via Expert Grafting

Authors:Fan Liu, Bikang Pan, Zhongyi Wang, Xi Yao, Xiaoying Tang, Jingya Wang, Ye Shi

Federated instruction tuning of large language models (LLMs) is challenged by significant data heterogeneity across clients, demanding robust personalization. The Mixture of Experts (MoE) architecture, where experts can specialize in distinct data patterns, presents a natural architectural solution to this challenge. The inherent sparsity of the MoE architecture, achieved by selectively activating experts, poses a significant challenge to its integration with federated learning (FL). Conventional FL frameworks, designed for dense models, naively aggregate all expert parameters irrespective of their local activation patterns. This naive approach not only undermines MoE’s dynamic sparsity but also risks corrupting the world knowledge within pretrained experts. To address this, we propose FLEx (Federated LLMs with Personalized Experts), a novel framework that leverages pretrained MoE-based LLMs for efficient personalization. By aggregating only the shared non-expert parameters, FLEx significantly reduces communication overhead and preserves the world knowledge stored within the frozen pretrained experts. For personalization, we introduce a novel expert grafting mechanism that leverages dynamic sparsity to construct a client-specific expert from selected components of pretrained experts, tailored to local data. This grafted expert is then fine-tuned locally alongside the gating mechanism. This joint training enables the model to learn when to leverage the shared knowledge from frozen experts and when to employ the personalized one. Evaluations on diverse, non-IID instruction tuning datasets show that FLEx consistently outperforms federated baselines on average, while demonstrating strong knowledge preservation on the knowledge-driven benchmark MMLU. Our code is available at \href{https://anonymous.4open.science/r/FLEx-8F12}{\texttt{https://anonymous.4open.science/r/FLEx-8F12}}.

联合训练大型语言模型（LLM）在面临客户间数据的巨大异质性时面临挑战，这要求强大的个性化能力。专家混合（MoE）架构中的专家可以专注于不同的数据模式，为此类挑战提供了自然的架构解决方案。MoE架构的内在稀疏性，通过有选择地激活专家来实现，对其与联邦学习（FL）的集成提出了重大挑战。传统的FL框架，设计为适用于密集模型，会无差别地聚合所有专家参数，而不考虑其本地激活模式。这种简单的方法不仅破坏了MoE的动态稀疏性，而且还可能破坏预训练专家中的世界知识。

为了解决这一问题，我们提出了FLEx（带有个性化专家的联邦LLM）这一新型框架，该框架利用基于MoE的预训练LLM进行有效个性化。FLEx只聚合共享的非专家参数，从而显著减少通信开销并保留冻结的预训练专家内的世界知识。对于个性化，我们引入了一种新的专家嫁接机制，该机制利用动态稀疏性从预训练专家中选择组件构建针对特定客户的专家，以适应本地数据。然后，将此嫁接的专家与门控机制一起进行本地微调。这种联合训练使模型能够学习何时利用来自冻结的专家的共享知识，以及何时使用个性化的知识。

论文及项目相关链接

PDF

Summary

本文介绍了在联邦学习中应用混合专家（MoE）架构的挑战与解决方案。针对大型语言模型（LLM）的联邦指令调优面临数据异构性的挑战，要求模型具备强大的个性化能力。MoE架构允许专家专注于不同的数据模式，为解决此挑战提供了天然架构解决方案。然而，MoE架构的内在稀疏性给其与联邦学习（FL）的融合带来了难题。传统FL框架无法有效处理MoE的稀疏性，且可能破坏预训练专家的全球知识。为此，本文提出了FLEx框架，利用预训练的MoE基LLM进行高效个性化。FLEx仅聚合共享的非专家参数，大大降低了通信开销并保留了预训练专家内的全球知识。为实现个性化，引入了专家嫁接机制，利用动态稀疏性构建针对本地数据的客户特定专家。嫁接的专家与门控机制一起进行本地微调。这种联合训练使模型能够在共享知识与个性化知识之间进行切换。评估表明，FLEx在多样化、非独立同分布（non-IID）指令调整数据集上的表现优于联邦基准测试，同时在知识驱动型MMLU基准测试上表现出强大的知识保留能力。

Key Takeaways

联邦学习中大型语言模型（LLM）面临数据异构性的挑战，需要强大的个性化能力。
混合专家（MoE）架构为解决此挑战提供了天然解决方案，但与其融合存在困难。
FLEx框架利用预训练的MoE基LLM进行高效个性化，仅聚合共享的非专家参数，降低通信开销并保留全球知识。
FLEx引入专家嫁接机制，利用动态稀疏性构建针对本地数据的客户特定专家。
嫁接的专家与门控机制联合训练，使模型能够在共享知识与个性化知识之间进行切换。
FLEx在多样化、非独立同分布（non-IID）指令调整数据集上的表现优于联邦基准测试。

Cool Papers

点此查看论文截图

Enhancing Transformers Through Conditioned Embedded Tokens

Authors:Hemanth Saratchandran, Simon Lucey

Transformers have transformed modern machine learning, driving breakthroughs in computer vision, natural language processing, and robotics. At the core of their success lies the attention mechanism, which enables the modeling of global dependencies among input tokens. However, we reveal that the attention block in transformers suffers from inherent ill-conditioning, which hampers gradient-based optimization and leads to inefficient training. To address this, we develop a theoretical framework that establishes a direct relationship between the conditioning of the attention block and that of the embedded tokenized data. Building on this insight, we introduce conditioned embedded tokens, a method that systematically modifies the embedded tokens to improve the conditioning of the attention mechanism. Our analysis demonstrates that this approach significantly mitigates ill-conditioning, leading to more stable and efficient training. We validate our methodology across various transformer architectures, achieving consistent improvements in image classification, object detection, instance segmentation, and natural language processing, highlighting its broad applicability and effectiveness.

转换器（Transformers）已经改变了现代机器学习，推动了计算机视觉、自然语言处理和机器人技术的突破。其核心成功的关键在于注意力机制（Attention Mechanism），它能够实现输入标记之间全局依赖关系的建模。然而，我们揭示出转换器的注意力块存在固有的不适定性问题，这阻碍了基于梯度的优化，导致训练效率低下。为了解决这一问题，我们建立了一个理论框架，直接关联注意力块和嵌入标记化数据的条件。基于这一见解，我们引入了条件嵌入标记（Conditioned Embedded Tokens）方法，该方法系统地修改了嵌入标记，以改善注意力机制的条件。我们的分析表明，这种方法显著减轻了不适定性，导致训练和推理更加稳定和高效。我们在各种转换器架构中验证了我们的方法，在图像分类、目标检测、实例分割和自然语言处理等方面实现了持续性的改进，突显了其广泛适用性和有效性。

论文及项目相关链接

PDF ICCV 2025

摘要

Transformer中的注意力机制为现代机器学习带来了变革，推动了计算机视觉、自然语言处理和机器人技术的突破。然而，本文揭示了Transformer中的注意力块存在固有的不适定性问题，这阻碍了基于梯度的优化并导致训练效率低下。为解决这一问题，我们建立了一个理论框架，该框架直接关联注意力块的适定性与嵌入令牌数据的适定性。在此基础上，我们提出了条件嵌入令牌方法，该方法系统地修改嵌入令牌以提高注意力机制的适定性。分析表明，该方法有效缓解了不适定性问题，使训练更加稳定和高效。我们在各种Transformer架构中验证了该方法的有效性，在图像分类、目标检测、实例分割和自然语言处理方面取得了持续的改进，凸显了其广泛的应用性和有效性。

关键见解

Transformer的注意力机制实现了现代机器学习的突破，推动了计算机视觉、NLP和机器人技术的进步。
Transformer中的注意力块存在固有的不适定性问题，影响训练效率和稳定性。
本文建立了理论框架，关联注意力块的适定性与嵌入令牌数据的适定性。
提出条件嵌入令牌方法，通过修改嵌入令牌提高注意力机制的适定性。
条件嵌入令牌方法有效缓解了注意力块的不适定性问题。
该方法在多种Transformer架构中验证了有效性，提高了图像分类、目标检测、实例分割和自然语言处理的性能。

Cool Papers

点此查看论文截图

FVQ: A Large-Scale Dataset and an LMM-based Method for Face Video Quality Assessment

Authors:Sijing Wu, Yunhao Li, Ziwen Xu, Yixuan Gao, Huiyu Duan, Wei Sun, Guangtao Zhai

Face video quality assessment (FVQA) deserves to be explored in addition to general video quality assessment (VQA), as face videos are the primary content on social media platforms and human visual system (HVS) is particularly sensitive to human faces. However, FVQA is rarely explored due to the lack of large-scale FVQA datasets. To fill this gap, we present the first large-scale in-the-wild FVQA dataset, FVQ-20K, which contains 20,000 in-the-wild face videos together with corresponding mean opinion score (MOS) annotations. Along with the FVQ-20K dataset, we further propose a specialized FVQA method named FVQ-Rater to achieve human-like rating and scoring for face video, which is the first attempt to explore the potential of large multimodal models (LMMs) for the FVQA task. Concretely, we elaborately extract multi-dimensional features including spatial features, temporal features, and face-specific features (i.e., portrait features and face embeddings) to provide comprehensive visual information, and take advantage of the LoRA-based instruction tuning technique to achieve quality-specific fine-tuning, which shows superior performance on both FVQ-20K and CFVQA datasets. Extensive experiments and comprehensive analysis demonstrate the significant potential of the FVQ-20K dataset and FVQ-Rater method in promoting the development of FVQA.

脸部视频质量评估（FVQA）除了通用视频质量评估（VQA）外，值得进一步探索。脸部视频是社交媒体平台的主要内容，而人眼视觉系统对人脸尤为敏感。然而，由于缺乏大规模的FVQA数据集，FVQA的研究很少。为了填补这一空白，我们首次推出了大规模的野生FVQA数据集FVQ-20K，它包含2万段野生脸部视频以及相应的平均意见得分（MOS）注释。除了FVQ-20K数据集外，我们还提出了一种专门的FVQA方法，名为FVQ-Rater，实现对人脸视频的拟人评分，这是首次探索大型多模态模型（LMMs）在FVQA任务中的潜力。具体来说，我们精心提取了包括空间特征、时间特征和面部特定特征（即肖像特征和面部嵌入）在内的多维特征，以提供全面的视觉信息，并利用基于LoRA的指令调整技术实现质量特定的微调，这在FVQ-20K和CFVQA数据集上都表现出卓越的性能。大量的实验和综合分析证明了FVQ-20K数据集和FVQ-Rater方法在推动FVQA发展中的显著潜力。

论文及项目相关链接

PDF Accepted by ACM MM 2025. Project page: https://github.com/wsj-sjtu/FVQ

摘要

除了通用视频质量评估（VQA），针对脸部视频的评估（FVQA）也值得研究。脸部视频是社交媒体平台的主要内容，且人类视觉系统对人脸特别敏感。然而，FVQA的研究因缺乏大规模数据集而受到限制。为此，我们推出了首个大规模野外FVQA数据集FVQ-20K，包含20,000个野外脸部视频及其对应的平均意见得分（MOS）注释。我们还提出了一种专门的FVQA方法FVQ-Rater，实现对脸部视频的人类式评分，这是首次探索大型多模态模型（LMMs）在FVQA任务中的潜力。我们精心提取了包括空间特征、时间特征和面部特定特征在内的多维特征，利用LoRA基于指令的微调技术实现质量特定的精细调整，在FVQ-20K和CFVQA数据集上表现出卓越的性能。

关键见解

脸部视频质量评估（FVQA）是社交媒体时代的重要研究领域，因为脸部视频是主流内容。
人类视觉系统对脸部质量非常敏感，使得FVQA研究具有挑战性。
缺乏大规模的FVQA数据集限制了该领域的研究进展。
引入了首个大规模野外FVQA数据集FVQ-20K，包含20,000个带注释的脸部视频。
提出了FVQ-Rater方法，能模拟人类对面部视频的质量评分。
FVQ-Rater利用多维特征提取和基于LoRA的质量特定微调技术，实现卓越性能。
FVQ-20K数据集和FVQ-Rater方法在促进FVQA领域发展方面具有显著潜力。

Cool Papers

点此查看论文截图

AtmosSci-Bench: Evaluating the Recent Advance of Large Language Model for Atmospheric Science

Authors:Chenyue Li, Wen Deng, Mengqian Lu, Binhang Yuan

The rapid advancements in large language models (LLMs), particularly in their reasoning capabilities, hold transformative potential for addressing complex challenges and boosting scientific discovery in atmospheric science. However, leveraging LLMs effectively in this domain requires a robust and comprehensive evaluation benchmark. Toward this end, we present AtmosSci-Bench, a novel benchmark designed to systematically assess LLM performance across five core categories of atmospheric science problems: hydrology, atmospheric dynamics, atmospheric physics, geophysics, and physical oceanography. AtmosSci-Bench features a dual-format design comprising both multiple-choice questions (MCQs) and open-ended questions (OEQs), enabling scalable automated evaluation alongside deeper analysis of conceptual understanding. We employ a template-based MCQ generation framework to create diverse, graduate-level problems with symbolic perturbation, while OEQs are used to probe open-ended reasoning. We conduct a comprehensive evaluation of representative LLMs, categorized into four groups: instruction-tuned models, advanced reasoning models, math-augmented models, and domain-specific climate models. Our analysis provides some interesting insights into the reasoning and problem-solving capabilities of LLMs in atmospheric science. We believe AtmosSci-Bench can serve as a critical step toward advancing LLM applications in climate services by offering a standard and rigorous evaluation framework. Our source code is available at https://github.com/Relaxed-System-Lab/AtmosSci-Bench.

大型语言模型（LLM）的快速发展，特别是在其推理能力方面，为解决大气科学中的复杂挑战和推动科学发展提供了变革性潜力。然而，要在这一领域有效地利用LLM，需要稳健而全面的评估基准。为此，我们推出了AtmosSci-Bench，这是一个新颖的基准测试，旨在系统评估LLM在五个大气科学核心类别的问题上的性能：水文学、大气动力学、大气物理学、地球物理学和物理海洋学。AtmosSci-Bench采用双格式设计，包括多项选择题（MCQs）和开放式问题（OEQs），既能够进行可扩展的自动化评估，又能深入分析概念理解。我们采用基于模板的MCQ生成框架来创建具有符号扰动的多样化、研究生水平的问题，而OEQs则用于测试开放式推理。我们对具有代表性的LLM进行了全面评估，将其分为四组：指令调整模型、高级推理模型、数学增强模型和领域特定气候模型。我们的分析对LLM在大气科学中的推理和问题解决能力提供了一些有趣的见解。我们相信，AtmosSci-Bench可以通过提供标准和严格的评估框架，成为推动LLM在气候服务中应用的重要一步。我们的源代码可在https://github.com/Relaxed-System-Lab/AtmosSci-Bench上找到。

论文及项目相关链接

PDF 37 pages, 4 figures, 13 tables

摘要
大型语言模型（LLM）在推理能力方面的迅速进步，为应对大气科学中的复杂挑战和促进科学发现提供了变革性潜力。然而，要在这一领域有效地利用LLM，需要强大而全面的评估基准。为此，我们推出了AtmosSci-Bench，这是一个新颖的基准测试，旨在系统地评估LLM在五个大气科学核心类别的问题上的表现：水文学、大气动力学、大气物理学、地球物理学和物理海洋学。AtmosSci-Bench采用双重格式设计，包括多项选择题（MCQs）和开放性问题（OEQs），以实现可扩展的自动化评估以及对概念理解的深入分析。我们采用基于模板的MCQ生成框架来创建具有符号扰动的多样化、研究生水平的问题，而OEQs则用于探究开放式推理。我们对具有代表性的LLM进行了全面评估，将它们分为四个类别：指令调优模型、高级推理模型、数学增强模型和特定领域的气候模型。我们的分析提供了关于LLM在解决大气科学方面的推理和问题解决能力的有趣见解。我们相信AtmosSci-Bench可以通过提供标准和严格的评估框架，成为推进LLM在气候服务中应用的至关重要的步骤。我们的源代码可在https://github.com/Relaxed-System-Lab/AtmosSci-Bench上找到。

关键见解

大型语言模型（LLM）在解决大气科学问题方面具有巨大潜力。
AtmosSci-Bench是一个新的基准测试，旨在评估LLM在五个大气科学核心领域的问题解决能力。
该基准测试采用双重格式设计，包括多项选择题和开放性问题，以全面评估LLM的推理和理解能力。
通过符号扰动创建多样化的问题，以测试LLM的适应性和灵活性。
对不同类型的LLM进行了全面评估，包括指令调优模型、高级推理模型等。
AtmosSci-Bench为推进LLM在气候服务中的应用提供了标准评估框架。

Cool Papers

点此查看论文截图

Imagining the Unseen: Generative Location Modeling for Object Placement

Authors:Jooyeol Yun, Davide Abati, Mohamed Omran, Jaegul Choo, Amirhossein Habibian, Auke Wiggers

Location modeling, or determining where non-existing objects could feasibly appear in a scene, has the potential to benefit numerous computer vision tasks, from automatic object insertion to scene creation in virtual reality. Yet, this capability remains largely unexplored to date. In this paper, we develop a generative location model that, given an object class and an image, learns to predict plausible bounding boxes for such an object. Our approach first tokenizes the image and target object class, then decodes bounding box coordinates through an autoregressive transformer. This formulation effectively addresses two core challenges in locatio modeling: the inherent one-to-many nature of plausible locations, and the sparsity of existing location modeling datasets, where fewer than 1% of valid placements are labeled. Furthermore, we incorporate Direct Preference Optimization to leverage negative labels, refining the spatial predictions. Empirical evaluations reveal that our generative location model achieves superior placement accuracy on the OPA dataset as compared to discriminative baselines and image composition approaches. We further test our model in the context of object insertion, where it proposes locations for an off-the-shelf inpainting model to render objects. In this respect, our proposal exhibits improved visual coherence relative to state-of-the-art instruction-tuned editing methods, demonstrating a high-performing location model’s utility in a downstream application.

位置建模，或者确定场景中不存在的物体可能出现的位置，具有为众多计算机视觉任务带来潜在益处的潜力，从自动物体插入到虚拟现实场景创建。然而，这种能力至今尚未得到充分的探索。在本文中，我们开发了一种生成式位置模型，该模型在给定的目标对象和图像的情况下，能够预测该对象的合理边界框。我们的方法首先对图像和目标对象类别进行标记化，然后通过自回归变压器解码边界框坐标。这种表述有效地解决了位置建模中的两个核心挑战：合理位置的固有的一到多对应关系，以及现有位置建模数据集稀疏的问题，其中只有不到1%的有效位置被标记。此外，我们结合了直接偏好优化来利用负面标签，对空间预测进行细化。经验评估表明，我们的生成位置模型在OPA数据集上的放置精度优于判别基准和图像组合方法。我们进一步在物体插入的上下文中测试了我们的模型，它为现成的图像修复模型呈现物体位置建议。在这方面，我们的提案与最新的指令调整编辑方法相比，表现出更高的视觉连贯性，证明了高性能位置模型在下游应用中的实用性。

论文及项目相关链接

PDF Accepted by ICCV 2025 DRL4Real Workshop

Summary

本文提出了一种生成式位置模型，能够根据目标对象和图像预测其合理的边界框位置。模型首先对图像和目标对象进行标记化，然后通过自回归转换器解码边界框坐标。该方法解决了位置建模中的两个核心挑战：一是合理位置具有一对多的特点；二是现有的位置建模数据集稀少。通过结合直接偏好优化和负样本，优化了空间预测的准确性。经验评估表明，本文模型在OPA数据集上的放置准确度优于判别式基准和图像组合方法。在对象插入的上下文中进一步测试了模型，结果表明模型能够提出适合补全模型的物体位置，与最新的指令调优编辑方法相比，视觉连贯性有所提升。该模型的位置建模性能优异，在下游应用中有实用价值。

Key Takeaways