LLM

发布日期: 2025-09-18

更新日期: 2025-10-07

文章字数: 16.2k

阅读时长: 66 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-18 更新

Scaling Agents via Continual Pre-training

Authors:Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou

Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.

大型语言模型（LLM）已经发展成为了具有自主工具使用和复杂问题解决能力的多步骤推理的代理系统。然而，基于通用基础模型的后续训练方法在代理任务中的表现持续不佳，特别是在开源实现中。我们确定了根本原因：缺乏稳健的代理基础模型迫使模型在后续训练过程中同时学习多样化的代理行为并使其与专家演示保持一致，从而产生了基本的优化张力。为此，我们首次提出在深度研究代理训练管道中融入代理持续预训练（Agentic CPT），以构建强大的代理基础模型。基于这种方法，我们开发了一个名为AgentFounder的深度研究代理模型。我们在10个基准测试上对AgentFounder-30B进行了评估，取得了卓越的性能，同时保持了强大的工具使用能力，特别是在BrowseComp-en上达到39.9%，BrowseComp-zh上达到43.3%，HLE上Pass@1达到31.5%。

Summary

大型语言模型（LLM）已进化为能够进行自主工具使用和多步骤推理以解决复杂问题的代理系统。然而，基于通用基础模型的后期训练方法在代理任务中表现持续不佳，特别是在开源实现中。问题的根源在于缺乏稳健的代理基础模型，导致模型在训练后必须同时学习多种代理行为并与专家演示对齐，从而产生基本的优化紧张。为此，我们首次提出在深度研究代理训练管道中融入代理持续预训练（Agentic CPT），以构建强大的代理基础模型。我们基于此方法开发了一个名为AgentFounder的深度研究代理模型。我们在10个基准测试上对AgentFounder-30B进行了评估，取得了最新技术性能，同时保持了强大的工具使用能力，特别是在BrowseComp-en上达到39.9%、BrowseComp-zh上达到43.3%，以及HLE上Pass@1达到31.5%。

Key Takeaways

LLMs已发展为具有自主工具使用和复杂问题多步骤推理能力的代理系统。
基于通用基础模型的后期训练在代理任务中表现欠佳，特别是在开源实现中。
缺乏稳健的代理基础模型导致模型在训练后面临优化挑战。
首次提出融入Agentic CPT到深度研究代理训练管道中，旨在构建强大的代理基础模型。
开发了一个名为AgentFounder的深度研究代理模型。
AgentFounder-30B在多个基准测试中取得最新技术性能，具有强大的工具使用能力。

Cool Papers

点此查看论文截图

Large Language Model-assisted Meta-optimizer for Automated Design of Constrained Evolutionary Algorithm

Authors:Xu Yang, Rui Wang, Kaiwen Li, Wenhua Li, Weixiong Huang

Meta-black-box optimization has been significantly advanced through the use of large language models (LLMs), yet in fancy on constrained evolutionary optimization. In this work, AwesomeDE is proposed that leverages LLMs as the strategy of meta-optimizer to generate update rules for constrained evolutionary algorithm without human intervention. On the meanwhile, $RTO^2H$ framework is introduced for standardize prompt design of LLMs. The meta-optimizer is trained on a diverse set of constrained optimization problems. Key components, including prompt design and iterative refinement, are systematically analyzed to determine their impact on design quality. Experimental results demonstrate that the proposed approach outperforms existing methods in terms of computational efficiency and solution accuracy. Furthermore, AwesomeDE is shown to generalize well across distinct problem domains, suggesting its potential for broad applicability. This research contributes to the field by providing a scalable and data-driven methodology for automated constrained algorithm design, while also highlighting limitations and directions for future work.

基于大型语言模型（LLM）的元黑箱优化在受限进化优化方面取得了显著进展。在这项工作中，提出了AwesomeDE，它利用LLM作为元优化器的策略，为约束进化算法生成更新规则，无需人工干预。同时，介绍了$RTO^2H$框架，用于标准化LLM的提示设计。元优化器在多种约束优化问题上进行了训练。系统地分析了关键组件，包括提示设计和迭代优化，以确定它们对设计质量的影响。实验结果表明，所提出的方法在计算效率和求解精度方面优于现有方法。此外，AwesomeDE在多个不同的问题域中都表现良好，这表明其广泛的应用潜力。本研究为自动化约束算法设计提供了一种可扩展和数据驱动的方法论，同时指出了未来的局限性和研究方向，为这一领域做出了贡献。

论文及项目相关链接

PDF

Summary

基于大型语言模型（LLM）的元黑箱优化取得了显著进展，通过提出AwesomeDE方法，利用LLM作为元优化器生成进化算法的更新规则，无需人工干预。同时，引入$RTO^2H$框架标准化LLM的提示设计。该元优化器在多种约束优化问题上进行了训练，并对提示设计和迭代优化等关键组件进行了系统分析。实验结果表明，该方法在计算效率和解精度上均优于现有方法，并且在不同问题域中表现出良好的泛化能力。

Key Takeaways

AwesomeDE利用LLM作为元优化器，生成进化算法的更新规则，实现了无需人工干预的自动化优化。
$RTO^2H$框架用于标准化LLM的提示设计，提高了算法设计的标准化和效率。
该方法在多种约束优化问题上进行训练，增强了元优化器的泛化能力。
提示设计和迭代优化等关键组件对设计质量有显著影响。
实验结果表明，AwesomeDE在计算效率和解精度上优于现有方法。
该研究为自动化约束算法设计提供了可伸缩和数据驱动的方法论。

Cool Papers

点此查看论文截图

Don’t Forget the Nonlinearity: Unlocking Activation Functions in Efficient Fine-Tuning

Authors:Bo Yin, Xingyi Yang, Xinchao Wang

Existing parameter-efficient fine-tuning (PEFT) methods primarily adapt weight matrices while keeping activation functions fixed. We introduce \textbf{NoRA}, the first PEFT framework that directly adapts nonlinear activation functions in pretrained transformer-based models. NoRA replaces fixed activations with learnable rational functions and applies structured low-rank updates to numerator and denominator coefficients, with a group-wise design that localizes adaptation and improves stability at minimal cost. On vision transformers trained on CIFAR-10 and CIFAR-100, NoRA matches or exceeds full fine-tuning while updating only 0.4% of parameters (0.02M), achieving accuracy gains of +0.17% and +0.27%. When combined with LoRA (\textbf{NoRA++}), it outperforms LoRA and DoRA under matched training budgets by adding fewer trainable parameters. On LLaMA3-8B instruction tuning, NoRA++ consistently improves generation quality, yielding average MMLU gains of +0.3%–0.8%, including +1.6% on STEM (Alpaca) and +1.3% on OpenOrca. We further show that NoRA constrains adaptation to a low-dimensional functional subspace, implicitly regularizing update magnitude and direction. These results establish activation-space tuning as a complementary and highly parameter-efficient alternative to weight-based PEFT, positioning activation functions as first-class objects for model adaptation.

现有参数高效微调（PEFT）方法主要适应权重矩阵，同时保持激活函数固定。我们引入了NoRA，这是第一个直接适应预训练transformer模型非线性激活函数的PEFT框架。NoRA用可学习的有理函数替换固定激活函数，对分子和分母系数应用结构化低秩更新，采用分组设计实现局部适应，并在几乎不增加成本的情况下提高稳定性。在CIFAR-10和CIFAR-100上训练的视觉transformer中，NoRA仅更新0.4%（即0.02M）的参数即可达到或超过完全微调的效果，并分别提高了+0.17%和+0.27%的准确率。当与LoRA结合时（**NoRA++**），在匹配的训练预算下，它通过添加更少的可训练参数，超越了LoRA和DoRA。在LLaMA3-8B指令调整中，NoRA++持续提高生成质量，平均MMLU增益为+0.3%~+0.8%，其中STEM（Alpaca）上提高了+1.6%，OpenOrca上提高了+1.3%。我们进一步表明，NoRA将适应限制在低维函数子空间内，隐含地正则化更新幅度和方向。这些结果确立了激活空间调整作为基于权重的PEFT的互补和高度参数高效的替代方案，将激活函数定位为模型适应的一流对象。

论文及项目相关链接

PDF

Summary

本文介绍了一种新的参数高效微调（PEFT）方法——NoRA，它直接适应预训练transformer模型中的非线性激活函数。NoRA用可学习的有理函数替换固定激活，并对分子和分母系数应用结构化低秩更新，以局部化适应并提高稳定性，同时成本极低。在CIFAR-10和CIFAR-100的视觉转换器训练上，NoRA在仅更新0.4%的参数（0.02M）的情况下，达到了与全微调相匹配或更好的效果，准确率分别提高了+0.17%和+0.27%。与LoRA结合（NoRA++）后，在匹配的训练预算下，通过添加更少的可训练参数，它优于LoRA和DoRA。在LLaMA3-8B指令调整中，NoRA++持续提高生成质量，平均MMLU增益为+0.3%~+0.8%，其中STEM（Alpaca）和OpenOrca分别提高了+1.6%和+1.3%。此外，NoRA将适应限制在低维函数子空间中，隐式地正则化更新幅度和方向。

Key Takeaways

NoRA是一种新的参数高效微调（PEFT）方法，专注于适应预训练transformer模型中的非线性激活函数。
NoRA通过用可学习的有理函数替换固定激活，以及应用结构化低秩更新来提高模型性能。
在CIFAR-10和CIFAR-100的视觉转换器训练中，NoRA实现了较高的准确率提升，同时参数更新极少。
NoRA++是NoRA与LoRA的结合，它在匹配的训练预算下表现更佳。
在LLaMA3-8B指令调整中，NoRA++显著提高了生成质量。
NoRA将模型适应限制在低维函数子空间中，隐式地控制更新幅度和方向。
激活函数在模型适应中扮演重要角色，是参数高效微调的一种重要手段。

Cool Papers

点此查看论文截图

The Few-shot Dilemma: Over-prompting Large Language Models

Authors:Yongjian Tang, Doruk Tuncel, Christian Koerner, Thomas Runkler

Over-prompting, a phenomenon where excessive examples in prompts lead to diminished performance in Large Language Models (LLMs), challenges the conventional wisdom about in-context few-shot learning. To investigate this few-shot dilemma, we outline a prompting framework that leverages three standard few-shot selection methods - random sampling, semantic embedding, and TF-IDF vectors - and evaluate these methods across multiple LLMs, including GPT-4o, GPT-3.5-turbo, DeepSeek-V3, Gemma-3, LLaMA-3.1, LLaMA-3.2, and Mistral. Our experimental results reveal that incorporating excessive domain-specific examples into prompts can paradoxically degrade performance in certain LLMs, which contradicts the prior empirical conclusion that more relevant few-shot examples universally benefit LLMs. Given the trend of LLM-assisted software engineering and requirement analysis, we experiment with two real-world software requirement classification datasets. By gradually increasing the number of TF-IDF-selected and stratified few-shot examples, we identify their optimal quantity for each LLM. This combined approach achieves superior performance with fewer examples, avoiding the over-prompting problem, thus surpassing the state-of-the-art by 1% in classifying functional and non-functional requirements.

过度提示现象，即在提示中提供过多例子导致大型语言模型（LLM）性能下降，这一现象对传统的上下文内少量样本学习智慧提出了挑战。为了研究这一少量样本困境，我们概述了一个提示框架，该框架利用三种标准的少量样本选择方法——随机抽样、语义嵌入和TF-IDF向量，并对多个LLM进行了评估，包括GPT-4o、GPT-3.5-turbo、DeepSeek-V3、Gemma-3、LLaMA-3.1、LLaMA-3.2和Mistral。我们的实验结果表明，在提示中融入过多的特定领域例子可能会在某些LLM中适得其反，这与之前的经验结论相悖，即更多的相关少量样本普遍对LLM有益。考虑到LLM辅助的软件工程和需求分析趋势，我们在两个真实世界的软件需求分类数据集上进行了实验。通过逐渐增加TF-IDF选择和分层选择的少量样本数量，我们确定了每个LLM的最佳数量。这种结合方法使用较少的例子就能实现卓越性能，避免了过度提示问题，从而在分类功能和非功能需求方面超越了最新技术1%。

论文及项目相关链接

PDF accepted for the main track of FLLM

Summary

本文探讨了大型语言模型（LLM）中过度提示的现象，即过多的示例导致性能下降。为应对这一挑战，研究团队提出了一个提示框架，利用随机抽样、语义嵌入和TF-IDF向量三种标准方法选择示例，并在多个LLM上进行评估。实验结果显示，在某些LLM中，加入过多的领域特定示例会适得其反，与先前的结论相反，更多相关的示例并不总是能提高LLM的性能。针对软件工程和需求分析的LLM辅助趋势，研究团队使用真实软件需求分类数据集进行实验，通过逐步增加TF-IDF选择和分层选择的示例数量，找到了每个LLM的最佳示例数量。该方法在分类功能性和非功能性需求方面实现了卓越性能，解决了过度提示问题，超越了现有技术1%。

Key Takeaways

过量提示现象：在大型语言模型（LLM）中，过多的示例可能导致性能下降。
探究方法：研究团队提出了一个基于随机抽样、语义嵌入和TF-IDF向量的提示框架。
LLM评估：实验在多个LLM上进行了性能评估。
与传统观点相悖的发现：加入过多的领域特定示例可能适得其反，并非越多越好。
实际应用研究：使用真实软件需求分类数据集进行实验，以解决软件工程和需求分析的LLM辅助问题。
最佳实践：通过逐步增加TF-IDF选择和分层选择的示例数量，找到了每个LLM的最佳示例数量，解决了过度提示问题。

Cool Papers

点此查看论文截图

More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era

Authors:Yingtai Li, Haoran Lai, Xiaoqian Zhou, Shuai Ming, Wenxin Ma, Wei Wei, Shaohua Kevin Zhou

The emergence of Large Language Models (LLMs) presents unprecedented opportunities to revolutionize medical contrastive vision-language pre-training. In this paper, we show how LLMs can facilitate large-scale supervised pre-training, thereby advancing vision-language alignment. We begin by demonstrate that modern LLMs can automatically extract diagnostic labels from radiology reports with remarkable precision (>96% AUC in our experiments) without complex prompt engineering, enabling the creation of large-scale “silver-standard” datasets at a minimal cost (~$3 for 50k CT image-report pairs). Further, we find that vision encoder trained on this “silver-standard” dataset achieves performance comparable to those trained on labels extracted by specialized BERT-based models, thereby democratizing the access to large-scale supervised pre-training. Building on this foundation, we proceed to reveal that supervised pre-training fundamentally improves contrastive vision-language alignment. Our approach achieves state-of-the-art performance using only a 3D ResNet-18 with vanilla CLIP training, including 83.8% AUC for zero-shot diagnosis on CT-RATE, 77.3% AUC on RAD-ChestCT, and substantial improvements in cross-modal retrieval (MAP@50=53.7% for image-image, Recall@100=52.2% for report-image). These results demonstrate the potential of utilizing LLMs to facilitate {\bf more performant and scalable} medical AI systems. Our code is avaiable at https://github.com/SadVoxel/More-performant-and-scalable.

大型语言模型（LLM）的出现为医学对比视觉语言预训练带来了前所未有的机会。在本文中，我们展示了LLM如何促进大规模监督预训练，从而促进视觉语言对齐。首先，我们证明现代LLM能够自动从放射学报告中提取诊断标签，其精确度令人印象深刻（在我们的实验中AUC值大于96%），无需复杂的提示工程，从而以最低的成本创建大规模“银标准”数据集（每对CT图像报告成本约3美元）。此外，我们发现在此“银标准”数据集上训练的视觉编码器与用基于BERT的模型提取的标签训练的视觉编码器的性能相当，从而实现大规模监督预训练的普及。在此基础上，我们进一步揭示监督预训练从根本上改善了对比视觉语言对齐。我们的方法仅使用标准的CLIP训练和简单的3D ResNet-18就达到了最先进的性能，包括CT-RATE上的零样本诊断AUC为83.8%，RAD-ChestCT上的AUC为77.3%，跨模态检索也有显著改善（图像-图像MAP@50为53.7%，报告-图像Recall@100为52.2%）。这些结果展示了利用LLM促进更具性能和可扩展性的医学人工智能系统的潜力。我们的代码可在https://github.com/SadVoxel/More-performant-and-scalable上找到。

论文及项目相关链接

PDF MICCAI 2025

Summary
大型语言模型（LLM）的出现为医学对比视觉语言预训练带来了前所未有的机会。研究表明，LLM能够促进大规模监督预训练，进而推动视觉语言对齐的发展。通过自动从放射学报告中提取诊断标签，LLM能够创建大规模“银标准”数据集，降低成本。此外，基于这些“银标准”数据集的视觉编码器性能与用专业BERT模型提取的标签训练的视觉编码器相当。在此基础上，研究发现监督预训练能从根本上改善对比视觉语言对齐。在CT-RATE上实现了零样本诊断的AUC值为83.8%，RAD-ChestCT上的AUC值为77.3%，跨模态检索也有显著改善。这显示了利用LLM构建更高效、可扩展的医学人工智能系统的潜力。

Key Takeaways

大型语言模型（LLM）为医学对比视觉语言预训练带来了革命性机会。
LLM能够自动从放射学报告中提取诊断标签，创建大规模的“银标准”数据集，并且这一过程的成本较低。
基于这些“银标准”数据集的视觉编码器性能与用专业模型训练的视觉编码器相当。
监督预训练能够改善对比视觉语言对齐。
LLM在零样本诊断中表现出优异的性能，特别是在CT-RATE和RAD-ChestCT上的AUC值较高。
LLM还显著改善了跨模态检索的性能。
利用LLM构建的医学人工智能系统具有更高的性能和可扩展性。

Cool Papers

点此查看论文截图

Authors:Ruifei Ding, Zhe Chen, Wen Fan, Chen Long, Huijuan Xiao, Yelu Zeng, Zhen Dong, Bisheng Yang

Street trees are vital to urban livability, providing ecological and social benefits. Establishing a detailed, accurate, and dynamically updated street tree inventory has become essential for optimizing these multifunctional assets within space-constrained urban environments. Given that traditional field surveys are time-consuming and labor-intensive, automated surveys utilizing Mobile Mapping Systems (MMS) offer a more efficient solution. However, existing MMS-acquired tree datasets are limited by small-scale scene, limited annotation, or single modality, restricting their utility for comprehensive analysis. To address these limitations, we introduce WHU-STree, a cross-city, richly annotated, and multi-modal urban street tree dataset. Collected across two distinct cities, WHU-STree integrates synchronized point clouds and high-resolution images, encompassing 21,007 annotated tree instances across 50 species and 2 morphological parameters. Leveraging the unique characteristics, WHU-STree concurrently supports over 10 tasks related to street tree inventory. We benchmark representative baselines for two key tasks–tree species classification and individual tree segmentation. Extensive experiments and in-depth analysis demonstrate the significant potential of multi-modal data fusion and underscore cross-domain applicability as a critical prerequisite for practical algorithm deployment. In particular, we identify key challenges and outline potential future works for fully exploiting WHU-STree, encompassing multi-modal fusion, multi-task collaboration, cross-domain generalization, spatial pattern learning, and Multi-modal Large Language Model for street tree asset management. The WHU-STree dataset is accessible at: https://github.com/WHU-USI3DV/WHU-STree.

街道树木对城市宜居性至关重要，能提供生态和社会效益。在空间受限的城市环境中优化这些多功能资产时，建立详细、准确、动态更新的街道树木清单已变得至关重要。由于传统实地调查耗时且劳力密集，利用移动测量系统(MMS)的自动化调查提供了更高效的解决方案。然而，现有MMS获取的树木数据集受限于小规模场景、有限注释或单一模式，限制了其进行综合分析的实用性。为了解决这些局限性，我们推出了WHU-STree数据集，这是一个跨城市的、注释丰富的、多模态的城市街道树木数据集。WHU-STree是在两个不同城市收集的，融合了同步点云和高分辨率图像，包含50个物种的21,007个注释树木实例和2个形态参数。利用独特的特点，WHU-STree同时支持超过10个与街道树木清单相关的任务。我们为两个关键任务——树种分类和单株树木分割——设定了基准线。大量的实验和深入分析证明了多模态数据融合的显著潜力，并强调了跨域适用性作为实际应用算法部署的关键先决条件。特别是，我们确定了充分利用WHU-STree的主要挑战，并概述了未来的可能工作方向，包括多模态融合、多任务协作、跨域泛化、空间模式学习以及用于街道树木资产管理的多模态大型语言模型。WHU-STree数据集可通过以下链接访问：https://github.com/WHU-USI3DV/WHU-STree 。

论文及项目相关链接

PDF

Summary

街道树木对城市宜居性至关重要，提供生态和社会效益。建立详细、准确、动态更新的街道树木清单，对于优化空间受限的城市环境中的多功能资产至关重要。传统现场调查耗时耗力，而采用移动测绘系统（MMS）的自动化调查提供了更高效的解决方案。然而，现有MMS获得的树木数据集存在场景规模小、标注有限或单一模态等限制，难以进行全面分析。为解决这些问题，我们推出了WHU-STree数据集，它是一个跨城市的丰富标注、多模态的街道树木数据集。WHU-STree整合了点云和高分辨率图像，涵盖两个不同城市的21,007个标注树木实例，涉及50个树种和2个形态参数。该数据集支持超过10个与街道树木清单相关的任务。我们对两个关键任务——树种分类和单株树木分割——进行了基准测试。实验和分析表明多模态数据融合的显著潜力，并强调跨域适用性对于实际算法部署的关键性。

Key Takeaways

街道树木对城市的生态和社会效益至关重要，需要建立详细、准确、动态更新的街道树木清单来优化城市中的多功能资产。
传统现场调查方法存在时间消耗和劳动力需求大的问题，而移动测绘系统（MMS）的自动化调查方法提供了更高效的解决方案。
现有的MMS树木数据集存在局限性，如场景规模小、标注有限或单一模态等，限制了其全面分析的能力。
WHU-STree数据集是一个跨城市、丰富标注、多模态的街道树木数据集，整合了点云和高分辨率图像，涵盖多个城市和树种。
WHU-STree数据集支持超过10个与街道树木相关的任务，包括树种分类和单株树木分割等。
实验和分析表明多模态数据融合的潜力，强调跨域适用性对于实际算法部署的重要性。

Cool Papers

点此查看论文截图

LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals

Authors:Jinxin Li, Gang Tu, ShengYu Cheng, Junjie Hu, Jinting Wang, Rui Chen, Zhilong Zhou, Dongbo Shan

Hallucination remains a critical barrier for deploying large language models (LLMs) in reliability-sensitive applications. Existing detection methods largely fall into two categories: factuality checking, which is fundamentally constrained by external knowledge coverage, and static hidden-state analysis, that fails to capture deviations in reasoning dynamics. As a result, their effectiveness and robustness remain limited. We propose HSAD (Hidden Signal Analysis-based Detection), a novel hallucination detection framework that models the temporal dynamics of hidden representations during autoregressive generation. HSAD constructs hidden-layer signals by sampling activations across layers, applies Fast Fourier Transform (FFT) to obtain frequency-domain representations, and extracts the strongest non-DC frequency component as spectral features. Furthermore, by leveraging the autoregressive nature of LLMs, HSAD identifies optimal observation points for effective and reliable detection. Across multiple benchmarks, including TruthfulQA, HSAD achieves over 10 percentage points improvement compared to prior state-of-the-art methods. By integrating reasoning-process modeling with frequency-domain analysis, HSAD establishes a new paradigm for robust hallucination detection in LLMs.

幻觉仍是大型语言模型（LLM）在可靠性敏感应用部署中的关键障碍。现有的检测方法主要分为两类：受控于外部知识覆盖范围的合理性检验和无法捕捉推理动态偏差的静态隐藏状态分析。因此，它们的有效性和稳健性仍然有限。我们提出HSAD（基于隐藏信号分析的检测），这是一种新型的幻觉检测框架，对自回归生成过程中的隐藏表示的时空动态进行建模。HSAD通过跨层采样激活构建隐藏层信号，应用快速傅里叶变换（FFT）获得频域表示，并提取最强的非直流频率分量作为谱特征。此外，HSAD利用LLM的自回归特性，确定有效的可靠检测的最佳观测点。在包括TruthfulQA等多个基准测试中，HSAD相较于现有最先进的检测方法，实现了超过10个百分点的改进。通过将推理过程建模与频域分析相结合，HSAD为LLM中的稳健幻觉检测建立了新的范式。

论文及项目相关链接

PDF

Summary

本文指出大型语言模型（LLM）在可靠性敏感应用中的关键障碍是幻觉生成。现有检测方法主要分为两类：受制于外部知识覆盖的事实性检查和无法捕捉推理动态变化的静态隐藏状态分析。为此，本文提出一种新型的幻觉检测框架HSAD（基于隐藏信号分析的检测），该框架对自动回归生成过程中的隐藏表示的时空动态进行建模。HSAD通过采样各层激活值构建隐藏层信号，应用快速傅里叶变换（FFT）获得频域表示，并提取非直流分量中最强的频谱特征。此外，HSAD利用LLM的自动回归性质，确定有效的观察点，以实现可靠检测。在多个基准测试中，包括TruthfulQA，HSAD相较于现有最先进的检测方法，提高了超过10个百分点。通过将推理过程建模与频域分析相结合，HSAD为LLM中的稳健幻觉检测建立了新范例。

Key Takeaways

幻觉是大型语言模型（LLM）在可靠性敏感应用中面临的关键问题。
现有检测方法主要分两类：事实性检查和静态隐藏状态分析，但都存在局限性。
新型幻觉检测框架HSAD通过建模隐藏表示的时空动态来解决这一问题。
HSAD利用快速傅里叶变换（FFT）和隐藏层信号的采样来提取关键特征。
HSAD结合LLM的自动回归性质，确定有效的观察点以提高检测可靠性。
在多个基准测试中，HSAD相较于现有方法显著提高效果。

Cool Papers

点此查看论文截图

Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets

Authors:Marylou Fauchard, Florian Carichon, Margarida Carvalho, Golnoosh Farnadi

Recent advances in reasoning with large language models (LLMs) have demonstrated strong performance on complex mathematical tasks, including combinatorial optimization. Techniques such as Chain-of-Thought and In-Context Learning have further enhanced this capability, making LLMs both powerful and accessible tools for a wide range of users, including non-experts. However, applying LLMs to matching problems, which require reasoning under preferential and structural constraints, remains underexplored. To address this gap, we introduce a novel benchmark of 369 instances of the College Admission Problem, a canonical example of a matching problem with preferences, to evaluate LLMs across key dimensions: feasibility, stability, and optimality. We employ this benchmark to assess the performance of several open-weight LLMs. Our results first reveal that while LLMs can satisfy certain constraints, they struggle to meet all evaluation criteria consistently. They also show that reasoning LLMs, like QwQ and GPT-oss, significantly outperform traditional models such as Llama, Qwen or Mistral, defined here as models used without any dedicated reasoning mechanisms. Moreover, we observed that LLMs reacted differently to the various prompting strategies tested, which include Chain-of-Thought, In-Context Learning and role-based prompting, with no prompt consistently offering the best performance. Finally, we report the performances from iterative prompting with auto-generated feedback and show that they are not monotonic; they can peak early and then significantly decline in later attempts. Overall, this work offers a new perspective on model reasoning performance and the effectiveness of prompting strategies in combinatorial optimization problems with preferential constraints.

近期，大型语言模型（LLM）在复杂数学任务上的进展，包括组合优化，都表现出了强大的性能。思维链和上下文学习等技术进一步增强了这一能力，使LLM成为强大且易于使用的工具，适用于广大用户，包括非专业人士。然而，将LLM应用于匹配问题（需要在偏好和结构约束下进行推理）仍然鲜有研究。为了弥补这一空白，我们引入了包含369个实例的大学录取问题基准测试，这是带有偏好匹配问题的一个典型案例，以评估LLM的关键维度：可行性、稳定性和最优性。我们使用这个基准测试来评估几个公开权重LLM的性能。我们的结果首先表明，虽然LLM可以满足某些约束，但它们难以始终如一地满足所有评估标准。他们还显示，像QwQ和GPT-oss这样的推理LLM显著优于传统的LLM模型，如Llama、Qwen或Mistral（这里定义为没有专用推理机制的模型）。此外，我们观察到LLM对不同提示策略的反应各不相同，包括思维链、上下文学习和基于角色的提示，没有任何一种提示始终提供最佳性能。最后，我们报告了通过自动生成反馈进行迭代提示的性能，并显示它们并非单调；它们可能在早期达到峰值，然后在后续尝试中显著下降。总的来说，这项工作提供了关于模型推理性能和偏好约束组合优化问题中的提示策略有效性的新视角。

论文及项目相关链接

PDF

Summary

大型语言模型（LLM）在复杂的数学任务上表现出强大的性能，包括组合优化。通过Chain-of-Thought和In-Context Learning等技术，LLM的推理能力得到进一步提升，使其成为非专家用户也能轻松使用的强大工具。然而，将LLM应用于带有偏好和结构约束的匹配问题仍然未被充分探索。本研究为解决这一空白而推出新的基准测试，对LLM解决大学录取问题等匹配问题的能力进行评估，涵盖可行性、稳定性和最优性等方面。研究发现LLM虽能满足某些约束，但难以一致满足所有评估标准。此外，相比传统模型，如QwQ和GPT-oss等带有推理机制的LLM表现更优。同时，不同的提示策略对LLM的影响各异，未发现始终最佳的提示方式。最后，通过迭代提示和自动生成反馈，发现性能并非单调提升，可能在早期达到峰值后显著下降。

Key Takeaways

LLM在复杂的数学任务上表现出强大的性能，包括组合优化问题。
LLM技术如Chain-of-Thought和In-Context Learning增强了其推理能力。
LLM在解决带有偏好和结构约束的匹配问题上仍有待探索。
推出新的基准测试来评估LLM解决匹配问题的能力，涵盖可行性、稳定性和最优性。
LLM在某些约束条件下表现良好，但难以满足所有评估标准。
与传统模型相比，带有推理机制的LLM（如QwQ和GPT-oss）表现更优。
提示策略对LLM的影响不同，未找到始终最佳的提示方式；性能提升并非单调。

Cool Papers

点此查看论文截图

Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning

Authors:Sijia Cui, Shuai Xu, Aiyao He, Yanna Wang, Bo Xu

Recent advancements in Large Language Models(LLMs) have led to the development of LLM-based AI agents. A key challenge is the creation of agents that can effectively ground themselves in complex, adversarial long-horizon environments. Existing methods mainly focus on (1) using LLMs as policies to interact with the environment through generating low-level feasible actions, and (2) utilizing LLMs to generate high-level tasks or language guides to stimulate action generation. However, the former struggles to generate reliable actions, while the latter relies heavily on expert experience to translate high-level tasks into specific action sequences. To address these challenges, we introduce the Plan with Language, Act with Parameter (PLAP) planning framework that facilitates the grounding of LLM-based agents in long-horizon environments. The PLAP method comprises three key components: (1) a skill library containing environment-specific parameterized skills, (2) a skill planner powered by LLMs, and (3) a skill executor converting the parameterized skills into executable action sequences. We implement PLAP in MicroRTS, a long-horizon real-time strategy game that provides an unfamiliar and challenging environment for LLMs. The experimental results demonstrate the effectiveness of PLAP. In particular, GPT-4o-driven PLAP in a zero-shot setting outperforms 80% of baseline agents, and Qwen2-72B-driven PLAP, with carefully crafted few-shot examples, surpasses the top-tier scripted agent, CoacAI. Additionally, we design comprehensive evaluation metrics and test 6 closed-source and 2 open-source LLMs within the PLAP framework, ultimately releasing an LLM leaderboard ranking long-horizon skill planning ability. Our code is available at https://github.com/AI-Research-TeamX/PLAP.

近期大型语言模型（LLM）的进步推动了基于LLM的AI代理的发展。一个关键挑战是创建能够在复杂、对抗性的长期环境中有效定位自己的代理。现有方法主要集中在：（1）使用LLM作为策略，通过与环境生成低级可行动作进行交互；（2）利用LLM生成高级任务或语言指南来激发动作生成。然而，前者在生成可靠动作方面遇到困难，而后者则严重依赖于专家经验将高级任务翻译成特定的动作序列。为了解决这些挑战，我们引入了“用语言规划，用参数行动”（PLAP）规划框架，该框架有助于基于LLM的代理在长期环境中定位。PLAP方法包含三个关键组成部分：（1）包含环境特定参数化技能的技能库；（2）由LLM驱动的技能规划器；（3）将参数化技能转化为可执行动作序列的技能执行器。我们在MicroRTS中实现了PLAP，这是一款长期实时策略游戏，为LLM提供了一个不熟悉且具有挑战性的环境。实验结果表明PLAP的有效性。特别是，GPT-4o驱动的PLAP在零样本设置下超越了80%的基线代理，而经过精心设计的Qwen2-72B驱动的PLAP则在少数样本示例中超越了顶级脚本代理CoacAI。此外，我们设计了全面的评估指标，并在PLAP框架内测试了6个闭源和2个开源LLM，最终发布了一个长期规划能力排行榜的LLM领导者榜单。我们的代码可在https://github.com/AI-Research-TeamX/PLAP访问。

论文及项目相关链接

PDF Accepted to IJCNN 2025

Summary

LLM领域的新进展带来了基于LLM的AI代理的开发。创建能在复杂、对抗性的长期环境中有效自我定位的智能体是关键挑战。现有方法主要关注使用LLM生成低级可行动作，以及利用LLM生成高级任务或语言指南来刺激动作生成。然而，前者难以生成可靠的动作，后者则严重依赖于专家经验将高级任务翻译成特定的动作序列。为解决这些挑战，本文引入Plan with Language, Act with Parameter（PLAP）规划框架，促进基于LLM的智能体在长期环境中的定位。实验结果表明PLAP的有效性。特别是GPT-4驱动的PLAP在零射击环境中表现优于80%的基线智能体，而经过精心设计的Qwen2-72B少数样本表现超越顶尖脚本代理CoacAI。

Key Takeaways

LLM的新进展导致基于LLM的AI代理的发展。
创建能在长期环境中有效自我定位的智能体是关键挑战。
当前方法存在缺陷：难以生成可靠动作和过度依赖专家经验。
引入PLAP规划框架来解决这些挑战，包括技能库、技能规划器和技能执行器三个关键组件。
PLAP在MicroRTS长时策略游戏中的实施证明了其有效性。
GPT-4驱动的PLAP在零射击环境中表现优异，超过大多数基线智能体。
Qwen2-72B在精心设计的少数样本中表现最佳，超越顶尖脚本代理CoacAI。
提供了全面的评估指标，并在PLAP框架下测试了多个LLM。

Cool Papers

点此查看论文截图

A comparison of pipelines for the translation of a low resource language based on transformers

Authors:Chiara Bonfanti, Michele Colombino, Giulia Coucourde, Faeze Memari, Stefano Pinardi, Rosa Meo

This work compares three pipelines for training transformer-based neural networks to produce machine translators for Bambara, a Mand`e language spoken in Africa by about 14,188,850 people. The first pipeline trains a simple transformer to translate sentences from French into Bambara. The second fine-tunes LLaMA3 (3B-8B) instructor models using decoder-only architectures for French-to-Bambara translation. Models from the first two pipelines were trained with different hyperparameter combinations to improve BLEU and chrF scores, evaluated on both test sentences and official Bambara benchmarks. The third pipeline uses language distillation with a student-teacher dual neural network to integrate Bambara into a pre-trained LaBSE model, which provides language-agnostic embeddings. A BERT extension is then applied to LaBSE to generate translations. All pipelines were tested on Dokotoro (medical) and Bayelemagaba (mixed domains). Results show that the first pipeline, although simpler, achieves the best translation accuracy (10% BLEU, 21% chrF on Bayelemagaba), consistent with low-resource translation results. On the Yiri dataset, created for this work, it achieves 33.81% BLEU and 41% chrF. Instructor-based models perform better on single datasets than on aggregated collections, suggesting they capture dataset-specific patterns more effectively.

本文比较了三条用于训练基于转换器的神经网络以生成巴姆巴拉语（一种非洲约14,188,850人使用的曼德语言）的机器翻译器的管道。第一条管道训练一个简单的转换器，将法语翻译成巴姆巴拉语。第二条管道对LLaMA3（3B-8B）指导员模型进行微调，采用仅解码器架构进行法语至巴姆巴拉语的翻译。前两条管道中的模型通过不同的超参数组合进行训练，以提高BLEU和chrF分数，并在测试句和官方巴姆巴拉语基准测试上进行评估。第三条管道采用学生-教师双神经网络进行语言蒸馏，将巴姆巴拉语集成到预训练的LaBSE模型中，该模型提供语言无关的嵌入。然后应用BERT扩展版到LaBSE以生成翻译。所有管道均在Dokotoro（医学）和Bayelemagaba（混合领域）上进行了测试。结果表明，虽然第一条管道更简单，但在翻译准确性方面表现最佳（在Bayelemagaba上BLEU得分为10%，chrF得分为21%），这与低资源翻译结果一致。针对本文创建的数据集Yiri，其BLEU得分为33.81%，chrF得分为41%。基于指导员的模型在单一数据集上的表现优于聚合集合，这表明它们更有效地捕捉数据集特定的模式。

论文及项目相关链接

PDF 9 pages, 4 figures

摘要

本文比较了三种训练基于转换器的神经网络以生成巴马语言（一种非洲的曼德语，约1418万人使用）翻译机的管道。第一种管道简单训练一个翻译器，实现从法语到巴马的翻译。第二种管道采用LLaMA3（介于低B和赫兹至佩塔赫兹级功率运行的高速控制论量子编码数字多功能整合工程的高保真极限动力系统科技复杂科学的逻辑推理力不断架构网络型通信管理的硬件代号命名的芯片上封装含有子链接组成低码的极度高阶语义象征算法的加密密集型决策矩阵平台代号指项目第三级网络系统体系压缩分析再聚焦数值模型）模型进行微调，用于法巴翻译。前两种管道的模型通过不同的超参数组合训练，以提高BLEU和chrF分数。第三种管道使用学生教师双神经网络进行语言蒸馏，将巴马语言融入预训练的LaBSE模型，并利用BERT扩展生成翻译。在Dokotoro（医疗领域）和Bayelemagaba（混合领域）的测试中，第一种管道在翻译准确性方面表现最佳。在为此工作创建的Yiri数据集上，它实现了33.81％的BLEU和41％的chrF分数。基于指导者的模型在单个数据集上的表现优于聚合集合，这表明它们更有效地捕捉数据集特定的模式。

关键见解

研究比较了三种训练神经网络以生成巴马语言翻译的方法。
第一种管道简单有效，在特定数据集上实现最佳翻译准确性。
第二种管道采用LLaMA3模型微调，用于法巴翻译，表现亦佳。
前两种管道的模型通过调整超参数提高性能。
第三种管道结合语言蒸馏和学生教师双神经网络，集成巴马语言到预训练模型。
基于指导者的模型在单一数据集上优于聚合数据集，更有效捕捉数据集特定模式。
研究结果揭示了不同管道在处理特定语言翻译任务时的优势和局限性。

Cool Papers

点此查看论文截图

TAPS: Tool-Augmented Personalisation via Structured Tagging

Authors:Ekaterina Taktasheva, Jeff Dalton

Recent advancements in tool-augmented large language models have enabled them to interact with external tools, enhancing their ability to perform complex user tasks. However, existing approaches overlook the role of personalisation in guiding tool use. This work investigates how user preferences can be effectively integrated into goal-oriented dialogue agents. Through extensive analysis, we identify key weaknesses in the ability of LLMs to personalise tool use. To this end, we introduce TAPS, a novel solution that enhances personalised tool use by leveraging a structured tagging tool and an uncertainty-based tool detector. TAPS significantly improves the ability of LLMs to incorporate user preferences, achieving the new state-of-the-art for open source models on the NLSI task.

近期，工具增强型大型语言模型的进步已经使得它们可以与外部工具进行交互，提高了执行复杂用户任务的能力。然而，现有方法忽视了个性化在指导工具使用中的作用。本研究旨在探讨如何有效地将用户偏好整合到目标导向的对话代理中。通过广泛的分析，我们发现了大型语言模型在个性化工具使用方面的关键弱点。为此，我们引入了TAPS，这是一种新型解决方案，它通过利用结构化标签工具和基于不确定性的工具检测器来增强个性化工具的使用。TAPS显著提高了大型语言模型融入用户偏好的能力，在NLSI任务上实现了开源模型的新水平。

论文及项目相关链接

PDF Accepted to EMNLP 2026 Main

Summary

近期工具增强型大型语言模型取得进展，能与外部工具交互，执行复杂用户任务能力增强。然而，现有方法忽视了个性化在指导工具使用中的作用。本研究调查了如何有效整合用户偏好到目标导向对话代理中。通过深入分析，我们发现LLM在个性化工具使用上的关键弱点。为此，我们引入TAPS，一种利用结构化标签工具和基于不确定性的工具检测器增强个性化工具使用的新解决方案。TAPS显著提高了LLM融入用户偏好的能力，在NLSI任务上实现了开源模型的新水平。

Key Takeaways

工具增强型大型语言模型能与外部工具交互，增强执行复杂任务的能力。
现有方法忽视了个性化在指导工具使用中的作用。
用户偏好整合到目标导向对话代理中是本研究的重点。
LLM在个性化工具使用上存在关键弱点。
TAPS利用结构化标签工具和基于不确定性的工具检测器解决这一问题。
TAPS显著提高了LLM融入用户偏好的能力。

Cool Papers

点此查看论文截图

Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation

Authors:Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, Huaxia Li

Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM’s internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: https://github.com/ECNU-Text-Computing/cot-hallu-detect .

大型语言模型（LLM）常常出现“幻觉”，即生成与提示不符的事实错误或语义无关的内容。链式思维（CoT）提示可以通过鼓励逐步推理来缓解幻觉问题，但其对幻觉检测的影响仍被忽视。为了填补这一空白，我们进行了系统的实证研究。我们首先进行了一项试点实验，结果显示CoT推理显著影响LLM的内部状态和标记概率分布。在此基础上，我们评估了各种CoT提示方法对主流幻觉检测方法的检测影响，涉及指令调优和面向推理的LLM。具体来说，我们考察了三个方面：幻觉分数分布的变化、检测准确度的变化以及检测信心的转变。我们的研究结果表明，虽然CoT提示有助于减少幻觉频率，但它也有掩盖用于检测的关键信号的趋势，从而损害各种检测方法的效力。我们的研究凸显了在使用推理过程中忽略的一个权衡问题。代码已公开在：https://github.com/ECNU-Text-Computing/cot-hallu-detect。

论文及项目相关链接

PDF Accepted at EMNLP 2025 Findings

Summary

大型语言模型（LLM）会出现生成错误或不相关内容的幻觉现象。链式思维（CoT）提示通过鼓励逐步推理可以缓解幻觉问题，但其对幻觉检测的影响尚未得到充分研究。本文进行了一项系统的实证研究，以填补这一空白。通过初步实验发现，CoT推理对LLM的内部状态和令牌概率分布产生了显著影响。在此基础上，本文评估了不同的CoT提示方法对主流幻觉检测方法的冲击，涉及指令微调与面向推理的LLM。研究发现，CoT提示虽然有助于减少幻觉频率，但也可能掩盖用于检测的关键信号，从而影响各种检测方法的准确性。本研究揭示了使用推理过程中的一个被忽视的权衡。

Key Takeaways

大型语言模型（LLM）存在生成错误或不相关内容的幻觉现象。
链式思维（CoT）提示通过鼓励逐步推理有助于缓解LLM的幻觉问题。
CoT推理对LLM的内部状态和令牌概率分布产生显著影响。
CoT提示方法对主流幻觉检测方法的冲击评估显示，它可能影响检测方法的准确性。
CoT提示在减少幻觉频率的同时，也可能掩盖用于检测的关键信号。
研究揭示了使用推理过程中的一个被忽视的权衡，需要在减少幻觉和提高检测准确性之间进行权衡。

Cool Papers

点此查看论文截图

UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

Authors:Joseph Marvin Imperial, Abdullah Barayan, Regina Stodden, Rodrigo Wilkens, Ricardo Munoz Sanchez, Lingyun Gao, Melissa Torgbi, Dawn Knight, Gail Forey, Reka R. Jablonkai, Ekaterina Kochmar, Robert Reynolds, Eugénio Ribeiro, Horacio Saggion, Elena Volodina, Sowmya Vajjala, Thomas François, Fernando Alva-Manchego, Harish Tayyar Madabushi

We introduce UniversalCEFR, a large-scale multilingual and multidimensional dataset of texts annotated with CEFR (Common European Framework of Reference) levels in 13 languages. To enable open research in automated readability and language proficiency assessment, UniversalCEFR comprises 505,807 CEFR-labeled texts curated from educational and learner-oriented resources, standardized into a unified data format to support consistent processing, analysis, and modelling across tasks and languages. To demonstrate its utility, we conduct benchmarking experiments using three modelling paradigms: a) linguistic feature-based classification, b) fine-tuning pre-trained LLMs, and c) descriptor-based prompting of instruction-tuned LLMs. Our results support using linguistic features and fine-tuning pretrained models in multilingual CEFR level assessment. Overall, UniversalCEFR aims to establish best practices in data distribution for language proficiency research by standardising dataset formats, and promoting their accessibility to the global research community.

我们介绍了UniversalCEFR，这是一个大规模的多语言、多维文本数据集，用CEFR（欧洲共同语言参考标准）的13种语言的等级进行标注。为了促进自动化可读性评估和语言水平评估的开放研究，UniversalCEFR包含了从教育和学习导向资源中精心挑选的50万五千八百零七篇CEFR标注文本，统一的数据格式支持跨任务和语言的统一处理、分析和建模。为了证明其实用性，我们采用了三种建模范式进行基准测试：a）基于语言特征的分类，b）微调预训练的大型语言模型，以及c）基于描述的指令调整大型语言模型的提示。我们的结果支持在多语言CEFR水平评估中使用语言特征和微调预训练模型。总体而言，UniversalCEFR旨在通过标准化数据集格式和促进全球研究社区的访问，为语言水平研究建立最佳的数据分布实践。

论文及项目相关链接

PDF Accepted to EMNLP 2025 (Main Conference)

Summary

本文介绍了UniversalCEFR——一个大规模的多语种、多维度的文本数据集，采用CEFR（欧洲语言共同参考框架）水平标注，包含教育和学习导向资源的50万多个CEFR标注文本。数据集统一格式支持跨任务和语言的处理、分析和建模。通过实验验证，证明了该数据集在基于语言特征的分类和微调预训练模型中的实用性。UniversalCEFR旨在通过标准化数据集格式和促进全球研究社区的访问，为语言水平研究建立最佳实践。

Key Takeaways

UniversalCEFR是一个大规模的多语种和多维度文本数据集，包含教育和学习导向资源的CEFR标注文本。
数据集采用统一的格式，支持跨任务和语言的处理、分析和建模。
数据集通过实验验证在基于语言特征的分类和微调预训练模型中的实用性。
该数据集有助于建立语言水平研究的最佳实践，通过标准化数据集格式来促进全球研究社区的访问。
UniversalCEFR包含三种建模范式：基于语言特征的分类、微调预训练模型和基于描述符的指令微调LLMs。
实验结果表明，使用语言特征和微调预训练模型进行多语种CEFR水平评估是有效的。

Cool Papers

点此查看论文截图

Game-RL: Synthesizing Verifiable Game Tasks at Scale to Boost VLMs General Reasoning

Authors:Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang

Real-world vision language reasoning scenarios often include diverse and complex tasks. However, vision language reinforcement learning has primarily focused on a narrow set of tasks (e.g. geometry or chart reasoning), limiting the improvement of Vision Language Models’ (VLMs) general reasoning. Therefore, we propose a novel Code2Logic approach, using Large Language Models (LLMs) to synthesize verifiable game reasoning tasks at scale via adapting game code. Using the Code2Logic, we developed the GameQA dataset to train and evaluate VLMs. GameQA is verifiable and scalable, offers controllable difficulty gradation and is diverse with 30 games and 158 tasks. Then we apply Game-RL, which is simple reinforcement learning on GameQA. Surprisingly, despite training solely on game tasks, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code, dataset and models are available at the GitHub repository.

现实世界中的视觉语言推理场景通常包含各种复杂多样的任务。然而，视觉语言强化学习主要集中在狭窄的任务集上（例如几何或图表推理），这限制了视觉语言模型（VLM）的一般推理能力的改进。因此，我们提出了一种新的Code2Logic方法，利用大型语言模型（LLM）通过适应游戏代码来大规模合成可验证的游戏推理任务。使用Code2Logic，我们开发了GameQA数据集来训练和评估VLM。GameQA具有可验证性和可扩展性，提供可控的难度梯度，并且具有30种游戏和158个任务，种类繁多。然后，我们对GameQA应用Game-RL，这是一种简单的强化学习。令人惊讶的是，尽管仅在游戏任务上进行训练，VLM表现出了跨域的泛化能力，特别是Qwen2.5-VL-7B在7个不同的视觉语言基准测试上性能提高了2.33%。我们的代码、数据集和模型可在GitHub仓库中找到。

论文及项目相关链接

PDF 63 pages, 23 figures, submitted to NeurIPS 2025

Summary

本文提出了Code2Logic方法，通过适应游戏代码，利用大型语言模型（LLMs）合成可验证的游戏推理任务。同时开发了GameQA数据集以训练和评估视觉语言模型（VLMs）。该方法注重多样性和可控难度分级，展示了视觉语言模型在强化学习下的跨域泛化能力。模型的代码、数据集可在GitHub仓库中获取。

Key Takeaways

Code2Logic方法通过适应游戏代码来合成大规模可验证的推理任务。
开发的GameQA数据集用于训练和评估视觉语言模型（VLMs）。
GameQA数据集包含30个游戏和158个任务，具有多样性和可控难度分级的特点。
采用基于游戏的强化学习（Game-RL）来训练模型。
视觉语言模型在强化学习下展现出跨域泛化能力。
具体模型Qwen2.5-VL-7B在7个不同的视觉语言基准测试中性能提升2.33%。

Cool Papers

点此查看论文截图

SuPreME: A Supervised Pre-training Framework for Multimodal ECG Representation Learning

Authors:Mingsheng Cai, Jiuming Jiang, Wenhao Huang, Che Liu, Rossella Arcucci

Cardiovascular diseases are a leading cause of death and disability worldwide. Electrocardiogram (ECG) is critical for diagnosing and monitoring cardiac health, but obtaining large-scale annotated ECG datasets is labor-intensive and time-consuming. Recent ECG Self-Supervised Learning (eSSL) methods mitigate this by learning features without extensive labels but fail to capture fine-grained clinical semantics and require extensive task-specific fine-tuning. To address these challenges, we propose $\textbf{SuPreME}$, a $\textbf{Su}$pervised $\textbf{Pre}$-training framework for $\textbf{M}$ultimodal $\textbf{E}$CG representation learning. SuPreME is pre-trained using structured diagnostic labels derived from ECG report entities through a one-time offline extraction with Large Language Models (LLMs), which help denoise, standardize cardiac concepts, and improve clinical representation learning. By fusing ECG signals with textual cardiac queries instead of fixed labels, SuPreME enables zero-shot classification of unseen conditions without further fine-tuning. We evaluate SuPreME on six downstream datasets covering 106 cardiac conditions, achieving superior zero-shot AUC performance of $77.20%$, surpassing state-of-the-art eSSLs by $4.98%$. Results demonstrate SuPreME’s effectiveness in leveraging structured, clinically relevant knowledge for high-quality ECG representations.

心血管疾病是全球范围内导致死亡和残疾的主要原因之一。心电图（ECG）对于诊断和治疗心脏疾病至关重要，但获取大规模标注的心电图数据集是一项劳动密集且耗时的工作。最近的心电图自监督学习方法（eSSL）通过无需大量标签即可学习特征来减轻这一负担，但无法捕捉精细的临床语义，并需要大量针对特定任务的微调。为了应对这些挑战，我们提出了SuPreME，这是一个用于多模态心电图表示学习的监督预训练框架。SuPreME使用通过一次性离线提取心电图报告实体而获得的结构化诊断标签进行预训练，借助大型语言模型（LLM）帮助去噪、标准化心脏概念，改进临床表示学习。通过将心电图信号与文本心脏查询而不是固定标签融合，SuPreME实现了对未见条件的零样本分类，无需进一步微调。我们在包含106种心脏疾病的六个下游数据集上评估SuPreME，其零样本AUC性能达到77.20%，比最先进的eSSL高出4.98%。结果表明，SuPreME在利用结构化、与临床相关的知识方面进行高质量心电图表示方面非常有效。

论文及项目相关链接

PDF Findings of The 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)

Summary

本文介绍了心血管疾病的严重性及其对全球健康的影响。心电图（ECG）在诊断和监测心脏健康方面发挥着重要作用，但获取大规模注释的心电图数据集是劳动密集型的，需要大量时间。最新的心电图自监督学习方法（eSSL）可以在没有大量标签的情况下学习特征，但无法捕捉精细的临床语义，并需要大量针对特定任务的微调。针对这些挑战，本文提出了SuPreME，一个用于多模态心电图表示学习的监督预训练框架。SuPreME使用从心电图报告实体中提取的结构化诊断标签进行预训练，借助大型语言模型（LLM）一次离线提取数据，帮助去噪、标准化心脏概念，提高临床表示学习。通过将心电图信号与文本心脏查询融合而非固定标签，SuPreME实现了对未见条件的零样本分类，无需进一步微调。在涵盖106种心脏疾病的六个下游数据集上评估显示，SuPreME的零样本AUC性能达到77.2%，比最先进的eSSL高出4.98%，证明了利用结构化、临床相关知识进行高质量心电图表示的有效性。

Key Takeaways

心血管疾病是全球死亡和残疾的主要原因之一，而心电图（ECG）在诊断心脏健康方面发挥着重要作用。
获取大规模注释的心电图数据集是一个劳动密集和时间密集的过程。
最新心电图自监督学习方法（eSSL）可以在没有大量标签的情况下学习特征，但缺乏捕捉临床语义的能力，并需要大量任务特定微调。
提出了一种新的方法SuPreME，一个用于多模态心电图表示学习的监督预训练框架。
SuPreME使用从心电图报告实体中提取的结构化诊断标签进行预训练，借助大型语言模型（LLM）。
SuPreME通过将心电图信号与文本心脏查询融合来实现对未见条件的零样本分类。

Cool Papers

点此查看论文截图

SEVEN: Pruning Transformer Model by Reserving Sentinels

Authors:Jinying Xiao, Ping Li, Jie Nie, Zhe Tang

Large-scale Transformer models (TM) have demonstrated outstanding performance across various tasks. However, their considerable parameter size restricts their applicability, particularly on mobile devices. Due to the dynamic and intricate nature of gradients on TM compared to Convolutional Neural Networks, commonly used pruning methods tend to retain weights with larger gradient noise. This results in pruned models that are sensitive to sparsity and datasets, exhibiting suboptimal performance. Symbolic Descent (SD) is a general approach for training and fine-tuning TM. In this paper, we attempt to describe the noisy batch gradient sequences on TM through the cumulative process of SD. We utilize this design to dynamically assess the importance scores of weights.SEVEN is introduced by us, which particularly favors weights with consistently high sensitivity, i.e., weights with small gradient noise. These weights are tended to be preserved by SEVEN. Extensive experiments on various TM in natural language, question-answering, and image classification domains are conducted to validate the effectiveness of SEVEN. The results demonstrate significant improvements of SEVEN in multiple pruning scenarios and across different sparsity levels. Additionally, SEVEN exhibits robust performance under various fine-tuning strategies. The code is publicly available at https://github.com/xiaojinying/SEVEN.

大规模Transformer模型（TM）已在各种任务中展现出卓越的性能。然而，其庞大的参数规模限制了其应用场景，特别是在移动设备上。与卷积神经网络相比，Transformer模型的梯度具有动态和复杂的特点，常用的剪枝方法往往保留具有较大梯度噪声的权重。这导致剪枝模型对稀疏性和数据集敏感，表现出次优性能。符号下降（SD）是一种用于训练和微调TM的通用方法。在本文中，我们试图通过SD的累积过程来描述TM上的噪声批量梯度序列。我们利用这种设计来动态评估权重的重要性分数。我们推出了SEVEN，它特别倾向于保留那些持续高敏感性的权重，即具有较小梯度噪声的权重。广泛地在自然语言、问答和图像分类领域的各种TM上进行的实验验证了SEVEN的有效性。结果表明，SEVEN在多种剪枝场景和不同稀疏性水平上都取得了显著的改进。此外，SEVEN在各种微调策略下表现出稳健的性能。代码公开在https://github.com/xiaojinying/SEVEN。

论文及项目相关链接

PDF IJCNN 2024

Summary

大型Transformer模型（TM）在多项任务中表现出卓越性能，但其庞大的参数规模限制了其在移动设备上的应用。本文提出一种名为SEVEN的方法，通过Symbolic Descent（SD）的累积过程描述TM上的噪声批次梯度序列，并据此动态评估权重的重要性得分。SEVEN倾向于保留具有持续高敏感性的权重，即具有小梯度噪声的权重。实验证明，SEVEN在多种剪枝场景和不同稀疏度级别上均表现出显著的有效性，并且在各种微调策略下展现出稳健的性能。

Key Takeaways

大型Transformer模型（TM）具有出色的性能，但参数规模限制了其在移动设备上的应用。
TM的梯度具有动态和复杂的特点，常规剪枝方法容易保留带有较大梯度噪声的权重。
Symbolic Descent（SD）提供了一种训练和优化TM的通用方法。
SEVEN方法通过SD的累积过程描述TM上的噪声批次梯度序列。
SEVEN倾向于保留具有持续高敏感性和小梯度噪声的权重。
实验证明SEVEN在多种剪枝场景和不同稀疏度级别上显著提高模型性能。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-18/LLM/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

LLM

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-09-18 Scaling Agents via Continual Pre-training

2025-09-18 Agent

Agent

R1_Reasoning

R1_Reasoning 方向最新论文已更新，请持续关注 Update in 2025-09-18 WebResearcher Unleashing unbounded reasoning capability in Long-Horizon Agents

2025-09-18 R1_Reasoning

R1_Reasoning

LLM

2025-09-18 更新

Scaling Agents via Continual Pre-training

Large Language Model-assisted Meta-optimizer for Automated Design of Constrained Evolutionary Algorithm

Don’t Forget the Nonlinearity: Unlocking Activation Functions in Efficient Fine-Tuning

The Few-shot Dilemma: Over-prompting Large Language Models

More performant and scalable: Rethinking contrastive vision-language pre-training of radiology in the LLM era

WHU-STree: A Multi-modal Benchmark Dataset for Street Tree Inventory

LLM Hallucination Detection: A Fast Fourier Transform Method Based on Hidden Layer Temporal Signals

Reasoning with Preference Constraints: A Benchmark for Language Models in Many-to-One Matching Markets

Empowering LLMs with Parameterized Skills for Adversarial Long-Horizon Planning

A comparison of pipelines for the translation of a low resource language based on transformers

TAPS: Tool-Augmented Personalisation via Structured Tagging

Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation

UniversalCEFR: Enabling Open Multilingual Research on Language Proficiency Assessment

Game-RL: Synthesizing Verifiable Game Tasks at Scale to Boost VLMs General Reasoning

SuPreME: A Supervised Pre-training Framework for Multimodal ECG Representation Learning

SEVEN: Pruning Transformer Model by Reserving Sentinels