⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-30 更新
Death of the Novel(ty): Beyond n-Gram Novelty as a Metric for Textual Creativity
Authors:Arkadiy Saakyan, Najoung Kim, Smaranda Muresan, Tuhin Chakrabarty
N-gram novelty is widely used to evaluate language models’ ability to generate text outside of their training data. More recently, it has also been adopted as a metric for measuring textual creativity. However, theoretical work on creativity suggests that this approach may be inadequate, as it does not account for creativity’s dual nature: novelty (how original the text is) and appropriateness (how sensical and pragmatic it is). We investigate the relationship between this notion of creativity and n-gram novelty through 7542 expert writer annotations (n=26) of novelty, pragmaticality, and sensicality via close reading of human and AI-generated text. We find that while n-gram novelty is positively associated with expert writer-judged creativity, ~91% of top-quartile expressions by n-gram novelty are not judged as creative, cautioning against relying on n-gram novelty alone. Furthermore, unlike human-written text, higher n-gram novelty in open-source LLMs correlates with lower pragmaticality. In an exploratory study with frontier close-source models, we additionally confirm that they are less likely to produce creative expressions than humans. Using our dataset, we test whether zero-shot, few-shot, and finetuned models are able to identify creative expressions (a positive aspect of writing) and non-pragmatic ones (a negative aspect). Overall, frontier LLMs exhibit performance much higher than random but leave room for improvement, especially struggling to identify non-pragmatic expressions. We further find that LLM-as-a-Judge novelty scores from the best-performing model were predictive of expert writer preferences.
N-gram新颖性被广泛用于评估语言模型生成训练数据之外文本的能力。最近,它也被用作衡量文本创造力的指标。然而,关于创造力的理论工作表明,这种方法可能不足,因为它没有考虑到创造力的双重性质:新颖性(文本的新颖程度)和适用性(文本的实用性和逻辑性)。我们通过细读人类和AI生成的文本,进行了7542次专家注释(n=26),以研究这种创造力概念与n-gram新颖性之间的关系。我们发现,虽然n-gram新颖性与专家作家判断的创造力呈正相关,但约91%的n-gram新颖性最高的四分之一表达并不被认为具有创造力,这提醒人们不要仅仅依赖n-gram新颖性。此外,与人类撰写的文本不同,开源大型语言模型中的更高n-gram新颖性与较低的实用性相关。在一项前沿的封闭模型探索性研究中,我们还证实它们产生创造性表达的可能性比人类更低。我们使用我们的数据集测试了零样本、少样本和微调模型是否能够识别创造性表达(写作的一个积极方面)和非实用表达(一个消极方面)。总体而言,前沿的大型语言模型的性能远高于随机水平,但仍有一定的改进空间,尤其是在识别非实用表达方面。我们进一步发现,最佳性能模型的大型语言模型作为法官的新颖性评分能够预测专家作家的偏好。
论文及项目相关链接
PDF 26 pages, 10 figures, under review
Summary
本文探讨了用N-gram新颖性评估语言模型生成文本的能力,以及评估文本创造力的指标问题。研究发现,尽管N-gram新颖性与专家评判的创造力有一定关联,但单纯依赖N-gram新颖性并不足以衡量创造力,因为创造力包含新颖性和适当性两个要素。研究表明,高达91%的N-gram新颖性处于顶部的四分之一的表达并未被判断为具有创造力。此外,与人工智能生成的文本相比,人类写作的文本中较高的N-gram新颖性与较低的实用性相关联。同时,探索性研究还发现前沿的封闭源模型产生创造性表达的可能性较低。使用数据集测试了零样本、少样本和微调模型识别创造性表达和不合逻辑表达的能力,结果显示虽然大型语言模型展现出一定性能,但仍需改进。
Key Takeaways
- N-gram新颖性被用于评估语言模型的文本生成能力,但在衡量文本创造力方面可能存在不足。
- 创造力包含新颖性和适当性两个要素,单纯依赖N-gram新颖性评估创造力是不全面的。
- 高达91%的N-gram新颖性高的表达未被专家判断为具有创造力。
- 在人类写作的文本中,较高的N-gram新颖性与较低的实用性有关联。
- 前沿的开放源大型语言模型(LLMs)产生创造性表达的可能性较低。
- 使用数据集测试了不同模型识别创造性表达和不合逻辑表达的能力,显示出大型语言模型仍需改进的空间。
点此查看论文截图





Training-Free Synthetic Data Generation with Dual IP-Adapter Guidance
Authors:Luc Boudier, Loris Manganelli, Eleftherios Tsonis, Nicolas Dufour, Vicky Kalogeiton
Few-shot image classification remains challenging due to the limited availability of labeled examples. Recent approaches have explored generating synthetic training data using text-to-image diffusion models, but often require extensive model fine-tuning or external information sources. We present a novel training-free approach, called DIPSY, that leverages IP-Adapter for image-to-image translation to generate highly discriminative synthetic images using only the available few-shot examples. DIPSY introduces three key innovations: (1) an extended classifier-free guidance scheme that enables independent control over positive and negative image conditioning; (2) a class similarity-based sampling strategy that identifies effective contrastive examples; and (3) a simple yet effective pipeline that requires no model fine-tuning or external captioning and filtering. Experiments across ten benchmark datasets demonstrate that our approach achieves state-of-the-art or comparable performance, while eliminating the need for generative model adaptation or reliance on external tools for caption generation and image filtering. Our results highlight the effectiveness of leveraging dual image prompting with positive-negative guidance for generating class-discriminative features, particularly for fine-grained classification tasks.
针对具有挑战性的小样本图像分类问题,由于其可用的标注样本数量有限,最近的研究方法开始尝试使用文本到图像的扩散模型生成合成训练数据,但这通常需要大量的模型微调或外部信息源。我们提出了一种全新的无训练方法,称为DIPSY,它利用IP-Adapter进行图像到图像的转换,仅使用可用的小样本示例生成具有高度区分性的合成图像。DIPSY引入了三项关键创新:(1)扩展的无分类器引导方案,实现对正例和负例图像条件的独立控制;(2)基于类别相似性的采样策略,用于识别有效的对比示例;(3)无需模型微调或外部字幕标注和过滤的简单有效的流程。在十个基准数据集上的实验表明,我们的方法达到了最新或相当的性能水平,同时消除了对生成模型适应或依赖外部工具进行字幕生成和图像过滤的需求。我们的结果突显了在生成具有类别区分性的特征时,利用带有正负引导的双重图像提示的有效性,特别是在细粒度分类任务中。
论文及项目相关链接
PDF BMVC 2025. Project page: https://www.lix.polytechnique.fr/vista/projects/2025_bmvc_dipsy/
Summary
文本主要介绍了针对少样本图像分类问题的一种新颖的训练外方法,称为DIPSY。该方法利用IP-Adapter进行图像到图像的翻译,仅使用有限的样本生成高度区分性的合成图像。DIPSY引入了三项关键创新:扩展的分类器外指导方案,基于类别相似性的采样策略,以及无需模型微调或外部描述的简单有效管道。实验表明,该方法在多个基准数据集上取得了最新或相当的性能。
Key Takeaways
- DIPSY是一种新颖的针对少样本图像分类的训练外方法。
- DIPSY利用IP-Adapter进行图像到图像的翻译,生成合成图像。
- DIPSY引入了扩展的分类器外指导方案,实现对正负图像条件的独立控制。
- 基于类别相似性的采样策略用于识别有效的对比样本。
- DIPSY的方法简单有效,无需模型微调或外部描述和过滤。
- 实验证明,DIPSY在多个基准数据集上取得了最新或相当的性能。
点此查看论文截图



FoodSEM: Large Language Model Specialized in Food Named-Entity Linking
Authors:Ana Gjorgjevikj, Matej Martinc, Gjorgjina Cenikj, Sašo Džeroski, Barbara Koroušić Seljak, Tome Eftimov
This paper introduces FoodSEM, a state-of-the-art fine-tuned open-source large language model (LLM) for named-entity linking (NEL) to food-related ontologies. To the best of our knowledge, food NEL is a task that cannot be accurately solved by state-of-the-art general-purpose (large) language models or custom domain-specific models/systems. Through an instruction-response (IR) scenario, FoodSEM links food-related entities mentioned in a text to several ontologies, including FoodOn, SNOMED-CT, and the Hansard taxonomy. The FoodSEM model achieves state-of-the-art performance compared to related models/systems, with F1 scores even reaching 98% on some ontologies and datasets. The presented comparative analyses against zero-shot, one-shot, and few-shot LLM prompting baselines further highlight FoodSEM’s superior performance over its non-fine-tuned version. By making FoodSEM and its related resources publicly available, the main contributions of this article include (1) publishing a food-annotated corpora into an IR format suitable for LLM fine-tuning/evaluation, (2) publishing a robust model to advance the semantic understanding of text in the food domain, and (3) providing a strong baseline on food NEL for future benchmarking.
本文介绍了FoodSEM,这是一个最新精细调整的开源大型语言模型(LLM),用于命名实体链接(NEL)到食品相关本体。据我们所知,食品NEL是一项任务,无法由最新通用(大型)语言模型或定制领域特定模型/系统准确解决。通过指令响应(IR)场景,FoodSEM将文本中提到的食品相关实体链接到几个本体,包括FoodOn、SNOMED-CT和Hansard分类法。FoodSEM模型与相关模型/系统相比实现了最先进的性能,在某些本体和数据集上F1分数甚至达到98%。与零镜头、一镜头和少镜头LLM提示基准点的比较分析进一步突出了FoodSEM在微调版本上的卓越性能。通过公开提供FoodSEM及其相关资源,本文的主要贡献包括:(1)将食品注释语料库发布为适合LLM微调/评估的IR格式,(2)发布一个稳健的模型,以推进食品领域的文本语义理解,(3)为未来的基准测试提供食品NEL的强基准线。
论文及项目相关链接
PDF To appear in the Proceedings of the 28th International Conference on Discovery Science (DS 2025)
Summary
本文主要介绍了一个针对食品领域命名实体链接(NEL)任务的先进大型语言模型——FoodSEM。该模型通过精细调整,实现了对食品相关实体的文本链接到多个食品本体,如FoodOn、SNOMED-CT和Hansard分类法。FoodSEM模型在相关模型/系统中表现出卓越性能,部分数据集上的F1分数高达98%。本文主要贡献包括发布食品注释语料库、适用于LLM微调/评估的IR格式,发布一个用于增强文本语义理解的稳健模型,并为未来的基准测试提供强大的食品NEL基线。
Key Takeaways
- FoodSEM是一个针对食品领域命名实体链接(NEL)的先进大型语言模型。
- FoodSEM通过精细调整,实现了对食品相关实体在文本中的链接到多个食品本体。
- FoodSEM模型在相关模型/系统中表现出卓越性能,F1分数高达98%。
- FoodSEM的发布对于推动食品领域的语义理解具有重大意义。
- 本文主要贡献包括发布食品注释语料库、适用于LLM微调/评估的IR格式。
- FoodSEM的发布为未来的基准测试提供了强大的食品NEL基线。
点此查看论文截图



Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
Authors:Zilun Zhang, Zian Guan, Tiancheng Zhao, Haozhan Shen, Tianyu Li, Yuxiang Cai, Zhonggen Su, Zhaojun Liu, Jianwei Yin, Xiang Li
Referring expression understanding in remote sensing poses unique challenges, as it requires reasoning over complex object-context relationships. While supervised fine-tuning (SFT) on multimodal large language models achieves strong performance with massive labeled datasets, they struggle in data-scarce scenarios, leading to poor generalization. To address this limitation, we propose Geo-R1, a reasoning-centric reinforcement fine-tuning (RFT) paradigm for few-shot geospatial referring. Geo-R1 enforces the model to first generate explicit, interpretable reasoning chains that decompose referring expressions, and then leverage these rationales to localize target objects. This “reason first, then act” process enables the model to make more effective use of limited annotations, enhances generalization, and provides interpretability. We validate Geo-R1 on three carefully designed few-shot geospatial referring benchmarks, where our model consistently and substantially outperforms SFT baselines. It also demonstrates strong cross-dataset generalization, highlighting its robustness. Code and data will be released at http://geo-r1.github.io.
遥感中的指代表达式理解面临着独特的挑战,因为它需要对复杂的对象上下文关系进行推理。虽然基于多模态大型语言模型的监督微调(SFT)在大量标记数据集上表现出强大的性能,但在数据稀缺的情况下,它们会遇到困难,导致泛化性能差。为了解决这一局限性,我们提出了Geo-R1,这是一种以推理为中心的强化微调(RFT)范式,用于小样本地理空间指代。Geo-R1强制模型首先生成明确、可解释的推理链,对指代表达式进行分解,然后利用这些理性来定位目标对象。这种“先推理,后行动”的过程使模型能够更有效地利用有限的注释,增强泛化能力,并提供可解释性。我们在三个精心设计的小样本地理空间指代基准测试上对Geo-R1进行了验证,我们的模型始终且大幅度地优于SFT基准测试。它还表现出了强大的跨数据集泛化能力,突出了其稳健性。代码和数据将在http://geo-r1.github.io发布。
论文及项目相关链接
Summary
远程感应中的指代表达理解面临独特挑战,需对复杂的物体上下文关系进行推理。虽然监督微调(SFT)在多模态大型语言模型上的表现强劲,但在数据稀缺的场景下却表现不佳,导致泛化能力弱。为解决此问题,我们提出Geo-R1,一种以推理为中心的强化微调(RFT)范式,用于少数地理空间指代。Geo-R1强制模型首先生成明确、可解释的推理链,分解指代表达式,然后利用这些理性来定位目标对象。这种“先推理,后行动”的过程使模型更有效地利用有限的注释,增强了泛化能力,并提供了可解释性。我们在三个精心设计的少数地理空间指代基准测试上验证了Geo-R1,我们的模型始终且大幅度地优于SFT基准测试。它还展示了强大的跨数据集泛化能力,凸显了其稳健性。
Key Takeaways
- 远程感应中的指代表达理解具有独特挑战,需应对复杂的物体上下文关系。
- 监督微调(SFT)在数据丰富的情况下表现良好,但在数据稀缺时泛化能力弱。
- Geo-R1是一种新的强化微调(RFT)方法,适用于少数地理空间指代场景。
- Geo-R1通过生成明确、可解释的推理链来分解指代表达式。
- “先推理,后行动”的过程使模型更有效地利用有限数据,提高泛化能力。
- Geo-R1在多个基准测试中表现优越,且具备强大的跨数据集泛化能力。
点此查看论文截图





ChaosNexus: A Foundation Model for Universal Chaotic System Forecasting with Multi-scale Representations
Authors:Chang Liu, Bohao Zhao, Jingtao Ding, Yong Li
Accurately forecasting chaotic systems, prevalent in domains such as weather prediction and fluid dynamics, remains a significant scientific challenge. The inherent sensitivity of these systems to initial conditions, coupled with a scarcity of observational data, severely constrains traditional modeling approaches. Since these models are typically trained for a specific system, they lack the generalization capacity necessary for real-world applications, which demand robust zero-shot or few-shot forecasting on novel or data-limited scenarios. To overcome this generalization barrier, we propose ChaosNexus, a foundation model pre-trained on a diverse corpus of chaotic dynamics. ChaosNexus employs a novel multi-scale architecture named ScaleFormer augmented with Mixture-of-Experts layers, to capture both universal patterns and system-specific behaviors. The model demonstrates state-of-the-art zero-shot generalization across both synthetic and real-world benchmarks. On a large-scale testbed comprising over 9,000 synthetic chaotic systems, it improves the fidelity of long-term attractor statistics by more than 40% compared to the leading baseline. This robust performance extends to real-world applications with exceptional data efficiency. For instance, in 5-day global weather forecasting, ChaosNexus achieves a competitive zero-shot mean error below 1 degree, a result that further improves with few-shot fine-tuning. Moreover, experiments on the scaling behavior of ChaosNexus provide a guiding principle for scientific foundation models: cross-system generalization stems from the diversity of training systems, rather than sheer data volume.
精确预测混沌系统,这在天气预报和流体动力学等领域普遍存在,仍然是一个重大的科学挑战。这些系统对初始条件的固有敏感性,以及观测数据的稀缺性,严重制约了传统建模方法。由于这些模型通常针对特定系统训练,它们缺乏现实世界应用所需的一般化能力,这要求在新型或数据有限场景上进行稳健的零样本或少样本预测。为了克服这一泛化障碍,我们提出了ChaosNexus,一个预训练在多样混沌动力学语料库上的基础模型。ChaosNexus采用了一种新型的多尺度架构,名为ScaleFormer,并辅以Mixture-of-Experts层,以捕捉通用模式和系统特定行为。该模型在合成和现实世界基准测试上展示了最先进的零样本泛化能力。在一个包含超过9000个合成混沌系统的大规模测试床上,与领先的基准模型相比,它在长期吸引子统计的保真度上提高了40%以上。这种稳健性能在现实世界应用中具有出色的数据效率。例如,在为期5天的全球天气预报中,ChaosNexus实现了具有竞争力的零样本平均误差低于1度,并且该结果通过少量样本微调得到了进一步改善。此外,ChaosNexus的可扩展性行为的实验为科学基础模型提供了指导原则:跨系统泛化源于训练系统的多样性,而非单纯的数据量。
论文及项目相关链接
Summary
本文介绍了预测混沌系统的新方法,如天气预报和流体动力学。由于传统建模方法受限于初始条件的敏感性和观测数据的稀缺性,难以应用于现实世界的复杂场景。因此,提出了ChaosNexus,一种在多样混沌动态语料库上预训练的通用模型。该模型采用新型的多尺度架构ScaleFormer和混合专家层,能捕捉通用模式和系统特定行为。在合成和真实世界基准测试上,实现了零样本泛化的最新水平。在大规模测试床上对超过9000个合成混沌系统的测试显示,与领先的基线方法相比,长期吸引统计量的保真度提高了超过40%。在现实世界应用中,表现出极高的数据效率。例如,在全球天气预测的五天内,无需任何数据即实现了平均误差低于一度的优秀零样本预测效果。随着少量的微调样本,性能进一步得到提升。实验结果表明,泛化性能源于训练系统的多样性,而非大量数据。
Key Takeaways
- ChaosNexus是一种用于预测混沌系统的通用模型,适用于天气预报和流体动力学等领域。
- 传统建模方法受到初始条件敏感性和观测数据稀缺性的限制,难以应用于现实世界的复杂场景。
- ChaosNexus通过预训练在多样混沌动态语料库上,采用新型的多尺度架构ScaleFormer和混合专家层来捕捉通用模式和系统特定行为。
- 该模型实现了零样本泛化的最新水平,在合成和真实世界基准测试上表现优异。
- 在大规模测试床上,与领先的基线方法相比,ChaosNexus长期吸引统计量的保真度提高了超过40%。
- ChaosNexus在现实世界应用中表现出极高的数据效率,例如在全球天气预测中实现了优秀的零样本预测效果。
点此查看论文截图



Vision Language Models Cannot Plan, but Can They Formalize?
Authors:Muyu He, Yuxi Zheng, Yuchen Liu, Zijian An, Bill Cai, Jiani Huang, Lifeng Zhou, Feng Liu, Ziyang Li, Li Zhang
The advancement of vision language models (VLMs) has empowered embodied agents to accomplish simple multimodal planning tasks, but not long-horizon ones requiring long sequences of actions. In text-only simulations, long-horizon planning has seen significant improvement brought by repositioning the role of LLMs. Instead of directly generating action sequences, LLMs translate the planning domain and problem into a formal planning language like the Planning Domain Definition Language (PDDL), which can call a formal solver to derive the plan in a verifiable manner. In multimodal environments, research on VLM-as-formalizer remains scarce, usually involving gross simplifications such as predefined object vocabulary or overly similar few-shot examples. In this work, we present a suite of five VLM-as-formalizer pipelines that tackle one-shot, open-vocabulary, and multimodal PDDL formalization. We evaluate those on an existing benchmark while presenting another two that for the first time account for planning with authentic, multi-view, and low-quality images. We conclude that VLM-as-formalizer greatly outperforms end-to-end plan generation. We reveal the bottleneck to be vision rather than language, as VLMs often fail to capture an exhaustive set of necessary object relations. While generating intermediate, textual representations such as captions or scene graphs partially compensate for the performance, their inconsistent gain leaves headroom for future research directions on multimodal planning formalization.
视觉语言模型(VLM)的进步使得实体代理能够完成简单的多模态规划任务,但还不能完成需要一系列长期行动的任务。在仅文本模拟中,长期规划在重新定位大型语言模型(LLM)的角色后取得了重大改进。LLM不直接生成行动序列,而是将规划领域和问题转化为正式的规划语言(如规划领域定义语言(PDDL)),这样可以调用形式化求解器以可验证的方式制定计划。在多模态环境中,关于VLM作为形式化器的研究仍然很少,通常涉及粗略简化,例如预设对象词汇或过于相似的少数案例。在这项工作中,我们提出了五个VLM作为形式化器的管道,解决单镜头、开放词汇和多模态PDDL形式化问题。我们在现有的基准测试中对这些进行了评估,同时提出了另外两个基准测试,首次考虑使用真实、多视角和低质量的图像进行规划。我们得出结论,VLM作为形式化器大大优于端到端的计划生成。我们发现瓶颈在于视觉而非语言,因为VLM通常无法捕获必要的对象关系的完整集合。虽然生成中间文本表示(如标题或场景图)部分弥补了性能不足,但其不一致的增益为未来多模态规划形式化的研究方向留下了空间。
论文及项目相关链接
Summary
本文探讨了将视觉语言模型(VLMs)应用于形式化规划的方法,解决在多模态环境下进行长期规划的问题。研究提出了一套五个VLM-as-formalizer管道,实现了一次性、开放词汇和多模态的PDDL形式化。评估结果表明,VLM-as-formalizer在规划生成方面表现优于端到端方法。瓶颈在于视觉而非语言,因为VLMs往往无法捕捉所有必要的对象关系。虽然生成中间文本表示(如标题或场景图)可以部分弥补性能不足,但其不一致的增益为未来多模态规划形式化的研究方向留下了空间。
Key Takeaways
- VLMs的进步使得实体代理能够完成简单的多模态规划任务,但对于需要长期序列动作的长远规划任务仍面临挑战。
- 在文本模拟环境中,LLMs在长远规划方面的作用被重新定位,通过将规划领域和问题转化为正式的规划语言(如PDDL),再调用正式的求解器以可验证的方式推导计划。
- 在多模态环境中,关于VLM-as-formalizer的研究仍然稀缺,存在诸如预设对象词汇或过于相似的少数案例等简化情况。
- 研究提出了五个VLM-as-formalizer管道,解决了一次性、开放词汇和多模态的PDDL形式化问题。
- 评估结果表明,VLM-as-formalizer在规划生成方面优于端到端方法。
- VLMs的瓶颈在于视觉能力,往往无法捕捉所有必要的对象关系。
点此查看论文截图




pFedMMA: Personalized Federated Fine-Tuning with Multi-Modal Adapter for Vision-Language Models
Authors:Sajjad Ghiasvand, Mahnoosh Alizadeh, Ramtin Pedarsani
Vision-Language Models (VLMs) like CLIP have demonstrated remarkable generalization in zero- and few-shot settings, but adapting them efficiently to decentralized, heterogeneous data remains a challenge. While prompt tuning has emerged as a popular parameter-efficient approach in personalized federated learning, existing methods often sacrifice generalization in favor of personalization, struggling particularly on unseen classes or domains. In this work, we propose pFedMMA, the first personalized federated learning framework that leverages multi-modal adapters for vision-language tasks. Each adapter contains modality-specific up- and down-projection layers alongside a globally shared projection that aligns cross-modal features. Our optimization strategy allows clients to locally adapt to personalized data distributions while collaboratively training the shared projection to improve global generalization. This design is also communication-efficient, as only the shared component is exchanged during communication rounds. Through extensive experiments across eleven datasets, including domain- and label-shift scenarios, we show that pFedMMA achieves state-of-the-art trade-offs between personalization and generalization, outperforming recent federated prompt tuning methods.
对于CLIP等视觉语言模型(VLMs)而言,它们在零样本和少样本环境中表现出了显著的泛化能力,但在分散、异构数据上有效地适应它们仍然是一个挑战。虽然提示调整(prompt tuning)作为一种个性化的联邦学习中的参数高效方法已经崭露头角,但现有方法往往牺牲泛化能力以换取个性化能力,特别是在未见类别或领域上表现尤为困难。在这项工作中,我们提出了pFedMMA,这是第一个利用多模态适配器的个性化联邦学习框架,用于视觉语言任务。每个适配器包含特定模态的上下投影层以及一个全局共享投影,用于对齐跨模态特征。我们的优化策略允许客户端适应个性化的数据分布,同时训练共享投影以提高全局泛化能力。这种设计还具有通信效率高的优点,因为通信轮期间只交换共享组件。通过涵盖包括领域和标签转移场景在内的十一个数据集的广泛实验,我们证明了pFedMMA在个性化和泛化之间达到了最先进的权衡,超越了最新的联邦提示调整方法。
论文及项目相关链接
Summary
基于CLIP等视觉语言模型(VLMs)在零样本和少样本场景下的出色泛化能力,本工作提出pFedMMA,一个利用多模态适配器的个性化联邦学习框架,旨在应对分布式异构数据的适应效率问题。pFedMMA通过优化策略实现本地个性化数据分布的适应,同时协同训练共享投影层以提升全局泛化能力。该设计具有通信效率优势,仅交换共享组件。实验显示,pFedMMA在多种数据集上实现了个性化与泛化之间的最佳平衡,优于现有的联邦提示调整方法。
Key Takeaways
- pFedMMA是首个针对视觉语言任务利用多模态适配器的个性化联邦学习框架。
- pFedMMA结合个性化数据分布适应和全局泛化能力的提升。
- 多模态适配器包含针对特定模态的上下投影层及全局共享投影层,用于对齐跨模态特征。
- pFedMMA的优化策略允许本地个性化调整,同时协同训练共享投影层。
- pFedMMA设计通信效率高,仅交换共享组件。
- 实验证明pFedMMA在多种数据集上实现了个性化与泛化的最佳平衡。
点此查看论文截图



Intercept Cancer: Cancer Pre-Screening with Large Scale Healthcare Foundation Models
Authors:Liwen Sun, Hao-Ren Yao, Gary Gao, Ophir Frieder, Chenyan Xiong
Cancer screening, leading to early detection, saves lives. Unfortunately, existing screening techniques require expensive and intrusive medical procedures, not globally available, resulting in too many lost would-be-saved lives. We present CATCH-FM, CATch Cancer early with Healthcare Foundation Models, a cancer pre-screening methodology that identifies high-risk patients for further screening solely based on their historical medical records. With millions of electronic healthcare records (EHR), we establish the scaling law of EHR foundation models pretrained on medical code sequences, pretrain compute-optimal foundation models of up to 2.4 billion parameters, and finetune them on clinician-curated cancer risk prediction cohorts. In our retrospective evaluation comprising of thirty thousand patients, CATCH-FM achieves strong efficacy, with 50% sensitivity in predicting first cancer risks at 99% specificity cutoff, and outperforming feature-based tree models and both general and medical LLMs by up to 20% AUPRC. Despite significant demographic, healthcare system, and EHR coding differences, CATCH-FM achieves state-of-the-art pancreatic cancer risk prediction on the EHRSHOT few-shot leaderboard, outperforming EHR foundation models pretrained using on-site patient data. Our analysis demonstrates the robustness of CATCH-FM in various patient distributions, the benefits of operating in the ICD code space, and its ability to capture non-trivial cancer risk factors. Our code will be open-sourced.
癌症筛查能够实现早期发现,从而挽救生命。然而,现有的筛查技术需要昂贵且侵入性的医疗程序,并非全球通用,导致许多本可挽救的生命丧失。我们提出CATCH-FM,即使用医疗基础模型早期发现癌症(CATch Cancer early with Healthcare Foundation Models),这是一种癌症预筛查方法,仅基于患者的历史医疗记录识别出需要进行进一步筛查的高风险患者。我们利用数百万份电子健康记录(EHR),建立基于医疗代码序列的EHR基础模型的可扩展性法则,训练参数高达2.4亿的预训练优化基础模型,并在医生策划的癌症风险预测队列中进行微调。在我们的包含三万名患者的回顾性评估中,CATCH-FM表现出强大的有效性,在99%的特异性截止值下,预测首次癌症风险的敏感性达到50%,并且相较于基于特征树模型和一般及医疗领域的大型语言模型(LLM),在曲线下面积(AUPRC)上高出最多达20%。尽管存在显著的种族、医疗保健系统和EHR编码差异,CATCH-FM在EHRSHOT少样本领导者榜上实现了胰腺癌风险预测的业界领先水平,优于使用现场患者数据预训练的EHR基础模型。我们的分析证明了CATCH-FM在各种患者分布中的稳健性、在ICD代码空间内操作的优势以及捕捉非显著性癌症风险因素的能力。我们的代码将开源。
论文及项目相关链接
Summary
基于历史医疗记录数据,提出一种名为CATCH-FM的癌症预筛查方法,用于早期发现癌症风险患者并推荐给进一步的筛查。该方法通过电子健康记录(EHR)数据建立基础模型,通过训练计算优化模型,并在医生整理的风险预测队列中进行微调。在包含三万名患者的回顾性评估中,CATCH-FM展现出强大的效能,对首次癌症风险的预测具有高达50%的敏感度和出色的特异性,且在胰腺癌风险预测方面达到领先水平。CATCH-FM具有在各种患者分布中的稳健性,并展现出捕捉非显著癌症风险因素的能力。代码将开源。
Key Takeaways
- CATCH-FM是一种基于历史医疗记录的癌症预筛查方法,早期识别高风险的癌症患者以供进一步筛查。
- 该方法使用电子健康记录(EHR)数据建立基础模型,并通过计算优化进行训练。
- CATCH-FM在回顾性评估中表现出强大的效能,对首次癌症风险的预测具有高敏感度和特异性。
- CATCH-FM在胰腺癌风险预测方面达到了领先水平,展现出在各种患者分布中的稳健性。
- CATCH-FM具有捕捉非显著癌症风险因素的能力。
点此查看论文截图




Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models
Authors:Sajjad Ghiasvand, Haniyeh Ehsani Oskouie, Mahnoosh Alizadeh, Ramtin Pedarsani
Vision-Language Models (VLMs) such as CLIP have shown remarkable performance in cross-modal tasks through large-scale contrastive pre-training. To adapt these large transformer-based models efficiently for downstream tasks, Parameter-Efficient Fine-Tuning (PEFT) techniques like (Low-Rank Adaptation) LoRA have emerged as scalable alternatives to full fine-tuning, especially in few-shot scenarios. However, like traditional deep neural networks, VLMs are highly vulnerable to adversarial attacks, where imperceptible perturbations can significantly degrade model performance. Adversarial training remains the most effective strategy for improving model robustness in PEFT. In this work, we propose AdvCLIP-LoRA, to our knowledge the first method designed to enhance the adversarial robustness of CLIP models fine-tuned with LoRA in few-shot settings. Our method formulates training as a minimax optimization over low-rank adapters and adversarial perturbations, enabling robust adaptation with a small trainable footprint. Across eight datasets and two backbones (ViT-B/16 and ViT-B/32), AdvCLIP-LoRA achieves state-of-the-art performance in few-shot classification, adversarial base-to-new generalization, and cross-dataset transfer, delivering higher adversarial robustness than prompt tuning baselines without sacrificing much clean accuracy. These findings highlight AdvCLIP-LoRA as a practical approach for robust adaptation of VLMs in resource-constrained settings.
基于大规模对比预训练,CLIP等视觉语言模型(VLMs)在跨模态任务中取得了显著的性能。为了有效地将这些基于变压器的的大型模型适应到下游任务,出现了参数高效微调(PEFT)技术,如LoRA(低秩适应)等,可作为全微调的可扩展替代方案,特别是在小样本场景中。然而,与传统的深度神经网络一样,VLMs非常容易受到对抗性攻击的影响,微小的不可察觉的扰动也可能导致模型性能显著下降。对抗性训练仍然是PEFT中提高模型稳健性的最有效策略。在这项工作中,我们提出了AdvCLIP-LoRA,据我们所知,这是第一种旨在提高在LoRA微调下小样本设置中CLIP模型的对抗性稳健性的方法。我们的方法将训练制定为低秩适配器和对抗性扰动之间的最小最大优化问题,以较小的可训练足迹实现了稳健的适配。在八个数据集和两个主干网络(ViT-B/16和ViT-B/32)上,AdvCLIP-LoRA在少样本分类、对抗性基本到新的泛化以及跨数据集传输方面实现了最先进的性能,在不妨碍清洁准确度的前提下提高了对抗性稳健性。这些发现突出了AdvCLIP-LoRA在资源受限环境中稳健适应VLMs的实际价值。
论文及项目相关链接
Summary
该文本介绍了针对大型跨模态任务中预训练的CLIP模型的鲁棒性问题,提出一种新型的增强方法AdvCLIP-LoRA。此方法在低秩适配器对抗扰动和最小化最坏结果的基础上进行优化训练,提高了CLIP模型在少样本环境下的对抗鲁棒性。实验结果显示,AdvCLIP-LoRA在多个数据集和两种不同架构下实现了最先进的性能,并且在不牺牲清洁精度的情况下,实现了对基线提示调优的高对抗鲁棒性。这一发现使得AdvCLIP-LoRA成为资源受限环境中实现VLM稳健适应的一种实用方法。
Key Takeaways
- Vision-Language Models (VLMs) 如CLIP在跨模态任务中表现出卓越性能。
- 为了高效适应下游任务,出现了一种参数有效微调技术,如低秩自适应(LoRA)。尤其在少样本场景中。
- 尽管强大,但VLM对对抗攻击高度敏感,微小的扰动可能导致模型性能显著下降。对抗训练是提高模型鲁棒性的有效策略。
点此查看论文截图



Texture or Semantics? Vision-Language Models Get Lost in Font Recognition
Authors:Zhecheng Li, Guoxian Song, Yujun Cai, Zhen Xiong, Junsong Yuan, Yiwei Wang
Modern Vision-Language Models (VLMs) exhibit remarkable visual and linguistic capabilities, achieving impressive performance in various tasks such as image recognition and object localization. However, their effectiveness in fine-grained tasks remains an open question. In everyday scenarios, individuals encountering design materials, such as magazines, typography tutorials, research papers, or branding content, may wish to identify aesthetically pleasing fonts used in the text. Given their multimodal capabilities and free accessibility, many VLMs are often considered potential tools for font recognition. This raises a fundamental question: Do VLMs truly possess the capability to recognize fonts? To investigate this, we introduce the Font Recognition Benchmark (FRB), a compact and well-structured dataset comprising 15 commonly used fonts. FRB includes two versions: (i) an easy version, where 10 sentences are rendered in different fonts, and (ii) a hard version, where each text sample consists of the names of the 15 fonts themselves, introducing a stroop effect that challenges model perception. Through extensive evaluation of various VLMs on font recognition tasks, we arrive at the following key findings: (i) Current VLMs exhibit limited font recognition capabilities, with many state-of-the-art models failing to achieve satisfactory performance and being easily affected by the stroop effect introduced by textual information. (ii) Few-shot learning and Chain-of-Thought (CoT) prompting provide minimal benefits in improving font recognition accuracy across different VLMs. (iii) Attention analysis sheds light on the inherent limitations of VLMs in capturing semantic features.
现代视觉语言模型(VLMs)展现出显著视觉和语言能力,在图像识别、目标定位等任务中表现突出。然而,它们在精细任务中的有效性仍然是一个悬而未决的问题。在日常场景中,个人遇到设计材料,如杂志、排版教程、研究论文或品牌内容,可能会希望识别文本中视觉上令人愉悦的字体。考虑到它们的多模式能力和自由访问性,许多VLMs通常被认为是字体识别的潜在工具。这引发了一个根本性的问题:VLMs真的具备识别字体的能力吗?为了调查这一点,我们引入了字体识别基准测试(FRB),这是一个包含15种常用字体的紧凑且结构良好的数据集。FRB包括两个版本:(i)一个简单版本,其中10个句子以不同字体呈现;(ii)一个困难版本,其中每个文本样本由这15种字体的名称组成,引入一种斯特鲁普效应,挑战模型的感知能力。通过对各种VLMs在字体识别任务上的广泛评估,我们得出以下关键发现:(i)当前VLMs在字体识别方面的能力有限,许多最先进的模型无法达到令人满意的性能,并容易受到文本信息引入的斯特鲁普效应的影响。(ii)在VLMs中,小样本学习和思维链提示对改善字体识别精度提供最小的帮助。(iii)注意力分析揭示了VLMs在捕获语义特征方面的内在局限性。
论文及项目相关链接
PDF Accepted to COLM 2025
摘要
现代视觉语言模型(VLMs)在图像识别、物体定位等方面表现出强大的视觉和语言能力,但在精细任务上的效果仍有待探讨。针对日常场景中识别设计材料中的字体需求,我们提出了字体识别基准测试(FRB),包含15种常见字体。研究发现,当前VLMs在字体识别方面能力有限,难以应对复杂情境,且少样本学习和链式思维提示对提升字体识别准确率的作用有限。注意力分析揭示了VLMs在捕捉语义特征方面的内在局限。
关键见解
- 当前VLMs在字体识别方面能力有限,难以满足精细任务需求。
- FRB基准测试包含两种版本,分别考察模型在不同难度下的表现。
- 少样本学习和Chain-of-Thought(CoT)提示在改善字体识别准确率方面效果甚微。
- VLMs在处理含有字体名称的文本样本时易受到干扰,表明其容易受到“斯特鲁普效应”的影响。
- 注意力分析揭示了VLMs在捕捉语义特征方面的不足。
- 现代VLMs需要进一步加强在字体识别等精细任务上的能力。
点此查看论文截图





Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning
Authors:Bokai Hu, Sai Ashish Somayajula, Xin Pan, Pengtao Xie
Instruction-fine-tuned large language models (LLMs) under 14B parameters continue to underperform on natural language understanding (NLU) tasks, often trailing smaller models like BERT-base on benchmarks such as GLUE and SuperGLUE. Motivated by the success of reinforcement learning in reasoning tasks (e.g., DeepSeek), we explore Proximal Policy Optimization (PPO) as a framework to improve the NLU capabilities of LLMs. We frame NLU as a reinforcement learning environment, treating token generation as a sequence of actions and optimizing for reward signals based on alignment with ground-truth labels. PPO consistently outperforms supervised fine-tuning, yielding an average improvement of 6.3 points on GLUE, and surpasses zero-shot and few-shot prompting by 38.7 and 26.1 points, respectively. Notably, PPO-tuned models outperform GPT-4o by over 4% on average across sentiment and natural language inference tasks, including gains of 7.3% on the Mental Health dataset and 10.9% on SIGA-nli. This work highlights a promising direction for adapting LLMs to new tasks by reframing them as reinforcement learning problems, enabling learning through simple end-task rewards rather than extensive data curation.
指令微调的大型语言模型(LLM)在少于14B参数的情况下,在理解自然语言(NLU)的任务上表现不佳,往往在GLUE和SuperGLUE等基准测试上落后于较小的模型,如BERT-base。受到强化学习在推理任务中成功应用的启发(例如DeepSeek),我们探索使用近端策略优化(PPO)框架来提高LLM的NLU能力。我们将NLU构建为强化学习环境,将令牌生成视为一系列动作,并基于与真实标签的对齐情况优化奖励信号。PPO始终在监督微调中表现出色,在GLUE上的平均提高了6.3分,并且相对于零样本和少样本提示分别提高了38.7分和26.1分。值得注意的是,通过PPO调整的模型在情感和自然语言推理任务上的平均表现超过了GPT-4o超过4%,包括在心理健康数据集上提高7.3%和在SIGA-nli上提高10.9%。这项工作强调了通过重新构建为强化学习问题来适应LLM的新任务方向,使其能够通过简单的终端任务奖励进行学习,而不是大量数据整理。
论文及项目相关链接
Summary
基于指令微调的大型语言模型(LLM)在小于14B参数的情况下,在自然语言理解(NLU)任务上表现不佳,常常在GLUE和SuperGLUE等基准测试中落后于较小的模型,如BERT-base。受强化学习在推理任务(如DeepSeek)中成功的启发,我们探索使用近端策略优化(PPO)框架来提升LLM的NLU能力。我们将NLU任务视为一个强化学习环境,将令牌生成视为一系列动作,并基于与真实标签的对齐情况优化奖励信号。PPO在GLUE上的平均表现优于监督微调,提升6.3分;并且在零样本和少样本提示下的表现分别提升38.7和26.1分。值得注意的是,PPO优化的模型在情感和语言推理任务上的平均表现优于GPT-4o超过4%,包括在心理健康数据集上提升7.3%和在SIGA-nli上提升10.9%。本研究为通过重新构建为强化学习问题来适应LLM的新任务提供了一个有前景的方向,可以通过简单的终端任务奖励进行学习,而无需大量数据整理。
Key Takeaways
- 指令微调的大型语言模型(LLMs)在NLU任务上表现不佳,尤其在GLUE和SuperGLUE基准测试中。
- 强化学习中的近端策略优化(PPO)框架被探索用于提升LLM的NLU能力。
- 将NLU任务视为强化学习环境,将令牌生成视为一系列动作。
- PPO优化在GLUE基准测试中平均表现优于监督微调,并在零样本和少样本情境下表现显著。
- PPO优化的模型在某些任务上超越了GPT-4o的表现,包括情感分析和语言推理任务。
- PPO框架使LLM能够适应新任务,通过简单的终端任务奖励进行学习,无需大量数据整理。
点此查看论文截图





Leveraging Model Guidance to Extract Training Data from Personalized Diffusion Models
Authors:Xiaoyu Wu, Jiaru Zhang, Zhiwei Steven Wu
Diffusion Models (DMs) have become powerful image generation tools, especially for few-shot fine-tuning where a pretrained DM is fine-tuned on a small image set to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the data leakage risks when releasing fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask: “Can training data be extracted from these fine-tuned DMs shared online?” A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model’s learned distribution – from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets including WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting about 20% of fine-tuning data in most cases. The code is available https://github.com/Nicholas0228/FineXtract.
扩散模型(DMs)已经成为强大的图像生成工具,特别是在小样本微调领域,预训练的DM通过在小图像集上进行微调以捕捉特定的风格或对象。许多人在线上传这些个性化的检查点,形成了如Civitai和HuggingFace等社区。然而,在发布微调检查点时,模型所有者可能会忽略数据泄露的风险。此外,当微调过程中使用未经授权的数据时,会出现版权违规的担忧。在本文中,我们提出的问题:“可以从在线共享的这些微调后的DMs中提取训练数据吗?”成功的提取不仅会带来数据泄露的威胁,而且会成为版权侵权的切实证据。为了回答这个问题,我们提出了FineXtract,一个用于提取微调数据的框架。我们的方法将微调近似为模型学习分布的一个逐渐变化——从原始的预训练DM向微调数据转变。通过在微调前后对模型进行外推,我们引导生成走向微调数据分布内的高概率区域。然后,我们应用聚类算法从使用这种外推指导生成的图像中提取最可能的图像。在使用WikiArt、DreamBooth和在线发布的现实世界检查点对DMs进行微调的实验验证了我们的方法的有效性,在大多数情况下,我们能够提取约20%的微调数据。代码可用在https://github.com/Nicholas0228/FineXtract。
论文及项目相关链接
PDF Accepted at the International Conference on Machine Learning (ICML) 2025
Summary
扩散模型(DMs)在少量样本微调上展现出强大的图像生成能力,通过微调预训练模型以适应特定风格或对象。但共享这些微调模型时,可能泄露训练数据并侵犯版权。本文提出“FineXtract”框架,旨在从在线共享的微调DM中提取训练数据。通过模拟微调过程中的分布变化,以及使用聚类算法对生成图像进行聚类,该方法在WikiArt、DreamBooth及在线现实检查点等数据集上有效,多数情况可提取约20%的训练数据。
Key Takeaways
- 扩散模型(DMs)在少量样本微调方面具有强大能力,特别是在适应特定风格或对象时。
- 共享微调模型存在数据泄露和版权侵犯的风险。
- 本文提出“FineXtract”框架,可以从在线共享的微调DM中提取训练数据。
- 该方法通过模拟微调过程中的分布变化来工作。
- 使用聚类算法对生成图像进行聚类,以提取最可能的图像。
- 在WikiArt、DreamBooth及在线现实检查点等数据集上的实验验证了该方法的有效性。
点此查看论文截图





TEXT2AFFORD: Probing Object Affordance Prediction abilities of Language Models solely from Text
Authors:Sayantan Adak, Daivik Agrawal, Animesh Mukherjee, Somak Aditya
We investigate the knowledge of object affordances in pre-trained language models (LMs) and pre-trained Vision-Language models (VLMs). A growing body of literature shows that PTLMs fail inconsistently and non-intuitively, demonstrating a lack of reasoning and grounding. To take a first step toward quantifying the effect of grounding (or lack thereof), we curate a novel and comprehensive dataset of object affordances – Text2Afford, characterized by 15 affordance classes. Unlike affordance datasets collected in vision and language domains, we annotate in-the-wild sentences with objects and affordances. Experimental results reveal that PTLMs exhibit limited reasoning abilities when it comes to uncommon object affordances. We also observe that pre-trained VLMs do not necessarily capture object affordances effectively. Through few-shot fine-tuning, we demonstrate improvement in affordance knowledge in PTLMs and VLMs. Our research contributes a novel dataset for language grounding tasks, and presents insights into LM capabilities, advancing the understanding of object affordances. Codes and data are available at https://github.com/sayantan11995/Text2Afford
我们研究了预训练语言模型(LMs)和预训练视觉语言模型(VLMs)对物体功能的知识。越来越多的文献显示,PTLMs的表现存在不一致和非直觉性的失败,显示出缺乏推理和接地能力。为了初步量化接地效果(或缺乏接地的影响),我们创建了一个全新的综合物体功能数据集——Text2Afford,包含15个功能类别。与视觉和语言领域收集的功能数据集不同,我们对野外句子中的物体和功能进行了注释。实验结果表明,当涉及到不常见的物体功能时,PTLMs的推理能力有限。我们还观察到,预训练的VLMs并不一定能有效地捕获物体功能。通过少样本微调,我们展示了PTLMs和VLMs在功能知识方面的改进。我们的研究为语言接地任务提供了一个新的数据集,并深入了解了LM的能力,推进了对物体功能的理解。代码和数据可在https://github.com/sayantan11995/Text2Afford找到。
论文及项目相关链接
PDF Accepted at Conference on Computational Natural Language Learning 2024
Summary
本文探究了预训练语言模型(PTLMs)和预训练视觉语言模型(VLMs)对物体功能性的理解程度。研究发现在处理非直观、不一致的物体功能性时,PTLMs表现出推理能力不足的问题。为了量化这种现象,研究者创建了一个全新的物体功能性数据集——Text2Afford,包含15类功能性标注。实验结果显示,PTLMs在处理罕见物体功能性时能力有限,而预训练VLMs也未能有效捕捉物体功能性。通过少样本微调的方法,可以改进PTLMs和VLMs对物体功能性的理解能力。本研究不仅提供了语言定位任务的新数据集,还为理解PTLMs的功能性提供了深入见解。
Key Takeaways
- 研究调查了预训练语言模型和视觉语言模型对物体功能性的了解程度。
- 发现预训练语言模型在处理非直观和非一致的物体功能性任务时存在推理能力不足的缺陷。
- 创建了一个全新的物体功能性数据集Text2Afford,包含15类标注。
- 该数据集的特点是标注了包含物体和功能性信息的自然语句。
- 实验结果显示预训练语言模型和视觉语言模型在处理罕见物体功能性时表现有限。
- 通过少样本微调的方法可以改善预训练模型对物体功能性的理解能力。
点此查看论文截图





