嘘~ 正在从服务器偷取页面 . . .

Few-Shot


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-09-13 更新

Decoupling Clinical and Class-Agnostic Features for Reliable Few-Shot Adaptation under Shift

Authors:Umaima Rahman, Raza Imam, Mohammad Yaqub, Dwarikanath Mahapatra

Medical vision-language models (VLMs) offer promise for clinical decision support, yet their reliability under distribution shifts remains a major concern for safe deployment. These models often learn task-agnostic correlations due to variability in imaging protocols and free-text reports, limiting their generalizability and increasing the risk of failure in real-world settings. We propose DRiFt, a structured feature decoupling framework that explicitly separates clinically relevant signals from task-agnostic noise using parameter-efficient tuning (LoRA) and learnable prompt tokens. To enhance cross-modal alignment and reduce uncertainty, we curate high-quality, clinically grounded image-text pairs by generating captions for a diverse medical dataset. Our approach improves in-distribution performance by +11.4% Top-1 accuracy and +3.3% Macro-F1 over prior prompt-based methods, while maintaining strong robustness across unseen datasets. Ablation studies reveal that disentangling task-relevant features and careful alignment significantly enhance model generalization and reduce unpredictable behavior under domain shift. These insights contribute toward building safer, more trustworthy VLMs for clinical use. The code is available at https://github.com/rumaima/DRiFt.

医学视觉-语言模型(VLMs)在临床决策支持方面显示出巨大潜力,但在分布变化下的可靠性仍是其安全部署的主要关注点。这些模型通常会由于成像协议和文本报告的差异性而学习一些与任务无关的相关性,从而限制了其泛化能力并增加了在现实世界环境中的失败风险。我们提出了DRiFt这一结构化特征解耦框架,它利用参数效率高的调整(LoRA)和学习提示令牌,明确地将临床相关的信号从任务无关的噪声中分离出来。为了提高跨模态对齐和减少不确定性,我们通过为多样化的医学数据集生成标题来整理高质量的临床基础图像文本对。我们的方法较之前的提示方法提高了+11.4%的Top-1准确率和+3.3%的宏观F1分数,同时在未见数据集上保持了强大的稳健性。消融研究表明,解开任务相关特征并进行仔细对齐显著提高了模型的泛化能力并减少了领域迁移下的不可预测行为。这些见解为构建用于临床使用的更安全、更可信赖的VLMs做出了贡献。代码可在https://github.com/rumaima/DRiFt获取。

论文及项目相关链接

PDF

Summary

医疗视觉-语言模型(VLM)在临床决策支持中具有广阔的应用前景,但其分布转移下的可靠性仍是安全部署的主要关注点。本文提出DRiFt框架,通过参数高效调整(LoRA)和可学习提示令牌,明确分离临床相关信号和任务无关噪声。为提高跨模态对齐和减少不确定性,本文还为多样化的医疗数据集生成了高质量、基于临床的图像文本对。该框架在现有提示方法的基础上提高了+11.4%的Top-1准确率和+3.3%的宏观F1得分,同时在未见数据集上保持了强大的稳健性。消融研究证明,解耦任务相关特征并进行仔细对齐能显著增强模型的泛化能力并减少域转移下的不可预测行为。这些见解有助于构建更安全、更可靠的用于临床使用的VLM。

Key Takeaways

  1. 医疗视觉-语言模型(VLM)在临床决策支持中有广泛应用,但分布转移下的可靠性是部署的主要挑战。
  2. DRiFt框架旨在通过参数高效调整和可学习提示令牌,分离临床相关信号和任务无关噪声。
  3. DRiFt框架提高了模型的性能,同时在未见数据集上保持了强大的稳健性。
  4. 通过解耦任务相关特征并进行仔细对齐,DRiFt框架能显著增强模型的泛化能力并减少不可预测行为。
  5. DRiFt框架利用高质量、基于临床的图像文本对,提高跨模态对齐和减少不确定性。
  6. DRiFt框架对构建更安全、更可靠的用于临床使用的VLM具有重要意义。

Cool Papers

点此查看论文截图

Exploring Pre-training Across Domains for Few-Shot Surgical Skill Assessment

Authors:Dimitrios Anastasiou, Razvan Caramalau, Nazir Sirajudeen, Matthew Boal, Philip Edwards, Justin Collins, John Kelly, Ashwin Sridhar, Maxine Tran, Faiz Mumtaz, Nevil Pavithran, Nader Francis, Danail Stoyanov, Evangelos B. Mazomenos

Automated surgical skill assessment (SSA) is a central task in surgical computer vision. Developing robust SSA models is challenging due to the scarcity of skill annotations, which are time-consuming to produce and require expert consensus. Few-shot learning (FSL) offers a scalable alternative enabling model development with minimal supervision, though its success critically depends on effective pre-training. While widely studied for several surgical downstream tasks, pre-training has remained largely unexplored in SSA. In this work, we formulate SSA as a few-shot task and investigate how self-supervised pre-training strategies affect downstream few-shot SSA performance. We annotate a publicly available robotic surgery dataset with Objective Structured Assessment of Technical Skill (OSATS) scores, and evaluate various pre-training sources across three few-shot settings. We quantify domain similarity and analyze how domain gap and the inclusion of procedure-specific data into pre-training influence transferability. Our results show that small but domain-relevant datasets can outperform large scale, less aligned ones, achieving accuracies of 60.16%, 66.03%, and 73.65% in the 1-, 2-, and 5-shot settings, respectively. Moreover, incorporating procedure-specific data into pre-training with a domain-relevant external dataset significantly boosts downstream performance, with an average gain of +1.22% in accuracy and +2.28% in F1-score; however, applying the same strategy with less similar but large-scale sources can instead lead to performance degradation. Code and models are available at https://github.com/anastadimi/ssa-fsl.

自动化手术技能评估(SSA)是计算机手术视觉中的核心任务。由于技能注释的稀缺性,开发稳健的SSA模型是一个挑战,这些注释耗费时间且需要专家共识。小样本学习(FSL)提供了一种可扩展的替代方案,能够在最小监督下实现模型开发,尽管其成功在很大程度上取决于有效的预训练。虽然预训练对于几个手术下游任务已经得到了广泛的研究,但在SSA中,它仍然基本上未被探索。在这项工作中,我们将SSA制定为小样本任务,并研究自监督预训练策略如何影响下游小样本SSA性能。我们使用目标结构化技术技能评估(OSATS)分数对公众可用的机器人手术数据集进行注释,并在三种小样本设置下评估各种预训练源的效果。我们量化领域相似性并分析领域差距以及特定程序数据纳入预训练对可转移性的影响。我们的结果表明,规模小但领域相关的数据集可以超越大规模、较少对齐的数据集,在1、2和5个样本的场景下分别达到60.16%、66.03%和73.65%的准确率。此外,将特定程序数据与领域相关的外部数据集结合进行预训练可显着提高下游性能,平均准确率提高+1.22%,F1分数提高+2.28%;然而,将相同的策略应用于不那么相似但规模较大的源可能会导致性能下降。代码和模型可在https://github.com/anastadimi/ssa-fsl中找到。

论文及项目相关链接

PDF Accepted at MICCAI 2025 DEMI Workshop

Summary

本文研究了基于少样本学习的自动化手术技能评估(SSA)的预训练策略。文章指出,在数据稀缺的情况下,利用自我监督的预训练策略能够提高少样本手术技能评估模型的性能。实验结果显示,与大规模、少对齐的数据集相比,规模小但与任务相关的数据集能够取得更高的准确率。此外,结合任务特定数据与相关外部数据集进行预训练可以进一步提高下游性能。相关代码和模型可在https://github.com/anastadimi/ssa-fsl找到。

Key Takeaways

  1. 研究采用少样本学习(FSL)方法应用于自动化手术技能评估(SSA)。
  2. 预训练策略在少样本SSA中至关重要,能提高模型的性能。
  3. 与大规模、少对齐的数据集相比,规模小但任务相关的数据集表现更好。
  4. 在预训练中结合任务特定数据可提高下游性能。
  5. 使用领域相关的外部数据集与特定程序的数据进行预训练效果最佳。
  6. 与不相似的大规模数据源结合进行预训练可能导致性能下降。

Cool Papers

点此查看论文截图

You Share Beliefs, I Adapt: Progressive Heterogeneous Collaborative Perception

Authors:Hao Si, Ehsan Javanmardi, Manabu Tsukada

Collaborative perception enables vehicles to overcome individual perception limitations by sharing information, allowing them to see further and through occlusions. In real-world scenarios, models on different vehicles are often heterogeneous due to manufacturer variations. Existing methods for heterogeneous collaborative perception address this challenge by fine-tuning adapters or the entire network to bridge the domain gap. However, these methods are impractical in real-world applications, as each new collaborator must undergo joint training with the ego vehicle on a dataset before inference, or the ego vehicle stores models for all potential collaborators in advance. Therefore, we pose a new question: Can we tackle this challenge directly during inference, eliminating the need for joint training? To answer this, we introduce Progressive Heterogeneous Collaborative Perception (PHCP), a novel framework that formulates the problem as few-shot unsupervised domain adaptation. Unlike previous work, PHCP dynamically aligns features by self-training an adapter during inference, eliminating the need for labeled data and joint training. Extensive experiments on the OPV2V dataset demonstrate that PHCP achieves strong performance across diverse heterogeneous scenarios. Notably, PHCP achieves performance comparable to SOTA methods trained on the entire dataset while using only a small amount of unlabeled data.

协同感知使车辆通过共享信息来克服个体感知的局限性,从而使它们能够看得更远并透过遮挡物。在现实世界的场景中,由于制造商的差异,不同车辆上的模型通常是异构的。现有的异构协同感知方法通过微调适配器或整个网络来解决这一挑战,以弥合领域差距。然而,这些方法在真实世界的应用中并不实用,因为每个新合作伙伴必须在推理之前在数据集上与自主车辆进行联合训练,或者自主车辆预先存储所有潜在合作伙伴的模型。因此,我们提出了一个新问题:我们能否在推理过程中直接解决这一挑战,从而不需要联合训练?为了回答这个问题,我们引入了渐进式异构协同感知(PHCP),这是一种新的框架,将问题表述为少量无监督域适应问题。与以前的工作不同,PHCP通过自我训练适配器在推理过程中动态对齐特征,从而消除了对标记数据和联合训练的需要。在OPV2V数据集上的广泛实验表明,PHCP在不同种类的异构场景中表现出强大的性能。值得注意的是,PHCP在仅使用少量未标记数据的情况下,实现了与在整个数据集上训练的最新方法相当的性能。

论文及项目相关链接

PDF

Summary

协作感知技术通过车辆间信息共享,克服了个体感知的局限性,使车辆能够看得更远并透过遮挡物。针对不同车辆模型间的异质性问题,现有方法通过微调适配器或整个网络来弥合领域差距。然而,这些方法在现实世界应用中并不实用,因为每个新协作方在推理前都需要与自主车辆进行联合训练,或者自主车辆需要提前存储所有潜在协作方的模型。为此,我们提出新问题:能否在推理过程中直接解决这一挑战,而无需进行联合训练?我们引入了渐进式异构协作感知(PHCP)这一新型框架,将问题表述为小样本无监督域适应问题。PHCP通过自训练适配器在推理过程中动态对齐特征,无需标记数据和联合训练。在OPV2V数据集上的广泛实验表明,PHCP在多种异构场景下表现出强大的性能,其性能可与使用整个数据集训练的最新技术相媲美,同时仅使用少量未标记数据。

Key Takeaways

  1. 协作感知技术通过信息共享克服个体感知限制,增强车辆视野。
  2. 不同车辆模型间的协作感知面临异质性问题。
  3. 现有方法通过微调适配器或整个网络进行领域适应,但实际应用中不实用。
  4. 无需联合训练的新方法——渐进式异构协作感知(PHCP)被提出。
  5. PHCP将问题表述为小样本案无监督域适应问题。
  6. PHCP通过自训练适配器在推理过程中动态对齐特征。

Cool Papers

点此查看论文截图

Can Multimodal LLMs See Materials Clearly? A Multimodal Benchmark on Materials Characterization

Authors:Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, Benyou Wang

Materials characterization is fundamental to acquiring materials information, revealing the processing-microstructure-property relationships that guide material design and optimization. While multimodal large language models (MLLMs) have recently shown promise in generative and predictive tasks within materials science, their capacity to understand real-world characterization imaging data remains underexplored. To bridge this gap, we present MatCha, the first benchmark for materials characterization image understanding, comprising 1,500 questions that demand expert-level domain expertise. MatCha encompasses four key stages of materials research comprising 21 distinct tasks, each designed to reflect authentic challenges faced by materials scientists. Our evaluation of state-of-the-art MLLMs on MatCha reveals a significant performance gap compared to human experts. These models exhibit degradation when addressing questions requiring higher-level expertise and sophisticated visual perception. Simple few-shot and chain-of-thought prompting struggle to alleviate these limitations. These findings highlight that existing MLLMs still exhibit limited adaptability to real-world materials characterization scenarios. We hope MatCha will facilitate future research in areas such as new material discovery and autonomous scientific agents. MatCha is available at https://github.com/FreedomIntelligence/MatCha.

材料表征是获取材料信息的基础,揭示了加工-微观结构-性能关系,为材料设计和优化提供指导。虽然最近多模态大型语言模型(MLLM)在材料科学的生成和预测任务中显示出潜力,但它们理解真实世界的表征成像数据的能力仍然被低估。为了弥补这一差距,我们推出了MatCha,这是首个材料表征图像理解基准测试,包含1500个需要专家级领域知识的问题。MatCha涵盖材料研究的四个阶段,包括21个不同的任务,每个任务都反映材料科学家面临的真实挑战。我们对最新MLLM在MatCha上的评估显示,与人类专家相比,存在显著的性能差距。这些模型在处理需要高级专业知识和复杂视觉感知的问题时表现出退化。简单的少镜头和思维链提示难以克服这些局限性。这些发现表明,现有的MLLM对真实世界的材料表征场景适应性仍然有限。我们希望MatCha能促进未来在新材料发现和自主科学代理等领域的研究。MatCha可在https://github.com/FreedomIntelligence/MatCha上获得。

论文及项目相关链接

PDF

Summary

本文介绍了材料表征在材料信息获取、揭示加工-微观结构-性能关系以及材料设计与优化中的重要性。针对当前多模态大型语言模型在材料科学领域的应用,本文提出了MatCha基准测试集,包含1500个需要专家级领域知识的问答题目。MatCha涵盖了材料研究的四个阶段,共包含21个不同任务,每个任务旨在反映材料科学家面临的实际挑战。评估显示,现有模型与专家相比存在显著性能差距,尤其是在处理高水平专业知识需求和复杂视觉感知的问题时表现不佳。简单的few-shot和chain-of-thought提示难以解决这些问题。因此,现有的多模态大型语言模型在真实世界材料表征场景中仍存在适应性不足的局限性。本文期望MatCha能够推动未来新材料发现和自主科学代理等领域的研究进步。MatCha可在特定网址访问。

Key Takeaways

  1. 材料表征在材料设计和优化中扮演着揭示加工-微观结构-性能关系的重要角色。
  2. 多模态大型语言模型在材料科学领域具有潜力,但在理解真实世界的表征成像数据方面仍存在不足。
  3. MatCha是首个针对材料表征图像理解提出的基准测试集,包含反映真实挑战的任务和专家级领域知识需求的题目。
  4. 评估显示现有模型在解决高水平专业知识和复杂视觉感知需求方面存在局限性,简单的提示方式无法有效解决这些问题。
  5. MatCha基准测试集期望促进新材料发现和自主科学代理等领域的研究进步。
  6. MatCha的访问渠道已经公开。

Cool Papers

点此查看论文截图

Bridging the Gap Between Ideal and Real-world Evaluation: Benchmarking AI-Generated Image Detection in Challenging Scenarios

Authors:Chunxiao Li, Xiaoxiao Wang, Meiling Li, Boming Miao, Peng Sun, Yunjian Zhang, Xiangyang Ji, Yao Zhu

With the rapid advancement of generative models, highly realistic image synthesis has posed new challenges to digital security and media credibility. Although AI-generated image detection methods have partially addressed these concerns, a substantial research gap remains in evaluating their performance under complex real-world conditions. This paper introduces the Real-World Robustness Dataset (RRDataset) for comprehensive evaluation of detection models across three dimensions: 1) Scenario Generalization: RRDataset encompasses high-quality images from seven major scenarios (War and Conflict, Disasters and Accidents, Political and Social Events, Medical and Public Health, Culture and Religion, Labor and Production, and everyday life), addressing existing dataset gaps from a content perspective. 2) Internet Transmission Robustness: examining detector performance on images that have undergone multiple rounds of sharing across various social media platforms. 3) Re-digitization Robustness: assessing model effectiveness on images altered through four distinct re-digitization methods. We benchmarked 17 detectors and 10 vision-language models (VLMs) on RRDataset and conducted a large-scale human study involving 192 participants to investigate human few-shot learning capabilities in detecting AI-generated images. The benchmarking results reveal the limitations of current AI detection methods under real-world conditions and underscore the importance of drawing on human adaptability to develop more robust detection algorithms.

随着生成模型的快速发展,高度逼真的图像合成对数字安全和媒体可信度提出了新的挑战。虽然人工智能生成的图像检测方法已经部分解决了这些担忧,但在复杂现实条件下评估它们的性能仍存在巨大研究空白。本文介绍了用于全面评估检测模型的三维度现实鲁棒性数据集(RRDataset):1)场景泛化:RRDataset包含来自七大场景的高质量图像(战争与冲突、灾害与事故、政治与社会事件、医疗与公共卫生、文化与宗教、劳动与生产以及日常生活),从内容角度解决了现有数据集的空白。2)互联网传播鲁棒性:考察图像在各大社交媒体平台经过多轮共享后的检测器性能。3)再数字化鲁棒性:评估四种不同再数字化方法改变图像的模型有效性。我们在RRDataset上对17种检测器和10种视觉语言模型进行了基准测试,并进行了一项涉及192名参与者的大规模人类研究,以调查人类在小样本学习中检测人工智能生成图像的能力。基准测试结果揭示了当前人工智能检测方法在现实世界条件下的局限性,并强调了借鉴人类适应性来开发更稳健的检测算法的重要性。

论文及项目相关链接

PDF ICCV2025

Summary
在生成模型迅速发展的背景下,高度逼真的图像合成对数字安全和媒体可信度提出了新的挑战。尽管已有部分AI生成的图像检测方法,但在复杂现实条件下评估其性能仍存在较大研究空白。本文介绍了一个用于全面评估检测模型的真实世界鲁棒性数据集(RRDataset),涵盖场景泛化、互联网传输鲁棒性和再数字化鲁棒性三个维度。对17种检测器和10种视觉语言模型进行了基准测试,并进行了一项涉及192名参与者的大规模人类研究,以探索检测AI生成图像的人类小样本学习能力。结果表明当前AI检测方法的局限性,并强调需要借鉴人类适应性来开发更稳健的检测算法。

Key Takeaways

  1. 生成模型的快速发展导致高度逼真的图像合成对数字安全和媒体可信度提出挑战。
  2. AI生成的图像检测方法已部分出现以应对这些挑战,但在复杂现实条件下评估其性能仍存在研究空白。
  3. 介绍了真实世界鲁棒性数据集(RRDataset),用于全面评估检测模型的性能。
  4. RRDataset涵盖场景泛化、互联网传输鲁棒性和再数字化鲁棒性三个维度。
  5. 对多种检测器和视觉语言模型进行了基准测试。
  6. 通过大规模人类研究探索了检测AI生成图像的人类小样本学习能力。

Cool Papers

点此查看论文截图

Automated Evidence Extraction and Scoring for Corporate Climate Policy Engagement: A Multilingual RAG Approach

Authors:Imene Kolli, Ario Saeid Vaghefi, Chiara Colesanti Senni, Shantam Raj, Markus Leippold

InfluenceMap’s LobbyMap Platform monitors the climate policy engagement of over 500 companies and 250 industry associations, assessing each entity’s support or opposition to science-based policy pathways for achieving the Paris Agreement’s goal of limiting global warming to 1.5{\deg}C. Although InfluenceMap has made progress with automating key elements of the analytical workflow, a significant portion of the assessment remains manual, making it time- and labor-intensive and susceptible to human error. We propose an AI-assisted framework to accelerate the monitoring of corporate climate policy engagement by leveraging Retrieval-Augmented Generation to automate the most time-intensive extraction of relevant evidence from large-scale textual data. Our evaluation shows that a combination of layout-aware parsing, the Nomic embedding model, and few-shot prompting strategies yields the best performance in extracting and classifying evidence from multilingual corporate documents. We conclude that while the automated RAG system effectively accelerates evidence extraction, the nuanced nature of the analysis necessitates a human-in-the-loop approach where the technology augments, rather than replaces, expert judgment to ensure accuracy.

InfluenceMap的LobbyMap平台正在监测超过500家公司和250个行业协会的气候政策参与情况。它评估每个实体对基于科学的政策路径的支持或反对态度,以实现《巴黎协定》将全球变暖限制在1.5摄氏度的目标。尽管InfluenceMap在自动化分析工作流的关键方面取得了进展,但评估的大部分内容仍然需要人工操作,这使得它耗时耗力,并容易出错。我们提出了一种AI辅助框架,旨在利用检索增强生成技术来加速对企业气候政策参与的监测,通过自动化从大规模文本数据中提取最耗时且相关的证据。我们的评估显示,结合布局感知解析、Nomic嵌入模型和少样本提示策略,在从多语言企业文档中提取和分类证据方面表现最佳。我们得出结论,尽管自动化的RAG系统有效地加速了证据提取,但分析的微妙性质需要一种人工参与的方法,在这种方法中,技术增强而不是替代专家判断,以确保准确性。

论文及项目相关链接

PDF

Summary

基于InfluenceMap的LobbyMap平台,该文提出了一个AI辅助框架,用于加速对企业气候政策参与情况的监测,通过利用增强检索生成技术,自动化提取大规模文本数据中的相关证据。文中评价了几种技术方案的性能,认为结合布局感知解析、Nomic嵌入模型和少样本提示策略的方法在提取和分类证据方面的表现最佳。最终得出结论,虽然自动化加速证据提取,但分析仍需要人的参与,技术与专家判断相结合是确保准确性的关键。

Key Takeaways

  1. InfluenceMap监测企业和行业协会的气候政策参与情况,评估其对巴黎协议目标的支持或反对立场。
  2. 当前评估存在手动操作多、耗时耗力、易出错的问题。
  3. 提出AI辅助框架,利用增强检索生成技术自动化提取相关证据。
  4. 技术方案结合布局感知解析、Nomic嵌入模型和少样本提示策略表现最佳。
  5. 自动化技术在加速证据提取方面有效。
  6. 分析具有微妙性,需要人的参与和判断,技术与专家结合是确保准确性的关键。

Cool Papers

点此查看论文截图

Two Sides of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation Models

Authors:Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Zhenjun Li, Bing Zhou, Rong-Hua Li, Guoren Wang

Graph foundation models, inspired by the success of LLMs, are designed to learn the optimal embedding from multi-domain TAGs for the downstream cross-task generalization capability. During our investigation, graph VQ-MAE stands out among the increasingly diverse landscape of GFM architectures. This is attributed to its ability to jointly encode topology and textual attributes from multiple domains into discrete embedding spaces with clear semantic boundaries. Despite its potential, domain generalization conflicts cause imperceptible pitfalls. In this paper, we instantiate two of them, and they are just like two sides of the same GFM optimization coin - Side 1 Model Degradation: The encoder and codebook fail to capture the diversity of inputs; Side 2 Representation Collapse: The hidden embedding and codebook vector fail to preserve semantic separability due to constraints from narrow representation subspaces. These two pitfalls (sides) collectively impair the decoder and generate the low-quality reconstructed supervision, causing the GFM optimization dilemma during pre-training (coin). Through empirical investigation, we attribute the above challenges to Information Bottleneck and Regularization Deficit. To address them, we propose MoT (Mixture-of-Tinkers) - (1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic fusion strategy and a mixture-of-codebooks with domain-aware routing to improve information capacity. (2) Regularization Tinker for Optimization Coin, which utilizes two additional regularizations to further improve gradient supervision in our proposed Information Tinker. Notably, as a flexible architecture, MoT adheres to the scaling laws of GFM, offering a controllable model scale. Compared to SOTA baselines, experiments on 22 datasets across 6 domains demonstrate that MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios.

图模型基础,受到大型语言模型(LLMs)成功的启发,旨在从多域标签(TAGs)中学习最佳嵌入,以用于下游跨任务的泛化能力。在我们的研究中,图VQ-MAE在日益多样化的GFM架构中脱颖而出。这归功于其能够联合编码多个领域的拓扑和文本属性到具有清晰语义边界的离散嵌入空间的能力。尽管其具有潜力,但领域泛化冲突会导致难以察觉的陷阱。在本文中,我们实例化了两个这样的陷阱,它们就像是GFM优化同一枚硬币的两面——第一面是模型退化:编码器和代码本无法捕捉输入的多样性;第二面是表示崩溃:隐藏嵌入和代码本向量由于来自狭窄表示子空间的约束而无法保持语义可分性。这两个陷阱(面)共同损害了解码器并产生了低质量的重构监督,从而在预训练期间造成GFM优化困境(硬币)。通过实证研究,我们将上述挑战归因于信息瓶颈和正则化缺陷。为了解决这个问题,我们提出了MoT(混合微调器)——(1)针对两个陷阱的信息微调器,它利用边缘语义融合策略和带有领域感知路由的混合代码本来提高信息容量。(2)针对优化硬币的正则化微调器,它利用两种额外的正则化方法进一步改进我们提出的信息微调器中的梯度监督。值得注意的是,作为一个灵活的架构,MoT遵循GFM的扩展定律,提供了一个可控的模型规模。在6个领域的22个数据集上的实验表明,与最先进的基线相比,MoT在监督、小样本和零样本场景中实现了显著改进。

论文及项目相关链接

PDF

Summary
图基础模型通过借鉴大型语言模型的成功经验,旨在学习多域标签的最优嵌入,以提升下游跨任务的泛化能力。图VQ-MAE在众多图基础模型架构中脱颖而出,它能将拓扑和文本属性编码进离散嵌入空间,并具有清晰的语义边界。然而,也存在领域泛化冲突引起的隐患,如模型退化与表示崩溃。本文实证调查发现这些问题源于信息瓶颈和正则化缺陷,并提出了MoT(混合微调)方法来解决这些问题。MoT方法通过信息微调与针对优化币的正则化微调,改善了信息容量和优化梯度监督。作为灵活的图基础模型架构,MoT遵循规模定律,并在多个数据集上的实验表明,它在监督学习、小样本学习和零样本学习场景下均取得了显著的提升。

Key Takeaways

  1. 图基础模型借鉴大型语言模型的成功经验,旨在通过多域标签的最优嵌入提升下游任务的泛化能力。
  2. 图VQ-MAE具有联合编码拓扑和文本属性的能力,能够生成具有清晰语义边界的离散嵌入空间。
  3. 领域泛化冲突可能导致模型退化与表示崩溃的问题。
  4. 通过实证调查,这些问题被归因于信息瓶颈和正则化缺陷。
  5. 针对这些问题,提出了MoT(混合微调)方法,包括信息微调与针对优化币的正则化微调。
  6. MoT方法通过改善信息容量和优化梯度监督来解决领域泛化冲突问题。

Cool Papers

点此查看论文截图

Expert-Guided Explainable Few-Shot Learning for Medical Image Diagnosis

Authors:Ifrat Ikhtear Uddin, Longwei Wang, KC Santosh

Medical image analysis often faces significant challenges due to limited expert-annotated data, hindering both model generalization and clinical adoption. We propose an expert-guided explainable few-shot learning framework that integrates radiologist-provided regions of interest (ROIs) into model training to simultaneously enhance classification performance and interpretability. Leveraging Grad-CAM for spatial attention supervision, we introduce an explanation loss based on Dice similarity to align model attention with diagnostically relevant regions during training. This explanation loss is jointly optimized with a standard prototypical network objective, encouraging the model to focus on clinically meaningful features even under limited data conditions. We evaluate our framework on two distinct datasets: BraTS (MRI) and VinDr-CXR (Chest X-ray), achieving significant accuracy improvements from 77.09% to 83.61% on BraTS and from 54.33% to 73.29% on VinDr-CXR compared to non-guided models. Grad-CAM visualizations further confirm that expert-guided training consistently aligns attention with diagnostic regions, improving both predictive reliability and clinical trustworthiness. Our findings demonstrate the effectiveness of incorporating expert-guided attention supervision to bridge the gap between performance and interpretability in few-shot medical image diagnosis.

医学图像分析常常因为专家标注数据有限而面临重大挑战,这不仅影响了模型的泛化能力,也阻碍了其在临床的采纳。我们提出了一种专家引导的可解释少量学习框架,该框架将放射科医生提供的感兴趣区域(ROI)整合到模型训练中,以同时提高分类性能和解释性。我们利用Grad-CAM进行空间注意力监督,并基于Dice相似性引入一种解释损失,以在训练过程中使模型注意力与诊断相关的区域对齐。这种解释损失与标准原型网络目标联合优化,鼓励模型即使在数据有限的情况下也关注临床上具有意义的特征。我们在两个不同的数据集BraTS(MRI)和VinDr-CXR(胸部X光)上评估了我们的框架,与无引导模型相比,BraTS的准确性从77.09%提高到83.61%,VinDr-CXR从54.33%提高到73.29%,实现了显著的准确性提高。Grad-CAM可视化进一步证实,专家引导的训练能使注意力始终与诊断区域对齐,提高了预测可靠性和临床可信度。我们的研究结果表明,在少量医学图像诊断中融入专家引导的注意力监督,能有效弥合性能和可解释性之间的差距。

论文及项目相关链接

PDF Accepted for publication in the proceedings of MICCAI Workshop on Data Engineering in Medical Imaging 2025

Summary

本摘要介绍了医学图像分析面临的挑战,特别是由于缺少专家标注数据所带来的问题。为解决这些问题,提出了一种专家引导的可解释的少样本学习框架。该框架将放射科医生提供的感兴趣区域(ROI)融入模型训练,同时提高分类性能和解释性。通过利用Grad-CAM进行空间注意力监督,引入基于Dice相似度的解释损失,使模型注意力与诊断相关区域对齐。该解释损失与标准原型网络目标联合优化,鼓励模型在有限数据条件下关注临床有意义的特征。在BraTS和VinDr-CXR两个数据集上的评估显示,与无引导模型相比,该框架的准确率分别从77.09%提高到83.61%和从54.33%提高到73.29%。Grad-CAM可视化进一步证实,专家引导的训练能使模型注意力始终与诊断区域对齐,提高了预测可靠性和临床可信度。

Key Takeaways

  1. 医学图像分析面临专家标注数据有限的挑战。
  2. 提出了一种专家引导的可解释的少样本学习框架,集成放射科医生提供的感兴趣区域(ROI)。
  3. 通过Grad-CAM进行空间注意力监督,并引入基于Dice相似度的解释损失。
  4. 解释损失与标准原型网络目标联合优化,提高模型在有限数据下的临床特征关注能力。
  5. 在BraTS和VinDr-CXR数据集上的评估显示,该框架显著提高了分类准确率。
  6. 专家引导的训练能提高模型的预测可靠性和临床可信度。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
I2I Translation I2I Translation
I2I Translation 方向最新论文已更新,请持续关注 Update in 2025-09-13 COCO-Urdu A Large-Scale Urdu Image-Caption Dataset with Multimodal Quality Estimation
下一篇 
Agent Agent
Agent 方向最新论文已更新,请持续关注 Update in 2025-09-13 Maximizing social welfare among EF1 allocations at the presence of two types of agents
2025-09-13
  目录