发布日期: 2025-09-12

更新日期: 2025-10-07

文章字数: 6.2k

阅读时长: 25 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-12 更新

Implicit Shape-Prior for Few-Shot Assisted 3D Segmentation

Authors:Mathilde Monvoisin, Louise Piecuch, Blanche Texier, Cédric Hémon, Anaïs Barateau, Jérémie Huet, Antoine Nordez, Anne-Sophie Boureau, Jean-Claude Nunes, Diana Mateus

The objective of this paper is to significantly reduce the manual workload required from medical professionals in complex 3D segmentation tasks that cannot be yet fully automated. For instance, in radiotherapy planning, organs at risk must be accurately identified in computed tomography (CT) or magnetic resonance imaging (MRI) scans to ensure they are spared from harmful radiation. Similarly, diagnosing age-related degenerative diseases such as sarcopenia, which involve progressive muscle volume loss and strength, is commonly based on muscular mass measurements often obtained from manual segmentation of medical volumes. To alleviate the manual-segmentation burden, this paper introduces an implicit shape prior to segment volumes from sparse slice manual annotations generalized to the multi-organ case, along with a simple framework for automatically selecting the most informative slices to guide and minimize the next interactions. The experimental validation shows the method’s effectiveness on two medical use cases: assisted segmentation in the context of at risks organs for brain cancer patients, and acceleration of the creation of a new database with unseen muscle shapes for patients with sarcopenia.

本文的目标是为了显著减少在复杂的3D分割任务中，医疗专业人员所需的手动工作量，这些任务目前尚无法完全自动化。例如，在放射治疗计划中，必须在计算机断层扫描（CT）或磁共振成像（MRI）扫描中准确识别出风险器官，以确保这些器官免受有害辐射的影响。同样，诊断与年龄相关的退行性疾病，如肌肉体积逐渐丧失和力量减弱的肌少症，通常基于肌肉质量测量，而这些测量往往是从医疗体积的手动分割中获得的。为了减轻手动分割的负担，本文引入了一种隐式形状先验，从稀疏切片的手动注释中分割体积，并将其推广到多器官的情况，以及一个简单框架，用于自动选择最具信息量的切片来指导和最小化下一次交互。实验验证显示该方法在两种医学用例中的有效性：辅助分割脑癌患者风险器官的情境，以及加速为肌少症患者创建包含未见肌肉形状的新数据库。

论文及项目相关链接

PDF Both first Authors contributed equally to this work, lastnames in alphabetical order. This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution will be published in a Springer Nature Computer Science book series (CCIS, LNAI, LNBI, LNBIP, LNCS) and the doi will soon be released

Summary

本文旨在减少医学专家在复杂的三维分割任务中的手动工作量，这些任务目前尚无法完全自动化。例如，在放射治疗计划中，必须在计算机断层扫描（CT）或磁共振成像（MRI）中准确识别风险器官，以确保它们免受有害辐射。同样，诊断年龄相关的退行性疾病（如肌肉减少症），需要基于手动分割医学体积所获得的肌肉质量测量。本文引入了一种隐式形状先验方法，可以从稀疏切片手动注释中分割体积并将其推广到多器官情况，以及一个简单框架，可以自动选择信息量最大的切片来指导和最小化下一次交互。实验验证显示该方法在两种医学用例中的有效性：辅助分割风险器官的背景下的脑癌患者，以及加速创建具有未见肌肉形状的新数据库的患者肌肉减少症。

Key Takeaways

该论文旨在减轻医学专家在复杂三维分割任务中的手动工作量，这些任务目前尚无法完全自动化。
在放射治疗计划和诊断退行性疾病（如肌肉减少症）中，医学体积的准确分割是关键步骤。
论文引入了一种隐式形状先验方法，可以从稀疏切片手动注释中分割体积，并将其推广到多器官情况。
该方法采用了一个简单框架自动选择信息量最大的切片，以指导和最小化下一次交互。
实验验证显示该方法在风险器官的分割和肌肉减少症诊断中的有效性。
该方法有助于减轻医学专家在诊断过程中的负担，提高医疗质量和效率。
此技术有可能在将来推动医疗行业的自动化和智能化进程。

Cool Papers

点此查看论文截图

Two Sides of the Same Optimization Coin: Model Degradation and Representation Collapse in Graph Foundation Models

Authors:Xunkai Li, Daohan Su, Sicheng Liu, Ru Zhang, Rong-Hua Li, Guoren Wang

Graph foundation models, inspired by the success of LLMs, are designed to learn the optimal embedding from multi-domain TAGs for the downstream cross-task generalization capability. During our investigation, graph VQ-MAE stands out among the increasingly diverse landscape of GFM architectures. This is attributed to its ability to jointly encode topology and textual attributes from multiple domains into discrete embedding spaces with clear semantic boundaries. Despite its potential, domain generalization conflicts cause imperceptible pitfalls. In this paper, we instantiate two of them, and they are just like two sides of the same GFM optimization coin - Side 1 Model Degradation: The encoder and codebook fail to capture the diversity of inputs; Side 2 Representation Collapse: The hidden embedding and codebook vector fail to preserve semantic separability due to constraints from narrow representation subspaces. These two pitfalls (sides) collectively impair the decoder and generate the low-quality reconstructed supervision, causing the GFM optimization dilemma during pre-training (coin). Through empirical investigation, we attribute the above challenges to Information Bottleneck and Regularization Deficit. To address them, we propose MoT (Mixture-of-Tinkers) - (1) Information Tinker for Two Pitfalls, which utilizes an edge-wise semantic fusion strategy and a mixture-of-codebooks with domain-aware routing to improve information capacity. (2) Regularization Tinker for Optimization Coin, which utilizes two additional regularizations to further improve gradient supervision in our proposed Information Tinker. Notably, as a flexible architecture, MoT adheres to the scaling laws of GFM, offering a controllable model scale. Compared to SOTA baselines, experiments on 22 datasets across 6 domains demonstrate that MoT achieves significant improvements in supervised, few-shot, and zero-shot scenarios.

图模型基础，受到大型语言模型（LLMs）成功的启发，旨在从多域标签（TAGs）中学习最佳嵌入，以获得下游跨任务泛化能力。在我们的研究中，图VQ-MAE在日益多样化的GFM架构中脱颖而出。这归功于它将多个领域的拓扑和文本属性联合编码到具有清晰语义边界的离散嵌入空间中的能力。尽管它潜力巨大，但域泛化冲突会导致难以察觉的陷阱。在本文中，我们实例化了两个陷阱，它们就像是GFM优化硬币的两面——第一面模型退化：编码器和代码本无法捕捉输入的多样性；第二面表示崩溃：隐藏嵌入和代码本向量由于狭窄表示子空间的约束而无法保持语义可分性。这两个陷阱（两面）共同损害了解码器并产生了低质量重建的监督，导致了预训练过程中的GFM优化困境（硬币）。通过实证研究，我们将上述挑战归因于信息瓶颈和正则化缺陷。为了解决这个问题，我们提出了MoT（混合微调器）-（1）针对两个陷阱的信息微调器，它利用边缘语义融合策略和带有域感知路由的混合代码本，以提高信息容量。（2）针对优化硬币的正则化微调器，它利用两种额外的正则化方法，以进一步提高我们提出的信息微调器中的梯度监督。值得注意的是，作为一个灵活的架构，MoT遵循GFM的扩展定律，提供了一个可控的模型规模。在6个领域的22个数据集上的实验表明，与最新技术基准相比，MoT在监督、少样本和零样本场景中实现了显著改进。

论文及项目相关链接

PDF

Summary
图模型借鉴大型语言模型的优点，旨在从多域标签学习最优嵌入，以实现下游跨任务的泛化能力。在研究中，图VQ-MAE在众多图模型架构中脱颖而出，它能够从多个领域联合编码拓扑和文本属性到离散嵌入空间，具有清晰的语义边界。然而，域泛化冲突会引发一些隐患。针对这些问题，本文提出了MoT（混合微调）方案，包括针对两个隐患的信息微调器和优化硬币的正则化微调器。MoT遵循图模型的规模定律，提供了一个可控的模型规模。在22个数据集、6个领域的实验中，MoT在监督学习、小样本学习和零样本学习场景下均取得了显著的提升。

Key Takeaways

图模型借鉴LLMs成功，旨在从多域标签学习最优嵌入，提高下游任务的泛化能力。
图VQ-MAE能够联合编码拓扑和文本属性到离散嵌入空间。
域的泛化冲突会引发模型隐患，例如模型退化和表示崩溃。
模型退化的原因是编码器和代码本无法捕捉输入的多样性。
表示崩溃是因为隐藏嵌入和代码本向量在狭窄的表示子空间内失去语义可分性。
MoT方案通过信息微调器和优化硬币的正则化微调器来解决上述问题。

Cool Papers

点此查看论文截图

Few-shot Personalization via In-Context Learning for Speech Emotion Recognition based on Speech-Language Model

Authors:Mana Ihori, Taiga Yamane, Naotaka Kawata, Naoki Makishima, Tomohiro Tanaka, Satoshi Suzuki, Shota Orihashi, Ryo Masumura

This paper proposes a personalization method for speech emotion recognition (SER) through in-context learning (ICL). Since the expression of emotions varies from person to person, speaker-specific adaptation is crucial for improving the SER performance. Conventional SER methods have been personalized using emotional utterances of a target speaker, but it is often difficult to prepare utterances corresponding to all emotion labels in advance. Our idea to overcome this difficulty is to obtain speaker characteristics by conditioning a few emotional utterances of the target speaker in ICL-based inference. ICL is a method to perform unseen tasks by conditioning a few input-output examples through inference in large language models (LLMs). We meta-train a speech-language model extended from the LLM to learn how to perform personalized SER via ICL. Experimental results using our newly collected SER dataset demonstrate that the proposed method outperforms conventional methods.

本文提出了一种通过上下文学习（ICL）进行语音情感识别（SER）个性化方法。由于情感的表达因人而异，因此针对特定说话人的适应对于提高SER性能至关重要。传统SER方法已经通过目标说话人的情感表达进行了个性化处理，但很难提前准备所有情感标签对应的表达。我们的想法是通过在基于ICL的推理中条件化目标说话人的少数情感表达来获得说话人特征，以克服这一难题。ICL是一种通过在大语言模型（LLM）中进行推理来条件化少数输入输出示例以执行未见任务的方法。我们对从LLM扩展的语音语言模型进行元训练，以学习如何通过ICL执行个性化的SER。使用我们新收集的SER数据集进行的实验结果表明，所提出的方法优于传统方法。

论文及项目相关链接

PDF Accepted by ASRU 2025

Summary

本论文提出了一种基于上下文学习（ICL）的个性化语音情感识别（SER）方法。由于情感表达存在个体差异，针对特定说话人的适应对于提高SER性能至关重要。论文通过利用目标说话人的少量情感话语来实现个性化，但预先准备所有情感标签的话语往往很困难。论文通过ICL进行推断，以通过少数输入-输出示例解决未见任务，在大语言模型（LLM）中进行条件训练获得说话人的特性。使用新收集的SER数据集进行的实验结果表明，该方法优于传统方法。

Key Takeaways

论文提出了一种基于上下文学习的个性化语音情感识别方法。
情感表达存在个体差异，因此针对特定说话人的适应很重要。
传统SER方法使用目标说话人的情感话语进行个性化，但预先准备所有情感标签的话语很困难。
通过ICL进行推断，用少数输入-输出示例解决未见任务。
通过在大语言模型上进行条件训练，获得说话人的特性。
新收集的SER数据集上的实验结果表明，该方法的有效性。

Cool Papers

点此查看论文截图

Expert-Guided Explainable Few-Shot Learning for Medical Image Diagnosis

Authors:Ifrat Ikhtear Uddin, Longwei Wang, KC Santosh

Medical image analysis often faces significant challenges due to limited expert-annotated data, hindering both model generalization and clinical adoption. We propose an expert-guided explainable few-shot learning framework that integrates radiologist-provided regions-of-interests (ROIs) into model training to simultaneously enhance classification performance and interpretability. Leveraging Grad-CAM for spatial attention supervision, we introduce an explanation loss based on Dice similarity to align model attention with diagnostically relevant regions during training. This explanation loss is jointly optimized with a standard prototypical network objective, encouraging the model to focus on clinically meaningful features even under limited data conditions. We evaluate our framework on two distinct datasets: BraTS (MRI) and VinDr-CXR (Chest X-ray), achieving significant accuracy improvements from 77.09% to 83.61% on BraTS and from 54.33% to 73.29% on VinDr-CXR compared to non-guided models. Grad-CAM visualizations further confirm that expert-guided training consistently aligns attention with diagnostic regions, improving both predictive reliability and clinical trustworthiness. Our findings demonstrate the effectiveness of incorporating expert-guided attention supervision to bridge the gap between performance and interpretability in few-shot medical image diagnosis.

医疗图像分析经常由于专家标注数据有限而面临重大挑战，这阻碍了模型的通用化和临床采用。我们提出了一种专家引导的可解释的少量学习框架，该框架将放射科医生提供的感兴趣区域（ROI）集成到模型训练中，以同时提高分类性能和解释性。我们借助Grad-CAM进行空间注意力监督，并引入基于Dice相似度的解释损失，以在训练过程中使模型注意力与诊断相关区域对齐。这种解释损失与标准原型网络目标联合优化，鼓励模型即使在数据有限的情况下也关注临床上意义重大的特征。我们在两个不同的数据集BraTS（MRI）和VinDr-CXR（胸部X射线）上评估了我们的框架，与无引导模型相比，BraTS的准确率从77.09%提高到83.61%，VinDr-CXR的准确率从54.33%提高到73.29%。Grad-CAM可视化进一步证实，专家引导的训练始终能使注意力与诊断区域对齐，提高了预测可靠性和临床可信度。我们的研究结果表明，在少量医疗图像诊断中融入专家引导的监督能有效地缩小性能和解释性之间的差距。

论文及项目相关链接

PDF Accepted for publication in the proceedings of MICCAI Workshop on Data Engineering in Medical Imaging 2025

Summary

本文提出一种专家引导的可解释的少量学习框架，该框架将放射科医生提供的感兴趣区域（ROIs）集成到模型训练中，以提高分类性能和解释性。通过利用Grad-CAM进行空间注意力监督，并引入基于Dice相似度的解释损失，使模型注意力与诊断相关区域对齐。该解释损失与标准原型网络目标联合优化，鼓励模型在有限数据条件下关注临床有意义的特征。在BraTS和VinDr-CXR两个数据集上的评估显示，与无引导模型相比，准确率分别提高了从77.09%到83.61%和从54.33%到73.29%。Grad-CAM可视化进一步证实，专家引导的训练能使模型注意力始终与诊断区域对齐，提高了预测可靠性和临床可信度。

Key Takeaways

医学图像分析面临专家标注数据有限的挑战，影响模型通用化和临床采纳。
提出一种专家引导的可解释的少量学习框架，结合放射科医生提供的感兴趣区域（ROIs）进行模型训练。
利用Grad-CAM进行空间注意力监督，引入基于Dice相似度的解释损失，使模型关注诊断相关区域。
解释损失与原型网络目标联合优化，提高模型在有限数据下的临床有意义特征关注。
在BraTS和VinDr-CXR数据集上评估，准确率显著提升。
Grad-CAM可视化显示专家引导的训练使模型注意力与诊断区域对齐。
专家引导有助于提高预测可靠性和临床可信度。

Cool Papers

点此查看论文截图

TerraMind: Large-Scale Generative Multimodality for Earth Observation

Authors:Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind’s dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces “Thinking-in-Modalities” (TiM) – the capability of generating additional artificial data during finetuning and inference to improve the model output – and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

我们推出了TerraMind，这是首个面向地球观测（EO）的任意到任意生成、多模态基础模型。与其他多模态模型不同，TerraMind在双尺度表示上进行预训练，结合了跨模态的令牌级别和像素级别的数据。在令牌级别上，TerraMind编码高级上下文信息以学习跨模态关系，而在像素级别上，TerraMind利用精细表示来捕捉关键的空间细微差别。我们在全球大规模数据集的九个地理空间模态上预训练了TerraMind。在本文中，我们证明了（i）TerraMind的双尺度早期融合方法开启了地球观测的零样本和少样本应用的范围，（ii）TerraMind引入了“模态思考”（TiM）——在微调过程中生成额外的人工数据并在推理期间改善模型输出的能力——以及（iii）TerraMind在像泛非洲地球观测计划这样的社区标准基准测试中实现了超越最新技术的性能。预训练数据集、模型权重和我们的代码都是在许可下开源的。

论文及项目相关链接

PDF Accepted at ICCV’25

Summary

TerraMind是首个面向地球观测的任意输入任意输出的生成式、多模态基础模型。它采用双尺度表示法，结合标记级别和像素级别的数据跨模态进行预训练。TerraMind的预训练数据集规模庞大，涵盖全球九大地理空间模态。该模型可实现零样本和少样本应用，引入“模态思考”（TiM）能力，能在微调及推理阶段生成额外的模拟数据，提升模型输出效果，且性能超越地球观测领域的社区标准基准测试。

Key Takeaways

TerraMind是首个地球观测领域的任意输入任意输出生成式多模态基础模型。
TerraMind采用独特的双尺度预训练方法，结合标记级别和像素级别的数据跨模态进行训练。
模型可以在零样本和少样本场景下应用。
引入“模态思考”（TiM）能力，能在微调及推理阶段生成额外的模拟数据。
TerraMind的预训练数据集规模庞大，涵盖九大地理空间模态。
TerraMind性能超越地球观测领域的社区标准基准测试。

Cool Papers

点此查看论文截图

Investigating Compositional Reasoning in Time Series Foundation Models

Authors:Willa Potosnak, Cristian Challu, Mononito Goswami, Kin G. Olivares, Michał Wiliński, Nina Żukowska, Artur Dubrawski

Large pre-trained time series foundation models (TSFMs) have demonstrated promising zero-shot performance across a wide range of domains. However, a question remains: Do TSFMs succeed by memorizing patterns in training data, or do they possess the ability to reason about such patterns? While reasoning is a topic of great interest in the study of Large Language Models (LLMs), it is undefined and largely unexplored in the context of TSFMs. In this work, inspired by language modeling literature, we formally define compositional reasoning in forecasting and distinguish it from in-distribution generalization. We evaluate the reasoning and generalization capabilities of 16 popular deep learning forecasting models on multiple synthetic and real-world datasets. Additionally, through controlled studies, we systematically examine which design choices in 7 popular open-source TSFMs contribute to improved reasoning capabilities. Our study yields key insights into the impact of TSFM architecture design on compositional reasoning and generalization. We find that patch-based Transformers have the best reasoning performance, closely followed by residualized MLP-based architectures, which are 97% less computationally complex in terms of FLOPs and 86% smaller in terms of the number of trainable parameters. Interestingly, in some zero-shot out-of-distribution scenarios, these models can outperform moving average and exponential smoothing statistical baselines trained on in-distribution data. Only a few design choices, such as the tokenization method, had a significant (negative) impact on Transformer model performance.

大型预训练时间序列基础模型（TSFMs）在多个领域都表现出了很有前景的零样本性能。但有一个问题仍然待解：TSFM是通过训练数据中的模式记忆取得成功，还是它们具备对这些模式进行推理的能力？虽然推理是大型语言模型（LLM）研究中的热门话题，但在TSFM的上下文中，它尚未被明确定义和广泛探索。在这项工作中，我们受到语言建模文献的启发，正式定义了预测中的组合推理，并将其与内部分布泛化区分开来。我们评估了16个流行的深度学习预测模型在多个合成和真实数据集上的推理和泛化能力。此外，通过对照研究，我们系统地检查了7个流行开源TSFM中的哪些设计选择有助于改进推理能力。我们的研究对于TSFM架构设计对组合推理和泛化的影响提供了关键见解。我们发现基于补丁的Transformer具有最佳的推理性能，紧随其后的是基于残差MLP的架构，其在浮点运算方面减少了97%，在可训练参数数量方面减少了86%。有趣的是，在某些零样本超出分布场景中，这些模型可以超越在内部分布数据上训练的移动平均和指数平滑统计基线。只有少数设计选择，如令牌化方法，对Transformer模型性能产生了重大（负面）影响。

论文及项目相关链接

PDF

Summary
时序序列基础大模型（TSFMs）展现出跨域零样本性能。但问题在于它们是否通过记忆训练数据成功，还是它们具有理解这些模式的能力？本研究受到语言建模文献的启发，正式定义了预测中的组合推理并将其与分布内泛化区分开。我们评估了多个合成和真实数据集上流行的深度学习预测模型的推理和泛化能力。此外，通过受控研究，我们系统地研究了设计选择对TSFMs推理能力的贡献。研究发现，基于补丁的Transformer具有最佳的推理性能，其次是基于残差化的MLP架构，其计算复杂度FLOPs减少了97%，可训练参数减少了86%。在某些零样本分布外场景中，这些模型甚至可以超越在分布内数据上训练的移动平均和指数平滑统计基线。仅少数设计选择如令牌化方法对Transformer模型性能有显著负面影响。

Key Takeaways

TSFMs展现出跨域零样本性能。
TSFMs的推理能力与记忆训练数据的能力有所区别。
通过研究明确了预测模型中的组合推理和分布内泛化的概念差异。
在多种数据集上的研究显示基于补丁的Transformer具有最佳推理性能。
基于残差化的MLP架构表现优秀，具有较低的计算复杂度和较小的参数规模。
在某些场景下，先进的模型可超越传统统计基线方法。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-12/Few-Shot/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Few-Shot

I2I Translation

I2I Translation 方向最新论文已更新，请持续关注 Update in 2025-09-12 Reangle-A-Video 4D Video Generation as Video-to-Video Translation

2025-09-12 I2I Translation

I2I Translation

Agent

Agent 方向最新论文已更新，请持续关注 Update in 2025-09-12 Automatic Failure Attribution and Critical Step Prediction Method for Multi-Agent Systems Based on Causal Inference

2025-09-12 Agent

Agent