发布日期: 2025-10-11

更新日期: 2025-11-27

文章字数: 1.1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-11 更新

CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation

Authors:Pardis Taghavi, Tian Liu, Renjie Li, Reza Langari, Zhengzhong Tu

Instance segmentation demands costly per-pixel annotations and computationally expensive models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pre-trained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM(s) via self-training with contrastive calibration, (2) knowledge transfer through a unified multi-objective loss, and (3) student refinement to mitigate residual pseudo-label bias. Central to CAST is an \emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to extract informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11x smaller student improves over its zero-shot VFM teacher(s) by +8.5 and +7.1 AP, surpasses adapted teacher(s) by +3.4 and +1.5 AP, and further outperforms state-of-the-art SSKD methods on both benchmarks.

实例分割需要昂贵的逐像素标注和计算密集型的模型。我们引入了CAST，这是一种半监督知识蒸馏（SSKD）框架，它利用有限的标记数据和大量的无标记数据，将预训练的视觉基础模型（VFM）压缩成紧凑的专家模型。CAST分为三个阶段：（1）通过对比校准进行自我训练的VFM领域适应，（2）通过统一的多目标损失进行知识转移，（3）学生精炼，以减轻残留的伪标签偏见。CAST的核心是实例感知的像素级对比损失，它融合掩膜和类分数来提取信息负样本并强制清晰的实例间边界。通过在整个适应和蒸馏过程中保持这种对比信号，我们使教师和学生嵌入对齐，并充分利用无标签图像。在Cityscapes和ADE20K上，我们大约11倍较小的学生模型超过了其零样本VFM教师+8.5和+7.1 AP，超越了适应的教师+3.4和+1.5 AP，并且在这两个基准测试上都超过了最新的SSKD方法。

论文及项目相关链接

PDF

Summary

该文本介绍了一种名为CAST的半监督知识蒸馏（SSKD）框架，它利用有限的标注数据和大量的无标注数据来压缩预训练的视觉基础模型（VFM），以形成紧凑的专家模型。CAST通过三个阶段进行：通过对比校准进行VFM的领域自适应、通过统一的多目标损失进行知识转移以及学生模型的优化以减轻剩余伪标签的偏见。核心在于一个实例感知的像素级对比损失，它融合了掩膜和类别分数以提取信息负样本并强制明确的实例间边界。在Cityscapes和ADE20K上，我们的学生模型比其零样本VFM老师高出+8.5和+7.1 AP，比适应的老师高出+3.4和+1.5 AP，并且在两个基准测试中均超过了最新的SSKD方法。

Key Takeaways