⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-18 更新
Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
Authors:Gaojian Wang, Feng Lin, Tong Wu, Zhisheng Yan, Kui Ren
With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.
对于大量无标签的真实人脸图像,我们如何学习鲁棒且可迁移的面部表示,以促进在各种面部安全任务中的泛化能力呢?我们首次尝试并提出FS-VFM,这是一种可扩展的自我监督预训练框架,用于学习真实人脸图像的基本表示。我们引入了三个学习目标,即3C,它结合了掩码图像建模(MIM)和实例鉴别(ID),使FS-VFM能够编码真实人脸的局部模式和全局语义。具体来说,我们为MIM制定了各种面部掩码策略,并设计了一种简单有效的CRFR-P掩码,它明确提示模型追求区域内的一致性(Consistency)和区域间的连贯性(Coherency)。我们提出了一种可靠的自我蒸馏机制,无缝地将MIM与ID结合起来,建立基本的局部到全局对应关系。预训练后,普通视觉变压器(ViTs)作为面向下游面部安全任务的通用视觉基础模型:跨数据集深度伪造检测、跨域面部防伪和未见扩散面部取证。为了有效地迁移预训练的FS-VFM,我们进一步提出FS-Adapter,这是一个轻量级的即插即用瓶颈,位于冻结的主干之上,具有新型真实锚点对比目标。在11个公共基准测试上的大量实验表明,我们的FS-VFM始终比跨越自然和面部领域的多样化VFMs有更好的泛化能力,涵盖完全、弱监督和自监督范式,小型、基础和大型ViT规模,甚至超越了特定任务的最先进方法,而FS-Adapter提供了卓越的效率性能权衡。代码和模型可在https://fsfm-3c.github.io/fsvfm.html上找到。
论文及项目相关链接
PDF 18 pages, 9 figures, project page: https://fsfm-3c.github.io/fsvfm.html
Summary
本文介绍了针对真实人脸图像学习鲁棒和可迁移表示的新方法FS-VFM框架。通过引入三项学习目标和各种面部遮挡策略,该框架能够编码人脸的局部模式和全局语义。此外,本文还提出了一种可靠的自我蒸馏机制,将MIM与ID相结合,建立局部到全局的对应关系。通过预训练后,使用普通的视觉转换器作为面向下游面部安全任务的通用视觉基础模型。为了有效迁移预训练的FS-VFM模型,本文还提出了FS-Adapter,它是一个轻量级的插件,具有新型真实锚对比目标。实验表明,FS-VFM在多个公共基准测试中表现优异。
Key Takeaways
- 引入FS-VFM框架用于学习真实人脸图像的基本表示。
- 提出三项学习目标(3C)结合掩膜图像建模(MIM)和实例鉴别(ID)。
- 采用多种面部遮挡策略,其中CRFR-P遮挡能促使模型追求区域内一致性及区域间连贯性。
- 结合MIM和ID建立局部到全局的对应关系。
- 使用普通视觉转换器作为下游面部安全任务的通用视觉基础模型。
- 提出FS-Adapter,一种轻量级插件,用于迁移预训练的FS-VFM模型,具有真实锚对比目标。
点此查看论文截图


