发布日期: 2025-06-04

更新日期: 2025-07-06

文章字数: 1.2k

阅读时长: 5 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-06-04 更新

Authors:Yaning Zhang, Tianyi Wang, Zitong Yu, Zan Gao, Linlin Shen, Shengyong Chen

The rapid development of photo-realistic face generation methods has raised significant concerns in society and academia, highlighting the urgent need for robust and generalizable face forgery detection (FFD) techniques. Although existing approaches mainly capture face forgery patterns using image modality, other modalities like fine-grained noises and texts are not fully explored, which limits the generalization capability of the model. In addition, most FFD methods tend to identify facial images generated by GAN, but struggle to detect unseen diffusion-synthesized ones. To address the limitations, we aim to leverage the cutting-edge foundation model, contrastive language-image pre-training (CLIP), to achieve generalizable diffusion face forgery detection (DFFD). In this paper, we propose a novel multi-modal fine-grained CLIP (MFCLIP) model, which mines comprehensive and fine-grained forgery traces across image-noise modalities via language-guided face forgery representation learning, to facilitate the advancement of DFFD. Specifically, we devise a fine-grained language encoder (FLE) that extracts fine global language features from hierarchical text prompts. We design a multi-modal vision encoder (MVE) to capture global image forgery embeddings as well as fine-grained noise forgery patterns extracted from the richest patch, and integrate them to mine general visual forgery traces. Moreover, we build an innovative plug-and-play sample pair attention (SPA) method to emphasize relevant negative pairs and suppress irrelevant ones, allowing cross-modality sample pairs to conduct more flexible alignment. Extensive experiments and visualizations show that our model outperforms the state of the arts on different settings like cross-generator, cross-forgery, and cross-dataset evaluations.

随着逼真人脸生成技术的快速发展，人脸伪造检测技术在社会和学术界引起了广泛关注，并凸显了对稳健和通用化人脸伪造检测技术的迫切需求。虽然现有的方法主要使用图像模式捕捉人脸伪造模式，但尚未充分探索其他模式，如精细噪声和文本，这限制了模型的泛化能力。此外，大多数FFD方法倾向于识别由GAN生成的人脸图像，但难以检测未见过的扩散合成图像。为了克服这些局限性，我们旨在利用最前沿的基础模型——对比语言图像预训练（CLIP）技术，实现通用化的扩散人脸伪造检测（DFFD）。在本文中，我们提出了一种新颖的多模态精细CLIP（MFCLIP）模型，它通过语言引导的人脸伪造表示学习，挖掘图像噪声模态之间的全面和精细的伪造痕迹，以促进DFFD的发展。具体来说，我们设计了一种精细语言编码器（FLE），用于从分层文本提示中提取精细的全局语言特征。我们设计了多模态视觉编码器（MVE），以捕获全局图像伪造嵌入以及从丰富补丁中提取的精细噪声伪造模式，并将它们集成以挖掘通用的视觉伪造痕迹。此外，我们构建了一种创新的即插即用样本配对注意力（SPA）方法，以强调相关的负样本对并抑制不相关的样本对，使跨模态样本对能够进行更灵活的对齐。广泛的实验和可视化显示，我们的模型在不同的设置（如跨生成器、跨伪造和跨数据集评估）上均优于现有技术。

论文及项目相关链接

PDF Accepted by IEEE Transactions on Information Forensics and Security 2025

Summary
人脸识别伪造检测技术受到社会及学术界关注。现有方法主要通过图像模式捕捉伪造痕迹，但忽略了其他模式如精细噪声和文本等，限制了模型的泛化能力。针对此问题，本文提出利用对比语言图像预训练模型，实现通用的扩散人脸识别伪造检测（DFFD）。创新性地提出多模态精细CLIP（MFCLIP）模型，通过语言引导的人脸伪造表示学习，挖掘图像噪声模态下的全面精细伪造痕迹。实验证明，该模型在不同设置下均超越现有技术。

Key Takeaways