发布日期: 2025-06-04

更新日期: 2025-07-06

文章字数: 995

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-06-04 更新

Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

Authors:Jiaxin Liu, Jia Wang, Saihui Hou, Min Ren, Huijia Wu, Zhaofeng He

In recent years, the explosive advancement of deepfake technology has posed a critical and escalating threat to public security: diffusion-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency via multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the new large-scale multimodal digital human forgery dataset based on diffusion models. Leveraging five of the latest digital human generation methods and a voice cloning method, we systematically construct a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies demonstrate that the misrecognition rate by participants for DigiFakeAV reaches as high as 68%. Moreover, the substantial performance degradation of existing detection models on our dataset further highlights its challenges. To address this problem, we propose DigiShield, an effective detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves state-of-the-art (SOTA) performance on the DigiFakeAV and shows strong generalization on other datasets.

近年来，深度伪造技术的飞速发展对公共安全造成了重大且日益增长的威胁，即基于扩散的数字人类生成技术。与传统的面部操纵方法不同，这些模型可以通过多模态控制信号生成高度逼真的视频，并保持一致性。它们的灵活性和隐蔽性对现有检测策略构成了严重挑战。为了弥补这一差距，我们推出了DigiFakeAV，这是基于扩散模型的新的大规模多模态数字人类伪造数据集。我们利用五种最新的数字人类生成方法和一种声音克隆方法，系统地构建了一个包含60000个视频（840万个框架）的数据集，涵盖了多个民族、肤色、性别和真实场景，显著提高了数据的多样性和逼真性。用户研究表明，参与者对DigiFakeAV的误识别率高达68%。此外，现有检测模型在我们数据集上的性能大幅下降，进一步突出了其挑战性。为了解决这一问题，我们提出了DigiShield，这是一种基于时空和跨模态融合的有效检测基线。通过联合建模视频的3D时空特征和音频的语义声学特征，DigiShield在DigiFakeAV上实现了最先进的性能，并在其他数据集上表现出强大的泛化能力。

论文及项目相关链接

PDF

Summary

本文介绍了深度伪造技术的快速发展对公共安全构成的新威胁——基于扩散的数字人类生成。为应对这一挑战，研究者推出了DigiFakeAV数据集，包含6万段视频、覆盖多种人种、肤色、性别和真实场景，显著提高了数据多样性和真实性。现有检测模型在该数据集上的性能显著下降，因此研究者提出了基于时空和跨模态融合的DigiShield检测基线，取得了最新效果。

Key Takeaways