发布日期: 2025-05-31

更新日期: 2025-06-24

文章字数: 1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-05-31 更新

Beyond Face Swapping: A Diffusion-Based Digital Human Benchmark for Multimodal Deepfake Detection

Authors:Jiaxin Liu, Jia Wang, Saihui Hou, Min Ren, Huijia Wu, Zhaofeng He

In recent years, the rapid development of deepfake technology has given rise to an emerging and serious threat to public security: diffusion-based digital human generation. Unlike traditional face manipulation methods, such models can generate highly realistic videos with consistency through multimodal control signals. Their flexibility and covertness pose severe challenges to existing detection strategies. To bridge this gap, we introduce DigiFakeAV, the new large-scale multimodal digital human forgery dataset based on diffusion models. Employing five of the latest digital human generation methods and the voice cloning methods, we systematically produce a dataset comprising 60,000 videos (8.4 million frames), covering multiple nationalities, skin tones, genders, and real-world scenarios, significantly enhancing data diversity and realism. User studies demonstrate that participants misclassify forged videos as real in 68% of tests, and existing detection models exhibit a large drop in performance on DigiFakeAV, highlighting the challenge of the dataset. To address this problem, we propose DigiShield, an effective detection baseline based on spatiotemporal and cross-modal fusion. By jointly modeling the 3D spatiotemporal features of videos and the semantic-acoustic features of audio, DigiShield achieves state-of-the-art (SOTA) performance on the DigiFakeAV and shows strong generalization on other datasets.

近年来，深度伪造技术的快速发展对公共安全产生了新的严重威胁：基于扩散的数字人类生成。与传统的人脸操控方法不同，这些模型可以通过多模态控制信号生成高度逼真的视频，并保持一致性。它们的灵活性和隐蔽性对现有检测策略构成了严峻挑战。为了填补这一空白，我们引入了DigiFakeAV，这是基于扩散模型的新大型多模态数字人类伪造数据集。我们采用最新的五种数字人类生成方法和语音克隆方法，系统地生成了一个包含60000个视频（840万个帧）的数据集，涵盖了多种民族、肤色、性别和真实场景，显著提高了数据的多样性和逼真性。用户研究表明，参与者在68%的测试中将伪造的视频错误地归类为真实视频，现有检测模型在DigiFakeAV上的性能大幅下降，这凸显了该数据集面临的挑战。为了解决这个问题，我们提出了DigiShield，这是一个基于时空和跨模态融合的有效的检测基线。通过联合建模视频的3D时空特征和音频的语义声学特征，DigiShield在DigiFakeAV上实现了最先进的性能，并在其他数据集上表现出强大的泛化能力。

论文及项目相关链接

PDF

摘要
高拟真度数字人生成技术的崛起给公共安全带来严重威胁。新型大规规模多媒体数字人伪造数据集DigiFakeAV基于此而生。包含五万视频片段，涵盖多种人种、肤色、性别和真实场景，显著提升了数据多样性和逼真度。用户研究表明，现有检测模型在DigiFakeAV上的性能大幅下降，因此推出基于时空和跨模态融合的DigiShield检测基线，达到当前最佳性能。

要点掌握