发布日期: 2025-09-07

更新日期: 2025-10-08

文章字数: 2.6k

阅读时长: 10 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-07 更新

Uncertainty-Guided Face Matting for Occlusion-Aware Face Transformation

Authors:Hyebin Cho, Jaehyup Lee

Face filters have become a key element of short-form video content, enabling a wide array of visual effects such as stylization and face swapping. However, their performance often degrades in the presence of occlusions, where objects like hands, hair, or accessories obscure the face. To address this limitation, we introduce the novel task of face matting, which estimates fine-grained alpha mattes to separate occluding elements from facial regions. We further present FaceMat, a trimap-free, uncertainty-aware framework that predicts high-quality alpha mattes under complex occlusions. Our approach leverages a two-stage training pipeline: a teacher model is trained to jointly estimate alpha mattes and per-pixel uncertainty using a negative log-likelihood (NLL) loss, and this uncertainty is then used to guide the student model through spatially adaptive knowledge distillation. This formulation enables the student to focus on ambiguous or occluded regions, improving generalization and preserving semantic consistency. Unlike previous approaches that rely on trimaps or segmentation masks, our framework requires no auxiliary inputs making it well-suited for real-time applications. In addition, we reformulate the matting objective by explicitly treating skin as foreground and occlusions as background, enabling clearer compositing strategies. To support this task, we newly constructed CelebAMat, a large-scale synthetic dataset specifically designed for occlusion-aware face matting. Extensive experiments show that FaceMat outperforms state-of-the-art methods across multiple benchmarks, enhancing the visual quality and robustness of face filters in real-world, unconstrained video scenarios. The source code and CelebAMat dataset are available at https://github.com/hyebin-c/FaceMat.git

面部滤镜已成为短视频内容的关键元素，能够实现风格化和换脸等多种视觉效果。然而，在遮挡物（如手、头发或配饰）存在的情况下，它们的性能往往会下降。为了解决这一局限性，我们引入了面部抠图这一新任务，该任务估计精细的alpha抠图，以将遮挡物与面部区域分离。我们还提出了无需修剪图、能感知不确定性的FaceMat框架，该框架可在复杂遮挡物下预测高质量的alpha抠图。我们的方法采用两阶段训练流程：教师模型使用负对数似然（NLL）损失联合估计alpha抠图和像素级不确定性，这种不确定性然后用于通过空间自适应知识蒸馏引导学生模型。这种表述使学生模型能够专注于模糊或遮挡区域，提高通用性并保持语义一致性。与以往依赖于修剪图或分割蒙版的方法不同，我们的框架无需辅助输入，非常适合实时应用。此外，我们通过明确地将皮肤视为前景、遮挡物视为背景，重新制定了抠图目标，从而实现更清晰的合成策略。为了支持这项任务，我们新构建了CelebAMat数据集，这是一个专门为感知遮挡的面部抠图设计的大规模合成数据集。大量实验表明，FaceMat在多个基准测试上的表现优于最先进的方法，提高了现实世界、无约束视频场景中面部滤镜的视觉质量和稳健性。源代码和CelebAMat数据集可在https://github.com/hyebin-c/FaceMat.git上获得。

论文及项目相关链接

PDF Accepted to ACM MM 2025. 9 pages, 8 figures, 6 tables

Summary

该文介绍了面部滤镜在短视频内容中的关键作用，以及面临遮挡物时性能下降的问题。为此，文章提出了面部抠图的新任务，旨在通过精细的alpha通道抠图来分离遮挡物和面部区域。文章介绍了一个无需辅助输入的trimap-free框架FaceMat，它通过两阶段训练流程预测高质量alpha通道抠图，并考虑了像素级的不确定性。FaceMat利用教师模型预测alpha通道抠图和像素不确定性，并使用负对数似然损失进行训练。这种不确定性被用来指导学生模型进行空间自适应知识蒸馏，使其能够关注模糊或遮挡区域，提高泛化能力和保持语义一致性。此外，文章重新定义了matting目标，明确将皮肤视为前景，遮挡物视为背景，并为此任务构建了大型合成数据集CelebAMat。实验表明，FaceMat在多个基准测试中表现优于最新方法，提高了现实世界无约束视频场景中面部滤镜的视觉质量和鲁棒性。

Key Takeaways

面部滤镜在短视频内容中普及，但面临遮挡物时的性能下降问题。
引入面部抠图新任务，旨在通过精细的alpha通道抠图分离遮挡物和面部区域。
提出无需辅助输入的trimap-free框架FaceMat，通过两阶段训练流程预测高质量alpha通道抠图。
教师模型利用负对数似然损失预测alpha通道抠图和像素不确定性。
不确定性用于指导学生模型进行空间自适应知识蒸馏，提高泛化能力和保持语义一致性。
重新定义了matting目标，将皮肤视为前景，遮挡物视为背景。

Cool Papers

点此查看论文截图

Towards a Universal Synthetic Video Detector: From Face or Background Manipulations to Fully AI-Generated Content

Authors:Rohit Kundu, Hao Xiong, Vishal Mohanty, Athula Balachandran, Amit K. Roy-Chowdhury

Existing DeepFake detection techniques primarily focus on facial manipulations, such as face-swapping or lip-syncing. However, advancements in text-to-video (T2V) and image-to-video (I2V) generative models now allow fully AI-generated synthetic content and seamless background alterations, challenging face-centric detection methods and demanding more versatile approaches. To address this, we introduce the \underline{U}niversal \underline{N}etwork for \underline{I}dentifying \underline{T}ampered and synth\underline{E}tic videos (\texttt{UNITE}) model, which, unlike traditional detectors, captures full-frame manipulations. \texttt{UNITE} extends detection capabilities to scenarios without faces, non-human subjects, and complex background modifications. It leverages a transformer-based architecture that processes domain-agnostic features extracted from videos via the SigLIP-So400M foundation model. Given limited datasets encompassing both facial/background alterations and T2V/I2V content, we integrate task-irrelevant data alongside standard DeepFake datasets in training. We further mitigate the model’s tendency to over-focus on faces by incorporating an attention-diversity (AD) loss, which promotes diverse spatial attention across video frames. Combining AD loss with cross-entropy improves detection performance across varied contexts. Comparative evaluations demonstrate that \texttt{UNITE} outperforms state-of-the-art detectors on datasets (in cross-data settings) featuring face/background manipulations and fully synthetic T2V/I2V videos, showcasing its adaptability and generalizable detection capabilities.

现有的深度伪造检测技术主要侧重于面部操作，例如面部交换或唇同步。然而，文本到视频（T2V）和图像到视频（I2V）生成模型的进步现在允许完全由人工智能生成的合成内容和无缝背景更改，这对面部中心的检测方法提出了挑战，并需要更通用的方法。针对这一问题，我们引入了通用网络用于识别篡改和合成视频的模型（UNITE），与传统的检测器不同，它可以捕获全帧操作。UNITE将检测能力扩展到没有面部、非人类主体和复杂背景修改的场景。它利用基于transformer的架构，通过SigLIP-So400M基础模型处理从视频中提取的领域无关特征。考虑到包含面部/背景更改以及T2V/I2V内容的有限数据集，我们在训练中整合了任务不相关的数据以及标准深度伪造数据集。为了进一步减轻模型过于关注面部的倾向，我们引入了注意力多样性（AD）损失，该损失促进视频帧中的空间注意力多样化。将AD损失与交叉熵结合使用，可以在不同的上下文中提高检测性能。比较评估表明，在包含面部/背景操作和完全合成的T2V/I2V视频的数据集上（在跨数据设置中），UNITE的表现优于最新的检测器，展示了其适应性和可推广的检测能力。

论文及项目相关链接

PDF

摘要

随着文本转视频（T2V）和图像转视频（I2V）生成模型的进步，完全AI生成的合成内容和无缝背景更改对现有的以面部为中心的DeepFake检测技术提出了挑战。我们引入了通用网络用于识别篡改和合成视频（UNITE）模型，该模型能够捕捉全帧操作，扩展了检测能力，涵盖了无面部、非人类主体和复杂背景修改的场景。UNITE采用基于变压器的架构，处理通过SigLIP-So400M基础模型从视频中提取的域无关特征。在包含面部/背景更改以及T2V/I2V内容的有限数据集上训练时，我们将任务不相关数据与标准DeepFake数据集相结合。通过引入注意力多样性（AD）损失，缓解了模型过于关注面部的倾向，促进了视频帧中空间注意力的多样性。将AD损失与交叉熵相结合，提高了不同背景下的检测性能。对比评估表明，UNITE在包含面部/背景操作和完全合成的T2V/I2V视频的数据集上优于现有最先进的检测器，展示了其适应性和通用检测能力。

关键见解