GAN

发布日期: 2025-09-17

更新日期: 2025-10-07

文章字数: 4.7k

阅读时长: 19 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-17 更新

AvatarSync: Rethinking Talking-Head Animation through Autoregressive Perspective

Authors:Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Suiyang Zhang, Yi He, Yuxing Han

Existing talking-head animation approaches based on Generative Adversarial Networks (GANs) or diffusion models often suffer from inter-frame flicker, identity drift, and slow inference. These limitations inherent to their video generation pipelines restrict their suitability for applications. To address this, we introduce AvatarSync, an autoregressive framework on phoneme representations that generates realistic and controllable talking-head animations from a single reference image, driven directly text or audio input. In addition, AvatarSync adopts a two-stage generation strategy, decoupling semantic modeling from visual dynamics, which is a deliberate “Divide and Conquer” design. The first stage, Facial Keyframe Generation (FKG), focuses on phoneme-level semantic representation by leveraging the many-to-one mapping from text or audio to phonemes. A Phoneme-to-Visual Mapping is constructed to anchor abstract phonemes to character-level units. Combined with a customized Text-Frame Causal Attention Mask, the keyframes are generated. The second stage, inter-frame interpolation, emphasizes temporal coherence and visual smoothness. We introduce a timestamp-aware adaptive strategy based on a selective state space model, enabling efficient bidirectional context reasoning. To support deployment, we optimize the inference pipeline to reduce latency without compromising visual fidelity. Extensive experiments show that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency, providing a scalable and controllable solution.

基于生成对抗网络（GANs）或扩散模型的现有说话人动画方法常常面临帧间闪烁、身份漂移和推理速度慢等问题。这些固有的视频生成管道限制其在应用中的适用性。为解决此问题，我们引入了AvatarSync，这是一个基于音素表示的自回归框架，能够从单个参考图像生成真实可控的说话人动画，直接由文本或音频输入驱动。此外，AvatarSync采用两阶段生成策略，将语义建模与视觉动力学解耦，这是一种有意的“分而治之”设计。第一阶段，面部关键帧生成（FKG），专注于音素级语义表示，通过文本或音频到音素的多对一映射。构建了音素到视觉映射，将抽象音素锚定到字符级单元。结合定制的文本帧因果注意力掩码，生成关键帧。第二阶段，帧间插值，强调时间连贯性和视觉平滑性。我们引入了一种基于选择状态空间模型的、时间戳感知的自适应策略，实现高效的双向上下文推理。为支持部署，我们优化了推理管道，减少延迟而不损害视觉保真度。大量实验表明，AvatarSync在视觉保真度、时间一致性和计算效率方面超越了现有的说话人动画方法，提供了一种可扩展且可控的解决方案。

论文及项目相关链接

PDF

Summary

基于生成对抗网络（GANs）或扩散模型的现有说话人动画方法常常存在帧间闪烁、身份漂移和推理速度慢等问题，这些固有的视频生成管道限制其在应用中的适用性。为解决这些问题，我们推出AvatarSync，一种基于音素表示的自回归框架，可从单张参考图像生成真实可控的说话人动画，直接由文本或音频输入驱动。AvatarSync采用两阶段生成策略，将语义建模与视觉动态解耦，这是有意采取的“分而治之”的设计。第一阶段专注于音素级别的语义表示，通过建立音素到视觉映射来锚定抽象音素到字符级单元。结合定制的文本帧因果注意力掩码，生成关键帧。第二阶段强调帧间的临时连贯性和视觉平滑度。我们引入基于选择性状态空间模型的带时间戳自适应策略，实现高效的双向上下文推理。为支持部署，我们优化了推理管道，降低延迟而不损害视觉保真度。实验表明，AvatarSync在视觉保真度、时间一致性和计算效率方面优于现有说话人动画方法，提供了一种可扩展和可控的解决方案。

Key Takeaways

AvatarSync是一个基于音素表示的自回归框架，可以从单张参考图像生成真实且可控的说话人动画。
该方法通过文本或音频输入直接驱动动画生成。
AvatarSync采用两阶段生成策略，分别处理语义建模和视觉动态。
第一阶段专注于音素级别的语义表示，建立音素到视觉映射。
第二阶段强调帧间的临时连贯性和视觉平滑度，引入带时间戳自适应策略实现高效推理。
优化了推理管道以降低延迟并维持视觉保真度。
实验表明，AvatarSync在视觉质量、时间一致性和计算效率方面表现优于现有方法。

Cool Papers

点此查看论文截图

Sphere-GAN: a GAN-based Approach for Saliency Estimation in 360° Videos

Authors:Mahmoud Z. A. Wahba, Sara Baldoni, Federica Battisti

The recent success of immersive applications is pushing the research community to define new approaches to process 360{\deg} images and videos and optimize their transmission. Among these, saliency estimation provides a powerful tool that can be used to identify visually relevant areas and, consequently, adapt processing algorithms. Although saliency estimation has been widely investigated for 2D content, very few algorithms have been proposed for 360{\deg} saliency estimation. Towards this goal, we introduce Sphere-GAN, a saliency detection model for 360{\deg} videos that leverages a Generative Adversarial Network with spherical convolutions. Extensive experiments were conducted using a public 360{\deg} video saliency dataset, and the results demonstrate that Sphere-GAN outperforms state-of-the-art models in accurately predicting saliency maps.

最近沉浸式应用的成功推动了研究界定义新的方法来处理360°图像和视频，并优化其传输。其中，显著性估计提供了一种强大的工具，可以用来识别视觉相关区域，并因此适应处理算法。尽管显著性估计已经为二维内容进行了广泛的研究，但针对360°显著性估计的算法几乎很少被提出。为此，我们引入了Sphere-GAN，这是一个用于360°视频的显著性检测模型，它利用带有球面卷积的生成对抗网络。在公共的360°视频显著性数据集上进行了大量实验，结果表明Sphere-GAN在准确预测显著性地图方面优于最先进模型。

论文及项目相关链接

PDF

Summary

本文介绍了Sphere-GAN模型，该模型利用生成对抗网络（GAN）和球面卷积技术，实现了对360度视频显著性检测。通过大量实验验证，Sphere-GAN在预测显著性地图方面的性能优于现有模型。

Key Takeaways

Sphere-GAN是一个针对360度视频设计的显著性检测模型。
模型使用生成对抗网络（GAN）技术。
模型采用球面卷积以处理360度视频的特殊性质。
实验证明Sphere-GAN在预测显著性地图方面性能优越。
该模型可应用于优化360度视频的传输和处理。
此技术有助于识别视觉上的重要区域，进而改进处理算法。

Cool Papers

点此查看论文截图

StegOT: Trade-offs in Steganography via Optimal Transport

Authors:Chengde Lin, Xuezhu Gong, Shuxue Ding, Mingzhe Yang, Xijun Lu, Chengjun Mo

Image hiding is often referred to as steganography, which aims to hide a secret image in a cover image of the same resolution. Many steganography models are based on genera-tive adversarial networks (GANs) and variational autoencoders (VAEs). However, most existing models suffer from mode collapse. Mode collapse will lead to an information imbalance between the cover and secret images in the stego image and further affect the subsequent extraction. To address these challenges, this paper proposes StegOT, an autoencoder-based steganography model incorporating optimal transport theory. We designed the multiple channel optimal transport (MCOT) module to transform the feature distribution, which exhibits multiple peaks, into a single peak to achieve the trade-off of information. Experiments demonstrate that we not only achieve a trade-off between the cover and secret images but also enhance the quality of both the stego and recovery images. The source code will be released on https://github.com/Rss1124/StegOT.

图像隐藏通常被称为隐写术，旨在将一个秘密图像隐藏在相同分辨率的载体图像中。许多隐写术模型基于生成对抗网络（GANs）和变分自编码器（VAEs）。然而，大多数现有模型存在模式崩溃的问题。模式崩溃会导致隐写图像中的载体图像和秘密图像之间的信息不平衡，并进一步影响后续的提取。为了解决这些挑战，本文提出了StegOT，这是一个结合最优传输理论的基于自编码器的隐写术模型。我们设计了多通道最优传输（MCOT）模块，将具有多个峰值的特征分布转换为单峰，以实现信息的权衡。实验表明，我们不仅在载体图像和秘密图像之间实现了权衡，还提高了隐写图像和恢复图像的质量。源代码将在https://github.com/Rss1124/StegOT上发布。

论文及项目相关链接

PDF

Summary

本文介绍了一种基于自编码器的隐写术模型StegOT，该模型结合了最优传输理论。设计了多通道最优传输（MCOT）模块，用于将具有多个峰值的特征分布转换为单峰分布以实现信息权衡。实验表明，该方法不仅实现了覆盖图像与秘密图像之间的权衡，还提高了隐写和恢复图像的质量。

Key Takeaways

图像隐藏常被称为隐写术，旨在将秘密图像隐藏在相同分辨率的载体图像中。
许多隐写术模型基于生成对抗网络（GANs）和变分自编码器（VAEs）。
现有模型普遍面临模式崩溃问题，导致载体图像和秘密图像之间的信息失衡。
StegOT模型是一个基于自编码器的隐写术模型，结合了最优传输理论来解决模式崩溃问题。
MCOT模块用于转换具有多个峰值的特征分布至单峰分布以实现信息权衡。
实验结果表明StegOT不仅实现了覆盖图像与秘密图像的权衡，而且提高了隐写和恢复图像的质量。

Cool Papers

点此查看论文截图

An End-to-End Depth-Based Pipeline for Selfie Image Rectification

Authors:Ahmed Alhawwary, Janne Mustaniemi, Phong Nguyen-Ha, Janne Heikkilä

Portraits or selfie images taken from a close distance typically suffer from perspective distortion. In this paper, we propose an end-to-end deep learning-based rectification pipeline to mitigate the effects of perspective distortion. We learn to predict the facial depth by training a deep CNN. The estimated depth is utilized to adjust the camera-to-subject distance by moving the camera farther, increasing the camera focal length, and reprojecting the 3D image features to the new perspective. The reprojected features are then fed to an inpainting module to fill in the missing pixels. We leverage a differentiable renderer to enable end-to-end training of our depth estimation and feature extraction nets to improve the rectified outputs. To boost the results of the inpainting module, we incorporate an auxiliary module to predict the horizontal movement of the camera which decreases the area that requires hallucination of challenging face parts such as ears. Unlike previous works, we process the full-frame input image at once without cropping the subject’s face and processing it separately from the rest of the body, eliminating the need for complex post-processing steps to attach the face back to the subject’s body. To train our network, we utilize the popular game engine Unreal Engine to generate a large synthetic face dataset containing various subjects, head poses, expressions, eyewear, clothes, and lighting. Quantitative and qualitative results show that our rectification pipeline outperforms previous methods, and produces comparable results with a time-consuming 3D GAN-based method while being more than 260 times faster.

从近距离拍摄的肖像或自拍图像通常会受到透视失真的影响。在本文中，我们提出了一种端到端的基于深度学习的校正管道，以减轻透视失真的影响。我们通过学习预测面部深度来训练深度卷积神经网络。估计的深度用于调整相机到主体的距离，通过将相机移得更远、增加相机焦距并将3D图像特征重新投影到新的透视来做到这一点。然后，将重新投影的特征输入到填充模块中以填充缺失的像素。我们利用可微渲染器，使深度估计和特征提取网络的端到端训练，以提高校正输出。为了提升填充模块的结果，我们引入了辅助模块来预测相机的水平移动，这减少了需要虚构具有挑战性的面部部位（如耳朵）的区域。与以前的工作不同，我们一次性处理全帧输入图像，而不裁剪主体的脸部并与其余部分身体分开处理，从而无需复杂的后期处理步骤即可将脸部附着到主体的身体上。为了训练我们的网络，我们利用流行的游戏引擎Unreal Engine生成了一个包含各种主体、头部姿势、表情、眼镜、衣物和光照的大型合成面部数据集。定量和定性结果表明，我们的校正管道优于以前的方法，并且与耗时的基于3D GAN的方法产生可比的结果，同时速度是其260倍以上。

论文及项目相关链接

PDF Accepted at IEEE TPAMI

Summary

本文提出了一种基于深度学习的端到端纠正管道，用于减轻近距离拍摄肖像或自拍图像时产生的透视失真问题。通过训练深度CNN预测面部深度，并利用估计的深度调整相机与拍摄对象之间的距离，再通过差分渲染器实现端到端的深度估计和特征提取网络的训练，以提高校正输出质量。为提高修复模块的效果，融入了一个辅助模块来预测相机的水平移动，减少了需要虚构的具有挑战性的面部区域，如耳朵。与以前的方法不同，我们一次性处理全帧输入图像，无需将拍摄对象的面部与其余部分分开处理，从而省去了复杂的后期处理步骤。实验证明，我们的纠正管道在定量和定性结果上都优于以前的方法，并且与耗时较长的3D GAN方法产生相当的结果，但速度是其260倍以上。

Key Takeaways

近距离拍摄的肖像或自拍图像常常存在透视失真问题。
提出了一种基于深度学习的端到端纠正管道来解决透视失真问题。
通过训练深度CNN预测面部深度，并用于调整相机与拍摄对象之间的距离。
利用差分渲染器实现端到端的网络训练，提高校正输出质量。
辅助模块预测相机的水平移动，减少需要虚构的面部区域。
与之前的方法不同，一次性处理全帧输入图像，省去了复杂的后期处理步骤。

Cool Papers

点此查看论文截图

SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis

Authors:Huan-ang Gao, Mingju Gao, Jiaju Li, Wenyi Li, Rong Zhi, Hao Tang, Hao Zhao

Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has set new state-of-the-art results in SIS on Cityscapes, ADE20K and COCO-Stuff, yielding a FID as low as 10.53 on Cityscapes. The code and models can be accessed via the project page.

语义图像合成（SIS）在传感器仿真方面显示出良好的前景。然而，当前基于生成对抗网络（GAN）的最佳实践尚未达到理想的质量水平。随着潜在扩散模型在图像生成方面取得显著进展，我们受到启发，对ControlNet进行了评估，该方法因其稠密控制功能而备受关注。我们的调查发现了其结果的两个主要问题：大型语义区域内存在奇怪的子结构以及内容与语义掩码的不对齐。通过实证研究，我们将这些问题的原因归结为噪声训练数据分布与推理阶段应用的标准正态先验之间的不匹配。为了应对这一挑战，我们为SIS开发了特定的噪声先验，包括空间先验、分类先验，以及用于推理的新型空间分类联合先验。我们称这种方法为SCP-Diff，它在Cityscapes、ADE20K和COCO-Stuff的SIS任务上取得了最新结果，在Cityscapes上的FID低至10.53。代码和模型可通过项目页面访问。

Summary

基于生成对抗网络（GANs）的语义图像合成（SIS）在传感器仿真方面展现出良好前景，但现有最佳实践尚未达到理想的质量水平。研究ControlNet方法时发现了两个问题：大语义区域内存在奇怪的子结构和内容与语义掩膜的对齐问题。通过实证研究，确定了这些问题源于训练数据分布中的噪声与推理阶段应用的标准正态先验之间的不匹配。为应对这一挑战，研究团队为SIS开发特定的噪声先验，包括空间、类别和新型的空间类别联合先验，该方法在Cityscapes、ADE20K和COCO-Stuff的SIS任务上达到最新结果，其中Cityscapes上的FID低至10.53。相关代码和模型可通过项目页面访问。

Key Takeaways