GAN

发布日期: 2025-10-03

更新日期: 2025-11-27

文章字数: 1.1k

阅读时长: 4 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-03 更新

H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

Authors:Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, Sergey Tulyakov

Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time even on mobile devices. We also propose an omni-training objective to unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single VAE network but with enhanced quality. In addition, we propose a novel latent consistency loss that provides stable improvements in reconstruction quality. Latent consistency loss outperforms prior auxiliary losses including LPIPS, GAN and DWT in terms of both quality improvements and simplicity. H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile, and outperforms prior arts in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.

自编码器（AE）是潜在扩散模型在图像和视频生成中取得成功的关键，它能够降低去噪分辨率并提高效率。然而，关于自编码器的网络设计、压缩比和训练策略方面的潜力长期以来一直被忽视。在这项工作中，我们系统地研究了架构设计选择，优化了计算分布，获得了一系列高效的高压缩视频自编码器，即使在移动设备上也能进行实时解码。我们还提出了一个全面的训练目标，将普通自编码器和图像条件I2V VAE的设计统一起来，在一个单一的VAE网络中实现多功能性，同时提高了质量。此外，我们提出了一种新型潜在一致性损失，在重建质量方面提供了稳定的改进。潜在一致性损失在质量和简洁性方面都优于之前的辅助损失，包括LPIPS、GAN和DWT。H3AE实现了超高的压缩比和实时的GPU和手机解码速度，并且在重建指标上大幅度超越了之前的技术。最后，我们通过在其潜在空间上训练一个DiT来验证我们的自编码器，并展示了快速、高质量的文字到视频生成能力。

论文及项目相关链接

PDF 17 pages, 6 figures, 9 tables

Summary

本工作对自动编码器（AE）的架构设计进行了系统的探索和优化，提高了计算分布的效率和压缩比例，使其能实时解码甚至在手机设备上。此外，提出了一种多功能训练目标，融合了普通自动编码器和图像条件I2V VAE的设计，提升了网络的多功能性与质量。同时，引入了一种新型的潜在一致性损失，提高了重建质量。该工作在GPU和移动设备上实现了超高的压缩率和实时解码速度，并在重建指标上大幅超越了现有技术。最后，通过在它的潜在空间上训练DiT验证了其有效性，并展示了快速、高质量的文字到视频生成能力。

Key Takeaways