⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-11 更新
Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Authors:Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the “mode-covering” nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the “mode-seeking” reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.
这篇工作是首次尝试将连续时间一致性蒸馏扩展到通用应用级别的图像和视频扩散模型。尽管连续时间一致性模型(sCM)在理论上具有原则性,并且在加速学术规模的扩散方面经验强大,但由于雅可比向量积(JVP)计算的基础设施挑战和标准评估基准的限制,其在大规模文本到图像和视频任务上的适用性仍不明确。我们首先开发了一个与并行性兼容的FlashAttention-2 JVP内核,能够在具有超过10亿参数和高维视频任务的模型上进行sCM训练。我们的调查揭示了sCM在细节生成方面的根本质量限制,我们将其归因于误差累积和其前向发散目标的“模式覆盖”性质。为了解决这个问题,我们提出了评分正则化连续时间一致性模型(rCM),它将评分蒸馏作为长跳正则化器。这种集成与sCM相结合,通过“模式寻求”反向发散,在保持高生成多样性的同时,有效地提高了视觉质量。在大型模型(Cosmos-Predict2、Wan2.1)上验证,参数高达14B,视频时长为5秒,rCM在质量指标上与最先进的蒸馏方法DMD2相匹配或超越,同时在多样性方面有明显的优势,而且无需对GAN进行调整或进行广泛的超参数搜索。蒸馏后的模型在仅1至4步内生成高保真样本,将扩散采样加速15至50倍。这些结果将rCM定位为推动大规模扩散蒸馏实践和理论发展的实用框架。
论文及项目相关链接
Summary
该文是关于连续时间一致性蒸馏在大规模文本驱动图像和视频扩散模型中的首次应用。该研究解决了在通用级别的扩散模型中的基础设施挑战,并开发了并行兼容的FlashAttention-2 JVP内核,实现了超过十亿参数的模型和高维视频任务的sCM训练。研究发现sCM在细节生成上存在质量限制,并提出了结合分数蒸馏的长跳正则化的连续时间一致性模型(rCM)来改进视觉质量并保持高生成多样性。rCM在大型模型上验证了匹配或超越了当前最佳蒸馏方法DMD2的质量指标,同时也在多样性方面表现出优势,且无需GAN调优或广泛的超参数搜索。蒸馏模型在1~4步内生成高保真样本,将扩散采样的速度提高了15~50倍。这些结果使rCM成为推动大规模扩散蒸馏的实际且理论扎实框架。
Key Takeaways
- 该研究首次尝试将连续时间一致性蒸馏扩展到通用级别的图像和视频扩散模型。
- 开发了并行兼容的FlashAttention-2 JVP内核,使得大型参数模型和高维视频任务的sCM训练成为可能。
- 发现sCM在细节生成方面存在质量限制,归因于误差累积和其前向发散目标的“模式覆盖”特性。
- 提出rCM,通过结合分数蒸馏和反向发散,有效改进视觉质量并保持高生成多样性。
- rCM在大型模型上表现优异,匹配或超越了现有最佳蒸馏方法DMD2的质量指标,并在多样性上展现出优势。
- rCM无需复杂的GAN调优或广泛的超参数搜索,简化了流程。
点此查看论文截图



