发布日期: 2025-10-03

更新日期: 2025-11-27

文章字数: 8.3k

阅读时长: 34 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-03 更新

DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space

Authors:Wenkun He, Yuchao Gu, Junyu Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Yu, Junsong Chen, Enze Xie, Song Han, Han Cai

Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions, like 4K image generation. While previous research accelerates diffusion models in various aspects, it seldom handles the inherent redundancy within the latent space. To bridge this gap, this paper introduces DC-Gen, a general framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space. Rather than a costly training-from-scratch approach, DC-Gen uses an efficient post-training pipeline to preserve the quality of the base model. A key challenge in this paradigm is the representation gap between the base model’s latent space and a deeply compressed latent space, which can lead to instability during direct fine-tuning. To overcome this, DC-Gen first bridges the representation gap with a lightweight embedding alignment training. Once the latent embeddings are aligned, only a small amount of LoRA fine-tuning is needed to unlock the base model’s inherent generation quality. We verify DC-Gen’s effectiveness on SANA and FLUX.1-Krea. The resulting DC-Gen-SANA and DC-Gen-FLUX models achieve quality comparable to their base models but with a significant speedup. Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU. When combined with NVFP4 SVDQuant, DC-Gen-FLUX generates a 4K image in just 3.5 seconds on a single NVIDIA 5090 GPU, achieving a total latency reduction of 138x compared to the base FLUX.1-Krea model. Code: https://github.com/dc-ai-projects/DC-Gen.

现有的文本到图像扩散模型在生成高质量图像方面表现出色，但在扩展到高分辨率（如4K图像生成）时面临重大的效率挑战。尽管之前的研究从各个方面加速了扩散模型，但很少处理潜在空间中的固有冗余。为了填补这一空白，本文介绍了DC-Gen，这是一个通过利用深度压缩潜在空间来加速文本到图像扩散模型的通用框架。DC-Gen并非采用昂贵的从头开始训练的方法，而是采用高效的后训练管道来保留基础模型的质量。这一范式中的关键挑战是基础模型的潜在空间与深度压缩的潜在空间之间的表示差距，这可能导致直接微调时的不稳定性。为了克服这一点，DC-Gen首先通过轻量级的嵌入对齐训练来弥合表示差距。一旦潜在嵌入对齐，只需少量的LoRA微调即可释放基础模型的固有生成质量。我们在SANA和FLUX.1-Krea上验证了DC-Gen的有效性。DC-Gen-SANA和DC-Gen-FLUX模型在质量上与基础模型相当，但速度显著提高。具体来说，DC-Gen-FLUX在NVIDIA H100 GPU上将4K图像的生成延迟减少了53倍。当与NVFP4 SVDQuant结合时，DC-Gen-FLUX在单个NVIDIA 5090 GPU上仅用3.5秒即可生成4K图像，与基础FLUX.1-Krea模型相比，总延迟减少了138倍。代码：https://github.com/dc-ai-projects/DC-Gen。

论文及项目相关链接

PDF Tech Report. The first three authors contributed equally to this work

Summary

本文介绍了DC-Gen框架，它通过利用深度压缩的潜在空间来加速文本到图像的扩散模型。DC-Gen采用高效的训练后流程，而非从头开始训练的方法，保留了基础模型的质量。解决了基础模型的潜在空间与深度压缩的潜在空间之间的表示鸿沟问题，采用轻量级嵌入对齐训练进行弥补，然后通过少量的LoRA微调来解锁基础模型的生成质量。在SANA和FLUX.1-Krea上验证了DC-Gen的有效性，生成的模型在质量上可与基础模型相当，但速度更快。特别是DC-Gen-FLUX在NVIDIA H100 GPU上将4K图像生成的延迟减少了53倍。结合NVFP4 SVDQuant，DC-Gen-FLUX在单个NVIDIA 5090 GPU上生成4K图像只需3.5秒，与基础FLUX.1-Krea模型相比，总延迟降低了138倍。

Key Takeaways

DC-Gen是一个用于加速文本到图像扩散模型的框架，它通过深度压缩潜在空间来提高效率。
DC-Gen采用训练后流程，而不是从头开始训练，以保留基础模型的质量。
解决了基础模型与深度压缩潜在空间之间的表示鸿沟问题。
采用轻量级嵌入对齐训练来弥补表示鸿沟。
仅需少量LoRA微调即可解锁基础模型的生成质量。
DC-Gen在SANA和FLUX.1-Krea模型上的实施效果显著，生成图像质量相当，但速度更快。
DC-Gen与FLUX模型的结合在4K图像生成方面实现了显著的延迟降低，为实际应用带来了更高的效率。

Cool Papers

点此查看论文截图

Imagining Alternatives: Towards High-Resolution 3D Counterfactual Medical Image Generation via Language Guidance

Authors:Mohamed Mohamed, Brennan Nichyporuk, Douglas L. Arnold, Tal Arbel

Vision-language models have demonstrated impressive capabilities in generating 2D images under various conditions; however, the success of these models is largely enabled by extensive, readily available pretrained foundation models. Critically, comparable pretrained models do not exist for 3D, significantly limiting progress. As a result, the potential of vision-language models to produce high-resolution 3D counterfactual medical images conditioned solely on natural language remains unexplored. Addressing this gap would enable powerful clinical and research applications, such as personalized counterfactual explanations, simulation of disease progression, and enhanced medical training by visualizing hypothetical conditions in realistic detail. Our work takes a step toward this challenge by introducing a framework capable of generating high-resolution 3D counterfactual medical images of synthesized patients guided by free-form language prompts. We adapt state-of-the-art 3D diffusion models with enhancements from Simple Diffusion and incorporate augmented conditioning to improve text alignment and image quality. To our knowledge, this is the first demonstration of a language-guided native-3D diffusion model applied to neurological imaging, where faithful three-dimensional modeling is essential. On two neurological MRI datasets, our framework simulates varying counterfactual lesion loads in Multiple Sclerosis and cognitive states in Alzheimer’s disease, generating high-quality images while preserving subject fidelity. Our results lay the groundwork for prompt-driven disease progression analysis in 3D medical imaging. Project link - https://lesupermomo.github.io/imagining-alternatives/.

视觉语言模型在各种条件下生成2D图像的能力令人印象深刻。然而，这些模型的成功在很大程度上得益于广泛且易于获取的预训练基础模型。关键的是，对于3D领域，并没有与之相当的预训练模型，这严重限制了进展。因此，视觉语言模型在仅根据自然语言生成高分辨率的3D反事实医学影像方面的潜力尚未被探索。填补这一空白将能推动临床和研究应用的发展，如个性化的反事实解释、疾病进展模拟以及通过详细展示假设条件来增强医学培训。我们的工作朝着这一挑战迈出了一步，我们引入了一个框架，该框架能够通过自由形式的语言提示来生成由合成患者的高分辨率3D反事实医学影像。我们改进了最先进的3D扩散模型并融入了简单扩散，同时增加了附加条件来提高文本对齐和图像质量。据我们所知，这是首次将语言引导的原生3D扩散模型应用于神经成像，忠实的三维建模在这里至关重要。在两个神经MRI数据集上，我们的框架模拟了多发性硬化症的多种反事实病灶负荷以及阿尔茨海默病的认知状态，生成了高质量图像，同时保持了主体忠实度。我们的研究为3D医学影像中的提示驱动疾病进展分析奠定了基础。项目链接：https://lesupermomo.github.io/imagining-alternatives/。

论文及项目相关链接

PDF Accepted to the 2025 MICCAI ELAMI Workshop

Summary
文本主要介绍了在生成高质量三维图像方面的技术挑战与现状。目前大多数成功的视觉语言模型都依赖于已经存在的预训练模型作为基础，但这样的模型在三维图像生成领域仍属稀缺。本文介绍了一个能够基于自然语言提示生成高质量三维医疗图像的系统框架，这可以应用在医疗的临床和研究上，比如生成个性化反向解释、模拟疾病进展等。这为后续的进一步研究，如基于提示的疾病进展分析奠定了基础。该项目已在两个神经MRI数据集上进行了验证。

Key Takeaways

当前视觉语言模型在生成高质量三维图像方面面临挑战，原因在于缺乏相应的预训练模型基础。
本文介绍的系统框架首次展示了语言引导的原生三维扩散模型在神经成像中的应用。
该框架能够基于自然语言提示生成高质量的三维医疗图像，并应用于个性化反向解释、模拟疾病进展等医疗临床和研究领域。
该系统在两个神经MRI数据集上进行了验证，能够模拟多发性硬化症的病变负荷和阿尔茨海默病中的认知状态变化。

Cool Papers

点此查看论文截图

Text-to-CT Generation via 3D Latent Diffusion Model with Contrastive Vision-Language Pretraining

Authors:Daniele Molino, Camillo Maria Caruso, Filippo Ruffini, Paolo Soda, Valerio Guarrasi

Objective: While recent advances in text-conditioned generative models have enabled the synthesis of realistic medical images, progress has been largely confined to 2D modalities such as chest X-rays. Extending text-to-image generation to volumetric CT remains a significant challenge, due to its high dimensionality, anatomical complexity, and the absence of robust frameworks that align vision-language data in 3D medical imaging. Methods: We introduce a novel architecture for Text-to-CT generation that combines a latent diffusion model with a 3D contrastive vision-language pretraining scheme. Our approach leverages a dual-encoder CLIP-style model trained on paired CT volumes and radiology reports to establish a shared embedding space, which serves as the conditioning input for generation. CT volumes are compressed into a low-dimensional latent space via a pretrained volumetric VAE, enabling efficient 3D denoising diffusion without requiring external super-resolution stages. Results: We evaluate our method on the CT-RATE dataset and conduct a comprehensive assessment of image fidelity, clinical relevance, and semantic alignment. Our model achieves competitive performance across all tasks, significantly outperforming prior baselines for text-to-CT generation. Moreover, we demonstrate that CT scans synthesized by our framework can effectively augment real data, improving downstream diagnostic performance. Conclusion: Our results show that modality-specific vision-language alignment is a key component for high-quality 3D medical image generation. By integrating contrastive pretraining and volumetric diffusion, our method offers a scalable and controllable solution for synthesizing clinically meaningful CT volumes from text, paving the way for new applications in data augmentation, medical education, and automated clinical simulation. Code at https://github.com/cosbidev/Text2CT.

目标：尽管近期文本条件生成模型取得了进展，已经能够合成逼真的医学图像，但进展主要局限于如胸部X光等2D模式。将文本到图像的生成扩展到体积CT仍然是一个重大挑战，这主要是由于其高维度、解剖结构复杂，以及缺乏能够在3D医学成像中对视觉语言数据进行对齐的稳健框架。方法：我们引入了一种用于文本到CT生成的新型架构，它结合了潜在扩散模型与3D对比视觉语言预训练方案。我们的方法利用在配对CT体积和放射学报告上训练的双重编码器CLIP风格模型，建立共享嵌入空间，作为生成的条件输入。CT体积通过预训练的体积VAE压缩成低维潜在空间，实现高效的3D去噪扩散，而无需外部超分辨率阶段。结果：我们在CT-RATE数据集上评估了我们的方法，并全面评估了图像保真度、临床相关性和语义对齐。我们的模型在所有任务上均表现出竞争力，在文本到CT生成方面显著超越了先前的基线。此外，我们证明了我们框架合成的CT扫描可以有效地增强真实数据，提高下游诊断性能。结论：我们的结果表明，模态特定的视觉语言对齐是高质量3D医学图像生成的关键组成部分。通过结合对比预训练和体积扩散，我们的方法提供了一种可扩展和可控的解决方案，可以根据文本合成具有临床意义的CT体积，为数据增强、医学教育和自动化临床模拟等领域开辟了新应用途径。相关代码可在[https://github.com/cosbidev/Text2CT找到。]

论文及项目相关链接

PDF

Summary

本文介绍了一种将文本转化为CT图像的新方法，通过结合潜在扩散模型和三维对比视觉语言预训练方案，实现了文本到CT图像的高效生成。该方法使用双编码器CLIP风格的模型，在配对CT体积和放射学报告上训练，建立共享嵌入空间作为生成的条件输入。将CT体积压缩到低维潜在空间，通过预训练的体积VAE实现高效的三维去噪扩散。评估结果显示，该方法在图像保真度、临床相关性和语义对齐方面表现优异，显著优于先前的基线文本到CT生成模型。此外，通过合成CT扫描数据对真实数据进行增强，可提高下游诊断性能。

Key Takeaways

介绍了一种新的文本转CT图像生成方法，能够合成逼真的三维医学图像。
采用了潜在扩散模型和三维对比视觉语言预训练，提高了生成模型的效率和性能。
使用双编码器CLIP风格的模型，建立共享嵌入空间作为生成条件，增强了图像与文本的对应关系。
通过将CT体积压缩到低维潜在空间，实现了高效的三维去噪扩散，无需外部超分辨率阶段。
评估结果表明，该方法在图像保真度、临床相关性和语义对齐方面表现优异。
合成CT扫描数据可有效增强真实数据，提高下游诊断性能。
模态特定的视觉语言对齐是高质量三维医学图像生成的关键组件。

Cool Papers

点此查看论文截图

STORK: Faster Diffusion And Flow Matching Sampling By Resolving Both Stiffness And Structure-Dependence

Authors:Zheng Tan, Weizhen Wang, Andrea L. Bertozzi, Ernest K. Ryu

Diffusion models (DMs) and flow-matching models have demonstrated remarkable performance in image and video generation. However, such models require a significant number of function evaluations (NFEs) during sampling, leading to costly inference. Consequently, quality-preserving fast sampling methods that require fewer NFEs have been an active area of research. However, prior training-free sampling methods fail to simultaneously address two key challenges: the stiffness of the ODE (i.e., the non-straightness of the velocity field) and dependence on the semi-linear structure of the DM ODE (which limits their direct applicability to flow-matching models). In this work, we introduce the Stabilized Taylor Orthogonal Runge–Kutta (STORK) method, addressing both design concerns. We demonstrate that STORK consistently improves the quality of diffusion and flow-matching sampling for image and video generation. Code is available at https://github.com/ZT220501/STORK.

扩散模型（DMs）和流量匹配模型在图像和视频生成方面表现出了显著的性能。然而，这些模型在采样过程中需要大量的函数评估（NFEs），导致推理成本高昂。因此，需要较少NFEs的保质保速采样方法已成为研究热点。然而，之前的无训练采样方法未能同时解决两个关键挑战：常微分方程的刚度（即速度场的非直线性）和对DM常微分方程的半线性结构的依赖（这限制了它们直接应用于流量匹配模型）。在这项工作中，我们引入了稳定的泰勒正交龙格-库塔（STORK）方法，解决了这两个设计问题。我们证明STORK方法能持续提高图像和视频生成的扩散和流量匹配采样的质量。代码可在 https://github.com/ZT220501/STORK 获得。

论文及项目相关链接

PDF

Summary

本文介绍了Diffusion models（DMs）在图像和视频生成中的出色表现，但采样过程中需要大量功能评估（NFEs），导致推理成本高昂。为减少NFEs，研究者们致力于开发保持质量的同时更快的采样方法。然而，先前无训练采样方法未能同时解决两个关键问题：常微分方程的刚度（即速度场的不直线性）和对DM ODE半线性结构的依赖（限制了它们对流动匹配模型的直接应用）。本研究引入稳定化的泰勒正交龙格库塔（STORK）方法，解决了这两个设计难题。实验证明，STORK在图像和视频生成的扩散和流动匹配采样中持续提高了质量。

Key Takeaways

Diffusion models (DMs) 和流动匹配模型在图像和视频生成中表现出卓越性能。
采样过程中需要大量功能评估（NFEs），导致推理成本高昂。
此前无训练采样方法无法同时解决ODE的刚度问题和半线性结构依赖问题。
引入的STORK方法解决了这两个设计难题。
STORK方法提高了图像和视频生成的扩散和流动匹配采样的质量。
STORK方法的代码已公开在GitHub上。

Cool Papers

点此查看论文截图

H3AE: High Compression, High Speed, and High Quality AutoEncoder for Video Diffusion Models

Authors:Yushu Wu, Yanyu Li, Ivan Skorokhodov, Anil Kag, Willi Menapace, Sharath Girish, Aliaksandr Siarohin, Yanzhi Wang, Sergey Tulyakov

Autoencoder (AE) is the key to the success of latent diffusion models for image and video generation, reducing the denoising resolution and improving efficiency. However, the power of AE has long been underexplored in terms of network design, compression ratio, and training strategy. In this work, we systematically examine the architecture design choices and optimize the computation distribution to obtain a series of efficient and high-compression video AEs that can decode in real time even on mobile devices. We also propose an omni-training objective to unify the design of plain Autoencoder and image-conditioned I2V VAE, achieving multifunctionality in a single VAE network but with enhanced quality. In addition, we propose a novel latent consistency loss that provides stable improvements in reconstruction quality. Latent consistency loss outperforms prior auxiliary losses including LPIPS, GAN and DWT in terms of both quality improvements and simplicity. H3AE achieves ultra-high compression ratios and real-time decoding speed on GPU and mobile, and outperforms prior arts in terms of reconstruction metrics by a large margin. We finally validate our AE by training a DiT on its latent space and demonstrate fast, high-quality text-to-video generation capability.

自编码器（AE）是潜在扩散模型在图像和视频生成方面取得成功的关键，它可以降低降噪分辨率并提高效率。然而，关于自编码器的网络设计、压缩比和训练策略等方面的潜力长期以来一直未被充分探索。在这项工作中，我们系统地研究了架构设计选择，优化了计算分配，获得了一系列高效的高压缩视频自编码器，即使在移动设备上也能实时解码。我们还提出了一个全面的训练目标，以统一普通自编码器和图像条件I2V VAE的设计，在一个单一的VAE网络中实现多功能性，同时提高了质量。此外，我们提出了一种新型潜在一致性损失，在重建质量方面提供了稳定的改进。潜在一致性损失在质量和简洁性方面优于先前的辅助损失，包括LPIPS、GAN和DWT。H3AE在GPU和移动设备上实现了超高压缩比和实时解码速度，在重建指标上大大优于先前技术。最后，我们通过在其潜在空间上训练DiT来验证我们的自编码器，并展示了快速、高质量的文字到视频生成能力。

论文及项目相关链接

PDF 17 pages, 6 figures, 9 tables

Summary

自动编码器（AE）是潜在扩散模型在图像和视频生成中成功的关键，能够降低去噪分辨率并提高效率。本研究系统地探讨了架构设计的选择，优化了计算分布，获得了一系列高效、高压缩的视频自动编码器，即使在移动设备上也能实时解码。此外，还提出了一种全面的训练目标，实现了普通自动编码器和图像条件I2V VAE的统一设计，在单个VAE网络中实现多功能并增强了质量。同时，我们提出了一种新型潜在一致性损失，在重建质量方面提供了稳定的改进。潜在一致性损失在质量和简洁性方面优于先前的辅助损失，包括LPIPS、GAN和DWT。H3AE在GPU和移动设备上实现了超高压缩比和实时解码速度，在重建指标上大幅优于先前技术。最后，我们通过在其潜在空间上训练DiT来验证我们的AE，并展示了快速、高质量的文字到视频生成能力。

Key Takeaways

自动编码器（AE）是潜在扩散模型成功的关键，能降低去噪分辨率并提高图像和视频生成效率。
研究优化了视频自动编码器的架构设计，实现高效、高压缩的视频处理，支持实时解码，包括在移动设备上。
提出全面的训练目标，统一普通自动编码器和图像条件I2V VAE的设计，增强网络多功能性和质量。
引入新型潜在一致性损失，有效提高重建质量，优于其他辅助损失方法。
H3AE在GPU和移动设备上实现超高压缩比和快速解码，并在重建指标上显著优于现有技术。
实验验证了AE的文本到视频生成能力，表现出高质量和快速生成的特点。

Cool Papers

点此查看论文截图

BlobCtrl: Taming Controllable Blob for Element-level Image Editing

Authors:Yaowei Li, Lingen Li, Zhaoyang Zhang, Xiaoyu Li, Guangzhi Wang, Hongxiang Li, Xiaodong Cun, Ying Shan, Yuexian Zou

As user expectations for image editing continue to rise, the demand for flexible, fine-grained manipulation of specific visual elements presents a challenge for current diffusion-based methods. In this work, we present BlobCtrl, a framework for element-level image editing based on a probabilistic blob-based representation. Treating blobs as visual primitives, BlobCtrl disentangles layout from appearance, affording fine-grained, controllable object-level manipulation. Our key contributions are twofold: (1) an in-context dual-branch diffusion model that separates foreground and background processing, incorporating blob representations to explicitly decouple layout and appearance, and (2) a self-supervised disentangle-then-reconstruct training paradigm with an identity-preserving loss function, along with tailored strategies to efficiently leverage blob-image pairs. To foster further research, we introduce BlobData for large-scale training and BlobBench, a benchmark for systematic evaluation. Experimental results demonstrate that BlobCtrl achieves state-of-the-art performance in a variety of element-level editing tasks, such as object addition, removal, scaling, and replacement, while maintaining computational efficiency. Project Webpage: https://liyaowei-stu.github.io/project/BlobCtrl/

随着用户对图像编辑的期望不断提高，对特定视觉元素进行灵活、精细操控的需求为当前基于扩散的方法带来了挑战。在这项工作中，我们提出了BlobCtrl，这是一个基于概率性blob表示的元素级图像编辑框架。将blob视为视觉基本元素，BlobCtrl将布局和外观分开处理，实现了精细可控的对象级操作。我们的主要贡献有两点：（1）上下文中的双分支扩散模型，该模型将前景和背景处理分开，通过融入blob表示来显式地解耦布局和外观；（2）采用自监督的解耦重建训练范式，并配有身份保持损失函数，以及针对blob图像对进行有效利用的定制策略。为了促进进一步研究，我们引入了用于大规模训练的BlobData和用于系统评估的BlobBench基准测试。实验结果表明，BlobCtrl在各种元素级编辑任务中实现了最先进的性能，如对象添加、删除、缩放和替换，同时保持了计算效率。项目网页：https://liyaowei-stu.github.io/project/BlobCtrl/

论文及项目相关链接

PDF Project Webpage: https://liyaowei-stu.github.io/project/BlobCtrl/ This version presents a major update with rephrased writing. Accepted to SIGGRAPH Asia 2025

Summary
随着用户对图像编辑的期望不断提高，对特定视觉元素进行灵活、精细的操控对当前的扩散方法提出了挑战。本研究提出了BlobCtrl框架，基于概率性斑块表示进行元素级图像编辑。BlobCtrl将斑块视为视觉基本元素，使布局和外观分离，从而实现精细、可控的对象级操作。主要贡献包括：一、上下文双分支扩散模型，将前景和背景处理分离，结合斑块表示显式地解耦布局和外观；二、自我监督的解耦重建训练模式，具有身份保留的损失函数，以及有效利用斑块图像对的策略。为支持进一步研究，引入了BlobData用于大规模训练和BlobBench系统评估基准。实验结果表明，BlobCtrl在元素级编辑任务上达到了最新技术水平，如对象添加、删除、缩放和替换，同时保持了计算效率。

Key Takeaways

用户对图像编辑的期望不断提高，对特定视觉元素的精细操控成为挑战。
BlobCtrl框架基于概率性斑块表示进行元素级图像编辑。
BlobCtrl将布局和外观解耦，实现精细、可控的对象级操作。
主要贡献包括上下文双分支扩散模型和自我监督的解耦重建训练模式。
引入了BlobData用于大规模训练和BlobBench作为系统评估基准。
BlobCtrl在元素级编辑任务上达到了最新技术水平。

Cool Papers

点此查看论文截图

Diffusion Model as a Noise-Aware Latent Reward Model for Step-Level Preference Optimization

Authors:Tao Zhang, Cheng Da, Kun Ding, Huan Yang, Kun Jin, Yan Li, Tingting Gao, Di Zhang, Shiming Xiang, Chunhong Pan

Preference optimization for diffusion models aims to align them with human preferences for images. Previous methods typically use Vision-Language Models (VLMs) as pixel-level reward models to approximate human preferences. However, when used for step-level preference optimization, these models face challenges in handling noisy images of different timesteps and require complex transformations into pixel space. In this work, we show that pre-trained diffusion models are naturally suited for step-level reward modeling in the noisy latent space, as they are explicitly designed to process latent images at various noise levels. Accordingly, we propose the Latent Reward Model (LRM), which repurposes components of the diffusion model to predict preferences of latent images at arbitrary timesteps. Building on LRM, we introduce Latent Preference Optimization (LPO), a step-level preference optimization method conducted directly in the noisy latent space. Experimental results indicate that LPO significantly improves the model’s alignment with general, aesthetic, and text-image alignment preferences, while achieving a 2.5-28x training speedup over existing preference optimization methods. Our code and models are available at https://github.com/Kwai-Kolors/LPO.

扩散模型的偏好优化旨在使图像与人类偏好对齐。之前的方法通常使用视觉语言模型（VLMs）作为像素级奖励模型来近似人类偏好。然而，当用于步骤级偏好优化时，这些模型在处理不同时间步的噪声图像时面临挑战，需要将复杂的转换进行到像素空间。在这项工作中，我们展示了预训练的扩散模型自然适用于噪声潜在空间的步骤级奖励建模，因为它们被明确设计为处理各种噪声水平的潜在图像。因此，我们提出了潜在奖励模型（LRM），该模型重新利用扩散模型的组件来预测任意时间步的潜在图像的偏好。基于LRM，我们引入了潜在偏好优化（LPO），这是一种直接在噪声潜在空间中进行的步骤级偏好优化方法。实验结果表明，LPO显著提高了模型与一般偏好、审美偏好和文本图像对齐偏好的对齐程度，同时实现了相对于现有偏好优化方法的2.5-28倍训练速度提升。我们的代码和模型可在https://github.com/Kwai-Kolors/LPO找到。

论文及项目相关链接

PDF NeurIPS 2025

Summary

本文介绍了针对扩散模型的偏好优化，旨在使图像与人类偏好对齐。传统方法使用视觉语言模型（VLMs）作为像素级奖励模型来模拟人类偏好，但在处理不同时间步的噪声图像时面临挑战。本文提出利用预训练的扩散模型进行天然适合噪声潜在空间的步级奖励建模，并介绍了潜在奖励模型（LRM）。基于LRM，本文提出了潜在偏好优化（LPO），直接在噪声潜在空间进行步级偏好优化。实验结果表明，LPO在通用、美学和文本图像对齐偏好方面显著提高模型对齐度，同时实现现有偏好优化方法的2.5-28倍训练速度提升。

Key Takeaways

扩散模型的偏好优化旨在使图像与人类偏好对齐。
传统方法使用视觉语言模型（VLMs）作为像素级奖励模型，处理不同时间步的噪声图像时存在挑战。
预训练的扩散模型适合进行噪声潜在空间的步级奖励建模。
提出了潜在奖励模型（LRM）以预测不同时间步的潜在图像的偏好。
基于LRM，引入了潜在偏好优化（LPO），直接在噪声潜在空间进行步级偏好优化。
实验结果表明，LPO在多种偏好方面显著提高模型性能，同时实现训练速度提升。

Cool Papers

点此查看论文截图

Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner

Authors:Mengfei Xia, Yujun Shen, Changsong Lei, Yu Zhou, Ran Yi, Deli Zhao, Wenping Wang, Yong-Jin Liu

A diffusion model, which is formulated to produce an image using thousands of denoising steps, usually suffers from a slow inference speed. Existing acceleration algorithms simplify the sampling by skipping most steps yet exhibit considerable performance degradation. By viewing the generation of diffusion models as a discretized integral process, we argue that the quality drop is partly caused by applying an inaccurate integral direction to a timestep interval. To rectify this issue, we propose a \textbf{timestep tuner} that helps find a more accurate integral direction for a particular interval at the minimum cost. Specifically, at each denoising step, we replace the original parameterization by conditioning the network on a new timestep, enforcing the sampling distribution towards the real one. Extensive experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods, especially when there are few denoising steps. For example, when using 10 denoising steps on LSUN Bedroom dataset, we improve the FID of DDIM from 9.65 to 6.07, simply by adopting our method for a more appropriate set of timesteps. Code is available at \href{https://github.com/THU-LYJ-Lab/time-tuner}{https://github.com/THU-LYJ-Lab/time-tuner}.

扩散模型通过数千步去噪步骤生成图像，通常面临推理速度较慢的问题。现有的加速算法通过跳过大部分步骤来简化采样，但会导致性能显著下降。我们将扩散模型的生成视为离散积分过程，认为质量下降部分是由于对时间步长应用了不准确的方向积分。为了解决这个问题，我们提出了一种时间步长调整器，它可以帮助以最低成本为特定间隔找到更精确的方向积分。具体来说，在每个去噪步骤中，我们通过以新的时间步长对网络进行条件约束，替换原始参数化，强制采样分布接近真实分布。大量实验表明，我们的插件设计可以高效地进行训练，并提升各种最先进的加速方法的推理性能，特别是在去噪步骤较少时。例如，在LSUN卧室数据集上使用10个去噪步骤时，我们仅仅通过采用我们的方法为适当的时间步长集，就可以将DDIM的FID从9.65提高到6.07。代码可在https://github.com/THU-LYJ-Lab/time-tuner获取。

论文及项目相关链接

PDF

Summary

本文提出了一种针对扩散模型的时序调整器（timestep tuner），解决了现有加速算法在简化采样过程中导致的性能下降问题。通过将扩散模型的生成过程视为离散积分过程，该调整器能够在每个去噪步骤中，通过条件网络对新的时序进行参数化，使采样分布更加接近真实分布。实验结果证明了该方法的有效性，提高了加速方法的推理性能，尤其是在使用较少去噪步骤的情况下。代码已公开。

Key Takeaways