发布日期: 2025-10-02

更新日期: 2025-11-27

文章字数: 18.7k

阅读时长: 76 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-02 更新

Query-Kontext: An Unified Multimodal Model for Image Generation and Editing

Authors:Yuxin Song, Wenkai Dong, Shizun Wang, Qi Zhang, Song Xue, Tao Yuan, Hu Yang, Haocheng Feng, Hang Zhou, Xinyan Xiao, Jingdong Wang

Unified Multimodal Models (UMMs) have demonstrated remarkable performance in text-to-image generation (T2I) and editing (TI2I), whether instantiated as assembled unified frameworks which couple powerful vision-language model (VLM) with diffusion-based generator, or as naive Unified Multimodal Models with an early fusion of understanding and generation modalities. We contend that in current unified frameworks, the crucial capability of multimodal generative reasoning which encompasses instruction understanding, grounding, and image referring for identity preservation and faithful reconstruction, is intrinsically entangled with high-fidelity synthesis. In this work, we introduce Query-Kontext, a novel approach that bridges the VLM and diffusion model via a multimodal ``kontext’’ composed of semantic cues and coarse-grained image conditions encoded from multimodal inputs. This design delegates the complex ability of multimodal generative reasoning to powerful VLM while reserving diffusion model’s role for high-quality visual synthesis. To achieve this, we propose a three-stage progressive training strategy. First, we connect the VLM to a lightweight diffusion head via multimodal kontext tokens to unleash the VLM’s generative reasoning ability. Second, we scale this head to a large, pre-trained diffusion model to enhance visual detail and realism. Finally, we introduce a low-level image encoder to improve image fidelity and perform instruction tuning on downstream tasks. Furthermore, we build a comprehensive data pipeline integrating real, synthetic, and open-source datasets, covering diverse multimodal reference-to-image scenarios, including image generation, instruction-driven editing, customized generation, and multi-subject composition. Experiments show that our approach matches strong unified baselines and even outperforms task-specific state-of-the-art methods in several cases.

统一多模态模型（UMMs）在文本到图像生成（T2I）和编辑（TI2I）方面表现出了显著的性能，无论是作为集成了强大视觉语言模型（VLM）和基于扩散的生成器的统一框架，还是作为早期理解和生成模式融合的天真统一多模态模型。我们认为，在当前统一框架中，包含指令理解、接地和图像引用在内的多模态生成推理的关键能力，在身份保留和忠实重建方面与高清综合息息相关。在此工作中，我们引入了Query-Kontext这一新方法，它通过由语义线索和来自多模态输入的粗粒度图像条件组成的多模态``kontext’’，在视觉语言模型（VLM）和扩散模型之间建立了桥梁。这种设计将复杂的多模态生成推理能力委托给强大的VLM，同时保留扩散模型进行高质量视觉合成的作用。为了实现这一点，我们提出了一种分三阶段的渐进训练策略。首先，我们通过多模态kontext令牌将VLM连接到轻量级扩散头，以释放VLM的生成推理能力。其次，我们将此头扩展到大型预训练扩散模型，以提高视觉细节和逼真度。最后，我们引入低级别图像编码器，以提高图像保真度并在下游任务上进行指令调整。此外，我们建立了一个综合数据管道，整合真实、合成和开源数据集，涵盖多样化的多模态参考到图像场景，包括图像生成、指令驱动编辑、定制生成和多主题组合。实验表明，我们的方法与强大的统一基线相匹配，甚至在某些情况下超过了任务特定的最新方法。

论文及项目相关链接

PDF 23 pages, 10 figures

Summary

本文介绍了Unified Multimodal Models（UMMs）在文本到图像生成（T2I）和编辑（TI2I）任务中的卓越性能。通过引入Query-Kontext新型方法，结合语义线索和粗粒度图像条件编码的多模态上下文，实现了对统一框架中多模态生成推理能力的强化。该方法通过分阶段训练策略，将多模态生成推理能力委托给强大的视觉语言模型（VLM），保留扩散模型用于高质量视觉合成。实验表明，该方法匹配了强大的统一基线，甚至在部分情况下超越了任务特定的最先进方法。

Key Takeaways

UMMs在文本到图像生成和编辑任务中表现卓越。
Query-Kontext方法通过多模态上下文桥接VLM和扩散模型。
语义线索和粗粒度图像条件编码提高了模型的性能。
分阶段训练策略增强了模型的生成推理能力和视觉合成质量。
方法匹配了强大的统一基线，并在部分情况下超越了任务特定方法。
模型能够处理多样化的多模态参考到图像的场景，包括图像生成、指令驱动编辑、自定义生成和多主体组合。

Cool Papers

点此查看论文截图

MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation

Authors:Chenhui Zhu, Yilu Wu, Shuai Wang, Gangshan Wu, Limin Wang

Image-to-video generation has made remarkable progress with the advancements in diffusion models, yet generating videos with realistic motion remains highly challenging. This difficulty arises from the complexity of accurately modeling motion, which involves capturing physical constraints, object interactions, and domain-specific dynamics that are not easily generalized across diverse scenarios. To address this, we propose MotionRAG, a retrieval-augmented framework that enhances motion realism by adapting motion priors from relevant reference videos through Context-Aware Motion Adaptation (CAMA). The key technical innovations include: (i) a retrieval-based pipeline extracting high-level motion features using video encoder and specialized resamplers to distill semantic motion representations; (ii) an in-context learning approach for motion adaptation implemented through a causal transformer architecture; (iii) an attention-based motion injection adapter that seamlessly integrates transferred motion features into pretrained video diffusion models. Extensive experiments demonstrate that our method achieves significant improvements across multiple domains and various base models, all with negligible computational overhead during inference. Furthermore, our modular design enables zero-shot generalization to new domains by simply updating the retrieval database without retraining any components. This research enhances the core capability of video generation systems by enabling the effective retrieval and transfer of motion priors, facilitating the synthesis of realistic motion dynamics.

随着扩散模型的发展，图像到视频的生成已经取得了显著的进步，但生成具有真实感的运动的视频仍然极具挑战性。这一难度源于准确建模运动的复杂性，其中包括捕捉物理约束、物体交互和特定领域的动态，这些不容易在多种场景中通用化。为了解决这一问题，我们提出了MotionRAG，这是一个增强型检索框架，通过上下文感知运动适应（CAMA）从相关参考视频中适应运动先验知识，提高运动真实性。主要技术创新包括：（i）基于检索的管道使用视频编码器和专用重采样器提取高级运动特征，以蒸馏语义运动表示；（ii）通过因果变压器架构实现运动适应的上下文内学习方法；（iii）基于注意力的运动注入适配器无缝集成转移的运动特征到预训练的视频扩散模型中。大量实验表明，我们的方法在多个领域和各种基础模型上都取得了显著的改进，推理过程中的计算开销微乎其微。此外，我们的模块化设计通过简单地更新检索数据库而无需重新训练任何组件，实现了零镜头跨域推广。这项研究通过实现有效的运动先验检索和转移，提高了视频生成系统的核心能力，促进了真实运动动态的合成。

论文及项目相关链接

PDF

Summary：
随着扩散模型的发展，图像到视频的生成取得了显著的进步，但生成具有真实感的视频仍然是一个巨大的挑战。为解决这一问题，我们提出了MotionRAG框架，通过上下文感知运动适应（CAMA）技术，利用相关参考视频的运动先验信息提高运动真实性。包括基于检索的管道、上下文内学习方法和注意力驱动的注入适配器等技术创新。实验证明，该方法在不同领域和基准模型上取得了显著改进，且推理时的计算开销很小。此外，我们的模块化设计可通过简单更新检索数据库实现零样本跨域推广。该研究提高了视频生成系统的核心能力，实现了有效的运动先验检索和转移，促进了真实运动动态的合成。

Key Takeaways：

扩散模型在图像到视频生成方面取得显著进步，但生成真实视频仍存在挑战。
MotionRAG框架通过适应运动先验提高视频运动真实性。
框架包括基于检索的管道、上下文内学习方法和注意力驱动的注入适配器等技术创新。
实验证明该方法在不同领域和模型上表现优异，推理计算开销小。
模块化设计可实现零样本跨域推广，更新检索数据库即可。
该研究提高了视频生成系统的核心能力，实现了运动先验的有效检索和转移。

Cool Papers

点此查看论文截图

Data-to-Energy Stochastic Dynamics

Authors:Kirill Tamogashev, Nikolay Malkin

The Schr"odinger bridge problem is concerned with finding a stochastic dynamical system bridging two marginal distributions that minimises a certain transportation cost. This problem, which represents a generalisation of optimal transport to the stochastic case, has received attention due to its connections to diffusion models and flow matching, as well as its applications in the natural sciences. However, all existing algorithms allow to infer such dynamics only for cases where samples from both distributions are available. In this paper, we propose the first general method for modelling Schr"odinger bridges when one (or both) distributions are given by their unnormalised densities, with no access to data samples. Our algorithm relies on a generalisation of the iterative proportional fitting (IPF) procedure to the data-free case, inspired by recent developments in off-policy reinforcement learning for training of diffusion samplers. We demonstrate the efficacy of the proposed data-to-energy IPF on synthetic problems, finding that it can successfully learn transports between multimodal distributions. As a secondary consequence of our reinforcement learning formulation, which assumes a fixed time discretisation scheme for the dynamics, we find that existing data-to-data Schr"odinger bridge algorithms can be substantially improved by learning the diffusion coefficient of the dynamics. Finally, we apply the newly developed algorithm to the problem of sampling posterior distributions in latent spaces of generative models, thus creating a data-free image-to-image translation method. Code: https://github.com/mmacosha/d2e-stochastic-dynamics

薛定谔桥问题涉及找到一个随机动力系统，该动力系统连接两个边缘分布，并最小化特定的运输成本。这个问题代表了最优传输在随机情况下的推广，由于其与扩散模型、流量匹配以及与自然科学应用的联系而备受关注。然而，所有现有的算法都只能在两个分布都有样本的情况下推断出这种动态。在本文中，我们提出了一种当分布之一（或两者）由未归一化的密度给出且无法使用数据样本时建模薛定谔桥的首个通用方法。我们的算法依赖于将数据无关的迭代比例拟合（IPF）程序推广到无数据的情况，这是受近期离策略强化学习在扩散采样器训练方面的最新进展的启发。我们在合成问题上验证了所提出的数据到能量IPF的有效性，发现它可以成功地在多峰分布之间学习传输。作为对假设固定时间离散化方案的动态进行强化学习公式化的次要结果，我们发现通过学习动力学的扩散系数，现有的数据到数据薛定谔桥算法可以大大改进。最后，我们将新开发的算法应用于生成模型潜在空间中的后验分布采样问题，从而创建了一种无数据图像到图像的转换方法。代码：https://github.com/mmacosha/d2e-stochastic-dynamics

论文及项目相关链接

PDF

Summary

本文解决了Schrödinger桥问题在样本不可用的情况下的建模方法。当分布由未归一化的密度给出时，我们提出了一种基于迭代比例拟合（IPF）的算法，适用于数据不可用的场合。此方法不仅能在合成问题上成功学习多模态分布之间的传输，而且能改善现有的数据到数据的Schrödinger桥算法，通过学习动力学的扩散系数来提高性能。此外，将新开发的算法应用于生成模型的潜在空间中的后验分布采样问题，从而创建了一种数据无关的图像到图像的转换方法。

Key Takeaways

Schrödinger桥问题旨在寻找连接两个边缘分布的随机动力系统，并最小化特定的运输成本。它是最佳运输在随机情况下的推广，与扩散模型和流动匹配有联系，在自然科学中有应用。
现有算法仅在可以从两个分布中获取样本的情况下推断动态。这篇论文解决了当分布由未归一化的密度给出且无法获取数据样本时的建模方法。
论文提出的算法基于迭代比例拟合（IPF）的推广，适用于数据不可用的情况，受到强化学习的启发。它在合成问题上表现出色，能成功学习多模态分布之间的传输。
强化学习公式假定动力学的固定时间离散方案，发现可以通过学习动力学的扩散系数来改进现有的数据到数据的Schrödinger桥算法。
新开发的算法应用于生成模型的潜在空间中的后验分布采样问题，提供了一种数据无关的图像到图像的转换方法。
该研究提供了关于如何解决Schrödinger桥问题的新见解，尤其是在样本不可用的情境下。它结合了不同领域的知识（如最佳运输、扩散模型、强化学习等），为处理复杂的随机动态系统提供了新的视角。

Cool Papers

点此查看论文截图

EchoGen: Generating Visual Echoes in Any Scene via Feed-Forward Subject-Driven Auto-Regressive Model

Authors:Ruixiao Dong, Zhendong Wang, Keli Liu, Li Li, Ying Chen, Kai Li, Daowen Li, Houqiang Li

Subject-driven generation is a critical task in creative AI; yet current state-of-the-art methods present a stark trade-off. They either rely on computationally expensive, per-subject fine-tuning, sacrificing efficiency and zero-shot capability, or employ feed-forward architectures built on diffusion models, which are inherently plagued by slow inference speeds. Visual Auto-Regressive (VAR) models are renowned for their rapid sampling speeds and strong generative quality, making them an ideal yet underexplored foundation for resolving this tension. To bridge this gap, we introduce EchoGen, a pioneering framework that empowers VAR models with subject-driven generation capabilities. The core design of EchoGen is an effective dual-path injection strategy that disentangles a subject’s high-level semantic identity from its low-level fine-grained details, enabling enhanced controllability and fidelity. We employ a semantic encoder to extract the subject’s abstract identity, which is injected through decoupled cross-attention to guide the overall composition. Concurrently, a content encoder captures intricate visual details, which are integrated via a multi-modal attention mechanism to ensure high-fidelity texture and structural preservation. To the best of our knowledge, EchoGen is the first feed-forward subject-driven framework built upon VAR models. Both quantitative and qualitative results substantiate our design, demonstrating that EchoGen achieves subject fidelity and image quality comparable to state-of-the-art diffusion-based methods with significantly lower sampling latency. Code and models will be released soon.

主题驱动生成是创造性AI中的关键任务；然而，当前最先进的方法呈现出明显的权衡。它们要么依赖于计算量大、针对每个主题的微调，牺牲了效率和零样本能力，要么采用基于扩散模型的前馈架构，这固有地导致推理速度较慢。视觉自回归（VAR）模型以其快速采样速度和强大的生成质量而闻名，是解决这个问题的理想但尚未被充分探索的基础。为了弥补这一差距，我们引入了EchoGen，这是一个赋能VAR模型主题驱动生成能力的开创性框架。EchoGen的核心设计是一个有效的双路径注入策略，该策略将一个主题的高级语义身份与其低级精细细节分开，从而增强了可控性和保真度。我们采用语义编码器提取主题的抽象身份，通过解耦交叉注意力注入来引导整体构图。同时，内容编码器捕捉复杂的视觉细节，通过多模态注意力机制集成，以确保高保真纹理和结构保留。据我们所知，EchoGen是基于VAR模型的第一个前馈主题驱动框架。定量和定性结果都证明了我们设计的合理性，证明EchoGen在主题保真度和图像质量方面达到了与最先进的扩散模型相当的水平，同时采样延迟更低。代码和模型将很快发布。

论文及项目相关链接

PDF

Summary

本文介绍了EchoGen框架，该框架将视觉自回归（VAR）模型与主题驱动生成相结合，实现了快速采样和高质量生成。EchoGen的核心设计是一种有效的双路径注入策略，能够分离主题的高级语义身份和低级细节，提高了可控性和保真度。它采用语义编码器提取主题抽象身份，通过解耦交叉注意力引导整体构图。同时，内容编码器捕捉复杂视觉细节，通过多模态注意力机制确保高保真纹理和结构保留。EchoGen是首个基于VAR模型的向前传播主题驱动框架，实现了与基于扩散的方法相当的主题保真度和图像质量，同时采样延迟更低。

Key Takeaways

EchoGen框架结合了视觉自回归（VAR）模型和主题驱动生成，实现了快速采样和高质量生成。
EchoGen采用双路径注入策略，分离主题的高级语义和低级细节，提高可控性和保真度。
语义编码器提取主题抽象身份，通过解耦交叉注意力引导整体构图。
内容编码器负责捕捉复杂视觉细节，确保高保真纹理和结构保留。
EchoGen是首个基于VAR模型的向前传播主题驱动框架。
EchoGen的主题保真度和图像质量与最先进的扩散方法相当。

Cool Papers

点此查看论文截图

EVODiff: Entropy-aware Variance Optimized Diffusion Inference

Authors:Shigui Li, Wei Chen, Delu Zeng

Diffusion models (DMs) excel in image generation, but suffer from slow inference and the training-inference discrepancies. Although gradient-based solvers like DPM-Solver accelerate the denoising inference, they lack theoretical foundations in information transmission efficiency. In this work, we introduce an information-theoretic perspective on the inference processes of DMs, revealing that successful denoising fundamentally reduces conditional entropy in reverse transitions. This principle leads to our key insights into the inference processes: (1) data prediction parameterization outperforms its noise counterpart, and (2) optimizing conditional variance offers a reference-free way to minimize both transition and reconstruction errors. Based on these insights, we propose an entropy-aware variance optimized method for the generative process of DMs, called EVODiff, which systematically reduces uncertainty by optimizing conditional entropy during denoising. Extensive experiments on DMs validate our insights and demonstrate that our method significantly and consistently outperforms state-of-the-art (SOTA) gradient-based solvers. For example, compared to the DPM-Solver++, EVODiff reduces the reconstruction error by up to 45.5% (FID improves from 5.10 to 2.78) at 10 function evaluations (NFE) on CIFAR-10, cuts the NFE cost by 25% (from 20 to 15 NFE) for high-quality samples on ImageNet-256, and improves text-to-image generation while reducing artifacts. Code is available at https://github.com/ShiguiLi/EVODiff.

扩散模型（DMs）在图像生成方面表现出色，但存在推理速度慢和训练推理差异等问题。虽然像DPM-Solver这样的基于梯度的求解器可以加速去噪推理，但它们在信息传输效率方面缺乏理论基础。在这项工作中，我们从信息理论的角度介绍了扩散模型推理过程，发现成功的去噪从根本上减少了反向转换中的条件熵。这一原则让我们深入理解了推理过程：（1）数据预测参数化表现优于噪声对照；（2）优化条件方差提供了一种无参考的方式来最小化转换和重建误差。基于这些见解，我们提出了一种用于扩散模型生成过程的熵感知方差优化方法，称为EVODiff，它通过去噪过程中优化条件熵来系统地减少不确定性。对扩散模型的广泛实验验证了我们见解的有效性，并证明我们的方法显著且持续地优于最先进的基于梯度的求解器。例如，与DPM-Solver++相比，EVODiff在CIFAR-10上的重建错误降低了45.5%（在10次函数评估（NFE）中，FID从5.10提高到2.78），在ImageNet-256上高质量样本的NFE成本降低了25%（从20次降至15次NFE），同时改进了文本到图像的生成并减少了伪影。代码可在https://github.com/ShiguiLi/EVODiff上找到。

论文及项目相关链接

PDF NeurIPS 2025, 40 pages, 14 figures

摘要
扩散模型在图像生成方面表现出色，但存在推理速度慢和训练推理差异等问题。虽然基于梯度的求解器如DPM-Solver加速了去噪推理，但它们在信息传输效率方面缺乏理论基础。本文介绍了从信息理论角度对扩散模型推理过程的研究，揭示了成功的去噪在根本上减少了反向转换中的条件熵。这一原理为我们深入洞察推理过程提供了关键见解：（1）数据预测参数化优于噪声参数化；（2）优化条件方差提供了一种无参考的方法来最小化转换和重建误差。基于此，我们提出了一个用于扩散模型生成过程的熵感知方差优化方法，称为EVODiff，它通过去噪过程中优化条件熵来系统地降低不确定性。在扩散模型上的广泛实验验证了我们的见解，并证明我们的方法显著且一致地优于最先进的基于梯度的求解器。例如，与DPM-Solver++相比，EVODiff在CIFAR-10上的重建误差降低了45.5%（在10次函数评估（NFE）时，FID从5.10降低到2.78），在ImageNet-256上高质量样本的NFE成本降低了25%（从20次降至15次NFE），同时改进了文本到图像的生成并减少了伪影。

关键见解

扩散模型在图像生成方面的卓越性能与其较慢的推理速度和训练推理差异有关。
基于梯度的求解器如DPM-Solver虽然在加速去噪推理方面有效，但在信息传输效率方面缺乏理论基础。
成功的去噪能够减少反向转换中的条件熵，这是理解扩散模型推理过程的关键。
数据预测参数化比噪声参数化更有效。
优化条件方差提供了一种无参考的方法，可以最小化转换和重建误差。
提出的EVODiff方法通过优化去噪过程中的条件熵，系统地降低了不确定性，显著优于先进的基于梯度的求解器。
EVODiff在多个实验上实现了显著的性能提升，如降低重建误差、减少NFE成本，改进文本到图像的生成并减少伪影。

Cool Papers

点此查看论文截图

Training-Free Reward-Guided Image Editing via Trajectory Optimal Control

Authors:Jinho Chang, Jaemin Kim, Jong Chul Ye

Recent advancements in diffusion and flow-matching models have demonstrated remarkable capabilities in high-fidelity image synthesis. A prominent line of research involves reward-guided guidance, which steers the generation process during inference to align with specific objectives. However, leveraging this reward-guided approach to the task of image editing, which requires preserving the semantic content of the source image while enhancing a target reward, is largely unexplored. In this work, we introduce a novel framework for training-free, reward-guided image editing. We formulate the editing process as a trajectory optimal control problem where the reverse process of a diffusion model is treated as a controllable trajectory originating from the source image, and the adjoint states are iteratively updated to steer the editing process. Through extensive experiments across distinct editing tasks, we demonstrate that our approach significantly outperforms existing inversion-based training-free guidance baselines, achieving a superior balance between reward maximization and fidelity to the source image without reward hacking.

近期扩散模型和流匹配模型的进展在高保真图像合成中展示了显著的能力。一个突出的研究方向是奖励引导指导，它在推理过程中引导生成过程以符合特定目标。然而，将这一奖励引导方法应用于图像编辑任务——需要在保留源图像语义内容的同时提高目标奖励——却很少被探索。在这项工作中，我们介绍了一种无需训练、奖励引导的图像编辑新框架。我们将编辑过程制定为一个轨迹最优控制问题，其中扩散模型的逆向过程被视为一条可控轨迹，起源于源图像，并且伴随状态被迭代更新以引导编辑过程。通过在不同编辑任务上的广泛实验，我们证明我们的方法显著优于现有的基于反演的无需训练指导基线，在奖励最大化和对源图像的保真度之间取得了优越平衡，且无需奖励黑客行为。

论文及项目相关链接

PDF 18 pages, 5 figures

Summary

本文介绍了基于扩散模型的奖励引导图像编辑的新框架。该框架将编辑过程视为轨迹最优控制问题，将扩散模型的逆向过程视为从源图像开始的可控轨迹，并通过迭代更新伴随状态来引导编辑过程。实验表明，该方法在不同的编辑任务上显著优于基于反演的无需训练指导的基线，在奖励最大化和保持源图像保真度之间取得了优越平衡，且无需奖励黑客手段。

Key Takeaways

奖励引导方法在扩散模型中显示出显著的图像合成能力。
本文首次探索了将奖励引导方法应用于图像编辑任务。
提出了一种无需训练的奖励引导图像编辑新框架。
将编辑过程视为轨迹最优控制问题，利用扩散模型的逆向过程作为可控轨迹。
通过迭代更新伴随状态来引导编辑过程，实现源图像语义内容的保留和目标奖励的增强。
实验证明，该方法在多种编辑任务上显著优于现有基线。

Cool Papers

点此查看论文截图

Editable Noise Map Inversion: Encoding Target-image into Noise For High-Fidelity Image Manipulation

Authors:Mingyu Kang, Yong Suk Choi

Text-to-image diffusion models have achieved remarkable success in generating high-quality and diverse images. Building on these advancements, diffusion models have also demonstrated exceptional performance in text-guided image editing. A key strategy for effective image editing involves inverting the source image into editable noise maps associated with the target image. However, previous inversion methods face challenges in adhering closely to the target text prompt. The limitation arises because inverted noise maps, while enabling faithful reconstruction of the source image, restrict the flexibility needed for desired edits. To overcome this issue, we propose Editable Noise Map Inversion (ENM Inversion), a novel inversion technique that searches for optimal noise maps to ensure both content preservation and editability. We analyze the properties of noise maps for enhanced editability. Based on this analysis, our method introduces an editable noise refinement that aligns with the desired edits by minimizing the difference between the reconstructed and edited noise maps. Extensive experiments demonstrate that ENM Inversion outperforms existing approaches across a wide range of image editing tasks in both preservation and edit fidelity with target prompts. Our approach can also be easily applied to video editing, enabling temporal consistency and content manipulation across frames.

文本到图像的扩散模型在生成高质量和多样化的图像方面取得了显著的成功。在此基础上，扩散模型在文本引导的图像编辑方面也表现出了卓越的性能。实现有效图像编辑的关键策略是将源图像反转为与目标图像相关的可编辑噪声图。然而，以前的反转方法在面对紧密遵循目标文本提示时面临挑战。这种限制的产生是因为反转的噪声图虽然能够使源图像进行忠实的重建，但限制了所需的编辑灵活性。为了解决这个问题，我们提出了可编辑噪声图反转（ENM反转）这一新型反转技术，寻找最佳噪声图，以确保内容和可编辑性的保留。我们分析了噪声图的属性，以提高其可编辑性。基于这一分析，我们的方法引入了一种可编辑的噪声细化，通过最小化重建和编辑噪声图之间的差异，与所需的编辑保持一致。大量实验表明，在广泛图像编辑任务中，ENM反转在保留和编辑保真度方面均优于现有方法并符合目标提示。我们的方法还可以轻松应用于视频编辑，实现跨帧的时间一致性和内容操作。

论文及项目相关链接

PDF ICML 2025

摘要
文本到图像的扩散模型在生成高质量、多样化的图像方面取得了显著的成功。在此基础上，扩散模型在文本引导的图像编辑中也表现出了卓越的性能。一种有效的图像编辑策略是通过将源图像反转为与目标图像相关的可编辑噪声图来实现。然而，以往的反转方法在面对紧贴目标文本提示的要求时面临挑战。限制产生于反转的噪声图，虽然能够忠实重建源图像，但限制了所需的灵活性编辑。为了解决这个问题，我们提出了可编辑噪声图反转（ENM反转）这一新颖的反转技术，寻找最佳的噪声图，确保内容保留和可编辑性。我们分析了噪声图的属性，以提高其可编辑性。基于这一分析，我们的方法引入了一种可编辑噪声优化，通过最小化重建和编辑噪声图之间的差异，与所需的编辑保持一致。大量实验表明，ENM反转法在图像编辑任务的保留和编辑保真度方面优于现有方法，并且可轻松应用于视频编辑，实现跨帧的时间一致性和内容操作。

关键见解

扩散模型在文本引导的图像编辑中表现出卓越性能。
有效的图像编辑策略是通过将源图像反转为与目标图像相关的可编辑噪声图来实现。
以往的图像反转方法在紧贴目标文本提示方面存在挑战。
ENM反转技术旨在寻找最佳噪声图，确保内容保留和编辑灵活性。
通过分析噪声图的属性来提高其可编辑性。
ENM反转法通过最小化重建和编辑噪声图之间的差异，与所需的编辑保持一致。
大量实验证明，ENM反转法在图像和视频的编辑任务中具有优势。

Cool Papers

点此查看论文截图

Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs

Authors:Jia Jun Cheng Xian, Muchen Li, Haotian Yang, Xin Tao, Pengfei Wan, Leonid Sigal, Renjie Liao

Recent advances in diffusion-based text-to-image (T2I) models have led to remarkable success in generating high-quality images from textual prompts. However, ensuring accurate alignment between the text and the generated image remains a significant challenge for state-of-the-art diffusion models. To address this, existing studies employ reinforcement learning with human feedback (RLHF) to align T2I outputs with human preferences. These methods, however, either rely directly on paired image preference data or require a learned reward function, both of which depend heavily on costly, high-quality human annotations and thus face scalability limitations. In this work, we introduce Text Preference Optimization (TPO), a framework that enables “free-lunch” alignment of T2I models, achieving alignment without the need for paired image preference data. TPO works by training the model to prefer matched prompts over mismatched prompts, which are constructed by perturbing original captions using a large language model. Our framework is general and compatible with existing preference-based algorithms. We extend both DPO and KTO to our setting, resulting in TDPO and TKTO. Quantitative and qualitative evaluations across multiple benchmarks show that our methods consistently outperform their original counterparts, delivering better human preference scores and improved text-to-image alignment. Our Open-source code is available at https://github.com/DSL-Lab/T2I-Free-Lunch-Alignment.

基于扩散的文本到图像（T2I）模型的最新进展在从文本提示生成高质量图像方面取得了显著的成功。然而，确保文本和生成图像之间的准确对齐对于最先进的扩散模型来说仍然是一个巨大的挑战。为了解决这一问题，现有研究采用增强学习结合人类反馈（RLHF）的方法，使T2I输出与人类偏好对齐。然而，这些方法要么直接依赖于配对图像偏好数据，要么需要一个学习的奖励函数，两者都严重依赖于昂贵的高质量人类注释，因此面临可扩展性限制。

论文及项目相关链接

PDF

Summary
文本到图像（T2I）扩散模型在生成高质量图像方面取得了显著成功，但文本与生成图像之间的准确对齐仍是挑战。现有研究采用强化学习结合人类反馈（RLHF）来优化对齐。然而，这种方法需要大量高质量的人类标注数据，存在可扩展性限制。本文提出了文本偏好优化（TPO）框架，无需配对图像偏好数据即可实现T2I模型的“免费午餐”对齐。通过训练模型以偏好匹配提示而非不匹配提示来实现对齐，不匹配提示是通过使用大型语言模型扰动原始标题构建的。我们的框架具有通用性，可与现有的基于偏好的算法相结合。定量和定性评估表明，我们的方法始终优于其原始方法，并提高了人类偏好得分和文本到图像的对齐精度。

Key Takeaways

扩散模型在文本到图像生成领域取得了显著进展，但文本与生成图像之间的准确对齐仍是挑战。
现有研究采用强化学习结合人类反馈（RLHF）来提高文本与图像之间的对齐精度，但这种方法需要大量高质量的人类标注数据，存在可扩展性限制。
本文提出了文本偏好优化（TPO）框架，无需配对图像偏好数据即可实现T2I模型的“免费午餐”对齐。
TPO框架通过训练模型以偏好匹配提示来实现对齐，这些匹配提示是通过扰动原始标题构建的。
TPO框架具有通用性，可与现有的基于偏好的算法相结合，提高文本到图像的对齐精度。
定量和定性评估表明，基于TPO的方法在多个基准测试中表现优于传统方法，提高了人类偏好得分和对齐精度。

Cool Papers

点此查看论文截图

ART-VITON: Measurement-Guided Latent Diffusion for Artifact-Free Virtual Try-On

Authors:Junseo Park, Hyeryung Jang

Virtual try-on (VITON) aims to generate realistic images of a person wearing a target garment, requiring precise garment alignment in try-on regions and faithful preservation of identity and background in non-try-on regions. While latent diffusion models (LDMs) have advanced alignment and detail synthesis, preserving non-try-on regions remains challenging. A common post-hoc strategy directly replaces these regions with original content, but abrupt transitions often produce boundary artifacts. To overcome this, we reformulate VITON as a linear inverse problem and adopt trajectory-aligned solvers that progressively enforce measurement consistency, reducing abrupt changes in non-try-on regions. However, existing solvers still suffer from semantic drift during generation, leading to artifacts. We propose ART-VITON, a measurement-guided diffusion framework that ensures measurement adherence while maintaining artifact-free synthesis. Our method integrates residual prior-based initialization to mitigate training-inference mismatch and artifact-free measurement-guided sampling that combines data consistency, frequency-level correction, and periodic standard denoising. Experiments on VITON-HD, DressCode, and SHHQ-1.0 demonstrate that ART-VITON effectively preserves identity and background, eliminates boundary artifacts, and consistently improves visual fidelity and robustness over state-of-the-art baselines.

虚拟试衣（VITON）旨在生成用户穿戴目标服饰的现实图像，要求精确对齐服饰的试穿区域，并在非试穿区域忠实地保留身份和背景。虽然潜在扩散模型（LDM）已经改进了对齐和细节合成，但保留非试穿区域仍然具有挑战性。一种常见的后处理方法直接用原始内容替换这些区域，但突兀的过渡通常会产生边界伪影。为了克服这一问题，我们将VITON重新表述为一个线性逆问题，并采用轨迹对齐求解器，逐步强制执行测量一致性，减少非试穿区域的突兀变化。然而，现有的求解器在生成过程中仍存在语义漂移问题，导致出现伪影。我们提出了ART-VITON，一种测量指导的扩散框架，确保测量遵循性同时保持无伪影合成。我们的方法整合了基于残差先验的初始化，以减轻训练-推断不匹配问题，以及无伪影测量指导的采样，它结合了数据一致性、频率级别的校正和周期性的标准去噪。在VITON-HD、DressCode和SHHQ-1.0上的实验表明，ART-VITON有效地保留了身份和背景，消除了边界伪影，并且相较于最先进的基线方法，在视觉保真度和稳健性方面都有了一致的提高。

论文及项目相关链接

PDF 21 pages

Summary

基于潜在扩散模型（LDM）的虚拟试穿（VITON）技术在生成人穿着目标服装的逼真图像方面已取得进展，但在非试穿区域的身份和背景保留方面仍面临挑战。为解决这一问题，本文提出一种测量引导扩散框架ART-VITON，通过结合残差先验初始化和无瑕疵测量引导采样，确保测量一致性并维持合成结果的清晰度。实验证明，ART-VITON在身份和背景保留、消除边界伪影、提高视觉保真度和稳健性方面均优于现有技术。

Key Takeaways

VITON技术旨在生成目标服装的逼真图像，需要精确对齐服装并在非试穿区域保持身份和背景的忠实保留。
LDM在提高对齐和细节合成能力方面表现突出，但保留非试穿区域仍存在挑战。常见的后处理策略存在边界伪影问题。
本文通过改革VITON为线性逆问题并采用轨迹对齐求解器来减少非试穿区域的突兀变化。然而，现有求解器在生成过程中仍会出现语义漂移问题，导致伪影。
提出ART-VITON框架，确保测量一致性同时维持无瑕疵的合成结果。通过结合残差先验初始化和无瑕疵测量引导采样，该框架实现了数据一致性、频率级校正和周期性标准去噪。

Cool Papers

点此查看论文截图

LieHMR: Autoregressive Human Mesh Recovery with $SO(3)$ Diffusion

Authors:Donghwan Kim, Tae-Kyun Kim

We tackle the problem of Human Mesh Recovery (HMR) from a single RGB image, formulating it as an image-conditioned human pose and shape generation. While recovering 3D human pose from 2D observations is inherently ambiguous, most existing approaches have regressed a single deterministic output. Probabilistic methods attempt to address this by generating multiple plausible outputs to model the ambiguity. However, these methods often exhibit a trade-off between accuracy and sample diversity, and their single predictions are not competitive with state-of-the-art deterministic models. To overcome these limitations, we propose a novel approach that models well-aligned distribution to 2D observations. In particular, we introduce $SO(3)$ diffusion model, which generates the distribution of pose parameters represented as 3D rotations unconditional and conditional to image observations via conditioning dropout. Our model learns the hierarchical structure of human body joints using the transformer. Instead of using transformer as a denoising model, the time-independent transformer extracts latent vectors for the joints and a small MLP-based denoising model learns the per-joint distribution conditioned on the latent vector. We experimentally demonstrate and analyze that our model predicts accurate pose probability distribution effectively.

我们针对从单张RGB图像进行人体网格恢复（HMR）的问题，将其表述为以图像为条件的人体姿态和形状生成。从2D观察恢复3D人体姿态本质上是具有歧义的，大多数现有方法都回归了单一确定性输出。概率方法试图通过生成多个合理输出对歧义进行建模。然而，这些方法在准确性和样本多样性之间往往存在权衡，其单一预测并不具备与最新确定性模型竞争的能力。为了克服这些局限性，我们提出了一种新型方法，该方法对2D观测进行良好对齐的建模。特别是，我们引入了$SO(3)$扩散模型，该模型通过条件丢弃生成作为三维旋转的体态参数分布，该分布无条件地和有选择地取决于图像观测。我们的模型使用变压器学习人体关节的层次结构。不同于将变压器用作去噪模型的做法，时间独立变压器提取关节的潜在向量，并且基于小型多层感知机的去噪模型学习基于潜在向量的关节分布。我们实验性地证明并分析了我们的模型能够准确地预测姿态概率分布。

论文及项目相关链接

PDF 17 pages, 13 figures

Summary

本文解决从单一RGB图像进行人体网格恢复（HMR）的问题，将其表述为图像条件下的人体姿态和形状生成。针对从2D观测恢复3D人体姿态的固有模糊性，大多数现有方法回归了单一确定性输出。概率方法试图通过生成多个合理输出对模糊性进行建模。然而，这些方法在准确性和样本多样性之间往往存在权衡，其单一预测并不具备与最新确定性模型竞争的能力。为了克服这些限制，本文提出了一种新型方法，对2D观测进行良好对齐的分布建模。特别是引入了$SO(3)$扩散模型，该模型通过条件丢弃，无条件地表示姿态参数的分布，并根据图像观测进行条件化。该模型学习使用变压器的人体关节层次结构，时间独立变压器提取关节的潜在向量，基于小型多层感知机的去噪模型学习基于潜在向量的关节分布。实验证明，该模型能有效地预测准确的姿态概率分布。

Key Takeaways

解决从单一RGB图像进行人体网格恢复（HMR）的问题，表述为图像条件下的人体姿态和形状生成。
现有方法存在回归单一确定性输出的问题，概率方法试图通过生成多个输出对模糊性进行建模但存在权衡问题。
引入$SO(3)$扩散模型，通过条件丢弃，对姿态参数的分布进行建模。
模型结合了变压器和时间独立变压器技术，提取关节的潜在向量。
基于多层感知机的去噪模型学习潜在向量下的关节分布。
模型能有效预测准确的姿态概率分布。

Cool Papers

点此查看论文截图

DC-Gen: Post-Training Diffusion Acceleration with Deeply Compressed Latent Space

Authors:Wenkun He, Yuchao Gu, Junyu Chen, Dongyun Zou, Yujun Lin, Zhekai Zhang, Haocheng Xi, Muyang Li, Ligeng Zhu, Jincheng Yu, Junsong Chen, Enze Xie, Song Han, Han Cai

Existing text-to-image diffusion models excel at generating high-quality images, but face significant efficiency challenges when scaled to high resolutions, like 4K image generation. While previous research accelerates diffusion models in various aspects, it seldom handles the inherent redundancy within the latent space. To bridge this gap, this paper introduces DC-Gen, a general framework that accelerates text-to-image diffusion models by leveraging a deeply compressed latent space. Rather than a costly training-from-scratch approach, DC-Gen uses an efficient post-training pipeline to preserve the quality of the base model. A key challenge in this paradigm is the representation gap between the base model’s latent space and a deeply compressed latent space, which can lead to instability during direct fine-tuning. To overcome this, DC-Gen first bridges the representation gap with a lightweight embedding alignment training. Once the latent embeddings are aligned, only a small amount of LoRA fine-tuning is needed to unlock the base model’s inherent generation quality. We verify DC-Gen’s effectiveness on SANA and FLUX.1-Krea. The resulting DC-Gen-SANA and DC-Gen-FLUX models achieve quality comparable to their base models but with a significant speedup. Specifically, DC-Gen-FLUX reduces the latency of 4K image generation by 53x on the NVIDIA H100 GPU. When combined with NVFP4 SVDQuant, DC-Gen-FLUX generates a 4K image in just 3.5 seconds on a single NVIDIA 5090 GPU, achieving a total latency reduction of 138x compared to the base FLUX.1-Krea model. Code: https://github.com/dc-ai-projects/DC-Gen.

现有的文本到图像的扩散模型在生成高质量图像方面表现出色，但在扩展到高分辨率（如4K图像生成）时面临重大的效率挑战。尽管之前的研究从各个方面加速了扩散模型，但很少处理潜在空间中的固有冗余。为了弥补这一空白，本文介绍了DC-Gen，这是一个通过利用深度压缩的潜在空间来加速文本到图像扩散模型的通用框架。DC-Gen不同于昂贵的从头开始训练方法，它采用高效的后训练管道来保留基础模型的质量。在这种范式中的关键挑战是基础模型的潜在空间与深度压缩的潜在空间之间的表示差距，这可能导致直接微调时的不稳定性。为了克服这一点，DC-Gen首先通过轻量级的嵌入对齐训练来弥合表示差距。一旦潜在嵌入对齐，只需少量的LoRA微调即可解锁基础模型的固有生成质量。我们在SANA和FLUX.1-Krea上验证了DC-Gen的有效性。DC-Gen-SANA和DC-Gen-FLUX模型在质量上与基础模型相当，但具有显著的速度提升。具体来说，DC-Gen-FLUX在NVIDIA H100 GPU上将4K图像的生成延迟减少了53倍。当与NVFP4 SVDQuant结合时，DC-Gen-FLUX在单个NVIDIA 5090 GPU上只需3.5秒即可生成4K图像，与基础FLUX.1-Krea模型相比，总延迟降低了138倍。代码地址：https://github.com/dc-ai-projects/DC-Gen。

论文及项目相关链接

PDF Tech Report. The first three authors contributed equally to this work

摘要

本文介绍了一种名为DC-Gen的通用框架，该框架通过利用深度压缩的潜在空间来加速文本到图像的扩散模型。DC-Gen采用高效的训练后流程，避免了昂贵的从头开始训练的方法，同时保留了基础模型的质量。该框架解决了基础模型的潜在空间与深度压缩的潜在空间之间的表示差距问题，通过轻量级的嵌入对齐训练来弥合这一差距。一旦对齐潜在嵌入，只需少量的LoRA微调即可解锁基础模型的固有生成质量。实验证明，DC-Gen在SANA和FLUX.1-Krea模型上具有良好的有效性。特别是DC-Gen-FLUX在NVIDIA H100 GPU上将生成4K图像的延迟减少了53倍。当与NVFP4 SVDQuant结合时，DC-Gen-FLUX在单个NVIDIA 5090 GPU上仅用了3.5秒就生成了一个4K图像，与基础FLUX.1-Krea模型相比，总延迟减少了高达138倍。

关键见解

DC-Gen框架旨在加速文本到图像的扩散模型，利用深度压缩的潜在空间以提高效率。
DC-Gen通过高效的训练后流程避免了从头开始训练的高成本，同时保留了基础模型的质量。
框架解决了基础模型与深度压缩潜在空间的表示差距问题，通过嵌入对齐训练进行弥合。
DC-Gen能提高模型的生成速度，同时保持或提高图像质量。
DC-Gen显著降低了生成4K图像所需的延迟时间。
与其他技术结合，DC-Gen能进一步提升性能，实现更快的图像生成速度。
DC-Gen具有广泛的应用前景，特别是在需要高效文本到图像转换的场合。

Cool Papers

点此查看论文截图

GLASS Flows: Transition Sampling for Alignment of Flow and Diffusion Models

Authors:Peter Holderrieth, Uriel Singer, Tommi Jaakkola, Ricky T. Q. Chen, Yaron Lipman, Brian Karrer

The performance of flow matching and diffusion models can be greatly improved at inference time using reward alignment algorithms, yet efficiency remains a major limitation. While several algorithms were proposed, we demonstrate that a common bottleneck is the sampling method these algorithms rely on: many algorithms require to sample Markov transitions via SDE sampling, which is significantly less efficient and often less performant than ODE sampling. To remove this bottleneck, we introduce GLASS Flows, a new sampling paradigm that simulates a “flow matching model within a flow matching model” to sample Markov transitions. As we show in this work, this “inner” flow matching model can be retrieved from a pre-trained model without any re-training, combining the efficiency of ODEs with the stochastic evolution of SDEs. On large-scale text-to-image models, we show that GLASS Flows eliminate the trade-off between stochastic evolution and efficiency. Combined with Feynman-Kac Steering, GLASS Flows improve state-of-the-art performance in text-to-image generation, making it a simple, drop-in solution for inference-time scaling of flow and diffusion models.

使用奖励对齐算法可以在推理时间显著提高流匹配和扩散模型的性能，但效率仍然是一个主要限制。虽然提出了几种算法，但我们证明了一个共同的瓶颈在于这些算法所依赖的采样方法：许多算法需要通过随机微分方程（SDE）采样进行马尔可夫转换采样，这显著地不如常微分方程（ODE）采样高效且性能较差。为了消除这一瓶颈，我们引入了GLASS Flows这一新型采样范式，它模拟“流匹配模型内的流匹配模型”来采样马尔可夫转换。我们在本工作中展示，这个“内部”流匹配模型可以从预训练模型中检索出来，无需任何再训练，结合了常微分方程的高效性和随机微分方程的随机演化。在大型文本到图像模型中，我们证明了GLASS Flows消除了随机演化和效率之间的权衡。结合费曼-卡克转向，GLASS Flows提高了文本到图像生成的最先进性能，使其成为流和扩散模型推理时间缩放的一种简单、即插即用的解决方案。

论文及项目相关链接

PDF

Summary

流匹配和扩散模型的性能可通过使用奖励对齐算法在推理时间显著提高，但效率仍是主要限制。虽然已提出多种算法，但演示表明，这些算法所依赖的采样方法是共同瓶颈：许多算法需要通过随机微分方程（SDE）采样进行马尔可夫过渡采样，这显著降低了效率并且往往性能不如常微分方程（ODE）采样。为了消除这一瓶颈，我们引入了GLASS流这一新采样范式，它通过模拟“流匹配模型内的流匹配模型”来采样马尔可夫过渡。我们在工作中展示，这个“内部”流匹配模型可以从预训练模型中提取，无需任何再训练，结合了ODE的效率与SDE的随机演变。在大型文本到图像模型中，我们展示GLASS流消除了随机演变与效率之间的权衡。结合费曼-卡克转向，GLASS流提高了文本到图像生成的最先进性能，成为流和扩散模型推理时间扩展的简单即插即用解决方案。

Key Takeaways

奖励对齐算法可以提高流匹配和扩散模型的推理时间性能。
效率是流匹配和扩散模型的主要限制之一。
现有算法依赖于通过SDE采样的马尔可夫过渡采样，这不太高效且性能有限。
GLASS流是一种新采样范式，模拟“流匹配模型内的流匹配模型”以采样马尔可夫过渡。
GLASS流的“内部”流匹配模型可以从预训练模型中提取，无需再训练。
GLASS流结合了ODE的效率与SDE的随机演变。

Cool Papers

点此查看论文截图

Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models

Authors:Bowei Chen, Sai Bi, Hao Tan, He Zhang, Tianyuan Zhang, Zhengqi Li, Yuanjun Xiong, Jianming Zhang, Kai Zhang

In this work, we propose aligning pretrained visual encoders to serve as tokenizers for latent diffusion models in image generation. Unlike training a variational autoencoder (VAE) from scratch, which primarily emphasizes low-level details, our approach leverages the rich semantic structure of foundation encoders. We introduce a three-stage alignment strategy: (1) freeze the encoder and train an adapter and a decoder to establish a semantic latent space; (2) jointly optimize all components with an additional semantic preservation loss, enabling the encoder to capture perceptual details while retaining high-level semantics; and (3) refine the decoder for improved reconstruction quality. This alignment yields semantically rich image tokenizers that benefit diffusion models. On ImageNet 256$\times$256, our tokenizer accelerates the convergence of diffusion models, reaching a gFID of 1.90 within just 64 epochs, and improves generation both with and without classifier-free guidance. Scaling to LAION, a 2B-parameter text-to-image model trained with our tokenizer consistently outperforms FLUX VAE under the same training steps. Overall, our method is simple, scalable, and establishes a semantically grounded paradigm for continuous tokenizer design.

在这项工作中，我们提出了将预训练的视觉编码器对齐，以作为图像生成中潜在扩散模型的标记器。不同于从头开始训练变分自编码器（VAE）主要强调低层次细节，我们的方法利用基础编码器的丰富语义结构。我们引入了一个三阶段对齐策略：（1）冻结编码器，并训练适配器和解码器以建立语义潜在空间；（2）通过额外的语义保留损失联合优化所有组件，使编码器能够捕获感知细节，同时保留高级语义；（3）对解码器进行微调以提高重建质量。这种对齐产生了丰富的语义图像标记器，对扩散模型有益。在ImageNet 256x256上，我们的标记器加速了扩散模型的收敛，仅在64个周期内就达到了1.90的gFID，并在有无分类器引导的情况下都提高了生成质量。扩展到LAION，使用我们的标记器训练的2B参数文本到图像模型在相同训练步骤下始终优于FLUX VAE。总的来说，我们的方法简单、可扩展，为连续标记器设计建立了一个语义基础的范式。

论文及项目相关链接

PDF Project Page: https://aligntok.github.io/

Summary

本文提出将预训练的视觉编码器对齐，用作图像生成中潜在扩散模型的标记器。该方法不同于从头开始训练变分自编码器（VAE），而侧重于利用基础编码器的丰富语义结构。引入的三阶段对齐策略包括：冻结编码器，训练适配器和解码器以建立语义潜在空间；联合优化所有组件，并添加语义保留损失，使编码器能够捕获感知细节并保留高级语义；以及改进解码器的重建质量。这种对齐方式生成了有益于扩散模型的语义丰富的图像标记器。在ImageNet 256x256上，该标记器加速了扩散模型的收敛，仅64个周期就达到了1.90的gFID，并在有和无分类器引导的情况下都提高了生成质量。在LAION上扩展到2B参数的文本到图像模型，使用此标记器的表现始终优于FLUX VAE。总体而言，该方法简单、可扩展，并为连续标记器设计建立了语义基础的模式。

Key Takeaways

提出了使用预训练视觉编码器对齐作为潜在扩散模型的标记器的方法。
采用三阶段对齐策略，包括建立语义潜在空间、联合优化组件并添加语义保留损失，以及改进解码器。
方法加速了扩散模型的收敛，并在ImageNet上达到了较高的生成质量。
该方法在LAION数据集上扩展到2B参数的文本到图像模型时表现优异。
对比FLUX VAE，使用此标记器的模型在相同训练步骤下表现更佳。
该方法简单、可扩展，为连续标记器设计提供了新的思路。

Cool Papers

点此查看论文截图

FlashEdit: Decoupling Speed, Structure, and Semantics for Precise Image Editing

Authors:Junyi Wu, Zhiteng Li, Haotong Qin, Xiaohong Liu, Linghe Kong, Yulun Zhang, Xiaokang Yang

Text-guided image editing with diffusion models has achieved remarkable quality but suffers from prohibitive latency, hindering real-world applications. We introduce FlashEdit, a novel framework designed to enable high-fidelity, real-time image editing. Its efficiency stems from three key innovations: (1) a One-Step Inversion-and-Editing (OSIE) pipeline that bypasses costly iterative processes; (2) a Background Shield (BG-Shield) technique that guarantees background preservation by selectively modifying features only within the edit region; and (3) a Sparsified Spatial Cross-Attention (SSCA) mechanism that ensures precise, localized edits by suppressing semantic leakage to the background. Extensive experiments demonstrate that FlashEdit maintains superior background consistency and structural integrity, while performing edits in under 0.2 seconds, which is an over 150$\times$ speedup compared to prior multi-step methods. Our code will be made publicly available at https://github.com/JunyiWuCode/FlashEdit.

使用扩散模型进行文本引导的图像编辑已经取得了显著的品质，但依然面临难以承受的延迟问题，阻碍了其在现实世界的应用。我们引入了FlashEdit，这是一个新型框架，旨在实现高保真、实时的图像编辑。它的效率源于三个关键创新点：（1）一步式反转和编辑（OSIE）管道，绕过昂贵的迭代过程；（2）背景屏蔽（BG-Shield）技术，通过选择性修改仅编辑区域内的特征来保证背景保存；（3）稀疏空间交叉注意力（SSCA）机制，通过抑制语义泄露到背景来确保精确、局部编辑。大量实验表明，FlashEdit在保持背景一致性和结构完整性的同时，编辑时间不到0.2秒，与先前的多步骤方法相比，速度提高了超过150倍。我们的代码将在https://github.com/JunyiWuCode/FlashEdit上公开提供。

论文及项目相关链接

PDF We need to improve our work

Summary

本文介绍了针对扩散模型的新框架FlashEdit，旨在实现高保真度的实时图像编辑。其效率源于三个关键创新点：一是绕过昂贵迭代过程的一步反转和编辑（OSIE）管道；二是通过选择性修改编辑区域内的特征来保证背景保留的背景屏蔽（BG-Shield）技术；三是抑制背景语义泄露以实现精确局部编辑的稀疏空间交叉注意力（SSCA）机制。FlashEdit保持了卓越的图像背景一致性和结构完整性，能够在不到0.2秒内完成编辑操作，与先前多步骤方法相比，速度提高了超过150倍。

Key Takeaways

FlashEdit是一个用于实现高保真度实时图像编辑的新框架。
该框架的效率来源于三个关键创新点：一步反转和编辑管道（OSIE）、背景屏蔽技术和稀疏空间交叉注意力机制。
FlashEdit能够在不到0.2秒内完成图像编辑操作，相比以前的方法大大提高了效率。
OSIE管道绕过了昂贵的迭代过程，提高了效率。
背景屏蔽技术通过选择性修改编辑区域内的特征，保证了背景的一致性。
稀疏空间交叉注意力机制实现了精确的局部编辑，抑制了语义泄露到背景。
实验证明，FlashEdit保持了优秀的背景一致性和结构完整性。

Cool Papers

点此查看论文截图

TADA: Improved Diffusion Sampling with Training-free Augmented Dynamics

Authors:Tianrong Chen, Huangjie Zheng, David Berthelot, Jiatao Gu, Josh Susskind, Shuangfei Zhai

Diffusion models have demonstrated exceptional capabilities in generating high-fidelity images but typically suffer from inefficient sampling. Many solver designs and noise scheduling strategies have been proposed to dramatically improve sampling speeds. In this paper, we introduce a new sampling method that is up to $186%$ faster than the current state of the art solver for comparative FID on ImageNet512. This new sampling method is training-free and uses an ordinary differential equation (ODE) solver. The key to our method resides in using higher-dimensional initial noise, allowing to produce more detailed samples with less function evaluations from existing pretrained diffusion models. In addition, by design our solver allows to control the level of detail through a simple hyper-parameter at no extra computational cost. We present how our approach leverages momentum dynamics by establishing a fundamental equivalence between momentum diffusion models and conventional diffusion models with respect to their training paradigms. Moreover, we observe the use of higher-dimensional noise naturally exhibits characteristics similar to stochastic differential equations (SDEs). Finally, we demonstrate strong performances on a set of representative pretrained diffusion models, including EDM, EDM2, and Stable-Diffusion 3, which cover models in both pixel and latent spaces, as well as class and text conditional settings. The code is available at https://github.com/apple/ml-tada.

扩散模型在生成高保真图像方面表现出卓越的能力，但通常存在采样效率低下的问题。为了显著提高采样速度，已经提出了许多求解器设计和噪声调度策略。在本文中，我们介绍了一种新的采样方法，其速度比当前最先进的求解器在ImageNet512上进行比较时快达186%。这种新的采样方法无需训练，并使用常微分方程（ODE）求解器。我们的方法的关键在于使用高维初始噪声，从而利用现有的预训练扩散模型以较少的函数评估产生更详细的样本。此外，我们的求解器设计允许通过简单的超参数控制细节层次，而无需额外的计算成本。我们展示了我们的方法如何利用动量动力学，通过建立动量扩散模型与常规扩散模型在训练范式方面的基本等价关系。此外，我们观察到高维噪声的使用自然表现出与随机微分方程（SDEs）相似的特征。最后，我们在一组代表性的预训练扩散模型上展示了强大的性能，包括EDM、EDM2和Stable-Diffusion 3，这些模型涵盖了像素和潜在空间以及类和文本条件设置。代码可在https://github.com/apple/ml-tada找到。

论文及项目相关链接

PDF

Summary

本文介绍了一种新的采样方法，该方法使用常微分方程（ODE）求解器，无需训练，即可显著提升扩散模型生成图像的效率，速度比当前最先进的求解器快达186%。新方法利用高维初始噪声，减少函数评估次数，产生更详细的样本。此外，通过设计，该方法可通过简单超参数控制细节层次，无需额外计算成本。该方法还利用动量动力学，建立动量扩散模型与传统扩散模型之间的训练范式等价关系。使用高维噪声展现出与随机微分方程（SDEs）相似的特性。在多个预训练扩散模型上表现出强性能，包括EDM、EDM2、Stable-Diffusion 3等。

Key Takeaways

引入新的采样方法，显著提升扩散模型生成图像的效率，速度比当前最先进的求解器快达186%。
使用常微分方程（ODE）求解器，无需训练。
利用高维初始噪声，减少函数评估次数，产生更详细的样本。
通过简单超参数控制细节层次，无需额外计算成本。
方法建立动量扩散模型与传统扩散模型之间的训练范式等价关系。
使用高维噪声展现出与随机微分方程（SDEs）相似的特性。
在多个预训练扩散模型上表现出强性能，包括EDM、EDM2、Stable-Diffusion 3等模型的像素和潜在空间、类别和文本条件设置。

Cool Papers

点此查看论文截图

InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences

Authors:Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy T. Feng, Caifeng Zou, Yu Sun, Nikola Kovachki, Zachary E. Ross, Katherine L. Bouman, Yisong Yue

Plug-and-play diffusion priors (PnPDP) have emerged as a promising research direction for solving inverse problems. However, current studies primarily focus on natural image restoration, leaving the performance of these algorithms in scientific inverse problems largely unexplored. To address this gap, we introduce \textsc{InverseBench}, a framework that evaluates diffusion models across five distinct scientific inverse problems. These problems present unique structural challenges that differ from existing benchmarks, arising from critical scientific applications such as optical tomography, medical imaging, black hole imaging, seismology, and fluid dynamics. With \textsc{InverseBench}, we benchmark 14 inverse problem algorithms that use plug-and-play diffusion priors against strong, domain-specific baselines, offering valuable new insights into the strengths and weaknesses of existing algorithms. To facilitate further research and development, we open-source the codebase, along with datasets and pre-trained models, at https://devzhk.github.io/InverseBench/.

即插即用扩散先验（PnPDP）已成为解决反问题的一个前景广阔的研究方向。然而，当前的研究主要集中在自然图像恢复上，这些算法在科学反问题中的性能在很大程度上尚未被探索。为了弥补这一空白，我们引入了\textsc{InverseBench}框架，该框架在五个不同的科学反问题上评估扩散模型。这些问题呈现出独特的结构挑战，与现有基准测试不同，源于关键科学应用，如光学层析成像、医学成像、黑洞成像、地震学和流体动力学。通过\textsc{InverseBench}，我们对使用即插即用扩散先验的14种反问题算法进行了基准测试，与强大、特定领域的基准线进行了比较，为现有算法的优缺点提供了宝贵的新见解。为了促进进一步的研究和开发，我们在https://devzhk.github.io/InverseBench/公开源代码库、数据集和预训练模型。

论文及项目相关链接

PDF

Summary

扩散先验的“即插即用”（PnPDP）为求解反问题提供了一个前景广阔的研究方向。然而，当前的研究主要集中在自然图像恢复领域，这些算法在科学反问题上的表现尚未得到充分探索。为解决这一空白，我们推出了\textsc{InverseBench}框架，该框架在五个不同的科学反问题上评估扩散模型的表现。这些问题具有独特的结构挑战，不同于现有的基准测试，广泛应用于关键科学应用，如光学层析成像、医学成像、黑洞成像、地震学和流体动力学。通过使用\textsc{InverseBench}，我们对使用扩散先验的14种反问题算法进行了基准测试，与强大专属的基线进行了比较，为现有算法的优点和缺点提供了有价值的新见解。为促进进一步的研究和开发，我们在 https://devzhk.github.io/InverseBench/ 开源了代码库、数据集和预训练模型。

Key Takeaways

扩散先验的“即插即用”（PnPDP）为反问题研究提供了新的方向。
当前研究主要关注自然图像恢复，科学反问题领域尚未充分探索。
\textsc{InverseBench}框架用于评估扩散模型在五个科学反问题上的表现。
科学反问题具有独特的结构挑战，区别于现有的基准测试。
\textsc{InverseBench}涵盖了关键科学应用，如光学层析成像、医学成像等。
通过基准测试，对使用扩散先验的算法进行了评估，揭示了现有算法的优点和缺点。

Cool Papers

点此查看论文截图

CE-SDWV: Effective and Efficient Concept Erasure for Text-to-Image Diffusion Models via a Semantic-Driven Word Vocabulary

Authors:Jiahang Tu, Qian Feng, Jiahua Dong, Hanbin Zhao, Chao Zhang, Nicu Sebe, Hui Qian

Large-scale text-to-image (T2I) diffusion models have achieved remarkable generative performance about various concepts. With the limitation of privacy and safety in practice, the generative capability concerning NSFW (Not Safe For Work) concepts is undesirable, e.g., producing sexually explicit photos, and licensed images. The concept erasure task for T2I diffusion models has attracted considerable attention and requires an effective and efficient method. To achieve this goal, we propose a CE-SDWV framework, which removes the target concepts (e.g., NSFW concepts) of T2I diffusion models in the text semantic space by only adjusting the text condition tokens and does not need to re-train the original T2I diffusion model’s weights. Specifically, our framework first builds a target concept-related word vocabulary to enhance the representation of the target concepts within the text semantic space, and then utilizes an adaptive semantic component suppression strategy to ablate the target concept-related semantic information in the text condition tokens. To further adapt the above text condition tokens to the original image semantic space, we propose an end-to-end gradient-orthogonal token optimization strategy. Extensive experiments on I2P and UnlearnCanvas benchmarks demonstrate the effectiveness and efficiency of our method. Code is available at https://github.com/TtuHamg/CE-SDWV.

大规模文本到图像（T2I）扩散模型在各种概念上取得了显著的生成性能。但在实践中，由于隐私和安全性的限制，关于NSFW（不适合工作）概念的生成能力并不理想，例如生成色情照片和许可图像。文本到图像扩散模型的概念消除任务引起了人们的广泛关注，需要一种有效且高效的方法。为实现这一目标，我们提出了CE-SDWV框架，该框架通过在文本语义空间中仅调整文本条件令牌来消除目标概念（例如NSFW概念），而无需重新训练原始T2I扩散模型的权重。具体来说，我们的框架首先构建一个与目标概念相关的词汇表，以增强目标概念在文本语义空间中的表示，然后采用自适应语义成分抑制策略，消除目标概念相关的语义信息在文本条件令牌中。为了进一步将上述文本条件令牌适应到原始图像语义空间，我们提出了一种端到端的梯度正交令牌优化策略。在I2P和UnlearnCanvas基准测试上的大量实验证明了我们方法的有效性和效率。代码可在https://github.com/TtuHamg/CE-SDWV找到。

论文及项目相关链接

PDF 25 pages, 14 figures

Summary

本文关注大规模文本到图像（T2I）扩散模型在处理涉及不安全和不适宜工作场合（NSFW）概念时的隐私问题。为此，提出了一个CE-SDWV框架，通过在文本语义空间中调整文本条件令牌来消除目标概念，无需重新训练原始T2I扩散模型的权重。该框架通过构建目标概念相关词汇表和采用自适应语义组件抑制策略来消除与NSFW相关的语义信息。此外，还提出了一种端到端的梯度正交令牌优化策略，以适应原始图像语义空间。实验证明该方法有效且高效。

Key Takeaways

大型文本到图像（T2I）扩散模型在生成各种概念方面表现出色。
在实践中存在隐私和安全限制问题，例如生成不适宜工作场合（NSFW）的概念，如性暗示照片和受版权保护的图片。
CE-SDWV框架旨在消除T2I扩散模型中的目标概念，如NSFW概念，而无需重新训练模型权重。
该框架通过构建目标概念相关词汇表和自适应语义组件抑制策略来消除目标概念的语义信息。
采用端到端的梯度正交令牌优化策略，以适应原始图像语义空间。
在I2P和UnlearnCanvas基准测试上的实验证明了该方法的有效性和高效性。

Cool Papers

点此查看论文截图

Unpicking Data at the Seams: Understanding Disentanglement in VAEs

Authors:Carl Allen

A generative latent variable model is said to be disentangled when varying a single latent co-ordinate changes a single aspect of samples generated, e.g. object position or facial expression in an image. Related phenomena are seen in several generative paradigms, including state-of-the-art diffusion models, but disentanglement is most notably observed in Variational Autoencoders (VAEs), where oft-used diagonal posterior covariances are argued to be the cause. We make this picture precise. From a known exact link between optimal Gaussian posteriors and decoder derivatives, we show how diagonal posteriors “lock” a decoder’s local axes so that density over the data manifold factorises along independent one-dimensional seams that map to axis-aligned directions in latent space. This gives a clear definition of disentanglement, explains why it emerges in VAEs and shows that, under stated assumptions, ground truth factors are identifiable even with a symmetric prior.

当一个生成潜在变量模型中的单一潜在坐标发生变化时，只会改变生成样本的单一方面，例如图像中的对象位置或面部表情，此时该模型被认为是解纠缠的。这种现象在多种生成模型中都可以看到，包括最先进的扩散模型，但在变分自编码器（VAEs）中，解纠缠现象最为明显，人们认为常用的对角后验协方差是造成这一现象的原因。我们使这一观点精确化。从已知的最优高斯后验和解码器导数之间的关联出发，我们展示了如何对角后验会“锁定”解码器的局部轴，使得数据流形上的密度能够沿着独立的一维接缝进行分解，这些接缝映射到潜在空间中的轴对齐方向。这为解纠缠提供了明确的定义，解释了为什么它会在VAEs中出现，并表明在假设条件下，即使使用对称先验，真实因素也是可以识别的。

论文及项目相关链接

PDF 9 pages

Summary
生成潜在变量模型的解耦现象是指改变单一的潜在坐标会影响生成样本的单一特征，如图像的物体位置或面部表情。这种现象在包括最先进的扩散模型在内的多种生成范式中都可以看到，但在变分自编码器（VAEs）中尤其明显。我们通过精确描述最优高斯后验与解码器导数之间的已知联系，展示了对角后验如何“锁定”解码器的局部轴，使数据流形上的密度沿独立的一维接缝分解，这些接缝映射到潜在空间中的轴对齐方向。这为解耦现象提供了明确的定义，解释了为什么它会在VAEs中出现，并表明在既定的假设下，即使使用对称先验，真实因素也是可识别的。

Key Takeaways