发布日期: 2025-10-25

更新日期: 2025-11-27

文章字数: 21.2k

阅读时长: 86 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-25 更新

Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Authors:Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot

Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.

最近生成建模的最新进展使得扩散模型成为从复杂数据分布中采样的最先进的工具。虽然这些模型在单模态领域（如图像和音频）取得了显著的成功，但它们在多模态转换（不同感官信息的转换）方面的能力仍然是一个挑战。现有的方法往往依赖于一些限制性的假设，包括共享维度、高斯源先验和特定模态架构等，这些假设限制了它们的通用性和理论基础。在这项工作中，我们提出了潜在去噪扩散桥梁模型（Latent Denoising Diffusion Bridge Model，LDDBM），这是一个基于潜在变量扩展的去噪扩散桥梁模型的通用多模态转换框架。通过在共享潜在空间中进行操作，我们的方法能够在任意模态之间建立桥梁，无需对齐维度。我们引入了一种对比对齐损失来强制配对样本之间的语义一致性，并设计了一种针对潜在空间中噪声预测的通用编码器-解码器架构。此外，我们提出了一种预测损失来指导训练以实现准确的跨域转换，并探索了多种训练策略来提高稳定性。我们的方法支持任意模态对，并在多种多模态转换任务上表现出色，包括多视图到三维形状的生成、图像超分辨率和多视图场景合成等。综合实验和删除操作验证了我们的框架的有效性，为一般模态转换领域建立了新的强大基线。如需更多信息，请访问我们的项目页面：https://sites.google.com/view/lddbm/home。

论文及项目相关链接

PDF

摘要

扩散模型近期在生成建模方面取得重大进展，已成为复杂数据分布采样的顶尖工具。虽然这些模型在单模态领域（如图像和音频）表现出显著的成功，但将其能力扩展到跨不同感官模态的模态翻译（MT）仍是一个开放挑战。现有方法通常依赖于共享维度、高斯源先验和模态特定架构等限制性假设，这限制了其通用性和理论根基。本研究提出基于去噪扩散桥模型的潜在去噪扩散桥模型（Latent Denoising Diffusion Bridge Model，LDDBM），这是一种用于模态翻译的通用框架。通过在共享潜在空间中进行操作，该方法能够在任意模态之间搭建桥梁，无需对齐维度。研究引入了对比对齐损失来强制配对样本之间的语义一致性，并设计了一种针对潜在空间中噪声预测的域无关编码器-解码器架构。此外，研究还提出了预测损失来指导训练以实现准确的跨域翻译，并探索了若干训练策略以提高稳定性。该方法支持任意模态对，并在多种MT任务上表现出色，包括多视角到3D形状生成、图像超分辨率和多视角场景合成。全面的实验和消融验证了框架的有效性，在通用模态翻译方面建立了新的强劲基线。更多信息请参见我们的项目页面：https://sites.google.com/view/lddbm/home。

要点摘要

扩散模型已成为复杂数据采样领域的顶尖工具。
模态翻译（MT）是将信息从一种感官模态转换到另一种模态的任务，是一个开放挑战。
现有方法受限于共享维度、高斯源先验和特定架构，降低了其通用性。
本研究提出了Latent Denoising Diffusion Bridge Model（LDDBM），一个用于模态翻译的通用框架。
LDDBM在共享潜在空间中进行操作，无需对齐维度，支持任意模态对。
研究引入了对比对齐损失和预测损失来提高翻译的准确性。

Cool Papers

点此查看论文截图

AutoScape: Geometry-Consistent Long-Horizon Scene Generation

Authors:Jiacheng Chen, Ziyu Jiang, Mingfu Liang, Bingbing Zhuang, Jong-Chyi Su, Sparsh Garg, Ying Wu, Manmohan Chandraker

This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene’s appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively.

本文提出了AutoScape框架，这是一个长视野驾驶场景生成框架。其核心是一个新型的RGB-D扩散模型，该模型能够迭代生成稀疏且几何一致的关键帧，作为场景外观和几何的可靠锚点。为了保持长距离几何一致性，该模型1）在共享潜在空间中同时处理图像和深度信息；2）显式地基于先前生成的关键帧的现有场景几何（即渲染点云）进行条件处理；3）通过warp一致性指导来控制采样过程。给定高质量RGB-D关键帧后，视频扩散模型将在它们之间进行插值，以产生密集且连贯的视频帧。AutoScape能够生成超过20秒的真实和几何一致的驾驶视频，相对于先前最先进的水平，长视野FID和FVD得分分别提高了48.6%和43.0%。

论文及项目相关链接

PDF ICCV 2025. Project page: https://auto-scape.github.io

Summary

本文提出了AutoScape框架，一个用于长远视距驾驶场景生成的框架。其核心是一个新颖的RGB-D扩散模型，该模型可迭代生成稀疏且几何一致的关键帧，为场景外观和几何提供可靠锚点。通过联合处理图像和深度信息，显式条件化现有场景几何，以及用warp一致性指导采样过程来保持长距离几何一致性。给定高质量RGB-D关键帧，视频扩散模型会在它们之间进行插值，生成密集且连贯的视频帧。AutoScape可生成超过20秒的逼真且几何一致的驾驶视频，相较于现有技术，其长距离FID和FVD得分分别提高了48.6%和43.0%。

Key Takeaways

AutoScape是一个用于长远视距驾驶场景生成的框架。
框架核心是一个RGB-D扩散模型，可生成稀疏且几何一致的关键帧。
模型联合处理图像和深度信息，并在共享潜在空间中工作。
模型显式条件化现有场景几何，并用以指导采样过程。
给定高质量RGB-D关键帧，视频扩散模型能生成密集且连贯的视频帧。
AutoScape生成的驾驶视频具有逼真性和几何一致性，并能维持较长时间。

Cool Papers

点此查看论文截图

Downsizing Diffusion Models for Cardinality Estimation

Authors:Xinhe Mu, Zhaoqi Zhou, Zaijiu Shang, Chuan Zhou, Gang Fu, Guiying Yan, Guoliang Li, Zhiming Ma

Inspired by the performance of score-based diffusion models in estimating complex text, video, and image distributions with thousands of dimensions, we introduce Accelerated Diffusion Cardest (ADC), the first joint distribution cardinality estimator based on a downsized diffusion model. To calculate the pointwise density value of data distributions, ADC’s density estimator uses a formula that evaluates log-likelihood by integrating the score function, a gradient mapping which ADC has learned to efficiently approximate using its lightweight score estimator. To answer ranged queries, ADC’s selectivity estimator first predicts their selectivity using a Gaussian Mixture Model (GMM), then uses importance sampling Monte Carlo to correct its predictions with more accurate pointwise density values calculated by the density estimator. ADC+ further trains a decision tree to identify the high-volume, high-selectivity queries that the GMM alone can predict very accurately, in which case it skips the correction phase to prevent Monte Carlo from adding more variance. Doing so lowers median Q-error and cuts per-query latency by 25 percent, making ADC+ usually twice as fast as Naru, arguably the state-of-the-art joint distribution cardinality estimator. Numerical experiments using well-established benchmarks show that on all real-world datasets tested, ADC+ is capable of rivaling Naru and outperforming MSCN, DeepDB, LW-Tree, and LW-NN using around 66 percent their storage space, being at least 3 times as accurate as MSCN on 95th and 99th percentile error. Furthermore, on a synthetic dataset where attributes exhibit complex, multilateral correlations, ADC and ADC+ are considerably robust while almost every other learned model suffered significant accuracy declines. In this case, ADC+ performs better than any other tested model, being 10 times as accurate as Naru on 95th and 99th percentile error.

受基于分数的扩散模型在估计具有数千维的复杂文本、视频和图像分布方面的性能的启发，我们引入了加速扩散卡德斯特（ADC），这是基于简化的扩散模型的首个联合分布基数估计器。为了计算数据分布的逐点密度值，ADC的密度估计器使用一个公式来评估对数似然，该公式通过积分得分函数（一个梯度映射）来评估，ADC已经使用其轻量级的得分估计器学会了对其进行有效的近似。为了回答范围查询，ADC的选择性估计器首先使用高斯混合模型（GMM）预测其选择性，然后使用重要性抽样蒙特卡洛（Monte Carlo）通过密度估计器计算的更准确的逐点密度值来校正其预测。ADC+进一步训练决策树，以识别高容量、高选择性的查询，这些查询可以由GMM单独非常准确地预测，在这种情况下，它跳过校正阶段以防止蒙特卡洛增加更多的方差。这样做降低了中位数Q误差，并将每个查询的延迟时间减少了25%，使得ADC+通常是先进的联合分布基数估计器Naru的两倍速度。使用公认的基准测试进行的数值实验表明，在所有真实世界测试数据集上，ADC+能够与Naru竞争并且优于MSCN、DeepDB、LW-Tree和LW-NN，大约使用它们66%的存储空间，在95th和99th百分位误差上至少比MSCN准确三倍。此外，在属性表现出复杂、多方面关联的合成数据集上，ADC和ADC+表现出相当稳健的性能，而几乎其他所有模型都出现了明显的准确性下降。在这种情况下，ADC+的性能优于其他所有测试模型，在95th和99th百分位误差上比Naru准确10倍。

论文及项目相关链接

PDF

摘要

基于分数扩散模型在估计数千维度的复杂文本、视频和图像分布中的出色表现，我们推出了加速扩散基数估计器（ADC），这是首个基于简化扩散模型的联合分布基数估计器。ADC的密度估计器使用公式计算数据分布的逐点密度值，该公式通过积分得分函数来评估对数似然性。ADC已经学会使用其轻量级得分估计器有效地近似梯度映射的得分函数。为了回答范围查询，ADC的选择性估计器首先使用高斯混合模型（GMM）预测其选择性，然后使用重要性抽样蒙特卡洛方法（Monte Carlo）校正由密度估计器计算得出的更准确的逐点密度值。ADC+进一步训练决策树，以识别GMM可以非常准确预测的高容量、高选择性查询。在这种情况下，它跳过校正阶段以防止蒙特卡洛增加方差。这样做降低了中位数Q误差并减少了每秒查询延迟时间达百分之二十五，使得ADC+通常比Naru快一倍，后者是目前联合分布基数估计器中的佼佼者。数值实验表明，在所有真实数据集上测试时，ADC+能够与Naru竞争并超越MSCN、DeepDB、LW-Tree和LW-NN等模型，仅使用其约百分之六十六的存储空间。在属性表现出复杂多边关系的合成数据集上，ADC和ADC+表现出显著的稳健性，而其他几乎所有学习模型都经历了显著的准确性下降。在这种情况下，ADC+的表现优于其他所有测试模型，在95th和99th百分位误差上比Naru高出十倍准确。

关键见解

提出了基于简化扩散模型的联合分布基数估计器——加速扩散基数估计器（ADC）。
ADC使用密度估计器计算逐点密度值，通过积分得分函数评估对数似然性。
ADC利用高斯混合模型（GMM）预测选择性，并结合重要性抽样蒙特卡洛方法校正预测结果。
ADC+通过训练决策树优化性能，能识别高容量、高选择性的查询，从而跳过校正阶段提高效率。
ADC+相较于其他模型如Naru、MSCN等展现出卓越性能，减少查询延迟并减少存储空间使用。
在处理具有复杂多边关系的合成数据集时，ADC和ADC+表现出显著稳健性，远超其他模型。

Cool Papers

点此查看论文截图

UltraHR-100K: Enhancing UHR Image Synthesis with A Large-Scale High-Quality Dataset

Authors:Chen Zhao, En Ci, Yunzhe Xu, Tiehan Fan, Shanyan Guan, Yanhao Ge, Jian Yang, Ying Tai

Ultra-high-resolution (UHR) text-to-image (T2I) generation has seen notable progress. However, two key challenges remain : 1) the absence of a large-scale high-quality UHR T2I dataset, and (2) the neglect of tailored training strategies for fine-grained detail synthesis in UHR scenarios. To tackle the first challenge, we introduce \textbf{UltraHR-100K}, a high-quality dataset of 100K UHR images with rich captions, offering diverse content and strong visual fidelity. Each image exceeds 3K resolution and is rigorously curated based on detail richness, content complexity, and aesthetic quality. To tackle the second challenge, we propose a frequency-aware post-training method that enhances fine-detail generation in T2I diffusion models. Specifically, we design (i) \textit{Detail-Oriented Timestep Sampling (DOTS)} to focus learning on detail-critical denoising steps, and (ii) \textit{Soft-Weighting Frequency Regularization (SWFR)}, which leverages Discrete Fourier Transform (DFT) to softly constrain frequency components, encouraging high-frequency detail preservation. Extensive experiments on our proposed UltraHR-eval4K benchmarks demonstrate that our approach significantly improves the fine-grained detail quality and overall fidelity of UHR image generation. The code is available at \href{https://github.com/NJU-PCALab/UltraHR-100k}{here}.

超高分辨率（UHR）文本到图像（T2I）生成已经取得了显著的进步。然而，还有两个主要挑战：1）缺乏大规模高质量UHR T2I数据集；2）忽视针对超高分辨率场景中精细细节合成的定制训练策略。为了解决第一个挑战，我们引入了UltraHR-100K，这是一个高质量、包含丰富描述性文本标注的超高分辨率图像数据集，包含10万张图像，内容多样且视觉保真度高。每张图像分辨率超过3K，并根据细节丰富程度、内容复杂性和美学质量进行严格筛选。为了解决第二个挑战，我们提出了一种频率感知的后训练方法，用于提高文本到图像扩散模型的精细细节生成能力。具体来说，我们设计了（i）面向细节的步长采样（DOTS），使学习专注于关键细节去噪步骤；（ii）软加权频率正则化（SWFR），利用离散傅里叶变换（DFT）对频率成分进行软约束，鼓励保留高频细节。在我们提出的UltraHR-eval4K基准测试上的大量实验表明，我们的方法显著提高了超高分辨率图像生成的精细细节质量和整体保真度。代码可从[https://github.com/NJU-PCALab/UltraHR-100k]（此处）获取。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

本文主要介绍了针对超高分辨率（UHR）文本到图像（T2I）生成领域的两个挑战的解决方案。首先，为了应对缺乏大规模高质量UHR T2I数据集的问题，引入了UltraHR-100K数据集，包含10万张超过3K分辨率的高质量图像和丰富标注。其次，为了解决精细细节合成方面的特定训练策略缺乏的问题，提出了一种频率感知的后训练方法，其中包括面向细节的步骤采样（DOTS）和软加权频率正则化（SWFR）。实验证明，该方法显著提高了UHR图像生成的细节质量和整体保真度。

Key Takeaways

UltraHR-100K数据集被引入，以解决超高分辨率文本到图像生成领域缺乏大规模高质量数据集的问题。
该数据集包含超过3K分辨率的图像，并注重细节丰富度、内容复杂度和美学质量。
提出了频率感知的后训练方法，旨在提高文本到图像扩散模型在超高分辨率下的精细细节生成能力。
面向细节的步骤采样（DOTS）被设计用于专注于细节关键的降噪步骤。
软加权频率正则化（SWFR）利用离散傅里叶变换（DFT）来柔和地约束频率成分，促进高频细节的保留。
在UltraHR-eval4K基准测试上的实验证明，该方法显著提高了超高分辨率图像生成的细节质量和整体保真度。

Cool Papers

点此查看论文截图

EchoDistill: Bidirectional Concept Distillation for One-Step Diffusion Personalization

Authors:Yixiong Yang, Tao Wu, Senmao Li, Shiqi Yang, Yaxing Wang, Joost van de Weijer, Kai Wang

Recent advances in accelerating text-to-image (T2I) diffusion models have enabled the synthesis of high-fidelity images even in a single step. However, personalizing these models to incorporate novel concepts remains a challenge due to the limited capacity of one-step models to capture new concept distributions effectively. We propose a bidirectional concept distillation framework, EchoDistill, to enable one-step diffusion personalization (1-SDP). Our approach involves an end-to-end training process where a multi-step diffusion model (teacher) and a one-step diffusion model (student) are trained simultaneously. The concept is first distilled from the teacher model to the student, and then echoed back from the student to the teacher. During the EchoDistill, we share the text encoder between the two models to ensure consistent semantic understanding. Following this, the student model is optimized with adversarial losses to align with the real image distribution and with alignment losses to maintain consistency with the teacher’s output. Furthermore, we introduce the bidirectional echoing refinement strategy, wherein the student model leverages its faster generation capability to feedback to the teacher model. This bidirectional concept distillation mechanism not only enhances the student ability to personalize novel concepts but also improves the generative quality of the teacher model. Our experiments demonstrate that this collaborative framework significantly outperforms existing personalization methods over the 1-SDP setup, establishing a novel paradigm for rapid and effective personalization in T2I diffusion models.

近期加速文本到图像（T2I）扩散模型的进展使得一步合成高清图像成为可能。然而，由于一步模型在有效捕捉新概念分布方面的能力有限，将这些模型个性化以融入新观念仍然是一个挑战。我们提出了一种双向概念蒸馏框架EchoDistill，以实现一步扩散个性化（1-SDP）。我们的方法涉及端到端的训练过程，其中多步扩散模型（教师）和一步扩散模型（学生）同时进行训练。概念首先从教师模型蒸馏到学生模型，然后从学生模型反馈回教师模型。在EchoDistill过程中，我们在两个模型之间共享文本编码器，以确保一致的语义理解。之后，学生模型通过对抗损失进行优化，以与真实图像分布对齐，并通过对齐损失来保持与教师模型输出的一致性。此外，我们引入了双向回声细化策略，学生模型利用其更快的生成能力来反馈给教师模型。这种双向概念蒸馏机制不仅提高了学生模型对新颖概念的个性化能力，而且提高了教师模型的生成质量。我们的实验表明，这一协作框架在1-SDP设置上显著优于现有个性化方法，为文本到图像扩散模型中的快速有效个性化建立了新范式。

论文及项目相关链接

PDF Project page available at https://liulisixin.github.io/EchoDistill-page/

摘要
提出了一种双向概念蒸馏框架EchoDistill，用于实现一步扩散个性化（1-SDP）。该框架通过端到端训练过程，将多步扩散模型（教师模型）和一步扩散模型（学生模型）同时训练。概念首先由教师模型蒸馏给学生模型，再从学生模型反馈回教师模型。共享文本编码器以确保一致的语义理解。学生模型通过对抗性损失进行优化，以与现实图像分布对齐，并通过对齐损失以保持与教师模型输出的一致性。此外，引入了双向回声优化策略，学生模型利用其更快的生成能力反馈给教师模型。这种双向概念蒸馏机制不仅提高了学生对新概念的个性化能力，还提高了教师模型的生成质量。实验表明，该协作框架在1-SDP设置中显著优于现有个性化方法，为T2I扩散模型的快速有效个性化建立了新范式。

关键见解

提出了EchoDistill双向概念蒸馏框架，用于加速文本到图像（T2I）扩散模型的个性化。
通过端到端训练过程，同时训练教师模型和学生模型，实现概念蒸馏与反馈。
共享文本编码器以确保在蒸馏和反馈过程中语义理解的一致性。
学生模型通过优化对抗性损失和对齐损失，以与现实图像分布对齐并维持与教师模型的一致性。
引入双向回声优化策略，利用学生模型的快速生成能力反馈给教师模型，进一步提高两者性能。
实验证明，该框架在一步扩散个性化（1-SDP）设置中显著优于现有方法。

Cool Papers

点此查看论文截图

EditInfinity: Image Editing with Binary-Quantized Generative Models

Authors:Jiahuan Wang, Yuxin Chen, Jun Yu, Guangming Lu, Wenjie Pei

Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of VQ-based generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose \emph{EditInfinity}, which adapts \emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our \emph{EditInfinity} to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across “add”, “change”, and “delete” editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: https://github.com/yx-chen-ust/EditInfinity.

适应预训练的基于扩散的生成模型，用于文本驱动的图片编辑，且几乎不需要调整开销，已经显示出显著潜力。这些方法遵循的经典适应模式首先通过对给定源图像进行图像反演来逆向推断生成轨迹，然后沿着推断出的轨迹在目标文本提示的指导下进行图像编辑。然而，图像编辑的性能受到扩散模型在图像反演过程中引入的近似误差的严重限制，这些误差源于中间生成步骤中缺乏精确的监督。为了解决这个问题，我们研究了基于VQ的生成模型在图像编辑中的参数高效适应性问题，并利用了它们的固有特性，即可以获得源图像的精确中间量化表示，为精确图像反演提供了更有效的监督。具体来说，我们提出了“EditInfinity”，它适应了“Infinity”这一二进制量化生成模型用于图像编辑。我们提出了一种高效且有效的图像反演机制，融合了文本提示校正和图像风格保持，实现了精确的图像反演。此外，我们还设计了一种整体平滑策略，使我们的“EditInfinity”能够在源图像上进行高保真度的图像编辑，并与文本提示实现精确语义对齐。在PIE-Bench基准测试上的广泛实验，涵盖了“添加”、“更改”和“删除”编辑操作，证明了我们模型相较于最先进的扩散基准模型的优越性。代码可在：https://github.com/yx-chen-ust/EditInfinity找到。

论文及项目相关链接

PDF 28 pages, 13 figures, accepted by The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

Summary

基于预训练的扩散生成模型进行文本驱动图像编辑，无需大量调整，展现出显著潜力。当前方法通过逆向推断给定源图像的生成轨迹，然后沿推断轨迹进行图像编辑，同时受目标文本提示引导。然而，由于扩散模型在图像反转过程中的近似误差，限制了图像编辑的效果。这些误差源于中间生成步骤中缺乏精确监督。为了解决这个问题，我们研究了VQ基生成模型的参数高效适应图像编辑方法，利用其固有的特性——获取源图像的中间量化表示，为精确图像反转提供更有效的监督。我们提出了EditInfinity，它适应了Infinity这一二元量化生成模型用于图像编辑。我们设计了一个高效而精确的图像反转机制，融合了文本提示校正和图像风格保持，使图像反转更加精确。此外，我们还提出了整体平滑策略，使EditInfinity能够在保持源图像高保真度的同时进行精确的语义文本对齐编辑。在PIE-Bench基准测试上的实验表明，我们的模型在添加、更改和删除编辑操作上的性能均优于最先进的扩散基准模型。

Key Takeaways

扩散生成模型在文本驱动的图像编辑中表现出潜力。
当前方法通过推断生成轨迹进行图像编辑，但存在近似误差问题。
VQ基生成模型的中间量化表示可提高图像反转的精确性。
提出了EditInfinity模型，适应了Infinity二元量化生成模型用于图像编辑。
EditInfinity设计了一个高效精确图像反转机制，融合文本提示校正和图像风格保持。
整体平滑策略使EditInfinity能在保持源图像高保真度的同时进行精确的语义文本对齐编辑。
在PIE-Bench基准测试上，EditInfinity性能优于其他扩散基准模型。

Cool Papers

点此查看论文截图

StableSketcher: Enhancing Diffusion Model for Pixel-based Sketch Generation via Visual Question Answering Feedback

Authors:Jiho Park, Sieun Choi, Jaeyoon Seo, Jihie Kim

Although recent advancements in diffusion models have significantly enriched the quality of generated images, challenges remain in synthesizing pixel-based human-drawn sketches, a representative example of abstract expression. To combat these challenges, we propose StableSketcher, a novel framework that empowers diffusion models to generate hand-drawn sketches with high prompt fidelity. Within this framework, we fine-tune the variational autoencoder to optimize latent decoding, enabling it to better capture the characteristics of sketches. In parallel, we integrate a new reward function for reinforcement learning based on visual question answering, which improves text-image alignment and semantic consistency. Extensive experiments demonstrate that StableSketcher generates sketches with improved stylistic fidelity, achieving better alignment with prompts compared to the Stable Diffusion baseline. Additionally, we introduce SketchDUO, to the best of our knowledge, the first dataset comprising instance-level sketches paired with captions and question-answer pairs, thereby addressing the limitations of existing datasets that rely on image-label pairs. Our code and dataset will be made publicly available upon acceptance.

尽管扩散模型最近的进展极大地提高了生成图像的质量，但在合成基于像素的手绘草图等抽象表达代表性示例方面仍存在挑战。为了应对这些挑战，我们提出了StableSketcher，这是一个新型框架，能够赋能扩散模型生成具有高提示保真度的手绘草图。在此框架内，我们微调了变分自编码器以优化潜在解码，使其能够更好地捕捉草图的特点。同时，我们结合了一种基于视觉问答的新型奖励函数，用于强化学习，以提高文本-图像对齐和语义一致性。大量实验表明，StableSketcher生成的草图在风格保真度上有所提高，与提示的对齐程度优于Stable Diffusion基线。此外，我们还推出了SketchDUO数据集，据我们所知，它是第一个包含实例级草图与标题和问答对配对的数据集，从而解决了现有数据集仅依赖图像标签对的局限性。我们的代码和数据集将在接受后公开提供。

论文及项目相关链接

PDF Under review at IEEE Access. Author-submitted preprint. Not the IEEE-published version

Summary

本文提出一种名为StableSketcher的新型框架，用于增强扩散模型生成手绘素描图的能力。该框架通过微调变分自编码器以优化潜在解码，并引入基于视觉问答的强化学习奖励函数，提高了文本与图像的对齐和语义一致性。实验表明，StableSketcher生成的素描图在风格保真度上有所提高，与提示对齐的效果更好。此外，还推出了SketchDUO数据集，包含带注释的实例级素描图，解决了现有数据集依赖图像标签对的局限性。

Key Takeaways

StableSketcher框架被提出，旨在增强扩散模型生成手绘素描图的能力。
通过微调变分自编码器优化潜在解码，以更好地捕捉素描特征。
引入基于视觉问答的强化学习奖励函数，提高文本与图像的对齐和语义一致性。
实验证明StableSketcher生成的素描图在风格保真度上有所提升。
StableSketcher相比基线模型在提示对齐方面表现更佳。
推出了SketchDUO数据集，包含带注释的实例级素描图，解决现有数据集的局限性。

Cool Papers

点此查看论文截图

CBDiff:Conditional Bernoulli Diffusion Models for Image Forgery Localization

Authors:Zhou Lei, Pan Gang, Wang Jiahao, Sun Di

Image Forgery Localization (IFL) is a crucial task in image forensics, aimed at accurately identifying manipulated or tampered regions within an image at the pixel level. Existing methods typically generate a single deterministic localization map, which often lacks the precision and reliability required for high-stakes applications such as forensic analysis and security surveillance. To enhance the credibility of predictions and mitigate the risk of errors, we introduce an advanced Conditional Bernoulli Diffusion Model (CBDiff). Given a forged image, CBDiff generates multiple diverse and plausible localization maps, thereby offering a richer and more comprehensive representation of the forgery distribution. This approach addresses the uncertainty and variability inherent in tampered regions. Furthermore, CBDiff innovatively incorporates Bernoulli noise into the diffusion process to more faithfully reflect the inherent binary and sparse properties of forgery masks. Additionally, CBDiff introduces a Time-Step Cross-Attention (TSCAttention), which is specifically designed to leverage semantic feature guidance with temporal steps to improve manipulation detection. Extensive experiments on eight publicly benchmark datasets demonstrate that CBDiff significantly outperforms existing state-of-the-art methods, highlighting its strong potential for real-world deployment.

图像伪造定位（IFL）是图像取证中的一项关键任务，旨在准确地在像素级别识别图像中被操纵或篡改的区域。现有方法通常生成单个确定性定位图，这通常缺乏用于高风险应用（例如法医分析和安全监控）所需的高精度和可靠性。为了提高预测的可靠性并降低错误风险，我们引入了一种先进的条件伯努利扩散模型（CBDiff）。给定伪造图像，CBDiff生成多个多样化和合理的定位图，从而为伪造分布提供更丰富和全面的表示。此方法解决了篡改区域固有的不确定性和变化性。此外，CBDiff创新地将伯努利噪声融入扩散过程，以更真实地反映伪造掩码的固有二进制和稀疏属性。此外，CBDiff还引入了时间步交叉注意力（TSCAttention），专门设计用于利用语义特征指导与时间步的结合来提高操纵检测的性能。在八个公开基准数据集上的大量实验表明，CBDiff显著优于现有的最先进方法，突显其在现实世界部署的强大潜力。

论文及项目相关链接

PDF

Summary：

本文介绍了图像伪造定位（IFL）任务的重要性及其在图像取证领域的应用。现有方法通常生成单个确定性定位图，缺乏在高级应用（如取证分析和安全监控）中所需的高精度和可靠性。为改进预测的可信度并降低错误风险，本文引入了先进的条件伯努利扩散模型（CBDiff）。给定伪造图像，CBDiff能够生成多个多样化和合理的定位图，从而提供更丰富和全面的伪造分布表示。此方法解决了篡改区域中固有的不确定性和变化性。此外，CBDiff还将伯努利噪声融入扩散过程，更真实地反映了伪造掩膜的固有二进制和稀疏属性。同时，CBDiff引入了时间步交叉注意力（TSCAttention）机制，专门设计用于利用语义特征指导与临时步骤来提高操作检测。在八个公开基准数据集上的广泛实验表明，CBDiff显著优于现有最先进的方法，突显其在现实世界部署的强大潜力。

Key Takeaways：

图像伪造定位（IFL）是图像取证中的关键任务，旨在准确识别图像中的操纵或篡改区域。
现有方法通常生成单一确定性定位图，这在高风险应用中可能缺乏足够的精度和可靠性。
CBDiff模型能够生成多个多样化和合理的定位图，提供更全面和真实的伪造表示。
CBDiff处理篡改区域中的不确定性和变化性。
CBDiff通过将伯努利噪声融入扩散过程，更真实地反映伪造掩膜的固有属性。
TSCAttention机制有助于提高操作检测的准确性。

Cool Papers

点此查看论文截图

The Intricate Dance of Prompt Complexity, Quality, Diversity, and Consistency in T2I Models

Authors:Xiaofeng Zhang, Aaron Courville, Michal Drozdzal, Adriana Romero-Soriano

Text-to-image (T2I) models offer great potential for creating virtually limitless synthetic data, a valuable resource compared to fixed and finite real datasets. Previous works evaluate the utility of synthetic data from T2I models on three key desiderata: quality, diversity, and consistency. While prompt engineering is the primary means of interacting with T2I models, the systematic impact of prompt complexity on these critical utility axes remains underexplored. In this paper, we first conduct synthetic experiments to motivate the difficulty of generalization w.r.t. prompt complexity and explain the observed difficulty with theoretical derivations. Then, we introduce a new evaluation framework that can compare the utility of real data and synthetic data, and present a comprehensive analysis of how prompt complexity influences the utility of synthetic data generated by commonly used T2I models. We conduct our study across diverse datasets, including CC12M, ImageNet-1k, and DCI, and evaluate different inference-time intervention methods. Our synthetic experiments show that generalizing to more general conditions is harder than the other way round, since the former needs an estimated likelihood that is not learned by diffusion models. Our large-scale empirical experiments reveal that increasing prompt complexity results in lower conditional diversity and prompt consistency, while reducing the synthetic-to-real distribution shift, which aligns with the synthetic experiments. Moreover, current inference-time interventions can augment the diversity of the generations at the expense of moving outside the support of real data. Among those interventions, prompt expansion, by deliberately using a pre-trained language model as a likelihood estimator, consistently achieves the highest performance in both image diversity and aesthetics, even higher than that of real data.

文本到图像（T2I）模型在创建几乎无限合成数据方面表现出巨大潜力，与固定和有限真实数据集相比，这是一种有价值的资源。以前的研究工作评估了来自T2I模型的合成数据在三个关键要素上的实用性：质量、多样性和一致性。虽然提示工程是与T2I模型交互的主要手段，但关于提示复杂性对这些关键实用轴的系统性影响尚未得到充分探索。在本文中，我们首先进行合成实验，以证明关于提示复杂性的泛化难度，并用理论推导来解释观察到的困难。然后，我们引入了一个可以比较真实数据和合成数据实用性的新评估框架，并对提示复杂性如何影响常用T2I模型生成的合成数据的实用性进行了综合分析。我们的研究涵盖了CC12M、ImageNet-1k和DCI等多个数据集，并评估了不同的推理时间干预方法。我们的合成实验表明，推广至更一般的条件比反方向更难，因为前者需要估计的似然性是扩散模型没有学到的。我们的大规模实证研究结果表明，增加提示复杂性会导致条件多样性和提示一致性降低，同时减少合成到真实分布转移，这与合成实验相符。此外，当前的推理时间干预措施可以通过增加生成的多样性来牺牲对真实数据支持范围的偏离。在这些干预措施中，通过故意使用预训练的语言模型作为似然估计器来进行提示扩展，在图像多样性和美观度方面始终表现出最佳性能，甚至超过了真实数据。

论文及项目相关链接

PDF

Summary

本文探讨了文本转图像（T2I）模型合成数据的潜力，并关注于如何通过调整提示复杂性来影响合成数据的效用。文章通过合成实验和理论推导，分析了提示复杂性对T2I模型生成合成数据效用的影响，并引入新的评估框架来比较真实数据和合成数据的效用。研究发现，增加提示复杂性会降低条件多样性和一致性，但可以减少合成数据与真实数据分布的差异。此外，通过扩大提示或使用预训练语言模型作为似然估计器等方法可以在一定程度上提高生成的图像的多样性和美学效果。综合来看，合理利用提示复杂性对优化T2I模型的性能具有积极意义。

Key Takeaways

文本转图像（T2I）模型能够创建几乎无限合成数据，为数据科学家提供有价值的资源补充有限的真实数据集。
通过对合成实验的分析，文章揭示了关于提示复杂性对合成数据效用影响的观察难度，并通过理论推导解释了这一现象。
引入了一个评估框架，用于比较真实数据和由常见T2I模型生成的合成数据的效用。
实验表明，提高提示复杂性会降低条件多样性和一致性，但可以减少合成数据与真实数据分布的差异。
当前推理时间干预措施能够增加生成的多样性，但可能会偏离真实数据的范围。
通过扩大提示或使用预训练语言模型作为似然估计器等方法可以有效提高图像生成的质量和美学效果。

Cool Papers

点此查看论文截图

SCEESR: Semantic-Control Edge Enhancement for Diffusion-Based Super-Resolution

Authors:Yun Kai Zhuang

Real-world image super-resolution (Real-ISR) must handle complex degradations and inherent reconstruction ambiguities. While generative models have improved perceptual quality, a key trade-off remains with computational cost. One-step diffusion models offer speed but often produce structural inaccuracies due to distillation artifacts. To address this, we propose a novel SR framework that enhances a one-step diffusion model using a ControlNet mechanism for semantic edge guidance. This integrates edge information to provide dynamic structural control during single-pass inference. We also introduce a hybrid loss combining L2, LPIPS, and an edge-aware AME loss to optimize for pixel accuracy, perceptual quality, and geometric precision. Experiments show our method effectively improves structural integrity and realism while maintaining the efficiency of one-step generation, achieving a superior balance between output quality and inference speed. The results of test datasets will be published at https://drive.google.com/drive/folders/1amddXQ5orIyjbxHgGpzqFHZ6KTolinJF?usp=drive_link and the related code will be published at https://github.com/ARBEZ-ZEBRA/SCEESR.

现实世界图像超分辨率（Real-ISR）需要处理复杂的退化和固有的重建歧义。虽然生成模型已经提高了感知质量，但计算成本之间的权衡仍然是一个关键问题。一步扩散模型虽然速度快，但由于蒸馏产生的伪影通常会导致结构不准确。为了解决这个问题，我们提出了一种新的SR框架，该框架使用ControlNet机制增强了一步扩散模型，用于语义边缘引导。这通过集成边缘信息在单次传递推理过程中提供动态结构控制。我们还引入了一种混合损失，结合了L2、LPIPS和边缘感知AME损失，以优化像素精度、感知质量和几何精度。实验表明，我们的方法在保持一步生成的效率的同时，有效地提高了结构的完整性和逼真度，在输出质量和推理速度之间达到了优越的平衡。测试数据集的结果将发布在：https://drive.google.com/drive/folders/1amddXQ5orIyjbxHgGpzqFHZ6KTolinJF?usp=drive_link，相关代码将发布在https://github.com/ARBEZ-ZEBRA/SCEESR。

论文及项目相关链接

PDF 10 pages, 5 figures, 3 tables

Summary

本文提出了一种基于ControlNet机制的新型超分辨率（SR）框架，该框架旨在增强一步扩散模型，通过语义边缘引导来解决结构失真问题。该框架结合了边缘信息，在单次传递推理过程中提供动态结构控制。同时，引入了一种结合L2、LPIPS和边缘感知AME损失的混合损失，以优化像素精度、感知质量和几何精度。实验表明，该方法在保持一步生成效率的同时，有效提高了结构完整性和逼真度，实现了输出质量和推理速度之间的卓越平衡。

Key Takeaways

提出的SR框架基于ControlNet机制增强了一步扩散模型，解决结构失真问题。
该框架通过结合边缘信息，在单次传递推理过程中提供动态结构控制。
引入了一种混合损失，包括L2、LPIPS和边缘感知AME损失，以优化图像质量的不同方面。
实验表明，该方法在提高结构完整性和逼真度的同时，保持了高效的一步生成。
该方法实现了输出质量和推理速度之间的平衡。
测试数据集的结果将在指定的Google Drive链接上发布。

Cool Papers

点此查看论文截图

DP$^2$O-SR: Direct Perceptual Preference Optimization for Real-World Image Super-Resolution

Authors:Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Shihao Wang, Tianhe Wu, Qiaosi Yi, Shuai Li, Lei Zhang

Benefiting from pre-trained text-to-image (T2I) diffusion models, real-world image super-resolution (Real-ISR) methods can synthesize rich and realistic details. However, due to the inherent stochasticity of T2I models, different noise inputs often lead to outputs with varying perceptual quality. Although this randomness is sometimes seen as a limitation, it also introduces a wider perceptual quality range, which can be exploited to improve Real-ISR performance. To this end, we introduce Direct Perceptual Preference Optimization for Real-ISR (DP$^2$O-SR), a framework that aligns generative models with perceptual preferences without requiring costly human annotations. We construct a hybrid reward signal by combining full-reference and no-reference image quality assessment (IQA) models trained on large-scale human preference datasets. This reward encourages both structural fidelity and natural appearance. To better utilize perceptual diversity, we move beyond the standard best-vs-worst selection and construct multiple preference pairs from outputs of the same model. Our analysis reveals that the optimal selection ratio depends on model capacity: smaller models benefit from broader coverage, while larger models respond better to stronger contrast in supervision. Furthermore, we propose hierarchical preference optimization, which adaptively weights training pairs based on intra-group reward gaps and inter-group diversity, enabling more efficient and stable learning. Extensive experiments across both diffusion- and flow-based T2I backbones demonstrate that DP$^2$O-SR significantly improves perceptual quality and generalizes well to real-world benchmarks.

受益于预训练的文本到图像（T2I）扩散模型，现实世界图像超分辨率（Real-ISR）方法可以合成丰富且逼真的细节。然而，由于T2I模型的固有随机性，不同的噪声输入通常会导致输出感知质量有所不同。虽然这种随机性有时被视为限制，但它也引入了更广泛的感知质量范围，可以被用来提高Real-ISR性能。为此，我们引入了面向Real-ISR的直接感知偏好优化（DP$^2$O-SR）框架，该框架无需昂贵的人力标注，即可使生成模型与感知偏好对齐。我们通过结合全参考和无参考图像质量评估（IQA）模型，构建了一个混合奖励信号，这些模型是在大规模人类偏好数据集上训练的。该奖励鼓励结构保真和自然外观。为了更好地利用感知多样性，我们超越了标准的最佳与最差选择，从同一模型的输出中构建多个偏好对。我们的分析表明，最佳选择比例取决于模型容量：小型模型从更广泛的覆盖中受益，而大型模型对更强的监督对比有更好的反应。此外，我们提出了分层偏好优化，它根据组内奖励差距和组间多样性自适应地加权训练对，从而实现更高效和稳定的学习。在基于扩散和流体的T2I主干网络上进行的广泛实验表明，DP$^2$O-SR显著提高了感知质量，并很好地推广到了现实世界基准测试。

论文及项目相关链接

PDF Accept by NeurIPS 2025

摘要

受益于预训练的文本到图像（T2I）扩散模型，现实图像超分辨率（Real-ISR）方法可以合成丰富且逼真的细节。然而，由于T2I模型固有的随机性，不同的噪声输入通常会导致输出品的感知质量有所不同。虽然这种随机性有时被视为限制，但它也引入了更广泛的感知质量范围，可以被用来提高Real-ISR性能。为此，我们引入了面向Real-ISR的直接感知偏好优化（DP$^2$O-SR）框架，该框架能够使生成模型与感知偏好对齐，而无需昂贵的人力注释。我们结合全参考和无参考图像质量评估（IQA）模型，构建了一个混合奖励信号，这些模型是在大规模人类偏好数据集上训练的。该奖励鼓励结构保真和自然外观。为了更好地利用感知多样性，我们从同一模型的输出中构建多个偏好对，而不仅仅是最好的与最差的选项。我们的分析表明，最佳选择比例取决于模型容量：小型模型从更广泛的覆盖中受益，而大型模型对更强的监督对比有更好的反应。此外，我们提出了分层偏好优化，它根据组内奖励差距和组间多样性自适应地加权训练对，使学习更加高效和稳定。跨扩散和流基T2I主干的广泛实验表明，DP$^2$O-SR显著提高了感知质量，并很好地推广到了现实世界基准测试。

关键见解

受益于预训练文本到图像（T2I）扩散模型，现实图像超分辨率（Real-ISR）方法可以合成高质量细节。
T2I模型的随机性导致输出感知质量存在差异，这既是一种限制也是一种可用来提高Real-ISR性能的资源。
引入Direct Perceptual Preference Optimization for Real-ISR（DP$^2$O-SR）框架，该框架与感知偏好对齐，无需昂贵的人力注释。
通过结合全参考和无参考IQA模型，构建混合奖励信号，鼓励结构保真和自然外观。
利用多个偏好对来提高感知多样性的利用，最佳选择比例取决于模型容量。
提出分层偏好优化，使学习更加高效和稳定。
DP$^2$O-SR显著提高了感知质量，并在多个基准测试中表现良好。

Cool Papers

点此查看论文截图

Identity-Preserving Image-to-Video Generation via Reward-Guided Optimization

Authors:Liao Shen, Wentao Jiang, Yiran Zhu, Jiahe Li, Tiezheng Ge, Zhiguo Cao, Bo Zheng

Recent advances in image-to-video (I2V) generation have achieved remarkable progress in synthesizing high-quality, temporally coherent videos from static images. Among all the applications of I2V, human-centric video generation includes a large portion. However, existing I2V models encounter difficulties in maintaining identity consistency between the input human image and the generated video, especially when the person in the video exhibits significant expression changes and movements. This issue becomes critical when the human face occupies merely a small fraction of the image. Since humans are highly sensitive to identity variations, this poses a critical yet under-explored challenge in I2V generation. In this paper, we propose Identity-Preserving Reward-guided Optimization (IPRO), a novel video diffusion framework based on reinforcement learning to enhance identity preservation. Instead of introducing auxiliary modules or altering model architectures, our approach introduces a direct and effective tuning algorithm that optimizes diffusion models using a face identity scorer. To improve performance and accelerate convergence, our method backpropagates the reward signal through the last steps of the sampling chain, enabling richer gradient feedback. We also propose a novel facial scoring mechanism that treats faces in ground-truth videos as facial feature pools, providing multi-angle facial information to enhance generalization. A KL-divergence regularization is further incorporated to stabilize training and prevent overfitting to the reward signal. Extensive experiments on Wan 2.2 I2V model and our in-house I2V model demonstrate the effectiveness of our method. Our project and code are available at https://ipro-alimama.github.io/.

在图像到视频（I2V）生成方面的最新进展，已经在从静态图像合成高质量、时间连贯的视频方面取得了显著进展。在所有I2V应用中，以人类为中心的视频生成占据很大一部分。然而，现有的I2V模型在保持输入人像与生成视频之间身份一致性方面遇到困难，尤其是在视频中人物表情变化和动作显著时。当人脸只占图像一小部分时，这个问题变得更为关键。由于人类对身份变化高度敏感，这为I2V生成提出了一个至关重要但尚未被充分研究的挑战。在本文中，我们提出了基于强化学习的身份保留奖励引导优化（IPRO）新型视频扩散框架，以提高身份保留能力。我们的方法不是引入辅助模块或改变模型架构，而是引入了一种直接有效的调整算法，该算法使用面部身份评分者对扩散模型进行优化。为了提高性能和加速收敛，我们的方法通过采样链的最后几步反向传播奖励信号，从而实现更丰富的梯度反馈。我们还提出了一种新型的面部评分机制，将真实视频中的面部视为面部特征池，提供多角度的面部信息以增强泛化能力。还进一步融入了KL散度正则化，以稳定训练并防止对奖励信号的过度拟合。在Wan 2.2 I2V模型和我们的内部I2V模型上的大量实验证明了我们的方法的有效性。我们的项目和代码可在https://ipro-alimama.github.io/找到。

论文及项目相关链接

PDF

Summary
该论文针对图像到视频生成中的人为中心的视频生成问题，提出了一种基于强化学习的身份保留奖励引导优化（IPRO）框架。针对现有模型在人脸表情变化和移动时身份一致性维护的困难，IPRO通过优化扩散模型，使用面部身份评分者进行直接有效的调整算法。该框架还引入了面部评分机制和KL-散度正则化，以提高性能和加速收敛，同时稳定训练并防止过度拟合奖励信号。

Key Takeaways

图像到视频生成（I2V）领域面临身份一致性维护的挑战，特别是在人脸表情变化和移动时。
IPRO框架通过强化学习解决了这一挑战，提供了一种身份保留奖励引导的优化方法。
IPRO采用直接有效的调整算法，使用面部身份评分者对扩散模型进行优化。
引入面部评分机制，利用地面真实视频中的面部特征池提供多角度面部信息，提高泛化能力。
KL-散度正则化被纳入以稳定训练并防止过度拟合奖励信号。
在Wan 2.2 I2V模型和内部I2V模型上的广泛实验证明了IPRO方法的有效性。

Cool Papers

点此查看论文截图

RODS: Robust Optimization Inspired Diffusion Sampling for Detecting and Reducing Hallucination in Generative Models

Authors:Yiqi Tian, Pengfei Jin, Mingze Yuan, Na Li, Bo Zeng, Quanzheng Li

Diffusion models have achieved state-of-the-art performance in generative modeling, yet their sampling procedures remain vulnerable to hallucinations-often stemming from inaccuracies in score approximation. In this work, we reinterpret diffusion sampling through the lens of optimization and introduce RODS (Robust Optimization-inspired Diffusion Sampler), a novel method that detects and corrects high-risk sampling steps using geometric cues from the loss landscape. RODS enforces smoother sampling trajectories and adaptively adjusts perturbations, reducing hallucinations without retraining and at minimal additional inference cost. Experiments on AFHQv2, FFHQ, and 11k-hands demonstrate that RODS maintains comparable image quality and preserves generation diversity. More importantly, it improves both sampling fidelity and robustness, detecting over 70% of hallucinated samples and correcting more than 25%, all while avoiding the introduction of new artifacts. We release our code at https://github.com/Yiqi-Verna-Tian/RODS.

扩散模型在生成建模方面取得了最先进的性能，但其采样程序仍然容易受幻觉影响，这通常源于评分估计不准确。在这项工作中，我们从优化的角度重新解释了扩散采样，并引入了一种新型方法RODS（受稳健优化启发的扩散采样器）。该方法利用损失景观的几何线索来检测和纠正高风险的采样步骤。RODS强制平滑采样轨迹并自适应调整扰动，无需重新训练即可减少幻觉，并且只需极少的额外推理成本。在AFHQv2、FFHQ和11k-hands上的实验表明，RODS保持了相当的图片质量并保持了生成多样性。更重要的是，它提高了采样保真度和稳健性，检测到超过70%的幻觉样本并纠正了超过25%，同时避免了新伪影的产生。我们在https://github.com/Yiqi-Verna-Tian/RODS发布了我们的代码。

论文及项目相关链接

PDF

Summary

本文介绍了扩散模型在生成建模中的最新进展。尽管已达到了业界领先的性能，但其采样过程仍容易受到hallucination的影响，主要源于评分估算的不准确。本文提出一种新型采样方法——RODS（基于稳健优化的扩散采样器），该方法通过损失景观的几何线索来检测和纠正高风险的采样步骤。RODS可实现更平滑的采样轨迹，自适应调整扰动，在不进行再训练的情况下减少了hallucination，并且几乎没有增加额外的推理成本。在AFHQv2、FFHQ和11k-hands上的实验表明，RODS保持了相当的图片质量并保持了生成多样性。更重要的是，它提高了采样保真度和稳健性，能够检测出超过70%的hallucinated样本并纠正超过25%，同时避免了新产生的伪影。我们已将代码发布在[链接地址]。

Key Takeaways

扩散模型在生成建模领域取得了最先进的性能。
采样过程中容易出现hallucination问题，主要由于评分估算的不准确。
提出了一种新型采样方法——RODS（基于稳健优化的扩散采样器）。
RODS利用损失景观的几何线索来检测和纠正高风险采样步骤。
RODS能自适应调整扰动并达到更平滑的采样轨迹。
RODS在不进行再训练的情况下减少了hallucination，且几乎没有增加额外的推理成本。

Cool Papers

点此查看论文截图

Latent Diffusion Models with Masked AutoEncoders

Authors:Junho Lee, Jeongwoo Shin, Hyungwook Choi, Joonseok Lee

In spite of the remarkable potential of Latent Diffusion Models (LDMs) in image generation, the desired properties and optimal design of the autoencoders have been underexplored. In this work, we analyze the role of autoencoders in LDMs and identify three key properties: latent smoothness, perceptual compression quality, and reconstruction quality. We demonstrate that existing autoencoders fail to simultaneously satisfy all three properties, and propose Variational Masked AutoEncoders (VMAEs), taking advantage of the hierarchical features maintained by Masked AutoEncoders. We integrate VMAEs into the LDM framework, introducing Latent Diffusion Models with Masked AutoEncoders (LDMAEs). Our code is available at https://github.com/isno0907/ldmae.

尽管潜在扩散模型（Latent Diffusion Models，简称LDM）在图像生成方面具有显著潜力，但自动编码器的理想属性和优化设计尚未得到充分探索。在这项工作中，我们分析了自动编码器在LDM中的作用，并确定了三个关键属性：潜在平滑性、感知压缩质量和重建质量。我们证明现有自动编码器无法满足这三个属性的同时，并提出利用掩码自动编码器所保留的分层特征进行变分掩码自动编码器（Variational Masked AutoEncoders，简称VMAEs）。我们将VMAEs集成到LDM框架中，引入带有掩码自动编码器的潜在扩散模型（Latent Diffusion Models with Masked AutoEncoders，简称LDMAEs）。我们的代码可在https://github.com/isno0907/ldmae找到。

论文及项目相关链接

PDF

Summary

潜在扩散模型（LDM）在图像生成方面具有显著潜力，但自动编码器的理想属性和设计尚未得到充分探索。本文分析了自动编码器在LDM中的作用，并确定了三个关键属性：潜在平滑性、感知压缩质量和重建质量。我们证明现有自动编码器无法满足所有这三个属性，并提出采用掩膜自动编码器（Masked AutoEncoders）优势的变分掩膜自动编码器（VMAEs）。将其集成到LDM框架中，引入带有掩膜自动编码器的潜在扩散模型（LDMAEs）。我们的代码位于https://github.com/isno0907/ldmae。

Key Takeaways

潜在扩散模型（LDM）在图像生成方面具有显著潜力。
自动编码器在LDM中的作用尚未充分探索。
确定自动编码器的三个关键属性：潜在平滑性、感知压缩质量和重建质量。
现有自动编码器无法满足所有这三个属性。
提出变分掩膜自动编码器（VMAEs），利用掩膜自动编码器的层次特征。
将VMAEs集成到LDM框架中，形成带有掩膜自动编码器的潜在扩散模型（LDMAEs）。

Cool Papers

点此查看论文截图

Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback

Authors:Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm

Paucity of medical data severely limits the generalizability of diagnostic ML models, as the full spectrum of disease variability can not be represented by a small clinical dataset. To address this, diffusion models (DMs) have been considered as a promising avenue for synthetic image generation and augmentation. However, they frequently produce medically inaccurate images, deteriorating the model performance. Expert domain knowledge is critical for synthesizing images that correctly encode clinical information, especially when data is scarce and quality outweighs quantity. Existing approaches for incorporating human feedback, such as reinforcement learning (RL) and Direct Preference Optimization (DPO), rely on robust reward functions or demand labor-intensive expert evaluations. Recent progress in Multimodal Large Language Models (MLLMs) reveals their strong visual reasoning capabilities, making them adept candidates as evaluators. In this work, we propose a novel framework, coined MAGIC (Medically Accurate Generation of Images through AI-Expert Collaboration), that synthesizes clinically accurate skin disease images for data augmentation. Our method creatively translates expert-defined criteria into actionable feedback for image synthesis of DMs, significantly improving clinical accuracy while reducing the direct human workload. Experiments demonstrate that our method greatly improves the clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Additionally, augmenting training data with these synthesized images improves diagnostic accuracy by +9.02% on a challenging 20-condition skin disease classification task, and by +13.89% in the few-shot setting.

医疗数据的缺乏严重限制了诊断机器学习模型的泛化能力，因为小规模的临床数据集无法代表疾病的全部变异。为了解决这一问题，扩散模型（DMs）已被视为合成图像生成和增强的有前途的途径。然而，它们经常产生医学上不准确的图像，从而降低了模型性能。在数据稀缺且质量胜过数量的情况下，领域专业知识对于合成正确编码临床信息的图像至关重要。现有的融入人类反馈的方法，如强化学习（RL）和直接偏好优化（DPO），依赖于稳健的奖励功能或需求劳动密集型的专家评估。最近多模态大型语言模型（MLLMs）的进展显示出了强大的视觉推理能力，使其成为熟练的候选评估者。在这项工作中，我们提出了一种新型框架，名为MAGIC（通过AI-Expert协作进行医学准确的图像生成），该框架合成用于数据增强的临床准确皮肤病图像。我们的方法创造性地将专家定义的标准转化为对扩散模型图像合成的可操作反馈，显著提高了临床准确性，同时减少了直接人力工作量。实验表明，我们的方法大大提高了合成皮肤病图像的临床质量，输出结果与皮肤科医生的评估相符。此外，使用这些合成图像增强训练数据，在具有挑战性的20种皮肤病分类任务中，诊断准确率提高了+9.02%，在少镜头设置下提高了+13.89%。

论文及项目相关链接

PDF NeurIPS 2025

摘要

医疗数据的匮乏严重限制了诊断机器学习模型的泛化能力。扩散模型在合成图像生成和增强方面展现出巨大潜力，但往往产生医学上不准确的图像，影响模型性能。专家领域知识对于合成正确编码临床信息的图像至关重要，特别是在数据稀缺、质量胜过数量的情况下。结合人类反馈的现有方法，如强化学习和直接偏好优化，依赖于稳健的奖励函数或需要大量专家评估。最近多模态大型语言模型的进展显示出其强大的视觉推理能力，使其成为评估者的合适候选。本研究提出一种名为MAGIC（通过人工智能专家合作进行医学准确图像生成）的新框架，用于合成用于数据增强的临床准确皮肤病图像。我们的方法创造性地将专家定义的准则转化为对扩散模型的图像合成可操作反馈，显著提高了临床准确性，同时减少了直接人力工作量。实验表明，我们的方法大大提高了合成皮肤病图像的临床质量，输出结果与皮肤科医生的评估相符。此外，使用这些合成图像增强训练数据，在具有挑战性的20种皮肤病分类任务中，诊断准确率提高9.02%，在少量样本情况下提高13.89%。

关键见解

扩散模型在医疗图像生成和增强中具有潜力，但缺乏医学准确性。
专家领域知识对于合成临床准确图像至关重要。
现有结合人类反馈的方法依赖于稳健的奖励函数或需要大量专家评估。
多模态大型语言模型展现出强大的视觉推理能力，适用于图像评估。
提出的MAGIC框架通过人工智能与专家的合作，能合成临床准确的皮肤病图像。
MAGIC方法提高了合成皮肤病图像的临床质量，并与皮肤科医生评估相符。

Cool Papers

点此查看论文截图

Beyond Masked and Unmasked: Discrete Diffusion Models via Partial Masking

Authors:Chen-Hao Chao, Wei-Fang Sun, Hanwen Liang, Chun-Yi Lee, Rahul G. Krishnan

Masked diffusion models (MDM) are powerful generative models for discrete data that generate samples by progressively unmasking tokens in a sequence. Each token can take one of two states: masked or unmasked. We observe that token sequences often remain unchanged between consecutive sampling steps; consequently, the model repeatedly processes identical inputs, leading to redundant computation. To address this inefficiency, we propose the Partial masking scheme (Prime), which augments MDM by allowing tokens to take intermediate states interpolated between the masked and unmasked states. This design enables the model to make predictions based on partially observed token information, and facilitates a fine-grained denoising process. We derive a variational training objective and introduce a simple architectural design to accommodate intermediate-state inputs. Our method demonstrates superior performance across a diverse set of generative modeling tasks. On text data, it achieves a perplexity of 15.36 on OpenWebText, outperforming previous MDM (21.52), autoregressive models (17.54), and their hybrid variants (17.58), without relying on an autoregressive formulation. On image data, it attains competitive FID scores of 3.26 on CIFAR-10 and 6.98 on ImageNet-32, comparable to leading continuous generative models.

掩码扩散模型（MDM）是为离散数据设计的强大生成模型，它通过逐步揭示序列中的令牌来生成样本。每个令牌可以处于两种状态之一：掩码状态或未掩码状态。我们观察到，在连续的采样步骤之间，令牌序列往往保持不变；因此，模型会反复处理相同的输入，导致冗余计算。为了解决这种低效问题，我们提出了部分掩码方案（Prime），它通过允许令牌处于掩码状态和未掩码状态之间的中间状态，来增强MDM的功能。这种设计使模型能够基于部分观察到的令牌信息进行预测，并促进了精细的降噪过程。我们推导了变分训练目标，并引入了一种简单的架构设计来适应中间状态输入。我们的方法在多种生成建模任务上表现出卓越的性能。在文本数据上，它在OpenWebText上的困惑度达到15.36，优于先前的MDM（21.52）、自回归模型（17.54）及其混合变体（17.58），且无需依赖自回归公式。在图像数据上，它在CIFAR-10上的FID得分为3.26，在ImageNet-32上的得分为6.98，与领先的连续生成模型相当。

论文及项目相关链接

PDF Published at NeurIPS 2025. Project Page: https://chen-hao-chao.github.io/mdm-prime

Summary

部分掩盖扩散模型（MDM）在离散数据生成任务中表现出强大的性能，它通过逐步揭示序列中的令牌来实现样本生成。研究团队提出了一种名为Prime的部分掩盖方案，该方案允许令牌处于掩盖与未掩盖之间的中间状态，以提高模型的预测能力和细化去噪过程。新方案在多种生成建模任务上表现出卓越性能，对OpenWebText文本的困惑度降至15.36，优于传统MDM、自回归模型及其混合变体；在CIFAR-10和ImageNet-32图像数据上的FID得分也具有竞争力。

Key Takeaways

Masked Diffusion Models (MDM) 是离散数据生成的有力工具，通过逐步揭示令牌来实现样本生成。
令牌在掩盖与未掩盖之间可以存在中间状态，这一特性称为Partial masking scheme（Prime）。
Prime方案提高了模型的预测能力，并促进了更精细的去噪过程。
在生成建模任务上，Prime方案表现卓越，尤其是在处理文本和图像数据时。
在OpenWebText文本数据上，新方法的困惑度低于其他模型。
在CIFAR-10和ImageNet-32图像数据上，新方法的FID得分具有竞争力，与领先的连续生成模型相当。

Cool Papers

点此查看论文截图

FairGen: Controlling Sensitive Attributes for Fair Generations in Diffusion Models via Adaptive Latent Guidance

Authors:Mintong Kang, Vinayshekhar Bannihatti Kumar, Shamik Roy, Abhishek Kumar, Sopan Khosla, Balakrishnan Murali Narayanaswamy, Rashmi Gangadharaiah

Text-to-image diffusion models often exhibit biases toward specific demographic groups, such as generating more males than females when prompted to generate images of engineers, raising ethical concerns and limiting their adoption. In this paper, we tackle the challenge of mitigating generation bias towards any target attribute value (e.g., “male” for “gender”) in diffusion models while preserving generation quality. We propose FairGen, an adaptive latent guidance mechanism which controls the generation distribution during inference. In FairGen, a latent guidance module dynamically adjusts the diffusion process to enforce specific attributes, while a memory module tracks the generation statistics and steers latent guidance to align with the targeted fair distribution of the attribute values. Furthermore, we address the limitations of existing datasets by introducing the Holistic Bias Evaluation (HBE) benchmark, which covers diverse domains and incorporates complex prompts to assess bias more comprehensively. Extensive evaluations on HBE and Stable Bias datasets demonstrate that FairGen outperforms existing bias mitigation approaches, achieving substantial bias reduction (e.g., 68.5% gender bias reduction on Stable Diffusion 2). Ablation studies highlight FairGen’s ability to flexibly control the output distribution at any user-specified granularity, ensuring adaptive and targeted bias mitigation.

文本到图像的扩散模型往往会对特定的群体产生偏见，例如在提示生成工程师图像时，生成更多的男性而不是女性，这引发了伦理担忧，并限制了其采用。在本文中，我们解决了在扩散模型中缓解对任何目标属性值的生成偏见（例如，“性别”中的“男性”）的挑战，同时保持生成质量。我们提出了FairGen，这是一种自适应的潜在指导机制，用于控制推理过程中的生成分布。在FairGen中，潜在指导模块动态调整扩散过程以强制执行特定属性，而记忆模块则跟踪生成统计信息，并引导潜在指导与属性值的公平分布目标保持一致。此外，我们通过引入全面的偏见评估（HBE）基准，解决了现有数据集的限制，该基准涵盖多个领域并包含复杂的提示，以更全面地评估偏见。在HBE和Stable Bias数据集上的广泛评估表明，FairGen优于现有的偏见缓解方法，实现了显著的偏见减少（例如在Stable Diffusion 2上实现了68.5%的性别偏见减少）。消融研究突出了FairGen在用户指定的任何粒度上灵活控制输出分布的能力，确保自适应和有针对性的偏见缓解。

论文及项目相关链接

PDF EMNLP 2025 Main Conference (Camera Ready)

Summary

本文关注文本到图像扩散模型中的生成偏见问题，并提出了FairGen方案以缓解对目标属性值的生成偏见，同时保持生成质量。通过引入自适应潜在指导机制，FairGen能够在推理过程中控制生成分布。此外，本文还介绍了全新的Holistic Bias Evaluation（HBE）基准测试，以更全面地评估不同领域的复杂提示中的偏见。实验表明，FairGen在HBE和Stable Bias数据集上的表现优于现有的偏见缓解方法，实现了显著的偏见减少。

Key Takeaways

文本到图像扩散模型存在对特定人群（如工程师中的男性）的偏见生成问题。
FairGen方案被提出以缓解对目标属性值的生成偏见，同时保持生成质量。
FairGen引入了自适应潜在指导机制，能在推理过程中控制生成分布。
首次介绍了Holistic Bias Evaluation（HBE）基准测试，用于更全面地评估不同领域的复杂提示中的偏见。
实验证明FairGen在HBE和Stable Bias数据集上的表现优于现有方法，实现显著偏见减少。
FairGen能够灵活控制用户指定粒度的输出分布，具有自适应和针对性的偏见缓解能力。

Cool Papers

点此查看论文截图

Graph Representation Learning with Diffusion Generative Models

Authors:Daniel Wesego

Diffusion models have established themselves as state-of-the-art generative models across various data modalities, including images and videos, due to their ability to accurately approximate complex data distributions. Unlike traditional generative approaches such as VAEs and GANs, diffusion models employ a progressive denoising process that transforms noise into meaningful data over multiple iterative steps. This gradual approach enhances their expressiveness and generation quality. Not only that, diffusion models have also been shown to extract meaningful representations from data while learning to generate samples. Despite their success, the application of diffusion models to graph-structured data remains relatively unexplored, primarily due to the discrete nature of graphs, which necessitates discrete diffusion processes distinct from the continuous methods used in other domains. In this work, we leverage the representational capabilities of diffusion models to learn meaningful embeddings for graph data. By training a discrete diffusion model within an autoencoder framework, we enable both effective autoencoding and representation learning tailored to the unique characteristics of graph-structured data. We extract the representation from the combination of the encoder’s output and the decoder’s first time step hidden embedding. Our approach demonstrates the potential of discrete diffusion models to be used for graph representation learning. The code can be found at https://github.com/DanielMitiku/Graph-Representation-Learning-with-Diffusion-Generative-Models

扩散模型凭借其准确逼近复杂数据分布的能力，已成为图像和视频等多种数据模态的先进生成模型。与传统的生成方法（如变分自动编码器和生成对抗网络）不同，扩散模型采用渐进的去噪过程，通过多个迭代步骤将噪声转化为有意义的数据。这种逐步的方法增强了其表达力和生成质量。不仅如此，扩散模型还显示出在生成样本的同时从数据中提取有意义表示的能力。尽管它们取得了成功，但将扩散模型应用于图形结构数据仍然相对未被探索，这主要是因为图形的离散性质需要不同于其他领域使用的连续方法的离散扩散过程。在这项工作中，我们利用扩散模型的表示能力来学习图形数据的有意义嵌入。通过在自编码器框架内训练离散扩散模型，我们实现了针对图形结构数据的有效自编码和表示学习。我们从编码器输出和解码器第一步隐藏嵌入的组合中提取表示。我们的方法展示了离散扩散模型在图形表示学习中的潜力。代码可在 https://github.com/DanielMitiku/Graph-Representation-Learning-with-Diffusion-Generative-Models 找到。

论文及项目相关链接

PDF

Summary

扩散模型通过逐步去噪过程，将噪声转化为有意义的数据，在图像、视频等多种数据模态上表现出优秀的生成能力。其在图结构数据的应用尚未得到充分探索，主要由于图的离散性质需要离散扩散过程，有别于其他领域的连续方法。本研究利用扩散模型的表示能力，学习图数据的有意义嵌入。通过训练离散扩散模型于自编码器框架内，实现了针对图结构数据的有效自编码和表示学习。代码可通过链接访问。

Key Takeaways

扩散模型是当下先进生成模型，能准确近似复杂数据分布。
扩散模型采用逐步去噪过程，可增强表达性和生成质量。
离散扩散模型适用于图结构数据学习，因图的离散性质需特定处理。
研究通过自编码器框架训练离散扩散模型，实现图数据的有效自编码和表示学习。
扩散模型在图表示学习中的潜力得到验证。
该研究代码可通过指定链接访问。

Cool Papers

点此查看论文截图

FairGen: Enhancing Fairness in Text-to-Image Diffusion Models via Self-Discovering Latent Directions

Authors:Yilei Jiang, Weihong Li, Yiyuan Zhang, Minghong Cai, Xiangyu Yue

While Diffusion Models (DM) exhibit remarkable performance across various image generative tasks, they nonetheless reflect the inherent bias presented in the training set. As DMs are now widely used in real-world applications, these biases could perpetuate a distorted worldview and hinder opportunities for minority groups. Existing methods on debiasing DMs usually requires model retraining with a human-crafted reference dataset or additional classifiers, which suffer from two major limitations: (1) collecting reference datasets causes expensive annotation cost; (2) the debiasing performance is heavily constrained by the quality of the reference dataset or the additional classifier. To address the above limitations, we propose FairGen, a plug-and-play method that learns attribute latent directions in a self-discovering manner, thus eliminating the reliance on such reference dataset. Specifically, FairGen consists of two parts: a set of attribute adapters and a distribution indicator. Each adapter in the set aims to learn an attribute latent direction, and is optimized via noise composition through a self-discovering process. Then, the distribution indicator is multiplied by the set of adapters to guide the generation process towards the prescribed distribution. Our method enables debiasing multiple attributes in DMs simultaneously, while remaining lightweight and easily integrable with other DMs, eliminating the need for retraining. Extensive experiments on debiasing gender, racial, and their intersectional biases show that our method outperforms previous SOTA by a large margin.

尽管扩散模型（DM）在各种图像生成任务中表现出卓越的性能，但它们仍然反映出训练集所呈现的固有偏见。由于扩散模型现在广泛应用于现实世界的应用中，这些偏见可能会持续扭曲世界观并阻碍少数群体的机会。现有的关于去偏扩散模型的方法通常需要与人工构建的参考数据集或额外的分类器一起进行模型再训练，这存在两个主要局限性：（1）收集参考数据集导致昂贵的标注成本；（2）去偏性能在很大程度上受到参考数据集或附加分类器质量的影响。为了解决上述局限性，我们提出了FairGen，这是一种即插即用方法，能够以自我发现的方式学习属性潜在方向，从而消除了对参考数据集的依赖。具体来说，FairGen由两部分组成：一组属性适配器和一个分布指标。集合中的每个适配器旨在学习一个属性潜在方向，并通过自我发现过程进行优化噪声组合。然后，分布指标与适配器集合相乘，以指导生成过程朝向规定的分布。我们的方法能够同时去除扩散模型中的多个属性偏见，同时保持轻量化，并且容易与其他扩散模型集成，无需进行再训练。对消除性别、种族及其交叉偏见的广泛实验表明，我们的方法大大超越了之前的最优水平。

论文及项目相关链接

PDF

Summary

本文讨论了Diffusion Models（DM）在处理图像生成任务时存在的偏见问题。现有去偏方法需要昂贵的标注成本，并受限于参考数据集或附加分类器的质量。为此，提出了一种名为FairGen的即插即用方法，通过自我发现的方式学习属性潜在方向，无需依赖参考数据集。FairGen包括属性适配器和分布指示器两部分，能同时去偏多个属性，且易于与其他DM集成，无需重新训练。实验表明，该方法在性别、种族及其交叉偏见的去偏任务上大幅超越了先前最佳方法。

Key Takeaways

Diffusion Models在处理图像生成任务时存在偏见问题，可能加剧对少数群体的不公平待遇。
现有去偏方法通常需要重新训练模型，并依赖昂贵的标注成本和参考数据集的质量。
FairGen方法通过自我发现的方式学习属性潜在方向，无需依赖参考数据集。
FairGen包括属性适配器和分布指示器两部分，可以同时去偏多个属性。
FairGen方法易于与其他Diffusion Models集成，且无需重新训练。
实验表明，FairGen在性别、种族及其交叉偏见的去偏任务上表现优异，大幅超越了先前最佳方法。

Cool Papers

点此查看论文截图

GenLit: Reformulating Single-Image Relighting as Video Generation

Authors:Shrisha Bharadwaj, Haiwen Feng, Giorgio Becherini, Victoria Fernandez Abrevaya, Michael J. Black

Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible – one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing. . Project page: https://genlit.is.tue.mpg.de/.

在单幅图像内操纵3D场景的照明是计算机视觉和图形学中的基本挑战。传统上，这个问题是通过逆向渲染技术来解决的，这涉及到明确的3D资产重建和昂贵的光线追踪模拟。同时，视觉基础模型的最新进展表明，一种新的范式即将到来——一个用大量图像和视频数据训练的网络来替代明确的物理模型。在本文中，我们利用视频扩散模型的隐式场景理解，特别是稳定的视频扩散，来对单幅图像进行重新照明。我们介绍了GenLit框架，它将图形引擎进行光线操作的能力提炼成视频生成模型，使用户能够在给定的图像中的3D世界里直接插入和操纵点光源，并直接生成视频序列结果。我们发现仅在小型合成数据集上进行微调后的模型可以推广到真实场景，能够实现具有可信阴影和相互反射的单图像重新照明。我们的结果突出了视频基础模型在捕捉关于照明、材料和形状丰富信息方面的能力，我们的研究结果表明，这种模型可以在极少的训练下用于执行重新照明，无需明确的资产重建或光线追踪。项目页面：https://genlit.is.tue.mpg.de/。

论文及项目相关链接

PDF

Summary

本文介绍了一种利用视频扩散模型进行图像照明操控的新方法。通过引入GenLit框架，将图形引擎的照明操控能力提炼并融入视频生成模型，实现在单一图像内对3D世界的点光源进行直接插入与操控，并生成视频序列。该研究实现了在少量合成数据集上的微调，并能在真实场景中进行单图像照明，生成具有逼真阴影和反射效果的结果。

Key Takeaways