发布日期: 2025-09-28

更新日期: 2025-11-27

文章字数: 21.1k

阅读时长: 86 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-28 更新

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Authors:Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

图像组合旨在无缝插入用户指定的对象到新场景中，但现有模型在处理复杂光照（例如准确阴影、水面反射）和多样化、高分辨率输入时遇到困难。现代文本到图像扩散模型（例如SD3.5、FLUX）已经编码了基本的物理和分辨率先验知识，但缺乏一个框架来释放它们，而不必求助于潜在的反演，这通常会将对象姿势锁定在上下文不适当的方向上，或脆弱的注意力手术。我们提出SHINE，这是一个无需训练的无缝、高保真插入中性化错误框架。SHINE引入了流形引导锚点损失，利用预训练的定制适配器（例如IP-Adapter）来引导潜在表示以忠实于主题表示，同时保持背景完整性。提出退化抑制指导和自适应背景融合，以进一步消除低质量输出和可见接缝。为了解决缺乏严格基准的问题，我们引入了ComplexCompo，它具有多种分辨率和挑战性条件，例如低光照、强照明、复杂的阴影和反射表面。在ComplexCompo和DreamEditBench上的实验表明，其在标准指标（例如DINOv2）和人类对齐分数（例如DreamSim、ImageReward、VisionReward）上均表现出卓越性能。代码和基准将在发布时公开。

论文及项目相关链接

PDF Preprint

Summary

本文介绍了SHINE框架，它是一种无需训练即可实现无缝、高保真插入的技术，通过引入流形引导锚点损失和预训练定制适配器，能够在不破坏背景完整性的情况下，忠实呈现主题表示。此外，还提出了退化抑制指导和自适应背景融合等方法，以消除低质量输出和可见接缝。为解决缺乏严格基准的问题，引入了ComplexCompo基准测试，对多种分辨率和复杂条件下的图像组合性能进行评估。实验结果表明，SHINE框架在ComplexCompo和DreamEditBench上的表现达到或超越了现有技术。

Key Takeaways

SHINE框架旨在解决现有模型在处理复杂光照和多样化高分辨率输入时的无缝插入难题。
SHINE框架引入了流形引导锚点损失，利用预训练定制适配器来引导潜在表示，忠实呈现主题表示同时保持背景完整性。
通过提出退化抑制指导和自适应背景融合等方法，进一步消除低质量输出和可见接缝。
ComplexCompo基准测试被引入，以评估图像组合性能，包括多种分辨率和复杂条件下的挑战。
实验结果表明，SHINE框架在ComplexCompo和DreamEditBench上的表现达到或超越了现有技术。
SHINE框架不需要训练即可应用，解决了现有模型如潜点逆转等问题。

Cool Papers

点此查看论文截图

T2I-Diff: fMRI Signal Generation via Time-Frequency Image Transform and Classifier-Free Denoising Diffusion Models

Authors:Hwa Hui Tew, Junn Yong Loo, Yee-Fan Tan, Xinyu Tang, Hernando Ombao, Fuad Noman, Raphael C. -W. Phan, Chee-Ming Ting

Functional Magnetic Resonance Imaging (fMRI) is an advanced neuroimaging method that enables in-depth analysis of brain activity by measuring dynamic changes in the blood oxygenation level-dependent (BOLD) signals. However, the resource-intensive nature of fMRI data acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often underperform because they overlook the complex non-stationarity and nonlinear BOLD dynamics. To address these challenges, we introduce T2I-Diff, an fMRI generation framework that leverages time-frequency representation of BOLD signals and classifier-free denoising diffusion. Specifically, our framework first converts BOLD signals into windowed spectrograms via a time-dependent Fourier transform, capturing both the underlying temporal dynamics and spectral evolution. Subsequently, a classifier-free diffusion model is trained to generate class-conditioned frequency spectrograms, which are then reverted to BOLD signals via inverse Fourier transforms. Finally, we validate the efficacy of our approach by demonstrating improved accuracy and generalization in downstream fMRI-based brain network classification.

功能磁共振成像（fMRI）是一种先进的神经成像方法，它通过测量血氧水平依赖（BOLD）信号的动态变化，对脑活动进行深入分析。然而，fMRI数据获取资源密集型的特性，限制了对数据驱动的大脑分析模型所需的高保真样本的可用性。虽然现代生成模型可以合成fMRI数据，但它们往往表现不佳，因为它们忽略了复杂的非平稳和非线性BOLD动态。为了解决这些挑战，我们引入了T2I-Diff fMRI生成框架，它利用BOLD信号的时频表示和无分类器去噪扩散。具体来说，我们的框架首先通过时间依赖傅立叶变换将BOLD信号转换为分窗频谱图，捕获潜在的动态变化和光谱演化。随后，训练一个无分类器扩散模型来生成类别条件频率频谱图，然后再通过傅立叶逆变换将其转回BOLD信号。最后，我们通过展示下游fMRI脑网络分类的更高准确性和泛化能力，验证了我们的方法的有效性。

论文及项目相关链接

PDF Accepted at the 28th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2025)

摘要
fMRI是一种先进的神经成像方法，通过对血液中氧含量变化的动态测量来深入分析脑活动。但由于fMRI数据采集资源密集，对数据驱动的大脑分析模型所需的高保真样本可用性有限。为解决现代生成模型忽略复杂非平稳和非线性BOLD动力学的问题，我们引入了T2I-Diff fMRI生成框架。该框架通过时间依赖的傅立叶变换将BOLD信号转换为分窗频谱图，捕获基本的时间动态和频谱演变。然后训练无分类器去噪扩散模型来生成类条件频率频谱图，再经反傅立叶变换还原为BOLD信号。最后，我们验证了该方法在提高下游fMRI脑网络分类的准确性和泛化能力方面的有效性。

关键见解

fMRI是一种通过测量血液中氧含量变化的动态来深入分析脑活动的先进神经成像方法。
由于fMRI数据采集的资源密集性，高保真样本的可用性受到限制。
现代生成模型在合成fMRI数据时常常表现不佳，因为它们忽略了复杂的非平稳和非线性BOLD动力学。
T2I-Diff框架通过时间依赖的傅立叶变换将BOLD信号转换为分窗频谱图，以捕获基本的时间动态和频谱演变。
T2I-Diff框架使用无分类器去噪扩散模型生成类条件频率频谱图。
生成的分窗频谱图通过反傅立叶变换还原为BOLD信号。
T2I-Diff框架提高了下游fMRI脑网络分类的准确性和泛化能力。

Cool Papers

点此查看论文截图

CusEnhancer: A Zero-Shot Scene and Controllability Enhancement Method for Photo Customization via ResInversion

Authors:Maoye Ren, Praneetha Vaddamanu, Jianjin Xu, Fernando De la Torre Frade

Recently remarkable progress has been made in synthesizing realistic human photos using text-to-image diffusion models. However, current approaches face degraded scenes, insufficient control, and suboptimal perceptual identity. We introduce CustomEnhancer, a novel framework to augment existing identity customization models. CustomEnhancer is a zero-shot enhancement pipeline that leverages face swapping techniques, pretrained diffusion model, to obtain additional representations in a zeroshot manner for encoding into personalized models. Through our proposed triple-flow fused PerGeneration approach, which identifies and combines two compatible counter-directional latent spaces to manipulate a pivotal space of personalized model, we unify the generation and reconstruction processes, realizing generation from three flows. Our pipeline also enables comprehensive training-free control over the generation process of personalized models, offering precise controlled personalization for them and eliminating the need for controller retraining for per-model. Besides, to address the high time complexity of null-text inversion (NTI), we introduce ResInversion, a novel inversion method that performs noise rectification via a pre-diffusion mechanism, reducing the inversion time by 129 times. Experiments demonstrate that CustomEnhancer reach SOTA results at scene diversity, identity fidelity, training-free controls, while also showing the efficiency of our ResInversion over NTI. The code will be made publicly available upon paper acceptance.

近期，利用文本到图像扩散模型合成逼真的人物照片方面取得了显著进展。然而，当前的方法仍然面临场景质量下降、控制不足以及感知身份不理想等问题。我们引入了CustomEnhancer，这是一个增强现有身份定制模型的新型框架。CustomEnhancer是一种零样本增强流程，它利用面部替换技术、预训练的扩散模型，以零样本的方式获得额外的表示，然后编码到个性化模型中。通过我们提出的融合PerGeneration的三重流方法，该方法能够识别和结合两个兼容的逆向潜在空间来操作个性化模型的关键空间，我们统一了生成和重建过程，实现了从三个流的生成。我们的流程还为个性化模型的生成过程提供了全面的无训练控制，为它们提供了精确的控制个性化，并消除了对个性化模型的控制器再训练的需求。此外，为了解决无文本反转（NTI）的高时间复杂度问题，我们引入了ResInversion，这是一种新的反转方法，它通过预扩散机制进行噪声校正，将反转时间减少了129倍。实验表明，CustomEnhancer在场景多样性、身份保真度、无训练控制等方面达到了最新水平，同时我们的ResInversion相比NTI也显示出高效率。论文被接受后，代码将公开发布。

论文及项目相关链接

PDF

Summary

本文介绍了使用文本到图像扩散模型合成逼真的人像照片的最新进展，并指出当前方法面临场景质量下降、控制不足和感知身份不优等问题。为此，研究者提出了一种新型框架CustomEnhancer，结合面部替换技术和预训练扩散模型，以零样本方式获得额外表示并将其编码为个性化模型。通过提出的三流融合PerGeneration方法，统一生成和重建过程，实现三流生成。此外，CustomEnhancer还提供全面的训练前控制个性化模型生成过程，消除对控制器进行逐模型重训的需求。为解决空文本反转（NTI）的高时间复杂度问题，引入ResInversion新反转方法，通过预扩散机制进行噪声校正，将反转时间缩短129倍。实验表明，CustomEnhancer在场景多样性、身份保真度和训练前控制方面达到最新水平，同时ResInversion相比NTI也展现出高效率。

Key Takeaways

最新研究在利用文本到图像扩散模型合成人像照片方面取得显著进展，但存在场景质量、控制和感知身份等问题。
CustomEnhancer框架结合面部替换技术和预训练扩散模型，提升个性化模型的生成质量。
PerGeneration方法通过融合三流技术，统一生成和重建过程。
CustomEnhancer提供全面的训练前控制，支持个性化模型的精准控制，无需逐模型重训控制器。
研究人员为解决空文本反转的高时间复杂度问题，引入了ResInversion新反转方法。
ResInversion通过预扩散机制进行噪声校正，显著缩短反转时间。

Cool Papers

点此查看论文截图

PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

Authors:Chen Wang, Chuhao Chen, Yiming Huang, Zhiyang Dou, Yuan Liu, Jiatao Gu, Lingjie Liu

Existing video generation models excel at producing photo-realistic videos from text or images, but often lack physical plausibility and 3D controllability. To overcome these limitations, we introduce PhysCtrl, a novel framework for physics-grounded image-to-video generation with physical parameters and force control. At its core is a generative physics network that learns the distribution of physical dynamics across four materials (elastic, sand, plasticine, and rigid) via a diffusion model conditioned on physics parameters and applied forces. We represent physical dynamics as 3D point trajectories and train on a large-scale synthetic dataset of 550K animations generated by physics simulators. We enhance the diffusion model with a novel spatiotemporal attention block that emulates particle interactions and incorporates physics-based constraints during training to enforce physical plausibility. Experiments show that PhysCtrl generates realistic, physics-grounded motion trajectories which, when used to drive image-to-video models, yield high-fidelity, controllable videos that outperform existing methods in both visual quality and physical plausibility. Project Page: https://cwchenwang.github.io/physctrl

现有视频生成模型擅长从文本或图像生成逼真的视频，但往往缺乏物理可行性和3D可控性。为了克服这些限制，我们引入了PhysCtrl，这是一个新的基于物理的图像到视频生成框架，具有物理参数和力控制功能。其核心是一个生成物理网络，通过扩散模型学习四种材料（弹性、沙子、塑料和刚性）的物理动态分布，该模型根据物理参数和应用力进行条件控制。我们将物理动态表示为3D点轨迹，并在由物理模拟器生成的大规模合成数据集（55万个动画）上进行训练。我们增强扩散模型，增加了一种新型时空注意力块，该块模拟粒子交互并在训练期间加入基于物理的约束，以强制物理合理性。实验表明，PhysCtrl生成的物理基础运动轨迹现实，当用于驱动图像到视频模型时，可以生成高质量、可控的视频，在视觉质量和物理合理性方面均优于现有方法。项目页面：https://cwchenwang.github.io/physctrl

论文及项目相关链接

PDF Accepted by NeurIPS 2025. This is the preview version; the camera-ready version is still in preparation

Summary

本文介绍了一种名为PhysCtrl的新型框架，该框架结合了物理参数和力控制，用于基于图像的视频生成。它使用扩散模型学习四种材料的物理动态分布，并通过大型合成数据集进行训练。此外，还引入了一种新的时空注意力块，用于模拟粒子交互并在训练中融入物理约束，以生成真实且符合物理规律的轨迹。该框架能够生成高质量且可控的视频，在视觉质量和物理真实性方面优于现有方法。

Key Takeaways

PhysCtrl框架结合了物理参数和力控制，旨在解决现有视频生成模型在物理真实性和3D可控性方面的局限性。
引入生成物理网络，通过扩散模型学习四种材料的物理动态分布。
使用大型合成数据集进行训练，包含550K个动画。
引入新的时空注意力块，模拟粒子交互，并在训练中融入物理约束。
PhysCtrl能够生成真实且符合物理规律的轨迹。
当用于驱动图像到视频模型时，PhysCtrl生成的高质量且可控的视频在视觉质量和物理真实性方面优于现有方法。

Cool Papers

点此查看论文截图

4D Driving Scene Generation With Stereo Forcing

Authors:Hao Lu, Zhuang Ma, Guangfeng Jiang, Wenhang Ge, Bohan Li, Yuzhan Cai, Wenzhao Zheng, Yunpeng Zhang, Yingcong Chen

Current generative models struggle to synthesize dynamic 4D driving scenes that simultaneously support temporal extrapolation and spatial novel view synthesis (NVS) without per-scene optimization. Bridging generation and novel view synthesis remains a major challenge. We present PhiGenesis, a unified framework for 4D scene generation that extends video generation techniques with geometric and temporal consistency. Given multi-view image sequences and camera parameters, PhiGenesis produces temporally continuous 4D Gaussian splatting representations along target 3D trajectories. In its first stage, PhiGenesis leverages a pre-trained video VAE with a novel range-view adapter to enable feed-forward 4D reconstruction from multi-view images. This architecture supports single-frame or video inputs and outputs complete 4D scenes including geometry, semantics, and motion. In the second stage, PhiGenesis introduces a geometric-guided video diffusion model, using rendered historical 4D scenes as priors to generate future views conditioned on trajectories. To address geometric exposure bias in novel views, we propose Stereo Forcing, a novel conditioning strategy that integrates geometric uncertainty during denoising. This method enhances temporal coherence by dynamically adjusting generative influence based on uncertainty-aware perturbations. Our experimental results demonstrate that our method achieves state-of-the-art performance in both appearance and geometric reconstruction, temporal generation and novel view synthesis (NVS) tasks, while simultaneously delivering competitive performance in downstream evaluations. Homepage is at \href{https://jiangxb98.github.io/PhiGensis}{PhiGensis}.

当前生成模型在合成动态四维驾驶场景时面临挑战，这些场景需要同时支持时间外推和无需针对每个场景优化的空间新视角合成（NVS）。生成和新颖视角合成之间的桥梁仍然是一个主要挑战。我们提出了PhiGenesis，这是一个用于四维场景生成的统一框架，它扩展了视频生成技术，具有几何和时间一致性。给定多视角图像序列和相机参数，PhiGenesis可以生成沿目标三维轨迹的时间连续四维高斯喷射表示。在第一阶段，PhiGenesis利用预训练的视频VAE和新型范围视图适配器，实现从多视角图像的前馈四维重建。此架构支持单帧或视频输入，并输出完整的四维场景，包括几何、语义和运动。在第二阶段，PhiGenesis引入了一个受几何指导的视频扩散模型，使用呈现的历史四维场景作为先验来根据轨迹生成未来视角。为了解决新视角的几何曝光偏差问题，我们提出了立体强制（Stereo Forcing）这一新型条件策略，它在去噪过程中整合了几何不确定性。此方法通过基于不确定性感知扰动的动态调整生成影响来增强时间连贯性。我们的实验结果表明，我们的方法在外观和几何重建、时间生成和新视角合成（NVS）任务上均达到了最新性能水平，同时在下游评估中取得了有竞争力的表现。主页位于：PhiGensis。

论文及项目相关链接

PDF

Summary

基于当前生成模型面临的挑战，如动态四维驾驶场景的合成和同时支持时间外推和空间新视角合成等，本文提出了PhiGenesis统一框架。该框架扩展了视频生成技术，具有几何和时间一致性。它采用多视角图像序列和摄像机参数，生成沿目标三维轨迹的四维高斯贴图表示。该框架分为两个阶段，第一阶段利用预训练的视频VAE和新颖的范围视图适配器进行四维重建，第二阶段引入几何导向的视频扩散模型，利用渲染的历史四维场景作为先验来生成未来视角的条件轨迹。为解决新视角的几何曝光偏差问题，提出了立体强制这一新型条件策略，在降噪过程中整合几何不确定性。实验结果表明，该方法在外观和几何重建、时间生成和新视角合成任务上均达到最佳性能。

Key Takeaways

当前生成模型在合成动态四维驾驶场景时面临挑战，需要同时支持时间外推和空间新视角合成。
PhiGenesis是一个统一框架，旨在解决这些问题，通过扩展视频生成技术实现几何和时间一致性。
PhiGenesis利用多视角图像序列和摄像机参数生成四维高斯贴图表示。
该框架分为两个阶段：四维重建阶段和基于几何导向的视频扩散模型生成阶段。
第一阶段利用视频VAE和范围视图适配器进行重建。
第二阶段引入立体强制策略来解决新视角的几何曝光偏差问题。

Cool Papers

点此查看论文截图

Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing

Authors:Zizheng Yang, Hu Yu, Bing Li, Jinghao Zhang, Jie Huang, Feng Zhao

Diffusion models have recently been investigated as powerful generative solvers for image dehazing, owing to their remarkable capability to model the data distribution. However, the massive computational burden imposed by the retraining of diffusion models, coupled with the extensive sampling steps during the inference, limit the broader application of diffusion models in image dehazing. To address these issues, we explore the properties of hazy images in the semantic latent space of frozen pre-trained diffusion models, and propose a Diffusion Latent Inspired network for Image Dehazing, dubbed DiffLI$^2$D. Specifically, we first reveal that the semantic latent space of pre-trained diffusion models can represent the content and haze characteristics of hazy images, as the diffusion time-step changes. Building upon this insight, we integrate the diffusion latent representations at different time-steps into a delicately designed dehazing network to provide instructions for image dehazing. Our DiffLI$^2$D avoids re-training diffusion models and iterative sampling process by effectively utilizing the informative representations derived from the pre-trained diffusion models, which also offers a novel perspective for introducing diffusion models to image dehazing. Extensive experiments on multiple datasets demonstrate that the proposed method achieves superior performance to existing image dehazing methods. Code is available at https://github.com/aaaasan111/difflid.

扩散模型因其对数据分布的出色建模能力，最近被研究为图像去雾的强大生成求解器。然而，扩散模型的重训练所带来的巨大计算负担，以及推理过程中的大量采样步骤，限制了扩散模型在图像去雾中的更广泛应用。为了解决这些问题，我们探索了冻结的预训练扩散模型的语义潜在空间中的雾图像的属性，并提出了一种用于图像去雾的扩散潜在启发网络，称为DiffLI$^2$D。具体来说，我们首先发现，随着扩散时间步长的变化，预训练的扩散模型的语义潜在空间可以表示雾图像的内容和雾特性。基于这一见解，我们将不同时间步长的扩散潜在表示集成到一个精心设计的去雾网络中，为图像去雾提供指导。我们的DiffLI$^2$D通过有效利用预训练扩散模型产生的信息表示，避免了重新训练扩散模型和迭代采样过程，这为将扩散模型引入到图像去雾中提供了新的视角。在多个数据集上的大量实验表明，所提出的方法在现有图像去雾方法中实现了卓越的性能。代码可通过https://github.com/aaaasan111/difflid获取。

论文及项目相关链接

PDF

摘要

扩散模型因其对数据分布的出色建模能力而被研究为图像去雾的生成求解器。然而，扩散模型的巨大计算负担以及推理过程中的大量采样步骤限制了其在图像去雾中的更广泛应用。为了解决这些问题，我们探索了冻结的预训练扩散模型的语义潜在空间中的雾图像的属性，并提出了用于图像去雾的扩散潜在启发网络，称为DiffLI$^2$D。具体来说，我们首先揭示随着扩散时间步的变化，预训练的扩散模型的语义潜在空间可以表示雾图像的内容和雾特征。基于此洞察，我们将不同时间步的扩散潜在表示集成到一个精心设计的去雾网络中，以为图像去雾提供指导。DiffLI$^2$D避免了重新训练扩散模型和迭代采样过程，通过有效利用从预训练的扩散模型中得出的信息表示，它还为将扩散模型引入图像去雾提供了新颖的视角。在多个数据集上的广泛实验表明，该方法在现有图像去雾方法中实现了卓越的性能。代码可在链接中找到。

关键见解

扩散模型因其出色的数据分布建模能力而在图像去雾中受到关注。
扩散模型的巨大计算负担和推理过程中的采样步骤限制了其应用。
冻结的预训练扩散模型的语义潜在空间可以表示雾图像的内容和雾特征。
提出了一种基于扩散潜在启发的图像去雾网络（DiffLI$^2$D）。
DiffLI$^2$D利用预训练扩散模型中的信息表示，避免了模型重新训练和迭代采样过程。
DiffLI$^2$D为将扩散模型应用于图像去雾提供了新颖的视角。
在多个数据集上的实验表明，所提出的方法在现有图像去雾方法中表现卓越。

Cool Papers

点此查看论文截图

Learnable Sampler Distillation for Discrete Diffusion Models

Authors:Feiyang Fu, Tongxian Guo, Zhaoqiang Liu

Discrete diffusion models (DDMs) have shown powerful generation ability for discrete data modalities like text and molecules. However, their practical application is hindered by inefficient sampling, requiring a large number of sampling steps. Accelerating DDMs by using larger step sizes typically introduces significant problems in generation quality, as it amplifies the impact of both the compounding decoding error due to factorized predictions and discretization error from numerical approximations, leading to a significant decrease in sampling quality. To address these challenges, we propose learnable sampler distillation (LSD), a novel approach to train fast and high-fidelity samplers for DDMs. LSD employs a distillation approach where a student sampler with a few steps learns to align its intermediate score trajectory with that of a high-quality teacher sampler with numerous steps. This alignment is achieved by optimizing learnable sampler coefficients that adaptively adjust sampling dynamics. Additionally, we further propose LSD+, which also learns time schedules that allocate steps non-uniformly. Experiments across text generation, image generation, and synthetic tasks demonstrate that our proposed approaches outperform existing samplers for DDMs, achieving substantially higher sampling quality with significantly fewer sampling steps. Our code is available at \href{https://github.com/feiyangfu/LSD}{https://github.com/feiyangfu/LSD}.

离散扩散模型（DDMs）已经显示出对离散数据模式（如文本和分子）的强大生成能力。然而，它们在实际应用中的采样效率低下，需要大量的采样步骤。通过增大步长来加速DDMs通常会引入严重的生成质量问题，因为它放大了由于分解预测和数值逼近引起的离散化误差，导致采样质量显著降低。为了解决这些挑战，我们提出了可学习采样器蒸馏（LSD）这一新方法，用于训练DDMs的快速高保真采样器。LSD采用了一种蒸馏方法，其中具有几步的学生采样器学习使其中间分数轨迹与具有多步的高质量教师采样器的轨迹对齐。这种对齐是通过优化可学习采样器系数来实现的，这些系数可以自适应地调整采样动态。此外，我们还进一步提出了LSD+，它还学习非均匀分配步骤的时间表。在文本生成、图像生成和合成任务上的实验表明，我们提出的方法在DDMs的采样器上优于现有技术，在大大减少采样步骤的情况下实现了更高的采样质量。我们的代码可在https://github.com/feiyangfu/LSD找到。

论文及项目相关链接

PDF NeurIPS 2025

Summary

离散扩散模型（DDMs）在文本和分子等离散数据模态的生成方面表现出强大的能力，但实际应用中存在采样效率低下的问题，需要大量的采样步骤。为解决这个问题，本文提出了可学习采样器蒸馏（LSD）方法，用于训练DDMs的快速高保真采样器。LSD采用蒸馏方法，通过优化可学习采样器系数，使具有较少步骤的学生采样器学习跟随高质量教师采样器的中间分数轨迹。此外，还提出了LSD+，该方法还学习非均匀分配步骤的时间表。实验表明，在文本生成、图像生成和合成任务中，所提出的方法在采样步骤大大减少的情况下，实现了更高的采样质量。

Key Takeaways

离散扩散模型（DDMs）具有强大的生成能力，但在采样效率方面存在挑战。
现有加速DDMs的方法常常在放大因子化预测和数值逼近带来的误差时引入新的问题，导致采样质量下降。
本文提出的可学习采样器蒸馏（LSD）方法通过优化可学习采样器系数，实现高效且高质量的采样器训练。
LSD方法使学生采样器通过较少的步骤就能学习跟随高质量教师采样器的中间分数轨迹，从而提高采样质量。
此外，LSD+还学习非均匀分配步骤的时间表，以进一步优化采样过程。
实验结果表明，所提出的方法在文本生成、图像生成和合成任务中实现了更高的采样质量，并显著减少了采样步骤。

Cool Papers

点此查看论文截图

Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

Authors:Sherwin Bahmani, Tianchang Shen, Jiawei Ren, Jiahui Huang, Yifeng Jiang, Haithem Turki, Andrea Tagliasacchi, David B. Lindell, Zan Gojcic, Sanja Fidler, Huan Ling, Jun Gao, Xuanchi Ren

The ability to generate virtual environments is crucial for applications ranging from gaming to physical AI domains such as robotics, autonomous driving, and industrial AI. Current learning-based 3D reconstruction methods rely on the availability of captured real-world multi-view data, which is not always readily available. Recent advancements in video diffusion models have shown remarkable imagination capabilities, yet their 2D nature limits the applications to simulation where a robot needs to navigate and interact with the environment. In this paper, we propose a self-distillation framework that aims to distill the implicit 3D knowledge in the video diffusion models into an explicit 3D Gaussian Splatting (3DGS) representation, eliminating the need for multi-view training data. Specifically, we augment the typical RGB decoder with a 3DGS decoder, which is supervised by the output of the RGB decoder. In this approach, the 3DGS decoder can be purely trained with synthetic data generated by video diffusion models. At inference time, our model can synthesize 3D scenes from either a text prompt or a single image for real-time rendering. Our framework further extends to dynamic 3D scene generation from a monocular input video. Experimental results show that our framework achieves state-of-the-art performance in static and dynamic 3D scene generation.

生成虚拟环境的能力对于从游戏到物理人工智能领域（如机器人、自动驾驶和工业人工智能）的应用至关重要。当前基于学习的3D重建方法依赖于捕获的现实世界多视角数据的可用性，而这并非总是容易获得的。视频扩散模型的最新进展表现出了惊人的想象力能力，然而它们的2D性质限制了其在仿真中的应用，在仿真中机器人需要导航并与环境进行交互。在本文中，我们提出了一种自蒸馏框架，旨在将视频扩散模型中的隐式3D知识蒸馏到明确的3D高斯拼贴（3DGS）表示中，从而消除了对多视角训练数据的需求。具体来说，我们用一个3DGS解码器增强了典型的RGB解码器，该解码器由RGB解码器的输出进行监督。在这种方法中，3DGS解码器可以仅通过由视频扩散模型生成的合成数据进行训练。在推理时，我们的模型可以从文本提示或单幅图像中合成3D场景，用于实时渲染。我们的框架进一步扩展到从单目输入视频生成动态3D场景。实验结果表明，我们的框架在静态和动态3D场景生成方面达到了最新技术水平。

Summary
本文提出了一种基于视频扩散模型的自蒸馏框架，旨在将视频扩散模型中的隐式3D知识蒸馏到显式的三维高斯喷溅（3DGS）表示中，无需使用多视角训练数据。通过增加一个3DGS解码器，对RGB解码器的输出进行监督学习。模型可以根据文本提示或单张图像合成三维场景，并实时渲染。此外，该框架还可以扩展到基于单目输入视频的动态三维场景生成。实验结果表明，该框架在静态和动态三维场景生成方面均达到了最新技术水平。

Key Takeaways

提出了基于视频扩散模型的自蒸馏框架，用于生成虚拟环境。
通过将隐式3D知识蒸馏到显式的三维高斯喷溅（3DGS）表示中，解决了缺乏多视角训练数据的问题。
引入了RGB解码器和一个额外的3DGS解码器进行协同工作，以实现更好的三维场景合成。
模型能够根据文本提示或单张图像合成三维场景，并可实时渲染。
框架能够扩展到基于单目输入视频的动态三维场景生成。
实验结果表明该框架在静态和动态三维场景生成方面均达到了最新技术水平。

Cool Papers

点此查看论文截图

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

Authors:Shufan Li, Jiuxiang Gu, Kangning Liu, Zhe Lin, Zijun Wei, Aditya Grover, Jason Kuen

We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation. Unlike existing multimodal MDMs such as MMaDa and Muddit which only support simple image-level understanding tasks and low-resolution image generation, Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution (1024px) text-to-image synthesis. Lavida-O incorporates a novel Elastic Mixture-of-Transformers (Elastic-MoT) architecture that couples a lightweight generation branch with a larger understanding branch, supported by token compression, universal text conditioning and stratified sampling for efficient and high-quality generation. Lavida-O further incorporates planning and iterative self-reflection in image generation and editing tasks, seamlessly boosting generation quality with its understanding capabilities. Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing, outperforming existing autoregressive models and continuous diffusion models such as Qwen2.5-VL and FluxKontext-dev, while offering considerable speedup at inference. These advances establish Lavida-O as a new paradigm for scalable multimodal reasoning and generation.

我们提出了Lavida-O，这是一个统一的Masked Diffusion Model（MDM）用于多模态理解和生成。不同于现有的多模态MDM，如MMaDa和Muddit，它们仅支持简单的图像级别理解任务和低分辨率图像生成，Lavida-O呈现了一个单一框架，能够实现图像级别理解、对象定位、图像编辑和高达1024像素的高分辨率文本到图像合成。Lavida-O采用了新型Elastic Mixture-of-Transformers（Elastic-MoT）架构，该架构将轻量级的生成分支与更大的理解分支相结合，并通过令牌压缩、通用文本条件和分层采样来支持高效和高质量的生成。Lavida-O进一步在图像生成和编辑任务中融入了规划和迭代自我反思，利用其理解能力无缝提升生成质量。Lavida-O在广泛的基准测试中实现了卓越的性能，包括RefCOCO对象定位、GenEval文本到图像生成和ImgEdit图像编辑，超越了现有的自回归模型和连续扩散模型，如Qwen2.5-VL和FluxKontext-dev，同时在推理过程中提供了显著的速度提升。这些进步使Lavida-O成为可扩展的多模态推理和生成的新范式。

论文及项目相关链接

PDF 31 pages, 15 figures

Summary

Lavida-O是一种统一的Masked Diffusion Model（MDM），用于多模态理解和生成。与现有的多模态MDM相比，Lavida-O支持图像级别的理解、对象定位、图像编辑以及高分辨率文本到图像的合成。它采用新颖Elastic Mixture-of-Transformers架构，实现了高效和高质量的生成。Lavida-O在多个基准测试中达到业界领先性能，如RefCOCO对象定位、GenEval文本到图像生成和ImgEdit图像编辑任务，展现出强大的推理和生成能力。

Key Takeaways

Lavida-O是一个多模态的Masked Diffusion Model（MDM）。
与其他MDM相比，Lavida-O支持更广泛的任务，包括图像级别的理解、对象定位、图像编辑和高分辨率文本到图像的合成。
Lavida-O采用Elastic Mixture-of-Transformers架构，实现了高效和高质量的生成。
Lavida-O在多个基准测试中表现优异，如RefCOCO、GenEval和ImgEdit等任务。
Lavida-O超越了现有的自回归模型和连续扩散模型，如Qwen2.5-VL和FluxKontext-dev。
Lavida-O在推理过程中提供了显著的速度提升。

Cool Papers

点此查看论文截图

Efficient Rectified Flow for Image Fusion

Authors:Zirui Wang, Jiayi Zhang, Tianwei Guan, Yuhan Zhou, Xingyuan Li, Minjing Dong, Jinyuan Liu

Image fusion is a fundamental and important task in computer vision, aiming to combine complementary information from different modalities to fuse images. In recent years, diffusion models have made significant developments in the field of image fusion. However, diffusion models often require complex computations and redundant inference time, which reduces the applicability of these methods. To address this issue, we propose RFfusion, an efficient one-step diffusion model for image fusion based on Rectified Flow. We incorporate Rectified Flow into the image fusion task to straighten the sampling path in the diffusion model, achieving one-step sampling without the need for additional training, while still maintaining high-quality fusion results. Furthermore, we propose a task-specific variational autoencoder (VAE) architecture tailored for image fusion, where the fusion operation is embedded within the latent space to further reduce computational complexity. To address the inherent discrepancy between conventional reconstruction-oriented VAE objectives and the requirements of image fusion, we introduce a two-stage training strategy. This approach facilitates the effective learning and integration of complementary information from multi-modal source images, thereby enabling the model to retain fine-grained structural details while significantly enhancing inference efficiency. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods in terms of both inference speed and fusion quality. Code is available at https://github.com/zirui0625/RFfusion.

图像融合是计算机视觉中的一个基本且重要的任务，旨在结合不同模态的互补信息来融合图像。近年来，扩散模型在图像融合领域取得了显著的发展。然而，扩散模型通常需要复杂的计算和冗余的推理时间，这降低了这些方法的适用性。为了解决这一问题，我们提出了RFfusion，这是一种基于Rectified Flow的高效一步图像融合扩散模型。我们将Rectified Flow融入到图像融合任务中，以矫正扩散模型中的采样路径，实现一步采样而无需额外的训练，同时仍能保持高质量的融合结果。此外，我们提出了一种针对图像融合的特定任务变分自编码器（VAE）架构，其中融合操作被嵌入到潜在空间中，以进一步降低计算复杂性。为了解决传统以重建为导向的VAE目标与图像融合要求之间的固有差异，我们引入了一种两阶段训练策略。这种方法促进了多模态源图像中互补信息的有效学习和整合，从而使模型在保留精细结构细节的同时，显著提高推理效率。大量实验表明，我们的方法在推理速度和融合质量方面都优于其他最先进的方法。代码可在https://github.com/zirui0625/RFfusion找到。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary
融合模型基于修正流（Rectified Flow）提出一种高效的一步成像融合方法RFfusion。此方法简化了扩散模型的采样路径，无需额外训练即可实现一步采样，同时保持高质量的融合结果。此外，研究团队还为图像融合任务定制了面向任务的变分自编码器（VAE）架构，将融合操作嵌入潜在空间以降低计算复杂度。针对传统重建导向的VAE目标与图像融合需求之间的固有差异，研究团队引入了一种两阶段训练策略。此方法有助于有效学习和整合多模态源图像中的互补信息，使模型在保留精细结构细节的同时显著提高推理效率。

Key Takeaways

扩散模型在图像融合领域取得显著进展，但计算复杂和推理时间长限制了其应用。
RFfusion方法基于修正流提出，实现高效的一步图像融合。
修正流技术用于简化扩散模型的采样路径，无需额外训练即可实现一步采样。
定制面向任务的变分自编码器（VAE）架构用于图像融合，将融合操作嵌入潜在空间降低计算复杂度。
针对传统重建导向的VAE目标与图像融合需求的差异，引入两阶段训练策略。
此方法能够在保留精细结构细节的同时显著提高推理效率。

Cool Papers

点此查看论文截图

Authors:Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Lionel M. Ni, Gang Yu, Heung-Yeung Shum

The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball’’, or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.

对基于注意力的隐式点匹配的依赖已成为基于拖动编辑的核心瓶颈，导致对减弱反演强度和昂贵的测试时间优化（TTO）的基本妥协。这种妥协严重限制了扩散模型的生成能力，抑制了高保真补全和文本引导的创作。在本文中，我们介绍了LazyDrag，这是针对多模态扩散转换器的首个基于拖动的图像编辑方法，它直接消除了对隐式点匹配的依赖。具体来说，我们的方法生成一个显式对应图，从用户的拖动输入作为可靠参考来提升注意力控制。这个可靠的参考为稳定的完全强度反演过程打开了可能性，这是基于拖动编辑任务中的第一次。它消除了对TTO的需要，并释放了模型的生成能力。因此，LazyDrag自然地统一了精确的几何控制与文本指导，能够实现以前无法实现的复杂编辑：如打开狗的嘴巴并填充其内部，生成新对象（如“网球”），或对模糊的拖动进行上下文感知更改（如将手移入口袋）。此外，LazyDrag支持多轮工作流程，可同时执行移动和缩放操作。在DragBench上进行的评估表明，我们的方法在拖动准确性和感知质量方面优于基线，这已得到VIEScore和人类评估的验证。LazyDrag不仅创造了新的最先进的性能，还为编辑范式铺就了新路。

论文及项目相关链接

PDF

Summary

本文介绍了LazyDrag，一种基于多模态扩散变压器的拖拽式图像编辑方法。该方法消除了对隐式点匹配的依赖，通过生成显式对应图来提高注意力控制，从而增强模型的生成能力。LazyDrag支持精确的几何控制和文本指导，能够完成复杂的编辑任务，如打开狗的嘴巴并修复内部、生成新对象或在模糊拖动的情况下进行上下文感知更改。它支持多轮工作流程，同时执行移动和缩放操作，并在DragBench上表现出卓越的性能。

Key Takeaways

LazyDrag是首个基于多模态扩散变压器的拖拽式图像编辑方法。
该方法通过生成显式对应图，消除了对隐式点匹配的依赖，提高了注意力控制。
LazyDrag结合了精确的几何控制和文本指导，能够完成复杂的编辑任务。
该方法支持多轮编辑流程，可以同时进行移动和缩放操作。
LazyDrag在DragBench上的表现优于其他方法，具有更高的拖拽准确性和感知质量。
LazyDrag不仅创造了新的最佳性能，还为编辑模式提供了新的方式。

Cool Papers

点此查看论文截图

CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts

Authors:Olaf Dünkel, Artur Jesslen, Jiahao Xie, Christian Theobalt, Christian Rupprecht, Adam Kortylewski

An important challenge when using computer vision models in the real world is to evaluate their performance in potential out-of-distribution (OOD) scenarios. While simple synthetic corruptions are commonly applied to test OOD robustness, they often fail to capture nuisance shifts that occur in the real world. Recently, diffusion models have been applied to generate realistic images for benchmarking, but they are restricted to binary nuisance shifts. In this work, we introduce CNS-Bench, a Continuous Nuisance Shift Benchmark to quantify OOD robustness of image classifiers for continuous and realistic generative nuisance shifts. CNS-Bench allows generating a wide range of individual nuisance shifts in continuous severities by applying LoRA adapters to diffusion models. To address failure cases, we propose a filtering mechanism that outperforms previous methods, thereby enabling reliable benchmarking with generative models. With the proposed benchmark, we perform a large-scale study to evaluate the robustness of more than 40 classifiers under various nuisance shifts. Through carefully designed comparisons and analyses, we find that model rankings can change for varying shifts and shift scales, which cannot be captured when applying common binary shifts. Additionally, we show that evaluating the model performance on a continuous scale allows the identification of model failure points, providing a more nuanced understanding of model robustness. Project page including code and data: https://genintel.github.io/CNS.

在使用计算机视觉模型进行现实世界应用时，面临的一个重要挑战是评估其在潜在的非分布场景中的性能。虽然通常使用简单的合成腐蚀来测试非分布场景的鲁棒性，但它们往往无法捕捉到现实世界中出现的干扰变化。最近，扩散模型被应用于生成真实图像以进行基准测试，但它们仅限于二元干扰变化。在这项工作中，我们引入了CNS-Bench，一个连续干扰变化基准测试，以量化图像分类器在面对连续且真实的生成干扰变化时的稳健性。CNS-Bench允许通过应用LoRA适配器到扩散模型来生成一系列连续的个体干扰变化。为了解决失败案例，我们提出了一种过滤机制，该方法优于以前的方法，从而实现了可靠的基准测试与生成模型。通过使用所提出的基准测试，我们对超过40个分类器进行了大规模研究评估其在各种干扰变化下的稳健性。通过精心设计的比较和分析，我们发现随着变化的种类和规模的改变，模型排名可能会发生变化，这是在使用常见的二元变化时所无法捕捉到的。此外，我们还表明，在连续规模上评估模型性能可以识别出模型的失败点，从而更深入地了解模型的稳健性。项目页面包括代码和数据：https://genintel.github.io/CNS。

论文及项目相关链接

PDF ICCV 2025. Project page: https://genintel.github.io/CNS

Summary

本文介绍了一个名为CNS-Bench的新连续干扰移位基准测试，用于衡量图像分类器在连续且现实的生成干扰移位场景中的稳健性。该基准测试通过应用LoRA适配器到扩散模型来生成各种个人干扰移位，并引入了一种过滤机制来解决失败情况，从而实现了可靠的基准测试。大规模研究显示，在不同的干扰移位和移位尺度下，模型排名可能会发生变化，而常见的二元移位无法捕捉到这一点。评估模型性能的连续性有助于发现模型的失败点，提供更深入的模型稳健性理解。

Key Takeaways

CNS-Bench是一个新的连续干扰移位基准测试，用于评估图像分类器在现实世界的OOD场景中的稳健性。
2.CNS-Bench通过使用扩散模型和LoRA适配器生成广泛的干扰移位，这些移位是连续的并在实际场景中可能发生。
提出了一种过滤机制，解决了在基准测试中出现的模型失败问题，使得使用生成模型的基准测试更加可靠。
大规模研究显示，在不同的干扰移位和移位尺度下，模型的性能排名可能会发生变化，这无法通过传统的二元移位测试来捕捉。
使用连续性的评估尺度可以揭示模型的弱点，为模型稳健性提供了更深入的理解。
6.CNS-Bench包括代码和数据，可供公众访问和使用。

Cool Papers

点此查看论文截图

EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation

Authors:Carl Qi, Dan Haramati, Tal Daniel, Aviv Tamar, Amy Zhang

Object manipulation is a common component of everyday tasks, but learning to manipulate objects from high-dimensional observations presents significant challenges. These challenges are heightened in multi-object environments due to the combinatorial complexity of the state space as well as of the desired behaviors. While recent approaches have utilized large-scale offline data to train models from pixel observations, achieving performance gains through scaling, these methods struggle with compositional generalization in unseen object configurations with constrained network and dataset sizes. To address these issues, we propose a novel behavioral cloning (BC) approach that leverages object-centric representations and an entity-centric Transformer with diffusion-based optimization, enabling efficient learning from offline image data. Our method first decomposes observations into an object-centric representation, which is then processed by our entity-centric Transformer that computes attention at the object level, simultaneously predicting object dynamics and the agent’s actions. Combined with the ability of diffusion models to capture multi-modal behavior distributions, this results in substantial performance improvements in multi-object tasks and, more importantly, enables compositional generalization. We present BC agents capable of zero-shot generalization to tasks with novel compositions of objects and goals, including larger numbers of objects than seen during training. We provide video rollouts on our webpage: https://sites.google.com/view/ec-diffuser.

日常任务中常见物体操作这一环节，但从高维观察中学习物体操作仍存在诸多挑战。在多物体环境中，由于状态空间和所需行为的组合复杂性，这些挑战更为突出。虽然近期的方法利用大规模离线数据从像素观察中训练模型，并通过扩展实现性能提升，但这些方法在未见物体配置的组合泛化方面面临困难，且受网络和数据集大小的限制。为解决这些问题，我们提出了一种新型的行为克隆（BC）方法，该方法利用以物体为中心的表达和实体为中心的Transformer与基于扩散的优化，实现从离线图像数据的高效学习。我们的方法首先将观察结果分解为以物体为中心的表达，然后将其输入实体为中心的Transformer进行处理，该处理模块在物体层面计算注意力，同时预测物体动态和代理行为。结合扩散模型捕捉多模式行为分布的能力，这大大提高了多物体任务的性能，更重要的是，实现了组合泛化。我们展示了BC代理能够零样本泛化至具有新型物体和目标组合的任务，包括训练时未见过的更多物体。视频回放请见我们的网页：[网站链接]（https://sites.google.com/view/ec-diffuser）。

论文及项目相关链接

PDF

Summary

本文提出一种利用对象中心表示和实体中心Transformer与基于扩散的优化相结合的行为克隆方法，能够从离线图像数据中高效学习。该方法首先观察对象中心表示，然后由实体中心Transformer处理，计算对象级别的注意力，同时预测对象动态和代理行为。结合扩散模型捕捉多模态行为分布的能力，该方法在多目标任务中实现了显著的性能改进，并实现了组合泛化。

Key Takeaways

对象操作是日常任务中的常见组成部分，但从高维观察中学习对象操作存在挑战。
多目标环境中的挑战加剧，因为状态空间和所需行为组合的复杂性。
最近的方法利用大规模离线数据从像素观测中训练模型，通过扩展实现性能提升，但在未见过的对象配置上难以实现组合泛化。
提出一种结合对象中心表示和实体中心Transformer的行为克隆方法，通过扩散优化实现高效学习。
方法能同时预测对象动态和代理行为，结合扩散模型捕捉多模态行为分布的能力，实现多目标任务性能改进。
方法使代理能够零射击泛化到具有新颖对象组合和目标的任务。

Cool Papers

点此查看论文截图

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Authors:Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Freeman, Fredo Durand, Eli Shechtman, Xun Huang

Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies. The generation of a single frame requires the model to process the entire sequence, including the future. We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly. To further reduce latency, we extend distribution matching distillation (DMD) to videos, distilling 50-step diffusion model into a 4-step generator. To enable stable and high-quality distillation, we introduce a student initialization scheme based on teacher’s ODE trajectories, as well as an asymmetric distillation strategy that supervises a causal student model with a bidirectional teacher. This approach effectively mitigates error accumulation in autoregressive generation, allowing long-duration video synthesis despite training on short clips. Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models. It enables fast streaming generation of high-quality videos at 9.4 FPS on a single GPU thanks to KV caching. Our approach also enables streaming video-to-video translation, image-to-video, and dynamic prompting in a zero-shot manner.

当前的视频扩散模型在生成质量方面表现出色，但由于双向注意力依赖性，在交互式应用中表现困难。生成单个帧需要模型处理整个序列，包括未来信息。我们通过将预训练的双向扩散变压器自适应为自回归变压器来解决这一限制，该变压器可以即时生成帧。为了进一步降低延迟，我们将分布匹配蒸馏（DMD）扩展到视频领域，将50步扩散模型精炼为4步生成器。为了实现稳定和高质量的蒸馏，我们引入了基于教师ODE轨迹的学生初始化方案，以及一种不对称的蒸馏策略，用双向教师监督因果学生模型。这种方法有效地减轻了自回归生成中的误差积累，即使在短片段训练的情况下也能实现长时长视频合成。我们的模型在VBench-Long基准测试上达到了84.27的总分，超越了所有先前的视频生成模型。得益于KV缓存，我们的模型能够在单个GPU上以9.4 FPS的速度快速生成高质量视频。我们的方法还支持流式视频到视频的翻译、图像到视频以及零样本方式的动态提示。

论文及项目相关链接

PDF CVPR 2025. Project Page: https://causvid.github.io/

Summary
针对当前视频扩散模型在交互式应用中遇到的双向注意力依赖问题，本文提出了一种自适应的预训练双向扩散变压器改进方案，将模型转变为具有即时生成帧能力的自回归变压器。为进一步提高效率，本文通过扩展分布匹配蒸馏（DMD）至视频领域，将50步扩散模型简化为4步生成器。同时引入基于教师常微分方程轨迹的学生初始化方案及不对称蒸馏策略，确保稳定且高质量地进行蒸馏，并有效减少自回归生成中的误差累积。该模型在VBench-Long基准测试中得分84.27，超越所有先前视频生成模型。其能在单个GPU上以9.4 FPS的速度快速生成高质量视频，并支持视频到视频的流式转换、图像到视频转换以及零样本动态提示生成。

Key Takeaways

当前视频扩散模型在交互式应用中面临双向注意力依赖问题。
提出一种改进的预训练双向扩散变压器，转变为自回归模型以即时生成帧。
通过扩展分布匹配蒸馏（DMD）至视频领域，提高生成效率，简化模型步骤。
引入基于教师常微分方程轨迹的学生初始化方案，确保稳定且高质量地进行蒸馏。
采用不对称蒸馏策略，减少自回归生成中的误差累积。
模型在VBench-Long基准测试中表现优异，得分84.27，超越先前所有视频生成模型。

Cool Papers

点此查看论文截图

Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts

Authors:Viet Nguyen, Anh Nguyen, Trung Dao, Khoi Nguyen, Cuong Pham, Toan Tran, Anh Tran

The escalating demand for real-time image synthesis has driven significant advancements in one-step diffusion models, which inherently offer expedited generation speeds compared to traditional multi-step methods. However, this enhanced efficiency is frequently accompanied by a compromise in the controllability of image attributes. While negative prompting, typically implemented via classifier-free guidance (CFG), has proven effective for fine-grained control in multi-step models, its application to one-step generators remains largely unaddressed. Due to the lack of iterative refinement, as in multi-step diffusion, directly applying CFG to one-step generation leads to blending artifacts and diminished output quality. To fill this gap, we introduce \textbf{N}egative-\textbf{A}way \textbf{S}teer \textbf{A}ttention (NASA), an efficient method that integrates negative prompts into one-step diffusion models. NASA operates within the intermediate representation space by leveraging cross-attention mechanisms to suppress undesired visual attributes. This strategy avoids the blending artifacts inherent in output-space guidance and achieves high efficiency, incurring only a minimal 1.89% increase in FLOPs compared to the computational doubling of CFG. Furthermore, NASA can be seamlessly integrated into existing timestep distillation frameworks, enhancing the student’s output quality. Experimental results demonstrate that NASA substantially improves controllability and output quality, achieving an HPSv2 score of \textbf{31.21}, setting a new state-of-the-art benchmark for one-step diffusion models.

随着对实时图像合成需求的不断增加，一步扩散模型取得了显著的进步，与传统的多步方法相比，其天生就提供了更快的生成速度。然而，这种提高效率往往伴随着图像属性可控性的妥协。负提示（通常通过无分类器引导（CFG）来实现）已在多步模型中实现了精细控制，但其在一步生成器中的应用仍然很少被研究。由于缺乏如多步扩散中的迭代优化，直接将CFG应用于一步生成会导致混合伪影和输出质量下降。为了填补这一空白，我们引入了负向控制注意力（NASA），这是一种将负提示有效地集成到一步扩散模型中的高效方法。NASA通过在中间表示空间中操作，利用交叉注意力机制来抑制不希望的视觉属性。这种策略避免了输出空间指导所固有的混合伪影，并实现了高效率，与CFG的计算加倍相比，仅增加了1.89%的FLOPs。此外，NASA可以无缝地集成到现有的时间步蒸馏框架中，增强了学生的输出质量。实验结果表明，NASA在可控性和输出质量方面都有显著提高，达到了HPSv2评分31.21分，为一步扩散模型树立了新的业界标杆。

论文及项目相关链接

PDF Accepted at ICCV 2025

摘要

实时图像合成需求的不断升级推动了一步扩散模型的显著发展，该模型相较于传统多步方法提供了更快的生成速度。然而，这种提升的效率通常伴随着图像属性控制性的妥协。虽然负提示通过无分类引导（CFG）在多部模型中已实现精细控制，但其在一步生成器中的应用仍然基本未被解决。由于缺乏如多步扩散中的迭代优化，直接在一步生成中应用CFG会导致混合伪影和输出质量下降。为了填补这一空白，我们引入了负向注意力（NASA），这是一种将负提示整合到一步扩散模型中的高效方法。NASA通过在中间表示空间内利用交叉注意力机制来抑制不想要的视觉属性来工作。这种策略避免了输出空间指导的固有混合伪影，并且效率高，相较于计算翻倍的CFG只增加了额外的1.89%浮点运算量。此外，NASA可以无缝集成到现有的时间步蒸馏框架中，提高了学生模型的输出质量。实验结果表明，NASA在控制性和输出质量上取得了显著的提升，实现了HPSv2评分31.21的新里程碑成绩，为一步扩散模型创造了新的先进基准。

关键见解

一、实时图像合成需求的增长推动了一步扩散模型的快速发展，其提供更快的生成速度。
二、虽然负提示在多步模型中已被证明对精细控制有效，但其在一步生成器中的应用仍存在挑战，可能导致混合伪影和输出质量下降。
三、NASA是一种高效的负提示集成方法，能在一步扩散模型中抑制不想要的视觉属性。
四、NASA在中间表示空间内操作，避免了输出空间指导的混合伪影问题。
五、相较于计算翻倍的CFG，NASA仅增加了额外的1.89%浮点运算量，具有高效率优势。
六、NASA可以与现有的时间步蒸馏框架无缝集成，进一步提高学生模型的输出质量。

Cool Papers

点此查看论文截图

Training-Free Layout-to-Image Generation with Marginal Attention Constraints

Authors:Huancheng Chen, Jingtao Li, Weiming Zhuang, Haris Vikalo, Lingjuan Lyu

Recently, many text-to-image diffusion models excel at generating high-resolution images from text but struggle with precise control over spatial composition and object counting. To address these challenges, prior works developed layout-to-image (L2I) approaches that incorporate layout instructions into text-to-image models. However, existing L2I methods typically require fine-tuning of pre-trained parameters or training additional control modules for the diffusion models. In this work, we propose a training-free L2I approach, MAC (Marginal Attention Constrained Generation), which eliminates the need for additional modules or fine-tuning. Specifically, we use text-visual cross-attention feature maps to quantify inconsistencies between the layout of the generated images and the provided instructions, and then compute loss functions to optimize latent features during the diffusion reverse process. To enhance spatial controllability and mitigate semantic failures in complex layout instructions, we leverage pixel-to-pixel correlations in the self-attention feature maps to align cross-attention maps and combine three loss functions constrained by boundary attention to update latent features. Comprehensive experimental results on both L2I and non-L2I pretrained diffusion models demonstrate that our method outperforms existing training-free L2I techniques both quantitatively and qualitatively in terms of image composition on the DrawBench and HRS benchmarks.

最近，许多文本到图像的扩散模型在根据文本生成高分辨率图像方面表现出色，但在空间构图和物体计数方面的精确控制上遇到了困难。为了解决这些挑战，早期的研究开发了布局到图像（L2I）的方法，将布局指令融入文本到图像模型中。然而，现有的L2I方法通常需要调整预训练参数或为扩散模型训练额外的控制模块。在这项工作中，我们提出了一种无需训练的L2I方法，名为MAC（边际注意力约束生成），它无需额外的模块或微调。具体来说，我们使用文本视觉交叉注意力特征图来量化生成图像布局与所提供指令之间的不一致性，然后计算损失函数，在扩散反转过程中优化潜在特征。为了提高空间可控性并缓解复杂布局指令中的语义错误，我们利用自注意力特征图中的像素到像素相关性来对齐交叉注意力图，并结合三种受边界注意力约束的损失函数来更新潜在特征。在L2I和非L2I预训练的扩散模型上的综合实验结果表明，我们的方法在DrawBench和HRS基准测试上，无论在数量上还是在质量上，都超越了现有的无需训练的L2I技术，特别是在图像构图方面。

论文及项目相关链接

PDF

Summary

本文提出了一种无需训练的文本布局转图像（L2I）方法，名为MAC（边际注意力约束生成）。该方法利用文本视觉交叉注意力特征图来量化生成图像布局与指令之间的一致性，通过计算损失函数在扩散反向过程中优化潜在特征，提高空间控制力并减少复杂布局指令中的语义错误。实验结果表明，该方法在L2I和非L2I预训练的扩散模型上均优于现有的无训练L2I技术，特别是在图像组合方面。

Key Takeaways

文本转图像扩散模型面临空间组合和对象计数控制挑战。
现有L2I方法通常需要微调预训练参数或训练额外的控制模块。
MAC方法是一种无需训练的L2I技术，利用文本视觉交叉注意力特征图量化布局不一致性。
MAC方法通过计算损失函数优化潜在特征在扩散反向过程中。
MAC方法提高了空间控制力并减少了复杂布局指令中的语义错误。
实验结果表明，MAC方法在L2I和非L2I预训练的扩散模型上表现优异。

Cool Papers

点此查看论文截图

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

Authors:Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang

As large Vision-Language Models (VLMs) gain prominence, ensuring their safe deployment has become critical. Recent studies have explored VLM robustness against jailbreak attacks-techniques that exploit model vulnerabilities to elicit harmful outputs. However, the limited availability of diverse multimodal data has constrained current approaches to rely heavily on adversarial or manually crafted images derived from harmful text datasets, which often lack effectiveness and diversity across different contexts. In this paper, we propose IDEATOR, a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR is grounded in the insight that VLMs themselves could serve as powerful red team models for generating multimodal jailbreak prompts. Specifically, IDEATOR leverages a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. Extensive experiments demonstrate IDEATOR’s high effectiveness and transferability, achieving a 94% attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high ASRs of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Chameleon, respectively. Building on IDEATOR’s strong transferability and automated process, we introduce the VLJailbreakBench, a safety benchmark comprising 3,654 multimodal jailbreak samples. Our benchmark results on 11 recently released VLMs reveal significant gaps in safety alignment. For instance, our challenge set achieves ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses. VLJailbreakBench is publicly available at https://roywang021.github.io/VLJailbreakBench.

随着大型视觉语言模型（VLMs）越来越突出，确保其安全部署已成为关键。近期的研究已经探索了VLM对抗越狱攻击（利用模型漏洞诱发有害输出的技术）的稳健性。然而，多样态数据的有限可用性限制了当前方法严重依赖于从有害文本数据集派生的对抗性或手工制作的图像，这在不同的上下文环境中往往缺乏有效性和多样性。在本文中，我们提出了IDEATOR，这是一种新型越狱方法，能够自主生成恶意图像文本对，用于黑箱越狱攻击。IDEATOR的见解在于，VLM本身可以作为强大的红队模型，用于生成多模式越狱提示。具体来说，IDEATOR利用VLM来创建有针对性的越狱文本，并将它们与由最先进的扩散模型生成的越狱图像配对。大量实验表明，IDEATOR具有很高的有效性和可迁移性，在越狱MiniGPT-4时，攻击成功率（ASR）达到94%，平均查询次数仅为5.34次；当迁移到LLaVA、InstructBLIP和Chameleon时，ASR分别达到了82%、88%和75%。建立在IDEATOR的强大可迁移性和自动化流程之上，我们引入了VLJailbreakBench，这是一个安全基准测试，包含3654个多模式越狱样本。我们对最近发布的11个VLM的基准测试结果揭示了安全对齐方面的重大差距。例如，我们的挑战集在GPT-4o上的ASR达到46.31%，在Claude-3.5-Sonnet上的ASR为19.65%，这突显了更强大防御的迫切需要。VLJailbreakBench可在公开访问https://roywang021.github.io/VLJailbreakBench上找到。

论文及项目相关链接

PDF

Summary

本文关注大型视觉语言模型（VLMs）的安全部署问题。针对VLMs的jailbreak攻击，提出了一种新的攻击方法IDEATOR，该方法能够自主生成恶意图像文本对，用于黑盒jailbreak攻击。IDEATOR利用VLM生成有针对性的jailbreak文本，并与由最先进的扩散模型生成的jailbreak图像配对。实验表明，IDEATOR具有高有效性和可迁移性，能够在MiniGPT-4上达到94%的攻击成功率（ASR），并且具有较高的可迁移性。此外，文章还介绍了基于IDEATOR的VLJailbreakBench安全基准，该基准包含3654个多模态jailbreak样本，揭示了VLMs在安全对齐方面的显著差距。

Key Takeaways

大型视觉语言模型（VLMs）的安全部署至关重要。
现有的VLMs容易受到jailbreak攻击，这利用模型漏洞产生有害输出。
IDEATOR是一种新型攻击方法，可自主生成恶意图像文本对进行黑盒jailbreak攻击。
IDEATOR利用VLM生成针对性的jailbreak文本，结合扩散模型生成图像。
IDEATOR具有高有效性和可迁移性，能在不同模型上实现较高的攻击成功率。
VLJailbreakBench安全基准包含大量多模态jailbreak样本，用于评估VLMs的安全性。

Cool Papers

点此查看论文截图

Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion

Authors:Yijun Liang, Shweta Bhardwaj, Tianyi Zhou

Low-quality or scarce data has posed significant challenges for training deep neural networks in practice. While classical data augmentation cannot contribute very different new data, diffusion models opens up a new door to build self-evolving AI by generating high-quality and diverse synthetic data through text-guided prompts. However, text-only guidance cannot control synthetic images’ proximity to the original images, resulting in out-of-distribution data detrimental to the model performance. To overcome the limitation, we study image guidance to achieve a spectrum of interpolations between synthetic and real images. With stronger image guidance, the generated images are similar to the training data but hard to learn. While with weaker image guidance, the synthetic images will be easier for model but contribute to a larger distribution gap with the original data. The generated full spectrum of data enables us to build a novel “Diffusion Curriculum (DisCL)”. DisCL adjusts the image guidance level of image synthesis for each training stage: It identifies and focuses on hard samples for the model and assesses the most effective guidance level of synthetic images to improve hard data learning. We apply DisCL to two challenging tasks: long-tail (LT) classification and learning from low-quality data. It focuses on lower-guidance images of high-quality to learn prototypical features as a warm-up of learning higher-guidance images that might be weak on diversity or quality. Extensive experiments showcase a gain of 2.7% and 2.1% in OOD and ID macro-accuracy when applying DisCL to iWildCam dataset. On ImageNet-LT, DisCL improves the base model’s tail-class accuracy from 4.4% to 23.64% and leads to a 4.02% improvement in all-class accuracy.

在实践中，低质量或稀缺数据为训练深度神经网络带来了重大挑战。虽然经典的数据增强无法提供非常不同的新数据，但扩散模型通过文本指导提示生成高质量和多样化的合成数据，从而打开了构建自我进化的AI的新大门。然而，仅文本的指导无法控制合成图像与原始图像的相似性，这会导致对模型性能有害的分布外数据。为了克服这一局限性，我们研究了图像指导来实现合成图像和真实图像之间的光谱插值。使用更强的图像指导，生成的图像与训练数据相似，但难以学习。而使用较弱的图像指导时，合成图像对模型而言更容易，但与原始数据的分布差距更大。生成的全数据光谱使我们能够构建一种新型的“扩散课程（DisCL）”。DisCL针对每个训练阶段调整图像合成的图像指导级别：它识别并关注模型的困难样本，并评估合成图像最有效的指导级别，以提高困难数据的学习效果。我们将DisCL应用于两个具有挑战性的任务：长尾（LT）分类和从低质量数据中学习。它专注于高质量但指导较少的图像作为学习高指导但可能在多样性或质量上较弱的图像的热身。在iWildCam数据集上应用DisCL的实验显示，在OOD和ID宏准确率上分别提高了2.7%和2.1%。在ImageNet-LT上，DisCL将基础模型的尾部类别准确率从4.4%提高到23.64%，并在所有类别准确率上实现了4.02%的改进。

论文及项目相关链接

PDF Accepted in ICCV2025. 22 pages, including references and appendix. Code is available at http://github.com/tianyi-lab/DisCL

Summary

扩散模型通过文本引导生成合成数据，为构建自我进化的AI打开了新的大门。然而，文本引导无法控制合成图像与原始图像的接近程度，导致模型性能受到损害。本研究通过图像引导来克服这一局限，实现了合成图像和真实图像之间的频谱插值。通过调整图像引导水平，生成的图像既与训练数据相似，又具有一定的学习难度。本研究构建了一种新型的“扩散课程（DisCL）”，该课程在每个训练阶段调整图像合成的图像引导水平，识别并专注于模型的困难样本，并评估合成图像的最有效引导水平，以提高困难数据的学习效果。在长尾分类和学习低质量数据等任务中，DisCL取得了显著的效果。

Key Takeaways

扩散模型通过文本生成高质量和多样化的合成数据，为构建自我进化的AI提供了新的途径。
文本引导在合成图像时存在局限性，无法控制合成图像与原始图像的接近程度。
图像引导克服了文本引导的局限，实现了合成图像和真实图像之间的频谱插值。
通过调整图像引导水平，可以在保证图像相似性的同时增加学习难度。
DisCL方法根据每个训练阶段调整图像合成的图像引导水平，以提高模型对困难数据的学习效果。
DisCL在长尾分类任务中表现出显著的效果，特别是在低质量数据的处理方面。
在iWildCam数据集和ImageNet-LT任务上，DisCL取得了显著的性能提升。

Cool Papers

点此查看论文截图

Sample what you cant compress

Authors:Vighnesh Birodkar, Gabriel Barcik, James Lyon, Sergey Ioffe, David Minnen, Joshua V. Dillon

For learned image representations, basic autoencoders often produce blurry results. Reconstruction quality can be improved by incorporating additional penalties such as adversarial (GAN) and perceptual losses. Arguably, these approaches lack a principled interpretation. Concurrently, in generative settings diffusion has demonstrated a remarkable ability to create crisp, high quality results and has solid theoretical underpinnings (from variational inference to direct study as the Fisher Divergence). Our work combines autoencoder representation learning with diffusion and is, to our knowledge, the first to demonstrate jointly learning a continuous encoder and decoder under a diffusion-based loss and showing that it can lead to higher compression and better generation. We demonstrate that this approach yields better reconstruction quality as compared to GAN-based autoencoders while being easier to tune. We also show that the resulting representation is easier to model with a latent diffusion model as compared to the representation obtained from a state-of-the-art GAN-based loss. Since our decoder is stochastic, it can generate details not encoded in the otherwise deterministic latent representation; we therefore name our approach “Sample what you can’t compress”, or SWYCC for short.

对于学习到的图像表示，基础自编码器通常会产生模糊的结果。通过引入对抗性（GAN）和感知损失等额外的惩罚，可以改善重建质量。然而，这些方法的解释原则性较差。同时，在生成环境中，扩散现象表现出了产生清晰、高质量结果的显著能力，并具有坚实的理论基础（从变分推理到直接研究费舍尔散度）。我们的工作将自编码器表示学习与扩散相结合，据我们所知，首次展示了在扩散损失下联合学习连续编码器和解码器，并表明这可以提高压缩率和生成质量。我们证明，这种方法与基于GAN的自编码器相比，具有更好的重建质量，而且更容易调整。我们还表明，与从基于GAN的损失中获得的状态最佳表示相比，使用潜在扩散模型对由此产生的表示进行建模更为容易。由于我们的解码器是随机的，它可以生成未编码在确定性潜在表示中的细节；因此，我们将我们的方法命名为“压缩所不能采样的东西”，简称SWYCC。

论文及项目相关链接

PDF

Summary

基于学习到的图像表示，基本自编码器通常会产生模糊的结果。为提高重建质量，可以引入对抗性（GAN）和感知损失等额外惩罚。然而，这些方法缺乏原则性的解释。同时，扩散在生成环境中表现出了惊人的创建清晰、高质量结果的能力，并有坚实的理论基础（从变分推断到直接研究费舍尔散度）。我们的工作结合了自编码器表示学习与扩散，据我们所知，是首次展示在扩散损失下联合学习连续编码器和解码器，并表明它可以实现更高的压缩率和更好的生成效果。我们证明，与其他基于GAN的自编码器相比，我们的方法具有更好的重建质量，且更易于调整。我们还表明，与来自最先进的基于GAN的损失所获得的表示相比，使用我们方法得到的表示更容易用潜在扩散模型建模。由于我们的解码器是随机的，它可以生成未编码在确定性潜在表示中的细节；因此，我们将我们的方法命名为“压缩所不能采样的”，简称SWYCC。

Key Takeaways

基本自编码器会产生模糊结果，为提高重建质量需引入额外惩罚。
扩散在生成环境中能创建清晰、高质量结果，具备坚实的理论基础。
首次展示结合自编码器表示学习与扩散，联合学习连续编码器和解码器。
提出的方法可以实现更高的压缩率和更好的生成效果。
与基于GAN的自编码器相比，此方法具有更好的重建质量且更易于调整。
使用此方法得到的表示更容易用潜在扩散模型建模。

Cool Papers

点此查看论文截图

Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints

Authors:Chuan Fang, Yuan Dong, Kunming Luo, Xiaotao Hu, Rakesh Shrestha, Ping Tan

Text-driven 3D indoor scene generation is useful for gaming, the film industry, and AR/VR applications. However, existing methods cannot faithfully capture the room layout, nor do they allow flexible editing of individual objects in the room. To address these problems, we present Ctrl-Room, which can generate convincing 3D rooms with designer-style layouts and high-fidelity textures from just a text prompt. Moreover, Ctrl-Room enables versatile interactive editing operations such as resizing or moving individual furniture items. Our key insight is to separate the modeling of layouts and appearance. Our proposed method consists of two stages: a Layout Generation Stage and an Appearance Generation Stage. The Layout Generation Stage trains a text-conditional diffusion model to learn the layout distribution with our holistic scene code parameterization. Next, the Appearance Generation Stage employs a fine-tuned ControlNet to produce a vivid panoramic image of the room guided by the 3D scene layout and text prompt. We thus achieve a high-quality 3D room generation with convincing layouts and lively textures. Benefiting from the scene code parameterization, we can easily edit the generated room model through our mask-guided editing module, without expensive edit-specific training. Extensive experiments on the Structured3D dataset demonstrate that our method outperforms existing methods in producing more reasonable, view-consistent, and editable 3D rooms from natural language prompts.

文本驱动的3D室内场景生成在游戏、电影产业和AR/VR应用中非常有用。然而，现有方法无法真实地捕捉房间布局，也不允许对房间中的单个物体进行灵活的编辑。为了解决这些问题，我们提出了Ctrl-Room，它仅通过文本提示就能生成具有设计师风格的布局和高保真纹理的令人信服的3D房间。此外，Ctrl-Room还支持多样化的交互编辑操作，如调整大小或移动单个家具项目。我们的关键见解是将布局和外观的建模分开。我们提出的方法分为两个阶段：布局生成阶段和外观生成阶段。布局生成阶段训练一个文本条件扩散模型，以通过我们的整体场景代码参数化来学习布局分布。接下来，外观生成阶段采用经过微调的控制网络（ControlNet），根据3D场景布局和文本提示生成生动的全景房间图像。因此，我们实现了具有令人信服的布局和生动纹理的高质量3D房间生成。得益于场景代码参数化，我们可以轻松通过我们的遮罩引导编辑模块编辑生成的房间模型，而无需进行昂贵的特定编辑训练。在Structured3D数据集上的大量实验表明，我们的方法在根据自然语言提示生成更合理、视角一致和可编辑的3D房间方面优于现有方法。

论文及项目相关链接

PDF

Summary

本文介绍了一种名为Ctrl-Room的文本驱动3D室内场景生成方法。该方法能够仅通过文本提示生成具有设计师风格的布局和高保真纹理的逼真3D房间，并实现了灵活的交互式编辑操作，如调整家具大小或移动家具位置。其关键在于将布局和外观建模分离，采用两个阶段的方法：布局生成阶段和外观生成阶段。

Key Takeaways

Ctrl-Room是一种文本驱动的3D室内场景生成方法，适用于游戏、电影、AR/VR等领域。
现有方法无法真实地捕捉房间布局，Ctrl-Room通过文本提示生成具有设计师风格的布局。
Ctrl-Room实现了灵活的交互式编辑操作，如移动和调整家具大小。
该方法将布局和外观建模分离，分为两个阶段：布局生成阶段和外观生成阶段。
采用全景图像生成技术，结合3D场景布局和文本提示，产生生动的全景图像。
通过场景代码参数化，实现了轻松编辑生成的房间模型。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-28/Diffusion%20Models/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Diffusion Models

牙齿修复

牙齿修复方向最新论文已更新，请持续关注 Update in 2025-09-28 An Interpretable Single-Index Mixed-Effects Model for Non-Gaussian National Survey Data

2025-09-28 牙齿修复

牙齿修复

NeRF

NeRF 方向最新论文已更新，请持续关注 Update in 2025-09-28 Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering

2025-09-28 NeRF

NeRF

Diffusion Models

2025-09-28 更新

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

T2I-Diff: fMRI Signal Generation via Time-Frequency Image Transform and Classifier-Free Denoising Diffusion Models

CusEnhancer: A Zero-Shot Scene and Controllability Enhancement Method for Photo Customization via ResInversion

PhysCtrl: Generative Physics for Controllable and Physics-Grounded Video Generation

4D Driving Scene Generation With Stereo Forcing

Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing

Learnable Sampler Distillation for Discrete Diffusion Models

Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation

Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation

Efficient Rectified Flow for Image Fusion

LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence

CNS-Bench: Benchmarking Image Classifier Robustness Under Continuous Nuisance Shifts

EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation

From Slow Bidirectional to Fast Autoregressive Video Diffusion Models

Supercharged One-step Text-to-Image Diffusion Models with Negative Prompts

Training-Free Layout-to-Image Generation with Marginal Attention Constraints

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion

Sample what you cant compress

Ctrl-Room: Controllable Text-to-3D Room Meshes Generation with Layout Constraints