⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-11 更新
VideoCanvas: Unified Video Completion from Arbitrary Spatiotemporal Patches via In-Context Conditioning
Authors:Minghong Cai, Qiulin Wang, Zongli Ye, Wenze Liu, Quande Liu, Weicai Ye, Xintao Wang, Pengfei Wan, Kun Gai, Xiangyu Yue
We introduce the task of arbitrary spatio-temporal video completion, where a video is generated from arbitrary, user-specified patches placed at any spatial location and timestamp, akin to painting on a video canvas. This flexible formulation naturally unifies many existing controllable video generation tasks–including first-frame image-to-video, inpainting, extension, and interpolation–under a single, cohesive paradigm. Realizing this vision, however, faces a fundamental obstacle in modern latent video diffusion models: the temporal ambiguity introduced by causal VAEs, where multiple pixel frames are compressed into a single latent representation, making precise frame-level conditioning structurally difficult. We address this challenge with VideoCanvas, a novel framework that adapts the In-Context Conditioning (ICC) paradigm to this fine-grained control task with zero new parameters. We propose a hybrid conditioning strategy that decouples spatial and temporal control: spatial placement is handled via zero-padding, while temporal alignment is achieved through Temporal RoPE Interpolation, which assigns each condition a continuous fractional position within the latent sequence. This resolves the VAE’s temporal ambiguity and enables pixel-frame-aware control on a frozen backbone. To evaluate this new capability, we develop VideoCanvasBench, the first benchmark for arbitrary spatio-temporal video completion, covering both intra-scene fidelity and inter-scene creativity. Experiments demonstrate that VideoCanvas significantly outperforms existing conditioning paradigms, establishing a new state of the art in flexible and unified video generation.
我们介绍了任意时空视频补全的的任务,即生成一个视频,该视频由用户指定的任意补丁在任意空间位置和时间戳放置,就像在视频画布上绘画一样。这种灵活的公式自然地统一了许多现有的可控视频生成任务,包括从第一帧图像到视频的生成、图像修复、扩展和插值,在一个连贯的模式下。然而,实现这一愿景面临着现代潜在视频扩散模型中的根本障碍:因果自编码器引入的时间模糊,多个像素帧被压缩成一个单一的潜在表示,使得精确帧级的条件结构化变得困难。我们通过VideoCanvas来解决这一挑战,这是一个新型框架,它将上下文条件(ICC)范式适应于这项精细的控制任务,无需任何新参数。我们提出了一种混合条件策略,将空间和时间控制解耦:空间位置通过零填充处理,而时间对齐则通过Temporal RoPE插值实现,为每个条件在潜在序列内分配一个连续的分数位置。这解决了自编码器的时序模糊问题,并在冻结的骨干网上实现了像素帧感知控制。为了评估这种新能力,我们开发了VideoCanvasBench,这是任意时空视频补全的首个基准测试,涵盖了场景内的保真度和场景间的创造力。实验表明,VideoCanvas显著优于现有的条件范式,在灵活统一的视频生成方面建立了新的技术领先地位。
论文及项目相关链接
PDF Project page: https://onevfall.github.io/project_page/videocanvas
Summary
本文介绍了基于任意时空视频补全的生成任务,这是一种在任意空间位置和特定时间戳生成视频片段的新技术。该研究提出了一种名为VideoCanvas的新框架,利用零参数上下文条件法解决了现有潜在视频扩散模型中的时间模糊问题。通过空间零填充和时间RoPE插值技术,实现了像素级精细控制。实验证明,VideoCanvas在任意时空视频补全任务上显著优于现有条件模型,建立了新的技术标杆。
Key Takeaways
- 任意时空视频补全任务介绍:该技术允许在视频的任意位置和时间生成新的片段。
- VideoCanvas框架解决了现有潜在视频扩散模型中的时间模糊问题。
- VideoCanvas利用零参数上下文条件法实现精细控制。
- 空间控制通过零填充实现,时间控制则通过Temporal RoPE插值实现。
- VideoCanvas显著提高了任意时空视频补全任务的性能。
- VideoCanvasBench作为首个针对任意时空视频补全任务的基准测试,用于评估视频生成的质量和创新性。
点此查看论文截图



X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering
Authors:Zhitong Huang, Mohan Zhang, Renhan Wang, Rui Tang, Hao Zhu, Jing Liao
We present X2Video, the first diffusion model for rendering photorealistic videos guided by intrinsic channels including albedo, normal, roughness, metallicity, and irradiance, while supporting intuitive multi-modal controls with reference images and text prompts for both global and local regions. The intrinsic guidance allows accurate manipulation of color, material, geometry, and lighting, while reference images and text prompts provide intuitive adjustments in the absence of intrinsic information. To enable these functionalities, we extend the intrinsic-guided image generation model XRGB to video generation by employing a novel and efficient Hybrid Self-Attention, which ensures temporal consistency across video frames and also enhances fidelity to reference images. We further develop a Masked Cross-Attention to disentangle global and local text prompts, applying them effectively onto respective local and global regions. For generating long videos, our novel Recursive Sampling method incorporates progressive frame sampling, combining keyframe prediction and frame interpolation to maintain long-range temporal consistency while preventing error accumulation. To support the training of X2Video, we assembled a video dataset named InteriorVideo, featuring 1,154 rooms from 295 interior scenes, complete with reliable ground-truth intrinsic channel sequences and smooth camera trajectories. Both qualitative and quantitative evaluations demonstrate that X2Video can produce long, temporally consistent, and photorealistic videos guided by intrinsic conditions. Additionally, X2Video effectively accommodates multi-modal controls with reference images, global and local text prompts, and simultaneously supports editing on color, material, geometry, and lighting through parametric tuning. Project page: https://luckyhzt.github.io/x2video
我们推出了X2Video,这是一款首个采用内在通道引导(包括亮度、法线、粗糙度、金属性和辐照度)的光照写实视频扩散模型。它支持使用参考图像和文本提示对全局和局部区域进行直观的多模式控制。内在引导允许准确操作颜色、材质、几何和光照,而参考图像和文本提示则在缺乏内在信息的情况下提供直观的调整。为了实现这些功能,我们通过采用新颖且高效混合自注意力机制,将内在引导图像生成模型XRGB扩展到视频生成。这确保了视频帧之间的时间连贯性,并提高了对参考图像的保真度。我们进一步开发了遮罩交叉注意力,以区分全局和局部文本提示,并将其有效应用于各自的局部和全局区域。为了生成长视频,我们的新颖递归采样方法结合了渐进帧采样,结合关键帧预测和帧插值,以保持长时程的时间连贯性,同时防止误差累积。为了支持X2Video的训练,我们组装了一个名为InteriorVideo的视频数据集,其中包含来自295个室内场景的1154个房间,具备可靠的地面真实内在通道序列和平滑的相机轨迹。定性和定量评估均表明,X2Video可以生成由内在条件引导的长、时间连贯且光照写实的视频。此外,X2Video有效地适应了多模式控制,使用参考图像、全局和局部文本提示,同时支持通过参数调整进行颜色、材质、几何和光照编辑。项目页面:https://luckyhzt.github.io/x2video
论文及项目相关链接
PDF Code, model, and dataset will be released at project page soon: https://luckyhzt.github.io/x2video
摘要
本文介绍了X2Video,这是首个通过内在通道(包括材质、光照等)引导的光写实视频扩散模型。该模型支持直观的多模式控制,可以通过参考图像和文本提示对全局和局部区域进行操控。内在引导可实现颜色、材质、几何和光照的精确操控。为支持这些功能,本文扩展了内在引导图像生成模型XRGB,将其应用于视频生成。采用高效混合自注意力机制,确保视频帧间的时序一致性并提高参考图像的保真度。进一步发展了掩膜交叉注意力机制,将全局和局部文本提示有效地应用于相应区域。对于长视频的生成,本文提出递归采样方法,结合关键帧预测和帧插值,维持长时序一致性并防止误差累积。为支持X2Video的训练,本文创建了一个名为InteriorVideo的视频数据集,包含来自295个内部场景的1,154个房间的可靠地面真实内在通道序列和平滑相机轨迹。评估结果表明,X2Video可生成长时间一致且光写实性的视频,并有效实现多模式控制及全局和局部文本提示的编辑。
关键见解
- X2Video是首个结合内在通道(包括材质、光照等)的扩散模型,用于生成光写实视频。
- 模型支持通过参考图像和文本提示对全局和局部区域进行直观调整。
- 采用高效混合自注意力机制确保视频帧间的时序一致性。
- 掩膜交叉注意力机制使全局和局部文本提示能更精准作用于相应区域。
- 提出递归采样方法以生成长视频,维持长时间的一致性和防止误差累积。
- 创建了InteriorVideo数据集,提供可靠地面真实内在通道序列和平滑相机轨迹。
- X2Video在生成长、时间上一致且光写实视频方面表现出优越性能,并支持多模式控制和编辑。
点此查看论文截图




InstructX: Towards Unified Visual Editing with MLLM Guidance
Authors:Chong Mou, Qichao Sun, Yanze Wu, Pengze Zhang, Xinghui Li, Fulong Ye, Songtao Zhao, Qian He
With recent advances in Multimodal Large Language Models (MLLMs) showing strong visual understanding and reasoning, interest is growing in using them to improve the editing performance of diffusion models. Despite rapid progress, most studies lack an in-depth analysis of MLLM design choices. Moreover, the integration of MLLMs and diffusion models remains an open challenge in some difficult tasks, such as video editing. In this paper, we present InstructX, a unified framework for image and video editing. Specifically, we conduct a comprehensive study on integrating MLLMs and diffusion models for instruction-driven editing across diverse tasks. Building on this study, we analyze the cooperation and distinction between images and videos in unified modeling. (1) We show that training on image data can lead to emergent video editing capabilities without explicit supervision, thereby alleviating the constraints imposed by scarce video training data. (2) By incorporating modality-specific MLLM features, our approach effectively unifies image and video editing tasks within a single model. Extensive experiments demonstrate that our method can handle a broad range of image and video editing tasks and achieves state-of-the-art performance.
随着多模态大型语言模型(MLLMs)在视觉理解和推理方面的强大能力日益显现,人们对其改进扩散模型的编辑性能的兴趣也在增长。尽管进展迅速,但大多数研究缺乏对MLLM设计选择的深入分析。此外,将MLLMs和扩散模型集成在一起在某些困难的任务(如视频编辑)中仍然是一个开放性的挑战。在本文中,我们提出了InstructX,这是一个用于图像和视频编辑的统一框架。具体来说,我们对集成MLLMs和扩散模型以进行指令驱动的跨多种任务的编辑进行了深入研究。基于这项研究,我们分析了图像和视频在统一建模中的合作和区别。(1)我们表明,在图像数据上进行训练可以在没有显式监督的情况下产生视频编辑能力,从而减轻了因视频训练数据稀缺而受到的约束。(2)通过融入模态特定的MLLM特性,我们的方法有效地将图像和视频编辑任务统一到一个单一模型中。大量实验表明,我们的方法可以处理广泛的图像和视频编辑任务,并达到了最先进的性能。
论文及项目相关链接
Summary
随着多模态大型语言模型(MLLMs)在视觉理解和推理方面的进展,利用其提升扩散模型的编辑性能成为了研究热点。本文提出InstructX框架,用于图像和视频编辑的指令驱动编辑。通过深入研究MLLMs和扩散模型的整合,该框架能够在不同任务中进行指令驱动编辑。训练图像数据可以提升视频编辑能力,无需显式监督;结合模态特定的MLLM特性,实现了图像和视频编辑任务的统一建模。
Key Takeaways
- 多模态大型语言模型(MLLMs)在视觉理解和推理方面取得进展,有助于提升扩散模型的编辑性能。
- InstructX框架用于图像和视频编辑的指令驱动编辑,具有广泛的应用潜力。
- 整合MLLMs和扩散模型的研究对于解决视频编辑等困难任务具有重要意义。
- 训练图像数据可以提升视频编辑能力,无需显式监督。
- InstructX框架通过结合模态特定的MLLM特性,实现了图像和视频编辑任务的统一建模。
- 该方法能够处理广泛的图像和视频编辑任务,并达到最新技术水平。
点此查看论文截图








Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Authors:Kaiwen Zheng, Yuji Wang, Qianli Ma, Huayu Chen, Jintao Zhang, Yogesh Balaji, Jianfei Chen, Ming-Yu Liu, Jun Zhu, Qinsheng Zhang
This work represents the first effort to scale up continuous-time consistency distillation to general application-level image and video diffusion models. Although continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of standard evaluation benchmarks. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the “mode-covering” nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the “mode-seeking” reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM matches or surpasses the state-of-the-art distillation method DMD2 on quality metrics while offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation.
本文代表了将连续时间一致性蒸馏技术扩展到通用应用级图像和视频扩散模型的首次尝试。尽管连续时间一致性模型(sCM)在理论上具有原则性,并且在加速学术规模的扩散方面具有实证效力,但由于雅可比向量积(JVP)计算的基础设施挑战以及标准评估基准的限制,其在大规模文本到图像和视频任务上的适用性仍不明确。我们首先开发了一个与并行性兼容的FlashAttention-2 JVP内核,能够在具有超过10亿参数和高维视频任务的模型上进行sCM训练。我们的研究发现sCM在细节生成方面存在根本性的质量限制,我们将其归因于误差累积和其前向发散目标的“模式覆盖”性质。为了解决这个问题,我们提出了得分正则化的连续时间一致性模型(rCM),它将得分蒸馏作为长跳正则化器。这种集成与sCM的“模式寻求”反向发散相结合,在保持高生成多样性的同时,有效地提高了视觉质量。经过验证,在大规模模型(Cosmos-Predict2、Wan2.1)上,rCM的参数高达14B,视频时长为5秒,它在质量指标上与最新的蒸馏方法DMD2相匹配或超越了它,同时在多样性方面具有显著优势,所有这些成果都是在没有使用GAN微调或广泛超参数搜索的情况下取得的。蒸馏后的模型在仅1至4步内生成高保真样本,将扩散采样的速度提高了15至50倍。这些结果将rCM定位为推进大规模扩散蒸馏的实际且理论扎实的框架。
论文及项目相关链接
摘要
本研究首次实现了连续时间一致性蒸馏在大规模文本到图像和视频任务中的应用。通过开发并行兼容的FlashAttention-2 JVP内核,实现了在超过10亿参数和高维视频任务上的sCM训练。研究发现sCM在细节生成上存在质量局限,提出通过引入评分正则化的连续时间一致性模型(rCM)来解决这一问题。rCM通过结合sCM的前向发散目标和评分蒸馏的长期跳跃正则化,在保持高生成多样性的同时,提高了视觉质量。在大型模型(Cosmos-Predict2、Wan2.1)上的实验表明,rCM在质量指标上达到或超越了最新的蒸馏方法DMD2,且在多样性上具有显著优势,无需GAN调整和广泛的超参数搜索。蒸馏后的模型在14步内生成高保真样本,将扩散采样的速度提高了1550倍。
关键见解
- 本研究首次扩展了连续时间一致性蒸馏在通用应用级别的图像和视频扩散模型中的应用。
- 开发了FlashAttention-2 JVP内核,支持在大型模型和高维视频任务上进行sCM训练。
- 揭示了sCM在细节生成方面的质量局限,归因于误差累积和它的前向发散目标的“模式覆盖”特性。
- 提出了评分正则化的连续时间一致性模型(rCM),通过结合sCM和评分蒸馏,提高了视觉质量并保持高生成多样性。
- rCM在大型模型上的实验表现优于现有的蒸馏方法DMD2,特别是在质量指标和多样性上。
- rCM不需要复杂的GAN调整和超参数搜索,简化了模型训练过程。
- 蒸馏后的模型能够在较少的步骤内生成高保真样本,显著加速了扩散采样过程。
点此查看论文截图




Hyperspectral data augmentation with transformer-based diffusion models
Authors:Mattia Ferrari, Lorenzo Bruzzone
The introduction of new generation hyperspectral satellite sensors, combined with advancements in deep learning methodologies, has significantly enhanced the ability to discriminate detailed land-cover classes at medium-large scales. However, a significant challenge in deep learning methods is the risk of overfitting when training networks with small labeled datasets. In this work, we propose a data augmentation technique that leverages a guided diffusion model. To effectively train the model with a limited number of labeled samples and to capture complex patterns in the data, we implement a lightweight transformer network. Additionally, we introduce a modified weighted loss function and an optimized cosine variance scheduler, which facilitate fast and effective training on small datasets. We evaluate the effectiveness of the proposed method on a forest classification task with 10 different forest types using hyperspectral images acquired by the PRISMA satellite. The results demonstrate that the proposed method outperforms other data augmentation techniques in both average and weighted average accuracy. The effectiveness of the method is further highlighted by the stable training behavior of the model, which addresses a common limitation in the practical application of deep generative models for data augmentation.
新一代高光谱卫星传感器的引入,结合深度学习方法的进步,显著提高了在中等至较大规模上识别详细土地覆盖类别的能力。然而,深度学习方法的一个重大挑战是,在使用小型标记数据集训练网络时存在过度拟合的风险。在这项工作中,我们提出了一种利用引导扩散模型的数据增强技术。为了有效地在有限数量的标记样本上训练模型,并捕获数据中的复杂模式,我们实现了一个轻量级的变压器网络。此外,我们引入了一种改进的加权损失函数和优化后的余弦方差调度器,有助于在小型数据集上进行快速有效的训练。我们以森林分类任务为例,利用PRISMA卫星获取的高光谱图像对10种不同类型的森林进行了评估。结果表明,该方法在平均准确率和加权平均准确率上均优于其他数据增强技术。该方法的有效性还体现在模型的稳定训练行为上,解决了深度生成模型在数据增强实际应用中的常见局限性。
论文及项目相关链接
PDF 10 pages, 2 figures, accepted at SPIE REMOTE SENSING conference 16-20 September 2024 Edinburgh, United Kingdom
Summary
新一代高光谱卫星传感器与深度学习方法的结合,提高了对中等至大规模土地覆盖类别的鉴别能力。针对小标签数据集训练网络时易出现的过拟合问题,本研究提出了一种基于引导扩散模型的数据增强技术,并实现了轻量级变压器网络以捕获数据中的复杂模式。通过改进加权损失函数和优化余弦方差调度器,该方法能在有限标签样本上快速有效地训练模型。在利用PRISMA卫星获取的高光谱图像进行的森林分类任务中,该方法在平均和加权平均精度上均优于其他数据增强技术,并展示了稳定的训练行为,解决了深度生成模型在实际应用中面临的常见局限性。
Key Takeaways
- 新一代高光谱卫星传感器与深度学习结合,提高了对土地覆盖类别的鉴别能力。
- 针对小标签数据集训练网络易过拟合的问题,提出了基于引导扩散模型的数据增强技术。
- 实现了轻量级变压器网络以更有效地捕获数据中的复杂模式。
- 改进了加权损失函数和优化了余弦方差调度器,促进模型的快速有效训练。
- 在森林分类任务中,该方法在平均和加权平均精度上表现出优越性能。
- 该方法展示的稳定训练行为解决了深度生成模型在实际应用中的常见局限性。
点此查看论文截图


One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting
Authors:Haipeng Liu, Yang Wang, Meng Wang
Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process. In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed \textbf{NTN-Diff}, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models. Our code can be accessed from https://github.com/htyjers/NTN-Diff.
文本引导的图像修复旨在根据文本提示重建遮挡区域,长期以来的挑战在于保留未遮挡区域,同时实现未遮挡区域和修复遮挡区域之间的语义一致性。以前的方法未能同时解决这两个问题,通常只能解决其中之一。我们观察到,这种情况源于混合(例如中低频)频带之间的纠缠,这些频带编码了不同的图像属性,在降噪过程中对不同文本提示表现出不同的稳健性。在本文中,我们提出了一种用于文本引导的图像修复的零文本零频率感知扩散模型,称为NTN-Diff。它通过分解遮挡和未遮挡区域之间的语义一致性,针对每个频带进行一致性处理,同时保留未遮挡区域,从而连续解决两个挑战。基于扩散过程,我们将降噪过程进一步分为早期(高级噪声)和晚期(低级噪声)阶段,其中中低频带在降噪过程中被解开。观察到稳定的中频带在文本引导的降噪过程中逐渐去噪以实现语义对齐,同时作为零文本降噪过程的指导来去除遮挡区域的低频带噪声,随后在后期进行后续的文本引导降噪过程,以实现遮挡和未遮挡区域之间的中低频带的语义一致性,同时保留未遮挡区域。大量实验验证了NTN-Diff在文本引导扩散模型方面优于最先进的扩散模型。我们的代码可从https://github.com/htyjers/NTN-Diff访问。
论文及项目相关链接
PDF 25 pages, 11 figures, to appear NeurIPS 2025
摘要
文本引导的图像修复是根据文本提示重构被遮蔽区域的过程,其长期存在的挑战在于如何保持未遮蔽区域的完整性,同时在未遮蔽和修复的被遮蔽区域之间实现语义一致性。先前的方法无法解决这两个问题,常常只能解决其中之一。这种情况源于混合频率波段(如中低频)的纠缠,这些频率波段编码了不同的图像属性,在降噪过程中对文本提示的响应不同。在本文中,我们提出了一种用于文本引导的图像修复的零文本零频率感知扩散模型(NTN-Diff),通过分解被遮蔽和未遮蔽区域之间的语义一致性到每个频率波段,同时保持未遮蔽区域的完整性,以此连续解决两个挑战。基于扩散过程,我们将降噪过程分为早期(高级噪声)和晚期(低级噪声)阶段,其中中低频波段在降噪过程中被分离。观察到稳定的中频波段在文本引导的降噪过程中逐步去噪以实现语义对齐,同时作为零文本降噪过程的指导来修复被遮蔽区域的低频波段,随后在晚期阶段进行后续的文本引导降噪过程,以实现被遮蔽和未遮蔽区域之间的中低频波段的语义一致性,同时保持未遮蔽区域的完整性。大量实验验证了NTN-Diff相较于先进的扩散模型在文本引导扩散模型中的优越性。我们的代码可从https://github.com/htyjers/NTN-Diff访问。
关键见解
- 文本引导的图像修复旨在根据文本提示重建被遮蔽区域,同时保持未遮蔽区域的完整性。
- 此前的方法在中低频波段的纠缠问题上存在挑战,导致无法同时解决语义一致性和未遮蔽区域保持的问题。
- NTN-Diff模型通过分解语义一致性到每个频率波段,解决了上述问题。
- NTN-Diff模型将降噪过程分为早期和晚期阶段,并发现稳定的中频波段在文本引导的降噪过程中起着关键作用。
- 零文本降噪过程用于修复被遮蔽区域的低频波段,随后进行文本引导的第二阶段降噪。
- 通过这种方式,NTN-Diff实现了被遮蔽和未遮蔽区域之间的中低频波段的语义一致性。
点此查看论文截图





SViM3D: Stable Video Material Diffusion for Single Image 3D Generation
Authors:Andreas Engelhardt, Mark Boss, Vikram Voletti, Chun-Han Yao, Hendrik P. A. Lensch, Varun Jampani
We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.
我们提出了稳定视频材料3D(SViM3D)框架,该框架可从单张图像预测多视角一致性的基于物理的渲染(PBR)材料。最近,视频扩散模型已成功用于从单张图像高效重建3D对象。然而,反射仍由简单的材料模型表示,或者需要进行额外的步骤来估计,以实现重新照明和控制外观编辑。我们扩展了潜在视频扩散模型,以输出空间变化的PBR参数和表面法线,以及与每个生成视图基于显式相机控制的联合输出。这种独特的设置允许使用我们的模型作为神经先验来进行重新照明和生成3D资产。我们为这个管道引入了各种机制,在这个不适定的环境中提高了质量。我们在多个以对象为中心的数据集上展示了最先进的重新照明和新视角合成性能。我们的方法适用于各种输入,能够生成可用于AR/VR、电影、游戏和其他视觉媒体的重新照明的3D资产。
论文及项目相关链接
PDF Accepted by International Conference on Computer Vision (ICCV 2025). Project page: http://svim3d.aengelhardt.com
Summary
本文提出了一个名为Stable Video Materials 3D(SViM3D)的框架,该框架能够从单张图像预测多视角一致性的物理基础渲染(PBR)材料。通过扩展潜在视频扩散模型,该框架能够联合生成每个视角的空间变化PBR参数和表面法线,并基于明确的相机控制进行输出。这一独特设置允许使用我们的模型作为神经先验来进行重新照明和生成3D资产。本文介绍了多种机制,可以在这个不适定的环境中提高质量。在多个以物体为中心的数据集上,我们的方法显示了最先进的重新照明和新视角合成性能。我们的方法能够推广到各种输入,生成的重新照明的3D资产可用于AR/VR、电影、游戏和其他视觉媒体。
Key Takeaways
- SViM3D框架能从单张图像预测多视角一致性的PBR材料。
- 框架扩展了潜在视频扩散模型以联合生成空间变化的PBR参数和表面法线。
- 通过明确的相机控制进行输出,允许重新照明和生成3D资产。
- 引入多种机制以提高在不适定环境中的质量。
- 在多个物体为中心的数据集上显示了最先进的重新照明和新视角合成性能。
- 方法能够推广到多种输入,具有广泛的应用性。
点此查看论文截图




UniMMVSR: A Unified Multi-Modal Framework for Cascaded Video Super-Resolution
Authors:Shian Du, Menghan Xia, Chang Liu, Quande Liu, Xintao Wang, Pengfei Wan, Xiangyang Ji
Cascaded video super-resolution has emerged as a promising technique for decoupling the computational burden associated with generating high-resolution videos using large foundation models. Existing studies, however, are largely confined to text-to-video tasks and fail to leverage additional generative conditions beyond text, which are crucial for ensuring fidelity in multi-modal video generation. We address this limitation by presenting UniMMVSR, the first unified generative video super-resolution framework to incorporate hybrid-modal conditions, including text, images, and videos. We conduct a comprehensive exploration of condition injection strategies, training schemes, and data mixture techniques within a latent video diffusion model. A key challenge was designing distinct data construction and condition utilization methods to enable the model to precisely utilize all condition types, given their varied correlations with the target video. Our experiments demonstrate that UniMMVSR significantly outperforms existing methods, producing videos with superior detail and a higher degree of conformity to multi-modal conditions. We also validate the feasibility of combining UniMMVSR with a base model to achieve multi-modal guided generation of 4K video, a feat previously unattainable with existing techniques.
级联视频超分辨率技术已成为一种有前景的技术,用于减轻在使用大型基础模型生成高分辨率视频时所带来的计算负担。然而,现有研究大多局限于文本到视频的任务,未能利用文本以外的其他生成条件,这对于确保多模式视频生成的保真度至关重要。我们通过提出UniMMVSR来解决这一局限性,UniMMVSR是第一个结合混合模式条件的统一生成视频超分辨率框架,包括文本、图像和视频。我们在潜在视频扩散模型中全面探索了条件注入策略、训练方案和数据混合技术。一个关键挑战是设计独特的数据构建和条件利用方法,以使得模型能够精确利用所有条件类型,考虑到它们与目标视频的各异关联性。我们的实验表明,UniMMVSR显著优于现有方法,生成的视频细节更加优秀,并且更能符合多模式条件。我们还验证了将UniMMVSR与基础模型相结合生成4K视频的可行性,这是现有技术无法达到的成就。
论文及项目相关链接
Summary
该研究提出了一种名为UniMMVSR的统一视频超分辨率生成框架,该框架结合文本、图像和视频等多种模态条件,解决了现有技术主要局限于文本到视频的单一任务,并在混合模态条件下的视频生成中保证保真度的问题。通过潜视频扩散模型,研究团队深入探讨了条件注入策略、训练方案和数据混合技术。实验表明,UniMMVSR在利用不同条件类型方面表现出显著优势,且能够结合基础模型生成符合多模态指导的4K视频。该框架显著提高视频细节度和对多模态条件的符合程度。
Key Takeaways
- UniMMVSR是首个结合多种模态条件的统一视频超分辨率生成框架,包括文本、图像和视频。
- 研究团队探讨了条件注入策略、训练方案和数据混合技术。
- 设计了独特的数据构建和利用条件的方法,以利用不同模态与目标的视频的相关性。
- UniMMVSR显著提高了视频生成的质量,并在多模态条件下表现出更好的性能。
- 与现有技术相比,UniMMVSR能够生成更精细的视频内容。
- UniMMVSR能与基础模型结合,实现多模态指导下的4K视频生成,这是以前的技术无法达到的。
点此查看论文截图



Real-Time Motion-Controllable Autoregressive Video Diffusion
Authors:Kesen Zhao, Jiaxin Shi, Beier Zhu, Junbao Zhou, Xiaolong Shen, Yuan Zhou, Qianru Sun, Hanwang Zhang
Real-time motion-controllable video generation remains challenging due to the inherent latency of bidirectional diffusion models and the lack of effective autoregressive (AR) approaches. Existing AR video diffusion models are limited to simple control signals or text-to-video generation, and often suffer from quality degradation and motion artifacts in few-step generation. To address these challenges, we propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control. We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement learning with a trajectory-based reward model. Our design preserves the Markov property through a Self-Rollout mechanism and accelerates training by selectively introducing stochasticity in denoising steps. Extensive experiments demonstrate that AR-Drag achieves high visual fidelity and precise motion alignment, significantly reducing latency compared with state-of-the-art motion-controllable VDMs, while using only 1.3B parameters. Additional visualizations can be found on our project page: https://kesenzhao.github.io/AR-Drag.github.io/.
实时运动控制视频生成仍然具有挑战性,这主要是由于双向扩散模型的固有延迟和缺乏有效的自回归(AR)方法。现有的AR视频扩散模型仅限于简单的控制信号或文本到视频生成,并且在少步骤生成中经常遭受质量下降和运动伪影的问题。为了应对这些挑战,我们提出了AR-Drag,这是第一个结合强化学习的少数步骤AR视频扩散模型,用于实时图像到视频生成具有多种运动控制。我们首先微调基础I2V模型以支持基本运动控制,然后通过基于轨迹的奖励模型进一步改进。我们的设计通过自我滚动机制保留了马尔可夫属性,并通过在降噪步骤中选择性引入随机性来加速训练。大量实验表明,AR-Drag实现了高视觉保真度和精确的运动对齐,与最先进的运动控制VDM相比,延迟时间大大降低,而参数仅使用1.3B。更多可视化内容请参见我们的项目页面:链接地址。
论文及项目相关链接
Summary
针对实时运动控制视频生成面临的挑战,如双向扩散模型的固有延迟和缺乏有效的自回归(AR)方法,我们提出了AR-Drag模型。它是首个结合强化学习(RL)技术的少步骤自回归视频扩散模型,支持实时图像到视频的生成并具有多样的运动控制。通过微调基础I2V模型并借助基于轨迹的奖励模型进行强化学习,AR-Drag设计保留了马尔可夫属性,同时通过选择性引入去噪步骤中的随机性来加速训练。实验表明,AR-Drag在视觉保真度和运动对齐方面表现出色,相较于当前先进的运动控制VDM显著降低了延迟,仅使用1.3B参数。更多可视化内容请见我们的项目页面。
Key Takeaways
- AR-Drag是首个结合强化学习(RL)的少步骤自回归视频扩散模型,支持实时图像到视频的生成并具有多样的运动控制。
- AR-Drag通过微调基础I2V模型以实现基本运动控制,并借助基于轨迹的奖励模型进行强化学习。
- 模型设计保留了马尔可夫属性,并通过选择性引入去噪步骤中的随机性来加速训练。
- AR-Drag实现了高视觉保真和精确运动对齐。
- 与现有运动控制VDM相比,AR-Drag显著降低了延迟。
- AR-Drag模型仅使用1.3B参数,实现了高效性能。
点此查看论文截图



ComGS: Efficient 3D Object-Scene Composition via Surface Octahedral Probes
Authors:Jian Gao, Mengqi Yuan, Yifei Zeng, Chang Zeng, Zhihao Li, Zhenyu Chen, Weichao Qiu, Xiao-Xiao Long, Hao Zhu, Xun Cao, Yao Yao
Gaussian Splatting (GS) enables immersive rendering, but realistic 3D object-scene composition remains challenging. Baked appearance and shadow information in GS radiance fields cause inconsistencies when combining objects and scenes. Addressing this requires relightable object reconstruction and scene lighting estimation. For relightable object reconstruction, existing Gaussian-based inverse rendering methods often rely on ray tracing, leading to low efficiency. We introduce Surface Octahedral Probes (SOPs), which store lighting and occlusion information and allow efficient 3D querying via interpolation, avoiding expensive ray tracing. SOPs provide at least a 2x speedup in reconstruction and enable real-time shadow computation in Gaussian scenes. For lighting estimation, existing Gaussian-based inverse rendering methods struggle to model intricate light transport and often fail in complex scenes, while learning-based methods predict lighting from a single image and are viewpoint-sensitive. We observe that 3D object-scene composition primarily concerns the object’s appearance and nearby shadows. Thus, we simplify the challenging task of full scene lighting estimation by focusing on the environment lighting at the object’s placement. Specifically, we capture a 360 degrees reconstructed radiance field of the scene at the location and fine-tune a diffusion model to complete the lighting. Building on these advances, we propose ComGS, a novel 3D object-scene composition framework. Our method achieves high-quality, real-time rendering at around 28 FPS, produces visually harmonious results with vivid shadows, and requires only 36 seconds for editing. Code and dataset are available at https://nju-3dv.github.io/projects/ComGS/.
高斯贴图(GS)技术能够实现沉浸式渲染,但现实3D对象场景组合仍然具有挑战性。GS辐射场中的烘焙外观和阴影信息在组合对象和场景时会导致不一致。为解决这一问题,需要进行可重新照明的对象重建和场景照明估计。对于可重新照明的对象重建,现有的基于高斯值的逆向渲染方法通常依赖于光线追踪,导致效率较低。我们引入了表面八面探针(SOPs),其能够存储照明和遮挡信息,并通过插值实现高效3D查询,避免了昂贵的光线追踪。SOPs至少可将重建速度提高2倍,并在高斯场景中实现实时阴影计算。对于照明估计,现有的基于高斯的逆向渲染方法在复杂的光线传输建模方面表现挣扎,在复杂场景中经常失效,而基于学习的方法则从单一图像预测照明,并对观点敏感。我们观察到,3D对象场景组合主要关注对象的外观和附近的阴影。因此,我们通过关注对象放置处的环境照明来简化完整的场景照明估计这一具有挑战性的任务。具体来说,我们在该位置捕获场景的360度重建辐射场,并对扩散模型进行微调以完成照明。基于这些进展,我们提出了ComGS,这是一种新型的3D对象场景组合框架。我们的方法以大约28帧每秒的速度实现高质量实时渲染,产生视觉和谐、阴影生动的结果,编辑仅需36秒。相关代码和数据集可通过https://nju-3dv.github.io/projects/ComGS/获取。
论文及项目相关链接
Summary
本文介绍了Gaussian Splatting(GS)在沉浸式渲染方面的应用,并针对3D对象场景组合的挑战性进行了深入研究。文章通过引入Surface Octahedral Probes(SOPs)来解决对象和场景组合时的不一致性,提高了重建和实时阴影计算效率。同时,文章还提出了一种简化场景照明估计的方法,专注于对象放置处的环境照明,并结合扩散模型完成照明。最终,文章提出了ComGS,一个新型的3D对象场景组合框架,实现了高质量、实时的渲染效果。
Key Takeaways
- Gaussian Splatting(GS)在沉浸式渲染中有广泛应用,但3D对象场景组合存在挑战。
2.SOPs的引入解决了对象和场景组合时的不一致性,通过存储光照和遮挡信息,允许通过插值进行高效3D查询,避免昂贵的光线追踪。
3.现有基于Gaussian的逆向渲染方法在复杂场景中的光照建模方面存在困难,而学习基于方法的方法则从单一图像预测光照,对观点敏感。
4.文章通过专注于对象放置处的环境照明,简化了复杂的场景照明估计任务。
5.结合扩散模型完成照明,提高了渲染质量和实时性。
6.提出了ComGS,一个新型的3D对象场景组合框架,实现了高质量、实时的渲染,并具备编辑功能。
点此查看论文截图




Rectified-CFG++ for Flow Based Models
Authors:Shreshth Saini, Shashank Gupta, Alan C. Bovik
Classifier-free guidance (CFG) is the workhorse for steering large diffusion models toward text-conditioned targets, yet its native application to rectified flow (RF) based models provokes severe off-manifold drift, yielding visual artifacts, text misalignment, and brittle behaviour. We present Rectified-CFG++, an adaptive predictor-corrector guidance that couples the deterministic efficiency of rectified flows with a geometry-aware conditioning rule. Each inference step first executes a conditional RF update that anchors the sample near the learned transport path, then applies a weighted conditional correction that interpolates between conditional and unconditional velocity fields. We prove that the resulting velocity field is marginally consistent and that its trajectories remain within a bounded tubular neighbourhood of the data manifold, ensuring stability across a wide range of guidance strengths. Extensive experiments on large-scale text-to-image models (Flux, Stable Diffusion 3/3.5, Lumina) show that Rectified-CFG++ consistently outperforms standard CFG on benchmark datasets such as MS-COCO, LAION-Aesthetic, and T2I-CompBench. Project page: https://rectified-cfgpp.github.io/
无分类器引导(CFG)是推动大型扩散模型向文本条件目标发展的主要方法,然而它直接应用于基于正规化流(RF)的模型会引发严重的流形漂移问题,导致视觉伪影、文本错位和模型表现脆弱。我们提出了正规化流-CFG++(Rectified-CFG++),这是一种自适应的预测校正引导法,它将正规化流的确定性效率与几何感知条件规则相结合。每次推理步骤首先执行条件RF更新,将样本固定在接近学习到的传输路径上,然后应用加权条件校正,在条件和无条件速度场之间进行插值。我们证明得到的速度场具有边际一致性,其轨迹保持在数据流形的有界管状邻域内,确保了在不同强度的引导下的稳定性。在大规模文本到图像模型的广泛实验(Flux、稳定扩散3/3.5、Lumina)表明,Rectified-CFG++在MS-COCO、LAION-Aesthetic和T2I-CompBench等基准数据集上始终优于标准CFG。项目页面:https://rectified-cfgpp.github.io/(此部分不翻译)
论文及项目相关链接
PDF Accepted at NeurIPS 2025
Summary
文本介绍了一种用于大尺度扩散模型的改进型分类器自由引导方法——Rectified-CFG++。该方法结合了确定性高的rectified流和几何感知条件规则,旨在解决原有方法在文本条件下的目标应用中存在的视觉伪影、文本不匹配等问题,通过每一步推理先进行条件性RF更新以将样本锚定在学习路径附近,再应用加权条件校正以在条件速度场和无条件速度场之间进行插值,从而确保样本的稳定性。实验证明,Rectified-CFG++在大型文本到图像模型上表现优异,超越了标准CFG在MS-COCO、LAION-Aesthetic和T2I-CompBench等基准数据集上的表现。
Key Takeaways
- Rectified-CFG++解决了原有方法在文本条件下的目标应用中的视觉伪影和文本不匹配问题。
- 该方法结合了确定性高的rectified流和几何感知条件规则,旨在提高大尺度扩散模型的性能。
- Rectified-CFG++通过预测和校正机制确保样本的稳定性,即使在不同强度的引导下也能保持稳定性。
- 实验证明,Rectified-CFG++在大型文本到图像模型上的表现优于标准CFG。
- 该方法在多个基准数据集上表现出优异性能,如MS-COCO、LAION-Aesthetic和T2I-CompBench等。
- 项目页面提供了更多详细信息:https://rectified-cfgpp.github.io/。
点此查看论文截图






EMPalm: Exfiltrating Palm Biometric Data via Electromagnetic Side-Channels
Authors:Haowen Xu, Tianya Zhao, Xuyu Wang, Lei Ma, Jun Dai, Alexander Wyglinski, Xiaoyan Sun
Palm recognition has emerged as a dominant biometric authentication technology in critical infrastructure. These systems operate in either single-modal form, using palmprint or palmvein individually, or dual-modal form, fusing the two modalities. Despite this diversity, they share similar hardware architectures that inadvertently emit electromagnetic (EM) signals during operation. Our research reveals that these EM emissions leak palm biometric information, motivating us to develop EMPalm–an attack framework that covertly recovers both palmprint and palmvein images from eavesdropped EM signals. Specifically, we first separate the interleaved transmissions of the two modalities, identify and combine their informative frequency bands, and reconstruct the images. To further enhance fidelity, we employ a diffusion model to restore fine-grained biometric features unique to each domain. Evaluations on seven prototype and two commercial palm acquisition devices show that EMPalm can recover palm biometric information with high visual fidelity, achieving SSIM scores up to 0.79, PSNR up to 29.88 dB, and FID scores as low as 6.82 across all tested devices, metrics that collectively demonstrate strong structural similarity, high signal quality, and low perceptual discrepancy. To assess the practical implications of the attack, we further evaluate it against four state-of-the-art palm recognition models, achieving a model-wise average spoofing success rate of 65.30% over 6,000 samples from 100 distinct users.
掌纹识别已在关键基础设施中崭露头角,成为主要的生物认证技术。这些系统以单模态形式运行,单独使用掌纹或掌静脉,或以双模态形式运行,融合这两种模式。尽管存在这种多样性,但它们却拥有类似的硬件架构,在运行时无意中发出电磁(EM)信号。我们的研究表明,这些EM排放会泄露掌纹生物识别信息,这促使我们开发出EMPalm——一种攻击框架,可以隐秘地从窃听的EM信号中恢复掌纹和掌静脉图像。具体来说,我们首先分离两种模式的交错传输,识别并结合其信息频段,然后重建图像。为了进一步提高保真度,我们采用扩散模型来恢复每个领域的独特细微生物特征。在七个原型和两个商用掌纹采集设备上的评估表明,EMPalm可以高度逼真地恢复掌纹生物识别信息,实现高达0.79的结构相似性度量(SSIM)分数,峰值信噪比(PSNR)高达29.88分贝,以及所有测试设备中FID分数低至6.82。这些指标共同体现了强大的结构相似性、高信号质量和低感知差异。为了评估该攻击的实际影响,我们进一步在四个最先进的掌纹识别模型上对其进行了评估,在来自100个不同用户的6000个样本上,模型级的平均假冒成功率达到65.30%。
论文及项目相关链接
Summary
本文研究了基于扩散模型的手部生物特征识别技术的电磁泄露问题。研究中发现,手识别系统在操作过程中会无意中发出包含手掌纹理信息的电磁信号,并由此提出了EMPalm攻击框架,可以从窃听的电磁信号中恢复手掌纹路和掌静脉图像。实验证明,EMPalm可以在各种设备上实现高保真度的信息恢复,并能成功攻击多种手掌识别模型。
Key Takeaways
- 手部识别技术成为关键基础设施中的主要生物认证技术,有单模态和双模态两种形式。
- 研究揭示了手部生物特征识别系统操作中会无意发出电磁信号,其中包含生物特征信息。
- EMPalm框架可以从窃听的电磁信号中恢复手掌纹路和掌静脉图像。
- EMPalm通过分离两种模态的传输信号、识别并结合其信息频段,进而重建图像。
- 采用扩散模型增强图像的细节恢复,提高各领域的生物特征辨识度。
- 实验证明EMPalm在多种设备上的信息恢复具有高质量,包括高结构相似性、高信号质量和低感知差异。
点此查看论文截图






DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images
Authors:Ozgur Kara, Harris Nisar, James M. Rehg
Numerous models have been developed for scanpath and saliency prediction, which are typically trained on scanpaths, which model eye movement as a sequence of discrete fixation points connected by saccades, while the rich information contained in the raw trajectories is often discarded. Moreover, most existing approaches fail to capture the variability observed among human subjects viewing the same image. They generally predict a single scanpath of fixed, pre-defined length, which conflicts with the inherent diversity and stochastic nature of real-world visual attention. To address these challenges, we propose DiffEye, a diffusion-based training framework designed to model continuous and diverse eye movement trajectories during free viewing of natural images. Our method builds on a diffusion model conditioned on visual stimuli and introduces a novel component, namely Corresponding Positional Embedding (CPE), which aligns spatial gaze information with the patch-based semantic features of the visual input. By leveraging raw eye-tracking trajectories rather than relying on scanpaths, DiffEye captures the inherent variability in human gaze behavior and generates high-quality, realistic eye movement patterns, despite being trained on a comparatively small dataset. The generated trajectories can also be converted into scanpaths and saliency maps, resulting in outputs that more accurately reflect the distribution of human visual attention. DiffEye is the first method to tackle this task on natural images using a diffusion model while fully leveraging the richness of raw eye-tracking data. Our extensive evaluation shows that DiffEye not only achieves state-of-the-art performance in scanpath generation but also enables, for the first time, the generation of continuous eye movement trajectories. Project webpage: https://diff-eye.github.io/
针对视线轨迹和显著性预测,已经开发了许多模型。这些模型通常基于视线轨迹进行训练,将眼球运动模拟为一系列由眼跳连接的离散注视点序列,而原始轨迹中包含的丰富信息通常被丢弃。此外,大多数现有方法无法捕捉同一图像中不同人类主体之间的差异性。它们通常预测具有固定预定义长度的单一视线轨迹,这与真实世界视觉注意力的内在多样性和随机性相冲突。为了解决这些挑战,我们提出了DiffEye,这是一个基于扩散的训练框架,旨在模拟在自然图像自由观看期间的连续和多样的眼球运动轨迹。我们的方法在基于视觉刺激的扩散模型的基础上构建,并引入了一个名为相应位置嵌入(CPE)的新组件,它将空间凝视信息与视觉输入的基于补丁的语义特征对齐。DiffEye通过利用原始的眼动轨迹,而不是依赖于视线轨迹,来捕捉人类凝视行为的内在变化,并生成高质量的、现实的眼动模式,尽管它是在相对较小的数据集上训练的。生成的轨迹还可以转换为视线轨迹和显著性地图,从而产生更准确地反映人类视觉注意力分布的输山。DiffEye是第一个使用扩散模型处理自然图像任务的方法,同时充分利用了原始的眼动追踪数据的丰富性。我们的广泛评估表明,DiffEye不仅在视线轨迹生成方面达到了最新技术水平,而且首次实现了连续的眼动轨迹生成。项目网页:https://diff-eye.github.io/
论文及项目相关链接
PDF Accepted to NeurIPS 2025
Summary
本文介绍了针对视觉注视路径预测的新模型DiffEye。该模型基于扩散模型训练,能够模拟连续且多样的眼动轨迹。模型通过引入CPE(对应位置嵌入)来将空间注视信息与视觉输入的语义特征对齐。使用原始眼动轨迹数据,DiffEye能够捕捉人类注视行为的内在变化,生成逼真的眼动模式,并在较小的数据集上训练也能达到高质量的结果。DiffEye不仅能够生成扫描路径和显著性地图,而且能够更准确地反映人类视觉注意的分布。它在自然图像任务中达到了利用原始眼动轨迹数据的最新技术水平。
Key Takeaways
- DiffEye是一个基于扩散模型的训练框架,用于模拟自然图像观看时的连续和多样化的眼动轨迹。
- 该模型引入CPE(对应位置嵌入),将空间注视信息与视觉输入的语义特征对齐。
- DiffEye使用原始眼动轨迹数据,能够捕捉人类注视行为的内在变化并生成高质量的眼动模式。
- DiffEye能够在较小的数据集上进行训练,并生成逼真的眼动轨迹。
- 该模型可以生成扫描路径和显著性地图,更准确反映人类视觉注意的分布。
- DiffEye是首个在自然图像任务中使用扩散模型并充分利用原始眼动轨迹数据的模型。
点此查看论文截图





MAMBO: High-Resolution Generative Approach for Mammography Images
Authors:Milica Škipina, Nikola Jovišić, Nicola Dall’Asen, Vanja Švenda, Anil Osman Tur, Slobodan Ilić, Elisa Ricci, Dubravko Ćulibrk
Mammography is the gold standard for the detection and diagnosis of breast cancer. This procedure can be significantly enhanced with Artificial Intelligence (AI)-based software, which assists radiologists in identifying abnormalities. However, training AI systems requires large and diverse datasets, which are often difficult to obtain due to privacy and ethical constraints. To address this issue, the paper introduces MAMmography ensemBle mOdel (MAMBO), a novel patch-based diffusion approach designed to generate full-resolution mammograms. Diffusion models have shown breakthrough results in realistic image generation, yet few studies have focused on mammograms, and none have successfully generated high-resolution outputs required to capture fine-grained features of small lesions. To achieve this, MAMBO integrates separate diffusion models to capture both local and global (image-level) contexts. The contextual information is then fed into the final model, significantly aiding the noise removal process. This design enables MAMBO to generate highly realistic mammograms of up to 3840x3840 pixels. Importantly, this approach can be used to enhance the training of classification models and extended to anomaly segmentation. Experiments, both numerical and radiologist validation, assess MAMBO’s capabilities in image generation, super-resolution, and anomaly segmentation, highlighting its potential to enhance mammography analysis for more accurate diagnoses and earlier lesion detection. The source code used in this study is publicly available at: https://github.com/iai-rs/mambo.
乳腺X光摄影是检测和诊断乳腺癌的金标准。通过人工智能(AI)软件可以显著增强这一程序的效果,帮助放射科医生识别异常。然而,训练AI系统需要大型且多样的数据集,由于隐私和道德约束,这些数据集往往难以获得。为了解决这一问题,本文介绍了MAMmography ensemBle mOdel(MAMBO),这是一种基于补丁的扩散方法,旨在生成全分辨率乳腺X线影像。扩散模型在真实图像生成方面取得了突破性成果,但对乳腺X线影像的研究很少,且尚无研究能够生成高分辨率输出,以捕捉小病灶的精细特征。为了实现这一点,MAMBO集成了单独的扩散模型来捕捉局部和全局(图像级别)的上下文。然后将上下文信息输入到最终模型中,极大地有助于去噪过程。这种设计使MAMBO能够生成高达3840x3840像素的非常逼真的乳腺X线影像。重要的是,这种方法可用于增强分类模型的训练,并可扩展到异常分割。实验包括数值验证和放射科医生验证,评估了MAMBO在图像生成、超分辨率和异常分割方面的能力,突显其在提高乳腺X线摄影分析、更准确的诊断和更早的病灶检测方面的潜力。本研究中使用的源代码可在https://github.com/iai-rs/mambo公开访问。
论文及项目相关链接
PDF 21 pages, 14 figures, 7 tables
Summary
人工智能(AI)辅助的乳腺X光摄影(mammography)能显著提高乳腺癌的检测和诊断水平。为解决AI系统训练所需的大型多样数据集难以获取的问题,研究者提出了MAMmography ensemBle mOdel(MAMBO)模型。该模型采用基于扩散的方法生成高分辨率的乳腺X光图像,通过集成局部和全局上下文信息,生成高度逼真的图像,分辨率高达3840x3840像素。该模型可应用于分类模型的训练增强和异常分割,有望提高乳腺X光摄影的分析准确性,实现更早的病变检测。
Key Takeaways
- AI在乳腺X光摄影中的应用能显著提高乳腺癌的检测和诊断水平。
- MAMBO模型是一种新颖的基于扩散的方法,用于生成高分辨率的乳腺X光图像。
- MAMBO通过集成局部和全局上下文信息,生成高度逼真的图像。
- MAMBO模型可应用于分类模型的训练增强。
- MAMBO模型有助于实现异常分割,提高乳腺X光摄影的分析准确性。
- MAMBO模型的源代码已公开发布,可供研究使用。
点此查看论文截图





Feedback Guidance of Diffusion Models
Authors:Felix Koulischer, Florian Handke, Johannes Deleu, Thomas Demeester, Luca Ambrogioni
While Classifier-Free Guidance (CFG) has become standard for improving sample fidelity in conditional diffusion models, it can harm diversity and induce memorization by applying constant guidance regardless of whether a particular sample needs correction. We propose FeedBack Guidance (FBG), which uses a state-dependent coefficient to self-regulate guidance amounts based on need. Our approach is derived from first principles by assuming the learned conditional distribution is linearly corrupted by the unconditional distribution, contrasting with CFG’s implicit multiplicative assumption. Our scheme relies on feedback of its own predictions about the conditional signal informativeness to adapt guidance dynamically during inference, challenging the view of guidance as a fixed hyperparameter. The approach is benchmarked on ImageNet512x512, where it significantly outperforms Classifier-Free Guidance and is competitive to Limited Interval Guidance (LIG) while benefitting from a strong mathematical framework. On Text-To-Image generation, we demonstrate that, as anticipated, our approach automatically applies higher guidance scales for complex prompts than for simpler ones and that it can be easily combined with existing guidance schemes such as CFG or LIG.
虽然无分类器引导(CFG)已成为改善条件扩散模型样本保真度的标准方法,但它可能会损害多样性并导致记忆固化,因为它会不断应用引导,而不考虑特定样本是否需要校正。我们提出了反馈引导(FBG)方法,它使用一个状态依赖系数来根据需求自我调节引导量。我们的方法基于基本假设,即学习到的条件分布被无条件分布线性腐蚀,这与CFG的隐性乘法假设形成对比。我们的方案依赖于对其关于条件信号信息性的预测反馈,以在推理过程中动态适应引导,从而挑战了将引导视为固定超参数的观点。我们的方法在ImageNet512x512上进行了基准测试,在那里它显著优于无分类器引导并具有与有限间隔引导(LIG)的竞争性能,同时还受益于强大的数学框架。在文本到图像生成中,我们证明了我们的方法可以自动为复杂提示应用更高的指导比例而不是简单提示,并且它可以轻松与现有的引导方案如CFG或LIG结合使用。
论文及项目相关链接
PDF Article accepeted as poster at the 39th Annual Conference on Neural Information Processing Systems (NeurIPS25). Code is available at: https://github.com/FelixKoulischer/FBG_using_edm2
Summary
本文介绍了反馈指导(FBG)在条件扩散模型中的应用,旨在改进样本保真度并提高多样性。FBG使用状态依赖系数来根据需求自我调节指导量,与Classifier-Free Guidance(CFG)不同,FBG通过假设学习的条件分布被无条件分布线性破坏来实现。这种方法依靠自身对条件信号重要性的预测反馈来动态调整指导过程,打破了将指导作为固定超参数的看法。在ImageNet和文本转图像生成任务的测试中,FBG表现出强大的性能,可自动对复杂提示应用更高的指导比例,并可轻松与其他指导方案结合使用。
Key Takeaways
- FBG旨在改进条件扩散模型的样本保真度和多样性。
- FBG使用状态依赖系数来根据需求自我调节指导量。
- 与CFG不同,FBG基于假设学习的条件分布被无条件分布线性破坏来实现。
- FBG依靠自身预测反馈来动态调整指导过程。
- FBG在ImageNet任务上显著优于CFG,与Limited Interval Guidance(LIG)具有竞争力。
- FBG可自动对复杂提示应用更高的指导比例。
点此查看论文截图



IMAGHarmony: Controllable Image Editing with Consistent Object Quantity and Layout
Authors:Fei Shen, Yutong Gao, Jian Yu, Xiaoyu Du, Jinhui Tang
Recent diffusion models have advanced image editing by improving fidelity and controllability across creative and personalized applications. However, multi-object scenes remain challenging, as reliable control over object categories, counts, and spatial layout is difficult to achieve. For that, we first study quantity and layout consistent image editing, abbreviated as QL-Edit, which targets control of object quantity and spatial layout in multi-object scenes. Then, we present IMAGHarmony, a straightforward framework featuring a plug-and-play harmony aware (HA) module that fuses perception semantics while modeling object counts and locations, resulting in accurate edits and strong structural consistency. We further observe that diffusion models are sensitive to the choice of initial noise and tend to prefer certain noise patterns. Based on this finding, we present a preference-guided noise selection (PNS) strategy that selects semantically aligned initial noise through vision and language matching, thereby further improving generation stability and layout consistency in multiple object editing. To support evaluation, we develop HarmonyBench, a comprehensive benchmark that covers a diverse range of quantity and layout control scenarios. Extensive experiments demonstrate that IMAGHarmony outperforms prior methods in both structural alignment and semantic accuracy, utilizing only 200 training images and 10.6M of trainable parameters. Code, models, and data are available at https://github.com/muzishen/IMAGHarmony.
最近的扩散模型通过提高创意和个性化应用程序中的保真度和可控性,推动了图像编辑的进步。然而,多目标场景仍然具有挑战性,因为实现对目标类别、计数和空间布局的可控性仍然很难度。因此,我们首先研究数量与布局一致的图像编辑,简称为QL-Edit,旨在控制多目标场景中的目标数量和空间布局。然后,我们推出了IMAGHarmony框架,该框架简洁直观,并带有一个随插随用的和谐感知(HA)模块,该模块在建模对象计数和位置时融合了感知语义,从而实现精准编辑和强大的结构一致性。我们进一步观察到扩散模型对初始噪声的选择非常敏感,并倾向于偏好某些噪声模式。基于这一发现,我们提出了一种偏好引导噪声选择(PNS)策略,通过视觉和语言匹配选择语义对齐的初始噪声,从而进一步提高多目标编辑中的生成稳定性和布局一致性。为了支持评估,我们开发了HarmonyBench综合基准测试平台,涵盖多种数量和布局控制场景。大量实验表明,IMAGHarmony在结构对齐和语义准确性方面均优于先前的方法,仅使用200张训练图像和106万个可训练参数。代码、模型和数据均可在https://github.com/muzishen/IMAGHarmony获得。
论文及项目相关链接
Summary
近期扩散模型通过提高图像编辑的保真度和可控性,推动了创意和个性化应用的发展。在多对象场景中,我们研究了目标数量与布局一致的图像编辑(QL-Edit),并推出了IMAGHarmony框架,具有即插即用的和谐感知(HA)模块,能融合感知语义同时建模对象计数和位置。此外,我们发现扩散模型对初始噪声的选择很敏感,于是提出了基于视觉和语言匹配的偏好引导噪声选择(PNS)策略。为了支持评估,我们开发了HarmonyBench基准测试平台,涵盖多种数量与布局控制场景。实验表明,IMAGHarmony在结构对齐和语义准确性方面优于先前方法,仅使用200张训练图像和10.6M可训练参数。
Key Takeaways
- 扩散模型提高了图像编辑的保真度和可控性,促进了创意和个性化应用的发展。
- 在多对象场景中,实现对象类别、数量和空间布局的可控性是一个挑战。
- QL-Edit技术针对多对象场景中的对象数量和空间布局控制进行了研究。
- IMAGHarmony框架通过融合感知语义同时建模对象计数和位置,提高了图像编辑的准确性。
- 扩散模型对初始噪声的选择非常敏感,因此提出了偏好引导噪声选择(PNS)策略。
- HarmonyBench基准测试平台用于评估图像编辑的效果,涵盖多种数量与布局控制场景。
点此查看论文截图





DvD: Unleashing a Generative Paradigm for Document Dewarping via Coordinates-based Diffusion Model
Authors:Weiguang Zhang, Huangcheng Lu, Maizhen Ning, Xiaowei Huang, Wei Wang, Kaizhu Huang, Qiufeng Wang
Document dewarping aims to rectify deformations in photographic document images, thus improving text readability, which has attracted much attention and made great progress, but it is still challenging to preserve document structures. Given recent advances in diffusion models, it is natural for us to consider their potential applicability to document dewarping. However, it is far from straightforward to adopt diffusion models in document dewarping due to their unfaithful control on highly complex document images (e.g., 2000$times$3000 resolution). In this paper, we propose DvD, the first generative model to tackle document Dewarping via a Diffusion framework. To be specific, DvD introduces a coordinate-level denoising instead of typical pixel-level denoising, generating a mapping for deformation rectification. In addition, we further propose a time-variant condition refinement mechanism to enhance the preservation of document structures. In experiments, we find that current document dewarping benchmarks can not evaluate dewarping models comprehensively. To this end, we present AnyPhotoDoc6300, a rigorously designed large-scale document dewarping benchmark comprising 6,300 real image pairs across three distinct domains, enabling fine-grained evaluation of dewarping models. Comprehensive experiments demonstrate that our proposed DvD can achieve state-of-the-art performance with acceptable computational efficiency on multiple metrics across various benchmarks, including DocUNet, DIR300, and AnyPhotoDoc6300. The new benchmark and code will be publicly available at https://github.com/hanquansanren/DvD.
文档去扭曲旨在纠正摄影文档图像中的变形,从而提高文本的可读性,这已引起了很多关注并取得了很大的进展,但在保留文档结构方面仍然具有挑战性。考虑到最近扩散模型的进步,我们自然会考虑它们在文档去扭曲方面的潜在适用性。然而,由于扩散模型对高度复杂的文档图像(例如2000x3000分辨率)控制不忠实,因此在文档去扭曲中采用扩散模型并不简单。在本文中,我们提出了DvD,这是第一个通过扩散框架解决文档去扭曲的生成模型。具体来说,DvD引入了坐标级去噪,而不是典型的像素级去噪,为变形校正生成映射。此外,我们进一步提出了一种时间可变条件细化机制,以提高文档结构的保留。在实验中,我们发现当前的文档去扭曲基准测试无法全面评估去扭曲模型。为此,我们推出了AnyPhotoDoc6300,这是一个严格设计的大规模文档去扭曲基准测试,包含三个不同领域的6300张真实图像对,能够对去扭曲模型进行精细评估。综合实验表明,我们提出的DvD可以在多个基准测试中的多个指标上实现卓越的性能和可接受的计算效率,包括DocUNet、DIR300和AnyPhotoDoc6300。新基准测试和代码将在https://github.com/hanquansanren/DvD上公开。
论文及项目相关链接
摘要
本文关注文档图像去扭曲技术,旨在改善文档图像的文本可读性。文章提出了DvD,首个通过扩散模型解决文档去扭曲问题的生成模型。DvD引入坐标级别的去噪,而非典型的像素级别去噪,为变形矫正生成映射。此外,还提出时间可变条件优化机制,以更好地保留文档结构。实验发现现有文档去扭曲基准测试无法全面评估去扭曲模型,因此推出了AnyPhotoDoc6300,一个严格设计的大型文档去扭曲基准测试,包含6300张真实图像对,可精细评估去扭曲模型。实验表明,DvD模型在多个基准测试上取得了最先进的性能,同时保持了可接受的计算效率。
关键见解
- 文档图像去扭曲技术旨在提高文档图像的文本可读性,已引起广泛关注并取得重大进展。
- 扩散模型在文档去扭曲中具有潜力,但现有模型在高度复杂的文档图像控制上不忠实。
- DvD模型是首个通过扩散模型解决文档去扭曲问题的生成模型。
- DvD引入坐标级别去噪,生成变形矫正映射。
- 提出时间可变条件优化机制,以更好地保留文档结构。
- 现有文档去扭曲基准测试无法全面评估去扭曲模型,需要新的评估方法。
- 推出AnyPhotoDoc6300大型文档去扭曲基准测试,可精细评估去扭曲模型,DvD模型在多个基准测试上表现先进。
点此查看论文截图








Uncertainty-Aware Diffusion Guided Refinement of 3D Scenes
Authors:Sarosij Bose, Arindam Dutta, Sayak Nag, Junge Zhang, Jiachen Li, Konstantinos Karydis, Amit K. Roy Chowdhury
Reconstructing 3D scenes from a single image is a fundamentally ill-posed task due to the severely under-constrained nature of the problem. Consequently, when the scene is rendered from novel camera views, existing single image to 3D reconstruction methods render incoherent and blurry views. This problem is exacerbated when the unseen regions are far away from the input camera. In this work, we address these inherent limitations in existing single image-to-3D scene feedforward networks. To alleviate the poor performance due to insufficient information beyond the input image’s view, we leverage a strong generative prior in the form of a pre-trained latent video diffusion model, for iterative refinement of a coarse scene represented by optimizable Gaussian parameters. To ensure that the style and texture of the generated images align with that of the input image, we incorporate on-the-fly Fourier-style transfer between the generated images and the input image. Additionally, we design a semantic uncertainty quantification module that calculates the per-pixel entropy and yields uncertainty maps used to guide the refinement process from the most confident pixels while discarding the remaining highly uncertain ones. We conduct extensive experiments on real-world scene datasets, including in-domain RealEstate-10K and out-of-domain KITTI-v2, showing that our approach can provide more realistic and high-fidelity novel view synthesis results compared to existing state-of-the-art methods.
从单一图像重建3D场景是一个根本上的不适定任务,因为这个问题具有严重的约束不足的特性。因此,当场景从新视角呈现时,现有的单一图像到3D重建方法会产生不连贯和模糊的视图。当未见的区域远离输入相机时,这个问题更加严重。在这项工作中,我们解决了现有单一图像到3D场景前馈网络的这些固有局限性。为了缓解因输入图像视图以外的信息不足而导致的性能不佳,我们利用预训练的潜在视频扩散模型作为强生成先验,对由可优化高斯参数表示的粗略场景进行迭代优化。为了确保生成的图像的风格和纹理与输入图像一致,我们在生成的图像和输入图像之间进行了实时的傅里叶风格转换。此外,我们设计了一个语义不确定性量化模块,该模块计算像素熵并产生不确定性映射,用于指导从最具信心的像素进行细化过程,同时丢弃剩余的高度不确定的像素。我们在包括域内RealEstate-10K和域外KITTI-v2的真实场景数据集上进行了大量实验,结果表明我们的方法可以提供比现有最先进的方法更真实、更高保真度的新视角合成结果。
论文及项目相关链接
PDF ICCV 2025
Summary
本文解决了从单张图像重建3D场景时面临的固有局限性,如渲染不清晰、视角不一致等问题。通过引入预训练的潜在视频扩散模型,实现了场景的迭代优化;利用傅立叶风格转移技术,确保生成的图像与输入图像的风格和纹理一致;同时设计了语义不确定性量化模块,生成真实场景的熵图指导最准确的像素细化过程。实验证明,此方法比现有技术更为真实和高效。
Key Takeaways
- 解决了从单张图像重建的3D场景出现的渲染不清晰和视角不一致的问题。
- 通过引入预训练的潜在视频扩散模型进行迭代优化,提高场景重建的质量。
- 利用傅立叶风格转移技术,确保生成的图像与输入图像风格和纹理的一致性。
- 设计了语义不确定性量化模块,生成熵图以指导最准确的像素细化过程。
- 方法可应用于真实场景数据集,如RealEstate-10K和KITTI-v2数据集。
- 与现有技术相比,该方法能够提供更真实、高保真度的视角合成结果。
点此查看论文截图




DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks
Authors:Canyu Zhao, Yanlong Sun, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen
This paper’s primary objective is to develop a robust generalist perception model capable of addressing multiple tasks under constraints of computational resources and limited training data. We leverage text-to-image diffusion models pre-trained on billions of images and successfully introduce our DICEPTION, a visual generalist model. Exhaustive evaluations demonstrate that DICEPTION effectively tackles diverse perception tasks, even achieving performance comparable to SOTA single-task specialist models. Specifically, we achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs.\ 1B pixel-level annotated images). We designed comprehensive experiments on architectures and input paradigms, demonstrating that the key to successfully re-purposing a single diffusion model for multiple perception tasks lies in maximizing the preservation of the pre-trained model’s prior knowledge. Consequently, DICEPTION can be trained with substantially lower computational costs than conventional models requiring training from scratch. Furthermore, adapting DICEPTION to novel tasks is highly efficient, necessitating fine-tuning on as few as 50 images and approximately 1% of its parameters. Finally, we demonstrate that a subtle application of classifier-free guidance can improve the model’s performance on depth and normal estimation. We also show that pixel-aligned training, as is characteristic of perception tasks, significantly enhances the model’s ability to preserve fine details. DICEPTION offers valuable insights and presents a promising direction for the development of advanced diffusion-based visual generalist models. Code and Model: https://github.com/aim-uofa/Diception
本文的主要目标是开发一个稳健的通用感知模型,该模型能够在计算资源有限和训练数据稀缺的情况下处理多个任务。我们利用在数十亿图像上预训练的文本到图像扩散模型,并成功推出了我们的通用视觉模型DICEPTION。详尽的评估表明,DICEPTION能够有效地处理各种感知任务,甚至实现了与最新单任务专业模型相当的性能。具体来说,我们仅使用其0.06%的数据(例如,600K对比1B像素级注释图像)就达到了与SAM-vit-h相当的结果。我们对架构和输入范式进行了全面的实验,结果表明,成功地将单个扩散模型重新用于多个感知任务的关键在于最大限度地保留预训练模型的先验知识。因此,DICEPTION的计算成本远低于需要从头开始训练的常规模型。此外,将DICEPTION适应新任务非常高效,只需对大约50个图像进行微调,并且只需要其大约1%的参数。最后,我们证明了通过无分类器引导的精妙应用可以提高模型在深度和正常估计方面的性能。我们还表明,感知任务特有的像素对齐训练显著增强了模型的细节保留能力。DICEPTION提供了有价值的见解,并为基于扩散的高级视觉通用模型的发展提供了有前景的方向。代码和模型:https://github.com/aim-uofa/Diception
论文及项目相关链接
PDF Homepage: https://aim-uofa.github.io/Diception, Code and Model: https://github.com/aim-uofa/Diception
摘要
本文旨在开发一个稳健的通用感知模型,该模型能够在计算资源有限和训练数据有限的情况下处理多个任务。通过利用文本到图像的扩散模型预训练于数十亿图像数据,成功推出视觉通用模型DICEPTION。评估表明,DICEPTION能够有效应对多样化的感知任务,甚至达到单任务专业模型的性能水平。在仅使用0.06%数据的情况下,其效果与SAM-vit-h相当(例如,使用600K vs. 1B像素级标注图像)。实验证明,成功将一个单一的扩散模型用于多个感知任务的关键在于最大限度地保留预训练模型的先验知识。因此,DICEPTION的计算成本远低于传统模型。此外,适应DICEPTION进行新任务非常高效,仅需微调少量图像和约1%的参数。最后,研究证明通过无分类器引导可提升模型在深度和正常估计方面的性能。像素对齐训练也显著提高了模型对细节的保留能力。DICEPTION为扩散式视觉通用模型的发展提供了有价值的见解和前景。
关键见解
- 论文目标是开发一个能在计算资源和训练数据限制下处理多个任务的稳健通用感知模型。
- 引入视觉通用模型DICEPTION,通过利用文本到图像的扩散模型预训练于大量图像数据。
- DICEPTION能够有效处理多样化的感知任务,性能与先进的专业模型相当。
- 使用极少量数据(仅0.06%)即可达到与SAM-vit-h相当的效果。
- 成功将单一扩散模型用于多个感知任务的关键在于保留预训练模型的先验知识。
- DICEPTION的计算成本远低于传统模型,适应新任务更加高效。
- 通过无分类器引导和像素对齐训练,提高了模型在深度和正常估计任务上的性能,并增强了细节保留能力。
点此查看论文截图



The Poisson Midpoint Method for Langevin Dynamics: Provably Efficient Discretization for Diffusion Models
Authors:Saravanan Kandasamy, Dheeraj Nagaraj
Langevin Dynamics is a Stochastic Differential Equation (SDE) central to sampling and generative modeling and is implemented via time discretization. Langevin Monte Carlo (LMC), based on the Euler-Maruyama discretization, is the simplest and most studied algorithm. LMC can suffer from slow convergence - requiring a large number of steps of small step-size to obtain good quality samples. This becomes stark in the case of diffusion models where a large number of steps gives the best samples, but the quality degrades rapidly with smaller number of steps. Randomized Midpoint Method has been recently proposed as a better discretization of Langevin dynamics for sampling from strongly log-concave distributions. However, important applications such as diffusion models involve non-log concave densities and contain time varying drift. We propose its variant, the Poisson Midpoint Method, which approximates a small step-size LMC with large step-sizes. We prove that this can obtain a quadratic speed up of LMC under very weak assumptions. We apply our method to diffusion models for image generation and show that it maintains the quality of DDPM with 1000 neural network calls with just 50-80 neural network calls and outperforms ODE based methods with similar compute.
朗维动力学是采样和生成建模中的核心随机微分方程(SDE),通过时间离散化实现。基于欧拉-玛鲁亚马离散的朗维蒙特卡洛(LMC)是最简单且研究最多的算法。LMC存在收敛缓慢的问题,需要大量的小步长步骤才能获得高质量的样本。在扩散模型中,这种情况更加明显,大量的步骤能给出最好的样本,但步骤数量减少时,质量迅速下降。最近提出了随机中点法作为朗维动力学的更好离散化方法,用于从强对数凹分布中进行采样。然而,重要的应用(如扩散模型)涉及非对数凹密度和随时间变化的漂移。我们提出了它的变体——泊松中点法,该方法用大步长近似小步长的LMC。我们证明,在很弱的假设下,这可以获得LMC的二次加速。我们将该方法应用于图像生成的扩散模型,并表明在仅使用50-80个神经网络调用的情况下,它能在保持DDPM质量的同时,将神经网络调用次数从1000次减少到只需几十次,并且优于类似的计算量的基于常微分方程组的方法。
论文及项目相关链接
PDF “One often meets his destiny on the road he takes to avoid it” - Master Oogway. My destiny seems to be to write triangle inequalities for the rest of my life
Summary
基于Langevin动力学随机微分方程(SDE)的采样和生成模型中的关键要素,利用时间离散化实施。文章提出了一种新方法——Poisson中点法,用于近似小步长的Langevin Monte Carlo方法,并可在大型步长下实现二次加速。在图像生成的扩散模型中应用此方法,显著减少了神经网络调用的数量,并保持高质量的生成结果。
Key Takeaways
- Langevin Dynamics是一种基于随机微分方程(SDE)的采样和生成模型的核心方法,通过时间离散化实现。
- Langevin Monte Carlo(LMC)虽然是最简单且被广泛研究的方法,但其收敛速度较慢,特别是在扩散模型中需要大量步骤才能获得高质量的样本。
- 对于强对数凹分布,采用随机中点法作为对Langevin动力学的更好离散化方法。
- 然而,重要应用如扩散模型涉及非对数凹密度和随时间变化的漂移,提出了Poisson中点方法作为近似小步长LMC的新手段,能在大型步长下工作。
- 该方法可以实现在非常弱的假设下LMC的二次加速效果。
- 在图像生成的扩散模型中应用Poisson中点方法,显著减少了神经网络调用的数量。
点此查看论文截图

