⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-02-28 更新
HDM: Hybrid Diffusion Model for Unified Image Anomaly Detection
Authors:Zekang Weng, Jinjin Shi, Jinwei Wang, Zeming Han
Image anomaly detection plays a vital role in applications such as industrial quality inspection and medical imaging, where it directly contributes to improving product quality and system reliability. However, existing methods often struggle with complex and diverse anomaly patterns. In particular, the separation between generation and discrimination tasks limits the effective coordination between anomaly sample generation and anomaly region detection. To address these challenges, we propose a novel hybrid diffusion model (HDM) that integrates generation and discrimination into a unified framework. The model consists of three key modules: the Diffusion Anomaly Generation Module (DAGM), the Diffusion Discriminative Module (DDM), and the Probability Optimization Module (POM). DAGM generates realistic and diverse anomaly samples, improving their representativeness. DDM then applies a reverse diffusion process to capture the differences between generated and normal samples, enabling precise anomaly region detection and localization based on probability distributions. POM refines the probability distributions during both the generation and discrimination phases, ensuring high-quality samples are used for training. Extensive experiments on multiple industrial image datasets demonstrate that our method outperforms state-of-the-art approaches, significantly improving both image-level and pixel-level anomaly detection performance, as measured by AUROC.
图像异常检测在工业质检和医学影像等应用中发挥着重要作用,为产品质量和系统可靠性提升做出了直接贡献。然而,现有方法往往难以应对复杂多样的异常模式。特别是,生成和鉴别任务之间的分离限制了异常样本生成和异常区域检测之间的有效协调。为了解决这些挑战,我们提出了一种新型混合扩散模型(HDM),它将生成和鉴别整合到一个统一框架中。该模型包括三个关键模块:扩散异常生成模块(DAGM)、扩散鉴别模块(DDM)和概率优化模块(POM)。DAGM生成现实且多样的异常样本,提高了其代表性。DDM则应用反向扩散过程来捕捉生成样本和正常样本之间的差异,基于概率分布实现精确异常区域检测和定位。POM在生成和鉴别阶段优化概率分布,确保使用高质量样本进行训练。在多个工业图像数据集上的广泛实验表明,我们的方法优于最新技术,在图像级别和像素级别的异常检测性能上都有显著提高,这是通过AUROC测量的。
论文及项目相关链接
Summary
提出了一种新的混合扩散模型(HDM),该模型集成了生成和鉴别任务,用于图像异常检测。该模型包括三个关键模块:异常生成模块(DAGM)、扩散鉴别模块(DDM)和概率优化模块(POM)。DAGM生成真实且多样的异常样本,提高代表性;DDM通过反向扩散过程捕捉差异来精准检测和定位异常区域;POM优化生成和鉴别阶段概率分布,提高训练质量。实验证明该方法优于现有技术,显著提高图像和像素级别的异常检测性能。
Key Takeaways
- 图像异常检测在产品质量检测、医学影像等应用中至关重要。
- 现有方法在处理复杂多样的异常模式时存在局限性。
- 提出了一种新的混合扩散模型(HDM),集成生成和鉴别任务。
- 模型包括三个关键模块:异常生成模块(DAGM)、扩散鉴别模块(DDM)和概率优化模块(POM)。
- DAGM生成真实且多样的异常样本,提高代表性。
- DDM通过反向扩散过程捕捉差异,实现精准异常区域检测和定位。
点此查看论文截图


A Dual-Purpose Framework for Backdoor Defense and Backdoor Amplification in Diffusion Models
Authors:Vu Tuan Truong Long, Bao Le
Diffusion models have emerged as state-of-the-art generative frameworks, excelling in producing high-quality multi-modal samples. However, recent studies have revealed their vulnerability to backdoor attacks, where backdoored models generate specific, undesirable outputs called backdoor target (e.g., harmful images) when a pre-defined trigger is embedded to their inputs. In this paper, we propose PureDiffusion, a dual-purpose framework that simultaneously serves two contrasting roles: backdoor defense and backdoor attack amplification. For defense, we introduce two novel loss functions to invert backdoor triggers embedded in diffusion models. The first leverages trigger-induced distribution shifts across multiple timesteps of the diffusion process, while the second exploits the denoising consistency effect when a backdoor is activated. Once an accurate trigger inversion is achieved, we develop a backdoor detection method that analyzes both the inverted trigger and the generated backdoor targets to identify backdoor attacks. In terms of attack amplification with the role of an attacker, we describe how our trigger inversion algorithm can be used to reinforce the original trigger embedded in the backdoored diffusion model. This significantly boosts attack performance while reducing the required backdoor training time. Experimental results demonstrate that PureDiffusion achieves near-perfect detection accuracy, outperforming existing defenses by a large margin, particularly against complex trigger patterns. Additionally, in an attack scenario, our attack amplification approach elevates the attack success rate (ASR) of existing backdoor attacks to nearly 100% while reducing training time by up to 20x.
扩散模型已经作为最先进的生成框架出现,擅长生成高质量的多模式样本。然而,最近的研究揭示了它们容易受到后门攻击的影响,后门模型会在其输入嵌入预定义触发器时,生成特定的、不希望出现的输出,这些输出被称为后门目标(例如,有害图像)。在本文中,我们提出了PureDiffusion,这是一个双重用途的框架,可以同时扮演两个相反的角色:后门防御和后门攻击放大。对于防御,我们引入了两种新型损失函数来反转嵌入在扩散模型中的后门触发器。第一种利用触发诱导的多个时间步的扩散过程的分布变化,第二种则利用激活后门时的去噪一致性效应。一旦实现了准确的触发器反转,我们就开发了一种后门检测方法,该方法分析反转的触发器和生成的恶意目标来识别后门攻击。关于攻击放大作为攻击者的角色方面,我们描述了如何使用我们的触发器反转算法来加强嵌入后门扩散模型中的原始触发器。这显著提高了攻击性能,同时减少了所需的后门训练时间。实验结果表明,PureDiffusion实现了近乎完美的检测准确率,大大优于现有防御手段,尤其是对复杂的触发模式。此外,在攻击场景中,我们的攻击放大方法将现有后门攻击的成功率(ASR)提高到接近100%,同时减少训练时间高达20倍。
论文及项目相关链接
Summary
扩散模型由于其生成能力而受到关注,但其易受后门攻击的影响。本论文提出了PureDiffusion框架,既能防御也能强化后门攻击。通过引入两种新型损失函数来反转嵌入扩散模型中的后门触发器,并开发后门检测方法。同时,该框架还能强化原始的后门触发器,提高攻击性能并减少训练时间。实验证明,PureDiffusion在防御和攻击方面都有出色的表现。
Key Takeaways
- 扩散模型虽然生成能力强大,但易受后门攻击影响,可能导致生成特定、有害的输出。
- PureDiffusion框架具备双重功能:既能防御后门攻击,也能强化攻击。
- 引入两种新型损失函数来反转后门触发器,利用扩散过程中的分布变化和去噪一致性效应来防御攻击。
- 开发了一种后门检测方法,通过分析反转的触发器和生成的后门目标来识别后门攻击。
- PureDiffusion框架能强化原始的后门触发器,显著提高攻击性能。
- 实验证明PureDiffusion在防御方面表现出色,特别是针对复杂触发模式的情况。
点此查看论文截图




Optimal Stochastic Trace Estimation in Generative Modeling
Authors:Xinyang Liu, Hengrong Du, Wei Deng, Ruqi Zhang
Hutchinson estimators are widely employed in training divergence-based likelihoods for diffusion models to ensure optimal transport (OT) properties. However, this estimator often suffers from high variance and scalability concerns. To address these challenges, we investigate Hutch++, an optimal stochastic trace estimator for generative models, designed to minimize training variance while maintaining transport optimality. Hutch++ is particularly effective for handling ill-conditioned matrices with large condition numbers, which commonly arise when high-dimensional data exhibits a low-dimensional structure. To mitigate the need for frequent and costly QR decompositions, we propose practical schemes that balance frequency and accuracy, backed by theoretical guarantees. Our analysis demonstrates that Hutch++ leads to generations of higher quality. Furthermore, this method exhibits effective variance reduction in various applications, including simulations, conditional time series forecasts, and image generation.
Hutchinson估计器在训练基于分歧概率的扩散模型以实现对最优传输(OT)属性的应用中被广泛使用。然而,该估计器通常存在高方差和可扩展性问题。为了应对这些挑战,我们研究了Hutch++,这是一种针对生成模型的最优随机迹估计器,旨在最小化训练方差并保持传输最优性。Hutch++在处理条件数较大的病态矩阵时特别有效,这种矩阵在高维数据表现出低维结构时经常会出现。为了减少频繁且昂贵的QR分解的需求,我们提出了实用的平衡频率和准确度的方案,并得到理论上的保证。我们的分析表明,Hutch++能生成更高质量的样本。此外,该方法在各种应用中表现出有效的方差减少,包括模拟、条件时间序列预测和图像生成。
论文及项目相关链接
PDF Accepted by AISTATS 2025
Summary
使用Hutch++优化器可广泛应用于训练扩散模型的基于散度的概率分布模型,它通过最小化训练方差并维持最优传输属性来改进Hutchinson估计器的性能。Hutch++在处理具有大条件数的病态矩阵时效果显著,尤其适用于高维数据具有低维结构的情况。通过理论保证的平衡方案,Hutch++可减轻对频繁昂贵的QR分解的需求,并且在仿真、条件时间序列预测和图像生成等多个领域展现了显著的方差降低和提高生成质量的效果。
Key Takeaways
- Hutch++是一种针对生成模型的优化随机迹估计器。
- Hutch++旨在最小化训练方差并保持最优传输属性。
- Hutch++特别擅长处理具有大条件数的病态矩阵。
- Hutch++在高维数据的低维结构情况下表现优异。
- Hutch++通过平衡频率和准确性的实用方案减轻了QR分解的需求。
- Hutch++能够在仿真、条件时间序列预测和图像生成等多个领域实现方差降低。
点此查看论文截图


Autoregressive Image Generation Guided by Chains of Thought
Authors:Miaomiao Cai, Guanjie Wang, Wei Li, Zhijun Tu, Hanting Chen, Shaohui Lin, Jie Hu
In the field of autoregressive (AR) image generation, models based on the ‘next-token prediction’ paradigm of LLMs have shown comparable performance to diffusion models by reducing inductive biases. However, directly applying LLMs to complex image generation can struggle with reconstructing the structure and details of the image, impacting the accuracy and stability of generation. Additionally, the ‘next-token prediction’ paradigm in the AR model does not align with the contextual scanning and logical reasoning processes involved in human visual perception, limiting effective image generation. Chain-of-Thought (CoT), as a key reasoning capability of LLMs, utilizes reasoning prompts to guide the model, improving reasoning performance on complex natural language process (NLP) tasks, enhancing accuracy and stability of generation, and helping the model maintain contextual coherence and logical consistency, similar to human reasoning. Inspired by CoT from the field of NLP, we propose autoregressive Image Generation with Thoughtful Reasoning (IGTR) to enhance autoregressive image generation. IGTR adds reasoning prompts without modifying the model structure or raster generation order. Specifically, we design specialized image-related reasoning prompts for AR image generation to simulate the human reasoning process, which enhances contextual reasoning by allowing the model to first perceive overall distribution information before generating the image, and improve generation stability by increasing the inference steps. Compared to the AR method without prompts, our method shows outstanding performance and achieves an approximate improvement of 20%.
在自回归(AR)图像生成领域,基于大型语言模型(LLM)的“下一个令牌预测”范式的模型通过减少归纳偏见表现出了与扩散模型相当的性能。然而,直接将LLM应用于复杂图像生成时,在重建图像的结构和细节方面会遇到困难,影响生成的质量和稳定性。此外,AR模型中的“下一个令牌预测”范式不符合人类视觉感知所涉及上下文扫描和逻辑推理过程,限制了有效的图像生成。作为LLM的关键推理能力,“思维链”(Chain-of-Thought,CoT)利用推理提示来引导模型,提高了在复杂的自然语言处理(NLP)任务上的推理性能,增强了生成的质量和稳定性,帮助模型保持上下文连贯性和逻辑一致性,类似于人类推理。我们从NLP领域的CoT中汲取灵感,提出了增强自回归图像生成的“思维化推理”(IGTR)。IGTR通过添加推理提示来增强自回归图像生成,无需修改模型结构或栅格生成顺序。具体来说,我们为AR图像生成设计了专门化的图像相关推理提示,以模拟人类推理过程,通过允许模型在生成图像之前先感知整体分布信息,增强上下文推理,并通过增加推理步骤来提高生成的稳定性。与没有提示的AR方法相比,我们的方法表现出卓越的性能,并实现了约20%的改进。
论文及项目相关链接
摘要
基于链式思维(Chain-of-Thought,简称CoT)的能力启发,我们提出带有深思熟虑推理(IGTR)的自回归图像生成方法。此方法采用推理提示来引导模型,提高了复杂自然语言处理任务中的推理性能,增强了图像生成的准确性和稳定性。通过设计专门的图像相关推理提示,模拟人类推理过程,模型在生成图像前先感知整体分布信息,提高了上下文推理能力并增加了推理步骤以提升生成稳定性。相较于无提示的AR方法,此方法表现出卓越性能,实现了约20%的提升。
关键见解
- 自回归(AR)图像生成领域中,基于大型语言模型(LLM)的“下一个令牌预测”范式虽然减少了归纳偏见,但在复杂图像生成方面可能存在困难。模型在重建图像结构和细节方面可能存在问题,影响生成的准确性和稳定性。
- “下一个令牌预测”范式并不符合人类视觉感知涉及的内容扫描和逻辑推理过程,限制了有效的图像生成。
- 基于自然语言处理领域中的链式思维(CoT)概念,我们引入深思熟虑推理(IGTR)来增强自回归图像生成。此方法采用推理提示来改善模型的复杂任务性能。
- 设计的图像相关推理提示能够模拟人类推理过程,允许模型在生成图像之前先感知整体分布信息,增强了上下文推理能力。
- 通过增加推理步骤,提高了图像生成的稳定性。
- 与没有使用推理提示的AR方法相比,使用IGTR的方法表现出显著提升,实现了大约20%的性能改进。
点此查看论文截图







CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing
Authors:Ziqi Jiang, Zhen Wang, Long Chen
Precise and flexible image editing remains a fundamental challenge in computer vision. Based on the modified areas, most editing methods can be divided into two main types: global editing and local editing. In this paper, we choose the two most common editing approaches (ie text-based editing and drag-based editing) and analyze their drawbacks. Specifically, text-based methods often fail to describe the desired modifications precisely, while drag-based methods suffer from ambiguity. To address these issues, we proposed \textbf{CLIPDrag}, a novel image editing method that is the first to combine text and drag signals for precise and ambiguity-free manipulations on diffusion models. To fully leverage these two signals, we treat text signals as global guidance and drag points as local information. Then we introduce a novel global-local motion supervision method to integrate text signals into existing drag-based methods by adapting a pre-trained language-vision model like CLIP. Furthermore, we also address the problem of slow convergence in CLIPDrag by presenting a fast point-tracking method that enforces drag points moving toward correct directions. Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods.
精确且灵活的图像编辑仍是计算机视觉领域的一个基本挑战。基于修改区域,大多数编辑方法可分为两大类:全局编辑和局部编辑。在本文中,我们选择了两种最常见的编辑方法(即基于文本的编辑和基于拖放的编辑),并分析了它们的缺点。具体来说,基于文本的方法往往不能精确地描述所需的修改,而基于拖放的方法则存在模糊性。为了解决这些问题,我们提出了CLIPDrag这一新型图像编辑方法,它是首个结合文本和拖放信号对扩散模型进行精确且无模糊操作的方法。为了充分利用这两种信号,我们将文本信号视为全局指导,并将拖放点视为局部信息。然后,我们引入了一种新的全局-局部运动监督方法,通过将预训练的语言视觉模型(如CLIP)与基于文本的指导信号结合,以适应现有的基于拖放的方法。此外,我们还通过采用一种快速点跟踪方法来解决CLIPDrag中收敛慢的问题,该方法强制拖动点朝正确方向移动。大量实验表明,CLIPDrag优于现有的单一基于拖放或基于文本的方法。
论文及项目相关链接
PDF 17 pages
Summary
本文提出一种新型图像编辑方法CLIPDrag,结合文本和拖动信号,在扩散模型上进行精确、无歧义的操控。该方法将文本信号作为全局指导,拖动点作为局部信息,引入全局-局部运动监督方法,将文本信号融入现有的拖动方法中。同时,为解决CLIPDrag收敛慢的问题,提出快速点跟踪方法,使拖动点向正确方向移动。实验证明,CLIPDrag优于现有的单一拖动或文本方法。
Key Takeaways
- 图像编辑分为全局编辑和局部编辑两类,当前两种常见编辑方法(文本编辑和拖动编辑)存在缺陷。
- CLIPDrag是首个结合文本和拖动信号进行图像编辑的方法。
- 文本信号作为全局指导,拖动点作为局部信息。
- 引入全局-局部运动监督方法,将文本信号融入现有拖动方法。
- 提出快速点跟踪方法,解决CLIPDrag收敛慢的问题。
- CLIPDrag在实验中表现出优于现有方法的性能。
点此查看论文截图



CCDM: Continuous Conditional Diffusion Models for Image Generation
Authors:Xin Ding, Yongwei Wang, Kao Zhang, Z. Jane Wang
Continuous Conditional Generative Modeling (CCGM) estimates high-dimensional data distributions, such as images, conditioned on scalar continuous variables (aka regression labels). While Continuous Conditional Generative Adversarial Networks (CcGANs) were designed for this task, their instability during adversarial learning often leads to suboptimal results. Conditional Diffusion Models (CDMs) offer a promising alternative, generating more realistic images, but their diffusion processes, label conditioning, and model fitting procedures are either not optimized for or incompatible with CCGM, making it difficult to integrate CcGANs’ vicinal approach. To address these issues, we introduce Continuous Conditional Diffusion Models (CCDMs), the first CDM specifically tailored for CCGM. CCDMs address existing limitations with specially designed conditional diffusion processes, a novel hard vicinal image denoising loss, a customized label embedding method, and efficient conditional sampling procedures. Through comprehensive experiments on four datasets with resolutions ranging from 64x64 to 192x192, we demonstrate that CCDMs outperform state-of-the-art CCGM models, establishing a new benchmark. Ablation studies further validate the model design and implementation, highlighting that some widely used CDM implementations are ineffective for the CCGM task. Our code is publicly available at https://github.com/UBCDingXin/CCDM.
连续条件生成模型(CCGM)能够估计高维数据分布,如图像,这些数据基于标量连续变量(即回归标签)进行条件化。虽然连续条件生成对抗网络(CcGANs)被设计用于此任务,但在对抗学习过程中的不稳定性常常导致结果不尽人意。条件扩散模型(CDMs)提供了一个有前景的替代方案,能够生成更真实的图像,但它们的扩散过程、标签条件以及模型拟合程序尚未针对CCGM进行优化或不兼容,这使得难以集成CcGANs的近邻方法。为了解决这个问题,我们引入了连续条件扩散模型(CCDMs),这是专门为CCGM定制的第一个CDM。CCDM通过特殊设计的条件扩散过程、新颖的硬近邻图像去噪损失、定制的标签嵌入方法和高效的条件采样程序来解决现有限制。我们在四个不同分辨率(从64x64到192x192)的数据集上进行了全面的实验,证明了CCDM在先进的CCGM模型上的表现,并建立了新的基准。消融研究进一步验证了模型的设计和实现,并强调对于一些广泛使用的CDM实现对于CCGM任务无效。我们的代码公开在https://github.com/UBCDingXin/CCDM。
论文及项目相关链接
Summary
连续条件生成模型(CCGM)用于估计高维数据分布,如图像,它基于标量连续变量(回归标签)。虽然连续条件生成对抗网络(CcGANs)为此任务设计,但对抗学习过程中的不稳定性常导致结果不佳。条件扩散模型(CDMs)提供了更现实的图像生成,但其扩散过程、标签条件和模型拟合程序并不适用于或兼容CCGM,难以融入CcGANs的临近方法。为解决这些问题,我们首次推出了针对CCGM设计的连续条件扩散模型(CCDMs)。CCDMs通过特别设计的条件扩散过程、新颖的硬临近图像去噪损失、定制的标签嵌入方法和高效的条件采样程序来解决现有限制。在四个不同分辨率的数据集上的实验表明,CCDMs表现优于最先进CCGM模型,建立了新的基准。
Key Takeaways
- 连续条件生成模型(CCGM)用于估计高维数据分布。
- 连续条件生成对抗网络(CcGANs)在对抗学习中有不稳定性问题。
- 条件扩散模型(CDMs)能生成更现实的图像,但不适用于连续条件生成任务。
- 提出的连续条件扩散模型(CCDMs)结合了CCGM和CDMs的优势,并进行了优化。
- CCDMs通过设计新的条件扩散过程、硬临近图像去噪损失和定制标签嵌入方法来解决现有问题。
- 实验证明,CCDMs在多个数据集上的性能优于其他模型。
点此查看论文截图


Faster Diffusion via Temporal Attention Decomposition
Authors:Haozhe Liu, Wentian Zhang, Jinheng Xie, Francesco Faccio, Mengmeng Xu, Tao Xiang, Mike Zheng Shou, Juan-Manuel Perez-Rua, Jürgen Schmidhuber
We explore the role of attention mechanism during inference in text-conditional diffusion models. Empirical observations suggest that cross-attention outputs converge to a fixed point after several inference steps. The convergence time naturally divides the entire inference process into two phases: an initial phase for planning text-oriented visual semantics, which are then translated into images in a subsequent fidelity-improving phase. Cross-attention is essential in the initial phase but almost irrelevant thereafter. However, self-attention initially plays a minor role but becomes crucial in the second phase. These findings yield a simple and training-free method known as temporally gating the attention (TGATE), which efficiently generates images by caching and reusing attention outputs at scheduled time steps. Experimental results show when widely applied to various existing text-conditional diffusion models, TGATE accelerates these models by 10%-50%. The code of TGATE is available at https://github.com/HaozheLiu-ST/T-GATE.
我们研究了推理过程中注意力机制在文本条件扩散模型中的作用。经验观察表明,交叉注意力输出在几个推理步骤后会收敛到一个固定点。收敛时间自然将整个推理过程分为两个阶段:首先是面向文本的视觉语义规划初始阶段,然后是在随后的提高保真度的阶段将其翻译成图像。交叉注意力在初始阶段很重要,但在此之后几乎无关。然而,自注意力在初期的作用较小,但在第二阶段变得至关重要。这些发现产生了一种简单且无需训练的方法——时间门控注意力(TGATE),它通过缓存和重用预定时间步长的注意力输出来有效地生成图像。实验结果表明,当广泛应用于各种现有的文本条件扩散模型时,TGATE可以加速这些模型的10%-50%。TGATE的代码可在https://github.com/HaozheLiu-ST/T-GATE获取。
论文及项目相关链接
PDF Accepted by TMLR: https://openreview.net/forum?id=xXs2GKXPnH
Summary
本文探讨了文本条件扩散模型中注意力机制在推理过程中的作用。研究发现,交叉注意力输出在几个推理步骤后会收敛到一个固定点,将整个推理过程自然地分为两个阶段:首先是文本导向的视觉语义规划阶段,随后是提升图像生成的第二阶段。初步阶段中交叉注意力至关重要,而后续阶段则几乎无关重要。而自我注意力在初步阶段作用较小,但在第二阶段变得至关重要。这些发现提出了一种简单且无需训练的方法——时间门控注意力(TGATE),通过缓存和重复使用定时步骤的注意力输出,有效地生成图像。实验结果表明,TGATE广泛应用于各种现有文本条件扩散模型,可加速模型运行10%-50%。
Key Takeaways
- 文本条件扩散模型中注意力机制的角色分析。
- 交叉注意力输出在推理过程中的收敛现象。
- 推理过程分为两个阶段:文本导向的视觉语义规划阶段和提升图像生成的阶段。
- 交叉注意力在初步阶段重要,而在后续阶段作用减弱;自我注意力在第二阶段变得重要。
- 提出了一种简单且无需训练的注意力优化方法——时间门控注意力(TGATE)。
- TGATE通过缓存和重复使用定时步骤的注意力输出,提高了文本条件扩散模型的图像生成效率。
点此查看论文截图



