⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-02-12 更新
A Large-scale AI-generated Image Inpainting Benchmark
Authors:Paschalis Giakoumoglou, Dimitrios Karageorgiou, Symeon Papadopoulos, Panagiotis C. Petrantonakis
Recent advances in generative models enable highly realistic image manipulations, creating an urgent need for robust forgery detection methods. Current datasets for training and evaluating these methods are limited in scale and diversity. To address this, we propose a methodology for creating high-quality inpainting datasets and apply it to create DiQuID, comprising over 95,000 inpainted images generated from 78,000 original images sourced from MS-COCO, RAISE, and OpenImages. Our methodology consists of three components: (1) Semantically Aligned Object Replacement (SAOR) that identifies suitable objects through instance segmentation and generates contextually appropriate prompts, (2) Multiple Model Image Inpainting (MMII) that employs various state-of-the-art inpainting pipelines primarily based on diffusion models to create diverse manipulations, and (3) Uncertainty-Guided Deceptiveness Assessment (UGDA) that evaluates image realism through comparative analysis with originals. The resulting dataset surpasses existing ones in diversity, aesthetic quality, and technical quality. We provide comprehensive benchmarking results using state-of-the-art forgery detection methods, demonstrating the dataset’s effectiveness in evaluating and improving detection algorithms. Through a human study with 42 participants on 1,000 images, we show that while humans struggle with images classified as deceiving by our methodology, models trained on our dataset maintain high performance on these challenging cases. Code and dataset are available at https://github.com/mever-team/DiQuID.
近期生成模型的进步使得高度逼真的图像操作成为可能,从而产生了对稳健的伪造检测方法的迫切需求。当前用于训练和评估这些方法的数据集在规模和多样性方面存在局限性。为了解决这一问题,我们提出了一种创建高质量填充数据集的方法,并应用该方法创建了DiQuID数据集,其中包含超过95000张通过78000张原始图像生成的填充图像,这些原始图像来源于MS-COCO、RAISE和OpenImages。我们的方法包括三个组成部分:(1)语义对齐对象替换(SAOR),它通过实例分割来识别合适的对象并生成上下文适当的提示;(2)多模型图像填充(MMII),采用多种先进的填充管道,主要基于扩散模型进行多样化的操作;(3)基于不确定性的欺骗性评估(UGDA),通过与原图的对比分析来评估图像的真实性。所得到的数据集在多样性、美学质量和技术质量方面超越了现有数据集。我们使用最先进的伪造检测方法提供了全面的基准测试结果,证明了该数据集在评估和改进检测算法方面的有效性。通过对42名参与者进行的涉及1000张图像的研究,我们表明,虽然人类难以识别出我们方法分类为欺骗的图像,但在这些具有挑战性的情况下,经过我们数据集训练的模型仍能保持高性能。代码和数据集可在https://github.com/mever-team/DiQuID获取。
论文及项目相关链接
Summary
本文介绍了针对图像操作伪造行为的检测需求,因当前训练与评估数据集存在规模与多样性上的局限性,提出了一种创建高质量填充数据集的方法,并据此构建了DiQuID数据集。该方法包含三个组成部分:语义对齐对象替换、多模型图像填充以及不确定性引导欺骗性评估。所创建的DiQuID数据集在多样性、美学质量和技术质量上超越了现有数据集。通过全面的基准测试与包含42参与者的实证研究,验证了数据集在评估和改进检测算法上的有效性。
Key Takeaways
- 近期生成模型进步推动了高度逼真的图像操作,催生了迫切的伪造检测需求。
- 当前用于训练和评估伪造检测方法的数据集存在规模和多样性上的局限。
- 提出了一种创建高质量填充数据集的方法,包括语义对齐对象替换、多模型图像填充和不确定性引导欺骗性评估三个组成部分。
- 创建的DiQuID数据集在多样性、美学质量和技术质量上超越了现有数据集。
- 通过全面的基准测试,验证了DiQuID数据集在评估和改进检测算法上的有效性。
- 实证研究表明,人类在识别由新方法分类为欺骗性的图像时存在困难。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.06593v1/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.06593v1/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.06593v1/page_3_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.06593v1/page_3_1.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.06593v1/page_4_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.06593v1/page_5_0.jpg)
Diffusion Models for Computational Neuroimaging: A Survey
Authors:Haokai Zhao, Haowei Lou, Lina Yao, Wei Peng, Ehsan Adeli, Kilian M Pohl, Yu Zhang
Computational neuroimaging involves analyzing brain images or signals to provide mechanistic insights and predictive tools for human cognition and behavior. While diffusion models have shown stability and high-quality generation in natural images, there is increasing interest in adapting them to analyze brain data for various neurological tasks such as data enhancement, disease diagnosis and brain decoding. This survey provides an overview of recent efforts to integrate diffusion models into computational neuroimaging. We begin by introducing the common neuroimaging data modalities, follow with the diffusion formulations and conditioning mechanisms. Then we discuss how the variations of the denoising starting point, condition input and generation target of diffusion models are developed and enhance specific neuroimaging tasks. For a comprehensive overview of the ongoing research, we provide a publicly available repository at https://github.com/JoeZhao527/dm4neuro.
计算神经成像涉及分析脑图像或信号,以提供对人类认知和行为机制的见解和预测工具。虽然扩散模型在自然图像中显示出稳定性和高质量生成,但人们越来越有兴趣将它们适应于分析各种神经任务的大脑数据,如数据增强、疾病诊断和大脑解码。本文概述了最近将扩散模型整合到计算神经成像中的努力。我们首先介绍常见的神经成像数据模式,然后介绍扩散公式和调节机制。接下来,我们讨论去噪起点、条件输入和生成目标的扩散模型的变化是如何发展和增强特定神经成像任务的。为了全面概述目前的研究,我们提供了一个公开可用的仓库:https://github.com/JoeZhao527/dm4neuro。
论文及项目相关链接
PDF 9 pages, 1 figure
摘要
计算神经成像通过分析脑图像或信号,为人类认知和行为的机制提供见解和预测工具。扩散模型在自然图像领域表现出稳定性和高质量生成能力,其在脑数据分析方面的应用也日益受到关注,如数据增强、疾病诊断和脑解码等神经任务。本文概述了近期将扩散模型整合到计算神经成像中的研究努力。文章介绍了常见的神经成像数据模态、扩散公式和调节机制。接着讨论了扩散模型的去噪起始点、条件输入和生成目标的变体是如何发展和增强特定神经成像任务的。欲了解更多相关研究概况,请访问公开仓库:https://github.com/JoeZhao527/dm4neuro。
要点
- 计算神经成像通过分析脑图像和信号来洞察人类认知和行为的机制。
- 扩散模型在神经成像中的应用逐渐受到关注,涉及数据增强、疾病诊断和脑解码等任务。
- 文章概述了近期将扩散模型整合到计算神经成像中的研究概览。
- 文章介绍了神经成像数据模态、扩散公式及调节机制等基础内容。
- 文中讨论了如何通过调整扩散模型的去噪起始点、条件输入和生成目标来优化其在神经成像中的特定任务表现。
- 文章提供了一个公开仓库以提供更全面的研究概况。
- 此领域仍有许多持续的研究和发展空间。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.06552v1/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.06552v1/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.06552v1/page_4_0.jpg)
Diffusion Models for Inverse Problems in the Exponential Family
Authors:Alessandro Micheli, Mélodie Monod, Samir Bhatt
Diffusion models have emerged as powerful tools for solving inverse problems, yet prior work has primarily focused on observations with Gaussian measurement noise, restricting their use in real-world scenarios. This limitation persists due to the intractability of the likelihood score, which until now has only been approximated in the simpler case of Gaussian likelihoods. In this work, we extend diffusion models to handle inverse problems where the observations follow a distribution from the exponential family, such as a Poisson or a Binomial distribution. By leveraging the conjugacy properties of exponential family distributions, we introduce the evidence trick, a method that provides a tractable approximation to the likelihood score. In our experiments, we demonstrate that our methodology effectively performs Bayesian inference on spatially inhomogeneous Poisson processes with intensities as intricate as ImageNet images. Furthermore, we demonstrate the real-world impact of our methodology by showing that it performs competitively with the current state-of-the-art in predicting malaria prevalence estimates in Sub-Saharan Africa.
扩散模型已经成为解决反问题的强大工具,但之前的研究主要集中在具有高斯测量噪声的观察上,限制了它们在现实世界场景中的应用。这一限制持续存在是由于似然得分难以处理,直到现在只在高斯似的情况下进行了近似处理。在这项工作中,我们将扩散模型扩展到处理反问题,其中观察结果遵循指数族分布,如泊松分布或二项分布。通过利用指数族分布的共轭性质,我们引入了证据技巧,这是一种提供似然得分的可行近似的方法。在我们的实验中,我们证明我们的方法在空间非均匀泊松过程上进行贝叶斯推断的有效性,其强度与ImageNet图像一样复杂。此外,我们还通过展示其与当前预测撒哈拉以南非洲疟疾发病率估计的最先进水平的竞争力,证明了我们的方法在现实世界中的影响。
论文及项目相关链接
摘要
扩散模型已成为解决反问题的重要工具,但先前的研究主要集中在具有高斯测量噪声的观察上,这限制了其在现实世界场景中的应用。由于似然值的不可预测性,这一局限性一直存在。直到现在,似然值仅在更简单的高斯可能性情况下进行近似。在这项工作中,我们将扩散模型扩展到处理反问题,其中观察结果遵循指数族分布,如泊松分布或二项分布。通过利用指数族分布的共轭属性,我们引入了证据技巧,这是一种提供似然值可行近似的方法。我们的实验表明,我们的方法在具有与ImageNet图像一样复杂的强度的空间非均匀泊松过程上有效地执行贝叶斯推断。此外,我们通过展示该方法在预测撒哈拉以南非洲疟疾发病率方面的竞争力来证明其现实影响力,这一竞争力堪比目前最先进的方法。
要点总结
以下是这篇文章的七个关键要点:
- 扩散模型是解决反问题的重要工具,但其在现实世界的运用受到了一定限制。先前的相关研究主要集中在处理高斯测量噪声的情况上。这项工作打破了这种局限。
- 之前研究的局限性是由于似然值的不可预测性导致的。本研究通过引入证据技巧解决了这个问题,提供了一种可行的方法来近似似然值。
- 研究人员利用了指数族分布的共轭属性来扩展扩散模型的应用范围。指数族分布包括泊松分布和二项分布等类型。这一应用可以扩展到其他分布形式更为广泛的反问题领域。
- 研究中的实验结果表明该模型在复杂强度的空间非均匀泊松过程上能够有效地执行贝叶斯推断。这意味着模型能够处理复杂数据并进行准确预测。这种有效性在许多领域中具有广泛的应用潜力。特别是在解决需要精确推断的场景中特别有价值。例如处理大规模数据集、进行预测分析等领域的应用将从中受益。然而实验也暗示了一些局限性可能存在应用场景具体情形会挑战该方法的稳定性在应用到更大规模和更复杂的场景中还需要更多的工作来提升模型性能和鲁棒性具体来说需要加强建模能力的推广适应性对于每个新应用领域需要根据特定数据环境和要求进一步调整模型和优化参数避免通用的应用方案的不足限制以便提高扩散模型的性能和鲁棒性以便在更多的现实场景中得到广泛应用和利用该方法在现实世界中实际应用证明了其有效性和价值尽管其具备诸多优点但是仍有待于进一步的完善和改进以应对未来的挑战和问题
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05994v1/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05994v1/page_1_0.jpg)
Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation
Authors:Chenkai Xu, Xu Wang, Zhenyi Liao, Yishun Li, Tianqi Hou, Zhijie Deng
There has been increasing research interest in building unified multimodal understanding and generation models, among which Show-o stands as a notable representative, demonstrating great promise for both text-to-image and image-to-text generation. The inference of Show-o involves progressively denoising image tokens and autoregressively decoding text tokens, and hence, unfortunately, suffers from inefficiency issues from both sides. This paper introduces Show-o Turbo to bridge the gap. We first identify a unified denoising perspective for the generation of images and text in Show-o based on the parallel decoding of text tokens. We then propose to extend consistency distillation (CD), a qualified approach for shortening the denoising process of diffusion models, to the multimodal denoising trajectories of Show-o. We introduce a trajectory segmentation strategy and a curriculum learning procedure to improve the training convergence. Empirically, in text-to-image generation, Show-o Turbo displays a GenEval score of 0.625 at 4 sampling steps without using classifier-free guidance (CFG), outperforming that of the original Show-o with 8 steps and CFG; in image-to-text generation, Show-o Turbo exhibits a 1.5x speedup without significantly sacrificing performance. The code is available at https://github.com/zhijie-group/Show-o-Turbo.
关于构建统一的多模态理解和生成模型的研究兴趣日益浓厚,其中Show-o是一个引人注目的代表,在文本到图像和图像到文本生成方面都显示出巨大的潜力。Show-o的推理过程包括逐步去噪图像标记和自回归解码文本标记,因此,不幸的遭遇是从两方面带来的效率低下问题。本文介绍Show-o Turbo来填补这一空白。我们首先基于文本标记的并行解码,识别出Show-o中图像和文本生成的统一去噪视角。然后,我们提出将一致性蒸馏(CD)这一缩短扩散模型去噪过程的合格方法,扩展到Show-o的多模态去噪轨迹。我们引入了一种轨迹分割策略和一个课程学习程序来改善训练收敛性。经验上,在文本到图像生成方面,Show-o Turbo在不使用无分类器引导(CFG)的情况下,以4个采样步骤达到GenEval分数0.625,优于原始Show-o的8个步骤和CFG;在图像到文本生成方面,Show-o Turbo实现了1.5倍的速度提升,而不会显著牺牲性能。代码可在https://github.com/zhijie-group/Show-o-Turbo找到。
论文及项目相关链接
Summary
Show-o Turbo通过引入统一去噪视角和一致性蒸馏技术,提高了Show-o模型在文本到图像和图像到文本生成任务中的效率。Show-o Turbo通过轨迹分割策略和课程学习程序进行改进训练收敛性,最终在文本到图像生成任务中,使用较少的采样步骤便实现了较高的生成评价分数,同时在图像到文本生成任务中实现了速度提升,性能损失较小。
Key Takeaways
- Show-o Turbo是Show-o模型的一个改进版本,旨在提高多模态理解和生成模型的效率。
- 引入了统一去噪视角,为图像和文本生成提供了一个基于文本令牌并行解码的基础。
- 采用一致性蒸馏(CD)技术来缩短扩散模型的去噪过程,并将其扩展到Show-o的多模态去噪轨迹。
- 实现了轨迹分割策略和课程学习程序来提高训练收敛性。
- 在文本到图像生成任务中,Show-o Turbo在较少的采样步骤下实现了较高的生成评价分数,性能优越。
- Show-o Turbo在图像到文本生成任务中实现了速度提升,同时性能损失较小。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05415v1/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05415v1/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05415v1/page_2_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05415v1/page_4_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05415v1/page_5_0.jpg)
Beautiful Images, Toxic Words: Understanding and Addressing Offensive Text in Generated Images
Authors:Aditya Kumar, Tom Blanchard, Adam Dziedzic, Franziska Boenisch
State-of-the-art visual generation models, such as Diffusion Models (DMs) and Vision Auto-Regressive Models (VARs), produce highly realistic images. While prior work has successfully mitigated Not Safe For Work (NSFW) content in the visual domain, we identify a novel threat: the generation of NSFW text embedded within images. This includes offensive language, such as insults, racial slurs, and sexually explicit terms, posing significant risks to users. We show that all state-of-the-art DMs (e.g., SD3, Flux, DeepFloyd IF) and VARs (e.g., Infinity) are vulnerable to this issue. Through extensive experiments, we demonstrate that existing mitigation techniques, effective for visual content, fail to prevent harmful text generation while substantially degrading benign text generation. As an initial step toward addressing this threat, we explore safety fine-tuning of the text encoder underlying major DM architectures using a customized dataset. Thereby, we suppress NSFW generation while preserving overall image and text generation quality. Finally, to advance research in this area, we introduce ToxicBench, an open-source benchmark for evaluating NSFW text generation in images. ToxicBench provides a curated dataset of harmful prompts, new metrics, and an evaluation pipeline assessing both NSFW-ness and generation quality. Our benchmark aims to guide future efforts in mitigating NSFW text generation in text-to-image models.
最先进的视觉生成模型,如扩散模型(DMs)和视觉自回归模型(VARs),能够产生高度逼真的图像。虽然之前的工作成功减轻了工作场所不宜(NSFW)内容在视觉领域的威胁,但我们发现了一个新型威胁:在图像中嵌入NSFW文本的生成。这包括冒犯性的语言,如侮辱、种族歧视和性明确术语,对用户构成重大风险。我们证明,所有最先进的DMs(例如SD3、Flux、DeepFloyd IF)和VARs(例如Infinity)都面临这一问题的威胁。通过大量实验,我们证明现有的减轻技术,对于视觉内容是有效的,但在防止有害文本生成时却失败了,同时极大地损害了良性文本生成。作为解决这一威胁的初步步骤,我们探索了使用自定义数据集对主要DM架构的底层文本编码器进行安全微调。因此,我们在保持整体图像和文本生成质量的同时,抑制了NSFW的生成。最后,为了推动这一领域的研究,我们推出了ToxicBench,这是一个用于评估图像中NSFW文本生成的开源基准测试。ToxicBench提供了一个有害提示的精选数据集、新指标和评估流程,评估NSFW程度和生成质量。我们的基准测试旨在指导未来在文本到图像模型中缓解NSFW文本生成的工作。
论文及项目相关链接
Summary
本文揭示了扩散模型(DMs)和视觉自回归模型(VARs)在生成图像时存在的新型安全隐患,即生成的图像中包含不安全的文本内容,如侮辱性语言、种族歧视和色情词汇等。现有针对视觉内容的缓解技术无法有效防止这种有害文本生成,同时还会对良性文本生成产生负面影响。为解决这一问题,文章初步探讨了针对主要DM架构文本编码器的安全微调方法,并使用定制数据集进行抑制有害文本生成,同时保持图像和文本生成的总体质量。此外,文章还介绍了面向评估图像中NSFW文本生成的开源基准测试平台——ToxicBench,为未来的研究工作提供指导。
Key Takeaways
- 扩散模型(DMs)和视觉自回归模型(VARs)在生成图像时存在生成不安全文本内容(NSFW)的隐患。
- 现行的图像内容过滤技术无法有效防止有害文本生成,且可能对良性文本生成产生负面影响。
- 通过安全微调主要DM架构的文本编码器,可以抑制NSFW生成并保持图像和文本生成的总体质量。
- 引入了一个定制数据集进行安全微调。
- 介绍了一个评估图像中NSFW文本生成的开源基准测试平台——ToxicBench。
- ToxicBench提供了有害提示的数据库、新指标和评估管道,用于评估NSFW程度和生成质量。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05066v2/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05066v2/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05066v2/page_3_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05066v2/page_5_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.05066v2/page_5_1.jpg)
Towards Consistent and Controllable Image Synthesis for Face Editing
Authors:Mengting Wei, Tuomas Varanka, Yante Li, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao
Face editing methods, essential for tasks like virtual avatars, digital human synthesis and identity preservation, have traditionally been built upon GAN-based techniques, while recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in controlling specific attributes and preserving the consistency of other unchanged attributes especially the identity characteristics. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion (SD) models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involves the combinations of target background, identity and face attributes aimed to edit. We strive to sufficiently disentangle the control of these factors to enable consistency of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Attribute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) A high-consistency FaceFusion method that transfers identity features from the Identity Encoder to the denoising UNet of a pre-trained SD model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models. Code is publicly available at https://github.com/weimengting/RigFace.
面部编辑方法对于虚拟化身、数字人类合成和身份保留等任务至关重要,传统上基于GAN技术构建,而最近的关注焦点已转向扩散模型,因为它们在图像重建方面的成功。然而,扩散模型在控制特定属性和保持其他未改变属性的一致性方面仍面临挑战,尤其是身份特征。为了解决这些问题,并方便面部图像的编辑,我们提出了一种新方法,该方法利用Stable-Diffusion(SD)模型和粗略的3D面部模型的力量来控制肖像照片的光线、面部表情和头部姿势。我们发现此任务主要涉及背景、身份和面部属性的组合,目的在于进行编辑。我们努力充分解开这些因素的控制,以实现面部编辑的一致性。具体来说,我们的方法称为RigFace,包括以下内容:1)空间属性编码器提供精确且解耦的背景、姿势、表情和光线条件;2)高一致性FaceFusion方法将从身份编码器提取的身份特征转移到预训练SD模型的去噪UNet中;3)属性触发器将这些条件注入去噪UNet中。与现有的面部编辑模型相比,我们的模型在身份保留和逼真度方面达到了相当或更高的性能。代码已公开在https://github.com/weimengting/RigFace。
论文及项目相关链接
Summary
本文提出了一种基于Stable-Diffusion模型和粗略3D人脸模型的新方法,用于控制肖像照片的光线、面部表情和头部姿势的编辑。该方法结合目标背景、身份和人脸属性进行编辑,通过空间属性编码器提供精确且解耦的条件,高一致性FaceFusion方法将身份特征从身份编码器转移到预训练的SD模型的去噪UNet中,并实现条件注入。该模型在身份保留和逼真度方面与现有面部编辑模型相比具有相当或更优越的性能。
Key Takeaways
- 文本介绍了基于Stable-Diffusion模型和粗略3D人脸模型的面部编辑新方法。
- 该方法结合了目标背景、身份和面部属性进行编辑,旨在解决扩散模型在控制特定属性和保持身份一致性方面的挑战。
- 通过空间属性编码器提供精确且解耦的条件,包括背景、姿势、表达和光线。
- 采用高一致性FaceFusion方法,将身份特征从身份编码器转移到预训练SD模型的去噪UNet中。
- 实现了条件注入的Attribute Rigger,将各种条件注入到去噪UNet中。
- 模型在身份保留和逼真度方面表现出卓越性能,与现有面部编辑模型相比具有优势。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.02465v2/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.02465v2/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.02465v2/page_3_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.02465v2/page_4_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.02465v2/page_5_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.02465v2/page_5_1.jpg)
Compressed Image Generation with Denoising Diffusion Codebook Models
Authors:Guy Ohayon, Hila Manor, Tomer Michaeli, Michael Elad
We present a novel generative approach based on Denoising Diffusion Models (DDMs), which produces high-quality image samples along with their losslessly compressed bit-stream representations. This is obtained by replacing the standard Gaussian noise sampling in the reverse diffusion with a selection of noise samples from pre-defined codebooks of fixed iid Gaussian vectors. Surprisingly, we find that our method, termed Denoising Diffusion Codebook Model (DDCM), retains sample quality and diversity of standard DDMs, even for extremely small codebooks. We leverage DDCM and pick the noises from the codebooks that best match a given image, converting our generative model into a highly effective lossy image codec achieving state-of-the-art perceptual image compression results. More generally, by setting other noise selections rules, we extend our compression method to any conditional image generation task (e.g., image restoration), where the generated images are produced jointly with their condensed bit-stream representations. Our work is accompanied by a mathematical interpretation of the proposed compressed conditional generation schemes, establishing a connection with score-based approximations of posterior samplers for the tasks considered.
我们提出了一种基于去噪扩散模型(DDMs)的新型生成方法,该方法能够生成高质量图像样本及其无损压缩比特流表示。这是通过用预定义的固定独立同分布高斯向量的代码本中的噪声样本替换反向扩散中的标准高斯噪声采样来实现的。令人惊讶的是,我们发现我们的方法,即去噪扩散代码本模型(DDCM),即使在极小的代码本中,也能保持标准DDM的样本质量和多样性。我们利用DDCM,从代码本中选择与给定图像最匹配的噪声,将我们的生成模型转化为一种高效的有损图像编码,实现了最先进的感知图像压缩结果。更一般地说,通过设置其他噪声选择规则,我们将我们的压缩方法扩展到任何条件图像生成任务(例如图像恢复),其中生成的图像与其浓缩的比特流表示联合生成。我们的工作还提供了所提出的有条件压缩生成方案的数学解释,与所考虑任务的基于分数的后验采样器建立了联系。
论文及项目相关链接
PDF Code and demo are available at https://ddcm-2025.github.io/
Summary
基于去噪扩散模型(DDM),我们提出了一种新的生成方法,能够生成高质量图像样本及其无损压缩比特流表示。通过用预定义的独立同分布高斯向量码本中的噪声样本替换反向扩散中的标准高斯噪声样本,我们实现了被称为去噪扩散码本模型(DDCM)的方法。即使对于极小的码本,DDCM也能保持与标准DDM相当的样本质量和多样性。我们利用DDCM,挑选与给定图像最匹配的噪声,将生成模型转化为高效的损失图像压缩编码器,实现了先进的感知图像压缩结果。此外,通过设置其他噪声选择规则,我们将压缩方法扩展到任何条件图像生成任务(例如图像恢复),生成的图像与其浓缩的比特流表示联合产生。我们的工作还提供了对所提出的压缩条件生成方案进行数学解释,建立了所考虑任务的得分近似后验采样器的联系。
Key Takeaways
- 提出了一种基于去噪扩散模型(DDM)的新生成方法,能够生成高质量图像样本及其无损压缩表示。
- 通过使用预定义码本中的噪声样本替换标准高斯噪声采样,实现了去噪扩散码本模型(DDCM)。
- DDCM在极小码本下仍能保持样本质量和多样性。
- DDCM可转化为高效的损失图像压缩编码器,实现先进感知图像压缩结果。
- 该方法可扩展至任何条件图像生成任务,如图像恢复。
- 提出的压缩条件生成方案有数学解释,与得分近似后验采样器建立联系。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.01189v3/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.01189v3/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.01189v3/page_2_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.01189v3/page_3_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.01189v3/page_4_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2502.01189v3/page_5_0.jpg)
Grounding Text-to-Image Diffusion Models for Controlled High-Quality Image Generation
Authors:Ahmad Süleyman, Göksel Biricik
Text-to-image (T2I) generative diffusion models have demonstrated outstanding performance in synthesizing diverse, high-quality visuals from text captions. Several layout-to-image models have been developed to control the generation process by utilizing a wide range of layouts, such as segmentation maps, edges, and human keypoints. In this work, we propose ObjectDiffusion, a model that conditions T2I diffusion models on semantic and spatial grounding information, enabling the precise rendering and placement of desired objects in specific locations defined by bounding boxes. To achieve this, we make substantial modifications to the network architecture introduced in ControlNet to integrate it with the grounding method proposed in GLIGEN. We fine-tune ObjectDiffusion on the COCO2017 training dataset and evaluate it on the COCO2017 validation dataset. Our model improves the precision and quality of controllable image generation, achieving an AP$_{\text{50}}$ of 46.6, an AR of 44.5, and an FID of 19.8, outperforming the current SOTA model trained on open-source datasets across all three metrics. ObjectDiffusion demonstrates a distinctive capability in synthesizing diverse, high-quality, high-fidelity images that seamlessly conform to the semantic and spatial control layout. Evaluated in qualitative and quantitative tests, ObjectDiffusion exhibits remarkable grounding capabilities in closed-set and open-set vocabulary settings across a wide variety of contexts. The qualitative assessment verifies the ability of ObjectDiffusion to generate multiple detailed objects in varying sizes, forms, and locations.
文本到图像(T2I)生成扩散模型已经显示出从文本描述合成多样、高质量图像的出色性能。为了控制生成过程,已经开发了几种布局到图像模型,通过使用各种布局,如分割图、边缘和人类关键点。在这项工作中,我们提出了ObjectDiffusion模型,该模型将T2I扩散模型建立在语义和空间定位信息之上,能够在特定位置精确呈现和放置所需对象,这些位置由边界框定义。为了实现这一点,我们对ControlNet中引入的网络架构进行了重大修改,将其与GLIGEN中提出的定位方法相结合。我们在COCO2017训练数据集上对ObjectDiffusion进行了微调,并在COCO2017验证数据集上对其进行了评估。我们的模型提高了可控图像生成的精度和质量,在三个指标上均超过了当前在开源数据集上训练的最新模型:以46.6的AP50,44.5的AR和较低的FID。ObjectDiffusion表现出独特的能力,能够合成多样、高质量、高保真度的图像,无缝符合语义和空间控制布局。在定性和定量测试中进行了评估,ObjectDiffusion在封闭和开放词汇设置的各种上下文中表现出卓越的定位能力。定性评估验证了ObjectDiffusion生成多种不同大小、形状和位置的详细对象的能力。
论文及项目相关链接
Summary
本文提出一种名为ObjectDiffusion的文本转图像扩散模型,该模型结合了语义和空间定位信息,能在特定边界框内精确渲染和放置所需对象。模型在COCO2017训练数据集上进行微调,并在验证数据集上表现优异,实现了高精确度、高质量的可控图像生成。
Key Takeaways
- ObjectDiffusion是一个结合语义和空间定位信息的文本转图像扩散模型。
- 模型能在特定边界框内精确渲染和放置所需对象。
- ObjectDiffusion在COCO2017训练数据集上进行微调,并在验证数据集上实现了高表现。
- 相比当前在开源数据集上训练的其他模型,ObjectDiffusion在三个指标上都表现更优秀。
- ObjectDiffusion能合成多样、高质量、高保真且符合语义和空间控制布局的图像。
- 在封闭词汇集和开放词汇集环境下,ObjectDiffusion在广泛上下文中的定位能力显著。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2501.09194v2/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2501.09194v2/page_2_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2501.09194v2/page_4_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2501.09194v2/page_5_0.jpg)
Guided and Variance-Corrected Fusion with One-shot Style Alignment for Large-Content Image Generation
Authors:Shoukun Sun, Min Xian, Tiankai Yao, Fei Xu, Luca Capriotti
Producing large images using small diffusion models is gaining increasing popularity, as the cost of training large models could be prohibitive. A common approach involves jointly generating a series of overlapped image patches and obtaining large images by merging adjacent patches. However, results from existing methods often exhibit noticeable artifacts, e.g., seams and inconsistent objects and styles. To address the issues, we proposed Guided Fusion (GF), which mitigates the negative impact from distant image regions by applying a weighted average to the overlapping regions. Moreover, we proposed Variance-Corrected Fusion (VCF), which corrects data variance at post-averaging, generating more accurate fusion for the Denoising Diffusion Probabilistic Model. Furthermore, we proposed a one-shot Style Alignment (SA), which generates a coherent style for large images by adjusting the initial input noise without adding extra computational burden. Extensive experiments demonstrated that the proposed fusion methods improved the quality of the generated image significantly. The proposed method can be widely applied as a plug-and-play module to enhance other fusion-based methods for large image generation. Code: https://github.com/TitorX/GVCFDiffusion
使用小型扩散模型生成大型图像正变得越来越受欢迎,因为训练大型模型的成本可能很高。一种常见的方法是通过联合生成一系列重叠的图像块,并通过合并相邻的图像块来获得大型图像。然而,现有方法的结果往往会出现明显的伪影,例如接缝和不一致的物体和风格。为了解决这些问题,我们提出了引导融合(GF),通过应用加权平均来减轻远距离图像区域的不利影响。此外,我们提出了方差校正融合(VCF),在平均后校正数据方差,为去噪扩散概率模型生成更准确的融合。另外,我们提出了一次性风格对齐(SA),通过调整初始输入噪声为大型图像生成连贯的风格,且不增加额外的计算负担。大量实验表明,所提出的融合方法显著提高了生成图像的质量。该方法可广泛应用于即插即用模块,以增强其他基于融合的大型图像生成方法。代码:https://github.com/TitorX/GVCFDiffusion
论文及项目相关链接
Summary
该文本介绍了一种使用小型扩散模型生成大型图像的方法,并指出了现有方法存在的问题。为了解决这些问题,提出了包括 Guided Fusion(GF)、Variance-Corrected Fusion(VCF)和 one-shot Style Alignment(SA)等方法。这些方法可显著提高生成图像的质量,并可广泛应用于其他融合方法以增强大型图像生成的效果。
Key Takeaways
- 使用小型扩散模型生成大型图像越来越受欢迎,因为训练大型模型的成本可能很高。
- 现有方法生成图像时通常会出现明显的瑕疵,例如接缝和不连贯的对象和风格。
- Guided Fusion(GF)方法通过应用加权平均来减轻远距离图像区域的不利影响。
- Variance-Corrected Fusion(VCF)方法在平均后进行数据方差校正,生成更准确的融合结果。
- one-shot Style Alignment(SA)方法通过调整初始输入噪声为大型图像生成一致的风格,且不增加额外的计算负担。
- 提出的融合方法显著提高了生成图像的质量。
- 该方法可作为即插即用模块广泛应用于其他融合方法,以增强大型图像的生成。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.12771v2/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.12771v2/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.12771v2/page_2_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.12771v2/page_2_1.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.12771v2/page_3_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.12771v2/page_4_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.12771v2/page_4_1.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.12771v2/page_4_2.jpg)
Analyzing and Mitigating Model Collapse in Rectified Flow Models
Authors:Huminhao Zhu, Fangyikang Wang, Tianyu Ding, Qing Qu, Zhihui Zhu
Training with synthetic data is becoming increasingly inevitable as synthetic content proliferates across the web, driven by the remarkable performance of recent deep generative models. This reliance on synthetic data can also be intentional, as seen in Rectified Flow models, whose Reflow method iteratively uses self-generated data to straighten the flow and improve sampling efficiency. However, recent studies have shown that repeatedly training on self-generated samples can lead to model collapse (MC), where performance degrades over time. Despite this, most recent work on MC either focuses on empirical observations or analyzes regression problems and maximum likelihood objectives, leaving a rigorous theoretical analysis of reflow methods unexplored. In this paper, we aim to fill this gap by providing both theoretical analysis and practical solutions for addressing MC in diffusion/flow models. We begin by studying Denoising Autoencoders and prove performance degradation when DAEs are iteratively trained on their own outputs. To the best of our knowledge, we are the first to rigorously analyze model collapse in DAEs and, by extension, in diffusion models and Rectified Flow. Our analysis and experiments demonstrate that rectified flow also suffers from MC, leading to potential performance degradation in each reflow step. Additionally, we prove that incorporating real data can prevent MC during recursive DAE training, supporting the recent trend of using real data as an effective approach for mitigating MC. Building on these insights, we propose a novel Real-data Augmented Reflow and a series of improved variants, which seamlessly integrate real data into Reflow training by leveraging reverse flow. Empirical evaluations on standard image benchmarks confirm that RA Reflow effectively mitigates model collapse, preserving high-quality sample generation even with fewer sampling steps.
随着合成内容在网上不断增多,训练合成数据变得不可避免,这得益于最近的深度生成模型的出色性能。这种对合成数据的依赖也可能是有意为之,如在Rectified Flow模型中,其Reflow方法迭代使用自我生成的数据来纠正流并改善采样效率。然而,最近的研究表明,在自我生成的样本上反复训练会导致模型崩溃(MC),随着时间推移,性能会下降。尽管如此,关于MC的大多数最新工作要么集中在经验观察上,要么分析回归问题和最大似然目标,留下对Reflow方法的严格理论分析尚未探索。在本文中,我们旨在填补这一空白,为扩散/流模型中的MC提供理论分析和实际解决方案。我们首先研究去噪自动编码器(DAEs),并证明在其自身输出上迭代训练时性能会下降。据我们所知,我们是首次严格分析DAEs中的模型崩溃,并扩展至扩散模型和Rectified Flow。我们的分析和实验表明,经过修正的流也会受到MC的影响,可能导致每个反流步骤中的潜在性能下降。此外,我们证明引入真实数据可以预防递归DAE训练期间的MC,这支持了最近使用真实数据作为缓解MC的有效方法的趋势。基于这些见解,我们提出了新颖的真实数据增强反流(RA Reflow)和一系列改进变体,通过利用反向流无缝地将真实数据集成到Reflow训练中。在标准图像基准测试上的经验评估证实,RA Reflow有效缓解了模型崩溃,即使在较少的采样步骤下也能保持高质量的样本生成。
论文及项目相关链接
Summary
本文探讨了合成数据在深度生成模型中的广泛应用及其带来的问题。训练时使用合成数据易导致模型崩溃(MC)。文章旨在解决MC问题,包括对去噪自编码器(DAEs)的深入分析以及相应的实证实验。通过融入真实数据能有效缓解MC问题。此外,本文提出一种新的训练策略Real-data Augmented Reflow,该策略利用反向流无缝集成真实数据,有效缓解模型崩溃问题,提高了采样质量。
Key Takeaways
- 合成数据在深度生成模型中的使用日益普遍,但也带来了模型崩溃(MC)的问题。
- 文章首次对去噪自编码器(DAEs)中的模型崩溃进行了严谨的理论分析,并扩展到扩散模型和Rectified Flow。
- 通过分析证明融入真实数据能有效预防MC问题。
- 提出了一种新的训练策略Real-data Augmented Reflow,通过利用反向流无缝集成真实数据,减少了模型崩溃问题并提高了采样质量。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.08175v2/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.08175v2/page_2_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.08175v2/page_3_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.08175v2/page_5_0.jpg)
Cross-Attention Head Position Patterns Can Align with Human Visual Concepts in Text-to-Image Generative Models
Authors:Jungwon Park, Jungmin Ko, Dongnam Byun, Jangwon Suh, Wonjong Rhee
Recent text-to-image diffusion models leverage cross-attention layers, which have been effectively utilized to enhance a range of visual generative tasks. However, our understanding of cross-attention layers remains somewhat limited. In this study, we introduce a mechanistic interpretability approach for diffusion models by constructing Head Relevance Vectors (HRVs) that align with human-specified visual concepts. An HRV for a given visual concept has a length equal to the total number of cross-attention heads, with each element representing the importance of the corresponding head for the given visual concept. To validate HRVs as interpretable features, we develop an ordered weakening analysis that demonstrates their effectiveness. Furthermore, we propose concept strengthening and concept adjusting methods and apply them to enhance three visual generative tasks. Our results show that HRVs can reduce misinterpretations of polysemous words in image generation, successfully modify five challenging attributes in image editing, and mitigate catastrophic neglect in multi-concept generation. Overall, our work provides an advancement in understanding cross-attention layers and introduces new approaches for fine-controlling these layers at the head level.
最近的文本到图像的扩散模型利用了交叉注意力层,这些层已经被有效地用于增强各种视觉生成任务。然而,我们对交叉注意力层的理解仍然有限。在这项研究中,我们通过构建与人类指定的视觉概念相对应的Head Relevance Vectors(HRVs)头相关向量),引入了一种机械可解释性的扩散模型方法。给定视觉概念的HRV的长度等于交叉注意力头的总数,每个元素代表相应头对于给定视觉概念的重要性。为了验证HRV作为可解释特征的有效性,我们开发了一种有序的减弱分析。此外,我们提出了概念强化和概念调整方法,并应用于增强三种视觉生成任务。结果表明,HRV可以减少图像生成中对多义词的误解,成功修改图像编辑中的五个具有挑战性的属性,并减轻多概念生成中的灾难性忽视。总体而言,我们的工作有助于更好地了解交叉注意力层,并引入了在头部层面精细控制这些层的新方法。
论文及项目相关链接
PDF Accepted by ICLR 2025
Summary
本文探讨了文本到图像扩散模型中的跨注意力层机制。研究通过构建与人类指定视觉概念对齐的头重要性向量(HRVs)来提高扩散模型的可解释性。研究验证了HRVs作为可解释特征的有效性,并提出了概念强化和概念调整方法,用于增强视觉生成任务的性能。该研究有助于减少多义词的误解、成功修改图像编辑中的五个挑战属性,并缓解多概念生成中的灾难性忽视问题。总体而言,该研究推动了跨注意力层的理解,并引入了头级别的精细控制新方法。
Key Takeaways
- 跨注意力层在文本到图像扩散模型中起到关键作用。
- 头重要性向量(HRVs)被引入以提高扩散模型的可解释性,并与人类指定的视觉概念对齐。
- 有序减弱分析验证了HRVs作为可解释特征的有效性。
- 概念强化和概念调整方法被提出并应用于增强视觉生成任务的性能。
- HRVs有助于减少图像生成中多义词的误解。
- 通过修改图像编辑中的属性,HRVs取得了显著成效。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.02237v2/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.02237v2/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.02237v2/page_3_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2412.02237v2/page_5_0.jpg)
Mining Your Own Secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models
Authors:Saurav Jha, Shiqi Yang, Masato Ishii, Mengjie Zhao, Christian Simon, Muhammad Jehanzeb Mirza, Dong Gong, Lina Yao, Shusuke Takahashi, Yuki Mitsufuji
Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns. When faced with this continual learning (CL) setup, most personalization methods fail to find a balance between acquiring new concepts and retaining previous ones – a challenge that continual personalization (CP) aims to solve. Inspired by the successful CL methods that rely on class-specific information for regularization, we resort to the inherent class-conditioned density estimates, also known as diffusion classifier (DC) scores, for continual personalization of text-to-image diffusion models. Namely, we propose using DC scores for regularizing the parameter-space and function-space of text-to-image diffusion models, to achieve continual personalization. Using several diverse evaluation setups, datasets, and metrics, we show that our proposed regularization-based CP methods outperform the state-of-the-art C-LoRA, and other baselines. Finally, by operating in the replay-free CL setup and on low-rank adapters, our method incurs zero storage and parameter overhead, respectively, over the state-of-the-art. Our project page: https://srvcodes.github.io/continual_personalization/
个性化文本到图像扩散模型因其能够从用户定义的文本描述和少量图像中有效地获取新概念而受到欢迎。然而,在真实世界中,用户可能希望在多个概念上进行模型个性化,但一次只能进行一个概念,由于存储/隐私担忧而无法访问以前概念的数据。面对这种持续学习(CL)设置时,大多数个性化方法无法在新概念的获取和保留旧概念之间找到平衡——这是持续个性化(CP)旨在解决的问题。我们受到依赖特定类别信息进行正则化的成功CL方法的启发,诉诸于固有的类别条件密度估计(也称为扩散分类器(DC)分数),用于文本到图像扩散模型的持续个性化。具体来说,我们提出使用DC分数对文本到图像扩散模型的参数空间和功能空间进行正则化,以实现持续个性化。通过几种不同的评估设置、数据集和指标,我们证明了基于正则化的CP方法优于最先进的C-LoRA和其他基准测试。最后,我们在无需复习的CL设置和低阶适配器上运行我们的方法,与现有技术相比分别实现了零存储和零参数开销。我们的项目页面:https://srvcodes.github.io/continual_personalization/。
论文及项目相关链接
PDF Accepted to ICLR 2025
Summary
文本到图像扩散模型的个性化持续学习面临挑战,因为用户希望在不断学习的过程中掌握新概念,同时保留对旧概念的记忆。为解决此问题,研究者提出了基于类条件密度估计的扩散分类器(DC)分数的方法,用于对文本到图像扩散模型进行持续个性化。通过参数空间和功能空间的调节,该方法实现了持续个性化,并在多种评估设置和数据集上表现出优于现有技术(如C-LoRA等)的性能。此外,该方法无需回放,在低秩适配器上运行,实现了零存储和零参数开销。
Key Takeaways
- 个性化的文本到图像扩散模型能够从用户定义的文本描述和少量图像中有效地获取新概念。
- 在持续学习(CL)设置中,大多数个性化方法难以平衡获取新概念和保留旧概念之间的关系。
- 扩散分类器(DC)分数被用于解决持续个性化(CP)的问题,通过调节文本到图像扩散模型的参数空间和功能空间实现。
- 所提出的方法在多种评估设置和数据集上的性能优于现有技术。
- 该方法在不使用回放的情况下运行,减少了存储需求。
- 该方法在低秩适配器上运行,实现了零参数开销。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2410.00700v3/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2410.00700v3/page_4_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2410.00700v3/page_5_0.jpg)
Scalable Autoregressive Image Generation with Mamba
Authors:Haopeng Li, Jinyue Yang, Kexin Wang, Xuerui Qiu, Yuhong Chou, Xin Li, Guoqi Li
We introduce AiM, an autoregressive (AR) image generative model based on Mamba architecture. AiM employs Mamba, a novel state-space model characterized by its exceptional performance for long-sequence modeling with linear time complexity, to supplant the commonly utilized Transformers in AR image generation models, aiming to achieve both superior generation quality and enhanced inference speed. Unlike existing methods that adapt Mamba to handle two-dimensional signals via multi-directional scan, AiM directly utilizes the next-token prediction paradigm for autoregressive image generation. This approach circumvents the need for extensive modifications to enable Mamba to learn 2D spatial representations. By implementing straightforward yet strategically targeted modifications for visual generative tasks, we preserve Mamba’s core structure, fully exploiting its efficient long-sequence modeling capabilities and scalability. We provide AiM models in various scales, with parameter counts ranging from 148M to 1.3B. On the ImageNet1K 256*256 benchmark, our best AiM model achieves a FID of 2.21, surpassing all existing AR models of comparable parameter counts and demonstrating significant competitiveness against diffusion models, with 2 to 10 times faster inference speed. Code is available at https://github.com/hp-l33/AiM
我们介绍了基于Mamba架构的自回归(AR)图像生成模型AiM。AiM采用Mamba这一新型状态空间模型,以其对长序列建模的出色性能(具有线性时间复杂度)来替代自回归图像生成模型中常用的Transformer,旨在实现更高的生成质量和更快的推理速度。不同于现有方法通过多方向扫描使Mamba适应处理二维信号的方式,AiM直接使用基于令牌预测的范式进行自回归图像生成。这种方法避免了需要大量修改以使Mamba能够学习二维空间表示的需求。通过对视觉生成任务进行简单而有针对性的修改,我们保留了Mamba的核心结构,充分利用其高效的长序列建模能力和可扩展性。我们提供了不同规模的AiM模型,参数数量从148M到1.3B不等。在ImageNet1K 256*256基准测试中,我们最好的AiM模型实现了FID为2.21,超越了所有现有参数相似的AR模型,并且在推理速度上表现出竞争力,是扩散模型的2到10倍。代码可在https://github.com/hp-l33/AiM找到。
论文及项目相关链接
PDF 9 pages, 8 figures
Summary
我们介绍了基于Mamba架构的自回归(AR)图像生成模型AiM。AiM利用具有线性时间复杂性的新型状态空间模型Mamba,旨在实现更高的生成质量和更快的推理速度。它通过直接利用下一个令牌预测范式进行自回归图像生成,避免了大量修改的需要。我们提供了不同规模的AiM模型,参数数量从148M到1.3B不等。在ImageNet1K 256*256基准测试中,我们最好的AiM模型FID达到2.21,超过了所有现有参数数量相近的AR模型,并且在推理速度上比扩散模型快2到10倍。
Key Takeaways
- AiM是一个基于Mamba架构的自回归图像生成模型。
- Mamba是一个具有线性时间复杂性的新型状态空间模型。
- AiM旨在实现更高的生成质量和更快的推理速度。
- AiM通过直接利用下一个令牌预测范式进行自回归图像生成。
- AiM模型参数数量范围从148M到1.3B不等。
- 在ImageNet1K 256*256基准测试中,AiM模型FID达到2.21。
- AiM模型在推理速度上比扩散模型快2到10倍。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2408.12245v4/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2408.12245v4/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2408.12245v4/page_3_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2408.12245v4/page_4_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2408.12245v4/page_4_1.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2408.12245v4/page_5_0.jpg)
Iterative Ensemble Training with Anti-Gradient Control for Mitigating Memorization in Diffusion Models
Authors:Xiao Liu, Xiaoliu Guan, Yu Wu, Jiaxu Miao
Diffusion models, known for their tremendous ability to generate novel and high-quality samples, have recently raised concerns due to their data memorization behavior, which poses privacy risks. Recent approaches for memory mitigation either only focused on the text modality problem in cross-modal generation tasks or utilized data augmentation strategies. In this paper, we propose a novel training framework for diffusion models from the perspective of visual modality, which is more generic and fundamental for mitigating memorization. To facilitate forgetting of stored information in diffusion model parameters, we propose an iterative ensemble training strategy by splitting the data into multiple shards for training multiple models and intermittently aggregating these model parameters. Moreover, practical analysis of losses illustrates that the training loss for easily memorable images tends to be obviously lower. Thus, we propose an anti-gradient control method to exclude the sample with a lower loss value from the current mini-batch to avoid memorizing. Extensive experiments and analysis on four datasets are conducted to illustrate the effectiveness of our method, and results show that our method successfully reduces memory capacity while even improving the performance slightly. Moreover, to save the computing cost, we successfully apply our method to fine-tune the well-trained diffusion models by limited epochs, demonstrating the applicability of our method. Code is available in https://github.com/liuxiao-guan/IET_AGC.
扩散模型因其生成新颖、高质量样本的出色能力而备受关注,但最近其数据记忆行为引发了关于隐私风险的担忧。最近的方法主要集中在跨模态生成任务中的文本模态问题或采用数据增强策略来缓解记忆。在本文中,我们从视觉模态的角度为扩散模型提出了一个新的训练框架,该框架在缓解记忆方面更加通用和根本。为了促进存储在扩散模型参数中的信息的遗忘,我们提出了一种迭代集成训练策略,通过将数据分成多个分片来训练多个模型,并定期聚合这些模型参数。此外,对损失的实际分析表明,易于记忆的图像的训练损失往往明显较低。因此,我们提出了一种反梯度控制方法,将当前小批量中具有较低损失值的样本排除在外,以避免记忆。我们在四个数据集上进行了广泛的实验和分析,以说明我们方法的有效性。结果表明,我们的方法在减少内存容量的同时,甚至略微提高了性能。此外,为了节省计算成本,我们成功地将该方法应用于对经过良好训练的扩散模型进行有限轮次的微调,证明了该方法的适用性。相关代码可通过 https://github.com/liuxiao-guan/IET_AGC 获得。
论文及项目相关链接
PDF Accepted in ECCV 2024, 20 pages with 7 figures
Summary
本文提出一种针对扩散模型的新型训练框架,旨在从视觉模态角度更通用、更基础地缓解记忆问题。通过分裂数据为多个片段进行多次模型训练并间歇性地整合模型参数,提出一种迭代集成训练策略。同时,采用反梯度控制方法避免记忆损失较低的样本。实验证明该方法能有效减少记忆容量,并略微提升性能。此外,该方法还适用于微调已训练的扩散模型,适用于节约计算成本。
Key Takeaways
- 扩散模型因其生成新颖、高质量样本的能力而备受关注,但最近因数据记忆化行为引发隐私担忧。
- 现有方法主要集中在跨模态生成任务的文本模态问题或数据增强策略上。
- 本文从视觉模态角度提出一种新型训练框架,旨在更通用、更基础地解决记忆问题。
- 通过数据分片和迭代集成训练策略,促进扩散模型参数中的信息遗忘。
- 发现易记忆图像的损失较低,因此提出反梯度控制方法排除这些样本。
- 实验证明该方法在四个数据集上有效减少记忆容量,并略有提升性能。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2407.15328v3/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2407.15328v3/page_4_0.jpg)
Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation
Authors:Eyal Michaeli, Ohad Fried
Fine-grained visual classification (FGVC) involves classifying closely related sub-classes. This task is difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in full-dataset training of FGVC models remains under-explored. Recent techniques that rely on Text2Image generation or Img2Img methods, often struggle to generate images that accurately represent the class while modifying them to a degree that significantly increases the dataset’s diversity. To address these challenges, we present SaSPA: Structure and Subject Preserving Augmentation. Contrary to recent methods, our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity. To ensure accurate class representation, we employ conditioning mechanisms, specifically by conditioning on image edges and subject representation. We conduct extensive experiments and benchmark SaSPA against both traditional and recent generative data augmentation methods. SaSPA consistently outperforms all established baselines across multiple settings, including full dataset training, contextual bias, and few-shot classification. Additionally, our results reveal interesting patterns in using synthetic data for FGVC models; for instance, we find a relationship between the amount of real data used and the optimal proportion of synthetic data. Code is available at https://github.com/EyalMichaeli/SaSPA-Aug.
细粒度视觉分类(FGVC)涉及对密切相关的子类别进行分类。由于类别之间的细微差异和高度的类内变化,这项任务很困难。此外,FGVC数据集通常很小且难以收集,从而凸显出对有效数据增强的迫切需求。文本到图像扩散模型的最新进展为增强分类数据集提供了新的可能性。虽然这些模型已被用于生成分类任务的训练数据,但它们在全数据集训练FGVC模型方面的有效性仍然有待探索。最近依赖于Text2Image生成或Img2Img方法的技巧,往往难以生成准确代表类的图像,同时在修改它们时显著增加数据集的多样性。为了应对这些挑战,我们提出了SaSPA:结构和主题保留增强法。与最近的方法不同,我们的方法不使用真实图像作为指导,从而提高了生成的灵活性并促进了更大的多样性。为了确保准确的类别表示,我们采用了条件机制,特别是通过以图像边缘和主题表示为条件。我们对SaSPA进行了广泛的实验,并将其与传统和最新的生成数据增强方法进行了基准测试。SaSPA在多种设置下始终超越所有既定的基线,包括全数据集训练、上下文偏差和少样本分类。此外,我们的结果揭示了在使用合成数据进行FGVC模型时的有趣模式;例如,我们发现真实数据的使用量与合成数据的最佳比例之间存在关系。代码可在https://github.com/EyalMichaeli/SaSPA-Aug处获取。
论文及项目相关链接
PDF Accepted to NeurIPS 2024
Summary
本文介绍了Fine-grained Visual Classification(FGVC)面临的挑战,包括子类别间的细微差异、高类内方差以及数据集获取困难等问题。针对这些问题,文章提出了一种新的数据增强方法SaSPA,用于生成训练数据,以应对FGVC模型的训练需求。该方法不依赖真实图像作为指导,通过条件机制确保准确类表示,同时在图像边缘和主题表示上进行条件设定,以增加生成多样性和灵活性。实验结果显示,SaSPA在不同设置下均表现优异,包括全数据集训练、上下文偏差和少样本分类。此外,文章还探讨了使用合成数据对FGVC模型的有趣现象,并指出真实数据与合成数据的最佳比例关系。
Key Takeaways
- Fine-grained Visual Classification (FGVC) 面临挑战,包括类间细微差异、高类内方差及数据集获取困难。
- 数据增强在FGVC中至关重要,近期文本到图像的扩散模型为此提供新可能。
- SaSPA方法通过不依赖真实图像指导来增加生成多样性和灵活性。
- SaSPA采用条件机制确保准确类表示,特别是在图像边缘和主题表示上设定条件。
- 实验证明SaSPA在全数据集训练、上下文偏差和少样本分类等设置下表现优异。
- 使用合成数据对FGVC模型有趣现象被探讨,真实数据与合成数据之间的比例关系被揭示。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2406.14551v3/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2406.14551v3/page_2_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2406.14551v3/page_4_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2406.14551v3/page_5_0.jpg)
Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models
Authors:Alireza Ganjdanesh, Reza Shirkavand, Shangqian Gao, Heng Huang
Text-to-image (T2I) diffusion models have demonstrated impressive image generation capabilities. Still, their computational intensity prohibits resource-constrained organizations from deploying T2I models after fine-tuning them on their internal target data. While pruning techniques offer a potential solution to reduce the computational burden of T2I models, static pruning methods use the same pruned model for all input prompts, overlooking the varying capacity requirements of different prompts. Dynamic pruning addresses this issue by utilizing a separate sub-network for each prompt, but it prevents batch parallelism on GPUs. To overcome these limitations, we introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method designed for T2I diffusion models. Central to our approach is a prompt router model, which learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. Each architecture code represents a specialized model tailored to the prompts assigned to it, and the number of codes is a hyperparameter. We train the prompt router and architecture codes using contrastive learning, ensuring that similar prompts are mapped to nearby codes. Further, we employ optimal transport to prevent the codes from collapsing into a single one. We demonstrate APTP’s effectiveness by pruning Stable Diffusion (SD) V2.1 using CC3M and COCO as target datasets. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores. Our analysis of the clusters learned by APTP reveals they are semantically meaningful. We also show that APTP can automatically discover previously empirically found challenging prompts for SD, e.g. prompts for generating text images, assigning them to higher capacity codes.
文本到图像(T2I)扩散模型已经展现出令人印象深刻的图像生成能力。然而,它们的计算强度使得资源受限的组织无法在对其内部目标数据进行微调后部署T2I模型。虽然剪枝技术提供了减少T2I模型计算负担的潜在解决方案,但静态剪枝方法使用相同的剪枝模型来处理所有输入提示,忽视了不同提示所需的能力差异。动态剪枝通过为每个提示使用单独的子网络来解决这个问题,但它阻止了GPU上的批量并行处理。为了克服这些限制,我们引入了自适应提示定制剪枝(APTP),这是一种针对T2I扩散模型的新型提示剪枝方法。我们的方法的核心是一个提示路由器模型,它学习确定输入文本提示所需的容量,并根据给定的总计算预算将提示路由到架构代码。每个架构代码代表一个针对分配给它的提示而定制的特殊模型,代码的数量是一个超参数。我们使用对比学习来训练提示路由器和架构代码,确保相似的提示被映射到附近的代码。此外,我们采用最优传输来防止代码合并为单一代码。我们通过使用CC3M和COCO作为目标数据集来剪枝稳定的扩散(SD)V2.1,证明了APTP的有效性。APTP在FID、CLIP和CMMD分数方面优于单模型剪枝基线。我们对APTP学习的集群的分析表明它们是语义上有意义的。我们还展示了APTP可以自动发现之前经验上发现的SD具有挑战性的提示(例如用于生成文本图像的提示),并将它们分配给更高容量的代码。
论文及项目相关链接
摘要
T2I扩散模型具有强大的图像生成能力,但其计算强度限制了资源受限组织在内部目标数据上微调后的部署。针对此问题,提出了一种新型的基于提示的修剪方法——自适应提示定制修剪(APTP)。核心在于提示路由器模型,其可学习确定输入文本提示所需的计算能力并根据总计算预算进行路由。此方法针对相似提示映射相近代码,使用对比学习进行训练,并采用最优传输防止代码崩溃。在SD V2.1上通过CC3M和COCO目标数据集进行修剪的实验显示,APTP在FID、CLIP和CMMD分数上优于单模型修剪基线。分析表明,APTP学习的聚类在语义上是有意义的,并且APTP能够自动发现之前实证中难以发现的提示并将其分配给更高容量的代码。总之,APTP方法为资源受限的环境提供了一种有效的T2I扩散模型优化方案。
关键见解
- T2I扩散模型虽然能生成高质量图像,但其计算强度高难以部署于资源受限环境。
- 现有静态修剪方法忽略不同提示对计算能力的不同需求。
- 动态修剪使用单独的子网络处理每个提示,但无法利用GPU并行处理优势。
- APTP是一种针对T2I扩散模型的基于提示的修剪方法,包含提示路由器模型和架构代码学习。
- 提示路由器模型根据计算预算为输入文本提示确定所需计算能力。
- 对比学习用于训练提示路由器和架构代码,确保相似提示映射到相近代码。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2406.12042v2/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2406.12042v2/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2406.12042v2/page_3_0.jpg)
Guided Score identity Distillation for Data-Free One-Step Text-to-Image Generation
Authors:Mingyuan Zhou, Zhendong Wang, Huangjie Zheng, Hai Huang
Diffusion-based text-to-image generation models trained on extensive text-image pairs have demonstrated the ability to produce photorealistic images aligned with textual descriptions. However, a significant limitation of these models is their slow sample generation process, which requires iterative refinement through the same network. To overcome this, we introduce a data-free guided distillation method that enables the efficient distillation of pretrained Stable Diffusion models without access to the real training data, often restricted due to legal, privacy, or cost concerns. This method enhances Score identity Distillation (SiD) with Long and Short Classifier-Free Guidance (LSG), an innovative strategy that applies Classifier-Free Guidance (CFG) not only to the evaluation of the pretrained diffusion model but also to the training and evaluation of the fake score network. We optimize a model-based explicit score matching loss using a score-identity-based approximation alongside our proposed guidance strategies for practical computation. By exclusively training with synthetic images generated by its one-step generator, our data-free distillation method rapidly improves FID and CLIP scores, achieving state-of-the-art FID performance while maintaining a competitive CLIP score. Notably, the one-step distillation of Stable Diffusion 1.5 achieves an FID of 8.15 on the COCO-2014 validation set, a record low value under the data-free setting. Our code and checkpoints are available at https://github.com/mingyuanzhou/SiD-LSG.
基于扩散的文本到图像生成模型在大量的文本图像对上进行了训练,并已被证明能够生成与文本描述相符的高分辨率图像。然而,这些模型的一个重大局限性是它们的样本生成过程缓慢,需要通过同一网络进行迭代优化。为了克服这一问题,我们引入了一种无需数据引导蒸馏方法,该方法能够在无法访问真实训练数据的情况下,有效地蒸馏预训练的Stable Diffusion模型,由于法律、隐私或成本等方面的担忧,真实训练数据通常受到限制。该方法增强了基于分数身份的蒸馏(SiD)与长短无分类器引导(LSG)相结合的策略,这是一种创新性的策略,不仅将无分类器引导(CFG)应用于预训练的扩散模型的评估,还应用于假分数网络的训练和评估。我们使用基于分数身份的近似方法优化基于模型的显式分数匹配损失,并结合我们提出的引导策略进行实际计算。通过仅使用其一步生成器生成的合成图像进行训练,我们的无需数据蒸馏方法迅速提高了FID和CLIP指标,在保持竞争力的CLIP得分的同时,实现了最先进的FID性能。值得注意的是,使用一步蒸馏的Stable Diffusion 1.5在COCO-2014验证集上达到了8.15的FID,这是无数据设置下的最低记录值。我们的代码和检查点可用于访问:https://github.com/mingyuanzhou/SiD-LSG。
论文及项目相关链接
PDF ICLR 2025; fixed typos in Table 1; Code and model checkpoints available at https://github.com/mingyuanzhou/SiD-LSG; More efficient code using AMP is coming soon
Summary
本文介绍了一种无需真实训练数据的扩散模型蒸馏方法,该方法结合了得分身份蒸馏(SiD)与长短分类器自由引导(LSG)策略,优化了预训练扩散模型的评估及假分数网络的训练和评估。通过合成图像进行训练,该方法在提升FID和CLIP得分方面表现出卓越性能,实现了数据无关设置下的最佳FID表现。
Key Takeaways
- 扩散模型能够在文本描述的基础上生成逼真的图像。
- 现有扩散模型在样本生成过程中存在速度慢的问题。
- 提出了一种无需真实训练数据的数据免费蒸馏方法,能够克服上述速度问题。
- 结合了SiD与LSG策略,优化了预训练扩散模型的评估及假分数网络的训练和评估。
- 通过合成图像进行训练,提升了FID和CLIP得分。
- 实现了一流FID性能,同时保持有竞争力的CLIP得分。
- 成功蒸馏Stable Diffusion 1.5模型,在COCO-2014验证集上达到8.15的FID。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2406.01561v4/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2406.01561v4/page_2_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2406.01561v4/page_5_0.jpg)
Weakly-Supervised PET Anomaly Detection using Implicitly-Guided Attention-Conditional Counterfactual Diffusion Modeling: a Multi-Center, Multi-Cancer, and Multi-Tracer Study
Authors:Shadab Ahamed, Arman Rahmim
Minimizing the need for pixel-level annotated data to train PET lesion detection and segmentation networks is highly desired and can be transformative, given time and cost constraints associated with expert annotations. Current un-/weakly-supervised anomaly detection methods rely on autoencoder or generative adversarial networks trained only on healthy data; however GAN-based networks are more challenging to train due to issues with simultaneous optimization of two competing networks, mode collapse, etc. In this paper, we present the weakly-supervised Implicitly guided COuNterfactual diffusion model for Detecting Anomalies in PET images (IgCONDA-PET). The solution is developed and validated using PET scans from six retrospective cohorts consisting of a total of 2652 cases containing both local and public datasets. The training is conditioned on image class labels (healthy vs. unhealthy) via attention modules, and we employ implicit diffusion guidance. We perform counterfactual generation which facilitates “unhealthy-to-healthy” domain translation by generating a synthetic, healthy version of an unhealthy input image, enabling the detection of anomalies through the calculated differences. The performance of our method was compared against several other deep learning based weakly-supervised or unsupervised methods as well as traditional methods like 41% SUVmax thresholding. We also highlight the importance of incorporating attention modules in our network for the detection of small anomalies. The code is publicly available at: https://github.com/ahxmeds/IgCONDA-PET.git.
最小化对像素级标注数据的需要,以训练PET图像病变检测与分割网络是非常理想的,并考虑到与专家标注相关的时间和成本约束,这可能会带来变革。当前的非/弱监督异常检测方法依赖于自编码器或仅在健康数据上训练的生成对抗网络;然而,基于GAN的网络由于同时优化两个竞争网络、模式崩溃等问题而更具挑战性。在本文中,我们提出了弱监督的隐式引导PET图像异常检测扩散模型(IgCONDA-PET)。该解决方案是使用来自六个回顾性队列的PET扫描进行开发和验证的,其中包括本地和公共数据集的总计2652个病例。训练是通过注意力模块以图像类别标签(健康与非健康)为条件进行的,我们采用了隐式扩散引导。我们执行反事实生成,通过生成不健康输入图像的合成健康版本,促进“不健康到健康”的领域转换,通过计算差异来实现异常检测。我们的方法与其他的基于深度学习的弱监督或无监督方法以及传统的如SUVmax阈值法等方法进行了比较。我们还强调了在网络中融入注意力模块检测小异常的重要性。代码公开在:https://github.com/ahxmeds/IgCONDA-PET.git。
论文及项目相关链接
PDF 32 pages, 6 figures, 4 tables
Summary
本研究提出了一种基于弱监督的隐式引导计数扩散模型(IgCONDA-PET),用于PET图像中的异常检测。该研究旨在减少像素级注释数据的需求,通过条件训练和健康与不健康图像的注意力模块,采用隐式扩散引导和反事实生成技术来检测异常。该研究在包含本地和公开数据集的六个回顾性队列的PET扫描上进行开发和验证,并与其他深度学习方法及传统方法进行比较。
Key Takeaways
- 研究提出了一种新的弱监督学习方法IgCONDA-PET,用于PET图像中的病变检测和分割。
- 该方法旨在减少像素级标注数据的需求,降低时间和成本。
- IgCONDA-PET利用注意力模块进行训练,基于图像类别标签(健康与不健康)。
- 采用隐式扩散引导和反事实生成技术,实现“不健康到健康”的领域转换。
- 方法在包含本地和公开数据集的多个回顾性队列中进行验证,性能优越。
- 公开可用代码位于https://github.com/ahxmeds/IgCONDA-PET.git。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2405.00239v2/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2405.00239v2/page_2_0.jpg)
Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting
Authors:Zhiqi Li, Yiming Chen, Lingzhe Zhao, Peidong Liu
While text-to-3D and image-to-3D generation tasks have received considerable attention, one important but under-explored field between them is controllable text-to-3D generation, which we mainly focus on in this work. To address this task, 1) we introduce Multi-view ControlNet (MVControl), a novel neural network architecture designed to enhance existing pre-trained multi-view diffusion models by integrating additional input conditions, such as edge, depth, normal, and scribble maps. Our innovation lies in the introduction of a conditioning module that controls the base diffusion model using both local and global embeddings, which are computed from the input condition images and camera poses. Once trained, MVControl is able to offer 3D diffusion guidance for optimization-based 3D generation. And, 2) we propose an efficient multi-stage 3D generation pipeline that leverages the benefits of recent large reconstruction models and score distillation algorithm. Building upon our MVControl architecture, we employ a unique hybrid diffusion guidance method to direct the optimization process. In pursuit of efficiency, we adopt 3D Gaussians as our representation instead of the commonly used implicit representations. We also pioneer the use of SuGaR, a hybrid representation that binds Gaussians to mesh triangle faces. This approach alleviates the issue of poor geometry in 3D Gaussians and enables the direct sculpting of fine-grained geometry on the mesh. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content. Project page: https://lizhiqi49.github.io/MVControl/.
虽然文本到3D和图像到3D生成任务已经得到了相当多的关注,但它们之间一个重要的但尚未充分探索的领域是可控文本到3D生成,这是我们在这项工作中的主要关注点。为了解决这一任务,1)我们引入了多视图ControlNet(MVControl),这是一种新型神经网络架构,旨在通过集成边缘、深度、法线和涂鸦图等额外输入条件,增强现有的预训练多视图扩散模型。我们的创新之处在于引入了一个控制模块,该模块使用从输入条件图像和相机姿态计算出的局部和全局嵌入来控制基础扩散模型。一旦训练完成,MVControl能够为基于优化的3D生成提供3D扩散指导。2)我们提出了一个高效的多阶段3D生成管道,它利用最近的重建模型和分数蒸馏算法的优点。基于我们的MVControl架构,我们采用了一种独特的混合扩散引导方法,用于指导优化过程。为了提高效率,我们采用3D高斯作为表示形式,而不是常用的隐式表示。我们还首创了SuGaR的使用,这是一种混合表示,将高斯绑定到网格三角形面上。这种方法缓解了3D高斯中的几何问题,能够在网格上直接塑造精细的几何形状。大量实验表明,我们的方法实现了稳健的泛化,并能生成可控的高质量3D内容。项目页面:https://lizhiqi49.github.io/MVControl/。
论文及项目相关链接
PDF 3DV-2025
Summary
本文关注可控文本到三维生成这一领域的研究,针对这一任务提出了Multi-view ControlNet(MVControl)神经网络架构。该架构结合了额外的输入条件,如边缘、深度、法线和草图图等,提升了现有预训练的多视图扩散模型的能力。提出了一套高效的基于MVControl的多阶段三维生成流程,结合了最近的大型重建模型和分数蒸馏算法的优点。实验证明,该方法实现了稳健的泛化能力,并能生成高质量的可控三维内容。
Key Takeaways
- 介绍了一种新的神经网络架构Multi-view ControlNet(MVControl),旨在提高现有的多视图扩散模型的能力。
- MVControl架构通过整合额外的输入条件(如边缘、深度、法线和草图图等)来实现可控文本到三维生成的任务。
- 引入了一个条件模块,该模块使用局部和全局嵌入来控制基础扩散模型,这些嵌入是从输入条件图像和相机姿态计算的。
- MVControl能够在进行优化时提供三维扩散指导。
- 提出了一种高效的多阶段三维生成流程,结合了大型重建模型和分数蒸馏算法的优点。
- 采用三维高斯作为表示形式以提高效率,并提出了一个新的混合表示方法SuGaR,以解决三维高斯中的几何问题。
点此查看论文截图
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2403.09981v3/page_0_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2403.09981v3/page_1_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2403.09981v3/page_2_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2403.09981v3/page_4_0.jpg)
![](D:\MyBlog\AutoFX\arxiv\2025-02-12\./crop_Diffusion Models/2403.09981v3/page_5_0.jpg)