发布日期: 2025-09-13

更新日期: 2025-10-07

文章字数: 3k

阅读时长: 12 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-13 更新

Locality in Image Diffusion Models Emerges from Data Statistics

Authors:Artem Lukoianov, Chenyang Yuan, Justin Solomon, Vincent Sitzmann

Among generative models, diffusion models are uniquely intriguing due to the existence of a closed-form optimal minimizer of their training objective, often referred to as the optimal denoiser. However, diffusion using this optimal denoiser merely reproduces images in the training set and hence fails to capture the behavior of deep diffusion models. Recent work has attempted to characterize this gap between the optimal denoiser and deep diffusion models, proposing analytical, training-free models that can generate images that resemble those generated by a trained UNet. The best-performing method hypothesizes that shift equivariance and locality inductive biases of convolutional neural networks are the cause of the performance gap, hence incorporating these assumptions into its analytical model. In this work, we present evidence that the locality in deep diffusion models emerges as a statistical property of the image dataset, not due to the inductive bias of convolutional neural networks. Specifically, we demonstrate that an optimal parametric linear denoiser exhibits similar locality properties to the deep neural denoisers. We further show, both theoretically and experimentally, that this locality arises directly from the pixel correlations present in natural image datasets. Finally, we use these insights to craft an analytical denoiser that better matches scores predicted by a deep diffusion model than the prior expert-crafted alternative.

在生成模型中，扩散模型因其训练目标存在封闭形式的最佳最小化器而具有独特吸引力，通常被称为最佳去噪器。然而，使用此最佳去噪器的扩散仅复制训练集中的图像，因此无法捕捉深层扩散模型的行为。最近的研究工作试图表征最佳去噪器和深层扩散模型之间的鸿沟，提出了无需训练即可生成图像的分析模型，这些图像类似于由训练过的UNet生成的图像。性能最好的方法假设卷积神经网络的平移等变性和局部归纳偏置是造成性能差距的原因，因此将这些假设纳入其分析模型中。在这项工作中，我们提供证据表明，深层扩散模型中的局部性表现为图像数据集的一种统计属性，并非由于卷积神经网络的归纳偏置。具体来说，我们证明最优参数化线性去噪器与深层神经网络去噪器具有类似的局部性属性。我们进一步从理论和实验上证明，这种局部性直接来源于自然图像数据集中存在的像素相关性。最后，我们利用这些见解来制作一个分析去噪器，与之前的专家制作替代品相比，它更好地匹配了深层扩散模型的预测分数。

论文及项目相关链接

PDF 30 pages, 18 figures, 6 tables

Summary

本文探讨了扩散模型中的一些问题。尽管存在最优解扩散器，但其在图像生成方面存在局限性，仅能在训练集中复制图像。近期研究尝试分析最优解扩散器和深度扩散模型之间的差距，并提出了无需训练即可生成图像的分析模型。本文通过证据显示，深度扩散模型中的局部性并非源于卷积神经网络的归纳偏见，而是作为图像数据集的一种统计特性出现。为此，本文提出了一个匹配深度扩散模型预测的分析型去噪器。

Key Takeaways

扩散模型中存在一个最优解扩散器，但它只能复制训练集中的图像，无法捕捉深度扩散模型的行为。
近期研究尝试分析最优解扩散器和深度扩散模型之间的差异，提出了无需训练的分析模型。
深度扩散模型中的局部性并非源于卷积神经网络的归纳偏见，而是图像数据集的统计特性。
最优参数化线性去噪器展现出与深度神经网络去噪器相似的局部特性。
局部性直接来源于自然图像数据集中的像素相关性。
利用这些见解，可以构建一个与深度扩散模型更匹配的分析型去噪器。

Cool Papers

点此查看论文截图

Prompt Pirates Need a Map: Stealing Seeds helps Stealing Prompts

Authors:Felix Mächtle, Ashwath Shetty, Jonas Sander, Nils Loose, Sören Pirk, Thomas Eisenbarth

Diffusion models have significantly advanced text-to-image generation, enabling the creation of highly realistic images conditioned on textual prompts and seeds. Given the considerable intellectual and economic value embedded in such prompts, prompt theft poses a critical security and privacy concern. In this paper, we investigate prompt-stealing attacks targeting diffusion models. We reveal that numerical optimization-based prompt recovery methods are fundamentally limited as they do not account for the initial random noise used during image generation. We identify and exploit a noise-generation vulnerability (CWE-339), prevalent in major image-generation frameworks, originating from PyTorch’s restriction of seed values to a range of $2^{32}$ when generating the initial random noise on CPUs. Through a large-scale empirical analysis conducted on images shared via the popular platform CivitAI, we demonstrate that approximately 95% of these images’ seed values can be effectively brute-forced in 140 minutes per seed using our seed-recovery tool, SeedSnitch. Leveraging the recovered seed, we propose PromptPirate, a genetic algorithm-based optimization method explicitly designed for prompt stealing. PromptPirate surpasses state-of-the-art methods, i.e., PromptStealer, P2HP, and CLIP-Interrogator, achieving an 8-11% improvement in LPIPS similarity. Furthermore, we introduce straightforward and effective countermeasures that render seed stealing, and thus optimization-based prompt stealing, ineffective. We have disclosed our findings responsibly and initiated coordinated mitigation efforts with the developers to address this critical vulnerability.

扩散模型在文本到图像生成方面取得了显著进展，能够基于文本提示和种子生成高度逼真的图像。鉴于此类提示中蕴含的巨大智力与经济价值，提示盗窃问题引发了关键的安全与隐私问题。在本文中，我们研究了针对扩散模型的提示窃取攻击。我们发现基于数值优化的提示恢复方法存在根本局限性，因为它们没有考虑到图像生成过程中使用的初始随机噪声。我们识别并利用了一个普遍存在于主要图像生成框架中的噪声生成漏洞（CWE-339），该漏洞源于PyTorch在CPU上生成初始随机噪声时，将种子值限制在$2^{32}$范围内的限制。我们对通过流行平台CivitAI共享的图像进行了大规模实证分析，结果表明，使用我们的种子恢复工具SeedSnitch，大约95%的图像的种子值可在140分钟内有效强行恢复每个种子。利用恢复的种子，我们提出了PromptPirate，这是一种基于遗传算法的优化方法，专为提示窃取而设计。PromptPirate超越了现有方法，即PromptStealer、P2HP和CLIP-Interrogator，在LPIPS相似性方面提高了8-11%。此外，我们还介绍了简单有效的对策，使种子窃取，以及基于优化的提示窃取无效。我们已经负责任地披露了我们的发现，并与开发者一起启动了协调缓解工作来解决这一关键漏洞。

论文及项目相关链接

PDF

Summary

本文研究了针对扩散模型的提示窃取攻击。作者揭示了基于数值优化的提示恢复方法的局限性，并指出图像生成框架中存在的噪声生成漏洞（CWE-339）。作者通过大规模实证研究证明，使用SeedSnitch工具，可以有效暴力破解图像种子值。基于恢复的种子值，作者提出了基于遗传算法的提示窃取方法PromptPirate，该方法优于现有方法，如PromptStealer、P2HP和CLIP-Interrogator。最后，作者提出了有效的防范措施，使基于种子值的优化提示窃取失效。

Key Takeaways

扩散模型在文本到图像生成方面取得了显著进展，但也面临着提示窃取的安全和隐私问题。
基于数值优化的提示恢复方法存在局限性，忽略了图像生成过程中使用的初始随机噪声。
主流图像生成框架存在噪声生成漏洞（CWE-339），源于PyTorch在CPU上生成初始随机噪声时种子值的范围限制。
通过大规模实证研究证明，可以使用SeedSnitch工具有效暴力破解图像种子值。
提出了基于遗传算法的PromptPirate方法，用于提示窃取，该方法优于现有方法。
作者提出了有效的防范措施，使基于种子值的优化提示窃取失效。

Cool Papers

点此查看论文截图

SV-DRR: High-Fidelity Novel View X-Ray Synthesis Using Diffusion Model

Authors:Chun Xie, Yuichi Yoshii, Itaru Kitahara

X-ray imaging is a rapid and cost-effective tool for visualizing internal human anatomy. While multi-view X-ray imaging provides complementary information that enhances diagnosis, intervention, and education, acquiring images from multiple angles increases radiation exposure and complicates clinical workflows. To address these challenges, we propose a novel view-conditioned diffusion model for synthesizing multi-view X-ray images from a single view. Unlike prior methods, which are limited in angular range, resolution, and image quality, our approach leverages the Diffusion Transformer to preserve fine details and employs a weak-to-strong training strategy for stable high-resolution image generation. Experimental results demonstrate that our method generates higher-resolution outputs with improved control over viewing angles. This capability has significant implications not only for clinical applications but also for medical education and data extension, enabling the creation of diverse, high-quality datasets for training and analysis. Our code is available at GitHub.

X射线成像是一种快速且成本效益高的工具，用于可视化人体内部解剖结构。虽然多视角X射线成像提供了增强诊断、干预和教育的补充信息，但从多个角度获取图像会增加辐射暴露并复杂化临床工作流程。为了解决这些挑战，我们提出了一种新型基于视图条件的扩散模型，可从单一视角合成多视角X射线图像。不同于在角度范围、分辨率和图像质量上有所局限的先前方法，我们的方法利用扩散变压器来保留细节，并采用从弱到强的训练策略进行稳定的高分辨率图像生成。实验结果表明，我们的方法能够生成更高分辨率的输出，对观看角度有更好的控制。这项能力不仅对临床应用有重要意义，而且对医学教育和数据扩展也有重要意义，能够创建用于培训和分析的多样、高质量数据集。我们的代码可在GitHub上获得。

论文及项目相关链接

PDF Accepted by MICCAI2025

Summary

多视角X射线成像在医疗诊断、治疗和教育中有着重要作用，但多角度成像会增加辐射暴露和复杂化临床工作流程。为此，我们提出了一种基于扩散模型的新型单视角合成多视角X射线图像方法。该方法利用扩散转换器保留细节，采用由弱至强的训练策略进行稳定的高分辨率图像生成。实验证明，该方法能生成高分辨率的输出图像，提高对视角的控制能力，对临床应用、医学教育和数据扩展具有重要意义。

Key Takeaways