发布日期: 2025-01-24

更新日期: 2025-01-24

文章字数: 13.6k

阅读时长: 54 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-01-24 更新

Accelerate High-Quality Diffusion Models with Inner Loop Feedback

Authors:Matthew Gwilliam, Han Cai, Di Wu, Abhinav Shrivastava, Zhiyu Cheng

We propose Inner Loop Feedback (ILF), a novel approach to accelerate diffusion models’ inference. ILF trains a lightweight module to predict future features in the denoising process by leveraging the outputs from a chosen diffusion backbone block at a given time step. This approach exploits two key intuitions; (1) the outputs of a given block at adjacent time steps are similar, and (2) performing partial computations for a step imposes a lower burden on the model than skipping the step entirely. Our method is highly flexible, since we find that the feedback module itself can simply be a block from the diffusion backbone, with all settings copied. Its influence on the diffusion forward can be tempered with a learnable scaling factor from zero initialization. We train this module using distillation losses; however, unlike some prior work where a full diffusion backbone serves as the student, our model freezes the backbone, training only the feedback module. While many efforts to optimize diffusion models focus on achieving acceptable image quality in extremely few steps (1-4 steps), our emphasis is on matching best case results (typically achieved in 20 steps) while significantly reducing runtime. ILF achieves this balance effectively, demonstrating strong performance for both class-to-image generation with diffusion transformer (DiT) and text-to-image generation with DiT-based PixArt-alpha and PixArt-sigma. The quality of ILF’s 1.7x-1.8x speedups are confirmed by FID, CLIP score, CLIP Image Quality Assessment, ImageReward, and qualitative comparisons.

我们提出了内部循环反馈（ILF），这是一种加速扩散模型推断的新型方法。ILF训练了一个轻量级模块，通过利用在给定时间步长选择的扩散主干块输出，来预测去噪过程中未来的特征。这种方法利用了两个关键直觉：（1）给定块在相邻时间步的输出是相似的，以及（2）对一步进行部分计算比完全跳过这一步给模型带来的负担更低。我们的方法具有很高的灵活性，因为我们发现反馈模块本身可以仅仅是扩散主干的一个块，所有设置都被复制。它对扩散正向的影响可以通过从零初始化开始的一个可学习的缩放因子来调节。我们使用蒸馏损失来训练这个模块；然而，与一些先前的工作不同，其中整个扩散主干作为学生，我们的模型会冻结主干，只训练反馈模块。虽然许多优化扩散模型的努力都集中在在极少的步骤（1-4步）内达到可接受的图像质量，但我们的重点是在达到最佳结果（通常在20步内实现）的同时显著减少运行时间。ILF有效地实现了这种平衡，在利用扩散变压器（DiT）进行类到图像生成和基于DiT的PixArt-alpha和PixArt-sigma进行文本到图像生成时，均表现出强劲的性能。ILF的1.7倍至1.8倍加速质量得到了FID、CLIP分数、CLIP图像质量评估、ImageReward和定性比较的确证。

论文及项目相关链接

PDF submission currently under review; 20 pages, 17 figures, 6 tables

Summary
我们提出了内部循环反馈（ILF）这一新方法，旨在加速扩散模型的推理过程。ILF训练了一个轻量级模块，通过利用给定时间步长下选择的扩散主干块的输出来预测去噪过程中的未来特征。该方法基于两个关键直觉：（1）相邻时间步长下给定块的输出是相似的；（2）执行部分计算步骤比完全跳过步骤对模型的负担更小。ILF方法灵活度高，反馈模块本身可以是扩散主干的一个块，所有设置均被复制。它对扩散正向的影响可以通过从零初始化中学习到的缩放因子进行调节。我们仅训练该模块使用蒸馏损失，但与某些先前的工作不同，我们的模型冻结了主干，只训练反馈模块。虽然许多优化扩散模型的努力都集中在如何在极少的步骤（1-4步）内达到可接受的图像质量，但我们的重点是在匹配最佳结果（通常在20步内达到）的同时显著减少运行时间。ILF有效地实现了这一平衡，在基于扩散变压器的类到图像生成和基于PixArt-alpha和PixArt-sigma的文本到图像生成中均表现出强大的性能。ILF的速度提升1.7倍至1.8倍的质量得到了FID、CLIP分数、CLIP图像质量评估、ImageReward和定性比较的确证。

Key Takeaways

提出了一种名为内部循环反馈（ILF）的方法，旨在加速扩散模型的推理过程。
ILF利用给定时间步长下的扩散主干块输出来预测未来特征。
该方法基于两个关键直觉：相邻时间步长下的输出相似性，以及部分计算步骤的有效性。
ILF方法具有灵活性，其反馈模块可以是扩散主干的一个块。
仅训练反馈模块使用蒸馏损失，而主干保持冻结。
ILF在匹配最佳结果的同时显著减少运行时间，实现了在扩散模型优化中的有效平衡。

Cool Papers

点此查看论文截图

Orchid: Image Latent Diffusion for Joint Appearance and Geometry Generation

Authors:Akshay Krishnan, Xinchen Yan, Vincent Casser, Abhijit Kundu

Diffusion models are state-of-the-art for image generation. Trained on large datasets, they capture expressive image priors that have been used for tasks like inpainting, depth, and (surface) normal prediction. However, these models are typically trained for one specific task, e.g., a separate model for each of color, depth, and normal prediction. Such models do not leverage the intrinsic correlation between appearance and geometry, often leading to inconsistent predictions. In this paper, we propose using a novel image diffusion prior that jointly encodes appearance and geometry. We introduce a diffusion model Orchid, comprising a Variational Autoencoder (VAE) to encode color, depth, and surface normals to a latent space, and a Latent Diffusion Model (LDM) for generating these joint latents. Orchid directly generates photo-realistic color images, relative depth, and surface normals from user-provided text, and can be used to create image-aligned partial 3D scenes seamlessly. It can also perform image-conditioned tasks like joint monocular depth and normal prediction and is competitive in accuracy to state-of-the-art methods designed for those tasks alone. Lastly, our model learns a joint prior that can be used zero-shot as a regularizer for many inverse problems that entangle appearance and geometry. For example, we demonstrate its effectiveness in color-depth-normal inpainting, showcasing its applicability to problems in 3D generation from sparse views.

扩散模型是图像生成领域的最前沿技术。它们经过大规模数据集训练，能够捕捉图像的表达先验，被用于图像修复、深度估计和（表面）法线预测等任务。然而，这些模型通常针对特定任务进行训练，例如针对颜色、深度和法线预测等任务分别使用不同的模型。这种模型没有利用外观和几何之间的内在关联，常常导致预测结果不一致。在本文中，我们提出了一种新的图像扩散先验，能够同时编码外观和几何信息。我们引入了一种扩散模型——Orchid，它包含一个变分自编码器（VAE），用于将颜色、深度和表面法线编码到潜在空间，以及一个潜在扩散模型（LDM）来生成这些联合潜在变量。Orchid可以根据用户提供的文本直接生成逼真的彩色图像、相对深度图和表面法线图，可用于创建无缝的图像对齐部分3D场景。它还可以执行图像条件任务，如联合单目深度估计和法线预测，并且在准确性方面与针对这些任务单独设计的最先进的方法相竞争。最后，我们的模型学习了一个联合先验，可以零样本作为许多纠缠外观和几何的逆问题的正则化器。例如，我们通过彩色深度法线图像修复任务展示了其应用于3D生成的稀疏视图问题的有效性。

论文及项目相关链接

PDF Project webpage: https://orchid3d.github.io

Summary

扩散模型是目前图像生成的最先进方法。通过大型数据集的训练，它们捕捉了生动的图像先验知识，可用于图像补全、深度预测和表面法线预测等任务。然而，这些模型通常针对单一任务进行训练，导致颜色、深度和法线预测等任务需要单独的模型处理。这些模型未能充分利用外观与几何之间的内在关联，导致预测结果的不一致性。本文提出了一种新型图像扩散先验，能联合编码外观和几何信息。提出的Orchid模型包括一个用于编码颜色、深度和表面法线的变分自编码器（VAE）和一个用于生成这些联合潜在特征的潜在扩散模型（LDM）。Orchid可根据用户提供的文本直接生成逼真的彩色图像、相对深度及表面法线，并可用于创建无缝的图像对齐的局部三维场景。此外，它还能执行图像条件下的联合单眼深度和法线预测任务，并能在准确性方面与针对这些任务单独设计的最新方法相竞争。最后，该模型学习了一个联合先验，可零样本用作许多纠缠外观和几何的反问题的正则化器。例如，我们在彩色深度法线补全任务中展示了其有效性，展示了其在从稀疏视图生成三维问题中的适用性。

Key Takeaways

扩散模型用于图像生成效果卓越，但面临任务特定训练的问题，缺乏跨任务一致性。
论文提出了Orchid模型，结合变分自编码器（VAE）和潜在扩散模型（LDM），联合编码颜色和几何信息。
Orchid能直接从文本生成逼真的彩色图像、相对深度及表面法线，支持创建无缝的图像对齐的局部三维场景。
该模型在图像条件下的联合单眼深度和法线预测任务中具有竞争力。
模型学习了一个联合先验，可作为多种反问题的正则化器，适用于彩色深度法线补全等任务。
Orchid模型展示了在从稀疏视图生成三维问题中的潜力。

Cool Papers

点此查看论文截图

3D Object Manipulation in a Single Image using Generative Models

Authors:Ruisi Zhao, Zechuan Zhang, Zongxin Yang, Yi Yang

Object manipulation in images aims to not only edit the object’s presentation but also gift objects with motion. Previous methods encountered challenges in concurrently handling static editing and dynamic generation, while also struggling to achieve fidelity in object appearance and scene lighting. In this work, we introduce \textbf{OMG3D}, a novel framework that integrates the precise geometric control with the generative power of diffusion models, thus achieving significant enhancements in visual performance. Our framework first converts 2D objects into 3D, enabling user-directed modifications and lifelike motions at the geometric level. To address texture realism, we propose CustomRefiner, a texture refinement module that pre-train a customized diffusion model, aligning the details and style of coarse renderings of 3D rough model with the original image, further refine the texture. Additionally, we introduce IllumiCombiner, a lighting processing module that estimates and corrects background lighting to match human visual perception, resulting in more realistic shadow effects. Extensive experiments demonstrate the outstanding visual performance of our approach in both static and dynamic scenarios. Remarkably, all these steps can be done using one NVIDIA 3090. Project page is at https://whalesong-zrs.github.io/OMG3D-projectpage/

图像中的物体操作旨在不仅编辑物体的呈现，还赋予物体动态。之前的方法在处理静态编辑和动态生成时遇到了挑战，同时在实现物体外观和场景光照的保真度方面也遇到了困难。在这项工作中，我们介绍了\textbf{OMG3D}，这是一个新型框架，它将精确的几何控制与扩散模型的生成能力相结合，从而在视觉性能上实现了重大改进。我们的框架首先将2D对象转换为3D，从而在几何级别实现用户导向的修改和逼真的动作。为了解决纹理逼真度的问题，我们提出了CustomRefiner，这是一种纹理优化模块，它预先训练了一个定制的扩散模型，将3D粗糙模型的粗糙渲染的细节和风格与原始图像对齐，进一步优化纹理。此外，我们还介绍了IllumiCombiner，这是一种光照处理模块，它估计并纠正背景光照，以匹配人类视觉感知，从而产生更逼真的阴影效果。大量实验证明了我们方法在静态和动态场景中的出色视觉性能。值得一提的是，所有这些步骤都可以使用一台NVIDIA 3090完成。项目页面是https://whalesong-zrs.github.io/OMG3D-projectpage/。

论文及项目相关链接

PDF

Summary

本文介绍了一种名为OMG3D的新型框架，它将精确的几何控制与扩散模型的生成能力相结合，实现了物体操作图像中的静态编辑和动态生成的显著改善，并在视觉性能上取得了重大提升。该框架通过转换2D对象为3D对象，使用户可以在几何级别进行指导性修改，并赋予物体逼真的动态效果。为解决纹理真实性问题，提出了CustomRefiner纹理优化模块；为解决光照问题，引入了IllumiCombiner光照处理模块。实验证明，该方法在静态和动态场景中均表现出卓越视觉性能。

Key Takeaways

1.OMG3D框架结合了精确的几何控制与扩散模型的生成能力。
2.OMG3D能实现2D对象到3D对象的转换，允许用户在几何级别进行指导性修改，并赋予物体动态效果。
3.CustomRefiner模块用于优化纹理，提高渲染细节和风格与原始图像的一致性。
4.IllumiCombiner模块用于处理光照，估计和校正背景光照，以匹配人类视觉感知，实现更逼真的阴影效果。
5.该框架在静态和动态场景中都表现出了卓越的视觉性能。
6.所有步骤都在一个NVIDIA 3090上完成。

Cool Papers

点此查看论文截图

AMM-Diff: Adaptive Multi-Modality Diffusion Network for Missing Modality Imputation

Authors:Aghiles Kebaili, Jérôme Lapuyade-Lahorgue, Pierre Vera, Su Ruan

In clinical practice, full imaging is not always feasible, often due to complex acquisition protocols, stringent privacy regulations, or specific clinical needs. However, missing MR modalities pose significant challenges for tasks like brain tumor segmentation, especially in deep learning-based segmentation, as each modality provides complementary information crucial for improving accuracy. A promising solution is missing data imputation, where absent modalities are generated from available ones. While generative models have been widely used for this purpose, most state-of-the-art approaches are limited to single or dual target translations, lacking the adaptability to generate missing modalities based on varying input configurations. To address this, we propose an Adaptive Multi-Modality Diffusion Network (AMM-Diff), a novel diffusion-based generative model capable of handling any number of input modalities and generating the missing ones. We designed an Image-Frequency Fusion Network (IFFN) that learns a unified feature representation through a self-supervised pretext task across the full input modalities and their selected high-frequency Fourier components. The proposed diffusion model leverages this representation, encapsulating prior knowledge of the complete modalities, and combines it with an adaptive reconstruction strategy to achieve missing modality completion. Experimental results on the BraTS 2021 dataset demonstrate the effectiveness of our approach.

在临床实践中，全面成像并不总是可行的，这往往是由于复杂的采集协议、严格的隐私规定或特定的临床需求。然而，缺失的MR模式对于如脑肿瘤分割等任务提出了重大挑战，特别是在基于深度学习的分割中，因为每种模式都提供了对提高准确性至关重要的补充信息。一种可行的解决方案是数据插值填补缺失部分，其中缺失的模式由存在的模式生成。虽然生成模型已被广泛用于此目的，但大多数最先进的方法仅限于单个或双目标翻译，缺乏根据不同输入配置生成缺失模式的适应性。为了解决这个问题，我们提出了一种自适应多模态扩散网络（AMM-Diff），这是一种基于扩散的新型生成模型，能够处理任何数量的输入模式并生成缺失的模式。我们设计了一个图像频率融合网络（IFFN），它通过全输入模式及其选定的高频傅里叶分量的自监督前期任务来学习统一的特征表示。所提出的扩散模型利用这种表示，封装了完整模式的先验知识，并结合自适应重建策略来实现缺失模式的完成。在BraTS 2021数据集上的实验结果证明了我们的方法的有效性。

论文及项目相关链接

PDF

Summary

本文主要探讨了在临床实践中，由于复杂的采集协议、严格的隐私规定或特定的临床需求，全面成像并不总是可行的。缺失的MR模式态对脑肿瘤分割等任务提出了重大挑战，特别是在基于深度学习的方法中，因为每个模态都提供了改善准确性的重要补充信息。一种有前途的解决方案是缺失数据插补，其中缺失的模态是从现有的模态生成的。虽然生成模型已被广泛用于此目的，但大多数最先进的方法仅限于单一或双重目标翻译，缺乏适应不同输入配置生成缺失模态的能力。为了解决这个问题，本文提出了一种自适应多模态扩散网络（AMM-Diff），这是一种新型的基于扩散的生成模型，可以处理任何数量的输入模态并生成缺失的模态。设计了一种图像频率融合网络（IFFN），它通过全输入模态及其选定的高频傅立叶分量的自监督预文本任务学习统一特征表示。所提出的扩散模型利用这种表示，包含完整模态的先验知识，并结合自适应重建策略来实现缺失模态的完成。在BraTS 2021数据集上的实验结果证明了该方法的有效性。

Key Takeaways

临床实践中全面成像不总是可行，缺失MR模态对任务如脑肿瘤分割带来挑战。
缺失数据插补是一种解决方案，其中缺失模态从现有模态生成。
现有生成模型方法受限于单一或双重目标翻译，缺乏适应不同输入配置的能力。
提出了一种新型自适应多模态扩散网络（AMM-Diff），能处理任意数量的输入模态并生成缺失模态。
设计了图像频率融合网络（IFFN）以学习统一特征表示，通过自监督预文本任务融合全输入模态及其高频傅立叶分量。
扩散模型利用包含完整模态先验知识的表示，结合自适应重建策略实现缺失模态完成。

Cool Papers

点此查看论文截图

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

Authors:Lijun Li, Zhelun Shi, Xuhao Hu, Bowen Dong, Yiran Qin, Xihui Liu, Lu Sheng, Jing Shao

Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing. Data and evaluator are released under https://github.com/adwardlee/t2i_safety.

文本到图像（T2I）模型已得到迅速发展，能够在各个领域根据文本提示生成高质量图像。然而，这些模型引发了显著的安全问题，包括生成有害、偏见或私人内容的潜在风险。当前对于T2I安全性的评估研究仍处于初级阶段。尽管已经有一些努力在某些特定的安全维度上评估模型，但还有许多关键风险尚未探索。为了解决这一差距，我们推出了T2ISafety安全基准，用于在毒性、公平性和偏见这三个关键领域评估T2I模型。我们在这三个领域的基础上建立了包含三个层级任务（总计四级任务分类）的详细层次结构，并精心收集了相应的提示语共七万条。基于这种分类和提示集，我们建立了一个大规模的T2I数据集（共六十八万张人工注释的图像），并训练出一种能够检测出以前未识别出重大风险的评估器。甚至能检测到如GPT等大型专有模型未能察觉的风险。我们对十二个主流扩散模型进行了T2ISafety评估，并揭示了包括种族公平问题持续存在、倾向于生成有毒内容以及隐私保护在模型间存在显著差异等担忧，即使采用概念消除等防御方法也是如此。数据和评估器已在https://github.com/adwardlee/t2i_safety上发布。

论文及项目相关链接

PDF

Summary

文本转图像（T2I）模型在安全领域存在风险隐患，如可能生成有害、有偏见或涉及隐私的内容。针对这一问题，研究团队推出了T2ISafety安全基准评估系统，针对毒性、公平性和偏见三个关键领域进行细致评估。他们建立了包含12项任务和44个类别的详细层次结构，并收集了7万条相应提示。基于这一分类和提示集，他们建立了一个大规模的T2I数据集，包含6.8万张手动标注的图像，并训练了一个能够检测出先前工作未能识别的关键风险的评估器。他们对12个主流的扩散模型进行了T2ISafety评估，并揭示了包括种族公平问题、生成有毒内容倾向以及模型间隐私保护显著差异等隐患。

Key Takeaways

T2I模型在安全领域存在风险，可能生成有害、有偏见或涉及隐私的内容。
研究团队推出了T2ISafety安全基准评估系统，针对毒性、公平性和偏见三大领域进行评估。
T2ISafety建立了包含12项任务和44个类别的详细层次结构，并收集了7万条提示用于评估。
研究团队建立了一个大规模的T2I数据集，包含6.8万张手动标注的图像。
训练了一个评估器，可检测出先前未识别的关键风险。
对12个主流的扩散模型进行了T2ISafety评估，发现存在种族公平问题、生成有毒内容倾向等隐患。

Cool Papers

点此查看论文截图

Image Motion Blur Removal in the Temporal Dimension with Video Diffusion Models

Authors:Wang Pang, Zhihao Zhan, Xiang Zhu, Yechao Bai

Most motion deblurring algorithms rely on spatial-domain convolution models, which struggle with the complex, non-linear blur arising from camera shake and object motion. In contrast, we propose a novel single-image deblurring approach that treats motion blur as a temporal averaging phenomenon. Our core innovation lies in leveraging a pre-trained video diffusion transformer model to capture diverse motion dynamics within a latent space. It sidesteps explicit kernel estimation and effectively accommodates diverse motion patterns. We implement the algorithm within a diffusion-based inverse problem framework. Empirical results on synthetic and real-world datasets demonstrate that our method outperforms existing techniques in deblurring complex motion blur scenarios. This work paves the way for utilizing powerful video diffusion models to address single-image deblurring challenges.

大部分的运动去模糊算法都依赖于空间域卷积模型，这些模型在处理由相机抖动和物体运动引起的复杂、非线性模糊时显得捉襟见肘。与此相反，我们提出了一种新型的单图像去模糊方法，它将运动模糊视为一种时间平均现象。我们的核心创新在于利用预训练的视频扩散transformer模型，在潜在空间内捕捉各种运动动态。它避开了明确的核估计，有效地适应了各种运动模式。我们在基于扩散的逆向问题框架内实现了该算法。在合成和真实世界数据集上的经验结果表明，我们的方法在去除复杂运动模糊场景方面的表现优于现有技术。这项工作为利用强大的视频扩散模型解决单图像去模糊挑战铺平了道路。

论文及项目相关链接

PDF

Summary
针对传统运动去模糊算法在应对由相机抖动和物体运动引起的复杂非线性模糊时的局限性，我们提出了一种新的单图像去模糊方法。该方法将运动模糊视为时间平均现象，并利用预训练的扩散模型视频扩散转换器模型在潜在空间内捕捉各种运动动态。该方法避免了明确的核估计，并有效地适应了各种运动模式。实验结果证明，在合成和真实数据集上，我们的方法在复杂的运动模糊场景的去模糊技术上优于现有技术。

Key Takeaways

当前大多数运动去模糊算法主要依赖空间域卷积模型，难以处理复杂的非线性模糊。
我们提出了一种新的单图像去模糊方法，将运动模糊视为时间平均现象。
方法的核心创新在于利用预训练的扩散模型视频扩散转换器模型。
该方法能够在潜在空间内捕捉各种运动动态，避免了明确的核估计。
方法可以适应多种运动模式，对复杂的运动模糊场景有很好的去模糊效果。
实验结果证明，该方法在合成和真实数据集上的性能优于现有技术。

Cool Papers

点此查看论文截图

Ensemble score filter with image inpainting for data assimilation in tracking surface quasi-geostrophic dynamics with partial observations

Authors:Siming Liang, Hoang Tran, Feng Bao, Hristo G. Chipilski, Peter Jan van Leeuwen, Guannan Zhang

Data assimilation plays a pivotal role in understanding and predicting turbulent systems within geoscience and weather forecasting, where data assimilation is used to address three fundamental challenges, i.e., high-dimensionality, nonlinearity, and partial observations. Recent advances in machine learning (ML)-based data assimilation methods have demonstrated encouraging results. In this work, we develop an ensemble score filter (EnSF) that integrates image inpainting to solve the data assimilation problems with partial observations. The EnSF method exploits an exclusively designed training-free diffusion models to solve high-dimensional nonlinear data assimilation problems. Its performance has been successfully demonstrated in the context of having full observations, i.e., all the state variables are directly or indirectly observed. However, because the EnSF does not use a covariance matrix to capture the dependence between the observed and unobserved state variables, it is nontrivial to extend the original EnSF method to the partial observation scenario. In this work, we incorporate various image inpainting techniques into the EnSF to predict the unobserved states during data assimilation. At each filtering step, we first use the diffusion model to estimate the observed states by integrating the likelihood information into the score function. Then, we use image inpainting methods to predict the unobserved state variables. We demonstrate the performance of the EnSF with inpainting by tracking the Surface Quasi-Geostrophic (SQG) model dynamics under a variety of scenarios. The successful proof of concept paves the way to more in-depth investigations on exploiting modern image inpainting techniques to advance data assimilation methodology for practical geoscience and weather forecasting problems.

数据同化在地球科学和天气预报中理解和预测湍流系统方面起着至关重要的作用，数据同化用于应对三个基本挑战，即高维性、非线性和部分观测。基于机器学习的数据同化方法的最新进展已经取得了令人鼓舞的结果。在这项工作中，我们开发了一种集成图像修复技术的一体化评分过滤方法（EnSF），以解决部分观测数据同化问题。EnSF方法利用专门设计的无训练扩散模型来解决高维非线性数据同化问题。在具有全观测的情境中，即所有状态变量直接或间接观测的情况下，其性能已经得到了成功验证。然而，由于EnSF没有使用协方差矩阵来捕捉观测状态变量和未观测状态变量之间的依赖关系，因此将原始EnSF方法扩展到部分观测情景并不简单。在这项工作中，我们将各种图像修复技术融入EnSF中，以在数据同化过程中预测未观测状态。在每个过滤步骤中，我们首先使用扩散模型通过整合可能性信息来估计观测状态的值。然后，我们使用图像修复方法来预测未观测状态变量。我们通过跟踪各种场景下的表面准地转模型动力学来展示带有修复功能的EnSF的性能。成功的概念验证为利用现代图像修复技术进一步推动数据同化方法在地球科学和天气预报问题中的应用开辟了道路。

论文及项目相关链接

PDF

摘要
本研究开发了一种集成图像修复技术的集合评分过滤器（EnSF），以解决具有部分观测的数据同化问题。EnSF利用专门设计的无需训练扩散模型解决高维非线性数据同化问题。研究成功证明了其在全观测环境下的性能，但面对部分观测情景时面临挑战。本研究将各种图像修复技术融入EnSF中，用于预测数据同化中的未观测状态。每一步过滤都先利用扩散模型结合似然信息估计观测状态，然后使用图像修复方法预测未观测状态变量。通过对Surface Quasi-Geostrophic（SQG）模型动态进行追踪，证明了EnSF与图像修复相结合的性能。这为利用现代图像修复技术推动数据同化方法在地球科学和天气预报问题中的实际应用铺平了道路。

关键见解

数据同化在理解和预测地球科学和天气预报中的湍流系统方面起着至关重要的作用，解决了高维、非线性和部分观测三大挑战。
基于机器学习的数据同化方法取得了令人鼓舞的结果。
集合评分过滤器（EnSF）结合了图像修复技术以解决具有部分观测的数据同化问题。
EnSF利用无需训练的扩散模型解决高维非线性问题。
EnSF在全观测环境下表现出成功的性能。
面对部分观测情景的挑战，EnSF通过融入图像修复技术预测未观测状态。
通过追踪SQG模型动态证明了EnSF与图像修复结合的性能，这为未来在地球科学和天气预报中实际应用提供了基础。

Cool Papers

点此查看论文截图

GPS as a Control Signal for Image Generation

Authors:Chao Feng, Ziyang Chen, Aleksander Holynski, Alexei A. Efros, Andrew Owens

We show that the GPS tags contained in photo metadata provide a useful control signal for image generation. We train GPS-to-image models and use them for tasks that require a fine-grained understanding of how images vary within a city. In particular, we train a diffusion model to generate images conditioned on both GPS and text. The learned model generates images that capture the distinctive appearance of different neighborhoods, parks, and landmarks. We also extract 3D models from 2D GPS-to-image models through score distillation sampling, using GPS conditioning to constrain the appearance of the reconstruction from each viewpoint. Our evaluations suggest that our GPS-conditioned models successfully learn to generate images that vary based on location, and that GPS conditioning improves estimated 3D structure.

我们证明，照片元数据中包含的GPS标签为图像生成提供了有用的控制信号。我们训练GPS到图像的模型，并将它们用于需要精细理解图像如何在城市内部变化的任务。特别是，我们训练了一个扩散模型，以根据GPS和文本生成图像。学习到的模型生成能够捕捉不同街区、公园和地标独特外观的图像。我们还通过评分蒸馏采样从二维GPS到图像的模型中提取三维模型，使用GPS条件来约束每个视点的重建外观。我们的评估表明，我们的GPS条件模型成功学习根据位置生成图像，并且GPS条件改进了估计的3D结构。

Summary
本文展示了照片元数据中的GPS标签对于图像生成具有有用的控制信号。研究团队训练了GPS到图像的模型，并用于需要精细理解城市内图像如何变化的任务。特别是，他们训练了一个扩散模型，该模型可以根据GPS和文本生成图像。该模型能够捕捉不同街区、公园和地标的独特外观。此外，研究团队通过评分蒸馏采样从二维GPS到图像模型中提取三维模型，利用GPS条件约束从每个视角进行重建的外观。评估表明，GPS条件模型成功学习根据位置生成图像，并且GPS条件改进了估计的三维结构。

Key Takeaways

GPS标签在图像生成中提供有用的控制信号。
通过训练GPS到图像的模型，实现了对城市内图像变化的精细理解。
扩散模型可以根据GPS和文本生成图像，能捕捉不同地点的独特外观。
通过评分蒸馏采样从二维GPS到图像模型中提取三维模型。
GPS条件用于约束从各个视角的重建外观。
GPS条件模型能够成功根据位置生成图像。
GPS条件改进了估计的三维结构。

Cool Papers

点此查看论文截图

A Survey on Diffusion Models for Anomaly Detection

Authors:Jing Liu, Zhenchao Ma, Zepu Wang, Yang Liu, Zehua Wang, Peng Sun, Liang Song, Bo Hu, Azzedine Boukerche, Victor C. M. Leung

Diffusion models (DMs) have emerged as a powerful class of generative AI models, showing remarkable potential in anomaly detection (AD) tasks across various domains, such as cybersecurity, fraud detection, healthcare, and manufacturing. The intersection of these two fields, termed diffusion models for anomaly detection (DMAD), offers promising solutions for identifying deviations in increasingly complex and high-dimensional data. In this survey, we review recent advances in DMAD research. We begin by presenting the fundamental concepts of AD and DMs, followed by a comprehensive analysis of classic DM architectures including DDPMs, DDIMs, and Score SDEs. We further categorize existing DMAD methods into reconstruction-based, density-based, and hybrid approaches, providing detailed examinations of their methodological innovations. We also explore the diverse tasks across different data modalities, encompassing image, time series, video, and multimodal data analysis. Furthermore, we discuss critical challenges and emerging research directions, including computational efficiency, model interpretability, robustness enhancement, edge-cloud collaboration, and integration with large language models. The collection of DMAD research papers and resources is available at https://github.com/fdjingliu/DMAD.

扩散模型（DMs）作为一类强大的生成人工智能模型已经崭露头角，在异常检测（AD）任务中显示出显著潜力，广泛应用于网络安全、欺诈检测、医疗保健和制造等多个领域。这两个领域的交集，被称为用于异常检测的扩散模型（DMAD），为识别日益复杂和高维数据中的偏差提供了有前景的解决方案。在这篇综述中，我们回顾了DMAD研究的最新进展。首先介绍AD和DM的基本概念，然后全面分析经典的DM架构，包括DDPMs、DDIMS和Score SDEs。我们进一步将现有的DMAD方法分为基于重建的、基于密度的和混合方法，并对其方法创新进行详细检查。我们还探讨了不同数据模态的多样化任务，包括图像、时间序列、视频和多模态数据分析。此外，我们还讨论了关键的挑战和新兴的研究方向，包括计算效率、模型可解释性、鲁棒性增强、边缘云协作以及与大型语言模型的集成。DMAD研究论文和资源集可在https://github.com/fdjingliu/DMAD找到。

论文及项目相关链接

PDF

Summary
扩散模型（DMs）作为生成式人工智能模型的新兴强大类别，在异常检测（AD）任务中展现出巨大潜力，广泛应用于网络安全、欺诈检测、医疗保健和制造等领域。本文综述了扩散模型在异常检测方面的最新研究进展，介绍了异常检测和扩散模型的基本概念，分析了经典的扩散模型架构，包括DDPMs、DDIIMs和Score SDEs，并将现有的扩散模型异常检测方法分为重建型、密度型和混合型方法，详细探讨了其方法创新。此外，还介绍了不同数据模态的任务，包括图像、时间序列、视频和多模态数据分析。文章还讨论了计算效率、模型可解释性、鲁棒性增强、边缘云协作以及与大型语言模型的集成等关键挑战和新兴研究方向。相关资源可访问：https://github.com/fdjingliu/DMAD。

Key Takeaways

扩散模型（DMs）在异常检测（AD）任务中展现出显著潜力，应用范围广泛。
文章综述了DMs在异常检测方面的最新研究进展。
介绍了异常检测和扩散模型的基本概念。
分析了经典的扩散模型架构，包括DDPMs、DDIIMs和Score SDEs。
将现有的扩散模型异常检测方法分为重建型、密度型和混合型方法。
介绍了不同数据模态的异常检测任务。

Cool Papers

点此查看论文截图

Authors:Shao-Hao Lu, Ren Wang, Ching-Chun Huang, Wei-Chen Chiu

Recently, diffusion-based blind super-resolution (SR) methods have shown great ability to generate high-resolution images with abundant high-frequency detail, but the detail is often achieved at the expense of fidelity. Meanwhile, another line of research focusing on rectifying the reverse process of diffusion models (i.e., diffusion guidance), has demonstrated the power to generate high-fidelity results for non-blind SR. However, these methods rely on known degradation kernels, making them difficult to apply to blind SR. To address these issues, we present DADiff in this paper. DADiff incorporates degradation-aware models into the diffusion guidance framework, eliminating the need to know degradation kernels. Additionally, we propose two novel techniques – input perturbation and guidance scalar – to further improve our performance. Extensive experimental results show that our proposed method has superior performance over state-of-the-art methods on blind SR benchmarks.

最近，基于扩散的盲超分辨率（SR）方法显示出生成富含高频细节的高分辨率图像的强大能力，但往往以保真度的损失为代价来实现这些细节。与此同时，另一条研究聚焦于纠正扩散模型的逆向过程（即扩散引导），已显示出为非盲SR生成高保真结果的能力。然而，这些方法依赖于已知的退化核，使得它们难以应用于盲SR。为了解决这些问题，我们在本文中提出了DADiff。DADiff将退化感知模型纳入扩散引导框架，无需了解退化核。此外，我们提出了两种新技术——输入扰动和引导标量——来进一步提高我们的性能。大量的实验结果表明，我们在盲SR基准测试上的表现超过了最新技术的方法。

论文及项目相关链接

PDF To appear in WACV 2025. Code is available at: https://github.com/ryanlu2240/DADiff

Summary

本文提出了一种基于扩散模型的盲超分辨率（SR）方法——DADiff。该方法结合了感知退化模型与扩散引导框架，无需知道退化核信息。同时，通过引入输入扰动和引导标量两项新技术，进一步提升了性能。实验表明，在盲SR基准测试中，该方法性能卓越，超越了现有技术。

Key Takeaways

DADiff结合感知退化模型与扩散引导框架，解决扩散模型在盲超分辨率（SR）中的性能问题。
无需知道退化核信息，扩大了应用范围。
引入输入扰动技术，提升模型性能。
采用引导标量技术，进一步提高效果。
在盲SR基准测试中表现卓越。
对比现有技术，DADiff具有更高的性能。

Cool Papers

点此查看论文截图

Deep Geometric Moments Promote Shape Consistency in Text-to-3D Generation

Authors:Utkarsh Nath, Rajeev Goel, Eun Som Jeon, Changhoon Kim, Kyle Min, Yezhou Yang, Yingzhen Yang, Pavan Turaga

To address the data scarcity associated with 3D assets, 2D-lifting techniques such as Score Distillation Sampling (SDS) have become a widely adopted practice in text-to-3D generation pipelines. However, the diffusion models used in these techniques are prone to viewpoint bias and thus lead to geometric inconsistencies such as the Janus problem. To counter this, we introduce MT3D, a text-to-3D generative model that leverages a high-fidelity 3D object to overcome viewpoint bias and explicitly infuse geometric understanding into the generation pipeline. Firstly, we employ depth maps derived from a high-quality 3D model as control signals to guarantee that the generated 2D images preserve the fundamental shape and structure, thereby reducing the inherent viewpoint bias. Next, we utilize deep geometric moments to ensure geometric consistency in the 3D representation explicitly. By incorporating geometric details from a 3D asset, MT3D enables the creation of diverse and geometrically consistent objects, thereby improving the quality and usability of our 3D representations. Project page and code: https://moment-3d.github.io/

针对与3D资产相关的数据稀缺问题，如分数蒸馏采样（SDS）的2D提升技术已成为文本到3D生成管道中广泛采用的实践。然而，这些技术中使用的扩散模型容易出现视点偏差，从而导致几何不一致性问题，例如 Janus 问题。为了解决这个问题，我们引入了MT3D，这是一个文本到3D的生成模型，它利用高保真3D对象来克服视点偏差，并将几何理解明确融入生成管道。首先，我们采用从高质量3D模型派生的深度图作为控制信号，以保证生成的2D图像保持基本形状和结构，从而减少固有的视点偏差。其次，我们利用深层几何矩来确保3D表示中的几何一致性。通过融入3D资产的几何细节，MT3D能够创建多样且几何一致的物体，从而提高我们3D表示的质量和可用性。项目页面和代码：https://moment-3d.github.io/

论文及项目相关链接

PDF This paper has been accepted to WACV 2025

Summary

文本介绍了为解决文本到三维生成中的视角偏差问题，提出了一个名为MT3D的文本到三维生成模型。该模型采用高质量的三维模型深度图作为控制信号，确保了生成的二维图像保持了基本的形状和结构，从而减少了固有的视角偏差。同时，模型利用深度几何时刻确保三维表示中的几何一致性。通过从三维资产中融入几何细节，MT3D能够创建多样且几何一致的物体，从而提高三维表示的质量和可用性。

Key Takeaways

数据稀缺是文本到三维资产生成的一大挑战。因此，二维提升技术（如分数蒸馏采样）广泛应用于文本到三维生成的流程中。但此方法涉及的数据视角偏差可能导致几何不一致问题。
新模型MT3D采用高质量的三维模型深度图作为控制信号来生成二维图像，以减少视角偏差并保持物体的基本形状和结构。这一创新策略提高了生成的图像质量。
MT3D利用深度几何时刻确保三维表示的几何一致性，进一步增强了模型的准确性和可靠性。这一特点使得生成的物体更加真实和多样。
通过融入三维资产的几何细节，MT3D可以创建既多样又几何一致的物体，大大提高了三维表示的质量和可用性。这是该模型在文本到三维生成领域的重大突破。
新模型克服了传统的几何不一致问题（如Janus问题），为后续的三维生成研究提供了宝贵的参考和启示。项目页面和代码可供查阅，便于进一步研究和应用。
MT3D模型展示了其在文本到三维生成领域的潜力，为未来的应用场景提供了广阔的可能性。随着技术的进一步发展，该模型有望在各种领域中发挥更大的作用。

Cool Papers

点此查看论文截图

VisMin: Visual Minimal-Change Understanding

Authors:Rabiul Awal, Saba Ahmadi, Le Zhang, Aishwarya Agrawal

Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing benchmarks primarily focus on evaluating VLMs’ capability to distinguish between two very similar captions given an image. In this paper, we introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: object, attribute, count, and spatial relation. These changes test the models’ understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP’s general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at https://vismin.net/.

对于视觉语言模型（VLMs）而言，对物体、属性以及物体间关系的精细理解是至关重要的。现有的基准测试主要侧重于评估VLMs在给定图像的情况下区分两个非常相似的标题的能力。在本文中，我们引入了一个新的具有挑战性的基准测试，称为视觉最小变化理解（VisMin），该测试要求模型在给定的两个图像和两个标题中预测正确的图像-标题匹配。图像对和标题对中包含了最小的变化，即每次只有一个方面的变化，包括：物体、属性、数量和空间关系。这些变化测试了模型对物体、属性（如颜色、材质、形状）、数量和物体之间空间关系的理解。我们使用大型语言模型和扩散模型建立了一个自动框架，随后经过人工注释者进行的严格四步验证过程。经验实验表明，目前的VLM在理解空间关系和计数能力方面存在明显的缺陷。我们还生成了一个大规模的训练数据集来微调CLIP和Idefics2，在基准测试和CLIP的通用图像文本对齐方面显示出对精细理解的显著改善。我们发布的所有资源，包括基准测试、训练数据和微调后的模型检查点，都可以在https://vismin.net/找到。

论文及项目相关链接

PDF Accepted at NeurIPS 2024. Project URL at https://vismin.net/

Summary

本文介绍了一种新的视觉语言模型评估基准——Visual Minimal-Change Understanding（VisMin）。该基准要求模型预测给定两个图像和两个仅在一个方面发生变化的描述时，正确匹配图像和描述。通过这种方式，VisMin评估模型对对象、属性（如颜色、材质、形状）、数量和对象间空间关系的理解。采用大型语言模型和扩散模型构建自动框架，并通过人类注释者进行严格的四步验证过程。实验表明，当前视觉语言模型在理解和计数能力方面存在明显缺陷。通过微调CLIP和Idefics2的训练数据集，在基准测试中显示出精细理解的显著改善。所有资源，包括基准测试、训练数据和微调模型检查点，均已发布在https://vismin.net/上。

Key Takeaways

视觉语言模型（VLMs）需要对对象、属性和对象间关系的精细理解。
提出了一个新的评估基准——Visual Minimal-Change Understanding（VisMin），以评估模型对微小变化的敏感度。
VisMin要求模型预测给定两个图像和两个描述中只有一个方面发生变化时的正确匹配。
VisMin测试模型在对象、属性（如颜色、材质、形状）、数量和空间关系方面的理解。
采用大型语言模型和扩散模型构建自动评估框架，并通过人工验证确保准确性。
现有VLM在理解和计数能力方面存在缺陷。

Cool Papers

点此查看论文截图

MD-Dose: A diffusion model based on the Mamba for radiation dose prediction

Authors:Linjie Fu, Xia Li, Xiuding Cai, Yingkai Wang, Xueyao Wang, Yali Shen, Yu Yao

Radiation therapy is crucial in cancer treatment. Experienced experts typically iteratively generate high-quality dose distribution maps, forming the basis for excellent radiation therapy plans. Therefore, automated prediction of dose distribution maps is significant in expediting the treatment process and providing a better starting point for developing radiation therapy plans. With the remarkable results of diffusion models in predicting high-frequency regions of dose distribution maps, dose prediction methods based on diffusion models have been extensively studied. However, existing methods mainly utilize CNNs or Transformers as denoising networks. CNNs lack the capture of global receptive fields, resulting in suboptimal prediction performance. Transformers excel in global modeling but face quadratic complexity with image size, resulting in significant computational overhead. To tackle these challenges, we introduce a novel diffusion model, MD-Dose, based on the Mamba architecture for predicting radiation therapy dose distribution in thoracic cancer patients. In the forward process, MD-Dose adds Gaussian noise to dose distribution maps to obtain pure noise images. In the backward process, MD-Dose utilizes a noise predictor based on the Mamba to predict the noise, ultimately outputting the dose distribution maps. Furthermore, We develop a Mamba encoder to extract structural information and integrate it into the noise predictor for localizing dose regions in the planning target volume (PTV) and organs at risk (OARs). Through extensive experiments on a dataset of 300 thoracic tumor patients, we showcase the superiority of MD-Dose in various metrics and time consumption.

放射治疗在癌症治疗中占有至关重要的地位。经验丰富的专家通常会通过迭代生成高质量的剂量分布图，为优秀的放射治疗计划奠定基础。因此，剂量分布图的自动预测在加速治疗过程以及为制定放射治疗计划提供更好的起点方面具有重要意义。扩散模型在预测剂量分布图的高频区域方面取得了显著成果，因此基于扩散模型的剂量预测方法已经得到了广泛的研究。然而，现有方法主要利用卷积神经网络（CNN）或Transformer作为去噪网络。CNN缺乏全局感受野的捕捉，导致预测性能不佳。虽然Transformer擅长全局建模，但随着图像大小的增加，其面临二次复杂性，导致计算开销较大。为了应对这些挑战，我们提出了一种基于Mamba架构的新型扩散模型MD-Dose，用于预测胸部癌症患者的放射治疗剂量分布。在正向过程中，MD-Dose向剂量分布图添加高斯噪声以获得纯噪声图像。在逆向过程中，MD-Dose利用基于Mamba的噪声预测器来预测噪声，并最终输出剂量分布图。此外，我们开发了一个Mamba编码器来提取结构信息，并将其整合到噪声预测器中，以定位计划靶区（PTV）和风险器官（OARs）中的剂量区域。通过对300例胸部肿瘤患者数据集进行的广泛实验，我们展示了MD-Dose在各种指标和时间消耗方面的优越性。

论文及项目相关链接

PDF

Summary：辐射治疗在癌症治疗中至关重要。专家们通过迭代生成高质量剂量分布图，形成优秀的放射治疗计划的基础。因此，自动预测剂量分布图对于加快治疗过程以及为制定放射治疗计划提供更好的起点具有重要意义。基于扩散模型在预测剂量分布图高频区域方面的出色表现，基于扩散模型的剂量预测方法已经得到了广泛的研究。本研究针对现有方法存在的问题，提出了一种基于Mamba架构的新型扩散模型MD-Dose，用于预测胸部癌症患者的放射治疗剂量分布。该模型在正向过程中向剂量分布图添加高斯噪声以获得纯噪声图像，在逆向过程中利用基于Mamba的噪声预测器进行预测，最终输出剂量分布图。此外，还开发了Mamba编码器以提取结构信息并将其整合到噪声预测器中，以定位计划靶区（PTV）和风险器官（OARs）的剂量区域。在300例胸部肿瘤患者数据集上的大量实验表明，MD-Dose在各种指标和时间消耗方面的优越性。

Key Takeaways：