嘘~ 正在从服务器偷取页面 . . .

Diffusion Models


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-02-22 更新

A Survey on Text-Driven 360-Degree Panorama Generation

Authors:Hai Wang, Xiaoyu Xiang, Weihao Xia, Jing-Hao Xue

The advent of text-driven 360-degree panorama generation, enabling the synthesis of 360-degree panoramic images directly from textual descriptions, marks a transformative advancement in immersive visual content creation. This innovation significantly simplifies the traditionally complex process of producing such content. Recent progress in text-to-image diffusion models has accelerated the rapid development in this emerging field. This survey presents a comprehensive review of text-driven 360-degree panorama generation, offering an in-depth analysis of state-of-the-art algorithms and their expanding applications in 360-degree 3D scene generation. Furthermore, we critically examine current limitations and propose promising directions for future research. A curated project page with relevant resources and research papers is available at https://littlewhitesea.github.io/Text-Driven-Pano-Gen/.

文本驱动式360度全景生成的出现,使得能够直接从文本描述中合成360度全景图像,这标志着沉浸式视觉内容创作领域的一次变革性进步。这一创新极大地简化了传统上制作此类内容的复杂过程。文本到图像扩散模型的最新进展加速了这一新兴领域的快速发展。本文全面回顾了文本驱动的360度全景生成技术,深入分析了最新算法及其在360度三维场景生成中的不断扩展的应用。此外,我们还对当前的局限性进行了批判性审视,并提出了未来研究的希望方向。相关资源和研究论文精选的项目页面可在https://littlewhitesea.github.io/Text-Driven-Pano-Gen/找到。

论文及项目相关链接

PDF

Summary

文本驱动的360度全景生成技术简化了传统复杂的内容制作过程,使得从文本描述直接合成360度全景图像成为可能。此创新标志着沉浸式视觉内容创作领域的变革性进展。文本到图像扩散模型的新进展推动了该领域的快速发展。本文全面回顾了文本驱动的360度全景生成技术,深入分析最新算法及其在360度三维场景生成中的应用,同时审视了当前局限性和未来研究方向。相关资源和研究论文可在[https://littlewhitesea.github.io/Text-Driven-Pano-Gen/]找到。

Key Takeaways

  1. 文本驱动的360度全景生成技术实现了从文本描述到360度全景图像的合成,简化了内容制作过程。
  2. 该技术标志着沉浸式视觉内容创作领域的重大进展。
  3. 文本到图像扩散模型的新进展推动了该领域的快速发展。
  4. 文章全面回顾了文本驱动的360度全景生成技术,包括最新算法的应用。
  5. 文章深入分析了当前技术的局限性,并提出了未来研究方向。
  6. 可在指定网站找到相关资源和研究论文。

Cool Papers

点此查看论文截图

DiffExp: Efficient Exploration in Reward Fine-tuning for Text-to-Image Diffusion Models

Authors:Daewon Chae, June Suk Choi, Jinkyu Kim, Kimin Lee

Fine-tuning text-to-image diffusion models to maximize rewards has proven effective for enhancing model performance. However, reward fine-tuning methods often suffer from slow convergence due to online sample generation. Therefore, obtaining diverse samples with strong reward signals is crucial for improving sample efficiency and overall performance. In this work, we introduce DiffExp, a simple yet effective exploration strategy for reward fine-tuning of text-to-image models. Our approach employs two key strategies: (a) dynamically adjusting the scale of classifier-free guidance to enhance sample diversity, and (b) randomly weighting phrases of the text prompt to exploit high-quality reward signals. We demonstrate that these strategies significantly enhance exploration during online sample generation, improving the sample efficiency of recent reward fine-tuning methods, such as DDPO and AlignProp.

对文本到图像扩散模型进行微调以最大化奖励,已被证明是提高模型性能的有效方法。然而,由于在线样本生成,奖励微调方法通常存在收敛缓慢的问题。因此,获得具有强烈奖励信号的多样样本对于提高样本效率和整体性能至关重要。在这项工作中,我们介绍了DiffExp,这是一种用于文本到图像模型的奖励微调简单有效的探索策略。我们的方法采用两种关键策略:(a)动态调整无分类器指导的规模,以增强样本多样性;(b)随机加权文本提示的短语,以利用高质量的奖励信号。我们证明,这些策略显著增强了在线样本生成过程中的探索能力,提高了近期奖励微调方法(如DDPO和AlignProp)的样本效率。

论文及项目相关链接

PDF AAAI 2025

Summary

文本到图像扩散模型的微调以最大化奖励已证明可以有效提高模型性能。然而,由于在线样本生成,奖励微调方法常常面临收敛缓慢的问题。因此,获得具有强烈奖励信号的多样样本对于提高样本效率和整体性能至关重要。在这项工作中,我们介绍了DiffExp,这是一种简单有效的文本到图像模型奖励微调探索策略。我们的方法采用两种关键策略:(a)动态调整无分类器引导的规模以增强样本多样性,(b)随机加权文本提示的短语以利用高质量的奖励信号。我们证明这些策略可以显著提高在线样本生成过程中的探索能力,提高近期奖励微调方法(如DDPO和AlignProp)的样本效率。

Key Takeaways

  1. 奖励微调方法对于提高文本到图像扩散模型的性能是有效的。
  2. 文本到图像模型的奖励微调面临在线样本生成时的收敛缓慢问题。
  3. 获得具有强烈奖励信号的多样样本对提升模型性能至关重要。
  4. DiffExp是一种针对文本到图像模型奖励微调的有效探索策略。
  5. DiffExp通过动态调整无分类器引导规模和随机加权文本提示的短语来增强样本多样性和高质量奖励信号。
  6. DiffExp显著提高了在线样本生成过程中的探索能力。
  7. DiffExp提高了近期奖励微调方法的样本效率。

Cool Papers

点此查看论文截图

d-Sketch: Improving Visual Fidelity of Sketch-to-Image Translation with Pretrained Latent Diffusion Models without Retraining

Authors:Prasun Roy, Saumik Bhattacharya, Subhankar Ghosh, Umapada Pal, Michael Blumenstein

Structural guidance in an image-to-image translation allows intricate control over the shapes of synthesized images. Generating high-quality realistic images from user-specified rough hand-drawn sketches is one such task that aims to impose a structural constraint on the conditional generation process. While the premise is intriguing for numerous use cases of content creation and academic research, the problem becomes fundamentally challenging due to substantial ambiguities in freehand sketches. Furthermore, balancing the trade-off between shape consistency and realistic generation contributes to additional complexity in the process. Existing approaches based on Generative Adversarial Networks (GANs) generally utilize conditional GANs or GAN inversions, often requiring application-specific data and optimization objectives. The recent introduction of Denoising Diffusion Probabilistic Models (DDPMs) achieves a generational leap for low-level visual attributes in general image synthesis. However, directly retraining a large-scale diffusion model on a domain-specific subtask is often extremely difficult due to demanding computation costs and insufficient data. In this paper, we introduce a technique for sketch-to-image translation by exploiting the feature generalization capabilities of a large-scale diffusion model without retraining. In particular, we use a learnable lightweight mapping network to achieve latent feature translation from source to target domain. Experimental results demonstrate that the proposed method outperforms the existing techniques in qualitative and quantitative benchmarks, allowing high-resolution realistic image synthesis from rough hand-drawn sketches.

在图像到图像的翻译中,结构指导允许对合成图像的形状进行精细控制。从用户指定的粗略手绘草图生成高质量逼真图像是一项旨在对条件生成过程施加结构约束的任务。虽然这一前提对于内容创建和学术研究有许多用例而言很有吸引力,但由于自由手绘草图中的大量歧义,该问题变得具有根本挑战性。此外,在形状一致性和现实生成之间取得平衡,为该过程增加了额外的复杂性。基于生成对抗网络(GANs)的现有方法通常利用条件GAN或GAN反演,通常需要特定应用的数据和优化目标。最近出现的去噪扩散概率模型(DDPMs)实现了通用图像合成中低级视觉属性的飞跃。然而,由于计算成本高昂和数据不足,直接在特定领域的子任务上重新训练大规模扩散模型通常极其困难。在本文中,我们介绍了一种利用大规模扩散模型的特性泛化能力来进行草图到图像翻译的技术,而无需重新训练。特别是,我们使用可学习的轻型映射网络来实现从源域到目标域的潜在特征翻译。实验结果表明,所提出的方法在定性和定量基准测试中优于现有技术,能够从粗略的手绘草图中生成高分辨率的逼真图像。

论文及项目相关链接

PDF Accepted in The International Conference on Pattern Recognition (ICPR) 2024

Summary
本文介绍了一种利用大型扩散模型的特性进行草图到图像翻译的技术。通过利用可学习的轻量化映射网络,实现了从源域到目标域的潜在特征翻译,无需在特定领域子任务上重新训练大型扩散模型。实验结果表明,该方法在定性和定量基准测试中表现出色,能够从粗糙的手绘草图中生成高分辨率的真实图像。

Key Takeaways

  1. 图像到图像翻译可以通过引入结构指导来实现对合成图像形状的精细控制。
  2. 从用户指定的粗略手绘草图中生成高质量真实图像是对条件生成过程施加结构约束的一个任务。
  3. 该任务面临显著挑战,包括自由手绘草图中的大量模糊性和平衡形状一致性与现实生成之间的权衡。
  4. 基于生成对抗网络(GANs)的方法通常使用条件GANs或GAN反演,需要特定应用的数据和优化目标。
  5. 降噪扩散概率模型(DDPMs)的引入实现了通用图像合成中低级视觉属性的飞跃。
  6. 直接在特定领域的子任务上重新训练大型扩散模型面临计算成本高和数据不足的挑战。

Cool Papers

点此查看论文截图

SigStyle: Signature Style Transfer via Personalized Text-to-Image Models

Authors:Ye Wang, Tongyuan Bai, Xuping Xie, Zili Yi, Yilin Wang, Rui Ma

Style transfer enables the seamless integration of artistic styles from a style image into a content image, resulting in visually striking and aesthetically enriched outputs. Despite numerous advances in this field, existing methods did not explicitly focus on the signature style, which represents the distinct and recognizable visual traits of the image such as geometric and structural patterns, color palettes and brush strokes etc. In this paper, we introduce SigStyle, a framework that leverages the semantic priors that embedded in a personalized text-to-image diffusion model to capture the signature style representation. This style capture process is powered by a hypernetwork that efficiently fine-tunes the diffusion model for any given single style image. Style transfer then is conceptualized as the reconstruction process of content image through learned style tokens from the personalized diffusion model. Additionally, to ensure the content consistency throughout the style transfer process, we introduce a time-aware attention swapping technique that incorporates content information from the original image into the early denoising steps of target image generation. Beyond enabling high-quality signature style transfer across a wide range of styles, SigStyle supports multiple interesting applications, such as local style transfer, texture transfer, style fusion and style-guided text-to-image generation. Quantitative and qualitative evaluations demonstrate our approach outperforms existing style transfer methods for recognizing and transferring the signature styles.

风格转换技术能够实现将风格图像中的艺术风格无缝集成到内容图像中,从而生成视觉冲击力强且审美丰富的输出。尽管此领域已经有很多进展,但现有方法并没有明确关注签名风格,签名风格代表了图像的独特且可识别的视觉特征,如几何和结构图案、色彩搭配和笔触等。在本文中,我们介绍了SigStyle框架,它利用嵌入在个性化文本到图像扩散模型中的语义先验知识来捕捉签名风格表示。这种风格捕捉过程由超网络驱动,该网络能够针对给定的任何单个风格图像高效地微调扩散模型。风格转换被概念化为通过个性化扩散模型中学到的风格标记对内容图像进行重建的过程。此外,为了确保在整个风格转换过程中内容的一致性,我们引入了一种时间感知注意力交换技术,该技术将原始图像的内容信息融入到目标图像生成早期的去噪步骤中。除了实现高质量签名风格转换,覆盖各种风格外,SigStyle还支持多种有趣的应用,如局部风格转换、纹理转换、风格融合和风格指导的文本到图像生成。定量和定性评估表明,我们的方法在识别和转换签名风格方面优于现有的风格转换方法。

论文及项目相关链接

PDF

Summary

本文提出了一种名为SigStyle的框架,该框架利用嵌入在个性化文本到图像扩散模型中的语义先验来捕捉代表图像独特可识别特征的签名风格。通过超网络高效微调给定单一风格图像的扩散模型,实现风格捕捉。风格转移被概念化为通过从个性化扩散模型中学习的风格令牌重建内容图像的过程。为确保风格转移过程中内容的一致性,引入了时间感知注意力交换技术,将原始图像的内容信息融入目标图像生成的早期去噪步骤。SigStyle不仅实现了高质量签名风格的跨风格转移,还支持局部风格转移、纹理转移、风格融合和风格指导的文本到图像生成等多种有趣应用,且较现有风格转移方法在识别和转移签名风格方面表现更优。

Key Takeaways

  1. SigStyle框架利用个性化文本到图像扩散模型中的语义先验捕捉签名风格。
  2. 通过超网络高效微调扩散模型以适配单一风格图像。
  3. 风格转移被重新定义为通过学到的风格令牌从个性化扩散模型中重建内容图像的过程。
  4. 引入时间感知注意力交换技术,确保风格转移过程中内容的一致性。
  5. SigStyle支持多种应用,包括局部风格转移、纹理转移、风格融合和风格指导的文本到图像生成。
  6. SigStyle在识别和转移签名风格方面较现有方法表现更优。

Cool Papers

点此查看论文截图

SelfAge: Personalized Facial Age Transformation Using Self-reference Images

Authors:Taishi Ito, Yuki Endo, Yoshihiro Kanamori

Age transformation of facial images is a technique that edits age-related person’s appearances while preserving the identity. Existing deep learning-based methods can reproduce natural age transformations; however, they only reproduce averaged transitions and fail to account for individual-specific appearances influenced by their life histories. In this paper, we propose the first diffusion model-based method for personalized age transformation. Our diffusion model takes a facial image and a target age as input and generates an age-edited face image as output. To reflect individual-specific features, we incorporate additional supervision using self-reference images, which are facial images of the same person at different ages. Specifically, we fine-tune a pretrained diffusion model for personalized adaptation using approximately 3 to 5 self-reference images. Additionally, we design an effective prompt to enhance the performance of age editing and identity preservation. Experiments demonstrate that our method achieves superior performance both quantitatively and qualitatively compared to existing methods. The code and the pretrained model are available at https://github.com/shiiiijp/SelfAge.

面部图像的年龄变换是一种编辑与年龄相关的人脸外观同时保留身份特征的技术。现有的基于深度学习的方法可以复制自然的年龄变换,但它们只复制平均过渡,忽略了受个人生活史影响个体特定的外观。在本文中,我们提出了基于扩散模型的首个个性化年龄变换方法。我们的扩散模型以面部图像和目标年龄为输入,生成年龄编辑后的面部图像作为输出。为了反映个体特定的特征,我们使用自参考图像(即同一人在不同年龄阶段的面部图像)增加额外的监督。具体来说,我们使用大约3到5张自参考图像对预训练的扩散模型进行微调,以实现个性化适应。此外,我们设计了一个有效的提示来增强年龄编辑和身份保留的性能。实验表明,与现有方法相比,我们的方法在定量和定性方面都实现了卓越的性能。代码和预训练模型可在https://github.com/shiiiijp/SelfAge获取。

论文及项目相关链接

PDF

Summary

基于扩散模型的个性化年龄转换方法。该技术采用面部图像和目标年龄作为输入,生成经过年龄编辑的面部图像作为输出。通过采用自我参照图像进行额外监督,反映个人特定特征。使用大约3到5张自我参照图像对预训练扩散模型进行微调,以实现个性化适应。设计有效提示以增强年龄编辑和身份保留的性能。相较于现有方法,该方法在定量和定性方面均表现出卓越性能。

Key Takeaways

  1. 扩散模型被首次应用于个性化年龄转换。
  2. 该方法采用面部图像和目标年龄作为输入,输出经过年龄编辑的面部图像。
  3. 通过自我参照图像进行额外监督,以体现个人特定特征。
  4. 使用大约3到5张自我参照图像对预训练扩散模型进行微调,实现个性化适应。
  5. 设计了有效提示以增强年龄编辑和身份保留的效果。
  6. 该方法在定量和定性评估上均超越了现有方法。

Cool Papers

点此查看论文截图

Compression-Aware One-Step Diffusion Model for JPEG Artifact Removal

Authors:Jinpei Guo, Zheng Chen, Wenbo Li, Yong Guo, Yulun Zhang

Diffusion models have demonstrated remarkable success in image restoration tasks. However, their multi-step denoising process introduces significant computational overhead, limiting their practical deployment. Furthermore, existing methods struggle to effectively remove severe JPEG artifact, especially in highly compressed images. To address these challenges, we propose CODiff, a compression-aware one-step diffusion model for JPEG artifact removal. The core of CODiff is the compression-aware visual embedder (CaVE), which extracts and leverages JPEG compression priors to guide the diffusion model. We propose a dual learning strategy that combines explicit and implicit learning. Specifically, explicit learning enforces a quality prediction objective to differentiate low-quality images with different compression levels. Implicit learning employs a reconstruction objective that enhances the model’s generalization. This dual learning allows for a deeper and more comprehensive understanding of JPEG compression. Experimental results demonstrate that CODiff surpasses recent leading methods in both quantitative and visual quality metrics. The code and models will be released at https://github.com/jp-guo/CODiff.

扩散模型在图像恢复任务中取得了显著的成功。然而,其多步骤去噪过程带来了较大的计算开销,限制了其实际部署的应用。此外,现有方法难以有效去除严重的JPEG压缩伪影,特别是在高度压缩的图像中。为了解决这些挑战,我们提出了CODiff,这是一种用于JPEG伪影去除的压缩感知一步扩散模型。CODiff的核心是压缩感知视觉嵌入器(CaVE),它提取并利用JPEG压缩先验来指导扩散模型。我们提出了一种结合显式学习和隐式学习的双重学习策略。具体来说,显式学习通过实施质量预测目标来区分不同压缩水平的低质量图像。隐式学习采用重建目标,增强了模型的泛化能力。这种双重学习可以更深入、更全面地理解JPEG压缩。实验结果表明,CODiff在定量和视觉质量指标上均超越了最新的先进方法。代码和模型将在https://github.com/jp-guo/CODiff上发布。

论文及项目相关链接

PDF

Summary

扩散模型在图像修复任务中取得了显著的成功,但其多步去噪过程带来了较大的计算开销,限制了实际应用。针对此及现有方法对严重JPEG伪影去除的困难,我们提出了CODiff,一种感知压缩的一步扩散模型。CODiff的核心是压缩感知视觉嵌入器(CaVE),它提取并利用JPEG压缩先验来指导扩散模型。我们采用了一种结合显式学习和隐式学习的双重学习策略。实验结果证明,CODiff在定量和视觉质量指标上均超越了现有领先方法。相关代码和模型将在https://github.com/jp-guo/CODiff发布。

Key Takeaways

  1. 扩散模型在图像修复中表现优异,但计算开销较大,限制了实际应用。
  2. 现有方法难以有效去除严重JPEG伪影,尤其是在高度压缩的图像中。
  3. CODiff是一种感知压缩的一步扩散模型,旨在解决上述问题。
  4. CODiff的核心是压缩感知视觉嵌入器(CaVE),利用JPEG压缩先验指导扩散模型。
  5. 采用显式学习和隐式学习相结合的双学习策略,提高模型对JPEG压缩的深入理解。
  6. 实验结果表明,CODiff在定量和视觉质量指标上均超越现有领先方法。

Cool Papers

点此查看论文截图

Conditional diffusion model with spatial attention and latent embedding for medical image segmentation

Authors:Behzad Hejrati, Soumyanil Banerjee, Carri Glide-Hurst, Ming Dong

Diffusion models have been used extensively for high quality image and video generation tasks. In this paper, we propose a novel conditional diffusion model with spatial attention and latent embedding (cDAL) for medical image segmentation. In cDAL, a convolutional neural network (CNN) based discriminator is used at every time-step of the diffusion process to distinguish between the generated labels and the real ones. A spatial attention map is computed based on the features learned by the discriminator to help cDAL generate more accurate segmentation of discriminative regions in an input image. Additionally, we incorporated a random latent embedding into each layer of our model to significantly reduce the number of training and sampling time-steps, thereby making it much faster than other diffusion models for image segmentation. We applied cDAL on 3 publicly available medical image segmentation datasets (MoNuSeg, Chest X-ray and Hippocampus) and observed significant qualitative and quantitative improvements with higher Dice scores and mIoU over the state-of-the-art algorithms. The source code is publicly available at https://github.com/Hejrati/cDAL/.

扩散模型已被广泛应用于高质量图像和视频生成任务。在本文中,我们提出了一种具有空间注意力和潜在嵌入(cDAL)的新型条件扩散模型,用于医学图像分割。在cDAL中,扩散过程的每个时间步都使用基于卷积神经网络(CNN)的判别器来区分生成的标签和真实的标签。基于判别器学习的特征计算空间注意力图,帮助cDAL生成输入图像中判别区域的更准确分割。此外,我们将随机潜在嵌入融入模型中的每一层,以大大减少训练和采样时间步的数量,从而使它比其他用于图像分割的扩散模型更快。我们在三个公开的医学图像分割数据集(MoNuSeg、Chest X-ray和Hippocampus)上应用了cDAL,并在最先进的算法上观察到显著的定性和定量改进,具有更高的Dice分数和mIoU。源代码可在https://github.com/Hejrati/cDAL/公开获取。

论文及项目相关链接

PDF 13 pages, 5 figures, 3 tables, Accepted in MICCAI 2024

Summary
本文提出了一种新型的条件扩散模型cDAL,结合了空间注意力和潜在嵌入,用于医学图像分割。cDAL在每个扩散步骤中使用基于卷积神经网络(CNN)的鉴别器,以区分生成的标签和真实标签。通过计算基于鉴别器学习的特征的空间注意力图,cDAL能够更准确地生成输入图像中的判别区域的分割。此外,模型中还融入了随机潜在嵌入,显著减少了训练和采样的时间步骤,使其相较于其他图像分割的扩散模型更加快速。在三个公开的医学图像分割数据集上应用cDAL,相较于最先进的算法,获得了更高的Dice分数和mIoU,取得了显著的质量和数量上的改进。

Key Takeaways

  1. 提出了新型条件扩散模型cDAL用于医学图像分割,结合了空间注意力和潜在嵌入技术。
  2. 使用CNN鉴别器在扩散过程的每个步骤中进行真实与生成标签的区分。
  3. 空间注意力图帮助cDAL更准确地生成输入图像中的判别区域分割。
  4. 融入随机潜在嵌入,减少训练和采样时间步骤,提高模型效率。
  5. 在三个医学图像分割数据集上应用cDAL,获得较高的Dice分数和mIoU。
  6. cDAL相较于其他图像分割的扩散模型更加快速。

Cool Papers

点此查看论文截图

T2ISafety: Benchmark for Assessing Fairness, Toxicity, and Privacy in Image Generation

Authors:Lijun Li, Zhelun Shi, Xuhao Hu, Bowen Dong, Yiran Qin, Xihui Liu, Lu Sheng, Jing Shao

Text-to-image (T2I) models have rapidly advanced, enabling the generation of high-quality images from text prompts across various domains. However, these models present notable safety concerns, including the risk of generating harmful, biased, or private content. Current research on assessing T2I safety remains in its early stages. While some efforts have been made to evaluate models on specific safety dimensions, many critical risks remain unexplored. To address this gap, we introduce T2ISafety, a safety benchmark that evaluates T2I models across three key domains: toxicity, fairness, and bias. We build a detailed hierarchy of 12 tasks and 44 categories based on these three domains, and meticulously collect 70K corresponding prompts. Based on this taxonomy and prompt set, we build a large-scale T2I dataset with 68K manually annotated images and train an evaluator capable of detecting critical risks that previous work has failed to identify, including risks that even ultra-large proprietary models like GPTs cannot correctly detect. We evaluate 12 prominent diffusion models on T2ISafety and reveal several concerns including persistent issues with racial fairness, a tendency to generate toxic content, and significant variation in privacy protection across the models, even with defense methods like concept erasing. Data and evaluator are released under https://github.com/adwardlee/t2i_safety.

文本到图像(T2I)模型已得到迅速发展,能够在各个领域根据文本提示生成高质量图像。然而,这些模型也引发了显著的安全担忧,包括生成有害、偏向或私密内容的风险。当前对T2I安全性的评估研究仍处于早期阶段。虽然已有一些努力在特定安全维度上评估模型,但许多关键风险仍未被探索。为了解决这一空白,我们引入了T2ISafety,这是一个安全基准,用于评估T2I模型在三个关键领域的表现:毒性、公平性和偏见。我们基于这三个领域构建了12项任务和44个类别的详细层次结构,并精心收集了7万项相应的提示。基于这个分类和提示集,我们建立了一个大规模的T2I数据集,包含6.8万张手动注释的图像,并训练了一个能够检测出以前的工作未能识别出的关键风险的评估器,包括大型专有模型如GPT无法正确检测到的风险。我们在T2ISafety上评估了12个流行的扩散模型,并揭示了包括种族公平方面的持续问题、生成有毒内容的倾向以及模型间隐私保护的显著差异等担忧,即使采用概念消除等防御方法也是如此。数据和评估器已在https://github.com/adwardlee/t2i_safety上发布。

论文及项目相关链接

PDF

Summary

文本到图像(T2I)模型在生成高质量图像方面取得了快速进展,但在安全性方面存在显著问题,包括生成有害、偏见或私人内容的风险。当前对T2I安全性的评估研究仍处于早期阶段,许多关键风险尚未被探索。为此,本文引入T2ISafety安全基准,对毒性、公平性和偏见三个关键领域进行T2I模型评估。建立基于这些领域的详细层次结构和任务类别,并收集相应的提示。基于此分类和提示集,建立大规模T2I数据集,训练评估器以检测先前工作未能识别的关键风险。对12个流行的扩散模型进行T2ISafety评估,并揭示了包括种族公平问题、生成有毒内容的倾向以及模型间隐私保护方面的显著差异等几个问题。

Key Takeaways

  1. T2I模型在生成高质量图像方面取得快速进展,但存在显著的安全性问题。
  2. 当前对T2I模型的安全性评估研究仍处于早期阶段,许多关键风险尚未被探索。
  3. 引入T2ISafety安全基准,用于评估T2I模型在毒性、公平性和偏见三个关键领域的安全性。
  4. 建立了一个详细的层次结构,包括12个任务和44个类别,并收集了70K相应的提示。
  5. 基于此分类和提示集,建立了大规模T2I数据集,并训练了能够检测先前未识别关键风险的评估器。
  6. 对12个流行的扩散模型进行T2ISafety评估,发现存在种族公平问题、生成有毒内容的倾向以及模型间的隐私保护差异。

Cool Papers

点此查看论文截图

Text-to-Image Rectified Flow as Plug-and-Play Priors

Authors:Xiaofeng Yang, Cheng Chen, Xulei Yang, Fayao Liu, Guosheng Lin

Large-scale diffusion models have achieved remarkable performance in generative tasks. Beyond their initial training applications, these models have proven their ability to function as versatile plug-and-play priors. For instance, 2D diffusion models can serve as loss functions to optimize 3D implicit models. Rectified flow, a novel class of generative models, enforces a linear progression from the source to the target distribution and has demonstrated superior performance across various domains. Compared to diffusion-based methods, rectified flow approaches surpass in terms of generation quality and efficiency, requiring fewer inference steps. In this work, we present theoretical and experimental evidence demonstrating that rectified flow based methods offer similar functionalities to diffusion models - they can also serve as effective priors. Besides the generative capabilities of diffusion priors, motivated by the unique time-symmetry properties of rectified flow models, a variant of our method can additionally perform image inversion. Experimentally, our rectified flow-based priors outperform their diffusion counterparts - the SDS and VSD losses - in text-to-3D generation. Our method also displays competitive performance in image inversion and editing.

大规模扩散模型在生成任务中取得了显著的性能。除了初始的训练应用之外,这些模型还证明了它们作为通用即插即用优先功能的能力。例如,2D扩散模型可以作为损失函数来优化3D隐式模型。整流流是一种新型生成模型,强制从源分布到目标分布的线性进展,并在各个领域都表现出了卓越的性能。与基于扩散的方法相比,整流流方法在生成质量和效率方面更胜一筹,所需的推理步骤较少。在这项工作中,我们提供了理论和实验证据,证明基于整流流的方法提供了与扩散模型相似的功能——它们也可以作为有效的先验。除了扩散先验的生成能力外,还受到整流流模型独特的时间对称性的启发,我们的方法的一个变体还可以执行图像反转。实验表明,我们的基于整流流的先验在文本到3D生成方面优于SDS和VSD损失等扩散模型。我们的方法在图像反转和编辑方面也表现出有竞争力的性能。

论文及项目相关链接

PDF ICLR 2025 Camera Ready. Code: https://github.com/yangxiaofeng/rectified_flow_prior

Summary

大规模扩散模型在生成任务中取得了显著的性能提升。这些模型不仅适用于初始训练应用,还展现出作为通用即插即用先验的潜力。例如,2D扩散模型可作为损失函数来优化3D隐式模型。新型生成模型——整流流(Rectified Flow)通过强制从源分布到目标分布的线性进展,在各种领域表现出卓越性能。相较于扩散方法,整流流在生成质量和效率方面更胜一筹,所需推理步骤更少。本研究提供了理论和实验证据,证明整流流方法具有与扩散模型相似的功能——它们也可以作为有效的先验。此外,受整流流模型的独特时间对称属性的启发,我们的方法的一个变体还能进行图像反转。实验中,我们的基于整流流的先验在文本到3D生成任务中超越了SDS和VSD损失,同时在图像反转和编辑方面也表现出竞争力。

Key Takeaways

  1. 大规模扩散模型在生成任务中表现卓越。
  2. 扩散模型具有作为通用即插即用先验的潜力。
  3. 2D扩散模型能作为损失函数优化3D隐式模型。
  4. 新型生成模型——整流流(Rectified Flow)表现出卓越性能,并强制从源到目标的线性进展。
  5. 相较于扩散方法,整流流在生成质量和效率上更胜一筹。
  6. 整流流方法可以作为有效的先验,并具有与扩散模型相似的功能。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
医学图像 医学图像
医学图像 方向最新论文已更新,请持续关注 Update in 2025-02-22 MedVAE Efficient Automated Interpretation of Medical Images with Large-Scale Generalizable Autoencoders
2025-02-22
下一篇 
NeRF NeRF
NeRF 方向最新论文已更新,请持续关注 Update in 2025-02-22 NeRF-3DTalker Neural Radiance Field with 3D Prior Aided Audio Disentanglement for Talking Head Synthesis
2025-02-22
  目录