嘘~ 正在从服务器偷取页面 . . .

Diffusion Models

⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-01-03 更新

AdaDiff: Adaptive Step Selection for Fast Diffusion Models

Authors:Hui Zhang, Zuxuan Wu, Zhen Xing, Jie Shao, Yu-Gang Jiang

Diffusion models, as a type of generative model, have achieved impressive results in generating images and videos conditioned on textual conditions. However, the generation process of diffusion models involves denoising dozens of steps to produce photorealistic images/videos, which is computationally expensive. Unlike previous methods that design ``one-size-fits-all’’ approaches for speed up, we argue denoising steps should be sample-specific conditioned on the richness of input texts. To this end, we introduce AdaDiff, a lightweight framework designed to learn instance-specific step usage policies, which are then used by the diffusion model for generation. AdaDiff is optimized using a policy gradient method to maximize a carefully designed reward function, balancing inference time and generation quality. We conduct experiments on three image generation and two video generation benchmarks and demonstrate that our approach achieves similar visual quality compared to the baseline using a fixed 50 denoising steps while reducing inference time by at least 33%, going as high as 40%. Furthermore, our method can be used on top of other acceleration methods to provide further speed benefits. Lastly, qualitative analysis shows that AdaDiff allocates more steps to more informative prompts and fewer steps to simpler prompts.



PDF Accepted by AAAI 2025



Key Takeaways

  1. 扩散模型在生成图像和视频方面表现出色,但需要大量计算资源。
  2. AdaDiff框架旨在加速扩散模型的生成过程。
  3. AdaDiff通过实例特定的步骤使用策略进行优化。
  4. AdaDiff使用策略梯度方法和精心设计奖励函数进行最优化。
  5. AdaDiff能够在保持视觉质量的同时减少推理时间。
  6. AdaDiff可与其他加速方法结合使用,提供进一步的速度效益。

Cool Papers


Language-Guided Diffusion Model for Visual Grounding

Authors:Sijia Chen, Baochun Li

Visual grounding (VG) tasks involve explicit cross-modal alignment, as semantically corresponding image regions are to be located for the language phrases provided. Existing approaches complete such visual-text reasoning in a single-step manner. Their performance causes high demands on large-scale anchors and over-designed multi-modal fusion modules based on human priors, leading to complicated frameworks that may be difficult to train and overfit to specific scenarios. Even worse, such once-for-all reasoning mechanisms are incapable of refining boxes continuously to enhance query-region matching. In contrast, in this paper, we formulate an iterative reasoning process by denoising diffusion modeling. Specifically, we propose a language-guided diffusion framework for visual grounding, LG-DVG, which trains the model to progressively reason queried object boxes by denoising a set of noisy boxes with the language guide. To achieve this, LG-DVG gradually perturbs query-aligned ground truth boxes to noisy ones and reverses this process step by step, conditional on query semantics. Extensive experiments for our proposed framework on five widely used datasets validate the superior performance of solving visual grounding, a cross-modal alignment task, in a generative way. The source codes are available at https://github.com/iQua/vgbase/tree/main/examples/DiffusionVG.



PDF 20 pages, 16 figures



Key Takeaways

  1. 视觉定位任务涉及明确的多模态对齐,要求找到与语言短语对应的图像区域。
  2. 现有方法采用一次性完成视觉文本推理的方式,需要大规模锚点和复杂的多模态融合模块,难以训练和适应特定场景。
  3. 提出的LG-DVG框架采用基于去噪扩散模型的迭代推理过程,逐步推理查询对象框。
  4. LG-DVG通过语言引导逐步将查询对齐的真实框扰动成噪声框,并逐步反转这一过程,实现跨模态对齐。
  5. 广泛实验验证LG-DVG在多个数据集上解决视觉定位任务的优越性。
  6. LG-DVG框架可在https://github.com/iQua/vgbase/tree/main/examples/DiffusionVG获取源代码。

Cool Papers


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !