发布日期: 2025-02-27

更新日期: 2025-05-14

文章字数: 15k

阅读时长: 60 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-02-27 更新

LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation

Authors:Pengzhi Li, Pengfei Yu, Zide Liu, Wei He, Xuhao Pan, Xudong Rao, Tao Wei, Wei Chen

In this paper, we introduce LDGen, a novel method for integrating large language models (LLMs) into existing text-to-image diffusion models while minimizing computational demands. Traditional text encoders, such as CLIP and T5, exhibit limitations in multilingual processing, hindering image generation across diverse languages. We address these challenges by leveraging the advanced capabilities of LLMs. Our approach employs a language representation strategy that applies hierarchical caption optimization and human instruction techniques to derive precise semantic information,. Subsequently, we incorporate a lightweight adapter and a cross-modal refiner to facilitate efficient feature alignment and interaction between LLMs and image features. LDGen reduces training time and enables zero-shot multilingual image generation. Experimental results indicate that our method surpasses baseline models in both prompt adherence and image aesthetic quality, while seamlessly supporting multiple languages. Project page: https://zrealli.github.io/LDGen.

本文介绍了LDGen，这是一种将大型语言模型（LLM）集成到现有的文本到图像扩散模型中，同时最小化计算需求的新方法。传统的文本编码器，如CLIP和T5，在多语言处理方面存在局限性，阻碍了不同语言的图像生成。我们通过利用LLM的先进功能来解决这些挑战。我们的方法采用了一种语言表示策略，通过分层标题优化和人工指令技术来推导精确语义信息。随后，我们引入了一个轻量级适配器和跨模态精炼器，以促进LLM和图像特征之间的有效特征对齐和交互。LDGen缩短了训练时间，实现了零启动多语言图像生成。实验结果表明，我们的方法在提示遵循和图像美学质量方面都超过了基线模型，同时无缝支持多种语言。项目页面：https://zrealli.github.io/LDGen。

论文及项目相关链接

PDF

Summary

本文介绍了LDGen，这是一种将大型语言模型（LLMs）集成到现有的文本到图像扩散模型中，同时尽量减少计算需求的新方法。针对传统文本编码器（如CLIP和T5）在多语言处理方面的局限性，阻碍了跨多种语言的图像生成问题，我们借助LLMs的先进功能来解决这些挑战。LDGen采用一种语言表示策略，通过层次化的标题优化和人类指令技术来提取精确语义信息。随后，它引入了一个轻量级的适配器和跨模态精炼器，以促进LLMs和图像特征之间的有效特征对齐和交互。LDGen缩短了训练时间，并实现了零样本多语言图像生成。实验结果表明，我们的方法在提示遵循和图像美学质量方面超越了基准模型，同时支持多种语言无缝对接。

Key Takeaways

LDGen是一种集成大型语言模型（LLMs）到文本到图像扩散模型的方法，旨在提高图像生成的质量并扩大语言支持范围。
传统文本编码器在多语言处理方面存在局限性，而LDGen利用LLMs的先进功能来解决这一问题。
LDGen采用特定的语言表示策略，结合层次化的标题优化和人类指令技术，以提取精确的语义信息。
该方法通过轻量级的适配器和跨模态精炼器，促进LLMs和图像特征之间的有效交互和特征对齐。
LDGen能够减少训练时间，并支持零样本多语言图像生成。
实验结果表明，LDGen在提示遵循和图像美学质量方面超越了基准模型。

Cool Papers

点此查看论文截图

Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training

Authors:Botao Ye, Sifei Liu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang

Large diffusion models demonstrate remarkable zero-shot capabilities in novel view synthesis from a single image. However, these models often face challenges in maintaining consistency across novel and reference views. A crucial factor leading to this issue is the limited utilization of contextual information from reference views. Specifically, when there is an overlap in the viewing frustum between two views, it is essential to ensure that the corresponding regions maintain consistency in both geometry and appearance. This observation leads to a simple yet effective approach, where we propose to use epipolar geometry to locate and retrieve overlapping information from the input view. This information is then incorporated into the generation of target views, eliminating the need for training or fine-tuning, as the process requires no learnable parameters. Furthermore, to enhance the overall consistency of generated views, we extend the utilization of epipolar attention to a multi-view setting, allowing retrieval of overlapping information from the input view and other target views. Qualitative and quantitative experimental results demonstrate the effectiveness of our method in significantly improving the consistency of synthesized views without the need for any fine-tuning. Moreover, This enhancement also boosts the performance of downstream applications such as 3D reconstruction. The code is available at https://github.com/botaoye/ConsisSyn.

大型扩散模型在单图像合成新视角中展现出显著的零样本能力。然而，这些模型在维持新视角和参考视角之间的一致性时常常面临挑战。导致这一问题的关键因素是对参考视角的上下文信息利用有限。具体来说，当两个视角之间的视图截断处有重叠时，确保相应区域在几何和外观上都保持一致至关重要。基于这一观察，我们提出了一种简单有效的方法，建议使用极几何来定位和检索输入视图中的重叠信息。然后将这些信息融入目标视图的生成过程中，无需进行训练或微调，因为该过程不需要可学习的参数。此外，为了增强生成视图的整体一致性，我们将极注意力机制的应用扩展到多视角环境，允许从输入视图和其他目标视图中检索重叠信息。定性和定量实验结果都证明了我们方法在显著提高合成视图一致性的有效性，并且无需进行微调。此外，这种改进还提升了下游应用（如3D重建）的性能。代码可访问https://github.com/botaoye/ConsisSyn。

论文及项目相关链接

PDF 3DV 2025

Summary

大型扩散模型在单图像合成新视角中具有出色的零样本能力，但在保持新颖观点和参考观点的一致性方面存在挑战。其关键问题是未能充分利用参考视角的上下文信息。当两个视角的观察视野有重叠时，确保相应区域的几何和外观一致性至关重要。本研究利用极几何定位并检索输入视角的重叠信息，将其融入目标视角的生成中，无需训练或微调。此外，还将极注意力扩展到多视角环境，提高生成视角的整体一致性。实验结果表明，该方法在显著提高合成视角的一致性同时，不需要任何微调。此外，此改进还增强了下游应用如3D重建的性能。

Key Takeaways

大型扩散模型在单图像合成新视角展现出色的零样本能力。
模型面临保持新颖和参考观点一致性的挑战。
挑战的关键问题在于未能充分利用参考视角的上下文信息。
利用极几何定位并检索输入视角的重叠信息，提高生成目标视角的一致性。
该方法无需训练或微调。
极注意力机制扩展到多视角环境，进一步提高生成视角的整体一致性。
实验证明该方法显著提高合成视角的一致性，并增强下游应用如3D重建的性能。

Cool Papers

点此查看论文截图

Training Consistency Models with Variational Noise Coupling

Authors:Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji

Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and improving its training dynamics is an active area of research. In this work, we propose a novel CT training approach based on the Flow Matching framework. Our main contribution is a trained noise-coupling scheme inspired by the architecture of Variational Autoencoders (VAE). By training a data-dependent noise emission model implemented as an encoder architecture, our method can indirectly learn the geometry of the noise-to-data mapping, which is instead fixed by the choice of the forward process in classical CT. Empirical results across diverse image datasets show significant generative improvements, with our model outperforming baselines and achieving the state-of-the-art (SoTA) non-distillation CT FID on CIFAR-10, and attaining FID on par with SoTA on ImageNet at $64 \times 64$ resolution in 2-step generation. Our code is available at https://github.com/sony/vct .

一致性训练（CT）作为一种前景广阔的分步建模替代方案最近崭露头角，它在图像生成任务中表现出了有竞争力的性能。然而，非蒸馏一致性训练常常面临高方差和不稳定的问题，因此分析和提高其训练动态是一个活跃的研究领域。在这项工作中，我们提出了一种基于流匹配框架的新型CT训练方法。我们的主要贡献是借鉴变分自编码器（VAE）架构的启发，设计了一种训练后的噪声耦合方案。通过训练一个作为编码器架构实现的数据依赖噪声发射模型，我们的方法可以间接学习噪声到数据映射的几何结构，这在经典CT中是通过前向过程的选择固定的。在不同的图像数据集上的经验结果表明，我们的方法在生成方面有了显著的改进，且在CIFAR-10上超越了基准测试并实现了最先进的非蒸馏CT FID，在ImageNet的64x64分辨率的2步生成中达到了与最先进方法相当的FID。我们的代码可在https://github.com/sony/vct找到。

论文及项目相关链接

PDF 23 pages, 11 figures

摘要
本研究提出了一种基于流匹配框架的新型一致性训练（CT）方法，主要贡献在于受变分自编码器（VAE）架构启发的训练噪声耦合方案。通过训练数据依赖的噪声发射模型，实现为编码器架构，该方法可间接学习噪声到数据的映射几何，这是经典CT中前向过程的选择所固定的。在多种图像数据集上的经验结果表明，该方法在生成方面有显著改善，我们的模型超越了基线并实现非蒸馏CT的CIFAR-10上的最新技术，在2步生成中以64×64分辨率在ImageNet上达到与最新技术相当的FID。

要点

一致性训练（CT）作为扩散模型的替代方案，在图生成任务中表现出竞争力。
非蒸馏一致性训练存在高方差和不稳定性的问题。
提出了一种基于流匹配框架的新型CT训练方法。
引入受变分自编码器（VAE）架构启发的训练噪声耦合方案。
通过训练数据依赖的噪声发射模型，间接学习噪声到数据的映射几何。
在CIFAR-10数据集上实现了非蒸馏CT的领先水平，并在ImageNet的2步生成中达到与最新技术相当的FID。
可用代码已发布在https://github.com/sony/vct。

Cool Papers

点此查看论文截图

Authors:Han Nie, Bin Luo, Jun Liu, Zhitao Fu, Huan Zhou, Shuo Zhang, Weixing Liu

The ideal goal of image matching is to achieve stable and efficient performance in unseen domains. However, many existing learning-based optical-SAR image matching methods, despite their effectiveness in specific scenarios, exhibit limited generalization and struggle to adapt to practical applications. Repeatedly training or fine-tuning matching models to address domain differences is not only not elegant enough but also introduces additional computational overhead and data production costs. In recent years, general foundation models have shown great potential for enhancing generalization. However, the disparity in visual domains between natural and remote sensing images poses challenges for their direct application. Therefore, effectively leveraging foundation models to improve the generalization of optical-SAR image matching remains challenge. To address the above challenges, we propose PromptMID, a novel approach that constructs modality-invariant descriptors using text prompts based on land use classification as priors information for optical and SAR image matching. PromptMID extracts multi-scale modality-invariant features by leveraging pre-trained diffusion models and visual foundation models (VFMs), while specially designed feature aggregation modules effectively fuse features across different granularities. Extensive experiments on optical-SAR image datasets from four diverse regions demonstrate that PromptMID outperforms state-of-the-art matching methods, achieving superior results in both seen and unseen domains and exhibiting strong cross-domain generalization capabilities. The source code will be made publicly available https://github.com/HanNieWHU/PromptMID.

图像匹配的理想目标是在未见领域实现稳定且高效的性能。然而，尽管许多现有的基于学习的光学-SAR图像匹配方法在特定场景中具有有效性，但它们的泛化能力有限，难以适应实际应用。针对领域差异重复训练或微调匹配模型，不仅不够优雅，而且引入了额外的计算开销和数据生产成本。近年来，通用基础模型在提高泛化能力方面显示出巨大潜力。然而，自然图像和遥感图像视觉领域之间的差异为其直接应用带来了挑战。因此，有效利用基础模型来提高光学-SAR图像匹配的泛化能力仍然是一个挑战。针对上述挑战，我们提出了PromptMID，这是一种基于土地用途分类的文本提示构建模态不变描述符的新方法，作为光学和SAR图像匹配的先验信息。PromptMID通过利用预训练的扩散模型和视觉基础模型（VFMs）提取多尺度模态不变特征，同时专门设计的特征聚合模块有效地融合了不同粒度的特征。在来自四个不同地区的光学-SAR图像数据集上的广泛实验表明，PromptMID优于最先进的匹配方法，在已知和未知领域都取得了优越的结果，并表现出强大的跨域泛化能力。源代码将公开在https://github.com/HanNieWHU/PromptMID。

论文及项目相关链接

PDF 15 pages, 8 figures

Summary

光学与SAR图像匹配的挑战在于如何在未见领域实现稳定且高效的性能。现有方法在某些场景下表现良好，但泛化能力有限，难以适应实际应用。本文提出一种基于文本提示构建模态不变描述符的新方法PromptMID，利用土地用途分类作为先验信息，用于光学和SAR图像匹配。该方法利用预训练的扩散模型和视觉基础模型提取多尺度模态不变特征，并设计特征聚合模块实现不同粒度特征的融合。在四个不同区域的光学-SAR图像数据集上的实验表明，PromptMID在可见和未见领域均优于当前先进的匹配方法，表现出强大的跨域泛化能力。

Key Takeaways

图像匹配的理想目标是实现未见领域的稳定高效性能。
现有光学-SAR图像匹配方法泛化能力有限，难以适应实际应用。
PromptMID是一种基于文本提示的模态不变描述符构建新方法。
PromptMID利用土地用途分类作为先验信息，用于光学和SAR图像匹配。
该方法结合预训练的扩散模型和视觉基础模型提取多尺度模态不变特征。
特征聚合模块实现不同粒度特征的融合。

Cool Papers

点此查看论文截图

3D Anatomical Structure-guided Deep Learning for Accurate Diffusion Microstructure Imaging

Authors:Xinrui Ma, Jian Cheng, Wenxin Fan, Ruoyou Wu, Yongquan Ye, Shanshan Wang

Diffusion magnetic resonance imaging (dMRI) is a crucial non-invasive technique for exploring the microstructure of the living human brain. Traditional hand-crafted and model-based tissue microstructure reconstruction methods often require extensive diffusion gradient sampling, which can be time-consuming and limits the clinical applicability of tissue microstructure information. Recent advances in deep learning have shown promise in microstructure estimation; however, accurately estimating tissue microstructure from clinically feasible dMRI scans remains challenging without appropriate constraints. This paper introduces a novel framework that achieves high-fidelity and rapid diffusion microstructure imaging by simultaneously leveraging anatomical information from macro-level priors and mutual information across parameters. This approach enhances time efficiency while maintaining accuracy in microstructure estimation. Experimental results demonstrate that our method outperforms four state-of-the-art techniques, achieving a peak signal-to-noise ratio (PSNR) of 30.51$\pm$0.58 and a structural similarity index measure (SSIM) of 0.97$\pm$0.004 in estimating parametric maps of multiple diffusion models. Notably, our method achieves a 15$\times$ acceleration compared to the dense sampling approach, which typically utilizes 270 diffusion gradients.

扩散磁共振成像（dMRI）是探索活人脑微观结构的重要无创技术。传统的手工和基于模型的微观结构重建方法通常需要大量的扩散梯度采样，这既耗时且限制了微观结构信息在临床上的应用。深度学习领域的最新进展在微观结构估计方面显示出潜力，然而，从临床上可行的dMRI扫描中准确估计微观结构仍然具有挑战性，除非采用适当的约束。本文介绍了一个新颖框架，通过同时利用宏观先验解剖信息和参数间的互信息，实现高保真和快速的扩散微观结构成像。这种方法在提高时间效率的同时保持了微观结构估计的准确性。实验结果表明，我们的方法在估计多个扩散模型的参数映射方面优于四种最新技术，峰值信噪比（PSNR）达到30.51±0.58，结构相似性指数度量（SSIM）为0.97±0.004。值得注意的是，我们的方法与密集采样方法相比实现了15倍的加速，密集采样方法通常使用270个扩散梯度。

论文及项目相关链接

PDF

Summary

本文介绍了一种新型的扩散微观结构成像框架，该框架结合了宏观级别的先验解剖信息和参数间的互信息，实现了高保真和快速的扩散微观结构成像。此方法在提高时间效率的同时保持了微观结构估计的准确性，并超越了四种最新的技术。

Key Takeaways

扩散磁共振成像（dMRI）是探索活体人脑微观结构的重要非侵入技术。
传统的手动制作和基于模型的组织微观结构重建方法需要广泛的扩散梯度采样，这既耗时又限制了组织微观结构信息在临床上的适用性。
深度学习在微观结构估计方面显示出希望，但要从临床上可行的dMRI扫描中准确估计组织微观结构仍然具有挑战性。
本文提出了一种新型的扩散微观结构成像框架，该框架利用宏观级别的先验解剖信息和参数间的互信息，实现了快速而高保真的微观结构成像。
该方法超越了四种最新的技术，在估计多个扩散模型的参数映射时，达到了峰值信噪比（PSNR）30.51±0.58和结构相似性指数度量（SSIM）0.97±0.004。
此方法实现了与传统密集采样方法相比的15倍加速，后者通常使用270个扩散梯度。

Cool Papers

点此查看论文截图

Authors:Peipei Yuan, Zijing Xie, Shuo Ye, Hong Chen, Yulong Wang

Generative artificial intelligence holds significant potential for abuse, and generative image detection has become a key focus of research. However, existing methods primarily focused on detecting a specific generative model and emphasizing the localization of synthetic regions, while neglecting the interference caused by image size and style on model learning. Our goal is to reach a fundamental conclusion: Is the image real or generated? To this end, we propose a diffusion model-based generative image detection framework termed Hierarchical Retrospection Refinement~(HRR). It designs a multi-scale style retrospection module that encourages the model to generate detailed and realistic multi-scale representations, while alleviating the learning biases introduced by dataset styles and generative models. Additionally, based on the principle of correntropy sparse additive machine, a feature refinement module is designed to reduce the impact of redundant features on learning and capture the intrinsic structure and patterns of the data, thereby improving the model’s generalization ability. Extensive experiments demonstrate the HRR framework consistently delivers significant performance improvements, outperforming state-of-the-art methods in generated image detection task.

生成式人工智能存在被滥用的巨大潜力，生成图像检测已成为研究的关键焦点。然而，现有方法主要集中在检测特定的生成模型，并强调合成区域的定位，而忽略了图像大小和风格对模型学习造成的干扰。我们的目标是得出一个基本结论：图像是真实还是生成的？为此，我们提出了一种基于扩散模型的生成图像检测框架，称为分层回顾细化（HRR）。它设计了一个多尺度风格回顾模块，鼓励模型生成详细和逼真的多尺度表示，同时减轻数据集风格和生成模型引入的学习偏见。此外，基于correntropy稀疏加法机器的原理，设计了一个特征细化模块，以减少冗余特征对学习的影响，捕捉数据的内在结构和模式，从而提高模型的泛化能力。大量实验表明，HRR框架在生成图像检测任务上始终表现出显著的性能改进，优于最先进的方法。

论文及项目相关链接

PDF

Summary
本文探讨生成式人工智能的潜在滥用风险，提出一种基于扩散模型的生成图像检测框架——Hierarchical Retrospection Refinement (HRR)。该框架设计了一个多尺度风格回顾模块，可以缓解数据集风格和生成模型引入的学习偏见，并基于熵理论设计了一个特征精炼模块，以提高模型的泛化能力。实验表明，HRR框架在生成图像检测任务上表现优异，显著提高了性能。

Key Takeaways

生成式人工智能存在滥用风险，生成图像检测成为研究重点。
现有方法主要关注检测特定生成模型并强调合成区域的定位，忽视了图像大小和风格对模型学习的影响。
提出一种基于扩散模型的生成图像检测框架——Hierarchical Retrospection Refinement (HRR)。
HRR框架设计了一个多尺度风格回顾模块，能够生成详细且逼真的多尺度表示。
HRR框架基于熵理论设计了一个特征精炼模块，旨在减少冗余特征对学习的干扰，并捕捉数据的内在结构和模式。
HRR框架通过提高模型的泛化能力，在生成图像检测任务上表现出优异的性能。

Cool Papers

点此查看论文截图

On the Vulnerability of Concept Erasure in Diffusion Models

Authors:Lucas Beerens, Alex D. Richardson, Kaicheng Zhang, Dongdong Chen

The proliferation of text-to-image diffusion models has raised significant privacy and security concerns, particularly regarding the generation of copyrighted or harmful images. To address these issues, research on machine unlearning has developed various concept erasure methods, which aim to remove the effect of unwanted data through post-hoc training. However, we show these erasure techniques are vulnerable, where images of supposedly erased concepts can still be generated using adversarially crafted prompts. We introduce RECORD, a coordinate-descent-based algorithm that discovers prompts capable of eliciting the generation of erased content. We demonstrate that RECORD significantly beats the attack success rate of current state-of-the-art attack methods. Furthermore, our findings reveal that models subjected to concept erasure are more susceptible to adversarial attacks than previously anticipated, highlighting the urgency for more robust unlearning approaches. We open source all our code at https://github.com/LucasBeerens/RECORD

文本到图像扩散模型的普及引发了关于隐私和安全性的重大担忧，尤其是关于生成版权或有害图像的问题。为了解决这些问题，机器遗忘的研究已经开发出了各种概念消除方法，这些方法旨在通过事后训练来消除不想要数据的影响。然而，我们展示这些消除技术是有缺陷的，其中被假定已消除的概念的图像仍然可以通过对抗性制作的提示来生成。我们引入了RECORD，这是一种基于坐标下降算法，能够发现能够引发已删除内容生成的提示。我们证明RECORD显著提高了当前最先进的攻击方法的攻击成功率。此外，我们的研究结果表明，经受概念消除处理的模型比预期更容易受到对抗性攻击的影响，这强调了需要更稳健的遗忘方法的紧迫性。我们所有的代码均已开源：https://github.com/LucasBeerens/RECORD。

论文及项目相关链接

PDF

Summary

文本到图像扩散模型的普及引发了关于生成版权或有害图像的重要隐私和安全担忧。为解决这些问题，机器遗忘研究开发了各种概念消除方法，旨在通过事后训练消除不想要数据的影响。然而，研究表明这些消除技术存在漏洞，通过特定对抗性构建的提示仍可生成看似已删除的概念的图像。为此，我们引入了RECORD算法，这是一种基于坐标下降的方法，能够发现能够激发已删除内容生成的提示。我们证明RECORD显著优于当前最先进的攻击方法的攻击成功率。此外，我们的研究还发现，经历概念消除的模型比以往预期更容易受到对抗性攻击的影响，这突显了对更稳健的遗忘方法的迫切需求。我们公开了所有代码：https://github.com/LucasBeerens/RECORD。

Key Takeaways

文本到图像扩散模型的普及引发隐私和安全问题，特别是关于生成版权或有害图像的问题。
研究人员提出概念消除方法来解决这些问题，旨在通过事后训练消除不想要的数据影响。
目前的方法存在漏洞，对抗性构建的提示可以生成看似已删除的概念的图像。
引入RECORD算法，能有效发现能够激发已删除内容生成的提示，攻击成功率高于现有方法。
研究显示，经历概念消除的模型更容易受到对抗性攻击的影响。
需要更稳健的遗忘方法来解决这些问题。

Cool Papers

点此查看论文截图

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

Authors:Canyu Zhao, Mingyu Liu, Huanyi Zheng, Muzhi Zhu, Zhiyue Zhao, Hao Chen, Tong He, Chunhua Shen

Our primary goal here is to create a good, generalist perception model that can tackle multiple tasks, within limits on computational resources and training data. To achieve this, we resort to text-to-image diffusion models pre-trained on billions of images. Our exhaustive evaluation metrics demonstrate that DICEPTION effectively tackles multiple perception tasks, achieving performance on par with state-of-the-art models. We achieve results on par with SAM-vit-h using only 0.06% of their data (e.g., 600K vs. 1B pixel-level annotated images). Inspired by Wang et al., DICEPTION formulates the outputs of various perception tasks using color encoding; and we show that the strategy of assigning random colors to different instances is highly effective in both entity segmentation and semantic segmentation. Unifying various perception tasks as conditional image generation enables us to fully leverage pre-trained text-to-image models. Thus, DICEPTION can be efficiently trained at a cost of orders of magnitude lower, compared to conventional models that were trained from scratch. When adapting our model to other tasks, it only requires fine-tuning on as few as 50 images and 1% of its parameters. DICEPTION provides valuable insights and a more promising solution for visual generalist models. Homepage: https://aim-uofa.github.io/Diception, Huggingface Demo: https://huggingface.co/spaces/Canyu/Diception-Demo.

我们的主要目标是在有限的计算资源和训练数据条件下，创建一个能够处理多项任务的良好通用感知模型。为实现这一目标，我们采用基于数十亿张图像进行预训练的文本到图像扩散模型。我们的全面评估指标表明，DICEPTION有效地解决了多个感知任务，其性能与最先进模型的表现相当。我们使用的数据量仅为其0.06%（例如，60万张对比十亿像素级注释图像），便达到了与SAM-vit-h相当的效果。DICEPTION受到Wang等人的启发，采用颜色编码来制定各种感知任务的输出；我们证明，为不同实例分配随机颜色的策略在实体分割和语义分割中都极为有效。将各种感知任务统一为条件图像生成，使我们能够充分利用预训练的文本到图像模型。因此，与从头开始训练的常规模型相比，DICEPTION的训练成本大大降低。当将我们的模型适应于其他任务时，它只需要在少量图像（至多50张）上进行微调，以及使用其参数的1%。DICEPTION为视觉通用模型提供了有价值的见解和更有前途的解决方案。主页：https://aim-uofa.github.io/Diception，Huggingface演示：https://huggingface.co/spaces/Canyu/Diception-Demo。

论文及项目相关链接

PDF 29 pages, 20 figures. Homepage: https://aim-uofa.github.io/Diception, Huggingface Demo: https://huggingface.co/spaces/Canyu/Diception-Demo

Summary
本文旨在创建良好的通用感知模型，该模型可在计算资源和训练数据受限的情况下处理多个任务。通过借助在数十亿图像上预训练的文本到图像扩散模型，DICEPTION能够有效地处理多个感知任务，性能达到业界领先水平。其策略是通过对不同实例分配随机颜色，在实体分割和语义分割方面都表现出高度有效性。通过将各种感知任务统一为条件图像生成，DICEPTION能够充分利用预训练的文本到图像模型，训练成本大大降低，相比从头开始训练的传统模型效率更高。该模型在适应其他任务时，仅需少量图像和1%的参数进行微调。DICEPTION为视觉通用模型提供了有价值的见解和更有前景的解决方案。

Key Takeaways

DICEPTION模型旨在创建能够处理多个任务的通用感知模型，受限于计算资源和训练数据。
该模型借助预训练的文本到图像扩散模型达到处理多个感知任务的目标。
DICEPTION通过对不同实例分配随机颜色，实现了在实体分割和语义分割中的高效表现。
通过将感知任务统一为条件图像生成，DICEPTION能够利用预训练文本-图像模型，降低训练成本。
DICEPTION在适应新任务时，仅需少量数据和参数微调。
DICEPTION的性能与业界领先水平相当，使用的数据量仅为SAM-vit-h模型的0.06%。
DICEPTION为视觉通用模型提供了新的解决方案和有价值的见解。

Cool Papers

点此查看论文截图

ConsistentDreamer: View-Consistent Meshes Through Balanced Multi-View Gaussian Optimization

Authors:Onat Şahin, Mohammad Altillawi, George Eskandar, Carlos Carbone, Ziyuan Liu

Recent advances in diffusion models have significantly improved 3D generation, enabling the use of assets generated from an image for embodied AI simulations. However, the one-to-many nature of the image-to-3D problem limits their use due to inconsistent content and quality across views. Previous models optimize a 3D model by sampling views from a view-conditioned diffusion prior, but diffusion models cannot guarantee view consistency. Instead, we present ConsistentDreamer, where we first generate a set of fixed multi-view prior images and sample random views between them with another diffusion model through a score distillation sampling (SDS) loss. Thereby, we limit the discrepancies between the views guided by the SDS loss and ensure a consistent rough shape. In each iteration, we also use our generated multi-view prior images for fine-detail reconstruction. To balance between the rough shape and the fine-detail optimizations, we introduce dynamic task-dependent weights based on homoscedastic uncertainty, updated automatically in each iteration. Additionally, we employ opacity, depth distortion, and normal alignment losses to refine the surface for mesh extraction. Our method ensures better view consistency and visual quality compared to the state-of-the-art.

近期扩散模型的新进展极大地提高了3D生成的图像质量，使得可以使用从这些图像生成的资产进行实体AI模拟。然而，由于图像到3D问题的多对一特性，导致从不同视角看内容并不一致和质量不稳定，从而限制了其使用。之前的模型通过从视角调节的扩散先验中采样视角优化3D模型，但扩散模型无法保证视角的一致性。相反，我们提出了ConsistentDreamer，首先生成一组固定的多视角先验图像，并在它们之间通过评分蒸馏采样（SDS）损失使用另一个扩散模型进行随机视角采样。因此，我们通过SDS损失指导的视图限制差异，并确保一致的粗略形状。在每次迭代中，我们还使用生成的多视角先验图像进行精细细节重建。为了在粗略形状和精细细节优化之间取得平衡，我们引入了基于异方差不确定性的动态任务依赖权重，并在每次迭代中自动更新。此外，我们还采用透明度、深度失真和法线对齐损失来优化表面以进行网格提取。我们的方法在视角一致性和视觉质量方面相比当前最先进的技术确保了更好的效果。

论文及项目相关链接

PDF Manuscript accepted by Pattern Recognition Letters. Project Page: https://onatsahin.github.io/ConsistentDreamer/

Summary
最新扩散模型改进推动了3D生成的进步，能利用图像生成资产应用于AI仿真。然而，图像到3D的一对多问题造成内容在不同视角下存在不一致性。本文提出ConsistentDreamer，通过生成固定多视角先验图像并在此之间采样随机视角，结合评分蒸馏采样损失实现视角一致性。通过动态任务依赖权重平衡粗糙形状与细节优化，并采用不透明度、深度失真和法线对齐损失细化表面进行网格提取。该方法确保了与现有技术相比更好的视角一致性和视觉质量。

Key Takeaways

扩散模型的最新进展推动了3D生成的进步，使得图像生成的资产可用于AI仿真。
图像到3D的一对多问题导致内容在不同视角下的不一致性。
ConsistentDreamer通过生成固定多视角先验图像，结合评分蒸馏采样损失实现视角一致性。
动态任务依赖权重用于平衡粗糙形状和细节优化的平衡。
不透明度、深度失真和法线对齐损失用于细化表面进行网格提取。
该方法确保了与现有技术相比更好的视角一致性和视觉质量。

Cool Papers

点此查看论文截图

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Authors:Koichi Namekata, Sherwin Bahmani, Ziyi Wu, Yash Kant, Igor Gilitschenski, David B. Lindell

Methods for image-to-video generation have achieved impressive, photo-realistic quality. However, adjusting specific elements in generated videos, such as object motion or camera movement, is often a tedious process of trial and error, e.g., involving re-generating videos with different random seeds. Recent techniques address this issue by fine-tuning a pre-trained model to follow conditioning signals, such as bounding boxes or point trajectories. Yet, this fine-tuning procedure can be computationally expensive, and it requires datasets with annotated object motion, which can be difficult to procure. In this work, we introduce SG-I2V, a framework for controllable image-to-video generation that is self-guided$\unicode{x2013}$offering zero-shot control by relying solely on the knowledge present in a pre-trained image-to-video diffusion model without the need for fine-tuning or external knowledge. Our zero-shot method outperforms unsupervised baselines while significantly narrowing down the performance gap with supervised models in terms of visual quality and motion fidelity. Additional details and video results are available on our project page: https://kmcode1.github.io/Projects/SG-I2V

图像到视频的生成方法已经取得了令人印象深刻的、逼真的质量。然而，在生成的视频中调整特定元素，如物体运动或相机移动，通常是一个繁琐的试错过程，例如，涉及使用不同的随机种子重新生成视频。最近的技术通过微调预训练模型来遵循条件信号来解决这个问题，如边界框或点轨迹。然而，这种微调过程可能计算量大，并且需要带有注释对象运动的数据集，这些数据集可能难以获得。在这项工作中，我们介绍了SG-I2V，这是一个可控的图像到视频生成框架，它采用自我引导的方式，仅依靠预训练的图像到视频扩散模型中的知识实现零样本控制，无需微调或外部知识。我们的零样本方法在视觉质量和运动保真度方面超越了无监督基线，并显著缩小了与监督模型的性能差距。更多细节和视频结果可在我们的项目页面查看：https://kmcode1.github.io/Projects/SG-I2V。

论文及项目相关链接

PDF ICLR 2025. Project page: https://kmcode1.github.io/Projects/SG-I2V/

Summary

本文介绍了一种名为SG-I2V的自引导可控图像到视频生成框架。它依赖于预训练的图像到视频扩散模型，无需微调或外部知识，即可实现零样本控制。该方法在视觉质量和运动保真度方面表现出色，超越了无监督基准，并显著缩小了与监督模型的性能差距。

Key Takeaways

图像到视频生成技术已具备令人印象深刻的逼真质量。
调整生成视频中的特定元素（如物体运动或相机移动）通常是一个繁琐的试错过程。
最近的技术通过微调预训练模型来遵循条件信号（如边界框或点轨迹）来解决这个问题。
微调过程计算量大，需要带有标注物体运动的数据集，这些数据集难以获得。
引入SG-I2V框架，实现可控图像到视频生成，具备自引导特性。
SG-I2V依赖预训练的图像到视频扩散模型，无需微调或外部知识，实现零样本控制。

Cool Papers

点此查看论文截图

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Authors:Peiwen Sun, Sitong Cheng, Xiangtai Li, Zhen Ye, Huadai Liu, Honggang Zhang, Wei Xue, Yike Guo

Recently, diffusion models have achieved great success in mono-channel audio generation. However, when it comes to stereo audio generation, the soundscapes often have a complex scene of multiple objects and directions. Controlling stereo audio with spatial contexts remains challenging due to high data costs and unstable generative models. To the best of our knowledge, this work represents the first attempt to address these issues. We first construct a large-scale, simulation-based, and GPT-assisted dataset, BEWO-1M, with abundant soundscapes and descriptions even including moving and multiple sources. Beyond text modality, we have also acquired a set of images and rationally paired stereo audios through retrieval to advance multimodal generation. Existing audio generation models tend to generate rather random and indistinct spatial audio. To provide accurate guidance for Latent Diffusion Models, we introduce the SpatialSonic model utilizing spatial-aware encoders and azimuth state matrices to reveal reasonable spatial guidance. By leveraging spatial guidance, our model not only achieves the objective of generating immersive and controllable spatial audio from text but also extends to other modalities as the pioneer attempt. Finally, under fair settings, we conduct subjective and objective evaluations on simulated and real-world data to compare our approach with prevailing methods. The results demonstrate the effectiveness of our method, highlighting its capability to generate spatial audio that adheres to physical rules.

最近，扩散模型在单声道音频生成方面取得了巨大成功。然而，当涉及到立体声音频生成时，声音场景通常包含多个物体和方向，情况复杂。由于数据成本高昂和生成模型不稳定，利用空间上下文控制立体声音频仍然是一个挑战。据我们所知，这项工作首次尝试解决这些问题。我们首先构建了一个大规模、基于模拟、GPT辅助的数据集BEWO-1M，其中包含丰富的声音场景和描述，甚至包括动态和多重来源的内容。除了文本模态外，我们还获取了一系列图像，并通过检索合理配对立体声音频，以推动多模态生成。现有的音频生成模型倾向于生成随机且模糊的空间音频。为了为潜在扩散模型提供准确的指导，我们引入了SpatialSonic模型，利用空间感知编码器和方位状态矩阵来提供合理的空间指导。通过利用空间指导，我们的模型不仅实现了从文本生成沉浸式且可控的空间音频的目标，而且还扩展到其他模态，作为首次尝试。最后，在公平的设置下，我们对模拟和真实世界数据进行了主观和客观评估，以比较我们的方法与流行方法。结果表明我们的方法有效，尤其能够生成遵循物理规则的立体声音频。

论文及项目相关链接

PDF Accepted by ICLR 2025

Summary

扩散模型在单声道音频生成方面取得了巨大成功，但在立体声音频生成方面，由于数据成本高昂和生成模型不稳定，控制具有多个对象和方向的复杂场景的声音仍然具有挑战性。本研究首次构建了一个大规模、仿真、GPT辅助的BEWO-1M数据集，包括丰富的声音场景和描述，还包括动态和多个来源。除了文本模式外，还通过检索获得了一组图像和合理配对的立体声音频，以推动多模式生成。现有音频生成模型倾向于生成随机和模糊的空间音频。为了为潜在扩散模型提供准确的指导，引入了SpatialSonic模型，利用空间感知编码器和方位状态矩阵来揭示合理的空间指导。通过利用空间指导，我们的模型不仅实现了从文本生成沉浸式可控空间音频的目标，还作为首次尝试扩展到其他模式。在模拟和真实世界数据上进行了主观和客观评估，结果表明我们的方法有效，能够生成符合物理规则的立体声。

Key Takeaways

扩散模型在立体声音频生成上遭遇挑战，主要由于数据成本高昂和生成模型不稳定。
研究构建了大规模、仿真、GPT辅助的BEWO-1M数据集，包含丰富的声音场景和描述，包含动态和多个来源。
除了文本模式外，研究还通过检索结合图像和立体声音频，推动多模式生成。
现有音频生成模型倾向于生成随机和模糊的空间音频。
引入了SpatialSonic模型，利用空间感知编码器和方位状态矩阵，为潜在扩散模型提供准确的空间指导。
模型能生成沉浸式可控空间音频，并从文本扩展到其他模式。

Cool Papers

点此查看论文截图

Training-free Camera Control for Video Generation

Authors:Chen Hou, Zhibo Chen

We propose a training-free and robust solution to offer camera movement control for off-the-shelf video diffusion models. Unlike previous work, our method does not require any supervised finetuning on camera-annotated datasets or self-supervised training via data augmentation. Instead, it can be plug-and-play with most pretrained video diffusion models and generate camera-controllable videos with a single image or text prompt as input. The inspiration for our work comes from the layout prior that intermediate latents encode for the generated results, thus rearranging noisy pixels in them will cause the output content to relocate as well. As camera moving could also be seen as a type of pixel rearrangement caused by perspective change, videos can be reorganized following specific camera motion if their noisy latents change accordingly. Building on this, we propose CamTrol, which enables robust camera control for video diffusion models. It is achieved by a two-stage process. First, we model image layout rearrangement through explicit camera movement in 3D point cloud space. Second, we generate videos with camera motion by leveraging the layout prior of noisy latents formed by a series of rearranged images. Extensive experiments have demonstrated its superior performance in both video generation and camera motion alignment compared with other finetuned methods. Furthermore, we show the capability of CamTrol to generalize to various base models, as well as its impressive applications in scalable motion control, dealing with complicated trajectories and unsupervised 3D video generation. Videos available at https://lifedecoder.github.io/CamTrol/.

我们提出了一种无需训练且稳健的解决方案，为现成的视频扩散模型提供摄像机运动控制。与之前的工作不同，我们的方法不需要在摄像机注释数据集上进行任何有监督的微调或使用数据增强的自我监督训练。相反，它可以与大多数预训练的视频扩散模型即插即用，并使用单张图像或文本提示作为输入来生成摄像机可控的视频。我们工作的灵感来自于布局优先的原则，即中间潜在编码生成的结果，因此重新排列其中的噪声像素将导致输出内容也发生移动。由于摄像机移动也可以被视为由视角变化引起的像素重新排列的一种形式，因此如果它们的噪声潜在相应改变，视频就可以根据特定的摄像机运动进行重组。在此基础上，我们提出了CamTrol，它为视频扩散模型实现了稳健的摄像机控制。这是通过两个阶段的过程实现的。首先，我们通过3D点云空间中的显式摄像机运动对图像布局进行重新排列进行建模。其次，我们利用由一系列重新排列的图像形成的噪声布局的先验知识来生成具有摄像机运动的视频。大量实验表明，与其他微调方法相比，它在视频生成和摄像机运动对齐方面表现出卓越的性能。此外，我们还展示了CamTrol在不同基础模型上的通用能力以及其在可扩展运动控制、处理复杂轨迹和无监督3D视频生成方面的令人印象深刻的应用。视频可在https://lifedecoder.github.io/CamTrol/观看。

论文及项目相关链接

PDF

Summary

本文提出了一种无需训练且稳健的解决方案，为现成的视频扩散模型提供摄像头控制功能。不同于以往的方法，该方法无需对摄像头注释数据集进行有监督微调或通过数据增强进行自监督训练。相反，它可以与大多数预训练的视频扩散模型进行即插即用，并使用单张图像或文本提示作为输入来生成可控摄像头视频。通过布局先验，我们提出CamTrol，实现对视频扩散模型的稳健摄像头控制。它通过两阶段过程实现：首先，在3D点云空间中通过明确的摄像头移动来建模图像布局调整；其次，利用一系列重新排列的图像中的布局先验，通过摄像头运动生成视频。实验表明，与微调方法相比，CamTrol在视频生成和摄像头运动对齐方面表现出卓越性能，并能广泛应用于各种基础模型、复杂的轨迹以及无监督的3D视频生成。

Key Takeaways

提出一种无需训练的视频扩散模型摄像头控制解决方案。
方法即插即用，兼容大多数预训练的视频扩散模型。
利用布局先验实现摄像头控制，通过两阶段过程生成可控摄像头视频。
第一阶段建模图像布局在3D点云空间中的调整。
第二阶段利用布局先验生成具有特定摄像头运动的视频。
实验显示CamTrol在视频生成和摄像头运动对齐上性能卓越。

Cool Papers

点此查看论文截图

You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

Authors:Yihong Luo, Xiaolong Chen, Xinghua Qu, Tianyang Hu, Jing Tang

Recently, some works have tried to combine diffusion and Generative Adversarial Networks (GANs) to alleviate the computational cost of the iterative denoising inference in Diffusion Models (DMs). However, existing works in this line suffer from either training instability and mode collapse or subpar one-step generation learning efficiency. To address these issues, we introduce YOSO, a novel generative model designed for rapid, scalable, and high-fidelity one-step image synthesis with high training stability and mode coverage. Specifically, we smooth the adversarial divergence by the denoising generator itself, performing self-cooperative learning. We show that our method can serve as a one-step generation model training from scratch with competitive performance. Moreover, we extend our YOSO to one-step text-to-image generation based on pre-trained models by several effective training techniques (i.e., latent perceptual loss and latent discriminator for efficient training along with the latent DMs; the informative prior initialization (IPI), and the quick adaption stage for fixing the flawed noise scheduler). Experimental results show that YOSO achieves the state-of-the-art one-step generation performance even with Low-Rank Adaptation (LoRA) fine-tuning. In particular, we show that the YOSO-PixArt-$\alpha$ can generate images in one step trained on 512 resolution, with the capability of adapting to 1024 resolution without extra explicit training, requiring only ~10 A800 days for fine-tuning. Our code is provided at https://github.com/Luo-Yihong/YOSO.

最近，一些研究尝试将扩散生成模型（DMs）与生成对抗网络（GANs）相结合，以减轻迭代去噪推断的计算成本。然而，现有的工作在这一方向上存在训练不稳定、模式崩溃或单步生成学习效率低下等问题。为了解决这些问题，我们引入了YOSO，这是一种为快速、可扩展、高保真一步图像合成而设计的新型生成模型，具有高度的训练稳定性和模式覆盖性。具体来说，我们通过去噪生成器本身平滑对抗发散，进行自协作学习。我们展示了我们的方法可以作为从头开始训练的一步生成模型，具有竞争力的性能。此外，我们通过几种有效的训练技术将我们的YOSO扩展到基于预训练模型的一步文本到图像生成（例如，潜在感知损失和潜在鉴别器用于有效训练以及潜在DMs；信息先验初始化（IPI）和快速适应阶段用于修复有缺陷的噪声调度器）。实验结果表明，即使在低秩适应（LoRA）微调的情况下，YOSO也实现了最先进的单步生成性能。特别地，我们展示了YOSO-PixArt-α在512分辨率上经过一步训练的图像生成能力，并能在无需额外明确训练的情况下适应1024分辨率，微调仅需约10个A800天。我们的代码位于https://github.com/Luo-Yihong/YOSO。

论文及项目相关链接

PDF ICLR 2025

Summary
近期有研究工作尝试将扩散与生成对抗网络（GANs）结合，以降低扩散模型（DMs）迭代去噪推断的计算成本。然而，现有方法存在训练不稳定、模式崩溃或单次生成学习效率不高的问题。为解决这些问题，我们推出YOSO，这是一种为快速、可伸缩、高保真一步图像合成设计的新型生成模型，具有高度的训练稳定性和模式覆盖。通过去噪生成器本身平滑对抗发散，实现自我协作学习。我们的方法可以作为从头开始训练的一步生成模型，具有竞争性能。此外，通过几种有效的训练技术，我们将YOSO扩展到基于预训练模型的一步文本到图像生成。实验结果表明，即使在低秩适应（LoRA）微调的情况下，YOSO也实现了最新的一步生成性能。特别是，我们展示了YOSO-PixArt-α在512分辨率上的一步训练生成图像的能力，并适应到1024分辨率而无需额外的明确训练，微调只需约10个A800天。

Key Takeaways

扩散模型与生成对抗网络的结合旨在降低计算成本。
现有方法面临训练不稳定、模式崩溃及单次生成效率问题。
YOSO模型被设计用于快速、高保真的一步图像合成，具备高训练稳定性和模式覆盖。
通过自我协作学习和有效的训练技术，YOSO模型提高了性能。
YOSO模型可以在不同的分辨率下适应，并展示了在512分辨率上的一步生成能力。
YOSO模型通过低秩适应（LoRA）微调实现了先进的一步生成性能。
模型的代码已公开在GitHub上。

Cool Papers

点此查看论文截图

Continuous Diffusion for Mixed-Type Tabular Data

Authors:Markus Mueller, Kathrin Gruber, Dennis Fok

Score-based generative models, commonly referred to as diffusion models, have proven to be successful at generating text and image data. However, their adaptation to mixed-type tabular data remains underexplored. In this work, we propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data. CDTD is based on a novel combination of score matching and score interpolation to enforce a unified continuous noise distribution for both continuous and categorical features. We explicitly acknowledge the necessity of homogenizing distinct data types by relying on model-specific loss calibration and initialization schemes.To further address the high heterogeneity in mixed-type tabular data, we introduce adaptive feature- or type-specific noise schedules. These ensure balanced generative performance across features and optimize the allocation of model capacity across features and diffusion time. Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models, captures feature correlations exceptionally well, and that heterogeneity in the noise schedule design boosts sample quality. Replication code is available at https://github.com/muellermarkus/cdtd.

基于分数生成模型，通常被称为扩散模型，在生成文本和图像数据方面已经证明是成功的。然而，它们在混合类型表格数据中的应用仍然被较少探索。在这项工作中，我们提出了CDTD，这是一个用于混合类型表格数据的连续扩散模型。CDTD基于分数匹配和分数插值的组合，以强制连续类别特征的统一连续噪声分布。我们明确承认通过依赖模型特定的损失校准和初始化方案来统一不同类型数据的必要性。为了进一步解决混合类型表格数据中的高异质性，我们引入了自适应特征或类型特定的噪声时间表。这些确保了跨特征的生成性能平衡，并优化了模型容量在特征和扩散时间上的分配。我们的实验结果表明，CDTD始终优于最新的基准模型，能够出色地捕获特征相关性，并且噪声时间安排中的异质性提高了样本质量。复制代码可在https://github.com/muellermarkus/cdtd找到。

论文及项目相关链接

PDF published in ICLR 2025

Summary

扩散模型在文本和图像数据生成方面取得了显著成功，但在混合类型表格数据的应用上仍待探索。本文提出CDTD，一种适用于混合类型表格数据的连续扩散模型。CDTD通过分数匹配和分数插值的新型结合，为连续和分类特征强制执行统一的连续噪声分布。通过依赖模型特定损失校准和初始化方案，我们明确认识到统一不同数据类型的必要性。此外，为解决混合类型表格数据中的高异质性，我们引入了自适应特征或类型特定的噪声时间表。实验结果证明，CDTD持续超越现有基准模型，特征相关性捕捉极佳，且噪声时间表设计的异质性提升了样本质量。

Key Takeaways

扩散模型在生成文本和图像数据方面表现出色，但在处理混合类型表格数据时仍面临挑战。
CDTD是一种针对混合类型表格数据的连续扩散模型，通过强制执行统一的噪声分布来处理不同类型的数据。
CDTD采用新型结合方式——分数匹配和分数插值，以提高模型性能。
模型通过依赖特定的损失校准和初始化方案，以统一不同数据类型。
为应对混合类型表格数据中的高异质性，CDTD引入了自适应特征或类型特定的噪声时间表。
实验结果表明，CDTD在生成混合类型表格数据方面超越了现有模型。

Cool Papers

点此查看论文截图

Language-Guided Diffusion Model for Visual Grounding

Authors:Sijia Chen, Baochun Li

Visual grounding (VG) tasks involve explicit cross-modal alignment, as semantically corresponding image regions are to be located for the language phrases provided. Existing approaches complete such visual-text reasoning in a single-step manner. Their performance causes high demands on large-scale anchors and over-designed multi-modal fusion modules based on human priors, leading to complicated frameworks that may be difficult to train and overfit to specific scenarios. Even worse, such once-for-all reasoning mechanisms are incapable of refining boxes continuously to enhance query-region matching. In contrast, in this paper, we formulate an iterative reasoning process by denoising diffusion modeling. Specifically, we propose a language-guided diffusion framework for visual grounding, LG-DVG, which trains the model to progressively reason queried object boxes by denoising a set of noisy boxes with the language guide. To achieve this, LG-DVG gradually perturbs query-aligned ground truth boxes to noisy ones and reverses this process step by step, conditional on query semantics. Extensive experiments for our proposed framework on five widely used datasets validate the superior performance of solving visual grounding, a cross-modal alignment task, in a generative way. The source codes are available at https://github.com/iQua/vgbase/tree/main/examples/DiffusionVG.

视觉定位（VG）任务涉及明确的跨模态对齐，需要为提供的语言短语定位语义对应的图像区域。现有方法以单步方式完成这种视觉文本推理。它们的性能对大规模锚点和基于人类先验的过度设计的多模态融合模块提出了高要求，导致框架复杂，可能难以训练和过度适应特定场景。更糟糕的是，这种一次性的推理机制无法连续调整框以改进查询区域匹配。相比之下，本文采用降噪扩散建模来制定迭代推理过程。具体来说，我们提出了一种用于视觉定位的语言引导扩散框架LG-DVG，该框架训练模型通过语言指南逐步推理查询对象框，通过对一组噪声框进行降噪。为了实现这一点，LG-DVG将查询对齐的地面真实框逐渐扰动为噪声框，并根据查询语义逐步反转这一过程。在五个广泛使用的数据集上对我们提出的框架进行的广泛实验验证了以生成方式解决跨模态对齐任务——视觉定位任务的优越性能。源代码可在https://github.com/iQua/vgbase/tree/main/examples/DiffusionVG处获得。

论文及项目相关链接

PDF 20 pages, 16 figures

Summary

本文提出了一种基于去噪扩散建模的迭代推理过程，用于视觉定位任务。通过语言引导扩散框架（LG-DVG），模型能够逐步推理查询对象框，通过去噪一组噪声框和语言引导来实现。该方法在五个广泛使用的数据集上的实验验证了其在解决跨模态对齐任务——视觉定位上的优越性能。

Key Takeaways

视觉定位（VG）任务涉及明确的跨模态对齐，需要为提供的语言短语定位语义上对应的图像区域。
现有方法采用一次完成的方式处理视觉文本推理，对大规模锚点和过度设计的多模态融合模块有较高要求，导致框架复杂，可能难以训练和适应特定场景。
现有方法无法连续优化框以改进查询区域匹配。
本文提出了基于去噪扩散建模的迭代推理过程，通过语言引导扩散框架（LG-DVG）来解决这一问题。
LG-DVG通过逐步将查询对齐的 ground truth 框扰动为噪声框，并基于查询语义逐步反转这一过程，实现逐步推理。
在五个广泛使用的数据集上的实验验证了LG-DVG在视觉定位任务上的优越性能。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-02-27/Diffusion%20Models/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Diffusion Models

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-02-27 Training Consistency Models with Variational Noise Coupling

2025-02-27 医学图像

医学图像

NeRF

NeRF 方向最新论文已更新，请持续关注 Update in 2025-02-27 HEROS-GAN Honed-Energy Regularized and Optimal Supervised GAN for Enhancing Accuracy and Range of Low-Cost Accelerometers

2025-02-27 NeRF

NeRF

Diffusion Models

2025-02-27 更新

LDGen: Enhancing Text-to-Image Synthesis via Large Language Model-Driven Language Representation

Synthesizing Consistent Novel Views via 3D Epipolar Attention without Re-Training

Training Consistency Models with Variational Noise Coupling

PromptMID: Modal Invariant Descriptors Based on Diffusion and Vision Foundation Models for Optical-SAR Image Matching

3D Anatomical Structure-guided Deep Learning for Accurate Diffusion Microstructure Imaging

HRR: Hierarchical Retrospection Refinement for Generated Image Detection

On the Vulnerability of Concept Erasure in Diffusion Models

DICEPTION: A Generalist Diffusion Model for Visual Perceptual Tasks

ConsistentDreamer: View-Consistent Meshes Through Balanced Multi-View Gaussian Optimization

SG-I2V: Self-Guided Trajectory Control in Image-to-Video Generation

Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation

Training-free Camera Control for Video Generation

You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

Continuous Diffusion for Mixed-Type Tabular Data

Language-Guided Diffusion Model for Visual Grounding