发布日期: 2025-09-24

更新日期: 2025-11-27

文章字数: 20.6k

阅读时长: 83 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-24 更新

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers

Authors:Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, Seungryong Kim

Text-to-image diffusion models excel at translating language prompts into photorealistic images by implicitly grounding textual concepts through their cross-modal attention mechanisms. Recent multi-modal diffusion transformers extend this by introducing joint self-attention over concatenated image and text tokens, enabling richer and more scalable cross-modal alignment. However, a detailed understanding of how and where these attention maps contribute to image generation remains limited. In this paper, we introduce Seg4Diff (Segmentation for Diffusion), a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image. Through comprehensive analysis, we identify a semantic grounding expert layer, a specific MM-DiT block that consistently aligns text tokens with spatially coherent image regions, naturally producing high-quality semantic segmentation masks. We further demonstrate that applying a lightweight fine-tuning scheme with mask-annotated image data enhances the semantic grouping capabilities of these layers and thereby improves both segmentation performance and generated image fidelity. Our findings demonstrate that semantic grouping is an emergent property of diffusion transformers and can be selectively amplified to advance both segmentation and generation performance, paving the way for unified models that bridge visual perception and generation.

文本到图像的扩散模型通过其跨模态注意机制隐式地将文本概念与图像生成相联系，从而擅长将语言提示转化为逼真的图像。最近的多模态扩散变压器通过在连接的图像和文本标记上引入联合自注意力，进一步扩展了这一功能，实现了更丰富、更可扩展的跨模态对齐。然而，对于这些注意力图如何以及在何处有助于图像生成的详细理解仍然有限。在本文中，我们介绍了Seg4Diff（用于扩散的分割），这是一个用于分析MM-DiT注意力结构的系统框架，重点关注特定层如何将语义信息从文本传播到图像。通过综合分析，我们确定了语义定位专家层，这是一个特定的MM-DiT块，能够持续地将文本标记与空间连贯的图像区域对齐，自然地产生高质量语义分割掩码。我们进一步证明，使用带掩码标注的图像数据进行轻量级微调方案可以增强这些层的语义分组能力，从而提高分割性能和生成的图像保真度。我们的研究结果表明，语义分组是扩散变压器的固有属性，可以针对性地进行增强以提高分割和生成性能，为建立融合视觉感知和生成的统一模型铺平了道路。

论文及项目相关链接

PDF NeurIPS 2025. Project page: https://cvlab-kaist.github.io/Seg4Diff/

摘要

文本到图像扩散模型通过跨模态注意力机制隐式地将文本概念转化为图像，表现出卓越的性能。最近的多模态扩散变压器通过引入联合自注意力机制，对图像和文本标记进行串联，实现了更丰富和更可扩展的跨模态对齐。然而，关于注意力图如何以及在哪里对图像生成做出贡献的详细理解仍然有限。本文提出Seg4Diff（用于扩散的分割），这是一个分析MM-DiT注意力结构的系统框架，重点关注特定层如何将语义信息从文本传播到图像。通过综合分析，我们确定了语义定位专家层，这是一个特定的MM-DiT块，能够持续地将文本标记与空间连贯的图像区域对齐，自然地产生高质量语义分割掩模。我们进一步证明，使用带掩码注释的图像数据进行轻量级微调可以增强这些层的语义分组能力，从而提高分割性能和生成的图像保真度。研究结果表明，语义分组是扩散变压器的固有属性，可以针对性地进行增强以提高分割和生成性能，为统一模型铺平道路，实现视觉感知和生成的融合。

关键见解

文本到图像扩散模型通过跨模态注意力机制将文本概念转化为图像。
多模态扩散变压器通过引入联合自注意力机制实现跨模态对齐。
Seg4Diff框架用于分析MM-DiT的注意力结构，揭示语义信息从文本到图像的传播过程。
识别出语义定位专家层，该层能将文本标记与空间连贯的图像区域对齐。
语义分组是扩散变压器的固有属性，通过轻量级微调可提高分割和生成性能。
带掩码注释的图像数据可用于增强模型的语义分组能力。

Cool Papers

点此查看论文截图

ComposeMe: Attribute-Specific Image Prompts for Controllable Human Image Generation

Authors:Guocheng Gordon Qian, Daniil Ostashev, Egor Nemchinov, Avihay Assouline, Sergey Tulyakov, Kuan-Chieh Jackson Wang, Kfir Aberman

Generating high-fidelity images of humans with fine-grained control over attributes such as hairstyle and clothing remains a core challenge in personalized text-to-image synthesis. While prior methods emphasize identity preservation from a reference image, they lack modularity and fail to provide disentangled control over specific visual attributes. We introduce a new paradigm for attribute-specific image prompting, in which distinct sets of reference images are used to guide the generation of individual aspects of human appearance, such as hair, clothing, and identity. Our method encodes these inputs into attribute-specific tokens, which are injected into a pre-trained text-to-image diffusion model. This enables compositional and disentangled control over multiple visual factors, even across multiple people within a single image. To promote natural composition and robust disentanglement, we curate a cross-reference training dataset featuring subjects in diverse poses and expressions, and propose a multi-attribute cross-reference training strategy that encourages the model to generate faithful outputs from misaligned attribute inputs while adhering to both identity and textual conditioning. Extensive experiments show that our method achieves state-of-the-art performance in accurately following both visual and textual prompts. Our framework paves the way for more configurable human image synthesis by combining visual prompting with text-driven generation. Webpage is available at: https://snap-research.github.io/composeme/.

在人类高保真图像生成中，对于发型和服装等属性的精细控制仍是个性化文本到图像合成的核心挑战。虽然之前的方法强调从参考图像中保留身份，但它们缺乏模块化，无法提供对特定视觉属性的分离控制。我们引入了一种新的属性特定图像提示范式，其中使用不同的参考图像来指导人类外观的各个方面，如头发、服装和身份。我们的方法将这些输入编码为属性特定令牌，并注入到预训练的文本到图像扩散模型中。这实现对多个视觉因素的组合和分离控制，甚至在单个图像中的多人之间也是如此。为了促进自然构图和稳健的分离，我们整理了一个跨参考训练数据集，其中包括各种姿势和表情的主体，并提出了一种多属性跨参考训练策略，鼓励模型在身份和文本条件下从错位属性输入生成忠实输出。大量实验表明，我们的方法在准确遵循视觉和文本提示方面达到了最新性能。我们的框架通过结合视觉提示和文本驱动生成，为人像合成铺平了更多可配置的道路。网页地址为：https://snap-research.github.io/composeme/。。

论文及项目相关链接

PDF Accepted to SIGGRAPH Asia 2025, webpage: https://snap-research.github.io/composeme/

Summary

本文提出了一种全新的属性特定图像提示方法，通过采用不同参考图像来指导人类外观的各个方面生成，如发型、服装和身份等。该方法将输入编码为特定属性的标记，并注入预训练的文本到图像扩散模型中。此方法实现了对多个视觉因素的组合和解耦控制，甚至可以在单个图像中对多个人进行此种控制。通过提出一种多属性跨参考训练策略，模型在产生对不一致属性输入忠实响应的同时，也符合身份和文本条件的要求。该框架实现了高度可配置的人类图像合成，通过视觉提示与文本驱动生成相结合。

Key Takeaways

提出了一种新的属性特定图像提示方法，使用不同参考图像来指导人类外观不同方面的生成（发型、服装、身份等）。
将输入编码为特定属性的标记，并注入预训练的文本到图像扩散模型中，实现组合与解耦控制多个视觉因素的能力。
方法可以在单个图像中对多个人进行属性控制。
通过提出一种多属性跨参考训练策略，模型能在对不一致属性输入产生忠实响应的同时，也符合身份和文本条件的要求。
该方法实现了高度可配置的人类图像合成。
方法结合了视觉提示与文本驱动生成，提高了图像生成的灵活性和逼真度。

Cool Papers

点此查看论文截图

StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models

Authors:Haoxin Yang, Bangzhen Liu, Xuemiao Xu, Cheng Xu, Yuyang Yu, Zikai Huang, Yi Wang, Shengfeng He

The advancement of diffusion models has enhanced the realism of AI-generated content but also raised concerns about misuse, necessitating robust copyright protection and tampering localization. Although recent methods have made progress toward unified solutions, their reliance on post hoc processing introduces considerable application inconvenience and compromises forensic reliability. We propose StableGuard, a novel framework that seamlessly integrates a binary watermark into the diffusion generation process, ensuring copyright protection and tampering localization in Latent Diffusion Models through an end-to-end design. We develop a Multiplexing Watermark VAE (MPW-VAE) by equipping a pretrained Variational Autoencoder (VAE) with a lightweight latent residual-based adapter, enabling the generation of paired watermarked and watermark-free images. These pairs, fused via random masks, create a diverse dataset for training a tampering-agnostic forensic network. To further enhance forensic synergy, we introduce a Mixture-of-Experts Guided Forensic Network (MoE-GFN) that dynamically integrates holistic watermark patterns, local tampering traces, and frequency-domain cues for precise watermark verification and tampered region detection. The MPW-VAE and MoE-GFN are jointly optimized in a self-supervised, end-to-end manner, fostering a reciprocal training between watermark embedding and forensic accuracy. Extensive experiments demonstrate that StableGuard consistently outperforms state-of-the-art methods in image fidelity, watermark verification, and tampering localization.

扩散模型的进步提高了人工智能生成内容的真实性，但也引发了关于误用的担忧，这需要进行强有力的版权保护和篡改定位。尽管最近的方法在统一解决方案方面取得了进展，但它们对事后处理的依赖带来了相当大的应用不便和鉴定可靠性的妥协。我们提出了StableGuard，这是一个无缝集成二进制水印到扩散生成过程的新型框架，通过端到端设计，确保在潜在扩散模型中的版权保护和篡改定位。我们开发了一种多路复用水印VAE（MPW-VAE），它通过为预训练的变分自动编码器（VAE）配备轻量级的基于潜在残留的适配器，能够生成配对的水印和无水印图像。这些配对图像通过随机掩模融合，创建了一个多样化的数据集，用于训练对篡改无关的法网。为了进一步增强法网协同作用，我们引入了混合专家指导法网（MoE-GFN），它动态地整合整体水印模式、局部篡改痕迹和频域线索，进行精确的水印验证和篡改区域检测。MPW-VAE和MoE-GFN以自我监督的方式进行联合优化，以端到端的方式促进水印嵌入与鉴定准确性之间的互惠训练。大量实验表明，StableGuard在图像保真度、水印验证和篡改定位方面始终优于最新方法。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

摘要

扩散模型的进步提高了AI生成内容的逼真度，但也引发了关于误用的担忧，需要强大的版权保护和干扰定位。我们提出StableGuard，一个新型框架，将二进制水印无缝集成到扩散生成过程中，通过端到端设计确保潜在扩散模型的版权保护和干扰定位。我们开发了一种多路复用水印变分自动编码器（MPW-VAE），它通过为预训练的变分自动编码器（VAE）配备轻量级的潜在残差适配器，生成配对的水印和无水印图像。这些配对通过随机掩模融合，创建一个多样化的数据集，用于训练对干扰无关的法网。为了进一步增强法网协同作用，我们引入了专家引导法网混合物（MoE-GFN），它动态集成整体水印模式、局部干扰痕迹和频域线索，进行精确的水印验证和干扰区域检测。MPW-VAE和MoE-GFN以自我监督的方式进行联合优化，端到端地促进水印嵌入与法网准确性的相互训练。大量实验表明，StableGuard在图像保真度、水印验证和干扰定位方面均优于现有技术。

关键见解

扩散模型的进步提高了AI生成内容的真实感，但也需要应对误用问题，包括版权保护和干扰定位的需求。
StableGuard框架通过端到端设计无缝集成二进制水印到扩散生成过程中。
MPW-VAE通过配备轻量级潜在残差适配器生成配对的水印和无水印图像。
通过随机掩模融合生成多样化数据集，用于训练对干扰无关的法网。
MoE-GFN结合整体水印模式、局部干扰痕迹和频域线索进行精确水印验证和干扰区域检测。
MPW-VAE和MoE-GFN以自我监督方式进行联合优化，促进水印嵌入与法网准确性的相互提升。
StableGuard在图像保真度、水印验证和干扰定位方面表现出卓越性能，优于现有技术。

Cool Papers

点此查看论文截图

Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology

Authors:Saghir Alfasly, Wataru Uegami, MD Enamul Hoq, Ghazal Alabtah, H. R. Tizhoosh

Synthetic data generation in histopathology faces unique challenges: preserving tissue heterogeneity, capturing subtle morphological features, and scaling to unannotated datasets. We present a latent diffusion model that generates realistic heterogeneous histopathology images through a novel dual-conditioning approach combining semantic segmentation maps with tissue-specific visual crops. Unlike existing methods that rely on text prompts or abstract visual embeddings, our approach preserves critical morphological details by directly incorporating raw tissue crops from corresponding semantic regions. For annotated datasets (i.e., Camelyon16, Panda), we extract patches ensuring 20-80% tissue heterogeneity. For unannotated data (i.e., TCGA), we introduce a self-supervised extension that clusters whole-slide images into 100 tissue types using foundation model embeddings, automatically generating pseudo-semantic maps for training. Our method synthesizes high-fidelity images with precise region-wise annotations, achieving superior performance on downstream segmentation tasks. When evaluated on annotated datasets, models trained on our synthetic data show competitive performance to those trained on real data, demonstrating the utility of controlled heterogeneous tissue generation. In quantitative evaluation, prompt-guided synthesis reduces Frechet Distance by up to 6X on Camelyon16 (from 430.1 to 72.0) and yields 2-3x lower FD across Panda and TCGA. Downstream DeepLabv3+ models trained solely on synthetic data attain test IoU of 0.71 and 0.95 on Camelyon16 and Panda, within 1-2% of real-data baselines (0.72 and 0.96). By scaling to 11,765 TCGA whole-slide images without manual annotations, our framework offers a practical solution for an urgent need for generating diverse, annotated histopathology data, addressing a critical bottleneck in computational pathology.

在病理学中对合成数据生成面临着独特的挑战，如保持组织异质性、捕捉微妙的形态特征和扩展到未标注的数据集。我们提出了一种潜在扩散模型，通过一种新颖的双重条件方法结合语义分割地图和特定组织的视觉裁剪，生成逼真的异质性病理学图像。不同于依赖文本提示或抽象视觉嵌入的现有方法，我们的方法通过直接结合来自相应语义区域的原始组织裁剪，保留了关键的形态细节。对于已标注的数据集（例如Camelyon16和Panda），我们提取斑块，确保20-80%的组织异质性。对于未标注的数据（例如TCGA），我们引入了一种自监督扩展，使用基础模型嵌入将整个幻灯片图像聚类为100种组织类型，自动生成伪语义地图进行训练。我们的方法合成高保真图像，具有精确的区域注释，并在下游分割任务上实现了卓越的性能。在已标注数据集上进行评估，经我们的合成数据训练模型的表现与真实数据训练的模型表现相当，证明了可控异质性组织生成的实用性。在定量评估中，提示引导的合成将Camelyon16上的Frechet距离减少了高达6倍（从430.1降至72.0），并且在Panda和TCGA上的FD降低了2-3倍。仅经合成数据训练的下游DeepLabv3+模型在Camelyon16和Panda上的测试IoU达到0.71和0.95，与真实数据基准测试值（分别为0.72和0.96）相差仅1-2%。通过扩展到没有手动标注的TCGA的11,765张全幻灯片图像，我们的框架为解决迫切需求的生成多样、标注的病理学数据提供了切实可行的解决方案，解决了计算病理学中的关键瓶颈问题。

论文及项目相关链接

PDF NeurIPS 2025

Summary

本文介绍了一种基于潜在扩散模型的方法，用于生成真实且异质性的病理图像。该方法结合了语义分割图和特定组织的视觉裁剪，通过一种新的双重条件方法生成图像。对于带注释的数据集，通过提取包含20-80%组织异质性的补丁进行训练；对于未注释的数据集，则使用自监督扩展方法自动产生伪语义图进行训练。该方法合成的图像具有高保真度和精确的区域注释，并在下游分割任务中表现出卓越性能。经过评估，该方法生成的合成数据训练的模型在带注释数据集上的表现与真实数据训练的模型具有竞争力。这为急需生成多样化和带注释的病理数据提供了一个实用的解决方案。

Key Takeaways

介绍了一种基于潜在扩散模型的病理图像生成方法，能够生成真实且异质性的病理图像。
通过结合语义分割图和特定组织的视觉裁剪，采用双重条件方法生成图像，保留了关键的形态细节。
方法可以处理带注释和未注释的数据集，对于未注释的数据集，使用自监督扩展方法自动产生伪语义图进行训练。
合成图像具有高保真度和精确的区域注释，在下游分割任务中表现出卓越性能。
经过评估，合成数据训练的模型在带注释数据集上的表现与真实数据训练的模型相当。
该方法解决了计算病理学中急需生成多样化和带注释的病理数据的瓶颈。

Cool Papers

点此查看论文截图

Elucidating the Design Space of FP4 training

Authors:Robert Hu, Carlo Luschi, Paul Balanca

The increasing computational demands of foundation models have spurred research into low-precision training, with 4-bit floating-point (\texttt{FP4}) formats emerging as a frontier for maximizing hardware throughput. While numerous techniques have been proposed to stabilize \texttt{FP4} training, they often present isolated solutions with varying, and not always clear, computational overheads. This paper aims to provide a unified view of the design space of \texttt{FP4} training. We introduce a comprehensive, quantisation gradient-based framework for microscaling quantization that allows for a theoretical analysis of the computational costs associated with different stabilization methods on both the forward and backward passes. Using a simulator built on this framework, we conduct an extensive empirical study across a wide range of machine learning tasks, including regression, image classification, diffusion models, and language models. By systematically evaluating thousands of combinations of techniques, such as novel gradient approximations, rounding strategies, and scaling methods, we identify which configurations offer the most favourable performance-to-overhead trade-off. We find that the techniques enabling the best trade-off involve carefully combining Hadamard transformations, tensor scaling and stochastic rounding. We further find that using \texttt{UE5M3} as a scaling factor potentially offers a good compromise between range and precision with manageable computational overhead.

随着基础模型计算需求的不断增加，低精度训练研究应运而生，而4位浮点数（FP4）格式作为最大化硬件吞吐量的前沿技术崭露头角。尽管已经提出了许多技术来稳定FP4训练，但它们通常是孤立的解决方案，计算开销各不相同，并不总是清晰明了。本文旨在为FP4训练的设计空间提供一个统一视图。我们引入了一个全面的、基于量化梯度的微尺度量化框架，该框架可以对正向和反向传递中不同稳定方法与计算成本进行理论分析。使用基于此框架构建的模拟器，我们对一系列广泛的机器学习任务进行了广泛的实证研究，包括回归、图像分类、扩散模型和语言模型。通过系统地评估数千种技术组合，例如新型梯度逼近、舍入策略和调整方法，我们确定了哪些配置提供了最有利于性能与开销之间的权衡。我们发现，能够实现最佳权衡的技术涉及精心结合哈达玛变换、张量缩放和随机舍入。我们进一步发现，使用UE5M3作为调整因子可能在范围和精度之间提供良好的平衡，同时计算开销可控。

论文及项目相关链接

PDF

Summary

本文研究了低精度训练中的FP4格式，提出了一种基于量化梯度的微尺度量化综合框架，用于对前向和反向传播中不同稳定方法的计算成本进行理论分析。通过模拟器对回归、图像分类、扩散模型和语言模型等广泛机器学习任务进行大量实证研究，系统评估了各种技术组合的性能与开销权衡，发现最佳的组合策略包括结合Hadamard变换、张量尺度和随机舍入等技术，并使用UE5M3作为尺度因子在范围和精度之间取得良好平衡。

Key Takeaways

研究了低精度训练中的FP4格式，这是为了应对基础模型对计算能力的日益增长的需求。
提出了一种基于量化梯度的微尺度量化综合框架，用于理论分析不同稳定方法的计算成本。
通过模拟器进行了大量实证研究，涵盖了多种机器学习任务。
系统评估了包括新型梯度近似、舍入策略及尺度方法等在内的技术组合。
发现最佳的组合策略涉及结合特定技术，如Hadamard变换、张量尺度和随机舍入。
使用UE5M3作为尺度因子在范围和精度之间取得良好平衡。

Cool Papers

点此查看论文截图

SISMA: Semantic Face Image Synthesis with Mamba

Authors:Filippo Botti, Alex Ergasti, Tomaso Fontanini, Claudio Ferrari, Massimo Bertozzi, Andrea Prati

Diffusion Models have become very popular for Semantic Image Synthesis (SIS) of human faces. Nevertheless, their training and inference is computationally expensive and their computational requirements are high due to the quadratic complexity of attention layers. In this paper, we propose a novel architecture called SISMA, based on the recently proposed Mamba. SISMA generates high quality samples by controlling their shape using a semantic mask at a reduced computational demand. We validated our approach through comprehensive experiments with CelebAMask-HQ, revealing that our architecture not only achieves a better FID score yet also operates at three times the speed of state-of-the-art architectures. This indicates that the proposed design is a viable, lightweight substitute to transformer-based models.

扩散模型（Diffusion Models）已经在人类面部的语义图像合成（SIS）领域变得非常受欢迎。然而，由于注意力层的二次复杂性，其训练和推理的计算成本很高且计算要求也很高。在本文中，我们提出了一种基于最新提出的Mamba的新型架构，称为SISMA。SISMA通过利用语义掩膜控制形状生成高质量的样本，并在减少计算需求的情况下进行。我们通过使用CelebAMask-HQ进行的大量实验验证了我们的方法，结果表明我们的架构不仅实现了更好的FID得分，而且运行速度是现有技术架构的三倍。这表明所提出的设计是一个可行的、轻量级的替代基于变压器模型的选择。

论文及项目相关链接

PDF

Summary

本文提出了一种基于Mamba的新型架构SISMA，用于语义图像合成（SIS）中的人脸生成。SISMA通过语义掩膜控制形状，降低计算需求，生成高质量样本。实验验证显示，SISMA不仅实现了更好的FID得分，而且运行速度是现有技术架构的三倍，成为可行的轻量级替代方案。

Key Takeaways

Diffusion Models在人脸语义图像合成（SIS）中很受欢迎。
现有Diffusion Models由于注意力层的二次复杂性，其训练和推理计算量大。
提出了一种新型架构SISMA，基于Mamba，用于高质量样本生成。
SISMA通过语义掩膜控制形状，降低计算需求。
SISMA在CelebAMask-HQ上的实验验证了其有效性。
SISMA实现了更好的FID得分，运行速度是现有技术架构的三倍。

Cool Papers

点此查看论文截图

CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

Authors:Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, Shayan Baghayi Nejad, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/{this URL}.

文本到图像的扩散模型，如Stable Diffusion，能够产生高质量和多样化的图像，但往往难以实现组成对齐，特别是在提示描述复杂的对象关系、属性或空间排列时。最近的推理时间方法通过优化或探索初始噪声来解决这个问题，在奖励函数的指导下评分文本图像对齐，而无需对模型进行微调。虽然前景广阔，但每种策略单独使用时都有固有的局限性：优化可能会因初始化不佳或搜索轨迹不利而陷入僵局，而探索可能需要大量的样本才能找到满意的输出。我们的进一步分析表明，无论是单一奖励指标还是临时组合，都无法可靠地捕获所有方面的组合性，导致指导不力或不一致。为了克服这些挑战，我们提出了基于类别的奖励初始噪声优化和探索（CARINOX），这是一个结合了噪声优化和探索的统一框架，采用基于与人类判断相关的原则性奖励选择程序。在两个互补的基准测试上的评估，涵盖了各种组成挑战，表明CARINOX在T2I-CompBench++上平均对齐得分提高了+16%，在HRS基准测试上提高了+11%，在所有主要类别中一致地优于最新的优化和探索方法，同时保持了图像的质量和多样性。项目页面可在此URL找到。

论文及项目相关链接

PDF

Summary

文本描述的扩散模型能够在生成高质量和多样化图像方面表现出色，但在处理复杂对象关系、属性或空间排列的提示时，往往难以实现组成对齐。最新推断时间方法通过优化或探索初始噪声来解决这一问题，在无需模型微调的情况下，通过评分文本图像对齐的奖励函数来指导。然而，每种策略都有其局限性：优化可能因不良初始化或不利搜索轨迹而陷入僵局，而探索可能需要大量样本才能找到满意的输出。此外，单一的奖励指标或临时组合无法全面捕捉组成性的各个方面，导致指导不力或不一致。为解决这些挑战，提出了基于类别的奖励初始噪声优化和探索（CARINOX）的统一框架，结合优化和探索，并基于与人类判断相关的奖励选择程序。评估表明，CARINOX在T2I-CompBench++和HRS基准测试上的平均对齐得分分别提高了+16%和+11%，在所有主要类别中一致地优于最新优化和探索方法，同时保持了图像质量和多样性。

Key Takeaways

文本到图像的扩散模型如Stable Diffusion虽能生成高质量图像，但在处理复杂对象关系、属性及空间排列时难以实现组成对齐。
现有推断时间方法通过优化初始噪声或探索来解决这一问题，但它们各自存在局限性。
单一的奖励指标无法全面捕捉组成性的各个方面，导致指导效果不佳。
CARINOX框架结合了优化和探索策略，通过基于类别的奖励选择程序来提高文本图像对齐的得分。
评估显示，CARINOX在多个基准测试中表现优异，提高了平均对齐得分。
CARINOX在保持图像质量和多样性的同时，一致地优于其他最新方法。

Cool Papers

点此查看论文截图

Graph Signal Generative Diffusion Models

Authors:Yigit Berkay Uslu, Samar Hadou, Sergio Rozada, Shirin Saeedi Bidokhti, Alejandro Ribeiro

We introduce U-shaped encoder-decoder graph neural networks (U-GNNs) for stochastic graph signal generation using denoising diffusion processes. The architecture learns node features at different resolutions with skip connections between the encoder and decoder paths, analogous to the convolutional U-Net for image generation. The U-GNN is prominent for a pooling operation that leverages zero-padding and avoids arbitrary graph coarsening, with graph convolutions layered on top to capture local dependencies. This technique permits learning feature embeddings for sampled nodes at deeper levels of the architecture that remain convolutional with respect to the original graph. Applied to stock price prediction – where deterministic forecasts struggle to capture uncertainties and tail events that are paramount – we demonstrate the effectiveness of the diffusion model in probabilistic forecasting of stock prices.

我们引入U型编码器-解码器图神经网络（U-GNNs），利用去噪扩散过程进行随机图信号生成。该架构通过编码器与解码器路径之间的跳过连接学习不同分辨率的节点特征，类似于用于图像生成的卷积U-Net。U-GNN的突出之处在于它的池化操作，该操作利用零填充并避免任意的图简化，并在其上叠加图卷积以捕获局部依赖性。这项技术允许在架构的更深层次上，对采样节点进行特征嵌入学习，相对于原始图保持卷积。将其应用于股票价格预测——确定性预测很难捕捉不确定性和极端事件，而这些因素至关重要——我们展示了扩散模型在概率预测股票价格方面的有效性。

论文及项目相关链接

PDF Submitted to 2026 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2026)

Summary

基于去噪扩散过程的随机图信号生成，我们引入了U型编码器-解码器图神经网络（U-GNNs）。架构通过跳过连接学习不同分辨率的节点特征，类似于用于图像生成的卷积U-Net。U-GNN的一个突出特点是它的池化操作利用零填充避免了任意图粗化，顶层使用图卷积捕捉局部依赖性。该技术允许在架构的更深层次学习采样节点的特征嵌入，这些嵌入相对于原始图是卷积的。在股票预测方面，我们证明了该扩散模型在概率预测股票价格中的有效性，确定性预测很难捕捉到不确定性和重要的事件尾部。

Key Takeaways

U型编码器-解码器图神经网络（U-GNNs）被引入用于基于去噪扩散过程的随机图信号生成。
U-GNN架构通过跳过连接学习不同分辨率的节点特征。
U-GNN的池化操作利用零填充避免任意图粗化，捕捉局部依赖性。
该技术允许学习采样节点的特征嵌入，这些嵌入相对于原始图是卷积的。
扩散模型在概率预测股票价格方面表现出有效性。

Cool Papers

点此查看论文截图

Stencil: Subject-Driven Generation with Context Guidance

Authors:Gordon Chen, Ziqi Huang, Cheston Tan, Ziwei Liu

Recent text-to-image diffusion models can generate striking visuals from text prompts, but they often fail to maintain subject consistency across generations and contexts. One major limitation of current fine-tuning approaches is the inherent trade-off between quality and efficiency. Fine-tuning large models improves fidelity but is computationally expensive, while fine-tuning lightweight models improves efficiency but compromises image fidelity. Moreover, fine-tuning pre-trained models on a small set of images of the subject can damage the existing priors, resulting in suboptimal results. To this end, we present Stencil, a novel framework that jointly employs two diffusion models during inference. Stencil efficiently fine-tunes a lightweight model on images of the subject, while a large frozen pre-trained model provides contextual guidance during inference, injecting rich priors to enhance generation with minimal overhead. Stencil excels at generating high-fidelity, novel renditions of the subject in less than a minute, delivering state-of-the-art performance and setting a new benchmark in subject-driven generation.

近期的文本到图像扩散模型可以从文本提示生成引人注目的图像，但它们在跨世代和上下文时往往无法保持主题一致性。当前微调方法的一个主要局限是质量与效率之间的固有权衡。微调大型模型可以提高保真度，但计算成本高昂，而微调轻型模型则提高效率但会牺牲图像保真度。此外，在少量有关主题图像上对预训练模型进行微调可能会破坏现有先验知识，导致结果不佳。为此，我们提出了Stencil，一个新型框架，可在推理过程中同时采用两种扩散模型。Stencil在有关主题的图像上有效地微调轻型模型，而一个大型冻结的预训练模型则在推理过程中提供上下文指导，注入丰富的先验知识以增强生成，且几乎不产生额外开销。Stencil擅长在不到一分钟内生成高保真、新颖的主题呈现，实现卓越性能，并在主题驱动生成方面树立新标杆。

论文及项目相关链接

PDF Accepted as Spotlight at ICIP 2025

Summary

当前文本到图像扩散模型能够从文本提示生成引人注目的图像，但在不同世代和背景下维持主题一致性方面存在缺陷。现有微调方法面临质量与效率之间的权衡问题。大型模型的微调提高了保真度但计算成本高昂，而轻量级模型的微调提高了效率但牺牲了图像保真度。此外，对主题少量图像进行预训练模型的微调会破坏现有先验知识，导致结果不尽人意。为此，我们提出了Stencil框架，它在推理过程中联合采用两个扩散模型。Stencil高效地对关于主题的图像进行轻量级模型的微调，而一个大型冻结的预训练模型在推理过程中提供上下文指导，注入丰富的先验知识以增强生成效果，且几乎不增加计算开销。Stencil擅长在不到一分钟内生成高保真、新颖的主题渲染，实现卓越性能，在主题驱动生成方面树立了新标杆。

Key Takeaways

当前文本到图像扩散模型在维持主题一致性方面存在挑战。
现有微调方法在质量和效率之间存在权衡。
对主题少量图像进行预训练模型的微调可能破坏现有先验知识。
Stencil框架通过联合采用两个扩散模型来解决上述问题。
Stencil能高效地对关于主题的图像进行轻量级模型的微调。
一个大型冻结的预训练模型在推理过程中为Stencil提供上下文指导和丰富的先验知识。

Cool Papers

点此查看论文截图

VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation

Authors:Feng Han, Chao Gong, Zhipeng Wei, Jingjing Chen, Yu-Gang Jiang

Recently, autoregressive image generation models have wowed audiences with their remarkable capability in creating surprisingly realistic images. Models such as GPT-4o and LlamaGen can not only produce images that faithfully mimic renowned artistic styles like Ghibli, Van Gogh, or Picasso, but also potentially generate Not-Safe-For-Work (NSFW) content, raising significant concerns regarding copyright infringement and ethical use. Despite these concerns, methods to safeguard autoregressive text-to-image models remain underexplored. Previous concept erasure methods, primarily designed for diffusion models that operate in denoising latent space, are not directly applicable to autoregressive models that generate images token by token. To address this critical gap, we propose Visual Contrast Exploitation (VCE), a novel framework comprising: (1) an innovative contrastive image pair construction paradigm that precisely decouples unsafe concepts from their associated content semantics, and (2) a sophisticated DPO-based training approach that enhances the model’s ability to identify and leverage visual contrastive features from image pairs, enabling precise concept erasure. Our comprehensive experiments across three challenging tasks-artist style erasure, explicit content erasure, and object removal-demonstrate that our method effectively secures the model, achieving state-of-the-art results while erasing unsafe concepts and maintaining the integrity of unrelated safe concepts. The code and models are available at https://github.com/Maplebb/VCE.

最近，自回归图像生成模型凭借其创造惊人逼真图像的能力吸引了观众。例如GPT-4o和LlamaGen等模型不仅能够忠实地模仿吉卜利、梵高或毕加索等著名艺术风格生成图像，而且还可能生成不适合工作场合（NSFW）的内容，引发了关于版权侵犯和道德使用等方面的严重关注。尽管存在这些担忧，但保护自回归文本到图像模型的方法仍被较少探索。之前的概念消除方法主要设计用于在降噪潜在空间操作的扩散模型，并不直接适用于生成图像的自回归模型。为了解决这一关键空白，我们提出了视觉对比利用（VCE）这一新框架，它包含：（1）一种创新性的对比图像对构建范式，能够精确地将不安全概念与其关联的内容语义分离；（2）一种基于DPO的先进训练方法，提高模型从图像对中识别和利用视觉对比特征的能力，从而实现精确的概念消除。我们在三个具有挑战性的任务（艺术家风格消除、明确内容消除和对象移除）上进行的全面实验表明，我们的方法有效地保护了模型，在消除不安全概念的同时保持无关安全概念的完整性，取得了最先进的成果。相关代码和模型可通过https://github.com/Maplebb/VCE获取。

论文及项目相关链接

PDF

Summary
近期，自回归图像生成模型如GPT-4o和LlamaGen能逼真地生成图像，引起观众惊叹。这些模型不仅能模仿著名艺术风格，还可能生成不适宜工作环境的（NSFW）内容，引发版权和伦理使用方面的担忧。尽管存在担忧，但针对自回归文本到图像模型的保护措施仍被忽视。为填补这一空白，我们提出Visual Contrast Exploitation（VCE）框架，包含对比图像对构建模式和基于DPO的训练方法，能精确剥离不安全概念与其相关内容语义，提升模型识别和利用对比特征的能力。实验证明，该方法在风格消除、明确内容消除和对象移除任务上达到先进水平，有效保护模型并维持安全概念的完整性。相关代码和模型可访问 https://github.com/Maplebb/VCE。

Key Takeaways

自回归图像生成模型具有创建逼真图像的能力，并能模仿著名艺术风格，引发关注。
这些模型存在生成不适宜工作环境的（NSFW）内容的潜在风险，引发版权和伦理使用方面的担忧。
针对自回归文本到图像模型的保护措施目前仍被忽视，存在研究空白。
提出的Visual Contrast Exploitation（VCE）框架包含创新对比图像对构建模式和基于DPO的训练方法。
VCE框架能精确剥离不安全概念与其相关内容语义，提高模型的识别和对比特征利用能力。
实验证明，VCE框架在风格消除、明确内容消除和对象移除任务上表现出卓越性能。

Cool Papers

点此查看论文截图

DiffEye: Diffusion-Based Continuous Eye-Tracking Data Generation Conditioned on Natural Images

Authors:Ozgur Kara, Harris Nisar, James M. Rehg

Numerous models have been developed for scanpath and saliency prediction, which are typically trained on scanpaths, which model eye movement as a sequence of discrete fixation points connected by saccades, while the rich information contained in the raw trajectories is often discarded. Moreover, most existing approaches fail to capture the variability observed among human subjects viewing the same image. They generally predict a single scanpath of fixed, pre-defined length, which conflicts with the inherent diversity and stochastic nature of real-world visual attention. To address these challenges, we propose DiffEye, a diffusion-based training framework designed to model continuous and diverse eye movement trajectories during free viewing of natural images. Our method builds on a diffusion model conditioned on visual stimuli and introduces a novel component, namely Corresponding Positional Embedding (CPE), which aligns spatial gaze information with the patch-based semantic features of the visual input. By leveraging raw eye-tracking trajectories rather than relying on scanpaths, DiffEye captures the inherent variability in human gaze behavior and generates high-quality, realistic eye movement patterns, despite being trained on a comparatively small dataset. The generated trajectories can also be converted into scanpaths and saliency maps, resulting in outputs that more accurately reflect the distribution of human visual attention. DiffEye is the first method to tackle this task on natural images using a diffusion model while fully leveraging the richness of raw eye-tracking data. Our extensive evaluation shows that DiffEye not only achieves state-of-the-art performance in scanpath generation but also enables, for the first time, the generation of continuous eye movement trajectories. Project webpage: https://diff-eye.github.io/

针对扫描路径和显著性预测，已经开发了许多模型。这些模型通常是在扫描路径上进行训练，将眼球运动模拟为一系列由眼跳连接的离散注视点，而原始轨迹中包含的丰富信息通常被丢弃。此外，大多数现有方法无法捕捉同一图像中不同人类主体之间的差异性。它们通常预测固定预定义长度的单一扫描路径，这与真实世界视觉注意力的内在多样性和随机性相冲突。为了解决这些挑战，我们提出了DiffEye，这是一个基于扩散的训练框架，旨在模拟在自然图像自由观看时的连续和多样的眼球运动轨迹。我们的方法在基于视觉刺激的扩散模型的基础上构建，并引入了一个名为相应位置嵌入（CPE）的新组件，它将空间凝视信息与视觉输入的基于补丁的语义特征对齐。DiffEye通过利用原始的眼动轨迹，而不是依赖于扫描路径，捕捉了人类凝视行为的内在差异，并生成了高质量的、现实的眼球运动模式，尽管它是在相对较小的数据集上训练的。生成的轨迹还可以转换为扫描路径和显著性地图，从而输出更准确反映人类视觉注意力分布的结果。DiffEye是第一个使用扩散模型在自然图像上完成此任务的方法，同时充分利用了原始的眼动追踪数据的丰富性。我们的广泛评估表明，DiffEye不仅在扫描路径生成方面达到了最新性能，而且首次实现了连续眼动轨迹的生成。项目网页：https://diff-eye.github.io/

论文及项目相关链接

PDF Accepted to NeurIPS 2025

Summary

本文介绍了针对视觉注视路径预测的新模型DiffEye。该模型基于扩散模型训练，能够模拟连续且多样的眼动轨迹，应对自然图像的自由观看。DiffEye利用CPE（对应位置嵌入）对齐空间注视信息与视觉输入的语义特征，并采用原始眼动轨迹数据训练，提升了注视路径生成质量，准确反映人类视觉注意分布。它解决了使用scanpath作为预测基础的模型的局限性，如信息丢失和预测单一化问题。DiffEye的性能达到领先水平，并首次实现了连续眼动轨迹的生成。

Key Takeaways

DiffEye是一个基于扩散模型的训练框架，用于模拟自然图像观看时的连续和多样的眼动轨迹。
DiffEye利用CPE技术对齐空间注视信息与视觉刺激的语义特征。
该模型采用原始眼动轨迹数据进行训练，捕捉人类注视行为的内在变化。
DiffEye解决了使用scanpath作为预测基础的模型的局限性，如信息丢失和对真实世界视觉注意力多样性的忽视。
DiffEye在生成注视路径和眼动轨迹方面达到了卓越的性能水平。
该模型首次实现了利用扩散模型对自然图像进行眼动轨迹的生成。

Cool Papers

点此查看论文截图

InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention

Authors:Qiang Xiang, Shuang Sun, Binglei Li, Dejia Song, Huaxia Li, Nemo Chen, Xu Tang, Yao Hu, Junping Zhang

Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules.

扩散模型在生成高质量图像方面展现了显著的能力。最近在Layout-to-Image（L2I）生成方面的进展利用了位置条件和文本描述，促进了精确可控的图像合成。尽管整体有所进展，但当前的L2I方法仍然表现出次优性能。因此，我们提出了InstanceAssemble，这是一种新型架构，通过实例组装注意力融入布局条件，实现利用边界框（bbox）进行位置控制以及包括文本和额外视觉内容的多模态内容控制。我们的方法通过轻量级的LoRA模块，灵活适应现有的基于DiT的T2I模型。此外，我们提出了Layout-to-Image基准测试Denselayout，这是布局到图像生成的综合基准测试，包含5000张图像，总计9万个实例。我们还介绍了布局接地得分（LGS），这是一个可解释的评估指标，可以更精确地评估L2I生成的准确性。实验表明，我们的InstanceAssemble方法在复杂的布局条件下达到了最先进的性能，同时与多种风格的LoRA模块具有很强的兼容性。

论文及项目相关链接

PDF Accepted in NeurIPS 2025

Summary

扩散模型在生成高质量图像方面表现出卓越的能力。最近的Layout-to-Image（L2I）生成技术通过位置条件和文本描述，促进了精确可控的图像合成。尽管有所进展，但当前的L2I方法性能仍不理想。因此，我们提出了InstanceAssemble，一种通过实例组装注意力机制融入布局条件的新型架构，实现通过边界框（bbox）进行位置控制，以及包括文本和额外视觉内容的多模态内容控制。我们的方法通过轻量级的LoRA模块，灵活适应现有的DiT-based T2I模型。此外，我们还推出了Layout-to-Image基准测试Denselayout，这是一个包含5000张图像、总计9万个实例的全面布局到图像生成的基准测试。我们还介绍了布局接地分数（LGS），一个可解释的评价指标，以更精确地评估L2I生成的准确性。实验表明，我们的InstanceAssemble方法在复杂的布局条件下达到了最佳性能，同时与各种风格的LoRA模块表现出强大的兼容性。

Key Takeaways

扩散模型在生成高质量图像上表现出卓越的能力。
Layout-to-Image（L2I）生成技术通过位置条件和文本描述促进了图像合成的精确性和可控性。
当前L2I方法性能仍有待提升。
InstanceAssemble是一种新型架构，通过实例组装注意力机制实现灵活的位置和内容控制。
InstanceAssemble方法兼容现有的DiT-based T2I模型，并通过轻量级的LoRA模块进行适应。
Denselayout是全面的Layout-to-Image生成基准测试。

Cool Papers

点此查看论文截图

A Novel Metric for Detecting Memorization in Generative Models for Brain MRI Synthesis

Authors:Antonio Scardace, Lemuel Puglisi, Francesco Guarnera, Sebastiano Battiato, Daniele Ravì

Deep generative models have emerged as a transformative tool in medical imaging, offering substantial potential for synthetic data generation. However, recent empirical studies highlight a critical vulnerability: these models can memorize sensitive training data, posing significant risks of unauthorized patient information disclosure. Detecting memorization in generative models remains particularly challenging, necessitating scalable methods capable of identifying training data leakage across large sets of generated samples. In this work, we propose DeepSSIM, a novel self-supervised metric for quantifying memorization in generative models. DeepSSIM is trained to: i) project images into a learned embedding space and ii) force the cosine similarity between embeddings to match the ground-truth SSIM (Structural Similarity Index) scores computed in the image space. To capture domain-specific anatomical features, training incorporates structure-preserving augmentations, allowing DeepSSIM to estimate similarity reliably without requiring precise spatial alignment. We evaluate DeepSSIM in a case study involving synthetic brain MRI data generated by a Latent Diffusion Model (LDM) trained under memorization-prone conditions, using 2,195 MRI scans from two publicly available datasets (IXI and CoRR). Compared to state-of-the-art memorization metrics, DeepSSIM achieves superior performance, improving F1 scores by an average of +52.03% over the best existing method. Code and data of our approach are publicly available at the following link: https://github.com/brAIn-science/DeepSSIM.

深度生成模型作为医学影像中的变革性工具，具有生成合成数据的巨大潜力。然而，最近的实证研究突出了一个关键漏洞：这些模型会记忆敏感的训练数据，存在未经授权的泄露患者信息的风险。在生成模型中检测记忆化仍然是一个特别具有挑战性的任务，需要能够在大量生成的样本中识别训练数据泄露的可扩展方法。在这项工作中，我们提出了DeepSSIM，这是一种新型的自我监督指标，用于量化生成模型中的记忆化。DeepSSIM经过训练能够：i)将图像投影到学习到的嵌入空间；ii)强制嵌入之间的余弦相似性与图像空间中计算的真实SSIM（结构相似性指数）分数相匹配。为了捕捉特定领域的解剖特征，训练过程中采用了保持结构增强的方法，允许DeepSSIM在不需要精确空间对齐的情况下可靠地估计相似性。我们在一项涉及由潜在扩散模型（LDM）在易于记忆条件下训练的合成大脑MRI数据的案例研究中评估了DeepSSIM，该研究使用了两个公开数据集（IXI和CoRR）的2,195个MRI扫描。与最新的记忆化指标相比，DeepSSIM性能卓越，平均将F1分数提高了+52.03%，超过了现有最佳方法。我们的方法的相关代码和数据可在以下链接中找到：https://github.com/brAIn-science/DeepSSIM。

论文及项目相关链接

PDF

Summary
深度学习模型在医学成像领域展现出强大的生成能力，具有生成合成数据的巨大潜力。然而，最新实证研究揭示了其关键漏洞：这些模型会记忆敏感的训练数据，存在未经授权披露患者信息的风险。检测生成模型中的记忆化颇具挑战，需要能够在大量生成样本中识别训练数据泄露的可扩展方法。本研究提出一种新型自监督度量方法DeepSSIM，用于量化生成模型中的记忆化。DeepSSIM经过训练，能将图像投影到学习嵌入空间，并通过强制嵌入之间的余弦相似性与图像空间中计算的SSIM（结构相似性指数）得分匹配。为捕捉特定领域的解剖特征，训练过程中融入了结构保留增强技术，使DeepSSIM在无需精确空间对齐的情况下可靠地估计相似性。我们在涉及由潜在扩散模型（LDM）在易记忆条件下训练的合成脑MRI数据的案例研究中评估了DeepSSIM。与最先进的记忆化度量方法相比，DeepSSIM表现出卓越性能，平均将F1分数提高了+52.03%。我们的方法代码和数据可在以下链接公开获取：[链接地址]。

Key Takeaways

深度生成模型在医学成像中有巨大潜力，但存在记忆敏感训练数据的风险。
检测生成模型中的记忆化需要可扩展方法。
提出了一种新型自监督度量DeepSSIM，用于量化生成模型的记忆化。
DeepSSIM通过投影图像到嵌入空间并匹配SSIM得分来工作。
DeepSSIM在结构保留增强技术的帮助下，能捕捉特定领域的解剖特征。
在合成脑MRI数据的案例研究中，DeepSSIM表现出卓越性能。

Cool Papers

点此查看论文截图

V-CECE: Visual Counterfactual Explanations via Conceptual Edits

Authors:Nikolaos Spanos, Maria Lymperaiou, Giorgos Filandrianos, Konstantinos Thomas, Athanasios Voulodimos, Giorgos Stamou

Recent black-box counterfactual generation frameworks fail to take into account the semantic content of the proposed edits, while relying heavily on training to guide the generation process. We propose a novel, plug-and-play black-box counterfactual generation framework, which suggests step-by-step edits based on theoretical guarantees of optimal edits to produce human-level counterfactual explanations with zero training. Our framework utilizes a pre-trained image editing diffusion model, and operates without access to the internals of the classifier, leading to an explainable counterfactual generation process. Throughout our experimentation, we showcase the explanatory gap between human reasoning and neural model behavior by utilizing both Convolutional Neural Network (CNN), Vision Transformer (ViT) and Large Vision Language Model (LVLM) classifiers, substantiated through a comprehensive human evaluation.

最近的黑盒反事实生成框架未能考虑到提议编辑的语义内容，同时却严重依赖于训练来指导生成过程。我们提出了一种新颖即插即用的黑盒反事实生成框架，该框架基于最优编辑的理论保证，提出逐步编辑，以产生无需训练即可达到人类水平的反事实解释。我们的框架利用预训练的图像编辑扩散模型，无需访问分类器的内部组件，从而实现可解释的反事实生成过程。在我们的实验中，我们展示了人类推理和神经网络模型行为之间的解释差距，使用的分类器包括卷积神经网络（CNN）、视觉转换器（ViT）和大型视觉语言模型（LVLM），并通过全面的人类评估得到了证实。

论文及项目相关链接

PDF Accepted in NeurIPS 2025

Summary

近期黑盒式反事实生成框架忽视了建议编辑的语义内容，而过度依赖训练来引导生成过程。我们提出了一种新颖的即插即用黑盒式反事实生成框架，该框架基于理论上的最佳编辑保证，能够逐步进行编辑建议，无需训练即可生成人类水平的反事实解释。我们的框架利用预训练的图像编辑扩散模型进行操作，无需访问分类器的内部信息，从而实现了可解释的反事实生成过程。我们的实验展示了人类推理与神经网络模型行为之间的解释鸿沟，这得到了基于卷积神经网络（CNN）、视觉转换器（ViT）和大视觉语言模型（LVLM）分类器的综合人类评估的证实。

Key Takeaways

当前的黑盒反事实生成框架忽略了编辑的语义内容。
提出了一种新型黑盒反事实生成框架，能够基于理论上的最佳编辑保证进行逐步编辑建议。
该框架无需训练即可生成人类水平的反事实解释。
利用预训练的图像编辑扩散模型进行操作。
无需访问分类器的内部信息，实现可解释的反事实生成过程。
实验展示了人类推理与神经网络模型行为之间的解释鸿沟。

Cool Papers

点此查看论文截图

Dynamic Classifier-Free Diffusion Guidance via Online Feedback

Authors:Pinelopi Papalampidi, Olivia Wiles, Ira Ktena, Aleksandar Shtedritski, Emanuele Bugliarello, Ivana Kajic, Isabela Albuquerque, Aida Nematzadeh

Classifier-free guidance (CFG) is a cornerstone of text-to-image diffusion models, yet its effectiveness is limited by the use of static guidance scales. This “one-size-fits-all” approach fails to adapt to the diverse requirements of different prompts; moreover, prior solutions like gradient-based correction or fixed heuristic schedules introduce additional complexities and fail to generalize. In this work, we challeng this static paradigm by introducing a framework for dynamic CFG scheduling. Our method leverages online feedback from a suite of general-purpose and specialized small-scale latent-space evaluations, such as CLIP for alignment, a discriminator for fidelity and a human preference reward model, to assess generation quality at each step of the reverse diffusion process. Based on this feedback, we perform a greedy search to select the optimal CFG scale for each timestep, creating a unique guidance schedule tailored to every prompt and sample. We demonstrate the effectiveness of our approach on both small-scale models and the state-of-the-art Imagen 3, showing significant improvements in text alignment, visual quality, text rendering and numerical reasoning. Notably, when compared against the default Imagen 3 baseline, our method achieves up to 53.8% human preference win-rate for overall preference, a figure that increases up to to 55.5% on prompts targeting specific capabilities like text rendering. Our work establishes that the optimal guidance schedule is inherently dynamic and prompt-dependent, and provides an efficient and generalizable framework to achieve it.

无分类器引导（CFG）是文本到图像扩散模型的核心，但其有效性受到静态引导尺度使用的限制。“一刀切”的方法无法满足不同提示的多样化需求；此外，先前的解决方案，如基于梯度的校正或固定的启发式时间表，引入了额外的复杂性，并且无法推广。在这项工作中，我们通过引入动态CFG调度框架来挑战这一静态模式。我们的方法利用一系列通用和专用小型潜在空间评估的在线反馈，如CLIP对齐、鉴别器保真度以及人类偏好奖励模型，来评估反向扩散过程每一步的生成质量。基于这些反馈，我们执行贪婪搜索来选择每个时间步的最佳CFG尺度，为每个提示和样本创建独特的指导时间表。我们在小型模型以及最先进的Imagen 3上都证明了我们的方法的有效性，在文本对齐、视觉质量、文本渲染和数值推理方面取得了显著的改进。特别是与默认的Imagen 3基线相比，我们的方法在总体偏好上达到了53.8%的人类偏好胜率，在针对特定能力（如文本渲染）的提示上，这一数字最高可达55.5%。我们的工作证明，最佳的引导时间表本质上是动态的，并且依赖于提示，并提供了一个高效且可推广的框架来实现它。

论文及项目相关链接

PDF

Summary

本文介绍了基于动态分类器指导调度框架的文本到图像扩散模型。该框架通过在线反馈评估生成质量，根据反馈进行贪婪搜索，为每个时间步选择最佳的CFG规模，为每一个提示和样本量身定制独特的指导计划。在小型模型以及最新Imagen 3模型上验证了该方法的有效性，显著提高了文本对齐、视觉质量、文本渲染和数值推理的效果。相较于默认的Imagen 3基线模型，新方法取得了最高达到人类偏好率提高至约半数以上（近最高达到的甚至达到了五分之三以上）。总的来说，这项工作证实了最佳指导计划本质上是动态的并且依赖于提示的，提供了一个高效且可推广的框架来实现这一目标。

Key Takeaways

分类器指导（CFG）是文本到图像扩散模型的核心，但其效果受限于静态指导规模的使用。
当前方法如基于梯度的校正或固定启发式时间表增加了复杂性且难以推广。
提出了一种基于动态CFG调度的框架，利用在线反馈评估生成质量。
通过贪婪搜索选择每个时间步的最佳CFG规模，为每次提示和样本量身定制独特指导计划。
在小型模型和最新Imagen 3模型上验证了方法的有效性。
在文本对齐、视觉质量、文本渲染和数值推理方面取得了显著改进。相较于默认基线模型，人类偏好率显著提高。

Cool Papers

点此查看论文截图

Image-to-Brain Signal Generation for Visual Prosthesis with CLIP Guided Multimodal Diffusion Models

Authors:Ganxi Xu, Jinyi Long, Jia Zhang

Visual prostheses hold great promise for restoring vision in blind individuals. While researchers have successfully utilized M/EEG signals to evoke visual perceptions during the brain decoding stage of visual prostheses, the complementary process of converting images into M/EEG signals in the brain encoding stage remains largely unexplored, hindering the formation of a complete functional pipeline. In this work, we present, to our knowledge, the first image-to-brain signal framework that generates M/EEG from images by leveraging denoising diffusion probabilistic models enhanced with cross-attention mechanisms. Specifically, the proposed framework comprises two key components: a pretrained CLIP visual encoder that extracts rich semantic representations from input images, and a cross-attention enhanced U-Net diffusion model that reconstructs brain signals through iterative denoising. Unlike conventional generative models that rely on simple concatenation for conditioning, our cross-attention modules capture the complex interplay between visual features and brain signal representations, enabling fine-grained alignment during generation. We evaluate the framework on two multimodal benchmark datasets and demonstrate that it generates biologically plausible brain signals. We also present visualizations of M/EEG topographies across all subjects in both datasets, providing intuitive demonstrations of intra-subject and inter-subject variations in brain signals.

视觉假体对恢复盲人的视力有着巨大潜力。虽然研究者已成功在视觉假体的脑解码阶段利用M/EEG信号来激发视觉感知，但大脑编码阶段将图像转换为M/EEG信号这一互补过程仍被大大忽视，阻碍了完整功能管道的形成。在这项工作中，我们呈现了（据我们所知）首个图像到脑信号的框架，该框架借助去噪扩散概率模型并增强交叉注意力机制，能够从图像生成M/EEG信号。具体来说，所提出的框架包含两个关键组件：预训练的CLIP视觉编码器，用于从输入图像中提取丰富的语义表示；增强交叉注意力的U-Net扩散模型，通过迭代去噪重建脑信号。不同于依赖简单拼接进行条件设置的传统生成模型，我们的交叉注意力模块捕捉视觉特征和脑信号表示之间的复杂交互，从而在生成过程中实现精细对齐。我们在两个多模式基准数据集上评估了该框架，并证明其能够生成生物上合理的脑信号。此外，我们还展示了两个数据集中所有受试者的M/EEG地形图可视化，直观地展示了脑信号的受试者内和受试者间变化。

论文及项目相关链接

PDF

Summary

本文介绍了一种基于去噪扩散概率模型与交叉注意机制的图像到脑信号框架，用于将图像转化为M/EEG信号，为视觉假体的研究提供了重要进展。该框架包括预训练的CLIP视觉编码器和增强型U-Net扩散模型两部分，能够提取图像中的丰富语义表示并重建脑信号。通过交叉注意模块捕捉视觉特征与脑信号表示之间的复杂交互，实现了精细对齐生成过程。在两个多模态基准数据集上的评估表明，该框架能够生成生物学上合理的脑信号，并呈现不同数据集内个体和个体间M/EEG地形图的可视化。

Key Takeaways

研究人员利用去噪扩散概率模型与交叉注意机制提出了首个图像到脑信号的框架，该框架在视觉假体的研究上取得了重要进展。
框架包含预训练的CLIP视觉编码器和增强型U-Net扩散模型两部分，前者用于提取图像语义信息，后者则通过迭代去噪重建脑信号。
交叉注意模块能够捕捉视觉特征与脑信号之间的复杂交互，实现更精细的生成对齐。
该框架在两个多模态基准数据集上进行了评估，证明了其生成生物学上合理脑信号的能力。
提供了不同数据集内个体和个体间M/EEG地形图的直观可视化展示。
该研究为视觉假体的进一步发展提供了重要支持，有望为盲人的视觉恢复带来实质性进展。
目前该过程仍处于探索阶段，需要进一步的研究和改进以实现实际应用。

Cool Papers

点此查看论文截图

Subjective Camera 1.0: Bridging Human Cognition and Visual Reconstruction through Sequence-Aware Sketch-Guided Diffusion

Authors:Haoyang Chen, Dongfang Sun, Caoyuan Ma, Shiqin Wang, Kewei Zhang, Zheng Wang, Zhixiang Wang

We introduce the concept of a subjective camera to reconstruct meaningful moments that physical cameras fail to capture. We propose Subjective Camera 1.0, a framework for reconstructing real-world scenes from readily accessible subjective readouts, i.e., textual descriptions and progressively drawn rough sketches. Built on optimization-based alignment of diffusion models, our approach avoids large-scale paired training data and mitigates generalization issues. To address the challenge of integrating multiple abstract concepts in real-world scenarios, we design a Sequence-Aware Sketch-Guided Diffusion framework with three loss terms for concept-wise sequential optimization, following the natural order of subjective readouts. Experiments on two datasets demonstrate that our method achieves state-of-the-art performance in image quality as well as spatial and semantic alignment with target scenes. User studies with 40 participants further confirm that our approach is consistently preferred. Our project page is at: subjective-camera.github.io

我们引入了主观相机的概念，以重建物理相机未能捕捉的有意义时刻。我们提出了主观相机1.0，这是一个从易于获取的主观读数（如文本描述和逐步绘制的粗略草图）重建现实场景的框架。我们的方法建立在基于优化的扩散模型对齐基础上，避免了大规模配对训练数据，并缓解了泛化问题。为了解决在现实场景中整合多个抽象概念的挑战，我们设计了一个序列感知草图引导扩散框架，该框架包含三个用于概念级顺序优化的损失项，遵循主观读数的自然顺序。在两个数据集上的实验表明，我们的方法在图像质量以及空间和目标场景语义对齐方面达到了最新技术水平。有40名参与者参与的用户研究进一步证实了我们方法的优越性。我们的项目页面为：[主观相机的github.io页面]（subjective-camera.github.io）。

论文及项目相关链接

PDF

Summary
主观相机的概念被引入以重建物理相机无法捕捉的有意义时刻。我们提出主观相机1.0，这是一个从易于获取的主观读数（如文本描述和渐进式粗略草图）重建现实场景的框架。该框架基于扩散模型的优化对齐，避免了大规模配对训练数据，并缓解了泛化问题。为了解决将多个抽象概念融入现实场景的挑战，我们设计了一个序列感知草图引导扩散框架，包含三个用于概念级顺序优化的损失项，遵循主观读数的自然顺序。在两个数据集上的实验表明，我们的方法在图像质量以及空间和目标场景语义对齐方面达到最佳性能。由40名参与者进行的研究进一步证实了我们方法的优越性。

Key Takeaways

引入了主观相机的概念，用以重建物理相机未能捕捉的有意义时刻。
提出了Subjective Camera 1.0框架，该框架能通过文本描述和粗略草图重建现实场景。
利用优化对齐的扩散模型，无需大规模配对训练数据，并减轻泛化问题。
设计了序列感知草图引导扩散框架，通过三个损失项进行概念级的顺序优化。
框架遵循主观读数的自然顺序，以更好地融入多个抽象概念于现实场景。
在两个数据集上的实验表明，该方法在图像质量及空间、语义对齐方面表现卓越。

Cool Papers

点此查看论文截图

Single-step Diffusion for Image Compression at Ultra-Low Bitrates

Authors:Chanung Park, Joo Chan Lee, Jong Hwan Ko

Although there have been significant advancements in image compression techniques, such as standard and learned codecs, these methods still suffer from severe quality degradation at extremely low bits per pixel. While recent diffusion-based models provided enhanced generative performance at low bitrates, they often yields limited perceptual quality and prohibitive decoding latency due to multiple denoising steps. In this paper, we propose the single-step diffusion model for image compression that delivers high perceptual quality and fast decoding at ultra-low bitrates. Our approach incorporates two key innovations: (i) Vector-Quantized Residual (VQ-Residual) training, which factorizes a structural base code and a learned residual in latent space, capturing both global geometry and high-frequency details; and (ii) rate-aware noise modulation, which tunes denoising strength to match the desired bitrate. Extensive experiments show that ours achieves comparable compression performance to state-of-the-art methods while improving decoding speed by about 50x compared to prior diffusion-based methods, greatly enhancing the practicality of generative codecs.

尽管图像压缩技术（如标准和基于学习的编码技术）已经取得了重大进展，但这些方法仍然面临极低比特率下的严重质量下降问题。虽然最近的基于扩散的模型在低比特率下提供了增强的生成性能，但由于多次去噪步骤，它们通常产生有限的感知质量和禁止的解码延迟。在本文中，我们提出了用于图像压缩的单步扩散模型，该模型在超低比特率下提供高感知质量和快速解码。我们的方法包含两个关键创新点：（i）向量量化残差（VQ-Residual）训练，它在潜在空间中分解结构基础代码和学习的残差，捕获全局几何和高频细节；（ii）速率感知噪声调制，它调整去噪强度以匹配所需的比特率。大量实验表明，我们的方法达到了与最新技术相当的性能，同时与先前的基于扩散的方法相比，解码速度提高了约50倍，大大提高了生成编码器的实用性。

论文及项目相关链接

PDF

Summary
本文提出一种基于单步扩散模型的图像压缩方法，可在超低比特率下实现高感知质量和快速解码。该方法通过VQ-Residual训练和速率感知噪声调制两项关键技术，实现了结构基础码和学得残差的因子分解，捕捉了全局几何结构和高频细节。实验表明，该方法在压缩性能上与最新技术相当，解码速度比先前的扩散方法提高了约50倍，极大地提高了生成码实的实用性。

Key Takeaways

扩散模型在图像压缩领域的应用被介绍，尤其是其增强生成性能和在低比特率下的表现。
当前图像压缩技术面临的挑战包括质量降解和解码延迟。
提出的单步扩散模型旨在解决这些问题，实现了高感知质量和快速解码。
VQ-Residual训练是关键技术之一，实现了结构基础码和学得残差的因子分解。
速率感知噪声调制技术用于调整去噪强度以匹配所需的比特率。
实验结果表明，该方法在压缩性能上表现出色，与最新技术相当。

Cool Papers

点此查看论文截图

QVGen: Pushing the Limit of Quantized Video Generative Models

Authors:Yushi Huang, Ruihao Gong, Jing Liu, Yifu Ding, Chengtao Lv, Haotong Qin, Jun Zhang

Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules ($\Phi$) to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of $\Phi$, we propose a rank-decay strategy that progressively eliminates $\Phi$. Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization $\mathbf{\gamma}$ to identify and decay low-contributing components. This strategy retains performance while zeroing out inference overhead. Extensive experiments across $4$ state-of-the-art (SOTA) video DMs, with parameter sizes ranging from $1.3$B $\sim14$B, show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of $+25.28$ in Dynamic Degree and $+8.43$ in Scene Consistency on VBench.

视频扩散模型（DMs）已经能够实现高质量的视频合成。然而，其巨大的计算和内存需求对实际部署带来了严峻挑战，即使在高端GPU上也是如此。作为一种常用的解决方案，量化在降低图像DM的成本方面取得了显著的成效，而直接应用于视频DM则仍然无效。在本文中，我们提出了QVGen，这是一个针对极端低比特量化（例如4位及以下）下高性能和推理效率的视频DM量身定制的量化感知训练（QAT）框架。我们首先进行理论分析，证明降低梯度范数对于促进QAT的收敛至关重要。为此，我们引入了辅助模块（Φ）来缓解大量化误差，从而显着增强收敛性。为了消除Φ的推理开销，我们提出了一种排名衰减策略，该策略逐步消除Φ。具体来说，我们反复使用奇异值分解（SVD）和提出的基于排名的正则化γ来识别和衰减低贡献成分。此策略在保持性能的同时消除了推理开销。在4种最先进的视频DMs上的广泛实验，参数大小从1.3B到14B不等，表明QVGen首次在4位设置下达到全精度相当的质量。而且，它显著优于现有方法。例如，我们的3位CogVideoX-2B在VBench上的动态度提高了+25.28，场景一致性提高了+8.43。

论文及项目相关链接

PDF Our code will be released upon acceptance

Summary

视频扩散模型（DMs）虽然可以实现高质量的视频合成，但其巨大的计算和内存需求阻碍了实际应用，特别是在高端GPU上的应用。量化是一种广泛采用的方法，能够成功降低图像DM的成本，但直接应用于视频DM则效果不佳。本文提出QVGen，一个针对高性能和推理效率的视频DM的量化感知训练（QAT）框架，可在极低比特量化（如4位及以下）下工作。理论分析表明，减少梯度范数是促进QAT收敛的关键。通过引入辅助模块来减轻大量量化误差，显著提高了收敛性。为了消除辅助模块的推理开销，提出了一种排名衰减策略，通过逐步消除辅助模块实现性能保留并消除推理开销。实验证明，QVGen在4位设置下首次达到全精度可比质量，并显著优于现有方法。例如，我们的3位CogVideoX-2B在VBench上的动态度和场景一致性分别提高了+25.28和+8.43。

Key Takeaways

视频扩散模型（DMs）面临着巨大的计算和内存需求挑战。
量化是一种有效的降低图像DM成本的方法，但直接应用于视频DM效果不佳。
QVGen是一个针对视频DM的量化感知训练（QAT）框架，适用于极低比特量化设置。
理论分析表明减少梯度范数是促进QAT收敛的关键。
引入辅助模块来减轻量化误差，提高收敛性。
提出一种排名衰减策略，逐步消除辅助模块，消除推理开销。

Cool Papers

点此查看论文截图

DiCo: Revitalizing ConvNets for Scalable and Efficient Diffusion Modeling

Authors:Yuang Ai, Qihang Fan, Xuefeng Hu, Zhenheng Yang, Ran He, Huaibo Huang

Diffusion Transformer (DiT), a promising diffusion model for visual generation, demonstrates impressive performance but incurs significant computational overhead. Intriguingly, analysis of pre-trained DiT models reveals that global self-attention is often redundant, predominantly capturing local patterns-highlighting the potential for more efficient alternatives. In this paper, we revisit convolution as an alternative building block for constructing efficient and expressive diffusion models. However, naively replacing self-attention with convolution typically results in degraded performance. Our investigations attribute this performance gap to the higher channel redundancy in ConvNets compared to Transformers. To resolve this, we introduce a compact channel attention mechanism that promotes the activation of more diverse channels, thereby enhancing feature diversity. This leads to Diffusion ConvNet (DiCo), a family of diffusion models built entirely from standard ConvNet modules, offering strong generative performance with significant efficiency gains. On class-conditional ImageNet generation benchmarks, DiCo-XL achieves an FID of 2.05 at 256x256 resolution and 2.53 at 512x512, with a 2.7x and 3.1x speedup over DiT-XL/2, respectively. Furthermore, experimental results on MS-COCO demonstrate that the purely convolutional DiCo exhibits strong potential for text-to-image generation. Code: https://github.com/shallowdream204/DiCo.

扩散Transformer（DiT）是一种有前景的视觉生成扩散模型，它表现出令人印象深刻的效果，但带来了巨大的计算开销。有趣的是，对预训练的DiT模型的分析表明，全局自注意力通常是冗余的，主要捕捉局部模式，这突显了更高效替代方案的潜力。在本文中，我们重新审视卷积作为构建高效且表达性扩散模型的替代构建块。然而，天真地用卷积替换自注意力通常会导致性能下降。我们的调查将这一性能差距归因于与Transformer相比，ConvNet中的通道冗余度更高。为解决这一问题，我们引入了一种紧凑的通道注意力机制，该机制促进了更多不同通道的激活，从而增强了特征多样性。这导致了完全由标准ConvNet模块构建的扩散模型家族——扩散卷积网络（DiCo）。在条件ImageNet生成基准测试中，DiCo-XL在256x256分辨率下实现了FID为2.05，在512x512分辨率下实现了2.53，相对于DiT-XL/2分别实现了2.7x和3.1x的加速。此外，在MS-COCO上的实验结果表明，完全卷积的DiCo在文本到图像生成方面表现出强大的潜力。代码：https://github.com/shallowdream204/DiCo。

论文及项目相关链接

PDF NeurIPS 2025 Spotlight

Summary

本文介绍了Diffusion Transformer（DiT）在计算效率方面的问题，并提出了新的解决方案。通过深入研究预训练的DiT模型，发现全局自注意力常常是冗余的，主要捕捉的是局部模式。于是引入卷积作为构建高效扩散模型的替代组件，提出了Diffusion ConvNet（DiCo）。为了弥补卷积网络通道冗余的问题，引入了紧凑通道注意力机制，提高了特征多样性。在图像生成方面，DiCo实现了强大的生成性能和显著的效率提升。在ImageNet和MS-COCO数据集上的实验结果表明，DiCo具有很强的文本到图像生成潜力。代码已在GitHub上公开。

Key Takeaways

以下是该文本的主要观点或关键信息点：