发布日期: 2025-11-05

更新日期: 2025-11-27

文章字数: 19.7k

阅读时长: 80 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-05 更新

Who Made This? Fake Detection and Source Attribution with Diffusion Features

Authors:Simone Bonechi, Paolo Andreini, Barbara Toniella Corradini

The rapid progress of generative diffusion models has enabled the creation of synthetic images that are increasingly difficult to distinguish from real ones, raising concerns about authenticity, copyright, and misinformation. Existing supervised detectors often struggle to generalize across unseen generators, requiring extensive labeled data and frequent retraining. We introduce FRIDA (Fake-image Recognition and source Identification via Diffusion-features Analysis), a lightweight framework that leverages internal activations from a pre-trained diffusion model for deepfake detection and source generator attribution. A k-nearest-neighbor classifier applied to diffusion features achieves state-of-the-art cross-generator performance without fine-tuning, while a compact neural model enables accurate source attribution. These results show that diffusion representations inherently encode generator-specific patterns, providing a simple and interpretable foundation for synthetic image forensics.

生成性扩散模型的快速发展使得合成图像的创建越来越难以与真实图像区分开，这引发了人们对真实性、版权和误导信息的担忧。现有的监督检测器往往难以在未见过的生成器之间进行推广，需要大量的标记数据和频繁的重新训练。我们引入了FRIDA（通过扩散特征分析进行虚假图像识别和来源识别），这是一个轻量级的框架，它利用预训练的扩散模型的内部激活来进行深度伪造检测和数据源生成器归属。应用于扩散特征的k最近邻分类器在不进行微调的情况下实现了跨生成器的最佳性能，而紧凑的神经网络模型则实现了精确的数据源归属。这些结果表明，扩散表示法本质上编码了特定生成器的模式，为合成图像取证提供了简单且可解释的基础。

论文及项目相关链接

PDF

Summary
生成扩散模型迅速进步，能生成越来越难以区分真实性的合成图像，引发对真实性、版权和误导信息的担忧。现有监督检测器往往难以在未见过的生成器上实现通用化，需要大量标记数据和频繁重新训练。本文介绍FRIDA（基于扩散特征分析的假图像识别和来源识别），它是一个轻量级框架，利用预训练的扩散模型的内部激活进行深度伪造检测和来源生成器归因。应用扩散特征的k最近邻分类器在不进行微调的情况下实现了跨生成器的卓越性能，而紧凑的神经网络模型则可实现准确的来源归因。结果表明，扩散表示法天然编码生成器特定模式，为合成图像取证提供了简单和可解释的基础。

Key Takeaways

生成扩散模型的快速发展导致合成图像难以区分真实性和真实性担忧。
现有监督检测器在跨生成器上通用化时面临挑战。
FRIDA框架利用预训练的扩散模型的内部激活进行深度伪造检测和来源归因。
k最近邻分类器在不进行微调的情况下实现了卓越的跨生成器性能。
紧凑的神经网络模型可实现准确的来源归因。
扩散表示法天然编码生成器特定模式。

Cool Papers

点此查看论文截图

Understanding the Implicit User Intention via Reasoning with Large Language Model for Image Editing

Authors:Yijia Wang, Yiqing Shen, Weiming Chen, Zhihai He

Existing image editing methods can handle simple editing instructions very well. To deal with complex editing instructions, they often need to jointly fine-tune the large language models (LLMs) and diffusion models (DMs), which involves very high computational complexity and training cost. To address this issue, we propose a new method, called \textbf{C}omplex \textbf{I}mage \textbf{E}diting via \textbf{L}LM \textbf{R}easoning (CIELR), which converts a complex user instruction into a set of simple and explicit editing actions, eliminating the need for jointly fine-tuning the large language models and diffusion models. Specifically, we first construct a structured semantic representation of the input image using foundation models. Then, we introduce an iterative update mechanism that can progressively refine this representation, obtaining a fine-grained visual representation of the image scene. This allows us to perform complex and flexible image editing tasks. Extensive experiments on the SmartEdit Reasoning Scenario Set show that our method surpasses the previous state-of-the-art by 9.955 dB in PSNR, indicating its superior preservation of regions that should remain consistent. Due to the limited number of samples of public datasets of complex image editing with reasoning, we construct a benchmark named CIEBench, containing 86 image samples, together with a metric specifically for reasoning-based image editing. CIELR also outperforms previous methods on this benchmark. The code and dataset are available at \href{https://github.com/Jia-shao/Reasoning-Editing}{https://github.com/Jia-shao/Reasoning-Editing}.

现有图像编辑方法能够很好地处理简单的编辑指令。对于复杂的编辑指令，它们通常需要联合微调大型语言模型（LLMs）和扩散模型（DMs），这涉及非常高的计算复杂性和训练成本。为了解决这一问题，我们提出了一种新方法，称为通过LLM推理的复杂图像编辑（CIELR），它将复杂的用户指令转换为一系列简单明确的编辑操作，从而无需联合微调大型语言模型和扩散模型。具体来说，我们首先使用基础模型构建输入图像的结构化语义表示。然后，我们引入了一种迭代更新机制，可以逐步优化此表示，获得图像场景的精细视觉表示。这使我们能够执行复杂而灵活的图像编辑任务。在SmartEdit推理场景集上的大量实验表明，我们的方法在峰值信噪比（PSNR）上超越了之前的最先进水平，提高了9.955 dB，这表明它在保持应保持一致的区域方面表现出色。由于公共数据集中复杂图像编辑推理的样本数量有限，我们构建了一个名为CIEBench的基准测试，包含86个图像样本，以及一个专门针对基于推理的图像编辑的度量标准。CIELR在这个基准测试上也超越了之前的方法。代码和数据集可在[https://github.com/Jia-shao/Reasoning-Editing]找到。

论文及项目相关链接

PDF

Summary

该文提出了一种新的图像编辑方法——CIELR，通过语言模型推理进行复杂图像编辑。该方法将复杂的用户指令转化为简单明确的编辑动作，无需联合微调大型语言模型和扩散模型，降低了计算复杂度和训练成本。CIELR使用基础模型构建输入图像的结构化语义表示，并引入迭代更新机制，逐步优化表示，获得图像的精细视觉表示，从而执行复杂灵活的图像编辑任务。在SmartEdit Reasoning Scenario Set上的实验表明，CIELR在PSNR上较现有技术提高了9.955 dB，并且在CIEBench基准测试中表现优异。

Key Takeaways

CIELR将复杂编辑指令转化为简单明确的编辑动作。
无需联合微调大型语言模型和扩散模型。
使用基础模型构建输入图像的结构化语义表示。
引入迭代更新机制逐步优化表示。
CIELR在SmartEdit Reasoning Scenario Set上的PSNR表现优于现有技术。
构建了针对复杂图像编辑的推理基准测试CIEBench。

Cool Papers

点此查看论文截图

H2-Cache: A Novel Hierarchical Dual-Stage Cache for High-Performance Acceleration of Generative Diffusion Models

Authors:Mingyu Sung, Il-Min Kim, Sangseok Yun, Jae-Mo Kang

Diffusion models have emerged as state-of-the-art in image generation, but their practical deployment is hindered by the significant computational cost of their iterative denoising process. While existing caching techniques can accelerate inference, they often create a challenging trade-off between speed and fidelity, suffering from quality degradation and high computational overhead. To address these limitations, we introduce H2-Cache, a novel hierarchical caching mechanism designed for modern generative diffusion model architectures. Our method is founded on the key insight that the denoising process can be functionally separated into a structure-defining stage and a detail-refining stage. H2-cache leverages this by employing a dual-threshold system, using independent thresholds to selectively cache each stage. To ensure the efficiency of our dual-check approach, we introduce pooled feature summarization (PFS), a lightweight technique for robust and fast similarity estimation. Extensive experiments on the Flux architecture demonstrate that H2-cache achieves significant acceleration (up to 5.08x) while maintaining image quality nearly identical to the baseline, quantitatively and qualitatively outperforming existing caching methods. Our work presents a robust and practical solution that effectively resolves the speed-quality dilemma, significantly lowering the barrier for the real-world application of high-fidelity diffusion models. Source code is available at https://github.com/Bluear7878/H2-cache-A-Hierarchical-Dual-Stage-Cache.

扩散模型在图像生成方面已达到了业界前沿水平，但其实践部署却受到了其迭代去噪过程巨大计算成本的阻碍。尽管现有的缓存技术可以加速推理，但它们往往需要在速度和保真度之间进行权衡，并存在质量下降和计算开销高的缺点。为了解决这些局限性，我们引入了H2-Cache，这是一种专为现代生成扩散模型架构设计的分层缓存机制。我们的方法建立在这样一个关键见解之上：去噪过程可以功能性地分为结构定义阶段和细节优化阶段。H2-cache通过采用双阈值系统来利用这一点，使用独立的阈值有选择地缓存每个阶段。为了保证我们双检方法的效率，我们引入了池特征总结（PFS），这是一种用于稳健快速相似度估计的轻量级技术。在Flux架构上进行的大量实验表明，H2-Cache实现了显著的加速（最高达5.08倍），同时保持的图像质量与基线几乎相同，在量和质上均优于现有的缓存方法。我们的工作提出了一个稳健实用的解决方案，有效地解决了速度与质量之间的困境，大大降低了高保真扩散模型在现实世界应用中的障碍。源代码可在https://github.com/Bluear7878/H2-cache-A-Hierarchical-Dual-Stage-Cache找到。

论文及项目相关链接

PDF

Summary
扩散模型已成为图像生成领域的最前沿技术，但其迭代去噪过程的计算成本较高，阻碍了其实践应用。现有缓存技术虽能加速推理，但往往在速度和保真度之间产生权衡，存在质量下降和计算开销大的问题。为解决这些局限，我们推出H2-Cache，一种为现代生成型扩散模型架构设计的分层缓存机制。我们的方法基于这样一个关键见解：去噪过程可以功能性地分为结构定义阶段和细节优化阶段。H2-cache通过采用双阈值系统，利用独立的阈值选择性地缓存每个阶段。为确保双检方法的效率，我们引入了汇总特征总结（PFS），这是一种用于稳健快速相似性估计的轻量级技术。在Flux架构上的广泛实验表明，H2-cache实现了显著加速（最高达5.08倍），同时保持图像质量与基线几乎相同，在数量和质量上均优于现有缓存方法。

Key Takeaways

扩散模型在图像生成领域表现出卓越性能，但计算成本较高，需要解决迭代去噪过程的效率问题。
现有缓存技术在加速推理时存在速度和保真度之间的权衡，导致质量下降和计算开销大。
H2-Cache是一种针对现代生成型扩散模型的新型分层缓存机制，旨在解决现有技术的局限性。
H2-Cache基于去噪过程可以分为结构定义和细节优化两个阶段的关键见解。
H2-Cache采用双阈值系统，通过独立阈值选择性地缓存每个阶段，以提高效率。
汇总特征总结（PFS）是H2-Cache的一部分，用于稳健快速相似性估计。

Cool Papers

点此查看论文截图

DANCER: Dance ANimation via Condition Enhancement and Rendering with diffusion model

Authors:Yucheng Xing, Jinxing Yin, Xiaodong Liu

Recently, diffusion models have shown their impressive ability in visual generation tasks. Besides static images, more and more research attentions have been drawn to the generation of realistic videos. The video generation not only has a higher requirement for the quality, but also brings a challenge in ensuring the video continuity. Among all the video generation tasks, human-involved contents, such as human dancing, are even more difficult to generate due to the high degrees of freedom associated with human motions. In this paper, we propose a novel framework, named as DANCER (Dance ANimation via Condition Enhancement and Rendering with Diffusion Model), for realistic single-person dance synthesis based on the most recent stable video diffusion model. As the video generation is generally guided by a reference image and a video sequence, we introduce two important modules into our framework to fully benefit from the two inputs. More specifically, we design an Appearance Enhancement Module (AEM) to focus more on the details of the reference image during the generation, and extend the motion guidance through a Pose Rendering Module (PRM) to capture pose conditions from extra domains. To further improve the generation capability of our model, we also collect a large amount of video data from Internet, and generate a novel datasetTikTok-3K to enhance the model training. The effectiveness of the proposed model has been evaluated through extensive experiments on real-world datasets, where the performance of our model is superior to that of the state-of-the-art methods. All the data and codes will be released upon acceptance.

最近，扩散模型在视觉生成任务中表现出了令人印象深刻的能力。除了静态图像外，越来越多的研究关注于现实视频的生成。视频生成不仅对质量有更高的要求，而且还在确保视频连续性方面带来了挑战。在所有视频生成任务中，涉及人类内容（例如舞蹈）的生成尤为困难，因为与人的动作相关的自由度很高。本文提出了一种新的框架，名为DANCER（基于条件增强和扩散模型的渲染进行舞蹈动画），该框架基于最新的稳定视频扩散模型，用于合成逼真的单人舞蹈。视频生成通常由参考图像和视频序列引导，因此我们在框架中引入了两个重要模块，以充分利用这两个输入的优势。更具体地说，我们设计了一个外观增强模块（AEM），以在生成过程中更专注于参考图像的细节，并通过姿态渲染模块（PRM）扩展运动指导，以从额外领域捕获姿态条件。为了进一步提高我们模型的生产能力，我们还从互联网收集了大量视频数据，并创建了一个新的数据集TikTok-3K以增强模型训练。所提出模型的有效性已通过在实际数据集上的大量实验进行了评估，我们的模型性能优于最新技术的方法。所有数据和方法一经接受将通过开放源代码发布。

论文及项目相关链接

PDF

Summary

本文提出了一种名为DANCER的新型框架，用于基于最新的稳定视频扩散模型进行真实单人舞蹈合成。该框架引入了两个重要模块，外观增强模块（AEM）和姿态渲染模块（PRM），以充分利用参考图像和视频序列的输入。此外，还从互联网收集了大量视频数据，生成了一个新的数据集TikTok-3K以增强模型训练。实验证明，该模型在真实数据集上的性能优于现有最先进的方法。

Key Takeaways

扩散模型在视觉生成任务中表现出强大的能力，越来越多地用于生成真实视频。
视频生成不仅要求高质，还需确保视频连续性，其中涉及人类活动的视频生成尤为困难。
提出的DANCER框架利用最新的稳定视频扩散模型进行真实单人舞蹈合成。
框架中引入了外观增强模块（AEM）和姿态渲染模块（PRM），分别关注参考图像的细节和姿态条件的捕捉。
为了提高模型生成能力，从互联网收集大量视频数据，并创建了新的数据集TikTok-3K。
实验证明，DANCER框架在真实数据集上的性能优于现有方法。

Cool Papers

点此查看论文截图

E-MMDiT: Revisiting Multimodal Diffusion Transformer Design for Fast Image Synthesis under Limited Resources

Authors:Tong Shen, Jingai Yu, Dong Zhou, Dong Li, Emad Barsoum

Diffusion models have shown strong capabilities in generating high-quality images from text prompts. However, these models often require large-scale training data and significant computational resources to train, or suffer from heavy structure with high latency. To this end, we propose Efficient Multimodal Diffusion Transformer (E-MMDiT), an efficient and lightweight multimodal diffusion model with only 304M parameters for fast image synthesis requiring low training resources. We provide an easily reproducible baseline with competitive results. Our model for 512px generation, trained with only 25M public data in 1.5 days on a single node of 8 AMD MI300X GPUs, achieves 0.66 on GenEval and easily reaches to 0.72 with some post-training techniques such as GRPO. Our design philosophy centers on token reduction as the computational cost scales significantly with the token count. We adopt a highly compressive visual tokenizer to produce a more compact representation and propose a novel multi-path compression module for further compression of tokens. To enhance our design, we introduce Position Reinforcement, which strengthens positional information to maintain spatial coherence, and Alternating Subregion Attention (ASA), which performs attention within subregions to further reduce computational cost. In addition, we propose AdaLN-affine, an efficient lightweight module for computing modulation parameters in transformer blocks. Our code is available at https://github.com/AMD-AGI/Nitro-E and we hope E-MMDiT serves as a strong and practical baseline for future research and contributes to democratization of generative AI models.

扩散模型已显示出从文本提示生成高质量图像的强大能力。然而，这些模型通常需要大规模的训练数据和大量的计算资源进行训练，或者存在结构复杂、延迟较高的问题。为此，我们提出了高效多模态扩散转换器（E-MMDiT），这是一种高效且轻量级的多模态扩散模型，仅包含3.04亿个参数，可用于快速图像合成，并且训练资源需求较低。我们提供了一个易于复制的基线方案，并获得了具有竞争力的结果。我们的模型在512像素生成方面，仅使用2500万份公开数据在8个AMD MI300X GPU的单节点上训练了1.5天，就达到了GenEval的0.66分，并且采用一些后训练技术如GRPO后很容易达到0.72分。我们的设计哲学以令牌减少为中心，因为计算成本随着令牌数量的增加而显著增加。我们采用高度压缩的视觉标记器来产生更紧凑的表示，并提出了新型的多路径压缩模块来进一步压缩令牌。为了增强我们的设计，我们引入了位置强化，以加强位置信息来保持空间连贯性，以及交替子区域注意力（ASA），它在子区域内执行注意力来进一步降低计算成本。此外，我们提出了AdaLN-仿射，这是一个高效的轻量级模块，用于计算转换器块中的调制参数。我们的代码可在[https://github.com/AMD-AGI/Nitro-E上找到，我们希望E-MMDiT能成为未来研究的有力且实用的基线，并为生成式人工智能模型的民主化做出贡献。

论文及项目相关链接

PDF

摘要

本文介绍了Efficient Multimodal Diffusion Transformer（E-MMDiT）模型，这是一种高效且轻量级的扩散模型，能够在低训练资源下快速生成高质量图像。该模型具有仅304M的参数，通过减少计算量和采用高效设计策略，实现了低资源消耗和高性能。通过采用紧凑的视觉标记器和多路径压缩模块等技术，模型实现了高效的图像合成。此外，模型还强化了位置信息和子区域注意力机制等创新技术以提升性能。模型的代码已公开发布在GitHub上，希望作为未来研究和实践的坚实基础，并为生成式人工智能模型的普及做出贡献。

关键见解

E-MMDiT模型是一种高效且轻量级的扩散模型，用于快速生成高质量图像，仅需要较少的计算资源和训练时间。
模型采用紧凑的视觉标记器和多路径压缩模块等技术来减少计算量和提高性能。
模型实现了强大的性能，在仅有25M公共数据的训练下，达到了GenEval上的0.66分数，使用某些后训练技术如GRPO后，分数提升至0.72。
模型强化了位置信息，维持空间连贯性；引入了交替子区域注意力机制（ASA），通过子区域内的注意力来降低计算成本。
AdaLN-affine模块被提出为一种高效的轻量级模块，用于计算transformer块中的调制参数。
E-MMDiT的代码已经公开发布，并且该模型有望成为未来研究和实践的坚实基础。

Cool Papers

点此查看论文截图

SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

Authors:Andreas Engelhardt, Mark Boss, Vikram Voleti, Chun-Han Yao, Hendrik P. A. Lensch, Varun Jampani

We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.

我们提出了稳定视频材料3D（SViM3D）框架，该框架可根据单幅图像预测多视角一致性的基于物理的渲染（PBR）材料。最近，视频扩散模型已成功用于从单个图像高效重建3D对象。然而，反射仍然由简单的材料模型表示，或者需要在附加步骤中估计以启用重新照明和控制外观编辑。我们扩展了潜在视频扩散模型，以联合输出空间变化的PBR参数和表面法线，以及基于显式相机控制的每个生成视图。这种独特的设置允许使用我们的模型作为神经先验进行重新照明和生成3D资产。我们向此管道引入了各种机制，以在这种不适定的设置中提高质量。我们在多个以对象为中心的数据集上展示了最先进的重新照明和新视角合成性能。我们的方法适用于各种输入，能够生成可用于AR/VR、电影、游戏和其他视觉媒体的重新照明的3D资产。

论文及项目相关链接

PDF Accepted by International Conference on Computer Vision (ICCV 2025). Project page: http://svim3d.aengelhardt.com

Summary

本文提出了一个名为Stable Video Materials 3D（SViM3D）的框架，能够从单张图像预测多视角一致性的物理基础渲染（PBR）材质。该框架通过扩展潜在视频扩散模型来联合输出空间变化的PBR参数和表面法线，基于明确的相机控制生成每个视图。这种方法允许重新照明和使用神经先验生成3D资产。本文介绍了改进这一不适定设置质量的多种机制，并在多个以物体为中心的数据集上展示了最先进的重新照明和新视角合成性能。该方法可推广到各种输入，生成的重新照明3D资产可用于AR/VR、电影、游戏和其他视觉媒体。

Key Takeaways

SViM3D是一个能够从单张图像预测多视角一致性的PBR材质的框架。
该框架通过扩展潜在视频扩散模型来联合输出空间变化的PBR参数和表面法线。
允许重新照明和使用神经先验生成3D资产。
介绍了多种改进不适定设置质量的机制。
在多个以物体为中心的数据集上展示了最先进的重新照明和新视角合成性能。
该方法可推广到多种类型的输入。

Cool Papers

点此查看论文截图

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Authors:Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

图像组合旨在无缝插入用户指定的对象到新的场景中，但现有模型在处理复杂光照（例如准确阴影、水面反射）和多样化、高分辨率输入时遇到困难。现代文本到图像扩散模型（例如SD3.5、FLUX）已经编码了基本的物理和分辨率先验，但缺乏一个框架来释放它们，而不求助于潜在逆过程，这通常会将对象姿势锁定在上下文不适当的方向上，或脆弱的注意力手术。我们提出SHINE，这是一个无需训练的无缝、高保真插入框架，具有中性化错误。SHINE引入了流形引导锚点损失，利用预训练的定制适配器（例如IP-Adapter）来引导潜在表示以忠实于主题表示，同时保持背景完整性。提出退化抑制指导和自适应背景融合，以进一步消除低质量输出和可见接缝。为了解决缺乏严格基准测试的问题，我们推出了ComplexCompo，它拥有多种分辨率和具有挑战性的条件，如低光照、强烈照明、复杂阴影和反射表面等。在ComplexCompo和DreamEditBench上的实验表明，其在标准指标（如DINOv2）和人类对齐得分（如DreamSim、ImageReward、VisionReward）上均表现出卓越性能。代码和基准测试将在发表时公开。

论文及项目相关链接

PDF Preprint

Summary：

该文介绍了一种名为SHINE的无训练框架，用于无缝、高保真插入用户指定对象。SHINE通过引入流形引导的锚点损失和预训练的定制适配器，能够在不破坏背景完整性的情况下，忠实呈现主题表示。此外，还提出了降质抑制引导和自适应背景融合等方法，以消除低质量输出和可见接缝。为解决缺乏严格基准的问题，引入了ComplexCompo基准测试平台，对各种分辨率和复杂条件（如低光照、强烈照明、复杂阴影和反射表面）进行了测试。实验结果表明，在标准指标和人类对齐得分方面均表现出卓越性能。

Key Takeaways：

SHINE是一个无训练框架，旨在实现无缝、高保真地插入用户指定对象。
SHINE通过引入流形引导的锚点损失，利用预训练定制适配器来指导潜在表示。
SHINE保留了背景完整性，同时忠实呈现主题表示。
SHINE提出了降质抑制引导和自适应背景融合方法，以改善输出质量。
ComplexCompo基准测试平台被引入，用以测试图像合成模型在各种分辨率和复杂条件下的性能。
实验结果表明SHINE在标准指标和人类对齐得分方面表现出卓越性能。

Cool Papers

点此查看论文截图

Where and How to Perturb: On the Design of Perturbation Guidance in Diffusion and Flow Models

Authors:Donghoon Ahn, Jiwon Kang, Sanghyun Lee, Minjae Kim, Jaewon Min, Wooseok Jang, Sangwu Lee, Sayak Paul, Susung Hong, Seungryong Kim

Recent guidance methods in diffusion models steer reverse sampling by perturbing the model to construct an implicit weak model and guide generation away from it. Among these approaches, attention perturbation has demonstrated strong empirical performance in unconditional scenarios where classifier-free guidance is not applicable. However, existing attention perturbation methods lack principled approaches for determining where perturbations should be applied, particularly in Diffusion Transformer (DiT) architectures where quality-relevant computations are distributed across layers. In this paper, we investigate the granularity of attention perturbations, ranging from the layer level down to individual attention heads, and discover that specific heads govern distinct visual concepts such as structure, style, and texture quality. Building on this insight, we propose “HeadHunter”, a systematic framework for iteratively selecting attention heads that align with user-centric objectives, enabling fine-grained control over generation quality and visual attributes. In addition, we introduce SoftPAG, which linearly interpolates each selected head’s attention map toward an identity matrix, providing a continuous knob to tune perturbation strength and suppress artifacts. Our approach not only mitigates the oversmoothing issues of existing layer-level perturbation but also enables targeted manipulation of specific visual styles through compositional head selection. We validate our method on modern large-scale DiT-based text-to-image models including Stable Diffusion 3 and FLUX.1, demonstrating superior performance in both general quality enhancement and style-specific guidance. Our work provides the first head-level analysis of attention perturbation in diffusion models, uncovering interpretable specialization within attention layers and enabling practical design of effective perturbation strategies.

近期扩散模型中的指导方法通过扰动模型来构建隐式弱模型，从而引导反向采样并避免生成受到影响。在这些方法中，注意力扰动在无条件场景（其中无分类指导不适用）中表现出了强大的实证性能。然而，现有的注意力扰动方法缺乏原则性的方法来确定扰动应应用于何处，特别是在扩散变压器（DiT）架构中，质量相关的计算分布在各层。在本文中，我们研究了注意力扰动的粒度，从层级到单个注意力头，并发现特定的头控制着结构、风格和纹理质量等不同的视觉概念。基于这一发现，我们提出了“HeadHunter”，这是一个系统框架，用于迭代选择符合用户中心目标的注意力头，实现对生成质量和视觉属性的精细控制。此外，我们引入了SoftPAG，它通过线性插值所选头部的注意力图来生成身份矩阵，提供了一个连续的旋钮来调节扰动强度并抑制伪影。我们的方法不仅缓解了现有层级扰动方法的过度平滑问题，而且能够通过组合头部选择来针对特定的视觉风格进行操控。我们在基于大型DiT的文本到图像模型（包括Stable Diffusion 3和FLUX.1）上验证了我们的方法，在通用质量提升和特定风格指导方面都表现出了卓越的性能。我们的工作首次在扩散模型中进行了头部层面的注意力扰动分析，揭示了注意力层内的可解释专业化，并实现了有效的扰动策略的实际设计。

论文及项目相关链接

PDF Accepted at NeurIPS 2025. Project page: https://cvlab-kaist.github.io/HeadHunter/

Summary

本文探讨了扩散模型中的指导方法，通过扰动模型进行反向采样，构建了隐式弱模型，并指导生成远离它。文章重点研究了注意力扰动的方法，特别是在Diffusion Transformer（DiT）架构中，发现了特定注意力头负责不同的视觉概念，如结构、风格和纹理质量。基于此，提出了“HeadHunter”框架，通过迭代选择与用户中心目标对齐的注意力头，实现对生成质量和视觉属性的精细控制。此外，还介绍了SoftPAG方法，通过线性插值选定的注意力头朝向身份矩阵，提供连续的调节扰动强度和抑制伪影的旋钮。该方法不仅解决了现有层级扰动的过平滑问题，还通过选择性头部操作实现了特定视觉风格的目标操控。在基于大型DiT的文本到图像模型上验证了方法的优越性。

Key Takeaways

扩散模型中的指导方法通过扰动模型进行反向采样，构建隐式弱模型以指导生成。
注意力扰动在无条件场景中具有强大的实证性能，特别是在Diffusion Transformer架构中。
特定注意力头负责不同的视觉概念，如结构、风格和纹理。
提出“HeadHunter”框架，通过迭代选择与用户中心目标对齐的注意力头，实现细粒度控制。
介绍SoftPAG方法，通过线性插值调节选定的注意力头，提供连续的调节扰动强度和抑制伪影的旋钮。
该方法不仅解决了现有层级扰动的过平滑问题，还实现了特定视觉风格的目标操控。

Cool Papers

点此查看论文截图

EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models

Authors:Mingzhe Li, Gehao Zhang, Zhenting Wang, Guanhong Tao, Siqi Pan, Richard Cartwright, Juan Zhai, Shiqing Ma

Text-to-image generation models~(e.g., Stable Diffusion) have achieved significant advancements, enabling the creation of high-quality and realistic images based on textual descriptions. Prompt inversion, the task of identifying the textual prompt used to generate a specific artifact, holds significant potential for applications including data attribution, model provenance, and watermarking validation. Recent studies introduced a delayed projection scheme to optimize for prompts representative of the vocabulary space, though challenges in semantic fluency and efficiency remain. Advanced image captioning models or visual large language models can generate highly interpretable prompts, but they often lack in image similarity. In this paper, we propose a prompt inversion technique called \sys for text-to-image diffusion models, which includes initializing embeddings using a pre-trained image captioning model, refining them through reverse-engineering in the latent space, and converting them to texts using an embedding-to-text model. Our experiments on the widely-used datasets, such as MS COCO, LAION, and Flickr, show that our method outperforms existing methods in terms of image similarity, textual alignment, prompt interpretability and generalizability. We further illustrate the application of our generated prompts in tasks such as cross-concept image synthesis, concept manipulation, evolutionary multi-concept generation and unsupervised segmentation.

文本到图像生成模型（例如Stable Diffusion）已经取得了重大进展，能够根据文本描述创建高质量和现实的图像。提示反转（Prompt Inversion）的任务是识别用于生成特定工件的文本提示，在数据归属、模型来源和水印验证等方面具有广泛的应用前景。最近的研究引入了一种延迟投影方案，以优化代表词汇空间的提示，尽管在语义流畅性和效率方面仍存在挑战。先进的图像描述模型或视觉大型语言模型可以生成高度可解释的提示，但它们往往在图像相似性方面有所欠缺。在本文中，我们提出了一种针对文本到图像扩散模型的提示反转技术，称为“sys”。它包括使用预训练的图像描述模型进行嵌入初始化，在潜在空间中进行反向工程进行细化，并使用嵌入到文本模型将其转换为文本。我们在MS COCO、LAION和Flickr等常用数据集上的实验表明，我们的方法在图像相似性、文本对齐、提示可解释性和通用性方面优于现有方法。我们进一步说明了所生成的提示在跨概念图像合成、概念操作、进化多概念生成和无监督分割任务中的应用。

论文及项目相关链接

PDF

Summary：
文本转图像生成模型（如Stable Diffusion）已取得显著进展，能够根据文本描述生成高质量、逼真的图像。本文提出一种针对文本转图像扩散模型的提示反转技术，通过预训练图像描述模型进行嵌入初始化，在潜在空间进行反向工程优化，并使用嵌入到文本的模型进行转换。实验证明，该方法在图像相似性、文本对齐、提示可解释性和通用性方面优于现有方法，并应用于跨概念图像合成、概念操作、进化多概念生成和无监督分割等任务。

Key Takeaways：

文本转图像生成模型如Stable Diffusion能根据文本描述生成高质量图像。
提示反转技术在数据归属、模型来源和水印验证等方面有应用潜力。
现有研究中语义流畅性和效率的挑战仍然存在。
本文提出了一种针对文本转图像扩散模型的提示反转技术\sys。
该技术通过预训练图像描述模型进行嵌入初始化，并在潜在空间进行反向工程优化。
实验证明，该方法在图像相似性、文本对齐、提示可解释性和通用性方面表现优异。

Cool Papers

点此查看论文截图

SPIRAL: Semantic-Aware Progressive LiDAR Scene Generation and Understanding

Authors:Dekai Zhu, Yixuan Hu, Youquan Liu, Dongyue Lu, Lingdong Kong, Slobodan Ilic

Leveraging recent diffusion models, LiDAR-based large-scale 3D scene generation has achieved great success. While recent voxel-based approaches can generate both geometric structures and semantic labels, existing range-view methods are limited to producing unlabeled LiDAR scenes. Relying on pretrained segmentation models to predict the semantic maps often results in suboptimal cross-modal consistency. To address this limitation while preserving the advantages of range-view representations, such as computational efficiency and simplified network design, we propose Spiral, a novel range-view LiDAR diffusion model that simultaneously generates depth, reflectance images, and semantic maps. Furthermore, we introduce novel semantic-aware metrics to evaluate the quality of the generated labeled range-view data. Experiments on the SemanticKITTI and nuScenes datasets demonstrate that Spiral achieves state-of-the-art performance with the smallest parameter size, outperforming two-step methods that combine the generative and segmentation models. Additionally, we validate that range images generated by Spiral can be effectively used for synthetic data augmentation in the downstream segmentation training, significantly reducing the labeling effort on LiDAR data.

借助最新的扩散模型，基于激光雷达的大规模3D场景生成已经取得了巨大的成功。虽然最近的基于体素的方法可以生成几何结构和语义标签，但现有的范围视图方法仅限于生成无标签的激光雷达场景。依赖预训练的分割模型来预测语义地图往往会导致跨模态一致性不佳。为了解决这个问题，同时保留范围视图表示的优点，如计算效率和网络设计简化，我们提出了Spiral，这是一种新型的范围视图激光雷达扩散模型，可以同时生成深度、反射图像和语义地图。此外，我们引入了新型语义感知指标来评估生成的有标签范围视图数据的质量。在SemanticKITTI和nuscenes数据集上的实验表明，Spiral在参数大小最小的情况下达到了最先进的技术性能，超过了将生成模型和分割模型结合的两步方法。此外，我们验证了Spiral生成的范围图像可以有效地用于下游分割训练中的合成数据增强，显著减少激光雷达数据的标注工作量。

论文及项目相关链接

PDF NeurIPS 2025; 24 pages, 10 figures, 9 tables; Code at https://dekai21.github.io/SPIRAL/

Summary

基于最新的扩散模型，LiDAR基的大规模3D场景生成已取得巨大成功。尽管近期基于体素的方法可以生成几何结构和语义标签，但现有的范围视图方法仅限于生成未标记的LiDAR场景。依赖预训练的分割模型来预测语义地图往往会导致跨模态一致性较差。为了克服这一局限性，同时保留范围视图表示的优势，如计算效率高和简化网络设计，我们提出了Spiral，一种新型的范围视图LiDAR扩散模型，可以同时生成深度、反射图像和语义地图。实验表明，Spiral在参数规模最小的情况下实现了最佳性能，超越了结合生成模型和分割模型的两步方法。此外，我们验证了Spiral生成的范围图像可以有效地用于下游分割训练中的合成数据增强，大大减少LiDAR数据的标注工作量。

Key Takeaways

利用最新的扩散模型，LiDAR基的大规模3D场景生成取得巨大成功。
现有的范围视图方法主要生成未标记的LiDAR场景。
Spiral是一种新型的范围视图LiDAR扩散模型，能同时生成深度、反射图像和语义地图。
Spiral在参数规模最小的情况下实现了最佳性能。
Spiral生成的范围图像可以有效用于合成数据增强，减少LiDAR数据的标注工作量。
Spiral通过结合范围视图表示的优势，如计算效率高和简化网络设计，实现了性能的提升。

Cool Papers

点此查看论文截图

Spatial Knowledge Graph-Guided Multimodal Synthesis

Authors:Yida Xue, Zhen Bi, Jinnan Yang, Jungang Lou, Kehai Chen, Min Zhang, Huajun Chen, Ningyu Zhang

Recent advances in Multimodal Large Language Models (MLLMs) have significantly enhanced their capabilities; however, their spatial perception abilities remain a notable limitation. To address this challenge, multimodal data synthesis offers a promising solution. Yet, ensuring that synthesized data adhere to spatial common sense is a non-trivial task. Our approach addresses this critical gap by providing a systematic framework for generating spatially coherent data. In this work, we introduce SKG2DATA, a novel multimodal synthesis approach guided by spatial knowledge graphs, grounded in the concept of knowledge-to-data generation. SKG2DATA employs an automated pipeline for constructing Spatial Knowledge Graph (SKG) that effectively captures human-like spatial cognition, including directional and distance relationships. These structured representations then serve as precise guidance for our integrated synthesis pipeline, where a diffusion model generates spatially-consistent images while a MLLM produces corresponding textual descriptions. The automated construction of SKG enables scalable generation of diverse yet realistic spatial configurations, overcoming the limitations of manual data collection and annotation. Extensive experiments demonstrate that data synthesized from diverse types of spatial knowledge, including direction and distance, enhance the spatial perception and reasoning abilities of MLLMs markedly, albeit with a slight cost to their general capabilities. We hope that the idea of knowledge-based data synthesis can advance the development of spatial intelligence. Code is available at https://github.com/zjunlp/Knowledge2Data.

最近的多模态大型语言模型（MLLMs）的进步显著增强了其能力，但它们在空间感知方面的能力仍存在明显的局限。为了解决这一挑战，多模态数据合成提供了一种有前途的解决方案。然而，确保合成数据符合空间常识并非易事。我们的方法通过提供生成空间连贯数据的系统框架来解决这一关键差距。在这项工作中，我们介绍了SKG2DATA，这是一种由空间知识图谱引导的新型多模态合成方法，基于知识到数据生成的概念。SKG2DATA采用自动化管道构建空间知识图（SKG），有效捕捉人类的空间认知，包括方向和距离关系。这些结构化表示形式然后为我们集成的合成管道提供精确的指导，其中扩散模型生成空间一致的图像，而MLLM则产生相应的文本描述。SKG的自动构建能够实现各种真实空间配置的规模化生成，克服了手动数据收集和注释的局限性。大量实验表明，从各种空间知识中合成的数据，包括方向和距离，显著增强了MLLM的空间感知和推理能力，尽管其整体能力略有下降。我们希望基于知识的数据合成理念能够促进空间智能的发展。代码可在https://github.com/zjunlp/Knowledge2Data找到。

论文及项目相关链接

PDF IEEE/ACM Transactions on Audio, Speech and Language Processing

Summary

本文介绍了针对多模态大型语言模型（MLLMs）在空间感知方面的局限，提出了一种新的多模态合成方法SKG2DATA。该方法通过构建空间知识图谱（SKG）来指导数据合成，实现空间一致性。实验表明，合成数据提高了MLLMs的空间感知和推理能力。

Key Takeaways

多模态大型语言模型（MLLMs）虽有显著进步，但在空间感知方面仍存在局限。
多模态数据合成是解决这一挑战的有前途的方法。
SKG2DATA是一个新的多模态合成方法，通过空间知识图谱（SKG）指导数据合成。
SKG有效捕捉人类的空间认知，包括方向和距离关系。
合成数据提高了MLLMs的空间感知和推理能力。
SKG的自动构建实现了空间配置的多样性和现实性的可伸缩生成。

Cool Papers

点此查看论文截图

Joint Reconstruction of Activity and Attenuation in PET by Diffusion Posterior Sampling in Wavelet Coefficient Space

Authors:Clémentine Phung-Ngoc, Alexandre Bousse, Antoine De Paepe, Thibaut Merlin, Baptiste Laurent, Hong-Phuong Dang, Olivier Saut, Dimitris Visvikis

Attenuation correction (AC) is necessary for accurate activity quantification in positron emission tomography (PET). Conventional reconstruction methods typically rely on attenuation maps derived from a co-registered computed tomography (CT) or magnetic resonance imaging (MRI) scan. However, this additional scan may complicate the imaging workflow, introduce misalignment artifacts and increase radiation exposure. In this paper, we propose a joint reconstruction of activity and attenuation (JRAA) approach that eliminates the need for auxiliary anatomical imaging by relying solely on emission data. This framework combines wavelet diffusion model (WDM) and diffusion posterior sampling (DPS) to reconstruct fully three-dimensional (3-D) data. Experimental results show our method outperforms maximum likelihood activity and attenuation (MLAA) and MLAA with U-Net-based post processing, and yields high-quality noise-free reconstructions across various count settings when time-of-flight (TOF) information is available. It is also able to reconstruct non-TOF data, although the reconstruction quality significantly degrades in low-count (LC) conditions, limiting its practical effectiveness in such settings. Nonetheless, a non-TOF Biograph mMR data reconstruction with joint scatter estimation highlights the potential of the method for clinical applications. This approach represents a step towards stand-alone PET imaging by reducing the dependence on anatomical modalities while maintaining quantification accuracy, even in low-count scenarios when TOF information is available. Code will soon be available on GitHub at https://github.com/clemphg/jraa-dps.

衰减校正（AC）在正电子发射断层扫描（PET）的准确活动量化中是必要的。传统重建方法通常依赖于从共注册的计算机断层扫描（CT）或磁共振成像（MRI）扫描中得出的衰减图。然而，这个额外的扫描可能会使成像工作流程复杂化，引入错位伪影并增加辐射暴露。在本文中，我们提出了一种联合重建活动和衰减（JRAA）的方法，该方法仅依靠发射数据，无需辅助解剖成像。该框架结合了小波扩散模型（WDM）和扩散后采样（DPS）来重建全三维（3-D）数据。实验结果表明，我们的方法优于最大可能性活动和衰减（MLAA）以及使用U-Net进行后处理的MLAA，并在各种计数设置下，当飞行时间（TOF）信息可用时，产生高质量的无噪声重建。虽然它也能够重建非TOF数据，但在低计数条件下重建质量会显著下降，这在实践中限制了其在这种环境下的有效性。尽管如此，使用联合散射估计的非TOF Biograph mMR数据重建突出了该方法在临床应用中的潜力。这种方法朝着独立PET成像迈出了一步，通过减少对解剖模态的依赖同时保持量化准确性，即使在低计数场景中当TOF信息可用时也是如此。代码很快将在GitHub上提供：https://github.com/clemphg/jraa-dps。

论文及项目相关链接

PDF 11 pages, 8 figures, 3 tables

Summary
本文提出了一种基于发射数据的联合重建活动及衰减（JRAA）方法，无需额外的解剖成像。该方法结合了小波扩散模型（WDM）和扩散后采样（DPS），重建了全三维数据。实验表明，该方法在有时间飞行（TOF）信息的情况下，性能优于最大似然活动及衰减（MLAA）方法及其U-Net后处理，并能进行非TOF数据重建。尽管在低计数条件下重建质量有所下降，但在临床应用中具有潜力。

Key Takeaways

衰减校正（AC）在正电子发射断层扫描（PET）中的活动量化是必要的。
传统重建方法依赖于从合并的计算机断层扫描（CT）或磁共振成像（MRI）扫描得到的衰减图。
提出的联合重建活动及衰减（JRAA）方法消除了对辅助解剖成像的需求，仅依赖发射数据。
JRAA方法结合了小波扩散模型（WDM）和扩散后采样（DPS），以重建全三维数据。
在有时间飞行（TOF）信息的情况下，JRAA方法性能优于其他方法，并能进行非TOF数据重建。
在低计数条件下，JRAA方法的重建质量有所下降，限制了其在这些场景中的实际应用效果。

Cool Papers

点此查看论文截图

Diffusion Classifiers Understand Compositionality, but Conditions Apply

Authors:Yujin Jeong, Arnas Uselis, Seong Joon Oh, Anna Rohrbach

Understanding visual scenes is fundamental to human intelligence. While discriminative models have significantly advanced computer vision, they often struggle with compositional understanding. In contrast, recent generative text-to-image diffusion models excel at synthesizing complex scenes, suggesting inherent compositional capabilities. Building on this, zero-shot diffusion classifiers have been proposed to repurpose diffusion models for discriminative tasks. While prior work offered promising results in discriminative compositional scenarios, these results remain preliminary due to a small number of benchmarks and a relatively shallow analysis of conditions under which the models succeed. To address this, we present a comprehensive study of the discriminative capabilities of diffusion classifiers on a wide range of compositional tasks. Specifically, our study covers three diffusion models (SD 1.5, 2.0, and, for the first time, 3-m) spanning 10 datasets and over 30 tasks. Further, we shed light on the role that target dataset domains play in respective performance; to isolate the domain effects, we introduce a new diagnostic benchmark \textsc{Self-Bench} comprised of images created by diffusion models themselves. Finally, we explore the importance of timestep weighting and uncover a relationship between domain gap and timestep sensitivity, particularly for SD3-m. To sum up, diffusion classifiers understand compositionality, but conditions apply! Code and dataset are available at https://github.com/eugene6923/Diffusion-Classifiers-Compositionality.

理解视觉场景对人类智能至关重要。虽然判别模型在计算机视觉领域取得了显著进展，但在组合理解方面往往遇到困难。相比之下，最近的文本到图像的生成扩散模型在合成复杂场景方面表现出色，这表明其具有内在的组成能力。在此基础上，零样本扩散分类器已被提出用于将扩散模型重新用于判别任务。虽然先前的工作在判别组合场景方面提供了有前景的结果，但由于基准测试数量较少且对模型成功的条件分析相对肤浅，这些结果仍属于初步阶段。

论文及项目相关链接

PDF NeurIPS 2025 Datasets and Benchmarks

Summary
扩散模型在合成复杂场景方面表现出色，具有内在的组合能力。近期提出的零样本扩散分类器将扩散模型应用于判别任务，但在组合场景下的判别能力研究尚浅。本研究全面探讨了扩散模型在多种组合任务上的判别能力，涉及三个扩散模型、10个数据集和30多个任务。研究还介绍了新诊断基准Self-Bench，以分析目标数据集领域对性能的影响，并探讨了时间步长权重的重要性，揭示了领域差距与时间步长敏感性的关系。总体而言，扩散分类器理解组合性，但条件应用是关键。

Key Takeaways

扩散模型在合成复杂场景方面具有优势，展现出内在的组合能力。
零样本扩散分类器将扩散模型应用于判别任务。
目前对于扩散模型在组合场景下的判别能力研究尚浅。
本研究全面评估了三个扩散模型在多个组合任务上的性能。
引入新的诊断基准Self-Bench，以分析目标数据集领域对扩散模型性能的影响。
研究发现时间步长权重的重要性，并揭示了领域差距与时间步长敏感性之间的关系。
扩散分类器理解组合性，但应用条件是关键。

Cool Papers

点此查看论文截图

LocDiff: Identifying Locations on Earth by Diffusing in the Hilbert Space

Authors:Zhangyu Wang, Zeping Liu, Jielu Zhang, Zhongliang Zhou, Qian Cao, Nemin Wu, Lan Mu, Yang Song, Yiqun Xie, Ni Lao, Gengchen Mai

Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. State-of-the-art methods employ either grid-based classification or gallery-based image-location retrieval, whose spatial generalizability significantly suffers if the spatial distribution of test images does not align with the choices of grids and galleries. Recently emerging generative approaches, while getting rid of grids and galleries, use raw geographical coordinates and suffer quality losses due to their lack of multi-scale information. To address these limitations, we propose a multi-scale latent diffusion model called LocDiff for image geolocalization. We developed a novel positional encoding-decoding framework called Spherical Harmonics Dirac Delta (SHDD) Representations, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking on spherical probability distributions. We also propose a novel SirenNet-based architecture (CS-UNet) to learn an image-based conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. To the best of our knowledge, LocDiff is the first image geolocalization model that performs latent diffusion in a multi-scale location encoding space and generates geolocations under the guidance of images. Experimental results show that LocDiff can outperform all state-of-the-art grid-based, retrieval-based, and diffusion-based baselines across 5 challenging global-scale image geolocalization datasets, and demonstrates significantly stronger generalizability to unseen geolocations.

图像地理定位是一项基础且具有挑战性的任务，旨在推断图像拍摄的地球上的地理位置。最新方法采用基于网格的分类或基于画廊的图像位置检索，如果测试图像的空间分布与网格和画廊的选择不一致，其空间泛化能力会遭受显著影响。最近出现的生成方法虽然摆脱了网格和画廊，但使用原始地理坐标，由于缺乏多尺度信息而遭受质量损失。为了解决这些局限性，我们提出了一种用于图像地理定位的多尺度潜在扩散模型LocDiff。我们开发了一种新型的位置编码-解码框架，称为球面谐波狄拉克三角（SHDD）表示法，它将球面上的一点（例如地球上的地理位置）编码为球面谐波系数的希尔伯特空间，并通过球概率分布的模式搜索来解码点（地理位置）。我们还提出了一种基于SirenNet的新型架构（CS-UNet），用于在潜在SHDD空间中学习基于图像的条件逆向过程，通过最小化潜在KL散度损失。据我们所知，LocDiff是首个在多层次位置编码空间中执行潜在扩散并根据图像指导生成地理位置的图像地理定位模型。实验结果表明，LocDiff在5个具有挑战性的全球图像地理定位数据集上的性能优于所有最先进的基于网格、基于检索和基于扩散的基线，并且对未见过的地理位置表现出显著更强的泛化能力。

论文及项目相关链接

PDF

Summary
图像地理定位是一项基本且具有挑战性的任务，旨在推断图像拍摄的地球地理位置。最新方法采用基于网格的分类或基于画廊的图像位置检索，若测试图像的空间分布与网格和画廊的选择不一致，则其空间泛化能力会大大受损。新兴的生成式方法虽摆脱了网格和画廊的限制，但由于缺乏多尺度信息而导致质量损失。为解决这些问题，我们提出了用于图像地理定位的多尺度潜在扩散模型LocDiff。我们开发了一种名为球形谐波狄拉克增量（SHDD）表示的新型位置编码解码框架，将球形表面上的点（例如地球上的地理位置）编码为球形谐波系数的Hilbert空间，并通过球形概率分布的模式搜索来解码点（地理位置）。我们还提出了基于SirenNet的架构（CS-UNet），以在潜在SHDD空间中学习基于图像的条件逆向过程，通过最小化潜在KL散度损失。据我们所知，LocDiff是首个在多维位置编码空间中进行潜在扩散的图像地理定位模型，可在图像指导下生成地理位置。实验结果表明，LocDiff在五个具有挑战性的全球图像地理定位数据集上的性能优于所有先进的基于网格、检索和扩散的基线，并对未见过的地理位置表现出显著更强的泛化能力。

Key Takeaways

图像地理定位是推断图像拍摄地理位置的任务，具有挑战性和重要性。
现有方法如网格分类和画廊检索存在空间泛化能力受限的问题。
新兴生成式方法虽然摆脱了对网格和画廊的依赖，但面临质量损失的问题。
LocDiff模型通过结合多尺度信息和扩散模型解决上述问题。
LocDiff采用新型位置编码解码框架SHDD表示，将地理位置编码为Hilbert空间。
LocDiff利用SirenNet架构（CS-UNet）学习图像条件下的逆向过程。

Cool Papers

点此查看论文截图

Split Gibbs Discrete Diffusion Posterior Sampling

Authors:Wenda Chu, Zihui Wu, Yifan Chen, Yang Song, Yisong Yue

We study the problem of posterior sampling in discrete-state spaces using discrete diffusion models. While posterior sampling methods for continuous diffusion models have achieved remarkable progress, analogous methods for discrete diffusion models remain challenging. In this work, we introduce a principled plug-and-play discrete diffusion posterior sampling algorithm based on split Gibbs sampling, which we call SGDD. Our algorithm enables reward-guided generation and solving inverse problems in discrete-state spaces. We demonstrate the convergence of SGDD to the target posterior distribution and verify this through controlled experiments on synthetic benchmarks. Our method enjoys state-of-the-art posterior sampling performance on a range of benchmarks for discrete data, including DNA sequence design, discrete image inverse problems, and music infilling, achieving more than 30% improved performance compared to existing baselines. Our code is available at https://github.com/chuwd19/Split-Gibbs-Discrete-Diffusion-Posterior-Sampling.

我们利用离散扩散模型研究离散状态空间中的后采样问题。虽然连续扩散模型的后采样方法已经取得了显著的进步，但离散扩散模型的后采样方法仍然具有挑战性。在这项工作中，我们提出了一种基于分裂吉布斯采样的原则性即插即用离散扩散后采样算法，我们称之为SGDD。我们的算法能够实现奖励引导生成和离散状态空间中的逆问题解决。我们证明了SGDD在目标后验分布上的收敛性，并通过合成基准的受控实验验证了这一点。我们的方法在离散数据的一系列基准测试中享有最先进的后采样性能，包括DNA序列设计、离散图像逆问题和音乐填充，与现有基线相比，性能提高了30%以上。我们的代码位于https://github.com/chuwd19/Split-Gibbs-Discrete-Diffusion-Posterior-Sampling。

论文及项目相关链接

PDF Accepted to NeurIPS 2025

Summary

离散扩散模型的后采样问题仍然面临挑战。本文提出一种基于分割Gibbs采样的离散扩散后采样算法SGDD，可解决离散状态空间的奖励引导生成和逆问题。我们证明了SGDD在目标后验分布上的收敛性，并在合成基准测试上进行了控制实验验证。该方法在离散数据基准测试中表现出卓越的后采样性能，包括DNA序列设计、离散图像逆问题和音乐填充，与现有基线相比性能提高了30%以上。

Key Takeaways

研究离散扩散模型的后采样问题。
提出基于分割Gibbs采样的离散扩散后采样算法SGDD。
SGDD算法可实现奖励引导生成和离散状态空间的逆问题解决。
证明了SGDD在目标后验分布上的收敛性。
在合成基准测试上进行了控制实验验证。
SGDD在DNA序列设计、离散图像逆问题和音乐填充等基准测试中表现出卓越性能。

Cool Papers

点此查看论文截图

Multi-scale Latent Point Consistency Models for 3D Shape Generation

Authors:Bi’an Du, Wei Hu, Renjie Liao

Consistency Models (CMs) have significantly accelerated the sampling process in diffusion models, yielding impressive results in synthesizing high-resolution images. To explore and extend these advancements to point-cloud-based 3D shape generation, we propose a novel Multi-scale Latent Point Consistency Model (MLPCM). Our MLPCM follows a latent diffusion framework and introduces hierarchical levels of latent representations, ranging from point-level to super-point levels, each corresponding to a different spatial resolution. We design a multi-scale latent integration module along with 3D spatial attention to effectively denoise the point-level latent representations conditioned on those from multiple super-point levels. Additionally, we propose a latent consistency model, learned through consistency distillation, that compresses the prior into a one-step generator. This significantly improves sampling efficiency while preserving the performance of the original teacher model. Extensive experiments on standard benchmarks ShapeNet and ShapeNet-Vol demonstrate that MLPCM achieves a 100x speedup in the generation process, while surpassing state-of-the-art diffusion models in terms of both shape quality and diversity.

一致性模型（CMs）显著加速了扩散模型中的采样过程，在高分辨率图像合成中取得了令人印象深刻的结果。为了探索并将这些进展扩展到基于点云的3D形状生成，我们提出了一种新型的多尺度潜在点一致性模型（MLPCM）。我们的MLPCM遵循潜在扩散框架，并引入了从点到超点级别的分层潜在表示，每个表示都对应不同的空间分辨率。我们设计了一个多尺度潜在集成模块，结合3D空间注意力，以有效地对基于多个超点级别的点级潜在表示进行去噪。此外，我们提出了一种通过一致性蒸馏学习的一致性模型，它将先验压缩为一步生成器。这显著提高了采样效率，同时保持了原始教师模型的性能。在ShapeNet和ShapeNet-Vol标准基准测试上的大量实验表明，MLPCM在生成过程中实现了100倍的加速，同时在形状质量和多样性方面都超越了最先进的扩散模型。

论文及项目相关链接

PDF

Summary

本文介绍了基于扩散模型的点云一致性模型（MLPCM）。该模型采用潜在扩散框架，引入多层次潜在表示，通过多尺度潜在整合模块和三维空间注意力机制，对点级潜在表示进行去噪处理。同时，通过一致性蒸馏技术学习潜在一致性模型，将先验知识压缩为一步生成器，提高了采样效率并保持原教师模型的性能。在ShapeNet和ShapeNet-Vol等标准基准测试上，MLPCM实现了生成过程的百倍加速，同时在形状质量和多样性上超越了现有扩散模型。

Key Takeaways

CMs（一致性模型）显著加速了扩散模型中的采样过程。
提出了一种新的点云一致性模型（MLPCM），用于基于点云的3D形状生成。
MLPCM采用潜在扩散框架，引入多层次潜在表示，涉及点级到超点级的层次。
通过多尺度潜在整合模块和三维空间注意力机制，有效去噪点级潜在表示。
利用一致性蒸馏技术学习潜在一致性模型，将先验知识压缩为一步生成器，提高采样效率。
MLPCM在标准基准测试上实现了生成过程的百倍加速。

Cool Papers

点此查看论文截图

FIRE: Robust Detection of Diffusion-Generated Images via Frequency-Guided Reconstruction Error

Authors:Beilin Chu, Xuan Xu, Xin Wang, Yufei Zhang, Weike You, Linna Zhou

The rapid advancement of diffusion models has significantly improved high-quality image generation, making generated content increasingly challenging to distinguish from real images and raising concerns about potential misuse. In this paper, we observe that diffusion models struggle to accurately reconstruct mid-band frequency information in real images, suggesting the limitation could serve as a cue for detecting diffusion model generated images. Motivated by this observation, we propose a novel method called Frequency-guided Reconstruction Error (FIRE), which, to the best of our knowledge, is the first to investigate the influence of frequency decomposition on reconstruction error. FIRE assesses the variation in reconstruction error before and after the frequency decomposition, offering a robust method for identifying diffusion model generated images. Extensive experiments show that FIRE generalizes effectively to unseen diffusion models and maintains robustness against diverse perturbations.

扩散模型的快速发展极大地提高了高质量图像生成的能力，使得生成的内容越来越难以与真实图像区分开，并引发了关于潜在误用的担忧。在本文中，我们观察到扩散模型在准确重建真实图像的中频信息方面存在困难，这表明这一局限可作为检测扩散模型生成图像的一种线索。受此观察结果的启发，我们提出了一种名为频率引导重建误差（FIRE）的新方法，据我们所知，这是首次研究频率分解对重建误差的影响。FIRE评估频率分解前后的重建误差变化，为我们提供了一个稳健的方法来识别扩散模型生成的图像。大量实验表明，FIRE能够有效地推广到未见过的扩散模型，并对各种扰动保持稳健性。

论文及项目相关链接

PDF 14 pages, 14 figures. Accepted to CVPR 2025

Summary
扩散模型的快速发展极大提升了高质量图像生成能力，生成的图像与现实图像难以区分，引发关于潜在误用的担忧。本文发现扩散模型在重建真实图像的中频信息方面存在局限，可作为检测扩散模型生成图像的依据。为此，本文提出一种新颖的方法——频率引导重建误差（FIRE），首次探究频率分解对重建误差的影响。FIRE评估频率分解前后的重建误差变化，为识别扩散模型生成的图像提供稳健方法。实验表明，FIRE能有效泛化至未见过的扩散模型，并对各种扰动保持稳健。

Key Takeaways

扩散模型的图像生成能力显著提高，与现实图像难以区分，引发关于潜在误用的担忧。
扩散模型在重建真实图像的中频信息方面存在局限。
本文提出频率引导重建误差（FIRE）方法，探究频率分解对重建误差的影响。
FIRE通过评估频率分解前后的重建误差变化，为识别扩散模型生成的图像提供稳健方法。
FIRE方法能有效泛化至未见过的扩散模型。
FIRE对多种图像扰动具有稳健性。
此研究为识别扩散模型生成的图像提供了新的思路和方法。

Cool Papers

点此查看论文截图

High Resolution Seismic Waveform Generation using Denoising Diffusion

Authors:Kadek Hendrawan Palgunadi, Andreas Bergmeister, Andrea Bosisio, Laura Ermert, Maria Koroni, Nathanaël Perraudin, Simon Dirmeier, Men-Andrin Meier

Accurate prediction and synthesis of seismic waveforms are crucial for seismic-hazard assessment and earthquake-resistant infrastructure design. Existing prediction methods, such as ground-motion models and physics-based wave-field simulations, often fail to capture the full complexity of seismic wavefields, particularly at higher frequencies. This study introduces HighFEM, a novel, computationally efficient, and scalable (i.e., capable of generating many seismograms simultaneously) generative model for high-frequency seismic-waveform generation. Our approach leverages a spectrogram representation of the seismic-waveform data, which is reduced to a lower-dimensional manifold via an autoencoder. A state-of-the-art diffusion model is trained to generate this latent representation conditioned on key input parameters: earthquake magnitude, recording distance, site conditions, hypocenter depth, and azimuthal gap. The model generates waveforms with frequency content up to 50 Hz. Any scalar ground-motion statistic, such as peak ground-motion amplitudes and spectral accelerations, can be readily derived from the synthesized waveforms. We validate our model using commonly employed seismological metrics and performance metrics from image-generation studies. Our results demonstrate that the openly available model can generate realistic high-frequency seismic waveforms across a wide range of input parameters, even in data-sparse regions. For the scalar ground-motion statistics commonly used in seismic-hazard and earthquake-engineering studies, we show that our model accurately reproduces both the median trends of the real data and their variability. To evaluate and compare the growing number of these and similar Generative Waveform Models (GWMs), we argue that they should be openly available and included in community ground-motion-model evaluation efforts.

地震波形的精确预测和合成对于地震危险评估和抗震基础设施设计至关重要。现有的预测方法，如地面运动模型和基于物理的波场模拟，往往无法捕捉到地震波场的全部复杂性，特别是在较高频率下。本研究引入了HighFEM，这是一种计算效率高、可扩展性强的生成模型，可用于高频地震波形的生成。（可扩展性强指的是能够同时生成多个地震波。）我们的方法采用地震波形数据的频谱图表示，通过自编码器将其降低到低维流形。采用最先进的扩散模型进行训练，以根据关键输入参数生成这种潜在表示：地震震级、记录距离、场地条件、震源深度和方位角间隙。该模型生成的波形频率成分高达50赫兹。可以从合成的波形中轻易得出任何标量地面运动统计量，例如峰值地面运动幅度和谱加速度。我们使用常用的地震学指标和图像生成研究中的性能指标来验证我们的模型。结果表明，我们的公开模型可以在广泛的输入参数范围内生成逼真的高频地震波形，即使在数据稀疏的区域也是如此结对于地震危险性和工程研究中常用的标量地面运动统计量，我们证明我们的模型能够准确地再现真实数据的中位趋势及其变化。为了评估和比较这些以及类似生成波形模型的日益增长的数量，我们主张它们应公开可用，并纳入社区地面运动模型评估工作中。

论文及项目相关链接

PDF

摘要

本文介绍了一种新型的高频地震波形生成方法——HighFEM。该方法采用谱图表示地震波形数据，通过自编码器降低数据维度。利用先进的扩散模型生成潜在表现，根据地震震级、记录距离、场地条件、震源深度和方位角缺口等关键输入参数进行条件化。所生成波形的频率内容高达50赫兹。可从合成波形中轻松得出峰值地面运动幅度和谱加速度等标量地面运动统计量。通过地震学常用指标和图像生成研究中的性能指标验证了模型的准确性。结果表明，该模型可在广泛输入参数范围内生成真实高频地震波形，甚至在数据稀缺地区也能如此。对于地震工程和地震危险性研究常用的标量地面运动统计量，该模型能准确再现真实数据的中位趋势及其变异性。

关键见解

HighFEM是一种用于高频地震波形生成的新型计算效率高且可扩展的生成模型。
该模型采用谱图表示地震波形数据并通过自编码器降低其维度。
HighFEM利用扩散模型生成潜在表现，该表现可根据多种关键输入参数进行条件化，包括地震震级、记录距离等。
模型能够生成高达50赫兹频率的波形。
可以从合成波形中轻松推导出标量地面运动统计量。
通过地震学常用指标验证了模型的有效性。
模型在广泛的数据输入参数范围内表现良好，甚至适用于数据稀缺地区，并能够准确反映真实数据的趋势和变异性。

Cool Papers

点此查看论文截图

ReviveDiff: A Universal Diffusion Model for Restoring Images in Adverse Weather Conditions

Authors:Wenfeng Huang, Guoan Xu, Wenjing Jia, Stuart Perry, Guangwei Gao

Images captured in challenging environments–such as nighttime, smoke, rainy weather, and underwater–often suffer from significant degradation, resulting in a substantial loss of visual quality. The effective restoration of these degraded images is critical for the subsequent vision tasks. While many existing approaches have successfully incorporated specific priors for individual tasks, these tailored solutions limit their applicability to other degradations. In this work, we propose a universal network architecture, dubbed ``ReviveDiff’’, which can address various degradations and bring images back to life by enhancing and restoring their quality. Our approach is inspired by the observation that, unlike degradation caused by movement or electronic issues, quality degradation under adverse conditions primarily stems from natural media (such as fog, water, and low luminance), which generally preserves the original structures of objects. To restore the quality of such images, we leveraged the latest advancements in diffusion models and developed ReviveDiff to restore image quality from both macro and micro levels across some key factors determining image quality, such as sharpness, distortion, noise level, dynamic range, and color accuracy. We rigorously evaluated ReviveDiff on seven benchmark datasets covering five types of degrading conditions: Rainy, Underwater, Low-light, Smoke, and Nighttime Hazy. Our experimental results demonstrate that ReviveDiff outperforms the state-of-the-art methods both quantitatively and visually.

在具有挑战性的环境中捕捉的图像，如夜间、烟雾、雨天和水下环境，经常会发生严重的退化，导致视觉质量大量损失。这些退化图像的有效恢复对于后续的视觉任务至关重要。虽然许多现有方法已经成功地结合了针对各个任务的特定先验知识，但这些定制解决方案限制了它们在处理其他退化问题时的适用性。在这项工作中，我们提出了一种名为“ReviveDiff”的通用网络架构，它可以解决各种退化问题，并通过增强和恢复图像质量使图像恢复生机。我们的方法受到以下观察结果的启发：与运动或电子问题引起的退化不同，恶劣条件下的质量退化主要源于自然媒介（如雾、水和低亮度），这通常保留了物体的原始结构。为了恢复此类图像的质量，我们利用扩散模型的最新进展，并开发ReviveDiff来从宏观和微观层面恢复图像质量，涉及决定图像质量的关键因素，如清晰度、失真、噪声水平、动态范围和颜色准确性。我们对ReviveDiff在包含五种退化条件的七个基准数据集上进行了严格评估：雨天、水下、低光、烟雾和夜间雾霾。我们的实验结果表明，ReviveDiff在定量和视觉上都优于最新技术的方法。

论文及项目相关链接

PDF

Summary
本文提出一种名为“ReviveDiff”的通用网络架构，可应对各种图像退化问题，并在恶劣条件下恢复图像质量。通过利用扩散模型的最新进展，从宏微观层面恢复图像质量，提高清晰度、失真、噪声水平、动态范围和色彩准确性等关键因素。在七个基准数据集上的实验结果表明，ReviveDiff在五种退化条件下均优于现有方法。

Key Takeaways

图像在挑战环境下（如夜间、烟雾、雨天、水下）经常遭受显著退化，导致视觉质量损失。
图像恢复对于后续视觉任务至关重要。
现有方法多为针对特定任务的特定先验，限制了其在其他退化情况的应用。
提出了一个通用的网络架构“ReviveDiff”，可以应对各种退化并恢复图像质量。
ReviveDiff的灵感来源于观察：恶劣条件下的质量退化主要源于自然媒体（如雾、水、低亮度），这通常保留了物体的原始结构。
利用扩散模型的最新进展，ReviveDiff从宏微观层面恢复图像质量，涉及清晰度、失真、噪声水平、动态范围和色彩准确性等。

Cool Papers

点此查看论文截图

Neural Entropy

Authors:Akhil Premkumar

We explore the connection between deep learning and information theory through the paradigm of diffusion models. A diffusion model converts noise into structured data by reinstating, imperfectly, information that is erased when data was diffused to noise. This information is stored in a neural network during training. We quantify this information by introducing a measure called neural entropy, which is related to the total entropy produced by diffusion. Neural entropy is a function of not just the data distribution, but also the diffusive process itself. Measurements of neural entropy on a few simple image diffusion models reveal that they are extremely efficient at compressing large ensembles of structured data.

我们通过扩散模型这一范式来探讨深度学习和信息理论之间的联系。扩散模型通过将扩散到噪声中的信息重新恢复（虽然是不完美的），将噪声转化为结构化数据。这些信息在训练过程中存储在神经网络中。我们通过引入神经熵这一度量方法来量化这些信息，它与扩散产生的总熵有关。神经熵不仅是数据分布的函数，而且是扩散过程本身的函数。对几个简单的图像扩散模型的神经熵测量表明，它们在压缩大量结构化数据时极为高效。

论文及项目相关链接

PDF 29 pages + references, 18 figures. Camera-ready version from NeurIPS 2025

Summary
本文探讨了深度学习和信息理论之间的联系，通过扩散模型这一范例进行阐述。扩散模型可以将噪声转化为结构化数据，通过神经网络的训练过程来重新获得在数据扩散过程中丢失的信息。本文引入了神经熵这一度量标准来衡量信息，它与扩散产生的总熵有关。神经熵不仅是数据分布的函数，还是扩散过程本身的函数。对几个简单的图像扩散模型的神经熵测量表明，它们在压缩大量结构化数据时非常高效。

Key Takeaways