发布日期: 2025-09-16

更新日期: 2025-10-07

文章字数: 8.8k

阅读时长: 35 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-16 更新

InfGen: A Resolution-Agnostic Paradigm for Scalable Image Synthesis

Authors:Tao Han, Wanghan Xu, Junchao Gong, Xiaoyu Yue, Song Guo, Luping Zhou, Lei Bai

Arbitrary resolution image generation provides a consistent visual experience across devices, having extensive applications for producers and consumers. Current diffusion models increase computational demand quadratically with resolution, causing 4K image generation delays over 100 seconds. To solve this, we explore the second generation upon the latent diffusion models, where the fixed latent generated by diffusion models is regarded as the content representation and we propose to decode arbitrary resolution images with a compact generated latent using a one-step generator. Thus, we present the \textbf{InfGen}, replacing the VAE decoder with the new generator, for generating images at any resolution from a fixed-size latent without retraining the diffusion models, which simplifies the process, reducing computational complexity and can be applied to any model using the same latent space. Experiments show InfGen is capable of improving many models into the arbitrary high-resolution era while cutting 4K image generation time to under 10 seconds.

任意分辨率图像生成技术为各种设备提供了一致的视觉体验，对生产者和消费者具有广泛的应用。当前的扩散模型在计算需求上随着分辨率的增加而呈现二次方增长，导致4K图像生成延迟超过100秒。为了解决这个问题，我们在潜在扩散模型的基础上进行了第二代探索，其中由扩散模型生成的固定潜在被视为内容表示。我们提出使用一步生成器，利用紧凑的生成潜在来解码任意分辨率的图像。因此，我们推出了InfGen，用新的生成器替换VAE解码器，可以从固定大小的潜在生成任意分辨率的图像，无需重新训练扩散模型，这简化了流程，降低了计算复杂性，并且可以应用于使用相同潜在空间的任何模型。实验表明，InfGen能够将许多模型带入任意高分辨率时代，同时将4K图像生成时间缩短到不到10秒。

论文及项目相关链接

PDF Accepted by ICCV 2025

Summary

新一代扩散模型在图像生成方面的应用实现了跨设备的统一视觉体验，特别是在高分辨率图像生成方面有着显著的优势。通过采用固定大小的潜在扩散模型生成的潜在量，我们提出了一种新的解码器来生成任意分辨率的图像。这种名为InfGen的新技术不仅简化了流程，降低了计算复杂度，而且可以应用于使用同一潜在空间的任何模型。实验表明，InfGen可将许多模型提升为任意高分辨率时代，并将4K图像生成时间缩短至不到10秒。

Key Takeaways

扩散模型可实现任意分辨率的图像生成，提供跨设备的统一视觉体验。
当前扩散模型在计算需求上随着分辨率的增加而呈二次方增长，导致高分辨率图像生成时间较长。
第二代扩散模型通过固定潜在量生成，将内容表示与扩散模型相结合。
InfGen技术用于解码任意分辨率的图像，使用紧凑的生成潜在量，无需重新训练扩散模型。
InfGen简化了流程，降低了计算复杂度，可广泛应用于使用同一潜在空间的模型。
实验表明，InfGen技术能够显著提高模型的图像生成能力，进入任意高分辨率时代。

Cool Papers

点此查看论文截图

GARD: Gamma-based Anatomical Restoration and Denoising for Retinal OCT

Authors:Botond Fazekas, Thomas Pinetz, Guilherme Aresta, Taha Emre, Hrvoje Bogunovic

Optical Coherence Tomography (OCT) is a vital imaging modality for diagnosing and monitoring retinal diseases. However, OCT images are inherently degraded by speckle noise, which obscures fine details and hinders accurate interpretation. While numerous denoising methods exist, many struggle to balance noise reduction with the preservation of crucial anatomical structures. This paper introduces GARD (Gamma-based Anatomical Restoration and Denoising), a novel deep learning approach for OCT image despeckling that leverages the strengths of diffusion probabilistic models. Unlike conventional diffusion models that assume Gaussian noise, GARD employs a Denoising Diffusion Gamma Model to more accurately reflect the statistical properties of speckle. Furthermore, we introduce a Noise-Reduced Fidelity Term that utilizes a pre-processed, less-noisy image to guide the denoising process. This crucial addition prevents the reintroduction of high-frequency noise. We accelerate the inference process by adapting the Denoising Diffusion Implicit Model framework to our Gamma-based model. Experiments on a dataset with paired noisy and less-noisy OCT B-scans demonstrate that GARD significantly outperforms traditional denoising methods and state-of-the-art deep learning models in terms of PSNR, SSIM, and MSE. Qualitative results confirm that GARD produces sharper edges and better preserves fine anatomical details.

光学相干层析成像（OCT）是诊断视网膜疾病并进行监测的重要成像方式。然而，OCT图像受到斑点噪声的固有干扰，这掩盖了细节并阻碍了准确解释。虽然存在许多去噪方法，但许多方法在平衡噪声降低与关键解剖结构的保留方面存在困难。本文介绍了基于伽马解剖结构恢复和去噪（GARD）技术，这是一种利用扩散概率模型的深度学习方法进行OCT图像去斑处理。与传统的假设高斯噪声的扩散模型不同，GARD采用去噪扩散伽马模型，更准确反映斑点噪声的统计特性。此外，我们引入了降噪保真度术语，利用预处理后的噪声较少的图像来引导去噪过程。这一关键补充避免了高频噪声的再次引入。我们通过将去噪扩散隐模型框架适应到基于伽马的模型来加速推理过程。在具有配对噪声和非噪声OCT B扫描数据集上的实验表明，与传统的去噪方法和最先进的深度学习模型相比，GARD在峰值信噪比（PSNR）、结构相似性度量（SSIM）和均方误差（MSE）方面表现出显著优势。定性结果证实，GARD可以产生更清晰的边缘并更好地保留细微解剖结构细节。

论文及项目相关链接

PDF

Summary

本文介绍了一种基于深度学习的光学相干层析成像（OCT）图像去噪方法——GARD（基于伽马的解剖学恢复和去噪）。该方法利用扩散概率模型的优势，通过采用伽马模型更准确反映散斑的统计特性，并引入去噪保真度项来指导去噪过程，防止高频噪声的重新引入。实验证明，GARD在PSNR、SSIM和MSE指标上显著优于传统去噪方法和最新深度学习模型，能够产生更清晰边缘并更好保留精细解剖学细节。

Key Takeaways

GARD是一种针对OCT图像去噪的深度学习新方法，基于扩散概率模型。
GARD采用伽马模型更准确反映散斑噪声的统计特性。
引入去噪保真度项，防止高频噪声的重新引入，指导去噪过程。
GARD显著提高了去噪性能，在PSNR、SSIM和MSE指标上优于传统方法和最新深度学习模型。
GARD能够产生更清晰边缘，更好地保留OCT图像中的精细解剖学细节。
GARD方法具有加速推断过程的能力，通过适应隐式扩散模型框架实现。

Cool Papers

点此查看论文截图

Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

Authors:Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, Linfeng Zhang

Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.

扩散变压器近年来因其生成高质量图像和视频的能力而备受关注，但由于其迭代去噪过程，仍然面临着巨大的计算成本。最近，特征缓存的引入通过缓存先前的时间步长的特征计算并在后续的时间步长中重复使用它来加速扩散变压器，这利用了扩散模型的时间相似性，同时忽略了空间维度的相似性。本文介绍了一种正交且互补于以前特征缓存的 Cluster-Driven Feature Caching (ClusCa)。具体来说，ClusCa 对每个时间步长的令牌进行空间聚类，每个集群只计算一个令牌，并将其信息传播到所有其他令牌，这能够将令牌数量减少超过 90%。在 DiT、FLUX 和 HunyuanVideo 上的大量实验证明了其在文本到图像和文本到视频生成中的有效性。此外，它可以无需训练直接应用于任何扩散变压器。例如，ClusCa 在 FLUX 上实现了 4.96 倍的加速，ImageReward 为 99.49%，超出原始模型 0.51%。代码可在 https://github.com/Shenyi-Z/Cache4Diffusion 找到。

论文及项目相关链接

PDF 11 pages, 11 figures; Accepted by ACM MM2025; Mainly focus on feature caching for diffusion transformers acceleration

摘要
扩散模型生成的图像和视频质量高，但由于迭代去噪过程，计算成本仍然很高。最近引入了特征缓存来加速扩散模型，通过在后续时间步重用之前时间步的特征计算，并利用扩散模型的时序相似性来忽略空间维度的相似性。本文介绍了集群驱动特征缓存（ClusCa）作为一种正交且补充的视角。具体来说，ClusCa在每个时间步对令牌进行空间聚类，只计算每个集群中的一个令牌，并将其信息传播到所有其他令牌，能够减少超过90%的令牌数量。在DiT、FLUX和HunyuanVideo上的大量实验证明其在文本到图像和文本到视频生成中的有效性。此外，它可以无需训练直接应用于任何扩散模型。例如，ClusCa在FLUX上实现了4.96倍的加速，ImageReward为99.49%，超越了原始模型0.51%。代码可通过链接获取：https://github.com/Shenyi-Z/Cache4Diffusion。

关键见解

扩散模型生成高质量图像和视频但计算成本高。
特征缓存旨在加速扩散模型，通过重用之前时间步的特征计算。
Cluster-Driven Feature Caching（ClusCa）通过空间聚类减少计算负担。
ClusCa能够在保持高质量输出的同时大幅度减少计算令牌数量。
ClusCa方法在各种实验设置中都表现出了有效性。
该方法可以很容易地应用于任何扩散模型，无需额外的训练步骤。

Cool Papers

点此查看论文截图

Scalable Training for Vector-Quantized Networks with 100% Codebook Utilization

Authors:Yifan Chang, Jie Qin, Limeng Qiao, Xiaofeng Wang, Zheng Zhu, Lin Ma, Xingang Wang

Vector quantization (VQ) is a key component in discrete tokenizers for image generation, but its training is often unstable due to straight-through estimation bias, one-step-behind updates, and sparse codebook gradients, which lead to suboptimal reconstruction performance and low codebook usage. In this work, we analyze these fundamental challenges and provide a simple yet effective solution. To maintain high codebook usage in VQ networks (VQN) during learning annealing and codebook size expansion, we propose VQBridge, a robust, scalable, and efficient projector based on the map function method. VQBridge optimizes code vectors through a compress-process-recover pipeline, enabling stable and effective codebook training. By combining VQBridge with learning annealing, our VQN achieves full (100%) codebook usage across diverse codebook configurations, which we refer to as FVQ (FullVQ). Through extensive experiments, we demonstrate that FVQ is effective, scalable, and generalizable: it attains 100% codebook usage even with a 262k-codebook, achieves state-of-the-art reconstruction performance, consistently improves with larger codebooks, higher vector channels, or longer training, and remains effective across different VQ variants. Moreover, when integrated with LlamaGen, FVQ significantly enhances image generation performance, surpassing visual autoregressive models (VAR) by 0.5 and diffusion models (DiT) by 0.2 rFID, highlighting the importance of high-quality tokenizers for strong autoregressive image generation.

向量量化（VQ）是图像生成离散分词器中的关键组成部分，但其训练通常不稳定，原因在于直通估计偏差、一步滞后更新和稀疏代码本梯度，导致重建性能不佳和代码本使用率低。在这项工作中，我们分析了这些基本挑战，并提供了一种简单有效的解决方案。为了在学习退火和代码本大小扩展过程中保持VQ网络（VQN）的高代码本使用率，我们提出了VQBridge，这是一种基于映射函数方法的稳健、可扩展和高效的投影仪。VQBridge通过压缩-处理-恢复管道优化代码向量，能够实现稳定和有效的代码本训练。通过将VQBridge与学习退火相结合，我们的VQN在不同代码本配置下实现了100%的代码本使用率，我们称之为FVQ（FullVQ）。通过大量实验，我们证明了FVQ的有效性、可扩展性和通用性：它即使在262k代码本下也能实现100%的代码本使用率，达到了最先进的重建性能，随着更大的代码本、更高的向量通道或更长的训练时间而不断改进，并且在不同的VQ变体中都有效。此外，当与LlamaGen集成时，FVQ显着提高了图像生成性能，以0.5的rFID超过了视觉自回归模型（VAR），以0.2的rFID超过了扩散模型（DiT），突显了高质量分词器对于强大的自回归图像生成的重要性。

论文及项目相关链接

PDF

Summary

本文介绍了向量量化（VQ）在图像生成离散分词器中的关键作用和存在的训练不稳定性问题。为解决这些问题，提出了一种简单有效的解决方案VQBridge。它通过压缩-处理-恢复管道优化代码向量，实现稳定的代码本训练。结合VQBridge和学习退火技术，实现了全代码本使用率的FVQ（FullVQ）。实验表明，FVQ效果显著，可扩展到大规模代码本，并与其他技术相结合时性能更佳。

Key Takeaways

向量量化（VQ）是图像生成离散分词器中的重要组成部分，但其训练存在不稳定问题。
训练不稳定的主要原因包括直通估计偏差、一步滞后更新和稀疏代码本梯度。
VQBridge是一个基于映射函数方法的稳健、可扩展和高效的投影仪，可优化代码向量。
VQBridge通过压缩-处理-恢复管道实现稳定的代码本训练，并结合学习退火技术实现全代码本使用率（FVQ）。
FVQ具有显著效果，可扩展到大规模代码本，并与其他技术结合时表现更佳。
FVQ在多种VQ配置中实现了100%的代码本使用率，并在与LlamaGen集成时显著增强了图像生成性能。

Cool Papers

点此查看论文截图

Realism Control One-step Diffusion for Real-World Image Super-Resolution

Authors:Zongliang Wu, Siming Zheng, Peng-Tao Jiang, Xin Yuan

Pre-trained diffusion models have shown great potential in real-world image super-resolution (Real-ISR) tasks by enabling high-resolution reconstructions. While one-step diffusion (OSD) methods significantly improve efficiency compared to traditional multi-step approaches, they still have limitations in balancing fidelity and realism across diverse scenarios. Since the OSDs for SR are usually trained or distilled by a single timestep, they lack flexible control mechanisms to adaptively prioritize these competing objectives, which are inherently manageable in multi-step methods through adjusting sampling steps. To address this challenge, we propose a Realism Controlled One-step Diffusion (RCOD) framework for Real-ISR. RCOD provides a latent domain grouping strategy that enables explicit control over fidelity-realism trade-offs during the noise prediction phase with minimal training paradigm modifications and original training data. A degradation-aware sampling strategy is also introduced to align distillation regularization with the grouping strategy and enhance the controlling of trade-offs. Moreover, a visual prompt injection module is used to replace conventional text prompts with degradation-aware visual tokens, enhancing both restoration accuracy and semantic consistency. Our method achieves superior fidelity and perceptual quality while maintaining computational efficiency. Extensive experiments demonstrate that RCOD outperforms state-of-the-art OSD methods in both quantitative metrics and visual qualities, with flexible realism control capabilities in the inference stage. The code will be released.

预训练的扩散模型在真实世界图像超分辨率（Real-ISR）任务中显示出巨大潜力，能够实现高分辨率重建。与传统的多步方法相比，一步扩散（OSD）方法显著提高了效率，但在平衡不同场景下的保真度和真实感方面仍存在局限性。由于OSD超分辨率通常通过单一时间步长进行训练或蒸馏，它们缺乏灵活的控制机制来适应性地优先考虑这些相互竞争的目标，而这些目标在多步方法中可以通过调整采样步骤来内在管理。为解决这一挑战，我们提出了用于Real-ISR的真实感控制一步扩散（RCOD）框架。RCOD提供了一种潜在领域分组策略，能够在噪声预测阶段通过最小的训练范式修改和原始训练数据实现对保真度-真实感权衡的明确控制。还引入了一种退化感知采样策略，以使蒸馏正则化与分组策略相一致，并增强对权衡的控制。此外，使用视觉提示注入模块来替代传统的文本提示，使用退化感知视觉令牌，提高恢复精度和语义一致性。我们的方法在保持计算效率的同时，实现了优越的保真度和感知质量。大量实验表明，RCOD在定量指标和视觉质量方面优于最新的OSD方法，并在推理阶段具有灵活的真实感控制能力。代码将发布。

论文及项目相关链接

PDF

Summary

预训练扩散模型在真实世界图像超分辨率任务中显示出巨大潜力，通过一步扩散方法提高效率。但一步扩散方法面临保真度和现实感平衡的挑战。为此，我们提出一种名为RCOD的新框架，提供潜在域分组策略，在噪声预测阶段明确控制保真度与现实感的权衡，同时引入降解感知采样策略和视觉提示注入模块，提高恢复精度和语义一致性。RCOD方法在计算效率上实现了高保真和感知质量，并在定量指标和视觉质量上优于最先进的一步扩散方法，具有灵活的推理阶段现实感控制功能。

Key Takeaways

预训练扩散模型在真实世界图像超分辨率任务中表现出潜力。
一步扩散方法提高了效率，但在平衡保真度和现实感方面存在挑战。
RCOD框架通过潜在域分组策略明确控制保真度与现实感的权衡。
引入降解感知采样策略和视觉提示注入模块提升恢复精度和语义一致性。
RCOD方法在定量指标和视觉质量上表现优异，优于其他先进的一步扩散方法。
RCOD具有灵活的推理阶段现实感控制功能。

Cool Papers

点此查看论文截图

Chord: Chain of Rendering Decomposition for PBR Material Estimation from Generated Texture Images

Authors:Zhi Ying, Boxiang Rong, Jingyu Wang, Maoyuan Xu

Material creation and reconstruction are crucial for appearance modeling but traditionally require significant time and expertise from artists. While recent methods leverage visual foundation models to synthesize PBR materials from user-provided inputs, they often fall short in quality, flexibility, and user control. We propose a novel two-stage generate-and-estimate framework for PBR material generation. In the generation stage, a fine-tuned diffusion model synthesizes shaded, tileable texture images aligned with user input. In the estimation stage, we introduce a chained decomposition scheme that sequentially predicts SVBRDF channels by passing previously extracted representation as input into a single-step image-conditional diffusion model. Our method is efficient, high quality, and enables flexible user control. We evaluate our approach against existing material generation and estimation methods, demonstrating superior performance. Our material estimation method shows strong robustness on both generated textures and in-the-wild photographs. Furthermore, we highlight the flexibility of our framework across diverse applications, including text-to-material, image-to-material, structure-guided generation, and material editing.

材质创建和重建对于外观建模至关重要，但传统上需要艺术家花费大量时间和专业技能。虽然最近的方法利用视觉基础模型来合成用户提供的输入的物理基础渲染（PBR）材质，但在质量、灵活性和用户控制方面往往存在不足。我们提出了一种用于PBR材质生成的新型两阶段生成和估计框架。在生成阶段，经过微调的分扩散模型合成与用户输入对齐的阴影、可拼接纹理图像。在估计阶段，我们引入了一种链式分解方案，该方案通过将由先前提取的表示作为输入传递给单步图像条件扩散模型来顺序预测SVBRDF通道。我们的方法效率高、质量高，并能实现灵活的用户控制。我们将我们的方法与现有的材质生成和估计方法进行了比较评估，展示了优越的性能。我们的材质估计方法在合成纹理和自然照片上都表现出强大的稳健性。此外，我们强调了我们的框架在各种应用中的灵活性，包括文本到材质、图像到材质、结构引导生成和材质编辑。

论文及项目相关链接

PDF Accepted to SIGGRAPH Asia 2025. Project page: https://ubisoft-laforge.github.io/world/chord

Summary

本文提出了一种基于扩散模型的两阶段PBR材质生成框架，包括生成阶段和估计阶段。生成阶段利用精细调整的扩散模型合成与用户输入对齐的带阴影、可贴图的纹理图像；估计阶段则采用链式分解方案，通过单次图像条件扩散模型按顺序预测SVBRDF通道。该方法高效、高质量，可实现灵活的用户控制，并在多种应用场合中展示出色性能。

Key Takeaways

提出了一种基于扩散模型的两阶段PBR材质生成框架，包括生成和估计两个阶段。
生成阶段利用精细调整的扩散模型合成带阴影、可贴图的纹理图像，与用户输入对齐。
估计阶段采用链式分解方案，通过单次图像条件扩散模型按顺序预测SVBRDF通道。
方法高效、高质量，并实现灵活的用户控制。
与现有材质生成和估计方法相比，该方法表现出卓越性能。
材质估计方法对合成纹理和真实照片均表现出强大稳健性。

Cool Papers

点此查看论文截图

Just Say the Word: Annotation-Free Fine-Grained Object Counting

Authors:Adriano D’Alessandro, Ali Mahdavi-Amiri, Ghassan Hamarneh

Fine-grained object counting remains a major challenge for class-agnostic counting models, which overcount visually similar but incorrect instances (e.g., jalape~no vs. poblano). Addressing this by annotating new data and fully retraining the model is time-consuming and does not guarantee generalization to additional novel categories at test time. Instead, we propose an alternative paradigm: Given a category name, tune a compact concept embedding derived from the prompt using synthetic images and pseudo-labels generated by a text-to-image diffusion model. This embedding conditions a specialization module that refines raw overcounts from any frozen counter into accurate, category-specific estimates\textemdash without requiring real images or human annotations. We validate our approach on \textsc{Lookalikes}, a challenging new benchmark containing 1,037 images across 27 fine-grained subcategories, and show substantial improvements over strong baselines. Code will be released upon acceptance. Dataset - https://dalessandro.dev/datasets/lookalikes/

精细对象计数仍然是面向类别无关的计数模型的主要挑战，这些模型会过度计算视觉上相似但错误的实例（例如，辣椒与波布拉诺辣椒）。通过标注新数据并完全重新训练模型来解决这一问题是非常耗时的，并且在测试时并不能保证能够推广到新的类别。相反，我们提出了一种替代方法：给定一个类别名称，使用文本到图像的扩散模型生成合成图像和伪标签来微调由提示派生的紧凑概念嵌入。这个嵌入条件会作用于一个专业模块，该模块能够将来自任何冷冻计数器的原始过度计数调整为精确、针对特定类别的估计值——这不需要真实图像或人工注释。我们在新的挑战基准测试Lookalikes上验证了我们的方法，该基准测试包含涉及27个精细粒度子类别的共1037张图像，并显示出相对于强大基准线的实质性改进。接受后将发布代码。数据集：https://dalessandro.dev/datasets/lookalikes/。

Summary

本文提出一种基于文本到图像扩散模型的新型计数方法。该方法利用类别名称生成合成图像和伪标签，构建紧凑的概念嵌入，用于调整特定类别的计数模块。该方法可以在不需要真实图像或人工标注的情况下，对任意冻结计数器的粗略计数进行精细化调整，特别适用于精细分类的图像计数挑战。

Key Takeaways

精细分类对象计数是一个主要挑战，现有模型容易对视觉上相似但错误的实例进行过度计数。
提议的方法基于文本到图像扩散模型生成合成图像和伪标签，构建概念嵌入。
使用该嵌入来训练特定类别的计数模块，能够调整原始计数，获得更准确的估计。
该方法不需要真实图像或人工标注，为计数任务提供了一个高效、泛化的解决方案。
引入了一个新数据集\textsc{Lookalikes}，包含涵盖多个精细分类的子类别的图像，对于模型验证是一个重要资源。
与现有的强基线相比，新方法在数据集上显示出显著改进。

Cool Papers

点此查看论文截图

Can Generative Geospatial Diffusion Models Excel as Discriminative Geospatial Foundation Models?

Authors:Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, Andrea Nascetti

Self-supervised learning (SSL) has revolutionized representation learning in Remote Sensing (RS), advancing Geospatial Foundation Models (GFMs) to leverage vast unlabeled satellite imagery for diverse downstream tasks. Currently, GFMs primarily employ objectives like contrastive learning or masked image modeling, owing to their proven success in learning transferable representations. However, generative diffusion models, which demonstrate the potential to capture multi-grained semantics essential for RS tasks during image generation, remain underexplored for discriminative applications. This prompts the question: can generative diffusion models also excel and serve as GFMs with sufficient discriminative power? In this work, we answer this question with SatDiFuser, a framework that transforms a diffusion-based generative geospatial foundation model into a powerful pretraining tool for discriminative RS. By systematically analyzing multi-stage, noise-dependent diffusion features, we develop three fusion strategies to effectively leverage these diverse representations. Extensive experiments on remote sensing benchmarks show that SatDiFuser outperforms state-of-the-art GFMs, achieving gains of up to +5.7% mIoU in semantic segmentation and +7.9% F1-score in classification, demonstrating the capacity of diffusion-based generative foundation models to rival or exceed discriminative GFMs. The source code is available at: https://github.com/yurujaja/SatDiFuser.

自监督学习（SSL）已经彻底改变了遥感（RS）中的表示学习，推动了地理空间基础模型（GFMs）利用大量无标签卫星图像用于各种下游任务。目前，GFMs主要使用对比学习或掩膜图像建模等目标，因为它们在学习可转移表示方面取得了成功的证明。然而，生成扩散模型在图像生成过程中显示出捕获对遥感任务至关重要的多粒度语义的潜力，但对于判别性应用仍然探索不足。这引发了以下问题：生成扩散模型是否也能表现出色，并作为具有足够判别力的GFMs？在这项工作中，我们用SatDiFuser回答这个问题，这是一个将基于扩散的生成地理空间基础模型转化为强大的预训练工具框架，用于判别遥感。通过系统分析多阶段、噪声相关的扩散特征，我们开发了三种融合策略，以有效利用这些不同的表示形式。在遥感基准测试上的广泛实验表明，SatDiFuser优于最新GFMs，在语义分割上实现了高达+5.7%的mIoU提升，在分类上实现了+7.9%的F1分数提升，证明了基于扩散的生成基础模型与判别GFMs相匹敌或超越的能力。源代码可在：https://github.com/yurujaja/SatDiFuser获取。

论文及项目相关链接

PDF ICCV 2025, camera ready

Summary：自监督学习（SSL）在遥感领域引发了表征学习的革命，推动了地理空间基础模型（GFMs）利用大量无标签卫星图像进行各种下游任务。当前GFMs主要使用对比学习或掩膜图像建模等目标，而生成扩散模型在图像生成过程中能够捕捉遥感任务所需的多粒度语义信息，但在判别应用方面仍被低估。本研究通过SatDiFuser框架，将基于扩散的生成型地理空间基础模型转化为强大的预训练工具，用于判别遥感。通过系统分析多阶段噪声相关扩散特征，我们开发了三种融合策略，以有效利用这些不同的表示。在遥感基准测试上的广泛实验表明，SatDiFuser优于最新GFMs，语义分割的mIoU提高了5.7%，分类的F1分数提高了7.9%，证明了扩散基础模型的潜力可与判别GFMs相抗衡或超越。

Key Takeaways：

自监督学习在遥感领域促进了地理空间基础模型的发展，利用大量无标签卫星图像进行多种下游任务。
当前GFMs主要利用对比学习或掩膜图像建模等方法。
生成扩散模型在捕捉遥感任务所需的多粒度语义信息方面具有潜力，但在判别应用方面尚未得到充分探索。
SatDiFuser框架将基于扩散的生成型地理空间基础模型转化为强大的预训练工具，用于判别遥感任务。
通过系统分析多阶段噪声相关扩散特征，开发了三种融合策略。
在遥感基准测试上，SatDiFuser表现出优于现有GFMs的性能，语义分割和分类任务分别实现了mIoU和F1分数的显著提升。

Cool Papers

点此查看论文截图

Your Image is Secretly the Last Frame of a Pseudo Video

Authors:Wenlong Chen, Wenlin Chen, Lapo Rastrelli, Yingzhen Li

Diffusion models, which can be viewed as a special case of hierarchical variational autoencoders (HVAEs), have shown profound success in generating photo-realistic images. In contrast, standard HVAEs often produce images of inferior quality compared to diffusion models. In this paper, we hypothesize that the success of diffusion models can be partly attributed to the additional self-supervision information for their intermediate latent states provided by corrupted images, which along with the original image form a pseudo video. Based on this hypothesis, we explore the possibility of improving other types of generative models with such pseudo videos. Specifically, we first extend a given image generative model to their video generative model counterpart, and then train the video generative model on pseudo videos constructed by applying data augmentation to the original images. Furthermore, we analyze the potential issues of first-order Markov data augmentation methods, which are typically used in diffusion models, and propose to use more expressive data augmentation to construct more useful information in pseudo videos. Our empirical results on the CIFAR10 and CelebA datasets demonstrate that improved image generation quality can be achieved with additional self-supervised information from pseudo videos.

扩散模型可以看作分层变分自动编码器（HVAEs）的一种特殊情况，在生成逼真的图像方面取得了显著的成功。相比之下，标准的HVAEs产生的图像质量往往不如扩散模型。本文中，我们认为扩散模型的成功部分归功于由损坏图像提供的中间潜在状态的额外自我监督信息，这些与原始图像一起形成伪视频。基于这一假设，我们探索了其他类型的生成模型是否可以通过这种伪视频进行改进的可能性。具体来说，我们首先将从给定的图像生成模型扩展到其对应的视频生成模型，然后在通过原始图像的数据增强构建伪视频的基础上对视频生成模型进行训练。此外，我们分析了扩散模型中常用的一阶马尔可夫数据增强方法的潜在问题，并提出使用更具表现力的数据增强方法在伪视频中构建更有用的信息。我们在CIFAR10和CelebA数据集上的实证结果表明，通过来自伪视频的额外自我监督信息可以实现改进的图像生成质量。

论文及项目相关链接

PDF Presented at the ICLR 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy (DeLTa). 1-frame results for CIFAR10 in Table 2 corrected. Code released

Summary

本文探讨了扩散模型（作为分层变分自编码器（HVAEs）的一种特殊情况）在生成逼真图像方面的成功，并提出其部分原因在于通过损坏图像提供的中间潜在状态的额外自我监督信息，这些与原始图像一起形成伪视频。基于此假设，文章探讨了将这种伪视频应用于其他类型的生成模型以提高图像生成质量的可行性。通过扩展现有图像生成模型为视频生成模型对应物，并使用原始图像的数据增强构建伪视频来训练视频生成模型。此外，文章分析了扩散模型中常用的一阶马尔可夫数据增强方法的潜在问题，并提出使用更具表现力的数据增强来构建伪视频中更有用的信息。在CIFAR10和CelebA数据集上的实证结果表明，通过伪视频的额外自我监督信息可以提高图像生成质量。

Key Takeaways

扩散模型能够生成高质量的逼真图像，其成功部分归因于通过损坏图像提供的额外自我监督信息。
这种自我监督信息为扩散模型的中间潜在状态提供了附加信息，与原始图像结合形成伪视频。
文章探讨了将这种伪视频应用于其他生成模型的可行性，以提高图像生成质量。
通过将图像生成模型扩展到视频生成模型并使用数据增强构建伪视频来训练该模型。
文章指出了使用一阶马尔可夫数据增强方法的潜在问题，并提倡使用更具表现力的数据增强方法。
实证研究表明，通过伪视频的自我监督信息能够改进图像生成质量。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-16/Diffusion%20Models/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Diffusion Models

医学图像

医学图像方向最新论文已更新，请持续关注 Update in 2025-09-16 Joint X-ray, kinetic Sunyaev-Zeldovich, and weak lensing measurements toward a consensus picture of efficient gas expulsion from groups and clusters

2025-09-16 医学图像

医学图像

Face Swapping

Face Swapping 方向最新论文已更新，请持续关注 Update in 2025-09-16 Optimizing Inter-chip Coupler Link Placement for Modular and Chiplet Quantum Systems

2025-09-16 Face Swapping

Face Swapping