发布日期: 2025-09-20

更新日期: 2025-11-27

文章字数: 12.4k

阅读时长: 50 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-20 更新

Lightweight and Accurate Multi-View Stereo with Confidence-Aware Diffusion Model

Authors:Fangjinhua Wang, Qingshan Xu, Yew-Soon Ong, Marc Pollefeys

To reconstruct the 3D geometry from calibrated images, learning-based multi-view stereo (MVS) methods typically perform multi-view depth estimation and then fuse depth maps into a mesh or point cloud. To improve the computational efficiency, many methods initialize a coarse depth map and then gradually refine it in higher resolutions. Recently, diffusion models achieve great success in generation tasks. Starting from a random noise, diffusion models gradually recover the sample with an iterative denoising process. In this paper, we propose a novel MVS framework, which introduces diffusion models in MVS. Specifically, we formulate depth refinement as a conditional diffusion process. Considering the discriminative characteristic of depth estimation, we design a condition encoder to guide the diffusion process. To improve efficiency, we propose a novel diffusion network combining lightweight 2D U-Net and convolutional GRU. Moreover, we propose a novel confidence-based sampling strategy to adaptively sample depth hypotheses based on the confidence estimated by diffusion model. Based on our novel MVS framework, we propose two novel MVS methods, DiffMVS and CasDiffMVS. DiffMVS achieves competitive performance with state-of-the-art efficiency in run-time and GPU memory. CasDiffMVS achieves state-of-the-art performance on DTU, Tanks & Temples and ETH3D. Code is available at: https://github.com/cvg/diffmvs.

从校准图像重建3D几何结构时，基于学习的多视图立体（MVS）方法通常执行多视图深度估计，然后将深度图融合为网格或点云。为了提高计算效率，许多方法初始化一个粗略的深度图，然后逐渐在更高分辨率下对其进行优化。最近，扩散模型在生成任务中取得了巨大成功。从随机噪声开始，扩散模型通过迭代去噪过程逐渐恢复样本。本文提出了一种新的MVS框架，该框架引入了MVS中的扩散模型。具体来说，我们将深度优化制定为条件扩散过程。考虑到深度估计的判别特性，我们设计了一个条件编码器来指导扩散过程。为了提高效率，我们提出了一种结合轻量化2D U-Net和卷积GRU的新型扩散网络。此外，我们提出了一种基于信心的采样策略，根据扩散模型估计的信心自适应地采样深度假设。基于我们新的MVS框架，我们提出了两种新的MVS方法，即DiffMVS和CasDiffMVS。DiffMVS在运行时间和GPU内存方面达到了先进的效率；CasDiffMVS在DTU、Tanks & Temples和ETH3D上达到了最新性能。代码可访问于：https://github.com/cvg/diffmvs。

论文及项目相关链接

PDF Accepted to IEEE T-PAMI 2025. Code: https://github.com/cvg/diffmvs

Summary

本文提出将扩散模型引入多视角立体（MVS）框架中，通过条件扩散过程进行深度细化。设计条件编码器以提高深度估计的判别能力，并提出结合轻量化2D U-Net和卷积GRU的新型扩散网络。此外，还提出了一种基于信心的采样策略，根据扩散模型估计的信心自适应采样深度假设。基于该框架，提出了两种新型MVS方法：DiffMVS和CasDiffMVS，前者在运行时和GPU内存使用方面表现出高效竞争力，后者在DTU、Tanks & Temples和ETH3D上达到最新技术水平。

Key Takeaways

扩散模型被引入多视角立体（MVS）框架中，用于提高从校准图像重建3D几何体的计算效率和性能。
深度细化被表述为条件扩散过程，以提高深度估计的准确性。
设计了条件编码器以引导扩散过程，从而提高深度估计的判别能力。
提出了一种结合轻量化2D U-Net和卷积GRU的新型扩散网络结构。
引入基于信心的采样策略，根据扩散模型的信心估计自适应采样深度假设。
基于新框架，提出了两种MVS方法：DiffMVS和CasDiffMVS。

Cool Papers

点此查看论文截图

AutoEdit: Automatic Hyperparameter Tuning for Image Editing

Authors:Chau Pham, Quan Dao, Mahesh Bhosale, Yunjie Tian, Dimitris Metaxas, David Doermann

Recent advances in diffusion models have revolutionized text-guided image editing, yet existing editing methods face critical challenges in hyperparameter identification. To get the reasonable editing performance, these methods often require the user to brute-force tune multiple interdependent hyperparameters, such as inversion timesteps and attention modification, \textit{etc.} This process incurs high computational costs due to the huge hyperparameter search space. We consider searching optimal editing’s hyperparameters as a sequential decision-making task within the diffusion denoising process. Specifically, we propose a reinforcement learning framework, which establishes a Markov Decision Process that dynamically adjusts hyperparameters across denoising steps, integrating editing objectives into a reward function. The method achieves time efficiency through proximal policy optimization while maintaining optimal hyperparameter configurations. Experiments demonstrate significant reduction in search time and computational overhead compared to existing brute-force approaches, advancing the practical deployment of a diffusion-based image editing framework in the real world.

扩散模型的最新进展彻底改变了文本引导的图像编辑，但现有的编辑方法在超参数识别方面面临重大挑战。为了获得合理的编辑性能，这些方法通常要求用户强行调整多个相互依赖的超参数，如反转时间步长和注意力修改等。这一过程由于超参数搜索空间的巨大而产生了高昂的计算成本。我们将寻找最佳编辑超参数视为扩散去噪过程中的一个序贯决策任务。具体来说，我们提出了一种强化学习框架，它建立了一个马尔可夫决策过程，该过程在去噪步骤中动态调整超参数，并将编辑目标整合到奖励函数中。该方法通过近端策略优化实现了时间效率，同时保持了最佳超参数配置。实验表明，与现有的强行方法相比，搜索时间和计算开销大大减少，为基于扩散的图像编辑框架在实际世界中的实际应用提供了进展。

论文及项目相关链接

PDF Accepted to NeurIPS 2025

Summary

本文提出一种基于强化学习的扩散模型优化方法，旨在解决文本引导的图像编辑中面临的超参数识别挑战。该方法将超参数搜索视为扩散去噪过程中的序贯决策任务，并提出一个强化学习框架，通过动态调整去噪步骤中的超参数，将编辑目标融入奖励函数，实现时间效率与最优超参数配置的平衡。实验证明，该方法相较于现有暴力搜索方法，显著减少了搜索时间和计算开销，推动了扩散模型图像编辑框架在实际应用中的部署。

Key Takeaways

扩散模型在文本引导的图像编辑中的最新进展及其面临的挑战。
提出将超参数搜索视为扩散去噪过程中的序贯决策任务。
引入强化学习框架，动态调整去噪步骤中的超参数。
将编辑目标融入奖励函数，实现时间效率与最优超参数配置的平衡。
方法的实验验证，证明其相较于现有方法的优势。
该方法显著减少了搜索时间和计算开销。

Cool Papers

点此查看论文截图

Controllable Localized Face Anonymization Via Diffusion Inpainting

Authors:Ali Salar, Qing Liu, Guoying Zhao

The growing use of portrait images in computer vision highlights the need to protect personal identities. At the same time, anonymized images must remain useful for downstream computer vision tasks. In this work, we propose a unified framework that leverages the inpainting ability of latent diffusion models to generate realistic anonymized images. Unlike prior approaches, we have complete control over the anonymization process by designing an adaptive attribute-guidance module that applies gradient correction during the reverse denoising process, aligning the facial attributes of the generated image with those of the synthesized target image. Our framework also supports localized anonymization, allowing users to specify which facial regions are left unchanged. Extensive experiments conducted on the public CelebA-HQ and FFHQ datasets show that our method outperforms state-of-the-art approaches while requiring no additional model training. The source code is available on our page.

计算机视觉中肖像图像的使用日益增多，这突出了保护个人身份的需要。同时，匿名图像必须仍然对下游计算机视觉任务有用。在这项工作中，我们提出了一个统一的框架，该框架利用潜在扩散模型的补全能力来生成逼真的匿名图像。与以前的方法不同，我们通过设计一个自适应属性引导模块来控制匿名化过程，该模块在反向去噪过程中应用梯度校正，使生成图像的面部属性与合成目标图像的面部属性对齐。我们的框架还支持局部匿名化，允许用户指定哪些面部区域保持不变。在公共CelebA-HQ和FFHQ数据集上进行的广泛实验表明，我们的方法在不需要额外模型训练的情况下优于最先进的方法。源代码可在我们的页面上找到。

论文及项目相关链接

PDF

Summary

本文提出一个利用潜在扩散模型的补全能力生成真实匿名肖像图像的统一框架。该框架通过设计一个自适应属性引导模块，在反向去噪过程中进行梯度校正，使生成图像的面部属性与合成目标图像对齐，从而完全控制匿名化过程。此外，该框架支持局部匿名化，允许用户指定面部区域保持不变。在公共CelebA-HQ和FFHQ数据集上的广泛实验表明，该方法优于现有先进技术，且无需额外模型训练。

Key Takeaways

利用潜在扩散模型的补全能力生成真实匿名肖像图像的统一框架被提出。
通过自适应属性引导模块实现梯度校正，使生成图像的面部属性与目标图像对齐。
支持局部匿名化，允许用户指定面部区域保持不变。
该方法在公共CelebA-HQ和FFHQ数据集上表现优异，优于现有技术。
该方法无需额外模型训练。
源码已公开。

Cool Papers

点此查看论文截图

Radiology Report Conditional 3D CT Generation with Multi Encoder Latent diffusion Model

Authors:Sina Amirrajab, Zohaib Salahuddin, Sheng Kuang, Henry C. Woodruff, Philippe Lambin

Text to image latent diffusion models have recently advanced medical image synthesis, but applications to 3D CT generation remain limited. Existing approaches rely on simplified prompts, neglecting the rich semantic detail in full radiology reports, which reduces text image alignment and clinical fidelity. We propose Report2CT, a radiology report conditional latent diffusion framework for synthesizing 3D chest CT volumes directly from free text radiology reports, incorporating both findings and impression sections using multiple text encoder. Report2CT integrates three pretrained medical text encoders (BiomedVLP CXR BERT, MedEmbed, and ClinicalBERT) to capture nuanced clinical context. Radiology reports and voxel spacing information condition a 3D latent diffusion model trained on 20000 CT volumes from the CT RATE dataset. Model performance was evaluated using Frechet Inception Distance (FID) for real synthetic distributional similarity and CLIP based metrics for semantic alignment, with additional qualitative and quantitative comparisons against GenerateCT model. Report2CT generated anatomically consistent CT volumes with excellent visual quality and text image alignment. Multi encoder conditioning improved CLIP scores, indicating stronger preservation of fine grained clinical details in the free text radiology reports. Classifier free guidance further enhanced alignment with only a minor trade off in FID. We ranked first in the VLM3D Challenge at MICCAI 2025 on Text Conditional CT Generation and achieved state of the art performance across all evaluation metrics. By leveraging complete radiology reports and multi encoder text conditioning, Report2CT advances 3D CT synthesis, producing clinically faithful and high quality synthetic data.

文本到图像潜在扩散模型在医学图像合成方面取得了最新进展，但在3D CT生成方面的应用仍然有限。现有方法依赖于简化的提示，忽略了完整放射学报告中的丰富语义细节，这降低了文本图像对齐和临床保真度。我们提出了Report2CT，这是一种基于放射学报告条件的潜在扩散框架，可以直接从自由文本放射学报告中合成3D胸部CT体积，它结合了检查结果和印象部分，使用多个文本编码器。Report2CT集成了三个预训练的医疗文本编码器（BiomedVLP CXR BERT、MedEmbed和ClinicalBERT）来捕捉微妙的临床背景。放射学报告和体素间距信息对CT RATE数据集上经过训练的2万个CT体积的3D潜在扩散模型进行了条件处理。我们通过Frechet Inception Distance（FID）评估模型性能，用于真实合成分布相似性，并使用CLIP相关指标进行语义对齐，与GenerateCT模型进行了定性和定量比较。Report2CT生成的CT体积解剖结构一致，视觉质量优秀，文本图像对齐。多编码器条件改善了CLIP分数，表明精细粒度的临床细节在自由文本放射学报告中得到了更好的保留。无分类器指导进一步增强了对齐性，同时只牺牲了少量的FID。在MICCAI 2025年的VLM3D挑战中，我们在文本条件CT生成方面排名第一，并在所有评估指标上达到了最新技术水平。通过利用完整的放射学报告和多编码器文本条件处理，Report2CT推动了3D CT合成的发展，生成了临床真实且高质量的综合数据。

论文及项目相关链接

PDF

摘要

报告提出了Report2CT，一个基于放射学报告条件的潜在扩散框架，可从自由文本放射学报告直接合成3D胸部CT体积。该框架结合了多种医学文本编码器，捕捉微妙的临床背景。使用来自CT RATE数据集的2万份CT体积训练了3D潜在扩散模型。模型性能通过Frechet Inception Distance (FID)进行真实与合成分布相似性评估，并通过CLIP指标进行语义对齐评估。Report2CT生成了结构一致的CT体积，具有良好的视觉质量和文本图像对齐性。多编码器条件改善了CLIP评分，表明精细粒度的临床细节在自由文本放射学报告中得到了更好的保留。无分类器指导进一步增强了对齐性，仅略微降低了FID。在MICCAI 2025年的VLM3D挑战中，Report2CT在文本条件CT生成方面排名第一，并在所有评估指标上取得了最新技术性能。通过利用完整的放射学报告和多编码器文本条件，Report2CT推进了3D CT的合成，产生了临床真实且高质量的综合数据。

关键见解

报告提出了一种新的方法Report2CT，可以从自由文本放射学报告直接合成3D胸部CT体积。
Report2CT结合了多种医学文本编码器以捕捉微妙的临床背景信息。
使用来自CT RATE数据集的CT体积训练了3D潜在扩散模型。
Report2CT生成的CT体积具有良好的视觉质量和文本图像对齐性。
多编码器条件改善了CLIP评分，保留了更精细的临床细节。
无分类器指导增强了文本与图像之间的对齐性，且只带来微小的FID损失。

Cool Papers

点此查看论文截图

Dataset Distillation for Super-Resolution without Class Labels and Pre-trained Models

Authors:Sunwoo Cho, Yejin Jung, Nam Ik Cho, Jae Woong Soh

Training deep neural networks has become increasingly demanding, requiring large datasets and significant computational resources, especially as model complexity advances. Data distillation methods, which aim to improve data efficiency, have emerged as promising solutions to this challenge. In the field of single image super-resolution (SISR), the reliance on large training datasets highlights the importance of these techniques. Recently, a generative adversarial network (GAN) inversion-based data distillation framework for SR was proposed, showing potential for better data utilization. However, the current method depends heavily on pre-trained SR networks and class-specific information, limiting its generalizability and applicability. To address these issues, we introduce a new data distillation approach for image SR that does not need class labels or pre-trained SR models. In particular, we first extract high-gradient patches and categorize images based on CLIP features, then fine-tune a diffusion model on the selected patches to learn their distribution and synthesize distilled training images. Experimental results show that our method achieves state-of-the-art performance while using significantly less training data and requiring less computational time. Specifically, when we train a baseline Transformer model for SR with only 0.68% of the original dataset, the performance drop is just 0.3 dB. In this case, diffusion model fine-tuning takes 4 hours, and SR model training completes within 1 hour, much shorter than the 11-hour training time with the full dataset.

训练深度神经网络的要求越来越高，需要大量的数据集和重要的计算资源，尤其是随着模型复杂性的提高。旨在提高数据效率的数据蒸馏方法已成为应对这一挑战的有前途的解决方案。在单图像超分辨率（SISR）领域，对大量训练数据集的依赖突显了这些技术的重要性。最近，提出了一种基于生成对抗网络（GAN）反转的SR数据蒸馏框架，显示出更好的数据利用潜力。然而，当前的方法严重依赖于预训练的SR网络和类别特定信息，这限制了其通用性和适用性。为了解决这些问题，我们介绍了一种新的用于图像SR的数据蒸馏方法，该方法不需要类标签或预训练的SR模型。具体来说，我们首先提取高梯度补丁并根据CLIP特征对图像进行分类，然后对所选补丁微调扩散模型以学习其分布并合成蒸馏的训练图像。实验结果表明，我们的方法在实现最新性能的同时，使用更少的训练数据和更短的计算时间。具体来说，当我们仅使用原始数据集的0.68%来训练SR的基线Transformer模型时，性能下降仅为0.3分贝。在这种情况下，扩散模型的微调需要4小时，SR模型的训练在1小时内完成，远远短于使用完整数据集的11小时训练时间。

论文及项目相关链接

PDF

Summary
深度学习网络训练对大数据集和计算资源的需求日益增加，特别是在模型复杂性提升的背景下。数据蒸馏方法旨在提高数据效率，成为应对这一挑战的有前途的解决方案。在单图像超分辨率（SISR）领域，对大量训练数据集的依赖凸显了这些技术的重要性。最近提出了基于生成对抗网络（GAN）反演的SR数据蒸馏框架，显示出更好的数据利用潜力。然而，当前方法严重依赖于预训练的SR网络和类别特定信息，限制了其通用性和适用性。为解决这些问题，我们提出了一种新的图像SR数据蒸馏方法，无需类别标签或预训练的SR模型。我们首先将高梯度补丁提取并分类图像基于CLIP特征，然后在选定区域微调扩散模型以学习其分布并合成蒸馏的训练图像。实验表明，该方法达到了最新的性能水平，同时使用更少的训练数据和更短的计算时间。具体来说，当我们仅使用原始数据集中的0.68%训练SR基线Transformer模型时，性能下降仅为0.3分贝。在这种情况下，扩散模型的微调需要4小时，而SR模型的训练在1小时内完成，远短于使用整个数据集时的训练时间（需时11小时）。

Key Takeaways

深度神经网络的训练需要大量数据集和计算资源。为了解决这一挑战，数据蒸馏方法应运而生以提高数据效率。
在单图像超分辨率领域，数据蒸馏技术尤为重要。当前方法主要依赖于预训练的SR网络和类别特定信息。
提出了一种新的图像SR数据蒸馏方法，无需类别标签或预训练的SR模型。通过提取高梯度补丁并基于CLIP特征分类图像进行训练。

Cool Papers

点此查看论文截图

DICE: Diffusion Consensus Equilibrium for Sparse-view CT Reconstruction

Authors:Leon Suarez-Rodriguez, Roman Jacome, Romario Gualdron-Hurtado, Ana Mantilla-Dulcey, Henry Arguello

Sparse-view computed tomography (CT) reconstruction is fundamentally challenging due to undersampling, leading to an ill-posed inverse problem. Traditional iterative methods incorporate handcrafted or learned priors to regularize the solution but struggle to capture the complex structures present in medical images. In contrast, diffusion models (DMs) have recently emerged as powerful generative priors that can accurately model complex image distributions. In this work, we introduce Diffusion Consensus Equilibrium (DICE), a framework that integrates a two-agent consensus equilibrium into the sampling process of a DM. DICE alternates between: (i) a data-consistency agent, implemented through a proximal operator enforcing measurement consistency, and (ii) a prior agent, realized by a DM performing a clean image estimation at each sampling step. By balancing these two complementary agents iteratively, DICE effectively combines strong generative prior capabilities with measurement consistency. Experimental results show that DICE significantly outperforms state-of-the-art baselines in reconstructing high-quality CT images under uniform and non-uniform sparse-view settings of 15, 30, and 60 views (out of a total of 180), demonstrating both its effectiveness and robustness.

稀疏视图计算机断层扫描（CT）重建面临根本挑战，因为欠采样导致反问题不适定。传统迭代方法结合手工或学习先验来正则化解决方案，但难以捕捉医疗图像中存在的复杂结构。相比之下，扩散模型（DMs）最近作为强大的生成先验出现，可以准确地模拟复杂的图像分布。在这项工作中，我们引入了扩散共识平衡（DICE），这是一个将两智能体共识平衡集成到扩散模型的采样过程中的框架。DICE在以下两个智能体之间交替进行：（i）数据一致性智能体，通过近端算子强制实施测量一致性来实现；（ii）先验智能体，通过每个采样步骤中执行清洁图像估计的扩散模型实现。通过迭代平衡这两个互补的智能体，DICE有效地结合了强大的生成先验能力和测量一致性。实验结果表明，无论是在统一和非统一的稀疏视图设置下，DICE在重建高质量CT图像方面都显著优于最新基线，即在总共180个视图中仅使用15、30和60个视图，证明了其有效性和稳健性。

论文及项目相关链接

PDF 8 pages, 4 figures, confenrence

Summary

本文介绍了一种名为Diffusion Consensus Equilibrium（DICE）的框架，该框架结合了扩散模型（DMs）的强大生成先验和测量一致性。DICE通过在采样过程中采用双智能体共识均衡机制，实现了数据一致性智能体和先验智能体的交替作用。这种方法在稀疏视图计算机断层扫描（CT）重建中表现优异，显著优于现有基线，能够在均匀和非均匀稀疏视图设置下重建高质量CT图像。

Key Takeaways

扩散模型（DMs）作为强大的生成先验，能够准确地模拟复杂的图像分布。
DICE框架结合了数据一致性智能体和先验智能体，通过迭代平衡两者来实现高效的CT图像重建。
DICE框架通过采用双智能体共识均衡机制，在采样过程中实现了强大的生成先验与测量一致性结合。
DICE在稀疏视图CT重建中表现出显著优势，能够在不同视图设置下重建高质量图像。
DICE框架在均匀和非均匀稀疏视图设置下均表现出有效性和稳健性。
传统迭代方法在手工艺或学习先验方面存在挑战，无法完全捕捉医疗图像的复杂结构，而DICE框架能够克服这些挑战。

Cool Papers

点此查看论文截图

DiffVL: Diffusion-Based Visual Localization on 2D Maps via BEV-Conditioned GPS Denoising

Authors:Li Gao, Hongyang Sun, Liu Liu, Yunhao Li, Yang Cai

Accurate visual localization is crucial for autonomous driving, yet existing methods face a fundamental dilemma: While high-definition (HD) maps provide high-precision localization references, their costly construction and maintenance hinder scalability, which drives research toward standard-definition (SD) maps like OpenStreetMap. Current SD-map-based approaches primarily focus on Bird’s-Eye View (BEV) matching between images and maps, overlooking a ubiquitous signal-noisy GPS. Although GPS is readily available, it suffers from multipath errors in urban environments. We propose DiffVL, the first framework to reformulate visual localization as a GPS denoising task using diffusion models. Our key insight is that noisy GPS trajectory, when conditioned on visual BEV features and SD maps, implicitly encode the true pose distribution, which can be recovered through iterative diffusion refinement. DiffVL, unlike prior BEV-matching methods (e.g., OrienterNet) or transformer-based registration approaches, learns to reverse GPS noise perturbations by jointly modeling GPS, SD map, and visual signals, achieving sub-meter accuracy without relying on HD maps. Experiments on multiple datasets demonstrate that our method achieves state-of-the-art accuracy compared to BEV-matching baselines. Crucially, our work proves that diffusion models can enable scalable localization by treating noisy GPS as a generative prior-making a paradigm shift from traditional matching-based methods.

精确视觉定位对于自动驾驶至关重要。然而，现有方法面临一个基本困境：高清地图虽然提供了高精度的定位参考，但其高昂的构建和维护成本阻碍了可扩展性，这促使研究转向标准定义地图（如OpenStreetMap）。当前基于SD地图的方法主要关注图像和地图之间的鸟瞰图（BEV）匹配，忽视了普遍存在的信号噪声GPS。尽管GPS易于获取，但在城市环境中它受到多径误差的影响。我们提出了DiffVL框架，它是首个将视觉定位重新表述为使用扩散模型的GPS去噪任务的框架。我们的关键见解是，当在有视觉BEV特征和SD地图的条件下，噪声GPS轨迹隐含地编码了真实的姿态分布，这可以通过迭代扩散细化来恢复。DiffVL不同于先前的BEV匹配方法（例如OrienterNet）或基于变压器的注册方法，它通过联合建模GPS、SD地图和视觉信号，学习反向GPS噪声扰动，在不依赖高清地图的情况下实现亚米级精度。在多个数据集上的实验表明，我们的方法达到了与BEV匹配基线相比的领先水平。最重要的是，我们的工作证明了扩散模型可以通过将噪声GPS视为生成先验，实现与传统的基于匹配的方法的模式转变，从而实现可扩展的定位。

论文及项目相关链接

PDF

Summary

基于扩散模型的视觉定位新方法，采用GPS去噪方式实现准确视觉定位。该方法利用标准地图和鸟瞰视图特征，通过迭代扩散优化，从噪声GPS轨迹中恢复真实姿态分布，实现亚米级定位精度，无需依赖高精度地图。

Key Takeaways

现有自主驾驶视觉定位方法主要依赖高精度地图，但其高昂的构建和维护成本限制了可扩展性。
提出的DiffVL框架将视觉定位重新定义为GPS去噪任务，利用扩散模型进行处理。
DiffVL利用噪声GPS轨迹、鸟瞰视图特征和标准地图，通过迭代扩散优化，隐式编码真实姿态分布。
DiffVL不同于以往的鸟瞰视图匹配方法或基于变压器的注册方法。
DiffVL学会反向恢复GPS噪声扰动，通过联合建模GPS、标准地图和视觉信号，实现亚米级定位精度。
实验证明，DiffVL在多个数据集上实现了最先进的定位精度。

Cool Papers

点此查看论文截图

TIDE: Achieving Balanced Subject-Driven Image Generation via Target-Instructed Diffusion Enhancement

Authors:Jibai Lin, Bo Ma, Yating Yang, Xi Zhou, Rong Ma, Turghun Osman, Ahtamjan Ahmat, Rui Dong, Lei Wang

Subject-driven image generation (SDIG) aims to manipulate specific subjects within images while adhering to textual instructions, a task crucial for advancing text-to-image diffusion models. SDIG requires reconciling the tension between maintaining subject identity and complying with dynamic edit instructions, a challenge inadequately addressed by existing methods. In this paper, we introduce the Target-Instructed Diffusion Enhancing (TIDE) framework, which resolves this tension through target supervision and preference learning without test-time fine-tuning. TIDE pioneers target-supervised triplet alignment, modelling subject adaptation dynamics using a (reference image, instruction, target images) triplet. This approach leverages the Direct Subject Diffusion (DSD) objective, training the model with paired “winning” (balanced preservation-compliance) and “losing” (distorted) targets, systematically generated and evaluated via quantitative metrics. This enables implicit reward modelling for optimal preservation-compliance balance. Experimental results on standard benchmarks demonstrate TIDE’s superior performance in generating subject-faithful outputs while maintaining instruction compliance, outperforming baseline methods across multiple quantitative metrics. TIDE’s versatility is further evidenced by its successful application to diverse tasks, including structural-conditioned generation, image-to-image generation, and text-image interpolation. Our code is available at https://github.com/KomJay520/TIDE.

主题驱动图像生成（SDIG）旨在根据文本指令操作图像中的特定主题，这对于推进文本到图像的扩散模型至关重要。SDIG需要在保持主题身份和遵守动态编辑指令之间找到平衡，现有方法对此挑战解决不足。在本文中，我们引入了目标指导扩散增强（TIDE）框架，通过目标监督和偏好学习解决这一平衡问题，而无需进行测试时微调。TIDE首创目标监督三元组对齐，使用（参考图像、指令、目标图像）三元组对主题适应动态进行建模。该方法利用直接主题扩散（DSD）目标，用配对的“获胜”（平衡保留和合规性）和“失败”（失真）目标训练模型，这些目标通过定量指标进行系统化生成和评估。这为实现最优保留合规性平衡隐式奖励建模提供了可能。在标准基准测试上的实验结果表明，TIDE在生成忠实于主题的输出时表现出卓越性能，同时保持指令合规性，在多个定量指标上优于基线方法。TIDE的通用性进一步体现在其在结构条件生成、图像到图像生成和文本图像插值等多样化任务中的成功应用。我们的代码可在https://github.com/KomJay520/TIDE找到。

论文及项目相关链接

PDF

Summary

本文介绍了针对文本指令驱动图像生成的任务，提出了一种名为TIDE（目标指导扩散增强）的框架。该框架解决了在维持主题身份与遵循动态编辑指令之间的平衡问题，通过目标监督与偏好学习，无需测试时微调即可实现。TIDE采用目标监督的三重对准技术，利用（参考图像、指令、目标图像）三重结构建模主题适应动态。通过Direct Subject Diffusion（DSD）目标进行训练，使用系统生成的配对的“胜出”（平衡保留合规性）和“失败”（失真）目标，并通过定量指标进行评估。实验结果证明，TIDE在生成主题忠实输出同时维持指令合规性的任务上表现卓越，优于基准方法。TIDE的通用性还体现在其在结构条件生成、图像到图像生成和文本图像插值等多样化任务的成功应用。

Key Takeaways

TIDE框架解决了文本指令驱动图像生成中维持主题身份与遵循编辑指令之间的平衡问题。
TIDE通过目标监督与偏好学习实现这一目标，无需测试时微调。
TIDE采用目标监督的三重对准技术，利用（参考图像、指令、目标图像）三重结构建模主题适应动态。
通过Direct Subject Diffusion（DSD）目标进行训练，使用配对的“胜出”和“失败”目标进行系统评估。
TIDE在多种定量指标上表现优越，能生成主题忠实的输出并维持指令合规性。
TIDE框架具有广泛的应用性，成功应用于结构条件生成、图像到图像生成和文本图像插值等多样化任务。

Cool Papers

点此查看论文截图

Probing the Representational Power of Sparse Autoencoders in Vision Models

Authors:Matthew Lyle Olson, Musashi Hinck, Neale Ratzlaff, Changbai Li, Phillip Howard, Vasudev Lal, Shao-Yen Tseng

Sparse Autoencoders (SAEs) have emerged as a popular tool for interpreting the hidden states of large language models (LLMs). By learning to reconstruct activations from a sparse bottleneck layer, SAEs discover interpretable features from the high-dimensional internal representations of LLMs. Despite their popularity with language models, SAEs remain understudied in the visual domain. In this work, we provide an extensive evaluation the representational power of SAEs for vision models using a broad range of image-based tasks. Our experimental results demonstrate that SAE features are semantically meaningful, improve out-of-distribution generalization, and enable controllable generation across three vision model architectures: vision embedding models, multi-modal LMMs and diffusion models. In vision embedding models, we find that learned SAE features can be used for OOD detection and provide evidence that they recover the ontological structure of the underlying model. For diffusion models, we demonstrate that SAEs enable semantic steering through text encoder manipulation and develop an automated pipeline for discovering human-interpretable attributes. Finally, we conduct exploratory experiments on multi-modal LLMs, finding evidence that SAE features reveal shared representations across vision and language modalities. Our study provides a foundation for SAE evaluation in vision models, highlighting their strong potential improving interpretability, generalization, and steerability in the visual domain.

稀疏自编码器（Sparse Autoencoders，简称SAE）已经成为解读大型语言模型（LLM）隐藏状态的一种流行工具。通过学会从稀疏瓶颈层重建激活，SAE从LLM的高维内部表示中发现可解释的特征。尽管SAE在语言模型中很受欢迎，但在视觉领域，对它们的研究仍然不足。在这项工作中，我们通过对一系列图像任务的大量评估，全面评估了SAE在视觉模型中的表示能力。实验结果表明，SAE特征在语义上是明确的，能够提升模型对分布外的泛化能力，并能实现在三种视觉模型架构下的可控生成：视觉嵌入模型、多模态LLM和扩散模型。在视觉嵌入模型中，我们发现学习到的SAE特征可用于OOD检测，并提供证据表明它们恢复了底层模型的本体结构。对于扩散模型，我们展示了SAE通过文本编码器操控来实现语义引导的能力，并开发了一个自动化流程来发现人类可解释的属性。最后，我们对多模态LLM进行了探索性实验，发现证据表明SAE特征揭示了跨视觉和语言模态的共享表示。我们的研究为SAE在视觉模型中的评估奠定了基础，突显了它们在提高视觉领域的可解释性、泛化能力和操控能力方面的强大潜力。

论文及项目相关链接

PDF ICCV 2025 Findings

Summary

稀疏自编码器（SAE）在解读大型语言模型（LLM）的隐藏状态方面展现出巨大潜力。本研究通过对视觉模型进行广泛评估，证明了SAE在视觉领域的表现能力。实验结果显示，SAE特征语义丰富，能提高模型在分布外的泛化能力，并在三种视觉模型架构中实现可控生成。此外，SAE特征还可用于图像嵌入模型的OOD检测，并在扩散模型中实现语义导向。本研究为SAE在视觉模型中的应用提供了基础，展现了其在提高解释性、泛化能力和可控性方面的强大潜力。

Key Takeaways

SAEs用于解读大型语言模型的隐藏状态。
SAE在视觉领域表现能力得到广泛评估。
SAE特征语义丰富，能提高模型在分布外的泛化能力。
SAE在三种视觉模型架构中实现可控生成。
SAE特征可用于图像嵌入模型的OOD检测。
在扩散模型中，SAE实现语义导向。

Cool Papers

点此查看论文截图

Style Transfer with Diffusion Models for Synthetic-to-Real Domain Adaptation

Authors:Estelle Chigot, Dennis G. Wilson, Meriem Ghrib, Thomas Oberlin

Semantic segmentation models trained on synthetic data often perform poorly on real-world images due to domain gaps, particularly in adverse conditions where labeled data is scarce. Yet, recent foundation models enable to generate realistic images without any training. This paper proposes to leverage such diffusion models to improve the performance of vision models when learned on synthetic data. We introduce two novel techniques for semantically consistent style transfer using diffusion models: Class-wise Adaptive Instance Normalization and Cross-Attention (CACTI) and its extension with selective attention Filtering (CACTIF). CACTI applies statistical normalization selectively based on semantic classes, while CACTIF further filters cross-attention maps based on feature similarity, preventing artifacts in regions with weak cross-attention correspondences. Our methods transfer style characteristics while preserving semantic boundaries and structural coherence, unlike approaches that apply global transformations or generate content without constraints. Experiments using GTA5 as source and Cityscapes/ACDC as target domains show that our approach produces higher quality images with lower FID scores and better content preservation. Our work demonstrates that class-aware diffusion-based style transfer effectively bridges the synthetic-to-real domain gap even with minimal target domain data, advancing robust perception systems for challenging real-world applications. The source code is available at: https://github.com/echigot/cactif.

基于合成数据训练的语义分割模型在真实世界图像上的表现往往较差，这主要是由于领域差距造成的，特别是在标记数据稀缺的恶劣条件下。然而，最近的基础模型能够在无需任何训练的情况下生成逼真的图像。本文提出利用此类扩散模型来提高在合成数据上学习的视觉模型的性能。我们引入两种利用扩散模型进行语义一致风格转移的新技术：基于类别的自适应实例归一化和交叉注意力（CACTI），以及其带有选择性注意力过滤的扩展（CACTIF）。CACTI根据语义类别有选择地应用统计归一化，而CACTIF进一步根据特征相似性过滤交叉注意力图，防止在交叉注意力对应关系较弱的区域出现伪影。我们的方法在转移风格特征的同时，保留了语义边界和结构连贯性，不同于应用全局变换或无条件生成内容的方法。使用GTA5作为源域，Cityscapes/ACDC作为目标域的实验表明，我们的方法生成了质量更高、FID得分更低、内容保存更好的图像。我们的工作证明，类感知扩散风格转移有效地缩小了合成到真实的领域差距，即使目标领域的数据最少，也为具有挑战性的真实世界应用提供了稳健的感知系统。源代码可在以下网址找到：https://github.com/echigot/cactif。

论文及项目相关链接

PDF Published in Computer Vision and Image Understanding, September 2025 (CVIU 2025)

Summary

本文提出利用扩散模型改善在合成数据上训练的语义分割模型在真实世界图像中的表现。针对合成数据与实际场景之间的域差异问题，文章引入两种新的基于扩散模型的技术：Class-wise Adaptive Instance Normalization与Cross-Attention结合的方法（CACTI）及其通过选择性注意力过滤的扩展版本（CACTIF）。CACTI根据语义类别选择性应用统计归一化，而CACTIF则进一步根据特征相似性过滤交叉注意力图，避免了弱交叉注意力对应区域的伪影。实验表明，该方法在风格转移时能够保留语义边界和结构连贯性，使用GTA5作为源域、Cityscapes/ACDC作为目标域的实验结果证明了其生成图像质量更高、FID分数更低、内容保留更好。该研究展示了基于类别的扩散模型在桥梁合成到真实域差距方面的有效性，即使在目标域数据极少的情况下也能取得良好效果。

Key Takeaways

语义分割模型在真实世界图像中的表现因合成数据和真实数据之间的域差异而受限。
扩散模型被用来改善在合成数据上训练的视觉模型性能。
引入两种新技术：Class-wise Adaptive Instance Normalization与Cross-Attention结合的方法（CACTI）和它的选择性注意力过滤扩展版本（CACTIF）。
CACTI通过选择性应用统计归一化基于语义类别进行风格转移。
CACTIF利用特征相似性过滤交叉注意力图，减少伪影。
实验证明，该方法生成图像质量高，FID分数低，内容保留好。

Cool Papers

点此查看论文截图

PVLM: Parsing-Aware Vision Language Model with Dynamic Contrastive Learning for Zero-Shot Deepfake Attribution

Authors:Yaning Zhang, Jiahe Zhang, Chunjie Ma, Weili Guan, Tian Gan, Zan Gao

The challenge of tracing the source attribution of forged faces has gained significant attention due to the rapid advancement of generative models. However, existing deepfake attribution (DFA) works primarily focus on the interaction among various domains in vision modality, and other modalities such as texts and face parsing are not fully explored. Besides, they tend to fail to assess the generalization performance of deepfake attributors to unseen advanced generators like diffusion in a fine-grained manner. In this paper, we propose a novel parsing-aware vision language model with dynamic contrastive learning(PVLM) method for zero-shot deepfake attribution (ZS-DFA),which facilitates effective and fine-grained traceability to unseen advanced generators. Specifically, we conduct a novel and fine-grained ZS-DFA benchmark to evaluate the attribution performance of deepfake attributors to unseen advanced generators like diffusion. Besides, we propose an innovative parsing-guided vision language model with dynamic contrastive learning (PVLM) method to capture general and diverse attribution features. We are motivated by the observation that the preservation of source face attributes in facial images generated by GAN and diffusion models varies significantly. We employ the inherent face attributes preservation differences to capture face parsing-aware forgery representations. Therefore, we devise a novel parsing encoder to focus on global face attribute embeddings, enabling parsing-guided DFA representation learning via dynamic vision-parsing matching. Additionally, we present a novel deepfake attribution contrastive center loss to pull relevant generators closer and push irrelevant ones away, which can be introduced into DFA models to enhance traceability. Experimental results show that our model exceeds the state-of-the-art on the ZS-DFA benchmark via various protocol evaluations.

随着生成模型的快速发展，追踪伪造面孔的源头归属问题获得了广泛关注。然而，现有的深度伪造归属（DFA）主要关注视觉模态中不同域之间的交互，而文本和面部解析等其他模态并未得到充分探索。此外，它们往往无法以精细的方式评估深度伪造归属者对未见的高级生成器（如扩散）的泛化性能。在本文中，我们提出了一种新颖的解析感知视觉语言模型，结合动态对比学习（PVLM）方法进行零样本深度伪造归属（ZS-DFA），有助于对未见的高级生成器进行有效且精细的追踪。具体来说，我们构建了一个新颖且精细的ZS-DFA基准测试，以评估深度伪造归属者对未见的高级生成器的归属性能，如扩散模型。除此之外，我们提出了创新的解析引导视觉语言模型，结合动态对比学习方法来捕捉通用和多样化的归属特征。我们的动机是观察到由GAN和扩散模型生成的面部图像中保留的源脸特征差异很大。我们利用固有的面部特征保留差异来捕捉面部解析感知的伪造表示。因此，我们设计了一种新的解析编码器，专注于全局面部特征嵌入，通过动态视觉解析匹配实现解析引导DFA表示学习。此外，我们提出了一种新的深度伪造归属对比中心损失，将相关生成器拉近并推远不相关的生成器，可以引入到DFA模型中以提高追踪性能。实验结果表明，我们的模型在各种协议评估上的ZS-DFA基准测试上超过了现有技术。

论文及项目相关链接

PDF

摘要

本文关注于面部伪造溯源的问题，针对现有深度伪造归因（DFA）方法主要关注视觉模态的不足，提出一种结合解析感知与视觉语言模型的动态对比学习（PVLM）方法，用于零样本深度伪造归因（ZS-DFA）。该方法能有效追踪未见的高级生成器如扩散模型。文章建立了一个新颖的细粒度ZS-DFA基准测试平台，评估归因性能。同时，提出解析引导的动态对比学习机制，捕捉全面且多样的归因特征。利用GAN和扩散模型生成面部图像时源脸属性保留程度的差异，设计了一种新颖的解析编码器，专注于全局面部属性嵌入，通过动态视觉解析匹配实现解析引导DFA表示学习。此外，引入深度伪造对比中心损失，拉近相关生成器并远离不相关生成器，以增强追踪能力。实验结果显示，该模型在ZS-DFA基准测试中超过了现有最佳方法。

关键见解

现有深度伪造归因方法主要关注视觉模态的交互，忽视了文本和面部解析等其他模态的作用。
提出了一种新颖的解析感知与视觉语言模型结合的动态对比学习方法（PVLM），用于零样本深度伪造归因（ZS-DFA）。
建立了一个细粒度的ZS-DFA基准测试平台，以评估对未见的高级生成器的归因性能。
利用面部图像生成时源脸属性保留程度的差异，设计了一种新颖的解析编码器。
引入了深度伪造对比中心损失，以增强追踪能力。
实验结果显示，该模型在多种协议评估中的ZS-DFA基准测试上表现优异。
该方法为实现更有效的面部伪造溯源提供了新的思路和工具。

Cool Papers

点此查看论文截图

Boost 3D Reconstruction using Diffusion-based Monocular Camera Calibration

Authors:Junyuan Deng, Wei Yin, Xiaoyang Guo, Qian Zhang, Xiaotao Hu, Weiqiang Ren, Xiao-Xiao Long, Ping Tan

In this paper, we present DM-Calib, a diffusion-based approach for estimating pinhole camera intrinsic parameters from a single input image. Monocular camera calibration is essential for many 3D vision tasks. However, most existing methods depend on handcrafted assumptions or are constrained by limited training data, resulting in poor generalization across diverse real-world images. Recent advancements in stable diffusion models, trained on massive data, have shown the ability to generate high-quality images with varied characteristics. Emerging evidence indicates that these models implicitly capture the relationship between camera focal length and image content. Building on this insight, we explore how to leverage the powerful priors of diffusion models for monocular pinhole camera calibration. Specifically, we introduce a new image-based representation, termed Camera Image, which losslessly encodes the numerical camera intrinsics and integrates seamlessly with the diffusion framework. Using this representation, we reformulate the problem of estimating camera intrinsics as the generation of a dense Camera Image conditioned on an input image. By fine-tuning a stable diffusion model to generate a Camera Image from a single RGB input, we can extract camera intrinsics via a RANSAC operation. We further demonstrate that our monocular calibration method enhances performance across various 3D tasks, including zero-shot metric depth estimation, 3D metrology, pose estimation and sparse-view reconstruction. Extensive experiments on multiple public datasets show that our approach significantly outperforms baselines and provides broad benefits to 3D vision tasks.

本文介绍了DM-Calib，这是一种基于扩散的方法，用于从单个输入图像估计针孔相机的内在参数。单目相机标定是许多三维视觉任务的基础。然而，大多数现有方法依赖于手工制作的假设或受限于有限的训练数据，导致在多样化的真实世界图像中的泛化能力较差。最近基于大规模数据的稳定扩散模型的进展表明，它们能够生成具有不同特性的高质量图像。有证据表明，这些模型隐含地捕捉了相机焦距与图像内容之间的关系。基于这一见解，我们探讨了如何利用扩散模型的有力先验进行单目针孔相机标定。具体来说，我们引入了一种新的基于图像的表示方法，称为Camera Image，它无损地编码了相机的数值内在参数，并与扩散框架无缝集成。使用这种表示方法，我们将估计相机内在参数的问题重新表述为根据输入图像生成密集Camera Image的问题。通过微调稳定的扩散模型来从单个RGB输入生成Camera Image，我们可以通过RANSAC操作提取相机内在参数。我们进一步证明，我们的单目标定方法提高了各种三维任务的性能，包括零样本度量深度估计、三维测量、姿态估计和稀疏视图重建。在多个公共数据集上的广泛实验表明，我们的方法显著优于基准线，并为三维视觉任务提供了广泛的益处。

论文及项目相关链接

PDF

Summary
DM-Calib是一种基于扩散模型的针孔相机内参估计方法，从单张输入图像出发，利用扩散模型的强大先验信息进行单目相机校准。新方法解决了传统方法受限于手工假设或训练数据不足的问题，提高了在多种真实图像中的泛化能力。通过引入Camera Image表示，将相机内参估计问题转化为基于输入图像生成密集Camera Image的问题。通过微调稳定的扩散模型生成Camera Image，利用RANSAC算法提取相机内参。在多个公共数据集上的实验表明，该方法显著优于基线方法，并为多种三维视觉任务提供了广泛的益处。

Key Takeaways