发布日期: 2025-08-02

更新日期: 2025-08-20

文章字数: 17.6k

阅读时长: 71 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-08-02 更新

UniLDiff: Unlocking the Power of Diffusion Priors for All-in-One Image Restoration

Authors:Zihan Cheng, Liangtai Zhou, Dian Chen, Ni Tang, Xiaotong Luo, Yanyun Qu

All-in-One Image Restoration (AiOIR) has emerged as a promising yet challenging research direction. To address its core challenges, we propose a novel unified image restoration framework based on latent diffusion models (LDMs). Our approach structurally integrates low-quality visual priors into the diffusion process, unlocking the powerful generative capacity of diffusion models for diverse degradations. Specifically, we design a Degradation-Aware Feature Fusion (DAFF) module to enable adaptive handling of diverse degradation types. Furthermore, to mitigate detail loss caused by the high compression and iterative sampling of LDMs, we design a Detail-Aware Expert Module (DAEM) in the decoder to enhance texture and fine-structure recovery. Extensive experiments across multi-task and mixed degradation settings demonstrate that our method consistently achieves state-of-the-art performance, highlighting the practical potential of diffusion priors for unified image restoration. Our code will be released.

全能图像修复（AiOIR）已成为一个前景广阔但充满挑战的研究方向。为了解决其核心挑战，我们提出了一种基于潜在扩散模型（LDM）的统一图像修复框架。我们的方法将低质量视觉先验结构性地融入扩散过程，解锁扩散模型对各种退化的强大生成能力。具体来说，我们设计了一个退化感知特征融合（DAFF）模块，以实现对多种退化类型的自适应处理。此外，为了减轻由于LDM的高压缩和迭代采样造成的细节损失，我们在解码器中设计了一个细节感知专家模块（DAEM）以增强纹理和精细结构的恢复。跨多任务和混合退化环境的广泛实验表明，我们的方法始终达到最先进的性能，突显了统一图像修复中扩散先验的实际潜力。我们的代码将予以发布。

论文及项目相关链接

PDF

Summary

基于潜在扩散模型（LDMs），我们提出了一种新的统一图像恢复框架，该框架将低质量视觉先验知识融入到扩散过程中，以应对多样降解的核心挑战。设计Degradation-Aware Feature Fusion（DAFF）模块，实现自适应处理多种降解类型。此外，在解码器中设计Detail-Aware Expert Module（DAEM）以减少潜在扩散模型的高压缩和迭代采样引起的细节损失，增强纹理和精细结构的恢复。实验证明，该方法在多任务和混合降解设置中始终实现了最先进的性能。

Key Takeaways

提出了基于潜在扩散模型（LDMs）的统一图像恢复框架。
框架融合了低质量视觉先验知识以应对图像恢复的多样降解核心挑战。
采用了Degradation-Aware Feature Fusion（DAFF）模块，实现自适应处理多种降解类型。
设计了Detail-Aware Expert Module（DAEM）以减少细节损失并增强纹理和精细结构的恢复。
方法在多任务和混合降解设置中表现优异，达到最新技术水平。
该方法具有实际应用的潜力。

Cool Papers

点此查看论文截图

Adaptively Distilled ControlNet: Accelerated Training and Superior Sampling for Medical Image Synthesis

Authors:Kunpeng Qiu, Zhiying Zhou, Yongxin Guo

Medical image annotation is constrained by privacy concerns and labor-intensive labeling, significantly limiting the performance and generalization of segmentation models. While mask-controllable diffusion models excel in synthesis, they struggle with precise lesion-mask alignment. We propose \textbf{Adaptively Distilled ControlNet}, a task-agnostic framework that accelerates training and optimization through dual-model distillation. Specifically, during training, a teacher model, conditioned on mask-image pairs, regularizes a mask-only student model via predicted noise alignment in parameter space, further enhanced by adaptive regularization based on lesion-background ratios. During sampling, only the student model is used, enabling privacy-preserving medical image generation. Comprehensive evaluations on two distinct medical datasets demonstrate state-of-the-art performance: TransUNet improves mDice/mIoU by 2.4%/4.2% on KiTS19, while SANet achieves 2.6%/3.5% gains on Polyps, highlighting its effectiveness and superiority. Code is available at GitHub.

医学影像标注受到隐私担忧和劳动密集型标注的限制，这极大地限制了分割模型的性能和泛化能力。虽然可控制的掩膜扩散模型在合成方面表现出色，但在精确的病灶掩膜对齐方面存在困难。我们提出了适应蒸馏ControlNet，这是一个任务无关的框架，它通过双模型蒸馏加速训练和优化。具体来说，在训练过程中，以教师模型（以掩膜图像对为条件）通过参数空间的预测噪声对齐来规范仅使用掩膜的学生模型，并进一步通过基于病灶背景比的自适应正则化来增强。在采样过程中，只使用学生模型，实现了保护隐私的医学影像生成。在两个不同的医学数据集上的综合评估表明，其性能达到最新水平：TransUNet在KiTS19上的mDice/mIoU提高了2.4%/4.2%，而SANet在Polyps上的增益达到2.6%/3.5%，突显了其有效性和优越性。代码可在GitHub上获取。

论文及项目相关链接

PDF Accepted by MICCAI2025

Summary

本文提出一种名为Adaptively Distilled ControlNet的任务无关框架，通过双模型蒸馏加速训练和优化。该框架解决了医学图像标注中隐私和标注工作量的问题，并在精确病变蒙版对齐方面取得进展。框架包括训练阶段使用以教师模型为主、基于蒙版图像对的预测噪声对齐进行参数空间的正则化，以及在采样阶段仅使用学生模型进行隐私保护的医学图像生成。在两种不同医学数据集上的综合评估证明了其卓越性能。

Key Takeaways

医学图像标注受限于隐私关注和劳动密集型标注，影响分割模型的性能和泛化能力。
提出Adaptively Distilled ControlNet框架，通过双模型蒸馏加速训练和优化。
教师模型基于蒙版图像对进行预测噪声对齐的参数空间正则化。
框架包括自适应正则化，根据病变背景比进行调整。
采样阶段仅使用学生模型，实现隐私保护的医学图像生成。
在两个不同医学数据集上的评估证明了其卓越性能，TransUNet在KiTS19数据集上改进了mDice/mIoU指标，SANet在Polyps数据集上也取得了显著成绩。

Cool Papers

点此查看论文截图

DivControl: Knowledge Diversion for Controllable Image Generation

Authors:Yucheng Xie, Fu Feng, Ruixiao Shi, Jing Wang, Yong Rui, Xin Geng

Diffusion models have advanced from text-to-image (T2I) to image-to-image (I2I) generation by incorporating structured inputs such as depth maps, enabling fine-grained spatial control. However, existing methods either train separate models for each condition or rely on unified architectures with entangled representations, resulting in poor generalization and high adaptation costs for novel conditions. To this end, we propose DivControl, a decomposable pretraining framework for unified controllable generation and efficient adaptation. DivControl factorizes ControlNet via SVD into basic components-pairs of singular vectors-which are disentangled into condition-agnostic learngenes and condition-specific tailors through knowledge diversion during multi-condition training. Knowledge diversion is implemented via a dynamic gate that performs soft routing over tailors based on the semantics of condition instructions, enabling zero-shot generalization and parameter-efficient adaptation to novel conditions. To further improve condition fidelity and training efficiency, we introduce a representation alignment loss that aligns condition embeddings with early diffusion features. Extensive experiments demonstrate that DivControl achieves state-of-the-art controllability with 36.4$\times$ less training cost, while simultaneously improving average performance on basic conditions. It also delivers strong zero-shot and few-shot performance on unseen conditions, demonstrating superior scalability, modularity, and transferability.

扩散模型已经从文本到图像（T2I）的生成发展到了图像到图像（I2I）的生成，通过融入深度映射等结构化输入，实现了精细的空间控制。然而，现有方法要么针对每种条件训练单独的模型，要么依赖于具有纠缠表示的统一架构，导致对新条件的泛化能力较差和较高的适应成本。为此，我们提出了DivControl，这是一个可分解的预训练框架，用于统一的可控生成和高效适应。DivControl通过SVD将ControlNet进行因子分解，得到基本组件——奇异向量对，然后通过多条件训练过程中的知识分流，将其分离为与条件无关的学习基因和与条件特定的裁缝。知识分流是通过一个动态门实现的，该门根据条件指令的语义对裁缝进行软路由选择，从而实现零样本泛化和对新条件的参数高效适应。为了提高条件保真度和训练效率，我们引入了一种表示对齐损失，将条件嵌入与早期扩散特征对齐。大量实验表明，DivControl在减少36.4倍训练成本的同时，实现了最先进的可控性，同时提高了基本条件下的平均性能。它在未见过的条件上表现出强大的零样本和少样本性能，证明了其可扩展性、模块化和可迁移性。

论文及项目相关链接

PDF

Summary

本文介绍了Diffusion模型从文本到图像（T2I）生成发展到图像到图像（I2I）生成的过程。通过引入结构化输入（如深度图），实现了精细的空间控制。然而，现有方法存在训练每个条件需要单独模型或依赖统一架构但表示纠缠的问题，导致对新条件的泛化能力较差和较高的适应成本。为此，提出了DivControl，一个可分解的预训练框架，用于统一可控生成和高效适应。DivControl通过SVD将ControlNet分解为基本组件，即奇异向量对，并在多条件训练期间通过知识分流将其分离为条件无关的学习基因和条件特定的定制者。知识分流通过动态门实现，根据条件指令的语义对定制者进行软路由选择，实现零样本泛化和对新条件的参数高效适应。为提高条件保真度和训练效率，还引入了表示对齐损失，将条件嵌入与早期扩散特征对齐。实验表明，DivControl在降低36.4倍训练成本的同时，实现了最先进的可控性，并提高了基本条件下的平均性能，同时在未见过的条件下表现出强大的零样本和少样本性能，展现了出色的可扩展性、模块性和可转移性。

Key Takeaways

Diffusion模型已发展到图像到图像（I2I）生成，通过结构化输入实现精细空间控制。
现有方法存在对新条件泛化能力差和适应成本高的问题。
DivControl框架通过分解ControlNet提高可控生成和适应效率。
DivControl实现知识分流，通过动态门选择定制者，实现零样本泛化和参数高效适应。
引入表示对齐损失以提高条件保真度和训练效率。
DivControl实现了先进的可控性，并提高了基本条件下的平均性能。

Cool Papers

点此查看论文截图

MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction

Authors:Zijian Dong, Longteng Duan, Jie Song, Michael J. Black, Andreas Geiger

We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable

我们提出了一种名为MoGA的新方法，可以从单视图图像重建高保真3D高斯化身。主要挑战在于推断未见的外观和几何细节，同时确保3D一致性和逼真度。大多数之前的方法依赖于2D扩散模型来合成未见的视图；然而，这些生成的视图稀疏且不一致，导致3D伪影和模糊的外观。为了解决这些局限性，我们利用生成化身模型，通过从学习的先验分布中采样变形高斯函数来生成各种3D化身。由于3D训练数据量有限，单一的3D模型无法捕获未见身份的所有图像细节。因此，我们将其整合为优先级，通过将输入图像投影到其潜在空间并施加额外的3D外观和几何约束来确保3D一致性。我们的新方法将高斯化身的创建制定为一个模型反转过程，通过将生成化身拟合到来自2D扩散模型的合成视图。生成化身为模型拟合提供了有意义的初始化，强制执行3D正则化，并有助于改进姿态估计。实验表明，我们的方法超越了最新技术并在现实世界场景中具有良好的通用性。我们的高斯化身也具有内在的动画性。

论文及项目相关链接

PDF ICCV 2025 (Highlight), Project Page: https://zj-dong.github.io/MoGA/

Summary

本文介绍了MoGA方法，这是一种从单视角图像重建高保真3D高斯头像的新技术。它主要解决了在推断未见的外观和几何细节时，如何确保3D一致性和真实性的问题。与大多数依赖2D扩散模型合成未见视图的方法不同，MoGA采用生成式头像模型，通过从学习到的先验分布中采样变形高斯来生成多样化的3D头像。结合3D模型和输入图像的投影，确保了3D一致性，并实施了额外的3D外观和几何约束。MoGA将高斯头像创建制定为模型反演过程，通过将生成式头像适配于来自2D扩散模型的合成视图来实现。此方法超越了先进技术，并很好地推广至真实世界场景。

Key Takeaways

MoGA是一种从单视角图像重建高保真3D高斯头像的新方法。
该方法主要解决在推断未见的外观和几何细节时的3D一致性和真实性问题。
与依赖2D扩散模型的方法不同，MoGA采用生成式头像模型来生成多样化的3D头像。
MoGA结合3D模型和输入图像投影，确保3D一致性，并实施额外的3D外观和几何约束。
MoGA将高斯头像创建制定为一个模型反演过程，适配生成式头像至合成视图。
MoGA方法超越了现有技术，并在真实世界场景中具有良好的推广性。

Cool Papers

点此查看论文截图

Out-of-Distribution Detection in Medical Imaging via Diffusion Trajectories

Authors:Lemar Abdi, Francisco Caetano, Amaan Valiuddin, Christiaan Viviers, Hamdi Joudeh, Fons van der Sommen

In medical imaging, unsupervised out-of-distribution (OOD) detection offers an attractive approach for identifying pathological cases with extremely low incidence rates. In contrast to supervised methods, OOD-based approaches function without labels and are inherently robust to data imbalances. Current generative approaches often rely on likelihood estimation or reconstruction error, but these methods can be computationally expensive, unreliable, and require retraining if the inlier data changes. These limitations hinder their ability to distinguish nominal from anomalous inputs efficiently, consistently, and robustly. We propose a reconstruction-free OOD detection method that leverages the forward diffusion trajectories of a Stein score-based denoising diffusion model (SBDDM). By capturing trajectory curvature via the estimated Stein score, our approach enables accurate anomaly scoring with only five diffusion steps. A single SBDDM pre-trained on a large, semantically aligned medical dataset generalizes effectively across multiple Near-OOD and Far-OOD benchmarks, achieving state-of-the-art performance while drastically reducing computational cost during inference. Compared to existing methods, SBDDM achieves a relative improvement of up to 10.43% and 18.10% for Near-OOD and Far-OOD detection, making it a practical building block for real-time, reliable computer-aided diagnosis.

在医学成像中，无监督的离群分布（OOD）检测提供了一种极具吸引力的方法来识别发病率极低的病理病例。与监督方法相比，基于OOD的方法无需标签就能运行，并且天然地能够应对数据不平衡的问题。当前的生成方法通常依赖于可能性估计或重建误差，但这些方法计算成本高昂、可靠性不足，如果内部数据发生变化，还需要重新训练。这些局限性阻碍了它们在高效、持续和稳健地区分正常输入和异常输入方面的能力。我们提出了一种无需重建的OOD检测方法，该方法利用基于Stein得分的降噪扩散模型的向前扩散轨迹。通过估计Stein得分来捕捉轨迹曲率，我们的方法仅使用五个扩散步骤即可实现精确异常评分。单个基于Stein得分的扩散模型（SBDDM）预训练在大型、语义对齐的医疗数据集上，能够在多个近OOD和远OOD基准测试上实现有效泛化，同时在推理过程中大幅降低计算成本，达到最先进的性能。与现有方法相比，SBDDM在近OOD和远OOD检测方面的相对改进率分别高达10.43%和18.10%，使其成为实时可靠计算机辅助诊断的实用组成部分。

论文及项目相关链接

PDF Accepted at Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, MICCAI 2025

Summary

基于Stein得分扩散模型的非监督分布外检测（OOD）为医学图像中低频病例检测提供了有效的手段。该方案通过捕获扩散轨迹的曲率来实现精确异常评分，只需五步扩散即可完成，相比依赖重构误差的生成模型，更经济、可靠，适应性强。该方法仅需在一大张语义对齐的医疗数据集上训练一个SBDDM模型即可应对多个近OOD和远OOD标准检测，取得了前所未有的效果，大大减少了推理阶段的计算成本。与现有方法相比，SBDDM在近OOD和远OOD检测方面的相对改进分别达到了高达10.43%和高达18.10%，成为了实际用于可靠的计算机辅助诊断的关键构建块。

Key Takeaways

无监督分布外检测（OOD）是识别低频病例的有效方法。
基于Stein得分扩散模型的检测方法计算成本低且可靠性高。

Cool Papers

点此查看论文截图

UniEmo: Unifying Emotional Understanding and Generation with Learnable Expert Queries

Authors:Yijie Zhu, Lingsen Zhang, Zitong Yu, Rui Shao, Tao Tan, Liqiang Nie

Emotional understanding and generation are often treated as separate tasks, yet they are inherently complementary and can mutually enhance each other. In this paper, we propose the UniEmo, a unified framework that seamlessly integrates these two tasks. The key challenge lies in the abstract nature of emotions, necessitating the extraction of visual representations beneficial for both tasks. To address this, we propose a hierarchical emotional understanding chain with learnable expert queries that progressively extracts multi-scale emotional features, thereby serving as a foundational step for unification. Simultaneously, we fuse these expert queries and emotional representations to guide the diffusion model in generating emotion-evoking images. To enhance the diversity and fidelity of the generated emotional images, we further introduce the emotional correlation coefficient and emotional condition loss into the fusion process. This step facilitates fusion and alignment for emotional generation guided by the understanding. In turn, we demonstrate that joint training allows the generation component to provide implicit feedback to the understanding part. Furthermore, we propose a novel data filtering algorithm to select high-quality and diverse emotional images generated by the well-trained model, which explicitly feedback into the understanding part. Together, these generation-driven dual feedback processes enhance the model’s understanding capacity. Extensive experiments show that UniEmo significantly outperforms state-of-the-art methods in both emotional understanding and generation tasks. The code for the proposed method is available at https://github.com/JiuTian-VL/UniEmo.

情感理解和生成通常被视为两个单独的任务，但它们本质上是互补的，可以相互增强。在本文中，我们提出了UniEmo，这是一个无缝集成这两个任务的统一框架。关键挑战在于情感的抽象性质，需要提取有益于这两个任务的视觉表示。为解决这一问题，我们提出了分级情感理解链，通过可学习的专家查询逐步提取多尺度情感特征，从而为统一化奠定基础。同时，我们将这些专家查询和情感表示融合，以指导扩散模型生成引发情感图像。为提高生成情感图像的多样性和保真度，我们进一步将情感相关系数和情感条件损失引入到融合过程中。这一步有助于融合和对齐情感生成与理解之间的指导。反过来，我们证明了联合训练可以让生成组件为理解部分提供隐性反馈。此外，我们提出了一种新的数据过滤算法，用于选择高质量和多样化的情感图像，这些图像由训练良好的模型生成，并明确反馈到理解部分。总的来说，这些生成驱动的双重反馈过程提高了模型的理解能力。大量实验表明，UniEmo在情感理解和生成任务上都显著优于最先进的方法。所提出方法的代码可在https://github.com/JiuTian-VL/UniEmo获得。

论文及项目相关链接

PDF

Summary

本文提出了一种名为UniEmo的统一框架，该框架融合了情感理解和生成任务。其核心挑战在于情感的抽象性质，需要提取对这两个任务都有益的视觉表示。为此，研究提出了层次化的情感理解链，利用可学习的专家查询逐步提取多尺度情感特征，并为融合这些特征提供了基础。此外，通过引入情感关联系数和情感条件损失，优化了情感生成图像的质量和多样性。联合训练使得生成组件为理解部分提供隐性反馈，同时提出的数据过滤算法能够选择高质量、多样化的情感图像，为理解部分提供明确反馈。整体上，这些生成驱动的双重反馈过程提高了模型的理解能力。

Key Takeaways

UniEmo框架融合了情感理解和生成任务，认为两者是互补的，可以相互增强。
层次化的情感理解链和可学习的专家查询用于提取多尺度情感特征。
通过引入情感关联系数和情感条件损失，优化了情感生成图像的质量。
联合训练使得生成组件为理解部分提供隐性反馈。
数据过滤算法能够选择高质量、多样化的情感图像，为理解部分提供明确反馈。
双重反馈过程提高了模型对情感的理解和生成能力。

Cool Papers

点此查看论文截图

IN45023 Neural Network Design Patterns in Computer Vision Seminar Report, Summer 2025

Authors:Radu-Andrei Bourceanu, Neil De La Fuente, Jan Grimm, Andrei Jardan, Andriy Manucharyan, Cornelius Weiss, Roman Pflugfelder

This report analyzes the evolution of key design patterns in computer vision by examining six influential papers. The analy- sis begins with foundational architectures for image recognition. We review ResNet, which introduced residual connections to overcome the vanishing gradient problem and enable effective training of significantly deeper convolutional networks. Subsequently, we examine the Vision Transformer (ViT), which established a new paradigm by applying the Transformer ar- chitecture to sequences of image patches, demonstrating the efficacy of attention-based models for large-scale image recogni- tion. Building on these visual representation backbones, we investigate generative models. Generative Adversarial Networks (GANs) are analyzed for their novel adversarial training process, which challenges a generator against a discriminator to learn complex data distributions. Then, Latent Diffusion Models (LDMs) are covered, which improve upon prior generative methods by performing a sequential denoising process in a perceptually compressed latent space. LDMs achieve high-fidelity synthesis with greater computational efficiency, representing the current state-of-the-art for image generation. Finally, we explore self-supervised learning techniques that reduce dependency on labeled data. DINO is a self-distillation framework in which a student network learns to match the output of a momentum-updated teacher, yielding features with strong k-NN classification performance. We conclude with Masked Autoencoders (MAE), which utilize an asymmetric encoder-decoder design to reconstruct heavily masked inputs, providing a highly scalable and effective method for pre-training large-scale vision models.

本报告通过分析六篇有影响力的论文，探讨了计算机视觉中关键设计模式的演变。分析从图像识别的基本架构开始。我们回顾了ResNet，它引入了残差连接，克服了梯度消失问题，实现了对深度卷积神经网路的有效训练。之后，我们研究了将Transformer架构应用于图像补丁序列的Vision Transformer（ViT），这开创了新的模式，并证明了注意力模型在大规模图像识别中的有效性。基于这些视觉表征骨干网，我们研究了生成模型。分析了生成对抗网络（GANs）的新颖对抗训练过程，该过程通过生成器与鉴别器的对抗来学习复杂的数据分布。然后，介绍了潜在扩散模型（LDMs），通过感知压缩潜在空间中的顺序去噪过程改进了先前的生成方法。LDMs实现了高保真合成，具有更高的计算效率，代表了当前图像生成的最新技术。最后，我们探索了减少对标定数据依赖性的自监督学习技术。DINO是一种自蒸馏框架，学生网络学习匹配动量更新后的教师输出，产生具有强大k-NN分类性能的特征。最后以使用不对称编码器-解码器设计重建高度遮挡输入的Masked Autoencoders（MAE）为例，提供了一种高度可扩展和有效的预训练大规模视觉模型的方法。

论文及项目相关链接

PDF

Summary

该文分析了计算机视觉中关键设计模式的演变，通过考察六篇有影响力的论文进行了详细的研究。文章从图像识别的基本架构开始，介绍了ResNet、Vision Transformer等关键技术。随后，探讨了生成模型，包括对抗性训练过程的Generative Adversarial Networks (GANs)和在感知压缩潜在空间中执行顺序去噪过程的Latent Diffusion Models (LDMs)。最后，介绍了自我监督学习技术，包括DINO和Masked Autoencoders等，这些技术减少了对于标注数据的依赖。

Key Takeaways

计算机视觉中的设计模式不断演变，从图像识别的基本架构到生成模型和自我监督学习技术。
ResNet通过引入残差连接克服了梯度消失问题，使深度卷积网络的有效训练成为可能。
Vision Transformer (ViT) 将Transformer架构应用于图像补丁序列，证明了注意力模型在大规模图像识别中的有效性。
Generative Adversarial Networks (GANs) 通过对抗训练过程挑战生成器和鉴别器，学会了复杂的数据分布。
Latent Diffusion Models (LDMs) 在感知压缩潜在空间中执行去噪过程，实现了高保真合成和更高的计算效率。
DINO是一种自蒸馏框架，学生网络学习匹配动量更新的教师输出，产生具有强大k-NN分类性能的特征。

Cool Papers

点此查看论文截图

The Cow of Rembrandt - Analyzing Artistic Prompt Interpretation in Text-to-Image Models

Authors:Alfio Ferrara, Sergio Picascia, Elisabetta Rocchetti

Text-to-image diffusion models have demonstrated remarkable capabilities in generating artistic content by learning from billions of images, including popular artworks. However, the fundamental question of how these models internally represent concepts, such as content and style in paintings, remains unexplored. Traditional computer vision assumes content and style are orthogonal, but diffusion models receive no explicit guidance about this distinction during training. In this work, we investigate how transformer-based text-to-image diffusion models encode content and style concepts when generating artworks. We leverage cross-attention heatmaps to attribute pixels in generated images to specific prompt tokens, enabling us to isolate image regions influenced by content-describing versus style-describing tokens. Our findings reveal that diffusion models demonstrate varying degrees of content-style separation depending on the specific artistic prompt and style requested. In many cases, content tokens primarily influence object-related regions while style tokens affect background and texture areas, suggesting an emergent understanding of the content-style distinction. These insights contribute to our understanding of how large-scale generative models internally represent complex artistic concepts without explicit supervision. We share the code and dataset, together with an exploratory tool for visualizing attention maps at https://github.com/umilISLab/artistic-prompt-interpretation.

文本到图像的扩散模型已经显示出从数十亿图像（包括流行艺术作品）生成艺术内容的显著能力。然而，这些模型如何在内部表示绘画中的内容和风格等概念的基础问题仍未被探索。传统计算机视觉假设内容和风格是正交的，但扩散模型在训练过程中没有关于这种区别的明确指导。在这项工作中，我们调查了基于转换器的文本到图像扩散模型如何在生成艺术作品时编码内容和风格概念。我们利用交叉注意力热图将生成图像的像素归因于特定的提示标记，这使我们能够隔离受内容描述标记与风格描述标记影响的图像区域。我们的研究发现，扩散模型在不同艺术提示和所请求的风格的特定情况下表现出不同程度的内容与风格的分离。在许多情况下，内容标记主要影响与对象相关的区域，而风格标记影响背景和纹理区域，这表明对内容和风格区别的隐性理解。这些见解有助于我们了解大规模生成模型如何在没有显式监督的情况下内部表示复杂艺术概念。我们在https://github.com/umilISLab/artistic-prompt-interpretation上分享了代码、数据集以及用于可视化注意力图的探索工具。

论文及项目相关链接

PDF to be published in: Applications of AI in the Analysis of Cultural and Artistic Heritage, organized within the 35th IEEE International Workshop on Machine Learning for Signal Processing (MLSP) 2025

Summary

本文探索了基于变压器的文本到图像扩散模型在生成艺术作品时如何编码内容和风格概念。通过利用跨注意力热图，将生成的图像中的像素归属于特定的提示标记，从而隔离受内容描述和风格描述标记影响的图像区域。研究发现，扩散模型在内容和风格上表现出不同程度的分离，这取决于特定的艺术提示和所要求的风格。这些发现有助于理解大型生成模型如何在没有显式监督的情况下内部表示复杂艺术概念。

Key Takeaways

文本到图像扩散模型可以生成具有艺术性的内容，但其内部如何代表概念，如绘画中的内容和风格，尚未被探索。
传统计算机视觉假设内容和风格是正交的，但扩散模型在训练过程中没有关于这种区别的明确指导。
通过利用跨注意力热图，可以追踪生成图像中的像素与特定提示标记的关系，从而隔离内容和风格的影响区域。
扩散模型在生成艺术作品时，内容和风格的影响程度会有所不同，这取决于艺术提示和所要求的风格。
在许多情况下，内容标记主要影响对象相关区域，而风格标记影响背景和纹理区域，表明模型对内容和风格的区分有一定的自我意识。
这些发现增加了对大型生成模型如何内部表示复杂艺术概念的理解，这些模型在无需显式监督的情况下表现出内容和风格的分离。

Cool Papers

点此查看论文截图

Adversarial-Guided Diffusion for Multimodal LLM Attacks

Authors:Chengwei Xia, Fan Ma, Ruijie Quan, Kun Zhan, Yi Yang

This paper addresses the challenge of generating adversarial image using a diffusion model to deceive multimodal large language models (MLLMs) into generating the targeted responses, while avoiding significant distortion of the clean image. To address the above challenges, we propose an adversarial-guided diffusion (AGD) approach for adversarial attack MLLMs. We introduce adversarial-guided noise to ensure attack efficacy. A key observation in our design is that, unlike most traditional adversarial attacks which embed high-frequency perturbations directly into the clean image, AGD injects target semantics into the noise component of the reverse diffusion. Since the added noise in a diffusion model spans the entire frequency spectrum, the adversarial signal embedded within it also inherits this full-spectrum property. Importantly, during reverse diffusion, the adversarial image is formed as a linear combination of the clean image and the noise. Thus, when applying defenses such as a simple low-pass filtering, which act independently on each component, the adversarial image within the noise component is less likely to be suppressed, as it is not confined to the high-frequency band. This makes AGD inherently robust to variety defenses. Extensive experiments demonstrate that our AGD outperforms state-of-the-art methods in attack performance as well as in model robustness to some defenses.

本文旨在解决使用扩散模型生成对抗图像，以欺骗多模态大型语言模型（MLLMs）产生目标响应的挑战，同时避免干净图像出现重大失真。为了应对上述挑战，我们提出了一种用于对抗MLLMs的对抗性引导扩散（AGD）方法。我们引入对抗性引导噪声以确保攻击的有效性。我们的设计中的一个关键观察是，不同于大多数传统对抗性攻击直接将高频扰动嵌入干净图像，AGD将目标语义注入反向扩散的噪声成分。由于扩散模型中添加的噪声跨越了整个频谱，嵌入其中的对抗信号也继承了这种全频谱特性。重要的是，在反向扩散过程中，对抗性图像是干净图像和噪声的线性组合。因此，当应用如简单的低通滤波等防御措施时，这些措施独立作用于每个组件，噪声成分中的对抗图像不太可能被抑制，因为它并不局限于高频带。这使得AGD对多种防御措施具有内在的鲁棒性。大量实验表明，我们的AGD在攻击性能和模型对某些防御的稳健性方面都优于最新方法。

论文及项目相关链接

PDF

Summary

本文提出了一种基于扩散模型的对抗性攻击方法，旨在欺骗多模态大型语言模型生成目标响应，同时避免对干净图像造成重大失真。文章介绍了对抗性引导扩散（AGD）方法，通过引入对抗性引导噪声确保攻击的有效性。不同于大多数传统对抗性攻击直接将高频扰动嵌入干净图像，AGD将目标语义注入反向扩散的噪声成分中。这种嵌入的对抗信号继承了整个频谱属性。在反向扩散过程中，对抗性图像是由干净图像和噪声的线性组合形成的。因此，当应用如简单低通滤波等防御手段时，AGD中的对抗性图像在噪声成分中不太可能被抑制，因为它不限于高频带，这使得AGD对多种防御手段具有内在鲁棒性。实验表明，AGD在攻击性能和模型对某些防御的鲁棒性方面均优于现有先进技术。

Key Takeaways

论文提出了一种基于扩散模型的对抗性攻击方法——对抗性引导扩散（AGD），用于欺骗多模态大型语言模型。
AGD通过引入对抗性引导噪声确保攻击的有效性。
AGD将目标语义注入反向扩散的噪声成分中，不同于传统的高频扰动方法。
对抗性信号嵌入的噪声具有整个频谱属性。
在反向扩散过程中，对抗性图像是干净图像和噪声的线性组合。
AGD具有对多种防御手段的内在鲁棒性，因为对抗性图像在噪声中不易被简单低通滤波等防御手段抑制。

Cool Papers

点此查看论文截图

Hate in Plain Sight: On the Risks of Moderating AI-Generated Hateful Illusions

Authors:Yiting Qu, Ziqing Yang, Yihan Ma, Michael Backes, Savvas Zannettou, Yang Zhang

Recent advances in text-to-image diffusion models have enabled the creation of a new form of digital art: optical illusions–visual tricks that create different perceptions of reality. However, adversaries may misuse such techniques to generate hateful illusions, which embed specific hate messages into harmless scenes and disseminate them across web communities. In this work, we take the first step toward investigating the risks of scalable hateful illusion generation and the potential for bypassing current content moderation models. Specifically, we generate 1,860 optical illusions using Stable Diffusion and ControlNet, conditioned on 62 hate messages. Of these, 1,571 are hateful illusions that successfully embed hate messages, either overtly or subtly, forming the Hateful Illusion dataset. Using this dataset, we evaluate the performance of six moderation classifiers and nine vision language models (VLMs) in identifying hateful illusions. Experimental results reveal significant vulnerabilities in existing moderation models: the detection accuracy falls below 0.245 for moderation classifiers and below 0.102 for VLMs. We further identify a critical limitation in their vision encoders, which mainly focus on surface-level image details while overlooking the secondary layer of information, i.e., hidden messages. To address this risk, we explore preliminary mitigation measures and identify the most effective approaches from the perspectives of image transformations and training-level strategies.

近期文本到图像扩散模型的进展为数字艺术创造了一种新的形式：光学错觉——通过视觉技巧创造不同的现实感知。然而，对手可能会滥用这些技术来生成仇恨错觉，将特定的仇恨信息嵌入到无害的场景中，并在网络社区中传播它们。在这项工作中，我们首次对可扩展仇恨错觉生成的风险和绕过当前内容管理模型的能力进行了调查。具体来说，我们使用Stable Diffusion和ControlNet生成了1860个光学错觉，基于62条仇恨信息。其中，有1571个仇恨错觉成功地将仇恨信息嵌入其中，无论是明显还是微妙地形成了仇恨错觉数据集。使用该数据集，我们评估了六个管理分类器和九个视觉语言模型（VLM）在识别仇恨错觉方面的性能。实验结果表明现有管理模型存在重大漏洞：管理分类器的检测准确率低于0.245，VLM的准确率低于0.102。我们还发现了他们视觉编码器的一个关键局限性，它们主要关注图像的表面细节，而忽视了次要层的信息，即隐藏信息。为了应对这一风险，我们探索了初步缓解措施，并从图像转换和培训层面策略的角度确定了最有效的解决方案。

论文及项目相关链接

PDF Accepted at ICCV 2025

Summary
近期文本到图像扩散模型的进展为创造新型数字艺术——光学错觉提供了可能，这些错觉能在无害的场景中嵌入特定仇恨信息并传播到网络社区中。本研究首次探讨了可伸缩仇恨错觉生成的风险和绕过当前内容审核模型的可能性。我们利用Stable Diffusion和ControlNet生成了1860个光学错觉图像，其中包含62条仇恨信息。实验发现现有审核模型存在显著漏洞，检测准确率低于0.245。此外，模型在识别图像中的隐藏信息方面存在局限性。为解决这一问题，我们初步探讨了缓解措施，并从图像转换和培训策略两方面提出了最有效的解决方案。

Key Takeaways

文本到图像扩散模型的最新进展为创建光学错觉提供了新的可能性。
光学错觉技术可被用于生成仇恨错觉，将仇恨信息嵌入无害场景中并在网络社区中传播。
首次调查了可伸缩仇恨错觉生成的风险和对现有内容审核模型的潜在威胁。
生成了包含仇恨信息的1,571个仇恨错觉图像，并创建了一个数据集。
实验发现现有审核模型存在显著漏洞，检测准确率低于0.245。
现有审核模型在识别图像中的隐藏信息方面存在局限性，主要关注表面细节而忽视次要信息层。

Cool Papers

点此查看论文截图

DACA-Net: A Degradation-Aware Conditional Diffusion Network for Underwater Image Enhancement

Authors:Chang Huang, Jiahang Cao, Jun Ma, Kieren Yu, Cong Li, Huayong Yang, Kaishun Wu

Underwater images typically suffer from severe colour distortions, low visibility, and reduced structural clarity due to complex optical effects such as scattering and absorption, which greatly degrade their visual quality and limit the performance of downstream visual perception tasks. Existing enhancement methods often struggle to adaptively handle diverse degradation conditions and fail to leverage underwater-specific physical priors effectively. In this paper, we propose a degradation-aware conditional diffusion model to enhance underwater images adaptively and robustly. Given a degraded underwater image as input, we first predict its degradation level using a lightweight dual-stream convolutional network, generating a continuous degradation score as semantic guidance. Based on this score, we introduce a novel conditional diffusion-based restoration network with a Swin UNet backbone, enabling adaptive noise scheduling and hierarchical feature refinement. To incorporate underwater-specific physical priors, we further propose a degradation-guided adaptive feature fusion module and a hybrid loss function that combines perceptual consistency, histogram matching, and feature-level contrast. Comprehensive experiments on benchmark datasets demonstrate that our method effectively restores underwater images with superior colour fidelity, perceptual quality, and structural details. Compared with SOTA approaches, our framework achieves significant improvements in both quantitative metrics and qualitative visual assessments.

水下图像通常受到严重的色彩失真、低可见度和结构清晰度降低的影响，这是由于散射和吸收等复杂的光学效应导致的。这大大降低了水下图像的可视质量，并限制了后续视觉感知任务的性能。现有的增强方法往往难以适应不同的退化条件，并且未能有效地利用水下特定的物理先验信息。在本文中，我们提出了一种意识到的退化条件扩散模型，以自适应和稳健地增强水下图像。给定一个退化水下图像作为输入，我们首先使用轻量级双流卷积网络预测其退化程度，生成连续的退化分数作为语义指导。基于此分数，我们引入了一种新型的条件扩散恢复网络，具有Swin UNet主干，能够实现自适应噪声调度和分层特征细化。为了融入水下特定的物理先验信息，我们进一步提出了一个由退化引导的自适应特征融合模块和一个混合损失函数，该损失函数结合了感知一致性、直方图匹配和特征级对比度。在基准数据集上的综合实验表明，我们的方法有效地恢复了水下图像，具有出色的色彩保真度、感知质量和结构细节。与最新方法相比，我们的框架在定量指标和定性视觉评估方面都取得了显著改进。

论文及项目相关链接

PDF accepted by ACM MM 2025

Summary

本文提出了一种基于降解感知的条件扩散模型，用于自适应和稳健地增强水下图像。首先利用轻量级双流传卷积网络预测图像退化程度，生成连续退化分数作为语义指导。基于该分数，引入新型条件扩散式恢复网络，结合Swin UNet主干网，实现自适应噪声调度和分层特征细化。通过水下特定物理先验的融入，进一步提出降解引导自适应特征融合模块和混合损失函数，结合感知一致性、直方图匹配和特征级对比。在基准数据集上的综合实验表明，该方法有效恢复了水下图像，具有出色的色彩保真度、感知质量和结构细节。与最新方法相比，本框架在定量指标和定性视觉评估方面取得了显著改进。

Key Takeaways

水下图像常受到严重色彩失真、低可见度和结构清晰度降低的影响。
复杂的光学效应（如散射和吸收）导致图像视觉质量下降，影响下游视觉感知任务性能。
现有的增强方法往往难以适应不同的退化条件，未能有效利用水下特定的物理先验。
本文提出了一种降解感知的条件扩散模型，用于自适应和稳健地增强水下图像。
通过轻量级双流传卷积网络预测图像退化程度，生成连续退化分数作为语义指导。
引入新型条件扩散式恢复网络，结合Swin UNet主干网，实现自适应噪声处理和分层特征细化。

Cool Papers

点此查看论文截图

Visual Language Models as Zero-Shot Deepfake Detectors

Authors:Viacheslav Pirogov

The contemporary phenomenon of deepfakes, utilizing GAN or diffusion models for face swapping, presents a substantial and evolving threat in digital media, identity verification, and a multitude of other systems. The majority of existing methods for detecting deepfakes rely on training specialized classifiers to distinguish between genuine and manipulated images, focusing only on the image domain without incorporating any auxiliary tasks that could enhance robustness. In this paper, inspired by the zero-shot capabilities of Vision Language Models, we propose a novel VLM-based approach to image classification and then evaluate it for deepfake detection. Specifically, we utilize a new high-quality deepfake dataset comprising 60,000 images, on which our zero-shot models demonstrate superior performance to almost all existing methods. Subsequently, we compare the performance of the best-performing architecture, InstructBLIP, on the popular deepfake dataset DFDC-P against traditional methods in two scenarios: zero-shot and in-domain fine-tuning. Our results demonstrate the superiority of VLMs over traditional classifiers.

现代深度伪造现象，利用GAN或扩散模型进行面部替换，给数字媒体、身份验证以及其他多种系统带来了显著且不断演变的威胁。大多数现有的检测深度伪造的方法都依赖于训练专门的分类器来区分真实和操纵过的图像，只专注于图像领域，而没有融入任何可以增强稳健性的辅助任务。在本文中，我们受到视觉语言模型的零样本能力的启发，提出了一种基于视觉语言模型（VLM）的图像分类新方法，并评估了其在深度伪造检测中的应用。具体来说，我们利用了一个新的高质量深度伪造数据集，包含6万张图像，我们的零样本模型在这个数据集上的表现几乎超过了所有现有方法。随后，我们在流行的深度伪造数据集DFDC-P上对比了表现最佳的架构InstructBLIP与传统方法在零样本和域内微调两种情况下的性能。我们的结果证明了视觉语言模型相较于传统分类器的优越性。

论文及项目相关链接

PDF Accepted to the ICML 2025 Workshop on Reliable and Responsible Foundation Models

Summary
利用GAN或扩散模型进行面部替换的Deepfakes现象对数字媒体、身份验证等多个系统构成重大且不断发展的威胁。现有的大多数检测深度伪造的方法依赖于训练专门的分类器来区分真实和操纵过的图像，但仅限于图像领域，没有引入任何可以增强稳健性的辅助任务。本文受视觉语言模型的零样本能力的启发，提出了一种新型的基于VLM的图像分类方法，并评估其在深度伪造检测中的应用。实验表明，在新型高质量深度伪造数据集上，我们的零样本模型几乎在所有现有方法中表现出最佳性能。此外，我们将最佳性能架构InstructBLIP在流行的深度伪造数据集DFDC-P上与传统方法进行了零样本和域内微调两种场景下的性能对比，结果证明了VLMs相较于传统分类器的优越性。

Key Takeaways

Deepfakes现象使用GAN或扩散模型进行面部替换，对数字媒体和身份验证系统构成威胁。
现有的深度伪造检测方法主要依赖专门的分类器来区分真实和操纵过的图像，但缺乏辅助任务以增强其稳健性。
本文提出了基于视觉语言模型（VLM）的零样本能力进行图像分类的新方法。
在新型高质量深度伪造数据集上，零样本模型表现出最佳性能。
对比了最佳性能架构InstructBLIP与传统方法在深度伪造数据集DFDC-P上的性能。
InstructBLIP在零样本和域内微调两种场景下均表现出优于传统方法的性能。
验证了视觉语言模型（VLM）在深度伪造检测中的优越性。

Cool Papers

点此查看论文截图

TopoLiDM: Topology-Aware LiDAR Diffusion Models for Interpretable and Realistic LiDAR Point Cloud Generation

Authors:Jiuming Liu, Zheng Huang, Mengmeng Liu, Tianchen Deng, Francesco Nex, Hao Cheng, Hesheng Wang

LiDAR scene generation is critical for mitigating real-world LiDAR data collection costs and enhancing the robustness of downstream perception tasks in autonomous driving. However, existing methods commonly struggle to capture geometric realism and global topological consistency. Recent LiDAR Diffusion Models (LiDMs) predominantly embed LiDAR points into the latent space for improved generation efficiency, which limits their interpretable ability to model detailed geometric structures and preserve global topological consistency. To address these challenges, we propose TopoLiDM, a novel framework that integrates graph neural networks (GNNs) with diffusion models under topological regularization for high-fidelity LiDAR generation. Our approach first trains a topological-preserving VAE to extract latent graph representations by graph construction and multiple graph convolutional layers. Then we freeze the VAE and generate novel latent topological graphs through the latent diffusion models. We also introduce 0-dimensional persistent homology (PH) constraints, ensuring the generated LiDAR scenes adhere to real-world global topological structures. Extensive experiments on the KITTI-360 dataset demonstrate TopoLiDM’s superiority over state-of-the-art methods, achieving improvements of 22.6% lower Frechet Range Image Distance (FRID) and 9.2% lower Minimum Matching Distance (MMD). Notably, our model also enables fast generation speed with an average inference time of 1.68 samples/s, showcasing its scalability for real-world applications. We will release the related codes at https://github.com/IRMVLab/TopoLiDM.

激光雷达场景生成对于降低真实世界激光雷达数据采集成本、提高自动驾驶下游感知任务的稳健性至关重要。然而，现有方法通常难以捕捉几何现实和全局拓扑一致性。最近的激光雷达扩散模型（LiDM）主要将激光雷达点嵌入潜在空间以提高生成效率，这限制了其建模详细几何结构和保持全局拓扑一致性的解释能力。为了解决这些挑战，我们提出了TopoLiDM，这是一个结合图神经网络（GNN）和拓扑正则化下的扩散模型的新型框架，用于高保真度的激光雷达生成。我们的方法首先训练一个保留拓扑结构的VAE，通过图构建和多个图卷积层提取潜在图表示。然后，我们冻结VAE，并通过潜在扩散模型生成新的潜在拓扑图。我们还引入了零维持久同余（PH）约束，确保生成的激光雷达场景符合现实世界的全局拓扑结构。在KITTI-360数据集上的大量实验表明，TopoLiDM优于最新方法，Frechet Range Image Distance (FRID)降低了22.6%，Minimum Matching Distance (MMD)降低了9.2%。值得注意的是，我们的模型还实现了快速的生成速度，平均推理时间为1.68样本/秒，展示了其适用于实际应用的可扩展性。我们将会在https://github.com/IRMVLab/TopoLiDM发布相关代码。

论文及项目相关链接

PDF Accepted by IROS 2025. Code:https://github.com/IRMVLab/TopoLiDM

Summary

本文提出了TopoLiDM框架，结合了图神经网络（GNNs）与扩散模型，用于进行拓扑正则化的高保真LiDAR场景生成。该框架通过训练拓扑保持式VAE提取潜在图表示，然后通过潜在扩散模型生成新的潜在拓扑图。此外，引入零维持久性同源性（PH）约束以确保生成的LiDAR场景符合现实世界全局拓扑结构。实验证明，TopoLiDM在KITTI-360数据集上较现有方法表现出卓越性能，提高了生成速度和感知任务的鲁棒性。将在IRMVLab的GitHub上发布相关代码。

Key Takeaways

LiDAR场景生成对降低真实世界数据收集成本和提高自动驾驶下游感知任务稳健性至关重要。
现有方法难以捕捉几何真实性和全局拓扑一致性。
TopoLiDM框架结合了图神经网络（GNNs）与扩散模型，实现高保真LiDAR生成。
使用拓扑保持式VAE提取潜在图表示，通过潜在扩散模型生成新拓扑图。
引入零维持久性同源性（PH）约束以确保生成的LiDAR场景符合现实世界全局结构。
在KITTI-360数据集上，TopoLiDM较现有方法性能优越，降低Frechet Range Image Distance (FRID)和Minimum Matching Distance (MMD)。

Cool Papers

点此查看论文截图

Learning to See in the Extremely Dark

Authors:Hai Jiang, Binhao Guan, Zhen Liu, Xiaohong Liu, Jian Yu, Zheng Liu, Songchen Han, Shuaicheng Liu

Learning-based methods have made promising advances in low-light RAW image enhancement, while their capability to extremely dark scenes where the environmental illuminance drops as low as 0.0001 lux remains to be explored due to the lack of corresponding datasets. To this end, we propose a paired-to-paired data synthesis pipeline capable of generating well-calibrated extremely low-light RAW images at three precise illuminance ranges of 0.01-0.1 lux, 0.001-0.01 lux, and 0.0001-0.001 lux, together with high-quality sRGB references to comprise a large-scale paired dataset named See-in-the-Extremely-Dark (SIED) to benchmark low-light RAW image enhancement approaches. Furthermore, we propose a diffusion-based framework that leverages the generative ability and intrinsic denoising property of diffusion models to restore visually pleasing results from extremely low-SNR RAW inputs, in which an Adaptive Illumination Correction Module (AICM) and a color consistency loss are introduced to ensure accurate exposure correction and color restoration. Extensive experiments on the proposed SIED and publicly available benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/JianghaiSCU/SIED.

基于学习的方法在低光环境下的RAW图像增强方面已取得了有前景的进展，然而，由于缺乏相应的数据集，它们在环境照度降至0.0001勒克斯的极暗场景下的能力尚待探索。为此，我们提出了一种配对到配对的数据合成管道，能够生成三个精确照度范围内的校准良好的极低光RAW图像，即0.01-0.1勒克斯、0.001-0.01勒克斯和0.0001-0.001勒克斯，以及高质量sRGB参考图像，共同组成大规模配对数据集，名为“见极暗”（SIED），作为低光RAW图像增强方法的基准。此外，我们提出了一个基于扩散的框架，利用扩散模型的生成能力和内在降噪属性，从极低的SNR RAW输入中恢复出视觉上的愉悦结果，其中引入了自适应照明校正模块（AICM）和颜色一致性损失，以确保准确的曝光校正和颜色恢复。在提出的SIED和公开可用的基准测试上的广泛实验证明了我们的方法的有效性。代码和数据集可在https://github.com/JianghaiSCU/SIED找到。

论文及项目相关链接

PDF Accepted by ICCV 2025

Summary

本文提出一种数据合成方法，生成精确校准的极低光照RAW图像及其sRGB参考的大型配对数据集“极暗环境下的视觉感知”（See-in-the-Extremely-Dark，简称SIED）。此外，利用扩散模型的生成能力和内在去噪特性，提出一种基于扩散的框架，用于从极低信噪比的RAW输入中恢复出视觉上令人愉悦的结果。其中引入了自适应光照校正模块和色彩一致性损失，确保准确的曝光校正和色彩恢复。实验证明该方法在提出的SIED和公开基准测试集上均有效。

Key Takeaways

缺乏对应数据集限制了学习算法在极低光照场景（环境照度低至0.0001 lux）的应用。
提出一种数据合成方法生成极端低光RAW图像及其sRGB参考的大型配对数据集——极暗环境下的视觉感知（SIE）。
利用扩散模型的生成能力和内在去噪特性，提出基于扩散的框架用于RAW图像增强。
引入自适应光照校正模块和色彩一致性损失确保准确曝光校正和色彩恢复。
在提出的SIED数据集和公开基准测试集上的实验验证了该方法的有效性。
该方法和数据集可用于评估低光RAW图像增强方法。

Cool Papers

点此查看论文截图

Diffusion-based Adversarial Identity Manipulation for Facial Privacy Protection

Authors:Liqin Wang, Qianyue Hu, Wei Lu, Xiangyang Luo

The success of face recognition (FR) systems has led to serious privacy concerns due to potential unauthorized surveillance and user tracking on social networks. Existing methods for enhancing privacy fail to generate natural face images that can protect facial privacy. In this paper, we propose diffusion-based adversarial identity manipulation (DiffAIM) to generate natural and highly transferable adversarial faces against malicious FR systems. To be specific, we manipulate facial identity within the low-dimensional latent space of a diffusion model. This involves iteratively injecting gradient-based adversarial identity guidance during the reverse diffusion process, progressively steering the generation toward the desired adversarial faces. The guidance is optimized for identity convergence towards a target while promoting semantic divergence from the source, facilitating effective impersonation while maintaining visual naturalness. We further incorporate structure-preserving regularization to preserve facial structure consistency during manipulation. Extensive experiments on both face verification and identification tasks demonstrate that compared with the state-of-the-art, DiffAIM achieves stronger black-box attack transferability while maintaining superior visual quality. We also demonstrate the effectiveness of the proposed approach for commercial FR APIs, including Face++ and Aliyun.

人脸识别（FR）系统的成功引发了对个人隐私的严重担忧，因为有可能导致未经授权的监视和社会网络上的用户追踪。现有的提高隐私保护的方法无法生成能够保护面部隐私的自然面部图像。在本文中，我们提出基于扩散的对抗性身份操控（DiffAIM）来生成自然的、高度可转移的对抗性面部图像，以对抗恶意的人脸识别系统。具体来说，我们在扩散模型的低维潜在空间内对身份进行操控。这涉及在反向扩散过程中迭代地注入基于梯度的对抗性身份指导，逐步引导生成过程以产生所需的对抗性面部图像。指导优化旨在使身份朝向目标收敛，同时促进与源数据的语义分歧，以实现有效的伪装，同时保持视觉自然性。我们进一步融入了结构保持正则化，以保持面部结构的一致性在操控过程中。对面部验证和识别任务的广泛实验表明，与最新技术相比，DiffAIM实现了更强的黑盒攻击转移能力，同时保持了优越的视觉质量。我们还证明了所提出方法在包括Face++和阿里云在内商用FR API上的有效性。

论文及项目相关链接

PDF Accepted by ACM MM 2025

摘要

人脸识别（FR）系统的成功引发了关于隐私的担忧，因为可能存在的未经授权的监控和用户社交网络的追踪行为。现有的隐私保护方法无法生成能保护面部隐私的自然面部图像。本文提出了基于扩散对抗身份操控（DiffAIM）的方法，用于生成自然且高度可迁移的对抗面部图像，以对抗恶意FR系统。具体来说，我们在扩散模型的低维潜在空间内操纵面部身份。这涉及在反向扩散过程中迭代注入基于梯度的对抗身份指导，逐步引导生成走向期望的对抗面部。指导优化旨在使身份收敛于目标，同时促进与源之间的语义分歧，以实现有效的伪装，同时保持视觉自然性。我们进一步融入了结构保持正则化，以保持面部结构在操控过程中的一致性。对面部验证和识别任务的广泛实验表明，与最新技术相比，DiffAIM实现了更强的黑盒攻击迁移性，同时保持了卓越的可视质量。我们还证明了该方法对包括Face++和阿里云在内的商业FR API的有效性。

关键见解

人脸识别系统的成功引发了关于隐私保护的严重关切。
现有隐私增强方法无法生成自然的面部图像以保护面部隐私。
提出了基于扩散模型的对抗身份操控（DiffAIM）方法。
在低维潜在空间内操控面部身份，通过反向扩散过程逐步生成对抗面部图像。
通过优化指导实现身份收敛与语义分歧的平衡，确保自然性和有效性。
融入结构保持正则化，保持面部结构一致性。
实验证明，DiffAIM相比最新技术，在面部验证和识别任务上实现了更强的黑盒攻击迁移性，并保持优秀视觉质量，对商业FR API有效。

Cool Papers

点此查看论文截图

VistaDepth: Frequency Modulation with Bias Reweighting for Enhanced Far-range Depth Estimation

Authors:Mingxia Zhan, Li Zhang, Xiaomeng Chu, Beibei Wang, Yanyong Zhang

Monocular depth estimation predicts per-pixel depth from a single RGB image. While recent methods have shown promise by leveraging diffusion models, they often struggle to accurately reconstruct far-range regions. This difficulty stems from two compounding factors. First, the standard spatially uniform diffusion objective fails to adapt to the varying frequency content across a depth map. Second, the long-tail depth distribution heavily biases models toward near-range regions. To address these limitations, we introduce VistaDepth, a novel framework named for its ability to accurately reconstruct far-range vistas, which integrates adaptive frequency-domain feature processing with an adaptive loss-balancing mechanism into the diffusion pipeline. Central to our approach is the Latent Frequency Modulation module, which dynamically refines spectral responses in the latent feature space, effectively preserving structural detail. Additionally, we introduce BiasMap, a mechanism that applies adaptive weights directly to the diffusion loss in the latent space, focusing supervision on under-represented far-range regions. These innovations collectively achieve superior depth perception performance across near- and far-range depths while preserving fine detail. Experiments show that VistaDepth achieves state-of-the-art performance for diffusion-based MDE, particularly excelling in reconstructing detailed and accurate depth in far-range regions.

单眼深度估计是从单一RGB图像预测每个像素的深度。虽然最近的方法利用扩散模型显示出了一定的前景，但它们往往难以准确重建远距离区域。这种困难源于两个相互关联的因素。首先，标准的空间均匀扩散目标不能适应深度图的不同频率内容。其次，长尾巴深度分布严重使模型偏向于近距离区域。为了解决这个问题，我们引入了VistaDepth，这是一个名为远视重建的新框架，它将自适应频域特征处理与自适应损失平衡机制集成到扩散管道中。我们的方法的核心是潜在频率调制模块，它动态地优化潜在特征空间中的光谱响应，有效地保留结构细节。此外，我们引入了BiasMap机制，该机制直接在潜在空间中对扩散损失应用自适应权重，将监督重点放在表示不足的远距离区域。这些创新共同实现了卓越的深度感知性能，涵盖近距和远距离深度，同时保留精细细节。实验表明，VistaDepth在基于扩散的MDE中实现了最先进的性能，特别是在重建远距离区域的详细和准确深度方面表现出色。

论文及项目相关链接

PDF

Summary

基于单目深度估计的单RGB图像像素级深度预测技术在最近采用扩散模型的方法中展现出潜力，但重建远距离区域时仍存在困难。主要面临两大挑战：标准空间均匀扩散目标无法适应深度图中不同的频率内容；深度分布的长尾现象导致模型严重偏向于近距离区域。为解决这些问题，我们推出VistaDepth框架，通过自适应频域特征处理和自适应损失平衡机制提升扩散管道的性能。该框架包括潜在频率调制模块和BiasMap机制，前者动态优化潜在特征空间的频谱响应，有效保留结构细节；后者直接对扩散损失施加自适应权重，重点监督远距离区域的表示不足问题。综合应用这些创新技术，VistaDepth在近距离和远距离深度感知方面表现出卓越性能，同时保留精细细节。实验表明，VistaDepth在基于扩散的MDE中达到领先水平，尤其在重建远距离区域的详细和准确深度方面表现突出。

Key Takeaways

单目深度估计通过单RGB图像进行像素级深度预测。
现有方法虽借助扩散模型展现潜力，但在重建远距离区域时遇到困难。
主要挑战在于标准空间均匀扩散目标无法适应深度图中的频率变化和深度分布的长尾现象。
VistaDepth框架通过自适应频域特征处理和损失平衡机制解决这些问题。
潜在频率调制模块动态优化潜在特征空间的频谱响应，保留结构细节。
BiasMap机制对扩散损失施加自适应权重，强化远距离区域的表示。

Cool Papers

点此查看论文截图

Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction

Authors:Gaoge Han, Yongkang Cheng, Zhe Chen, Shaoli Huang, Tongliang Liu

Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions, causing significant difficulty in achieving plausible interaction alignment. Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts. To tackle this, we propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models and diffusion-based generative 3D interaction refinement to achieve occlusion-robust two-hand reconstruction. First, we introduce a lightweight fusion alignment encoder that aligns fused multimodal 2D priors like key points, segmentation maps, and depth cues from vision foundation models during training. This provides robust structured guidance, further enabling efficient inference without heavy foundation model encoders at test time while maintaining high reconstruction accuracy. Second, we implement a two-hand diffusion model explicitly trained to convert interpenetrated 3D poses into plausible, penetration-free counterparts. Through collision gradient-guided denoising, the model rectifies artifacts while preserving natural spatial relationships between hands. Extensive evaluations demonstrate that our method achieves state-of-the-art performance on InterHand2.6M, HIC, and FreiHAND datasets, significantly advancing occlusion handling and interaction robustness. Our code will be publicly released.

从单目图像进行双手重建面临着持久的挑战，由于手势复杂、动态且多变，以及遮挡问题，导致实现合理的交互对齐变得非常困难。现有方法在处理这类对齐问题时往往力不从心，经常出现错位和穿透伪影。为了解决这个问题，我们提出了一种两阶段的Foundation-to-Diffusion框架，该框架精确地对齐来自视觉基础模型的2D先验指导和基于扩散的生成式3D交互细化，以实现遮挡稳健的双手重建。首先，我们引入了一种轻量级的融合对齐编码器，它在训练过程中对齐融合多模态的2D先验，如关键点、分割图和深度线索等，这些来自视觉基础模型的先验信息。这提供了稳健的结构化指导，进一步使测试时的推理更高效，同时保持较高的重建精度。其次，我们实现了一个双手扩散模型，该模型经过专门训练，能够将相互穿透的3D姿势转换为合理的、无穿透的对应姿势。通过碰撞梯度引导的降噪，该模型在修正伪影的同时，保留了双手之间的自然空间关系。在InterHand2.6M、HIC和FreiHAND数据集上的广泛评估表明，我们的方法达到了最新技术水平，在遮挡处理和交互稳健性方面有了显著的提升。我们的代码将公开发布。

论文及项目相关链接

PDF

摘要
单目图像的两手重建面临诸多挑战，包括手部姿态的复杂性和动态变化以及遮挡问题，导致互动对齐难以实现。现有方法在处理这类对齐问题时往往出现误对齐和穿透伪影。为解决这一问题，我们提出了一个两阶段的Foundation-to-Diffusion框架，通过精确对齐来自视觉基础模型的2D先验指导和基于扩散的生成式3D互动优化，实现遮挡鲁棒的两手重建。首先，我们引入了一个轻量级的融合对齐编码器，在训练过程中对齐融合的多模态2D先验，如关键点、分割图和深度线索等，来自视觉基础模型。这提供了稳健的结构化指导，进一步实现了测试时的高效推理，同时保持较高的重建精度。其次，我们实现了一个专门用于将相互穿透的3D姿势转换为合理、无穿透对应姿势的两手扩散模型。通过碰撞梯度引导的降噪，该模型在修正伪影的同时，保留了手部之间的自然空间关系。广泛评估表明，我们的方法在InterHand2.6M、HIC和FreiHAND数据集上取得了最新技术性能，显著提高了遮挡处理和互动稳健性。我们的代码将公开发布。

关键见解