发布日期: 2025-11-25

更新日期: 2025-11-27

文章字数: 13.2k

阅读时长: 53 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-25 更新

Learning Latent Transmission and Glare Maps for Lens Veiling Glare Removal

Authors:Xiaolong Qian, Qi Jiang, Lei Sun, Zongxi Yu, Kailun Yang, Peixuan Wu, Jiacheng Zhou, Yao Gao, Yaoguang Ma, Ming-Hsuan Yang, Kaiwei Wang

Beyond the commonly recognized optical aberrations, the imaging performance of compact optical systems-including single-lens and metalens designs-is often further degraded by veiling glare caused by stray-light scattering from non-ideal optical surfaces and coatings, particularly in complex real-world environments. This compound degradation undermines traditional lens aberration correction yet remains underexplored. A major challenge is that conventional scattering models (e.g., for dehazing) fail to fit veiling glare due to its spatial-varying and depth-independent nature. Consequently, paired high-quality data are difficult to prepare via simulation, hindering application of data-driven veiling glare removal models. To this end, we propose VeilGen, a generative model that learns to simulate veiling glare by estimating its underlying optical transmission and glare maps in an unsupervised manner from target images, regularized by Stable Diffusion (SD)-based priors. VeilGen enables paired dataset generation with realistic compound degradation of optical aberrations and veiling glare, while also providing the estimated latent optical transmission and glare maps to guide the veiling glare removal process. We further introduce DeVeiler, a restoration network trained with a reversibility constraint, which utilizes the predicted latent maps to guide an inverse process of the learned scattering model. Extensive experiments on challenging compact optical systems demonstrate that our approach delivers superior restoration quality and physical fidelity compared with existing methods. These suggest that VeilGen reliably synthesizes realistic veiling glare, and its learned latent maps effectively guide the restoration process in DeVeiler. All code and datasets will be publicly released at https://github.com/XiaolongQian/DeVeiler.

除了公认的光学畸变，紧凑光学系统的成像性能（包括单透镜和金属透镜设计）通常还会受到由非理想光学表面和涂层产生的散射光所引起的眩光干扰，特别是在复杂的现实环境中。这种复合退化影响传统透镜的畸变校正，但仍未得到充分探索。主要挑战在于传统的散射模型（例如用于去雾的模型）无法适应由于空间变化和深度独立性质导致的眩光干扰。因此，通过仿真准备配套的高质量数据非常困难，阻碍了数据驱动的去眩光模型的应用。为此，我们提出了VeilGen，这是一种生成模型，能够估计目标图像中的潜在光学传输和眩光地图，以无监督的方式模拟眩光干扰，并由基于稳定扩散的先验进行正则化。VeilGen能够生成具有现实光学畸变和眩光干扰的配套数据集，同时提供估计的潜在光学传输和眩光地图来指导去眩光过程。我们还介绍了使用可逆性约束进行训练的恢复网络DeVeiler，该网络利用预测的潜在地图来指导学习散射模型的逆向过程。在具有挑战性的紧凑光学系统上的大量实验表明，我们的方法提供了优于现有方法的恢复质量和物理保真度。这表明VeilGen能够可靠地合成现实的眩光干扰，其学习的潜在地图可以有效地指导DeVeiler中的恢复过程。所有代码和数据集将在https://github.com/XiaolongQian/DeVeiler上公开发布。

论文及项目相关链接

PDF All code and datasets will be publicly released at https://github.com/XiaolongQian/DeVeiler

Summary
本文探讨小型光学系统的成像性能问题，除常见的光学畸变外，还存在因非理想光学表面和涂层导致的散射杂光造成的遮蔽眩光问题。针对这一问题，文章提出了VeilGen生成模型与DeVeiler恢复网络，通过模拟与去除眩光，提高了成像质量。

Key Takeaways

紧凑光学系统（如单镜头和金属镜头设计）的成像性能会受到由非理想光学表面和涂层引起的散射杂光导致的遮蔽眩光的影响。
传统散射模型（例如用于去雾的模型）无法适应遮蔽眩光，因为其具有空间可变和深度独立的特性。
VeilGen生成模型能够在无需配对的高质量数据的情况下，通过估计潜在的光学传输和眩光图来模拟遮蔽眩光。
稳定扩散（SD）基于先验知识的VeilGen，能生成带有真实复合退化（光学畸变和遮蔽眩光）的配对数据集。
DeVeiler恢复网络利用可逆性约束和预测的潜在地图来指导学习过程，提高恢复质量和物理保真度。
实验证明，与现有方法相比，VeilGen和DeVeiler的组合在小型光学系统恢复方面表现出卓越的性能。

Cool Papers

点此查看论文截图

Diversity Has Always Been There in Your Visual Autoregressive Models

Authors:Tong Wang, Guanyu Yang, Nian Liu, Kai Wang, Yaxing Wang, Abdelrahman M Shaker, Salman Khan, Fahad Shahbaz Khan, Senmao Li

Visual Autoregressive (VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm, offering notable advantages in both inference efficiency and image quality compared to traditional multi-step autoregressive (AR) and diffusion models. However, despite their efficiency, VAR models often suffer from the diversity collapse i.e., a reduction in output variability, analogous to that observed in few-step distilled diffusion models. In this paper, we introduce DiverseVAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training. Our analysis reveals the pivotal component of the feature map as a key factor governing diversity formation at early scales. By suppressing the pivotal component in the model input and amplifying it in the model output, DiverseVAR effectively unlocks the inherent generative potential of VAR models while preserving high-fidelity synthesis. Empirical results demonstrate that our approach substantially enhances generative diversity with only neglectable performance influences. Our code will be publicly released at https://github.com/wangtong627/DiverseVAR.

视觉自回归（VAR）模型因其创新的下一次尺度预测范式而备受关注。与传统的多步自回归（AR）和扩散模型相比，它在推理效率和图像质量方面都表现出显著的优势。然而，尽管效率很高，VAR模型常常会出现多样性崩溃的问题，即输出变量的减少，这与在少数步骤蒸馏扩散模型中观察到的现象类似。在本文中，我们介绍了DiverseVAR，这是一种简单而有效的方法，可以恢复VAR模型的生成多样性，而无需任何额外的训练。我们的分析表明，特征映射作为控制早期规模多样性形成的关键因素。通过抑制模型输入中的关键成分并在模型输出中放大它，DiverseVAR有效地释放了VAR模型的内在生成潜力，同时保持了高保真合成。经验结果表明，我们的方法大大提高了生成多样性，且仅对性能产生微小影响。我们的代码将在https://github.com/wangtong62intlangthángrelaserte出版公开供下载和应用。

论文及项目相关链接

PDF

Summary

本文介绍了DiverseVAR模型，这是一种针对视觉自回归（VAR）模型的改进方法，旨在恢复其生成多样性，同时保持高效推理和高保真合成。通过抑制特征映射中的关键成分并放大模型输出中的关键成分，DiverseVAR能够有效解锁VAR模型的内在生成潜力。实证结果表明，该方法显著提高了生成多样性，同时性能影响较小。

Key Takeaways

VAR模型因其新颖的下尺度预测范式而在推理效率和图像质量方面展现出显著优势。
尽管效率较高，但VAR模型通常会出现多样性崩溃问题，即输出变异性减少。
DiverseVAR模型旨在恢复VAR模型的生成多样性，无需额外训练。
特征映射是控制早期尺度多样性的关键因素。
DiverseVAR通过抑制模型输入中的关键成分并放大模型输出中的关键成分来解锁VAR模型的内在生成潜力。
实证结果表明，DiverseVAR能显著提高生成多样性，同时保持性能影响较小。

Cool Papers

点此查看论文截图

ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion

Authors:Junming Liu, Yifei Sun, Weihua Cheng, Yujin Kang, Yirong Chen, Ding Wang, Guosun Zeng

Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes with poor through-plane resolution, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for brain MRI reconstruction. Given any 3D CT scan with limited slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices along the 2D dimension. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions.

磁共振成像（MRI）在脑疾病诊断中发挥着至关重要的作用，但由于某些患者的生理或临床限制，并非总是可行。最近的研究尝试从计算机断层扫描（CT）中合成MRI，但低剂量协议通常会导致高度稀疏的CT体积，平面外分辨率较差，使得准确重建完整的脑MRI体积特别具有挑战性。针对这一问题，我们提出了ReBrain，这是一个增强检索的扩散框架，用于脑MRI重建。对于任何具有有限切片的3D CT扫描，我们首先采用布朗桥扩散模型（BBDM）沿2D维度合成MRI切片。同时，我们通过精细调整的检索模型从综合先验数据库中检索结构和病理相似的CT切片。这些检索到的切片用作参考，通过ControlNet分支融入，以引导中间MRI切片的生成，并确保结构连续性。当数据库缺乏合适的参考时，我们进一步考虑罕见的检索失败情况，并应用球形线性插值以提供补充指导。在SynthRAD2023和BraTS上的广泛实验表明，ReBrain在稀疏条件下的跨模态重建中达到了最先进的性能。

论文及项目相关链接

PDF 16 pages, 12 figures, 7 tables; Accepted by WACV 2026

Summary

本文提出一种基于检索增强的扩散框架ReBrain，用于从有限的CT扫描切片重建MRI。采用布朗桥扩散模型（BBDM）在2D维度合成MRI切片，同时从先验数据库中检索结构和病理相似的CT切片作为参考，通过ControlNet分支引导生成中间MRI切片，确保结构连续性。ReBrain在SynthRAD2023和BraTS数据集上的实验表明，其在稀疏条件下的跨模态重建达到最佳性能。

Key Takeaways

ReBrain是一种基于检索增强的扩散框架，用于从有限的CT扫描切片重建MRI。
采用布朗桥扩散模型（BBDM）在2D维度合成MRI切片。
从先验数据库中检索结构和病理相似的CT切片作为参考。
通过ControlNet分支引导生成的MRI切片确保结构连续性。
球形线性插值用于处理罕见的检索失败情况。
ReBrain在SynthRAD2023和BraTS数据集上的实验达到最佳性能。

Cool Papers

点此查看论文截图

Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation

Authors:Aniketh Iyengar, Jiaqi Han, Boris Ruf, Vincent Grari, Marcin Detyniecki, Stefano Ermon

The rapidly growing computational demands of diffusion models for image generation have raised significant concerns about energy consumption and environmental impact. While existing approaches to energy optimization focus on architectural improvements or hardware acceleration, there is a lack of principled methods to predict energy consumption across different model configurations and hardware setups. We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs). Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps. We conduct comprehensive experiments across four state-of-the-art diffusion models (Stable Diffusion 2, Stable Diffusion 3.5, Flux, and Qwen) on three GPU architectures (NVIDIA A100, A4000, A6000), spanning various inference configurations including resolution (256x256 to 1024x1024), precision (fp16/fp32), step counts (10-50), and classifier-free guidance settings. Our energy scaling law achieves high predictive accuracy within individual architectures (R-squared > 0.9) and exhibits strong cross-architecture generalization, maintaining high rank correlations across models and enabling reliable energy estimation for unseen model-hardware combinations. These results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.

扩散模型在图像生成方面的计算需求迅速增长，引发了人们对能源消耗和环境影响的担忧。现有的能源优化方法主要集中在架构改进或硬件加速上，缺乏跨不同模型配置和硬件设置预测能源消耗的原则性方法。我们提出一种基于计算复杂度（FLOPs）对卡普兰定律进行改编的方法，用于预测扩散模型在GPU上的能源消耗。我们的方法将扩散模型推理分解为文本编码、迭代去噪和解码组件，假设去噪操作由于其在多个推理步骤中的重复执行而主导能源消耗。我们在三种GPU架构（NVIDIA A100、A4000、A6000）上对四种最先进的扩散模型（Stable Diffusion 2、Stable Diffusion 3.5、Flux和Qwen）进行了全面的实验，涵盖了各种推理配置，包括分辨率（从256x256到1024x1024）、精度（fp16/fp32）、步数（从大到小不等）和无条件分类器指导设置等。我们的能源定标律在单个架构内实现了高预测精度（R平方值大于0.9），表现出强大的跨架构泛化能力，能够在不同模型和硬件组合之间保持高度相关性，为未知模型硬件组合提供了可靠的能源估算。这些结果验证了扩散推理的计算边界性质，并为可持续的AI部署规划和碳足迹估算奠定了基础。

论文及项目相关链接

PDF Accepted at EurIPS 2025 workshop “Rethinking AI: Efficiency, Frugality, and Sustainability”

Summary
针对扩散模型的计算需求迅速增长引发的能源消耗和环境影响问题，现有方法主要集中在架构改进和硬件加速上，缺乏一种原理性的方法来预测不同模型配置和硬件设置下的能耗。该研究提出了一种基于Kaplan比例定律的改进方法，通过计算复杂度（FLOPs）预测扩散模型的GPU能耗。实验结果显示该研究提出的能源比例定律在同一架构内具有高预测精度，并能实现跨架构的泛化。这为可持续的AI部署规划和碳足迹估算奠定了基础。

Key Takeaways

扩散模型对于图像生成的需求迅速增长引发了关于能源消耗和环境影响的关注。
当前缺乏一种方法能够原理性地预测不同模型配置和硬件设置下的能耗。
研究提出了基于Kaplan比例定律的改进方法，通过计算复杂度预测扩散模型的GPU能耗。
研究将扩散模型推理分解为文本编码、迭代去噪和解码组件，假设去噪操作由于其在多个推理步骤中的重复执行而主导能源消耗。
实验在四种先进的扩散模型和三种GPU架构上进行了全面测试，包括分辨率、精度、步骤计数和分类器无关的引导设置等多种推理配置。
研究的能源比例定律在单一架构内具有高预测精度，并能实现跨架构的泛化。

Cool Papers

点此查看论文截图

The Finer the Better: Towards Granular-aware Open-set Domain Generalization

Authors:Yunyun Wang, Zheng Duan, Xinyue Liao, Ke-Jia Chen, Songcan Chen

Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns” that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.

开放集域泛化（OSDG）解决了现实场景中的问题，其中部署的模型会遇到域迁移和新型对象类别。尽管在诸如CLIP的视觉语言模型方面取得了令人印象深刻的进展，但现有方法仍然陷入已知类别的结构风险和未知类别的开放空间风险之间的困境，并且容易过度自信，尤其是在区分与已知类别具有细微视觉相似性的“硬未知”时。为此，我们提出了一个语义增强的CLIP（SeeCLIP）框架，通过精细的语义增强显式地解决这个困境。在SeeCLIP中，我们提出了一个语义感知的提示增强模块，将图像分解成具有区分性的语义令牌，从而实现超越粗略类别标签的精细视觉语言对齐。为了有效地定位未知提示，我们引入了具有互补目标的双重对比学习，即排斥力以保持与已知类别的可分性，和凝聚力以保持语义接近。此外，我们的语义引导扩散模块通过扰动提取的语义令牌合成伪未知数，生成与已知类别视觉上相似但具有关键局部差异的样本。这些硬阴性迫使模型学习更精细的决策边界。在五个基准测试上的大量实验表明，与最先进的方法相比，我们的方法在准确率上提高了3%，H分数提高了5%。

论文及项目相关链接

PDF 9 pages,3 figures,aaai2026

Summary

本文提出了一个名为Semantic-enhanced CLIP（SeeCLIP）的框架，旨在解决开放集域通用化（OSDG）问题，即模型在部署时遇到域变化和未知对象类别的情况。SeeCLIP通过精细语义增强明确解决已知和未知类别之间的结构风险和开放空间风险困境。通过语义感知提示增强模块将图像分解为具有辨识力的语义令牌，实现超越粗略类别标签的微妙视觉语言对齐。引入双重对比学习以有效地定位未知提示，包含排斥和凝聚两个互补目标，以维持与已知类别的可分离性和语义接近性。此外，语义引导扩散模块通过扰动提取的语义令牌合成伪未知样本，生成与已知类别视觉上相似但局部差异明显的挑战样本，迫使模型学习更精细的决策边界。实验表明，与最新技术相比，SeeCLIP在五个基准测试上的准确率提高了3%，H值提高了5%。

Key Takeaways

Open-Set Domain Generalization (OSDG) 是一个关注模型在部署时面对域变化和未知对象类别的场景的问题。
Semantic-enhanced CLIP (SeeCLIP) 框架旨在解决OSDG问题，特别是解决已知和未知类别之间的结构风险和开放空间风险困境。
SeeCLIP 通过语义感知提示增强模块分解图像，实现微妙的视觉语言对齐。
双重对比学习被用来有效地定位未知提示，包含排斥和凝聚两个互补目标。
语义引导扩散模块通过合成伪未知样本，帮助模型学习更精细的决策边界。
SeeCLIP 在五个基准测试上的表现优于现有技术，准确率和H值均有提高。

Cool Papers

点此查看论文截图

Align & Invert: Solving Inverse Problems with Diffusion and Flow-based Models via Representational Alignment

Authors:Loukas Sfountouris, Giannis Daras, Paris Giampouras

Enforcing alignment between the internal representations of diffusion or flow-based generative models and those of pretrained self-supervised encoders has recently been shown to provide a powerful inductive bias, improving both convergence and sample quality. In this work, we extend this idea to inverse problems, where pretrained generative models are employed as priors. We propose applying representation alignment (REPA) between diffusion or flow-based models and a pretrained self-supervised visual encoder, such as DINOv2, to guide the reconstruction process at inference time. Although ground-truth signals are unavailable in inverse problems, we show that aligning model representations with approximate target features can substantially enhance reconstruction fidelity and perceptual realism. We provide theoretical results showing (a) the relation between the REPA regularization and a divergence measure in the DINOv2 embedding space, and (b) how REPA updates steer the model’s internal representations toward those of the clean image. These results offer insights into the role of REPA in improving perceptual fidelity. Finally, we demonstrate the generality of our approach by integrating it into multiple state-of-the-art inverse problem solvers. Extensive experiments on super-resolution, box inpainting, Gaussian deblurring, and motion deblurring confirm that our method consistently improves reconstruction quality across tasks, while also providing substantial efficiency gains by reducing the number of required discretization steps without compromising the performance of the underlying solver.

最近的研究表明，通过在扩散或基于流的生成模型的内部表示与预训练的自我监督编码器的内部表示之间强制执行对齐，可以提供一个强大的归纳偏置，从而改善收敛性和样本质量。在这项工作中，我们将这一理念扩展到逆问题，其中预训练的生成模型被用作先验。我们提出在扩散或基于流的模型与预训练的自我监督视觉编码器（如DINOv2）之间应用表示对齐（REPA），以引导推理时间点的重建过程。虽然逆问题中无法获得真实信号，但我们表明，通过与目标特征的近似对齐模型表示，可以显著提高重建的保真度和感知现实感。我们提供理论结果，展示（a）REPA正则化与DINOv2嵌入空间中的发散度量的关系，（b）REPA更新如何引导模型的内部表示朝向干净图像的表示。这些结果提供了REPA在提高感知保真度方面的作用。最后，我们通过将其集成到多个最新逆问题求解器中来证明我们方法的一般性。在超分辨率、盒子补全、高斯去模糊和运动去模糊方面的广泛实验证实，我们的方法在各种任务中都能提高重建质量，同时通过减少所需的离散化步骤数量来提高效率，且不损害底层求解器的性能。

论文及项目相关链接

PDF

Summary

本文探讨了将扩散或流生成模型的内部表示与预训练自监督编码器进行对齐的方法，在逆向问题中的应用。文章提出了在扩散模型或流模型与预训练的自我监督视觉编码器（如DINOv2）之间进行表示对齐（REPA）的方法，以在推理时引导重建过程。在逆向问题中，即使无法获取真实信号，通过模型表示与近似目标特征的对齐，也能显著提高重建的保真度和感知的真实性。文章还提供了理论结果，展示了REPA正则化与DINOv2嵌入空间中的发散度量的关系，以及REPA更新如何引导模型内部表示朝向干净图像。最后，文章将该方法集成到多个最先进的逆向问题求解器中，并在超分辨率、框填充、高斯去模糊和运动去模糊等多个任务上进行实验验证，证明了该方法在提高重建质量和效率方面的优势。

Key Takeaways

扩散或流生成模型的内部表示与预训练自监督编码器的对齐对于提高逆向问题的性能是一个有效的策略。
表示对齐（REPA）方法被提出并应用于扩散模型或流模型与预训练的自我监督视觉编码器之间，以优化推理时的重建过程。
在逆向问题中，即使没有真实信号，模型表示与近似目标特征的对齐也能提高重建的保真度和感知真实性。
理论结果展示了REPA正则化与DINOv2嵌入空间中的发散度量的关系。
REPA更新可以引导模型内部表示朝向干净图像。
方法在多个逆向问题求解器中进行集成，并在超分辨率、框填充、高斯去模糊和运动去模糊等任务上验证了其有效性和优势。

Cool Papers

点此查看论文截图

SCALEX: Scalable Concept and Latent Exploration for Diffusion Models

Authors:E. Zhixuan Zeng, Yuhao Chen, Alexander Wong

Image generation models frequently encode social biases, including stereotypes tied to gender, race, and profession. Existing methods for analyzing these biases in diffusion models either focus narrowly on predefined categories or depend on manual interpretation of latent directions. These constraints limit scalability and hinder the discovery of subtle or unanticipated patterns. We introduce SCALEX, a framework for scalable and automated exploration of diffusion model latent spaces. SCALEX extracts semantically meaningful directions from H-space using only natural language prompts, enabling zero-shot interpretation without retraining or labelling. This allows systematic comparison across arbitrary concepts and large-scale discovery of internal model associations. We show that SCALEX detects gender bias in profession prompts, ranks semantic alignment across identity descriptors, and reveals clustered conceptual structure without supervision. By linking prompts to latent directions directly, SCALEX makes bias analysis in diffusion models more scalable, interpretable, and extensible than prior approaches.

图像生成模型经常编码社会偏见，包括与性别、种族和职业相关的刻板印象。现有分析扩散模型中这些偏见的方法要么局限于预设的类别，要么依赖于潜在方向的手动解释。这些限制阻碍了模型的扩展性，并妨碍了发现微妙或意想不到的模式。我们介绍了SCALEX，这是一个用于可扩展和自动化探索扩散模型潜在空间的框架。SCALEX仅使用自然语言提示从H空间提取语义上有意义的方向，实现零样本解释，无需重新训练或标记。这允许对任意概念进行系统比较，并大规模发现内部模型关联。我们展示了SCALEX在职业提示中检测到性别偏见，对身份描述符进行语义对齐排名，并在无监督的情况下揭示聚类概念结构。通过将提示直接与潜在方向联系，SCALEX使扩散模型中的偏见分析比以往方法更具可扩展性、可解释性和可扩展性。

论文及项目相关链接

PDF Accepted to WACV2025

Summary：
扩散模型常含有社会偏见，如与性别、种族和职业相关的刻板印象。现有分析方法局限于预设类别或依赖潜在方向的手动解读，限制了可扩展性和对微妙或意外模式的发现。我们推出SCALEX框架，可自动探索扩散模型的潜在空间。SCALEX仅使用自然语言提示从H空间提取语义方向，实现零射击解读，无需重新训练或标注。这允许对任意概念进行系统比较，并大规模发现内部模型关联。SCALEX能检测职业提示中的性别偏见，对身份描述符进行语义对齐排名，揭示无监督下的集群概念结构。通过将提示直接与潜在方向联系，SCALEX使扩散模型的偏见分析更具可扩展性、可解释性和扩展性，优于以往的方法。

Key Takeaways：

扩散模型会编码社会偏见，包括与性别、种族和职业相关的刻板印象。
现有分析扩散模型偏见的方法存在局限性，如依赖预设类别和手动解读。
SCALEX框架可自动探索扩散模型的潜在空间，实现零射击解读，无需重新训练或标注。
SCALEX能够检测职业提示中的性别偏见。
SCALEX可以对身份描述符进行语义对齐排名。
SCALEX揭示了无监督下的集群概念结构。

Cool Papers

点此查看论文截图

Model Inversion Attack Against Deep Hashing

Authors:Dongdong Zhao, Qiben Xu, Ranxin Fang, Baogang Song

Deep hashing improves retrieval efficiency through compact binary codes, yet it introduces severe and often overlooked privacy risks. The ability to reconstruct original training data from hash codes could lead to serious threats such as biometric forgery and privacy breaches. However, model inversion attacks specifically targeting deep hashing models remain unexplored, leaving their security implications unexamined. This research gap stems from the inaccessibility of genuine training hash codes and the highly discrete Hamming space, which prevents existing methods from adapting to deep hashing. To address these challenges, we propose DHMI, the first diffusion-based model inversion framework designed for deep hashing. DHMI first clusters an auxiliary dataset to derive semantic hash centers as surrogate anchors. It then introduces a surrogate-guided denoising optimization method that leverages a novel attack metric (fusing classification consistency and hash proximity) to dynamically select candidate samples. A cluster of surrogate models guides the refinement of these candidates, ensuring the generation of high-fidelity and semantically consistent images. Experiments on multiple datasets demonstrate that DHMI successfully reconstructs high-resolution, high-quality images even under the most challenging black-box setting, where no training hash codes are available. Our method outperforms the existing state-of-the-art model inversion attacks in black-box scenarios, confirming both its practical efficacy and the critical privacy risks inherent in deep hashing systems.

深度哈希通过紧凑的二进制编码提高了检索效率，但同时也引入了严重且常被忽视的隐私风险。从哈希代码中重建原始训练数据的能力可能导致生物特征伪造和隐私泄露等严重威胁。然而，专门针对深度哈希模型的模型反转攻击尚未被探索，其安全影响也未被检查。这一研究空白源于真实的训练哈希代码无法访问以及高度离散的汉明空间，这使得现有方法无法适应深度哈希。为了解决这些挑战，我们提出了DHMI，这是第一个针对深度哈希的基于扩散的模型反转框架。DHMI首先对一个辅助数据集进行聚类，以导出语义哈希中心作为替代锚点。然后，它引入了一种替代引导去噪优化方法，该方法利用一种新型攻击指标（融合分类一致性和哈希接近度）来动态选择候选样本。一系列替代模型引导这些候选者的优化，确保生成高保真和语义上一致的图片。在多个数据集上的实验表明，即使在无法获取训练哈希代码的最具挑战性的黑箱设置下，DHMI也能成功重建高分辨率、高质量的图像。我们的方法在黑箱场景中的模型反转攻击方面超过了现有的最先进的水平，证实了其实际效果和深度哈希系统固有的关键隐私风险。

论文及项目相关链接

PDF

Summary

深度哈希技术通过紧凑的二进制编码提高了检索效率，但同时也带来了严重的隐私风险，可能被用于生物特征伪造和隐私泄露等。针对深度哈希模型的安全隐患尚未得到充分研究，原因在于无法获取真实的训练哈希码以及汉明空间的离散性。本研究提出DHMI，首个针对深度哈希的扩散模型反转框架。DHMI通过聚类辅助数据集得到语义哈希中心作为替代锚点，并引入一种新型攻击度量标准来动态选择候选样本。实验证明，DHMI在无法使用训练哈希码的黑盒场景下，仍能成功重构高质量图像。

Key Takeaways

深度哈希技术存在隐私泄露风险，如生物特征伪造和隐私泄露。
目前针对深度哈希模型的安全研究存在缺口，主要由于无法获取真实的训练哈希码以及汉明空间的离散性。
DHMI是首个针对深度哈希的扩散模型反转框架，旨在解决上述挑战。
DHMI通过聚类辅助数据集得到语义哈希中心作为替代锚点。
DHMI引入了一种新型攻击度量标准，融合了分类一致性和哈希邻近性，以动态选择候选样本。
实验证明，DHMI在无法使用训练哈希码的黑盒场景下，能成功重构高质量图像。

Cool Papers

点此查看论文截图

Parameter-Efficient MoE LoRA for Few-Shot Multi-Style Editing

Authors:Cong Cao, Yujie Xu, Xiaodong Xu

In recent years, image editing has garnered growing attention. However, general image editing models often fail to produce satisfactory results when confronted with new styles. The challenge lies in how to effectively fine-tune general image editing models to new styles using only a limited amount of paired data. To address this issue, this paper proposes a novel few-shot style editing framework. For this task, we construct a benchmark dataset that encompasses five distinct styles. Correspondingly, we propose a parameter-efficient multi-style Mixture-of-Experts Low-Rank Adaptation (MoE LoRA) with style-specific and style-shared routing mechanisms for jointly fine-tuning multiple styles. The style-specific routing ensures that different styles do not interfere with one another, while the style-shared routing adaptively allocates shared MoE LoRAs to learn common patterns. Our MoE LoRA can automatically determine the optimal ranks for each layer through a novel metric-guided approach that estimates the importance score of each single-rank component. Additionally, we explore the optimal location to insert LoRA within the Diffusion in Transformer (DiT) model and integrate adversarial learning and flow matching to guide the diffusion training process. Experimental results demonstrate that our proposed method outperforms existing state-of-the-art approaches with significantly fewer LoRA parameters.

近年来，图像编辑受到了越来越多的关注。然而，当面对新的风格时，通用的图像编辑模型往往无法产生令人满意的结果。挑战在于如何使用有限的配对数据有效地微调通用图像编辑模型以适应新风格。针对这一问题，本文提出了一种新的少样本风格编辑框架。为此任务，我们构建了一个包含五种不同风格的基准数据集。相应地，我们提出了一种参数高效的多风格混合专家低秩适应（MoE LoRA）方法，该方法具有风格特定和风格共享路由机制，可联合微调多种风格。风格特定路由确保不同的风格不会相互干扰，而风格共享路由自适应地分配共享的MoE LoRAs来学习公共模式。我们的MoE LoRA可以通过一种新的度量指导方法自动确定每层的最佳秩，该方法估计每个单秩组件的重要性得分。此外，我们探索了在扩散变压器（DiT）模型中插入LoRA的最佳位置，并集成了对抗性学习和平滑匹配来引导扩散训练过程。实验结果表明，我们提出的方法在参数更少的情况下优于现有的最新技术方法。

论文及项目相关链接

PDF

Summary

本文提出了一种基于参数效率的少样本风格编辑框架，通过构建包含五种不同风格的基准数据集来解决通用图像编辑模型在面对新风格时效果不佳的问题。文中提出了多风格混合专家低秩适应（MoE LoRA）方法，包含风格特定和风格共享路由机制，以联合微调多种风格。MoE LoRA可自动确定各层最佳秩次，通过探索在Diffusion in Transformer模型中的最佳位置插入LoRA，并结合对抗学习和流匹配来指导扩散训练过程。实验表明，该方法优于现有技术。

Key Takeaways

以下是本文本的主要见解要点：

通用图像编辑模型在面对新风格时效果不佳的问题被提出。
提出了一种基于参数效率的少样本风格编辑框架来解决该问题。
构建了一个包含五种不同风格的基准数据集用于训练和测试模型。
提出了多风格混合专家低秩适应（MoE LoRA）方法，包含风格特定和风格共享路由机制，用于联合微调多种风格。
MoE LoRA能够自动确定各层的最佳秩次，通过一种新型度量引导方法估计每个单秩成分的重要性得分。
在Diffusion in Transformer模型中探索了插入LoRA的最佳位置。
结合对抗学习和流匹配来指导扩散训练过程。

Cool Papers

点此查看论文截图

CharCom: Composable Identity Control for Multi-Character Story Illustration

Authors:Zhongsheng Wang, Ming Lin, Zhedong Lin, Yaser Shakib, Qian Liu, Jiamou Liu

Ensuring character identity consistency across varying prompts remains a fundamental limitation in diffusion-based text-to-image generation. We propose CharCom, a modular and parameter-efficient framework that achieves character-consistent story illustration through composable LoRA adapters, enabling efficient per-character customization without retraining the base model. Built on a frozen diffusion backbone, CharCom dynamically composes adapters at inference using prompt-aware control. Experiments on multi-scene narratives demonstrate that CharCom significantly enhances character fidelity, semantic alignment, and temporal coherence. It remains robust in crowded scenes and enables scalable multi-character generation with minimal overhead, making it well-suited for real-world applications such as story illustration and animation.

在基于扩散的文本到图像生成中，确保不同提示下的字符身份一致性仍然是一个基本限制。我们提出了CharCom，这是一个模块化且参数高效的框架，它通过可组合的LoRA适配器实现了字符一致的故事插图，可以在不重新训练基础模型的情况下实现高效的按字符定制。CharCom建立在冻结的扩散主干网上，利用提示感知控制，在推理时动态组合适配器。针对多场景叙事进行的实验表明，CharCom在角色保真度、语义对齐和时序连贯性方面显著增强。它在拥挤场景中表现稳健，能够在最小开销的情况下实现可扩展的多角色生成，非常适合用于故事插图和动画等实际应用场景。

论文及项目相关链接

PDF Accepted by ACM MMAsia 2025

Summary：

扩散模型在文本转图像生成中存在一个根本性的限制，即如何在不同提示下保持角色一致性。本文提出了CharCom，这是一种模块化且参数高效的框架，它通过可组合的LoRA适配器实现了角色一致的故事插图。CharCom允许在不需要重新训练基础模型的情况下进行高效的个性化角色定制。实验证明，CharCom在多人场景中显著提高了角色忠实度、语义对齐和时序连贯性，并且在拥挤场景中表现稳健，支持可扩展的多角色生成，具有很小的开销，非常适合故事插图和动画等实际应用。

Key Takeaways:

扩散模型在文本转图像生成中存在保持角色一致性的挑战。
CharCom是一个模块化、参数高效的框架，旨在解决这一挑战。
CharCom通过可组合的LoRA适配器实现角色一致的故事插图。
CharCom允许在不需要重新训练基础模型的情况下进行个性化角色定制。
CharCom在多人场景中提高了角色忠实度、语义对齐和时序连贯性。
CharCom在拥挤场景中表现稳健，支持可扩展的多角色生成。

Cool Papers

点此查看论文截图

Lung-DDPM+: Efficient Thoracic CT Image Synthesis using Diffusion Probabilistic Model

Authors:Yifan Jiang, Ahmad Shariftabrizi, Venkata SK. Manem

Generative artificial intelligence (AI) has been playing an important role in various domains. Leveraging its high capability to generate high-fidelity and diverse synthetic data, generative AI is widely applied in diagnostic tasks, such as lung cancer diagnosis using computed tomography (CT). However, existing generative models for lung cancer diagnosis suffer from low efficiency and anatomical imprecision, which limit their clinical applicability. To address these drawbacks, we propose Lung-DDPM+, an improved version of our previous model, Lung-DDPM. This novel approach is a denoising diffusion probabilistic model (DDPM) guided by nodule semantic layouts and accelerated by a pulmonary DPM-solver, enabling the method to focus on lesion areas while achieving a better trade-off between sampling efficiency and quality. Evaluation results on the public LIDC-IDRI dataset suggest that the proposed method achieves 8$\times$ fewer FLOPs (floating point operations per second), 6.8$\times$ lower GPU memory consumption, and 14$\times$ faster sampling compared to Lung-DDPM. Moreover, it maintains comparable sample quality to both Lung-DDPM and other state-of-the-art (SOTA) generative models in two downstream segmentation tasks. We also conducted a Visual Turing Test by an experienced radiologist, showing the advanced quality and fidelity of synthetic samples generated by the proposed method. These experimental results demonstrate that Lung-DDPM+ can effectively generate high-quality thoracic CT images with lung nodules, highlighting its potential for broader applications, such as general tumor synthesis and lesion generation in medical imaging. The code and pretrained models are available at https://github.com/Manem-Lab/Lung-DDPM-PLUS.

生成式人工智能（AI）在各种领域都扮演着重要角色。由于其能够生成高保真和多样化的合成数据的高能力，生成式AI被广泛应用于诊断任务，如使用计算机断层扫描（CT）进行肺癌诊断。然而，现有的肺癌诊断生成模型存在效率低和解剖不精确的问题，限制了其在临床上的适用性。为了解决这些缺点，我们提出了Lung-DDPM+模型，这是我们之前Lung-DDPM模型的改进版。这种新方法是一种基于结节语义布局引导的降噪扩散概率模型（DDPM），并使用肺部DPM求解器进行加速，使该方法能够专注于病变区域，同时在采样效率和质量之间取得更好的平衡。在公共LIDC-IDRI数据集上的评估结果表明，所提出的方法实现了每秒浮点运算量减少了8倍，GPU内存消耗降低了6.8倍，采样速度提高了14倍。此外，它在两个下游分割任务中的样本质量保持与Lung-DDPM和其他最先进的生成模型相当。我们还通过经验丰富的放射科医生进行了视觉图灵测试，证明了该方法生成的合成样本的高级质量和保真度。这些实验结果证明，Lung-DDPM+能够有效地生成具有肺结节的高质量胸部CT图像，突显其在更广泛应用中的潜力，如通用肿瘤合成和医学成像中的病变生成。代码和预先训练好的模型可在https://github.com/Manem-Lab/Lung-DDPM-PLUS找到。

论文及项目相关链接

PDF Accepted by Computers in Biology and Medicine (CIBM)

Summary

本文介绍了一种改进后的肺部癌症诊断生成模型——Lung-DDPM+。该模型通过去噪扩散概率模型（DDPM）结合结节语义布局和肺部DPM求解器，提高了采样效率和图像质量。相较于之前的模型，Lung-DDPM+在公共LIDC-IDRI数据集上的实验结果显示，其浮点运算每秒（FLOPs）减少了8倍，GPU内存消耗降低了6.8倍，采样速度提高了14倍。同时，该模型生成的样本质量在两项下游分割任务中与其他先进的生成模型相比具有竞争力，并得到了经验丰富的放射科医生的高度评价。总体而言，Lung-DDPM+为生成高质量胸部CT图像提供了新的可能，具有广泛的应用前景。

Key Takeaways

Lung-DDPM+是一个改进的生成模型，用于肺部癌症诊断。
该模型采用去噪扩散概率模型（DDPM），并结合结节语义布局和肺部DPM求解器。
与前代模型相比，Lung-DDPM+在效率和质量方面有明显提升。
在公共数据集上的实验结果表明，Lung-DDPM+的采样效率更高，GPU内存消耗更低。
该模型生成的图像质量在下游分割任务中具有竞争力。
经验丰富的放射科医生对Lung-DDPM+生成的样本质量给予了高度评价。
Lung-DDPM+具有广泛的应用前景，如一般肿瘤合成和医学图像中的病灶生成。

Cool Papers

点此查看论文截图

Model-Agnostic Gender Bias Control for Text-to-Image Generation via Sparse Autoencoder

Authors:Chao Wu, Zhenyi Wang, Kangxian Xie, Naresh Kumar Devulapally, Vishnu Suresh Lokhande, Mingchen Gao

Text-to-image (T2I) diffusion models often exhibit gender bias, particularly by generating stereotypical associations between professions and gendered subjects. This paper presents SAE Debias, a lightweight and model-agnostic framework for mitigating such bias in T2I generation. Unlike prior approaches that rely on CLIP-based filtering or prompt engineering, which often require model-specific adjustments and offer limited control, SAE Debias operates directly within the feature space without retraining or architectural modifications. By leveraging a k-sparse autoencoder pre-trained on a gender bias dataset, the method identifies gender-relevant directions within the sparse latent space, capturing professional stereotypes. Specifically, a biased direction per profession is constructed from sparse latents and suppressed during inference to steer generations toward more gender-balanced outputs. Trained only once, the sparse autoencoder provides a reusable debiasing direction, offering effective control and interpretable insight into biased subspaces. Extensive evaluations across multiple T2I models, including Stable Diffusion 1.4, 1.5, 2.1, and SDXL, demonstrate that SAE Debias substantially reduces gender bias while preserving generation quality. To the best of our knowledge, this is the first work to apply sparse autoencoders for identifying and intervening in gender bias within T2I models. These findings contribute toward building socially responsible generative AI, providing an interpretable and model-agnostic tool to support fairness in text-to-image generation.

文本到图像（T2I）扩散模型经常表现出性别偏见，特别是通过在职业和性别主题之间产生刻板印象来体现。本文提出了SAE Debias，这是一个轻量级、模型无关的框架，用于减轻T2I生成中的这种偏见。不同于以往依赖于CLIP基过滤或提示工程的方法，这些方法通常需要特定的模型调整，并且控制有限，SAE Debias直接在特征空间内操作，无需重新训练或架构修改。通过利用在性别偏见数据集上预训练的k稀疏自动编码器，该方法确定了稀疏潜在空间中的性别相关方向，从而捕捉到职业刻板印象。具体地说，从稀疏潜在变量中构建每个职业的有偏方向，并在推理过程中加以抑制，以引导生成朝着更平衡的性别输出发展。该稀疏自动编码器只需训练一次，即可提供可重复使用的去偏方向，为控制偏见子空间和洞察性别偏见提供了有效的控制和可解释性。在多个T2I模型上的广泛评估，包括Stable Diffusion 1.4、1.5、2.1和SDXL，表明SAE Debias在保持生成质量的同时，大大降低了性别偏见。据我们所知，这是首次将稀疏自动编码器应用于识别和干预T2I模型内的性别偏见的工作。这些发现有助于建立社会责任感强的生成式人工智能，提供了一个可解释、模型无关的工具，支持文本到图像生成的公平性。

论文及项目相关链接

PDF

Summary
本文提出了一种轻量级、模型无关的去偏框架SAE Debias，用于缓解文本到图像生成中的性别偏见问题。该框架通过运用预训练的k稀疏自编码器在特征空间内直接操作，无需重新训练或修改架构，即可有效识别并干预性别相关的偏见方向，从而生成更平衡的图像输出。该框架在多个文本到图像模型上的评估结果表明，其在减少性别偏见的同时保持了生成质量。

Key Takeaways

T2I扩散模型存在性别偏见问题，表现在生成与职业和性别相关的刻板印象。
SAE Debias框架是一种轻量级、模型无关的方法，用于缓解这种偏见。
SAE Debias利用预训练的k稀疏自编码器在特征空间内操作，无需重新训练或修改模型架构。
该方法通过识别性别相关的偏见方向，并在推理过程中抑制这些方向，来引导生成更平衡的图像输出。
SAE Debias只需训练一次稀疏自编码器，即可提供可重复使用的去偏方向，为控制偏见提供了可解释的工具。
在多个T2I模型上的评估表明，SAE Debias在减少性别偏见的同时保持了生成图像的质量。

Cool Papers

点此查看论文截图

HDCompression: Hybrid-Diffusion Image Compression for Ultra-Low Bitrates

Authors:Lei Lu, Yize Li, Yanzhi Wang, Wei Wang, Wei Jiang

Image compression under ultra-low bitrates remains challenging for both conventional learned image compression (LIC) and generative vector-quantized (VQ) modeling. Conventional LIC suffers from severe artifacts due to heavy quantization, while generative VQ modeling gives poor fidelity due to the mismatch between learned generative priors and specific inputs. In this work, we propose Hybrid-Diffusion Image Compression (HDCompression), a dual-stream framework that utilizes both generative VQ-modeling and diffusion models, as well as conventional LIC, to achieve both high fidelity and high perceptual quality. Different from previous hybrid methods that directly use pre-trained LIC models to generate low-quality fidelity-preserving information from heavily quantized latent, we use diffusion models to extract high-quality complementary fidelity information from the ground-truth input, which can enhance the system performance in several aspects: improving index map prediction, enhancing the fidelity-preserving output of the LIC stream, and refining conditioned image reconstruction with VQ-latent correction. In addition, our diffusion model is based on a dense representative vector (DRV), which is lightweight with very simple sampling schedulers. Extensive experiments demonstrate that our HDCompression outperforms the previous conventional LIC, generative VQ-modeling, and hybrid frameworks in both quantitative metrics and qualitative visualization, providing balanced robust compression performance at ultra-low bitrates.

图像在超低比特率下的压缩对于传统的图像压缩（LIC）和生成向量量化（VQ）建模仍然具有挑战性。传统LIC由于重度量化而产生严重伪影，而生成VQ建模则由于生成的先验知识与特定输入之间存在不匹配而导致保真度降低。在这项工作中，我们提出了混合扩散图像压缩（HDCompression），这是一个双流传输框架，结合了生成VQ建模和扩散模型以及传统LIC，以实现高保真和高感知质量。与之前直接使用预训练的LIC模型从重度量化的潜在数据中生成低质量保真信息的混合方法不同，我们使用扩散模型从原始真实输入中提取高质量互补的保真信息，这可以在多方面提高系统性能：改善索引映射预测，提高LIC流的保真保持输出，并细化有条件的图像重建与VQ潜在校正。此外，我们的扩散模型基于密集表示向量（DRV），具有非常简单的采样调度器且轻量化。大量实验表明，我们的HDCompression在定量指标和定性可视化方面都优于之前的传统LIC、生成VQ建模和混合框架，在超低比特率下提供了平衡的稳健压缩性能。

论文及项目相关链接

PDF Accepted by PRICAI 2025 (Oral Presentation)

Summary
本文提出了Hybrid-Diffusion图像压缩（HDCompression）技术，该技术结合了生成式VQ建模、传统图像压缩和扩散模型。与传统图像压缩和生成式VQ建模相比，该技术能够实现更高的保真度和感知质量。该技术通过扩散模型从原始图像中提取高质量互补信息，优化了索引映射预测，提高了传统图像压缩流的保真度保留输出，并改进了条件图像重建的VQ潜在校正。此外，其扩散模型基于轻量级的密集表示向量（DRV），具有非常简单的采样调度器。实验证明，在超低比特率下，HDCompression在定量指标和定性可视化方面都优于传统的图像压缩和生成式VQ建模技术。

Key Takeaways

Hybrid-Diffusion图像压缩技术结合了生成式VQ建模、传统图像压缩和扩散模型，旨在解决超低比特率下的图像压缩挑战。
传统图像压缩因量化过重会产生严重伪影，而生成式VQ建模则因生成的先验与特定输入不匹配而导致保真度低。
HDCompression使用扩散模型从原始图像中提取高质量互补信息，优化性能包括改善索引映射预测、提高传统图像压缩的保真度保留输出以及改进条件图像重建的VQ潜在校正。
HDCompression采用轻量级的密集表示向量（DRV）为基础扩散模型，具备简单的采样调度器。
实验证明，在超低比特率下，HDCompression在定量指标和可视化效果上均优于传统图像压缩和生成式VQ建模技术。
HDCompression实现了高保真度和感知质量，为平衡稳健的压缩性能提供了解决方案。

Cool Papers

点此查看论文截图