发布日期: 2025-11-11

更新日期: 2025-11-27

文章字数: 8.4k

阅读时长: 34 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-11 更新

A Dual-stage Prompt-driven Privacy-preserving Paradigm for Person Re-Identification

Authors:Ruolin Li, Min Liu, Yuan Bian, Zhaoyang Li, Yuzhen Li, Xueping Wang, Yaonan Wang

With growing concerns over data privacy, researchers have started using virtual data as an alternative to sensitive real-world images for training person re-identification (Re-ID) models. However, existing virtual datasets produced by game engines still face challenges such as complex construction and poor domain generalization, making them difficult to apply in real scenarios. To address these challenges, we propose a Dual-stage Prompt-driven Privacy-preserving Paradigm (DPPP). In the first stage, we generate rich prompts incorporating multi-dimensional attributes such as pedestrian appearance, illumination, and viewpoint that drive the diffusion model to synthesize diverse data end-to-end, building a large-scale virtual dataset named GenePerson with 130,519 images of 6,641 identities. In the second stage, we propose a Prompt-driven Disentanglement Mechanism (PDM) to learn domain-invariant generalization features. With the aid of contrastive learning, we employ two textual inversion networks to map images into pseudo-words representing style and content, respectively, thereby constructing style-disentangled content prompts to guide the model in learning domain-invariant content features at the image level. Experiments demonstrate that models trained on GenePerson with PDM achieve state-of-the-art generalization performance, surpassing those on popular real and virtual Re-ID datasets.

随着对数据隐私的担忧日益加剧，研究人员开始使用虚拟数据作为替代敏感的真实世界图像来训练行人再识别（Re-ID）模型。然而，由游戏引擎生成的现有虚拟数据集仍然面临挑战，如复杂的构建和较差的域泛化，使得它们难以在真实场景中应用。为了解决这些挑战，我们提出了一个双阶段提示驱动隐私保护范式（DPPP）。在第一阶段，我们生成了丰富的提示，结合了多维属性，如行人外观、照明和视点等，驱动扩散模型端到端合成多样化数据，建立了一个名为GenePerson的大规模虚拟数据集，包含6,641个身份的130,519张图像。在第二阶段，我们提出了提示驱动分解机制（PDM）来学习域不变泛化特征。借助对比学习，我们采用两个文本反转网络将图像映射到代表风格和内容的伪词，从而构建风格分离的内容提示，引导模型在图像级别学习域不变的内容特征。实验表明，在GenePerson数据集上使用PDM训练的模型达到了最先进的泛化性能，超越了流行真实和虚拟Re-ID数据集上的模型。

论文及项目相关链接

PDF 10 pages, 6 figures

Summary
针对数据隐私问题的日益关注，研究人员开始使用虚拟数据替代敏感的真实图像来训练行人再识别（Re-ID）模型。为应对现有游戏引擎产生的虚拟数据集所面临的构建复杂和领域泛化性差等挑战，提出了双阶段提示驱动隐私保护范式（DPPP）。首先生成包含行人外观、光照和视角等多维属性的丰富提示，驱动扩散模型合成多样数据，构建大规模虚拟数据集GenePerson。接着，提出提示驱动分解机制（PDM）来学习领域不变泛化特征。借助对比学习，采用两个文本倒谱网络将图像映射为表示风格和内容的伪词，构建风格分离的内容提示，指导模型学习领域不变的图像级内容特征。实验表明，在GenePerson数据集上使用PDM训练的模型达到了先进的泛化性能，超越了流行真实和虚拟Re-ID数据集上的模型。

Key Takeaways

研究人员关注数据隐私，开始使用虚拟数据训练Re-ID模型。
现有虚拟数据集面临构建复杂和领域泛化性差等挑战。
提出了双阶段提示驱动隐私保护范式（DPPP）来解决这些问题。
第一阶段生成包含多维属性的丰富提示，合成多样数据构建GenePerson数据集。
第二阶段采用提示驱动分解机制（PDM）学习领域不变的泛化特征。
通过对比学习和文本倒谱网络，将图像映射为表示风格和内容的伪词。

Cool Papers

点此查看论文截图

Authors:Xiongri Shen, Jiaqi Wang, Yi Zhong, Zhenxi Song, Leilei Zhao, Yichen Wei, Lingyan Liang, Shuqiang Wang, Baiying Lei, Demao Deng, Zhiguo Zhang

Magnetic resonance imaging (MRI), especially functional MRI (fMRI) and diffusion MRI (dMRI), is essential for studying neurodegenerative diseases. However, missing modalities pose a major barrier to their clinical use. Although GAN- and diffusion model-based approaches have shown some promise in modality completion, they remain limited in fMRI-dMRI synthesis due to (1) significant BOLD vs. diffusion-weighted signal differences between fMRI and dMRI in time/gradient axis, and (2) inadequate integration of disease-related neuroanatomical patterns during generation. To address these challenges, we propose PDS, introducing two key innovations: (1) a pattern-aware dual-modal 3D diffusion framework for cross-modality learning, and (2) a tissue refinement network integrated with a efficient microstructure refinement to maintain structural fidelity and fine details. Evaluated on OASIS-3, ADNI, and in-house datasets, our method achieves state-of-the-art results, with PSNR/SSIM scores of 29.83 dB/90.84% for fMRI synthesis (+1.54 dB/+4.12% over baselines) and 30.00 dB/77.55% for dMRI synthesis (+1.02 dB/+2.2%). In clinical validation, the synthesized data show strong diagnostic performance, achieving 67.92%/66.02%/64.15% accuracy (NC vs. MCI vs. AD) in hybrid real-synthetic experiments. Code is available in \href{https://github.com/SXR3015/PDS}{PDS GitHub Repository}

磁共振成像（MRI），特别是功能MRI（fMRI）和扩散MRI（dMRI），对于研究神经退行性疾病至关重要。然而，缺失的模式是其在临床应用中的一大障碍。虽然基于生成对抗网络（GAN）和扩散模型的方法在模式完成方面显示出了一些希望，但由于（1）功能MRI和扩散MRI在时间/梯度轴上的加权信号差异显著，（2）在生成过程中疾病相关的神经解剖学模式融合不足，因此在fMRI-dMRI合成方面仍存在局限性。为了解决这些挑战，我们提出了PDS方法，引入两个关键创新点：（1）一个意识到模式的双模态3D扩散框架，用于跨模式学习；（2）一个结合高效微观结构精化的组织精化网络，以保持结构忠实度和细节。在OASIS-3、ADNI和内部数据集上进行评估，我们的方法达到了最先进水平的结果，fMRI合成的PSNR/SSIM分数为29.83 dB/90.84%（比基线高出1.54 dB/4.12%），dMRI合成的PSNR/SSIM分数为30.00 dB/77.55%（比基线高出1.02 dB/2.2%）。在临床验证中，合成数据显示了强大的诊断性能，在混合真实-合成实验中，对NC与MCI和AD的鉴别准确率达到了67.92%/66.02%/64.15%。代码可在PDS GitHub Repository中找到。

论文及项目相关链接

PDF

Summary

本文介绍了在神经退行性疾病研究中，磁共振成像（MRI）特别是功能MRI（fMRI）和扩散MRI（dMRI）的重要性。然而，缺失的模式是临床应用的主要障碍。针对fMRI与dMRI合成中的挑战，提出了PDS方法，包括双模态三维扩散框架和与精细微观结构相结合的组织优化网络。该方法在多个数据集上取得了突破性成果，合成数据具有出色的诊断性能。

Key Takeaways

MRI，特别是fMRI和dMRI，在神经退行性疾病研究中至关重要。
缺失的模式是限制临床应用的主要障碍。
PDS方法通过双模态三维扩散框架和组织优化网络解决了fMRI与dMRI合成的挑战。
该方法在多个数据集上实现了最先进的成果，并获得了出色的诊断性能。
PDS方法在fMRI合成上的PSNR/SSIM分数达到29.83 dB/90.84%，在dMRI合成上达到30.00 dB/77.55%。
临床验证显示，合成数据在混合真实-合成实验中对NC vs. MCI vs. AD的诊断准确率达到了67.92%/66.02%/64.15%。

Cool Papers

点此查看论文截图

Unified Multimodal Diffusion Forcing for Forceful Manipulation

Authors:Zixuan Huang, Huaidian Hou, Dmitry Berenson

Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards, which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories that extends beyond action generation. Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory. This training objective encourages the model to learn temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations. We evaluate MDF on contact-rich, forceful manipulation tasks in simulated and real-world environments. Our results show that MDF not only delivers versatile functionalities, but also achieves strong performance, and robustness under noisy observations. More visualizations can be found on our website https://unified-df.github.io

给定一组专家轨迹数据集，传统的模仿学习通常学习从观察（例如RGB图像）到行为的直接映射。然而，这些方法常常忽略了不同模态之间丰富的相互作用，即感官输入、行为和奖励之间的相互作用，这对于模拟机器人行为和了解任务结果至关重要。在这项工作中，我们提出了“多模态扩散强迫（Multimodal Diffusion Forcing，简称MDF）”这一统一框架，用于从多模态机器人轨迹中学习，并超越了动作生成的范畴。MDF不是对固定分布进行建模，而是应用随机部分掩码并训练扩散模型来重建轨迹。这一训练目标鼓励模型学习时序和跨模态依赖关系，例如预测动作对力信号的影响或从部分观察中推断状态。我们在模拟和真实环境中的接触丰富、强制操作任务上评估了MDF。结果表明，MDF不仅具有多种功能，而且在噪声观察下具有良好的性能和稳健性。更多可视化内容请访问我们的网站：[网站链接]（https://unified-df.github.io）

论文及项目相关链接

PDF Project website: https://unified-df.github.io

Summary

本文提出一种名为多模态扩散强迫（Multimodal Diffusion Forcing，MDF）的统一框架，用于从多模态机器人轨迹中学习，超越了动作生成。该框架通过应用随机部分遮蔽并训练扩散模型来重建轨迹，学习时间和跨模态依赖关系，如预测动作对力信号的影响或从部分观察中推断状态。在模拟和真实环境中的接触丰富、有力的操作任务评估中，MDF不仅实现了多功能性，而且表现出强大的性能和鲁棒性。

Key Takeaways

多模态扩散强迫（MDF）是一种用于学习从多模态机器人轨迹的统一框架。
MDF超越了仅生成动作的学习，考虑了感官输入、动作和奖励之间的丰富互动。
MDF通过应用随机部分遮蔽和训练扩散模型来重建轨迹，以学习时间和跨模态依赖关系。
MDF能预测动作对力信号的影响及从部分观察中推断状态。
在模拟和真实环境中的接触丰富、有力操作任务评估中，MDF展现出强大的性能和鲁棒性。
MDF在应对噪声观察时表现出稳健性。

Cool Papers

点此查看论文截图

EditInfinity: Image Editing with Binary-Quantized Generative Models

Authors:Jiahuan Wang, Yuxin Chen, Jun Yu, Guangming Lu, Wenjie Pei

Adapting pretrained diffusion-based generative models for text-driven image editing with negligible tuning overhead has demonstrated remarkable potential. A classical adaptation paradigm, as followed by these methods, first infers the generative trajectory inversely for a given source image by image inversion, then performs image editing along the inferred trajectory guided by the target text prompts. However, the performance of image editing is heavily limited by the approximation errors introduced during image inversion by diffusion models, which arise from the absence of exact supervision in the intermediate generative steps. To circumvent this issue, we investigate the parameter-efficient adaptation of binary-quantized generative models for image editing, and leverage their inherent characteristic that the exact intermediate quantized representations of a source image are attainable, enabling more effective supervision for precise image inversion. Specifically, we propose EditInfinity, which adapts \emph{Infinity}, a binary-quantized generative model, for image editing. We propose an efficient yet effective image inversion mechanism that integrates text prompting rectification and image style preservation, enabling precise image inversion. Furthermore, we devise a holistic smoothing strategy which allows our EditInfinity to perform image editing with high fidelity to source images and precise semantic alignment to the text prompts. Extensive experiments on the PIE-Bench benchmark across add', change’, and `delete’ editing operations, demonstrate the superior performance of our model compared to state-of-the-art diffusion-based baselines. Code available at: https://github.com/yx-chen-ust/EditInfinity.

适应预训练的扩散生成模型进行文本驱动图像编辑，几乎不需要调整开销，已经显示出显著潜力。这些方法遵循的经典适应模式首先通过图像反演逆向推断给定源图像的生成轨迹，然后沿着推断的轨迹在目标文本提示的指导下进行图像编辑。然而，图像编辑的性能受到扩散模型在图像反演过程中引入的近似误差的严重限制，这些误差源于中间生成步骤中缺乏精确监督。为了解决这个问题，我们研究了二进制量化生成模型的参数高效适应于图像编辑，并利用其固有特性，即可以获得源图像的中间精确量化表示，为精确图像反演提供更有效的监督。具体来说，我们提出了EditInfinity，它适应了二进制量化的生成模型Infinity用于图像编辑。我们提出了一种高效而有效的图像反演机制，集成了文本提示校正和图像风格保持，使图像反演更加精确。此外，我们设计了一种整体平滑策略，使我们的EditInfinity能够进行高保真于源图像的图像编辑，并与文本提示精确语义对齐。在PIE-Bench基准测试上的大量实验涵盖了添加、更改和删除编辑操作，证明我们的模型优于最先进的扩散基准模型。代码可在：https://github.com/yx-chen-ust/EditInfinity获取。

论文及项目相关链接

PDF 28 pages, 13 figures, accepted by The Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS 2025)

Summary
基于扩散模型的预训练生成模型在文本驱动图像编辑中展现出了巨大的潜力，其通过最小的调整开销实现自适应。但是，由于缺乏中间生成步骤的精确监督，图像编辑的效果受到了很大限制。为了解决这一问题，我们研究了二元量化生成模型的参数有效适应，利用其可以获取源图像的中间量化表示的固有特性，以实现更有效的图像反演。具体来说，我们提出了一项新的技术——EditInfinity，该技术基于二元量化的生成模型Infinity，可以高效地进行图像反演并融入文本提示修正和图像风格保持机制。我们还开发了一种整体平滑策略，使得我们的EditInfinity能够精准执行图像编辑任务，高度忠实于源图像并与文本提示对齐。实验表明，相较于其他扩散模型基线，我们的模型表现更优越。更多详情可通过以下网址获取：https://github.com/yx-chen-ust/EditInfinity。

Key Takeaways

适应预训练的扩散模型进行文本驱动的图像编辑具有显著潜力。
图像编辑受限于扩散模型在图像反演过程中引入的近似误差。
二元量化生成模型的参数有效适应有助于解决近似误差问题。
EditInfinity模型基于二元量化的生成模型Infinity提出，支持高效图像反演并融合文本提示修正和图像风格保持机制。
整体平滑策略提高了模型在图像编辑任务中的表现，实现了高保真度和精确语义对齐。
EditInfinity在PIE-Bench基准测试上的表现优于其他扩散模型基线。

Cool Papers

点此查看论文截图

Towards Understanding the Mechanisms of Classifier-Free Guidance

Authors:Xiang Li, Rongrong Wang, Qing Qu

Classifier-free guidance (CFG) is a core technique powering state-of-the-art image generation systems, yet its underlying mechanisms remain poorly understood. In this work, we begin by analyzing CFG in a simplified linear diffusion model, where we show its behavior closely resembles that observed in the nonlinear case. Our analysis reveals that linear CFG improves generation quality via three distinct components: (i) a mean-shift term that approximately steers samples in the direction of class means, (ii) a positive Contrastive Principal Components (CPC) term that amplifies class-specific features, and (iii) a negative CPC term that suppresses generic features prevalent in unconditional data. We then verify that these insights in real-world, nonlinear diffusion models: over a broad range of noise levels, linear CFG resembles the behavior of its nonlinear counterpart. Although the two eventually diverge at low noise levels, we discuss how the insights from the linear analysis still shed light on the CFG’s mechanism in the nonlinear regime.

无分类器引导（CFG）是驱动当前图像生成系统核心技术之一，但其底层机制仍鲜为人知。在这项工作中，我们首先在一个简化的线性扩散模型中对CFG进行分析，我们发现其行为与在非线性情况下观察到的行为非常相似。我们的分析表明，线性CFG通过三个不同部分提高了生成质量：（i）一个均值偏移项，它将样本近似引导向类别均值的方向；（ii）一个正向对比主成分（CPC）项，它放大了特定类别的特征；（iii）一个抑制无条件数据中普遍存在的通用特征负CPC项。然后我们在现实世界的非线性扩散模型中验证了这些见解：在广泛的噪声水平下，线性CFG表现出与其非线性对应物的行为相似。尽管两者在低噪声水平下最终会分歧，但我们讨论了线性分析中的见解如何仍然为非线性状态下CFG的机制提供启示。

论文及项目相关链接

PDF

Summary

本文探讨了分类器引导（CFG）在简化线性扩散模型中的应用，分析了其提高生成质量的三项主要机制，包括均值偏移项、正向对比主成分项和负向对比主成分项。研究发现在现实世界的非线性扩散模型中，这些见解依然有效，虽然两者在低噪声水平下有所分歧，但线性分析中的见解仍然有助于理解非线性状态下的CFG机制。

Key Takeaways

分类器引导（CFG）是图像生成系统中的核心技术，通过提高生成质量来改善性能。
在简化线性扩散模型中分析CFG，其行为类似于非线性情况下的表现。
CFG主要通过三个不同的机制提高生成质量：均值偏移项、正向对比主成分项和负向对比主成分项。
在现实世界的非线性扩散模型中，线性CFG的行为与真实情况相似，即使在噪声水平较高的情况下也能有效工作。
虽然线性模型和非线性模型在低噪声水平下有所不同，但线性分析仍然有助于理解非线性状态下的CFG机制。这表明即使在复杂的环境中，对简单模型的洞察也能提供有价值的见解。

Cool Papers

点此查看论文截图

Consistency Trajectory Matching for One-Step Generative Super-Resolution

Authors:Weiyi You, Mingyang Zhang, Leheng Zhang, Xingyu Zhou, Kexuan Shi, Shuhang Gu

Current diffusion-based super-resolution (SR) approaches achieve commendable performance at the cost of high inference overhead. Therefore, distillation techniques are utilized to accelerate the multi-step teacher model into one-step student model. Nevertheless, these methods significantly raise training costs and constrain the performance of the student model by the teacher model. To overcome these tough challenges, we propose Consistency Trajectory Matching for Super-Resolution (CTMSR), a distillation-free strategy that is able to generate photo-realistic SR results in one step. Concretely, we first formulate a Probability Flow Ordinary Differential Equation (PF-ODE) trajectory to establish a deterministic mapping from low-resolution (LR) images with noise to high-resolution (HR) images. Then we apply the Consistency Training (CT) strategy to directly learn the mapping in one step, eliminating the necessity of pre-trained diffusion model. To further enhance the performance and better leverage the ground-truth during the training process, we aim to align the distribution of SR results more closely with that of the natural images. To this end, we propose to minimize the discrepancy between their respective PF-ODE trajectories from the LR image distribution by our meticulously designed Distribution Trajectory Matching (DTM) loss, resulting in improved realism of our recovered HR images. Comprehensive experimental results demonstrate that the proposed methods can attain comparable or even superior capabilities on both synthetic and real datasets while maintaining minimal inference latency.

当前基于扩散的超分辨率（SR）方法在高推理开销的情况下实现了令人钦佩的性能。因此，采用蒸馏技术将多步教师模型加速为一步学生模型。然而，这些方法显著提高了训练成本，并且学生模型的表现受到教师模型的限制。为了克服这些艰巨的挑战，我们提出了超级分辨率的无蒸馏一致性轨迹匹配（CTMSR）策略，该策略能够在一步内生成逼真的SR结果。具体来说，我们首先制定概率流常微分方程（PF-ODE）轨迹，建立从带有噪声的低分辨率（LR）图像到高分辨率（HR）图像的确定性映射。然后，我们应用一致性训练（CT）策略直接在一步中学习映射，消除了对预训练扩散模型的必要。为了进一步提高性能并更好地利用训练过程中的真实标签，我们旨在使SR结果的分布更接近自然图像的分布。为此，我们通过精心设计的分布轨迹匹配（DTM）损失来最小化LR图像分布与其PF-ODE轨迹之间的差异，从而提高我们恢复的HR图像的现实感。综合实验结果表明，所提出的方法在合成和真实数据集上均可达到相当甚至更高的能力，同时保持极低的推理延迟。

论文及项目相关链接

PDF Accepted by ICCV 2025

摘要

当前基于扩散的超分辨率（SR）方法虽然性能良好，但推理开销较大。因此，研究人员采用蒸馏技术将多步教师模型加速为一步学生模型。然而，这些方法增加了训练成本，并受限于教师模型的性能。为克服这些挑战，我们提出无蒸馏的策略——一致性轨迹匹配超分辨率（CTMSR），能够一步生成逼真的SR结果。我们通过概率流常微分方程（PF-ODE）轨迹建立从带噪声的低分辨率（LR）图像到高分辨率（HR）图像的确定性映射。然后，我们应用一致性训练（CT）策略直接一步学习映射，无需预先训练的扩散模型。为进一步提高性能并更好地利用训练过程中的真实标签，我们旨在使SR结果的分布更接近自然图像分布。为此，我们提出通过精心设计的分布轨迹匹配（DTM）损失来最小化LR图像分布中它们的PF-ODE轨迹之间的差异，从而提高恢复的高分辨率图像的现实感。全面的实验结果表明，所提方法在合成和真实数据集上达到了可比甚至更优的性能，同时保持了较低推理延迟。

关键见解

当前扩散超分辨率方法虽然性能良好但推理开销大。
蒸馏技术被用于加速教师模型，但增加了训练成本并受限于教师模型的性能。
提出无蒸馏策略CTMSR，能够一步生成逼真的SR结果。
通过PF-ODE轨迹建立从LR图像到HR图像的确定性映射。
应用一致性训练策略直接一步学习映射，无需预先训练的扩散模型。
通过DTM损失最小化LR图像分布与SR结果分布之间的差异，提高恢复图像的现实感。

Cool Papers

点此查看论文截图

CASteer: Steering Diffusion Models for Controllable Generation

Authors:Tatiana Gaintseva, Andreea-Maria Oncescu, Chengcheng Ma, Ziquan Liu, Martin Benning, Gregory Slabaugh, Jiankang Deng, Ismail Elezi

Diffusion models have transformed image generation, yet controlling their outputs to reliably erase undesired concepts remains challenging. Existing approaches usually require task-specific training and struggle to generalize across both concrete (e.g., objects) and abstract (e.g., styles) concepts. We propose CASteer (Cross-Attention Steering), a training-free framework for concept erasure in diffusion models using steering vectors to influence hidden representations dynamically. CASteer precomputes concept-specific steering vectors by averaging neural activations from images generated for each target concept. During inference, it dynamically applies these vectors to suppress undesired concepts only when they appear, ensuring that unrelated regions remain unaffected. This selective activation enables precise, context-aware erasure without degrading overall image quality. This approach achieves effective removal of harmful or unwanted content across a wide range of visual concepts, all without model retraining. CASteer outperforms state-of-the-art concept erasure techniques while preserving unrelated content and minimizing unintended effects. Pseudocode is provided in the supplementary.

扩散模型已经实现了图像生成，但如何控制其输出以可靠地消除不需要的概念仍然是一个挑战。现有方法通常需要针对任务进行专门训练，并且在具体（例如对象）和抽象（例如风格）概念之间很难实现泛化。我们提出了CASteer（跨注意力转向），这是一种无需训练的框架，用于使用转向向量动态影响扩散模型中的概念消除的隐藏表示。CASteer通过平均针对每个目标概念生成的图像的神经网络激活来计算概念特定的转向向量。在推理过程中，它仅在出现不需要的概念时动态应用这些向量来抑制它们，确保无关区域不受影响。这种选择性激活可实现精确、上下文感知的消除，而不会降低整体图像质量。该方法在不重新训练模型的情况下，实现了广泛视觉概念中有害或不需要内容的有效去除。CASteer优于最新的概念消除技术，同时保留无关内容并尽量减少意外影响。伪代码详见附录。

论文及项目相关链接

PDF

Summary

扩散模型在图像生成领域已经实现了重大突破，但如何在输出中可靠地控制并抹去不需要的概念仍然是一个挑战。现有方法通常需要针对特定任务进行训练，难以在实体（如物体）和抽象（如风格）概念之间进行泛化。我们提出了CASteer（跨注意力引导），这是一种无需训练即可在扩散模型中实现概念抹除的框架。它利用引导向量动态影响隐藏表示。CASteer通过平均针对每个目标概念生成的图像的神经激活来预先计算概念特定的引导向量。在推理过程中，它会在不需要的概念出现时动态应用这些向量进行抑制，确保无关区域不受影响。这种选择性激活可实现精确、具有上下文感知的抹除，且不会降低整体图像质量。此方法可在广泛的视觉概念中有效去除有害或不需要的内容，且无需对模型进行重新训练。CASteer在保持无关内容和最小化意外影响的同时，优于现有的概念抹除技术。伪代码详见补充材料。

Key Takeaways

扩散模型在图像生成中的应用及其面临的挑战：虽然扩散模型在图像生成领域取得了显著进展，但在控制输出以消除不需要的概念方面仍存在挑战。
现有方法的局限性：现有方法通常需要针对特定任务进行训练，难以在实体和抽象概念之间进行泛化。
CASteer框架的提出：为了解决这个问题，我们提出了CASteer（跨注意力引导）框架，这是一种无需训练即可实现概念抹除的方法。
CASteer的工作原理：利用引导向量动态影响隐藏表示，通过平均针对每个目标概念生成的图像的神经激活来预先计算概念特定的引导向量。
CASteer的优势：CASteer能够实现精确、具有上下文感知的抹除，且不会降低整体图像质量。此外，它可在广泛的视觉概念中有效去除有害或不需要的内容，且无需对模型进行重新训练。
与现有技术的比较：CASteer在保持无关内容和最小化意外影响的同时，优于现有的概念抹除技术。

Cool Papers

点此查看论文截图

FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models

Authors:Barbara Toniella Corradini, Mustafa Shukor, Paul Couairon, Guillaume Couairon, Franco Scarselli, Matthieu Cord

Foundation models have exhibited unprecedented capabilities in tackling many domains and tasks. Models such as CLIP are currently widely used to bridge cross-modal representations, and text-to-image diffusion models are arguably the leading models in terms of realistic image generation. Image generative models are trained on massive datasets that provide them with powerful internal spatial representations. In this work, we explore the potential benefits of such representations, beyond image generation, in particular, for dense visual prediction tasks. We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets, with pixel-level annotations. To avoid the annotation cost or training large diffusion models, we constraint our setup to be zero-shot and training-free. In a nutshell, our pipeline leverages different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation. The pipeline is as follows: the image is passed to both a captioner model (i.e. BLIP) and a diffusion model (i.e., Stable Diffusion Model) to generate a text description and visual representation, respectively. The features are clustered and binarized to obtain class agnostic masks for each object. These masks are then mapped to a textual class, using the CLIP model to support open-vocabulary. Finally, we add a refinement step that allows to obtain a more precise segmentation mask. Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets. In addition, we show very competitive results compared to the recent weakly-supervised segmentation approaches. We provide comprehensive experiments showing the superiority of diffusion model features compared to other pretrained models. Project page: https://bcorrad.github.io/freesegdiff/

模型在各种领域和任务中表现出了前所未有的能力。CLIP等模型目前广泛应用于跨模态表示的桥梁作用，文本到图像的扩散模型在真实图像生成方面无疑是领先的模型。图像生成模型在大量数据集上进行训练，为它们提供了强大的内部空间表示。在这项工作中，我们探讨了这种表示在图像生成之外潜在的有益之处，特别是针对密集视觉预测任务。我们专注于图像分割任务，该任务传统上是通过在封闭词汇数据集上训练模型并带有像素级注释来解决的。为了避免标注成本或训练大型扩散模型，我们将设置限制为零样本且无需训练。简而言之，我们的管道利用不同且相对较小的开源基础模型进行零样本开放词汇分割。管道如下：图像被传递给描述模型（例如BLIP）和扩散模型（例如Stable Diffusion Model），以分别生成文本描述和视觉表示。这些特征被聚类和二值化以获取每个对象的类无关掩码。这些掩码然后使用CLIP模型映射到文本类以支持开放词汇。最后，我们添加了一个细化步骤，以获得更精确的分割掩码。我们的方法（称为FreeSeg-Diff）无需任何训练即可在Pascal VOC和COCO数据集上超越许多基于训练的方法。此外，与最近的弱监督分割方法相比，我们展示了非常有竞争力的结果。我们进行了全面的实验，证明了扩散模型特征相较于其他预训练模型的优越性。项目页面：https://bcorrad.github.io/freesegdiff/

论文及项目相关链接

PDF

Summary

该文探讨了利用扩散模型在图像分割任务中的潜力。研究团队提出了一种名为FreeSeg-Diff的零样本开源词汇表分割方法，该方法结合使用captioner模型（如BLIP）和扩散模型（如Stable Diffusion Model），通过生成文本描述和视觉表征，实现无需训练即可进行图像分割。该方法在Pascal VOC和COCO数据集上的表现优于许多基于训练的方法，并且与最近的弱监督分割方法相比也表现出竞争力。

Key Takeaways