⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-17 更新
LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence
Authors:Zixin Yin, Xili Dai, Duomin Wang, Xianfang Zeng, Lionel M. Ni, Gang Yu, Heung-Yeung Shum
The reliance on implicit point matching via attention has become a core bottleneck in drag-based editing, resulting in a fundamental compromise on weakened inversion strength and costly test-time optimization (TTO). This compromise severely limits the generative capabilities of diffusion models, suppressing high-fidelity inpainting and text-guided creation. In this paper, we introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers, which directly eliminates the reliance on implicit point matching. In concrete terms, our method generates an explicit correspondence map from user drag inputs as a reliable reference to boost the attention control. This reliable reference opens the potential for a stable full-strength inversion process, which is the first in the drag-based editing task. It obviates the necessity for TTO and unlocks the generative capability of models. Therefore, LazyDrag naturally unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach: opening the mouth of a dog and inpainting its interior, generating new objects like a ``tennis ball’’, or for ambiguous drags, making context-aware changes like moving a hand into a pocket. Additionally, LazyDrag supports multi-round workflows with simultaneous move and scale operations. Evaluated on the DragBench, our method outperforms baselines in drag accuracy and perceptual quality, as validated by VIEScore and human evaluation. LazyDrag not only establishes new state-of-the-art performance, but also paves a new way to editing paradigms.
对基于注意力的隐式点匹配的依赖已成为基于拖拽的编辑中的核心瓶颈,导致反转强度减弱、测试时优化(TTO)成本高昂的基本妥协。这种妥协严重限制了扩散模型的生成能力,抑制了高保真补全和文本引导的创作。在本文中,我们介绍了LazyDrag,这是首个针对多模态扩散变压器的基于拖拽的图像编辑方法,它直接消除了对隐式点匹配的依赖。具体来说,我们的方法从用户的拖拽输入生成一个明确的对应图,作为增强注意力控制的可靠参考。这一可靠的参考为稳定的完全强度反转过程打开了可能性,这是基于拖拽的编辑任务中的首次实现。它不需要TTO,并释放了模型的生成能力。因此,LazyDrag自然地结合了精确的几何控制和文本指导,能够实现以前无法实现的复杂编辑:打开狗的嘴巴并补全其内部,生成新对象(如“网球”),或对模糊的拖拽进行上下文感知更改(如将手放入口袋)。此外,LazyDrag支持多轮工作流程,可同时执行移动和缩放操作。在DragBench上评估,我们的方法在拖拽准确性和感知质量方面超越了基线,这已通过VIEScore和人类评估得到验证。LazyDrag不仅创造了新的技术顶尖表现,还为编辑范式开辟了新的道路。
论文及项目相关链接
Summary
本文介绍了LazyDrag,一种基于多模态扩散变压器的拖拽式图像编辑方法。该方法消除了对隐式点匹配的依赖,通过生成用户拖拽输入的显式对应图,提高注意力控制,使拖拽式编辑任务中首次实现稳定的全强度反转过程,解锁模型生成能力,将精确的几何控制与文本指导相结合,实现复杂编辑操作。
Key Takeaways
- 介绍了LazyDrag作为一种新的拖拽式图像编辑方法,特别适用于多模态扩散变压器。
- LazyDrag通过生成显式对应图来消除对隐式点匹配的依赖,提高注意力控制。
- 该方法实现了稳定的完全强度反转过程,这在拖拽编辑任务中是首次实现。
- LazyDrag解锁了模型的生成能力,将精确的几何控制与文本指导相结合。
- LazyDrag支持多轮工作流程,可同时执行移动和缩放操作。
- 在DragBench上评估,LazyDrag在拖拽准确性和感知质量方面优于基准方法,得到VIEScore和人为评估的验证。
点此查看论文截图






AvatarSync: Rethinking Talking-Head Animation through Autoregressive Perspective
Authors:Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Suiyang Zhang, Yi He, Yuxing Han
Existing talking-head animation approaches based on Generative Adversarial Networks (GANs) or diffusion models often suffer from inter-frame flicker, identity drift, and slow inference. These limitations inherent to their video generation pipelines restrict their suitability for applications. To address this, we introduce AvatarSync, an autoregressive framework on phoneme representations that generates realistic and controllable talking-head animations from a single reference image, driven directly text or audio input. In addition, AvatarSync adopts a two-stage generation strategy, decoupling semantic modeling from visual dynamics, which is a deliberate “Divide and Conquer” design. The first stage, Facial Keyframe Generation (FKG), focuses on phoneme-level semantic representation by leveraging the many-to-one mapping from text or audio to phonemes. A Phoneme-to-Visual Mapping is constructed to anchor abstract phonemes to character-level units. Combined with a customized Text-Frame Causal Attention Mask, the keyframes are generated. The second stage, inter-frame interpolation, emphasizes temporal coherence and visual smoothness. We introduce a timestamp-aware adaptive strategy based on a selective state space model, enabling efficient bidirectional context reasoning. To support deployment, we optimize the inference pipeline to reduce latency without compromising visual fidelity. Extensive experiments show that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency, providing a scalable and controllable solution.
基于生成对抗网络(GANs)或扩散模型的现有说话人动画方法常常面临帧间闪烁、身份漂移和推理速度慢的问题。这些固有的视频生成管道限制其在应用中的适用性。为了解决这一问题,我们引入了AvatarSync,这是一个基于音素表示的自回归框架,能够从单张参考图像生成真实可控的说话人动画,直接由文本或音频输入驱动。
此外,AvatarSync采用了两阶段生成策略,将语义建模与视觉动力学解耦,这是一种有意的“分而治之”设计。第一阶段,面部关键帧生成(FKG)专注于音素级别的语义表示,通过文本或音频到音素的多对一映射。构建了音素到视觉的映射,将抽象音素锚定到字符级单元。结合定制的文本帧因果注意力掩码,生成关键帧。第二阶段,帧间插值强调时间连贯性和视觉平滑性。我们引入了一种基于选择性状态空间模型的时间戳感知自适应策略,实现高效的双向上下文推理。为了支持部署,我们优化了推理管道,减少了延迟,而不损害视觉保真度。大量实验表明,AvatarSync在视觉保真度、时间一致性和计算效率方面超越了现有的说话人动画方法,提供了一种可扩展且可控的解决方案。
论文及项目相关链接
Summary
基于生成对抗网络(GANs)或扩散模型的现有说话人动画方法常常存在帧间闪烁、身份漂移和推理速度慢等问题,这些固有的视频生成管道限制其在各种应用中的适用性。为解决这些问题,我们推出AvatarSync,一种基于音素表示的自回归框架,可从单个参考图像生成真实且可控的说话人动画,直接由文本或音频输入驱动。AvatarSync采用两阶段生成策略,将语义建模与视觉动态解耦,这是有意为之的“分而治之”设计。第一阶段专注于音素级语义表示,利用文本或音频到音素的多对一映射构建音素到视觉映射,将抽象音素锚定到字符级单元。结合定制的文本帧因果注意力掩码,生成关键帧。第二阶段着重于帧间插值,强调时间连贯性和视觉平滑度。我们引入一种基于选择性状态空间模型的带有时间戳的自适应策略,实现高效的双向上下文推理。为支持部署,我们优化推理管道,降低延迟而不影响视觉保真度。大量实验表明,AvatarSync在视觉保真度、时间一致性和计算效率方面优于现有说话人动画方法,提供可伸缩且可控的解决方案。
Key Takeaways
- AvatarSync是一个基于音素表示的自回归框架,能够生成真实且可控的说话人动画。
- 该方法采用两阶段生成策略,解耦语义建模与视觉动态。
- 第一阶段专注于音素级语义表示,构建音素到视觉映射并生成关键帧。
- 第二阶段强调帧间插值的时空连贯性和视觉平滑度。
- 引入带有时间戳的基于选择性状态空间模型的自适应策略,实现高效推理。
- 优化推理管道以降低延迟,同时保持视觉保真度。
- AvatarSync在视觉保真度、时间一致性和计算效率方面优于现有方法。
点此查看论文截图





Robust Concept Erasure in Diffusion Models: A Theoretical Perspective on Security and Robustness
Authors:Zixuan Fu, Yan Ren, Finn Carter, Chenyue Wen, Le Ku, Daheng Yu, Emily Davis, Bo Zhang
Diffusion models have achieved unprecedented success in image generation but pose increasing risks in terms of privacy, fairness, and security. A growing demand exists to \emph{erase} sensitive or harmful concepts (e.g., NSFW content, private individuals, artistic styles) from these models while preserving their overall generative capabilities. We introduce \textbf{SCORE} (Secure and Concept-Oriented Robust Erasure), a novel framework for robust concept removal in diffusion models. SCORE formulates concept erasure as an \emph{adversarial independence} problem, theoretically guaranteeing that the model’s outputs become statistically independent of the erased concept. Unlike prior heuristic methods, SCORE minimizes the mutual information between a target concept and generated outputs, yielding provable erasure guarantees. We provide formal proofs establishing convergence properties and derive upper bounds on residual concept leakage. Empirically, we evaluate SCORE on Stable Diffusion and FLUX across four challenging benchmarks: object erasure, NSFW removal, celebrity face suppression, and artistic style unlearning. SCORE consistently outperforms state-of-the-art methods including EraseAnything, ANT, MACE, ESD, and UCE, achieving up to \textbf{12.5%} higher erasure efficacy while maintaining comparable or superior image quality. By integrating adversarial optimization, trajectory consistency, and saliency-driven fine-tuning, SCORE sets a new standard for secure and robust concept erasure in diffusion models.
扩散模型在图像生成方面取得了前所未有的成功,但也在隐私、公平和安全方面带来了日益增加的风险。因此,存在对从这些模型中删除敏感或有害概念(例如,成人内容、个人私密信息、艺术风格等)的需求,同时保持其整体生成能力。我们引入了扩散模型中的稳健概念删除框架SCORE(安全且面向概念的稳健删除)。SCORE将概念删除制定为对抗独立性问题,从理论上保证了模型的输出与删除的概念在统计上独立。不同于先前的方法,SCORE最小化目标概念与生成输出之间的互信息,从而提供可证明的概念删除保证。我们提供了正式证明,建立了收敛属性并推导出了残余概念泄漏的上限。从实证角度看,我们在Stable Diffusion和FLUX上对SCORE进行了评估,涵盖了四个具有挑战性的基准测试:对象删除、成人内容移除、名人面部抑制和艺术风格遗忘。SCORE持续优于包括EraseAnything、ANT、MACE、ESD和UCE在内的最新技术方法,在提高删除效果高达12.5%的同时,保持或提高了图像质量。通过整合对抗优化、轨迹一致性以及显著性驱动的微调,SCORE为扩散模型中的安全且稳健的概念删除设定了新的标准。
论文及项目相关链接
PDF Camera ready version
Summary
本文介绍了扩散模型在图像生成领域取得的巨大成功,同时也指出了其在隐私、公平性和安全性方面日益增长的风险。为了解决这些问题,提出了一种名为SCORE的新型框架,用于在扩散模型中实现稳健的概念去除。SCORE将概念删除制定为对抗独立性问题,从理论上保证了模型输出与删除概念之间的统计独立性。与先前的启发式方法不同,SCORE最小化目标概念与生成输出之间的互信息,从而实现可证明的概念删除保证。本文提供了形式化证明来建立收敛性质并推导出概念泄漏的剩余上限。在Stable Diffusion和FLUX上进行的实证评估表明,在对象删除、NSFW移除、名人面部抑制和艺术风格去除等四个挑战性基准测试中,SCORE始终优于其他最新方法,包括EraseAnything、ANT、MACE、ESD和UCE等,实现了高达12.5%的删除效果提升,同时保持了相当或更高的图像质量。
Key Takeaways
- 扩散模型在图像生成方面表现出显著的成功,但也存在隐私、公平性和安全性方面的风险。
- 提出了一种名为SCORE的新型框架,用于稳健地删除扩散模型中的特定概念。
- SCORE将概念删除问题转化为对抗独立性问题,确保模型输出与删除概念之间的统计独立性。
- SCORE通过最小化目标概念与生成输出之间的互信息来实现可证明的概念删除效果。
- 提供了形式化证明来展示该框架的收敛性质和概念泄漏的剩余上限。
- 在多个基准测试中,SCORE在概念删除效果上显著优于其他现有方法。
点此查看论文截图

Learning to Generate 4D LiDAR Sequences
Authors:Ao Liang, Youquan Liu, Yu Yang, Dongyue Lu, Linfeng Li, Lingdong Kong, Huaici Zhao, Wei Tsang Ooi
While generative world models have advanced video and occupancy-based data synthesis, LiDAR generation remains underexplored despite its importance for accurate 3D perception. Extending generation to 4D LiDAR data introduces challenges in controllability, temporal stability, and evaluation. We present LiDARCrafter, a unified framework that converts free-form language into editable LiDAR sequences. Instructions are parsed into ego-centric scene graphs, which a tri-branch diffusion model transforms into object layouts, trajectories, and shapes. A range-image diffusion model generates the initial scan, and an autoregressive module extends it into a temporally coherent sequence. The explicit layout design further supports object-level editing, such as insertion or relocation. To enable fair assessment, we provide EvalSuite, a benchmark spanning scene-, object-, and sequence-level metrics. On nuScenes, LiDARCrafter achieves state-of-the-art fidelity, controllability, and temporal consistency, offering a foundation for LiDAR-based simulation and data augmentation.
虽然生成式世界模型已经推动了视频和基于占用率的数据合成的发展,但激光雷达生成在准确的3D感知方面仍然被忽视。将生成扩展到4D激光雷达数据带来了可控性、时间稳定性和评估方面的挑战。我们提出了LiDARCrafter,这是一个将自由形式语言转换为可编辑激光雷达序列的统一框架。指令被解析为以自我为中心的场景图,三支扩散模型将其转化为对象布局、轨迹和形状。范围图像扩散模型生成初始扫描,自回归模块将其扩展为时间连贯的序列。明确的布局设计进一步支持对象级别的编辑,如插入或重新定位。为了进行公平评估,我们提供了EvalSuite,这是一个涵盖场景、对象和序列级别指标的基准测试。在nuScenes上,LiDARCrafter达到了最先进的保真度、可控性和时间一致性,为激光雷达模拟和数据增强提供了基础。
论文及项目相关链接
PDF Abstract Paper (Non-Archival) @ ICCV 2025 Wild3D Workshop; GitHub Repo at https://lidarcrafter.github.io/
Summary
LiDARCrafter是一个统一框架,能将自由形式的语言转换为可编辑的LiDAR序列。它通过解析指令生成ego-centric场景图,并由三分支扩散模型将指令转换为对象布局、轨迹和形状。范围图像扩散模型生成初始扫描,自回归模块将其扩展为时间连贯的序列。明确的布局设计进一步支持对象级别的编辑,如插入或重新定位。EvalSuite基准测试涵盖了场景、对象和序列级别的指标,以便公平评估。在nuScenes上,LiDARCrafter达到了先进的逼真度、可控性和时间一致性,为LiDAR模拟和数据增强奠定了基础。
Key Takeaways
- LiDAR数据生成在生成世界模型中仍然被忽视,尽管它对准确的3D感知很重要。
- LiDARCrafter是一个能将自由形式语言转换为可编辑LiDAR序列的统一框架。
- 该框架通过解析指令生成ego-centric场景图,并由三分支扩散模型转换数据。
- 范围图像扩散模型生成初始扫描,自回归模块则确保时间的连贯性。
- LiDARCrafter支持对象级别的编辑,如插入或重新定位对象。
- EvalSuite基准测试用于公平评估LiDAR数据生成的场景、对象和序列级别指标。
点此查看论文截图




Do It Yourself (DIY): Modifying Images for Poems in a Zero-Shot Setting Using Weighted Prompt Manipulation
Authors:Sofia Jamil, Kotla Sai Charan, Sriparna Saha, Koustava Goswami, K J Joseph
Poetry is an expressive form of art that invites multiple interpretations, as readers often bring their own emotions, experiences, and cultural backgrounds into their understanding of a poem. Recognizing this, we aim to generate images for poems and improve these images in a zero-shot setting, enabling audiences to modify images as per their requirements. To achieve this, we introduce a novel Weighted Prompt Manipulation (WPM) technique, which systematically modifies attention weights and text embeddings within diffusion models. By dynamically adjusting the importance of specific words, WPM enhances or suppresses their influence in the final generated image, leading to semantically richer and more contextually accurate visualizations. Our approach exploits diffusion models and large language models (LLMs) such as GPT in conjunction with existing poetry datasets, ensuring a comprehensive and structured methodology for improved image generation in the literary domain. To the best of our knowledge, this is the first attempt at integrating weighted prompt manipulation for enhancing imagery in poetic language.
诗歌是一种表达性艺术形式,邀请读者进行多重解读,因为读者经常会将他们自己的情感、经历和文化背景带入对诗歌的理解中。我们认识到这一点,旨在为诗歌生成图像,并在零样本环境中改进这些图像,使观众能够根据他们的需求修改图像。为了实现这一点,我们引入了一种新颖的加权提示操纵(WPM)技术,该技术系统地修改了扩散模型中注意力权重和文本嵌入。通过动态调整特定词语的重要性,WPM可以增强或抑制它们在最终生成图像中的影响,从而得到语义更丰富、上下文更准确的可视化图像。我们的方法利用扩散模型和大语言模型(如GPT),结合现有的诗歌数据集,确保在文学领域改进图像生成的全面和结构化的方法。据我们所知,这是首次尝试将加权提示操纵整合到诗歌语言的图像增强中。
论文及项目相关链接
Summary
本文旨在通过引入加权提示操控(WPM)技术,在零样本设置下为诗歌生成图像并进行改进,使观众能够根据需求修改图像。WPM技术通过动态调整扩散模型中特定词语的注意力权重和文本嵌入,增强或抑制它们对最终生成图像的影响,从而实现语义更丰富、上下文更准确的可视化。该方法结合扩散模型、大型语言模型(LLMs)如GPT和现有诗歌数据集,确保在文学领域实现更优质图像生成的综合结构化方法。此为首次尝试将加权提示操控技术应用于增强诗歌语言的图像表现。
Key Takeaways
- 诗歌是一种邀请多重解读的表达艺术形式,读者常带入自身情感、经历和文化背景理解诗歌。
- 提出一种新颖的Weighted Prompt Manipulation (WPM)技术,能在零样本环境下为诗歌生成并改进图像。
- WPM技术通过系统地调整扩散模型中特定词语的注意力权重和文本嵌入,实现图像生成的精细化控制。
- WPM能够增强或抑制词语对生成图像的影响,生成语义更丰富、上下文更准确的图像。
- 该方法结合扩散模型、大型语言模型(LLMs)如GPT,确保文学领域图像生成的全面和结构化。
- 此为首次尝试将加权提示操控技术应用于诗歌语言的图像表现增强。
点此查看论文截图





DRAG: Data Reconstruction Attack using Guided Diffusion
Authors:Wa-Kin Lei, Jun-Cheng Chen, Shang-Tse Chen
With the rise of large foundation models, split inference (SI) has emerged as a popular computational paradigm for deploying models across lightweight edge devices and cloud servers, addressing data privacy and computational cost concerns. However, most existing data reconstruction attacks have focused on smaller CNN classification models, leaving the privacy risks of foundation models in SI settings largely unexplored. To address this gap, we propose a novel data reconstruction attack based on guided diffusion, which leverages the rich prior knowledge embedded in a latent diffusion model (LDM) pre-trained on a large-scale dataset. Our method performs iterative reconstruction on the LDM’s learned image prior, effectively generating high-fidelity images resembling the original data from their intermediate representations (IR). Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art methods, both qualitatively and quantitatively, in reconstructing data from deep-layer IRs of the vision foundation model. The results highlight the urgent need for more robust privacy protection mechanisms for large models in SI scenarios. Code is available at: https://github.com/ntuaislab/DRAG.
随着大型基础模型的兴起,分割推理(SI)已成为在轻量级边缘设备和云服务器之间部署模型的流行计算范式,解决了数据隐私和计算成本方面的担忧。然而,现有的大多数数据重建攻击主要集中在较小的CNN分类模型上,而基础模型在分割推理设置中的隐私风险尚未得到充分探索。为了弥补这一空白,我们提出了一种基于引导扩散的新型数据重建攻击方法,该方法利用嵌入在潜在扩散模型(LDM)中的丰富先验知识,该模型在大规模数据集上进行预训练。我们的方法对学习到的图像先验进行迭代重建,有效地从其中间表示生成与原始数据相似的高保真图像。大量实验表明,我们的方法在重建视觉基础模型的深层中间表示数据方面,无论是在定性还是定量上,都显著优于最先进的方法。结果强调了分割推理场景中大型模型对更稳健的隐私保护机制的迫切需求。代码可通过以下网址获得:https://github.com/ntuaislab/DRAG 。
论文及项目相关链接
PDF ICML 2025
Summary
随着大型基础模型的兴起,分割推理(SI)已成为部署模型于轻量级边缘设备和云服务器之间的热门计算范式,解决了数据隐私和计算成本问题。然而,现有的数据重建攻击主要关注较小的CNN分类模型,对于分割推理设置中基础模型的数据隐私问题探索不足。为解决这一空白,我们提出了一种基于引导扩散的新型数据重建攻击方法,该方法利用在大型数据集上预训练的潜在扩散模型(LDM)中的丰富先验知识。我们的方法对LDM学习的图像先验进行迭代重建,有效地从其中间表示(IR)生成高保真度图像,这些图像类似于原始数据。实验证明,我们的方法在重建视觉基础模型的深层IRs中的数据时,无论在定性还是定量方面均显著优于现有最新方法。这凸显了在分割推理场景中为大模型提供更稳健的隐私保护机制的迫切需求。
Key Takeaways
- 分割推理(SI)已成为部署大型模型的一种流行计算范式,解决了数据隐私和计算成本问题。
- 现有数据重建攻击主要关注较小的CNN模型,基础模型在分割推理设置中的隐私风险尚未得到充分探索。
- 提出了一种新型数据重建攻击方法,基于引导扩散和潜在扩散模型(LDM)的丰富先验知识。
- 方法通过迭代重建生成高保真度图像,这些图像从基础模型的中间表示(IR)生成,类似于原始数据。
- 相比现有方法,新型攻击方法在重建深层IRs中的数据时表现更优秀。
- 结果凸显了为大型模型在分割推理场景中提供更强隐私保护机制的必要性。
点此查看论文截图





SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching
Authors:Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, Linfeng Zhang
Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. These models face two fundamental challenges: strict temporal dependencies preventing parallelization, and computationally intensive forward passes required at each denoising step. Drawing inspiration from speculative decoding in large language models, we present SpeCa, a novel ‘Forecast-then-verify’ acceleration framework that effectively addresses both limitations. SpeCa’s core innovation lies in introducing Speculative Sampling to diffusion models, predicting intermediate features for subsequent timesteps based on fully computed reference timesteps. Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction while incurring negligible computational overhead. Furthermore, SpeCa introduces sample-adaptive computation allocation that dynamically modulates resources based on generation complexity, allocating reduced computation for simpler samples while preserving intensive processing for complex instances. Experiments demonstrate 6.34x acceleration on FLUX with minimal quality degradation (5.5% drop), 7.3x speedup on DiT while preserving generation fidelity, and 79.84% VBench score at 6.1x acceleration for HunyuanVideo. The verification mechanism incurs minimal overhead (1.67%-3.5% of full inference costs), establishing a new paradigm for efficient diffusion model inference while maintaining generation quality even at aggressive acceleration ratios. Our codes have been released in Github: \textbf{https://github.com/Shenyi-Z/Cache4Diffusion}
扩散模型已经彻底改变了高保真图像和视频合成的方式,但它们对计算资源的需求仍然对实时应用构成了巨大挑战。这些模型面临两大根本挑战:严格的时序依赖性阻止了并行化,以及在每个去噪步骤中都需要进行密集的计算前向传递。从大型语言模型的投机解码中汲取灵感,我们提出了SpeCa,这是一种新型的“预测-验证”加速框架,能够有效地解决这两个限制。SpeCa的核心创新在于在扩散模型中引入投机采样,基于完全计算得到的参考时间步长来预测后续时间步长的中间特征。我们的方法实现了一种无参数验证机制,有效地评估预测可靠性,能够在实时决策时接受或拒绝每个预测,同时产生可忽略的计算开销。此外,SpeCa引入了样本自适应计算分配,根据生成复杂度动态调整资源,为简单样本分配减少计算量,同时保留对复杂实例的密集处理。实验表明,在FLUX上实现了6.34倍的加速,质量略有下降(下降5.5%),在DiT上实现了7.3倍的加速同时保持生成保真度,在HunyuanVideo上实现了6.1倍加速,VBench得分为79.84%。验证机制产生的开销极小(仅占全推理成本的1.67%-3.5%),为扩散模型的高效推理建立了新范式,即使在极端的加速比下也能保持生成质量。我们的代码已在GitHub上发布:\textbf{https://github.com/Shenyi-Z/Cache4Diffusion}。
论文及项目相关链接
PDF 15 pages, 9 figures, ACM Multimedia 2025
Summary
扩散模型已经实现了高保真图像和视频合成,但其计算需求仍然对实时应用造成阻碍。本文提出SpeCa框架,通过引入“预测-验证”机制解决这两个问题。SpeCa采用前瞻性采样对扩散模型进行预测,并基于完全计算的参考时间点评估预测可靠性。此外,SpeCa还引入了样本自适应计算分配,可根据生成复杂度动态调整资源分配。实验表明,SpeCa在保证质量的前提下加速了扩散模型的推理过程。
Key Takeaways
- 扩散模型在高保真图像和视频合成方面表现出显著优势,但计算需求大,不适用于实时应用。
- SpeCa框架通过引入“预测-验证”机制解决了扩散模型的两个挑战:严格的时间依赖性和计算密集型的正向传递。
- SpeCa采用前瞻性采样,基于完全计算的参考时间点预测后续时间步长的中间特征。
- SpeCa实现了参数化的验证机制,有效评估预测可靠性,支持实时决策。
- SpeCa引入样本自适应计算分配,根据生成复杂度动态调整资源,实现更高效推理。
- 实验结果表明,SpeCa在多个数据集上实现了显著加速,同时保持生成质量。
点此查看论文截图





ToMA: Token Merge with Attention for Image Generation with Diffusion Models
Authors:Wenbo Lu, Shaoyi Zheng, Yuxuan Xia, Shengjie Wang
Diffusion models excel in high-fidelity image generation but face scalability limits due to transformers’ quadratic attention complexity. Plug-and-play token reduction methods like ToMeSD and ToFu reduce FLOPs by merging redundant tokens in generated images but rely on GPU-inefficient operations (e.g., sorting, scattered writes), introducing overheads that negate theoretical speedups when paired with optimized attention implementations (e.g., FlashAttention). To bridge this gap, we propose Token Merge with Attention (ToMA), an off-the-shelf method that redesigns token reduction for GPU-aligned efficiency, with three key contributions: 1) a reformulation of token merge as a submodular optimization problem to select diverse tokens; 2) merge/unmerge as an attention-like linear transformation via GPU-friendly matrix operations; and 3) exploiting latent locality and sequential redundancy (pattern reuse) to minimize overhead. ToMA reduces SDXL/Flux generation latency by 24%/23%, respectively (with DINO $\Delta < 0.07$), outperforming prior methods. This work bridges the gap between theoretical and practical efficiency for transformers in diffusion.
扩散模型在高保真图像生成方面表现出色,但由于变压器的二次注意力复杂性而面临可扩展性限制。像ToMeSD和ToFu这样的即插即用令牌减少方法通过合并生成图像中的冗余令牌来减少浮点运算次数,但它们依赖于GPU低效操作(例如排序、分散写入),引入了与优化的注意力实现配对时的开销,抵消了理论上的速度提升(例如FlashAttention)。为了弥这一差距,我们提出了面向注意力的令牌合并(ToMA),这是一种现成的GPU对齐效率方法重新设计了令牌缩减,主要有三个关键贡献:首先,将令牌合并重新表述为子模块优化问题,以选择多样化的令牌;其次,将合并/取消合并作为类似注意力的线性变换,通过GPU友好的矩阵操作实现;最后,利用潜在局部性和顺序冗余(模式重用)来最小化开销。ToMA分别将SDXL/Flux生成延迟时间减少了24%/23%(与DINO $\Delta < 0.07$),优于先前的方法。这项工作缩小了扩散模型中变压器在理论和实践效率之间的差距。
论文及项目相关链接
PDF In proceedings of the 42nd International Conference on Machine Learning (ICML 2025). Code available at https://github.com/wenboluu/ToMA
Summary
本文探讨了扩散模型在高保真图像生成中的优势及其面临的挑战,即变压器的二次注意力复杂性导致的可扩展性限制。文章介绍了即插即用型令牌减少方法,如ToMeSD和ToFu,它们通过合并生成图像中的冗余令牌来减少FLOPs。然而,这些方法依赖于GPU效率不高的操作,如排序和分散写入,引入了抵消与优化注意力实现(如FlashAttention)配对时的理论加速的开销。为了弥补这一差距,本文提出了Token Merge with Attention(ToMA)方法,通过三项关键贡献重新设计了令牌减少以提高GPU对齐效率:1)将令牌合并重新表述为子模块优化问题以选择多样化的令牌;2)通过GPU友好的矩阵操作,将合并/解除合并为类似注意力的线性转换;3)利用潜在局部性和顺序冗余(模式重用)来最小化开销。ToMA将SDXL/Flux生成延迟减少了24%/23%(与DINO的Δ<0.07相比),优于先前的方法。这项工作拉近了变压器在扩散过程中的理论效率和实际效率之间的差距。
Key Takeaways
- 扩散模型在高保真图像生成中表现出色,但面临可扩展性挑战,主要是由于变压器的二次注意力复杂性。
- 现有方法如ToMeSD和ToFu虽然能减少令牌数量,但存在GPU效率不高的问题,引入开销,抵消了理论加速优势。
- ToMA方法通过三项关键贡献重新设计令牌减少策略以提高GPU效率:令牌合并作为子模块优化问题,合并/解除合并作为类似注意力的线性转换,以及利用潜在局部性和顺序冗余。
- ToMA方法能有效减少SDXL和Flux的生成延迟,且对图像质量影响较小(与DINO的Δ<0.07相比)。
- ToMA方法拉进了扩散模型中理论效率和实际效率之间的差距。
- ToMA是一种即插即用的方法,可以轻松地集成到现有的扩散模型中。
点此查看论文截图




STADI: Fine-Grained Step-Patch Diffusion Parallelism for Heterogeneous GPUs
Authors:Han Liang, Jiahui Zhou, Zicheng Zhou, Xiaoxi Zhang, Xu Chen
The escalating adoption of diffusion models for applications such as image generation demands efficient parallel inference techniques to manage their substantial computational cost. However, existing diffusion parallelism inference schemes often underutilize resources in heterogeneous multi-GPU environments, where varying hardware capabilities or background tasks cause workload imbalance. This paper introduces Spatio-Temporal Adaptive Diffusion Inference (STADI), a novel framework to accelerate diffusion model inference in such settings. At its core is a hybrid scheduler that orchestrates fine-grained parallelism across both temporal and spatial dimensions. Temporally, STADI introduces a novel computation-aware step allocator applied after warmup phases, using a least-common-multiple-minimizing quantization technique to reduce denoising steps on slower GPUs and execution synchronization. To further minimize GPU idle periods, STADI executes an elastic patch parallelism mechanism that allocates variably sized image patches to GPUs according to their computational capability, ensuring balanced workload distribution through a complementary spatial mechanism. Extensive experiments on both load-imbalanced and heterogeneous multi-GPU clusters validate STADI’s efficacy, demonstrating improved load balancing and mitigation of performance bottlenecks. Compared to patch parallelism, a state-of-the-art diffusion inference framework, our method significantly reduces end-to-end inference latency by up to 45% and significantly improves resource utilization on heterogeneous GPUs.
随着扩散模型在图像生成等应用中的日益普及,需要高效的并行推理技术来管理其巨大的计算成本。然而,现有的扩散并行推理方案在异构多GPU环境中往往不能充分利用资源,其中硬件能力的差异或后台任务导致工作量不平衡。本文介绍了时空自适应扩散推理(STADI),这是一个加速此类环境中扩散模型推理的新型框架。其核心是一个混合调度程序,它在时间和空间维度上协调精细粒度的并行性。在时间上,STADI在预热阶段后引入了一种新型的计算感知步骤分配器,使用最小公倍数最小化量化技术来减少较慢GPU上的降噪步骤和执行同步。为了进一步减少GPU空闲时间,STADI执行弹性补丁并行机制,根据GPU的计算能力分配不同大小的图像补丁,通过互补的空间机制确保工作量平衡分布。在负载不平衡和异构多GPU集群上进行的广泛实验验证了STADI的有效性,显示其负载均衡性能提升,并缓解了性能瓶颈。与补丁并行性相比,作为一种先进的扩散推理框架,我们的方法端到端推理延迟最多减少了45%,并在异构GPU上显著提高了资源利用率。
论文及项目相关链接
Summary
本文主要探讨了在图像生成等应用中,扩散模型的日益增长的使用导致的计算成本问题。为了加速在这种环境中的扩散模型推理,提出了一种名为Spatio-Temporal Adaptive Diffusion Inference (STADI)的新型框架。该框架通过混合调度程序在时间和空间两个维度上实现精细的并行性,以提高资源利用率和推理速度。
Key Takeaways
- 扩散模型在图像生成等应用中的使用正在增加,导致巨大的计算成本。
- 现有扩散并行推理方案在异构多GPU环境中往往资源利用不足。
- Spatio-Temporal Adaptive Diffusion Inference (STADI)框架被引入以加速推理过程。
- STADI的核心是一个混合调度程序,它在时间和空间两个维度上实现精细的并行性。
- STADI使用计算感知的步骤分配器,在预热阶段后减少降噪步骤和执行同步,以减少GPU空闲时间。
- 通过弹性补丁并行机制,STADI能够根据GPU的计算能力分配不同大小的图像补丁,确保工作负载的平衡分布。
点此查看论文截图






MEPG:Multi-Expert Planning and Generation for Compositionally-Rich Image Generation
Authors:Yuan Zhao, Lin Liu
Text-to-image diffusion models have achieved remarkable image quality, but they still struggle with complex, multiele ment prompts, and limited stylistic diversity. To address these limitations, we propose a Multi-Expert Planning and Gen eration Framework (MEPG) that synergistically integrates position- and style-aware large language models (LLMs) with spatial-semantic expert modules. The framework comprises two core components: (1) a Position-Style-Aware (PSA) module that utilizes a supervised fine-tuned LLM to decom pose input prompts into precise spatial coordinates and style encoded semantic instructions; and (2) a Multi-Expert Dif fusion (MED) module that implements cross-region genera tion through dynamic expert routing across both local regions and global areas. During the generation process for each lo cal region, specialized models (e.g., realism experts, styliza tion specialists) are selectively activated for each spatial par tition via attention-based gating mechanisms. The architec ture supports lightweight integration and replacement of ex pert models, providing strong extensibility. Additionally, an interactive interface enables real-time spatial layout editing and per-region style selection from a portfolio of experts. Ex periments show that MEPG significantly outperforms base line models with the same backbone in both image quality and style diversity.
文本到图像的扩散模型已经取得了显著的图像质量,但它们仍然面临复杂、多元素提示和风格多样性有限的挑战。为了解决这些限制,我们提出了一个多专家规划和生成框架(MEPG),该框架协同整合了位置感知和风格感知的大型语言模型(LLM)与空间语义专家模块。该框架包括两个核心组件:(1)位置风格感知(PSA)模块,它利用监督微调LLM将输入提示分解成精确的空间坐标和风格编码语义指令;(2)多专家扩散(MED)模块,通过本地区域和全局区域的动态专家路由实现跨区域生成。在每个局部区域的生成过程中,通过基于注意力的门控机制,针对每个空间分区选择专业模型(例如,现实主义专家、风格化专家)。该架构支持专家模型的轻松集成和替换,具有很强的可扩展性。此外,交互式界面允许实时空间布局编辑和从专家组合中选择每个区域的风格。实验表明,MEPG在图像质量和风格多样性方面都显著优于具有相同背景的基线模型。
论文及项目相关链接
Summary
文本到图像扩散模型已实现了显著的图像质量,但仍面临复杂多元素提示和风格多样性受限的问题。针对这些局限性,提出了多专家规划生成框架(MEPG),该框架协同集成了位置感知和风格感知的大型语言模型(LLM)与空间语义专家模块。实验表明,MEPG在图像质量和风格多样性方面都显著优于基线模型。
Key Takeaways
- 文本到图像扩散模型虽已实现良好图像质量,但仍面临处理复杂多元素提示和风格多样性有限的挑战。
- 提出了一种新的多专家规划生成框架(MEPG),集成位置感知和风格感知的大型语言模型(LLM)与空间语义专家模块以解决这些挑战。
- MEPG框架包含两个核心组件:位置风格感知(PSA)模块和多专家扩散(MED)模块。
- PSA模块利用监督微调的大型语言模型将输入提示分解为精确的空间坐标和风格编码语义指令。
- MED模块通过动态专家路由在局部区域和全局区域进行跨区域生成。
- MEPG支持轻松集成和替换专家模型,具有强大的可扩展性。
点此查看论文截图






KB-DMGen: Knowledge-Based Global Guidance and Dynamic Pose Masking for Human Image Generation
Authors:Shibang Liu, Xuemei Xie, Guangming Shi
Recent methods using diffusion models have made significant progress in Human Image Generation (HIG) with various control signals such as pose priors. In HIG, both accurate human poses and coherent visual quality are crucial for image generation. However, most existing methods mainly focus on pose accuracy while neglecting overall image quality, often improving pose alignment at the cost of image quality. To address this, we propose Knowledge-Based Global Guidance and Dynamic pose Masking for human image Generation (KB-DMGen). The Knowledge Base (KB), implemented as a visual codebook, provides coarse, global guidance based on input text-related visual features, improving pose accuracy while maintaining image quality, while the Dynamic pose Mask (DM) offers fine-grained local control to enhance precise pose accuracy. By injecting KB and DM at different stages of the diffusion process, our framework enhances pose accuracy through both global and local control without compromising image quality. Experiments demonstrate the effectiveness of KB-DMGen, achieving new state-of-the-art results in terms of AP and CAP on the HumanArt dataset. The project page and code are available at https://lushbng.github.io/KBDMGen.
近期使用扩散模型的方法在人类图像生成(HIG)方面取得了显著进展,通过各种控制信号(如姿态先验)实现了高质量的人类图像生成。在HIG中,准确的人类姿态和连贯的视觉质量对于图像生成都至关重要。然而,大多数现有方法主要关注姿态准确性,而忽视整体图像质量,往往以提高姿态对齐度为代价来牺牲图像质量。为了解决这个问题,我们提出了基于知识的全局引导和动态姿态掩码的人类图像生成方法(KB-DMGen)。知识库(KB)以视觉代码簿的形式实现,基于输入文本相关的视觉特征提供粗略的全局指导,在提高姿态准确性的同时保持图像质量;而动态姿态掩码(DM)则提供精细的局部控制,以增强精确姿态的准确性。通过在扩散过程的不同阶段注入KB和DM,我们的框架通过全局和局部控制提高了姿态准确性,同时不牺牲图像质量。实验表明KB-DMGen的有效性,在人类艺术数据集上实现了平均精度(AP)和类别平均精度(CAP)的最新结果。项目页面和代码可通过https://lushbng.github.io/KBDMGen访问。
论文及项目相关链接
Summary
基于扩散模型的新方法在人类图像生成(HIG)方面取得了显著进展,通过姿势先验等控制信号实现了准确的人体姿势和连贯的视觉质量。为解决多数方法仅注重姿势准确性而忽视整体图像质量的问题,提出基于知识的全局引导和动态姿势遮罩(KB-DMGen)。知识库提供基于输入文本的视觉特征进行粗略的全局指导,维持图像质量的同时提高姿势准确性;动态姿势遮罩则提供精细的局部控制,进一步提高精确姿势的准确性。通过在不同扩散阶段注入KB和DM,框架能在全局和局部控制中提高姿势准确性,且不会损害图像质量。
Key Takeaways
- 扩散模型在人体图像生成(HIG)中取得显著进展。
- 目前方法多侧重于姿势准确性,但可能忽视整体图像质量。
- KB-DMGen方法结合了知识库和动态姿势遮罩技术。
- 知识库提供全局指导以提高姿势准确性并维持图像质量。
- 动态姿势遮罩用于精细的局部控制,进一步提高姿势的准确性。
- KB-DMGen通过全局和局部控制增强姿势准确性,同时不损害图像质量。
点此查看论文截图





SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models
Authors:Joon Hyun Park, Kumju Jo, Sungyong Baik
Entrusted with the goal of pixel-level object classification, the semantic segmentation networks entail the laborious preparation of pixel-level annotation masks. To obtain pixel-level annotation masks for a given class without human efforts, recent few works have proposed to generate pairs of images and annotation masks by employing image and text relationships modeled by text-to-image generative models, especially Stable Diffusion. However, these works do not fully exploit the capability of text-guided Diffusion models and thus require a pre-trained segmentation network, careful text prompt tuning, or the training of a segmentation network to generate final annotation masks. In this work, we take a closer look at attention mechanisms of Stable Diffusion, from which we draw connections with classical seeded segmentation approaches. In particular, we show that cross-attention alone provides very coarse object localization, which however can provide initial seeds. Then, akin to region expansion in seeded segmentation, we utilize the semantic-correspondence-modeling capability of self-attention to iteratively spread the attention to the whole class from the seeds using multi-scale self-attention maps. We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences, compared to complex-structured objects. Thus, we further refine a mask using a more accurate background mask. Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion, without additional training procedure, prompt tuning, or a pre-trained segmentation network.
以像素级目标分类为任务目标的语义分割网络需要繁琐的像素级注释掩膜制备工作。为了获得给定类别的像素级注释掩膜而不依赖人工操作,近期的一些研究提出了通过文本到图像的生成模型(特别是Stable Diffusion)来建模图像和文本关系,从而生成图像和注释掩码对。然而,这些研究并未充分利用文本引导型扩散模型的能力,因此仍需预先训练的分割网络、精细的文本提示调整或分割网络的训练来生成最终的注释掩膜。
论文及项目相关链接
PDF AAAI 2025
Summary
本文探讨了利用文本引导的稳定扩散模型生成像素级标注掩膜的方法。通过深入研究稳定扩散模型的注意力机制,提出了一种名为SeeDiff的方法,无需预训练分割网络、精细文本提示调整或训练分割网络,即可直接从稳定扩散模型中生成高质量掩膜。
Key Takeaways
- 稳定扩散模型被用于生成图像和标注掩膜对,以减轻像素级标注的繁重工作。
- 当前方法主要依赖于文本引导的稳定扩散模型的能力,但在生成最终标注掩膜时仍存在不足。
- 深入研究稳定扩散模型的注意力机制,发现交叉注意力可提供较粗糙的目标定位,而自注意力有助于从种子点扩展到整个类别的语义对应。
- SeeDiff方法利用稳定扩散模型生成高质量掩膜,无需额外的训练过程、提示调整或预训练分割网络。
- 简单文本引导的合成图像通常有统一的背景,这更容易找到对应关系,相比于复杂结构的物体。
- 使用更准确的背景掩膜进一步精细化生成的掩膜。
点此查看论文截图





STRICT: Stress Test of Rendering Images Containing Text
Authors:Tianyu Zhang, Xinyu Wang, Lu Li, Zhenghan Tai, Jijun Chi, Jingrui Tian, Hailin He, Suyuchen Wang
While diffusion models have revolutionized text-to-image generation with their ability to synthesize realistic and diverse scenes, they continue to struggle to generate consistent and legible text within images. This shortcoming is commonly attributed to the locality bias inherent in diffusion-based generation, which limits their ability to model long-range spatial dependencies. In this paper, we introduce $\textbf{STRICT}$, a benchmark designed to systematically stress-test the ability of diffusion models to render coherent and instruction-aligned text in images. Our benchmark evaluates models across multiple dimensions: (1) the maximum length of readable text that can be generated; (2) the correctness and legibility of the generated text, and (3) the ratio of not following instructions for generating text. We evaluate several state-of-the-art models, including proprietary and open-source variants, and reveal persistent limitations in long-range consistency and instruction-following capabilities. Our findings provide insights into architectural bottlenecks and motivate future research directions in multimodal generative modeling. We release our entire evaluation pipeline at https://github.com/tianyu-z/STRICT-Bench.
扩散模型凭借合成逼真且多样化的场景的能力,在文本到图像的生成中引发了革命。然而,它们仍难以在图像内部生成连贯且清晰的文本。这一缺陷通常归因于扩散生成中固有的局部偏见,限制了它们对长距离空间依赖关系的建模能力。在本文中,我们介绍了$\textbf{STRICT}$,这是一个基准测试,旨在系统地测试扩散模型在图像中呈现连贯且符合指令的文本的能力。我们的基准测试从多个维度评估模型:(1)可生成文本的最大长度;(2)生成文本的准确性和清晰度;(3)文本生成过程中不遵循指令的比例。我们评估了包括专有和开源变体在内的若干最先进的模型,并揭示了长期一致性以及遵循指令能力方面的持续局限性。我们的发现为架构瓶颈提供了见解,并为未来的多模态生成建模研究提供了动力。我们发布了我们整个评估管道https://github.com/tianyu-z/STRICT-Bench供研究使用。
论文及项目相关链接
PDF Accepted as a main conference paper at EMNLP 2025
Summary
扩散模型在文本到图像生成领域表现出革命性的能力,能合成逼真且多样的场景。然而,它们仍然难以在图像内生成连贯和清晰的文本。这一缺陷通常归因于扩散生成中的局部偏见,限制了模型对长程空间依赖关系的建模能力。本文介绍了一个名为STRICT的基准测试,旨在系统地测试扩散模型在图像中呈现连贯且符合指令的文本的能力。该基准测试从多个维度评估模型的表现,包括可生成文本的最大长度、生成的文本的正确性和清晰度以及遵循指令生成文本的比例。我们对包括专有和开源变体在内的最先进模型进行了评估,并揭示了长程一致性和遵循指令能力方面的持续局限性。我们的发现为架构瓶颈提供了见解,并为多模态生成建模的未来研究提供了方向。我们已发布完整的评估管道:https://github.com/tianyu-z/STRICT-Bench。
Key Takeaways
- 扩散模型在文本到图像生成方面具有强大的能力,但生成连贯和清晰文本仍存在困难。
- 局部偏见限制了扩散模型对长程空间依赖关系的建模能力。
- 引入的STRICT基准测试旨在评估扩散模型在图像中呈现连贯且符合指令文本的能力。
- 评估模型表现包括可生成文本的最大长度、文本的清晰度和正确性,以及遵循指令的比例。
- 对现有模型的评估揭示了长程一致性和遵循指令能力方面的局限性。
- 研究发现了架构瓶颈的问题。
点此查看论文截图




OSDM-MReg: Multimodal Image Registration based One Step Diffusion Model
Authors:Xiaochen Wei, Weiwei Guo, Wenxian Yu, Feiming Wei, Dongying Li
Multimodal remote sensing image registration aligns images from different sensors for data fusion and analysis. However, existing methods often struggle to extract modality-invariant features when faced with large nonlinear radiometric differences, such as those between SAR and optical images. To address these challenges, we propose OSDM-MReg, a novel multimodal image registration framework that bridges the modality gap through image-to-image translation. Specifically, we introduce a one-step unaligned target-guided conditional diffusion model (UTGOS-CDM) to translate source and target images into a unified representation domain. Unlike traditional conditional DDPM that require hundreds of iterative steps for inference, our model incorporates a novel inverse translation objective during training to enable direct prediction of the translated image in a single step at test time, significantly accelerating the registration process. After translation, we design a multimodal multiscale registration network (MM-Reg) that extracts and fuses both unimodal and translated multimodal images using the proposed multimodal fusion strategy, enhancing the robustness and precision of alignment across scales and modalities. Extensive experiments on the OSdataset demonstrate that OSDM-MReg achieves superior registration accuracy compared to state-of-the-art methods.
多模态遥感图像配准是为了数据融合和分析而对不同传感器的图像进行对齐。然而,现有的方法在面对如SAR和光学图像之间的大非线性辐射差异时,往往难以提取模态不变特征。为了解决这些挑战,我们提出了OSDM-MReg,这是一种新的多模态图像配准框架,它通过图像到图像的翻译来弥合模态之间的差距。具体来说,我们引入了一步式未对齐目标引导条件扩散模型(UTGOS-CDM),将源图像和目标图像翻译成一个统一的表现域。与传统的需要数百个迭代步骤进行推断的条件DDPM不同,我们的模型在训练过程中融入了新颖的反向翻译目标,从而在测试时能够在单步中直接预测翻译后的图像,显著加速了配准过程。翻译后,我们设计了一个多模态多尺度配准网络(MM-Reg),该网络使用所提出的多模态融合策略来提取和融合单模态和翻译后的多模态图像,提高了跨尺度和模态对齐的稳健性和精度。在OSdataset上的广泛实验表明,OSDM-MReg的配准精度优于最先进的方法。
论文及项目相关链接
PDF This version updates our previous submission. After rerunning the experiments, we found that the proposed high-frequency perceptual loss did not improve the overall performance of the model. Therefore, we removed this component, revised the corresponding ablation studies, and updated the contributions accordingly. This work has been submitted to the IEEE for possible publication
Summary
多模态遥感图像配准旨在将不同传感器的图像进行对齐,以便数据融合和分析。针对现有方法在应对大幅非线性辐射差异(如SAR和光学图像之间的差异)时难以提取模态不变特征的问题,我们提出了OSDM-MReg新型多模态图像配准框架,它通过图像到图像的翻译来弥补模态差异。我们引入了一步式未对齐目标引导条件扩散模型(UTGOS-CDM),将源图像和目标图像翻译到统一表示域。与传统需要数百次迭代推理的条件DDPM不同,我们的模型在训练过程中融入了逆向翻译目标,使得测试时能够在单步内直接预测翻译后的图像,大幅加速配准过程。翻译后,我们设计了一个多模态多尺度配准网络(MM-Reg),结合单模态和翻译后的多模态图像,采用提出的多模态融合策略,提高了跨尺度和模态对齐的稳健性和精度。在OSdataset上的广泛实验表明,OSDM-MReg的配准精度优于现有最先进的方法。
Key Takeaways
- 多模态遥感图像配准是为了数据融合和分析而将对齐来自不同传感器的图像的技术。
- 现有方法在应对大幅非线性辐射差异时面临挑战。
- OSDM-MReg框架通过图像到图像的翻译来弥补模态差异。
- UTGOS-CDM模型能够在单步内直接预测翻译后的图像,加速配准过程。
- MM-Reg网络结合了单模态和翻译后的多模态图像,提高了配准的稳健性和精度。
- OSDM-MReg在OSdataset上的实验表现优于现有的最先进的配准方法。
点此查看论文截图








Marigold-DC: Zero-Shot Monocular Depth Completion with Guided Diffusion
Authors:Massimiliano Viola, Kevin Qu, Nando Metzger, Bingxin Ke, Alexander Becker, Konrad Schindler, Anton Obukhov
Depth completion upgrades sparse depth measurements into dense depth maps guided by a conventional image. Existing methods for this highly ill-posed task operate in tightly constrained settings and tend to struggle when applied to images outside the training domain or when the available depth measurements are sparse, irregularly distributed, or of varying density. Inspired by recent advances in monocular depth estimation, we reframe depth completion as an image-conditional depth map generation guided by sparse measurements. Our method, Marigold-DC, builds on a pretrained latent diffusion model for monocular depth estimation and injects the depth observations as test-time guidance via an optimization scheme that runs in tandem with the iterative inference of denoising diffusion. The method exhibits excellent zero-shot generalization across a diverse range of environments and handles even extremely sparse guidance effectively. Our results suggest that contemporary monocular depth priors greatly robustify depth completion: it may be better to view the task as recovering dense depth from (dense) image pixels, guided by sparse depth; rather than as inpainting (sparse) depth, guided by an image. Project website: https://MarigoldDepthCompletion.github.io/
深度补全将稀疏的深度测量值升级为在常规图像引导下生成的密集深度图。针对这一高度不适定的任务,现有方法通常在严格受限的环境中使用,当应用于训练域外的图像或可用的深度测量值稀疏、分布不均或密度不同时,这些方法往往表现不佳。受单眼深度估计最新进展的启发,我们将深度补全重新构建为一种以稀疏测量为引导的图像条件深度图生成任务。我们的方法Marigold-DC建立在用于单眼深度估计的预训练潜在扩散模型的基础上,通过将深度观测值作为测试时的指导,并通过与去噪扩散的迭代推断并行运行的优化方案来注入。该方法在多种环境中具有良好的零样本泛化能力,并能有效地处理极其稀疏的引导。我们的结果建议,现代单眼深度先验可以极大地提高深度补全的鲁棒性:更好的做法是视此任务为从密集图像像素恢复密集深度,受稀疏深度引导;而非在图像引导下对稀疏深度进行填充修复。项目网站:https://MarigoldDepthCompletion.github.io/
论文及项目相关链接
PDF ICCV 2025
Summary
本文提出一种名为Marigold-DC的深度完成方法,它将深度完成任务重新定义为在稀疏测量引导下,基于图像的深度图生成。该方法利用预训练的潜在扩散模型进行单眼深度估计,并通过优化方案在测试时注入深度观测值,该优化方案与去噪扩散的迭代推断并行运行。该方法具有良好的零样本泛化能力,可处理各种环境,甚至对极稀疏的引导也能有效处理。
Key Takeaways
- Marigold-DC方法将深度完成定义为在稀疏测量引导下,基于图像的深度图生成。
- 利用预训练的潜在扩散模型进行单眼深度估计。
- 通过优化方案在测试时注入深度观测值。
- 该方法具有良好的零样本泛化能力,能适应多种环境。
- Marigold-DC能有效处理极稀疏的引导情况。
- 现有方法往往在某些约束条件下表现良好,但在图像超出训练域或可用深度测量稀疏、分布不均、密度不一的情况下表现不佳。
- 最佳实践可能是将任务视为从密集图像像素中恢复密集深度,由稀疏深度引导,而不是依靠图像引导来填充(稀疏)深度。
点此查看论文截图



SCP-Diff: Spatial-Categorical Joint Prior for Diffusion Based Semantic Image Synthesis
Authors:Huan-ang Gao, Mingju Gao, Jiaju Li, Wenyi Li, Rong Zhi, Hao Tang, Hao Zhao
Semantic image synthesis (SIS) shows good promises for sensor simulation. However, current best practices in this field, based on GANs, have not yet reached the desired level of quality. As latent diffusion models make significant strides in image generation, we are prompted to evaluate ControlNet, a notable method for its dense control capabilities. Our investigation uncovered two primary issues with its results: the presence of weird sub-structures within large semantic areas and the misalignment of content with the semantic mask. Through empirical study, we pinpointed the cause of these problems as a mismatch between the noised training data distribution and the standard normal prior applied at the inference stage. To address this challenge, we developed specific noise priors for SIS, encompassing spatial, categorical, and a novel spatial-categorical joint prior for inference. This approach, which we have named SCP-Diff, has set new state-of-the-art results in SIS on Cityscapes, ADE20K and COCO-Stuff, yielding a FID as low as 10.53 on Cityscapes. The code and models can be accessed via the project page.
语义图像合成(SIS)在传感器模拟方面显示出良好的前景。然而,目前该领域基于生成对抗网络(GANs)的最佳实践尚未达到理想的质量水平。随着潜在扩散模型在图像生成方面取得了显著进展,我们受到了启发,对ControlNet进行了评估,因其具有密集的控制能力而成为了一种值得注意的方法。我们的调查发现了其结果的两个主要问题:大型语义区域内存在奇怪的子结构以及内容与语义掩码的错位。通过实证研究,我们将这些问题的原因归结为噪声训练数据分布与推理阶段应用的标准正态先验之间的不匹配。为了应对这一挑战,我们为SIS开发了特定的噪声先验,包括空间先验、分类先验以及用于推理的新型空间分类联合先验。我们称这种方法为SCP-Diff,它在Cityscapes、ADE20K和COCO-Stuff的SIS任务上创下了最新纪录,其中在Cityscapes上的FID低至10.53。代码和模型可通过项目页面访问。
论文及项目相关链接
PDF Project Page: https://air-discover.github.io/SCP-Diff/
Summary
语义图像合成(SIS)在传感器模拟方面展现出良好前景,但基于生成对抗网络(GANs)的当前最佳实践尚未达到理想的质量水平。随着潜在扩散模型在图像生成方面的显著进展,我们对ControlNet方法进行了评估,该方法以其密集的控制能力而备受关注。研究发现,ControlNet存在两个主要问题:大型语义区域内存在奇怪的子结构以及内容与语义掩码的不匹配。通过实证研究,我们将这些问题归因于噪声训练数据分布与推理阶段应用的标准正态优先之间的不匹配。为解决这一挑战,我们为SIS开发了特定的噪声先验,包括空间、类别和一种新的用于推理的空间类别联合先验,我们称之为SCP-Diff。该方法在Cityscapes、ADE20K和COCO-Stuff的SIS任务上达到了最新结果,Cityscapes上的FID低至10.53。
Key Takeaways
- 语义图像合成(SIS)在传感器模拟方面前景良好,但现有方法存在质量问题。
- ControlNet方法在密集控制方面表现出潜力,但存在奇怪的子结构问题和内容与语义掩码不匹配的问题。
- 这些问题被归因于噪声训练数据分布和推理阶段使用的标准正态优先之间的不匹配。
- 为解决这一问题,开发了特定的噪声先验,包括空间、类别和一种新的空间类别联合先验。
- SCP-Diff方法在SIS任务上达到了最新结果,并在Cityscapes、ADE20K和COCO-Stuff上表现出优异的性能。
- SCP-Diff方法降低了FID分数至10.53。
点此查看论文截图


