⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-10 更新
SSDD: Single-Step Diffusion Decoder for Efficient Image Tokenization
Authors:Théophane Vallaeys, Jakob Verbeek, Matthieu Cord
Tokenizers are a key component of state-of-the-art generative image models, extracting the most important features from the signal while reducing data dimension and redundancy. Most current tokenizers are based on KL-regularized variational autoencoders (KL-VAE), trained with reconstruction, perceptual and adversarial losses. Diffusion decoders have been proposed as a more principled alternative to model the distribution over images conditioned on the latent. However, matching the performance of KL-VAE still requires adversarial losses, as well as a higher decoding time due to iterative sampling. To address these limitations, we introduce a new pixel diffusion decoder architecture for improved scaling and training stability, benefiting from transformer components and GAN-free training. We use distillation to replicate the performance of the diffusion decoder in an efficient single-step decoder. This makes SSDD the first diffusion decoder optimized for single-step reconstruction trained without adversarial losses, reaching higher reconstruction quality and faster sampling than KL-VAE. In particular, SSDD improves reconstruction FID from $0.87$ to $0.50$ with $1.4\times$ higher throughput and preserve generation quality of DiTs with $3.8\times$ faster sampling. As such, SSDD can be used as a drop-in replacement for KL-VAE, and for building higher-quality and faster generative models.
令牌化器是先进生成图像模型的关键组件,能够从信号中提取最重要的特征,同时降低数据维度和冗余。目前大多数的令牌化器都是基于KL正则化变分自动编码器(KL-VAE),通过重建、感知和对抗损失进行训练。扩散解码器已被提出作为基于潜在条件的图像分布建模的更原则性的替代方案。然而,要达到KL-VAE的性能仍然需要对抗性损失,而且由于迭代采样,解码时间也较高。为了解决这些局限性,我们引入了一种新的像素扩散解码器架构,以提高扩展性和训练稳定性,受益于变压器组件和无GAN训练。我们使用蒸馏来在高效的单步解码器中复制扩散解码器的性能。这使得SSDD成为首个优化的单步重建扩散解码器,无需对抗性损失即可达到更高的重建质量和更快的采样速度,超越了KL-VAE。特别是,SSDD将重建FID从0.87提高到0.50,吞吐量提高了1.4倍,同时保持了DiT的生成质量,采样速度提高了3.8倍。因此,SSDD可以作为KL-VAE的即插即用替代方案,用于构建更高质量和更快的生成模型。
论文及项目相关链接
Summary:
新型像素扩散解码器SSDD改进了缩放和训练稳定性,采用了变压器组件和无GAN训练。通过蒸馏技术,SSDD能够在无需对抗损失的情况下复制扩散解码器的性能,实现了高效的单步解码。与KL-VAE相比,SSDD提高了重建质量,加快了采样速度,可以作为KL-VAE的替代品,用于构建更高质量和更快的生成模型。
Key Takeaways:
- 令牌化器在生成图像模型中扮演关键角色,能够从信号中提取重要特征并降低数据维度和冗余。
- 当前大多数令牌化器基于KL正则化变分自编码器(KL-VAE),其训练使用重建、感知和对抗损失。
- 扩散解码器作为图像分布建模的替代方案被提出,但匹配KL-VAE性能仍需对抗损失,且由于迭代采样,解码时间较长。
- 新型像素扩散解码器(SSDD)改进了缩放和训练稳定性,并结合了变压器组件和无对抗损失的GAN训练。
- SSDD使用蒸馏技术实现高效单步解码,无需对抗损失即可复制扩散解码器的性能。
- SSDD提高了重建质量,与KL-VAE相比,其重建FID从0.87降低到0.50,同时提高了吞吐量。
- SSDD能够保持与DiTs相当的生产质量,并实现了更快的采样速度(3.8倍)。
点此查看论文截图



The best performance in the CARE 2025 – Liver Task (LiSeg-Contrast): Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation
Authors:Jincan Lou, Jingkun Chen, Haoquan Li, Hang Li, Wenjian Huang, Weihua Chen, Fan Wang, Jianguo Zhang
Accurate liver segmentation from contrast-enhanced MRI is essential for diagnosis, treatment planning, and disease monitoring. However, it remains challenging due to limited annotated data, heterogeneous enhancement protocols, and significant domain shifts across scanners and institutions. Traditional image-to-image translation frameworks have made great progress in domain generalization, but their application is not straightforward. For example, Pix2Pix requires image registration, and cycle-GAN cannot be integrated seamlessly into segmentation pipelines. Meanwhile, these methods are originally used to deal with cross-modality scenarios, and often introduce structural distortions and suffer from unstable training, which may pose drawbacks in our single-modality scenario. To address these challenges, we propose CoSSeg-TTA, a compact segmentation framework for the GED4 (Gd-EOB-DTPA enhanced hepatobiliary phase MRI) modality built upon nnU-Netv2 and enhanced with a semi-supervised mean teacher scheme to exploit large amounts of unlabeled volumes. A domain adaptation module, incorporating a randomized histogram-based style appearance transfer function and a trainable contrast-aware network, enriches domain diversity and mitigates cross-center variability. Furthermore, a continual test-time adaptation strategy is employed to improve robustness during inference. Extensive experiments demonstrate that our framework consistently outperforms the nnU-Netv2 baseline, achieving superior Dice score and Hausdorff Distance while exhibiting strong generalization to unseen domains under low-annotation conditions.
从对比增强MRI进行准确的肝脏分割对于诊断、治疗计划和疾病监测至关重要。然而,由于标注数据有限、增强协议存在异质性以及扫描仪和机构之间的领域差异较大,这仍然是一个挑战。传统的图像到图像翻译框架在领域泛化方面取得了很大进展,但其应用并不直接。例如,Pix2Pix需要进行图像配准,而循环GAN无法无缝集成到分割管道中。同时,这些方法最初是用于处理跨模态场景的,往往会引入结构失真并面临训练不稳定的问题,这在我们这种单模态场景中可能会带来缺点。为了解决这些挑战,我们提出了CoSSeg-TTA,这是一个紧凑的分割框架,适用于GED4(Gd-EOB-DTPA增强肝胆期MRI)模态,基于nnU-Netv2构建,并使用半监督均值教师方案来利用大量未标记体积数据。一个领域适应模块,结合基于随机直方图的风格外观转换函数和一个可训练的对比感知网络,丰富了领域多样性并减轻了跨中心差异性。此外,采用持续测试时间适应策略,以提高推理过程中的稳健性。大量实验表明,我们的框架始终优于nnU-Netv2基线,在Dice得分和Hausdorff距离上表现优越,同时在低注释条件下对未见领域表现出强大的泛化能力。
论文及项目相关链接
PDF 11 pages, 3 figures
Summary
本文提出了一种基于nnU-Netv2的紧凑分割框架CoSSeg-TTA,用于对GED4(Gd-EOB-DTPA增强肝胆期MRI)模态进行肝脏分割。该框架结合了半监督均值教师方案,利用大量未标注体积数据,并通过领域适应模块和持续测试时适应策略,解决了标注数据有限、增强协议异质化、扫描器和机构间领域漂移等问题,实现了对对比增强MRI的准确肝脏分割。
Key Takeaways
- 准确肝脏分割在对比增强MRI中对诊断、治疗计划和疾病监测至关重要。
- 传统图像到图像翻译框架在领域推广方面取得进展,但在肝脏分割中面临挑战。
- 所提出的CoSSeg-TTA框架基于nnU-Netv2构建,针对GED4模态进行优化。
- 半监督均值教师方案用于利用大量未标注数据。
- 领域适应模块通过随机直方图样式转换函数和可训练对比感知网络,丰富了领域多样性和减轻了跨中心变化。
- 采用持续测试时适应策略,提高推理阶段的稳健性。
点此查看论文截图



SDAKD: Student Discriminator Assisted Knowledge Distillation for Super-Resolution Generative Adversarial Networks
Authors:Nikolaos Kaparinos, Vasileios Mezaris
Generative Adversarial Networks (GANs) achieve excellent performance in generative tasks, such as image super-resolution, but their computational requirements make difficult their deployment on resource-constrained devices. While knowledge distillation is a promising research direction for GAN compression, effectively training a smaller student generator is challenging due to the capacity mismatch between the student generator and the teacher discriminator. In this work, we propose Student Discriminator Assisted Knowledge Distillation (SDAKD), a novel GAN distillation methodology that introduces a student discriminator to mitigate this capacity mismatch. SDAKD follows a three-stage training strategy, and integrates an adapted feature map distillation approach in its last two training stages. We evaluated SDAKD on two well-performing super-resolution GANs, GCFSR and Real-ESRGAN. Our experiments demonstrate consistent improvements over the baselines and SOTA GAN knowledge distillation methods. The SDAKD source code will be made openly available upon acceptance of the paper.
生成对抗网络(GANs)在生成任务(如超分辨率图像)方面表现出卓越的性能,但其计算要求使得在资源受限的设备上部署它们变得困难。虽然知识蒸馏是GAN压缩的有前途的研究方向,但由于学生生成器和教师判别器之间的容量不匹配,有效训练小型学生生成器是一个挑战。在这项工作中,我们提出了学生判别器辅助知识蒸馏(SDAKD),这是一种新的GAN蒸馏方法,它引入了一个学生判别器来缓解这种容量不匹配的问题。SDAKD采用三阶段训练策略,并在最后两个阶段集成了一种改进的特征图蒸馏方法。我们在两个性能良好的超分辨率GANs(GCFSR和Real-ESRGAN)上评估了SDAKD。我们的实验证明,与基准方法和最先进GAN知识蒸馏方法相比,SDAKD具有持续的可改进性。论文被接受后,将公开提供SDAKD的源代码。
论文及项目相关链接
PDF Under review
Summary
GAN在生成任务(如超分辨率图像)上表现优异,但其计算需求使得在资源受限的设备上部署变得困难。知识蒸馏是一种有前景的研究方向,但训练小型学生生成器具有挑战性,因为学生生成器与教师鉴别器之间存在容量不匹配问题。本研究提出一种新型GAN蒸馏方法——学生鉴别器辅助知识蒸馏(SDAKD),引入学生鉴别器以缓解容量不匹配问题。SDAKD采用三阶段训练策略,并在最后两个阶段融入改进的特征图蒸馏方法。在两种高性能超分辨率GAN(GCFSR和Real-ESRGAN)上的实验表明,SDAKD较基线方法和现有最佳GAN知识蒸馏方法有明显改进。
Key Takeaways
- GAN在生成任务上具有优异性能,但资源受限设备的部署存在挑战。
- 知识蒸馏是一种有前途的GAN压缩研究方向。
- 训练小型学生生成器存在挑战,主要是因为与学生鉴别器之间存在容量不匹配问题。
- 提出新型GAN蒸馏方法——学生鉴别器辅助知识蒸馏(SDAKD),引入学生鉴别器以缓解容量不匹配。
- SDAKD采用三阶段训练策略,并在后两个阶段融入改进的特征图蒸馏。
- 在两种高性能超分辨率GAN上的实验表明SDAKD较基线方法和现有方法有明显改进。
点此查看论文截图




Advances in Medical Image Segmentation: A Comprehensive Survey with a Focus on Lumbar Spine Applications
Authors:Ahmed Kabil, Ghada Khoriba, Mina Yousef, Essam A. Rashed
Medical Image Segmentation (MIS) stands as a cornerstone in medical image analysis, playing a pivotal role in precise diagnostics, treatment planning, and monitoring of various medical conditions. This paper presents a comprehensive and systematic survey of MIS methodologies, bridging the gap between traditional image processing techniques and modern deep learning approaches. The survey encompasses thresholding, edge detection, region-based segmentation, clustering algorithms, and model-based techniques while also delving into state-of-the-art deep learning architectures such as Convolutional Neural Networks (CNNs), Fully Convolutional Networks (FCNs), and the widely adopted U-Net and its variants. Moreover, integrating attention mechanisms, semi-supervised learning, generative adversarial networks (GANs), and Transformer-based models is thoroughly explored. In addition to covering established methods, this survey highlights emerging trends, including hybrid architectures, cross-modality learning, federated and distributed learning frameworks, and active learning strategies, which aim to address challenges such as limited labeled datasets, computational complexity, and model generalizability across diverse imaging modalities. Furthermore, a specialized case study on lumbar spine segmentation is presented, offering insights into the challenges and advancements in this relatively underexplored anatomical region. Despite significant progress in the field, critical challenges persist, including dataset bias, domain adaptation, interpretability of deep learning models, and integration into real-world clinical workflows.
医学影像分割(MIS)是医学影像分析中的核心部分,对于精确诊断、治疗规划和各种医疗状况监测起到至关重要的作用。本文全面系统地概述了MIS方法,弥合了传统图像处理技术与现代深度学习方法的差距。本文涵盖了阈值分割、边缘检测、基于区域的分割、聚类算法和基于模型的分割技术,同时还深入探讨了最新的深度学习架构,如卷积神经网络(CNN)、全卷积网络(FCN)以及广泛应用的U-Net及其变体。此外,本文还深入探讨了集成注意力机制、半监督学习、生成对抗网络(GANs)和基于Transformer的模型的方法。除了涵盖已建立的方法外,本文还强调了新兴趋势,包括混合架构、跨模态学习、联邦和分布式学习框架以及主动学习策略,旨在解决如有限标记数据集、计算复杂度和模型在多种成像模式中的通用化等挑战。此外,还针对腰椎分割进行了专项案例研究,介绍了这一相对未充分研究的解剖区域的挑战和进展。尽管在这一领域取得了重要进展,但仍存在一些挑战,包括数据集偏差、域适应、深度学习模型的解释性和融入现实世界临床工作流程等。
论文及项目相关链接
PDF Computers in Biology and Medicine (to appear)
摘要
医疗图像分割(MIS)是医学图像分析的核心,对于精确诊断、治疗规划和各种医疗状况的监督起到关键作用。本文全面系统地概述了MIS方法,包括传统图像处理技术和现代深度学习方法的融合。文章涵盖了阈值分割、边缘检测、基于区域的分割、聚类算法和模型基础技术,并探讨了最新的深度学习架构,如卷积神经网络(CNN)、全卷积网络(FCN)以及广泛应用的U-Net及其变体。此外,本文深入探讨了注意力机制、半监督学习、生成对抗网络(GANs)和基于Transformer的模型的集成。除了已建立的方法外,本文还强调了新兴趋势,包括混合架构、跨模态学习、联邦和分布式学习框架以及主动学习策略,旨在解决如有限标记数据集、计算复杂性和模型在不同成像方式下的泛化能力等方面的挑战。此外,对腰椎间盘分割的专项案例研究提供了对该解剖部位挑战和进展的见解。尽管该领域取得了重大进展,但仍存在数据集偏见、域适应、深度学习模型的可解释性和融入实际临床工作流程等挑战。
关键见解
- 医疗图像分割(MIS)在医学图像分析中起核心作用,对精确诊断和治疗后监测至关重要。
- 本文综述了传统图像处理方法与最新深度学习技术在MIS中的应用。
- 深度学习架构如CNN、FCN和U-Net在医疗图像分割中表现优异。
- 集成注意力机制、半监督学习和GANs等先进方法正在解决特定挑战。
- 新兴趋势包括混合架构、跨模态学习以及联邦和分布式学习框架等。
- 腰椎间盘分割的案例研究揭示了特定部位的挑战和最新进展。
点此查看论文截图


RoboSwap: A GAN-driven Video Diffusion Framework For Unsupervised Robot Arm Swapping
Authors:Yang Bai, Liudi Yang, George Eskandar, Fengyi Shen, Dong Chen, Mohammad Altillawi, Ziyuan Liu, Gitta Kutyniok
Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another: a key step for crossembodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.
近期生成模型的进展为视频合成和编辑带来了革命性的变化。然而,多样且高质量数据集的稀缺仍然阻碍了基于视频条件的机器人学习,限制了跨平台的泛化能力。在这项工作中,我们解决了在一个视频中替换机器人手臂为另一个手臂的挑战,这是跨体态学习的关键步骤。不同于以前的方法,它们依赖于相同环境设置中的配对视频演示,我们提出的框架RoboSwap在来自不同环境的不配对数据上进行操作,减轻了数据收集的需求。RoboSwap引入了一种新的视频编辑管道,集成了生成对抗网络(GANs)和扩散模型,结合了它们的独立优势。具体来说,我们从背景中分割出机器人手臂,并训练一个不匹配的GAN模型来将一个机器人手臂转换为另一个手臂。翻译后的手臂与原始视频背景混合,并使用扩散模型进行精炼,以提高连贯性、运动现实性和物体交互。GAN和扩散阶段是独立训练的。我们的实验表明,RoboSwap在三个基准测试上的结构连贯性和运动一致性方面均优于最新的视频和图像编辑模型,从而为机器人学习中的可靠、跨体态数据生成提供了稳健的解决方案。
论文及项目相关链接
Summary
近期生成模型的发展在视频合成和编辑方面带来了革新。然而,缺乏多样、高质量的数据集仍然阻碍着视频条件下的机器人学习,限制了跨平台的通用性。本研究致力于解决在一个视频中将一个机械臂替换为另一个机械臂的挑战,这是跨体态学习的关键步骤。不同于以前的方法依赖于相同环境设置中的配对视频演示,我们提出的框架RoboSwap在无配对数据的环境中操作,减轻了数据采集的需求。RoboSwap引入了一种新的视频编辑管道,集成了生成对抗网络(GANs)和扩散模型,结合了它们各自的优势。具体来说,我们从背景中分割出机械臂,训练一个无配对GAN模型将一个机械臂翻译成另一个机械臂。翻译后的机械臂与原始视频背景相结合,并使用扩散模型进行精炼,以提高连贯性、运动现实性和对象交互性。GAN和扩散阶段是独立训练的。实验表明,RoboSwap在结构连贯性和运动一致性方面优于最先进视频和图像编辑模型,在机器人学习的跨体态数据生成方面提供了稳健的解决方案。
Key Takeaways
- 生成模型的新进展已革新视频合成和编辑技术。
- 缺乏多样、高质量数据集仍是视频条件机器人学习的挑战。
- 提出RoboSwap框架,能在不同环境中操作无配对数据,简化数据采集需求。
- RoboSwap集成GANs和扩散模型,利用两者的优势进行视频编辑。
- RoboSwap能分割机械臂并训练无配对GAN模型进行翻译。
- 精炼翻译后的机械臂与原始背景的结合,增强连贯性、运动现实性和对象交互性。
点此查看论文截图






DiffMI: Breaking Face Recognition Privacy via Diffusion-Driven Training-Free Model Inversion
Authors:Hanrui Wang, Shuo Wang, Chun-Shien Lu, Isao Echizen
Face recognition poses serious privacy risks due to its reliance on sensitive and immutable biometric data. While modern systems mitigate privacy risks by mapping facial images to embeddings (commonly regarded as privacy-preserving), model inversion attacks reveal that identity information can still be recovered, exposing critical vulnerabilities. However, existing attacks are often computationally expensive and lack generalization, especially those requiring target-specific training. Even training-free approaches suffer from limited identity controllability, hindering faithful reconstruction of nuanced or unseen identities. In this work, we propose DiffMI, the first diffusion-driven, training-free model inversion attack. DiffMI introduces a novel pipeline combining robust latent code initialization, a ranked adversarial refinement strategy, and a statistically grounded, confidence-aware optimization objective. DiffMI applies directly to unseen target identities and face recognition models, offering greater adaptability than training-dependent approaches while significantly reducing computational overhead. Our method achieves 84.42%–92.87% attack success rates against inversion-resilient systems and outperforms the best prior training-free GAN-based approach by 4.01%–9.82%. The implementation is available at https://github.com/azrealwang/DiffMI.
人脸识别依赖于敏感且不可更改的生物识别数据,从而带来严重的隐私风险。现代系统通过将面部图像映射到嵌入来减轻隐私风险(通常被认为是保护隐私的),但模型逆向攻击表明,身份信息仍然可以恢复,暴露出关键漏洞。然而,现有攻击通常计算量大且缺乏通用性,特别是那些需要针对目标进行特定训练的方法。即使是不需要训练的方法也面临身份可控性有限的问题,阻碍了微妙或未见过的身份的忠实重建。在这项工作中,我们提出了DiffMI,这是一种不需要训练的首个基于扩散的模型逆向攻击方法。DiffMI引入了一种新型管道,结合了稳健的潜在代码初始化、排名对抗细化策略以及基于统计的、具有置信度的优化目标。DiffMI可直接应用于未见过的目标身份和人脸识别模型,相比依赖于训练的方法提供了更大的适应性,同时大大降低了计算开销。我们的方法针对抗逆向攻击的系统的攻击成功率为84.42%~92.87%,并且相较于最佳的先前无训练GAN方法提高了4.01%~9.82%。实现代码可在https://github.com/azrealwang/DiffMI处下载。
论文及项目相关链接
Summary
人脸识别技术依赖敏感且不可更改的生物识别数据,存在严重的隐私风险。尽管现代系统通过面部图像映射到嵌入来减轻这些风险,但模型反演攻击显示仍可以恢复身份信息,暴露出重大漏洞。本文提出的DiffMI是一种无需训练的反演攻击方法,通过结合稳健的潜在代码初始化、排名对抗性优化策略和统计置信度感知的优化目标,直接应用于未见过的目标身份和人脸识别模型。相较于依赖训练的方法,DiffMI具有更大的适应性和更低的计算开销,攻击成功率达到84.42%-92.87%,并对抗反演韧性系统,表现优于最佳的前期无需训练的GAN基础方法4.01%-9.82%。更多详情访问:https://github.com/azrealwang/DiffMI。
Key Takeaways
- 人脸识别存在隐私风险,因为模型反演攻击可恢复身份信息。
- 现代系统通过面部图像映射到嵌入来减轻隐私风险,但仍存在漏洞。
- DiffMI是一种新型模型反演攻击方法,无需训练,可直接应用于未见过的目标身份和人脸识别模型。
- DiffMI结合稳健的潜在代码初始化、排名对抗性优化策略和统计置信度感知的优化目标。
- DiffMI具有较大的适应性和较低的计算开销。
- DiffMI的攻击成功率达到84.42%-92.87%,并能对抗反演韧性系统。
点此查看论文截图






