GAN

发布日期: 2025-10-18

更新日期: 2025-11-27

文章字数: 10k

阅读时长: 40 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-18 更新

A Multi-domain Image Translative Diffusion StyleGAN for Iris Presentation Attack Detection

Authors:Shivangi Yadav, Arun Ross

An iris biometric system can be compromised by presentation attacks (PAs) where artifacts such as artificial eyes, printed eye images, or cosmetic contact lenses are presented to the system. To counteract this, several presentation attack detection (PAD) methods have been developed. However, there is a scarcity of datasets for training and evaluating iris PAD techniques due to the implicit difficulties in constructing and imaging PAs. To address this, we introduce the Multi-domain Image Translative Diffusion StyleGAN (MID-StyleGAN), a new framework for generating synthetic ocular images that captures the PA and bonafide characteristics in multiple domains such as bonafide, printed eyes and cosmetic contact lens. MID-StyleGAN combines the strengths of diffusion models and generative adversarial networks (GANs) to produce realistic and diverse synthetic data. Our approach utilizes a multi-domain architecture that enables the translation between bonafide ocular images and different PA domains. The model employs an adaptive loss function tailored for ocular data to maintain domain consistency. Extensive experiments demonstrate that MID-StyleGAN outperforms existing methods in generating high-quality synthetic ocular images. The generated data was used to significantly enhance the performance of PAD systems, providing a scalable solution to the data scarcity problem in iris and ocular biometrics. For example, on the LivDet2020 dataset, the true detect rate at 1% false detect rate improved from 93.41% to 98.72%, showcasing the impact of the proposed method.

虹膜生物识别系统可能会受到表现攻击（PAs）的威胁，攻击者使用人造眼睛、打印的眼部图像或美容隐形眼镜等伪造物品来欺骗系统。为了应对这一问题，已经开发了几种表现攻击检测（PAD）方法。然而，由于构造和成像表现攻击存在隐性的困难，用于训练和评估虹膜PAD技术的数据集十分匮乏。为了解决这一问题，我们引入了多域图像转换扩散风格生成对抗网络（MID-StyleGAN），这是一种新的生成合成眼部图像的框架，能够在多个领域（如真实眼部图像、打印的眼部图像和美容隐形眼镜）捕捉表现攻击和真实特征。MID-StyleGAN结合了扩散模型和生成对抗网络（GANs）的优点，生成真实且多样化的合成数据。我们的方法采用多域架构，能够在真实眼部图像和不同表现攻击领域之间进行转换。该模型采用针对眼部数据的自适应损失函数，以保持领域一致性。大量实验表明，MID-StyleGAN在生成高质量合成眼部图像方面优于现有方法。使用生成的数据显著提高了PAD系统的性能，为解决虹膜和眼部生物识别中的数据稀缺问题提供了可扩展的解决方案。例如，在LivDet2020数据集上，在1%的误检率下，真实检测率从93.41%提高到了98.72%，展示了所提出方法的影响。

论文及项目相关链接

PDF

Summary

本文介绍了针对虹膜生物识别系统中存在的演示攻击（PAs）问题，提出一种新的图像生成框架MID-StyleGAN。该框架结合了扩散模型和生成对抗网络（GANs）的优点，能够在多个领域生成逼真的虹膜图像数据。实验表明，MID-StyleGAN在生成高质量合成虹膜图像方面优于现有方法，解决了虹膜生物识别中数据稀缺的问题，并显著提高了演示攻击检测（PAD）系统的性能。

Key Takeaways

虹膜生物识别系统易受到演示攻击（PAs）的影响，需要使用演示攻击检测（PAD）方法进行防范。
当前缺乏用于训练和评估虹膜PAD技术的数据集，成为该领域的一个挑战。
引入了一个新的图像生成框架MID-StyleGAN，用于生成合成虹膜图像数据。
MID-StyleGAN结合了扩散模型和生成对抗网络（GANs）的优点，能够在多个领域生成逼真的虹膜图像数据。
MID-StyleGAN采用自适应损失函数来保持领域一致性，提高图像质量。
实验证明MID-StyleGAN在生成高质量合成虹膜图像方面优于现有方法。

Cool Papers

点此查看论文截图

LOTA: Bit-Planes Guided AI-Generated Image Detection

Authors:Hongsong Wang, Renxi Cheng, Yang Zhang, Chaolei Han, Jie Gui

The rapid advancement of GAN and Diffusion models makes it more difficult to distinguish AI-generated images from real ones. Recent studies often use image-based reconstruction errors as an important feature for determining whether an image is AI-generated. However, these approaches typically incur high computational costs and also fail to capture intrinsic noisy features present in the raw images. To solve these problems, we innovatively refine error extraction by using bit-plane-based image processing, as lower bit planes indeed represent noise patterns in images. We introduce an effective bit-planes guided noisy image generation and exploit various image normalization strategies, including scaling and thresholding. Then, to amplify the noise signal for easier AI-generated image detection, we design a maximum gradient patch selection that applies multi-directional gradients to compute the noise score and selects the region with the highest score. Finally, we propose a lightweight and effective classification head and explore two different structures: noise-based classifier and noise-guided classifier. Extensive experiments on the GenImage benchmark demonstrate the outstanding performance of our method, which achieves an average accuracy of \textbf{98.9%} (\textbf{11.9}%~$\uparrow$) and shows excellent cross-generator generalization capability. Particularly, our method achieves an accuracy of over 98.2% from GAN to Diffusion and over 99.2% from Diffusion to GAN. Moreover, it performs error extraction at the millisecond level, nearly a hundred times faster than existing methods. The code is at https://github.com/hongsong-wang/LOTA.

GAN和扩散模型的快速发展使得区分人工智能生成的图像和真实图像变得更加困难。最近的研究经常使用基于图像的重建误差作为判断图像是否由人工智能生成的重要特征。然而，这些方法通常计算成本较高，并且无法捕捉到原始图像中存在的内在噪声特征。为了解决这些问题，我们创新地通过基于位平面的图像处理来优化误差提取，因为低位平面确实代表了图像中的噪声模式。我们引入了一种有效的位平面引导噪声图像生成方法，并采用了各种图像归一化策略，包括缩放和阈值化。然后，为了放大噪声信号，便于检测人工智能生成的图像，我们设计了一种最大梯度补丁选择方法，该方法应用多方向梯度来计算噪声分数，并选择得分最高的区域。最后，我们提出了一种轻量级且有效的分类头，并探索了两种不同结构：基于噪声的分类器和噪声引导的分类器。在GenImage基准测试上的大量实验证明了我们的方法的出色性能，平均准确率达到了98.9%（上升了11.9%），并显示出出色的跨生成器泛化能力。特别地，我们的方法实现了从GAN到扩散模型的准确率超过98.2%，从扩散模型到GAN的准确率超过99.2%。此外，它在毫秒级别执行误差提取，几乎是现有方法的100倍快。代码地址为：https://github.com/hongsong-wang/LOTA。

论文及项目相关链接

PDF Published in the ICCV2025, COde is https://github.com/hongsong-wang/LOTA

Summary
本文介绍了利用位平面处理技术对图像进行噪声处理的方法，通过引入噪声引导的图像生成技术，实现了对AI生成图像的高效检测。采用最大梯度块选择算法，增强噪声信号，并设计了轻量级的分类器头。在GenImage基准测试上取得了平均准确率高达98.9%的优异表现，且具有良好的跨生成器泛化能力。代码已上传至GitHub。

Key Takeaways

利用GAN和Diffusion模型的快速发展，研究AI生成图像与真实图像的区分变得更加困难。
当前研究常采用基于图像重建误差的方法来判断图像是否由AI生成，但存在计算成本高和无法捕捉图像内在噪声特征的问题。
创新性地使用位平面处理技术来优化误差提取，利用低阶位平面代表图像中的噪声模式。
引入噪声引导的图像生成技术，并采用多种图像归一化策略。
通过最大梯度块选择算法放大噪声信号，便于检测AI生成的图像。
设计了轻量级的分类器头，并探索了两种不同结构：基于噪声的分类器和噪声引导的分类器。
在GenImage基准测试上取得了平均准确率高达98.9%的优异表现，显示出良好的跨生成器泛化能力，包括从GAN到Diffusion和从Diffusion到GAN的准确率分别超过98.2%和99.2%。

Cool Papers

点此查看论文截图

Self-Supervised Selective-Guided Diffusion Model for Old-Photo Face Restoration

Authors:Wenjie Li, Xiangyi Wang, Heng Guo, Guangwei Gao, Zhanyu Ma

Old-photo face restoration poses significant challenges due to compounded degradations such as breakage, fading, and severe blur. Existing pre-trained diffusion-guided methods either rely on explicit degradation priors or global statistical guidance, which struggle with localized artifacts or face color. We propose Self-Supervised Selective-Guided Diffusion (SSDiff), which leverages pseudo-reference faces generated by a pre-trained diffusion model under weak guidance. These pseudo-labels exhibit structurally aligned contours and natural colors, enabling region-specific restoration via staged supervision: structural guidance applied throughout the denoising process and color refinement in later steps, aligned with the coarse-to-fine nature of diffusion. By incorporating face parsing maps and scratch masks, our method selectively restores breakage regions while avoiding identity mismatch. We further construct VintageFace, a 300-image benchmark of real old face photos with varying degradation levels. SSDiff outperforms existing GAN-based and diffusion-based methods in perceptual quality, fidelity, and regional controllability. Code link: https://github.com/PRIS-CV/SSDiff.

旧照片人脸修复由于断裂、褪色和严重模糊等复合退化而面临重大挑战。现有的预训练扩散引导方法要么依赖于明确的退化先验，要么依赖于全局统计引导，它们在处理局部伪影或人脸颜色时遇到困难。我们提出了自监督选择性引导扩散（SSDiff），它利用弱引导下预训练扩散模型生成的伪参考人脸。这些伪标签具有结构对齐的轮廓和自然颜色，通过分阶段监督实现特定区域修复：去噪过程中应用结构指导，后续步骤进行颜色细化，与扩散的由粗到细特点相对应。通过结合人脸解析图和划痕掩膜，我们的方法能够选择性修复断裂区域，同时避免身份不匹配。我们进一步构建了VintageFace，这是一个包含不同退化程度真实旧照片人脸的300张图像基准测试集。SSDiff在感知质量、保真度和区域可控性方面优于现有的基于GAN和基于扩散的方法。代码链接：https://github.com/PRIS-CV/SSDiff。

论文及项目相关链接

PDF

Summary

本文介绍了老照片人脸修复的挑战，包括断裂、褪色和严重模糊等问题。现有的预训练扩散引导方法存在局限性，如依赖明确的退化先验或全局统计指导，难以处理局部伪影或人脸颜色问题。本文提出一种自监督选择性引导扩散（SSDiff）方法，利用弱引导下预训练扩散模型生成的人脸伪参考图像。伪标签具有结构对齐的轮廓和自然颜色，通过分阶段监督实现特定区域修复：去噪过程中应用结构指导，后期步骤进行颜色优化，与扩散的由粗到细特性相符。结合人脸解析图和划痕掩膜，该方法能选择性恢复断裂区域，同时避免身份不匹配。此外，还构建了VintageFace数据集，包含300张不同退化程度的老照片人脸图像。SSDiff在感知质量、保真度和区域可控性方面优于现有的基于GAN和扩散的方法。

Key Takeaways

老照片人脸修复面临断裂、褪色和模糊等挑战。
现有方法依赖明确的退化先验或全局统计指导，难以处理局部伪影和颜色问题。
提出SSDiff方法，利用伪参考图像实现自监督选择性引导扩散。
伪标签具有结构对齐的轮廓和自然颜色，通过分阶段监督实现特定区域修复。
结合人脸解析图和划痕掩膜，能选择性恢复断裂区域，避免身份不匹配。
构建了VintageFace数据集，包含多种退化程度的老照片人脸图像。
SSDiff在感知质量、保真度和区域可控性方面优于现有方法。

Cool Papers

点此查看论文截图

DISC-GAN: Disentangling Style and Content for Cluster-Specific Synthetic Underwater Image Generation

Authors:Sneha Varur, Anirudh R Hanchinamani, Tarun S Bagewadi, Uma Mudenagudi, Chaitra D Desai, Sujata C, Padmashree Desai, Sumit Meharwade

In this paper, we propose a novel framework, Disentangled Style-Content GAN (DISC-GAN), which integrates style-content disentanglement with a cluster-specific training strategy towards photorealistic underwater image synthesis. The quality of synthetic underwater images is challenged by optical due to phenomena such as color attenuation and turbidity. These phenomena are represented by distinct stylistic variations across different waterbodies, such as changes in tint and haze. While generative models are well-suited to capture complex patterns, they often lack the ability to model the non-uniform conditions of diverse underwater environments. To address these challenges, we employ K-means clustering to partition a dataset into style-specific domains. We use separate encoders to get latent spaces for style and content; we further integrate these latent representations via Adaptive Instance Normalization (AdaIN) and decode the result to produce the final synthetic image. The model is trained independently on each style cluster to preserve domain-specific characteristics. Our framework demonstrates state-of-the-art performance, obtaining a Structural Similarity Index (SSIM) of 0.9012, an average Peak Signal-to-Noise Ratio (PSNR) of 32.5118 dB, and a Frechet Inception Distance (FID) of 13.3728.

本文提出了一种新型框架，即解纠缠风格内容GAN（DISC-GAN），它将风格内容解纠缠与针对特定集群的训练策略相结合，以实现逼真的水下图像合成。由于颜色衰减和浑浊等现象，合成水下图像的质量面临光学挑战。这些现象表现为不同水体之间独特的风格变化，例如色调和雾霾的变化。虽然生成模型非常适合捕捉复杂模式，但它们往往缺乏对不同水下环境非均匀条件的建模能力。为了解决这些挑战，我们采用K-means聚类将数据集划分为特定风格的领域。我们使用单独的编码器获取风格和内容的潜在空间；我们通过自适应实例归一化（AdaIN）进一步整合这些潜在表示，并解码结果以产生最终的合成图像。该模型是在每个风格集群上独立训练的，以保留特定领域的特征。我们的框架展示了最先进的性能，获得了结构相似性指数（SSIM）为0.9012、平均峰值信噪比（PSNR）为32.5118分贝和弗雷歇特入口距离（FID）为13.3728的成果。

论文及项目相关链接

PDF

Summary

本文提出了一种新型框架——解纠缠风格内容GAN（DISC-GAN），它将风格内容解纠缠与特定集群的训练策略相结合，以实现逼真的水下图像合成。针对水下图像合成中因光学现象（如颜色衰减和浑浊度）导致的挑战，DISC-GAN通过K-means聚类将数据集分成特定风格的领域，并分别训练生成模型以模拟不同的水下环境。该框架采用单独的编码器获取风格和内容的潜在空间，并通过自适应实例归一化（AdaIN）整合这些潜在表示，以产生最终合成图像。该框架表现出卓越的性能。

Key Takeaways

DISC-GAN框架结合了风格内容解纠缠和特定集群的训练策略，用于水下图像合成。
水下图像合成面临光学现象（如颜色衰减和浑浊度）的挑战。
使用K-means聚类将数据集分成特定风格的领域，以应对这些挑战。
通过单独的编码器获取风格和内容的潜在空间。
通过AdaIN整合潜在表示，以生成最终的合成图像。
模型在每个风格集群上进行独立训练，以保留特定领域的特征。

Cool Papers

点此查看论文截图

The 1st Solution for CARE Liver Task Challenge 2025: Contrast-Aware Semi-Supervised Segmentation with Domain Generalization and Test-Time Adaptation

Authors:Jincan Lou, Jingkun Chen, Haoquan Li, Hang Li, Wenjian Huang, Weihua Chen, Fan Wang, Jianguo Zhang

Accurate liver segmentation from contrast-enhanced MRI is essential for diagnosis, treatment planning, and disease monitoring. However, it remains challenging due to limited annotated data, heterogeneous enhancement protocols, and significant domain shifts across scanners and institutions. Traditional image-to-image translation frameworks have made great progress in domain generalization, but their application is not straightforward. For example, Pix2Pix requires image registration, and cycle-GAN cannot be integrated seamlessly into segmentation pipelines. Meanwhile, these methods are originally used to deal with cross-modality scenarios, and often introduce structural distortions and suffer from unstable training, which may pose drawbacks in our single-modality scenario. To address these challenges, we propose CoSSeg-TTA, a compact segmentation framework for the GED4 (Gd-EOB-DTPA enhanced hepatobiliary phase MRI) modality built upon nnU-Netv2 and enhanced with a semi-supervised mean teacher scheme to exploit large amounts of unlabeled volumes. A domain adaptation module, incorporating a randomized histogram-based style appearance transfer function and a trainable contrast-aware network, enriches domain diversity and mitigates cross-center variability. Furthermore, a continual test-time adaptation strategy is employed to improve robustness during inference. Extensive experiments demonstrate that our framework consistently outperforms the nnU-Netv2 baseline, achieving superior Dice score and Hausdorff Distance while exhibiting strong generalization to unseen domains under low-annotation conditions.

从对比增强MRI进行准确的肝脏分割对于诊断、治疗计划和疾病监测至关重要。然而，由于标注数据有限、增强协议异质以及扫描仪和机构之间的领域差异较大，这仍然是一个挑战。传统的图像到图像翻译框架在领域推广方面取得了巨大进步，但其应用并不直接。例如，Pix2Pix需要进行图像注册，而循环GAN无法无缝集成到分割管道中。同时，这些方法最初是用于处理跨模态场景的，通常会引入结构失真并面临训练不稳定的问题，这在我们这种单模态场景中可能会带来缺点。为了解决这些挑战，我们提出了CoSSeg-TTA，这是一个针对GED4（钆塞柏增强的肝胆期MRI）模态的紧凑分割框架，它基于nnU-Netv2并增强了一种半监督均值教师方案来利用大量未标记体积数据。一个领域适应模块，结合基于随机直方图的风格外观转移函数和一个可训练的对比感知网络，丰富了领域多样性并减轻了跨中心差异性。此外，采用连续测试时间适应策略，以提高推理过程中的稳健性。大量实验表明，我们的框架始终优于nnU-Netv2基线，在Dice得分和Hausdorff距离方面表现优越，同时在低注释条件下对未见领域具有很强的泛化能力。

论文及项目相关链接

PDF 11 pages, 3 figures

Summary
在增强MRI中对肝脏进行准确分割对于诊断、治疗规划和疾病监测至关重要。但由于标注数据有限、增强协议异质性和跨扫描仪和机构域偏移等问题，仍面临挑战。传统图像到图像翻译框架在域推广方面取得了很大进展，但应用并不直接。例如，Pix2Pix需要图像配准，cycle-GAN无法无缝集成到分割管道中。针对这些挑战，我们提出了CoSSeg-TTA，这是一个针对GED4模态的紧凑分割框架，基于nnU-Netv2构建，并通过半监督均值教师方案利用大量未标记体积进行增强。域适应模块通过随机直方图样式的风格转换功能和可训练的对比网络，丰富了域多样性并减轻了跨中心差异。此外，在测试阶段采用了连续测试时间适应策略，以提高推理阶段的稳健性。实验表明，我们的框架在有限标注条件下始终优于nnU-Netv2基线，实现了更高的Dice分数和Hausdorff距离，并在未见域的情境中表现出强大的泛化能力。

Key Takeaways

肝脏在增强MRI中的准确分割对诊断、治疗规划和疾病监测至关重要。
传统图像到图像翻译框架在域适应方面存在挑战，难以直接应用于肝脏分割。
提出的CoSSeg-TTA框架基于nnU-Netv2构建，旨在解决有限标注数据、增强协议异质性和跨机构域偏移等问题。
CoSSeg-TTA利用半监督学习方法和均值教师方案以利用大量未标记数据。
域适应模块通过随机直方图样式转换丰富域多样性，并减轻跨中心差异。
框架采用连续测试时间适应策略，提高推理阶段的稳健性。

Cool Papers

点此查看论文截图

AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective

Authors:Yuchen Deng, Xiuyang Wu, Hai-Tao Zheng, Suiyang Zhang, Yi He, Yuxing Han

Talking-head animation focuses on generating realistic facial videos from audio input. Following Generative Adversarial Networks (GANs), diffusion models have become the mainstream, owing to their robust generative capacities. However, inherent limitations of the diffusion process often lead to inter-frame flicker and slow inference, restricting their practical deployment. To address this, we introduce AvatarSync, an autoregressive framework on phoneme representations that generates realistic and controllable talking-head animations from a single reference image, driven directly by text or audio input. To mitigate flicker and ensure continuity, AvatarSync leverages an autoregressive pipeline that enhances temporal modeling. In addition, to ensure controllability, we introduce phonemes, which are the basic units of speech sounds, and construct a many-to-one mapping from text/audio to phonemes, enabling precise phoneme-to-visual alignment. Additionally, to further accelerate inference, we adopt a two-stage generation strategy that decouples semantic modeling from visual dynamics, and incorporate a customized Phoneme-Frame Causal Attention Mask to support multi-step parallel acceleration. Extensive experiments conducted on both Chinese (CMLR) and English (HDTF) datasets demonstrate that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency, providing a scalable and controllable solution.

头部动画主要关注从音频输入生成逼真的面部视频。继生成对抗网络（GANs）之后，扩散模型由于其强大的生成能力已成为主流。然而，扩散过程本身的固有局限性常常导致帧间闪烁和推理缓慢，从而限制了其在实际部署中的应用。为了解决这一问题，我们引入了AvatarSync，这是一个基于音素表示的自回归框架，能够从单张参考图像生成逼真且可控的头部动画，直接由文本或音频输入驱动。为了减轻闪烁并确保连续性，AvatarSync利用了一个自回归管道，增强了时间建模。此外，为了确保可控性，我们引入了音素（即语音的基本单位），构建了从文本/音频到音素的多对一映射，实现了精确的音素到视觉的对齐。另外，为了进一步加速推理，我们采用了两阶段生成策略，将语义建模与视觉动态解耦，并融入定制的音素帧因果注意力掩码，以支持多步并行加速。在中文（CMLR）和英文（HDTF）数据集上的大量实验表明，AvatarSync在视觉保真度、时间一致性和计算效率方面优于现有的头部动画方法，提供了一种可扩展且可控的解决方案。

论文及项目相关链接

PDF

Summary

基于生成对抗网络（GANs）的扩散模型已成为主流说话人头动画的主流方法，但其固有的扩散过程限制其实际应用，如帧间闪烁和推理速度慢。为解决这些问题，我们推出AvatarSync，这是一种基于音素表示的自回归框架，可从单张参考图像生成逼真且可控的说话人头动画，由文本或音频输入驱动。为减少闪烁并确保连续性，AvatarSync采用增强时间建模的自回归管道。通过引入音素并建立文本/音频到音素的多对一映射，确保可控性。此外，为加速推理，我们采用两阶段生成策略，将语义建模与视觉动态解耦，并融入定制的音素帧因果注意力掩码以支持多步并行加速。在中文（CMLR）和英文（HDTF）数据集上的广泛实验证明，AvatarSync在视觉保真度、时间一致性和计算效率方面优于现有说话人头动画方法，提供可伸缩和可控的解决方案。

Key Takeaways

说话人头动画领域正转向基于扩散模型的生成对抗网络（GANs），因其强大的生成能力。
扩散模型存在帧间闪烁和推理速度慢等内在限制。
AvatarSync是一个自回归框架，可以从单张参考图像生成逼真且可控的说话人头动画，由文本或音频驱动。
为减少闪烁并确保连续性，AvatarSync采用增强时间建模的自回归管道。
通过引入音素并建立多对一映射，提高动画的精确性和可控性。
为加速推理，采用了两阶段生成策略和定制的音素帧因果注意力掩码。

Cool Papers

点此查看论文截图

StegOT: Trade-offs in Steganography via Optimal Transport

Authors:Chengde Lin, Xuezhu Gong, Shuxue Ding, Mingzhe Yang, Xijun Lu, Chengjun Mo

Image hiding is often referred to as steganography, which aims to hide a secret image in a cover image of the same resolution. Many steganography models are based on genera-tive adversarial networks (GANs) and variational autoencoders (VAEs). However, most existing models suffer from mode collapse. Mode collapse will lead to an information imbalance between the cover and secret images in the stego image and further affect the subsequent extraction. To address these challenges, this paper proposes StegOT, an autoencoder-based steganography model incorporating optimal transport theory. We designed the multiple channel optimal transport (MCOT) module to transform the feature distribution, which exhibits multiple peaks, into a single peak to achieve the trade-off of information. Experiments demonstrate that we not only achieve a trade-off between the cover and secret images but also enhance the quality of both the stego and recovery images. The source code will be released on https://github.com/Rss1124/StegOT.

图像隐藏通常被称为隐写术（steganography），旨在将一个秘密图像隐藏在相同分辨率的载体图像中。许多隐写术模型基于生成对抗网络（GANs）和变分自动编码器（VAEs）。然而，现有的大多数模型都存在模式崩溃的问题。模式崩溃会导致载体图像和秘密图像之间的信息不平衡，进而影响后续的提取过程。针对这些挑战，本文提出了StegOT模型，这是一个基于自动编码器的隐写术模型，结合了最优传输理论。我们设计了多通道最优传输（MCOT）模块来转换特征分布，该分布表现出多个峰值，将其转化为单峰分布以实现信息的权衡。实验表明，我们不仅在载体图像和秘密图像之间实现了权衡，还提高了隐写图像和恢复图像的质量。源代码将在https://github.com/Rss1124/StegOT上发布。

论文及项目相关链接

PDF Accepted by IEEE International Conference on Multimedia and Expo (ICME 2025)

Summary

本文介绍了一种基于自编码器的隐写术模型StegOT，该模型结合了最优传输理论以解决隐写术中常见的模式崩溃问题。通过设计多通道最优传输（MCOT）模块，将特征分布的多峰转化为单峰，实现了信息平衡。实验证明，该方法不仅实现了覆盖图像与秘密图像之间的平衡，还提高了隐写和恢复图像的质量。

Key Takeaways

StegOT是一个基于自编码器的隐写术模型，结合了最优传输理论。
MCOT模块能将特征分布的多峰转化为单峰，以实现信息平衡。
StegOT可以解决隐写术中常见的模式崩溃问题。
实验证明StegOT能够平衡覆盖图像和秘密图像之间的关系。
StegOT能提高隐写和恢复图像的质量。
StegOT的源代码将发布在https://github.com/Rss1124/StegOT。

Cool Papers

点此查看论文截图

PanoLAM: Large Avatar Model for Gaussian Full-Head Synthesis from One-shot Unposed Image

Authors:Peng Li, Yisheng He, Yingdong Hu, Yuan Dong, Weihao Yuan, Yuan Liu, Siyu Zhu, Gang Cheng, Zilong Dong, Yike Guo

We present a feed-forward framework for Gaussian full-head synthesis from a single unposed image. Unlike previous work that relies on time-consuming GAN inversion and test-time optimization, our framework can reconstruct the Gaussian full-head model given a single unposed image in a single forward pass. This enables fast reconstruction and rendering during inference. To mitigate the lack of large-scale 3D head assets, we propose a large-scale synthetic dataset from trained 3D GANs and train our framework using only synthetic data. For efficient high-fidelity generation, we introduce a coarse-to-fine Gaussian head generation pipeline, where sparse points from the FLAME model interact with the image features by transformer blocks for feature extraction and coarse shape reconstruction, which are then densified for high-fidelity reconstruction. To fully leverage the prior knowledge residing in pretrained 3D GANs for effective reconstruction, we propose a dual-branch framework that effectively aggregates the structured spherical triplane feature and unstructured point-based features for more effective Gaussian head reconstruction. Experimental results show the effectiveness of our framework towards existing work. Project page at: https://panolam.github.io/.

我们提出了一种基于单一未定位图像的高斯全头合成前馈框架。不同于以往依赖于耗时GAN反演和测试时优化的工作，我们的框架能够在单次前向传递中根据单一未定位图像重建高斯全头模型。这可以在推理过程中实现快速重建和渲染。为了缓解大规模3D头部资产缺乏的问题，我们提出了一个基于训练好的3D GANs的大规模合成数据集，并且只用合成数据来训练我们的框架。为了进行高效的高保真生成，我们引入了从粗到细的高斯头部生成管道，其中FLAME模型的稀疏点与图像特征通过transformer块进行特征提取和粗略形状重建，然后进行密集化以实现高保真重建。为了充分利用预训练的3D GANs中的先验知识进行有效的重建，我们提出了一个双分支框架，该框架可以有效地聚合结构化的球面triplane特征和非结构化的点基特征，以实现更有效的高斯头部重建。实验结果证明了我们的框架相较于现有工作的有效性。项目页面为：网站链接。

论文及项目相关链接

PDF

Summary
本文提出了一种基于前馈网络的高斯全头合成框架，可以从单张未定位图像中重建高斯全头模型。框架无需耗费时间的GAN反演和测试时间优化，能够实现快速重建和渲染。为解决缺乏大规模3D头部资产的问题，研究团队使用训练好的3D GANs生成大规模合成数据集进行训练。为提高生成效率和质量，引入从粗到细的高斯头部生成管道，并利用预训练的3D GANs中的先验知识，提出双分支框架进行更有效的头部重建。实验结果表明，该研究提出的框架相对于现有工作更有效。详情参见项目页面。

Key Takeaways

提出了一种基于前馈网络的高斯全头合成框架，可以从单张未定位图像重建模型。
框架无需时间消耗的反演和优化过程，实现快速重建和渲染。
利用训练好的3D GANs生成大规模合成数据集进行训练，解决缺乏真实数据的问题。
引入从粗到细的高斯头部生成管道，提高生成质量。
利用预训练的3D GANs中的先验知识，提出了一个双分支框架，用于更高效的头部重建。

Cool Papers

点此查看论文截图

SQ-GAN: Semantic Image Communications Using Masked Vector Quantization

Authors:Francesco Pezone, Sergio Barbarossa, Giuseppe Caire

This work introduces Semantically Masked Vector Quantized Generative Adversarial Network (SQ-GAN), a novel approach integrating semantically driven image coding and vector quantization to optimize image compression for semantic/task-oriented communications. The method only acts on source coding and is fully compliant with legacy systems. The semantics is extracted from the image computing its semantic segmentation map using off-the-shelf software. A new specifically developed semantic-conditioned adaptive mask module (SAMM) selectively encodes semantically relevant features of the image. The relevance of the different semantic classes is task-specific, and it is incorporated in the training phase by introducing appropriate weights in the loss function. SQ-GAN outperforms state-of-the-art image compression schemes such as JPEG2000, BPG, and deep-learning based methods across multiple metrics, including perceptual quality and semantic segmentation accuracy on the reconstructed image, at extremely low compression rates.

本文介绍了语义掩码向量量化生成对抗网络（SQ-GAN），这是一种新型方法，将语义驱动的图像编码和向量量化相结合，针对语义/任务导向通信优化图像压缩。该方法仅对源编码起作用，并完全符合旧系统。语义是通过图像计算得到的，使用现成的软件计算其语义分割图。新开发的语义条件自适应掩模模块（SAMM）会选择性地对图像中的语义相关特征进行编码。不同语义类别的相关性是任务特定的，通过在损失函数中引入适当的权重，在训练阶段将其纳入考虑。SQ-GAN在多个指标上超越了最先进的图像压缩方案，包括JPEG2000、BPG和基于深度学习的方法，在极低压缩率下，对重建图像的感知质量和语义分割准确性都有显著的提升。

论文及项目相关链接

PDF arXiv admin note: substantial text overlap with arXiv:2502.01675

Summary
在这个工作中，提出了一个名为SQ-GAN的新型语义蒙版矢量量化生成对抗网络。这个方法将语义驱动图像编码和矢量量化相结合，旨在优化语义/任务导向通信的图像压缩效果。该方法只作用于源编码，兼容现有系统。利用市售软件计算图像的语义分割图来提取语义信息。通过引入语义条件自适应蒙版模块（SAMM），选择性编码图像中的语义相关特征。SQ-GAN在多个指标上优于当前最先进的图像压缩方案，包括JPEG2000、BPG和基于深度学习的技术，特别是在重建图像的感知质量和语义分割准确性方面，即使在极低的压缩率下也是如此。

Key Takeaways

SQ-GAN是一个新型的图像压缩方法，结合了语义驱动图像编码和矢量量化技术。
该方法主要针对源编码进行优化，并兼容现有系统。
SQ-GAN通过计算图像的语义分割图来提取语义信息。
SAMM模块用于选择性编码图像中的语义相关特征，考虑了不同语义类别的任务特异性。
SQ-GAN在多种图像压缩方案中表现出卓越性能，包括在感知质量和语义分割准确性方面的优异表现。
SQ-GAN在极低压缩率下依然能保持良好的性能。

Cool Papers

点此查看论文截图

RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

Authors:Chengde Lin, Xijun Lu, Guangxi Chen

Synthesizing high-quality photorealistic images with textual descriptions as a condition is very challenging. Generative Adversarial Networks (GANs), the classical model for this task, frequently suffer from low consistency between image and text descriptions and insufficient richness in synthesized images. Recently, conditional affine transformations (CAT), such as conditional batch normalization and instance normalization, have been applied to different layers of GAN to control content synthesis in images. CAT is a multi-layer perceptron that independently predicts data based on batch statistics between neighboring layers, with global textual information unavailable to other layers. To address this issue, we first model CAT and a recurrent neural network (RAT) to ensure that different layers can access global information. We then introduce shuffle attention between RAT to mitigate the characteristic of information forgetting in recurrent neural networks. Moreover, both our generator and discriminator utilize the powerful pre-trained model, Clip, which has been extensively employed for establishing associations between text and images through the learning of multimodal representations in latent space. The discriminator utilizes CLIP’s ability to comprehend complex scenes to accurately assess the quality of the generated images. Extensive experiments have been conducted on the CUB, Oxford, and CelebA-tiny datasets to demonstrate the superiority of the proposed model over current state-of-the-art models. The code is https://github.com/OxygenLu/RATLIP.

将文本描述作为条件合成高质量的照片级图像是一项非常具有挑战性的任务。生成对抗网络（GANs）是这项任务的经典模型，但它们在图像和文本描述之间的一致性较低，且合成图像的丰富性不足。最近，条件仿射变换（CAT，如条件批量归一化和实例归一化）已应用于GAN的不同层，以控制图像中的内容合成。CAT是一种多层感知器，它基于相邻层之间的批量统计数据独立预测数据，而其他层无法获取全局文本信息。为了解决这一问题，我们首先建立CAT和循环神经网络（RAT）模型，以确保不同层可以访问全局信息。然后，我们在RAT之间引入shuffle注意力，以减轻循环神经网络中信息遗忘的特征。此外，我们的生成器和判别器都使用强大的预训练模型Clip。Clip已被广泛用于通过建立潜在空间中的多模式表示来建立文本和图像之间的关联。判别器利用CLIP理解复杂场景的能力来准确评估生成图像的质量。在CUB、Oxford和CelebA-tiny数据集上进行了大量实验，以证明所提出的模型优于当前最先进的模型。代码地址为：https://github.com/OxygenLu/RATLIP。

论文及项目相关链接

PDF Accepted by 2024 IEEE International Conference on Systems, Man, and Cybernetics(SMC)

Summary
利用生成对抗性网络（GANs）合成高质量逼真的图像并以文本描述作为条件是一项挑战。最近，条件仿射变换（CAT）被应用于GAN的不同层以控制图像内容的合成。为解决不同层无法访问全局信息的问题，提出了结合条件仿射变换和循环神经网络（RAT）的模型，并引入shuffle attention来缓解循环神经网络中信息遗忘的特征。同时，生成器和判别器都利用强大的预训练模型Clip，通过潜在空间中的多模态表示学习来建立文本和图像之间的联系。判别器利用CLIP理解复杂场景的能力来准确评估生成图像的质量。在CUB、Oxford和CelebA-tiny数据集上进行的实验表明，该模型优于当前最先进的模型。

Key Takeaways