GAN

发布日期: 2025-11-05

更新日期: 2025-11-27

文章字数: 10k

阅读时长: 41 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-05 更新

GLYPH-SR: Can We Achieve Both High-Quality Image Super-Resolution and High-Fidelity Text Recovery via VLM-guided Latent Diffusion Model?

Authors:Mingyu Sung, Seungjae Ham, Kangwoo Kim, Yeokyoung Yoon, Sangseok Yun, Il-Min Kim, Jae-Mo Kang

Image super-resolution(SR) is fundamental to many vision system-from surveillance and autonomy to document analysis and retail analytics-because recovering high-frequency details, especially scene-text, enables reliable downstream perception. Scene-text, i.e., text embedded in natural images such as signs, product labels, and storefronts, often carries the most actionable information; when characters are blurred or hallucinated, optical character recognition(OCR) and subsequent decisions fail even if the rest of the image appears sharp. Yet previous SR research has often been tuned to distortion (PSNR/SSIM) or learned perceptual metrics (LIPIS, MANIQA, CLIP-IQA, MUSIQ) that are largely insensitive to character-level errors. Furthermore, studies that do address text SR often focus on simplified benchmarks with isolated characters, overlooking the challenges of text within complex natural scenes. As a result, scene-text is effectively treated as generic texture. For SR to be effective in practical deployments, it is therefore essential to explicitly optimize for both text legibility and perceptual quality. We present GLYPH-SR, a vision-language-guided diffusion framework that aims to achieve both objectives jointly. GLYPH-SR utilizes a Text-SR Fusion ControlNet(TS-ControlNet) guided by OCR data, and a ping-pong scheduler that alternates between text- and scene-centric guidance. To enable targeted text restoration, we train these components on a synthetic corpus while keeping the main SR branch frozen. Across SVT, SCUT-CTW1500, and CUTE80 at x4, and x8, GLYPH-SR improves OCR F1 by up to +15.18 percentage points over diffusion/GAN baseline (SVT x8, OpenOCR) while maintaining competitive MANIQA, CLIP-IQA, and MUSIQ. GLYPH-SR is designed to satisfy both objectives simultaneously-high readability and high visual realism-delivering SR that looks right and reds right.

图像超分辨率（SR）对于许多视觉系统，从监控和自主到文档分析和零售分析都是至关重要的，因为恢复高频细节，特别是场景文本，可以实现可靠的下游感知。场景文本，即嵌入在自然图像中的文本，如标志、产品标签和店面，通常携带最具操作性的信息；当字符模糊或产生幻觉时，即使其余图像看起来很清晰，光学字符识别（OCR）和随后的决策也会失败。然而，之前的SR研究通常针对扭曲（PSNR/SSIM）或学习感知指标（LIPIS、MANIQA、CLIP-IQA、MUSIQ），这些指标对字符级别的错误大多不敏感。此外，确实关注文本SR的研究通常集中在具有孤立字符的简单基准测试上，忽略了复杂自然场景中文本的挑战。因此，场景文本实际上被视为通用纹理。要使SR在实际部署中有效，因此必须显式优化文本可读性和感知质量。我们提出了GLYPH-SR，这是一个由视觉语言引导的扩散框架，旨在同时实现这两个目标。GLYPH-SR利用由OCR数据引导的Text-SR Fusion ControlNet（TS-ControlNet）和一个交替进行文本和场景为中心的指导的乒乓调度器。为了实现有针对性的文本恢复，我们在合成语料库上训练这些组件，同时保持主SR分支冻结。在SVT、SCUT-CTW1500和CUTE80的x4和x8分辨率下，GLYPH-SR在扩散/GAN基准测试的基础上提高了OCR F1分数高达+15.18个百分点（SVT x8，OpenOCR），同时保持有竞争力的MANIQA、CLIP-IQA和MUSIQ指标。GLYPH-SR被设计成同时满足两个目标——高可读性和高视觉逼真度，实现既看起来正确又读得正确的SR效果。

论文及项目相关链接

PDF 11 pages, 6 figures. Includes supplementary material. Under review as a conference paper at ICLR 2026

Summary
文本超分辨率（SR）在许多视觉系统中至关重要，如监控、自主驾驶、文档分析和零售分析等。GLYPSH-SR是一种旨在联合实现文本可读性和感知质量优化的视觉语言引导扩散框架。它通过Text-SR Fusion ControlNet（TS-ControlNet）和乒乓调度器实现目标，并在合成语料库上训练组件，同时保持主要的SR分支冻结。在多个数据集上，GLYPH-SR改进了OCR的F1分数，同时保持了竞争性的MANIQA、CLIP-IQA和MUSIQ评分。

Key Takeaways

图像超分辨率（SR）在许多视觉系统中非常重要，因为它能恢复高频细节，尤其是场景文本，使下游感知任务更可靠。
之前的SR研究通常针对失真或学习感知指标，这些指标对字符级错误大多不敏感。
针对文本SR的研究往往集中在简化的基准测试上，忽略了复杂自然场景中的文本挑战。
GLYPH-SR是一种联合优化文本可读性和感知质量的视觉语言引导扩散框架。
GLYPH-SR通过TS-ControlNet和乒乓调度器实现目标文本恢复。
GLYPH-SR在合成语料库上训练组件，同时保持主要的SR分支冻结。

Cool Papers

点此查看论文截图

BOLT-GAN: Bayes-Optimal Loss for Stable GAN Training

Authors:Mohammadreza Tavasoli Naeini, Ali Bereyhi, Morteza Noshad, Ben Liang, Alfred O. Hero III

We introduce BOLT-GAN, a simple yet effective modification of the WGAN framework inspired by the Bayes Optimal Learning Threshold (BOLT). We show that with a Lipschitz continuous discriminator, BOLT-GAN implicitly minimizes a different metric distance than the Earth Mover (Wasserstein) distance and achieves better training stability. Empirical evaluations on four standard image generation benchmarks (CIFAR-10, CelebA-64, LSUN Bedroom-64, and LSUN Church-64) show that BOLT-GAN consistently outperforms WGAN, achieving 10-60% lower Frechet Inception Distance (FID). Our results suggest that BOLT is a broadly applicable principle for enhancing GAN training.

我们介绍了BOLT-GAN，它是受Bayes最优学习阈值（BOLT）启发的WGAN框架的简单有效的改进。我们表明，具有Lipschitz连续判别器的BOLT-GAN隐式地最小化了不同于地球移动（Wasserstein）距离的另一种度量距离，并实现了更好的训练稳定性。在四个标准图像生成基准测试（CIFAR-10、CelebA-64、LSUN Bedroom-64和LSUN Church-64）上的经验评估表明，BOLT-GAN持续优于WGAN，实现了10-60%更低的Frechet Inception Distance（FID）。我们的结果表明，BOLT是提高GAN训练的一个普遍适用的原则。

论文及项目相关链接

PDF

Summary

BOLT-GAN是WGAN框架的简单有效改进，受到Bayes Optimal Learning Threshold（BOLT）的启发。它利用Lipschitz连续鉴别器，隐式地最小化不同于Earth Mover（Wasserstein）距离的度量标准，实现更好的训练稳定性。在四个标准图像生成基准测试上的实证评估表明，BOLT-GAN持续优于WGAN，Frechet Inception Distance（FID）降低10-60%。研究结果表明，BOLT是广泛适用于增强GAN训练的原则。

Key Takeaways

BOLT-GAN是WGAN框架的改进版本，受Bayes Optimal Learning Threshold（BOLT）启发。
BOLT-GAN利用Lipschitz连续鉴别器隐式地最小化不同的度量距离。
与Earth Mover（Wasserstein）距离不同，BOLT-GAN实现更好的训练稳定性。
在四个标准图像生成基准测试上，BOLT-GAN表现优于WGAN。
BOLT-GAN降低了Frechet Inception Distance（FID）达10-60%。
实证评估证明，BOLT-GAN在图像生成任务上具有显著优势。

Cool Papers

点此查看论文截图

PSTF-AttControl: Per-Subject-Tuning-Free Personalized Image Generation with Controllable Face Attributes

Authors:Xiang liu, Zhaoxiang Liu, Huan Hu, Zipeng Wang, Ping Chen, Zezhou Chen, Kai Wang, Shiguo Lian

Recent advancements in personalized image generation have significantly improved facial identity preservation, particularly in fields such as entertainment and social media. However, existing methods still struggle to achieve precise control over facial attributes in a per-subject-tuning-free (PSTF) way. Tuning-based techniques like PreciseControl have shown promise by providing fine-grained control over facial features, but they often require extensive technical expertise and additional training data, limiting their accessibility. In contrast, PSTF approaches simplify the process by enabling image generation from a single facial input, but they lack precise control over facial attributes. In this paper, we introduce a novel, PSTF method that enables both precise control over facial attributes and high-fidelity preservation of facial identity. Our approach utilizes a face recognition model to extract facial identity features, which are then mapped into the $W^+$ latent space of StyleGAN2 using the e4e encoder. We further enhance the model with a Triplet-Decoupled Cross-Attention module, which integrates facial identity, attribute features, and text embeddings into the UNet architecture, ensuring clean separation of identity and attribute information. Trained on the FFHQ dataset, our method allows for the generation of personalized images with fine-grained control over facial attributes, while without requiring additional fine-tuning or training data for individual identities. We demonstrate that our approach successfully balances personalization with precise facial attribute control, offering a more efficient and user-friendly solution for high-quality, adaptable facial image synthesis. The code is publicly available at https://github.com/UnicomAI/PSTF-AttControl.

在个性化图像生成的最新进展中，面部身份保留能力已经得到了显著的提升，特别是在娱乐和社交媒体等领域。然而，现有方法仍然难以在无需针对每个主体进行微调（PSTF）的情况下实现对面部属性的精确控制。基于调整的技术（如PreciseControl）通过对面部特征进行精细控制而显示出前景，但它们通常需要大量的技术专长和额外的训练数据，从而限制了其普及性。相比之下，PSTF方法通过从单个面部输入生成图像简化了流程，但它们无法对面部属性进行精确控制。在本文中，我们介绍了一种新颖的PSTF方法，既能够实现对面部属性的精确控制，又能高度保真地保留面部身份。我们的方法利用面部识别模型提取面部身份特征，然后将其映射到StyleGAN2的$W^+$潜在空间，并使用e4e编码器进行映射。我们进一步使用Triplet-Decoupled Cross-Attention模块增强模型，该模块将面部身份、属性特征和文本嵌入集成到UNet架构中，确保身份和属性信息的清晰分离。我们的方法在FFHQ数据集上进行训练，允许生成具有对面部属性进行精细控制的个性化图像，同时无需针对个别身份进行微调或提供额外的训练数据。我们证明我们的方法成功地在个性化和精确面部属性控制之间取得了平衡，为高质量、可适应的面部图像合成提供了更高效和用户友好的解决方案。代码公开在[https://github.com/UnicomAI/PSTF-AttControl。]

论文及项目相关链接

PDF Accepted by Image and Vision Computing (18 pages, 8 figures)

Summary
本文提出了一种新颖的个性化图像生成方法，能够在无需针对个人进行微调（PSTF）的情况下，实现对面部特征的精细控制并高度保持面部身份特征。该方法利用面部识别模型提取面部身份特征，并将其映射到StyleGAN2的$W^+$潜在空间，同时采用Triplet-Decoupled Cross-Attention模块来整合面部身份、属性特征和文本嵌入，确保身份和属性信息的清晰分离。在FFHQ数据集上训练后，该方法能够生成具有精细控制的个性化图像，无需针对个人身份进行额外的微调或训练数据。

Key Takeaways

该方法实现了个性化图像生成中面部身份的高度保持。
通过新颖的PSTF方法，实现了对面部属性的精细控制，无需针对个人进行微调。
利用面部识别模型提取面部身份特征，并映射到StyleGAN2的$W^+$潜在空间。
采用Triplet-Decoupled Cross-Attention模块整合面部身份、属性信息和文本嵌入。
该方法在FFHQ数据集上进行训练，证明了其有效性和适用性。
生成的图像具有高质量的面部合成，平衡了个性化和面部属性控制。

Cool Papers

点此查看论文截图

Cross-view Localization and Synthesis – Datasets, Challenges and Opportunities

Authors:Ningli Xu, Rongjun Qin

Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding, which deals with cross-view datasets: overhead (satellite or aerial) and ground-level imagery. These tasks have gained increasing attention due to their broad applications in autonomous navigation, urban planning, and augmented reality. Cross-view localization aims to estimate the geographic position of ground-level images based on information provided by overhead imagery while cross-view synthesis seeks to generate ground-level images based on information from the overhead imagery. Both tasks remain challenging due to significant differences in viewing perspective, resolution, and occlusion, which are widely embedded in cross-view datasets. Recent years have witnessed rapid progress driven by the availability of large-scale datasets and novel approaches. Typically, cross-view localization is formulated as an image retrieval problem where ground-level features are matched with tiled overhead images feature, extracted by convolutional neural networks (CNNs) or vision transformers (ViTs) for cross-view feature embedding. Cross-view synthesis, on the other hand, seeks to generate ground-level views based on information from overhead imagery, generally using generative adversarial networks (GANs) or diffusion models. This paper presents a comprehensive survey of advances in cross-view localization and synthesis, reviewing widely used datasets, highlighting key challenges, and providing an organized overview of state-of-the-art techniques. Furthermore, it discusses current limitations, offers comparative analyses, and outlines promising directions for future research. We also include the project page via https://github.com/GDAOSU/Awesome-Cross-View-Methods.

跨视图定位与合成是跨视图视觉理解中的两个基本任务，该领域处理跨视图数据集：俯视图（卫星或航空）和地面级别图像。这些任务在自主导航、城市规划和增强现实等领域有着广泛的应用，因此受到了越来越多的关注。跨视图定位旨在基于俯视图图像提供的信息来估计地面图像的地理位置，而跨视图合成则旨在基于俯视图图像的信息生成地面图像。这两个任务仍然存在挑战，因为视图角度、分辨率和遮挡等方面的巨大差异普遍存在于跨视图数据集中。近年来，由于大规模数据集和新颖方法的出现，进展迅速。通常，跨视图定位被制定为图像检索问题，其中地面特征会与瓦片俯视图图像特征进行匹配，这些特征通过卷积神经网络（CNN）或视觉转换器（ViT）进行跨视图特征嵌入。另一方面，跨视图合成则旨在基于俯视图图像的信息生成地面视图，通常使用生成对抗网络（GAN）或扩散模型。本文对跨视图定位和合成的进展进行了全面的综述，回顾了广泛使用的数据集，强调了关键挑战，并提供了最新技术的有条理概述。此外，它还讨论了当前局限性、提供了比较分析，并概述了未来研究的有前途的方向。我们还通过https://github.com/GDAOSU/Awesome-Cross-View-Methods包含项目页面。

论文及项目相关链接

PDF 15 Figures

Summary

本文全面概述了跨视角定位与合成的研究进展，介绍了常用数据集，强调了关键挑战，提供了最新技术的有序概览。文章讨论了当前限制，进行了比较分析，并指出了未来研究的有前途的方向。

Key Takeaways

跨视角定位与合成是处理跨视角数据集（包括俯视图和地面级图像）的两个基本任务，广泛应用于自主导航、城市规划和增强现实。
跨视角定位旨在根据俯视图信息估计地面级图像的地理位置。
跨视角合成则旨在基于俯视图信息生成地面级图像。
这两个任务面临的主要挑战包括视角、分辨率和遮挡的显著差异。
跨视角定位通常被制定为图像检索问题，其中地面特征会与切割的俯视图特征进行匹配。
跨视角合成通常使用生成对抗网络（GANs）或扩散模型来基于俯视图信息生成地面级视图。

Cool Papers

点此查看论文截图

Wavelet-based GAN Fingerprint Detection using ResNet50

Authors:Sai Teja Erukude, Suhasnadh Reddy Veluru, Viswa Chaitanya Marella

Identifying images generated by Generative Adversarial Networks (GANs) has become a significant challenge in digital image forensics. This research presents a wavelet-based detection method that uses discrete wavelet transform (DWT) preprocessing and a ResNet50 classification layer to differentiate the StyleGAN-generated images from real ones. Haar and Daubechies wavelet filters are applied to convert the input images into multi-resolution representations, which will then be fed to a ResNet50 network for classification, capitalizing on subtle artifacts left by the generative process. Moreover, the wavelet-based models are compared to an identical ResNet50 model trained on spatial data. The Haar and Daubechies preprocessed models achieved a greater accuracy of 93.8 percent and 95.1 percent, much higher than the model developed in the spatial domain (accuracy rate of 81.5 percent). The Daubechies-based model outperforms Haar, showing that adding layers of descriptive frequency patterns can lead to even greater distinguishing power. These results indicate that the GAN-generated images have unique wavelet-domain artifacts or “fingerprints.” The method proposed illustrates the effectiveness of wavelet-domain analysis to detect GAN images and emphasizes the potential of further developing the capabilities of future deepfake detection systems.

鉴别生成对抗网络（GANs）生成的图像已成为数字图像取证领域的一个重要挑战。本研究提出了一种基于小波的检测方法，该方法使用离散小波变换（DWT）预处理和ResNet50分类层来区分StyleGAN生成的图像和真实图像。采用Haar和Daubechies小波滤波器将输入图像转换为多分辨率表示形式，然后将其输入ResNet50网络进行分类，充分利用生成过程留下的细微痕迹。此外，将基于小波模型的性能与经过空间数据训练的相同ResNet50模型进行了比较。Haar和Daubechies预处理模型的准确率分别达到了93.8%和95.1%，远高于在空域开发的模型（准确率为81.5%）。基于Daubechies的模型表现优于Haar模型，表明添加描述性频率模式层可以导致更高的区分能力。这些结果表明，GAN生成的图像具有独特的小波域特征或“指纹”。所提出的方法说明了小波域分析在检测GAN图像时的有效性，并强调了未来进一步发展深度伪造检测系统潜力的巨大。

论文及项目相关链接

PDF 6 pages; Published in IEEE

Summary
本文提出一种基于小波的检测方法，利用离散小波变换预处理和ResNet50分类层来区分StyleGAN生成的图像与真实图像。研究结果表明，通过Haar和Daubechies小波过滤器预处理图像后，结合ResNet50网络分类，可以有效识别出GAN生成的图像，准确率高达93.8%和95.1%。此外，研究还发现，在频率域添加描述性模式层可以提高模型的区分能力。这表明GAN生成的图像在小波域具有独特的“指纹”特征。

Key Takeaways

GAN生成的图像检测已成为数字图像取证的重要挑战。
研究提出了一种基于小波的检测方法，通过离散小波变换预处理图像数据。
利用ResNet50分类层对StyleGAN生成的图像进行鉴别，提高了检测的准确性。
Haar和Daubechies小波滤波器预处理图像后，准确率分别达到93.8%和95.1%。
对比研究显示了在小波域分析的优越性以及其对GAN图像检测的有效性。
Daubechies滤波器相比Haar滤波器在检测中表现更好，显示添加描述性频率模式层能提高模型的区分能力。

Cool Papers

点此查看论文截图

GENRE-CMR: Generalizable Deep Learning for Diverse Multi-Domain Cardiac MRI Reconstruction

Authors:Kian Anvari Hamedani, Narges Razizadeh, Shahabedin Nabavi, Mohsen Ebrahimi Moghaddam

Accelerated Cardiovascular Magnetic Resonance (CMR) image reconstruction remains a critical challenge due to the trade-off between scan time and image quality, particularly when generalizing across diverse acquisition settings. We propose GENRE-CMR, a generative adversarial network (GAN)-based architecture employing a residual deep unrolled reconstruction framework to enhance reconstruction fidelity and generalization. The architecture unrolls iterative optimization into a cascade of convolutional subnetworks, enriched with residual connections to enable progressive feature propagation from shallow to deeper stages. To further improve performance, we integrate two loss functions: (1) an Edge-Aware Region (EAR) loss, which guides the network to focus on structurally informative regions and helps prevent common reconstruction blurriness; and (2) a Statistical Distribution Alignment (SDA) loss, which regularizes the feature space across diverse data distributions via a symmetric KL divergence formulation. Extensive experiments confirm that GENRE-CMR surpasses state-of-the-art methods on training and unseen data, achieving 0.9552 SSIM and 38.90 dB PSNR on unseen distributions across various acceleration factors and sampling trajectories. Ablation studies confirm the contribution of each proposed component to reconstruction quality and generalization. Our framework presents a unified and robust solution for high-quality CMR reconstruction, paving the way for clinically adaptable deployment across heterogeneous acquisition protocols.

在心血管磁共振（CMR）图像重建中，由于扫描时间与图像质量之间的权衡，特别是在不同采集环境中的应用推广方面，仍面临重大挑战。我们提出了基于生成对抗网络（GAN）的GENRE-CMR架构，采用残差深度展开重建框架，以提高重建的保真度和泛化能力。该架构将迭代优化展开成卷积子网络的级联，并加入残差连接，以实现从浅层到深层阶段的渐进特征传播。为了进一步提高性能，我们整合了两种损失函数：（1）边缘感知区域（EAR）损失，引导网络关注结构信息丰富的区域，有助于防止常见的重建模糊；（2）统计分布对齐（SDA）损失，通过对称KL散度公式，正规化不同数据分布的特征空间。大量实验证实，GENRE-CMR在训练和未见数据上超越了最先进的方法，在未见分布的各种加速因子和采样轨迹上达到了0.9552的结构相似性指标（SSIM）和38.90分贝的峰值信噪比（PSNR）。消融研究证实了每个所提出组件对重建质量和推广的贡献。我们的框架为高质量CMR重建提供了统一且稳健的解决方案，为在不同异质采集协议中的临床适应性部署铺平了道路。

论文及项目相关链接

PDF

Summary

基于生成对抗网络（GAN）的GENRE-CMR图像重建技术利用深度展开重建框架提升了图像重建的真实性和泛化能力。它通过结合边缘感知区域（EAR）损失和统计分布对齐（SDA）损失来优化性能。该技术在对多种加速因子和采样轨迹的未见数据集上表现优异，为临床部署提供了可能。

Key Takeaways

GENRE-CMR是一种基于GAN的CMRI图像重建方法，采用深度展开重建框架以提高重建质量和泛化能力。
该技术结合了两种损失函数：EAR损失用于关注结构信息丰富的区域，防止重建模糊；SDA损失通过对称KL散度公式对特征空间进行规范化，适应不同的数据分布。
实验表明，GENRE-CMR在训练和未见数据上的表现均超越了现有方法，实现了高重建质量和泛化能力。
消融研究证实了所提出组件对重建质量和泛化的贡献。
该技术为高质量CMRI图像重建提供了统一且稳健的解决方案，有望在多种不同的采集协议中实现临床部署。

Cool Papers

点此查看论文截图

ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models

Authors:Chihan Huang, Hao Tang

Despite the success of deep learning across various domains, it remains vulnerable to adversarial attacks. Although many existing adversarial attack methods achieve high success rates, they typically rely on $\ell_{p}$-norm perturbation constraints, which do not align with human perceptual capabilities. Consequently, researchers have shifted their focus toward generating natural, unrestricted adversarial examples (UAEs). GAN-based approaches suffer from inherent limitations, such as poor image quality due to instability and mode collapse. Meanwhile, diffusion models have been employed for UAE generation, but they still rely on iterative PGD perturbation injection, without fully leveraging their central denoising capabilities. In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv. This method incorporates an interpretable adversarial guidance mechanism to gradually shift the sampling distribution towards the adversarial distribution, while using an interpretable saliency map to inject the visual information of a reference image into the generated samples. Notably, our method is capable of generating an unlimited number of natural adversarial examples and can attack not only classification models but also retrieval models. We conduct extensive experiments on ImageNet and CelebA datasets, validating the performance of ScoreAdv across ten target models in both black-box and white-box settings. Our results demonstrate that ScoreAdv achieves state-of-the-art attack success rates and image quality, while maintaining inference efficiency. Furthermore, the dynamic balance between denoising and adversarial perturbation enables ScoreAdv to remain robust even under defensive measures.

尽管深度学习在各种领域取得了成功，但它仍然容易受到对抗性攻击的影响。虽然现有的许多对抗性攻击方法达到了较高的成功率，但它们通常依赖于$\ell_{p}$范数扰动约束，这些约束并不符合人类的感知能力。因此，研究人员将重点转向生成自然、无限制的对抗性示例（UAEs）。基于GAN的方法存在内在局限性，例如由于不稳定和模式崩溃导致的图像质量差。同时，虽然扩散模型已被用于UAE生成，但它们仍然依赖于迭代PGD扰动注入，没有充分利用其核心的降噪能力。在本文中，我们介绍了一种基于扩散模型生成UAEs的新方法，名为ScoreAdv。该方法采用可解释的对抗性指导机制，逐步将采样分布转向对抗性分布，同时使用可解释显著性图将参考图像的可视信息注入生成的样本中。值得注意的是，我们的方法能够生成无限数量的自然对抗性示例，并且可以攻击分类模型以及检索模型。我们在ImageNet和CelebA数据集上进行了广泛的实验，在黑白盒设置中对ScoreAdv进行了十种目标模型的性能验证。我们的结果表明，ScoreAdv达到了最先进的攻击成功率和图像质量，同时保持了推理效率。此外，降噪和对抗性扰动之间的动态平衡使ScoreAdv即使在防御措施下也能保持稳健性。

论文及项目相关链接

PDF

摘要

本文提出了一种基于扩散模型的新型生成自然无限制对抗样本的方法，名为ScoreAdv。该方法通过可解释的对抗性指导机制逐步将采样分布转向对抗分布，同时使用可解释的显著性地图将参考图像的可视信息注入生成的样本中。ScoreAdv不仅能生成大量自然对抗样本，还能攻击分类模型和检索模型。在ImageNet和CelebA数据集上的实验表明，ScoreAdv在黑白盒设置中针对十个目标模型的攻击成功率及图像质量均达到业界最佳，同时保持推理效率。其动态平衡的降噪和对抗扰动能力使ScoreAdv在防御措施下仍能保持稳健。

关键见解

现有对抗攻击方法主要依赖$\ell_{p}$-norm扰动约束，不符合人类感知能力，研究已转向生成自然无限制对抗样本（UAEs）。
GANs存在图像质量差的问题，如不稳定和模式崩溃等内在局限性。
扩散模型已被用于UAE生成，但仍依赖迭代PGD扰动注入，未充分利用其核心去噪能力。
ScoreAdv方法结合扩散模型，通过可解释的对抗指导机制和显著性地图生成UAEs。
ScoreAdv能生成大量自然对抗样本，并攻击分类和检索模型。
在ImageNet和CelebA数据集上的实验表明ScoreAdv达到业界最佳攻击成功率和图像质量，同时保持推理效率。
ScoreAdv的动态平衡能力使其在防御措施下仍能保持稳健。

Cool Papers

点此查看论文截图

A Racing Dataset and Baseline Model for Track Detection in Autonomous Racing

Authors:Shreya Ghosh, Yi-Huan Chen, Ching-Hsiang Huang, Abu Shafin Mohammad Mahdee Jameel, Chien Chou Ho, Aly El Gamal, Samuel Labi

A significant challenge in racing-related research is the lack of publicly available datasets containing raw images with corresponding annotations for the downstream task. In this paper, we introduce RoRaTrack, a novel dataset that contains annotated multi-camera image data from racing scenarios for track detection. The data is collected on a Dallara AV-21 at a racing circuit in Indiana, in collaboration with the Indy Autonomous Challenge (IAC). RoRaTrack addresses common problems such as blurriness due to high speed, color inversion from the camera, and absence of lane markings on the track. Consequently, we propose RaceGAN, a baseline model based on a Generative Adversarial Network (GAN) that effectively addresses these challenges. The proposed model demonstrates superior performance compared to current state-of-the-art machine learning models in track detection. The dataset and code for this work are available at https://github.com/ghosh64/RaceGAN.

在赛车相关研究中，一个重大挑战是缺乏公开可用的数据集，其中包含用于下游任务的原始图像及其相应的注释。在本文中，我们介绍了RoRaTrack数据集，这是一个包含赛车场景的多相机图像数据的新数据集，用于车道检测。该数据是在印第安纳州的一个赛车场地上，与Indy Autonomous Challenge (IAC)合作，使用Dallara AV-21收集的。RoRaTrack解决了由于高速导致的模糊、相机颜色反转以及赛道上缺少车道标记等常见问题。因此，我们提出了基于生成对抗网络（GAN）的RaceGAN基线模型，该模型可以有效地解决这些挑战。与当前最先进的机器学习模型相比，所提出的模型在车道检测方面表现出卓越的性能。该数据集和该工作的代码可在https://github.com/ghosh64/RaceGAN找到。

论文及项目相关链接

PDF Currently Under Review

Summary

本文介绍了一项关于赛车相关研究的挑战：缺乏公开可用的带有相应注释的原始图像数据集用于下游任务。针对这一问题，论文引入了RoRaTrack数据集，该数据集包含来自赛车场景的带注释的多相机图像数据，用于车道检测。数据是在印第安纳州的一个赛车场使用Dallara AV-21收集的，并与Indy Autonomous Challenge（IAC）合作。RoRaTrack解决了由于高速导致的模糊、相机颜色反转以及赛道上无车道标记等问题。为此，论文提出了基于生成对抗网络（GAN）的RaceGAN基线模型，该模型有效解决了这些挑战。相较于当前先进的机器学习模型，该模型在车道检测方面表现出卓越的性能。相关数据集和代码已公开分享至https://github.com/ghosh64/RaceGAN。

Key Takeaways

缺乏公开可用的带有注释的原始图像数据集是赛车相关研究的挑战。
论文引入了RoRaTrack数据集，包含带注释的多相机图像数据用于车道检测。
数据是在真实赛车环境下使用Dallara AV-21收集的。
RoRaTrack解决了高速导致的图像模糊、相机颜色反转及赛道无车道标记等问题。
论文提出了基于生成对抗网络（GAN）的RaceGAN模型，用于解决相关挑战。
RaceGAN模型在车道检测方面表现出卓越性能。

Cool Papers

点此查看论文截图

SRAGAN: Saliency Regularized and Attended Generative Adversarial Network for Chinese Ink-wash Painting Style Transfer

Authors:Xiang Gao, Yuqi Zhang

Recent style transfer problems are still largely dominated by Generative Adversarial Network (GAN) from the perspective of cross-domain image-to-image (I2I) translation, where the pivotal issue is to learn and transfer target-domain style patterns onto source-domain content images. This paper handles the problem of translating real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though a wide range of I2I models tackle this problem, a notable challenge is that the content details of the source image could be easily erased or corrupted due to the transfer of ink-wash style elements. To remedy this issue, we propose to incorporate saliency detection into the unpaired I2I framework to regularize image content, where the detected saliency map is utilized from two aspects: (\romannumeral1) we propose saliency IOU (SIOU) loss to explicitly regularize object content structure by enforcing saliency consistency before and after image stylization; (\romannumeral2) we propose saliency adaptive normalization (SANorm) which implicitly enhances object structure integrity of the generated paintings by dynamically injecting image saliency information into the generator to guide stylization process. Besides, we also propose saliency attended discriminator which harnesses image saliency information to focus generative adversarial attention onto the drawn objects, contributing to generating more vivid and delicate brush strokes and ink-wash textures. Extensive qualitative and quantitative experiments demonstrate superiority of our approach over related advanced image stylization methods in both GAN and diffusion model paradigms.

近期风格迁移问题仍主要从跨域图像到图像（I2I）转换的角度受到生成对抗网络（GAN）的主导。这里的关键问题是学习和将目标域的样式模式转移到源域的内容图像上。本文针对将真实图片翻译成传统水墨画的问题，即水墨画风格迁移。尽管有许多I2I模型解决了这个问题，但一个显著的挑战是源图像的内容细节在转换水墨风格元素时很容易被抹去或破坏。为了解决这个问题，我们提出将显著性检测融入到非配对的I2I框架中以对图像内容进行规范化，利用检测到的显著性图从两个方面：一、我们提出了显著性IOU（SIOU）损失，通过强制显著性一致性在图像风格化前后显式地规范对象内容结构；二、我们提出了显著性自适应归一化（SANorm），它通过动态将图像显著性信息注入生成器来隐含地增强生成画作的对象结构完整性，从而指导风格化过程。此外，我们还提出了显著性关注鉴别器，它利用图像显著性信息将生成对抗的注意力集中在绘制的对象上，有助于生成更生动、更精细的笔触和水墨纹理。大量的定性和定量实验表明，我们的方法在GAN和扩散模型范式中均优于相关的先进图像风格化方法。

论文及项目相关链接

PDF Pattern Recognition, Volume 162, June 2025, 111344

Summary
在跨域图像到图像（I2I）翻译领域，生成对抗网络（GAN）仍然是风格转移问题的主导。本文解决将真实图片翻译成传统水墨画的问题，即水墨画风格转移。虽然有许多I2I模型处理这个问题，但一个显著挑战是源图像的内容细节在转移水墨风格元素时容易被抹去或破坏。为解决此问题，我们将显著性检测融入非配对的I2I框架以规范图像内容。通过两个方面利用检测到的显著性地图：一是提出显著性IOU（SIOU）损失，通过强制风格化前后的显著性一致性来明确规范对象内容结构；二是提出显著性自适应归一化（SANorm），通过动态将图像显著性信息注入生成器，隐式增强生成画作的对象结构完整性，以指导风格化过程。此外，我们还提出了显著性注意判别器，利用图像显著性信息将生成对抗注意力于绘制对象上，有助于生成更生动和精致的水墨笔触和纹理。实验证明，我们的方法相较于GAN和扩散模型范式中的相关先进图像风格化方法具有优越性。

Key Takeaways

GAN在跨域图像风格转移中仍占主导地位，特别是在传统水墨画风格转移方面。
显著性检测被融入非配对I2I框架，以规范图像内容，解决源图像内容细节在风格转移中容易丢失的问题。
提出了显著性IOU（SIOU）损失和显著性自适应归一化（SANorm）方法，分别通过明确规范对象内容结构和隐式增强对象结构完整性来改善风格转移结果。
引入显著性注意判别器，能将生成对抗注意力于绘制对象上，生成更生动和精致的水墨笔触和纹理。
实验证明，该方法在定量和定性方面均优于其他先进的图像风格化方法。
该方法对于保护和发展传统水墨画艺术，以及实现真实图片与艺术作品之间的转换具有潜在价值。

Cool Papers

点此查看论文截图

Time Weaver: A Conditional Time Series Generation Model

Authors:Sai Shankar Narasimhan, Shubhankar Agarwal, Oguzhan Akcin, Sujay Sanghavi, Sandeep Chinchali

Imagine generating a city’s electricity demand pattern based on weather, the presence of an electric vehicle, and location, which could be used for capacity planning during a winter freeze. Such real-world time series are often enriched with paired heterogeneous contextual metadata (e.g., weather and location). Current approaches to time series generation often ignore this paired metadata. Additionally, the heterogeneity in metadata poses several practical challenges in adapting existing conditional generation approaches from the image, audio, and video domains to the time series domain. To address this gap, we introduce TIME WEAVER, a novel diffusion-based model that leverages the heterogeneous metadata in the form of categorical, continuous, and even time-variant variables to significantly improve time series generation. Additionally, we show that naive extensions of standard evaluation metrics from the image to the time series domain are insufficient. These metrics do not penalize conditional generation approaches for their poor specificity in reproducing the metadata-specific features in the generated time series. Thus, we innovate a novel evaluation metric that accurately captures the specificity of conditional generation and the realism of the generated time series. We show that TIME WEAVER outperforms state-of-the-art benchmarks, such as Generative Adversarial Networks (GANs), by up to 30% in downstream classification tasks on real-world energy, medical, air quality, and traffic datasets.

想象一下，基于天气、电动汽车的存在和位置等因素来生成城市的电力需求模式，这可以用于应对冬季电力短缺的容量规划。这类现实世界的时间序列数据通常拥有丰富的配对异构上下文元数据（例如天气和位置）。然而，当前的时间序列生成方法往往忽略了这些配对元数据。此外，元数据的异构性给适应图像、音频和视频域中的现有条件生成方法带来了诸多实际挑战。为了弥补这一空白，我们引入了TIME WEAVER，这是一种新型基于扩散的模型，它以类别、连续甚至是时间变量形式的异构元数据来显著改进时间序列生成。此外，我们还表明，从图像领域到时间序列领域，标准评估指标的简单扩展是不够的。这些指标并没有因为条件生成方法在生成的时间序列中重现元数据特定特征的特异性不足而对其进行惩罚。因此，我们创新了一种新的评估指标，能够准确捕捉条件生成的特异性和生成时间序列的现实性。我们证明了TIME WEAVER在现实世界的数据集上，在下游分类任务上的表现优于生成对抗网络（GANs）等最新技术基准测试，提高了高达30%。这在能源、医疗、空气质量和交通数据集上都有体现。

论文及项目相关链接

PDF

Summary

本文提出了一种名为TIME WEAVER的新型扩散模型，该模型能够利用异质元数据（如分类、连续和时间变量）来改善时间序列生成。模型可根据天气、电动汽车的存在和地点等配对元数据生成城市电力需求模式。同时，本文还提出了一种新的评估指标，以更准确地衡量条件生成和生成时间序列的逼真度。实验表明，TIME WEAVER在真实世界的能源、医疗、空气质量和交通数据集上的下游分类任务中，较生成对抗网络（GANs）高出30%的表现。

Key Takeaways

TIME WEAVER模型利用异质元数据来改善时间序列生成。
模型可基于天气、电动汽车及地点等配对元数据生成城市电力需求模式。
现有评估指标在衡量时间序列生成时存在不足，无法准确反映元数据对生成结果的影响。
提出了一种新的评估指标，以更准确地衡量条件生成和生成时间序列的逼真度。
TIME WEAVER模型在多个真实世界数据集上的下游分类任务表现优异。
该模型在生成对抗网络（GANs）的基础上有所创新，提高了生成质量。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-05/GAN/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

GAN

元宇宙/虚拟人

元宇宙/虚拟人方向最新论文已更新，请持续关注 Update in 2025-11-05 ESCA Enabling Seamless Codec Avatar Execution through Algorithm and Hardware Co-Optimization for Virtual Reality

2025-11-05 元宇宙/虚拟人

元宇宙/虚拟人

Face Swapping

Face Swapping 方向最新论文已更新，请持续关注 Update in 2025-11-05 A Dual-Branch CNN for Robust Detection of AI-Generated Facial Forgeries

2025-11-05 Face Swapping

Face Swapping