发布日期: 2025-11-16

更新日期: 2025-11-27

文章字数: 17k

阅读时长: 69 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-16 更新

VLF-MSC: Vision-Language Feature-Based Multimodal Semantic Communication System

Authors:Gwangyeon Ahn, Jiwan Seo, Joonhyuk Kang

We propose Vision-Language Feature-based Multimodal Semantic Communication (VLF-MSC), a unified system that transmits a single compact vision-language representation to support both image and text generation at the receiver. Unlike existing semantic communication techniques that process each modality separately, VLF-MSC employs a pre-trained vision-language model (VLM) to encode the source image into a vision-language semantic feature (VLF), which is transmitted over the wireless channel. At the receiver, a decoder-based language model and a diffusion-based image generator are both conditioned on the VLF to produce a descriptive text and a semantically aligned image. This unified representation eliminates the need for modality-specific streams or retransmissions, improving spectral efficiency and adaptability. By leveraging foundation models, the system achieves robustness to channel noise while preserving semantic fidelity. Experiments demonstrate that VLF-MSC outperforms text-only and image-only baselines, achieving higher semantic accuracy for both modalities under low SNR with significantly reduced bandwidth.

我们提出了基于视觉语言特征的多媒体语义通信（VLF-MSC），这是一个统一系统，它将单个紧凑的视觉语言表示传输到接收器，以支持图像和文本的生成。不同于现有的语义通信技术，后者是分别处理每种模态的，VLF-MSC则采用预训练的视觉语言模型（VLM）对源图像进行编码，生成视觉语言语义特征（VLF），并通过无线信道进行传输。在接收器端，基于解码器的语言模型和基于扩散的图像生成器都以VLF为条件，生成描述性文本和语义对齐的图像。这种统一表示消除了对特定模态的流或重传的需求，提高了频谱效率和适应性。通过利用基础模型，该系统在保持语义保真度的同时，实现了对信道噪声的鲁棒性。实验表明，VLF-MSC优于仅使用文本或图像的基线方法，在信噪比低的情况下，对两种模态都实现了更高的语义准确性，同时显著减少了带宽。

论文及项目相关链接

PDF To appear in the AI4NextG Workshop at NeurIPS 2025

Summary

VLF-MSC是一种基于视觉语言特征的多模态语义通信系统，它能传输单一紧凑的视觉语言表示，支持接收端的图像和文本生成。该系统采用预训练的视觉语言模型对源图像进行编码，生成视觉语言语义特征（VLF），并通过无线信道传输。接收端使用解码器型语言模型和扩散型图像生成器，基于VLF生成描述性文本和语义对齐的图像。此统一表示提高了频谱效率和适应性，实现了对通道噪声的鲁棒性，同时保持了语义保真度。实验表明，VLF-MSC在低信噪比下实现了较高的语义准确性，并显著减少了带宽使用。

Key Takeaways

VLF-MSC系统提出一种基于视觉语言特征的多模态语义通信方法。
系统采用预训练视觉语言模型对源图像进行编码，生成视觉语言语义特征（VLF）。
接收端使用解码器型语言模型和扩散型图像生成器，基于VLF生成文本和图像。
此系统实现了统一表示，提高了频谱效率和适应性。
利用基础模型，系统实现了对通道噪声的鲁棒性，保持了语义保真度。
实验显示，VLF-MSC在低保真环境下表现出较高的语义准确性。

Cool Papers

点此查看论文截图

STELLAR: Scene Text Editor for Low-Resource Languages and Real-World Data

Authors:Yongdeuk Seo, Hyun-seok Min, Sungchul Choi

Scene Text Editing (STE) is the task of modifying text content in an image while preserving its visual style, such as font, color, and background. While recent diffusion-based approaches have shown improvements in visual quality, key limitations remain: lack of support for low-resource languages, domain gap between synthetic and real data, and the absence of appropriate metrics for evaluating text style preservation. To address these challenges, we propose STELLAR (Scene Text Editor for Low-resource LAnguages and Real-world data). STELLAR enables reliable multilingual editing through a language-adaptive glyph encoder and a multi-stage training strategy that first pre-trains on synthetic data and then fine-tunes on real images. We also construct a new dataset, STIPLAR(Scene Text Image Pairs of Low-resource lAnguages and Real-world data), for training and evaluation. Furthermore, we propose Text Appearance Similarity (TAS), a novel metric that assesses style preservation by independently measuring font, color, and background similarity, enabling robust evaluation even without ground truth. Experimental results demonstrate that STELLAR outperforms state-of-the-art models in visual consistency and recognition accuracy, achieving an average TAS improvement of 2.2% across languages over the baselines.

场景文本编辑（STE）的任务是在保持图像视觉风格（如字体、颜色和背景）的同时修改图像中的文本内容。虽然最近的扩散方法在提高视觉质量方面取得了进展，但仍存在关键限制：不支持低资源语言、合成数据和真实数据之间的领域差距，以及缺乏评估文本风格保持情况的合适指标。为了解决这些挑战，我们提出了STELLAR（场景文本编辑器，适用于低资源语言和真实世界数据）。STELLAR通过语言自适应的符号编码器以及分阶段训练策略，实现在合成数据上先进行预训练，再在真实图像上进行微调，从而实现可靠的多语言编辑。我们还构建了用于训练和评估的新数据集STIPLAR（场景文本图像对低资源语言和真实世界数据）。此外，我们提出了文本外观相似性（TAS），这是一种新的指标，通过独立测量字体、颜色和背景相似性来评估风格保持情况，即使在缺乏真实值的情况下也能实现稳健评估。实验结果表明，STELLAR在视觉一致性和识别准确性方面优于最先进模型，在语言基线的基础上平均提高了2.2%的TAS。

论文及项目相关链接

PDF Accepted to AAAI Workshop (Artificial Intelligence with Biased or Scarce Data)

Summary

基于场景文本编辑（STE）的任务是修改图像中的文本内容，同时保留其视觉风格，如字体、颜色和背景。针对低资源语言、合成与现实数据之间的领域差距以及缺乏适当的文本风格保存评估指标等挑战，提出了STELLAR（场景文本编辑器，用于低资源语言和真实世界数据）。STELLAR通过语言自适应的符号编码器和多阶段训练策略，实现了可靠的多语言编辑。我们还构建了一个新的数据集STIPLAR，并提出了一个新的评估指标Text Appearance Similarity（TAS），以独立测量字体、颜色和背景相似性，从而在没有真实值的情况下进行稳健评估。实验结果表明，STELLAR在视觉一致性和识别准确性方面优于最新模型，在各种语言上平均提高了2.2%的TAS。

Key Takeaways

Scene Text Editing (STE)的任务是修改图像中的文本内容，同时保留其视觉风格。
现有扩散方法在处理文本图像编辑时存在挑战，如支持低资源语言、合成与现实数据间的领域差距等。
提出了STELLAR方法，通过语言自适应的符号编码器和多阶段训练策略，解决了上述问题。
构建新的数据集STIPLAR用于训练和评估。
提出了新评估指标Text Appearance Similarity (TAS)，用于独立测量字体、颜色和背景相似性。
STELLAR在视觉一致性和识别准确性方面优于现有模型。

Cool Papers

点此查看论文截图

Equivariant Sampling for Improving Diffusion Model-based Image Restoration

Authors:Chenxu Wu, Qingpeng Kong, Peiang Zhao, Wendi Yang, Wenxin Ma, Fenghe Tang, Zihang Jiang, S. Kevin Zhou

Recent advances in generative models, especially diffusion models, have significantly improved image restoration (IR) performance. However, existing problem-agnostic diffusion model-based image restoration (DMIR) methods face challenges in fully leveraging diffusion priors, resulting in suboptimal performance. In this paper, we address the limitations of current problem-agnostic DMIR methods by analyzing their sampling process and providing effective solutions. We introduce EquS, a DMIR method that imposes equivariant information through dual sampling trajectories. To further boost EquS, we propose the Timestep-Aware Schedule (TAS) and introduce EquS$^+$. TAS prioritizes deterministic steps to enhance certainty and sampling efficiency. Extensive experiments on benchmarks demonstrate that our method is compatible with previous problem-agnostic DMIR methods and significantly boosts their performance without increasing computational costs. Our code is available at https://github.com/FouierL/EquS.

最近生成模型，尤其是扩散模型的进展，已经显著提高了图像恢复（IR）的性能。然而，现有的与问题无关的扩散模型图像恢复（DMIR）方法在面对充分利用扩散先验的挑战时，表现并不理想。在本文中，我们通过分析现有与问题无关的DMIR方法的采样过程，提出有效的解决方案，解决了其局限性。我们引入了EquS，一种通过双采样轨迹施加等价信息的DMIR方法。为了进一步提升EquS的性能，我们提出了时序感知调度（TAS）并引入了EquS+。TAS通过优先执行确定性步骤来提高确定性和采样效率。在基准测试上的大量实验表明，我们的方法与之前的与问题无关的DMIR方法兼容，并能显著提高它们的性能，同时不增加计算成本。我们的代码可在https://github.com/FouierL/EquS获得。

论文及项目相关链接

PDF 12 pages, 9 figures

Summary
扩散模型在图像修复（IR）领域的应用取得了显著进展，但现有的问题无关的扩散模型图像修复方法未能充分利用扩散先验信息，性能有待提高。本文通过分析现有方法的采样过程，提出一种名为EquS的扩散模型图像修复方法，通过双采样轨迹引入等价信息来提高性能。为进一步增强EquS的效果，本文还提出了时间表感知调度（TAS），并在EquS的基础上推出了EquS+。实验证明，该方法与现有问题无关的扩散模型图像修复方法兼容，能显著提高性能且不会增加计算成本。

Key Takeaways

扩散模型在图像修复领域的性能显著提升。
现有的问题无关的扩散模型图像修复方法在利用扩散先验信息方面存在局限。
EquS方法通过双采样轨迹引入等价信息，以提高图像修复性能。
TAS调度策略能增强EquS的效果，提高确定性和采样效率。
EquS+是EquS的升级版，与现有方法兼容，能显著提高性能。
大量实验证明，该方法在基准测试上的表现优异。

Cool Papers

点此查看论文截图

GPDM: Generation-Prior Diffusion Model for Accelerated Direct Attenuation and Scatter Correction of Whole-body 18F-FDG PET

Authors:Min Jeong Cho, Hyeong Seok Shim, Sungyu Kim, Jae Sung Lee

Accurate attenuation and scatter corrections are crucial in positron emission tomography (PET) imaging for accurate visual interpretation and quantitative analysis. Traditional methods relying on computed tomography (CT) or magnetic resonance imaging (MRI) have limitations in accuracy, radiation exposure, and applicability. Deep neural networks provide potential approaches to estimating attenuation and scatter-corrected (ASC) PET from non-attenuation and non-scatter-corrected (NASC) PET images based on VAE or CycleGAN. However, the limitations inherent to conventional GAN-based methods, such as unstable training and mode collapse, need further advancements. To address these limitations and achieve more accurate attenuation and scatter corrections, we propose a novel framework for generating high-quality ASC PET images from NASC PET images: Generation-Prior Diffusion Model (GPDM). Our GPDM framework is based on the Denoising Diffusion Probabilistic Model (DDPM), but instead of starting sampling from an entirely different image distribution, it begins from a distribution similar to the target images we aim to generate. This similar distribution is referred to as the Generation-Prior. By leveraging this Generation-Prior, the GPDM framework effectively reduces the number of sampling steps and generates more refined ASC PET images. Our experimental results demonstrate that GPDM outperforms existing methods in generating ASC PET images, achieving superior accuracy while significantly reducing sampling time. These findings highlight the potential of GPDM to address the limitations of conventional methods and establish a new standard for efficient and accurate attenuation and scatter correction in PET imaging.

准确的衰减和散射校正对于正电子发射断层扫描（PET）成像的准确视觉解读和定量分析至关重要。传统的方法依赖于计算机断层扫描（CT）或磁共振成像（MRI），在准确性、辐射暴露和适用性方面存在局限性。深度神经网络提供了基于变分自编码器（VAE）或CycleGAN从非衰减和非散射校正（NASC）PET图像估计衰减和散射校正（ASC）PET的潜在方法。然而，传统生成对抗网络（GAN）方法固有的局限性，如训练不稳定和模式崩溃，需要进一步改进。为了解决这些局限性，实现更准确的衰减和散射校正，我们提出了一种从NASC PET图像生成高质量ASC PET图像的新型框架：生成优先扩散模型（GPDM）。我们的GPDM框架基于去噪扩散概率模型（DDPM），但不同于从一个完全不同的图像分布开始采样，它从一个类似于我们旨在生成的目标图像分布开始。这种类似的分布被称为生成优先。通过利用这种生成优先权，GPDM框架有效地减少了采样步骤的数量，并生成了更精细的ASC PET图像。我们的实验结果表明，GPDM在生成ASC PET图像方面优于现有方法，实现了更高的准确性，并显著减少了采样时间。这些发现突出了GPDM解决传统方法局限性的潜力，并为PET成像中的高效和准确衰减及散射校正建立了新标准。

论文及项目相关链接

PDF 25 pages, 10 figures

摘要

基于深度学习中的生成先验扩散模型（GPDM），提出了一种从非衰减和非散射校正（NASC）PET图像生成高质量衰减和散射校正（ASC）PET图像的新框架。该框架基于去噪扩散概率模型（DDPM），从类似于目标图像分布的生成先验开始，有效减少采样步骤，生成更精细的ASC PET图像。实验结果表明，GPDM在生成ASC PET图像方面优于现有方法，具有更高的准确性和显著的采样时间减少。这为解决传统方法的局限性并确立PET成像中高效准确的衰减和散射校正新标准提供了潜力。

关键见解

准确衰减和散射校正在PET成像中至关重要，影响视觉解释和定量分析。
传统方法（如使用CT或MRI）在准确性、辐射暴露和适用性方面存在局限性。
深度神经网络为从非衰减和非散射校正的PET图像估计衰减和散射校正PET图像提供了潜在方法。
提出了一种新的生成高质量ASC PET图像框架——基于生成先验的扩散模型（GPDM）。
GPDM框架基于去噪扩散概率模型（DDPM），从类似目标图像分布的生成先验开始，减少采样步骤并生成更精细的图像。
实验结果表明，GPDM在生成ASC PET图像方面优于现有方法，具有更高的准确性。
GPDM有潜力解决传统方法的局限性，并为PET成像中高效准确的衰减和散射校正设定新标准。

Cool Papers

点此查看论文截图

Debiased Dual-Invariant Defense for Adversarially Robust Person Re-Identification

Authors:Yuhang Zhou, Yanxiang Zhao, Zhongyun Hua, Zhipu Liu, Zhaoquan Gu, Qing Liao, Leo Yu Zhang

Person re-identification (ReID) is a fundamental task in many real-world applications such as pedestrian trajectory tracking. However, advanced deep learning-based ReID models are highly susceptible to adversarial attacks, where imperceptible perturbations to pedestrian images can cause entirely incorrect predictions, posing significant security threats. Although numerous adversarial defense strategies have been proposed for classification tasks, their extension to metric learning tasks such as person ReID remains relatively unexplored. Moreover, the several existing defenses for person ReID fail to address the inherent unique challenges of adversarially robust ReID. In this paper, we systematically identify the challenges of adversarial defense in person ReID into two key issues: model bias and composite generalization requirements. To address them, we propose a debiased dual-invariant defense framework composed of two main phases. In the data balancing phase, we mitigate model bias using a diffusion-model-based data resampling strategy that promotes fairness and diversity in training data. In the bi-adversarial self-meta defense phase, we introduce a novel metric adversarial training approach incorporating farthest negative extension softening to overcome the robustness degradation caused by the absence of classifier. Additionally, we introduce an adversarially-enhanced self-meta mechanism to achieve dual-generalization for both unseen identities and unseen attack types. Experiments demonstrate that our method significantly outperforms existing state-of-the-art defenses.

行人再识别（ReID）是许多现实世界应用（如行人轨迹跟踪）中的基本任务。然而，先进的基于深度学习的ReID模型很容易受到对抗性攻击的影响，其中行人图像的微小扰动可能导致完全错误的预测，从而构成重大安全威胁。虽然针对分类任务已经提出了许多对抗性防御策略，但它们扩展到度量学习任务（如行人ReID）仍然相对未被探索。而且，现有的针对行人ReID的几种防御方法未能解决对抗鲁棒ReID的固有独特挑战。在本文中，我们将对抗性防御在行人ReID中的挑战系统地归为两个关键问题：模型偏见和复合泛化要求。为了解决这些问题，我们提出了一个去偏双不变防御框架，该框架由两个阶段组成。在数据平衡阶段，我们使用基于扩散模型的数据重采样策略来缓解模型偏见，该策略促进训练数据的公平性和多样性。在双向对抗性自我元防御阶段，我们引入了一种新的度量对抗训练方法来克服由于缺乏分类器而导致的鲁棒性下降问题，该方法结合了最远负扩展软化技术。此外，我们还引入了对抗增强的自我元机制，以实现针对未见身份和未见攻击类型的双重泛化。实验表明，我们的方法显著优于现有的最先进的防御方法。

论文及项目相关链接

PDF Accepted by AAAI 2026

Summary

本文探讨了行人再识别（ReID）任务在面临对抗性攻击时的挑战，并提出了一个去偏双不变防御框架来解决这一问题。该框架包括数据平衡和双向对抗自元防御两个阶段。通过扩散模型进行数据重采样，促进训练的公平性和多样性，并引入了一种新的度量对抗训练方法来克服鲁棒性退化问题。实验证明，该方法显著优于现有最先进的防御手段。

Key Takeaways

行人再识别（ReID）在面临对抗性攻击时存在显著的安全威胁。
当前针对ReID的对抗性防御策略面临模型偏差和复合泛化要求等挑战。
提出的去偏双不变防御框架包括数据平衡和双向对抗自元防御两个阶段。
数据平衡阶段使用扩散模型进行数据重采样，增强训练的公平性和多样性。
双向对抗自元防御阶段引入新的度量对抗训练方法来克服鲁棒性退化问题。
提出的机制实现了对未见身份和未见攻击类型的双重泛化。

Cool Papers

点此查看论文截图

FlowCast: Advancing Precipitation Nowcasting with Conditional Flow Matching

Authors:Bernardo Perrone Ribeiro, Jana Faganeli Pucer

Radar-based precipitation nowcasting, the task of forecasting short-term precipitation fields from previous radar images, is a critical problem for flood risk management and decision-making. While deep learning has substantially advanced this field, two challenges remain fundamental: the uncertainty of atmospheric dynamics and the efficient modeling of high-dimensional data. Diffusion models have shown strong promise by producing sharp, reliable forecasts, but their iterative sampling process is computationally prohibitive for time-critical applications. We introduce FlowCast, the first model to apply Conditional Flow Matching (CFM) to precipitation nowcasting. Unlike diffusion, CFM learns a direct noise-to-data mapping, enabling rapid, high-fidelity sample generation with drastically fewer function evaluations. Our experiments demonstrate that FlowCast establishes a new state-of-the-art in predictive accuracy. A direct comparison further reveals the CFM objective is both more accurate and significantly more efficient than a diffusion objective on the same architecture, maintaining high performance with significantly fewer sampling steps. This work positions CFM as a powerful and practical alternative for high-dimensional spatiotemporal forecasting.

雷达基降水预报（Radar-based precipitation nowcasting）是指利用前期雷达图像对短期降水场进行预测的任务，对于洪灾风险管理和决策制定是一个关键问题。尽管深度学习已经大大推动了这一领域的发展，但仍然存在两个基本挑战：大气动力学的不确定性和高维数据的有效建模。扩散模型（Diffusion models）展现出强烈的潜力，能够产生清晰可靠的预测，但其迭代采样过程对于时间关键型应用来说计算上是不合算的。我们引入了FlowCast，这是第一个应用于降水预报的条件流匹配（CFM）模型。与扩散不同，CFM学习直接从噪声到数据的映射，能够实现快速、高保真度的样本生成，所需的功能评估次数大大减少。我们的实验表明，FlowCast在预测精度方面达到了最新水平。进一步的直接比较显示，在同一架构上，CFM目标既更准确也显著更高效，在减少采样步骤的同时保持高性能。这项工作使CFM成为高维时空预测的强大实用替代方案。

论文及项目相关链接

PDF Under Review

Summary

雷达降水预报短时降水场从前期雷达图像进行预测，对洪水风险管理和决策制定至关重要。深度学习已大大推动该领域发展，但仍面临大气动力不确定性和高维数据高效建模两大挑战。扩散模型虽展现出锐化可靠预测的强大潜力，但其迭代采样过程对时间关键应用而言计算上过于昂贵。本研究引入FlowCast模型，首次应用条件流匹配（CFM）于降水预报。不同于扩散模型，CFM学习从噪声到数据的直接映射，能够实现快速、高保真样本生成，大幅减少功能评估次数。实验证明FlowCast在预测精度上树立新标杆。与扩散目标直接比较显示，CFM目标在相同架构上既更准确又更有效率，在较少的采样步骤中保持高性能。这项工作将CFM定位为高维时空预测的强大实用替代方案。

Key Takeaways

雷达降水预报在洪水风险管理和决策制定中扮演重要角色。
深度学习在雷达降水预报中的应用仍面临大气动力不确定性和高维数据建模挑战。
扩散模型虽能生成锐化预测，但计算成本较高，不适用于时间关键应用。
FlowCast模型首次将条件流匹配（CFM）应用于降水预报。
CFM通过直接噪声到数据的映射，实现快速、高保真样本生成。
实验证明FlowCast在预测精度上超越现有技术。

Cool Papers

点此查看论文截图

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Authors:Ye Tian, Ling Yang, Jiongfan Yang, Anran Wang, Yu Tian, Jiani Zheng, Haochen Wang, Zhiyang Teng, Zhuochen Wang, Yinjie Wang, Yunhai Tong, Mengdi Wang, Xiangtai Li

While thinking-aware generation aims to improve performance on complex tasks, we identify a critical failure mode where existing sequential, autoregressive approaches can paradoxically degrade performance due to error propagation. To systematically analyze this issue, we propose ParaBench, a new benchmark designed to evaluate both text and image output modalities. Our analysis using ParaBench reveals that this performance degradation is strongly correlated with poor alignment between the generated reasoning and the final image. To resolve this, we propose a parallel multimodal diffusion framework, MMaDA-Parallel, that enables continuous, bidirectional interaction between text and images throughout the entire denoising trajectory. MMaDA-Parallel is trained with supervised finetuning and then further optimized by Parallel Reinforcement Learning (ParaRL), a novel strategy that applies semantic rewards along the trajectory to enforce cross-modal consistency. Experiments validate that our model significantly improves cross-modal alignment and semantic consistency, achieving a 6.9% improvement in Output Alignment on ParaBench compared to the state-of-the-art model, Bagel, establishing a more robust paradigm for thinking-aware image synthesis. Our code is open-sourced at https://github.com/tyfeld/MMaDA-Parallel

当我们考虑认知感知生成旨在提高复杂任务的性能时，我们发现了一种关键的失败模式，即现有的序列式、自回归方法会由于错误传播而悖论性地导致性能下降。为了系统地分析这一问题，我们提出了ParaBench，这是一种新的基准测试，旨在评估文本和图像输出模态的评估。我们利用ParaBench进行的分析表明，这种性能下降与生成推理和最终图像之间的对齐不佳之间存在强烈的相关性。

Summary

该文指出，虽然思考感知生成旨在提高复杂任务的性能，但现有序贯的自回归方法由于误差传递，会出现性能悖论性下降的问题。为解决这一问题，该文提出ParaBench基准测试，用以评估文本和图像输出模式。分析显示，性能下降与生成推理和最终图像之间的对齐不良密切相关。为解决此问题，该文提出并行多模态扩散框架MMaDA-Parallel，实现文本和图像在整个去噪轨迹中的持续双向交互。通过监督微调训练MMaDA-Parallel，并采用Parallel Reinforcement Learning（ParaRL）进一步优化，沿轨迹应用语义奖励以加强跨模态一致性。实验验证，该模型显著提高了跨模态对齐和语义一致性，在ParaBench上的输出对齐度较Bagel模型提高了6.9%，为思考感知图像合成建立了更稳健的范式。

Key Takeaways

现有序贯自回归方法在复杂任务中会出现性能悖论性下降的问题。
ParaBench基准测试用于评估文本和图像输出模式，揭示了性能下降与生成推理和图像对齐之间的关联。
MMaDA-Parallel框架实现文本和图像在生成过程中的持续双向交互。
MMaDA-Parallel通过监督微调训练，并采用Parallel Reinforcement Learning（ParaRL）进一步优化。
语义奖励沿生成轨迹应用，以加强跨模态一致性。
实验结果显示，MMaDA-Parallel模型在跨模态对齐和语义一致性方面显著提高。

Cool Papers

点此查看论文截图

Bridging the Data Gap: Spatially Conditioned Diffusion Model for Anomaly Generation in Photovoltaic Electroluminescence Images

Authors:Shiva Hanifi, Sasan Jafarnejad, Marc Köntges, Andrej Wentnagel, Andreas Kokkas, Raphael Frank

Reliable anomaly detection in photovoltaic (PV) modules is critical for maintaining solar energy efficiency. However, developing robust computer vision models for PV inspection is constrained by the scarcity of large-scale, diverse, and balanced datasets. This study introduces PV-DDPM, a spatially conditioned denoising diffusion probabilistic model that generates anomalous electroluminescence (EL) images across four PV cell types: multi-crystalline silicon (multi-c-Si), mono-crystalline silicon (mono-c-Si), half-cut multi-c-Si, and interdigitated back contact (IBC) with dogbone interconnect. PV-DDPM enables controlled synthesis of single-defect and multi-defect scenarios by conditioning on binary masks representing structural features and defect positions. To the best of our knowledge, this is the first framework that jointly models multiple PV cell types while supporting simultaneous generation of diverse anomaly types. We also introduce E-SCDD, an enhanced version of the SCDD dataset, comprising 1,000 pixel-wise annotated EL images spanning 30 semantic classes, and 1,768 unlabeled synthetic samples. Quantitative evaluation shows our generated images achieve a Fréchet Inception Distance (FID) of 4.10 and Kernel Inception Distance (KID) of 0.0023 $\pm$ 0.0007 across all categories. Training the vision–language anomaly detection model AA-CLIP on E-SCDD, compared to the SCDD dataset, improves pixel-level AUC and average precision by 1.70 and 8.34 points, respectively.

在光伏（PV）模块中进行可靠异常检测对于维持太阳能效率至关重要。然而，为PV检测开发稳健的计算机视觉模型受到大规模、多样且平衡数据集稀缺的限制。本研究介绍了PV-DDPM，这是一种空间条件去噪扩散概率模型，可生成四种光伏电池类型的异常电致发光（EL）图像：多晶硅（multi-c-Si）、单晶硅（mono-c-Si）、半切割多晶硅和具有dogbone互连的交叉背接触（IBC）。PV-DDPM通过以代表结构特征和缺陷位置的二进制蒙版为条件，实现了单缺陷和多缺陷场景的受控合成。据我们所知，这是一个首次联合建模多种光伏电池类型的同时支持多种异常类型生成的框架。我们还介绍了E-SCDD，这是SCDD数据集的增强版，包含1000张像素级注释的EL图像，跨越30个语义类别，以及1768个未标记的合成样本。定量评估表明，我们生成的图像在所有类别中的Fréchet Inception Distance（FID）达到4.10，Kernel Inception Distance（KID）为0.0023 ± 0.0007。在E-SCDD上训练视觉语言异常检测模型AA-CLIP与SCDD数据集相比，像素级AUC和平均精度分别提高了1.7点和8.34点。

论文及项目相关链接

PDF 8 pages, 4 figures

Summary
本研究利用扩散概率模型PV-DDPM，针对四种光伏电池类型生成异常电致发光图像。通过条件合成单一缺陷和多重缺陷场景，实现对光伏模块可靠异常检测的数据增强。研究还推出了增强版SCDD数据集E-SCDD，包含像素级标注的EL图像和合成样本。采用该数据集训练视觉-语言异常检测模型AA-CLIP，提高了异常检测的准确性。

Key Takeaways

研究引入了PV-DDPM扩散概率模型，用于生成多种光伏电池类型的异常电致发光图像。
PV-DDPM可通过条件合成方式，模拟单一缺陷和多重缺陷场景，增强光伏模块异常检测的数据。
研究推出了增强版的SCDD数据集E-SCDD，包含更丰富的像素级标注EL图像和合成样本。
E-SCDD数据集的推出为异常检测提供了更多训练数据，有助于提高模型性能。
采用E-SCDD数据集训练的AA-CLIP模型，相比SCDD数据集，在像素级AUC和平均精度上有所提升。
本研究实现了对多种光伏电池类型的同时建模，生成多样化的异常类型。

Cool Papers

点此查看论文截图

WDT-MD: Wavelet Diffusion Transformers for Microaneurysm Detection in Fundus Images

Authors:Yifei Sun, Yuzhi He, Junhao Jia, Jinhong Wang, Ruiquan Ge, Changmiao Wang, Hongxia Xu

Microaneurysms (MAs), the earliest pathognomonic signs of Diabetic Retinopathy (DR), present as sub-60 $μm$ lesions in fundus images with highly variable photometric and morphological characteristics, rendering manual screening not only labor-intensive but inherently error-prone. While diffusion-based anomaly detection has emerged as a promising approach for automated MA screening, its clinical application is hindered by three fundamental limitations. First, these models often fall prey to “identity mapping”, where they inadvertently replicate the input image. Second, they struggle to distinguish MAs from other anomalies, leading to high false positives. Third, their suboptimal reconstruction of normal features hampers overall performance. To address these challenges, we propose a Wavelet Diffusion Transformer framework for MA Detection (WDT-MD), which features three key innovations: a noise-encoded image conditioning mechanism to avoid “identity mapping” by perturbing image conditions during training; pseudo-normal pattern synthesis via inpainting to introduce pixel-level supervision, enabling discrimination between MAs and other anomalies; and a wavelet diffusion Transformer architecture that combines the global modeling capability of diffusion Transformers with multi-scale wavelet analysis to enhance reconstruction of normal retinal features. Comprehensive experiments on the IDRiD and e-ophtha MA datasets demonstrate that WDT-MD outperforms state-of-the-art methods in both pixel-level and image-level MA detection. This advancement holds significant promise for improving early DR screening.

微动脉瘤（MAs）是糖尿病视网膜病变（DR）的最早病理特征标志，在眼底图像中表现为小于60微米的病变，具有多变的光度学和形态学特征，这使得手动筛查不仅劳动强度大，而且容易出错。虽然基于扩散的异常检测已成为自动化MA筛查的有前途的方法，但其临床应用受到三个基本局限的阻碍。首先，这些模型经常陷入“身份映射”，即它们无意中复制了输入图像。其次，它们难以区分MA与其他异常，导致出现大量误报。第三，它们对正常特征的重建不够理想，影响了整体性能。为了应对这些挑战，我们提出了用于MA检测的Wavelet Diffusion Transformer（WDT-MD）框架，它具有三个关键创新点：噪声编码图像调节机制，通过训练期间扰动图像条件来避免“身份映射”；通过inpainting进行伪正常模式合成，引入像素级监督，能够区分MA和其他异常；以及小波扩散Transformer架构，它将扩散Transformer的全局建模能力与多尺度小波分析相结合，以改善正常视网膜特征的重建。在IDRiD和e-ophtha MA数据集上的综合实验表明，WDT-MD在像素级和图像级MA检测方面都优于现有最新方法。这一进展为改善早期DR筛查具有重大前景。

论文及项目相关链接

PDF 9 pages, 6 figures, 8 tables, accepted by AAAI 2026

摘要
针对糖尿病视网膜病变早期病理标志——微动脉瘤的自动检测，扩散模型在图像异常检测中展现出潜力。然而，其临床应用面临三大挑战：身份映射问题、难以区分微动脉瘤与其他异常、以及正常特征重建不佳。为此，提出基于小波扩散Transformer的微动脉瘤检测框架（WDT-MD），包括噪声编码图像调节机制、伪正常模式合成和结合扩散Transformer全局建模和多尺度小波分析的创新架构。在IDRiD和e-ophtha数据集上的实验显示，WDT-MD在像素级和图像级检测中均表现优异，有望改善早期糖尿病视网膜病变筛查。

关键见解

微动脉瘤是糖尿病视网膜病变的早期病理标志，其手动筛查既耗时又易出错，因此自动检测至关重要。
扩散模型在图像异常检测中具有潜力，但面临身份映射、异常区分和正常特征重建等三大挑战。
提出的WDT-MD框架通过噪声编码图像调节机制避免身份映射。
伪正常模式合成技术引入像素级监督，提高微动脉瘤与其他异常的辨别能力。
结合扩散Transformer全局建模和多尺度小波分析，优化正常视网膜特征的重建。
在IDRiD和e-ophtha数据集上的实验验证了WDT-MD在微动脉瘤检测中的优越性。
WDT-MD框架的提出有望改善糖尿病视网膜病变的早期筛查。

Cool Papers

点此查看论文截图

Improving Conditional VAE with approximation using Normalizing Flows

Authors:Tuhin Subhra De

Variational Autoencoders and Generative Adversarial Networks remained the state-of-the-art (SOTA) generative models until 2022. Now they are superseded by diffusion based models. Efforts to improve traditional models have stagnated as a result. In old-school fashion, we explore image generation with conditional Variational Autoencoders (CVAE) to incorporate desired attributes within the images. VAEs are known to produce blurry images with less diversity, we refer a method that solve this issue by leveraging the variance of the gaussian decoder as a learnable parameter during training. Previous works on CVAEs assumed that the conditional distribution of the latent space given the labels is equal to the prior distribution, which is not the case in reality. We show that estimating it using normalizing flows results in better image generation than existing methods by reducing the FID by 5% and increasing log likelihood by 7.7% than the previous case.

变分自编码器和生成对抗网络在2022年之前一直是先进的生成模型。如今它们已被基于扩散的模型所取代。因此，改进传统模型的努力陷入了停滞。我们以传统的方式探索了使用条件变分自编码器（CVAE）进行图像生成，以在图像中融入所需的属性。已知VAE会产生模糊且多样性较少的图像，我们提出了一种方法，通过利用高斯解码器的方差作为训练过程中的可学习参数来解决这个问题。关于CVAE的先前工作假设给定标签的潜在空间条件分布等于先验分布，但现实中并非如此。我们展示了通过使用规范化流对其进行估算，与现有方法相比，我们的方法在图像生成方面表现得更好，将FID降低了5%，对数似然增加了7.7%。

论文及项目相关链接

PDF Independent Work

Summary

扩散模型已经在生成模型领域取代了变分自编码器（VAEs）和生成对抗网络（GANs），成为新的主流技术。文章探索了使用条件变分自编码器（CVAE）进行图像生成的方法，通过利用高斯解码器的方差作为训练时的可学习参数来解决生成图像模糊和多样性不足的问题。同时指出先前的CVAE工作中关于标签潜在空间条件分布等于先验分布的假设存在问题，并提出了通过归一化流来估算这一分布，可以提高图像生成的质量。通过该方法的应用，相比原有模型可以减少FID分数，并增加对数似度。这也标志着扩散模型在图像生成领域的优势进一步得到证实。

Key Takeaways

扩散模型已成为生成模型领域的最新主流技术，取代了变分自编码器（VAEs）和生成对抗网络（GANs）。
条件变分自编码器（CVAE）被用于图像生成，以解决生成图像模糊和多样性不足的问题。
利用高斯解码器的方差作为训练时的可学习参数，改进了CVAE的性能。
之前的CVAE工作假设标签潜在空间的条件分布等于先验分布，但实际上并非如此。
通过归一化流估算条件分布可以提高图像生成的质量。相较于之前的模型，这一改进有助于减少FID分数并增加对数似度。这表明新方法具有更高的性能。
文章展示了扩散模型在图像生成领域的优势，并验证了其在实际应用中的有效性。

Cool Papers

点此查看论文截图

From Structure to Detail: Hierarchical Distillation for Efficient Diffusion Model

Authors:Hanbo Cheng, Peng Wang, Kaixiang Lei, Qi Li, Zhen Zou, Pengfei Hu, Jun Du

The inference latency of diffusion models remains a critical barrier to their real-time application. While trajectory-based and distribution-based step distillation methods offer solutions, they present a fundamental trade-off. Trajectory-based methods preserve global structure but act as a “lossy compressor”, sacrificing high-frequency details. Conversely, distribution-based methods can achieve higher fidelity but often suffer from mode collapse and unstable training. This paper recasts them from independent paradigms into synergistic components within our novel Hierarchical Distillation (HD) framework. We leverage trajectory distillation not as a final generator, but to establish a structural ``sketch”, providing a near-optimal initialization for the subsequent distribution-based refinement stage. This strategy yields an ideal initial distribution that enhances the ceiling of overall performance. To further improve quality, we introduce and refine the adversarial training process. We find standard discriminator structures are ineffective at refining an already high-quality generator. To overcome this, we introduce the Adaptive Weighted Discriminator (AWD), tailored for the HD pipeline. By dynamically allocating token weights, AWD focuses on local imperfections, enabling efficient detail refinement. Our approach demonstrates state-of-the-art performance across diverse tasks. On ImageNet $256\times256$, our single-step model achieves an FID of 2.26, rivaling its 250-step teacher. It also achieves promising results on the high-resolution text-to-image MJHQ benchmark, proving its generalizability. Our method establishes a robust new paradigm for high-fidelity, single-step diffusion models.

扩散模型的推理延迟仍然是其实时应用的关键障碍。虽然基于轨迹和基于分布的步骤蒸馏方法提供了解决方案，但它们之间存在基本权衡。基于轨迹的方法保留了全局结构，但充当“有损压缩器”，牺牲了高频细节。相反，基于分布的方法可以达到较高保真度，但经常遭受模式崩溃和训练不稳定的问题。本文将其从独立范式重新构建为我们新型层次蒸馏（HD）框架内的协同组件。我们利用轨迹蒸馏法不是作为最终生成器，而是建立一个结构性“草图”，为后续的基于分布的细化阶段提供接近最优的初始化。这种策略产生了理想的初始分布，提高了整体性能的上限。

论文及项目相关链接

PDF

摘要

扩散模型的推理延迟是其实时应用的关键障碍。虽然轨迹基础和分布基础的蒸馏方法提供了解决方案，但它们之间存在基本权衡。轨迹基础方法保留全局结构，但牺牲高频细节，作为“有损压缩器”。相反，分布基础方法可以达到高保真，但经常遭受模式崩溃和训练不稳定。本文将其从独立范式转变为我们新颖层次蒸馏（HD）框架内的协同组件。我们利用轨迹蒸馏并非作为最终生成器，而是建立结构性“草图”，为随后的分布基础细化阶段提供接近最优的初始化。此策略产生理想的初始分布，提高了整体性能的上限。进一步提高质量，我们对对抗训练过程进行了介绍和细化。我们发现标准鉴别器结构在细化已高质量生成器时无效。为了克服这一点，我们引入了自适应加权鉴别器（AWD），适用于HD管道。通过动态分配令牌权重，AWD专注于局部缺陷，实现有效的细节细化。我们的方法在多种任务上展示了卓越的性能。在ImageNet 256x256上，我们的单步模型实现了FID为2.26，与它的250步教师相抗衡。在高分辨率文本到图像的MJHQ基准测试中，它也取得了令人鼓舞的结果，证明了其泛化能力。我们的方法为高保真单步扩散模型建立了稳健的新范式。

要点

扩散模型的推理延迟限制了其实时应用。
现有解决方法的权衡：轨迹基础方法保留全局结构但牺牲高频细节，分布基础方法追求高保真但面临模式崩溃和训练不稳定。
引入层次蒸馏（HD）框架，将轨迹基础和分布基础方法结合，利用轨迹蒸馏建立结构性“草图”，为分布基础细化提供优化初始化。
改进对抗训练过程，引入自适应加权鉴别器（AWD），以关注局部缺陷并细化细节。
方法在多种任务上表现卓越，包括ImageNet和MJHQ基准测试。
单步模型实现了高保真和竞争力性能，与多步模型相抗衡。

Cool Papers

点此查看论文截图

LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization

Authors:Ronghuan Wu, Wanchao Su, Jing Liao

Image vectorization is a powerful technique that converts raster images into vector graphics, enabling enhanced flexibility and interactivity. However, popular image vectorization tools struggle with occluded regions, producing incomplete or fragmented shapes that hinder editability. While recent advancements have explored optimization-based and learning-based layer-wise image vectorization, these methods face limitations in vectorization quality and flexibility. In this paper, we introduce LayerPeeler, a novel layer-wise image vectorization approach that addresses these challenges through a progressive simplification paradigm. The key to LayerPeeler’s success lies in its autoregressive peeling strategy: by identifying and removing the topmost non-occluded layers while recovering underlying content, we generate vector graphics with complete paths and coherent layer structures. Our method leverages vision-language models to construct a layer graph that captures occlusion relationships among elements, enabling precise detection and description for non-occluded layers. These descriptive captions are used as editing instructions for a finetuned image diffusion model to remove the identified layers. To ensure accurate removal, we employ localized attention control that precisely guides the model to target regions while faithfully preserving the surrounding content. To support this, we contribute a large-scale dataset specifically designed for layer peeling tasks. Extensive quantitative and qualitative experiments demonstrate that LayerPeeler significantly outperforms existing techniques, producing vectorization results with superior path semantics, geometric regularity, and visual fidelity.

图像矢量化是一种强大的技术，能将位图图像转换为矢量图形，提高了灵活性和交互性。然而，流行的图像矢量化工具在处理遮挡区域时遇到困难，产生不完整或断裂的形状，妨碍了编辑能力。虽然最近的进展探索了基于优化和基于学习的分层图像矢量化，但这些方法在矢量化质量和灵活性方面仍面临局限。在本文中，我们介绍了LayerPeeler，这是一种新型的分层图像矢量化方法，通过渐进简化范式来解决这些挑战。LayerPeeler成功的关键在于其自回归剥离策略：通过识别和移除最顶部的非遮挡层并恢复底层内容，生成具有完整路径和连贯层结构的矢量图形。我们的方法利用视觉语言模型构建层图，捕捉元素之间的遮挡关系，为非遮挡层的精确检测和描述提供支持。这些描述性标题用作微调图像扩散模型的编辑指令，以移除已识别的层。为确保准确的移除，我们采用了局部注意力控制，精确引导模型定位区域，同时忠实保留周围内容。为此，我们贡献了一个专门为图层剥离任务设计的大规模数据集。大量的定量和定性实验表明，LayerPeeler在路径语义、几何规则性和视觉保真度方面显著优于现有技术，产生了出色的矢量化结果。

论文及项目相关链接

PDF Project Page: https://layerpeeler.github.io/

Summary

本文介绍了一种名为LayerPeeler的新图像矢量化方法，它通过分层剥离策略解决了图像矢量化中的遮挡区域问题。该方法采用渐进简化范式，通过识别并移除非遮挡层来生成矢量图形，并在此过程中恢复底层内容。LayerPeeler使用视觉语言模型构建层图，捕获元素之间的遮挡关系，从而实现精准检测和描述非遮挡层。该方法的描述性标题用作训练过的图像扩散模型的编辑指令来移除识别的层。LayerPeeler的出色表现得到了广泛的定量和定性实验验证，其生成的矢量图形具有出色的路径语义、几何规则和视觉保真度。

Key Takeaways

LayerPeeler是一种新的图像矢量化方法，通过分层剥离策略解决了遮挡区域问题。
LayerPeeler采用渐进简化范式，生成具有完整路径和连贯层结构的矢量图形。
该方法使用视觉语言模型构建层图，以捕获元素间的遮挡关系。
LayerPeeler的描述性标题用作图像扩散模型的编辑指令来移除识别的层。
LayerPeeler能精准检测并描述非遮挡层，确保准确移除。
该方法采用局部注意力控制，精确指导模型针对目标区域，同时忠实保留周围内容。

Cool Papers

点此查看论文截图

Towards Consistent and Efficient Dataset Distillation via Diffusion-Driven Selection

Authors:Xinhao Zhong, Shuoyang Sun, Xulin Gu, Zhaoyang Xu, Yaowei Wang, Min Zhang, Bin Chen

Dataset distillation provides an effective approach to reduce memory and computational costs by optimizing a compact dataset that achieves performance comparable to the full original. However, for large-scale datasets and complex deep networks (e.g., ImageNet-1K with ResNet-101), the vast optimization space hinders distillation effectiveness, limiting practical applications. Recent methods leverage pre-trained diffusion models to directly generate informative images, thereby bypassing pixel-level optimization and achieving promising results. Nonetheless, these approaches often suffer from distribution shifts between the pre-trained diffusion prior and target datasets, as well as the need for multiple distillation steps under varying settings. To overcome these challenges, we propose a novel framework that is orthogonal to existing diffusion-based distillation techniques by utilizing the diffusion prior for patch selection rather than generation. Our method predicts noise from the diffusion model conditioned on input images and optional text prompts (with or without label information), and computes the associated loss for each image-patch pair. Based on the loss differences, we identify distinctive regions within the original images. Furthermore, we apply intra-class clustering and ranking on the selected patches to enforce diversity constraints. This streamlined pipeline enables a one-step distillation process. Extensive experiments demonstrate that our approach consistently outperforms state-of-the-art methods across various metrics and settings.

数据集蒸馏提供了一种通过优化紧凑数据集来减少内存和计算成本的有效方法，该数据集的性能可与原始完整数据集相当。然而，对于大规模数据集和复杂深度网络（例如，使用ResNet-101的ImageNet-1K），巨大的优化空间阻碍了蒸馏的有效性，限制了实际应用。最近的方法利用预训练的扩散模型直接生成信息图像，从而绕过像素级优化，并取得了有前景的结果。然而，这些方法经常遭受预训练扩散先验和目标数据集之间的分布偏移问题，以及在不同设置下需要多次蒸馏步骤的困扰。为了克服这些挑战，我们提出了一种新的框架，该框架与现有的基于扩散的蒸馏技术正交，利用扩散先验进行补丁选择而不是生成。我们的方法根据输入图像和可选的文本提示（带有或不带标签信息）预测扩散模型中的噪声，并计算每个图像-补丁对的损失。基于损失差异，我们可以识别原始图像中的独特区域。此外，我们对选定的补丁进行类内聚类和排名，以实施多样性约束。这种简化的管道可实现一步蒸馏过程。大量实验表明，我们的方法在各种指标和设置下始终优于最先进的方法。

论文及项目相关链接

PDF

摘要

利用数据集蒸馏方法可以有效地降低内存和计算成本，通过优化紧凑数据集实现与原始数据集相当的性能。然而，对于大规模数据集和复杂深度网络（如ImageNet-1K与ResNet-101），巨大的优化空间阻碍了蒸馏效果，限制了实际应用。最近的方法利用预训练的扩散模型直接生成图像，从而绕过像素级优化，并获得了有前景的结果。然而，这些方法常常受到预训练扩散先验和目标数据集之间分布变化的困扰，以及需要在不同设置下进行多次蒸馏步骤。为了克服这些挑战，我们提出了一种新的框架，它与现有的基于扩散的蒸馏技术正交，利用扩散先验进行补丁选择而不是生成。我们的方法从扩散模型中预测给定输入图像和可选文本提示的噪声，并计算每个图像补丁对的损失。基于损失差异，我们确定了原始图像中的不同区域。此外，我们对所选补丁进行类内聚类和排名，以实施多样性约束。这一简化的管道实现了一次性蒸馏过程。大量实验表明，我们的方法在各种指标和设置下均优于现有先进方法。

关键见解

介绍了利用数据集蒸馏降低内存和计算成本的方法，特别是对于大规模数据集和复杂深度网络的应用场景。
讨论了当前方法面临的挑战，如预训练扩散先验与目标数据集之间的分布变化以及多次蒸馏步骤的需求。
提出了一种新的蒸馏框架，该框架利用扩散模型进行补丁选择而不是图像生成。
描述了一种基于扩散模型的噪声预测方法，该方法考虑了输入图像和可选文本提示（带有或不带有标签信息）。
通过计算损失差异来识别原始图像中的独特区域，并应用类内聚类和排名来实施多样性约束。
实现了一次性蒸馏过程，提高了效率并优化了性能。

Cool Papers

点此查看论文截图

The Visual Counter Turing Test (VCT2): A Benchmark for Evaluating AI-Generated Image Detection and the Visual AI Index (VAI)

Authors:Nasrin Imanpour, Abhilekh Borah, Shashwat Bajpai, Subhankar Ghosh, Sainath Reddy Sankepally, Hasnat Md Abdullah, Nishoak Kosaraju, Shreyas Dixit, Ashhar Aziz, Shwetangshu Biswas, Vinija Jain, Aman Chadha, Song Wang, Amit Sheth, Amitava Das

The rapid progress and widespread availability of text-to-image (T2I) generative models have heightened concerns about the misuse of AI-generated visuals, particularly in the context of misinformation campaigns. Existing AI-generated image detection (AGID) methods often overfit to known generators and falter on outputs from newer or unseen models. We introduce the Visual Counter Turing Test (VCT2), a comprehensive benchmark of 166,000 images, comprising both real and synthetic prompt-image pairs produced by six state-of-the-art T2I systems: Stable Diffusion 2.1, SDXL, SD3 Medium, SD3.5 Large, DALL.E 3, and Midjourney 6. We curate two distinct subsets: COCOAI, featuring structured captions from MS COCO, and TwitterAI, containing narrative-style tweets from The New York Times. Under a unified zero-shot evaluation, we benchmark 17 leading AGID models and observe alarmingly low detection accuracy, 58% on COCOAI and 58.34% on TwitterAI. To transcend binary classification, we propose the Visual AI Index (VAI), an interpretable, prompt-agnostic realism metric based on twelve low-level visual features, enabling us to quantify and rank the perceptual quality of generated outputs with greater nuance. Correlation analysis reveals a moderate inverse relationship between VAI and detection accuracy: Pearson of -0.532 on COCOAI and -0.503 on TwitterAI, suggesting that more visually realistic images tend to be harder to detect, a trend observed consistently across generators. We release COCOAI, TwitterAI, and all codes to catalyze future advances in generalized AGID and perceptual realism assessment.

文本到图像（T2I）生成模型的快速进步和广泛可用性加剧了人们对人工智能生成图像误用的担忧，特别是在虚假信息泛滥的背景下。现有的AI生成图像检测（AGID）方法通常会对已知的生成器过度拟合，而在面对更新或未见过的模型输出时则表现不佳。我们引入了视觉反图灵测试（VCT2），这是一个包含16.6万张图像的综合基准测试，其中包括由六种最先进的T2I系统产生的真实和合成提示图像对：Stable Diffusion 2.1、SDXL、SD3 Medium、SD3.5 Large、DALL.E 3和Midjourney 6。我们创建了两个独特的子集：COCOAI，以MS COCO的结构化标题为特色，以及TwitterAI，包含《纽约时报》的叙事式推特。在统一的零样本评估下，我们对领先的17种AGID模型进行了基准测试，令人警惕的是，检测准确率极低，在COCOAI上为58%，在TwitterAI上为58.34%。为了超越二元分类，我们提出了视觉人工智能指数（VAI），这是一个基于十二个低级视觉特征的、可解释的、提示无关的逼真度指标，使我们能够更微妙地量化并排列生成输出的感知质量。相关性分析显示VAI与检测准确率之间存在适中的逆向关系：在COCOAI上的Pearson值为-0.532，在TwitterAI上为-0.503，这表明更逼真的图像往往更难检测，这一趋势在各大生成器中均持续存在。我们发布COCOAI、TwitterAI及所有代码，以推动未来在通用AGID和感知逼真度评估方面的进步。

论文及项目相关链接

PDF 13 pages, 9 figures

摘要

本文关注文本转图像（T2I）生成模型的滥用问题，特别是在虚假信息传播背景下的担忧。现有的AI生成图像检测（AGID）方法经常过度拟合已知生成器，而无法识别来自新型或未知模型的输出。为此，本文引入了视觉反图灵测试（VCT2），包含166,000张图像的综合基准测试集，包括由六大顶尖T2I系统产生的真实和合成提示图像对。通过统一零样本评估，对领先的AGID模型进行基准测试，发现检测准确率令人震惊地低，分别为COCOAI的58%和TwitterAI的58.34%。为超越二元分类，提出了可视化AI指数（VAI），一个基于十二种低级视觉特征的、提示无关的逼真度指标，能够更精细地衡量和排名生成输出的感知质量。相关性分析显示，VAI与检测准确率之间存在中度负相关关系，在COCOAI和TwitterAI上的Pearson系数分别为-0.532和-0.503，这表明更逼真的图像往往更难检测。本文公开了COCOAI、TwitterAI及所有代码，以推动通用AGID和感知逼真度评估的未来发展。

关键见解

文本转图像（T2I）生成模型的快速发展引发关于滥用和虚假信息传播的担忧。
现有的AI生成图像检测（AGID）方法对新模型存在局限性。
VCT2为T2I系统和AGID方法提供了一个全面的基准测试集。
二元分类法在图像检测方面存在局限性，需更多样化的评估指标。
VAI指标的引入有助于更精细地衡量生成图像的感知质量。
VAI与AGID的检测准确率之间存在中度负相关关系，更逼真的图像更难检测。

Cool Papers

点此查看论文截图

Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models

Authors:Ronghuan Wu, Wanchao Su, Jing Liao

Scalable Vector Graphics (SVG) has become the de facto standard for vector graphics in digital design, offering resolution independence and precise control over individual elements. Despite their advantages, creating high-quality SVG content remains challenging, as it demands technical expertise with professional editing software and a considerable time investment to craft complex shapes. Recent text-to-SVG generation methods aim to make vector graphics creation more accessible, but they still encounter limitations in shape regularity, generalization ability, and expressiveness. To address these challenges, we introduce Chat2SVG, a hybrid framework that combines the strengths of Large Language Models (LLMs) and image diffusion models for text-to-SVG generation. Our approach first uses an LLM to generate semantically meaningful SVG templates from basic geometric primitives. Guided by image diffusion models, a dual-stage optimization pipeline refines paths in latent space and adjusts point coordinates to enhance geometric complexity. Extensive experiments show that Chat2SVG outperforms existing methods in visual fidelity, path regularity, and semantic alignment. Additionally, our system enables intuitive editing through natural language instructions, making professional vector graphics creation accessible to all users.

可缩放矢量图形（SVG）已成为数字设计中矢量图形的实际标准，提供分辨率独立性和对各个元素的精确控制。尽管具有优势，但创建高质量的SVG内容仍然具有挑战性，因为需要掌握专业编辑软件的技术知识和大量时间来塑造复杂的形状。最近的文本到SVG生成方法旨在使矢量图形的创建更加易于访问，但它们仍然面临着形状规则、通用能力和表现力的局限性。为了应对这些挑战，我们引入了Chat2SVG，这是一个混合框架，结合了大型语言模型（LLM）和图像扩散模型在文本到SVG生成中的优势。我们的方法首先使用LLM从基本几何原语生成语义上有意义的SVG模板。在图像扩散模型的指导下，一个两阶段优化管道在潜在空间中细化路径并调整点坐标，以提高几何复杂性。大量实验表明，Chat2SVG在视觉保真度、路径规则性和语义对齐方面优于现有方法。此外，我们的系统通过自然语言指令实现了直观的编辑，使专业矢量图形的创建对所有用户都易于访问。

论文及项目相关链接

PDF Project Page: https://chat2svg.github.io/

Summary

SVG已成为数字设计的矢量图形标准，但在创建高质量SVG内容时仍存在挑战。为解决挑战，提出Chat2SVG混合框架，结合大型语言模型（LLM）和图像扩散模型进行文本到SVG的生成。该方法通过LLM生成有意义的SVG模板，并通过图像扩散模型进行路径优化和点坐标调整，从而提高几何复杂性。Chat2SVG在视觉保真度、路径规律性和语义对齐方面优于现有方法，并可通过自然语言指令进行直观编辑，使专业矢量图形创作对所有人可行。

Key Takeaways

SVG作为数字设计矢量图形标准的优势在于其分辨率独立性和对个别元素的精确控制。
创建高质量的SVG内容需要专业技术知识和对复杂形状的时间投入。
现有文本到SVG生成方法存在形状规则性、泛化能力和表现力方面的局限。
Chat2SVG框架结合了大型语言模型和图像扩散模型的优点，用于文本到SVG的生成。
Chat2SVG通过LLM生成有意义的SVG模板，并通过图像扩散模型进行精细化优化。
Chat2SVG在视觉保真度、路径规律性和语义对齐方面表现优异。

Cool Papers

点此查看论文截图

CART: Compositional Auto-Regressive Transformer for Image Generation

Authors:Siddharth Roheda, Rohit Chowdhury, Aniruddha Bala, Rohan Jaiswal

We propose a novel Auto-Regressive (AR) image generation approach that models images as hierarchical compositions of interpretable visual layers. While AR models have achieved transformative success in language modeling, replicating this success in vision tasks remains challenging due to inherent spatial dependencies in images. Addressing the unique challenges of vision tasks, our method (CART) adds image details iteratively via semantically meaningful decompositions. We demonstrate the flexibility and generality of CART by applying it across three distinct decomposition strategies: (i) Base-Detail Decomposition (Mumford-Shah smoothness), (ii) Intrinsic Decomposition (albedo/shading), and (iii) Specularity Decomposition (diffuse/specular). This next-detail strategy outperforms traditional next-token and next-scale approaches, improving controllability, semantic interpretability, and resolution scalability. Experiments show CART generates visually compelling results while enabling structured image manipulation, opening new directions for controllable generative modeling via physically or perceptually motivated image factorization.

我们提出了一种新型的基于自回归（AR）的图像生成方法，该方法将图像建模为可解释视觉层次的层次结构组合。虽然自回归模型在语言建模方面取得了革命性的成功，但由于图像固有的空间依赖性，在视觉任务上复制这一成功仍然具有挑战性。我们的方法（CART）通过语义上有意义的分解迭代地添加图像细节，以应对视觉任务的独特挑战。我们通过在三种不同的分解策略中应用CART来展示其灵活性和通用性：（i）基础细节分解（Mumford-Shah平滑度）、（ii）内在分解（亮度/阴影）和（iii）光泽分解（漫反射/光泽）。这种下一个细节的策略优于传统的下一个令牌和下一个尺度的方法，提高了可控性、语义可解释性和分辨率可扩展性。实验表明，CART生成了视觉上引人注目的结果，同时实现了结构化图像操作，为通过物理或感知驱动的图像分解进行可控生成建模打开了新的方向。

论文及项目相关链接

PDF figures compressed to meet arxiv size limit

Summary

本文提出一种新颖的自回归（AR）图像生成方法，该方法将图像建模为可解释视觉层级的层次结构组合。尽管AR模型在语言建模中取得了突破性成功，但在视觉任务中复制这一成功仍然具有挑战性，因为图像中固有的空间依赖性。通过解决视觉任务的独特挑战，我们的方法（CART）通过语义有意义的分解迭代地添加图像细节。我们展示了CART在三种不同分解策略上的灵活性和通用性：（i）基础细节分解（Mumford-Shah平滑度），（ii）内在分解（亮度/阴影），以及（iii）光泽分解（漫射/光泽）。这种下一细节策略优于传统的下一令牌和下一尺度方法，提高了可控性、语义可解释性和分辨率可扩展性。实验表明，CART生成了视觉上引人注目的结果，同时实现了结构化图像操作，为通过物理或感知动机的图像分解进行可控生成建模开辟了新方向。

Key Takeaways

提出了一种新颖的自回归（AR）图像生成方法，将图像视为可解释的视觉层级结构。
AR模型在视觉任务中的应用面临挑战，主要由于图像的空间依赖性。
CART方法通过语义有意义的分解迭代地增加图像细节。
CART在三种不同的分解策略上进行了展示：基础细节分解、内在分解和光泽分解。
CART的下一细节策略在可控性、语义可解释性和分辨率可扩展性方面优于传统方法。
CART生成的图像具有视觉吸引力，并能实现结构化图像操作。

Cool Papers

点此查看论文截图

DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models

Authors:Xiaoxiao He, Quan Dao, Ligong Han, Song Wen, Minhao Bai, Di Liu, Han Zhang, Martin Renqiang Min, Felix Juefei-Xu, Chaowei Tan, Bo Liu, Kang Li, Hongdong Li, Junzhou Huang, Faez Ahmed, Akash Srivastava, Dimitris Metaxas

Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation. We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces.

离散扩散模型在图像生成和掩码语言建模等任务中取得了成功，但在受控内容编辑方面存在局限性。我们引入了DICE（可控编辑的离散反转法），这是第一种为离散扩散模型实现精确反转的方法，包括多项式扩散和掩码生成模型。通过记录反向扩散过程中的噪声序列和掩码模式，DICE能够在不需要预先定义的掩码或注意力操纵的情况下，实现对离散数据的精确重构和灵活编辑。我们在图像和文本领域都展示了DICE的有效性，并在VQ-Diffusion、Paella和RoBERTa等模型上进行了评估。我们的结果表明，DICE在保持高数据保真度的同时，提高了编辑能力，为离散空间中的细粒度内容操作提供了新的机会。

论文及项目相关链接

PDF Project webpage: https://hexiaoxiao-cs.github.io/DICE/. This paper was accepted to CVPR 2025 but later desk-rejected post camera-ready, due to a withdrawal from ICLR made 14 days before reviewer assignment

Summary：
离散扩散模型在图像生成和遮蔽语言建模任务上取得了成功，但在可控内容编辑方面存在局限性。我们推出DICE（离散可控编辑的反向扩散技术），它是为离散扩散模型（包括多项式扩散和遮蔽生成模型）提供精确反向的首个方法。通过记录反向扩散过程中的噪声序列和遮蔽模式，DICE能够在不需要预先定义的遮蔽或注意力操纵的情况下，实现离散数据的精确重建和灵活编辑。我们在图像和文本领域验证了DICE的有效性，并在VQ-Diffusion、Paella和RoBERTa等模型上进行了评估。结果表明，DICE在保持高数据保真度的同时，提高了编辑能力，为离散空间中的精细内容操作提供了新的机会。

Key Takeaways: