发布日期: 2025-10-21

更新日期: 2025-11-27

文章字数: 13.3k

阅读时长: 53 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-21 更新

LightsOut: Diffusion-based Outpainting for Enhanced Lens Flare Removal

Authors:Shr-Ruei Tsai, Wei-Cheng Chang, Jie-Ying Lee, Chih-Hai Su, Yu-Lun Liu

Lens flare significantly degrades image quality, impacting critical computer vision tasks like object detection and autonomous driving. Recent Single Image Flare Removal (SIFR) methods perform poorly when off-frame light sources are incomplete or absent. We propose LightsOut, a diffusion-based outpainting framework tailored to enhance SIFR by reconstructing off-frame light sources. Our method leverages a multitask regression module and LoRA fine-tuned diffusion model to ensure realistic and physically consistent outpainting results. Comprehensive experiments demonstrate LightsOut consistently boosts the performance of existing SIFR methods across challenging scenarios without additional retraining, serving as a universally applicable plug-and-play preprocessing solution. Project page: https://ray-1026.github.io/lightsout/

镜头眩光会严重影响图像质量，从而影响关键计算机视觉任务，如目标检测和自动驾驶。当帧外光源不完整或缺失时，最近的单图像光晕去除（SIFR）方法表现不佳。我们提出了LightsOut，这是一种基于扩散的外推框架，旨在通过重建帧外光源来提高SIFR的性能。我们的方法利用多任务回归模块和LoRA微调扩散模型，以确保外推结果真实且物理一致。综合实验表明，LightsOut在具有挑战性的场景中，无需额外训练即可持续提高现有SIFR方法的性能，作为一种通用即插即用的预处理解决方案。项目页面：https://ray-1026.github.io/lightsout/

论文及项目相关链接

PDF ICCV 2025. Project page: https://ray-1026.github.io/lightsout/

Summary
镜头光晕会严重影响图像质量，对物体检测和自动驾驶等关键计算机视觉任务造成影响。针对现有单图像光晕去除（SIFR）方法在光源缺失或不完全时表现不佳的问题，我们提出了LightsOut，一个基于扩散的外画框架，旨在通过重建离线光源增强SIFR。该方法利用多任务回归模块和LoRA微调扩散模型，确保生成的结果既真实又符合物理规律。实验证明，LightsOut能在不额外训练的情况下，提高现有SIFR方法在不同挑战场景下的性能，是一个通用的即插即用预处理解决方案。

Key Takeaways

镜头光晕严重降低图像质量，影响计算机视觉任务，如物体检测和自动驾驶。
现有单图像光晕去除（SIFR）方法在光源不完整或缺失时表现不佳。
提出了一种基于扩散的外画框架“LightsOut”，旨在增强SIFR。
LightsOut利用多任务回归模块和LoRA微调扩散模型，确保结果真实且符合物理规律。
LightsOut通过重建离线光源来提高SIFR的性能。
实验证明，LightsOut在多种挑战场景下都能提升现有SIFR方法的性能。

Cool Papers

点此查看论文截图

BLIP3o-NEXT: Next Frontier of Native Image Generation

Authors:Jiuhai Chen, Le Xue, Zhiyang Xu, Xichen Pan, Shusheng Yang, Can Qin, An Yan, Honglu Zhou, Zeyuan Chen, Lifu Huang, Tianyi Zhou, Junnan Li, Silvio Savarese, Caiming Xiong, Ran Xu

We present BLIP3o-NEXT, a fully open-source foundation model in the BLIP3 series that advances the next frontier of native image generation. BLIP3o-NEXT unifies text-to-image generation and image editing within a single architecture, demonstrating strong image generation and image editing capabilities. In developing the state-of-the-art native image generation model, we identify four key insights: (1) Most architectural choices yield comparable performance; an architecture can be deemed effective provided it scales efficiently and supports fast inference; (2) The successful application of reinforcement learning can further push the frontier of native image generation; (3) Image editing still remains a challenging task, yet instruction following and the consistency between generated and reference images can be significantly enhanced through post-training and data engine; (4) Data quality and scale continue to be decisive factors that determine the upper bound of model performance. Building upon these insights, BLIP3o-NEXT leverages an Autoregressive + Diffusion architecture in which an autoregressive model first generates discrete image tokens conditioned on multimodal inputs, whose hidden states are then used as conditioning signals for a diffusion model to generate high-fidelity images. This architecture integrates the reasoning strength and instruction following of autoregressive models with the fine-detail rendering ability of diffusion models, achieving a new level of coherence and realism. Extensive evaluations of various text-to-image and image-editing benchmarks show that BLIP3o-NEXT achieves superior performance over existing models.

我们推出BLIP3o-NEXT，这是BLIP3系列中的全新开源基础模型，进一步推动了原生图像生成技术的边界。BLIP3o-NEXT在单一架构内统一了文本到图像生成和图像编辑，展现出强大的图像生成和图像编辑能力。在开发最先进的原生图像生成模型时，我们获得了四个关键见解：(1) 大多数架构选择都能产生相当的性能；一个有效的架构应该能够高效扩展并支持快速推理；(2) 强化学习的成功应用可以进一步推动原生图像生成的边界；(3) 图像编辑仍然是一项具有挑战性的任务，但通过事后训练和数据中心，指令遵循和生成图像与参考图像之间的一致性可以得到显着增强；(4) 数据的质量和规模仍然是决定模型性能上限的决定性因素。基于这些见解，BLIP3o-NEXT采用自回归+扩散架构，其中自回归模型首先根据多模式输入生成离散图像令牌，其隐藏状态然后用作扩散模型的调节信号以生成高保真图像。该架构将自回归模型的推理能力和指令执行能力与扩散模型的精细细节渲染能力相结合，实现了全新水平的连贯性和逼真度。对多种文本到图像和图像编辑基准的广泛评估表明，BLIP3o-NEXT在现有模型上实现了卓越的性能。

论文及项目相关链接

PDF

Summary

BLIP3o-NEXT是一款开源的BLIP3系列基础模型，推动了原生图像生成的前沿。它统一了文本到图像生成和图像编辑在一个架构内，展示了强大的图像生成和图像编辑能力。基于四个关键见解构建了这一先进的原生图像生成模型，包括选择有效的架构、应用强化学习推动技术边界、图像编辑的挑战和提升生成图像质量等。BLIP3o-NEXT利用自回归+扩散架构，结合了自回归模型的推理优势和扩散模型的精细渲染能力，实现了高保真图像的生成，并在文本到图像和图像编辑的基准测试中表现出卓越性能。

Key Takeaways

BLIP3o-NEXT是BLIP3系列中的全新开源模型，用于推进原生图像生成技术。
模型具备强大的文本到图像生成和图像编辑能力，可在单一架构内完成。
模型基于四个关键见解构建：选择有效架构、应用强化学习提升性能、应对图像编辑挑战以及提高生成图像质量。
模型采用自回归+扩散架构，结合了两种模型的优势，实现了高保真图像的生成。
数据质量和规模是决定模型性能的重要因素。
BLIP3o-NEXT在文本到图像和图像编辑的基准测试中表现卓越。
模型具备高度通用性，可广泛应用于各种应用场景。

Cool Papers

点此查看论文截图

NDM: A Noise-driven Detection and Mitigation Framework against Implicit Sexual Intentions in Text-to-Image Generation

Authors:Yitong Sun, Yao Huang, Ruochen Zhang, Huanran Chen, Shouwei Ruan, Ranjie Duan, Xingxing Wei

Despite the impressive generative capabilities of text-to-image (T2I) diffusion models, they remain vulnerable to generating inappropriate content, especially when confronted with implicit sexual prompts. Unlike explicit harmful prompts, these subtle cues, often disguised as seemingly benign terms, can unexpectedly trigger sexual content due to underlying model biases, raising significant ethical concerns. However, existing detection methods are primarily designed to identify explicit sexual content and therefore struggle to detect these implicit cues. Fine-tuning approaches, while effective to some extent, risk degrading the model’s generative quality, creating an undesirable trade-off. To address this, we propose NDM, the first noise-driven detection and mitigation framework, which could detect and mitigate implicit malicious intention in T2I generation while preserving the model’s original generative capabilities. Specifically, we introduce two key innovations: first, we leverage the separability of early-stage predicted noise to develop a noise-based detection method that could identify malicious content with high accuracy and efficiency; second, we propose a noise-enhanced adaptive negative guidance mechanism that could optimize the initial noise by suppressing the prominent region’s attention, thereby enhancing the effectiveness of adaptive negative guidance for sexual mitigation. Experimentally, we validate NDM on both natural and adversarial datasets, demonstrating its superior performance over existing SOTA methods, including SLD, UCE, and RECE, etc. Code and resources are available at https://github.com/lorraine021/NDM.

尽管文本到图像（T2I）扩散模型具有令人印象深刻的生成能力，但它们仍然容易生成不合适的内容，尤其是在面对隐晦的性提示时。与明确的有害提示不同，这些微妙的线索往往伪装成看似无害的术语，由于模型潜在的偏见，它们可能会意外触发性内容，这引发了重大的伦理担忧。然而，现有的检测方法主要设计用于识别明确的性内容，因此很难检测这些隐含的线索。微调方法虽然在一定程度上有效，但可能会降低模型的生成质量，造成不理想的权衡。为了解决这一问题，我们提出了NDM，即首个噪声驱动的检测与缓解框架，能够在保留模型原始生成能力的同时，检测并缓解T2I生成中的隐含恶意意图。具体来说，我们引入了两项关键创新：首先，我们利用早期阶段预测噪声的可分离性，开发了一种基于噪声的检测方法，能够高效准确地识别恶意内容；其次，我们提出了一种增强型自适应负引导机制，通过抑制显著区域的注意力来优化初始噪声，从而提高自适应负引导在性内容缓解方面的有效性。通过实验，我们在自然和对抗数据集上验证了NDM的性能，证明其优于包括SLD、UCE和RECE等现有最先进的方法。相关代码和资源可在https://github.com/lorraine021/NDM找到。

论文及项目相关链接

PDF 10 pages, 8 figures, accepted by ACMMM 2025

摘要
文本转图像扩散模型在生成内容方面展现出强大的能力，但也存在生成不当内容的风险，特别是面对隐含性暗示时。现有检测手段主要关注明显性内容，难以识别隐含性暗示。本文提出一种噪声驱动的检测与缓解框架NDM，能在保持模型原有生成能力的同时，检测和缓解图像生成中的隐含恶意意图。该框架引入两种关键技术：一是基于早期预测噪声的可分离性进行噪声检测；二是优化初始噪声，增强自适应负向引导机制。实验证明，NDM在自然和对抗数据集上的性能优于现有最先进方法。

关键见解

文本转图像扩散模型存在生成不当内容的风险，尤其是面对隐含性暗示时。
现有检测手段难以识别隐含性暗示，主要关注明显性内容的检测。
提出一种噪声驱动的检测与缓解框架NDM，能检测并缓解图像生成中的隐含恶意意图。
NDM引入两种关键技术：基于早期预测噪声的检测方法和优化初始噪声的自适应负向引导机制。
NDM框架在保持模型原有生成能力的同时，实现了对隐含恶意意图的有效检测和缓解。
实验证明，NDM在自然和对抗数据集上的性能优于现有最先进方法，包括SLD、UCE和RECE等。

Cool Papers

点此查看论文截图

Exploring Conditions for Diffusion models in Robotic Control

Authors:Heeseong Shin, Byeongho Heo, Dongyoon Han, Seungryong Kim, Taekyung Kim

While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model’s training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.

预训练视觉表征在模仿学习上取得了显著进展，但它们通常是任务无关的，因为在策略学习过程中保持冻结。在这项工作中，我们探索利用预训练的文本到图像扩散模型，以获得用于机器人控制的任务适应性视觉表征，而无需微调模型本身。然而，我们发现，天真地应用文本条件（在其他视觉领域取得成功的策略）在控制任务中产生微小甚至负面的收益。我们将这一现象归因于扩散模型的训练数据与机器人控制环境之间的领域差距，促使我们提出考虑控制所需的具体、动态视觉信息的条件。为此，我们提出了ORCA，它引入了可学习的任务提示，以适应控制环境和视觉提示，捕捉精细的、帧特定的细节。通过我们新设计的条件促进任务适应性表征，我们的方法在各种机器人控制基准测试中实现了最先进的性能，显著超越了以前的方法。

论文及项目相关链接

PDF Project page: https://orca-rc.github.io/

Summary

本文探索了如何利用预训练的文本到图像扩散模型，为机器人控制获得任务适应性视觉表示，而无需微调模型本身。然而，作者发现直接应用文本条件在控制任务上收益甚微甚至产生负面影响。为解决此问题，作者提出了ORCA方法，引入可学习的任务提示，以适应控制环境和视觉提示，捕捉精细的帧特定细节。通过新的条件促进任务适应性表示，该方法在多种机器人控制基准测试中实现了卓越性能，显著超越了先前的方法。

Key Takeaways

预训练的文本到图像扩散模型可用于机器人控制的视觉表示。
直接应用文本条件在机器人控制任务上的效果有限。
扩散模型的训练数据与机器人控制环境之间存在领域差距。
引入可学习的任务提示和视觉提示以适应控制环境和捕捉帧特定细节。
ORCA方法通过促进任务适应性表示，实现了机器人控制任务的卓越性能。
该方法在各种机器人控制基准测试中显著超越了先前的方法。

Cool Papers

点此查看论文截图

Deep generative priors for 3D brain analysis

Authors:Ana Lawry Aguila, Dina Zemlyanker, You Cheng, Sudeshna Das, Daniel C. Alexander, Oula Puonti, Annabel Sorby-Adams, W. Taylor Kimberly, Juan Eugenio Iglesias

Diffusion models have recently emerged as powerful generative models in medical imaging. However, it remains a major challenge to combine these data-driven models with domain knowledge to guide brain imaging problems. In neuroimaging, Bayesian inverse problems have long provided a successful framework for inference tasks, where incorporating domain knowledge of the imaging process enables robust performance without requiring extensive training data. However, the anatomical modeling component of these approaches typically relies on classical mathematical priors that often fail to capture the complex structure of brain anatomy. In this work, we present the first general-purpose application of diffusion models as priors for solving a wide range of medical imaging inverse problems. Our approach leverages a score-based diffusion prior trained extensively on diverse brain MRI data, paired with flexible forward models that capture common image processing tasks such as super-resolution, bias field correction, inpainting, and combinations thereof. We further demonstrate how our framework can refine outputs from existing deep learning methods to improve anatomical fidelity. Experiments on heterogeneous clinical and research MRI data show that our method achieves state-of-the-art performance producing consistent, high-quality solutions without requiring paired training datasets. These results highlight the potential of diffusion priors as versatile tools for brain MRI analysis.

扩散模型最近作为医学成像中的强大生成模型而出现。然而，将这些数据驱动模型与领域知识相结合以指导大脑成像问题仍然是一个主要挑战。在神经成像中，贝叶斯反问题长期以来为推理任务提供了成功的框架，其中结合成像过程的领域知识能够在不需要大量训练数据的情况下实现稳健性能。然而，这些方法的解剖建模部分通常依赖于经典的数学先验知识，这往往无法捕捉到大脑解剖结构的复杂特性。在这项工作中，我们首次提出了将扩散模型作为先验知识来解决一系列广泛的医学成像反问题的通用应用。我们的方法利用在多样化的脑部MRI数据上经过广泛训练的基于分数的扩散先验，以及能够完成超分辨率、偏置场校正、图像修复等常见图像处理任务的灵活正向模型，并将它们组合起来。我们还展示了我们的框架如何优化现有深度学习方法的输出，以提高解剖结构的忠实度。在异质的临床和研究MRI数据上的实验表明，我们的方法达到了最先进的性能，产生了连贯的高质量解决方案，而无需配对训练数据集。这些结果突出了扩散先验作为大脑MRI分析通用工具的潜力。

论文及项目相关链接

PDF

摘要

扩散模型在医学成像中展现出强大的生成能力，但如何将数据驱动模型与领域知识结合以解决神经成像中的逆向问题是一大挑战。本研究首次将扩散模型应用于解决广泛的医学成像逆向问题，作为先验知识。我们利用在多样的大脑MRI数据上训练的基于分数的扩散先验，配合灵活的前向模型，涵盖超分辨率、偏置场校正、图像补全等常见图像处理任务。实验表明，该方法在不需配对训练数据集的情况下，实现了卓越的性能，产生一致的高质量解决方案，并能优化现有深度学习方法的输出，提高解剖真实性。这突显了扩散先验在大脑MRI分析中的潜力。

关键见解

扩散模型在医学成像中的生成能力强大。
结合领域知识是解决神经成像逆向问题的关键挑战。
本研究首次将扩散模型应用于医学成像逆向问题的通用解决方案中作为先验知识。
利用基于分数的扩散先验，在多样的大脑MRI数据上进行训练。
灵活的前向模型可以涵盖超分辨率、偏置场校正、图像补全等常见图像处理任务。
该方法在异质的临床和研究MRI数据上实现了卓越性能，产生高质量解决方案。
扩散先验在大脑MRI分析中具有巨大潜力。

Cool Papers

点此查看论文截图

Sequential Comics for Jailbreaking Multimodal Large Language Models via Structured Visual Storytelling

Authors:Deyue Zhang, Dongdong Yang, Junjie Mu, Quancheng Zou, Zonghao Ying, Wenzhuo Xu, Zhao Liu, Xuan Wang, Xiangzheng Zhang

Multimodal large language models (MLLMs) exhibit remarkable capabilities but remain susceptible to jailbreak attacks exploiting cross-modal vulnerabilities. In this work, we introduce a novel method that leverages sequential comic-style visual narratives to circumvent safety alignments in state-of-the-art MLLMs. Our method decomposes malicious queries into visually innocuous storytelling elements using an auxiliary LLM, generates corresponding image sequences through diffusion models, and exploits the models’ reliance on narrative coherence to elicit harmful outputs. Extensive experiments on harmful textual queries from established safety benchmarks show that our approach achieves an average attack success rate of 83.5%, surpassing prior state-of-the-art by 46%. Compared with existing visual jailbreak methods, our sequential narrative strategy demonstrates superior effectiveness across diverse categories of harmful content. We further analyze attack patterns, uncover key vulnerability factors in multimodal safety mechanisms, and evaluate the limitations of current defense strategies against narrative-driven attacks, revealing significant gaps in existing protections.

多模态大型语言模型（MLLMs）表现出卓越的能力，但仍然容易受到利用跨模态漏洞的越狱攻击。在这项工作中，我们介绍了一种新方法，该方法利用连续的漫画式视觉叙事来规避最先进MLLMs中的安全对齐。我们的方法使用辅助型大型语言模型将恶意查询分解成视觉上无害的叙事元素，通过扩散模型生成相应的图像序列，并利用模型对叙事连贯性的依赖来激发有害输出。对有害文本查询的广泛实验表明，我们的方法平均攻击成功率为83.5%，比现有技术高出46%。与现有的视觉越狱方法相比，我们的连续叙事策略在各类有害内容中表现出更高的有效性。我们进一步分析了攻击模式，揭示了多模态安全机制中的关键脆弱性因素，并评估了现有防御策略对叙事驱动攻击的局限性，揭示了现有防护中的重大差距。

论文及项目相关链接

PDF

摘要

本研究介绍了一种利用序贯漫画式视觉叙事来规避当前最先进的多模态大型语言模型（MLLMs）安全对齐的新方法。该方法通过辅助大型语言模型将恶意查询分解为视觉上无害的叙事元素，利用扩散模型生成相应的图像序列，并利用模型对叙事连贯性的依赖来引发有害输出。在来自已建立安全基准的有害文本查询的广泛实验表明，我们的方法平均攻击成功率为83.5%，比现有技术高出46%。与现有的视觉越狱方法相比，我们的序贯叙事策略在各类有害内容中表现出更高的有效性。我们还进一步分析了攻击模式，揭示了多模态安全机制中的关键脆弱性因素，并评估了针对叙事驱动攻击的现有防御策略的限制，暴露出当前保护的重大漏洞。

关键见解

多模态大型语言模型（MLLMs）容易受到利用跨模态漏洞的越狱攻击。
提出一种新型方法，通过序贯漫画式视觉叙事来规避MLLMs的安全对齐。
该方法将恶意查询分解为视觉上不具威胁的叙事元素，并利用扩散模型生成图像序列。
借助模型对叙事连贯性的依赖，能够引发有害输出。
在安全基准测试中，该方法平均攻击成功率达83.5%，显著超越现有技术。
与其他视觉越狱方法相比，序贯叙事策略在各类有害内容中表现更优秀。
分析攻击模式，揭示多模态安全机制的脆弱性，并指出当前防御策略对叙事驱动攻击的限制和漏洞。

Cool Papers

点此查看论文截图

Dataset Distillation for Super-Resolution without Class Labels and Pre-trained Models

Authors:Sunwoo Cho, Yejin Jung, Nam Ik Cho, Jae Woong Soh

Training deep neural networks has become increasingly demanding, requiring large datasets and significant computational resources, especially as model complexity advances. Data distillation methods, which aim to improve data efficiency, have emerged as promising solutions to this challenge. In the field of single image super-resolution (SISR), the reliance on large training datasets highlights the importance of these techniques. Recently, a generative adversarial network (GAN) inversion-based data distillation framework for SR was proposed, showing potential for better data utilization. However, the current method depends heavily on pre-trained SR networks and class-specific information, limiting its generalizability and applicability. To address these issues, we introduce a new data distillation approach for image SR that does not need class labels or pre-trained SR models. In particular, we first extract high-gradient patches and categorize images based on CLIP features, then fine-tune a diffusion model on the selected patches to learn their distribution and synthesize distilled training images. Experimental results show that our method achieves state-of-the-art performance while using significantly less training data and requiring less computational time. Specifically, when we train a baseline Transformer model for SR with only 0.68% of the original dataset, the performance drop is just 0.3 dB. In this case, diffusion model fine-tuning takes 4 hours, and SR model training completes within 1 hour, much shorter than the 11-hour training time with the full dataset.

训练深度神经网络的需求越来越高，需要大规模数据集和大量的计算资源，尤其是随着模型复杂性的提升。旨在提高数据效率的数据蒸馏方法已成为应对这一挑战的有前途的解决方案。在单图像超分辨率（SISR）领域，对大规模训练数据集的依赖凸显了这些技术的重要性。最近，提出了一种基于生成对抗网络（GAN）反转的SR数据蒸馏框架，显示出更好的数据利用潜力。然而，当前的方法严重依赖于预训练的SR网络和类别特定信息，这限制了其通用性和适用性。为了解决这些问题，我们介绍了一种新的用于图像SR的数据蒸馏方法，该方法不需要类标签或预训练的SR模型。具体来说，我们首先提取高梯度补丁并根据CLIP特征对图像进行分类，然后对所选补丁微调扩散模型，以学习其分布并合成蒸馏的训练图像。实验结果表明，我们的方法在使用显著更少训练数据和更短计算时间的情况下实现了最先进的性能。具体来说，当我们仅使用原始数据集的0.68%来训练SR的基线Transformer模型时，性能下降仅为0.3分贝。在这种情况下，扩散模型的微调需要4小时，SR模型的训练在1小时内完成，远远短于使用完整数据集的11小时训练时间。

Summary
深度学习模型训练需要大量数据集和计算资源，数据蒸馏方法旨在提高数据效率，成为解决这一挑战的有前途的解决方案。在单图像超分辨率（SISR）领域，依赖大型训练数据集凸显了数据蒸馏技术的重要性。最近提出了一种基于生成对抗网络（GAN）反演的SR数据蒸馏框架，显示出更好的数据利用潜力。然而，当前方法严重依赖于预训练的SR网络和类别特定信息，限制了其通用性和适用性。为解决这些问题，我们提出了一种新的图像SR数据蒸馏方法，无需类别标签或预训练的SR模型。通过提取高梯度补丁并使用CLIP特征对图像进行分类，对选择的补丁进行微调扩散模型来学习其分布并合成蒸馏训练图像。实验结果表明，我们的方法在较少的训练数据和更短的计算时间内实现了最先进的性能。当使用原始数据集仅0.68%训练基线Transformer模型进行SR时，性能下降仅为0.3dB。在这种情况下，微调扩散模型需要4小时，SR模型训练在1小时内完成，远小于使用完整数据集的11小时训练时间。

Key Takeaways

深度学习模型面临大数据集和计算资源需求挑战。
数据蒸馏方法旨在提高数据效率，成为解决这一挑战的解决方案。
当前SR数据蒸馏方法依赖预训练模型和类别特定信息，限制了其通用性。
提出一种新的图像SR数据蒸馏方法，无需类别标签或预训练模型。
通过提取高梯度补丁并使用CLIP特征分类图像，对扩散模型进行微调学习图像分布。
实验结果显示新方法在较少数据、较短时间内实现先进性能。

Cool Papers

点此查看论文截图

RadioDiff-$k^2$: Helmholtz Equation Informed Generative Diffusion Model for Multi-Path Aware Radio Map Construction

Authors:Xiucheng Wang, Qiming Zhang, Nan Cheng, Ruijin Sun, Zan Li, Shuguang Cui, Xuemin Shen

In this paper, we propose a novel physics-informed generative learning approach, named RadioDiff-$k^2$, for accurate and efficient multipath-aware radio map (RM) construction. As future wireless communication evolves towards environment-aware paradigms, the accurate construction of RMs becomes crucial yet highly challenging. Conventional electromagnetic (EM)-based methods, such as full-wave solvers and ray-tracing approaches, exhibit substantial computational overhead and limited adaptability to dynamic scenarios. Although existing neural network (NN) approaches have efficient inferencing speed, they lack sufficient consideration of the underlying physics of EM wave propagation, limiting their effectiveness in accurately modeling critical EM singularities induced by complex multipath environments. To address these fundamental limitations, we propose a novel physics-inspired RM construction method guided explicitly by the Helmholtz equation, which inherently governs EM wave propagation. Specifically, based on the analysis of partial differential equations (PDEs), we theoretically establish a direct correspondence between EM singularities, which correspond to the critical spatial features influencing wireless propagation, and regions defined by negative wave numbers in the Helmholtz equation. We then design an innovative dual diffusion model (DM)-based large artificial intelligence framework comprising one DM dedicated to accurately inferring EM singularities and another DM responsible for reconstructing the complete RM using these singularities along with environmental contextual information. Experimental results demonstrate that the proposed RadioDiff-$k^2$ framework achieves state-of-the-art (SOTA) performance in both image-level RM construction and localization tasks, while maintaining inference latency within a few hundred milliseconds.

在这篇论文中，我们提出了一种名为RadioDiff-$k^2$的新型物理信息生成学习方法，用于准确高效地构建多路径感知无线地图（RM）。随着未来无线通信向环境感知模式的发展，RM的精确构建变得至关重要，但也极具挑战性。传统的基于电磁（EM）的方法，如全波求解器和射线追踪方法，存在巨大的计算开销和对动态场景的适应性有限。尽管现有的神经网络（NN）方法具有高效的推理速度，但它们没有足够考虑电磁波传播的底层物理学，在准确模拟由复杂多路径环境引起的关键电磁奇异现象方面的效果有限。为了解决这些基本局限性，我们提出了一种新型物理启发式的RM构建方法，该方法明确受Helmholtz方程的指导，该方程固有地控制电磁波传播。具体来说，基于偏微分方程（PDEs）的分析，我们从理论上建立了电磁奇异点与影响无线传播的关键空间特征之间的直接对应关系，这些奇异点对应于Helmholtz方程中定义的负波数区域。然后，我们设计了一个创新的基于双扩散模型（DM）的大型人工智能框架，包含一个专门用于准确推断电磁奇异点的DM，以及另一个负责使用这些奇异点以及环境上下文信息重建完整RM的DM。实验结果表明，所提出的RadioDiff-$k^2$框架在图像级RM构建和定位任务中都达到了最先进的性能，同时保持了几百毫秒内的推理延迟。

论文及项目相关链接

PDF

Summary
本论文提出了一种名为RadioDiff-$k^2$的新型物理信息生成学习法，用于准确高效的多路径感知无线电地图（RM）构建。针对未来无线通信向环境感知模式发展所带来的挑战，RadioDiff-$k^2$结合了物理学原理与神经网络模型，既保证了计算效率，又考虑了电磁波传播的底层物理机制。该方法以Helmholtz方程为引导，通过部分微分方程分析建立了电磁波奇点与Helmholtz方程中负波数区域的直接联系。利用双扩散模型构建的大型人工智能框架不仅能准确推断出电磁波奇点，还能根据这些奇点和环境上下文信息重建完整的无线电地图。实验结果显示，RadioDiff-$k^2$在图像级别的无线电地图构建和定位任务上达到了最新技术水平，同时推理延迟保持在数百毫秒以内。

Key Takeaways

提出了一种新型物理信息生成学习法RadioDiff-$k^2$用于无线电地图构建。
该方法结合了物理学原理与神经网络模型，考虑了电磁波传播的底层物理机制。
以Helmholtz方程为引导，建立了电磁波奇点与负波数区域的直接联系。
利用双扩散模型构建大型人工智能框架进行无线电地图构建。
该方法能准确推断出电磁波奇点并重建完整的无线电地图。
实验结果显示RadioDiff-$k^2$在图像级别的无线电地图构建和定位任务上表现优异。

Cool Papers

点此查看论文截图

SPICE: A Synergistic, Precise, Iterative, and Customizable Image Editing Workflow

Authors:Kenan Tang, Yanhong Li, Yao Qin

Prompt-based models have demonstrated impressive prompt-following capability at image editing tasks. However, the models still struggle with following detailed editing prompts or performing local edits. Specifically, global image quality often deteriorates immediately after a single editing step. To address these challenges, we introduce SPICE, a training-free workflow that accepts arbitrary resolutions and aspect ratios, accurately follows user requirements, and consistently improves image quality during more than 100 editing steps, while keeping the unedited regions intact. By synergizing the strengths of a base diffusion model and a Canny edge ControlNet model, SPICE robustly handles free-form editing instructions from the user. On a challenging realistic image-editing dataset, SPICE quantitatively outperforms state-of-the-art baselines and is consistently preferred by human annotators. We release the workflow implementation for popular diffusion model Web UIs to support further research and artistic exploration.

基于提示的模型在图像编辑任务中表现出了令人印象深刻的遵循提示的能力。然而，这些模型在遵循详细的编辑提示或执行局部编辑时仍面临困难。具体来说，单一编辑步骤后，全局图像质量往往会立即下降。为了解决这些挑战，我们引入了SPICE，这是一个无需训练的工作流，可接受任意分辨率和长宽比，能准确地遵循用户需求，在超过100步的编辑过程中持续提高图像质量，同时保持未编辑区域完好无损。通过结合基础扩散模型和Canny边缘ControlNet模型的优点，SPICE能够稳健地处理用户的自由形式编辑指令。在一个具有挑战性的真实图像编辑数据集上，SPICE在定量上优于最新基线，并始终被人类注释者所青睐。我们发布了流行扩散模型Web UI的工作流程实现，以支持进一步的研究和艺术探索。

论文及项目相关链接

PDF The paper has been accepted to NeurIPS Creative AI Track 2025. Figure 4(c) has been accepted to CVPR AI Art Gallery 2025

Summary

本文介绍了一种名为SPICE的训练免费工作流程，用于图像编辑任务。该流程可以接受任意分辨率和比例，准确遵循用户要求，在超过100步的编辑过程中持续提高图像质量，同时保持未编辑区域不变。通过结合基础扩散模型和Canny边缘ControlNet模型的优点，SPICE能够稳健地处理用户的自由形式编辑指令。在具有挑战性的真实图像编辑数据集上，SPICE定量优于最新基线，并始终受到人类注释者的青睐。

Key Takeaways

SPICE是一种无需训练的工作流程，用于图像编辑任务。
它可以接受任意分辨率和比例，并能准确遵循用户的要求。
SPICE在连续多次编辑过程中能持续提高图像质量，同时保持未编辑区域不变。
通过结合扩散模型和Canny边缘ControlNet模型的优点，SPICE能稳健地处理自由形式的编辑指令。
SPICE在真实图像编辑数据集上的性能优于现有方法，并得到了人类评估者的青睐。
SPICE工作流程的实现已发布，支持进一步的研究和艺术探索。

Cool Papers

点此查看论文截图

CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-Consistency from a Single Image

Authors:Arindam Dutta, Meng Zheng, Zhongpai Gao, Benjamin Planche, Anwesha Choudhuri, Terrence Chen, Amit K. Roy-Chowdhury, Ziyan Wu

Reconstructing clothed humans from a single image is a fundamental task in computer vision with wide-ranging applications. Although existing monocular clothed human reconstruction solutions have shown promising results, they often rely on the assumption that the human subject is in an occlusion-free environment. Thus, when encountering in-the-wild occluded images, these algorithms produce multiview inconsistent and fragmented reconstructions. Additionally, most algorithms for monocular 3D human reconstruction leverage geometric priors such as SMPL annotations for training and inference, which are extremely challenging to acquire in real-world applications. To address these limitations, we propose CHROME: Clothed Human Reconstruction with Occlusion-Resilience and Multiview-ConsistEncy from a Single Image, a novel pipeline designed to reconstruct occlusion-resilient 3D humans with multiview consistency from a single occluded image, without requiring either ground-truth geometric prior annotations or 3D supervision. Specifically, CHROME leverages a multiview diffusion model to first synthesize occlusion-free human images from the occluded input, compatible with off-the-shelf pose control to explicitly enforce cross-view consistency during synthesis. A 3D reconstruction model is then trained to predict a set of 3D Gaussians conditioned on both the occluded input and synthesized views, aligning cross-view details to produce a cohesive and accurate 3D representation. CHROME achieves significant improvements in terms of both novel view synthesis (upto 3 db PSNR) and geometric reconstruction under challenging conditions.

从单个图像重建穿衣服的人类是计算机视觉领域的一项基本任务，具有广泛的应用范围。尽管现有的单目穿衣人类重建解决方案已经显示出有前景的结果，但它们通常假设人类主体处于无遮挡的环境中。因此，当遇到野外被遮挡的图像时，这些算法会产生视图不一致和分散的重建结果。此外，大多数单目3D人类重建算法利用几何先验（如SMPL注释）进行训练和推理，这在现实世界的实际应用中极为具有挑战性。为了解决这些局限性，我们提出了CHROME：从单个图像进行遮挡复原和多视角一致性的穿衣人类重建。CHROME是一种新型管道设计，旨在从单个被遮挡的图像重建具有多视角一致性的遮挡复原3D人类，而无需真实几何先验注释或3D监督。具体来说，CHROME首先利用多视角扩散模型从被遮挡的输入中合成无遮挡的人类图像，与现成的姿势控制相结合，在合成过程中明确实施跨视图一致性。然后训练一个3D重建模型，以根据被遮挡的输入和合成视图预测一组3D高斯分布，对齐跨视图的细节以产生连贯和准确的三维表示。CHROME在新型视图合成（高达3分贝的PSNR）和具有挑战性的条件下的几何重建方面都取得了显著的改进。

论文及项目相关链接

PDF Accepted at ICCV 2025

Summary

本文提出了一种新型的单图像重建穿着衣服的人体的方法，称为CHROME。该方法解决了现有技术中的遮挡问题和多视角不一致性问题，能够在无需真实几何先验标注和三维监督的情况下，从单一遮挡图像中重建出具有多视角一致性的遮挡复原三维人体。通过利用多视角扩散模型合成无遮挡人体图像，并结合姿势控制明确合成过程中的跨视角一致性。随后训练一个三维重建模型，根据遮挡输入和合成视角预测一组三维高斯分布，对齐跨视角细节以产生连贯和准确的三维表示。

Key Takeaways

CHROME解决了现有单图像重建穿着衣服的人体技术在处理遮挡和多视角不一致性问题上的挑战。
CHROME能够在无需真实几何先验标注和三维监督的情况下工作。
多视角扩散模型被用于合成无遮挡的人体图像，增强了重建的鲁棒性。
姿势控制用于明确合成过程中的跨视角一致性。
三维重建模型根据遮挡输入和合成视角预测三维高斯分布，提高重建的准确性。
CHROME在新型视图合成和几何重建方面都有显著改进。
该方法具有广泛的应用前景，尤其是在处理复杂环境和遮挡情况下的图像时。

Cool Papers

点此查看论文截图

Bolt3D: Generating 3D Scenes in Seconds

Authors:Stanislaw Szymanowicz, Jason Y. Zhang, Pratul Srinivasan, Ruiqi Gao, Arthur Brussee, Aleksander Holynski, Ricardo Martin-Brualla, Jonathan T. Barron, Philipp Henzler

We present a latent diffusion model for fast feed-forward 3D scene generation. Given one or more images, our model Bolt3D directly samples a 3D scene representation in less than seven seconds on a single GPU. We achieve this by leveraging powerful and scalable existing 2D diffusion network architectures to produce consistent high-fidelity 3D scene representations. To train this model, we create a large-scale multiview-consistent dataset of 3D geometry and appearance by applying state-of-the-art dense 3D reconstruction techniques to existing multiview image datasets. Compared to prior multiview generative models that require per-scene optimization for 3D reconstruction, Bolt3D reduces the inference cost by a factor of up to 300 times.

我们提出了一种用于快速前馈3D场景生成的潜在扩散模型。给定一张或多张图像，我们的模型Bolt3D能够在单个GPU上不到七秒内直接采样3D场景表示。我们利用强大且可扩展的现有2D扩散网络架构来生成一致的高保真3D场景表示，从而实现这一目标。为了训练此模型，我们通过将最先进的密集3D重建技术应用于现有的多视角图像数据集，创建了一个大规模的多视角一致性的3D几何和外观数据集。与之前的需要针对每个场景进行优化的多视角生成模型相比，Bolt3D的推理成本降低了高达300倍。

论文及项目相关链接

PDF ICCV 2025. Project page: https://szymanowiczs.github.io/bolt3d

Summary

基于给定的一个或多个图像，Bolt3D模型通过利用强大的可扩展的二维扩散网络架构，快速生成三维场景表示，耗时不到七秒。通过应用最新的密集三维重建技术，创建大规模的多视角一致的三维几何和外观数据集来训练模型。相较于之前的多视角生成模型，Bolt3D减少了高达300倍的推理成本。

Key Takeaways

Bolt3D模型可以快速生成三维场景表示，耗时不到七秒。
该模型利用现有的二维扩散网络架构。
Bolt3D通过应用最新的密集三维重建技术创建大规模数据集。
模型可以实现多视角一致的三维几何和外观表示。
与先前的多视角生成模型相比，Bolt3D降低了推理成本。
Bolt3D模型可以直接从一个或多个图像生成三维场景。

Cool Papers

点此查看论文截图

Methods and Trends in Detecting AI-Generated Images: A Comprehensive Review

Authors:Arpan Mahara, Naphtali Rishe

The proliferation of generative models, such as Generative Adversarial Networks (GANs), Diffusion Models, and Variational Autoencoders (VAEs), has enabled the synthesis of high-quality multimedia data. However, these advancements have also raised significant concerns regarding adversarial attacks, unethical usage, and societal harm. Recognizing these challenges, researchers have increasingly focused on developing methodologies to detect synthesized data effectively, aiming to mitigate potential risks. Prior reviews have predominantly focused on deepfake detection and often overlook recent advancements in synthetic image forensics, particularly approaches that incorporate multimodal frameworks, reasoning-based detection, and training-free methodologies. To bridge this gap, this survey provides a comprehensive and up-to-date review of state-of-the-art techniques for detecting and classifying synthetic images generated by advanced generative AI models. The review systematically examines core detection paradigms, categorizes them into spatial-domain, frequency-domain, fingerprint-based, patch-based, training-free, and multimodal reasoning-based frameworks, and offers concise descriptions of their underlying principles. We further provide detailed comparative analyses of these methods on publicly available datasets to assess their generalizability, robustness, and interpretability. Finally, the survey highlights open challenges and future directions, emphasizing the potential of hybrid frameworks that combine the efficiency of training-free approaches with the semantic reasoning of multimodal models to advance trustworthy and explainable synthetic image forensics.

生成模型（如生成对抗网络（GANs）、扩散模型和变分自编码器（VAEs）等）的普及使得高质量多媒体数据的合成成为可能。然而，这些进展也引发了关于对抗性攻击、不道德使用和社会危害的严重关切。为应对这些挑战，研究人员越来越关注开发有效检测合成数据的方法，以降低潜在风险。此前的综述主要集中在深度伪造检测上，往往忽视了合成图像取证方面的最新进展，特别是采用多模态框架、基于推理的检测方法和无训练方法等。为了填补这一空白，本文提供了对检测由先进生成式人工智能模型生成的合成图像的最前沿技术的全面和最新综述。本文系统地考察了核心检测范式，将它们分为空间域、频率域、基于指纹、基于补丁、无训练以及基于多模态推理的框架，并简要描述了它们的基本原理。此外，本文还在公开数据集上对这些方法进行了详细的比较分析，以评估它们的通用性、鲁棒性和可解释性。最后，该综述强调了面临的挑战和未来发展方向，重点介绍了混合框架的潜力，这种框架结合了无训练方法的效率和多模态模型的语义推理，以推动可信和可解释的合成图像取证技术的发展。

论文及项目相关链接

PDF 34 pages, 4 Figures, 10 Tables

Summary

这篇综述全面探讨了最新的检测与分类合成图像的技术，这些图像由先进的生成式人工智能模型生成。文章详细描述了各种检测框架，包括空间域、频率域、指纹、补丁、无训练以及多模态推理框架等，并对公开数据集上的这些方法进行了详细的比较分析，以评估其泛化性、鲁棒性和可解释性。同时，文章也指出了未来面临的挑战和未来发展方向，强调了结合无训练方法的效率和多模态模型的语义推理的混合框架在推动可信和可解释的合成图像取证中的潜力。

Key Takeaways

生成模型（如GANs、Diffusion Models和VAEs）的合成多媒体数据质量不断提高，但伴随对抗性攻击、不道德使用和社会危害等挑战。
研究人员正致力于开发有效检测合成数据的方法，以减轻潜在风险。
当前综述填补了关于合成图像取证的最新进展的空白，特别是多模态框架、基于推理的检测和无训练方法等。
文章详细描述了各种检测框架的核心原理，并在公开数据集上进行了比较分析。
文章强调了检测方法的泛化性、鲁棒性和可解释性的重要性。
未来的挑战和未来发展方向包括开发混合框架，结合无训练方法的效率和多模态模型的语义推理。

Cool Papers

点此查看论文截图

Diffusion Models are Efficient Data Generators for Human Mesh Recovery

Authors:Yongtao Ge, Wenjia Wang, Yongfan Chen, Fanzhou Wang, Lei Yang, Hao Chen, Chunhua Shen

Despite remarkable progress having been made on the problem of 3D human pose and shape estimation (HPS), current state-of-the-art methods rely heavily on either confined indoor mocap datasets or datasets generated by a rendering engine using computer graphics (CG). Both categories of datasets exhibit inadequacies in furnishing adequate human identities and authentic in-the-wild background scenes, which are crucial for accurately simulating real-world distributions. In this work, we show that synthetic data created by generative models is complementary to CG-rendered data for achieving remarkable generalization performance on diverse real-world scenes. We propose an effective data generation pipeline based on recent diffusion models, termed HumanWild, which can effortlessly generate human images and corresponding 3D mesh annotations. Specifically, we first collect a large-scale human-centric dataset with comprehensive annotations, e.g, text captions, the depth map, and surface normal images. To generate a wide variety of human images with initial labels, we train a customized, multi-condition ControlNet model. The key to this process is using a 3D parametric model, e.g, SMPL-X, to create various condition inputs easily. Our data generation pipeline is both flexible and customizable, making it adaptable to multiple real-world tasks, such as human interaction in complex scenes and humans captured by wide-angle lenses. By relying solely on generative models, we can produce large-scale, in-the-wild human images with high-quality annotations, significantly reducing the need for manual image collection and annotation. The generated dataset encompasses a wide range of viewpoints, environments, and human identities, ensuring its versatility across different scenarios. We hope that our work could pave the way for scaling up 3D human recovery to in-the-wild scenes.

尽管在三维人体姿态和形状估计（HPS）问题上取得了显著的进展，但当前最先进的方法仍然严重依赖于有限的室内mocap数据集或由计算机图形（CG）生成的数据集。这两类数据集在提供足够的人的身份和真实野外背景场景方面存在不足，这对于准确模拟真实世界分布至关重要。在这项工作中，我们展示了生成模型创建的合成数据对于在多种真实场景上实现显著泛化性能的CG渲染数据的补充作用。我们提出了一种基于最新扩散模型的有效的数据生成流程，称为HumanWild，可以轻松地生成人体图像和相应的3D网格注释。具体来说，我们首先收集了一个大规模以人类为中心的数据集，包含全面的注释，如文本标题、深度图和表面法线图。为了生成具有初始标签的多种人体图像，我们训练了一个定制的、多条件ControlNet模型。该过程的关键是使用一个3D参数模型（如SMPL-X），以轻松创建各种条件输入。我们的数据生成流程既灵活又可根据需求进行定制，使其能够适应多个真实世界任务，如复杂场景中的人类交互和广角镜头拍摄的人类。仅依靠生成模型，我们可以生成大规模、真实的带有高质量注释的人体图像，大大降低了手动收集和标注图像的需求。生成的数据集涵盖了广泛的视角、环境和人物身份，确保其在不同场景中的通用性。我们希望我们的工作能为扩展到真实场景的3D人体恢复铺平道路。

论文及项目相关链接

PDF Accepted by TPAMI, project page: https://yongtaoge.github.io/projects/humanwild

摘要
该文本介绍了针对三维人体姿态和形状估计（HPS）问题的最新研究进展。尽管已有显著进展，但当前最先进的方法仍严重依赖于室内mocap数据集或由计算机图形（CG）生成的渲染引擎的数据集。这两类数据集在提供足够的人身份和真实野外背景场景方面存在不足，这对于准确模拟真实世界分布至关重要。本文展示了生成模型创建的合成数据与CG渲染数据相结合，可以在多种真实场景上实现出色的泛化性能。我们提出了一种基于最新扩散模型的有效数据生成管道，称为HumanWild，可以轻松生成人类图像和相应的三维网格注释。我们的数据生成管道既灵活又可根据需求进行定制，适应多种现实世界任务，如复杂场景中的人类交互和广角镜头下的人类拍摄。通过仅依赖生成模型，我们可以生成大规模、野外的人类图像和高质量注释，大大降低了手动收集图像和注释的需求。生成的数据集涵盖了广泛的视角、环境和人类身份，确保其在不同场景中的通用性。

关键见解