发布日期: 2025-09-14

更新日期: 2025-10-07

文章字数: 4.4k

阅读时长: 17 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-14 更新

A novel method and dataset for depth-guided image deblurring from smartphone Lidar

Authors:Antonio Montanaro, Diego Valsesia

Modern smartphones are equipped with Lidar sensors providing depth-sensing capabilities. Recent works have shown that this complementary sensor allows to improve various tasks in image processing, including deblurring. However, there is a current lack of datasets with realistic blurred images and paired mobile Lidar depth maps to further study the topic. At the same time, there is also a lack of blind zero-shot methods that can deblur a real image using the depth guidance without requiring extensive training sets of paired data. In this paper, we propose an image deblurring method based on denoising diffusion models that can leverage the Lidar depth guidance and does not require training data with paired Lidar depth maps. We also present the first dataset with real blurred images with corresponding Lidar depth maps and sharp ground truth images, acquired with an Apple iPhone 15 Pro, for the purpose of studying Lidar-guided deblurring. Experimental results on this novel dataset show that Lidar guidance is effective and the proposed method outperforms state-of-the-art deblurring methods in terms of perceptual quality.

现代智能手机配备了提供深度感知功能的Lidar传感器。近期的研究表明，这种辅助传感器能够改善图像处理中的各种任务，包括去模糊。然而，目前缺乏具有现实模糊图像和配套移动Lidar深度图的数据集，以进一步研究该主题。同时，还缺乏无需大量配套数据训练集的盲零射击方法，这些方法可以使用深度指导对真实图像进行去模糊处理。在本文中，我们提出了一种基于去噪扩散模型的图像去模糊方法，该方法可以利用Lidar深度指导，无需使用配套Lidar深度图的训练数据。我们还展示了使用苹果iPhone 15 Pro拍摄的首个具有实际模糊图像、配套Lidar深度图和清晰地面真实图像的数据集，旨在研究Lidar指导的去模糊技术。在该新型数据集上的实验结果表明，Lidar指导是有效的，所提出的方法在感知质量方面优于最先进的去模糊方法。

论文及项目相关链接

PDF

Summary

现代智能手机配备了Lidar传感器，具有深度感知功能。然而，由于缺乏现实模糊图像和配套移动Lidar深度图的数据集，以及能够利用深度指导进行去模糊而无需大量配对训练集的盲零射击方法，相关研究受到限制。本文提出了一种基于去噪扩散模型的图像去模糊方法，该方法能够利用Lidar深度指导且无需配对Lidar深度图的训练数据。同时，我们还推出了首个具有相应Lidar深度图和清晰基准图像的现实模糊图像数据集，这些数据均通过苹果iPhone 15 Pro获取，旨在研究Lidar引导去模糊技术。实验结果表明，Lidar指导有效，且该方法在感知质量方面优于现有去模糊方法。

Key Takeaways

现代智能手机配备了Lidar传感器，具备深度感知能力，可改善图像处理任务，包括去模糊。
缺乏含有现实模糊图像和配套移动Lidar深度图的数据集，限制进一步研究。
缺乏能在不依赖大量配对训练集的情况下，利用深度指导进行去模糊的盲零射击方法。
论文提出了一种基于去噪扩散模型的图像去模糊方法，利用Lidar深度指导，无需配对Lidar深度图的训练数据。
推出了首个现实模糊图像数据集，包含相应的Lidar深度图和清晰基准图像，用于研究Lidar引导去模糊技术。
实验结果表明，Lidar指导对于去模糊过程有效。

Cool Papers

点此查看论文截图

Missing Fine Details in Images: Last Seen in High Frequencies

Authors:Tejaswini Medi, Hsien-Yi Wang, Arianna Rampini, Margret Keuper

Latent generative models have shown remarkable progress in high-fidelity image synthesis, typically using a two-stage training process that involves compressing images into latent embeddings via learned tokenizers in the first stage. The quality of generation strongly depends on how expressive and well-optimized these latent embeddings are. While various methods have been proposed to learn effective latent representations, generated images often lack realism, particularly in textured regions with sharp transitions, due to loss of fine details governed by high frequencies. We conduct a detailed frequency decomposition of existing state-of-the-art (SOTA) latent tokenizers and show that conventional objectives inherently prioritize low-frequency reconstruction, often at the expense of high-frequency fidelity. Our analysis reveals these latent tokenizers exhibit a bias toward low-frequency information during optimization, leading to over-smoothed outputs and visual artifacts that diminish perceptual quality. To address this, we propose a wavelet-based, frequency-aware variational autoencoder (FA-VAE) framework that explicitly decouples the optimization of low- and high-frequency components. This decoupling enables improved reconstruction of fine textures while preserving global structure. Moreover, we integrate our frequency-preserving latent embeddings into a SOTA latent diffusion model, resulting in sharper and more realistic image generation. Our approach bridges the fidelity gap in current latent tokenizers and emphasizes the importance of frequency-aware optimization for realistic image synthesis, with broader implications for applications in content creation, neural rendering, and medical imaging.

潜在生成模型在高保真图像合成方面取得了显著的进展，通常采用涉及两阶段训练过程的模型。第一阶段是通过学习到的分词器将图像压缩成潜在嵌入。生成图像的质量强烈依赖于这些潜在嵌入的表达能力和优化程度。虽然已提出了各种方法来学习有效的潜在表示，但由于高频细节的损失，生成的图像往往缺乏真实性，特别是在纹理区域过渡尖锐的情况下。我们对现有最先进的潜在分词器进行了详细的频率分解，并表明传统目标本质上优先进行低频重建，往往以高频保真为代价。我们的分析表明，这些潜在分词器在优化过程中存在对低频信息的偏向，导致输出过于平滑和视觉失真，降低了感知质量。为了解决这一问题，我们提出了一种基于小波、频率感知的变分自动编码器（FA-VAE）框架，该框架显式地将低频和高频组件的优化解耦。这种解耦能够在保持全局结构的同时，改善纹理的重建。此外，我们将频率保留的潜在嵌入集成到最先进的潜在扩散模型中，从而生成更清晰、更逼真的图像。我们的方法弥补了当前潜在分词器中的保真度差距，并强调了频率感知优化在真实图像合成中的重要性，为内容创建、神经渲染和医学成像等领域带来更广泛的应用影响。

论文及项目相关链接

PDF

摘要
潜变量生成模型在高保真图像合成方面取得了显著进展，通常采用两阶段训练过程，第一阶段通过学到的标记器将图像压缩成潜在嵌入。生成质量强烈依赖于这些潜在嵌入的表达能力和优化程度。尽管已经提出了各种方法来学习有效的潜在表示，但由于高频损失的精细细节，生成的图像往往缺乏真实感，特别是在纹理区域有锐利过渡的地方。我们对最新的潜变量标记器进行了详细的频率分解，发现传统目标本质上侧重于低频重建，往往以高频保真为代价。我们的分析表明，这些潜变量标记器在优化过程中存在对低频信息的偏好，导致输出过于平滑和视觉伪影，降低了感知质量。为解决这一问题，我们提出了基于小波的频率感知变分自编码器（FA-VAE）框架，该框架显式地解耦了低频和高频组件的优化。这种解耦能够在保留全局结构的同时，改善精细纹理的重建。此外，我们将频率保留潜变量嵌入到最新的潜变量扩散模型中，实现了更清晰、更真实的图像生成。我们的方法弥补了当前潜变量标记器中的保真度差距，并强调了频率感知优化在现实图像合成中的重要性，对内容创建、神经渲染和医学成像等领域具有更广泛的影响。

关键见解

潜变量生成模型在高保真图像合成中使用两阶段训练过程，其中第一阶段将图像压缩成潜在嵌入。
生成的图像质量依赖于潜变量的表达和优化能力。
现有潜变量标记器存在对低频信息的偏好，导致生成的图像缺乏真实感，特别是在纹理区域的锐利过渡处。
通过对最新潜变量标记器的频率分解，发现传统目标往往以高频保真为代价来优化低频重建。
提出的基于小波的FA-VAE框架能够解耦低频和高频组件的优化，改善精细纹理的重建并保留全局结构。
将频率保留潜变量嵌入到潜变量扩散模型中，实现了更真实、更清晰的图像生成。

Cool Papers

点此查看论文截图

Parasite: A Steganography-based Backdoor Attack Framework for Diffusion Models

Authors:Jiahao Chen, Yu Pan, Yi Du, Chunkai Wu, Lin Wang

Recently, the diffusion model has gained significant attention as one of the most successful image generation models, which can generate high-quality images by iteratively sampling noise. However, recent studies have shown that diffusion models are vulnerable to backdoor attacks, allowing attackers to enter input data containing triggers to activate the backdoor and generate their desired output. Existing backdoor attack methods primarily focused on target noise-to-image and text-to-image tasks, with limited work on backdoor attacks in image-to-image tasks. Furthermore, traditional backdoor attacks often rely on a single, conspicuous trigger to generate a fixed target image, lacking concealability and flexibility. To address these limitations, we propose a novel backdoor attack method called “Parasite” for image-to-image tasks in diffusion models, which not only is the first to leverage steganography for triggers hiding, but also allows attackers to embed the target content as a backdoor trigger to achieve a more flexible attack. “Parasite” as a novel attack method effectively bypasses existing detection frameworks to execute backdoor attacks. In our experiments, “Parasite” achieved a 0 percent backdoor detection rate against the mainstream defense frameworks. In addition, in the ablation study, we discuss the influence of different hiding coefficients on the attack results. You can find our code at https://anonymous.4open.science/r/Parasite-1715/.

最近，扩散模型作为最成功的图像生成模型之一受到了广泛关注，它可以通过迭代采样噪声生成高质量图像。然而，最近的研究表明，扩散模型容易受到后门攻击的影响，攻击者可以通过输入包含触发器的数据来激活后门，并生成他们想要的输出。现有的后门攻击方法主要集中在目标噪声到图像和文本到图像的任务上，而对于图像到图像任务的后门攻击研究相对较少。此外，传统的后门攻击通常依赖于单个、明显的触发器来生成固定的目标图像，缺乏隐蔽性和灵活性。为了解决这些限制，我们针对扩散模型中的图像到图像任务提出了一种新的后门攻击方法，名为“寄生虫”。它不仅首次利用隐写术来隐藏触发器，还允许攻击者将目标内容作为后门触发器，以实现更灵活的攻击。“寄生虫”作为一种新的攻击方法，有效地绕过了现有的检测框架来执行后门攻击。在我们的实验中，“寄生虫”针对主流防御框架实现了0%的后门检测率。另外，在消融研究中，我们讨论了不同的隐藏系数对攻击结果的影响。你可以在https://anonymous.4open.science/r/Parasite-1715/找到我们的代码。

论文及项目相关链接

PDF

Summary
扩散模型近期成为图像生成领域最成功的模型之一，但其存在后门攻击漏洞。新提出的“Parasite”攻击方法能绕过现有检测框架，在图像到图像的扩散模型中实现后门攻击，并利用隐写术隐藏触发条件。该攻击方法的隐蔽性和灵活性更强，能在主流防御框架下实现零检测率。

Key Takeaways

扩散模型是当下图像生成领域最成功的模型之一，但存在后门攻击漏洞。
“Parasite”是首个利用隐写术隐藏触发条件的攻击方法。
“Parasite”允许攻击者将目标内容作为后门触发条件，实现更灵活的攻击。
“Parasite”能有效绕过现有检测框架执行后门攻击。
在主流防御框架下，“Parasite”实现了零检测率。
实验中探讨了不同隐藏系数对攻击结果的影响。

Cool Papers

点此查看论文截图

IDEATOR: Jailbreaking and Benchmarking Large Vision-Language Models Using Themselves

Authors:Ruofan Wang, Juncheng Li, Yixu Wang, Bo Wang, Xiaosen Wang, Yan Teng, Yingchun Wang, Xingjun Ma, Yu-Gang Jiang

As large Vision-Language Models (VLMs) gain prominence, ensuring their safe deployment has become critical. Recent studies have explored VLM robustness against jailbreak attacks-techniques that exploit model vulnerabilities to elicit harmful outputs. However, the limited availability of diverse multimodal data has constrained current approaches to rely heavily on adversarial or manually crafted images derived from harmful text datasets, which often lack effectiveness and diversity across different contexts. In this paper, we propose IDEATOR, a novel jailbreak method that autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR is grounded in the insight that VLMs themselves could serve as powerful red team models for generating multimodal jailbreak prompts. Specifically, IDEATOR leverages a VLM to create targeted jailbreak texts and pairs them with jailbreak images generated by a state-of-the-art diffusion model. Extensive experiments demonstrate IDEATOR’s high effectiveness and transferability, achieving a 94% attack success rate (ASR) in jailbreaking MiniGPT-4 with an average of only 5.34 queries, and high ASRs of 82%, 88%, and 75% when transferred to LLaVA, InstructBLIP, and Chameleon, respectively. Building on IDEATOR’s strong transferability and automated process, we introduce the VLJailbreakBench, a safety benchmark comprising 3,654 multimodal jailbreak samples. Our benchmark results on 11 recently released VLMs reveal significant gaps in safety alignment. For instance, our challenge set achieves ASRs of 46.31% on GPT-4o and 19.65% on Claude-3.5-Sonnet, underscoring the urgent need for stronger defenses.VLJailbreakBench is publicly available at https://roywang021.github.io/VLJailbreakBench.

随着大型视觉语言模型（VLMs）的日益普及，确保其安全部署已成为关键。最近的研究探讨了VLM对抗越狱攻击（利用模型漏洞诱发有害输出的技术）的稳健性。然而，多元模态数据的有限可用性限制了当前的方法，使其严重依赖于从有害文本数据集派生的对抗性或手工制作的图像，这在不同的上下文环境中往往缺乏有效性和多样性。在本文中，我们提出了IDEATOR，这是一种新型的越狱方法，可以自主生成恶意的图像文本对，用于黑箱越狱攻击。IDEATOR的见解是，VLM本身可以作为强大的红队模型，用于生成多模态越狱提示。具体来说，IDEATOR利用VLM创建有针对性的越狱文本，并与由最先进的扩散模型生成的越狱图像配对。大量实验表明，IDEATOR具有很高的有效性和可迁移性，在越狱MiniGPT-4时，攻击成功率（ASR）达到94%，平均查询次数仅为5.34次；当迁移到LLaVA、InstructBLIP和Chameleon时，ASR分别达到了82%、88%和75%。基于IDEATOR的强大可迁移性和自动化流程，我们引入了VLJailbreakBench，这是一个包含3654个多模态越狱样本的安全基准测试。我们对最近发布的11个VLM的基准测试结果揭示了安全对齐方面的重大差距。例如，我们的挑战集在GPT-4o上的ASR达到46.31%，在Claude-3.5-Sonnet上的ASR为19.65%，这突显了更强防御的迫切需要。VLJailbreakBench可在https://roywang021.github.io/VLJailbreakBench公开访问。

论文及项目相关链接

PDF

Summary

本文关注大型视觉语言模型（VLMs）的安全部署问题。针对VLMs的jailbreak攻击，提出了一种新的自主生成恶意图像文本对的方法IDEATOR。该方法利用VLM自身作为红队模型，生成针对性的jailbreak文本，并结合由最先进的扩散模型生成的jailbreak图像。实验表明，IDEATOR在攻击成功率和迁移性方面表现出色，对MiniGPT-4的攻击成功率达到94%，平均查询次数仅为5.34次。此外，还引入了VLJailbreakBench安全基准，包含3654个多模态jailbreak样本，对11种最新发布的VLMs的基准测试结果揭示了安全对齐方面的显著差距。

Key Takeaways