嘘~ 正在从服务器偷取页面 . . .

Vision Transformer


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-18 更新

From Attention to Frequency: Integration of Vision Transformer and FFT-ReLU for Enhanced Image Deblurring

Authors:Syed Mumtahin Mahmud, Mahdi Mohd Hossain Noki, Prothito Shovon Majumder, Abdul Mohaimen Al Radi, Md. Haider Ali, Md. Mosaddek Khan

Image deblurring is vital in computer vision, aiming to recover sharp images from blurry ones caused by motion or camera shake. While deep learning approaches such as CNNs and Vision Transformers (ViTs) have advanced this field, they often struggle with complex or high-resolution blur and computational demands. We propose a new dual-domain architecture that unifies Vision Transformers with a frequency-domain FFT-ReLU module, explicitly bridging spatial attention modeling and frequency sparsity. In this structure, the ViT backbone captures local and global dependencies, while the FFT-ReLU component enforces frequency-domain sparsity to suppress blur-related artifacts and preserve fine details. Extensive experiments on benchmark datasets demonstrate that this architecture achieves superior PSNR, SSIM, and perceptual quality compared to state-of-the-art models. Both quantitative metrics, qualitative comparisons, and human preference evaluations confirm its effectiveness, establishing a practical and generalizable paradigm for real-world image restoration.

图像去模糊在计算机视觉中至关重要,旨在从由于运动或相机抖动导致的模糊图像中恢复出清晰的图像。虽然深度学习方法(如卷积神经网络和视觉转换器(ViTs))已经推动了该领域的发展,但它们通常难以处理复杂或高分辨率的模糊和计算需求。我们提出了一种新的双域架构,该架构统一了视觉变压器与频率域的FFT-ReLU模块,显式地建立了空间注意力建模和频率稀疏性之间的桥梁。在这种结构中,ViT骨干网捕捉局部和全局依赖性,而FFT-ReLU组件强制实施频率域稀疏性,以抑制与模糊相关的伪影并保留细节。在基准数据集上的大量实验表明,该架构与最新模型相比,在峰值信噪比(PSNR)、结构相似性(SSIM)和感知质量方面实现了卓越的性能。定量指标、定性比较和人类偏好评估都证实了其有效性,为现实世界图像恢复建立了实用且通用的范式。

论文及项目相关链接

PDF

Summary
提出一种结合Vision Transformer与频率域FFT-ReLU模块的新型双域架构,实现空间注意力建模与频率稀疏性的桥梁。ViT骨干网络捕捉局部和全局依赖性,FFT-ReLU组件实施频率域稀疏性,以抑制模糊相关伪影并保留细节。在基准数据集上的实验显示其PSNR、SSIM和感知质量优于最新模型,确立实用且通用的图像恢复范式。

Key Takeaways

  1. 文中提出了一种新型双域架构,结合了Vision Transformer(ViT)和频率域FFT-ReLU模块。
  2. 该架构旨在解决图像去模糊问题,能够从模糊图像中恢复出清晰图像。
  3. ViT骨干网络能够捕捉图像的局部和全局依赖性。
  4. FFT-ReLU模块实施频率域稀疏性,以抑制模糊相关伪影并保留图像细节。
  5. 在多个基准数据集上进行的实验表明,该架构的PSNR、SSIM和感知质量优于当前最先进的模型。
  6. 该架构不仅考虑了定量指标,还进行了定性比较和人类偏好评估,证明了其有效性。

Cool Papers

点此查看论文截图

LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers

Authors:Minjun Kim, Jaeri Lee, Jongjin Kim, Jeongin Yun, Yongmo Kwon, U Kang

How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.

我们如何准确地量化预训练的Vision Transformer模型?量化算法将Vision Transformers(ViTs)压缩成低位格式,以最小的精度损失减少内存和计算需求。然而,现有方法依赖于统一精度,忽略了ViT组件在量化方面的不同敏感性。基于度量的混合精度量化(MPQ)是一个有前途的替代方案,但以前用于ViT的MPQ方法存在三个主要局限性:1)粒度粗糙,2)组件类型之间度量标准的尺度不匹配,以及3)量化无关的位分配。在本文中,我们提出了针对Vision Transformers的层级混合精度量化(LampQ),这是一种基于度量的MPQ方法,旨在克服这些局限性。LampQ执行逐层量化,以实现精细的控制和有效的加速,并结合基于类型的Fisher度量来衡量敏感性。然后,LampQ通过整数线性规划最优地分配位宽,并进一步进行迭代更新。大量实验表明,LampQ在图像分类、目标检测和零样本量化等各种任务上预训练的ViT的量化方面提供了最先进的性能。

论文及项目相关链接

PDF AAAI 2026

Summary

本文介绍了如何准确量化预训练的Vision Transformer模型。现有方法主要依赖统一精度,忽略了Vision Transformer组件对量化的不同敏感度。本文提出了一种基于度量的混合精度量化(MPQ)方法——LampQ,以克服这一局限性。LampQ通过逐层量化实现精细粒度的控制和高效加速,并引入基于Fisher的组件感知度量来衡量敏感度。此外,LampQ通过整数线性规划优化分配位宽,并进行了迭代更新。实验表明,LampQ在图像分类、目标检测和零样本量化等任务上预训练的Vision Transformer模型的量化性能达到最新水平。

Key Takeaways

  1. 量化算法能够压缩Vision Transformer(ViT)模型至低比特格式,减少内存和计算需求,同时保持较高的精度。
  2. 现有方法主要依赖统一精度进行量化,忽略了ViT组件的不同敏感度。
  3. Metric-based Mixed Precision Quantization (MPQ) 是一种有前景的替代方案。
  4. LampQ方法克服了现有MPQ方法的三大局限性:粗粒度、度量尺度在组件类型上的不匹配以及量化感知位分配。
  5. LampQ通过逐层量化实现精细粒度的控制和高效加速。
  6. LampQ引入了一种基于Fisher的组件感知度量来衡量敏感度。

Cool Papers

点此查看论文截图

Unifying Segment Anything in Microscopy with Vision-Language Knowledge

Authors:Manyu Li, Ruian He, Zixian Zhang, Chenxi Ma, Weimin Tan, Bo Yan

Accurate segmentation of regions of interest in biomedical images holds substantial value in image analysis. Although several foundation models for biomedical segmentation have currently achieved excellent performance on certain datasets, they typically demonstrate sub-optimal performance on unseen domain data. We owe the deficiency to lack of vision-language knowledge before segmentation. Multimodal Large Language Models (MLLMs) bring outstanding understanding and reasoning capabilities to multimodal tasks, which inspires us to leverage MLLMs to inject Vision-Language Knowledge (VLK), thereby enabling vision models to demonstrate superior generalization capabilities on cross-domain datasets. In this paper, we propose a novel framework that seamlessly uses MLLMs to guide SAM in learning microscopy cross-domain data, unifying Segment Anything in Microscopy, named uLLSAM. Specifically, we propose the Vision-Language Semantic Alignment (VLSA) module, which injects VLK into Segment Anything Model (SAM). We find that after SAM receives global VLK prompts, its performance improves significantly, but there are deficiencies in boundary contour perception. Therefore, we further propose Semantic Boundary Regularization (SBR) to regularize SAM. Our method achieves performance improvements of 11.8% in SA across 9 in-domain microscopy datasets, achieving state-of-the-art performance. Our method also demonstrates improvements of 9.2% in SA across 10 out-of-domain datasets, exhibiting strong generalization capabilities. Code is available at https://github.com/ieellee/uLLSAM.

生物医学图像中感兴趣区域的精确分割在图像分析中具有重要的价值。尽管目前一些生物医学分割的基础模型在某些数据集上取得了卓越的性能,但它们在未见领域的数据上通常表现出次优的性能。我们认为这是因为在分割之前缺乏视觉语言知识。多模态大型语言模型(MLLMs)为跨模态任务带来了出色的理解和推理能力,这激励我们利用MLLMs来注入视觉语言知识(VLK),从而使视觉模型在跨域数据集上表现出更出色的泛化能力。在本文中,我们提出了一个新颖的框架,该框架无缝地使用MLLMs来指导SAM学习显微镜跨域数据,统一显微分割,名为uLLSAM。具体来说,我们提出了视觉语言语义对齐(VLSA)模块,该模块将VLK注入到分段任何事物模型(SAM)中。我们发现,在SAM接收到全局VLK提示后,其性能得到了显着提高,但在边界轮廓感知方面存在不足。因此,我们进一步提出语义边界正则化(SBR)来规范SAM。我们的方法在9个领域内的显微镜数据集中实现了SA性能提升11.8%,达到最新性能水平。我们的方法还展示了在10个跨域数据集中的SA性能提高了9.2%,表现出强大的泛化能力。代码可在https://github.com/ieellee/uLLSAM中找到。

论文及项目相关链接

PDF 15 pages, 5 figures

Summary:生物医学图像的区域准确分割在图像分析中具有重要价值。当前的基础模型在某些数据集上表现优异,但在跨域数据上通常表现不佳。这主要是由于分割前缺乏视觉语言知识所致。多模态大型语言模型(MLLMs)为跨模态任务带来了出色的理解和推理能力,启发我们通过MLLMs注入视觉语言知识(VLK),从而使视觉模型在跨域数据集上表现出更出色的泛化能力。本文提出一个利用MLLMs引导SAM学习显微镜跨域数据的新框架,统一显微镜分割任务,名为uLLSAM。特别是,我们提出了视觉语言语义对齐(VLSA)模块,将VLK注入到分割任何事物模型(SAM)中。我们发现SAM接受全局VLK提示后,性能显著提高,但在边界轮廓感知方面存在缺陷。因此,我们进一步提出语义边界正则化(SBR)来规范SAM。该方法在9个内部领域显微镜数据集上的分割精度提高了11.8%,达到最新性能水平,并在10个外部领域数据集上的分割精度提高了9.2%,表现出强大的泛化能力。

Key Takeaways:

  1. 引入视觉语言知识(VLK)到生物医学图像分割中能提高模型的性能。
  2. 多模态大型语言模型(MLLMs)具备出色的理解和推理能力,适用于处理跨模态任务。
  3. 提出的新框架uLLSAM利用MLLMs引导SAM学习显微镜跨域数据。
  4. 视觉语言语义对齐(VLSA)模块被用于注入VLK到SAM中。
  5. SAM在接受全局VLK提示后性能提升显著,但在边界轮廓感知方面有不足。
  6. 语义边界正则化(SBR)被提出来规范SAM,以提高其性能。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
  目录