⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-23 更新
Cross-Modal Scene Semantic Alignment for Image Complexity Assessment
Authors:Yuqing Luo, Yixiao Li, Jiang Liu, Jun Fu, Hadi Amirpour, Guanghui Yue, Baoquan Zhao, Padraig Corcoran, Hantao Liu, Wei Zhou
Image complexity assessment (ICA) is a challenging task in perceptual evaluation due to the subjective nature of human perception and the inherent semantic diversity in real-world images. Existing ICA methods predominantly rely on hand-crafted or shallow convolutional neural network-based features of a single visual modality, which are insufficient to fully capture the perceived representations closely related to image complexity. Recently, cross-modal scene semantic information has been shown to play a crucial role in various computer vision tasks, particularly those involving perceptual understanding. However, the exploration of cross-modal scene semantic information in the context of ICA remains unaddressed. Therefore, in this paper, we propose a novel ICA method called Cross-Modal Scene Semantic Alignment (CM-SSA), which leverages scene semantic alignment from a cross-modal perspective to enhance ICA performance, enabling complexity predictions to be more consistent with subjective human perception. Specifically, the proposed CM-SSA consists of a complexity regression branch and a scene semantic alignment branch. The complexity regression branch estimates image complexity levels under the guidance of the scene semantic alignment branch, while the scene semantic alignment branch is used to align images with corresponding text prompts that convey rich scene semantic information by pair-wise learning. Extensive experiments on several ICA datasets demonstrate that the proposed CM-SSA significantly outperforms state-of-the-art approaches. Codes are available at https://github.com/XQ2K/First-Cross-Model-ICA.
图像复杂度评估(ICA)是一项具有挑战性的感知评估任务,这主要是由于人类感知的主观性以及真实世界图像固有的语义多样性。现有的ICA方法主要依赖于手工制作或浅层的单一视觉模态的卷积神经网络特征,这些特征不足以完全捕获与图像复杂度密切相关的感知表示。最近,跨模态场景语义信息在各种计算机视觉任务中发挥了至关重要的作用,特别是涉及感知理解的任务。然而,在ICA的情境中探索跨模态场景语义信息仍未得到解决。因此,本文提出了一种新型的ICA方法,称为跨模态场景语义对齐(CM-SSA),该方法利用跨模态场景语义对齐技术来提升ICA性能,使复杂度预测更加符合主观人类感知。具体来说,提出的CM-SSA包括复杂度回归分支和场景语义对齐分支。复杂度回归分支在场景语义对齐分支的指导下估计图像复杂度水平,而场景语义对齐分支则用于将图像与包含丰富场景语义信息的文本提示进行对齐,通过配对学习实现。在多个ICA数据集上的广泛实验表明,所提出的CM-SSA显著优于最新技术方法。相关代码可通过https://github.com/XQ2K/First-Cross-Model-ICA获取。
论文及项目相关链接
PDF 14 pages,2 figures, British Machine Vision Conference
Summary
本文提出一种名为Cross-Modal Scene Semantic Alignment(CM-SSA)的新方法,用于图像复杂度评估(ICA)。该方法利用跨模态场景语义对齐技术,提高ICA性能,使复杂度预测更符合人类主观感知。CM-SSA包括复杂度回归分支和场景语义对齐分支,前者在后者指导下估计图像复杂度水平,后者则通过配对学习将图像与传达丰富场景语义信息的文本提示进行对齐。在多个ICA数据集上的实验表明,CM-SSA显著优于现有方法。
Key Takeaways
- 图像复杂度评估(ICA)是感知评价中的一项具有挑战性的任务,因为人类感知的主观性和真实世界图像的语义多样性。
- 现有ICA方法主要依赖于手工制作或基于浅层卷积神经网络的单一视觉模态的特征,这不足以完全捕捉与图像复杂度密切相关的感知表示。
- 跨模态场景语义信息在各种计算机视觉任务中扮演着关键角色,特别是涉及感知理解的任务。
- 本文提出了一种新的ICA方法——Cross-Modal Scene Semantic Alignment(CM-SSA),该方法利用跨模态场景语义对齐来增强ICA性能。
- CM-SSA包括复杂度回归分支和场景语义对齐分支,前者估计图像复杂度水平,后者通过配对学习对齐图像和富含场景语义信息的文本提示。
- 实验表明,CM-SSA在多个ICA数据集上的性能显著优于现有方法。
点此查看论文截图
Polyline Path Masked Attention for Vision Transformer
Authors:Zhongchen Zhao, Chaodong Xiao, Hui Lin, Qi Xie, Lei Zhang, Deyu Meng
Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.
全局依赖建模和空间位置建模是当前深度学习框架基础架构设计中的两个核心问题。最近,Vision Transformers(ViTs)凭借自注意力机制的强大全局依赖建模能力,在计算机视觉领域取得了显著的成功。此外,Mamba2通过明确的空间邻接先验建模结构掩膜,在自然语言处理任务中展示了其显著潜力。在本文中,我们提出了Polyline Path Masked Attention(PPMA),它将ViTs的自注意力机制与Mamba2增强的结构掩膜相结合,充分利用了两种架构的互补优势。具体来说,我们首先通过引入2D折线路径扫描策略改进了Mamba2的传统结构掩膜,并推导出了相应的结构掩膜,即折线路径掩膜,它能更好地保留图像标记之间的邻接关系。值得注意的是,我们对所提出的折线路径掩膜的结构特性进行了深入的理论分析,并设计了高效的折线路径掩膜计算算法。接下来,我们将折线路径掩膜嵌入ViTs的自注意力机制中,实现了空间邻接先验的显式建模。在包括图像分类、目标检测和分割在内的标准基准测试上的大量实验表明,我们的模型优于基于状态空间模型和Transformers的先前最先进的方法。例如,我们提出的PPMA-T/S/B模型在ADE20K语义分割任务上的mIoU达到48.7%/51.1%/52.3%,分别超越了RMT-T/S/B模型0.7%/1.3%/0.3%。代码可在https://github.com/zhongchenzhao/PPMA中找到。
论文及项目相关链接
Summary
本文提出将Vision Transformer与Mamba2的结构化掩码结合的Polyline Path Masked Attention(PPMA)。通过引入2D折线路径扫描策略改进Mamba2的结构化掩码,形成Polyline Path Mask,更好地保留图像标记间的邻接关系。将此掩码嵌入Vision Transformer的自注意力机制中,实现对空间邻接关系的显式建模。在图像分类、目标检测和分割等标准基准测试上,该模型超越了基于状态和变换器的最新方法。
Key Takeaways
- Vision Transformer借助自注意力机制实现全局依赖建模,并在计算机视觉领域取得了显著成功。
- Mamba2通过显式建模空间邻接先验在自然语言处理任务中展现出潜力。
- PPMA结合了Vision Transformer的自注意力机制和Mamba2的增强结构化掩码,融合了两种架构的互补优势。
- PPMA通过引入2D折线路径扫描策略改进了结构化掩码,更好地捕捉图像标记间的邻接关系。
- PPMA在标准基准测试上表现优越,如图像分类、目标检测和语义分割等任务。
- PPMA在ADE20K语义分割任务上的mIoU达到了48.7%、51.1%和52.3%,超越了之前的最新方法。
点此查看论文截图