发布日期: 2025-09-11

更新日期: 2025-10-07

文章字数: 4.7k

阅读时长: 19 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-11 更新

Benchmarking Vision Transformers and CNNs for Thermal Photovoltaic Fault Detection with Explainable AI Validation

Authors:Serra Aksoy

Artificial intelligence deployment for automated photovoltaic (PV) monitoring faces interpretability barriers that limit adoption in energy infrastructure applications. While deep learning achieves high accuracy in thermal fault detection, validation that model decisions align with thermal physics principles remains lacking, creating deployment hesitancy where understanding model reasoning is critical. This study provides a systematic comparison of convolutional neural networks (ResNet-18, EfficientNet-B0) and vision transformers (ViT-Tiny, Swin-Tiny) for thermal PV fault detection, using XRAI saliency analysis to assess alignment with thermal physics principles. This represents the first systematic comparison of CNNs and vision transformers for thermal PV fault detection with physics-validated interpretability. Evaluation on 20,000 infrared images spanning normal operation and 11 fault categories shows that Swin Transformer achieves the highest performance (94% binary accuracy; 73% multiclass accuracy) compared to CNN approaches. XRAI analysis reveals that models learn physically meaningful features, such as localized hotspots for cell defects, linear thermal paths for diode failures, and thermal boundaries for vegetation shading, consistent with expected thermal signatures. However, performance varies significantly across fault types: electrical faults achieve strong detection (F1-scores >0.90) while environmental factors like soiling remain challenging (F1-scores 0.20-0.33), indicating limitations imposed by thermal imaging resolution. The thermal physics-guided interpretability approach provides methodology for validating AI decision-making in energy monitoring applications, addressing deployment barriers in renewable energy infrastructure.

人工智能在自动化光伏（PV）监测方面的部署面临可解释性的障碍，这限制了其在能源基础设施应用中的采用。虽然深度学习在热故障检测方面实现了高准确率，但模型决策与热物理原理的对齐验证仍然缺乏，这在理解模型推理至关重要的地方造成了部署的犹豫。本研究对卷积神经网络（ResNet-18、EfficientNet-B0）和视觉变压器（ViT-Tiny、Swin-Tiny）进行热光伏故障检测的系统性比较，利用XRAI显著性分析评估其与热物理原理的对齐程度。这是首次对卷积神经网络和视觉变压器在热光伏故障检测方面进行系统比较，并带有物理验证的可解释性。对涵盖正常操作和11个故障类别的2万张红外图像进行评估显示，与CNN方法相比，Swin Transformer实现了最高性能（二进制准确率94%；多类准确率73%）。XRAI分析表明，模型学习到的特征在物理上是具有意义的，如细胞缺陷的局部热点、二极管故障的线性热路径以及植被遮蔽的热边界，与预期的热特征一致。然而，不同故障类型的性能差异很大：电气故障检测效果强劲（F1分数> 0.90），而如污染等环境因素仍然具有挑战性（F1分数为0.20-0.33），这表明受热成像分辨率的限制。基于热物理学的可解释性方法为能源监测应用中的人工智能决策制定提供了验证方法，解决了在可再生能源基础设施中部署的障碍。

论文及项目相关链接

PDF 28 Pages, 4 Figures

Summary
人工智能在光伏监测中的部署面临可解释性的障碍，制约了其在能源基础设施应用中的普及。本研究系统比较了卷积神经网络（ResNet-18、EfficientNet-B0）和视觉变压器（ViT-Tiny、Swin-Tiny）在光伏热故障检测中的应用，并利用XRAI显著性分析评估其与热物理原理的契合度。结果显示，Swin Transformer在二元分类和多分类检测上表现最佳，且模型学习的特征符合热物理规律。但不同故障类型的检测性能差异显著，如电气设备故障检测效果好，而受环境因素影响较大的污垢问题检测仍具挑战。该研究为能源监测应用中AI决策制定的验证提供了方法，有助于解决人工智能在可再生能源基础设施部署中的障碍。

Key Takeaways

人工智能在自动化光伏监测中的部署受到可解释性障碍的限制，尤其是在涉及理解模型推理关键性的应用场景中。
本研究首次系统比较了卷积神经网络和视觉变压器在光伏热故障检测上的应用。
Swin Transformer在热故障检测方面表现最佳，具有较高的二元和多类分类准确率。
通过XRAI显著性分析发现，这些模型能够学习符合热物理规律的特征，如局部热点、线性热路径和热边界等。
不同故障类型的检测性能存在差异，电气设备故障检测效果好，而受环境影响的因素如污垢问题仍是挑战。
人工智能决策制定在能源监测应用中的验证方法得到了发展，解决了部署障碍。

Cool Papers

点此查看论文截图

Aesthetic Image Captioning with Saliency Enhanced MLLMs

Authors:Yilin Tao, Jiashui Huang, Huaze Xu, Ling Shao

Aesthetic Image Captioning (AIC) aims to generate textual descriptions of image aesthetics, becoming a key research direction in the field of computational aesthetics. In recent years, pretrained Multimodal Large Language Models (MLLMs) have advanced rapidly, leading to a significant increase in image aesthetics research that integrates both visual and textual modalities. However, most existing studies on image aesthetics primarily focus on predicting aesthetic ratings and have shown limited application in AIC. Existing AIC works leveraging MLLMs predominantly rely on fine-tuning methods without specifically adapting MLLMs to focus on target aesthetic content. To address this limitation, we propose the Aesthetic Saliency Enhanced Multimodal Large Language Model (ASE-MLLM), an end-to-end framework that explicitly incorporates aesthetic saliency into MLLMs. Within this framework, we introduce the Image Aesthetic Saliency Module (IASM), which efficiently and effectively extracts aesthetic saliency features from images. Additionally, we design IAS-ViT as the image encoder for MLLMs, this module fuses aesthetic saliency features with original image features via a cross-attention mechanism. To the best of our knowledge, ASE-MLLM is the first framework to integrate image aesthetic saliency into MLLMs specifically for AIC tasks. Extensive experiments demonstrated that our approach significantly outperformed traditional methods and generic MLLMs on current mainstream AIC benchmarks, achieving state-of-the-art (SOTA) performance.

美学图像标题生成（AIC）旨在生成图像美学描述的文本，已成为计算美学领域的关键研究方向。近年来，预训练的多模态大型语言模型（MLLMs）发展迅速，推动了融合视觉和文本模态的图像美学研究的显著增加。然而，大多数现有的图像美学研究主要集中在预测美学评分上，在AIC中的应用有限。现有利用MLLMs的AIC工作主要依赖于微调方法，而没有专门调整MLLMs以专注于目标美学内容。为了解决这一局限性，我们提出了美学显著性增强多模态大型语言模型（ASE-MLLM），这是一个端到端的框架，明确地将美学显著性纳入MLLMs。在该框架中，我们引入了图像美学显著性模块（IASM），该模块能够高效地从图像中提取美学显著性特征。此外，我们设计了IAS-ViT作为MLLMs的图像编码器，该模块通过交叉注意机制将美学显著性特征与原始图像特征融合。据我们所知，ASE-MLLM是第一个将图像美学显著性集成到MLLMs中的专门用于AIC任务的框架。大量实验表明，我们的方法在当前主流的AIC基准测试上显著优于传统方法和通用MLLMs，达到了最新技术性能。

论文及项目相关链接

PDF

Summary

该文本介绍了计算美学领域中的一个关键研究方向——美学图像标题生成（AIC）。近年来，预训练的多模态大型语言模型（MLLMs）迅速发展，促进了图像美学研究的集成视觉和文本模式。但现有研究主要关注于预测美学评分，在AIC方面的应用有限。针对此问题，提出一种美学显著性增强多模态大型语言模型（ASE-MLLM）框架，通过引入图像美学显著性模块（IASM）和IAS-ViT图像编码器，将美学显著性特征融入MLLMs中。据我们所知，ASE-MLLM是首个专门为AIC任务整合图像美学显著性的框架，并在主流AIC基准测试中显著优于传统方法和通用MLLMs，达到最佳性能。

Key Takeaways

美学图像标题生成（AIC）是计算美学领域的重要研究方向，旨在生成图像美学的文本描述。
预训练的多模态大型语言模型（MLLMs）在图像美学研究领域得到快速发展。
当前图像美学研究主要关注预测美学评分，在AIC方面的应用有限。
引入ASE-MLLM框架，通过结合图像美学显著性模块（IASM）和IAS-ViT图像编码器，将美学显著性特征融入MLLMs中。
ASE-MLLM框架是首个专门为AIC任务整合图像美学显著性的框架。
ASE-MLLM框架在主流AIC基准测试中的性能显著优于传统方法和通用MLLMs，达到最佳性能。

Cool Papers

点此查看论文截图

GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images

Authors:Mengyu Ren, Yutong Li, Hua Li, Runmin Cong, Sam Kwong

Salient object detection (SOD) in optical remote sensing images (ORSIs) faces numerous challenges, including significant variations in target scales and low contrast between targets and the background. Existing methods based on vision transformers (ViTs) and convolutional neural networks (CNNs) architectures aim to leverage both global and local features, but the difficulty in effectively integrating these heterogeneous features limits their overall performance. To overcome these limitations, we propose a graph-enhanced contextual and regional perception network (GCRPNet), which builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation. Specifically, we employ the visual state space (VSS) encoder to extract multi-scale features. To further achieve deep guidance and enhancement of these features, we first design a difference-similarity guided hierarchical graph attention module (DS-HGAM). This module strengthens cross-layer interaction capabilities between features of different scales while enhancing the model’s structural perception,allowing it to distinguish between foreground and background more effectively. Then, we design the LEVSS block as the decoder of GCRPNet. This module integrates our proposed adaptive scanning strategy and multi-granularity collaborative attention enhancement module (MCAEM). It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information and enhancing Mamba’s local modeling capability. Extensive experimental results demonstrate that the proposed model achieves state-of-the-art performance, validating its effectiveness and superiority.

显著性目标检测（SOD）在光学遥感图像（ORSIs）中面临诸多挑战，包括目标尺度的显著变化和目标与背景之间的对比度低。现有的基于视觉变压器（ViTs）和卷积神经网络（CNNs）架构的方法旨在利用全局和局部特征，但有效整合这些异构特征的困难限制了它们的整体性能。为了克服这些局限性，我们提出了图增强上下文和区域感知网络（GCRPNet），它基于Mamba架构，能够同时捕捉长距离依赖关系并增强区域特征表示。具体来说，我们采用视觉状态空间（VSS）编码器来提取多尺度特征。为了进一步实现这些特征的深度引导和增强，我们首先设计了一个差异相似性引导分层图注意力模块（DS-HGAM）。该模块增强了不同尺度特征之间的跨层交互能力，提高了模型的结构感知能力，使其能够更有效地区分前景和背景。然后，我们设计了LEVSS块作为GCRPNet的解码器。该模块结合了我们的自适应扫描策略和多粒度协同注意力增强模块（MCAEM）。它对通过多尺度卷积处理的特征图执行自适应补丁扫描，从而捕获丰富的局部区域信息并增强Mamba的局部建模能力。大量的实验结果证明了所提出模型达到了最先进的性能，验证了其有效性和优越性。

论文及项目相关链接

PDF

摘要

光学遥感图像中的显著目标检测面临诸多挑战，如目标尺度变化大、目标与背景对比度低等。现有基于视觉变压器和卷积神经网络的方法旨在利用全局和局部特征，但难以有效融合这些异构特征，限制了其性能。为克服这些局限性，我们提出一种图增强上下文和区域感知网络（GCRPNet），基于Mamba架构，同时捕捉长程依赖关系并增强区域特征表示。具体地，我们使用视觉状态空间（VSS）编码器提取多尺度特征。为实现深度引导和增强这些特征，我们设计了差异相似度引导层次图注意力模块（DS-HGAM），增强跨层不同尺度特征间的交互能力，提高模型的结构感知能力，使其更有效地区分前景和背景。此外，我们还设计了LEVSS块作为GCRPNet的解码器，结合自适应扫描策略和多粒度协同注意力增强模块（MCAEM），对经过多尺度卷积处理的特征图进行自适应斑块扫描，从而捕捉丰富的局部区域信息，增强Mamba的局部建模能力。实验结果表明，该模型达到先进水平，验证了其有效性和优越性。

关键见解

光学遥感图像中的显著目标检测面临诸多挑战，包括目标尺度和对比度的变化。
现有方法基于视觉变压器和卷积神经网络，旨在融合全局和局部特征，但存在局限性。
提出一种图增强上下文和区域感知网络（GCRPNet），基于Mamba架构。
使用视觉状态空间（VSS）编码器提取多尺度特征。
差异相似度引导层次图注意力模块（DS-HGAM）增强跨层特征交互和模型的结构感知能力。
LEVSS块结合自适应扫描策略和多粒度协同注意力增强模块（MCAEM），提高局部建模能力。
实验结果表明，所提模型性能优越，达到先进水平。

Cool Papers

点此查看论文截图

MSCPT: Few-shot Whole Slide Image Classification with Multi-scale and Context-focused Prompt Tuning

Authors:Minghao Han, Linhao Qu, Dingkang Yang, Xukun Zhang, Xiaoying Wang, Lihua Zhang

Multiple instance learning (MIL) has become a standard paradigm for the weakly supervised classification of whole slide images (WSIs). However, this paradigm relies on using a large number of labeled WSIs for training. The lack of training data and the presence of rare diseases pose significant challenges for these methods. Prompt tuning combined with pre-trained Vision-Language models (VLMs) is an effective solution to the Few-shot Weakly Supervised WSI Classification (FSWC) task. Nevertheless, applying prompt tuning methods designed for natural images to WSIs presents three significant challenges: 1) These methods fail to fully leverage the prior knowledge from the VLM’s text modality; 2) They overlook the essential multi-scale and contextual information in WSIs, leading to suboptimal results; and 3) They lack exploration of instance aggregation methods. To address these problems, we propose a Multi-Scale and Context-focused Prompt Tuning (MSCPT) method for FSWC task. Specifically, MSCPT employs the frozen large language model to generate pathological visual language prior knowledge at multiple scales, guiding hierarchical prompt tuning. Additionally, we design a graph prompt tuning module to learn essential contextual information within WSI, and finally, a non-parametric cross-guided instance aggregation module has been introduced to derive the WSI-level features. Extensive experiments, visualizations, and interpretability analyses were conducted on five datasets and three downstream tasks using three VLMs, demonstrating the strong performance of our MSCPT. All codes have been made publicly accessible at https://github.com/Hanminghao/MSCPT.

多实例学习（MIL）已成为全幻灯片图像（WSI）弱监督分类的标准范式。然而，这一范式依赖于大量标记的WSI用于训练。缺乏训练数据和罕见疾病的存在给这些方法带来了重大挑战。提示微调与预训练的视觉语言模型（VLM）相结合是解决小样本弱监督WSI分类（FSWC）任务的有效方法。然而，将针对自然图像设计的提示微调方法应用于WSI面临三大挑战：1）这些方法未能充分利用来自VLM文本模态的先验知识；2）它们忽略了WSI中的多尺度和上下文信息，导致结果不佳；3）它们缺乏实例聚合方法的探索。为了解决这些问题，我们提出了一种用于FSWC任务的多尺度上下文聚焦提示微调（MSCPT）方法。具体来说，MSCPT采用冻结的大型语言模型来生成多尺度病理视觉语言先验知识，引导分层提示微调。此外，我们设计了一个图提示调优模块来学习WSI内的关键上下文信息，并最终引入了一个非参数化交叉引导实例聚合模块来提取WSI级别的特征。在五个数据集和三个下游任务上，使用三种VLM进行了广泛的实验、可视化和解释性分析，展示了我们的MSCPT的强大性能。所有代码已公开访问：https://github.com/Hanminghao/MSCPT。

论文及项目相关链接

PDF This work has been submitted to the IEEE TMI for possible publication

摘要
利用多任务学习的理念对罕病WSIs进行弱监督分类面临的挑战较大。为应对这些挑战，提出了基于多尺度上下文信息的即时调整（MSCPT）方法，实现针对少样本弱监督WSIs分类任务的有效解决。该方法利用冻结的大型语言模型生成病理视觉语言先验知识，并在多个尺度上进行即时调整，通过图形提示模块学习WSI内的关键上下文信息，最后引入非参数交叉引导实例聚合模块来提取WSI级别的特征。实验表明，该方法在多个数据集和任务上表现优异。代码已公开。

关键见解

多任务学习已成为对全幻灯片图像进行弱监督分类的标准范式，但缺乏训练数据和罕见疾病是该范式面临的挑战。
提示调整结合预训练视觉语言模型是解决少样本弱监督WSIs分类任务的有效方法。
将为自然图像设计的提示调整方法应用于WSIs面临三大挑战：未能充分利用VLM文本模态的先验知识、忽视WSIs的多尺度上下文信息以及缺乏实例聚合方法的探索。
提出的MSCPT方法利用大型语言模型生成病理视觉语言先验知识，并在多个尺度上进行即时调整，以应对上述挑战。
MSCPT设计了一个图形提示模块，用于学习WSI内的关键上下文信息。
引入非参数交叉引导实例聚合模块来提取WSI级别的特征，以增强模型性能。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-11/Vision%20Transformer/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Vision Transformer

检测/分割/跟踪

检测/分割/跟踪方向最新论文已更新，请持续关注 Update in 2025-09-11 GCRPNet Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images

2025-09-11 检测/分割/跟踪

检测/分割/跟踪

视频理解

视频理解方向最新论文已更新，请持续关注 Update in 2025-09-11 SkillFormer Unified Multi-View Video Understanding for Proficiency Estimation

2025-09-11 视频理解

视频理解