发布日期: 2025-10-02

更新日期: 2025-11-27

文章字数: 5.6k

阅读时长: 23 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-02 更新

GastroViT: A Vision Transformer Based Ensemble Learning Approach for Gastrointestinal Disease Classification with Grad CAM & SHAP Visualization

Authors:Sumaiya Tabassum, Md. Faysal Ahamed, Hafsa Binte Kibria, Md. Nahiduzzaman, Julfikar Haider, Muhammad E. H. Chowdhury, Mohammad Tariqul Islam

The gastrointestinal (GI) tract of humans can have a wide variety of aberrant mucosal abnormality findings, ranging from mild irritations to extremely fatal illnesses. Prompt identification of gastrointestinal disorders greatly contributes to arresting the progression of the illness and improving therapeutic outcomes. This paper presents an ensemble of pre-trained vision transformers (ViTs) for accurately classifying endoscopic images of the GI tract to categorize gastrointestinal problems and illnesses. ViTs, attention-based neural networks, have revolutionized image recognition by leveraging the transformative power of the transformer architecture, achieving state-of-the-art (SOTA) performance across various visual tasks. The proposed model was evaluated on the publicly available HyperKvasir dataset with 10,662 images of 23 different GI diseases for the purpose of identifying GI tract diseases. An ensemble method is proposed utilizing the predictions of two pre-trained models, MobileViT_XS and MobileViT_V2_200, which achieved accuracies of 90.57% and 90.48%, respectively. All the individual models are outperformed by the ensemble model, GastroViT, with an average precision, recall, F1 score, and accuracy of 69%, 63%, 64%, and 91.98%, respectively, in the first testing that involves 23 classes. The model comprises only 20 million (M) parameters, even without data augmentation and despite the highly imbalanced dataset. For the second testing with 16 classes, the scores are even higher, with average precision, recall, F1 score, and accuracy of 87%, 86%, 87%, and 92.70%, respectively. Additionally, the incorporation of explainable AI (XAI) methods such as Grad-CAM (Gradient Weighted Class Activation Mapping) and SHAP (Shapley Additive Explanations) enhances model interpretability, providing valuable insights for reliable GI diagnosis in real-world settings.

人类的胃肠道（GI）道可以有许多异常的粘膜异常发现，从轻微的刺激到极其致命的疾病。及时发现胃肠道疾病有助于阻止疾病的进展并改善治疗效果。本文介绍了一组预训练的视觉变压器（ViTs），用于准确分类胃肠道内窥镜图像，以分类胃肠道问题和疾病。ViTs是一种基于注意力的神经网络，通过利用变压器的变革能力，在各项视觉任务中实现了最先进的性能，从而彻底改变了图像识别。所提出的模型在公开的HyperKvasir数据集上进行了评估，该数据集包含10662张涉及23种不同胃肠道疾病的图像，旨在识别胃肠道疾病。提出了一种利用两个预训练模型MobileViT_XS和MobileViT_V2_200的预测的集成方法，这两个模型的准确率分别为90.57%和90.48%。与单个模型相比，集成模型GastroViT的表现更佳，在涉及23类的第一次测试中，平均准确率、召回率、F1分数和准确率分别为69%、63%、64%和91.98%。即使在不进行数据增强的情况下，且数据集高度不平衡，该模型也仅包含2000万个参数。在涉及16类的第二次测试中，各项得分更高，平均准确率、召回率、F1分数和准确率分别为87%、86%、87%和92.70%。此外，通过融入可解释的AI（XAI）方法，如Grad-CAM（梯度加权类激活映射）和SHAP（沙普利加法解释），增强了模型的解释性，为现实世界的可靠胃肠道诊断提供了宝贵的见解。

论文及项目相关链接

PDF

Summary
人类胃肠道（GI）的黏膜异常状况多种多样，从轻微刺激到极端致命的疾病。利用预训练的视觉转换器（ViTs）对内镜图像进行准确分类，以分类胃肠道问题和疾病。该研究使用MobileViT_XS和MobileViT_V2_200两个预训练模型进行预测，分别实现了90.57%和90.48%的准确率。通过集成这些方法，构建的GastroViT模型实现了更高的平均精度、召回率、F1分数和准确率，在涉及不同胃肠道疾病的图像分类上表现优越。通过引入可解释的AI（XAI）方法如Grad-CAM和SHAP增强模型解释性，有助于可靠的胃肠道疾病的现实诊断。

Key Takeaways

预训练的视觉转换器（ViTs）被用于对胃肠道（GI）的内镜图像进行准确分类，以诊断各种胃肠道问题。
MobileViT_XS和MobileViT_V2_200两个模型在GI疾病识别上表现出较高的准确率（分别为90.57%和90.48%）。
集成模型GastroViT在涉及不同胃肠道疾病的图像分类上表现出卓越性能，准确率高达91.98%。在另一组测试中，其平均精度、召回率、F1分数和准确率分别为87%、86%、87%和92.70%。
GastroViT模型具备强大的性能，仅包含约两千万个参数，即使在高度不平衡的数据集和无数据增强的情况下也是如此。

Cool Papers

点此查看论文截图

Zero-Shot Decentralized Federated Learning

Authors:Alessio Masano, Matteo Pennisi, Federica Proietto Salanitri, Concetto Spampinato, Giovanni Bellitto

CLIP has revolutionized zero-shot learning by enabling task generalization without fine-tuning. While prompting techniques like CoOp and CoCoOp enhance CLIP’s adaptability, their effectiveness in Federated Learning (FL) remains an open challenge. Existing federated prompt learning approaches, such as FedCoOp and FedTPG, improve performance but face generalization issues, high communication costs, and reliance on a central server, limiting scalability and privacy. We propose Zero-shot Decentralized Federated Learning (ZeroDFL), a fully decentralized framework that enables zero-shot adaptation across distributed clients without a central coordinator. ZeroDFL employs an iterative prompt-sharing mechanism, allowing clients to optimize and exchange textual prompts to enhance generalization while drastically reducing communication overhead. We validate ZeroDFL on nine diverse image classification datasets, demonstrating that it consistently outperforms–or remains on par with–state-of-the-art federated prompt learning methods. More importantly, ZeroDFL achieves this performance in a fully decentralized setting while reducing communication overhead by 118x compared to FedTPG. These results highlight that our approach not only enhances generalization in federated zero-shot learning but also improves scalability, efficiency, and privacy preservation–paving the way for decentralized adaptation of large vision-language models in real-world applications.

CLIP通过实现无需微调的任务泛化，彻底改变了零射击学习。虽然CoOp和CoCoOp等提示技术增强了CLIP的适应性，但它们在联邦学习（FL）中的有效性仍然是一个开放性的挑战。现有的联邦提示学习方法，如FedCoOp和FedTPG，虽然提高了性能，但面临着泛化问题、通信成本高昂以及对中央服务器的依赖，这限制了可扩展性和隐私保护。我们提出了Zero-shot Decentralized Federated Learning（ZeroDFL），这是一个完全去中心化的框架，能够在没有中央协调器的情况下，实现跨分布式客户端的零射击适应。ZeroDFL采用了一种迭代提示共享机制，允许客户端优化和交换文本提示，以提高泛化能力，同时大大降低了通信开销。我们在九个不同的图像分类数据集上验证了ZeroDFL，结果表明它始终优于或至少与最新联邦提示学习方法持平。更重要的是，ZeroDFL在全去中心化设置中实现了这一性能，与FedTPG相比，通信开销降低了118倍。这些结果强调，我们的方法不仅提高了联邦零射击学习的泛化能力，还提高了可扩展性、效率和隐私保护——为大型视觉语言模型在现实世界应用中的去中心化适应铺平了道路。

论文及项目相关链接

PDF Accepted at International Joint Conference on Neural Networks (IJCNN) 2025. Code available at https://github.com/perceivelab/ZeroDFL

Summary

本文介绍了CLIP在零样本学习中的革命性作用，它通过任务泛化而无需微调。虽然CoOp和CoCoOp等提示技术增强了CLIP的适应性，但它们在联邦学习（FL）中的有效性仍是一个挑战。现有的联邦提示学习方法，如FedCoOp和FedTPG，虽然提高了性能，但存在泛化问题、通信成本高以及依赖中央服务器等问题，限制了可扩展性和隐私保护。本文提出了ZeroDFL（零样本去中心化联邦学习），这是一种完全去中心化的框架，能够在分布式客户端之间实现零样本适应而无需中央协调器。ZeroDFL采用迭代提示共享机制，允许客户端优化和交换文本提示以增强泛化能力，同时大大降低通信开销。在九个不同的图像分类数据集上验证了ZeroDFL的有效性，表明它在最先进的联邦提示学习方法中表现优异或与之相当。更重要的是，ZeroDFL在完全去中心化的环境中实现了性能提升，并将通信开销减少了高达118倍相比于FedTPG。这些结果证明了ZeroDFL在提高联邦零样本学习的泛化能力的同时，也提高了可扩展性、效率和隐私保护，为大型视觉语言模型在现实世界应用中的去中心化适应铺平了道路。

Key Takeaways

CLIP已革新零样本学习，实现无需微调的任务泛化。
现有联邦提示学习方法面临泛化问题、高通信成本及依赖中央服务器等挑战。
提出了ZeroDFL框架，实现去中心化的零样本学习，无需中央协调器。
ZeroDFL采用迭代提示共享机制，优化文本提示的交换，增强泛化能力并降低通信开销。
在九个图像分类数据集上验证了ZeroDFL的有效性，表现优于或相当于最先进的联邦提示学习方法。
ZeroDFL在完全去中心化的环境中实现了性能提升，并大幅度减少了通信开销。

Cool Papers

点此查看论文截图

Self-Supervised Anatomical Consistency Learning for Vision-Grounded Medical Report Generation

Authors:Longzhen Yang, Zhangkai Ni, Ying Wen, Yihang Liu, Lianghua He, Heng Tao Shen

Vision-grounded medical report generation aims to produce clinically accurate descriptions of medical images, anchored in explicit visual evidence to improve interpretability and facilitate integration into clinical workflows. However, existing methods often rely on separately trained detection modules that require extensive expert annotations, introducing high labeling costs and limiting generalizability due to pathology distribution bias across datasets. To address these challenges, we propose Self-Supervised Anatomical Consistency Learning (SS-ACL) – a novel and annotation-free framework that aligns generated reports with corresponding anatomical regions using simple textual prompts. SS-ACL constructs a hierarchical anatomical graph inspired by the invariant top-down inclusion structure of human anatomy, organizing entities by spatial location. It recursively reconstructs fine-grained anatomical regions to enforce intra-sample spatial alignment, inherently guiding attention maps toward visually relevant areas prompted by text. To further enhance inter-sample semantic alignment for abnormality recognition, SS-ACL introduces a region-level contrastive learning based on anatomical consistency. These aligned embeddings serve as priors for report generation, enabling attention maps to provide interpretable visual evidence. Extensive experiments demonstrate that SS-ACL, without relying on expert annotations, (i) generates accurate and visually grounded reports – outperforming state-of-the-art methods by 10% in lexical accuracy and 25% in clinical efficacy, and (ii) achieves competitive performance on various downstream visual tasks, surpassing current leading visual foundation models by 8% in zero-shot visual grounding.

视觉基础医学报告生成旨在根据明确的视觉证据生成临床准确的医学图像描述，以提高解释性并促进融入临床工作流程。然而，现有方法通常依赖于单独训练的检测模块，需要大量专家标注，这增加了标注成本，并且由于数据集之间病理分布偏差而限制了泛化能力。为了应对这些挑战，我们提出了无监督解剖一致性学习（SS-ACL）——一种新型的无标注框架，通过使用简单的文本提示来对齐生成的报告与相应的解剖区域。SS-ACL构建了一个层次化的解剖图，以人体解剖结构的自上而下不变包含结构为灵感，按空间位置组织实体。它递归地重建精细的解剖区域以强制样本内的空间对齐，从而引导文本提示下的注意力图面向视觉相关区域。为了进一步增强样本间语义对齐以识别异常，SS-ACL引入了基于解剖一致性的区域级对比学习。这些对齐的嵌入作为报告生成的先验知识，使注意力图提供可解释的视觉证据。大量实验表明，无需依赖专家标注的SS-ACL（i）生成准确且基于视觉的报告——在词汇准确性和临床功效方面分别超越了最先进方法10%和25%；（ii）在各种下游视觉任务上表现优异，在零样本视觉定位上超越了当前领先的视觉基础模型8%。

论文及项目相关链接

PDF

Summary

本文介绍了基于视觉的医疗报告生成的新挑战及解决方案。现有方法依赖大量人工标注数据，引入高昂的标注成本且因数据集病理学分布偏差而限制泛化能力。针对这些问题，提出了自我监督解剖一致性学习（SS-ACL）框架，无需标注，通过简单文本提示将生成的报告与对应的解剖区域对齐。SS-ACL构建了一个层次化的解剖图，组织实体按空间位置排列，通过递归重建精细的解剖区域来强化样本内空间对齐，引导注意力图朝向文本提示的视觉相关区域。此外，SS-ACL引入基于解剖一致性的区域级对比学习以增强跨样本语义对齐，用于异常识别。对齐的嵌入作为报告的先验，使注意力图提供可解释的视觉证据。实验证明，无需依赖专家标注的SS-ACL在词汇准确性和临床效用方面分别优于现有方法10%和25%，并在各种下游视觉任务上实现竞争性能，零镜头视觉定位方面超过当前领先的视觉基础模型8%。

Key Takeaways

现有医疗报告生成方法面临高标注成本和泛化能力受限的挑战。
SS-ACL框架是一种无需标注的自我监督学习方法，通过简单文本提示将生成的报告与解剖区域对齐。
SS-ACL构建层次化的解剖图，强化样本内空间对齐并引导注意力图朝向视觉相关区域。
SS-ACL通过区域级对比学习增强跨样本语义对齐，用于异常识别。
对齐的嵌入作为报告的先验，提供可解释的视觉证据。
SS-ACL在词汇准确性和临床效用方面表现优越，超过现有方法。

Cool Papers

点此查看论文截图

AttentionViG: Cross-Attention-Based Dynamic Neighbor Aggregation in Vision GNNs

Authors:Hakan Emre Gedik, Andrew Martin, Mustafa Munir, Oguzhan Baser, Radu Marculescu, Sandeep P. Chinchali, Alan C. Bovik

Vision Graph Neural Networks (ViGs) have demonstrated promising performance in image recognition tasks against Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). An essential part of the ViG framework is the node-neighbor feature aggregation method. Although various graph convolution methods, such as Max-Relative, EdgeConv, GIN, and GraphSAGE, have been explored, a versatile aggregation method that effectively captures complex node-neighbor relationships without requiring architecture-specific refinements is needed. To address this gap, we propose a cross-attention-based aggregation method in which the query projections come from the node, while the key projections come from its neighbors. Additionally, we introduce a novel architecture called AttentionViG that uses the proposed cross-attention aggregation scheme to conduct non-local message passing. We evaluated the image recognition performance of AttentionViG on the ImageNet-1K benchmark, where it achieved SOTA performance. Additionally, we assessed its transferability to downstream tasks, including object detection and instance segmentation on MS COCO 2017, as well as semantic segmentation on ADE20K. Our results demonstrate that the proposed method not only achieves strong performance, but also maintains efficiency, delivering competitive accuracy with comparable FLOPs to prior vision GNN architectures.

视觉图神经网络（ViGs）在图像识别任务中，相对于卷积神经网络（CNNs）和视觉转换器（ViTs）表现出了有前景的性能。ViG框架的一个重要部分是节点-邻居特征聚合方法。虽然已探索了各种图卷积方法，如Max-Relative、EdgeConv、GIN和GraphSAGE，但需要一个通用的聚合方法，该方法能够有效地捕获复杂的节点-邻居关系，而无需特定的架构改进。为了解决这一空白，我们提出了一种基于交叉注意力的聚合方法，其中查询投影来自节点，而键投影来自其邻居。此外，我们引入了一种名为AttentionViG的新型架构，它使用所提出的交叉注意力聚合方案进行非局部消息传递。我们在ImageNet-1K基准测试上评估了AttentionViG的图像识别性能，它实现了SOTA性能。此外，我们还评估了其转移到下游任务的能力，包括在MS COCO 2017上的目标检测和实例分割，以及在ADE20K上的语义分割。我们的结果表明，该方法不仅性能强大，而且保持高效，与前期的视觉GNN架构相比，具有竞争力的精度和FLOPs。

论文及项目相关链接

PDF WACV submission. 13 pages, including the main text (8 pages), references, and supplementary material

Summary

ViGs（Vision Graph Neural Networks）在图像识别任务中表现出优异的性能，相较于CNN和ViT具有优势。本文提出了一个基于交叉注意力的节点-邻居特征聚合方法，并引入了一种新型架构AttentionViG，采用该方案进行非局部消息传递。在ImageNet-1K基准测试中，AttentionViG取得了SOTA性能，并且在下游任务如MS COCO 2017的物体检测、实例分割以及ADE20K的语义分割中展现出色的可迁移性。此方法不仅性能强劲，还能保持高效，与前述视觉GNN架构相比具有相当的FLOPs。

Key Takeaways

Vision Graph Neural Networks (ViGs) 在图像识别任务中展现出优异性能。
节点-邻居特征聚合是ViGs框架的重要组成部分。
现有图卷积方法（如Max-Relative, EdgeConv, GIN, GraphSAGE等）仍需要更通用的聚合方法来有效捕捉复杂的节点-邻居关系。
提出了基于交叉注意力的聚合方法，其中查询投影来自节点，而键投影来自其邻居。
引入了新型架构AttentionViG，采用非局部消息传递。
AttentionViG在ImageNet-1K基准测试中取得了最佳性能。

Cool Papers

点此查看论文截图

MIAFEx: An Attention-based Feature Extraction Method for Medical Image Classification

Authors:Oscar Ramos-Soto, Jorge Ramos-Frutos, Ezequiel Perez-Zarate, Diego Oliva, Sandra E. Balderas-Mata

Feature extraction techniques are crucial in medical image classification; however, classical feature extractors, in addition to traditional machine learning classifiers, often exhibit significant limitations in providing sufficient discriminative information for complex image sets. While Convolutional Neural Networks (CNNs) and Vision Transformer (ViT) have shown promise in feature extraction, they are prone to overfitting due to the inherent characteristics of medical imaging data, including small sample sizes or high intra-class variance. In this work, the Medical Image Attention-based Feature Extractor (MIAFEx) is proposed, a novel method that employs a learnable refinement mechanism to enhance the classification token within the Transformer encoder architecture. This mechanism adjusts the token based on learned weights, improving the extraction of salient features and enhancing the model’s adaptability to the challenges presented by medical imaging data. The MIAFEx output feature quality is compared against classical feature extractors using traditional and hybrid classifiers. Also, the performance of these features is compared against modern CNN and ViT models in classification tasks, demonstrating their superiority in accuracy and robustness across multiple complex medical imaging datasets. This advantage is particularly pronounced in scenarios with limited training data, where traditional and modern models often struggle to generalize effectively. The source code of this proposal can be found at https://github.com/Oscar-RamosS/Medical-Image-Attention-based-Feature-Extractor-MIAFEx

特征提取技术在医学图像分类中至关重要；然而，除了传统的机器学习分类器外，经典的特征提取器在处理复杂图像集时，在提供足够的判别信息方面往往存在显著局限性。卷积神经网络（CNN）和视觉转换器（ViT）在特征提取方面显示出潜力，但由于医学成像数据的小样本量或高类内方差等固有特性，它们容易发生过拟合。在本研究中，提出了一种医学图像注意力特征提取器（MIAFEx），这是一种采用可学习细化机制的新方法，旨在增强转换器编码器架构中的分类令牌。该机制根据学习到的权重调整令牌，提高了显著特征的提取能力，并增强了模型对医学成像数据挑战的适应性。将MIAFEx输出的特征质量与使用传统和混合分类器的经典特征提取器进行比较。此外，还比较了这些特征在分类任务中的性能与现代的CNN和ViT模型，证明了它们在多个复杂的医学成像数据集上的准确性和稳健性方面的优越性。这种优势在训练数据有限的情况下尤为突出，传统和现代模型在这种情况下往往难以有效推广。该提案的源代码可在https://github.com/Oscar-RamosS/Medical-Image-Attention-based-Feature-Extractor-MIAFEx找到。

论文及项目相关链接

PDF This is the preprint version of an article that has been accepted for publication in Knowledge-Based Systems

Summary

本文介绍了在医疗图像分类中特征提取技术的重要性。针对传统特征提取器和机器学习分类器在处理复杂图像集时提供的判别信息不足的问题，提出了基于注意力的医疗图像特征提取器（MIAFEx）。该方法采用可学习的细化机制，根据学习到的权重调整分类令牌，提高了显著特征的提取能力，并增强了模型对医疗图像数据挑战的适应性。实验结果表明，MIAFEx在多个复杂的医疗图像数据集上的分类性能优于传统和现代模型，特别是在训练数据有限的情况下表现更为突出。

Key Takeaways

经典特征提取器和传统机器学习分类器在处理复杂医疗图像集时存在局限性。
卷积神经网络（CNN）和视觉转换器（ViT）虽在特征提取方面表现出潜力，但易因医疗图像数据的小样本规模或高类内方差而出现过拟合。
提出的基于注意力的医疗图像特征提取器（MIAFEx）采用可学习的细化机制，改进了分类令牌的提取。
MIAFEx通过调整令牌提高了显著特征的提取能力，并增强了模型对医疗图像数据挑战的适应性。
MIAFEx在多个医疗图像数据集上的分类性能优于传统和现代模型。
在训练数据有限的情况下，MIAFEx的优势尤为明显。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-10-02/Vision%20Transformer/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

Vision Transformer

检测/分割/跟踪

检测/分割/跟踪方向最新论文已更新，请持续关注 Update in 2025-10-02 Evaluating the Impact of Radiographic Noise on Chest X-ray Semantic Segmentation and Disease Classification Using a Scalable Noise Injection Framework

2025-10-02 检测/分割/跟踪

检测/分割/跟踪

I2I Translation

I2I Translation 方向最新论文已更新，请持续关注 Update in 2025-10-02 Data-to-Energy Stochastic Dynamics

2025-10-02 I2I Translation

I2I Translation