发布日期: 2025-11-05

更新日期: 2025-11-27

文章字数: 21k

阅读时长: 85 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-05 更新

Vision Transformer for Robust Occluded Person Reidentification in Complex Surveillance Scenes

Authors:Bo Li, Duyuan Zheng, Xinyang Liu, Qingwen Li, Hong Li, Hongyan Cui, Ge Gao, Chen Liu

Person re-identification (ReID) in surveillance is challenged by occlusion, viewpoint distortion, and poor image quality. Most existing methods rely on complex modules or perform well only on clear frontal images. We propose Sh-ViT (Shuffling Vision Transformer), a lightweight and robust model for occluded person ReID. Built on ViT-Base, Sh-ViT introduces three components: First, a Shuffle module in the final Transformer layer to break spatial correlations and enhance robustness to occlusion and blur; Second, scenario-adapted augmentation (geometric transforms, erasing, blur, and color adjustment) to simulate surveillance conditions; Third, DeiT-based knowledge distillation to improve learning with limited labels.To support real-world evaluation, we construct the MyTT dataset, containing over 10,000 pedestrians and 30,000+ images from base station inspections, with frequent equipment occlusion and camera variations. Experiments show that Sh-ViT achieves 83.2% Rank-1 and 80.1% mAP on MyTT, outperforming CNN and ViT baselines, and 94.6% Rank-1 and 87.5% mAP on Market1501, surpassing state-of-the-art methods.In summary, Sh-ViT improves robustness to occlusion and blur without external modules, offering a practical solution for surveillance-based personnel monitoring.

监控场景下的行人再识别（ReID）面临遮挡、视角畸变和图像质量差等挑战。大多数现有方法依赖于复杂的模块或者仅在清晰的正面图像上表现良好。我们提出了Sh-ViT（Shuffling Vision Transformer），这是一种用于遮挡行人ReID的轻量级且稳健的模型。Sh-ViT基于ViT-Base构建，引入了三个组件：首先，在最后的Transformer层中加入Shuffle模块，打破空间相关性，增强对遮挡和模糊的鲁棒性；其次，情景适应性增强（几何变换、擦除、模糊和色彩调整）来模拟监控条件；第三，基于DeiT的知识蒸馏来改善有限标签的学习。为了支持真实世界评估，我们构建了MyTT数据集，包含超过10,000名行人和来自基站检查的30,000+图像，具有频繁的设备遮挡和相机变化。实验表明，Sh-ViT在MyTT上实现了83.2%的Rank-1和80.1%的mAP，超过了CNN和ViT基线，在Market1501上实现了94.6%的Rank-1和87.5%的mAP，超越了最先进的方法。总的来说，Sh-ViT提高了对遮挡和模糊的鲁棒性，且无需外部模块，为基于监控的人员监控提供了实用解决方案。

论文及项目相关链接

PDF 12 pages,conference

Summary

Sh-ViT是一种针对遮挡行人再识别问题的轻量级且稳健的模型。它通过引入Shuffle模块、场景适应性增强和DeiT知识蒸馏技术，提高了对遮挡和模糊的鲁棒性。在MyTT数据集上的实验表明，Sh-ViT在行人再识别任务中取得了优异性能，并实现了在实际监控场景中的有效应用。

Key Takeaways

Sh-ViT模型针对遮挡行人再识别问题进行了优化。
Sh-ViT引入了Shuffle模块，打破了空间相关性，增强了模型对遮挡和模糊的鲁棒性。
场景适应性增强技术通过模拟监控条件，提高了模型的泛化能力。
DeiT知识蒸馏技术用于提高模型在有限标签下的学习能力。
MyTT数据集的构建，支持了模型在现实世界的评估。
实验结果表明，Sh-ViT在MyTT数据集上取得了83.2%的Rank-1准确率和80.1%的mAP，超越了CNN和ViT基线。

Cool Papers

点此查看论文截图

CoMViT: An Efficient Vision Backbone for Supervised Classification in Medical Imaging

Authors:Aon Safdar, Mohamed Saadeldin

Vision Transformers (ViTs) have demonstrated strong potential in medical imaging; however, their high computational demands and tendency to overfit on small datasets limit their applicability in real-world clinical scenarios. In this paper, we present CoMViT, a compact and generalizable Vision Transformer architecture optimized for resource-constrained medical image analysis. CoMViT integrates a convolutional tokenizer, diagonal masking, dynamic temperature scaling, and pooling-based sequence aggregation to improve performance and generalization. Through systematic architectural optimization, CoMViT achieves robust performance across twelve MedMNIST datasets while maintaining a lightweight design with only ~4.5M parameters. It matches or outperforms deeper CNN and ViT variants, offering up to 5-20x parameter reduction without sacrificing accuracy. Qualitative Grad-CAM analyses show that CoMViT consistently attends to clinically relevant regions despite its compact size. These results highlight the potential of principled ViT redesign for developing efficient and interpretable models in low-resource medical imaging settings.

视觉Transformer（ViT）在医学成像领域表现出了强大的潜力。然而，它们计算量大且在小数据集上容易过度拟合，限制了它们在现实世界临床场景中的应用。在本文中，我们提出了CoMViT，这是一种针对资源受限医学图像分析而优化的紧凑且通用的视觉Transformer架构。CoMViT集成了卷积标记器、对角掩码、动态温度缩放和基于池化的序列聚合，以提高性能和通用性。通过系统的架构优化，CoMViT在十二个MedMNIST数据集上实现了稳健的性能，同时保持轻量级设计，仅有~450万参数。它匹配或超越了更深的CNN和ViT变体，在不影响精度的情况下实现了高达5-20倍的参数减少。定性Grad-CAM分析表明，尽管CoMViT尺寸紧凑，但它始终关注临床相关区域。这些结果突出了基于原则重新设计的ViT在低资源医学成像环境中开发高效和可解释模型的潜力。

论文及项目相关链接

PDF Preprint (submitted manuscript). Accepted at the MICCAI 2025 MIRASOL Workshop; to appear in the Springer proceedings volume. This is the pre-review version (not the Version of Record). DOI will be added after publication. [Optional: 8 pages, 4 figures, 4 tables.]

Summary

本文提出一种适用于资源受限医学图像分析的紧凑通用型视觉Transformer架构——CoMViT。它通过整合卷积分词器、对角掩码、动态温度缩放和基于池化的序列聚合等技术，优化了性能并提高了泛化能力。CoMViT在十二个MedMNIST数据集上实现了稳健的性能，同时保持轻量级设计，仅约4.5M参数。它匹配或优于较深的CNN和ViT变体，提供了高达5-20倍的参数缩减，同时不损失精度。结果突出了设计原则ViT的潜力，可以在低资源医学成像环境中开发高效可解释模型。

Key Takeaways

CoMViT是一种针对资源受限医学图像分析优化的紧凑通用型Vision Transformer架构。
CoMViT通过整合多种技术（卷积分词器、对角掩码等）提高了性能和泛化能力。
CoMViT在多个医学图像数据集上实现了稳健性能，参数数量相对较少（仅约4.5M）。
CoMViT与深度CNN和ViT变体相比，在参数效率上有所突破，实现了高参数缩减率（5-20倍）。
CoMViT在保证准确性的同时，降低了计算需求，适用于资源受限的环境。
定性Grad-CAM分析显示，CoMViT能够关注临床上相关的区域，即使在其紧凑大小的情况下也如此。

Cool Papers

点此查看论文截图

Sparse Model Inversion: Efficient Inversion of Vision Transformers for Data-Free Applications

Authors:Zixuan Hu, Yongxian Wei, Li Shen, Zhenyi Wang, Lei Li, Chun Yuan, Dacheng Tao

Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations–a phenomenon we term “hallucination” in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to 3.79 faster) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.

模型反转旨在从预训练的判别模型中重构原始训练数据，在由于隐私、使用权或规模限制等原因导致原始训练数据不可用的情况下尤其有用。然而，现有的密集反转方法试图重构整个图像区域，在从大规模视觉转换器（ViTs）反转高分辨率图像时，这使其效率极低。我们进一步识别出这种低效率的两个根本原因：嘈杂背景的冗余反转和意外反转的偶然关联——我们在模型反转中称之为“幻觉”的现象。为了解决这些局限性，我们提出了一种新的稀疏模型反转策略，作为一种即插即用扩展，可以加速现有的密集反转方法，而无需修改其原始损失函数。具体来说，我们有选择地反转语义前景，同时停止嘈杂背景和潜在偶然关联的反转。通过理论研究和实证研究，我们验证了我们的方法在达到显著的快速反转（最高达3.79倍）的同时，在无需数据的模型量化和无需数据的知识转移中保持相当甚至更好的下游性能。代码可通过以下网址获取：https://github.com/Egg-Hu/SMI。

论文及项目相关链接

PDF

摘要
针对预训练判别模型，提出一种新型稀疏模型反演策略，旨在从大型视觉转换器（ViTs）中反演高分辨率图像。该策略旨在解决现有密集反演方法在面对大型模型时的效率低下问题，特别是在反演语义前景时停止反演嘈杂背景和潜在虚假关联，从而实现显著的反演加速（最高可达3.79倍），同时保持或提高在无数据模型量化、无数据知识迁移等下游任务中的性能。代码已公开于GitHub。

要点分析

模型反演：技术用于从预训练判别模型中重建原始训练数据，特别是在无法获取原始数据时由于隐私保护或数据量限制尤其有用。但现有的密集反演方法尝试重建整个图像区域，对于大型视觉转换器（ViTs）来说效率极低。
效率问题原因：冗余的反转嘈杂背景以及意外的反转虚假关联，被称为“幻觉”。这种效率问题使得反演过程缓慢且结果质量不佳。
稀疏模型反演策略：作为一种即插即用扩展，能够加速现有密集反演方法，无需修改其原始损失函数。该策略选择性地反演语义前景，同时停止反演嘈杂背景和潜在虚假关联。这种策略显著提高了反演速度，同时保持了或提高了在无数据模型量化、无数据知识迁移等下游任务中的性能。这提供了一个高效且实用的解决方案来解决现有模型反演的局限性。

Cool Papers

点此查看论文截图

SteerVLM: Robust Model Control through Lightweight Activation Steering for Vision Language Models

Authors:Anushka Sivakumar, Andrew Zhang, Zaber Hakim, Chris Thomas

This work introduces SteerVLM, a lightweight steering module designed to guide Vision-Language Models (VLMs) towards outputs that better adhere to desired instructions. Our approach learns from the latent embeddings of paired prompts encoding target and converse behaviors to dynamically adjust activations connecting the language modality with image context. This allows for fine-grained, inference-time control over complex output semantics without modifying model weights while preserving performance on off-target tasks. Our steering module requires learning parameters equal to 0.14% of the original VLM’s size. Our steering module gains model control through dimension-wise activation modulation and adaptive steering across layers without requiring pre-extracted static vectors or manual tuning of intervention points. Furthermore, we introduce VNIA (Visual Narrative Intent Alignment), a multimodal dataset specifically created to facilitate the development and evaluation of VLM steering techniques. Our method outperforms existing intervention techniques on steering and hallucination mitigation benchmarks for VLMs and proposes a robust solution for multimodal model control through activation engineering.

本文介绍了SteerVLM，这是一个轻量级的导向模块，旨在引导视觉语言模型（VLMs）产生更符合所需指令的输出。我们的方法通过从编码目标行为和反向行为的配对提示的潜在嵌入中学习，动态调整连接语言模态与图像上下文之间的激活。这允许在推理时间对复杂输出语义进行精细控制，同时在不修改模型权重的情况下保持非目标任务的性能。我们的导向模块需要学习的参数仅为原始VLM大小的0.14%。我们的导向模块通过维度激活调制和跨层的自适应导向获得模型控制，而无需预先提取的静态向量或手动调整干预点。此外，我们还引入了VNIA（视觉叙事意图对齐）数据集，该数据集是专门为促进VLM转向技术的发展和评估而创建的。我们的方法在VLM的转向和幻觉缓解基准测试上优于现有的干预技术，并通过激活工程提出了多模态模型控制的稳健解决方案。

论文及项目相关链接

PDF

摘要

本文提出SteerVLM，一个轻量级的控制模块，旨在引导视觉语言模型（VLMs）生成更符合指令要求的结果。SteerVLM通过学习目标行为与反向行为配对提示的潜在嵌入，动态调整连接语言模态与图像上下文之间的激活，从而实现精细粒度的推理控制，同时不影响模型对偏离目标的任务的性能。该控制模块所需的学习参数仅为原始VLM规模的0.14%。通过逐层激活调制和自适应控制，SteerVLM无需预先提取的静态向量或手动调整干预点即可实现模型控制。此外，本文还引入了VNIA（视觉叙事意图对齐）数据集，专为促进VLM控制技术的发展和评估而设计。在控制和缓解幻觉的基准测试中，我们的方法优于现有的干预技术，并通过激活工程提出了对多模态模型控制的稳健解决方案。

关键见解

SteerVLM是一个轻量级模块，旨在引导视觉语言模型（VLMs）以符合指令的方式输出。
模块通过学习配对提示的潜在嵌入来动态调整激活。
实现推理时的精细粒度控制，不影响模型对偏离目标的任务的性能。
控制模块所需的学习参数仅为原始VLM规模的极小部分（0.14%）。
模块通过逐层激活调制和自适应控制实现模型控制，无需额外的静态向量或手动调整干预点。
引入了VNIA数据集，用于发展和评估视觉语言模型的控制技术。
该方法在控制和缓解幻觉的基准测试中表现出优于现有技术的性能。

Cool Papers

点此查看论文截图

Visual Diversity and Region-aware Prompt Learning for Zero-shot HOI Detection

Authors:Chanhyeong Yang, Taehoon Song, Jihwan Park, Hyunwoo J. Kim

Zero-shot Human-Object Interaction detection aims to localize humans and objects in an image and recognize their interaction, even when specific verb-object pairs are unseen during training. Recent works have shown promising results using prompt learning with pretrained vision-language models such as CLIP, which align natural language prompts with visual features in a shared embedding space. However, existing approaches still fail to handle the visual complexity of interaction, including (1) intra-class visual diversity, where instances of the same verb appear in diverse poses and contexts, and (2) inter-class visual entanglement, where distinct verbs yield visually similar patterns. To address these challenges, we propose VDRP, a framework for Visual Diversity and Region-aware Prompt learning. First, we introduce a visual diversity-aware prompt learning strategy that injects group-wise visual variance into the context embedding. We further apply Gaussian perturbation to encourage the prompts to capture diverse visual variations of a verb. Second, we retrieve region-specific concepts from the human, object, and union regions. These are used to augment the diversity-aware prompt embeddings, yielding region-aware prompts that enhance verb-level discrimination. Experiments on the HICO-DET benchmark demonstrate that our method achieves state-of-the-art performance under four zero-shot evaluation settings, effectively addressing both intra-class diversity and inter-class visual entanglement. Code is available at https://github.com/mlvlab/VDRP.

零样本人体目标交互检测旨在定位图像中的人体和物体，并识别他们的交互，即使训练过程中未见过特定的动词-物体对。近期的研究显示，使用预训练的视觉语言模型（如CLIP）进行提示学习很有前景，这种模型可以在共享嵌入空间中对自然语言提示和视觉特征进行对齐。然而，现有方法仍无法处理交互的视觉复杂性，包括（1）类内视觉多样性，其中同一动词的实例以不同的姿势和上下文出现；（2）类间视觉纠缠，其中不同的动词会产生视觉上相似的模式。为了应对这些挑战，我们提出了VDRP，一个视觉多样性和区域感知提示学习框架。首先，我们引入了一种视觉多样性感知提示学习策略，将群体视觉方差注入上下文嵌入。我们还应用高斯扰动来鼓励提示捕捉动词的多种视觉变化。其次，我们从人类、物体和联合区域检索特定区域概念。这些用于增强多样性感知提示嵌入，产生增强动词级别辨别的区域感知提示。在HICO-DET基准测试上的实验表明，我们的方法在四种零样本评估设置下达到了最先进的性能，有效地解决了类内多样性和类间视觉纠缠问题。代码可访问https://github.com/mlvlab/VDRP获取。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary
在零样本人类对象交互检测中，使用预训练的视觉语言模型（如CLIP）进行提示学习展现出良好效果。然而，现有方法面临视觉复杂性挑战，如视觉多样性问题和视觉纠缠问题。为解决这些问题，我们提出了VDRP框架，该框架通过可视化多样性和区域感知提示学习提升表现。实现了通过添加群组和特殊模型扰动的多样化和有目标的区分词向量来促进精准提示的形成和发布，并利用人、对象和联合区域的特定概念来增加增强视觉特征的感知和灵活性提示生成结果，显著提升了动词级别的判别能力。代码公开于GitHub上。

Key Takeaways

零样本人类对象交互检测旨在定位图像中的人与物体并识别其交互。
CLIP等预训练的视觉语言模型在零样本检测中展现出潜力。
当前方法面临视觉多样性和视觉纠缠两大挑战。

Cool Papers

点此查看论文截图

Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning

Authors:Ivica Dimitrovski, Vlatko Spasev, Ivan Kitanovski

Remote sensing applications increasingly rely on deep learning for scene classification. However, their performance is often constrained by the scarcity of labeled data and the high cost of annotation across diverse geographic and sensor domains. While recent vision-language models like CLIP have shown promise by learning transferable representations at scale by aligning visual and textual modalities, their direct application to remote sensing remains suboptimal due to significant domain gaps and the need for task-specific semantic adaptation. To address this critical challenge, we systematically explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification. We evaluate several representative methods, including Context Optimization, Conditional Context Optimization, Multi-modal Prompt Learning, and Prompting with Self-Regulating Constraints. These approaches reflect complementary design philosophies: from static context optimization to conditional prompts for enhanced generalization, multi-modal prompts for joint vision-language adaptation, and semantically regularized prompts for stable learning without forgetting. We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features. Through extensive experiments on multiple benchmark remote sensing datasets, including cross-dataset generalization tests, we demonstrate that prompt learning consistently outperforms both baselines in few-shot scenarios. Notably, Prompting with Self-Regulating Constraints achieves the most robust cross-domain performance. Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery, providing a strong foundation for future research in this field.

遥感应用越来越依赖深度学习进行场景分类。然而，其性能往往受到标注数据稀缺和跨不同地理和传感器领域标注成本高昂的限制。虽然最近的视觉语言模型（如CLIP）通过视觉和文本模态的对齐，在大规模学习可转移表示方面显示出前景，但它们直接应用于遥感仍然不尽人意，这主要是由于领域差距显著以及需要任务特定的语义适应。为了应对这一关键挑战，我们系统地探索了提示学习作为一种轻量级、高效的适应策略，用于解决遥感图像场景分类的小样本问题。我们评估了几种代表性的方法，包括上下文优化、条件上下文优化、多模态提示学习和带有自我调节约束的提示。这些方法反映了互补的设计哲学：从静态上下文优化到增强泛化的条件提示，用于联合视觉语言适应的多模态提示，以及用于稳定学习的语义正则化提示，而不会出现遗忘现象。我们将这些提示学习方法与两种标准基线进行了比较：使用手工制作的提示进行零样本CLIP，以及使用冻结的CLIP特征进行线性探针训练。通过多个基准遥感数据集上的广泛实验，包括跨数据集泛化测试，我们证明了在少样本场景下，提示学习始终优于这两种基线。值得注意的是，“带有自我调节约束的提示”实现了最稳健的跨域性能。我们的研究强调，提示学习是弥合卫星和航空影像领域差距的可扩展和高效解决方案，为这一领域未来的研究提供了坚实的基础。

论文及项目相关链接

PDF

Summary

远程感知应用越来越多地依赖深度学习进行场景分类，但受限于标注数据的稀缺性和跨不同地理和传感器域标注的高成本。最近出现的视觉语言模型如CLIP显示出通过大规模学习可迁移表示来对齐视觉和文本模态的潜力，但它们在遥感领域的直接应用仍不理想，存在显著的领域差距和任务特定语义适应的需求。针对这一挑战，我们系统地探讨了提示学习作为轻量级、高效的适应策略，用于小样本的遥感图像场景分类。我们评估了几种代表性的方法，包括上下文优化、条件上下文优化、多模态提示学习和带有自我调节约束的提示。这些方法反映了从静态上下文优化到条件提示增强泛化、多模态提示进行联合视觉语言适应以及语义正则化提示进行稳定学习的不同设计哲学。通过对这些提示学习方法与两种基准方法（无提示的CLIP和手工艺品提示的线性探针）在多个遥感数据集上的广泛实验，包括跨数据集泛化测试，我们证明了提示学习在小型样本场景中的表现始终优于基准方法。特别是带有自我调节约束的提示实现了最稳健的跨域性能。我们的研究强调了提示学习作为缩小卫星和航空图像领域差距的可扩展和高效解决方案，为未来在该领域的研究提供了坚实的基础。

Key Takeaways

远程感知应用面临标注数据稀缺和跨域标注成本高昂的问题。
CLIP等视觉语言模型在遥感领域的应用存在领域差距和语义适应需求。
提示学习是轻量级、高效的适应策略，用于小样本的遥感图像场景分类。
评估了多种提示学习方法，包括上下文优化、条件上下文优化、多模态提示学习和带有自我调节约束的提示。
提示学习方法在跨数据集泛化测试中表现优异，尤其是带有自我调节约束的提示。
提示学习有助于缩小卫星和航空图像领域的差距。

Cool Papers

点此查看论文截图

Probing the Representational Geometry of Color Qualia: Dissociating Pure Perception from Task Demands in Brains and AI Models

Authors:Jing Xu

Probing the computational underpinnings of subjective experience, or qualia, remains a central challenge in cognitive neuroscience. This project tackles this question by performing a rigorous comparison of the representational geometry of color qualia between state-of-the-art AI models and the human brain. Using a unique fMRI dataset with a “no-report” paradigm, we use Representational Similarity Analysis (RSA) to compare diverse vision models against neural activity under two conditions: pure perception (“no-report”) and task-modulated perception (“report”). Our analysis yields three principal findings. First, nearly all models align better with neural representations of pure perception, suggesting that the cognitive processes involved in task execution are not captured by current feedforward architectures. Second, our analysis reveals a critical interaction between training paradigm and architecture, challenging the simple assumption that Contrastive Language-Image Pre-training(CLIP) training universally improves neural plausibility. In our direct comparison, this multi-modal training method enhanced brain-alignment for a vision transformer(ViT), yet had the opposite effect on a ConvNet. Our work contributes a new benchmark task for color qualia to the field, packaged in a Brain-Score compatible format. This benchmark reveals a fundamental divergence in the inductive biases of artificial and biological vision systems, offering clear guidance for developing more neurally plausible models.

探究主观经验或感受的计算基础仍然是认知神经科学中的核心挑战。此项目通过严格比较最前沿人工智能模型与大脑的颜色感受质表征几何结构来应对这一挑战。我们利用独特的“无报告”范式功能磁共振成像数据集，使用表征相似性分析方法比较各种视觉模型在两种条件下的神经活动：纯粹感知（“无报告”）和任务调节感知（“报告”）。我们的分析得出了三个主要发现。首先，几乎所有模型都与纯粹感知的神经表征更为一致，这表明当前的前馈架构并未捕捉到与任务执行相关的认知过程。其次，我们的分析揭示了训练范式和架构之间的关键互动，这挑战了简单假设——对比语言图像预训练普遍提高了神经可信度。在我们的直接比较中，这种多模式训练方法提高了视觉变压器的脑对齐度，但对卷积网络产生了相反的效果。我们的工作为该领域提供了一个新的颜色感受质基准测试任务，以Brain-Score兼容格式呈现。该基准测试揭示了人工与生物视觉系统的归纳偏见之间的根本分歧，为开发更具神经可信度的模型提供了明确指导。

论文及项目相关链接

PDF

Summary

本文探讨了认知神经科学中的核心挑战——主观经验或感受性质的计算基础。该研究通过对比最先进的AI模型和人类大脑的颜色感受性质表示几何结构，对这一问题进行了深入研究。利用独特的fMRI数据集和“无报告”范式，通过代表性相似性分析（RSA）比较了不同视觉模型在两种条件下的表现：纯感知（“无报告”）和任务调节感知（“报告”）。研究发现，大多数模型与纯感知的神经网络表示更为一致，表明当前的前馈架构并未涉及任务执行的认知过程；多模态训练如对比语言-图像预训练（CLIP）对视觉变压器（ViT）和对卷积神经网络（ConvNet）的大脑对齐效果截然不同；该研究为颜色感受性质领域提供了一个新的基准任务，以Brain-Score兼容格式呈现。该基准任务揭示了人工和生物视觉系统的归纳偏置之间的根本差异，为开发更具神经可信度的模型提供了明确指导。

Key Takeaways

研究通过对比AI模型和人类大脑的颜色感受性质表示几何结构，探讨主观经验的计算基础。
利用fMRI数据集和代表性相似性分析方法，比较了不同视觉模型在纯感知和任务调节感知条件下的表现。
大多数模型与纯感知的神经网络表示更为一致，表明当前前馈架构未充分涉及任务执行的认知过程。
多模态训练如对比语言-图像预训练对不同的网络架构（如视觉变压器和卷积神经网络）的大脑对齐效果不同。
该研究为颜色感受性质领域提供了一个新的基准任务，以Brain-Score兼容格式呈现，有助于评估模型的神经可信度。
研究揭示了人工和生物视觉系统的归纳偏置之间的根本差异。

Cool Papers

点此查看论文截图

ConMatFormer: A Multi-attention and Transformer Integrated ConvNext based Deep Learning Model for Enhanced Diabetic Foot Ulcer Classification

Authors:Raihan Ahamed Rifat, Fuyad Hasan Bhoyan, Md Humaion Kabir Mehedi, Md Kaviul Hossain, Md. Jakir Hossen, M. F. Mridha

Diabetic foot ulcer (DFU) detection is a clinically significant yet challenging task due to the scarcity and variability of publicly available datasets. To solve these problems, we propose ConMatFormer, a new hybrid deep learning architecture that combines ConvNeXt blocks, multiple attention mechanisms convolutional block attention module (CBAM) and dual attention network (DANet), and transformer modules in a way that works together. This design facilitates the extraction of better local features and understanding of the global context, which allows us to model small skin patterns across different types of DFU very accurately. To address the class imbalance, we used data augmentation methods. A ConvNeXt block was used to obtain detailed local features in the initial stages. Subsequently, we compiled the model by adding a transformer module to enhance long-range dependency. This enabled us to pinpoint the DFU classes that were underrepresented or constituted minorities. Tests on the DS1 (DFUC2021) and DS2 (diabetic foot ulcer (DFU)) datasets showed that ConMatFormer outperformed state-of-the-art (SOTA) convolutional neural network (CNN) and Vision Transformer (ViT) models in terms of accuracy, reliability, and flexibility. The proposed method achieved an accuracy of 0.8961 and a precision of 0.9160 in a single experiment, which is a significant improvement over the current standards for classifying DFUs. In addition, by 4-fold cross-validation, the proposed model achieved an accuracy of 0.9755 with a standard deviation of only 0.0031. We further applied explainable artificial intelligence (XAI) methods, such as Grad-CAM, Grad-CAM++, and LIME, to consistently monitor the transparency and trustworthiness of the decision-making process.. Our findings set a new benchmark for DFU classification and provide a hybrid attention transformer framework for medical image analysis.

糖尿病足溃疡（DFU）检测是一项具有临床意义但充满挑战的任务，因为可用的公共数据集稀缺且多变。为了解决这些问题，我们提出了ConMatFormer，这是一种新的混合深度学习架构，它结合了ConvNeXt块、多种注意力机制（卷积块注意力模块（CBAM）和双注意力网络（DANet）以及协同工作的变压器模块。这种设计有助于提取更好的局部特征并理解全局上下文，使我们能够非常准确地建模不同类型DFU上的小皮肤模式。为了解决类别不平衡问题，我们使用了数据增强方法。在初始阶段，使用ConvNeXt块来获得详细的局部特征。随后，我们通过添加变压器模块来增强长期依赖性，从而编译模型。这使我们能够确定代表性不足或占少数的DFU类别。在DS1（DFUC2021）和DS2（糖尿病足溃疡（DFU））数据集上的测试表明，在准确性、可靠性和灵活性方面，ConMatFormer超越了最先进的卷积神经网络（CNN）和视觉变压器（ViT）模型。在一次实验中，该方法达到了0.8961的准确率和0.9160的精确度，这是对当前DFU分类标准的重大改进。此外，通过4折交叉验证，该模型达到了0.9755的准确率，标准偏差仅为0.0031。我们进一步应用了可解释人工智能（XAI）方法，如Grad-CAM、Grad-CAM++和LIME，以持续监控决策过程的透明度和可信度。我们的研究结果为DFU分类设定了新的基准，并为医学图像分析提供了混合注意力变压器框架。

论文及项目相关链接

PDF

Summary

本文提出了一种名为ConMatFormer的新型混合深度学习架构，用于糖尿病足溃疡（DFU）检测。该架构结合了ConvNeXt块、多种注意力机制（如卷积块注意力模块CBAM和双注意力网络DANet）和变压器模块，以提取更好的局部特征并理解全局上下文。为解决类别不平衡问题，采用了数据增强方法。在DS1（DFUC2021）和DS2（糖尿病足溃疡DFU）数据集上的测试表明，ConMatFormer在准确性、可靠性和灵活性方面超越了现有的卷积神经网络（CNN）和视觉变压器（ViT）模型。该模型还通过4倍交叉验证达到了较高的准确率，并应用了可解释的的人工智能（XAI）方法来监控决策过程的透明度和可信度。

Key Takeaways

ConMatFormer是一种新型的混合深度学习架构，适用于糖尿病足溃疡（DFU）检测。
该架构结合了多种技术，包括ConvNeXt块、CBAM和DANet注意力机制以及变压器模块。
通过数据增强解决类别不平衡问题。
在DS1和DS2数据集上的测试表现超越了现有模型。
该模型通过4倍交叉验证达到了较高的准确率。
应用了XAI方法来确保决策过程的透明度和可信度。

Cool Papers

点此查看论文截图

Cross-view Localization and Synthesis – Datasets, Challenges and Opportunities

Authors:Ningli Xu, Rongjun Qin

Cross-view localization and synthesis are two fundamental tasks in cross-view visual understanding, which deals with cross-view datasets: overhead (satellite or aerial) and ground-level imagery. These tasks have gained increasing attention due to their broad applications in autonomous navigation, urban planning, and augmented reality. Cross-view localization aims to estimate the geographic position of ground-level images based on information provided by overhead imagery while cross-view synthesis seeks to generate ground-level images based on information from the overhead imagery. Both tasks remain challenging due to significant differences in viewing perspective, resolution, and occlusion, which are widely embedded in cross-view datasets. Recent years have witnessed rapid progress driven by the availability of large-scale datasets and novel approaches. Typically, cross-view localization is formulated as an image retrieval problem where ground-level features are matched with tiled overhead images feature, extracted by convolutional neural networks (CNNs) or vision transformers (ViTs) for cross-view feature embedding. Cross-view synthesis, on the other hand, seeks to generate ground-level views based on information from overhead imagery, generally using generative adversarial networks (GANs) or diffusion models. This paper presents a comprehensive survey of advances in cross-view localization and synthesis, reviewing widely used datasets, highlighting key challenges, and providing an organized overview of state-of-the-art techniques. Furthermore, it discusses current limitations, offers comparative analyses, and outlines promising directions for future research. We also include the project page via https://github.com/GDAOSU/Awesome-Cross-View-Methods.

跨视图定位与合成是跨视图视觉理解中的两个基本任务，该领域处理跨视图数据集：俯视图（卫星或航空）和地面级别图像。由于这些任务在自主导航、城市规划和增强现实等领域的广泛应用，它们越来越受到关注。跨视图定位旨在根据俯视图图像提供的信息估计地面级别图像的地理位置，而跨视图合成则旨在基于俯视图图像的信息生成地面级别图像。由于跨视图数据集中视角、分辨率和遮挡等方面的显著差异，这两个任务仍然具有挑战性。近年来，由于大规模数据集和新颖方法的出现，进展迅速。通常，跨视图定位被制定为图像检索问题，其中地面特征会与俯视图图像特征进行匹配，这些特征由卷积神经网络（CNN）或视觉转换器（ViT）进行跨视图特征嵌入来提取。另一方面，跨视图合成则旨在基于俯视图图像的信息生成地面视图，通常使用生成对抗网络（GAN）或扩散模型。本文对跨视图定位和合成的进展进行了全面综述，回顾了广泛使用的数据集，强调了关键挑战，并提供了最新技术的有条理概述。此外，它还讨论了当前局限性、提供了比较分析，并概述了未来研究的有前途的方向。我们还通过https://github.com/GDAOSU/Awesome-Cross-View-Methods包含项目页面。

论文及项目相关链接

PDF 15 Figures

Summary

本文介绍了跨视角视觉理解的两个核心任务：跨视角定位和跨视角合成。两个任务都涉及处理空中（卫星或航空）和地面水平的图像数据。跨视角定位旨在基于空中图像信息估计地面图像的地理位置，而跨视角合成则旨在基于空中图像信息生成地面图像。由于显著的视角差异、分辨率差异和遮挡问题，这两个任务都面临挑战。近年来，由于大规模数据集的出现和新方法的提出，这两个领域都取得了快速进展。通常，跨视角定位被制定为图像检索问题，通过卷积神经网络（CNN）或视觉转换器（ViT）提取地面和空中图像的特征进行跨视角特征嵌入匹配。跨视角合成则使用生成对抗网络（GANs）或扩散模型基于空中图像信息生成地面视图。本文对跨视角定位和合成的进展进行了全面综述，介绍了常用数据集、关键挑战、最新技术，并讨论了当前局限性、对比分析以及未来研究的有前途方向。

Key Takeaways

跨视角视觉理解包含两个核心任务：跨视角定位和跨视角合成。
跨视角定位旨在基于空中图像信息估计地面图像的地理位置。
跨视角合成旨在基于空中图像信息生成地面图像。
两个任务面临视角差异、分辨率差异和遮挡问题的挑战。
跨视角定位通常被制定为图像检索问题，使用CNN或ViT进行特征匹配。
跨视角合成使用GANs或扩散模型生成地面视图。

Cool Papers

点此查看论文截图

Alias-Free ViT: Fractional Shift Invariance via Linear Attention

Authors:Hagay Michaeli, Daniel Soudry

Transformers have emerged as a competitive alternative to convnets in vision tasks, yet they lack the architectural inductive bias of convnets, which may hinder their potential performance. Specifically, Vision Transformers (ViTs) are not translation-invariant and are more sensitive to minor image translations than standard convnets. Previous studies have shown, however, that convnets are also not perfectly shift-invariant, due to aliasing in downsampling and nonlinear layers. Consequently, anti-aliasing approaches have been proposed to certify convnets’ translation robustness. Building on this line of work, we propose an Alias-Free ViT, which combines two main components. First, it uses alias-free downsampling and nonlinearities. Second, it uses linear cross-covariance attention that is shift-equivariant to both integer and fractional translations, enabling a shift-invariant global representation. Our model maintains competitive performance in image classification and outperforms similar-sized models in terms of robustness to adversarial translations.

Transformer在视觉任务中已成为卷积神经网络（convnets）的有力竞争对手。然而，由于缺乏卷积神经网络的架构归纳偏见，可能会限制其潜在性能。具体来说，视觉Transformer（ViTs）不具有平移不变性，对于轻微的图像平移比标准卷积神经网络更敏感。然而，先前的研究表明，由于下采样和非线性层中的混叠，卷积神经网络也并非完全平移不变。因此，人们已经提出了抗混叠方法来验证卷积神经网络的平移鲁棒性。在此基础上，我们提出了一种无混叠ViT，它主要包括两个组成部分。首先，它使用无混叠的下采样和非线性。其次，它采用对整数和分数平移具有平移等价性的线性交叉协方差注意力，以实现平移不变的全局表示。我们的模型在图像分类中保持了竞争力，并且在面对对抗性平移时，优于同类大小的模型。

论文及项目相关链接

PDF Accepted at NeurIPS 2025. Code is available at https://github.com/hmichaeli/alias_free_vit

Summary

本文探讨了Vision Transformers（ViTs）在视觉任务中的优势与局限，特别是其相较于卷积神经网络（convnets）在图像翻译方面的敏感性。为提升ViTs的翻译鲁棒性，本文提出了无别名ViT模型，通过采用无别名下采样和非线性性，以及具有整数和分数翻译等变的线性交叉协方差注意力机制，实现了全局的平移不变性表示，在图像分类任务中保持了竞争力，并在对抗性翻译方面优于同类规模模型。

Key Takeaways

Vision Transformers (ViTs) 已成为卷积神经网络（convnets）在视觉任务中的竞争替代方案。
ViTs 缺乏 convnets 的架构归纳偏置，可能影响其性能。
ViTs 对轻微图像翻译敏感，缺乏翻译不变性。
convnets 并非完全平移不变，下采样和非线性层会产生混叠。
为提升 ViTs 的翻译鲁棒性，提出了无别名 ViT 模型。
该模型采用无别名下采样和非线性性，以及具有平移等变性的线性交叉协方差注意力机制。

Cool Papers

点此查看论文截图

A Multimodal, Multitask System for Generating E Commerce Text Listings from Images

Authors:Nayan Kumar Singh

Manually generating catchy descriptions and names is labor intensive and a slow process for retailers. Although generative AI provides an automation solution in form of Vision to Language Models (VLM), the current VLMs are prone to factual “hallucinations”. Siloed, single task models are not only inefficient but also fail to capture interdependent relationships between features. To address these challenges, we propose an end to end, multi task system that generates factually grounded textual listings from a single image. The contributions of this study are two proposals for the model architecture. First, application of multi task learning approach for fine tuning a vision encoder where a single vision backbone is jointly trained on attribute prediction such as color, hemline and neck style and price regression. Second, introduction of a hierarchical generation process where the model’s own predicted attributes are embedded in a prompt and fed to the text decoder to improve factual consistency. The experiments demonstrate the superiority of this architecture. The multi tasking approach outperforms both the independent price regression, with a 3.6% better R2 Value and attribute classification, with a 6.6% improvement F1 score. Critically, the hierarchical generation process proves highly effective, slashing the factual hallucination rate from 12.7% to 7.1%, a 44.5% relative reduction, compared to a non hierarchical ablation. The hierarchical approach also reduces the latency of the autoregressive text generation process by a factor of 3.5 when compared to direct vision to language model of similar size. One minor caveat is that the model does perform 3.5% worse than direct vision-to-language model on ROUGE-L score.

手动生成引人注目的描述和名称对于零售商而言是一个劳动密集型且缓慢的过程。尽管生成式人工智能（AI）以视觉到语言模型（VLM）的形式提供了自动化解决方案，但当前的VLM容易陷入事实上的“幻觉”。孤立的单任务模型不仅效率低下，而且无法捕捉特征之间的依赖关系。为了应对这些挑战，我们提出了一种端到端的多任务系统，该系统可以从单个图像生成基于事实的文字列表。本研究有两个模型架构的提案，这是其贡献。首先，应用多任务学习方法对视觉编码器进行微调，其中单个视觉骨干网联合训练属性预测（如颜色、裙摆和颈部样式）和价格回归。其次，引入分层生成过程，模型自身预测的属性嵌入到提示中并馈送给文本解码器，以提高事实一致性。实验证明了该架构的优越性。多任务方法优于独立的价格回归，R²值提高了3.6%，属性分类的F1分数提高了6.6%。关键的是，分层生成过程被证明非常有效，将事实幻觉率从12.7%降低到7.1%，相对减少了44.5%，与无分层的消融研究相比。与类似大小的直接视觉到语言模型相比，分层方法还将自回归文本生成过程的延迟降低了3.5倍。有一个小缺陷是，该模型在ROUGE-L分数上比直接视觉到语言模型差了3.5%。

论文及项目相关链接

PDF 24 pages, 10 figures, 11 tables. Code can be found at: https://github.com/SinghNayanKumar/multimodal-product-lister/

摘要
一种新型的视觉到语言模型（VLM）提出，旨在解决手动生成吸引人的描述和名称对于零售商来说劳动强度大且过程缓慢的问题。当前VLM容易犯事实“幻觉”，而单独的单一任务模型不仅效率低下，而且无法捕捉特征之间的依赖关系。本研究提出了一种端到端多任务系统，该系统能够从单张图像生成基于事实的文字列表。本研究的贡献在于提出了两种模型架构的提案。首先，采用多任务学习方法对视觉编码器进行微调，其中单个视觉骨干网联合训练属性预测（如颜色、裙摆和领口风格）和价格回归。其次，引入层次生成过程，将模型自身预测的属性嵌入提示中并馈送给文本解码器，以提高事实一致性。实验证明了该架构的优越性。多任务方法优于独立的价格回归和属性分类，其R值提高了3.6%，F值提高了6.6%。特别是，层次生成过程证明非常有效，将事实幻觉率从12.7%降低到7.1%，相对于非层次结构减少了44.5%，同时层次方法还将自动文本生成过程的延迟减少了3.5倍。但有一点需要注意，该模型在ROUGE-L得分上比直接视觉到语言模型低3.5%。

关键见解

当前手动为零售商生成商品描述和名称是一个劳动密集且低效的过程。
当前使用的生成性AI存在事实“幻觉”的问题。
提出了一种新型的多任务系统架构，能够从单一图像生成基于事实的文字描述。
通过多任务学习和层次生成过程提高了模型的性能。
多任务方法优化了价格回归和属性分类的性能。
层次生成方法显著降低了事实幻觉率并加快了文本生成速度。
该模型在ROUGE-L得分方面相对直接视觉到语言模型略有不足。

Cool Papers

点此查看论文截图

Bridging the gap to real-world language-grounded visual concept learning

Authors:Whie Jung, Semin Kim, Junee Kim, Seunghoon Hong

Human intelligence effortlessly interprets visual scenes along a rich spectrum of semantic dimensions. However, existing approaches to language-grounded visual concept learning are limited to a few predefined primitive axes, such as color and shape, and are typically explored in synthetic datasets. In this work, we propose a scalable framework that adaptively identifies image-related concept axes and grounds visual concepts along these axes in real-world scenes. Leveraging a pretrained vision-language model and our universal prompting strategy, our framework identifies a diverse image-related axes without any prior knowledge. Our universal concept encoder adaptively binds visual features to the discovered axes without introducing additional model parameters for each concept. To ground visual concepts along the discovered axes, we optimize a compositional anchoring objective, which ensures that each axis can be independently manipulated without affecting others. We demonstrate the effectiveness of our framework on subsets of ImageNet, CelebA-HQ, and AFHQ, showcasing superior editing capabilities across diverse real-world concepts that are too varied to be manually predefined. Our method also exhibits strong compositional generalization, outperforming existing visual concept learning and text-based editing methods. The code is available at https://github.com/whieya/Language-grounded-VCL.

人类智能可以毫不费力地解释丰富的语义维度中的视觉场景。然而，现有的基于语言的地视觉概念学习方法仅限于一些预定义的基本轴，如颜色和形状，并且通常在合成数据集中进行探索。在这项工作中，我们提出了一个可扩展的框架，该框架可以自适应地识别与图像相关的概念轴，并沿着这些轴在现实场景中对视觉概念进行定位。我们利用预训练的视觉语言模型和我们通用的提示策略，在没有任何先验知识的情况下，识别出多样化的与图像相关的轴。我们的通用概念编码器可以自适应地将视觉特征绑定到发现的轴上，而无需为每个概念引入额外的模型参数。为了沿着发现的轴定位视觉概念，我们优化了一个组合锚定目标，确保每个轴都可以独立操作而不影响其他轴。我们在ImageNet、CelebA-HQ和AFHQ的子集上展示了框架的有效性，展示了在多种现实概念中的卓越编辑能力，这些概念太过多样而无法预先手动定义。我们的方法还表现出强大的组合泛化能力，优于现有的视觉概念学习方法和基于文本的编辑方法。代码可在https://github.com/whieya/Language-grounded-VCL找到。

论文及项目相关链接

PDF

Summary

本文提出了一种可扩展的框架，能够自适应地识别图像相关的概念轴，并在真实场景中将视觉概念与这些轴相结合。通过利用预训练的视觉语言模型和通用提示策略，该框架能够在没有任何先验知识的情况下发现多样化的图像相关轴。该框架通过优化组合锚定目标，确保每个轴能独立操作而不会相互影响。该方法在ImageNet、CelebA-HQ和AFHQ的子集上表现出优异的编辑能力，展示了在不同真实概念上的优越表现，并且具有良好的组合泛化能力。

Key Takeaways

该框架能够自适应地识别图像相关的概念轴，不再局限于预设的原始轴（如颜色和形状）。
该框架能在真实场景中将视觉概念与识别的轴相结合。
利用预训练的视觉语言模型和通用提示策略，可在无先验知识的情况下发现图像相关轴。
通用概念编码器能够自适应地将视觉特征绑定到发现的轴上，无需为每个概念引入额外的模型参数。
通过优化组合锚定目标，确保每个轴能独立操作而不影响其他轴。
该方法在多个数据集上表现出优异的编辑能力和强大的组合泛化性能。

Cool Papers

点此查看论文截图

Knowledge-Driven Vision-Language Model for Plexus Detection in Hirschsprung’s Disease

Authors:Youssef Megahed, Atallah Madi, Dina El Demellawy, Adrian D. C. Chan

Hirschsprung’s disease is defined as the congenital absence of ganglion cells in some segment(s) of the colon. The muscle cannot make coordinated movements to propel stool in that section, most commonly leading to obstruction. The diagnosis and treatment for this disease require a clear identification of different region(s) of the myenteric plexus, where ganglion cells should be present, on the microscopic view of the tissue slide. While deep learning approaches, such as Convolutional Neural Networks, have performed very well in this task, they are often treated as black boxes, with minimal understanding gained from them, and may not conform to how a physician makes decisions. In this study, we propose a novel framework that integrates expert-derived textual concepts into a Contrastive Language-Image Pre-training-based vision-language model to guide plexus classification. Using prompts derived from expert sources (e.g., medical textbooks and papers) generated by large language models and reviewed by our team before being encoded with QuiltNet, our approach aligns clinically relevant semantic cues with visual features. Experimental results show that the proposed model demonstrated superior discriminative capability across different classification metrics as it outperformed CNN-based models, including VGG-19, ResNet-18, and ResNet-50; achieving an accuracy of 83.9%, a precision of 86.6%, and a specificity of 87.6%. These findings highlight the potential of multi-modal learning in histopathology and underscore the value of incorporating expert knowledge for more clinically relevant model outputs.

希施普林格病被定义为一个或多个结肠段先天性缺乏神经细胞。该部位的肌肉无法协调运动推动大便，最常见的情况是导致梗阻。对于这种疾病的诊断和治疗，需要在显微镜下清楚地识别肠系膜丛的不同区域，这些区域应该存在神经细胞。虽然深度学习方法（如卷积神经网络）在此任务中表现良好，但它们通常被视为黑箱模型，人们从中获得的理解有限，可能不符合医生的决策方式。本研究提出了一种新的框架，它将专家衍生的文本概念整合到基于对比语言图像预训练的视觉语言模型中，以指导丛状组织的分类。通过使用专家来源（如医学教科书和论文）生成的提示，这些提示由大型语言模型生成，并由我们的团队在编码前进行审查，我们的方法将临床相关的语义线索与视觉特征对齐。实验结果表明，所提出的模型在多种分类指标上表现出卓越的辨别能力，优于基于CNN的模型，包括VGG-19、ResNet-18和ResNet-50；其准确度达到83.9%，精确度达到86.6%，特异度达到87.6%。这些发现突显了多模态学习在组织病理学中的潜力，并强调了融入专家知识对于获得更具临床相关性的模型输出的价值。

论文及项目相关链接

PDF Accepted into the ICAAI 2025 - The 9th International Conference on Advances in Artificial Intelligence

Summary：本研究提出了一种融合专家衍生文本概念的新型框架，用于基于对比语言图像预训练的视觉语言模型进行丛状分类。该框架利用来自专家资源（如医学教科书和论文）的提示，通过QuiltNet编码，将临床相关的语义线索与视觉特征对齐。实验结果表明，该模型在多种分类指标上表现出较高的鉴别能力，优于基于CNN的模型，包括VGG-19、ResNet-18和ResNet-50，准确率达到了83.9%，精确度达到了86.6%，特异性达到了87.6%。这为多模态学习在病理学领域的应用提供了潜力，并凸显了融入专家知识对于获得更具临床相关性的模型输出的价值。

Key Takeaways：

本研究定义了希森施普朗病为结肠部分神经元缺失导致的疾病，影响肌肉协调运动并可能导致梗阻。
传统深度学习方法如卷积神经网络在该病诊断中存在不足，缺乏透明度，可能不符合医生的决策方式。
研究提出了一种融合专家知识的多模态学习框架，利用对比语言图像预训练模型进行丛状分类。
该框架使用来自专家资源的提示，结合视觉特征，提高模型的鉴别能力。
实验结果显示，该模型在分类指标上优于其他CNN模型，包括VGG-19、ResNet-18和ResNet-50。
模型准确率达到了83.9%，精确度达到了86.6%，特异性达到了87.6%。

Cool Papers

点此查看论文截图

ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression

Authors:Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir

The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.

假设卷积神经网络（CNN）本质上具有纹理偏向性，这一假设在很大程度上影响了深度学习中特征使用的讨论。我们通过重新审视Geirhos等人提出的线索冲突实验中的局限性来重新探讨这一假设。为了克服这些局限性，我们提出了一个领域无关的框架，通过系统地抑制形状、纹理和颜色线索来量化特征依赖，避免了强制选择冲突所带来的混淆。通过在受控抑制条件下对人类和神经网络进行评估，我们发现CNN并非本质上偏向于纹理，而是主要依赖于局部形状特征。然而，通过现代训练策略或架构（ConvNeXt、ViTs），这种依赖可以大幅度减轻。我们进一步对计算机视觉、医学成像和遥感技术进行了分析，结果显示依赖模式存在系统性差异：计算机视觉模型优先考虑形状，医学成像模型强调颜色，遥感模型则表现出更强的纹理依赖。相关代码可在https://github.com/tomburgert/feature-reliance获取。

论文及项目相关链接

PDF Accepted at NeurIPS 2025 (oral)

Summary

本文重新审视了卷积神经网络（CNNs）在深度学习中的特征使用方式是否天生偏向于纹理的观点。作者通过解决Geirhos等人在线索冲突实验中的局限性，提出了一个领域不可知的框架，该框架通过系统地抑制形状、纹理和颜色线索来量化特征依赖，避免了强制选择冲突所带来的混淆。通过控制抑制条件下的对人类和神经网络的评估发现，CNN并非天生偏向于纹理，而是主要依赖于局部形状特征。然而，这种依赖可以通过现代训练策略或架构（如ConvNeXt和ViTs）来大幅减轻。此外，本文还分析了计算机视觉、医学成像和遥感等不同领域中的依赖模式差异。

Key Takeaways

作者重新评估了CNN是否天生偏向于纹理的观点。
提出了一个框架来量化特征依赖，该框架通过系统地抑制形状、纹理和颜色线索来避免强制选择冲突带来的混淆。
通过实验发现，CNN主要依赖于局部形状特征，而非天生偏向于纹理。
现代训练策略或架构（如ConvNeXt和ViTs）可以大幅减轻对局部形状特征的依赖。
不同领域中的模型依赖模式存在差异，计算机视觉模型重视形状，医学成像模型强调颜色，而遥感模型更依赖纹理。
作者的研究提供了关于特征和依赖性的新见解，对深度学习领域的进一步发展有重要意义。

Cool Papers

点此查看论文截图

3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data

Authors:Nojod M. Alotaibi, Areej M. Alhothali, Manar S. Ali

Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 78.98% accuracy, 76.54% sensitivity, 81.58% specificity, 81.58% precision, and 78.98% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.

抑郁症是一种常见的心理健康问题，对个人福祉和全球公共卫生都产生负面影响。利用结构磁共振成像（sMRI）和深度学习（DL）方法进行抑郁症的自动化检测，在提高诊断准确性和实现早期干预方面显示出越来越大的潜力。现有的大多数方法要么使用体素级特征，要么使用基于预定义脑图谱的手工区域表示，这限制了它们捕捉复杂脑模式的能力。本文开发了一个统一的流程，利用视觉转换器（ViTs）从sMRI数据中提取三维区域嵌入，并利用图神经网络（GNN）进行分类。我们探索了两种定义区域的方法：（1）一种基于图谱的方法，使用预定义的结构和功能脑图谱；（2）一种基于立方体块的方法，直接训练ViTs从均匀提取的三维块中识别区域。此外，还生成了余弦相似度图来模拟区域间的关系，并引导基于GNN的分类。我们使用REST-meta-MDD数据集进行了大量实验，以证明我们模型的有效性。通过分层10倍交叉验证，最佳模型获得了78.98%的准确率、76.54%的敏感性、81.58%的特异性和精确度以及78.98%的F1分数。此外，基于图谱的模型始终优于基于立方体块的方法，强调了在使用特定领域的解剖学先验进行抑郁症检测方面的重要性。

论文及项目相关链接

PDF 14 pages, 1 figure, 7 tables

Summary

本文介绍了一种利用Vision Transformer和Graph Neural Network对抑郁症进行自动化检测的新方法。该方法通过提取sMRI数据的3D区域嵌入和构建余弦相似性图进行分类，探索了基于图谱的方法和基于立方体方法两种定义区域策略。实验表明，基于图谱的方法表现更佳，其准确度达到78.98%，并具有敏感性、特异性、精确度和F1分数等多项指标。该研究为提高抑郁症诊断准确性和实现早期干预提供了新的可能性。

Key Takeaways

Vision Transformer和Graph Neural Network被应用于抑郁症的自动化检测，提高了诊断准确性。
通过提取sMRI数据的3D区域嵌入进行分类，探索了基于图谱的方法和基于立方体方法的区域定义策略。
实验在REST-meta-MDD数据集上进行，显示基于图谱的方法表现优于基于立方体方法。
基于图谱的最佳模型在多项指标上表现良好，包括准确度78.98%，敏感性76.54%，特异性81.58%，精确度81.58%，F1分数78.98%。
研究表明，利用领域特定的解剖学先验信息对于抑郁症检测至关重要。
该方法有望改善抑郁症的诊断准确性并促进早期干预。

Cool Papers

点此查看论文截图

CapRecover: A Cross-Modality Feature Inversion Attack Framework on Vision Language Models

Authors:Kedong Xiu, Sai Qian Zhang

As Vision-Language Models (VLMs) are increasingly deployed in split-DNN configurations–with visual encoders (e.g., ResNet, ViT) operating on user devices and sending intermediate features to the cloud–there is a growing privacy risk from semantic information leakage. Existing approaches to reconstructing images from these intermediate features often result in blurry, semantically ambiguous images. To directly address semantic leakage, we propose CapRecover, a cross-modality inversion framework that recovers high-level semantic content, such as labels or captions, directly from intermediate features without image reconstruction. We evaluate CapRecover on multiple datasets and victim models, demonstrating strong performance in semantic recovery. Specifically, CapRecover achieves up to 92.71% Top-1 label accuracy on CIFAR-10 and generates fluent captions from ResNet50 features on COCO2017 with ROUGE-L scores up to 0.52. Our analysis further reveals that deeper convolutional layers encode significantly more semantic information compared to shallow layers. To mitigate semantic leakage, we introduce a simple yet effective protection method: adding random noise to intermediate features at each layer and removing the noise in the next layer. Experimental results show that this approach prevents semantic leakage without additional training costs. Our code is available at https://jus1mple.github.io/Image2CaptionAttack.

随着视觉语言模型（VLMs）在分布式深度神经网络（DNN）配置中的部署越来越多——视觉编码器（例如ResNet、ViT）在用户设备上运行并将中间特征发送到云端——由此产生的语义信息泄露的隐私风险也越来越高。现有技术通过中间特征重建图像往往会导致模糊、语义模糊的图像。为了直接解决语义泄露问题，我们提出了CapCover系统，这是一个跨模态反转框架，直接从中间特征恢复高级语义内容，如标签或标题，而无需图像重建。我们在多个数据集和目标模型上评估了CapCover的性能，显示了其在语义恢复方面的强大性能。具体来说，CapCover在CIFAR-10上达到了高达92.71%的Top-1标签准确率，并从ResNet50特性在COCO2017上生成了流畅性较高的标题，ROUGE-L得分高达0.52。我们的分析进一步揭示，与浅层相比，深层卷积层编码的语义信息更多。为了缓解语义泄露问题，我们提出了一种简单有效的保护方法：向每一层的中间特征添加随机噪声并在下一层去除噪声。实验结果表明，该方法在无需额外训练成本的情况下，能有效地防止语义泄露。我们的代码可在https://jus1mple.github.io/Image2CaptionAttack上获取。

论文及项目相关链接

PDF 9 pages, accepted by the 2025 ACM Multimedia Conference. Code is available at https://jus1mple.github.io/Image2CaptionAttack

Summary

本文介绍了在Vision-Language模型（VLMs）采用分割深度神经网络（DNN）配置时面临的语义信息泄露隐私问题。现有方法通过中间特征重构图像往往产生模糊、语义模糊的图像。针对这一问题，本文提出CapCover方法，直接从中间特征恢复高级语义内容，如标签或标题，无需图像重构。实验证明，CapCover在多数据集和受害者模型上表现出强大的性能，如CIFAR-10上的Top-1标签准确率高达92.71%，COCO2017数据集上从ResNet50特征生成的流畅标题ROUGE-L得分达0.52。分析显示深层卷积层比浅层包含更多语义信息。为解决语义泄露问题，本文还提出了一种简单有效的保护方法：向中间特征添加随机噪声并在下一层去除噪声。实验结果表明，该方法可在不增加训练成本的情况下防止语义泄露。

Key Takeaways

Vision-Language Models (VLMs) in split-DNN configurations face privacy risks due to semantic information leakage.
现有方法通过中间特征重构图像结果模糊且语义模糊。
CapCover方法能直接从中间特征恢复高级语义内容，如标签和标题，无需图像重构。
CapCover在多个数据集上表现出强大的性能，如CIFAR-10的Top-1标签准确率和COCO2017的ROUGE-L得分。
深层卷积层相较于浅层包含更多语义信息。
为解决语义泄露问题，提出了一种添加随机噪声至中间特征的保护方法。

Cool Papers

点此查看论文截图

Boosting Generative Adversarial Transferability with Self-supervised Vision Transformer Features

Authors:Shangbo Wu, Yu-an Tan, Ruinan Ma, Wencong Ma, Dehua Zhu, Yuanzhang Li

The ability of deep neural networks (DNNs) come from extracting and interpreting features from the data provided. By exploiting intermediate features in DNNs instead of relying on hard labels, we craft adversarial perturbation that generalize more effectively, boosting black-box transferability. These features ubiquitously come from supervised learning in previous work. Inspired by the exceptional synergy between self-supervised learning and the Transformer architecture, this paper explores whether exploiting self-supervised Vision Transformer (ViT) representations can improve adversarial transferability. We present dSVA – a generative dual self-supervised ViT features attack, that exploits both global structural features from contrastive learning (CL) and local textural features from masked image modeling (MIM), the self-supervised learning paradigm duo for ViTs. We design a novel generative training framework that incorporates a generator to create black-box adversarial examples, and strategies to train the generator by exploiting joint features and the attention mechanism of self-supervised ViTs. Our findings show that CL and MIM enable ViTs to attend to distinct feature tendencies, which, when exploited in tandem, boast great adversarial generalizability. By disrupting dual deep features distilled by self-supervised ViTs, we are rewarded with remarkable black-box transferability to models of various architectures that outperform state-of-the-arts. Code available at https://github.com/spencerwooo/dSVA.

深度神经网络（DNNs）的能力来自于提取和解释所提供数据的特征。我们不是依赖硬标签，而是利用DNN中的中间特征，来制造更具通用性的对抗扰动，从而提升黑盒迁移性。这些特征在以前的工作中普遍来自监督学习。受自监督学习与Transformer架构之间卓越协同的启发，本文探讨了利用自监督视觉Transformer（ViT）表示是否可以提高对抗迁移性。我们提出了dSVA——一种生成式双自监督ViT特征攻击方法，它利用对比学习（CL）的全局结构特征和掩码图像建模（MIM）的局部纹理特征，这两者是自监督学习范式中ViT的双璧。我们设计了一个新颖的生成式训练框架，该框架包含一个生成器来创建黑盒对抗样本，以及通过利用自监督ViT的联合特征和注意力机制来训练生成器的策略。我们的研究发现，CL和MIM使ViT能够关注不同的特征倾向，当它们结合使用时，可以大大提高对抗性的通用性。通过破坏由自监督ViT提炼的双重深度特征，我们在各种架构的模型上获得了显著的黑盒迁移性，超过了现有技术的水平。代码可通过链接获取。

论文及项目相关链接

PDF 14 pages, 9 figures, accepted at ICCV 2025

Summary
深度神经网络（DNNs）通过提取和解释数据特征实现其功能。本研究通过利用DNNs的中间特征而非依赖硬标签，设计出更具通用性的对抗扰动，提高了黑箱迁移能力。本研究受自监督学习与Transformer架构间卓越协同的启发，探索了利用自监督Vision Transformer（ViT）表示是否可提高对抗性迁移能力。提出dSVA——一种生成式双自监督ViT特征攻击方法，它利用对比学习（CL）的全局结构特征和掩码图像建模（MIM）的局部纹理特征，这是ViT的自监督学习范式组合。设计了一种新颖的生成式训练框架，通过生成器创建黑箱对抗样本，并制定了通过利用自监督ViT的联合特征和注意力机制来训练生成器的策略。研究发现，CL和MIM使ViT能够关注不同的特征倾向，当结合使用时，可大大提高对抗性的一般性。通过破坏自监督ViT提炼的双重深度特征，实现对各种架构模型的卓越黑箱迁移能力，并超越现有技术。

Key Takeaways

研究利用深度神经网络（DNNs）的中间特征，通过设计对抗性扰动提高黑箱迁移能力。
受自监督学习与Transformer协同工作的启发，研究焦点转向自监督Vision Transformer（ViT）。
提出dSVA方法，结合对比学习（CL）的全局结构特征和掩码图像建模（MIM）的局部纹理特征。
采用了新颖的生成式训练框架，融入生成器来创建黑箱对抗样本。
结合CL和MIM，ViT能够关注不同的特征倾向，这有助于提高对抗性的一般性。
通过攻击自监督ViT提炼的双重深度特征，实现卓越的黑箱迁移能力。

Cool Papers

点此查看论文截图

PatchGuard: Adversarially Robust Anomaly Detection and Localization through Vision Transformers and Pseudo Anomalies

Authors:Mojtaba Nafez, Amirhossein Koochakian, Arad Maleki, Jafar Habibi, Mohammad Hossein Rohban

Anomaly Detection (AD) and Anomaly Localization (AL) are crucial in fields that demand high reliability, such as medical imaging and industrial monitoring. However, current AD and AL approaches are often susceptible to adversarial attacks due to limitations in training data, which typically include only normal, unlabeled samples. This study introduces PatchGuard, an adversarially robust AD and AL method that incorporates pseudo anomalies with localization masks within a Vision Transformer (ViT)-based architecture to address these vulnerabilities. We begin by examining the essential properties of pseudo anomalies, and follow it by providing theoretical insights into the attention mechanisms required to enhance the adversarial robustness of AD and AL systems. We then present our approach, which leverages Foreground-Aware Pseudo-Anomalies to overcome the deficiencies of previous anomaly-aware methods. Our method incorporates these crafted pseudo-anomaly samples into a ViT-based framework, with adversarial training guided by a novel loss function designed to improve model robustness, as supported by our theoretical analysis. Experimental results on well-established industrial and medical datasets demonstrate that PatchGuard significantly outperforms previous methods in adversarial settings, achieving performance gains of $53.2%$ in AD and $68.5%$ in AL, while also maintaining competitive accuracy in non-adversarial settings. The code repository is available at https://github.com/rohban-lab/PatchGuard .

异常检测（AD）和异常定位（AL）在高可靠性需求的领域，如医学成像和工业监控中，具有至关重要的作用。然而，由于训练数据的局限性，通常仅包含正常未标记样本，当前的AD和AL方法容易受到对抗性攻击的干扰。本研究引入了PatchGuard，这是一种对抗性稳健的AD和AL方法，它结合了伪异常和定位掩膜在视觉转换器（ViT）架构内解决这些漏洞。我们首先研究伪异常的基本属性，然后通过理论洞察力探究增强AD和AL系统对抗性稳健性的注意力机制。然后我们提出了利用前景感知伪异常的方法来解决以前异常感知方法的缺陷。我们的方法将这些精心制作的伪异常样本融入基于ViT的框架中，由对抗性训练引导的新型损失函数改进模型稳健性，并由我们的理论分析提供支持。在成熟的工业和医疗数据集上的实验结果表明，PatchGuard在对抗设置下显著优于以前的方法，在AD和AL中的性能分别提高了53.2％和68.5％，同时在非对抗设置中也保持了竞争准确性。代码仓库可通过https://github.com/rohban-lab/PatchGuard获取。

论文及项目相关链接

PDF Accepted to the Conference on Computer Vision and Pattern Recognition (CVPR) 2025

Summary

本文介绍了PatchGuard，一种针对Vision Transformer的对抗性稳健异常检测和定位方法。该方法通过融入伪异常和定位掩膜来克服现有方法的不足，提高对抗性环境下的模型稳健性。实验结果表明，PatchGuard在工业和医疗数据集上的性能显著优于其他方法，既提高了异常检测的准确性，又增强了异常定位的准确性。

Key Takeaways

PatchGuard是一种基于Vision Transformer的异常检测（AD）和异常定位（AL）方法。
该方法利用伪异常和定位掩膜，融入ViT架构以增强模型在对抗环境下的稳健性。
PatchGuard通过融入Foreground-Aware Pseudo-Anomalies克服了以往异常感知方法的不足。
方法结合了对抗训练，并通过新型损失函数提升模型稳健性。
实验结果显示，PatchGuard在AD和AL方面显著优于其他方法，分别提高了53.2%和68.5%的性能。
PatchGuard在非对抗性环境下也保持了较高的准确性。

Cool Papers

点此查看论文截图

Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection

Authors:Fangling Jiang, Qi Li, Bing Liu, Weining Wang, Caifeng Shan, Zhenan Sun, Ming-Hsuan Yang

3D mask presentation attack detection is crucial for protecting face recognition systems against the rising threat of 3D mask attacks. While most existing methods utilize multimodal features or remote photoplethysmography (rPPG) signals to distinguish between real faces and 3D masks, they face significant challenges, such as the high costs associated with multimodal sensors and limited generalization ability. Detection-related text descriptions offer concise, universal information and are cost-effective to obtain. However, the potential of vision-language multimodal features for 3D mask presentation attack detection remains unexplored. In this paper, we propose a novel knowledge-based prompt learning framework to explore the strong generalization capability of vision-language models for 3D mask presentation attack detection. Specifically, our approach incorporates entities and triples from knowledge graphs into the prompt learning process, generating fine-grained, task-specific explicit prompts that effectively harness the knowledge embedded in pre-trained vision-language models. Furthermore, considering different input images may emphasize distinct knowledge graph elements, we introduce a visual-specific knowledge filter based on an attention mechanism to refine relevant elements according to the visual context. Additionally, we leverage causal graph theory insights into the prompt learning process to further enhance the generalization ability of our method. During training, a spurious correlation elimination paradigm is employed, which removes category-irrelevant local image patches using guidance from knowledge-based text features, fostering the learning of generalized causal prompts that align with category-relevant local patches. Experimental results demonstrate that the proposed method achieves state-of-the-art intra- and cross-scenario detection performance on benchmark datasets.

三维掩膜展示攻击检测对于保护人脸识别系统免受日益严重的三维掩膜攻击威胁至关重要。虽然大多数现有方法利用多模式特征或远程光电容积脉搏波描记法（rPPG）信号来区分真实人脸和三维掩膜，但它们面临重大挑战，如多模式传感器的高成本和有限的泛化能力。检测相关的文本描述简洁明了，信息通用且成本低廉。然而，关于三维掩膜展示攻击检测中视觉语言多模式特征的研究潜力尚未被发掘。在本文中，我们提出了一种基于知识提示学习框架来探索预训练视觉语言模型对三维掩膜展示攻击检测的强大的泛化能力。具体来说，我们的方法将知识图谱中的实体和三元组融入提示学习过程，生成精细的任务特定明确提示，有效地利用嵌入在预训练视觉语言模型中的知识。此外，考虑到不同的输入图像可能会强调不同的知识图谱元素，我们引入了一种基于注意力机制的视觉特定知识过滤器来根据视觉上下文精炼相关元素。此外，我们还利用因果图理论的见解来优化提示学习过程，进一步提高了方法的泛化能力。在训练过程中，我们采用了一种消除偶然相关性的范式，该范式利用基于知识的文本特征指导去除与类别无关的局部图像斑块，促进学习符合类别相关局部斑块的通用因果提示。实验结果表明，所提方法在基准数据集上实现了先进、最佳的场景内外检测性能。

论文及项目相关链接

PDF Accepted by TPAMI

Summary
在人脸识别系统中，保护系统免受三维口罩攻击至关重要。当前方法主要利用多模态特征或远程光体积描记术信号来区分真实人脸和三维口罩，但存在成本高和泛化能力有限等挑战。本文提出一种基于知识提示学习框架的新的方法，通过融入知识图谱中的实体和三元组来生成针对任务的明确提示，并利用视觉上下文相关的知识过滤器来优化相关元素。此外，结合因果图理论，提高方法的泛化能力。实验结果表明，该方法在基准数据集上取得了跨场景的优异检测性能。

Key Takeaways

以下是七个关于这段文本的主要观点与洞察：

三维口罩攻击对人脸识别系统的安全构成威胁。
当前利用多模态特征和远程光体积描记术信号的检测方法面临高成本和泛化困难等挑战。
知识提示学习框架可用于检测三维口罩攻击。它融合了知识图谱中的实体和三元组以增强识别能力。
针对视觉上下文的信息差异，引入视觉特定知识过滤器以优化相关元素的选择。
结合因果图理论来提高方法的泛化性能。
方法通过消除冗余相关性并依赖知识基础文本特征，提高模型学习到的提示与类别相关的局部区域对齐效果。

Cool Papers

点此查看论文截图

Efficient Remote Sensing Change Detection with Change State Space Models

Authors:Elman Ghazaei, Erchan Aptoula

Despite their frequent use for change detection, both ConvNets and Vision transformers (ViT) exhibit well-known limitations, namely the former struggle to model long-range dependencies while the latter are computationally inefficient, rendering them challenging to train on large-scale datasets. Vision Mamba, an architecture based on State Space Models has emerged as an alternative addressing the aforementioned deficiencies and has been already applied to remote sensing change detection, though mostly as a feature extracting backbone. In this article the Change State Space Model is introduced, that has been specifically designed for change detection by focusing on the relevant changes between bi-temporal images, effectively filtering out irrelevant information. By concentrating solely on the changed features, the number of network parameters is reduced, enhancing significantly computational efficiency while maintaining high detection performance and robustness against input degradation. The proposed model has been evaluated via three benchmark datasets, where it outperformed ConvNets, ViTs, and Mamba-based counterparts at a fraction of their computational complexity. The implementation will be made available at https://github.com/Elman295/CSSM upon acceptance.

尽管卷积神经网络（ConvNets）和视觉转换器（ViT）经常用于变化检测，但它们表现出了一些众所周知的局限性。具体来说，前者难以建模长距离依赖关系，而后者计算效率低下，难以在大型数据集上进行训练。基于状态空间模型的“Vision Mamba”架构作为一种替代方案应运而生，解决了上述缺陷，并已应用于遥感变化检测，但大多作为特征提取的骨干。本文介绍了专为变化检测而设计的“变化状态空间模型”（Change State Space Model）。该模型关注双时态图像之间的相关变化，有效地过滤掉无关信息。通过专注于变化特征，减少了网络参数的数量，在保持高检测性能和对输入降解的稳健性的同时，显著提高了计算效率。该模型通过三个基准数据集进行了评估，在计算复杂度较小的情况下，其性能优于ConvNets、ViTs和基于Mamba的同类产品。模型实现将在接受后发布在https://github.com/Elman295/CSSM。

论文及项目相关链接

PDF

Summary

基于状态空间模型的Vision Mamba架构解决了传统卷积神经网络（ConvNets）和视觉转换器（ViT）在变化检测中的局限性。文章提出了专为变化检测设计的“变化状态空间模型”（CSSM），通过专注于双时态图像之间的相关变化，有效过滤掉无关信息，从而提高计算效率并保持高检测性能和抗输入干扰的稳健性。在三个基准数据集上的评估表明，该模型在减少计算复杂性的同时，优于ConvNets、ViTs和基于Mamba的模型。

Key Takeaways