⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-18 更新
Towards Generalist Intelligence in Dentistry: Vision Foundation Models for Oral and Maxillofacial Radiology
Authors:Xinrui Huang, Fan Xiao, Dongming He, Anqi Gao, Dandan Li, Xiaofan Zhang, Shaoting Zhang, Xudong Wang
Oral and maxillofacial radiology plays a vital role in dental healthcare, but radiographic image interpretation is limited by a shortage of trained professionals. While AI approaches have shown promise, existing dental AI systems are restricted by their single-modality focus, task-specific design, and reliance on costly labeled data, hindering their generalization across diverse clinical scenarios. To address these challenges, we introduce DentVFM, the first family of vision foundation models (VFMs) designed for dentistry. DentVFM generates task-agnostic visual representations for a wide range of dental applications and uses self-supervised learning on DentVista, a large curated dental imaging dataset with approximately 1.6 million multi-modal radiographic images from various medical centers. DentVFM includes 2D and 3D variants based on the Vision Transformer (ViT) architecture. To address gaps in dental intelligence assessment and benchmarks, we introduce DentBench, a comprehensive benchmark covering eight dental subspecialties, more diseases, imaging modalities, and a wide geographical distribution. DentVFM shows impressive generalist intelligence, demonstrating robust generalization to diverse dental tasks, such as disease diagnosis, treatment analysis, biomarker identification, and anatomical landmark detection and segmentation. Experimental results indicate DentVFM significantly outperforms supervised, self-supervised, and weakly supervised baselines, offering superior generalization, label efficiency, and scalability. Additionally, DentVFM enables cross-modality diagnostics, providing more reliable results than experienced dentists in situations where conventional imaging is unavailable. DentVFM sets a new paradigm for dental AI, offering a scalable, adaptable, and label-efficient model to improve intelligent dental healthcare and address critical gaps in global oral healthcare.
口腔颌面放射学在牙科健康护理中扮演着至关重要的角色,但由于训练有素的专业人员短缺,放射学图像解读受到限制。虽然人工智能方法已经展现出潜力,但现有的牙科人工智能系统受到其单一模态关注点、特定任务设计和依赖昂贵标注数据的限制,阻碍了它们在各种临床场景中的泛化能力。为了应对这些挑战,我们推出了DentVFM,这是为牙科设计的第一代视觉基础模型(VFMs)。DentVFM为各种牙科应用生成任务通用的视觉表示,并使用自我监督学习在DentVista(一个大型精选牙科成像数据集)上进行训练,该数据集包含来自不同医疗中心的约160万多种模态的放射图像。DentVFM包括基于视觉转换器(ViT)架构的2D和3D变体。为了解决牙科智能评估和基准测试的空白,我们推出了DentBench,这是一个全面的基准测试,涵盖八个牙科专科、更多疾病、成像方式和广泛的地理分布。DentVFM展现了令人印象深刻的通用智能,表现出对各种牙科任务的稳健泛化能力,如疾病诊断、治疗分析、生物标志物识别以及解剖地标检测和分割。实验结果表明,DentVFM显著优于监督学习、自我监督学习和弱监督学习的基线,具有出色的泛化能力、标签效率和可扩展性。此外,DentVFM能够实现跨模态诊断,在常规成像无法获取的情况下提供更可靠的结果,其诊断结果甚至比经验丰富的牙医还要可靠。DentVFM为牙科人工智能设定了新的范式,提供了一个可扩展、可适应和标签效率高的模型,以改善智能牙科健康护理,并解决全球口腔健康护理中的关键差距。
论文及项目相关链接
摘要
口腔颌面放射学在口腔健康护理中发挥着重要作用,但放射图像解读受限于专业人才的短缺。虽然人工智能方法已展现出潜力,但现有口腔AI系统受到单一模式关注点、特定任务设计和依赖昂贵标注数据的限制,难以在不同临床情景中通用化。为解决这些挑战,我们推出了DentVFM,这是首款为牙科设计的视觉基础模型家族(VFMs)。DentVFM为广泛的牙科应用生成任务无关的视觉表征,并使用自我监督学习在大型牙科影像数据集DentVista上进行训练,包含来自不同医疗中心的约160万多种模态放射图像。DentVFM包括基于视觉转换器(ViT)架构的二维和三维变体。为解决牙科智能评估和基准测试的空白,我们推出了DentBench,这是一个涵盖八个牙科专科、更多疾病、成像方式和广泛地理分布的全面基准测试。DentVFM展现出令人印象深刻的通用智能,证明其在各种牙科任务(如疾病诊断、治疗分析、生物标志物识别以及解剖地标检测和分割)上的稳健泛化能力。实验结果表明,DentVFM显著优于监督式、自我监督式和弱监督式基线,提供优越的泛化能力、标签效率和可扩展性。此外,DentVFM可实现跨模态诊断,在常规成像无法使用的情况下提供更可靠的结果。DentVFM为口腔AI设定了新的范式,提供了一个可扩展、可适应和标签效率高的模型,以改善智能口腔健康护理并解决全球口腔健康护理中的关键差距。
要点摘要
- 口腔颌面放射学在牙科健康护理中极为重要,但放射图像解读面临专业人才短缺的问题。
- 现有AI系统在牙科应用中存在局限性,如单一模式关注点、特定任务设计和依赖大量标注数据。
- DentVFM是首款为牙科设计的视觉基础模型家族,支持多种牙科应用并具备自我监督学习能力。
- DentVFM利用大型牙科影像数据集DentVista进行训练,包含多种模态的放射图像。
- DentVFM包括基于视觉转换器架构的二维和三维模型变体。
- DentBench填补了牙科智能评估和基准测试的空白,涵盖多个牙科领域、疾病、成像方式和地理分布。
点此查看论文截图

Invited Paper: BitMedViT: Ternary-Quantized Vision Transformer for Medical AI Assistants on the Edge
Authors:Mikolaj Walczak, Uttej Kallakuri, Edward Humes, Xiaomin Lin, Tinoosh Mohsenin
Vision Transformers (ViTs) have demonstrated strong capabilities in interpreting complex medical imaging data. However, their significant computational and memory demands pose challenges for deployment in real-time, resource-constrained mobile and wearable devices used in clinical environments. We introduce, BiTMedViT, a new class of Edge ViTs serving as medical AI assistants that perform structured analysis of medical images directly on the edge. BiTMedViT utilizes ternary- quantized linear layers tailored for medical imaging and com- bines a training procedure with multi-query attention, preserving stability under ternary weights with low-precision activations. Furthermore, BiTMedViT employs task-aware distillation from a high-capacity teacher to recover accuracy lost due to extreme quantization. Lastly, we also present a pipeline that maps the ternarized ViTs to a custom CUDA kernel for efficient memory bandwidth utilization and latency reduction on the Jetson Orin Nano. Finally, BiTMedViT achieves 86% diagnostic accuracy (89% SOTA) on MedMNIST across 12 datasets, while reducing model size by 43x, memory traffic by 39x, and enabling 16.8 ms inference at an energy efficiency up to 41x that of SOTA models at 183.62 GOPs/J on the Orin Nano. Our results demonstrate a practical and scientifically grounded route for extreme-precision medical imaging ViTs deployable on the edge, narrowing the gap between algorithmic advances and deployable clinical tools.
视觉Transformer(ViT)在解读复杂医学成像数据方面表现出强大的能力。然而,它们巨大的计算和内存需求,为在资源受限的实时移动和可穿戴设备的部署带来了挑战,这些设备在临床环境中使用。我们引入了BiTMedViT,这是一种新的边缘ViT类别,作为医疗人工智能助理,直接在边缘对医疗图像进行结构化分析。BiTMedViT利用针对医学成像量身定制的三元量化线性层,并结合多查询注意力的训练过程,在三元权重下保持低精度激活的稳定性。此外,BiTMedViT还采用任务感知蒸馏法,从高性能教师模型中恢复因极端量化而损失的准确性。最后,我们还提供了一个流程,将三元ViT映射到自定义CUDA内核,以在Jetson Orin Nano上实现高效的内存带宽利用和延迟降低。最终,BiTMedViT在MedMNIST的12个数据集上实现了86%的诊断准确率(最高水平为89%),同时缩小了模型大小43倍,内存流量减少39倍,并在Orin Nano上以183.62 GOPs/J的能效实现了16.8毫秒的推理。我们的结果证明了在边缘部署极端精度医疗成像ViT的实际且科学的途径,缩小了算法进步和可部署临床工具之间的差距。
论文及项目相关链接
PDF Accepted at 2025 IEEE/ACM International Conf. on Computer-Aided Design (ICCAD) Oct. 26-30 2025, Munich, DE
Summary
本文介绍了针对医疗影像处理的边缘人工智能助手——BiTMedViT的设计与实现。针对资源受限的环境如移动设备或可穿戴设备,它通过利用特化量化技术和优化训练过程,实现了在边缘设备上直接进行医疗图像的结构化分析。同时,通过任务感知蒸馏技术恢复因量化导致的精度损失,并采用CUDA内核映射技术提高内存带宽利用率并降低延迟。在多个数据集上测试显示,BiTMedViT达到了高诊断准确率,并显著减少了模型大小、内存流量,实现了高效能源利用和快速推理。这一技术的提出,为将医疗影像处理领域的算法进展应用于临床实践开辟了切实有效的途径。
Key Takeaways
- BiTMedViT是一种用于医疗影像处理的边缘设备上的新型视觉转换器(ViT)。
- 它采用特化量化技术来减少计算需求和内存占用,适用于资源受限的环境。
- 通过多查询注意训练程序和任务感知蒸馏技术优化性能。
- CUDA内核映射技术用于提高内存效率和降低推理延迟。
- BiTMedViT在多个数据集上实现高诊断准确率,显著减少模型大小和内存流量,具有高效能源利用和快速推理性能。
点此查看论文截图






Multi-Scale High-Resolution Logarithmic Grapher Module for Efficient Vision GNNs
Authors:Mustafa Munir, Alex Zhang, Radu Marculescu
Vision graph neural networks (ViG) have demonstrated promise in vision tasks as a competitive alternative to conventional convolutional neural nets (CNN) and transformers (ViTs); however, common graph construction methods, such as k-nearest neighbor (KNN), can be expensive on larger images. While methods such as Sparse Vision Graph Attention (SVGA) have shown promise, SVGA’s fixed step scale can lead to over-squashing and missing multiple connections to gain the same information that could be gained from a long-range link. Through this observation, we propose a new graph construction method, Logarithmic Scalable Graph Construction (LSGC) to enhance performance by limiting the number of long-range links. To this end, we propose LogViG, a novel hybrid CNN-GNN model that utilizes LSGC. Furthermore, inspired by the successes of multi-scale and high-resolution architectures, we introduce and apply a high-resolution branch and fuse features between our high-resolution and low-resolution branches for a multi-scale high-resolution Vision GNN network. Extensive experiments show that LogViG beats existing ViG, CNN, and ViT architectures in terms of accuracy, GMACs, and parameters on image classification and semantic segmentation tasks. Our smallest model, Ti-LogViG, achieves an average top-1 accuracy on ImageNet-1K of 79.9% with a standard deviation of 0.2%, 1.7% higher average accuracy than Vision GNN with a 24.3% reduction in parameters and 35.3% reduction in GMACs. Our work shows that leveraging long-range links in graph construction for ViGs through our proposed LSGC can exceed the performance of current state-of-the-art ViGs. Code is available at https://github.com/mmunir127/LogViG-Official.
视觉图神经网络(ViG)在视觉任务中作为传统卷积神经网络(CNN)和变压器(ViT)的竞争替代方案已经显示出其潜力。然而,常见的图构建方法,如k-最近邻(KNN),在较大图像上可能会很昂贵。虽然稀疏视觉图注意力(SVGA)等方法已经显示出潜力,但SVGA的固定步长尺度可能导致过度压缩并错过通过长距离链接可以获得的信息的多个连接。基于这一观察,我们提出了一种新的图构建方法——对数可伸缩图构建(LSGC),通过限制长距离链接的数量来提高性能。为此,我们提出了LogViG,这是一种利用LSGC的新型混合CNN-GNN模型。此外,受到多尺度和高分辨率架构成功的启发,我们引入并应用了一个高分辨率分支,并将高分辨率和 low-resolution 分支之间的特征融合在一起,形成一个多尺度高分辨率视觉 GNN 网络。大量实验表明,LogViG 在图像分类和语义分割任务上的准确率、GMACs 和参数方面超越了现有的 ViG、CNN 和 ViT 架构。我们的最小模型 Ti-LogViG 在 ImageNet-1K 上的平均 top-1 准确率为 79.9%,标准差为 0.2%,平均准确率比 Vision GNN 高 1.7%,同时参数减少了 24.3%,GMACs 减少了 35.3%。我们的工作表明,通过我们提出的 LSGC 利用图构建中的长距离链接,可以超越当前最先进的 ViG 的性能。代码可在 https://github.com/mmunir127/LogViG-Official 获得。
论文及项目相关链接
PDF Published in the Proceedings of the Third Learning on Graphs Conference (LoG 2024)
Summary
本文介绍了LogViG,这是一种新型的混合CNN-GNN模型,它通过采用对数可伸缩图构建(LSGC)方法来提高性能。LogViG在图像分类和语义分割任务上的表现超过了现有的ViG、CNN和ViT架构。其中最小的模型Ti-LogViG在ImageNet-1K上的平均top-1准确率达到了79.9%,较Vision GNN有更高的平均准确率和显著的参数与计算量减少。这表明利用长程链接的图构建方法可以通过LSGC超越当前最先进的ViG性能。
Key Takeaways
- LogViG是一个新型的混合CNN-GNN模型,旨在提高视觉任务的性能。
- LogViG提出了一种新的图构建方法——对数可伸缩图构建(LSGC),以限制长程链接的数量。
- LogViG在图像分类和语义分割任务上的表现优于现有的ViG、CNN和ViT架构。
- Ti-LogViG模型在ImageNet-1K上的表现突出,达到了79.9%的平均top-1准确率。
- Ti-LogViG较Vision GNN有更高的平均准确率,同时显著减少了参数和计算量。
- LogViG的工作表明,通过LSGC利用长程链接可以超越当前最先进的ViG性能。
点此查看论文截图






Hybrid Explanation-Guided Learning for Transformer-Based Chest X-Ray Diagnosis
Authors:Shelley Zixin Shu, Haozhe Luo, Alexander Poellinger, Mauricio Reyes
Transformer-based deep learning models have demonstrated exceptional performance in medical imaging by leveraging attention mechanisms for feature representation and interpretability. However, these models are prone to learning spurious correlations, leading to biases and limited generalization. While human-AI attention alignment can mitigate these issues, it often depends on costly manual supervision. In this work, we propose a Hybrid Explanation-Guided Learning (H-EGL) framework that combines self-supervised and human-guided constraints to enhance attention alignment and improve generalization. The self-supervised component of H-EGL leverages class-distinctive attention without relying on restrictive priors, promoting robustness and flexibility. We validate our approach on chest X-ray classification using the Vision Transformer (ViT), where H-EGL outperforms two state-of-the-art Explanation-Guided Learning (EGL) methods, demonstrating superior classification accuracy and generalization capability. Additionally, it produces attention maps that are better aligned with human expertise.
基于Transformer的深度学习模型通过利用注意力机制进行特征表示和解释,已在医学影像领域展现出卓越性能。然而,这些模型容易学习偶然关联,导致偏见和有限的泛化能力。虽然人类-人工智能注意力对齐可以缓和这些问题,但它常常依赖于昂贵的人工监督。在这项工作中,我们提出了一个混合解释引导学习(H-EGL)框架,它结合了自我监督和人类引导约束,以提高注意力对齐并改善泛化能力。H-EGL的自我监督部分利用类区分注意力,不依赖限制性先验,促进稳健性和灵活性。我们在使用Vision Transformer(ViT)进行胸部X射线分类的任务上验证了我们的方法,H-EGL表现优于两种最先进的解释引导学习方法,展现出更高的分类精度和泛化能力。此外,它产生的注意力图与专家人类更加对齐。
论文及项目相关链接
PDF Accepted by iMIMIC at MICCAI 2025
Summary
本文提出了一个混合解释引导学习(H-EGL)框架,结合了自监督学习和人类引导约束,提高注意力对齐和改进了概括能力。在基于Vision Transformer的胸部X光分类任务上验证了该方法,表现出优于两种最新解释引导学习(EGL)方法的分类准确性和概括能力。
Key Takeaways
- Transformer模型在医疗成像领域表现优异,但存在学习虚假关联的问题,导致偏见和有限的概括能力。
- 人机注意力对齐可缓解这些问题,但需要昂贵的人工监督。
- 本文提出了一个混合解释引导学习(H-EGL)框架,结合了自监督学习和人类引导约束,以增强注意力对齐和提高概括能力。
- H-EGL框架利用类区别注意力,不依赖限制性先验,促进了稳健性和灵活性。
- 在胸部X光分类任务上验证了H-EGL框架的有效性,优于两种最新EGL方法。
- H-EGL框架产生的注意力图与人类专业知识更对齐。
- H-EGL为医疗成像领域提供了一种新的模型训练策略,有助于提高模型的性能和解释性。
点此查看论文截图


Hybrid Vision Transformer and Quantum Convolutional Neural Network for Image Classification
Authors:Mingzhu Wang, Yun Shang
Quantum machine learning (QML) holds promise for computational advantage, yet progress on real-world tasks is hindered by classical preprocessing and noisy devices. We introduce ViT-QCNN-FT, a hybrid framework that integrates a fine-tuned Vision Transformer with a quantum convolutional neural network (QCNN) to compress high-dimensional images into features suited for noisy intermediate-scale quantum (NISQ) devices. By systematically probing entanglement, we show that ansatzes with uniformly distributed entanglement entropy consistently deliver superior non-local feature fusion and state-of-the-art accuracy (99.77% on CIFAR-10). Surprisingly, quantum noise emerges as a double-edged factor: in some cases, it enhances accuracy (+2.71% under amplitude damping). Strikingly, substituting the QCNN with classical counterparts of equal parameter count leads to a dramatic 29.36% drop, providing unambiguous evidence of quantum advantage. Our study establishes a principled pathway for co-designing classical and quantum architectures, pointing toward practical QML capable of tackling complex, high-dimensional learning tasks.
量子机器学习(QML)在计算方面具有巨大的优势,然而在实际任务上的进展受到经典预处理和噪声设备的影响。我们引入了ViT-QCNN-FT这一混合框架,它将经过精细训练的视觉Transformer与量子卷积神经网络(QCNN)相结合,将高维图像压缩成适合噪声中间规模量子(NISQ)设备的特征。通过系统地探测纠缠关系,我们发现具有均匀分布的纠缠熵的假设方案始终能提供更优越的非局部特征融合和最新技术准确率的准确性(在CIFAR-10上达到99.77%)。令人惊讶的是,量子噪声呈现出两面性:在某些情况下,它能提高准确性(在振幅阻尼下提高2.71%)。值得注意的是,用参数数量相同的经典网络替代QCNN会导致准确率大幅下降(下降29.36%),这为量子优势提供了确凿的证据。我们的研究为共同设计经典和量子架构建立了原则性途径,朝着能够应对复杂、高维学习任务的实用QML发展。
论文及项目相关链接
Summary
量子机器学习(QML)具有计算优势潜力,但在现实任务上进展受到经典预处理和噪声设备的阻碍。本研究引入ViT-QCNN-FT混合框架,通过微调视觉变压器与量子卷积神经网络(QCNN)的结合,将高维图像压缩成适合噪声中间尺度量子(NISQ)设备的特征。通过系统地探测纠缠,发现具有均匀分布纠缠熵的ansatz在提供卓越的非局部特征融合和最新技术准确性方面表现一致(在CIFAR-10上为99.77%)。令人惊讶的是,量子噪声成为了一个双刃剑:在某些情况下,它提高了准确性(+ 2.71%在振幅阻尼下)。显著的是,用参数数量相等的经典网络替代QCNN会导致性能下降29.36%,这为量子优势提供了明确证据。本研究为共同设计经典和量子架构建立了原则性途径,为能够应对复杂、高维学习任务的实用QML指明了方向。
Key Takeaways
- 量子机器学习面临预处理和噪声问题。
- ViT-QCNN-FT混合框架结合了视觉变压器和量子卷积神经网络,压缩图像以适应NISQ设备。
- 纠缠熵的分布影响特征融合和准确性。
- 在某些情况下,量子噪声可以提高准确性。
- 用经典网络替代QCNN会导致显著的性能下降。
- 此研究证明了量子优势的存在。
点此查看论文截图



Chimera: State Space Models Beyond Sequences
Authors:Aakash Lahoti, Tanya Marwah, Ratish Puduppully, Albert Gu
Transformer-based deep learning methods have become the standard approach for modeling diverse data such as sequences, images, and graphs. These methods rely on self-attention, which treats data as an unordered set of elements. This ignores the neighborhood structure or graph topology of the data and requires inductive biases–such as position embeddings in sequences and images, or random walks in graphs–to incorporate topology. However, designing such task-specific biases requires significant effort and can introduce side effects that hinder generalization. We introduce Chimera, a unified model that directly incorporates data topology in a principled way, removing the need for domain-specific biases. The key idea is that state space models–which naturally do not require position embeddings–can be generalized to capture any graph topology. Our experiments show that Chimera achieves strong performance across language, vision, and graph domains, outperforming BERT on GLUE by 0.7 points, ViT on ImageNet-1k by 2.6%, and all baselines on the Long Range Graph Benchmark. We further propose algorithmic optimizations to improve Chimera’s efficiency: (1) for Directed Acyclic Graphs, Chimera can be implemented as a linear-time recurrence; (2) for general graphs, a simple mathematical relaxation achieves Transformer’s quadratic complexity without domain-specific heuristics. These results validate Chimera’s core contribution and support the idea that data topology is a powerful inductive bias across modalities.
基于Transformer的深度学习方法已成为对序列、图像和图形等多样化数据进行建模的标准方法。这些方法依赖于自注意力机制,将数据视为无序的元素集合。这忽略了数据的邻域结构或图形拓扑结构,需要归纳偏见,例如在序列和图像中的位置嵌入,或在图形中的随机游走,以融入拓扑结构。然而,设计这样的特定任务偏见需要大量的努力,并可能引入阻碍泛化的副作用。我们引入了Chimera,这是一种统一模型,它以有原则的方式直接融入数据拓扑结构,消除了对特定领域的偏见的需求。关键的想法是状态空间模型——它们自然不需要位置嵌入——可以被推广到捕获任何图形拓扑结构。我们的实验表明,在语言、视觉和图形领域,Chimera都取得了强大的性能,其在GLUE上的表现优于BERT 0.7分,在ImageNet-1k上的表现优于ViT 2.6%,并在长距离图形基准测试中超过了所有基线。我们进一步提出了算法优化以提高Chimera的效率:(1)对于无环图,可以将Chimera实现为线性时间递归;(2)对于一般图形,简单的数学松弛可以在不需要特定领域启发式的情况下实现Transformer的二次复杂性。这些结果验证了Chimera的核心贡献并支持这一想法:数据拓扑是一个强大的跨模态归纳偏见。
论文及项目相关链接
PDF Published in TMLR (October 2025); 22 Pages, 6 Figures, 11 Tables
Summary
本文介绍了Transformer深度学习方法的局限性,包括忽略了数据的拓扑结构而需要依赖特定的偏置。作者引入了新的模型 Chimera,能够直接整合数据拓扑结构,减少了对特定领域的偏置需求。实验表明,在多种领域(如语言、图像和图)中,Chimer 都实现了强大的性能。作者还提出了算法优化措施以提高 Chimera 的效率。这些结果证明了数据拓扑结构是一个强大的跨模态偏置。
Key Takeaways
以下是文本的关键要点,以子弹形式呈现:
- Transformer-based深度学习方法依赖于特定领域偏置来设计任务特定的特征表达方法。这些方法需要通过嵌入方式应对复杂拓扑数据集的复杂性。尽管如此,但它们仍然存在一定局限性和对大规模模型的强大能力需求的缺陷。鉴于此背景下诞生了“跨越时空域的表达能力构建问题”和关于如何通过改进现有技术以缓解现有局限的讨论和考虑成为了学界的重要课题。基于此话题提出的讨论揭示了研究人员如何追求简化过程并实现效率优化提升的挑战和决心。由于简化需要专门化的偏见方法融入现代AI系统中是一大难点。这也暗示着行业正面临着诸多机遇和挑战并寻求实现人工智能在现实生活中的应用和推广的新方法。引入新型模型和算法是实现这一目标的必经之路。引入了一种新型统一模型Chimer正是其中之一——它可以减少针对特定领域的偏见设计挑战需求并具有通用性和潜力推广使用领域限制最少的有效AI模型具有优势之一的是能够消除设计复杂特征提取过程的需求并且可能能够更有效地解决不同领域问题提高模型效率并解决大规模数据集计算需求挑战同时也有待于进行进一步的技术开发以便进行规模化应用和在实际生产环境中的集成性能评估和检验关于这项研究的长远发展预计会对当前研究产生影响具有研究领域的战略意义和潜在的颠覆性贡献是未来智能化生态系统推进和技术开发的重中之重经过不懈努力可实现端到端的改进落地将为打造AI经济现代化社会的推动创新创新开启新的发展局面其解决方案也需要根据不同应用情况进行适当适配以达到在不同行业中的最佳效果未来有望通过技术的不断迭代和升级推动AI行业的持续发展和进步并为实际应用领域提供重要的支撑和服务推广更加便捷的生活模式和推动整个社会迈向更高层次智能化应用水平的提高未来的前景将会变得更加广阔也更加值得我们期待和思考将会对未来人工智能的发展产生深远影响。以下是关键要点:
点此查看论文截图


Evaluating the Explainability of Vision Transformers in Medical Imaging
Authors:Leili Barekatain, Ben Glocker
Understanding model decisions is crucial in medical imaging, where interpretability directly impacts clinical trust and adoption. Vision Transformers (ViTs) have demonstrated state-of-the-art performance in diagnostic imaging; however, their complex attention mechanisms pose challenges to explainability. This study evaluates the explainability of different Vision Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct both quantitative and qualitative analyses on two medical imaging tasks: peripheral blood cell classification and breast ultrasound image classification. Our findings indicate that DINO combined with Grad-CAM offers the most faithful and localized explanations across datasets. Grad-CAM consistently produces class-discriminative and spatially precise heatmaps, while Gradient Attention Rollout yields more scattered activations. Even in misclassification cases, DINO with Grad-CAM highlights clinically relevant morphological features that appear to have misled the model. By improving model transparency, this research supports the reliable and explainable integration of ViTs into critical medical diagnostic workflows.
在医学成像领域,理解模型决策至关重要,因为可解释性直接影响临床信任度和采用度。视觉变压器(ViTs)在诊断成像方面表现出了最先进的表现;然而,它们复杂的注意力机制给解释性带来了挑战。本研究使用梯度注意力展开(Gradient Attention Rollout)和Grad-CAM评估了不同视觉变压器架构和预训练策略(包括ViT、DeiT、DINO和Swin Transformer)的解释性。我们在两项医学成像任务上进行了定量和定性分析:外周血细胞分类和乳腺超声图像分类。我们的研究结果表明,DINO与Grad-CAM相结合提供了跨数据集的最为真实和局部化的解释。Grad-CAM持续产生类别判别力和空间精确的热图,而梯度注意力展开产生的激活更为分散。即使在误分类的情况下,DINO与Grad-CAM也能突出显示临床上相关的形态特征,这些特征似乎误导了模型。通过提高模型的透明度,本研究支持将ViTs可靠且可解释地集成到关键的医学诊断工作流程中。
论文及项目相关链接
PDF Accepted at Workshop on Interpretability of Machine Intelligence in Medical Image Computing at MICCAI 2025
Summary
本文研究了Vision Transformer(ViT)在医学成像中的解释性,评估了不同ViT架构和预训练策略(包括ViT、DeiT、DINO和Swin Transformer)的解释性。通过Gradient Attention Rollout和Grad-CAM进行定量和定性分析,发现DINO结合Grad-CAM能提供最忠实和局部化的解释。此研究提高了模型的透明度,为ViT可靠地融入关键医疗诊断流程提供了支持。
Key Takeaways
- Vision Transformers (ViTs) 在医学成像中展现先进性能,但其复杂注意力机制对解释性构成挑战。
- 研究评估了ViT、DeiT、DINO和Swin Transformer等不同Vision Transformer架构和预训练策略的解释性。
- 采用Gradient Attention Rollout和Grad-CAM进行定量和定性分析。
- DINO结合Grad-CAM提供跨数据集最忠实和局部化的解释。
- Grad-CAM能生成类别判别性和空间精确性的热图,而Gradient Attention Rollout产生的激活更分散。
- 在误分类情况下,DINO与Grad-CAM能突出显示临床上相关的形态特征,这些特征似乎误导了模型。
点此查看论文截图





SeeingSounds: Learning Audio-to-Visual Alignment via Text
Authors:Simone Carnemolla, Matteo Pennisi, Chiara Russo, Simone Palazzo, Daniela Giordano, Concetto Spampinato
We introduce SeeingSounds, a lightweight and modular framework for audio-to-image generation that leverages the interplay between audio, language, and vision-without requiring any paired audio-visual data or training on visual generative models. Rather than treating audio as a substitute for text or relying solely on audio-to-text mappings, our method performs dual alignment: audio is projected into a semantic language space via a frozen language encoder, and, contextually grounded into the visual domain using a vision-language model. This approach, inspired by cognitive neuroscience, reflects the natural cross-modal associations observed in human perception. The model operates on frozen diffusion backbones and trains only lightweight adapters, enabling efficient and scalable learning. Moreover, it supports fine-grained and interpretable control through procedural text prompt generation, where audio transformations (e.g., volume or pitch shifts) translate into descriptive prompts (e.g., “a distant thunder”) that guide visual outputs. Extensive experiments across standard benchmarks confirm that SeeingSounds outperforms existing methods in both zero-shot and supervised settings, establishing a new state of the art in controllable audio-to-visual generation.
我们介绍了SeeingSounds,这是一个用于音频到图像生成的轻量级和模块化框架,它利用音频、语言和视觉之间的相互作用,而无需任何配对音视频数据或在视觉生成模型上进行训练。我们的方法不是将音频视为文本的替代品或仅依赖于音频到文本的映射,而是执行双重对齐:音频通过冻结的语言编码器投射到语义语言空间,并使用视觉语言模型在视觉上将其上下文化。这种方法受到认知神经科学的启发,反映了人类感知中观察到的自然跨模态关联。该模型在冻结的扩散骨干网上运行,仅训练轻量级适配器,实现了高效和可扩展的学习。此外,它通过程序性文本提示生成支持精细和可解释的控制,其中音频转换(例如音量或音调变化)转化为描述性提示(例如,“远处雷声”),引导视觉输出。在标准基准测试上的广泛实验证实,SeeingSounds在零样本和受监督环境中均优于现有方法,在可控的音频到视觉生成领域树立了新的技术标杆。
论文及项目相关链接
PDF accepted to ACM Multimedia Asia 2025
Summary
本文介绍了SeeingSounds,这是一个轻量级、模块化的音频到图像生成框架,它利用音频、语言和视觉之间的相互作用,无需任何配对音视频数据或视觉生成模型的训练。该方法通过冰冻语言编码器将音频投影到语义语言空间,并使用视觉语言模型将其上下文融入视觉领域。该方法受到认知神经科学的启发,反映了人类感知中观察到的自然跨模态关联。该模型在冰冻扩散主干上运行,只训练轻量级适配器,实现高效和可扩展的学习。此外,它支持通过程序性文本提示生成进行精细和可解释的控制,其中音频转换(例如音量或音调变化)转化为描述性提示,引导视觉输出。在标准基准测试上的广泛实验证实,SeeingSounds在零样本和监督设置中都优于现有方法,在可控音频到视觉生成领域树立了新的技术标杆。
Key Takeaways
- SeeingSounds是一个轻量级、模块化的音频到图像生成框架。
- 该方法利用音频、语言和视觉的相互作用,无需配对音视频数据或视觉生成模型的训练。
- 通过冰冻语言编码器和视觉语言模型实现音频的语义表示和视觉化。
- 该方法受到认知神经科学的启发,反映人类感知中的自然跨模态关联。
- 模型实现高效和可扩展的学习,只需训练轻量级适配器。
- 支持通过程序性文本提示进行精细和可解释的控制。
点此查看论文截图







Validation of an Artificial Intelligence Tool for the Detection of Sperm DNA Fragmentation Using the TUNEL In Situ Hybridization Assay
Authors:Byron Alexander Jacobs, Aqeel Morris, Ifthakaar Shaik, Frando Lin
Sperm DNA fragmentation (SDF) is a critical parameter in male fertility assessment that conventional semen analysis fails to evaluate. This study presents the validation of a novel artificial intelligence (AI) tool designed to detect SDF through digital analysis of phase contrast microscopy images, using the terminal deoxynucleotidyl transferase dUTP nick end labeling (TUNEL) assay as the gold standard reference. Utilising the established link between sperm morphology and DNA integrity, the present work proposes a morphology assisted ensemble AI model that combines image processing techniques with state-of-the-art transformer based machine learning models (GC-ViT) for the prediction of DNA fragmentation in sperm from phase contrast images. The ensemble model is benchmarked against a pure transformer vision' model as well as a
morphology-only` model. Promising results show the proposed framework is able to achieve sensitivity of 60% and specificity of 75%. This non-destructive methodology represents a significant advancement in reproductive medicine by enabling real-time sperm selection based on DNA integrity for clinical diagnostic and therapeutic applications.
精子DNA碎片化(SDF)是男性生育能力评估中的一个关键参数,传统的精液分析无法进行评估。本研究介绍了一种新型人工智能(AI)工具的验证,该工具通过相衬显微镜图像的数字分析来检测SDF,以末端脱氧核苷酸转移酶dUTP缺口末端标记(TUNEL)测定法作为金标准参考。利用精子形态与DNA完整性的既定联系,目前的工作提出了一个形态辅助集成AI模型,该模型结合了图像处理技术与最新的基于变压器的机器学习模型(GC-ViT),用于从相衬图像预测精子的DNA碎片化。该集成模型与纯变压器“视觉”模型以及“仅形态”模型进行了比较。令人鼓舞的结果表明,该框架能够实现60%的灵敏度和75%的特异度。这种非破坏性的方法代表了生殖医学的一个重大进步,它能够通过实时筛选基于DNA完整性的精子来实现临床诊断和治疗应用。
论文及项目相关链接
Summary:
研究验证了新型人工智能工具评估精子DNA碎片化(SDF)的有效性,该技术通过相位对比显微镜图像的数字分析来检测SDF,以TUNEL检测为金标准。研究提出了一种结合图像处理和基于变压器(GC-ViT)的最新机器学习模型的形态辅助集成AI模型来预测精子DNA碎片化。与纯“视觉”模型和仅形态模型相比,该集成模型表现出色,敏感度和特异度分别达到了60%和75%。这项非破坏性方法代表了生殖医学领域的一个重大进展,能够基于DNA完整性进行实时精子选择,对临床诊断和治疗应用具有重要意义。
Key Takeaways:
- 精子DNA碎片化(SDF)是男性生育能力评估的重要参数,传统精液分析无法评估。
- 研究验证了一种新型人工智能工具,通过相位对比显微镜图像检测SDF。
- 集成AI模型结合了图像处理和基于变压器(GC-ViT)的机器学习模型,预测精子DNA碎片化。
- 该模型的敏感度和特异度分别为60%和75%。
- 此方法具有非破坏性,对生殖医学领域有重大进展。
- 该技术可基于DNA完整性进行实时精子选择。
点此查看论文截图





Scalable Face Security Vision Foundation Model for Deepfake, Diffusion, and Spoofing Detection
Authors:Gaojian Wang, Feng Lin, Tong Wu, Zhisheng Yan, Kui Ren
With abundant, unlabeled real faces, how can we learn robust and transferable facial representations to boost generalization across various face security tasks? We make the first attempt and propose FS-VFM, a scalable self-supervised pre-training framework, to learn fundamental representations of real face images. We introduce three learning objectives, namely 3C, that synergize masked image modeling (MIM) and instance discrimination (ID), empowering FS-VFM to encode both local patterns and global semantics of real faces. Specifically, we formulate various facial masking strategies for MIM and devise a simple yet effective CRFR-P masking, which explicitly prompts the model to pursue meaningful intra-region Consistency and challenging inter-region Coherency. We present a reliable self-distillation mechanism that seamlessly couples MIM with ID to establish underlying local-to-global Correspondence. After pre-training, vanilla vision transformers (ViTs) serve as universal Vision Foundation Models for downstream Face Security tasks: cross-dataset deepfake detection, cross-domain face anti-spoofing, and unseen diffusion facial forensics. To efficiently transfer the pre-trained FS-VFM, we further propose FS-Adapter, a lightweight plug-and-play bottleneck atop the frozen backbone with a novel real-anchor contrastive objective. Extensive experiments on 11 public benchmarks demonstrate that our FS-VFM consistently generalizes better than diverse VFMs, spanning natural and facial domains, fully, weakly, and self-supervised paradigms, small, base, and large ViT scales, and even outperforms SOTA task-specific methods, while FS-Adapter offers an excellent efficiency-performance trade-off. The code and models are available on https://fsfm-3c.github.io/fsvfm.html.
面对大量未标记的真实人脸,我们如何学习稳健且可迁移的面部表示,以促进各种面部安全任务的泛化能力呢?我们首次尝试并提出FS-VFM,这是一种可扩展的自我监督预训练框架,用于学习真实人脸图像的基本表示。我们引入了三个学习目标,即3C,它结合了掩码图像建模(MIM)和实例鉴别(ID),使FS-VFM能够编码真实人脸的局部模式和全局语义。具体来说,我们为MIM制定了各种面部掩码策略,并设计了一种简单有效的CRFR-P掩码,它明确地提示模型追求区域内有意义的一致性(Consistency)和区域间协同性(Coherency)。我们还提出了一种可靠的自我蒸馏机制,无缝结合MIM和ID,建立基本的局部到全局对应关系。预训练后,普通的视觉变压器(ViTs)可作为面向下游面部安全任务的通用视觉基础模型:跨数据集深度伪造检测、跨域面部防伪以及未见扩散面部取证。为了有效地迁移预训练的FS-VFM,我们进一步提出FS-Adapter,这是一个轻量级的即插即用瓶颈,位于冻结的主干之上,具有新型真实锚点对比目标。在11个公共基准测试上的广泛实验表明,我们的FS-VFM始终比涵盖自然和面部领域的各种VFMs有更好的泛化能力,涵盖全监督、弱监督和自监督范式,小型、基础和大型ViT规模,甚至超越了特定任务的最先进方法,而FS-Adapter提供了卓越的效率性能权衡。代码和模型可在https://fsfm-3c.github.io/fsvfm.html上找到。
论文及项目相关链接
PDF 18 pages, 9 figures, project page: https://fsfm-3c.github.io/fsvfm.html
摘要
本文提出了FS-VFM框架,这是一种可扩展的自我监督预训练框架,用于学习真实面部图像的基本表示。引入三种学习目标,即3C,协同掩膜图像建模(MIM)和实例鉴别(ID),使FS-VFM能够编码真实面部的局部模式和全局语义。文章制定了各种面部掩模策略,并提出了简单有效的CRFR-P掩模,促使模型追求区域内一致性及区域间连贯性。建立可靠的自我蒸馏机制,无缝结合MIM和ID以建立基本的局部到全局对应关系。预训练后,普通视觉变压器(ViTs)可作为面向下游面部安全任务的通用视觉基础模型。为进一步高效转移预训练的FS-VFM,文章还提出了FS-Adapter,这是一个轻量级的即插即用瓶颈,位于冻结的主干之上,具有新型真实锚对比目标。在11个公共基准测试上的广泛实验表明,与跨越自然和面部领域的各种VFMs相比,FS-VFM具有更好的泛化能力,并且优于最先进的任务特定方法,而FS-Adapter则提供了出色的效率与性能的平衡。代码和模型可在https://fsfm-3c.github.io/fsvfm.html获取。
关键见解
- 引入FS-VFM框架,一种可扩展的自我监督预训练框架,用于学习真实面部图像的基本表示。
- 提出三种学习目标:3C(一致性、连贯性、对应性),结合MIM和ID进行面部表示学习。
- 介绍了多种面部掩模策略,特别是CRFR-P掩模方法以提高学习效果。
- 建立自我蒸馏机制以结合MIM和ID,形成局部到全局的面部表示对应关系。
- 预训练后,ViTs可作为下游面部安全任务的通用视觉基础模型。
- 提出FS-Adapter,一个轻量级结构,用于高效转移预训练的FS-VFM知识到下游任务。
点此查看论文截图



MARS-Sep: Multimodal-Aligned Reinforced Sound Separation
Authors:Zihan Zhang, Xize Cheng, Zhennan Jiang, Dongjie Fu, Jingyuan Chen, Zhou Zhao, Tao Jin
Universal sound separation faces a fundamental misalignment: models optimized for low-level signal metrics often produce semantically contaminated outputs, failing to suppress perceptually salient interference from acoustically similar sources. To bridge this gap, we introduce MARS-Sep, a reinforcement learning framework that reformulates separation as decision making. Instead of simply regressing ground-truth masks, MARS-Sep learns a factorized Beta mask policy that is optimized by a clipped trust-region surrogate with entropy regularization and group-relative advantage normalization. Concretely, we sample masks from a frozen old policy, reconstruct waveforms, and update the current policy using clipped importance ratios-yielding substantially more stable and sample-efficient learning. Multimodal rewards, derived from an audio-text-vision encoder, directly incentivize semantic consistency with query prompts. We further propose a progressive alignment scheme to fine-tune this encoder, boosting its cross-modal discriminability and improving reward faithfulness. Extensive experiments on multiple benchmarks demonstrate consistent gains in Text-, Audio-, and Image-Queried separation, with notable improvements in signal metrics and semantic quality. Our code is available at https://anonymous.4open.science/r/MARS-Sep. Sound separation samples are available at https://mars-sep.github.io/.
通用声音分离面临一个基本的不匹配问题:针对低级别信号指标优化的模型通常会产生语义污染的输出来,无法抑制听觉上显著来自声音相似源的干扰。为了弥补这一差距,我们引入了MARS-Sep,这是一种将分离重新定义为决策制定的强化学习框架。MARS-Sep不是简单地回归地面真实掩膜,而是学习一个经过优化的因子化Beta掩膜策略,该策略通过带有熵正则化和组相对优势归一化的剪辑信任区域替代物进行优化。具体来说,我们从冻结的旧政策中采样掩膜,重建波形,并使用剪辑后的重要性比率更新当前政策,从而实现了更加稳定和高效的样本学习。从音频文本视觉编码器派生的多模式奖励直接激励与查询提示的语义一致性。我们进一步提出了一种渐进的对齐方案来微调此编码器,提高其跨模态的辨别力并改善奖励的真实性。在多个基准测试上的广泛实验表明,在文本、音频和图像查询分离中均取得了持续的收益,在信号指标和语义质量方面取得了显著的改进。我们的代码可用在https://anonymous.4open.science/r/MARS-Sep。声音分离样本可用在https://mars-sep.github.io/。
论文及项目相关链接
Summary
本文主要介绍了一个名为MARS-Sep的强化学习框架,用于解决通用声音分离中的根本性不匹配问题。该框架将声音分离重新定义为决策制定,而非简单地回归地面真实掩膜。它通过采用因子化的Beta掩膜策略、优化的信任区域替代物、熵正则化以及群体相对优势归一化等技术,大大提高了学习和样本效率。此外,本文还提出了一个多模态奖励方案,直接激励查询提示的语义一致性。实验结果表明,该框架在文本、音频和图像查询分离方面均表现出一致的优点,特别是在信号指标和语义质量方面取得了显著的改进。
Key Takeaways
- MARS-Sep是一个强化学习框架,用于解决声音分离中的根本性不匹配问题。
- 该框架将声音分离重新定义为决策制定,并采用因子化的Beta掩膜策略进行优化。
- MARS-Sep使用信任区域替代物、熵正则化和群体相对优势归一化等技术,提高了学习和样本效率。
- 多模态奖励方案直接激励查询提示的语义一致性。
- 提出的渐进对齐方案提高了音频文本视觉编码器的跨模态鉴别力和奖励真实性。
- 在多个基准测试上进行了广泛的实验,表明MARS-Sep在文本、音频和图像查询分离方面取得了显著的改进。
点此查看论文截图



Cooperative Pseudo Labeling for Unsupervised Federated Classification
Authors:Kuangpu Guo, Lijun Sheng, Yongcan Yu, Jian Liang, Zilei Wang, Ran He
Unsupervised Federated Learning (UFL) aims to collaboratively train a global model across distributed clients without sharing data or accessing label information. Previous UFL works have predominantly focused on representation learning and clustering tasks. Recently, vision language models (e.g., CLIP) have gained significant attention for their powerful zero-shot prediction capabilities. Leveraging this advancement, classification problems that were previously infeasible under the UFL paradigm now present promising new opportunities, yet remain largely unexplored. In this paper, we extend UFL to the classification problem with CLIP for the first time and propose a novel method, \underline{\textbf{Fed}}erated \underline{\textbf{Co}}operative \underline{\textbf{P}}seudo \underline{\textbf{L}}abeling (\textbf{FedCoPL}). Specifically, clients estimate and upload their pseudo label distribution, and the server adjusts and redistributes them to avoid global imbalance among classes. Moreover, we introduce a partial prompt aggregation protocol for effective collaboration and personalization. In particular, visual prompts containing general image features are aggregated at the server, while text prompts encoding personalized knowledge are retained locally. Extensive experiments demonstrate the superior performance of our FedCoPL compared to baseline methods. Our code is available at \href{https://github.com/krumpguo/FedCoPL}{https://github.com/krumpguo/FedCoPL}.
无监督联邦学习(UFL)旨在在不共享数据或访问标签信息的情况下,协同训练分布式客户端上的全局模型。之前的UFL工作主要集中在表示学习和聚类任务上。最近,视觉语言模型(例如CLIP)由于其强大的零样本预测能力而备受关注。利用这一进展,之前在UFL范式下不可行的分类问题现在呈现出有希望的新的机会,但仍有待广泛探索。在本文中,我们首次将UFL扩展到使用CLIP的分类问题,并提出了一种新方法,即联邦合作伪标签法(FedCoPL)。具体来说,客户端估计并上传其伪标签分布,服务器进行调整和重新分配,以避免类别之间的全局不平衡。此外,我们引入了一种部分提示聚合协议,以实现有效的协作和个性化。特别是,包含一般图像特征的视觉提示在服务器上进行聚合,而编码个性化知识的文本提示则保留在本地。大量实验表明,我们的FedCoPL相比基线方法具有优越的性能。我们的代码可在https://github.com/krumpguo/FedCoPL上找到。
论文及项目相关链接
PDF Accepted by ICCV 2025
Summary
基于CLIP的联邦学习分类问题新方法——FedCoPL。该方法在无监督联邦学习(UFL)框架下处理分类问题,通过伪标签分布估计与调整、部分提示聚合协议实现有效协作与个性化。
Key Takeaways
- UFL扩展至分类问题:利用CLIP模型实现强大的零样本预测能力,解决UFL在分类问题上的局限性。
- 提出FedCoPL方法:利用伪标签分布估计与调整,避免全局类别不平衡问题。
- 部分提示聚合协议:实现有效协作和个性化,视觉提示在服务器聚合,文本提示保留本地。
- FedCoPL方法相较于基线方法表现出优越性能。
- 该方法适用于分布式环境下的分类问题,特别是在数据隐私保护需求较高的场景中。
- 代码公开可用,便于其他研究者使用与进一步改进。
点此查看论文截图





Zero-shot image privacy classification with Vision-Language Models
Authors:Alina Elena Baia, Alessio Xompero, Andrea Cavallaro
While specialized learning-based models have historically dominated image privacy prediction, the current literature increasingly favours adopting large Vision-Language Models (VLMs) designed for generic tasks. This trend risks overlooking the performance ceiling set by purpose-built models due to a lack of systematic evaluation. To address this problem, we establish a zero-shot benchmark for image privacy classification, enabling a fair comparison. We evaluate the top-3 open-source VLMs, according to a privacy benchmark, using task-aligned prompts and we contrast their performance, efficiency, and robustness against established vision-only and multi-modal methods. Counter-intuitively, our results show that VLMs, despite their resource-intensive nature in terms of high parameter count and slower inference, currently lag behind specialized, smaller models in privacy prediction accuracy. We also find that VLMs exhibit higher robustness to image perturbations.
在历史中,基于特定学习的模型一直主导着图像隐私预测领域,但当前文献越来越倾向于采用针对通用任务设计的大型视觉语言模型(VLMs)。这一趋势可能会因缺乏系统评估而忽略专门构建的模型所设定的性能上限。为了解决这个问题,我们为图像隐私分类建立了零样本基准测试,以实现公平比较。我们根据隐私基准评估了排名前三的开源VLMs,采用任务对齐提示并对其性能、效率和稳健性与现有的仅视觉和多模态方法进行了对比。结果出人意料的是,尽管视觉语言模型在参数数量和推理速度方面资源密集,但在隐私预测准确性方面,它们目前仍落后于专门的较小模型。我们还发现,视觉语言模型对图像扰动的稳健性更高。
论文及项目相关链接
PDF 5 pages, 3 figures, 3 tables. This work has been submitted to the ICASSP 2026
Summary
采用大规模视觉语言模型(VLMs)进行图像隐私分类的趋势逐渐明显,但由于缺乏系统评估,容易忽视专业模型的性能上限。本研究建立了一个零样本基准测试来公平比较模型性能,对三种顶级开源VLMs进行评估,并对比其性能、效率和稳健性。结果显示,尽管VLMs参数众多、推理速度慢且资源消耗大,但在隐私预测准确性方面目前仍落后于专业的小型模型,但VLMs对图像扰动具有更高的稳健性。
Key Takeaways
- 当前文献越来越倾向于采用用于通用任务的视觉语言模型(VLMs)进行图像隐私分类。
- 缺乏系统评估可能导致忽视专业模型的性能上限。
- 建立了一个零样本基准测试来公平比较不同模型的性能。
- 对三种顶级开源VLMs进行了评估。
- VLMs在隐私预测准确性方面目前仍落后于专业的小型模型。
- VLMs对图像扰动具有更高的稳健性。
点此查看论文截图





Modeling Time-Lapse Trajectories to Characterize Cranberry Growth
Authors:Ronan John, Anis Chihoub, Ryan Meegan, Gina Sidelli, Jeffery Neyhart, Peter Oudemans, Kristin Dana
Change monitoring is an essential task for cranberry farming as it provides both breeders and growers with the ability to analyze growth, predict yield, and make treatment decisions. However, this task is often done manually, requiring significant time on the part of a cranberry grower or breeder. Deep learning based change monitoring holds promise, despite the caveat of hard-to-interpret high dimensional features and hand-annotations for fine-tuning. To address this gap, we introduce a method for modeling crop growth based on fine-tuning vision transformers (ViTs) using a self-supervised approach that avoids tedious image annotations. We use a two-fold pretext task (time regression and class prediction) to learn a latent space for the time-lapse evolution of plant and fruit appearance. The resulting 2D temporal tracks provide an interpretable time-series model of crop growth that can be used to: 1) predict growth over time and 2) distinguish temporal differences of cranberry varieties. We also provide a novel time-lapse dataset of cranberry fruit featuring eight distinct varieties, observed 52 times over the growing season (span of around four months), annotated with information about fungicide application, yield, and rot. Our approach is general and can be applied to other crops and applications (code and dataset can be found at https://github. com/ronan-39/tlt/).
监测变化对于蓝莓种植来说是一项至关重要的任务,因为它为育种者和种植者提供了分析生长情况、预测产量和做出处理决定的能力。然而,这一任务通常是通过人工完成的,需要蓝莓种植者或育种者花费大量时间。尽管存在难以解释的高维特征和微调时的手工注释这一不足之处,深度学习在变化监测中具有广阔前景。为了解决这个问题,我们提出了一种基于微调视觉变压器(ViTs)的农作物生长建模方法,通过自监督方式避免繁琐的图像注释。我们使用两倍的预文本任务(时间回归和类别预测)来学习植物和果实外观时间推移演变的潜在空间。所得的二维时间轨迹提供了农作物生长的可解释时间序列模型,可用于:1)预测生长趋势;2)区分不同品种的蓝莓的时间差异。我们还提供了一个新型的时间推移蓝莓果实数据集,其中包括八个不同品种,在生长季节(大约四个月的时间跨度)内观察了52次,并标注了有关杀菌剂应用、产量和腐烂的信息。我们的方法是通用的,可应用于其他农作物和应用(代码和数据集可在https://github.com/ronan-39/tlt/找到)。
论文及项目相关链接
PDF Accepted to ICCV Workshops 2025
Summary
本文介绍了基于深度学习的变化监测在蓝莓种植中的应用前景。为提高效率,研究人员开发了一种使用自监督方法微调视觉变压器(ViTs)的模型,用于模拟作物生长。该模型通过两项预训练任务(时间回归和类别预测)学习植物和果实外观的时间序列变化的潜在空间,可提供可解释的作物生长时间序列模型,用于预测作物生长过程和区分不同品种的蓝莓。此外,该研究还提供了一个新型的时间序列蓝莓果实数据集,包含八种不同品种在生长季节内的52次观测数据,并附带了关于杀菌剂应用、产量和腐烂等信息的注释。该研究的方法具有通用性,可应用于其他作物和应用场景。
Key Takeaways
- 变化监测对于蓝莓种植至关重要,但当前的手动方法耗时耗力。
- 深度学习在变化监测中有应用前景,但存在难以解释的高维特征和繁琐的图像注释问题。
- 研究人员引入了一种基于视觉变压器(ViTs)的模型,采用自监督方法,避免了繁琐的图像注释。
- 模型使用两项预训练任务来学习时间序列变化的潜在空间,提供可解释的作物生长模型。
- 该模型可预测作物生长过程并区分不同品种的蓝莓。
- 研究人员还提供了一个蓝莓果实的时间序列数据集,包含多种品种和丰富的注释信息。
点此查看论文截图








MedDINOv3: How to adapt vision foundation models for medical image segmentation?
Authors:Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang
Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.
在CT和MRI扫描中准确地分割器官和肿瘤对于诊断、治疗规划和疾病监测至关重要。尽管深度学习已经推动了自动化分割的进展,但大多数模型仍然针对特定任务,在不同模态和机构之间缺乏通用性。视觉基础模型(FMs)在百亿级自然图像上的预训练提供了强大且可迁移的表示。然而,将其适应医学成像面临两个主要挑战:(1)大多数基础模型的ViT主干在医学图像分割方面仍然表现不佳,不如专业化的CNN;(2)自然图像和医学图像之间的巨大领域差距限制了可迁移性。我们引入了MedDINOv3,这是一个简单有效的框架,用于将DINOv3适应医学分割。我们首先回顾了普通的ViTs,并设计了一个简单有效的架构,具有多尺度令牌聚合。然后,我们在CT-3M上进行域自适应预训练,这是一个精选的包含387万张轴向CT切片的集合,使用多阶段的DINOv3配方来学习稳健的密集特征。MedDINOv3在四个分割基准测试中达到了或超过了最先进的表现,证明了视觉基础模型作为医学图像分割统一主干的潜力。代码可在https://github.com/ricklisz/MedDINOv3找到。
论文及项目相关链接
Summary
本文介绍了MedDINOv3框架,该框架旨在将DINOv3适应于医学图像分割。通过重新审视普通ViT并设计具有多尺度令牌聚合的简单有效架构,以及使用多阶段DINOv3配方在CT-3M上进行域自适应预训练,学习稳健的密集特征,MedDINOv3在四个分割基准测试中达到或超越了最新技术水平,展示了作为医学图像分割统一背骨的潜力。
Key Takeaways
- 准确进行CT和MRI扫描中的器官和肿瘤分割对于诊断、治疗规划和疾病监测至关重要。
- 当前大多数深度学习模型在进行医学图像分割时仍面临特定任务挑战,缺乏跨不同模态和机构的泛化能力。
- Vision foundation models(FMs)通过在大规模自然图像上的预训练,提供了强大的可迁移表示能力。
- 将这些模型应用于医学成像面临两大挑战:ViT主干在医学图像分割方面仍落后于专业CNN;自然图像与医学图像之间的域差距限制了模型的迁移能力。
- MedDINOv3框架旨在解决这些问题,通过设计简单有效的架构和进行域自适应预训练来适应医学分割任务。
- MedDINOv3框架在四个分割基准测试中实现了卓越性能,证明了其在医学图像分割中的潜力。
点此查看论文截图





Automatic Synthesis of High-Quality Triplet Data for Composed Image Retrieval
Authors:Haiwen Li, Delong Liu, Zhaohui Hou, Zhicheng Zhao, Fei Su
As a challenging vision-language (VL) task, Composed Image Retrieval (CIR) aims to retrieve target images using multimodal (image+text) queries. Although many existing CIR methods have attained promising performance, their reliance on costly, manually labeled triplets hinders scalability and zero-shot capability. To address this issue, we propose a scalable pipeline for automatic triplet generation, along with a fully synthetic dataset named Composed Image Retrieval on High-quality Synthetic Triplets (CIRHS). Our pipeline leverages a large language model (LLM) to generate diverse prompts, controlling a text-to-image generative model to produce image pairs with identical elements in each pair, which are then filtered and reorganized to form the CIRHS dataset. In addition, we introduce Hybrid Contextual Alignment (CoAlign), a novel CIR framework, which can accomplish global alignment and local reasoning within a broader context, enabling the model to learn more robust and informative representations. By utilizing the synthetic CIRHS dataset, CoAlign achieves outstanding zero-shot performance on three commonly used benchmarks, demonstrating for the first time the feasibility of training CIR models on a fully synthetic dataset. Furthermore, under supervised training, our method outperforms all the state-of-the-art supervised CIR approaches, validating the effectiveness of our proposed retrieval framework. The code and the CIRHS dataset will be released soon.
作为具有挑战性的视觉语言(VL)任务,组合图像检索(CIR)旨在使用多模态(图像+文本)查询来检索目标图像。尽管许多现有的CIR方法已经取得了有前景的性能,但它们对成本高昂的手动标注三元组的依赖阻碍了其可扩展性和零样本能力。为了解决这一问题,我们提出了一个用于自动三元组生成的可扩展管道,以及一个名为“基于高质量合成三元组的组合图像检索(CIRHS)”的完全合成数据集。我们的管道利用大型语言模型(LLM)生成各种提示,控制文本到图像的生成模型,以产生具有相同元素的图像对,然后对它们进行过滤和重组以形成CIRHS数据集。此外,我们介绍了混合上下文对齐(CoAlign),这是一种新型的CIR框架,可以在更广泛的背景下完成全局对齐和局部推理,使模型能够学习更稳健和更有用的表示。通过利用合成的CIRHS数据集,CoAlign在三个常用的基准测试上取得了出色的零样本性能,首次证明了在完全合成的数据集上训练CIR模型的可行性。此外,在监督训练下,我们的方法超越了所有最新的监督CIR方法,验证了所提出的检索框架的有效性。代码和CIRHS数据集将很快发布。
论文及项目相关链接
PDF This paper was originally submitted to ACM MM 2025 on April 12, 2025
Summary
这是一项针对组合图像检索(CIR)的挑战性研究,提出了一种可扩展到大规模数据的自动三元组生成管道以及一个名为CIRHS的全合成数据集。该管道利用大型语言模型生成多种提示,控制文本到图像生成模型以产生具有相同元素的图像对,并通过过滤和重组形成CIRHS数据集。同时,引入了一种新的CIR框架Hybrid Contextual Alignment(CoAlign),能在更广泛的背景下进行全局对齐和局部推理,使模型学习更稳健和更具信息性的表示。CoAlign在三个常用基准测试上首次实现了全合成数据集训练CIR模型的可行性,并在监督训练下优于所有最先进的CIR方法。
Key Takeaways
- 组合图像检索(CIR)是一个使用多模态(图像+文本)查询来检索目标图像的视觉语言任务。
- 当前CIR方法依赖昂贵的手动标签数据,限制了其可扩展性和零样本能力。
- 提出了一种自动三元组生成管道和名为CIRHS的全合成数据集,解决了上述问题。
- 利用大型语言模型生成提示,并使用文本到图像生成模型产生相同元素的图像对。
- 引入了Hybrid Contextual Alignment(CoAlign)框架,实现全局对齐和局部推理,提高模型的表示能力。
- CoAlign在三个常用基准测试上实现了零样本训练的可行性,并超越了现有的监督学习方法。
点此查看论文截图




TMT: Cross-domain Semantic Segmentation with Region-adaptive Transferability Estimation
Authors:Enming Zhang, Zhengyu Li, Yanru Wu, Jingge Wang, Yang Tan, Guan Wang, Yang Li, Xiaoping Zhang
Recent advances in Vision Transformers (ViTs) have significantly advanced semantic segmentation performance. However, their adaptation to new target domains remains challenged by distribution shifts, which often disrupt global attention mechanisms. While existing global and patch-level adaptation methods offer some improvements, they overlook the spatially varying transferability inherent in different image regions. To address this, we propose the Transferable Mask Transformer (TMT), a region-adaptive framework designed to enhance cross-domain representation learning through transferability guidance. First, we dynamically partition the image into coherent regions, grouped by structural and semantic similarity, and estimates their domain transferability at a localized level. Then, we incorporate region-level transferability maps directly into the self-attention mechanism of ViTs, allowing the model to adaptively focus attention on areas with lower transferability and higher semantic uncertainty. Extensive experiments across 20 diverse cross-domain settings demonstrate that TMT not only mitigates the performance degradation typically associated with domain shift but also consistently outperforms existing approaches.
最近,Vision Transformers(ViTs)的进展大大提高了语义分割的性能。然而,它们适应新目标领域时仍然面临分布转移的挑战,这往往会破坏全局注意力机制。虽然现有的全局和补丁级适应方法提供了一些改进,但它们忽略了不同图像区域固有的空间可转移性。为了解决这一问题,我们提出了Transferable Mask Transformer(TMT),这是一种区域自适应框架,旨在通过可转移性指导增强跨域表示学习。首先,我们根据结构和语义相似性将图像动态划分为连贯的区域并对其进行分组,然后在局部级别估计它们的域可转移性。然后,我们将区域级可转移性图直接纳入ViT的自注意力机制中,使模型能够自适应地关注可转移性较低、语义不确定性较高的区域。在20个不同的跨域设置上的大量实验表明,TMT不仅减轻了与域偏移相关的性能下降问题,而且总体上优于现有方法。
论文及项目相关链接
Summary
本文介绍了Vision Transformers(ViTs)的最新进展在语义分割性能上的显著提升,但其在适应新目标领域时仍面临分布转移的挑战。为解决这一问题,提出了Transferable Mask Transformer(TMT)这一区域自适应框架,通过转移性指导增强跨域表示学习。TMT通过动态划分图像区域,并估计其域转移性,将区域级转移图直接融入ViTs的自注意力机制中,使模型能够自适应地关注转移性较低、语义不确定性较高的区域。实验结果证明了TMT在跨域设置下的优越性。
Key Takeaways
- Vision Transformers(ViTs)在语义分割性能上有显著提升。
- 跨域适应面临分布转移的挑战,影响全局注意力机制。
- 现有全局和补丁级适应方法虽有所改善,但忽略了不同图像区域固有的空间可转移性。
- 提出Transferable Mask Transformer(TMT)框架,通过转移性指导增强跨域表示学习。
- TMT通过动态划分图像区域并估计其域转移性,实现区域自适应。
- 将区域级转移图融入ViTs的自注意力机制中,提高模型在转移性较低、语义不确定性较高区域的关注度。
点此查看论文截图





ChA-MAEViT: Unifying Channel-Aware Masked Autoencoders and Multi-Channel Vision Transformers for Improved Cross-Channel Learning
Authors:Chau Pham, Juan C. Caicedo, Bryan A. Plummer
Prior work using Masked Autoencoders (MAEs) typically relies on random patch masking based on the assumption that images have significant redundancies across different channels, allowing for the reconstruction of masked content using cross-channel correlations. However, this assumption does not hold in Multi-Channel Imaging (MCI), where channels may provide complementary information with minimal feature overlap. Thus, these MAEs primarily learn local structures within individual channels from patch reconstruction, failing to fully leverage cross-channel interactions and limiting their MCI effectiveness. In this paper, we present ChA-MAEViT, an MAE-based method that enhances feature learning across MCI channels via four key strategies: (1) dynamic channel-patch masking, which compels the model to reconstruct missing channels in addition to masked patches, thereby enhancing cross-channel dependencies and improving robustness to varying channel configurations; (2) memory tokens, which serve as long-term memory aids to promote information sharing across channels, addressing the challenges of reconstructing structurally diverse channels; (3) hybrid token fusion module, which merges fine-grained patch tokens with a global class token to capture richer representations; and (4) Channel-Aware Decoder, a lightweight decoder utilizes channel tokens to effectively reconstruct image patches. Experiments on satellite and microscopy datasets, CHAMMI, JUMP-CP, and So2Sat, show that ChA-MAEViT significantly outperforms state-of-the-art MCI-ViTs by 3.0-21.5%, highlighting the importance of cross-channel interactions in MCI. Our code is publicly available at https://github.com/chaudatascience/cha_mae_vit.
先前的工作通常使用基于假设的掩码自动编码器(MAEs),即图像在不同通道之间存在大量冗余信息,允许利用跨通道相关性重建掩码内容。然而,这一假设在多通道成像(MCI)中并不成立,其中各通道可能提供互补信息,特征重叠较少。因此,这些MAEs主要通过从补丁重建中学习单个通道内的局部结构,未能充分利用跨通道交互,限制了它们在MCI中的有效性。在本文中,我们提出了一种基于MAE的方法ChA-MAEViT,它通过四个关键策略增强MCI通道的特征学习:(1)动态通道补丁掩码,迫使模型除了掩码补丁外还需重建缺失的通道,从而增强跨通道依赖性并提高对不同通道配置的稳健性;(2)内存令牌充当长期记忆辅助工具,促进跨通道信息共享,解决重建结构多样通道的挑战;(3)混合令牌融合模块,将精细粒度的补丁令牌与全局类别令牌合并,以捕获更丰富的表示;(4)通道感知解码器是一个轻量级的解码器,它利用通道令牌有效地重建图像补丁。在卫星和显微镜数据集CHAMMI、JUMP-CP和So2Sat上的实验表明,ChA-MAEViT显著优于最新的MCI-ViTs,性能提高了3.0-21.5%,突显了跨通道交互在MCI中的重要性。我们的代码公开在https://github.com/chaudatascience/cha_mae_vit。
论文及项目相关链接
PDF Accepted to NeurIPS 2025
Summary
本文介绍了针对多通道成像(MCI)的ChA-MAEViT方法,通过四种策略增强跨通道特征学习:动态通道补丁掩模提高跨通道依赖性和鲁棒性;记忆令牌促进通道间信息共享;混合令牌融合模块捕获更丰富表示;以及通道感知解码器有效重建图像补丁。在卫星和显微镜数据集上的实验表明,ChA-MAEViT显著优于其他MCI-ViTs模型,突显跨通道交互的重要性。
Key Takeaways
- ChA-MAEViT是一种基于Masked Autoencoder(MAE)的方法,用于增强多通道成像(MCI)中的跨通道特征学习。
- 传统MAE方法主要学习单个通道内的局部结构,未能充分利用跨通道交互,在MCI中的有效性受限。
- ChA-MAEViT通过四种关键策略提高MCI效果:动态通道补丁掩模、记忆令牌、混合令牌融合模块和通道感知解码器。
- 动态通道补丁掩模增强跨通道依赖性和鲁棒性,迫使模型重建缺失通道。
- 记忆令牌作为长期记忆辅助,促进通道间信息共享,解决结构多样通道的重构挑战。
- 混合令牌融合模块合并细粒度补丁令牌与全局类别令牌,以捕获更丰富表示。
- 通道感知解码器利用通道令牌有效地重建图像补丁。
点此查看论文截图



Isolated Channel Vision Transformers: From Single-Channel Pretraining to Multi-Channel Finetuning
Authors:Wenyi Lian, Patrick Micke, Joakim Lindblad, Nataša Sladoje
Vision Transformers (ViTs) have achieved remarkable success in standard RGB image processing tasks. However, applying ViTs to multi-channel imaging (MCI) data, e.g., for medical and remote sensing applications, remains a challenge. In particular, MCI data often consist of layers acquired from different modalities. Directly training ViTs on such data can obscure complementary information and impair the performance. In this paper, we introduce a simple yet effective pretraining framework for large-scale MCI datasets. Our method, named Isolated Channel ViT (IC-ViT), patchifies image channels individually and thereby enables pretraining for multimodal multi-channel tasks. We show that this channel-wise patchifying is a key technique for MCI processing. More importantly, one can pretrain the IC-ViT on single channels and finetune it on downstream multi-channel datasets. This pretraining framework captures dependencies between patches as well as channels and produces robust feature representation. Experiments on various tasks and benchmarks, including JUMP-CP and CHAMMI for cell microscopy imaging, and So2Sat-LCZ42 for satellite imaging, show that the proposed IC-ViT delivers 4-14 percentage points of performance improvement over existing channel-adaptive approaches. Further, its efficient training makes it a suitable candidate for large-scale pretraining of foundation models on heterogeneous data. Our code is available at https://github.com/shermanlian/IC-ViT.
视觉Transformer(ViTs)在标准RGB图像处理任务中取得了显著的成功。然而,将ViTs应用于多通道成像(MCI)数据,例如医疗和遥感应用,仍然是一个挑战。特别是,MCI数据通常由从不同模态获得的层组成。直接在MCI数据上训练ViTs可能会掩盖互补信息并损害性能。在本文中,我们针对大规模MCI数据集引入了一个简单有效的预训练框架。我们的方法名为隔离通道视觉Transformer(IC-ViT),它单独地对图像通道进行划分,从而实现对多模态多通道任务的预训练。我们证明这种通道级的划分是多通道成像处理的关键技术。更重要的是,可以在单个通道上预训练IC-ViT,并在下游多通道数据集上进行微调。该预训练框架捕获了补丁之间的依赖关系以及通道间的依赖关系,并产生了稳健的特征表示。在各种任务和基准测试上的实验,包括用于细胞显微镜成像的JUMP-CP和CHAMMI,以及用于卫星成像的So2Sat-LCZ42,表明所提出的IC-ViT在现有的通道自适应方法上实现了4-14个百分点的性能改进。此外,其高效的训练使其成为在异质数据上进行大规模基础模型预训练的合适候选者。我们的代码可在https://github.com/shermanlian/IC-ViT获得。
论文及项目相关链接
PDF Paper has been accepted by BMVC as an Oral presentation
Summary
多通道成像数据在计算机视觉领域具有广泛的应用潜力,但对于在大型多通道成像数据集上训练的ViT模型仍然面临诸多挑战。该文提出一种新的预训练框架——Isolated Channel Vision Transformer(IC-ViT),该方法通过对图像通道进行单独的划分(patchifying),实现了多模态多通道任务的预训练。实验证明,该方法能够捕获patch之间的依赖关系以及通道间的联系,生成稳健的特征表示,且在多个任务上取得了显著的性能提升。
Key Takeaways
- Vision Transformers (ViTs) 在处理多通道成像(MCI)数据时面临挑战。
- IC-ViT 是一种新的预训练框架,针对大型 MCI 数据集设计。
- IC-ViT 通过单独划分图像通道(patchifying)实现多模态多通道任务的预训练。
- 实验证明,IC-ViT 在多个任务上比现有通道自适应方法性能提升4-14个百分点。
点此查看论文截图






Open Vocabulary Multi-Label Video Classification
Authors:Rohit Gupta, Mamshad Nayeem Rizve, Jayakrishnan Unnikrishnan, Ashish Tawari, Son Tran, Mubarak Shah, Benjamin Yao, Trishul Chilimbi
Pre-trained vision-language models (VLMs) have enabled significant progress in open vocabulary computer vision tasks such as image classification, object detection and image segmentation. Some recent works have focused on extending VLMs to open vocabulary single label action classification in videos. However, previous methods fall short in holistic video understanding which requires the ability to simultaneously recognize multiple actions and entities e.g., objects in the video in an open vocabulary setting. We formulate this problem as open vocabulary multilabel video classification and propose a method to adapt a pre-trained VLM such as CLIP to solve this task. We leverage large language models (LLMs) to provide semantic guidance to the VLM about class labels to improve its open vocabulary performance with two key contributions. First, we propose an end-to-end trainable architecture that learns to prompt an LLM to generate soft attributes for the CLIP text-encoder to enable it to recognize novel classes. Second, we integrate a temporal modeling module into CLIP’s vision encoder to effectively model the spatio-temporal dynamics of video concepts as well as propose a novel regularized finetuning technique to ensure strong open vocabulary classification performance in the video domain. Our extensive experimentation showcases the efficacy of our approach on multiple benchmark datasets.
预训练的视觉语言模型(VLMs)在开放词汇计算机视觉任务(如图像分类、目标检测和图像分割)方面取得了重大进展。一些最新研究着眼于将VLMs扩展到开放词汇的单标签视频动作分类。然而,以前的方法在整体视频理解方面存在不足,这需要同时识别多个动作和视频中的实体(例如,在开放词汇设置中识别视频中的对象)的能力。我们将问题定位为开放词汇的多标签视频分类,并提出了一种方法,以适应预训练的VLM(如CLIP)来解决此任务。我们利用大型语言模型(LLMs)为VLM提供关于类别标签的语义指导,以提高其开放词汇性能,这得益于两项关键贡献。首先,我们提出了一种端到端可训练架构,该架构可以学习提示LLM为CLIP文本编码器生成软属性,以使其能够识别新型类别。其次,我们将时序建模模块集成到CLIP的视觉编码器中,以有效地对视频概念的空间时间动态进行建模,并提出了一种新型正则微调技术,以确保在视频领域的开放词汇分类性能强大。我们在多个基准数据集上进行了广泛的实验,展示了我们的方法的有效性。
论文及项目相关链接
PDF Accepted at ECCV 2024
Summary
预训练视觉语言模型(VLMs)在开放词汇计算机视觉任务(如图像分类、目标检测和图像分割)方面取得了显著进展。然而,在需要同时识别视频中的多个动作和实体(例如,开放词汇设置中的视频中的对象)的整体视频理解方面,以前的方法有所不足。本研究将此问题表述为开放词汇的多标签视频分类问题,并提出了一种适应预训练的VLM(如CLIP)来解决此任务的方法。我们利用大型语言模型(LLMs)为VLM提供关于类别标签的语义指导,以提高其开放词汇性能,并做出了两项重要贡献。首先,我们提出了一种端到端可训练架构,该架构可以学习提示LLM为CLIP文本编码器生成软属性,以使其能够识别新型类别。其次,我们将时间建模模块集成到CLIP的视觉编码器中,以有效地对视频概念的空间时间动态进行建模,并提出了一种新的正则微调技术,以确保在视频领域的强开放词汇分类性能。
Key Takeaways
- 预训练视觉语言模型(VLMs)在开放词汇计算机视觉任务中表现优异,包括图像分类、目标检测和图像分割。
- 当前方法在多动作和实体识别的整体视频理解方面存在局限,尤其是在开放词汇环境下。
- 研究者提出了一个方法来解决开放词汇的多标签视频分类问题,并成功适应了预训练的VLM如CLIP。
- 利用大型语言模型(LLMs)提供语义指导,提高模型在开放词汇下的性能。
- 提出了一种端到端可训练架构,能够提示LLM生成软属性用于识别新类别。
- 在CLIP中集成了时间建模模块来捕捉视频的空间时间动态。
点此查看论文截图


