发布日期: 2025-11-16

更新日期: 2025-11-27

文章字数: 17.1k

阅读时长: 70 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-16 更新

LampQ: Towards Accurate Layer-wise Mixed Precision Quantization for Vision Transformers

Authors:Minjun Kim, Jaeri Lee, Jongjin Kim, Jeongin Yun, Yongmo Kwon, U Kang

How can we accurately quantize a pre-trained Vision Transformer model? Quantization algorithms compress Vision Transformers (ViTs) into low-bit formats, reducing memory and computation demands with minimal accuracy degradation. However, existing methods rely on uniform precision, ignoring the diverse sensitivity of ViT components to quantization. Metric-based Mixed Precision Quantization (MPQ) is a promising alternative, but previous MPQ methods for ViTs suffer from three major limitations: 1) coarse granularity, 2) mismatch in metric scale across component types, and 3) quantization-unaware bit allocation. In this paper, we propose LampQ (Layer-wise Mixed Precision Quantization for Vision Transformers), an accurate metric-based MPQ method for ViTs to overcome these limitations. LampQ performs layer-wise quantization to achieve both fine-grained control and efficient acceleration, incorporating a type-aware Fisher-based metric to measure sensitivity. Then, LampQ assigns bit-widths optimally through integer linear programming and further updates them iteratively. Extensive experiments show that LampQ provides the state-of-the-art performance in quantizing ViTs pre-trained on various tasks such as image classification, object detection, and zero-shot quantization.

我们如何准确地量化预训练的Vision Transformer模型？量化算法将Vision Transformers（ViTs）压缩成低位格式，以最小的精度损失减少内存和计算需求。然而，现有方法依赖于统一精度，忽略了ViT组件在量化方面的不同敏感性。基于度量的混合精度量化（MPQ）是一个有前途的替代方案，但以前用于ViT的MPQ方法存在三个主要局限性：1）粒度粗糙，2）组件类型之间度量标准的尺度不匹配，以及3）量化无关的位分配。在本文中，我们提出了针对Vision Transformers的Layer-wise混合精度量化（LampQ），这是一种基于度量的准确MPQ方法，旨在克服这些局限性。LampQ执行逐层量化，以实现精细的粒度和高效的加速，并结合基于Fisher的感知类型度量来测量敏感性。然后，LampQ通过整数线性规划最优地分配位宽，并进一步进行迭代更新。大量实验表明，LampQ在量化在各种任务（如图像分类、目标检测和零样本量化）上预训练的ViTs方面提供了最先进的性能。

论文及项目相关链接

PDF AAAI 2026

Summary

本文介绍了如何准确量化预训练的Vision Transformer模型。现有方法主要依赖统一精度，忽略了Vision Transformer组件对量化的不同敏感度。为此，提出了一种名为LampQ的基于度量的混合精度量化方法，以克服现有方法的局限性。LampQ实现了对Vision Transformer的层级混合精度量化，通过类型感知的Fisher度量来测量敏感度，并通过整数线性规划最优地分配位宽，进一步迭代更新。实验表明，LampQ在多种任务预训练的Vision Transformer量化方面达到了最新性能。

Key Takeaways

量化算法能够压缩Vision Transformer（ViTs）模型以降低内存和计算需求，同时保持最小的精度损失。
现有方法主要依赖统一精度进行量化，忽略了ViT组件对量化的不同敏感度。
Metric-based Mixed Precision Quantization（MPQ）是一种有前途的替代方法，但存在三大局限性。
LampQ（Layer-wise Mixed Precision Quantization for Vision Transformers）克服了这些局限性，实现了对Vision Transformer的层级混合精度量化。
LampQ通过类型感知的Fisher度量来测量敏感度，实现了精细粒度的控制和高效加速。
通过整数线性规划，LampQ能够最优地分配位宽，并进一步迭代更新。

Cool Papers

点此查看论文截图

Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

Authors:Konstantinos M. Dafnis, Dimitris N. Metaxas

Vision-Language Models (VLMs) excel at zero-shot inference but often degrade under test-time domain shifts. For this reason, episodic test-time adaptation strategies have recently emerged as powerful techniques for adapting VLMs to a single unlabeled image. However, existing adaptation strategies, such as test-time prompt tuning, typically require backpropagating through large encoder weights or altering core model components. In this work, we introduce Spectrum-Aware Test-Time Steering (STS), a lightweight adaptation framework that extracts a spectral subspace from the textual embeddings to define principal semantic directions and learns to steer latent representations in a spectrum-aware manner by adapting a small number of per-sample shift parameters to minimize entropy across augmented views. STS operates entirely at inference in the latent space, without backpropagation through or modification of the frozen encoders. Building on standard evaluation protocols, our comprehensive experiments demonstrate that STS largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8x faster with a 12x smaller memory footprint than conventional test-time prompt tuning. The code is available at https://github.com/kdafnis/STS.

视觉语言模型（VLMs）在零样本推理方面表现出色，但在测试时间域偏移时通常性能下降。因此，近期出现了针对VLMs的即时测试时间适应策略，这是一种强大的技术，用于将VLMs适应单个未标记图像。然而，现有的适应策略，如测试时间提示调整，通常需要反向传播大型编码器权重或更改核心模型组件。在这项工作中，我们引入了Spectrum-Aware Test-Time Steering（STS），这是一个轻量级的适应框架，它从文本嵌入中提取频谱子空间来定义主要的语义方向，并学习通过适应少量的每个样本偏移参数来以频谱感知的方式控制潜在表示，以最小化增强视图之间的熵。STS完全在潜在空间中的推理阶段运行，无需通过冻结的编码器进行反向传播或修改。基于标准评估协议的综合实验表明，STS在很大程度上超越了或表现优于最先进的测试时间适应方法，同时只引入少数额外的参数，并且相较于传统的测试时间提示调整实现了高达8倍的推理速度提升和高达1 2倍的内存占用减少。代码可在https://github.com/kdafnis/STS找到。

论文及项目相关链接

PDF NeurIPS 2025

Summary

该文章介绍了一种名为Spectrum-Aware Test-Time Steering（STS）的轻量级自适应框架，该框架能够从文本嵌入中提取频谱子空间，定义主要语义方向，并通过适应少量样本偏移参数来学习在频谱感知方式下引导潜在表示，从而最小化增强视图之间的熵。STS完全在潜在空间中进行推理，无需通过冻结编码器进行反向传播或修改。实验表明，STS在测试时自适应方法上大大超越了或表现良好，同时只引入少量额外参数，并且相对于传统的测试时间提示调整，实现了高达8倍的推理速度和高达12倍更小的内存占用。

Key Takeaways

Vision-Language Models (VLMs) 在零样本推断方面表现出色，但在测试时间域转移时性能会下降。
测试时间的自适应策略已被视为适应VLMs到单个无标签图像的强大技术。
现有自适应策略如测试时间提示调整通常需要反向传播大量编码器权重或改变核心模型组件。
STS是一个轻量级自适应框架，通过提取文本嵌入的频谱子空间来定义主要语义方向。
STS学习通过适应少量样本偏移参数来引导潜在表示，以最小化增强视图之间的熵。
STS完全在潜在空间中进行推理，无需修改或反向传播冻结的编码器。
实验表明STS在测试时自适应方法上具有显著优势，具有较少的额外参数、较小的内存占用和更快的推理速度。

Cool Papers

点此查看论文截图

vMFCoOp: Towards Equilibrium on a Unified Hyperspherical Manifold for Prompting Biomedical VLMs

Authors:Minye Shao, Sihan Guo, Xinrun Li, Xingyu Miao, Haoran Duan, Yang Long

Recent advances in context optimization (CoOp) guided by large language model (LLM)-distilled medical semantic priors offer a scalable alternative to manual prompt engineering and full fine-tuning for adapting biomedical CLIP-based vision-language models (VLMs). However, prompt learning in this context is challenged by semantic misalignment between LLMs and CLIP variants due to divergent training corpora and model architectures; it further lacks scalability across continuously evolving families of foundation models. More critically, pairwise multimodal alignment via conventional Euclidean-space optimization lacks the capacity to model unified representations or apply localized geometric constraints, which tends to amplify modality gaps in complex biomedical imaging and destabilize few-shot adaptation. In this work, we propose vMFCoOp, a framework that inversely estimates von Mises-Fisher (vMF) distributions on a shared Hyperspherical Manifold, aligning semantic biases between arbitrary LLMs and CLIP backbones via Unified Semantic Anchors to achieve robust biomedical prompting and superior few-shot classification. Grounded in three complementary constraints, vMFCoOp demonstrates consistent improvements across 14 medical datasets, 12 medical imaging modalities, and 13 anatomical regions, outperforming state-of-the-art methods in accuracy, generalization, and clinical applicability. This work aims to continuously expand to encompass more downstream applications, and the corresponding resources are intended to be shared through https://github.com/VinyehShaw/UniEqui.

近期，利用大型语言模型（LLM）提炼出的医学语义先验知识来指导上下文优化（CoOp）的方法，为基于CLIP的跨模态视觉语言模型（VLMs）在生物医学领域的应用提供了手动提示工程和完全微调的可扩展替代方案。然而，在这种情况下，提示学习面临来自LLM和CLIP变体之间语义不一致的挑战，这是由于两者采用不同的训练语料库和模型架构造成的；此外，它还不具备跨不断发展的基础模型家族的扩展性。更关键的是，通过传统的欧几里得空间优化进行的一对多模态对齐缺乏建模统一表示或应用局部几何约束的能力，这往往会放大复杂的生物医学成像中的模态差距并破坏少量数据的适应性。在这项工作中，我们提出了vMFCoOp框架，它通过逆向估计共享超球流形上的冯米塞斯费舍尔（vMF）分布来对齐LLM和CLIP之间的语义偏差。通过统一语义锚点实现稳健的生物医学提示和出色的少量样本分类。基于三项互补约束的vMFCoOp在跨14个医学数据集、涉及医学成像的十二种模态以及解剖学区域的十三种方面显示出持续一致的改进效果，在准确性、泛化能力和临床适用性方面均优于现有技术方法。本工作的目标是不断扩大应用范围以涵盖更多的下游应用，并通过https://github.com/VinyehShaw/UniEqui分享相关资源。

论文及项目相关链接

PDF Accepted as an Oral Presentation at AAAI 2026 Main Technical Track (this version is not peer-reviewed; it is the extended version)

Summary

基于大型语言模型（LLM）蒸馏医学语义先验的上下文优化（CoOp）进展为解决生物医学CLIP基视觉语言模型（VLMs）的适应问题提供了可伸缩的替代方案，相较于手动提示工程和完全微调的方式。然而，在此背景下的提示学习面临LLMs和CLIP变体间因不同训练语料库和模型架构导致的语义不匹配挑战，且缺乏跨不断演变的家族基础模型的扩展性。更重要的是，通过常规欧几里得空间优化的配对多模式对齐无法建模统一表示或应用局部几何约束，这在复杂的生物医学成像中会放大模式差距并破坏少量样本适应的稳定性。在本文中，我们提出了vMFCoOp框架，它通过共享超球面流形上反向估计von Mises-Fisher（vMF）分布，并通过统一语义锚对齐任意LLMs和CLIP骨架之间的语义偏见，实现稳健的生物医学提示和优越的小样本分类。基于三项互补约束，vMFCoOp在14个医疗数据集、12种医疗成像模式和13个解剖区域上实现了持续的改进，在准确性、通用性和临床适用性方面优于最先进的方法。

Key Takeaways

LLM-distilled medical semantic priors advance context optimization (CoOp) for adapting biomedical CLIP-based vision-language models (VLMs).
Prompt learning faces challenges due to semantic misalignment between LLMs and CLIP variants.
传统的欧几里得空间优化在复杂生物医学成像中存在模态差距和适应稳定性问题。
vMFCoOp框架通过估计von Mises-Fisher分布和对齐语义偏见，实现了稳健的生物医学提示和少量样本分类。
vMFCoOp框架在医疗数据集、医疗成像模式和解剖区域上的表现优于最先进的方法。
vMFCoOp框架旨在不断扩展以涵盖更多下游应用，并通过https://github.com/VinyehShaw/UniEqui共享相关资源。

Cool Papers

点此查看论文截图

Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding

Authors:Jingtian Ma, Jingyuan Wang, Wayne Xin Zhao, Guoping Liu, Xiang Wen

Nowadays, navigation and ride-sharing apps have collected numerous images with spatio-temporal data. A core technology for analyzing such images, associated with spatiotemporal information, is Traffic Scene Understanding (TSU), which aims to provide a comprehensive description of the traffic scene. Unlike traditional spatio-temporal data analysis tasks, the dependence on both spatio-temporal and visual-textual data introduces distinct challenges to TSU task. However, recent research often treats TSU as a common image understanding task, ignoring the spatio-temporal information and overlooking the interrelations between different aspects of the traffic scene. To address these issues, we propose a novel SpatioTemporal Enhanced Model based on CILP (ST-CLIP) for TSU. Our model uses the classic vision-language model, CLIP, as the backbone, and designs a Spatio-temporal Context Aware Multiaspect Prompt (SCAMP) learning method to incorporate spatiotemporal information into TSU. The prompt learning method consists of two components: A dynamic spatio-temporal context representation module that extracts representation vectors of spatio-temporal data for each traffic scene image, and a bi-level ST-aware multi-aspect prompt learning module that integrates the ST-context representation vectors into word embeddings of prompts for the CLIP model. The second module also extracts low-level visual features and image-wise high-level semantic features to exploit interactive relations among different aspects of traffic scenes. To the best of our knowledge, this is the first attempt to integrate spatio-temporal information into visionlanguage models to facilitate TSU task. Experiments on two realworld datasets demonstrate superior performance in the complex scene understanding scenarios with a few-shot learning strategy.

如今，导航和拼车应用程序已经收集了大量带有时空数据的图像。分析这些与时空信息相关的图像的核心技术是交通场景理解（TSU），旨在为交通场景提供全面的描述。与传统的时空数据分析任务不同，对时空数据和视觉文本数据的依赖给TSU任务带来了独特的挑战。然而，最近的研究往往将TSU视为常见的图像理解任务，忽略了时空信息，并忽视了交通场景不同方面的相互关系。为了解决这些问题，我们提出了一种基于CILP的新型时空增强模型（ST-CLIP）用于TSU。我们的模型以经典的视觉语言模型CLIP作为骨干网，并设计了一种时空上下文感知多方面提示（SCAMP）学习方法，将时空信息融入TSU中。提示学习方法由两个组件组成：一个动态时空上下文表示模块，用于提取每个交通场景图像的时空数据表示向量；一个两级ST感知多方面提示学习模块，将ST上下文表示向量集成到CLIP模型的提示词嵌入中。第二个模块还提取低级视觉特征和图像级高级语义特征，以利用交通场景不同方面的交互关系。据我们所知，这是首次尝试将时空信息整合到视觉语言模型中，以促进TSU任务。在两个真实世界数据集上的实验表明，在复杂的场景理解场景中，采用小样本学习策略可以取得卓越的性能。

论文及项目相关链接

PDF

Summary

该文主要探讨交通场景理解（TSU）技术，该技术需要同时处理时空数据和视觉文本数据，为现代导航和拼车应用提供对交通场景的综合描述。为解决现有研究忽视时空信息的问题，提出了基于CLIP的时空增强模型ST-CLIP，并结合时空上下文感知多方面提示学习方法SCAMP，实现对交通场景的深入理解和分析。此模型在真实世界的两个数据集上的实验结果表明，其在复杂场景理解方面表现优越。

Key Takeaways

交通场景理解（TSU）是处理导航和拼车应用中的时空数据和视觉文本数据的核心技术。
现有研究常常忽视时空信息以及交通场景中不同方面的相互关系。
提出了基于CLIP的时空增强模型ST-CLIP，首次尝试将时空信息融入视觉语言模型以促进TSU任务。
ST-CLIP模型包含两大模块：动态时空上下文表示模块和双级ST感知多方面提示学习模块。
动态时空上下文表示模块为每个交通场景图像提取时空数据表示向量。
双级ST感知多方面提示学习模块将这些向量融入CLIP模型的词嵌入提示中，并提取低级别视觉特征和图像级高级语义特征，以利用交通场景中不同方面的交互关系。

Cool Papers

点此查看论文截图

Improve Contrastive Clustering Performance by Multiple Fusing-Augmenting ViT Blocks

Authors:Cheng Wang, Shuisheng Zhou, Fengjiao Peng, Jin Sheng, Feng Ye, Yinli Dong

In the field of image clustering, the widely used contrastive learning networks improve clustering performance by maximizing the similarity between positive pairs and the dissimilarity of negative pairs of the inputs. Extant contrastive learning networks, whose two encoders often implicitly interact with each other by parameter sharing or momentum updating, may not fully exploit the complementarity and similarity of the positive pairs to extract clustering features from input data. To explicitly fuse the learned features of positive pairs, we design a novel multiple fusing-augmenting ViT blocks (MFAVBs) based on the excellent feature learning ability of Vision Transformers (ViT). Firstly, two preprocessed augmentions as positive pairs are separately fed into two shared-weight ViTs, then their output features are fused to input into a larger ViT. Secondly, the learned features are split into a pair of new augmented positive samples and passed to the next FAVBs, enabling multiple fusion and augmention through MFAVBs operations. Finally, the learned features are projected into both instance-level and clustering-level spaces to calculate the cross-entropy loss, followed by parameter updates by backpropagation to finalize the training process. To further enhance ability of the model to distinguish between similar images, our input data for the network we propose is preprocessed augmentions with features extracted from the CLIP pretrained model. Our experiments on seven public datasets demonstrate that MFAVBs serving as the backbone for contrastive clustering outperforms the state-of-the-art techniques in terms of clustering performance.

在图像聚类领域，广泛使用的对比学习网络通过最大化正对之间的相似性和负对之间的不相似性来提高聚类性能。现有的对比学习网络，其两个编码器通常通过参数共享或动量更新隐式地相互交互，可能无法充分探索正对之间的互补性和相似性，以从输入数据中提取聚类特征。为了明确融合正对的已学习特征，我们基于视觉变压器（ViT）的优秀特征学习能力，设计了一种新型的多重融合增强ViT块（MFAVBs）。首先，将两个预处理的增强数据作为正对分别输入到两个共享权重ViT中，然后将它们的输出特征融合并输入到一个更大的ViT中。其次，将学习到的特征分割成一对新的增强正样本并传递给下一个FAVBs，通过MFAVBs操作实现多次融合和增强。最后，将学习到的特征投影到实例级和聚类级空间，计算交叉熵损失，然后通过反向传播进行参数更新以完成训练过程。为了进一步提高模型区分相似图像的能力，我们为所提出的网络使用预训练的CLIP模型的特征提取作为输入数据的预处理增强。我们在七个公共数据集上的实验表明，作为对比聚类的骨干网，MFAVBs在聚类性能上超越了最新技术。

论文及项目相关链接

PDF

Summary
基于对比学习的图像聚类中，通过最大化正样本对间的相似性和负样本对间的差异性来提升性能。现有对比学习网络通过参数共享或动量更新使两个编码器隐式交互，可能无法充分利用正样本对的相似性和互补性来提取聚类特征。为此，设计基于Vision Transformer（ViT）的新型多重融合增强ViT块（MFAVBs），通过两个预处理增强作为正样本对分别输入两个共享权重ViTs，再融合其输出特征输入更大的ViT。此外，将学习到的特征分割成新的增强正样本对，通过MFAVBs操作进行多次融合和增强。最终，将学习到的特征投影到实例级和聚类级空间，计算交叉熵损失，通过反向传播更新参数完成训练过程。使用CLIP预训练模型的特征预处理输入数据，进一步提高模型区分相似图像的能力。在七个公开数据集上的实验表明，作为对比聚类的骨干网，MFAVBs的性能优于现有先进技术。

Key Takeaways

对比学习网络通过最大化正样本对间的相似性和负样本对间的差异性，提升了图像聚类的性能。
现有对比学习网络在提取聚类特征时，可能无法充分利用正样本对的相似性和互补性。
提出了基于Vision Transformer（ViT）的新型多重融合增强ViT块（MFAVBs），以更好地融合和提取特征。
MFAVBs通过多次融合和增强学习到的特征，提高了模型的性能。
使用CLIP预训练模型的特征进行输入数据的预处理，增强了模型区分相似图像的能力。
在七个公开数据集上的实验表明，MFAVBs作为对比聚类的骨干网表现出优异性能。

Cool Papers

点此查看论文截图

VLMDiff: Leveraging Vision-Language Models for Multi-Class Anomaly Detection with Diffusion

Authors:Samet Hicsonmez, Abd El Rahman Shabayek, Djamila Aouada

Detecting visual anomalies in diverse, multi-class real-world images is a significant challenge. We introduce \ours, a novel unsupervised multi-class visual anomaly detection framework. It integrates a Latent Diffusion Model (LDM) with a Vision-Language Model (VLM) for enhanced anomaly localization and detection. Specifically, a pre-trained VLM with a simple prompt extracts detailed image descriptions, serving as additional conditioning for LDM training. Current diffusion-based methods rely on synthetic noise generation, limiting their generalization and requiring per-class model training, which hinders scalability. \ours, however, leverages VLMs to obtain normal captions without manual annotations or additional training. These descriptions condition the diffusion model, learning a robust normal image feature representation for multi-class anomaly detection. Our method achieves competitive performance, improving the pixel-level Per-Region-Overlap (PRO) metric by up to 25 points on the Real-IAD dataset and 8 points on the COCO-AD dataset, outperforming state-of-the-art diffusion-based approaches. Code is available at https://github.com/giddyyupp/VLMDiff.

检测多种多类别的现实世界中图像中的视觉异常是一个巨大的挑战。我们推出了一项新颖的无监督多类别视觉异常检测框架——ours。它将潜在扩散模型（Latent Diffusion Model，简称LDM）与视觉语言模型（Vision-Language Model，简称VLM）相结合，增强了异常的定位和检测能力。具体来说，通过简单的提示，使用预训练的VLM提取详细的图像描述，作为LDM训练的附加条件。当前的基于扩散的方法依赖于合成噪声生成，这限制了它们的泛化能力并需要每类模型训练，阻碍了可扩展性。然而，ours利用VLMs获得正常的字幕，无需手动注释或额外的训练。这些描述条件扩散模型，学习用于多类别异常检测的强大正常图像特征表示。我们的方法具有竞争力，在Real-IAD数据集上每区域重叠（PRO）指标提高了高达25个点，在COCO-AD数据集上提高了8个点，超越了最新的基于扩散的方法。代码可在https://github.com/giddyyupp/VLMDiff获取。

论文及项目相关链接

PDF WACV 2026

Summary

一种新型的无监督多类视觉异常检测框架被提出，它结合了潜在扩散模型（LDM）和视觉语言模型（VLM）以增强异常定位和检测能力。利用预训练的VLM模型和简单提示来提取图像详细描述，为LDM训练提供额外的条件。该方法克服了现有扩散方法依赖合成噪声生成的局限性，无需每类模型训练即可实现良好的泛化能力。利用VLM获得正常图像的描述，无需手动注释或额外训练，从而条件扩散模型，学习多类异常检测的鲁棒正常图像特征表示。该方法在Real-IAD和COCO-AD数据集上实现了出色的性能提升。

Key Takeaways

提出了一种新型的无监督多类视觉异常检测框架。
结合了潜在扩散模型（LDM）和视觉语言模型（VLM）。
利用预训练的VLM模型提取图像详细描述，为LDM训练提供额外的条件。
克服了现有扩散方法的局限性，实现了良好的泛化能力。
利用VLM获得正常图像的描述，无需手动注释或额外训练。
条件扩散模型学习多类异常检测的鲁棒正常图像特征表示。

Cool Papers

点此查看论文截图

VectorSynth: Fine-Grained Satellite Image Synthesis with Structured Semantics

Authors:Daniel Cher, Brian Wei, Srikumar Sastry, Nathan Jacobs

We introduce VectorSynth, a diffusion-based framework for pixel-accurate satellite image synthesis conditioned on polygonal geographic annotations with semantic attributes. Unlike prior text- or layout-conditioned models, VectorSynth learns dense cross-modal correspondences that align imagery and semantic vector geometry, enabling fine-grained, spatially grounded edits. A vision language alignment module produces pixel-level embeddings from polygon semantics; these embeddings guide a conditional image generation framework to respect both spatial extents and semantic cues. VectorSynth supports interactive workflows that mix language prompts with geometry-aware conditioning, allowing rapid what-if simulations, spatial edits, and map-informed content generation. For training and evaluation, we assemble a collection of satellite scenes paired with pixel-registered polygon annotations spanning diverse urban scenes with both built and natural features. We observe strong improvements over prior methods in semantic fidelity and structural realism, and show that our trained vision language model demonstrates fine-grained spatial grounding. The code and data are available at https://github.com/mvrl/VectorSynth.

我们介绍了VectorSynth，这是一个基于扩散的像素级卫星图像合成框架，它根据带有语义属性的多边形地理注释进行条件合成。与之前的文本或布局条件模型不同，VectorSynth学习密集的多模式对应，对齐图像和语义矢量几何，从而实现精细的、空间基础的编辑。视觉语言对齐模块根据多边形语义生成像素级嵌入；这些嵌入引导条件图像生成框架，既尊重空间范围也尊重语义线索。VectorSynth支持交互工作流程，将语言提示与几何感知条件相结合，允许快速模拟“如果”场景、空间编辑和地图信息内容生成。为了训练和评估，我们收集了一系列卫星场景，配以覆盖多种城市场景的像素级多边形注释，包括人造和自然特征。我们观察到与先前的方法相比，语义保真度和结构逼真性都有显著改善，并且显示我们训练过的视觉语言模型具有精细的空间定位能力。代码和数据集可在https://github.com/mvrl/VectorSynth找到。

论文及项目相关链接

PDF

Summary

VectorSynth是一款基于扩散的像素级卫星图像合成框架，根据多边形地理注释和其语义属性进行条件合成。不同于以往的文本或布局条件模型，VectorSynth学习密集的多模态对应关系，对齐图像与语义矢量几何，实现精细的空间定位编辑。该框架包括一个视觉语言对齐模块，可从多边形语义中产生像素级嵌入，这些嵌入引导条件图像生成框架，尊重空间范围和语义线索。VectorSynth支持交互式工作流程，将语言提示与几何感知条件相结合，可实现快速场景模拟、空间编辑和地图内容生成。通过训练和评估，该框架在语义保真和结构现实感方面表现出对先前方法的显著改进。

Key Takeaways

VectorSynth是一个基于扩散的卫星图像合成框架，能够根据多边形地理注释和其语义属性进行像素级合成。
该框架学习了密集的多模态对应关系，使图像与语义矢量几何对齐，实现了精细的空间定位编辑。
VectorSynth包括视觉语言对齐模块，产生像素级嵌入，引导条件图像生成框架。
该框架支持交互式工作流程，结合语言提示和几何感知条件，实现快速场景模拟、空间编辑和地图内容生成。
VectorSynth在语义保真和结构现实感方面表现出对先前方法的显著改进。
该框架能够尊重空间范围和语义线索，在生成图像时充分考虑这些因素。

Cool Papers

点此查看论文截图

From ACR O-RADS 2022 to Explainable Deep Learning: Comparative Performance of Expert Radiologists, Convolutional Neural Networks, Vision Transformers, and Fusion Models in Ovarian Masses

Authors:Ali Abbasian Ardakani, Afshin Mohammadi, Alisa Mohebbi, Anushya Vijayananthan, Sook Sam Leong, Lim Yi Ting, Mohd Kamil Bin Mohamad Fabell, U Rajendra Acharya, Sepideh Hatamikia

Background: The 2022 update of the Ovarian-Adnexal Reporting and Data System (O-RADS) ultrasound classification refines risk stratification for adnexal lesions, yet human interpretation remains subject to variability and conservative thresholds. Concurrently, deep learning (DL) models have demonstrated promise in image-based ovarian lesion characterization. This study evaluates radiologist performance applying O-RADS v2022, compares it to leading convolutional neural network (CNN) and Vision Transformer (ViT) models, and investigates the diagnostic gains achieved by hybrid human-AI frameworks. Methods: In this single-center, retrospective cohort study, a total of 512 adnexal mass images from 227 patients (110 with at least one malignant cyst) were included. Sixteen DL models, including DenseNets, EfficientNets, ResNets, VGGs, Xception, and ViTs, were trained and validated. A hybrid model integrating radiologist O-RADS scores with DL-predicted probabilities was also built for each scheme. Results: Radiologist-only O-RADS assessment achieved an AUC of 0.683 and an overall accuracy of 68.0%. CNN models yielded AUCs of 0.620 to 0.908 and accuracies of 59.2% to 86.4%, while ViT16-384 reached the best performance, with an AUC of 0.941 and an accuracy of 87.4%. Hybrid human-AI frameworks further significantly enhanced the performance of CNN models; however, the improvement for ViT models was not statistically significant (P-value >0.05). Conclusions: DL models markedly outperform radiologist-only O-RADS v2022 assessment, and the integration of expert scores with AI yields the highest diagnostic accuracy and discrimination. Hybrid human-AI paradigms hold substantial potential to standardize pelvic ultrasound interpretation, reduce false positives, and improve detection of high-risk lesions.

背景：卵巢附件报告和数据系统（O-RADS）的2022版超声分类细化了附件病变的风险分层，但人为解读仍存在差异和保守阈值。同时，深度学习（DL）模型在基于图像的卵巢病变表征方面显示出巨大潜力。本研究评估了放射科医生应用O-RADS v2022的表现，将其与领先的卷积神经网络（CNN）和视觉转换器（ViT）模型进行比较，并研究了混合人工智能框架实现的诊断增益。方法：在这项单中心回顾性队列研究中，共纳入来自227名患者（其中110名患者至少有一个恶性囊肿）的512张附件肿块图像。训练并验证了包括DenseNets、EfficientNets、ResNets、VGGs、Xception和ViTs在内的16种深度学习模型。还构建了混合模型，将放射科医生的O-RADS评分与深度学习预测的概进行整合。结果：仅使用放射科医生进行O-RADS评估的AUC为0.683，总体准确率为68.0%。CNN模型的AUC为0.620至0.908，准确率为59.2%至86.4%，其中ViT16-384表现最佳，AUC为0.941，准确率为87.4%。混合人工智能框架进一步显著提高了CNN模型的性能，但对于ViT模型的改进并不具有统计学上的显著性（P值> 0.05）。结论：深度学习模型明显优于仅使用放射科医生进行O-RADS v2022评估，并且结合专家评分和人工智能可带来最高的诊断准确性和区分度。混合人类人工智能范式在标准化盆腔超声检查解读、减少误报和提高高风险病变检测方面拥有巨大潜力。

论文及项目相关链接

PDF 18 pages, 4 figures

Summary

该文本介绍了Ovarian-Adnexal Reporting and Data System（O-RADS）的超声分类的更新及其存在的问题。通过深度学习模型对卵巢病变进行图像识别展现了良好前景。本文对比了放射科医生运用O-RADS v2022的性能和领先的卷积神经网络（CNN）与视觉转换器（ViT）模型的表现，并探讨了混合人类人工智能框架的诊断优势。结果显示，深度学习模型显著优于放射科医生单独使用O-RADS v2022的评估结果，结合人工智能的专家评分可实现最高的诊断准确性和鉴别力。混合人类人工智能模式在标准化盆腔超声解读、减少误判阳性以及提高高风险病变检测方面具有巨大潜力。

Key Takeaways

O-RADS 2022更新了其超声分类，对附件病变的风险分层进行了改进，但仍存在人为解读的变异性问题。
深度学习模型在基于图像的卵巢病变表征中显示出巨大潜力。
放射科医生运用O-RADS v2022的性能评估显示其AUC为0.683，整体准确率为68.0%。
CNN模型的AUC范围在0.620至0.908之间，准确率在59.2%至86.4%之间；而ViT16-384表现最佳，AUC为0.941，准确率为87.4%。
混合人类人工智能框架显著增强了CNN模型的性能，但对ViT模型的改进没有统计学意义。
深度学习模型显著优于放射科医生单独使用O-RADS v2022的评估结果。

Cool Papers

点此查看论文截图

A Dual-Mode ViT-Conditioned Diffusion Framework with an Adaptive Conditioning Bridge for Breast Cancer Segmentation

Authors:Prateek Singh, Moumita Dholey, P. K. Vinod

In breast ultrasound images, precise lesion segmentation is essential for early diagnosis; however, low contrast, speckle noise, and unclear boundaries make this difficult. Even though deep learning models have demonstrated potential, standard convolutional architectures frequently fall short in capturing enough global context, resulting in segmentations that are anatomically inconsistent. To overcome these drawbacks, we suggest a flexible, conditional Denoising Diffusion Model that combines an enhanced UNet-based generative decoder with a Vision Transformer (ViT) encoder for global feature extraction. We introduce three primary innovations: 1) an Adaptive Conditioning Bridge (ACB) for efficient, multi-scale fusion of semantic features; 2) a novel Topological Denoising Consistency (TDC) loss component that regularizes training by penalizing structural inconsistencies during denoising; and 3) a dual-head architecture that leverages the denoising objective as a powerful regularizer, enabling a lightweight auxiliary head to perform rapid and accurate inference on smaller datasets and a noise prediction head. Our framework establishes a new state-of-the-art on public breast ultrasound datasets, achieving Dice scores of 0.96 on BUSI, 0.90 on BrEaST and 0.97 on BUS-UCLM. Comprehensive ablation studies empirically validate that the model components are critical for achieving these results and for producing segmentations that are not only accurate but also anatomically plausible.

在乳腺超声图像中，精确的病灶分割对于早期诊断至关重要。然而，由于图像对比度低、斑点噪声和边界不清等问题，这一任务变得困难。尽管深度学习模型已经显示出潜力，但标准的卷积架构通常难以捕捉足够的全局上下文，导致分割结果解剖上不连贯。为了克服这些缺点，我们提出了一种灵活的条件去噪扩散模型，该模型结合了基于增强型UNet的生成解码器和Vision Transformer（ViT）编码器，用于全局特征提取。我们引入了三个主要创新点：1）自适应条件桥（ACB），用于高效的多尺度语义特征融合；2）一种新的拓扑去噪一致性（TDC）损失组件，通过惩罚去噪过程中的结构不一致性来正则化训练；3）双头架构，利用去噪目标作为强大的正则化器，使轻型辅助头能在较小数据集上进行快速且准确的推理，以及一个噪声预测头。我们的框架在公共乳腺超声数据集上达到了最新水平，在BUSI上实现了0.96的Dice得分，在BrEaST上实现了0.90的Dice得分，以及在BUS-UCLM上实现了0.97的Dice得分。综合消融研究经验验证了我们模型组件对于实现这些结果和产生既准确又解剖上合理的分割结果至关重要。

论文及项目相关链接

PDF 5 pages, 2 figures, 3 tables, submitted to ISBI 2026

Summary
针对乳腺超声图像中的病灶分割问题，由于低对比度、斑点噪声和边界不清等挑战，精确分割对早期诊断至关重要。本研究提出了一种灵活的、有条件的去噪扩散模型，该模型结合了增强的UNet生成解码器和Vision Transformer（ViT）编码器进行全局特征提取。主要创新包括自适应条件桥（ACB）、拓扑去噪一致性（TDC）损失组件和双头架构。该研究在公共乳腺超声数据集上取得了最新技术，实现了BUSI上的Dice得分0.96、BrEaST上的0.9和BUS-UCLM上的0.97。

Key Takeaways

乳腺超声图像中的病灶分割对早期诊断至关重要，但由于低对比度、斑点噪声和边界不清等挑战，分割困难。
现有的深度学习模型如标准卷积架构在捕捉全局上下文方面存在不足，导致分割结果解剖上不一致。
研究提出了一种灵活的、有条件的去噪扩散模型，结合UNet生成解码器和Vision Transformer（ViT）编码器进行特征提取。
引入三个主要创新点：自适应条件桥（ACB）、拓扑去噪一致性（TDC）损失组件和双头架构。
该模型在公共乳腺超声数据集上表现优异，达到最先进的Dice得分。
全面的消融研究证明，模型组件对于实现这些结果和产生解剖上合理的分割至关重要。
该模型能在较小的数据集上进行快速准确的推断。

Cool Papers

点此查看论文截图

DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation

Authors:Boyi Li, Ce Zhang, Richard M. Timmerman, Wenxuan Bao

The emergence of vision language models (VLMs) bridges the gap between vision and language, enabling multimodal understanding beyond traditional visual-only deep learning models. However, transferring VLMs from the natural image domain to remote sensing (RS) segmentation remains challenging due to the large domain gap and the diversity of RS inputs across tasks, particularly in open-vocabulary semantic segmentation (OVSS) and referring expression segmentation (RES). Here, we propose a training-free unified framework, termed DGL-RSIS, which decouples visual and textual representations and performs visual-language alignment at both local semantic and global contextual levels. Specifically, a Global-Local Decoupling (GLD) module decomposes textual inputs into local semantic tokens and global contextual tokens, while image inputs are partitioned into class-agnostic mask proposals. Then, a Local Visual-Textual Alignment (LVTA) module adaptively extracts context-aware visual features from the mask proposals and enriches textual features through knowledge-guided prompt engineering, achieving OVSS from a local perspective. Furthermore, a Global Visual-Textual Alignment (GVTA) module employs a global-enhanced Grad-CAM mechanism to capture contextual cues for referring expressions, followed by a mask selection module that integrates pixel-level activations into mask-level segmentation outputs, thereby achieving RES from a global perspective. Experiments on the iSAID (OVSS) and RRSIS-D (RES) benchmarks demonstrate that DGL-RSIS outperforms existing training-free approaches. Ablation studies further validate the effectiveness of each module. To the best of our knowledge, this is the first unified training-free framework for RS image segmentation, which effectively transfers the semantic capability of VLMs trained on natural images to the RS domain without additional training.

视觉语言模型（VLMs）的出现，架起了视觉和语言的桥梁，实现了超越传统仅视觉深度学习的多模态理解。然而，将VLMs从自然图像领域迁移到遥感（RS）分割仍然具有挑战性，主要由于领域差距大以及遥感输入任务的多样性，特别是在开放词汇语义分割（OVSS）和引用表达式分割（RES）中。在这里，我们提出了一个无需训练的统一框架，名为DGL-RSIS，该框架解耦视觉和文本表示，并在局部语义和全局上下文级别进行视觉语言对齐。具体而言，全局局部解耦（GLD）模块将文本输入分解为局部语义标记和全局上下文标记，而图像输入则被划分为类未知掩膜提案。然后，局部视觉文本对齐（LVTA）模块自适应地从掩膜提案中提取上下文感知的视觉特征，并通过知识引导提示工程丰富文本特征，从局部角度实现OVSS。此外，全局视觉文本对齐（GVTA）模块采用全局增强Grad-CAM机制来捕获引用表达式的上下文线索，随后是掩膜选择模块，该模块将像素级激活整合到掩膜级分割输出中，从而实现从全局角度的RES。在iSAID（OVSS）和RRSIS-D（RES）基准测试上的实验表明，DGL-RSIS在无需训练的方案中的性能超过了现有方法。消融研究进一步验证了每个模块的有效性。据我们所知，这是第一个无需训练的遥感图像分割统一框架，有效地将基于自然图像训练的VLMs的语义能力转移到遥感领域而无需额外的训练。

论文及项目相关链接

PDF

Summary

该文本介绍了一种名为DGL-RSIS的无训练统一框架，它将视觉和文本表示分离，并在局部语义和全局上下文级别进行视觉语言对齐，从而实现了遥感图像分割中的开放词汇语义分割和引用表达式分割。实验表明，该框架在iSAID和RRSIS-D基准测试中表现优异，且无需额外训练即可将自然语言模型的语义能力转移到遥感领域。

Key Takeaways

引入新的无训练统一框架DGL-RSIS，实现遥感图像分割中的开放词汇语义分割和引用表达式分割。
通过分离视觉和文本表示，并在局部语义和全局上下文级别进行视觉语言对齐，提高了模型的性能。
利用全局局部解耦模块将文本输入分解为局部语义令牌和全局上下文令牌，同时图像输入被划分为类无关的掩膜提案。
通过局部视觉文本对齐模块自适应提取上下文感知的视觉特征，并通过知识引导提示工程丰富文本特征，实现局部视角的开放词汇语义分割。
全球视觉文本对齐模块采用全球增强Grad-CAM机制捕获引用表达式的上下文线索，并通过掩膜选择模块将像素级激活集成到掩膜级分割输出中，实现全局视角的引用表达式分割。
在iSAID和RRSIS-D基准测试中的实验结果表明，DGL-RSIS框架的性能优于现有的无训练方法。

Cool Papers

点此查看论文截图

LPLC: A Dataset for License Plate Legibility Classification

Authors:Lucas Wojcik, Gabriel E. Lima, Valfride Nascimento, Eduil Nascimento, Rayson Laroca, David Menotti

Automatic License Plate Recognition (ALPR) faces a major challenge when dealing with illegible license plates (LPs). While reconstruction methods such as super-resolution (SR) have emerged, the core issue of recognizing these low-quality LPs remains unresolved. To optimize model performance and computational efficiency, image pre-processing should be applied selectively to cases that require enhanced legibility. To support research in this area, we introduce a novel dataset comprising 10,210 images of vehicles with 12,687 annotated LPs for legibility classification (the LPLC dataset). The images span a wide range of vehicle types, lighting conditions, and camera/image quality levels. We adopt a fine-grained annotation strategy that includes vehicle- and LP-level occlusions, four legibility categories (perfect, good, poor, and illegible), and character labels for three categories (excluding illegible LPs). As a benchmark, we propose a classification task using three image recognition networks to determine whether an LP image is good enough, requires super-resolution, or is completely unrecoverable. The overall F1 score, which remained below 80% for all three baseline models (ViT, ResNet, and YOLO), together with the analyses of SR and LP recognition methods, highlights the difficulty of the task and reinforces the need for further research. The proposed dataset is publicly available at https://github.com/lmlwojcik/lplc-dataset.

在自动车牌识别（ALPR）中，遇到不清晰的车牌（LPs）是一大挑战。虽然出现了超分辨率（SR）等重建方法，但识别这些低质量车牌的核心问题仍未解决。为了优化模型性能和计算效率，应选择性地对需要提高清晰度的案例进行图像预处理。为了支持这一领域的研究，我们引入了一个包含10210张车辆图像和12687个带注释车牌的新数据集，用于清晰度分类（LPLC数据集）。这些图像涵盖了各种车型、光照条件和摄像头/图像质量水平。我们采用了精细的注释策略，包括车辆和车牌级别的遮挡情况，四个清晰度类别（完美、良好、较差和不可读），以及三个类别（不包括不可读车牌）的字符标签。作为基准测试，我们提出了一个分类任务，使用三个图像识别网络来确定车牌图像是否足够好、是否需要超分辨率处理或是否完全无法恢复。三个基准模型（ViT、ResNet和YOLO）的总体F1分数均低于80%，结合超分辨率和车牌识别方法的分析，突显了任务的难度并强调了需要进一步研究的必要性。所提出的数据集可在https://github.com/lmlwojcik/lplc-dataset公开访问。

论文及项目相关链接

PDF Accepted for presentation at the Conference on Graphics, Patterns and Images (SIBGRAPI) 2025

Summary

本文介绍了针对车牌识别（ALPR）中遇到的不清晰车牌问题，提出了一种新的数据集LPLC。该数据集包含10,210张车辆图像和12,687个带注释的车牌，用于分类车牌的可读性。采用精细标注策略，包括车辆和车牌级别的遮挡情况，以及四个可读类别（完美、良好、不良和不可读）。同时，提出了一种基于三个图像识别网络的分类任务基准，用于判断车牌图像是否足够好、是否需要超分辨率处理或完全无法恢复。实验结果显示，所有基准模型的F1得分均低于80%，强调了任务的难度和对进一步研究的需要。数据集已在GitHub上公开。

Key Takeaways

ALPR在处理不清晰车牌时面临挑战。
重建方法如超分辨率（SR）虽已出现，但核心问题仍未解决。
针对需要提高可读性的情况，应选择性应用图像预处理。
介绍了包含10,210张车辆图像和12,687个带注释车牌的LPLC数据集，用于车牌可读性分类。
采用精细标注策略，包括车辆和车牌级别的遮挡信息及四个可读类别。
提出了一种基于三个图像识别网络的分类任务基准，用于判断车牌图像的状态。

Cool Papers

点此查看论文截图

ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays

Authors:Shehroz S. Khan, Petar Przulj, Ahmed Ashraf, Ali Abedi

The global demand for radiologists is increasing rapidly due to a growing reliance on medical imaging services, while the supply of radiologists is not keeping pace. Advances in computer vision and image processing technologies present significant potential to address this gap by enhancing radiologists’ capabilities and improving diagnostic accuracy. Large language models (LLMs), particularly generative pre-trained transformers (GPTs), have become the primary approach for understanding and generating textual data. In parallel, vision transformers (ViTs) have proven effective at converting visual data into a format that LLMs can process efficiently. In this paper, we present ChestGPT, a deep-learning framework that integrates the EVA ViT with the Llama 2 LLM to classify diseases and localize regions of interest in chest X-ray images. The ViT converts X-ray images into tokens, which are then fed, together with engineered prompts, into the LLM, enabling joint classification and localization of diseases. This approach incorporates transfer learning techniques to enhance both explainability and performance. The proposed method achieved strong global disease classification performance on the VinDr-CXR dataset, with an F1 score of 0.76, and successfully localized pathologies by generating bounding boxes around the regions of interest. We also outline several task-specific prompts, in addition to general-purpose prompts, for scenarios radiologists might encounter. Overall, this framework offers an assistive tool that can lighten radiologists’ workload by providing preliminary findings and regions of interest to facilitate their diagnostic process.

随着对医学影像服务日益增长的依赖，对放射科医师的全球需求迅速增加，然而放射科医师的供给却跟不上需求。计算机视觉和图像处理技术的进展在增强放射科医师的能力和提高诊断准确性方面展现出巨大潜力，为解决这一差距提供了可能。大型语言模型（LLMs）尤其是预训练生成式转换器（GPTs）已成为理解和生成文本数据的主要方法。与此同时，视觉转换器（ViTs）在将视觉数据转换为LLMs可以高效处理的格式方面证明了其有效性。本文介绍了ChestGPT，这是一个深度学习框架，它将EVA ViT与Llama 2 LLM相结合，用于对胸部X射线图像中的疾病进行分类并定位感兴趣区域。ViT将X射线图像转换为令牌，然后连同工程提示一起输入LLM，实现了疾病的联合分类和定位。该方法结合了迁移学习技术，提高了可解释性和性能。所提出的方法在VinDr-CXR数据集上实现了强大的全球疾病分类性能，F1分数为0.76，并通过围绕感兴趣区域生成边界框成功地定位了病理。我们还概述了除通用提示外，针对放射科医生可能遇到的各种情景的特定任务提示。总的来说，该框架提供了一个辅助工具，可以通过提供初步发现和感兴趣区域来减轻放射科医生的工作量，从而有助于他们的诊断过程。

论文及项目相关链接

PDF 8 pages, 5 figures, 4 tables

Summary

本文介绍了全球对放射科医生的需求迅速增长，而放射科医生供应不足的问题。计算机视觉和图像处理技术的进步为解决这一差距提供了潜力。文章提出了一种名为ChestGPT的深度学习框架，它结合了EVA Vision Transformer（ViT）和Llama 2大型语言模型（LLM），用于对胸部X射线图像进行疾病分类和感兴趣区域定位。ViT将X射线图像转换为令牌，然后与工程提示一起输入LLM，实现疾病的联合分类和定位。该方法采用迁移学习技术，提高了可解释性和性能。在VinDr-CXR数据集上，该方法取得了较强的全球疾病分类性能，F1分数为0.76，并能成功定位病理，在感兴趣区域生成边界框。该框架为放射科医生提供了一种辅助工具，通过提供初步发现和感兴趣区域，减轻他们的工作量，促进诊断过程。

Key Takeaways

医学成像服务的需求增长导致对放射科医生的需求迅速增加，而放射科医生的供应不足。
计算机视觉和图像处理技术的进步有助于解决这一差距。
ChestGPT是一个结合EVA Vision Transformer（ViT）和Llama 2大型语言模型（LLM）的深度学习框架。
ViT将X射线图像转换为令牌，与工程提示一起输入LLM，实现疾病分类和定位。
该方法采用迁移学习技术，提高了模型的可解释性和性能。
在VinDr-CXR数据集上，该方法取得了较强的疾病分类性能，F1分数为0.76。

Cool Papers

点此查看论文截图

evMLP: An Efficient Event-Driven MLP Architecture for Vision

Authors:Zhentan Zheng

Deep neural networks have achieved remarkable results in computer vision tasks. In the early days, Convolutional Neural Networks (CNNs) were the mainstream architecture. In recent years, Vision Transformers (ViTs) have become increasingly popular. In addition, exploring applications of multi-layer perceptrons (MLPs) has provided new perspectives for research into vision model architectures. In this paper, we present evMLP accompanied by a simple event-driven local update mechanism. The proposed evMLP can independently process patches on images or feature maps via MLPs. We define changes between consecutive frames as ``events’’. Under the event-driven local update mechanism, evMLP selectively processes patches where events occur. For sequential image data (e.g., video processing), this approach improves computational performance by avoiding redundant computations. Through ImageNet image classification experiments, evMLP attains accuracy competitive with state-of-the-art models. More significantly, experimental results on multiple video datasets demonstrate that evMLP reduces computational cost via its event-driven local update mechanism while maintaining output consistency with its non-event-driven baseline. The code and pre-trained models are available at https://github.com/i-evi/evMLP.

深度神经网络在计算机视觉任务中取得了显著成果。在早期，卷积神经网络（CNN）是主流架构。近年来，视觉转换器（ViT）越来越受欢迎。此外，多层感知器（MLP）的应用探索为视觉模型架构的研究提供了新的视角。在本文中，我们提出了带有简单事件驱动局部更新机制的evMLP。所提出的evMLP可以通过MLP独立处理图像或特征图上的补丁。我们将连续帧之间的变化定义为“事件”。在事件驱动的局部更新机制下，evMLP会选择性地处理发生事件的地方。对于序列图像数据（例如视频处理），这种方法通过避免冗余计算来提高计算性能。通过ImageNet图像分类实验，evMLP的准确率与最先进的模型具有竞争力。更重要的是，多个视频数据集上的实验结果表明，evMLP通过其事件驱动的局部更新机制降低了计算成本，同时保持了与非事件驱动基准线的输出一致性。代码和预训练模型可在https://github.com/i-evi/evMLP上找到。

论文及项目相关链接

PDF

Summary
深神经网络在计算机视觉任务中取得了显著成果。近年来，Vision Transformer（ViT）越来越受欢迎。本文提出evMLP，并配备了简单的事件驱动局部更新机制。evMLP可独立通过MLP处理图像或特征图的patches。事件被定义为连续帧之间的变化。在事件驱动局部更新机制下，evMLP会选择性地处理发生事件的patches。对于序列图像数据（如视频处理），此方法通过避免冗余计算提高了计算性能。evMLP在ImageNet图像分类实验中达到了与最新模型相当的性能。更重要的是，在多个视频数据集上的实验结果表明，evMLP通过其事件驱动局部更新机制降低了计算成本，同时保持了与非事件驱动的基线一致的输出。

Key Takeaways

深神经网络在计算机视觉任务中表现优秀，其中CNN和Vision Transformer（ViT）是主流架构。
evMLP结合了多层感知机（MLP）和事件驱动局部更新机制，能独立处理图像或特征图的patches。
事件被定义为连续帧之间的变化，evMLP通过事件驱动选择性地处理发生变化的patches。
对于序列图像数据（如视频处理），evMLP提高了计算性能，通过避免冗余计算降低了计算成本。
evMLP在ImageNet图像分类实验中的性能与最新模型相当。
在多个视频数据集上，evMLP通过事件驱动局部更新机制在保持输出一致性的同时降低了计算成本。

Cool Papers

点此查看论文截图

FaSDiff: Balancing Perception and Semantics in Face Compression via Stable Diffusion Priors

Authors:Yimin Zhou, Yichong Xia, Bin Chen, Mingyao Hong, Jiawei Li, Zhi Wang, Yaowei Wang

With the increasing deployment of facial image data across a wide range of applications, efficient compression tailored to facial semantics has become critical for both storage and transmission. While recent learning-based face image compression methods have achieved promising results, they often suffer from degraded reconstruction quality at low bit rates. Directly applying diffusion-based generative priors to this task leads to suboptimal performance in downstream machine vision tasks, primarily due to poor preservation of high-frequency details. In this work, we propose FaSDiff (\textbf{Fa}cial Image Compression with a \textbf{S}table \textbf{Diff}usion Prior), a novel diffusion-driven compression framework designed to enhance both visual fidelity and semantic consistency. FaSDiff incorporates a high-frequency-sensitive compressor to capture fine-grained details and generate robust visual prompts for guiding the diffusion model. To address low-frequency degradation, we further introduce a hybrid low-frequency enhancement module that disentangles and preserves semantic structures, enabling stable modulation of the diffusion prior during reconstruction. By jointly optimizing perceptual quality and semantic preservation, FaSDiff effectively balances human visual fidelity and machine vision accuracy. Extensive experiments demonstrate that FaSDiff outperforms state-of-the-art methods in both perceptual metrics and downstream task performance.

随着面部图像数据在众多应用领域的广泛应用，针对面部语义的高效压缩对于存储和传输都至关重要。虽然基于学习的面部图像压缩方法已经取得了有前景的结果，但在低码率下，它们往往面临重建质量下降的问题。直接将基于扩散的生成先验应用于此任务会导致在下游计算机视觉任务中的性能不佳，主要是因为高频细节的保留不佳。在本研究中，我们提出了FaSDiff（具有稳定扩散先验的面部图像压缩），这是一个新颖的扩散驱动压缩框架，旨在提高视觉保真度和语义一致性。FaSDiff结合了一个高频敏感压缩机，用于捕捉精细细节并为扩散模型生成稳健的视觉提示。为了解决低频退化问题，我们进一步引入了一个混合低频增强模块，该模块可以分离并保留语义结构，从而在重建过程中实现扩散先验的稳定调制。通过联合优化感知质量和语义保留，FaSDiff有效地平衡了人类视觉保真和计算机视觉准确性。大量实验表明，FaSDiff在感知指标和下游任务性能方面都优于最新方法。

论文及项目相关链接

PDF

Summary

高效压缩面部图像数据对存储和传输至关重要。尽管基于学习的面部图像压缩方法取得了一定成果，但在低比特率下重建质量往往下降。直接将扩散先验应用于该任务导致下游机器视觉任务性能不佳，主要因为高频细节保存不佳。本文提出FaSDiff（一种带有稳定扩散先验的面部图像压缩方法），旨在提高视觉保真度和语义一致性。FaSDiff结合高频敏感压缩器捕捉精细细节，为扩散模型生成稳健的视觉提示。为解决低频退化问题，引入混合低频增强模块，解开并保留语义结构，在重建过程中实现稳定的扩散先验调制。通过优化感知质量和语义保留，FaSDiff有效平衡了人类视觉保真和机器视觉准确性。实验表明，FaSDiff在感知指标和下游任务性能方面均优于现有方法。

Key Takeaways

高效压缩面部图像数据对实际应用至关重要。
基于学习的面部图像压缩方法在低比特率下重建质量下降。
直接应用扩散先验于面部图像压缩会影响下游机器视觉任务性能。
FaSDiff结合了高频敏感压缩器以捕捉精细细节并生成视觉提示。
为解决低频退化问题，引入了混合低频增强模块来保留语义结构。
FaSDiff在优化感知质量和语义保留之间达到了平衡，提高了视觉保真度和语义一致性。

Cool Papers

点此查看论文截图

ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation

Authors:Tobias Christian Nauen, Brian Moser, Federico Raue, Stanislav Frolov, Andreas Dengel

Transformers, particularly Vision Transformers (ViTs), have achieved state-of-the-art performance in large-scale image classification. However, they often require large amounts of data and can exhibit biases that limit their robustness and generalizability. This paper introduces ForAug, a novel data augmentation scheme that addresses these challenges and explicitly includes inductive biases, which commonly are part of the neural network architecture, into the training data. ForAug is constructed by using pretrained foundation models to separate and recombine foreground objects with different backgrounds, enabling fine-grained control over image composition during training. It thus increases the data diversity and effective number of training samples. We demonstrate that training on ForNet, the application of ForAug to ImageNet, significantly improves the accuracy of ViTs and other architectures by up to 4.5 percentage points (p.p.) on ImageNet and 7.3 p.p. on downstream tasks. Importantly, ForAug enables novel ways of analyzing model behavior and quantifying biases. Namely, we introduce metrics for background robustness, foreground focus, center bias, and size bias and show that training on ForNet substantially reduces these biases compared to training on ImageNet. In summary, ForAug provides a valuable tool for analyzing and mitigating biases, enabling the development of more robust and reliable computer vision models. Our code and dataset are publicly available at https://github.com/tobna/ForAug.

Transformer，特别是Vision Transformer（ViT），已经在大规模图像分类中取得了最先进的性能。然而，它们通常需要大量的数据，并可能表现出限制其稳健性和泛化能力的偏见。本文介绍了一种新的数据增强方案ForAug，该方案解决了这些挑战，并将通常作为神经网络架构一部分的归纳偏见显式地纳入训练数据中。ForAug通过使用预训练的基础模型来分离和重组前景物体与不同的背景，实现对训练过程中图像组成的精细控制。因此，它增加了数据多样性和有效的训练样本数量。我们证明，在ForNet上应用ForAug对ImageNet进行训练，显著提高了ViT和其他架构的准确性，在ImageNet上提高了高达4.5个百分点（p.p.），在下游任务上提高了7.3个百分点。重要的是，ForAug提供了新的分析模型行为和量化偏见的方法。具体来说，我们引入了背景稳健性、前景焦点、中心偏见和大小偏见的指标，并表明与在ImageNet上进行训练相比，在ForNet上进行训练可以大大减少这些偏见。总之，ForAug为分析和缓解偏见提供了有价值的工具，使开发更稳健、更可靠的计算机视觉模型成为可能。我们的代码和数据集可在https://github.com/tobna/ForAug公开访问。

论文及项目相关链接

PDF v2: added DeiT, added ablation vs simple copy-paste

Summary
基于Vision Transformer在图像分类领域的出色表现，该文提出了一种新的数据增强方案ForAug。它通过引入归纳偏见并将其纳入训练数据，解决了模型需要大量的数据和可能出现的偏见问题。ForAug利用预训练的基准模型分离并重新组合前景物体与不同的背景，实现对训练图像组成的精细控制，提高了数据多样性和有效训练样本数量。在ImageNet及下游任务上的实验表明，使用ForAug训练的ViTs和其他架构的精度显著提高。此外，ForAug还提供了分析和缓解模型偏见的新方法，有助于开发更稳健和可靠的计算机视觉模型。

Key Takeaways

Vision Transformers (ViTs) 在大规模图像分类上表现出卓越的性能。
ForAug是一种新的数据增强方案，旨在解决ViTs需要大量数据和存在偏见的问题。
ForAug通过将预训练的基准模型用于分离和重新组合前景物体与背景，提高了数据多样性和有效训练样本数量。
ForAug显著提高了ViTs和其他架构在ImageNet及下游任务上的精度。
ForAug提供了分析和缓解模型偏见的新方法，包括背景稳健性、前景焦点、中心偏见和大小偏见等指标。
使用ForAug训练的模型相比在ImageNet上训练的模型，显著减少了偏见。

Cool Papers

点此查看论文截图

DeNAS-ViT: Data Efficient NAS-Optimized Vision Transformer for Ultrasound Image Segmentation

Authors:Renqi Chen, Xinzhe Zheng, Haoyang Su, Kehan Wu

Accurate segmentation of ultrasound images is essential for reliable medical diagnoses but is challenged by poor image quality and scarce labeled data. Prior approaches have relied on manually designed, complex network architectures to improve multi-scale feature extraction. However, such handcrafted models offer limited gains when prior knowledge is inadequate and are prone to overfitting on small datasets. In this paper, we introduce DeNAS-ViT, a data-efficient NAS-optimized Vision Transformer, the first method to leverage neural architecture search (NAS) for ultrasound image segmentation by automatically optimizing model architecture through token-level search. Specifically, we propose an efficient NAS module that performs multi-scale token search prior to the ViT’s attention mechanism, effectively capturing both contextual and local features while minimizing computational costs. Given ultrasound’s data scarcity and NAS’s inherent data demands, we further develop a NAS-guided semi-supervised learning (SSL) framework. This approach integrates network independence and contrastive learning within a stage-wise optimization strategy, significantly enhancing model robustness under limited-data conditions. Extensive experiments on public datasets demonstrate that DeNAS-ViT achieves state-of-the-art performance, maintaining robustness with minimal labeled data. Moreover, we highlight DeNAS-ViT’s generalization potential beyond ultrasound imaging, underscoring its broader applicability.

超声图像的精确分割对于可靠的医学诊断至关重要，但面临着图像质量差和标注数据稀缺的挑战。之前的方法依赖于手动设计复杂的网络架构来改善多尺度特征提取。然而，当先验知识不足时，此类手工模型提供的收益有限，并且容易在小数据集上过度拟合。在本文中，我们介绍了DeNAS-ViT，这是一种数据高效的NAS优化视觉转换器，是第一种利用神经架构搜索（NAS）进行超声图像分割的方法，通过token级别的搜索自动优化模型架构。具体来说，我们提出了一个高效的NAS模块，在ViT的注意力机制之前进行多尺度token搜索，有效地捕捉上下文和局部特征，同时降低计算成本。考虑到超声数据稀缺和NAS固有的数据需求，我们还开发了一个NAS引导的半监督学习（SSL）框架。该方法将网络独立性和对比学习纳入分阶段优化策略中，显著提高了在有限数据条件下的模型稳健性。在公共数据集上的广泛实验表明，DeNAS-ViT达到了最先进的性能，用最小的标注数据维持了稳健性。此外，我们突出了DeNAS-ViT在超声成像之外的通用化潜力，强调了其更广泛的应用性。

论文及项目相关链接

PDF Accepted by AAAI-26 Main Technical Track

Summary

本文介绍了针对超声图像分割的DeNAS-ViT方法。该方法结合了神经网络架构搜索（NAS）和视觉转换器（ViT），通过token级别的搜索自动优化模型架构。为提高数据效率，开发了一个高效的NAS模块，实现了多尺度token搜索与ViT注意力机制的结合。此外，开发了一种NAS引导的半监督学习（SSL）框架，以在有限数据条件下提高模型的稳健性。在公共数据集上的实验表明，DeNAS-ViT实现了最先进的性能，并在少量标注数据的情况下保持稳健性。此外，DeNAS-ViT还具有广泛的适用性。

Key Takeaways