⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-06 更新
Diffusion Models are Robust Pretrainers
Authors:Mika Yagoda, Shady Abu-Hussein, Raja Giryes
Diffusion models have gained significant attention for high-fidelity image generation. Our work investigates the potential of exploiting diffusion models for adversarial robustness in image classification and object detection. Adversarial attacks challenge standard models in these tasks by perturbing inputs to force incorrect predictions. To address this issue, many approaches use training schemes for forcing the robustness of the models, which increase training costs. In this work, we study models built on top of off-the-shelf diffusion models and demonstrate their practical significance: they provide a low-cost path to robust representations, allowing lightweight heads to be trained on frozen features without full adversarial training. Our empirical evaluations on ImageNet, CIFAR-10, and PASCAL VOC show that diffusion-based classifiers and detectors achieve meaningful adversarial robustness with minimal compute. While clean and adversarial accuracies remain below state-of-the-art adversarially trained CNNs or ViTs, diffusion pretraining offers a favorable tradeoff between efficiency and robustness. This work opens a promising avenue for integrating diffusion models into resource-constrained robust deployments.
扩散模型在高保真图像生成方面引起了广泛的关注。我们的工作研究了将扩散模型用于图像分类和对象检测中的对抗性稳健性的潜力。对抗性攻击通过扰动输入来迫使模型做出错误预测,从而挑战这些任务的标准模型。为了解决这一问题,许多方法使用训练方案来增强模型的稳健性,这增加了训练成本。在这项工作中,我们研究了基于现成的扩散模型构建的模型,并展示了它们的实际意义:它们为获得稳健表示提供了低成本途径,允许在冻结的特征上训练轻量级头部,而无需进行全面的对抗性训练。我们在ImageNet、CIFAR-10和PASCAL VOC上的经验评估表明,基于扩散的分类器和检测器实现了有意义的对抗性稳健性,计算量最小。虽然干净和对抗性准确性仍然低于最先进的对抗性训练的CNN或ViTs,但扩散预训练在效率和稳健性之间提供了有利的权衡。这项工作为将扩散模型集成到资源受限的稳健部署中打开了有前途的道路。
论文及项目相关链接
PDF To be published in IEEE Signal Processing Letters
Summary
本文探讨了扩散模型在图像分类和对象检测中的对抗鲁棒性潜力。文章指出,对抗性攻击通过扰动输入来挑战标准模型,导致预测错误。为解决这一问题,许多方法使用训练方案来增强模型的鲁棒性,这增加了训练成本。而本文研究了基于现成扩散模型的模型,展示了其实践意义:它提供了一种低成本途径来构建稳健的表示,允许在冻结的特征上训练轻量级头部,无需进行全面的对抗性训练。在ImageNet、CIFAR-10和PASCAL VOC上的实证评估表明,基于扩散的分类器和检测器在有限的计算下实现了有意义的对抗鲁棒性。虽然清洁和对抗性精度仍低于最先进的对抗性训练的CNN或ViTs,但扩散预训练在效率和稳健性之间提供了有利的权衡。本文开辟了一条将扩散模型集成到资源受限的稳健部署中的有前途的道路。
Key Takeaways
- 扩散模型在高保真图像生成中受到关注。
- 本文探讨了扩散模型在图像分类和对象检测中的对抗鲁棒性潜力。
- 对抗性攻击通过扰动输入挑战标准模型。
- 许多方法使用昂贵的训练方案增强模型鲁棒性。
- 基于扩散模型的模型展示实践意义:提供低成本途径构建稳健表示。
- 在多个数据集上的实证评估显示,扩散模型在有限的计算下实现了有意义的对抗鲁棒性。
点此查看论文截图
PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
Authors:Antonio Oroz, Matthias Nießner, Tobias Kirschstein
We present PercHead, a method for single-image 3D head reconstruction and semantic 3D editing - two tasks that are inherently challenging due to severe view occlusions, weak perceptual supervision, and the ambiguity of editing in 3D space. We develop a unified base model for reconstructing view-consistent 3D heads from a single input image. The model employs a dual-branch encoder followed by a ViT-based decoder that lifts 2D features into 3D space through iterative cross-attention. Rendering is performed using Gaussian Splatting. At the heart of our approach is a novel perceptual supervision strategy based on DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric and appearance fidelity. Our model achieves state-of-the-art performance in novel-view synthesis and, furthermore, exhibits exceptional robustness to extreme viewing angles compared to established baselines. Furthermore, this base model can be seamlessly extended for semantic 3D editing by swapping the encoder and finetuning the network. In this variant, we disentangle geometry and style through two distinct input modalities: a segmentation map to control geometry and either a text prompt or a reference image to specify appearance. We highlight the intuitive and powerful 3D editing capabilities of our model through a lightweight, interactive GUI, where users can effortlessly sculpt geometry by drawing segmentation maps and stylize appearance via natural language or image prompts. Project Page: https://antoniooroz.github.io/PercHead Video: https://www.youtube.com/watch?v=4hFybgTk4kE
我们提出了PercHead方法,这是一种用于单图像3D头部重建和语义3D编辑的方法——这两项任务由于严重的视图遮挡、微弱的感知监督以及3D空间编辑的模糊性而具有固有的挑战性。我们开发了一个统一的基准模型,用于从单个输入图像重建视图一致的3D头部。该模型采用双分支编码器,后跟基于ViT的解码器,通过迭代交叉注意力将2D特征提升到3D空间。渲染是使用高斯贴图技术完成的。我们的方法的核心是一种基于DINOv2和SAM2.1的新型感知监督策略,它为几何和外观保真度提供了丰富的一般化信号。我们的模型在新视图合成方面达到了最先进的性能,并且与既定的基准线相比,对极端视角表现出惊人的稳健性。此外,这个基准模型可以通过交换编码器和微调网络无缝扩展到语义3D编辑。在该变种中,我们通过两种不同的输入模式:分割图控制几何,文本提示或参考图像指定外观,来分离几何和风格。我们通过轻便的交互式GUI突出显示了我们模型的直观和强大的3D编辑功能,用户可以通过绘制分割图轻松地塑造几何形状,并通过自然语言或图像提示来美化外观。项目页面:https://antoniooroz.github.io/PercHead;视频:https://www.youtube.com/watch?v=4hFybgTk4kE。
论文及项目相关链接
PDF Project Page: https://antoniooroz.github.io/PercHead/ Video: https://www.youtube.com/watch?v=4hFybgTk4kE
Summary
本文介绍了PercHead方法,用于单图像三维头部重建和语义三维编辑。该方法通过统一基模型实现,采用双分支编码器与基于ViT的解码器,将二维特征提升到三维空间,并使用高斯贴图进行渲染。核心在于基于DINOv2和SAM2.1的感知监督策略,为几何和外观保真度提供丰富的通用信号。该方法在新型视图合成方面取得了最先进的性能,对极端视角具有出色的稳健性。此外,该基模型可无缝扩展用于语义三维编辑,通过更换编码器和微调网络来实现。用户可以通过绘制分割图、自然语言或图像提示来轻松塑造几何形状和风格化外观。
Key Takeaways
- PercHead是一个用于单图像三维头部重建和语义三维编辑的方法。
- 采用统一基模型实现,通过双分支编码器与基于ViT的解码器将二维特征提升到三维空间。
- 使用高斯贴图进行渲染,并采用基于DINOv2和SAM2.1的感知监督策略。
- 在新型视图合成方面取得最先进的性能,对极端视角具有出色稳健性。
- 基模型可扩展到语义三维编辑,通过更换编码器和微调网络实现。
- 用户可以通过绘制分割图、自然语言或图像提示来轻松塑造几何形状和风格化外观。
点此查看论文截图
RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Authors:Jiahe Song, Chuang Wang, Bowen Jiang, Yinfan Wang, Hao Zheng, Xingjian Wei, Chengjin Liu, Junyuan Gao, Yubin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He
Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed “BBox and Index as Visual Prompt” (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.
大规模化学反应数据集对化学人工智能研究至关重要。然而,现有的化学反应数据通常以论文中的图片形式存在,这使得它们无法被机器读取,也无法用于训练机器学习模型。针对这一挑战,我们提出了用于化学反应图解析(RxnDP)任务的RxnCaption框架。我们的框架将传统的坐标预测驱动解析过程重新构建为图像描述问题,这是大型视觉语言模型(LVLM)自然处理的问题。我们引入了一种名为“BBox和索引作为视觉提示”(BIVP)的策略,该策略使用我们最先进的分子检测器MolYOLO,直接在输入图像上预先绘制分子边界框和索引。这将下游解析转化为自然语言描述问题。大量实验表明,BIVP策略在简化模型设计的同时,显著提高了结构提取质量。我们还构建了RxnCaption-11k数据集,其规模比先前的现实世界文献基准测试大一个数量级,并且在四种布局原型中具有平衡的测试子集。实验表明,RxnCaption-VL在多个指标上实现了最先进的性能。我们相信我们的方法、数据集和模型将推动从化学文献中提取结构化信息,并在化学领域的更广泛人工智能应用中发挥作用。我们会在GitHub上发布数据、模型和代码。
论文及项目相关链接
Summary:
化学反应数据集在化学反应领域的AI研究中具有关键作用,但现有数据集常以图像形式存在于论文中,不可用于机器学习模型的训练。因此,本文提出反应图解析任务(RxnDP)的RxnCaption框架,将传统的坐标预测驱动解析过程转化为图像描述问题,便于大型视觉语言模型(LVLMs)处理。通过引入“BBox和索引作为视觉提示”(BIVP)策略及先进的分子检测器MolYOLO,直接在输入图像上绘制分子边界框和索引,简化模型设计并提高结构提取质量。同时构建大型RxnCaption-11k数据集,并在多个指标上实现最佳性能。本文的方法、数据集和模型将推动从化学文献中提取结构化信息,并加速AI在化学领域的应用。
Key Takeaways:
- 大型化学反应数据集对化学AI研究至关重要。
- 现有化学反应数据常以图像形式存在,不可用于机器学习模型训练。
- RxnCaption框架将化学反应图解析转化为图像描述问题,便于大型视觉语言模型处理。
- 引入的BIVP策略简化了模型设计并提高了结构提取质量。
- 构建的大型RxnCaption-11k数据集包含多种布局类型,性能优于先前基准测试。
- 本文方法、数据集和模型推动了从化学文献中提取结构化信息的发展。
点此查看论文截图
Dynamic Multi-level Weighted Alignment Network for Zero-shot Sketch-based Image Retrieval
Authors:Hanwen Su, Ge Song, Jiyan Wang, Yuanbo Zhu
The problem of zero-shot sketch-based image retrieval (ZS-SBIR) has achieved increasing attention due to its wide applications, e.g. e-commerce. Despite progress made in this field, previous works suffer from using imbalanced samples of modalities and inconsistent low-quality information during training, resulting in sub-optimal performance. Therefore, in this paper, we introduce an approach called Dynamic Multi-level Weighted Alignment Network for ZS-SBIR. It consists of three components: (i) a Uni-modal Feature Extraction Module that includes a CLIP text encoder and a ViT for extracting textual and visual tokens, (ii) a Cross-modal Multi-level Weighting Module that produces an alignment weight list by the local and global aggregation blocks to measure the aligning quality of sketch and image samples, (iii) a Weighted Quadruplet Loss Module aiming to improve the balance of domains in the triplet loss. Experiments on three benchmark datasets, i.e., Sketchy, TU-Berlin, and QuickDraw, show our method delivers superior performances over the state-of-the-art ZS-SBIR methods.
基于零样本的草图图像检索(ZS-SBIR)问题因其广泛的应用(例如电子商务)而备受关注。尽管该领域已经取得了一些进展,但之前的工作在训练过程中存在模态样本不平衡和获取的不一致、低质量信息的问题,导致性能不佳。因此,本文介绍了一种名为动态多级加权对齐网络(Dynamic Multi-level Weighted Alignment Network)的ZS-SBIR方法。它由三个部分组成:(i)单模态特征提取模块,包括CLIP文本编码器和用于提取文本和视觉标记的ViT;(ii)跨模态多级加权模块,通过局部和全局聚合块生成对齐权重列表,以测量草图和图像样本的对齐质量;(iii)加权四重损失模块,旨在改善三重损失中的域平衡。在Sketchy、TU-Berlin和QuickDraw三个基准数据集上的实验表明,我们的方法在零样本草图图像检索领域的性能优于最先进的方法。
论文及项目相关链接
Summary
摘要:针对零样本草图基于图像检索(ZS-SBIR)问题,本文提出一种动态多级加权对齐网络方法。通过采用Uni-modal特征提取模块、跨模态多级加权模块和加权四重损失模块,解决了样本模态不均衡和训练时信息不一致、低质量的问题。在Sketchy、TU-Berlin和QuickDraw三个基准数据集上的实验表明,该方法优于现有ZS-SBIR方法。
Key Takeaways
- ZS-SBIR问题因其广泛应用而受到关注,如电子商务。
- 现有方法面临样本模态不均衡和训练时信息不一致、低质量的问题。
- 本文提出动态多级加权对齐网络方法,包含三个组件:Uni-modal特征提取模块、跨模态多级加权模块和加权四重损失模块。
- Uni-modal特征提取模块使用CLIP文本编码器和ViT提取文本和视觉标记。
- 跨模态多级加权模块通过局部和全局聚合块生成对齐权重列表,以测量草图与图像样本的对齐质量。
- 加权四重损失模块旨在改善域平衡的三重损失。
点此查看论文截图
Linear Differential Vision Transformer: Learning Visual Contrasts via Pairwise Differentials
Authors:Yifan Pu, Jixuan Ying, Qixiu Li, Tianzhu Ye, Dongchen Han, Xiaochen Wang, Ziyi Wang, Xinyu Shao, Gao Huang, Xiu Li
Vision Transformers (ViTs) have become a universal backbone for both image recognition and image generation. Yet their Multi-Head Self-Attention (MHSA) layer still performs a quadratic query-key interaction for every token pair, spending the bulk of computation on visually weak or redundant correlations. We introduce Visual-Contrast Attention (VCA), a drop-in replacement for MHSA that injects an explicit notion of discrimination while reducing the theoretical complexity from O(N N C) to O(N n C) with n << N. VCA first distils each head’s dense query field into a handful of spatially pooled visual-contrast tokens, then splits them into a learnable positive and negative stream whose differential interaction highlights what truly separates one region from another. The module adds fewer than 0.3M parameters to a DeiT-Tiny backbone, requires no extra FLOPs, and is wholly architecture-agnostic. Empirically, VCA lifts DeiT-Tiny top-1 accuracy on ImageNet-1K from 72.2% to 75.6% (+3.4) and improves three strong hierarchical ViTs by up to 3.1%, while in class-conditional ImageNet generation it lowers FID-50K by 2.1 to 5.2 points across both diffusion (DiT) and flow (SiT) models. Extensive ablations confirm that (i) spatial pooling supplies low-variance global cues, (ii) dual positional embeddings are indispensable for contrastive reasoning, and (iii) combining the two in both stages yields the strongest synergy. VCA therefore offers a simple path towards faster and sharper Vision Transformers. The source code is available at https://github.com/LeapLabTHU/LinearDiff.
Vision Transformers(ViTs)已经成为图像识别和图像生成领域的通用主干网络。然而,它们的多头自注意力(MHSA)层仍然对每个token对执行二次查询-键交互,将大部分计算量花费在视觉较弱或冗余的关联上。我们引入了视觉对比注意力(VCA),它是MHSA的即插即用替代品,在减少理论复杂度从O(N^2C)到O(NnC)的同时注入了明确的辨别概念,其中n<<N。VCA首先提炼每个头部的密集查询字段为少量空间池化的视觉对比令牌,然后将其分为可学习的正负极流,其差异交互突出了真正区分一个区域与另一个区域的内容。该模块向DeiT-Tiny主干添加了少于0.3M的参数,无需额外的FLOPs,并且完全不受架构限制。经验上,VCA将DeiT-Tiny在ImageNet-1K上的top-1准确率从72.2%提高到75.6%(+3.4%),并改进了三个强大的分层ViTs高达3.1%,而在类条件ImageNet生成中,它在扩散(DiT)和流(SiT)模型中将FID-50K降低了2.1至5.2点。广泛的消融实验证实:(i)空间池化提供低方差全局线索,(ii)双位置嵌入对于对比推理必不可少,(iii)在两个阶段中结合两者会产生最强的协同作用。因此,VCA为更快、更锐利的Vision Transformers提供了一条简单的途径。源代码可在https://github.com/LeapLabTHU/LinearDiff中找到。
论文及项目相关链接
PDF NeurIPS 2025
Summary
本文介绍了Vision Transformers(ViTs)的新改进——Visual-Contrast Attention(VCA)。VCA作为Multi-Head Self-Attention(MHSA)层的替代品,通过引入视觉对比的概念,降低了计算复杂度,并从理论上提高了模型的性能。VCA通过空间池化生成视觉对比令牌,并通过正负流差异交互来区分不同区域。在ImageNet-1K上的实验结果显示,VCA在DeiT-Tiny上的准确率提高了3.4%,并在其他三个强大的层次型ViTs中提高了最多3.1%的性能。此外,在类条件ImageNet生成任务中,VCA降低了扩散和流模型的FID-50K分数。源代码已公开。
Key Takeaways
- Vision Transformers (ViTs) 已成为图像识别和图像生成领域的通用架构。
- Multi-Head Self-Attention (MHSA) 层在计算中消耗了大量的资源,主要处理视觉较弱或冗余的关联。
- Visual-Contrast Attention (VCA) 是MHSA的替代品,通过引入视觉对比的概念来降低计算复杂度并提高模型性能。
- VCA通过空间池化生成视觉对比令牌,并通过正负流的差异交互来区分区域。
- VCA在DeiT-Tiny上的准确率提高了3.4%,并在其他ViTs中也有性能提升。
- 在类条件ImageNet生成任务中,VCA表现出优秀的性能,降低了FID-50K分数。
点此查看论文截图
Weakly Supervised Pneumonia Localization from Chest X-Rays Using Deep Neural Network and Grad-CAM Explanations
Authors:Kiran Shahi, Anup Bagale
This study proposes a weakly supervised deep learning framework for pneumonia classification and localization from chest X-rays, utilizing Grad-CAM explanations. Instead of costly pixel-level annotations, our approach utilizes image-level labels to generate clinically meaningful heatmaps that highlight regions affected by pneumonia. We evaluate seven ImageNet-pretrained architectures ResNet-18/50, DenseNet-121, EfficientNet-B0, MobileNet-V2/V3, and ViT-B16 under identical training conditions with focal loss and patient-wise splits to prevent data leakage. Experimental results on the Kermany CXR dataset demonstrate that ResNet-18 and EfficientNet-B0 achieve the best overall test accuracy of 98%, ROC-AUC = 0.997, and F1 = 0.987, while MobileNet-V2 provides an optimal trade-off between accuracy and computational cost. Grad-CAM visualizations confirm that the proposed models focus on clinically relevant lung regions, supporting the use of interpretable AI for radiological diagnostics. This work highlights the potential of weakly supervised explainable models that enhance pneumonia screening transparency, and clinical trust in AI-assisted medical imaging. https://github.com/kiranshahi/pneumonia-analysis
本研究提出了一种利用Grad-CAM解释从胸部X射线图像进行肺炎分类和定位的弱监督深度学习框架。与其他方法不同,我们的方法不需要昂贵的像素级注释,而是利用图像级标签生成具有临床意义的热图,突出显示受肺炎影响的区域。我们在相同的训练条件下评估了ResNet-18/50、DenseNet-121、EfficientNet-B0、MobileNet-V2/V3和ViT-B16七种在ImageNet上预训练的架构,使用焦点损失和患者级分割来防止数据泄露。在Kermany CXR数据集上的实验结果表明,ResNet-18和EfficientNet-B0的最佳总体测试准确率达到了98%,ROC-AUC为0.997,F1为0.987。而MobileNet-V2则在精度和计算成本之间提供了最佳的权衡。Grad-CAM可视化证实,所提出的模型关注临床相关的肺部区域,支持使用可解释的AI进行放射学诊断。这项工作突显了弱监督解释性模型的潜力,可以提高肺炎筛查的透明度以及临床对AI辅助医学成像的信任度。相关研究可通过 https://github.com/kiranshahi/pneumonia-analysis 了解。
论文及项目相关链接
Summary
肺炎分类与定位:基于弱监督深度学习和Grad-CAM解释的胸部X光片研究。该研究利用图像级标签生成有意义热图,突出肺炎影响区域,减少成本高昂的像素级标注需求。在Kermany CXR数据集上测试,ResNet-18和EfficientNet-B0模型表现最佳,测试准确度达98%,ROC-AUC为0.997,F1分数为0.987。Grad-CAM可视化证明模型关注临床相关肺区,为放射诊断提供可解释的人工智能支持。该研究突显弱监督可解释模型在肺炎筛查中的潜力,增强医疗影像的透明度和临床对AI的信任。
Key Takeaways
- 本研究提出一种基于弱监督深度学习的肺炎分类与定位方法,利用胸部X光片进行诊断。
- 使用Grad-CAM解释技术,通过图像级标签生成临床有意义的热图,突出显示肺炎影响区域。
- 研究评估了多种预训练模型架构,包括ResNet-18/50、DenseNet-121、EfficientNet-B0、MobileNet-V2/V3和ViT-B16。
- 在Kermany CXR数据集上的实验结果表明,ResNet-18和EfficientNet-B0模型表现最佳,测试准确度高达98%。
- Grad-CAM可视化验证模型关注临床相关的肺部区域,增强了人工智能在放射诊断中的可解释性。
- 研究表明,弱监督模型在肺炎筛查中具有潜力,有助于提高诊断透明度和临床对AI的信任。
点此查看论文截图
CompAgent: An Agentic Framework for Visual Compliance Verification
Authors:Rahul Ghosh, Baishali Chaudhury, Hari Prasanna Das, Meghana Ashok, Ryan Razkenari, Sungmin Hong, Chun-Hao Liu
Visual compliance verification is a critical yet underexplored problem in computer vision, especially in domains such as media, entertainment, and advertising where content must adhere to complex and evolving policy rules. Existing methods often rely on task-specific deep learning models trained on manually labeled datasets, which are costly to build and limited in generalizability. While recent multi-modal large language models (MLLMs) offer broad real-world knowledge and policy understanding, they struggle to reason over fine-grained visual details and apply structured compliance rules effectively on their own. In this paper, we propose CompAgent, the first agentic framework for visual compliance verification. CompAgent augments MLLMs with a suite of visual tools - such as object detectors, face analyzers, NSFW detectors, and captioning models - and introduces a planning agent that dynamically selects appropriate tools based on the compliance policy. A verification agent then integrates image, tool outputs, and policy context to perform multi-modal reasoning. Experiments on public benchmarks show that CompAgent outperforms specialized classifiers, direct MLLM prompting, and curated routing baselines, achieving up to 76% F1 score and a 10% improvement over the state-of-the-art on the UnsafeBench dataset. Our results demonstrate the effectiveness of agentic planning and tool-augmented reasoning for scalable, accurate, and adaptable visual compliance verification.
视觉合规验证是计算机视觉中一个至关重要但尚未被充分研究的问题,特别是在媒体、娱乐和广告等领域,内容必须遵守复杂且不断变化的政策规则。现有方法通常依赖于使用手动标注数据集训练的特定任务深度学习模型,这些模型构建成本高昂,且通用性有限。虽然最近的多模态大型语言模型(MLLMs)提供了广泛的现实世界知识和政策理解,但它们难以对细微的视觉细节进行推理,并且无法独立有效地应用结构化合规规则。在本文中,我们提出了用于视觉合规验证的第一个智能框架CompAgent。CompAgent使用一系列视觉工具增强MLLMs,例如对象检测器、面部分析器、NSFW检测器和字幕模型,并引入了一个规划代理,该代理根据合规策略动态选择适当的工具。然后,验证代理将图像、工具输出和政策上下文集成在一起,进行多模态推理。在公共基准测试上的实验表明,CompAgent优于专用分类器、直接MLLM提示和定制路由基线,在UnsafeBench数据集上实现了高达76%的F1分数,比现有技术先进10%。我们的结果证明了智能规划和工具增强推理在可扩展、准确和可适应的视觉合规验证中的有效性。
论文及项目相关链接
PDF Under review
Summary
视觉合规验证是计算机视觉中一个至关重要但尚未被充分研究的问题,特别是在媒体、娱乐和广告等领域,内容必须遵守复杂且不断发展的政策规则。现有方法通常依赖于使用手动标注数据集训练的特定任务深度学习模型,这些模型构建成本高且通用性有限。本文提出CompAgent,首个用于视觉合规验证的agentic框架。CompAgent通过一系列视觉工具(如目标检测器、面部分析器、NSFW检测器和描述模型)增强多模态大型语言模型(MLLMs),并引入一个规划agent,该agent根据合规政策动态选择适当的工具。验证agent然后整合图像、工具输出和政策上下文,进行多模态推理。在公共基准测试上的实验表明,CompAgent优于专业分类器、直接MLLL提示和定制路由基线,在UnsafeBench数据集上实现了高达76%的F1分数,比现有技术提高了10%。我们的结果证明了agentic规划和工具增强推理在可扩展、准确和适应性强的视觉合规验证中的有效性。
Key Takeaways
- 视觉合规验证是计算机视觉领域的关键问题,尤其在媒体、娱乐和广告领域。
- 现有方法主要依赖手动标注数据集训练的深度学习任务模型,成本高昂且通用性有限。
- CompAgent是首个用于视觉合规验证的agentic框架,结合了多模态大型语言模型和多种视觉工具。
- CompAgent包括规划agent和验证agent,分别负责根据合规政策选择工具和整合图像、工具输出及政策上下文进行多模态推理。
- 实验结果表明,CompAgent在公共基准测试上表现优异,实现了较高的F1分数。
- CompAgent的方法具有可扩展性、准确性和适应性强的特点。
点此查看论文截图
Integrating ConvNeXt and Vision Transformers for Enhancing Facial Age Estimation
Authors:Gaby Maroun, Salah Eddine Bekhouche, Fadi Dornaika
Age estimation from facial images is a complex and multifaceted challenge in computer vision. In this study, we present a novel hybrid architecture that combines ConvNeXt, a state-of-the-art advancement of convolutional neural networks (CNNs), with Vision Transformers (ViT). While each model independently delivers excellent performance on a variety of tasks, their integration leverages the complementary strengths of the CNNs localized feature extraction capabilities and the Transformers global attention mechanisms. Our proposed ConvNeXt-ViT hybrid solution was thoroughly evaluated on benchmark age estimation datasets, including MORPH II, CACD, and AFAD, and achieved superior performance in terms of mean absolute error (MAE). To address computational constraints, we leverage pre-trained models and systematically explore different configurations, using linear layers and advanced regularization techniques to optimize the architecture. Comprehensive ablation studies highlight the critical role of individual components and training strategies, and in particular emphasize the importance of adapted attention mechanisms within the CNN framework to improve the model focus on age-relevant facial features. The results show that the ConvNeXt-ViT hybrid not only outperforms traditional methods, but also provides a robust foundation for future advances in age estimation and related visual tasks. This work underscores the transformative potential of hybrid architectures and represents a promising direction for the seamless integration of CNNs and transformers to address complex computer vision challenges.
从面部图像进行年龄估计是计算机视觉中一个复杂且多维的挑战。在这项研究中,我们提出了一种新型混合架构,它结合了卷积神经网络(CNN)的最新进展ConvNeXt与Vision Transformer(ViT)。虽然每种模型独立地在各种任务上表现出卓越的性能,但它们的结合则利用了CNN的局部特征提取能力和Transformer的全局注意力机制的互补优势。我们提出的ConvNeXt-ViT混合解决方案在年龄估计基准数据集上进行了全面评估,包括MORPH II、CACD和AFAD,并在平均绝对误差(MAE)方面实现了卓越的性能。为了解决计算约束,我们利用预训练模型,并使用线性层和先进的正则化技术来优化架构的不同配置。全面的消融研究突出了各个组件和培训策略的关键作用,并特别强调了适应CNN框架中的注意力机制的重要性,以提高模型对年龄相关面部特征的关注。结果表明,ConvNeXt-ViT混合架构不仅优于传统方法,而且为未来的年龄估计和相关视觉任务提供了稳健的基础。这项工作突出了混合架构的变革潜力,并代表了无缝集成CNN和Transformer以应对复杂计算机视觉挑战的有希望的方向。
论文及项目相关链接
Summary
该研究提出了一种新型的混合架构,结合了ConvNeXt(卷积神经网络CNN的最新进展)和Vision Transformer(ViT)。该架构充分利用了CNN的局部特征提取能力和Transformer的全局注意力机制,在面部图像年龄估计方面表现出卓越性能。在多个基准数据集上评估,包括MORPH II、CACD和AFAD,以平均绝对误差(MAE)为评价指标,该架构均取得了优越的性能。通过利用预训练模型和探索不同配置及训练策略,强调了适应的注意力机制在CNN框架中的重要性,以改善对年龄相关面部特征的关注。
Key Takeaways
- 该研究提出了一种ConvNeXt与Vision Transformer(ViT)的混合架构,用于面部图像年龄估计。
- 混合架构结合了CNN的局部特征提取和Transformer的全局注意力机制。
- 在多个基准数据集上进行了评估,包括MORPH II、CACD和AFAD。
- 该架构实现了优越的性能,以平均绝对误差(MAE)为评价指标。
- 利用预训练模型、线性层和先进正则化技术来优化架构,以应对计算约束。
- 通过系统的消融研究,强调了适应的注意力机制在CNN框架中的重要性。
点此查看论文截图
Invited Paper: BitMedViT: Ternary-Quantized Vision Transformer for Medical AI Assistants on the Edge
Authors:Mikolaj Walczak, Uttej Kallakuri, Edward Humes, Xiaomin Lin, Tinoosh Mohsenin
Vision Transformers (ViTs) have demonstrated strong capabilities in interpreting complex medical imaging data. However, their significant computational and memory demands pose challenges for deployment in real-time, resource-constrained mobile and wearable devices used in clinical environments. We introduce, BiTMedViT, a new class of Edge ViTs serving as medical AI assistants that perform structured analysis of medical images directly on the edge. BiTMedViT utilizes ternary- quantized linear layers tailored for medical imaging and com- bines a training procedure with multi-query attention, preserving stability under ternary weights with low-precision activations. Furthermore, BiTMedViT employs task-aware distillation from a high-capacity teacher to recover accuracy lost due to extreme quantization. Lastly, we also present a pipeline that maps the ternarized ViTs to a custom CUDA kernel for efficient memory bandwidth utilization and latency reduction on the Jetson Orin Nano. Finally, BiTMedViT achieves 86% diagnostic accuracy (89% SOTA) on MedMNIST across 12 datasets, while reducing model size by 43x, memory traffic by 39x, and enabling 16.8 ms inference at an energy efficiency up to 41x that of SOTA models at 183.62 GOPs/J on the Orin Nano. Our results demonstrate a practical and scientifically grounded route for extreme-precision medical imaging ViTs deployable on the edge, narrowing the gap between algorithmic advances and deployable clinical tools.
视觉Transformer(ViT)在解读复杂医学成像数据方面表现出强大的能力。然而,其巨大的计算和内存需求为在资源受限的实时移动和可穿戴设备的部署带来了挑战,这些设备在临床环境中被广泛应用。我们引入了BiTMedViT,这是一种新型的边缘ViT,作为医疗人工智能助手,直接在边缘对医疗图像进行结构化分析。BiTMedViT利用针对医学成像定制的三元量化线性层,并结合多查询注意力的训练过程,在三元权重下保持低精度激活的稳定性。此外,BiTMedViT还采用任务感知蒸馏法,从高性能教师模型中恢复因极端量化而损失的准确性。最后,我们还提供了一个将三元ViT映射到自定义CUDA内核的管道,以实现Jetson Orin Nano上的高效内存带宽利用和延迟降低。最终,BiTMedViT在MedMNIST的12个数据集上实现了86%的诊断准确率(最高89%),同时减小了模型大小43倍,减少了内存流量39倍,并在Orin Nano上以183.62 GOPs/J的能效实现了16.8毫秒的推理时间,能效高达现有技术模型的41倍。我们的结果证明了在边缘部署极端精度医疗成像ViT的实际且科学的途径,缩小了算法进步和可部署临床工具之间的差距。
论文及项目相关链接
PDF Accepted at 2025 IEEE/ACM International Conf. on Computer-Aided Design (ICCAD) Oct. 26-30 2025, Munich, DE
Summary
BiTMedViT是一种新型的Edge ViT医疗人工智能辅助工具,可在边缘端直接对医疗图像进行结构化分析。它通过采用针对医疗成像的三元量化线性层、结合多查询注意力的训练程序、任务感知蒸馏技术,实现了在极端量化下的稳定性和准确性恢复。此外,BiTMedViT还映射了三元化ViT到自定义CUDA内核,以提高Jetson Orin Nano上的内存带宽利用率并减少延迟。在MedMNIST的12个数据集上,BiTMedViT实现了86%的诊断准确率(达到最新技术的89%),同时缩小了模型大小43倍,减少了内存流量39倍,并在Orin Nano上以183.62 GOPs/J的能效实现了16.8毫秒的推理速度。它为边缘部署的医疗成像ViT提供了一个实用且科学的途径。
Key Takeaways
- BiTMedViT是一种用于医疗图像结构化分析的Edge ViT工具,适用于资源受限的移动端和可穿戴设备。
- BiTMedViT使用三元量化线性层,以针对医疗图像进行专门设计。
- 结合多查询注意力训练程序,确保在极端量化条件下保持稳定性和准确性。
- 采用任务感知蒸馏技术从高性能教师模型中恢复因极端量化而损失的准确性。
- BiTMedViT将三元化ViT映射到自定义CUDA内核,以提高内存带宽利用率并减少延迟。
- BiTMedViT在多个数据集上实现了高诊断准确率,同时显著减少了模型大小、内存流量和推理时间。
点此查看论文截图
3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data
Authors:Nojod M. Alotaibi, Areej M. Alhothali, Manar S. Ali
Major depressive disorder (MDD) is a prevalent mental health condition that negatively impacts both individual well-being and global public health. Automated detection of MDD using structural magnetic resonance imaging (sMRI) and deep learning (DL) methods holds increasing promise for improving diagnostic accuracy and enabling early intervention. Most existing methods employ either voxel-level features or handcrafted regional representations built from predefined brain atlases, limiting their ability to capture complex brain patterns. This paper develops a unified pipeline that utilizes Vision Transformers (ViTs) for extracting 3D region embeddings from sMRI data and Graph Neural Network (GNN) for classification. We explore two strategies for defining regions: (1) an atlas-based approach using predefined structural and functional brain atlases, and (2) an cube-based method by which ViTs are trained directly to identify regions from uniformly extracted 3D patches. Further, cosine similarity graphs are generated to model interregional relationships, and guide GNN-based classification. Extensive experiments were conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of our model. With stratified 10-fold cross-validation, the best model obtained 81.51% accuracy, 85.94% sensitivity, 76.36% specificity, 80.88% precision, and 83.33% F1-score. Further, atlas-based models consistently outperformed the cube-based approach, highlighting the importance of using domain-specific anatomical priors for MDD detection.
抑郁症是一种常见的心理健康疾病,对个人福祉和全球公共卫生都有负面影响。利用结构磁共振成像(sMRI)和深度学习(DL)方法进行抑郁症的自动化检测在提高诊断准确性和早期干预方面有着巨大的潜力。大多数现有方法采用基于体素的特征或根据预先定义的脑图谱手工构建的区域表示,这限制了它们捕捉复杂脑模式的能力。本文开发了一个统一的流程,利用视觉转换器(ViTs)从sMRI数据中提取三维区域嵌入,并利用图神经网络(GNN)进行分类。我们探索了两种定义区域的方法:(1)一种基于图谱的方法,使用预先定义的结构和功能脑图谱;(2)一种基于立方体块的方法,训练ViTs直接识别从均匀提取的三维斑块区域。此外,通过构建余弦相似度图来模拟区域间的关系,并引导基于GNN的分类。我们使用了REST-meta-MDD数据集进行大量实验,以证明我们模型的有效性。通过分层十折交叉验证,最佳模型获得了81.51%的准确率、85.94%的灵敏度、76.36%的特异度、80.88%的精确度和83.33%的F1分数。此外,基于图谱的模型持续优于基于立方体块的方法,突显了在抑郁症检测中使用特定领域的解剖学先验信息的重要性。
论文及项目相关链接
PDF 17 pages, 3 figure, 9 tables
Summary
本文介绍了一种利用Vision Transformer和Graph Neural Network对抑郁症进行自动化检测的新方法。该方法通过提取sMRI数据的3D区域嵌入和构建余弦相似性图来建模脑区间关系,实现了较高的诊断准确性。研究采用两种定义区域的方法:一种是基于预设的结构和功能脑图谱的方法,另一种是直接训练Vision Transformer识别从均匀提取的3D斑块的方法。实验结果表明,基于图谱的方法表现更优,凸显了利用领域特定解剖先验信息的重要性。
Key Takeaways
- Vision Transformer和Graph Neural Network被应用于抑郁症的自动化检测,提高了诊断准确性和早期干预的可能性。
- 该方法通过提取sMRI数据的3D区域嵌入和构建余弦相似性图,有效建模了脑区间关系。
- 研究比较了两种定义脑区的方法:基于预设脑图谱的方法和基于Vision Transformer直接识别3D斑块的方法。
- 基于预设脑图谱的方法表现较好,凸显了解剖先验信息在抑郁症检测中的重要性。
- 该方法在REST-meta-MDD数据集上进行了广泛实验,最佳模型获得了较高的诊断准确率、敏感性、特异性和F1分数。
- 该研究为抑郁症的自动化检测提供了新的思路和方法,有望改善现有的诊断手段。
点此查看论文截图
Face Spoofing Detection using Deep Learning
Authors: Najeebullah, Maaz Salman, Zar Nawab Khan Swati
Digital image spoofing has emerged as a significant security threat in biometric authentication systems, particularly those relying on facial recognition. This study evaluates the performance of three vision based models, MobileNetV2, ResNET50, and Vision Transformer, ViT, for spoof detection in image classification, utilizing a dataset of 150,986 images divided into training , 140,002, testing, 10,984, and validation ,39,574, sets. Spoof detection is critical for enhancing the security of image recognition systems, and this research compares the models effectiveness through accuracy, precision, recall, and F1 score metrics. Results reveal that MobileNetV2 outperforms other architectures on the test dataset, achieving an accuracy of 91.59%, precision of 91.72%, recall of 91.59%, and F1 score of 91.58%, compared to ViT 86.54%, 88.28%, 86.54%, and 86.39%, respectively. On the validation dataset, MobileNetV2, and ViT excel, with MobileNetV2 slightly ahead at 97.17% accuracy versus ViT 96.36%. MobileNetV2 demonstrates faster convergence during training and superior generalization to unseen data, despite both models showing signs of overfitting. These findings highlight MobileNetV2 balanced performance and robustness, making it the preferred choice for spoof detection applications where reliability on new data is essential. The study underscores the importance of model selection in security sensitive contexts and suggests MobileNetV2 as a practical solution for real world deployment.
数字图像欺骗已成为生物特征识别系统(尤其是依赖面部识别的系统)中的重大安全威胁。本研究评估了三种基于视觉的模型(MobileNetV2、ResNET50和Vision Transformer,简称ViT)在图像分类中的欺骗检测性能,研究使用了包含150986张图像的数据集,其中训练集包含14002张图像,测试集包含10984张图像,验证集包含39574张图像。欺骗检测对于提高图像识别系统的安全性至关重要,本研究通过准确性、精确度、召回率和F1分数等指标比较了模型的有效性。结果表明,在测试数据集上,MobileNetV2优于其他架构,其准确性、精确度、召回率和F1分数分别为91.59%、91.72%、91.59%和91.58%,而ViT分别为86.54%、88.28%、86.54%和86.39%。在验证数据集上,MobileNetV2和ViT表现优异,其中MobileNetV2以97.17%的准确性略胜一筹,ViT的准确性为96.36%。尽管两者都显示出过拟合的迹象,但MobileNetV2在训练过程中表现出更快的收敛速度和对新数据的良好泛化能力。这些发现突显了MobileNetV2的均衡性能和稳健性,使其成为可靠性对新数据至关重要的欺骗检测应用的理想选择。该研究强调了安全敏感环境中模型选择的重要性,并建议将MobileNetV2作为实际部署的实用解决方案。
论文及项目相关链接
PDF The author’s school has a conflict of interest regarding the submission of this article prior to his graduation thesis submission
Summary
该研究对比了MobileNetV2、ResNET50和Vision Transformer(ViT)三种视觉模型在图像分类中的抗欺骗检测性能。使用150,986张图像数据集进行训练、测试和验证。结果表明,MobileNetV2在测试数据集上表现出最佳性能,具有较高的准确性、精度、召回率和F1得分,尤其是在未见数据上的泛化性能优异。因此,对于需要可靠应对新数据的欺骗检测应用,MobileNetV2是首选。
Key Takeaways
- 三种视觉模型(MobileNetV2、ResNET50和Vision Transformer)在图像分类中的抗欺骗检测性能被评估。
- MobileNetV2在测试数据集上的表现优于其他模型,具有较高的准确性、精度、召回率和F1得分。
- MobileNetV2在未见数据上的泛化性能较好,且训练过程中收敛较快。
- 研究表明MobileNetV2具有平衡的性能和稳健性,是欺骗检测应用中的理想选择。
- 模型选择在安全敏感环境中至关重要,而MobileNetV2可作为实际部署的实用解决方案。
- 欺骗检测对于提高图像识别系统的安全性至关重要。
点此查看论文截图
RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing
Authors:Fengxiang Wang, Yulin Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Hongzhen Wang, Di Wang, Long Lan, Wenjing Yang, Jing Zhang
Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. RoMA enhances scalability for high-resolution images through a tailored auto-regressive learning strategy, incorporating two key innovations: 1) a rotation-aware pretraining mechanism combining adaptive cropping with angular embeddings to handle sparsely distributed objects with arbitrary orientations, and 2) multi-scale token prediction objectives that address the extreme variations in object scales inherent to RS imagery. Systematic empirical studies validate that Mamba adheres to RS data and parameter scaling laws, with performance scaling reliably as model and data size increase. Furthermore, experiments across scene classification, object detection, and semantic segmentation tasks demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based counterparts in both accuracy and computational efficiency. The source code and pretrained models will be released at https://github.com/MiliLab/RoMA.
关于Vision Transformers(ViTs)的自我监督学习最新进展已经推动了遥感(RS)基础模型的突破。然而,自注意力的二次复杂度对可扩展性构成了重大障碍,特别是对于大型模型和高分辨率图像。虽然线性复杂度的Mamba架构提供了一个有前途的替代方案,但现有的Mamba遥感应用仅限于小型特定域数据集上的监督任务。为了解决这些挑战,我们提出了RoMA框架,它使基于Mamba的遥感基础模型能够利用大规模、多样化、无标签数据进行可扩展的自我监督预训练。RoMA通过定制的自回归学习策略增强了高分辨率图像的扩展性,并融入了两项关键创新:1)一种旋转感知预训练机制,结合了自适应裁剪和角度嵌入来处理方向任意且稀疏分布的对象;2)多尺度令牌预测目标,解决了遥感图像固有对象尺度极端变化的问题。系统的实证研究验证了Mamba对遥感数据和参数缩放规律的适应性,随着模型和数据规模的增加,性能可靠地提高。此外,跨场景分类、对象检测和语义分割任务的实验表明,RoMA预训练的Mamba模型在准确性和计算效率方面始终优于基于ViT的同类模型。源代码和预训练模型将在https://github.com/MiliLab/RoMA发布。
论文及项目相关链接
PDF NeurIPS 2025
Summary
近期自监督学习在Vision Transformer(ViT)方面的进展为遥感(RS)基础模型带来了突破。然而,自注意力的二次复杂度对可扩展性造成了重大障碍,特别是在大型高分辨率图像方面。虽然具有线性复杂度的Mamba架构提供了有前景的替代方案,但现有RS应用仅限于小规模特定域数据集上的监督任务。为解决这些挑战,我们提出RoMA框架,使基于Mamba的RS基础模型能够利用大规模、多样化、无标签数据进行自我监督预训练,实现可扩展性。RoMA通过定制的自回归学习策略增强高分辨率图像的扩展性,包括两项关键创新:一是结合自适应裁剪和角度嵌入的旋转感知预训练机制,以处理具有任意方向的稀疏分布对象;二是多尺度令牌预测目标,解决遥感图像中对象尺度极端变化的问题。实验表明,Mamba符合RS数据和参数扩展定律,随着模型和数据规模的增加,性能可靠地扩展。此外,场景分类、对象检测和语义分割任务的实验显示,RoMA预训练的Mamba模型在准确性和计算效率方面一致优于ViT基础模型。
Key Takeaways
- 近期自监督学习在Vision Transformer(ViT)方面的进展推动了遥感基础模型的突破。
- 自注意力的二次复杂度阻碍了可扩展性,特别是在大型高分辨图像方面。
- Mamba架构为监督任务提供了线性复杂度的替代方案,但在遥感应用上仍有限制。
- RoMA框架通过自我监督预训练使Mamba模型可扩展,利用大规模无标签数据。
- RoMA通过自回归学习策略处理高分辨率图像,包括旋转感知预训练和多尺度令牌预测目标两项关键创新。
- Mamba符合遥感数据和参数扩展定律,性能随模型和数据的增加而提高。
- RoMA预训练的Mamba模型在场景分类、对象检测和语义分割任务上优于ViT基础模型。
点此查看论文截图
Prompt to Restore, Restore to Prompt: Cyclic Prompting for Universal Adverse Weather Removal
Authors:Rongxin Liao, Feng Li, Yanyan Wei, Zenglin Shi, Le Zhang, Huihui Bai, Meng Wang
Universal adverse weather removal (UAWR) seeks to address various weather degradations within a unified framework. Recent methods are inspired by prompt learning using pre-trained vision-language models (e.g., CLIP), leveraging degradation-aware prompts to facilitate weather-free image restoration, yielding significant improvements. In this work, we propose CyclicPrompt, an innovative cyclic prompt approach designed to enhance the effectiveness, adaptability, and generalizability of UAWR. CyclicPrompt Comprises two key components: 1) a composite context prompt that integrates weather-related information and context-aware representations into the network to guide restoration. This prompt differs from previous methods by marrying learnable input-conditional vectors with weather-specific knowledge, thereby improving adaptability across various degradations. 2) The erase-and-paste mechanism, after the initial guided restoration, substitutes weather-specific knowledge with constrained restoration priors, inducing high-quality weather-free concepts into the composite prompt to further fine-tune the restoration process. Therefore, we can form a cyclic “Prompt-Restore-Prompt” pipeline that adeptly harnesses weather-specific knowledge, textual contexts, and reliable textures. Extensive experiments on synthetic and real-world datasets validate the superior performance of CyclicPrompt. The code is available at: https://github.com/RongxinL/CyclicPrompt.
普遍恶劣天气去除(UAWR)旨在在一个统一框架内解决各种天气退化问题。最近的方法受到预训练视觉语言模型(例如CLIP)的提示学习的启发,利用退化感知提示来促进无天气图像恢复,取得了显著的改进。在这项工作中,我们提出了CyclicPrompt,这是一种创新的循环提示方法,旨在提高UAWR的有效性、适应性和通用性。CyclicPrompt包含两个关键组件:1)复合上下文提示,它将天气相关信息和上下文感知表示集成到网络中,以指导恢复。此提示与以前的方法不同,它将可学习的输入条件向量与特定天气知识相结合,从而提高了对各种退化的适应性。2)擦除和粘贴机制,在初始引导恢复之后,用受限制的恢复先验替换特定天气知识,将高质量的无天气概念引入复合提示中,以进一步微调恢复过程。因此,我们可以形成一个循环的“提示-恢复-提示”管道,巧妙地利用特定天气知识、文本上下文和可靠纹理。在合成和真实世界数据集上的大量实验验证了CyclicPrompt的优越性能。代码可用在:https://github.com/RongxinL/CyclicPrompt。(网址翻译可能不精确,建议查阅对应中文网站)
论文及项目相关链接
Summary:针对恶劣天气去除问题,提出一种名为CyclicPrompt的循环提示方法,集成天气相关信息和上下文感知表示,通过引导恢复过程,并采用擦除和粘贴机制进一步优化。形成循环的“提示-恢复-提示”管道,充分利用天气特定知识、文本上下文和可靠纹理,在合成和真实世界数据集上表现出卓越性能。
Key Takeaways:
- CyclicPrompt是一种针对恶劣天气去除的统一框架中的新方法。
- 它通过集成天气相关信息和上下文感知表示,提高恢复效果、适应性和泛化能力。
- CyclicPrompt包含两个关键组件:复合上下文提示和擦除与粘贴机制。
- 复合上下文提示通过结合可学习的输入条件向量和天气特定知识,改进了不同退化的适应性。
- 擦除和粘贴机制在初始引导恢复后,用受限的恢复先验替换天气特定知识,将高质量的无天气概念引入复合提示中,以进一步优化恢复过程。
- CyclicPrompt形成一个循环的“提示-恢复-提示”管道,充分利用各种资源,包括天气特定知识、文本上下文和可靠纹理。
点此查看论文截图
Semantic Alignment and Reinforcement for Data-Free Quantization of Vision Transformers
Authors:Yunshan Zhong, Yuyao Zhou, Yuxin Zhang, Wanchen Sui, Shen Li, Yong Li, Fei Chao, Rongrong Ji
Data-free quantization (DFQ) enables model quantization without accessing real data, addressing concerns regarding data security and privacy. With the growing adoption of Vision Transformers (ViTs), DFQ for ViTs has garnered significant attention. However, existing DFQ methods exhibit two limitations: (1) semantic distortion, where the semantics of synthetic images deviate substantially from those of real images, and (2) semantic inadequacy, where synthetic images contain extensive regions with limited content and oversimplified textures, leading to suboptimal quantization performance. To address these limitations, we propose SARDFQ, a novel Semantics Alignment and Reinforcement Data-Free Quantization method for ViTs. To address semantic distortion, SARDFQ incorporates Attention Priors Alignment (APA), which optimizes synthetic images to follow randomly generated structure attention priors. To mitigate semantic inadequacy, SARDFQ introduces Multi-Semantic Reinforcement (MSR), leveraging localized patch optimization to enhance semantic richness across synthetic images. Furthermore, SARDFQ employs Soft-Label Learning (SL), wherein multiple semantic targets are adapted to facilitate the learning of multi-semantic images augmented by MSR. Extensive experiments demonstrate the effectiveness of SARDFQ, significantly surpassing existing methods. For example, SARDFQ improves top-1 accuracy on ImageNet by 15.52% for W4A4 ViT-B. The code is at https://github.com/zysxmu/SARDFQ.
无数据量化(DFQ)能够在不访问真实数据的情况下实现模型量化,解决了对数据安全和隐私的担忧。随着视觉转换器(ViTs)的广泛应用,ViTs的无数据量化已引起人们的关注。然而,现有的DFQ方法存在两个局限性:(1)语义失真,即合成图像的语义与真实图像的语义存在显著差异;(2)语义不足,合成图像包含大量内容有限、纹理过于简单的区域,导致量化性能不佳。为了解决这些局限性,我们提出了SARDFQ,一种用于ViTs的新型语义对齐和强化无数据量化方法。为了解决语义失真问题,SARDFQ引入了注意力优先对齐(APA),优化合成图像以遵循随机生成的结构注意力优先级。为了缓解语义不足的问题,SARDFQ引入了多语义强化(MSR),利用局部补丁优化来增强合成图像的语义丰富性。此外,SARDFQ采用软标签学习(SL),适应多个语义目标,以促进由MSR增强的多语义图像的学习。大量实验证明了SARDFQ的有效性,显著超越了现有方法。例如,SARDFQ在ImageNet上提高了W4A4 ViT-B的top-1准确率,提高了15.52%。代码地址为:https://github.com/zysxmu/SARDFQ。
论文及项目相关链接
PDF ICCV2025
Summary
数据无关量化(DFQ)在不接触真实数据的情况下实现模型量化,注重数据安全和隐私保护。针对视觉变压器(ViTs)的DFQ方法面临语义失真和语义不足的问题。为此,提出一种新型的SARDFQ方法,通过注意力优先对齐(APA)解决语义失真问题,并通过多语义增强(MSR)减轻语义不足的问题。SARDFQ采用软标签学习(SL),适应多种语义目标,促进由MSR增强的多语义图像的学习。实验证明SARDFQ效果显著,如W4A4 ViT-B在ImageNet上的top-1准确率提高15.52%。更多信息可通过访问代码库https://github.com/zysxmu/SARDFQ获取。
Key Takeaways
- 数据无关量化(DFQ)在不使用真实数据的情况下实现模型量化,保护数据安全和隐私。
- 针对ViTs的现有DFQ方法存在语义失真和语义不足的问题。
- SARDFQ方法通过注意力优先对齐(APA)解决语义失真问题。
- SARDFQ引入多语义增强(MSR)减轻语义不足的问题,提高量化性能。
- SARDFQ采用软标签学习(SL),促进多语义图像的学习。
- 实验证明SARDFQ在ImageNet上的性能显著提升。
点此查看论文截图
Static for Dynamic: Towards a Deeper Understanding of Dynamic Facial Expressions Using Static Expression Data
Authors:Yin Chen, Jia Li, Yu Zhang, Zhenzhen Hu, Shiguang Shan, Meng Wang, Richang Hong
Dynamic facial expression recognition (DFER) infers emotions from the temporal evolution of expressions, unlike static facial expression recognition (SFER), which relies solely on a single snapshot. This temporal analysis provides richer information and promises greater recognition capability. However, current DFER methods often exhibit unsatisfied performance largely due to fewer training samples compared to SFER. Given the inherent correlation between static and dynamic expressions, we hypothesize that leveraging the abundant SFER data can enhance DFER. To this end, we propose Static-for-Dynamic (S4D), a unified dual-modal learning framework that integrates SFER data as a complementary resource for DFER. Specifically, S4D employs dual-modal self-supervised pre-training on facial images and videos using a shared Vision Transformer (ViT) encoder-decoder architecture, yielding improved spatiotemporal representations. The pre-trained encoder is then fine-tuned on static and dynamic expression datasets in a multi-task learning setup to facilitate emotional information interaction. Unfortunately, vanilla multi-task learning in our study results in negative transfer. To address this, we propose an innovative Mixture of Adapter Experts (MoAE) module that facilitates task-specific knowledge acquisition while effectively extracting shared knowledge from both static and dynamic expression data. Extensive experiments demonstrate that S4D achieves a deeper understanding of DFER, setting new state-of-the-art performance on FERV39K, MAFW, and DFEW benchmarks, with weighted average recall (WAR) of 53.65%, 58.44%, and 76.68%, respectively. Additionally, a systematic correlation analysis between SFER and DFER tasks is presented, which further elucidates the potential benefits of leveraging SFER.
动态面部表情识别(DFER)是从表情的时间演变中推断情感,不同于静态面部表情识别(SFER),后者仅依赖于单一快照。这种时间分析提供了更丰富的信息,并承诺具有更大的识别能力。然而,当前的DFER方法通常由于训练样本较少而表现不佳,与SFER相比。鉴于静态和动态表情之间的内在关联,我们假设利用丰富的SFER数据可以增强DFER。为此,我们提出了Static-for-Dynamic(S4D),这是一个统一的双模态学习框架,它将SFER数据作为DFER的补充资源。具体来说,S4D采用面部图像和视频的双重模态自监督预训练,使用共享的Vision Transformer(ViT)编码器-解码器架构,产生改进的时空表示。然后,在多任务学习设置中对静态和动态表情数据集对预训练的编码器进行微调,以促进情感信息交互。然而,我们研究中的简单多任务学习导致了负面迁移。为了解决这一问题,我们提出了创新的Mixture of Adapter Experts(MoAE)模块,它有助于获取特定任务的知识,同时有效地从静态和动态表情数据中提取共享知识。大量实验表明,S4D对DFER有了更深入的理解,在FERV39K、MAFW和DFEW基准测试上创下了最新纪录,加权平均召回率分别为53.65%、58.44%和76.68%。此外,还进行了SFER和DFER任务之间的系统相关性分析,进一步阐明了利用SFER的潜在优势。
论文及项目相关链接
PDF The code and model are publicly available here https://github.com/MSA-LMC/S4D
Summary
动态面部表情识别(DFER)通过表情的时间演变来推断情绪,与仅依赖于单一静态图像的静态面部表情识别(SFER)不同。本文提出Static-for-Dynamic(S4D)方法,利用丰富的SFER数据增强DFER性能。S4D采用统一双模态学习框架,集成面部图像和视频进行双模态自监督预训练,使用共享Vision Transformer(ViT)编码器-解码器架构,获得改进的时空表征。然而,研究发现单纯的多任务学习会导致负迁移。因此,本文提出创新的Mixture of Adapter Experts(MoAE)模块,促进任务特定知识的获取,同时有效提取静态和动态表情数据的共享知识。实验表明,S4D在FERV39K、MAFW和DFEW基准测试上实现了最先进的性能。
Key Takeaways
- 动态面部表情识别(DFER)通过表情的时间演变推断情绪,较静态面部表情识别(SFER)更具优势。
- 现有的DFER方法因训练样本较少而性能不佳。
- 提出的Static-for-Dynamic(S4D)方法利用丰富的SFER数据增强DFER性能。
- S4D采用双模态自监督预训练,使用共享Vision Transformer架构获得改进的时空表征。
- 研究发现单纯的多任务学习会导致负迁移。
- Mixture of Adapter Experts(MoAE)模块被提出以解决负迁移问题,促进任务特定和共享知识的获取。
点此查看论文截图
REP: Resource-Efficient Prompting for Rehearsal-Free Continual Learning
Authors:Sungho Jeon, Xinyue Ma, Kwang In Kim, Myeongjae Jeon
Recent rehearsal-free continual learning (CL) methods guided by prompts achieve strong performance on vision tasks with non-stationary data but remain resource-intensive, hindering real-world edge deployment. We introduce resource-efficient prompting (REP), which improves the computational and memory efficiency of prompt-based rehearsal-free continual learning methods while minimizing accuracy trade-offs. Our approach employs swift prompt selection to refine input data using a carefully provisioned model and introduces adaptive token merging (AToM) and adaptive layer dropping (ALD) for efficient prompt updates. AToM and ALD selectively skip data and model layers while preserving task-specific features during the learning of new tasks. Extensive experiments on multiple image classification datasets demonstrate REP’s superior resource efficiency over state-of-the-art rehearsal-free CL methods.
在无需复现的连续学习(CL)方法中,基于提示的方法在非稳态数据上实现了强大的视觉任务性能,但资源消耗仍然很大,阻碍了其在边缘设备部署中的实际应用。我们引入了资源高效提示(REP),它在提高基于提示的无复现连续学习方法的同时,尽量减少准确度的权衡。我们的方法采用快速提示选择来优化输入数据,并使用精心准备的模型进行微调,并引入自适应令牌合并(AToM)和自适应层丢弃(ALD)来实现高效的提示更新。AToM和ALD在学习新任务时选择性跳过数据和模型层,同时保留特定任务的特性。在多个图像分类数据集上的广泛实验表明,REP在资源效率方面优于最新的无复现CL方法。
论文及项目相关链接
PDF accepted to NeurIPS 2025
Summary
本文介绍了资源高效提示(REP)技术,该技术提高了基于提示的无复习持续学习方法的计算和内存效率,同时最小化了准确度的损失。REP通过快速提示选择优化输入数据,并引入自适应令牌合并(AToM)和自适应层丢弃(ALD)进行高效的提示更新。实验证明,REP在多个图像分类数据集上的资源效率优于最新的无复习CL方法。
Key Takeaways
- 介绍了一种新的方法——资源高效提示(REP),旨在提高基于提示的无复习持续学习方法的效率和准确性。
- REP通过快速提示选择优化输入数据,使用精心准备的模型进行精炼。
- 引入自适应令牌合并(AToM)和自适应层丢弃(ALD)技术,用于在持续学习中高效更新提示。
- AToM和ALD能够选择性跳过数据和模型层,同时保留任务特定特征,以支持新任务的学习。
- REP在多个图像分类数据集上进行了广泛的实验验证,显示出其卓越的资源效率。
- 与最新的无复习持续学习方法相比,REP具有更高的资源效率。
点此查看论文截图