发布日期: 2025-11-21

更新日期: 2025-11-27

文章字数: 4.2k

阅读时长: 17 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-21 更新

IPTQ-ViT: Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers

Authors:Gihwan Kim, Jemin Lee, Hyungshin Kim

Previous Quantization-Aware Training (QAT) methods for vision transformers rely on expensive retraining to recover accuracy loss in non-linear layer quantization, limiting their use in resource-constrained environments. In contrast, existing Post-Training Quantization (PTQ) methods either partially quantize non-linear functions or adjust activation distributions to maintain accuracy but fail to achieve fully integer-only inference. In this paper, we introduce IPTQ-ViT, a novel PTQ framework for fully integer-only vision transformers without retraining. We present approximation functions: a polynomial-based GELU optimized for vision data and a bit-shifting-based Softmax designed to improve approximation accuracy in PTQ. In addition, we propose a unified metric integrating quantization sensitivity, perturbation, and computational cost to select the optimal approximation function per activation layer. IPTQ-ViT outperforms previous PTQ methods, achieving up to 6.44%p (avg. 1.78%p) top-1 accuracy improvement for image classification, 1.0 mAP for object detection. IPTQ-ViT outperforms partial floating-point PTQ methods under W8A8 and W4A8, and achieves accuracy and latency comparable to integer-only QAT methods. We plan to release our code https://github.com/gihwan-kim/IPTQ-ViT.git.

先前针对视觉变压器的量化感知训练（QAT）方法依赖于昂贵的重新训练来恢复非线性层量化中的精度损失，这限制了它们在资源受限环境中的使用。相比之下，现有的训练后量化（PTQ）方法要么部分量化非线性函数，要么调整激活分布以维持精度，但无法实现完全整数推理。在本文中，我们介绍了IPTQ-ViT，这是一种无需重新训练即可实现完全整数视觉变压器的PTQ框架。我们提出了近似函数：基于多项式的GELU，针对视觉数据进行了优化，以及基于位移的Softmax，旨在提高PTQ中的近似精度。此外，我们提出了一种统一度量标准，融合了量化敏感性、扰动和计算成本，以选择每个激活层的最佳近似函数。IPTQ-ViT优于先前的PTQ方法，在图像分类方面实现了高达6.44%（平均1.78%）的top-1准确率提升，在目标检测方面提高了1.0 mAP。IPTQ-ViT在W8A8和W4A8下优于部分浮点PTQ方法，并在准确率和延迟方面与仅整数的QAT方法相当。我们计划发布我们的代码：https://github.com/gihwan-kim/IPTQ-ViT.git。

论文及项目相关链接

PDF accepted in WACV 2026 (10 pages)

Summary

本文介绍了IPTQ-ViT这一无需重新训练即可在视觉Transformer模型中进行完全整数操作的新的Post-Training Quantization框架。研究团队引入针对视觉数据的基于多项式的GELU近似函数和用于提高PTQ中近似精度的基于位移的Softmax函数。此外，他们提出了一种统一的量化指标来选择每个激活层的最佳近似函数。IPTQ-ViT在图像分类任务上实现了最高达6.44%（平均提高1.78%）的top-1准确率提升，在目标检测任务上提高了约1.0 mAP。此方案可在GitHub公开获取。

Key Takeaways

IPTQ-ViT是一个针对视觉Transformer的Post-Training Quantization框架，无需重新训练即可实现完全整数操作。
研究人员引入了基于多项式的GELU近似函数和基于位移的Softmax函数来改进近似精度。
通过统一量化指标，可为每个激活层选择最佳近似函数。这一指标综合考虑了量化敏感度、扰动和计算成本等因素。
IPTQ-ViT相较于以往的PTQ方法在图像分类任务上显著提高准确率，达到最高达6.44%（平均提高1.78%）的top-1准确率提升。同时，目标检测任务上提升约1.0 mAP。
IPTQ-ViT在W8A8和W4A8配置下优于部分浮点PTQ方法，同时在准确率和延迟方面与整数型QAT方法相当。这意味着它在资源受限的环境中表现出强大的性能优势。
该方法公开可用，并且研究人员提供了代码链接，便于他人进一步研究和使用。这将有助于加速视觉Transformer模型在量化方面的进展和应用推广。

Cool Papers

点此查看论文截图

H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction

Authors:Xueyang Li, Zongren Wang, Yuliang Zhang, Zixuan Pan, Yu-Jen Chen, Nishchal Sapkota, Gelei Xu, Danny Z. Chen, Yiyu Shi

Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT.

膀胱癌是世界上最常见的恶性肿瘤之一，复发率高达78%，因此术后监测对于患者管理至关重要。多序列增强磁共振成像（MRI）常用于复发检测，但由于术后改变（如疤痕、肿胀和组织重塑），即使对有经验的放射科医生来说，解释这些扫描结果仍然具有挑战性。人工智能辅助诊断工具在改善膀胱癌复发预测方面显示出潜力，但这一领域的进展受到缺乏专门用于复发评估研究的多序列MRI数据集的阻碍。在这项工作中，我们首先引入了一个专门设计的多序列、多模式MRI数据集，专门用于膀胱癌复发预测，为未来的研究建立了宝贵的基准。然后，我们提出了H-CNN-ViT，这是一种新的分层门控注意力多分支模型，它可以根据上下文需求实现全局（ViT）和局部（CNN）路径的特征选择性加权，实现平衡且有针对性的特征融合。我们的多分支架构独立处理每种模态，确保每种成像通道的独特属性得到最佳捕捉和集成。在我们的数据集上评估的H-CNN-ViT达到了78.6%的AUC，超越了最先进模型。我们的模型公开在：https://github.com/XLIAaron/H-CNN-ViT。

论文及项目相关链接

PDF

Summary：
本研究引入了一个专为膀胱癌复发预测设计的多序列、多模态MRI数据集，建立了有价值的基准测试。提出了H-CNN-ViT模型，该模型通过分层门控注意力机制实现全局（ViT）和局部（CNN）特征路径的选择性加权，实现了平衡且有针对性的特征融合。在独立处理每种模态的多分支架构中，确保了每种成像通道的独特属性得到最佳捕捉和集成。在数据集上评估的H-CNN-ViT模型达到了78.6%的AUC，超越了现有模型。

Key Takeaways：

引入了一个专为膀胱癌复发预测设计的多序列、多模态MRI数据集。
建立了有价值的基准测试，为未来的研究提供了参考。
提出了H-CNN-ViT模型，该模型结合全局和局部特征路径，实现特征的选择性加权。
H-CNN-ViT模型通过分层门控注意力机制实现平衡且有针对性的特征融合。
多分支架构独立处理每种模态，确保每种成像通道的独特属性得到最佳捕捉和集成。
H-CNN-ViT模型在数据集上表现优越，达到了78.6%的AUC。

Cool Papers

点此查看论文截图

CompAgent: An Agentic Framework for Visual Compliance Verification

Authors:Rahul Ghosh, Baishali Chaudhury, Hari Prasanna Das, Meghana Ashok, Ryan Razkenari, Sungmin Hong, Chun-Hao Liu

Visual compliance verification is a critical yet underexplored problem in computer vision, especially in domains such as media, entertainment, and advertising where content must adhere to complex and evolving policy rules. Existing methods often rely on task-specific deep learning models trained on manually labeled datasets, which are costly to build and limited in generalizability. While recent Multimodal Large Language Models (MLLMs) offer broad real-world knowledge and policy understanding, they struggle to reason over fine-grained visual details and apply structured compliance rules effectively on their own. In this paper, we propose CompAgent, the first agentic framework for visual compliance verification. CompAgent augments MLLMs with a suite of visual tools-such as object detectors, face analyzers, NSFW detectors, and captioning models-and introduces a planning agent that dynamically selects appropriate tools based on the compliance policy. A compliance verification agent then integrates image, tool outputs, and policy context to perform multimodal reasoning. Experiments on public benchmarks show that CompAgent outperforms specialized classifiers, direct MLLM prompting, and curated routing baselines, achieving up to 76% F1 score and a 10% improvement over the state-of-the-art on the UnsafeBench dataset. Our results demonstrate the effectiveness of agentic planning and robust tool-augmented reasoning for scalable, accurate, and adaptable visual compliance verification.

视觉合规验证是计算机视觉中的一个关键但尚未被充分研究的问题，特别是在媒体、娱乐和广告等领域，内容必须遵守复杂且不断变化的政策规则。现有方法通常依赖于使用手动标注数据集训练的特定任务深度学习模型，这些模型构建成本高昂，且通用性有限。虽然最近的多模态大型语言模型（MLLMs）提供了广泛的实际知识和政策理解，但它们难以对细微的视觉细节进行推理，并有效地应用结构化合规规则。在本文中，我们提出了用于视觉合规验证的第一个智能框架CompAgent。CompAgent通过使用一系列视觉工具（如目标检测器、面部分析器、非安全内容检测器和字幕模型）来增强MLLMs的功能，并引入了一个规划代理，该代理根据合规策略动态选择适当的工具。然后，合规验证代理将图像、工具输出和政策上下文集成在一起，进行多模态推理。在公共基准测试集上的实验表明，CompAgent的性能优于专用分类器、直接MLLM提示和定制路由基线，在UnsafeBench数据集上实现了高达76%的F1分数，比现有技术提高了10%。我们的结果证明了智能规划和稳健的工具增强推理在可扩展、准确和可适应的视觉合规验证中的有效性。

论文及项目相关链接

PDF Under review

Summary

视觉合规验证是计算机视觉中一个至关重要但尚未被充分研究的问题，特别是在媒体、娱乐和广告等领域，内容必须遵守复杂且不断发展的政策规则。现有方法通常依赖于特定任务的深度学习模型，这些模型在手动标记的数据集上进行训练，但构建成本高昂且通用性有限。本文提出CompAgent，首个用于视觉合规验证的agentic框架。CompAgent通过一系列视觉工具（如目标检测器、面部分析器、NSFW检测器和描述模型）增强Multimodal Large Language Models (MLLMs)，并引入规划agent，根据合规政策动态选择适当的工具。合规验证agent整合图像、工具输出和政策上下文，进行多模式推理。在公共基准测试上的实验表明，CompAgent优于专业分类器、直接MLLM提示和定制路由基线，在UnsafeBench数据集上实现了高达76%的F1分数，比现有技术提高了10%。结果证明了agentic规划和稳健的工具增强推理在视觉合规验证中的有效性、准确性和适应性。

Key Takeaways

视觉合规验证是计算机视觉领域中的一个重要且尚未充分研究的问题，特别是在媒体、娱乐和广告领域。
现有方法主要依赖任务特定的深度学习模型，这些方法成本高昂且通用性有限。
CompAgent是首个用于视觉合规验证的agentic框架，结合了视觉工具和Multimodal Large Language Models (MLLMs)。
CompAgent通过使用规划agent动态选择适当的视觉工具以增强MLLMs。
合规验证agent能够整合图像、工具输出和政策上下文，进行多模式推理。
在公共基准测试上的实验表明，CompAgent的性能优于其他方法，实现了较高的F1分数。

Cool Papers

点此查看论文截图

Self Pre-training with Topology- and Spatiality-aware Masked Autoencoders for 3D Medical Image Segmentation

Authors:Pengfei Gu, Huimin Li, Yejia Zhang, Chaoli Wang, Danny Z. Chen

Masked Autoencoders (MAEs) have been shown to be effective in pre-training Vision Transformers (ViTs) for natural and medical image analysis problems. By reconstructing missing pixel/voxel information in visible patches, a ViT encoder can aggregate contextual information for downstream tasks. But, existing MAE pre-training methods, which were specifically developed with the ViT architecture, lack the ability to capture geometric shape and spatial information, which is critical for medical image segmentation tasks. In this paper, we propose a novel extension of known MAEs for self pre-training (i.e., models pre-trained on the same target dataset) for 3D medical image segmentation. (1) We propose a new topological loss to preserve geometric shape information by computing topological signatures of both the input and reconstructed volumes, learning geometric shape information. (2) We introduce a pre-text task that predicts the positions of the centers and eight corners of 3D crops, enabling the MAE to aggregate spatial information. (3) We extend the MAE pre-training strategy to a hybrid state-of-the-art (SOTA) medical image segmentation architecture and co-pretrain it alongside the ViT. (4) We develop a fine-tuned model for downstream segmentation tasks by complementing the pre-trained ViT encoder with our pre-trained SOTA model. Extensive experiments on five public 3D segmentation datasets show the effectiveness of our new approach.

Masked Autoencoders（MAEs）在预训练用于自然和医学图像分析问题的Vision Transformers（ViTs）时显示出其有效性。通过重建可见斑块中缺失的像素/体素信息，ViT编码器可以为下游任务聚合上下文信息。但是，现有的专为ViT架构开发的MAE预训练方法缺乏捕获几何形状和空间信息的能力，这对于医学图像分割任务至关重要。在本文中，我们提出了已知MAEs的自我预训练（即在同一目标数据集上预训练的模型）的 novel 扩展，用于3D医学图像分割。（1）我们提出了一种新的拓扑损失，通过计算输入和重建体积的拓扑特征来保留几何形状信息，从而学习几何形状信息。（2）我们引入了一个预测3D裁剪中心及八个角位置的预文本任务，使MAE能够聚合空间信息。（3）我们将MAE预训练策略扩展到了最先进的混合（SOTA）医学图像分割架构，并与ViT一起进行预训练。（4）我们通过用我们的预训练SOTA模型补充预训练的ViT编码器，开发了一个用于下游分割任务的精细调整模型。在五个公共3D分割数据集上的广泛实验表明了我们的新方法的有效性。

论文及项目相关链接

PDF

Summary

基于Masked Autoencoders（MAEs）在Vision Transformers（ViTs）预训练中的有效性，本文提出了一种针对三维医学图像分割任务的新型自预训练策略。通过引入拓扑损失和预测三维裁剪中心及八个角点的预文本任务，该策略能够捕捉几何形状和空间信息。此外，还将MAE预训练策略扩展到了先进的医学图像分割架构中，并与其进行了联合预训练。最后，通过补充预训练的ViT编码器，为下游分割任务开发了一个精细模型。

Key Takeaways

Masked Autoencoders (MAEs) 在Vision Transformers (ViTs) 预训练中表现有效，尤其是在自然和医学图像分析问题上。
现有的MAE预训练方法在捕捉几何形状和空间信息方面存在不足，这对医学图像分割任务至关重要。
引入新型拓扑损失，通过计算输入和重建体积的拓扑特征来保留几何形状信息。
提出预测三维裁剪中心及八个角点位置的预文本任务，使MAE能够聚合空间信息。
将MAE预训练策略扩展到了先进的医学图像分割架构中，并进行联合预训练。
通过补充预训练的ViT编码器，为下游分割任务开发了一个精细模型。

Cool Papers

点此查看论文截图