嘘~ 正在从服务器偷取页面 . . .

Vision Transformer


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-27 更新

Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition

Authors:Wei Tang, Zuo-Zheng Wang, Kun Zhang, Tong Wei, Min-Ling Zhang

Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP’s zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label correlations from CLIP’s textual encoder. The framework incorporates a graph convolutional network for label-aware propagation and learnable soft prompts for refined embeddings. It utilizes a distribution-balanced Focal loss with class-aware re-weighting for optimized training under imbalance. Moreover, it improves generalization through test-time ensembling and realigns visual-textual modalities using parameter-efficient fine-tuning to avert overfitting on tail classes without compromising head class performance. Extensive experiments and ablation studies on benchmarks including VOC-LT, COCO-LT, and NUS-WIDE demonstrate that CAPNET achieves substantial improvements over state-of-the-art methods, validating its effectiveness for real-world long-tailed multi-label visual recognition.

长尾多标签视觉识别面临重大挑战,因为图像通常包含多个标签,且类别分布极不平衡,导致模型偏向于头部类别,而在尾部类别上表现不佳。最近的研究努力结合了预训练的视觉语言模型(如CLIP)和长尾学习技术,以利用丰富的视觉文本先验知识来提高性能。然而,现有方法通常直接从不平衡的数据集中推导语义类间关系,由于数据稀缺,导致尾部类别的相关性不可靠。此外,CLIP的零样本范式是针对单标签图像文本匹配优化的,使其不适用于多标签任务。为了解决这些问题,我们提出了关联适配提示网络(CAPNET),这是一种新颖的端到端框架,可以从CLIP的文本编码器中显式地建模标签关联。该框架结合了图卷积网络进行标签感知传播和可学习软提示进行细化嵌入。它采用分布平衡的焦点损失和类别感知重加权,以在不平衡情况下进行优化训练。此外,它通过测试时集成提高了泛化能力,并使用参数有效的微调重新对齐视觉文本模态,以避免尾部类别过拟合,同时不损害头部类别的性能。在包括VOC-LT、COCO-LT和NUS-WIDE在内的基准测试上的大量实验和消融研究证明,CAPNET相较于最新方法取得了显著改进,验证了其在现实世界的长尾多标签视觉识别中的有效性。

论文及项目相关链接

PDF

Summary
针对长尾多标签视觉识别挑战,本文提出一个结合CLIP文本编码器建模标签间关系的框架CAPNET,它使用图卷积网络进行标签感知传播,并结合可学习的软提示进行精细化嵌入。通过分布平衡的Focal损失和类别感知的重加权来解决数据不平衡问题。实验证明,CAPNET在长尾多标签视觉识别任务上实现了显著改进。

Key Takeaways

  1. 长尾多标签视觉识别面临挑战,因图像中的标签高度不平衡,导致模型偏向头部类别而忽视尾部类别。
  2. 现有方法常利用预训练的视觉语言模型如CLIP和长尾学习技术来改善性能,但直接从不平衡数据集中推导语义类间关系并不可靠。
  3. CLIP的零样本范式针对单标签图像文本匹配进行优化,对于多标签任务并不理想。
  4. 提出的CAPNET框架结合CLIP的文本编码器和图卷积网络进行标签感知传播。
  5. CAPNET使用可学习的软提示进行精细化嵌入,并利用分布平衡的Focal损失和类别感知重加权来解决训练中的不平衡问题。
  6. CAPNET通过测试时的集成技术提高了泛化能力,并通过参数有效的微调重新对齐了视觉文本模式,避免了尾部类别的过拟合,同时不妥协于头部类别的性能。

Cool Papers

点此查看论文截图

Patch-Level Glioblastoma Subregion Classification with a Contrastive Learning-Based Encoder

Authors:Juexin Zhang, Qifeng Zhong, Ying Weng, Ke Chen

The significant molecular and pathological heterogeneity of glioblastoma, an aggressive brain tumor, complicates diagnosis and patient stratification. While traditional histopathological assessment remains the standard, deep learning offers a promising path toward objective and automated analysis of whole slide images. For the BraTS-Path 2025 Challenge, we developed a method that fine-tunes a pre-trained Vision Transformer (ViT) encoder with a dedicated classification head on the official training dataset. Our model’s performance on the online validation set, evaluated via the Synapse platform, yielded a Matthews Correlation Coefficient (MCC) of 0.7064 and an F1-score of 0.7676. On the final test set, the model achieved an MCC of 0.6509 and an F1-score of 0.5330, which secured our team second place in the BraTS-Pathology 2025 Challenge. Our results establish a solid baseline for ViT-based histopathological analysis, and future efforts will focus on bridging the performance gap observed on the unseen validation data.

胶质母细胞瘤是一种侵袭性的脑肿瘤,其分子和病理上的异质性给诊断和治疗带来了复杂性。虽然传统的组织病理学评估仍是标准方法,但深度学习为实现客观和自动化的全幻灯片图像分析提供了希望。为了应对BraTS-Path 2025挑战赛,我们开发了一种方法,该方法使用专用的分类头对预训练的Vision Transformer (ViT)编码器进行微调,并在官方训练数据集上进行训练。我们的模型在Synapse平台上对在线验证集的表现,达到了Matthews Correlation Coefficient (MCC)为0.7064和F1分数为0.7676。在最终测试集上,模型的MCC为0.6509,F1分数为0.5330,这使我们的团队在BraTS-Pathology 2025挑战赛中获得了第二名。我们的结果建立了基于ViT的组织病理学分析的坚实基础,未来的努力将侧重于缩小在未见验证数据上观察到的性能差距。

论文及项目相关链接

PDF Accepted by the International Brain Tumor Segmentation (BraTS) challenge organized at MICCAI 2025 conference

Summary
利用预训练的Vision Transformer(ViT)编码器与专用分类头进行微调,对官方训练数据集进行处理,开发了一种针对胶质母细胞瘤图像分析的方法。该方法在BraTS-Path 2025挑战赛的在线验证集上取得了较高的性能表现,被评价为具有潜力。模型在最终测试集上取得了良好成绩,获得第二名。这为基于ViT的组织病理学分析建立了坚实的基准。

Key Takeaways

  1. 胶质母细胞瘤存在分子和病理上的异质性,导致诊断和患者分层复杂化。
  2. 深度学习为实现客观和自动化的全幻灯图像分析提供了途径。
  3. 利用预训练的Vision Transformer(ViT)编码器处理胶质母细胞瘤图像分析方法展示出优越性。
  4. 在BraTS-Path 2025挑战赛的在线验证集上,该方法的性能表现优异,获得了较高的Matthews Correlation Coefficient(MCC)和F1分数。
  5. 模型在最终测试集上取得了良好成绩,荣获第二名,验证了方法的实用性。
  6. 该方法为基于ViT的组织病理学分析建立了坚实的基准。

Cool Papers

点此查看论文截图

Vision–Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

Authors:Jiaqi Guo, Mingzhen Li, Hanyu Su, Santiago López, Lexiaozi Fan, Daniel Kim, Aggelos Katsaggelos

Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.

半监督学习(SSL)已经成为医学图像分割的有效范式,减少了对于大量专家标注的依赖。同时,视觉语言模型(VLM)在多种视觉领域表现出了强大的泛化和少样本能力。在这项工作中,我们将基于VLM的分割技术融入半监督医学图像分割中,通过引入一种融合基础级别视觉语义理解的“视觉语言增强半监督分割助手”(VESSA),将其纳入SSL框架。我们的方法分为两个阶段。在第一阶段,我们训练VLM增强的分割基础模型VESSA,将其作为一个参考引导分割助手,利用包含黄金标准范例的模板库进行模拟学习,从有限的标记数据中学习。给定输入模板对,VESSA执行视觉特征匹配,从范例分割中提取代表性的语义和空间线索,为受SAM2启发的掩膜解码器生成结构化提示,以产生分割掩膜。在第二阶段,我们将VESSA集成到最先进的SSL框架中,实现与学生模型的动态交互:随着学生模型的预测越来越精确,这些预测被反馈到VESSA作为提示,使其能够生成更高质量的伪标签和更强大的指导。在多个分割数据集和领域的大量实验表明,使用VESSA增强的SSL显著提高了分割精度,在极度有限的标注条件下超越了最先进的基准模型。

论文及项目相关链接

PDF

Summary

该论文介绍了将视觉语言模型(VLM)融入半监督医学图像分割的方法。通过引入名为Vision-Language Enhanced Semi-supervised Segmentation Assistant(VESSA)的模型,结合第一阶段使用模板库的VLM增强分割基础模型的训练和第二阶段的SSL框架集成,实现动态与学生模型的交互。实验证明,在有限的标注条件下,VESSA辅助的SSL能显著提高分割精度,优于现有基线。

Key Takeaways

  1. VLM被用于半监督医学图像分割,以减少对大量专家标注的依赖。
  2. 引入名为VESSA的模型,该模型在第一阶段使用模板库进行训练,模拟从有限标记数据中的学习。
  3. VESSA生成结构化提示,用于SAM2启发的掩膜解码器,以产生分割掩膜。
  4. 在第二阶段,VESSA被集成到最先进的SSL框架中,实现与学生模型的动态交互。
  5. 学生模型的预测结果会反馈给VESSA,使其生成更高质量的伪标签和更强大的指导。
  6. 跨多个分割数据集和领域的实验表明,VESSA增强的SSL显著提高分割精度。

Cool Papers

点此查看论文截图

HunyuanOCR Technical Report

Authors: Hunyuan Vision Team, Pengyuan Lyu, Xingyu Wan, Gengluo Li, Shangpin Peng, Weinong Wang, Liang Wu, Huawen Shen, Yu Zhou, Canhui Tang, Qi Yang, Qiming Peng, Bin Luo, Hower Yang, Houwen Peng, Hongming Yang, Senhao Xie, Binghong Wu, Mana Yang, Sergey Wang, Raccoon Liu, Dick Zhu, Jie Jiang, Linus, Han Hu, Chengquan Zhang

This paper presents HunyuanOCR, a commercial-grade, open-source, and lightweight (1B parameters) Vision-Language Model (VLM) dedicated to OCR tasks. The architecture comprises a Native Vision Transformer (ViT) and a lightweight LLM connected via an MLP adapter. HunyuanOCR demonstrates superior performance, outperforming commercial APIs, traditional pipelines, and larger models (e.g., Qwen3-VL-4B). Specifically, it surpasses current public solutions in perception tasks (Text Spotting, Parsing) and excels in semantic tasks (IE, Text Image Translation), securing first place in the ICDAR 2025 DIMT Challenge (Small Model Track). Furthermore, it achieves state-of-the-art (SOTA) results on OCRBench among VLMs with fewer than 3B parameters. HunyuanOCR achieves breakthroughs in three key aspects: 1) Unifying Versatility and Efficiency: We implement comprehensive support for core capabilities including spotting, parsing, IE, VQA, and translation within a lightweight framework. This addresses the limitations of narrow “OCR expert models” and inefficient “General VLMs”. 2) Streamlined End-to-End Architecture: Adopting a pure end-to-end paradigm eliminates dependencies on pre-processing modules (e.g., layout analysis). This fundamentally resolves error propagation common in traditional pipelines and simplifies system deployment. 3) Data-Driven and RL Strategies: We confirm the critical role of high-quality data and, for the first time in the industry, demonstrate that Reinforcement Learning (RL) strategies yield significant performance gains in OCR tasks. HunyuanOCR is officially open-sourced on HuggingFace. We also provide a high-performance deployment solution based on vLLM, placing its production efficiency in the top tier. We hope this model will advance frontier research and provide a solid foundation for industrial applications.

本文介绍了亨源OCR,这是一个面向OCR任务的商用级、开源、轻量级(1B参数)的视觉语言模型(VLM)。该架构由本地视觉转换器(ViT)和通过MLP适配器连接的轻量级LLM组成。亨源OCR展示了卓越的性能,超越了商业API、传统管道和大型模型(例如Qwen3-VL-4B)。具体来说,它在感知任务(文本定位、解析)上超越了当前的公开解决方案,并在语义任务(信息提取、文本图像翻译)上表现出色,在ICDAR 2025 DIMT挑战赛中(小型模型赛道)获得第一名。此外,在OCRBench的少于3B参数的VLM中,它取得了最新的研究成果。亨源OCR在三个方面取得了突破:1)统一多样性和效率:我们在一个轻量级框架内实现了对核心功能(包括定位、解析、信息提取、问答和翻译)的全面支持。这解决了“OCR专家模型”范围狭窄和“通用VLM”效率低下的问题。2)端到端的简化架构:采用纯粹的端到端范式消除了对预处理模块的依赖(例如布局分析)。这从根本上解决了传统管道中的常见错误传播问题,并简化了系统部署。3)数据驱动和强化学习策略:我们确认了高质量数据的关键作用,并在业内首次证明强化学习(RL)策略在OCR任务中产生显著的性能提升。亨源OCR已在HuggingFace上正式开源。我们还提供了基于vLLM的高性能部署解决方案,使其生产效率处于顶级水平。我们希望该模型能推动前沿研究并为工业应用提供坚实的基础。

论文及项目相关链接

PDF

Summary

胡源OCR是一款商业级别的开源轻量级(1B参数)专门用于OCR任务的视觉语言模型(VLM)。它结合了原生视觉转换器(ViT)和轻量级LLM,通过MLP适配器进行连接。胡源OCR在感知任务(文本定位、解析)和语义任务(信息抽取、文本图像翻译)上都表现出卓越的性能,并在ICDAR 2025 DIMT挑战(小型模型赛道)中荣获第一名。此外,它在OCRBench上的表现达到了少于3B参数的VLM的先进水平。胡源OCR在三个方面取得了突破:1)统一了多样性和效率;2)端到端的简洁架构;3)数据驱动和强化学习策略。该模型已在HuggingFace上开源,并提供基于vLLM的高性能部署解决方案。

Key Takeaways

  1. 胡源OCR是一个商业级别的开源OCR视觉语言模型(VLM),具有轻量级参数(1B)。
  2. 它结合了原生视觉转换器(ViT)和轻量级LLM,实现卓越性能。
  3. 胡源OCR在各种OCR任务(如文本定位、解析、信息抽取、文本图像翻译等)上表现出色。
  4. 在ICDAR 2025 DIMT挑战(小型模型赛道)中获得第一名。
  5. 胡源OCR实现了在OCRBench上的先进性能,与少于3B参数的VLM相比。
  6. 胡源OCR在多样性和效率、端到端架构、数据驱动和强化学习策略方面取得了突破。

Cool Papers

点此查看论文截图

SafeFix: Targeted Model Repair via Controlled Image Generation

Authors:Ouyang Xu, Baoming Zhang, Ruiyu Mao, Yunhui Guo

Deep learning models for visual recognition often exhibit systematic errors due to underrepresented semantic subpopulations. Although existing debugging frameworks can pinpoint these failures by identifying key failure attributes, repairing the model effectively remains difficult. Current solutions often rely on manually designed prompts to generate synthetic training images – an approach prone to distribution shift and semantic errors. To overcome these challenges, we introduce a model repair module that builds on an interpretable failure attribution pipeline. Our approach uses a conditional text-to-image model to generate semantically faithful and targeted images for failure cases. To preserve the quality and relevance of the generated samples, we further employ a large vision-language model (LVLM) to filter the outputs, enforcing alignment with the original data distribution and maintaining semantic consistency. By retraining vision models with this rare-case-augmented synthetic dataset, we significantly reduce errors associated with rare cases. Our experiments demonstrate that this targeted repair strategy improves model robustness without introducing new bugs. Code is available at https://github.com/oxu2/SafeFix

深度学习模型在视觉识别中经常表现出系统性错误,这是由于语义子群体代表性不足所导致的。尽管现有的调试框架可以通过识别关键失败属性来定位这些故障,但有效地修复模型仍然很困难。当前解决方案通常依赖于手动设计的提示来生成合成训练图像,这种方法容易引发分布偏移和语义错误。为了克服这些挑战,我们引入了一个模型修复模块,该模块建立在可解释的失败属性管道之上。我们的方法使用条件文本到图像模型来为目标失败案例生成语义上忠实且有针对性的图像。为了保持生成样本的质量和相关性,我们进一步采用大型视觉语言模型(LVLM)来过滤输出,强制其与原始数据分布对齐并保持语义一致性。通过用此增强罕见案例的合成数据集重新训练视觉模型,我们显著减少了与罕见案例相关的错误。我们的实验表明,这种有针对性的修复策略提高了模型的稳健性,并且没有引入新错误。代码可在https://github.com/oxu2/SafeFix找到。

论文及项目相关链接

PDF

Summary
针对深度学习模型在视觉识别中出现的系统性错误,现有的调试框架能够识别关键失败属性,但有效修复模型仍然困难。当前解决方案倾向于依赖手动设计的提示来生成合成训练图像,这种方法容易出现分布偏移和语义错误。为了克服这些挑战,我们引入了基于可解释失败属性管道的模型修复模块。我们的方法使用条件文本到图像模型来生成针对失败案例的语义真实图像。为了保持生成样本的质量和相关性,我们进一步使用大型视觉语言模型(LVLM)来过滤输出,强制与原始数据分布对齐并保持语义一致性。通过用此罕见案例增强合成数据集重新训练视觉模型,我们显著减少了与罕见案例相关的错误。实验表明,这种有针对性的修复策略提高了模型的稳健性,并且没有引入新的错误。

Key Takeaways

  1. 深度学习模型在视觉识别中存在系统性错误,尤其涉及语义子群体的代表性不足。
  2. 现有调试框架能够识别关键失败属性,但修复模型仍然具有挑战性。
  3. 当前解决方案倾向于生成合成训练图像,但这种方法易导致分布偏移和语义错误。
  4. 引入了一个基于可解释失败属性管道的模型修复模块。
  5. 使用条件文本到图像模型来生成针对失败案例的语义真实图像。
  6. 采用大型视觉语言模型(LVLM)过滤输出,确保与原始数据分布对齐和语义一致性。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
  目录