发布日期: 2025-10-10

更新日期: 2025-11-27

文章字数: 16.5k

阅读时长: 67 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-10 更新

Evaluating Fundus-Specific Foundation Models for Diabetic Macular Edema Detection

Authors:Franco Javier Arellano, José Ignacio Orlando

Diabetic Macular Edema (DME) is a leading cause of vision loss among patients with Diabetic Retinopathy (DR). While deep learning has shown promising results for automatically detecting this condition from fundus images, its application remains challenging due the limited availability of annotated data. Foundation Models (FM) have emerged as an alternative solution. However, it is unclear if they can cope with DME detection in particular. In this paper, we systematically compare different FM and standard transfer learning approaches for this task. Specifically, we compare the two most popular FM for retinal images–RETFound and FLAIR–and an EfficientNet-B0 backbone, across different training regimes and evaluation settings in IDRiD, MESSIDOR-2 and OCT-and-Eye-Fundus-Images (OEFI). Results show that despite their scale, FM do not consistently outperform fine-tuned CNNs in this task. In particular, an EfficientNet-B0 ranked first or second in terms of area under the ROC and precision/recall curves in most evaluation settings, with RETFound only showing promising results in OEFI. FLAIR, on the other hand, demonstrated competitive zero-shot performance, achieving notable AUC-PR scores when prompted appropriately. These findings reveal that FM might not be a good tool for fine-grained ophthalmic tasks such as DME detection even after fine-tuning, suggesting that lightweight CNNs remain strong baselines in data-scarce environments.

糖尿病性黄斑水肿（DME）是糖尿病视网膜病变（DR）患者视力丧失的主要原因之一。深度学习在自动检测这种疾病从眼底图像中显示出有前景的结果，但由于标注数据的有限可用性，其应用仍然具有挑战性。基础模型（FM）已经出现作为一种替代解决方案。然而，尚不清楚它们是否能应对DME检测。在本文中，我们系统地比较了用于此任务的不同FM和标准迁移学习方法。具体来说，我们比较了用于视网膜图像的两个最流行的基础模型RETFound和FLAIR以及EfficientNet-B0骨干网，在IDRiD、MESSIDOR-2和OCT-and-Eye-Fundus-Images（OEFI）的不同训练方案和评估设置中进行比较。结果表明，尽管规模庞大，FM在此任务上并不始终优于微调后的CNN。特别是EfficientNet-B0在ROC曲线和精确度/召回率曲线下的区域方面排名靠前或第二，在大多数评估设置中，RETFound仅在OEFI上显示出有前途的结果。另一方面，FLAIR表现出有竞争力的零样本性能，在适当提示时获得了显著的AUC-PR分数。这些发现表明，即使在微调后，FM可能不是用于精细的眼科任务（如DME检测）的好工具，这表明在数据稀缺的环境中，轻量级的CNN仍然是很强的基线。

论文及项目相关链接

PDF Accepted for publication at SIPAIM 2025

Summary

本文对比研究了Foundation Models（FM）与标准迁移学习在糖尿病黄斑水肿（DME）检测中的应用。对比了RETFound和FLAIR两种视网膜图像常用的FM及EfficientNet-B0骨架网络，在不同训练模式和评估设置下的效果。结果表明，尽管FM规模较大，但在该任务上并未持续超越精细调整的CNN。EfficientNet-B0在多数评估设置中表现优异，RETFound仅在OEFI中表现较好。而FLAIR在适当提示下表现出良好的零样本性能。这些发现显示，对于精细眼科任务如DME检测，FM可能并不适用，即使在微调后，轻量级CNN在数据稀缺环境中仍是强有力的基线。

Key Takeaways

FM在DME检测上的表现并不持续优于精细调整的CNN。
EfficientNet-B0在多数评估设置中的表现排名前列。
RETFound仅在OEFI数据集中表现出较好性能。
FLAIR在适当提示下展现出零样本性能的竞争力。
FM在精细眼科任务上的适用性受限，即使经过微调。
对于数据稀缺环境，轻量级CNN仍是强有力的基线。

Cool Papers

点此查看论文截图

Continual Learning for Image Captioning through Improved Image-Text Alignment

Authors:Bertram Taetz, Gal Bordelius

Generating accurate and coherent image captions in a continual learning setting remains a major challenge due to catastrophic forgetting and the difficulty of aligning evolving visual concepts with language over time. In this work, we propose a novel multi-loss framework for continual image captioning that integrates semantic guidance through prompt-based continual learning and contrastive alignment. Built upon a pretrained ViT-GPT-2 backbone, our approach combines standard cross-entropy loss with three additional components: (1) a prompt-based cosine similarity loss that aligns image embeddings with synthetically constructed prompts encoding objects, attributes, and actions; (2) a CLIP-style loss that promotes alignment between image embeddings and target caption embedding; and (3) a language-guided contrastive loss that employs a triplet loss to enhance class-level discriminability between tasks. Notably, our approach introduces no additional overhead at inference time and requires no prompts during caption generation. We find that this approach mitigates catastrophic forgetting, while achieving better semantic caption alignment compared to state-of-the-art methods. The code can be found via the following link https://github.com/ Gepardius/Taetz_Bordelius_Continual_ImageCaptioning.

在持续学习环境下生成准确连贯的图像标题仍然存在重大挑战，主要由于灾难性遗忘以及随着时间推移将不断演变的视觉概念与语言进行对齐的困难。在这项工作中，我们为持续图像标题生成提出了一种新型多损失框架，该框架通过基于提示的连续学习和对比对齐来集成语义指导。我们的方法建立在预训练的ViT-GPT-2主干网络上，将标准交叉熵损失与另外三个组件相结合：（1）基于提示的余弦相似性损失，该损失将图像嵌入与编码对象、属性和动作的合成提示进行对齐；（2）CLIP风格的损失，该损失促进图像嵌入与目标标题嵌入之间的对齐；（3）语言引导对比损失，该损失采用三元组损失来提高任务间的类级辨别能力。值得注意的是，我们的方法在推理时不会引入任何额外开销，并且在生成标题时不需要任何提示。我们发现这种方法减轻了灾难性遗忘的问题，同时与最先进的方法相比实现了更好的语义标题对齐。代码可通过以下链接找到：https://github.com/Gepardius/Taetz_Bordelius_Continual_ImageCaptioning。

论文及项目相关链接

PDF 11 pages, 3 figures

Summary

该文本介绍了一种针对持续学习环境下图像字幕生成的新多损失框架。该框架通过基于提示的连续学习和对比对齐来提供语义指导。该框架建立在预训练的ViT-GPT-2主干上，结合了标准交叉熵损失与三种额外的损失组件，包括基于提示的余弦相似性损失、CLIP风格的损失以及语言指导的对比损失。该方法在减轻灾难性遗忘的同时，实现了更好的语义字幕对齐，与现有技术相比具有优势。

Key Takeaways

该文本提出了一种新的多损失框架，用于持续学习环境下的图像字幕生成。
框架集成了基于提示的连续学习和对比对齐，以提供语义指导。
使用预训练的ViT-GPT-2作为主干，并结合多种损失组件以提高性能。
引入基于提示的余弦相似性损失，将图像嵌入与合成提示进行对齐。
采用CLIP风格的损失，促进图像嵌入与目标字幕嵌入之间的对齐。
使用语言指导的对比损失，采用三元组损失增强任务间的类级鉴别能力。

Cool Papers

点此查看论文截图

Leveraging Vision Transformers for Enhanced Classification of Emotions using ECG Signals

Authors:Pubudu L. Indrasiri, Bipasha Kashyap, Pubudu N. Pathirana

Biomedical signals provide insights into various conditions affecting the human body. Beyond diagnostic capabilities, these signals offer a deeper understanding of how specific organs respond to an individual’s emotions and feelings. For instance, ECG data can reveal changes in heart rate variability linked to emotional arousal, stress levels, and autonomic nervous system activity. This data offers a window into the physiological basis of our emotional states. Recent advancements in the field diverge from conventional approaches by leveraging the power of advanced transformer architectures, which surpass traditional machine learning and deep learning methods. We begin by assessing the effectiveness of the Vision Transformer (ViT), a forefront model in image classification, for identifying emotions in imaged ECGs. Following this, we present and evaluate an improved version of ViT, integrating both CNN and SE blocks, aiming to bolster performance on imaged ECGs associated with emotion detection. Our method unfolds in two critical phases: first, we apply advanced preprocessing techniques for signal purification and converting signals into interpretable images using continuous wavelet transform and power spectral density analysis; second, we unveil a performance-boosted vision transformer architecture, cleverly enhanced with convolutional neural network components, to adeptly tackle the challenges of emotion recognition. Our methodology’s robustness and innovation were thoroughly tested using ECG data from the YAAD and DREAMER datasets, leading to remarkable outcomes. For the YAAD dataset, our approach outperformed existing state-of-the-art methods in classifying seven unique emotional states, as well as in valence and arousal classification. Similarly, in the DREAMER dataset, our method excelled in distinguishing between valence, arousal and dominance, surpassing current leading techniques.

生物医学信号为影响人体的各种状况提供了深入见解。除了诊断能力之外，这些信号还提供了关于特定器官如何响应个人情绪和感受的更深层次理解。例如，心电图数据可以揭示与情绪激发、压力水平和自主神经系统活动相关的心率变异性变化。这些数据为我们提供了情绪状态的生理基础的一个窗口。该领域的最新进展通过利用先进的变压器架构的力量来偏离传统方法，这些先进的变压器架构超越了传统的机器学习和深度学习方法。我们首先评估Vision Transformer（ViT）在心电图图像中的情感识别方面的有效性。接下来，我们展示并评估ViT的改进版本，该版本结合了CNN和SE块，旨在提高与情感检测相关的心电图图像的性能。我们的方法分为两个阶段：首先，我们应用先进的预处理技术来净化信号，并使用连续小波变换和功率谱密度分析将信号转换为可解释的图像；其次，我们展示了性能增强的视觉变压器架构，巧妙地结合了卷积神经网络组件，以有效应对情感识别的挑战。我们的方法的稳健性和创新性通过使用YAAD和DREAMER数据集的心电图数据进行了彻底测试，并取得了显著的结果。对于YAAD数据集，我们的方法在分类七种独特情绪状态以及价值和兴奋分类方面均优于现有的最先进方法。同样，在DREAMER数据集中，我们的方法在区分价值、兴奋和支配方面表现出色，超越了当前领先的技术。

论文及项目相关链接

PDF 14pages, 2 figures

Summary

本文介绍了生物医学信号在情感识别方面的应用，特别是心电图数据在揭示情感变化方面的作用。文章还探讨了利用先进的Vision Transformer（ViT）模型进行情感识别的最新进展，并介绍了一种改进版的ViT模型，该模型结合了CNN和SE模块，以提高在情感检测方面的性能。经过测试，该方法在情感分类方面取得了显著成果。

Key Takeaways

生物医学信号提供关于人体各种状况的见解，并可以深入了解特定器官对个体情感和感受的响应。
ECG数据能够揭示与情感激动、压力水平和自主神经系统活动相关的心率变化。
先进的transformer架构在情感识别方面具有潜力，超越了传统的机器学习和深度学习方法。
Vision Transformer（ViT）模型在图像分类方面表现优异，已被应用于情绪心电图的情感识别。
改进的ViT模型结合了CNN和SE模块，以提高在情感检测方面的性能。
方法包括两个阶段：信号净化的预处理技术和使用连续小波变换和功率谱密度分析将信号转换为可解释的图像。
该方法经过测试，在情感分类方面取得了显著成果，特别是在YAAD和DREAMER数据集上。

Cool Papers

点此查看论文截图

Neuroplastic Modular Framework: Cross-Domain Image Classification of Garbage and Industrial Surfaces

Authors:Debojyoti Ghosh, Soumya K Ghosh, Adrijit Goswami

Efficient and accurate classification of waste and industrial surface defects is essential for ensuring sustainable waste management and maintaining high standards in quality control. This paper introduces the Neuroplastic Modular Classifier, a novel hybrid architecture designed for robust and adaptive image classification in dynamic environments. The model combines a ResNet-50 backbone for localized feature extraction with a Vision Transformer (ViT) to capture global semantic context. Additionally, FAISS-based similarity retrieval is incorporated to provide a memory-like reference to previously encountered data, enriching the model’s feature space. A key innovation of our architecture is the neuroplastic modular design composed of expandable, learnable blocks that dynamically grow during training when performance plateaus. Inspired by biological learning systems, this mechanism allows the model to adapt to data complexity over time, improving generalization. Beyond garbage classification, we validate the model on the Kolektor Surface Defect Dataset 2 (KolektorSDD2), which involves industrial defect detection on metal surfaces. Experimental results across domains show that the proposed architecture outperforms traditional static models in both accuracy and adaptability. The Neuroplastic Modular Classifier offers a scalable, high-performance solution for real-world image classification, with strong applicability in both environmental and industrial domains.

有效且精确地分类废物和工业表面缺陷对于确保可持续的废物管理和保持高质量的标准至关重要。本文介绍了神经可塑性模块化分类器，这是一种为动态环境中的稳健和自适应图像分类设计的新型混合架构。该模型结合了ResNet-50主干进行局部特征提取和视觉转换器（ViT）来捕获全局语义上下文。此外，基于FAISS的相似性检索被结合，以提供对先前遇到的数据的记忆参考，丰富了模型的特征空间。我们架构的一个关键创新之处在于神经可塑性模块化设计，它由可扩展的学习块组成，当性能达到稳定水平时，这些块会在训练过程中动态增长。受生物学习系统的启发，这种机制允许模型随着时间的推移适应数据复杂性，提高泛化能力。除了垃圾分类之外，我们在Kolektor表面缺陷数据集2（KolektorSDD2）上验证了模型，该数据集涉及金属表面上的工业缺陷检测。跨领域的实验结果表明，所提出的架构在准确性和适应性方面优于传统的静态模型。神经可塑性模块化分类器为现实世界的图像分类提供了可扩展、高性能的解决方案，在环境和工业领域都有很强的适用性。

论文及项目相关链接

PDF

Summary

本文介绍了一种名为神经可塑性模块化分类器的全新混合架构，该架构旨在动态环境中实现稳健和自适应的图像分类。它结合了ResNet-50骨干网进行局部特征提取和Vision Transformer（ViT）捕捉全局语义上下文。此外，还融入了基于FAISS的相似性检索，为已遇到的数据提供内存式参考，丰富了模型的特征空间。其关键创新点在于神经可塑性模块化设计，由可扩展、可学习的块组成，当性能达到瓶颈时，这些块会在训练过程中动态增长。该机制受到生物学习系统的启发，允许模型随时间适应数据复杂性，提高泛化能力。该架构不仅适用于垃圾分类，还在金属表面工业缺陷检测领域的Kolektor Surface Defect Dataset 2数据集上进行了验证。实验结果表明，所提出的架构在准确性和适应性方面优于传统静态模型。神经可塑性模块化分类器为现实世界图像分类提供了可扩展、高性能的解决方案，在环境和工业领域具有广泛的应用性。

Key Takeaways

神经可塑性模块化分类器是一个为动态环境设计的混合架构，用于稳健和自适应的图像分类。
结合ResNet-50骨干网进行局部特征提取和Vision Transformer（ViT）进行全局语义上下文捕捉。
基于FAISS的相似性检索为已遇到的数据提供内存式参考，丰富模型特征空间。
神经可塑性模块化设计的关键创新点在于其由可扩展、可学习的块组成，可动态增长以应对性能瓶颈。
该架构受到生物学习系统的启发，能够适应数据复杂性变化，提高模型泛化能力。
不仅在垃圾分类上表现优异，还在工业缺陷检测领域的Kolektor Surface Defect Dataset 2数据集上进行了验证。
实验结果表明，该架构在准确性和适应性方面优于传统静态模型，为现实世界图像分类提供了有效的解决方案。

Cool Papers

点此查看论文截图

ViTs: Teaching Machines to See Time Series Anomalies Like Human Experts

Authors:Zexin Wang, Changhua Pei, Yang Liu, Hengyue Jiang, Quan Zhou, Haotian Si, Hang Cui, Jianhui Li, Gaogang Xie, Jingjing Li, Dan Pei

Web service administrators must ensure the stability of multiple systems by promptly detecting anomalies in Key Performance Indicators (KPIs). Achieving the goal of “train once, infer across scenarios” remains a fundamental challenge for time series anomaly detection models. Beyond improving zero-shot generalization, such models must also flexibly handle sequences of varying lengths during inference, ranging from one hour to one week, without retraining. Conventional approaches rely on sliding-window encoding and self-supervised learning, which restrict inference to fixed-length inputs. Large Language Models (LLMs) have demonstrated remarkable zero-shot capabilities across general domains. However, when applied to time series data, they face inherent limitations due to context length. To address this issue, we propose ViTs, a Vision-Language Model (VLM)-based framework that converts time series curves into visual representations. By rescaling time series images, temporal dependencies are preserved while maintaining a consistent input size, thereby enabling efficient processing of arbitrarily long sequences without context constraints. Training VLMs for this purpose introduces unique challenges, primarily due to the scarcity of aligned time series image-text data. To overcome this, we employ an evolutionary algorithm to automatically generate thousands of high-quality image-text pairs and design a three-stage training pipeline consisting of: (1) time series knowledge injection, (2) anomaly detection enhancement, and (3) anomaly reasoning refinement. Extensive experiments demonstrate that ViTs substantially enhance the ability of VLMs to understand and detect anomalies in time series data. All datasets and code will be publicly released at: https://anonymous.4open.science/r/ViTs-C484/.

网络服务管理员必须通过及时检测关键性能指标（KPI）中的异常情况，确保多个系统的稳定性。实现“一次训练，跨场景推断”的目标仍然是时间序列异常检测模型的基本挑战。除了提高零样本泛化能力外，此类模型在推断过程中还必须灵活处理长度不一的序列，从一小时到一周不等，而无需进行再训练。传统方法依赖于滑动窗口编码和自我监督学习，它们将推断限制在固定长度的输入上。大型语言模型（LLM）在一般领域表现出了引人注目的零样本能力。然而，当应用于时间序列数据时，它们由于上下文长度而面临固有局限性。为了解决这一问题，我们提出了ViTs，这是一个基于视觉语言模型（VLM）的框架，将时间序列曲线转换为视觉表示。通过重新缩放时间序列图像，可以保留时间依赖性，同时保持一致的输入大小，从而能够高效处理任意长度的序列，不受上下文约束。为此目的训练VLM引入了独特的挑战，主要是由于对齐的时间序列图像文本数据稀缺。为了克服这一点，我们采用进化算法自动生成数千对高质量图像文本，并设计一个包含三个阶段的训练管道，包括：（1）时间序列知识注入，（2）异常检测增强，以及（3）异常推理细化。大量实验表明，ViTs显著增强了VLM对时间序列数据中异常的理解和检测能力。所有数据集和代码将在https://anonymous.4open.science/r/ViTs-C484/公开发布。

论文及项目相关链接

PDF 13 pages

Summary
针对时间序列数据的异常检测模型面临固定输入长度和零样本泛化能力不足的双重挑战。本研究提出一种基于视觉语言模型（VLM）的框架ViTs，可将时间序列曲线转换为视觉表示，实现任意长度序列的高效处理。通过缩放时间序列图像，该模型能够在保持时间依赖性的同时维持恒定输入大小，解决了语境长度限制的问题。研究采用进化算法生成高质量图像文本对，并通过三阶段训练管道实现时间序列知识注入、异常检测增强和异常推理优化。实验证明，ViTs显著提高了VLM在理解和检测时间序列数据中的异常能力。

Key Takeaways

监控系统性能要求及时检测关键绩效指标（KPIs）的异常波动，以维持系统的稳定性。
实现跨场景的模型应用仍存在挑战，特别是关于时间系列数据的零样本泛化问题。
传统方法如滑动窗口编码和自我监督学习在处理固定长度输入方面有限制。
大型语言模型（LLMs）在一般领域具有出色的零样本能力，但在处理时间序列数据时面临语境长度限制的挑战。
ViTs模型通过将时间序列数据转换为视觉表示来解决上述问题，实现任意长度序列的高效处理。
通过缩放时间序列图像来保持时间依赖性和恒定输入大小，解决了语境长度的问题。
研究采用进化算法生成图像文本对以克服数据稀缺问题，并采用三阶段训练管道提高模型的异常检测能力。

Cool Papers

点此查看论文截图

AgentTypo: Adaptive Typographic Prompt Injection Attacks against Black-box Multimodal Agents

Authors:Yanjie Li, Yiming Cao, Dong Wang, Bin Xiao

Multimodal agents built on large vision-language models (LVLMs) are increasingly deployed in open-world settings but remain highly vulnerable to prompt injection, especially through visual inputs. We introduce AgentTypo, a black-box red-teaming framework that mounts adaptive typographic prompt injection by embedding optimized text into webpage images. Our automatic typographic prompt injection (ATPI) algorithm maximizes prompt reconstruction by substituting captioners while minimizing human detectability via a stealth loss, with a Tree-structured Parzen Estimator guiding black-box optimization over text placement, size, and color. To further enhance attack strength, we develop AgentTypo-pro, a multi-LLM system that iteratively refines injection prompts using evaluation feedback and retrieves successful past examples for continual learning. Effective prompts are abstracted into generalizable strategies and stored in a strategy repository, enabling progressive knowledge accumulation and reuse in future attacks. Experiments on the VWA-Adv benchmark across Classifieds, Shopping, and Reddit scenarios show that AgentTypo significantly outperforms the latest image-based attacks such as AgentAttack. On GPT-4o agents, our image-only attack raises the success rate from 0.23 to 0.45, with consistent results across GPT-4V, GPT-4o-mini, Gemini 1.5 Pro, and Claude 3 Opus. In image+text settings, AgentTypo achieves 0.68 ASR, also outperforming the latest baselines. Our findings reveal that AgentTypo poses a practical and potent threat to multimodal agents and highlight the urgent need for effective defense.

基于大型视觉语言模型（LVLMs）的多模态代理在开放世界环境中得到了越来越多的部署，但仍然很容易受到提示注入的威胁，特别是通过视觉输入。我们引入了AgentTypo，这是一个黑箱红队框架，通过嵌入网页图像中的优化文本来实现自适应的排版提示注入。我们的自动排版提示注入（ATPI）算法通过替代字幕来最大化提示重建，同时通过隐形损失来最小化人类检测能力，树结构Parzen估计器指导黑箱优化文本放置、大小和颜色。为了进一步增强攻击强度，我们开发了AgentTypo-pro，这是一个多LLM系统，可以迭代优化注入提示，利用评估反馈检索成功的过去示例进行持续学习。有效的提示被抽象为可推广的策略，并存储在策略库中，以便在未来的攻击中进行渐进的知识积累和重用。在VWA-Adv基准测试上的实验表明，跨分类、购物和Reddit场景的AgentTypo显著优于最新的基于图像的攻击，如AgentAttack。在GPT-4o代理上，我们的纯图像攻击将成功率从0.23提高到0.45，在GPT-4V、GPT-4o-mini、Gemini 1.5 Pro和Claude 3 Opus上结果一致。在图像+文本设置中，AgentTypo达到了0.68的攻击成功率，也超过了最新的基线。我们的研究发现，AgentTypo对多模态代理构成了实际且强大的威胁，并强调了对有效防御的迫切需求。

论文及项目相关链接

PDF 13 pages, 8 figures. Submitted to IEEE Transactions on Information Forensics & Security

Summary

本文介绍了针对基于大型视觉语言模型的多模态代理的安全威胁。针对这一威胁，研究人员开发了一种名为AgentTypo的黑箱红队框架，该框架通过网页图像中的优化文本嵌入来实现自适应的排版提示注入。AgentTypo包括自动排版提示注入算法和AgentTypo-pro多LLM系统，前者通过最大化提示重建并通过隐蔽损失最小化人类可检测性来实现注入，后者则通过评估反馈和检索成功的过去示例来进行迭代优化注入提示。实验表明，AgentTypo显著优于最新的图像攻击，如AgentAttack，并在GPT-4o代理上成功提高了攻击成功率。研究揭示了AgentTypo对多模态代理的实际威胁和潜在风险，并强调了防御的紧迫性。

Key Takeaways

多模态代理在开放世界设置中的部署日益普遍，但面临通过视觉输入进行提示注入的高度脆弱性。
AgentTypo是一种黑箱红队框架，可通过网页图像中的优化文本嵌入实现自适应排版提示注入。
AgentTypo包括自动排版提示注入算法和AgentTypo-pro多LLM系统，前者利用树结构Parzen估计器指导文本放置、大小和颜色的黑箱优化。
AgentTypo显著优于现有的图像攻击方法，如AgentAttack，并能有效提高GPT-4o代理的攻击成功率。
在图像+文本设置中，AgentTypo实现高攻击成功率。
AgentTypo揭示了对多模态代理的实际威胁和潜在风险。

Cool Papers

点此查看论文截图

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Authors:Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou

We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as \texttt{} as prompt’’. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.

我们提出了UGround，这是一种统一的视觉定位范式，它动态选择展开的转换器中的中间层作为“掩码提示”，这与流行的利用固定最后一个隐藏层作为“分割符提示”的流水线方法不同。UGround解决了现有范式所面临的两个主要挑战：(1)它对固定最后一个隐藏层的依赖，该隐藏层会逐层传播累积误差，而没有中间修正，导致误差顺序放大；(2)它使用分割符作为提示，这会将文本嵌入隐式地投影到视觉空间，而没有明确的空间线索（例如坐标）。UGround的核心是策略提示掩码，它包括两个关键组件：随机跳跃连接（SSC）和掩码提示（MasP）。SSC是一种强化学习策略，通过随机采样允许每个分割符标记在展开的转换器层之间滑动，实现在哪个层连接到视觉模型（例如SAM）的跳跃连接。给定所选的隐藏层，MasP使用从分割符标记和图像标记派生的相似度图作为软逻辑掩码来提示SAM进行掩码生成，通过其激活区域提供明确的空间线索。为了验证UGround的有效性，我们首次在单一框架内从属性角度统一了视觉定位任务，涵盖了从传统指代表达式分割到最新提出的推理分割、单目标到多目标、正向查询到错误前提（空目标）。所有代码和模型均可在https://github.com/rui-qian/UGround上公开获取。

论文及项目相关链接

PDF https://github.com/rui-qian/UGround

Summary
UGround是一种统一视觉定位范式，它通过动态选择展开的变压器中的中间层作为“掩膜提示”，解决了现有范式中固定使用最后一隐藏层作为提示以及用<SEG>作为提示的两大挑战。UGround的关键在于策略提示掩膜，包括随机跳跃连接和掩膜提示两部分。

Key Takeaways

UGround提出一种统一视觉定位范式，解决了现有范式中存在的问题。
UGround通过动态选择展开的变压器中的中间层作为“掩膜提示”，提高了模型的性能。
UGround解决了固定使用最后一隐藏层作为提示所带来的问题，如误差累积。
UGround通过策略提示掩膜（包括随机跳跃连接和掩膜提示）实现了有效的空间提示。
UGround首次在一个框架内统一视觉定位任务，包括传统参考表达式分割和新兴推理分割等。
UGround支持单目标到多目标，正向查询到错误前提（空目标）等多种场景。

Cool Papers

点此查看论文截图

DuPLUS: Dual-Prompt Vision-Language Framework for Universal Medical Image Segmentation and Prognosis

Authors:Numan Saeed, Tausifa Jan Saleem, Fadillah Maani, Muhammad Ridzuan, Hu Wang, Mohammad Yaqub

Deep learning for medical imaging is hampered by task-specific models that lack generalizability and prognostic capabilities, while existing ‘universal’ approaches suffer from simplistic conditioning and poor medical semantic understanding. To address these limitations, we introduce DuPLUS, a deep learning framework for efficient multi-modal medical image analysis. DuPLUS introduces a novel vision-language framework that leverages hierarchical semantic prompts for fine-grained control over the analysis task, a capability absent in prior universal models. To enable extensibility to other medical tasks, it includes a hierarchical, text-controlled architecture driven by a unique dual-prompt mechanism. For segmentation, DuPLUS is able to generalize across three imaging modalities, ten different anatomically various medical datasets, encompassing more than 30 organs and tumor types. It outperforms the state-of-the-art task specific and universal models on 8 out of 10 datasets. We demonstrate extensibility of its text-controlled architecture by seamless integration of electronic health record (EHR) data for prognosis prediction, and on a head and neck cancer dataset, DuPLUS achieved a Concordance Index (CI) of 0.69. Parameter-efficient fine-tuning enables rapid adaptation to new tasks and modalities from varying centers, establishing DuPLUS as a versatile and clinically relevant solution for medical image analysis. The code for this work is made available at: https://anonymous.4open.science/r/DuPLUS-6C52

针对医学影像深度学习的局限性，特定任务的模型缺乏通用性和预后能力，而现有的“通用”方法则存在简单的条件设定和医学语义理解不足的问题。为了克服这些局限性，我们引入了DuPLUS，这是一个用于高效多模态医学影像分析的深度学习框架。DuPLUS引入了一种新的视觉语言框架，该框架利用分层语义提示对分析任务进行精细控制，这是先前通用模型所不具备的功能。为了使其能够扩展到其他医学任务，它采用了一种独特的双重提示机制驱动的分层文本控制架构。在分割方面，DuPLUS能够在三种成像模式、十个不同的解剖结构各异的医学数据集上进行推广，涵盖超过30个器官和肿瘤类型。它在8个数据集上的表现优于最新的特定任务和通用模型。我们通过无缝集成电子健康记录（EHR）数据来进行预后预测，展示了其文本控制架构的扩展性，在头颈癌数据集上，DuPLUS的契合指数（CI）达到了0.69。参数高效的微调使得它能够迅速适应新任务和模式，从各种中心进行应用，确立了DuPLUS在医学影像分析中的通用性和临床相关性。该工作的代码可在：匿名链接获得。

论文及项目相关链接

PDF

Summary

本文介绍了针对医疗影像分析的新型深度学习框架DuPLUS。该框架采用先进的视觉语言技术，通过层次化的语义提示进行精细的任务控制，克服了现有通用模型的局限性。DuPLUS具备跨多模态医疗影像分析的能力，能够在不同的数据集上实现良好的泛化性能，并成功整合电子健康记录数据用于预后预测。其参数高效的微调能力使其能够快速适应新任务和模态，成为临床相关医疗影像分析的多功能解决方案。

Key Takeaways

DuPLUS是一个针对医疗影像分析的深度学习框架，旨在解决任务特定模型缺乏通用性和预后能力的问题。
它采用先进的视觉语言技术，并引入层次化的语义提示，实现了对分析任务的精细控制。
DuPLUS具备在多模态医疗影像分析中的泛化能力，可在不同的数据集上实现良好的性能。
该框架能够无缝整合电子健康记录数据用于预后预测。
DuPLUS通过参数高效的微调能力，能够快速适应新任务和模态。
它在多个数据集上的性能优于最新的任务特定和通用模型。
DuPLUS的代码已公开发布，供公众使用。

Cool Papers

点此查看论文截图

VT-FSL: Bridging Vision and Text with LLMs for Few-Shot Learning

Authors:Wenhao Li, Qiangchang Wang, Xianjing Meng, Zhibin Wu, Yilong Yin

Few-shot learning (FSL) aims to recognize novel concepts from only a few labeled support samples. Recent studies enhance support features by incorporating additional semantic information or designing complex semantic fusion modules. However, they still suffer from hallucinating semantics that contradict the visual evidence due to the lack of grounding in actual instances, resulting in noisy guidance and costly corrections. To address these issues, we propose a novel framework, bridging Vision and Text with LLMs for Few-Shot Learning (VT-FSL), which constructs precise cross-modal prompts conditioned on Large Language Models (LLMs) and support images, seamlessly integrating them through a geometry-aware alignment. It mainly consists of Cross-modal Iterative Prompting (CIP) and Cross-modal Geometric Alignment (CGA). Specifically, the CIP conditions an LLM on both class names and support images to generate precise class descriptions iteratively in a single structured reasoning pass. These descriptions not only enrich the semantic understanding of novel classes but also enable the zero-shot synthesis of semantically consistent images. The descriptions and synthetic images act respectively as complementary textual and visual prompts, providing high-level class semantics and low-level intra-class diversity to compensate for limited support data. Furthermore, the CGA jointly aligns the fused textual, support, and synthetic visual representations by minimizing the kernelized volume of the 3-dimensional parallelotope they span. It captures global and nonlinear relationships among all representations, enabling structured and consistent multimodal integration. The proposed VT-FSL method establishes new state-of-the-art performance across ten diverse benchmarks, including standard, cross-domain, and fine-grained few-shot learning scenarios. Code is available at https://github.com/peacelwh/VT-FSL.

少量学习（FSL）旨在从仅有的少量标记样本中识别出新概念。最近的研究通过融入额外的语义信息或设计复杂的语义融合模块来提升支撑特征。然而，由于缺乏在实际实例中的定位，它们仍然会出现与视觉证据相矛盾的幻觉语义，导致产生嘈杂的指导和昂贵的修正。为了解决这些问题，我们提出了一种新的框架——基于大型语言模型（LLM）的视觉与文本融合少量学习（VT-FSL），它通过精确跨模态提示和支撑图像，构建基于LLM的跨模态桥梁。它主要由跨模态迭代提示（CIP）和跨模态几何对齐（CGA）组成。具体来说，CIP以类名和支撑图像为条件，在单次结构化推理过程中迭代生成精确的类描述。这些描述不仅丰富了对新颖类的语义理解，还实现了语义一致的合成图像的零样本合成。这些描述和合成图像分别作为补充的文本和视觉提示，提供高级类语义和低级类内多样性，以弥补有限的支撑数据。此外，CGA通过最小化它们所跨越的3维平行四边形的内核体积，联合对齐融合的文本、支撑和合成视觉表示。它捕捉了所有表示之间的全局和非线性关系，实现了结构化且一致的多模态融合。所提出的VT-FSL方法在包括标准、跨域和精细粒度少量学习场景在内的十个不同基准测试上达到了新的最先进的性能。代码可在https://github.com/peacelwh/VT-FSL找到。

论文及项目相关链接

PDF Accepted by NeurIPS 2025

Summary

该文本介绍了一种解决小样本学习（FSL）问题的新型框架——VT-FSL。该框架结合了视觉和文本，利用大型语言模型（LLM）和图像支持，通过精确跨模态提示和几何对齐，提高支持特征的质量。它主要包括跨模态迭代提示（CIP）和跨模态几何对齐（CGA）。CIP通过类名和图像支持生成精确类描述，而CGA则通过最小化核化体积的三维平行多面体来对齐文本、支持和合成视觉表示。VT-FSL在多个基准测试中达到了最新的先进水平。

Key Takeaways

VT-FSL是一种新型的小样本学习框架，旨在解决识别新概念的挑战。它通过整合视觉和文本信息来提高支持特征的质量。
该框架引入跨模态迭代提示（CIP），利用大型语言模型生成精确类描述，并通过合成图像丰富语义理解。这些描述和图像分别提供高级类语义和低级类内多样性，以弥补有限的支持数据。
VT-FSL还包括跨模态几何对齐（CGA），通过最小化核化体积的三维平行多面体来对齐文本、支持和合成视觉表示。这有助于捕捉所有表示之间的全局和非线性关系，实现结构化和一致的多模态集成。
VT-FSL在多个不同的基准测试中取得了新的最佳性能，包括标准、跨域和细粒度的小样本学习场景。

Cool Papers

点此查看论文截图

Mask What Matters: Controllable Text-Guided Masking for Self-Supervised Medical Image Analysis

Authors:Ruilang Wang, Shuotong Xu, Bowen Liu, Runlin Huang, Donglong Chen, Weifeng Su

The scarcity of annotated data in specialized domains such as medical imaging presents significant challenges to training robust vision models. While self-supervised masked image modeling (MIM) offers a promising solution, existing approaches largely rely on random high-ratio masking, leading to inefficiency and poor semantic alignment. Moreover, region-aware variants typically depend on reconstruction heuristics or supervised signals, limiting their adaptability across tasks and modalities. We propose Mask What Matters, a controllable text-guided masking framework for self-supervised medical image analysis. By leveraging vision-language models for prompt-based region localization, our method flexibly applies differentiated masking to emphasize diagnostically relevant regions while reducing redundancy in background areas. This controllable design enables better semantic alignment, improved representation learning, and stronger cross-task generalizability. Comprehensive evaluation across multiple medical imaging modalities, including brain MRI, chest CT, and lung X-ray, shows that Mask What Matters consistently outperforms existing MIM methods (e.g., SparK), achieving gains of up to +3.1 percentage points in classification accuracy, +1.3 in box average precision (BoxAP), and +1.1 in mask average precision (MaskAP) for detection. Notably, it achieves these improvements with substantially lower overall masking ratios (e.g., 40% vs. 70%). This work demonstrates that controllable, text-driven masking can enable semantically aligned self-supervised learning, advancing the development of robust vision models for medical image analysis.

在医学成像等特定领域，标注数据的稀缺性为训练稳健的视觉模型带来了重大挑战。虽然自监督的掩码图像建模（MIM）提供了很有前景的解决方案，但现有的方法大多依赖于随机的高比例掩码，导致效率低下和语义对齐不佳。此外，区域感知变体通常依赖于重建启发式或监督信号，这限制了它们在任务和模态之间的适应性。我们提出了“Mask What Matters”（掩码关键内容），这是一种可控的文本引导式掩码框架，用于自监督医学图像分析。通过利用视觉语言模型进行基于提示的区域定位，我们的方法可以灵活地应用差异化掩码，以强调诊断相关的区域，同时减少背景区域的冗余信息。这种可控的设计实现了更好的语义对齐、改进了表征学习和更强的跨任务泛化能力。在多种医学成像模态的全面评估中，包括脑MRI、胸部CT和肺部X射线，显示Mask What Matters始终优于现有的MIM方法（例如SparK），在分类准确度上提高了高达+3.1个百分点，框平均精度（BoxAP）提高了+1.3，掩膜平均精度（MaskAP）在检测方面提高了+1.1。值得注意的是，它在实现这些改进的同时，整体掩码比例大大降低（例如，40%对70%）。这项工作表明，可控的、文本驱动的掩码可以实现语义对齐的自监督学习，推动医学图像分析领域稳健视觉模型的发展。

论文及项目相关链接

PDF

Summary
医学图像领域缺少标注数据对训练稳健的视觉模型带来挑战。现有自监督的掩码图像建模（MIM）方法大多依赖随机高比例掩码，存在效率不高和语义对齐差的问题。本文提出Mask What Matters，一种可控文本引导的掩码框架，用于医学图像分析的自监督学习。该方法利用视觉语言模型进行基于提示的区域定位，灵活应用差异化掩码，强调诊断相关区域，减少背景区域的冗余。可控设计实现了更好的语义对齐、改进了表征学习和更强的跨任务泛化能力。在多种医学成像模态的评估中，Mask What Matters持续表现优于现有MIM方法，分类准确度提高达+3.1个百分点，框平均精度（BoxAP）和掩膜平均精度（MaskAP）检测分别提高+1.3和+1.1。

Key Takeaways

医学图像领域缺乏标注数据，对训练视觉模型构成挑战。
自监督的掩码图像建模（MIM）是解决这个问题的一个有前途的解决方案。
现有MIM方法主要依赖随机高比例掩码，导致效率不高和语义对齐差。
Mask What Matters框架利用视觉语言模型进行区域定位，实现可控的文本引导掩码。
该方法强调诊断相关区域，减少背景区域的冗余，实现更好的语义对齐。
Mask What Matters在多种医学成像模态的评估中表现优越，相比现有MIM方法有所提高。

Cool Papers

点此查看论文截图

ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression

Authors:Tom Burgert, Oliver Stoll, Paolo Rota, Begüm Demir

The hypothesis that Convolutional Neural Networks (CNNs) are inherently texture-biased has shaped much of the discourse on feature use in deep learning. We revisit this hypothesis by examining limitations in the cue-conflict experiment by Geirhos et al. To address these limitations, we propose a domain-agnostic framework that quantifies feature reliance through systematic suppression of shape, texture, and color cues, avoiding the confounds of forced-choice conflicts. By evaluating humans and neural networks under controlled suppression conditions, we find that CNNs are not inherently texture-biased but predominantly rely on local shape features. Nonetheless, this reliance can be substantially mitigated through modern training strategies or architectures (ConvNeXt, ViTs). We further extend the analysis across computer vision, medical imaging, and remote sensing, revealing that reliance patterns differ systematically: computer vision models prioritize shape, medical imaging models emphasize color, and remote sensing models exhibit a stronger reliance on texture. Code is available at https://github.com/tomburgert/feature-reliance.

假设卷积神经网络（CNN）本质上具有纹理偏向性，这一假设在很大程度上影响了深度学习中特征使用的讨论。我们通过重新审视Geirhos等人提示线索冲突实验中的局限性来重新探讨这一假设。为了解决这些局限性，我们提出了一个领域无关框架，通过系统地抑制形状、纹理和颜色线索来量化特征依赖关系，避免了强制选择冲突造成的混淆。通过在受控抑制条件下对人类和神经网络进行评估，我们发现CNN并非本质上具有纹理偏向性，而是主要依赖于局部形状特征。然而，这种依赖可以通过现代训练策略或架构（如ConvNeXt、Vision Transformer）大幅减轻。我们进一步扩展了计算机视觉、医学成像和遥感分析的分析范围，发现依赖模式存在系统性差异：计算机视觉模型优先考虑形状，医学成像模型强调颜色，遥感模型则表现出对纹理的更强烈依赖。代码可通过链接https://github.com/tomburgert/feature-reliance获取。

论文及项目相关链接

PDF Accepted at NeurIPS 2025 (oral)

Summary

本文主要针对CNN模型存在的纹理偏好问题进行了探讨，并对现有研究中的缺陷进行了阐述。文章提出一个基于控制实验的评估框架，旨在定量研究CNN对形状、纹理和颜色特征的依赖程度。研究发现CNN并非天生偏好纹理，而是主要依赖于局部形状特征。通过现代训练策略或架构（如ConvNeXt和ViTs），这种依赖性可以得到显著缓解。此外，文章还探讨了不同领域模型的特征依赖模式差异，如计算机视觉模型侧重于形状特征，医学影像模型强调颜色特征，遥感模型则更依赖于纹理特征。代码已公开在GitHub上。

Key Takeaways

以下是本文的主要观点和发现：

文章重新考察了CNN模型是否天生偏好纹理的问题，并针对现有研究的局限进行了讨论。
提出了一种评估框架，通过控制实验定量研究CNN对形状、纹理和颜色特征的依赖程度。
研究发现CNN主要依赖于局部形状特征，而非天生偏好纹理。
现代训练策略或架构（如ConvNeXt和ViTs）可以显著减少CNN对特定特征的依赖。
不同领域的模型特征依赖模式存在差异，如计算机视觉模型侧重于形状特征，医学影像模型注重颜色特征，遥感模型更依赖于纹理特征。
文章提供了跨计算机视觉、医学成像和遥感等多个领域的研究结果，展现了广泛的适用性。

Cool Papers

点此查看论文截图

Do We Need All the Synthetic Data? Targeted Synthetic Image Augmentation via Diffusion Models

Authors:Dang Nguyen, Jiping Li, Jinghao Zheng, Baharan Mirzasoleiman

Synthetically augmenting training datasets with diffusion models has been an effective strategy for improving generalization of image classifiers. However, existing techniques struggle to ensure the diversity of generation and increase the size of the data by up to 10-30x to improve the in-distribution performance. In this work, we show that synthetically augmenting part of the data that is not learned early in training with faithful images-containing same features but different noise-outperforms augmenting the entire dataset. By analyzing a two-layer CNN, we prove that this strategy improves generalization by promoting homogeneity in feature learning speed without amplifying noise. Our extensive experiments show that by augmenting only 30%-40% of the data, our method boosts generalization by up to 2.8% in a variety of scenarios, including training ResNet, ViT, ConvNeXt, and Swin Transformer on CIFAR-10/100, and TinyImageNet, with various optimizers including SGD and SAM. Notably, our method applied with SGD outperforms the SOTA optimizer, SAM, on CIFAR-100 and TinyImageNet.

通过扩散模型对训练数据集进行合成增强，是提高图像分类器泛化能力的有效策略。然而，现有技术在确保生成的多样性和通过扩大数据规模（最多达10-30倍）来提高数据集的泛化性能方面面临挑战。在这项工作中，我们展示了一种新的合成增强策略，即只对训练初期未被学习到的部分数据进行增强，通过添加包含相同特征但带有不同噪声的真实图像来超越对整个数据集的增强。通过分析两层卷积神经网络，我们证明了该策略通过促进特征学习速度的均一性，提高了泛化能力，同时不会放大噪声。我们的大量实验表明，通过仅增强数据的30%-40%，在各种场景中，我们的方法提高了泛化性能高达2.8%。这些场景包括在CIFAR-10/100和TinyImageNet上训练ResNet、ViT、ConvNeXt和Swin Transformer等模型，并使用包括SGD和SAM在内的各种优化器。值得注意的是，我们的方法与SGD结合使用时，在CIFAR-100和TinyImageNet上的性能超过了当前最佳优化器SAM。

论文及项目相关链接

PDF

Summary
通过对训练数据集进行合成增强，利用扩散模型提高图像分类器的泛化能力是一种有效策略。但现有技术在生成数据的多样性和规模方面仍有不足。本文提出了一种新的数据增强策略，只对训练初期未学会的数据部分进行合成增强，通过加入相同特征但含有不同噪声的图像，而非对整个数据集进行增强。实验证明，这种策略能提高泛化能力，促进特征学习速度的均一性，同时不会放大噪声。仅对30%-40%的数据进行增强，就能在多种场景下提高泛化性能达2.8%。

Key Takeaways

扩散模型用于合成增强训练数据集，提高图像分类器泛化能力。
现有数据增强技术面临生成数据多样性和规模挑战。
本文策略仅对训练初期未掌握的部分数据进行合成增强。
通过添加含有相同特征但不同噪声的图像，策略表现出优于全面增强的效果。
策略促进特征学习速度的均一性，同时不放大噪声。
仅对部分数据进行增强，在多种场景下显著提高泛化性能。

Cool Papers

点此查看论文截图

AutoMiSeg: Automatic Medical Image Segmentation via Test-Time Adaptation of Foundation Models

Authors:Xingjian Li, Qifeng Wu, Adithya S. Ubaradka, Yiran Ding, Colleen Que, Runmin Jiang, Jianhua Xing, Tianyang Wang, Min Xu

Medical image segmentation is vital for clinical diagnosis, yet current deep learning methods often demand extensive expert effort, i.e., either through annotating large training datasets or providing prompts at inference time for each new case. This paper introduces a zero-shot and automatic segmentation pipeline that combines off-the-shelf vision-language and segmentation foundation models. Given a medical image and a task definition (e.g., “segment the optic disc in an eye fundus image”), our method uses a grounding model to generate an initial bounding box, followed by a visual prompt boosting module that enhance the prompts, which are then processed by a promptable segmentation model to produce the final mask. To address the challenges of domain gap and result verification, we introduce a test-time adaptation framework featuring a set of learnable adaptors that align the medical inputs with foundation model representations. Its hyperparameters are optimized via Bayesian Optimization, guided by a proxy validation model without requiring ground-truth labels. Our pipeline offers an annotation-efficient and scalable solution for zero-shot medical image segmentation across diverse tasks. Our pipeline is evaluated on seven diverse medical imaging datasets and shows promising results. By proper decomposition and test-time adaptation, our fully automatic pipeline not only substantially surpasses the previously best-performing method, yielding a 69% relative improvement in accuracy (Dice Score from 42.53 to 71.81), but also performs competitively with weakly-prompted interactive foundation models.

医学影像分割对临床诊断至关重要，但现有的深度学习方法通常需要大量的专家工作，例如通过标注大量训练数据集或在推理时间为每个新病例提供提示。本文介绍了一种零样本自动分割管道，它结合了现成的视觉语言和基础分割模型。给定一张医学图像和任务定义（例如，“在眼底图像中分割视神经盘”），我们的方法使用定位模型生成初始边界框，然后通过增强提示的视觉提示增强模块进行处理，最后由可提示的分割模型生成最终掩码。为了应对领域差距和结果验证的挑战，我们引入了一个测试时间自适应框架，该框架具有一组可学习的适配器，用于将医学输入与基础模型表示对齐。其超参数通过贝叶斯优化进行优化，由代理验证模型指导，无需真实标签。我们的管道为跨不同任务的零样本医学图像分割提供了高效且可扩展的解决方案。我们的管道在七个不同的医学成像数据集上进行了评估，并显示出有希望的结果。通过适当的分解和测试时间自适应，我们的全自动管道不仅显著超越了之前表现最好的方法，在准确度上提高了69%（Dice得分从42.53提高到71.81），而且在弱提示交互式基础模型中也表现出竞争力。

论文及项目相关链接

PDF

摘要
医疗图像分割对于临床诊断至关重要，但现有的深度学习方法通常需要大量的专家努力，例如通过标注大型训练数据集或在推理时间为每个新病例提供提示。本文介绍了一种零样本自动分割管道，它结合了现成的视觉语言和基础分割模型。给定医疗图像和任务定义（例如，“在眼底图像中分割视盘”），我们的方法使用接地模型生成初始边界框，然后通过增强提示的视觉提示增强模块，这些提示随后被可提示的分割模型处理以产生最终蒙版。为了解决领域差距和结果验证的挑战，我们引入了一个测试时间适应框架，该框架具有一组可学习的适配器，用于将医疗输入与基础模型表示对齐。其超参数通过贝叶斯优化进行优化，由无需真实标签的代理验证模型指导。我们的管道为跨不同任务的零样本医疗图像分割提供了高效且可扩展的解决方案。我们的管道在七个不同的医学成像数据集上进行了评估，并显示出有希望的结果。通过适当的分解和测试时间适应，我们的全自动管道不仅显著超越了之前表现最佳的方法，在准确度上相对提高了69%（Dice得分从42.53提高到71.81），而且在弱提示交互式基础模型中具有竞争力。

关键见解

本论文提出了一种零样本自动医疗图像分割管道，该管道结合了视觉语言模型和基础分割模型。
该方法通过使用接地模型生成初始边界框，然后利用视觉提示增强模块增强提示，再进行处理以产生最终蒙版。
针对医疗图像分割领域特有的挑战，如领域差距和结果验证，论文引入了测试时间适应框架和可学习适配器。
该管道的超参数优化是通过贝叶斯优化实现的，并由无需真实标签的代理验证模型指导。
该管道在多个医学成像数据集上进行了评估，结果令人鼓舞，相比之前最佳方法有明显改进。
通过适当的分解和测试时间适应，该管道的性能显著提升，并且在弱提示交互式基础模型中也表现出竞争力。

Cool Papers

点此查看论文截图

CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification

Authors:Cristiano Patrício, Isabel Rio-Torto, Jaime S. Cardoso, Luís F. Teixeira, João C. Neves

The main challenges limiting the adoption of deep learning-based solutions in medical workflows are the availability of annotated data and the lack of interpretability of such systems. Concept Bottleneck Models (CBMs) tackle the latter by constraining the model output on a set of predefined and human-interpretable concepts. However, the increased interpretability achieved through these concept-based explanations implies a higher annotation burden. Moreover, if a new concept needs to be added, the whole system needs to be retrained. Inspired by the remarkable performance shown by Large Vision-Language Models (LVLMs) in few-shot settings, we propose a simple, yet effective, methodology, CBVLM, which tackles both of the aforementioned challenges. First, for each concept, we prompt the LVLM to answer if the concept is present in the input image. Then, we ask the LVLM to classify the image based on the previous concept predictions. Moreover, in both stages, we incorporate a retrieval module responsible for selecting the best examples for in-context learning. By grounding the final diagnosis on the predicted concepts, we ensure explainability, and by leveraging the few-shot capabilities of LVLMs, we drastically lower the annotation cost. We validate our approach with extensive experiments across four medical datasets and twelve LVLMs (both generic and medical) and show that CBVLM consistently outperforms CBMs and task-specific supervised methods without requiring any training and using just a few annotated examples. More information on our project page: https://cristianopatricio.github.io/CBVLM/.

在医疗工作流程中采用基于深度学习的解决方案的主要挑战在于标注数据的可用性和此类系统缺乏可解释性。概念瓶颈模型（CBM）通过约束模型输出在一组预先定义和人类可解释的概念上来解决后者的问题。然而，通过基于概念的解释所增加的可解释性意味着更高的标注负担。而且，如果要添加新概念，需要重新训练整个系统。受大型视觉语言模型（LVLM）在少样本环境中的出色表现的启发，我们提出了一种简单而有效的方法，即CBVLM，它解决了上述两个挑战。首先，对于每个概念，我们提示LVLM回答概念是否出现在输入图像中。然后，我们让LVLM基于之前的概念预测对图像进行分类。而且，在两个阶段中，我们都融入了一个检索模块，负责选择最佳示例进行上下文内学习。通过将最终诊断建立在预测的概念上，我们确保了可解释性，并通过利用LVLM的少量样本能力，我们大大降低了标注成本。我们通过四个医疗数据集和十二个（通用和医疗）LVLM的广泛实验验证了我们的方法，并表明CBVLM始终优于CBM和特定任务监督方法，而且无需任何训练，只需使用少量标注样本。更多信息请参见我们的项目页面：[网址]（https://cristianopatricio.github.io/CBVLM/）。

论文及项目相关链接

PDF Accepted for publication in Computers in Biology and Medicine

Summary

本文提出一种名为CBVLM的方法，旨在解决深度学习在医疗工作流程中面临的两大挑战：标注数据的可用性和系统解释性的缺乏。通过利用大型视觉语言模型（LVLMs）的少样本学习能力，CBVLM不仅提高了模型的解释性，还降低了标注成本。该方法通过预测概念并基于这些预测进行图像分类，同时结合检索模块选择最佳上下文学习示例。实验验证显示，CBVLM在四个医疗数据集上持续优于概念瓶颈模型（CBMs）和任务特定监督方法，且无需任何训练，仅使用少量标注样本。

Key Takeaways

CBVLM方法旨在解决深度学习在医疗领域的两个主要问题：标注数据可用性和系统解释性。
通过利用大型视觉语言模型（LVLMs）的少样本学习能力，CBVLM提高模型解释性并降低标注成本。
CBVLM通过预测概念并基于这些概念进行图像分类，实现医疗图像的分析和诊断。
检索模块的选择有助于在上下文学习中选择最佳示例，进一步提高模型的准确性。
CBVLM在四个医疗数据集上的表现持续优于概念瓶颈模型（CBMs）和任务特定监督方法。
CBVLM方法无需任何训练，仅使用少量标注样本，具有实际应用中的灵活性和效率。

Cool Papers

点此查看论文截图

Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine

Authors:Xiaoshuang Huang, Lingdong Shen, Jia Liu, Fangxin Shang, Hongxiang Li, Haifeng Huang, Yehui Yang

In recent years, Multimodal Large Language Models (MLLM) have achieved notable advancements, demonstrating the feasibility of developing an intelligent biomedical assistant. However, current biomedical MLLMs predominantly focus on image-level understanding and restrict interactions to textual commands, thus limiting their capability boundaries and the flexibility of usage. In this paper, we introduce a novel end-to-end multimodal large language model for the biomedical domain, named MedPLIB, which possesses pixel-level understanding. Excitingly, it supports visual question answering (VQA), arbitrary pixel-level prompts (points, bounding boxes, and free-form shapes), and pixel-level grounding. We propose a novel Mixture-of-Experts (MoE) multi-stage training strategy, which divides MoE into separate training phases for a visual-language expert model and a pixel-grounding expert model, followed by fine-tuning using MoE. This strategy effectively coordinates multitask learning while maintaining the computational cost at inference equivalent to that of a single expert model. To advance the research of biomedical MLLMs, we introduce the Medical Complex Vision Question Answering Dataset (MeCoVQA), which comprises an array of 8 modalities for complex medical imaging question answering and image region understanding. Experimental results indicate that MedPLIB has achieved state-of-the-art outcomes across multiple medical visual language tasks. More importantly, in zero-shot evaluations for the pixel grounding task, MedPLIB leads the best small and large models by margins of 19.7 and 15.6 respectively on the mDice metric. The codes, data, and model checkpoints will be made publicly available at https://github.com/ShawnHuang497/MedPLIB.

近年来，多模态大型语言模型（MLLM）取得了显著的进步，证明了开发智能生物医学助理的可行性。然而，当前生物医学领域的MLLM主要侧重于图像级别的理解，并将互动限制在文本命令上，从而限制了其能力边界和使用的灵活性。在本文中，我们介绍了一种用于生物医学领域的新型端到端多模态大型语言模型，名为MedPLIB，它具有像素级的理解能力。令人兴奋的是，它支持视觉问答（VQA）、任意的像素级提示（点、边界框和自由形式形状）和像素级定位。我们提出了一种新颖的专家混合（MoE）多阶段训练策略，该策略将MoE分为视觉语言专家模型和像素定位专家模型的单独训练阶段，然后使用MoE进行微调。这种策略有效地协调了多任务学习，同时在推理阶段的计算成本相当于单个专家模型。为了推动生物医学MLLM的研究，我们推出了医疗复杂视觉问答数据集（MeCoVQA），该数据集包含用于复杂医疗影像问答和图像区域理解的8种模态。实验结果表明，MedPLIB在多个医疗视觉语言任务上达到了最新水平。更重要的是，在像素定位任务的零样本评估中，MedPLIB在小模型和大模型上的mDice指标分别领先了19.7和15.6的差距。代码、数据和模型检查点将公开在https://github.com/ShawnHuang497/MedPLIB上。

论文及项目相关链接

PDF Accepted by AAAI2025

Summary

本文介绍了一种针对生物医学领域的新型端到端多模态大型语言模型——MedPLIB，它具备像素级理解能力，支持视觉问答、任意像素级提示和像素级定位。提出一种新颖的混合专家（MoE）多阶段训练策略，有效协调多任务学习，同时保持推理时的计算成本与单一专家模型相当。为推进生物医学多模态大型语言模型的研究，引入了医疗复杂视觉问答数据集（MeCoVQA）。实验结果表明，MedPLIB在多个医疗视觉语言任务上取得了最新成果，并在像素定位任务的零样本评估中领先其他最佳小型和大型模型。

Key Takeaways