发布日期: 2025-09-12

更新日期: 2025-10-07

文章字数: 2.2k

阅读时长: 8 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-12 更新

XBusNet: Text-Guided Breast Ultrasound Segmentation via Multimodal Vision-Language Learning

Authors:Raja Mallina, Bryar Shareef

Background: Precise breast ultrasound (BUS) segmentation supports reliable measurement, quantitative analysis, and downstream classification, yet remains difficult for small or low-contrast lesions with fuzzy margins and speckle noise. Text prompts can add clinical context, but directly applying weakly localized text-image cues (e.g., CAM/CLIP-derived signals) tends to produce coarse, blob-like responses that smear boundaries unless additional mechanisms recover fine edges. Methods: We propose XBusNet, a novel dual-prompt, dual-branch multimodal model that combines image features with clinically grounded text. A global pathway based on a CLIP Vision Transformer encodes whole-image semantics conditioned on lesion size and location, while a local U-Net pathway emphasizes precise boundaries and is modulated by prompts that describe shape, margin, and Breast Imaging Reporting and Data System (BI-RADS) terms. Prompts are assembled automatically from structured metadata, requiring no manual clicks. We evaluate on the Breast Lesions USG (BLU) dataset using five-fold cross-validation. Primary metrics are Dice and Intersection over Union (IoU); we also conduct size-stratified analyses and ablations to assess the roles of the global and local paths and the text-driven modulation. Results: XBusNet achieves state-of-the-art performance on BLU, with mean Dice of 0.8765 and IoU of 0.8149, outperforming six strong baselines. Small lesions show the largest gains, with fewer missed regions and fewer spurious activations. Ablation studies show complementary contributions of global context, local boundary modeling, and prompt-based modulation. Conclusions: A dual-prompt, dual-branch multimodal design that merges global semantics with local precision yields accurate BUS segmentation masks and improves robustness for small, low-contrast lesions.

背景：精确乳腺超声（BUS）分割支持可靠测量、定量分析和下游分类，但对于边界模糊、有斑点噪声的小病灶或低对比度病变，仍然存在一定的困难。文本提示可以添加临床背景，但直接应用弱定位文本图像线索（例如CAM/CLIP派生信号）往往会产生粗糙的、块状的响应，会模糊边界，除非有其他机制恢复精细边缘。

方法：我们提出了XBusNet，这是一种新型的双提示、双分支多模式模型，结合了图像特征和基于临床的文本。基于CLIP Vision Transformer的全局路径根据病灶大小和位置对整体图像语义进行编码，而基于U-Net的局部路径则强调精确边界，并通过描述形状、边缘和乳腺影像报告和数据系统（BI-RADS）术语的提示进行调制。提示会自动从结构化元数据中生成，无需手动点击。我们在Breast Lesions USG（BLU）数据集上进行五次交叉验证评估。主要指标是Dice和交集比（IoU），我们还进行了分层分析和消融研究，以评估全局和局部路径以及文本驱动调制的作用。

结果：XBusNet在BLU上实现了最先进的性能，平均Dice系数为0.8765，IoU系数为0.8149，优于六个强大的基线方法。小病灶显示出最大的收益，漏检区域更少，误激活也更少。消融研究显示了全局上下文、局部边界建模和提示调制的互补作用。

结论：一种结合全局语义和局部精度的双提示、双分支多模式设计，能产生准确的BUS分割掩膜，提高小、低对比度病灶的稳健性。

论文及项目相关链接

PDF 15 pages, 3 figures, 4 tables

Summary

本文提出一种新型的双提示、双分支多模态模型XBusNet，该模型结合了图像特征和临床文本信息，用于精确乳腺超声（BUS）分割。模型包含全局路径和局部路径，前者基于CLIP视觉转换器编码整体图像语义，后者基于U-Net路径强调精确边界，并由描述形状、边缘和BI-RADS术语的提示调制。自动从结构化元数据中组装提示，无需手动点击。在乳腺病变USG数据集上评估，XBusNet达到先进水平，平均Dice和IoU指标表现优异。

Key Takeaways

XBusNet是一个结合了图像和临床文本信息的双提示、双分支多模态模型。
全局路径基于CLIP视觉转换器编码整体图像语义，局部路径强调精确边界并由相关提示调制。
提示信息自动从结构化元数据中组装，无需手动标注。
XBusNet在乳腺病变USG数据集上表现优异，达到先进水平。
模型对小病灶和低对比度病变的分割性能有明显提升。

Cool Papers

点此查看论文截图

Grounding DINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models

Authors:Hamza Rasaee, Taha Koleilat, Hassan Rivaz

Accurate and generalizable object segmentation in ultrasound imaging remains a significant challenge due to anatomical variability, diverse imaging protocols, and limited annotated data. In this study, we propose a prompt-driven vision-language model (VLM) that integrates Grounding DINO with SAM2 (Segment Anything Model2) to enable object segmentation across multiple ultrasound organs. A total of 18 public ultrasound datasets, encompassing the breast, thyroid, liver, prostate, kidney, and paraspinal muscle, were utilized. These datasets were divided into 15 for fine-tuning and validation of Grounding DINO using Low Rank Adaptation (LoRA) to the ultrasound domain, and 3 were held out entirely for testing to evaluate performance in unseen distributions. Comprehensive experiments demonstrate that our approach outperforms state-of-the-art segmentation methods, including UniverSeg, MedSAM, MedCLIP-SAM, BiomedParse, and SAMUS on most seen datasets while maintaining strong performance on unseen datasets without additional fine-tuning. These results underscore the promise of VLMs in scalable and robust ultrasound image analysis, reducing dependence on large, organ-specific annotated datasets. We will publish our code on code.sonography.ai after acceptance.

在超声成像中，实现准确且可推广的目标分割仍然是一个重大挑战，这主要是由于解剖结构差异、多样的成像协议和有限的标注数据所造成的。在这项研究中，我们提出了一种基于提示驱动的视觉语言模型（VLM），该模型将Grounding DINO与SAM2（任何分割模型2）相结合，以实现在多个超声器官中的目标分割。我们共使用了18个公开超声数据集，涵盖了乳房、甲状腺、肝脏、前列腺、肾脏和腰背部肌肉等部位。这些数据集被分为两部分，其中15个数据集用于微调Grounding DINO并使用低秩适应（LoRA）技术使其适应超声领域，另外3个数据集则完全保留用于测试，以评估其在未见分布中的性能。综合实验表明，我们的方法优于最先进的状态分割方法，包括UniverSeg、MedSAM、MedCLIP-SAM、BiomedParse和SAMUS等大多数可见数据集上的表现，同时也在未见数据集上保持强大的性能表现，无需额外的微调。这些结果突显了视觉语言模型在可扩展和稳健的超声图像分析中的潜力，并减少了对于大型特定器官标注数据集的依赖。在论文被接受后，我们将在code.sonography.ai上发布我们的代码。

论文及项目相关链接

PDF 11 pages, 3 figures, 7 tables

摘要

该研究针对超声成像中的准确且通用的目标分割问题，提出了一个由提示驱动的视觉语言模型（VLM），该模型集成了Grounding DINO与SAM2（任何内容分割模型2），能够在多个超声器官上进行目标分割。研究使用了18个公共超声数据集，包括乳房、甲状腺、肝脏、前列腺、肾脏和腰大肌等。这些数据集被分为两部分，其中15个数据集用于微调Grounding DINO并使用LoRA适应超声领域，另外3个数据集则完全用于测试，以评估在未见分布中的性能表现。实验表明，该方法在大多数已见数据集上优于包括UniverSeg、MedSAM、MedCLIP-SAM、BiomedParse和SAMUS等最新分割方法，并且在未见数据集上无需额外微调即可保持强劲表现。这证明了VLM在可扩展性和稳健性超声图像分析中的潜力，并降低了对大型特定器官标注数据集的依赖。

关键见解