发布日期: 2025-09-28

更新日期: 2025-11-27

文章字数: 20.7k

阅读时长: 84 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-28 更新

Modelling the effect of stellar metallicity on the XUV evolution of low-mass stars and its impact on exoplanet atmospheres/habitability

Authors:Victor See, Charlotte Fairman, Louis Amard, Oliver Hall

Understanding how exoplanet atmospheres evolve is a key question in the context of habitability. One key process governing this evolution is atmospheric evaporation by stellar X-ray and EUV emission (collectively, XUV). As such, the evolution of exoplanet atmospheres is closely tied to the evolution of the host star’s magnetic activity. Many studies have modelled the combined evolution of exoplanet atmospheres and their host stars. However, to date, the impact of the host star’s metallicity on stellar activity/exoplanet atmosphere evolution has not been explored. In this work, we investigate how stellar metallicity affects the rotation and activity evolution of solar-like stars as well as the corresponding exoplanet atmospheric evolution. We reconfirm previous results that metal-rich stars spin down more rapidly than metal-poor stars. We also find that the XUV flux that an exoplanet in the habitable zone of its host star receives is larger when the host star is more metal-rich. As such, the atmospheres of exoplanets in the habitable zones of metal-rich stars are evaporated more rapidly than exoplanets in the habitable zones of metal-poor stars. Lastly, we find that the atmospheric evolution is most sensitive to the host star metallicity when the host star has a higher mass. In the highest mass solar-stars, the metallicity can have a larger influence on the atmospheric evolution than the initial rotation period of the star.

在宜居性的背景下，了解外行星大气如何演化是一个关键问题。控制这种演化的关键过程之一是恒星X射线和极紫外线（统称为XUV）辐射引起的大气蒸发。因此，外行星大气的演化与宿主星磁活动的演化密切相关。许多研究已经建立了外行星大气和其宿主星的共同演化模型。然而，迄今为止，宿主星的金属丰度对恒星活动/外行星大气演化的影响尚未被探索。在这项工作中，我们研究了恒星金属丰度对太阳型恒星的旋转和活动演化的影响，以及相应的外行星大气演化。我们再次证实了之前的结论，即富金属恒星的自旋衰减速度比贫金属恒星更快。我们还发现，当宿主星为富金属时，处于宜居带的外行星接收到的XUV流量更大。因此，处于富金属星宜居区的外行星大气比处于贫金属星宜居区的外行星大气蒸发得更快。最后，我们发现当宿主星质量较高时，大气演化对宿主星金属丰度的敏感性最强。在质量最高的太阳型恒星中，金属丰度对大气演化的影响可能大于恒星的初始自转周期。

论文及项目相关链接

PDF 12 pages, 7 figures, accepted for publication in MNRAS

Summary
此研究探讨了宿主恒星金属量对太阳系类似恒星的自转和活动演化的影响，以及相应外行星大气演化的变化。研究发现金属含量较高的恒星自转减速更快，金属含量较高的宿主星的外行星在宜居带所接收到的XUV辐射较大，导致大气蒸发更快。此外，对于质量较高的宿主恒星，其金属量对大气演化的影响大于初始自转周期的影响。

Key Takeaways

宿主恒星的金属量影响太阳类似恒星的自转和活动演化。
金属含量较高的恒星自转减速更快。
金属含量较高的宿主星的外行星在宜居带所接收到的XUV辐射较大。
外行星在宜居带的金属含量较高的宿主星的大气蒸发更快。
宿主星质量较高时，其金属量对大气演化的影响更大。
在某些情况下，金属量对大气演化的影响可能超过宿主星的初始自转周期的影响。

Cool Papers

点此查看论文截图

A sample of 3403 galaxy clusters identified in XMM-Newton X-ray images

Authors:Z. S. Yuan, Z. L. Wen, W. Xu, J. L. Han

Currently, the number of galaxy clusters identified using galaxy data has far exceeded the number derived from intracluster medium data. In this study, we used positional information from large optical cluster catalogues to search for previously unrecognized X-ray galaxy clusters in archival XMM-Newton data. We successfully identified 1490 galaxy clusters in X-ray images for the first time. By incorporating 1913 previously known X-ray clusters, we constructed a sample of 3403 galaxy clusters observed by XMM-Newton. Our cluster mass estimates show broad consistency with previous measurements. Comparative analyses between the known and newly identified subsamples revealed that new X-ray clusters exhibit systematically higher redshifts, lower masses, and smaller X-ray-to-optical mass ratios, but show no systematic differences in dynamical properties. The newly identified X-ray clusters are a valuable addition to previous X-ray samples and are important for future statistical studies.

当前，使用星系数据识别出的星系团数量远远超过从星系间介质数据中得出的数量。在这项研究中，我们使用了大型光学星团目录中的位置信息，在XMM-Newton的存档数据中搜索之前未识别的X射线星系团。我们首次在X射线图像中成功识别出1490个星系团。通过纳入1913个先前已知的X射线星团，我们构建了由XMM-Newton观测的3403个星系团样本。我们对星团质量的估计与之前的测量结果大致一致。已知和新识别的亚样本之间的对比分析表明，新的X射线星团具有更高的红移、较低的质量和较低的X射线与光学质量比，但在动力学特性上没有系统性差异。新识别的X射线星团是对之前X射线样本的宝贵补充，对未来统计研究具有重要意义。

论文及项目相关链接

PDF 10 pages, 3 figures, 2 tables. Accepted for publication in MNRAS

Summary
本研究利用大型光学集群目录的位置信息，在XMM-Newton存档数据中搜索之前未识别的X射线星系团，成功首次识别出1490个星系团。结合之前已知的1913个X射线集群，构建了XMM-Newton观测的3403个星系团样本。新识别的X射线集群对之前的X射线样本有价值补充，对将来的统计研究具有重要意义。

Key Takeaways

研究使用光学集群目录的位置信息在XMM-Newton存档数据中成功识别出1490个新的X射线星系团。
结合已知的1913个X射线集群，构建了包含3403个星系团的样本。
新识别的X射线集群在红移、质量、X射线与光学质量比等方面具有特征。
新识别的集群与已知集群在动力学特性上没有系统性差异。
新识别的X射线集群对现有的X射线样本有价值补充。
这些新识别的集群对于未来的统计研究具有重要意义。

Cool Papers

点此查看论文截图

4D Computational Ultrasound Imaging of Carotid Artery Flow

Authors:Yuyang Hu, Michael Brown, Didem Dogan, Mahé Bulot, Maxime Cheppe, Guillaume Ferin, Geert Leus, Antonius F. W. van der Steen, Pieter Kruizinga, Johannes G. Bosch

Computational ultrasound imaging (cUSi) with few elements and spatial field encoding can provide high-resolution volumetric B-mode imaging. In this work, we extend its application to 4D carotid artery (CA) flow imaging using a custom large-aperture 240-element matrix probe. We implemented a frequency band-based matched filtering strategy that balances resolution and contrast. The system’s inherent imaging capabilities were evaluated and validated in flow phantom and human CA experiments. In the phantom study, 3D/4D power Doppler image and speckle-tracking analyses confirmed the system’s ability to resolve flow structures and hemodynamics. In the human study, the CA bifurcation flow structure and its local pulsatile flow dynamics were successfully reconstructed. These results demonstrate the feasibility of using a large-footprint, few-element cUSi system for 4D CA flow assessment.

计算超声成像（cUSi）使用少量元件和空间场编码技术，可提供高分辨率的三维B模式图像。在这项工作中，我们使用定制的大孔径240元素矩阵探头将其扩展到四维颈动脉（CA）血流成像。我们实施了一种基于频带的匹配滤波策略，该策略可平衡分辨率和对比度。该系统在血流模型和人体颈动脉实验中进行了成像能力的评估和验证。在模型研究中，三维/四维功率多普勒图像和斑点跟踪分析证实了该系统解决流动结构和血流动力学的能力。在人类研究中，成功重建了颈动脉分叉的血流结构及其局部脉动血流动力学。这些结果证明了使用大足迹、少量元件的cUSi系统进行四维颈动脉血流评估的可行性。

论文及项目相关链接

PDF The submission consists of an 11-page manuscript with 5 figures, followed by 2 pages of supplemental files

Summary
计算超声成像（cUSi）采用少量元件和空间场编码可实现高分辨率三维B模式成像。本研究将其应用扩展至四维颈动脉（CA）血流成像，采用自定义大孔径240元件矩阵探头。我们实施了基于频带的匹配滤波策略，在分辨率和对比度之间取得平衡。系统的固有成像能力在血流模型和人体颈动脉实验中得到了评估和验证。模型研究通过三维/四维功率多普勒图像和斑点追踪分析证实了系统解决流动结构和血流动力学的能力。人体研究中成功重建了颈动脉分叉流动结构及其局部脉动流动动力学。这些结果证明使用大足迹、少量元件的cUSi系统进行四维颈动脉血流评估是可行的。

Key Takeaways

计算超声成像（cUSi）可以实现高分辨率的三维B模式成像，使用少量元件和空间场编码技术。
研究将cUSi应用扩展至四维颈动脉（CA）血流成像。
采用自定义大孔径矩阵探头进行四维颈动脉血流成像。
实施基于频带的匹配滤波策略，以平衡分辨率和对比度。
血流模型和人体颈动脉实验验证了系统的固有成像能力。
模型研究证实了系统能够解决流动结构和血流动力学问题。

Cool Papers

点此查看论文截图

SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images

Authors:Qinfeng Zhu, Han Li, Liang He, Lei Fan

Semantic segmentation of remote sensing imagery is a fundamental task in computer vision, supporting a wide range of applications such as land use classification, urban planning, and environmental monitoring. However, this task is often challenged by the high spatial resolution, complex scene structures, and diverse object scales present in remote sensing data. To address these challenges, various deep learning architectures have been proposed, including convolutional neural networks, Vision Transformers, and the recently introduced Vision Mamba. Vision Mamba features a global receptive field and low computational complexity, demonstrating both efficiency and effectiveness in image segmentation. However, its reliance on global scanning tends to overlook critical local features, such as textures and edges, which are essential for achieving accurate segmentation in remote sensing contexts. To tackle this limitation, we propose SwinMamba, a novel framework inspired by the Swin Transformer. SwinMamba integrates localized Mamba-style scanning within shifted windows with a global receptive field, to enhance the model’s perception of both local and global features. Specifically, the first two stages of SwinMamba perform local scanning to capture fine-grained details, while its subsequent two stages leverage global scanning to fuse broader contextual information. In our model, the use of overlapping shifted windows enhances inter-region information exchange, facilitating more robust feature integration across the entire image. Extensive experiments on the LoveDA and ISPRS Potsdam datasets demonstrate that SwinMamba outperforms state-of-the-art methods, underscoring its effectiveness and potential as a superior solution for semantic segmentation of remotely sensed imagery.

遥感影像的语义分割是计算机视觉中的一项基本任务，支持土地利用分类、城市规划和环境监测等广泛应用。然而，这项任务常常面临遥感数据高分辨率、复杂场景结构和多样对象尺度所带来的挑战。为了解决这些挑战，已经提出了各种深度学习架构，包括卷积神经网络、视觉变压器和最近推出的视觉曼巴。视觉曼巴具有全局感受野和低计算复杂度，在图像分割中表现出高效性和有效性。然而，它对全局扫描的依赖往往会忽略关键的局部特征，如纹理和边缘，这些特征对于在遥感环境中实现准确分割至关重要。为了解决这个问题，我们提出了受Swin Transformer启发的新型框架SwinMamba。SwinMamba结合了局部曼巴风格扫描和全局感受野的移位窗口，增强了模型对局部和全局特征的感知。具体来说，SwinMamba的前两个阶段进行局部扫描以捕捉精细细节，而后两个阶段则利用全局扫描来融合更广泛的上下文信息。在我们的模型中，使用重叠的移位窗口增强了区域间的信息交换，促进了整个图像上更稳健的特征融合。在LoveDA和ISPRS Potsdam数据集上的大量实验表明，SwinMamba优于最先进的方法，突显了其有效性和作为遥感影像语义分割优越解决方案的潜力。

论文及项目相关链接

PDF

Summary

本文介绍了遥感影像语义分割的挑战和最新进展。针对现有模型忽视局部特征的问题，提出了融合局部与全局特征的SwinMamba模型。该模型通过结合Mamba风格的局部扫描和全局感受野，提升了遥感影像分割的准确性和效率。在LoveDA和ISPRS Potsdam数据集上的实验证明了SwinMamba的优越性能。

Key Takeaways

遥感影像语义分割是计算机视觉中的基础任务，具有广泛的应用价值，如土地利用分类、城市规划和环境监测等。
该任务面临高空间分辨率、复杂场景结构和多样对象尺度等挑战。
深度学习架构，如卷积神经网络、Vision Transformers和Vision Mamba，已被应用于解决这些挑战。
Vision Mamba具有全局感受野和低计算复杂性，但在遥感影像分割中忽视了局部特征的重要性。
SwinMamba模型结合了Mamba风格的局部扫描和全局感受野，旨在增强对局部和全局特征的感知。
SwinMamba模型通过重叠的移位窗口增强区域间信息交换，实现更稳健的特征整合。

Cool Papers

点此查看论文截图

Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning

Authors:Thanh Binh Le, Hoang Nhat Khang Vo, Tan-Ha Mai, Trong Nhan Phan

Low back pain affects millions worldwide, driving the need for robust diagnostic models that can jointly analyze complex medical images and accompanying text reports. We present LumbarCLIP, a novel multimodal framework that leverages contrastive language-image pretraining to align lumbar spine MRI scans with corresponding radiological descriptions. Built upon a curated dataset containing axial MRI views paired with expert-written reports, LumbarCLIP integrates vision encoders (ResNet-50, Vision Transformer, Swin Transformer) with a BERT-based text encoder to extract dense representations. These are projected into a shared embedding space via learnable projection heads, configurable as linear or non-linear, and normalized to facilitate stable contrastive training using a soft CLIP loss. Our model achieves state-of-the-art performance on downstream classification, reaching up to 95.00% accuracy and 94.75% F1-score on the test set, despite inherent class imbalance. Extensive ablation studies demonstrate that linear projection heads yield more effective cross-modal alignment than non-linear variants. LumbarCLIP offers a promising foundation for automated musculoskeletal diagnosis and clinical decision support.

腰背痛影响了全球数百万人的健康，这促使了对稳健的诊断模型的需求，这些模型能够联合分析复杂的医学图像和伴随的文本报告。我们提出了LumbarCLIP，这是一种新型的多模式框架，它利用对比语言图像预训练，将腰椎MRI扫描与相应的放射学描述对齐。LumbarCLIP建立在精选数据集之上，该数据集包含轴向MRI视图与专家撰写的报告相配对，它将视觉编码器（ResNet-50、Vision Transformer、Swin Transformer）与基于BERT的文本编码器相结合，以提取密集表示。这些表示通过可学习的投影头投影到共享嵌入空间中，可配置为线性或非线性，并进行归一化，以便使用软CLIP损失进行稳定对比训练。我们的模型在下游分类上达到了最先进的性能，尽管存在固有的类别不平衡问题，但在测试集上仍达到了95.00%的准确率和94.75%的F1分数。广泛的消融研究表明，线性投影头比非线性投影头在跨模式对齐方面更有效。LumbarCLIP为自动化肌肉骨骼诊断和临床决策支持提供了有前景的基础。

论文及项目相关链接

PDF 12 pages, 4 figures

Summary

LumbarCLIP是一种新型的多模态诊断框架，能够结合医学图像和文本报告进行分析。它利用对比语言图像预训练技术，将腰椎MRI扫描与相应的放射学描述对齐。LumbarCLIP集成了视觉编码器和基于BERT的文本编码器，以提取密集表示，并通过可学习的投影头将其投影到共享嵌入空间中。该模型实现了下游分类的卓越性能，在测试集上达到了95.00%的准确率和94.75%的F1分数。

Key Takeaways

LumbarCLIP是一个多模态框架，用于分析医学图像和文本报告。
它利用对比语言图像预训练技术，将腰椎MRI扫描与放射学描述对齐。
LumbarCLIP集成了视觉编码器和文本编码器以提取密集表示。
模型使用可学习的投影头将表示投影到共享嵌入空间。
模型性能在下游分类任务中表现卓越，达到了高准确率和F1分数。
线性投影头在跨模态对齐方面比非线性投影头更有效。

Cool Papers

点此查看论文截图

The X-ray Emission of NGC 5005: An Unobscured Low-Luminosity AGN with a Weakly Accreting Broad-Line Region

Authors:Anna Trindade Falcão, R. Middei, G. Fabbiano, M. Elvis, P. Zhu, W. P. Maksym, D. Ł. Król, L. Feuillet

We present deep Chandra X-ray observations of NGC 5005, a LINER-dominated galaxy previously reported to host a broad H$\alpha$ emission line. The diffuse soft X-ray emission ($<$3 keV) extends out to $\sim$800 pc, while harder emission ($>$3 keV) is confined to the central $\sim$400 pc. Spatially resolved spectroscopy of the nuclear ($r<150$ pc) and extended ($150<r<500$ pc) regions reveals that these are best described by models including both photoionized and thermal plasma components, consistent with excitation by a low-luminosity AGN and shock-heated gas. Narrow-band imaging and excitation maps from the Hubble Space Telescope (HST) support this interpretation, closely matching the X-ray morphology and ionization structure. The detection of a faint hard X-ray nuclear source with Chandra, combined with stringent upper limits from NuSTAR and Swift, and consistency with the X-ray luminosity predicted from the HST [O III]$\lambda$5007 emission, indicates that NGC 5005 hosts an intrinsically low-luminosity ($L_{\rm bol} \sim 10^{41}$ erg s$^{-1}$), unobscured AGN. Despite the extremely low Eddington ratio inferred from our measurements ($\lambda_{\rm Edd} \sim 5 \times 10^{-6}$), the presence of a broad H$\alpha$ line in the optical spectrum suggests the persistence of a thin accretion disk, challenging standard paradigms of accretion flow configurations at such low accretion rates.

我们展示了NGC 5005的深度钱德拉X射线观测结果。NGC 5005是一个以前报告存在宽Hα发射线的LINER星系。漫射软X射线发射（＜3 keV）延伸至约800 pc，而较硬的发射（＞3 keV）仅限于中央约400 pc。对核心区域（r＜150 pc）和扩展区域（150＜r＜500 pc）的空间分辨光谱表明，这些区域最适合包含光致电离和热等离子体成分在内的模型描述，这与低光度的活动星系核和冲击加热气体的激发相一致。哈勃太空望远镜（HST）的窄带成像和激发图支持这一解释，与X射线的形态和电离结构紧密匹配。钱德拉检测到一个微弱硬X射线核源，结合NuSTAR和Swift的严格上限，并与通过HST [O III]λ5007发射预测的X射线光度一致，表明NGC 5005存在一个固有低光度（光度约为10^41 erg s^-1）的无遮挡活动星系核。尽管我们从测量中推断出的爱丁顿比极低（λ_{edd} ~ 5 × 10^-6），但光学光谱中存在宽Hα线表明存在薄薄的吸积盘，这对如此低的吸积率下标准吸积流配置的范式提出了挑战。

论文及项目相关链接

PDF Accepted for publication on ApJ

摘要
NGC 5005的深层钱德拉X射线观测揭示，其弥散软X射线发射延伸至约800 pc，硬发射更集中在中央约400 pc。核区和扩展区的空间解析光谱显示，包括光致电离和热等离子体成分在内的模型最能描述这些区域，与低光度活动星系核（AGN）激发和冲击加热气体的猜想一致。哈勃太空望远镜（HST）的窄带成像和激发图支持这一解释，与X射线形态和电离结构相匹配。钱德拉检测到的微弱硬X射线核源与NuSTAR和Swift的严格上限相一致，并与HST的[O III]λ5007发射预测的X射线光度一致，表明NGC 5005拥有一个内在低光度（Lbol ~ 1041 erg s-1）未遮蔽的活跃星系核。尽管从我们的测量中推断出极低的艾丁顿比率（λEdd ~ 5 × 10-6），但光学光谱中存在宽Hα线，暗示了薄盘的存在，对如此低吸积率下的标准吸积流配置提出了挑战。

关键见解

NGC 5005的深层钱德拉X射线观测发现其X射线发射包括弥散软X射线发射和更集中的硬发射。
核区和扩展区的光谱显示存在光致电离和热等离子体成分。
哈勃太空望远镜的观测支持了低光度活动星系核（AGN）和冲击加热气体的解释，与X射线观测结果相匹配。
检测到微弱的硬X射线核源，与NGC 5005存在未遮蔽的活跃星系核的假设一致。
该活跃星系核具有内在低光度，与某些光学特征如宽Hα线共存。
观测到的艾丁顿比率极低，对标准吸积流配置理论提出了挑战。
结果强调了星系核结构和活动的复杂性，需要更深入的研究来理解其物理机制。

Cool Papers

点此查看论文截图

InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On

Authors:Julien Han, Shuwen Qiu, Qi Li, Xingzi Xu, Mehmet Saygin Seyfioglu, Kavosh Asadi, Karim Bouyarmane

We present InstructVTON, an instruction-following interactive virtual try-on system that allows fine-grained and complex styling control of the resulting generation, guided by natural language, on single or multiple garments. A computationally efficient and scalable formulation of virtual try-on formulates the problem as an image-guided or image-conditioned inpainting task. These inpainting-based virtual try-on models commonly use a binary mask to control the generation layout. Producing a mask that yields desirable result is difficult, requires background knowledge, might be model dependent, and in some cases impossible with the masking-based approach (e.g. trying on a long-sleeve shirt with “sleeves rolled up” styling on a person wearing long-sleeve shirt with sleeves down, where the mask will necessarily cover the entire sleeve). InstructVTON leverages Vision Language Models (VLMs) and image segmentation models for automated binary mask generation. These masks are generated based on user-provided images and free-text style instructions. InstructVTON simplifies the end-user experience by removing the necessity of a precisely drawn mask, and by automating execution of multiple rounds of image generation for try-on scenarios that cannot be achieved with masking-based virtual try-on models alone. We show that InstructVTON is interoperable with existing virtual try-on models to achieve state-of-the-art results with styling control.

我们介绍了InstructVTON，这是一个遵循指令的交互式虚拟试穿系统，允许通过自然语言指导对单一或多个服装进行精细粒度和复杂的样式控制，从而生成结果。虚拟试穿的计算高效和可扩展的公式将其表述为图像引导或图像条件的修复任务。这些基于修复技术的虚拟试穿模型通常使用二进制蒙版来控制生成布局。产生理想的蒙版是困难的，需要背景知识，可能是模型依赖的，并且在某些情况下使用基于蒙版的方法是不可能的（例如在穿着放下袖子长袖衬衫的人身上试穿“袖子卷起”的长袖衬衫，蒙版必然会覆盖整个袖子）。InstructVTON利用视觉语言模型（VLM）和图像分割模型进行自动二进制蒙版生成。这些蒙版是根据用户提供的图像和自由文本样式指令生成的。InstructVTON简化了最终用户的使用体验，通过去除精确绘制蒙版的必要性，并自动化执行多轮图像生成，以实现仅使用基于蒙版的虚拟试穿模型无法实现的试穿场景。我们证明了InstructVTON可以与现有的虚拟试穿模型协同工作，以实现具有样式控制的最新结果。

论文及项目相关链接

PDF Submitted to CVPR 2025 and Published at CVPR 2025 AI for Content Creation workshop

Summary

InstructVTON是一个支持精细化复杂样式控制的交互式虚拟试穿系统。它通过自然语言指导，以图像引导或图像条件修复任务的形式实现虚拟试穿，采用高效可伸缩的公式化方法。为解决传统虚拟试穿模型中需要复杂掩膜的问题，InstructVTON利用视觉语言模型和图像分割模型自动生成基于用户提供的图像和自由文本样式指令的掩膜。它简化了最终用户体验，通过自动化执行多轮图像生成，实现了无法仅通过基于掩膜的虚拟试穿模型实现的试穿场景。InstructVTON与现有虚拟试穿模型的兼容性良好，可实现业界领先的样式控制结果。

Key Takeaways

InstructVTON是一个支持精细化复杂样式控制的交互式虚拟试穿系统。
它解决了传统虚拟试穿模型中需要复杂掩膜的问题。
InstructVTON利用视觉语言模型和图像分割模型自动生成掩膜。
该系统自动执行多轮图像生成，以实现更丰富的试穿场景。
InstructVTON兼容现有虚拟试穿模型，并实现了业界领先的样式控制结果。
通过自然语言指导，用户可轻松控制生成结果的样式和细节。

Cool Papers

点此查看论文截图

Boosting LiDAR-Based Localization with Semantic Insight: Camera Projection versus Direct LiDAR Segmentation

Authors:Sven Ochs, Philip Schörner, Marc René Zofka, J. Marius Zöllner

Semantic segmentation of LiDAR data presents considerable challenges, particularly when dealing with diverse sensor types and configurations. However, incorporating semantic information can significantly enhance the accuracy and robustness of LiDAR-based localization techniques for autonomous mobile systems. We propose an approach that integrates semantic camera data with LiDAR segmentation to address this challenge. By projecting LiDAR points into the semantic segmentation space of the camera, our method enhances the precision and reliability of the LiDAR-based localization pipeline. For validation, we utilize the CoCar NextGen platform from the FZI Research Center for Information Technology, which offers diverse sensor modalities and configurations. The sensor setup of CoCar NextGen enables a thorough analysis of different sensor types. Our evaluation leverages the state-of-the-art Depth-Anything network for camera image segmentation and an adaptive segmentation network for LiDAR segmentation. To establish a reliable ground truth for LiDAR-based localization, we make us of a Global Navigation Satellite System (GNSS) solution with Real-Time Kinematic corrections (RTK). Additionally, we conduct an extensive 55 km drive through the city of Karlsruhe, Germany, covering a variety of environments, including urban areas, multi-lane roads, and rural highways. This multimodal approach paves the way for more reliable and precise autonomous navigation systems, particularly in complex real-world environments.

激光雷达数据的语义分割存在相当大的挑战，特别是在处理多种传感器类型和配置时。然而，融入语义信息可以显著提高基于激光雷达的自主移动系统定位技术的准确性和稳健性。我们提出了一种将语义相机数据与激光雷达分割相结合的方法，以应对这一挑战。通过将激光雷达点投影到相机的语义分割空间，我们的方法提高了基于激光雷达的定位流程的精度和可靠性。

为了验证，我们使用了FZI信息技术研究中心的CoCar NextGen平台，该平台提供了多种传感器模式和配置。CoCar NextGen的传感器设置能够对不同的传感器类型进行彻底分析。我们的评估利用最先进的用于相机图像分割的Depth-Anything网络，以及用于激光雷达分割的自适应分割网络。为了建立可靠的基于激光雷达的定位的地面真实情况，我们采用了具有实时动态校正的全球导航卫星系统（GNSS）解决方案。此外，我们在德国卡尔斯鲁厄市进行了长达55公里的驾驶测试，涵盖了多种环境，包括城区、多车道公路和乡村高速公路。这种多模式方法为更可靠和精确的自主导航系统铺平了道路，特别是在复杂的现实环境中。

论文及项目相关链接

PDF

Summary

本文提出了一个结合语义相机数据和激光雷达分割的方法，以解决语义分割激光雷达数据面临的挑战。通过将激光雷达点投影到语义分割空间的相机中，该方法提高了激光雷达定位管道的精度和可靠性。实验采用FZI信息技术研究中心的CoCar NextGen平台，进行多传感器模态和配置的全面分析验证。该方法在真实复杂环境中表现良好，为多模式可靠精确的自主导航系统铺平了道路。

Key Takeaways

语义分割激光雷达数据具有挑战性，特别是处理多样传感器类型和配置时。
整合语义信息能显著提高激光雷达定位技术的准确性和稳健性。
提出了一种结合语义相机数据和激光雷达分割的方法来解决此挑战。
方法通过将激光雷达点投影到语义分割的相机空间来提高激光雷达定位精度和可靠性。
利用CoCar NextGen平台验证了该方法的性能，该平台具备多种传感器模式和配置。
采用最先进的深度相机图像分割网络自适应激光雷达分割网络进行评估。

Cool Papers

点此查看论文截图

A Contrastive Learning Framework for Breast Cancer Detection

Authors:Samia Saeed, Khuram Naveed

Breast cancer, the second leading cause of cancer-related deaths globally, accounts for a quarter of all cancer cases [1]. To lower this death rate, it is crucial to detect tumors early, as early-stage detection significantly improves treatment outcomes. Advances in non-invasive imaging techniques have made early detection possible through computer-aided detection (CAD) systems which rely on traditional image analysis to identify malignancies. However, there is a growing shift towards deep learning methods due to their superior effectiveness. Despite their potential, deep learning methods often struggle with accuracy due to the limited availability of large-labeled datasets for training. To address this issue, our study introduces a Contrastive Learning (CL) framework, which excels with smaller labeled datasets. In this regard, we train Resnet-50 in semi supervised CL approach using similarity index on a large amount of unlabeled mammogram data. In this regard, we use various augmentation and transformations which help improve the performance of our approach. Finally, we tune our model on a small set of labelled data that outperforms the existing state of the art. Specifically, we observed a 96.7% accuracy in detecting breast cancer on benchmark datasets INbreast and MIAS.

乳腺癌是全球第二大癌症致死原因，占所有癌症病例的四分之一（来源：1）。为了降低死亡率，早期发现肿瘤至关重要，因为早期发现能够显著提高治疗效果。非侵入性成像技术的进步已经能够通过计算机辅助检测（CAD）系统进行早期检测，这些系统依赖于传统图像分析来识别恶性肿瘤。然而，由于深度学习方法具有出色的效果，越来越多的研究开始转向深度学习方法。尽管具有潜力，但由于用于训练的大型标记数据集有限，深度学习方法往往在准确性方面遇到困难。为了解决这个问题，我们的研究引入了对比学习（CL）框架，该框架在较小的标记数据集上表现出色。在这方面，我们使用大量的未标记的乳腺X线摄影数据以相似度为基准在一种半监督CL方法的指导下训练了Resnet-50网络模型。在这一框架内，我们采用了多种数据增强和转换手段来帮助提升方法的性能。最终我们在少量标记数据上对模型进行了微调，超越了现有的技术水平。具体来说，我们在基准数据集INbreast和MIAS上观察到96.7%的乳腺癌检测准确率。

论文及项目相关链接

PDF

Summary

本文介绍了乳腺癌作为全球第二大癌症致死原因，早期检测的重要性以及非侵入性成像技术在乳腺癌早期检测中的应用。研究引入了一种基于对比学习（CL）的框架，使用半监督CL方法和相似度指数在大量未标记的乳腺X线影像数据上训练Resnet-50模型，能够提高在较小标记数据集上的准确性。最终，在基准数据集INbreast和MIAS上，模型检测乳腺癌的准确度达到96.7%。

Key Takeaways

乳腺癌是全球第二大癌症致死原因，早期检测对降低死亡率至关重要。
非侵入性成像技术和计算机辅助检测（CAD）系统在早期检测乳腺癌中起到重要作用。
深度学习方法的引入提高了乳腺癌检测的准确性，但受限于大型标记数据集的可获得性。
对比学习（CL）框架被引入到解决标记数据集有限的问题，表现优秀。
使用半监督CL方法和相似度指数在大量未标记的乳腺X线影像数据上训练Resnet-50模型。
通过数据增强和变换技术提高了模型的性能。

Cool Papers

点此查看论文截图

HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy

Authors:Dayu Tan, Zhenpeng Xu, Yansen Su, Xin Peng, Chunhou Zheng, Weimin Zhong

Both local details and global context are crucial in medical image segmentation, and effectively integrating them is essential for achieving high accuracy. However, existing mainstream methods based on CNN-Transformer hybrid architectures typically employ simple feature fusion techniques such as serial stacking, endpoint concatenation, or pointwise addition, which struggle to address the inconsistencies between features and are prone to information conflict and loss. To address the aforementioned challenges, we innovatively propose HiPerformer. The encoder of HiPerformer employs a novel modular hierarchical architecture that dynamically fuses multi-source features in parallel, enabling layer-wise deep integration of heterogeneous information. The modular hierarchical design not only retains the independent modeling capability of each branch in the encoder, but also ensures sufficient information transfer between layers, effectively avoiding the degradation of features and information loss that come with traditional stacking methods. Furthermore, we design a Local-Global Feature Fusion (LGFF) module to achieve precise and efficient integration of local details and global semantic information, effectively alleviating the feature inconsistency problem and resulting in a more comprehensive feature representation. To further enhance multi-scale feature representation capabilities and suppress noise interference, we also propose a Progressive Pyramid Aggregation (PPA) module to replace traditional skip connections. Experiments on eleven public datasets demonstrate that the proposed method outperforms existing segmentation techniques, demonstrating higher segmentation accuracy and robustness. The code is available at https://github.com/xzphappy/HiPerformer.

在医学图像分割中，局部细节和全局上下文都至关重要，有效地将它们结合起来对于实现高精度至关重要。然而，现有的主流方法主要基于CNN-Transformer混合架构，通常采用简单的特征融合技术，如串行堆叠、端点拼接或逐点添加。这些方法在处理特征之间的不一致性时遇到困难，容易出现信息冲突和损失。为了解决上述挑战，我们创新地提出了HiPerformer。HiPerformer的编码器采用新型模块化分层架构，动态并行融合多源特征，实现分层深度集成异质信息。模块化分层设计不仅保留了编码器中每个分支的独立建模能力，而且确保了层之间的信息充分传递，有效避免了传统堆叠方法带来的特征退化和信息损失。此外，我们设计了一个局部-全局特征融合（LGFF）模块，以实现局部细节和全局语义信息的精确高效融合，有效缓解了特征不一致问题，并产生了更全面的特征表示。为了进一步提高多尺度特征表示能力并抑制噪声干扰，我们还提出了渐进金字塔聚合（PPA）模块来替代传统的跳跃连接。在11个公共数据集上的实验表明，所提出的方法优于现有的分割技术，表现出更高的分割精度和稳健性。代码可通过https://github.com/xzphappy/HiPerformer获取。

论文及项目相关链接

PDF

Summary
医学图像分割中，局部细节和全局上下文都至关重要。现有主流方法如CNN-Transformer混合架构在特征融合上采用简单策略，如串行堆叠、端点拼接或点加，难以解决特征不一致问题，易导致信息冲突和损失。为此，我们创新提出HiPerformer，其编码器采用新型模块化层次结构，动态融合多源特征，实现层级深度集成异质信息。同时，设计局部全局特征融合模块，精准高效整合局部细节和全局语义信息，缓解特征不一致问题。为增强多尺度特征表示能力和抑制噪声干扰，我们还提出渐进金字塔聚合模块替代传统跳跃连接。实验证明，该方法在多个公共数据集上表现优异，具有更高的分割精度和鲁棒性。

Key Takeaways

医学图像分割需兼顾局部细节和全局上下文信息。
现有CNN-Transformer混合架构在特征融合上存不足，如信息不一致、信息损失等。
HiPerformer通过模块化层次结构动态融合多源特征，提高信息集成效率。
HiPerformer设计局部全局特征融合模块，解决特征不一致问题，实现精准信息整合。
渐进金字塔聚合模块增强多尺度特征表示，抑制噪声干扰。
实验证明HiPerformer在多个数据集上表现优异，分割精度和鲁棒性高。

Cool Papers

点此查看论文截图

A co-evolving agentic AI system for medical imaging analysis

Authors:Songhao Li, Jonathan Xu, Tiancheng Bao, Yuxuan Liu, Yuchen Liu, Yihang Liu, Lilin Wang, Wenhui Lei, Sheng Wang, Yinuo Xu, Yan Cui, Jialu Yao, Shunsuke Koga, Zhi Huang

Agentic AI is rapidly advancing in healthcare and biomedical research. However, in medical image analysis, their performance and adoption remain limited due to the lack of a robust ecosystem, insufficient toolsets, and the absence of real-time interactive expert feedback. Here we present “TissueLab”, a co-evolving agentic AI system that allows researchers to ask direct questions, automatically plan and generate explainable workflows, and conduct real-time analyses where experts can visualize intermediate results and refine them. TissueLab integrates tool factories across pathology, radiology, and spatial omics domains. By standardizing inputs, outputs, and capabilities of diverse tools, the system determines when and how to invoke them to address research and clinical questions. Across diverse tasks with clinically meaningful quantifications that inform staging, prognosis, and treatment planning, TissueLab achieves state-of-the-art performance compared with end-to-end vision-language models (VLMs) and other agentic AI systems such as GPT-5. Moreover, TissueLab continuously learns from clinicians, evolving toward improved classifiers and more effective decision strategies. With active learning, it delivers accurate results in unseen disease contexts within minutes, without requiring massive datasets or prolonged retraining. Released as a sustainable open-source ecosystem, TissueLab aims to accelerate computational research and translational adoption in medical imaging while establishing a foundation for the next generation of medical AI.

医学人工智能（Agentic AI）在医疗保健和生物医学研究领域正快速发展。然而，在医学图像分析方面，其性能和采用率受到缺乏稳健生态系统、工具集不足以及缺乏实时交互专家反馈的限制。在这里，我们推出“TissueLab”，这是一个协同进化的医学人工智能系统，允许研究人员提出直接问题、自动规划和生成可解释的工作流程，并进行实时分析，专家可以可视化中间结果并进行优化。TissueLab整合了病理学、放射学和空间组学领域的工具工厂。通过标准化各种工具的输出、输入和能力，该系统确定何时以及如何调用它们来解决研究和临床问题。在具有临床意义的量化任务的多样化方面，这些任务可以提供分期、预后和治疗计划的信息，TissueLab与端到端视觉语言模型（VLMs）和其他医学人工智能系统相比，实现了最先进的性能。此外，TissueLab能够从临床医生那里持续学习，朝着更高级的分类器和更有效的决策策略发展。通过主动学习，它在未见疾病背景下几分钟内提供准确结果，无需大规模数据集或长期重新训练。作为可持续的开源生态系统发布，TissueLab旨在加速医学成像的计算研究和实际应用，同时为下一代医学人工智能奠定基础。

论文及项目相关链接

PDF

Summary

这是一篇关于Agentic AI在医疗健康和生物医学研究中的进展与挑战的文章。文章介绍了TissueLab系统，它是一个协同进化的agentic AI系统，用于医学图像分析。该系统允许研究人员直接提问、自动规划生成可解释的工作流，并进行实时分析。TissueLab集成了跨病理学、放射学和空间组学领域的工具工厂，通过标准化工具输入、输出和能力，确定何时以及如何调用它们来解决研究和临床问题。与传统的端到端视觉语言模型和其他agentic AI系统相比，TissueLab在具有临床意义的任务上实现了卓越性能，并且能够从临床医生那里持续学习，不断进化分类器和决策策略。作为可持续的开源生态系统，TissueLab旨在加速医学成像的计算研究和翻译应用，为下一代医疗人工智能奠定基础。

Key Takeaways

Agentic AI在医疗健康和生物医学研究中的应用正在快速发展，但在医学图像分析领域仍面临生态系统不健全、工具集不足和缺乏实时互动专家反馈的挑战。
TissueLab是一个协同进化的agentic AI系统，解决了上述问题，允许研究人员直接交互、自动规划工作流程，并进行实时分析。
TissueLab集成了跨病理学、放射学和空间组学领域的工具，通过标准化工具输入、输出和能力来解决临床和研究问题。
TissueLab实现了卓越的性能，与传统视觉语言模型和其他agentic AI系统相比，其在具有临床意义的任务上表现更佳。
TissueLab能从临床医生那里学习并持续进化，通过活跃的学习能力，在未知疾病背景下快速提供准确结果，无需大规模数据集或长期再训练。
TissueLab被发布为一个可持续的开源生态系统，旨在加速医学成像的计算研究和翻译应用。

Cool Papers

点此查看论文截图

SHMoAReg: Spark Deformable Image Registration via Spatial Heterogeneous Mixture of Experts and Attention Heads

Authors:Yuxi Zheng, Jianhui Feng, Tianran Li, Marius Staring, Yuchuan Qiao

Encoder-Decoder architectures are widely used in deep learning-based Deformable Image Registration (DIR), where the encoder extracts multi-scale features and the decoder predicts deformation fields by recovering spatial locations. However, current methods lack specialized extraction of features (that are useful for registration) and predict deformation jointly and homogeneously in all three directions. In this paper, we propose a novel expert-guided DIR network with Mixture of Experts (MoE) mechanism applied in both encoder and decoder, named SHMoAReg. Specifically, we incorporate Mixture of Attention heads (MoA) into encoder layers, while Spatial Heterogeneous Mixture of Experts (SHMoE) into the decoder layers. The MoA enhances the specialization of feature extraction by dynamically selecting the optimal combination of attention heads for each image token. Meanwhile, the SHMoE predicts deformation fields heterogeneously in three directions for each voxel using experts with varying kernel sizes. Extensive experiments conducted on two publicly available datasets show consistent improvements over various methods, with a notable increase from 60.58% to 65.58% in Dice score for the abdominal CT dataset. Furthermore, SHMoAReg enhances model interpretability by differentiating experts’ utilities across/within different resolution layers. To the best of our knowledge, we are the first to introduce MoE mechanism into DIR tasks. The code will be released soon.

基于深度学习的可变形图像配准（DIR）中广泛使用Encoder-Decoder架构，其中编码器提取多尺度特征，解码器通过恢复空间位置来预测变形场。然而，当前的方法缺乏特征的专业提取（这些特征对于配准是有用的），并且在所有三个方向上联合预测变形是均匀的。在本文中，我们提出了一种新型专家指导的DIR网络，该网络在编码器和解码器中均应用了混合专家（MoE）机制，名为SHMoAReg。具体来说，我们在编码器层中引入了混合注意力头（MoA），而在解码器层中引入了空间异质混合专家（SHMoE）。MoA通过动态选择每个图像令牌的注意力头的最佳组合，增强了特征提取的专业性。同时，SHMoE使用具有不同内核大小的专家在每个方向上对变形场进行异质预测。在两个公开数据集上进行的广泛实验表明，在各种方法上均取得了持续的改进，特别是在腹部CT数据集上的Dice得分从60.58%显著提高至65.58%。此外，SHMoAReg通过在不同分辨率层之间/内部区分专家的实用性来增强模型的解释性。据我们所知，我们是首次将MoE机制引入DIR任务的研究者。代码即将发布。

论文及项目相关链接

PDF

Summary

本文提出一种基于深度学习的新型专家引导可变形图像配准网络（SHMoAReg），在编码器和解码器中应用了混合专家（MoE）机制。该网络通过引入混合注意力头（MoA）增强特征提取的专业性，并通过空间异质混合专家（SHMoE）实现各方向上的异质性变形预测。实验结果显示，该网络在公开数据集上的表现优于其他方法，特别是在腹部CT数据集上Dice得分提高了从60.58%到65.58%。同时，SHMoAReg还提高了模型的解释性。本文首次将MoE机制引入DIR任务。

Key Takeaways

论文介绍了基于深度学习的可变形图像配准（DIR）中编码器-解码器架构的广泛应用。
当前方法缺乏针对配准的特征提取，并均匀预测所有方向的变形。
论文提出了一种新型专家引导的DIR网络SHMoAReg，结合MoA和SHMoE机制优化特征提取和变形预测。
MoA机制通过动态选择注意力头实现特征提取的专业化。
SHMoE机制能异质地预测三个方向的变形场，提高模型性能。
在公开数据集上的实验结果显示SHMoAReg优于其他方法，特别是在腹部CT数据集上Dice得分显著提高。

Cool Papers

点此查看论文截图

PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction

Authors:Manahil Raza, Ayesha Azam, Talha Qaiser, Nasir Rajpoot

Current multimodal fusion approaches in computational oncology primarily focus on integrating multi-gigapixel histology whole slide images (WSIs) with genomic or transcriptomic data, demonstrating improved survival prediction. We hypothesize that incorporating pathology reports can further enhance prognostic performance. Pathology reports, as essential components of clinical workflows, offer readily available complementary information by summarizing histopathological findings and integrating expert interpretations and clinical context. However, fusing these modalities poses challenges due to their heterogeneous nature. WSIs are high-dimensional, each containing several billion pixels, whereas pathology reports consist of concise text summaries of varying lengths, leading to potential modality imbalance. To address this, we propose a prototype-based approach to generate balanced representations, which are then integrated using a Transformer-based fusion model for survival prediction that we term PS3 (Predicting Survival from Three Modalities). Specifically, we present: (1) Diagnostic prototypes from pathology reports, leveraging self-attention to extract diagnostically relevant sections and standardize text representation; (2) Histological prototypes to compactly represent key morphological patterns in WSIs; and (3) Biological pathway prototypes to encode transcriptomic expressions, accurately capturing cellular functions. PS3, the three-modal transformer model, processes the resulting prototype-based multimodal tokens and models intra-modal and cross-modal interactions across pathology reports, WSIs and transcriptomic data. The proposed model outperforms state-of-the-art methods when evaluated against clinical, unimodal and multimodal baselines on six datasets from The Cancer Genome Atlas (TCGA). The code is available at: https://github.com/manahilr/PS3.

当前计算肿瘤学中的多模态融合方法主要集中在将多千兆像素全切片图像（Whole Slide Images, WSI）与基因组或转录组数据进行整合，以改善生存预测。我们假设加入病理报告可以进一步提高预后性能。作为临床工作流程的重要组成部分，病理报告通过总结病理组织学发现、整合专家解读和临床背景，提供了可轻易获得的补充信息。然而，由于这些模态的异质性，融合它们具有挑战性。全切片图像是高维的，每个图像都包含数十亿像素，而病理报告则包含不同长度的简洁文本摘要，这可能导致模态不平衡的问题。为解决这一问题，我们提出了一种基于原型的生成平衡表示的方法，然后使用基于Transformer的融合模型进行生存预测，我们称之为PS3（从三种模态预测生存）。具体来说，我们展示了：（1）来自病理报告的诊断原型，利用自注意力提取诊断相关的部分并标准化文本表示；（2）全切片图像中的组织原型，紧凑地表示关键形态模式；（3）转录组原型，编码转录组表达，准确捕获细胞功能。PS3三模态Transformer模型处理基于原型的多模态符号，并模拟病理报告、全切片图像和转录组数据之间的内部和外部交互作用。所提出的模型在来自癌症基因组图谱（TCGA）的六个数据集上进行了评估，与临床、单模态和多模态基线相比，表现出优于当前最新方法的效果。代码可在以下网址获取：https://github.com/manahilr/PS3。

论文及项目相关链接

PDF Accepted at ICCV 2025. Copyright 2025 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works

Summary

本文介绍了多模态融合方法在计算肿瘤学领域的应用，重点聚焦于将高像素组织学全切片图像（WSIs）与基因组学或转录组学数据相结合，以提高生存预测的准确性。文章提出结合病理学报告能进一步提升预测性能的观点，并详细介绍了为解决不同模态数据异质性问题而采用的基于原型的融合模型——PS3（从三种模态预测生存的模型）。PS3模型通过处理基于原型的多模态标记，并建模病理报告、WSIs和转录组学数据之间的内模态和跨模态交互，展现了优越的性能，优于在TCGA六个数据集上的其他先进方法。模型代码已公开。

Key Takeaways

当前计算肿瘤学中的多模态融合方法主要关注于整合高像素组织学全切片图像（WSIs）与基因组或转录组数据，以提高生存预测的准确性。
融入病理学报告信息能够进一步提升预测性能，因为这些报告总结了病理组织学发现，并融入了专家解读和临床背景。
WSIs与病理学报告的融合面临挑战，因为两者的维度和性质不同（WSIs高维，而报告文本简短）。
为解决这一问题，提出了基于原型的表示方法，生成均衡表示，并进一步使用基于Transformer的融合模型PS3进行生存预测。
PS3模型包括从病理学报告中提取诊断原型、从WSIs中提取组织学原型以及编码转录组表达的生物通路原型。
PS3模型在TCGA的六个数据集上的性能超过了其他先进方法。

Cool Papers

点此查看论文截图

Generalized Shortest Path-based Superpixels for 3D Spherical Image Segmentation

Authors:Rémi Giraud, Rodrigo Borba Pinheiro, Yannick Berthoumieu

The growing use of wide angle image capture devices and the need for fast and accurate image analysis in computer visions have enforced the need for dedicated under-representation approaches. Most recent decomposition methods segment an image into a small number of irregular homogeneous regions, called superpixels. Nevertheless, these approaches are generally designed to segment standard 2D planar images, i.e., captured with a 90o angle view without distortion. In this work, we introduce a new general superpixel method called SphSPS (for Spherical Shortest Path-based Superpixels)1 , dedicated to wide 360o spherical or omnidirectional images. Our method respects the geometry of the 3D spherical acquisition space and generalizes the notion of shortest path between a pixel and a superpixel center, to fastly extract relevant clustering features. We demonstrate that considering the geometry of the acquisition space to compute the shortest path enables to jointly improve the segmentation accuracy and the shape regularity of superpixels. To evaluate this regularity aspect, we also generalize a global regularity metric to the spherical space, addressing the limitations of the only existing spherical compactness measure. Finally, the proposed SphSPS method is validated on the reference 360o spherical panorama segmentation dataset and on synthetic road omnidirectional images. Our method significantly outperforms both planar and spherical state-of-the-art approaches in terms of segmentation accuracy,robustness to noise and regularity, providing a very interesting tool for superpixel-based applications on 360o images.

随着广角图像采集设备的使用日益增多，以及计算机视觉中对快速和准确图像分析的需求，迫切需要专门的欠表示方法。最近的方法通常将图像分割成少量不规则的同质区域，称为超像素。然而，这些方法通常是为标准二维平面图像设计的，即采用90度视角拍摄，无失真。在这项工作中，我们介绍了一种新的通用超像素方法，称为基于球面最短路径的超像素（SphSPS），专门用于宽360度球面或全向图像。我们的方法尊重三维球状采集空间的几何结构，并推广了像素与超像素中心之间的最短路径的概念，以快速提取相关的聚类特征。我们证明了考虑采集空间的几何结构来计算最短路径可以联合提高超像素的分割精度和形状规则性。为了评估这种规则性方面，我们还针对球面空间推广了全局规则性度量标准，解决了现有球面紧凑性度量的局限性。最后，所提出的方法在参考的360度全景分割数据集和合成全向道路图像上进行了验证。在分割精度、噪声鲁棒性和规则性方面，我们的方法显著优于平面和超球面的现有最先进方法，为基于超像素的360度图像应用提供了非常有用的工具。

论文及项目相关链接

PDF

摘要

针对广角图像捕获设备日益普及以及对快速和准确计算机视觉图像分析的需求，迫切需要专门的欠表达方法。最新的分解方法将图像分割成少量不规则的同质区域，称为超像素。然而，这些方法通常设计用于分割标准二维平面图像，即使用无失真的视角进行捕获。在此工作中，我们提出了一种新的通用超像素方法，称为球形最短路径超像素（SphSPS），专门用于处理宽视角的360度球形或全方位图像。我们的方法尊重三维球形采集空间的几何结构，并推广像素与超像素中心之间的最短路径概念，以快速提取相关的聚类特征。我们证明了考虑采集空间的几何结构来计算最短路径能够同时提高超像素的分割精度和形状规则性。为了评估这种规则性方面，我们还针对球形空间推广了全局规则性度量指标，解决了现有唯一球形紧凑性度量指标的局限性。最后，在参考的360度球形全景分割数据集和合成全方位道路图像上验证了所提出的SphSPS方法。在分割精度、噪声鲁棒性和规则性方面，该方法显著优于平面和球形的最新方法，为基于超像素的360度图像应用提供了非常有用的工具。

关键见解

介绍了一种新的通用超像素方法——球形最短路径超像素（SphSPS），专门用于处理宽视角的球形或全方位图像。
考虑三维球形采集空间的几何结构，提高了超像素的分割精度和形状规则性。
推广了像素与超像素中心之间的最短路径概念，以快速提取相关的聚类特征。
针对球形空间推广了全局规则性度量指标，以评估超像素的规则性。
在分割数据集和合成图像上验证了所提出的SphSPS方法的有效性。
该方法在分割精度、噪声鲁棒性和规则性方面显著优于现有的最新方法。

Cool Papers

点此查看论文截图

Authors:Xiaocheng Fang, Jiarui Jin, Haoyu Wang, Che Liu, Jieyi Cai, Guangkun Nie, Jun Li, Hongyan Li, Shenda Hong

In clinical practice, electrocardiography (ECG) remains the gold standard for cardiac monitoring, providing crucial insights for diagnosing a wide range of cardiovascular diseases (CVDs). However, its reliance on specialized equipment and trained personnel limits feasibility for continuous routine monitoring. Photoplethysmography (PPG) offers accessible, continuous monitoring but lacks definitive electrophysiological information, preventing conclusive diagnosis. Generative models present a promising approach to translate PPG into clinically valuable ECG signals, yet current methods face substantial challenges, including the misalignment of physiological semantics in generative models and the complexity of modeling in high-dimensional signals. To this end, we propose PPGFlowECG, a two-stage framework that aligns PPG and ECG in a shared latent space via the CardioAlign Encoder and employs latent rectified flow to generate ECGs with high fidelity and interpretability. To the best of our knowledge, this is the first study to experiment on MCMED, a newly released clinical-grade dataset comprising over 10 million paired PPG-ECG samples from more than 118,000 emergency department visits with expert-labeled cardiovascular disease annotations. Results demonstrate the effectiveness of our method for PPG-to-ECG translation and cardiovascular disease detection. Moreover, cardiologist-led evaluations confirm that the synthesized ECGs achieve high fidelity and improve diagnostic reliability, underscoring our method’s potential for real-world cardiovascular screening.

在临床实践中，心电图（ECG）仍然是心脏监测的金标准，为诊断各种心血管疾病（CVD）提供关键见解。然而，它对专业设备和受过训练的人员的依赖限制了其进行持续常规监测的可行性。光电容积脉搏波描记法（PPG）提供了可访问的连续监测，但缺乏明确的电生理信息，无法作出确切诊断。生成模型作为一种有前景的方法，可将PPG转化为具有临床价值的心电图信号，但当前方法面临重大挑战，包括生成模型中的生理语义不匹配和高维信号建模的复杂性。为此，我们提出了PPGFlowECG，这是一个两阶段的框架，通过CardioAlign编码器在共享潜在空间中对PPG和ECG进行对齐，并使用潜在整流流生成具有高保真度和可解释性的ECG。据我们所知，这是第一项在新发布的临床级数据集MCMED上进行实验的研究，该数据集包含超过1000万对PPG-ECG样本，来自超过11.8万次急诊就诊记录，并附有专家标注的心血管疾病注释。结果表明，我们的方法在PPG到ECG的翻译和心血管疾病检测方面非常有效。此外，由心脏病学家领导的评价证实，合成的ECG具有高保真度，提高了诊断的可靠性，这突显了我们方法在现实世界心血管筛查中的潜力。

论文及项目相关链接

PDF

Summary

本文介绍了心电图（ECG）在临床实践中仍是心脏监测的金标准，但其对于专业设备和人员的依赖限制了其在持续常规监测中的可行性。光体积描记法（PPG）提供了一种可访问的连续监测方法，但缺乏明确的生理信息，无法做出确切诊断。研究人员提出了PPGFlowECG框架，通过两阶段过程将PPG转化为具有临床价值的心电图信号，并通过CardioAlign Encoder在共享潜在空间中对PPG和ECG进行对齐。该研究首次在MCMED数据集上进行实验，该数据集包含超过100万对PPG-ECG样本和专家标注的心血管疾病注释。实验结果表明该方法在PPG到ECG的翻译和心血管疾病检测方面的有效性，并得到了心脏病专家的评估确认。

Key Takeaways

心电图（ECG）是诊断心血管疾病（CVD）的金标准，但其在持续常规监测中的可行性受限。
光体积描记法（PPG）提供连续监测，但缺乏明确的生理信息，无法用于确诊。
研究人员提出了PPGFlowECG框架，旨在将PPG转化为具有临床价值的心电图信号。
该框架包括一个两阶段过程，通过CardioAlign Encoder在共享潜在空间中对PPG和ECG进行对齐。
研究首次在包含超过一百万对PPG-ECG样本和专家标注的MCMED数据集上进行实验。
实验结果证明了该方法在PPG到ECG翻译和心血管疾病检测方面的有效性。

Cool Papers

点此查看论文截图

nnFilterMatch: A Unified Semi-Supervised Learning Framework with Uncertainty-Aware Pseudo-Label Filtering for Efficient Medical Segmentation

Authors:Yi Yang

Semi-supervised learning (SSL) has emerged as a promising paradigm in medical image segmentation, offering competitive performance while substantially reducing the need for extensive manual annotation. When combined with active learning (AL), these strategies further minimize annotation burden by selectively incorporating the most informative samples. However, conventional SSL_AL hybrid approaches often rely on iterative and loop-based retraining cycles after each annotation round, incurring significant computational overhead and limiting scalability in clinical applications. In this study, we present a novel, annotation-efficient, and self-adaptive deep segmentation framework that integrates SSL with entropy-based pseudo-label filtering (FilterMatch), an AL-inspired mechanism, within the single-pass nnU-Net training segmentation framework (nnFilterMatch). By selectively excluding high-confidence pseudo-labels during training, our method circumvents the need for retraining loops while preserving the benefits of uncertainty-guided learning. We validate the proposed framework across multiple clinical segmentation benchmarks and demonstrate that it achieves performance comparable to or exceeding fully supervised models, even with only 5%–20% labeled data. This work introduces a scalable, end-to-end learning strategy for reducing annotation demands in medical image segmentation without compromising accuracy. Code is available here: https://github.com/Ordi117/nnFilterMatch.git.

半监督学习（SSL）已成为医学图像分割领域的一个有前途的范式，它在提供具有竞争力性能的同时，大幅减少对大量手动注释的需求。当与主动学习（AL）结合时，这些策略通过选择性地融入最有信息量的样本，进一步减轻了注释的负担。然而，传统的SSL_AL混合方法通常依赖于每次注释轮次后的迭代和基于循环的重新训练周期，这产生了巨大的计算开销，并限制了其在临床应用中的可扩展性。在研究中，我们提出了一种新颖、注释效率高、自适应的深度分割框架，该框架将半监督学习与基于熵的伪标签过滤（FilterMatch）相结合，这是一种受主动学习启发的机制，并集成到单通道nnU-Net训练分割框架中（nnFilterMatch）。通过训练过程中选择性排除高信心伪标签，我们的方法避免了重新训练循环的需要，同时保留了不确定性引导学习的优点。我们在多个临床分割基准测试中验证了所提出框架的性能，并证明其即使在只有5%~20%标记数据的情况下，性能也可与全监督模型相当或更佳。这项工作介绍了一种可扩展的端到端学习策略，可在不损害准确性的情况下减少医学图像分割中的注释需求。代码可通过以下链接获取：https://github.com/Ordi117/nnFilterMatch.git。

论文及项目相关链接

PDF

摘要

半监督学习（SSL）结合主动学习（AL）在医学图像分割中展现出巨大潜力，能够减少大量手动标注的需求，并提供竞争力强的性能。然而，传统的SSL_AL混合方法常常依赖于迭代和循环再训练，这增加了计算负担并限制了其在临床应用中的可扩展性。本研究提出了一种新型、标注高效、自适应深度分割框架，该框架将SSL与基于��ndropy的伪标签过滤机制（FilterMatch）相结合，在单次通过nnU-Net训练分割框架（nnFilterMatch）内实现。通过选择性排除高置信度的伪标签进行训练，该方法避免了再训练循环的需要，同时保留了不确定性引导学习的优点。我们在多个临床分割基准测试中验证了所提框架，并证明即使只有5%~20%的数据被标注，其性能也能与或超过全监督模型。本研究为医学图像分割领域提供了一种可扩展的端到端学习策略，能够在不牺牲精度的情况下减少标注需求。相关代码可通过链接访问：https://github.com/Ordi117/nnFilterMatch.git。

关键见解

半监督学习与主动学习结合，显著减少了医学图像分割中手动标注的需求。
提出了新型的标注高效自适应深度分割框架nnFilterMatch。
通过结合SSL和基于熵的伪标签过滤（FilterMatch），实现了无需再训练循环的学习过程。
方法在多个临床分割基准测试中表现出优异的性能，与全监督模型相当或更优。
仅需少量标注数据（5%~20%），即可实现高性能分割。
提供了可扩展的端到端学习策略，提高了医学图像分割中的实际应用能力。
相关代码可通过在线链接访问，便于研究者和开发者使用。

Cool Papers

点此查看论文截图

Authors:Bo Yu, Jianhua Yang, Zetao Du, Yan Huang, Chenglong Li, Liang Wang

Automatically segmenting infected areas in radiological images is essential for diagnosing pulmonary infectious diseases. Recent studies have demonstrated that the accuracy of the medical image segmentation can be improved by incorporating clinical text reports as semantic guidance. However, the complex morphological changes of lesions and the inherent semantic gap between vision-language modalities prevent existing methods from effectively enhancing the representation of visual features and eliminating semantically irrelevant information, ultimately resulting in suboptimal segmentation performance. To address these problems, we propose a Frequency-domain Multi-modal Interaction model (FMISeg) for language-guided medical image segmentation. FMISeg is a late fusion model that establishes interaction between linguistic features and frequency-domain visual features in the decoder. Specifically, to enhance the visual representation, our method introduces a Frequency-domain Feature Bidirectional Interaction (FFBI) module to effectively fuse frequency-domain features. Furthermore, a Language-guided Frequency-domain Feature Interaction (LFFI) module is incorporated within the decoder to suppress semantically irrelevant visual features under the guidance of linguistic information. Experiments on QaTa-COV19 and MosMedData+ demonstrated that our method outperforms the state-of-the-art methods qualitatively and quantitatively.

对放射图像中的感染区域进行自动分割是诊断肺部感染性疾病的关键。近期研究表明，通过融入临床文本报告作为语义指导，可以提高医学图像分割的准确性。然而，病灶的复杂形态变化以及视觉语言模态之间固有的语义鸿沟，使得现有方法无法有效地增强视觉特征的表示并消除语义上无关的信息，最终导致分割性能不佳。为了解决这些问题，我们提出了一种语言引导医学图像分割的频域多模态交互模型（FMISeg）。FMISeg是一种后期融合模型，在解码器中建立语言特征与频域视觉特征之间的互动。具体来说，为了增强视觉表示，我们的方法引入了一个频域特征双向交互（FFBI）模块，以有效地融合频域特征。此外，解码器中还融入了一个语言引导频域特征交互（LFFI）模块，在语言的指导下抑制语义上无关的视觉特征。在QaTa-COV19和MosMedData+上的实验表明，我们的方法在定性和定量上均优于现有方法。

论文及项目相关链接

PDF Accepted by MICCAI 2025

Summary
针对肺部感染性疾病诊断中的图像自动分割问题，融合临床文本报告作为语义指导能提高医学图像分割的准确性。为解决病灶复杂形态变化和视觉语言模态间的语义鸿沟等问题，提出一种基于频域多模态交互模型（FMISeg）。该模型为语言引导下的医学图像分割的后期融合模型，在解码器中建立语言特征与频域视觉特征之间的交互。通过引入频域特征双向交互（FFBI）模块增强视觉表征，并结合语言引导频域特征交互（LFFI）模块，在语言的引导下抑制语义上无关的视觉特征。在QaTa-COV19和MosMedData+上的实验表明，该方法在定性和定量上均优于现有方法。

Key Takeaways

医学图像自动分割对诊断肺部感染性疾病至关重要。
融合临床文本报告作为语义指导可以提高医学图像分割的准确性。
现有方法面临病灶复杂形态变化和视觉语言模态间的语义鸿沟等挑战。
提出一种基于频域多模态交互模型（FMISeg）解决上述问题。
FMISeg模型通过引入FFBI模块增强视觉表征，并结合LFFI模块抑制语义上无关的视觉特征。
FMISeg模型在QaTa-COV19和MosMedData+数据集上的实验表现优于现有方法。

Cool Papers

点此查看论文截图

Towards Robust In-Context Learning for Medical Image Segmentation via Data Synthesis

Authors:Jiesi Hu, Yanwu Yang, Zhiyu Ye, Chenfei Ye, Hanyang Peng, Jianfeng Cao, Ting Ma

The rise of In-Context Learning (ICL) for universal medical image segmentation has introduced an unprecedented demand for large-scale, diverse datasets for training, exacerbating the long-standing problem of data scarcity. While data synthesis offers a promising solution, existing methods often fail to simultaneously achieve both high data diversity and a domain distribution suitable for medical data. To bridge this gap, we propose \textbf{SynthICL}, a novel data synthesis framework built upon domain randomization. SynthICL ensures realism by leveraging anatomical priors from real-world datasets, generates diverse anatomical structures to cover a broad data distribution, and explicitly models inter-subject variations to create data cohorts suitable for ICL. Extensive experiments on four held-out datasets validate our framework’s effectiveness, showing that models trained with our data achieve performance gains of up to 63% in average Dice and substantially enhanced generalization to unseen anatomical domains. Our work helps mitigate the data bottleneck for ICL-based segmentation, paving the way for robust models. Our code and the generated dataset are publicly available at https://github.com/jiesihu/Neuroverse3D.

在医学图像分割领域，上下文学习（ICL）的兴起带来了对大规模、多样化训练数据集前所未有的需求，加剧了长期存在的数据稀缺问题。虽然数据合成提供了一个有前景的解决方案，但现有方法往往不能同时实现高数据多样性和适合医学数据的领域分布。为了弥补这一差距，我们提出了基于领域随机化的新型数据合成框架SynthICL。SynthICL通过利用真实世界数据集的人体解剖先验知识来确保真实性，生成多样化的人体解剖结构以覆盖广泛的数据分布，并显式地模拟受试者之间的差异来创建适合ICL的数据集。在四个独立数据集上的大量实验验证了我们框架的有效性，结果显示使用我们的数据进行训练的模型在平均Dice系数上提高了高达63%，并且在未见过的解剖领域实现了显著增强的泛化能力。我们的工作缓解了基于ICL的分割中的数据瓶颈问题，为稳健的模型铺平了道路。我们的代码和生成的数据集可在https://github.com/jiesihu/Neuroverse3D上公开访问。

论文及项目相关链接

PDF

Summary
医学图像分割中上下文学习（ICL）的兴起对大尺度、多样化数据集的需求急剧增加，加剧了长期存在的数据稀缺问题。尽管数据合成提供了有希望的解决方案，但现有方法往往难以同时实现数据多样性和适合医学数据的领域分布。为此，我们提出了基于领域随机化的新型数据合成框架SynthICL。它通过利用真实数据集的人体解剖先验知识确保真实性，生成多样化的人体解剖结构以覆盖广泛的数据分布，并显式地模拟主体间变化以创建适合ICL的数据群体。在四个独立数据集上的大量实验验证了我们框架的有效性，表明使用我们的数据进行训练模型平均Dice提升高达63%，并在未见过的解剖领域实现显著增强的一般化。我们的工作有助于缓解基于ICL的分割中的数据瓶颈问题，为稳健模型的发展铺平道路。

Key Takeaways

In-Context Learning (ICL) 在医学图像分割中的需求导致大规模、多样化数据集的需求增加。
数据合成是解决数据稀缺问题的一种有前途的方法。
现有数据合成方法难以同时实现数据多样性和适合医学数据的领域分布。
提出的SynthICL框架基于领域随机化，利用真实数据集的人体解剖先验知识确保真实性。
SynthICL能生成多样化的人体解剖结构，覆盖广泛的数据分布，并模拟主体间变化。
在四个独立数据集上的实验显示，使用SynthICL数据训练的模型性能有所提升。

Cool Papers

点此查看论文截图

Graph-Radiomic Learning (GrRAiL) Descriptor to Characterize Imaging Heterogeneity in Confounding Tumor Pathologies

Authors:Dheerendranath Battalapalli, Apoorva Safai, Maria Jaramillo, Hyemin Um, Gustavo Adalfo Pineda Ortiz, Ulas Bagci, Manmeet Singh Ahluwalia, Marwa Ismail, Pallavi Tiwari

A significant challenge in solid tumors is reliably distinguishing confounding pathologies from malignant neoplasms on routine imaging. While radiomics methods seek surrogate markers of lesion heterogeneity on CT/MRI, many aggregate features across the region of interest (ROI) and miss complex spatial relationships among varying intensity compositions. We present a new Graph-Radiomic Learning (GrRAiL) descriptor for characterizing intralesional heterogeneity (ILH) on clinical MRI scans. GrRAiL (1) identifies clusters of sub-regions using per-voxel radiomic measurements, then (2) computes graph-theoretic metrics to quantify spatial associations among clusters. The resulting weighted graphs encode higher-order spatial relationships within the ROI, aiming to reliably capture ILH and disambiguate confounding pathologies from malignancy. To assess efficacy and clinical feasibility, GrRAiL was evaluated in n=947 subjects spanning three use cases: differentiating tumor recurrence from radiation effects in glioblastoma (GBM; n=106) and brain metastasis (n=233), and stratifying pancreatic intraductal papillary mucinous neoplasms (IPMNs) into no+low vs high risk (n=608). In a multi-institutional setting, GrRAiL consistently outperformed state-of-the-art baselines - Graph Neural Networks (GNNs), textural radiomics, and intensity-graph analysis. In GBM, cross-validation (CV) and test accuracies for recurrence vs pseudo-progression were 89% and 78% with >10% test-accuracy gains over comparators. In brain metastasis, CV and test accuracies for recurrence vs radiation necrosis were 84% and 74% (>13% improvement). For IPMN risk stratification, CV and test accuracies were 84% and 75%, showing >10% improvement.

在实体肿瘤中，一个重大的挑战是在常规成像中可靠地区分混淆的病理和恶性肿瘤。虽然放射学方法试图在CT/MRI上寻找病变异质性的替代标志物，但许多特征都集中在感兴趣区域（ROI）内，而忽略了不同强度成分之间复杂的空间关系。我们提出了一种新的基于图放射学学习（GrRAiL）描述符，用于表征临床MRI扫描中的病灶内异质性（ILH）。GrRAiL（1）使用基于体素的放射学测量来识别子区域集群，然后（2）计算图论度量来量化集群之间的空间关联。所得的加权图编码了ROI内的高阶空间关系，旨在可靠地捕获ILH，并从混淆的病理中区分出恶性肿瘤。为了评估GrRAiL的有效性和临床可行性，我们在947名受试者中进行了评估，涉及三种应用场景：区分胶质母细胞瘤（GBM；n=106）和脑转移瘤（n=233）中的肿瘤复发与辐射效应，以及将胰腺导管内乳头状黏液瘤（IPMNs）分层为低风险和无+低风险组和高风险组（n=608）。在多机构设置中，GrRAiL持续优于最先进的基线技术——图神经网络（GNNs）、纹理放射学和强度图分析。在GBM中，对于复发与假进展的交叉验证（CV）和测试准确率分别为89%和78%，较比较方法提高了超过10%的测试准确率。在脑转移瘤中，对于复发与辐射坏死的CV和测试准确率分别为84%和74%（提高13%）。对于IPMN的风险分层，CV和测试准确率分别为84%和75%，显示出超过10%的改进。

论文及项目相关链接

PDF Under Review: npj Digital Medicine

Summary
针对实体瘤在常规成像中可靠地区分混淆性病理与恶性肿瘤的难题，提出了一种新的Graph-Radiomic Learning（GrRAiL）描述符，用于表征临床MRI扫描中的病灶内异质性（ILH）。GrRAiL通过体素级放射学测量识别子区域集群，并计算图形理论度量以量化集群之间的空间关联。评估显示，GrRAiL在多机构环境中，相较于基线方法如图神经网络（GNNs）、纹理放射学及强度图分析，表现出更优异的表现。在胶质母细胞瘤（GBM）、脑转移癌以及胰腺导管内乳头状黏液瘤（IPMN）的风险分层等多个应用场景中，GrRAiL的交叉验证和测试准确率均显示出显著的提升。

Key Takeaways

实体瘤在常规成像中区分混淆性病理与恶性肿瘤存在挑战。
Graph-Radiomic Learning (GrRAiL) 描述符用于表征临床MRI扫描中的病灶内异质性（ILH）。
GrRAiL通过识别子区域集群和计算图形理论度量，量化空间关联。
GrRAiL在多机构环境中表现优异，优于图神经网络（GNNs）、纹理放射学及强度图分析等方法。
在胶质母细胞瘤（GBM）区分复发与辐射效应方面，GrRAiL表现出高准确率。
在脑转移癌区分复发与坏死方面，GrRAiL同样展现出高准确率。

Cool Papers

点此查看论文截图

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning

Authors:Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Haoqiang Xing, Zhenhong Yang

Medical imaging provides critical evidence for clinical diagnosis, treatment planning, and surgical decisions, yet most existing imaging models are narrowly focused and require multiple specialized networks, limiting their generalization. Although large-scale language and multimodal models exhibit strong reasoning and multi-task capabilities, real-world clinical applications demand precise visual grounding, multimodal integration, and chain-of-thought reasoning. We introduce Citrus-V, a multimodal medical foundation model that combines image analysis with textual reasoning. The model integrates detection, segmentation, and multimodal chain-of-thought reasoning, enabling pixel-level lesion localization, structured report generation, and physician-like diagnostic inference in a single framework. We propose a novel multimodal training approach and release a curated open-source data suite covering reasoning, detection, segmentation, and document understanding tasks. Evaluations demonstrate that Citrus-V outperforms existing open-source medical models and expert-level imaging systems across multiple benchmarks, delivering a unified pipeline from visual grounding to clinical reasoning and supporting precise lesion quantification, automated reporting, and reliable second opinions.

医学成像为临床诊断、治疗计划和手术决策提供关键证据，但大多数现有成像模型专注于特定任务，需要多个专用网络，从而限制了其通用性。尽管大规模语言和多媒体模型展现出强大的推理和多任务能力，但现实世界的临床应用需要精确的视觉定位、多媒体融合和连贯推理。我们介绍了Citrus-V，这是一个结合图像分析与文本推理的多媒体医学基础模型。该模型集成了检测、分割和多媒体连贯推理，能够在单一框架内进行像素级病灶定位、结构化报告生成和类似医生的诊断推断。我们提出了一种新颖的多媒体训练方法和一个精选的开源数据集套件，涵盖推理、检测、分割和文档理解任务。评估表明，Citrus-V在多个基准测试中优于现有开源医学模型和专家级成像系统，提供了一个从视觉定位到临床推理的统一流程，并支持精确病灶量化、自动化报告和可靠二次意见。

论文及项目相关链接

PDF

Summary

本文介绍了Citrus-V这一多模态医学基础模型，该模型结合了图像分析与文本推理，实现了像素级病变定位、结构化报告生成和医生级别的诊断推断。它采用了一种新颖的多模态训练方法，并发布了一套涵盖推理、检测、分割和文档理解任务的开源数据集。评估表明，Citrus-V在多基准测试中优于现有开源医学模型和专家级成像系统，为从视觉定位到临床推理提供了一个统一的管道，并支持精确病变量化、自动化报告和可靠二次意见。

Key Takeaways

Citrus-V是一个多模态医学基础模型，结合了图像分析和文本推理。
该模型实现了像素级病变定位、结构化报告生成和医生级别的诊断推断。
Citrus-V采用新颖的多模态训练方法，旨在提高模型的泛化能力。
模型发布了一套涵盖多个任务的开源数据集，用于推动医学成像研究。
Citrus-V在多个基准测试中表现优于现有模型和专家系统。
它提供了一个统一的管道，从视觉定位到临床推理，支持精确病变量化、自动化报告生成。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-09-28/%E5%8C%BB%E5%AD%A6%E5%9B%BE%E5%83%8F/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

医学图像

TTS

TTS 方向最新论文已更新，请持续关注 Update in 2025-09-28 UniSS Unified Expressive Speech-to-Speech Translation with Your Voice

2025-09-28 TTS

TTS

牙齿修复

牙齿修复方向最新论文已更新，请持续关注 Update in 2025-09-28 An Interpretable Single-Index Mixed-Effects Model for Non-Gaussian National Survey Data

2025-09-28 牙齿修复

牙齿修复

医学图像

2025-09-28 更新

Modelling the effect of stellar metallicity on the XUV evolution of low-mass stars and its impact on exoplanet atmospheres/habitability

A sample of 3403 galaxy clusters identified in XMM-Newton X-ray images

4D Computational Ultrasound Imaging of Carotid Artery Flow

SwinMamba: A hybrid local-global mamba framework for enhancing semantic segmentation of remotely sensed images

Revolutionizing Precise Low Back Pain Diagnosis via Contrastive Learning

The X-ray Emission of NGC 5005: An Unobscured Low-Luminosity AGN with a Weakly Accreting Broad-Line Region

InstructVTON: Optimal Auto-Masking and Natural-Language-Guided Interactive Style Control for Inpainting-Based Virtual Try-On

Boosting LiDAR-Based Localization with Semantic Insight: Camera Projection versus Direct LiDAR Segmentation

A Contrastive Learning Framework for Breast Cancer Detection

HiPerformer: A High-Performance Global-Local Segmentation Model with Modular Hierarchical Fusion Strategy

A co-evolving agentic AI system for medical imaging analysis

SHMoAReg: Spark Deformable Image Registration via Spatial Heterogeneous Mixture of Experts and Attention Heads

PS3: A Multimodal Transformer Integrating Pathology Reports with Histology Images and Biological Pathways for Cancer Survival Prediction

Generalized Shortest Path-based Superpixels for 3D Spherical Image Segmentation

PPGFlowECG: Latent Rectified Flow with Cross-Modal Encoding for PPG-Guided ECG Generation and Cardiovascular Disease Detection

nnFilterMatch: A Unified Semi-Supervised Learning Framework with Uncertainty-Aware Pseudo-Label Filtering for Efficient Medical Segmentation

Frequency-domain Multi-modal Fusion for Language-guided Medical Image Segmentation

Towards Robust In-Context Learning for Medical Image Segmentation via Data Synthesis

Graph-Radiomic Learning (GrRAiL) Descriptor to Characterize Imaging Heterogeneity in Confounding Tumor Pathologies

Citrus-V: Advancing Medical Foundation Models with Unified Medical Image Grounding for Clinical Reasoning