⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-29 更新
Diffusion Bridge Variational Inference for Deep Gaussian Processes
Authors:Jian Xu, Qibin Zhao, John Paisley, Delu Zeng
Deep Gaussian processes (DGPs) enable expressive hierarchical Bayesian modeling but pose substantial challenges for posterior inference, especially over inducing variables. Denoising diffusion variational inference (DDVI) addresses this by modeling the posterior as a time-reversed diffusion from a simple Gaussian prior. However, DDVI’s fixed unconditional starting distribution remains far from the complex true posterior, resulting in inefficient inference trajectories and slow convergence. In this work, we propose Diffusion Bridge Variational Inference (DBVI), a principled extension of DDVI that initiates the reverse diffusion from a learnable, data-dependent initial distribution. This initialization is parameterized via an amortized neural network and progressively adapted using gradients from the ELBO objective, reducing the posterior gap and improving sample efficiency. To enable scalable amortization, we design the network to operate on the inducing inputs, which serve as structured, low-dimensional summaries of the dataset and naturally align with the inducing variables’ shape. DBVI retains the mathematical elegance of DDVI, including Girsanov-based ELBOs and reverse-time SDEs,while reinterpreting the prior via a Doob-bridged diffusion process. We derive a tractable training objective under this formulation and implement DBVI for scalable inference in large-scale DGPs. Across regression, classification, and image reconstruction tasks, DBVI consistently outperforms DDVI and other variational baselines in predictive accuracy, convergence speed, and posterior quality.
深度高斯过程(DGPs)能够实现表达性层次化的贝叶斯建模,但给后验推断带来了实质性的挑战,特别是在诱导变量方面。降噪扩散变分推断(DDVI)通过把后验建模为从简单高斯先验开始的时间反转扩散来解决这个问题。然而,DDVI的固定无条件起始分布与复杂的真实后验相差甚远,导致推理轨迹效率低下,收敛缓慢。在这项工作中,我们提出了扩散桥变分推断(DBVI),它是DDVI的一种原则性扩展,从可学习的、依赖于数据初始分布开始反向扩散。这个初始化是通过摊销神经网络进行参数化的,并且通过使用ELBO目标的梯度进行逐步适应,从而缩小了后验差距并提高了样本效率。为了实现可扩展的摊销,我们设计网络在诱导输入上运行,这些输入作为数据集的结构化、低维摘要,自然与诱导变量的形状对齐。DBVI保留了DDVI的数学优雅性,包括基于Girsanov的ELBOs和反向时间SDEs,同时通过Doob桥扩散过程重新解释先验。在这种形式下,我们推导出了一个可行的训练目标,并实现了DBVI用于在大规模DGP中进行可扩展推理。在回归、分类和图像重建任务中,DBVI在预测精度、收敛速度和后验质量方面始终优于DDVI和其他变分基准测试。
论文及项目相关链接
摘要
深度高斯过程(DGPs)能够进行灵活的分层贝叶斯建模,但为后置推理带来了巨大挑战,尤其是在诱导变量方面。去噪扩散变分推理(DDVI)通过模拟从简单高斯先验开始的时间反转扩散过程来解决这个问题。然而,DDVI固定的无条件起始分布与复杂的真实后验分布相差甚远,导致推理轨迹效率低下,收敛速度慢。针对这一问题,本文提出了扩散桥变分推理(DBVI),这是一种对DDVI的理论扩展,它从可学习的、数据依赖的初始分布开始反向扩散。该初始化通过摊销神经网络进行参数化,并使用ELBO目标函数的梯度逐步适应,从而缩小后验差距并提高了样本效率。为了支持可扩展的摊销,我们设计了一个在诱导输入上运行的网络,这些输入作为数据集的结构化、低维摘要,自然地与诱导变量的形状对齐。DBVI保留了DDVI的数学优雅性,包括Girsanov-based ELBOs和反向时间SDEs,同时通过Doob桥扩散过程重新解释先验。我们在此公式下推导出了一个可行的训练目标,并实现了用于大规模DGPs中可扩展推理的DBVI。在回归、分类和图像重建任务中,DBVI在预测精度、收敛速度和后验质量方面均优于DDVI和其他变分基准测试。
关键见解
- 深度高斯过程(DGPs)在进行层次贝叶斯建模时面临后置推理的挑战,尤其是在诱导变量方面。
- Denoising Diffusion Variational Inference (DDVI) 通过模拟从简单高斯先验开始的时间反转扩散过程来解决这个问题,但存在固定起始分布的问题。
- 提出的Diffusion Bridge Variational Inference (DBVI) 从可学习的数据依赖初始分布开始反向扩散,缩小了与复杂真实后验分布的差距。
- DBVI通过摊销神经网络进行参数化初始化,并使用ELBO目标函数的梯度逐步适应,提高了样本效率和后验质量。
- DBVI设计网络在诱导输入上运行,这些输入作为数据集的结构化、低维摘要。
- DBVI保留了DDVI的数学优雅性,包括Girsanov-based ELBOs和反向时间SDEs,并通过Doob桥扩散过程重新解释先验。
点此查看论文截图


DiSSECT: Structuring Transfer-Ready Medical Image Representations through Discrete Self-Supervision
Authors:Azad Singh, Deepak Mishra
Self-supervised learning (SSL) has emerged as a powerful paradigm for medical image representation learning, particularly in settings with limited labeled data. However, existing SSL methods often rely on complex architectures, anatomy-specific priors, or heavily tuned augmentations, which limit their scalability and generalizability. More critically, these models are prone to shortcut learning, especially in modalities like chest X-rays, where anatomical similarity is high and pathology is subtle. In this work, we introduce DiSSECT – Discrete Self-Supervision for Efficient Clinical Transferable Representations, a framework that integrates multi-scale vector quantization into the SSL pipeline to impose a discrete representational bottleneck. This constrains the model to learn repeatable, structure-aware features while suppressing view-specific or low-utility patterns, improving representation transfer across tasks and domains. DiSSECT achieves strong performance on both classification and segmentation tasks, requiring minimal or no fine-tuning, and shows particularly high label efficiency in low-label regimes. We validate DiSSECT across multiple public medical imaging datasets, demonstrating its robustness and generalizability compared to existing state-of-the-art approaches.
自监督学习(SSL)已成为医学图像表示学习的一种强大范式,特别是在标签数据有限的情况下。然而,现有的SSL方法往往依赖于复杂的架构、特定的解剖先验知识或经过高度调整的数据增强,这限制了其可扩展性和通用性。更重要的是,这些模型容易陷入捷径学习,特别是在胸部X射线等模态中,由于解剖结构相似性高、病理表现微妙更是如此。在这项工作中,我们引入了DiSSECT——离散自监督高效临床可转移表示框架,它将多尺度矢量量化集成到SSL管道中,以施加离散表示瓶颈。这约束模型学习可重复、结构感知的特征,同时抑制特定视图或低效用模式,改进跨任务和领域的表示转移。DiSSECT在分类和分割任务上均表现出强大的性能,几乎不需要或根本不需要微调,并且在低标签状态下显示出极高的标签效率。我们在多个公共医学成像数据集上验证了DiSSECT,与现有最先进的方法相比,证明了其稳健性和通用性。
论文及项目相关链接
Summary
本文介绍了针对医学图像表示学习的自监督学习(SSL)方法的局限性,如复杂性高、依赖于特定解剖结构先验信息和需要大量调优的数据增强。针对这些问题,提出了一个新的SSL框架DiSSECT,通过将多尺度矢量量化整合到SSL管道中,以施加离散表示瓶颈,从而学习可重复的结构感知特征并抑制特定视图或低效用模式,提高跨任务和领域的表示转移能力。DiSSECT在分类和分割任务上表现出强大的性能,可以在低标签情况下实现高标签效率。通过跨多个公共医学成像数据集验证,展示了其在与现有最先进的策略相比具有稳健性和通用性。
Key Takeaways
- 自监督学习(SSL)在医学图像表示学习中具有广泛的应用前景,尤其在有限标签数据的情况下。
- 现有的SSL方法存在复杂性高、依赖于特定解剖结构先验和需要大量调优的数据增强等问题。
- DiSSECT框架通过整合多尺度矢量量化到SSL管道中,以施加离散表示瓶颈,从而提高模型的性能。
- DiSSECT框架能够学习可重复的结构感知特征并抑制特定视图或低效用模式,提高跨任务和领域的表示转移能力。
- DiSSECT在分类和分割任务上表现出强大的性能,且在高标签效率方面具有优势。
- DiSSECT框架在不同的公共医学成像数据集上验证过,表现出了其稳健性和通用性。
点此查看论文截图






OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
Authors:Zhuoxiao Chen, Hongyang Yu, Ying Xu, Yadan Luo, Long Duong, Yuan-Fang Li
Radiology report generation (RRG) aims to automatically produce clinically faithful reports from chest X-ray images. Prevailing work typically follows a scale-driven paradigm, by multi-stage training over large paired corpora and oversized backbones, making pipelines highly data- and compute-intensive. In this paper, we propose Oracle-educated GRPO {OraPO) with a FactScore-based reward (FactS) to tackle the RRG task under constrained budgets. OraPO enables single-stage, RL-only training by converting failed GRPO explorations on rare or difficult studies into direct preference supervision via a lightweight oracle step. FactS grounds learning in diagnostic evidence by extracting atomic clinical facts and checking entailment against ground-truth labels, yielding dense, interpretable sentence-level rewards. Together, OraPO and FactS create a compact and powerful framework that significantly improves learning efficiency on clinically challenging cases, setting the new SOTA performance on the CheXpert Plus dataset (0.341 in F1) with 2–3 orders of magnitude less training data using a small base VLM on modest hardware.
放射学报告生成(RRG)旨在从胸部X射线图像中自动产生临床真实的报告。现有的工作通常遵循规模驱动的方法,通过大规模配对语料库和过大的骨干网络的分阶段训练,使得管道高度依赖数据和计算资源。在本文中,我们提出了使用基于Oracle教育的GRPO(OraPO)和基于FactScore的奖励(FactS)在有限预算下解决RRG任务。OraPO通过轻量级的Oracle步骤将失败的GRPO探索转换为对罕见或困难研究的直接偏好监督,从而实现了单阶段仅RL训练。FactS通过提取原子临床事实并检查与真实标签的蕴涵关系,使学习基于诊断证据,从而产生密集、可解释的句子级奖励。奥拉普和法斯特一起,形成了一个紧凑而强大的框架,该框架在临床挑战案例中显著提高学习效率,在CheXpert Plus数据集上设定了新的最佳性能(F1得分为0.341),使用小型基础VLM在适度硬件上减少了2-3个数量级的训练数据。
论文及项目相关链接
Summary
本文提出一种名为Oracle-educated GRPO(OraPO)结合FactScore奖励机制的方法,旨在以有限的预算自动产生临床可靠的胸部X光图像报告。该方法通过单一阶段的强化学习训练,将失败的探索转化为直接的偏好监督,提高了学习效率,特别是在临床挑战案例中表现优异。在CheXpert Plus数据集上取得了最新的最佳性能(F1得分为0.341),并使用较小的基本VLM在适度硬件上实现了2-3个数量级的更少训练数据。
Key Takeaways
- 论文旨在解决放射学报告生成(RRG)任务,即自动从胸部X光图像中产生临床可靠的报告。
- 现有方法通常采用多阶段大规模训练和大规模模型,导致数据和计算成本高昂。
- 论文提出了Oracle-educated GRPO(OraPO)方法,通过单一阶段的强化学习训练,提高了学习效率。
- OraPO通过将失败的探索转化为直接的偏好监督,使得模型在困难案例上的表现得到显著提升。
- 论文引入了FactScore奖励机制,通过提取原子临床事实并检查其与真实标签的蕴含关系,为学习提供密集、可解释的句子级奖励。
- OraPO和FactS结合形成了一个紧凑且强大的框架,在CheXpert Plus数据集上实现了卓越的性能。
点此查看论文截图



MK-UNet: Multi-kernel Lightweight CNN for Medical Image Segmentation
Authors:Md Mostafijur Rahman, Radu Marculescu
In this paper, we introduce MK-UNet, a paradigm shift towards ultra-lightweight, multi-kernel U-shaped CNNs tailored for medical image segmentation. Central to MK-UNet is the multi-kernel depth-wise convolution block (MKDC) we design to adeptly process images through multiple kernels, while capturing complex multi-resolution spatial relationships. MK-UNet also emphasizes the images salient features through sophisticated attention mechanisms, including channel, spatial, and grouped gated attention. Our MK-UNet network, with a modest computational footprint of only 0.316M parameters and 0.314G FLOPs, represents not only a remarkably lightweight, but also significantly improved segmentation solution that provides higher accuracy over state-of-the-art (SOTA) methods across six binary medical imaging benchmarks. Specifically, MK-UNet outperforms TransUNet in DICE score with nearly 333$\times$ and 123$\times$ fewer parameters and FLOPs, respectively. Similarly, when compared against UNeXt, MK-UNet exhibits superior segmentation performance, improving the DICE score up to 6.7% margins while operating with 4.7$\times$ fewer #Params. Our MK-UNet also outperforms other recent lightweight networks, such as MedT, CMUNeXt, EGE-UNet, and Rolling-UNet, with much lower computational resources. This leap in performance, coupled with drastic computational gains, positions MK-UNet as an unparalleled solution for real-time, high-fidelity medical diagnostics in resource-limited settings, such as point-of-care devices. Our implementation is available at https://github.com/SLDGroup/MK-UNet.
本文介绍了MK-UNet,这是一种超轻量级多核U型CNN的范式转变,专门用于医学图像分割。MK-UNet的核心是我们设计的多核深度卷积块(MKDC),它能够灵活地处理图像,通过多个内核捕获复杂的跨分辨率空间关系。MK-UNet还通过复杂的注意力机制强调图像的显著特征,包括通道、空间和分组门控注意力。我们的MK-UNet网络仅有0.316M的参数和0.314G的FLOPs计算开销,不仅非常轻量级,而且在六个二元医学成像基准测试中提供了更高的精度,超过了最先进的方法。具体来说,MK-UNet在DICE得分上优于TransUNet,同时使用参数和FLOPs的近333倍和123倍更少。与UNeXt相比,MK-UNet展现出更优越的分割性能,DICE得分提高了高达6.7%,同时使用参数更少,为4.7倍。我们的MK-UNet还优于其他最新的轻量级网络,如MedT、CMUNeXt、EGE-UNet和Rolling-UNet,同时计算资源更低。这种性能的提升以及计算资源的显著节省,使MK-UNet成为资源受限环境中实时高保真医学诊断的无与伦比的选择,如便携式医疗设备。我们的实现可在https://github.com/SLDGroup/MK-UNet找到。
论文及项目相关链接
PDF 11 pages, 3 figures, Accepted at ICCV 2025 Workshop CVAMD
Summary
本文介绍了MK-UNet,一种面向医疗图像分割的超轻量级多核U形CNN。MK-UNet采用多核深度卷积块(MKDC)处理图像,同时捕捉复杂的多分辨率空间关系,并通过通道、空间和分组门控注意力机制强调图像的重要特征。MK-UNet具有较小的计算开销,仅0.316M参数和0.314G FLOPs,但在六个二进制医疗成像基准测试中实现了较高的准确性,优于现有方法。MK-UNet在DICE评分上优于TransUNet和UNeXt,同时需要更少的参数和FLOPs。此外,MK-UNet还优于其他轻量级网络,如MedT、CMUNeXt、EGE-UNet和Rolling-UNet。因此,MK-UNet被认为是资源受限环境中实时高保真医疗诊断的绝佳解决方案。
Key Takeaways
- MK-UNet是一种面向医疗图像分割的超轻量级多核U形CNN。
- MK-UNet采用多核深度卷积块(MKDC)处理图像,捕捉复杂的多分辨率空间关系。
- MK-UNet通过通道、空间和分组门控注意力机制强调图像的重要特征。
- MK-UNet在六个二进制医疗成像基准测试中实现了较高的准确性,优于现有方法。
- MK-UNet在DICE评分上优于其他网络,如TransUNet和UNeXt,同时需要更少的参数和FLOPs。
- MK-UNet的计算开销较小,适合在资源受限的环境中进行实时高保真医疗诊断。
点此查看论文截图




Learning Contrastive Multimodal Fusion with Improved Modality Dropout for Disease Detection and Prediction
Authors:Yi Gu, Kuniaki Saito, Jiaxin Ma
As medical diagnoses increasingly leverage multimodal data, machine learning models are expected to effectively fuse heterogeneous information while remaining robust to missing modalities. In this work, we propose a novel multimodal learning framework that integrates enhanced modalities dropout and contrastive learning to address real-world limitations such as modality imbalance and missingness. Our approach introduces learnable modality tokens for improving missingness-aware fusion of modalities and augments conventional unimodal contrastive objectives with fused multimodal representations. We validate our framework on large-scale clinical datasets for disease detection and prediction tasks, encompassing both visual and tabular modalities. Experimental results demonstrate that our method achieves state-of-the-art performance, particularly in challenging and practical scenarios where only a single modality is available. Furthermore, we show its adaptability through successful integration with a recent CT foundation model. Our findings highlight the effectiveness, efficiency, and generalizability of our approach for multimodal learning, offering a scalable, low-cost solution with significant potential for real-world clinical applications. The code is available at https://github.com/omron-sinicx/medical-modality-dropout.
随着医学诊断越来越多地利用多模态数据,机器学习模型需要有效地融合异构信息,同时对于缺失的模态也要保持稳健。在这项工作中,我们提出了一种新型的多模态学习框架,该框架集成了增强模态丢弃和对比学习,以解决现实世界中的局限性,如模态不平衡和缺失。我们的方法引入可学习的模态令牌,以改进缺失感知的模态融合,并用融合的多模态表示增强传统的单模态对比目标。我们在大规模临床数据集上验证了我们的框架,用于疾病检测和预测任务,涵盖视觉和表格模态。实验结果表明,我们的方法达到了最新技术水平,特别是在只有单一模态可用的具有挑战性和实际情景中表现尤其出色。此外,我们还展示了其与最新的CT基础模型的成功集成,证明了其适应性。我们的研究结果表明了该方法在多模态学习中的有效性、效率和泛化能力,提供了一种可扩展、成本低廉的解决方案,在现实世界临床应用中具有显著潜力。代码可在https://github.com/omron-sinicx/medical-modality-dropout上找到。
论文及项目相关链接
PDF MICCAI 2025
Summary
本文提出一种新型的多模态学习框架,通过增强模态丢失和对比学习,有效融合异质信息,并应对模态不平衡和缺失等现实世界的挑战。该框架引入可学习的模态令牌,改进了缺失感知融合模态,并融合了传统的单模态对比目标。在大型临床数据集上进行疾病检测和预测任务验证,涵盖视觉和表格模态。实验结果表明,该方法在仅有单一模态的实用场景中表现优异。此外,它与最新的CT基础模型成功集成,展现出其适应性。该框架对于多模态学习的有效性、效率和泛化能力提供了显著的优势,是一种可扩展、低成本且具有实际应用前景的解决方案。
Key Takeaways
- 提出一种新型多模态学习框架,融合增强模态丢失和对比学习技术。
- 框架能够处理模态不平衡和缺失的现实挑战。
- 引入可学习的模态令牌,改进缺失感知融合模态技术。
- 融合了传统的单模态对比目标,形成融合的多模态表示。
- 在大型临床数据集上进行疾病检测和预测任务验证,表现优异。
- 框架成功集成最近的CT基础模型,展现出其适应性。
点此查看论文截图



Multimodal Medical Image Classification via Synergistic Learning Pre-training
Authors:Qinghua Lin, Guang-Hai Liu, Zuoyong Li, Yang Li, Yuting Jiang, Xiang Wu
Multimodal pathological images are usually in clinical diagnosis, but computer vision-based multimodal image-assisted diagnosis faces challenges with modality fusion, especially in the absence of expert-annotated data. To achieve the modality fusion in multimodal images with label scarcity, we propose a novel ``pretraining + fine-tuning” framework for multimodal semi-supervised medical image classification. Specifically, we propose a synergistic learning pretraining framework of consistency, reconstructive, and aligned learning. By treating one modality as an augmented sample of another modality, we implement a self-supervised learning pre-train, enhancing the baseline model’s feature representation capability. Then, we design a fine-tuning method for multimodal fusion. During the fine-tuning stage, we set different encoders to extract features from the original modalities and provide a multimodal fusion encoder for fusion modality. In addition, we propose a distribution shift method for multimodal fusion features, which alleviates the prediction uncertainty and overfitting risks caused by the lack of labeled samples. We conduct extensive experiments on the publicly available gastroscopy image datasets Kvasir and Kvasirv2. Quantitative and qualitative results demonstrate that the proposed method outperforms the current state-of-the-art classification methods. The code will be released at: https://github.com/LQH89757/MICS.
多模态病理图像在临床诊断中通常很受欢迎,但基于计算机视觉的多模态图像辅助诊断在模态融合方面面临挑战,特别是在缺乏专家标注数据的情况下。为了实现多模态图像在标签稀缺情况下的模态融合,我们提出了一种新颖的“预训练+微调”框架,用于多模态半监督医学图像分类。具体来说,我们提出了一个协同学习的预训练框架,包括一致性、重建和对齐学习。通过将一种模态视为另一种模态的增强样本,我们实现了自监督学习预训练,提高了基线模型的特征表示能力。然后,我们设计了针对多模态融合的微调方法。在微调阶段,我们设置不同的编码器从原始模态中提取特征,并提供一个多模态融合编码器进行融合模态。此外,我们还提出了一种多模态融合特征分布转移方法,这减轻了由于缺少标记样本导致的预测不确定性和过拟合风险。我们在公开可用的胃镜图像数据集Kvasir和Kvasirv2上进行了大量实验。定量和定性结果表明,所提出的方法优于当前最先进的分类方法。代码将在以下网址发布:https://github.com/LQH89757/MICS。
论文及项目相关链接
Summary
医学图像多模态诊断面临标签稀缺和模态融合的挑战。为此,我们提出了一种新型的“预训练+微调”框架,实现半监督医学图像多模态分类。通过一致性、重建和对齐学习的预训练框架,提高模型的特征表示能力,并设计了一种针对模态融合的微调方法。在公开可用的胃镜图像数据集Kvasir和Kvasirv2上进行了实验验证,该方法优于当前最先进的分类方法。
Key Takeaways
- 多模态医学图像诊断面临标签稀缺和模态融合的挑战。
- 提出了一种新型的“预训练+微调”框架,用于半监督医学图像多模态分类。
- 通过预训练框架提高模型的特征表示能力,采用一致性、重建和对齐学习的方法。
- 设计了一种针对模态融合的微调方法,包括不同编码器提取特征、融合模态的编码器以及特征分布转换方法。
- 特征分布转换方法能减轻因缺乏标签样本导致的预测不确定性和过拟合风险。
- 在公开数据集Kvasir和Kvasirv2上进行了实验验证,证明了该方法的有效性。
点此查看论文截图





Implicit Neural Representations of Intramyocardial Motion and Strain
Authors:Andrew Bell, Yan Kit Choi, Steffen E Petersen, Andrew King, Muhummad Sohaib Nazir, Alistair A Young
Automatic quantification of intramyocardial motion and strain from tagging MRI remains an important but challenging task. We propose a method using implicit neural representations (INRs), conditioned on learned latent codes, to predict continuous left ventricular (LV) displacement – without requiring inference-time optimisation. Evaluated on 452 UK Biobank test cases, our method achieved the best tracking accuracy (2.14 mm RMSE) and the lowest combined error in global circumferential (2.86%) and radial (6.42%) strain compared to three deep learning baselines. In addition, our method is $\sim$380$\times$ faster than the most accurate baseline. These results highlight the suitability of INR-based models for accurate and scalable analysis of myocardial strain in large CMR datasets. The code can be found at https://github.com/andrewjackbell/Displacement-INR
心肌内运动的自动量化以及从标记MRI中获得的应变仍然是一项重要且具有挑战性的任务。我们提出了一种使用基于学习潜在代码的条件隐神经表示(INR)来预测左心室(LV)连续位移的方法,而无需在推理时间进行优化。我们在452例英国生物银行测试案例上对我们的方法进行了评估,与三个深度学习基线相比,我们的方法达到了最佳的跟踪精度(2.14毫米RMSE),并且在全球圆周应变(2.86%)和径向应变(6.42%)方面的组合误差最低。此外,我们的方法比最准确的基线快约380倍。这些结果突出了INR模型在大型CMR数据集中进行心肌应变准确和可伸缩分析的适用性。代码可在https://github.com/andrewjackbell/Displacement-INR找到。
论文及项目相关链接
PDF STACOM 2025 @ MICCAI
Summary
该文本提出了一种使用隐神经表示(INR)的方法,通过学习潜在代码进行预测左心室位移,无需进行推理时间优化。该方法在测试数据集上表现出良好的跟踪精度和快速的计算速度,为大规模心脏磁共振数据集中心肌应变分析提供了准确且可量化的工具。
Key Takeaways
- 文中提出了一种基于隐神经表示(INR)的方法,用于自动量化心肌运动和应变。
- 该方法能够在不需要推理时间优化的情况下预测左心室位移。
- 在UK Biobank测试数据集上,该方法的跟踪精度达到最佳(RMSE为2.14mm)。
- 与其他深度学习模型相比,该方法在全球圆周和径向应变方面的综合误差最低。
- 该方法的计算速度比最准确的基线模型快约380倍。
- 实验结果证明了隐神经表示模型在心肌应变分析中的准确性和可扩展性。
点此查看论文截图


Clustering methods for Categorical Time Series and Sequences : A scoping review
Authors:Ottavio Khalifa, Viet-Thi Tran, Alan Balendran, François Petit
Objective: To provide an overview of clustering methods for categorical time series (CTS), a data structure commonly found in epidemiology, sociology, biology, and marketing, and to support method selection in regards to data characteristics. Methods: We searched PubMed, Web of Science, and Google Scholar, from inception up to November 2024 to identify articles that propose and evaluate clustering techniques for CTS. Methods were classified according to three major families – distance-based, feature-based, and model-based – and assessed on their ability to handle data challenges such as variable sequence length, multivariate data, continuous time, missing data, time-invariant covariates, and large data volumes. Results: Out of 14607 studies, we included 124 articles describing 129 methods, spanning domains such as artificial intelligence, social sciences, and epidemiology. Distance-based methods, particularly those using Optimal Matching, were most prevalent, with 56 methods. We identified 28 model-based methods, which demonstrated superior flexibility for handling complex data structures such as multivariate data, continuous time and time-invariant covariates. We also recorded 45 feature-based approaches, which were on average more scalable but less flexible. A searchable Web application was developed to facilitate method selection based on dataset characteristics ( https://cts-clustering-scoping-review-7sxqj3sameqvmwkvnzfynz.streamlit.app/ ) Discussion: While distance-based methods dominate, model-based approaches offer the richest modeling potential but are less scalable. Feature-based methods favor performance over flexibility, with limited support for complex data structures. Conclusion: This review highlights methodological diversity and gaps in CTS clustering. The proposed typology aims to guide researchers in selecting methods for their specific use cases.
目标:旨在为分类时间序列(CTS)的聚类方法提供概述。CTS是常见于流行病学、社会学、生物学和市场营销等领域的一种数据结构,并针对数据特性支持方法选择。方法:我们搜索了PubMed、Web of Science和Google Scholar,从创刊至2024年11月的文献,以识别和评估CTS的聚类技术。方法被分为三大类别——基于距离的、基于特征的、基于模型的,并评估了它们处理数据挑战的能力,例如可变序列长度、多元数据、连续时间、缺失数据、时间恒定协变量和大数据量。结果:在14607项研究中,我们纳入了124篇文章,描述了129种方法,涉及人工智能、社会科学和流行病学等领域。基于距离的方法,尤其是使用最佳匹配的方法最为普遍,共有56种方法。我们确定了28种基于模型的方法,在应对多元数据、连续时间和时间恒定协变量等复杂数据结构方面表现出较高的灵活性。我们还记录了45种基于特征的方法,这些方法在平均情况下更具可扩展性但灵活性较差。开发了一个可搜索的Web应用程序,以便根据数据集特性促进方法选择( https://cts-clustering-scoping-review-7sxqj3sameqvmwkvnzfynz.streamlit.app/ )。讨论:虽然基于距离的方法占主导地位,但基于模型的方法提供了最丰富的建模潜力,但扩展性较差。基于特征的方法更侧重于性能而非灵活性,对复杂数据结构的支持有限。结论:本综述突出了CTS聚类中的方法多样性和空白。提出的分类旨在指导研究人员为其特定用例选择合适的方法。
论文及项目相关链接
Summary
该文概述了针对类别时间序列(CTS)数据的聚类方法,该类数据广泛存在于流行病学、社会学、生物学和市场营销等领域。文章通过检索数据库和文献,对CTS聚类方法进行了分类和评估,包括距离基础、特征基础和模型基础三大类,并针对数据特性进行了方法选择。研究发现距离基础方法最为普遍,模型基础方法在处理复杂数据结构方面表现出较高灵活性,而特征基础方法则更侧重于性能。文章最后指出了聚类方法的多样性和现有差距,旨在为研究者选择合适的方法提供指导。
Key Takeaways
- 文章提供了类别时间序列(CTS)聚类方法的概述,旨在指导研究者根据特定用例选择合适的方法。
- 通过文献检索,对CTS聚类方法进行了距离基础、特征基础和模型基础三大类的分类。
- 距离基础方法最为普遍,但模型基础方法在处理复杂数据结构方面表现出较高灵活性。
- 特征基础方法更侧重于性能,但在处理复杂数据结构上有限制。
- 聚类方法的选择需考虑数据特性,如序列长度、多元数据、连续时间、缺失数据等。
- 文章指出当前聚类方法的多样性和现有差距,为未来的研究提供了方向。
点此查看论文截图


Robust Pan-Cancer Mitotic Figure Detection with YOLOv12
Authors:Raphaël Bourgade, Guillaume Balezo, Thomas Walter
Mitotic figures represent a key histoprognostic feature in tumor pathology, providing crucial insights into tumor aggressiveness and proliferation. However, their identification remains challenging, subject to significant inter-observer variability, even among experienced pathologists. To address this issue, the MItosis DOmain Generalization (MIDOG) 2025 challenge marks the third edition of an international competition aiming to develop robust mitosis detection algorithms. In this paper, we present a mitotic figure detection approach based on the state-of-the-art YOLOv12 object detection architecture. Our method achieved an F1-score of 0.801 on the preliminary test set (hotspots only) and ranked second on the final test leaderboard with an F1-score of 0.7216 across complex and heterogeneous whole-slide regions, without relying on external data.
有丝分裂图像在肿瘤病理学中代表了关键的组织学预后特征,为理解肿瘤的侵袭性和增殖提供了重要依据。然而,它们的识别仍然具有挑战性,即使是经验丰富的病理学家之间也存在显著的观察者间变异。为了解决这一问题,有丝分裂域泛化(MIDOG)2025挑战赛是国际竞赛的第三版,旨在开发稳健的有丝分裂检测算法。在本文中,我们提出了一种基于最新YOLOv12目标检测架构的有丝分裂图像检测方法。我们的方法在初步测试集(仅热点)上达到了0.801的F1分数,并在最终测试排行榜上以0.7216的F1分数排名第二,且无需依赖外部数据,即可在复杂且异质的整张幻灯片区域进行检测。
论文及项目相关链接
Summary
本文介绍了基于最新YOLOv12目标检测架构的细胞分裂图像检测方法,该方法是国际第三次细胞分裂检测挑战赛的重要成果之一。本文方法可以在复杂的全滑区域中准确检测细胞分裂图像,且无需依赖外部数据,在初步测试集上取得了F1分数为0.801的好成绩,并在最终测试中排名第二,展现了其在复杂环境下的稳健性和可靠性。
Key Takeaways
- 细胞分裂图像检测是肿瘤病理学中的关键特征,对肿瘤侵袭性和增殖性的评估至关重要。
- 细胞分裂检测面临观察者间差异大的挑战,国际上的MIDOG挑战赛旨在开发稳健的细胞分裂检测算法。
- 本文提出了一种基于YOLOv12架构的细胞分裂图像检测方法,初步测试成绩优异。
- 该方法能在复杂的全滑区域中准确检测细胞分裂图像,展示了其在复杂环境下的稳健性和可靠性。
- 该方法在最终测试中排名第二,表明其性能优异且具有潜力。
点此查看论文截图



A Multimodal and Multi-centric Head and Neck Cancer Dataset for Segmentation, Diagnosis and Outcome Prediction
Authors:Numan Saeed, Salma Hassan, Shahad Hardan, Ahmed Aly, Darya Taratynova, Umair Nawaz, Ufaq Khan, Muhammad Ridzuan, Vincent Andrearczyk, Adrien Depeursinge, Yutong Xie, Thomas Eugene, Raphaël Metz, Mélanie Dore, Gregory Delpon, Vijay Ram Kumar Papineni, Kareem Wahid, Cem Dede, Alaa Mohamed Shawky Ali, Carlos Sjogreen, Mohamed Naser, Clifton D. Fuller, Valentin Oreiller, Mario Jreige, John O. Prior, Catherine Cheze Le Rest, Olena Tankyevych, Pierre Decazes, Su Ruan, Stephanie Tanadini-Lang, Martin Vallières, Hesham Elhalawani, Ronan Abgral, Romain Floch, Kevin Kerleguer, Ulrike Schick, Maelle Mauguen, David Bourhis, Jean-Christophe Leclere, Amandine Sambourg, Arman Rahmim, Mathieu Hatt, Mohammad Yaqub
We present a publicly available multimodal dataset for head and neck cancer research, comprising 1123 annotated Positron Emission Tomography/Computed Tomography (PET/CT) studies from patients with histologically confirmed disease, acquired from 10 international medical centers. All studies contain co-registered PET/CT scans with varying acquisition protocols, reflecting real-world clinical diversity from a long-term, multi-institution retrospective collection. Primary gross tumor volumes (GTVp) and involved lymph nodes (GTVn) were manually segmented by experienced radiation oncologists and radiologists following established guidelines. We provide anonymized NifTi files, expert-annotated segmentation masks, comprehensive clinical metadata, and radiotherapy dose distributions for a patient subset. The metadata include TNM staging, HPV status, demographics, long-term follow-up outcomes, survival times, censoring indicators, and treatment information. To demonstrate its utility, we benchmark three key clinical tasks: automated tumor segmentation, recurrence-free survival prediction, and HPV status classification, using state-of-the-art deep learning models like UNet, SegResNet, and multimodal prognostic frameworks.
我们提供了一个公开可用的头颈部癌症研究多模式数据集,包含从国际上10个医疗中心采集的经组织病理学确诊疾病的患者的PET/CT研究注解共1123项。所有研究都包含使用不同采集协议的共注册PET/CT扫描,反映长期多机构回顾性收集的来自真实世界的临床多样性。原发肿瘤体积(GTVp)和受累淋巴结(GTVn)由经验丰富的放射肿瘤学家和放射科医生按照既定指南进行手动分割。我们提供匿名化的NifTi文件、专家注释的分割掩膜、全面的临床元数据以及患者子集的放射治疗剂量分布。元数据包括TNM分期、HPV状态、人口统计学信息、长期随访结果、生存时间、审查指标和治疗信息。为了证明其实用性,我们对三个关键的临床任务进行了基准测试:使用UNet、SegResNet等最新深度学习模型和多种模式预后框架进行自动肿瘤分割、无复发生存预测和HPV状态分类。
论文及项目相关链接
PDF 10 pages, 5 figures. Numan Saeed is the corresponding author. Numan Saeed, Salma Hassan and Shahad Hardan contributed equally to this work. Project page: https://hecktor25.grand-challenge.org/
Summary
这是一个关于头颈癌研究的多模式数据集介绍,包含来自十个国际医疗中心的1123份经过注释的PET/CT研究。数据集中包含了手动分割的主要肿瘤体积和涉及的淋巴结,以及包括TNM分期、HPV状态、人口统计学信息、长期随访结果等在内的综合临床元数据。此外,还提供了用于患者子集的匿名NifTi文件、专家注释的分割掩模、放射治疗剂量分布图。数据集可用于自动化肿瘤分割、无复发生存预测和HPV状态分类等任务。
Key Takeaways
- 该数据集是一个公开可用的多模式数据集,用于头颈癌研究,涵盖了国际十个医疗中心的PET/CT研究数据。
- 数据集中包含了手动分割的主要肿瘤体积和涉及的淋巴结。
- 数据集包含了丰富的临床元数据,如TNM分期、HPV状态等。
- 数据集提供了匿名化的NifTi文件、专家注释的分割掩模和放射治疗剂量分布图。
- 数据集可用于自动化肿瘤分割任务。
- 数据集可用于无复发生存预测任务。
点此查看论文截图






Multimodal Deep Learning for Phyllodes Tumor Classification from Ultrasound and Clinical Data
Authors:Farhan Fuad Abir, Abigail Elliott Daly, Kyle Anderman, Tolga Ozmen, Laura J. Brattain
Phyllodes tumors (PTs) are rare fibroepithelial breast lesions that are difficult to classify preoperatively due to their radiological similarity to benign fibroadenomas. This often leads to unnecessary surgical excisions. To address this, we propose a multimodal deep learning framework that integrates breast ultrasound (BUS) images with structured clinical data to improve diagnostic accuracy. We developed a dual-branch neural network that extracts and fuses features from ultrasound images and patient metadata from 81 subjects with confirmed PTs. Class-aware sampling and subject-stratified 5-fold cross-validation were applied to prevent class imbalance and data leakage. The results show that our proposed multimodal method outperforms unimodal baselines in classifying benign versus borderline/malignant PTs. Among six image encoders, ConvNeXt and ResNet18 achieved the best performance in the multimodal setting, with AUC-ROC scores of 0.9427 and 0.9349, and F1-scores of 0.6720 and 0.7294, respectively. This study demonstrates the potential of multimodal AI to serve as a non-invasive diagnostic tool, reducing unnecessary biopsies and improving clinical decision-making in breast tumor management.
叶状肿瘤(PTs)是一种罕见的纤维上皮性乳腺病变,由于其与良性纤维腺瘤的放射学相似性,术前难以分类,这常常导致不必要的手术切除。为了解决这一问题,我们提出了一种多模式深度学习框架,该框架结合了乳腺超声(BUS)图像和结构化临床数据,以提高诊断准确性。我们开发了一个双分支神经网络,从超声图像和来自81名已确诊PTs患者的患者元数据中提取并融合特征。应用类别感知采样和受试者分层5倍交叉验证,以防止类别不平衡和数据泄露。结果表明,我们提出的多模式方法在分类良性与边界性或恶性PTs时优于单模式基线方法。在六种图像编码器中,ConvNeXt和ResNet18在多模式设置中表现最佳,AUC-ROC得分分别为0.9427和0.9349,F1得分分别为0.6720和0.7294。本研究展示了多模式人工智能作为非侵入性诊断工具的潜力,可以减少不必要的活检,提高乳腺肿瘤管理的临床决策水平。
论文及项目相关链接
PDF IEEE-EMBS International Conference on Body Sensor Networks (IEEE-EMBS BSN 2025)
Summary
本文提出了一种多模态深度学习框架,该框架结合了乳腺超声图像和结构化临床数据,旨在提高叶状肿瘤的诊断准确性。通过开发一个双分支神经网络,该网络从超声图像和患者元数据中提取特征,并从81名确诊为叶状肿瘤的受试者中获取数据。该研究采用类感知采样和分层5倍交叉验证,以避免类别不平衡和数据泄露问题。结果显示,所提出的多模态方法在分类良性与边界性或恶性叶状肿瘤方面优于单模态基线。在多种图像编码器中,ConvNeXt和ResNet18在多模态设置中表现最佳,AUC-ROC得分分别为0.9427和0.9349,F1分数分别为0.6720和0.7294。该研究证明了多模态人工智能作为非侵入性诊断工具的潜力,有望降低不必要的活检率,提高乳腺肿瘤管理的临床决策水平。
Key Takeaways
- 本文提出了一种多模态深度学习框架,旨在提高叶状肿瘤(PTs)的诊断准确性。
- 通过结合乳腺超声图像和结构化临床数据,该框架能够更准确地分类PTs。
- 研究采用双分支神经网络,能够从超声图像和患者元数据中提取特征。
- 实验结果展示了所提出的多模态方法优于单模态基线方法在分类PTs方面的表现。
- 在多种图像编码器中,ConvNeXt和ResNet18表现出最佳性能。
- 多模态人工智能的应用有望降低不必要的活检率,提高临床决策水平。
点此查看论文截图






Cross-Cancer Knowledge Transfer in WSI-based Prognosis Prediction
Authors:Pei Liu, Luping Ji, Jiaxiang Gou, Xiangxiang Zeng
Whole-Slide Image (WSI) is an important tool for estimating cancer prognosis. Current studies generally follow a conventional cancer-specific paradigm where one cancer corresponds to one model. However, it naturally struggles to scale to rare tumors and cannot utilize the knowledge of other cancers. Although a multi-task learning-like framework has been studied recently, it usually has high demands on computational resources and needs considerable costs in iterative training on ultra-large multi-cancer WSI datasets. To this end, this paper makes a paradigm shift to knowledge transfer and presents the first preliminary yet systematic study on cross-cancer prognosis knowledge transfer in WSIs, called CROPKT. It has three major parts: (i) we curate a large dataset (UNI2-h-DSS) with 26 cancers and use it to measure the transferability of WSI-based prognostic knowledge across different cancers (including rare tumors); (ii) beyond a simple evaluation merely for benchmark, we design a range of experiments to gain deeper insights into the underlying mechanism of transferability; (iii) we further show the utility of cross-cancer knowledge transfer, by proposing a routing-based baseline approach (ROUPKT) that could often efficiently utilize the knowledge transferred from off-the-shelf models of other cancers. We hope CROPKT could serve as an inception and lay the foundation for this nascent paradigm, i.e., WSI-based prognosis prediction with cross-cancer knowledge transfer. Our source code is available at https://github.com/liupei101/CROPKT.
全切片图像(WSI)是评估癌症预后的重要工具。当前的研究通常遵循一种传统的特定癌症模式,即一种癌症对应一个模型。然而,这种模式在应对罕见肿瘤时自然会遇到困难,并且无法利用其他癌症的知识。尽管最近已经研究了类似多任务学习的框架,但它通常对计算资源有很高的要求,并且需要在超大型多癌症WSI数据集上进行迭代训练,成本相当高。为此,本文进行了范式转变,转向知识转移,并对WSI中的跨癌症预后知识转移进行了首次初步但系统的研究,称为CROPKT。它主要包括三个部分:(i)我们整理了一个包含26种癌症的大型数据集(UNI2-h-DSS),并用来衡量不同癌症(包括罕见肿瘤)之间基于WSI的预后知识的可转移性;(ii)不仅为了基准测试而进行简单评估,我们还设计了一系列实验,以深入了解可转移性的内在机制;(iii)我们进一步展示了跨癌症知识转移的实用性,通过提出一种基于路由的基线方法(ROUPKT),该方法可以高效地利用其他癌症的现成模型所传递的知识。我们希望CROPKT能够作为一个开端,为这个新兴范式奠定基础,即基于WSI的具有跨癌症知识转移的预后预测。我们的源代码可在[https://github.com/liupei101/CROPKT找到。]
论文及项目相关链接
PDF 20 pages (11 figures and 6 tables)
Summary
该研究突破了传统的癌症特异性模式限制,首次系统性研究了跨癌预后知识转移在WSI中的应用,命名为CROPKT。研究内容包括:建立包含26种癌症的大型数据集UNI2-h-DSS,衡量WSI预后知识在不同癌症间的可转移性;设计一系列实验深入了解可转移性的内在机制;提出基于路由的基线方法ROUPKT,展示跨癌知识转移的实用性。
Key Takeaways
- 研究突破了传统癌症特异性模式的限制,建立了跨癌预后知识转移的新模式(CROPKT)。
- 建立了包含26种癌症的大型数据集UNI2-h-DSS,用于衡量WSI预后知识在不同癌症间的可转移性。
- 通过一系列实验深入了解可转移性的内在机制。
- 提出了基于路由的基线方法ROUPKT,能够高效利用其他癌症的已训练模型进行知识转移。
- CROPKT为WSI为基础的预后预测提供了新的研究视角,即跨癌知识转移。
- 该研究提高了处理罕见肿瘤的能力,并能利用其他癌症的知识。
- 研究的源代码已公开,便于后续研究使用与参考。
点此查看论文截图





RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding
Authors:Tianchen Fang, Guiru Liu
Medical image understanding plays a crucial role in enabling automated diagnosis and data-driven clinical decision support. However, its progress is impeded by two primary challenges: the limited availability of high-quality annotated medical data and an overreliance on global image features, which often miss subtle but clinically significant pathological regions. To address these issues, we introduce RegionMed-CLIP, a region-aware multimodal contrastive learning framework that explicitly incorporates localized pathological signals along with holistic semantic representations. The core of our method is an innovative region-of-interest (ROI) processor that adaptively integrates fine-grained regional features with the global context, supported by a progressive training strategy that enhances hierarchical multimodal alignment. To enable large-scale region-level representation learning, we construct MedRegion-500k, a comprehensive medical image-text corpus that features extensive regional annotations and multilevel clinical descriptions. Extensive experiments on image-text retrieval, zero-shot classification, and visual question answering tasks demonstrate that RegionMed-CLIP consistently exceeds state-of-the-art vision language models by a wide margin. Our results highlight the critical importance of region-aware contrastive pre-training and position RegionMed-CLIP as a robust foundation for advancing multimodal medical image understanding.
医学图像理解在实现自动化诊断和基于数据的临床决策支持中起着至关重要的作用。然而,其进展受到两个主要挑战的限制:高质量标注医学数据的有限可用性,以及对全局图像特征的过度依赖,这往往导致忽略细微但临床上重要的病理区域。为了解决这些问题,我们引入了RegionMed-CLIP,这是一个区域感知的多模式对比学习框架,它显式地结合了局部病理信号和整体语义表示。我们的方法的核心是一个创新的兴趣区域(ROI)处理器,它自适应地集成精细的局部特征与全局上下文,辅以一种增强分层多模式对齐的渐进式训练策略。为了实现大规模的区域级别表示学习,我们构建了MedRegion-500k,这是一个具有广泛区域注释和多层临床描述的综合医学图像-文本语料库。在图像-文本检索、零样本分类和视觉问答任务上的大量实验表明,RegionMed-CLIP始终大幅超越了最先进的视觉语言模型。我们的结果强调了区域感知对比预训练的关键重要性,并将RegionMed-CLIP定位为推动多模式医学图像理解发展的稳健基础。
论文及项目相关链接
PDF Upon further review, we identified that our dataset requires optimization to ensure research reliability and accuracy. Additionally, considering the target journal’s latest submission policies, we believe comprehensive manuscript revisions are necessary
Summary
医学图像理解在自动化诊断和数据驱动的临床决策支持中起着关键作用,但面临高质量标注医学数据有限和过度依赖全局图像特征两大挑战。为解决这个问题,我们提出了RegionMed-CLIP,一个结合局部病理信号和整体语义表示的区域感知多模态对比学习框架。其核心是创新的感兴趣区域(ROI)处理器,能自适应地集成细粒度区域特征与全局上下文,辅以渐进式训练策略,提高分层多模态对齐。为支持大规模区域级表示学习,我们构建了MedRegion-500k,一个包含丰富区域标注和多层临床描述的大规模医学图像文本语料库。实验表明,RegionMed-CLIP在图像文本检索、零样本分类和视觉问答任务上均显著超越现有视觉语言模型。
Key Takeaways
- 医学图像理解在自动化诊断与临床决策支持中起关键作用。
- 两大挑战:高质量标注医学数据有限和过度依赖全局图像特征。
- RegionMed-CLIP框架结合局部病理信号和整体语义表示。
- ROI处理器自适应集成细粒度区域特征与全局上下文。
- 渐进式训练策略提高分层多模态对齐。
- MedRegion-500k医学图像文本语料库支持大规模区域级表示学习。
- RegionMed-CLIP在多项任务上表现超越现有视觉语言模型。
点此查看论文截图





RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation
Authors:Kai Ye, YingShi Luan, Zhudi Chen, Guangyue Meng, Pingyang Dai, Liujuan Cao
Referring Image Segmentation (RIS), which aims to segment specific objects based on natural language descriptions, plays an essential role in vision-language understanding. Despite its progress in remote sensing applications, RIS in Low-Altitude Drone (LAD) scenarios remains underexplored. Existing datasets and methods are typically designed for high-altitude and static-view imagery. They struggle to handle the unique characteristics of LAD views, such as diverse viewpoints and high object density. To fill this gap, we present RIS-LAD, the first fine-grained RIS benchmark tailored for LAD scenarios. This dataset comprises 13,871 carefully annotated image-text-mask triplets collected from realistic drone footage, with a focus on small, cluttered, and multi-viewpoint scenes. It highlights new challenges absent in previous benchmarks, such as category drift caused by tiny objects and object drift under crowded same-class objects. To tackle these issues, we propose the Semantic-Aware Adaptive Reasoning Network (SAARN). Rather than uniformly injecting all linguistic features, SAARN decomposes and routes semantic information to different stages of the network. Specifically, the Category-Dominated Linguistic Enhancement (CDLE) aligns visual features with object categories during early encoding, while the Adaptive Reasoning Fusion Module (ARFM) dynamically selects semantic cues across scales to improve reasoning in complex scenes. The experimental evaluation reveals that RIS-LAD presents substantial challenges to state-of-the-art RIS algorithms, and also demonstrates the effectiveness of our proposed model in addressing these challenges. The dataset and code will be publicly released soon at: https://github.com/AHideoKuzeA/RIS-LAD/.
基于自然语言描述的目标图像分割(RIS)在视觉语言理解中扮演着至关重要的角色,其旨在根据自然语言描述来分割特定对象。尽管其在遥感应用方面取得了进展,但低空无人机(LAD)场景中的RIS仍然鲜有研究。现有的数据集和方法通常针对高空和静态视图图像设计,它们难以处理LAD视图的独特特征,例如多样的视角和高密度的目标。为了填补这一空白,我们推出了RIS-LAD,这是针对LAD场景定制的首个精细的RIS基准数据集。该数据集包含从真实的无人机视频中收集的13,871个经过仔细注释的图像-文本-掩膜三元组,专注于小、杂乱和多视角的场景。它突出了以前基准测试中不存在的新的挑战,例如由于微小物体引起的类别漂移和在拥挤的同类物体下物体的漂移。为了解决这些问题,我们提出了语义感知自适应推理网络(SAARN)。SAARN不是统一地注入所有语言特征,而是分解和将语义信息路由到网络的不同阶段。具体来说,类别主导的语言增强(CDLE)在早期编码时将视觉特征与对象类别对齐,而自适应推理融合模块(ARFM)则动态选择跨尺度的语义线索,以改善复杂场景中的推理。实验评估表明,RIS-LAD给现有的RIS算法带来了实质性的挑战,同时也证明了我们提出的模型在应对这些挑战时的有效性。数据集和代码将很快在https://github.com/AHideoKuzeA/RIS-LAD/公开发布。
论文及项目相关链接
Summary
针对低空无人机(LAD)场景的参照图像分割(RIS)研究存在空白。现有数据集和方法主要面向高空和静态图像,难以处理LAD视角的独特特征,如不同视角和高物体密度。为此,我们推出RIS-LAD,首个针对LAD场景的精细RIS基准数据集。该数据集包含从真实无人机影像中收集的13,871个精心标注的图像-文本-掩膜三元组,重点关注小、杂乱、多视角场景。我们提出语义感知自适应推理网络(SAARN)来解决新问题,如因小物体引起的类别漂移和在拥挤的同类物体中的对象漂移。实验评估显示,RIS-LAD为当前RIS算法带来重大挑战,同时验证了我们的模型在应对这些挑战时的有效性。数据集和代码将在 https://github.com/AHideoKuzeA/RIS-LAD/ 公开。
Key Takeaways
- 低空无人机(LAD)场景的参照图像分割(RIS)研究尚待深入探索。
- 现有数据集和方法主要面向高空和静态图像,难以适应LAD视角的多样性。
- 推出RIS-LAD数据集,专为LAD场景设计,包含从真实无人机影像中精心标注的图像-文本-掩膜三元组。
- 数据集重点关注小、杂乱、多视角场景的分割问题。
- 提出语义感知自适应推理网络(SAARN)来解决类别漂移和对象漂移等问题。
- 实验评估显示,RIS-LAD对现有RIS算法构成重大挑战。
点此查看论文截图









DCFFSNet: Deep Connectivity Feature Fusion Separation Network for Medical Image Segmentation
Authors:Mingda Zhang, Xun Ye, Ruixiang Tang, Haiyan Ding
Medical image segmentation leverages topological connectivity theory to enhance edge precision and regional consistency. However, existing deep networks integrating connectivity often forcibly inject it as an additional feature module, resulting in coupled feature spaces with no standardized mechanism to quantify different feature strengths. To address these issues, we propose DCFFSNet (Dual-Connectivity Feature Fusion-Separation Network). It introduces an innovative feature space decoupling strategy. This strategy quantifies the relative strength between connectivity features and other features. It then builds a deep connectivity feature fusion-separation architecture. This architecture dynamically balances multi-scale feature expression. Experiments were conducted on the ISIC2018, DSB2018, and MoNuSeg datasets. On ISIC2018, DCFFSNet outperformed the next best model (CMUNet) by 1.3% (Dice) and 1.2% (IoU). On DSB2018, it surpassed TransUNet by 0.7% (Dice) and 0.9% (IoU). On MoNuSeg, it exceeded CSCAUNet by 0.8% (Dice) and 0.9% (IoU). The results demonstrate that DCFFSNet exceeds existing mainstream methods across all metrics. It effectively resolves segmentation fragmentation and achieves smooth edge transitions. This significantly enhances clinical usability.
医学图像分割利用拓扑连接理论来提高边缘精度和区域一致性。然而,现有的深度网络在集成连接性时通常强制将其注入作为附加特征模块,导致特征空间耦合,没有标准化的机制来量化不同特征的强度。为了解决这些问题,我们提出了DCFFSNet(双连接特征融合分离网络)。它引入了一种创新的特征空间解耦策略。该策略量化连接特征与其他特征之间的相对强度。然后,它建立了一个深度连接特征融合-分离架构。该架构动态平衡了多尺度特征表达。我们在ISIC2018、DSB2018和MoNuSeg数据集上进行了实验。在ISIC2018上,DCFFSNet在狄克系数和交并比方面分别比排名第二的最佳模型CMUNet高出1.3%和1.2%。在DSB2018上,它比TransUNet高出0.7%(狄克系数)和0.9%(交并比)。在MoNuSeg上,它比CSCAUNet高出0.8%(狄克系数)和0.9%(交并比)。结果表明,DCFFSNet在所有指标上都超过了现有的主流方法。它有效地解决了分割碎片问题,实现了平滑的边缘过渡。这显著提高了临床可用性。
论文及项目相关链接
PDF 16 pages , 11 figures
Summary
医学图像分割利用拓扑连通性理论提高边缘精度和区域一致性。针对现有深度网络集成连通性时强行注入作为附加特征模块的问题,提出DCFFSNet(双连通特征融合分离网络)。引入特征空间解耦策略,量化连通性特征与其他特征的相对强度,建立深度连通特征融合分离架构,动态平衡多尺度特征表达。在ISIC2018、DSB2018和MoNuSeg数据集上的实验表明,DCFFSNet在各项指标上均超过主流方法,有效解决分割碎片化问题,实现平滑边缘过渡,显著提高临床实用性。
Key Takeaways
- 医学图像分割利用拓扑连通性理论提高边缘和区域一致性。
- 现有网络集成连通性存在特征空间耦合问题,缺乏量化不同特征强度的标准化机制。
- DCFFSNet提出特征空间解耦策略,建立深度连通特征融合分离架构。
- DCFFSNet在多个数据集上表现优异,超过主流方法。
- DCFFSNet有效解决分割碎片化问题。
- DCFFSNet实现平滑边缘过渡。
点此查看论文截图





GeMix: Conditional GAN-Based Mixup for Improved Medical Image Augmentation
Authors:Hugo Carlesso, Maria Eliza Patulea, Moncef Garouani, Radu Tudor Ionescu, Josiane Mothe
Mixup has become a popular augmentation strategy for image classification, yet its naive pixel-wise interpolation often produces unrealistic images that can hinder learning, particularly in high-stakes medical applications. We propose GeMix, a two-stage framework that replaces heuristic blending with a learned, label-aware interpolation powered by class-conditional GANs. First, a StyleGAN2-ADA generator is trained on the target dataset. During augmentation, we sample two label vectors from Dirichlet priors biased toward different classes and blend them via a Beta-distributed coefficient. Then, we condition the generator on this soft label to synthesize visually coherent images that lie along a continuous class manifold. We benchmark GeMix on the large-scale COVIDx-CT-3 dataset using three backbones (ResNet-50, ResNet-101, EfficientNet-B0). When combined with real data, our method increases macro-F1 over traditional mixup for all backbones, reducing the false negative rate for COVID-19 detection. GeMix is thus a drop-in replacement for pixel-space mixup, delivering stronger regularization and greater semantic fidelity, without disrupting existing training pipelines. We publicly release our code at https://github.com/hugocarlesso/GeMix to foster reproducibility and further research.
Mixup已成为图像分类中流行的数据增强策略,但其简单的像素级插值经常生成不真实的图像,这可能会阻碍学习,特别是在高风险的医学应用中。我们提出了GeMix,这是一个两阶段的框架,它用基于类别条件生成对抗网络(GANs)的学习感知插值替换了启发式混合方法。首先,我们在目标数据集上训练StyleGAN2-ADA生成器。在数据增强过程中,我们从偏向不同类别的Dirichlet先验中采样两个标签向量,并通过Beta分布系数将它们混合。然后,我们将生成器设置为该软标签的条件,合成沿连续类别流形的一致图像。我们在大规模的COVIDx-CT-3数据集上使用三种主干网络(ResNet-50、ResNet-101、EfficientNet-B0)对GeMix进行基准测试。当与真实数据结合时,我们的方法在所有主干网络上提高了宏观F1分数,降低了COVID-19检测的误报率。因此,GeMix可作为像素空间中Mixup的替代方案,提供更强的正则化和更高的语义保真度,而不会破坏现有的训练管道。我们已在https://github.com/hugocarlesso/GeMix公开发布我们的代码,以促进可重复性和进一步研究。
论文及项目相关链接
PDF Accepted at CBMI 2025
摘要
本文提出了一个名为GeMix的两阶段框架,用于图像分类中的增强策略。传统的混合策略使用像素级的插值产生不真实的图像,特别是在高风险的医学应用中可能阻碍学习。GeMix使用基于类条件生成对抗网络(GANs)的标签感知插值替代启发式混合方法。通过目标数据集训练StyleGAN2-ADA生成器,并使用偏向不同类的Dirichlet先验采样两个标签向量,通过Beta分布系数进行混合。然后将此软标签应用于生成器,合成沿着连续类流形分布视觉上连贯的图像。在大型COVIDx-CT-3数据集上进行的基准测试显示,与真实数据结合使用时,GeMix相对于传统混合方法提高了所有背骨网络的宏观F1分数,并降低了COVID-19检测的误报率。因此,GeMix可以作为像素空间混合的替代品,提供更强大的正则化和更高的语义保真度,而不会破坏现有的训练管道。我们公开发布了代码以促进可重复性和进一步研究。
**关键见解**
1. 传统混合策略常常产生不真实的图像,尤其是在医学应用中可能阻碍学习。
2. GeMix是一个基于类条件GANs的两阶段框架,通过标签感知插值替换启发式混合方法。
3. GeMix使用StyleGAN2-ADA生成器合成视觉上连贯的图像,这些图像沿着连续类流形分布。
4. 在大型COVIDx-CT-3数据集上进行的基准测试表明,GeMix提高了宏观F1分数并降低了COVID-19检测的误报率。
5. GeMix作为一种替代像素空间混合的方法,提供了更强的正则化和更高的语义保真度。
6. GeMix不会破坏现有的训练管道,可以公开发布的源代码促进可重复性和进一步研究。
点此查看论文截图




Interpretability-Aware Pruning for Efficient Medical Image Analysis
Authors:Nikita Malik, Pratinav Seth, Neeraj Kumar Singh, Chintan Chitroda, Vinay Kumar Sankarapu
Deep learning has driven significant advances in medical image analysis, yet its adoption in clinical practice remains constrained by the large size and lack of transparency in modern models. Advances in interpretability techniques such as DL-Backtrace, Layer-wise Relevance Propagation, and Integrated Gradients make it possible to assess the contribution of individual components within neural networks trained on medical imaging tasks. In this work, we introduce an interpretability-guided pruning framework that reduces model complexity while preserving both predictive performance and transparency. By selectively retaining only the most relevant parts of each layer, our method enables targeted compression that maintains clinically meaningful representations. Experiments across multiple medical image classification benchmarks demonstrate that this approach achieves high compression rates with minimal loss in accuracy, paving the way for lightweight, interpretable models suited for real-world deployment in healthcare settings.
深度学习已在医学图像分析方面取得显著进展,但在临床实践中应用仍受到现代模型体积庞大和不透明的限制。诸如DL-Backtrace、逐层相关性传播和集成梯度等解释技术的进展,使得评估经过医学成像任务训练的神经网络中各个组件的贡献成为可能。在这项工作中,我们引入了一个以解释性为指导的修剪框架,该框架能够在保持预测性能和透明度的同时,降低模型的复杂性。通过有选择地保留每层最相关的部分,我们的方法能够实现有针对性的压缩,同时保持对临床有意义的表示。在多个医学图像分类基准测试上的实验表明,该方法实现了较高的压缩率,并且精度损失较小,这为在医疗环境中部署轻便、可解释性强的模型铺平了道路。
论文及项目相关链接
PDF Accepted at The 1st MICCAI Workshop on Efficient Medical AI 2025
Summary
深度学习在医学图像分析领域取得了显著进展,但仍存在模型体积大、透明度不足等问题,限制了其在临床实践中的应用。采用DL-Backtrace、逐层相关性传播和集成梯度等解释性技术,可以评估神经网络中对医学成像任务训练各组成部分的作用。本研究引入了一种以解释性为指导的剪枝框架,旨在降低模型复杂度,同时保留预测性能和透明度。通过选择性保留每层最相关的部分,我们的方法能够实现有针对性的压缩,保持临床意义的表示。在多医学图像分类基准测试上的实验表明,该方法实现了高压缩率且准确性损失极小,为适合在医疗环境中实际部署的轻量级、可解释性模型铺平了道路。
Key Takeaways
- 深度学习在医学图像分析上取得显著进展,但实际应用中面临模型体积大、透明度不足的问题。
- 采用解释性技术(如DL-Backtrace、逐层相关性传播和集成梯度)以评估神经网络中各组成部分的作用。
- 引入了一种以解释性为指导的剪枝框架,旨在降低模型复杂度同时保留预测性能和透明度。
- 通过选择性保留神经网络层中最相关的部分,实现有针对性的模型压缩。
- 方法能够保持临床意义的表示,使医学图像分析更加透明和可解释。
- 实验证明,该方法在多个医学图像分类基准测试上实现了高压缩率和较小的准确性损失。
点此查看论文截图



Uncertainty-Aware Information Pursuit for Interpretable and Reliable Medical Image Analysis
Authors:Md Nahiduzzaman, Steven Korevaar, Zongyuan Ge, Feng Xia, Alireza Bab-Hadiashar, Ruwan Tennakoon
To be adopted in safety-critical domains like medical image analysis, AI systems must provide human-interpretable decisions. Variational Information Pursuit (V-IP) offers an interpretable-by-design framework by sequentially querying input images for human-understandable concepts, using their presence or absence to make predictions. However, existing V-IP methods overlook sample-specific uncertainty in concept predictions, which can arise from ambiguous features or model limitations, leading to suboptimal query selection and reduced robustness. In this paper, we propose an interpretable and uncertainty-aware framework for medical imaging that addresses these limitations by accounting for upstream uncertainties in concept-based, interpretable-by-design models. Specifically, we introduce two uncertainty-aware models, EUAV-IP and IUAV-IP, that integrate uncertainty estimates into the V-IP querying process to prioritize more reliable concepts per sample. EUAV-IP skips uncertain concepts via masking, while IUAV-IP incorporates uncertainty into query selection implicitly for more informed and clinically aligned decisions. Our approach allows models to make reliable decisions based on a subset of concepts tailored to each individual sample, without human intervention, while maintaining overall interpretability. We evaluate our methods on five medical imaging datasets across four modalities: dermoscopy, X-ray, ultrasound, and blood cell imaging. The proposed IUAV-IP model achieves state-of-the-art accuracy among interpretable-by-design approaches on four of the five datasets, and generates more concise explanations by selecting fewer yet more informative concepts. These advances enable more reliable and clinically meaningful outcomes, enhancing model trustworthiness and supporting safer AI deployment in healthcare.
在医疗图像分析等安全关键领域,人工智能系统必须提供人类可解释的决策。变分信息追求(V-IP)提供了一个通过设计即可解释框架,通过顺序查询输入图像以获取人类可以理解的概念,并利用它们的存在与否来进行预测。然而,现有的V-IP方法忽略了概念预测中的样本特定不确定性,这种不确定性可能来源于特征模糊或模型局限性,从而导致查询选择不佳和稳健性降低。在本文中,我们提出了一个用于医学影像的可解释和不确定性感知框架,该框架通过考虑基于概念的、通过设计即可解释的模型中的上游不确定性来解决这些限制。具体来说,我们引入了两种不确定性感知模型,即EUAV-IP和IUAV-IP,它们将不确定性估计整合到V-IP查询过程中,以优先考虑每个样本的更可靠概念。EUAV-IP通过屏蔽跳过不确定概念,而IUAV-IP将不确定性隐式地纳入查询选择中,以做出更明智且与临床相符的决策。我们的方法允许模型基于针对每个个体样本量身定制的概念子集做出可靠决策,无需人工干预,同时保持整体可解释性。我们在五种医学影像数据集(包括四种模式:皮肤镜检查、X射线、超声波和血液细胞成像)上评估了我们的方法。所提出的IUAV-IP模型在四个数据集上实现了通过设计即可解释的先进准确性,并通过选择更少但更有信息量的概念来提供更简洁的解释。这些进展使结果更加可靠和具有临床意义,提高了模型的信任度,并支持更安全的人工智能在医疗保健中的部署。
论文及项目相关链接
Summary
在医学图像分析等重要领域应用AI系统时,需要注重提供人类可解释的决策。本研究提出了一种结合概念预测的不确定性感知框架,以解决现有变异信息追求(V-IP)方法在概念查询中的不足。本研究提出了两种不确定性感知模型EUAV-IP和IUAV-IP,将不确定性估计融入V-IP查询过程,优先处理每个样本更可靠的概念。其中,EUAV-IP通过屏蔽跳过不确定概念,而IUAV-IP则将不确定性纳入查询选择中,以做出更具针对性和临床对齐的决策。评估结果表明,该模型在五个医学图像数据集上的四个表现出优异准确性,生成的解释更简洁且具有更少而更具信息性的概念。此技术能更可靠和有意义地支持医疗健康领域的人工智能部署并提升模型的信赖度。
Key Takeaways
- AI系统在医学图像分析等领域需要提供人类可解释的决策。
- 变种信息追求(V-IP)框架通过查询输入图像中的可理解概念进行预测,但存在样本特定不确定性的忽略问题。
- 现有方法因为忽略了概念预测的不确定性可能导致查询选择不理想和鲁棒性降低。
- 研究者提出了一种不确定性感知框架,旨在解决上述问题,考虑了概念性设计中上游的不确定性问题。
- EUAV-IP模型通过屏蔽不确定概念提高决策可靠性,而IUAV-IP模型将不确定性纳入查询选择中,以生成更精确且临床对齐的解释。
点此查看论文截图


A Quad-Step Approach to Uncertainty-Aware Deep Learning for Skin Cancer Classification
Authors:Hamzeh Asgharnezhad, Pegah Tabarisaadi, Abbas Khosravi, Roohallah Alizadehsani, U. Rajendra Acharya
Accurate skin cancer diagnosis is vital for early treatment and improved patient outcomes. Deep learning (DL) models have shown promise in automating skin cancer classification, yet challenges remain due to data scarcity and limited uncertainty awareness. This study presents a comprehensive evaluation of DL-based skin lesion classification with transfer learning and uncertainty quantification (UQ) on the HAM10000 dataset. We benchmark several pre-trained feature extractors – including CLIP variants, ResNet50, DenseNet121, VGG16, and EfficientNet-V2-Large – combined with traditional classifiers such as SVM, XGBoost, and logistic regression. Multiple principal component analysis (PCA) settings (64, 128, 256, 512) are explored, with LAION CLIP ViT-H/14 and ViT-L/14 at PCA-256 achieving the strongest baseline results. In the UQ phase, Monte Carlo Dropout (MCD), Ensemble, and Ensemble Monte Carlo Dropout (EMCD) are applied and evaluated using uncertainty-aware metrics (UAcc, USen, USpe, UPre). Ensemble methods with PCA-256 provide the best balance between accuracy and reliability. Further improvements are obtained through feature fusion of top-performing extractors at PCA-256. Finally, we propose a feature-fusion based model trained with a predictive entropy (PE) loss function, which outperforms all prior configurations across both standard and uncertainty-aware evaluations, advancing trustworthy DL-based skin cancer diagnosis.
精确的皮肤癌诊断对于早期治疗并改善患者的治疗效果至关重要。深度学习模型在自动进行皮肤癌分类方面展现出巨大潜力,但由于数据稀缺和缺乏不确定性意识,仍存在挑战。本研究对基于深度学习的皮肤病变分类进行了全面评估,采用迁移学习和不确定性量化(UQ)在HAM10000数据集上进行实验。我们基准测试了多种预训练特征提取器,包括CLIP变体、ResNet50、DenseNet121、VGG16和EfficientNet-V2-Large,结合传统分类器,如SVM、XGBoost和逻辑回归。探索了多个主成分分析(PCA)设置(64、128、256、512),在PCA-256设置下,LAION CLIP ViT-H/14和ViT-L/14取得了最强的基线结果。在不确定性量化阶段,应用了蒙特卡洛Dropout(MCD)、集成方法和集成蒙特卡洛Dropout(EMCD),并使用不确定性感知指标(UAcc、USen、USpe、UPre)进行评估。PCA-256的集成方法提供了准确性和可靠性之间的最佳平衡。通过融合表现最佳的提取器在PCA-256上的特征,获得了进一步的改进。最后,我们提出了一种基于特征融合的模型,该模型使用预测熵(PE)损失函数进行训练,在标准和不确定性感知评估中都优于所有先前的配置,从而推进了基于深度学习的可信皮肤癌诊断。
论文及项目相关链接
摘要
本文研究了深度学习在皮肤癌诊断中的应用,特别是在皮肤病变分类方面的表现。文章使用了迁移学习和不确定性量化技术,并在HAM10000数据集上进行了评估。研究对比了多种预训练特征提取器的性能,包括CLIP变体、ResNet50、DenseNet121、VGG16和EfficientNet-V2-Large等,并结合传统分类器如SVM、XGBoost和逻辑回归。研究发现,LAION CLIP ViT-H/14和ViT-L/14在PCA-256设置下取得了最佳基线结果。在不确定性量化阶段,应用了Monte Carlo Dropout、Ensemble和Ensemble Monte Carlo Dropout等方法,并使用不确定性感知指标进行评估。融合特征的方法在提升准确性和可靠性方面表现最佳。最后,研究提出了一种基于特征融合的模型,该模型使用预测熵损失函数进行训练,在标准评估和不确定性感知评估中都表现出最佳性能,为可信赖的深度学习皮肤癌诊断提供了新的方向。
关键见解
- 研究评估了深度学习模型在皮肤癌诊断中的潜力,特别是在自动化皮肤病变分类方面的应用。
- 使用迁移学习和不确定性量化技术来提高模型的性能。
- 多种预训练特征提取器与传统分类器的结合,在皮肤病变分类中进行了对比研究。
- LAION CLIP ViT模型在特定设置下取得了最佳基线结果。
- Ensemble方法和特征融合有助于提高模型的准确性和可靠性。
- 提出了基于特征融合的模型,采用预测熵损失函数,在评估和不确定性感知评估中都表现优异。
点此查看论文截图




Image Segmentation and Classification of E-waste for Training Robots for Waste Segregation
Authors:Prakriti Tripathi
Industry partners provided a problem statement that involves classifying electronic waste using machine learning models that will be used by pick-and-place robots for waste segregation. This was achieved by taking common electronic waste items, such as a mouse and charger, unsoldering them, and taking pictures to create a custom dataset. Then state-of-the art YOLOv11 model was trained and run to achieve 70 mAP in real-time. Mask-RCNN model was also trained and achieved 41 mAP. The model can be integrated with pick-and-place robots to perform segregation of e-waste.
产业合作伙伴提供了一个问题陈述,该问题涉及使用机器学习模型对电子废物进行分类,这些模型将由拾取放置机器人用于废物分离。这是通过获取常见的电子废物项目(如鼠标和充电器),对其进行解焊,并拍照以创建自定义数据集来实现的。然后,使用最先进的YOLOv11模型进行训练和运行,以实时实现70 mAP。还训练了Mask-RCNN模型,达到了41 mAP。该模型可以集成到拾取放置机器人中,以执行电子废物的分离。
论文及项目相关链接
PDF 3 pages, 2 figures, submitted to 2025 5th International Conference on AI-ML-Systems (AIMLSystems)
Summary
:本论文描述了如何利用机器学习模型对电子废物进行分类,采用合作伙伴提供的方案通过机器人进行废物的分类放置。通过对常见的电子废物如鼠标和充电器进行拆解并拍照创建自定义数据集,训练了先进的YOLOv11模型,实现了实时70 mAP的分类效果。同时,也训练了Mask-RCNN模型,取得了41 mAP的分类效果。该模型可集成到拾取放置机器人中,实现电子废物的分离。
Key Takeaways
- 行业合作伙伴提出了一个涉及电子废物分类的问题,该问题将通过机器学习模型解决,并由拾取放置机器人用于废物分离。
- 通过拆解常见的电子废物物品并拍照创建自定义数据集来解决这个问题。
- 先进的YOLOv11模型被训练并用于实时分类电子废物,达到了70 mAP的效果。
- Mask-RCNN模型也被训练用于电子废物分类,取得了41 mAP的分类效果。
- 训练出的模型可以集成到拾取放置机器人中,实现自动化的废物分离。
- 此方案对于处理电子废物具有实际应用价值。
点此查看论文截图

