⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-09-08 更新
Masked Autoencoder Pretraining and BiXLSTM ResNet Architecture for PET/CT Tumor Segmentation
Authors:Moona Mazher, Steven A Niederer, Abdul Qayyum
The accurate segmentation of lesions in whole-body PET/CT imaging is es-sential for tumor characterization, treatment planning, and response assess-ment, yet current manual workflows are labor-intensive and prone to inter-observer variability. Automated deep learning methods have shown promise but often remain limited by modality specificity, isolated time points, or in-sufficient integration of expert knowledge. To address these challenges, we present a two-stage lesion segmentation framework developed for the fourth AutoPET Challenge. In the first stage, a Masked Autoencoder (MAE) is em-ployed for self-supervised pretraining on unlabeled PET/CT and longitudinal CT scans, enabling the extraction of robust modality-specific representations without manual annotations. In the second stage, the pretrained encoder is fine-tuned with a bidirectional XLSTM architecture augmented with ResNet blocks and a convolutional decoder. By jointly leveraging anatomical (CT) and functional (PET) information as complementary input channels, the model achieves improved temporal and spatial feature integration. Evalua-tion on the AutoPET Task 1 dataset demonstrates that self-supervised pre-training significantly enhances segmentation accuracy, achieving a Dice score of 0.582 compared to 0.543 without pretraining. These findings high-light the potential of combining self-supervised learning with multimodal fu-sion for robust and generalizable PET/CT lesion segmentation. Code will be available at https://github.com/RespectKnowledge/AutoPet_2025_BxLSTM_UNET_Segmentation
在全身PET/CT成像中,病变的精确分割对于肿瘤特征描述、治疗计划制定和疗效评估至关重要。然而,当前的手动工作流程劳动强度大,且存在观察者间变异的可能性。虽然自动深度学习的方法已经展现出了一定的潜力,但通常受到模态特异性、孤立时间点或专家知识整合不足的限制。
论文及项目相关链接
Summary
本文介绍了一种针对全身PET/CT影像的两阶段病变分割框架,以解决肿瘤特征化、治疗规划和疗效评估中的关键挑战。第一阶段采用Masked Autoencoder进行无监督预训练,提取稳健的模态特异性表示;第二阶段使用预训练的编码器进行微调,结合双向XLSTM架构和卷积解码器进行精细化处理。该模型利用解剖(CT)和功能(PET)信息作为互补输入通道,实现了时空特征的集成。在AutoPET Task 1数据集上的评估表明,自监督预训练显著提高了分割准确性,Dice得分从0.543提升至0.582。这表明自监督学习与多模态融合的结合在PET/CT病变分割中具有潜力和广阔前景。
Key Takeaways
- PET/CT影像的精确病变分割对于肿瘤表征、治疗规划和疗效评估至关重要。
- 当前手动工作流程劳动强度大,且存在观察者间变异的问题。
- 自动化深度学习方法虽具潜力,但仍面临模态特异性、孤立时间点或缺乏专家知识整合等挑战。
- 提出的两阶段病变分割框架结合了自监督预训练和双向XLSTM架构,以提高PET/CT影像的分割准确性。
- 自监督预训练在无标签PET/CT和纵向CT扫描数据上有效提取了稳健的模态特异性表示。
- 结合解剖(CT)和功能(PET)信息作为互补输入通道,实现了时空特征的集成,进一步提高了分割准确性。
点此查看论文截图





GPSToken: Gaussian Parameterized Spatially-adaptive Tokenization for Image Representation and Generation
Authors:Zhengqiang Zhang, Rongyuan Wu, Lingchen Sun, Lei Zhang
Effective and efficient tokenization plays an important role in image representation and generation. Conventional methods, constrained by uniform 2D/1D grid tokenization, are inflexible to represent regions with varying shapes and textures and at different locations, limiting their efficacy of feature representation. In this work, we propose $\textbf{GPSToken}$, a novel $\textbf{G}$aussian $\textbf{P}$arameterized $\textbf{S}$patially-adaptive $\textbf{Token}$ization framework, to achieve non-uniform image tokenization by leveraging parametric 2D Gaussians to dynamically model the shape, position, and textures of different image regions. We first employ an entropy-driven algorithm to partition the image into texture-homogeneous regions of variable sizes. Then, we parameterize each region as a 2D Gaussian (mean for position, covariance for shape) coupled with texture features. A specialized transformer is trained to optimize the Gaussian parameters, enabling continuous adaptation of position/shape and content-aware feature extraction. During decoding, Gaussian parameterized tokens are reconstructed into 2D feature maps through a differentiable splatting-based renderer, bridging our adaptive tokenization with standard decoders for end-to-end training. GPSToken disentangles spatial layout (Gaussian parameters) from texture features to enable efficient two-stage generation: structural layout synthesis using lightweight networks, followed by structure-conditioned texture generation. Experiments demonstrate the state-of-the-art performance of GPSToken, which achieves rFID and FID scores of 0.65 and 1.50 on image reconstruction and generation tasks using 128 tokens, respectively. Codes and models of GPSToken can be found at $\href{https://github.com/xtudbxk/GPSToken}{https://github.com/xtudbxk/GPSToken}$.
有效且高效的令牌化在图像表示和生成中起着重要作用。传统方法受到统一2D/1D网格令牌化的限制,难以表示形状和纹理各异以及位置不同的区域,从而限制了其特征表示的有效性。在这项工作中,我们提出了GPSToken这一新型的高斯参数化空间自适应令牌化框架,利用参数化2D高斯模型来动态模拟不同图像区域的形状、位置和纹理,从而实现非均匀图像令牌化。我们首先采用一种基于熵的算法将图像分割成不同大小的纹理均匀区域。然后,我们将每个区域参数化为一个包含纹理特征的二维高斯(均值用于位置,协方差用于形状)。训练了专门的变压器来优化高斯参数,实现了位置/形状的持续自适应和内容感知特征提取。在解码过程中,通过可微分的基于平铺的渲染器将高斯参数化令牌重建为二维特征图,使我们的自适应令牌化与标准解码器相结合,实现端到端的训练。GPSToken将空间布局(高斯参数)与纹理特征分离,实现了高效的两个阶段生成:使用轻量级网络进行结构布局合成,然后进行结构条件纹理生成。实验表明,GPSToken的性能处于领先水平,在图像重建和生成任务中使用128个令牌实现了rFID和FID得分分别为0.65和1.50。GPSToken的代码和模型可以在https://github.com/xtudbxk/GPSToken找到。
论文及项目相关链接
摘要
本文提出了一种新型的基于高斯参数化的空间自适应令牌化框架GPSToken,用于实现非均匀图像令牌化。该方法利用参数化二维高斯动态建模图像不同区域的形状、位置和纹理,实现了高效和有效的图像表示和生成。该框架包括基于熵驱动的算法将图像划分为纹理均匀的可变区域,并将每个区域参数化为二维高斯模型,与纹理特征相结合。训练专用转换器以优化高斯参数,实现位置/形状的连续适应和内容感知特征提取。解码过程中,高斯参数化令牌通过可微分的平铺渲染器重建为二维特征图,将自适应令牌化与标准解码器相结合进行端到端的训练。GPSToken将空间布局(高斯参数)与纹理特征分开,实现了高效的两阶段生成:结构布局的合成使用轻量级网络,然后进行结构条件下的纹理生成。实验表明,GPSToken在图像重建和生成任务上达到了领先水平,使用128个令牌分别实现了0.65的rFID和1.50的FID分数。
关键见解
- 提出了一种新型的高斯参数化空间自适应令牌化框架GPSToken,实现了非均匀图像令牌化。
- 利用参数化二维高斯动态建模图像不同区域的形状、位置和纹理。
- 采用了基于熵驱动的算法将图像划分为纹理均匀的可变区域。
- 通过训练专用转换器优化高斯参数,实现了位置/形状的连续适应和内容感知特征提取。
- 通过可微分的平铺渲染器将高斯参数化令牌重建为二维特征图,便于与标准解码器结合进行端到端训练。
- GPSToken实现了高效的两阶段生成过程,包括结构布局的合成和纹理生成。
点此查看论文截图




SpectMamba: Integrating Frequency and State Space Models for Enhanced Medical Image Detection
Authors:Yao Wang, Dong Yang, Zhi Qiao, Wenjian Huang, Liuzhi Yang, Zhen Qian
Abnormality detection in medical imaging is a critical task requiring both high efficiency and accuracy to support effective diagnosis. While convolutional neural networks (CNNs) and Transformer-based models are widely used, both face intrinsic challenges: CNNs have limited receptive fields, restricting their ability to capture broad contextual information, and Transformers encounter prohibitive computational costs when processing high-resolution medical images. Mamba, a recent innovation in natural language processing, has gained attention for its ability to process long sequences with linear complexity, offering a promising alternative. Building on this foundation, we present SpectMamba, the first Mamba-based architecture designed for medical image detection. A key component of SpectMamba is the Hybrid Spatial-Frequency Attention (HSFA) block, which separately learns high- and low-frequency features. This approach effectively mitigates the loss of high-frequency information caused by frequency bias and correlates frequency-domain features with spatial features, thereby enhancing the model’s ability to capture global context. To further improve long-range dependencies, we propose the Visual State-Space Module (VSSM) and introduce a novel Hilbert Curve Scanning technique to strengthen spatial correlations and local dependencies, further optimizing the Mamba framework. Comprehensive experiments show that SpectMamba achieves state-of-the-art performance while being both effective and efficient across various medical image detection tasks.
医学成像中的异常检测是一项既需要高效率又需要准确性的重要任务,以支持有效的诊断。虽然卷积神经网络(CNN)和基于Transformer的模型已经得到了广泛的应用,但它们都面临着固有的挑战:CNN的感野有限,限制了它们捕获广泛上下文信息的能力;而Transformer在处理高分辨率医学图像时遭遇了过高的计算成本。作为自然语言处理领域的一项最新创新,Mamba因其以线性复杂度处理长序列的能力而受到关注,它为我们提供了一个有前景的替代方案。在此基础上,我们提出了专为医学图像检测设计的Mamba架构——SpectMamba。SpectMamba的关键组件是混合空间频率注意力(HSFA)块,该块分别学习高频和低频特征。这种方法有效地减轻了由频率偏差导致的高频信息损失,并将频域特征与空间特征相关联,从而增强了模型捕获全局上下文的能力。为了进一步改善远程依赖关系,我们提出了视觉状态空间模块(VSSM),并引入了一种新型的Hilbert曲线扫描技术来加强空间关联和局部依赖关系,进一步优化了Mamba框架。综合实验表明,SpectMamba在多种医学图像检测任务中达到了最先进的性能,既有效又高效。
论文及项目相关链接
Summary
医学图像异常检测需要高效且准确的任务以满足有效诊断的需求。CNN和基于Transformer的模型虽然广泛应用,但存在局限性。Mamba在处理长序列时具有线性复杂度的能力而引起关注。基于此,我们提出SpectMamba,首个用于医学图像检测的Mamba架构。它采用混合时空频注意(HSFA)块,分别学习高低频特征,并介绍视觉状态空间模块(VSSM)和Hilbert曲线扫描技术,优化Mamba框架。实验表明,SpectMamba在多种医学图像检测任务上实现卓越性能。
Key Takeaways
- 医学图像异常检测要求高效率和准确性以支持有效诊断。
- CNN和基于Transformer的模型在医学图像检测中存在局限性。
- Mamba因其处理长序列的线性复杂度能力而受到关注。
- SpectMamba是首个用于医学图像检测的Mamba架构。
- SpectMamba采用混合时空频注意(HSFA)块,分别学习高低频特征。
- 视觉状态空间模块(VSSM)和Hilbert曲线扫描技术用于优化Mamba框架。
点此查看论文截图


BSNeRF: Broadband Spectral Neural Radiance Fields for Snapshot Multispectral Light-field Imaging
Authors:Erqi Huang, John Restrepo, Xun Cao, Ivo Ihrke
Snapshot Multispectral Light-field Imaging (SMLI) is an emerging computational imaging technique that captures high-dimensional data (x, y, z, $\theta$, $\phi$, $\lambda$) in a single shot using a low-dimensional sensor. The accuracy of high-dimensional data reconstruction depends on representing the spectrum using neural radiance field models, which requires consideration of broadband spectral decoupling during optimization. Currently, some SMLI approaches avoid the challenge of model decoupling by either reducing light-throughput or prolonging imaging time. In this work, we propose a broadband spectral neural radiance field (BSNeRF) for SMLI systems. Experiments show that our model successfully decouples a broadband multiplexed spectrum. Consequently, this approach enhances multispectral light-field image reconstruction and further advances plenoptic imaging.
瞬时多光谱光场成像(SMLI)是一种新兴的计算成像技术,它使用低维传感器在一次拍摄中捕获高维数据(x,y,z,θ,φ,λ)。高维数据重建的准确性取决于使用神经辐射场模型表示光谱,这需要在优化过程中考虑宽谱解耦。目前,一些SMLI方法通过降低光通量或延长成像时间,避免模型解耦的挑战。在这项工作中,我们为SMLI系统提出了宽谱神经辐射场(BSNeRF)。实验表明,我们的模型成功实现了宽频多路复用光谱的解耦。因此,此方法增强了多光谱光场图像重建,并推动了全光成像的进一步发展。
论文及项目相关链接
PDF Presented in ISCS25
Summary
一种名为快照多光谱光场成像(SMLI)的计算成像技术正在兴起,该技术能够在单次拍摄中使用低维传感器捕获高维数据(x,y,z,θ,φ,λ)。高维数据重建的准确性依赖于使用神经辐射场模型表示光谱,这需要优化过程中考虑宽谱解耦。目前一些SMLI方法通过降低光通量或延长成像时间来避免模型解耦的挑战。本研究提出了一种用于SMLI系统的宽带谱神经辐射场(BSNeRF)。实验表明,该模型成功实现了宽带多路复用光谱的解耦。因此,此方法提高了多光谱光场图像的重建质量,并推动了全光成像的进一步发展。
Key Takeaways
- 快照多光谱光场成像(SMLI)是新兴的计算成像技术。
- SMLI技术能够在单次拍摄中捕获高维数据。
- 高维数据重建的准确性受神经辐射场模型表示光谱的影响。
- 宽谱解耦在优化过程中是必要的。
- 当前一些SMLI方法通过降低光通量或延长成像时间来解决模型解耦的挑战。
- 研究提出了一种名为宽带谱神经辐射场(BSNeRF)的方法,用于SMLI系统。
点此查看论文截图


Towards Early Detection: AI-Based Five-Year Forecasting of Breast Cancer Risk Using Digital Breast Tomosynthesis Imaging
Authors:Manon A. Dorster, Felix J. Dorfner, Mason C. Cleveland, Melisa S. Guelen, Jay Patel, Dania Daye, Jean-Philippe Thiran, Albert E. Kim, Christopher P. Bridge
As early detection of breast cancer strongly favors successful therapeutic outcomes, there is major commercial interest in optimizing breast cancer screening. However, current risk prediction models achieve modest performance and do not incorporate digital breast tomosynthesis (DBT) imaging, which was FDA-approved for breast cancer screening in 2011. To address this unmet need, we present a deep learning (DL)-based framework capable of forecasting an individual patient’s 5-year breast cancer risk directly from screening DBT. Using an unparalleled dataset of 161,753 DBT examinations from 50,590 patients, we trained a risk predictor based on features extracted using the Meta AI DINOv2 image encoder, combined with a cumulative hazard layer, to assess a patient’s likelihood of developing breast cancer over five years. On a held-out test set, our best-performing model achieved an AUROC of 0.80 on predictions within 5 years. These findings reveal the high potential of DBT-based DL approaches to complement traditional risk assessment tools, and serve as a promising basis for additional investigation to validate and enhance our work.
由于乳腺癌的早期检测对成功治疗结果有着极大的积极影响,优化乳腺癌筛查有着巨大的商业价值。然而,当前的风险预测模型表现一般,并未融入数字乳腺断层合成(DBT)成像技术,该技术于2011年获得FDA批准用于乳腺癌筛查。为满足这一未被满足的需求,我们提出了一种基于深度学习的框架,能够直接从筛查DBT预测个体患者5年的乳腺癌风险。我们使用包含50,590名患者的161,753次DBT检查组成的无与伦比的数据集进行训练。我们以Meta AI DINOv2图像编码器提取的特征为基础结合累积风险层训练风险预测器,以评估患者未来五年内患乳腺癌的可能性。在独立的测试集上,表现最佳的模型在五年内预测方面达到了0.8的AUROC。这些发现揭示了基于DBT的深度学习方法在补充传统风险评估工具方面的高潜力,并为我们进一步验证和改进工作提供了有前景的基础。
论文及项目相关链接
PDF Deep Breath Workshop, MICCAI 2025
Summary
本文介绍了使用深度学习技术结合数字乳腺断层合成(DBT)成像来预测个体五年内乳腺癌风险的研究。利用大规模数据集训练的风险预测模型,取得了良好的性能表现,有望作为传统风险评估工具的补充。
Key Takeaways
- 研究背景强调早期检测对乳腺癌治疗成功的重要性,以及商业市场对优化乳腺癌筛查的兴趣。
- 当前风险预测模型性能有限,未充分利用数字乳腺断层合成(DBT)成像技术。
- 采用了基于深度学习的框架,直接从筛查DBT图像预测个人五年内乳腺癌风险。
- 使用了大规模数据集(包括来自五万多名患者的16万多次DBT检查)进行模型训练。
- 研究采用Meta AI DINOv2图像编码器提取特征,并结合累积危害层评估风险。
- 最佳模型在独立测试集上取得了较高的预测性能(AUROC为0.80)。
点此查看论文截图



Can General-Purpose Omnimodels Compete with Specialists? A Case Study in Medical Image Segmentation
Authors:Yizhe Zhang, Qiang Chen, Tao Zhou
The emergence of powerful, general-purpose omnimodels capable of processing diverse data modalities has raised a critical question: can these jack-of-all-trades'' systems perform on par with highly specialized models in knowledge-intensive domains? This work investigates this question within the high-stakes field of medical image segmentation. We conduct a comparative study analyzing the zero-shot performance of a state-of-the-art omnimodel (Gemini 2.5 Pro, the
Nano Banana’’ model) against domain-specific deep learning models on three distinct tasks: polyp (endoscopy), retinal vessel (fundus), and breast tumor segmentation (ultrasound). Our study focuses on performance at the extremes by curating subsets of the easiest'' and
hardest’’ cases based on the specialist models’ accuracy. Our findings reveal a nuanced and task-dependent landscape. For polyp and breast tumor segmentation, specialist models excel on easy samples, but the omnimodel demonstrates greater robustness on hard samples where specialists fail catastrophically. Conversely, for the fine-grained task of retinal vessel segmentation, the specialist model maintains superior performance across both easy and hard cases. Intriguingly, qualitative analysis suggests omnimodels may possess higher sensitivity, identifying subtle anatomical features missed by human annotators. Our results indicate that while current omnimodels are not yet a universal replacement for specialists, their unique strengths suggest a potential complementary role with specialist models, particularly in enhancing robustness on challenging edge cases.
通用多功能全模型的强大出现,能够处理多种数据模态,引发了一个关键问题:这些“无所不能”的系统是否能在知识密集型领域与高度专业化的模型表现相当?本研究在医疗图像分割的高风险领域探讨了这个问题。我们进行了一项比较研究,分析了一种最先进的全模型(Gemini 2.5 Pro,“Nano Banana”模型)在三个不同任务上的零样本性能:息肉(内窥镜)、视网膜血管(眼底)和乳腺癌肿瘤分割(超声),并与特定领域的深度学习模型进行比较。我们的研究重点是通过基于专业模型的准确性整理“最容易”和“最困难”的案例子集来关注极端性能。我们的研究发现了一个微妙且依赖于任务的景观。在息肉和乳腺癌肿瘤分割方面,专业模型在简单样本上表现出色,但全模型在困难样本上表现出更大的稳健性,在这些困难样本上专业模型会遭遇灾难性的失败。相反,对于精细的视网膜血管分割任务,专业模型在简单和困难案例中均保持卓越性能。有趣的是,定性分析表明全模型可能具有更高的敏感性,能够识别出人类注释器遗漏的细微解剖特征。我们的结果表明,虽然当前的全模型尚未成为专业模型的通用替代品,但它们独特的优势表明可能与专业模型互补,特别是在提高挑战性边缘案例的稳健性方面。
论文及项目相关链接
PDF 15 pages, 7 figures
摘要
本文探讨了新兴的强大通用型全能模型在医学图像分割领域是否可与高度专业化的模型相提并论的问题。通过对最先进的全能模型(Gemini 2.5 Pro,“Nano Banana”模型)与特定领域的深度学习模型在三个不同任务上的零样本性能进行比较研究,发现结果呈现出微妙的、任务依赖性的格局。在息肉和乳腺癌分割方面,专业模型在简单样本上表现出色,而全能模型在困难样本上展现出更大的稳健性,专业模型会遭遇灾难性的失败。相反,对于精细的视网膜血管分割任务,专业模型在简单和困难案例中均保持卓越性能。此外,定性分析显示全能模型可能具有较高的敏感性,能够识别出人类注释器遗漏的细微解剖特征。因此,尽管全能模型尚未成为取代专家的普遍选择,但其独特优势表明其作为专家模型的补充角色具有潜力,特别是在提高挑战性边缘案例的稳健性方面。
关键见解
- 全能模型与专业模型在医学图像分割领域的性能对比是一个关键研究问题。
- 在息肉和乳腺癌分割方面,专业模型在简单样本上表现优秀,而全能模型在困难样本上更稳健。
- 对于视网膜血管分割的精细任务,专业模型在简单和困难情况下均表现优越。
- 全能模型可能具有较高的敏感性,能识别出人类注释器遗漏的细微解剖特征。
- 全能模型尚未普遍取代专业模型,但在处理挑战性案例时具有潜在的补充作用。
- 研究结果强调了不同任务中模型性能的差异性,表明需要根据具体任务选择适当的模型。
- 定性分析与定量评估相结合对于全面评估模型性能至关重要。
点此查看论文截图



Encoder-Only Image Registration
Authors:Xiang Chen, Renjiu Hu, Jinwei Zhang, Yuxi Zhang, Xinyao Yue, Min Liu, Yaonan Wang, Hang Zhang
Learning-based techniques have significantly improved the accuracy and speed of deformable image registration. However, challenges such as reducing computational complexity and handling large deformations persist. To address these challenges, we analyze how convolutional neural networks (ConvNets) influence registration performance using the Horn-Schunck optical flow equation. Supported by prior studies and our empirical experiments, we observe that ConvNets play two key roles in registration: linearizing local intensities and harmonizing global contrast variations. Based on these insights, we propose the Encoder-Only Image Registration (EOIR) framework, designed to achieve a better accuracy-efficiency trade-off. EOIR separates feature learning from flow estimation, employing only a 3-layer ConvNet for feature extraction and a set of 3-layer flow estimators to construct a Laplacian feature pyramid, progressively composing diffeomorphic deformations under a large-deformation model. Results on five datasets across different modalities and anatomical regions demonstrate EOIR’s effectiveness, achieving superior accuracy-efficiency and accuracy-smoothness trade-offs. With comparable accuracy, EOIR provides better efficiency and smoothness, and vice versa. The source code of EOIR is publicly available on https://github.com/XiangChen1994/EOIR.
基于学习的技术显著提高了可变形图像配准的准确性和速度。然而,减少计算复杂性和处理大变形等挑战仍然存在。为了解决这些挑战,我们分析了卷积神经网络(ConvNets)如何使用Horn-Schunck光流方程影响配准性能。通过先前的研究和我们的经验实验,我们发现ConvNets在配准中扮演两个关键角色:线性化局部强度和协调全局对比度变化。基于这些见解,我们提出了仅编码器图像配准(EOIR)框架,旨在实现更好的准确性-效率权衡。EOIR将特征学习从流估计中分离出来,仅使用3层ConvNet进行特征提取和一组3层流估计器来构建拉普拉斯特征金字塔,在大变形模型下逐步组成微分同胚变形。在五个不同模态和解剖区域的数据集上的结果表明,EOIR的有效性实现了优越的准确度-效率和准确度-平滑度权衡。在具有可比的准确度的情况下,EOIR提供了更好的效率和平滑度,反之亦然。EOIR的源代码可公开访问:https://github.com/XiangChen1994/EOIR。
论文及项目相关链接
Summary
基于学习的方法在可变形图像配准方面显著提高准确性和速度,但仍有挑战如减少计算复杂性及处理大变形的问题。针对这些挑战,通过分析卷积神经网络对基于Horn-Schunck光流方程注册性能的影响,我们发现ConvNets扮演了关键角色,它们有助于实现本地强度的线性化和全局对比度变化的调和。基于这些观察,我们提出了只采用Encoder的图像注册(EOIR)框架,旨在实现更佳的精准度与效率之间的平衡。EOIR通过将特征学习与流动估计分开,使用仅包含三层ConvNet的特征提取器和一组三层流动估计器构建拉普拉斯特征金字塔,在大变形模型下逐步组合微分同胚变形。在五组不同模态和解剖区域的测试集上进行的实验表明EOIR的有效性,实现了出色的精准度与效率和精准度与平滑度的平衡。代码已公开在链接。
Key Takeaways
- 基于学习的方法增强了可变形图像配准的准确性和速度。
- 面临减少计算复杂性及处理大变形的挑战。
- 卷积神经网络(ConvNets)在图像注册中扮演关键角色,线性化局部强度并调和全局对比度变化。
- 提出了Encoder-Only Image Registration (EOIR)框架,实现了精准度与效率之间的平衡。
- EOIR框架通过分离特征学习与流动估计,使用三层ConvNet和流动估计器构建拉普拉斯特征金字塔。
- EOIR在多种数据集上表现优异,实现了精准度、效率和平滑度的良好平衡。
点此查看论文截图





A Multimodal and Multi-centric Head and Neck Cancer Dataset for Tumor Segmentation and Outcome Prediction
Authors:Numan Saeed, Salma Hassan, Shahad Hardan, Ahmed Aly, Darya Taratynova, Umair Nawaz, Ufaq Khan, Muhammad Ridzuan, Vincent Andrearczyk, Adrien Depeursinge, Mathieu Hatt, Thomas Eugene, Raphaël Metz, Mélanie Dore, Gregory Delpon, Vijay Ram Kumar Papineni, Kareem Wahid, Cem Dede, Alaa Mohamed Shawky Ali, Carlos Sjogreen, Mohamed Naser, Clifton D. Fuller, Valentin Oreiller, Mario Jreige, John O. Prior, Catherine Cheze Le Rest, Olena Tankyevych, Pierre Decazes, Su Ruan, Stephanie Tanadini-Lang, Martin Vallières, Hesham Elhalawani, Ronan Abgral, Romain Floch, Kevin Kerleguer, Ulrike Schick, Maelle Mauguen, Arman Rahmim, Mohammad Yaqub
We describe a publicly available multimodal dataset of annotated Positron Emission Tomography/Computed Tomography (PET/CT) studies for head and neck cancer research. The dataset includes 1123 FDG-PET/CT studies from patients with histologically confirmed head and neck cancer, acquired from 10 international medical centers. All examinations consisted of co-registered PET/CT scans with varying acquisition protocols, reflecting real-world clinical diversity across institutions. Primary gross tumor volumes (GTVp) and involved lymph nodes (GTVn) were manually segmented by experienced radiation oncologists and radiologists following standardized guidelines and quality control measures. We provide anonymized NifTi files of all studies, along with expert-annotated segmentation masks, radiotherapy dose distribution for a subset of patients, and comprehensive clinical metadata. This metadata includes TNM staging, HPV status, demographics (age and gender), long-term follow-up outcomes, survival times, censoring indicators, and treatment information. We demonstrate how this dataset can be used for three key clinical tasks: automated tumor segmentation, recurrence-free survival prediction, and HPV status classification, providing benchmark results using state-of-the-art deep learning models, including UNet, SegResNet, and multimodal prognostic frameworks.
我们描述了一个公共可用的多模态数据集,该数据集包含了经组织病理学证实头颈癌患者的¹¹¹³C PET/CT扫描研究数据,涵盖了从十个国际医学中心的一千一百二十三项检查研究。所有的检查均包含同时注册后的PET/CT扫描,采用了不同的采集协议,反映了机构间真实世界的临床多样性。原发肿瘤体积(GTVp)和受累淋巴结(GTVn)是由经验丰富的放射肿瘤学家和放射科医生按照标准化准则和质量保证措施进行手动分割的。我们提供了所有研究的匿名化NifTi文件,以及专家注释的分割掩模、部分患者的放射治疗剂量分布和全面的临床元数据。这些元数据包括TNM分期、HPV状态、人口统计学信息(年龄和性别)、长期随访结果、生存时间、审查指标和治疗信息。我们展示了该数据集在三项关键临床任务中的应用:自动肿瘤分割、无复发生存预测和HPV状态分类。我们使用最先进的深度学习模型,包括UNet、SegResNet和多模态预后框架等,为该数据集提供了基准测试结果。
论文及项目相关链接
PDF 10 pages, 5 figures. Numan Saeed is the corresponding author. Numan Saeed, Salma Hassan and Shahad Hardan contributed equally to this work. Project page: https://hecktor25.grand-challenge.org/
Summary
本文介绍了一个公开可用的多模式数据集,包含经组织病理学证实头颈部癌症患者的PET/CT研究。数据集由来自国际医疗中心的PET/CT扫描组成,包含手动分割的主要肿瘤体积和涉及的淋巴结。此外,还提供了匿名化的NifTi文件、专家注释的分割掩模、部分患者的放疗剂量分布和全面的临床元数据。演示了如何使用该数据集进行自动肿瘤分割、无复发生存预测和HPV状态分类等三项关键临床任务,并使用前沿的深度学习方法提供基准测试结果。
Key Takeaways
- 描述了一个公开的多模式数据集,包含头颈部癌症患者的PET/CT研究。
- 数据集包含来自不同国际医疗中心的异构图像,反映了真实世界的临床多样性。
- 数据集包含了肿瘤和淋巴结的手动分割,以及详细的临床元数据。
- 提供了匿名化的图像文件和专家注释的分割掩模。
- 数据集可用于自动肿瘤分割、无复发生存预测和HPV状态分类等任务。
- 使用深度学习方法(如UNet、SegResNet和多模态预后框架)为这些任务提供了基准测试结果。
点此查看论文截图







MorphGen: Morphology-Guided Representation Learning for Robust Single-Domain Generalization in Histopathological Cancer Classification
Authors:Hikmat Khan, Syed Farhan Alam Zaidi, Pir Masoom Shah, Kiruthika Balakrishnan, Rabia Khan, Muhammad Waqas, Jia Wu
Domain generalization in computational histopathology is hindered by heterogeneity in whole slide images (WSIs), caused by variations in tissue preparation, staining, and imaging conditions across institutions. Unlike machine learning systems, pathologists rely on domain-invariant morphological cues such as nuclear atypia (enlargement, irregular contours, hyperchromasia, chromatin texture, spatial disorganization), structural atypia (abnormal architecture and gland formation), and overall morphological atypia that remain diagnostic across diverse settings. Motivated by this, we hypothesize that explicitly modeling biologically robust nuclear morphology and spatial organization will enable the learning of cancer representations that are resilient to domain shifts. We propose MorphGen (Morphology-Guided Generalization), a method that integrates histopathology images, augmentations, and nuclear segmentation masks within a supervised contrastive learning framework. By aligning latent representations of images and nuclear masks, MorphGen prioritizes diagnostic features such as nuclear and morphological atypia and spatial organization over staining artifacts and domain-specific features. To further enhance out-of-distribution robustness, we incorporate stochastic weight averaging (SWA), steering optimization toward flatter minima. Attention map analyses revealed that MorphGen primarily relies on nuclear morphology, cellular composition, and spatial cell organization within tumors or normal regions for final classification. Finally, we demonstrate resilience of the learned representations to image corruptions (such as staining artifacts) and adversarial attacks, showcasing not only OOD generalization but also addressing critical vulnerabilities in current deep learning systems for digital pathology. Code, datasets, and trained models are available at: https://github.com/hikmatkhan/MorphGen
在计算病理学中,领域泛化受到全幻灯片图像(WSI)异质性的阻碍,这种异质性是由不同机构在组织制备、染色和成像条件上的变化引起的。病理医师不同于机器学习系统,他们依赖于跨不同设置仍然具有诊断性的领域不变形态线索,如核异型性(增大、轮廓不规则、超染色、染色质纹理、空间无序)、结构异型性(异常结构和腺体形成)以及整体形态异型性。受此启发,我们假设通过显式建模生物学上稳健的核形态和空间组织,将能够学习对领域变化具有弹性的癌症表示。我们提出了MorphGen(形态学引导泛化)方法,它在一个有监督的对比学习框架内整合了病理组织学图像、增强图像和核分割掩膜。通过对齐图像和核掩膜的潜在表示,MorphGen优先考虑核和形态异型性以及空间组织等诊断特征,而不是染色伪影和领域特定特征。为了进一步提高分布外的稳健性,我们引入了随机权重平均(SWA),将优化转向较平坦的最小值。注意力图分析表明,MorphGen主要依赖于核形态、肿瘤内或正常区域的细胞组成和细胞空间组织来进行最终分类。最后,我们证明了所学表示的韧性能够对抗图像腐蚀(如染色伪影)和对抗性攻击,展示了其在领域外的泛化能力,并解决了当前深度学习系统在数字病理学中的关键脆弱性。代码、数据集和训练模型可通过以下网址获得:https://github.com/hikmatkhan/MorphGen 。
论文及项目相关链接
Summary
该文探讨计算组织病理学中的领域泛化问题,指出全幻灯片图像(WSIs)的异质性是阻碍因素之一。文章提出MorphGen方法,通过整合组织病理学图像、增强技术和核分段掩膜,在监督对比学习框架中显式建模生物学稳定的核形态和空间组织,以提高癌症表示的域转移抗性。此外,通过引入随机权重平均(SWA)技术,提高了模型的鲁棒性。MorphGen主要依赖核形态学、细胞组成和肿瘤或正常区域的细胞空间组织进行分类。该方法不仅展示了对图像腐蚀和对抗攻击的鲁棒性,还解决了当前深度学习系统在数字病理学中的关键脆弱性问题。
Key Takeaways
- 计算组织病理学中的领域泛化受到全幻灯片图像异质性的阻碍。
- 核形态和空间组织是病理诊断中跨越不同设置的重要不变特征。
- MorphGen方法通过整合图像、增强技术和核分段掩膜,在对比学习框架中建模这些特征。
- 监督对比学习强调诊断特征,如核和形态的不典型性以及空间组织,同时忽略染色伪影和特定域特征。
- 通过引入随机权重平均(SWA),提高了模型的鲁棒性和泛化能力。
- 注意力图分析显示MorphGen主要依赖核形态学、细胞组成和肿瘤或正常区域的细胞空间组织进行分类。
点此查看论文截图


Multimodal Deep Learning for Phyllodes Tumor Classification from Ultrasound and Clinical Data
Authors:Farhan Fuad Abir, Abigail Elliott Daly, Kyle Anderman, Tolga Ozmen, Laura J. Brattain
Phyllodes tumors (PTs) are rare fibroepithelial breast lesions that are difficult to classify preoperatively due to their radiological similarity to benign fibroadenomas. This often leads to unnecessary surgical excisions. To address this, we propose a multimodal deep learning framework that integrates breast ultrasound (BUS) images with structured clinical data to improve diagnostic accuracy. We developed a dual-branch neural network that extracts and fuses features from ultrasound images and patient metadata from 81 subjects with confirmed PTs. Class-aware sampling and subject-stratified 5-fold cross-validation were applied to prevent class imbalance and data leakage. The results show that our proposed multimodal method outperforms unimodal baselines in classifying benign versus borderline/malignant PTs. Among six image encoders, ConvNeXt and ResNet18 achieved the best performance in the multimodal setting, with AUC-ROC scores of 0.9427 and 0.9349, and F1-scores of 0.6720 and 0.7294, respectively. This study demonstrates the potential of multimodal AI to serve as a non-invasive diagnostic tool, reducing unnecessary biopsies and improving clinical decision-making in breast tumor management.
叶状肿瘤(PTs)是一种罕见的乳腺纤维上皮性病变,由于其放射学与良性纤维腺瘤相似,术前难以分类。这常常导致不必要的手术切除。为了解决这个问题,我们提出了一种多模式深度学习框架,该框架结合了乳腺超声(BUS)图像和结构化临床数据,以提高诊断准确性。我们开发了一个双分支神经网络,从超声图像和来自81名已确诊PTs患者的患者元数据中提取并融合特征。应用类别感知采样和分层5折交叉验证,以防止类别不平衡和数据泄露。结果表明,我们提出的多模式方法在分类良性与边界性或恶性PTs时优于单模式基线。在六种图像编码器中,ConvNeXt和ResNet18在多模式设置中表现最佳,AUC-ROC得分分别为0.9427和0.9349,F1得分分别为0.6720和0.7294。本研究展示了多模式人工智能作为非侵入性诊断工具的潜力,可以减少不必要的活检,提高乳腺肿瘤管理的临床决策水平。
论文及项目相关链接
PDF IEEE-EMBS International Conference on Body Sensor Networks (IEEE-EMBS BSN 2025)
Summary
该研究提出了一种多模态深度学习框架,该框架结合了乳腺超声图像和结构化临床数据,旨在提高诊断的准确性。研究开发了双分支神经网络,该网络能够从确诊的病例中提取和融合来自超声图像和患者元数据的特征。该研究的结果显示,多模态方法相较于单模态基线在区分良性与交界性或恶性叶状肿瘤方面表现更佳。其中,ConvNeXt和ResNet18在多模态环境中表现最佳,AUC-ROC得分分别为0.9427和0.9349,F1分数分别为0.6720和0.7294。研究表明,多模态人工智能具有作为非侵入性诊断工具的潜力,有望降低不必要的活检次数,提高乳腺肿瘤管理的临床决策水平。
Key Takeaways
- 多模态深度学习框架结合了乳腺超声图像和结构化临床数据,旨在提高叶状肿瘤的诊断准确性。
- 双分支神经网络能够从确诊的病例中提取并融合超声图像和患者元数据特征。
- 研究采用类感知采样和分层五重交叉验证来防止类别不平衡和数据泄露。
- 多模态方法相较于单模态基线在区分良性与交界性或恶性叶状肿瘤方面表现更佳。
- ConvNeXt和ResNet18在多模态环境中表现最佳,AUC-ROC和F1分数均较高。
- 研究表明多模态人工智能具有作为非侵入性诊断工具的潜力。
点此查看论文截图






A Multi-Stage Fine-Tuning and Ensembling Strategy for Pancreatic Tumor Segmentation in Diagnostic and Therapeutic MRI
Authors:Omer Faruk Durugol, Maximilian Rokuss, Yannick Kirchhoff, Klaus H. Maier-Hein
Automated segmentation of Pancreatic Ductal Adenocarcinoma (PDAC) from MRI is critical for clinical workflows but is hindered by poor tumor-tissue contrast and a scarcity of annotated data. This paper details our submission to the PANTHER challenge, addressing both diagnostic T1-weighted (Task 1) and therapeutic T2-weighted (Task 2) segmentation. Our approach is built upon the nnU-Net framework and leverages a deep, multi-stage cascaded pre-training strategy, starting from a general anatomical foundation model and sequentially fine-tuning on CT pancreatic lesion datasets and the target MRI modalities. Through extensive five-fold cross-validation, we systematically evaluated data augmentation schemes and training schedules. Our analysis revealed a critical trade-off, where aggressive data augmentation produced the highest volumetric accuracy, while default augmentations yielded superior boundary precision (achieving a state-of-the-art MASD of 5.46 mm and HD95 of 17.33 mm for Task 1). For our final submission, we exploited this finding by constructing custom, heterogeneous ensembles of specialist models, essentially creating a mix of experts. This metric-aware ensembling strategy proved highly effective, achieving a top cross-validation Tumor Dice score of 0.661 for Task 1 and 0.523 for Task 2. Our work presents a robust methodology for developing specialized, high-performance models in the context of limited data and complex medical imaging tasks (Team MIC-DKFZ).
自动分割胰腺导管腺癌(PDAC)的MRI对于临床工作流程至关重要,但由于肿瘤组织对比度差和标注数据稀缺,这一任务受到阻碍。本文详细描述了我们的针对PANTHER挑战提交的方案,该方案同时解决了诊断T1加权(任务1)和治疗T2加权(任务2)的分割问题。我们的方法建立在nnU-Net框架之上,采用深度多阶段级联预训练策略,从一般解剖基础模型开始,并在CT胰腺病变数据集和目标MRI模式上顺序微调。通过五倍交叉验证,我们系统地评估了数据增强方案和训练时间表。我们的分析揭示了一个关键的权衡:激进的数据增强产生了最高的体积精度,而默认的数据增强产生了更高的边界精度(在任务1中达到了最先进的平均对称距离(MASD)为5.46毫米和HD95为17.33毫米)。在我们的最终提交中,我们通过构建专业的异质模型组合,利用了这一发现,实质上创建了专家混合。这种指标感知的集成策略被证明是非常有效的,在任务1和任务2中分别达到了交叉验证肿瘤Dice系数的最高分0.661和0.523。(团队MIC-DKFZ)我们的工作在有限数据和复杂医学成像任务的背景下,提出了一种稳健的方法,用于开发专业、高性能的模型。
论文及项目相关链接
PDF 11 pages, 1 figure, PANTHER Challenge submission
Summary
本文介绍了针对胰腺导管腺癌(PDAC)的MRI自动化分割方法。研究团队在PANTHER挑战中提出了针对诊断T1加权(任务1)和治疗T2加权(任务2)分割的解决策略。该研究基于nnU-Net框架,采用深度多阶段级联预训练策略,从一般解剖基础模型开始,并在CT胰腺病变数据集和目标MRI模式上进行顺序微调。研究通过五折交叉验证评估了数据增强方案和训练计划,发现激进的数据增强产生最高的体积精度,而默认增强则产生更高的边界精度。最终,团队采用混合专家模型的方式构建自定义的异质模型集合,采用指标感知集成策略,在任务1中达到0.661的肿瘤Dice分数,任务2达到0.523。该研究为有限数据和复杂医学成像任务下开发高性能模型提供了稳健的方法。
Key Takeaways
- 自动化分割胰腺导管腺癌(PDAC)从MRI对于临床工作流程至关重要,但受到肿瘤组织对比度差和标注数据稀缺的限制。
- 研究团队在PANTHER挑战中提出了针对任务1(诊断T1加权)和任务2(治疗T2加权)的分割方法。
- 基于nnU-Net框架,采用深度多阶段级联预训练策略,从一般解剖模型开始,逐步微调。
- 评估了数据增强方案和训练计划,发现激进数据增强提高体积精度,而默认增强提高边界精度。
- 团队采用混合专家模型的方式构建自定义的异质模型集合,实现高性能模型开发。
- 采用指标感知集成策略,在任务1中达到肿瘤Dice分数0.661,任务2达到0.523。
点此查看论文截图


CAD2DMD-SET: Synthetic Generation Tool of Digital Measurement Device CAD Model Datasets for fine-tuning Large Vision-Language Models
Authors:João Valente, Atabak Dehban, Rodrigo Ventura
Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities across various multimodal tasks. They continue, however, to struggle with trivial scenarios such as reading values from Digital Measurement Devices (DMDs), particularly in real-world conditions involving clutter, occlusions, extreme viewpoints, and motion blur; common in head-mounted cameras and Augmented Reality (AR) applications. Motivated by these limitations, this work introduces CAD2DMD-SET, a synthetic data generation tool designed to support visual question answering (VQA) tasks involving DMDs. By leveraging 3D CAD models, advanced rendering, and high-fidelity image composition, our tool produces diverse, VQA-labelled synthetic DMD datasets suitable for fine-tuning LVLMs. Additionally, we present DMDBench, a curated validation set of 1,000 annotated real-world images designed to evaluate model performance under practical constraints. Benchmarking three state-of-the-art LVLMs using Average Normalised Levenshtein Similarity (ANLS) and further fine-tuning LoRA’s of these models with CAD2DMD-SET’s generated dataset yielded substantial improvements, with InternVL showcasing a score increase of 200% without degrading on other tasks. This demonstrates that the CAD2DMD-SET training dataset substantially improves the robustness and performance of LVLMs when operating under the previously stated challenging conditions. The CAD2DMD-SET tool is expected to be released as open-source once the final version of this manuscript is prepared, allowing the community to add different measurement devices and generate their own datasets.
最近大型视觉语言模型(LVLMs)的进展在各种跨模态任务中展示了令人印象深刻的性能。然而,它们仍然难以处理从数字测量设备(DMDs)读取数值等平凡场景,特别是在现实世界中存在杂乱、遮挡、极端视角和运动模糊等常见情况,这些情况在头戴式相机和增强现实(AR)应用中很常见。针对这些局限性,这项工作引入了CAD2DMD-SET,这是一个合成数据生成工具,旨在支持涉及DMD的视觉问答(VQA)任务。我们的工具通过利用3D CAD模型、高级渲染和高保真图像组合,可以生成多样化、带有VQA标签的合成DMD数据集,适用于微调LVLMs。此外,我们还推出了DMDBench,这是一组包含1000个注释过的真实世界图像的精选验证集,旨在在实际约束下评估模型性能。使用平均标准化莱文斯坦相似度(ANLS)对三项最先进的LVLMs进行基准测试,并进一步使用CAD2DMD-SET生成的数据集对LoRA模型进行微调,取得了显著的提升效果,其中InternVL的得分提高了两倍,同时在其他任务上也没有出现性能下降。这表明CAD2DMD-SET训练数据集在应对之前提到的挑战条件下,显著提高了LVLMs的稳健性和性能。一旦最终版本的手稿准备完毕,CAD2DMD-SET工具将作为开源发布,允许社区添加不同的测量设备并生成自己的数据集。
论文及项目相关链接
Summary
该研究针对大型视觉语言模型(LVLMs)在数字测量设备(DMD)读取值方面的不足,开发了CAD2DMD-SET合成数据生成工具,可支持涉及DMD的视觉问答(VQA)任务。通过3D CAD模型、高级渲染和高保真图像组合技术,该工具生成多样化的、带有VQA标签的合成DMD数据集,适用于精细调整LVLMs。此外,还推出了DMDBench,一个包含1000个注释过的真实世界图像的评价集,用于评估模型在实际约束下的性能。使用平均标准化莱文斯坦相似性(ANLS)评估三种先进的LVLMs,并进一步使用CAD2DMD-SET生成的数据集对它们进行微调,结果显示InternVL的性能提高了200%,同时不影响其他任务。这表明CAD2DMD-SET训练数据集显著提高了LVLMs在挑战条件下的稳健性和性能。
Key Takeaways
- 大型视觉语言模型(LVLMs)在读取数字测量设备(DMD)的值时面临挑战,特别是在现实世界的复杂环境中。
- CAD2DMD-SET是一个合成数据生成工具,用于支持涉及DMD的视觉问答(VQA)任务。
- CAD2DMD-SET利用3D CAD模型、高级渲染和高保真图像组合技术,生成适合精细调整LVLMs的多样化合成DMD数据集。
- DMDBench是一个真实世界图像评价集,用于评估模型在实用情况下的性能。
- 使用CAD2DMD-SET生成的数据集对三种先进的LVLMs进行微调,结果显示性能显著提高。
- InternVL模型在使用CAD2DMD-SET训练数据集后,性能提升200%,同时不影响其他任务的执行。
点此查看论文截图




Integrating Pathology and CT Imaging for Personalized Recurrence Risk Prediction in Renal Cancer
Authors:Daniël Boeke, Cedrik Blommestijn, Rebecca N. Wray, Kalina Chupetlovska, Shangqi Gao, Zeyu Gao, Regina G. H. Beets-Tan, Mireia Crispin-Ortuzar, James O. Jones, Wilson Silva, Ines P. Machado
Recurrence risk estimation in clear cell renal cell carcinoma (ccRCC) is essential for guiding postoperative surveillance and treatment. The Leibovich score remains widely used for stratifying distant recurrence risk but offers limited patient-level resolution and excludes imaging information. This study evaluates multimodal recurrence prediction by integrating preoperative computed tomography (CT) and postoperative histopathology whole-slide images (WSIs). A modular deep learning framework with pretrained encoders and Cox-based survival modeling was tested across unimodal, late fusion, and intermediate fusion setups. In a real-world ccRCC cohort, WSI-based models consistently outperformed CT-only models, underscoring the prognostic strength of pathology. Intermediate fusion further improved performance, with the best model (TITAN-CONCH with ResNet-18) approaching the adjusted Leibovich score. Random tie-breaking narrowed the gap between the clinical baseline and learned models, suggesting discretization may overstate individualized performance. Using simple embedding concatenation, radiology added value primarily through fusion. These findings demonstrate the feasibility of foundation model-based multimodal integration for personalized ccRCC risk prediction. Future work should explore more expressive fusion strategies, larger multimodal datasets, and general-purpose CT encoders to better match pathology modeling capacity.
在透明细胞肾细胞癌(ccRCC)中,复发风险评估对于指导术后监测和治疗至关重要。Leibovich评分仍广泛应用于分层远距离复发风险的评估,但其在患者层面的分辨率有限,且不包括成像信息。本研究通过整合术前计算机断层扫描(CT)和术后组织病理学全切片图像(WSI)来评估多模式复发预测。采用模块化深度学习框架,包含预训练编码器和基于Cox的生存建模,测试了单模态、后期融合和中间融合等多种设置。在真实世界的ccRCC队列中,基于WSI的模型始终优于仅使用CT的模型,突显了病理学的预后强度。中间融合进一步提高了性能,其中最佳模型(采用ResNet-18的TITAN-CONCH)接近调整后的Leibovich评分。随机解决平局问题缩小了临床基线与学习模型之间的差距,表明离散化可能会夸大个体化性能。通过简单的嵌入拼接,放射学主要通过融合增加了价值。这些发现证明了基于基础模型的多模式整合在个性化ccRCC风险预测中的可行性。未来的工作应探索更具表现力的融合策略、更大的多模式数据集和与病理学建模能力相匹配的通用CT编码器。
论文及项目相关链接
PDF 12 pages, 2 figures, 1 table. Accepted at the Multimodal Learning and Fusion Across Scales for Clinical Decision Support (ML-CDS) Workshop, MICCAI 2025. This is the submitted version with authors, affiliations, and acknowledgements included; it has not undergone peer review or revisions. The final version will appear in the Springer Lecture Notes in Computer Science (LNCS) proceedings
Summary
本文研究了使用多模态预测复发风险的方法,结合了术前计算机断层扫描(CT)和术后组织病理学全切片图像(WSI)。采用模块化深度学习框架和Cox生存模型,测试了单模态、后期融合和中间融合等多种设置。在真实世界肾透明细胞癌队列中,基于WSI的模型表现优于仅使用CT的模型,突显了病理学的预后价值。中间融合进一步提高了性能,最佳模型接近调整后的Leibovich评分。
Key Takeaways
- 复发风险评估在肾透明细胞癌(ccRCC)中非常重要,用于指导术后监测和治疗。
- Leibovich评分虽广泛应用于分层评估远处复发风险,但患者层面分辨率有限,且未包含成像信息。
- 本研究采用多模态预测复发风险,结合了术前CT和术后组织病理学全切片图像。
- 使用模块化深度学习框架和Cox生存模型进行测试,显示基于WSI的模型性能优于仅使用CT的模型。
- 中间融合策略进一步提高模型性能,最佳模型接近调整后的Leibovich评分。
- 随机平局机制缩小了临床基线与学习模型之间的差距,表明离散化可能夸大个性化性能。
点此查看论文截图


Temporal Flow Matching for Learning Spatio-Temporal Trajectories in 4D Longitudinal Medical Imaging
Authors:Nico Albert Disch, Yannick Kirchhoff, Robin Peretzke, Maximilian Rokuss, Saikat Roy, Constantin Ulrich, David Zimmerer, Klaus Maier-Hein
Understanding temporal dynamics in medical imaging is crucial for applications such as disease progression modeling, treatment planning and anatomical development tracking. However, most deep learning methods either consider only single temporal contexts, or focus on tasks like classification or regression, limiting their ability for fine-grained spatial predictions. While some approaches have been explored, they are often limited to single timepoints, specific diseases or have other technical restrictions. To address this fundamental gap, we introduce Temporal Flow Matching (TFM), a unified generative trajectory method that (i) aims to learn the underlying temporal distribution, (ii) by design can fall back to a nearest image predictor, i.e. predicting the last context image (LCI), as a special case, and (iii) supports $3D$ volumes, multiple prior scans, and irregular sampling. Extensive benchmarks on three public longitudinal datasets show that TFM consistently surpasses spatio-temporal methods from natural imaging, establishing a new state-of-the-art and robust baseline for $4D$ medical image prediction.
在医学成像中理解时间动态对于疾病进展建模、治疗计划拟定和解剖发育跟踪等应用至关重要。然而,大多数深度学习方法要么只考虑单一的时态背景,要么专注于分类或回归等任务,这限制了它们进行精细空间预测的能力。虽然有一些方法已经被探索出来,但它们通常仅限于单一时间点、特定疾病或其他技术限制。为了解决这一基本差距,我们引入了时间流匹配(TFM)这一统一的生成轨迹方法,其目标包括:(一)学习潜在的时态分布;(二)设计时能够回退到最接近的图像预测器,即预测最后一个上下文图像(LCI)作为特殊情况;(三)支持三维体积、多个先前扫描和不规则采样。在三个公共纵向数据集上的广泛基准测试显示,TFM始终超越了自然成像中的时空方法,为四维医学图像预测建立了新的先进和稳健的基线。
论文及项目相关链接
Summary
本文介绍了医学成像中的时间动态理解对于疾病进展建模、治疗规划和解剖发育跟踪等应用的重要性。针对现有深度学习方法的局限性,如只能处理单一时间上下文或专注于分类和回归任务而无法进行精细的空间预测,本文提出了一种统一的生成轨迹方法——Temporal Flow Matching(TFM)。TFM旨在学习潜在的时空分布,并可在必要时退回到最近图像预测器进行预测。该方法支持三维体积、多个先验扫描和非规则采样。在三个公共纵向数据集上的广泛基准测试表明,TFM在时空方法上表现更优秀,建立了新的四维医学图像预测先进基线。
Key Takeaways
- 医学成像中的时间动态理解对于疾病进展建模、治疗规划和解剖发育跟踪至关重要。
- 现有深度学习方法在医学图像分析中存在局限性,如处理单一时间上下文或专注于分类和回归任务。
- TFM是一种统一的生成轨迹方法,旨在学习潜在的时空分布。
- TFM支持三维体积、多个先验扫描和非规则采样,增强了方法的实际应用能力。
- TFM具有退化机制,在无法精确预测未来图像时,可退回到最近图像预测器进行预测。
- 在三个公共纵向数据集上的测试表明TFM表现优于其他时空方法。
点此查看论文截图






Lightweight MRI-Based Automated Segmentation of Pancreatic Cancer with Auto3DSeg
Authors:Keshav Jha, William Sharp, Dominic LaBella
Accurate delineation of pancreatic tumors is critical for diagnosis, treatment planning, and outcome assessment, yet automated segmentation remains challenging due to anatomical variability and limited dataset availability. In this study, SegResNet models, as part of the Auto3DSeg architecture, were trained and evaluated on two MRI-based pancreatic tumor segmentation tasks as part of the 2025 PANTHER Challenge. Algorithm methodology included 5-fold cross-validation with STAPLE ensembling after focusing on an anatomically relevant region-of-interest. The Pancreatic Tumor Segmentation on Diagnostic MRI task 1 training set included 91 T1-weighted arterial contrast-enhanced MRI with expert annotated pancreas and tumor labels. The Pancreatic Tumor Segmentation on MR-Linac task 2 training set used 50 T2-weighted MR-Linac cases with expert annotated pancreas and tumor labels. Algorithm-automated segmentation performance of pancreatic tumor was assessed using Dice Similarity Coefficient (DSC), 5 mm DSC, 95th percentile Hausdorff Distance (HD95), Mean Average Surface Distance (MASD), and Root Mean Square Error (RMSE). For Task 1, the algorithm achieved a DSC of 0.56, 5 mm DSC of 0.73, HD95 of 41.1 mm, MASD of 26.0 mm, and RMSE of 5164 mm. For Task 2, performance decreased, with a DSC of 0.33, 5 mm DSC of 0.50, HD95 of 20.1 mm, MASD of 7.2 mm, and RMSE of 17,203 mm. These findings illustrate the challenges of MRI-based pancreatic tumor segmentation with small datasets, highlighting variability introduced by different MRI sequences. Despite modest performance, the results demonstrate potential for automated delineation and emphasize the need for larger, standardized MRI datasets to improve model robustness and clinical utility.
胰腺肿瘤的准确轮廓描绘对于诊断、治疗计划和疗效评估至关重要。然而,由于解剖结构差异和可用数据集有限,自动化分割仍然是一个挑战。本研究中,SegResNet模型作为Auto3DSeg架构的一部分,在基于MRI的胰腺肿瘤分割任务上进行了训练和评估,作为2025年PANTHER挑战赛的一部分。算法方法包括在关注解剖上相关感兴趣区域后进行5倍交叉验证和STAPLE集成。胰腺肿瘤分割诊断MRI任务1的训练集包括91例T1加权动脉增强MRI,带有专家注释的胰腺和肿瘤标签。胰腺肿瘤分割MR-Linac任务2的训练集使用了50例T2加权MR-Linac病例,同样带有专家注释的胰腺和肿瘤标签。使用Dice相似系数(DSC)、5mm DSC、95th百分位Hausdorff距离(HD95)、平均表面距离(MASD)和均方根误差(RMSE)来评估算法对胰腺肿瘤自动化分割的性能。对于任务1,该算法实现了DSC 0.56,5mm DSC 0.73,HD95 41.1mm,MASD 26.0mm,RMSE 5164mm。对于任务2,性能有所下降,DSC为0.33,5mm DSC为0.50,HD95为20.1mm,MASD为7.2mm,RMSE为17,203mm。这些发现说明了基于MRI的胰腺肿瘤分割使用小数据集的挑战,强调了不同MRI序列引入的变异性。尽管性能表现一般,但结果证明了自动化轮廓描绘的潜力,并强调了需要更大、标准化的MRI数据集来提高模型稳健性和临床实用性的必要性。
论文及项目相关链接
PDF 11 pages, 3 figures, 3 tables, MICCAI
Summary
本研究利用SegResNet模型作为Auto3DSeg架构的一部分,对基于MRI的胰腺肿瘤分割任务进行了训练和评估,参加2025年PANTHER挑战赛。算法在T1加权动脉增强MRI的Pancreatic Tumor Segmentation on Diagnostic MRI任务1训练集和T2加权MR-Linac病例的Pancreatic Tumor Segmentation on MR-Linac任务2训练集上进行了测试。虽然面临数据集小的挑战,但该研究展示了自动化分割的潜力,并强调了需要更大、标准化的MRI数据集来提高模型稳健性和临床实用性。
Key Takeaways
- 胰腺肿瘤的准确分割对于诊断、治疗规划和结果评估至关重要,但自动化分割由于解剖变异和有限的数据集可用性而具有挑战性。
- 研究使用了SegResNet模型作为Auto3DSeg架构的一部分,参与2025年PANTHER挑战赛中的两个MRI胰腺肿瘤分割任务。
- 在诊断MRI的胰腺肿瘤分割任务中,算法取得了相对较好的性能,使用专家标注的胰腺和肿瘤标签进行训练。
- 在MR-Linac任务中,算法性能有所下降,这可能与不同MRI序列引入的变异性有关。
- 研究通过评估不同指标(如Dice相似系数、Hausdorff距离等)来衡量算法自动化分割的性能。
- 虽然性能相对温和,但结果展示了自动化分割的潜力。
点此查看论文截图






Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation
Authors:Yifan Gao, Haoyue Li, Feng Yuan, Xiaosong Wang, Xin Gao
Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model’s rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.
基于大规模自然图像数据集进行预训练的模型为医学图像分割提供了强大的范式。然而,有效地转移其学习表示以应用于精确的临床应用仍然是一个挑战。在这项工作中,我们提出了Dino U-Net,这是一种新型编码器-解码器架构,旨在利用DINOv3视觉基础模型的高保真密集特征。我们的架构引入了一个基于冻结的DINOv3骨干网的编码器,该编码器采用专用适配器融合模型的丰富语义特征与低级空间细节。为了在降维过程中保持这些表示的质量,我们设计了一个新型保真度感知投影模块(FAPM),该模块可以有效地改进并投影特征以供解码器使用。我们在七个公共医学图像分割数据集上进行了广泛实验。结果表明,Dino U-Net达到了最先进的性能,在各种成像模式下均优于以前的方法。我们的框架证明具有很高的可扩展性,随着骨干网模型大小增加到7亿参数变体,分割精度持续提高。研究结果表明,利用通用基础模型的优质、密集预训练特征提供了一种高效且参数化的方法来提高医学图像分割的准确性。代码可从https://github.com/yifangao112/DinoUNet获取。
论文及项目相关链接
Summary
本论文提出了Dino U-Net架构,该架构利用大规模自然图像数据集预训练的DINOv3模型的高保真密集特征,通过特殊适配器融合模型的丰富语义特征与低级空间细节,设计了一种新型的保真度感知投影模块(FAPM),以在降维过程中保持特征质量。在七个公共医学图像分割数据集上的实验表明,Dino U-Net表现卓越,在各种成像模态上均优于以前的方法。此外,该框架高度可扩展,随着基础模型规模增大到7亿参数变体,分割精度持续提高。研究证明了利用通用基础模型的先进预训练特征,对提升医学图像分割的准确度具有高效和参数高效的方法。
Key Takeaways
- Dino U-Net利用预训练的DINOv3模型进行医学图像分割。
- 该架构通过特殊适配器融合模型的丰富语义特征和低级空间细节。
- 引入了一种新型的保真度感知投影模块(FAPM),以在降维过程中保持特征质量。
- 在七个医学图像分割数据集上的实验表明,Dino U-Net表现优于其他方法。
- 随着基础模型规模增大,分割精度持续提高,证明该框架高度可扩展。
- 利用通用基础模型的预训练特征,能高效提升医学图像分割的准确度。
点此查看论文截图






Domain Adaptation Techniques for Natural and Medical Image Classification
Authors:Ahmad Chaddad, Yihang Wu, Reem Kateb, Christian Desrosiers
Domain adaptation (DA) techniques have the potential in machine learning to alleviate distribution differences between training and test sets by leveraging information from source domains. In image classification, most advances in DA have been made using natural images rather than medical data, which are harder to work with. Moreover, even for natural images, the use of mainstream datasets can lead to performance bias. {With the aim of better understanding the benefits of DA for both natural and medical images, this study performs 557 simulation studies using seven widely-used DA techniques for image classification in five natural and eight medical datasets that cover various scenarios, such as out-of-distribution, dynamic data streams, and limited training samples.} Our experiments yield detailed results and insightful observations highlighting the performance and medical applicability of these techniques. Notably, our results have shown the outstanding performance of the Deep Subdomain Adaptation Network (DSAN) algorithm. This algorithm achieved feasible classification accuracy (91.2%) in the COVID-19 dataset using Resnet50 and showed an important accuracy improvement in the dynamic data stream DA scenario (+6.7%) compared to the baseline. Our results also demonstrate that DSAN exhibits remarkable level of explainability when evaluated on COVID-19 and skin cancer datasets. These results contribute to the understanding of DA techniques and offer valuable insight into the effective adaptation of models to medical data.
领域自适应(DA)技术在机器学习中有潜力通过利用源域的信息来缓解训练集和测试集之间的分布差异。在图像分类中,大多数DA方面的进展都是使用自然图像而非医学数据取得的,医学数据更难处理。此外,即使对于自然图像,主流数据集的使用也可能导致性能偏见。本研究旨在更好地了解DA对于自然图像和医学图像的好处,对七种广泛应用于图像分类的DA技术进行了557项模拟研究,涉及五个自然数据集和八个医学数据集,涵盖各种场景,如跨分布、动态数据流和有限的训练样本。我们的实验产生了详细的结果和深刻的观察结果,突出了这些技术在性能和医学应用方面的表现。值得注意的是,我们的结果展示了Deep Subdomain Adaptation Network(DSAN)算法的出色性能。该算法在COVID-19数据集上使用Resnet50取得了可行的分类准确率(91.2%),并在动态数据流DA场景中实现了与基线相比的重要准确率改进(+6.7%)。我们的结果还表明,在COVID-19和皮肤癌数据集上评估时,DSAN表现出极高的可解释性。这些结果有助于了解DA技术,并为模型在医学数据上的有效适应提供了宝贵的见解。
论文及项目相关链接
PDF Accepted in Information Sciences
摘要
该研究利用机器学习中的域适应(DA)技术,通过利用源域的信息来缓解训练集和测试集之间的分布差异。在图像分类方面,大多数DA的进展都是使用自然图像而非医学数据来完成的,医学数据更难处理。此外,即使是自然图像,使用主流数据集也可能导致性能偏见。本研究为了更深入地了解DA在自然图像和医学图像中的优势,进行了557次模拟研究,使用了七种广泛使用的DA技术,涉及五个自然和八个医学数据集,涵盖了各种场景,如超出分布范围、动态数据流和有限的训练样本。实验产生了详细的结果和深刻的观察结果,突出了这些技术在性能和医学应用方面的表现。值得注意的是,Deep Subdomain Adaptation Network(DSAN)算法表现出了出色的性能。该算法在COVID-19数据集上使用Resnet50达到了可行的分类精度(91.2%),并且在动态数据流DA场景中相比基线方法精度提高了+6.7%。研究结果还表明,DSAN在COVID-19和皮肤癌数据集上的可解释性表现卓越。这些结果有助于理解DA技术,并为模型在医学数据上的有效适应提供了有价值的见解。
关键见解
- 研究通过557次模拟研究评估了七种广泛使用的DA技术在图像分类中的性能。
- 研究涉及五个自然和八个医学数据集,涵盖了多种场景,如超出分布、动态数据流和有限的训练样本。
- DSAN算法在COVID-19数据集上表现出较高的分类精度(91.2%),并且在动态数据流场景中相比其他方法具有更好的性能。
- DSAN在解释性方面也表现出色,特别是在COVID-19和皮肤癌数据集上。
- DA技术对于处理医学图像具有潜力,可以有效适应医学数据。
- 研究结果有助于深入理解DA技术在图像分类中的应用。
点此查看论文截图



Learning What is Worth Learning: Active and Sequential Domain Adaptation for Multi-modal Gross Tumor Volume Segmentation
Authors:Jingyun Yang, Guoqing Zhang, Jingge Wang, Yang Li
Accurate gross tumor volume segmentation on multi-modal medical data is critical for radiotherapy planning in nasopharyngeal carcinoma and glioblastoma. Recent advances in deep neural networks have brought promising results in medical image segmentation, leading to an increasing demand for labeled data. Since labeling medical images is time-consuming and labor-intensive, active learning has emerged as a solution to reduce annotation costs by selecting the most informative samples to label and adapting high-performance models with as few labeled samples as possible. Previous active domain adaptation (ADA) methods seek to minimize sample redundancy by selecting samples that are farthest from the source domain. However, such one-off selection can easily cause negative transfer, and access to source medical data is often limited. Moreover, the query strategy for multi-modal medical data remains unexplored. In this work, we propose an active and sequential domain adaptation framework for dynamic multi-modal sample selection in ADA. We derive a query strategy to prioritize labeling and training on the most valuable samples based on their informativeness and representativeness. Empirical validation on diverse gross tumor volume segmentation tasks demonstrates that our method achieves favorable segmentation performance, significantly outperforming state-of-the-art ADA methods. Code is available at the git repository: \href{https://github.com/Hiyoochan/mmActS}{mmActS}.
在多模态医学数据上进行精确的肿瘤体积分割对于鼻咽癌和胶质母细胞瘤的放射治疗计划至关重要。深度神经网络领域的最新进展为医学图像分割带来了有希望的结果,导致对标记数据的需求不断增加。由于医学图像标注耗时且劳动密集型,主动学习作为一种解决方案已经出现,通过选择最有信息量的样本进行标注,并利用尽可能少的标注样本适应高性能模型,从而减少标注成本。之前的主动域适应(ADA)方法试图通过选择距离源域最远的样本来减少样本冗余。然而,这种一次性选择很容易导致负迁移,并且往往难以获取源医学数据。此外,多模态医学数据的查询策略仍然未被探索。在这项工作中,我们提出了一个主动和序贯的域适应框架,用于ADA中的动态多模态样本选择。我们根据样本的信息量和代表性制定了一个查询策略,优先标注和训练最有价值的样本。在多种肿瘤体积分割任务上的实证验证表明,我们的方法实现了良好的分割性能,显著优于最新的ADA方法。代码可在git仓库中找到:mmActS。
论文及项目相关链接
Summary
本文提出了一种针对多模态医学数据的主动序列域自适应框架,用于动态选择样本。该框架通过考虑样本的信息性和代表性来优先标记和训练最有价值的样本,从而提高放射治疗计划中的肿瘤体积分割准确性。经验验证表明,该方法在多种肿瘤体积分割任务上表现优异,显著优于现有ADA方法。
Key Takeaways
- 准确的多模态医学数据肿瘤体积分割对鼻咽癌和胶质母细胞瘤的放射治疗计划至关重要。
- 深度神经网络在医学图像分割方面的最新进展导致了标注数据需求的增加。
- 医学图像标注耗时且劳力密集,因此出现了主动学习方法来减少标注成本。
- 以往的主动域适应(ADA)方法通过选择距离源域最远的样本来减少样本冗余,但这种方法容易导致负面迁移,并且获取源医学数据往往受限。
- 本文提出了一种新的主动序列域自适应框架,用于动态多模态样本选择,该框架考虑了样本的信息性和代表性。
- 该方法在多种肿瘤体积分割任务上进行了实证验证,表现出优越的性能,显著优于现有的ADA方法。
点此查看论文截图




Event-Enriched Image Analysis Grand Challenge at ACM Multimedia 2025
Authors:Thien-Phuc Tran, Minh-Quang Nguyen, Minh-Triet Tran, Tam V. Nguyen, Trong-Le Do, Duy-Nam Ly, Viet-Tham Huynh, Khanh-Duy Le, Mai-Khiem Tran, Trung-Nghia Le
The Event-Enriched Image Analysis (EVENTA) Grand Challenge, hosted at ACM Multimedia 2025, introduces the first large-scale benchmark for event-level multimodal understanding. Traditional captioning and retrieval tasks largely focus on surface-level recognition of people, objects, and scenes, often overlooking the contextual and semantic dimensions that define real-world events. EVENTA addresses this gap by integrating contextual, temporal, and semantic information to capture the who, when, where, what, and why behind an image. Built upon the OpenEvents V1 dataset, the challenge features two tracks: Event-Enriched Image Retrieval and Captioning, and Event-Based Image Retrieval. A total of 45 teams from six countries participated, with evaluation conducted through Public and Private Test phases to ensure fairness and reproducibility. The top three teams were invited to present their solutions at ACM Multimedia 2025. EVENTA establishes a foundation for context-aware, narrative-driven multimedia AI, with applications in journalism, media analysis, cultural archiving, and accessibility. Further details about the challenge are available at the official homepage: https://ltnghia.github.io/eventa/eventa-2025.
在ACM多媒体2025主办的事件丰富图像分析(EVENTA)大赛中,引入了事件级别多模式理解的首个大规模基准测试。传统的描述和检索任务主要关注对人物、物体和场景的表层识别,往往忽视了定义真实世界事件的上下文和语义维度。EVENTA通过整合上下文、时间和语义信息,捕捉图像的“谁、何时、何地、何事和为何”来弥补这一差距。该挑战基于OpenEvents V1数据集,设有两个赛道:事件丰富图像检索和描述,以及基于事件的图像检索。共有来自六个国家的45支队伍参赛,评估分为公开测试阶段和私下测试阶段进行,以确保公平性和可重复性。前三名队伍被邀请在ACM多媒体2025上展示他们的解决方案。EVENTA为语境感知、叙事驱动的多媒体人工智能奠定了基础,在新闻、媒体分析、文化归档和可访问性等方面有应用。有关该挑战的更多详细信息,请访问官方网站:https://ltnghia.github.io/eventa/eventa-2025。
论文及项目相关链接
PDF ACM Multimedia 2025
Summary
事件丰富的图像分析(EVENTA)挑战赛于ACM多媒体会议主办,建立首个大规模事件级别多媒体理解基准测试。该挑战通过集成上下文、时间和语义信息捕捉图像的“谁”、“何时”、“何地”、“何事”和“为何”,解决了传统标注和检索任务忽略的上下文和语义维度问题。该挑战包括两个赛道:事件丰富图像检索和标注,以及基于事件的图像检索。共有来自六个国家的45支队伍参与挑战,并通过公开和私下测试阶段进行公平和可重复性的评估。排名前三的团队受邀在ACM多媒体会议上进行方案展示。EVENTA为语境感知、叙事驱动的多媒体人工智能建立了基础,并应用于新闻业、媒体分析、文化归档和可访问性等领域。
Key Takeaways
- EVENTA是首个大规模事件级别多媒体理解的基准测试,于ACM多媒体会议举办。
- 该挑战解决了传统图像标注和检索任务忽略的上下文和语义维度问题。
- EVENTA通过集成上下文、时间和语义信息,深入捕捉图像内容。
- 挑战包括两个赛道:事件丰富图像检索和标注,以及基于事件的图像检索。
- 共有来自六个国家的45支队伍参与挑战,通过公开和私下测试阶段进行公平评估。
- EVENTA的成立为语境感知、叙事驱动的多媒体人工智能提供了基础。
点此查看论文截图




SWiFT: Soft-Mask Weight Fine-tuning for Bias Mitigation
Authors:Junyu Yan, Feng Chen, Yuyang Xue, Yuning Du, Konstantinos Vilouras, Sotirios A. Tsaftaris, Steven McDonagh
Recent studies have shown that Machine Learning (ML) models can exhibit bias in real-world scenarios, posing significant challenges in ethically sensitive domains such as healthcare. Such bias can negatively affect model fairness, model generalization abilities and further risks amplifying social discrimination. There is a need to remove biases from trained models. Existing debiasing approaches often necessitate access to original training data and need extensive model retraining; they also typically exhibit trade-offs between model fairness and discriminative performance. To address these challenges, we propose Soft-Mask Weight Fine-Tuning (SWiFT), a debiasing framework that efficiently improves fairness while preserving discriminative performance with much less debiasing costs. Notably, SWiFT requires only a small external dataset and only a few epochs of model fine-tuning. The idea behind SWiFT is to first find the relative, and yet distinct, contributions of model parameters to both bias and predictive performance. Then, a two-step fine-tuning process updates each parameter with different gradient flows defined by its contribution. Extensive experiments with three bias sensitive attributes (gender, skin tone, and age) across four dermatological and two chest X-ray datasets demonstrate that SWiFT can consistently reduce model bias while achieving competitive or even superior diagnostic accuracy under common fairness and accuracy metrics, compared to the state-of-the-art. Specifically, we demonstrate improved model generalization ability as evidenced by superior performance on several out-of-distribution (OOD) datasets.
近期研究表明,机器学习(ML)模型在现实世界场景中可能会表现出偏见,这在医疗等伦理敏感领域带来了重大挑战。这种偏见可能会对模型的公平性、模型泛化能力产生负面影响,并可能进一步加剧社会歧视。需要从训练模型中消除偏见。现有的去偏方法通常需要访问原始训练数据,并需要大量模型重新训练;它们通常在模型公平性和判别性能之间表现出权衡。为了解决这些挑战,我们提出了Soft-Mask Weight Fine-Tuning(SWiFT)方法,这是一种去偏框架,可以有效地提高模型的公平性,同时保持其判别性能,并且去偏成本较低。值得注意的是,SWiFT仅需一个外部小型数据集和几次模型微调周期。SWiFT背后的理念是首先找出模型参数对偏见和预测性能的相对且独特的贡献。然后,一个两阶段的微调过程会根据每个参数的贡献,使用不同的梯度流来更新它。在四个皮肤科和两个胸部X射线数据集上进行的针对三个敏感属性(性别、肤色和年龄)的广泛实验表明,SWiFT能够持续减少模型偏见,同时在常见的公平性和准确性指标上实现了具有竞争力的甚至更高的诊断准确率,与现有技术相比具有优势。具体来说,我们通过其在几个离群数据集上的卓越表现证明了模型泛化能力的增强。
论文及项目相关链接
PDF Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2025:015
摘要
机器学习模型在现实世界场景中可能表现出偏见,这在医疗等伦理敏感领域带来了挑战。偏见会影响模型的公平性和泛化能力,并可能加剧社会歧视。需要消除训练模型中的偏见。现有的去偏方法通常需要访问原始训练数据并进行大量的模型再训练,并且通常在模型公平性和辨别性能之间表现出权衡。为了应对这些挑战,我们提出了Soft-Mask Weight Fine-Tuning(SWiFT)去偏框架,该框架在保持辨别性能的同时,有效地提高了模型的公平性,并且去偏成本较低。SWiFT仅需一个小型外部数据集和几次模型微调周期。其基本思想是先找出模型参数对偏见和预测性能的相对但独特的贡献,然后通过一个两步骤微调过程,根据每个参数的贡献,用不同的梯度流进行更新。在多个皮肤病变和胸部X射线数据集上的实验表明,SWiFT在常见的公平性和准确性指标下,可以持续减少模型偏见,并且在诊断准确性方面与最新技术相比具有竞争力甚至更优。特别是,我们展示了提高的模型泛化能力,这在超出分布的数据集上的表现得到了证实。
关键见解
- 机器学习模型在现实世界应用中可能出现偏见,这在医疗等伦理敏感领域带来挑战。
- 偏见会影响模型的公平性和泛化能力,并可能加剧社会歧视。
- 现有去偏方法通常需要访问原始训练数据并进行大量再训练,且在模型公平性和性能间存在权衡。
- 提出的Soft-Mask Weight Fine-Tuning (SWiFT)框架旨在通过微调模型参数去偏,同时保持模型的辨别性能和公平性。
- SWiFT仅需一个小外部数据集和少量微调周期,降低了去偏成本。
- SWiFT通过识别模型参数对偏见和预测性能的贡献,进行两步微调过程。
点此查看论文截图



