⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-10-25 更新
ARGenSeg: Image Segmentation with Autoregressive Image Generation Model
Authors:Xiaolong Wang, Lixiang Ru, Ziyuan Huang, Kaixiang Ji, Dandan Zheng, Jingdong Chen, Jun Zhou
We propose a novel AutoRegressive Generation-based paradigm for image Segmentation (ARGenSeg), achieving multimodal understanding and pixel-level perception within a unified framework. Prior works integrating image segmentation into multimodal large language models (MLLMs) typically employ either boundary points representation or dedicated segmentation heads. These methods rely on discrete representations or semantic prompts fed into task-specific decoders, which limits the ability of the MLLM to capture fine-grained visual details. To address these challenges, we introduce a segmentation framework for MLLM based on image generation, which naturally produces dense masks for target objects. We leverage MLLM to output visual tokens and detokenize them into images using an universal VQ-VAE, making the segmentation fully dependent on the pixel-level understanding of the MLLM. To reduce inference latency, we employ a next-scale-prediction strategy to generate required visual tokens in parallel. Extensive experiments demonstrate that our method surpasses prior state-of-the-art approaches on multiple segmentation datasets with a remarkable boost in inference speed, while maintaining strong understanding capabilities.
我们提出了一种新型的基于AutoRegressive Generation的图像分割范式(ARGenSeg),在一个统一框架内实现了多模态理解和像素级感知。先前将图像分割集成到多模态大型语言模型(MLLM)中的工作通常采用边界点表示或专用分割头。这些方法依赖于离散表示或语义提示,这些提示被输入到特定任务的解码器中,这限制了MLLM捕获精细粒度视觉细节的能力。为了解决这些挑战,我们引入了基于图像生成的多模态大型语言模型分割框架,该框架自然地产生目标对象的密集掩码。我们利用MLLM输出视觉令牌,并使用通用VQ-VAE将它们解码成图像,使分割完全依赖于MLLM的像素级理解。为了减少推理延迟,我们采用下一尺度预测策略来并行生成所需的视觉令牌。大量实验表明,我们的方法在多个分割数据集上的表现超过了最新前沿方法,并且在推理速度上有显著的提升,同时保持着强大的理解能力。
论文及项目相关链接
PDF Accepted to NeurIPS 2025, 18 pages
Summary
提出基于AutoRegressive Generation的新型图像分割方法ARGenSeg,实现多模态理解与像素级感知的统一框架。与现有整合图像分割的多模态大型语言模型相比,该方法引入基于图像生成的分割框架,可自然生成目标对象的密集掩模,并完全依赖于语言模型的像素级理解进行分割。使用下一个尺度预测策略来减少推理延迟,同时提高了推理速度和分割效果。
Key Takeaways
- 提出基于AutoRegressive Generation的ARGenSeg方法用于图像分割。
- 实现多模态理解与像素级感知的统一框架。
- 引入基于图像生成的分割框架,自然生成目标对象的密集掩模。
- 依赖于多模态大型语言模型的像素级理解进行分割。
- 使用下一个尺度预测策略减少推理延迟。
- 在多个分割数据集上超越现有方法,显著提高推理速度。
点此查看论文截图
Better Tokens for Better 3D: Advancing Vision-Language Modeling in 3D Medical Imaging
Authors:Ibrahim Ethem Hamamci, Sezgin Er, Suprosanna Shit, Hadrien Reynaud, Dong Yang, Pengfei Guo, Marc Edgar, Daguang Xu, Bernhard Kainz, Bjoern Menze
Recent progress in vision-language modeling for 3D medical imaging has been fueled by large-scale computed tomography (CT) corpora with paired free-text reports, stronger architectures, and powerful pretrained models. This has enabled applications such as automated report generation and text-conditioned 3D image synthesis. Yet, current approaches struggle with high-resolution, long-sequence volumes: contrastive pretraining often yields vision encoders that are misaligned with clinical language, and slice-wise tokenization blurs fine anatomy, reducing diagnostic performance on downstream tasks. We introduce BTB3D (Better Tokens for Better 3D), a causal convolutional encoder-decoder that unifies 2D and 3D training and inference while producing compact, frequency-aware volumetric tokens. A three-stage training curriculum enables (i) local reconstruction, (ii) overlapping-window tiling, and (iii) long-context decoder refinement, during which the model learns from short slice excerpts yet generalizes to scans exceeding 300 slices without additional memory overhead. BTB3D sets a new state-of-the-art on two key tasks: it improves BLEU scores and increases clinical F1 by 40% over CT2Rep, CT-CHAT, and Merlin for report generation; and it reduces FID by 75% and halves FVD compared to GenerateCT and MedSyn for text-to-CT synthesis, producing anatomically consistent 512512241 volumes. These results confirm that precise three-dimensional tokenization, rather than larger language backbones alone, is essential for scalable vision-language modeling in 3D medical imaging. The codebase is available at: https://github.com/ibrahimethemhamamci/BTB3D
近期,3D医学影像的视语言模型进步显著,这得益于大规模配有自由文本报告的计算机断层扫描(CT)数据集、更强大的架构和预训练模型。这推动了自动化报告生成和文本条件3D图像合成等应用的发展。然而,当前的方法在处理高分辨率、长序列体积数据时遇到困难:对比预训练往往会得到与临床语言不匹配的视觉编码器,而逐片切分会模糊精细结构,降低下游任务的诊断性能。我们推出了BTB3D(更好的令牌用于更好的三维),这是一款因果卷积编码器-解码器,可以统一2D和3D的训练和推理,同时产生紧凑、频率感知的三维令牌。三阶段的训练课程使模型能够进行(i)局部重建,(ii)重叠窗口平铺,(iii)长上下文解码器细化,在此过程中,模型从短切片摘录中学习,并可以推广到超过300切片的扫描,无需额外的内存开销。在两项关键任务上,BTB3D创造了新的最先进的水平:在报告生成方面,相较于CT2Rep、CT-CHAT和Merlin,BLEU分数得到了提升,临床F1增加了40%;在文本到CT合成方面,相较于GenerateCT和MedSyn,FID降低了75%,FVD减半,生成了结构一致的512512241体积图像。这些结果证实,精确的三维令牌化对于可扩展的三维医学影像视语言建模至关重要,而非单靠更大的语言骨干网络。代码库可在以下链接找到:https://github.com/ibrahimethemhamamci/BTB3D 。
论文及项目相关链接
PDF NeurIPS 2025
Summary
本文主要介绍了在三维医学成像中,精准的三维标记化对于可伸缩的视觉语言模型的重要性。研究引入了BTB3D模型,该模型采用因果卷积编码器解码器,统一了二维和三维的训练和推理过程,提高了报告生成和文本到CT图像合成的性能。通过三个阶段的训练课程,模型能够学习短切片摘录并在没有额外内存开销的情况下推广到超过300切片的扫描。该项目代码已在GitHub上公开。
Key Takeaways
- 最新进展:介绍了基于大规模计算机断层扫描(CT)语料库、强大的架构和预训练模型的视觉语言建模在三维医学成像中的最新进展。
- 应用实例:实现了自动化报告生成和文本条件的三维图像合成等应用。
- 当前挑战:对比预训练常常导致与临床语言不一致的视觉编码器,切片级别的标记化会模糊精细的解剖学结构,降低下游任务的诊断性能。
- BTB3D模型介绍:引入BTB3D模型,一个因果卷积编码器解码器,能统一二维和三维的训练和推理,生成紧凑的频率感知体积标记。
- 三阶段训练课程:通过局部重建、重叠窗口拼贴和长期上下文解码器细化三个阶段的训练,模型能在没有额外内存开销的情况下处理长序列体积。
- 性能表现:BTB3D模型在报告生成和文本到CT图像合成等关键任务上设定了新的最佳表现。
点此查看论文截图
Unlock Anionic Behavior of Calcium Through Pressure Engineering
Authors:Yang Lv, Junwei Li, Jianfu Li, Yong Liu, Jianan Yuan, 1 Jiani Lin, Saori Kawaguchi-Imada, Qingyang Hu, Xiaoli Wang
An isolated calcium (Ca) atom has empty d-orbitals under ambient conditions. However, s-d band hybridization has been observed in both elemental Ca and compounds by manipulating thermodynamic conditions. Here, we reveal that the Ca 3d-band can even capture electrons from halogen atoms under pressure, exhibiting anionic behaviors in iodides. We predict a CsCl-type monovalent CaI at above 50 GPa by employing first-principles structural searching and successfully identified the phase at 84 GPa using in situ X-ray diffraction. We further reveal that, due to the effect of orbital broadening, unusual charge transfer from the 5p orbitals of I to the 3d orbitals of Ca in CaI, gradually reverses the ionicity of Ca and becomes the anionic ICa at 485 GPa. Multivalent Ca stabilizes a set of metallic iodides with eight- to ten-fold iodine hyper-coordination. Our findings demonstrate that the valence states of Ca can vary from negative to +2, suggesting much greater complexity of Ca chemistry under ultrahigh pressures.
一个孤立的钙(Ca)原子在环境条件下具有空的d轨道。然而,通过调节热力学条件,元素钙和化合物中均观察到s-d带杂交现象。在这里,我们发现Ca的3d轨道甚至可以在压力下从卤素原子中捕获电子,在碘化物中表现出阴离子行为。我们通过采用第一性原理结构搜索方法预测了在50 GPa以上存在的CsCl型一价CaI,并在84 GPa下通过原位X射线衍射成功鉴定了该相。我们进一步揭示,由于轨道扩展的影响,碘的5p轨道到钙的3d轨道的异常电荷转移在CaI中逐渐逆转了钙的离子性,并在485 GPa下形成阴离子ICa。多价钙稳定了一系列具有八到十倍碘超配位的金属碘化物。我们的研究结果表明,钙的价态可以从负到+2变化,表明在超高压力下钙化学的复杂性更高。
论文及项目相关链接
Summary
本文探讨了钙原子在高压下的特殊行为。研究发现,在加压条件下,钙的3d轨道能捕获卤素原子的电子,表现出阴离子性质,特别是在碘化物中。通过第一性原理结构搜索预测了CsCl型一价CaI在50 GPa以上的存在,并通过原位X射线衍射在84 GPa下成功识别出该相。此外,由于轨道展宽效应,碘的5p轨道向钙的3d轨道发生电荷转移,在485 GPa下钙的离子性逐渐逆转,形成阴离子ICa。同时,多价钙稳定了一系列具有八到十倍碘超配位的金属碘化物,表明钙在超高压力下的化学性质更加复杂。
Key Takeaways
- 钙原子在高压下表现出特殊行为,其3d轨道能捕获卤素原子的电子。
- 在碘化物中,钙表现出阴离子性质。
- 通过第一性原理结构搜索预测了CsCl型一价CaI在高压下的存在。
- 在84 GPa的实验条件下成功识别出CsCl型CaI相。
- 轨道展宽效应导致碘的5p轨道向钙的3d轨道发生电荷转移。
- 在极高压力(485 GPa)下,钙的离子性逐渐逆转,形成阴离子ICa。
点此查看论文截图
FlowCycle: Pursuing Cycle-Consistent Flows for Text-based Editing
Authors:Yanghao Wang, Zhen Wang, Long Chen
Recent advances in pre-trained text-to-image flow models have enabled remarkable progress in text-based image editing. Mainstream approaches always adopt a corruption-then-restoration paradigm, where the source image is first corrupted into an ``intermediate state’’ and then restored to the target image under the prompt guidance. However, current methods construct this intermediate state in a target-agnostic manner, i.e., they primarily focus on realizing source image reconstruction while neglecting the semantic gaps towards the specific editing target. This design inherently results in limited editability or inconsistency when the desired modifications substantially deviate from the source. In this paper, we argue that the intermediate state should be target-aware, i.e., selectively corrupting editing-relevant contents while preserving editing-irrelevant ones. To this end, we propose FlowCycle, a novel inversion-free and flow-based editing framework that parameterizes corruption with learnable noises and optimizes them through a cycle-consistent process. By iteratively editing the source to the target and recovering back to the source with dual consistency constraints, FlowCycle learns to produce a target-aware intermediate state, enabling faithful modifications while preserving source consistency. Extensive ablations have demonstrated that FlowCycle achieves superior editing quality and consistency over state-of-the-art methods.
近期预训练文本到图像流动模型的进步,为基于文本的图像编辑带来了显著的进展。主流方法通常采用“破坏-然后-恢复”的模式,首先将源图像破坏为“中间状态”,然后在提示的指导下将其恢复为目标图像。然而,当前的方法以目标无关的方式构建这个中间状态,即他们主要关注源图像的重建,而忽略了向特定编辑目标的语义差距。这种设计固有地导致编辑能力有限或不一致,当所需的修改与源图像有较大偏差时。在本文中,我们认为中间状态应该是目标感知的,即选择性地破坏与编辑相关的内容,同时保留与编辑无关的内容。为此,我们提出了FlowCycle,这是一个新型的无反演和基于流动的编辑框架,它通过可学习的噪声对破坏进行参数化,并通过循环一致的过程对其进行优化。通过迭代地从源编辑到目标并恢复到源,使用双重一致性约束,FlowCycle学习产生目标感知的中间状态,实现在保留源一致性的同时,进行忠实的修改。大量的消融实验表明,FlowCycle的编辑质量和一致性优于现有方法。
论文及项目相关链接
Summary
近期文本到图像的预训练流模型在文本驱动的图像编辑上取得了显著进展。现有主流方法遵循“先破坏再恢复”的模式,将原始图像转化为中间状态,然后根据文本提示恢复为目标图像。然而,当前方法构建中间状态时忽视了目标导向性,主要关注图像重建而非对特定编辑目标的语义填充。这导致当所需修改与原始图像有较大差异时,编辑能力受限或结果不一致。本文主张中间状态应具备目标导向性,即选择性破坏与编辑相关的内容同时保留与编辑无关的内容。为此,我们提出了FlowCycle,一种无反演和基于流的编辑框架,通过参数化噪声并对其进行优化实现周期一致性过程。FlowCycle通过迭代式地从源图像编辑到目标图像再恢复回源图像,学习生成目标导向的中间状态,实现了忠实修改的同时保持源一致性。
Key Takeaways
- 近期预训练文本到图像的流模型在文本驱动的图像编辑中取得显著进展。
- 当前主流方法采用“先破坏再恢复”的模式,但构建中间状态时缺乏目标导向性。
- 本文主张中间状态应具备目标导向性,即选择性破坏与编辑相关的内容。
- FlowCycle框架被提出,通过参数化噪声和优化实现周期一致性过程。
- FlowCycle通过迭代式编辑和恢复过程,学习生成目标导向的中间状态。
- FlowCycle在编辑质量和一致性方面优于现有方法。
点此查看论文截图
Filter-Based Reconstruction of Images from Events
Authors:Bernd Pfrommer
Reconstructing an intensity image from the events of a moving event camera is a challenging task that is typically approached with neural networks deployed on graphics processing units. This paper presents a much simpler, FIlter Based Asynchronous Reconstruction method (FIBAR). First, intensity changes signaled by events are integrated with a temporal digital IIR filter. To reduce reconstruction noise, stale pixels are detected by a novel algorithm that regulates a window of recently updated pixels. Arguing that for a moving camera, the absence of events at a pixel location likely implies a low image gradient, stale pixels are then blurred with a Gaussian filter. In contrast to most existing methods, FIBAR is asynchronous and permits image read-out at an arbitrary time. It runs on a modern laptop CPU at about 42(140) million events/s with (without) spatial filtering enabled. A few simple qualitative experiments are presented that show the difference in image reconstruction between FIBAR and a neural network-based approach (FireNet). FIBAR’s reconstruction is noisier than neural network-based methods and suffers from ghost images. However, it is sufficient for certain tasks such as the detection of fiducial markers. Code is available at https://github.com/ros-event-camera/event_image_reconstruction_fibar
从动态事件相机的事件重建强度图像是一项具有挑战性的任务,通常通过部署在图形处理单元上的神经网络来解决。本文提出了一种更为简单的基于滤波器的异步重建方法(FIBAR)。首先,通过时间数字IIR滤波器整合由事件引起的强度变化。为了减少重建噪声,通过一种新型算法检测过时的像素,该算法可调控最近更新的像素窗口。对于移动相机而言,论文主张某个像素位置没有事件很可能意味着图像梯度较低,随后用过时的像素进行高斯滤波。与大多数现有方法不同,FIBAR是异步的,可以在任意时间进行图像读取。它在启用(禁用)空间滤波的情况下,在现代笔记本电脑CPU上的运行速度约为每秒42(140)百万事件。本文进行了一些简单的定性实验,展示了FIBAR与基于神经网络的方法(如FireNet)在图像重建上的差异。FIBAR的重建结果较基于神经网络的方法更为嘈杂,并可能出现幽灵图像。然而,对于某些任务(如标记检测)而言,它是足够的。代码可在https://github.com/ros-event-camera/event_image_reconstruction_fibar找到。
论文及项目相关链接
Summary
本文介绍了一种基于滤波的异步重建方法(FIBAR),用于从动态事件相机的事件重建强度图像。该方法通过整合事件信号强度变化,采用数字IIR滤波器进行时间处理,并引入新型算法检测旧像素以减少重建噪声。此外,对于移动相机而言,无像素位置的事件很可能暗示低图像梯度,因此使用高斯滤波器模糊旧像素。与其他方法不同,FIBAR是异步的,可在任意时间进行图像读取。在具有(不具有)空间过滤功能的现代笔记本电脑CPU上,其运行速度可达每秒约42(140)百万事件。简单实验表明,与基于神经网络的方法(如FireNet)相比,FIBAR的图像重建存在噪声和鬼影现象,但对于某些任务如检测定位标记仍是足够的。
Key Takeaways
- FIBAR是一种用于从动态事件相机的事件重建强度图像的简化方法。
- 它采用数字IIR滤波器整合事件信号强度变化,并检测旧像素以减少重建噪声。
- 对于移动相机,无像素位置的事件可能表示低图像梯度,因此使用高斯滤波器处理旧像素。
- FIBAR是异步方法,允许在任意时间进行图像读取。
- FIBAR在现代笔记本电脑CPU上的运行速度较快。
- 与基于神经网络的方法相比,FIBAR的图像重建存在噪声和鬼影现象。
- 尽管存在这些缺点,但FIBAR对于某些任务如检测定位标记仍是有效的。
点此查看论文截图
Machine Learning-Based Localization Accuracy of RFID Sensor Networks via RSSI Decision Trees and CAD Modeling for Defense Applications
Authors:Curtis Lee Shull, Merrick Green
Radio Frequency Identification (RFID) tracking may be a viable solution for defense assets that must be stored in accordance with security guidelines. However, poor sensor specificity (vulnerabilities include long range detection, spoofing, and counterfeiting) can lead to erroneous detection and operational security events. We present a supervised learning simulation with realistic Received Signal Strength Indicator (RSSI) data and Decision Tree classification in a Computer Assisted Design (CAD)-modeled floor plan that encapsulates some of the challenges encountered in defense storage. In this work, we focused on classifying 12 lab zones (LabZoneA-L) to perform location inference. The raw dataset had approximately 980,000 reads. Class frequencies were imbalanced, and class weights were calculated to account for class imbalance in this multi-class setting. The model, trained on stratified subsamples to 5,000 balanced observations, yielded an overall accuracy of 34.2% and F1-scores greater than 0.40 for multiple zones (Zones F, G, H, etc.). However, rare classes (most notably LabZoneC) were often misclassified, even with the use of class weights. An adjacency-aware confusion matrix was calculated to allow better interpretation of physically adjacent zones. These results suggest that RSSI-based decision trees can be applied in realistic simulations to enable zone-level anomaly detection or misplacement monitoring for defense supply logistics. Reliable classification performance in low-coverage and low-signal zones could be improved with better antenna placement or additional sensors and sensor fusion with other modalities.
射频识别(RFID)跟踪对于必须按照安全指南存储的国防资产可能是一种可行的解决方案。然而,传感器特异性较差(漏洞包括远程检测、欺骗和伪造)可能导致错误检测和操作安全事件。我们采用了一种监督学习模拟方法,使用真实的接收信号强度指示(RSSI)数据和决策树分类,在计算机辅助设计(CAD)建模的平面图中,体现了国防存储中遇到的一些挑战。在这项工作中,我们专注于对12个实验室区域(LabZoneA-L)进行分类,以进行位置推断。原始数据集大约有98万个读取数据。类别频率分布不均衡,我们计算了类别权重,以弥补多类别设置中的类别不平衡问题。该模型经过分层子样本训练,达到5000个平衡观测值,总体准确度为34.2%,多个区域的F1得分大于0.40(区域F、G、H等)。然而,即使在使用类别权重的情况下,稀有类别(尤其是LabZoneC)通常会被误分类。计算了邻接感知混淆矩阵,以更好地解释物理上相邻的区域。这些结果表明,基于RSSI的决策树可以应用于现实模拟,以实现区域级别的异常检测或国防供应物流的错位监测。在低覆盖和低信号区域的可靠分类性能可以通过改善天线放置、增加传感器或其他模态的传感器融合来提高。
论文及项目相关链接
PDF 10 pages, 5 figures. Submitted to the Journal of Defense Modeling and Simulation (JDMS) for the Special Issue Integrating AI/ML Into Modeling and Simulation (J22-4). This work evaluates machine learning-based RFID localization for defense logistics environments using CAD-modeled simulations and RSSI-driven decision tree classification
Summary
本文探讨了使用射频识别(RFID)技术跟踪国防资产的问题。虽然RFID具有可行性,但其传感器特异性较差,可能导致误检测和操作安全事件。研究提出了一种基于计算机辅助设计(CAD)建模的决策树分类方法,用于模拟RSSI数据的监督学习,并解决了分类不平衡问题。研究结果表明,RSSI决策树可用于实现区域级别的异常检测或误放置监控,但在低覆盖率和低信号区域,可通过改进天线放置或使用其他传感器和传感器融合来提高可靠性。
Key Takeaways
- RFID技术在国防资产管理中有潜在应用价值,但需解决传感器特异性差的问题。
- 误检测和操作安全事件是RFID技术面临的挑战之一。
- 使用计算机建模的决策树分类方法可以处理RFID数据中的挑战性问题。
- 模拟中采用了RSSI数据并考虑了不平衡的分类问题。
- 模拟总体准确度为34.2%,特定区域的F1得分较高。
- 对某些稀有类别的分类效果不理想,需进一步优化算法和硬件配置。
点此查看论文截图
FairGRPO: Fair Reinforcement Learning for Equitable Clinical Reasoning
Authors:Shiqi Dai, Wei Dai, Jiaee Cheong, Paul Pu Liang
Medical artificial intelligence systems have achieved remarkable diagnostic capabilities, yet they consistently exhibit performance disparities across demographic groups, causing real-world harm to underrepresented populations. While recent multimodal reasoning foundation models have advanced clinical diagnosis through integrated analysis of diverse medical data, reasoning trainings via reinforcement learning inherit and often amplify biases present in training datasets dominated by majority populations. We introduce Fairness-aware Group Relative Policy Optimization (FairGRPO), a hierarchical reinforcement learning approach that promotes equitable learning across heterogeneous clinical populations. FairGRPO employs adaptive importance weighting of advantages based on representation, task difficulty, and data source. To address the common issue of missing demographic labels in the clinical domain, we further employ unsupervised clustering, which automatically discovers latent demographic groups when labels are unavailable. Through comprehensive experiments across 7 clinical diagnostic datasets spanning 5 clinical modalities across X-ray, CT scan, dermoscropy, mammography and ultrasound, we demonstrate that FairGRPO reduces predictive parity by 27.2% against all vanilla and bias mitigated RL baselines, while improving F1 score by 12.49%. Furthermore, training dynamics analysis reveals that FairGRPO progressively improves fairness throughout optimization, while baseline RL methods exhibit deteriorating fairness as training progresses. Based on FairGRPO, we release FairMedGemma-4B, a fairness-aware clinical VLLM that achieves state-of-the-art performance while demonstrating significantly reduced disparities across demographic groups.
医学人工智能系统已经具备了卓越的诊疗能力,然而它们在不同人群中的表现始终存在差异,给代表性不足的人群带来了现实世界的伤害。虽然最近的跨模态推理基础模型通过综合分析多样的医疗数据推动了临床诊断的进步,但强化学习中的推理训练会继承并经常放大由多数群体主导的训练数据集存在的偏见。我们引入了公平感知群组相对策略优化(FairGRPO),这是一种分层强化学习方法,旨在促进在异质临床人群中的公平学习。FairGRPO采用基于表示、任务难度和数据源的适应性权重优势加权。为了解决临床领域常见的缺失人口统计标签的问题,我们还采用了无监督聚类,当标签不可用时,它会自动发现潜在的人群分组。通过对涵盖X光、CT扫描、皮肤镜检查、乳腺X线和超声波等5种临床模态的7个临床诊断数据集进行全面实验,我们证明了FairGRPO相对于所有基础和无偏见强化学习基线,预测公平性提高了27.2%,同时F1分数提高了12.49%。此外,训练动态分析表明,FairGRPO在优化过程中逐渐改善公平性,而基线强化学习方法在训练过程中公平性逐渐恶化。基于FairGRPO,我们发布了FairMedGemma-4B,这是一款具备公平意识的临床VLLM(大型语言模型),在达到业界领先性能的同时,显著减少了不同人群之间的差异。
论文及项目相关链接
PDF Accepted as Oral on NeurIPS 2025 GenAI4Health Workshop
Summary
本文关注医疗人工智能系统在不同人群中的性能差异问题,提出了一个公平感知的群组相对策略优化方法(FairGRPO)。该方法采用分层强化学习,促进不同临床人群之间的公平学习。通过自适应权重调整优势,考虑代表性、任务难度和来源等数据因素。同时,解决临床领域常见的缺少人口统计标签问题,采用无监督聚类自动发现潜在人群。实验证明,FairGRPO能减少预测偏差,提高F1分数,并在训练过程中逐渐改善公平性。基于FairGRPO,发布了公平医疗宝石-4B模型,实现了跨人群公平性的卓越性能。
Key Takeaways
- 医疗人工智能系统在不同人群中的诊断性能存在差异,导致代表性不足的群体受到实际伤害。
- 多模态推理基础模型通过综合分析多样医疗数据提升了临床诊断。
- 基于强化学习的推理训练可能继承或放大偏见,尤其在以主流人群为主的训练数据集上。
- FairGRPO方法采用分层强化学习促进公平学习,并考虑代表性、任务难度和来源等数据因素进行自适应权重调整。
- FairGRPO解决了临床领域缺少人口统计标签的问题,通过无监督聚类自动发现潜在人群。
- 实验证明FairGRPO能减少预测偏差和提高F1分数,与其他基准强化学习模型相比表现出优越性。
点此查看论文截图
Curvilinear Structure-preserving Unpaired Cross-domain Medical Image Translation
Authors:Zihao Chen, Yi Zhou, Xudong Jiang, Li Chen, Leopold Schmetterer, Bingyao Tan, Jun Cheng
Unpaired image-to-image translation has emerged as a crucial technique in medical imaging, enabling cross-modality synthesis, domain adaptation, and data augmentation without costly paired datasets. Yet, existing approaches often distort fine curvilinear structures, such as microvasculature, undermining both diagnostic reliability and quantitative analysis. This limitation is consequential in ophthalmic and vascular imaging, where subtle morphological changes carry significant clinical meaning. We propose Curvilinear Structure-preserving Translation (CST), a general framework that explicitly preserves fine curvilinear structures during unpaired translation by integrating structure consistency into the training. Specifically, CST augments baseline models with a curvilinear extraction module for topological supervision. It can be seamlessly incorporated into existing methods. We integrate it into CycleGAN and UNSB as two representative backbones. Comprehensive evaluation across three imaging modalities: optical coherence tomography angiography, color fundus and X-ray coronary angiography demonstrates that CST improves translation fidelity and achieves state-of-the-art performance. By reinforcing geometric integrity in learned mappings, CST establishes a principled pathway toward curvilinear structure-aware cross-domain translation in medical imaging.
非配对图像到图像的翻译在医学成像中已经成为一项关键技术,它能够在没有昂贵的配对数据集的情况下实现跨模态合成、域适应和数据增强。然而,现有的方法往往会扭曲细微的曲线结构,如微血管,这既影响了诊断的可靠性,也影响了定量分析。这一局限性在眼科和血管成像中尤为重要,那里细微的形态变化具有重要的临床意义。我们提出了曲线结构保留翻译(CST),这是一个通用框架,通过整合结构一致性到训练中,在不成对的翻译中明确保留细微的曲线结构。具体来说,CST通过拓扑监督增强基线模型,引入了曲线提取模块。它可以无缝地融入现有方法。我们将其融入CycleGAN和UNSB作为两个代表性的主干。在三种成像模态的全面评估:光学相干断层扫描血管造影、彩色眼底和X射线冠状动脉造影表明,CST提高了翻译的准确性,并达到了最先进的性能。通过加强学习映射中的几何完整性,CST为医学成像中面向曲线结构的跨域翻译建立了有原则的路径。
论文及项目相关链接
摘要
在医学成像中,无配对图像转换技术已发展为一项关键技术,可在无需成本高昂的配对数据集的情况下实现跨模态合成、域适应和数据增强。然而,现有方法常常会在细微曲线结构(如微血管)上产生扭曲,这会影响诊断的可靠性和定量分析。本文提出了曲线结构保留转换(CST)这一通用框架,通过集成结构一致性来显式保留细微曲线结构进行无配对转换。具体来说,CST通过拓扑监督为基线模型增加一个曲线提取模块进行增强。它可以无缝集成到现有方法中。我们将其集成到CycleGAN和UNSB作为两个代表性的骨干网。在光学相干断层扫描血管造影、彩色眼底和X射线冠状动脉造影三种成像模式进行的全面评估表明,CST提高了转换保真度,并达到了最先进的性能。通过强化几何完整性在学习的映射中,CST为医学成像中的曲线结构感知跨域转换建立了原则性的途径。
关键见解
- 无配对图像转换技术已成为医学成像中的关键方法,可应用于跨模态合成、域适应和数据增强。
- 现有方法在处理细微曲线结构(如微血管)时会产生扭曲,影响诊断可靠性和定量分析。
- 提出的CST框架通过集成结构一致性来显式保留细微曲线结构进行无配对转换。
- CST通过拓扑监督增强基线模型,包括一个曲线提取模块。
- CST可以无缝集成到现有方法中,如CycleGAN和UNSB。
- 在多种成像模式下评估,CST提高了转换的保真度并达到了最先进的性能。
点此查看论文截图
MedReason-R1: Learning to Reason for CT Diagnosis with Reinforcement Learning and Local Zoom
Authors:Yifan Li, Fenghe Tang, Yingtai Li, Shaohua Kevin Zhou
General-purpose large Vision-Language Models (VLMs) demonstrate strong capabilities in generating detailed descriptions for natural images. However, their performance in the medical domain remains suboptimal, even for relatively straightforward tasks, primarily due to the lack of large-scale, high-quality, specialized medical imaging datasets and the neglect of the diagnostic process that progresses from coarse to fine-grained. To address the first issue, we construct the CT-RATE-VQA dataset, which has 84K QA pairs. For the second issue, we propose MedReason-R1, a medical VLM with explicit reasoning process for disease diagnosis. MedReason-R1 incorporates a novel strategy that embeds zoom-in disease region-of-interest areas into the image, highlighting the crucial role of both global localization and disease-specific details in enhancing the model’s diagnostic performance. Furthermore, we introduce the GRPO reinforcement learning framework to MedReason-R1, which enables effective reasoning without relying on costly manual annotations. Compared to recent general-purpose and medical VLMs, MedReason-R1 achieves state-of-the-art performance in CT disease diagnosis while retaining generalization. The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1
通用的大型视觉语言模型(VLMs)在生成自然图像详细描述方面表现出强大的能力。然而,它们在医学领域的表现仍然不尽人意,即使在相对简单的任务上也是如此。这主要是因为缺乏大规模、高质量、专业的医学成像数据集,以及忽略了从粗略到精细的诊断过程的推进。为了解决第一个问题,我们构建了CT-RATE-VQA数据集,包含84K个问答对。对于第二个问题,我们提出了MedReason-R1,这是一个用于疾病诊断的医学VLM,具有明确的推理过程。MedReason-R1采用了一种新颖的策略,将放大疾病感兴趣区域嵌入图像中,强调了全局定位和疾病特定细节在提升模型诊断性能中的关键作用。此外,我们还将GRPO强化学习框架引入到MedReason-R1中,使其能够在不依赖昂贵的人工注释的情况下进行有效推理。与最新的通用和医学VLMs相比,MedReason-R1在CT疾病诊断方面达到了最先进的性能,同时保持了泛化能力。代码、检查点和数据集可通过以下网址获得:https://github.com/Leevan001/MedReason-R1
论文及项目相关链接
PDF The code, checkpoints, and dataset are available at: https://github.com/Leevan001/MedReason-R1
Summary
本文介绍了针对医学图像领域存在的问题,构建了CT-RATE-VQA数据集,并提出了一种具有明确推理过程的医疗VLM——MedReason-R1,用于疾病诊断。MedReason-R1采用了一种新的策略,将疾病区域的细节嵌入图像中,强调全局定位和疾病特异性细节的重要性。此外,还引入了GRPO强化学习框架,使MedReason-R1在依赖成本高昂的手动注释的情况下也能进行有效推理。在CT疾病诊断方面,MedReason-R1达到了最先进的性能,并保持了泛化能力。
Key Takeaways
- VLM在医学图像领域的性能仍然不够理想,主要由于缺乏大规模高质量的专业医学成像数据集和诊断过程从粗略到精细的忽视。
- 构建了CT-RATE-VQA数据集,包含8.4万对问答对,以改善第一个问题。
- MedReason-R1是一种具有明确推理过程的医疗VLM,用于疾病诊断,结合了医学图像中的细节和全局定位信息。
- MedReason-R1采用了一种新的策略,将疾病区域的细节嵌入图像中,以提高模型的诊断性能。
- 引入了GRPO强化学习框架,使MedReason-R1在不需要昂贵的手动注释的情况下进行有效推理。
- MedReason-R1在CT疾病诊断方面达到了最先进的性能,并保持了泛化能力。
点此查看论文截图
Addressing the Depth-of-Field Constraint: A New Paradigm for High Resolution Multi-Focus Image Fusion
Authors:Luca Piano, Peng Huanwen, Radu Ciprian Bilcu
Multi-focus image fusion (MFIF) addresses the depth-of-field (DOF) limitations of optical lenses, where only objects within a specific range appear sharp. Although traditional and deep learning methods have advanced the field, challenges persist, including limited training data, domain gaps from synthetic datasets, and difficulties with regions lacking information. We propose VAEEDOF, a novel MFIF method that uses a distilled variational autoencoder for high-fidelity, efficient image reconstruction. Our fusion module processes up to seven images simultaneously, enabling robust fusion across diverse focus points. To address data scarcity, we introduce MattingMFIF, a new syntetic 4K dataset, simulating realistic DOF effects from real photographs. Our method achieves state-of-the-art results, generating seamless artifact-free fused images and bridging the gap between synthetic and real-world scenarios, offering a significant step forward in addressing complex MFIF challenges. The code, and weights are available here:
多焦点图像融合(MFIF)解决了光学镜头的景深(DOF)限制问题,景深限制导致只有特定范围内的物体才显得清晰。尽管传统和深度学习方法已经推动了该领域的发展,但仍然存在挑战,包括训练数据有限、合成数据集与实际应用场景之间的领域差距,以及缺乏信息的区域处理困难等问题。我们提出了VAEEDOF,这是一种新型MFIF方法,它采用提炼出的变分自动编码器进行高保真、高效的图像重建。我们的融合模块可以同时处理多达七张图像,实现不同焦点区域的稳健融合。为了解决数据稀缺问题,我们引入了MattingMFIF这一新的合成4K数据集,模拟真实照片中的现实景深效果。我们的方法达到了最新技术水平,生成无缝、无瑕疵的融合图像,并缩小了合成场景和真实场景之间的差距,为解决复杂的MFIF挑战迈出了重要一步。代码和权重可在此处获取:
论文及项目相关链接
Summary
本文介绍了多焦点图像融合(MFIF)技术,该技术解决了光学透镜景深(DOF)限制的问题。针对传统方法和深度学习在该领域的挑战,提出了使用蒸馏变分自编码器(VAEEDOF)的新MFIF方法,实现了高保真、高效的图像重建。同时,为解决数据稀缺问题,引入了模拟真实景深效果的新合成4K数据集MattingMFIF。该方法实现了业界领先的结果,生成无缝、无瑕疵的融合图像,并缩小了合成与真实场景之间的差距,为解决复杂的MFIF挑战迈出了重要的一步。
Key Takeaways
- 多焦点图像融合(MFIF)技术解决了光学透镜景深(DOF)限制的问题,使不同焦点点的图像能够融合。
- 提出了使用蒸馏变分自编码器(VAEEDOF)的新MFIF方法,提高了图像重建的效率和保真度。
- VAEEDOF能够同时处理多达7张图像,实现跨不同焦点的稳健融合。
- 针对数据稀缺的问题,引入了新的合成4K数据集MattingMFIF,模拟真实景深效果。
- 所提出的方法实现了业界领先的结果,生成了无缝、无瑕疵的融合图像。
- 该方法缩小了合成图像和真实场景之间的差距,提高了MFIF技术的实用性。
点此查看论文截图
Predicting before Reconstruction: A generative prior framework for MRI acceleration
Authors:Juhyung Park, Rokgi Hong, Roh-Eul Yoo, Jaehyeon Koo, Se Young Chun, Seung Hong Choi, Jongho Lee
Recent advancements in artificial intelligence have created transformative capabilities in image synthesis and generation, enabling diverse research fields to innovate at revolutionary speed and spectrum. In this study, we leverage this generative power to introduce a new paradigm for accelerating Magnetic Resonance Imaging (MRI), introducing a shift from image reconstruction to proactive predictive imaging. Despite being a cornerstone of modern patient care, MRI’s lengthy acquisition times limit clinical throughput. Our novel framework addresses this challenge by first predicting a target contrast image, which then serves as a data-driven prior for reconstructing highly under-sampled data. This informative prior is predicted by a generative model conditioned on diverse data sources, such as other contrast images, previously scanned images, acquisition parameters, patient information. We demonstrate this approach with two key applications: (1) reconstructing FLAIR images using predictions from T1w and/or T2w scans, and (2) reconstructing T1w images using predictions from previously acquired T1w scans. The framework was evaluated on internal and multiple public datasets (total 14,921 scans; 1,051,904 slices), including multi-channel k-space data, for a range of high acceleration factors (x4, x8 and x12). The results demonstrate that our prediction-prior reconstruction method significantly outperforms other approaches, including those with alternative or no prior information. Through this framework we introduce a fundamental shift from image reconstruction towards a new paradigm of predictive imaging.
近期人工智能的进步为图像合成和生成领域带来了革命性的能力,促使不同研究领域以革命性的速度和广度进行创新。在这项研究中,我们利用这种生成能力,引入了一种加速磁共振成像(MRI)的新范式,实现从图像重建到积极预测成像的转变。尽管MRI是现代患者护理的基石,但其漫长的采集时间限制了临床吞吐量。我们的新框架通过首先预测目标对比图像来解决这一挑战,该图像然后作为数据驱动的先验来重建高度欠采样的数据。这一信息先验是由生成模型根据多种数据源预测的,如其他对比图像、先前扫描的图像、采集参数、患者信息等。我们通过两个关键应用展示了这种方法:(1)使用T1w和/或T2w扫描的预测结果重建FLAIR图像;(2)使用先前获取的T1w扫描的预测结果重建T1w图像。该框架在内部和多个公共数据集(共14921次扫描,105万1千9百零四次切片)上进行了评估,包括多通道k空间数据,涵盖了一系列高加速因子(x4、x8和x12)。结果表明,我们的预测先验重建方法显著优于其他方法,包括使用替代或没有先验信息的方法。通过这一框架,我们从图像重建转向预测成像的新范式。
论文及项目相关链接
PDF 33 pages, 8figures
Summary
本研究利用人工智能的最新进展,引入一种新型范式加速磁共振成像(MRI),实现从图像重建到预测性成像的转变。通过预测目标对比图像作为数据驱动先验,重建高度欠采样的数据,解决MRI采集时间长的问题。
Key Takeaways
- 人工智能在图像合成和生成方面的最新进展为多个研究领域带来了创新。
- 本研究利用生成能力加速磁共振成像(MRI),从图像重建转向预测性成像。
- MRI采集时间长是临床通过量的瓶颈。
- 通过预测目标对比图像作为数据驱动先验,重建高度欠采样的数据。
- 预测性先验是由以多种数据源为条件的生成模型预测的,如其他对比图像、先前扫描的图像、采集参数、病人信息等。
- 研究展示了两个关键应用:利用T1w和/或T2w扫描的预测重建FLAIR图像和利用先前获取的T1w扫描的预测重建T1w图像。
点此查看论文截图
A Training-Free Framework for Open-Vocabulary Image Segmentation and Recognition with EfficientNet and CLIP
Authors:Ying Dai, Wei Yu Chen
This paper presents a novel training-free framework for open-vocabulary image segmentation and object recognition (OVSR), which leverages EfficientNetB0, a convolutional neural network, for unsupervised segmentation and CLIP, a vision-language model, for open-vocabulary object recognition. The proposed framework adopts a two stage pipeline: unsupervised image segmentation followed by segment-level recognition via vision-language alignment. In the first stage, pixel-wise features extracted from EfficientNetB0 are decomposed using singular value decomposition to obtain latent representations, which are then clustered using hierarchical clustering to segment semantically meaningful regions. The number of clusters is adaptively determined by the distribution of singular values. In the second stage, the segmented regions are localized and encoded into image embeddings using the Vision Transformer backbone of CLIP. Text embeddings are precomputed using CLIP’s text encoder from category-specific prompts, including a generic something else prompt to support open set recognition. The image and text embeddings are concatenated and projected into a shared latent feature space via SVD to enhance cross-modal alignment. Recognition is performed by computing the softmax over the similarities between the projected image and text embeddings. The proposed method is evaluated on standard benchmarks, including COCO, ADE20K, and PASCAL VOC, achieving state-of-the-art performance in terms of Hungarian mIoU, precision, recall, and F1-score. These results demonstrate the effectiveness, flexibility, and generalizability of the proposed framework.
本文介绍了一种用于开放词汇图像分割和对象识别(OVSR)的新型无训练框架。该框架利用EfficientNetB0卷积神经网络进行无监督分割,并利用CLIP视觉语言模型进行开放词汇对象识别。所提出框架采用两阶段流程:无监督图像分割,然后通过视觉语言对齐进行分段级识别。在第一阶段,从EfficientNetB0提取的像素级特征使用奇异值分解进行分解,以获得潜在表示,然后使用层次聚类对这些潜在表示进行聚类,从而分割出语义上有意义的区域。聚类的数量是根据奇异值的分布自适应确定的。在第二阶段,使用CLIP的视觉转换器主干对分割区域进行定位和编码,生成图像嵌入。使用CLIP的文本编码器根据类别特定的提示预先计算文本嵌入,包括一个通用的其他提示以支持开放集识别。然后将图像和文本嵌入通过SVD连接并投影到共享潜在特征空间,以增强跨模态对齐。通过计算投影图像和文本嵌入之间的相似性的softmax来执行识别。所提出的方法在COCO、ADE20K和PASCAL VOC等标准基准测试集上进行了评估,在匈牙利mIoU、精度、召回率和F1分数方面均达到了最新技术水平。这些结果证明了该框架的有效性、灵活性和通用性。
论文及项目相关链接
Summary
本文提出了一种无需训练的新型开放词汇图像分割与对象识别(OVSR)框架,结合了EfficientNetB0卷积神经网络进行无监督分割和CLIP视觉语言模型进行开放词汇对象识别。该框架采用两阶段管道:无监督图像分割和通过视觉语言对齐进行分段级别识别。
Key Takeaways
- 该论文介绍了一种新的无需训练的方法,用于开放词汇图像分割与对象识别。
- 利用EfficientNetB0进行无监督图像分割。
- 使用CLIP的视觉语言模型进行开放词汇对象识别。
- 框架包含两个阶段:无监督图像分割和分段级别识别。
- 通过奇异值分解(SVD)获得潜在表示并进行聚类以进行语义有意义的区域分割。
- 利用CLIP的Vision Transformer在第二阶段对分割区域进行定位和编码。
点此查看论文截图
Enhancing Early Alzheimer Disease Detection through Big Data and Ensemble Few-Shot Learning
Authors:Safa Ben Atitallah, Maha Driss, Wadii Boulila, Anis Koubaa
Alzheimer disease is a severe brain disorder that causes harm in various brain areas and leads to memory damage. The limited availability of labeled medical data poses a significant challenge for accurate Alzheimer disease detection. There is a critical need for effective methods to improve the accuracy of Alzheimer disease detection, considering the scarcity of labeled data, the complexity of the disease, and the constraints related to data privacy. To address this challenge, our study leverages the power of big data in the form of pre-trained Convolutional Neural Networks (CNNs) within the framework of Few-Shot Learning (FSL) and ensemble learning. We propose an ensemble approach based on a Prototypical Network (ProtoNet), a powerful method in FSL, integrating various pre-trained CNNs as encoders. This integration enhances the richness of features extracted from medical images. Our approach also includes a combination of class-aware loss and entropy loss to ensure a more precise classification of Alzheimer disease progression levels. The effectiveness of our method was evaluated using two datasets, the Kaggle Alzheimer dataset and the ADNI dataset, achieving an accuracy of 99.72% and 99.86%, respectively. The comparison of our results with relevant state-of-the-art studies demonstrated that our approach achieved superior accuracy and highlighted its validity and potential for real-world applications in early Alzheimer disease detection.
阿尔茨海默病是一种严重的脑疾病,会损害大脑的多个区域并导致记忆力下降。标记医疗数据的有限可用性给准确的阿尔茨海默病检测带来了巨大挑战。考虑到标记数据的稀缺性、疾病的复杂性以及与数据隐私相关的限制,迫切需要有有效的方法来提高阿尔茨海默病检测的准确性。为了应对这一挑战,我们的研究利用了在少样本学习(FSL)和集成学习框架下以预训练卷积神经网络(CNN)形式的大数据的力量。我们提出了一种基于原型网络(ProtoNet)的集成方法,这是一种强大的FSL方法,将各种预训练的CNN作为编码器进行集成。这种集成提高了从医学图像中提取的特征的丰富性。我们的方法还包括结合类别感知损失和熵损失,以确保更精确地分类阿尔茨海默病的进展水平。我们的方法使用Kaggle阿尔茨海默症数据集和ADNI数据集进行了评估,准确率分别为99.72%和99.86%。将我们的结果与最新的相关研究进行比较表明,我们的方法实现了更高的准确性,并突出了其在早期阿尔茨海默病检测中的有效性、实用性和潜力。
论文及项目相关链接
Summary
本研究利用大数据和预训练的卷积神经网络(CNN)来解决阿尔茨海默病检测的挑战。研究采用小样本学习(FSL)和集成学习的方法,基于原型网络(ProtoNet)提出一种集成策略,整合多种预训练CNN作为编码器,提高医学图像特征的丰富性。该方法结合类别感知损失和熵损失,确保更精确地分类阿尔茨海默病进展水平。在Kaggle阿尔茨海默数据集和ADNI数据集上的评估显示,该方法准确率分别高达99.72%和99.86%,与现有先进技术相比具有更高的准确性和潜力,可应用于早期阿尔茨海默病的实际检测。
Key Takeaways
- 阿尔茨海默病是一种严重的脑疾病,对大脑各区域造成损害,导致记忆力下降。
- 医学数据标注的有限性对准确的阿尔茨海默病检测构成重大挑战。
- 研究利用大数据和预训练的卷积神经网络(CNN)来解决此问题。
- 采用小样本学习(FSL)和集成学习的方法,基于原型网络(ProtoNet)提出一种集成策略。
- 整合多种预训练CNN以提高医学图像特征的丰富性。
- 结合类别感知损失和熵损失,以提高阿尔茨海默病进展水平的分类精度。
点此查看论文截图
A Multi-Evidence Framework Rescues Low-Power Prognostic Signals and Rejects Statistical Artifacts in Cancer Genomics
Authors:Gokturk Aytug Akarlar
Motivation: Standard genome-wide association studies in cancer genomics rely on statistical significance with multiple testing correction, but systematically fail in underpowered cohorts. In TCGA breast cancer (n=967, 133 deaths), low event rates (13.8%) create severe power limitations, producing false negatives for known drivers and false positives for large passenger genes. Results: We developed a five-criteria computational framework integrating causal inference (inverse probability weighting, doubly robust estimation) with orthogonal biological validation (expression, mutation patterns, literature evidence). Applied to TCGA-BRCA mortality analysis, standard Cox+FDR detected zero genes at FDR<0.05, confirming complete failure in underpowered settings. Our framework correctly identified RYR2 – a cardiac gene with no cancer function – as a false positive despite nominal significance (p=0.024), while identifying KMT2C as a complex candidate requiring validation despite marginal significance (p=0.047, q=0.954). Power analysis revealed median power of 15.1% across genes, with KMT2C achieving only 29.8% power (HR=1.55), explaining borderline statistical significance despite strong biological evidence. The framework distinguished true signals from artifacts through mutation pattern analysis: RYR2 showed 29.8% silent mutations (passenger signature) with no hotspots, while KMT2C showed 6.7% silent mutations with 31.4% truncating variants (driver signature). This multi-evidence approach provides a template for analyzing underpowered cohorts, prioritizing biological interpretability over purely statistical significance. Availability: All code and analysis pipelines available at github.com/akarlaraytu/causal-inference-for-cancer-genomics
动机:标准的全基因组关联研究在癌症基因组学中依赖于多重检验校正的统计显著性,但在功效不足的队列中系统性地失败。在TCGA乳腺癌(n=967,133例死亡)中,低事件率(13.8%)造成了严重的功效限制,对于已知驱动因素和大型乘客基因产生了假阴性和假阳性结果。结果:我们开发了一个五标准计算框架,融合了因果推理(逆概率加权,双重稳健估计)与正交生物验证(表达,突变模式,文献证据)。应用于TCGA-BRCA死亡分析,标准的Cox+FDR方法在FDR<0.05时未检测到基因。我们的框架正确地识别了RYR2——一个具有无癌症功能的心脏基因是一个假阳性结果,尽管其名义上具有重要意义(p=0.024),同时识别了KMT2C作为一个复杂的候选基因需要验证,尽管其边缘显著(p=0.047,q=0.954)。功效分析显示基因的中位功效为15.1%,其中KMT2C仅达到29.8%的功效(风险比=1.55),解释了尽管有强有力的生物学证据但统计结果却处于临界状态的原因。该框架通过突变模式分析区分了真信号和人工制品:RYR2显示出29.8%的无义突变(乘客签名)且无热点,而KMT2C显示出6.7%的无义突变和31.4%的截断变异(驱动签名)。这种多证据方法提供了分析功效不足队列的模板,优先考虑生物可解释性而非纯粹的统计显著性。可用性:所有代码和分析管道可在github.com/akarlaraytu/causal-inference-for-cancer-genomics找到。
论文及项目相关链接
PDF 17 pages (main text), 4 figures (main text), 7 supplementary figures, 4 supplementary tables. Focuses on a computational framework using causal inference and biological validation for underpowered cancer genomic studies
Summary
针对癌症基因组中的标准全基因组关联研究,因低事件率导致的效能不足问题,提出一种结合因果推断与正交生物验证的计算框架。应用于TCGA乳腺癌死亡率分析,该框架能够识别出标准Cox+FDR方法无法检测到的基因,并通过多重证据区分真实信号与伪影。
Key Takeaways
- 标准全基因组关联研究在癌症基因组中依赖统计显著性进行多重测试校正,但在效能不足的群体中系统性地失败。
- 在TCGA乳腺癌研究中,低事件率导致严重的效能限制,产生已知驱动因素的假阴性和大型乘客基因的假阳性。
- 提出了一种五标准计算框架,结合因果推断(逆向概率加权,双重稳健估计)和正交生物验证(表达,突变模式,文献证据)。
- 应用于TCGA-BRCA死亡率分析,该框架能够识别出标准方法未能检测到的基因,并正确区分真实信号和伪影。
- 通过突变模式分析,该框架能够进一步验证基因的真实性和功能类型。例如,RYR2显示出乘客基因特征,而KMT2C则显示出驱动基因特征。
- 该框架提供了一种分析效能不足群体的模板,优先考虑生物可解释性而非纯粹的统计显著性。
点此查看论文截图
Rethinking Hebbian Principle: Low-Dimensional Structural Projection for Unsupervised Learning
Authors:Shikuang Deng, Jiayuan Zhang, Yuhang Wu, Ting Chen, Shi Gu
Hebbian learning is a biological principle that intuitively describes how neurons adapt their connections through repeated stimuli. However, when applied to machine learning, it suffers serious issues due to the unconstrained updates of the connections and the lack of accounting for feedback mediation. Such shortcomings limit its effective scaling to complex network architectures and tasks. To this end, here we introduce the Structural Projection Hebbian Representation (SPHeRe), a novel unsupervised learning method that integrates orthogonality and structural information preservation through a local auxiliary nonlinear block. The loss for structural information preservation backpropagates to the input through an auxiliary lightweight projection that conceptually serves as feedback mediation while the orthogonality constraints account for the boundedness of updating magnitude. Extensive experimental results show that SPHeRe achieves SOTA performance among unsupervised synaptic plasticity approaches on standard image classification benchmarks, including CIFAR-10, CIFAR-100, and Tiny-ImageNet. Furthermore, the method exhibits strong effectiveness in continual learning and transfer learning scenarios, and image reconstruction tasks show the robustness and generalizability of the extracted features. This work demonstrates the competitiveness and potential of Hebbian unsupervised learning rules within modern deep learning frameworks, demonstrating the possibility of efficient and biologically inspired learning algorithms without the strong dependence on strict backpropagation. Our code is available at https://github.com/brain-intelligence-lab/SPHeRe.
赫布学习是一种生物原理,直观地描述了神经元如何通过重复刺激调整其连接。然而,当应用于机器学习时,它由于连接的无限更新和缺乏反馈调节而面临严重问题。这些缺点限制了其在复杂网络结构和任务中的有效扩展。为此,我们在这里引入了结构投影赫布表示(SPHeRe),这是一种新的无监督学习方法,通过局部辅助非线性块整合正交性和结构信息保留。结构信息保留的损失通过辅助的轻量级投影反向传播到输入端,这在概念上起到了反馈调节的作用,而正交性约束则考虑了更新幅度的有界性。大量的实验结果表明,SPHeRe在包括CIFAR-10、CIFAR-100和Tiny-ImageNet在内的标准图像分类基准测试中,在无监督突触可塑性方法中实现了最佳性能。此外,该方法在持续学习和迁移学习场景中具有很强的有效性,图像重建任务显示了所提取特征的稳健性和通用性。这项工作证明了赫布无监督学习规则在现代深度学习框架中的竞争力和潜力,展示了在没有严格依赖反向传播的情况下,高效且受生物启发的学习算法的可能性。我们的代码可在https://github.com/brain-intelligence-lab/SPHeRe找到。
论文及项目相关链接
Summary
提出一种新型无监督学习方法——结构投影赫布表示(SPHeRe),该方法结合正交性和结构信息保留,通过局部辅助非线性块实现。该方法解决了赫布学习在机器学习应用中的不足,具有反馈中介和更新幅度有界性。在图像分类、持续学习和迁移学习等任务上表现优秀。
Key Takeaways
- 赫布学习是描述神经元如何通过重复刺激适应连接的生物原理,但在机器学习中的应用存在缺陷。
- SPHeRe是一种新型无监督学习方法,解决了赫布学习在复杂网络架构和任务中的缩放问题。
- SPHeRe通过结合正交性和结构信息保留来实现学习,利用局部辅助非线性块进行反馈中介和更新幅度有界性管理。
- SPHeRe在标准图像分类基准测试(如CIFAR-10、CIFAR-100和Tiny-ImageNet)上取得了最先进的性能。
- SPHeRe在持续学习和迁移学习场景中表现出强大的效果。
- 图像重建任务证明了SPHeRe提取特征的稳健性和通用性。
点此查看论文截图
Epistemic-aware Vision-Language Foundation Model for Fetal Ultrasound Interpretation
Authors:Xiao He, Huangxuan Zhao, Guojia Wan, Wei Zhou, Yanxing Liu, Juhua Liu, Yongchao Xu, Yong Luo, Dacheng Tao, Bo Du
Recent medical vision-language models have shown promise on tasks such as VQA, report generation, and anomaly detection. However, most are adapted to structured adult imaging and underperform in fetal ultrasound, which poses challenges of multi-view image reasoning, numerous diseases, and image diversity. To bridge this gap, we introduce FetalMind, a medical AI system tailored to fetal ultrasound for both report generation and diagnosis. Guided by clinical workflow, we propose Salient Epistemic Disentanglement (SED), which injects an expert-curated bipartite graph into the model to decouple view-disease associations and to steer preference selection along clinically faithful steps via reinforcement learning. This design mitigates variability across diseases and heterogeneity across views, reducing learning bottlenecks while aligning the model’s inference with obstetric practice. To train FetalMind at scale, we curate FetalSigma-1M dataset, the first large-scale fetal ultrasound report corpus, comprising 20K reports from twelve medical centers, addressing the scarcity of domain data. Extensive experiments show that FetalMind outperforms open- and closed-source baselines across all gestational stages, achieving +14% average gains and +61.2% higher accuracy on critical conditions while remaining efficient, stable, and scalable. Project Page: https://hexiao0275.github.io/FetalMind.
近期医学视觉语言模型在诸如视觉问答(VQA)、报告生成和异常检测等任务中展现出巨大潜力。然而,大多数模型都适应于结构化成人影像,而在胎儿超声检查中表现不佳,这带来了多视角图像推理、多种疾病和图像多样性的挑战。为了弥补这一差距,我们推出了专为胎儿超声检查设计的医疗人工智能系统FetalMind,用于报告生成和诊断。在临床工作流程的指导下,我们提出了显著认知分解(SED),它将专家定制的二部图注入模型,以解耦视图-疾病关联,并通过强化学习引导偏好选择沿着临床忠实步骤进行。这种设计减轻了疾病间的可变性和视图间的异质性,减少了学习瓶颈,使模型的推断与产科实践相符。为了大规模训练FetalMind,我们整理了FetalSigma-1M数据集,这是首个大规模胎儿超声报告语料库,包含来自十二个医疗中心的2万份报告,解决了领域数据的稀缺性。大量实验表明,FetalMind在所有妊娠阶段的表现均优于开源和闭源的基线,在关键条件下平均提高了+14%的增益和+61.2%的准确性,同时保持了高效、稳定和可扩展性。项目页面:https://hexiao0275.github.io/FetalMind。
论文及项目相关链接
PDF This paper contains fundamental errors and will not be replaced
Summary
该文本介绍了针对胎儿超声医学影像的人工智能系统FetalMind的设计和应用。系统通过引入临床工作流程指导的显著认知分解(SED)技术和大规模胎儿超声报告数据集FetalSigma-1M,实现了报告生成和诊断功能,并提高了对不同疾病和视图变化的适应性,降低了学习瓶颈,模型推理符合产科实践。实验表明,FetalMind在所有妊娠阶段的表现均优于开放和封闭基线,对关键疾病的诊断准确率提高61.2%,具有高效、稳定和可扩展性。
Key Takeaways
- FetalMind是一个针对胎儿超声医学影像的医疗人工智能系统,用于报告生成和诊断。
- 系统面临的主要挑战是胎儿超声图像的多视角性、疾病多样性和复杂性。
- 提出了显著认知分解(SED)技术,通过引入专家构建的双向图和强化学习来引导模型学习,解决了视图与疾病的关联问题。
- FetalMind采用临床工作流程指导的设计,以减少疾病间的差异和视图多样性对模型推理的影响。
- 训练FetalMind的数据集是首个大规模胎儿超声报告语料库FetalSigma-1M,包含来自十二家医疗中心的2万份报告。
- 实验结果显示,FetalMind在所有妊娠阶段的表现均优于现有系统,对关键疾病的诊断准确率显著提高。
点此查看论文截图
Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans
Authors:Theo Di Piazza, Carole Lazarus, Olivier Nempont, Loic Boussel
With the growing volume of CT examinations, there is an increasing demand for automated tools such as organ segmentation, abnormality detection, and report generation to support radiologists in managing their clinical workload. Multi-label classification of 3D Chest CT scans remains a critical yet challenging problem due to the complex spatial relationships inherent in volumetric data and the wide variability of abnormalities. Existing methods based on 3D convolutional neural networks struggle to capture long-range dependencies, while Vision Transformers often require extensive pre-training on large-scale, domain-specific datasets to perform competitively. In this work of academic research, we propose a 2.5D alternative by introducing a new graph-based framework that represents 3D CT volumes as structured graphs, where axial slice triplets serve as nodes processed through spectral graph convolution, enabling the model to reason over inter-slice dependencies while maintaining complexity compatible with clinical deployment. Our method, trained and evaluated on 3 datasets from independent institutions, achieves strong cross-dataset generalization, and shows competitive performance compared to state-of-the-art visual encoders. We further conduct comprehensive ablation studies to evaluate the impact of various aggregation strategies, edge-weighting schemes, and graph connectivity patterns. Additionally, we demonstrate the broader applicability of our approach through transfer experiments on automated radiology report generation and abdominal CT data.
随着CT检查量的增长,对自动化工具的需求也在增加,如器官分割、异常检测、报告生成等,以支持放射科医生管理他们的工作负担。对3D胸部CT扫描的多标签分类仍然是一个至关重要且具挑战性的问题,这主要是因为体积数据中的复杂空间关系和异常情况的广泛变化。基于现有方法的依赖关系很难捕获到三维卷积神经网络的长距离关系,而视觉Transformer往往需要在大规模特定领域的数据集上进行广泛的预训练才能取得良好的性能。在这项学术研究中,我们提出了一种基于图的替代方案,通过引入新的基于图的框架来表示三维CT体积作为结构化图,其中轴向切片三元组作为节点通过谱图卷积进行处理,这使得模型能够推理切片之间的依赖关系,同时保持与临床部署兼容的复杂性。我们的方法经过独立机构的三个数据集的培训和评估,实现了强大的跨数据集泛化能力,与最先进的视觉编码器相比显示出有竞争力的性能。我们还进行了全面的消融研究,以评估各种聚合策略、边缘加权方案和图形连接模式的影响。此外,我们还通过自动生成的放射学报告转移实验和腹部CT数据来展示我们方法的更广泛应用性。
论文及项目相关链接
PDF 24 pages, 15 figures
Summary
本文提出一种基于图的新框架,用于处理三维CT体积数据,将其表示为结构化图,通过谱图卷积处理轴向切片三重作为节点,能够在跨切片之间建立依赖关系的同时保持与临床部署的兼容性。该研究实现了对多标签三维胸部CT扫描的分类,并在独立机构的三个数据集上进行了训练和评估,具有良好的跨数据集泛化能力和竞争力表现。同时,进行了全面的消融研究,探讨了各种聚合策略、边缘加权方案和图连接模式的影响。此外,还展示了该方法在自动放射学报告生成和腹部CT数据上的更广泛应用。
Key Takeaways
- 随着CT检查量的增长,对自动化工具(如器官分割、异常检测、报告生成等)的需求增加,以支持放射科医生管理临床工作量。
- 多标签三维胸部CT扫描分类是一个挑战性的问题,因为体积数据中的复杂空间关系和异常的广泛变化。
- 基于三维卷积神经网络的方法难以捕捉长期依赖关系,而视觉转换器则需要大规模特定领域的预训练数据才能表现良好。
- 引入了一种新的基于图的框架,将三维CT体积表示为结构化图,通过谱图卷积处理轴向切片三重作为节点。
- 该方法实现了良好的跨数据集泛化能力和竞争力表现,并在独立机构的三个数据集上进行了验证。
- 消融研究探讨了各种聚合策略、边缘加权方案和图连接模式对模型性能的影响。
点此查看论文截图
Toward a Vision-Language Foundation Model for Medical Data: Multimodal Dataset and Benchmarks for Vietnamese PET/CT Report Generation
Authors:Huu Tien Nguyen, Dac Thai Nguyen, The Minh Duc Nguyen, Trung Thanh Nguyen, Thao Nguyen Truong, Huy Hieu Pham, Johan Barthelemy, Minh Quan Tran, Thanh Tam Nguyen, Quoc Viet Hung Nguyen, Quynh Anh Chau, Hong Son Mai, Thanh Trung Nguyen, Phi Le Nguyen
Vision-Language Foundation Models (VLMs), trained on large-scale multimodal datasets, have driven significant advances in Artificial Intelligence (AI) by enabling rich cross-modal reasoning. Despite their success in general domains, applying these models to medical imaging remains challenging due to the limited availability of diverse imaging modalities and multilingual clinical data. Most existing medical VLMs are trained on a subset of imaging modalities and focus primarily on high-resource languages, thus limiting their generalizability and clinical utility. To address these limitations, we introduce a novel Vietnamese-language multimodal medical dataset consisting of 2,757 whole-body PET/CT volumes from independent patients and their corresponding full-length clinical reports. This dataset is designed to fill two pressing gaps in medical AI development: (1) the lack of PET/CT imaging data in existing VLMs training corpora, which hinders the development of models capable of handling functional imaging tasks; and (2) the underrepresentation of low-resource languages, particularly the Vietnamese language, in medical vision-language research. To the best of our knowledge, this is the first dataset to provide comprehensive PET/CT-report pairs in Vietnamese. We further introduce a training framework to enhance VLMs’ learning, including data augmentation and expert-validated test sets. We conduct comprehensive experiments benchmarking state-of-the-art VLMs on downstream tasks. The experimental results show that incorporating our dataset significantly improves the performance of existing VLMs. We believe this dataset and benchmark will serve as a pivotal step in advancing the development of more robust VLMs for medical imaging, especially for low-resource languages and clinical use in Vietnamese healthcare. The source code is available at https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen.
视觉语言基础模型(VLMs)经过大规模多模态数据集的训练,通过丰富的跨模态推理推动了人工智能(AI)的重大进步。尽管它们在通用领域取得了成功,但这些模型在医学成像方面的应用仍然具有挑战性,这主要是因为存在各种成像模式和多语言临床数据的有限可用性。现有的大多数医学VLMs只在部分成像模式上进行训练,并主要关注资源丰富的语言,从而限制了它们的通用性和临床实用性。为了解决这些局限性,我们引入了一个新的越南语多模态医学数据集,该数据集包含来自独立患者的2757个全身PET/CT体积及其相应的完整临床报告。该数据集旨在填补医学人工智能发展中的两个紧迫空白:(1)现有VLMs训练语料库中缺乏PET/CT成像数据,这阻碍了能够处理功能性成像任务的模型的开发;(2)在医学视觉语言研究中,特别是越南语的低资源语言代表性不足。据我们所知,这是第一个提供越南语全面的PET/CT报告配对的数据集。我们还介绍了一个增强VLMs学习的训练框架,包括数据增强和专家验证的测试集。我们进行了全面的实验,评估了最先进VLMs在下游任务上的表现。实验结果表明,加入我们的数据集可以显着提高现有VLMs的性能。我们相信,该数据集和基准测试将是推动更稳健的VLMs在医学成像方面的发展的关键一步,特别是对低资源语言和越南医疗保健的临床应用。源代码可在https://github.com/AIoT-Lab-BKAI/ViPET-ReportGen找到。
论文及项目相关链接
PDF 39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Summary
本研究针对医学图像领域,引入了一款越南语的多模态医疗数据集,包含PET/CT影像与临床报告。此数据集解决了现有VLM模型中缺乏PET/CT影像数据和低资源语言代表性不足的问题。通过引入新的训练框架和实验验证,该数据集显著提升了VLM模型性能。这将为低资源语言尤其是越南语的医学影像发展迈出重要一步。
Key Takeaways
- 引入了越南语的多模态医疗数据集,涵盖PET/CT影像及其对应的临床报告。
- 数据集解决了现有VLM模型缺乏PET/CT影像数据和低资源语言代表性不足的问题。
- 通过数据增强和专家验证测试集,增强了VLM模型的学习能力。
- 数据集显著提升了VLM模型在下游任务上的性能表现。
- 该数据集和基准测试对于推动医学影像的稳健VLM模型发展,特别是在低资源语言和越南语临床应用方面,具有关键作用。
点此查看论文截图
Untangling Vascular Trees for Surgery and Interventional Radiology
Authors:Guillaume Houry, Tom Boeken, Stéphanie Allassonnière, Jean Feydy
The diffusion of minimally invasive, endovascular interventions motivates the development of visualization methods for complex vascular networks. We propose a planar representation of blood vessel trees which preserves the properties that are most relevant to catheter navigation: topology, length and curvature. Taking as input a three-dimensional digital angiography, our algorithm produces a faithful two-dimensional map of the patient’s vessels within a few seconds. To this end, we propose optimized implementations of standard morphological filters and a new recursive embedding algorithm that preserves the global orientation of the vascular network. We showcase our method on peroperative images of the brain, pelvic and knee artery networks. On the clinical side, our method simplifies the choice of devices prior to and during the intervention. This lowers the risk of failure during navigation or device deployment and may help to reduce the gap between expert and common intervention centers. From a research perspective, our method simulates the cadaveric display of artery trees from anatomical dissections. This opens the door to large population studies on the branching patterns and tortuosity of fine human blood vessels. Our code is released under the permissive MIT license as part of the scikit-shapes Python library (https://scikit-shapes.github.io ).
微创血管内干预的普及促使了针对复杂血管网络的可视化方法的发展。我们提出了一种血管树的平面表示方法,保留了与导管导航最相关的属性:拓扑结构、长度和曲率。以三维数字血管造影为输入,我们的算法可在几秒钟内生成患者血管的忠实二维地图。为此,我们对标准形态学滤波器进行了优化实现,并提出了一种新的递归嵌入算法,该算法保留了血管网络的整体方向。我们在脑、骨盆和膝关节动脉网络的术中图像上展示了我们的方法。在临床方面,我们的方法简化了干预前后设备的选择。这降低了导航或设备部署过程中的失败风险,并有助于缩小专家与普通干预中心之间的差距。从研究的角度来看,我们的方法模拟了解剖解剖中的动脉树尸检显示。这为对人类精细血管的分支模式和扭曲性进行大规模人群研究打开了大门。我们的代码作为scikit-shapes Python库的一部分,在许可的MIT许可证下发布(https://scikit-shapes.github.io)。
论文及项目相关链接
Summary
本文提出一种平面表示法展示血管树,以呈现导管导航最相关的属性:拓扑结构、长度和曲率。通过三维数字血管造影术输入,算法可在几秒内生成患者血管的忠实二维地图。该方法简化了介入手术前的设备选择,降低了导航或设备部署过程中的失败风险,有助于缩小专家与普通介入中心之间的差距。此外,该方法还模拟了动脉树的尸体解剖显示,为研究人类血管分支模式和弯曲度提供了机会。
Key Takeaways
- 文中提出了一种用于呈现复杂血管网络的平面表示法,旨在辅助微创性血管内干预的可视化方法。
- 该方法能够基于三维数字血管造影术快速生成患者血管的二维地图。
- 该方法能够保留对导管导航至关重要的拓扑结构、长度和曲率属性。
- 这种方法简化了手术前的设备选择,并降低了手术过程中的失败风险。
- 此方法有助于缩小专家与非专家介入中心之间的差距。
- 该方法在临床应用中展示了其在手术中对大脑、骨盆和膝盖动脉网络的良好表现。
点此查看论文截图
Cryo-RL: automating prostate cancer cryoablation planning with reinforcement learning
Authors:Trixia Simangan, Ahmed Nadeem Abbasi, Yipeng Hu, Shaheer U. Saeed
Cryoablation is a minimally invasive localised treatment for prostate cancer that destroys malignant tissue during de-freezing, while sparing surrounding healthy structures. Its success depends on accurate preoperative planning of cryoprobe placements to fully cover the tumour and avoid critical anatomy. This planning is currently manual, expertise-dependent, and time-consuming, leading to variability in treatment quality and limited scalability. In this work, we introduce Cryo-RL, a reinforcement learning framework that models cryoablation planning as a Markov decision process and learns an optimal policy for cryoprobe placement. Within a simulated environment that models clinical constraints and stochastic intraoperative variability, an agent sequentially selects cryoprobe positions and ice sphere diameters. Guided by a reward function based on tumour coverage, this agent learns a cryoablation strategy that leads to optimal cryoprobe placements without the need for any manually-designed plans. Evaluated on 583 retrospective prostate cancer cases, Cryo-RL achieved over 8 percentage-point Dice improvements compared with the best automated baselines, based on geometric optimisation, and matched human expert performance while requiring substantially less planning time. These results highlight the potential of reinforcement learning to deliver clinically viable, reproducible, and efficient cryoablation plans.
冷冻消融是一种对前列腺癌的微创局部治疗方法,它能在解冻过程中破坏恶性组织,同时保留周围的健康结构。其成功取决于冷冻探针放置的术前计划准确,以全面覆盖肿瘤并避免关键解剖结构。当前的规划是手动的,依赖于专家,并且耗时,导致治疗质量参差不齐,可扩展性有限。在这项工作中,我们引入了冷冻强化学习(Cryo-RL),这是一种强化学习框架,它将冷冻消融计划建模为马尔可夫决策过程,并学习冷冻探针放置的最优策略。在一个模拟的环境中,该环境模拟了临床约束和术中随机变化,智能体按顺序选择冷冻探针的位置和冰球直径。在肿瘤覆盖的奖励函数指导下,智能体学习一种冷冻消融策略,该策略能导致最优的冷冻探针放置,无需任何手动设计计划。在583例回顾性前列腺癌病例中评估显示,与基于几何优化的最佳自动化基线相比,Cryo-RL的Dice指数提高了超过8个百分点,并匹配了人类专家的性能,同时大大减少了规划时间。这些结果突显了强化学习在提供临床可行、可重复和高效的冷冻消融计划方面的潜力。
论文及项目相关链接
PDF Accepted at MICAD (Medical Imaging and Computer-Aided Diagnosis) 2025
Summary
本文介绍了Cryo-RL这一强化学习框架在前列腺癌冷冻消融治疗计划中的应用。该框架将冷冻消融计划视为马尔可夫决策过程,学习冷冻探针放置的最优策略。在模拟的临床环境中,通过奖励函数引导,自主学会无需手动设计的冷冻消融策略,实现对肿瘤的最佳覆盖。与几何优化等自动化方法相比,其在回顾性前列腺癌病例上取得了超过8个百分点的Dice改善值,且匹配了专家的人类表现,同时大幅减少了规划时间。此研究展示了强化学习在冷冻消融计划中的临床应用潜力。
Key Takeaways
- Cryoablation是一种微创的局部前列腺癌治疗方法,通过冷冻消融恶性组织,同时保护周围健康结构。
- 当前Cryoablation的术前规划依赖于专家经验和时间消耗,导致治疗质量不一且难以规模化。
- 引入Cryo-RL强化学习框架,将冷冻消融规划视为马尔可夫决策过程。
- 在模拟环境中,通过奖励函数引导学习最优冷冻探针放置策略,实现对肿瘤的最佳覆盖。
- 与几何优化等自动化方法相比,Cryo-RL在回顾性前列腺癌病例上取得了显著效果,提高了Dice系数值。
- Cryo-RL匹配了专家的人类表现,并大幅减少了术前规划时间。
点此查看论文截图