⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验
2025-11-25 更新
Improving Multimodal Distillation for 3D Semantic Segmentation under Domain Shift
Authors:Björn Michele, Alexandre Boulch, Gilles Puy, Tuan-Hung Vu, Renaud Marlet, Nicolas Courty
Semantic segmentation networks trained under full supervision for one type of lidar fail to generalize to unseen lidars without intervention. To reduce the performance gap under domain shifts, a recent trend is to leverage vision foundation models (VFMs) providing robust features across domains. In this work, we conduct an exhaustive study to identify recipes for exploiting VFMs in unsupervised domain adaptation for semantic segmentation of lidar point clouds. Building upon unsupervised image-to-lidar knowledge distillation, our study reveals that: (1) the architecture of the lidar backbone is key to maximize the generalization performance on a target domain; (2) it is possible to pretrain a single backbone once and for all, and use it to address many domain shifts; (3) best results are obtained by keeping the pretrained backbone frozen and training an MLP head for semantic segmentation. The resulting pipeline achieves state-of-the-art results in four widely-recognized and challenging settings. The code will be available at: https://github.com/valeoai/muddos.
在完整监督下针对一种激光雷达类型进行训练的语义分割网络在未进行干预的情况下无法推广到未见过的激光雷达。为了减少域转移下的性能差距,最近的趋势是利用提供跨域稳健特征的视觉基础模型(VFMs)。在这项工作中,我们进行了详尽的研究,以确定在激光雷达点云语义分割的无监督域适应中利用VFMs的策略。基于无监督图像到激光雷达知识蒸馏,我们的研究发现:(1)激光雷达主干架构是最大化目标域上泛化性能的关键;(2)有可能预先训练一个单一的主干,并用于解决多种域转移问题;(3)通过保持预训练的主干冻结状态,并训练一个用于语义分割的MLP头部,可以获得最佳结果。该管道在四个广泛认可和具有挑战性的环境中达到了最新水平的结果。代码将在以下网址提供:https://github.com/valeoai/muddos。
论文及项目相关链接
PDF Accepted at BMVC 2025
Summary
基于全监督训练的语义分割网络对于一种激光雷达数据的性能在未见域的数据上表现不佳。为了缩小域转移下的性能差距,一种趋势是利用具有跨域鲁棒特征的视觉基础模型(VFMs)。在这项研究中,我们进行了详尽的实验,探索了在无监督域自适应的激光雷达点云语义分割中利用VFMs的策略。基于图像到激光雷达的无监督知识蒸馏,我们的研究揭示了以下关键发现:(1)激光雷达主干的架构是最大化目标域上泛化性能的关键;(2)有可能预先训练一个单一的主干网络,并用于解决多种域转移问题;(3)最好的结果是通过保持预训练的主干网络冻结状态,并训练一个用于语义分割的MLP头部得到的。由此产生的管道在四种公认的挑战场景中达到了最新的结果。更多详细信息可访问:GitHub链接地址。
Key Takeaways
- 全监督训练的语义分割网络在未见域的激光雷达数据上表现不佳。
- 利用视觉基础模型(VFMs)能有效提高语义分割网络的泛化能力。
- 激光雷达主干的架构是泛化性能的关键,需要针对目标域进行优化。
- 预训练一个主干网络可以适应多种域转移问题。
- 保持预训练的主干网络冻结状态,并训练特定的MLP头部可以获得最佳性能。
- 所提出的方法在多种公认的挑战场景中实现了最佳性能。
点此查看论文截图
REMSA: An LLM Agent for Foundation Model Selection in Remote Sensing
Authors:Binger Chen, Tacettin Emre Bök, Behnood Rasti, Volker Markl, Begüm Demir
Foundation Models (FMs) are increasingly used in remote sensing (RS) for tasks such as environmental monitoring, disaster assessment, and land-use mapping. These models include unimodal vision encoders trained on a single data modality and multimodal architectures trained on combinations of SAR, multispectral, hyperspectral, and image-text data. They support diverse RS tasks including semantic segmentation, image classification, change detection, and visual question answering. However, selecting an appropriate remote sensing foundation model (RSFM) remains difficult due to scattered documentation, heterogeneous formats, and varied deployment constraints. We introduce the RSFM Database (RS-FMD), a structured resource covering over 150 RSFMs spanning multiple data modalities, resolutions, and learning paradigms. Built on RS-FMD, we present REMSA, the first LLM-based agent for automated RSFM selection from natural language queries. REMSA interprets user requirements, resolves missing constraints, ranks candidate models using in-context learning, and provides transparent justifications. We also propose a benchmark of 75 expert-verified RS query scenarios, producing 900 configurations under an expert-centered evaluation protocol. REMSA outperforms several baselines, including naive agents, dense retrieval, and unstructured RAG-based LLMs. It operates entirely on publicly available metadata and does not access private or sensitive data.
基础模型(FMs)在遥感(RS)中得到了越来越广泛的应用,用于环境监测、灾害评估和土地利用映射等任务。这些模型包括在单一数据模态上训练的单模态视觉编码器和在SAR、多光谱、高光谱和图像文本数据组合上训练的跨模态架构。它们支持多种遥感任务,包括语义分割、图像分类、变化检测和视觉问答。然而,由于文档分散、格式异构和部署约束各异,选择合适的遥感基础模型(RSFM)仍然是一项挑战。我们引入了RSFM数据库(RS-FMD),这是一个结构化资源,涵盖了跨越多种数据模态、分辨率和学习范式的一百五十多个RSFM。基于RS-FMD,我们推出了REMSA,这是第一个基于LLM的自动化RSFM选择代理,可根据自然语言查询进行自动选择。REMSA解释用户需求,解决缺失约束,使用上下文学习对候选模型进行排名,并提供透明的理由。我们还提出了一个由75个专家验证的遥感查询场景组成的基准测试,在专家中心评估协议下产生了900个配置。REMSA的表现优于几个基准线,包括简单代理、密集检索和非结构化RAG-based LLMs。它完全基于可公开获取的元数据运行,无需访问私有或敏感数据。
论文及项目相关链接
PDF Code and data available at https://github.com/be-chen/REMSA
Summary
遥感领域应用基础模型(FMs)进行环境监控、灾害评估和土地利用映射等任务。引入RSFM数据库(RS-FMD),涵盖超过150个覆盖多数据模态、分辨率和学习范式的基础模型。基于RS-FMD,推出REMSA,首个从自然语言查询中实现自动化选择的基础模型选择代理。REMSA通过上下文学习对用户要求进行解释和排名,提供透明理由。推出专家验证的遥感查询场景基准测试,REMSA表现优于其他基线。
Key Takeaways
- 基础模型(FMs)在遥感(RS)领域有广泛应用,用于环境监控、灾害评估等任务。
- RSFM数据库(RS-FMD)是一个结构化的资源,包含超过150个基础模型,覆盖多数据模态、分辨率和学习范式。
- REMSA是首个基于LLM的自动化RSFM选择代理,可从自然语言查询中选择模型。
- REMSA通过解释用户要求、解决约束、排名候选模型并提供透明理由来工作。
- 推出专家验证的遥感查询场景基准测试,以评估REMSA的性能。
- REMSA相较于其他基线表现出色,且完全基于公开元数据进行操作,不访问私有或敏感数据。
- 这项技术对于提高遥感领域基础模型的选择效率和准确性具有重要意义。
点此查看论文截图
Quantum Masked Autoencoders for Vision Learning
Authors:Emma Andrews, Prabhat Mishra
Classical autoencoders are widely used to learn features of input data. To improve the feature learning, classical masked autoencoders extend classical autoencoders to learn the features of the original input sample in the presence of masked-out data. While quantum autoencoders exist, there is no design and implementation of quantum masked autoencoders that can leverage the benefits of quantum computing and quantum autoencoders. In this paper, we propose quantum masked autoencoders (QMAEs) that can effectively learn missing features of a data sample within quantum states instead of classical embeddings. We showcase that our QMAE architecture can learn the masked features of an image and can reconstruct the masked input image with improved visual fidelity in MNIST images. Experimental evaluation highlights that QMAE can significantly outperform (12.86% on average) in classification accuracy compared to state-of-the-art quantum autoencoders in the presence of masks.
经典自编码器被广泛应用于学习输入数据的特征。为了提高特征学习,经典掩码自编码器扩展了经典自编码器,以在存在被掩盖的数据的情况下学习原始输入样本的特征。虽然存在量子自编码器,但并没有设计并实现可以利用量子计算和量子自编码器优势的量子掩码自编码器。在本文中,我们提出了量子掩码自编码器(QMAE),它可以在量子状态下有效地学习数据样本中缺失的特征,而不是使用经典嵌入。我们展示了我们的QMAE架构可以学习图像中的掩盖特征,并能够重建具有改进的视觉保真度的掩盖输入图像(如MNIST图像)。实验评估表明,在存在掩膜的情况下,QMAE在分类精度上平均高出12.86%,显著优于最新量子自编码器。
论文及项目相关链接
Summary
本文提出了量子掩码自编码器(QMAE),结合了量子计算和量子自编码器的优势,能有效学习数据样本在量子状态下的缺失特征,而非仅依赖经典嵌入。实验证明,QMAE能在存在掩码的情况下学习图像的特征并重建掩码输入图像,提高MNIST图像的视觉保真度,且在分类准确性上显著优于现有最先进的量子自编码器。
Key Takeaways
- 量子掩码自编码器(QMAE)结合了量子计算和量子自编码器的优点。
- QMAE能有效学习数据样本在量子状态下的缺失特征。
- QMAE能在存在掩码的情况下学习图像特征。
- QMAE能重建掩码输入图像,提高视觉保真度。
- 实验证明QMAE在分类准确性上显著优于现有最先进的量子自编码器。
- QMAE的设计有助于提升特征学习的效率和准确性。
点此查看论文截图
UAM: A Unified Attention-Mamba Backbone of Multimodal Framework for Tumor Cell Classification
Authors:Taixi Chen, Jingyun Chen, Nancy Guo
Cell-level radiomics features provide fine-grained insights into tumor phenotypes and have the potential to significantly enhance diagnostic accuracy on hematoxylin and eosin (H&E) images. By capturing micro-level morphological and intensity patterns, these features support more precise tumor identification and improve AI interpretability by highlighting diagnostically relevant cells for pathologist review. However, most existing studies focus on slide-level or patch-level tumor classification, leaving cell-level radiomics analysis largely unexplored. Moreover, there is currently no dedicated backbone specifically designed for radiomics data. Inspired by the recent success of the Mamba architecture in vision and language domains, we introduce a Unified Attention-Mamba (UAM) backbone for cell-level classification using radiomics features. Unlike previous hybrid approaches that integrate Attention and Mamba modules in fixed proportions, our unified design flexibly combines their capabilities within a single cohesive architecture, eliminating the need for manual ratio tuning and improving encode capability. We develop two UAM variants to comprehensively evaluate the benefits of this unified structure. Building on this backbone, we further propose a multimodal UAM framework that jointly performs cell-level classification and image segmentation. Experimental results demonstrate that UAM achieves state-of-the-art performance across both tasks on public benchmarks, surpassing leading image-based foundation models. It improves cell classification accuracy from 74% to 78% ($n$=349,882 cells), and tumor segmentation precision from 75% to 80% ($n$=406 patches). These findings highlight the effectiveness and promise of UAM as a unified and extensible multimodal foundation for radiomics-driven cancer diagnosis.
细胞级放射学特征为肿瘤表型提供了精细的见解,并有可能显著提高苏木精和伊红(H&E)图像的诊断准确性。通过捕捉微观水平的形态和强度模式,这些特征支持更精确的肿瘤识别,并通过突出显示与诊断相关的细胞以供病理学家审查,从而提高了人工智能的可解释性。然而,大多数现有研究集中在幻灯片级别或斑块级别的肿瘤分类上,很少探索细胞级别的放射学分析。此外,目前还没有专门为放射学数据设计的专用主干网络。受Mamba架构在视觉和语言领域近期成功的启发,我们引入了使用放射学特征的统一注意力Mamba(UAM)主干网络,用于细胞级别分类。不同于以前将注意力和Mamba模块以固定比例集成的方法,我们的统一设计灵活地结合了它们的能力,在一个单一的内聚架构中,消除了对手动比例调整的需求,提高了编码能力。我们开发了两个UAM变体,以全面评估这种统一结构的好处。在此基础上,我们进一步提出了一个多模态UAM框架,该框架可以执行细胞级别的分类和图像分割。实验结果表明,UAM在公共基准测试中实现了最先进的性能,超越了领先的基于图像的基础模型。它将细胞分类准确率从74%提高到78%(n=349,882个细胞),肿瘤分割精度从75%提高到80%(n=406个斑块)。这些发现凸显了UAM作为放射学驱动癌症诊断的统一和多模式基础架构的有效性和潜力。
论文及项目相关链接
摘要
细胞层面的放射组学特征为肿瘤表现型提供了精细的见解,并有可能在苏木精和伊红(H&E)图像上显著提高诊断准确性。通过捕捉微观层面的形态和强度模式,这些特征支持更精确的肿瘤识别,并突出显示诊断相关的细胞,从而提高病理医师的审查可解释性。然而,大多数现有研究集中在幻灯片级别或补丁级别的肿瘤分类上,而细胞级别的放射组学分析在很大程度上尚未被探索。此外,目前尚未有专门针对放射组学数据的专用骨干网络。受Mamba架构在视觉和语言领域近期成功的启发,我们引入了用于细胞级别分类的Unified Attention-Mamba(UAM)骨干网络。与其他将注意力机制和Mamba模块以固定比例结合的混合方法不同,我们的统一设计可以灵活地结合它们的能力在一个单一的连贯架构中,消除了对手动比例调整的需求,提高了编码能力。我们开发了两个UAM变种来全面评估这种统一结构的好处。在此基础上,我们进一步提出了一个多模态UAM框架,该框架可同时进行细胞级别的分类和图像分割。实验结果表明,UAM在公共基准测试上实现了最先进的性能表现,超越了领先的图像基础模型。它将细胞分类准确率从74%提高到78%(n=349,882个细胞),肿瘤分割精度从75%提高到80%(n=406个补丁)。这些发现突显了UAM作为放射组学驱动癌症诊断的统一和可扩展的多模态基础架构的有效性和潜力。
关键见解
- 细胞级别的放射组学特征为肿瘤表现型提供了精细洞察,能提高诊断准确性。
- 目前缺乏对细胞级别放射组学分析的深入探索,且尚无专用骨干网络处理此类数据。
- 引入Unified Attention-Mamba(UAM)骨干网络,能灵活结合注意力机制和Mamba模块,提高编码能力。
- UAM在细胞级别分类和图像分割方面表现出卓越性能,超越了现有的图像基础模型。
- UAM细胞分类准确率提升至78%,肿瘤分割精度提升至80%,显示出其有效性和潜力。
- UAM的设计具有统一性和可扩展性,可应用于多模态放射组学驱动的癌症诊断。
点此查看论文截图
DSeq-JEPA: Discriminative Sequential Joint-Embedding Predictive Architecture
Authors:Xiangteng He, Shunsuke Sakai, Kun Yuan, Nicolas Padoy, Tatsuhito Hasegawa, Leonid Sigal
Image-based Joint-Embedding Predictive Architecture (I-JEPA) learns visual representations by predicting latent embeddings of masked regions from visible context. However, it treats all regions uniformly and independently, lacking an explicit notion of where or in what order predictions should be made. Inspired by human visual perception, which deploys attention selectively and sequentially from the most informative to secondary regions, we propose DSeq-JEPA, a Discriminative Sequential Joint-Embedding Predictive Architecture that bridges predictive and autoregressive self-supervised learning, integrating JEPA-style latent prediction with GPT-style sequential reasoning. Specifically, DSeq-JEPA (i) first identifies primary discriminative regions based on a transformer-derived saliency map, emphasizing the distribution of visual importance, and then (ii) predicts subsequent regions in this discriminative order, progressively forming a curriculum-like semantic progression from primary to secondary cues – a form of GPT-style pre-training. Extensive experiments across diverse tasks, including image classification (ImageNet), fine-grained visual categorization (iNaturalist21, CUB-200-2011, Stanford-Cars), detection and segmentation (MS-COCO, ADE20K), and low-level reasoning tasks (Clevr/Count, Clevr/Dist), demonstrate that DSeq-JEPA consistently focuses on more discriminative and generalizable representations than I-JEPA variants. Project page: https://github.com/SkyShunsuke/DSeq-JEPA.
基于图像的联合嵌入预测架构(I-JEPA)通过学习预测可见上下文中被遮挡区域的潜在嵌入来表示视觉特征。然而,它对待所有区域都是统一且独立的,缺乏明确的预测位置或顺序概念。受人类视觉感知的启发,人类视觉感知会从信息量最大的区域开始,选择性地、有顺序地部署注意力,我们提出了DSeq-JEPA(判别式序列联合嵌入预测架构)。它结合了预测性和自回归无监督学习,将JEPA风格的潜在预测与GPT风格的序列推理相结合。具体来说,DSeq-JEPA首先根据基于转换器得到的显著性地图识别主要判别区域,强调视觉重要性分布,然后按照这种判别顺序预测后续区域,逐步形成一个从主要线索到次要线索的课程式语义进展,这是一种类似于GPT的预训练方式。在多种任务上的广泛实验表明,包括图像分类(ImageNet)、细粒度视觉分类(iNaturalist21、CUB-200-2011、Stanford-Cars)、检测和分割(MS-COCO、ADE20K),以及低级推理任务(Clevr/Count、Clevr/Dist),DSeq-JEPA始终关注更具判别力和通用性的表示形式,相较于I-JEPA系列方法表现更优秀。项目页面:https://github.com/SkyShunsuke/DSeq-JEPA。
论文及项目相关链接
PDF Project page: https://github.com/SkyShunsuke/DSeq-JEPA
Summary
本文介绍了基于图像联合嵌入预测架构(I-JEPA)的改进版本——判别性序列联合嵌入预测架构(DSeq-JEPA)。DSeq-JEPA通过引入人类视觉感知的注意力机制,从最具信息量的区域开始,选择性地、顺序地进行预测。实验表明,DSeq-JEPA在图像分类、细粒度视觉分类、检测与分割以及低级推理任务上的表现均优于I-JEPA,具有更强的判别能力和泛化性。
Key Takeaways
- DSeq-JEPA结合预测和自回归自监督学习,整合了JEPA风格的潜在预测和GPT风格的顺序推理。
- DSeq-JEPA通过基于transformer的显著性地图识别主要鉴别区域,强调视觉重要性分布。
- DSeq-JEPA按照鉴别顺序预测后续区域,形成类似课程的语义进度,从主要线索到次要线索。
- 与I-JEPA相比,DSeq-JEPA在多个任务上表现优越,包括图像分类、细粒度视觉分类、检测与分割以及低层次推理任务。
- DSeq-JEPA关注更具判别力和泛化能力的表示。
- 人类视觉感知的注意力机制被有效地引入到DSeq-JEPA中,使得其可以从最具信息量的区域开始,选择性地、顺序地进行预测。
点此查看论文截图
A XRISM View of Relativistic Reflection in Cygnus X-1
Authors:Paul A. Draghis, Jon M. Miller, Erin Kara, Elisa Costantini, Oluwashina Adegoke, Javier A. Garcia
We present the first high-resolution XRISM/Resolve view of the relativistically broadened Fe K line in Cygnus X-1. The data clearly separate the relativistic broad line from the underlying continuum and from narrow emission and absorption features in the Fe band. The unprecedented spectral resolution in the Fe K band clearly demonstrates that the flux excess can be attributed to a single, broad feature, as opposed to a superposition of previously unresolved narrow features. This broad feature can be best interpreted as emission consistent with an origin near the innermost stable circular orbit around a rapidly rotating black hole. By modeling the shape of the broad line, we find a black hole spin of $a\simeq0.98$ and an inclination of the inner accretion disk of $θ\simeq63^\circ$. The spin is consistent with prior reflection studies, reaffirming the robustness of past spin measurements using the relativistic reflection method. The measured inclination provides reinforcing evidence of a disk-orbit misalignment in Cygnus X-1. These results highlight the unique abilities of XRISM in separating overlapping spectral features and providing constraints on the geometry of accretion in X-ray binaries.
我们对塞浦路斯X-1相对论展宽的Fe K线进行了首次高分辨率XRISM/Resolve观测。数据清楚地将相对论宽线与基本连续体以及Fe波段中的窄发射和吸收特征区分开来。在Fe K波段前所未有的光谱分辨率清楚地表明,流量过剩可以归因于单个宽特征,而不是之前未解决窄特征的叠加。这一宽特征最好被解释为与快速旋转黑洞最内稳定循环轨道附近的起源相一致的发射。通过对宽线形状的建模,我们发现黑洞的自旋为a≈0.98,内积盘倾角为θ≈63°。自旋与先前的反射研究一致,证实了使用相对论反射方法进行的过去自转测量的稳健性。所测得的倾角为塞浦路斯X-1的盘轨道错位提供了加强证据。这些结果突出了XRISM在分离重叠光谱特征以及对X射线双星中积几何形状的约束方面的独特能力。
论文及项目相关链接
PDF Accepted for publication in ApJL
摘要
高分辨率XRISM/Resolve视图首次展示赛格纳氏X-1中相对论效应下的Fe K线特征。数据显示,相对论宽带线能清晰地与底层连续谱以及铁带中的窄发射和吸收特征区分开来。铁K带的空前光谱分辨率清楚地表明,流量过剩可归因于单个宽带特征,而不是之前未解决的窄特征的叠加。这一宽带特征最好被解释为发射,其起源可能位于快速旋转黑洞周围最内稳定循环轨道附近。通过对宽带形状进行建模,我们发现黑洞的自旋参数约为a≈0.98,内吸积盘的倾角约为θ≈63°。自旋参数与先前的反射研究结果一致,再次证明了使用相对论反射方法测量自旋参数的稳健性。所测得的倾角为赛格纳氏X-1的盘轨道错位提供了进一步的证据。这些结果突出了XRISM在分离重叠光谱特征以及为X射线双星的吸积几何提供约束方面的独特能力。
要点
- 高分辨率XRISM/Resolve视图展示了Cygnus X-1中相对论效应下的Fe K线特征。
- 数据清晰区分了相对论宽带线与底层连续谱及铁带中的窄特征。
- 流量过剩归因于单个宽带特征,而非之前未解决的窄特征叠加。
- 宽带特征可能源于黑洞周围最内稳定循环轨道附近的发射。
- 通过建模,得出黑洞的自旋参数和內吸积盘的倾角。
- 自旋参数与先前研究一致,证明了测量自旋参数的稳健性。
- 观测结果支持Cygnus X-1的盘轨道错位假设,并突出了XRISM在分离光谱特征和约束吸积几何方面的能力。
点此查看论文截图
MuM: Multi-View Masked Image Modeling for 3D Vision
Authors:David Nordström, Johan Edstedt, Fredrik Kahl, Georg Bökman
Self-supervised learning on images seeks to extract meaningful visual representations from unlabeled data. When scaled to large datasets, this paradigm has achieved state-of-the-art performance and the resulting trained models such as DINOv3 have seen widespread adoption. However, most prior efforts are optimized for semantic understanding rather than geometric reasoning. One important exception is Cross-View Completion, CroCo, which is a form of masked autoencoding (MAE) tailored for 3D understanding. In this work, we continue on the path proposed by CroCo and focus on learning features tailored for 3D vision. In a nutshell, we extend MAE to arbitrarily many views of the same scene. By uniformly masking all views and employing a lightweight decoder with inter-frame attention, our approach is inherently simpler and more scalable than CroCo. We evaluate the resulting model, MuM, extensively on downstream tasks including feedforward reconstruction, dense image matching and relative pose estimation, finding that it outperforms the state-of-the-art visual encoders DINOv3 and CroCo v2.
自我监督学习在图像方面的目标是从未标记的数据中提取有意义的视觉表示。当扩展到大型数据集时,该范式已实现了最先进的性能,并且得到的训练模型(如DINOv3)已被广泛使用。然而,大多数先前的工作都是针对语义理解进行优化,而非几何推理。一个重要的例外是跨视图补全(Cross-View Completion),即CroCo,这是一种针对3D理解的掩码自编码(MAE)形式。在这项工作中,我们继续沿着CroCo提出的路径前进,专注于为3D视觉量身定制特征学习。简而言之,我们将MAE扩展到同一场景的任意多个视图。通过统一遮挡所有视图并使用具有帧间注意力的轻量级解码器,我们的方法比CroCo更简洁、更可扩展。我们在下游任务上对得到的模型MuM进行了广泛评估,包括前馈重建、密集图像匹配和相对姿态估计,发现它优于最先进的视觉编码器DINOv3和CroCo v2。
论文及项目相关链接
Summary
大型数据集上,自监督学习在图像领域提取有意义视觉表征,实现顶尖性能。DINOv3等模型广泛应用。但多数方法侧重于语义理解而非几何推理。本研究延续CroCo思路,专注于学习适用于3D视觉的特征。通过扩展MAE技术,实现同一场景的多视角学习。通过统一掩膜所有视角并使用轻量级解码器进行帧间注意力,方法更简单且更可扩展。在下游任务上评估MuM模型,包括前馈重建、密集图像匹配和相对姿态估计,超越现有视觉编码器DINOv3和CroCo v2。
Key Takeaways
- 自监督学习能从无标签图像数据中提取有意义视觉表征。
- 在大型数据集上,自监督学习达到顶尖性能,DINOv3等模型广泛应用。
- 多数自监督学习方法侧重于语义理解,但几何推理同样重要。
- CroCo是一种针对3D理解的MAE方法,本研究在此基础上进行扩展。
- 研究通过扩展MAE技术实现同一场景的多视角学习。
- 使用统一掩膜所有视角及轻量级解码器进行帧间注意力,使方法更简单且更具可扩展性。
点此查看论文截图
Scaling Self-Supervised and Cross-Modal Pretraining for Volumetric CT Transformers
Authors:Cris Claessens, Christiaan Viviers, Giacomo D’Amicantonio, Egor Bondarev, Fons van der Sommen
We introduce SPECTRE, a fully transformer-based foundation model for volumetric computed tomography (CT). Our Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction (SPECTRE) approach utilizes scalable 3D Vision Transformer architectures and modern self-supervised and vision-language pretraining strategies to learn general-purpose CT representations. Volumetric CT poses unique challenges, such as extreme token scaling, geometric anisotropy, and weak or noisy clinical supervision, that make standard transformer and contrastive learning recipes ineffective out of the box. The framework jointly optimizes a local transformer for high-resolution volumetric feature extraction and a global transformer for whole-scan context modeling, making large-scale 3D attention computationally tractable. Notably, SPECTRE is trained exclusively on openly available CT datasets, demonstrating that high-performing, generalizable representations can be achieved without relying on private data. Pretraining combines DINO-style self-distillation with SigLIP-based vision-language alignment using paired radiology reports, yielding features that are both geometrically consistent and clinically meaningful. Across multiple CT benchmarks, SPECTRE consistently outperforms prior CT foundation models in both zero-shot and fine-tuned settings, establishing SPECTRE as a scalable, open, and fully transformer-based foundation model for 3D medical imaging.
我们介绍了SPECTRE,这是一种基于全变压器的(CT)体积计算机断层扫描的基础模型。我们的CT表示提取的SPECTRE方法(Self-Supervised & Cross-Modal Pretraining for CT Representation Extraction)利用可扩展的3D视觉变压器架构和现代的自监督及视觉语言预训练策略来学习通用的CT表示。体积CT带来了独特的挑战,如极端令牌缩放、几何各向异性和微弱或嘈杂的临床监督,这使得标准变压器和对比学习配方无法直接使用。该框架联合优化用于高分辨率体积特征提取的局部变压器和用于整个扫描上下文建模的全局变压器,使得大规模三维注意力计算可行。值得注意的是,SPECTRE仅在公开可用的CT数据集上进行训练,表明在无需依赖私有数据的情况下也能实现高性能、可推广的表示。预训练结合了DINO风格的自我蒸馏和基于SigLIP的与放射学报告配对使用的视觉语言对齐技术,产生出既在几何上一致又符合临床实际特征的特征。在多个CT基准测试中,SPECTRE在零样本和微调设置中均表现出优于先前CT基础模型的表现,确立了SPECTRE作为一个可扩展、开放且基于全变压器的三维医学影像基础模型地位。
论文及项目相关链接
Summary
SPECTRE是一种基于Vision Transformer的计算机断层扫描(CT)体积图像的全模型预训练框架。它解决了传统Transformer模型在应对大规模CT数据时的难题,通过联合优化局部和全局Transformer模型,实现了高效的体积特征提取和整体扫描上下文建模。该模型仅使用公开可用的CT数据集进行训练,展现出良好的通用性能。预训练结合了DINO风格的自蒸馏技术和基于SigLIP的视觉语言对齐技术,产生几何一致且临床意义明确的特征。在多个CT基准测试中,SPECTRE在零样本和微调设置中均表现出超越先前CT基础模型的性能。
Key Takeaways
- SPECTRE是一个基于Vision Transformer的CT体积图像预训练框架。
- 该框架解决了大规模CT数据的处理难题,包括极端令牌缩放、几何各向异性和弱或嘈杂的临床监督等问题。
- SPECTRE联合优化了局部和全局Transformer模型,以实现高效的特征提取和上下文建模。
- 该模型仅使用公开数据集进行训练,展示了不使用私有数据也能实现高性能的通用表示。
- 预训练结合了自蒸馏技术和基于SigLIP的视觉语言对齐技术,生成了几何一致且临床意义明确的特征。
- 对比多个CT基准测试,SPECTRE表现出出色的性能,优于其他CT基础模型。
点此查看论文截图
Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning
Authors:Jiayi Wang, Wei Dai, Haoyu Wang, Sihan Yang, Haixia Bi, Jian Sun
In medical image segmentation, heterogeneous privacy policies across institutions often make joint training on pooled datasets infeasible, motivating continual image segmentation-learning from data streams without catastrophic forgetting. While the Segment Anything Model (SAM) offers strong zero-shot priors and has been widely fine-tuned across downstream tasks, its large parameter count and computational overhead challenge practical deployment. This paper demonstrates that the SAM paradigm is highly promising once its computational efficiency and performance can be balanced. To this end, we introduce the Alignment Layer, a lightweight, plug-and-play module which aligns encoder-decoder feature distributions to efficiently adapt SAM to specific medical images, improving accuracy while reducing computation. Building on SAM and the Alignment Layer, we then propose Continual Alignment for SAM (CA-SAM), a continual learning strategy that automatically adapts the appropriate Alignment Layer to mitigate catastrophic forgetting, while leveraging SAM’s zero-shot priors to preserve strong performance on unseen medical datasets. Experimented across nine medical segmentation datasets under continual-learning scenario, CA-SAM achieves state-of-the-art performance. Our code, models and datasets will be released on \mbox{https://github.com/azzzzyo/Continual-Alignment-for-SAM.}
在医学图像分割领域,机构间不同的隐私政策通常使得联合训练数据集的联合训练变得不可行,这促使人们从数据流中学习持续的图像分割,而不会发生灾难性遗忘。虽然Segment Anything Model(SAM)提供了强大的零先验知识,并已在下游任务中广泛微调,但其庞大的参数计数和计算开销对实际部署构成了挑战。本文展示了一旦平衡了SAM的计算效率和性能,SAM范式就非常具有前景。为此,我们引入了Alignment Layer,这是一个轻量级的即插即用模块,可以调整编码器-解码器的特征分布,以有效地将SAM适应特定的医学图像,提高精度同时减少计算。基于SAM和Alignment Layer,我们提出了面向SAM的持续对齐(CA-SAM)算法,这是一种持续学习策略,可以自动适应适当的Alignment Layer来缓解灾难性遗忘问题,同时利用SAM的零先验知识来保持未见医学数据集上的良好性能。在持续学习场景下跨九个医学分割数据集进行实验,CA-SAM达到了最先进的性能。我们的代码、模型和数据集将在https://github.com/azzzzyo/Continual-Alignment-for-SAM上发布。
论文及项目相关链接
Summary
该论文针对医学图像分割领域中的机构间隐私政策差异问题,提出了基于Segment Anything Model(SAM)的持续学习方案。通过引入Alignment Layer模块,提高了SAM对特定医学图像的适应性,并减少了计算量。在此基础上,论文提出了CA-SAM(Continual Alignment for SAM)的持续学习策略,能够自动调整Alignment Layer以缓解灾难性遗忘问题,同时利用SAM的零样本先验知识在新医学数据集上保持强劲性能。实验表明,CA-SAM在九个医学分割数据集上实现了最先进的性能。
Key Takeaways
- 医学图像分割中,机构间异质隐私政策导致联合训练难以实现。
- SAM模型虽然具有强大的零样本先验知识,但其计算效率和部署实践面临挑战。
- 引入Alignment Layer模块,提高SAM对特定医学图像的适应性并减少计算量。
- 提出CA-SAM作为持续学习策略,结合SAM和Alignment Layer,自动调整以适应灾难性遗忘问题。
- CA-SAM在九个医学分割数据集上实现了最先进的性能。
- 代码、模型和数据集将在相关GitHub仓库中公开。
- 该研究为医学图像分割领域提供了一种有效的、适应不同隐私政策的持续学习解决方案。
点此查看论文截图
From Cantilevers to Membranes: Advanced Scanning Protocols for Magnetic Resonance Force Microscopy
Authors:Nils Prumbaum, Christian L. Degen, Alexander Eichler
Magnetic Resonance Force Microscopy (MRFM) enables three-dimensional imaging of nuclear spin densities in nanoscale objects. Based on numerical simulations, we evaluate the performance of strained SiN resonators as force sensors and show that their out-of-plane oscillation direction improves the quality of the reconstructed sample. We further introduce a multislice, compressed-sensing scan protocol that maximizes the information obtained for a given measurement time. Our simulations predict that these new scanning protocols and optimized algorithms can shorten the total acquisition time by up to two orders of magnitude while maintaining the reconstruction fidelity. Our results demonstrate that combining advanced scanning protocols with state-of-the-art resonators is a promising path toward high-resolution MRFM for volumetric imaging of biological nanostructures.
磁共振力显微镜(MRFM)能够实现纳米级物体核自旋密度的三维成像。基于数值模拟,我们对受压氮化硅谐振器作为力传感器的性能进行了评估,并表明其面外振荡方向提高了重建样品的质量。我们还引入了一种多层压缩传感扫描协议,以在给定测量时间内最大化所获得的信息。我们的模拟预测,这些新的扫描协议和优化算法可以将总采集时间缩短多达两个数量级,同时保持重建的保真度。我们的结果表明,将先进的扫描协议与最新谐振器相结合,是实现生物纳米结构体积高分辨率磁共振力显微镜成像的有前途的途径。
论文及项目相关链接
Summary
基于数值模拟,评估了应变氮化硅谐振器作为力传感器的性能,并展示了其面外振荡方向在重建样本中的改进效果。引入了一种新的多层压缩感知扫描协议,该协议可在给定测量时间内最大化获得的信息量。预测显示,新的扫描协议和优化算法的结合可以缩短总采集时间,同时保持重建的保真度。这些结果证明了结合先进的扫描协议和最新的谐振器是实现高分辨磁共振力显微镜体积生物纳米结构成像的有前途的途径。
Key Takeaways
- MRFM能够实现纳米级物体的核自旋密度的三维成像。
- 数值模拟用于评估应变氮化硅谐振器的性能表现。
- 面外振荡方向的改进有助于提升样本重建质量。
- 提出了一种新的多层压缩感知扫描协议,以最大化信息获取并缩短测量时间。
- 新扫描协议和优化算法的结合能显著缩短总采集时间。
- 这种结合有望提高磁共振力显微镜的分辨率,实现生物纳米结构的体积成像。
点此查看论文截图
UI-Styler: Ultrasound Image Style Transfer with Class-Aware Prompts for Cross-Device Diagnosis Using a Frozen Black-Box Inference Network
Authors:Nhat-Tuong Do-Tran, Ngoc-Hoang-Lam Le, Ching-Chun Huang
The appearance of ultrasound images varies across acquisition devices, causing domain shifts that degrade the performance of fixed black-box downstream inference models when reused. To mitigate this issue, it is practical to develop unpaired image translation (UIT) methods that effectively align the statistical distributions between source and target domains, particularly under the constraint of a reused inference-blackbox setting. However, existing UIT approaches often overlook class-specific semantic alignment during domain adaptation, resulting in misaligned content-class mappings that can impair diagnostic accuracy. To address this limitation, we propose UI-Styler, a novel ultrasound-specific, class-aware image style transfer framework. UI-Styler leverages a pattern-matching mechanism to transfer texture patterns embedded in the target images onto source images while preserving the source structural content. In addition, we introduce a class-aware prompting strategy guided by pseudo labels of the target domain, which enforces accurate semantic alignment with diagnostic categories. Extensive experiments on ultrasound cross-device tasks demonstrate that UI-Styler consistently outperforms existing UIT methods, achieving state-of-the-art performance in distribution distance and downstream tasks, such as classification and segmentation.
超声图像的外观因采集设备而异,导致领域偏移,当重复使用固定的黑箱下游推理模型时,会降模型性能。为了缓解这一问题,开发无需配对图像翻译(UIT)方法是很实用的,这些方法可以有效地对齐源域和目标域之间的统计分布,特别是在重复使用的推理黑箱设置的约束下。然而,现有的UIT方法在进行领域适应时常忽略类特定的语义对齐,导致内容类别映射错位,可能损害诊断准确性。为了解决这个问题,我们提出了UI-Styler,一个新型的超声特定、类感知的图像风格转换框架。UI-Styler利用模式匹配机制,将目标图像中的纹理模式转移到源图像上,同时保留源图像的结构内容。此外,我们引入了由目标域的伪标签引导的类感知提示策略,这强制与诊断类别进行准确语义对齐。在超声跨设备任务上的大量实验表明,UI-Styler持续优于现有的UIT方法,在分布距离和下游任务(如分类和分割)上均达到最新性能水平。
论文及项目相关链接
PDF Project page: https://dotrannhattuong.github.io/UIStyler, Accepted to WACV 2026
Summary
该文本介绍了超声图像在不同采集设备间呈现的差异,导致域偏移现象,影响了固定黑箱下游推理模型的性能。为解决这一问题,提出开发无配对图像翻译(UIT)方法,以在重用推理黑箱设置下有效对齐源域和目标域的统计分布。然而,现有的UIT方法常常忽视类特定的语义对齐,导致内容类映射错位,可能影响诊断准确性。为解决此问题,本文提出了一个超声专用的类感知图像风格转换框架UI-Styler。UI-Styler利用模式匹配机制将目标图像的纹理模式转移到源图像上,同时保留源的结构内容。此外,本文还引入了一种由目标域的伪标签引导的类感知提示策略,以实现与诊断类别的精确语义对齐。在超声跨设备任务上的广泛实验表明,UI-Styler持续优于现有的UIT方法,在分布距离和下游任务(如分类和分割)上达到最新性能水平。
Key Takeaways
- 超声图像在不同采集设备间存在域偏移现象,影响固定黑箱下游模型的性能。
- 无配对图像翻译(UIT)方法可有效解决域偏移问题,通过对齐源域和目标域的统计分布来改善模型性能。
- 现有UIT方法忽视类特定的语义对齐,可能导致诊断准确性下降。
- 提出的UI-Styler框架结合模式匹配和类感知提示策略,实现超声图像的精准风格转换和语义对齐。
- UI-Styler在超声跨设备任务上表现优异,优于现有UIT方法。
- UI-Styler在分布距离、分类和分割等下游任务上达到最新性能水平。
- UI-Styler框架具有潜力在医学图像分析和诊断中广泛应用。
点此查看论文截图
Learning to Look Closer: A New Instance-Wise Loss for Small Cerebral Lesion Segmentation
Authors:Luc Bouteille, Alexander Jaus, Jens Kleesiek, Rainer Stiefelhagen, Lukas Heine
Traditional loss functions in medical image segmentation, such as Dice, often under-segment small lesions because their small relative volume contributes negligibly to the overall loss. To address this, instance-wise loss functions and metrics have been proposed to evaluate segmentation quality on a per-lesion basis. We introduce CC-DiceCE, a loss function based on the CC-Metrics framework, and compare it with the existing blob loss. Both are benchmarked against a DiceCE baseline within the nnU-Net framework, which provides a robust and standardized setup. We find that CC-DiceCE loss increases detection (recall) with minimal to no degradation in segmentation performance, albeit at the cost of slightly more false positives. Furthermore, our multi-dataset study shows that CC-DiceCE generally outperforms blob loss.
在医学图像分割中,传统的损失函数(如Dice系数)通常会对小病灶进行欠分割,因为小病灶相对于总体积的损失贡献微乎其微。为了解决这一问题,已经提出了实例级的损失函数和指标,以基于每个病灶来评估分割质量。我们介绍了基于CC-Metrics框架的CC-DiceCE损失函数,并将其与现有的blob损失进行了比较。两者都在nnU-Net框架内以DiceCE为基准进行了评估,该框架提供了稳健且标准化的设置。我们发现,CC-DiceCE损失在几乎不降低分割性能的情况下提高了检测率(召回率),尽管会略微增加误报。此外,我们的多数据集研究表明,CC-DiceCE通常优于blob损失。
论文及项目相关链接
PDF 5 pages, 2 figures, 2 tables
Summary
这篇文本主要介绍了在医学图像分割中,针对传统损失函数(如Dice)在分割小病灶时存在的问题(如忽略小病灶在总体损失中的贡献),提出了基于CC-Metrics框架的CC-DiceCE损失函数。实验表明,CC-DiceCE损失函数能提高检测召回率,且对分割性能影响较小,虽然会增加一定的误报率。此外,多数据集研究结果显示,CC-DiceCE通常优于blob损失函数。
Key Takeaways
- 传统医学图像分割损失函数(如Dice)在小病灶分割上存在问题,因为小病灶的相对体积对总体损失贡献较小。
- 提出了基于CC-Metrics框架的CC-DiceCE损失函数,以解决这一问题。
- CC-DiceCE损失函数能提高检测召回率,并且可以在保证较小误报率的情况下保持较好的分割性能。
- 与现有的blob损失函数相比,CC-DiceCE在多数据集研究中有更好的表现。
- nnU-Net框架为实验提供了稳健和标准化的设置。
- CC-DiceCE损失函数是针对每个病灶实例进行评估的,这在处理医学图像分割任务时可能更具实际意义。
点此查看论文截图
ReBrain: Brain MRI Reconstruction from Sparse CT Slice via Retrieval-Augmented Diffusion
Authors:Junming Liu, Yifei Sun, Weihua Cheng, Yujin Kang, Yirong Chen, Ding Wang, Guosun Zeng
Magnetic Resonance Imaging (MRI) plays a crucial role in brain disease diagnosis, but it is not always feasible for certain patients due to physical or clinical constraints. Recent studies attempt to synthesize MRI from Computed Tomography (CT) scans; however, low-dose protocols often result in highly sparse CT volumes with poor through-plane resolution, making accurate reconstruction of the full brain MRI volume particularly challenging. To address this, we propose ReBrain, a retrieval-augmented diffusion framework for brain MRI reconstruction. Given any 3D CT scan with limited slices, we first employ a Brownian Bridge Diffusion Model (BBDM) to synthesize MRI slices along the 2D dimension. Simultaneously, we retrieve structurally and pathologically similar CT slices from a comprehensive prior database via a fine-tuned retrieval model. These retrieved slices are used as references, incorporated through a ControlNet branch to guide the generation of intermediate MRI slices and ensure structural continuity. We further account for rare retrieval failures when the database lacks suitable references and apply spherical linear interpolation to provide supplementary guidance. Extensive experiments on SynthRAD2023 and BraTS demonstrate that ReBrain achieves state-of-the-art performance in cross-modal reconstruction under sparse conditions.
磁共振成像(MRI)在脑疾病诊断中起着至关重要的作用,但由于物理或临床限制,并非所有患者都适用。最近的研究尝试从计算机断层扫描(CT)中合成MRI,但低剂量协议通常会导致CT体积高度稀疏,平面外分辨率较差,使得准确重建完整的脑MRI体积尤为困难。针对这一问题,我们提出了ReBrain,这是一种增强检索的扩散框架,用于脑MRI重建。对于任何具有有限切片的3D CT扫描,我们首先采用布朗桥扩散模型(BBDM)沿2D维度合成MRI切片。同时,我们通过精细调整的检索模型从综合先验数据库中检索结构和病理相似的CT切片。这些检索到的切片用作参考,通过ControlNet分支融入,以指导中间MRI切片的生成并确保结构连续性。当数据库缺少合适的参考时,我们进一步考虑罕见的检索失败情况,并应用球面线性插值以提供补充指导。在SynthRAD2023和BraTS上的广泛实验表明,ReBrain在稀疏条件下的跨模态重建中达到了最新技术水平。
论文及项目相关链接
PDF 16 pages, 12 figures, 7 tables; Accepted by WACV 2026
Summary
本文提出一种名为ReBrain的检索增强扩散框架,用于从有限的CT扫描切片中重建MRI。它采用布朗桥扩散模型(BBDM)在2D维度上合成MRI切片,同时从综合先验数据库中检索结构和病理相似的CT切片作为参考,通过ControlNet分支引导中间MRI切片的生成,确保结构连续性。ReBrain在SynthRAD2023和BraTS数据集上的实验表明,它在稀疏条件下的跨模态重建达到最佳性能。
Key Takeaways
- ReBrain是一种新型的检索增强扩散框架,旨在解决从稀疏CT扫描中重建MRI的挑战。
- 它结合了布朗桥扩散模型(BBDM)和检索模型,以合成高质量的MRI切片。
- ReBrain能够从综合先验数据库中检索与给定CT切片结构和病理相似的切片作为参考。
- 通过ControlNet分支,这些检索到的切片被用来引导中间MRI切片的生成,确保结构的连续性。
- ReBrain处理了可能的检索失败情况,当数据库中没有合适的参考切片时,使用球形线性插值提供补充指导。
- 在SynthRAD2023和BraTS数据集上的实验表明,ReBrain在跨模态重建方面达到了最佳性能。
点此查看论文截图
REArtGS++: Generalizable Articulation Reconstruction with Temporal Geometry Constraint via Planar Gaussian Splatting
Authors:Di Wu, Liu Liu, Anran Huang, Yuyan Liu, Qiaoyu Jun, Shaofan Liu, Liangtu Song, Cewu Lu
Articulated objects are pervasive in daily environments, such as drawers and refrigerators. Towards their part-level surface reconstruction and joint parameter estimation, REArtGS~\cite{wu2025reartgs} introduces a category-agnostic approach using multi-view RGB images at two different states. However, we observe that REArtGS still struggles with screw-joint or multi-part objects and lacks geometric constraints for unseen states. In this paper, we propose REArtGS++, a novel method towards generalizable articulated object reconstruction with temporal geometry constraint and planar Gaussian splatting. We first model a decoupled screw motion for each joint without type prior, and jointly optimize part-aware Gaussians with joint parameters through part motion blending. To introduce time-continuous geometric constraint for articulated modeling, we encourage Gaussians to be planar and propose a temporally consistent regularization between planar normal and depth through Taylor first-order expansion. Extensive experiments on both synthetic and real-world articulated objects demonstrate our superiority in generalizable part-level surface reconstruction and joint parameter estimation, compared to existing approaches. Project Site: https://sites.google.com/view/reartgs2/home.
文章中提到,日常生活中的环境中普遍存在可关节折展的物体,例如抽屉和冰箱。针对其部分级别的表面重建和关节参数估计,REArtGS(引用文献[wu2025reartgs])采用一种通用的类别无关的RGB图像进行双重不同状态的采集进行方法探索。然而,我们发现REArtGS在螺丝关节或多部分物体方面仍存在挑战,并且缺乏对未见状态的几何约束。在本文中,我们提出了REArtGS++,这是一种具有时间几何约束和平面高斯贴图的可关节折展物体重建的新方法。我们首先为每个关节建立无类型先验的解耦螺丝运动模型,并通过部分运动混合技术联合优化具有关节参数的感知高斯分布。为了引入可关节建模的时间连续几何约束,我们鼓励高斯分布保持平面状态,并通过泰勒一阶展开提出平面法线和深度之间的时间一致性正则化。在合成和真实世界的可关节折展物体上的大量实验证明,与现有方法相比,我们在通用部分级别的表面重建和关节参数估计方面具有优势。项目网站地址为:https://sites.google.com/view/reartgs2/home。
论文及项目相关链接
PDF 10 pages, 7 figures
Summary
基于REArtGS方法,针对日常环境中的可旋转物体进行部分级别的表面重建和关节参数估计。然而,REArtGS在处理螺丝关节或多部件物体时仍有困难,并且缺乏对未见状态的几何约束。本研究提出REArtGS++方法,通过时间连续的几何约束和平面高斯拼贴技术,实现更通用的可旋转物体重建。对关节进行解耦建模,无需类型先验,并通过部分运动融合优化部分感知的高斯和关节参数。此外,引入时间连续的几何约束用于可旋转建模,鼓励高斯平面化,并通过泰勒一阶展开提出平面法线和深度之间的时间一致性正则化。在合成和真实世界的可旋转物体上的广泛实验证明,与现有方法相比,REArtGS++在可泛化的部分级别表面重建和关节参数估计方面表现优越。
Key Takeaways
- REArtGS++是一种针对可旋转物体的新型重建方法,用于部分级别的表面重建和关节参数估计。
- 该方法通过时间连续的几何约束和平面高斯拼贴技术实现更泛化的可旋转物体重建。
- 解耦建模关节无需类型先验,通过部分运动融合优化部分感知的高斯和关节参数。
- 引入时间连续的几何约束用于可旋转建模,鼓励高斯平面化。
- 通过泰勒一阶展开提出平面法线和深度之间的时间一致性正则化。
- 在合成和真实世界的可旋转物体上的实验证明REArtGS++的优越性。
点此查看论文截图
MedImageInsight for Thoracic Cavity Health Classification from Chest X-rays
Authors:Rama Krishna Boya, Mohan Kireeti Magalanadu, Azaruddin Palavalli, Rupa Ganesh Tekuri, Amrit Pattanayak, Prasanthi Enuga, Vignesh Esakki Muthu, Vivek Aditya Boya
Chest radiography remains one of the most widely used imaging modalities for thoracic diagnosis, yet increasing imaging volumes and radiologist workload continue to challenge timely interpretation. In this work, we investigate the use of MedImageInsight, a medical imaging foundational model, for automated binary classification of chest X-rays into Normal and Abnormal categories. Two approaches were evaluated: (1) fine-tuning MedImageInsight for end-to-end classification, and (2) employing the model as a feature extractor for a transfer learning pipeline using traditional machine learning classifiers. Experiments were conducted using a combination of the ChestX-ray14 dataset and real-world clinical data sourced from partner hospitals. The fine-tuned classifier achieved the highest performance, with an ROC-AUC of 0.888 and superior calibration compared to the transfer learning models, demonstrating performance comparable to established architectures such as CheXNet. These results highlight the effectiveness of foundational medical imaging models in reducing task-specific training requirements while maintaining diagnostic reliability. The system is designed for integration into web-based and hospital PACS workflows to support triage and reduce radiologist burden. Future work will extend the model to multi-label pathology classification to provide preliminary diagnostic interpretation in clinical environments.
胸部放射学检查仍然是胸部诊断中最常用的成像方式之一,然而,日益增长的成像量和放射科医生的工作量继续对及时解读构成挑战。在这项工作中,我们研究了MedImageInsight这一医学成像基础模型在将胸部X射线自动二分类为正常和异常类别中的应用。我们评估了两种方案:(1)对MedImageInsight进行微调,以实现端到端的分类;(2)将该模型用作特征提取器,利用传统机器学习分类器构建迁移学习管道。实验结合了ChestX-ray14数据集和来自合作医院的真实世界临床数据。经过微调后的分类器性能最高,ROC-AUC达到0.888,并且与迁移学习模型相比具有更好的校准效果,显示出与CheXNet等已建立架构相当的性能。这些结果突出了基础医学成像模型在减少特定任务培训要求的同时保持诊断可靠性的有效性。该系统旨在集成到基于Web和医院PACS工作流程中,以支持初步筛选并减轻放射科医生的工作负担。未来的工作将扩展模型进行多标签病理分类,以在临床环境中提供初步诊断解读。
论文及项目相关链接
PDF 9 pages, 5 figures and 3 tables
Summary
本文研究了MedImageInsight医学影像基础模型在胸部X射线图像自动二分类(正常与异常)中的应用。实验采用ChestX-ray14数据集和来自合作医院的真实临床数据,通过微调MedImageInsight模型进行端到端的分类,并评估了其性能。结果显示,微调分类器的性能最高,ROC-AUC达到0.888,与CheXNet等已建立的架构相比具有相当的校准性能。该系统的目的是集成到基于Web的医院PACS工作流程中,以支持筛选工作并减轻放射科医师的负担。未来工作将扩展模型以进行多标签病理分类,为临床环境中提供初步诊断解释。
Key Takeaways
- MedImageInsight模型用于胸部X射线图像的二分类(正常与异常)。
- 通过微调MedImageInsight模型进行端到端的分类实验,并使用ChestX-ray14数据集和真实临床数据评估性能。
- 微调分类器性能最高,ROC-AUC达到0.888,具有良好的校准性能。
- 该系统可集成到Web和医院PACS工作流程中,以支持筛选工作并减轻放射科医师的负担。
- 与已建立的架构(如CheXNet)相比,该模型表现出相当的性能。
- 未来工作将扩展模型以进行多标签病理分类。
点此查看论文截图
Flow-Guided Implicit Neural Representation for Motion-Aware Dynamic MRI Reconstruction
Authors:Baoqing Li, Yuanyuan Liu, Congcong Liu, Qingyong Zhu, Jing Cheng, Yihang Zhou, Hao Chen, Zhuo-Xu Cui, Dong Liang
Dynamic magnetic resonance imaging (dMRI) captures temporally-resolved anatomy but is often challenged by limited sampling and motion-induced artifacts. Conventional motion-compensated reconstructions typically rely on pre-estimated optical flow, which is inaccurate under undersampling and degrades reconstruction quality. In this work, we propose a novel implicit neural representation (INR) framework that jointly models both the dynamic image sequence and its underlying motion field. Specifically, one INR is employed to parameterize the spatiotemporal image content, while another INR represents the optical flow. The two are coupled via the optical flow equation, which serves as a physics-inspired regularization, in addition to a data consistency loss that enforces agreement with k-space measurements. This joint optimization enables simultaneous recovery of temporally coherent images and motion fields without requiring prior flow estimation. Experiments on dynamic cardiac MRI datasets demonstrate that the proposed method outperforms state-of-the-art motion-compensated and deep learning approaches, achieving superior reconstruction quality, accurate motion estimation, and improved temporal fidelity. These results highlight the potential of implicit joint modeling with flow-regularized constraints for advancing dMRI reconstruction.
动态磁共振成像(dMRI)能够捕捉时间分辨的解剖结构,但常常受到采样限制和运动引起的伪影的挑战。传统的运动补偿重建通常依赖于预先估计的光流,在欠采样情况下光流估计不准确,会降低重建质量。在这项工作中,我们提出了一种新型隐式神经表示(INR)框架,该框架同时对动态图像序列和其底层运动场进行建模。具体来说,一个INR被用来参数化时空图像内容,另一个INR则表示光流。两者通过光流方程耦合,光流方程作为一种物理启发式的正则化方法,再加上数据一致性损失,强制与k空间测量值保持一致。这种联合优化能够同时恢复时间连贯的图像和运动场,而无需事先进行光流估计。在动态心脏MRI数据集上的实验表明,所提出的方法优于最先进的运动补偿和深度学习方法,实现了优质的重建、精确的运动估计和时间保真度的提高。这些结果突出了隐式联合建模与流正则化约束的潜力,有助于推动dMRI重建的发展。
论文及项目相关链接
PDF 10 pages, 7 figures
Summary
本文提出了一种基于隐神经表示(INR)框架的动态磁共振成像(dMRI)重建方法,该方法联合建模动态图像序列及其底层运动场。通过采用两个INR分别参数化时空图像内容和光流,并通过光流方程进行耦合,实现了无需预先估计光流的运动补偿重建。实验结果表明,该方法在动态心脏MRI数据集上的表现优于现有的运动补偿和深度学习方法,具有更高的重建质量、准确的运动估计和更好的时间保真度。
Key Takeaways
- 隐神经表示(INR)框架被用于动态磁共振成像(dMRI)的重建。
- 该方法联合建模动态图像序列及其底层运动场。
- 通过光流方程耦合两个INR,实现了运动补偿重建,无需预先估计光流。
- 实验结果表明,该方法在重建质量、运动估计和时间保真度方面优于现有方法。
- 该方法能够捕捉时序解析的解剖学信息,并解决了有限的采样和运动引起的伪影问题。
- 提出的框架具有潜在的应用价值,可推动dMRI重建技术的发展。
点此查看论文截图
Automated Interpretable 2D Video Extraction from 3D Echocardiography
Authors:Milos Vukadinovic, Hirotaka Ieki, Yuki Sahashi, David Ouyang, Bryan He
Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos https://github.com/echonet/3d-echo .
尽管心脏具有复杂的三维(3D)解剖结构,但传统的医学成像方法(如心脏超声)依赖于一系列二维视频来显示单个心脏结构。三维超声心动图是近年来发展起来的一种成像方式,目前已能提供足够的图像质量供临床使用,并具有简化采集和改进离轴特征的评估的潜力。我们提出了一种从三维心脏超声体积中自动选择标准二维视图的方法,使医生能够以他们通常的格式解释数据,同时受益于三维扫描的速度和实用性。通过应用深度学习视图分类器和基于解剖标志的下游启发式方法以及心脏病专家提供的启发式方法,我们重建了标准超声心动图视图。这种方法经过三位心脏病专家的盲评验证(在来自两家医院的1600个视频中准确率为96%)。下游二维视频在检测心脏异常方面的能力也得到了验证,使用了人工智能超声心动图模型(EchoPrime和PanEcho),并展示了生成临床级心脏解剖测量的能力(EchoNet-Measurement)。我们证明,提取的二维视频保留了空间校准和诊断特征,使临床医生能够从三维体积中获得准确的实际世界解读。我们公开了代码和包含29个三维超声心动图视频的公开数据集:https://github.com/echonet/3d-echo。
论文及项目相关链接
PDF 12 pages, 5 figures
Summary
本文介绍了一种从3D心脏超声体积中自动选择标准2D视图的方法,使医生能够在保持其习惯的解读格式的同时,受益于3D扫描的速度和实用性。该方法结合了深度学习视图分类器、基于解剖标志的下游启发式算法以及心脏病专家的意见,重建了标准的心电图视图。经过三位心脏病专家的盲评验证,该方法的准确率达到了96%。此外,该方法还能有效检测心脏异常并生成临床级心脏解剖测量数据。
Key Takeaways
- 3D超声成像在医学领域正在逐渐普及,提供了更高质量的图像,有望简化获取过程并改善非轴向特征的评估。
- 一种新方法可以从3D心脏超声体积中自动选择标准2D视图,结合了深度学习和基于解剖标志的启发式算法。
- 该方法允许医生以习惯的方式解读数据,同时享受3D扫描带来的速度和便利性。
- 该方法经过大量实验验证,准确率为96%,并且在检测心脏异常和生成心脏解剖测量数据方面表现出良好的性能。
- 释放的代码和数据集将有助于其他研究者进一步开发和改进该方法。
- 提取的2D视频保留了空间校准和诊断特征,使临床医生能够从3D体积中获得准确的实际世界解读。
点此查看论文截图
SF-Recon: Simplification-Free Lightweight Building Reconstruction via 3D Gaussian Splatting
Authors:Zihan Li, Tengfei Wang, Wentian Gan, Hao Zhan, Xin Wang, Zongqian Zhan
Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:https://lzh282140127-cell.github.io/SF-Recon-project/
轻量级建筑表面模型对于数字城市、导航和快速地理空间分析至关重要。然而,传统的多视角几何流水线仍然很笨拙且对质量敏感,因为它们依赖于密集重建、网格化和随后的简化。本研究提出了SF-Recon方法,该方法能够从多视角图像直接重建轻量级建筑表面,无需后续的网格简化。我们首先训练一个初始的3D高斯喷溅(3DGS)场,以获得一致的视图表示。然后,通过法线梯度引导的高斯优化提炼建筑结构,选择符合屋顶和墙壁边界的基本形状,随后通过多视角边缘一致性修剪增强结构锐度并抑制非结构伪影,无需外部监督。最后,基于多视角深度约束的Delaunay三角剖分将结构化高斯场转换为轻量级、结构真实的建筑网格。基于提出的SF数据集,实验结果表明,我们的SF-Recon方法能够从多视角图像直接重建轻量级建筑模型,在保持计算效率的同时,实现更少的面和顶点。网站地址:https://lzh282140127-cell.github.io/SF-Recon-project/
论文及项目相关链接
PDF This paper has been submitted to the 2026 ISPRS Congress
Summary
本文介绍了一种名为SF-Recon的方法,能够从多视角图像直接重建轻量级建筑表面模型,无需后续网格简化。该方法通过训练初始3D高斯喷射场获得视角一致表示,通过正常梯度引导的高斯优化和基于多视角边缘一致性的修剪技术来提炼建筑结构,最终将结构化高斯场转化为轻量级、结构真实的建筑网格。实验结果表明,SF-Recon方法能够实现从多视角图像直接重建轻量级建筑模型,在保持计算效率的同时大大减少面的数量和顶点的数量。
Key Takeaways
- SF-Recon方法直接从多视角图像重建轻量级建筑表面模型,无需后续网格简化。
- 通过训练初始3D高斯喷射场获得视角一致表示。
- 采用正常梯度引导的高斯优化和基于多视角边缘一致性的修剪技术来提炼建筑结构。
- 多视角深度约束Delaunay三角剖分将结构化高斯场转化为轻量级、结构真实的建筑网格。
- 方法能够实现计算效率高,面的数量和顶点的数量大大减少。
- 提出的SF数据集用于实验验证,证明SF-Recon方法的有效性。
点此查看论文截图
Model Inversion Attack Against Deep Hashing
Authors:Dongdong Zhao, Qiben Xu, Ranxin Fang, Baogang Song
Deep hashing improves retrieval efficiency through compact binary codes, yet it introduces severe and often overlooked privacy risks. The ability to reconstruct original training data from hash codes could lead to serious threats such as biometric forgery and privacy breaches. However, model inversion attacks specifically targeting deep hashing models remain unexplored, leaving their security implications unexamined. This research gap stems from the inaccessibility of genuine training hash codes and the highly discrete Hamming space, which prevents existing methods from adapting to deep hashing. To address these challenges, we propose DHMI, the first diffusion-based model inversion framework designed for deep hashing. DHMI first clusters an auxiliary dataset to derive semantic hash centers as surrogate anchors. It then introduces a surrogate-guided denoising optimization method that leverages a novel attack metric (fusing classification consistency and hash proximity) to dynamically select candidate samples. A cluster of surrogate models guides the refinement of these candidates, ensuring the generation of high-fidelity and semantically consistent images. Experiments on multiple datasets demonstrate that DHMI successfully reconstructs high-resolution, high-quality images even under the most challenging black-box setting, where no training hash codes are available. Our method outperforms the existing state-of-the-art model inversion attacks in black-box scenarios, confirming both its practical efficacy and the critical privacy risks inherent in deep hashing systems.
深度哈希通过紧凑的二进制码提高了检索效率,但同时也引入了严重且常被忽视的隐私风险。从哈希码重建原始训练数据的能力可能导致生物特征伪造和隐私泄露等严重威胁。然而,专门针对深度哈希模型的模型逆转攻击仍未被探索,其安全影响也未被检查。这一研究空白源于真实训练哈希码的不可及性和高度离散的汉明空间,这使得现有方法难以适应深度哈希。为了解决这些挑战,我们提出了DHMI,这是第一个针对深度哈希设计的基于扩散的模型逆框架。DHMI首先对一个辅助数据集进行聚类,以导出语义哈希中心作为替代锚点。然后,它引入了一种替代引导去噪优化方法,该方法利用一种新型攻击指标(融合分类一致性和哈希接近度)来动态选择候选样本。一系列替代模型引导这些候选者的优化,确保生成高保真度和语义一致性的图像。在多个数据集上的实验表明,即使在无法获得训练哈希码的最具挑战性的黑箱设置下,DHMI也能成功重建高分辨率、高质量的图像。我们的方法在黑箱场景中的模型逆转攻击方面超越了现有最先进的水平,证实了其实际效果和深度哈希系统固有的关键隐私风险。
论文及项目相关链接
Summary
深度哈希技术通过紧凑的二进制编码提高了检索效率,但同时也带来了严重且常被忽视的隐私风险。本文从哈希码重建原始训练数据的能力可能导致生物特征伪造和隐私泄露等严重威胁。研究针对深度哈希模型的逆向攻击方法,发现现有方法难以适应深度哈希的安全挑战,提出DHMI,首个针对深度哈希的扩散式模型逆向框架。DHMI通过聚类辅助数据集得到语义哈希中心作为替代锚点,引入替代引导去噪优化方法,利用新的攻击度量标准(融合分类一致性和哈希邻近度)动态选择候选样本。实验证明,DHMI在无需训练哈希码的黑盒设置下,也能成功重建高分辨率、高质量的图像,优于现有最先进的模型逆向攻击方法。
Key Takeaways
- 深度哈希技术虽提高检索效率,但存在严重的隐私风险,能从哈希码重建原始训练数据。
- 生物特征伪造和隐私泄露是深度哈希技术的潜在威胁。
- 针对深度哈希模型的模型逆向攻击尚未得到充分研究。
- 现有方法难以适应深度哈希的安全挑战。
- DHMI是首个针对深度哈希的扩散式模型逆向框架。
- DHMI通过聚类辅助数据集和引入替代引导去噪优化方法来提高攻击效果。
点此查看论文截图
Improving the Performance of Radiology Report De-identification with Large-Scale Training and Benchmarking Against Cloud Vendor Methods
Authors:Eva Prakash, Maayane Attias, Pierre Chambon, Justin Xu, Steven Truong, Jean-Benoit Delbrouck, Tessa Cook, Curtis Langlotz
Objective: To enhance automated de-identification of radiology reports by scaling transformer-based models through extensive training datasets and benchmarking performance against commercial cloud vendor systems for protected health information (PHI) detection. Materials and Methods: In this retrospective study, we built upon a state-of-the-art, transformer-based, PHI de-identification pipeline by fine-tuning on two large annotated radiology corpora from Stanford University, encompassing chest X-ray, chest CT, abdomen/pelvis CT, and brain MR reports and introducing an additional PHI category (AGE) into the architecture. Model performance was evaluated on test sets from Stanford and the University of Pennsylvania (Penn) for token-level PHI detection. We further assessed (1) the stability of synthetic PHI generation using a “hide-in-plain-sight” method and (2) performance against commercial systems. Precision, recall, and F1 scores were computed across all PHI categories. Results: Our model achieved overall F1 scores of 0.973 on the Penn dataset and 0.996 on the Stanford dataset, outperforming or maintaining the previous state-of-the-art model performance. Synthetic PHI evaluation showed consistent detectability (overall F1: 0.959 [0.958-0.960]) across 50 independently de-identified Penn datasets. Our model outperformed all vendor systems on synthetic Penn reports (overall F1: 0.960 vs. 0.632-0.754). Discussion: Large-scale, multimodal training improved cross-institutional generalization and robustness. Synthetic PHI generation preserved data utility while ensuring privacy. Conclusion: A transformer-based de-identification model trained on diverse radiology datasets outperforms prior academic and commercial systems in PHI detection and establishes a new benchmark for secure clinical text processing.
目标:通过大规模训练数据集扩展基于转换模型的自动化去标识化,并与商业云供应商系统进行性能基准测试,以实现受保护健康信息(PHI)检测的自动化。材料与方法:在本回顾性研究中,我们在斯坦福大学提供的包含胸部X射线、胸部CT、腹部/骨盆CT和大脑MR报告的两个大型注释放射学语料库上微调了基于最新转换技术的PHI去标识化管道,并在架构中引入了额外的PHI类别(年龄)。在斯坦福大学和宾夕法尼亚大学(宾大)的测试集上评估模型在令牌级PHI检测方面的性能。我们还进一步评估了(1)使用“隐藏在日常可见之中”方法的合成PHI生成的稳定性,以及(2)与商业系统的性能。计算了所有PHI类别的精确度、召回率和F1分数。结果:我们的模型在宾大数据集上的总体F1分数为0.973,在斯坦福数据集上为0.996,优于或保持了先前最先进的模型性能。合成PHI评估显示,在50个独立去标识化的宾大数据集上,检测能力保持一致(总体F1:0.959 [0.958-0.960])。我们的模型在合成宾大报告上优于所有供应商系统(总体F1:0.960与商业系统的整体性能相比介于为达到到且远远超过及通过覆盖因子表明最终差异是显著的比值结果差异在比值上的对比上)。讨论:大规模多模式训练提高了跨机构推广和稳健性。合成PHI生成保留了数据效用并确保隐私。结论:基于转换模型的去标识化模型在经过多种放射学数据集训练后,在PHI检测方面优于先前学术界和业界的系统,为安全临床文本处理建立了新的基准。
论文及项目相关链接
PDF In submission to JAMIA
Summary
基于大规模多模态数据集训练的变压器模型在放射学报告自动去标识方面表现出优异性能,并在PHI检测方面建立了新的基准,超越了先前的学术和商用系统。
Key Takeaways
- 研究目标:通过大规模训练数据集扩展变压器模型,提高放射学报告的自动化去标识性能,并与商业云供应商系统进行PHI检测性能对比。
- 方法:基于斯坦福大学的两个大型注释放射学语料库(包含胸部X光、胸部CT、腹部/骨盆CT和大脑MR报告)对最先进的变压器PHI去标识管道进行了微调,并将新的PHI类别(年龄)纳入架构中。在斯坦福大学和宾夕法尼亚大学的测试集上评估了模型性能。同时评估了合成PHI生成的稳定性和商业系统的性能。