嘘~ 正在从服务器偷取页面 . . .

医学图像


⚠️ 以下所有内容总结都来自于 大语言模型的能力,如有错误,仅供参考,谨慎使用
🔴 请注意:千万不要用于严肃的学术场景,只能用于论文阅读前的初筛!
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ,还请您给我们一些鼓励!⭐️ HuggingFace免费体验

2025-11-22 更新

NoPo-Avatar: Generalizable and Animatable Avatars from Sparse Inputs without Human Poses

Authors:Jing Wen, Alexander G. Schwing, Shenlong Wang

We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate “ground-truth” camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).

我们致力于从单个或稀疏图像集恢复可动画的3D人体化身的任务。对于这个任务,除了图像集之外,许多先进的方法还使用精确的“地面真实”相机姿态和人体姿态作为输入来指导测试时的重建。我们表明,如果姿态估计带有噪声,那么依赖于姿态的重建会显著地降低结果。为了克服这一点,我们引入了NoPo-Avatar,它仅从图像重建化身,无需任何姿态输入。通过去除测试时重建对人物姿态的依赖,NoPo-Avatar不受噪声人物姿态估计的影响,使其具有更广泛的应用性。在具有挑战性的THuman2.0、XHuman和HuGe100K数据集上的实验表明,NoPo-Avatar在实际设置(无地面真实姿态)中的性能优于现有基线,并在实验室设置(有地面真实姿态)中取得相当的结果。

论文及项目相关链接

PDF NeurIPS’25; project page: https://wenj.github.io/NoPo-Avatar/

Summary

本文介绍了从单个或稀疏图像集恢复可动画的3D人类角色的任务。针对此任务,许多先前的方法除了图像集外,还使用准确的“地面真实”相机姿势和人类姿势作为输入来指导测试时的重建。本文显示,如果姿势估计存在噪声,姿势相关的重建会显著地降低结果。为了克服这一点,引入了NoPo-Avatar,它仅从图像重建化身,而无需任何姿势输入。通过去除测试时重建对人类姿势的依赖,NoPo-Avatar不受噪声人类姿势估计的影响,使其具有更广泛的应用性。在具有挑战性的THuman2.0、XHuman和HuGe100K数据集上的实验表明,NoPo-Avatar在实际设置中的表现优于现有基线,在实验室设置中的表现也相当。

Key Takeaways

  1. 文章解决了从单个或稀疏图像集恢复可动画的3D人类角色的问题。
  2. 现有方法依赖准确的相机和人类姿势作为输入,但姿势估计的噪声会显著影响结果。
  3. NoPo-Avatar被引入,它仅从图像进行重建,无需任何姿势输入,从而克服了噪声姿势估计的问题。
  4. NoPo-Avatar在实用设置中的性能优于现有基线,在实验室设置中的性能相当。
  5. NoPo-Avatar的去除对测试时重建的人类姿势的依赖使其具有更广泛的应用性。
  6. 在THuman2.0、XHuman和HuGe100K数据集上进行了实验验证。

Cool Papers

点此查看论文截图

SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking

Authors:Haofeng Liu, Ziyue Wang, Sudhanshu Mishra, Mingqi Gao, Guanyi Qin, Chang Han Low, Alex Y. W. Kong, Yueming Jin

Surgical video segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues. Interactive Video Object Segmentation (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking. To address these limitations, we construct SA-SV, the largest surgical iVOS benchmark with instance-level spatio-temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets), enabling comprehensive development and evaluation for long-term tracking and zero-shot generalization. Building on SA-SV, we propose SAM2S, a foundation model enhancing \textbf{SAM2} for \textbf{S}urgical iVOS through: (1) DiveMem, a trainable diverse memory mechanism for robust long-term tracking; (2) temporal semantic learning for instrument understanding; and (3) ambiguity-resilient learning to mitigate annotation inconsistencies across multi-source datasets. Extensive experiments demonstrate that fine-tuning on SA-SV enables substantial performance gains, with SAM2 improving by 12.99 average $\mathcal{J}$&$\mathcal{F}$ over vanilla SAM2. SAM2S further advances performance to 80.42 average $\mathcal{J}$&$\mathcal{F}$, surpassing vanilla and fine-tuned SAM2 by 17.10 and 4.11 points respectively, while maintaining 68 FPS real-time inference and strong zero-shot generalization. Code and dataset will be released at https://jinlab-imvr.github.io/SAM2S.

手术视频分割对计算机辅助手术至关重要,能够实现仪器和组织的精确定位和跟踪。交互式视频对象分割(iVOS)模型,如Segment Anything Model 2(SAM2),提供了基于提示的灵活性,超越了具有预定义类别的方法,但由于领域差距和有限的长期跟踪,在手术场景中面临挑战。为了解决这些局限性,我们构建了SA-SV,这是最大的手术iVOS基准数据集,具有实例级时空注释(masklets),涵盖八种手术类型(61k帧,1.6k masklets),为长期跟踪和零样本泛化提供了全面的开发和评估。基于SA-SV,我们提出了SAM2S,这是一个基础模型,通过以下方面增强手术iVOS的SAM2:(1)DiveMem,一种可用于稳健长期跟踪的可训练多样化记忆机制;(2)用于仪器理解的时间语义学习;(3)模糊韧性学习,以缓解多源数据集之间的注释不一致性。大量实验表明,在SA-SV上进行微调能够实现显著的性能提升,SAM2的平均$\mathcal{J}$&$\mathcal{F}$值比原版SAM2提高12.99。SAM2S进一步提高了性能,平均$\mathcal{J}$&$\mathcal{F}$值达到80.42,超越原版和微调后的SAM2分别提高17.10和4.11个百分点,同时保持68 FPS的实时推理和强大的零样本泛化能力。代码和数据集将在https://jinlab-imvr.github.io/SAM2S发布。

论文及项目相关链接

PDF 11 pages, 4 figures

Summary

该文介绍了互动视频目标分割(iVOS)模型在手术视频分割中的应用,提出了SA-SV基准数据集的构建方法,并设计了用于解决手术中面临的挑战的新模型SAM2S。SAM2S模型通过引入DiveMem机制、时序语义学习和模糊抗性学习等技术,实现了对手术视频中的工具和组织的精确跟踪和识别。实验结果表明,SAM2S模型在SA-SV数据集上的性能显著提升,具有实时推理速度和高泛化能力。数据集和代码将公开于:https://jinlab-imvr.github.io/SAM2S。

Key Takeaways

  • 互动视频对象分割(iVOS)模型在手术视频分割中扮演重要角色,用于精确定位和追踪工具和组织。
  • SA-SV基准数据集用于评估和发展手术iVOS模型的长期跟踪和零样本泛化能力,包含多种手术类型的实例级时空注释。
  • SAM2S模型通过引入DiveMem机制、时序语义学习和模糊抗性学习等技术,增强了SAM2模型的性能。
  • 实验表明,在SA-SV数据集上微调SAM2模型可实现显著性能提升,而SAM2S模型的性能进一步提升,具有实时推理速度和高泛化能力。

Cool Papers

点此查看论文截图

Toward Artificial Palpation: Representation Learning of Touch on Soft Bodies

Authors:Zohar Rimon, Elisei Shafer, Tal Tepper, Efrat Shimron, Aviv Tamar

Palpation, the use of touch in medical examination, is almost exclusively performed by humans. We investigate a proof of concept for an artificial palpation method based on self-supervised learning. Our key idea is that an encoder-decoder framework can learn a $\textit{representation}$ from a sequence of tactile measurements that contains all the relevant information about the palpated object. We conjecture that such a representation can be used for downstream tasks such as tactile imaging and change detection. With enough training data, it should capture intricate patterns in the tactile measurements that go beyond a simple map of forces – the current state of the art. To validate our approach, we both develop a simulation environment and collect a real-world dataset of soft objects and corresponding ground truth images obtained by magnetic resonance imaging (MRI). We collect palpation sequences using a robot equipped with a tactile sensor, and train a model that predicts sensory readings at different positions on the object. We investigate the representation learned in this process, and demonstrate its use in imaging and change detection.

触诊是医学检查中利用触觉的方法,几乎完全由人类完成。我们基于自我监督学习,对一种人工触诊方法进行概念验证。我们的核心思想是,编码器-解码器框架可以从一系列触觉测量中学习一种表示方法,该表示方法包含有关触诊对象的所有相关信息。我们推测,这种表示方法可用于下游任务,如触觉成像和变化检测。使用足够的训练数据,它应能捕捉到触觉测量中的复杂模式,超越当前先进的力度映射技术。为了验证我们的方法,我们既开发了一个仿真环境,又收集了真实世界的柔软物体数据集和通过磁共振成像(MRI)获得的相应基准图像。我们使用配备触觉传感器的机器人收集触诊序列,并训练一个模型来预测物体上不同位置的感官读数。我们研究了在此过程中学到的表示方法,并展示了其在成像和变化检测中的应用。

论文及项目相关链接

PDF

Summary

本文探索了一种基于自监督学习的人工触诊方法的概念证明。文章提出使用编码器-解码器框架从一系列触觉测量中学习表示,该表示包含关于触诊对象的所有相关信息,可用于下游任务,如触觉成像和变化检测。通过开发仿真环境和收集真实世界软对象数据集及通过磁共振成像获得的相应基准图像,验证该方法的有效性。使用配备触觉传感器的机器人收集触诊序列,并训练模型以预测对象不同位置的感官读数。

Key Takeaways

  1. 提出一种基于自监督学习的人工触诊方法的概念证明。
  2. 使用编码器-解码器框架从触觉测量中学习表示。
  3. 该表示包含关于触诊对象的所有相关信息,可用于下游任务,如触觉成像和变化检测。
  4. 开发仿真环境以模拟触诊过程。
  5. 收集真实世界软对象数据集,并通过磁共振成像获取基准图像。
  6. 使用配备触觉传感器的机器人收集触诊序列。
  7. 训练模型预测对象不同位置的感官读数,并探究在此过程中学习的表示。

Cool Papers

点此查看论文截图

LLaVA$^3$: Representing 3D Scenes like a Cubist Painter to Boost 3D Scene Understanding of VLMs

Authors:Doriand Petit, Steve Bourgeois, Vincent Gay-Bellile, Florian Chabot, Loïc Barthe

Developing a multi-modal language model capable of understanding 3D scenes remains challenging due to the limited availability of 3D training data, in contrast to the abundance of 2D datasets used for vision-language models (VLM). As an alternative, we introduce LLaVA$^3$ (pronounced LLaVA-Cube), a novel method that improves the 3D scene understanding capabilities of VLM using only multi-view 2D images and without any fine-tuning. Inspired by Cubist painters, who represented multiple viewpoints of a 3D object within a single picture, we propose to describe the 3D scene for the VLM through omnidirectional visual representations of each object. These representations are derived from an intermediate multi-view 3D reconstruction of the scene. Extensive experiments on 3D VQA and 3D language grounding show that our approach outperforms previous 2D-based VLM solutions.

开发能够理解3D场景的多模态语言模型仍然是一个挑战,因为与用于视觉语言模型(VLM)的2D数据集丰富相比,3D训练数据的可用性有限。作为一种替代方案,我们引入了LLaVA$^3$(发音为LLaVA-Cube),这是一种使用仅多视角2D图像,无需微调即可提高VLM对3D场景理解能力的新方法。受到立体派画家的启发,他们用一幅画内的多个角度来表示一个3D对象,我们提出通过每个对象的全方位视觉表示为VLM描述3D场景。这些表示是从场景的中间多视角3D重建中得出的。在3D视觉问答和3D语言定位方面的大量实验表明,我们的方法优于之前的基于二维的VLM解决方案。

论文及项目相关链接

PDF Accepted at AAAI’26

Summary
多模态语言模型在理解三维场景方面面临挑战,主要由于缺乏三维训练数据。本研究提出LLaVA$^3$方法,通过多视角二维图像提升语言模型对三维场景的理解能力,无需精细微调。受到立体派画家通过单一画面表现三维物体多个视角的启发,研究通过全向视觉表示来描述三维场景。此方法来源于场景中间的多视角三维重建。在三维视觉问答和三维语言定位上的实验表明,该方法优于传统的二维视觉语言模型解决方案。

Key Takeaways

  1. 多模态语言模型在理解三维场景时面临挑战,主要由于缺乏足够的三维训练数据。
  2. LLaVA$^3$方法旨在提高语言模型对三维场景的理解能力,且只依赖于多视角的二维图像数据。
  3. LLaVA$^3$方法通过借鉴立体派画家的创作思路,采用全向视觉表示描述三维场景中的物体。
  4. LLaVA$^3$方法基于场景的多视角三维重建生成中间表示。
  5. 在三维视觉问答和三维语言定位的实验中,LLaVA$^3$方法表现出优于传统二维视觉语言模型解决方案的性能。
  6. LLaVA$^3$方法对现有的语言模型不需要进行精细微调即可提高其理解三维场景的能力。

Cool Papers

点此查看论文截图

StreetView-Waste: A Multi-Task Dataset for Urban Waste Management

Authors:Diogo J. Paulo, João Martins, Hugo Proença, João C. Neves

Urban waste management remains a critical challenge for the development of smart cities. Despite the growing number of litter detection datasets, the problem of monitoring overflowing waste containers, particularly from images captured by garbage trucks, has received little attention. While existing datasets are valuable, they often lack annotations for specific container tracking or are captured in static, decontextualized environments, limiting their utility for real-world logistics. To address this gap, we present StreetView-Waste, a comprehensive dataset of urban scenes featuring litter and waste containers. The dataset supports three key evaluation tasks: (1) waste container detection, (2) waste container tracking, and (3) waste overflow segmentation. Alongside the dataset, we provide baselines for each task by benchmarking state-of-the-art models in object detection, tracking, and segmentation. Additionally, we enhance baseline performance by proposing two complementary strategies: a heuristic-based method for improved waste container tracking and a model-agnostic framework that leverages geometric priors to refine litter segmentation. Our experimental results show that while fine-tuned object detectors achieve reasonable performance in detecting waste containers, baseline tracking methods struggle to accurately estimate their number; however, our proposed heuristics reduce the mean absolute counting error by 79.6%. Similarly, while segmenting amorphous litter is challenging, our geometry-aware strategy improves segmentation mAP@0.5 by 27% on lightweight models, demonstrating the value of multimodal inputs for this task. Ultimately, StreetView-Waste provides a challenging benchmark to encourage research into real-world perception systems for urban waste management.

城市垃圾管理仍然是智能城市发展的重大挑战。尽管垃圾检测数据集的数量不断增加,但监测垃圾容器溢出的问题,尤其是从垃圾清运车拍摄的图片中进行监测的问题,仍然受到的关注很少。现有的数据集虽然很有价值,但它们通常缺乏针对特定容器的标注跟踪,或在静态、脱离实际环境的情况下拍摄,限制了它们在现实世界物流中的实用性。为了解决这一空白,我们推出了StreetView-Waste数据集,该数据集包含城市场景中的垃圾和垃圾容器。该数据集支持三个关键评估任务:(1)垃圾容器检测,(2)垃圾容器跟踪,(3)垃圾溢出分割。与数据集一起,我们通过评估最先进的物体检测、跟踪和分割模型,为每项任务提供了基准线。此外,我们提出了两种互补策略来提高基线性能:一种基于启发式的改进垃圾容器跟踪方法,以及一个利用几何先验知识的模型无关框架,用于改进垃圾分割。我们的实验结果表明,虽然经过精细调整的目标检测器在检测垃圾容器方面表现出合理的性能,但基线跟踪方法难以准确估计其数量;然而,我们提出的启发式方法将平均绝对计数误差降低了79.6%。同样地,虽然分割无定形的垃圾是一个挑战,但我们基于几何的策略在轻量级模型上提高了分割mAP@0.5的准确度达27%,证明了多模式输入对此任务的价值。总的来说,StreetView-Waste提供了一个具有挑战性的基准测试,鼓励研究针对城市垃圾管理的现实感知系统。

论文及项目相关链接

PDF Accepted at WACV 2026

Summary

本文介绍了城市垃圾管理作为智能城市发展的一个重要挑战,并指出当前垃圾检测数据集存在的问题。为解决这些问题,本文提出了StreetView-Waste数据集,该数据集支持三项关键评估任务:垃圾容器检测、垃圾容器追踪和垃圾溢出分割。文章提供了针对这三个任务的基准线,并通过启发式方法和几何先验框架两种策略来提升模型性能。实验结果表明这些策略的有效性。StreetView-Waste数据集为城市垃圾管理的真实世界感知系统研究提供了挑战。

Key Takeaways

  1. 城市垃圾管理是智能城市发展的一个重要挑战。
  2. 当前垃圾检测数据集存在缺乏特定容器跟踪标注以及脱离实际情境的问题。
  3. StreetView-Waste数据集支持三项关键评估任务:垃圾容器检测、垃圾容器追踪和垃圾溢出分割。
  4. 文章提供了针对这三项任务的基准线模型,并通过启发式方法和几何先验框架两种策略提升模型性能。
  5. 实验结果表明,启发式方法能显著降低跟踪误差,而几何先验框架能提高垃圾分割的精度。
  6. StreetView-Waste数据集为城市垃圾管理的真实世界感知系统研究提供了挑战。

Cool Papers

点此查看论文截图

Beyond Visual Cues: Leveraging General Semantics as Support for Few-Shot Segmentation

Authors:Jin Wang, Bingfeng Zhang, Jian Pang, Mengyu Liu, Honglong Chen, Weifeng Liu

Few-shot segmentation (FSS) aims to segment novel classes under the guidance of limited support samples by a meta-learning paradigm. Existing methods mainly mine references from support images as meta guidance. However, due to intra-class variations among visual representations, the meta information extracted from support images cannot produce accurate guidance to segment untrained classes. In this paper, we argue that the references from support images may not be essential, the key to the support role is to provide unbiased meta guidance for both trained and untrained classes. We then introduce a Language-Driven Attribute Generalization (LDAG) architecture to utilize inherent target property language descriptions to build robust support strategy. Specifically, to obtain an unbiased support representation, we design a Multi-attribute Enhancement (MaE) module, which produces multiple detailed attribute descriptions of the target class through Large Language Models (LLMs), and then builds refined visual-text prior guidance utilizing multi-modal matching. Meanwhile, due to text-vision modal shift, attribute text struggles to promote visual feature representation, we design a Multi-modal Attribute Alignment (MaA) to achieve cross-modal interaction between attribute texts and visual feature. Experiments show that our proposed method outperforms existing approaches by a clear margin and achieves the new state-of-the art performance. The code will be released.

少样本分割(FSS)旨在通过元学习范式在有限的支持样本指导下对新型类别进行分割。现有方法主要从支持图像中挖掘参考信息作为元指导。然而,由于视觉表示中的类内变化,从支持图像中提取的元信息无法为未训练过的类别提供准确的分割指导。本文认为,来自支持图像的参考可能并不重要,关键在于为已训练类和未训练类提供无偏见的元指导。为此,我们引入了语言驱动属性泛化(LDAG)架构,利用目标属性语言描述的内在信息来构建稳健的支持策略。具体来说,为了获得无偏见的支持表示,我们设计了一个多属性增强(MaE)模块,该模块通过大型语言模型(LLM)生成目标类的多个详细属性描述,然后利用多模态匹配构建精细的视觉文本先验指导。同时,由于文本视觉模态的转变,属性文本在促进视觉特征表示方面遇到困难,我们设计了多模态属性对齐(MaA)来实现属性文本和视觉特征之间的跨模态交互。实验表明,我们提出的方法明显优于现有方法,达到了新的最先进的性能。代码将发布。

论文及项目相关链接

PDF

Summary

本文提出一种名为Language-Driven Attribute Generalization(LDAG)的架构,利用目标属性的语言描述来建立稳健的支持策略,以解决少样本分割(FSS)中从支持图像提取的元信息无法准确指导未训练类别的分割问题。通过设计Multi-attribute Enhancement(MaE)模块和多模态匹配,获得无偏支持表示,并建立精细的视觉文本先验指导。同时,针对文本视觉模态转移的问题,设计Multi-modal Attribute Alignment(MaA)实现属性文本与视觉特征之间的跨模态交互。实验表明,该方法优于现有方法,达到新的先进水平。

Key Takeaways

  1. 少样本分割(FSS)面临从支持图像提取的元信息无法准确指导未训练类别分割的问题。
  2. 元信息提取在应对类内变化时存在挑战,无法为未训练类别提供准确指导。
  3. 本文认为支持角色的关键在于为训练和未训练类别提供无偏元的指导。
  4. 引入Language-Driven Attribute Generalization(LDAG)架构,利用目标属性的语言描述来建立稳健的支持策略。
  5. 设计Multi-attribute Enhancement(MaE)模块,通过大型语言模型(LLM)产生目标的多个详细属性描述,建立精细的视觉文本先验指导。
  6. 针对文本视觉模态转移的问题,设计Multi-modal Attribute Alignment(MaA)模块,实现属性文本与视觉特征之间的跨模态交互。

Cool Papers

点此查看论文截图

Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Authors:Minseok Seo, Mark Hamilton, Changick Kim

We present \textbf{Upsample Anything}, a lightweight test-time optimization (TTO) framework that restores low-resolution features to high-resolution, pixel-wise outputs without any training. Although Vision Foundation Models demonstrate strong generalization across diverse downstream tasks, their representations are typically downsampled by 14x/16x (e.g., ViT), which limits their direct use in pixel-level applications. Existing feature upsampling approaches depend on dataset-specific retraining or heavy implicit optimization, restricting scalability and generalization. Upsample Anything addresses these issues through a simple per-image optimization that learns an anisotropic Gaussian kernel combining spatial and range cues, effectively bridging Gaussian Splatting and Joint Bilateral Upsampling. The learned kernel acts as a universal, edge-aware operator that transfers seamlessly across architectures and modalities, enabling precise high-resolution reconstruction of features, depth, or probability maps. It runs in only $\approx0.419 \text{s}$ per 224x224 image and achieves state-of-the-art performance on semantic segmentation, depth estimation, and both depth and probability map upsampling.

我们提出了Upsample Anything,这是一个轻量级的测试时优化(TTO)框架,它能够在无需任何训练的情况下,将低分辨率特征恢复为高分辨率的像素级输出。尽管视觉基础模型(Vision Foundation Models)在多种下游任务中表现出了强大的泛化能力,但它们的表示通常通过14倍或更大的缩放系数(例如ViT)进行下采样处理,从而限制了它们在像素级别应用中的直接使用。现有的特征上采样方法依赖于针对特定数据集的重训练或复杂的隐式优化,这限制了其可扩展性和泛化能力。Upsample Anything通过简单的逐图像优化解决了这些问题,它学习一个结合了空间范围和范围线索的定向高斯核,有效地桥接了高斯Splatting和联合双边上采样。学习到的核作为一种通用、边缘感知的操作符,能够无缝地跨越各种架构和模态进行传输,使特征、深度或概率地图的高分辨率重建变得精确。它仅在对224x224图像进行处理时耗时约0.419秒,并在语义分割、深度估计以及深度和概率地图的上采样方面达到了最先进的性能。

论文及项目相关链接

PDF 15 pages, 12 figures

Summary

本文介绍了名为“Upsample Anything”的轻量级测试时优化(TTO)框架,能够在不经过任何训练的情况下,将低分辨率特征恢复为高分辨率像素级的输出。该框架解决了现有特征上采样方法依赖特定数据集的重训或隐性优化的问题,通过简单的图像优化学习各向异性的高斯核,结合空间和范围线索,有效桥接高斯Splatting和联合双边上采样。学习的核作为一个通用、边缘感知的操作符,无缝地应用于各种架构和模式,能够实现特征、深度或概率地图的高精度高分辨率重建。

Key Takeaways

  1. “Upsample Anything”是一个轻量级的测试时优化框架,可将低分辨率特征恢复为高分辨率像素级输出。
  2. 该框架解决了现有特征上采样方法存在的问题,如依赖特定数据集的重训或隐性优化。
  3. 通过学习各向异性的高斯核,该框架能够结合空间和范围线索,有效桥接高斯Splatting和联合双边上采样。
  4. 学习的核作为一个通用、边缘感知的操作符,可无缝应用于各种架构和模式。
  5. 该框架能够实现特征、深度或概率地图的高精度高分辨率重建。
  6. “Upsample Anything”框架的运算速度非常快,对224x224大小的图像,运行时间约为0.419秒。

Cool Papers

点此查看论文截图

Controllable Layer Decomposition for Reversible Multi-Layer Image Generation

Authors:Zihao Liu, Zunnan Xu, Shi Shu, Jun Zhou, Ruicheng Zhang, Zhenchao Tang, Xiu Li

This work presents Controllable Layer Decomposition (CLD), a method for achieving fine-grained and controllable multi-layer separation of raster images. In practical workflows, designers typically generate and edit each RGBA layer independently before compositing them into a final raster image. However, this process is irreversible: once composited, layer-level editing is no longer possible. Existing methods commonly rely on image matting and inpainting, but remain limited in controllability and segmentation precision. To address these challenges, we propose two key modules: LayerDecompose-DiT (LD-DiT), which decouples image elements into distinct layers and enables fine-grained control; and Multi-Layer Conditional Adapter (MLCA), which injects target image information into multi-layer tokens to achieve precise conditional generation. To enable a comprehensive evaluation, we build a new benchmark and introduce tailored evaluation metrics. Experimental results show that CLD consistently outperforms existing methods in both decomposition quality and controllability. Furthermore, the separated layers produced by CLD can be directly manipulated in commonly used design tools such as PowerPoint, highlighting its practical value and applicability in real-world creative workflows.

本文介绍了可控图层分解(CLD)方法,这是一种实现栅格图像多层精细粒度可控分离的方法。在实际工作流程中,设计师通常独立生成和编辑每个RGBA图层,然后再将它们组合成最终的栅格图像。然而,这个过程是不可逆的:一旦组合完成,就无法再进行图层级别的编辑。现有的方法通常依赖于图像磨光和图像填充,但在可控性和分割精度方面仍存在局限性。为了解决这些挑战,我们提出了两个关键模块:LayerDecompose-DiT(LD-DiT),它将图像元素解耦成不同的图层,并实现了精细粒度的控制;以及多层条件适配器(MLCA),它将目标图像信息注入多层令牌以实现精确的条件生成。为了进行全面的评估,我们建立了一个新的基准测试并引入了针对性的评估指标。实验结果表明,CLD在分解质量和可控性方面始终优于现有方法。此外,CLD产生的分离图层可以直接在常用的设计工具(如PowerPoint)中进行操作,这凸显了其在现实世界的创意工作流程中的实用价值和应用性。

论文及项目相关链接

PDF 19 pages, 14 figures

Summary

本文提出一种名为可控层分解(CLD)的方法,可实现光栅图像的多层精细分离与可控操作。此方法通过两个关键模块LayerDecompose-DiT(LD-DiT)和多层条件适配器(MLCA)的应用,解决了传统图像编辑中不可逆和编辑精度不足的问题。CLD方法能够在不破坏原有图像结构的情况下,将图像元素分解为不同的层,并实现对各层的精细控制。实验证明,CLD在分解质量和可控性上均优于现有方法,且生成的分离层可直接在常用设计工具(如PowerPoint)中操作。

Key Takeaways

  • 提出了一种新的图像处理方法:可控层分解(CLD),实现了光栅图像的多层精细分离与可控操作。
  • 通过LayerDecompose-DiT(LD-DiT)模块,将图像元素解耦为不同的层,实现对各层的精细控制。
  • 引入多层条件适配器(MLCA)模块,通过注入目标图像信息实现精确的条件生成。
  • 建立新的评估基准并引入针对性的评估指标,以全面评估CLD的性能。
  • 实验结果表明,CLD在分解质量和可控性上均优于现有方法。

Cool Papers

点此查看论文截图

Video2Layout: Recall and Reconstruct Metric-Grounded Cognitive Map for Spatial Reasoning

Authors:Yibin Huang, Wang Xu, Wanyue Zhang, Helu Zhi, Jingjing Huang, Yangbin Xu, Yangang Sun, Conghui Zhu, Tiejun Zhao

Spatial intelligence is a critical frontier for Multimodal Large Language Models (MLLMs), empowering them to comprehend the physical world. Drawing inspiration from human perception mechanisms, existing studies attempt to construct a coherent spatial understanding via grid-based cognitive maps from multi-frame visual inputs. However, current grid-based map methods rely on discretized raster representations, which limit the model’s ability in fine-grained spatial reasoning. To overcome this limitation, we propose Video2Layout, a framework for reconstructing metric-grounded spatial layouts from video. The framework employs continuous object boundary coordinates to quantify inter-object physical distances and object size. This empowers the model with quantitative spatial computation capabilities, effectively alleviating the inherent ambiguity when describing spatial relationships in natural language. Specifically, our method comprises two core stages. First, in supervised fine-tuning stage, we construct a high-quality dataset from the AI2THOR simulator, which enables the model to learn the mapping from visual inputs to precise boundary coordinates. Subsequently, a reinforcement fine-tuning stage further enhances the model’s real-world generalization capabilities. To systematically evaluate the correlation between cognitive map accuracy and image quantity, as well as how the quantity of image inputs affects spatial reasoning accuracy, we introduce QVS-Bench, a diagnostic benchmark designed to analyze the relevant mechanisms. Evaluated on QVS-Bench and mainstream spatial reasoning benchmarks, our model, V2LO-7B achieves an average improvement of 4.92% over the model trained on grid maps, validating the superiority of our method. Our code is available at https://github.com/ybrrraway/Video2Layout.

空间智能对于多模态大型语言模型(MLLMs)来说是一个关键前沿领域,使其能够理解物理世界。受人类感知机制的启发,现有研究试图通过基于网格的认知地图从多帧视觉输入中构建连贯的空间理解。然而,当前的基于网格的地图方法依赖于离散化的栅格表示,这限制了模型进行精细空间推理的能力。为了克服这一局限性,我们提出了Video2Layout框架,该框架能够从视频中重建度量基础的空间布局。该框架采用连续的物体边界坐标来量化物体间的物理距离和物体大小。这使得模型具备定量空间计算能力,有效地缓解了用自然语言描述空间关系时固有的模糊性。具体来说,我们的方法分为两个阶段。首先,在监督微调阶段,我们使用AI2THOR模拟器构建高质量数据集,使模型能够学习从视觉输入到精确边界坐标的映射。随后,强化微调阶段进一步提高了模型在现实世界的泛化能力。为了系统地评估认知地图精度与图像数量之间的相关性,以及图像输入数量对空间推理精度的影响,我们引入了QVS-Bench诊断基准,用于分析相关机制。在QVS-Bench和主流空间推理基准上进行评估,我们的模型V2LO-7B在网格地图训练模型的基础上平均提高了4.92%,验证了我们的方法的优越性。我们的代码可在https://github.com/ybrrraway/Video2Layout访问。

论文及项目相关链接

PDF

Summary

本文介绍了多模态大型语言模型在空间智能方面的挑战。现有研究通过基于网格的认知地图从多帧视觉输入中构建空间理解,但这种方法受限于离散栅格表示,影响了精细空间推理能力。为此,本文提出了Video2Layout框架,从视频中重建度量基础的空间布局,采用连续的对象边界坐标来量化物体间的物理距离和物体大小,使模型具备定量空间计算能力,有效缓解描述空间关系时的固有模糊性。该框架包括监督微调阶段和强化微调阶段,并使用AI2THOR模拟器构建高质量数据集。同时,本文还介绍了QVS-Bench诊断基准,用于系统评估认知地图准确性与图像数量之间的关系,以及图像输入数量对空间推理准确度的影响。在QVS-Bench和主流空间推理基准上的评估显示,V2LO-7B模型相较于网格地图训练模型平均提高了4.92%,验证了该方法的优越性。

Key Takeaways

  1. 多模态大型语言模型(MLLMs)在理解物理世界时面临空间智能的挑战。
  2. 现有基于网格的认知地图方法受限于离散栅格表示,影响精细空间推理。
  3. Video2Layout框架通过连续对象边界坐标重建度量基础的空间布局,提高模型的空间计算能力。
  4. Video2Layout框架包括监督微调阶段和强化微调阶段,以提高模型的性能。
  5. 构建了基于AI2THOR模拟器的数据集用于训练模型。
  6. QVS-Bench诊断基准用于评估认知地图准确性与图像数量之间的关系及空间推理准确度。

Cool Papers

点此查看论文截图

Externally Validated Multi-Task Learning via Consistency Regularization Using Differentiable BI-RADS Features for Breast Ultrasound Tumor Segmentation

Authors:Jingru Zhang, Saed Moradi, Ashirbani Saha

Multi-task learning can suffer from destructive task interference, where jointly trained models underperform single-task baselines and limit generalization. To improve generalization performance in breast ultrasound-based tumor segmentation via multi-task learning, we propose a novel consistency regularization approach that mitigates destructive interference between segmentation and classification. The consistency regularization approach is composed of differentiable BI-RADS-inspired morphological features. We validated this approach by training all models on the BrEaST dataset (Poland) and evaluating them on three external datasets: UDIAT (Spain), BUSI (Egypt), and BUS-UCLM (Spain). Our comprehensive analysis demonstrates statistically significant (p<0.001) improvements in generalization for segmentation task of the proposed multi-task approach vs. the baseline one: UDIAT, BUSI, BUS-UCLM (Dice coefficient=0.81 vs 0.59, 0.66 vs 0.56, 0.69 vs 0.49, resp.). The proposed approach also achieves state-of-the-art segmentation performance under rigorous external validation on the UDIAT dataset.

多任务学习可能会受到破坏性任务干扰的影响,联合训练的模型在单任务基准测试中的表现较差,并限制了模型的泛化能力。为了改善基于乳腺超声的肿瘤分割中多任务学习的泛化性能,我们提出了一种新的一致性正则化方法,该方法减轻了分割和分类之间的破坏性干扰。一致性正则化方法由可启发的BI-RADS形态特征组成。我们通过使用波兰的BrEaST数据集进行模型训练,并在三个外部数据集(UDIAT(西班牙)、BUSI(埃及)和BUS-UCLM(西班牙))上评估模型的有效性来验证了该方法。我们的综合分析表明,与基线相比,所提出的多任务方法在泛化方面具有统计学上的显著差异(p<0.001):在UDIAT、BUSI和BUS-UCLM数据集上,Dice系数分别为0.81 vs 0.59、0.66 vs 0.56、以及0.69 vs 0.49。此外,在UDIAT数据集上严格进行外部验证时,所提出的方法也达到了最先进的分割性能。

论文及项目相关链接

PDF

Summary
多任务学习可能导致破坏性任务干扰,影响联合训练模型的性能并限制其泛化能力。为改善基于乳腺超声的肿瘤分割的泛化性能,我们提出了一种新型一致性正则化方法,该方法可缓解分割与分类之间的破坏性干扰。一致性正则化方法由可借鉴的BI-RADS启发形态特征构成。我们在BrEaST数据集(波兰)上训练所有模型,并在三个外部数据集:UDIAT(西班牙)、BUSI(埃及)和BUS-UCLM(西班牙)上进行评估。综合分析显示,所提出的多任务方法在泛化方面相较于基线方法有显著改进(p<0.001):UDIAT、BUSI、BUS-UCLM的Dice系数分别为0.81 vs 0.59、0.66 vs 0.56、0.69 vs 0.49。此外,该方法在UDIAT数据集上实现了先进的分割性能。

Key Takeaways

  1. 多任务学习可能遭遇破坏性任务干扰,影响模型泛化性能。
  2. 为改善乳腺超声肿瘤分割的泛化性能,提出了一种基于一致性正则化的新方法。
  3. 该方法由可借鉴的BI-RADS启发形态特征构成。
  4. 方法在BrEaST数据集上训练,并在三个外部数据集上进行评估。
  5. 相较于基线方法,所提方法泛化性能有显著改进,并在UDIAT数据集上实现先进分割性能。
  6. 所提方法统计分析结果显示有统计学意义(p<0.001)。

Cool Papers

点此查看论文截图

Automated Interpretable 2D Video Extraction from 3D Echocardiography

Authors:Milos Vukadinovic, Hirotaka Ieki, Yuki Sahasi, David Ouyang, Bryan He

Although the heart has complex three-dimensional (3D) anatomy, conventional medical imaging with cardiac ultrasound relies on a series of 2D videos showing individual cardiac structures. 3D echocardiography is a developing modality that now offers adequate image quality for clinical use, with potential to streamline acquisition and improve assessment of off-axis features. We propose an automated method to select standard 2D views from 3D cardiac ultrasound volumes, allowing physicians to interpret the data in their usual format while benefiting from the speed and usability of 3D scanning. Applying a deep learning view classifier and downstream heuristics based on anatomical landmarks together with heuristics provided by cardiologists, we reconstruct standard echocardiography views. This approach was validated by three cardiologists in blinded evaluation (96% accuracy in 1,600 videos from 2 hospitals). The downstream 2D videos were also validated in their ability to detect cardiac abnormalities using AI echocardiography models (EchoPrime and PanEcho) as well as ability to generate clinical-grade measurements of cardiac anatomy (EchoNet-Measurement). We demonstrated that the extracted 2D videos preserve spatial calibration and diagnostic features, allowing clinicians to obtain accurate real-world interpretations from 3D volumes. We release the code and a dataset of 29 3D echocardiography videos https://github.com/echonet/3d-echo .

尽管心脏具有复杂的三维(3D)解剖结构,但传统的医学成像方法(如心脏超声)依赖于一系列二维视频来显示单个心脏结构。三维超声心动图是正在发展的模式之一,现已能够提供足够的图像质量供临床使用,并具有简化采集和改进离轴特征评估的潜力。我们提出了一种从三维心脏超声体积中自动选择标准二维视图的方法,使医生能够以他们习惯的方式解释数据,同时受益于三维扫描的速度和实用性。通过应用深度学习视图分类器和基于解剖标志的下游启发式方法以及心脏病专家提供的启发式方法,我们重建了标准超声心动图视图。该方法经过三名心脏病专家在盲检中的验证(在来自两家医院的1600个视频中准确率为96%)。下游二维视频还经过验证,能够使用AI超声心动图模型(EchoPrime和PanEcho)检测心脏异常,并能够生成临床级心脏解剖测量(EchoNet-Measurement)。我们证明,提取的二维视频保留了空间校准和诊断特征,使临床医生能够从三维体积中获得准确的现实世界解释。我们公开了包含29个三维超声心动图视频的代码和数据集:https://github.com/echonet/3d-echo

论文及项目相关链接

PDF 12 pages, 5 figures

Summary

本文介绍了一种从3D心脏超声体积中自动选择标准2D视图的方法,使医生能够在保持其习惯的解读格式的同时,受益于3D扫描的速度和实用性。该方法结合了深度学习视图分类器、基于解剖标志的下游启发式算法以及心脏病专家的意见,重构了标准的心电图视图。此方法通过了三位心脏病专家的盲评验证,准确率达到了96%,并能够有效检测心脏异常和生成临床级心脏解剖测量数据。

Key Takeaways

  1. 3D超声成像技术为心脏医学提供了更高的图像质量,具有潜在的流线性获取和改进离轴特征评估的能力。
  2. 提出了一种自动从3D心脏超声体积中选择标准2D视图的方法,结合了深度学习和解剖学标志。
  3. 该方法得到了心脏病专家的验证,准确率为96%,并能够生成临床级的心脏解剖测量数据。
  4. 通过AI心电图模型(EchoPrime和PanEcho)验证了下游2D视频检测心脏异常的能力。
  5. 提取的2D视频保留了空间校准和诊断特征,允许医生从3D体积中获得准确的实际世界解读。
  6. 该方法和相关代码及数据集已公开发布,供研究使用。

Cool Papers

点此查看论文截图

OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

Authors:Xinli Tao, Xin Dong, Xuezhong Zhou

With the rapid expansion of unstructured clinical texts in electronic health records (EHRs), clinical named entity recognition (NER) has become a crucial technique for extracting medical information. However, traditional supervised models such as CRF and BioClinicalBERT suffer from high annotation costs. Although zero-shot NER based on large language models (LLMs) reduces the dependency on labeled data, challenges remain in aligning example selection with task granularity and effectively integrating prompt design with self-improvement frameworks. To address these limitations, we propose OEMA, a novel zero-shot clinical NER framework based on multi-agent collaboration. OEMA consists of three core components: (1) a self-annotator that autonomously generates candidate examples; (2) a discriminator that leverages SNOMED CT to filter token-level examples by clinical relevance; and (3) a predictor that incorporates entity-type descriptions to enhance inference accuracy. Experimental results on two benchmark datasets, MTSamples and VAERS, demonstrate that OEMA achieves state-of-the-art performance under exact-match evaluation. Moreover, under related-match criteria, OEMA performs comparably to the supervised BioClinicalBERT model while significantly outperforming the traditional CRF method. OEMA improves zero-shot clinical NER, achieving near-supervised performance under related-match criteria. Future work will focus on continual learning and open-domain adaptation to expand its applicability in clinical NLP.

随着电子健康记录(EHRs)中无结构临床文本的快速扩张,临床命名实体识别(NER)已成为提取医疗信息的关键技术。然而,CRF和BioClinicalBERT等传统监督模型面临着高标注成本的问题。尽管基于大语言模型的零样本NER减少了对标记数据的依赖,但在示例选择与任务粒度对齐以及提示设计与自我改进框架的有效集成方面仍存在挑战。为了解决这些局限性,我们提出了OEMA,一个基于多智能体协作的零样本临床NER新框架。OEMA由三个核心组件构成:(1)自主生成候选示例的自我注释器;(2)利用SNOMED CT过滤出与临床相关性相关的token级别示例的鉴别器;(3)结合实体类型描述以提高推理准确性的预测器。在MTSamples和VAERS两个基准数据集上的实验结果表明,OEMA在精确匹配评估下达到了最先进的性能。此外,在相关匹配标准下,OEMA的表现与监督学习模型BioClinicalBERT相当,但显著优于传统的CRF方法。OEMA改进了零样本临床NER,在相关匹配标准下实现了接近监督学习的性能。未来的工作将集中在持续学习和开放域适应,以扩大其在临床自然语言处理中的应用范围。

论文及项目相关链接

PDF 12 pages, 4 figures, 4 tables

Summary
针对电子健康记录(EHRs)中临床命名实体识别(NER)的需求,提出了一种基于多智能体协作的零样本临床NER框架OEMA。OEMA包括自主生成候选例子的自我注释器、利用SNOMED CT过滤标记级别的例子的鉴别器以及融入实体类型描述提高推断准确度的预测器。实验结果表明,OEMA在精确匹配评估下达到领先水平,且在相关匹配标准下表现优异。

Key Takeaways

  1. 临床命名实体识别(NER)在电子健康记录(EHRs)中非常重要,传统监督模型如CRF和BioClinicalBERT存在高标注成本问题。
  2. 基于大型语言模型(LLM)的零样本NER降低了对标注数据的依赖,但仍面临示例选择与任务粒度对齐以及提示设计与自我改进框架整合的挑战。
  3. OEMA是一个新型零样本临床NER框架,包含自我注释器、鉴别器和预测器三个核心组件。
  4. 自我注释器能自主生成候选例子,鉴别器利用SNOMED CT过滤例子,预测器融入实体类型描述提高推断准确度。
  5. 实验结果表明,OEMA在精确匹配评估下表现领先,且在相关匹配标准下与监督的BioClinicalBERT模型表现相当,显著优于传统CRF方法。
  6. OEMA改进了零样本临床NER,并在相关匹配标准下实现了接近监督学习的性能。

Cool Papers

点此查看论文截图

BrainRotViT: Transformer-ResNet Hybrid for Explainable Modeling of Brain Aging from 3D sMRI

Authors:Wasif Jalal, Md Nafiu Rahman, Atif Hasan Rahman, M. Sohel Rahman

Accurate brain age estimation from structural MRI is a valuable biomarker for studying aging and neurodegeneration. Traditional regression and CNN-based methods face limitations such as manual feature engineering, limited receptive fields, and overfitting on heterogeneous data. Pure transformer models, while effective, require large datasets and high computational cost. We propose Brain ResNet over trained Vision Transformer (BrainRotViT), a hybrid architecture that combines the global context modeling of vision transformers (ViT) with the local refinement of residual CNNs. A ViT encoder is first trained on an auxiliary age and sex classification task to learn slice-level features. The frozen encoder is then applied to all sagittal slices to generate a 2D matrix of embedding vectors, which is fed into a residual CNN regressor that incorporates subject sex at the final fully-connected layer to estimate continuous brain age. Our method achieves an MAE of 3.34 years (Pearson $r=0.98$, Spearman $ρ=0.97$, $R^2=0.95$) on validation across 11 MRI datasets encompassing more than 130 acquisition sites, outperforming baseline and state-of-the-art models. It also generalizes well across 4 independent cohorts with MAEs between 3.77 and 5.04 years. Analyses on the brain age gap (the difference between the predicted age and actual age) show that aging patterns are associated with Alzheimer’s disease, cognitive impairment, and autism spectrum disorder. Model attention maps highlight aging-associated regions of the brain, notably the cerebellar vermis, precentral and postcentral gyri, temporal lobes, and medial superior frontal gyrus. Our results demonstrate that this method provides an efficient, interpretable, and generalizable framework for brain-age prediction, bridging the gap between CNN- and transformer-based approaches while opening new avenues for aging and neurodegeneration research.

从结构磁共振成像(MRI)准确估计大脑年龄是研究衰老和神经退行性病变的重要生物标志物。传统回归和基于卷积神经网络(CNN)的方法面临诸如手动特征工程、感受野有限以及对异质数据过拟合等局限性。纯transformer模型虽然有效,但需要大数据集和较高的计算成本。我们提出了Brain ResNet over trained Vision Transformer(BrainRotViT),这是一种混合架构,结合了视觉transformer(ViT)的全局上下文建模和残差卷积神经网络(CNN)的局部细化。首先,ViT编码器在辅助年龄和性别分类任务上进行训练,以学习切片级别的特征。然后,将冻结的编码器应用于所有矢状切片,以生成嵌入向量的二维矩阵,该矩阵被输入到残差CNN回归器中,并在最终的全连接层中结合受试者性别来估计连续的大脑年龄。我们的方法在11个MRI数据集(涵盖超过1 3 0个采集位点)的验证上实现了平均绝对误差(MAE)为3.34岁的表现(Pearson r=0.98,Spearman rho=0.97,R²=0.95),优于基准模型和最先进的模型。它在四个独立队列中的MAE介于3.77岁至5.04岁之间,具有良好的泛化能力。对大脑年龄差距(预测年龄与实际年龄之间的差异)的分析表明,衰老模式与阿尔茨海默病、认知障碍和自闭症谱系障碍有关。模型注意力图突出了与衰老相关的大脑区域,尤其是小脑蚯蚓体、中央前后回、颞叶和内侧额上回。我们的结果表明,该方法提供了一个高效、可解释和可泛化的脑龄预测框架,在基于CNN和基于transformer的方法之间架起了一座桥梁,为衰老和神经退行性疾病研究开辟了新的途径。

论文及项目相关链接

PDF

摘要

基于结构磁共振成像的精确脑龄估计对于研究衰老和神经退行性疾病具有重要价值。传统回归和CNN方法存在手动特征工程、接收域有限以及对异质数据过拟合等局限性。纯Transformer模型虽然有效,但需要大数据集和高计算成本。本研究提出了BrainResNet在训练过的Vision Transformer(BrainRotViT)上的混合架构,结合了Vision Transformer(ViT)的全局上下文建模和残差CNN的局部精细化。首先使用辅助年龄和性别分类任务训练ViT编码器以学习切片级特征。然后将冻结的编码器应用于所有矢状切片,生成嵌入向量的二维矩阵,输入到残差CNN回归器中,并在最终全连接层结合受试者性别以估计连续的脑龄。该方法在跨越11个MRI数据集(涵盖超过130个采集位点)的验证集上取得了平均绝对误差(MAE)为3.34年的成绩(Pearson r=0.98,Spearman ρ=0.97,R^2=0.95),优于基准模型和最先进的模型。在跨越四个独立队列的测试中,其MAE介于3.77至5.04年之间。对脑龄差异的分析(预测年龄与实际年龄的差异)显示,衰老模式与阿尔茨海默病、认知障碍和自闭症谱系障碍有关。模型注意力图突出了与衰老相关的脑区,包括小脑顶叶、前中央回和后中央回、颞叶和内侧额上回等。研究结果证明,该方法提供了一种高效、可解释且通用的脑龄预测框架,在CNN和Transformer方法之间架起桥梁,为衰老和神经退行性疾病研究开辟了新的途径。

Key Takeaways

  1. Brain age estimation from structural MRI is a valuable biomarker for aging and neurodegeneration research.
  2. 传统回归和CNN方法存在局限性,而纯Transformer模型则需要大量数据和计算资源。
  3. 提出了一种混合架构BrainResNet over trained Vision Transformer(BrainRotViT),结合了ViT的全局上下文建模和残差CNN的局部精细化。
  4. 方法在多个MRI数据集上表现出优异的性能,实现了精确的脑龄估计。
  5. 该方法具有良好的泛化能力,能够在独立队列中取得较好的结果。
  6. 脑龄差异与阿尔茨海默病、认知障碍和自闭症谱系障碍等衰老模式相关。

Cool Papers

点此查看论文截图

CD-DPE: Dual-Prompt Expert Network based on Convolutional Dictionary Feature Decoupling for Multi-Contrast MRI Super-Resolution

Authors:Xianming Gu, Lihui Wang, Ying Cao, Zeyu Deng, Yingfeng Ou, Guodong Hu, Yi Chen

Multi-contrast magnetic resonance imaging (MRI) super-resolution intends to reconstruct high-resolution (HR) images from low-resolution (LR) scans by leveraging structural information present in HR reference images acquired with different contrasts. This technique enhances anatomical detail and soft tissue differentiation, which is vital for early diagnosis and clinical decision-making. However, inherent contrasts disparities between modalities pose fundamental challenges in effectively utilizing reference image textures to guide target image reconstruction, often resulting in suboptimal feature integration. To address this issue, we propose a dual-prompt expert network based on a convolutional dictionary feature decoupling (CD-DPE) strategy for multi-contrast MRI super-resolution. Specifically, we introduce an iterative convolutional dictionary feature decoupling module (CD-FDM) to separate features into cross-contrast and intra-contrast components, thereby reducing redundancy and interference. To fully integrate these features, a novel dual-prompt feature fusion expert module (DP-FFEM) is proposed. This module uses a frequency prompt to guide the selection of relevant reference features for incorporation into the target image, while an adaptive routing prompt determines the optimal method for fusing reference and target features to enhance reconstruction quality. Extensive experiments on public multi-contrast MRI datasets demonstrate that CD-DPE outperforms state-of-the-art methods in reconstructing fine details. Additionally, experiments on unseen datasets demonstrated that CD-DPE exhibits strong generalization capabilities.

多对比度磁共振成像(MRI)超分辨率技术旨在利用从不同对比度获取的参考高分辨率(HR)图像中的结构信息,从低分辨率(LR)扫描中重建高分辨率图像。该技术提高了解剖细节和软组织分辨能力,对于早期诊断和临床决策至关重要。然而,不同模态之间固有的对比度差异在有效利用参考图像纹理以指导目标图像重建方面构成了根本挑战,往往导致特征融合不佳。为了解决这一问题,我们提出了一种基于卷积字典特征解耦(CD-DPE)策略的多对比度MRI超分辨率双提示专家网络。具体来说,我们引入了一个迭代卷积字典特征解耦模块(CD-FDM),将特征分离为跨对比度和内对比度组件,从而减少冗余和干扰。为了充分融合这些特征,我们提出了一种新型的双提示特征融合专家模块(DP-FFEM)。该模块使用频率提示来指导选择相关的参考特征,并将其合并到目标图像中,而自适应路由提示则确定融合参考和目标特征的最佳方法,以提高重建质量。在公共多对比度MRI数据集上的广泛实验表明,CD-DPE在重建细节方面优于现有先进技术。此外,在未见过的数据集上的实验表明,CD-DPE具有很强的泛化能力。

论文及项目相关链接

PDF This paper has been accepted by AAAI, but due to the final camera-ready version not being finalized, there are still some expression errors. It will be re-published after correction

Summary
基于卷积字典特征解耦(CD-DPE)策略的多对比度磁共振成像(MRI)超分辨率技术,旨在利用不同对比度下获取的高分辨率参考图像中的结构信息,重建低分辨率扫描的高分辨率图像。该技术提高了解剖细节和软组织分化的表现,对早期诊断和临床决策至关重要。为解决不同模态间固有对比度差异带来的问题,提出了基于双提示专家网络的解决方案,通过迭代卷积字典特征解耦模块(CD-FDM)分离跨对比度和内对比度成分,减少冗余和干扰。同时,提出一种新型的双提示特征融合专家模块(DP-FFEM),利用频率提示引导选择相关参考特征融入目标图像,自适应路由提示则确定融合参考和目标特征的最佳方法,以提高重建质量。在公共多对比度MRI数据集上的实验表明,CD-DPE在细节重建方面优于现有方法,并在未见数据集上展现出强大的泛化能力。

Key Takeaways

  • 多对比度磁共振成像(MRI)超分辨率技术旨在通过利用不同对比度的HR参考图像中的结构信息来重建LR扫描的HR图像。
  • 该技术对于提高解剖细节和软组织分化的表现至关重要,有助于早期诊断和临床决策。
  • 固有对比度差异是有效利用参考图像纹理以指导目标图像重建的主要挑战。
  • 为解决此问题,提出了基于卷积字典特征解耦(CD-DPE)策略的双提示专家网络。
  • CD-FDM模块用于分离跨对比度和内对比度成分,减少冗余和干扰。
  • DP-FFEM模块利用频率提示和自适应路由提示来优化特征融合,提高重建质量。

Cool Papers

点此查看论文截图

TM-UNet: Token-Memory Enhanced Sequential Modeling for Efficient Medical Image Segmentation

Authors:Yaxuan Jiao, Qing Xu, Yuxiang Luo, Xiangjian He, Zhen Chen, Wenting Duan

Medical image segmentation is essential for clinical diagnosis and treatment planning. Although transformer-based methods have achieved remarkable results, their high computational cost hinders clinical deployment. To address this issue, we propose TM-UNet, a novel lightweight framework that integrates token sequence modeling with an efficient memory mechanism for efficient medical segmentation. Specifically, we introduce a multi-scale token-memory (MSTM) block that transforms 2D spatial features into token sequences through strategic spatial scanning, leveraging matrix memory cells to selectively retain and propagate discriminative contextual information across tokens. This novel token-memory mechanism acts as a dynamic knowledge store that captures long-range dependencies with linear complexity, enabling efficient global reasoning without redundant computation. Our MSTM block further incorporates exponential gating to identify token effectiveness and multi-scale contextual extraction via parallel pooling operations, enabling hierarchical representation learning without computational overhead. Extensive experiments demonstrate that TM-UNet outperforms state-of-the-art methods across diverse medical segmentation tasks with substantially reduced computation cost. The code is available at https://github.com/xq141839/TM-UNet.

医学图像分割对于临床诊断和治疗计划至关重要。尽管基于变压器的方法已经取得了显著成果,但其高昂的计算成本阻碍了其在临床的部署。为了解决这一问题,我们提出了TM-UNet,这是一个结合了令牌序列建模和高效内存机制的轻量级框架,用于高效的医学分割。具体来说,我们引入了一种多尺度令牌内存(MSTM)块,它通过战略空间扫描将2D空间特征转换为令牌序列,利用矩阵内存单元有选择地保留和传播令牌之间的判别上下文信息。这种新型令牌内存机制就像一个动态的知识存储库,以线性复杂度捕获远程依赖关系,实现高效的全局推理,无需冗余计算。我们的MSTM块进一步通过指数门控来识别令牌的有效性和通过并行池操作进行多尺度上下文提取,实现层次化表示学习,无需计算开销。大量实验表明,TM-UNet在多种医学分割任务上优于最新方法,且计算成本大幅降低。代码可在https://github.com/xq141839/TM-UNet获取。

论文及项目相关链接

PDF

Summary

本文提出一种名为TM-UNet的新型轻量化框架,用于医疗图像分割。该框架结合了令牌序列建模和高效内存机制,以实现高效的医疗分割。通过引入多尺度令牌内存(MSTM)块,将二维空间特征转换为令牌序列,利用矩阵内存单元有选择地保留和传播令牌间的判别性上下文信息。MSTM块能够实现线性复杂度的长距离依赖捕获,实现高效全局推理,降低计算成本。实验表明,TM-UNet在多种医疗分割任务上优于现有方法。

Key Takeaways

  1. TM-UNet是一种新型的轻量化框架,用于医疗图像分割,旨在解决传统基于变压器的方法计算成本高的问题。
  2. 引入多尺度令牌内存(MSTM)块,将二维空间特征转换为令牌序列,有利于捕获更丰富的上下文信息。
  3. MSTM块利用矩阵内存单元实现动态知识存储,捕获长距离依赖关系,实现高效全局推理。
  4. MSTM块采用指数门控机制来识别令牌的有效性和多尺度上下文提取,实现层次化表示学习。
  5. TM-UNet通过结合令牌序列建模和高效内存机制,实现了医疗图像分割的高效性和准确性。
  6. 实验结果表明,TM-UNet在多种医疗分割任务上优于现有方法,具有较低的计算成本。

Cool Papers

点此查看论文截图

MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images

Authors:Qinyue Tong, Ziqian Lu, Jun Liu, Rui Zuo, Zheming Lu

Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we construct MR-MedSeg, a large-scale dataset of 177K multi-round medical segmentation dialogues, featuring entity-based reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment & Correction Mechanism during model inference. Experimental results demonstrate that our method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.

尽管医学图像分割领域已经取得了一定的进展,但大多数现有方法仍然具有特定的任务性并缺乏交互性。虽然最近基于文本提示的分割方法增强了用户驱动和基于推理的分割,但它们仍然局限于单轮对话,无法执行多轮推理。在这项工作中,我们引入了多轮实体级医学推理分割(MEMR-Seg),这是一个新任务,需要通过多轮查询进行实体级推理来生成分割掩膜。为了支持此任务,我们构建了MR-MedSeg,这是一个包含17.7万轮多轮医学分割对话的大规模数据集,具有跨回合的实体级推理功能。此外,我们提出了MediRound,这是一个针对多轮医学推理分割设计的有效基线模型。为了减轻多轮分割的链式管道中的固有误差传播,我们在模型推理过程中引入了一个轻便有效的判断与纠正机制。实验结果表明,我们的方法有效地解决了MEMR-Seg任务,并优于传统的医学参考分割方法。

论文及项目相关链接

PDF 12pages, 6 figures

Summary
医学图像分割领域存在任务特定性和缺乏交互性的问题。近期出现的文本提示分割方法增强了用户驱动和推理基础的分割,但仍限于单轮对话,无法执行多轮推理。本研究引入了多轮实体级医疗推理分割(MEMR-Seg)新任务,需要通过多轮查询生成具有实体级推理的分割掩膜。为此任务,我们构建了MR-MedSeg数据集,包含17.7万轮多轮医疗分割对话,并推出MediRound基线模型。为解决多轮分割中的误差传播问题,我们在模型推理中引入了轻量且有效的判断与纠正机制。实验结果证明,我们的方法有效解决了MEMR-Seg任务,并优于传统医疗参照分割方法。

Key Takeaways

  1. 医学图像分割仍存在任务特定性和缺乏交互性的问题。
  2. 近期文本提示分割方法增强了用户驱动和推理能力,但仅限于单轮对话。
  3. 引入多轮实体级医疗推理分割(MEMR-Seg)新任务,需通过多轮查询生成分割掩膜,体现实体级推理。
  4. 构建MR-MedSeg数据集,包含大量多轮医疗分割对话。
  5. 推出MediRound基线模型,为多轮医疗推理分割提供支持。
  6. 为解决多轮分割中的误差传播问题,引入判断与纠正机制。

Cool Papers

点此查看论文截图

Prompt Triage: Structured Optimization Enhances Vision-Language Model Performance on Medical Imaging Benchmarks

Authors:Arnav Singhvi, Vasiliki Bikia, Asad Aali, Akshay Chaudhari, Roxana Daneshjou

Vision-language foundation models (VLMs) show promise for diverse imaging tasks but often underperform on medical benchmarks. Prior efforts to improve performance include model finetuning, which requires large domain-specific datasets and significant compute, or manual prompt engineering, which is hard to generalize and often inaccessible to medical institutions seeking to deploy these tools. These challenges motivate interest in approaches that draw on a model’s embedded knowledge while abstracting away dependence on human-designed prompts to enable scalable, weight-agnostic performance improvements. To explore this, we adapt the Declarative Self-improving Python (DSPy) framework for structured automated prompt optimization in medical vision-language systems through a comprehensive, formal evaluation. We implement prompting pipelines for five medical imaging tasks across radiology, gastroenterology, and dermatology, evaluating 10 open-source VLMs with four prompt optimization techniques. Optimized pipelines achieved a median relative improvement of 53% over zero-shot prompting baselines, with the largest gains ranging from 300% to 3,400% on tasks where zero-shot performance is low. These results highlight the substantial potential of applying automated prompt optimization to medical AI systems, demonstrating significant gains for vision-based applications requiring accurate clinical image interpretation. By reducing dependence on prompt design to elicit intended outputs, these techniques allow clinicians to focus on patient care and clinical decision-making. Furthermore, our experiments offer scalability and preserve data privacy, demonstrating performance improvement on open-source VLMs. We publicly release our evaluation pipelines to support reproducible research on specialized medical tasks, available at https://github.com/DaneshjouLab/prompt-triage-lab.

视觉语言基础模型(VLMs)在多种成像任务中显示出潜力,但在医疗基准测试中往往表现不佳。之前的努力提高包括模型微调,这需要大量特定领域的数据集和大量的计算资源,或者手动提示工程,这很难推广,并且常常让寻求部署这些工具的医疗机构无法接触。这些挑战激发了人们对利用模型的内置知识的方法的兴趣,这种方法不需要依赖人类设计的提示,就能够实现可扩展的、不受权重影响的性能改进。为了探索这一点,我们适应了Declarative Self-improving Python(DSPy)框架,用于医疗视觉语言系统中结构化的自动化提示优化,通过全面、正式的评价。我们为放射学、胃肠病学和皮肤病学中的五项医学成像任务实施了提示管道,评估了10种开源VLMs的四种提示优化技术。优化后的管道相对于零提示基线实现了中位数相对改进53%,在零提示性能较低的任务上,最大收益范围从300%到3400%。这些结果突显了将自动化提示优化应用于医疗AI系统的巨大潜力,证明了这一技术在基于视觉的应用中能够实现准确临床图像解释方面的重大改进。通过减少对提示设计的依赖来激发预期的输出,这些技术使临床医生能够专注于患者护理和临床决策。此外,我们的实验具有可扩展性并保护数据隐私,展示了在开源VLMs上的性能改进。我们公开发布评估管道,以支持在特殊医疗任务上进行可重复的研究,可在https://github.com/DaneshjouLab/prompt-triage-lab获取。

论文及项目相关链接

PDF

Summary

本文探索了在医学视觉语言系统中应用自动化提示优化技术的潜力。通过对五种医学影像任务的评估,发现优化后的提示管道相对于零基线有显著改善,尤其是在零基线性能较低的任务上取得了巨大的增益。这些技术降低了对提示设计的依赖,有助于临床医师专注于患者护理和临床决策制定,并提供可扩展性和保护数据隐私的优点。实验展示了开源视觉语言模型在改进性能方面的潜力,并在GitHub上公开了评估管道,支持在特定医学任务上的可重复性研究。该方法的广泛应用具有重大的临床意义和技术影响。

Key Takeaways

  1. 视觉语言模型(VLMs)在医学影像任务中具有潜力,但常常在医学基准测试中表现不佳。
  2. 模型微调需要大量领域特定的数据集和大量计算能力,手动提示工程难以推广且医疗机构难以部署这些工具。
  3. 自动提示优化技术在医学AI系统中的应用潜力显著,特别是用于准确临床图像解释的基于视觉的应用。这些技术减少了对提示设计的依赖,使临床医生能够专注于患者护理和临床决策制定。
  4. 优化后的提示管道相对于零基线有显著改善,特别是在零基线性能较低的任务上取得了巨大收益。这凸显了自动化提示优化技术的效果。
  5. 该方法提供了可扩展性并保护数据隐私,展示在改进开源VLMs性能方面的潜力。这有助于推动医学图像领域的进一步发展。
  6. 实验结果公开在GitHub上,支持特定医学任务上的可重复性研究。这为其他研究者提供了一个很好的起点,进一步推动该领域的进步和创新。

Cool Papers

点此查看论文截图

Identifying Imaging Follow-Up in Radiology Reports: A Comparative Analysis of Traditional ML and LLM Approaches

Authors:Namu Park, Giridhar Kaushik Ramachandran, Kevin Lybarger, Fei Xia, Ozlem Uzuner, Meliha Yetisgen, Martin Gunn

Large language models (LLMs) have shown considerable promise in clinical natural language processing, yet few domain-specific datasets exist to rigorously evaluate their performance on radiology tasks. In this work, we introduce an annotated corpus of 6,393 radiology reports from 586 patients, each labeled for follow-up imaging status, to support the development and benchmarking of follow-up adherence detection systems. Using this corpus, we systematically compared traditional machine-learning classifiers, including logistic regression (LR), support vector machines (SVM), Longformer, and a fully fine-tuned Llama3-8B-Instruct, with recent generative LLMs. To evaluate generative LLMs, we tested GPT-4o and the open-source GPT-OSS-20B under two configurations: a baseline (Base) and a task-optimized (Advanced) setting that focused inputs on metadata, recommendation sentences, and their surrounding context. A refined prompt for GPT-OSS-20B further improved reasoning accuracy. Performance was assessed using precision, recall, and F1 scores with 95% confidence intervals estimated via non-parametric bootstrapping. Inter-annotator agreement was high (F1 = 0.846). GPT-4o (Advanced) achieved the best performance (F1 = 0.832), followed closely by GPT-OSS-20B (Advanced; F1 = 0.828). LR and SVM also performed strongly (F1 = 0.776 and 0.775), underscoring that while LLMs approach human-level agreement through prompt optimization, interpretable and resource-efficient models remain valuable baselines.

大型语言模型(LLMs)在临床自然语言处理中显示出巨大的潜力,但针对放射学任务的特定数据集很少存在,无法严格评估其性能。在这项工作中,我们引入了来自586名患者的6393份放射学报告组成的注释语料库,每份报告都标有随访成像状态,以支持随访依从性检测系统的开发和基准测试。使用该语料库,我们系统地比较了传统的机器学习分类器,包括逻辑回归(LR)、支持向量机(SVM)、Longformer和完全微调过的Llama3-8B-Instruct,以及最新的生成式LLMs。为评估生成式LLMs,我们测试了GPT-4o和开源GPT-OSS-20B两种配置:一种是基线(Base)设置,另一种是专注于元数据、推荐句子及其周围上下文的任务优化(Advanced)设置。GPT-OSS-20B的精细提示进一步提高了推理准确性。我们使用精确度、召回率和F1分数来评估性能,并通过非参数bootstrap估计95%置信区间。注释者之间的协议很高(F1=0.846)。GPT-4o(高级)获得最佳性能(F1=0.832),紧随其后的是GPT-OSS-20B(高级;F1=0.828)。逻辑回归和支持向量机也表现出色(F1分别为0.776和0.775),这证明虽然通过提示优化大型语言模型可以接近人类水平的协议,但可解释性和资源高效的模型仍然是宝贵的基准。

论文及项目相关链接

PDF Submitted to LREC 2026

摘要

在临床自然语言处理领域,大型语言模型(LLMs)显示出巨大潜力,但在放射学任务上评估其性能的数据集仍稀缺。本研究引入一个包含6393份放射报告的注释语料库,来自586名患者的数据都标有后续成像状态,用于支持跟进遵守检测系统的开发和基准测试。利用此语料库,我们系统地比较了传统机器学习分类器与最新的生成式LLMs。通过对GPT-4o和开源GPT-OSS-20B两种配置的性能测试发现,优化提示可进一步提升GPT-OSS-20B的推理准确性。本研究采用精确度、召回率和F1分数进行评估,并通过非参数bootstrap方法估算95%置信区间。评估结果显示,GPT-4o(高级配置)表现最佳(F1=0.832),GPT-OSS-20B(高级配置)紧随其后(F1=0.828)。传统机器学习模型如逻辑回归和支持向量机也表现出良好性能(F1分别为0.776和0.775),表明虽然大型语言模型可通过提示优化接近人类水平协议,但可解释的且资源效率高的模型仍是宝贵的基准模型。总体来看,本研究所揭示的性能表现不仅表明了大型语言模型的优势,同时也揭示了其未来的优化方向和发展潜力。研究者在大型语言模型的精准提示和应用层面还有许多工作需要继续探索和努力。对此数据集,还为未来的研究工作提供了大量可能的方向和可能性。通过引入更多维度和角度的标注数据、更复杂的模型和更丰富的实验设置,可以进一步推动这一领域的发展。因此,此数据集是一个非常有价值的资源,可以用于推动大型语言模型在放射学任务上的进一步发展和应用。大型语言模型虽然在放射学任务上表现出良好的性能,但仍需进一步研究和改进以提高其性能和可解释性。同时,传统机器学习模型在放射学任务中仍具有应用价值。因此,未来的研究可以探索结合大型语言模型与传统机器学习模型的优点来解决放射学任务中的实际问题。通过本研究,对于如何有效利用大型语言模型在放射学任务中提供有价值的见解和预测,未来的研究方向将是关注大模型理解和自适应设计的综合考量问题提供了清晰的指向标。“领域特有知识赋予特别专注上下文而非全新工作技能”,在该类问题的解决路径上有着重大意义和应用潜力。

关键见解

  1. 介绍了一个包含大量放射报告的注释语料库用于评估LLM在放射学任务上的性能。
  2. 对比了传统机器学习分类器和生成式LLMs的性能表现。

Cool Papers

点此查看论文截图

FocusSDF: Boundary-Aware Learning for Medical Image Segmentation via Signed Distance Supervision

Authors:Muzammal Shafique, Nasir Rahim, Jamil Ahmad, Mohammad Siadat, Khalid Malik, Ghaus Malik

Segmentation of medical images constitutes an essential component of medical image analysis, providing the foundation for precise diagnosis and efficient therapeutic interventions in clinical practices. Despite substantial progress, most segmentation models do not explicitly encode boundary information; as a result, making boundary preservation a persistent challenge in medical image segmentation. To address this challenge, we introduce FocusSDF, a novel loss function based on the signed distance functions (SDFs), which redirects the network to concentrate on boundary regions by adaptively assigning higher weights to pixels closer to the lesion or organ boundary, effectively making it boundary aware. To rigorously validate FocusSDF, we perform extensive evaluations against five state-of-the-art medical image segmentation models, including the foundation model MedSAM, using four distance-based loss functions across diverse datasets covering cerebral aneurysm, stroke, liver, and breast tumor segmentation tasks spanning multiple imaging modalities. The experimental results consistently demonstrate the superior performance of FocusSDF over existing distance transform based loss functions.

医学图像分割是医学图像分析的重要组成部分,为临床实践中精确诊断和治疗干预提供了基础。尽管已经取得了重大进展,但大多数分割模型并没有明确编码边界信息,因此,边界保留在医学图像分割中成为一个持久挑战。为了解决这一挑战,我们引入了FocusSDF,这是一种基于有向距离函数(Signed Distance Functions,SDFs)的新型损失函数,它通过自适应地为更接近病变或器官边界的像素分配更高的权重,使网络专注于边界区域,从而有效地实现边界感知。为了对FocusSDF进行严格验证,我们对五种最先进的医学图像分割模型(包括基础模型MedSAM)进行了广泛评估,使用了四种基于距离的损失函数,涉及多个成像模态的脑动脉瘤、中风、肝脏和乳腺癌分割任务的不同数据集。实验结果一致表明,FocusSDF在现有基于距离变换的损失函数中的表现更优越。

论文及项目相关链接

PDF

Summary
医学图像分割是医学图像分析的重要组成部分,为精确诊断和治疗干预提供基础。针对分割模型在边界信息编码上的不足,提出一种基于有向距离函数(SDF)的新型损失函数FocusSDF,使模型关注边界区域,实现边界感知。在多种数据集上的实验结果表明,FocusSDF在医学图像分割任务上优于现有基于距离变换的损失函数。

Key Takeaways

  1. 医学图像分割是医学图像分析的核心部分,对精确诊断和治疗至关重要。
  2. 现有大多数医学图像分割模型在边界信息编码上有所欠缺,导致边界保留成为挑战。
  3. 为解决此挑战,引入了一种新型损失函数FocusSDF,基于有向距离函数(SDF)。
  4. FocusSDF通过自适应分配更高的权重给接近病变或器官边界的像素,使模型关注边界区域,实现边界感知。
  5. 为验证FocusSDF的有效性,进行了广泛的实验评估,与五种最先进的医学图像分割模型进行对比。
  6. 实验结果证明,FocusSDF在多种数据集上的医学图像分割任务上优于现有的基于距离变换的损失函数。

Cool Papers

点此查看论文截图

NURBGen: High-Fidelity Text-to-CAD Generation through LLM-Driven NURBS Modeling

Authors:Muhammad Usama, Mohammad Sadil Khan, Didier Stricker, Muhammad Zeshan Afzal

Generating editable 3D CAD models from natural language remains challenging, as existing text-to-CAD systems either produce meshes or rely on scarce design-history data. We present NURBGen, the first framework to generate high-fidelity 3D CAD models directly from text using Non-Uniform Rational B-Splines (NURBS). To achieve this, we fine-tune a large language model (LLM) to translate free-form texts into JSON representations containing NURBS surface parameters (\textit{i.e}, control points, knot vectors, degrees, and rational weights) which can be directly converted into BRep format using Python. We further propose a hybrid representation that combines untrimmed NURBS with analytic primitives to handle trimmed surfaces and degenerate regions more robustly, while reducing token complexity. Additionally, we introduce partABC, a curated subset of the ABC dataset consisting of individual CAD components, annotated with detailed captions using an automated annotation pipeline. NURBGen demonstrates strong performance on diverse prompts, surpassing prior methods in geometric fidelity and dimensional accuracy, as confirmed by expert evaluations. Code and dataset will be released publicly.

从自然语言生成可编辑的3D CAD模型仍然是一个挑战,因为现有的文本到CAD系统要么产生网格,要么依赖于稀缺的设计历史数据。我们推出了NURBGen,这是第一个利用非均匀有理B样条(NURBS)直接从文本生成高保真3D CAD模型的框架。为实现这一点,我们对大型语言模型(LLM)进行了微调,将其翻译成包含NURBS表面参数的JSON表示形式(即控制点、结向量、度数和有理权重),然后可直接使用Python转换为BRep格式。我们进一步提出了一种混合表示法,它将未修剪的NURBS与解析基元相结合,以更稳健地处理修剪表面和退化区域,同时降低令牌复杂度。此外,我们还推出了partABC,这是ABC数据集的一个精选子集,由带有详细标题的单个CAD组件组成,使用了自动化注释管道。NURBGen在多种提示下表现出强大的性能,在几何保真度和尺寸准确性方面超过了以前的方法,这得到了专家评估的证实。代码和数据集将公开发布。

论文及项目相关链接

PDF Accepted in AAAI 2026

Summary
医学图像研究领域,通过文本直接生成高质量的三维CAD模型仍然是一个挑战。现有系统大多生成网格或依赖于有限的设计历史数据。本研究提出了一种名为NURBGen的新框架,利用非均匀有理B样条(NURBS)直接从文本生成高保真三维CAD模型。该研究还引入了partABC数据集,为CAD组件提供详细标注,并通过自动化注释管道进行注释。NURBGen在多种提示下表现出强大的性能,在几何保真度和尺寸准确性方面超过了先前的方法。

Key Takeaways

  1. NURBGen是首个直接从文本生成高质量三维CAD模型的框架,利用非均匀有理B样条(NURBS)实现。
  2. 现有文本-CAD系统存在局限性,大多生成网格或依赖有限的设计历史数据。
  3. NURBGen通过微调大型语言模型(LLM)实现文本到JSON表示的转换,该表示包含NURBS表面参数。
  4. 研究引入了混合表示法,结合未修剪的NURBS和解析原始数据,以更稳健地处理修剪表面和退化区域,同时减少令牌复杂性。
  5. partABC数据集的引入为CAD组件提供了详细标注,通过自动化注释管道进行注释。
  6. NURBGen在多种提示下表现出强大的性能。

Cool Papers

点此查看论文截图


文章作者: Kedreamix
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !
 上一篇
TTS TTS
TTS 方向最新论文已更新,请持续关注 Update in 2025-11-22 SceneGuard Training-Time Voice Protection with Scene-Consistent Audible Background Noise
2025-11-22
下一篇 
Diffusion Models Diffusion Models
Diffusion Models 方向最新论文已更新,请持续关注 Update in 2025-11-22 Mantis A Versatile Vision-Language-Action Model with Disentangled Visual Foresight
  目录