发布日期: 2025-11-19

更新日期: 2025-11-27

文章字数: 19.8k

阅读时长: 81 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-19 更新

Free-Form Scene Editor: Enabling Multi-Round Object Manipulation like in a 3D Engine

Authors:Xincheng Shuai, Zhenyuan Qin, Henghui Ding, Dacheng Tao

Recent advances in text-to-image (T2I) diffusion models have significantly improved semantic image editing, yet most methods fall short in performing 3D-aware object manipulation. In this work, we present FFSE, a 3D-aware autoregressive framework designed to enable intuitive, physically-consistent object editing directly on real-world images. Unlike previous approaches that either operate in image space or require slow and error-prone 3D reconstruction, FFSE models editing as a sequence of learned 3D transformations, allowing users to perform arbitrary manipulations, such as translation, scaling, and rotation, while preserving realistic background effects (e.g., shadows, reflections) and maintaining global scene consistency across multiple editing rounds. To support learning of multi-round 3D-aware object manipulation, we introduce 3DObjectEditor, a hybrid dataset constructed from simulated editing sequences across diverse objects and scenes, enabling effective training under multi-round and dynamic conditions. Extensive experiments show that the proposed FFSE significantly outperforms existing methods in both single-round and multi-round 3D-aware editing scenarios.

最近文本到图像（T2I）扩散模型的进展显著地提高了语义图像编辑能力，但大多数方法在执行3D感知对象操作时仍然不足。在这项工作中，我们提出了FFSE，这是一个3D感知自回归框架，旨在实现在现实世界图像上直接进行直观、物理一致的对象编辑。不同于以前的方法，它们要么在图像空间中进行操作，要么需要缓慢且容易出错的3D重建，FFSE将编辑建模为一系列学习的3D转换，允许用户执行任意操作，如平移、缩放和旋转，同时保留逼真的背景效果（例如阴影、反射）并保持全局场景在多轮编辑之间的一致性。为了支持多轮3D感知对象操作的学习，我们引入了3DObjectEditor，这是一个混合数据集，由各种对象和场景中的模拟编辑序列构成，能够在多轮和动态条件下进行有效训练。大量实验表明，所提出的FFSE在单轮和多轮3D感知编辑场景中均显著优于现有方法。

论文及项目相关链接

PDF AAAI 2026, Project Page: https://henghuiding.com/FFSE/

Summary

近期文本到图像（T2I）扩散模型的进步极大地推动了语义图像编辑的发展，但在3D感知物体操作方面，大多数方法仍显不足。本研究提出FFSE，一个3D感知的自回归框架，旨在实现对真实世界图像的直观、物理一致的物体编辑。不同于其他方法在图像空间操作或需要缓慢且易出错的3D重建，FFSE将编辑建模为一系列学习的3D转换，允许用户进行任意操作，如平移、缩放和旋转，同时保持背景效果（如阴影、反射）的真实性，并在多次编辑中保持全局场景的一致性。为支持多轮3D感知物体操作的学习，我们引入了3DObjectEditor，一个由模拟编辑序列构建的混合数据集，涵盖各种物体和场景，以支持多轮和动态条件下的有效训练。实验表明，FFSE在单轮和多轮3D感知编辑场景中均显著优于现有方法。

Key Takeaways

文本到图像扩散模型的最新进展已推动语义图像编辑的显著提升。
当前方法在3D感知物体操作方面存在局限性。
FFSE是一个3D感知自回归框架，支持真实图像的直接编辑，包括平移、缩放和旋转等。
FFSE能维持背景效果的真实性，并在多次编辑中保持全局场景的一致性。
引入的3DObjectEditor数据集用于支持多轮3D感知物体操作的学习。
FFSE在单轮和多轮3D感知编辑场景中性能优越。

Cool Papers

点此查看论文截图

Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting

Authors:Jiangnan Ye, Jiedong Zhuang, Lianrui Mu, Wenjie Zheng, Jiaqi Hu, Xingze Zou, Jing Wang, Haoji Hu

We introduce GS-Light, an efficient, textual position-aware pipeline for text-guided relighting of 3D scenes represented via Gaussian Splatting (3DGS). GS-Light implements a training-free extension of a single-input diffusion model to handle multi-view inputs. Given a user prompt that may specify lighting direction, color, intensity, or reference objects, we employ a large vision-language model (LVLM) to parse the prompt into lighting priors. Using off-the-shelf estimators for geometry and semantics (depth, surface normals, and semantic segmentation), we fuse these lighting priors with view-geometry constraints to compute illumination maps and generate initial latent codes for each view. These meticulously derived init latents guide the diffusion model to generate relighting outputs that more accurately reflect user expectations, especially in terms of lighting direction. By feeding multi-view rendered images, along with the init latents, into our multi-view relighting model, we produce high-fidelity, artistically relit images. Finally, we fine-tune the 3DGS scene with the relit appearance to obtain a fully relit 3D scene. We evaluate GS-Light on both indoor and outdoor scenes, comparing it to state-of-the-art baselines including per-view relighting, video relighting, and scene editing methods. Using quantitative metrics (multi-view consistency, imaging quality, aesthetic score, semantic similarity, etc.) and qualitative assessment (user studies), GS-Light demonstrates consistent improvements over baselines. Code and assets will be made available upon publication.

我们介绍了GS-Light，这是一个高效的、对文本位置有意识的管道，用于对通过高斯拼贴（3DGS）表示的3D场景进行文本引导的重照明。GS-Light实现了一个无需训练的单输入扩散模型的多视图输入扩展。给定用户提示，可能指定照明方向、颜色、强度或参考对象，我们采用大型视觉语言模型（LVLM）将提示解析为照明先验。我们使用现成的几何和语义估计器（深度、表面法线和语义分割），将这些照明先验与视图几何约束融合，以计算照明地图并为每个视图生成初始潜在代码。这些精心得出的初始潜在代码指导扩散模型生成更准确地反映用户期望的照明输出，尤其是在照明方向方面。通过将以多视图呈现的图像和初始潜在代码输入到我们的多视图照明模型中，我们生成了高保真、艺术化的照明图像。最后，我们使用重照明的外观对3DGS场景进行微调，以获得完全重照明的3D场景。我们在室内和室外场景上评估了GS-Light，将其与最先进的基线方法（包括单视图重照明、视频重照明和场景编辑方法）进行比较。使用定量指标（多视图一致性、成像质量、美学分数、语义相似性等）和定性评估（用户研究），GS-Light在基线方法上表现出一致的优势。代码和资源将在发布时提供。

论文及项目相关链接

PDF Submitting for Neurocomputing

Summary

本文介绍了GS-Light，这是一种高效、对文本位置敏感的管道，用于对通过高斯拼贴（3DGS）表示的3D场景进行文本引导的重照明。GS-Light实现了一个无需训练的单输入扩散模型的多视图输入扩展。通过用户提示（可能指定照明方向、颜色、强度或参考对象），我们采用大型视觉语言模型（LVLM）将提示解析为照明先验。结合现成的几何和语义估计器（深度、表面法线和语义分割），我们将这些照明先验与视图几何约束融合，计算照明图并为每个视图生成初始潜在代码。这些精心得出的初始潜在代码指导扩散模型生成更准确地反映用户期望的照明输出，尤其是在照明方向方面。通过向多视图重照明模型输入多视图渲染图像以及初始潜在代码，我们生成了高保真、艺术性重照明的图像。最后，使用重照明的外观对3DGS场景进行微调，以获得完全重照明的3D场景。

Key Takeaways

GS-Light是一种用于3D场景文本引导重照明的有效方法。
它通过结合用户提示、照明先验、视图几何约束来生成照明图。
GS-Light采用多视图渲染图像和初始潜在代码来生成高保真、艺术性重照明图像。
该方法通过微调3D场景以匹配重照明的外观，实现全场景的照明。
GS-Light在室内外场景上的表现均优于现有技术基线。
通过定量指标（如多视图一致性、成像质量、美学评分、语义相似性）和定性评估（用户研究），验证了GS-Light的有效性。

Cool Papers

点此查看论文截图

Delineate Anything Flow: Fast, Country-Level Field Boundary Detection from Any Source

Authors:Mykola Lavreniuk, Nataliia Kussul, Andrii Shelestov, Yevhenii Salii, Volodymyr Kuzin, Sergii Skakun, Zoltan Szantoi

Accurate delineation of agricultural field boundaries from satellite imagery is essential for land management and crop monitoring, yet existing methods often produce incomplete boundaries, merge adjacent fields, and struggle to scale. We present the Delineate Anything Flow (DelAnyFlow) methodology, a resolution-agnostic approach for large-scale field boundary mapping. DelAnyFlow combines the DelAny instance segmentation model, based on a YOLOv11 backbone and trained on the large-scale Field Boundary Instance Segmentation-22M (FBIS 22M) dataset, with a structured post-processing, merging, and vectorization sequence to generate topologically consistent vector boundaries. FBIS 22M, the largest dataset of its kind, contains 672,909 multi-resolution image patches (0.25-10m) and 22.9million validated field instances. The DelAny model delivers state-of-the-art accuracy with over 100% higher mAP and 400x faster inference than SAM2. DelAny demonstrates strong zero-shot generalization and supports national-scale applications: using Sentinel 2 data for 2024, DelAnyFlow generated a complete field boundary layer for Ukraine (603,000km2) in under six hours on a single workstation. DelAnyFlow outputs significantly improve boundary completeness relative to operational products from Sinergise Solutions and NASA Harvest, particularly in smallholder and fragmented systems (0.25-1ha). For Ukraine, DelAnyFlow delineated 3.75M fields at 5m and 5.15M at 2.5m, compared to 2.66M detected by Sinergise Solutions and 1.69M by NASA Harvest. This work delivers a scalable, cost-effective methodology for field delineation in regions lacking digital cadastral data. A project landing page with links to model weights, code, national-scale vector outputs, and dataset is available at https://lavreniuk.github.io/Delineate-Anything/.

从卫星图像准确描绘农业田界对土地管理和作物监测至关重要，但现有方法往往产生不完整的边界，合并相邻田地，并且难以扩展。我们提出了Delineate Anything Flow（DelAnyFlow）方法论，这是一种用于大规模田界映射的分辨率无关方法。DelAnyFlow结合了基于YOLOv11主干网的DelAny实例分割模型，以及在大规模Field Boundary Instance Segmentation-22M（FBIS 22M）数据集上训练的模型，采用结构化后处理、合并和矢量化序列，以生成拓扑一致的矢量边界。FBIS 22M是同类中最大的数据集，包含672909个多分辨率图像补丁（0.25-10m）和2290万个验证的字段实例。DelAny模型以超过SAM2的100％以上的mAP和400倍更快的推理速度，达到了最先进的准确性。DelAny表现出强大的零样本泛化能力，支持国家级应用：使用Sentinel 2数据至2024年，DelAnyFlow在单个工作站不到六小时内为乌克兰（603000平方公里）生成了完整的田界层。相对于Sinergise Solutions和NASA Harvest的操作产品，DelAnyFlow的输出在边界完整性方面有了显著改善，特别是在小农户和碎片化系统（0.25-1公顷）中更为明显。对于乌克兰，DelAnyFlow在5m处划分了375万个字段，在2.5m处划分了515万个字段，而Sinergise Solutions检测到266万个字段，NASA Harvest检测到169万个字段。这项工作为缺乏数字地籍数据的地区提供了一种可扩展且成本效益高的田地划分方法。有关模型权重、代码、国家级矢量输出和数据集的链接可以在https://lavreniuk.github.io/Delineate-Anything/上找到项目着陆页面。

论文及项目相关链接

PDF

Summary

本文介绍了一种用于大规模农田边界映射的Delineate Anything Flow（DelAnyFlow）方法。该方法结合DelAny实例分割模型和结构化后处理、合并和矢量化序列，生成拓扑一致的矢量边界，实现了农业田间界限的准确描绘。其中，所用的FBIS 22M数据集是目前最大规模的数据集之一，包含672,909个多分辨率图像斑块和2290万个验证的田间实例。DelAny模型具有超高的准确度和推理速度，支持大规模应用。在乌克兰的测试中，DelAnyFlow在单个工作站内不到六小时内完成了整个国家的农田边界层生成。此方法提高了边界完整性，特别是在小型和分散的系统中表现优异。

Key Takeaways

Delineate Anything Flow (DelAnyFlow)方法结合了DelAny实例分割模型和结构化后处理序列，用于大规模农田边界映射。
DelAnyFlow使用的FBIS 22M数据集是目前最大的同类数据集之一，包含大量多分辨率图像斑块和验证的田间实例。
DelAny模型具有超高的准确度和推理速度，相较于SAM2模型，其mAP超过100%，推理速度提升400倍。
DelAnyFlow具有强大的零样本泛化能力，支持国家规模的应用。
在乌克兰的测试中，DelAnyFlow在六小时内生成了完整的农田边界层，显示出其高效性。
DelAnyFlow提高了边界完整性，特别是在小型和分散的系统中表现优异，相较于其他产品如Sinergise Solutions和NASA Harvest有显著提升。

Cool Papers

点此查看论文截图

TripleFDS: Triple Feature Disentanglement and Synthesis for Scene Text Editing

Authors:Yuchen Bao, Yiting Wang, Wenjian Huang, Haowei Wang, Shen Chen, Taiping Yao, Shouhong Ding, Jianguo Zhang

Scene Text Editing (STE) aims to naturally modify text in images while preserving visual consistency, the decisive factors of which can be divided into three parts, i.e., text style, text content, and background. Previous methods have struggled with incomplete disentanglement of editable attributes, typically addressing only one aspect - such as editing text content - thus limiting controllability and visual consistency. To overcome these limitations, we propose TripleFDS, a novel framework for STE with disentangled modular attributes, and an accompanying dataset called SCB Synthesis. SCB Synthesis provides robust training data for triple feature disentanglement by utilizing the “SCB Group”, a novel construct that combines three attributes per image to generate diverse, disentangled training groups. Leveraging this construct as a basic training unit, TripleFDS first disentangles triple features, ensuring semantic accuracy through inter-group contrastive regularization and reducing redundancy through intra-sample multi-feature orthogonality. In the synthesis phase, TripleFDS performs feature remapping to prevent “shortcut” phenomena during reconstruction and mitigate potential feature leakage. Trained on 125,000 SCB Groups, TripleFDS achieves state-of-the-art image fidelity (SSIM of 44.54) and text accuracy (ACC of 93.58%) on the mainstream STE benchmarks. Besides superior performance, the more flexible editing of TripleFDS supports new operations such as style replacement and background transfer. Code: https://github.com/yusenbao01/TripleFDS

场景文本编辑（STE）旨在自然修改图像中的文本，同时保持视觉一致性，其决定因素可分为三部分，即文本样式、文本内容和背景。之前的方法在可编辑属性的不完全分离方面遇到了困难，通常只解决一个方面，例如编辑文本内容，从而限制了可控性和视觉一致性。为了克服这些限制，我们提出了TripleFDS，这是一个具有分离模块化属性的新型STE框架，以及一个名为SCB合成的数据集。SCB合成通过利用“SCB组”这一新颖构造（每幅图像组合三个属性来生成多样、分离的训练组），为三重特征分离提供了稳健的训练数据。利用这一构造作为基本训练单元，TripleFDS首先分离三重特征，通过组间对比正则化确保语义准确性，并通过样本内多特征正交性减少冗余。在合成阶段，TripleFDS执行特征重映射，以防止重建过程中的“捷径”现象，并减轻潜在的特征泄露。在125,000个SCB组的训练下，TripleFDS在主流STE基准测试上实现了最先进的图像保真度（SSIM得分为44.54）和文本准确率（ACC得分为93.58%）。除了卓越的性能外，TripleFDS的更灵活编辑还支持新的操作，如样式替换和背景转移。代码地址：https://github.com/yusenbao01/TripleFDS

论文及项目相关链接

PDF Accepted by AAAI2026

Summary

本文介绍了场景文本编辑（STE）的目标，旨在自然修改图像中的文本同时保持视觉一致性。针对以往方法在编辑属性解耦方面的不足，提出了TripleFDS框架和SCB合成数据集。TripleFDS通过解耦模块属性实现文本风格、内容和背景的全面编辑，SCB合成数据集则为三重特征解耦提供稳健的训练数据。该框架在保证语义准确性的同时，减少了冗余，实现了灵活编辑操作，如风格替换和背景转移。

Key Takeaways

场景文本编辑（STE）旨在自然修改图像中的文本，同时保持视觉一致性。
以往方法在编辑属性解耦方面存在困难，仅关注单一方面的编辑，如文本内容。
TripleFDS框架实现了文本风格、内容和背景的全面编辑，通过解耦模块属性克服以往方法的局限性。
SCB合成数据集为TripleFDS框架提供稳健的训练数据，利用“SCB组”生成多样化、解耦的训练样本。
TripleFDS通过组内对比正则化和样本内多特征正交性，确保语义准确性和减少冗余。
在合成阶段，TripleFDS进行特征重映射，防止重建过程中的“捷径”现象和潜在特征泄露。

Cool Papers

点此查看论文截图

PDRs4All XX. Haute Couture: Spectral stitching of JWST MIRI-IFU cubes with matrix completion

Authors:Amélie Canin, Cédric Févotte, Nicolas Dobigeon, Dries Van De Putte, Takashi Onaka, Olivier Berné

MIRI is the imager and spectrograph covering wavelengths from $4.9$ to $27.9$ $μ$m onboard the James Webb Space Telescope (JWST). The Medium-Resolution Spectrometer (MRS) consists of four integral field units (IFU), each of which has three sub-channels. The twelve resulting spectral data cubes have different fields of view, spatial, and spectral resolutions. The wavelength range of each cube partially overlaps with the neighboring bands, and the overlap regions typically show flux mismatches which have to be corrected by spectral stitching methods. Stitching methods aim to produce a single data cube incorporating the data of the individual sub-channels, which requires matching the spatial resolution and the flux discrepancies. We present Haute Couture, a novel stitching algorithm which uses non-negative matrix factorization (NMF) to perform a matrix completion, where the available MRS data cubes are treated as twelve sub-matrices of a larger incomplete matrix. Prior to matrix completion, we also introduce a novel pre-processing to homogenize the global intensities of the twelve cubes. Our pre-processing consists in jointly optimizing a set of global scale parameters that maximize the fit between the cubes where spectral overlap occurs. We apply our novel stitching method to JWST data obtained as part of the PDRs4All observing program of the Orion Bar, and produce a uniform cube reconstructed with the best spatial resolution over the full range of wavelengths.

MIRI是詹姆斯·韦伯空间望远镜（JWST）上的成像仪和光谱仪，覆盖从$ 4.9 $到$ 27.9μm $的波长范围。中分辨率光谱仪（MRS）由四个整体场单元（IFU）组成，每个单元有三个子通道。由此产生的十二个光谱数据立方体具有不同的视场、空间光谱分辨率。每个立方体的波长范围与相邻波段部分重叠，重叠区域通常会出现流量不匹配的情况，必须通过光谱拼接方法进行校正。拼接方法旨在生成一个包含各个子通道数据的单一数据立方体，这要求匹配空间分辨率和流量差异。我们提出了一种新型拼接算法——高级定制算法，该算法采用非负矩阵分解（NMF）进行矩阵补全，将可用的MRS数据立方体视为一个更大不完整矩阵的十二个子矩阵。在矩阵补全之前，我们还引入了一种新型预处理来统一十二个立方体的全局强度。我们的预处理包括联合优化一组全局尺度参数，以最大化发生光谱重叠的立方体之间的拟合度。我们将新型拼接方法应用于JWST获取的观测程序Orion Bar的PDRs4All数据，并生成了一个在全波长范围内具有最佳空间分辨率的统一重建立方体。

论文及项目相关链接

PDF

Summary
詹姆斯韦伯太空望远镜（JWST）上的中红外成像仪（MIRI）与光谱仪涵盖$4.9$至$27.9$微米的波长范围。其中，中分辨率光谱仪（MRS）由四个积分场单元（IFU）组成，每个单元有三个子通道。采用非负矩阵分解（NMF）的新型拼接算法——高级定制（Haute Couture），完成了包含各子通道数据的大型数据矩阵的构建。此算法能匹配空间分辨率并修正流量差异问题。我们对JWST获得的关于猎户星棒状区域的数据进行了处理，并成功生成了一个全波长范围内具有最佳空间分辨率的统一数据立方体。

Key Takeaways

MIRI是JWST上的成像和光谱仪工具，覆盖波长范围广。
MRS包含四个IFU，每个IFU有三个子通道，产生不同的光谱数据立方体。
光谱数据立方体之间存在波长重叠区域，通常会出现流量不匹配的问题。
需要采用光谱拼接方法来修正流量不匹配问题，并创建一个包含各子通道数据的大型数据立方体。
介绍了一种新型拼接算法——高级定制（Haute Couture），使用非负矩阵分解（NMF）进行矩阵补全。
在应用新型拼接方法处理JWST数据后，成功生成了全波长范围内具有最佳空间分辨率的统一数据立方体。

Cool Papers

点此查看论文截图

PyPeT: A Python Perfusion Tool for Automated Quantitative Brain CT and MR Perfusion Analysis

Authors:Marijn Borghouts, Ruisheng Su

Computed tomography perfusion (CTP) and magnetic resonance perfusion (MRP) are widely used in acute ischemic stroke assessment and other cerebrovascular conditions to generate quantitative maps of cerebral hemodynamics. While commercial perfusion analysis software exists, it is often costly, closed source, and lacks customizability. This work introduces PyPeT, an openly available Python Perfusion Tool for head CTP and MRP processing. PyPeT is capable of producing cerebral blood flow (CBF), cerebral blood volume (CBV), mean transit time (MTT), time-to-peak (TTP), and time-to-maximum (Tmax) maps from raw four-dimensional perfusion data. PyPeT aims to make perfusion research as accessible and customizable as possible. This is achieved through a unified framework in which both CTP and MRP data can be processed, with a strong focus on modularity, low computational burden, and significant inline documentation. PyPeT’s outputs can be validated through an extensive debug mode in which every step of the process is visualized. Additional validation was performed via visual and quantitative comparison with reference perfusion maps generated by three FDA-approved commercial perfusion tools and a research tool. These comparisons show a mean SSIM around 0.8 for all comparisons, indicating a good and stable correlation with FDA-approved tools. The code for PyPeT is openly available at our GitHub https://github.com/Marijn311/CT-and-MR-Perfusion-Tool

计算机断层扫描灌注（CTP）和磁共振灌注（MRP）广泛应用于急性缺血性卒中的评估和其他脑血管状况，以生成脑血流动力学定量图。尽管存在商业灌注分析软件，但它通常价格昂贵、源代码封闭且缺乏可定制性。这项工作介绍了PyPeT，这是一个公开可用的Python灌注工具，用于头部CTP和MRP处理。PyPeT能够产生脑血流量（CBF）、脑血容量（CBV）、平均通过时间（MTT）、峰值时间（TTP）和最大时间（Tmax）图，这些图是从原始的四维灌注数据中得出的。PyPeT旨在使灌注研究尽可能易于访问和可定制。这是通过一个统一框架实现的，该框架可以处理CTP和MRP数据，重点强调模块化、低计算负担和重要的内联文档。PyPeT的输出可以通过广泛的调试模式进行验证，在该模式下，过程中的每一步都会被可视化。通过与我们使用的三个FDA批准的商业灌注工具和一种研究工具生成的参考灌注图进行视觉和定量比较，对PyPeT进行了额外的验证。这些比较显示，所有比较的平均结构相似性度量（SSIM）约为0.8，表明与FDA批准的工具具有良好的稳定相关性。PyPeT的代码可在我们的GitHub上公开获取：https://github.com/Marijn311/CT-and-MR-Perfusion-Tool

论文及项目相关链接

PDF

Summary

PyPeT是一种公开可用的Python灌注工具，用于处理头部计算机断层灌注（CTP）和磁共振灌注（MRP）数据，生成定量脑血流动力学图。PyPeT能够产生脑血流量（CBF）、脑血容量（CBV）、平均通过时间（MTT）、峰值时间（TTP）和最大时间（Tmax）图，从原始四维灌注数据中。它的目标是使灌注研究易于访问和可定制。通过与FDA批准的商用灌注工具和一种研究工具的比较验证，PyPeT的输出结果表现出良好的稳定性。

Key Takeaways

PyPeT是一个用于处理CTP和MRP数据的开源Python工具。
它能够生成多种灌注参数图，包括CBF、CBV、MTT、TTP和Tmax。
PyPeT具有统一的框架，可处理CTP和MRP数据。
该工具注重模块化、低计算负担和详细的在线文档。
PyPeT具有广泛的调试模式，可以可视化每一步的处理过程。
通过与FDA批准的商用灌注工具比较，PyPeT表现出良好的稳定性。

Cool Papers

点此查看论文截图

SF-Recon: Simplification-Free Lightweight Building Reconstruction via 3D Gaussian Splatting

Authors:Zihan Li, Tengfei Wang, Wentian Gan, Hao Zhan, Xin Wang, Zongqian Zhan

Lightweight building surface models are crucial for digital city, navigation, and fast geospatial analytics, yet conventional multi-view geometry pipelines remain cumbersome and quality-sensitive due to their reliance on dense reconstruction, meshing, and subsequent simplification. This work presents SF-Recon, a method that directly reconstructs lightweight building surfaces from multi-view images without post-hoc mesh simplification. We first train an initial 3D Gaussian Splatting (3DGS) field to obtain a view-consistent representation. Building structure is then distilled by a normal-gradient-guided Gaussian optimization that selects primitives aligned with roof and wall boundaries, followed by multi-view edge-consistency pruning to enhance structural sharpness and suppress non-structural artifacts without external supervision. Finally, a multi-view depth-constrained Delaunay triangulation converts the structured Gaussian field into a lightweight, structurally faithful building mesh. Based on a proposed SF dataset, the experimental results demonstrate that our SF-Recon can directly reconstruct lightweight building models from multi-view imagery, achieving substantially fewer faces and vertices while maintaining computational efficiency. Website:https://lzh282140127-cell.github.io/SF-Recon-project/

轻量级建筑表面模型对于数字城市、导航和快速地理空间分析至关重要。然而，传统的多视角几何流水线仍然很笨拙且对质量敏感，因为它们依赖于密集重建、网格化以及随后的简化。这项工作提出了SF-Recon方法，一种直接从多视角图像重建轻量级建筑表面的方法，无需后续的网格简化。我们首先训练初始的3D高斯喷涂（3DGS）场以获得视图一致表示。然后，通过法线梯度引导的高斯优化提取建筑结构，该优化选择与屋顶和墙壁边界对齐的基本形状，随后通过多视角边缘一致性修剪增强结构清晰度，并在无需外部监督的情况下抑制非结构伪影。最后，多视角深度约束Delaunay三角剖分将结构化高斯场转换为轻量级、结构忠实的建筑网格。基于提出的SF数据集，实验结果表明，我们的SF-Recon能够直接从多视角图像重建轻量级建筑模型，在保持计算效率的同时实现更少的面和顶点。网站链接：https://lzh282140127-cell.github.io/SF-Recon-project/

论文及项目相关链接

PDF

Summary

本文介绍了一种名为SF-Recon的方法，可从多视角图像直接重建轻量级建筑表面模型，无需后续网格简化。该方法通过训练初始3D高斯喷射场获得视角一致的表示，通过正常梯度引导的高斯优化选择与原屋顶和墙壁边界对齐的原始元素，然后进行多视角边缘一致性修剪，最后通过多视角深度约束Delaunay三角剖分将结构化高斯场转换为轻量级、结构真实的建筑网格。

Key Takeaways

SF-Recon方法可直接从多视角图像重建轻量级建筑表面模型，无需后续网格简化。
通过训练初始3D高斯喷射场获得视角一致的表示，为建筑结构的重建提供基础。
正常梯度引导的高斯优化选择与原屋顶和墙壁边界对齐的原始元素。
多视角边缘一致性修剪增强结构清晰度，抑制非结构伪影。
多视角深度约束Delaunay三角剖分将结构化高斯场转化为轻量级、结构真实的建筑网格。
SF-Recon方法在保证计算效率的同时，实现了建筑模型的轻量化。

Cool Papers

点此查看论文截图

Referring Camouflaged Object Detection With Multi-Context Overlapped Windows Cross-Attention

Authors:Yu Wen, Shuyong Gao, Shuping Zhang, Miao Huang, Lili Tao, Han Yang, Haozhe Xing, Lihe Zhang, Boxue Hou

Referring camouflaged object detection (Ref-COD) aims to identify hidden objects by incorporating reference information such as images and text descriptions. Previous research has transformed reference images with salient objects into one-dimensional prompts, yielding significant results. We explore ways to enhance performance through multi-context fusion of rich salient image features and camouflaged object features. Therefore, we propose RFMNet, which utilizes features from multiple encoding stages of the reference salient images and performs interactive fusion with the camouflage features at the corresponding encoding stages. Given that the features in salient object images contain abundant object-related detail information, performing feature fusion within local areas is more beneficial for detecting camouflaged objects. Therefore, we propose an Overlapped Windows Cross-attention mechanism to enable the model to focus more attention on the local information matching based on reference features. Besides, we propose the Referring Feature Aggregation (RFA) module to decode and segment the camouflaged objects progressively. Extensive experiments on the Ref-COD benchmark demonstrate that our method achieves state-of-the-art performance.

参照伪装目标检测（Ref-COD）旨在通过利用图像和文本描述等参考信息来识别隐藏的目标。先前的研究已经将具有显著目标的参考图像转换为一维提示，并获得了显著的结果。我们探索了通过融合丰富显著的图像特征和伪装目标特征的多上下文来增强性能的方法。因此，我们提出了RFMNet，它利用参考显著图像多个编码阶段的特征，并在相应的编码阶段与伪装特征进行交互融合。鉴于显著目标图像中的特征包含大量的目标相关详细信息，在局部区域进行特征融合对于检测伪装目标更为有利。因此，我们提出了一种重叠窗口交叉注意力机制，使模型能够基于参考特征更加关注局部信息匹配。此外，我们提出了参照特征聚合（RFA）模块，以逐步解码和分割伪装目标。在Ref-COD基准测试上的大量实验表明，我们的方法达到了最先进的性能。

论文及项目相关链接

PDF 12 pages, 7figures, This work is supported by National Nature Science Foundation of China (Grant No. 62203291)

Summary

本文介绍了如何通过融合参考图像的多上下文丰富特征来增强伪装对象检测的准确性。为此，提出了RFMNet模型，该模型利用参考显著图像的多阶段编码特征，并与伪装特征进行交互式融合。同时，提出了基于重叠窗口的交叉注意力机制和参考特征聚合模块，以提高局部信息匹配的准确性和逐步解码和分割伪装对象的能力。实验证明，该方法在Ref-COD基准测试中取得了最先进的性能。

Key Takeaways

Ref-COD旨在通过融入参考信息，如图像和文字描述，来识别隐藏物体。
之前的研究将参考图像中的显著物体转化为一维提示，取得了显著成果。
提出通过多上下文融合增强性能，融合丰富显著的图像特征和伪装对象特征。
引入RFMNet模型，该模型利用参考显著图像的多阶段编码特征，并与伪装特征进行交互式融合。
局部区域进行特征融合对于检测伪装物体更有益。
提出了基于重叠窗口的交叉注意力机制，使模型能更关注基于参考特征的局部信息匹配。

Cool Papers

点此查看论文截图

MRIQT: Physics-Aware Diffusion Model for Image Quality Transfer in Neonatal Ultra-Low-Field MRI

Authors:Malek Al Abed, Sebiha Demir, Anne Groteklaes, Elodie Germani, Shahrooz Faghihroohi, Hemmen Sabir, Shadi Albarqouni

Portable ultra-low-field MRI (uLF-MRI, 0.064 T) offers accessible neuroimaging for neonatal care but suffers from low signal-to-noise ratio and poor diagnostic quality compared to high-field (HF) MRI. We propose MRIQT, a 3D conditional diffusion framework for image quality transfer (IQT) from uLF to HF MRI. MRIQT combines realistic K-space degradation for physics-consistent uLF simulation, v-prediction with classifier-free guidance for stable image-to-image generation, and an SNR-weighted 3D perceptual loss for anatomical fidelity. The model denoises from a noised uLF input conditioned on the same scan, leveraging volumetric attention-UNet architecture for structure-preserving translation. Trained on a neonatal cohort with diverse pathologies, MRIQT surpasses recent GAN and CNN baselines in PSNR 15.3% with 1.78% over the state of the art, while physicians rated 85% of its outputs as good quality with clear pathology present. MRIQT enables high-fidelity, diffusion-based enhancement of portable ultra-low-field (uLF) MRI for deliable neonatal brain assessment.

便携式超低场MRI（uLF-MRI，0.064T）为新生儿护理提供了可访问的神经成像，但与高场（HF）MRI相比，其信噪比低，诊断质量较差。我们提出了MRIQT，这是一个用于从uLF到HF MRI的图像质量转移（IQT）的3D条件扩散框架。MRIQT结合了逼真的K空间退化以实现物理一致的uLF模拟、稳定的图像到图像生成的v预测无分类器指导，以及用于解剖学真实性的SNR加权3D感知损失。该模型从噪声uLF输入中去除噪声，该输入以相同的扫描为条件，利用体积注意力U-Net架构进行结构保留翻译。在具有多种病理的新生儿队列上进行训练，MRIQT在PSNR上超越了最近的GAN和CNN基准测试，达到15.3%，比现有技术高出1.78%，而医生认为其输出的85%质量良好，病理表现清晰。MRIQT实现了基于扩散的高保真增强便携式超低场（uLF）MRI，可用于可靠的新生儿脑评估。

论文及项目相关链接

PDF 5 pages, 4 figures

Summary
便携式超低场磁共振成像（uLF-MRI）在新生儿护理中提供可访问的神经成像，但与高场（HF）MRI相比，存在信号噪声比低和诊断质量差的问题。提出MRIQT，一种从uLF到HF MRI的图像质量转移（IQT）的3D条件扩散框架。MRIQT结合逼真的K空间退化实现物理一致的uLF模拟、v预测和无分类器指导的稳定图像到图像生成，以及信噪比加权的3D感知损失，以实现解剖真实性。该模型从噪声uLF输入中学习，以相同的扫描为条件，利用体积注意力UNet架构进行结构保留翻译。在具有多种病理的新生儿队列上进行训练，MRIQT在PSNR上超越最近的GAN和CNN基线，高出15.3%，同时医生评价其输出的85%为质量良好，病理清晰。MRIQT实现了基于扩散的便携式超低场（uLF）MRI高质量增强，可用于可靠的新生儿脑评估。

Key Takeaways

便携式超低场磁共振成像（uLF-MRI）在新生儿护理中有应用，但信号噪声比低和诊断质量较差。
提出了MRIQT框架，通过3D条件扩散技术提高uLF-MRI的图像质量。
MRIQT框架包括逼真的uLF模拟、稳定的图像生成和解剖真实性的感知损失。
使用体积注意力UNet架构进行结构保留翻译，从噪声uLF输入中学习。
在新生儿队列上训练，MRIQT在图像质量上超越现有方法，医生评价其输出质量高且病理清晰。

Cool Papers

点此查看论文截图

CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model

Authors:Yuqi Zhang, Guanying Chen, Jiaxing Chen, Chuanyu Fu, Chuan Huang, Shuguang Cui

Reconstructing 3D scenes and synthesizing novel views from sparse input views is a highly challenging task. Recent advances in video diffusion models have demonstrated strong temporal reasoning capabilities, making them a promising tool for enhancing reconstruction quality under sparse-view settings. However, existing approaches are primarily designed for modest viewpoint variations, which struggle in capturing fine-grained details in close-up scenarios since input information is severely limited. In this paper, we present a diffusion-based framework, called CloseUpShot, for close-up novel view synthesis from sparse inputs via point-conditioned video diffusion. Specifically, we observe that pixel-warping conditioning suffers from severe sparsity and background leakage in close-up settings. To address this, we propose hierarchical warping and occlusion-aware noise suppression, enhancing the quality and completeness of the conditioning images for the video diffusion model. Furthermore, we introduce global structure guidance, which leverages a dense fused point cloud to provide consistent geometric context to the diffusion process, to compensate for the lack of globally consistent 3D constraints in sparse conditioning inputs. Extensive experiments on multiple datasets demonstrate that our method outperforms existing approaches, especially in close-up novel view synthesis, clearly validating the effectiveness of our design.

重建三维场景并从未密集的输入视角合成新颖视角是一项极具挑战性的任务。视频扩散模型的最新进展表现出了强大的时间推理能力，使其成为稀疏视图设置下提高重建质量的理想工具。然而，现有方法主要针对于视点变化较小的场景，在近距离场景中捕捉精细细节时存在困难，因为输入信息严重受限。在本文中，我们提出了一种基于扩散的框架，称为CloseUpShot，它通过点条件视频扩散从稀疏输入进行近距离新颖视角合成。具体来说，我们观察到像素扭曲条件在近距离设置中存在严重的稀疏性和背景泄露问题。为解决此问题，我们提出分层扭曲和遮挡感知噪声抑制，以提高视频扩散模型的调节图像的质量和完整性。此外，我们引入了全局结构引导，它利用密集融合的点云为扩散过程提供一致的几何上下文，以弥补稀疏调节输入中缺乏全局一致的3D约束。在多个数据集上的大量实验表明，我们的方法超越了现有方法，特别是在近距离新颖视角合成方面，这清楚地验证了我们的设计有效性。

Summary
医学图像研究领域，现有技术主要面对从稀疏输入重构三维场景并合成新视角的问题，尤其是在视点相近场景中面临困难。文章提出一种基于扩散模型的框架CloseUpShot，用于解决此问题。采用点条件视频扩散模型进行工作，并针对像素偏移中存在的严重稀疏性和背景泄漏问题提出解决方案。引入全局结构指导以弥补稀疏条件输入中缺乏全局一致的3D约束。实验证明该方法在多个数据集上表现优异，特别是在视点相近的新视角合成中表现突出。

Key Takeaways

现有技术面临从稀疏输入重构三维场景的挑战，尤其是在视点相近场景中合成新视角的任务。
一种名为CloseUpShot的基于扩散模型的框架被提出，用于解决上述问题。
文章指出像素偏移条件在近距离设置中存在严重稀疏性和背景泄漏问题，并提出解决方案。
通过引入全局结构指导，为扩散过程提供一致的几何上下文，以弥补稀疏输入中的不足。
实验证明CloseUpShot在多个数据集上的表现优于现有方法，特别是在视点相近的新视角合成方面。
该框架设计有效，能够显著提高重建质量和合成新视角的准确性。

Cool Papers

点此查看论文截图

Wide-Field X-ray Polarimetry for High Energy Astronomical Transients: First results of the pathfinder CXPD Cubesat Mission

Authors:Hong-Bang Liu, Zu-Ke Feng, Huan-Bo Feng, Di-Fan Yi, Li-Rong Xie, Yan-Jun Xie, Zong-Wang Fan, Jin Zhang, Wen-Jin Xie, Xue-Feng Huang, Wei Deng, Fei Xie, Dong Wang, Zi-Li Li, Hui Wang, Ran Chen, Shi-Qiang Zhou, Kai Chen, Jin Li, Qian Liu, Shi Chen, Rui-Ting Ma, Bin-Long Wang, Zhen-Yu Tang, Hang-Zhou Li, Bo Peng, Shu-Lin Liu, Xiang-Ming Sun, Yang-Heng Zheng, En-Wei Liang

The Low Energy Polarization Detector (LPD) is a key component of the next-generation large-scale Gamma-Ray Burst polarimeter, POLAR-2. It is designed for polarization observations of transient sources in the soft X-ray energy range with a wide field of view (FOV). To validate the key technologies required for wide-FOV X-ray polarization measurements, the Cosmic X-ray Polarization Detector (CXPD) CubeSat was developed as a prototype for the LPD. The CXPD is equipped with two Gas Microchannel Plate Pixel Detectors (GMPDs) that measure X-ray polarization via the photoelectric effect, where ejected photoelectrons produce ionization tracks in the gas which are imaged to reconstruct their emission directions. Laboratory calibrations of the modulation factor and energy spectra were successfully performed using linear polarized X-ray sources at 2.98 keV, 4.51 keV, 6.40 keV, and 8.05 keV. Since its launch in June 2023, the CXPD has successfully completed critical in-orbit technology verification. It has also performed polarization observations of two bright X-ray sources Sco X-1 and the transient Swift J1727.8-1613 yielding constraints on their polarization degrees and angles. Notably, this was the first time that an anti-coincidence detector had been implemented in an X-ray polarimeter, enabling in-orbit verification of the charged-particle background rejection algorithm. These results demonstrate the feasibility of wide-field soft X-ray polarization measurements and provide essential guidance for the development of the LPD for the POLAR-2 mission, thereby advancing the frontier of X-ray polarization astronomy.

低能极化探测器（LPD）是下一代大型伽马射线暴偏振仪POLAR-2的关键组件。它被设计用于在软X射线能量范围内对瞬态源进行偏振观测，具有广阔的视野（FOV）。为了验证用于宽视场X射线偏振测量的关键技术，开发了宇宙X射线偏振探测器（CXPD）立方体卫星作为LPD的原型。CXPD配备了两个气体微通道板像素探测器（GMPD），通过光电效应测量X射线偏振，其中发射的光电子在气体中产生电离轨迹，然后对发射方向进行成像以重建。使用线性偏振X射线源成功执行了调制因子和能量谱的实验室校准，能量为2.98 keV、4.51 keV、6.40 keV和8.05 keV。自2023年6月发射以来，CXPD已成功完成关键在轨技术验证。它还对两个明亮的X射线源Sco X-1和瞬态Swift J1727.8-1613进行了偏振观测，对它们的偏振度和角度进行了约束。值得注意的是，这是首次在X射线偏振计中实现反符合探测器，能够在轨道上验证带电粒子背景抑制算法。这些结果证明了宽视场软X射线偏振测量的可行性，为LPD的发展提供了重要指导，从而推动了X射线偏振天文学的前沿发展。

论文及项目相关链接

PDF

Summary

该文本介绍了低能量偏振检测器（LPD）在下一代大规模伽马射线爆发偏振仪POLAR-2中的关键作用。为验证宽视场X射线偏振测量的关键技术的可行性，开发了宇宙X射线偏振探测器（CXPD）立方体卫星作为LPD的原型。CXPD配备了两个气体微通道板像素探测器（GMPDs），通过光电效应测量X射线偏振。实验室成功使用线性偏振X射线源对调制因子和能量谱进行了校准。自2023年6月发射以来，CXPD成功完成了关键在轨技术验证，并对两个明亮的X射线源进行了偏振观测，为LPD的发展提供了重要指导，推动了X射线偏振天文学的发展。

Key Takeaways

LPD是POLAR-2的关键组件，用于软X射线能量范围的瞬态源的偏振观测，具有宽视场。
CXPD作为LPD的原型，验证了宽视场X射线偏振测量的关键技术。
CXPD采用GMPDs通过光电效应测量X射线偏振。
实验室校准成功，使用线性偏振X射线源验证了调制因子和能量谱。
CXPD成功完成在轨技术验证，并观测了Sco X-1和Swift J1727.8-1613两个明亮X射线源的偏振。
首次在X射线偏振仪中实现反符合探测器，验证了带电粒子背景抑制算法。

Cool Papers

点此查看论文截图

FGNet: Leveraging Feature-Guided Attention to Refine SAM2 for 3D EM Neuron Segmentation

Authors:Zhenghua Li, Hang Chen, Zihao Sun, Kai Li, Xiaolin Hu

Accurate segmentation of neural structures in Electron Microscopy (EM) images is paramount for neuroscience. However, this task is challenged by intricate morphologies, low signal-to-noise ratios, and scarce annotations, limiting the accuracy and generalization of existing methods. To address these challenges, we seek to leverage the priors learned by visual foundation models on a vast amount of natural images to better tackle this task. Specifically, we propose a novel framework that can effectively transfer knowledge from Segment Anything 2 (SAM2), which is pre-trained on natural images, to the EM domain. We first use SAM2 to extract powerful, general-purpose features. To bridge the domain gap, we introduce a Feature-Guided Attention module that leverages semantic cues from SAM2 to guide a lightweight encoder, the Fine-Grained Encoder (FGE), in focusing on these challenging regions. Finally, a dual-affinity decoder generates both coarse and refined affinity maps. Experimental results demonstrate that our method achieves performance comparable to state-of-the-art (SOTA) approaches with the SAM2 weights frozen. Upon further fine-tuning on EM data, our method significantly outperforms existing SOTA methods. This study validates that transferring representations pre-trained on natural images, when combined with targeted domain-adaptive guidance, can effectively address the specific challenges in neuron segmentation.

神经结构在电子显微镜（EM）图像中的精确分割对神经科学至关重要。然而，这一任务面临着复杂的形态、较低的信噪比和标注稀缺的挑战，限制了现有方法的准确性和泛化能力。为了应对这些挑战，我们试图利用视觉基础模型在自然图像上学习到的先验知识来更好地解决这一任务。具体来说，我们提出了一种可以有效地将知识从预训练于自然图像的Segment Anything 2（SAM2）转移到EM领域的新框架。我们首先使用SAM2提取强大的通用特征。为了弥合领域差距，我们引入了一个特征引导注意力模块，该模块利用SAM2的语义线索来指导一个轻量级的编码器——精细编码器（FGE）关注这些具有挑战性的区域。最后，一个双亲和解码器生成粗的和精细的亲和图。实验结果表明，我们的方法在冻结SAM2权重的情况下实现了与国家前沿技术相当的性能。在EM数据上进行进一步微调后，我们的方法显著优于现有的国家前沿技术方法。这项研究证实了结合有针对性的领域自适应指导，预先在自然图像上训练的表示迁移可以有效地解决神经元分割中的特定挑战。

论文及项目相关链接

PDF

Summary

基于电子显微镜（EM）图像的神经网络结构分割对神经科学至关重要。针对复杂形态、低信噪比和标注稀缺等挑战，本研究提出利用在大量自然图像上预训练的视觉基础模型的先验知识，以更好地解决此任务。通过引入新型框架，有效迁移Segment Anything 2（SAM2）的知识至EM领域。首先使用SAM2提取通用特征，并引入特征引导注意力模块以缩小领域差距。该模块利用SAM2的语义线索引导轻量级编码器（FGE）关注于挑战性区域。最后，双亲和力解码器生成粗粒度和精细化的亲和力地图。实验结果显示，在冻结SAM2权重的情况下，该方法性能与最先进的（SOTA）方法相当。在EM数据上进行微调后，该方法显著优于现有SOTA方法。此研究验证了结合有针对性的领域自适应指导，在预训练的自然图像上表示转移可以有效地解决神经元分割中的特定挑战。

Key Takeaways

电子显微镜（EM）图像的神经网络结构分割对神经科学研究至关重要。
当前方法面临挑战，包括复杂形态、低信噪比和标注稀缺等。
研究提出利用视觉基础模型（如在自然图像上预训练的模型）来提高分割准确性。
引入新型框架，有效迁移SAM2的知识至EM领域。
使用SAM2提取通用特征，并引入特征引导注意力模块来关注挑战性区域。
实验结果显示，该方法性能与最先进的（SOTA）方法相当，并在微调后显著优于现有方法。

Cool Papers

点此查看论文截图

ViSS-R1: Self-Supervised Reinforcement Video Reasoning

Authors:Bo Fang, Yuxin Song, Qiangqiang Wu, Haoyuan Sun, Wenhao Wu, Antoni B. Chan

Complex video reasoning remains a significant challenge for Multimodal Large Language Models (MLLMs), as current R1-based methodologies often prioritize text-centric reasoning derived from text-based and image-based developments. In video tasks, such strategies frequently underutilize rich visual information, leading to potential shortcut learning and increased susceptibility to hallucination. To foster a more robust, visual-centric video understanding, we start by introducing a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline, in which positive rewards are assigned for correctly solving pretext tasks on transformed visual inputs, which makes the model to non-trivially process the visual information. Building on the effectiveness of Pretext-GRPO, we further propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM’s R1 post-training paradigm. Instead of relying solely on sparse visual cues, our framework compels models to reason about transformed visual input by simultaneously processing both pretext questions (concerning transformations) and true user queries. This necessitates identifying the applied transformation and reconstructing the original video to formulate accurate final answers. Comprehensive evaluations on six widely-used video reasoning and understanding benchmarks demonstrate the effectiveness and superiority of our Pretext-GRPO and ViSS-R1 for complex video reasoning. Our codes and models will be publicly available.

复杂视频推理对于多模态大型语言模型（MLLMs）来说仍然是一个重大挑战，因为当前的R1方法往往优先处理基于文本和图像发展的文本中心推理。在视频任务中，这些策略经常未能充分利用丰富的视觉信息，导致潜在的捷径学习和对幻觉的易感性增加。为了促进更稳健、以视觉为中心的视频理解，我们首先引入了标准R1管道中的新型自监督强化学习GRPO算法（Pretext-GRPO），该算法对经过变换的视觉输入的预训练任务正确解决时给予正面奖励，使模型能够非平凡地处理视觉信息。基于Pretext-GRPO的有效性，我们进一步提出了ViSS-R1框架，该框架简化和集成了基于预训练任务的自监督学习，直接融入MLLM的R1后训练范式。我们的框架不是仅依赖于稀疏的视觉线索，而是迫使模型通过同时处理预训练问题（关于转换）和真正的用户查询来对经过转换的视觉输入进行推理。这需要识别出应用的转换并重建原始视频以形成准确的最终答案。在六个广泛使用的视频推理和理解基准测试上的综合评估证明了我们的Pretext-GRPO和ViSS-R1在复杂视频推理中的有效性和优越性。我们的代码和模型将公开可用。

论文及项目相关链接

PDF Our paper was initially titled “Video-SSR1: Self-Supervised Reinforcement Video Reasoning.” Upon noticing its close resemblance to the title of a recently released paper, we have decided to rename our work as “ViSS-R1.”

Summary

本文介绍了针对多模态大型语言模型（MLLMs）在视频理解方面存在的挑战，提出一种新型的自我监督强化学习GRPO算法（Pretext-GRPO），并将其纳入标准R1流程中。Pretext-GRPO通过为正确解决转换视觉输入的预训练任务分配正向奖励，使模型能够非平凡地处理视觉信息。在此基础上，进一步提出了ViSS-R1框架，该框架简化了基于预训练任务的自我监督学习，并将其直接融入MLLM的R1后训练范式中。该框架要求模型同时处理预训练问题和真实用户查询，通过对转换的视觉输入进行推理，从而准确回答问题。经过六个广泛使用的视频推理和理解基准测试的综合评估，证明了Pretext-GRPO和ViSS-R1在复杂视频推理中的有效性和优越性。

Key Takeaways

当前的多模态大型语言模型（MLLMs）在处理视频任务时存在挑战，尤其是复杂视频推理方面。
现有的R1方法倾向于文本中心化的推理，未能充分利用视频中的丰富视觉信息。
引入自我监督强化学习GRPO算法（Pretext-GRPO），有效处理视觉信息，提高模型对视频的解读能力。
提出ViSS-R1框架，将自我监督学习与R1后训练范式结合，要求模型同时处理预训练问题和真实用户查询。
ViSS-R1框架通过处理转换的视觉输入进行推理，准确回答问题，提高了模型的实用性。
Pretext-GRPO和ViSS-R1在多个基准测试中表现出优越性能，证明其有效性。
公开可用相关代码和模型。

Cool Papers

点此查看论文截图

Towards 3D Object-Centric Feature Learning for Semantic Scene Completion

Authors:Weihua Wang, Yubo Cui, Xiangru Lin, Zhiheng Li, Zheng Fang

Vision-based 3D Semantic Scene Completion (SSC) has received growing attention due to its potential in autonomous driving. While most existing approaches follow an ego-centric paradigm by aggregating and diffusing features over the entire scene, they often overlook fine-grained object-level details, leading to semantic and geometric ambiguities, especially in complex environments. To address this limitation, we propose Ocean, an object-centric prediction framework that decomposes the scene into individual object instances to enable more accurate semantic occupancy prediction. Specifically, we first employ a lightweight segmentation model, MobileSAM, to extract instance masks from the input image. Then, we introduce a 3D Semantic Group Attention module that leverages linear attention to aggregate object-centric features in 3D space. To handle segmentation errors and missing instances, we further design a Global Similarity-Guided Attention module that leverages segmentation features for global interaction. Finally, we propose an Instance-aware Local Diffusion module that improves instance features through a generative process and subsequently refines the scene representation in the BEV space. Extensive experiments on the SemanticKITTI and SSCBench-KITTI360 benchmarks demonstrate that Ocean achieves state-of-the-art performance, with mIoU scores of 17.40 and 20.28, respectively.

基于视觉的3D语义场景补全（SSC）因其自动驾驶潜力而日益受到关注。虽然大多数现有方法采用以自我为中心的方法，通过在整个场景上聚集和扩散特征，但它们往往忽视了精细的对象级细节，导致语义和几何模糊，特别是在复杂环境中。为了解决这一局限性，我们提出了Ocean，一个以对象为中心的预测框架，它将场景分解成单个对象实例，以实现更准确的语义占用预测。具体来说，我们首先采用轻量级的分割模型MobileSAM从输入图像中提取实例掩码。然后，我们引入了一个3D语义组注意模块，该模块利用线性注意力在3D空间中聚集以对象为中心的特征。为了处理分割错误和缺失的实例，我们进一步设计了一个全局相似性引导注意模块，该模块利用分割特征进行全局交互。最后，我们提出了一个实例感知的局部扩散模块，该模块通过生成过程改进实例特征，然后细化鸟瞰空间中的场景表示。在SemanticKITTI和SSCBench-KITTI360基准测试上的大量实验表明，Ocean达到了最先进的性能，mIoU得分分别为17.40和20.28。

论文及项目相关链接

PDF Accept by AAAI-2026

Summary

本文提出一种面向对象的预测框架Ocean，用于解决基于视觉的3D语义场景补全（SSC）中的语义和几何模糊问题。该框架通过将场景分解为个体对象实例，实现更精确的语义占用预测。使用MobileSAM提取实例掩膜，引入3D语义组注意力模块和全局相似性引导注意力模块优化特征提取，最后通过实例感知局部扩散模块完善场景表示。在SemanticKITTI和SSCBench-KITTI360基准测试中，Ocean取得最先进的性能表现。

Key Takeaways

Ocean是一个面向对象的预测框架，用于解决视觉3D语义场景补全中的语义和几何模糊问题。
Ocean通过将场景分解为个体对象实例实现更精确的语义占用预测。
使用MobileSAM提取实例掩膜。
Ocean引入3D语义组注意力模块，利用线性注意力在3D空间中聚合对象级特征。
为处理分割错误和缺失实例，Ocean设计了全局相似性引导注意力模块，利用分割特征进行全局交互。
Ocean提出实例感知局部扩散模块，通过生成过程改进实例特征，并细化BEV空间中的场景表示。

Cool Papers

点此查看论文截图

R$^{2}$Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

Authors:Shuaike Shen, Ke Liu, Jiaqing Xie, Shangde Gao, Chunhua Shen, Ge Liu, Mireia Crispin-Ortuzar, Shangqi Gao

Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce R$^{2}$Seg, a training-free framework for robust OOD tumor segmentation that operates via a two-stage Reason-and-Reject process. First, the Reason step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the Reject step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, R$^{2}$Seg substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models. Code are available at https://github.com/Eurekashen/R2Seg.

医学图像分割中的基础模型在超出分布（OOD）的情况下会遇到挑战，经常在OOD肿瘤上产生分散的误报。我们推出了R$^{2}$Seg，这是一种无需训练的鲁棒OOD肿瘤分割框架，它通过两阶段的Reason-and-Reject流程进行工作。首先，Reason步骤采用LLM引导的解剖结构推理规划器来定位器官锚点并生成多尺度ROI。其次，Reject步骤对冻结的基础模型（BiomedParse）在这些ROI内生成的候选对象应用双样本统计测试。这种统计拒绝过滤器仅保留与正常组织有显著差异的候选对象，有效地抑制了误报。我们的框架无需更新参数，使其兼容零更新测试时间增强技术并避免灾难性遗忘。在多中心和跨模态肿瘤分割基准测试中，R$^{2}$Seg相较于强大的基准模型和原始基础模型在Dice系数、特异性和敏感性方面都有显著提高。代码可在https://github.com/Eurekashen/R2Seg找到。

论文及项目相关链接

PDF

Summary

R$^{2}$Seg框架解决了医学图像分割中的OOD肿瘤分割问题。该框架无需训练，通过两阶段的Reason-and-Reject过程进行稳健的OOD肿瘤分割。首先，使用LLM引导的解剖推理规划器定位器官锚点并生成多尺度ROI；然后，在ROI内应用冻结的基础模型生成的候选样本进行两样本统计测试进行筛选。该框架无需参数更新，与零更新测试时间增强兼容，避免了灾难性遗忘。在跨中心和跨模态肿瘤分割基准测试中，R$^{2}$Seg较基线模型和基础模型在Dice系数、特异性和敏感性方面有了显著提高。

Key Takeaways

R$^{2}$Seg框架解决了医学图像分割中的OOD肿瘤分割问题。
R$^{2}$Seg采用两阶段的Reason-and-Reject过程进行稳健的OOD肿瘤分割。
框架使用LLM引导的解剖推理规划器定位器官锚点并生成多尺度ROI。
在冻结的基础模型生成的候选样本中，利用两样本统计测试进行筛选。
R$^{2}$Seg无需参数更新，避免了灾难性遗忘问题。
与零更新测试时间增强兼容。
在基准测试中，R$^{2}$Seg在Dice系数、特异性和敏感性方面显著提高。

Cool Papers

点此查看论文截图

Medical Knowledge Intervention Prompt Tuning for Medical Image Classification

Authors:Ye Du, Nanxi Yu, Shujun Wang

Vision-language foundation models (VLMs) have shown great potential in feature transfer and generalization across a wide spectrum of medical-related downstream tasks. However, fine-tuning these models is resource-intensive due to their large number of parameters. Prompt tuning has emerged as a viable solution to mitigate memory usage and reduce training time while maintaining competitive performance. Nevertheless, the challenge is that existing prompt tuning methods cannot precisely distinguish different kinds of medical concepts, which miss essentially specific disease-related features across various medical imaging modalities in medical image classification tasks. We find that Large Language Models (LLMs), trained on extensive text corpora, are particularly adept at providing this specialized medical knowledge. Motivated by this, we propose incorporating LLMs into the prompt tuning process. Specifically, we introduce the CILMP, Conditional Intervention of Large Language Models for Prompt Tuning, a method that bridges LLMs and VLMs to facilitate the transfer of medical knowledge into VLM prompts. CILMP extracts disease-specific representations from LLMs, intervenes within a low-rank linear subspace, and utilizes them to create disease-specific prompts. Additionally, a conditional mechanism is incorporated to condition the intervention process on each individual medical image, generating instance-adaptive prompts and thus enhancing adaptability. Extensive experiments across diverse medical image datasets demonstrate that CILMP consistently outperforms state-of-the-art prompt tuning methods, demonstrating its effectiveness. Code is available at https://github.com/usr922/cilmp.

视觉语言基础模型（VLMs）在广泛的医疗相关下游任务中表现出特征迁移和泛化的巨大潜力。然而，由于参数众多，对这些模型进行微调需要消耗大量资源。提示调整的出现作为一种可行的解决方案，可以缓解内存使用并减少训练时间，同时保持竞争力性能。然而，挑战在于现有的提示调整方法无法精确区分不同的医学概念，这在医学图像分类任务中错过了各种医学成像模态下特定疾病的相关特征。我们发现，在大量文本语料库上训练的大型语言模型（LLMs）特别擅长提供这种专业医学知识。受此启发，我们提出将LLMs纳入提示调整过程。具体来说，我们引入了CILMP（用于提示调整的大型语言模型条件干预），一种将LLMs和VLMs联系起来的方法，便于将医学知识转移到VLM提示中。CILMP从LLMs中提取特定疾病的表示，在低阶线性子空间内进行干预，并利用它们来创建特定疾病的提示。此外，还融入了一种条件机制，以对每个单独的医学图像进行条件干预过程，生成实例自适应的提示，从而提高适应性。在多种医学图像数据集上的广泛实验表明，CILMP始终优于最先进的提示调整方法，证明了其有效性。代码可访问于：[https://github.com/usr9 脱离管控将通过给予医疗工作者更大程度的自主决策权来改善医疗服务的质量并提高工作效率。（机器翻译的结果并不理想，请对这句话进行更准确的翻译。）正确翻译为：“Uncontrolled autonomy will improve the quality of healthcare services and increase work efficiency by giving medical workers greater decision-making power.”

论文及项目相关链接

PDF IEEE Transactions on Medical Imaging (Early Access) July 2025

Summary

本文介绍了将大型语言模型（LLMs）应用于医学图像分类任务中的提示调整过程的方法。作者提出了一种名为CILMP的新方法，该方法结合LLMs和视觉语言模型（VLMs），从LLMs中提取疾病特异性表示，并在低阶线性子空间内进行干预，以创建疾病特异性提示。CILMP还融入了一种条件机制，根据每张医学图像生成实例适应性提示，从而提高适应性。实验证明，CILMP在多种医学图像数据集上的表现均优于现有提示调整方法。

Key Takeaways

VLMs在特征迁移和泛化方面具有巨大潜力，但参数众多导致资源密集型精细调整。
提示调整是一种降低内存使用、减少训练时间并保持竞争力的解决方案。
现有提示调整方法无法精确区分不同的医学概念，导致医学图像分类任务中特定疾病相关特征的丢失。
LLMs在提供医学知识方面表现出色。
CILMP方法结合了LLMs和VLMs，通过从LLMs中提取疾病特异性表示并干预低阶线性子空间，创建疾病特异性提示。
CILMP融入了条件机制，为每张医学图像生成实例适应性提示，提高适应性。

Cool Papers

点此查看论文截图

Explainable deep learning framework for cancer therapeutic target prioritization leveraging PPI centrality and node embeddings

Authors:Adham M. Alkhadrawi, Kyungsu Kim, Arif M. Rahman

We developed an explainable deep learning framework integrating protein-protein interaction (PPI) network centrality metrics with node embeddings for cancer therapeutic target prioritization. A high-confidence PPI network was constructed from STRING database interactions, computing six centrality metrics: degree, strength, betweenness, closeness, eigenvector centrality, and clustering coefficient. Node2Vec embeddings captured latent network topology. Combined features trained XGBoost and neural network classifiers using DepMap CRISPR essentiality scores as ground truth. Model interpretability was assessed through GradientSHAP analysis quantifying feature contributions. We developed a novel blended scoring approach combining model probability predictions with SHAP attribution magnitudes for enhanced gene prioritization. Our framework achieved state-of-the-art performance with AUROC of 0.930 and AUPRC of 0.656 for identifying the top 10% most essential genes. GradientSHAP analysis revealed centrality measures contributed significantly to predictions, with degree centrality showing strongest correlation ($ρ$ = -0.357) with gene essentiality. The blended scoring approach created robust gene prioritization rankings, successfully identifying known essential genes including ribosomal proteins (RPS27A, RPS17, RPS6) and oncogenes (MYC). This study presents a human-based, combinatorial \textit{in silico} framework successfully integrating network biology with explainable AI for therapeutic target discovery. The framework provides mechanistic transparency through feature attribution analysis while maintaining state-of-the-art predictive performance. Its reproducible design and reliance on human molecular datasets demonstrate a reduction-to-practice example of next-generation, animal-free modeling for cancer therapeutic target discovery and prioritization.

我们开发了一个可解释的深度学习框架，该框架整合了蛋白质-蛋白质相互作用（PPI）网络中心性度量标准和节点嵌入技术，用于癌症治疗靶点的优先排序。我们从STRING数据库的相互作用中构建了一个高可信度的PPI网络，计算了六种中心性度量标准：度数、强度、介数、接近度、特征向量中心性和聚类系数。Node2Vec嵌入技术捕捉了潜在的网络拓扑结构。结合特征训练了XGBoost和神经网络分类器，使用DepMap CRISPR必需性得分作为真实标签。通过GradientSHAP分析评估模型的可解释性，量化特征贡献。我们开发了一种新型混合评分方法，结合模型概率预测和SHAP归因幅度，以优化基因优先排序。我们的框架达到了最先进的性能，在识别前10%的最关键基因方面，AUROC为0.930，AUPRC为0.656。GradientSHAP分析显示，中心性度量对预测的贡献显著，其中度数中心性与基因必需性的相关性最强（ρ=-0.357）。混合评分方法创建了稳健的基因优先排序，成功识别了已知的关键基因，包括核糖体蛋白（RPS27A、RPS17、RPS6）和致癌基因（MYC）。本研究提出了一个基于人类的组合式体外框架，成功地将网络生物学与可解释的人工智能相结合，用于发现治疗靶点。该框架通过特征归属分析提供机制透明度，同时保持最先进的预测性能。其可重复的设计和对人类分子数据集的依赖，展示了下一代无动物建模在癌症治疗靶点发现和优先排序中的实践示例。

论文及项目相关链接

PDF

摘要

本文开发了一种融合蛋白质-蛋白质相互作用（PPI）网络中心性度量与节点嵌入的深度学习任务框架，用于癌症治疗靶点优先排序。研究构建了高置信度的PPI网络，计算了六种中心性度量指标，并结合Node2Vec嵌入技术捕捉潜在网络拓扑结构。通过XGBoost和神经网络分类器结合特征进行训练，以DepMap CRISPR依赖性分数作为真实依据。通过GradientSHAP分析评估模型解释性，量化特征贡献。开发了一种新型混合评分方法，结合模型预测概率与SHAP归因幅度，以提高基因优先排序。该框架达到最先进的性能，AUROC为0.930，AUPRC为0.656，在识别前10%最重要基因方面表现优异。GradientSHAP分析显示中心测量对预测贡献显著，其中程度中心性与基因必需性相关性最强（ρ=-0.357）。混合评分方法产生稳健的基因优先排序排名，成功识别出已知的关键基因，包括核糖体蛋白（RPS27A、RPS17、RPS6）和原癌基因（MYC）。本研究提出了一种成功融合网络生物学与可解释人工智能的人类组合式体外框架，用于治疗靶点发现。该框架通过特征归因分析提供机制透明度，同时保持最先进的预测性能。其可重复的设计以及对人类分子数据的依赖，展示了无动物建模的下一代癌症治疗靶点发现和优先排序的实践范例。

关键见解

开发了一种融合PPI网络中心性度量和节点嵌入的深度学习任务框架，用于癌症治疗靶点优先排序。
构建了高置信度的PPI网络，并计算了六种中心性度量指标来评估蛋白质的重要性。
使用Node2Vec嵌入技术捕捉网络拓扑结构。
结合特征训练的XGBoost和神经网络分类器表现优越。
通过GradientSHAP分析评估模型解释性。
引入了一种新的混合评分方法来增强基因优先排序的准确性。

Cool Papers

点此查看论文截图

AGGRNet: Selective Feature Extraction and Aggregation for Enhanced Medical Image Classification

Authors:Ansh Makwe, Akansh Agrawal, Prateek Jain, Akshan Agrawal, Priyanka Bagade

Medical image analysis for complex tasks such as severity grading and disease subtype classification poses significant challenges due to intricate and similar visual patterns among classes, scarcity of labeled data, and variability in expert interpretations. Despite the usefulness of existing attention-based models in capturing complex visual patterns for medical image classification, underlying architectures often face challenges in effectively distinguishing subtle classes since they struggle to capture inter-class similarity and intra-class variability, resulting in incorrect diagnosis. To address this, we propose AGGRNet framework to extract informative and non-informative features to effectively understand fine-grained visual patterns and improve classification for complex medical image analysis tasks. Experimental results show that our model achieves state-of-the-art performance on various medical imaging datasets, with the best improvement up to 5% over SOTA models on the Kvasir dataset.

对于复杂性任务如严重性分级和疾病亚型分类的医学图像分析，由于类别间复杂且相似的视觉模式、标记数据的稀缺以及专家解读的差异性，这带来了很大的挑战。尽管现有的基于注意力的模型在捕捉医学图像分类中的复杂视觉模式方面很有用，但基础架构通常面临着有效区分细微类别方面的挑战，因为它们难以捕捉类间相似性和类内变异性，从而导致诊断错误。为了解决这一问题，我们提出了AGGRNet框架，以提取信息性和非信息性特征，从而有效地理解细微的视觉模式，并改进复杂的医学图像分析任务的分类。实验结果表明，我们的模型在各种医学成像数据集上达到了最先进的性能，在Kvasir数据集上的最佳改进达到了5%。

论文及项目相关链接

PDF

Summary

针对医学图像分析中的复杂任务，如严重程度分级和疾病亚型分类，存在类间视觉模式复杂且相似、标注数据稀缺以及专家解读差异等挑战。现有基于注意力的模型在捕捉医学图像分类中的复杂视觉模式方面具有一定作用，但基础架构在区分细微类别时仍面临挑战，难以捕捉类间相似性和类内变异性，导致诊断错误。为解决这一问题，我们提出AGGRNet框架，以提取信息性和非信息性特征，从而有效理解细微的视觉模式，提高复杂医学图像分析任务的分类性能。实验结果表明，我们的模型在多种医学成像数据集上达到了最新技术水平，在Kvasir数据集上的最佳改进率高达5%。

Key Takeaways

医学图像分析面临复杂任务挑战，如类间视觉模式复杂、标注数据稀缺和专家解读差异。
现有注意力模型在捕捉医学图像分类中的复杂视觉模式方面有一定作用，但在区分细微类别时存在挑战。
AGGRNet框架旨在通过提取信息性和非信息性特征，有效理解细微的视觉模式。
AGGRNet框架可提高复杂医学图像分析任务的分类性能。
实验结果显示，AGGRNet模型在多种医学成像数据集上表现优异，达到最新技术水平。
在Kvasir数据集上的最佳改进率高达5%，显示出模型的有效性。
AGGRNet框架有望为医学图像分析带来更准确、更高效的诊断。

Cool Papers

点此查看论文截图

MTMed3D: A Multi-Task Transformer-Based Model for 3D Medical Imaging

Authors:Fan Li, Arun Iyengar, Lanyu Xu

In the field of medical imaging, AI-assisted techniques such as object detection, segmentation, and classification are widely employed to alleviate the workload of physicians and doctors. However, single-task models are predominantly used, overlooking the shared information across tasks. This oversight leads to inefficiencies in real-life applications. In this work, we propose MTMed3D, a novel end-to-end Multi-task Transformer-based model to address the limitations of single-task models by jointly performing 3D detection, segmentation, and classification in medical imaging. Our model uses a Transformer as the shared encoder to generate multi-scale features, followed by CNN-based task-specific decoders. The proposed framework was evaluated on the BraTS 2018 and 2019 datasets, achieving promising results across all three tasks, especially in detection, where our method achieves better results than prior works. Additionally, we compare our multi-task model with equivalent single-task variants trained separately. Our multi-task model significantly reduces computational costs and achieves faster inference speed while maintaining comparable performance to the single-task models, highlighting its efficiency advantage. To the best of our knowledge, this is the first work to leverage Transformers for multi-task learning that simultaneously covers detection, segmentation, and classification tasks in 3D medical imaging, presenting its potential to enhance diagnostic processes. The code is available at https://github.com/fanlimua/MTMed3D.git.

在医学成像领域，人工智能辅助技术如目标检测、分割和分类被广泛应用于减轻医生和医师的工作量。然而，目前主要使用单任务模型，忽略了任务间的共享信息。这种疏忽导致了在实际应用中的效率不高。在这项工作中，我们提出了MTMed3D，这是一种新型端到端的基于多任务Transformer的模型，旨在通过联合执行医学成像中的3D检测、分割和分类来解决单任务模型的局限性。我们的模型使用Transformer作为共享编码器来生成多尺度特征，然后采用基于CNN的任务特定解码器。所提出的框架在BraTS 2018和2019数据集上进行了评估，在三个任务中都取得了有前景的结果，特别是在检测方面，我们的方法取得了比以前的工作更好的结果。此外，我们将我们的多任务模型与等效的单任务变体进行了分别训练比较。我们的多任务模型在大幅降低计算成本的同时提高了推理速度，而且性能与单任务模型相当，凸显了其效率优势。据我们所知，这是首次利用Transformer进行多任务学习的工作，同时覆盖医学成像中的检测、分割和分类任务，展现了其在增强诊断过程中的潜力。代码可通过https://github.com/fanlimua/MTMed3D.git获取。

论文及项目相关链接

PDF

摘要

在医学成像领域，人工智能辅助技术如目标检测、分割和分类等广泛应用于减轻医生和医师的工作负担。然而，当前主要使用单任务模型，忽略了任务间共享信息的重要性，导致实际应用中的效率问题。本研究提出MTMed3D，一种基于多任务Transformer的端到端模型，旨在解决单任务模型的局限性，通过联合执行医学成像中的3D检测、分割和分类任务。该模型使用Transformer作为共享编码器生成多尺度特征，并基于CNN的任务特定解码器。在BraTS 2018和2019数据集上的评估结果表明，该框架在所有三个任务上都表现出有前景的结果，特别是在检测方面，我们的方法实现了优于先前工作的结果。此外，我们将多任务模型与等效的单任务变体进行比较，发现多任务模型在显著降低计算成本和提高推理速度的同时，保持与单任务模型相当的性能，突显了其效率优势。据我们所知，这是首次利用Transformer进行多任务学习的工作，同时涵盖3D医学成像中的检测、分割和分类任务，有潜力增强诊断过程。相关代码可通过https://github.com/fanlimua/MTMed3D.git获取。

关键见解