发布日期: 2025-11-16

更新日期: 2025-11-27

文章字数: 20k

阅读时长: 82 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-16 更新

From 2D to 3D Without Extra Baggage: Data-Efficient Cancer Detection in Digital Breast Tomosynthesis

Authors:Yen Nhi Truong Vu, Dan Guo, Sripad Joshi, Harshit Kumar, Jason Su, Thomas Paul Matthews

Digital Breast Tomosynthesis (DBT) enhances finding visibility for breast cancer detection by providing volumetric information that reduces the impact of overlapping tissues; however, limited annotated data has constrained the development of deep learning models for DBT. To address data scarcity, existing methods attempt to reuse 2D full-field digital mammography (FFDM) models by either flattening DBT volumes or processing slices individually, thus discarding volumetric information. Alternatively, 3D reasoning approaches introduce complex architectures that require more DBT training data. Tackling these drawbacks, we propose M&M-3D, an architecture that enables learnable 3D reasoning while remaining parameter-free relative to its FFDM counterpart, M&M. M&M-3D constructs malignancy-guided 3D features, and 3D reasoning is learned through repeatedly mixing these 3D features with slice-level information. This is achieved by modifying operations in M&M without adding parameters, thus enabling direct weight transfer from FFDM. Extensive experiments show that M&M-3D surpasses 2D projection and 3D slice-based methods by 11-54% for localization and 3-10% for classification. Additionally, M&M-3D outperforms complex 3D reasoning variants by 20-47% for localization and 2-10% for classification in the low-data regime, while matching their performance in high-data regime. On the popular BCS-DBT benchmark, M&M-3D outperforms previous top baseline by 4% for classification and 10% for localization.

数字乳腺断层合成技术（DBT）通过提供体积信息减少了重叠组织的影响，提高了乳腺癌检测的可见性。然而，标注数据的有限限制了DBT深度学习模型的发展。为了应对数据稀缺的问题，现有方法试图通过重新使用二维全场数字乳腺摄影（FFDM）模型来解决这一问题，这些方法要么通过压平DBT体积要么单独处理切片来丢弃体积信息。相比之下，三维推理方法引入了复杂的架构，需要更多的DBT训练数据。为了解决这些缺点，我们提出了M&M-3D架构，它能够在三维推理中学习，同时相对于其FFDM对应物M&M保持无参数状态。M&M-3D构建了以恶性病变为导向的三维特征，通过反复混合这些三维特征与切片级别的信息来学习三维推理。这是通过在M&M中修改操作而无需添加参数来实现的，从而实现从FFDM的直接权重转移。大量实验表明，在定位和分类方面，M&M-3D相较于二维投影和基于三维切片的方法提高了11%-54%和3%-10%。此外，在低数据情况下，M&M-3D在定位和分类方面分别超越了复杂的三维推理变体20%-47%和2%-10%，而在高数据情况下与之表现相匹配。在流行的BCS-DBT基准测试中，M&M-3D在分类方面优于之前的最佳基线模型4%，在定位方面提高了10%。

论文及项目相关链接

PDF

摘要
DBT技术能提高乳腺癌检测中病灶的可见性，但其有限的标注数据限制了深度学习模型的发展。针对此问题，我们提出M&M-3D架构，可在无需额外参数的情况下实现可学习的三维推理。它通过构建恶性病变引导的三维特征，并通过反复混合这些特征与切片级信息来学习三维推理。实验表明，M&M-3D在定位和分类方面超越了二维投影和三维切片方法，且在低数据场景下表现出更出色的性能。在BCS-DBT基准测试中，M&M-3D的分类和定位性能均优于先前最佳模型。

关键见解

DBT技术通过提供体积信息提高乳腺癌检测的病灶可见性，但标注数据的有限性限制了深度学习模型的发展。
现有方法试图通过平坦化DBT体积或单独处理切片来再利用二维全场数字乳腺摄影（FFDM）模型，从而忽略了体积信息。
复杂的三维推理方法需要大量DBT训练数据。
M&M-3D架构能在无需额外参数的情况下实现可学习的三维推理，通过构建恶性病变引导的三维特征，并与切片级信息混合来学习三维推理。
M&M-3D在定位和分类方面超越了二维投影和三维切片方法，性能提升范围为11-54%。
在低数据场景下，M&M-3D在定位和分类方面表现出优于复杂三维推理方法的性能，提升范围分别为20-47%和2-10%。

Cool Papers

点此查看论文截图

Learnable Total Variation with Lambda Mapping for Low-Dose CT Denoising

Authors:Yusuf Talha Basak, Mehmet Ozan Unal, Metin Ertas, Isa Yildirim

Although Total Variation (TV) performs well in noise reduction and edge preservation on images, its dependence on the lambda parameter limits its efficiency and makes it difficult to use effectively. In this study, we present a Learnable Total Variation (LTV) framework that couples an unrolled TV solver with a data-driven Lambda Mapping Network (LambdaNet) predicting a per-pixel regularization map. The pipeline is trained end-to-end so that reconstruction and regularization are optimized jointly, yielding spatially adaptive smoothing: strong in homogeneous regions, relaxed near anatomical boundaries. Experiments on the DeepLesion dataset, using a realistic noise model adapted from the LoDoPaB-CT methodology, show consistent gains over classical TV and FBP+U-Net: +2.9 dB PSNR and +6% SSIM on average. LTV provides an interpretable alternative to black-box CNNs and a basis for 3D and data-consistency-driven reconstruction. Our codes are available at: https://github.com/itu-biai/deep_tv_for_ldct

尽管总变差（TV）在图像降噪和边缘保持方面表现良好，但其对lambda参数的依赖限制了其效率，并且难以有效地使用。在本研究中，我们提出了一种可学习的总变差（LTV）框架，它将展开的TV求解器与数据驱动的Lambda映射网络（LambdaNet）相结合，预测每个像素的正则化图。该管道是端到端进行训练的，使重建和正则化能够联合优化，产生空间自适应平滑：在均匀区域中强烈，在解剖边界附近放松。在采用来自LoDoPaB-CT方法论的现实噪声模型的DeepLesion数据集上的实验表明，与经典TV和FBP+U-Net相比，LTV具有一致的优势，平均PSNR提高2.9 dB，SSIM提高6%。LTV为黑箱卷积神经网络提供了一种可解释的替代方案，并为3D和数据一致性驱动的重建提供了基础。我们的代码可在以下网址找到：https://github.com/itu-biai/deep_tv_for_ldct

论文及项目相关链接

PDF

Summary
学习性总变分（LTV）框架结合无卷积总变分解算器和数据驱动Lambda映射网络（LambdaNet），优化图像重建和正则化，提高图像去噪和边缘保持性能。与经典总变分和FBP+U-Net相比，LTV在DeepLesion数据集上的实验结果表明，其平均峰值信噪比（PSNR）和结构相似性指数（SSIM）分别提高了2.9 dB和6%。LTV提供了一个可解释的CNN替代方案，并为3D和数据一致性驱动的重建提供了基础。

Key Takeaways

LTV框架结合了无卷积的总变分解算器和Lambda映射网络（LambdaNet），以优化图像重建和正则化过程。
LTV通过数据驱动的方式预测像素级的正则化映射，提高了图像去噪和边缘保持性能。
LTV框架实现了空间自适应平滑，在均匀区域实现强去噪，在接近解剖边界处实现放松。
在DeepLesion数据集上的实验表明，与经典的总变分和FBP+U-Net相比，LTV提供了更好的图像重建结果，PSNR和SSIM分别提高了2.9 dB和6%。
LTV提供了一个可解释的深度学习模型（CNN）的替代方案，具有更高的透明度和可解释性。
LTV框架具有潜力应用于3D图像重建和数据一致性驱动的重建。

Cool Papers

点此查看论文截图

Utility of Pancreas Surface Lobularity as a CT Biomarker for Opportunistic Screening of Type 2 Diabetes

Authors:Tejas Sudharshan Mathai, Anisa V. Prasad, Xinya Wang, Praveen T. S. Balamuralikrishna, Yan Zhuang, Abhinav Suri, Jianfei Liu, Perry J. Pickhardt, Ronald M. Summers

Type 2 Diabetes Mellitus (T2DM) is a chronic metabolic disease that affects millions of people worldwide. Early detection is crucial as it can alter pancreas function through morphological changes and increased deposition of ectopic fat, eventually leading to organ damage. While studies have shown an association between T2DM and pancreas volume and fat content, the role of increased pancreatic surface lobularity (PSL) in patients with T2DM has not been fully investigated. In this pilot work, we propose a fully automated approach to delineate the pancreas and other abdominal structures, derive CT imaging biomarkers, and opportunistically screen for T2DM. Four deep learning-based models were used to segment the pancreas in an internal dataset of 584 patients (297 males, 437 non-diabetic, age: 45$\pm$15 years). PSL was automatically detected and it was higher for diabetic patients (p=0.01) at 4.26 $\pm$ 8.32 compared to 3.19 $\pm$ 3.62 for non-diabetic patients. The PancAP model achieved the highest Dice score of 0.79 $\pm$ 0.17 and lowest ASSD error of 1.94 $\pm$ 2.63 mm (p$<$0.05). For predicting T2DM, a multivariate model trained with CT biomarkers attained 0.90 AUC, 66.7% sensitivity, and 91.9% specificity. Our results suggest that PSL is useful for T2DM screening and could potentially help predict the early onset of T2DM.

糖尿病类型Ⅱ型（T2DM）是一种慢性代谢性疾病，全球范围内影响了数以百万计的人群。早期检测尤为重要，因为它能通过形态变化和异位脂肪沉积改变胰腺功能，最终导致器官损伤。虽然已有研究表明T2DM与胰腺体积和脂肪含量之间存在关联，但T2DM患者胰腺表面小叶增多（PSL）的作用尚未得到充分研究。在这项试点工作中，我们提出了一种全自动的方法，用于描绘胰腺和其他腹部结构，获取CT成像生物标志物，并有机会进行T2DM筛查。使用四种基于深度学习的模型对内部数据集（包含584名患者，其中297名男性，437名非糖尿病患者，年龄45±15岁）进行胰腺分段。自动检测到PSL，糖尿病患者的PSL较高（p=0.01），为4.26±8.32，非糖尿病患者为3.19±3.62。PancAP模型获得最高的Dice得分（0.79±0.17）和最低的ASSD误差（1.94±2.63毫米）（p<0.05）。使用CT标志物训练的多元模型预测T2DM达到了AUC 0.90、敏感性66.7%和特异性91.9%。我们的结果表明，PSL对T2DM筛查有用，可能有助于预测T2DM的早期发生。

论文及项目相关链接

PDF Submitted to IEEE ISBI 2026

Summary

该文章研究了Ⅱ型糖尿病与胰腺表面叶状结构（PSL）之间的关系。通过深度学习模型对胰腺进行自动分割，发现糖尿病患者具有更高的胰腺表面叶状结构值。该研究表明，PSL可能是早期预测Ⅱ型糖尿病的重要指标之一。

Key Takeaways

T2DM是一种常见的慢性代谢性疾病，早期检测至关重要。
胰腺体积和脂肪含量与T2DM有关联。
胰腺表面叶状结构（PSL）在T2DM患者中的研究尚未充分调查。
使用深度学习模型自动检测胰腺表面叶状结构，发现糖尿病患者值更高。
PancAP模型表现出最佳的性能，用于自动分割胰腺。
利用CT影像生物标志物预测T2DM的多元模型表现出良好的预测性能。

Cool Papers

点此查看论文截图

Domain Adaptation for Camera-Specific Image Characteristics using Shallow Discriminators

Authors:Maximiliane Gruber, Jürgen Seiler, André Kaup

Each image acquisition setup leads to its own camera-specific image characteristics degrading the image quality. In learning-based perception algorithms, characteristics occurring during the application phase, but absent in the training data, lead to a domain gap impeding the performance. Previously, pixel-level domain adaptation through unpaired learning of the pristine-to-distorted mapping function has been proposed. In this work, we propose shallow discriminator architectures to address limitations of these approaches. We show that a smaller receptive field size improves learning of unknown image distortions by more accurately reproducing local distortion characteristics at a low network complexity. In a domain adaptation setup for instance segmentation, we achieve mean average precision increases over previous methods of up to 0.15 for individual distortions and up to 0.16 for camera-specific image characteristics in a simplified camera model. In terms of number of parameters, our approach matches the complexity of one state of the art method while reducing complexity by a factor of 20 compared to another, demonstrating superior efficiency without compromising performance.

每个图像采集设置都会导致其特有的相机特定图像特征，从而降低了图像质量。在基于学习的感知算法中，应用阶段出现但在训练数据中不存在的特征会导致域差距，从而影响性能。先前已经提出了通过原始到失真的映射函数的非配对学习来进行像素级域适应。在这项工作中，我们提出了浅判别器架构来解决这些方法的局限性。我们表明，较小的感受野尺寸通过更准确地再现局部失真特征，在低网络复杂度下改进了对未知图像失真的学习。在实例分割的域适应设置中，我们实现了平均精度提高，与以前的方法相比，针对单个失真的提高幅度高达0.15，针对简化相机模型中的相机特定图像特性的提高幅度高达0.16。就参数数量而言，我们的方法与最新技术状态的复杂度相匹配，但与另一种方法相比减少了20倍的复杂度，在不影响性能的情况下展示了更高的效率。

论文及项目相关链接

PDF 5 pages, 7 figures, accepted for International Conference on Visual Communications and Image Processing (VCIP) 2025

Summary

本文提出利用浅判别器架构来解决图像采集设置中导致的领域差异问题，以提高学习感知算法的性能。通过减小感受野大小，更准确模拟局部失真特性，并在简化相机模型的实例分割任务中实现平均精度提升。该方法在参数数量上匹配一种先进技术，相较于另一种技术复杂度降低20倍，展现出卓越的效率且不影响性能。

Key Takeaways

图像采集设置导致相机特定的图像特性影响图像质量。
学习感知算法中的领域差异阻碍了性能。
浅判别器架构被提出来解决先前像素级领域适应方法的问题。
减小感受野大小能更准确地模拟局部失真特性。
在实例分割任务中实现了平均精度提升，超过之前的方法最多达0.15的个体失真和最多达0.16的相机特定图像特性。
该方法在参数数量上表现出高效性，匹配一种先进技术并相较于另一种降低复杂度。

Cool Papers

点此查看论文截图

3DFETUS: Standardizing Fetal Facial Planes in 3D Ultrasound

Authors:Alomar Antonia, Rubio Ricardo, Albaiges Gerard, Salort-Benejam Laura, Caminal Julia, Prat Maria, Rueda Carolina, Cortes Berta, Piella Gemma, Sukno Federico

Acquiring standard facial planes during routine fetal ultrasound (US) examinations is often challenging due to fetal movement, variability in orientation, and operator-dependent expertise. These factors contribute to inconsistencies, increased examination time, and potential diagnostic bias. To address these challenges in the context of facial assessment, we present: 1) GT++, a robust algorithm that estimates standard facial planes from 3D US volumes using annotated anatomical landmarks; and 2) 3DFETUS, a deep learning model that automates and standardizes their localization in 3D fetal US volumes. We evaluated our methods both qualitatively, through expert clinical review, and quantitatively. The proposed approach achieved a mean translation error of 4.13 mm and a mean rotation error of 7.93 degrees per plane, outperforming other state-of-the-art methods on 3D US volumes. Clinical assessments further confirmed the effectiveness of both GT++ and 3DFETUS, demonstrating statistically significant improvements in plane estimation accuracy.

在常规胎儿超声检查过程中获取标准面部平面常常具有挑战性，因为胎儿会移动，方向具有多变性和取决于操作员的专长。这些因素会导致结果不一致性、检查时间延长以及潜在的诊断偏见。为了解决面部评估中遇到的这些挑战，我们提出：1）GT++算法，该算法能够从三维超声数据中估计出标准面部平面并利用解剖标记点注释。这是鲁棒性的算法；2）通过深度学习技术自动化的自动化技术并结合标准进行其在三维胎儿超声体积中的定位，我们称之为3DFETUS。我们通过专家临床审查和定量评估对我们的方法进行了评估。所提出的方法达到了平均平移误差为4.13毫米和平均旋转误差为每平面7.93度的水平，相较于其他针对三维超声体积的最先进方法表现得更为优越。临床评估进一步证实了GT++和3DFETUS的有效性，在平面估计准确性方面显示出统计学上的显著改善。

论文及项目相关链接

PDF

摘要

基于三维胎儿超声成像中的挑战，例如胎儿的频繁移动和操作的难度，文章提出两种针对面部评估的方法：GT++算法和深度学习模型3DFETUS。GT++算法通过标注解剖标志点来估计标准面部平面，而深度学习模型则自动化定位这些平面。经过定量和定性评估，该方法的平均平移误差为4.13毫米，平均旋转误差为每平面7.93度，相较于其他方法在三维超声成像中的使用具有更高的性能。临床评估进一步证实了GT++和3DFETUS在平面估计准确性方面的有效性。

关键见解

在常规胎儿超声检查中，获取标准面部平面具有挑战性，因为胎儿的移动和操作经验的差异会影响成像的一致性和准确性。
文章介绍了GT++算法和深度学习模型3DFETUS，以改进这一过程。GT++算法利用解剖标志点估计标准面部平面，而深度学习模型则用于自动化定位这些平面。
该方法通过定量评估和专家临床审查验证了其有效性。相较于其他方法，该方法的平均平移误差和旋转误差较小。
临床评估显示GT++和3DFETUS在平面估计准确性方面显著提高。
这些技术可以帮助减少因胎儿移动和操作经验差异导致的诊断偏差和不一致性。
这些技术的实施可能有助于减少检查时间并提高诊断效率。

Cool Papers

点此查看论文截图

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Authors:Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai

Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage parsing pipeline. The first stage employs a large multimodal model to jointly predict document layout and reading order, leveraging visual information to ensure structural and sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios.

文档解析是文档智能的核心任务之一，支持信息提取、检索增强生成和自动化文档分析等多种应用。然而，现实世界的文档通常具有复杂的布局，包含多层次表格、嵌入的图像或公式以及跨页结构，这对现有OCR系统仍然具有挑战性。我们推出了MonkeyOCR v1.5，这是一个统一的视觉语言框架，通过两阶段解析管道增强布局理解和内容识别。第一阶段采用大型多模态模型联合预测文档布局和阅读顺序，利用视觉信息确保结构和顺序一致性。第二阶段对检测到的区域进行文本、公式和表格的局部识别，保持高视觉保真度，同时减少错误传播。针对复杂的表格结构，我们提出了一种基于视觉一致性强化学习方案，通过渲染和比较对齐来评估识别质量，提高结构准确性，无需手动注释。此外，还引入了Image-Decoupled Table Parsing和Type-Guided Table Merging两个专用模块，能够实现包含嵌入图像的表格可靠解析以及跨页或列的表格重建。在OmniDocBench v1.5上的综合实验表明，MonkeyOCR v1.5达到了最先进的性能水平，优于PPOCR-VL和MinerU 2.5，在视觉复杂的文档场景中表现出卓越的稳健性。

论文及项目相关链接

PDF

Summary

MonkeyOCR v1.5是一款统一的视觉语言框架，用于增强文档布局理解和内容识别。它通过两阶段解析管道实现文档解析，提高复杂文档场景下的稳健性和性能。

Key Takeaways

MonkeyOCR v1.5是一个视觉语言框架，用于文档智能中的文档解析任务。
它支持信息提取、检索增强生成和自动化文档分析应用程序。
引入了两阶段解析管道，第一阶段预测文档布局和阅读顺序，第二阶段进行局部文本、公式和表格识别。
提出了基于视觉一致性的强化学习方案，通过渲染和比较对齐来评估识别质量，提高结构准确性。
引入了Image-Decoupled Table Parsing和Type-Guided Table Merging两个专用模块，以处理包含嵌入式图像的表格解析和跨页或列的表格重建。
在OmniDocBench v1.5上的实验表明，MonkeyOCR v1.5达到最新性能水平，优于PPOCR-VL和MinerU 2.5。

Cool Papers

点此查看论文截图

Depth-Consistent 3D Gaussian Splatting via Physical Defocus Modeling and Multi-View Geometric Supervision

Authors:Yu Deng, Baozhu Zhao, Junyan Su, Xiaohan Zhang, Qi Liu

Three-dimensional reconstruction in scenes with extreme depth variations remains challenging due to inconsistent supervisory signals between near-field and far-field regions. Existing methods fail to simultaneously address inaccurate depth estimation in distant areas and structural degradation in close-range regions. This paper proposes a novel computational framework that integrates depth-of-field supervision and multi-view consistency supervision to advance 3D Gaussian Splatting. Our approach comprises two core components: (1) Depth-of-field Supervision employs a scale-recovered monocular depth estimator (e.g., Metric3D) to generate depth priors, leverages defocus convolution to synthesize physically accurate defocused images, and enforces geometric consistency through a novel depth-of-field loss, thereby enhancing depth fidelity in both far-field and near-field regions; (2) Multi-View Consistency Supervision employing LoFTR-based semi-dense feature matching to minimize cross-view geometric errors and enforce depth consistency via least squares optimization of reliable matched points. By unifying defocus physics with multi-view geometric constraints, our method achieves superior depth fidelity, demonstrating a 0.8 dB PSNR improvement over the state-of-the-art method on the Waymo Open Dataset. This framework bridges physical imaging principles and learning-based depth regularization, offering a scalable solution for complex depth stratification in urban environments.

在场景深度变化极端的三维重建中仍然具有挑战性，因为近场和远场区域之间的监督信号不一致。现有方法未能同时解决远距离区域的深度估计不准确和近距离区域的结构退化问题。本文提出了一种新型计算框架，该框架结合了景深监督和多视角一致性监督，以推动3D高斯展布技术的发展。我们的方法包括两个核心组件：（1）景深监督采用尺度恢复的单目深度估计器（例如Metric3D）来生成深度先验，利用散焦卷积合成物理准确的散焦图像，并通过新型的景深损失来强制执行几何一致性，从而提高远场和近场区域的深度保真度；（2）多视角一致性监督采用基于LoFTR的半密集特征匹配来最小化跨视角的几何误差，并通过可靠的匹配点的最小二乘优化来强制执行深度一致性。通过将散焦物理与多视角几何约束相结合，我们的方法实现了卓越的深度保真度，在Waymo Open Dataset上比最先进的方法提高了0.8dB的PSNR。该框架架起了物理成像原理和基于学习的深度正则化之间的桥梁，为城市环境中复杂的深度分层提供了可扩展的解决方案。

论文及项目相关链接

PDF

Summary

本文提出一种结合景深监督和多视角一致性监督的新型计算框架，用于改进3D高斯喷绘技术在极端深度变化场景中的应用。该框架包括两个核心组件：景深监督和多视角一致性监督，能够提高远近场景的深度估计准确性，并增强结构退化区域的性能。

Key Takeaways

极端深度变化场景中的三维重建具有挑战性，因为近场和远场区域之间的监督信号不一致。
现有方法不能同时解决远距离深度估计不准确和近程区域结构退化的问题。
本文提出了一种新型计算框架，集成了景深监督和多视角一致性监督，以改进3D高斯喷绘技术。
框架包括两个核心组件：景深监督通过尺度恢复的单目深度估计器生成深度先验，利用散焦卷积合成物理准确的散焦图像，并通过新的景深损失强制执行几何一致性。
多视角一致性监督采用LoFTR的半密集特征匹配来最小化跨视图的几何误差，并通过可靠的匹配点进行最小二乘优化来强制执行深度一致性。
该方法将散焦物理与多视角几何约束相结合，实现了卓越的深度保真度，在Waymo Open数据集上比现有技术提高了0.8 dB的PSNR。

Cool Papers

点此查看论文截图

Revisiting Evaluation of Deep Neural Networks for Pedestrian Detection

Authors:Patrick Feifel, Benedikt Franke, Frank Bonarens, Frank Köster, Arne Raulf, Friedhelm Schwenker

Reliable pedestrian detection represents a crucial step towards automated driving systems. However, the current performance benchmarks exhibit weaknesses. The currently applied metrics for various subsets of a validation dataset prohibit a realistic performance evaluation of a DNN for pedestrian detection. As image segmentation supplies fine-grained information about a street scene, it can serve as a starting point to automatically distinguish between different types of errors during the evaluation of a pedestrian detector. In this work, eight different error categories for pedestrian detection are proposed and new metrics are proposed for performance comparison along these error categories. We use the new metrics to compare various backbones for a simplified version of the APD, and show a more fine-grained and robust way to compare models with each other especially in terms of safety-critical performance. We achieve SOTA on CityPersons-reasonable (without extra training data) by using a rather simple architecture.

在自动驾驶系统中，可靠的行人检测是至关重要的一步。然而，当前性能基准展现出了弱点。对于验证数据集中的不同子集目前所应用的评价指标，无法真实评估深度神经网络（DNN）在行人检测方面的性能。图像分割为街道场景提供了精细信息，可以作为自动区分行人检测中不同类型错误的起点。在这项工作中，提出了8种不同的行人检测错误类别，并针对这些错误类别提出了新的性能比较指标。我们使用新指标来比较APD简化版的各种骨干网络，并展示了一种更精细且稳健的方式来对比模型彼此之间的性能，特别是在安全性能上。通过使用相对简单的架构，我们在CityPersons-reasonable（无需额外训练数据）上实现了最新技术水平。

论文及项目相关链接

PDF

Summary

该文探讨了自动驾驶系统中行人检测的可靠性问题。现有评估指标无法真实反映深度神经网络（DNN）在行人检测方面的性能，因此提出八种行人检测错误类型及相应的新评估指标。通过新指标对不同骨干网络进行比较，展示了一种更精细且稳健的模型对比方法，特别是在安全性能方面的对比。采用相对简单的架构在城市人物数据集上达到了先进性能。

Key Takeaways

自动驾驶系统中的行人检测是重要步骤，但现有性能评估指标存在局限性。
图像分割技术为街道场景提供了精细信息，有助于自动区分不同类型的错误。
提出了八种行人检测的误差类别及相应的新评估指标，用于更精细的性能比较。
通过新指标比较了不同骨干网络在行人检测模型中的表现。
展示了一种简单架构在城市人物数据集上的先进性能。
新方法提供了更精细且稳健的模型对比方式，特别是在安全性能方面。

Cool Papers

点此查看论文截图

Out-of-Context Misinformation Detection via Variational Domain-Invariant Learning with Test-Time Training

Authors:Xi Yang, Han Zhang, Zhijian Lin, Yibiao Hu, Hong Han

Out-of-context misinformation (OOC) is a low-cost form of misinformation in news reports, which refers to place authentic images into out-of-context or fabricated image-text pairings. This problem has attracted significant attention from researchers in recent years. Current methods focus on assessing image-text consistency or generating explanations. However, these approaches assume that the training and test data are drawn from the same distribution. When encountering novel news domains, models tend to perform poorly due to the lack of prior knowledge. To address this challenge, we propose \textbf{VDT} to enhance the domain adaptation capability for OOC misinformation detection by learning domain-invariant features and test-time training mechanisms. Domain-Invariant Variational Align module is employed to jointly encodes source and target domain data to learn a separable distributional space domain-invariant features. For preserving semantic integrity, we utilize domain consistency constraint module to reconstruct the source and target domain latent distribution. During testing phase, we adopt the test-time training strategy and confidence-variance filtering module to dynamically updating the VAE encoder and classifier, facilitating the model’s adaptation to the target domain distribution. Extensive experiments conducted on the benchmark dataset NewsCLIPpings demonstrate that our method outperforms state-of-the-art baselines under most domain adaptation settings.

脱离上下文的信息误导（OOC）是新闻报道中低成本的信息误导形式，它将真实的图片放入脱离上下文的或虚构的图像文本配对中。这个问题近年来引起了研究人员的广泛关注。当前的方法主要集中在评估图像文本的一致性或生成解释。然而，这些方法假设训练数据和测试数据来自同一分布。当遇到新的新闻领域时，由于缺乏先验知识，模型往往表现不佳。为了解决这一挑战，我们提出VDT方法，通过学习域不变特征和测试时间训练机制，增强OOC信息误导检测的域适应能力。采用域不变变分对齐模块，对源域和目标域数据进行联合编码，学习可分离的分布空间域不变特征。为了保持语义完整性，我们利用域一致性约束模块重建源域和目标域的潜在分布。在测试阶段，我们采用测试时间训练策略和信心方差过滤模块，动态更新VAE编码器和分类器，促进模型对目标域分布的适应。在NewsCLIPpings基准数据集上进行的广泛实验表明，我们的方法在大多数域适应设置下优于最新基线。

论文及项目相关链接

PDF accepted by the AAAI Conference on Artificial Intelligence (AAAI) 2026

Summary

本文介绍了出图语境信息错误（OOC）的问题，一种新闻报告中的低成本错误信息。该问题已引起研究者关注。当前方法主要评估图像和文字的一致性或生成解释，但假设训练和测试数据来自同一分布。面对新领域新闻时，由于缺乏先验知识，模型表现不佳。为解决此挑战，本文提出通过学习域不变特征和测试时间训练机制来增强模型对OOC误信息的域适应能力。通过采用域不变变分对齐模块和域一致性约束模块，该方法可以联合编码源和目标域数据，学习可分离的域不变特征空间并保持语义完整性。测试阶段采用测试时间训练策略和信心方差过滤模块，可动态更新VAE编码器和分类器，促进模型适应目标域分布。在NewsCLIPpings基准数据集上的实验表明，该方法在大多数域适应设置下优于最新基线。

Key Takeaways

OOC（出图语境信息错误）是新闻报告中一种低成本的信息错误形式，涉及将真实图像放入不合适的上下文或伪造图像文本配对中。
当前方法主要评估图像和文本的一致性或生成解释，但它们在面对新领域新闻时表现不佳，因为缺乏先验知识。
为解决这一问题，提出了VDT方法，通过联合编码源和目标域数据来增强域适应能力，学习可分离的域不变特征空间。
采用域不变变分对齐模块和域一致性约束模块来保持语义完整性。
测试阶段采用测试时间训练策略和信心方差过滤模块，促进模型适应目标域分布。
在NewsCLIPpings数据集上的实验表明，VDT方法在大多数域适应设置下优于其他最新方法。

Cool Papers

点此查看论文截图

CephRes-MHNet: A Multi-Head Residual Network for Accurate and Robust Cephalometric Landmark Detection

Authors:Ahmed Jaheen, Islam Hassan, Mohanad Abouserie, Abdelaty Rehab, Adham Elasfar, Knzy Elmasry, Mostafa El-Dawlatly, Seif Eldawlatly

Accurate localization of cephalometric landmarks from 2D lateral skull X-rays is vital for orthodontic diagnosis and treatment. Manual annotation is time-consuming and error-prone, whereas automated approaches often struggle with low contrast and anatomical complexity. This paper introduces CephRes-MHNet, a multi-head residual convolutional network for robust and efficient cephalometric landmark detection. The architecture integrates residual encoding, dual-attention mechanisms, and multi-head decoders to enhance contextual reasoning and anatomical precision. Trained on the Aariz Cephalometric dataset of 1,000 radiographs, CephRes-MHNet achieved a mean radial error (MRE) of 1.23 mm and a success detection rate (SDR) @ 2.0 mm of 85.5%, outperforming all evaluated models. In particular, it exceeded the strongest baseline, the attention-driven AFPF-Net (MRE = 1.25 mm, SDR @ 2.0 mm = 84.1%), while using less than 25% of its parameters. These results demonstrate that CephRes-MHNet attains state-of-the-art accuracy through architectural efficiency, providing a practical solution for real-world orthodontic analysis.

从二维侧面颅骨X射线片中准确定位头影测量标志点对于正畸诊断和治疗至关重要。手动标注既耗时又容易出错，而自动方法则常常因对比度低和解剖结构复杂而难以应对。本文介绍了CephRes-MHNet，这是一种用于稳健高效头影测量标志点检测的多头残差卷积网络。该架构融合了残差编码、双重注意力机制和多个解码器，以增强上下文推理和解剖精度。在Aariz头影测量数据集（包含1000张放射图像）上进行训练的CephRes-MHNet达到了平均径向误差（MRE）为1.23毫米，在2.0毫米处的成功检测率（SDR）为85.5%，超过了所有评估的模型。尤其值得一提的是，它在参数使用量不到四分之一的情况下，超越了表现最强的基线模型——AFPF-Net（MRE=1.25毫米，SDR @ 2.0毫米=84.1%）。这些结果证明了CephRes-MHNet通过架构效率达到了最先进的精确度，为现实世界的正畸分析提供了切实可行的解决方案。

论文及项目相关链接

PDF 5 Pages, Under Review at The IEEE International Symposium on Biomedical Imaging (ISBI 2026)

Summary

这篇论文介绍了一种用于头颅测量标志点检测的稳健而高效的方法——CephRes-MHNet。它通过集成残差编码、双重注意力机制和多个解码器来提高上下文推理和解剖精度。在Aariz头颅测量数据集上训练的CephRes-MHNet取得了平均径向误差（MRE）为1.23毫米和成功检测率（SDR）@ 2毫米为85.5%的结果，优于所有评估的模型，包括参数较少的最佳基线模型AFPF-Net。这表明CephRes-MHNet通过结构效率达到了最先进的准确性，为实际的正畸分析提供了切实可行的解决方案。

Key Takeaways

CephRes-MHNet是一种用于头颅测量标志点检测的多头残差卷积网络，旨在实现稳健高效的检测。
网络架构融合了残差编码、双重注意力机制和多个解码器，增强了上下文推理和解剖精度。
CephRes-MHNet在Aariz头颅测量数据集上进行了训练，实现了平均径向误差（MRE）为1.23毫米的优异性能。
与其他模型相比，包括参数较少的最佳基线模型AFPF-Net，CephRes-MHNet在成功检测率（SDR）方面表现出更高的性能。
CephRes-MHNet达到了最先进的准确性，通过结构效率实现了高效性能。
该方法为正畸诊断提供了切实可行的解决方案，特别是在自动化处理头颅测量标志点检测方面。

Cool Papers

点此查看论文截图

Split-Layer: Enhancing Implicit Neural Representation by Maximizing the Dimensionality of Feature Space

Authors:Zhicheng Cai, Hao Zhu, Linsen Chen, Qiu Shen, Xun Cao

Implicit neural representation (INR) models signals as continuous functions using neural networks, offering efficient and differentiable optimization for inverse problems across diverse disciplines. However, the representational capacity of INR defined by the range of functions the neural network can characterize, is inherently limited by the low-dimensional feature space in conventional multilayer perceptron (MLP) architectures. While widening the MLP can linearly increase feature space dimensionality, it also leads to a quadratic growth in computational and memory costs. To address this limitation, we propose the split-layer, a novel reformulation of MLP construction. The split-layer divides each layer into multiple parallel branches and integrates their outputs via Hadamard product, effectively constructing a high-degree polynomial space. This approach significantly enhances INR’s representational capacity by expanding the feature space dimensionality without incurring prohibitive computational overhead. Extensive experiments demonstrate that the split-layer substantially improves INR performance, surpassing existing methods across multiple tasks, including 2D image fitting, 2D CT reconstruction, 3D shape representation, and 5D novel view synthesis.

隐式神经网络表示（INR）使用神经网络将信号表示为连续函数，为各学科中的反问题提供了高效且可微分的优化方案。然而，INR的表示能力受限于神经网络能够表征的函数范围，这本质上是由传统多层感知器（MLP）架构中的低维特征空间所导致的。虽然拓宽MLP可以线性增加特征空间的维度，但这也会导致计算和内存成本的二次增长。为了解决这一限制，我们提出了分割层（split-layer）的概念，这是对MLP构建的一种新型重新表述。分割层将每一层分割成多个并行分支，并通过哈达玛积（Hadamard product）整合它们的输出，从而有效地构建了一个高次多项式空间。这种方法通过扩展特征空间的维度，显著提高了INR的表示能力，同时不会产生过高的计算开销。大量实验表明，分割层能显著提高INR的性能，在多个任务上超越现有方法，包括2D图像拟合、2D CT重建、3D形状表示和5D新颖视图合成。

论文及项目相关链接

PDF AAAI 2026

总结
隐式神经网络（INR）将信号视为连续函数并运用神经网络进行优化，为解决跨学科的逆问题提供了高效且可区分的方案。然而，由于常规多层感知器（MLP）架构的特征空间维度较低，INR的表征能力受到限制。为扩大特征空间维度而增加MLP层宽会导致计算和内存成本二次增长。为解决这一问题，我们提出了分裂层这一新型MLP构建方法。分裂层将每层分为多个并行分支，通过Hadamard积整合输出，构建出高效的高阶多项式空间。此方法在扩大特征空间维度的同时，不会造成过大的计算负担。大量实验表明，分裂层技术大幅提升了INR的性能，在多项任务上超越现有方法，包括2D图像拟合、2D CT重建、3D形状表示和5D新颖视图合成。

关键见解

隐式神经网络（INR）利用神经网络将信号视为连续函数，适用于多种学科的逆问题。
INR的表征能力受限于多层感知器（MLP）的特征空间维度。
单纯增加MLP层宽虽能提升特征空间维度，但会导致计算和内存成本显著增加。
提出的分裂层方法通过并行分支和Hadamard积有效扩大了特征空间，提升了INR的表征能力。
分裂层方法在多项任务上表现出卓越性能，包括图像拟合、CT重建、形状表示和视图合成。
分裂层方法在保证计算效率的同时，有效提升了INR的性能。

Cool Papers

点此查看论文截图

Radiology Workflow-Guided Hierarchical Reinforcement Fine-Tuning for Medical Report Generation

Authors:Bodong Du, Honglong Yang, Xiaomeng Li

Radiologists compose diagnostic reports through a structured workflow: they describe visual findings, summarize them into impressions, and carefully refine statements in clinically critical cases. However, most existing medical report generation (MRG) systems treat reports as flat sequences, overlooking this hierarchical organization and leading to inconsistencies between descriptive and diagnostic content. To align model behavior with real-world reporting practices, we propose RadFlow, a hierarchical workflow-guided reinforcement optimization framework that explicitly models the structured nature of clinical reporting. RadFlow introduces a clinically grounded reward hierarchy that mirrors the organization of radiological reports. At the global level, the reward integrates linguistic fluency, medical-domain correctness, and cross-sectional consistency between Finding and Impression, promoting coherent and clinically faithful narratives. At the local level, a section-specific reward emphasizes Impression quality, reflecting its central role in diagnostic accuracy. Furthermore, a critical-aware policy optimization mechanism adaptively regularizes learning for high-risk or clinically sensitive cases, emulating the cautious refinement behavior of radiologists when documenting critical findings. Together, these components translate the structured reporting paradigm into the reinforcement fine-tuning process, enabling the model to generate reports that are both linguistically consistent and clinically aligned. Experiments on chest X-ray and carotid ultrasound datasets demonstrate that RadFlow consistently improves diagnostic coherence and overall report quality compared with state-of-the-art baselines.

放射科医生通过结构化工作流程撰写诊断报告：他们描述所见，将其总结为印象，并在临床关键病例中仔细精炼陈述。然而，大多数现有的医学报告生成系统（MRG）将报告视为平坦的序列，忽略了这种分层组织，导致描述性和诊断内容之间出现不一致。为了将模型行为与真实世界的报告实践保持一致，我们提出了RadFlow，这是一个分层的工作流驱动强化优化框架，它显式地模拟临床报告的结构化特性。RadFlow引入了一个基于临床的奖励层次结构，该结构反映了放射报告的组织方式。全局层面上，奖励融合了语言流畅性、医学领域正确性，以及发现与印象之间的横截面一致性，促进连贯且符合临床实际的故事叙述。局部层面上，部分特定的奖励强调印象质量，反映了其在诊断准确性中的核心作用。此外，一种关键意识政策优化机制自适应地调整高风险或临床敏感病例的学习规则，模拟放射科医生在记录关键发现时的精细修正行为。这些组件共同将结构化报告范式转化为强化微调过程，使模型能够生成既符合语言一致性又符合临床实际的报告。在胸部X射线和颈动脉超声数据集上的实验表明，与最新基线相比，RadFlow持续提高了诊断连贯性和总体报告质量。

论文及项目相关链接

PDF

Summary

该文介绍了一种名为RadFlow的医学报告生成系统，该系统采用分层工作流引导的强化优化框架，旨在模拟放射科医生撰写报告的结构化过程。RadFlow通过构建临床报告的结构化模型、引入奖励层次结构以及针对关键病例的政策优化机制，提高了报告的连贯性和临床准确性。实验证明，RadFlow相较于其他系统，能生成更高质量的医学报告。

Key Takeaways

医学报告的生成通常采用结构化工作流程，包括描述视觉发现、总结印象以及精细描述临床关键病例。
现有医学报告生成系统多将报告视为平面序列，忽视了其层次结构，导致描述与诊断内容的不一致。
RadFlow是一个分层工作流引导的强化优化框架，旨在模拟真实世界的报告实践，并明确建模临床报告的结构化特性。
RadFlow引入了一个临床基础的奖励层次结构，该结构反映了放射学报告的组织结构。
在全局层面，奖励结合了语言流畅性、医学领域正确性和跨节一致性，以促进连贯且符合临床实际的叙述。
在局部层面，针对印象部分的奖励强调了其质量，反映了印象在诊断准确性中的核心作用。

Cool Papers

点此查看论文截图

MuSc-V2: Zero-Shot Multimodal Industrial Anomaly Classification and Segmentation with Mutual Scoring of Unlabeled Samples

Authors:Xurui Li, Feng Xue, Yu Zhou

Zero-shot anomaly classification (AC) and segmentation (AS) methods aim to identify and outline defects without using any labeled samples. In this paper, we reveal a key property that is overlooked by existing methods: normal image patches across industrial products typically find many other similar patches, not only in 2D appearance but also in 3D shapes, while anomalies remain diverse and isolated. To explicitly leverage this discriminative property, we propose a Mutual Scoring framework (MuSc-V2) for zero-shot AC/AS, which flexibly supports single 2D/3D or multimodality. Specifically, our method begins by improving 3D representation through Iterative Point Grouping (IPG), which reduces false positives from discontinuous surfaces. Then we use Similarity Neighborhood Aggregation with Multi-Degrees (SNAMD) to fuse 2D/3D neighborhood cues into more discriminative multi-scale patch features for mutual scoring. The core comprises a Mutual Scoring Mechanism (MSM) that lets samples within each modality to assign score to each other, and Cross-modal Anomaly Enhancement (CAE) that fuses 2D and 3D scores to recover modality-specific missing anomalies. Finally, Re-scoring with Constrained Neighborhood (RsCon) suppresses false classification based on similarity to more representative samples. Our framework flexibly works on both the full dataset and smaller subsets with consistently robust performance, ensuring seamless adaptability across diverse product lines. In aid of the novel framework, MuSc-V2 achieves significant performance improvements: a $\textbf{+23.7%}$ AP gain on the MVTec 3D-AD dataset and a $\textbf{+19.3%}$ boost on the Eyecandies dataset, surpassing previous zero-shot benchmarks and even outperforming most few-shot methods. The code will be available at The code will be available at \href{https://github.com/HUST-SLOW/MuSc-V2}{https://github.com/HUST-SLOW/MuSc-V2}.

零样本异常分类（AC）和分割（AS）方法旨在不使用任何标记样本来识别和描述缺陷。在本文中，我们揭示了一个被现有方法所忽视的关键属性：工业产品中的正常图像补丁通常会找到许多其他相似的补丁，不仅在2D外观方面，而且在3D形状方面也是如此，而异常值仍然多样且孤立。为了明确利用这种判别属性，我们提出了用于零样本AC/AS的Mutual Scoring框架（MuSc-V2），它灵活支持单2D/3D或多模式。具体来说，我们的方法首先通过迭代点分组（IPG）改进3D表示，从而减少来自不连续表面的误报。然后，我们使用具有多度的相似性邻域聚合（SNAMD）来将2D/3D邻域线索融合为更具判别力的多尺度补丁特征，以进行相互评分。核心部分包括让每种模态内的样本相互评分的Mutual Scoring Mechanism（MSM），以及融合2D和3D评分的Cross-modal Anomaly Enhancement（CAE），以恢复特定于模态的缺失异常值。最后，基于与更具代表性样本的相似性，通过受约束的邻域重新评分（RsCon）抑制误分类。我们的框架既适用于全数据集，也适用于较小的子集，并且具有一致稳定的性能，可确保在不同产品线之间进行无缝适应。借助新型框架MuSc-V2，实现了显著的性能提升：在MVTec 3D-AD数据集上提高了+23.7％的AP值，在Eyecandies数据集上提高了+19.3％的性能，超越了之前的零样本基准测试，甚至超越了大多数小样方法。代码将在https://github.com/HUST-SLOW/MuSc-V2提供。

论文及项目相关链接

PDF

Summary
这篇论文提出了一种新的零样本异常分类与分割方法，即MuSc-V2框架。该方法通过利用正常图像补丁在工业品中的相似性和差异性，提高了异常检测和分割的性能。通过改进3D表示、融合多尺度补丁特征、建立相互评分机制和跨模态异常增强，MuSc-V2框架在MVTec 3D-AD和Eyecandies数据集上实现了显著的性能提升。

Key Takeaways

零样本异常分类和分割方法旨在无需标注样本进行缺陷识别和标注。
现有方法忽略了正常图像补丁在工业品中的相似性和差异性这一关键属性。
MuSc-V2框架通过利用这一属性，提高了异常检测和分割的性能。
MuSc-V2框架包括三个主要部分：改进3D表示、融合多尺度补丁特征的相互评分机制和跨模态异常增强。
MuSc-V2框架具有灵活性和强大的性能，可适应不同的产品线和数据集大小。
MuSc-V2框架在MVTec 3D-AD和Eyecandies数据集上的性能优于之前的零样本基准测试，甚至超过了大多数少样本方法。

Cool Papers

点此查看论文截图

MIRNet: Integrating Constrained Graph-Based Reasoning with Pre-training for Diagnostic Medical Imaging

Authors:Shufeng Kong, Zijie Wang, Nuan Cui, Hao Tang, Yihan Meng, Yuanyuan Wei, Feifan Chen, Yingheng Wang, Zhuo Cai, Yaonan Wang, Yulong Zhang, Yuzheng Li, Zibin Zheng, Caihua Liu

Automated interpretation of medical images demands robust modeling of complex visual-semantic relationships while addressing annotation scarcity, label imbalance, and clinical plausibility constraints. We introduce MIRNet (Medical Image Reasoner Network), a novel framework that integrates self-supervised pre-training with constrained graph-based reasoning. Tongue image diagnosis is a particularly challenging domain that requires fine-grained visual and semantic understanding. Our approach leverages self-supervised masked autoencoder (MAE) to learn transferable visual representations from unlabeled data; employs graph attention networks (GAT) to model label correlations through expert-defined structured graphs; enforces clinical priors via constraint-aware optimization using KL divergence and regularization losses; and mitigates imbalance using asymmetric loss (ASL) and boosting ensembles. To address annotation scarcity, we also introduce TongueAtlas-4K, a comprehensive expert-curated benchmark comprising 4,000 images annotated with 22 diagnostic labels–representing the largest public dataset in tongue analysis. Validation shows our method achieves state-of-the-art performance. While optimized for tongue diagnosis, the framework readily generalizes to broader diagnostic medical imaging tasks.

医学图像自动解读要求对复杂的视觉语义关系进行稳健建模，同时解决标注稀缺、标签不平衡和临床可行性约束等问题。我们引入了MIRNet（医学图像推理网络），这是一个新型框架，它将自监督预训练与基于约束的图形推理相结合。舌苔图像诊断是一个特别具有挑战性的领域，需要精细的视觉和语义理解。我们的方法利用自监督掩码自动编码器（MAE）从非标记数据中学习可迁移的视觉表示；采用图注意力网络（GAT）通过专家定义的结构化图形对标签相关性进行建模；通过利用KL散度和正则化损失的约束感知优化来实施临床先验；并使用不对称损失（ASL）和增强集成来缓解不平衡问题。为解决标注稀缺问题，我们还推出了TongueAtlas-4K，这是一个包含4000张图像和22个诊断标签的综合性专家精选基准数据集——代表了舌苔分析领域最大的公开数据集。验证表明，我们的方法达到了最新技术水平。虽然该方法针对舌苔诊断进行了优化，但它很容易推广到更广泛的诊断医学影像任务中。

论文及项目相关链接

PDF To appear at AAAI-26

Summary

本文介绍了医疗图像自动解读中的MIRNet框架，该框架集成了自监督预训练和基于约束图的推理。针对舌图像诊断这一具有挑战的领域，MIRNet利用自监督掩码自动编码器学习可迁移的视觉表示，使用图注意力网络对专家定义的结构图进行标签关联建模，通过KL散度和正则化损失实施临床先验约束感知优化，并采用不对称损失和增强集成来缓解不平衡问题。为解决标注稀缺问题，还推出了TongueAtlas-4K数据集。验证结果表明，该方法达到业界领先水平，并易于推广至更广泛的医疗图像诊断任务。

Key Takeaways

MIRNet框架结合了自监督预训练和基于约束图的推理，用于医疗图像自动解读。
舌图像诊断是一个需要精细视觉和语义理解的挑战领域。
MIRNet利用自监督掩码自动编码器学习可迁移的视觉表示，从非标记数据中获取信息。
图注意力网络被用于通过专家定义的结构图进行标签关联建模。
实施临床先验约束感知优化，通过KL散度和正则化损失来处理临床可行性约束。
采用不对称损失和增强集成来缓解数据不平衡问题。
推出了TongueAtlas-4K数据集，这是舌分析领域最大的公开数据集。

Cool Papers

点此查看论文截图

DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation

Authors:Xuexun Liu, Xiaoxu Xu, Qiudan Zhang, Lin Ma, Xu Wang

Weakly supervised 3D instance segmentation is essential for 3D scene understanding, especially as the growing scale of data and high annotation costs associated with fully supervised approaches. Existing methods primarily rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations, both of which aim to reduce labeling efforts. However, these approaches still encounter limitations, including labor-intensive annotation processes, high complexity, and reliance on expert annotators. To address these challenges, we propose \textbf{DBGroup}, a two-stage weakly supervised 3D instance segmentation framework that leverages scene-level annotations as a more efficient and scalable alternative. In the first stage, we introduce a Dual-Branch Point Grouping module to generate pseudo labels guided by semantic and mask cues extracted from multi-view images. To further improve label quality, we develop two refinement strategies: Granularity-Aware Instance Merging and Semantic Selection and Propagation. The second stage involves multi-round self-training on an end-to-end instance segmentation network using the refined pseudo-labels. Additionally, we introduce an Instance Mask Filter strategy to address inconsistencies within the pseudo labels. Extensive experiments demonstrate that DBGroup achieves competitive performance compared to sparse-point-level supervised 3D instance segmentation methods, while surpassing state-of-the-art scene-level supervised 3D semantic segmentation approaches. Code is available at https://github.com/liuxuexun/DBGroup.

弱监督3D实例分割对于3D场景理解至关重要，尤其是随着数据规模的增长和全监督方法相关的高标注成本。现有方法主要依赖于两种形式的弱监督：一点一击标注和边界框标注，两者的目标都是减少标注工作量。然而，这些方法仍面临一些挑战，包括标注过程劳动强度高、复杂性高以及依赖专家标注者。为了解决这些挑战，我们提出了DBGroup，这是一个两阶段的弱监督3D实例分割框架，它利用场景级标注作为更高效和可扩展的替代方案。在第一阶段，我们引入了一个双分支点分组模块，该模块以多视角图像中提取的语义和掩膜线索为指导生成伪标签。为了进一步提高标签质量，我们开发了两种细化策略：粒度感知实例合并和语义选择和传播。第二阶段涉及在端到端实例分割网络上进行多轮自训练，使用精炼的伪标签。此外，我们还引入了一个实例掩膜过滤策略，以解决伪标签内部的不一致性。大量实验表明，DBGroup与稀疏点级监督的3D实例分割方法相比具有竞争力，同时超越了最先进的场景级监督的3D语义分割方法。代码可通过https://github.com/liuxuexun/DBGroup获取。

论文及项目相关链接

PDF

摘要

该研究针对三维场景理解提出了一种弱监督下的三维实例分割方法，命名为DBGroup。它采用场景级别的标注作为高效可伸缩的替代方案，利用双分支点分组模块生成伪标签，该伪标签由多视角图像提取的语义和掩膜线索引导。通过粒度感知实例合并和语义选择和传播两种策略进一步优化标签质量。在伪标签的基础上，通过多轮端到端实例分割网络的自训练，提高分割性能。实验结果表明，DBGroup相较于稀疏点级监督的三维实例分割方法具有竞争力，并超越了场景级监督的三维语义分割方法。代码已公开在GitHub上。

关键见解

DBGroup是一个两阶段的弱监督三维实例分割框架，旨在解决大规模数据下的标注成本高问题。
在第一阶段中引入了双分支点分组模块，使用场景级别标注生成伪标签，结合语义和掩膜线索指导伪标签的生成。
为提高标签质量，提出粒度感知实例合并和语义选择和传播两种细化策略。
第二阶段通过多轮自训练在端到端的实例分割网络上进行优化，并采用实例掩膜过滤策略解决伪标签的不一致性。
实验结果表明DBGroup在弱监督环境下的性能表现优异，与先进的场景级监督方法相比具有优势。
该方法的代码已经公开可供研究使用。

Cool Papers

点此查看论文截图

GPDM: Generation-Prior Diffusion Model for Accelerated Direct Attenuation and Scatter Correction of Whole-body 18F-FDG PET

Authors:Min Jeong Cho, Hyeong Seok Shim, Sungyu Kim, Jae Sung Lee

Accurate attenuation and scatter corrections are crucial in positron emission tomography (PET) imaging for accurate visual interpretation and quantitative analysis. Traditional methods relying on computed tomography (CT) or magnetic resonance imaging (MRI) have limitations in accuracy, radiation exposure, and applicability. Deep neural networks provide potential approaches to estimating attenuation and scatter-corrected (ASC) PET from non-attenuation and non-scatter-corrected (NASC) PET images based on VAE or CycleGAN. However, the limitations inherent to conventional GAN-based methods, such as unstable training and mode collapse, need further advancements. To address these limitations and achieve more accurate attenuation and scatter corrections, we propose a novel framework for generating high-quality ASC PET images from NASC PET images: Generation-Prior Diffusion Model (GPDM). Our GPDM framework is based on the Denoising Diffusion Probabilistic Model (DDPM), but instead of starting sampling from an entirely different image distribution, it begins from a distribution similar to the target images we aim to generate. This similar distribution is referred to as the Generation-Prior. By leveraging this Generation-Prior, the GPDM framework effectively reduces the number of sampling steps and generates more refined ASC PET images. Our experimental results demonstrate that GPDM outperforms existing methods in generating ASC PET images, achieving superior accuracy while significantly reducing sampling time. These findings highlight the potential of GPDM to address the limitations of conventional methods and establish a new standard for efficient and accurate attenuation and scatter correction in PET imaging.

准确的衰减和散射校正在正电子发射断层扫描（PET）成像中对于准确的视觉解读和定量分析至关重要。传统的方法依赖于计算机断层扫描（CT）或磁共振成像（MRI），但在准确性、辐射暴露和适用性方面存在局限性。深度神经网络提供了基于变分自编码器（VAE）或CycleGAN从非衰减和非散射校正（NASC）PET图像估计衰减和散射校正（ASC）PET的潜在方法。然而，传统生成对抗网络（GAN）方法固有的局限性，如训练不稳定和模式崩溃，需要进一步改进。为了解决这些局限性，实现更准确的衰减和散射校正，我们提出了一种从NASC PET图像生成高质量ASC PET图像的新框架：生成优先扩散模型（GPDM）。我们的GPDM框架基于去噪扩散概率模型（DDPM），但与从一个完全不同的图像分布开始采样不同，它从与我们旨在生成的目标图像相似的分布开始。这个相似的分布被称为生成优先。通过利用这种生成优先，GPDM框架有效地减少了采样步骤的数量，并生成了更精细的ASC PET图像。我们的实验结果表明，GPDM在生成ASC PET图像方面优于现有方法，实现了更高的准确性，同时显著减少了采样时间。这些发现突出了GPDM解决传统方法局限性的潜力，并为PET成像中的高效和准确衰减和散射校正建立了新标准。

论文及项目相关链接

PDF 25 pages, 10 figures

Summary

基于深度学习的医学图像重建技术已成为当今研究的热点，特别是对于PET成像中的衰减和散射校正问题。本文提出了一种新型的生成先验扩散模型（GPDM），采用扩散概率模型作为基础，借助生成先验（Generation-Prior）来生成高质量的衰减散射校正PET图像。实验结果显示，GPDM在生成ASC PET图像方面表现优异，准确度高且采样时间短，有望解决传统方法的局限性并确立新的标准。

Key Takeaways

衰减和散射校正在PET成像中至关重要，影响视觉解读和定量分析。
传统方法如依赖CT或MRI存在准确性、辐射暴露和适用性的局限性。
深度神经网络为估计衰减散射校正PET图像提供了潜在方法。
本文提出了基于生成先验扩散模型（GPDM）的新型框架，利用扩散概率模型生成高质量的ASC PET图像。
GPDM借助生成先验来减少采样步骤并生成更精细的ASC PET图像。
实验证明GPDM在生成ASC PET图像方面表现优越，准确度高且采样时间短。

Cool Papers

点此查看论文截图

Regional Attention-Enhanced Swin Transformer for Clinically Relevant Medical Image Captioning

Authors:Zubia Naz, Farhan Asghar, Muhammad Ishfaq Hussain, Yahya Hadadi, Muhammad Aasim Rafique, Wookjin Choi, Moongu Jeon

Automated medical image captioning translates complex radiological images into diagnostic narratives that can support reporting workflows. We present a Swin-BART encoder-decoder system with a lightweight regional attention module that amplifies diagnostically salient regions before cross-attention. Trained and evaluated on ROCO, our model achieves state-of-the-art semantic fidelity while remaining compact and interpretable. We report results as mean$\pm$std over three seeds and include $95%$ confidence intervals. Compared with baselines, our approach improves ROUGE (proposed 0.603, ResNet-CNN 0.356, BLIP2-OPT 0.255) and BERTScore (proposed 0.807, BLIP2-OPT 0.645, ResNet-CNN 0.623), with competitive BLEU, CIDEr, and METEOR. We further provide ablations (regional attention on/off and token-count sweep), per-modality analysis (CT/MRI/X-ray), paired significance tests, and qualitative heatmaps that visualize the regions driving each description. Decoding uses beam search (beam size $=4$), length penalty $=1.1$, $no_repeat_ngram_size$ $=3$, and max length $=128$. The proposed design yields accurate, clinically phrased captions and transparent regional attributions, supporting safe research use with a human in the loop.

自动医学图像注释将复杂的放射学图像转化为诊断性叙述，支持报告工作流程。我们提出了一种基于Swin-BART编码器-解码器的系统，并带有轻量级区域注意模块，在交叉注意之前放大诊断显著的区域。我们的模型在ROCO数据集上进行训练和评估，实现了最先进的语义保真度，同时保持紧凑和可解释性。我们报告的结果为三个种子的平均值±标准差，并包括95%的置信区间。与基线相比，我们的方法在ROUGE（提出者得分为0.603，ResNet-CNN得分为0.356，BLIP2-OPT得分为0.255）和BERTScore（提出者得分为0.807，BLIP2-OPT得分为0.645，ResNet-CNN得分为0.623）上表现更好，同时BLEU、CIDEr和METEOR也具有竞争力。我们还提供了消融研究（区域注意力开关和令牌计数扫描）、模态分析（CT/MRI/X射线）、配对显著性测试以及可视化驱动描述的区域的定性热图。解码使用集束搜索（集束大小=4），长度惩罚=1.1，无重复ngram大小=3，最大长度=128。所提出的设计产生了准确、临床性措辞的注释和透明的区域归属，支持有人工参与的安全研究使用。

论文及项目相关链接

PDF

Summary

本文介绍了一种基于Swin-BART编码器解码器的自动化医学图像描述系统，该系统配备了轻量级区域注意力模块，能够放大诊断关键区域。在ROCO数据集上训练与评估，该模型在保持紧凑和可解释性的同时，实现了先进的语义保真度。通过一系列实验和评估指标，证明了该模型相较于基线方法，在ROUGE、BERTScore等评估指标上取得了显著提升。同时，提供了可视化热图来展示驱动描述的关键区域。

Key Takeaways

介绍了基于Swin-BART编码器解码器的医学图像描述系统。
系统包含一个轻量级区域注意力模块，用于放大诊断关键区域。
模型在ROCO数据集上进行了训练与评估，实现高语义保真度。
模型相较于基线方法在ROUGE和BERTScore等评估指标上表现优越。
提供可视化热图来展示描述的关键区域。
模型的解码过程采用了特定的设置，包括beam搜索、长度惩罚、无重复ngram大小以及最大长度等参数。
该系统设计旨在为医学图像生成准确、临床术语的描述，并提供透明的区域归属。

Cool Papers

点此查看论文截图

TomoGraphView: 3D Medical Image Classification with Omnidirectional Slice Representations and Graph Neural Networks

Authors:Johannes Kiechle, Stefan M. Fischer, Daniel M. Lang, Cosmin I. Bercea, Matthew J. Nyflot, Lina Felsner, Julia A. Schnabel, Jan C. Peeken

The growing number of medical tomography examinations has necessitated the development of automated methods capable of extracting comprehensive imaging features to facilitate downstream tasks such as tumor characterization, while assisting physicians in managing their growing workload. However, 3D medical image classification remains a challenging task due to the complex spatial relationships and long-range dependencies inherent in volumetric data. Training models from scratch suffers from low data regimes, and the absence of 3D large-scale multimodal datasets has limited the development of 3D medical imaging foundation models. Recent studies, however, have highlighted the potential of 2D vision foundation models, originally trained on natural images, as powerful feature extractors for medical image analysis. Despite these advances, existing approaches that apply 2D models to 3D volumes via slice-based decomposition remain suboptimal. Conventional volume slicing strategies, which rely on canonical planes such as axial, sagittal, or coronal, may inadequately capture the spatial extent of target structures when these are misaligned with standardized viewing planes. Furthermore, existing slice-wise aggregation strategies rarely account for preserving the volumetric structure, resulting in a loss of spatial coherence across slices. To overcome these limitations, we propose TomoGraphView, a novel framework that integrates omnidirectional volume slicing with spherical graph-based feature aggregation. We publicly share our accessible code base at http://github.com/compai-lab/2025-MedIA-kiechle and provide a user-friendly library for omnidirectional volume slicing at https://pypi.org/project/OmniSlicer.

随着医学断层扫描检查数量的不断增加，必须开发能够提取全面成像特征的自动化方法，以促进肿瘤特征化等下游任务，同时帮助医生管理日益增长的工作量。然而，由于体积数据中的复杂空间关系和长距离依赖关系，3D医学图像分类仍然是一项具有挑战性的任务。从头开始训练模型会受到数据不足的限制，缺乏大规模的3D多模式数据集限制了3D医学成像基础模型的发展。然而，最近的研究突出了原本在自然图像上训练的2D视觉基础模型在医学图像分析中的强大特征提取潜力。尽管取得了这些进展，但现有的将2D模型应用于3D体积数据的切片分解方法仍不理想。传统的体积切片策略依赖于标准平面（如轴向、矢状面或冠状面），当目标结构与标准化观看平面不对齐时，可能无法充分捕捉其空间范围。此外，现有的切片级聚合策略很少考虑保持体积结构，导致切片间空间连贯性的丧失。为了克服这些局限性，我们提出了TomoGraphView这一新型框架，它结合了全向体积切片和基于球形图的特征聚合。我们在http://github.com/compai-lab/2025-MedIA-kiechle公开分享了我们可访问的代码库，并在https://pypi.org/project/OmniSlicer提供了用户友好的全向体积切片库。

论文及项目相关链接

PDF Preprint submitted to Medical Image Analysis (MedIA)

Summary

本文介绍了医学断层扫描检查数量的增长带来的挑战，包括需要从医学图像中提取全面特征以支持肿瘤表征等下游任务，以及协助医生应对日益增加的工作量。文章指出，尽管存在二维视觉基础模型在医学图像分析中的潜力，但由于对三维大规模多模态数据集的缺乏，以及传统的基于切片的二维模型在三维医学图像应用中的局限性，如切片过程中目标结构空间范围捕捉不足和切片间空间连贯性的丧失等问题，目前三维医学图像分类仍是一项挑战。为此，本文提出了一种新的框架TomoGraphView，结合了全方向体积切片和球形图特征聚合技术，并公开分享了其代码库和用户友好的全方向体积切片库。

Key Takeaways

随着医学断层扫描检查数量的增长，需要开发自动化方法从医学图像中提取全面的特征以支持下游任务。
三维医学图像分类面临挑战，主要由于数据复杂性、训练模型的局限性以及缺乏大规模三维多模态数据集。
二维视觉基础模型在医学图像分析中具有潜力，但传统的基于切片的二维模型应用于三维医学图像存在局限性。
全方向体积切片能够捕捉更多信息，而现有的切片方法可能无法充分捕捉目标的空间范围。
TomoGraphView框架结合了全方向体积切片和球形图特征聚合技术来克服现有方法的局限性。

Cool Papers

点此查看论文截图

Diffusion-Based Quality Control of Medical Image Segmentations across Organs

Authors:Vincenzo Marcianò, Hava Chaptoukaev, Virginia Fernandez, M. Jorge Cardoso, Sébastien Ourselin, Michela Antonelli, Maria A. Zuluaga

Medical image segmentation using deep learning (DL) has enabled the development of automated analysis pipelines for large-scale population studies. However, state-of-the-art DL methods are prone to hallucinations, which can result in anatomically implausible segmentations. With manual correction impractical at scale, automated quality control (QC) techniques have to address the challenge. While promising, existing QC methods are organ-specific, limiting their generalizability and usability beyond their original intended task. To overcome this limitation, we propose no-new Quality Control (nnQC), a robust QC framework based on a diffusion-generative paradigm that self-adapts to any input organ dataset. Central to nnQC is a novel Team of Experts (ToE) architecture, where two specialized experts independently encode 3D spatial awareness, represented by the relative spatial position of an axial slice, and anatomical information derived from visual features from the original image. A weighted conditional module dynamically combines the pair of independent embeddings, or opinions to condition the sampling mechanism within a diffusion process, enabling the generation of a spatially aware pseudo-ground truth for predicting QC scores. Within its framework, nnQC integrates fingerprint adaptation to ensure adaptability across organs, datasets, and imaging modalities. We evaluated nnQC on seven organs using twelve publicly available datasets. Our results demonstrate that nnQC consistently outperforms state-of-the-art methods across all experiments, including cases where segmentation masks are highly degraded or completely missing, confirming its versatility and effectiveness across different organs.

使用深度学习（DL）进行医学图像分割已经能够推动大规模人群研究的自动化分析流程的开发。然而，最先进的DL方法容易出现“幻视”现象，这可能导致解剖上不合理的分割结果。由于手动修正在大规模情况下不切实际，因此必须采用自动化质量控制（QC）技术来解决这一挑战。尽管前景广阔，但现有的QC方法是针对特定器官的，限制了其在原始任务之外的通用性和可用性。为了克服这一局限性，我们提出了无新质量控制（nnQC），这是一种基于扩散生成范式的稳健QC框架，可自适应于任何输入器官数据集。nnQC的核心是一种新型专家团队（ToE）架构，其中两位专业专家独立编码3D空间感知，由轴向切片的相对空间位置表示，并从原始图像中提取的视觉特征中派生出的解剖信息。加权条件模块动态结合了这对独立的嵌入或意见，以在扩散过程中调节采样机制，从而生成空间感知伪真实地面数据，用于预测QC分数。在其框架下，nnQC集成了指纹自适应技术，以确保在不同器官、数据集和成像模式之间的适应性。我们在七个器官和十二个公开数据集上评估了nnQC。结果表明，nnQC在所有实验中均表现出优于最先进的方法，包括分割掩膜高度退化或完全缺失的情况，这证实了其在不同器官中的通用性和有效性。

论文及项目相关链接

PDF

Summary
基于深度学习的医学图像分割技术为大规模人群研究提供了自动化分析管道。然而，最先进的深度学习方法容易出现幻觉，导致解剖结构不合理的分割结果。针对手动校正在大规模数据集中不实用的情况，必须采用自动化质量控制技术来解决挑战。尽管现有质量控制方法前景广阔，但它们具有器官特异性，限制了其在原始任务之外的一般化和可用性。为了克服这一局限性，我们提出了基于扩散生成范式的新无新质量控制（nnQC）框架，该框架可自适应于任何输入器官数据集。nnQC的核心是一种新型专家团队（ToE）架构，其中两个专业专家独立编码3D空间感知和从原始图像派生的解剖学信息。加权条件模块动态结合了这两个独立的嵌入或意见，以调节采样机制中的扩散过程，生成空间感知伪真实用于预测质量控制分数。在框架内，nnQC通过指纹适应性确保了跨器官、数据集和成像模式的数据适应性。关键见解如下：

**Key Takeaways**

Cool Papers

点此查看论文截图

DualFete: Revisiting Teacher-Student Interactions from a Feedback Perspective for Semi-supervised Medical Image Segmentation

Authors:Le Yi, Wei Huang, Lei Zhang, Kefu Zhao, Yan Wang, Zizhou Wang

The teacher-student paradigm has emerged as a canonical framework in semi-supervised learning. When applied to medical image segmentation, the paradigm faces challenges due to inherent image ambiguities, making it particularly vulnerable to erroneous supervision. Crucially, the student’s iterative reconfirmation of these errors leads to self-reinforcing bias. While some studies attempt to mitigate this bias, they often rely on external modifications to the conventional teacher-student framework, overlooking its intrinsic potential for error correction. In response, this work introduces a feedback mechanism into the teacher-student framework to counteract error reconfirmations. Here, the student provides feedback on the changes induced by the teacher’s pseudo-labels, enabling the teacher to refine these labels accordingly. We specify that this interaction hinges on two key components: the feedback attributor, which designates pseudo-labels triggering the student’s update, and the feedback receiver, which determines where to apply this feedback. Building on this, a dual-teacher feedback model is further proposed, which allows more dynamics in the feedback loop and fosters more gains by resolving disagreements through cross-teacher supervision while avoiding consistent errors. Comprehensive evaluations on three medical image benchmarks demonstrate the method’s effectiveness in addressing error propagation in semi-supervised medical image segmentation.

师徒范式已成为半监督学习中的经典框架。当应用于医学图像分割时，由于图像固有的模糊性，该范式面临挑战，使其特别容易受到错误的监督。关键的是，学生对这些错误的迭代确认会导致自我加强的偏见。虽然一些研究试图减轻这种偏见，但它们通常依赖于对传统师徒框架的外部修改，忽视了其内在的错误纠正潜力。为此，这项工作将反馈机制引入师徒框架以对抗错误确认。在这里，学生提供有关教师伪标签引起的变化的反馈，使教师能够相应地改进这些标签。我们指定这种互动依赖于两个关键组成部分：反馈属性指定触发学生更新的伪标签，反馈接收器确定应用反馈的地方。在此基础上，进一步提出了双教师反馈模型，使反馈回路更加动态，通过跨教师监督解决分歧，避免持续错误，从而实现更多收益。在三个医学图像基准测试上的综合评估证明了该方法在半监督医学图像分割中解决错误传播的有效性。

论文及项目相关链接

PDF Accepted by Proceedings of the AAAI Conference on Artificial Intelligence 40 (AAAI-26)

Summary

本文介绍了在医学图像分割中教师-学生模式的挑战及其解决方案。由于图像固有的模糊性，该模式容易受到错误的监督影响。学生反复确认这些错误会导致自我强化偏见。针对这一问题，本文引入反馈机制，让学生提供关于教师伪标签变化的反馈，使教师能够相应调整标签。通过双教师反馈模型的构建，为反馈循环提供更动态的方式，通过跨教师监督解决分歧，避免持续错误，从而获得更多收益。在三个医学图像基准测试上的综合评估证明了该方法在解决半监督医学图像分割中的错误传播方面的有效性。

Key Takeaways