发布日期: 2025-10-25

更新日期: 2025-11-27

文章字数: 8.2k

阅读时长: 33 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-25 更新

Towards General Modality Translation with Contrastive and Predictive Latent Diffusion Bridge

Authors:Nimrod Berman, Omkar Joglekar, Eitan Kosman, Dotan Di Castro, Omri Azencot

Recent advances in generative modeling have positioned diffusion models as state-of-the-art tools for sampling from complex data distributions. While these models have shown remarkable success across single-modality domains such as images and audio, extending their capabilities to Modality Translation (MT), translating information across different sensory modalities, remains an open challenge. Existing approaches often rely on restrictive assumptions, including shared dimensionality, Gaussian source priors, and modality-specific architectures, which limit their generality and theoretical grounding. In this work, we propose the Latent Denoising Diffusion Bridge Model (LDDBM), a general-purpose framework for modality translation based on a latent-variable extension of Denoising Diffusion Bridge Models. By operating in a shared latent space, our method learns a bridge between arbitrary modalities without requiring aligned dimensions. We introduce a contrastive alignment loss to enforce semantic consistency between paired samples and design a domain-agnostic encoder-decoder architecture tailored for noise prediction in latent space. Additionally, we propose a predictive loss to guide training toward accurate cross-domain translation and explore several training strategies to improve stability. Our approach supports arbitrary modality pairs and performs strongly on diverse MT tasks, including multi-view to 3D shape generation, image super-resolution, and multi-view scene synthesis. Comprehensive experiments and ablations validate the effectiveness of our framework, establishing a new strong baseline in general modality translation. For more information, see our project page: https://sites.google.com/view/lddbm/home.

近期生成建模的进展使得扩散模型成为从复杂数据分布中采样的最先进的工具。虽然这些模型在单模态领域（如图像和音频）取得了显著的成功，但它们的能力在跨模态翻译（MT）上仍然面临挑战，即在不同感官模态之间翻译信息。现有的方法往往依赖于一些限制性的假设，包括共享维度、高斯源先验和特定模态架构等，这限制了它们的通用性和理论基础。在这项工作中，我们提出了潜在去噪扩散桥梁模型（LDDBM），这是一种基于去噪扩散桥梁模型的潜在变量扩展的跨模态翻译通用框架。通过在共享潜在空间中进行操作，我们的方法无需对齐维度就能学习任意模态之间的桥梁。我们引入了一种对比对齐损失来强制配对样本之间的语义一致性，并设计了一种针对潜在空间中噪声预测的域无关编码器-解码器架构。此外，我们提出了一种预测损失来引导训练实现准确的跨域翻译，并探索了多种训练策略来提高稳定性。我们的方法支持任意模态对，并在多种MT任务上表现强劲，包括多视图到三维形状生成、图像超分辨率和多视图场景合成等。全面的实验和消融实验验证了我们的框架的有效性，为一般模态翻译建立了新的强大基准。更多信息请参见我们的项目页面：https://sites.google.com/view/lddbm/home。

论文及项目相关链接

PDF

摘要
本文介绍了一种基于潜在变体的去噪扩散桥模型（Latent Denoising Diffusion Bridge Model，简称LDDBM）的通用框架，用于模态翻译（Modality Translation，简称MT）。该框架在潜在空间操作，无需对齐维度即可建立任意模态之间的桥梁。通过引入对比对齐损失和针对潜在空间噪声预测的域无关编码器解码器架构，该框架实现了跨域模态翻译的精确性。同时支持任意模态对，并在多种模态翻译任务上表现出强大的性能。此研究为模态翻译领域提供了新的强大基线。更多信息请参见项目页面。

关键见解

扩散模型在复杂数据分布采样方面表现出卓越性能，但在多模态翻译（MT）方面仍存在挑战。
现有方法依赖于限制性假设，如共享维度、高斯源先验和特定于模态的架构，这限制了它们的通用性和理论支持。
LDDBM框架是一个通用的模态翻译方法，基于潜在变量的去噪扩散桥模型，可在共享潜在空间中进行操作，无需对齐维度即可建立任意模态之间的桥梁。
LDDBM通过引入对比对齐损失来强制执行配对样本之间的语义一致性。
LDDBM设计了一种针对潜在空间噪声预测的域无关编码器解码器架构。
LDDBM提出了预测损失来引导训练以实现准确的跨域翻译，并探索了多种训练策略来提高稳定性。

Cool Papers

点此查看论文截图

Monocular Visual 8D Pose Estimation for Articulated Bicycles and Cyclists

Authors:Eduardo R. Corral-Soto, Yang Liu, Yuan Ren, Bai Dongfeng, Liu Bingbing

In Autonomous Driving, cyclists belong to the safety-critical class of Vulnerable Road Users (VRU), and accurate estimation of their pose is critical for cyclist crossing intention classification, behavior prediction, and collision avoidance. Unlike rigid objects, articulated bicycles are composed of movable rigid parts linked by joints and constrained by a kinematic structure. 6D pose methods can estimate the 3D rotation and translation of rigid bicycles, but 6D becomes insufficient when the steering/pedals angles of the bicycle vary. That is because: 1) varying the articulated pose of the bicycle causes its 3D bounding box to vary as well, and 2) the 3D box orientation is not necessarily aligned to the orientation of the steering which determines the actual intended travel direction. In this work, we introduce a method for category-level 8D pose estimation for articulated bicycles and cyclists from a single RGB image. Besides being able to estimate the 3D translation and rotation of a bicycle from a single image, our method also estimates the rotations of its steering handles and pedals with respect to the bicycle body frame. These two new parameters enable the estimation of a more fine-grained bicycle pose state and travel direction. Our proposed model jointly estimates the 8D pose and the 3D Keypoints of articulated bicycles, and trains with a mix of synthetic and real image data to generalize on real images. We include an evaluation section where we evaluate the accuracy of our estimated 8D pose parameters, and our method shows promising results by achieving competitive scores when compared against state-of-the-art category-level 6D pose estimators that use rigid canonical object templates for matching.

在自动驾驶中，骑自行车的人属于关键安全的脆弱道路使用类（VRU），对其姿态的准确估计是进行骑自行车者穿越意图分类、行为预测和避免碰撞的关键。与刚体不同，关节式自行车是由通过关节连接的可移动刚性部件组成，并受运动学结构的约束。6D姿态方法能够估计刚性自行车的3D旋转和转换，但当自行车的转向/踏板角度发生变化时，6D就不足以应对了。这是因为：1）改变自行车的关节姿态会导致其3D边界框随之变化；2）3D框的方向并不一定与决定实际行驶方向的转向方向一致。在这项工作中，我们引入了一种针对关节式自行车和骑自行车者的类别级别8D姿态估计方法，从单张RGB图像进行估计。除了能够从单个图像中估计自行车的3D转换和旋转之外，我们的方法还可以估计转向手柄和踏板相对于自行车车身框架的旋转。这两个新参数能够估计更精细的自行车姿态状态和行驶方向。我们提出的模型联合估计8D姿态和关节式自行车的3D关键点，并用合成图像和真实图像数据的混合进行训练，以在真实图像上进行推广。我们包含一个评估部分，评估我们估计的8D姿态参数的准确性，当与使用刚性规范对象模板进行匹配的最新6D姿态估计器相比时，我们的方法显示出有前景的结果，实现了有竞争力的分数。

论文及项目相关链接

PDF

Summary

本文介绍了针对单车骑行者的安全关键类别——脆弱道路使用者（VRU）的8D姿态估计方法。该方法不仅能从单张RGB图像中估计自行车的3D平移和旋转，还能估计转向手柄和踏板相对于车身框架的旋转，从而更精细地估计自行车的姿态状态和行驶方向。通过结合合成和真实图像数据进行训练，该方法在真实图像上具有泛化能力。

Key Takeaways

骑行者及自行车姿态估计在自动驾驶中非常重要，对于骑行者穿越意图分类、行为预测和碰撞避免具有关键作用。
传统的6D姿态估计方法无法准确处理自行车转向和踏板角度变化，因此需要更精细的8D姿态估计。
提出的8D姿态估计方法能同时估计自行车的3D平移和旋转、转向手柄和踏板的旋转。
该方法通过使用合成和真实图像数据的混合训练，实现了在真实图像上的泛化。
评估结果表明，该方法与当前最先进的基于刚性规范对象模板匹配的6D姿态估计器相比具有竞争力。

Cool Papers

点此查看论文截图

Curvilinear Structure-preserving Unpaired Cross-domain Medical Image Translation

Authors:Zihao Chen, Yi Zhou, Xudong Jiang, Li Chen, Leopold Schmetterer, Bingyao Tan, Jun Cheng

Unpaired image-to-image translation has emerged as a crucial technique in medical imaging, enabling cross-modality synthesis, domain adaptation, and data augmentation without costly paired datasets. Yet, existing approaches often distort fine curvilinear structures, such as microvasculature, undermining both diagnostic reliability and quantitative analysis. This limitation is consequential in ophthalmic and vascular imaging, where subtle morphological changes carry significant clinical meaning. We propose Curvilinear Structure-preserving Translation (CST), a general framework that explicitly preserves fine curvilinear structures during unpaired translation by integrating structure consistency into the training. Specifically, CST augments baseline models with a curvilinear extraction module for topological supervision. It can be seamlessly incorporated into existing methods. We integrate it into CycleGAN and UNSB as two representative backbones. Comprehensive evaluation across three imaging modalities: optical coherence tomography angiography, color fundus and X-ray coronary angiography demonstrates that CST improves translation fidelity and achieves state-of-the-art performance. By reinforcing geometric integrity in learned mappings, CST establishes a principled pathway toward curvilinear structure-aware cross-domain translation in medical imaging.

非配对图像到图像的翻译作为医学成像中的一项关键技术已经出现，实现了跨模态合成、域适应和数据增强，而无需昂贵的配对数据集。然而，现有方法往往会扭曲精细的曲线结构，如微血管，从而破坏诊断的可靠性和定量分析。这一局限性在眼科和血管成像中尤其重要，细微的形态变化具有重要的临床意义。我们提出了曲线结构保留翻译（CST），这是一个通用框架，通过整合结构一致性进行非配对翻译时显式保留精细的曲线结构。具体来说，CST通过拓扑监督增强基线模型，并引入曲线提取模块。它可以无缝地纳入现有方法。我们将其整合到CycleGAN和UNSB作为两个代表性的主干。在三种成像模态的综合评估中：光学相干断层扫描血管造影、彩色眼底和X射线冠状动脉造影，表明CST提高了翻译精度并达到了最先进的性能。通过在学习的映射中强化几何完整性，CST为医学成像中的面向曲线结构的跨域翻译建立了理论途径。

论文及项目相关链接

PDF

Summary
医学图像的无配对转换技术已崭露头角，为跨模态合成、域适应和数据增强提供了关键手段。然而，现有方法常常扭曲微细曲线结构，如微血管等，影响诊断可靠性和定量分析。本研究提出一种名为CST（曲线结构保留转换）的通用框架，通过集成结构一致性进行无配对转换时明确保留微细曲线结构。CST通过拓扑监督增强基线模型，可无缝融入现有方法，如CycleGAN和UNSB。评估表明，CST提升转换保真度并实现业界最佳性能。它强化学习映射中的几何完整性，为医学影像的曲线结构感知跨域转换建立理论路径。

Key Takeaways

无配对图像转换技术已在医学成像中显示出重要作用，能够实现跨模态合成、域适应和数据增强。
现有方法在处理医学图像时容易扭曲微细曲线结构，如微血管，影响诊断和定量分析。
CST框架旨在通过集成结构一致性来保留微细曲线结构，从而提高转换的保真度。
CST通过引入拓扑监督增强基线模型，并可以无缝融入现有的方法，如CycleGAN和UNSB。
在三种成像模态（光学相干断层扫描血管造影、彩色眼底和X射线冠状动脉造影）的综合评估中，CST表现出卓越的性能。
CST通过强化几何完整性在学习的映射中建立了一个理论路径，为医学影像的跨域转换提供了新的视角。

Cool Papers

点此查看论文截图

Re-evaluating Minimum Bayes Risk Decoding for Automatic Speech Recognition

Authors:Yuu Jinnai

Recent work has shown that sample-based Minimum Bayes Risk (MBR) decoding outperforms beam search in text-to-text generation tasks, such as machine translation, text summarization, and image captioning. On the other hand, beam search is the current practice for speech-to-text tasks such as automatic speech recognition (ASR) and Speech Translation (ST). Given that MBR decoding is effective in text-to-text generation tasks, it is reasonable to expect it to also be effective for speech-to-text tasks. In this paper, we evaluate MBR decoding for ASR and ST tasks on English and Japanese using Whisper and its derivative models. We observe that the accuracy of MBR decoding outperforms that of beam search in most of the experimental settings we have evaluated. The results show that MBR decoding is a promising method for offline ASR and ST tasks that require high accuracy. The code is available at https://github.com/CyberAgentAILab/mbr-for-asr

近期的研究表明，基于样本的最小贝叶斯风险（MBR）解码在文本到文本的生成任务（如机器翻译、文本摘要和图像描述）中表现优于集束搜索。另一方面，集束搜索是当前语音到文本任务（如自动语音识别（ASR）和语音翻译（ST））的常用方法。鉴于MBR解码在文本到文本生成任务中的有效性，我们有理由期待它在语音到文本任务中也同样有效。在本文中，我们评估了使用Whisper及其衍生模型对ASR和ST任务的MBR解码的英语和日语表现。我们发现，在所评估的大部分实验环境中，MBR解码的准确性都优于集束搜索。结果表明，MBR解码是在线ASR和ST任务中一种有前景的方法，尤其在高准确性要求的任务中。相关代码可通过https://github.com/CyberAgentAILab/mbr-for-asr获取。

论文及项目相关链接

PDF

Summary

本文探讨了基于样本的最小贝叶斯风险（MBR）解码在语音到文本任务（如自动语音识别和语音翻译）中的表现。文章评估了MBR解码在英语和日语的自动语音识别和语音翻译任务上的性能，发现其在多数实验设置下的表现优于目前常用的beam search算法。结果证明MBR解码对于需要高精度的离线语音识别和语音翻译任务是一种有前途的方法。

Key Takeaways

MBR解码在文本到文本生成任务中已表现出优于beam search的性能。
文章首次探索了MBR解码在语音到文本任务（如ASR和ST）中的应用。
在英语和日语的ASR和ST任务上，MBR解码的实验表现优于beam search。
MBR解码特别适用于需要高准确率的离线ASR和ST任务。
文章提供的代码可用于进一步研究和应用MBR解码于ASR任务。
MBR解码方法具有潜力，可能成为未来语音到文本任务的新标准。

Cool Papers

点此查看论文截图

Real-time, inline quantitative MRI enabled by scanner-integrated machine learning: a proof of principle with NODDI

Authors:Samuel Rot, Iulius Dragonu, Christina Triantafyllou, Matthew Grech-Sollars, Anastasia Papadaki, Laura Mancini, Stephen Wastling, Jennifer Steeden, John S. Thornton, Tarek Yousry, Claudia A. M. Gandini Wheeler-Kingshott, David L. Thomas, Daniel C. Alexander, Hui Zhang

Purpose: The clinical feasibility and translation of many advanced quantitative MRI (qMRI) techniques are inhibited by their restriction to ‘research mode’, due to resource-intensive, offline parameter estimation. This work aimed to achieve ‘clinical mode’ qMRI, by real-time, inline parameter estimation with a trained neural network (NN) fully integrated into a vendor’s image reconstruction environment, therefore facilitating and encouraging clinical adoption of advanced qMRI techniques. Methods: The Siemens Image Calculation Environment (ICE) pipeline was customised to deploy trained NNs for advanced diffusion MRI parameter estimation with Open Neural Network Exchange (ONNX) Runtime. Two fully-connected NNs were trained offline with data synthesised with the neurite orientation dispersion and density imaging (NODDI) model, using either conventionally estimated (NNMLE) or ground truth (NNGT) parameters as training labels. The strategy was demonstrated online with an in vivo acquisition and evaluated offline with synthetic test data. Results: NNs were successfully integrated and deployed natively in ICE, performing inline, whole-brain, in vivo NODDI parameter estimation in <10 seconds. DICOM parametric maps were exported from the scanner for further analysis, generally finding that NNMLE estimates were more consistent than NNGT with conventional estimates. Offline evaluation confirms that NNMLE has comparable accuracy (or bias) and precision (or robustness to noise), whereas NNGT exhibits compromised accuracy at the benefit of higher precision. Conclusion: Real-time, inline parameter estimation with the proposed generalisable framework resolves a key practical barrier to clinical uptake of advanced qMRI methods and enables their efficient integration into clinical workflows.

目的：许多先进的定量磁共振成像（qMRI）技术由于资源密集型的离线参数估计而仅限于“研究模式”，这限制了其在临床上的可行性和翻译。这项工作旨在通过实时、内联参数估计，利用经过训练的神经网络（NN）完全融入供应商的图像重建环境，从而实现“临床模式”的qMRI，从而促进和鼓励先进的qMRI技术在临床上的采用。

方法：定制了Siemens图像计算环境（ICE）管道，以部署经过训练的神经网络，用于高级扩散MRI参数估计，并使用Open神经网络交换（ONNX）运行时。两个全连接的神经网络使用神经丝方向扩散和密度成像（NODDI）模型合成的数据进行离线训练，使用传统估计（NNMLE）或真实值（NNGT）参数作为训练标签。该策略通过在线的活体采集进行演示，并使用合成测试数据进行离线评估。

结果：神经网络成功集成并原生部署在ICE中，在<10秒内执行内联、全脑、活体的NODDI参数估计。DICOM参数图从扫描仪导出，用于进一步分析，通常发现NNMLE估计与传统估计相比更为一致。离线评估证实，NNMLE具有相当的准确性（或偏差）和精确度（或稳健性），而NNGT则在牺牲准确性的情况下表现出更高的精确度。

结论：利用所提出的通用框架进行实时内联参数估计，解决了先进qMRI方法在临床应用中的关键实际障碍，并能有效地将其整合到临床工作流程中。

论文及项目相关链接

PDF 19 pages, 5 figures, 2 supporting materials

Summary

此摘要在深度挖掘文本核心内容的基础上，通过简练的中文表达出了研究的关键意义：“采用训练后的神经网络进行实时、在线的参数估计，突破先进定量核磁共振成像技术从‘研究模式’向‘临床模式’转变的壁垒，推动其在临床治疗中的应用。”此摘要不仅准确地捕捉了文本的主旨，还做到了简明扼要。

Key Takeaways

以下是文本的七个关键洞察要点：

此研究工作的目的是实现‘临床模式’的定量磁共振成像（qMRI），通过实时、在线的参数估计，利用训练好的神经网络（NN）完全集成到供应商的图像重建环境中。
采用Siemens Image Calculation Environment (ICE)管道部署训练好的神经网络，用于高级扩散MRI参数估计。
使用Open Neural Network Exchange (ONNX) Runtime进行神经网络训练与部署。
训练了两个全连接神经网络，数据通过神经丝方向扩散和密度成像（NODDI）模型合成。
神经网络成功集成并部署在ICE中，实现了在线、全脑、体内NODDI参数估计，耗时不到10秒。
DICOM参数图从扫描仪导出，进一步分析表明，基于传统估计的NNMLE预测结果更为一致。
离线评估显示，NNMLE具有相当的准确性（或偏差）和精确度（或抗噪声能力），而基于真实标签NNGT的准确性有所降低但精度更高。

Cool Papers

点此查看论文截图

Doctor Approved: Generating Medically Accurate Skin Disease Images through AI-Expert Feedback

Authors:Janet Wang, Yunbei Zhang, Zhengming Ding, Jihun Hamm

Paucity of medical data severely limits the generalizability of diagnostic ML models, as the full spectrum of disease variability can not be represented by a small clinical dataset. To address this, diffusion models (DMs) have been considered as a promising avenue for synthetic image generation and augmentation. However, they frequently produce medically inaccurate images, deteriorating the model performance. Expert domain knowledge is critical for synthesizing images that correctly encode clinical information, especially when data is scarce and quality outweighs quantity. Existing approaches for incorporating human feedback, such as reinforcement learning (RL) and Direct Preference Optimization (DPO), rely on robust reward functions or demand labor-intensive expert evaluations. Recent progress in Multimodal Large Language Models (MLLMs) reveals their strong visual reasoning capabilities, making them adept candidates as evaluators. In this work, we propose a novel framework, coined MAGIC (Medically Accurate Generation of Images through AI-Expert Collaboration), that synthesizes clinically accurate skin disease images for data augmentation. Our method creatively translates expert-defined criteria into actionable feedback for image synthesis of DMs, significantly improving clinical accuracy while reducing the direct human workload. Experiments demonstrate that our method greatly improves the clinical quality of synthesized skin disease images, with outputs aligning with dermatologist assessments. Additionally, augmenting training data with these synthesized images improves diagnostic accuracy by +9.02% on a challenging 20-condition skin disease classification task, and by +13.89% in the few-shot setting.

医学数据的匮乏严重限制了诊断机器学习模型的通用性，因为疾病的全部变异谱无法由小型临床数据集来表示。为解决此问题，扩散模型（DMs）已被视为合成图像生成和增强的有前途的方法。然而，它们经常产生医学上不准确的图像，从而降低了模型性能。在数据稀缺且质量胜过数量的情况下，领域专家的知识对于合成正确编码临床信息的图像至关重要。现有的融入人类反馈的方法，如强化学习（RL）和直接偏好优化（DPO），依赖于稳健的奖励功能或需要大量的专家评估。最近多模态大型语言模型（MLLMs）的进步显示出其强大的视觉推理能力，使其成为评估者的合适候选。在这项工作中，我们提出了一种新型框架，名为MAGIC（通过AI-Expert协作进行医学准确的图像生成），该框架合成临床准确的皮肤病图像以进行数据增强。我们的方法创造性地将专家定义的标准转化为对扩散模型的图像合成的可操作反馈，这显著提高了临床准确性，同时减少了直接人力投入。实验表明，我们的方法大大提高了合成皮肤病图像的临床质量，输出结果与皮肤科医生的评估相符。此外，使用这些合成图像增强训练数据，在具有挑战性的20种皮肤病分类任务中，诊断准确率提高了+9.02%，在少量样本情况下提高了+13.89%。

论文及项目相关链接

PDF NeurIPS 2025

Summary

本文探讨了在医疗数据稀缺的情况下，利用人工智能与专家合作生成临床准确的皮肤疾病图像进行数据增强的问题。提出一种名为MAGIC的新框架，结合扩散模型和多模态大型语言模型的视觉推理能力，根据专家定义的准则生成医学准确的图像，提高了临床准确性和诊断准确性。

Key Takeaways

医疗数据的缺乏限制了诊断ML模型的泛化能力。
扩散模型在图像生成和增强方面展现出潜力，但常常产生医学上不准确的图像。
专家领域知识对于在数据稀缺时正确编码临床信息至关重要。
现有融入人类反馈的方法如强化学习和直接偏好优化，依赖于稳健的奖励函数或需要耗时的专家评估。
多模态大型语言模型展现出强大的视觉推理能力。
提出的MAGIC框架结合了人工智能与专家合作，能生成临床准确的皮肤疾病图像用于数据增强。

Cool Papers

点此查看论文截图

OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of-Distribution Detection

Authors:Max Gutbrod, David Rauber, Danilo Weber Nunes, Christoph Palm

The growing reliance on Artificial Intelligence (AI) in critical domains such as healthcare demands robust mechanisms to ensure the trustworthiness of these systems, especially when faced with unexpected or anomalous inputs. This paper introduces the Open Medical Imaging Benchmarks for Out-Of-Distribution Detection (OpenMIBOOD), a comprehensive framework for evaluating out-of-distribution (OOD) detection methods specifically in medical imaging contexts. OpenMIBOOD includes three benchmarks from diverse medical domains, encompassing 14 datasets divided into covariate-shifted in-distribution, near-OOD, and far-OOD categories. We evaluate 24 post-hoc methods across these benchmarks, providing a standardized reference to advance the development and fair comparison of OOD detection methods. Results reveal that findings from broad-scale OOD benchmarks in natural image domains do not translate to medical applications, underscoring the critical need for such benchmarks in the medical field. By mitigating the risk of exposing AI models to inputs outside their training distribution, OpenMIBOOD aims to support the advancement of reliable and trustworthy AI systems in healthcare. The repository is available at https://github.com/remic-othr/OpenMIBOOD.

随着人工智能（AI）在医疗等关键领域的依赖程度不断增长，尤其是在面对意外或异常输入时，需要建立稳健的机制来确保这些系统的可信度。本文介绍了用于检测分布外（OOD）的开放医学成像基准测试（OpenMIBOOD），这是一个专门用于评估医学成像环境中OOD检测方法的全面框架。OpenMIBOOD包含来自不同医学领域的三个基准测试，涵盖14个数据集，分为协变量偏移内分布、近OOD和远OOD类别。我们在这些基准测试上评估了2 4种事后方法，为推进OOD检测方法的发展和公平比较提供了标准化参考。结果表明，自然图像域的大规模OOD基准测试的结果并不能转化为医学应用，这强调了医学领域对这种基准测试的关键需求。通过降低AI模型暴露于训练分布之外输入的风险，OpenMIBOOD旨在支持医疗领域可靠和可信的AI系统的进步。该存储库可在 https://github.com/remic-othr/OpenMIBOOD 找到。

论文及项目相关链接

PDF Updated results for NNGuide and ViM

Summary
人工智能在医疗等重要领域的应用日益广泛，需要建立可靠的机制确保系统的可信度。面对意外或异常的输入，本论文提出开放医学成像基准测试（OpenMIBOOD），专门评估医学成像中超出分布范围（OOD）的检测方法的综合性框架。OpenMIBOOD包含来自不同医学领域的三个基准测试，涵盖14个数据集，分为协变量移动内部分布、接近OOD和远离OOD类别。我们在这三个基准测试上评估了24种事后方法，为OOD检测方法的开发和公平比较提供了标准化参考。研究结果表明，自然图像域的广泛OOD基准测试不适用于医学应用，强调医学领域需要此类基准测试。通过降低AI模型暴露于训练分布之外输入的风险，OpenMIBOOD旨在支持医疗领域可靠和可信的AI系统的发展。

Key Takeaways

OpenMIBOOD是一个用于评估医学成像中OOD检测方法的综合性框架。
它包含三个基准测试，涵盖不同医学领域的14个数据集。
研究了24种事后方法，为OOD检测方法的开发提供了标准化参考。
自然图像域的OOD基准测试不适用于医学应用。
医学领域需要专门的基准测试来确保AI系统的可信度和可靠性。
OpenMIBOOD有助于降低AI模型暴露于训练分布外的风险。

Cool Papers

点此查看论文截图

MsEdF: A Multi-stream Encoder-decoder Framework for Remote Sensing Image Captioning

Authors:Swadhin Das, Raksha Sharma

Remote sensing images contain complex spatial patterns and semantic structures, which makes the captioning model difficult to accurately describe. Encoder-decoder architectures have become the widely used approach for RSIC by translating visual content into descriptive text. However, many existing methods rely on a single-stream architecture, which weakens the model to accurately describe the image. Such single-stream architectures typically struggle to extract diverse spatial features or capture complex semantic relationships, limiting their effectiveness in scenes with high intraclass similarity or contextual ambiguity. In this work, we propose a novel Multi-stream Encoder-decoder Framework (MsEdF) which improves the performance of RSIC by optimizing both the spatial representation and language generation of encoder-decoder architecture. The encoder fuses information from two complementary image encoders, thereby promoting feature diversity through the integration of multiscale and structurally distinct cues. To improve the capture of context-aware descriptions, we refine the input sequence’s semantic modeling on the decoder side using a stacked GRU architecture with an element-wise aggregation scheme. Experiments on three benchmark RSIC datasets show that MsEdF outperforms several baseline models.

遥感图像包含复杂的空间模式和语义结构，这使得标注模型难以准确描述。编码器-解码器架构已成为遥感图像标注（RSIC）的广泛应用方法，将视觉内容翻译为描述性文本。然而，许多现有方法依赖于单流架构，这削弱了模型准确描述图像的能力。这种单流架构通常难以提取多样的空间特征或捕捉复杂的语义关系，在具有高的类内相似性或上下文模糊的场景中限制了其有效性。在这项工作中，我们提出了一种新型的多流编码器-解码器框架（MsEdF），通过优化编码器-解码器架构的空间表示和语言生成，提高了遥感图像标注的性能。编码器融合了来自两个互补图像编码器的信息，通过多尺度和结构不同特征的融合，促进了特征多样性。为了提高上下文感知描述的捕捉能力，我们通过在解码器侧使用带有元素级聚合方案的堆叠GRU架构，对输入序列的语义建模进行了改进。在三个基准遥感图像标注数据集上的实验表明，MsEdF优于几种基线模型。

论文及项目相关链接

PDF

Summary

遥感图像的空间模式和语义结构复杂，使得图像描述困难。目前广泛使用的编码器-解码器架构在将视觉内容转换为描述文本方面存在局限性。现有方法大多依赖单流架构，难以准确描述图像。本研究提出一种新型的多流编码器-解码器框架（MsEdF），通过优化空间表示和编码解码器的语言生成来提升遥感图像描述的性能。编码器融合来自两个互补图像编码器的信息，通过多尺度和结构不同线索的整合促进特征多样性。为提高上下文感知描述的捕捉能力，我们在解码器侧使用堆叠的GRU架构和元素级聚合方案进行输入序列的语义建模。在三个基准遥感图像描述数据集上的实验表明，MsEdF优于多种基准模型。

Key Takeaways