发布日期: 2025-11-20

更新日期: 2025-11-27

文章字数: 5.3k

阅读时长: 21 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-20 更新

MRI Embeddings Complement Clinical Predictors for Cognitive Decline Modeling in Alzheimer’s Disease Cohorts

Authors:Nathaniel Putera, Daniel Vilet Rodríguez, Noah Videcrantz, Julia Machnio, Mostafa Mehdipour Ghazi

Accurate modeling of cognitive decline in Alzheimer’s disease is essential for early stratification and personalized management. While tabular predictors provide robust markers of global risk, their ability to capture subtle brain changes remains limited. In this study, we evaluate the predictive contributions of tabular and imaging-based representations, with a focus on transformer-derived Magnetic Resonance Imaging (MRI) embeddings. We introduce a trajectory-aware labeling strategy based on Dynamic Time Warping clustering to capture heterogeneous patterns of cognitive change, and train a 3D Vision Transformer (ViT) via unsupervised reconstruction on harmonized and augmented MRI data to obtain anatomy-preserving embeddings without progression labels. The pretrained encoder embeddings are subsequently assessed using both traditional machine learning classifiers and deep learning heads, and compared against tabular representations and convolutional network baselines. Results highlight complementary strengths across modalities. Clinical and volumetric features achieved the highest AUCs of around 0.70 for predicting mild and severe progression, underscoring their utility in capturing global decline trajectories. In contrast, MRI embeddings from the ViT model were most effective in distinguishing cognitively stable individuals with an AUC of 0.71. However, all approaches struggled in the heterogeneous moderate group. These findings indicate that clinical features excel in identifying high-risk extremes, whereas transformer-based MRI embeddings are more sensitive to subtle markers of stability, motivating multimodal fusion strategies for AD progression modeling.

对阿尔茨海默病中的认知衰退进行精确建模对于早期分层和个性化管理至关重要。虽然表格预测因子提供了全球风险的稳健标志，但它们捕捉大脑微妙变化的能力仍然有限。在这项研究中，我们评估了表格和基于成像的表示的预测贡献，重点关注由变压器派生的磁共振成像（MRI）嵌入。我们基于动态时间扭曲聚类，引入了一种基于轨迹的标记策略，以捕获认知变化的异质模式，并通过在和谐和增强MRI数据上进行无监督重建，训练一个三维视觉变压器（ViT），以获得无进展标签的解剖结构保留嵌入。随后使用传统的机器学习分类器和深度学习头部评估预训练的编码器嵌入，并将其与表格表示和卷积网络基准进行比较。结果突出了跨模态的互补优势。临床和体积特征在预测轻度至重度进展方面达到了约0.70的最高AUC值，这突显了它们在捕捉全球衰退轨迹方面的实用性。相比之下，来自ViT模型的MRI嵌入在区分认知稳定个体方面最为有效，AUC为0.71。然而，所有方法在异质中度组中都遇到了困难。这些结果表明，临床特征在识别高风险极端方面表现出色，而基于变压器的MRI嵌入对微妙的稳定性标志物更为敏感，这促使开发用于阿尔茨海默症进展建模的多模式融合策略。

论文及项目相关链接

PDF Accepted at SPIE - Medical Imaging Conference 2026

Summary：

该研究探讨了在阿尔茨海默病中精准建模认知衰退的重要性，对比分析了表格预测和基于成像的表示方法，特别是使用transformer从磁共振成像（MRI）中提取的信息。引入基于动态时间扭曲聚类的轨迹感知标签策略来捕捉认知变化的异质模式，并训练了一个三维视觉转换器（ViT），通过无监督重建获得无需进展标签的解剖结构保留嵌入。评估了预训练编码器嵌入的预测性能，结果表明不同模态具有互补优势。临床和体积特征在预测轻度与重度进展方面表现出最高AUC值约0.70，突显其在捕捉全球衰退轨迹方面的作用。相比之下，来自ViT模型的MRI嵌入在区分认知稳定个体方面表现最佳，AUC为0.71。然而，所有方法在异质中度组中都面临挑战。这些发现表明，临床特征擅长识别高风险极端情况，而基于transformer的MRI嵌入更敏感于微妙的稳定性标记，这促使为阿尔茨海默病的进展建模采用多模态融合策略。

Key Takeaways：

准确建模阿尔茨海默病中的认知衰退对于早期分层和个性化管理至关重要。
该研究评价了表格预测和基于成像的表示方法的预测贡献，特别是使用transformer从MRI中提取信息的重要性。
通过引入轨迹感知标签策略和动态时间扭曲聚类技术，捕捉到了认知变化的异质模式。
训练了一个三维Vision Transformer（ViT）模型，能够通过无监督重建获得解剖结构保留的嵌入信息。
临床和体积特征在预测认知衰退的高风险和低风区段表现出最佳性能。
基于transformer的MRI嵌入在区分认知稳定个体方面最为有效。

Cool Papers

点此查看论文截图

H-CNN-ViT: A Hierarchical Gated Attention Multi-Branch Model for Bladder Cancer Recurrence Prediction

Authors:Xueyang Li, Zongren Wang, Yuliang Zhang, Zixuan Pan, Yu-Jen Chen, Nishchal Sapkota, Gelei Xu, Danny Z. Chen, Yiyu Shi

Bladder cancer is one of the most prevalent malignancies worldwide, with a recurrence rate of up to 78%, necessitating accurate post-operative monitoring for effective patient management. Multi-sequence contrast-enhanced MRI is commonly used for recurrence detection; however, interpreting these scans remains challenging, even for experienced radiologists, due to post-surgical alterations such as scarring, swelling, and tissue remodeling. AI-assisted diagnostic tools have shown promise in improving bladder cancer recurrence prediction, yet progress in this field is hindered by the lack of dedicated multi-sequence MRI datasets for recurrence assessment study. In this work, we first introduce a curated multi-sequence, multi-modal MRI dataset specifically designed for bladder cancer recurrence prediction, establishing a valuable benchmark for future research. We then propose H-CNN-ViT, a new Hierarchical Gated Attention Multi-Branch model that enables selective weighting of features from the global (ViT) and local (CNN) paths based on contextual demands, achieving a balanced and targeted feature fusion. Our multi-branch architecture processes each modality independently, ensuring that the unique properties of each imaging channel are optimally captured and integrated. Evaluated on our dataset, H-CNN-ViT achieves an AUC of 78.6%, surpassing state-of-the-art models. Our model is publicly available at https://github.com/XLIAaron/H-CNN-ViT}.

膀胱癌是全世界最常见的恶性肿瘤之一，复发率高达78%，需要进行精确的术后监测以实现有效的病人管理。多序列增强磁共振成像（MRI）通常用于检测复发；然而，由于术后改变（如疤痕、肿胀和组织重塑），即使对于经验丰富的放射科医生，解释这些扫描结果仍然具有挑战性。人工智能辅助诊断工具在改善膀胱癌复发预测方面显示出潜力，但这一领域的进步受到缺乏专门用于复发评估研究的多序列MRI数据集的阻碍。在这项工作中，我们首先引入了一个专门设计用于膀胱癌复发预测的多序列、多模式MRI数据集，为未来的研究建立了宝贵的基准。然后，我们提出了H-CNN-ViT，这是一种新的分层门控注意力多分支模型，它可以根据上下文需求实现全局（ViT）和局部（CNN）路径特征的选择性加权，实现平衡且有针对性的特征融合。我们的多分支架构独立处理每种模式，确保每种成像通道的独特属性都能得到最佳捕捉和整合。在我们的数据集上评估，H-CNN-ViT的AUC达到78.6%，超过了最新模型。我们的模型可在[https://github.com/XLIAaron/H-CNN-ViT公开访问。}。

论文及项目相关链接

PDF

Summary：

本文介绍了膀胱癌的复发检测挑战，提出一种专门用于膀胱癌复发预测的多元序列MRI数据集，并引入了一种新的分层门控注意力多分支模型H-CNN-ViT。该模型能够在全局和局部路径上实现特征的选择性加权，达到平衡且有针对性的特征融合。模型独立处理每种模态，确保每种成像通道的独特属性得到最佳捕捉和集成。在数据集上评估，H-CNN-ViT实现了AUC为78.6%，超越了现有模型。模型公开可访问于 https://github.com/XLIAaron/H-CNN-ViT。

Key Takeaways:

膀胱癌复发检测面临挑战，需要精准有效的监测方法。
多序列MRI扫描是常用的检测手段，但解读结果对于放射科医生来说仍然困难。
缺乏专门的MRI数据集用于膀胱癌复发评估研究，因此研究意义重要。
提出了一种新的分层门控注意力多分支模型H-CNN-ViT，实现了全局和局部特征的有效融合。
H-CNN-ViT模型独立处理每种模态数据，确保最佳捕捉和集成每种成像通道的独特属性。
在专用数据集上评估，H-CNN-ViT模型的AUC达到了78.6%，显示出其优秀性能。

Cool Papers

点此查看论文截图

Continual Learning for Image Captioning through Improved Image-Text Alignment

Authors:Bertram Taetz, Gal Bordelius

Generating accurate and coherent image captions in a continual learning setting remains a major challenge due to catastrophic forgetting and the difficulty of aligning evolving visual concepts with language over time. In this work, we propose a novel multi-loss framework for continual image captioning that integrates semantic guidance through prompt-based continual learning and contrastive alignment. Built upon a pretrained ViT-GPT-2 backbone, our approach combines standard cross-entropy loss with three additional components: (1) a prompt-based cosine similarity loss that aligns image embeddings with synthetically constructed prompts encoding objects, attributes, and actions; (2) a CLIP-style loss that promotes alignment between image embeddings and target caption embedding; and (3) a language-guided contrastive loss that employs a triplet loss to enhance class-level discriminability between tasks. Notably, our approach introduces no additional overhead at inference time and requires no prompts during caption generation. We find that this approach mitigates catastrophic forgetting, while achieving better semantic caption alignment compared to state-of-the-art methods. The code can be found via the following link: https://github.com/Gepardius/Taetz_Bordelius_Continual_ImageCaptioning.

在不断学习的环境中生成准确连贯的图像标题仍然是一个主要挑战，这是由于灾难性遗忘以及随着时间推移将不断演变的视觉概念与语言对齐的困难。在这项工作中，我们提出了一个用于连续图像标题生成的新型多损失框架，该框架通过基于提示的连续学习和对比对齐来集成语义指导。我们的方法建立在预训练的ViT-GPT-2主干网络上，将标准交叉熵损失与三个附加组件相结合：（1）基于提示的余弦相似性损失，将图像嵌入与编码对象、属性和动作的合成提示对齐；（2）CLIP风格的损失，促进图像嵌入与目标标题嵌入之间的对齐；（3）语言引导对比损失，采用三元组损失来提高任务间的类级可区分性。值得注意的是，我们的方法在推理时不会引入任何额外的开销，并且在生成标题时不需要任何提示。我们发现，该方法减轻了灾难性遗忘的问题，同时实现了与最新技术相比更好的语义标题对齐。代码可通过以下链接找到：https://github.com/Gepardius/Taetz_Bordelius_Continual_ImageCaptioning。

论文及项目相关链接

PDF 11 pages, 3 figures

Summary

本文提出了一种基于多损失函数的持续图像描述框架，通过基于提示的持续学习和对比对齐来实现语义引导。该研究结合了预训练的ViT-GPT-2骨干模型与三种额外损失函数来应对图像描述中的挑战，包括基于提示的余弦相似性损失、CLIP风格的损失以及语言引导对比损失。该方法在减轻灾难性遗忘的同时，实现了更好的语义描述对齐效果。代码可通过链接访问。

Key Takeaways

提出了一种新的多损失框架用于持续图像描述任务，解决了灾难性遗忘和对齐视觉概念与语言的难题。
利用基于提示的余弦相似性损失对齐图像嵌入与合成提示编码的对象、属性和动作。
采用CLIP风格的损失促进图像嵌入与目标描述嵌入之间的对齐。
引入语言引导对比损失，采用三元组损失增强任务间的类级别鉴别能力。
方法在减轻灾难性遗忘的同时，实现了更好的语义描述对齐，优于现有方法。
该方法无需在推理时间使用额外提示，且无额外开销。

Cool Papers

点此查看论文截图

LENS: Learning to Segment Anything with Unified Reinforced Reasoning

Authors:Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, Xinggang Wang

Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning significantly enhances text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models (SAM). Code is available at https://github.com/hustvl/LENS.

文本提示的图像分割能够实现精细的视觉理解，对于人机交互和机器人等应用至关重要。然而，现有的监督微调方法通常在测试时忽略了明确的思维链（CoT）推理，这限制了它们对未见提示和领域的泛化能力。为了解决这一问题，我们引入了LENS，这是一个可扩展的强化学习框架，以端到端的方式联合优化推理过程和分割。我们提出了统一的强化学习奖励，涵盖句子、盒子和分段级别的线索，鼓励模型在细化掩膜质量的同时生成有信息的CoT推理。使用公开可用的3亿参数视觉语言模型，即Qwen2.5-VL-3B-Instruct，LENS在RefCOCO、RefCOCO+和RefCOCOg基准测试上达到了平均完全交集（cIoU）81.2%，超越了强大的微调方法GLaMM，最高提升了5.6%。这些结果表明，强化学习驱动的CoT推理显著提高了文本提示的图像分割效果，并为实现更通用的任意分割模型（SAM）提供了实际路径。代码可在https://github.com/hustvl/LENS获得。

论文及项目相关链接

PDF Code is released at https://github.com/hustvl/LENS

Summary
基于文本提示的图像分割能够实现精细的视觉理解，对于人机交互和机器人应用至关重要。然而，现有的监督微调方法通常忽略了测试时的显式链式思维推理，这限制了它们在未见提示和领域中的泛化能力。为解决这一问题，我们引入了LENS，一个可扩展的强化学习框架，以端到端的方式联合优化推理过程和分割。我们提出了统一的强化学习奖励机制，涵盖句子、框和分段级别的线索，鼓励模型在细化掩膜质量的同时生成有见地的思维理由。使用公共可用的30亿参数视觉语言模型，LENS在RefCOCO、RefCOCO+和RefCOCOg基准测试中实现了平均完全交集(cIoU)为81.2%，超越了精细调整的GLaMM方法，提升了最多5.6%。这些结果表明，强化学习驱动的链式思维推理能够显著增强基于文本提示的图像分割，并为实现更通用的分段模型（SAM）提供了实际路径。

Key Takeaways

文本提示的图像分割对精细视觉理解和特定应用（如人机交互、机器人）至关重要。
现有监督微调方法忽略链式思维（CoT）推理，限制了模型在未见提示和领域的泛化能力。
LENS是一个强化学习框架，联合优化推理过程和图像分割。
LENS采用统一的强化学习奖励机制，涵盖句子、框和分段级别线索，提升模型性能。
使用30亿参数视觉语言模型，LENS在基准测试中实现了较高的平均完全交集（cIoU）得分。
LENS相较于其他方法展现出显著优势，表明了强化学习在图像分割中的有效性。

Cool Papers

点此查看论文截图

GMAT: Grounded Multi-Agent Clinical Description Generation for Text Encoder in Vision-Language MIL for Whole Slide Image Classification

Authors:Ngoc Bui Lam Quang, Nam Le Nguyen Binh, Thanh-Huy Nguyen, Le Thien Phuc Nguyen, Quan Nguyen, Ulas Bagci

Multiple Instance Learning (MIL) is the leading approach for whole slide image (WSI) classification, enabling efficient analysis of gigapixel pathology slides. Recent work has introduced vision-language models (VLMs) into MIL pipelines to incorporate medical knowledge through text-based class descriptions rather than simple class names. However, when these methods rely on large language models (LLMs) to generate clinical descriptions or use fixed-length prompts to represent complex pathology concepts, the limited token capacity of VLMs often constrains the expressiveness and richness of the encoded class information. Additionally, descriptions generated solely by LLMs may lack domain grounding and fine-grained medical specificity, leading to suboptimal alignment with visual features. To address these challenges, we propose a vision-language MIL framework with two key contributions: (1) A grounded multi-agent description generation system that leverages curated pathology textbooks and agent specialization (e.g., morphology, spatial context) to produce accurate and diverse clinical descriptions; (2) A text encoding strategy using a list of descriptions rather than a single prompt, capturing fine-grained and complementary clinical signals for better alignment with visual features. Integrated into a VLM-MIL pipeline, our approach shows improved performance over single-prompt class baselines and achieves results comparable to state-of-the-art models, as demonstrated on renal and lung cancer datasets.

多实例学习（MIL）是全幻灯片图像（WSI）分类的主要方法，能够实现千兆像素病理切片的高效分析。近期的工作将视觉语言模型（VLM）引入MIL管道，通过基于文本的类别描述而不是简单的类别名称来融入医学知识。然而，当这些方法依赖于大型语言模型（LLM）来生成临床描述或使用固定长度的提示来表示复杂的病理概念时，VLM的有限令牌容量通常限制了编码类别信息的表达性和丰富性。此外，仅由LLM生成的描述可能缺乏领域根基和精细的医学特异性，导致与视觉特征的次优对齐。为了解决这些挑战，我们提出了一个具有两个主要贡献的视觉语言MIL框架：1）一个基于代理的专业化多代理描述生成系统，该系统利用精选的病理教科书和代理专业化（例如形态学、空间上下文）来产生准确和多样的临床描述；2）一种使用一系列描述而不是单个提示的文本编码策略，以捕获精细且互补的临床信号，以更好地与视觉特征对齐。将其集成到VLM-MIL管道中，我们的方法在肾和肺癌数据集上的性能超过了单提示类别基线，并取得了与最新模型相当的结果。

论文及项目相关链接

PDF Acccepted in MICCAI Workshop 2025

Summary

本文介绍了将视觉语言模型（VLMs）应用于多实例学习（MIL）的方法，以解决在整幅幻灯片图像（WSI）分类中对病理学概念的表达受限问题。通过引入基于病理学教科书的描述生成系统和使用描述列表的文本编码策略，提高了与视觉特征的匹配度，并在肾和肺癌数据集上取得了与最新模型相当的性能。

Key Takeaways