发布日期: 2025-11-21

更新日期: 2025-11-27

文章字数: 1.3k

阅读时长: 5 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-21 更新

AdCare-VLM: Towards a Unified and Pre-aligned Latent Representation for Healthcare Video Understanding

Authors:Md Asaduzzaman Jabin, Hanqi Jiang, Yiwei Li, Patrick Kaggwa, Eugene Douglass, Juliet N. Sekandi, Tianming Liu

Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized LLaVA-based multimodal large vision language model (LVLM) by introducing a unified visual latent space with pre-alignment to facilitate visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient’s face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.

慢性病，包括糖尿病、高血压、哮喘、艾滋病、癫痫和肺结核，需要患者严格遵循药物治疗以避免疾病进展、管理症状和降低死亡率。患者的行为、护理人员的支持、医疗成本高昂和医疗保健基础设施不足等因素经常破坏药物治疗的依从性。我们提出AdCare-VLM，这是一种基于LLaVA的多模态大型视觉语言模型（LVLM），通过引入统一视觉潜在空间并进行预对齐，以促进通过患者视频进行药物治疗依从性相关的视觉问答（VQA）。我们使用包含806个由临床专家标注的肺结核（TB）药物治疗监控视频的私有数据集对模型进行微调，以检测依从性模式。我们推出了LLM-TB-VQA，这是一个详细的医疗依从性VQA数据集，涵盖了正面、负面和模棱两可的依从性案例。我们的方法能够识别诸如患者面部、药物、饮水摄入和摄入行为等视觉特征与其在字幕中的相关医学概念之间的关联。这有助于整合对齐的视觉语言表示并改善多模态交互。实验结果表明，我们的方法在参数高效微调（PEFT）启用的VLM模型（如LLaVA-V1.5和Chat-UniVi）上实现了超越，在预训练、常规和低秩适应（LoRA）配置方面的绝对改进范围在3.1％至3.54％之间。全面的消融研究和注意力图可视化证实了我们方法的有效性，提高了可解释性。

论文及项目相关链接

PDF 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: 7th International Workshop on Large Scale Holistic Video Understanding: Toward Video Foundation Models

Summary
本文介绍了针对慢性病患者用药依从性的问题，提出了一种基于LLaVA的多模态大型视觉语言模型（LVLM）的解决策略。利用患者视频进行视觉问答（VQA），通过建立统一的视觉潜在空间并进行预对齐，实现药物依从性的监测。采用包含806个结核药物监控视频的私人数据集进行模型微调，并推出LLM-TB-VQA数据集，涵盖正面、负面和模糊依从性案例。方法能识别视觉特征与医疗概念间的关联，提升多模态交互效果。实验结果显示，该方法优于参数高效微调（PEFT）的VLM模型，如LLaVA-V1.5和Chat-UniVi。

Key Takeaways

慢性病患者需严格遵循药物治疗以避免疾病进展、管理症状和降低死亡率。
患者用药依从性受到多种因素影响，包括患者行为、护理支持、高昂的医疗费用和医疗保健基础设施不足等。
提出了一种基于LLaVA的多模态大型视觉语言模型（LVLM）的AdCare-VLM解决方案。
使用包含806个结核药物监控视频的私人数据集进行模型微调，用于检测用药依从性模式。
推出LLM-TB-VQA详细医疗依从性VQA数据集，涵盖各种依从性情况。
方法能够识别视觉特征与医疗概念之间的关联，促进视觉和语言表示的对齐和多模态交互的提升。
实验结果显示，该方法在多种配置下均优于参数高效微调（PEFT）的VLM模型。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-21/%E8%A7%86%E9%A2%91%E7%90%86%E8%A7%A3/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

视频理解

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2025-11-21 IPTQ-ViT Post-Training Quantization of Non-linear Functions for Integer-only Vision Transformers

2025-11-21 Vision Transformer

Vision Transformer

I2I Translation

I2I Translation 方向最新论文已更新，请持续关注 Update in 2025-11-21 US-X Complete A Multi-Modal Approach to Anatomical 3D Shape Recovery

2025-11-21 I2I Translation

I2I Translation