发布日期: 2025-11-20

更新日期: 2025-11-27

文章字数: 2.4k

阅读时长: 9 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-11-20 更新

UVLM: Benchmarking Video Language Model for Underwater World Understanding

Authors:Xizhe Xue, Yang Zhou, Dawei Yan, Lijie Tao, Junjie Li, Ying Li, Haokui Zhang, Rong Xiao

Recently, the remarkable success of large language models (LLMs) has achieved a profound impact on the field of artificial intelligence. Numerous advanced works based on LLMs have been proposed and applied in various scenarios. Among them, video language models (VidLMs) are particularly widely used. However, existing works primarily focus on terrestrial scenarios, overlooking the highly demanding application needs of underwater observation. To overcome this gap, we introduce UVLM, an under water observation benchmark which is build through a collaborative approach combining human expertise and AI models. To ensure data quality, we have conducted in-depth considerations from multiple perspectives. First, to address the unique challenges of underwater environments, we selected videos that represent typical underwater challenges including light variations, water turbidity, and diverse viewing angles to construct the dataset. Second, to ensure data diversity, the dataset covers a wide range of frame rates, resolutions, 419 classes of marine animals, and various static plants and terrains. Next, for task diversity, we adopted a structured design where observation targets are categorized into two major classes: biological and environmental. Each category includes content observation and change/action observation, totaling 20 distinct task types. Finally, we designed several challenging evaluation metrics to enable quantitative comparison and analysis of different methods. Experiments on two representative VidLMs demonstrate that fine-tuning VidLMs on UVLM significantly improves underwater world understanding while also showing potential for slight improvements on existing in-air VidLM benchmarks, such as VideoMME and Perception text. The dataset and prompt engineering will be released publicly.

最近，大型语言模型（LLM）的显著成功对人工智能领域产生了深远影响。基于LLM的许多先进作品已经被提出并应用于各种场景。其中，视频语言模型（VidLMs）的应用尤为广泛。然而，现有的工作主要集中在地面场景，忽视了水下观察的高需求应用。为了弥补这一差距，我们引入了UVLM，这是一个水下观察基准测试，它是通过结合人类专业知识和AI模型的方法协同构建的。为了确保数据质量，我们从多个角度进行了深入考虑。首先，为了应对水下环境的独特挑战，我们选择了代表典型水下挑战的视频来构建数据集，这些挑战包括光线变化、水质混浊和多样化的观看角度。其次，为了确保数据多样性，数据集涵盖了广泛的帧率、分辨率、419类海洋生物以及各种各样的静态植物和地形。接下来，为了任务多样性，我们采用了结构化设计，将观察目标分为两大类：生物类和环境类。每个类别都包括内容观察和变化/动作观察，总计20种不同的任务类型。最后，我们设计了一些具有挑战性的评估指标，以便对不同方法进行定量比较和分析。在两个具有代表性的VidLM上的实验表明，在UVLM上对VidLM进行微调可以显著提高对水下世界的理解，同时也显示出对现有的空中VidLM基准测试（如VideoMME和Perception text）的轻微改进潜力。数据集和提示工程将公开发布。

论文及项目相关链接

PDF 18 pages, 10 figures, 7 tables. Accepted to the Fortieth AAAI Conference on Artificial Intelligence (AAAI-26), 2026

Summary

本文介绍了大型语言模型（LLMs）在人工智能领域的显著成功，特别是视频语言模型（VidLMs）的广泛应用。然而，现有工作主要关注陆地场景，忽视了水下观察的高需求应用。为此，本文引入了UVLM，一个通过人类专家与AI模型协作构建的水下观察基准测试。UVLM数据集解决了水下环境的独特挑战，确保了数据质量和多样性，包括帧率、分辨率、海洋生物类别、静态植物和地形的多样性。此外，采用结构化设计，将观察目标分为生物和环境两大类，共20种任务类型。实验表明，在UVLM上对VidLMs进行微调，可显著提高对水下世界的理解，并对现有的空中VidLM基准测试表现出轻微改进潜力。

Key Takeaways

大型语言模型（LLMs）在人工智能领域取得显著成功，视频语言模型（VidLMs）应用广泛。
现有工作主要关注陆地场景，忽视了水下观察的高需求应用。
UVLM是一个通过人类专家与AI模型协作构建的水下观察基准测试，解决水下环境的独特挑战。
UVLM数据集确保了数据质量和多样性，包括帧率、分辨率、海洋生物类别等。
UVLM采用结构化设计，将观察目标分为生物和环境两大类，共20种任务类型。
通过对代表性VidLMs进行实验，发现在UVLM上微调可提高对水下世界的理解。

Cool Papers

点此查看论文截图

AdCare-VLM: Towards a Unified and Pre-aligned Latent Representation for Healthcare Video Understanding

Authors:Md Asaduzzaman Jabin, Hanqi Jiang, Yiwei Li, Patrick Kaggwa, Eugene Douglass, Juliet N. Sekandi, Tianming Liu

Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS, epilepsy, and tuberculosis, necessitate rigorous adherence to medication to avert disease progression, manage symptoms, and decrease mortality rates. Adherence is frequently undermined by factors including patient behavior, caregiver support, elevated medical costs, and insufficient healthcare infrastructure. We propose AdCare-VLM, a specialized LLaVA-based multimodal large vision language model (LVLM) by introducing a unified visual latent space with pre-alignment to facilitate visual question answering (VQA) concerning medication adherence through patient videos. We employ a private dataset comprising 806 custom-annotated tuberculosis (TB) medication monitoring videos, which have been labeled by clinical experts, to fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a detailed medical adherence VQA dataset that encompasses positive, negative, and ambiguous adherence cases. Our method identifies correlations between visual features, such as the clear visibility of the patient’s face, medication, water intake, and the act of ingestion, and their associated medical concepts in captions. This facilitates the integration of aligned visual-linguistic representations and improves multimodal interactions. Experimental results indicate that our method surpasses parameter-efficient fine-tuning (PEFT) enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute improvements ranging from 3.1% to 3.54% across pre-trained, regular, and low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and attention map visualizations substantiate our approach, enhancing interpretability.

慢性病，包括糖尿病、高血压、哮喘、艾滋病、癫痫和肺结核，需要严格遵循药物治疗以避免疾病进展、管理症状和降低死亡率。患者的行为、护理人员的支持、医疗成本高昂和医疗保健基础设施不足等因素经常破坏遵医嘱行为。我们提出AdCare-VLM，这是一种基于LLaVA的多模式大型视觉语言模型（LVLM），通过引入统一视觉潜在空间并进行预对齐，以促进通过患者视频进行药物治疗依从性相关的视觉问答（VQA）。我们使用包含806个由临床专家标注的肺结核（TB）药物治疗监控视频的私有数据集，对模型进行微调，以检测依从性模式。我们提供了详细的医疗依从性VQA数据集LLM-TB-VQA，涵盖积极、消极和模棱两可的依从性情况。我们的方法能够识别诸如患者面部、药物、饮水摄入和服药行为的视觉特征，以及与说明中的相关医学概念之间的关联。这有助于整合对齐的视觉语言表示，提高多模式交互。实验结果表明，我们的方法在预训练、常规和低秩适应（LoRA）配置中，超越了参数有效微调（PEFT）启用的VLM模型，如LLaVA-V1.5和Chat-UniVi，绝对改进范围在3.1%至3.54%之间。全面的消融研究和注意力图可视化证实了我们方法的有效性，提高了可解释性。

论文及项目相关链接

PDF 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: 7th International Workshop on Large Scale Holistic Video Understanding: Toward Video Foundation Models

Summary

基于LLaVA的大型视觉语言模型（LVLM）用于通过患者视频监测药物依从性。研究使用包含806个结核病（TB）药物监测视频的私人数据集进行微调，并推出LLM-TB-VQA数据集用于粘附性模式检测。该方法识别了视频中的视觉特征与医疗概念之间的关联，有助于提高药物依从性监测的准确性。实验结果显示，该方法在预训练、常规和低秩适应（LoRA）配置方面均优于参数高效微调（PEFT）的VLM模型，绝对改进范围在3.1%至3.54%之间。

Key Takeaways

慢性疾病的药物治疗需要严格遵循以阻止疾病进展、管理症状和降低死亡率。
药物依从性受到患者行为、护理者支持、高昂的医疗费用和不足的卫生基础设施等因素的影响。
提出了一种基于LLaVA的专用药物依从性多模式大型视觉语言模型（LVLM）。
使用包含806个结核病药物监测视频的私人数据集进行模型微调，用于检测粘附性模式。
推出LLM-TB-VQA数据集，包含正面、负面和模糊粘附性的医疗问题回答数据集。
方法能够识别视频中的视觉特征与医疗概念之间的关联，提升药物依从性监测的准确性。
与其他模型相比，该方法在多个配置中表现出优越性能。

Cool Papers

点此查看论文截图

Kedreamix

https://kedreamix.github.io/Talk2Paper/Paper/2025-11-20/%E8%A7%86%E9%A2%91%E7%90%86%E8%A7%A3/

本博客所有文章除特別声明外，均采用 CC BY 4.0 许可协议。转载请注明来源 Kedreamix !

视频理解

Vision Transformer

Vision Transformer 方向最新论文已更新，请持续关注 Update in 2025-11-20 MRI Embeddings Complement Clinical Predictors for Cognitive Decline Modeling in Alzheimer's Disease Cohorts

2025-11-20 Vision Transformer

Vision Transformer

I2I Translation

I2I Translation 方向最新论文已更新，请持续关注 Update in 2025-11-20 A Specialized Large Language Model for Clinical Reasoning and Diagnosis in Rare Diseases

2025-11-20 I2I Translation

I2I Translation