MMT

发布日期: 2025-09-19

更新日期: 2025-10-07

文章字数: 1.3k

阅读时长: 5 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-09-19 更新

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

Authors:Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin’ichi Satoh, Michael Felsberg, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan

Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.

多模态大型模型（LMMs）由于其理解和生成视觉内容描述的有效性而最近受到关注。现有的大多数LMM都是英语。虽然最近有一些作品探索了多语言图像LMMs，但据我们所知，在视频LMM的情境中，为了文化和语言的包容性而超越英语尚未被研究。为了寻求更具包容性的视频LMM，我们引入了一个多语言视频LMM基准测试，名为ViMUL-Bench，用于评估14种语言（包括低资源和高资源语言）的视频LMM，包括英语、中文、西班牙语、法语、德语、印地语、阿拉伯语、俄语、孟加拉语、乌尔都语、僧伽罗语、泰米尔语、瑞典语和日语。我们的ViMUL-Bench旨在严格测试视频LMM的15个类别，包括八个文化多样的类别，从生活方式和节日到食品和仪式，以及从当地地标到重要文化人物。ViMUL-Bench既包括开放性问题（短形式和长形式）也包括选择题，涵盖各种视频时长（短、中和长），包含8k样本，这些样本经过母语者的手动验证。此外，我们还引入了一个机器翻译的多语言视频训练集，包含120万个样本，并开发了一个简单的多语言视频LMM，名为ViMUL，该模型在资源丰富的语言和资源有限的语言之间提供了更好的权衡，用于视频理解。我们希望我们的ViMUL-Bench和多语言视频LMM以及大规模的多语言视频训练集将有助于未来研究发展文化和语言上更具包容性的多语言视频LMM。我们提出的基准测试、视频LMM和训练数据将在https://mbzuai-oryx.github.io/ViMUL/上公开发布。

论文及项目相关链接

PDF

Summary

该文本介绍了大型多模态模型（LMMs）在理解生成视觉内容方面的有效性，并指出大多数现有LMMs仅限于英语。为推进更包容的视频LMMs，研究者引入了多语言视频LMM基准测试ViMUL-Bench，旨在评估包括英语、中文、西班牙语等14种语言的视频LMMs。ViMUL-Bench包含开放性问题（短、长形式）和选择题，涉及不同视频时长和8k样本，且经母语者验证。同时，研究者还推出了机器翻译的多语言视频训练集和简单的多语言视频LMM——ViMUL，旨在帮助未来研究发展文化和语言上更具包容性的多语言视频LMMs。资源和相关数据集将公开于https://mbzuai-oryx.github.io/ViMUL/。

Key Takeaways