MMT

发布日期: 2025-10-02

更新日期: 2025-11-27

文章字数: 1.3k

阅读时长: 5 分

阅读次数:

⚠️ 以下所有内容总结都来自于大语言模型的能力，如有错误，仅供参考，谨慎使用
🔴 请注意：千万不要用于严肃的学术场景，只能用于论文阅读前的初筛！
💗 如果您觉得我们的项目对您有帮助 ChatPaperFree ，还请您给我们一些鼓励！⭐️ HuggingFace免费体验

2025-10-02 更新

A Culturally-diverse Multilingual Multimodal Video Benchmark & Model

Authors:Bhuiyan Sanjid Shafique, Ashmal Vayani, Muhammad Maaz, Hanoona Abdul Rasheed, Dinura Dissanayake, Mohammed Irfan Kurpath, Yahya Hmaiti, Go Inoue, Jean Lahoud, Md. Safirur Rashid, Shadid Intisar Quasem, Maheen Fatima, Franco Vidal, Mykola Maslych, Ketan Pravin More, Sanoojan Baliah, Hasindri Watawana, Yuhao Li, Fabian Farestam, Leon Schaller, Roman Tymtsiv, Simon Weber, Hisham Cholakkal, Ivan Laptev, Shin’ichi Satoh, Michael Felsberg, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan

Large multimodal models (LMMs) have recently gained attention due to their effectiveness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language for cultural and linguistic inclusivity is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across 14 languages, including both low- and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories, ranging from lifestyles and festivals to foods and rituals and from local landmarks to prominent cultural personalities. ViMUL-Bench comprises both open-ended (short and long-form) and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified by native language speakers. In addition, we also introduce a machine translated multilingual video training set comprising 1.2 million samples and develop a simple multilingual video LMM, named ViMUL, that is shown to provide a better tradeoff between high-and low-resource languages for video understanding. We hope our ViMUL-Bench and multilingual video LMM along with a large-scale multilingual video training set will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs. Our proposed benchmark, video LMM and training data will be publicly released at https://mbzuai-oryx.github.io/ViMUL/.

近期，大型多模态模型（LMMs）因其理解和生成视觉内容描述的有效性而受到关注。现有的大多数LMMs都是英语。尽管最近有一些作品探索了多语言图像LMMs，但据我们所知，在视频LMM的语境中，超越英语实现文化和语言的包容性尚未被研究。为了寻求更具包容性的视频LMMs，我们引入了一个多语言视频LMM基准测试，名为ViMUL-Bench，可评估14种语言的视频LMMs，包括低资源语言和高资源语言：英语、中文、西班牙语、法语、德语、印地语、阿拉伯语、俄语、孟加拉语、乌尔都语、僧伽罗语、泰米尔语、瑞典语和日语。我们的ViMUL-Bench旨在严格测试视频LMMs在15个类别中的表现，包括八个文化多样的类别，从生活方式和节日到食品和仪式，以及从当地地标到重要文化人物。ViMUL-Bench既包括开放性问题（短形式和长形式）和选择题，涵盖各种视频时长（短、中和长），包含8k个样本，这些样本经过母语者的手动验证。此外，我们还引入了一个机器翻译的多语言视频训练集，包含120万个样本，并开发了一个简单的多语言视频LMM，名为ViMUL，它在高低资源语言之间提供了更好的权衡，用于视频理解。我们希望我们的ViMUL-Bench和多语言视频LMM以及大规模多语言视频训练集将有助于未来研究发展出文化和语言上更具包容性的多语言视频LMMs。我们提出的基准测试、视频LMM和训练数据将在https://mbzuai-oryx.github.io/ViMUL/上公开发布。

论文及项目相关链接

PDF

Summary

基于视觉内容理解和生成描述的有效性，大型多模态模型（LMMs）近期备受关注。尽管现有大多数LMMs仅支持英语，但为提升文化和语言包容性，我们推出了多语言视频LMM基准测试ViMUL-Bench，涵盖包括英语、中文在内的14种语言。此基准不仅评估模型性能，且提供了一个多语言视频训练数据集和多语言视频LMM“ViMUL”。旨在助力于文化包容性和语言多样性的多语言视频LMM的发展。详情请访问：公开链接。

Key Takeaways