科学研究

Research

首页 >  论文  >  详情

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark

发表会议及期刊:arXiv

Kunchang Li1,2,3♠ Yali Wang1,3♡ Yinan He3  Yizhuo Li4,3♠ Yi Wang3 Yi Liu1,2,3♠

Zun Wang3 Jilan Xu5,3♠ Guo Chen6,3♠ Ping Luo4,3 Limin Wang6,3♡ Yu Qiao3,1♡

1Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Shanghai AI Laboratory

4The University of Hong Kong 5Fudan University 6State Key Laboratory for Novel Software Technology, Nanjing University

 

Abstract

With the rapid development of Multi-modal Large Language Models (MLLMs), a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However, most benchmarks predominantly assess spatial understanding in the static image tasks, while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue, we introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench, which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically, we first introduce a novel staticto-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones, we enable the systematic generation of video tasks that require a broad spectrum of temporal skills, ranging from perception to cognition. Then, guided by the task definition, we automatically convert public video annotations into multiplechoice QA to evaluate each task. On one hand, such a distinct paradigm allows us to build MVBench efficiently, without much manual intervention. On the other hand, it guarantees evaluation fairness with ground-truth video annotations, avoiding the biased scoring of LLMs. Moreover, we further develop a robust video MLLM baseline, i.e., VideoChat2, by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that, the existing MLLMs are far from satisfactory in temporal understanding, while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. All models and data are available at https://github.com/OpenGVLab/Ask-Anything.