科学研究

Research

首页 >  论文  >  详情

From GPT-4 to Gemini and Beyond: Assessing the Landscape of MLLMs on Generalizability, Trustworthiness and Causality through Four Modalities

发表会议及期刊:arXiv

Chaochao Lu1  Chen Qian  Guodong Zheng  Hongxing Fan  Hongzhi Gao  Jie Zhang

Jing Shao  Jingyi Deng  Jinlan Fu2  Kexin Huang  Kunchang Li  Lijun Li

Limin Wang3  Lu Sheng4  Meiqi Chen  Ming Zhang  Qibing Ren  Sirui Chen  Tao Gui5

Wanli Ouyang  Yali Wang6  Yan Teng7  Yaru Wang  Yi Wang  Yinan He

Yingchun Wang7  Yixu Wang  Yongting Zhang  Yu Qiao  Yujiong Shen  Yurong Mou

Yuxi Chen  Zaibin Zhang  Zhelun Shi  Zhenfei Yin  Zhipin Wang

Shanghai AI Laboratory

 

Abstract

Multi-modal Large Language Models (MLLMs) have shown impressive abilities in generating reasonable responses with respect to multi-modal contents. However, there is still a wide gap between the performance of recent MLLM-based applications and the expectation of the broad public, even though the most powerful OpenAI's GPT-4 and Google's Gemini have been deployed. This paper strives to enhance understanding of the gap through the lens of a qualitative study on the generalizability, trustworthiness, and causal reasoning capabilities of recent proprietary and open-source MLLMs across four modalities: ie, text, code, image, and video, ultimately aiming to improve the transparency of MLLMs. We believe these properties are several representative factors that define the reliability of MLLMs, in supporting various downstream applications. To be specific, we evaluate the closed-source GPT-4 and Gemini and 6 open-source LLMs and MLLMs. Overall we evaluate 230 manually designed cases, where the qualitative results are then summarized into 12 scores (ie, 4 modalities times 3 properties). In total, we uncover 14 empirical findings that are useful to understand the capabilities and limitations of both proprietary and open-source MLLMs, towards more reliable downstream multi-modal applications.


comm@pjlab.org.cn

上海市徐汇区云锦路701号西岸国际人工智能中心37-38层

沪ICP备2021009351号-1