InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

发表会议及期刊：arXiv

Zhe Chen^2,1† Jiannan Wu^3,1†, Wenhai Wang^1,4, Weijie Su^6,1†, Guo Chen^2,1†, Sen Xing⁵, Muyan Zhong⁵, Qinglong Zhang¹, Xizhou Zhu^5,7,1, Lewei Lu^7,1, Bin Li⁶, Ping Luo³, Tong Lu²,

Yu Qiao¹, Jifeng Dai⁵

¹B1OpenGVLab, Shanghai AI Laboratory ²Nanjing University

³The University of Hong Kong ⁴The Chinese University of Hong Kong ⁵Tsinghua University

⁶University of Science and Technology of China ⁷SenseTime Research

Abstract

The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, visionlanguage tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models.

comm@pjlab.org.cn

上海市徐汇区龙文路129号国际传媒港L1楼

沪ICP备2021009351号-1

科学研究

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

网站地图