mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
Qinghao Ye, Haiyang Xu, Jiabo Ye, Ming Yan, Anwen Hu, Haowei Liu, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
Introduction
Large Language Models (LLMs) such as GPT-3 , LLaMA , and GPT-4 have garnered significant attention due to their exceptional generalization abilities in text understanding and generation. To facilitate the vision-language applications, GPT-4Vhttps://openai.com/research/gpt-4v-system-card has recently demonstrated impressive multi-modal capabilities in diverse tasks, e.g., description , question answering, etc., sparking interest among researchers in the potential convergence of the vision-language field. This has led to the emergence of a group of Multi-modal Large Language Models (MLLMs) , which aim to enhance LLMs with the ability to understand and handle visual problems.
Previous studies in multi-modal learning suggest that different modalities can effectively collaborate, thereby enhancing the performance of both text and multi-modal tasks simultaneously. However, MLLMs is a unified model that supports different modalities and tasks without fine-tuning for specific tasks. Recent works utilize cross-modal alignment modules (e.g., Q-former and linear layer ) to map visual features from the vision encoder into the frozen LLMs to carry out multi-modal tasks by leveraging preserved language capabilities. This strategy, unfortunately, restricts the potential of modality collaboration. As a result, some researchers opt to fine-tune LLMs during multi-modal instruction tuning. While fine-tuning significantly improves multi-modal tasks, it risks weakening text task performance . As illustrated in Figure 1, the challenge of modality collaboration in MLLMs is from applying a single module to balance the gain of modality collaboration and modality interference, where modalities may interfere with each other on a large number of instruction datasets across multiple modalities.
To mitigate this challenge, we present a new general-purpose multi-modal foundation model, mPLUG-Owl2, in this work. Our model features a modularized network design that takes both modality collaboration and modality interference into account, using the language decoder as a universal interface for managing multi-modal signals. Specifically, mPLUG-Owl2 incorporates certain shared functional modules to promote modality collaboration and introduces a modality-adaptive module that serves as a pivot across different modalities. Therefore, vision and language modalities are projected into a shared semantic space for cross-modality interaction, while the proposed module helps preserve modality-specific features. With our novel architecture, modalities with varying information densities are shielded from modality interference due to the modality-adaptive module and can collaborate effectively in capturing shared information. Furthermore, we introduce an innovative two-stage training paradigm that consists of vision-language pre-training and joint vision-language instruction tuning. This paradigm trains the vision encoder across two stages, enabling it to capture both low-level and high-level semantic visual information more effectively.
Extensive experiments illustrate the effectiveness and generalization abilities of mPLUG-Owl2, which achieves state-of-the-art performance on 8 classic vision-language benchmarks using a single generic model. Furthermore, it either first or second in performance on 5 recent zero-shot multi-modal benchmarks, underscoring its adaptability and proficiency in multi-modal instruction comprehension and generation. In addition to its cutting-edge performance in multi-modal tasks, mPLUG-Owl2 also achieves state-of-the-art results on multiple pure-text benchmarks. Moreover, we provide in-depth analysis to demonstrate and validate the impact of modality collaboration through our proposed modality-adaptive module, especially in enhancing text tasks, including understanding, knowledge, and reasoning. Finally, comprehensive ablation studies validate the effectiveness of the proposed MLLM training paradigm, which can help inspire the development of future multi-modal foundation models.
Related Work
The successful application of Large Language Models (LLMs) has paved the way for developing several approaches aiming to augment the perceptual capacities of LLMs with additional modalities, all within a unified model. There are three primary methods for constructing multi-modal large language foundational models, each showing promise for robust zero-shot generalization capabilities in the vision-language domain. For instance, Flamingo is a forerunner in this area, using a frozen vision encoder and a large language model equipped with gated cross-attention for cross-modality alignment. In contrast, PaLM-E integrates extracted visual features directly through linear layers into the pre-trained PaLM model, which boasts 520 billion parameters, thereby leading to robust performance across numerous real-world applications. This approach has been broadly adopted by models such as LLaVA , Shikra , etc. One significant limitation of this method, however, is the creation of lengthy visual sequences. To address this, BLIP-2 , drawing inspiration from DETR , developed a Q-former to reduce the sequence length of visual features efficiently. This design has been mirrored by Kosmos-1 , mPLUG-Owl , and MiniGPT-4 . Nevertheless, it should be noted that these methods directly align the visual features with the LLMs, treating vision and language signals as equivalent, thereby overlooking the unique granularities between vision and language modalities. To alleviate this problem, we introduce modality-adaptive module. Our proposed model leads to superior performance in both zero-shot and fine-tuning evaluation settings in terms of both image and video.
Instruction Tuning with MLLMs.
Instruction tuning optimizes pre-trained large language models to comprehend and adhere to natural instructions, thereby enhancing their ability to generalize unseen tasks in a zero-shot manner. Researchers often employ models such as ChatGPT and GPT-4 to generate diverse and expansive instruction datasets, including those like Alpaca , ShareGPT , and WizardLM . As multi-modal large language models emerge, research communities are beginning to create high-quality, diverse multi-modal datasets. For instance, MiniGPT-4 utilizes GPT-3.5 to rephrase captions generated by pre-trained models. Concurrently, LLaVA , SVIT , and LRV-Instruction take advantage of image annotations, such as bounding boxes of objects, image captions, and region descriptions, to prompt GPT-4 to generate instructions and responses using self-instruction methods. Models such as mPLUG-Owl and LLaVA-1.5 further advance this area by undergoing joint training with language-only and vision-and-language instruction data, thereby mitigating the risk of catastrophic forgetting of language knowledge. Rather than merely preventing this phenomenon of catastrophic forgetting, mPLUG-Owl2, with the help of the modality-adaptive module, can gain from the collaborative efforts of modalities by being jointly trained with language-only and multi-modal instruction data, thus enhancing both multi-modal and language-only performance.
Methodology
Figure 2 (a) sketches the overview of the mPLUG-Owl2. Specifically, our model comprises a vision encoder, a visual abstractor, a text embedding layer, and a language decoder. Notably, the standard implementation of the text embedding layer and language decoder involves the use of a large language model, such as GPT or LLaMA . We first briefly introduce our model’s architecture in Section 3.2. Furthermore, we handle different types of modalities by introducing the modality-adaptive module in Section 3.3. Lastly, we introduce the training paradigm for training mPLUG-Owl2 with modality collaboration in Section 3.4.
2 Model Architecture
3 Modality-Adaptive Module
Prior approaches typically attempt to align visual features with language features by projecting image features into the language semantic space. However, this strategy can cause a mismatch in granularity , where image features often contain fruitful semantic information compared to the discrete semantic information within text embedding features. Those methods disregard the unique characteristics of visual and textual information, thus potentially limiting the model’s performance. To this end, we propose a new approach, namely, the Modality-Adaptive Module (MAM), which decouples vision-language representations by projecting visual features and language features into a shared semantic space while preserving the distinctive properties of each modality.
where is the type of modalities (i.e., vision or language). Given the previous layer’s output vectors , where is the number of language decoder layers, we first normalized different modalities into the same magnitude as follows:
where and are layer normalization for visual features and language features respectively. Then, we reformulate the self-attention operation by leveraging separated linear projection layers for key projection matrix and value projection matrix while preserving query projection matrix shared as follows:
4 Training Paradigm
As depicted in Figure 2 (c), we employ a two-stage approach in training mPLUG-Owl2, comprising pre-training and visual instruction tuning similar to , which aims to align the pre-trained vision encoder and language model during the pre-training phase, and then fine-tune the language model with language modeling loss during the instruction tuning phase. However, we find that simply freezing a pre-trained vision encoder and training a vision-language projector to align visual data with language models can limit their capacity to interpret complex visual information, such as scene text and visual knowledge. To address the issue, we make the vision encoder trainable throughout both the pre-training and instruction tuning stages. This strategy allows the model to capture both low-level and high-level semantic visual information more effectively. Specifically, for the pre-training stage, we enable the vision encoder, visual abstractor, and a part of the modality-adaptive module to be trainable, while keeping the pre-trained language model frozen. Meanwhile, prior research in multi-modal learning has indicated that significant enhancements can be achieved through the collaborative learning of uni-modal and multi-modal sources. Based on this, we adopt a joint training approach by tuning the whole model during the instruction tuning stage, incorporating both text and multi-modal instructions. This methodology enhances the model’s comprehension of visual concepts embedded within the text by the multi-modal instructions. Concurrently, the text instruction data augments the model’s understanding of intricate natural instructions, thereby ensuring the preservation of its linguistic capabilities.
Experiments
mPLUG-Owl2 is first pre-trained on image-text pairs and fine-tunes on mono-modal and multi-modal instruction data. For pre-training data, we randomly pick about 400 million image-text pairs from five public datasets: Conceptual Captions (CC3M/CC12M) , COCO , Laion-en , COYO , DataComp . For instruction data, we collect 5 types of datasets including 1) image captioning (i.e., TextCaps , COCO ); 2) image question answering (i.e., VQAv2 , OKVQA , OCR-VQA , GQA , and A-OKVQA ); 3) region-aware QA (i.e., RefCOCO , VisualGenome ); 4) multi-modal instruct data (i.e., LLaVA-instruct-150K ); 5) text-only instruct data (i.e., ShareGPT-80K , SlimOrca ). Details can be found in the Appendix.
Training Settings
We pre-train the model for 42,500 iterations with a batch size 8,192 for about 348 million image-text pairs. Since we adopt the language modeling loss, the large batch size can be easily achieved by the gradient accumulation technique. mPLUG-Owl2 adopts ViT-L with patch size and pre-trained at resolution . We use the same data augmentation in BLIP-2 , including random resized cropping, and horizontal flipping with a probability of 0.5. The number of layers in the visual abstractor is set to 6 and it is randomly initialized. The number of learnable queries is set to 64. For the language model, LLaMA-2 is employed for handling multi-modal features with 7B parameters, and the parameters of modality-adaptive modules are initialized from the language model. We use the AdamW optimizer with , and 1e-6 for optimization. The cosine learning rate decay scheduler with a peak learning rate of 1e-4 and with warmup steps 1k. For the learning rate of the vision encoder, we employ layer-wise learning rate decay with a factor of 0.9 to retain the low-level visual representation. For the instruction tuning stage, we train the whole model for 1 epoch with a learning rate of 2e-5 and batch size 256. Besides, we increase the resolution from to . The layer-wise learning rate decay is also employed which is crucial for retaining good visual representation in our experiments.
2 Main Results
We assess mPLUG-Owl2 using a wide range of academic benchmarks for evaluating vision-language models. Our evaluation includes eight popular benchmarks, as summarized in Table 1. As the results show, our mPLUG-Owl2 surpasses previous generalist models in both captioning and question answering tasks. Specifically, mPLUG-Owl2 achieves state-of-the-art performance on the Flickr30K datasets, even compared with models with more powerful backbones (e.g., Qwen-VL-Chat and InstructBLIP ). Moreover, mPLUG-Owl2 exhibits distinct advantages in visual question answering, especially in OCR-free scenarios, where mPLUG-Owl2 achieves 54.3% accuracy on the TextVQA dataset in a zero-shot manner, demonstrating the benefits of our training strategy. Also worth noting is that mPLUG-Owl2 shows strong zero-shot performance on the ScienceQA (Image Set) and VizWizQA datasets.
MLLM-oriented Multi-modal Benchmarks.
Given the robust zero-shot capabilities of Multi-Modal Language Models (MLLMs), traditional evaluation metrics often fall short in providing a detailed ability assessment. This problem is further exacerbated by their inability to match the given answer accurately, leading to significant robustness issues. To address these challenges, research communities have introduced a series of benchmarks including MME , MMBench , MM-Vet , SEED-Bench , and Q-Bench . These benchmarks systematically structure and evaluate complex multi-modal tasks. We applied our model, in a zero-shot manner, to five recently popular multi-modal benchmarks. For a fair comparison, we select models with similar language model sizes, particularly those from the LLaMA family, and detail their differences in the vision encoder. The results of our evaluation are listed in Table 2. In the table, mPLUG-Owl2 achieves higher zero-shot performance in terms of MMBench, MM-Vet, and Q-Bench. Conversely, the performance on MME is lower because of the limited number of test samples in MME, which could potentially lead to sensitive fluctuations in performance. Particularly, it exhibits significant improvement on Q-Bench, a benchmark for examining the low-level visual perception of MLLMs. This improvement occurs when applying a smaller visual backbone (i.e., ViT-L), leading to enhanced low-level visual perception. This demonstrates the effectiveness of our training strategy for training visual backbone.
Natural Language Understanding and Generation.
Current MLLMs often outperform in various multi-modal downstream tasks by leveraging the power of large language models. Nevertheless, the intrinsic capabilities of these models often play a significant role in determining the performance of MLLMs, an aspect that has often been overlooked in prior multi-modal language model studies. Accordingly, we have also assessed the performance of our model in the context of natural language understanding and generation. We perform the evaluation on MMLU , BBH , AGIEval and ARC . The results are illustrated in Table 3. As observed in the table, mPLUG-Owl2 excels in examination and reasoning, showing a significant improvement on MMLU and BBH by 2.3% and 3.8% respectively. This indicates that mPLUG-Owl2 not only performs well on multi-modal tasks but also achieves better performance compared to the other instruction-tuned LLMs, showing the promising way for developing strong MLLMs.
Zero-Shot Video Question Answering.
Given that videos can be viewed as a sequence of images, we conducted a comprehensive quantitative evaluation using several commonly employed video question-answering datasets, including MSRVTT-QA , MSVD-QA , and TGIF-QA . These datasets aided in the zero-shot evaluation of the model’s ability to understand video content, with the results summarized in Table 4. We employed two types of evaluations: 1) Exact matching, which is commonly used in previous video question-answering evaluations; and 2) GPT-assisted evaluation that assesses the model’s capabilities by measuring the accuracy of the model’s generated predictions and providing a relative score on a scale of 1-5. We observe that our model achieves superior results on all three video datasets under a zero-shot setting. Furthermore, in terms of relevancy, our model generates more accurate answers than other video MLLMs, thereby demonstrating its superiority and excellent generalization capabilities.
3 Discussion
To demonstrate how modality collaboration enhances not only the multi-modal performance but also the text capability of MLLMs, we evaluate the performance of text benchmarks in terms of various abilities including examination, knowledge, understanding, and reasoning. As observed in Figure 3, both examination and knowledge capabilities of MLLMs have significantly improved thanks to the benefits of modality collaboration facilitated by the modality-adaptive module. This improvement arises because multi-modal data allows the model to utilize visual information to understand concepts that cannot be described through language. Similarly, the model can generate richer and more substantial responses due to a more concrete understanding of these concepts. Additionally, multi-modal data enhances the reasoning ability of the model because images contain rich information (such as relationships and spatial aspects). The model learns from these aspects and associates them with the text, thereby indirectly enhancing the reasoning ability of the text.
Impact of Joint Vision-Language Instruction Tuning.
Table 5 presents the results of instruction tuning with various types of data as well as whether using modality-adaptive module. These results show that even without multi-modal instruction data, the model’s performance on multi-modal benchmarks is respectable due to the effective vision-language alignment achieved during pre-training. However, when solely using multi-modal instruction data, we observe an increase in performance on multi-modal datasets, while performance on text tasks decreases by about 5.7%. This phenomenon can be counterbalanced by the joint vision-language tuning proposed, as shown in the table’s third row, where the multi-modal performance begins to slightly decrease due to modality interference. To counter this drawback, we apply our proposed modality-adaptive module to the model. Results show that the performance on both multi-modal and text benchmarks improves, with a minimum increase of 0.6% on the VQAv2 dataset and 1.6% on MMLU.
Impact of Trainable Vision Encoder.
Table 6 delivers the performance of the training vision encoder during instruction tuning with modality collaboration. It can be observed that enabling the vision encoder to be trainable improves performance on VQAv2 and Q-Bench by at least 1.4% and 0.9%, respectively, suggesting the benefits of modality collaboration. Conversely, it results in a 1.1% performance drop in MM-Bench, indicating a degree of forgetting and damage to the general visual representation due to the limited diversity of instruction data. To mitigate this challenge, we apply layer-wise learning rate decay with an exponential decay factor of 0.9, which preserves the representation of lower layers while modifying higher semantic representations. By applying the layer-wise learning rate decay, we can notice that performance on TextVQA has increased further with 2.2%, showing the effectiveness of our training strategy.
Impact of Number of Learnable Queries.
To investigate the effect of the number of learnable queries , we conduct experiments using different numbers of queries in the visual abstractor, as shown in Table 7. It can be observed that the model consistently exhibits improvement as the number of learnable queries increases until it reaches a saturation point, suggesting that 64 may be the optimal number for representing an image. Notably, there is a significant performance boost observed when the number is increased from 8 to 64, e.g., the performance of VQAv2 is increased 18.5%. These findings suggest that a higher number of learnable queries can capture image information more comprehensively, thereby enhancing the model’s image comprehension capabilities.
Impact of Image Resolution.
Image resolution plays a crucial role in vision-language tasks, as a higher resolution can reduce image blur and improve understanding of fine-grained details. To explore the impact of image resolution on performance across different benchmarks, we adjust the image resolution from to and the results are listed in Table 8. As observed in the table, using a higher resolution proves advantageous for multi-modal tasks, particularly in the question answering scenario. Specifically, the performance of VQAv2 has increased from 76.8 to 79.4, representing a 2.6% boost. Simultaneously, there is an 11.8 point lift in the TextVQA benchmark when enlarging the resolution from to . This suggests that OCR-related tasks benefit significantly from increasing the resolution.
4 Qualitative Analysis
We investigate the impact of the Modality-Adaptive Module in multi-modal scenarios by visualizing the attention maps of mPLUG-Owl2 with and without this module using image caption input, as shown in Figure 4. Each attention map illustrates the attention scores of generated tokens on the input sequence during the generation process.
It can be observed that regardless of whether the Modality-Adaptive Module is incorporated or not, the model focuses more on the textual tokens in the earlier layers while paying more attention to the visual tokens in the later layers. This suggests that the modeling of visual and textual information plays different roles in the collaboration of multi-modal language models (MLLMs). An intuitive explanation is that MLLMs initially use syntactic information to comprehend instructions and then identify relevant visual content tokens by considering the textual input.
When using the Modality-Adaptive Module, it can be observed that the model explicitly pays more attention to the textual content in the earlier stages and focuses more on the visual content in the later stages. The Modality-Adaptive Module prevents visual and textual tokens from being treated as the same and encourages collaboration between different modalities.
Impact of Modality-Adaptive Module in Unrelated-Modality Scenarios.
We present a question: "What are the seven colors of the rainbow?" along with a randomly selected image. In this example, the image input acts as a disturbance to the model. We aim to investigate the impact of our module on data that contains unrelated modalities. The responses and attention maps of the model are shown in Figure 5. Our proposed model, mPLUG-Owl2, which incorporates the Modality-Adaptive Module, accurately identifies all seven colors. During the generation process, it can be observed that the model primarily focuses on the textual input. On the other hand, when the Modality-Adaptive Module is not utilized, mPLUG-Owl2 only identifies six colors. The model’s ability to comprehend text instructions is disrupted, and it is also evident that it places more emphasis on the image during generation. Thanks to the Modality-Adaptive Module, mPLUG-Owl2 is better able to capture modality-specific features when modeling multimodal inputs. This enhances the adaptability of modality collaboration, resulting in reduced disturbance when the text and image are unrelated.
Conclusion
In this paper, we present mPLUG-Owl2, a highly capable generalist model by leveraging modality collaboration for enhancing performance across both text and multi-modal tasks. The inclusion of shared functional modules and a modality-adaptive module in mPLUG-Owl2 strengthens the model’s ability to harmonize modality collaboration and preserve modality-specific characteristics. The extensive experimental evaluations highlight mPLUG-Owl2’s proficiency in generalizing across various tasks, thereby achieving state-of-the-art performances with a singular, generalized model. Most notably, mPLUG-Owl2 stands as the first MLLM model to exhibit the phenomena of modality collaboration in both pure-text and multi-modal contexts. This not only enhances the model’s vision-language understanding but also improves its language capabilities in terms of understanding, knowledge, and reasoning. This represents a significant contribution to the field and opens up exciting opportunities for the future development of multi-modal foundation models.
References
Appendix A Additional Experimental Results
In this section, we provide more experimental results for the completeness of our proposed method.
We measure the hallucination of our model on image description using MMHal-Bench and compare the results with other recent vision-language models, including Kosmos-2 , IDEFICS , InstructBLIP , LLaVA , and LLaVA-RLHF . Following , we use GPT-4 to evaluate the overall score and hallucination rate of different MLLMs. As depicted in Figure 6, we find that our mPLUG-Owl2 tends to generate the response with reduced hallucination compared to other methods, especially surpassing IDEFICS with 80 billion parameters, showing the superiority of our methods. Besides, we can notice that our model excels at attribute and counting because the visual abstractor can effectively identify the main parts of the image, which reduces the hallucination.
We also study the hallucination of recent popular MLLMs and present the results in Figure 7. In the first example, the query asks the models to recognize the pattern on the wall. However, the pattern is not clearly visible in the image, causing other models to mistakenly perceive it as a solid color. Our model, on the other hand, accurately notices the white pattern on the wall and correctly answers the question. In the second example, there are only a few trees in the image. However, InstructBLIP incorrectly considers that there are no trees in the image. LLaVA and LLaVA-1.5, on the other hand, hallucinate and consider the tree in the image to be dense. MiniGPT-4 gives the correct answer, but with minimal explanation. Our mPLUG-Owl2, however, answers the question correctly and provides a more detailed explanation.
A.2 POPE Evaluation
We also conduct the hallucination evaluation using POPE , the results are shown in Table 9. As we can observe in the table, we can find mPLUG-Owl2 achieves higher F1 scores on the popular and adversarial split, showing the robustness of our model in terms of object hallucination compared to other MLLMs.
A.3 Detailed Evaluation Results on MMBench
MMBench is a meticulously designed benchmark that comprehensively assesses the diverse skills of vision-language models. The results from the test set for various MLLMs are presented in Table 10.
A.4 Detailed Evaluation Results on MM-Vet
We provide the detailed results of MM-Vet in Table 11. It can be observed that by training the visual encoder of mPLUG-Owl2, it exhibits stronger OCR capability compared to the model with the same backbone (i.e., LLaVA, Otter). Besides, mPLUG-Owl2 surpasses models with stronger language decoders such as LLaVA-13B which equips LLM with 13 billion parameters.
A.5 Detailed Evaluation Results on Q-Bench
For evaluating the low-level visual perception abilities, we have included the results of Q-Bench on the test set. By training the visual encoder, the ability of mPLUG-Owl2 in terms of low-level perception has been improved significantly, as it outperforms the model with a stronger visual encoder (i.e., ViT-L (0.3B) v.s. ViT-G (1.9B)), showing the effectiveness of our training paradigm.
A.6 Detailed Evaluation Results on MMHal-Bench
We include Table 13 for the full evaluation results on MMHal-Bench .
Appendix B Implementation
In this section, we detail our final training data mixture used during the instruction tuning stage in Table 14. Specifically, we process the VQAv2 data by selecting the answer with the highest confidence and combining question-answer pairs that share the same image. This combining strategy is also applied to GQA , OKVQA , and OCRVQA datasets. Additionally, for multiple-choice questions in A-OKVQA , we augment the dataset by switching the order of options to enhance robustness in terms of multiple choices. For caption datasets like COCO and TextCaps , we randomly select one caption from the ground truth for each image. Concurrently, some regional-VQA datasets are also used to improve regional abilities.
B.2 Training Hyper-parameters
We report the detailed training hyper-parameter settings of mPLUG-Owl2 in Table 15. Specifically, we leverage the model parallelism with Megatron distributed training framework to ensure a larger resolution training while maintaining efficiency.
Appendix C Summary of the Evaluation Benchmarks
We provide a detailed summary of the used evaluation benchmarks and corresponding metrics in Table 16.
Appendix D Broader Impact
mPLUG-Owl2 employs off-the-shelf LLM and web-sourced data. Consequently, it inherits some of the weaknesses of the original LLM and web-crawled data, such as generating uncensored text or producing biased outputs. We address these shortcomings by enhancing the model’s grounding on the visual and instructional input and executing joint vision-language instruction tuning on a diverse range of high-quality datasets. However, we advise against deploying mPLUG-Owl2 models for any downstream applications without prior evaluation of safety and fairness specific to the respective application.