MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning

Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, Mohamed Elhoseiny

cs.CV

Introduction

Multi-modal Large Language Models (LLMs) have emerged as an exciting research topic with a rich set of applications in vision-language community, such as visual AI assistant, image captioning, visual question answering (VQA), and referring expression comprehension (REC). A key feature of multimodal large language models is that they can inherit advanced capabilities (e.g., logical reasoning, common sense, and strong language expression) from the LLMs . When tuned with proper vision-language instructions, multi-modal LLMs, specifically vision-language models, demonstrate strong capabilities such as producing detailed image descriptions, generating code, localizing the visual objects in the image, and even performing multi-modal reasoning to better answer complicated visual questions . This evolution of LLMs enables interactions of visual and language inputs across communication with individuals and has been shown quite effective for building visual chatbots.

However, learning to perform multiple vision-language tasks effectively and formulating their corresponding multi-modal instructions present considerable challenges due to the complexities inherent among different tasks. For instance, given a user input “tell me the location of a person", there are many ways to interpret and respond based on the specific task. In the context of the referring expression comprehension task, it can be answered with one bounding box location of the person. For the visual question-answering task, the model might describe their spatial location using human natural language. For the person detection task, the model might identify every spatial location of each human in a given image. To alleviate this issue and towards a unified approach, we propose a task-oriented instruction training scheme to reduce the multi-modal instructional ambiguity, and a vision-language model, MiniGPT-v2. Specifically, we provide a unique task identifier token for each task. For example, we provide a [vqa] identifier token for training all the data samples from the visual question answering tasks. In total, we provide six different task identifiers during the model training stages.

Our model, MiniGPT-v2, has a simple architecture design. It directly takes the visual tokens from a ViT vision encoder and project them into the feature space of a large language model . For better visual perception, we utilize higher-resolution images (448x448) during training. But this will result in a larger number of visual tokens. To make the model training more efficient, we concatenate every four neighboring visual tokens into a single token, reducing the total number by 75%. Additionally, we utilize a three-stage training strategy to effectively train our model with a mixture of weakly-labeled, fine-grained image-text datasets, and multi-modal instructional datasets, with different training focus at each stage.

To evaluate the performance of our model, we conducted extensive experiments on diverse vision-language tasks, including (detailed) image/grounded captioning, vision question answering, and visual grounding. The results demonstrate that our MiniGPT-v2 can achieve SOTA or comparable performance on diverse benchmarks compared to previous vision-language generalist models, such as MiniGPT-4 , InstructBLIP , LLaVA and Shikra . For example, our MiniGPT-v2 outperforms MiniGPT-4 by 21.3%, InstructBLIP by 11.3%, and LLaVA by 11.7% on the VSR benchmark , and it also performs better than the previously established strong baseline, Shikra, in most validations on RefCOCO, RefCOCO+, and RefCOCOg. Our model establishes new state-of-the-art results on these benchmarks among vision-language generalist models, shown in Fig. 1.

Related Work

We briefly review relevant works on advanced large language models and multi-modal LLMs for visual aligning.

Advanced Large Language Models (LLMs). Early-stage models such as GPT-2 and BERT are foundation models trained on web-scale text datasets, marking a breakthrough in the NLP field. Following the success of foundation models, LLMs with higher capacity and increased training data are developed, including GPT-3 , Megatron-turing NLG , PaLM , Gopher , Chinchilla , OPT , and BLOOM . Most recently, the efforts have been focused on refining LLMs to work effectively with human instruction and feedback. Representative works in this direction are InstructGPT and ChatGPT , which demonstrate strong capabilities such as answering a diverse range of language questions, engaging in conversations with humans, and learning to perform complex tasks like writing refinement and coding assistant.

Concurrent with these advancements of LLMs is the rise of LLaMA language models. To enable human instruction following abilities similar to ChatGPT, some works attempt to finetune the LLaMA model with additional high-quality instruction datasets . Examples of these models include Alpaca , Vicuna , and MPT . Some other open-sourced language models that learned from the human feedback data, such as Falcon and LLaMA-2 , have also been introduced to the NLP community with impressive performance.

Visual Aligning with LLMs. With the remarkable generalization abilities of LLMs, interesting studies have extended LLMs to multi-modal domains by aligning visual inputs with LLMs. Early works such as VisualGPT and Frozen used pre-trained language models to improve vision-language models on image captioning and visual question answering. This initial exploration paved the way for subsequent vision-language research such as Flamingo and BLIP-2 . More recently, GPT-4 has been released and demonstrates many advanced multi-modal abilities, e.g., generating website code based on handwritten text instructions. Those demonstrated capabilities inspired other vision-language LLMs, including MiniGPT-4 and LLaVA , which align the image inputs with a large language model, Vicuna , using proper instructional tuning. These vision-language models also showcase many advanced multi-modal capabilities after the alignment. Recent works, such as Vision-LLM , Kosmos-2 , Shikra , and our concurrent work, Qwen-VL , also demonstrate that multi-model LLMs models can also perform visual grounding by generating the text format of bounding boxes through language model.

Method

We start by introducing our vision-language model, MiniGPT-v2, then discuss the basic idea of a multi-task instruction template with task identifiers for training, and finally adapt our task identifier idea to achieve task-oriented instruction tuning.

Our proposed model architecture, MiniGPT-v2, is shown in Fig. 2. It consists of three components: a visual backbone, a linear projection layer, and a large language model. We describe each component as follows:

Visual backbone. MiniGPT-v2 adapts the EVA as our visual backbone model backbone. We freeze the visual backbone during the entire model training. We train our model with the image resolution 448x448, and we interpolate the positional encoding to scale with a higher image resolution.

Linear projection layer. We aim to project all the visual tokens from the frozen vision backbone into the language model space. However, for higher-resolution images such as 448x448, projecting all the image tokens results in a very long-sequence input (e.g., 1024 tokens) and significantly lowers the training and inference efficiency. Hence, we simply concatenate 4 adjacent visual tokens in the embedding space and project them together into one single embedding in the same feature space of the large language model, thus reducing the number of visual input tokens by 4 times. With this operation, our MiniGPT-v2 can process high-resolution images much more efficiently during the training and inference stage.

Large language model. MiniGPT-v2 adopts the open-sourced LLaMA2-chat (7B) as the language model backbone. In our work, the language model is treated as a unified interface for various vision-language inputs. We directly rely on the LLaMA-2 language tokens to perform various vision-language tasks. For the visual grounding tasks that necessitate the generation of spatial locations, we directly ask the language model to produce textual representations of bounding boxes to denote their spatial positions.

2 Multi-task Instruction Template

When training a single unified model for multiple different tasks such as visual question answering, image caption, referring expression, grounded image caption, and region identification, the multi-modal model might fail to distinguish each task by just aligning visual tokens to language models. For instance, when you ask “Tell me the spatial location of the person wearing a red jacket?”, the model can either respond you the location in a bounding box format (e.g., $<\text{X}_{left}><\text{Y}_{top}><\text{X}_{right}><\text{Y}_{bottom}>$ ) or describe the object location using natural language (e.g., upper right corner). To reduce such ambiguity and make each task easily distinguishable, we introduce task-specific tokens in our designed multi-task instruction template for training. We now describe our multi-task instruction template in more details.

General input format. We follow the LLaMA-2 conversation template design and adapt it for the multi-modal instructional template. The template is denoted as follows,

[INST] $<$ Img $>$ $<$ ImageFeature $>$ $<$ /Img $>$ [Task Identifier] Instruction [/INST]

In this template, [INST] is considered as the user role, and [/INST] is considered as the assistant role. We structure the user input into three parts. The first part is the image features, the second part is the task identifier token, and the third part is the instruction input.

Task identifier tokens. Our model takes a distinct identifier for each task to reduce the ambiguity across various tasks. As illustrated in Table 1, we have proposed six different task identifiers for visual question answering, image caption, grounded image captioning, referring expression comprehension, referring expression generation, and phrase parsing and grounding respectively. For vision-irrelevant instructions, our model does not use any task identifier token.

Spatial location representation. For tasks such as referring expression comprehension (REC), referring expression generation (REG), and grounded image captioning, our model is required to identify the spatial location of the referred objects accurately. We represent the spatial location through the textual formatting of bounding boxes in our setting, specifically: “ $\{<\text{X}_{left}><\text{Y}_{top}><\text{X}_{right}><\text{Y}_{bottom}>\}$ ". Coordinates for X and Y are represented by integer values normalized in the range . $<\text{X}_{left}>$ and $<\text{Y}_{top}>$ denote the x and y coordinate top-left corner of the generated bounding box, and $<\text{X}_{right}>$ and $<\text{Y}_{bottom}>$ denote the x and y coordinates of the bottom-right corner.

3 Multi-task Instruction Training

We now adapt our designed multi-task instruction template for instruction training. The basic idea is to take instruction with task-specific identifier token as input for task-oriented instruction training of MiniGPT-v2. When input instructions have task identifier tokens, our model will become more prone to multiple-task understanding during training. We train our model with task identifier instructions for better visual aligment in three stages. The first stage is to help MiniGPT-v2 build broad vision-language knowledge through many weakly-labeled image-text datasets, and high-quality fine-grained vision-language annotation datasets as well (where we will assign a high data sampling ratio for weakly-labeled image-text datasets). The second stage is to improve the model with only fine-grained data for multiple tasks. The third stage is to finetune our model with more multi-modal instruction and language datasets for answering diverse multi-modal instructions better and behaving as a multi-modal chatbot. The datasets used for training at each stage are listed in Table 2.

Stage 1: Pretraining. To have broad vision-language knowledge, our model is trained on a mix of weakly-labeled and fine-grained datasets. We give a high sampling ratio for weakly-labeled datasets to gain more diverse knowledge in the first-stage.

For the weakly-labeled datasets, we use LAION , CC3M , SBU , and GRIT-20M from Kosmos v2 that built the dataset for referring expression comprehension (REC), referring expression generation (REG), and grounded image captioning.

For fine-grained datasets, we use datasets like COCO caption and Text Captions for image captioning, RefCOCO , RefCOCO+ , and RefCOCOg for REC. For REG, we restructured the data from ReferCOCO and its variants, reversing the order from phrase $\rightarrow$ bounding boxes to bounding boxes $\rightarrow$ phrase. For VQA datasets, our training takes a variety of datasets, such as GQA , VQA-v2 , OCR-VQA , OK-VQA , and AOK-VQA .

Stage 2: Multi-task training. To improve the performance of MiniGPT-v2 on each task, we only focus on using fine-grained datasets to train our model at this stage. We exclude the weakly-supervised datasets such as GRIT-20M and LAION from stage-1 and update the data sampling ratio according to the frequency of each task. This strategy enables our model to prioritize high-quality aligned image-text data for superior performance across various tasks.

Stage 3: Multi-modal instruction tuning. Subsequently, we focus on tuning our model with more multi-modal instruction datasets and enhancing its conversation ability as a chatbot. We continue using the datasets from the second stage and add instructional datasets, including LLaVA , Flickr30k dataset , our constructed mixing multi-task dataset, and the language dataset, Unnatural Instruction . We give a lower data sampling ratio for the fine-grained datasets from stage-2 and a higher data sampling ratio for the new instruction datasets.

– LLaVA instruction data. We add the multi-modal instruction tuning datasets, including the detailed descriptions and complex reasoning from LLaVA , with 23k and 58k data examples respectively.

– Flicker 30k. After the second-stage training, our MiniGPT-v2 can effectively generate the grounded image caption. Nevertheless, these descriptions tend to be short and often cover very few number of visual objects. This is because the GRIT-20M dataset from KOSMOS-v2 that our model was trained with, features a limited number of grounded visual objects in each caption, and our model lacks proper multi-modal instruction tuning to teach it to recognize more visual objects. To improve this, we fine-tune our model using the Flickr30k dataset , which provides more contextual grounding of entities within its captions.

We prepare the Flickr30k dataset in two distinct formats for training our model to perform grounded image caption and a new task “object parsing and grounding":

1) Grounded image caption. We select captions with a minimum of five grounded phrases, containing around 2.5k samples, and we directly instruct the model to produce the grounded image caption. e.g., a $<$ p $>$ wooden table $<$ /p $>$ { $<<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mtext>X</mtext><mrow><mi>l</mi><mi>e</mi><mi>f</mi><mi>t</mi></mrow></msub></mrow><annotation encoding="application/x-tex">\text{X}_{left}</annotation></semantics></math>Xleft><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo><</mo></mrow><annotation encoding="application/x-tex"><</annotation></semantics></math><\text{Y}_{top}<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo>></mo></mrow><annotation encoding="application/x-tex">></annotation></semantics></math>><<math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><msub><mtext>X</mtext><mrow><mi>r</mi><mi>i</mi><mi>g</mi><mi>h</mi><mi>t</mi></mrow></msub></mrow><annotation encoding="application/x-tex">\text{X}_{right}</annotation></semantics></math>Xright><math xmlns="http://www.w3.org/1998/Math/MathML" display="block"><semantics><mrow><mo><</mo></mrow><annotation encoding="application/x-tex"><</annotation></semantics></math><\text{Y}_{bottom}$ $>$ } in the center of the room.

2) Object parsing and grounding. This new task is to parse all the objects from an input caption and then ground each object. To enable this, we use the task identifier[detection] to differentiate this capability from other tasks. Also, we use Flickr30k to construct two types of instruction datasets: caption $\rightarrow$ grounded phrases and phrase $\rightarrow$ grounded phrase, each containing around 2.5k and 3k samples. Then we prompt our model with the instruction: [detection] description, the model will directly parse the objects from the input image description and also ground the objects into bounding boxes.

– Mixing multi-task dataset. After extensive training with single-round instruction-answer pairs, the model might not handle multiple tasks well during multi-round conversations since the context becomes more complex. To alleviate this situation, we create a new multi-round conversation dataset by mixing the data from different tasks. We include this dataset into our third-stage model training.

– Unnatural instruction. The conversation abilities of language model can be reduced after extensive vision-language training. To fix this, we add the language dataset, Unnatural Instruction into our model’s third-stage training for helping recover the language generation ability.

Experiments

In this section, we present experimental settings and results. We primarily conduct experiments on (detailed) image/grounded captioning, vision question answering, and visual grounding tasks, including referring expression comprehension. We present both quantitative and qualitative results.

Implementation details. Throughout the entire training process, the visual backbone of MiniGPT-v2 remains frozen. We focus on training the linear projection layer and efficient finetuning the language model using LoRA . With LoRA, we finetune $\mathcal{W}_{q}$ and $\mathcal{W}_{v}$ via low-rank adaptation. In our implementation, we set the rank, $r=64$ . We trained the model with an image resolution of 448x448 during all stages. During each stage, we use our designed multi-modal instructional templates for various vision-language tasks during the model training.

Training and hyperparameters. We use AdamW optimizer with a cosine learning rate scheduler to train our model. In the initial stage, we train on 8xA100 GPUs for 400,000 steps with a global batch size of 96 and an maximum learning rate of 1e-4. This stage takes around 90 hours. During the second stage, the model is trained for 50,000 steps on 4xA100 GPUs with a maximum learning rate of 1e-5, adopting a global batch size of 64, and this training stage lasts roughly 20 hours. For the last stage, training is executed for another 35,000 steps on 4xA100 GPUs, using a global batch size of 24 and this training stage took around 7 hours, maintaining the same maximum learning rate of 1e-5.

Dataset and evaluation metrics. We evaluate our model across a range of VQA and visual grounding benchmarks. For VQA benchmarks, we consider OKVQA , GQA , visual spatial reasoning (VSR) , IconVQA , VizWiz , HatefulMemes and (HM) . For visual grounding, we evaluate our model on RefCOCO and RefCOCO+, and RefCOCOg benchmarks.

To evaluate VQA benchmarks, we use an open-ended approach with a greedy decoding strategy. We evaluate each VQA question with the following instruction template: “[vqa] question". Following the previous method , we evaluate the performance by matching the model’s response to the ground-truth and reporting top-1 accuracy. For visual grounding benchmarks, we use the template “[refer] give me the location of Referring expression" for each referring expression comprehension question, and a predicted bounding box is considered as correct for reporting accuracy if its IOU between prediction and ground-truth is higher than 0.5.

Visual question answering results. Table 3 presents our experimental results on multiple VQA benchmarks. Our results compare favorably to baselines including MiniGPT-4 , Shikra , LLaVA , and InstructBLIP across all the VQA tasks. For example, on QKVQA, our MiniGPT-v2 outperforms MiniGPT-4, Shikra, LLaVA, and BLIP-2 by 20.3%, 10.6%, 3.4%, and 11.9%. These results indicate the strong visual question answering capabilities of our model. Furthermore, we find that our MiniGPT-v2 (chat) variant shows higher performance than the version trained after the second stage. On OKVQA, VSR, IconVQA, VizWiz, and HM, MiniGPT-v2 (chat) outperforms MiniGPT-v2 by 0.9%, 2.3%, 4.2%, 20.7%, and 0.6%. We believe that the better performance can be attributed to the improved language skills during the third-stage training, which is able to benefit visual question comprehension and response, especially on VizWiz with 20.7% top-1 accuracy increase.

Referring expression comprehension results. Table 4 compares our model to baselines on REC benchmarks. Our MiniGPT-v2 shows strong REC performance on RefCOCO, RefCOCO+, and RefCOCOg, performing better than other vision-language generalist models. MiniGPT-v2 outperforms OFA-L by over 8% accuracy across all tasks of RefCOCO/RefCOCO+/RefCOCOg. Compared with a strong baseline, Shikra (13B) , our model still shows better results, e.g., 84.29% vs 83.96% accuracy in average. These results provide direct evidence for the competing visual grounding capabilities of MiniGPT-v2. Although our model underperforms specialist models, the promising performance indicates its growing competence in visual grounding.

Ablation on task identifier. We conduct ablation studies on the effect of the task identifier on the performance of MiniGPT-v2. We compare our model with the variant without using task identifiers on VQA benchmarks. Both models were trained on 4xA100 GPUs for 24 hours with an equal number of training steps for multiple vision-language tasks. Results in Table 5 demonstrate the performance on multiple VQA benchmarks and consistently show that token identifier training benefits the overall performance of MiniGPT-v2. Specifically, our MiniGPT-v2 with task-oriented instruction training achieves 1.2% top-1 accuracy improvement on average. These ablation results can validate the clear advantage of adding task identifier tokens and support the use of multi-task identifiers for multi-task learning efficiency.

Hallucination. We measure the hallucination of our model on image description generation and compare the results with other vision-language baselines, including MiniGPT-4 , mPLUG-Owl , LLaVA , and MultiModal-GPT . Following the methodology from , we use CHAIR to assess hallucination at both object and sentence levels. As shown in Table 6, we find that our MiniGPT-v2 tends to generate the image description with reduced hallucination compared to other baselines. We have evaluated three types of prompts in MiniGPT-v2. First, we use the prompt generate a brief description of the given image without any specific task identifier which tends to produce more detailed image descriptions. Then we provide the instruction prompt [grounding] describe this image in as detailed as possible for evaluating grounded image captions. Lastly, we prompt our model with [caption] briefly describe the image. With these task identifiers, MiniGPT-v2 is able to produce a variety of image descriptions with different levels of hallucination. As a result, all these three instruction variants have lower hallucination than our baseline, especially with the task specifiers of [caption] and [grounding].

2 Qualitative Results

We now provide the qualitative results for a complementary understanding of our model’s multi-modal capabilities. Some examples can be seen in Fig. 3. Specifically, we demonstrated various abilities in the examples including a) object identification; b) detailed grounded image captioning; c) visual question answering; d) referring expression comprehension; e) visual question answering under task identifier; f) detailed image description; g) object parsing and grounding from an input text. More qualitative results can be found in the Appendix. These results demonstrate that our model has competing vision-language understanding capabilities. Moreover, notice that we train our model only with a few thousand of instruction samples on object parsing and grounding tasks at the third-stage, and our model can effectively follow the instructions and generalize on the new task. This indicates that our model has the flexibility to adapt on many new tasks.

Note that our model still occasionally shows hallucinations when generating the image description or visual grounding. e.g., our model may sometimes produce descriptions of non-existent visual objects or generate inaccurate visual locations of grounded objects. We believe training with more high-quality image-text aligned data and integrating with a stronger vision backbone or large language model hold the potential for alleviating this issue.

Conclusion

In this paper, we introduce MiniGPT-v2, a multi-modal LLM that can serve as a unified interface for various vision-language multi-tasking learning. To develop a single model capable of handling multiple vision-language tasks, we propose using distinct identifiers for each task during the training and inference. These identifiers help our model easily differentiate various tasks and also improve learning efficiency. Our MiniGPT-v2 achieves state-of-the-art results across many visual question answering and referring expression comprehension benchmarks. We also found that our model can efficiently adapt to new vision-language tasks, which suggests that MiniGPT-v2 has many potential applications in the vision-language community.

References

Appendix A Appendix

In the supplementary, we provide more qualitative results that are generated from our model to demonstrate the vision-language multi-tasking capabilities.

RefCOCO/RefCOCO+/RefCOCOg: [refer] give me the location of question

VizWiz: [vqa] Based on the image, respond to this question with a single word or phrase: question, and reply ’unanswerable’ when the provided information is insufficient

Hateful Meme: [vqa] This is an image with: question written on it. Is it hateful? Answer:

VSR: [vqa] Based on the image, is this statement true or false? question

IconQA, GQA, OKVQA: [vqa] Based on the image, respond to this question with a single word or phrase: question

A.2 Additional Qualitative Results

To study how well our model is able to take visual input and answer questions based on task-oriented identifier, we use our model to perform multiple vision-language tasks including grounded image captioning in Fig. 4, Fig. 5, Fig. 6 and Fig. 7; Object parsing and grounding in Fig. 8, Fig. 9, Fig. 10 and Fig. 11; Referring expression comprehension in Fig. 12, Fig. 13, Fig. 14 and Fig. 15; Object identification in Fig. 16, Fig. 17, Fig. 18 and Fig. 19.

For each task, we share 4 examples for showing the vision-language capabilities of our model. The results in the demo provide direct evidence for the competing visual understanding capabilities of MiniGPT-v2 on multiple vision-language tasks. For example, in the cases of grounded caption, our model is able to give correct grounded image caption with detailed spatial locations of objects. In the cases of identify, the model also generates our expected object names. MiniGPT-v2 can understand the new scenes and follow the question identifier to respond. But we also need to note that our model still has some hallucination e.g., In Fig. 6, several persons are not grounded accurately, and in Fig. 7, there does not exist a vase in the image.