InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi

cs.CV cs.LG

Introduction

A longstanding aspiration of Artificial Intelligence (AI) research is to build a single model that can solve arbitrary tasks specified by the user. In natural language processing (NLP), instruction tuning proves to be a promising approach toward that goal. By finetuning a large language model (LLM) on a wide range of tasks described by natural language instructions, instruction tuning enables the model to follow arbitrary instructions. Recently, instruction-tuned LLMs have also been leveraged for vision-language tasks. For example, BLIP-2 effectively adapts frozen instruction-tuned LLMs to understand visual inputs and exhibits preliminary capabilities to follow instructions in image-to-text generation.

Compared to NLP tasks, vision-language tasks are more diverse in nature due to the additional visual inputs from various domains. This poses a greater challenge to a unified model that is supposed to generalize to diverse vision-language tasks, many unseen during training. Most previous work can be grouped into two approaches. The first approach, multitask learning , formulates various vision-language tasks into the same input-output format. However, we empirically find multitask learning without instructions (Table 4) does not generalize well to unseen datasets and tasks. The second approach extends a pre-trained LLM with additional visual components, and trains the visual components with image caption data. Nevertheless, such data are too limited to allow broad generalization to vision-language tasks that require more than visual descriptions.

To address the aforementioned challenges, this paper presents InstructBLIP, a vision-language instruction tuning framework that enables general-purpose models to solve a wide range of visual-language tasks through a unified natural language interface. InstructBLIP uses a diverse set of instruction data to train a multimodal LLM. Specifically, we initialize training with a pre-trained BLIP-2 model consisting of an image encoder, an LLM, and a Query Transformer (Q-Former) to bridge the two. During instruction tuning, we finetune the Q-Former while keeping the image encoder and LLM frozen. Our paper makes the following key contributions:

We perform a comprehensive and systematic study on vision-language instruction tuning. We transform 26 datasets into the instruction tuning format and group them into 11 task categories. We use 13 held-in datasets for instruction tuning and 13 held-out datasets for zero-shot evaluation. Moreover, we withhold four entire task categories for zero-shot evaluation at the task level. Exhaustive quantitative and qualitative results demonstrate the effectiveness of InstructBLIP on vision-language zero-shot generalization.

We propose instruction-aware visual feature extraction, a novel mechanism that enables flexible and informative feature extraction according to the given instructions. Specifically, the textual instruction is given not only to the frozen LLM, but also to the Q-Former, so that it can extract instruction-aware visual features from the frozen image encoder. Also, we propose a balanced sampling strategy to synchronize learning progress across datasets.

We evaluate and open-source a suite of InstructBLIP models using two families of LLMs: 1) FlanT5 , an encoder-decoder LLM finetuned from T5 ; 2) Vicuna , a decoder-only LLM finetuned from LLaMA . The InstructBLIP models achieve state-of-the-art zero-shot performance on a wide range of vision-language tasks. Furthermore, InstructBLIP models lead to state-of-the-art finetuning performance when used as the model initialization on individual downstream tasks.

Vision-Language Instruction Tuning

InstructBLIP aims to address the unique challenges in vision-language instruction tuning and provide a systematic study on the models’ improved generalization ability to unseen data and tasks. In this section, we first introduce the construction of instruction-tuning data, followed by the training and evaluation protocols. Next, we delineate two techniques to improve instruction-tuning performance from the model and data perspectives, respectively. Lastly, we present the implementation details.

To ensure the diversity of instruction tuning data while considering their accessibility, we gather comprehensive set of publicly available vision-language datasets, and transform them into the instruction tuning format. As shown in Figure 2, the final collection covers 11 task categories and 26 datasets, including image captioning , image captioning with reading comprehension , visual reasoning , image question answering , knowledge-grounded image question answering , image question answering with reading comprehension , image question generation (adapted from the QA datasets), video question answering , visual conversational question answering , image classification , and LLaVA-Instruct-150K . We include detailed descriptions and statistics of each dataset in Appendix C.

For every task, we meticulously craft 10 to 15 distinct instruction templates in natural language. These templates serve as the foundation for constructing instruction tuning data, which articulates the task and the objective. For public datasets inherently favoring short responses, we use terms such as short and briefly into some of their corresponding instruction templates to reduce the risk of the model overfitting to always generating short outputs. For the LLaVA-Instruct-150K dataset, we do not incorporate additional instruction templates since it is naturally structured in the instruction format. The full list of instruction templates can be found in Appendix D.

2 Training and Evaluation Protocols

To ensure sufficient data and tasks for training and zero-shot evaluation, we divide the 26 datasets into 13 held-in datasets and 13 held-out datasets, indicated by yellow and white respectively in Figure 2. We employ the training sets of the held-in datasets for instruction tuning and their validation or test sets for held-in evaluation.

For held-out evaluation, our aim is to understand how instruction tuning improves the model’s zero-shot performance on unseen data. We define two types of held-out data: 1) datasets not exposed to the model during training, but whose tasks are present in the held-in cluster; 2) datasets and their associated tasks that remain entirely unseen during training. Addressing the first type of held-out evaluation is nontrivial due to the data distribution shift between held-in and held-out datasets. For the second type, we hold out several tasks completely, including visual reasoning, video question answering, visual conversational QA, and image classification.

To avoid data contamination, datasets are selected carefully so that no evaluation data appear in the held-in training cluster across different datasets. During instruction tuning, we mix all the held-in training sets and sample instruction templates uniformly for each dataset. The models are trained with the standard language modeling loss to directly generate the response given the instruction. Furthermore, for datasets that involve scene texts, we add OCR tokens in the instruction as supplementary information.

3 Instruction-aware Visual Feature Extraction

Existing zero-shot image-to-text generation methods, including BLIP-2, take an instruction-agnostic approach when extracting visual features. That results in a set of static visual representations being fed into the LLM, regardless of the task. In contrast, an instruction-aware vision model can adapt to the task instruction and produce visual representations most conducive to the task at hand. This is clearly advantageous if we expect the task instructions to vary considerably for the same input image.

We show the architecture of InstructBLIP in Figure 3. Similarly to BLIP-2 , InstructBLIP utilizes a Query Transformer, or Q-Former, to extract visual features from a frozen image encoder. The input to the Q-Former contains a set of $K$ learnable query embeddings, which interact with the image encoder’s output through cross attention. The output of the Q-Former consists of $K$ encoded visual vectors, one per query embedding, which then go through a linear projection and are fed to the frozen LLM. As in BLIP-2, the Q-Former is pretrained in two stages using image-caption data before instruction tuning. The first stage pretrains the Q-Former with the frozen image encoder for vision-language representation learning. The second stage adapts the output of Q-Former as soft visual prompts for text generation with a frozen LLM . After pretraining, we finetune the Q-Former with instruction tuning, where the LLM receives as input the visual encodings from the Q-Former and the task instruction.

Extending BLIP-2, InstructBLIP proposes an instruction-aware Q-former module, which takes in the instruction text tokens as additional input. The instruction interacts with the query embeddings through self-attention layers of the Q-Former, and encourages the extraction of task-relevant image features. As a result, the LLM receives visual information conducive to instruction following. We demonstrate empirically (Table 2) that instruction-aware visual feature extraction provides substantial performance improvements for both held-in and held-out evaluations.

4 Balancing Training Datasets

Due to the large number of training datasets and the significant differences in the size of each dataset, mixing them uniformly could cause the model to overfit smaller datasets and underfit larger datasets. To mitigate the problem, we propose to sample datasets with probabilities proportional to the square root of their sizes, or the numbers of training samples. Generally, given $D$ datasets with sizes $\{S_{1},S_{2},\dots,S_{D}\}$ , the probability of a data sample being selected from a dataset $d$ during training is $p_{d}=\frac{\sqrt{S_{d}}}{\sum_{i=1}^{D}\sqrt{S_{i}}}$ . On top of this formula, we make manual adjustments to the weights of certain datasets to improve optimization. This is warranted by inherent differences in the datasets and tasks that require varying levels of training intensity despite similar sizes. To be specific, we lower the weight of A-OKVQA, which features multiple-choice questions, and increase the weight of OKVQA, which requires open-ended text generation. In Table 2, we show that the balanced dataset sampling strategy improves overall performance for both held-in evaluation and held-out generalization.

5 Inference Methods

During inference time, we adopt two slightly different generation approaches for evaluation on different datasets. For the majority of datasets, such as image captioning and open-ended VQA, the instruction-tuned model is directly prompted to generate responses, which are subsequently compared to the ground truth to calculate metrics. On the other hand, for classification and multi-choice VQA tasks, we employ a vocabulary ranking method following previous works . Specifically, we still prompt the model to generate answers, but restrict its vocabulary to a list of candidates. Then, we calculate log-likelihood for each candidate and select the one with the highest value as the final prediction. This ranking method is applied to ScienceQA, IconQA, A-OKVQA (multiple-choice), HatefulMemes, Visual Dialog, MSVD, and MSRVTT datasets. Furthermore, for binary classification, we expand the positive and negative labels into a slightly broader set of verbalizers to exploit word frequencies in natural text (e.g., yes and true for the positive class; no and false for the negative class).

For the video question-answering task, we utilize four uniformly-sampled frames per video. Each frame is processed by the image encoder and Q-Former individually, and the extracted visual features are concatenated before being fed into the LLM.

6 Implementation Details

Thanks to the flexibility enabled by the modular architectural design of BLIP-2, we can quickly adapt the model to a wide range of LLMs. In our experiments, we adopt four variations of BLIP-2 with the same image encoder (ViT-g/14 ) but different frozen LLMs, including FlanT5-XL (3B), FlanT5-XXL (11B), Vicuna-7B and Vicuna-13B. FlanT5 is an instruction-tuned model based on the encoder-decoder Transformer T5 . Vicuna , on the other hand, is a recently released decoder-only Transformer instruction-tuned from LLaMA . During vision-language instruction tuning, we initialize the model from pre-trained BLIP-2 checkpoints, and only finetune the parameters of Q-Former while keeping both the image encoder and the LLM frozen. Since the original BLIP-2 models do not include checkpoints for Vicuna, we perform pre-training with Vicuna using the same procedure as BLIP-2.

Training and Hyper-parameters.

We use the LAVIS library for implementation, training, and evaluation. All models are instruction-tuned with a maximum of 60K steps and we validate model’s performance every 3K steps. For each model, a single optimal checkpoint is selected and used for evaluations on all datasets. We employ a batch size of 192, 128, and 64 for the 3B, 7B, and 11/13B models, respectively. The AdamW optimizer is used, with $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , and a weight decay of 0.05. Additionally, we apply a linear warmup of the learning rate during the initial 1,000 steps, increasing from $10^{-8}$ to $10^{-5}$ , followed by a cosine decay with a minimum learning rate of 0. All models are trained utilizing 16 Nvidia A100 (40G) GPUs and are completed within 1.5 days.

Experimental Results and Analysis

We first evaluate InstructBLIP models on the set of 13 held-out datasets with instructions provided in Appendix E. We compare InstructBLIP with the previous SOTA models BLIP-2 and Flamingo. As demonstrated in Table 1, we achieve new zero-shot SOTA results on all datasets. InstructBLIP consistently surpasses its original backbone, BLIP-2, by a significant margin across all LLMs, demonstrating the effectiveness of vision-language instruction tuning. For instance, InstructBLIP FlanT5XL yields an average relative improvement of 15.0% when compared to BLIP-2 FlanT5XL. Furthermore, instruction tuning boosts zero-shot generalization on unseen task categories such as video QA. InstructBLIP achieves up to 47.1% relative improvement on MSRVTT-QA over the previous SOTA despite having never been trained with temporal video data. Finally, our smallest InstructBLIP FlanT5XL with 4B parameters outperforms Flamingo-80B on all six shared evaluation datasets with an average relative improvement of 24.8%.

For the Visual Dialog dataset, we choose to report the Mean Reciprocal Rank (MRR) over the Normalized Discounted Cumulative Gain (NDCG) metric. This is because NDCG favors generic and uncertain answers while MRR prefers certain responses , making MRR better aligned with the zero-shot evaluation scenario.

2 Ablation Study on Instruction Tuning Techniques

To investigate the impact of the instruction-aware visual feature extraction (Section 2.3) and the balanced dataset sampling strategy (Section 2.4), we conduct ablation studies during the instruction tuning process. As illustrated in Table 2, the removal of instruction awareness in visual features downgrades performance significantly across all datasets. The performance drop is more severe in datasets that involve spatial visual reasoning (e.g., ScienceQA) or temporal visual reasoning (e.g., iVQA), where the instruction input to the Q-Former can guide visual features to attend to informative image regions. The removal of the data balancing strategy causes unstable and uneven training, as different datasets achieve peak performance at drastically different training steps. The lack of synchronized progress over multiple datasets harms the overall performance.

3 Qualitative Evaluation

Besides the systematic evaluation on public benchmarks, we further qualitatively examine InstructBLIP with more diverse images and instructions. As illustrated in Figure 1, InstructBLIP demonstrates its capacity for complex visual reasoning. For example, it can reasonably infer from the visual scene what could have happened and deduce the type of disaster from the location of the scene, which it extrapolates based on visual evidence like the palm trees. Moreover, InstructBLIP is capable of connecting visual input with embedded textual knowledge and generate informative responses, such as intruducing a famous painting. Furthermore, in descriptions of the overall atmosphere, InstructBLIP exhibits the ability to comprehend metaphorical implications of the visual imagery. Finally, we show that InstructBLIP can engage in multi-turn conversations, effectively considering the dialog history when making new responses.

In Appendix B, we qualitatively compare InstructBLIP with concurrent multimodal models (GPT-4 , LLaVA , MiniGPT-4 ). Although all models are capable of generating long-form responses, InstructBLIP’s outputs generally contains more proper visual details and exhibits logically coherent reasoning steps. Importantly, we argue that long-form responses are not always preferable. For example, in Figure 2 of the Appendix, InstructBLIP directly addresses the user’s intent by adaptively adjusting the response length, while LLaVA and MiniGPT-4 generate long and less relevant sentences. These advantages of InstructBLIP are a result of the diverse instruction tuning data and an effective architectural design.

4 Instruction Tuning vs. Multitask Learning

A direct analogue to instruction tuning is multitask learning, a widely used method that involves the simultaneous training of multiple datasets with the goal of improving the performance of each individual dataset. To investigate whether the improvement in zero-shot generalization observed in instruction tuning is mainly from the formatting of instructions or merely from multitasking, we conduct a comparative analysis between these two approaches under identical training settings.

Following , we consider two multitask training approaches. In the first approach, the model is trained using the vanilla input-output format of the training datasets without instructions. During evaluation, instructions are still provided to the model, indicating the specific task to be performed. However, an exception is made for image captioning, as the model achieves better scores when only receiving the image as input. For the second approach, we take a step towards instruction tuning by prepending a [Task:Dataset] identifier to the text input during training. For example, we prepend [Visual question answering:VQAv2] for the VQAv2 dataset. During evaluation, we explore both instructions and this identifier. Particularly, for the identifier of held-out datasets, we only use the task name since the model never sees the dataset name.

The results are shown in Figure 4, including BLIP-2 zero-shot, multitask training, and instruction tuning. All of these models are based on the BLIP-2 FlanT5XL backbone and adhere to the identical training configurations delineated in Section 2. Overall, we can conclude two insights from the results. Firstly, instruction tuning and multitask learning exhibit similar performance on the held-in datasets. This suggests that the model can fit these two different input patterns comparably well, as long as it has been trained with such data. On the other hand, instruction tuning yields a significant improvement over multitask learning on unseen held-out datasets, whereas multitask learning still performs on par with the original BLIP-2. This indicates that instruction tuning is the key to enhance the model’s zero-shot generalization ability.

5 Finetuning InstructBLIP on Downstream Tasks

We further finetune the InstructBLIP models to investigate its performance on learning a specific dataset. Compared to most previous methods (e.g., Flamingo, BLIP-2) which increase the input image resolution and finetune the visual encoder on downstream tasks, InstructBLIP maintains the same image resolution (224 $\times$ 224) during instruction tuning and keeps the visual encoder frozen during finetuning. This significantly reduces the number of trainable parameters from 1.2B to 188M, thus greatly improves finetuning efficiency.

The results are shown in Table 3. Compared to BLIP-2, InstructBLIP leads to better finetuning performance on all datasets, which validates InstructBLIP as a better weight initialization model for task-specific finetuning. InstructBLIP sets new state-of-the-art finetuning performance on ScienceQA (IMG), OCR-VQA, A-OKVQA, and is outperformed on OKVQA by PaLM-E with 562B parameters.

Additionally, we observe that the FlanT5-based InstructBLIP is superior at multi-choice tasks, whereas Vicuna-based InstructBLIP is generally better at open-ended generation tasks. This disparity can be primarily attributed to the capabilities of their frozen LLMs, as they both employ the same image encoder. Although FlanT5 and Vicuna are both instruction-tuned LLMs, their instruction data significantly differ. FlanT5 is mainly finetuned on NLP benchmarks containing many multi-choice QA and classification datasets, while Vicuna is finetuned on open-ended instruction-following data.

Related Work

Instruction tuning aims to teach language models to follow natural language instructions, which has been shown to improve their generalization performance to unseen tasks. Some methods collect instruction tuning data by converting existing NLP datasets into instruction format using templates . Others use LLMs (e.g., GPT-3 ) to generate instruction data with improved diversity.

Instruction-tuned LLMs have been adapted for vision-to-language generation tasks by injecting visual information to the LLMs. BLIP-2 uses frozen FlanT5 models, and trains a Q-Former to extract visual features as input to the LLMs. MiniGPT-4 uses the same pretrained visual encoder and Q-Former from BLIP-2, but uses Vicuna as the LLM and performs training using ChatGPT -generated image captions longer than the BLIP-2 training data. LLaVA directly projects the output of a visual encoder as input to a LLaMA/Vinuca LLM, and finetunes the LLM on vision-language conversational data generated by GPT-4 . mPLUG-owl performs low-rank adaption to a LLaMA model using both text instruction data and vision-language instruction data from LLaVA. A separate work is MultiInstruct , which performs vision-language instruction tuning without a pretrained LLM, leading to less competitive performance.

Compared to existing methods, InstructBLIP uses a much wider range of vision-language instruction data, covering both template-based converted data and LLM-generated data. Architecture wise, InstructBLIP proposes an instruction-aware visual feature extraction mechanism. Furthermore, our paper provides a comprehensive analysis on various aspects of vision-language instruction tuning, validating its advantages on generalizing to unseen tasks.

Conclusion

In this paper, we present InstructBLIP, a simple yet novel instruction tuning framework towards generalized vision-language models. We perform a comprehensive study on vision-language instruction tuning and demonstrate the capability of InstructBLIP models to generalize to a wide range of unseen tasks with state-of-the-art performance. Qualitative examples also exhibit InstructBLIP’s various capabilities on instruction following, such as complex visual reasoning, knowledge-grounded image description, and multi-turn conversations. Furthermore, we show that InstructBLIP can serve as an enhanced model initialization for downstream task finetuning, achieving state-of-the-art results. We hope that InstructBLIP can spur new research in general-purpose multimodal AI and its applications.

References

Appendix A Broader Impact

InstructBLIP uses off-the-shelf frozen LLMs. Therefore it inherits some of the shortcomings from the original LLMs, such as hallucinating ungrounded text or generating outputs with bias. We mitigate such shortcomings by improving the model’s grounding on the vision and instruction input, and performing vision-language instruction tuning on a diverse set of high-quality datasets. Nevertheless, we do not recommend applying InstructBLIP models to any downstream applications without a prior assessment on safety and fairness specific to that application.

Appendix B More Case Studies

Appendix C Instruction Tuning Datasets

Appendix D Instruction Templates

Appendix E Instructions for Zero-shot Inference

We provide instructions used for zero-shot inference. Note that for instructions with options, we separate options with the alphabetical order, e.g. (a) blue (b) yellow (c) pink (d) black.

NoCaps, Flickr30k

TextVQA

OCR tokens: {}. Question: {} Short answer:

IconQA

Question: {} Options: {}. Short answer:

ScienceQA

Context: {} Question: {} Options: {}. Answer:

HatefulMemes

This is an image with: "{}" written on it. Is it hateful? Answer:

VSR

Based on the image, is this statement true or false? "{}" Answer:

Visual Dialog

Dialog history: {}\n Question: {} Short answer: