Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Introduction

Large multimodal models (LMMs) have become increasingly popular in the research community, as they are the key building blocks towards general-purpose assistants . Recent studies on LMMs are converging on a central concept known as visual instruction tuning . The results are promising, e.g. LLaVA and MiniGPT-4 demonstrate impressive results on natural instruction-following and visual reasoning capabilities. To better understand the capability of LMMs, multiple benchmarks have been proposed. Recent works further demonstrate improved performance by scaling up the pretraining data , instruction-following data , visual encoders , or langauge models , respectively. The LLaVA architecture is also leveraged in different downstream tasks and domains, including region-level and pixel-level understanding, biomedical assistants , image generation , adversarial studies .

This note establishes stronger and more feasible baselines built upon the LLaVA framework. We report that two simple improvements, namely, an MLP cross-modal connector and incorporating academic task related data such as VQA, are orthogonal to the framework of LLaVA, and when used with LLaVA, lead to better multimodal understanding capabilities. In contrast to InstructBLIP or Qwen-VL , which trains specially designed visual resamplers on hundreds of millions or even billions of image-text paired data, LLaVA uses the simplest architecture design for LMMs and requires only training a simple fully-connected projection layer on merely 600K image-text pairs. Our final model can finish training in $\sim$ 1 day on a single 8-A100 machine and achieves state-of-the-art results on a wide range of benchmarks. Moreover, unlike Qwen-VL that includes in-house data in training, LLaVA utilizes only publicly available data. We hope these improved and easily-reproducible baselines will provide a reference for future research in open-source LMM.

Background

Common architectures include a pre-trained visual backbone to encode visual features, a pre-trained large language model (LLM) to comprehend the user instructions and produce responses, and a vision-language cross-modal connector to align the vision encoder outputs to the language models. As shown in Fig. 1, LLaVA is perhaps the simplest architecture for LMMs. Optionally, visual resamplers (e.g. Qformer ) are used to reduce the number of visual patches . Training an instruction-following LMM usually follows a two-stage protocol. First, the vision-language alignment pretraining stage leverages image-text pairs to align the visual features with the language model’s word embedding space. Earlier works utilize relatively few image-text pairs (e.g. $\sim$ 600K or $\sim$ 6M ), while some recent works pretrain the vision-language connector for a specific language model on a large amount of image-text pairs (e.g. 129M and 1.4B ), to maximize the LMM’s performance. Second, the visual instruction tuning stage tunes the model on visual instructions, to enable the model to follow users’ diverse requests on instructions that involve the visual contents.

Multimodal instruction-following data.

In NLP, studies show that the quality of instruction-following data largely affects the capability of the resulting instruction-following models . For visual instruction tuning, LLaVA is the pioneer to leverage text-only GPT-4 to expand the existing COCO bounding box and caption dataset to a multimodal instruction-following dataset that contains three types of instruction-following data: conversational-style QA, detailed description, and complex reasoning. LLaVA’s pipeline has been employed to expand to textual understanding , million-scales , and region-level conversations . InstructBLIP incorporates academic-task-oriented VQA datasets to further enhance the model’s visual capabilities. Conversely, identifies that such naive data merging can result in the models that tend to overfit to VQA datasets and thus are inability to participate in natural conversations. The authors further proposes to leverage the LLaVA pipeline to convert VQA datasets to a conversational style. While this proves effective for training, it introduces added complexities in data scaling.

Improved Baselines of LLaVA

As the initial work of visual instruction tuning, LLaVA has showcased commendable proficiency in visual reasoning capabilities, surpassing even more recent models on diverse benchmarks for real-life visual instruction-following tasks, while only falling short on academic benchmarks that typically require short-form answers (e.g. single-word). The latter was attributed to the fact that LLaVA has not been pretrained on large-scale data, as other approaches do. In this note, we first study the scaling effect of data, models and input image resolution on a selection of three datasets in Table 1, and then compare the final model against existing LMMs on a diverse set of 12 benchmarks in Table 2. We show that the LLaVA’s architecture is powerful and data-efficient for visual instruction tuning, and achieves the best performance using significantly less compute and training data than all other methods.

Response formatting prompts.

We find that the inability to balance between short- and long-form VQA for approaches like InstructBLIP is mainly due to the following reasons. First, ambiguous prompts on the response format. For example, Q: {Question} A: {Answer}. Such prompts do not clearly indicate the desirable output format, and can overfit an LLM behavorially to short-form answers even for natural visual conversations. Second, not finetuning the LLM. The first issue is worsened by InstructBLIP only finetuning the Qformer for instruction-tuning. It requires the Qformer’s visual output tokens to control the length of the LLM’s output to be either long-form or short-form, as in prefix tuning , but Qformer may lack the capability of properly doing so, due to its limited capacity compared with LLMs like LLaMA. See Table 6 for a qualitative example.

To address this, we propose to use a single response formatting prompt that clearly indicates the output format, to be appended at the end of VQA questions when promoting short answers: Answer the question using a single word or phrase. We empirically show that when LLM is finetuned with such prompts, LLaVA is able to properly adjust the output format according to the user’s instructions, and does not require additional processing of the VQA data using ChatGPT , which further enables scaling to various data sources. As shown in Table 1, by merely including VQAv2 in training, LLaVA’s performance on MME significantly improves (1323.8 vs 502.8) and outperforms InstructBLIP by 111 points.

MLP vision-language connector.

Inspired by the improved performance in self-supervised learning by changing from a linear projection to an MLP , we find that improving the vision-language connector’s representation power with a two-layer MLP can improve LLaVA’s multimodal capabilities, compared with the original linear projection design.

Academic task oriented data.

We further include additional academic-task-oriented VQA datasets for VQA, OCR, and region-level perception, to enhance the model’s capabilities in various ways, as shown in Table 1. We first include four additional datasets that are used in InstructBLIP: open-knowledge VQA (OKVQA , A-OKVQA ) and OCR (OCRVQA , TextCaps ). A-OKVQA is converted to multiple choice questions and a specific response formatting prompt is used: Answer with the option’s letter from the given choices directly. With only a subset of the datasets InstructBLIP uses, LLaVA already surpasses it on all three tasks in Table 1, suggesting LLaVA’s effective design. Furthermore, we find further adding region-level VQA datasets (Visual Genome , RefCOCO ) improves the model’s capability of localizing fine-grained visual details.

Additional scaling.

We further scale up the input image resolution to allow LLM to clearly “see” the details of images, and add the GQA dataset as an additional visual knowledge source. We also incorporate ShareGPT data and scale up the LLM to 13B as in . Results on MM-Vet shows the most significant improvement when scaling the LLM to 13B, suggesting the importance of the base LLM’s capability for visual conversations. We denote the final model with all the modifications as LLaVA-1.5 (the last two rows in Table 1), which achieves an impressive performance that significantly outperforms the original LLaVA .

Discussion

We benchmark LLaVA-1.5 on a wide range of academic VQA benchmarks and recent benchmarks specifically proposed for instruction-following LMMs, totalling 12 benchmarks. We show that it achieves the best performance across 11 out of 12 benchmarks, despite using magnitudes smaller pretraining and instruction tuning data compared with other methods . It is encouraging that LLaVA-1.5 achieves the best performance with the simplest architecture, academic compute and public datasets, and yields a fully-reproducible and affordable baseline for future research. The results also suggest that visual instruction tuning plays a more important role in improving an LMM’s capabilities than pretraining, and raises questions upon the common belief that LMMs require significant amount of vision-language alignment pretraining , despite that the vision encoders (e.g. CLIP , OpenCLIP , EVA-CLIP , etc.) are already pretrained on web-scale image-text paired dataset. LLaVA-1.5 (even the 7B model) outperforms 80B IDEFICS , a Flamingo-like LMM with billions of trainable parameters for cross-modal connection. This also makes us rethink the benefits of the vision samplers and the necessity of the additional large-scale pretraining, in terms of multimodal instruction-following capabilities.

Zero-shot format instruction generalization.

Although LLaVA-1.5 is only trained with a limited number of format instructions, it generalizes to others. First, VizWiz requires the model to output “Unanswerable” when the provided content is insufficient to answer the question, and our response format prompt (Table 8) effectively instructs the model to do so (11.1% $\rightarrow$ 67.8% on unanswerable questions). We additionally present qualitative examples on instructing LLaVA-1.5 to verify the tricky questions (Fig. 3) and respond in a constrained JSON format (Fig. 4).

Zero-shot multilingual capability.

Though LLaVA-1.5 is not finetuned for multilingual multimodal instruction following at all, we find that it is capable of following multilingual instructions, partly due to the multilingual language instructions in ShareGPT . We quantitatively evaluate the model’s generalization capability to Chinese on MMBench-CN , where the questions of MMBench are converted to Chinese. Notably, LLaVA-1.5 outperforms Qwen-VL-Chat by 7.3% (63.6% vs 56.7%), despite Qwen being finetuned on Chinese multimodal instructions while LLaVA-1.5 is not.

Computational cost.

For LLaVA-1.5, we use the same pretraining dataset of LCS-558KLCS-558K: a subset of $\sim$ 558K image-text pairs from LAION-CC-SBU with BLIP captions, as used in LLaVA-Lightning series., and keep the training iterations and batch size roughly the same for instruction tuning as LLaVA . Due to the increased image input resolution to 336px, the training of LLaVA-1.5 is $\sim$ 2 $\times$ as long as LLaVA: $\sim$ 6 hours of pretraining and $\sim$ 20 hours of visual instruction tuning, using 8 $\times$ A100s.

Limitations.

Despite the promising results demonstrated by LLaVA-1.5, several limitations must be acknowledged. First, LLaVA utilizes full image patches, potentially prolonging each training iteration. While visual resamplers reduce the number of visual patches in LLMs, they currently cannot achieve convergence as efficiently as LLaVA with a comparable amount of training data, probably due to more trainable parameters in the resamplers. The development of a sample-efficient visual resampler could pave the way for future scaling-up of instruction-following multimodal models. Second, LLaVA-1.5 is not yet capable of processing multiple images due to the lack of such instruction-following data, and the limit of the context length. Third, although LLaVA-1.5 exhibits proficiency in following complex instructions, its problem-solving capabilities can still be limited in certain domains, which could be improved with a more capable language model and with high-quality, targeted visual instruction tuning data. Finally, despite its significantly reduced propensity for hallucination, LLaVA is not exempt from producing hallucinations and occasionally disseminating misinformation, and should be used with caution in critical applications (e.g. medical).

Acknowledgements.

This work was supported in part by NSF CAREER IIS2150012, and Institute of Information & communications Technology Planning & Evaluation(IITP) grants funded by the Korea government(MSIT) (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration) and (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training).

Appendix

Our final training data mixture contains a variety of datasets: VQA , OCR , region-level VQA , visual conversation and language conversation data. We adopt multiple strategies to reduce training cost and enhance efficiency, detailed as follows:

For all VQA datasets, QA pairs from the same training image are merged into a single conversation.

For ShareGPT , we filter out invalid conversations as . Unlike Vicuna , long conversations that surpass 2048 tokens are truncated rather than splitting to multiple conversations. This results in $\sim$ 40K conversations.

Each QA pair in A-OKVQA is augmented $k$ times, where $k$ is the number of choices per question, to counterbalance the lack of multiple-choice data.

80K conversations are sampled from OCRVQA .

For Visual Genome, we sample 10 annotations for images with additional annotations.

For RefCOCO, conversations are dissected into segments, each containing fewer than 10 conversations.

We obverse that language conversations are often longer than visual ones. For each batch, we sample conversations only from a single modality, and this speeds up the training by 25%, and does not affect the final outcome.

All data splits are concatenated together and sampled with the same probability. We present the response formatting prompts of the final instruction-following data mixtures in Table 7 and the response format prompts used for each evaluation benchmark in Table 8.

Hyperparameters.

LLaVA-1.5 use the same set of hyperparameters as the original LLaVA, except that we halve the learning rate in pretraining due to the usage of the MLP projection layer instead of the original linear projection layer design. We show the training hyperparameters for both first-stage vision-language alignment pretraining and the second-stage visual instruction tuning in Table 5.