OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig Schmidt

cs.CV cs.AI cs.LG

Introduction

A popular format for vision and language models is (image, text) $\rightarrow$ text, i.e., models take as input an image and some text, and produce text as output, e.g., BLIP-2 .The flexible format directly supports tasks like image classification and visual question answering (VQA).

However, assuming a single image as input is limiting: autoregressive vision-language models enable new capabilities by instead mapping an arbitrarily interleaved sequence of images and text to textual outputs. This interface provides important flexibility: the input sequence can include demonstrations for a new task, enabling few-shot, in-context learning or multi-round multi-modal chatbot interactions. Evaluations suggest that autoregressive vision-language models can be performant foundation models : models like Flamingo , CM3 , Kosmos-1 , PALM-E , and multimodal GPT-4 generalize well across diverse vision-language tasks.

Unfortunately, these autoregressive vision-language models are closed-source, and their weights, training data, code, and hyperparameters are proprietary. This limits the academic community’s ability to conduct research on autoregressive vision-language models, e.g., to understand how web-scraped image-text data affects models’ performance and safety. Open-source alternatives, such as LLaVA , LLaMA-Adapter , BLIP-2 , and mPLUG-Owl , only take in single images, and they often directly train on curated datasets like COCO rather than web data.

In this technical report, we document our experiences building an open-source reproduction of the Flamingo models . Following Flamingo, we augment the layers of pretrained, frozen language models so that they cross attend to the outputs of a frozen vision encoder while predicting the next token. The cross-modal module is trained on web-scraped image-text sequences, in our case, two open source datasets: LAION-2B and Multimodal C4 . Our stack is built using publicly available components, including CLIP as a vision encoder and open-source language models as decoders .

We call the resulting family of five models OpenFlamingo. These models range from 3B to 9B parameters, with both standard and instruction-tuned language model backbones. When averaging performance across 7 evaluation datasets, OpenFlamingo-3B and -9B models attain 85% and 89% of their corresponding Flamingo models respectively (Figure 1). Models and code are open-sourced at https://github.com/mlfoundations/open_flamingo.

Related work

Generative vision-language models output text conditioned on an image-text sequence. While many such architectures, such as BLIP-2 and LLaVa, can incorporate only one image in their context , autoregressive vision-language models accept interleaved image-text sequences, enabling in-context learning.

We chose to replicate Flamingo because of its strong in-context learning abilities. Aggregated across evaluation sets, Flamingo models see steady performance improvements up to 32 in-context examples . This is in contrast with other autoregressive vision-language models, for example Kosmos-1 ; on captioning tasks COCO and Flickr-30K , Kosmos-1 shows performance improvements up to 4 in-context examples, but performance degrades when using 8 in-context examples.

Proprietary autoregressive vision-language models are typically trained on closed-source datasets . For example, Flamingo relies on image-text pairs from the ALIGN dataset and interleaved image-text sequences from the M3W dataset ; both are unavailable to the public. Recent efforts to replicate these web-scraped datasets include LAION-2B, a dataset of image-text pairs, and Multimodal C4 and OBELISC , datasets of image-text sequences. We use LAION-2B and Multimodal C4 for training OpenFlamingo models. Laurençon et al. also train 9B and 80B Flamingo-style models; their models differ in the choice of pretraining dataset (OBELISC instead of Multimodal C4) and language model (LLaMA-9B instead of the MPT and RedPajama-3B models ).

Approach

We match the Flamingo architecture . Given an interleaved sequence of images with text tokens, OpenFlamingo models predict the next text token conditioned on all previous text tokens and the last preceding image. Text tokens attend to their corresponding images via dense cross-attention modules, which we attach to the layers of a frozen, autoregressive language model. To embed images, we extract patch features from a frozen vision encoder and pass these through a trainable Perceiver resampler .

As a preprocessing step, we first mark the locations of images in the text sequence with tokens. We also insert <|endofchunk|> tokens after the text tokens following an image; e.g. the sequence $x$ Hello world, where $x$ is an image, would be preprocessed into Hello world <|endofchunk|> .

Unlike Flamingo, we do not support video inputs at this time. We leave this for future work.

Table 1 describes the five OpenFlamingo models based on their language model and density of cross-attention layers; all models use CLIP ViT-L/14 as a vision encoder. In most cases, the and <|endofchunk|> embeddings are trainable, while other text embeddings are frozen. For the OpenFlamingo-4B models, all embeddings are frozen, including the randomly initialized and <|endofchunk|> embeddings. This was due to complications with gradient masking when using Fully Sharded Data Parallel (§3.3).

2 Training data

We train our models on a mixture of image-text pairs and interleaved image-text sequences. During training, we sample dataset shards with replacement using the WebDataset format .

When training Flamingo, Alayrac et al. use ALIGN , a closed-source dataset of over 1B single images paired with short alt-text captions. To train OpenFlamingo, we replace ALIGN with LAION-2B, an open-source web-scraped dataset consisting of 2B image-text pairs (Figure 3A). We use part of the English subset and truncate captions to 32 tokens. All image-text pairs in LAION-2B have a cosine similarity of at least 0.28 according to CLIP ViT-B/32.

Multimodal C4 [45].

In addition to image-text pairs, Alayrac et al. train Flamingo using M3W, an internal web-scraped dataset of 43M interleaved image-text sequences. We replace M3W with Multimodal C4 (MMC4), an open-source dataset of 101M interleaved samples (Figure 3B). Unlike M3W or OBELISC , which directly parse HTML documents to extract multimodal sequences, MMC4 uses CLIP to soft align images with sentences in a document. To ensure data quality, we exclude images if their cosine similarity with the subsequent text falls below 0.24, according to CLIP ViT-L/14. Sequences contain between 1 and 6 images (median 2). To encourage learning from sequences with multiple images, we reject single-image sequences with probability $0.5$ . The resulting distribution is shown in Figure 4. Additional notes on MMC4 filtering are in Appendix B.

Synthetic data.

For the OpenFlamingo-4B models, we also experimented with training on ChatGPT-generated synthetic data (Figure 3C) These 417K image-text sequences were generated by prompting ChatGPT to generate a sequence of interleaved text and image alt-texts (in place of images). The alt-texts are used to retrieve a corresponding images from LAION-5B. Additional details of the prompting and data construction process are described in Appendix C. The median number of images per sequence is higher than in MMC4, while the median number of text tokens is lower (Table 2). We release these sequences through the OpenFlamingo repository.

3 Training details

OpenFlamingo models were trained for 60M interleaved (MMC4) examplesOpenFlamingo-4B models use both MMC4 and ChatGPT-generated data as interleaved sequences; 60M interleaved examples translates to approximately 240K ChatGPT-generated sequences and 59.8M MMC4 sequences. Other models train on 60M MMC4 examples. and 120M LAION-2B examples. All models are trained using the next-token prediction objective and optimized with AdamW. The learning rate is linearly increased at the beginning of training, and then held constant at 1e-4 throughout training. We apply weight decay of 0.1 on the dense cross attention layers. The batch size for LAION-2B is twice the batch size of the interleaved dataset (MMC4, optionally with ChatGPT-generated sequences), and the loss weights are set to Flamingo defaults of $1$ and $0.2$ for MMC4 and LAION-2B respectively. We accumulate gradients over both datasets between optimizer steps.

We train all models using 64 GPUs distributed across 8 nodes on Stabilty AI’s cluster (Table 3). OpenFlamingo-4B models were trained using model sharding with Fully Sharded Data Parallel ; other models were trained using only data parallel.

Loss curves.

Figure 5 tracks LAION-2B and MMC4 loss over the course of training. After an initial improvement, MMC4 loss decreases very slowly. We speculate that, since MMC4 sequences tend to include long paragraphs between images (Figure 2), most text tokens can be generated without referencing the image. Thus, the loss may be dominated by whether the frozen language model can fit unrelated paragraphs of text.

4 Evaluation method

We evaluate OpenFlamingo on seven vision-language datasets including captioning (COCO , Flickr-30K ), visual question answering (VQAv2 , OK-VQA , TextVQA , VizWiz ), and rank classification (HatefulMemes ). For each dataset, we measure performance at 0, 4, 8, 16, and 32 in-context examples. Evaluation was done in automatic mixed precision, with linear layers computed in bfloat16.

For each evaluation example, we sample in-context examples from the training split uniformly at random. Additionally, in Appendix A.2, we include evaluations of OpenFlamingo using Retrieval-based In-Context Example Selection (RICES) .

Evaluation subsets.

We evaluate on the dataset splits used by Alayrac et al. . We run each evaluation across three seeds, where the randomness is over selected in-context demonstrations, and average the results to obtain our final scores.

Prompts.

For captioning tasks, we format demonstrations as Output: [caption], replacing [caption] with the ground-truth caption. For VQA, we format examples as Question: [question] Short answer: [answer]. For HatefulMemes, we prompt the model with is an image with: ‘[text]’ written on it. Is it hateful? Answer: [answer].

Following Alayrac et al. , we prompt the model with two in-context examples during zero-shot evaluations, removing their images, and for classification tasks, we implement prompt ensembling by averaging logits across 6 permutations of the in-context examples.

Decoding parameters.

We evaluate captioning and VQA using beam search with 3 beams, stopping generation at 20 tokens for captioning, 5 tokens for VQA, or whenever the model produces an <|endofchunk|> token. For HatefulMemes, we compute the log-likelihood of completions “yes” and “no” and answer with the most likely completion.

Metrics.

For captioning, we use CIDEr score . For VQA, we report VQA accuracy, i.e., exact match accuracy over a set of ground truth answers . For HatefulMemes, we compute AUC ROC.

Results

In Table 4, we compare OpenFlamingo and Flamingo models across 0, 4, and 32 in-context examples. On average, OpenFlamingo-3B, -3B (Instruct), -4B (Instruct), and -9B attain more than 86% of the performance of their corresponding Flamingo models (Figure 1).

In the 0- and 4-shot regimes, OpenFlamingo models approach or match Flamingo performances on several datasets. For example, OpenFlamingo-9B improves upon Flamingo-9B’s 0-shot performance on VQAv2 ( $51.8\%\rightarrow 52.7\%$ VQA accuracy) and COCO ( $79.4\rightarrow 79.5$ CIDEr), and OpenFlamingo-9B approaches Flamingo-9B’s 0-shot performance on Flickr-30K and VizWiz. Moreover, OpenFlamingo-9B approaches the 4-shot performance of Flamingo-9B on COCO, VQAv2, and VizWiz.

However, on OK-VQA and TextVQA, OpenFlamingo models are notably weaker than their Flamingo counterparts: OpenFlamingo-9B underperforms Flamingo-9B in 0-shot evaluations by 6.9 percentage points on OK-VQA and 7.8 percentage points on TextVQA. OpenFlamingo-3B also underperforms Flamingo-3B by 4.6 percentage points in 0-shot VQAv2 accuracy. The reason for generally low VQA performance is unclear, although discussions in §5.2 may be related.

In Figure 6, we plot performance as a function of the number of in-context examples. We observe that the OpenFlamingo-3B and -9B models generally improve with the number of in-context examples. However, the rate of improvement is lower than the Flamingo models: in the bottom right corner of Figure 6, we observe that gaps between OpenFlamingo-9B and Flamingo-9B widen with the number of in-context examples. We speculate that this behavior may stem from the quality of our pre-training data, which mostly consists of sequences with few images (Table 2). In contrast with the -3B and -9B models, which generally improve with more in-context examples, the OpenFlamingo-4B models unexpectedly degrade in performance after 4 or 8 shots. The 4B models use RedPajama language models instead of MPT backbones ; they also use frozen and <|endofchunk|> embeddings. We investigate the effect of the latter in §5.1.

Trends by model size.

OpenFlamingo-9B generally outperforms smaller models, except on HatefulMemes and for large numbers of in-context examples on Flickr-30K and TextVQA. However, OpenFlamingo-4B models often underperform the smaller 3B models, including on Flickr-30K, HatefulMemes, TextVQA, and VizWiz.

Effect of language instruction-tuning.

We train two OpenFlamingo models at each of the 3B and 4B scales: one model using a base language model, and one with an instruction-tuned variant of the same language model. In the lower right corner of Figure 6, we observe that the instruction-tuned variants of MPT-1B and RedPajama-3B on average outperform the base models. The difference is starkest for RedPajama-3B. Transfer of language instruction tuning to vision-language tasks was previously reported in Huang et al. , Li et al. .

Comparison to fine-tuned state-of-the-art.

Figure 7 plots each model’s performance relative to fine-tuned state-of-the-art performance, as listed on Papers With Code on June 19, 2023. OpenFlamingo-9B averages more than 62% of fine-tuned state-of-the-art performance with 32 RICES-selected in-context examples, compared to 72% achieved by Flamingo-9B. For more details on the fine-tuned SoTAs, see Appendix A.1.

Discussion

In §4, we observed that OpenFlamingo-4B models underperform their 3B counterparts on most datasets. One notable way the OpenFlamingo-4B models differ from the 3B and 9B models is that their and <|endofchunk|> embeddings are randomly initialized and frozen, rather than trained.

In Table 5, we investigate the effect of this difference. We train small models using OPT-125M as a language model to 20M interleaved samples (one-third of full training). Freezing the and <|endofchunk|> embeddings results in a drop of 4.6 CIDEr for 0-shot COCO, and 12.1% accuracy for 0-shot VQAv2. This suggests that frozen and <|endofchunk|> embeddings may impact downstream trends.

2 VQAv2 validation trends

During development, we used the VQAv2 validation set as a temperature check for visual question answering capabilities. In this section, we discuss trends observed during development.

To understand how evaluation performance evolves over the course of training, Figure 8 plots validation performance of OpenFlamingo-9B on COCO and VQAv2 throughout training. While COCO performance steadily improves, VQAv2 progress is flatter. This matches trends reported by Li et al. .

Effect of language model.

Although additional training did not dramatically affect VQAv2 performance, changing language model backbones did. Table 7 illustrates this effect on the VQAv2 validation split; notably, switching from OPT-1.3B to MPT-1B (Instruct) added nearly 10 percentage points in 0-shot performance. We hypothesize that the language model has similarly large effects for other VQA tasks.

Common VQA failure modes (Table 6).

OpenFlamingo models struggle with counting; on the VQAv2 validation split, OpenFlamingo-9B scores 30.5% on questions with numerical answers, compared to 70.6% on yes / no questions. Additionally, because VQA accuracy uses an exact match criterion for generations, models must answer concisely to score well; OpenFlamingo models are often too verbose. Finally, VQA questions can ask about objects other than the central object in the image; models sometimes answer about the central item instead.

3 Applications of OpenFlamingo

Multiple models have already developed on top of OpenFlamingo. Li et al. fine-tuned OpenFlamingo on MIMIC-IT , a multi-image/video instruction following dataset, creating Otter, a multimodal assistant. Gong et al. released Multimodal-GPT, an OpenFlamingo model instruction fine-tuned on both vision and language instruction datasets. We hope the community continues to use OpenFlamingo models.

4 Limitations

OpenFlamingo models carry the same risks as their foundational language models. In particular, these models train on web-scraped data, and they have not undergone safety-focused fine-tuning. Models thus may produce unexpected, inappropriate, or inaccurate outputs. We hope to further investigate the safety properties of autoregressive vision-language models like OpenFlamingo.

Conclusion

In this technical report, we described OpenFlamingo, a family of five autoregressive vision-language models across the 3B, 4B, and 9B scales. OpenFlamingo remains an active research project, and we continue to work on training and releasing high-quality autoregressive vision-language models. We hope our contribution enables more researchers to train and study such models.

We would like to thank Jean-Baptiste Alayrac and Antoine Miech for their advice on reproducing Flamingo. We also thank Rohan Taori, Nicholas Schiefer, Deep Ganguli, Thomas Liao, Tatsunori Hashimoto, and Nicholas Carlini for their help with assessing the safety risks of our first release of OpenFlamingo. Thanks to Stability AI for compute resources.

References

Appendix A Extended results

Table 11 provides full evaluation results for 0, 4, 8, 16, and 32 in-context examples. For ease of comparison to Flamingo, we calculate each OpenFlamingo model’s performance as a fraction of corresponding Flamingo performance in Figure 11.

In Figure 9, we compare OpenFlamingo models to fine-tuned SoTA performances for different numbers of in-context examples. The fine-tuned methods used were pulled from PapersWithCode on 06/19/23 (Table 8).

A.2 Evaluations using RICES

In the main text, we evaluate OpenFlamingo by selecting in-context examples uniformly at random. In this appendix, we include additional evaluation results using Retrieval-based In-Context Example Selection (RICES) . For a given test example, RICES selects the top-k most similar training examples as demonstrations, where similarity is measured by cosine similarity of the images according to the frozen vision encoder (CLIP ViT-L/14). Full results with RICES are listed in Table 12 and illustrated in Figure 10.

In Table 9, we compare OpenFlamingo-9B performance using RICES to performance using randomly selected in-context examples. We observe that RICES significantly boosts performance in most evaluation settings, including by 19.2 ROC AUC using 32 shots on HatefulMemes. However, on Flickr-30K, we observe significant degradations from using RICES: CIDEr degrades by 20.4 in 0-shot evaluationsIn 0-shot evaluations, RICES is still used to select the two text-only examples used for the prompt (§3.4). and 13.1 in 4-shot evaluations. We hypothesize that the demonstrations RICES selects in Flickr-30K are more similar to the test example than in other datasets. This leads OpenFlamingo-9B to parrot captions from the in-context examples, including incorrect details. For an example, see Table 10 in Appendix A.

Appendix B Additional notes on filtering MMC4

When training contrastive vision-language models, filtering image-text pairs by CLIP cosine similarity has proven particularly helpful for improving data quality . We use a similar notion for filtering interleaved sequences in MMC4: if an image and its matched sentence had cosine similarities that fell below a fixed threshold (0.24), according to CLIP ViT-L/14 embeddings, we omitted the image from the sequence, keeping the text. If all images in a sequence are omitted, we discard the sequence entirely. This aims to ensure that images are relevant to the text following it.

However, increasing the image-text similarity threshold has a side effect: it reduces the typical number of images per interleaved sequence. When using similarity 0.32, nearly 58% of a sample of 1,000 MMC4 sequences contain only 1 image per sequence, compared to 38% in Figure 4, which uses a threshold of 0.24. Training with long sequences may be important for producing models that can handle a large amount of in-context examples. Further, we estimate that 88.7% of MMC4 sequences are discarded completely when filtering with threshold 0.32, compared to 42.7% with threshold 0.24.

As future work, we are interested in understanding how to balance length, quality, and dataset size objectives to improve OpenFlamingo models.

Appendix C Synthetic data prompt

We provide the prompt used to generate the ChatGPT-generated data (see §3.2) in Table 12. After generating candidate sequences, we query LAION-5B using to infill images. For each unique caption we generate, we attempt to retrieve 10 candidate images from the index using index=laion5B-L-14, aesthetic_score=9, and aesthetic_weight=0.5. After this search, we re-rank the retrieved images using CLIP ViT-L/16@336px and select the image with the highest similarity to interleave.

Appendix D Image credits

We include the links to the images we used in Figure 2 in Table 13.