FLAVA: A Foundational Language And Vision Alignment Model

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guillaume Couairon, Wojciech Galuba, Marcus Rohrbach, Douwe Kiela

Introduction

Large-scale pre-training of vision and language transformers has led to impressive performance gains in a wide variety of downstream tasks. In particular, contrastive methods such as CLIP and ALIGN have shown that natural language supervision can lead to very high quality visual models for transfer learning.

Purely contrastive methods, however, also have important shortcomings. Their cross-modal nature does not make them easily usable on multimodal problems that require dealing with both modalities at the same time. They require large corpora, which for both CLIP and ALIGN have not been made accessible to the research community and the details of which remain shrouded in mystery, notwithstanding well-known issues with the construction of such datasets .

In contrast, the recent literature is rich with transformer models that explicitly target the multimodal vision-and-language domain by having earlier fusion and shared self-attention across modalities. For those cases, however, the unimodal vision-only or language-only performance of the model is often either glossed over or ignored completely.

If the future of our field lies in generalized “foundation models” or “universal” transformers with many different capabilities, then the following limitation should be overcome: a true foundation model in the vision and language space should not only be good at vision, or language, or vision-and-language problems–it should be good at all three, at the same time.

Combining information from different modalities into one universal architecture holds promise not only because it is similar to how humans make sense of the world, but also because it may lead to better sample efficiency and much richer representations.

In this work, we introduce FLAVA, a foundational language and vision alignment model that explicitly targets vision, language, and their multimodal combination all at once. FLAVA learns strong representations through joint pretraining on both unimodal and multimodal data while encompassing cross-modal “alignment” objectives and multi-modal “fusion” objectives. We validate FLAVA by applying it to 35 tasks across vision, NLP, and multimodal domains and show impressive performance. An important advantage of our approach is that it was trained on a corpus of openly available datasets that is an order of magnitude smaller than datasets used in comparable models. Our models and code are available in https://flava-model.github.io/.

Background

The self-supervised pretraining paradigm has significantly advanced the state of the art across various domains, from natural language processing , to computer vision , to speech recognition and multimodal domains such as vision and language understanding . Even though this progress is based on a shared recipe of self-supervised learning on top of transformers, we are still missing major progress in building foundational models that work well across all of these different domains and modalities at once.

Table 1 shows an extensive comparison of popular and recent models w.r.t. our FLAVA on multiple axes. Recent work either (i) focuses on a single target domain ; (ii) targets a specific unimodal domain along with the joint vision-and-language domain ; or (iii) targets all domains but only a specific set of tasks in a particular domain.

SimVLM , ALIGN , and CLIP have demonstrated impressive gains by training transformer-based models on giant private paired image-and-text corpora, as opposed to the previous vision-and-language state-of-the-art such as VinVL and ViLT , which were trained on smaller public paired datasets .

Generally, models in the vision-and-language space can be divided into two categories: (i) dual encoders where the image and text are encoded separately followed by a shallow interaction layer for downstream tasks ; and (ii) fusion encoder(s) with self-attention spanning across the modalities . The dual encoder approach works well for unimodal and cross-modal retrieval tasks but their lack of fusion usually causes them to underperform on tasks that involve visual reasoning and question answering which is where models based on fusion encoder(s) shine.

Within the fusion encoder category, a further distinction can be made as to whether the model uses a single transformer for early and unconstrained fusion between modalities (e.g., VisualBERT, UNITER, VLBERT, OSCAR ) or allows cross-attention only in specific co-attention transformer layers while having some modality specific layers (e.g., LXMERT, ViLBERT, ERNIE-ViL . Another distinguishing factor between different models lies in the image features that are used, ranging from region features , to patch embeddings , to convolution or grid features .

Dual encoder models use contrastive pretraining to predict the correct N paired combinations among N2 possibilities. On the other hand, with fusion encoders, inspired by unimodal pretraining schemes such as masked language modeling , masked image modeling , and causal language modeling , numerous pretraining tasks have been explored: (i) Masked Language Modeling (MLM) for V&L where masked words in the caption are predicted with help of the paired image ; (ii) prefixLM, where with the help of an image, the model tries to complete a caption ; (iii) image-text matching, where the model predicts whether given pair of image and text match or not; and (iv) masked region modeling, where the model regresses onto the image features or predicts its object class.

Compared to previous work, our model FLAVA works on a wide range of tasks in each of the vision, language, and vision-and-language domains. FLAVA uses a shared trunk which was pretrained on only openly available public paired data. FLAVA combines dual and fusion encoder approaches into one holistic model that can be pretrained with our novel FLAVA pretraining scheme that leverages pretraining objectives from both categories. FLAVA is designed to be able to take advantage of unpaired unimodal data along with multimodal paired data, resulting in a model that can handle unimodal and retrieval tasks as well as cross-modal and multi-modal vision-and-language tasks.

FLAVA: A Foundational Language And Vision Alignment Model

The goal of this work is to learn a foundational language and vision representation that enables unimodal vision and language understanding as well as multimodal reasoning, all within a single pre-trained model. We show how this can be achieved with a simple and elegant architecture based on transformers (Sec. 3.1), which incorporates multimodal pretraining losses on image-text data (Sec. 3.2) as well as unimodal pretraining losses on unimodal data (Sec. 3.3). We discuss additional critical modeling insights in Sec. 3.4. Finally, we demonstrate that our pretrained models can be successfully applied to a wide range of image, text, and multimodal tasks through both zero-shot and fine-tuning evaluations.

The FLAVA model architecture is shown in Figure 2. The model involves an image encoder to extract unimodal image representations, a text encoder to obtain unimodal text representations, and a multimodal encoder to fuse and align the image and text representations for multimodal reasoning, all of which are based on transformers.

2 Multimodal pretraining objectives

We aim to obtain strong representations through pretraining on both multimodal data (paired image and text) as well as unimodal data (unpaired images or text). FLAVA pretraining involves the following multimodal objectives.

Specifically, given an image and text input, we first tokenize the input image patches using a pretrained dVAE tokenizer , which maps each image patch into an index in a visual codebook similar to a word dictionary (we use the same dVAE tokenizer as in ). Then, we replace a subset of image patches based on rectangular block image regions following BEiT and 15% of text tokens following BERT with a special [MASK] token. Then, from the multimodal encoder’s output $\{\mathbf{h}_{M}\}$ , we apply a multi-layer perceptron to predict the visual codebook index of the masked image patches, or the word vocabulary index of the masked text tokens.

This objective can be seen as an extension of the multimodal masked language modeling such that it incorporates masking on the image side. In our experiments, we find that our MMM pretraining leads to improvements over and in addition to the contrastive loss pretraining, especially for multimodal downstream tasks such as VQA. Note that we apply global contrastive loss on image patches and text tokens without any masking, which are forwarded through the image and text encoders separately from the MMM loss.

3 Unimodal pretraining objectives

While the objectives in Sec. 3.2 allow pretraining the FLAVA model on paired image-and-text data, the vast majority of datasets (such as ImageNet for images and CC-News for text) are unimodal without paired data from the other modality. To efficiently learn a representation for a wide range of downstream tasks, we would also like to leverage these datasets and incorporate unimodal and unaligned information into our representations.

In this work, we introduce knowledge and information from these unimodal datasets through 1) pretraining the image encoder and text encoder on unimodal datasets; 2) pretraining the entire FLAVA model jointly on both unimodal and multimodal datasets; or 3) a combination of both by starting from pretrained encoders and then jointly training. When applied to stand-alone image or text data, we adopt masked image modeling (MIM) and masked language modeling (MLM) losses over the image and text encoders respectively, as described in what follows.

Masked image modeling (MIM). On unimodal image datasets, we mask a set of image patches following the rectangular block-wise masking in BEiT and reconstruct them from other image patches. The input image is first tokenized using a pretrained dVAE tokenizer (same as the one used in the MMM objective in Sec. 3.2), and then a classifier is applied on the image encoder outputs $\{\mathbf{h}_{I}\}$ to predict the dVAE tokens of the masked patches.

Masked language modeling (MLM). We apply a masked language modeling loss on top of the text encoder to pretrain on stand-alone text datasets. A fraction (15%) of the text tokens are masked in the input, and reconstructed from the other tokens using a classifier over the unimodal text hidden states output $\{\mathbf{h}_{T}\}$ .

Encoder initialization from unimodal pretraining. We use three sources of data for pretraining: unimodal image data (ImageNet-1K), unimodal text data (CCNews and BookCorpus), and multimodal image-text paired data (Sec. 3.5). We first pretrain the text encoder with the MLM objective on the unimodal text dataset. We experiment with different ways for pretraining the image encoder: we pretrain the image encoder on unpaired image datasets with either MIM or the DINO objective , before joint training on both unimodal and multimodal datasets. We empirically found the latter to work quite well, despite the switch to an MIM objective on images post-initialization (more details in supplemental). Then, we initialize the whole FLAVA model with the two respective unimodally-pretrained encoders, or when we train from scratch, we initialize randomly. We always initialize the multimodal encoder randomly for pretraining.

Joint unimodal and multimodal training. After unimodal pretraining of the image and text encoders, we continue training the entire FLAVA model jointly on the three types of datasets with round-robin sampling. In each training iteration, we choose one of the datasets according to a sampling ratio that we determine empirically (see supplemental) and obtain a batch of samples. Then, depending on the dataset type, we apply unimodal MIM on image data, unimodal MLM on text data, or the multimodal losses (contrastive, MMM, and ITM) in Sec. 3.2 on image-text pairs.

4 Implementation details

We find that the optimizer hyperparameters play a critical role in effective pretraining. A large batch size, a large weight decay, and a long warm-up are all important for preventing divergence with a large learning rate (we use 8,192 batch size, 1e-3 learning rate, 0.1 weight decay, and 10,000 iteration warm-up in our pretraining tasks together with the AdamW optimizer ). In addition, the ViT transformer architecture (which applies layer norm before the multi-head attention rather than after ) provides more robust learning for the text encoder under large learning rate than the BERT transformer architecture. FLAVA is implemented using the open-source MMF and fairseq libraries. We use Fully-Sharded Data Parallel (FSDP) and train in full FP16 precision except the layer norm to reduce GPU memory consumption.

5 Data: Public Multimodal Datasets (PMD)

For multimodal pretraining, we constructed a corpus out of publicly available sources of image-text data, which are presented in Table 2 with examples in Fig. 3. The total count of text-image pairs is 70M, including 68M unique images, and the average caption length is 12.1 words. For the YFCC100M dataset , we filter the image-text data by discarding non-English captions and only keeping captions that contain more than two words. We first consider the description field of each image, if this does not pass our filters we consider the title field. Other than that, we did not do any additional filtering. Importantly, this corpus entirely consists of open datasets that are freely accessible by other researchers, facilitating reproducibility and enabling future work by the community.

Experiments

We evaluate FLAVA across vision, language, and multimodal tasks. For vision, we evaluate on 22 common vision tasks. For NLP, we evaluate on 8 tasks from the GLUE benchmark. For multimodal, we evaluate on VQAv2 , SNLI-VE , Hateful Memes , as well as Flickr30K and COCO image and text retrieval.

However, we also observe from column 4 vs 5 that the macro average over all tasks decreases slightly. We suspect that this is because adding different tasks to the mix makes the optimization problem much harder, especially when the whole model is randomly initialized. Also, the round-robin sampling of tasks does not follow any particular curriculum to order the learning sequence of these tasks. Naturally, having some vision and language understanding is important before learning multimodal tasks, which motivates us to explore first leveraging unimodal pretraining before the joint training, as described below.

We compare our full FLAVA model (Table 4 column 6) with several state-of-the-art models on multimodal tasks, language tasks, and ImageNet linear evaluation, in Table 5. FLAVA largely outperforms previous multimodal approaches pretrained on public data (row 4 to 11) on both language and multimodal tasks and approaches the well-established BERT model on several GLUE tasks.

FLAVA combines unimodal and multimodal losses and learns more generic representations which are transferable to vision, language, and multimodal tasks. We evaluate the best released CLIP ViT-B/16 model (pretrained on 400M image-text pairs in with the same image encoder architecture as in FLAVA) on our task benchmark, shown in Table 5 row 2. Compared to CLIP, we train FLAVA on just 70M data which is $\sim$ 6x smaller. In Fig. 4, we observe that FLAVA works significantly better on language and multimodal tasks while slightly worse than CLIP on some vision-only tasks. In addition, we note that FLAVA outperforms the variant of the CLIP model pretrained only on the PMD dataset (Table 5 row 10). Table 4 further shows a breakdown analysis between our model (column 6) and the released CLIP ViT-B/16 (400M) model (column 8) and the CLIP trained on PMD (column 7).

FLAVA also has comparable performance to SimVLM (Table 5 row 3) on language tasks while underperforming it on multimodal tasks and ImageNet linear evaluation. FLAVA is pretrained using a much smaller dataset compared to 1.8B image-text pairs in , and we anticipate that FLAVA’s performance will further heavily improve as the pretraining dataset size increases.

Conclusion

In this work, we have presented a foundational vision and language alignment model that performs well on all three target modalities: 1) vision, 2) language, and 3) vision & language. We introduced a novel set of objectives to achieve this goal and conducted experiments on a wide variety of 35 tasks to analyze the model’s performance. FLAVA was trained on a corpus of publicly available datasets that is several orders of magnitude smaller than similar recent models, but still obtained better or competitive performance. Our work points the way forward towards generalized but open models that perform well on a wide variety of multimodal tasks.

Broader impacts and limitations. The models in this work are trained on public datasets widely used in the community. This enables reproducibility and we hope that our work will motivate others to compare models across a wide area of tasks and domains with the same data. However, like all natural data, these datasets have biases, potentially affecting our models. We partly mitigate this by combining several public datasets to increase the diversity and evaluating on an even larger set of target datasets. Still, further study is needed to identify and reduce potentially harmful biases.

Acknowledgements. We thank Devi Parikh for her support and advice on this project. We are grateful to Dmytro Okhonko, Hu Xu, Armen Aghajanyan, Po-Yao Huang, Min Xu, and Aleksandra Piktus for joint explorations of multimodal data. We thank Ning Zhang, Madian Khabsa, Sasha Sheng, and Naman Goyal for useful technical discussions; Karan Desai for providing access to RedCaps; Vaibhav Singh and others on the Google TPU team for TPU support; Shubho Sengupta, Armand Joulin, Brian O’Horo, Arthur Menezes for compute and storage support; and Ryan Jiang, Kushal Tirumala and Russ Howes for help running experiments.

References

Appendix A Hyperparameters and details of FLAVA

We summarize the hyperparameters in our FLAVA model in Table A.1. We also list the sampling probabilities of the datasets for joint pretraining in Table A.2, including PMD (multimodal paired image and text), ImageNet-1k (unimodal unpaired images), and CCNews & BookCorpus (unimodal unpaired text).

We find that a large batch size, a large weight decay, and a long warmup are helpful to stabilize training and prevent divergence under a large learning rate. Based on this finding, we performed a hyperparameter search based by monitoring the learning curve as well as monitoring the zero-shot image classification accuracy based on the image-text contrastive loss on using the text templates from CLIP to obtain the hyperparameters above.

Appendix B Training and evaluation details

Language encoder pretraining. We follow RoBERTabase pretraining hyperparameters to train our pre-norm ViT-based text encoder . Specifically, we pretrain our text encoder using masked language modeling (MLM) on CCNews and BookCorpus for 125K iterations with a batch size of 2048 and a learning rate of 5e-4. We pick the best checkpoint based on the MLM loss without any further hyperparameter sweeps over RoBERTa’s default configuration.

Vision encoder pretraining. We pretrain the image encoder in FLAVA on the ImageNet-1k dataset following either BEiT or DINO . When pretraining a ViT-B/16 image encoder with BEiT, we adopt the hyperparameters and training details in with a masked image modeling loss by predicting the dVAE visual tokens of the masked image patches. We also follow the training protocols in to pretrain a DINO ViT-B/16 model as our image encoder. As discussed in Sec. C, we empirically find that the DINO-pretrained image encoder gives better final performance.

Full FLAVA pretraining. We pretrain jointly on the unimodal and multimodal datasets, following the sampling probabilities of these datasets as provided in Table A.2. Specifically, for each update, we pick a dataset based on its sampling probability and obtain a complete batch from it. In all our ablations, we use a training schedule such that the PMD dataset is sampled for a total of 150K iterations. We monitor the zero-shot accuracy on ImageNet classification every 8K updates and select the best checkpoint based on the ImageNet-1k zero-shot accuracy. We follow to calculate the zero-shot accuracy.

B.2 Vision, language and multimodal evaluation

We evaluate the pretrained FLAVA model across a broad range of vision, natural language, and multimodal tasks. We discuss our evaluation details of these tasks below.

Linear probing on vision tasks. We perform linear probe evaluations on the datasets by closely following the setup described in . We extract image features from the final layer of the image encoder (before the multi-modal encoder) and train a logistic regression classifier (L-BFGS implementation from ) on the extracted image features. We follow the hyperparameters similar to : 1000 iterations, logistic regression $\lambda$ parameter sweep from 1e-6 to 1e6.

Fine-tuning on NLP tasks. For NLP tasks, we finetune the language encoder end to end for all the GLUE tasks. We add a classification head on top of the language encoder for all the tasks, except for the STS-B task, where we use a regression head. The hyperparameters we use for finetuning follow the setup of RoBERTa.We follow hyperparameters used in FairSeq RoBERTa repo for finetuning on GLUE tasks without any further sweeping.

We use the same approach above to also evaluate the CLIP model on VQAv2, SNLI-VE, and Hateful Memes datasets. Since CLIP does not have a multimodal encoder, we concatenate the image feature vector from its image encoder and the text feature vector from its text encoder, apply a 2-layer classifier head (with the same hidden dimension of 1536) over the concatenated feature, and finetune the model following the same hyperparameters as for FLAVA.

Zero-shot multimodal text and image retrieval. We also evaluate the FLAVA model on the multimodal zero-shot retrieval tasks over the Flickr30K and COCO datasets, where the model needs to select a text caption based on a query image or select an image based on a query caption. We use the cosine similarities between the image and text feature computed in the global contrastive loss in FLAVA as the matching scores between the image and text modalities. Then, the text caption (or image) with the highest matching score to the query is retrieved. Similarly, we also evaluate the zero-shot text and image retrieval performance of the CLIP model using the cosine similarities between its image and text features.

Appendix C Additional ablations and analyses

Observations on SST and VQA. Some of our vision tasks involve classifying an image using the text written on the image pixels, and require the model to perform OCR to read text from images. For example, in the SST task in Table C.1 (which is also evaluated as an image classification task in ), the model is asked to classify the sentiment of a natural language sentence by printing the sentence words onto an image and providing the image pixels to the model. It can be seen from Table C.1 that our FLAVA model does not perform well on this SST task, which we believe is mostly because our PMD dataset does not contain enough scene text information for the model to acquire text reading ability from images. We note that the CLIP model pretrained on PMD (column 13) has a similar lower performance on SST than the variant pretrained on 400M image-text pairs (column 14), and we anticipate that FLAVA will also be able to perform scene text reading when pretrained on a larger dataset with enough scene text information.

Our FLAVA model reaches a final accuracy of 72.49 on the VQAv2 dataset. While this accuracy is below the state-of-the-art on VQAv2, we note that this is a reasonable performance given the amount of data used in FLAVA pretraining. Recent models such as SimVLM often use a much larger dataset (e.g. 1.8B image-text pairs ), and we believe more pretraining data will also benefit FLAVA.

Appendix D Architectural differences between FLAVA and CLIP encoders

FLAVA and CLIP use transformers as the image and text encoders in their comparable variations (column 3, FLAVAC-local contrastive and column 13, CLIP-ViT-B/16 in Table C.1). Compared to CLIP which uses a text vocabulary of size 49152, in FLAVA we use BERT’s text vocabulary with a size of 30522. CLIP uses lower-cased byte pair encoding similar to whereas we use BERT’s tokenizer from to tokenize the text. Furthermore, we use a hidden size of 768 instead of 512 and use the ViT architecture (based on the implementation in Hugging Face ) instead of the GPT-style transformer architecture in CLIP for both text and image encoders . Table D.1 shows the comparison of macro averages for the three domains between the original CLIP architecture and our optimized FLAVA architecture trained on PMD under the same settings with local contrastive loss (corresponding to columns 13 and 3 in Table C.1, respectively). A comparison between rows 1 and 2 in Table D.1 shows that our architecture optimizations help achieve a better macro average overall.