VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts

Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Furu Wei

Introduction

Vision-Language (VL) pre-training learns generic cross-modal representations from large-scale image-text pairs. Previous models usually employ image-text matching, image-text contrastive learning, masked region classification/feature regression, word-region/patch alignment and masked language modeling to aggregate and align visual and linguistic information. Then the pretrained models can be directly fine-tuned on downstream vision-language tasks, such as VL retrieval and classification (visual question answering, visual reasoning, etc.).

Two mainstream architectures are widely used in previous work. CLIP and ALIGN adopt a dual-encoder architecture to encode images and text separately. Modality interaction is handled by the cosine similarity of the image and text feature vectors. The dual-encoder architecture is effective for retrieval tasks, especially for masses of images and text. Feature vectors of images and text can be pre-computed and stored. However, the shallow interaction between images and text is not enough to handle complex VL classification tasks. ViLT finds that CLIP gives a relatively low accuracy on visual reasoning task. Another line of work relies on a fusion encoder with cross-modal attention to model image-text pairs. Multi-layer Transformer networks are usually employed to fuse image and text representations. The fusion-encoder architecture achieves superior performance on VL classification tasks. But it requires to jointly encode all possible image-text pairs to compute similarity scores for retrieval tasks. The quadratic time complexity leads to a much slower inference speed than the dual-encoder models whose time complexity is linear.

In order to take advantage of the two types of architectures, we propose a unified Vision-Language pretrained Model (VLMo) that can be used as either a dual encoder to separately encode images and text for retrieval tasks, or used as a fusion encoder to model the deep interaction of image-text pairs for classification tasks. This is achieved by introducing Mixture-of-Modality-Experts (MoME) Transformer that can encode various modalities (images, text, and image-text pairs) within a Transformer block. MoME employs a pool of modality experts to replace the feed-forward network in standard Transformer. It captures modality-specific information by switching to different modality experts, and uses the shared self-attention across modalities to align visual and linguistic information. Specifically, MoME Transformer consists of three modality experts, namely vision expert for image encoding, language expert for text encoding, and vision-language expert for image-text fusion. Thanks to the modeling flexibility, we can reuse MoME Transformer with the shared parameters for different purposes, i.e., text-only encoder, image-only encoder, and image-text fusion encoder.

VLMo is jointly learned with three pre-training tasks, namely image-text contrastive learning, image-text matching, and masked language modeling. In addition, we propose a stagewise pre-training strategy to effectively leverage large-scale image-only and text-only corpus besides image-text pairs in VLMo pre-training. We first pretrain vision experts and self-attention modules of MoME Transformer on image-only data using masked image modeling proposed in BEiT . We then pretrain language experts on text-only data using masked language modeling . Finally, the model is used to initialize vision-language pre-training. By getting rid of the limited size of image-text pairs and their simple and short captions, stagewise pre-training on large amounts of image-only and text-only data helps VLMo to learn more generalizable representations.

Experimental results demonstrate that VLMo achieves state-of-the-art results on vision-language retrieval and classification tasks. Our model, used as a dual encoder, outperforms fusion-encoder-based models while enjoying a much faster inference speed on retrieval tasks. Moreover, our model also achieves state-of-the-art results on visual question answering (VQA) and natural language for visual reasoning (NLVR2), where VLMo is used as a fusion encoder.

Our main contributions are summarized as follows:

We propose a unified vision-language pretrained model VLMo that can be used as a fusion encoder for classification tasks, or fine-tuned as a dual encoder for retrieval tasks.

We introduce a general-purpose multimodal Transformer for vision-language tasks, namely MoME Transformer, to encode different modalities. It captures modality-specific information by modality experts, and aligns contents of different modalities by the self-attention module shared across modalities.

We show that stagewise pre-training using large amounts of image-only and text-only data greatly improves our vision-language pretrained model.

Related Work

Pre-training with Transformer backbone networks has substantially advanced the state of the art across natural language processing , computer vision and vision-language tasks.

The approaches of vision-language pre-training can be divided into two categories. The first category utilizes a dual encoder to encode images and text separately, and uses cosine similarity or a linear projection layer to model the interaction between images and text . Image-text contrastive learning is usually employed to optimize the model. Dual-encoder models are effective for vision-language retrieval tasks. However, the simple interaction is not enough to handle tasks that require complex reasoning, such as visual reasoning and visual question answering (VL classification tasks). The second category models the interaction of images and text using a deep fusion encoder with cross-modal attention . Image-text matching, masked language modeling, word-region/patch alignment, masked region classification and feature regression are widely used to train fusion-encoder-based models. These models achieve better performance for vision-language classification tasks, while the joint encoding of all image-text pairs leads to a slow inference speed for retrieval tasks. A large portion of fusion-encoder-based models rely on an off-the-shelf object detector like Faster R-CNN to obtain image region features. Generating region features slows down the inference speed and renders the approach less scalable. Recently, Pixel-BERT removes object detector and encodes images into grid features by convolutional neural networks. ALBEF employs image Transformer to obtain the representations of images, and uses text Transformer to learn the contextualized representations of text. These representations are then fused by cross-modal attention. ViLT encodes images into patch embeddings, and then feed the concatenation of image patch embeddings and word embeddings into a Transformer network to learn contextualized representations and model the interaction of images and text.

Different from previous work, our unified pre-training using shared MoME Transformer enables the model perform separate encoding for retrieval tasks, and jointly encode image-text pairs to capture deeper interaction for classification tasks. Our model achieves competitive performance, while enjoying a faster inference speed for both retrieval and classification tasks.

Methods

Given image-text pairs, VLMo obtains image-only, text-only and image-text pair representations by the MoME Transformer network. As shown in Figure 1, the unified pre-training optimizes shared MoME Transformer with image-text contrastive learning on image-only and text-only representations, image-text matching and masked language modeling on image-text pair representations. Thanks to the modeling flexibility, the model can be used as a dual encoder for retrieval tasks to encode images and text separately during fine-tuning. It can also be fine-tuned as a fusion encoder to model deeper modality interaction of images and text for classification tasks.

Given an image-text pair, we encode the pair into image, text and image-text vector representations. These representations are then fed into the MoME Transformer to learn contextualized representations and align image and text feature vectors.

Text Representations

Image-Text Representations

We concatenate image and text input vectors to form the image-text input representations ${\bm{H}}_{0}^{vl}=[{\bm{H}}_{0}^{w};{\bm{H}}_{0}^{v}]$

2 Mixture-of-Modality-Experts Transformer

Inspired by mixture-of-experts networks , we propose a general-purpose multimodal Transformer for vision-language tasks, namely MoME Transformer, to encode different modalities. MoME Transformer introduces mixture of modality experts as a substitute of the feed forward network of standard Transformer. Given previous layer’s output vectors ${\bm{H}}_{l-1},l\in[1,L]$ , each MoME Transformer block captures modality-specific information by switching to different modality expert, and employs multi-head self-attention (MSA) shared across modalities to align visual and linguistic contents. LN is short for layer normalization.

3 Pre-Training Tasks

VLMo is jointly pretrained by image-text contrastive learning on the image and text representations, masked language modeling and image-text matching on the image-text pair representations with shared parameters.

Given a batch of $N$ image-text pairs, image-text contrastive learning aims to predict the matched pairs from $N\times N$ possible image-text pairs. There are $N^{2}-N$ negative image-text pairs within a training batch.

The final output vectors of [I_CLS] token and [T_CLS] token are used as the aggregated representation of the image and text, respectively. Followed by a linear projection and normalization, we obtain image vectors $\{\hat{{\bm{h}}}^{v}_{i}\}_{i=1}^{N}$ and text vectors $\{\hat{{\bm{h}}}^{w}_{i}\}_{i=1}^{N}$ in a training batch to compute image-to-text and text-to-image similarities:

Masked Language Modeling

Following BERT , we randomly choose tokens in the text sequence, and replace them with the [MASK] token. The model is trained to predict these masked tokens from all the other unmasked tokens and vision clues. We use $15$ % masking probability as in BERT. The final output vectors of masked tokens are fed into a classifier over the whole text vocabulary with cross-entropy loss.

Image-Text Matching

Image-text matching aims to predict whether the image and text is matched. We use the final hidden vector of the [T_CLS] token to represent the image-text pair, and feed the vector into a classifier with cross-entropy loss for binary classification. Inspired by ALBEF , we sample hard negative image-text pairs based on the contrastive image-to-text and text-to-image similarities. Different from ALBEF , which samples hard negatives from training examples of the single GPU (we named it as local hard negative mining). We propose global hard negative mining and sample hard negative image-text pairs from more training examples gathered from all GPUs. Global hard negative mining can find more informative image-text pairs and significantly improves our model.

4 Stagewise Pre-Training

We introduce a stagewise pre-training strategy, which leverages large-scale image-only and text-only corpus to improve the vision-language model. As present in Figure 2, we first perform vision pre-training on image-only data, and then perform language pre-training on text-only data to learn general image and text representations. The model is used to initialize the vision-language pre-training to learn the alignment of visual and linguistic information. For vision pre-training, we train the attention module and vision expert of MoME Transformer as in BEiT on image-only data. We directly utilize the pretrained parameters of BEiT to initialize the attention module and vision expert. For language pre-training, we freeze parameters of the attention module and vision expert, and utilize masked language modeling to optimize the language expert on text-only data. Compared with image-text pairs, image-only and text-only data are easier to collect. In addition, text data of image-text pairs is usually short and simple. Pre-training on image-only and text-only corpus improves the generalization on complex pairs.

5 Fine-Tuning VLMo on Downstream Tasks

As present in Figure 3, our model can be fine-tuned to adapt to various vision-language retrieval and classification tasks.

For classification tasks such as visual question answering and visual reasoning, VLMo is used as a fusion encoder to model modality interaction of images and text. We use the final encoding vector of the token [T_CLS] as the representation of the image-text pair, and feed it to a task-specific classifier layer to predict the label.

Vision-Language Retrieval

For retrieval tasks, VLMo can be used as a dual encoder to encode images and text separately. During fine-tuning, our model is optimized for the image-text contrastive loss. During inference, we compute representations of all images and text, and then use dot product to obtain image-to-text and text-to-image similarity scores of all possible image-text pairs. Separate encoding enables a much faster inference speed than fusion-encoder-based models.

Experiments

We pretrain our model using large-scale image-text pairs and evaluate the model on visual-linguistic classification and retrieval tasks.

Following previous work , our pre-training data consists of four image captioning datasets: Conceptual Captions (CC) , SBU Captions , COCO and Visual Genome (VG) datasets. There are about $4$ M images and $10$ M image-text pairs in the pre-training data.

Our models adopt the same network configuration as ViT and BEiT . VLMo-Base consists of $12$ -layer Transformer blocks with $768$ hidden size and $12$ attention heads. VLMo-Large is a $24$ -layer Transformer network with $1024$ hidden size and $16$ attention heads. The intermediate size of feed-forward networks is $3072$ and $4096$ for base-size and large-size models, respectively. VLMo-Base uses vision-language expert on the top two Transformer layers, and VLMo-Large introduces vision-language expert on the top three layers. For images, the input resolution is $224\times 224$ and the patch size is $16\times 16$ during pre-training. We apply RandAugment to the input images. The tokenizer of the uncased version of BERT is employed to tokenize the text. The maximum text sequence length is set to $40$ . We also employ whole word masking for the masked language modeling pre-training task. We pretrain the models for $200$ k steps with $1024$ batch size. We utilize AdamW optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.98$ . The peak learning is 2e-4 for the base-size model, 5e-5 for the large-size model. Weight decay is set to $0.01$ . We use linear warmup over the first $2.5$ k steps and linear decay. The vision-language pre-training of base-size model takes about two days using 64 Nvidia Tesla V100 32GB GPU cards, and the large-size model takes about three days using 128 Nvidia Tesla V100 32GB GPU cards.

2 Training on Larger-scale Datasets

We scale up vision-language representation learning by training VLMo-Large on one billion noisy web image-text pairs with a larger batch size. We first pretrain the model for 200k steps with 16k batch size, and then continue train the model for 100k steps with 32k batch size. The other hyper-parameters are the same as the training on 4M data. Please refer to the supplementary material for more details of hyper-parameters used for pre-training and fine-tuning.

3 Evaluation on Vision-Language Classification Tasks

We first conduct fine-tuning experiments on two widely used classification datasets: visual question answering and natural language for visual reasoning . The model is fine-tuned as a fusion encoder to model deeper interaction.

For VQA, a natural image and a question are given, the task is to generate/choose the correct answer. We train and evaluate the model on VQA 2.0 dataset . Following common practices, we convert VQA 2.0 to a classification task, and choose the answer from a shared set consists of $3,129$ answers. We use the final encoding vector of the [T_CLS] token as the representation of the image-question pair and feed it to a classifier layer to predict the answer.

Natural Language for Visual Reasoning (NLVR2)

The NLVR2 dataset requires the model to predict whether a text description is true about a pair of images. Following OSCAR and VinVL , we convert the triplet input to two image-text pairs, each containing the text description and one image. We concatenate the final output vectors of the [T_CLS] token of the two input pairs. The concatenated vector is then fed into a classification layer to predict the label.

We present the results of VL classification tasks in Table 1. VLMo achieves state-of-the-art performance and substantially outperforms previous methods. Our large-size model even outperforms SimVLM-Huge and Florence-Huge by a large margin, which consists of more parameters and are also trained on larger-scale image-text pairs. Our model uses a simple linear projection to embed images as in ViLT . This leads to a significant speedup compared with previous models using image region features, which are extracted by an off-the-shelf object detector .

4 Evaluation on Vision-Language Retrieval Tasks

The retrieval tasks contain image-to-text retrieval and text-to-image retrieval. We evaluate the model on the widely used COCO and Flickr30K datasets, and use the Karpathy split for both datasets. The model is used as a dual encoder for retrieval tasks. We encode images and text separately and compute their similarity scores by the dot product of image and text vectors.

As present in Table 2, VLMo achieves competitive performance with previous fusion-encoder-based models while having a much faster speed. Fusion-encoder-based models need to jointly encode all possible image-text pairs to compute their similarity scores, which requires quadratic time complexity. Moreover, our large-size model even outperforms the huge-size model of Florence , which also trained on massive image-text pairs using a larger batch size. VLMo pre-training can effectively leverage larger-scale noisy pairs and benefit from large batch training.

5 Evaluation on Vision Tasks

As shown in Table 3, we use VLMo as an image-only encoder and evaluate it on image classification (ImageNet ) and semantic segmentation (ADE20K ) tasks. The model also achieves competitive performance, even slightly better than the BEiT model used for the initialization of VLMo. The image resolution is 224 $\times$ 224 for ImageNet, and 512 $\times$ 512 for ADE20K. We perform intermediate fine-tuning on ImageNet-21k for all three models.

6 Ablation Studies

We first conduct ablation experiments of stagewise pre-training. ViLT shows that using the ViT model pretrained on image-only data as the initialization achieves better performance than the BERT model pretrained on text-only data. Therefore we start experiments with image-only pre-training. We compare using image-only pre-training, and image-only pre-training plus text-only pre-training as the initialization. For image-only pre-training, we directly use the parameters of BEiT-Base to initialize the self-attention module and all modality experts. For image-only pre-training plus text-only pre-training, we use pretrained parameters of BEiT-Base to initialize the vision expert and self-attention module of MoME Transformer, and then pretrain its language expert on text corpora. As shown in Table 4, image-only pre-training plus text-only pre-training improves our vision-language model. We also have tried to perform vision-language pre-training with random initialization but obtain a relatively low accuracy on downstream tasks. Stagewise pre-training effectively leverages large-scale image-only and text-only corpus, and improves our vision-language pre-training. Moreover, given the limited size of image-text pairs we used during pre-training, stage-wise pre-training on image-only and text-only data alleviates the need for image-text pair data.

MoME Transformer

We also conduct ablation experiments of MoME Transformer. We employ ViT-Base to initialize the models for the ablation experiments. As present in Table 5, using MoME Transformer achieves better performance than standard Transformer for both retrieval and classification tasks. In addition, we also analyse the contribution of vision-language expert (VL-FFN) used in MoME Transformer. We remove the vision-language expert used in the top Transformer layers. Experimental results demonstrate that the introduction of vision-language expert improves the model. Using vision-language expert captures more modality interaction. Shared self-attention module used in MoME also positively contributes to our model. Section A presents the ablation study of shared self-attention module.

Pre-Training Tasks

We perform ablation studies to analyse the contribution of different pre-training tasks, and the results are presented in Table 5. Compared with the model trained only using image-text contrastive loss, our unified training performs much better across classification and retrieval tasks. Introducing image-text matching with hard negative mining also greatly improves the model. This demonstrates the effectiveness of our unified-training framework with MoME Transformer. In addition, experimental results show that masked language modeling positively contribute to our model. Please refer to the supplementary material for more ablation studies.

Global Hard Negative Mining

Different from ALBEF , which samples hard negatives from training examples of the single GPU (named as local hard negative mining). We perform hard negative mining from more candidates by gathering training examples of all GPUs (named as global hard negative mining). As shown in Table 6, our global hard negative mining brings significant improvements.

Conclusion

In this work, we propose a unified vision-language pretrained model VLMo, which jointly learns a dual encoder and a fusion encoder with a shared MoME Transformer backbone. MoME introduces a pool of modality experts to encode modality-specific information, and aligns different modalities using the shared self-attention module. The unified pre-training with MoME enables the model to be used as a dual encoder for efficient vision-language retrieval, or as a fusion encoder to model cross-modal interactions for classification tasks. We also show that stagewise pre-training that leverages large-scale image-only and text-only corpus greatly improves vision-language pre-training. Experimental results demonstrate that VLMo outperforms previous state-of-the-art models on various vision-language classification and retrieval benchmarks.

In the future, we would like to work on improving VLMo from the following perspectives:

We will scale up the model size used in VLMo pre-training.

We are also interested in fine-tuning VLMo for vision-language generation tasks, such as image captioning, following the method proposed in UniLM .

We are going to explore to what extent vision-language pre-training can help each other modality, especially as the shared MoME backbone naturally blends in text and image representations.

We can extend the proposed model to integrate more modalities (e.g., speech, video, and structured knowledge), supporting general-purpose multimodal pre-training.

References

Appendix A Ablation Study of Shared Self-Attention

Table 7 presents the ablation study of shared self-attention module used in MoME Transformer for encoding image patches and text tokens. We compare shared self-attention with separate self-attention, which encodes image patches and text tokens using different attention parameters on the first L $-$ F layers. The shared self-attention used in MoME achieves better performance. The shared self-attention module helps VLMo learn the alignment of different modalities, and fuse images and text at bottom layers for classification tasks.

Appendix B Hyperparameters for Text-Only Pre-Training

For the text-only pre-training data, we use English Wikipedia and BookCorpus . AdamW optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.98$ is used to train the models. The maximum sequence length is set to $196$ . The batch size is $1024$ , and the peak learning rate is 2e-4. We set the weight decay to $0.01$ . For the base-size model, we train the model for $500$ k steps. The large-size model is trained for $200$ k steps.

Appendix C Hyperparameters for Vision-Language Classification Fine-Tuning

We fine-tune the models for $10$ epochs with $128$ batch size. The peak learning rate is 3e-5 for the base-size model, and 1.5e-5 for the large-size model. Following SimVLM , the input image resolution is $480\times 480$ . For VLMo-Large++, we use $768\times 768$ image resolution.

Natural Language for Visual Reasoning (NLVR2)

For results of Table 1, the models are fine-tuned for $10$ epochs with $128$ batch size. The peak learning rate of the base-size and large-size models are set to 5e-5 and 3e-5, respectively. The input image resolution is $384\times 384$ . For ablation experiments, we fine-tune the models for $10$ epochs with $128$ batch size, and choose learning rates from {5e-5, 1e-4}. The input image resolution is $224\times 224$ . All the ablation results of NLVR2 are averaged over $3$ runs.

Appendix D Hyperparameters for Vision-Language Retrieval Fine-Tuning

We fine-tune the base-size model for $20$ epochs and large-size model for $10$ epochs with $2048$ batch size. The peak learning rate is 2e-5 for the base-size model and 1e-5 for the large-size model. The input image resolution is $384\times 384$ .

Flickr30K

For results of Table 2, the base-size and large-size models are fine-tuned for $40$ epochs with a batch size of $2048$ and a peak learning rate of 1e-5. We use the fine-tuned model on COCO as the initialization. The input image resolution is $384\times 384$ . For all ablation experiments, we fine-tune the models for $10$ epochs with $1024$ batch size. The peak learning rate is set to 5e-5, and the input image resolution is $224\times 224$ .