mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng Cao, Ji Zhang, Songfang Huang, Fei Huang, Jingren Zhou, Luo Si
Introduction
Large-scale pre-training of vision-language models have recently received tremendous success on a wide range of cross-modal tasks . Such vision-language models learn cross-modal representations from a quantity of image-text pairs by aligning the visual and linguistic modalities. A great challenge of learning vision-language models is to find a good alignment between the two modalities to close the semantic gap in-between.
To discover a cross-modal alignment, prior studies employ a pre-trained object detector to extract salient regions from images, which are then aligned with language counterparts. Such an architecture, however, is generally limited by the power of the object detector, the pre-defined visual semantics it can represent, and the quantity of annotations available. Besides, it is also computationally expensive to extract region-based visual features from high-resolution (e.g. 6001000) images. More recent work , which scales and performs better on many vision-language tasks, drops the requirement of pre-trained object detection and enables a direct alignment between the image and text representations in an end-to-end manner. These models extract finer-grained visual representation with a long sequence of image patches or grids for good vision understanding . However, there exist two significant problems in modeling long visual sequences: 1) efficiency: full self-attention on long visual sequences requires much more computation than that on textual sequences, and 2) information asymmetry: the caption text in widely-used image-text pre-training data is usually short and highly abstract while more detailed and diverse information can be extracted from the image. This asymmetry presents challenges for effective multi-modal fusion between the modalities.
One straightforward way of multi-modal fusion is the connected-attention network as shown in Figure 1 (a). It adopts a single Transformer network for early fusion of vision and language by simply taking the concatenation of visual and linguistic features as input . This paradigm allows self-attention to discover alignments between the modalities from the bottom level, and requires full self-attention on the concatenation of cross-modal sequences, which is rather time-consuming. Besides, this type of methods process information from both modalities equally, which may suffer from the information asymmetry especially when there is a big difference in information density or sequence lengths between the modalities.
Another line of work keeps separate Transformer networks for both textual and visual features, and uses techniques such as cross-attention to enable cross-modal interaction , as shown in Figure 1 (b). This architecture design conducts multi-modal fusion on both modalities independently, which can help alleviate the information asymmetry problem. However, it still suffers from computation inefficiency for full self-attention on long visual sequences, and it is not that parameter-efficient with two separate Transformer networks.
In this work, we propose mPLUG, a unified Multi-modal Pre-training framework for both vision-Language Understanding and Generation. mPLUG performs effective and efficient vision-language learning with novel cross-modal skip-connections to address the fundamental information asymmetry problem. Instead of fusing visual and linguistic representations at the same levels, the cross-modal skip-connections enables the fusion to occur at disparate levels in the abstraction hierarchy across the modalities. It creates inter-layer shortcuts that skip a certain number of layers for visual representations to reflect the semantic richness of language compared to vision. As shown in Figure 1 (c), in each block of our cross-modal skip-connected network, mPLUG first adopts an asymmetric co-attention architecture at the first few layers for efficiency, by removing the co-attention on vision side. It is then followed by one layer of connected-attention, by concatenating the original visual representation and the co-attention output on the language side as input. In addition to the modeling efficacy due to the asymmetry, the cross-modal skip-connections ease the model training by alleviating vanishing gradients with the inserted shortcuts. Figure 1 shows that the new cross-modal skip-connected network achieves superior performance with at least four times speeding-up than other cross-modal fusion networks.
Our key contributions can be summarized as follows:
We propose a unified vision-language pretrained model mPLUG of cross-modal understanding and generation for both effectiveness and efficiency in cross-modal learning.
We introduce a new asymmetric vision-language architecture with novel cross-modal skip-connections, to address two fundamental problems of information asymmetry and computation inefficiency in multi-modal fusion.
mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks, including image captioning, image-text retrieval, visual grounding and visual question answering. mPLUG also demonstrates strong zero-shot transferability when directly transferred to a wide range of vision-language and video-language tasks.
Related Work
Vision-Language pre-training (VLP) has recently received tremendous success and achieved state-of-the-art results across a variety of vision-language tasks . In terms of how information from different modalities are aggregated, typical approaches to VLP can be roughly divided into two categories: dual encoder and fusion encoder. Dual encoder approach utilizes two single-modal encoders to encode images and text separately, and then uses simple functions such as dot product to model the instance-level cross-modal interaction between image and text. The advantage of dual encoder models like CLIP and ALIGN is that images and text can be pre-computed and cached, which is quite computation-efficient and more appropriate for retrieval tasks. However, they tend to fail in handling more complicated VL understanding tasks that require complex reasoning, such as visual question answering . In contrast, fusion encoder approach uses deep fusion functions such as multi-layer self-attention and cross-attention networks to model the fine-grained cross-modal interaction between image and text sequences. Representative methods of this category include the single-stream architecture such as UNITER and OSCAR , and two-stream architecture such as LXMERT , ALBEF and ERNIE-ViL . This kind of methods can better capture the underlying association between image and text for vision-language understanding tasks, while it needs to jointly encode all possible image-text pairs, which leads to a relatively slow inference speed.
To improve the inference speed, some recent work such as Pixel-BERT , E2E-VLP and ViLT removes the complicated object detector in feature extraction, and conducts end-to-end VL learning with CNN-based grid features and linearly projected patched embeddings, respectively. To combine the benefits of both categories of architectures, VLMo further unifies the dual encoder and fusion encoder modules with shared mixture-of-modality-experts Transformer. In this work, mPLUG introduces a new cross-modal fusion mechanism with cross-modal skip-connections, to enables the fusion to occur at disparate levels in the abstraction hierarchy across the modalities. It achieves superior performances in effectiveness and efficiency across a wide range of VL tasks.
2 Skip-connection
Skip-connection is a popular technique to bypass the gradient exploding or vanishing problem for model optimization in deep neural networks, which is widely-used in CV and NLP architectures such as ResNet and Transformer . A variety of skip connection methods have been proposed in recent years . ResNet introduces summed shortcut connections between different layers using simple identity mapping, while highway network designs a transform gating function to control the balance of the input and the transformed input. DenseNet designs new architectures with concatenated skip-connections, allowing the subsequent layers to re-use all the middle representations of previous layers. Layer Normalization and recursive skip connection are further used in combination with plain skip connection for further stablizing model optimization and better incorporating the transformed input . In this work, mPLUG proposes a new cross-modal skip connection method to address cross-modal fusion problem, and combines the concatenated skip-connection and summed skip-connection for choosing whether to attend to all the concatenated representations of different modalities or just focus on the cross-modal interaction part at each layer.
mPLUG
In this section, we will first introduce our new model architecture with the key module of the cross-modal skip-connected network, and then give the details of the pre-training objectives and scalable training infrastructure.
As shown in Figure 2, mPLUG consists of two unimodal encoders for image and text independently, a cross-modal skip-connected network and a decoder for text generation. To better model the inherent modality bias information, we first use two unimodal encoders to encode image and text separately. Following , we use a visual transformer directly on the image patches as the visual encoder, which is more computation-friendly than using pre-trained object detectors for visual feature extraction . The visual encoder divides an input image into patches and encodes them as a sequence of embeddings with an additional token. The input text is fed to the text encoder and represented as a sequence of embeddings , where is the embedding of the token and used to summarize the input text. Then, the visual and linguistic representations are fed into a cross-modal skip-connected network, which consists of multiple skip-connected fusion blocks. In each skip-connected fusion block, we adopt connected cross-modal fusion to each of asymmetric co-attention layers where is a fixed stride value. The aim of this network is to take advantage of the effectiveness of the connected cross-modal fusion and the efficiency of the asymmetric co-attention for enhanced cross-modal fusion in a recursive manner. Finally, the output cross-modal representations are fed into a transformer decoder for sequence to sequence learning, which equips mPLUG with both understanding and generation capabilities.
2 Cross-modal Skip-connected Network
The cross-modal skip-connected network consists of skip-connected fusion blocks. In each skip-connected fusion block, we adopt connected-attention layer to each of asymmetric co-attention layers where is a fixed stride value. We first pass the text feature and image feature from unimodal encoders through the asymmetric co-attention layers, and then connect the output text feature and image feature to one connected-attention layer. We repeat the skip-connected fusion block times for the final connected image and text representation.
Specifically, the asymmetric co-attention is composed of the self-attention (SA) layer, cross-attention (CA) layer and the feed-forward network (FFN). The input text feature is first fed to the self-attention layer, and then the visual feature is injected into the text feature by the cross-attention layer which gives . The output of self-attention and cross-attention are added up and fed to the FFN layer for the visual-aware text representation :
where LN is short for layer normalization.
The connected-attention layer is composed of the self-attention (SA) layer and the feed-forward network (FFN). We connect the image feature and input text feature , where is the output of asymmetric co-attention layers. The connected image and text feature are fed to the self-attention layer and FFN layer:
Then is fed into the next cross-modal skip-connected network repeatedly to get the final connected image and text representation. Finally, the connected output is fed into a Transformer decoder for sequence to sequence learning.
3 Pre-training Tasks
We perform four pre-training tasks including three understanding tasks (Image-Text Contrastive Learning, Image-Text Matching, Masked Language Modeling) and one generation task (Prefix Language Modeling). These pre-training tasks are optimized jointly.
Image-Text Contrastive (ITC): Following , we employ the task to align the image features and the text features from the unimodal encoders. Specifically, we calculate the softmax-normalized image-to-text and text-to-image similarity, and take two dynamic memory queues (text, image) to increase the number of negative examples as MoCo .
Image-Text Matching (ITM): This task aims to predict whether an image and a sentence match with each other on the cross-modal representation. We also select hard negative image-text pairs based on the contrastive text-image similarity as .
Masked Language Modeling (MLM): The task setup is basically the same as in BERT , where we randomly mask of tokens in text and the model is asked to predict these masked words with the cross-modal representations.
Prefix Language Modeling (PrefixLM): This task aims to generate the caption given an image and predict the text segment subsequent to the cross-modal context as . It optimizes a cross entropy loss by maximizing the likelihood of text in an autoregressive manner.
Distributed Learning on a Large Scale
Training a big model like mPLUG on large-scale datasets faces many efficiency challenges. We increase the throughput from the perspective of reducing memory usage and computation time, thereby accelerating the training of the model.
The memory usage during model training is mainly composed of two aspects: the static memory usage composed of parameters/optimizer states/gradients, etc., and the runtime memory usage caused by intermediate variables like activation values. For static memory overhead, we use the ZeRO technique to partition parameters/optimizer states/gradients into the entire data-parallel group, so that the static memory overhead of a single GPU can be approximately reduced to , where denotes the number of GPU cards. We use gradient checkpointing for the runtime memory cost, which greatly reduces the runtime memory usage at the expense of increasing forward time by recomputing part of the activation values during backward pass without keeping them in memory.
To reduce the computation time, we use BF16 precision training. BF16 is a new data type supported by NVIDIA’s new Ampere architecture GPU like A100. Compared with the previously widely used mixed-precision training of FP16 and FP32, BF16 has the same representation range as FP32, thereby reducing the risk of numerical overflow and ensuring model convergence stability, and at the same time has the same fast computing speed as FP16.
Experiments
Following the previous work , we use the same pre-training dataset with 14M images with texts, which includes two in-domain datasets (MS COCO and Visual Genome ), and three web out-domain datasets (Conceptual Captions , Conceptual 12M , SBU Captions .
We pretrain the model for 30 epochs with the total batch size of 1024 on 16 NVIDIA A100 GPUs. We use a 6-layer Transformer for both the text encoder and the cross-modal skip-connected network, and a 12-layer Transformer for the decoder. The text encoder is initialized using the first 6 layers of the BERTbase model and the skip-connected network is initialized using the last 6 layers of the BERTbase. We initialize the visual encoder by CLIP-ViT pretrained on 400M noisy image-text pairs. The visual transformer with ViT-B/16 is used as our base architecture, the one with ViT-L/14 as the large architecture. We use the AdamW optimizer with a weight decay of 0.02. The learning rate is warmed-up to 1e-5 (ViT-B/16) and 1e-4 (BERTbase) for mPLUGViT-B , and 5e-6 (ViT-L/14) and 5e-5 (BERTbase) for mPLUGViT-L in the first 1000 iterations, and decayed to 1e-6 following a cosine schedule. During pre-training, we take random image crops of resolution 256 256 (ViT-B/16)/224 224 (ViT-L/14) as input, and also apply RandAugment to improve the generalization of vision encoders. For VQA and image captioning tasks, we do an additional continue pre-training on 4M image-text pairs. We increase the image resolution during finetuning. For image-text contrastive learning, the queue size is set as 65,536 and the momentum coefficient is set as 0.995.
2 Evaluation on Vision-Language Tasks
We compare our pre-trained model against other VLP models on the six downstream V+L tasks. We introduce each task and our fine-tuning strategy below. Details of the datasets and fine-tuning hyperparameters are in Appendix.
The VQA task requires the model to answer natural language questions given an image. Most methods deal with visual question answering tasks as multi-label classification on predefined answer sets. This strategy achieves strong performance, but it is not suitable for real-world open scenarios. We treat VQA as an answer generation task and directly use unconstrained open-vocab generation during inference, which is different from constrained close-vocab generation models . Following , we concatenate the question with the object labels and OCR tokens extracted from image. As shown in Table 2, mPLUG achieves 81.27 on Test-std split and outperforms the SOTA models including SimVLM and Florence, which use 100 and 60 more pre-training image-text pairs, respectively. Based on the same 4M pre-training data, mPLUG outperforms CLIP-ViL and METER, which also use CLIP as the visual encoder. Besides, under the same settings, mPLUG always significantly outperforms ALBEF and BLIP which only rely on co-attention from images to text for cross-modal fusion. The gain can derive from the network design of cross-modal skip-connections specifically for information asymmetry of the two modalities. Neither ALBEF nor BLIP addresses this problem well, with bias towards the language modality.
2.2 Image Captioning
The image captioning task requires a model to generate an appropriate and fluent caption for a given image. We evaluate image captioning on two datasets COCO Caption and NoCaps . mPLUG finetuned with training data of COCO Caption is tested on both of the datasets. We train mPLUG on the MS COCO Caption and test on the same Karpathy split and NoCaps validation set. Following , we first fine-tune mPLUG with cross-entropy loss and then with CIDEr optimization for extra 5 epochs. As shown in Table 1, mPLUG with only 14M pre-training images can outperform the SOTA models including LEMON and SimVLM on both COCO Caption and Nocaps datasets, which uses more than 10 and 100 pre-training data, respectively. For the COCO Caption, mPLUG performs the best on CIDEr evaluation and surpasses the SOTA model by a large margin of 5.5 on Karpathy test set. We use the best checkpoint on COCO Caption and predict on the Nocaps validation set directly.
2.3 Image-Text Retrieval
We conduct experiments for both image-to-text retrieval (TR) and text-to-image retrieval (IR) on COCO and Flickr30K datasets. Following , we jointly optimize the ITC loss and the ITM loss during fine-tuning. During inference, we first select top-k candidates by computing the dot-product similarity between the image and text encoder features, and then rerank the selected candidates based on their ITM scores. We set for COCO and for Flickr30K. As shown in Table 3, mPLUG outperforms all existing methods on both datasets. Using 14M images, mPLUG achieves better performance than BLIP with 129M and Florence with 0.9B pre-training data. Using the same 14M pre-training images, mPLUG substantially outperforms the previous best model BLIP by +2.7% in TR recall@1 on COCO and +1.0 % in TR recall@1 on Flickr30K.
2.4 Visual Grounding
Given a query in plain text and an image, visual grounding requires models to localize the referred object in the image. Instead of regressing the bounding boxes directly, we concatenate visual features and attended textual features and feed them into the decoder to predict the coordinates. Table 4 shows that mPLUG outperforms all the SOTA methods. We observe that in RefCOCO testB the images often contain arbitrary objects and in RecCOCOg test-u the expressions are longer than other datasets. Compared with the previous best model OFA, mPLUG achieves 3.16% absolute improvement on RefCOCO testB and 1.22% absolute improvement on RefCOCOg test-u. It demonstrates that mPLUG learns better multi-modal interaction from cross-modal skip-connections and is better at handling complex images and long queries.
2.5 Visual Reasoning
We consider two datasets for visual reasoning: NLVR2 and SNLI-VE . The NLVR2 task requires the model to predict whether a sentence describes a pair of images. Following , we use two cross-attention layers to process the two input images, and their outputs are merged and fed to the FFN. An MLP classifier is then applied on the output embedding of the language [CLS] token. The SNLI-VE task requires the model to evaluate how the given image and text are semantically correlated, i.e., entailment, neutral, or contradiction. Following , the image premise, text premise and text hypothesis are fed to the encoder. While we remove the decoder, and only use the encoder modules for three-way classification, which can save nearly half of the total computation cost. We predict the class probabilities using the multimodal encoder’s output representation of the language [CLS] token. As shown in Table 5, mPLUG can obtain competitive performances to the SOTA models The SOTA models such as OFA and VLMo both add large-scale text-only and image-only pre-training data for improving the reasoning ability. in both visual reasoning tasks, and even outperform SimVLM and BLIP , which use far more pre-training data.
3 Effectiveness and Efficiency
To validate the effectiveness and efficiency of our proposed cross-modal skip-connected network, we conduct in-depth analysis on different stride values and various cross-modal fusion methods.
The stride S is the key factor to control the effectiveness and efficiency tradeoff. Therefore, we further compare the running time and performance of different stride value S in cross-modal skip-connected network on VQA and NLVR2 tasks. Specifically, we test four different stride values, which can be divisible by the total number of cross-modal fusion layers. The model is chosen as mPLUGViT-B and all the other experiment settings are kept the same. As shown in Figure 3, we can see that the larger S is, the more efficient cross-modal fusion is, where the running time can be largely reduced from skipping the vision co-attention layers by 5 times from to . The performances of mPLUG on both datasets gradually increases when , and slightly decreases later on. Compared with , mPLUG can achieve comparable performance at , while speeding up by nearly 30%. Therefore, we set on mPLUGViT-L for faster pre-training.
3.2 Analysis of Cross-modal Fusion
We compare the effectiveness and efficiency of different cross-modal fusion variants in terms of running time and performance on VQA and NLVR2 tasks. Specifically, we pre-train mPLUG with different cross-modal fusion network based on the same image encoder and text encoder. All the pre-training settings and the number of fusion layers are kept the same as in the original mPLUG pre-training. As shown in Figure 4, the fusion methods of co-attention and connected-attention both requires much more running time due to long visual sequence. Compared with the two fusion methods, our proposed skip-connected network is 4 faster and obtain better performance on both datasets. We also compare it with the asymmetric co-attention used in BLIP which only relies on the co-attention layers from images to text. Despite running slightly faster than the skip-connected network does, the asymmetric co-attention performs worse in accuracy on both datasets. The performance degradation is attributed to the information asymmetry and bias towards language, as shown in Section 5.2.1.
3.3 Large-scale Training
Combining the techniques introduced in Section 4 has dramatically increased the training throughput. With the utilization of memory saving and accelerated training techniques, the throughput of mPLUG improves 3 more from 124 samples per second to 422 samples per second, as shown in Table 6.
4 Zero-shot Transferability
In this section, we examine the generalization of mPLUG and compare the zero-shot result on two Vision-Language and three Video-Language tasks.
The pretraining of mPLUG adopts image-text contrastive and prefix language modeling tasks on large-scale image-text pairs. Thus, mPLUG has zero-shot generalization ability in image-text retrieval and image captioning. Image Caption: First, we take the pretrained mPLUG model and directly decode on NoCaps validation set without further finetuning. Following, we feed a prefix prompt “A picture of” into the text encoder to improve the quality of decoded captions. As shown in Table 7, the zero-shot performance of mPLUG is competitive with fully supervised baselines such like Oscar and VinVL. With further finetuning on MSCOCO dataset, mPLUG outperforms the SimVLMhuge, which use more pre-training image-text pairs and has larger model parameters. Image-text Retrieval: We perform zero-shot retrieval on Flickr30K. The result is shown in Table 8, where zero-shot mPLUG outperforms models (CLIP, ALIGN, Florence) pretrained with more image-text pairs. Following , we also evaluate zero-shot retrieval by the model finetuned on MSCOCO dataset. Table 8 shows that mPLUG achieves better performance than the previous SOTA models.
4.2 Zero-shot Transfer to Video-Language Tasks
To evaluate the generalization ability of mPLUG to Video-Language Tasks, we conduct zero-shot experiments on Video-text Retrieval, Video Caption and Video Question Answering. Following , we uniformly sample frames for each video ( for Retrieval, for QA, for Caption), and concatenate the frame features into a single sequence. Video-text Retrieval: We evaluate the mPLUG models pretrained and further finetuned on the COCO-retrieval image-text dataset without any video pre-training or supervision. Table 9 shows that zero-shot mPLUG can outperform the SOTA models pretrained on far more pretraining data (e.g., Florence, BLIP), and can even outperform models finetuned on the supervised video dataset without using temporal information (e.g., VideoCLIP, VIOLET); Video Question Answering: Following BLIP , We treat Video QA as an answer generation task and perform evaluation based on models finetuned on VQA. As shown in Table 10, the zero-shot mPLUG outperforms BLIP pretrained with more image-text pairs; Video Caption: We use a prefix prompt “A video of” to improve the quality of decoded captions. Table 10 shows that zero-shot mPLUG also achieves better performance than BLIP.
Conclusion
This paper presents mPLUG, an effective and efficient VLP framework for both cross-modal understanding and generation. mPLUG introduces a new asymmetric vision-language architecture with novel cross-modal skip-connections, to address two fundamental problems of information asymmetry and computation efficiency in cross-modal alignment. Pretrained on large-scale image-text pairs, mPLUG achieves state-of-the-art performance on a wide range of vision-language tasks. mPLUG also demonstrates strong zero-shot transfer ability when directly applied to multiple video-language tasks. Our work explores the cross-modal alignment with a newly-designed VLP architecture and we hope it can help promote future research on image-text foundation models.
References
More Experiments Details
We evaluate mPLUG on the six downstream vision-language tasks. The hyperparameters that we use for finetuning on the downstream tasks are listed in Table 11. Following , all tasks adopt RandAugment, AdamW optimizer with a weight decay of 0.05 and a cosine learning rate schedule. We use an image resolution of 336 336, except for VQA where we use 504 504 images. For VQA and image captioning tasks, we also do an additional continue pre-training on 4M image-text pairs, which can bring about 0.2+ accuracy improvement. Next we introduce the dataset settings in detail.
We conduct experiment on the VQA2.0 dataset , which contains 83k/41k/81k images for training/validation/test. Following , we use both training and validation splits for training, and incorporate additional training data from Visual Genome .
We finetune on COCO’s Karpathy train split, and evaluate on COCO’s Karpathy test split and No-Caps validation split. Following , we first fine-tune mPLUG with cross-entropy loss for 5 epochs with a learning rate of 1e-5 and a batch size of 256. Based on the fine-tuned model, we the fine-tune it with CIDEr optimization for extra 5 epochs with a smaller learning rate of 8e-7. During inference, we use beam search with a beam size of 10, and set the maximum generation length as 20.
We adopt the widely-used Karpathy split for both COCO and Flickr30K. COCO contains 113/5k/5k images for train/validation/test, and Flickr30K contains 29k/1k/1k images for train/validation/test.
We evaluate our method on three referring expression grounding datasets: RefCOCO, RefCOCO+ and RefCOCOg . The RefCOCO and RefCOCO+ datasets share 19K images and contain 142/141K queries. The RefCOCOg dataset contains 25K images and 95K queries. To fully use training data, we first train the model with a mixed dataset with a learning rate of 2e-5. Then we continue fine-tuning the model on each dataset with a learning rate of 2e-6.
We conduct experiment both on the official split .
2 Pre-training Dataset Details
Table 12 shows the statistics of the 14M pre-training images with texts.