Enabling Multimodal Generation on CLIP via Vision-Language Knowledge Distillation

Wenliang Dai, Lu Hou, Lifeng Shang, Xin Jiang, Qun Liu, Pascale Fung

Introduction

Recent large-scale dual-stream Vision-Language Pre-training (VLP) models like CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021), have shown remarkable performance on various downstream multimodal alignment tasks, e.g., image-text retrieval and image classification. These models are pre-trained using cross-modal contrastive learning on tremendous image-text pairs and learn strong multimodal representations. Despite their success, as mentioned by Radford et al. (2021), their text encoder is relatively weak by only having a discriminative multimodal pre-training objective, which makes them incompetent on generative multimodal tasks such as image captioning and open-ended visual question answering (VQA).

Meanwhile, the Transformer-based (Vaswani et al., 2017) auto-regressive large-scale pre-trained language models (PLMs), such as GPT (Radford and Narasimhan, 2018; Brown et al., 2020), have been dominating in the natural language generation (NLG) tasks. These models are usually trained with causal self-attention, which only allows the model to attend to past outputs (unidirectional) to satisfy their generative nature. More recently, BART Lewis et al. (2020) and T5 Raffel et al. (2020) propose to augment the auto-regressive decoder with a bidirectional Transformer encoder to further capture bidirectional information of the input. These encoder-decoder architectures excel on not only NLG but also understanding (NLU) tasks.

To tackle the aforementioned limitations of dual-stream VLP models and fully utilize PLMs, in this paper, we present Vision-Language Knowledge Distillation (VLKD), a simple yet effective approach to enable CLIP to perform generative multimodal tasks through knowledge distillation. Specifically, we align the BART encoder to CLIP’s joint multimodal embedding space to gain the understanding of multimodal knowledge, along with an image-conditioned language modeling loss to consort BART encoder and decoder. During training, we freeze CLIP’s weights to keep its learned multimodal space. For the finetuning and inference of downstream tasks, the original CLIP text encoder is discarded, which can be interpreted as being replaced by the distilled BART. Therefore, we leverage the strengths from both sides, the expressive multimodal representation space of CLIP and the strong text generation capability of BART.

Compared to VLP from scratch, VLKD uses several magnitudes fewer image-text pairs and computational resources. As depicted in Figure 1, after VLKD pre-training, the model exhibits strong zero-shot performance on generative multimodal tasks, including open-ended VQA and image captioning. Without finetuning, it has the ability to generate answers by reasoning over the question, the visual information, and the textual knowledge embedded in the pre-trained BART. Furthermore, it can also directly generate a plausible caption given an image. Empirical results show that our model achieves 44.5% accuracy on the VQAv2 dataset and 84.6 CIDEr on COCO image caption dataset in a zero-shot manner. Moreover, the original NLU and NLG ability of BART is maintained, which makes the model versatile for both multimodal and unimodal tasks.

To summarize, our contributions are: 1) We introduce an efficient approach to distill knowledge from the dual-stream VLP model CLIP to BART. The resulting model shows strong zero-shot performance on generative multimodal tasks, as well as pure NLP tasks; 2) We exhaustively quantify these capabilities on six benchmarks under various settings; and 3) We conduct comprehensive analysis and ablation study to provide insights and grease future work on this direction.

Related Work

Based on how the two modalities interact, recent VLP models mainly fall into two categories: single-stream and dual-stream models. Single-stream models Chen et al. (2020); Li et al. (2019); Ramesh et al. (2021); Lin et al. (2021); Kim et al. (2021a); Shen et al. (2022) concatenate the patch-wise or regional visual features and textual embeddings and feed them into a single model. Dual-stream models (Lu et al., 2019; Radford et al., 2021; Jia et al., 2021; Zhai et al., 2021; Yao et al., 2022) use separate encoders for images and texts, allowing efficient inference for downstream multimodal alignment tasks like image-text retrieval, by pre-computing image/text features offline. However, these models can not be directly used for multimodal generation tasks. In this paper, we propose an efficient method to align the dual-stream VLP model CLIP’s multimodal embedding space with a powerful PLM BART to gain multimodal generation ability.

There are also VLP models that can perform multimodal generation tasks, by expensive pre-training with objective of image-conditioned auto-regressive language modeling Lin et al. (2021); Wang et al. (2021); Hu et al. (2021); Li et al. (2022). However, the pre-training of these models requires a large number of image-text pairs and numerous computation resources. Other models like Agrawal et al. (2019); Li et al. (2019, 2020); Cho et al. (2021); Li et al. (2021) rely on an extra pre-trained object detector such as Faster-RCNN with labeled bounding-box data to extract image regional features offline and are less scalable.

2 Knowledge Distillation

Knowledge distillation (KD) in deep learning is first proposed by Hinton et al. (2015), which transfers knowledge embedded in the logits learned in a cumbersome teacher model to a smaller student model without sacrificing too much performance. Besides logits, other forms of knowledge like the intermediate representations and attentions Jiao et al. (2019); Hou et al. (2020) have also been used in transferring the knowledge embedded in Transformer-based models. Recently, contrastive representation distillation Tian et al. (2019) distills the knowledge from the teacher network to the student network by maximizing the mutual information between the two networks, and is recently extended to transfer the knowledge from the pre-trained multimodal model CLIP for zero-shot detection Gu et al. (2021) and multilingual setting Jain et al. (2021). In this paper, we apply the conventional KD as well as the contrastive KD to transfer the knowledge from the pre-trained CLIP to BART. Besides, we also propose to transfer the knowledge in CLIP image encoder to BART decoder through the cross-attention.

Proposed Method

We propose to distill multimodal knowledge from CLIP to BART for generative multimodal tasks, which takes the strengths from both sides (powerful multimodal representations of CLIP and text generation ability of BART). To this end, we propose three objectives (Section 3.2). The overall architecture is illustrated in Figure 2.

CLIP (Radford et al., 2021) is a dual-stream VLP model pre-trained with a contrastive loss on 400 million image-text pairs. It consists of a text encoder which is a GPT (Radford et al., 2019) style Transformer model, and an image encoder which can be either a Vision Transformer (ViT) (Dosovitskiy et al., 2020) or Residual Convolutional Neural Network (ResNet) (He et al., 2016). CLIP learns a joint multimodal embedding space with its text encoder and image encoder aligned. Given an input image-text pair, the ViT image encoder first reshapes the image into a sequence of 2D patches and then maps them into 1D embeddings with a prepended [CLS] token using a trainable linear projection. These embeddings are fed into the CLIP image encoder together with positional encodings. The output embedding of the [CLS] token can represent the whole image. When using ResNet-based image encoder, the [CLS] embedding is the average of output embeddings and then go through an attention pooling layer. For the text sentence, it is bracketed with [SOS] and [EOS] tokens, and the output embedding of the latter is used as the sentence-level representation. In this paper, we explore four CLIP variants, including ViT-B/16, ViT-L/14, RN50 $\times$ 16, and RN50 $\times$ 64.

BART is a Transformer-based (Vaswani et al., 2017) sequence-to-sequence model that has a bi-directional encoder and a uni-directional (left-to-right) decoder, which can be seen as a generalization of the BERT (Devlin et al., 2019) and GPT (Radford and Narasimhan, 2018). It is pre-trained on 160GB text data in a self-supervised way by performing the text span infilling task with the input sentences corrupted and shuffled. Similar to the CLIP text encoder, BART also tokenizes and converts the input text into a sequence of embeddings, which are then fed into the BART encoder. BART excels at both NLG (e.g., abstractive summarization) and NLU tasks.

2 Training Objectives

To distill multimodal knowledge from CLIP to BART, we propose three objective functions: 1) Text-Text Distance Minimization (TTDM); 2) Image-Text Contrastive Learning (ITCL); and 3) Image-Conditioned Text Infilling (ICTI). During training, the model parameters of CLIP are frozen constantly, i.e. no gradients will be back-propagated through them (marked as SG in Figure 2), to ensure its two encoders are still aligned and the multimodal knowledge is not forgotten.

For each training batch with $B$ image-text pairs, denote the $k$ -th image-text pair as ${\bf x}^{k}=\{{\bf x}^{k}_{I},{\bf x}^{k}_{T}\}$ , and the output of multimodal encoders of CLIP and BART encoder as

2.2 Image-Text Contrastive Learning

Contrastive training has been shown to be very effective in cross-modal representation learning (Tian et al., 2020; Sigurdsson et al., 2020; Zhang et al., 2020; Radford et al., 2021). To further adapt the BART encoder to CLIP’s multimodal space, we optimize a symmetric InfoNCE loss between the output representations of the BART encoder and CLIP image encoder. The image-to-text contrastive loss $\mathcal{L}_{i2t}$ is formulated as

where $\tau$ is a learnable temperature parameter. Different from Radford et al. (2021), we find that not clamping the $\tau$ shows a slight improvement. Similarly, the text-to-image contrastive loss $\mathcal{L}_{t2i}$ is

Note that when computing the ITCL and TTDM losses, we do not introduce any new linear projections to the CLIP output features to avoid destroying the pre-trained alignment between its image and text encoders. Instead, we add one linear layer (parameterized by ${\bf W}_{e}$ ) to project the BART encoder to CLIP’s representation space and match their feature dimension.

2.3 Image-Conditioned Text Infilling

With only TTDM and ITCL, the BART decoder is not updated at all. To consort BART encoder and decoder, we propose to perform the text span infilling task conditioned on the corresponding image features. As depicted in Figure 2(b), for the $k$ -th image-text pair, following Lewis et al. (2020), we corrupt the input text by masking 15% of whole-word tokens with span lengths drawn from a Poisson Distribution with $\lambda=3$ .

Considering that ${\bf V}^{k}$ and ${\bf W}_{e}{\bf E}^{k}$ are already aligned in the CLIP’s multimodal space through TTDM and ITCL, and having a different feature dimension with the BART decoder, we further project them to the BART decoder dimension with ${\bf W}_{i}$ and ${\bf W}^{\prime}_{e}$ . Then, we concatenate them together as ${\bf C}^{k}$ before feeding into the BART decoder as shown in Eq.(1). As mentioned in Section 3.1, we explore two variants of CLIP. With a slight abuse of notation, for the ResNet-based CLIP, ${\bf V}^{k}$ is composed of all embeddings after the final attention pooling layer $\{{\bf v}^{k}_{i}\}_{i=1}^{n_{1}}$ , while for the ViT-based CLIP, ${\bf V}^{k}$ consists of the embedding of the [CLS] token ${\bf v}^{k}_{cls}$ only.

Note that the weight matrix ${\bf W}^{\prime}_{e}$ is initialized to be the pseudo-inverse of ${\bf W}_{e}$ , such that text representations after the two projections ${\bf W}^{\prime}_{e}{\bf W}_{e}{\bf E}^{k}$ are the closest to the original pre-trained BART encoder space at initializationThe pseudo inverse matrix ${\bf W}^{\prime}_{e}$ satisfies ${\bf W}^{\prime}_{e}=\arg\min_{{\bf X}}\|{\bf W}_{e}{\bf X}-{\bf I}\|_{F}^{2}$ , where ${\bf I}$ is the identity matrix and $\|\cdot\|_{F}$ denotes the Frobenius Norm.. The BART decoder then interacts with ${\bf C}^{k}$ through standard Transformer cross-attention layers. We optimize a language modeling loss $\mathcal{L}_{ICTI}$ by minimizing the negative log-likelihood in Eq.(2), in which ${\bf w}_{j}$ denotes the token to be predicted at each decoding step.

The ICTI loss is crutial for for our methodology to work, as it not only coordinates the BART encoder and decoder, but also enables the BART decoder to understand the multimodal information by recovering texts with visual clues.

Finally, we simultaneously optimize the summation of three losses $\mathcal{L}$ as

where $\gamma$ is set to $10^{3}$ by default, as $\mathcal{L}_{ITCL},\mathcal{L}_{ICTI}$ are about three magnitudes larger than $\mathcal{L}_{TTDM}$ .

3 Datasets for VLKD

Our model is trained on the Conceptual Captions (CC3M) (Sharma et al., 2018) dataset, which contains 3 million image-text pairs crawled from the Internet. For larger model variants (ViT-L/14 and RN50x64), we further include the Visual Genome Caption data which contains $\sim$ 700K image-text pairs. No images for pre-training appear in the downstream datasets. Compared to previous VLP work (Radford et al., 2021; Jia et al., 2021; Wang et al., 2021), VLKD is much cheaper by leveraging several magnitudes less data. Furthermore, we experiment with even smaller data (1M, 100K) by uniformly sampling a subset of CC3M to test the limit of dataset size of VLKD, with results discussed in Section 5.

Experiments

To demonstrate the effectiveness of VLKD, we evaluate it on generative multimodal tasks for both zero-shot and finetuning. Specifically, we test the image captioning task, and also the VQA task under the open-ended scenario. Furthermore, we also run the model on NLU and NLG tasks to investigate the influence of VLKD on the text processing ability of the original pre-trained BART.

Image captioning requires the model to generate a relevant description given an image. We use the COCO image caption dataset (Lin et al., 2014) with the Karpathy split (Karpathy and Fei-Fei, 2017). Additionally, we use the NoCaps (Agrawal et al., 2019) dataset to test the model performance when there are out-of-domain objects.

Unlike previous works (Anderson et al., 2018; Chen et al., 2020; Li et al., 2020; Yu et al., 2021a; Zhang et al., 2021; Kim et al., 2021b) that treat the VQA task as a discriminative problem, we let the model generate answers freely, which is more aligned with the real-world scenario of this task. We use the standard VQAv2 (Goyal et al., 2017), and also OK-VQA (Marino et al., 2019) which requires knowledge to answer questions correctly.

For NLU, we test our model on the GLUE benchmark (Wang et al., 2019), which consists of nine text classification tasks. We exclude the WNLI task as it is problematichttps://gluebenchmark.com/faq . For NLG, we test the abstractive summarization task on XSUM (Narayan et al., 2018) dataset, which requires the model to comprehend long texts and generate short summaries with key information.

2 Implementation Details

We use BART-large as the pre-trained backbone NLP model, which has 12 layers in both encoder and decoder with a hidden size of 1024 and 16 heads in each multi-head attention (MHA) layer. In total, it contains 406M parameters. For the pre-trained CLIP Radford et al. (2021) model, we report four variants with different visual backbones, including ViT-B/16, ViT-L/14, RN50 $\times$ 16, and RN50 $\times$ 64.

We use 64 Nvidia V100 GPUs for VLKD and 8 for the finetuning of downstream tasks. In total, we pre-train the model for 10 epochs, which takes about 5 hours. We use a batch size of 4608 for ViT-B/16 and ViT-L/14, 4096 for RN50x16 and 3840 for RN50x64. All of the models are optimized by the AdamW (Loshchilov and Hutter, 2019) optimizer. The learning rate is warmed up to $2.4e^{-4}$ within the first 2% steps and then linearly decay to 0. More information of VLKD pre-training and the finetuning of each downstream task can be found in Appendix A.

3 Multimodal Zero-Shot Evaluation

Benefit from the knowledge distillation, especially the ICTI loss, our model can perform various downstream multimodal tasks in a zero-shot manner.

During knowledge distillation, the ICTI loss can be seen as a simple version of the image captioning task, which asks the model to fill in the corrupted locations of image descriptions. If the masking ratio increases to 100%, it reduces to the image captioning task. Therefore, it is intuitive to test the zero-shot performance of our model.

Following Radford et al. (2021) and Wang et al. (2021), we compose the input with a text prompt and also $m$ mask tokens, i.e., “A picture of [MASK] $\times m$ .”, for the model to generate the caption for the image. The zero-shot results are included in Table 1. Our zero-shot model achieves comparable overall performance to the finetuned UpDown (Agrawal et al., 2019) model on NoCaps dataset. As shown in Figure 3(b), the zero-shot generated captions are plausible with correct objects, relationships, and actions. However, sometimes details like colors could be omitted.

In our experiments, we use $m=6$ for COCO and $m=8$ for NoCaps. Although it could potentially limit the length of generation, we find that it has negligible influence to the performance, as for each [MASK] token, the model is learned to fill one to three tokens depending on the context. Furthermore, this could be used to control the length of generated texts for different senarios. See Section 5 for a more detailed discussion about the effects of number of the masks.

3.2 Zero-Shot VQA

Zero-shot VQA is much more challenging than image captioning, as it requires reasoning over both the image and question, which is very different from the ICTI loss during the knowledge distillation. As illustrated in Figure 1, we construct the input by appending a text prompt “Answer: [MASK] $\times n$ .” to the question Given the context (image+question+prompt), the model is required to predict the answer by recovering the textual token in the [MASK] positions. In our experiments, we use $n=2$ for the VQAv2, which is found performing best among $n\in\{1,2,3\}$ .

In Table 2, compared to the strong baseline Frozen (Tsimpoukelli et al., 2021), our model improves the zero-shot accuracy by 13.1% on the VQAv2 validation set and 7.4% on the OK-VQA test set with 7 $\times$ fewer parameters, indicating the efficiency and effectiveness of VLKD. Our model achieves 44.5% zero-shot accuracy on the VQAv2 test-dev set, which to the best of our knowledge is the new state-of-the-art. Furthermore, as shown in Figure 3(a), our model can bind visual objects to conceptual knowledge stored in the PLM to answer questions. For example, it connects the visual object Turkey with the traditional food people usually eat at the Thanksgiving festival.

4 Multimodal Finetuning Evaluation

When finetuning VLKD on downstream multimodal tasks, we keep the same input format as zero-shot to obtain outputs in a generative way. The CLIP model parameters are still frozen during finetuning.

In Table 1, we demonstrate that our model can achieve decent performance when finetuned on the COCO dataset. The SCST CIDEr optimization method Rennie et al. (2017) is used to further improve the performance. Our model outperforms VL-T5/BART (Cho et al., 2021) without using an extra object detector, which is fairly time-consuming as explained by Kim et al. (2021b). Compared to state-of-the-art models, however, there is still a small performance gap, which we conjecture is mainly due to their usage of object detector/tags and much more pre-training image-text pairs. We also evaluate our VLKD models with ResNet visual backbones on the NoCaps dataset (Table 1). For zero-shot image caption, the CIDEr score on the out-of-domain set is even higher than the in- and near-domain sets, which shows the generalization of our knowledge distillation method to common visual objects. After finetuned on the COCO training set, the performance on NoCaps of our model with the RN50 $\times$ 64 backbone is comparable to the state-of-the-art models.

4.2 Finetuning VQA

From Table 2, the best performance of VQAv2 is achieved by VLP models that tackle this task in a discriminative way with a set of pre-defined answers. However, this approach does not generalize to real-world scenarios and cannot be directly applied to more diverse datasets (e.g., OK-VQA). Differently, Frozen (Tsimpoukelli et al., 2021) and our proposed VLKD formulate VQA as a generative problem to generate answers conditioned on the questions and images in an open-ended manner, which also enables zero-shot VQA. Specifically, for each question-answer pair in the VQAv2 dataset, we optimize the model to generate the answer with the cross-entropy loss and a label-smoothing of 0.1. The loss is weighted by the weight of each answer candidate. In addition, we augment the training data with VG-QA Krishna et al. (2016).

Furthermore, following Cho et al. (2021), we test the performance on out-of-domain questions with rare answers using the Karpathy test-split. As shown in Table 3, our method shows a salient advantage on out-of-domain questions due to the benefit from VLKD and its generative nature without defining the answer list.

5 Evaluation of NLU and NLG

Table 4 shows results on the GLUE benchmark. Although prior VLP models are either initialized from the pre-trained BERT model, or trained by a text-only language modeling loss together with the vision-language (VL) losses, they generally suffer from the weakened performance of NLU. For example, SIMVLM performs significantly worse than BART, though trained with four times more textual data. We speculate that the weakened NLU ability of these models is caused by the catastrophic forgetting of the language knowledge in the pre-trained BERT weights during the multimodal pre-training. Moreover, simultaneous optimization of multimodal and text-only objectives potentially shifts the latter to be an auxiliary loss, making the NLP ability not as effective.

On the other hand, the resulting model of VLKD performs only slightly worse than the original BART and significantly outperforms BERT, as the original knowledge embedded in BART is well maintained.

Additionally, as presented in Table 5, we also run VLKD on the abstractive summarization task to evaluate its NLG performance, since BART-based methods excel on the summarization Lewis et al. (2020); Dou et al. (2021); Yu et al. (2021b). The gap between VLKD and its backbone BART is negligible. Overall, we empirically demonstrate that VLKD enables the backbone PLM to perform multimodal tasks without hurting its original NLP ability.

Ablation Study

Table 6 shows the ablation on the knowledge distillation objectives, except the ICTI loss which is necessary for our method to work. Without TTDM or ITCL, we observe a clear degradation of zero-shot performance on both VQAv2 and COCO image caption datasets. It is worth noting that ITCL contributes more to the image captioning task, which requires a deeper perception of visual features to generate captions. Oppositely, TTDM helps more for the VQA task, which involves reasoning over the question and image features. Removing both of them incurs a large performance drop, which demonstrates the importance of aligning the embedding space between CLIP and BART.

Furthermore, we also test the influence of the number of masks for zero-shot image captioning in Table 7. As discussed in Section 4.3.1, it has a trivial influence as the model learns to fill a variable length of tokens for each masked position. We achieve the best performance on the COCO caption dataset when $m=6$ and NoCaps when $m=8$ .

In Table 8, we vary the size of dataset used for knowledge distillation. VLKD only has a slight performance drop when the size is reduced from 3M to 1M, and a sharp drop when further reduced to 100K.

To quantitatively measure the importance of freezing the model weights of CLIP during the VLKD pre-training, we tried unfreezing CLIP’s weights and conduct the VLKD pre-training using the ViT-B/16 variant on CC3M without modifying other settings. It achieves 31.7 zero-shot accuracy on the VQAv2 validation set and 44.8 CIDEr on the COCO Caption test set. We speculate that unfreezing CLIP harms its pre-trained multimodal space, which further downgrades the performance of VLKD.

Conclusion

Recent dual-stream VLP models (e.g., CLIP) are powerful in various multimodal classification and retrieval tasks. However, their ability of multimodal generation or pure NLP tasks is highly restricted. In this paper, we propose a novel knowledge distillation method to efficiently align CLIP’s multimodal encoders and BART’s textual encoder to the same mutlimodal space, as well as a cross-modal LM loss to consort BART encoder and decoder. This enables multimodal generation under zero-shot and also fully-finetuned settings without losing the original BART’s NLP ability. Empirical results show that our model achieves new state-of-the-art zero-shot performance on VQA and excellent performance on both NLP and multimodal tasks when finetuned, demonstrating the effectiveness of our proposed method.

References

Appendix A Hyper-parameters

In this section, we show the hyper-parameters of vision-language knowledge distillation (VLKD), as well as downstream task finetuning.

For VLKD, the hyper-parameters are shown in Table 9, for both two CLIP variants we explored. For finetuning multimodal downstream tasks, we use the hyper-parameters shown in Table 10. Within each task, we use the same setting for multiple datasets.

For the GLUE benchmark, we use the LAMB optimizer (You et al., 2020) to train for 10 epochs. We conduct a hyper-parameter grid search with batch size={16, 32, 64}, lr={1e-4, 5e-4, 1e-3}, weight decay={1e-4, 1e-3}. We warm up the learning rate in the first epoch, then linearly decay it to zero.

For XSUM, we directly follow the hyper-parameters used in Lewis et al. (2020).

Appendix B More Examples of Zero-shot Inference

In Figure 4, we show more examples of zero-shot image captioning. In Figure 5, we depict more cases of the results of zero-shot open-ended VQA.