CLIP Models are Few-shot Learners: Empirical Studies on VQA and Visual Entailment

Haoyu Song, Li Dong, Wei-Nan Zhang, Ting Liu, Furu Wei

Introduction

Vision-language understanding (VLU) tasks, such as visual question answering Antol et al. (2015) and visual entailment Xie et al. (2019), test a system’s ability to comprehensively understand the semantics of both visual world and natural language. To capture the alignment between vision and language, various efforts have been made to build the vision-language pre-trained models Lu et al. (2019); Chen et al. (2020); Su et al. (2020); Zhang et al. (2021); Wang et al. (2021). Despite their superior performances, these methods have extensively utilized human-annotated training data that are expensive or require expert knowledge, such as object detection datasets Lin et al. (2014); Kuznetsova et al. (2020) and aligned image-text pairs Deng et al. (2009); Sharma et al. (2018). Collecting such datasets requires heavy work on data gathering and human annotation, and thus their scales are only in the realm of tens of millions, which are much smaller than the Internet text corpora for NLP pre-training Devlin et al. (2019); Brown et al. (2020).

Recently, CLIP Radford et al. (2021) has been proposed to learn visual concepts with natural language supervision, where its 400 million image-text pairs are crawled from the Internet. CLIP consists of a visual encoder and a text encoder, and it learns visual representations by aligning images and texts through contrastive loss. In this way, CLIP achieves strong zero-shot performances on vision benchmarks such as ImageNet. Besides, Shen et al. (2021) prove that CLIP could be leveraged as a strong visual encoder to benefit downstream vision-language tasks. However, there are two major differences between CLIP and previous visual encoders: 1) it is trained on much larger yet noisy web data, and 2) it has a shallow interaction between vision and language. The first feature promises the generalization ability of CLIP, and the second one equips alignment ability across modalities. Could the strong zero-shot ability of CLIP be transferred to vision-language understanding tasks?

To answer the above question, in this work, we empirically study how to transfer CLIP’s zero-shot ability into VLU tasks and further turn CLIP into a few-shot learner. We carried out experiments on two VLU tasks: 1) visual question answering, where the model needs to give an answer according to the details of an image and a natural sentence question, and 2) visual entailment, where the model needs to determine the entailment relation between an image and a natural sentence. Figure 1 demonstrates the basic forms of the two studied tasks.

For the zero-shot visual question answering task, the key to a successful zero-shot capability transfer is to mitigate the gap between the pre-training task of CLIP and the task form of question answering. Inspired by the recent advancements of few-shot learning in NLP Schick and Schütze (2021b); Gao et al. (2021), we address this issue by introducing a two-step prompt generation strategy, including automatic conversions from question to statement to get masked templates, and a span-infilling with generative pre-trained T5 model Raffel et al. (2020) to get candidate answers.

We explore a zero-shot cross-modality (language and vision) transfer capability through the visual entailment task. Specifically, we replace the image with its captions during training and only update a small classification layer. Then at inference, as usual, we still use image-text pairs for testing. This allows us to investigate how well the language and vision representations are aligned in CLIP models.

We further leverage few-shot learning to improve CLIP’s visual question answering performance based on the zero-shot transferring methods. We find that optimizing only bias and normalization (BiNor) parameters would make better use of limited examples and yield better results than the latest few-shot model Frozen Tsimpoukelli et al. (2021). Experiments confirm that CLIP models can be good vision-language few-shot learners.

Our contributions are summarized as follows:

To the best of our knowledge, this is the first work that studies how to transfer CLIP’s zero-shot capabilities into VLU tasks and confirms CLIP models can be good few-shot learners.

A zero-shot cross-modality transfer capability in CLIP is demonstrated.

A parameter-efficient fine-tuning strategy, BiNor, is proposed to boost CLIP’s few-shot visual question answering performance.

Preliminaries

2 Vision-Language Understanding Tasks

The task of VQA requires the model to answer questions about the details of input images. Following previous work, we experiment on the VQAv2 Goyal et al. (2017) dataset and formulate the task as a classification problem over 3,129 pre-defined most frequent answers. The images in VQAv2 come from Microsoft COCO Lin et al. (2014), and there are 65 types of questions in the dataset, such as how many and what color is. For answers, there are three types, including yes/no, number, and other.

Similar to the natural language inference (NLI), the task of visual entailment predicts the entailment relations, including entailment, neutral, and contradiction, between a premise and a hypothesis. Under the VL setting, the premise in visual entailment is based on the details of an image rather than textual descriptions in NLI. The SNLI-VE dataset Xie et al. (2019) is adapted from SNLI Bowman et al. (2015) and replaces SNLI’s premises with the images in the Flickr30k dataset Young et al. (2014). Considering the above characteristics, here we leverage the SNLI-VE dataset to verify the zero-shot cross-modality (language and vision) transfer capabilities of the CLIP models. This zero-shot setting investigates how well the vision and language representations are aligned in CLIP models.

Zero-shot VQA

Previous works Kim et al. (2021); Shen et al. (2021) have found that directly applying CLIP models for zero-shot VL tasks are infeasible. For example, nearly random-chance level zero-shot performances are observed on the VQAv2 dataset by directly applying a “question: [question text] answer: [answer text]” prompt template Shen et al. (2021). After rethinking the essence of prompt engineering in CLIP, we can find that the key to a successful zero-shot capability transfer for the VQA task is to mitigate the gap between natural language description and the form of question answering.

Motivated by the above observations, we propose a two-step automatic prompt generation method to enable the zero-shot VQA capabilities in CLIP models, with the assistant of a pre-trained generative T5 model Raffel et al. (2020). The key ideas of the two-step prompt generation method is illustrated in Figure 3: the first step is to convert the question into a masked template $\mathcal{T}$ , and the second step is to filter out impossible answers by language model and get a candidate answer set $\mathcal{V}_{F}$ . The infilled template connects both the question and answers in a natural description way and thus could be an ideal form of prompt for the VQA task.

This step is designed to convert the question into a template, which is a statement with a mask token. To tackle the conversion challenge, we explore two ways, including an in-context demonstration method and a dependency parsing based method.

The idea of this conversion method is relatively simple: by demonstrating question-to-template (with [mask] token) examples to the language model, the model could implicitly capture the conversion pattern. We define a few examples for each question type and convert the questions according to their types. Figure 3 shows a conversion example. More cases could be found at appendix D. Specifically, we use T5 Raffel et al. (2020), a large pre-trained text-to-text Transformer, for the question to template conversion. T5 is pre-trained to infill the missing spans (replaced by T5 special tokens, e.g. ) of a sentence. We present a concatenation of examples, question, and the token to T5 for conditional generation to restore it, and the generated span is our masked template, named as $\mathcal{T}_{\text{demo}}$ .

Although the T5 conversion method works well in most situations, it still faces some out-of-coverage problems. To compensate for this shortcoming, we turn to a traditional dependency parsing based way. This method converts a question to a statement by its part-of-speech tagging and parsing results, where the wh-word, root word, auxiliary, or copula, as well as prepositions and particles that are dependents of the wh-word or the root, are identified, and transformations are performed according to grammar rules. We use the Stanza Qi et al. (2020) to POS tag and parse the question and leave the answer as a mask token. Then the ruleshttps://github.com/kelvinguu/qanli in Demszky et al. (2018) are leveraged to perform the conversion. We name the template obtained in this way as $\mathcal{T}_{\text{parsing}}$ .

Step II: Answer Filtering

As common sense, “the specie of a flower” can never be a vase. Therefore, leveraging pre-trained language models, which have well learned such concepts during pre-training, to filter out less likely answers would have a positive influence on the final question answering performance. Given a masked template $\mathcal{T}$ , a language model $\mathcal{L}$ , and the answer vocabulary $\mathcal{V}$ , we get the filtered answers $\mathcal{V}_{F}$ as:

where the [mask] is the answer span in template $\mathcal{T}$ , and $P_{\mathcal{L}}$ is the output distribution of the language model. Here we also apply the T5 to infill answers because it makes no assumption about the length and position of the span. Once we get the template $\mathcal{T}$ and the filtered answers $\mathcal{V}_{F}$ , we replace the [mask] token in template $\mathcal{T}$ with every selected answer in $\mathcal{V}_{F}$ to get the prompts $\mathcal{P}$ .

2 TAP-C Method for VQA

The proposed method follows a Template-Answer-Prompt then CLIP discrimination pipeline, and thus we name it as TAP-C. To make better use of template $\mathcal{T}_{\text{parsing}}$ and $\mathcal{T}_{\text{demo}}$ , we use an ensemble of both templates by simply setting a threshold for the T5’s generation confidence. We prefer to use $\mathcal{T}_{\text{demo}}$ but use $\mathcal{T}_{\text{parsing}}$ if the generation confidence is low. Finally, given an image $i$ and the generated prompts $\mathcal{P}$ , the TAP-C method can get a zero-shot VQA prediction by:

Zero-shot Cross-modality Transfer

Recent pre-trained multilingual language models Wu and Dredze (2019); Liu et al. (2020); Xue et al. (2021) have been shown to be successful in transferring representations across different languages. For example, they can be only fine-tuned on a source language and evaluated on various target languages without specific training, yet still achieving good performance. On the other hand, the CLIP models achieve strong zero-shot performances on both image-to-text and text-to-image retrieval tasks Radford et al. (2021) only through a dot product between vision and language representations, which gives us an intuition that the two modalities are well aligned in the CLIP models. Is there a cross-modality capability between language and vision in the CLIP models, just like the multilingual ones across languages?

To answer the above question, we utilize the visual entailment task (§ 2.2) to explore the zero-shot cross-modality performance. Figure 4 briefs the key idea. Specifically, we train an MLP classifier over the fused representations of premise and hypothesis, and the fusion function is:

where $v_{1}$ and $v_{2}$ are two input vectors. During training, text-only premise and hypothesis are used as the input of CLIP text encoder:

Few-shot Learning for VQA

In this section, We aim to investigate whether the CLIP models could benefit from few-shot learning, where we work on the visual question answering task to study it.

Here we briefly define the terminology used in our few-shot visual question answering settings:

Number of ways. Originally, it is defined as the distinct classes in a task. However, rather than defining a 3,129-way task according to the answer vocabulary, we define the number of ways as question type times answer type (§ 2.2), i.e., 65 $\times$ 3=195 ways, to ensure the model’s generalization ability where it can answer a type of questions.

Number of shots. The number of distinct examples in each way. Here a shot is an image along with the question and the answer.

Support set and query set. Before training, we will sample a 195-way K-shot subset from the VQAv2 training set, and thus there are 195 $\times$ K distinct examples available during few-shot learning. In each training epoch, we select $C$ ways out of 195 ways for parameter optimizing and divide k shots in each way into support set and query set with a fixed proportion. The support set is used for model training, and the query set is used for performance evaluation (similar to a typical dev set).

2 Parameter-efficient Fine-tuning

Under the few-shot setting, our goal is to make the CLIP models learn from N-way K-shot examples and improve the zero-shot VQA performance. Specifically, we identify only a very small set of parameters in CLIP models (about 0.3 million out of over 100 million, details in appendix B.3), including the bias term and normalization term, to be optimized. For either the BatchNorm in ResNet or the LayerNorm in Transformer, the normalization could be uniformly denoted as:

where $x$ and $y$ are the mini-batched input and output, and the $\gamma$ and $\beta$ are learned parameters. And for all the linear layers and projection layers in CLIP models, they could be denoted as:

where $h$ and $o$ are the input and output vectors. We define the learnable parameter set as:

We optimize the Bias and Normalization (BiNor) parameters on the few-shot examples with a standard cross-entropy loss over the dot products from each image-prompt pair (Eq.2).

Besides, when there are a few examples available, we could also leverage an in-context demonstration manner to improve the performance of the answer filtering process in TAP-C (§ 3.1) by:

where the $\mathcal{D}$ denotes the demonstrations. $\mathcal{D}$ is similar to template $\mathcal{T}$ but has been infilled with the answers, and it is sampled from the same type of question in the available few-shot examples. The resulting filtered vocabulary is noted as $\mathcal{V}_{\text{demo}}$ . We report the few-shot training procedure in appendix C.

Experiments

For visual question answering and visual entailment, we carry out experiments on the VQAv2 Goyal et al. (2017) and the SNLI-VE Xie et al. (2019) datasets, respectively. We report the statistics of the two datasets in appendix A. For the VQA task’s evaluation, we follow the Frozen model Tsimpoukelli et al. (2021) to calculate the vqa scores on the VQAv2 validation set. For visual entailment, we calculate the accuracy on both validation and test sets through the sklearn toolkit.

According to the types of visual encoders, e.g. ResNet or ViT, CLIP models have different variants, resulting in a significant difference in the number of learnable bias and normalization parameters. We report the number of learnable parameters of CLIP variants in appendix B.3. We select two best performing (and publicly available) variants from two kinds of visual encoders, including the CLIP Res50x16 and the CLIP ViT-B/16, to empirically study their zero-shot and few-shot vision-language understanding performances by applying our transferring methods (§§ 3, 4 and 5).

2 Results of Zero-shot VQA

As previous VL models heavily rely on object detection sub-modules, it is not feasible to directly apply them under the zero-shot setting. Here we setup zero-shot VL baselines from two latest works:

Frozen. Frozen Tsimpoukelli et al. (2021) prompts a seven-billion-parameter 32-layer language model with image representations. It is trained on aligned image-caption data and is also the first model that shows promising zero-shot and few-shot VQA performances.

Question irrelevant prompt. Shen et al. (2021) explored directly prompting the CLIP models for the VQA task. They used a “question: [question text] answer: [answer text]” template, together with the prompt engineering of image classification, to prepare prompts. The resulting prompts are irrelevant to questions, and thus we note this method as QIP.

We report the zero-shot VQA results in Table 1. The experimental results verify our hypothesis (§ 3.1) that the prompts of CLIP should be used to describe the labels rather than the tasks. As we can see, the question irrelevant prompting methods simply present the task description and answers to the CLIP models and only get barely better than random guess results. In contrast, by converting questions into templates and filtering answers with pre-trained language models, our TAP-C method enables CLIP models a strong zero-shot capability on the VQA task, even compared with the seven-billion-parameter Frozen zero-shot model.

3 Zero-shot Cross-modality Transfer

We report the zero-shot cross-modality transfer results in Table 2. We first investigate the language to vision transfer capability. As introduced in § 4, we train a classifier on the text-only SNLI-VE dataset where the image is replaced by its caption. At inference, the trained classifier is evaluated by taking the image and text as inputs. As shown in the first group of results, after solely trained on text-text (caption as the premise) entailment data, different CLIP variants could successfully gain a similar discriminative ability under the image-text setting. To ensure that the above results are indeed transferring from language to vision, we made a double check by masking out the images at inference time, and the results are reported at Image Masked. As we can see, the results are similar to a random guess of three relations, indicating the images are of importance in the cross-modality evaluation.

Now that we have observed the language to vision transferring capability in CLIP models, we further investigate whether there is also a vision to language transfer capability. We conduct a similar experiment but train the classifier on the original SNLI-VE dataset, i.e., image premise and text hypothesis. At inference, we evaluate the classifier with the text-only valid and test data. The results are reported in Table 2, which confirms the vision to language capability. Since text data are usually much cheaper than visual data, the first kind of transferring tends to be more promising in practice.

4 Results of Few-shot VQA

We report the few-shot VQA results in Table 3. We take the Frozen model and the image blacked out Frozen ${}_{\text{blind}}$ as baselines. Under different k, our methods could always learn from limited training examples and improve over the zero-shot results, which confirms that CLIP models could be VL few-shot learners. With the increase of the number of shots, significant performance gains are observed in other category, which concurs with our intuition: as we sample examples from each question type, most answers in other category are not revealed to the model. As a result, the model could always learn to improve. Similarly, presenting examples to the T5 could also improve the answer filtering process, leading to significant performance gains over the other category. In contrast, the score of number category improves significantly when the model just begins to see some training examples while slowing down as k continues to increase.

5 Analyses and Discussion

Our TAP-C method uses an ensemble of dependency parsing template $\mathcal{T}_{\text{parsing}}$ and T5 demonstration template $\mathcal{T}_{\text{demo}}$ . Here we investigate whether it is necessary to use such an ensemble. We report the ablation results of two templates in Table 4. The results show that the two templates have different effects over different questions, and the ensemble could make the best use of their advantages.

The TAP-C method generates prompts through template generation (t.gen.) and answer filtering (a.filt.). Here we quantify how much each step contributes to the final zero/few-shot VQA performances. We report the ablation results in Table 5. When we remove the answer filtering step (w/o a.filt.), both the zero-shot and few-shot performances generally fall by about 20%, but the models still retain some few-shot learning capabilities. We further remove the template generation step and only use question irrelevant templates: all results are nearly cut in half, indicating the importance of considering questions in both zero-shot and few-shot scenarios.

We only update the bias and normalization parameters during few-shot learning (§ 5.2). To investigate whether our BiNor fine-tuning strategy works well, we compare BiNor with two fine-tuning methods: 1) Full-FT (Full fine-tuning), which updates all parameters in the model. 2) BitFit Ben Zaken et al. (2021), which only updates the bias-terms in all model layers. We report the comparison results in Table 6. Both BiNor and BitFit significantly outperform the full fine-tuning way: millions of parameters are very easy to overfit to a few training examples. When k is small, the performance differences between BiNor and BitFit are very small. When k becomes larger, BiNor begins to outperform BitFit with a noticeable margin. Our BiNor fine-tuning strategy is similar to the BitFit but differs in that it also updates the normalization parameters, which would grant the ResNet CLIP models better flexibility to adapt to new examples due to their larger number of batch normalization parameters. For the specific number of different parameters in each CLIP variant, please refer to the appendix B.3.

The proposed TAP-C method explores CLIP models’ potential to conduct zero/few-shot VQA tasks. However, we also found several limitations that hinder further improving the few-shot performance, which could be rooted in the CLIP models. First, CLIP models struggle with counting the number of fine-grained objects in an image, especially counting from a small area of the image. This shortcoming can hardly be improved by any kind of language knowledge. Besides, the CLIP models perform poorly in distinguishing subtle semantic differences. For example, when asked “what is the man in the background doing?”, all the experimented CLIP models give predictions of the man “in the foreground”. Under such cases, even if the TAP-C method perfectly converts the question into a prompt, the final results would still be wrong. Nevertheless, We believe this issue could be well addressed by enhancing CLIP models with a stronger text encoder, and we will make explorations in future work.

Related Work

Leveraging aligned caption data, vision-language models pre-trained by an image-text discriminative loss have recently enabled strong zero-shot generalization on image classification and cross-modality retrieval tasks Jia et al. (2021); Radford et al. (2021). Different from the discriminative manner, Tsimpoukelli et al. (2021) prompt a large frozen language model with vision prefix in a generative way, which is the first vision-language few-shot model.

This work is also inspired by the line of research in language model prompting Liu et al. (2021). Initialized by the GPT series Radford et al. (2018, 2019); Brown et al. (2020), prompting has become a popular manner to mining knowledge from pre-trained language models Petroni et al. (2019) in a zero-shot or few-shot way Shin et al. (2020); Gao et al. (2021); Qin and Eisner (2021). Besides mining knowledge from the language model, PET work Schick and Schütze (2021a, b) presents a semi-supervised prompting method for improving few-shot language understanding performance.

Conclusions

In this work, we empirically studied how to transfer CLIP models into vision-language understanding tasks. We first explored the CLIP models’ zero-shot VQA capability by leveraging language prompts and further proposed a parameter-efficient fine-tuning method to boost the few-shot performance. We also demonstrate a zero-shot cross-modality transfer capability of CLIP models on the visual entailment task. Experiments and analyses on VQAv2 and SNLI-VE confirm that the CLIP models can be good VL few-shot learners.

Acknowledgements

Haoyu Song, Wei-Nan Zhang, and Ting Liu are supported by the Science and Technology Innovation 2030 Major Project of China (No.2020AAA0108605), National Natural Science Foundation of China (No.62076081, No.61772153, and No.61936010), and Natural Science Foundation of Heilongjiang (No.YQ2021F006).

References

Appendix

Appendix A Datasets Statistics

Appendix B Details of Implementation

In our experiments, we leverage two kinds of pre-trained models: the CLIP variants and the T5. We brief these models as follows.

For the CLIP models, the text encoder is always a transformer, but its hidden size varies according to the size of visual encoders. And there are two architectures of visual encoders, including the vision transformer (ViT) and ResNet.

CLIP ViT-B/16: both the text and visual encoders are 12-layer, 512-hidden transformers.

CLIP RN101: the text encoder is a 12-layer transformer, and the visual encoder is ResNet101, both with a hidden size of 512.

CLIP RN50x16: the text encoder is a 12-layer transformer, and the visual encoder is ResNet50x16, both with a hidden size of 768.

All CLIP models we used are from the official CLIP repositoryhttps://github.com/openai/CLIP. For the language model T5, we use a publicly available T5 ${}_{\text{large}}$ checkpoint from the Huggingface repositoryhttps://huggingface.co/models. The T5 ${}_{\text{large}}$ has 24 hidden layers, 16 self-attention heads, 1024 hidden size, and a total of 770M parameters. It is trained on Colossal Clean Crawled Corpus (C4). Note that the T5 model had not been trained or finetuned under both few-shot and zero-shot settings.

B.2 Hyperparameters

We report the hyperparameter settings of few-shot CLIP training in Table 8. We apply the same set of hyperparameters to fine-tune both ResNet CLIP and ViT CLIP.

The hyperparameters used for the MLP classifier in the visual entailment task are reported in Table 9. We performed grid searches on the combination of the learning rate, batch size, and dropout. The CLIP variants reached the best performances under different parameter combinations.

Table 10 shows the hyperparameter configurations for T5’s conditional generation, which is leveraged to generate the masked template and filter answers.

B.3 The Number of Learnable Parameters

Table 11 shows the number of different type of learnable parameter in CLIP models. The counting of Bias and Normalization share the $\beta$ in Eq.7. The numbers of BiNor parameters are about 0.2M to 0.3M, accounting for less than 0.3% of all parameters.

Appendix C Few-shot Training Procedure

Appendix D Examples of Template Generation

In this section, we showcase several template generation examples to illustrate how the proposed method works. Since we have introduced how to convert a question into a masked template by demonstrating examples to the T5 (§ 3.1), here we directly present several examples in Table 12. These examples are sampled from five different question types and also cover the three answer types. As shown in Table 12, a single demo in the demonstration consists of a question and an answer with the [mask] token. Notice that the [mask] token is only a placeholder rather than a real mask in the pre-trained language models. Different from the in T5 that represents a corrupted span, the [mask] is used to inform the T5 where the answer words should be placed. After seeing several examples in the demonstration, the powerful T5 ${}_{\text{large}}$ model could capture the conversion pattern in each type of question and perfectly complete most conversions without ignoring the subtle grammars. Once the masked template is generated, we could infill the [mask] place with answer words and then carry out further processing. The processing for the yes/no type is a little different: as it is a binary task, we directly generate a positive prompt and a negative prompt, rather than masked templates, for the yes and no, respectively.