Plug-and-Play VQA: Zero-shot VQA by Conjoining Large Pretrained Models with Zero Training

Anthony Meng Huat Tiong, Junnan Li, Boyang Li, Silvio Savarese, Steven C. H. Hoi

Introduction

Recent years have witnessed unprecedented performance gains on many natural language reasoning tasks, especially in zero-shot and few-shot settings, being derived from scaling up pretrained language models (PLMs) and their training data Devlin et al. (2019); Liu et al. (2019); Brown et al. (2020); Raffel et al. (2020); Black et al. (2022); Sanh et al. (2022); Wei et al. (2021). Inspired by their success, a natural thought is that utilizing PLMs should also boost zero-shot performance in vision-language reasoning tasks.

However, to leverage PLMs for vision-language tasks, most existing methods require non-trivial adaptation of the PLMs for the vision modality, which necessitates the design of new network components and training objectives. For example, Sung et al. (2022) and Alayrac et al. (2022) insert into the PLMs new layers that are trained from scratch. Tsimpoukelli et al. (2021) train vision encoders that output soft prompts to frozen PLMs. Chen et al. (2022) and Eichenberg et al. (2021) train both the vision encoders and new layers inserted into PLMs. In the zero-shot setting, various vision-language pretraining objectives are employed, such as image captioning Alayrac et al. (2022) and image-conditioned masked language modeling Jin et al. (2022).

From the perspective of general-purpose AI, the ability to perform new tasks by simply recombining large-scale pretrained models, or foundation models Bommasani et al. (2021), without architectural changes or extra training would be highly desirable. Such a system would be able to dynamically adjust to previously unknown tasks by simply rewiring a small number of foundation models. However, to obtain high performance without some form of end-to-end training would seem difficult, if not impossible.

We present Plug-and-Play VQA (PnP-VQA), a framework for zero-shot visual question answering which conjoins large pretrained models with zero additional training and achieves state-of-the-art performance on zero-shot VQAv2 Goyal et al. (2017) and GQA Hudson and Manning (2019). For the purpose of bridging the vision and language modalities, we employ a pretrained vision-language model (PVLM) Li et al. (2022b) that describes visual information with textual captions. In order to obtain relevant and informative captions, we apply a network interpretability technique Selvaraju et al. (2017) to detect image patches that are relevant to the question. After that, we generate captions stochastically for these image patches. Finally, we employ a PLM Khashabi et al. (2022) to answer the question from the captions.

Research in cognitive science and neuroscience suggests that the human cognitive system is largely modular Shettleworth (2012); Bertolero et al. (2015). For instance, the pioneering work of Fodor (1983) argued that the low-level human cognition is constituted of several fast, autonomous, and domain-specific modules. For purely practical purposes, a modular design of artificial general intelligence would make it easy to harness rapid progress in each individual component, as the components can be individually replaced and updated without affecting other parts of the system. With this paper, we offer such a modular design for zero-shot VQA that leverages recent advances in PLM and PVLMs and combines them with an innovative application of network interpretability.

We summarize our contributions as follows:

We introduce PnP-VQA, a modular framework for zero-shot VQA without training. Its flexibility allows PnP-VQA to jointly evolve as pretrained models continue to advance.

Besides natural language, we propose the use of network interpretation as the interface between pretrained LMs and VLMs. With an interpretability technique, we create image captions that extensively cover information relevant to the question, which enable accurate QA.

We demonstrate state-of-the-art zero-shot VQA performance on multiple benchmarks. On VQAv2, PnP-VQA11B obtains 8.5% improvement over Flamingo80B Alayrac et al. (2022), which applies extensive end-to-end VL-pretraining. On GQA, PnP-VQAlarge outperforms FewVLMlarge Jin et al. (2022) by 9.1%.

Related Work

Large-scale image-text pretraining of neural networks is a popular research direction. Various vision-language pretraining tasks have been proposed, including image-conditioned language modeling Tsimpoukelli et al. (2021); Alayrac et al. (2022), masked language modeling Tan and Bansal (2019); Lu et al. (2019); Li et al. (2021b), prefix language modeling Wang et al. (2022), image-text matching Li et al. (2019); Chen et al. (2020); Li et al. (2020) and image-text contrastive learning Radford et al. (2021); Jia et al. (2021); Li et al. (2021a). After pretraining, several models exhibit zero-shot capabilities in image-text retrieval Jia et al. (2021); Radford et al. (2021); Zeng et al. (2022b) and image captioning Wang et al. (2022); Li et al. (2022b). However, zero-shot VQA remains a challenging task due to its high requirement on the model’s reasoning ability.

Adapting PLMs for zero-shot VQA has shown promising results. In order to incorporate vision information into PLMs, most existing methods perform additional vision-language training on image-text data. Frozen Tsimpoukelli et al. (2021) trains the vision encoder while keeping the gigantic PLM frozen to retain its knowledge in question answering. The output from the vision encoder is prepended to the text as prompts to the frozen language model. FewVLM Jin et al. (2022) finetunes the PLM using the prefix language modeling and masked language modeling objectives. VLKD Dai et al. (2022) distills multimodal knowledge to PLM by using CLIP Radford et al. (2021) as the teacher model during finetuning. Flamingo Alayrac et al. (2022) adds additional layers to both the pretrained vision model and the PLM and trains the new layers on billions of image-text pairs.

Different from the above work, PnP-VQA directly employs pretrained models with neither architectural modifications nor additional training.

Most similar to our work, PICa Yang et al. (2022) converts an image to a single caption and adopts GPT-3 Brown et al. (2020) for zero-shot VQA. In comparison, PnP-VQA generates multiple question-guided captions and performs fusion of captions after encoding to effectively utilize a large number of captions, yielding considerable performance gains.

An orthogonal research direction for zero-shot VQA is to train the VLMs on synthetic VQA examples generated from captions Changpinyo et al. (2022); Banerjee et al. (2021). PnP-VQA does not require additional training.

Natural language as an intermediate representation or interface between different models or multiple steps of reasoning is an emerging machine learning strategy. It dates back to at least Andreas et al. (2018) and saw renewed interest in the past few months due to the prevalence of large PLMs. Andreas et al. (2018) and Vong and Lake (2022) learn natural language descriptions that function as few-shot classifiers within an image-text matching model. Bostrom et al. (2022) generate intermediate reasoning steps with finetuned PLMs. Zhou et al. (2022) prompt a PLM to generate subproblem descriptions for a complex problem, and feed the subproblems back to the PLM to solve hierarchically. Wu et al. (2022) chain PLM outputs and inputs. Zeng et al. (2022a) show that language-conjoined LM and VLM successfully perform captioning and retrieval but do not evaluate their models on VQA. In comparison, PnP-VQA adopts both natural language and network interpretation as the interface between different pretrained models.

Method

The central idea of Plug-and-Play VQA (PnP-VQA) is to establish an interface between a pretrained language model and a pretrained vision-language model without training. We demonstrate that natural language image captions and network saliency maps together serve as an effective interface. Ideally, the generated captions should thoroughly cover information that is present in the image and be relevant to the question. We foster relevance by identifying image patches most related to the question with a saliency map-based interpretability technique and generating captions from these patches only. Further, we promote coverage by injecting stochasticity, including random sampling of relevant image patches and of the textual tokens during caption generation.

The overall system architecture (Figure 1) consists of three modules:

an image-question matching module that identifies the relevant image patches given a question,

an image captioning module that generates a diverse set of captions from a set of image patches, and

a question answering module that outputs an answer given the question and the generated captions.

In this section, we introduce the three modules in detail.

An image serves as a rich source of information, but the question at hand is likely focused only on particular objects or regions. Therefore, we encourage PnP-VQA to generate captions that describe image regions relevant to the question instead of generic captions with no specific aim.

We accomplish this goal by leveraging BLIP Li et al. (2022b), a large-scale pretrained vision-language model that contains a network branch outputting a similarity score $\text{sim}(v,t)$ between an image $v$ and a text $t$ . This branch, called Image-grounded Text Encoder (ITE), employs a vision transformer Dosovitskiy et al. (2021) that encodes the image, and a textual encoder that attends to the image features using cross-attention. As input to the image encoder, the image is equally divided into $K$ patches.

The $j$ th row of $A$ indicates the amount of attention the $j$ th textual token allocates to all image patches. At a selected layer of the ITE network, we compute the derivative of the similarity score w.r.t the cross-attention score, $\partial\text{ sim}(v,t)/\partial A$ , and multiply the gradient matrix element-wise with the cross-attention scores. The relevance of the $i$ th image patch, $\text{rel}(i)$ , takes the average over $H$ attention heads and the sum over $M$ textual tokens:

We provide the following motivation for the technique. The attention matrix $A$ may be taken as indicative of patch importance. However, much redundancy exists among these matrices and many attention heads may be pruned with little performance loss Bian et al. (2021), suggesting that some scores are uninformative. Inspired by GradCAM, we filter out uninformative attention scores by multiplication with the gradient which could cause an increase in the image-text similarity.

Figure 2 shows some examples of generic captions and question-guided captions with associated relevance heatmaps. We can clearly observe that question-guided captions contain more relevant information that helps produce the correct answers.

Table 1 gives a quantitative analysis about the effect of different patch selection methods on zero-shot VQA performance across three datasets. Question-guided patch sampling substantially outperforms generic captioning using all patches and random patch sampling, especially when the number of captions is large. 100 question-guided captions outperform the 5 human-written captions from MS COCO by 5.2% on VQAv2 and 6.0% on OK-VQA, demonstrating the merit of the proposed approach.

2 Informative Image Captioning

Even with relevant image regions, there may still be more than one way to describe these regions. Some descriptions may contain the desired answer to the question, whereas others may not. Without the ability to identify the answer a priori, we aim to generate maximally diverse captions to provide coverage of possible answers.

We adopt the image captioning network branch from BLIP Li et al. (2022b) and apply stochastic top- $k$ sampling Fan et al. (2018) instead of beam search, which is known to produce dull and repetitive captions Vijayakumar et al. (2018); Holtzman et al. (2020). The input to the network contains the $K^{\prime}$ image patches sampled according to relevance (see §3.1). We prepend a short prompt, “a picture of ” as input to the text decoder. We repeat this process to generate $N$ captions per image to encourage diversity of captions and coverage of visual content. To prevent repetition, we keep a generated caption only if it is not subsumed by any previous caption as an exact substring.

3 Answering the Question

The question-answering encoder-decoder model is pretrained on text data only and can only process text. Therefore, we include the question and the generated captions as input to the model. As discussed in §3.2, the image captioning module generates multiple diverse captions. To process such long inputs efficiently, we adopt the Fusion-in-Decoder (FiD) strategy Izacard and Grave (2021).

We illustrate the FiD strategy in Figure 3 by comparing it with the more straightforward Fusion-in-Encoder (FiE), which concatenates the question and all captions into a long paragraph as input to the encoder. In contrast, FiD encodes each caption with the question separately and concatenates the encoded representations of all tokens from all captions. The result is fed as input to the decoder and is processed through the cross-attention mechanism. Since the time complexity of the self-attention mechanism scales quadratically with input length, whereas the cross-attention scales linearly with the encoder’s output length, FiD is much more efficient than FiE. Further, FiE is constrained by the maximum input length of the encoder, caused by the positional encoding, but FiD does not have this constraint. Hence, with FiD, PnP-VQA can benefit from even more captions.

We plot the performance of FiD and FiE against the number of captions in Figure 4. Initially, both methods improve as the number of captions increases. However, the performance of FiE is capped at around 40 captions when the maximum input length is exceeded, whereas the performance of FiD continues to rise.

Experiments

We adopt multiple zero-shot VQA benchmarks, including the validation set (214,354 questions) and test-dev set (107,394 questions) of VQAv2 Goyal et al. (2017), the test set (5,046 questions) of OK-VQA Marino et al. (2019), and the test-dev set (12,578 questions) of GQA-balanced Hudson and Manning (2019). We include the VQAv2 validation set as a few recent works Tsimpoukelli et al. (2021); Jin et al. (2022) evaluate their performance on this dataset only. We obtain the answer by open-ended generation and perform evaluation based on exact matching. We report soft-accuracy Goyal et al. (2017) for VQAv2 and OK-VQA to account for multiple ground truth answer; for GQA, we report the standard accuracy.

2 Implementation Details

To obtain the image-question matching module and image captioning module, we adopt BLIP Li et al. (2022b) with the ViT-L/16 architecture pretrained on 129M image-text pairs. The original BLIP-ITM and BLIP-Caption models further finetune on the 2017 train split of COCO Captions Lin et al. (2014), which partially overlaps with VQAv2 and OKVQA. To prevent data leak, we instead finetune on the 2014 train split of COCO Captions, which does not overlap with the VQA evaluation datasets. We emphasize that this represents less, not more, training compared to the publicly released BLIP.

For the question answering module, we adopt UnifiedQAv2 Khashabi et al. (2022) trained on diverse textual QA datasets. It is worth noting that UnifiedQAv2 is completely unaware of the visual modality during training. Therefore, its training data do not overlap with the VQA datasets.

Unless otherwise stated, we utilize a total of 100 captions per question. We select the 8th cross-attention layer of the ITE network for GradCAM. We sample $K^{\prime}=20$ image patches for the generation of each caption, and use $k=50$ for top- $k$ decoding (see Fig. 9 in Appendix B). For VQAv2 and OK-VQA, we apply FiD and encode the question with one caption at a time. However, for GQA, we encode each question with a group of 5 captions. GQA requires compositional visual reasoning and thus benefits from more contextual information per question. We perform experiments using LAVIS Li et al. (2022a) on 8 Nvidia A100 GPUs.

3 Comparison with State of the Arts

We compare with state-of-the-art methods that formulate zero-shot VQA as open-ended answer generation. We categorize the methods based on how the pretrained networks are conjoined. In the first group, including VL-T5no-vqa Cho et al. (2021), FewVLM Jin et al. (2022), VLKD Dai et al. (2022), Flamingo Alayrac et al. (2022), and Frozen Tsimpoukelli et al. (2021), a vision encoder (VE) embeds the image as a dense matrix and feeds it to the pretrained language model (PLM). After that, the system performs a round of end-to-end vision-language (VL) training on tasks other than VQA, such as image captioning. VL-T5no-vqa and FewVLM freeze the VE and finetune the PLM, whereas Frozen freezes the PLM and trains the VE. VLKD finetunes both the PLM and part of VE. Flamingo partially finetunes both the VE and the PLM. In the second group, the two foundation models are not jointly trained. Instead, they use language in the form of captions as the intermediate representation for an image. This group includes PICa Yang et al. (2022) and our proposed model, PnP-VQA.

Table 2 shows the results. PnP-VQA outperforms previous methods by large margins on VQAv2 and GQA. On VQAv2 test-dev, PnP-VQA11B outperforms the second best technique, Flamingo80B Alayrac et al. (2022), by 8.5%. PnP-VQA3B outperforms Flamingo80B by 7.2% despite its significantly smaller size and the similar-sized Flamingo3B by 14.3%. On GQA, PnP-VQAlarge outperforms the FewVLMlarge by 9.1%, with similar-sized PLM despite the lack of end-to-end training. Only on OK-VQA, Flamingo performs better than PnP-VQA. OK-VQA requires external knowledge not existing in the images and cannot be solved by good captions alone. We hypothesize that the end-to-end training on the gigantic vision-language dataset of Flamingo induces a mapping between images and knowledge concepts that helps with OK-VQA. However, PnP-VQA is still better on OK-VQA than all other baselines that not trained on the gigantic Flamingo data. Compared with language-conjoined PICa Yang et al. (2022) with 175B parameters, PnP-VQA11B achieves a sizable improvement of 18.2%.

The results underscore the difficulty of zero-shot VQA using language models without any vision-language (VL) training. PICa, with its 175B-parameter language model, achieves comparable performance as FewVLMlarge, whose language model is 236x smaller but finetuned on VL data. On the other hand, finetuning the billion-scale language model could incur heavy computational cost and risk catastrophic forgetting Tsimpoukelli et al. (2021); Alayrac et al. (2022). PnP-VQA demonstrates the feasibility of a different paradigm: using billion-scale pretrained language models for VQA with zero training.

Analysis

Intuitively, if the captions contain the correct answer, the QA model would have a higher chance to answer correctly. To measure the utility of captions, we compute the answer hit rate (AHR), or the proportion of questions for which at least one caption contains the ground-truth answer verbatim. Here we exclude questions with yes/no answers as the meaning of “yes” and “no” can be contextual and these two words appear rarely in captions.

Figure 5(a) shows the correlation between the AHR and VQA accuracy, computed over the VQAv2 validation set, for three techniques of image patch sampling: question-guided sampling, uniform random sampling, and all patches. We observe that, within each sampling method, the VQA accuracy increases as the AHR increases. This corroborates our hypothesis that the presence of the answer in the captions facilitates the generation of the correct answer.

The correlation between performance and AHR is not perfect, as AHR does not capture other factors that may affect the answer accuracy, such as the position of the answer in the sentence and the number of its occurrence. However, AHR provides an easy-to-compute and useful measure for the information quality of the captions.

Figure 5(b) shows how AHR changes with the number of captions. Among the three techniques, question-guided sampling produces captions with the highest AHR. Thus, we may attribute the good performance of PnP-VQA partially to its informative, question-guided captions that directly contain the correct answer. Further, as the number of captions increases from 20 to 100, question-guided AHR increases from 71.8% to 84.0%. This demonstrates the benefit of Fusion-in-Decoder, which allows PnP-VQA to utilize up to 100 captions.

2 How sensitive is PnP-VQA to the caption decoding method?

As the content of captions plays a crucial role in the performance of PnP-VQA, we investigate the sensitivity to the choice of the caption decoding methods. We test four methods, including the deterministic beam search and three stochastic methods — temperature sampling Ficler and Goldberg (2017); Caccia et al. (2020), nucleus sampling Holtzman et al. (2020), and top- $k$ sampling Fan et al. (2018). We generate 100 captions from each method, and report the results in Table 3. PnP-VQA performs very similarly across stochastic decoding methods, but beam search results in a noticeable drop. Upon close inspection, we observe that beam search generates repetitive captions that do not sufficiently cover different aspects of the image.

3 Can PnP-VQA work with other textual QA models?

We experiment with two other PLMs as the question answering module for PnP-VQA: T0 Sanh et al. (2022) and GPT-J Wang and Komatsuzaki (2021). T0 is an encoder-decoder model which is pretrained in a multi-task fashion on a collection of NLP tasks, including question answering. GPT-J is a decoder-only model, a much smaller open-source alternative to GPT-3 Brown et al. (2020), which is pretrained with a task-agnostic language modeling loss on a large-scale text corpus. Table 4 shows that UnifiedQAv2 performs better on VQA tasks compared to T0 and GPT-J. We attribute UnifiedQAv2’s good performance to the fact that it is a task-specific question answering model with superior textual QA performance. The result indicates that the choice of PLM is important when performing zero-shot VQA with zero training. The modular and flexible design of PnP-VQA leaves room for further performance improvements as more advanced PLMs emerge.

Conclusion

We propose PnP-VQA, a framework with zero additional training for zero-shot VQA by conjoining off-the-shelf pretrained models. PnP-VQA leverages an image-question matching module to determine image patches relevant to the current question. An image captioning module then generates question-guided captions, which are processed by a question answering module to produce an answer. PnP-VQA achieves state-of-the-arts performance on multiple VQA benchmarks. We hope that our work will bring inspiration for further research in flexible, modular AI systems for solving vision-language tasks.

Limitations

Like two sides of the same coin, the strengths and weaknesses of PnP-VQA both result from the zero-training modular system design. PnP-VQA enjoys the power of pretrained models but also inherits the bias from these models. It enjoys the efficiency of zero training, but introduces additional inference cost due to the multi-step process. Nevertheless, we believe that the strengths of PnP-VQA outweigh its limitations, and welcome further investigations to help debias pretrained models and improve inference speed.

Acknowledgments

Anthony Meng Huat Tiong is supported by Salesforce and Singapore Economic Development Board under the Industrial Postgraduate Programme. Boyang Li is supported by the Nanyang Associate Professorship and the National Research Foundation Fellowship (NRF-NRFF13-2021-0006), Singapore. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not reflect the views of the funding agencies.

References

Appendix A Visualization

In the appendix, we show visualizations of GradCAM heatmaps and the generated captions for VQAv2, OK-VQA, and GQA in following pages.

Appendix B Hyperparameter sensitivity

We study how VQAv2 validation accuracy varies with different cross-attention layer used for GradCAM and number of image patches sampled for question-guided caption generation. Figure 9(a) shows no clear relationship between VQA accuracy and the cross-attention layer used for GradCAM. The maximum difference in VQA accuracy across different cross-attention layers is 3%. Figure 9(b) shows that VQA accuracy has a negative correlation with the number of sampled image patches. As $K^{\prime}$ increases, the sampled patches become less relevant to the questions, and question-guided patch sampling becomes akin to using all patches.