All You May Need for VQA are Image Captions

Soravit Changpinyo, Doron Kukliansky, Idan Szpektor, Xi Chen, Nan Ding, Radu Soricut

Introduction

Visual Question Answering (VQA) is a complex multimodal task that, to be successfully modeled and evaluated, requires large amounts of annotations that are not naturally produced by existing business processes, the way translation-pair annotations Guo et al. (2018) or image alt-text annotations Sharma et al. (2018) are produced.

At present, a main bottleneck for developing robust VQA systems that are useful for downstream applications, such as for visually-impaired people and in the medical and education domains, appears to be a lack of large image-question-answer training triplets (on the order of millions). Manual annotation of such triplets is costly, time-consuming, and prone to a variety of human biases that are difficult to account for Yuan (2021). In addition, the brittleness of VQA systems trained on such manual annotations is well-understood and documented Agrawal et al. (2018); Kafle and Kanan (2017).

To address the data limitation, we turn to a potential source for creating VQA examples: image-English caption pairs Chen et al. (2015); Sharma et al. (2018). Large-scale image caption datasets exist with millions Changpinyo et al. (2021), several hundreds millions Radford et al. (2021), or even billions Jia et al. (2021) of examples. Captions come mostly in the form of declarative sentences, e.g., “two bears are laying down on the ice”. Yet, the task of converting declarative captions into VQA question/answer pairs is still largely unexplored. It requires automatically inducing candidate answers fitting the VQA task, along with their respective questions based on the caption text (Fig. 1). We note that transforming declarative form to interrogative form plus answer(s) seems crucial, as there exists evidence that a vision-and-language model trained on declarative-language data cannot be successfully adapted or transferred “out-of-the-box" for VQA Wang et al. (2021).

Related Work

Question Generation (QG) is an active research topic in NLP. It is explored as a standalone task Heilman and Smith (2009); Nema et al. (2019), as a pre-training task for language models Narayan et al. (2020) and as a component in solutions for other textual tasks, such as question answering Alberti et al. (2019); Puri et al. (2020), information retrieval Mass et al. (2020); Gaur et al. (2021) and generation evaluation Durmus et al. (2020); Wang et al. (2020); Honovich et al. (2021). There are two main directions to QG: template-based Heilman and Smith (2009); Lyu et al. (2021); Dhole and Manning (2020) and neural-based, with the latter achieving state-of-the-art results Alberti et al. (2019); Narayan et al. (2020).

2 Question generation in computer vision

Question generation in computer vision aims at generating visual questions about a given image (or video), either for generating questions without knowing the answer Mostafazadeh et al. (2016); Zhang et al. (2017); Yang et al. (2018); Uehara et al. (2018); Krishna et al. (2019), e.g., for them to to be answered by humans, or to help improving the VQA task Kafle et al. (2017); Li et al. (2018); Shah et al. (2019); Xu et al. (2021); Kil et al. (2021); Akula et al. (2021), e.g., for additional evaluation and as means of data augmentation. Such QG models are typically based on VQA triplets as training data, whose language complexity is often limited, or require the collection of visual QG data Mostafazadeh et al. (2016). We take a different approach by leveraging models trained on textual QA datasets instead.

Multiple works leverage image captions or video transcripts as training sources Ren et al. (2015a); Banerjee et al. (2021); Yang et al. (2021a); Lee et al. (2021). In this approach, question-answer pairs are automatically generated from the text, ignoring the visual source, and are then combined with the related image/video to produce image-question-answer triplets. Banerjee et al. (2021) propose WeaQA, in which they generate questions from MSCOCO image captions Chen et al. (2015) using an improved template-based approach in COCOQA Ren et al. (2015a) as well as QA-SRL methods, enhanced by paraphrasing and backtranslation for linguistic variations. Lee et al. (2021) similarly train a VQA model from question-answer pairs derived from MSCOCO Captions but only use noun phrases as candidate answers, focusing on using it to verify generated captions but not on the VQA task itself. Yang et al. (2021a) generate question-answer pairs from instructional video ASR transcripts, which are then coupled with the related video.

In this work, we follow this direction, investigating what requires to generate data with good coverage for the VQA task in the image domain. We show that our neural-based textual question generation approach with captions is much more effective than previous approaches. Further, unlike previous work, we also explore automatically-curated out-of-domain image-text data sources.

3 Transfer learning for and in VQA

Existing work also explores the relationship between the image captioning task and the VQA task without question generation (Section 2.2). Fisch et al. (2020) perform image captioning by anticipating visual questions (i.e., using VQA data as additional supervision and post-inference evaluation). Wu et al. (2019) generate question-relevant image captions to aid VQA. Yang et al. (2021b) prompt the GPT-3 Brown et al. (2020) to answer knowledge-based visual questions based on generated captions and tags and a few VQA examples.

Evidence suggests that image-text pre-training, especially when performed at scale, benefits vision-and-language tasks, including VQA Lu et al. (2019); Li et al. (2019); Chen et al. (2020); Tan and Bansal (2019); Su et al. (2020); Lu et al. (2020); Zhou et al. (2020); Li et al. (2020); Zhang et al. (2021); Cho et al. (2021); Wang et al. (2021); Yuan et al. (2021). However, these approaches do not work well without fine-tuning on the downstream VQA data Wang et al. (2021). Further, prompt-based learning and inference Liu et al. (2021) from a pre-trained image-text model that works for VQA is still an open research problem. In contrast, our approach directly works with the training data, explicitly transforms them into the interrogative form of question-answer pairs.

Our focus is the zero-shot transfer setting in WeaQA Banerjee et al. (2021) in which no manually-created VQA triplets are available during training. Note that the term zero-shot here is different from the one used in Teney and Hengel (2016), in which the model still has access to manually-created VQA triplets but is evaluated with unseen questions at test time. Similar to this, Chao et al. (2018b) explore cross-dataset VQA but they solely focus on human-annotated data along with approaches to transfer.

Textual Question Generation for VQA

We study whether automatically producing VQA annotations from existing image-text resources can alleviate or completely replace the need for manual data annotation. We only focus on English in this paper. To this end, we follow and improve upon some of the recent directions in Section 2.2 on automatic question-answer generation from text.

The only prior work on neural question generation from captions we are aware of, Lee et al. (2021), focuses on noun phrases as candidate answers. Yet, these are not enough to cover the answer types included in typical VQA benchmarks such as VQA2.0 (as we will show in Section 5.1), such as boolean, attribute, and verb answers, to name a few, which are required for questions like as “Is there…”, “What color…”, “What is the dog doing”. We present a method that covers all of these answer types.

To extract candidate answers from a given caption, we parse it using spaCyhttps://spacy.io/ and then extract candidates based on the Part-of-Speech (POS) and dependency parse tree annotations, as follows:

Noun Phrases. We extract all noun phrases annotated by spaCy, including named entities.

POS Spans. We extract sequences that begin with an open-class POS (nouns, verbs, adjectives and adverbs), that end with an open-class POS or an adverbial particle, and that do not contain any other POS in between except closed-class POS for determiners, adpositions and conjunctions.

Parse Tree Spans. We consider all sub-trees that include at least one open-class POS and no more than 3 words altogether. We only extract maximal spans, i.e., not extracting sub-trees that are fully included in other extracted sub-trees.

Boolean. Boolean questions are frequent in VQA benchmarks Goyal et al. (2017). Yet, ‘yes’ and ‘no’ are not found in captions, and so cannot be extracted as candidates by extracting text spans from captions. To this end, we also add ‘yes’ and ‘no’ as candidate answers and generate one question per candidate (see Section 3.2).

How many? 0. Captions do not normally contain mentions of ‘zero’ object counts. Hence, marking spans in a caption does not generate questions with the answer ‘0’. Therefore, we randomly sample a generated “How many?” question (with a non-zero answer) from a different caption and add it with the answer changed to ‘zero’ to the candidate set of the target caption. This procedure is potentially noisy because the answer for the sampled question could be non-zero also for the target image. From a manual inspection of 200 such questions, we found this to happen infrequently – about 4.5%.

Our extraction method covers various answer candidates such as compound nouns, noun phrases, named entities, boolean answers, cardinal and ordinal numbers, verbs and their compounds, (multi-word) adjectives and prepositional phrases. Table 1 provides an example of candidate answers of various types and the mechanism used to extract them.

2 Question Generation

Given the advances in neural text generation, including models like T5 Raffel et al. (2020), we choose to use a neural generation model as $QG$ . Concretely, we use a T5-XXL model and further fine-tune it on SQuAD1.1 (Rajpurkar et al., 2016) for question generation. We take the top-scoring generated question for each caption-answer input. We note that our QG model is trained on a question answering dataset that is not caption-specific, and therefore is not optimized for caption inputs. From manual inspection of hundreds of generated questions, our QG model copes well with captions as input; see examples in Table 2 and Section 3.5.

3 Question-Answer Filtering

Generative models may hallucinate, that is, generate content that is inconsistent with its input source Alberti et al. (2019); Honovich et al. (2021). To mitigate this, we follow Alberti et al. (2019) and apply round-trip consistency by answering the generated question on the caption text with a question answering model. If the answer does not match the answer candidate offered as input to the question generation model, the generated question is discarded.

We use the token-level F1 score Wang et al. (2020) to determine if the candidate answer and the QA model’s answer is a match; If the score is above a threshold (manually set to 0.54, exemplified in Table 2), the pair is a match. For question answering, we use a T5-XXL model and further fine-tune it on SQuAD2.0 Rajpurkar et al. (2018) and Natural Questions Kwiatkowski et al. (2019).

4 Sources of Image/Caption Data

To quantify the impact of our method, we focus on VQA classification for the VQA2.0 Goyal et al. (2017), GQA Hudson and Manning (2019), and OKVQA Marino et al. (2019) benchmarks (see Section 4.2). We thus restrict our classifier to top 5,971 answers that are part of a unified answer vocabulary from these benchmarks (Appendix B.1). To this end, we remove triplets whose answers are not in the target answer vocabulary, and leave the study of using all generated triplets to future work. We then split our datasets into train/dev sets. In particular, since the images in VQA2.0 are taken from COCO, we split the COCO dataset based on the standard VQA2.0 train/dev splits of *train2014 and minival2014 Jiang et al. (2018)With the exception of OKVQA in which we split into train2014/val2014 to avoid using test images during training.. For the CC3M dataset, we use the default CC3M train/dev splits Sharma et al. (2018). For each unique image-question pair in the dev split, we construct an answer target of size 10, following VQA2.0, by reducing or expanding the set of seed answers that occur for this image-question pair. Additional details are in Appendix B.1.

5 Quality Analysis

Visual Question Answering (VQA)

To assess the effectiveness of our automatic generation of VQA annotations, we perform extrinsic evaluations of the generated data by measuring its impact on a variety of established VQA benchmarks. We first describe the model, followed by the experimental setup and the results.

Following the literature, we treat VQA as a classification task, i.e., vocab-based VQA. In particular, we treat our target answers as labels, where a label could be multi-token (e.g., "Christmas tree", "black and white", "play tennis"). We define our set of labels based on top answers in the training set of downstream VQA datasets, which allows for a fair comparison with most work in the VQA literature since Antol et al. (2015).

Since our work explores the impact of automatically-generated training data, we fix the VQA model architecture across all experimental conditions. Our model fuses the input image and question (Fig. 4). On the image side, we take global image features from ResNet-152 He et al. (2016) pre-trained on ImageNet Russakovsky et al. (2015) plus 16 region-of-interest image features from Faster R-CNN Ren et al. (2015b) pre-trained on Visual Genome Krishna et al. (2017), following the bottom-up-features paradigm Anderson et al. (2018). On the question side, we use the encoder of a pre-trained T5-base checkpoint Raffel et al. (2020). Given the image features and the output token embeddings of the question encoder, a Transformer Vaswani et al. (2017) fuses the multi-modal intermediate representation and classifies it into the predefined answer space. We train the (randomly-initialized) fusing encoder and the text encoder end-to-end using standard cross-entropy loss. The parameters of both ResNet and Faster R-CNN are frozen during training. Additional details are given in Appendix B.2.

2 Experimental Setup

We consider three VQA benchmarks: VQA2.0 Goyal et al. (2017), GQA Hudson and Manning (2019), and OKVQA Marino et al. (2019). These datasets have their own characteristics and thus test different capability of VQA models. For instance, GQA puts emphasis on reasoning and OKVQA on external knowledge, whereas VQA2.0 is more general; VQA2.0 and GQA are order-of-magnitude larger than OKVQA; GQA is generated using a question engine while VQA2.0 and OKVQA are human-annotated.

For training and evaluating on VQA2.0, we use the standard train/dev splits *train2014 and minival2014 Jiang et al. (2018). For GQA, we use the balanced v1.2 and combine the train and val splits for training and use the testdev split for evaluation, following the official guidelinehttps://cs.stanford.edu/people/dorarad/gqa/evaluate.html and Tan and Bansal (2019). For OKVQA, we use the train/val splits for training/evaluation. Table 3 summarizes the sizes of the different datasets.

Metrics. To be compatible with prior work, on VQA2.0 and OKVQA we measure the standard VQA Accuracy. It is the average score over 9 subsets of the ground-truth 10 answers5 targets in OKVQA, replicated twice Marino et al. (2019)., where each score is: $min(\frac{\#answer\ occurrences}{3},1)$ . On GQA, we measure Top-1 Accuracy against the single ground-truth answer.

Results

To gain further insights, we provide a breakdown of VQA Accuracy per VQA2.0 question types in Table 5. Boolean questions are the easiest and all models perform well on them. More challenging question types are ‘How many?’ and ‘What is’. One reason could be the validity of various answers, like “several” for counts. ‘What time?’ is the most difficult, probably due to lack of such information in captions.

2 Fully-Supervised Setting

3 Robustness of Existing VQA Training Sets

Considerations and Limitations

The resulting VQA model incorporates and may reinforce some of the biases and stereotypes present in the data. For instance, it may learn that answering questions such as “What is the gender of this person?” is a binary choice dictated by shallow cues, or that the answer to “For whom is this room decorated?” depends on stereotypical features present (or not) in the room depicted in the image. Mitigation strategies for such issues go beyond the scope of this paper, but we encourage the research community to consider addressing these issues as central for the successful deployment of this technology.

Conclusions

Acknowledgments. We would like to thank Or Honovich, Hagai Taitelbaum and Roee Aharoni for their help with question generation, Sebastian Goodman for his help with the VQA infrastructure, Piyush Sharma for his help with the Conceptual Captions, Nassim Oufattole for his early exploration of question generation, Gal Elidan, Sasha Goldshtein, and Avinatan Hassidim for their useful feedback.

References

Appendix A Additional Examples and Analysis of Generated Data

Fig. 8 offers a more visual view of the differences between question type distribution presented in Table 7.

Appendix B Implementation Details

Our default question and answer preprocessor is based on Jiang et al. (2018); Singh et al. (2020)https://github.com/facebookresearch/mmf/blob/main/mmf/datasets/processors/processors.py, with the exception of GQA which we use https://github.com/stanfordnlp/mac-network/blob/gqa/preprocess.py. The unified answer vocabulary used in our experiments is the union of top answers from existing COCO-based VQA benchmarks: VQA2.0 (3,128, minimum answer frequency=9), GQA (1,843, all), OKVQA (2,000, top), and Visual7W (3,140, minimum answer frequency=3) of total size 5,971

B.2 Details on Training and Evaluating Visual Question Answering

Our code for the VQA model is based on the Flaxformer frameworkhttps://github.com/google/flaxformer. Both the text encoder and the multi-modal encoder have 6 blocks of Transformers, each of which consists of self-attention and a feed-forward network. We use 12 heads of inner dimension of 64, the embedding dimension of 768, and the MLP dimension of 2048. During training, we use Adafactor Shazeer and Stern (2018), with an initial learning rate of 0.0025, a linear warm-up step of 5K for (pre-)training and 1K for fine-tuning, and an “inverse square root” learning rate schedule $\frac{1}{\sqrt{\max(n,k)}}$ , where $n$ is the current training iteration and $k$ is the number of warm-up steps. We use a dropout rate of 0.0. We train each of the models with data parallelism using 16 Cloud TPU Podshttps://cloud.google.com/tpu, each with a batch size of 256, unless otherwise stated.

The hyperparameters for Transformers are selected to be consistent with a T5-base checkpoint, which has 220 million parameters Raffel et al. (2020) (except that now we have 2 encoders rather than an encoder and a decoder). We initially tuned the initial learning rate (0.0125, 0.075, 0025, 0.00125, 0.00075) and the dropout rate (0.0, 0.1, 0.2) on a fully-supervised model on VQA2.0 baseline using VQA Accuracy and observed that 0.0025 and 0.0 work robustly across our experiments but we did not extensively tuned them in all of our experiments.

We implement VQA Accuracy ourselves based on the official challenge page for VQA2.0https://visualqa.org/evaluation.html.