From Recognition to Cognition: Visual Commonsense Reasoning

Rowan Zellers, Yonatan Bisk, Ali Farhadi, Yejin Choi

Introduction

With one glance at an image, we can immediately infer what is happening in the scene beyond what is visually obvious. For example, in the top image of Figure 1, not only do we see several objects (people, plates, and cups), we can also reason about the entire situation: three people are dining together, they have already ordered their food before the photo has been taken, [person3 ] is serving and not eating with them, and what [person1 ] ordered are the pancakes and bacon (as opposed to the cheesecake), because [person4 ] is pointing to [person1 ] while looking at the server, [person3 ].

Visual understanding requires seamless integration between recognition and cognition: beyond recognition-level perception (e.g., detecting objects and their attributes), one must perform cognition-level reasoning (e.g., inferring the likely intents, goals, and social dynamics of people) . State-of-the-art vision systems can reliably perform recognition-level image understanding, but struggle with complex inferences, like those in Figure 1. We argue that as the field has made significant progress on recognition-level building blocks, such as object detection, pose estimation, and segmentation, now is the right time to tackle cognition-level reasoning at scale.

As a critical step toward complete visual understanding, we present the task of Visual Commonsense Reasoning. Given an image, a machine must answer a question that requires a thorough understanding of the visual world evoked by the image. Moreover, the machine must provide a rationale justifying why that answer is true, referring to the details of the scene, as well as background knowledge about how the world works. These questions, answers, and rationales are expressed using a mixture of rich natural language as well as explicit references to image regions. To support clean-cut evaluation, all our tasks are framed as multiple choice QA.

Our new dataset for this task, VCR, is the first of its kind and is large-scale — 290k pairs of questions, answers, and rationales, over 110k unique movie scenes. A crucial challenge in constructing a dataset of this complexity at this scale is how to avoid annotation artifacts. A recurring challenge in most recent QA datasets has been that human-written answers contain unexpected but distinct biases that models can easily exploit. Often these biases are so prominent so that models can select the right answers without even looking at the questions .

Thus, we present Adversarial Matching, a novel QA assignment algorithm that allows for robust multiple-choice dataset creation at scale. The key idea is to recycle each correct answer for a question exactly three times — as a negative answer for three other questions. Each answer thus has the same probability (25%) of being correct: this resolves the issue of answer-only biases, and disincentivizes machines from always selecting the most generic answer. We formulate the answer recycling problem as a constrained optimization based on the relevance and entailment scores between each candidate negative answer and the gold answer, as measured by state-of-the-art natural language inference models . A neat feature of our recycling algorithm is a knob that can control the tradeoff between human and machine difficulty: we want the problems to be hard for machines while easy for humans.

Narrowing the gap between recognition- and cognition-level image understanding requires grounding the meaning of the natural language passage in the visual data, understanding the answer in the context of the question, and reasoning over the shared and grounded understanding of the question, the answer, the rationale and the image. In this paper we introduce a new model, Recognition to Cognition Networks (R2C). Our model performs three inference steps. First, it grounds the meaning of a natural language passage with respect to the image regions (objects) that are directly referred to. It then contextualizes the meaning of an answer with respect to the question that was asked, as well as the global objects not mentioned. Finally, it reasons over this shared representation to arrive at an answer.

Experiments on VCR show that R2C greatly outperforms state-of-the-art visual question-answering systems: obtaining 65% accuracy at question answering, 67% at answer justification, and 44% at staged answering and justification. Still, the task and dataset is far from solved: humans score roughly 90% on each. We provide detailed insights and an ablation study to point to avenues for future research.

In sum, our major contributions are fourfold: (1) we formalize a new task, Visual Commonsense Reasoning, and (2) present a large-scale multiple-choice QA dataset, VCR, (3) that is automatically assigned using Adversarial Matching, a new algorithm for robust multiple-choice dataset creation. (4) We also propose a new model, R2C, that aims to mimic the layered inferences from recognition to cognition; this also establishes baseline performance on our new challenge. The dataset is available to download, along with code for our model, at visualcommonsense.com.

Task Overview

We present VCR, a new task that challenges vision systems to holistically and cognitively understand the content of an image. For instance, in Figure 1, we need to understand the activities ([person3 ] is delivering food), the roles of people ([person1 ] is a customer who previously ordered food), the mental states of people ([person1 ] wants to eat), and the likely events before and after the scene ([person3 ] will serve the pancakes next). Our task covers these categories and more: a distribution of the inferences required is in Figure 2.

Visual understanding requires not only answering questions correctly, but doing so for the right reasons. We thus require a model to give a rationale that explains why its answer is true. Our questions, answers, and rationales are written in a mixture of rich natural language as well as detection tags, like ‘[person2 ]’: this helps to provide an unambiguous link between the textual description of an object (‘the man on the left in the white shirt’) and the corresponding image region.

To make evaluation straightforward, we frame our ultimate task – of staged answering and justification – in a multiple-choice setting. Given a question along with four answer choices, a model must first select the right answer. If its answer was correct, then it is provided four rationale choices (that could purportedly justify its correct answer), and it must select the correct rationale. We call this $Q{\rightarrow}AR$ as for the model prediction to be correct requires both the chosen answer and then the chosen rationale to be correct.

Our task can be decomposed into two multiple-choice sub-tasks, that correspond to answering ( $Q{\rightarrow}A$ ) and justification ( $QA{\rightarrow}R$ ) respectively:

Data Collection

In this section, we describe how we collect the questions, correct answers and correct rationales for VCR. Our key insight – towards collecting commonsense visual reasoning problems at scale – is to carefully select interesting situations. We thus extract still images from movie clips. The images from these clips describe complex situations that humans can decipher without additional context: for instance, in Figure 1, we know that [person3 ] will serve [person1 ] pancakes, whereas a machine might not understand this unless it sees the entire clip.

To ensure diversity, we make no limiting assumptions about the predefined set of actions. Rather than searching for predefined labels, which can introduce search engine bias , we collect images from movie scenes. The underlying scenes come from the Large Scale Movie Description Challenge and YouTube movie clips.Namely, Fandango MovieClips: youtube.com/user/movieclips. To avoid simple images, we train and apply an ‘interestingness filter’ (e.g. a closeup of a syringe in Figure 3).We annotated images for ‘interestingness’ and trained a classifier using CNN features and detection statistics, details in the appendix, Sec B.

We center our task around challenging questions requiring cognition-level reasoning. To make these cognition-level questions simple to ask, and to avoid the clunkiness of referring expressions, VCR’s language integrates object tags ([person2 ]) and explicitly excludes referring expressions (‘the woman on the right.’) These object tags are detected from Mask-RCNN , and the images are filtered so as to have at least three high-confidence tags.

Crowdsourcing Quality Annotations

Workers on Amazon Mechanical Turk were given an image with detections, along with additional context in the form of video captions.This additional clip-level context helps workers ask and answer about what will happen next. They then ask one to three questions about the image; for each question, they provide a reasonable answer and a rationale. To ensure top-tier work, we used a system of quality checks and paid our workers well.More details in the appendix, Sec B.

The result is an underlying dataset with high agreement and diversity of reasoning. Our dataset contains a myriad of interesting commonsense phenomena (Figure 2) and a great diversity in terms of unique examples (Supp Section A); almost every answer and rationale is unique.

Adversarial Matching

We cast VCR as a four-way multiple choice task, to avoid the evaluation difficulties of language generation or captioning tasks where current metrics often prefer incorrect machine-written text over correct human-written text . However, it is not obvious how to obtain high-quality incorrect choices, or counterfactuals, at scale. While past work has asked humans to write several counterfactual choices for each correct answer , this process is expensive. Moreover, it has the potential of introducing annotation artifacts: subtle patterns that are by themselves highly predictive of the ‘correct’ or ‘incorrect’ label .

In this work, we propose Adversarial Matching: a new method that allows for any ‘language generation’ dataset to be turned into a multiple choice test, while requiring minimal human involvement. An overview is shown in Figure 4. Our key insight is that the problem of obtaining good counterfactuals can be broken up into two subtasks: the counterfactuals must be as relevant as possible to the context (so that they appeal to machines), while they cannot be overly similar to the correct response (so that they don’t become correct answers incidentally). We balance between these two objectives to create a dataset that is challenging for machines, yet easy for humans.

Here, $\lambda{>}0$ controls the tradeoff between similarity and relevance.We tuned this hyperparameter by asking crowd workers to answer multiple-choice questions at several thresholds, and chose the value for which human performance is above $90\%$ - details in appendix Sec C. To obtain multiple counterfactuals, we perform several bipartite matchings. To ensure that the negatives are diverse, during each iteration we replace the similarity term with the maximum similarity between a candidate response $\bm{r}_{j}$ and all responses currently assigned to $\bm{q}_{i}$ .

To guarantee that there is no question/answer overlap between the training and test sets, we split our full dataset (by movie) into 11 folds. We match the answers and rationales invidually for each fold. Two folds are pulled aside for validation and testing.

Recognition to Cognition Networks

We introduce Recognition to Cognition Networks (R2C), a new model for visual commonsense reasoning. To perform well on this task requires a deep understanding of language, vision, and the world. For example, in Figure 5, answering ‘Why is [person4 ] pointing at [person1 ]?’ requires multiple inference steps. First, we ground the meaning of the query and each response, which involves referring to the image for the two people. Second, we contextualize the meaning of the query, response, and image together. This step includes resolving the referent ‘he,’ and why one might be pointing in a diner. Third, we reason about the interplay of relevant image regions, the query, and the response. In this example, the model must determine the social dynamics between [person1 ] and [person4 ]. We formulate our model as three high-level stages: grounding, contextualization, and reasoning, and use standard neural building blocks to implement each component.

In more detail, recall that a model is given an image, a set of objects $\bm{o}$ , a query $\bm{q}$ , and a set of responses $\bm{r}^{(i)}$ (of which exactly one is correct). The query $\bm{q}$ and response choices $\bm{r}^{(i)}$ are all expressed in terms of a mixture of natural language and pointing to image regions: notation-wise, we will represent the object tagged by a word $w$ as $o_{w}$ . If $w$ isn’t a detection tag, $o_{w}$ refers to the entire image boundary. Our model will then consider each response $\bm{r}$ separately, using the following three components:

Contextualization

Given a grounded representation of the query and response, we use attention mechanisms to contextualize these sentences with respect to each other and the image context. For each position $i$ in the response, we will define the attended query representation as $\hat{\mathbf{q}}_{i}$ using the following equation:

To contextualize an answer with the image, including implicitly relevant objects that have not been picked up from the grounding stage, we perform another bilinear attention between the response $\mathbf{r}$ and each object $o$ ’s image features. Let the result of the object attention be $\hat{\mathbf{o}}_{i}$ .

Reasoning

Last, we allow the model to reason over the response, attended query and objects. We accomplish this using a bidirectional LSTM that is given as context $\hat{\mathbf{q}}_{i}$ , $\mathbf{r}_{i}$ , and $\hat{\mathbf{o}}_{i}$ for each position $i$ . For better gradient flow through the network, we concatenate the output of the reasoning LSTM along with the question and answer representations for each timestep: the resulting sequence is max-pooled and passed through a multilayer perceptron, which predicts a logit for the query-response compatibility.

Neural architecture and training details

For our image features, we use ResNet50 . To obtain strong representations for language, we used BERT representations . BERT is applied over the entire question and answer choice, and we extract a feature vector from the second-to-last layer for each word. We train R2C by minimizing the multi-class cross entropy between the prediction for each response $\bm{r}^{(i)}$ , and the gold label. See the appendix (Sec E) for detailed training information and hyperparameters.Our code is also available online at visualcommonsense.com.

Results

In this section, we evaluate the performance of various models on VCR. Recall that our main evaluation mode is the staged setting ( $Q{\rightarrow}AR$ ). Here, a model must choose the right answer for a question (given four answer choices), and then choose the right rationale for that question and answer (given four rationale choices). If it gets either the answer or the rationale wrong, the entire prediction will be wrong. This holistic task decomposes into two sub-tasks wherein we can train individual models: question answering ( $Q{\rightarrow}A$ ) as well as answer justification ( $QA{\rightarrow}R$ ). Thus, in addition to reporting combined $Q{\rightarrow}AR$ performance, we will also report $Q{\rightarrow}A$ and $QA{\rightarrow}R$ .

A model is presented with a query $\bm{q}$ , and four response choices $\bm{r}^{(i)}$ . Like our model, we train the baselines using multi-class cross entropy between the set of responses and the label. Each model is trained separately for question answering and answer justification.We follow the standard train, val and test splits.

1 Baselines

We compare our R2C to several strong language and vision baselines.

We evaluate the level of visual reasoning needed for the dataset by also evaluating purely text-only models. For each model, we represent $\bm{q}$ and $\bm{r}^{(i)}$ as streams of tokens, with the detection tags replaced by the object name (e.g. chair5 $\rightarrow$ chair). To minimize the discrepancy between our task and pretrained models, we replace person detection tags with gender-neutral names.

BERT : BERT is a recently released NLP model that achieves state-of-the-art performance on many NLP tasks.

BERT (response only) We use the same BERT model, however, during fine-tuning and testing the model is only given the response choices $\bm{r}^{(i)}$ .

ESIM+ELMo : ESIM is another high performing model for sentence-pair classification tasks, particularly when used with ELMo embeddings .

LSTM+ELMo: Here an LSTM with ELMo embeddings is used to score responses $\bm{r}^{(i)}$ .

VQA Baselines

Additionally we compare our approach to models developed on the VQA dataset . All models use the same visual backbone as R2C (ResNet 50) as well as text representations (GloVe; ) that match the original implementations.

RevisitedVQA : This model takes as input a query, response, and image features for the entire image, and passes the result through a multilayer perceptron, which has to classify ‘yes’ or ‘no’.For VQA, the model is trained by sampling positive or negative answers for a given question; for our dataset, we simply use the result of the perceptron (for response $\bm{r}^{(i)}$ ) as the $i$ -th logit.

Bottom-up and Top-down attention (BottomUpTopDown) : This model attends over region proposals given by an object detector. To adapt to VCR, we pass this model object regions referenced by the query and response.

Multimodal Low-rank Bilinear Attention (MLB) : This model uses Hadamard products to merge the vision and language representations given by a query and each region in the image.

Multimodal Tucker Fusion (MUTAN) : This model expresses joint vision-language context in terms of a tensor decomposition, allowing for more expressivity.

We note that BottomUpTopDown, MLB, and MUTAN all treat VQA as a multilabel classification over the top 1000 answers . Because VCR is highly diverse (Supp A), for these models we represent each response $\bm{r}^{(i)}$ using a GRU .To match the other GRUs used in which encode $\bm{q}$ . The output logit for response $i$ is given by the dot product between the final hidden state of the GRU encoding $\bm{r}^{(i)}$ , and the final representation from the model.

Human performance

We asked five different workers on Amazon Mechanical Turk to answer 200 dataset questions from the test set. A different set of five workers were asked to choose rationales for those questions and answers. Predictions were combined using a majority vote.

2 Results and Ablations

We present our results in Table 1. Of note, standard VQA models struggle on our task. The best model, in terms of $Q{\rightarrow}AR$ accuracy, is MLB, with 17.2% accuracy. Deep text-only models perform much better: most notably, BERT obtains 35.0% accuracy. One possible justification for this gap in performance is a bottlenecking effect: whereas VQA models are often built around multilabel classification of the top 1000 answers, VCR requires reasoning over two (often long) text spans. Our model, R2C obtains an additional boost over BERT by 9% accuracy, reaching a final performance of 44%. Still, this figure is nowhere near human performance: 85% on the staged task, so there is significant headroom remaining.

We evaluated our model under several ablations to determine which components are most important. Removing the query representation (and query-response contextualization entirely) results in a drop of 21.6% accuracy points in terms of $Q\rightarrow AR$ performance. Interestingly, this setting allows it to leverage its image representation more heavily: the text based response-only models (BERT response only, and LSTM+ELMo) perform barely better than chance. Taking the reasoning module lowers performance by 1.9%, which suggests that it is beneficial, but not critical for performance. The model suffers most when using GloVe representations instead of BERT: a loss of 24%. This suggests that strong textual representations are crucial to VCR performance.

Qualitative results

Last, we present qualitative examples in Figure 6. R2C works well for many images: for instance, in the first row, it correctly infers that a bank robbery is happening. Moreover, it picks the right rationale: even though all of the options have something to do with ‘banks’ and ‘robbery,’ only c) makes sense. Similarly, analyzing the examples for which R2C chooses the right answer but the wrong rationale allows us to gain more insight into its understanding of the world. In the third row, the model incorrectly believes there is a crib while assigning less probability mass on the correct rationale - that [person2 ] is being shown a photo of [person4 ]’s children, which is why [person2 ] might say how cute they are.

Related Work

Visual Question Answering was one of the first large-scale datasets that framed visual understanding as a QA task, with questions about COCO images typically answered with a short phrase. This line of work also includes ‘pointing’ questions and templated questions with open ended answers . Recent datasets also focus on knowledge-base style content . On the other hand, the answers in VCR are entire sentences, and the knowledge required by our dataset is largely background knowledge about how the world works.

Recent work also includes movie or TV-clip based QA . In these settings, a model is given a video clip, often alongside additional language context such as subtitles, a movie script, or a plot summary.As we find in Appendix D, including additional language context tends to boost model performance. In contrast, VCR features no extra language context besides the question. Moreover, the use of explicit detection tags means that there is no need to perform person identification or linkage with subtitles.

An orthogonal line of work has been on referring expressions: asking to what image region a natural language sentence refers to . We explicitly avoid referring expression-style questions by using indexed detection tags (like [person1 ]).

Last, some work focuses on commonsense phenomena, such as ‘what if’ and ‘why’ questions . However, the space of commonsense inferences is often limited by the underlying dataset chosen (synthetic or COCO scenes). In our work, we ask commonsense questions in the context of rich images from movies.

Explainability

AI models are often right, but for questionable or vague reasons . This has motivated work in having models provide explanations for their behavior, in the form of a natural language sentence or an attention map . Our rationales combine the best of both of these approaches, as they involve both natural language text as well as references to image regions. Additionally, while it is hard to evaluate the quality of generated model explanations, choosing the right rationale in VCR is a multiple choice task, making evaluation straightforward.

Commonsense Reasoning

Our task unifies work involving reasoning about commonsense phenomena, such as physics , social interactions , procedure understanding and predicting what might happen next in a video .

Adversarial Datasets

Past work has proposed the idea of creating adversarial datasets, whether by balancing the dataset with respect to priors or switching them at test time . Most relevant to our dataset construction methodology is the idea of Adversarial Filtering .This was used to create the SWAG dataset, a multiple choice NLP dataset for natural language inference. Correct answers are human-written, while wrong answers are chosen from a pool of machine-generated text that is further validated by humans. However, the correct and wrong answers come from fundamentally different sources, which raises the concern that models can cheat by performing authorship identification rather than reasoning over the image. In contrast, in Adversarial Matching, the wrong choices come from the exact same distribution as the right choices, and no human validation is needed.

Conclusion

In this paper, we introduced Visual Commonsense Reasoning, along with a large dataset VCR for the task that was built using Adversarial Matching. We presented R2C, a model for this task, but the challenge – of cognition-level visual undertanding – is far from solved.

Acknowledgements

We thank the Mechanical Turk workers for doing such an outstanding job with dataset creation - this dataset and paper would not exist without them. Thanks also to Michael Schmitz for helping with the dataset split and Jen Dumas for legal advice. This work was supported by the National Science Foundation through a Graduate Research Fellowship (DGE-1256082) and NSF grants (IIS-1524371, 1637479, 165205, 1703166), the DARPA CwC program through ARO (W911NF-15-1-0543), the IARPA DIVA program through D17PC00343, the Sloan Research Foundation through a Sloan Fellowship, the Allen Institute for Artificial Intelligence, the NVIDIA Artificial Intelligence Lab, and gifts by Google and Facebook. The views and conclusions contained herein are those of the authors and should not be interpreted as representing endorsements of IARPA, DOI/IBC, or the U.S. Government.

Appendix

Appendix A Dataset Analysis

In this section, we continue our high-level analysis of VCR.

How challenging is the language in VCR? We show several statistics in Table 3. Of note, unlike many question-answering datasets wherein the answer is a single word, our answers average to more than 7.5 words. The rationales are even longer, averaging at more than 16 words.

An additional informative statistic is the counts of unique answers and rationales in the dataset, which we plot in Figure 7. As shown, almost every answer and rationale is unique.

A.2 Objects covered

On average, there are roughly two objects mentioned over a question, answer, and rationale. Most of these objects are people (Figure 8), though other types of COCO objects are common too . Objects such as ‘chair,’ ‘tie,’ and ‘cup’ are often detected, however, these objects vary in terms of scene importance: even though more ties exist in the data than cars, workers refer to cars more in their questions, answers, and rationales. Some objects, such as hair driers and snowboards, are rarely detected.

A.3 Movies covered

Our dataset also covers a broad range of movies - over 2000 in all, mostly via MovieClips (Figure 9). We note that since we split the dataset by movie, the validation and test sets cover a completely disjoint set of movies, which forces a model to generalize. For each movie image, workers ask 2.6 questions on average (Figure 10), though the exact number varies - by design, workers ask more questions for more interesting images.

A.4 Inference types

It is challenging to accurately categorize commonsense and cognition-level phenomena in the dataset. One approach that we presented in Figure 2 is to categorize questions by type: to estimate this over the entire training set, we used a several patterns, which we show in Table 4. Still, we note that automatic categorization of the inference types required for this task is hard. This is in part because a single question might require multiple types of reasoning: for example, ‘Why does person1 feel embarrassed?’ requires reasoning about person1’s mental state, as well as requiring an explanation. For this reason, we argue that this breakdown underestimates the task difficulty.

Appendix B Dataset Creation Details

In this section, we elaborate more on how we collected VCR, and about our crowdsourcing process.

The images in VCR are extracted from video clips from LSMDC and MovieClips. These clips vary in length from a few seconds (LSMDC) to several minutes (MovieClips). Thus, to obtain more still images from these clips, we performed shot detection. Our pipeline is as follows:

We iterate through a video clip at a speed of one frame per second.

During each iteration, we also perform shot detection: if we detect a mean difference of 30 pixels in HSV space, then we register a shot boundary.

After a shot boundary is found, we apply Mask-RCNN on the middle frame for the shot, and save the resulting image and detection information.

We used a threshold of 0.7 for Mask-RCNN, and the best detection/segmentation model available for us at the time: X-101-64x4d-FPNAvailable via the Detectron Model Zoo., which obtains 42.4 box mAP on COCO, and 37.5 mask mAP.

B.2 Interestingness Filter

Recall that we use an ‘interestingness filter’ to ensure that the images in our dataset are high quality. First, every image had to have at least two people in it, as detected by Mask RCNN. However, we also found that many images with two or more people were still not very interesting. The two main failure cases here are when there are one or two people detected, but they aren’t doing anything interesting (Figure 11a), or when the image is especially grainy and blurry. Thus, we opted to learn an additional classifier for determining which images were interesting.

Our filtering process evolved as we collected data for the task. The first author of this paper first manually annotated 2000 images from LSMDC as being ‘interesting’ or ‘not interesting’ and trained a logistic regression model to predict said label. The model is given as input the number of people detected by Mask RCNN , along with the number of objects (that are not people) detected. We used this model to identify interesting images in LSMDC, using a threshold that corresponded to 70% precision. This resulted in 72k images selected; these images were annotated first.

During the crowdsourcing process, we obtained data that allowed us to build an even better interestingness filter later on. Workers were asked, along with each image, whether they thought that the image was especially interesting (and thus should go to more workers), just okay, or especially boring (and hard to ask even one good question for). We used this to train a deeper model for this task. The model uses a ResNet 50 backbone over the entire image as well as a multilayer perceptron over the object counts. The entire model is trained end-to-end: 2048 dimensional features from Resnet are concatenated with a 512 dimensional projetion of the object counts, and used to predict the labels.In addition to predicting interestingness, the model also predicts the number of questions a worker asks, but we never ended up using these predictions. We used this model to select the most interesting 40k images from Movieclips, which finished off the annotation process.

B.3 Crowdsourcing quality data

As mentioned in the paper, crowdsourcing data at the quality and scale of VCR is challenging. We used several best practices for crowdsourcing, which we elaborate on in this section.

We used Amazon Mechanical Turk for our crowdsourcing. A screenshot of our interface is given in Figure 12. Given an image, workers asked questions, answered them, and provided a rationale explaining why their answer might be correct. These are all written in a mixture of natural language text, as well as referring to detection regions. In our annotation UI, workers refer to the regions by writing the tag number.Note that this differs a bit from the format in the paper: we originally had workers write out the full tag, like [person5], but this is often long and the workers would sometimes forget the brackets. Thus, the tag format here is just a single number, like 5.

Workers could ask anywhere between one to three questions per HIT. We paid the workers proportionally at $0.22 per triplet. According to workers, this resulted in$ 8–25/hr. This proved necessary as workers reported feeling “drained” by the high quality required.

We added several automated checks to the crowdsourcing UI to ensure high quality. The workers had to write at least four words for the question, three for the answer, and five for the rationale. Additionally, the workers had to explicitly refer to at least one detection on average per question, answer, and rationale triplet. This was automatically detected to ensure that the workers were referring to the detection tags in their submissions.

We also noticed early on was that sometimes workers would write detailed stories that were only loosely connected with the semantic content of the image. To fix this, workers also had to self-report whether their answer was likely (above 75% probability), possible (25-75% probability), or unlikely (below 25% probability). We found that this helped deter workers from coming up with consistently unlikely answers for each image. The likelihood ratings were never used for the task, since we found they weren’t necessary to obtain high human agreement.

Instructions

Like for any crowdsourcing task, we found wording the instructions carefully to be crucial. We encouraged workers to ask about higher-level actions, versus lower-level ones (such as ‘What is person1 wearing?’), as well as to not ask questions and answers that were overly generic (and thus could apply to many images). Workers were encouraged to answer reasonably in a way that was not overly unlikely or unreasonable. To this end, we provided the workers with high-quality example questions, answers, and rationales.

Qualification exam

Since we were picky about the types of questions asked, and the format of the answers and rationales, workers had to pass a qualification task to double check that they understood the format. The qualification test included a mix of multiple-choice graded answers as well as a short written section, which was to provide a single question, answer, and rationale for an image. The written answer was checked manually by the first author of this paper.

Work verification

In addition to the initial qualification exam, we also periodically monitored the annotation quality. Every 48 hours, the first author of this paper would review work and provide aggregate feedback to ensure that workers were asking good questions, answering them well, and structuring the rationales in the right way. Because this took significant time, we then selected several outstanding workers and paid them to do this job for us: through a separate set of HITs, these outstanding workers were paid $0.40 to provide detailed feedback on a submission that another worker made. Roughly one in fifty HITs were annotated in this way to give extra feedback. Throughout this process, workers whose submission quality dropped were dequalified from the HITs.

Appendix C Adversarial Matching Details

There are a few more details that we found useful when performing the Adversarial Matching to create VCR, which we discuss in this section.

In practice, most responses in our dataset are not relevant to most queries, due to the diversity of responses in our dataset and the range of detection tags (person1, etc.).

To fix this, for each query $\bm{q}_{i}$ (with associated object list $\bm{o}_{i}$ and response $\bm{r}_{i}$ ) we turn each candidate $\bm{r}_{j}$ into a template, and use a rule based system to probabilistically remap its detection tags to match the objects in $\bm{o}_{i}$ . With some probability, a tag in $\bm{r}_{j}$ is replaced with a tag in $\bm{q}_{i}$ and $\bm{r}_{i}$ . Otherwise, it is replaced with a random tag from $\bm{o}_{i}$ .

We note that our approach isn’t perfect. The remapping system often produces responses that violate predicate/argument structure, such as ‘person1 is kissing person1.’ However, our approach does not need to be perfect: because the detections for response $\bm{r}_{j}$ are remapped uniquely for each query $\bm{q}_{i}$ , with some probability, there should be at least some remappings of $\bm{r}_{i}$ that make sense, and the question relevance model $P_{rel}$ should select them.

Semantic categories

Recall that we use 11 folds for the dataset of around 290k questions, answers, and rationales. Since we must perform Adversarial Matching once for the answers, as well as for the rationales, this would naively involve 22 matchings on a fold size of roughly 26k. We found that the major computational bottleneck wasn’t the bipartite matchingWe use the https://github.com/gatagat/lap implementation., but rather the computation of all-pairs similarity and relevance between $\sim$ 26k examples.

There is one additional potential problem: we want the dataset examples to require a lot of complex commonsense reasoning, rather than simple attribute identification. However, if the response and the query disagree in terms of gender pronouns, then many of the dataset examples can be reduced to gender identification.

We address both of these problems by dividing each fold into ‘buckets’ of 3k examples for matching. We divide the examples up in terms of the pronouns in the response: if the response contains a female or male pronoun, then we put the example into a ‘female’ or ‘male’ bucket, respectively, otherwise the response goes into the ‘neutral’ bucket. To further divide the dataset examples, we also put different question types in different buckets for the question answering task (e.g. who, what, etc.). For the answer justification task, we cluster the questions and answers using their average GloVe embeddings .

Relevance model details

Recall that our relevance model $P_{rel}$ is trained to predict the probability that a response $\bm{r}$ is valid for a query $\bm{q}$ . We used BERT for this task , as it achieves state-of-the-art results across many two-sentence inference tasks. Each input looks like the following, where the query and response are concatenated with a separator in between:

[CLS] what is casey doing ? [SEP] casey is getting out of car . [SEP]

Note that in the above example, object tags are replaced with the class name (car3 $\to$ car). Person tags are replaced with gender neutral names (person1 $\to$ casey) .

We fine-tune BERT by treating it as a two-way classification problem. With probability 25% for a query, BERT is given that query’s actual response, otherwise it is given a random response (where the detections were remapped). Then, the model must predict whether it was given the actual response or not. We used a learning rate of $2\cdot 10^{-5}$ , the Adam optimizer , a batch size of 32, and 3 epochs of fine-tuning.We note that during the Adversarial Matching process, for either Question Answering or Answer Justification, the dataset is broken up into 11 folds. For each fold, BERT is fine-tuned on the other folds, not on the final dataset splits.

Due to computational limitations, we used BERT-Base as the architecture rather than BERT-Large - the latter is significantly slower.Also, BERT-Large requires much more memory, enough so that it’s harder to fine-tune due to the smaller feasible batch size. Already, $P_{rel}$ has an immense computational requirement as it must compute all-pairs similarity for the entire dataset, over buckets of 3000 examples. Thus, we opted to use a larger bucket size rather than a more expensive model.

Similarity model details

While we want the responses to be highly relevant to the query, we also want to avoid cases where two responses might be conflated by humans - particularly when one is the correct response. This conflation might occur for several reasons: possibly, two responses are paraphrases of one another, or one response entails another. We lump both under the ‘similarity’ umbrella as mentioned in the paper and introduce a model, $P_{sim}$ , to predict the probability of this occurring - broadly speaking, that two responses $\bm{r}_{i}$ and $\bm{r}_{j}$ have the same meaning.

We used ESIM+ELMo for this task , as it still does quite well on two-sentence natural language inference tasks (although not as well as BERT), and can be made much more efficient. At test time, the model makes the similarity prediction when given two token sequences.Again, with object tags replaced with the class name, and person tags replaced by gender neutral names.

We trained this model on freely available NLP corpora. We used the SNLI formalism , in which two sentences are an ‘entailment’ if the first entails the second, ‘contradiction’ if the first is contradicted by the second, and ‘neutral’ otherwise. We combined data from SNLI and MultiNLI as training data. Additionally, we found that even after training on these corpora, the model would struggle with paraphrases, so we also translated SNLI sentences from English to German and back using the Nematus machine translation system . These sentences served as extra paraphrase data and were assigned the ‘entailment’ label. We also used randomly sampled sentence pairs from SNLI as additional ‘neutral’ training data. We held out the SNLI validation set to determine when to stop training. We used standard hyperparameters for ESIM+ELMo as given by the AllenNLP library .

Given the trained model $P_{nli}$ , we defined the similarity model as the maximum entailment probability for either way of ordering the two responses:

where ‘ent’ refers to the ‘entailment’ label. If one response entails the other, we flag them as similar, even if the reverse entailment is not true, because such a response is likely to be a false positive as a distractor.

The benefit of using ESIM+ELMo for this task is that it can be made more efficient for the task of all-pairs sentence similarity. While much of the ESIM architecture involves computing attention between the two text sequences, everything before the first attention can be precomputed. This provides a large speedup, particularly as computing the ELMo representations is expensive. Now, for a fold size of $N$ , we only have to compute $2N$ ELMo representations rather than $N^{2}$ .

Validating the λ𝜆\lambda parameter

Recall that our hyperparameter $\lambda$ trades off between machine and human difficulty for our final dataset. We shed more insight on how we chose the exact value for $\lambda$ in Figure 13. We tried several different values of $\lambda$ and chose $\lambda=0.1$ for $Q\rightarrow A$ and $\lambda=0.01$ for $QA\rightarrow R$ , as at these thresholds human performance was roughly $90\%$ . For an easier dataset for both humans and machines, we would increase the hyperparameter.

Appendix D Language Priors and Annotation Artifacts Discussion

There has been much research in the last few years in understanding what ‘priors’ datasets have.This line of work is complementary to other notions of dataset bias, like understanding what phenomena datasets cover or don’t , particularly how that relates to how marginalized groups are represented and portrayed . Broadly speaking, how well do models do on VCR, as well as other visual question answering tasks, without vision?

To be more general, we will consider problems where a model is given a question and answer choices, and picks exactly one answer. The answer choices are the outputs that the model is deciding between (like the responses in VCR) and the question is the shared input that is common to all answer choices (the query, image, and detected objects in VCR). With this terminology, we can categorize unwanted dataset priors in the following ways:

Answer Priors: A model can select a correct answer without even looking at the question. Many text-only datasets contain these priors. For instance, the RocStories dataset (in which a model must classify endings to a story as correct or incorrect), a model can obtain 75% accuracy by looking at stylistic features (such as word choice and punctuation) in the endings.

Non-Visual Priors: A model can select a correct answer using only non-visual elements of the question. One example is VQA 1.0 : given a question like ‘What color is the fire hydrant?’ a model will classify some answers higher than others (red). This was addressed in VQA 2.0 , however, some answers will still be more likely than others (VQA’s answers are open-ended, and an answer to ‘What color is the fire hydrant?’ must be a color).

These priors can either arise from biases in the world (fire hydrants are usually red), or, they can come from annotation artifacts : patterns that arise when people write class-conditioned answers. Sometimes these biases are subliminal: when asked to write a correct or incorrect story ending, the correct endings tend to be longer . Other cases are more obvious: workers often use patterns such as negation to write sentences that contradict a sentence .For instance, the SNLI dataset contains pairs of sentences with labels such as ‘entailed’ or ‘contradiction’ . For a sentence like ‘A skateboarder is doing tricks’ workers often write ‘Nobody is doing tricks’ which is a contradiction. The result is that the word ‘nobody’ is highly predictive of a word being a contradiction.

To what extent do vision datasets suffer from annotation artifacts, versus world priors? We narrow our focus to multiple-choice question answering datasets, in which for humans traditionally write correct and incorrect answers to a question (thus, potentially introducing the annotation artifacts). In Table 5 we consider several of these datasets: TVQA , containing video clips from TV shows, along with subtitles; MovieQA , with videos from movies and questions obtained from higher-level plot summaries; PororoQA , with cartoon videos; and TGIFQA , with templated questions from the TGIF dataset . We note that these all differ from our proposed VCR in terms of subject matter, questions asked, number of answers (each of the above has 5 answers possible, while we have 4) and format; our focus here is to investigate how difficult these datasets are for text-only models.It should be noted that all of these datasets were released before the existence of strong text-only baselines such as BERT. Our point of comparison is VCR, since our use of Adversarial Matching means that humans never write incorrect answers.

We tackle this problem by running BERT-Base on these models : given only the answer (A), the answer and the question (Q+A), or additional language context in the form of subtitles (S+Q+A), how well does BERT do? Our results in Table 5 help support our hypothesis regarding annotation artifacts: whereas accuracy on VCR, only given the ending, is 27% for $Q\rightarrow A$ and 26% for $Q\rightarrow A$ , versus a 25% random baseline. Other models, where humans write the incorrect answers, have answer-only accuracies from 33.8% (MovieQA) to 45.8% (TGIFQA), over a 20% baseline.

There is also some non-visual bias for all datasets considered: from 35.4% when given the question and the answers (MovieQA) to 72.5% (TGIFQA). While these results suggest that MovieQA is incredibly difficult without seeing the video clip, there are two things to consider here. First, MovieQA is roughly 20x smaller than our dataset, with 9.8k examples in training. Thus, we also tried training BERT on ‘VCR ${}^{\textrm{small}}$ ’: taking 9.8k examples at random from our training set. Performance is roughly 14% worse, to the point of being roughly comparable to MovieQA.Assuming an equal chance of choosing each incorrect ending, the results for BERT on an imaginary 4-answer version of TVQA and MovieQA would be 54.5% and 42.2%, respectively. Second, often times the examples in MovieQA have similar structure, which might help to alleviate stylistic priors, for example:

“Who has followed Boyle to Eamon’s apartment?” Answers:

On the other hand, our dataset examples tend to be highly diverse in terms of syntax as well as high-level meaning, due to the similarity penalty. We hypothesize that this is why some language priors creep into VCR, particularly in the $QA\rightarrow R$ setting: given four very distinct rationales that ostensibly justify why an answer is true, some will likely serve as better justifications than others.

Furthermore, providing additional language information (such as subtitles) to a model tends to boost performance considerably. When given access to subtitles in TVQA,We prepend the subtitles that are aligned to the video clip to the beginning of the question, with a special token (;) in between. We trim tokens from the subtitles when the total sequence length is above 128 tokens. BERT scores 70.6%, which to the best of our knowledge is a new state-of-the-art on TVQA.

In conclusion, dataset creation is highly difficult, particularly as there are many ways that unwanted bias can creep in during the dataset creation process. One such bias of this form includes annotation artifacts, which our analysis suggests is prevalent amongst multiple-choice VQA tasks wherein humans write the wrong endings. Our analysis also suggests Adversarial Matching can help minimize this effect, even when there are strong natural biases in the underlying textual data.

Appendix E Model details

In this section, we discuss implementation details for our model, R2C.

As mentioned in the paper, we used BERT to represent text . We wanted to provide a fair comparison between our model and BERT, so we used BERT-Base for each. We tried to make our use of BERT to be as simple as possible, matching our use of it as a baseline. Given a query $\bm{q}$ and response choice $\bm{r}^{(i)}$ , we merge both into a single sequence to give to BERT. One example might look like the following:

[CLS] why is riley riding motorcycle while wearing a hospital gown ? [SEP] she had to leave the hospital in a hurry . [SEP]

Note that in the above example, we replaced person tags with gender neutral names (person3 $\rightarrow$ riley) and replaced object detections by their class name (motorcycle1 $\rightarrow$ motorcycle), to minimize domain shift between BERT’s pretrained data (Wikipedia and the BookCorpus ) and VCR.

Each token in the sequence corresponds to a different transformer unit in BERT. We can then use the later layers in BERT to extract contextualized representations for the each token in the query (everything from why to ?) and the response (she to .).The only slight difference is that, due to the WordPiece encoding scheme, rare words (like chortled) are broken up into subword units (cho ##rt ##led). In this case, we represent that word as the average of the BERT activations of its subwords. Note that this gives us a different representation for each response choice $i$ .

We extract frozen BERT representations from the second-to-last layer of the Transformer.Since the domain that BERT was pretrained on (Wikipedia and the BookCorpus ) is still quite different from our domain, we fine-tuned BERT on the text of VCR (using the masked language modeling objective, as well as next sentence prediction) for one epoch to account for the domain shift, and then extracted the representations. Intuitively, this makes sense as the representations that that layer are used for both of BERT’s pretraining tasks: next sentence prediction (the unit corresponding to the [CLS] token at the last layer $L$ attends to all units at layer $L-1$ ), as well as masked language modeling (the unit for a word at layer $L$ looks at its hidden state at the previous layer $L-1$ , and uses that to attend to all other units as well). The experiments in suggest that this works well, though not as well as fine-tuning BERT end-to-end or concatenating multiple layers of activations.This suggests, however, that if we also fine-tuned BERT along with the rest of the model parameters, the results of R2C would be higher. The tradeoff, however, is that precomputing BERT representations lets us substantially reduce the runtime of R2C and allows us to focus on learning more powerful vision representations.

Model Hyperparameters

A more detailed discussion of the hyperparameters used for R2C is as follows. We tried to stick to simple settings (and when possible, used similar configurations for the baselines, particularly with respect to learning rates and hidden state sizes).

Our projection of image features maps a 2176 dimensional hidden size (2048 from ResNet50 and 128 dimensional class embeddings) to a 512 dimensional vector.

Our grounding LSTM is a single-layer bidirectional LSTM with a 1280-dimensional input size (768 from BERT and 512 from image features) and uses 256 dimensional hidden states.

Our reasoning LSTM is a two-layer bidirectional LSTM with a 1536-dimensional input size (512 from image features, and 256 for each direction in the attended, grounded query and the grounded answer). It also uses 256-dimensional hidden states.

The representation from the reasoning LSTM, grounded answer, and attended question is maxpooled and projected to a 1024-dimensional vector. That vector is used to predict the $i$ th logit.

For all LSTMs, we initialized the hidden-hidden weights using orthogonal initialization , and applied recurrent dropout to the LSTM input with $p_{drop}=0.3$ .

The Resnet50 backbone was pretrained on Imagenet . The parameters in the first three blocks of ResNet were frozen. The final block (after the RoiAlign is applied) is fine-tuned by our model. We were worried, however, that the these representations would drift and so we added an auxiliary loss to the model inspired by : the 2048-dimensional representation of each object (without class embeddings) had to be predictive of that object’s label (via a linear projection to the label space and a softmax).

Often times, there are a lot of objects in the image that are not referred to by the query or response set. We filtered the objects considered by the model to include only the objects mentioned in the query and responses. We also passed in the entire image as an ‘object’ that the model could attend to in the object contextualization layer.

We optimized R2C using Adam , with a learning rate of $2\cdot 10^{-4}$ and weight decay of $10^{-4}$ . Our batch size was 96. We clipped the gradients to have a total $L_{2}$ norm of at most $1.0$ . We lowered the learning rate by a factor of 2 when we noticed a plateau (validation accuracy not increasing for two epochs in a row). Each model was trained for 20 epochs, which took roughly 20 hours over 3 NVIDIA Titan X GPUs.

Appendix F VQA baselines with BERT

We present additional results where baselines for VQA are augmented with BERT embeddings in Table 6. We didn’t include these results in the main paper, because to the best of our knowledge prior work hasn’t used contextualized representations for VQA. (Contextualized representations might be overkill, particularly as VQA questions are short and often simple). From the results, we find that while BERT also helps the baselines, our model R2C benefits even more, with a 2.5% overall boost in the holistic $Q\rightarrow AR$ setting.

Appendix G VCR Datasheet

A datasheet is a list of questions that accompany datasets that are released, in part so that people think hard about the phenomena in their data . In this section, we provide a datasheet for VCR.

The dataset was created to study the new task of Visual Commonsense Reasoning: essentially, to have models answer challenging cognition-level questions about images and also to choose a rationale justifying each answer.

Has the dataset been used already?

Yes, at the time of writing, several groups have submitted models to our leaderboard at visualcommonsense.com/leaderboard.

Who funded the dataset??

VCR was funded via a variety of sources; the biggest sponsor was the IARPA DIVA program through D17PC00343.However, the views and conclusions contained herein are those of the authors and should not be interpreted as representing endorsements of IARPA, DOI/IBC, or the U.S. Government.

G.2 Dataset Composition

Each instance contains an image, a sequence of object regions and classes, a query, and a list of response choices. Exactly one response is correct. There are two sub-tasks to the dataset: in Question Answering ( $Q{\rightarrow}A$ ) the query is a question and the response choices are answers. In Answer Justification ( $QA{\rightarrow}R$ ) the query is a question and the correct answer; the responses are rationales that justify why someone would conclude that the answer is true. Both the query and the rationale refer to the objects using detection tags like person1.

How many instances are there?

There are 212,923 training questions, 26,534 validation questions, and 25,263 questions. Each is associated with a four answer choices, and each question+correct answer is associated with four rationale choices.

What data does each instance consist of?

The image from each instance comes from a movie, while the object detector was trained to detect objects in the COCO dataset . Workers ask challenging high-level questions covering a wide variety of cognition-level phenomena. Then, workers provide a rationale: one to several sentences explaining how they came at their decision. The rationale points to details in the image, as well as background knowledge about how the world works. Each instance contains one correct answer and three incorrect counterfactual answers, along with one correct rationale and three incorrect rationales.

Does the data rely on external resources?

Are there recommended data splits or evaluation measures?

We release the training and validation sets, as well as the test set without labels. For the test set, researchers can submit their predictions to a public leaderboard. Evaluation is fairly straightforward as our task is multiple choice, but we will also release an evaluation script.

G.3 Data Collection Process

We used movie images, with objects detected using Mask RCNN . We collected the questions, answers, and rationales on Amazon Mechanical Turk.

Who was involved in the collection process and what were their roles?

We (the authors) did several rounds of pilot studies, and collected data at scale on Amazon Mechanical Turk. In the task, workers on Amazon Mechanical Turk could ask anywhere between one to three questions. For each question, they had to provide an answer, indicate its likelihood on an ordinal scale, and provide a rationale justifying why their answer is true. Workers were paid at 22 cents per question, answer, and rationale.

Over what time frame was the data collected?

Does the dataset contain all possible instances?

No. Visual Commonsense Inference is very broad, and we focused on a limited set of (interesting) phenomena. Beyond looking at different types of movies, or looking at the world beyond still photographs, there are also different types of inferences that we didn’t cover in our work.

If the dataset is a sample, then what is the population?

The population is that of movie images that were deemed interesting by our interestingness filter (having at least three object detections, of which at least two are people).

G.4 Data Preprocessing

The line between data preprocessing and dataset collection is blurry for VCR. After obtaining crowdsourced questions, answers, and rationales, we applied Adversarial Matching, turning raw data into a multiple choice task. We also tokenized the text spans.

Was the raw data saved in addition to the cleaned data?

Yes - the raw data is the correct answers (and as such is a subset of the ‘cleaned’ data).

Does this dataset collection/preprocessing procedure achieve the initial motivation?

At this point, we think so. Our dataset is challenging for existing VQA systems, but easy for humans.

G.5 Dataset Distribution

VCR is freely available for research use at visualcommonsense.com.

G.6 Legal and Ethical Considerations

Yes - the instructions said that workers answers would be used in a dataset. We tried to be as upfront as possible to workers. Workers also consented to have their responses used in this way through the Amazon Mechanical Turk Participation Agreement.

If it relates to people, could this dataset expose people to harm or legal action?

No - the questions, answers, and responses don’t contain personal info about the crowd workers.

If it relates to people, does it unfairly advantage or disadvantage a particular social group?

Unfortunately, movie data is highly biased against women and minorities . Our data, deriving from movies as well as from worker elicitations , is no different. For these reasons, we recommend that users do not deploy models trained on VCR in the real world.

Appendix H Additional qualitative results

In this section, we present additional qualitative results from R2C. Our use of attention mechanisms allow us to better gain insight into how the model arrives at its decisions. In particular, the model uses the answer to attend over the question, and it uses the answer to attend over relevant objects in the image. Looking at the attention maps help to visualize which items in the question are important (usually, the model focuses on the second half of the question, like ‘covering his face’ in Figure 14), as well as which objects are important (usually, the objects referred to by the answer are assigned the most weight).