SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Rowan Zellers, Yonatan Bisk, Roy Schwartz, Yejin Choi

Introduction

When we read a story, we bring to it a large body of implicit knowledge about the physical world. For instance, given the context “on stage, a woman takes a seat at the piano,” shown in Table 1, we can easily infer what the situation might look like: a woman is giving a piano performance, with a crowd watching her. We can furthermore infer her likely next action: she will most likely set her fingers on the piano keys and start playing.

This type of natural language inference requires commonsense reasoning, substantially broadening the scope of prior work that focused primarily on linguistic entailment Chierchia and McConnell-Ginet (2000). Whereas the dominant entailment paradigm asks if two natural language sentences (the ‘premise’ and the ‘hypothesis’) describe the same set of possible worlds Dagan et al. (2006); Bowman et al. (2015), here we focus on whether a (multiple-choice) ending describes a possible (future) world that can be anticipated from the situation described in the premise, even when it is not strictly entailed. Making such inference necessitates a rich understanding about everyday physical situations, including object affordances Gibson (1979) and frame semantics Baker et al. (1998).

A first step toward grounded commonsense inference with today’s deep learning machinery is to create a large-scale dataset. However, recent work has shown that human-written datasets are susceptible to annotation artifacts: unintended stylistic patterns that give out clues for the gold labels Gururangan et al. (2018); Poliak et al. (2018). As a result, models trained on such datasets with human biases run the risk of over-estimating the actual performance on the underlying task, and are vulnerable to adversarial or out-of-domain examples Wang et al. (2018); Glockner et al. (2018).

In this paper, we introduce Adversarial Filtering (AF), a new method to automatically detect and reduce stylistic artifacts. We use this method to construct Swag: an adversarial dataset with 113k multiple-choice questions. We start with pairs of temporally adjacent video captions, each with a context and a follow-up event that we know is physically possible. We then use a state-of-the-art language model fine-tuned on this data to massively oversample a diverse set of possible negative sentence endings (or counterfactuals). Next, we filter these candidate endings aggressively and adversarially using a committee of trained models to obtain a population of de-biased endings with similar stylistic features to the real ones. Finally, these filtered counterfactuals are validated by crowd workers to further ensure data quality.

Extensive empirical results demonstrate unique contributions of our dataset, complementing existing datasets for natural langauge inference (NLI) Bowman et al. (2015); Williams et al. (2018) and commonsense reasoning Roemmele et al. (2011); Mostafazadeh et al. (2016); Zhang et al. (2017). First, our dataset poses a new challenge of grounded commonsense inference that is easy for humans (88%) while hard for current state-of-the-art NLI models ( ${<}60\%$ ). Second, our proposed adversarial filtering methodology allows for cost-effective construction of a large-scale dataset while substantially reducing known annotation artifacts. The generality of adversarial filtering allows it to be applied to build future datasets, ensuring that they serve as reliable benchmarks.

Swag: Our new dataset

We introduce a new dataset for studying physically grounded commonsense inference, called Swag.111Short for Situations With Adversarial Generations. Our task is to predict which event is most likely to occur next in a video. More formally, a model is given a context $\bm{c}=(\bm{s},\bm{n})$ : a complete sentence $\bm{s}$ and a noun phrase $\bm{n}$ that begins a second sentence, as well as a list of possible verb phrase sentence endings $\bm{V}=\{\bm{v}_{1},\ldots,\bm{v}_{4}\}$ . See Figure 1 for an example triple $(\bm{s},\bm{n},\bm{v}_{i})$ . The model must then select the most appropriate verb phrase $\bm{v}_{\hat{i}}\in\bm{V}$ .

Our corpus consists of 113k multiple choice questions (73k training, 20k validation, 20k test) and is derived from pairs of consecutive video captions from ActivityNet Captions Krishna et al. (2017); Heilbron et al. (2015) and the Large Scale Movie Description Challenge (LSMDC; Rohrbach et al., 2017). The two datasets are slightly different in nature and allow us to achieve broader coverage: ActivityNet contains 20k YouTube clips containing one of 203 activity types (such as doing gymnastics or playing guitar); LSMDC consists of 128k movie captions (audio descriptions and scripts). For each pair of captions, we use a constituency parser Stern et al. (2017) to split the second sentence into noun and verb phrases (Figure 1).222We filter out sentences with rare tokens ( $\leq$ 3 occurrences), that are short ( $l\leq 5$ ), or that lack a verb phrase. Each question has a human-verified gold ending and 3 distractors.

A solution to annotation artifacts

In this section, we outline the construction of Swag. We seek dataset diversity while minimizing annotation artifacts, conditional stylistic patterns such as length and word-preference biases. For many NLI datasets, these biases have been shown to allow shallow models (e.g. bag-of-words) obtain artificially high performance.

To avoid introducing easily “gamed” patterns, we present Adversarial Filtering (AF), a generally-applicable treatment involving the iterative refinement of a set of assignments to increase the entropy under a chosen model family. We then discuss how we generate counterfactual endings, and finally, the models used for filtering.

In this section, we formalize what it means for a dataset to be adversarial. Intuitively, we say that an adversarial dataset for a model $f$ is one on which $f$ will not generalize, even if evaluated on test data from the same distribution. More formally, let our input space be $\mathcal{X}$ and the label space be $\mathcal{Y}$ . Our trainable classifier $f$ , taking parameters $\theta$ is defined as $f_{\theta}:\mathcal{X}\to\mathbb{R}^{|\mathcal{Y}|}$ . Let our dataset of size $N$ be defined as $\mathcal{D}=\{(x_{i},y_{i})\}_{1\leq i\leq N}$ , and let the loss function over the dataset be $L(f_{\theta},\mathcal{D})$ . We say that a dataset is adversarial with respect to $f$ if we expect high empirical error $I$ over all leave-one-out train/test splits Vapnik (2000):

with regularization terms omitted for simplicity.

2 Adversarial filtering (AF) algorithm

In this section, we outline an approach for generating an adversarial dataset $\mathcal{D}$ , effectively maximizing empirical error $I$ with respect to a family of trainable classifiers $f$ . Without loss of generality, we consider the situation where we have $N$ contexts, each associated with a single positive example $(x_{i}^{+},1)\,{\in}\,\mathcal{X}\,{\times}\,\mathcal{Y}$ , and a large population of context-specific negative examples $(x_{i,j}^{-},0)\,{\in}\,\mathcal{X}\,{\times}\,\mathcal{Y}$ , where $1{\leq}j{\leq}N^{-}$ for each $i$ . For instance, the negative examples could be incorrect relations in knowledge-base completion Socher et al. (2013), or all words in a dictionary for a single-word cloze task Zweig and Burges (2011).

Our goal will be to filter the population of negative examples for each instance $i$ to a size of $k{\ll}N^{-}$ . This will be captured by returning a set of assignments $\mathcal{A}$ , where for each instance the assignment will be a $k$ -subset $\mathcal{A}_{i}=[1\ldots N^{-}]^{k}$ . The filtered dataset will then be:

Unfortunately, optimizing $I(\mathcal{D}^{AF},f)$ is difficult as $\mathcal{A}$ is global and non-differentiable. To address this, we present Algorithm 1. On each iteration, we split the data into dummy ‘train’ and ‘test’ splits. We train a model $f$ on the training portion and obtain parameters $\theta$ , then use the remaining test portion to reassign the indices of $\mathcal{A}$ . For each context, we replace some number of ‘easy’ negatives in $\mathcal{A}$ that $f_{\theta}$ classifies correctly with ‘adversarial’ negatives outside of $\mathcal{A}$ that $f_{\theta}$ misclassifies.

This process can be thought of as increasing the overall entropy of the dataset: given a strong model $f_{\theta}$ that is compatible with a random subset of the data, we aim to ensure it cannot generalize to the held-out set. We repeat this for several iterations to reduce the generalization ability of the model family $f$ over arbitrary train/test splits.

3 Generating candidate endings

To generate counterfactuals for Swag, we use an LSTM Hochreiter and Schmidhuber (1997) language model (LM), conditioned on contexts from video captions. We first pretrain on BookCorpus Zhu et al. (2015), then finetune on the video caption datasets. The architecture uses standard best practices and was validated on held-out perplexity of the video caption datasets; details are in the appendix. We use the LM to sample $N^{-}{=}1023$ unique endings for a partial caption.333To ensure that the LM generates unique endings, we split the data into five validation folds and train five separate LMs, one for each set of training folds. This means that each LM never sees the found endings during training.

Importantly, we greedily sample the endings, since beam search decoding biases the generated endings to be of lower perplexity (and thus easily distinguishable from found endings). We find this process gives good counterfactuals: the generated endings tend to use topical words, but often make little sense physically, making them perfect for our task. Further, the generated endings are marked as “gibberish” by humans only 9.1% of the time (Sec 3.5); in that case the ending is filtered out.

4 Stylistic models for adversarial filtering

In creating Swag, we designed the model family $f$ to pick up on low-level stylistic features that we posit should not be predictive of whether an event happens next in a video. These stylistic features are an obvious case of annotation artifacts Cai et al. (2017); Schwartz et al. (2017).444A broad definition of annotation artifacts might include aspects besides lexical/stylistic features: for instance, certain events are less likely semantically regardless of the context (e.g. riding a horse using a hose). For this work, we erred more conservatively and only filtered based on style. Our final classifier is an ensemble of four stylistic models:

A multilayer perceptron (MLP) given LM perplexity features and context/ending lengths.

A bag-of-words model that averages the word embeddings of the second sentence as features.

A one-layer CNN, with filter sizes ranging from 2-5, over the second sentence.

A bidirectional LSTM over the 100 most common words in the second sentence; uncommon words are replaced by their POS tags.

We ensemble the models by concatenating their final representations and passing it through an MLP. On every adversarial iteration, the ensemble is trained jointly to minimize cross-entropy.

The accuracies of these models (at each iteration, evaluated on a 20% split of the test dataset before indices of $\mathcal{A}$ get remapped) are shown in Figure 2. Performance decreases from 60% to close to random chance; moreover, confusing the perplexity-based MLP is not sufficient to lower performance of the ensemble. Only once the other stylistic models are added does the ensemble accuracy drop substantially, suggesting that our approach is effective at reducing stylistic artifacts.

5 Human verification

Imagine that you are watching a video clip. The clip has a caption, but it is missing the final phrase. Please choose the best 2 caption endings, and classify each as:

likely, if it completes the caption in a reasonable way;

unlikely, if it sounds ridiculous or impossible;

gibberish if it has such serious errors that it doesn’t feel like a valid English sentence.

Example: Someone is shown sitting on a fence and talking to the camera while pointing out horses. Someone

stands in front of a podium. (likely, second best)

, the horse in a plaza field. (gibberish)

The final data-collection step is to have humans verify the data. Workers on Amazon Mechanical Turk were given the caption context, as well as six candidate endings: one found ending and five adversarially-sampled endings. The task was twofold: Turkers ranked the endings independently as likely, unlikely, or gibberish, and selected the best and second best endings (Fig 3).

We obtained the correct answers to each context in two ways. If a Turker ranks the found ending as either best or second best (73.7% of the time), we add the found ending as a gold example, with negatives from the generations not labelled best or gibberish. Further, if a Turker ranks a generated ending as best, and the found ending as second best, then we have reason to believe that the generation is good. This lets us add an additional training example, consisting of the generated best ending as the gold, and remaining generations as negatives.555These two examples share contexts. To prevent biasing the test and validation sets, we didn’t perform this procedure on answers from the evaluation sets’ context. Examples with ${\leq}3$ non-gibberish endings were filtered out.666To be data-efficient, we reannotated filtered-out examples by replacing gibberish endings, as well as generations that outranked the found ending, with candidates from $\mathcal{A}$ .

We found after 1000 examples that the annotators tended to have high agreement, also generally choosing found endings over generations (see Table 2). Thus, we collected the remaining 112k examples with one annotator each, periodically verifying that annotators preferred the found endings.

Experiments

In this section, we evaluate the performance of various NLI models on Swag. Recall that models for our dataset take the following form: given a sentence and a noun phrase as context $\bm{c}=(\bm{s},\bm{n})$ , as well as a list of possible verb phrase endings $\bm{V}=\{\bm{v}_{1},\ldots,\bm{v}_{4}\}$ , a model $f_{\theta}$ must select a verb $\hat{i}$ that hopefully matches $i_{gold}$ :

To study the amount of bias in our dataset, we also consider models that take as input just the ending verb phrase $\bm{v}_{i}$ , or the entire second sentence $(\bm{n},\bm{v}_{i})$ . For our learned models, we train $f$ by minimizing multi-class cross-entropy. We consider three different types of word representations: 300d GloVe vectors from Common Crawl Pennington et al. (2014), 300d Numberbatch vectors retrofitted using ConceptNet relations Speer et al. (2017), and 1024d ELMo contextual representations that show improvement on a variety of NLP tasks, including standard NLI Peters et al. (2018). We follow the final dataset split (see Section 2) using two training approaches: training on the found data, and the found and highly-ranked generated data. See the appendix for more details.

The following models predict labels from a single span of text as input; this could be the ending only, the second sentence only, or the full passage.

fastText Joulin et al. (2017): This library models a single span of text as a bag of $n$ -grams, and tries to predict the probability of an ending being correct or incorrect independently.777The fastText model is trained using binary cross-entropy; at test time we extract the prediction by selecting the ending with the highest positive likelihood under the model.

Pretrained sentence encoders We consider two types of pretrained RNN sentence encoders, SkipThoughts Kiros et al. (2015) and InferSent Conneau et al. (2017). SkipThoughts was trained by predicting adjacent sentences in book data, whereas InferSent was trained on supervised NLI data. For each second sentence (or just the ending), we feed the encoding into an MLP.

LSTM sentence encoder Given an arbitrary span of text, we run a two-layer BiLSTM over it. The final hidden states are then max-pooled to obtain a fixed-size representation, which is then used to predict the potential for that ending.

2 Binary models

The following models predict labels from two spans of text. We consider two possibilties for these models: using just the second sentence, where the two text spans are $\bm{n},\bm{v}_{i}$ , or using the context and the second sentence, in which case the spans are $\bm{s},(\bm{n},\bm{v}_{i})$ . The latter case includes many models developed for the NLI task.

Dual Bag-of-Words For this baseline, we treat each sentence as a bag-of-embeddings $(\mathbf{c},\mathbf{v}_{i})$ . We model the probability of picking an ending $i$ using a bilinear model: $\textrm{softmax}_{i}(\mathbf{c}\mathbf{W}\mathbf{v}_{i}^{T})$ .888We also tried using an MLP, but got worse results.

Dual pretrained sentence encoders Here, we obtain representations from SkipThoughts or InferSent for each span, and compute their pairwise compatibility using either 1) a bilinear model or 2) an MLP from their concatenated representations.

SNLI inference Here, we consider two models that do well on SNLI Bowman et al. (2015): Decomposable Attention Parikh et al. (2016) and ESIM Chen et al. (2017). We use pretrained versions of these models (with ELMo embeddings) on SNLI to obtain 3-way entailment, neutral, and contradiction probabilities for each example. We then train a log-linear model using these 3-way probabilities as features.

SNLI models (retrained) Here, we train ESIM and Decomposable Attention on our dataset: we simply change the output layer size to 1 (the potential of an ending $\bm{v}_{i}$ ) with a softmax over $i$ .

3 Other models

Length: Although length was used by the adversarial classifier, we want to verify that human validation didn’t reintroduce a length bias. For this baseline, we always choose the shortest ending.

ConceptNet As our task requires world knowledge, we tried a rule-based system on top of the ConceptNet knowledge base Speer et al. (2017). For an ending sentence, we use the spaCy dependency parser to extract the head verb and its dependent object. The ending score is given by the number of ConceptNet causal relations999We used the relations ‘Causes’, ‘CapableOf’, ‘ReceivesAction’, ‘UsedFor’, and ‘HasSubevent’. Though their coverage is low (30.4% of questions have an answer with $\geq$ 1 causal relation), the more frequent relations in ConceptNet, such as ‘IsA’, at best only indirectly relate to our task. between synonyms of the verb and synonyms of the object.

Human performance To benchmark human performance, five Mechanical Turk workers were asked to answer 100 dataset questions, as did an ‘expert’ annotator (the first author of this paper). Predictions were combined using a majority vote.

4 Results

We present our results in Table 3. The best model that only uses the ending is the LSTM sequence model with ELMo embeddings, which obtains 43.6%. This model, as with most models studied, greatly improves with more context: by 3.1% when given the initial noun phrase, and by an additional 4% when also given the first sentence.

Further improvement is gained from models that compute pairwise representations of the inputs. While the simplest such model, DualBoW, obtains only 35.1% accuracy, combining InferSent sentence representations gives 40.5% accuracy (InferSent-Bilinear). The best results come from pairwise NLI models: when fully trained on Swag, ESIM+ELMo obtains 59.2% accuracy.

When comparing machine results to human results, we see there exists a lot of headroom. Though there likely is some noise in the task, our results suggest that humans (even untrained) converge to a consensus. Our in-house “expert” annotator is outperformed by an ensemble of 5 Turk workers (with 88% accuracy); thus, the effective upper bound on our dataset is likely even higher.

Analysis

The past few years have yielded great advances in NLI and representation learning, due to the availability of large datasets like SNLI and MultiNLI Bowman et al. (2015); Williams et al. (2018). With the release of Swag, we hope to continue this trend, particularly as our dataset largely has the same input/output format as other NLI datasets. We observe three key differences between our dataset and others in this space:

First, as noted in Section 1, Swag requires a unique type of temporal reasoning. A state-of-the-art NLI model such as ESIM, when bottlenecked through the SNLI notion of entailment (SNLI-ESIM), only obtains 36.1% accuracy.101010The weights of SNLI-ESIM pick up primarily on entailment probability (0.59), as with neutral (0.46), while contradiction is negatively correlated (-.42). This implies that these datasets necessitate different (and complementary) forms of reasoning.

Second, our use of videos results in wide coverage of dynamic and temporal situations Compared with SNLI, with contexts from Flickr30K Plummer et al. (2017) image captions, Swag has more active verbs like ‘pull’ and ‘hit,’ and fewer static verbs like ‘sit’ and ‘wear’ (Figure 4).111111Video data has other language differences; notably, character names in LSMDC were replaced by ‘someone’

Third, our dataset suffers from few lexical biases. Whereas fastText, a bag of $n$ -gram model, obtains 67.0% accuracy on SNLI versus a 34.3% baseline Gururangan et al. (2018), fastText obtains only 29.0% accuracy on Swag.121212The most predictive individual words on SWAG are infrequent in number: ‘dotted‘ with P $(+|\textrm{dotted})=77\%$ with 10.3 counts, and P $(-|\textrm{similar})=81\%$ with 16.3 counts. (Counts from negative endings were discounted 3x, as there are 3 times as many negative endings as positive endings).

2 Error analysis

We sought to quantify how human judgments differ from the best studied model, ESIM+ELMo. We randomly sampled 100 validation questions that ESIM+ELMo answered incorrectly, for each extracting both the gold ending and the model’s preferred ending. We asked 5 Amazon Mechanical Turk workers to pick the better ending (of which they preferred the gold endings 94% of the time) and to select one (or more) multiple choice reasons explaining why the chosen answer was better.

The options, and the frequencies, are outlined in Table 4. The most common reason for the turkers preferring the correct answer is situational (52.3% of the time), followed by weirdness (17.5%) and plausibility (14.4%). This suggests that ESIM+ELMo already does a good job at filtering out weird and implausible answers, with the main bottleneck being grounded physical understanding. The ambiguous percentage is also relatively low (12.0%), implying significant headroom.

3 Qualitative examples

Last, we show several qualitative examples in Table 5. Though models can do decently well by identifying complex alignment patterns between the two sentences (e.g. being “up a tree” implies that “tree” is the end phrase), the incorrect model predictions suggest this strategy is insufficient. For instance, answering “An old man rides a small bumper car” requires knowledge about bumper cars and how they differ from regular cars: bumper cars are tiny, don’t drive on roads, and don’t work in parking lots, eliminating the alternatives. However, this knowledge is difficult to extract from existing corpora: for instance, the ConceptNet entry for Bumper Car has only a single relation: bumper cars are a type of vehicle. Other questions require intuitive physical reasoning: e.g, for “he pours the raw egg batter into the pan,” about what happens next in making an omelet.

4 Where to go next?

Our results suggest that Swag is a challenging testbed for NLI models. However, the adversarial models used to filter the dataset are purely stylistic and focus on the second sentence; thus, subtle artifacts still likely remain in our dataset. These patterns are ostensibly picked up by the NLI models (particularly when using ELMo features), but the large gap between machine and human performance suggests that more is required to solve the dataset. As models are developed for commonsense inference, and more broadly as the field of NLP advances, we note that AF can be used again to create a more adversarial version of Swag using better language models and AF models.

Related Work

There has been a long history of NLI benchmarks focusing on linguistic entailment Cooper et al. (1996); Dagan et al. (2006); Marelli et al. (2014); Bowman et al. (2015); Lai et al. (2017); Williams et al. (2018). Recent NLI datasets in particular have supported learning broadly-applicable sentence representations Conneau et al. (2017); moreover, models trained on these datasets were used as components for performing better video captioning Pasunuru and Bansal (2017), summarization Pasunuru and Bansal (2018), and generation Holtzman et al. (2018), confirming the importance of NLI research. The NLI task requires a variety of commonsense knowledge LoBue and Yates (2011), which our work complements. However, previous datasets for NLI have been challenged by unwanted annotation artifacts, Gururangan et al. (2018); Poliak et al. (2018) or scale issues. Our work addresses these challenges by constructing a new NLI benchmark focused on grounded commonsense reasoning, and by introducing an adversarial filtering mechanism that substantially reduces known and easily detectable annotation artifacts.

Commonsense NLI

Several datasets have been introduced to study NLI beyond linguistic entailment: for inferring likely causes and endings given a sentence (COPA; Roemmele et al., 2011), for choosing the most sensible ending to a short story (RocStories; Mostafazadeh et al., 2016; Sharma et al., 2018), and for predicting likelihood of a hypothesis by regressing to an ordinal label (JOCI; Zhang et al. (2017)). These datasets are relatively small: 1k examples for COPA and 10k cloze examples for RocStories.131313For RocStories, this was by design to encourage learning from the larger corpus of 98k sensible stories. JOCI increases the scale by generating the hypotheses using a knowledge graph or a neural model. In contrast to JOCI where the task was formulated as a regression task on the degree of plausibility of the hypothesis, we frame commonsense inference as a multiple choice question to reduce the potential ambiguity in the labels and to allow for direct comparison between machines and humans. In addition, Swag’s use of adversarial filtering increases diversity of situations and counterfactual generation quality.

Last, another related task formulation is sentence completion or cloze, where the task is to predict a single word that is removed from a given context Zweig and Burges (2011); Paperno et al. (2016).141414Prior work on sentence completion filtered negatives with heuristics based on LM perplexities. We initially tried something similar, but found the result to still be gameable. Our work in contrast requires longer textual descriptions to reason about.

Vision datasets

Several resources have been introduced to study temporal inference in vision. The Visual Madlibs dataset has 20k image captions about hypothetical next/previous events Yu et al. (2015); similar to our work, the test portion is multiple-choice, with counterfactual answers retrieved from similar images and verified by humans. The question of ‘what will happen next?’ has also been studied in photo albums Huang et al. (2016), videos of team sports, Felsen et al. (2017) and egocentric dog videos Ehsani et al. (2018). Last, annotation artifacts are also a recurring problem for vision datasets such as Visual Genome Zellers et al. (2018) and Visual QA Jabri et al. (2016); recent work was done to create a more challenging VQA dataset by annotating complementary image pairs Goyal et al. (2016).

Reducing gender/racial bias

Prior work has sought to reduce demographic biases in word embeddings Zhang et al. (2018) as well as in image recognition models Zhao et al. (2017). Our work has focused on producing a dataset with minimal annotation artifacts, which in turn helps to avoid some gender and racial biases that stem from elicitation Rudinger et al. (2017). However, it is not perfect in this regard, particularly due to biases in movies Schofield and Mehr (2016); Sap et al. (2017). Our methodology could potentially be extended to construct datasets free of (possibly intersectional) gender or racial bias.

Physical knowledge

Prior work has studied learning grounded knowledge about objects and verbs: from knowledge bases Li et al. (2016), syntax parses Forbes and Choi (2017), word embeddings Lucy and Gauthier (2017), and images and dictionary definitions Zellers and Choi (2017). An alternate thread of work has been to learn scripts: high-level representations of event chains Schank and Abelson (1975); Chambers and Jurafsky (2009). Swag evaluates both of these strands.

Conclusion

We propose a new challenge of physically situated commonsense inference that broadens the scope of natural language inference (NLI) with commonsense reasoning. To support research toward commonsense NLI, we create a large-scale dataset Swag with 113k multiple-choice questions. Our dataset is constructed using Adversarial Filtering (AF), a new paradigm for robust and cost-effective dataset construction that allows datasets to be constructed at scale while automatically reducing annotation artifacts that can be easily detected by a committee of strong baseline models. Our adversarial filtering paradigm is general, allowing potential applications to other datasets that require human composition of question answer pairs.

Acknowledgements

We thank the anonymous reviewers, members of the ARK and xlab at the University of Washington, researchers at the Allen Institute for AI, and Luke Zettlemoyer for their helpful feedback. We also thank the Mechanical Turk workers for doing a fantastic job with the human validation. This work was supported by the National Science Foundation Graduate Research Fellowship (DGE-1256082), the NSF grant (IIS-1524371, 1703166), the DARPA CwC program through ARO (W911NF-15-1-0543), the IARPA DIVA program through D17PC00343, and gifts by Google and Facebook. The views and conclusions contained herein are those of the authors and should not be interpreted as representing endorsements of IARPA, DOI/IBC, or the U.S. Government.

Appendix A Appendix

As mentioned in the main paper, we obtained contexts and found endings from video data. The videos in the ActivityNet dataset are already broken up into into clips. However, the LSMDC dataset contains captions for the entire movie, so it is possible that temporally adjacent captions describe events that are far apart in time. Thus, we don’t include any pair of captions that have a time-difference of more than 25 seconds.

In addition to the datasets we used, we also considered the DiDeMo dataset, which consists of (often several) referring expressions in a video Hendricks et al. (2017). However, many of the referring expressions are themselves sentence fragments, (e.g. “first time we see people” so we ultimately did not use this dataset.) Additionally, we considered the Visual Madlibs dataset Yu et al. (2015), as it contains 10k hypothetical captions written by Mechanical Turk workers about what might happen next given an image. However, these captions are fundamentally different from the rest of the data (as they’re about what might) happen next; as a result, they use different types of language. They also have different tenses versus the other datasets that we considered (e.g. past tense), as a result of the “Mad-libs” style of data collection.

A.2 Details of the language model

Our language model follows standard best practices: the input and output embedding layers are tied (Inan et al., 2017; Press and Wolf, 2017), all embedding and hidden layers are set to 512, and we used recurrent dropout Gal and Ghahramani (2016) on the hidden states and embedding layer. We additionally train a backwards language model alongside the forward language model, and they share embedding parameters. This adds extra supervision to the embedding layer and gives us another way to score candidate generations. We first pretrain the language model for two epochs on pairs of two sentences in the Toronto Books dataset Zhu et al. (2015), and then train on sentence pairs from ActivityNet Captions and LSMDC, validating on held-out perplexity. For optimization, we use Adam Kingma and Ba (2015) with a learning rate of $10^{-3}$ and clip gradients to norm $1.0$ .

All of the above details were validated using perplexity on a held-out set of the video datasets during early experimentation. Our final development set forward perplexity was 31.2 and backward perplexity was 30.4. We tried more complicated language modeling architectures, such as from Józefowicz et al. (2016), but ended up not seeing an improvement due to overfitting.

A.3 Language model features for the MLP, during adversarial filtering

We obtained LM perplexity features to be used during adversarial filtering in the following ways, using both directions of the bidirectional language model. We extract perplexities for the context by itself (going forward), the ending given the context (going forward), the context given the ending (going backward), and the ending by itself (going backward). We also extract the probability of the final generated token going forward, since sentences sometimes reach the length limit of 25 tokens and end unnaturally.

A.4 Refinining the generated answers to four distractors

In the main paper, we noted that we started with 1023 negatives per example, which the adversarial filtering process filtered down to 9. Five of these were passed to mechanical turk workers, and we were left with anywhere between 0 and 4 of these per example as “distractors.” (Note that we always were filtering out the second best option that the was selected by the turkers). This means that for many of our examples (62%) we actually have a fourth distractor. In these cases, we sorted the distractors by their “unlikely/likely” score, so that the fourth distractor was the one deemed most likely. We still provided the fourth distractor in the training set to be possibly used in future work, however we didn’t train on it for simplicity.

A.5 More information about Mechanical turk

We used several tricks to keep the interannotator agreement high (with a pairwise percent agreement of 79% at classifying an ending as either in the Top 2). First, we had a screening HIT where turkers were given detailed instructions for the task, and only the best-scoring turk workers qualified for the remaining HITs. Second, we periodically dequalified turkers that had a low agreement with the gold endings: any turk worker with an accuracy of less than 55% of classifying the “gold” ending as the best or second best, over 10 or more HITs, had the qualification taken away. We also gave small bonuses to turkers with high accuracy.

During our crowdsourcing, we tried to pay the Turkers a fair wage (median $8.57 per hour) and they left positive comments for us on TurkOpticon and TurkerView. The total dataset cost was$ 23,000, or an average of 20 cents per example.

A.6 Implementation details of the models considered

We implemented the neural models in PyTorch using the AllenNLP library Gardner et al. (2018). Our experiments use the Adam optimizer Kingma and Ba (2015), with a learning rate of $10^{-3}$ and gradient clipping, except for Decomposable Attention and ESIM, where we use the AllenNLP default configurations.

A.7 More info about dataset diversity

The final dataset has a vocabulary size of 21000. We also visualize the coverage of the dataset with a Topic model (see Table 7).

A.8 Comparing the distribution of verbs with MultiNLI

We also produced an extension to Figure 4 of the main paper, that involves verbs from MultiNLI, in Figure 5. We ended up not including it in the paper because we wanted to focus our comparison between SNLI and Swag (as they are both grounded datasets). Interestingly, we find that Swag has a less skewed cumulative distribution of verbs up to around 120, when afterwards MultiNLI has a slightly less skewed distribution. This is possibly due to the broader set of domains considered by MultiNLI, whereas we consider videos (which is also a broad domain! but still underrepresents words highly used in newswire text, for instance.)

A.9 More examples

We have more qualitative examples in Table 8.