Adversarial NLI: A New Benchmark for Natural Language Understanding

Yixin Nie, Adina Williams, Emily Dinan, Mohit Bansal, Jason Weston, Douwe Kiela

Introduction

Progress in AI has been driven by, among other things, the development of challenging large-scale benchmarks like ImageNet Russakovsky et al. (2015) in computer vision, and SNLI Bowman et al. (2015), SQuAD Rajpurkar et al. (2016), and others in natural language processing (NLP). Recently, for natural language understanding (NLU) in particular, the focus has shifted to combined benchmarks like SentEval Conneau and Kiela (2018) and GLUE Wang et al. (2018), which track model performance on multiple tasks and provide a unified platform for analysis.

With the rapid pace of advancement in AI, however, NLU benchmarks struggle to keep up with model improvement. Whereas it took around 15 years to achieve ``near-human performance'' on MNIST LeCun et al. (1998); Cireşan et al. (2012); Wan et al. (2013) and approximately 7 years to surpass humans on ImageNet Deng et al. (2009); Russakovsky et al. (2015); He et al. (2016), the GLUE benchmark did not last as long as we would have hoped after the advent of BERT Devlin et al. (2018), and rapidly had to be extended into SuperGLUE Wang et al. (2019). This raises an important question: Can we collect a large benchmark dataset that can last longer?

The speed with which benchmarks become obsolete raises another important question: are current NLU models genuinely as good as their high performance on benchmarks suggests? A growing body of evidence shows that state-of-the-art models learn to exploit spurious statistical patterns in datasets Gururangan et al. (2018); Poliak et al. (2018); Tsuchiya (2018); Glockner et al. (2018); Geva et al. (2019); McCoy et al. (2019), instead of learning meaning in the flexible and generalizable way that humans do. Given this, human annotators—be they seasoned NLP researchers or non-experts—might easily be able to construct examples that expose model brittleness.

We propose an iterative, adversarial human-and-model-in-the-loop solution for NLU dataset collection that addresses both benchmark longevity and robustness issues. In the first stage, human annotators devise examples that our current best models cannot determine the correct label for. These resulting hard examples—which should expose additional model weaknesses—can be added to the training set and used to train a stronger model. We then subject the strengthened model to the same procedure and collect weaknesses over several rounds. After each round, we train a new model and set aside a new test set. The process can be iteratively repeated in a never-ending learning Mitchell et al. (2018) setting, with the model getting stronger and the test set getting harder in each new round. Thus, not only is the resultant dataset harder than existing benchmarks, but this process also yields a ``moving post'' dynamic target for NLU systems, rather than a static benchmark that will eventually saturate.

Our approach draws inspiration from recent efforts that gamify collaborative training of machine learning agents over multiple rounds Yang et al. (2017) and pit ``builders'' against ``breakers'' to learn better models Ettinger et al. (2017). Recently, Dinan et al. (2019) showed that such an approach can be used to make dialogue safety classifiers more robust. Here, we focus on natural language inference (NLI), arguably the most canonical task in NLU. We collected three rounds of data, and call our new dataset Adversarial NLI (ANLI).

Our contributions are as follows: 1) We introduce a novel human-and-model-in-the-loop dataset, consisting of three rounds that progressively increase in difficulty and complexity, that includes annotator-provided explanations. 2) We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks. 3) We provide a detailed analysis of the collected data that sheds light on the shortcomings of current models, categorizes the data by inference type to examine weaknesses, and demonstrates good performance on NLI stress tests. The ANLI dataset is available at github.com/facebookresearch/anli/. A demo is available at adversarialnli.com.

Dataset collection

The primary aim of this work is to create a new large-scale NLI benchmark on which current state-of-the-art models fail. This constitutes a new target for the field to work towards, and can elucidate model capabilities and limitations. As noted, however, static benchmarks do not last very long these days. If continuously deployed, the data collection procedure we introduce here can pose a dynamic challenge that allows for never-ending learning.

To paraphrase the great bard Shakespeare (1603), there is something rotten in the state of the art. We propose Human-And-Model-in-the-Loop Enabled Training (HAMLET), a training procedure to automatically mitigate problems with current dataset collection procedures (see Figure 1).

In our setup, our starting point is a base model, trained on NLI data. Rather than employing automated adversarial methods, here the model's ``adversary'' is a human annotator. Given a context (also often called a ``premise'' in NLI), and a desired target label, we ask the human writer to provide a hypothesis that fools the model into misclassifying the label. One can think of the writer as a ``white hat'' hacker, trying to identify vulnerabilities in the system. For each human-generated example that is misclassified, we also ask the writer to provide a reason why they believe it was misclassified.

For examples that the model misclassified, it is necessary to verify that they are actually correct —i.e., that the given context-hypothesis pairs genuinely have their specified target label. The best way to do this is to have them checked by another human. Hence, we provide the example to human verifiers. If two human verifiers agree with the writer, the example is considered a good example. If they disagree, we ask a third human verifier to break the tie. If there is still disagreement between the writer and the verifiers, the example is discarded. If the verifiers disagree, they can overrule the original target label of the writer.

Once data collection for the current round is finished, we construct a new training set from the collected data, with accompanying development and test sets, which are constructed solely from verified correct examples. The test set was further restricted so as to: 1) include pairs from ``exclusive'' annotators who are never included in the training data; and 2) be balanced by label classes (and genres, where applicable). We subsequently train a new model on this and other existing data, and repeat the procedure.

2 Annotation details

We employed Mechanical Turk workers with qualifications and collected hypotheses via the ParlAIhttps://parl.ai/ framework. Annotators are presented with a context and a target label—either `entailment', `contradiction', or `neutral'—and asked to write a hypothesis that corresponds to the label. We phrase the label classes as ``definitely correct'', ``definitely incorrect'', or ``neither definitely correct nor definitely incorrect'' given the context, to make the task easier to grasp. Model predictions are obtained for the context and submitted hypothesis pair. The probability of each label is shown to the worker as feedback. If the model prediction was incorrect, the job is complete. If not, the worker continues to write hypotheses for the given (context, target-label) pair until the model predicts the label incorrectly or the number of tries exceeds a threshold (5 tries in the first round, 10 tries thereafter).

To encourage workers, payments increased as rounds became harder. For hypotheses that the model predicted incorrectly, and that were verified by other humans, we paid an additional bonus on top of the standard rate.

3 Round 1

For the first round, we used a BERT-Large model Devlin et al. (2018) trained on a concatenation of SNLI Bowman et al. (2015) and MNLI Williams et al. (2017), and selected the best-performing model we could train as the starting point for our dataset collection procedure. For Round 1 contexts, we randomly sampled short multi-sentence passages from Wikipedia (of 250-600 characters) from the manually curated HotpotQA training set Yang et al. (2018). Contexts are either ground-truth contexts from that dataset, or they are Wikipedia passages retrieved using TF-IDF Chen et al. (2017) based on a HotpotQA question.

4 Round 2

For the second round, we used a more powerful RoBERTa model Liu et al. (2019b) trained on SNLI, MNLI, an NLI-versionThe NLI version of FEVER pairs claims with evidence retrieved by Nie et al. (2019) as (context, hypothesis) inputs. of FEVER Thorne et al. (2018), and the training data from the previous round (A1). After a hyperparameter search, we selected the model with the best performance on the A1 development set. Then, using the hyperparameters selected from this search, we created a final set of models by training several models with different random seeds. During annotation, we constructed an ensemble by randomly picking a model from the model set as the adversary each turn. This helps us avoid annotators exploiting vulnerabilities in one single model. A new non-overlapping set of contexts was again constructed from Wikipedia via HotpotQA using the same method as Round 1.

5 Round 3

For the third round, we selected a more diverse set of contexts, in order to explore robustness under domain transfer. In addition to contexts from Wikipedia for Round 3, we also included contexts from the following domains: News (extracted from Common Crawl), fiction (extracted from StoryCloze Mostafazadeh et al. (2016) and CBT Hill et al. (2015)), formal spoken text (excerpted from court and presidential debate transcripts in the Manually Annotated Sub-Corpus (MASC) of the Open American National Corpusanc.org/data/masc/corpus/), and causal or procedural text, which describes sequences of events or actions, extracted from WikiHow. Finally, we also collected annotations using the longer contexts present in the GLUE RTE training data, which came from the RTE5 dataset Bentivogli et al. (2009). We trained an even stronger RoBERTa ensemble by adding the training set from the second round (A2) to the training data.

6 Comparing with other datasets

The ANLI dataset, comprising three rounds, improves upon previous work in several ways. First, and most obviously, the dataset is collected to be more difficult than previous datasets, by design. Second, it remedies a problem with SNLI, namely that its contexts (or premises) are very short, because they were selected from the image captioning domain. We believe longer contexts should naturally lead to harder examples, and so we constructed ANLI contexts from longer, multi-sentence source material.

Following previous observations that models might exploit spurious biases in NLI hypotheses, Gururangan et al. (2018); Poliak et al. (2018), we conduct a study of the performance of hypothesis-only models on our dataset. We show that such models perform poorly on our test sets.

With respect to data generation with naïve annotators, Geva et al. (2019) noted that models can pick up on annotator bias, modelling annotator artefacts rather than the intended reasoning phenomenon. To counter this, we selected a subset of annotators (i.e., the ``exclusive'' workers) whose data would only be included in the test set. This enables us to avoid overfitting to the writing style biases of particular annotators, and also to determine how much individual annotator bias is present for the main portion of the data. Examples from each round of dataset collection are provided in Table 1.

Furthermore, our dataset poses new challenges to the community that were less relevant for previous work, such as: can we improve performance online without having to train a new model from scratch every round, how can we overcome catastrophic forgetting, how do we deal with mixed model biases, etc. Because the training set includes examples that the model got right but were not verified, learning from noisy and potentially unverified data becomes an additional interesting challenge.

Dataset statistics

The dataset statistics can be found in Table 2. The number of examples we collected increases per round, starting with approximately 19k examples for Round 1, to around 47k examples for Round 2, to over 103k examples for Round 3. We collected more data for later rounds not only because that data is likely to be more interesting, but also simply because the base model is better and so annotation took longer to collect good, verified correct examples of model vulnerabilities.

For each round, we report the model error rate, both on verified and unverified examples. The unverified model error rate captures the percentage of examples where the model disagreed with the writer's target label, but where we are not (yet) sure if the example is correct. The verified model error rate is the percentage of model errors from example pairs that other annotators confirmed the correct label for. Note that error rate is a useful way to evaluate model quality: the lower the model error rate—assuming constant annotator quality and context-difficulty—the better the model.

We observe that model error rates decrease as we progress through rounds. In Round 3, where we included a more diverse range of contexts from various domains, the overall error rate went slightly up compared to the preceding round, but for Wikipedia contexts the error rate decreased substantially. While for the first round roughly 1 in every 5 examples were verified model errors, this quickly dropped over consecutive rounds, and the overall model error rate is less than 1 in 10. On the one hand, this is impressive, and shows how far we have come with just three rounds. On the other hand, it shows that we still have a long way to go if even untrained annotators can fool ensembles of state-of-the-art models with relative ease.

Table 2 also reports the average number of ``tries'', i.e., attempts made for each context until a model error was found (or the number of possible tries is exceeded), and the average time this took (in seconds). Again, these metrics are useful for evaluating model quality: observe that the average number of tries and average time per verified error both go up with later rounds. This demonstrates that the rounds are getting increasingly more difficult. Further dataset statistics and inter-annotator agreement are reported in Appendix C.

Results

Table 3 reports the main results. In addition to BERT Devlin et al. (2018) and RoBERTa Liu et al. (2019b), we also include XLNet Yang et al. (2019) as an example of a strong, but different, model architecture. We show test set performance on the ANLI test sets per round, the total ANLI test set, and the exclusive test subset (examples from test-set-exclusive workers). We also show accuracy on the SNLI test set and the MNLI development set (for the purpose of comparing between different model configurations across table rows). In what follows, we discuss our observations.

Notice that the base model for each round performs very poorly on that round's test set. This is the expected outcome: For round 1, the base model gets the entire test set wrong, by design. For rounds 2 and 3, we used an ensemble, so performance is not necessarily zero. However, as it turns out, performance still falls well below chanceChance is at 33%, since the test set labels are balanced., indicating that workers did not find vulnerabilities specific to a single model, but generally applicable ones for that model class.

Rounds become increasingly more difficult.

As already foreshadowed by the dataset statistics, round 3 is more difficult (yields lower performance) than round 2, and round 2 is more difficult than round 1. This is true for all model architectures.

Training on more rounds improves robustness.

Generally, our results indicate that training on more rounds improves model performance. This is true for all model architectures. Simply training on more ``normal NLI'' data would not help a model be robust to adversarial attacks, but our data actively helps mitigate these.

RoBERTa achieves state-of-the-art performance…

We obtain state of the art performance on both SNLI and MNLI with the RoBERTa model finetuned on our new data. The RoBERTa paper Liu et al. (2019b) reports a score of $90.2$ for both MNLI-matched and -mismatched dev, while we obtain $91.0$ and $90.7$ . The state of the art on SNLI is currently held by MT-DNN Liu et al. (2019a), which reports $91.6$ compared to our $92.9$ .

…but is outperformed when it is base model.

However, the base (RoBERTa) models for rounds 2 and 3 are outperformed by both BERT and XLNet (rows 5, 6 and 10). This shows that annotators found examples that RoBERTa generally struggles with, which cannot be mitigated by more examples alone. It also implies that BERT, XLNet, and RoBERTa all have different weaknesses, possibly as a function of their training data (BERT, XLNet and RoBERTa were trained on different data sets, which might or might not have contained information relevant to the weaknesses).

Continuously augmenting training data does not downgrade performance.

Even though ANLI training data is different from SNLI and MNLI, adding it to the training set does not harm performance on those tasks. Our results (see also rows 2-3 of Table 6) suggest the method could successfully be applied for multiple additional rounds.

Exclusive test subset difference is small.

We included an exclusive test subset (ANLI-E) with examples from annotators never seen in training, and find negligible differences, indicating that our models do not over-rely on annotator's writing styles.

1 The effectiveness of adversarial training

We examine the effectiveness of the adversarial training data in two ways. First, we sample from respective datasets to ensure exactly equal amounts of training data. Table 5 shows that the adversarial data improves performance, including on SNLI and MNLI when we replace part of those datasets with the adversarial data. This suggests that the adversarial data is more data efficient than ``normally collected'' data. Figure 2 shows that adversarial data collected in later rounds is of higher quality and more data-efficient.

Second, we compared verified correct examples of model vulnerabilities (examples that the model got wrong and were verified to be correct) to unverified ones. Figure 3 shows that the verified correct examples are much more valuable than the unverified examples, especially in the later rounds (where the latter drops to random).

2 Stress Test Results

We also test models on two recent hard NLI test sets: SNLI-Hard Gururangan et al. (2018) and the NLI stress tests (Naik et al., 2018) (see Appendix A for details). The results are in Table 4. We observe that all our models outperform the models presented in original papers for these common stress tests. The RoBERTa models perform best on SNLI-Hard and achieve accuracy levels in the high 80s on the `antonym' (AT), `numerical reasoning' (NR), `length' (LN), `spelling error'(SE) sub-datasets, and show marked improvement on both `negation' (NG), and `word overlap' (WO). Training on ANLI appears to be particularly useful for the AT, NR, NG and WO stress tests.

3 Hypothesis-only results

For SNLI and MNLI, concerns have been raised about the propensity of models to pick up on spurious artifacts that are present just in the hypotheses Gururangan et al. (2018); Poliak et al. (2018). Here, we compare full models to models trained only on the hypothesis (marked $H$ ). Table 6 reports results on ANLI, as well as on SNLI and MNLI. The table shows that hypothesis-only models perform poorly on ANLIObviously, without manual intervention, some bias remains in how people phrase hypotheses—e.g., contradiction might have more negation—which explains why hypothesis-only performs slightly above chance when trained on ANLI., and obtain good performance on SNLI and MNLI. Hypothesis-only performance decreases over rounds for ANLI.

We observe that in rounds 2 and 3, RoBERTa is not much better than hypothesis-only. This could mean two things: either the test data is very difficult, or the training data is not good. To rule out the latter, we trained only on ANLI ( $\sim$ 163k training examples): RoBERTa matches BERT when trained on the much larger, fully in-domain SNLI+MNLI combined dataset (943k training examples) on MNLI, with both getting $\sim$ 86 (the third row in Table 6). Hence, this shows that the test sets are so difficult that state-of-the-art models cannot outperform a hypothesis-only prior.

Linguistic analysis

We explore the types of inferences that fooled models by manually annotating $500$ examples from each round's development set. A dynamically evolving dataset offers the unique opportunity to track how model error rates change over time. Since each round's development set contains only verified examples, we can investigate two interesting questions: which types of inference do writers employ to fool the models, and are base models differentially sensitive to different types of reasoning?

The results are summarized in Table 7. We devised an inference ontology containing six types of inference: Numerical & Quantitative (i.e., reasoning about cardinal and ordinal numbers, inferring dates and ages from numbers, etc.), Reference & Names (coreferences between pronouns and forms of proper names, knowing facts about name gender, etc.), Standard Inferences (conjunctions, negations, cause-and-effect, comparatives and superlatives etc.), Lexical Inference (inferences made possible by lexical information about synonyms, antonyms, etc.), Tricky Inferences (wordplay, linguistic strategies such as syntactic transformations/reorderings, or inferring writer intentions from contexts), and reasoning from outside knowledge or additional facts (e.g., ``You can't reach the sea directly from Rwanda''). The quality of annotations was also tracked; if a pair was ambiguous or a label debatable (from the expert annotator's perspective), it was flagged. Quality issues were rare at 3-4% per round. Any one example can have multiple types, and every example had at least one tag.

We observe that both round 1 and 2 writers rely heavily on numerical and quantitative reasoning in over 30% of the development set—the percentage in A2 (32%) dropped roughly 6% from A1 (38%)—while round 3 writers use numerical or quantitative reasoning for only 17%. The majority of numerical reasoning types were references to cardinal numbers that referred to dates and ages. Inferences predicated on references and names were present in about 10% of rounds 1 & 3 development sets, and reached a high of 20% in round 2, with coreference featuring prominently. Standard inference types increased in prevalence as the rounds increased, ranging from 18%–27%, as did `Lexical' inferences (increasing from 13%–31%). The percentage of sentences relying on reasoning and outside facts remains roughly the same, in the mid-50s, perhaps slightly increasing over the rounds. For round 3, we observe that the model used to collect it appears to be more susceptible to Standard, Lexical, and Tricky inference types. This finding is compatible with the idea that models trained on adversarial data perform better, since annotators seem to have been encouraged to devise more creative examples containing harder types of inference in order to stump them. Further analysis is provided in Appendix B.

Related work

Machine learning methods are well-known to pick up on spurious statistical patterns. For instance, in the first visual question answering dataset Antol et al. (2015), biases like ``2'' being the correct answer to 39% of the questions starting with ``how many'' allowed learning algorithms to perform well while ignoring the visual modality altogether Jabri et al. (2016); Goyal et al. (2017). In NLI, Gururangan et al. (2018), Poliak et al. (2018) and Tsuchiya (2018) showed that hypothesis-only baselines often perform far better than chance. NLI systems can often be broken merely by performing simple lexical substitutions Glockner et al. (2018), and struggle with quantifiers Geiger et al. (2018) and certain superficial syntactic properties McCoy et al. (2019).

In question answering, Kaushik and Lipton (2018) showed that question- and passage-only models can perform surprisingly well, while Jia and Liang (2017) added adversarially constructed sentences to passages to cause a drastic drop in performance. Many tasks do not actually require sophisticated linguistic reasoning, as shown by the surprisingly good performance of random encoders Wieting and Kiela (2019). Similar observations were made in machine translation Belinkov and Bisk (2017) and dialogue Sankar et al. (2019). Machine learning also has a tendency to overfit on static targets, even if that does not happen deliberately Recht et al. (2018). In short, the field is rife with dataset bias and papers trying to address this important problem. This work presents a potential solution: if such biases exist, they will allow humans to fool the models, resulting in valuable training examples until the bias is mitigated.

Dynamic datasets.

Bras et al. (2020) proposed AFLite, an approach for avoiding spurious biases through adversarial filtering, which is a model-in-the-loop approach that iteratively probes and improves models. Kaushik et al. (2019) offer a causal account of spurious patterns, and counterfactually augment NLI datasets by editing examples to break the model. That approach is human-in-the-loop, using humans to find problems with one single model. In this work, we employ both human and model-based strategies iteratively, in a form of human-and-model-in-the-loop training, to create completely new examples, in a potentially never-ending loop Mitchell et al. (2018).

Human-and-model-in-the-loop training is not a new idea. Mechanical Turker Descent proposes a gamified environment for the collaborative training of grounded language learning agents over multiple rounds Yang et al. (2017). The ``Build it Break it Fix it'' strategy in the security domain Ruef et al. (2016) has been adapted to NLP Ettinger et al. (2017) as well as dialogue safety Dinan et al. (2019). The QApedia framework Kratzwald and Feuerriegel (2019) continuously refines and updates its content repository using humans in the loop, while human feedback loops have been used to improve image captioning systems Ling and Fidler (2017). Wallace et al. (2019) leverage trivia experts to create a model-driven adversarial question writing procedure and generate a small set of challenge questions that QA-models fail on. Relatedly, Lan et al. (2017) propose a method for continuously growing a dataset of paraphrases.

There has been a flurry of work in constructing datasets with an adversarial component, such as Swag Zellers et al. (2018) and HellaSwag Zellers et al. (2019), CODAH Chen et al. (2019), Adversarial SQuAD Jia and Liang (2017), Lambada Paperno et al. (2016) and others. Our dataset is not to be confused with abductive NLI Bhagavatula et al. (2019), which calls itself $\alpha$ NLI, or ART.

Discussion & Conclusion

In this work, we used a human-and-model-in-the-loop training method to collect a new benchmark for natural language understanding. The benchmark is designed to be challenging to current state-of-the-art models. Annotators were employed to act as adversaries, and encouraged to find vulnerabilities that fool the model into misclassifying, but that another person would correctly classify. We found that non-expert annotators, in this gamified setting and with appropriate incentives, are remarkably creative at finding and exploiting weaknesses. We collected three rounds, and as the rounds progressed, the models became more robust and the test sets for each round became more difficult. Training on this new data yielded the state of the art on existing NLI benchmarks.

The ANLI benchmark presents a new challenge to the community. It was carefully constructed to mitigate issues with previous datasets, and was designed from first principles to last longer. The dataset also presents many opportunities for further study. For instance, we collected annotator-provided explanations for each example that the model got wrong. We provided inference labels for the development set, opening up possibilities for interesting more fine-grained studies of NLI model performance. While we verified the development and test examples, we did not verify the correctness of each training example, which means there is probably some room for improvement there.

A concern might be that the static approach is probably cheaper, since dynamic adversarial data collection requires a verification step to ensure examples are correct. However, verifying examples is probably also a good idea in the static case, and adversarially collected examples can still prove useful even if they didn't fool the model and weren't verified. Moreover, annotators were better incentivized to do a good job in the adversarial setting. Our finding that adversarial data is more data-efficient corroborates this theory. Future work could explore a detailed cost and time trade-off between adversarial and static collection.

It is important to note that our approach is model-agnostic. HAMLET was applied against an ensemble of models in rounds 2 and 3, and it would be straightforward to put more diverse ensembles in the loop to examine what happens when annotators are confronted with a wider variety of architectures.

The proposed procedure can be extended to other classification tasks, as well as to ranking with hard negatives either generated (by adversarial models) or retrieved and verified by humans. It is less clear how the method can be applied in generative cases.

Adversarial NLI is meant to be a challenge for measuring NLU progress, even for as yet undiscovered models and architectures. Luckily, if the benchmark does turn out to saturate quickly, we will always be able to collect a new round.

Acknowledgments

YN interned at Facebook. YN and MB were sponsored by DARPA MCS Grant #N66001-19-2-4031, ONR Grant #N00014-18-1-2871, and DARPA YFA17-D17AP00022. Special thanks to Sam Bowman for comments on an earlier draft.

References

Appendix A Performance on challenge datasets

Recently, several hard test sets have been made available for revealing the biases NLI models learn from their training datasets (Nie and Bansal, 2017; McCoy et al., 2019; Gururangan et al., 2018; Naik et al., 2018). We examine model performance on two of these: the SNLI-Hard Gururangan et al. (2018) test set, which consists of examples that hypothesis-only models label incorrectly, and the NLI stress tests (Naik et al., 2018), in which sentences containing antonyms pairs, negations, high word overlap, i.a., are heuristically constructed. We test our models on these stress tests after tuning on each test's respective development set to account for potential domain mismatches. For comparison, we also report results from the original papers: for SNLI-Hard from Gururangan et al.'s implementation of the hierarchical tensor-based Densely Interactive Inference Network (Gong et al., 2018, DIIN) on MNLI, and for the NLI stress tests, Naik et al.'s implementation of InferSent (Conneau et al., 2017) trained on SNLI.

Appendix B Further linguistic analysis

We compare the incidence of linguistic phenomena in ANLI with extant popular NLI datasets to get an idea of what our dataset contains. We observe that FEVER and SNLI datasets generally contain many fewer hard linguistic phenomena than MultiNLI and ANLI (see Table 8).

ANLI and MultiNLI have roughly the same percentage of hypotheses that exceeding twenty words in length, and/or contain negation (e.g., `never', 'no'), tokens of `or', and modals (e.g., `must', `can'). MultiNLI hypotheses generally contains more pronouns, quantifiers (e.g., `many', `every'), WH-words (e.g., `who', `why'), and tokens of `and' than do their ANLI counterparts—although A3 reaches nearly the same percentage as MultiNLI for negation, and modals. However, ANLI contains more cardinal numerals and time terms (such as `before', `month', and `tomorrow') than MultiNLI. These differences might be due to the fact that the two datasets are constructed from different genres of text. Since A1 and A2 contexts are constructed from a single Wikipedia data source (i.e., HotPotQA data), and most Wikipedia articles include dates in the first line, annotators appear to prefer constructing hypotheses that highlight numerals and time terms, leading to their high incidence.

Focusing on ANLI more specifically, A1 has roughly the same incidence of most tags as A2 (i.e., within 2% of each other), which, again, accords with the fact that we used the same Wikipedia data source for A1 and A2 contexts. A3, however, has the highest incidence of every tag (except for numbers and time) in the ANLI dataset. This could be due to our sampling of A3 contexts from a wider range of genres, which likely affected how annotators chose to construct A3 hypotheses; this idea is supported by the fact that A3 contexts differ in tag percentage from A1 and A2 contexts as well. The higher incidence of all tags in A3 is also interesting, because it could be taken as providing yet another piece of evidence that our HAMLET data collection procedure generates increasingly more difficult data as rounds progress.

Appendix C Dataset properties

Table 9 shows the label distribution. Figure 4 shows a histogram of the number of tries per good verified example across for the three different rounds. Figure 5 shows the time taken per good verified example. Figure 6 shows a histogram of the number of tokens for contexts and hypotheses across three rounds. Figure 7 shows the proportion of different types of collected examples across three rounds.

Table 10 reports the inter-annotator agreement for verifiers on the dev and test sets. For reference, the Fleiss' kappa of FEVER Thorne et al. (2018) is $0.68$ and of SNLI Bowman et al. (2015) is $0.70$ . Table 11 shows the percentage of agreement of verifiers with the intended author label.

Appendix D Examples

We include more examples of collected data in Table 12.

Appendix E User interface

Examples of the user interface are shown in Figures 8, 9 and 10.