WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation

Alisa Liu, Swabha Swayamdipta, Noah A. Smith, Yejin Choi

Introduction

As much as large-scale crowdsourced datasets have expedited progress on various NLP problems, a growing body of research has revealed fundamental limitations in existing datasets: they are often flooded with repetitive and spurious patterns, rather than covering the broad range of linguistic phenomena required by the task Bowman and Dahl (2021). This leads to models that seem to achieve human-level performance on in-domain test sets, yet are brittle when given out-of-domain or adversarial examples Ribeiro et al. (2020); Glockner et al. (2018).

We attribute this problem to an inherent challenge in the crowdsourcing design—the prevalent paradigm for creating large-scale NLP datasets—where a relatively small number of workers create a massive number of free text examples. While human annotators are generally reliable for writing correct examples, crafting diverse and creative examples at scale can be challenging. Thus, crowdworkers often resort to a limited set of writing strategies for speed, at the expense of diversity Geva et al. (2019); Gururangan et al. (2018). When models overfit to such repetitive patterns, they fail to generalize to out-of-domain examples where these patterns no longer hold Geirhos et al. (2020).

On the other hand, there has been remarkable progress in open-ended text generation based on massive language models (Brown et al., 2020; Raffel et al., 2020, i.a.). Despite known deficiencies such as incoherence or repetition Dou et al. (2021), these models often produce human-like text Clark et al. (2021) and show potential for creative writing tasks Lee et al. (2022). Importantly, these models are capable of replicating a pattern given just a few examples in context (Brown et al., 2020, GPT-3).

In this paper, we introduce a novel approach for dataset creation which brings together the generative strength of language models and the evaluative strength of humans through human and machine collaboration (§2). The key insight of our approach is that language models can create new examples by replicating linguistic patterns that are valuable for training, without necessarily “understanding” the task itself. Illustrated in Figure 1, our pipeline starts with an existing dataset. We use dataset cartography from Swayamdipta et al. (2020) to automatically identify pockets of examples that demonstrate challenging reasoning patterns relative to a trained model. Using each group as a set of in-context examples, we leverage a pretrained language model to generate new examples likely to have the same pattern (see Table 1). We then propose a novel metric, building on dataset cartography, to automatically filter generations that are most likely to aid model learning. Finally, we validate the generated examples by subjecting them to human review, where crowdworkers assign a gold label and (optionally) revise for quality.

We demonstrate the effectiveness of our approach on the task of natural language inference (NLI), which determines whether a premise entails (i.e., implies the truth of) a hypothesis, both expressed in natural language. Despite being one of the most resource-available tasks in NLP, analysis and challenge sets repeatedly demonstrate the limitations of existing datasets and the brittleness of NLI models trained on them Gururangan et al. (2018); Poliak et al. (2018); Tsuchiya (2018). Using MultiNLI Williams et al. (2018) as our original dataset, we use our pipeline to create a dataset of 107,885 examples, which we call Worker-and-AI NLI (WaNLI).Pronounced wan-li like the Chinese characters 万理, as in ten thousand reasoning. A demo, data, and code are available at https://wanli.allenai.org/.

Remarkably, empirical results demonstrate that replacing MultiNLI supervision with WaNLI (which is $4$ times smaller) improves performance on eight different out-of-domain test sets, including datasets that are converted to the NLI format from downstream tasks such as question-answering and fact verification (§3). This result holds even when augmenting MultiNLI with other NLI datasets and recently proposed augmentation sets. Moreover, including WaNLI in the training data can help improve performance on certain in-domain test sets. We then analyze WaNLI and show that it has fewer previously documented spurious correlations than MultiNLI (§4), and provide insights into the collaborative framework (§5).

Our approach contrasts with previous instruction-based generation of dataset examples Schick and Schütze (2021); West et al. (2021), which require the model to understand the task from context, fundamentally limiting the complexity of generated output to what is accessible by the model. Moreover, our human-in-the-loop approach is collaborative, rather than adversarial Dinan et al. (2019); Nie et al. (2020); Bartolo et al. (2020). Overall, we leverage the best of both worlds: a powerful model’s ability to efficiently generate diverse examples, and humans’ ability to improve and ensure the quality of generations.

Our worker-AI collaborative approach is more scalable compared to the traditional crowdsourcing framework. Our approach is generalizable, allowing for rejuvenating datasets on many different classification tasks, especially when performance seems to stagnate due to overfitting to popular benchmarks Recht et al. (2019). Our work shows the promise of leveraging language models in a controlled way to aid the dataset creation process, and we encourage the community to think of dataset curation as an AI challenge itself.

Worker-AI Collaborative Dataset Creation for NLI

We describe our four-stage approach for dataset creation based on worker and AI collaboration. In this work, we apply it to the task of natural language inference (NLI), which involves predicting whether a premise entails, contradicts or is neutral to a hypothesis. NLI has broad applicability in NLP: it has proven useful for pretraining Clark et al. (2019); Phang et al. (2018), and can be applied to verify candidate answers in question-answering Chen et al. (2021) or factuality of generated summaries Maynez et al. (2020).

Our approach requires as prerequisites an initial dataset $\mathcal{D}_{0}$ and a strong task model $\mathcal{M}$ trained on $\mathcal{D}_{0}$ . We use MultiNLI Williams et al. (2018), a large-scale multi-genre NLI dataset, as $\mathcal{D}_{0}$ . We finetune RoBERTa-large Liu et al. (2019) on MultiNLI for our task model $\mathcal{M}$ (training details in Appendix B).

As an overview, we first automatically collect groups of examples exemplifying challenging reasoning patterns in $\mathcal{D}_{0}$ relative to $\mathcal{M}$ , using data maps (Swayamdipta et al., 2020; Stage 1, see §2.1). Then we overgenerate similar examples by leveraging the pattern replication capabilities of GPT-3 Brown et al. (2020) (Stage 2; §2.2). While GPT-3 can generate examples efficiently, it may not reliably replicate the desired pattern and its output quality will not be uniform. We address this by automatically filtering the generated examples using a metric derived from data maps (Stage 3; §2.3). We finally subject the collected data to human review, in which crowdworkers optionally revise examples and assign gold labels (Stage 4; §2.4).

A key component of our pipeline is inspired by data maps Swayamdipta et al. (2020), which automatically reveal different regions in a dataset, w.r.t. the behavior of a classification model during training. These include easy-to-learn examples which the model consistently predicts correctly through training, hard-to-learn examples on which it is consistently incorrect, and ambiguous examples for which the model’s confidence in the correct answer exhibits high variability across train epochs. Our pipeline focuses on ambiguous examples, which were shown to lead to more robust models. Additionally, ambiguous examples contain fewer spurious correlations Gardner et al. (2021), suggesting that they capture under-represented counterexamples to spurious correlations. Indeed, such counterexamples take more epochs of training to learn and are crucial for generalization Tu et al. (2020), providing a potential explanation for why they appear ambiguous across early epochs and lead to more robust models.

1 Stage 1: Collection of Exemplars

In this stage, we automatically collect groups of examples from $\mathcal{D}_{0}$ which represent linguistic patterns we wish to include in the target dataset. We begin with a seed example $(x_{i},y_{i})\in\mathcal{D}_{0}$ belonging to the most ambiguous $p=25\%$ relative to $\mathcal{M}$ .For exemplar collection, we exclude the telephone genre of MultiNLI, which consists of telephone conversation transcripts, due to their low fluency and ill-defined entailment relationships. During pilots, we found that generated examples mimicking telephone conversations would require crowdworkers to revise low-quality text for basic fluency.

To generate a new example with the same reasoning pattern, we wish to leverage the ability of GPT-3 Brown et al. (2020) for in-context learning; hence, we need to first collect examples that test a similar kind of reasoning to $x_{i}$ . To do this, we use the [CLS] token representation of each example relative to the task model $\mathcal{M}$ , and find the $k=4$ nearest neighbors via cosine similarity to $x_{i}$ that have the same label. Detailed qualitative inspection shows that the nearest neighbors in this representation space tend to capture a human-interpretable similarity in the reasoning required to solve an example, rather than lexical or semantic similarity (examples in Table 1).

Han and Tsvetkov (2021) give another interpretation for this approach: for examples with the same label, the similarity of [CLS] token embeddings actually represents the similarity of gradient updates in the row of the final projection layer corresponding to that label. Thus, two examples are close if training on them would “update” the final layer of the model similarly.

By automatically identifying areas for augmentation, our method does not require any prior knowledge of challenging patterns and makes our method tractable for building on top of large-scale datasets. Nonetheless, exemplar collection could potentially be approached in different ways (e.g., through expert curation or category labels).

2 Stage 2: Overgeneration

Given an automatically extracted group of $k+1$ examples from the original dataset $\mathcal{D}_{0}$ , we construct a natural language context (prompt) for a left-to-right language model; in this work, we use GPT-3 Curie (the second-largest GPT-3 model). The prompt template we use is shown in Figure 2, where we order the examples in increasing similarity to the seed example.

Note that our method leverages GPT-3 in way that is distinct from its typical usage in few-shot settings, where given examples demonstrating a task, GPT-3 performs the task on a new, unlabeled example. Here, we instead give GPT-3 examples representing a particular slice of the task, and ask GPT-3 to generate a new example in the same slice.

3 Stage 3: Automatic Filtering

In this step, we wish to filter generated examples from Stage 2 to retain those that are the most ambiguous with respect to $\mathcal{M}$ . However, computing ambiguity for an example requires that it be a part of the original training set, whereas we wish to estimate the ambiguity of an unlabeled example without additional training. Thus we introduce a new metric called estimated max variability, which measures the worst-case spread of predictions on an example $x_{i}$ across checkpoints of a trained model. Let $E$ be the total epochs in training, $\mathcal{Y}$ the label set, and $p_{\theta^{(e)}}$ the probability assigned with parameters $\theta^{e}$ at the end of the $e$ -th epoch. We define the estimated max variability as:

where $\sigma$ is the standard deviation function.

Concretely, we retroactively compute the prediction from each saved epoch of $\mathcal{M}$ on $x_{i}$ . The only assumption made is that the single example, if it had been a part of the training set, would have made a negligible difference on each model checkpoint (at least as observed through its posterior probabilities).Indeed, we find a high correlation between variability and estimated max variability; see Appendix A. In taking a maximum across labels, we consider $x_{i}$ to be ambiguous as long as $\mathcal{M}$ is undecided on any label $\in\mathcal{Y}$ .

4 Stage 4: Human Review

Crowdworkers annotate a total of 118,724 examples, with two distinct workers reviewing each example. For examples that both annotators labeled without revision, we achieved a Cohen’s $\kappa$ of $0.60$ , indicating substantial agreement. To create the final dataset, we discard an example if either annotator chose to discard it, and we keep a revision only if both annotators revise an example (and choose a revision uniformly at random). When both annotators label the example as-is but choose different labels, we sample one of the two labels uniformly at random. The rationale for this is discussed in Appendix D.4. This leads to a labeled dataset of 107,885 examples (90.87% of all annotated examples, with the remaining discarded). Of the labeled examples, 3.54% were revised.

We randomly split the data into a train and test sets. Key dataset statistics are summarized in Table 2. Unlike MultiNLI, WaNLI is not label-balanced; see §5.3 for a discussion.

In general, we believe the role of revision depends on the quality of machine-generated examples. Indeed, we need to strike a balance between leveraging human capabilities and avoiding the re-emergence of annotation artifacts that may come with too much freedom in revision.

Training NLI Models with WaNLI

We finetune different copies of RoBERTa-large Liu et al. (2019) on different training sets, and evaluate each resulting model’s performance on a large suite of NLI challenge sets. Given that the challenge sets were constructed independently of MultiNLI or WaNLI, we consider them out-of-distribution (OOD) for both training datasets.

The NLI challenge sets come from a wide array of domains, methodologies (e.g., crowdsourcing, expert curation, generation), and initial task formats (e.g., question-answering, fact verification).We evaluate on the development set for every dataset, except for Winograd NLI, where we combine the train and development set for greater statistical power, and Adversarial NLI, where we use the test set as the labels were not hidden.

NLI Diagnostics Wang et al. (2018) is a manually-curated test set that evaluates a variety of linguistic phenomena using naturally-occurring sentences from several domains.

HANS McCoy et al. (2019) targets unreliable syntactic heuristics based on lexical overlap between the premise and hypothesis.

QNLI was adapted from the Stanford Question-Answering Dataset Rajpurkar et al. (2016) by the GLUE benchmark Wang et al. (2018). Each example consists of a premise that is a sentence, and a hypothesis that is a question, which is entailed if the question is answered by the premise.

Winograd NLI was adapted by the GLUE benchmark from the Winograd Schema Challenge Levesque et al. (2011), which tests correct coreference via common sense. To convert this dataset to NLI, an entailed hypothesis is formed by substituting a correct referent and a non-entailed hypothesis is formed by substituting an incorrect referent.

Adversarial NLI (ANLI; Nie et al., 2020) is an adversarially-constructed dataset where crowdworkers are instructed to write examples that stump existing models. Examples are collected in three rounds that progressively increase in difficulty, with model adversaries trained on MultiNLI, SNLI Bowman et al. (2015), FEVER-NLI (discussed below), as well as ANLI sets from earlier rounds.

Natural Questions NLI (NQ-NLI, Chen et al., 2021) is created from the Natural Questions QA dataset Kwiatkowski et al. (2019). The premise is a decontextualized sentence from the original context; the hypothesis consists of a question and answer candidate converted into declarative form.

FEVER NLI is adapted from the FEVER fact verification dataset Thorne et al. (2018), and introduced along with ANLI. In each example, the premise is a short context from Wikipedia, and the hypothesis is a claim that is either supported (entailed), refuted (contradicted), or neither (neutral).

BIG-Bench NLI is a combination of four datasets from BIG-Bench Srivastava et al. (2022) about entailment: Analytic Entailment, Epistemic Reasoning, Disambiguation QA, Presuppositions NLI.

2 Training Datasets

In addition to stand-alone WaNLI and MultiNLI, we also consider combining MultiNLI with other NLI datasets. We use the train sets of SNLI Bowman et al. (2015), ANLI, and FEVER-NLI, as well as the augmentation set generated via Tailor Ross et al. (2022), which perturbed SNLI hypotheses to create examples with high lexical overlap between the premise and hypothesis, and the augmentation set Z-Aug Wu et al. (2022), which was created by generating in-distribution examples and filtering them based on spurious correlations.

We consider two schemes for combining datasets $\mathcal{A}$ and $\mathcal{B}$ : 1) augmentation ( $\mathcal{A}+\mathcal{B}$ ), in which the two datasets are concatenated, and 2) random replacement ( $\mathcal{A}\diamond\mathcal{B}$ ), where $\lvert\mathcal{B}\rvert$ examples from $\mathcal{A}$ are randomly swapped out and replaced with all examples from $\mathcal{B}$ .

3 Results

Results are shown in Table 3. When comparing MultiNLI (MNLI) and WaNLI alone, training a model on WaNLI instead of MultiNLI leads to better performance on every test set we consider, including by $4\%$ on Diagnostics, $11\%$ on HANS, and $9\%$ on Adversarial NLI. This is remarkable given WaNLI is $4\times$ smaller than MultiNLI, and contains primarily machine-written examples.

A WaNLI-trained model continues to outperform baselines that combine MultiNLI with other NLI datasets and augmentation sets, in every OOD setting. This includes when comparing to a model trained on $9\times$ more data from three existing NLI datasets, MNLI $+$ SNLI $+$ ANLI. The consistent advantage of WaNLI over datasets that include ANLI (e.g., MNLI $+$ ANLI) is noteworthy, as ANLI’s adversarial creation pipeline posed a much greater challenge for human workers, and used more existing resources to train model adversaries.

Quite surprisingly, training on WaNLI alone also outperforms combining WaNLI with MultiNLI. This reinforces that more data might not necessarily be better, especially when the data predominantly consists of easy-to-learn examples.

In addition to the OOD setting, we consider whether augmentation with WaNLI can improve in-domain test performance for another dataset (Table 4). Indeed, augmenting ANLI’s train set with WaNLI improves test accuracy on ANLI by 1.4%, while greatly aiding OOD test performance.

Artifacts in WaNLI

We next investigate whether WaNLI contains similar artifacts to MultiNLI.We note, however, that recent work has challenged whether artifacts based on partial input and lexical correlations in the dataset pose genuine robustness threats Srikanth and Rudinger (2022); Eisenstein (2022). We find that while WaNLI contains fewer previously known spurious correlations, it has a distinct set of lexical correlations that may reflect artifacts in GPT-3 output.

Given that the task requires reasoning with both the premise and the hypothesis, a model that sees only one of the two inputs should have no information about the correct label. We reproduce the methodology from Gururangan et al. (2018) and train fastText classifiers to predict the label using partial input. After first balancing WaNLI, a model trained on just the hypotheses of WaNLI achieves $41.6\%$ accuracy on the test set compared to $49.6\%$ for MultiNLI, when restricted to the same size. A premise-only model trained on WaNLI achieves an accuracy of $42.9\%$ .Unlike WaNLI, each MultiNLI premise is associated with hypotheses from all three labels; a premise-only baseline is thus guaranteed to have no information about the label.

2 Lexical Correlations

Gardner et al. (2021) posit that all correlations between single words and output labels are spurious. We plot the statistical correlation for every word and label in Figure 3, after balancing WaNLI and downsampling MultiNLI. We observe that WaNLI also contains words with detectable correlations, suggesting that GPT-3 may have some artifacts of its own due to the slightly different templates and different sets of in-context examples for each label. Interestingly, the correlations tend to be a different set of words than for MultiNLI (other than “not” and “no”), with less interpretable reasons for correlating with a certain label (e.g., “second”, “was”).

3 Premise-Hypothesis Semantic Similarity

We explore the semantic similarity between the premise and hypothesis within each label class using Sentence-BERT Reimers and Gurevych (2019); these distributions are shown in Figure 4. In both MultiNLI and WaNLI, entailed hypotheses are naturally most semantically similar to the premise. In MultiNLI, this is followed by neutral examples and then contradiction examples. In contrast, in WaNLI there is much greater overlap in the three distributions, and those for neutral and contradiction examples are nearly indistinguishable. This suggests in WaNLI, the semantic similarity between the premise and hypothesis provides less signal of the label.

What does WaNLI show about the human machine collaboration pipeline?

We discuss observations from collecting WaNLI that may shed insight for future work in the direction of collaborative dataset creation.

We find that revisions fall broadly into two categories: improving the fluency of the text, and improving the clarity of the relationship. The majority of revisions change the length only slightly, with $74\%$ of both premise revisions and hypothesis revisions changing the word count between $-1$ and $+2$ words. Fluency revisions often target well-documented issues with text generation, such as redundancy and self-contradiction. Clarity revisions often resolve ambiguities in the example that make the entailment relationship difficult (or impossible) to determine, such as ambiguous coreference or temporal references. We provide examples of revisions in Appendix D.3.

2 What kinds of examples do annotators disagree on?

We find that examples on which annotators disagree provide an extremely interesting test bed for how ambiguities surface in classification tasks. Upon inspecting the examples (some are shown in Table 5), we observe that they represent genuinely ambiguous cases rather than careless mislabels, echoing previous findings Pavlick and Kwiatkowski (2019). See further discussion in Appendix D.4.

3 How reliably does GPT-3 reproduce the in-context pattern?

One characteristic of WaNLI is its imbalanced label distribution: even though the set of seed examples for generation was constructed to be balanced, after undergoing human labeling, only 15% of examples are given the contradiction label. We observe that contradiction patterns in in-context examples are generally much more challenging for GPT-3 to copy, likely because it was trained on (mostly) coherent sequences of sentences. More broadly, we find that more abstract reasoning patterns are harder for GPT-3 to mimic than patterns that involve simpler transformations.

Nonetheless, even when GPT-3 does not successfully copy the examples, the diverse set of in-context examples leads to a variety of creative output that may be challenging for human crowdworkers to achieve.

Related Work

The scalability and flexibility of crowdsourcing has enabled the creation of foundational NLP benchmarks across a wide range of subproblems, and made it the dominant paradigm for data collection (Mihaylov et al., 2018; Rajpurkar et al., 2016; Huang et al., 2019; Talmor et al., 2019, i.a.). Nonetheless, a growing body of research shows that resulting datasets may not isolate the key linguistic phenomena Jia and Liang (2017); Chen et al. (2016); Sugawara et al. (2020).

For crowdsourcing NLI datasets, where the annotator is given a premise and asked to write a hypothesis of each label Bowman et al. (2015); Williams et al. (2018), the presence of annotation artifacts is especially well-studied Gururangan et al. (2018); McCoy et al. (2019); Glockner et al. (2018). Recent work attempted to remedy this through different data collection protocols but found negative results Vania et al. (2020); Bowman et al. (2020), showing this is a hard problem requiring greater innovation.

Adversarial data collection

In this paradigm, annotators are asked to produce examples on which current systems fail (Kiela et al., 2021; Talmor et al., 2021; Zellers et al., 2019, i.a.). Beyond increasing annotator effort Bartolo et al. (2020), adversarial methods have been challenged for not leading to better generalization on non-adversarial test sets Kaushik et al. (2021) and decreasing data diversity Bowman and Dahl (2021). Moreover, the resulting data has been shown to depend strongly on the adversaries, inhibiting a fair evaluation Phang et al. (2021). Finally, these approaches may produce examples beyond the scope of the task. For example, in Adversarial NLI Nie et al. (2020), an estimated 58% of examples required “reasoning from outside knowledge or additional facts,” which is arguably separate from the underlying problem of understanding semantic entailments. We argue that we can better leverage the strengths of machines and humans by having them collaborate rather than act as adversaries.

Dataset generation

Another recent approach leverages language models toward fully automatic dataset creation (Schick and Schütze, 2021; Wu et al., 2022; West et al., 2021; Bartolo et al., 2021a, i.a.). Removing human input may fundamentally limit the complexity of examples to phenomena already accessible by the model, when our goal is precisely to teach models more diverse phenomena. The most similarly-motivated work to ours, Lee et al. (2021), trains a data generator on “data-rich slices” of an existing dataset, and applies it to under-represented slices. However, they use labels or metadata to represent slices, leaving automatic methods of identifying slices to future work.

Human-machine collaboration

In terms of human-machine collaboration, Tekiroğlu et al. (2020) and Yuan et al. (2021) employ a language model to generate counter-narratives to hate speech and biographies, respectively, which are validated and revised by humans. This was for a generative task, and we complement their findings by showing that human-machine collaboration can also be useful for generating labeled datasets for robust classification models. Contemporary work Bartolo et al. (2021b) finetunes a generative annotation assistant to produce question-answer pairs that humans can revise for extractive QA.

Conclusion

At the heart of dataset creation is distilling human linguistic competence into data that models can learn from. The traditional crowdsourcing paradigm takes the view that the best approach for this is to solicit people to write free-form examples expressing their capabilities. In this work, we present a worker-and-AI collaborative approach and apply it to create WaNLI, whose empirical utility suggests that a better way of eliciting human intelligence at scale is to ask workers to revise and evaluate content. To this end, we hope to encourage more work in developing generative algorithms to aid the dataset creation process, and therefore re-imagining the role of human annotation.

Acknowledgments

We thank members of UW NLP, AI2, and Mila NLP for valuable feedback and discussion, and especially Jena Hwang for help in designing the AMT template, Julian Michael for countless discussions of NLI examples, and Alexander Fang for feedback during writing. We thank OpenAI for offering access to the GPT-3 API and the anonymous reviewers for valuable feedback.

This work was funded in part by the DARPA MCS program through NIWC Pacific (N66001-19-2-4031). The first author is supported by the National Science Foundation Graduate Research Fellowship Program.

Ethics Statement

We acknowledge that text generated from large pretrained language models is susceptible to perpetuating social harms and containing toxic language Sheng et al. (2019); Gehman et al. (2020). To partially remedy this, we ask annotators to discard any examples that may be perceived as offensive. Nonetheless, it is possible that harmful examples (especially if they contain subtle biases) may have been missed by annotators and included in the final dataset. Specifically due to the above harms, we additionally caution readers and practitioners against fully automating any data creation pipeline.

In addition, we are cognizant of the asymmetrical relationship between requesters and workers in crowdsourcing. We took great care to pay fair wages, and were responsive to feedback and questions throughout the data collection process (see Appendix D for details). The only personal information we collect is the worker IDs from Amazon Mechanical Turk, which we will not release. The annotation effort received an IRB exemption.

Limitations

In this paper, we apply our collaborative dataset creation pipeline to a single language and task, English natural language inference, and leave application of the pipeline more broadly to future work.

It is possible (if not likely) that datasets partially authored by language models will have artifacts of their own, especially those reflecting social biases that may not be captured by our accuracy-based evaluation setup. For investigation of a specific generation artifact observed by Yuan et al. (2021) in their own collaborative dataset, namely the over-representation of Western entities, please see Appendix C.4.

We are not able to perform ablations on different parts of the pipeline to understand the effectiveness of each component, e.g., by comparing different means of collecting exemplar groups or different templates for prompting GPT-3. Unfortunately, such variations would be prohibitively expensive as they each require collecting a dataset of sufficient scale (along with the necessary human annotation).

Finally, although we uncover examples where annotators disagree for valid reasons (see Table 5), we only use one label per example for training and evaluation. This is because to show the effectiveness of WaNLI, we need to compare WaNLI to existing (singly-labeled) training datasets via performance on established (singly-labeled) benchmarks. We encourage future work to understand the limitations of forcing inherently ambiguous instances into the $n$ -way classification scheme, or otherwise discarding these potentially valuable examples of linguistic reasoning as noise.

References

Appendix A Estimated Max Variability

In order to test the correlation between variability and estimated max variability on a dataset $\mathcal{D}$ , we would have to repeatedly hold out a single example $x$ , train a model on $\mathcal{D}\setminus\{x\}$ , and evaluate how well the estimated max variability from the model trained on $\mathcal{D}\setminus\{x\}$ correlates with the true variability from the model trained on $\mathcal{D}$ , which saw $x$ during training.

Appendix B Modeling Details

All model training is implemented with the HuggingFace Wolf et al. (2020) library and uses the original hyperparameters from the RoBERTa paper for finetuning on GLUE Liu et al. (2019). We train the model for five epochs and evaluate the final model. We choose not to use an early stopping scheme in order to isolate the training data as the object of study and control for training length as a confounding factor. This is important since Tu et al. (2020) showed that counter-examples can be learned better with longer training.

All training was performed on a single Nvidia Quadro RTX 6000 GPU. The duration of training varied depending on the size of the training data, from 3 hours for WaNLI to 14 hours for MultiNLI $+$ WaNLI.

Appendix C WaNLI Details and Discussion

We include some examples of full GPT-3 contexts in Table 12, 13, 14, 15.

C.2 GPT-3 Generation Hyperparameters

We queried the GPT-3 Curie model available through the OpenAI APIhttps://openai.com/api on the dates November 3 to November 5, 2021. In total, the generation cost $677.89. Hyperparameters for generationdescribed at https://beta.openai.com/docs/api-reference/completions/create are shown in Table 7.

C.3 Dataset sizes at each stage

C.4 Investigation of Western entities in WaNLI versus MNLI

While we investigated known artifacts of crowdsourced datasets in §4, generated datasets may have distinct kinds of artifacts. Indeed, recent related work qualitatively observed an over-representation of Western entities in generated biographies Yuan et al. (2021). To investigate whether this is also characteristic of WaNLI, we use flair Akbik et al. (2019) to perform named entity recognition on MultiNLI and WaNLI. Due to the challenges and ethical risks of automatically determining the origin of names and organizations, we focus on the diversity of locations mentioned. We use geopyhttps://geopy.readthedocs.io to map all locations (e.g., cities, provinces, landmarks, as well as countries) to a country.

We find that 79% of location mentions in WaNLI are in Europe or North America, compared to 71% in MultiNLI. In particular, the United States is massively over-represented, accounting for 46% of mentions in WaNLI and 26% in MultiNLI. However, both datasets feature a diversity of location names: WaNLI mentions locations in 210 countries across 22K location entities, and MultiNLI mentions locations in 227 countries across 163K location entities. We conclude that over-representation of Western entities is indeed a concern for generated datasets, and encourage future work to consider this.

Appendix D Human Review

Screenshots of the instructions, guidelines, and annotation interface are shown in Tables 6, 7, and 8. The guidelines take inspiration from the design of the NLI Diagnostics dataset Wang et al. (2018). To collect a pool of qualified workers, we designed a qualification task with examples testing each of these categories. NLI is a challenging task, and many generated examples are especially challenging by design. Therefore, instructing annotators in how to think about the task and resolve common issues is key to collecting high-quality, label-consistent data.

Annotators were required to have a HIT approval rate of 98%, a total of 10,000 approved HITs, and be located in the United States.

300 Turkers took our qualification test, of which 69 passed. Turkers who were later found to produce extremely careless annotations were removed from the qualification list (and oftentimes, their annotations were discarded, though they were still paid for their work). The number of workers who contributed to the final dataset is 62.

Throughout the data collection process, the authors would review annotations and write individualized emails to Turkers with feedback, as well as group emails to clarify common challenging cases of NLI (such as examples involving questions). This follows the recommended crowdsourcing protocol from Nangia et al. (2021).

D.2 Compensation

In designing the task, we aimed for a pay rate of at least $15 per hour. Workers were paid$ 0.12 for each example that they annotate. At the end of data collection, we aggregate the earning and time spent from each crowdworker, and find that the median hourly rate was $22.72, with 85% of workers being paid over the$ 15/hour target.

D.3 Revision Analysis

We provide examples of revisions in Table 9. We find that revisions are generally targeted yet effective. The majority of revisions change the length only slightly, with $74\%$ of both premise revisions and hypothesis revisions changing the word count between $-1$ and $+2$ words. A very large proportion, 11.6% of premise revisions and 20.6% of hypothesis revisions, changed the set of pronouns present in the text, often to clarify coreference.

We instructed annotators to revise examples only when it would make the example more “interesting” in some sense, or more clear without removing what’s interesting. Nonetheless, we still observed a large number of revisions that greatly simplified the example, oftentimes re-introducing the same artifacts that have been documented in prior work. Therefore, we ultimately chose to include revisions only when both annotators revised the example, indicating that the revision was necessary to improve the quality of the example.

D.4 Disagreement Analysis

In order to investigate the utility of collecting a third annotation, we randomly sampled 80 examples where the two annotators disagreed on the label (and neither revised nor discarded), and two of the authors separately annotated each one. Shockingly, the two authors agreed on the label only 49% of the time. Furthermore, in 12% of cases, all three labels were present among the four annotations. This suggests that disagreement is often due to true ambiguity rather than careless mislabeling, and a third annotation would be unlikely to have high payoff in terms of “correcting” the label. As a result, we choose not to collect a third annotation in this work. Instead, we believe that the doubly-annotated examples in WaNLI have flagged many interesting cases of ambiguity in NLI, and we encourage future work to design richer annotation frameworks to uncover the source(s) of ambiguity.

We choose to keep examples with disagreement in the WaNLI dataset because we believe that finetuning with one of multiple reasonable labels still provides valuable training signal.

Appendix E Additional Experiments

We additionally perform comparisons with several subsets of MultiNLI which are the same size as WaNLI: MultiNLI filtered with the AFLite algorithm (MultiNLI with AFLite; Le Bras et al., 2020), the most ambiguous examples of MultiNLI (MultiNLI ambiguous; Swayamdipta et al., 2020), and a random subset of MultiNLI (MultiNLI downsampled). Results in Table 10 show that a WaNLI-trained model outperforms these baselines on every test set.

E.2 Evaluation on MultiNLI

We report the results on MultiNLI’s development set in Table 8. We find that mixing WaNLI into the MultiNLI training data (either through swapping or augmentation) maintains in-domain accuracy within $\sim$ 1%. Training on WaNLI alone drops performance on MultiNLI’s development set by $\sim$ 10%; however, the higher performance on other out-of-domain test sets suggests that evaluation through MultiNLI may not be a definitive signal of model ability.

E.3 Finetuning T5

We demonstrate that the robustness improvements from training on WaNLI generalizes to another model architecture, T5-base Raffel et al. (2020), which was never used in the data curation pipeline. Shown in Table 11, training T5-base on WaNLI also outperforms training on MultiNLI on every test set, including by 4% of NLI Diagnostics, 10% on HANS, and 8% on Adversarial NLI (similar margins compared to finetuning RoBERTa-large).

Appendix F Data Map of WaNLI

In Figure 9, we show a data map of MultiNLI relative to RoBERTa-large trained on MNLI, and of WaNLI relative to RoBERTa-large trained on WaNLI.