Evaluating Models' Local Decision Boundaries via Contrast Sets

Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hanna Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, Ben Zhou

cs.CL

Introduction

Contrast Sets

We first give a sketch of the problem that contrast sets attempt to solve in a toy two-dimensional classification setting as shown in Figure 1. Here, the true underlying data distribution requires a complex decision boundary (Figure 1(a)). However, as is common in practice, our toy dataset is rife with systematic gaps (e.g., due to annotator bias, repeated patterns, etc.). This causes simple decision boundaries to emerge (Figure 1(b)). And, because our biased dataset is split i.i.d. into train and test sets, this simple decision boundary will perform well on test data. Ideally, we would like to fill in all of a dataset’s systematic gaps, however, this is usually impossible. Instead, we create a contrast set: a collection of instances tightly clustered in input space around a single test instance, or pivot (Figure 1(c); an $\epsilon$ -ball in our toy example). This contrast set allows us to measure how well a model’s decision boundary aligns with the correct decision boundary local to the pivot. In this case, the contrast set demonstrates that the model’s simple decision boundary is incorrect. We repeat this process around numerous pivots to form entire evaluation datasets.

When we move from toy settings to complex NLP tasks, the precise nature of a “systematic gap” in the data becomes harder to define. Indeed, the geometric view in our toy examples does not correspond directly to experts’ perception of data; there are many ways to “locally perturb” natural language. We do not expect intuition, even of experts, to exhaustively reveal gaps.

Nevertheless, the presence of these gaps is well-documented Gururangan et al. (2018); Poliak et al. (2018); Min et al. (2019), and Niven and Kao (2019) give an initial attempt at formally characterizing them. In particular, one common source is annotator bias from data collection processes Geva et al. (2019). For example, in the SNLI dataset Bowman et al. (2015), Gururangan et al. (2018) show that the words sleeping, tv, and cat almost never appear in an entailment example, either in the training set or the test set, though they often appear in contradiction examples. This is not because these words are particularly important to the phenomenon of entailment; their absence in entailment examples is a systematic gap in the data that can be exploited by models to achieve artificially high test accuracy. This is but one kind of systematic gap; there are also biases due to the writing styles of small groups of annotators Geva et al. (2019), the distributional biases in the data that was chosen for annotation, as well as numerous other biases that are more subtle and harder to discern Shah et al. (2020).

Completely removing these gaps in the initial data collection process would be ideal, but is likely impossible—language has too much inherent variability in a very high-dimensional space. Instead, we use contrast sets to fill in gaps in the test data to give more thorough evaluations than what the original data provides.

2 Definitions

We begin by defining a decision boundary as a partition of some space into labels.111In this discussion we are talking about the true decision boundary, not a model’s decision boundary. This partition can be represented by the set of all points in the space with their associated labels: $\{(x,y)\}$ . This definition differs somewhat from the canonical definition, which is a collection of hypersurfaces that separate labels. There is a bijection between partitions and these sets of hypersurfaces in continuous spaces, however, so they are equivalent definitions. We choose to use the partition to represent the decision boundary as it makes it very easy to define a local decision boundary and to generalize the notion to discrete spaces, which we deal with in NLP.

A local decision boundary around some pivot $x$ is the set of all points $x^{\prime}$ and their associated labels $y^{\prime}$ that are within some distance $\epsilon$ of $x$ . That is, a local decision boundary around $x$ is the set $\{(x^{\prime},y^{\prime})~{}|~{}d(x,x^{\prime})<\epsilon\}$ . Note here that even though a “boundary” or “surface” is hard to visualize in a discrete input space, using this partition representation instead of hypersurfaces gives us a uniform definition of a local decision boundary in any input space; all that is needed is a distance function $d$ .

3 Contrast sets in practice

Given these definitions, we now turn to the actual construction of contrast sets in practical NLP settings. There were two things left unspecified in the definitions above: the distance function $d$ to use in discrete input spaces, and the method for sampling from a local decision boundary. While there has been some work trying to formally characterize distances for adversarial robustness in NLP Michel et al. (2019); Jia et al. (2019), we find it more useful in our setting to simply rely on expert judgments to generate a similar but meaningfully different $x^{\prime}$ given $x$ , addressing both the distance function and the sampling method.

Future work could try to give formal treatments of these issues, but we believe expert judgments are sufficient to make initial progress in improving our evaluation methodologies. And while expert-crafted contrast sets can only give us an upper bound on a model’s local alignment with the true decision boundary, an upper bound on local alignment is often more informative than a potentially biased i.i.d. evaluation that permits artificially simple decision boundaries. To give a tighter upper bound, we draw pivots $x$ from some i.i.d. test set, and we do not provide i.i.d. contrast sets at training time, which could provide additional artificially simple decision boundaries to a model.

Figure LABEL:fig:teaser displays an example contrast set for the NLVR2 visual reasoning dataset Suhr and Artzi (2019). Here, both the sentence and the image are modified in small ways (e.g., by changing a word in the sentence or finding a similar but different image) to make the output label change.

A contrast set is not a collection of adversarial examples Szegedy et al. (2014). Adversarial examples are almost the methodological opposite of contrast sets: they change the input such that a model’s decision changes but the gold label does not Jia and Liang (2017); Wallace et al. (2019a). On the other hand, contrast sets are model-agnostic, constructed by experts to characterize whether a model’s decision boundary locally aligns to the true decision boundary around some point. Doing this requires input changes that also induce changes to the gold label.

We recommend that the original dataset authors—the experts on the linguistic phenomena intended to be reflected in their dataset—construct the contrast sets. This is best done by first identifying a list of phenomena that characterize their dataset. In syntactic parsing, for example, this list might include prepositional phrase attachment ambiguities, coordination scope, clausal attachment, etc. After the standard dataset collection process, the authors should sample pivots from their test set and perturb them according to the listed phenomena.

4 Design Choices of Contrast Sets

Here, we discuss possible alternatives to our approach for constructing contrast sets and our reasons for choosing the process we did.

How to Create Contrast Sets

Here, we walk through our process for creating contrast sets for three datasets. Examples are shown in Figure LABEL:fig:teaser and Table 1.

DROP Dua et al. (2019) is a reading comprehension dataset that is intended to cover compositional reasoning over numbers in a paragraph, including filtering, sorting, and counting sets, and doing numerical arithmetic. The data has three main sources of paragraphs, all from Wikipedia articles: descriptions of American football games, descriptions of census results, and summaries of wars. There are many common patterns used by the crowd workers that make some questions artificially easy: 2 is the most frequent answer to How many…? questions, questions asking about the ordering of events typically follow the linear order of the paragraph, and a large fraction of the questions do not require compositional reasoning.

Our strategy for constructing contrast sets for DROP was three-fold. First, we added more compositional reasoning steps. The questions about American football passages in the original data very often had multiple reasoning steps (e.g., How many yards difference was there between the Broncos’ first touchdown and their last?), but the questions about the other passage types did not. We drew from common patterns in the training data and added additional reasoning steps to questions in our contrast sets. Second, we inverted the semantics of various parts of the question. This includes perturbations such as changing shortest to longest, later to earlier, as well as changing questions asking for counts to questions asking for sets (How many countries… to Which countries…). Finally, we changed the ordering of events. A large number of questions about war paragraphs ask which of two events happened first. We changed (1) the order the events were asked about in the question, (2) the order that the events showed up in the passage, and (3) the dates associated with each event to swap their temporal order.

NLVR2

We next consider NLVR2, a dataset where a model is given a sentence about two provided images and must determine whether the sentence is true Suhr et al. (2019). The data collection process encouraged highly compositional language, which was intended to require understanding the relationships between objects, properties of objects, and counting. We constructed NLVR2 contrast sets by modifying the sentence or replacing one of the images with freely-licensed images from web searches. For example, we might change The left image contains twice the number of dogs as the right image to The left image contains three times the number of dogs as the right image. Similarly, given an image pair with four dogs in the left and two dogs in the right, we can replace individual images with photos of variably-sized groups of dogs. The textual perturbations were often changes in quantifiers (e.g., at least one to exactly one), entities (e.g., dogs to cats), or properties thereof (e.g., orange glass to green glass). An example contrast set for NLVR2 is shown in Figure LABEL:fig:teaser.

UD Parsing

Finally, we discuss dependency parsing in the universal dependencies (UD) formalism Nivre et al. (2016). We look at dependency parsing to show that contrast sets apply not only to modern “high-level” NLP tasks but also to longstanding linguistic analysis tasks. We first chose a specific type of attachment ambiguity to target: the classic problem of prepositional phrase (PP) attachment Collins and Brooks (1995), e.g. We ate spaghetti with forks versus We ate spaghetti with meatballs. We use a subset of the English UD treebanks: GUM Zeldes (2017), the English portion of LinES Ahrenberg (2007), the English portion of ParTUT Sanguinetti and Bosco (2015), and the dependency-annotated English Web Treebank Silveira et al. (2014). We searched these treebanks for sentences that include a potentially structurally ambiguous attachment from the head of a PP to either a noun or a verb. We then perturbed these sentences by altering one of their noun phrases such that the semantics of the perturbed sentence required a different attachment for the PP. We then re-annotated these perturbed sentences to indicate the new attachment(s).

Summary

While the overall process we recommend for constructing contrast sets is simple and unified, its actual instantiation varies for each dataset. Dataset authors should use their best judgment to select which phenomena they are most interested in studying and craft their contrast sets to explicitly test those phenomena. Care should be taken during contrast set construction to ensure that the phenomena present in contrast sets are similar to those present in the original test set; the purpose of a contrast set is not to introduce new challenges, but to more thoroughly evaluate the original intent of the test set.

Datasets and Experiments

We create contrast sets for 10 NLP datasets (full descriptions are provided in Section A):

IMDb sentiment analysis Maas et al. (2011)

We choose these datasets because they span a variety of tasks (e.g., reading comprehension, sentiment analysis, visual reasoning) and input-output formats (e.g., classification, span extraction, structured prediction). We include high-level tasks for which dataset artifacts are known to be prevalent, as well as longstanding formalism-based tasks, where data artifacts have been less of an issue (or at least have been less well-studied).

2 Contrast Set Construction

The contrast sets were constructed by NLP researchers who were deeply familiar with the phenomena underlying the annotated dataset

Related Work

The fundamental idea of finding or creating data that is “minimally different” has a very long history. In linguistics, for instance, the term minimal pair is used to denote two words with different meaning that differ by a single sound change, thus demonstrating that the sound change is phonemic in that language Pike (1946). Many people have used this idea in NLP (see below), creating challenge sets or providing training data that is “minimally different” in some sense, and we continue this tradition. Our main contribution to this line of work, in addition to the resources that we have created, is giving a simple and intuitive geometric interpretation of “bias” in dataset collection, and showing that this long-standing idea of minimal data changes can be effectively used to solve this problem on a wide variety of NLP tasks. We additionally generalize the idea of a minimal pair to a set, and use a consistency metric, which we contend more closely aligns with what NLP researchers mean by “language understanding”.

Many previous works have provided minimally contrastive examples on which to train models. Selsam et al. (2019), Tafjord et al. (2019), Lin et al. (2019), and Khashabi et al. (2020) designed their data collection process to include contrastive examples. Data augmentation methods have also been used to mitigate gender Zhao et al. (2018), racial Dixon et al. (2018), and other biases Kaushik et al. (2020) during training, or to introduce useful inductive biases Andreas (2020).

Challenge Sets

The idea of creating challenging contrastive evaluation sets has a long history Levesque et al. (2011); Ettinger et al. (2017); Glockner et al. (2018); Naik et al. (2018); Isabelle et al. (2017). Challenge sets exist for various phenomena, including ones with “minimal” edits similar to our contrast sets, e.g., in image captioning Shekhar et al. (2017), machine translation Sennrich (2017); Burlot and Yvon (2017); Burlot et al. (2018), and language modeling Marvin and Linzen (2018); Warstadt et al. (2019). Minimal pairs of edits that perturb gender or racial attributes are also useful for evaluating social biases Rudinger et al. (2018); Zhao et al. (2018); Lu et al. (2018). Our key contribution over this prior work is in grouping perturbed instances into a contrast set, for measuring local alignment of decision boundaries, along with our new, related resources. Additionally, rather than creating new data from scratch, contrast sets augment existing test examples to fill in systematic gaps. Thus contrast sets often require less effort to create, and they remain grounded in the original data distribution of some training set.

Since the initial publication of this paper, Shmidman et al. have further demonstrated the utility of contrast sets by applying these ideas to the evaluation of morphological disambiguation in Hebrew.

Conclusion

We presented a new annotation paradigm, based on long-standing ideas around contrastive examples, for constructing more rigorous test sets for NLP. Our procedure maintains most of the established processes for dataset creation but fills in some of the systematic gaps that are typically present in datasets. By shifting evaluations from accuracy on i.i.d. test sets to consistency on contrast sets, we can better examine whether models have learned the desired capabilities or simply captured the idiosyncrasies of a dataset. We created contrast sets for 10 NLP datasets and released this data as new evaluation benchmarks.

We recommend that future data collection efforts create contrast sets to provide more comprehensive evaluations for both existing and new NLP datasets. While we have created thousands of new test examples across a wide variety of datasets, we have only taken small steps towards the rigorous evaluations we would like to see in NLP. The last several years have given us dramatic modeling advancements; our evaluation methodologies and datasets need to see similar improvements.

Acknowledgements

We thank the anonymous reviewers for their helpful feedback on this paper, as well as many others who gave constructive comments on a publicly-available preprint. Various authors of this paper were supported in part by ERC grant 677352, NSF grant 1562364, NSF grant IIS-1756023, NSF CAREER 1750499, ONR grant N00014-18-1-2826 and DARPA grant N66001-19-2-403.

References

Appendix A Dataset Details

Here, we provide details for the datasets that we build contrast sets for.

(NLVR $2$ ) Given a natural language sentence about two photographs, the task is to determine if the sentence is true Suhr et al. (2019). The dataset has highly compositional language, e.g., The left image contains twice the number of dogs as the right image, and at least two dogs in total are standing. To succeed at NLVR2, a model is supposed to be able to detect and count objects, recognize spatial relationships, and understand the natural language that describes these phenomena.

Internet Movie Database

(IMDb) The task is to predict the sentiment (positive or negative) of a movie review Maas et al. (2011). We use the same set of reviews from Kaushik et al. (2020) in order to analyze the differences between crowd-edited reviews and expert-edited reviews.

Temporal relation extraction

(MATRES) The task is to determine what temporal relationship exists between two events, i.e., whether some event happened before or after another event Ning et al. (2018). MATRES has events and temporal relations labeled for approximately 300 news articles. The event annotations are taken from the data provided in the TempEval3 workshop UzZaman et al. (2013) and the temporal relations are re-annotated based on a multi-axis formalism. We assume that the events are given and only need to classify the relation label between them.

English UD Parsing

We use a combination of four English treebanks (GUM, EWT, LinES, ParTUT) in the Universal Dependencies parsing framework, covering a range of genres. We focus on the problem of prepositional phrase attachment: whether the head of a prepositional phrase attaches to a verb or to some other dependent of the verb. We manually selected a small set of sentences from these treebanks that had potentially ambiguous attachments.

Reasoning about perspectives

(PERSPECTRUM) Given a debate-worthy natural language claim, the task is to identify the set of relevant argumentative sentences that represent perspectives for/against the claim Chen et al. (2019). We focus on the stance prediction sub-task: a binary prediction of whether a relevant perspective is for/against the given claim.

Discrete Reasoning Over Paragraphs

(DROP) A reading comprehension dataset that requires numerical reasoning, e.g., adding, sorting, and counting numbers in paragraphs Dua et al. (2019). In order to compute the consistency metric for the span answers of DROP, we report the average number of contrast sets in which $F_{1}$ for all instances is above $0.8$ .

Quoref

A reading comprehension task with span selection questions that require coreference resolution Dasigi et al. (2019). In this dataset, most questions can be localized to a single event in the passage, and reference an argument in that event that is typically a pronoun or other anaphoric reference. Correctly answering the question requires resolving the pronoun. We use the same definition for consistency for Quorefas we did for DROP.

Reasoning Over Paragraph Effects in Situations

(ROPES) A reading comprehension dataset that requires applying knowledge from a background passage to new situations Lin et al. (2019). This task has background paragraphs drawn mostly from science texts that describe causes and effects (e.g., that brightly colored flowers attract insects), and situations written by crowd workers that instantiate either the cause (e.g., bright colors) or the effect (e.g., attracting insects). Questions are written that query the application of the statements in the background paragraphs to the instantiated situation. Correctly answering the questions is intended to require understanding how free-form causal language can be understood and applied. We use the same consistency metric for ROPES as we did for DROP and Quoref.

BoolQ

A dataset of reading comprehension instances with Boolean (yes or no) answers Clark et al. (2019). These questions were obtained from organic Google search queries and paired with paragraphs from Wikipedia pages that are labeled as sufficient to deduce the answer. As the questions are drawn from a distribution of what people search for on the internet, there is no clear set of “intended phenomena” in this data; it is an eclectic mix of different kinds of questions.

MC-TACO

A dataset of reading comprehension questions about multiple temporal common-sense phenomena Zhou et al. (2019). Given a short paragraph (often a single sentence), a question, and a collection of candidate answers, the task is to determine which of the candidate answers are plausible. For example, the paragraph might describe a storm and the question might ask how long the storm lasted, with candidate answers ranging from seconds to weeks. This dataset is intended to test a system’s knowledge of typical event durations, orderings, and frequency. As the paragraph does not contain the information necessary to answer the question, this dataset is largely a test of background (common sense) knowledge.

Appendix B Contrast Set Details

We use the following text perturbation strategies for NLVR2:

Perturbing quantifiers, e.g., There is at least one dog $\to$ There is exactly one dog.

Perturbing numbers, e.g., There is at least one dog $\to$ There are at least two dogs.

Perturbing entities, e.g., There is at least one dog $\to$ There is at least one cat.

Perturbing properties of entities, e.g., There is at least one yellow dog $\to$ There is at least one green dog.

Image Perturbation Strategies

For image perturbations, the annotators collected images that are perceptually and/or conceptually close to the hypothesized decision boundary, i.e., they represent a minimal change in some concrete aspect of the image. For example, for an image pair with 2 dogs on the left and 1 dog on the right and the sentence There are more dogs on the left than the right, a reasonable image change would be to replace the right-hand image with an image of two dogs.

Model

We use LXMERT Tan and Bansal (2019) trained on the NLVR2 training dataset.

Contrast Set Statistics

Five annotators created 983 perturbed instances that form 479 contrast sets. Annotation took approximately thirty seconds per textual perturbation and two minutes per image perturbation.

B.2 IMDb

We minimally perturb reviews to flip the label while ensuring that the review remains coherent and factually consistent. Here, we provide example revisions:

Original (Negative): I had quite high hopes for this film, even though it got a bad review in the paper. I was extremely tolerant, and sat through the entire film. I felt quite sick by the end. New (Positive): I had quite high hopes for this film, even though it got a bad review in the paper. I was extremely amused, and sat through the entire film. I felt quite happy by the end. Original (Positive): This is the greatest film I saw in 2002, whereas I’m used to mainstream movies. It is rich and makes a beautiful artistic act from these 11 short films. From the technical info (the chosen directors), I feared it would have an anti-American basis, but … it’s a kind of (11 times) personal tribute. The weakest point comes from Y. Chahine : he does not manage to “swallow his pride” and considers this event as a well-merited punishment … It is really the weakest part of the movie, but this testifies of a real freedom of speech for the whole piece. New (Negative): This is the most horrendous film I saw in 2002, whereas I’m used to mainstream movies. It is low budgeted and makes a less than beautiful artistic act from these 11 short films. From the technical info (the chosen directors), I feared it would have an anti-American basis, but … it’s a kind of (11 times) the same. One of the weakest point comes from Y. Chahine : he does not manage to “swallow his pride” and considers this event as a well-merited punishment … It is not the weakest part of the movie, but this testifies of a real freedom of speech for the whole piece.

Model

We use the same BERT model setup and training data as Kaushik et al. (2020) which allows us to fairly compare the crowd and expert revisions.

Contrast Set Statistics

We use 100 reviews from the validation set and 488 from the test set of Kaushik et al. (2020). Three annotators used approximately 70 hours to construct and validate the dataset.

B.3 MATRES

MATRES has three sections: TimeBank, AQUAINT, and Platinum, with the Platinum section serving as the test set. We use 239 instances (30% of the dataset) from Platinum.

The annotators perturb one or more of the following aspects: appearance order in text, tense of verb(s), and temporal conjunction words. Below are example revisions:

Colonel Collins followed a normal progression once she was picked as a NASA astronaut. (original sentence: “followed” is after “picked”)

Once Colonel Collins was picked as a NASA astronaut, she followed a normal progression. (appearance order change in text; “followed” is still after “picked”)

Colonel Collins followed a normal progression before she was picked as a NASA astronaut. (changed the temporal conjunction word from “once” to “before” and “followed” is now before “picked”)

Volleyball is a popular sport in the area, and more than 200 people were watching the game, the chief said. (original sentence: “watching” is before “said”)

Volleyball is a popular sport in the area, and more than 200 people would be watching the game, the chief said. (changed the verb tense: “watching” is after “said”)

Model

We use CogCompTime 2.0 Ning et al. (2019).

Contrast Set Statistics

Two annotators created 401 perturbed instances that form 239 contrast sets. The annotators used approximately 25 hours to construct and validate the dataset.

Analysis

We recorded the perturbation strategy used for each example. 49% of the perturbations changed the “appearance order”, 31% changed the “tense”, 24% changed the “temporal conjunction words”, and 10% had other changes. We double count the examples that have multiple perturbations. The model accuracy on the different perturbations is reported in the table below.

B.4 Syntactic Parsing

The annotators perturbed noun phrases adjacent to prepositions (leaving the preposition unchanged). For example, The clerics demanded talks with local US commanders $\to$ The clerics demanded talks with great urgency. The different semantic content of the noun phrase changes the syntactic path from the preposition with to the parent word of the parent of the preposition; in the initial example, the parent is commanders and the grandparent is the noun talks; in the perturbed version, the grandparent is now the verb demanded.

Model

We use a biaffine parser following the architecture of Dozat and Manning (2017) with ELMo embeddings Peters et al. (2018), trained on the combination of the training sets for the treebanks that we drew test examples from (GUM, EWT, LinES, and ParTUT).

Contrast Set Statistics

One annotator created 150 perturbed examples that form 150 contrast sets. 75 of the contrast sets consist of a sentence in which a prepositional phrase attaches to a verb, paired with an altered version where it attaches to a noun instead. The other 75 sentences were altered in the opposite direction.

Analysis

The process of creating a perturbation for a syntactic parse is highly time-consuming. Only a small fraction of sentences in the test set could be altered in the desired way, even after filtering to find relevant syntactic structures and eliminate unambiguous prepositions (e.g. of always attaches to a noun modifying a noun, making it impossible to change the attachment without changing the preposition). Further, once a potentially ambiguous sentence was identified, annotators had to come up with an alternative noun phrase that sounded natural and did not require extensive changes to the structure of the sentence. They then had to re-annotate the relevant section of the sentence, which could include new POS tags, new UD word features, and new arc labels. On average, each perturbation took 10–15 minutes. Expanding the scope of this augmented dataset to cover other syntactic features, such as adjective scope, apposition versus conjunction, and other forms of clausal attachment, would allow for a significantly larger dataset but would require a large amount of annotator time. The very poor contrast consistency on our dataset (17.3%) suggests that this would be a worthwhile investment to create a more rigorous parsing evaluation.

Notably, the model’s accuracy for predicting the target prepositions’ grandparents in the original, unaltered tree (64.7%) is significantly lower than the model’s accuracy for grandparents of all words (78.41%) and for grandparents of all prepositions (78.95%) in the original data. This indicates that these structures are already difficult for the parser due to structural ambiguity.

B.5 PERSPECTRUM

The annotators perturbed examples in multiple steps. First, they created non-trivial negations of the claim, e.g., Should we live in space? $\to$ Should we drop the ambition to live in space?. Next, they labeled the perturbed claim with respect to each perspective. For example:

Claim: Should we live in space? Perspective: Humanity in many ways defines itself through exploration and space is the next logical frontier. Label: True Claim: Should we drop the ambition to live in space? Perspective: Humanity in many ways defines itself through exploration and space is the next logical frontier. Label: False

Model

We use a RoBERTa model Liu et al. (2019) finetuned on PERSPECTRUM following the training process from Chen et al. (2019).

Contrast Set Statistics

The annotators created 217 perturbed instances that form 217 contrast sets. Each example took approximately three minutes to annotate: one minute for an annotator to negate each claim and one minute each for two separate annotators to adjudicate stance labels for each contrastive claim-perspective pair.

B.6 DROP

See Section 3 in the main text for details about our perturbation strategies.

Model

We use MTMSN Hu et al. (2019), a DROP question answering model that is built on top of BERT Large Devlin et al. (2019).

Contrast Set Statistics

The total size of the augmented test set is 947 examples and contains a total of 623 contrast sets. Three annotators used approximately 16 hours to construct and validate the dataset.

Analysis

We bucket $100$ of the perturbed instances into the three categories of perturbations described in Section 3. For each subset, we evaluate MTMSN’s performance and show the results in the Table below.

B.7 Quoref

We use the following perturbation strategies for Quoref:

Perturb questions whose answers are entities to instead make the answers a property of those entities, e.g., Who hides their identity … $\to$ What is the nationality of the person who hides their identity ….

Perturb questions to add compositionality, e.g., What is the name of the person … $\to$ What is the name of the father of the person ….

Add sentences between referring expressions and antecedents to the context paragraphs.

Replace antecedents with less frequent named entities of the same type in the context paragraphs.

Model

We use XLNet-QA, the best model from Dasigi et al. (2019), which is a span extraction model built on top of XLNet Yang et al. (2019).

Contrast Set Statistics

Four annotators created 700 instances that form 415 contrast sets. The mean contrast set size (including the original example) is $2.7(\pm 1.2)$ . The annotators used approximately 35 hours to construct and validate the dataset.

B.8 ROPES

We use the following perturbation strategies for ROPES:

Perturbing the background to have the opposite causes and effects or qualitative relation, e.g., Gibberellins are hormones that cause the plant to grow $\to$ Gibberellins are hormones that cause the plant to stop growing.

Perturbing the situation to associate different entities with different instantiations of a certain cause or effect. For example, Grey tree frogs live in wooded areas and are difficult to see when on tree trunks. Green tree frogs live in wetlands with lots of grass and tall plants. $\to$ Grey tree frogs live in wetlands areas and are difficult to see when on stormy days in the plants. Green tree frogs live in wetlands with lots of leaves to hide on.

Perturbing the situation to have more complex reasoning steps, e.g., Sue put 2 cubes of sugar into her tea. Ann decided to use granulated sugar and added the same amount of sugar to her tea. $\to$ Sue has 2 cubes of sugar but Ann has the same amount of granulated sugar. They exchange the sugar to each other and put the sugar to their ice tea.

Perturbing the questions to have presuppositions that match the situation and background.

Model

We use the best model from Lin et al. (2019), which is a span extraction model built on top of a RoBERTa model Liu et al. (2019) that is first finetuned on RACE Lai et al. (2017).

Contrast Set Statistics

Two annotators created 974 perturbed instances which form 974 contrast sets. The annotators used approximately 65 hours to construct and validate the dataset.

B.9 BoolQ

We use a diverse set of perturbations, including adjective, entity, and event changes. We show three representative examples below:

Paragraph: The Fate of the Furious premiered in Berlin on April 4, 2017, and was theatrically released in the United States on April 14, 2017, playing in 3D, IMAX 3D and 4DX internationally…A spinoff film starring Johnson and Statham’s characters is scheduled for release in August 2019, while the ninth and tenth films are scheduled for releases on the years 2020 and 2021. Question: Is “Fate and the Furious” the last movie? Answer: False New Question: Is “Fate and the Furious” the first of multiple movies? New Answer: True Perturbation Strategy: Adjective Change

Paragraph: Sanders played football primarily at cornerback, but also as a kick returner, punt returner, and occasionally wide receiver…An outfielder in baseball, he played professionally for the New York Yankees, the Atlanta Braves, the Cincinnati Reds and the San Francisco Giants, and participated in the 1992 World Series with the Braves. Question: Did Deion Sanders ever win a world series? Answer: False New Question: Did Deion Sanders ever play in a world series? New Answer: True Perturbation strategy: Event Change

Paragraph: The White House is the official residence and workplace of the President of the United States. It is located at 1600 Pennsylvania Avenue NW in Washington, D.C. and has been the residence of every U.S. President since John Adams in 1800. The term is often used as a metonym for the president and his advisers. Question: Does the president live in the White House? Answer: True New Question: Did George Washington live in the White House? New Answer: False Perturbation Strategy: Entity Change

Model

We use RoBERTa base and follow the standard finetuning process from Liu et al. (2019).

Contrast Set Statistics

The annotators created 339 perturbed questions generated that form 70 contrast sets. One annotator created the dataset and a separate annotator verified it. This entire process took approximately 16 hours.

B.10 MC-TACO

The main goal when perturbing MC-TACO questions is to retain a similar question that requires the same temporal knowledge to answer, while there are additional constraints with slightly different related context that changes the answers. We also modified the answers accordingly to make sure the question has a combination of plausible and implausible candidates.

Model

We use the best baseline model from the original paper Zhou et al. (2019) which is based on $\textsc{RoBERTa}_{base}$ Liu et al. (2019).

Contrast Set Statistics

The annotators created 646 perturbed question-answer pairs that form 646 contrast sets. Two annotators used approximately 12 hours to construct and validate the dataset.