Learning the Difference that Makes a Difference with Counterfactually-Augmented Data

Divyansh Kaushik, Eduard Hovy, Zachary C. Lipton

Introduction

What makes a document’s sentiment positive? What makes a loan applicant creditworthy? What makes a job candidate qualified? When does a photograph truly depict a dolphin? Moreover, what does it mean for a feature to be relevant to such a determination?

Statistical learning offers one framework for approaching these questions. First, we swap out the semantic question for a more readily answerable associative question. For example, instead of asking what conveys a document’s sentiment, we recast the question as which documents are likely to be labeled as positive (or negative)? Then, in this associative framing, we interpret as relevant, those features that are most predictive of the label. However, despite the rapid adoption and undeniable commercial success of associative learning, this framing seems unsatisfying.

Alongside deep learning’s predictive wins, critical questions have piled up concerning spurious patterns, artifacts, robustness, and discrimination, that the purely associative perspective appears ill-equipped to answer. For example, in computer vision, researchers have found that deep neural networks rely on surface-level texture (Jo & Bengio, 2017; Geirhos et al., 2018) or clues in the image’s background to recognize foreground objects even when that seems both unnecessary and somehow wrong: the beach is not what makes a seagull a seagull. And yet, researchers struggle to articulate precisely why models should not rely on such patterns.

In natural language processing (NLP), these issues have emerged as central concerns in the literature on annotation artifacts and societal biases. Across myriad tasks, researchers have demonstrated that models tend to rely on spurious associations (Poliak et al., 2018; Gururangan et al., 2018; Kaushik & Lipton, 2018; Kiritchenko & Mohammad, 2018). Notably, some models for question-answering tasks may not actually be sensitive to the choice of the question (Kaushik & Lipton, 2018), while in Natural Language Inference (NLI), classifiers trained on hypotheses only (vs hypotheses and premises) perform surprisingly well (Poliak et al., 2018; Gururangan et al., 2018). However, papers seldom make clear what, if anything, spuriousness means within the standard supervised learning framework. ML systems are trained to exploit the mutual information between features and a label to make accurate predictions. The standard statistical learning toolkit does not offer a conceptual distinction between spurious and non-spurious associations.

Causality, however, offers a coherent notion of spuriousness. Spurious associations owe to confounding rather than to a (direct or indirect) causal path. We might consider a factor of variation to be spuriously correlated with a label of interest if intervening upon it would not impact the applicability of the label or vice versa. While our paper does not call upon the mathematical machinery of causality, we draw inspiration from the underlying philosophy to design a new dataset creation procedure in which humans counterfactually revise documents.

Returning to NLP, although we lack automated tools for mapping between raw text and disentangled factors, we nevertheless describe documents in terms of these abstract representations. Moreover, it seems natural to speak of manipulating these factors directly (Hovy, 1987). Consider, for example, the following interventions: (i) Revise the letter to make it more positive; (ii) Edit the second sentence so that it appears to contradict the first. These edits might be thought of as intervening on only those aspects of the text that are necessary to make the counterfactual label applicable.

In this exploratory paper, we design a human-in-the-loop system for counterfactually manipulating documents. Our hope is that by intervening only upon the factor of interest, we might disentangle the spurious and non-spurious associations, yielding classifiers that hold up better when spurious associations do not transport out of domain. We employ crowd workers not to label documents, but rather to edit them, manipulating the text to make a targeted (counterfactual) class applicable. For sentiment analysis, we direct the worker to revise this negative movie review to make it positive, without making any gratuitous changes. We might regard the second part of this directive as a least action principle, ensuring that we perturb only those spans necessary to alter the applicability of the label. For NLI, a $3$ -class classification task (entailment, contradiction, neutral), we ask the workers to modify the premise while keeping the hypothesis intact, and vice versa, collecting edits corresponding to each of the (two) counterfactual classes. Using this platform, we collect thousands of counterfactually-manipulated examples for both sentiment analysis and NLI, extending the IMDb (Maas et al., 2011) and SNLI (Bowman et al., 2015) datasets, respectively. The result is two new datasets (each an extension of a standard resource) that enable us to both probe fundamental properties of language and train classifiers less reliant on spurious signal.

We show that classifiers trained on original IMDb reviews fail on counterfactually-revised data and vice versa. We further show that spurious correlations in these datasets are even picked up by linear models. However, augmenting the revised examples breaks up these correlations (e.g., genre ceases to be predictive of sentiment). For a Bidirectional LSTM (Graves & Schmidhuber, 2005) trained on IMDb reviews, classification accuracy goes down from $79.3\%$ to $55.7\%$ when evaluated on original vs revised reviews. The same classifier trained on revised reviews achieves an accuracy of $89.1\%$ on revised reviews compared to $62.5\%$ on their original counterparts. These numbers go to $81.7\%$ and $92.0\%$ on original and revised data, respectively, when the classifier is retrained on the combined dataset. Similar patterns are observed for linear classifiers. We discovered that BERT (Devlin et al., 2019) is more resilient to such drops in performance on sentiment analysis.

Additionally, SNLI models appear to rely on spurious associations as identified by Gururangan et al. (2018). Our experiments show that when fine-tuned on original SNLI sentence pairs, BERT fails on pairs with revised premise and vice versa, suffering more than a $30$ point drop in accuracy. Fine-tuned on the combined set, BERT’s performance improves significantly across all datasets. Similarly, a Bi-LSTM trained on (original) hypotheses alone can accurately classify $69\%$ of pairs correctly but performs worse than the blind classifier when evaluated on the revised dataset. When trained on hypotheses only from the combined dataset, its performance is not appreciably better than random guessing.

Related Work

Several papers demonstrate cases where NLP systems appear not to learn what humans consider to be the difference that makes the difference. For example, otherwise state-of-the-art models have been shown to be vulnerable to synthetic transformations such as distractor phrases (Jia & Liang, 2017; Wallace et al., 2019), to misclassify paraphrased task (Iyyer et al., 2018; Pfeiffer et al., 2019) and to fail on template-based modifications (Ribeiro et al., 2018). Glockner et al. (2018) demonstrate that simply replacing words by synonyms or hypernyms, which should not alter the applicable label, nevertheless breaks ML-based NLI systems. Gururangan et al. (2018) and Poliak et al. (2018) show that classifiers correctly classified the hypotheses alone in about $69\%$ of SNLI corpus. They further discover that crowd workers adopted specific annotation strategies and heuristics for data generation. Chen et al. (2016) identify similar issues exist with automatically-constructed benchmarks for question-answering (Hermann et al., 2015). Kaushik & Lipton (2018) discover that reported numbers in question-answering benchmarks could often be achieved by the same models when restricted to be blind either to the question or to the passages. Dixon et al. (2018); Zhao et al. (2018) and Kiritchenko & Mohammad (2018) showed how imbalances in training data lead to unintended bias in the resulting models, and, consequently, potentially unfair applications. Shen et al. (2018) substitute words to test the behavior of sentiment analysis algorithms in the presence of stylistic variation, finding that similar word pairs produce significant differences in sentiment score.

Several papers explore richer feedback mechanisms for classification. Some ask annotators to highlight rationales, spans of text indicative of the label (Zaidan et al., 2007; Zaidan & Eisner, 2008; Poulis & Dasgupta, 2017). For each document, Zaidan et al. remove the rationales to generate contrast documents, learning classifiers to distinguish original documents from their contrasting counterparts. While this feedback is easier to collect than ours, how to leverage it for training deep NLP models, where features are not neatly separated, remains less clear.

Lu et al. (2018) programmatically alter text to invert gender bias and combined the original and manipulated data yielding gender-balanced dataset for learning word embeddings. In the simplest experiments, they swap each gendered word for its other-gendered counterpart. For example, the doctor ran because he is late becomes the doctor ran because she is late. However, they do not substitute names even if they co-refer to a gendered pronoun. Building on their work, Zmigrod et al. (2019) describe a data augmentation approach for mitigating gender stereotypes associated with animate nouns for morphologically-rich languages like Spanish and Hebrew. They use a Markov random field to infer how the sentence must be modified while altering the grammatical gender of particular nouns to preserve morpho-syntactic agreement. In contrast, Maudslay et al. (2019) describe a method for probabilistic automatic in-place substitution of gendered words in a corpus. Unlike Lu et al., they propose an explicit treatment of first names by pre-defining name-pairs for swapping, thus expanding Lu et al.’s list of gendered word pairs significantly.

Data Collection

We use Amazon’s Mechanical Turk crowdsourcing platform to recruit editors to revise each document. To ensure high quality of the collected data, we restricted the pool to U.S. residents that had already completed at least $500$ HITs and had an over $97\%$ HIT approval rate. For each HIT, we conducted pilot tests to identify appropriate compensation per assignment, receive feedback from workers and revise our instructions accordingly. A total of $713$ workers contributed throughout the whole process, of which $518$ contributed edits reflected in the final datasets.

Sentiment Analysis The original IMDb dataset consists of $50k$ reviews divided equally across train and test splits. To keep the task of editing from growing unwieldy, we filter out the longest 20% of reviews, leaving $20k$ reviews in the train split from which we randomly sample $2.5k$ reviews, enforcing a $50$ : $50$ class balance. Following revision by the crowd workers, we partition this dataset into train/validation/test splits containing $1707$ , $245$ and $488$ examples, respectively. We present each review to two workers, instructing them to revise the review such that (a) the counterfactual label applies; (b) the document remains coherent; and (c) no unecessary modifications are made.

Over a four week period, we manually inspected each generated review and rejected the ones that were outright wrong (sentiment was still the same or the review was a spam). After review, we rejected roughly $2\%$ of revised reviews. For $60$ original reviews, we did not approve any among the counterfactually-revised counterparts supplied by the workers. To construct the new dataset, we chose one revised review (at random) corresponding to each original review. In qualitative analysis, we identified eight common patterns among the edits (Table 2).

By comparing original reviews to their counterfactually-revised counterparts we gain insight into which aspects are causally relevant. To analyze inter-editor agreement, we mark indices corresponding to replacements and insertions, representing the edits in each original review by a binary vector. Using these representations, we compute the Jaccard similarity between the two reviews (Table 1), finding it to be negatively correlated with the length of the review.

Natural Language Inference Unlike sentiment analysis, SNLI is $3$ -way classification task, with inputs consisting of two sentences, a premise and a hypothesis and the three possible labels being entailment, contradiction, and neutral. The label is meant to describe the relationship between the facts stated in each sentence. We randomly sampled $1750$ , $250$ , and $500$ pairs from the train, validation, and test sets of SNLI respectively, constraining the new data to have balanced classes. In one HIT, we asked workers to revise the hypothesis while keeping the premise intact, seeking edits corresponding to each of the two counterfactual classes. We refer to this data as Revised Hypothesis (RH). In another HIT, we asked workers to revise the original premise, while leaving the original hypothesis intact, seeking similar edits, calling it Revised Premise (RP).

Following data collection, we employed a different set of workers to verify whether the given label accurately described the relationship between each premise-hypothesis pair. We presented each pair to three workers and performed a majority vote. When all three reviewers were in agreement, we approved or rejected the pair based on their decision, else, we verified the data ourselves. Finally, we only kept premise-hypothesis pairs for which we had valid revised data in both RP and RH, corresponding to both counterfactual labels. As a result, we discarded $\approx 9\%$ data. RP and RH, each comprised of $3332$ pairs in train, $400$ in validation, and $800$ in test, leading to a total of $6664$ pairs in train, $800$ in validation, and $1600$ in test in the revised dataset. In qualitative analysis, we identified some common patterns among hypothesis and premise edits (Table 3, 4).

We collected all data after IRB approval and measured the time taken to complete each HIT to ensure that all workers were paid more than the federal minimum wage. During our pilot studies, workers spent roughly $5$ minutes per revised review, and $4$ minutes per revised sentence (for NLI). We paid workers $\$ 0.65 $per revision, and$ \ $0.15$ per verification, totalling $\$ 10778.14$ for the study.

Models

Our experiments rely on the following five models: Support Vector Machines (SVMs), Naïve Bayes (NB) classifiers, Bidirectional Long Short-Term Memory Networks (Bi-LSTMs; Graves & Schmidhuber, 2005), ELMo models with LSTM, and fine-tuned BERT models (Devlin et al., 2019). For brevity, we discuss only implementation details necessary for reproducibility.

Standard Methods We use scikit-learn (Pedregosa et al., 2011) implementations of SVMs and Naïve Bayes for sentiment analysis. We train these models on TF-IDF bag of words feature representations of the reviews. We identify parameters for both classifiers using grid search conducted over the validation set.

Experimental Results

We find that for sentiment analysis, linear models trained on the original $1.7k$ reviews achieve $80\%$ accuracy when evaluated on original reviews but only $51\%$ (level of random guessing) on revised reviews (Table 5). Linear models trained on revised reviews achieve $91\%$ accuracy on revised reviews but only $58.3\%$ on the original test set. We see similar pattern for Bi-LSTMs where accuracy drops substantially in both directions. Interestingly, while BERT models suffer drops too, they are less pronounced, perhaps a benefit of the exposure to a larger dataset where the spurious patterns may not have held. Classifiers trained on combined datasets perform well on both, often within $\approx 3$ pts of models trained on the same amount of data taken only from the original distribution. Thus, there may be a price to pay for breaking the reliance on spurious associations, but it may not be substantial.

We also conduct experiments to evaluate our sentiment models vis-a-vis their generalization out-of-domain to new domains. We evaluate models on Amazon reviews (Ni et al., 2019) on data aggregated over six genres: beauty, fashion, appliances, giftcards, magazines, and software, the Twitter sentiment dataset (Rosenthal et al., 2017),We use the development set as test data is not public. and Yelp reviews released as part of the Yelp dataset challenge. We show that in almost all cases, models trained on the counterfactually-augmented IMDb dataset perform better than models trained on comparable quantities of original data.

To gain intuition about what is learnable absent the edited spans, we tried training several models on passages where the edited spans have been removed from training set sentences (but not test set). SVM, Naïve Bayes, and Bi-LSTM achieve $57.8\%,59.1\%,60.2\%$ accuracy, respectively, on this task. Notably, these passages are predictive of the (true) label despite being semantially compatible with the counterfactual label. However, BERT performs worse than random guessing.

In one simple demonstration of the benefits of our approach, we note that seemingly irrelevant words such as: romantic, will, my, has, especially, life, works, both, it, its, lives and gives (correlated with positive sentiment), and horror, own, jesus, cannot, even, instead, minutes, your, effort, script, seems and something (correlated with negative sentiment) are picked up as high-weight features by linear models trained on either original or revised reviews as top predictors. However, because humans never edit these during revision owing to their lack of semantic relevance, combining the original and revised datasets breaks these associations and these terms cease to be predictive of sentiment (Fig 4). Models trained on original data but at the same scale as combined data are able to perform slightly better on the original test set but still fail on the revised reviews. All models trained on $19k$ original reviews receive a slight boost in accuracy on revised data (except Naïve Bayes), yet their performance significantly worse compared to specialized models. Retraining models on a combination of the original $19k$ reviews with revised $1.7k$ reviews leads to significant increases in accuracy for all models on classifying revised reviews, while slightly improving the accuracy on classifying the original reviews. This underscores the importance of including counterfactually-revised examples in training data.

Natural Language Inference Fine-tuned on $1.67k$ original sentence pairs, BERT achieves $72.2\%$ accuracy on SNLI dataset but it is only able to accurately classify $39.7\%$ sentence pairs from the RP set (Table 7). Fine-tuning BERT on the full SNLI training set ( $500k$ sentence pairs) results in similar behavior. Fine-tuning it on RP sentence pairs improves its accuracy to $66.3\%$ on RP but causes a drop of roughly $20$ pts on SNLI. On RH sentence pairs, this results in an accuracy of $67\%$ on RH and $71.9\%$ on SNLI test set but $47.4\%$ on the RP set. To put these numbers in context, each individual hypothesis sentence in RP is associated with two labels, each in the presence of a different premise. A model that relies on hypotheses only would at best perform slightly better than choosing the majority class when evaluated on this dataset. However, fine-tuning BERT on a combination of RP and RH leads to consistent performance on all datasets as the dataset design forces models to look at both premise and hypothesis. Combining original sentences with RP and RH improves these numbers even further. We compare this with the performance obtained by fine-tuning it on $8.3k$ sentence pairs sampled from SNLI training set, and show that while the two perform roughly within $4$ pts of each other when evaluated on SNLI, the former outperforms latter on both RP and RH.

To further isolate this effect, Bi-LSTM trained on SNLI hypotheses only achieves $69\%$ accuracy on SNLI test set, which drops to $44\%$ if it is retrained on combination of original, RP and RH data (Table 8). Note that this combined dataset consists of five variants of each original premise-hypothesis pair. Of these five pairs, three consist of the same hypothesis sentence, each associated with different truth value given the respective premise. Using these hypotheses only would provide conflicting feedback to a classifier during training, thus causing the drop in performance. Further, we notice that the gain of the latter over majority class baseline comes primarily from the original data, as the same model retrained only on RP and RH data experiences a further drop of $11.6\%$ in accuracy, performing worse than just choosing the majority class at all times.

One reasonable concern might be that our models would simply distinguish whether an example were from the original or revised dataset and thereafter treat them differently. The fear might be that our models would exhibit a hypersensitivity (rather than insensitivity) to domain. To test the potential for this behavior, we train several models to distinguish between original and revised data (Table 9). BERT identifies original reviews from revised reviews with $77.3\%$ accuracy. In case of NLI, BERT and Naïve Bayes perform roughly within $3$ pts of the majority class baseline ( $66.7\%$ ) whereas SVM performs substantially worse.

Conclusion

By leveraging humans not only to provide labels but also to intervene upon the data, revising documents to accord with various labels, we can elucidate the difference that makes a difference. Moreover, we can leverage the augmented data to train classifiers less dependent on spurious associations. Our study demonstrates the promise of leveraging human-in-the-loop feedback to disentangle the spurious and non-spurious associations, yielding classifiers that hold up better when spurious associations do not transport out of domain. Our methods appear useful on both sentiment analysis and NLI, two contrasting tasks. In sentiment analysis, expressions of opinion matter more than stated facts, while in NLI this is reversed. SNLI poses another challenge in that it is a $3$ -class classification task using two input sentences. In future work, we will extend these techniques, leveraging humans in the loop to build more robust systems for question answering and summarization.

Acknowledgements

The authors are grateful to Amazon AWS and NVIDIA for providing GPUs to conduct the experiments, Salesforce Research and Facebook AI for their generous grants that made the data collection possible, Sina Fazelpour, Sivaraman Balakrishnan, Shruti Rijhwani, Shruti Palaskar, Aishwarya Kamath, Michael Collins, Rajesh Ranganath and Sanjoy Dasgupta for their valuable feedback, and Tzu-Hsiang Lin for his generous help in creating the data collection platform. We also thank Abridge AI, UPMC, the Center for Machine Learning in Health, and the AI Ethics and Governance Fund for their support of our broader research on robust machine learning.