Hypothesis Only Baselines in Natural Language Inference

Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, Benjamin Van Durme

Introduction

Though datasets for the task of Natural Language Inference (NLI) may vary in just about every aspect (size, construction, genre, label classes), they generally share a common structure: each instance consists of two fragments of natural language text (a context, also known as a premise, and a hypothesis), and a label indicating the entailment relation between the two fragments (e.g., entailment, neutral, contradiction). Computationally, the task of NLI is to predict an entailment relation label (output) given a premise-hypothesis pair (input), i.e., to determine whether the truth of the hypothesis follows from the truth of the premise Dagan et al. (2006, 2013).

When these NLI datasets are constructed to facilitate the training and evaluation of natural language understanding (NLU) systems Nangia et al. (2017), it is tempting to claim that systems achieving high accuracy on such datasets have successfully “understood” natural language or at least a logical relationship between a premise and hypothesis. While this paper does not attempt to prescribe the sufficient conditions of such a claim, we argue for an obvious necessary, or at least desired condition: that interesting natural language inference should depend on both premise and hypothesis. In other words, a baseline system with access only to hypotheses (Figure 1b) can be said to perform NLI only in the sense that it is understanding language based on prior background knowledge. If this background knowledge is about the world, this may be justifiable as an aspect of natural language understanding, if not in keeping with the spirit of NLI. But if the “background knowledge” consists of learned statistical irregularities in the data, this may not be ideal. Here we explore the question: do NLI datasets contain statistical irregularities that allow hypothesis-only models to outperform the datasets specific prior?

We present the results of a hypothesis-only baseline across ten NLI-style datasets and advocate for its inclusion in future dataset reports. We find that this baseline can perform above the majority-class prior across most of the ten examined datasets. We examine whether: (1) hypotheses contain statistical irregularities within each entailment class that are “giveaways” to a well-trained hypothesis-only model, (2) the way in which an NLI dataset is constructed is related to how prone it is to this particular weakness, and (3) the majority baselines might not be as indicative of “the difficulty of the task” Bowman et al. (2015) as previously thought.

We are not the first to consider the inherent difficulty of NLI datasets. For example, MacCartney (2009) used a simple bag-of-words model to evaluate early iterations of Recognizing Textual Entailment (RTE) challenge sets.MacCartney (2009), Ch. 2.2: “the RTE1 test suite is the hardest, while the RTE2 test suite is roughly 4% easier, and the RTE3 test suite is roughly 9% easier.” Concerns have been raised previously about the hypotheses in the Stanford Natural Language Inference (SNLI) dataset specifically, such as by Rudinger et al. (2017) and in unpublished work.A course project constituting independent discovery of our observations on SNLI: https://leonidk.com/pdfs/cs224u.pdf Here, we survey of large number of existing NLI datasets under the lens of a hypothesis-only model. Our code and data can be found at https://github.com/azpoliak/hypothesis-only-NLI. Concurrently, Tsuchiya (2018) and Gururangan et al. (2018) similarly trained an NLI classifier with access limited to hypotheses and discovered similar results on three of the ten datasets that we study.

Motivation

Our approach is inspired by recent studies that show how biases in an NLU dataset allow models to perform well on the task without understanding the meaning of the text. In the Story Cloze task Mostafazadeh et al. (2016, 2017), a model is presented with a short four-sentence narrative and asked to complete it by choosing one of two suggested concluding sentences. While the task is presented as a new common-sense reasoning framework, Schwartz et al. (2017b) achieved state-of-the-art performance by ignoring the narrative and training a linear classifier with features related to the writing style of the two potential endings, rather than their content. It has also been shown that features focusing on sentence length, sentiment, and negation are sufficient for achieving high accuracy on this dataset Schwartz et al. (2017a); Cai et al. (2017); Bugert et al. (2017).

NLI is often viewed as an integral part of NLU. Condoravdi et al. (2003) argue that it is a necessary metric for evaluating an NLU system, since it forces a model to perform many distinct types of reasoning. Goldberg (2017) suggests that “solving [NLI] perfectly entails human level understanding of language”, and Nangia et al. (2017) argue that “in order for a system to perform well at natural language inference, it needs to handle nearly the full complexity of natural language understanding.” However, if biases in NLI datasets, especially those that do not reflect commonsense knowledge, allow models to achieve high levels of performance without needing to reason about hypotheses based on corresponding contexts, our current datasets may fall short of these goals.

Methodology

We modify Conneau et al. (2017)’s InferSent method to train a neural model to classify just the hypotheses. We choose InferSent because it performed competitively with the best-scoring systems on the Stanford Natural Language Inference (SNLI) dataset Bowman et al. (2015), while being representative of the types of neural architectures commonly used for NLI tasks. InferSent uses a BiLSTM encoder, and constructs a sentence representation by max-pooling over its hidden states. This sentence representation of a hypothesis is used as input to a MLP classifier to predict the NLI tag.

We preprocess each recast dataset using the NLTK tokenizer Loper and Bird (2002). Following Conneau et al. (2017), we map the resulting tokens to 300-dimensional GloVe vectors Pennington et al. (2014) trained on 840 billion tokens from the Common Crawl, using the GloVe OOV vector for unknown words. We optimize via SGD, with an initial learning rate of $0.1$ , and decay rate of $0.99$ . We allow at most $20$ epochs of training with optional early stopping according to the following policy: when the accuracy on the development set decreases, we divide the learning rate by $5$ and stop training when learning rate is $<$ $10^{-5}$ .

Datasets

We collect ten NLI datasets and categorize them into three distinct groups based on the methods by which they were constructed. Table 1 summarizes the different NLI datasets that our investigation considers.

In cases where humans were given a context and asked to generate a corresponding hypothesis and label, we consider these datasets to be elicited. Although we consider only two such datasets, they are the largest datasets included in our study and are currently popular amongst researchers. The elicited NLI datasets we look at are:

Stanford Natural Language Inference (SNLI) To create SNLI, Bowman et al. (2015) showed crowdsourced workers a premise sentence (sourced from Flickr image captions), and asked them to generate a corresponding hypothesis sentence for each of the three labels (entailment, neutral, contradiction). SNLI is known to contain stereotypical biases based on gender, race, and ethnic stereotypes Rudinger et al. (2017). Furthermore, Zhang et al. (2017) commented that this “elicitation protocols can lead to biased responses unlikely to contain a wide range of possible common-sense inferences.”

Multi-NLI Multi-NLI is a recent expansion of SNLI aimed to add greater diversity to the existing dataset Williams et al. (2017). Premises in Multi-NLI can originate from fictional stories, personal letters, telephone speech, and a 9/11 report.

2 Human Judged

Alternatively, if hypotheses and premises were automatically paired but labeled by a human, we consider the dataset to be judged. Our human-judged data sets are:

Sentences Involving Compositional Knowledge (SICK) To evaluate how well compositional distributional semantic models handle “challenging phenomena”, Marelli et al. (2014) introduced SICK, which used rules to expand or normalize existing premises to create more difficult examples. Workers were asked to label the relatedness of these resulting pairs, and these labels were then converted into the same three-way label space as SNLI and Multi-NLI.

Add-one RTE This mixed-genre dataset tests whether NLI systems can understand adjective-noun compounds Pavlick and Callison-Burch (2016). Premise sentences were extracted from Annotated Gigaword Napoles et al. (2012), image captions Young et al. (2014), the Internet Argument Corpus Walker et al. (2012), and fictional stories from the GutenTag dataset Mac Kim and Cassidy (2015). To create hypotheses, adjectives were removed or inserted before nouns in a premise, and crowd-sourced workers were asked to provide reliable labels (entailed, not-entailed).

SciTail Recently released, SciTail is an NLI dataset created from $4$ th grade science questions and multiple-choice answers Khot et al. (2018). Hypotheses are assertions converted from question-answer pairs found in SciQ Welbl et al. (2017). Hypotheses are automatically paired with premise sentences from domain specific texts Clark et al. (2016), and labeled (entailment, neutral) by crowdsourced workers. Notably, the construction method allows for the same sentence to appear as a hypothesis for more than one premise.

Multiple Premise Entailment (MPE) Unlike the other datasets we consider, the premises in MPE Lai et al. (2017) are not single sentences, but four different captions that describe the same image in the FLICKR30K dataset Plummer et al. (2015). Hypotheses were generated by simplifying either a fifth caption that describes the same image or a caption corresponding to a different image, and given the standard 3-way tags. Each hypothesis has at most a 50% overlap with the words in its corresponding premise. Since the hypotheses are still just one sentence, our hypothesis-only baseline can easily be applied to MPE.

Johns Hopkins Ordinal Common-Sense Inference (JOCI) JOCI labels context-hypothesis instances on an ordinal scale from impossible ( $1$ ) to very likely ( $5$ ) Zhang et al. (2017). In JOCI, context (premise) sentences were taken from existing NLU datasets: SNLI, ROC Stories Mostafazadeh et al. (2016), and COPA Roemmele et al. (2011). Hypotheses were created automatically by systems trained to generate entailed facts from a premise. We only consider the hypotheses generated by either a seq2seq model or from external world knowledge. Crowd-sourced workers labeled the likelihood of the hypothesis following from the premise on an ordinal scale. We convert these into a $3$ -way NLI tags where 1 maps to contradiction, 2-4 maps to neutral, and 5 maps to entailment. Converting the annotations into a $3$ -way classification problem allows us to limit the range of the number of NLI label classes in our investigation.

3 Automatically Recast

If an NLI dataset was automatically generated from existing datasets for other NLP tasks, and sentence pairs were constructed and labeled with minimal human intervention, we refer to such a dataset as recast. We use the recast datasets from White et al. (2017):

Semantic Proto-Roles (SPR) Inspired by Dowty (1991)’s thematic role theory, Reisinger et al. (2015) introduced the Semantic Proto-Role (SPR) labeling task, which can be viewed as decomposing semantic roles into finer-grained properties, such as whether a predicate’s argument was likely aware of the given predicated situation. 2-way labeled NLI sentence pairs were generated from SPR annotations by creating general templates.

Definite Pronoun Resolution (DPR) The DPR dataset targets an NLI model’s ability to perform anaphora resolution Rahman and Ng (2012). In the original dataset, sentences contain two entities and one pronoun, and the task is to link the pronoun to its referent. In the recast version, the premises are the original sentences and the hypotheses are the same sentences with the pronoun replaced with its correct (entailed) and incorrect (not-entailed) referent. For example, People raise dogs because they are obedient and People raise dogs because dogs are obedient is such a context-hypothesis pair. We note that this mechanism would appear to maximally benefit a hypothesis-only approach, as the hypothesis semantically subsumes the context.

FrameNet Plus (FN+) Using paraphrases from PPDB Ganitkevitch et al. (2013), Rastogi and Van Durme (2014) automatically replaced words with their paraphrases. Subsequently, Pavlick et al. (2015) asked crowd-source workers to judge how well a sentence with a paraphrase preserved the original sentence’s meanings. In this NLI dataset that targets a model’s ability to perform paraphrastic inference, premise sentences are the original sentences, the hypotheses are the edited versions, and the crowd-source judgments are converted to 2-way NLI-labels. For not-entailed examples, White et al. (2017) replaced a single token in a context sentence with a word that crowd-source workers labeled as not being a paraphrase of the token in the given context. In turn, we might suppose that positive entailments 4.3 are keeping in the spirit of NLI, but not-entailed examples might not because there are adequacy 4.3 and fluency 4.3 issues.In these examples, 4.3 is the corresponding context.

. \a. That is the way the system works .̱ That is the way the framework works .̧ That is the road the system works .̣ That is the way the system creations

Results

Our goal is to determine whether a hypothesis-only model outperforms the majority baseline and investigate what may cause significant gains. In such cases a hypothesis-only model should be used as a stronger baseline instead of the majority class baseline. For all experiments except for JOCI, we use each NLI dataset’s standard train, dev, and test splits.JOCI was not released with such splits so we randomly split the dataset into such a partition with 80:10:10 ratios. Table 2 compares the hypothesis-only model’s accuracy with the majority baseline on each dataset’s dev and test set.We only report results on the Multi-NLI development set since the test labels are only accessible on Kaggle.

Across six of the ten datasets, our hypothesis-only model significantly outperforms the majority-baseline, even outperforming the best reported results on one dataset, recast SPR. This indicates that there exists a significant degree of exploitable signal that may help NLI models perform well on their corresponding test set without considering NLI contexts. From Table 2, it is unclear whether the construction method is responsible for these improvements. The largest relative gains are on human-elicited models where the hypothesis-only model more than doubles the majority baseline.

However, there are no obvious unifying trends across these datasets: Among the judged and recast datasets, where humans do not generate the NLI hypothesis, we observe lower performance margins between majority and hypothesis-only models compared to the elicited data sets. However, the baseline performances of these models are noticeably larger than on SNLI and Multi-NLI. The drop between SNLI and Multi-NLI suggests that by including multiple genres, an NLI dataset may contain less biases. However, adding additional genres might not be enough to mitigate biases as the hypothesis-only model still drastically outperforms the majority-baseline. Therefore, we believe that models tested on SNLI and Multi-NLI should include a baseline version of the model that only accesses hypotheses.

We do not observe general trends across the datasets based on their construction methodology. On three of the five human judged datasets, the hypothesis-only model defaults to labeling each instance with the majority class tag. We find the same behavior in one recast dataset (DPR). However, across both these categories we find smaller relative improvements than on SNLI and Multi-NLI. These results suggest the existence of exploitable signal in the datasets that is unrelated to NLI contexts. Our focus now shifts to identifying precisely what these signals might be and understanding why they may appear in NLI hypotheses.

Statistical Irregularities

We are interested in determining what characteristics in the datasets may be responsible for the hypothesis-only model often outperforming the majority baseline. Here, we investigate the importance of specific words, grammaticality, and lexical semantics.

Since words in hypotheses have a distribution over the class of labels, we can determine the conditional probability of a label $l$ given the word $w$ by

If $p(l|w)$ is highly skewed across labels, there exists the potential for a predictive bias. Consequently, such words may be “give-aways” that allow the hypothesis model to correctly predict an NLI label without considering the context.

If a single occurrence of a highly label-specific word would allow a sentence to be deterministically classified, how many sentences in a dataset are prone to being trivially labeled? The plots in Figure 2 answer this question for SNLI and DPR. The $Y$ -value where $X=1.0$ captures the number of such sentences. Other values of $X<1.0$ can also have strong correlative effects, but a priori the relationship between the value of $X$ and the coverage of trivially answerable instances in the data is unclear. We illustrate this relationship for varying values of $p(l|w)$ . When $X=0$ , all words are considered highly-correlated with a specific class label, and thus the entire data set would be treated as trivially answerable.

In DPR, which has two class labels, because the uncertainty of a label is highest when $p(l|w)=0.5$ , the sharp drop as $X$ deviates from this value indicates a weaker effect, where there are proportionally fewer sentences which contain highly label-specific words with respect to SNLI. As SNLI uses 3-way classification we see a gradual decline from 0.33.

2 What are “Give-away” Words?

Now that we analyzed the extent to which highly label-correlated words may exist across sentences in a given label, we would like to understand what these words are and why they exist.

Figure 3 reports some of the words with the highest $p(l|w)$ for SNLI, a human elicited dataset, and MPE, a human judged dataset, on which our hypothesis model performed identically to the majority baseline. Because many of the most discriminative words are low frequency, we report only words which occur at least five times. We rank the words according to their overall frequency, since this statistic is perhaps more indicative of a word $w$ ’s effect on overall performance compared to $p(l|w)$ alone.

The score $p(l|w)$ of the words shown for SNLI deviate strongly, regardless of the label. In contrast, in MPE, scores are much closer to a uniform distribution of $p(l|w)$ across labels. Intuitively, the stronger the word’s deviation, the stronger the potential for it to be a “give-away” word. A high word frequency indicates a greater potential of the word to affect the overall accuracy on NLI.

Turning our attention to the qualities of the words themselves, we can easily identify trends among the words used in contradictory hypotheses in SNLI. In our top-10 list, for example, three words refer to the act of sleeping. Upon inspecting corresponding context sentences, we find that many contexts, which are sourced from Flickr, naturally deal with activities. This leads us to believe that as a common strategy, crowd-source workers often do not generate contradictory hypotheses that require fine-grained semantic reasoning, as a majority of such activities can be easily negated by removing an agent’s agency, i.e. describing the agent as sleeping. A second trend we notice is that universal negation constitutes four of the remaining seven terms in this list, and may also be used to similar effect.These are “Nobody”, “alone”, “no”, and “empty”. The human-elicited protocol does not guide, nor incentivize crowd-source workers to come up with less obvious examples. If not properly controlled, elicited datasets may be prone to many label-specific terms. The existence of label-specific terms in human-elicited NLI datasets does not invalidate the datasets nor is surprising. Studies in eliciting norming data are prone to repeated responses across subjects McRae et al. (2005) (see discussion in §2 of Zhang et al. (2017)).

3 On the Role of Grammaticality

Like MPE, FN+ contains few high frequency words with high $p(l|w)$ . However, unlike on MPE, our hypothesis-only model outperforms the majority-only baseline. If these gains do not arise from “give-away” words, then what is the statistical irregularity responsible for this discriminative power?

Upon further inspection, we notice an interesting imbalance in how our model performs for each of the two classes. The hypothesis-only model performs similarly to the majority baseline for entailed examples, while improving by over 34% those which are not entailed, as shown in Table 3.

As shown by White et al. (2017) and noticed by Poliak et al. (2018), FN+ contains more grammatical errors than the other recast datasets. We explore whether grammaticality could be the statistical irregularity exploited in this case. We manually sample a total of $200$ FN+ sentences and categorize them based on their gold label and our model’s prediction. Out of $50$ sentences that the model correctly labeled as entailed, 88% of them were grammatical. On the other-hand, of the $50$ hypotheses incorrectly labeled as entailed, only $38$ % of them were grammatical. Similarly, when the model correctly labeled $50$ not-entailed hypotheses, only $20\%$ were grammatical, and $68\%$ when labeled incorrectly. This suggests that a hypothesis-only model may be able to discover the correlation between grammaticality and NLI labels on this dataset.

4 Lexical Semantics

A survey of gains (Table 4) in the SPR dataset suggest a number of its property-driven hypotheses, such as X was sentient in [the event], can be accurately guessed based on lexical semantics (background knowledge learned from training) of the argument. For example, the hypothesis-only baseline correctly predicts the truth of hypotheses in the dev set such as: Experts were sentient … or Mr. Falls was sentient …, and the falsity of The campaign was sentient, while failing on referring expressions like Some or Each side. A model exploiting regularities of the real world would seem to be a different category of dataset bias: while not strictly wrong from the perspective of NLU, one should be aware of what the hypothesis-only baseline is capable of, to recognize those cases where access to the context is required and therefore more interesting under NLI.

5 Open Questions

There may remain statistical irregularities, which we leave for future work to explore. For example, are there correlation between sentence length and label class in these data sets? Is there a particular construction method that minimizes the amount of “give-away” words present in the dataset? And lastly, our study is another in a line of research which looks for irregularities at the word level MacCartney et al. (2008); MacCartney (2009). Beyond bag-of-words, are there multi-word expressions or syntactic phenomena that might encode label biases?

Related Work

In NLI datasets, non-semantic linguistic features have been used to improve NLI models. Vanderwende and Dolan (2006) and Blake (2007) demonstrate how sentence structure alone can provide a high signal for NLI. Instead of using external sources of knowledge, which was a common trend at the time, Blake (2007) improved results on RTE by combining syntactic features. More recently, Bar-Haim et al. (2015) introduce an inference formalism based on syntactic-parse trees.

World Knowledge and NLI

As mentioned earlier, hypothesis-only models that perform without exploiting statistical irregularities may be performing NLI only in the sense that it is understanding language based on prior background knowledge. Here, we take the approach that interesting NLI should depend on both premise and hypotheses. Prior work in NLI reflect this approach. For example, Glickman and Dagan (2005) argue that “the notion of textual entailment is relevant only” for hypothesis that are not world facts, e.g. “Paris is the capital of France.” Glickman et al. (2005a, b), introduce a probabilistic framework for NLI where the premise entails a hypothesis if, and only if, the probability of the hypothesis being true increases as a result of the premise.

NLI’s resurgence

Starting in the mid-2000’s, multiple community-wide shared tasks focused on NLI, then commonly referred to as RTE, i.e, recognizing textual entailment. Starting with Dagan et al. (2006), there have been eight iterations of the PASCAL RTE challenge with the most recent being Dzikovska et al. (2013).Technically Bentivogli et al. (2011) was the last challenge under PASCAL’s aegis but Dzikovska et al. (2013) was branded as the $8$ th RTE challenge. NLI datasets were relatively small, ranging from thousands to tens of thousands of labeled sentence pairs. In turn, NLI models often used alignment-based techniques MacCartney et al. (2008) or manually engineered features Androutsopoulos and Malakasiotis (2010). Bowman et al. (2015) sparked a renewed interested in NLI, particularly among deep-learning researchers. By developing and releasing a large NLI dataset containing over $550K$ examples, Bowman et al. (2015) enabled the community to successfully apply deep learning models to the NLI problem.

Conclusion

We introduced a stronger baseline for ten NLI datasets. Our baseline reduces the task from labeling the relationship between two sentences to classifying a single hypothesis sentence. Our experiments demonstrated that in six of the ten datasets, always predicting the majority-class label is not a strong baseline, as it is significantly outperformed by the hypothesis-only model. Our analysis suggests that statistical irregularities, including word choice and grammaticality, may reduce the difficulty of the task on popular NLI datasets by not fully testing how well a model can determine whether the truth of a hypothesis follows from the truth of a corresponding premise.

We hope our findings will encourage the development of new NLI datasets which exhibit less exploitable irregularities, and that encourage the development of richer models of inference. As a baseline, new NLI models should be compared against a corresponding version that only accesses hypotheses. In future work, we plan to apply a similar hypothesis-only baseline to multi-modal tasks that attempt to challenge a system to understand and classify the relationship between two inputs, e.g. Visual QA Antol et al. (2015).

Acknowledgements

This work was supported by Johns Hopkins University, the Human Language Technology Center of Excellence (HLTCOE), DARPA LORELEI, and the NSF Graduate Research Fellowships Program (GRFP). We would also like to thank three anonymous reviewers for their feedback. The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsements of DARPA or the U.S. Government.