AdvEntuRe: Adversarial Training for Textual Entailment with Knowledge-Guided Examples

Dongyeop Kang, Tushar Khot, Ashish Sabharwal, Eduard Hovy

Introduction

The impressive success of machine learning models on large natural language datasets often does not carry over to moderate training data regimes, where models often struggle with infrequently observed patterns and simple adversarial variations. A prominent example of this phenomenon is textual entailment, the fundamental task of deciding whether a premise text entails (\vDash) a hypothesis text. On certain datasets, recent deep learning entailment systems Parikh et al. (2016); Wang et al. (2017); Gong et al. (2018) have achieved close to human level performance. Nevertheless, the problem is far from solved, as evidenced by how easy it is to generate minor adversarial examples that break even the best systems. As Table 1 illustrates, a state-of-the-art neural system for this task, namely the Decomposable Attention Model Parikh et al. (2016), fails when faced with simple linguistic phenomena such as negation, or a re-ordering of words. This is not unique to a particular model or task. Minor adversarial examples have also been found to easily break neural systems on other linguistic tasks such as reading comprehension Jia and Liang (2017).

A key contributor to this brittleness is the use of specific datasets such as SNLI Bowman et al. (2015) and SQuAD Rajpurkar et al. (2016) to drive model development. While large and challenging, these datasets also tend to be homogeneous. E.g., SNLI was created by asking crowd-source workers to generate entailing sentences, which then tend to have limited linguistic variations and annotation artifacts Gururangan et al. (2018). Consequently, models overfit to sufficiently repetitive patterns—and sometimes idiosyncrasies—in the datasets they are trained on. They fail to cover long-tail and rare patterns in the training distribution, or linguistic phenomena such as negation that would be obvious to a layperson.

To address this challenge, we propose to train textual entailment models more robustly using adversarial examples generated in two ways: (a) by incorporating knowledge from large linguistic resources, and (b) using a sequence-to-sequence neural model in a GAN-style framework.

The motivation stems from the following observation. While deep-learning based textual entailment models lead the pack, they generally do not incorporate intuitive rules such as negation, and ignore large-scale linguistic resources such as PPDB Ganitkevitch et al. (2013) and WordNet Miller (1995). These resources could help them generalize beyond specific words observed during training. For instance, while the SNLI dataset contains the pattern two men \vDash people, it does not contain the analogous pattern two dogs \vDash animals found easily in WordNet.

Effectively integrating simple rules or linguistic resources in a deep learning model, however, is challenging. Doing so directly by substantially adapting the model architecture Sha et al. (2016); Chen et al. (2018) can be cumbersome and limiting. Incorporating such knowledge indirectly via modified word embeddings Faruqui et al. (2015); Mrkšić et al. (2016), as we show, can have little positive impact and can even be detrimental.

Our proposed method, which is task-specific but model-independent, is inspired by data-augmentation techniques. We generate new training examples by applying knowledge-guided rules, via only a handful of rule templates, to the original training examples. Simultaneously, we also use a sequence-to-sequence or seq2seq model for each entailment class to generate new hypotheses from a given premise, adaptively creating new adversarial examples. These can be used with any entailment model without constraining model architecture.

We also introduce the first approach to train a robust entailment model using a Generative Adversarial Network or GAN Goodfellow et al. (2014) style framework. We iteratively improve both the entailment system (the discriminator) and the differentiable part of the data-augmenter (specifically the neural generator), by training the generator based on the discriminator’s performance on the generated examples. Importantly, unlike the typical use of GANs to create a strong generator, we use it as a mechanism to create a strong and robust discriminator.

Our new entailment system, called AdvEntuRe, demonstrates that in the moderate data regime, adversarial iterative data-augmentation via only a handful of linguistic rule templates can be surprisingly powerful. Specifically, we observe 4.7% accuracy improvement on the challenging SciTail dataset Khot et al. (2018) and a 2.8% improvement on 10K-50K training subsets of SNLI. An evaluation of our algorithm on the negation examples in the test set of SNLI reveals a 6.1% improvement from just a single rule.

Related Work

Adversarial example generation has recently received much attention in NLP. For example, Jia and Liang (2017) generate adversarial examples using manually defined templates for the SQuAD reading comprehension task. Glockner et al. (2018) create an adversarial dataset from SNLI by using WordNet knowledge. Automatic methods Iyyer et al. (2018) have also been proposed to generate adversarial examples through paraphrasing. These works reveal how neural network systems trained on a large corpus can easily break when faced with carefully designed unseen adversarial patterns at test time. Our motivation is different. We use adversarial examples at training time, in a data augmentation setting, to train a more robust entailment discriminator. The generator uses explicit knowledge or hand written rules, and is trained in a end-to-end fashion along with the discriminator.

Incorporating external rules or linguistic resources in a deep learning model generally requires substantially adapting the model architecture Sha et al. (2016); Liang et al. (2017); Kang et al. (2017). This is a model-dependent approach, which can be cumbersome and constraining. Similarly non-neural textual entailment models have been developed that incorporate knowledge bases. However, these also require model-specific engineering Raina et al. (2005); Haghighi et al. (2005); Silva et al. (2018).

An alternative is the model- and task-independent route of incorporating linguistic resources via word embeddings that are retro-fitted Faruqui et al. (2015) or counter-fitted Mrkšić et al. (2016) to such resources. We demonstrate, however, that this has little positive impact in our setting and can even be detrimental. Further, it is unclear how to incorporate knowledge sources into advanced representations such as contextual embeddings McCann et al. (2017); Peters et al. (2018). We thus focus on a task-specific but model-independent approach.

Logical rules have also been defined to label existing examples based on external resources Hu et al. (2016). Our focus here is on generating new training examples.

Our use of the GAN framework to create a better discriminator is related to CatGANs Wang and Zhang (2017) and TripleGANs Chongxuan et al. (2017) where the discriminator is trained to classify the original training image classes as well as a new ‘fake’ image class. We, on the other hand, generate examples belonging to the same classes as the training examples. Further, unlike the earlier focus on the vision domain, this is the first approach to train a discriminator using GANs for a natural language task with discrete outputs.

Adversarial Example Generation

We present three different techniques to create adversarial examples for textual entailment. Specifically, we show how external knowledge resources, hand-authored rules, and neural language generation models can be used to generate such examples. Before describing these generators in detail, we introduce the notation used henceforth.

The seven generators we use for experimentation are summarized in Table 2 and discussed in more detail subsequently. While these particular generators are simplistic and one can easily imagine more advanced ones, we show that training using adversarial examples created using even these simple generators leads to substantial accuracy improvement on two datasets.

Large knowledge-bases such as WordNet and PPDB contain lexical equivalences and other relationships highly relevant for entailment models. However, even large datasets such as SNLI generally do not contain most of these relationships in the training data. E.g., that two dogs entails animals isn’t captured in the SNLI data. We define simple generators based on lexical resources to create adversarial examples that capture the underlying knowledge. This allows models trained on these examples to learn these relationships.

As discussed earlier, there are different ways of incorporating such symbolic knowledge into neural models. Unlike task-agnostic ways of approaching this goal from a word embedding perspective Faruqui et al. (2015); Mrkšić et al. (2016) or the model-specific approach Sha et al. (2016); Chen et al. (2018), we use this knowledge to generate task-specific examples. This allows any entailment model to learn how to use these relationships in the context of the entailment task, helping them outperform the above task-agnostic alternative.

This idea is similar to Natural Logic Inference or NLI Lakoff (1970); Sommers (1982); Angeli and Manning (2014) where words in a sentence can be replaced by their hypernym/hyponym to produce entailing/neutral sentences, depending on their context. We propose a context-agnostic use of lexical resources that, despite its simplicity, already results in significant gains. We use three sources for generators:

(Miller, 1995) is a large, hand-curated, semantic lexicon with synonymous words grouped into synsets. Synsets are connected by many semantic relations, from which we use hyponym and synonym relations to generate entailing sentences, and antonym relations to generate contradicting sentencesA similar approach was used in a parallel work to generate an adversarial dataset from SNLI Glockner et al. (2018).. Given a relation r(x,y)r(x,y), the (partial) transformation function fρf_{\rho} is the POS-tag matched replacement of xx in ss with yy, and requires the POS tag to be noun or verb. NLI provides a more robust way of using these relations based on context, which we leave for future work.

PPDB

(Ganitkevitch et al., 2013) is a large resource of lexical, phrasal, and syntactic paraphrases. We use 24,273 lexical paraphrases in their smallest set, PPDB-S Pavlick et al. (2015), as equivalence relations, xyx\equiv y. The (partial) transformation function fρf_{\rho} for this generator is POS-tagged matched replacement of xx in ss with yy, and the label gρg_{\rho} is entails.

SICK

(Marelli et al., 2014) is dataset with entailment examples of the form (p,h,c)(p,h,c), created to evaluate an entailment model’s ability to capture compositional knowledge via hand-authored rules. We use the 12,508 patterns of the form c(x,y)c(x,y) extracted by Beltagy et al. (2016) by comparing sentences in this dataset, with the property that for each SICK example (p,h,c)(p,h,c), replacing (when applicable) xx with yy in pp produces hh. For simplicity, we ignore positional information in these patterns. The (partial) transformation function fρf_{\rho} is replacement of xx in ss with yy, and the label gρg_{\rho} is cc.

2 Hand-Defined Generators

Even very large entailment datasets have no or very few examples of certain otherwise common linguistic constructs such as negation,Only 211 examples (2.11%) in the SNLI training set contain negation triggers such as not, ’nt, etc. causing models trained on them to struggle with these constructs. A simple model-agnostic way to alleviate this issue is via a negation example generator whose transformation function fρ(s)f_{\rho}(s) is negate(s)(s), described below, and the label gρg_{\rho} is contradicts.

negate(s): If ss contains a ‘be’ verb (e.g., is, was), add a “not” after the verb. If not, also add a “did” or “do” in front based on its tense. E.g., change “A person is crossing” to “A person is not crossing” and “A person crossed” to “A person did not cross.” While many other rules could be added, we found that this single rule covered a majority of the cases. Verb tenses are also consideredhttps://www.nodebox.net/code/index.php/Linguistics and changed accordingly. Other functions such as dropping adverbial clauses or changing tenses could be defined in a similar manner.

Both the knowledge-guided and hand-defined generators make local changes to the sentences based on simple rules. It should be possible to extend the hand-defined rules to cover the long tail (as long as they are procedurally definable). However, a more scalable approach would be to extend our generators to trainable models that can cover a wider range of phenomena than hand-defined rules. Moreover, the applicability of these rules generally depends on the context which can also be incorporated in such trainable generators.

3 Neural Generators

The loss function for training the seq2seq is:

4 Example Generation

The generators described above are used to create new entailment examples from the training data. For each example (p,h,c)(p,h,c) in the data, we can create two new examples: (p,fρ(p),gρ)\left(p,f_{\rho}(p),g_{\rho}\right) and (h,fρ(h),gρ)\left(h,f_{\rho}(h),g_{\rho}\right).

First, we consider the second-order example between the original premise and the transformed hypothesis: (p,fρ(h),(c,gρ))(p,f_{\rho}(h),\bigoplus(c,g_{\rho})), where \bigoplus, defined in the left half of Table 3, composes the input example label cc (connecting pp and hh) and the generated example label gρg_{\rho} to produce a new label. For instance, if pp entails hh and hh entails fρ(h)f_{\rho}(h), pp would entail fρf_{\rho}. In other words, (,)\bigoplus(\sqsubseteq,\sqsubseteq) is \sqsubseteq. For example, composing (“A man is playing soccer”, “A man is playing a game”, \sqsubseteq) with a generated hypothesis fρ(h)f_{\rho}(h): “A person is playing a game.” will give a new second-order entailment example: (“A man is playing soccer”, “A person is playing a game”, \sqsubseteq).

Second, we create an example from the generated premise to the original hypothesis: (fρ(p),h,(gρ,c))(f_{\rho}(p),h,\bigotimes(g_{\rho},c)). The composition function here, denoted \bigotimes and defined in the right half of Table 3, is often undetermined. For example, if pp entails fρ(p)f_{\rho}(p) and pp entails hh, the relation between fρ(p)f_{\rho}(p) and hh is undetermined i.e. (,)=?\bigotimes(\sqsubseteq,\sqsubseteq)=?. While this particular composition \bigotimes often leads to undetermined or neutral relations, we use it here for completeness. For example, composing the previous example with a generated neutral premise, fρ(p)f_{\rho}(p): “A person is wearing a cap” would generate an example (“A person is wearing a cap”, “A man is playing a game”, #\#)

The composition function \bigoplus is the same as the “join” operation in natural logic reasoning Icard III and Moss (2014), except for two differences: (a) relations that do not belong to our three entailment classes are mapped to ‘?’, and (b) the exclusivity/alternation relation is mapped to contradicts. The composition function \bigotimes, on the other hand, does not map to the join operation.

5 Implementation Details

To avoid this, we sub-sample our synthetic examples to ensure that they are proportional to the input examples XX, specifically they are bounded to αX\alpha|X| where α\alpha is tuned for each dataset. Also, as seen in Table 3, our knowledge-guided generators are more likely to generate neutral examples than any other class. To make sure that the labels are not skewed, we also sub-sample the examples to ensure that our generated examples have the same class distribution as the input batch. The SciTail dataset only contains two classes: entails mapped to \sqsubseteq and neutral mapped to \curlywedge. As a result, generated examples that do not belong to these two classes are ignored.

The sub-sampling, however, has a negative side-effect where our generated examples end up using a small number of lexical relations from the large knowledge bases. On moderate datasets, this would cause the entailment model to potentially just memorize these few lexical relations. Hence, we generate new entailment examples for each mini-batch and update the model parameters based on the training+generated examples in this batch.

AdvEntuRe

where LL is cross-entropy loss function between the true labels, YY and the predicted classes, and θ^\hat{\theta} are the learned parameters.

2 Generator Training

3 Adversarial Training

Experiments

Our empirical assessment focuses on two key questions: (a) Can a handful of rule templates improve a state-of-the-art entailment system, especially with moderate amounts of training data? (b) Can iterative GAN-style training lead to an improved discriminator?

To this end, we assess various models on the two entailment datasets mentioned earlier: SNLI (570K examples) and SciTail (27K examples).SNLI has a 96.4%/1.7%/1.7% split and SciTail has a 87.3%/4.8%/7.8% split on train, valid, and test sets, resp. To test our hypothesis that adversarial example based training prevents overfitting in small to moderate training data regimes, we compare model accuracies on the test sets when using 1%, 10%, 50%, and 100% subsamples of the train and dev sets.

The ratio between the number of generated vs. original examples, α\alpha is empirically chosen to be 1.0 for SNLI and 0.5 for SciTail, based on validation set performance. Generally, very few generated examples (small α\alpha) has little impact, while too many of them overwhelm the original dataset resulting in worse scores (cf. Appendix for more details).

Table 4 summarizes the test set accuracies of the different models using various subsampling ratios for SNLI and SciTail training data.

2 Ablation Study

Interestingly, while PPDB (phrasal paraphrases) helps the most (+3.6%) on SNLI, simple negation rules help significantly (+8.2%) on SciTail dataset. Since most entailment examples in SNLI are minor rewrites by Turkers, PPDB often contains these simple paraphrases. For SciTail, the sentences are authored independently with limited gains from simple paraphrasing. However, a model trained on only 10% of the dataset (2.3K examples) would end up learning a model relying on purely word overlap. We believe that the simple negation examples introduce neutral examples with high lexical overlap, forcing the model to find a more informative signal.

3 Qualitative Results

Table 6 shows examples generated by various methods in AdvEntuRe. As shown, both seq2seq and rule based generators produce reasonable sentences according to classes and rules. As expected, seq2seq models trained on very few examples generate noisy sentences. The quality of our knowledge-guided generators, on the other hand, does not depend on the training set size and they still produce reliable sentences.

4 Case Study: Negation

For further analysis of the negation-based generator in Table 1, we collect only the negation examples in test set of SNLI, henceforth referred to as nega-SNLI. Specifically, we extract examples where either the premise or the hypothesis contains “not”, “no”, “never”, or a word that ends with “n’t’. These do not cover more subtle ways of expressing negation such as “seldom” and the use of antonyms. nega-SNLI contains 201 examples with the following label distribution: 51 (25.4%) neutral, 42 (20.9%) entails, 108 (53.7%) contradicts. Table 7 shows examples in each category.

Conclusion

We introduced an adversarial training architecture for textual entailment. Our seq2seq and knowledge-guided example generators, trained in an end-to-end fashion, can be used to make any base entailment model more robust. The effectiveness of this approach is demonstrated by the significant improvement it achieves on both SNLI and SciTail, especially in the low to medium data regimes. Our rule-based generators can be expanded to cover more patterns and phenomena, and the seq2seq generator extended to incorporate per-example loss for adversarial training.

References

Appendix A Rules and Examples

Appendix B Training data sizes

Appendix C Effectiveness of Z/X Ratio, α𝛼\alpha

Appendix D Retrofitting Experiment

Table 9 shows the grid search results of retro-fitting vectors Faruqui et al. (2015) with different lexical resources. To obtain the strongest baseline, we choose the best performing vectors for each sub-sample ratio and each dataset. Usually, PPDB and WordNet are two most useful resources for both SNLI and SciTail.

Appendix E In-Depth Analysis: D+R