A Decomposable Attention Model for Natural Language Inference

Ankur P. Parikh, Oscar Täckström, Dipanjan Das, Jakob Uszkoreit

Introduction

Natural language inference (NLI) refers to the problem of determining entailment and contradiction relationships between a premise and a hypothesis. NLI is a central problem in language understanding [Katz, 1972, Bos and Markert, 2005, van Benthem, 2008, MacCartney and Manning, 2009] and recently the large SNLI corpus of 570K sentence pairs was created for this task [Bowman et al., 2015]. We present a new model for NLI and leverage this corpus for comparison with prior work.

A large body of work based on neural networks for text similarity tasks including NLI has been published in recent years [Hu et al., 2014, Rocktäschel et al., 2016, Wang and Jiang, 2016, Yin et al., 2016, inter alia]. The dominating trend in these models is to build complex, deep text representation models, for example, with convolutional networks [LeCun et al., 1990, CNNs henceforth] or long short-term memory networks [Hochreiter and Schmidhuber, 1997, LSTMs henceforth] with the goal of deeper sentence comprehension. While these approaches have yielded impressive results, they are often computationally very expensive, and result in models having millions of parameters (excluding embeddings).

Here, we take a different approach, arguing that for natural language inference it can often suffice to simply align bits of local text substructure and then aggregate this information. For example, consider the following sentences:

Bob is in his room, but because of the thunder and lightning outside, he cannot sleep.

The first sentence is complex in structure and it is challenging to construct a compact representation that expresses its entire meaning. However, it is fairly easy to conclude that the second sentence follows from the first one, by simply aligning Bob with Bob and cannot sleep with awake and recognizing that these are synonyms. Similarly, one can conclude that It is sunny outside contradicts the first sentence, by aligning thunder and lightning with sunny and recognizing that these are most likely incompatible.

We leverage this intuition to build a simpler and more lightweight approach to NLI within a neural framework; with considerably fewer parameters, our model outperforms more complex existing neural architectures. In contrast to existing approaches, our approach only relies on alignment and is fully computationally decomposable with respect to the input text. An overview of our approach is given in Figure 1. Given two sentences, where each word is represented by an embedding vector, we first create a soft alignment matrix using neural attention [Bahdanau et al., 2015]. We then use the (soft) alignment to decompose the task into subproblems that are solved separately. Finally, the results of these subproblems are merged to produce the final classification. In addition, we optionally apply intra-sentence attention [Cheng et al., 2016] to endow the model with a richer encoding of substructures prior to the alignment step.

Asymptotically our approach does the same total work as a vanilla LSTM encoder, while being trivially parallelizable across sentence length, which can allow for considerable speedups in low-latency settings. Empirical results on the SNLI corpus show that our approach achieves state-of-the-art results, while using almost an order of magnitude fewer parameters compared to complex LSTM-based approaches.

Related Work

Our method is motivated by the central role played by alignment in machine translation [Koehn, 2009] and previous approaches to sentence similarity modeling [Haghighi et al., 2005, Das and Smith, 2009, Chang et al., 2010, Fader et al., 2013], natural language inference [Marsi and Krahmer, 2005, MacCartney et al., 2006, Hickl and Bensley, 2007, MacCartney et al., 2008], and semantic parsing [Andreas et al., 2013]. The neural counterpart to alignment, attention [Bahdanau et al., 2015], which is a key part of our approach, was originally proposed and has been predominantly used in conjunction with LSTMs [Rocktäschel et al., 2016, Wang and Jiang, 2016] and to a lesser extent with CNNs [Yin et al., 2016]. In contrast, our use of attention is purely based on word embeddings and our method essentially consists of feed-forward networks that operate largely independently of word order.

Approach

Let $\mathbf{a}=(a_{1},\ldots,a_{\ell_{a}})$ and $\mathbf{b}=(b_{1},\ldots,b_{\ell_{b}})$ be the two input sentences of length $\ell_{a}$ and $\ell_{b}$ , respectively. We assume that each $a_{i}$ , $b_{j}\in\mathbb{R}^{d}$ is a word embedding vector of dimension $d$ and that each sentence is prepended with a “NULL” token. Our training data comes in the form of labeled pairs $\{\mathbf{a}^{(n)},\mathbf{b}^{(n)},\mathbf{y}^{(n)}\}_{n=1}^{N}$ , where $\mathbf{y}^{(n)}=(y_{1}^{(n)},\ldots,y_{C}^{(n)})$ is an indicator vector encoding the label and $C$ is the number of output classes. At test time, we receive a pair of sentences $(\mathbf{a},\mathbf{b})$ and our goal is to predict the correct label $\mathbf{y}$ .

Let $\bar{\mathbf{a}}=(\bar{a}_{1},\ldots,\bar{a}_{\ell_{a}})$ and $\bar{\mathbf{b}}=(\bar{b}_{1},\ldots,\bar{b}_{\ell_{b}})$ denote the input representation of each fragment that is fed to subsequent steps of the algorithm. The vanilla version of our model simply defines $\bar{\mathbf{a}}:=\mathbf{a}$ and $\bar{\mathbf{b}}:=\mathbf{b}$ . With this input representation, our model does not make use of word order. However, we discuss an extension using intra-sentence attention in Section 3.4 that uses a minimal amount of sequence information.

The core model consists of the following three components (see Figure 1), which are trained jointly:

Attend.

First, soft-align the elements of $\bar{\mathbf{a}}$ and $\bar{\mathbf{b}}$ using a variant of neural attention [Bahdanau et al., 2015] and decompose the problem into the comparison of aligned subphrases.

Compare.

Second, separately compare each aligned subphrase to produce a set of vectors $\{\mathbf{v}_{1,i}\}_{i=1}^{\ell_{a}}$ for $\mathbf{a}$ and $\{\mathbf{v}_{2,j}\}_{j=1}^{\ell_{b}}$ for $\mathbf{b}$ . Each $\mathbf{v}_{1,i}$ is a nonlinear combination of $a_{i}$ and its (softly) aligned subphrase in $\mathbf{b}$ (and analogously for $\mathbf{v}_{2,j}$ ).

Aggregate.

Finally, aggregate the sets $\{\mathbf{v}_{1,i}\}_{i=1}^{\ell_{a}}$ and $\{\mathbf{v}_{2,j}\}_{j=1}^{\ell_{b}}$ from the previous step and use the result to predict the label $\hat{y}$ .

1 Attend

We first obtain unnormalized attention weights $e_{ij}$ , computed by a function $F^{\prime}$ , which decomposes as:

This decomposition avoids the quadratic complexity that would be associated with separately applying $F^{\prime}$ $\ell_{a}\times\ell_{b}$ times. Instead, only $\ell_{a}+\ell_{b}$ applications of $F$ are needed. We take $F$ to be a feed-forward neural network with ReLU activations [Glorot et al., 2011].

These attention weights are normalized as follows:

Here $\beta_{i}$ is the subphrase in $\bar{\mathbf{b}}$ that is (softly) aligned to $\bar{a}_{i}$ and vice versa for $\alpha_{j}$ .

2 Compare

Next, we separately compare the aligned phrases $\{(\bar{a}_{i},\beta_{i})\}_{i=1}^{\ell_{a}}$ and $\{(\bar{b}_{j},\alpha_{j})\}_{j=1}^{\ell_{b}}$ using a function $G$ , which in this work is again a feed-forward network:

where the brackets $[\cdot,\cdot]$ denote concatenation. Note that since there are only a linear number of terms in this case, we do not need to apply a decomposition as was done in the previous step. Thus $G$ can jointly take into account both $\bar{a}_{i},$ and $\beta_{i}$ .

3 Aggregate

We now have two sets of comparison vectors $\{\mathbf{v}_{1,i}\}_{i=1}^{\ell_{a}}$ and $\{\mathbf{v}_{2,j}\}_{j=1}^{\ell_{b}}$ . We first aggregate over each set by summation:

and feed the result through a final classifier $H$ , that is a feed forward network followed by a linear layer:

where $\hat{\mathbf{y}}\in\mathbb{R}^{C}$ represents the predicted (unnormalized) scores for each class and consequently the predicted class is given by $\hat{y}=\textrm{argmax}_{i}\hat{\mathbf{y}}_{i}$ .

For training, we use multi-class cross-entropy loss with dropout regularization [Srivastava et al., 2014]:

Here $\mathbf{\theta}_{F},\mathbf{\theta}_{G},\mathbf{\theta}_{H}$ denote the learnable parameters of the functions F, G and H, respectively.

4 Intra-Sentence Attention (Optional)

In the above model, the input representations are simple word embeddings. However, we can augment this input representation with intra-sentence attention to encode compositional relationships between words within each sentence, as proposed by ?). Similar to Eqs. 1 and 3.1, we define

where $F_{\textrm{intra}}$ is a feed-forward network. We then create the self-aligned phrases

subscript𝑓𝑖𝑗subscript𝑑𝑖𝑗superscriptsubscript𝑘1subscriptℓ𝑎subscript𝑓𝑖𝑘subscript𝑑𝑖𝑘subscript𝑎𝑗\displaystyle:=\sum_{j=1}^{\ell_{a}}\frac{\exp(f_{ij}+d_{i-j})}{\sum_{k=1}^{\ell_{a}}\exp(f_{ik}+d_{i-k})}a_{j}\,. (7) The distance-sensitive bias terms $d_{i-j}\in\mathbb{R}$ provides the model with a minimal amount of sequence information, while remaining parallelizable. These terms are bucketed such that all distances greater than 10 words share the same bias. The input representation for subsequent steps is then defined as $\bar{a}_{i}:=[a_{i},a^{\prime}_{i}]$ and analogously $\bar{b}_{i}:=[b_{i},b^{\prime}_{i}]$ .

Computational Complexity

We now discuss the asymptotic complexity of our approach and how it offers a higher degree of parallelism than LSTM-based approaches. Recall that $d$ denotes embedding dimension and $\ell$ means sentence length. For simplicity we assume that all hidden dimensions are $d$ and that the complexity of matrix( $d\times d$ )-vector( $d\times 1$ ) multiplication is $O(d^{2})$ .

A key assumption of our analysis is that $\ell<d$ , which we believe is reasonable and is true of the SNLI dataset [Bowman et al., 2015] where $\ell<80$ , whereas recent LSTM-based approaches have used $d\geq 300$ . This assumption allows us to bound the complexity of computing the $\ell^{2}$ attention weights.

The complexity of an LSTM cell is $O(d^{2})$ , resulting in a complexity of $O(\ell d^{2})$ to encode the sentence. Adding attention as in ?) increases this complexity to $O(\ell d^{2}+\ell^{2}d)$ .

Complexity of our Approach.

Application of a feed-forward network requires $O(d^{2})$ steps. Thus, the Compare and Aggregate steps have complexity $O(\ell d^{2})$ and $O(d^{2})$ respectively. For the Attend step, $F$ is evaluated $O(\ell)$ times, giving a complexity of $O(\ell d^{2})$ . Each attention weight $e_{ij}$ requires one dot product, resulting in a complexity of $O(\ell^{2}d)$ .

Thus the total complexity of the model is $O(\ell d^{2}+\ell^{2}d)$ , which is equal to that of an LSTM with attention. However, note that with the assumption that $\ell<d$ , this becomes $O(\ell d^{2})$ which is the same complexity as a regular LSTM. Moreover, unlike the LSTM, our approach has the advantage of being parallelizable over $\ell$ , which can be useful at test time.

Experiments

We evaluate our approach on the Stanford Natural Language Inference (SNLI) dataset [Bowman et al., 2015]. Given a sentences pair $(\mathbf{a},\mathbf{b})$ , the task is to predict whether $\mathbf{b}$ is entailed by $\mathbf{a}$ , $\mathbf{b}$ contradicts $\mathbf{a}$ , or whether their relationship is neutral.

The method was implemented in TensorFlow [Abadi et al., 2015].

Data preprocessing: Following ?), we remove examples labeled “–” (no gold label) from the dataset, which leaves 549,367 pairs for training, 9,842 for development, and 9,824 for testing. We use the tokenized sentences from the non-binary parse provided in the dataset and prepend each sentence with a “NULL” token. During training, each sentence was padded up to the maximum length of the batch for efficient training (the padding was explicitly masked out so as not to affect the objective/gradients). For efficient batching in TensorFlow, we semi-sorted the training data to first contain examples where both sentences had length less than 20, followed by those with length less than 50, and then the rest. This ensured that most training batches contained examples of similar length.

Embeddings: We use 300 dimensional GloVe embeddings [Pennington et al., 2014] to represent words. Each embedding vector was normalized to have $\ell_{2}$ norm of 1 and projected down to 200 dimensions, a number determined via hyperparameter tuning. Out-of-vocabulary (OOV) words are hashed to one of 100 random embeddings each initialized to mean 0 and standard deviation 1. All embeddings remain fixed during training, but the projection matrix is trained. All other parameter weights (hidden layers etc.) were initialized from random Gaussians with mean 0 and standard deviation 0.01.

Each hyperparameter setting was run on a single machine with 10 asynchronous gradient-update threads, using Adagrad [Duchi et al., 2011] for optimization with the default initial accumulator value of 0.1. Dropout regularization [Srivastava et al., 2014] was used for all ReLU layers, but not for the final linear layer. We additionally tuned the following hyperparameters and present their chosen values in parentheses: network size (2-layers, each with 200 neurons), batch size (4), 11116 or 32 also work well and are a bit more stable. dropout ratio (0.2) and learning rate (0.05–vanilla, 0.025–intra-attention). All settings were run for 50 million steps (each step indicates one batch) but model parameters were saved frequently as training progressed and we chose the model that did best on the development set.

2 Results

Results in terms of 3-class accuracy are shown in Table 1. Our vanilla approach achieves state-of-the-art results with almost an order of magnitude fewer parameters than the LSTMN of ?). Adding intra-sentence attention gives a considerable improvement of 0.5 percentage points over the existing state of the art. Table 2 gives a breakdown of accuracy on the development set showing that most of our gains stem from neutral, while most losses come from contradiction pairs.

Table 3 shows some wins and losses. Examples A-C are cases where both variants of our approach are correct while both SPINN-PI [Bowman et al., 2016] and the mLSTM [Wang and Jiang, 2016] are incorrect. In the first two cases, both sentences contain phrases that are either identical or highly lexically related (e.g. “Two kids” and “ocean / beach”) and our approach correctly favors neutral in these cases. In Example C, it is possible that relying on word-order may confuse SPINN-PI and the mLSTM due to how “fountain” is the object of a preposition in the first sentence but the subject of the second.

The second set of examples (D-F) are cases where our vanilla approach is incorrect but mLSTM and SPINN-PI are correct. Example F requires sequential information and neither variant of our approach can predict the correct class. Examples D-E are interesting however, since they don’t require word order information, yet intra-attention seems to help. We suspect this may be because the word embeddings are not fine-grained enough for the algorithm to conclude that “play/watch” is a contradiction, but intra-attention, by adding an extra layer of composition/nonlinearity to incorporate context, compensates for this.

Finally, Examples G-I are cases that all methods get wrong. The first is actually representative of many examples in this category where there is one critical word that separates the two sentences (close vs open in this case) and goes unnoticed by the algorithms. Examples H requires inference about numbers and Example I needs sequence information.

Conclusion

We presented a simple attention-based approach to natural language inference that is trivially parallelizable. The approach outperforms considerably more complex neural methods aiming for text understanding. Our results suggest that, at least for this task, pairwise comparisons are relatively more important than global sentence-level representations.

Acknowledgements

We thank Slav Petrov, Tom Kwiatkowski, Yoon Kim, Erick Fonseca, Mark Neumann for useful discussion and Sam Bowman and Shuohang Wang for providing us their model outputs for error analysis.