Neural Network Models for Paraphrase Identification, Semantic Textual Similarity, Natural Language Inference, and Question Answering

Wuwei Lan, Wei Xu

Introduction

Sentence pair modeling is a fundamental technique underlying many NLP tasks, including the following:

Semantic Textual Similarity (STS), which measures the degree of equivalence in the underlying semantics of paired snippets of text [Agirre et al., 2016].

Paraphrase Identification (PI), which identifies whether two sentences express the same meaning [Dolan and Brockett, 2005, Xu et al., 2014, Xu et al., 2015].

Natural Language Inference (NLI), also known as recognizing textual entailment (RTE), which concerns whether a hypothesis can be inferred from a premise, requiring understanding of the semantic similarity between the hypothesis and the premise [Dagan et al., 2006, Bowman et al., 2015].

Question Answering (QA), which can be approximated as ranking candidate answer sentences or phrases based on their similarity to the original question [Yang et al., 2015].

Machine Comprehension (MC), which requires sentence matching between a passage and a question, pointing out the text region that contains the answer. [Rajpurkar et al., 2016].

Traditionally, researchers had to develop different methods specific for each task. Now neural networks can perform all the above tasks with the same architecture by training end to end. Various neural models [He and Lin, 2016, Chen et al., 2017, Parikh et al., 2016, Wieting et al., 2016, Tomar et al., 2017, Wang et al., 2017, Shen et al., 2017a, Yin et al., 2016] have declared state-of-the-art results for sentence pair modeling tasks; however, they were carefully designed and evaluated on selected (often one or two) datasets that can demonstrate the superiority of the model. The research questions are as follows: Do they perform well on other tasks and datasets? How much performance gain is due to certain system design choices and hyperparameter optimizations?

To answer these questions and better understand different network designs, we systematically analyze and compare the state-of-the-art neural models across multiple tasks and multiple domains. Namely, we implement five models and their variations on the same PyTorch platform: InferSent model [Conneau et al., 2017], Shortcut-stacked Sentence Encoder Model [Nie and Bansal, 2017], Pairwise Word Interaction Model [He and Lin, 2016], Decomposable Attention Model [Parikh et al., 2016], and Enhanced Sequential Inference Model [Chen et al., 2017]. They are representative of the two most common approaches: sentence encoding models that learn vector representations of individual sentences and then calculate the semantic relationship between sentences based on vector distance and sentence pair interaction models that use some sorts of word alignment mechanisms (e.g., attention) then aggregate inter-sentence interactions. We focus on identifying important network designs and present a series of findings with quantitative measurements and in-depth analyses, including (i) incorporating inter-sentence interactions is critical; (ii) Tree-LSTM does not help as much as previously claimed but surprisingly improves performance on Twitter data; (iii) Enhanced Sequential Inference Model has the most consistent high performance for larger datasets, while Pairwise Word Interaction Model performs better on smaller datasets and Shortcut-Stacked Sentence Encoder Model is the best performaning model on the Quora corpus. We release our implementations as a toolkit to the research community.The code is available on the authors’ homepages and GitHub: https://github.com/lanwuwei/SPM_toolkit

General Framework for Sentence Pair Modeling

Various neural networks have been proposed for sentence pair modeling, all of which fall into two types of approaches. The sentence encoding approach encodes each sentence into a fixed-length vector and then computes sentence similarity directly. The model of this type has advantages in the simplicity of the network design and generalization to other NLP tasks. The sentence pair interaction approach takes word alignment and interactions between the sentence pair into account and often show better performance when trained on in-domain data. Here we outline the two types of neural networks under the same general framework:

The Input Embedding Layer takes vector representations of words as input, where pretrained word embeddings are most commonly used, e.g. GloVe [Pennington et al., 2014] or Word2vec [Mikolov et al., 2013]. Some work used embeddings specially trained on phrase or sentence pairs that are paraphrases [Wieting and Gimpel, 2017, Tomar et al., 2017]; some used subword embeddings, which showed improvement on social media data [Lan and Xu, 2018].

The Context Encoding Layer incorporates word context and sequence order into modeling for better vector representation. This layer often uses CNN [He et al., 2015], LSTM [Chen et al., 2017], recursive neural network [Socher et al., 2011], or highway network [Gong et al., 2017]. The sentence encoding type of model will stop at this step, and directly use the encoded vectors to compute the semantic similarity through vector distances and/or the output classification layer.

The Interaction and Attention Layer calculates word pair (or n-gram pair) interactions using the outputs of the encoding layer. This is the key component for the interaction-aggregation type of model. In the PWIM model [He and Lin, 2016], the interactions are calculated by cosine similarity, Euclidean distance, and the dot product of the vectors. Various models put different weights on different interactions, primarily simulating the word alignment between two sentences. The alignment information is useful for sentence pair modeling because the semantic relation between two sentences depends largely on the relations of aligned chunks as shown in the SemEval-2016 task of interpretable semantic textual similarity [Agirre et al., 2016].

The Output Classification Layer adapts CNN or MLP to extract semantic-level features on the attentive alignment and applies softmax function to predict probability for each class.

Representative Models for Sentence Pair Modeling

Table 1 gives a summary of typical models for sentence pair modeling in recent years. In particular, we investigate five models in depth: two are representative of the sentence encoding type of model, and three are representative of the interaction-aggregation type of model. These models have reported state-or-the-art results with varied architecture design (this section) and implementation details (Section 4.2).

We choose the simple Bi-LSTM max-pooling network from InferSent [Conneau et al., 2017]:

where $\overleftrightarrow{\bm{h}}_{i}$ represents the concatenation of hidden states in both directons. It has shown better transfer learning capabilities than several other sentence embedding models, including SkipThought [Kiros et al., 2015] and FastSent [Hill et al., 2016], when trained on the natural language inference datasets.

2 The Shortcut-Stacked Sentence Encoder Model (SSE)

The Shortcut-Stacked Sentence Encoder model [Nie and Bansal, 2017] is a sentence-based embedding model, which enhances multi-layer Bi-LSTM with skip connection to avoid training error accumulation, and calculates each layer as follows:

where $\bm{x}_{i}^{k}$ is the input of the $k$ th Bi-LSTM layer at time step $i$ , which is the combination of outputs from all previous layers, $\overleftrightarrow{\bm{h}}_{i}^{k}$ represents the hidden state of the $k$ th Bi-LSTM layer in both directions. The final sentence embedding $\bm{v}$ is the row-based max pooling over the output of the last Bi-LSTM layer, where $n$ denotes the number of words within a sentence and $m$ is the number of Bi-LSTM layers ( $m=3$ in SSE).

3 The Pairwise Word Interaction Model (PWIM)

In the Pairwise Word Interaction model [He and Lin, 2016], each word vector $\bm{w}_{i}$ is encoded with context through forward and backward LSTMs: $\overrightarrow{\bm{h}}_{i}=LSTM^{f}(\bm{w}_{i},\overrightarrow{\bm{h}}_{i-1})$ and $\overleftarrow{\bm{h}}_{i}=LSTM^{b}(\bm{w}_{i},\overleftarrow{\bm{h}}_{i+1})$ . For every word pair $(\bm{w}^{a}_{i},\bm{w}^{b}_{j})$ across sentences, the model directly calculates word pair interactions using cosine similarity, Euclidean distance, and dot product over the outputs of the encoding layer:

The above equation not only applies to forward hidden state $\overrightarrow{\bm{h}}_{i}$ and backward hidden state $\overleftarrow{\bm{h}}_{i}$ , but also to the concatenation $\overleftrightarrow{\bm{h}}_{i}=[\overrightarrow{\bm{h}}_{i},\overleftarrow{\bm{h}}_{i}]$ and summation $\bm{h}^{+}_{i}=\overrightarrow{\bm{h}}_{i}+\overleftarrow{\bm{h}}_{i}$ , resulting in a tensor $\mathbf{D}^{13\times|sent1|\times|sent2|}$ after padding one extra bias term. A “hard” attention is applied to the interaction tensor to build word alignment: selecting the most related word pairs and increasing the corresponding weights by 10 times. Then a 19-layer deep CNN is applied to aggregate the word interaction features for final classification.

4 The Decomposable Attention Model (DecAtt)

The Decomposable Attention model [Parikh et al., 2016] is one of the earliest models to introduce attention-based alignment for sentence pair modeling, and it achieved state-of-the-art results on the SNLI dataset with about an order of magnitude fewer parameters than other models (see more in Table 5) without relying on word order information. It computes the word pair interaction between $\bm{w}_{i}^{a}$ and $\bm{w}_{j}^{b}$ (from input sentences $s_{a}$ and $s_{b}$ , each with $m$ and $n$ words, respectively) as ${e}_{ij}={F(\bm{w}_{i}^{a})}^{T}F(\bm{w}_{j}^{b})$ , where $F$ is a feedforward network; then alignment is determined as follows:

where $\bm{\beta}_{i}$ is the soft alignment between $\bm{w}_{i}^{a}$ and subphrases $\bm{w}_{j}^{b}$ in sentence $s_{b}$ , and vice versa for $\bm{\alpha}_{j}$ . The aligned phrases are fed into another feedforward network $G$ : $\bm{v}_{i}^{a}=G([\bm{w}_{i}^{a};\bm{\beta}_{i}])$ and $\bm{v}_{j}^{b}=G([\bm{w}_{j}^{b};\bm{\alpha}_{j}])$ to generate sets $\{\bm{v}_{i}^{a}\}$ and $\{\bm{v}_{j}^{b}\}$ , which are aggregated by summation and then concatenated together for classification.

5 The Enhanced Sequential Inference Model (ESIM)

The Enhanced Sequential Inference Model [Chen et al., 2017] is closely related to the DecAtt model, but it differs in a few aspects. First, Chen et al. [Chen et al., 2017] demonstrated that using Bi-LSTM to encode sequential contexts is important for performance improvement. They used the concatenation $\overline{\bm{w}}_{i}=\overleftrightarrow{\bm{h}}_{i}=[\overrightarrow{\bm{h}}_{i},\overleftarrow{\bm{h}}_{i}]$ of both directions as in the PWIM model. The word alignment $\bm{\beta}_{i}$ and $\bm{\alpha}_{j}$ between $\overline{\bm{w}}^{a}$ and $\overline{\bm{w}}^{b}$ are calculated the same way as in DecAtt. Second, they showed the competitive performance of recursive architecture with constituency parsing, which complements with sequential LSTM. The feedforward function $G$ in DecAtt is replaced with Tree-LSTM:

Third, instead of using summation in aggregation, ESIM adapts the average and max pooling and concatenation $\bm{v}=[\bm{v}_{ave}^{a};\bm{v}_{max}^{a};\bm{v}_{ave}^{b};\bm{v}_{max}^{b}]$ before passing through multi-layer perceptron (MLP) for classification:

Experiments and Analysis

We conducted sentence pair modeling experiments on eight popular datasets: two NLI datasets, three PI datasets, one STS dataset and two QA datasets. Table 2 gives a comparison of these datasets:

SNLI [Bowman et al., 2015] contains 570k hypotheses written by crowdsourcing workers given the premises. It focuses on three semantic relations: the premise entails the hypothesis (entailment), they contradict each other (contradiction), or they are unrelated (neutral).

Multi-NLI [Williams et al., 2017] extends the SNLI corpus to multiple genres of written and spoken texts with 433k sentence pairs.

Quora [Iyer et al., 2017] contains 400k question pairs collected from the Quora website. This dataset has balanced positive and negative labels indicating whether the questions are duplicated or not.

Twitter-URL [Lan et al., 2017] includes 50k sentence pairs collected from tweets that share the same URL of news articles. This dataset contains both formal and informal language.

PIT-2015 [Xu et al., 2015] comes from SemEval-2015 and was collected from tweets under the same trending topic. It contains naturally occurred (i.e. written by independent Twitter users spontaneously) paraphrases and non-paraphrases with varied topics and language styles.

STS-2014 [Agirre et al., 2014] is from SemEval-2014, constructed from image descriptions, news headlines, tweet news, discussion forums, and OntoNotes [Hovy et al., 2006].

WikiQA [Yang et al., 2015] is an open-domain question-answering dataset. Following He and Lin [He and Lin, 2016], questions without correct candidate answer sentences are excluded, and answer sentences are truncated to 40 tokens, resulting in 12k question-answer pairs for our experiments.

TrecQA [Wang et al., 2007] is an answer selection task of 56k question-answer pairs and created in Text Retrieval Conferences (TREC). For both WikiQA and TrecQA datasets, the best answer is selected according to the semantic relatedness with the question.

2 Implementation Details

We implement all the models with the same PyTorch framework.InferSent and SSE have open-source PyTorch implementations by the original authors, for which we reused part of the code.Our code is available at: https://github.com/lanwuwei/SPM_toolkit Below, we summarize the implementation details that are key for reproducing results for each model:

SSE: This model can converge very fast, for example, 2 or 3 epochs for the SNLI dataset. We control the convergence speed by updating the learning rate for each epoch: specifically, $lr=\frac{1}{2^{\frac{epoch\_i}{2}}}*{init\_lr}$ , where $init\_lr$ is the initial learning rate and $epoch\_i$ is the index of current epoch.

DecAtt: It is important to use gradient clipping for this model: for each gradient update, we check the L2 norm of all the gradient values, if it is greater than a threshold $b$ , we scale the gradient by a factor $\alpha=b/L2\_norm$ . Another useful procedure is to assemble batches of sentences with similar length.

ESIM: Similar but different from DecAtt, ESIM batches sentences with varied length and uses masks to filter out padding information. In order to batch the parse trees within Tree-LSTM recursion, we follow Bowman et al.’s [Bowman et al., 2016] procedure that converts tree structures into the linear sequential structure of a shift reduce parser. Two additional masks are used for producing left and right children of a tree node.

PWIM: The cosine and Euclidean distances used in the word interaction layer have smaller values for similar vectors while dot products have larger values. The performance increases if we add a negative sign to make all the vector similarity measurements behave consistently.

3 Analysis

Table 3 and 4 show the results reported in the original papers and the replicated results with our implementation. We use accuracy, F1 score, Pearson’s $r$ , Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR) for evaluation on different datasets following the literature. Our reproduced results are slightly lower than the original results by 0.5 $\sim$ 1.5 points on accuracy. We suspect the following potential reasons: (i) less extensive hyperparameter tuning for each individual dataset; (ii) only one run with random seeding to report results; and (iii) use of different neural network toolkits: for example, the original ESIM model was implemented with Theano, and PWIM model was in Torch.

3.2 Effects of Model Components

Herein, we examine the main components that account for performance in sentence pair modeling.

How important is LSTM encoded context information for sentence pair modeling? Regarding DecAtt, Parikh et al. [Parikh et al., 2016] mentioned that “intra-sentence attention is optional”; they can achieve competitive results without considering context information. However, not surprisingly, our experiments consistently show that encoding sequential context information with LSTM is critical. Compared to DecAtt, ESIM shows better performance on every dataset (see Table 4 and Figure 3). The main difference between ESIM and DecAtt that contributes to performance improvement, we found, is the use of Bi-LSTM and Tree-LSTM for sentence encoding, rather than the different choices of aggregation functions.

Why does Tree-LSTM help with Twitter data? Chen et al. [Chen et al., 2017] offered a simple combination (ESIMseq+tree) by averaging the prediction probabilities of two ESIM variants that use sequential Bi-LSTM and Tree-LSTM respectively, and suggested “parsing information complements very well with ESIM and further improves the performance”. However, we found that adding Tree-LSTM only helps slightly or not at all for most datasets, but it helps noticably with the two Twitter paraphrase datasets. We hypothesize the reason is that these two datasets come from real-world tweets which often contain extraneous text fragments, in contrast to SNLI and other datasets that have sentences written by crowdsourcing workers. For example, the segment “ever wondered ,” in the sentence pair ever wondered , why your recorded #voice sounds weird to you? and why do our recorded voices sound so weird to us? introduces a disruptive context into the Bi-LSTM encoder, while Tree-LSTM can put it in a less important position after constituency parsing.

How important is attentive interaction for sentence pair modeling? Why does SSE excel on Quora? Both ESIM and DecAtt (Eq. 7) calculate an attention-based soft alignment between a sentence pair, which was also proposed in [Rocktäschel et al., 2016] and [Wang and Jiang, 2017] for sentence pair modeling, whereas PWIM utilizes a hard attention mechanism. Both attention strategies are critical for model performance. In PWIM model [He and Lin, 2016], we observed a 1 $\sim$ 2 point performance drop after removing the hard attention, 0 $\sim$ 3 point performance drop and $\sim$ 25% training time reduction after removing the 19-layer CNN aggregation. Likely without even the authors of SSE knowing, the SSE model performs extraordinarily well on the Quora corpus, perhaps because Quora contains many sentence pairs with less complicated inter-sentence interactions (e.g., many identical words in the two sentences) and incorrect ground truth labels (e.g., What is your biggest regret in life? and What’s the biggest regret you’ve had in life? are labeled as non-duplicate questions by mistake).

3.3 Learning Curves and Training Time

Figure 3 shows the learning curves. The DecAtt model converges quickly and performs well on large NLI datasets due to its design simplicity. PWIM is the slowest model (see time comparison in Table 5) but shows very strong performance on semantic similarity and paraphrase identification datasets. ESIM and SSE keep a good balance between training time and performance.

3.4 Effects of Training Data Size

As shown in Figure 5, we experimented with different training sizes of the largest SNLI dataset. All the models show improved performance as we increase the training size. ESIM and SSE have very similar trends and clearly outperform PWIM on the SNLI dataset. DecAtt shows a performance jump when the training size exceeds a threshold.

3.5 Categorical Performance Comparison

We conducted an in-depth analysis of model performance on the Multi-domain NLI dataset based on different categories: text genre, sentence pair overlap, and sentence length. As shown in Table 6, all models have comparable performance between matched genre and unmatched genre. Sentence length and overlap turn out to be two important factors – the longer the sentences and the fewer tokens in common, the more challenging it is to determine their semantic relationship. These phenomena shared by the state-of-the-art systems reflect their similar design framework which is symmetric at processing both sentences in the pair, while question answering and natural language inference tasks are directional [Ghaeini et al., 2018]. How to incorporate asymmetry into model design will be worth more exploration in future research.

3.6 Transfer Learning Experiments

In addition to the cross-domain study (Table 6), we conducted transfer learning experiments on three paraphrase identification datasets (Table 5). The most noteworthy phenomenon is that the SSE model performs better on Twitter-URL and PIT-2015 when trained on the large out-of-domain Quora data than the small in-domain training data. Two likely reasons are: (i) the SSE model with over 29 million parameters is data hungry and (ii) SSE model is a sentence encoding model, which generalizes better across domains/tasks than sentence pair interaction models. Sentence pair interaction models may encounter difficulties on Quora, which contains sentence pairs with the highest word overlap (51.5%) among all datasets and often causes the interaction patterns to focus on a few key words that differ. In contrast, the Twitter-URL dataset has the lowest overlap (23.0%) with a semantic relationship that is mainly based on the intention of the tweets.

Conclusion

We analyzed five different neural models (and their variations) for sentence pair modeling and conducted a series of experiments with eight representative datasets for different NLP tasks. We quantified the importance of the LSTM encoder and attentive alignment for inter-sentence interaction, as well as the transfer learning ability of sentence encoding based models. We showed that the SNLI corpus of over 550k sentence pairs cannot saturate the learning curve. We systematically compared the strengths and weaknesses of different network designs and provided insights for future work.

Acknowledgements

We thank Ohio Supercomputer Center [Center, 2012] for computing resources. This work was supported in part by NSF CRII award (RI-1755898) and DARPA through the ARO (W911NF-17-C-0095). The content of the information in this document does not necessarily reflect the position or the policy of the U.S. Government, and no official endorsement should be inferred.

References

Appendix A Pretrained Word Embeddings

We used the 200-dimensional GloVe word vectors [Pennington et al., 2014], trained on 27 billion words from Twitter (vocabulary size of 1.2 milion words) for Twitter URL [Lan et al., 2017] and PIT-2015 [Xu et al., 2015] datasets, and the 300-dimensional GloVe vectors, trained on 840 billion words (vocabulary size of 2.2 milion words) from Common Crawl for all other datasets. For out-of-vocabulary words, we initialized the word vectors using normal distribution with mean and deviation $1$ .

Appendix B Hyper-parameter Settings

We followed original papers or code implementations to set hyper-parameters for these models. In Infersent model [Conneau et al., 2017], the hidden dimension size for Bi-LSTM is 2048, and the fully connected layers have 512 hidden units. In SSE model [Nie and Bansal, 2017], the hidden size for three Bi-LSTMs is 512, 2014 and 2048, respectively. The fully connected layers have 1600 units. PWIM [He and Lin, 2016] and ESIM [Chen et al., 2017] both use Bi-LSTM for context encoding, having 200 hidden units and 300 hidden units respectively. The DecAtt model [Parikh et al., 2016] uses three kinds of feed forward networks, all of which have 300 hidden units. Other parameters like learning rate, batch size, dropout rate, and all of them use the same settings as in original papers.

Appendix C Fine-tuning the Models

It is not practical to fine tune every hyper-parameter in every model and every dataset, since we want to show how these models can generalize well on other datasets, we need try to avoid fine-tuning these parameters on some specific datasets, otherwise we can easily get over-fitted models. Therefore, we keep the hyper-parameters unchanged across different datasets, to demonstrate the generalization capability of each model. The default number of epochs for training these models is set to 20, if some models could converge earlier (no more performance gain on development set), we would stop running them before they approached epoch 20. The 20 epochs can guarantee every model get converged on every dataset.