Bilateral Multi-Perspective Matching for Natural Language Sentences

Zhiguo Wang, Wael Hamza, Radu Florian

Introduction

Natural language sentence matching (NLSM) is the task of comparing two sentences and identifying the relationship between them. It is a fundamental technology for a variety of tasks. For example, in a paraphrase identification task, NLSM is used to determine whether two sentences are paraphrase or not Yin et al. (2015). For a natural language inference task, NLSM is utilized to judge whether a hypothesis sentence can be inferred from a premise sentence Bowman et al. (2015). For question answering and information retrieval tasks, NLSM is employed to assess the relevance between query-answer pairs and rank all the candidate answers Wang et al. (2016d). For machine comprehension tasks, NLSM is used for matching a passage with a question and pointing out the correct answer span Wang et al. (2016b).

With the renaissance of neural network models LeCun et al. (2015); Peng et al. (2015a, 2016), two types of deep learning frameworks were proposed for NLSM. The first framework is based on the “Siamese” architecture Bromley et al. (1993). In this framework, the same neural network encoder (e.g., a CNN or a RNN) is applied to two input sentences individually, so that both of the two sentences are encoded into sentence vectors in the same embedding space. Then, a matching decision is made solely based on the two sentence vectors Bowman et al. (2015); Tan et al. (2015). The advantage of this framework is that sharing parameters makes the model smaller and easier to train, and the sentence vectors can be used for visualization, sentence clustering and many other purposes Wang et al. (2016c). However, a disadvantage is that there is no explicit interaction between the two sentences during the encoding procedure, which may lose some important information. To deal with this problem, a second framework “matching-aggregation” has been proposed Wang and Jiang (2016); Wang et al. (2016d). Under this framework, smaller units (such as words or contextual vectors) of the two sentences are firstly matched, and then the matching results are aggregated (by a CNN or a LSTM) into a vector to make the final decision. The new framework captures more interactive features between the two sentences, therefore it acquires significant improvements. However, the previous “matching-aggregation” approaches still have some limitations. First, some of the approaches only explored the word-by-word matching Rocktäschel et al. (2015), but ignored other granular matchings (e.g., phrase-by-sentence); Second, the matching is only performed in a single direction (e.g., matching $P$ against $Q$ ) Wang and Jiang (2015), but neglected the reverse direction (e.g., matching $Q$ against $P$ ).

In this paper, to tackle these limitations, we propose a bilateral multi-perspective matching (BiMPM) model for NLSM tasks. Our model essentially belongs to the “matching-aggregation” framework. Given two sentences $P$ and $Q$ , our model first encodes them with a bidirectional Long Short-Term Memory Network (BiLSTM). Next, we match the two encoded sentences in two directions $P\rightarrow Q$ and $P\leftarrow Q$ . In each matching direction, let’s say $P\rightarrow Q$ , each time step of $Q$ is matched against all time-steps of $P$ from multiple perspectives. Then, another BiLSTM layer is utilized to aggregate the matching results into a fixed-length matching vector. Finally, based on the matching vector, a decision is made through a fully connected layer. We evaluate our model on three NLSM tasks: paraphrase identification, natural language inference and answer sentence selection. Experimental results on standard benchmark datasets show that our model achieves the state-of-the-art performance on all tasks.

In following parts, we start with a brief definition of the NLSM task (Section 2), followed by the details of our model (Section 3). Then we evaluate our model on standard benchmark datasets (Section 4). We talk about related work in Section 5, and conclude this work in Section 6.

Task Definition

Formally, we can represent each example of the NLSM task as a triple $(P,Q,y)$ , where $P=(p_{1},...,p_{j},...,p_{M})$ is a sentence with a length $M$ , $Q=(q_{1},...,q_{i},...,q_{N})$ is the second sentence with a length $N$ , $y\in\mathcal{Y}$ is the label representing the relationship between $P$ and $Q$ , and $\mathcal{Y}$ is a set of task-specific labels. The NLSM task can be represented as estimating a conditional probability $\Pr{(y|P,Q)}$ based on the training set, and predicting the relationship for testing examples by $y^{*}=\operatorname*{arg\,max}_{y\in\mathcal{Y}}\Pr(y|P,Q)$ . Concretely, for a paraphrase identification task, $P$ and $Q$ are two sentences, $\mathcal{Y}=\{0,1\}$ , where $y=1$ means that $P$ and $Q$ are paraphrase of each other, and $y=0$ otherwise. For a natural language inference task, $P$ is a premise sentence, $Q$ is a hypothesis sentence, and $\mathcal{Y}=\{entailment,contradiction,neutral\}$ where $entailment$ indicates $Q$ can be inferred from $P$ , $contradiction$ indicates $Q$ cannot be true condition on $P$ , and $neutral$ means $P$ and $Q$ are irrelevant to each other. In an answer sentence selection task, $P$ is a question, $Q$ is a candidate answer, and $\mathcal{Y}=\{0,1\}$ where $y=1$ means $Q$ is a correct answer for $P$ , and $y=0$ otherwise.

Method

In this section, we first give a high-level overview of our model in Sub-section 3.1, and then give more details about our novel multi-perspective matching operation in Sub-section 3.2.

We propose a bilateral multi-perspective matching (BiMPM) model to estimate the probability distribution $\Pr(y|P,Q)$ . Our model belongs to the “matching-aggregation” framework Wang and Jiang (2016). Contrarily to previous “matching-aggregation” approaches, our model matches $P$ and $Q$ in two directions ( $P\rightarrow Q$ and $P\leftarrow Q$ ). In each individual direction, our model matches the two sentences from multiple perspectives. Figure 1 shows the architecture of our model. Given a pair of sentences $P$ and $Q$ , the BiMPM model estimates the probability distribution $\Pr(y|P,Q)$ through the following five layers.

Word Representation Layer. The goal of this layer is to represent each word in $P$ and $Q$ with a $d$ -dimensional vector. We construct the $d$ -dimensional vector with two components: a word embedding and a character-composed embedding. The word embedding is a fixed vector for each individual word, which is pre-trained with GloVe Pennington et al. (2014) or word2vec Mikolov et al. (2013). The character-composed embedding is calculated by feeding each character (represented as a character embedding) within a word into a Long Short-Term Memory Network (LSTM) Hochreiter and Schmidhuber (1997), where the character embeddings are randomly initialized and learned jointly with other network parameters from NLSM tasks. The output of this layer are two sequences of word vectors $P:[\textbf{\emph{p}}_{1},...,\textbf{\emph{p}}_{M}]$ and $Q:[\textbf{\emph{q}}_{1},...,\textbf{\emph{q}}_{N}]$ .

Context Representation Layer. The purpose of this layer is to incorporate contextual information into the representation of each time step of $P$ and $Q$ . We utilize a bi-directional LSTM (BiLSTM) to encode contextual embeddings for each time-step of $P$ .

Meanwhile, we apply the same BiLSTM to encode $Q$ :

Matching Layer. This is the core layer within our model. The goal of this layer is to compare each contextual embedding (time-step) of one sentence against all contextual embeddings (time-steps) of the other sentence. As shown in Figure 1, we will match the two sentences $P$ and $Q$ in two directions: match each time-step of $P$ against all time-steps of $Q$ , and match each time-step of $Q$ against all time-steps of $P$ . To match one time-step of a sentence against all time-steps of the other sentence, we design a multi-perspective matching operation $\otimes$ . We will give more details about this operation in Sub-section 3.2. The output of this layer are two sequences of matching vectors (right above the operation $\otimes$ in Figure 1), where each matching vector corresponds to the matching result of one time-step against all time-steps of the other sentence.

Aggregation Layer. This layer is employed to aggregate the two sequences of matching vectors into a fixed-length matching vector. We utilize another BiLSTM model, and apply it to the two sequences of matching vectors individually. Then, we construct the fixed-length matching vector by concatenating (the four green) vectors from the last time-step of the BiLSTM models.

Prediction Layer. The purpose of this layer is to evaluate the probability distribution $\Pr(y|P,Q)$ . To this end, we employ a two layer feed-forward neural network to consume the fixed-length matching vector, and apply the $softmax$ function in the output layer. The number of nodes in the output layer is set based on each specific task described in Section 2.

2 Multi-perspective Matching Operation

We define the multi-perspective matching operation $\otimes$ in following two steps:

First, we define a multi-perspective cosine matching function $f_{m}$ to compare two vectors

where $\bm{v}_{1}$ and $\bm{v}_{2}$ are two $d$ -dimensional vectors, $\bm{W}\in\Re^{l\times d}$ is a trainable parameter with the shape $l\times d$ , $l$ is the number of perspectives, and the returned value $\bm{m}$ is a $l$ -dimensional vector $\bm{m}=[m_{1},...,m_{k},...,m_{l}]$ . Each element $m_{k}\in\bm{m}$ is a matching value from the $k$ -th perspective, and it is calculated by the cosine similarity between two weighted vectors

where $\circ$ is the element-wise multiplication, and $W_{k}$ is the $k$ -th row of $\bm{W}$ , which controls the $k$ -th perspective and assigns different weights to different dimensions of the $d$ -dimensional space.

Second, based on $f_{m}$ , we define four matching strategies to compare each time-step of one sentence against all time-steps of the other sentence. To avoid repetition, we only define these matching strategies for one matching direction $P\rightarrow Q$ . The readers can infer equations for the reverse direction easily.

(1) Full-Matching. Figure 2 (a) shows the diagram of this matching strategy. In this strategy, each forward (or backward) contextual embedding $\overrightarrow{\bm{h}}_{i}^{p}$ (or $\overleftarrow{\bm{h}}_{i}^{p}$ ) is compared with the last time step of the forward (or backward) representation of the other sentence $\overrightarrow{\bm{h}}_{N}^{q}$ (or $\overleftarrow{\bm{h}}_{1}^{q}$ ).

(2) Maxpooling-Matching. Figure 2 (b) gives the diagram of this matching strategy. In this strategy, each forward (or backward) contextual embedding $\overrightarrow{\bm{h}}_{i}^{p}$ (or $\overleftarrow{\bm{h}}_{i}^{p}$ ) is compared with every forward (or backward) contextual embeddings of the other sentence $\overrightarrow{\bm{h}}_{j}^{q}$ (or $\overleftarrow{\bm{h}}_{j}^{q}$ ) for $j\in(1...N)$ , and only the maximum value of each dimension is retained.

(3) Attentive-Matching. Figure 2 (c) shows the diagram of this matching strategy. We first calculate the cosine similarities between each forward (or backward) contextual embedding $\overrightarrow{\bm{h}}_{i}^{p}$ (or $\overleftarrow{\bm{h}}_{i}^{p}$ ) and every forward (or backward) contextual embeddings of the other sentence $\overrightarrow{\bm{h}}_{j}^{q}$ (or $\overleftarrow{\bm{h}}_{j}^{q}$ ):

Then, we take $\overrightarrow{\alpha}_{i,j}$ (or $\overleftarrow{\alpha}_{i,j}$ ) as the weight of $\overrightarrow{\bm{h}}_{j}^{q}$ (or $\overleftarrow{\bm{h}}_{j}^{q}$ ), and calculate an attentive vector for the entire sentence $Q$ by weighted summing all the contextual embeddings of $Q$ :

Finally, we match each forward (or backward) contextual embedding of $\overrightarrow{\bm{h}}_{i}^{p}$ (or $\overleftarrow{\bm{h}}_{i}^{p}$ ) with its corresponding attentive vector:

(4) Max-Attentive-Matching. Figure 2 (d) shows the diagram of this matching strategy. This strategy is similar to the Attentive-Matching strategy. However, instead of taking the weighed sum of all the contextual embeddings as the attentive vector, we pick the contextual embedding with the highest cosine similarity as the attentive vector. Then, we match each contextual embedding of the sentence $P$ with its new attentive vector.

We apply all these four matching strategies to each time-step of the sentence $P$ , and concatenate the generated eight vectors as the matching vector for each time-step of $P$ . We also perform the same process for the reverse matching direction.

Experiments

In this section, we evaluate our model on three tasks: paraphrase identification, natural language inference and answer sentence selection. We will first introduce the general setting of our BiMPM models in Sub-section 4.1. Then, we demonstrate the properties of our model through some ablation studies in Sub-section 4.2. Finally, we compare our model with state-of-the-art models on some standard benchmark datasets in Sub-section 4.3, 4.4 and 4.5.

We initialize word embeddings in the word representation layer with the 300-dimensional GloVe word vectors pre-trained from the 840B Common Crawl corpus Pennington et al. (2014). For the out-of-vocabulary (OOV) words, we initialize the word embeddings randomly. For the character-composed embeddings, we initialize each character as a 20-dimensional vector, and compose each word into a 50-dimensional vector with a LSTM layer. We set the hidden size as 100 for all BiLSTM layers. We apply dropout to every layers in Figure 1, and set the dropout ratio as 0.1. To train the model, we minimize the cross entropy of the training set, and use the ADAM optimizer Kingma and Ba (2014) to update parameters. We set the learning rate as 0.001. During training, we do not update the pre-trained word embeddings. For all the experiments, we pick the model which works the best on the dev set, and then evaluate it on the test set.

2 Model Properties

To demonstrate the properties of our model, we choose the paraphrase identification task, and experiment on the “Quora Question Pairs” dataset https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs. This dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other. We randomly select 5,000 paraphrases and 5,000 non-paraphrases as the dev set, and sample another 5,000 paraphrases and 5,000 non-paraphrases as the test set. We keep the remaining instances as the training set We will release our source code and the dataset partition at https://zhiguowang.github.io/ ..

First, we study the influence of our multi-perspective cosine matching function in Eq.(3). We vary the number of perspectives $l$ among {1, 5, 10, 15, 20}Due to practical limitations, we did not experiment with more perspectives., and keep the other options unchanged. We also build a baseline model by replacing Eq.(3) with the vanilla cosine similarity function. Figure 3 shows the performance curve on the dev set, where $l=0$ corresponds to the performance of our baseline model. We can see that, even if we only utilize one perspective ( $l=1$ ), our model gets a significant improvement. When increasing the number of perspectives, the performance improves significantly. Therefore, our multi-perspective cosine matching function is really effective for matching vectors.

Second, to check the effectiveness of bilateral matching, we build two ablation models to matching sentences in only a single direction: 1) “Only $P\rightarrow Q$ ” which only matches $P$ against $Q$ ; 2) “Only $P\leftarrow Q$ ” which only matches $Q$ against $P$ . Table 1 shows the performance on the dev set. Comparing the two ablation models with the “Full Model”, we can observe that single direction matching hurts the performance for about 1 percent. Therefore, matching sentences in two directions is really necessary for acquiring better performance.

Third, we evaluate the effectiveness of different matching strategies. To this end, we construct four ablation models (w/o Full-Matching, w/o Maxpooling-Matching, w/o Attentive-Matching, w/o Max-Attentive-Matching) by eliminating a matching strategy at each time. Table 1 shows the performance on the dev set. We can see that eliminating any of the matching strategies would hurt the performance significantly.

3 Experiments on Paraphrase Identification

In this Sub-section, we compare our model with state-of-the-art models on the paraphrase identification task. We still experiment on the “Quora Question Pairs” dataset, and use the same dataset partition as Sub-section 4.2. This dataset is a brand-new dataset, and no previous results have been published yet. Therefore, we implemented three types of baseline models.

First, under the Siamese framework, we implement two baseline models: “Siamese-CNN” and “Siamese-LSTM”. Both of the two models encode two input sentences into sentence vectors with a neural network encoder, and make a decision based on the cosine similarity between the two sentence vectors. But they implement the sentence encoder with a CNN and a LSTM respectively. We design the CNN and the LSTM model according to the architectures in Wang et al. (2016c).

Second, based on the two baseline models, we implement two more baseline models “Multi-Perspective-CNN” and “Multi-Perspective-LSTM”. In these two models, we change the cosine similarity calculation layer with our multi-perspective cosine matching function in Eq.(3), and apply a fully-connected layer (with $sigmoid$ function on the top) to make the prediction.

Third, we re-implement the “L.D.C.” model proposed by Wang et al. (2016d), which is a model under the “matching-aggregation” framework and acquires the state-of-the-art performance on several tasks.

Table 2 shows the performances of all baseline models and our “BiMPM” model. We can see that “Multi-Perspective-CNN” (or “Multi-Perspective-LSTM”) works much better than “Siamese-CNN” (or “Siamese-LSTM”), which further indicates that our multi-perspective cosine matching function (Eq.(3)) is very effective for matching vectors. Our “BiMPM” model outperforms the “L.D.C.” model by more than two percent. Therefore, our model is very effective for the paraphrase identification task.

4 Experiments on Natural Language Inference

In this Sub-section, we evaluate our model on the natural language inference task over the SNLI dataset Bowman et al. (2015). We test four variations of our model on this dataset, where “Only $P\rightarrow Q$ ” and “Only $P\leftarrow Q$ ” are the single direction matching models described in Sub-section 4.2, “BiMPM” is our full model, and “BiMPM (Ensemble)” is an ensemble version of our “BiMPM” model. We design the ensemble model by simply averaging the probability distributions Peng et al. (2015b, 2017) of four “BiMPM” models, and each of the “BiMPM” model has the same architecture, but is initialized with a different seed.

Table 3 shows the performances of the state-of-the-art models and our models. First, we can see that “Only $P\leftarrow Q$ ” works significantly better than “Only $P\rightarrow Q$ ”, which tells us that, for natural language inference, matching the hypothesis against the premise is more effective than the other way around. Second, our “BiMPM” model works much better than “Only $P\leftarrow Q$ ”, which reveals that matching premise against the hypothesis can also bring some benefits. Finally, comparing our models with all the state-of-the-art models, we can observe that our single model “BiMPM” is on par with the state-of-the-art single models, and our ‘BiMPM (Ensemble)” works much better than “Chen et al. (2016) (Ensemble)”. Therefore, our models achieve the state-of-the-art performance in both single and ensemble scenarios for the natural language inference task.

5 Experiments on Answer Sentence Selection

In this Sub-section, we study the effectiveness of our model for answer sentence selection tasks. The answer sentence selection task is to rank a list of candidate answer sentences based on their similarities to the question, and the performance is measured by the mean average precision (MAP) and mean reciprocal rank (MRR). We experiment on two datasets: TREC-QA Wang et al. (2007) and WikiQA Yang et al. (2015). Experimental results of the state-of-the-art models Rao et al. (2016) pointed out that there are two versions of TREC-QA dataset: raw-version and clean-version. In this work, we utilized the clean-version. Therefore, we only compare with approaches reporting performance on this dataset. and our “BiMPM” model are listed in Table 4, where the performances are evaluated with the standard trec_eval-8.0 script http://trec.nist.gove/trec_eval/. We can see that the performance from our model is on par with the state-of-the-art models. Therefore, our model is also effective for answer sentence selection tasks.

Related Work

Natural language sentence matching (NLSM) has been studied for many years. Early approaches focused on designing hand-craft features to capture n-gram overlapping, word reordering and syntactic alignments phenomena Heilman and Smith (2010); Wang and Ittycheriah (2015). This kind of method can work well on a specific task or dataset, but it’s hard to generalize well to other tasks.

With the availability of large-scale annotated datasets Bowman et al. (2015), many deep learning models were proposed for NLSM. The first kind of framework is based the Siamese architecture Bromley et al. (1993), where sentences are encoded into sentence vectors based on some neural network encoders, and then the relationship between two sentences was decided solely based on the two sentence vectors Bowman et al. (2015); Yang et al. (2015); Tan et al. (2015). However, this kind of framework ignores the fact that the lower level interactive features between two sentences are indispensable. Therefore, many neural network models were proposed to match sentences from multiple level of granularity Yin et al. (2015); Wang and Jiang (2016); Wang et al. (2016d). Experimental results on many tasks have proofed that the new framework works significantly better than the previous methods. Our model also belongs to this framework, and we have shown its effectiveness in Section 4.

Conclusion

In this work, we propose a bilateral multi-perspective matching (BiMPM) model under the “matching-aggregation” framework. Different from the previous “matching-aggregation” approaches, our model matches sentences $P$ and $Q$ in two directions ( $P\rightarrow Q$ and $P\leftarrow Q$ ). And, in each individual direction, our model matches the two sentences from multiple perspectives. We evaluated our model on three tasks: paraphrase identification, natural language inference and answer sentence selection. Experimental results on standard benchmark datasets show that our model achieves the state-of-the-art performance on all tasks.