Bilateral Multi-Perspective Matching for Natural Language Sentences

Zhiguo Wang, Wael Hamza, Radu Florian

Introduction

Natural language sentence matching (NLSM) is the task of comparing two sentences and identifying the relationship between them. It is a fundamental technology for a variety of tasks. For example, in a paraphrase identification task, NLSM is used to determine whether two sentences are paraphrase or not Yin et al. (2015). For a natural language inference task, NLSM is utilized to judge whether a hypothesis sentence can be inferred from a premise sentence Bowman et al. (2015). For question answering and information retrieval tasks, NLSM is employed to assess the relevance between query-answer pairs and rank all the candidate answers Wang et al. (2016d). For machine comprehension tasks, NLSM is used for matching a passage with a question and pointing out the correct answer span Wang et al. (2016b).

With the renaissance of neural network models LeCun et al. (2015); Peng et al. (2015a, 2016), two types of deep learning frameworks were proposed for NLSM. The first framework is based on the “Siamese” architecture Bromley et al. (1993). In this framework, the same neural network encoder (e.g., a CNN or a RNN) is applied to two input sentences individually, so that both of the two sentences are encoded into sentence vectors in the same embedding space. Then, a matching decision is made solely based on the two sentence vectors Bowman et al. (2015); Tan et al. (2015). The advantage of this framework is that sharing parameters makes the model smaller and easier to train, and the sentence vectors can be used for visualization, sentence clustering and many other purposes Wang et al. (2016c). However, a disadvantage is that there is no explicit interaction between the two sentences during the encoding procedure, which may lose some important information. To deal with this problem, a second framework “matching-aggregation” has been proposed Wang and Jiang (2016); Wang et al. (2016d). Under this framework, smaller units (such as words or contextual vectors) of the two sentences are firstly matched, and then the matching results are aggregated (by a CNN or a LSTM) into a vector to make the final decision. The new framework captures more interactive features between the two sentences, therefore it acquires significant improvements. However, the previous “matching-aggregation” approaches still have some limitations. First, some of the approaches only explored the word-by-word matching Rocktäschel et al. (2015), but ignored other granular matchings (e.g., phrase-by-sentence); Second, the matching is only performed in a single direction (e.g., matching PP against QQ) Wang and Jiang (2015), but neglected the reverse direction (e.g., matching QQ against PP).

In this paper, to tackle these limitations, we propose a bilateral multi-perspective matching (BiMPM) model for NLSM tasks. Our model essentially belongs to the “matching-aggregation” framework. Given two sentences PP and QQ, our model first encodes them with a bidirectional Long Short-Term Memory Network (BiLSTM). Next, we match the two encoded sentences in two directions PQP\rightarrow Q and PQP\leftarrow Q. In each matching direction, let’s say PQP\rightarrow Q, each time step of QQ is matched against all time-steps of PP from multiple perspectives. Then, another BiLSTM layer is utilized to aggregate the matching results into a fixed-length matching vector. Finally, based on the matching vector, a decision is made through a fully connected layer. We evaluate our model on three NLSM tasks: paraphrase identification, natural language inference and answer sentence selection. Experimental results on standard benchmark datasets show that our model achieves the state-of-the-art performance on all tasks.

In following parts, we start with a brief definition of the NLSM task (Section 2), followed by the details of our model (Section 3). Then we evaluate our model on standard benchmark datasets (Section 4). We talk about related work in Section 5, and conclude this work in Section 6.

Task Definition

Formally, we can represent each example of the NLSM task as a triple (P,Q,y)(P,Q,y), where P=(p1,...,pj,...,pM)P=(p_{1},...,p_{j},...,p_{M}) is a sentence with a length MM, Q=(q1,...,qi,...,qN)Q=(q_{1},...,q_{i},...,q_{N}) is the second sentence with a length NN, yYy\in\mathcal{Y} is the label representing the relationship between PP and QQ, and Y\mathcal{Y} is a set of task-specific labels. The NLSM task can be represented as estimating a conditional probability Pr(yP,Q)\Pr{(y|P,Q)} based on the training set, and predicting the relationship for testing examples by y=arg maxyYPr(yP,Q)y^{*}=\operatorname*{arg\,max}_{y\in\mathcal{Y}}\Pr(y|P,Q). Concretely, for a paraphrase identification task, PP and QQ are two sentences, Y={0,1}\mathcal{Y}=\{0,1\}, where y=1y=1 means that PP and QQ are paraphrase of each other, and y=0y=0 otherwise. For a natural language inference task, PP is a premise sentence, QQ is a hypothesis sentence, and Y={entailment,contradiction,neutral}\mathcal{Y}=\{entailment,contradiction,neutral\} where entailmententailment indicates QQ can be inferred from PP, contradictioncontradiction indicates QQ cannot be true condition on PP, and neutralneutral means PP and QQ are irrelevant to each other. In an answer sentence selection task, PP is a question, QQ is a candidate answer, and Y={0,1}\mathcal{Y}=\{0,1\} where y=1y=1 means QQ is a correct answer for PP, and y=0y=0 otherwise.

Method

In this section, we first give a high-level overview of our model in Sub-section 3.1, and then give more details about our novel multi-perspective matching operation in Sub-section 3.2.

We propose a bilateral multi-perspective matching (BiMPM) model to estimate the probability distribution Pr(yP,Q)\Pr(y|P,Q). Our model belongs to the “matching-aggregation” framework Wang and Jiang (2016). Contrarily to previous “matching-aggregation” approaches, our model matches PP and QQ in two directions (PQP\rightarrow Q and PQP\leftarrow Q). In each individual direction, our model matches the two sentences from multiple perspectives. Figure 1 shows the architecture of our model. Given a pair of sentences PP and QQ, the BiMPM model estimates the probability distribution Pr(yP,Q)\Pr(y|P,Q) through the following five layers.

Word Representation Layer. The goal of this layer is to represent each word in PP and QQ with a dd-dimensional vector. We construct the dd-dimensional vector with two components: a word embedding and a character-composed embedding. The word embedding is a fixed vector for each individual word, which is pre-trained with GloVe Pennington et al. (2014) or word2vec Mikolov et al. (2013). The character-composed embedding is calculated by feeding each character (represented as a character embedding) within a word into a Long Short-Term Memory Network (LSTM) Hochreiter and Schmidhuber (1997), where the character embeddings are randomly initialized and learned jointly with other network parameters from NLSM tasks. The output of this layer are two sequences of word vectors P:[p1,...,pM]P:[\textbf{\emph{p}}_{1},...,\textbf{\emph{p}}_{M}] and Q:[q1,...,qN]Q:[\textbf{\emph{q}}_{1},...,\textbf{\emph{q}}_{N}].

Context Representation Layer. The purpose of this layer is to incorporate contextual information into the representation of each time step of PP and QQ. We utilize a bi-directional LSTM (BiLSTM) to encode contextual embeddings for each time-step of PP.

Meanwhile, we apply the same BiLSTM to encode QQ:

Matching Layer. This is the core layer within our model. The goal of this layer is to compare each contextual embedding (time-step) of one sentence against all contextual embeddings (time-steps) of the other sentence. As shown in Figure 1, we will match the two sentences PP and QQ in two directions: match each time-step of PP against all time-steps of QQ, and match each time-step of QQ against all time-steps of PP. To match one time-step of a sentence against all time-steps of the other sentence, we design a multi-perspective matching operation \otimes. We will give more details about this operation in Sub-section 3.2. The output of this layer are two sequences of matching vectors (right above the operation \otimes in Figure 1), where each matching vector corresponds to the matching result of one time-step against all time-steps of the other sentence.

Aggregation Layer. This layer is employed to aggregate the two sequences of matching vectors into a fixed-length matching vector. We utilize another BiLSTM model, and apply it to the two sequences of matching vectors individually. Then, we construct the fixed-length matching vector by concatenating (the four green) vectors from the last time-step of the BiLSTM models.

Prediction Layer. The purpose of this layer is to evaluate the probability distribution Pr(yP,Q)\Pr(y|P,Q). To this end, we employ a two layer feed-forward neural network to consume the fixed-length matching vector, and apply the softmaxsoftmax function in the output layer. The number of nodes in the output layer is set based on each specific task described in Section 2.

2 Multi-perspective Matching Operation

We define the multi-perspective matching operation \otimes in following two steps:

First, we define a multi-perspective cosine matching function fmf_{m} to compare two vectors

where v1\bm{v}_{1} and v2\bm{v}_{2} are two dd-dimensional vectors, Wl×d\bm{W}\in\Re^{l\times d} is a trainable parameter with the shape l×dl\times d, ll is the number of perspectives, and the returned value m\bm{m} is a ll-dimensional vector m=[m1,...,mk,...,ml]\bm{m}=[m_{1},...,m_{k},...,m_{l}]. Each element mkmm_{k}\in\bm{m} is a matching value from the kk-th perspective, and it is calculated by the cosine similarity between two weighted vectors

where \circ is the element-wise multiplication, and WkW_{k} is the kk-th row of W\bm{W}, which controls the kk-th perspective and assigns different weights to different dimensions of the dd-dimensional space.

Second, based on fmf_{m}, we define four matching strategies to compare each time-step of one sentence against all time-steps of the other sentence. To avoid repetition, we only define these matching strategies for one matching direction PQP\rightarrow Q. The readers can infer equations for the reverse direction easily.

(1) Full-Matching. Figure 2 (a) shows the diagram of this matching strategy. In this strategy, each forward (or backward) contextual embedding hip\overrightarrow{\bm{h}}_{i}^{p} (or hip\overleftarrow{\bm{h}}_{i}^{p}) is compared with the last time step of the forward (or backward) representation of the other sentence hNq\overrightarrow{\bm{h}}_{N}^{q} (or h1q\overleftarrow{\bm{h}}_{1}^{q}).

(2) Maxpooling-Matching. Figure 2 (b) gives the diagram of this matching strategy. In this strategy, each forward (or backward) contextual embedding hip\overrightarrow{\bm{h}}_{i}^{p} (or hip\overleftarrow{\bm{h}}_{i}^{p}) is compared with every forward (or backward) contextual embeddings of the other sentence hjq\overrightarrow{\bm{h}}_{j}^{q} (or hjq\overleftarrow{\bm{h}}_{j}^{q}) for j(1...N)j\in(1...N), and only the maximum value of each dimension is retained.

(3) Attentive-Matching. Figure 2 (c) shows the diagram of this matching strategy. We first calculate the cosine similarities between each forward (or backward) contextual embedding hip\overrightarrow{\bm{h}}_{i}^{p} (or hip\overleftarrow{\bm{h}}_{i}^{p}) and every forward (or backward) contextual embeddings of the other sentence hjq\overrightarrow{\bm{h}}_{j}^{q} (or hjq\overleftarrow{\bm{h}}_{j}^{q}):

Then, we take αi,j\overrightarrow{\alpha}_{i,j} (or αi,j\overleftarrow{\alpha}_{i,j}) as the weight of hjq\overrightarrow{\bm{h}}_{j}^{q} (or hjq\overleftarrow{\bm{h}}_{j}^{q}), and calculate an attentive vector for the entire sentence QQ by weighted summing all the contextual embeddings of QQ:

Finally, we match each forward (or backward) contextual embedding of hip\overrightarrow{\bm{h}}_{i}^{p} (or hip\overleftarrow{\bm{h}}_{i}^{p}) with its corresponding attentive vector:

(4) Max-Attentive-Matching. Figure 2 (d) shows the diagram of this matching strategy. This strategy is similar to the Attentive-Matching strategy. However, instead of taking the weighed sum of all the contextual embeddings as the attentive vector, we pick the contextual embedding with the highest cosine similarity as the attentive vector. Then, we match each contextual embedding of the sentence PP with its new attentive vector.

We apply all these four matching strategies to each time-step of the sentence PP, and concatenate the generated eight vectors as the matching vector for each time-step of PP. We also perform the same process for the reverse matching direction.

Experiments

In this section, we evaluate our model on three tasks: paraphrase identification, natural language inference and answer sentence selection. We will first introduce the general setting of our BiMPM models in Sub-section 4.1. Then, we demonstrate the properties of our model through some ablation studies in Sub-section 4.2. Finally, we compare our model with state-of-the-art models on some standard benchmark datasets in Sub-section 4.3, 4.4 and 4.5.

We initialize word embeddings in the word representation layer with the 300-dimensional GloVe word vectors pre-trained from the 840B Common Crawl corpus Pennington et al. (2014). For the out-of-vocabulary (OOV) words, we initialize the word embeddings randomly. For the character-composed embeddings, we initialize each character as a 20-dimensional vector, and compose each word into a 50-dimensional vector with a LSTM layer. We set the hidden size as 100 for all BiLSTM layers. We apply dropout to every layers in Figure 1, and set the dropout ratio as 0.1. To train the model, we minimize the cross entropy of the training set, and use the ADAM optimizer Kingma and Ba (2014) to update parameters. We set the learning rate as 0.001. During training, we do not update the pre-trained word embeddings. For all the experiments, we pick the model which works the best on the dev set, and then evaluate it on the test set.

2 Model Properties

To demonstrate the properties of our model, we choose the paraphrase identification task, and experiment on the “Quora Question Pairs” dataset https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs. This dataset consists of over 400,000 question pairs, and each question pair is annotated with a binary value indicating whether the two questions are paraphrase of each other. We randomly select 5,000 paraphrases and 5,000 non-paraphrases as the dev set, and sample another 5,000 paraphrases and 5,000 non-paraphrases as the test set. We keep the remaining instances as the training set We will release our source code and the dataset partition at https://zhiguowang.github.io/ ..

First, we study the influence of our multi-perspective cosine matching function in Eq.(3). We vary the number of perspectives ll among {1, 5, 10, 15, 20}Due to practical limitations, we did not experiment with more perspectives., and keep the other options unchanged. We also build a baseline model by replacing Eq.(3) with the vanilla cosine similarity function. Figure 3 shows the performance curve on the dev set, where l=0l=0 corresponds to the performance of our baseline model. We can see that, even if we only utilize one perspective (l=1l=1), our model gets a significant improvement. When increasing the number of perspectives, the performance improves significantly. Therefore, our multi-perspective cosine matching function is really effective for matching vectors.

Second, to check the effectiveness of bilateral matching, we build two ablation models to matching sentences in only a single direction: 1) “Only PQP\rightarrow Q” which only matches PP against QQ; 2) “Only PQP\leftarrow Q” which only matches QQ against PP. Table 1 shows the performance on the dev set. Comparing the two ablation models with the “Full Model”, we can observe that single direction matching hurts the performance for about 1 percent. Therefore, matching sentences in two directions is really necessary for acquiring better performance.

Third, we evaluate the effectiveness of different matching strategies. To this end, we construct four ablation models (w/o Full-Matching, w/o Maxpooling-Matching, w/o Attentive-Matching, w/o Max-Attentive-Matching) by eliminating a matching strategy at each time. Table 1 shows the performance on the dev set. We can see that eliminating any of the matching strategies would hurt the performance significantly.

3 Experiments on Paraphrase Identification

In this Sub-section, we compare our model with state-of-the-art models on the paraphrase identification task. We still experiment on the “Quora Question Pairs” dataset, and use the same dataset partition as Sub-section 4.2. This dataset is a brand-new dataset, and no previous results have been published yet. Therefore, we implemented three types of baseline models.

First, under the Siamese framework, we implement two baseline models: “Siamese-CNN” and “Siamese-LSTM”. Both of the two models encode two input sentences into sentence vectors with a neural network encoder, and make a decision based on the cosine similarity between the two sentence vectors. But they implement the sentence encoder with a CNN and a LSTM respectively. We design the CNN and the LSTM model according to the architectures in Wang et al. (2016c).

Second, based on the two baseline models, we implement two more baseline models “Multi-Perspective-CNN” and “Multi-Perspective-LSTM”. In these two models, we change the cosine similarity calculation layer with our multi-perspective cosine matching function in Eq.(3), and apply a fully-connected layer (with sigmoidsigmoid function on the top) to make the prediction.

Third, we re-implement the “L.D.C.” model proposed by Wang et al. (2016d), which is a model under the “matching-aggregation” framework and acquires the state-of-the-art performance on several tasks.

Table 2 shows the performances of all baseline models and our “BiMPM” model. We can see that “Multi-Perspective-CNN” (or “Multi-Perspective-LSTM”) works much better than “Siamese-CNN” (or “Siamese-LSTM”), which further indicates that our multi-perspective cosine matching function (Eq.(3)) is very effective for matching vectors. Our “BiMPM” model outperforms the “L.D.C.” model by more than two percent. Therefore, our model is very effective for the paraphrase identification task.

4 Experiments on Natural Language Inference

In this Sub-section, we evaluate our model on the natural language inference task over the SNLI dataset Bowman et al. (2015). We test four variations of our model on this dataset, where “Only PQP\rightarrow Q” and “Only PQP\leftarrow Q” are the single direction matching models described in Sub-section 4.2, “BiMPM” is our full model, and “BiMPM (Ensemble)” is an ensemble version of our “BiMPM” model. We design the ensemble model by simply averaging the probability distributions Peng et al. (2015b, 2017) of four “BiMPM” models, and each of the “BiMPM” model has the same architecture, but is initialized with a different seed.

Table 3 shows the performances of the state-of-the-art models and our models. First, we can see that “Only PQP\leftarrow Q” works significantly better than “Only PQP\rightarrow Q”, which tells us that, for natural language inference, matching the hypothesis against the premise is more effective than the other way around. Second, our “BiMPM” model works much better than “Only PQP\leftarrow Q”, which reveals that matching premise against the hypothesis can also bring some benefits. Finally, comparing our models with all the state-of-the-art models, we can observe that our single model “BiMPM” is on par with the state-of-the-art single models, and our ‘BiMPM (Ensemble)” works much better than “Chen et al. (2016) (Ensemble)”. Therefore, our models achieve the state-of-the-art performance in both single and ensemble scenarios for the natural language inference task.

5 Experiments on Answer Sentence Selection

In this Sub-section, we study the effectiveness of our model for answer sentence selection tasks. The answer sentence selection task is to rank a list of candidate answer sentences based on their similarities to the question, and the performance is measured by the mean average precision (MAP) and mean reciprocal rank (MRR). We experiment on two datasets: TREC-QA Wang et al. (2007) and WikiQA Yang et al. (2015). Experimental results of the state-of-the-art models Rao et al. (2016) pointed out that there are two versions of TREC-QA dataset: raw-version and clean-version. In this work, we utilized the clean-version. Therefore, we only compare with approaches reporting performance on this dataset. and our “BiMPM” model are listed in Table 4, where the performances are evaluated with the standard trec_eval-8.0 script http://trec.nist.gove/trec_eval/. We can see that the performance from our model is on par with the state-of-the-art models. Therefore, our model is also effective for answer sentence selection tasks.

Related Work

Natural language sentence matching (NLSM) has been studied for many years. Early approaches focused on designing hand-craft features to capture n-gram overlapping, word reordering and syntactic alignments phenomena Heilman and Smith (2010); Wang and Ittycheriah (2015). This kind of method can work well on a specific task or dataset, but it’s hard to generalize well to other tasks.

With the availability of large-scale annotated datasets Bowman et al. (2015), many deep learning models were proposed for NLSM. The first kind of framework is based the Siamese architecture Bromley et al. (1993), where sentences are encoded into sentence vectors based on some neural network encoders, and then the relationship between two sentences was decided solely based on the two sentence vectors Bowman et al. (2015); Yang et al. (2015); Tan et al. (2015). However, this kind of framework ignores the fact that the lower level interactive features between two sentences are indispensable. Therefore, many neural network models were proposed to match sentences from multiple level of granularity Yin et al. (2015); Wang and Jiang (2016); Wang et al. (2016d). Experimental results on many tasks have proofed that the new framework works significantly better than the previous methods. Our model also belongs to this framework, and we have shown its effectiveness in Section 4.

Conclusion

In this work, we propose a bilateral multi-perspective matching (BiMPM) model under the “matching-aggregation” framework. Different from the previous “matching-aggregation” approaches, our model matches sentences PP and QQ in two directions (PQP\rightarrow Q and PQP\leftarrow Q). And, in each individual direction, our model matches the two sentences from multiple perspectives. We evaluated our model on three tasks: paraphrase identification, natural language inference and answer sentence selection. Experimental results on standard benchmark datasets show that our model achieves the state-of-the-art performance on all tasks.

References