Neural Matching Models for Question Retrieval and Next Question Prediction in Conversation

Liu Yang, Hamed Zamani, Yongfeng Zhang, Jiafeng Guo, W. Bruce Croft

Introduction

Due to the ability of neural network models to go beyond term matching similarities as well as omitting the feature engineering steps, neural matching models have recently achieved state-of-the-art performance in a number of information retrieval tasks. However, the generality of these models to be applied on different tasks is relatively unstudied.

In this paper, we focus on two question ranking tasks. The first one is question retrieval: retrieving similar questions in response to a specific question. This task is useful in question answering and community question answering (CQA) applications. For instance, finding similar questions could help to improve the question answering accuracy or can help to avoid asking duplicate questions in CQA websites. Although neural approaches have been widely applied to answer sentence selection (He and Lin, 2016; Severyn and Moschitti, ; Tan et al., ) and similar question identification (Wang et al., 2017), the effectiveness of deep learning architectures for question retrieval is relatively unstudied. Therefore, we study a set of neural networks that can retrieve similar questions to a given question.

The second task is relevant to conversation models. Building intelligent systems that could perform meaningful conversations with humans has been one of the long term goals of artificial intelligence. Human-computer conversation plays a critical role in many popular mobile search systems, intelligent assistants, and chat bot systems such as Google Assistant, Microsoft Cortana, Amazon Echo, and Apple Siri. Traditional conversational systems are based on hand designed logics and features with natural language templates, which usually only works for restricted and predictable conversational inputs (Nakano et al., 2000; Lemon et al., 2006; Young et al., 2013). With rich big data resources on the Web, enhanced GPU computational infrastructures, and large amount of labels derived from crowd sourcing and online user behaviors, end-to-end deep learning methods have begun to show promising results on conversation response ranking and generation tasks (Shang et al., 2015; Yan et al., 2016; Bordes and Weston, 2016; Li et al., 2016a, b; Sordoni et al., 2015). According to these motivations, we focus on a new type of conversational response ranking problem as the second task in the paper: predicting the next question in a conversation. During real conversations, humans could not only generate reasonable responses, but also have the ability to predict what the new questions that other speakers will be likely to ask. Learning models that could predict questions in conversations could enable us to better understand user intents during the conversations. Proactive content recommendations could be made without implicit questions issued by users. Furthermore, pre-selected answer sets could be generated based on question prediction results as a cache mechanism to improve the efficiency and effectiveness of conversational question answering systems. Table 1 shows a number of motivated examples of predicting questions in conversations.

Our neural network architecture for both tasks is inspired by previous work (He and Lin, 2016; Severyn and Moschitti, ; Tan et al., ; Wang and Jiang, 2016; Yan et al., 2016) that achieves impressive performance in different tasks. The designed siamese neural network models the long dependency of terms using a long short term memory (LSTM) layer. It further takes advantage of multiple convolutional and max pooling layers for representation learning of sequences based on the output of the LSTM layer. The network outputs a real-valued score for each candidate question and all candidate questions are ranked based on their matching score computed by the network.

We evaluate our models for the question retrieval task using the recently released Quora dataset. Our experiments demonstrate that the proposed neural network model outperforms state-of-the-art non-neural question retrieval approaches. The experiments also validate the hypothesis that neural matching models can complement exact term matching approaches in the question retrieval task; hence, a combination of the two is more appropriate. For the next question prediction task, we trained our model on the chat logs extracted from Ubuntu-related chat rooms on the Freenode Internet Relay Chat (IRC) networkhttp://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/. Our experiments suggest that neural matching models could perform well for both tasks, which demonstrates the potential of neural matching models with representation learning for new applications and scenarios.

The contributions of this work are as follows: (1) We introduce and formalize the new task of next question prediction in conversations. (2) We study of the effectiveness of neural matching models for question retrieval and predicting questions in conversations. Experimental results show that neural matching models perform well for both tasks.

Neural Matching Model

Figure 1 shows the architecture of LSTM-CNN-Match model for matching sequences. This model is an extension of the CDNN model proposed by Severyn and Moschitti (Severyn and Moschitti, ) that has been also explored in various applications such as answer sentence selection (He and Lin, 2016; Tan et al., ; Wang and Jiang, 2016; Zhou et al., 2016; Yan et al., 2016; Yang et al., 2016). Comparing with CDNN, this model adopts a long short term memory (LSTM) layer for long term dependency modeling in sequences. The convolutional layers are running on the output of the latent representations modeled by the LSTM layer, instead of the raw word embeddings sequence. In the following, we describe the model in more detail.

2. LSTM for Long Term Dependency Modeling

We use an LSTM layer to process $\mathbf{Q}$ and $\mathbf{P}$ for modeling long term dependency information in the sentences. LSTM (Hochreiter and Schmidhuber, 1997) is an advanced variant of recurrent neural networks (RNN). It can overcome the vanishing / exploding gradient problem of simpler Vanilla RNNs with the memory cell and gating mechanisms. Each LSTM cell consists of a memory cell that stores information over a long history and three gates that specify how to control the information flow into and out of the memory cell. Given an input sequence $\mathbf{Q}=(x_{0},x_{1},...,x_{t})$ , where $x_{t}$ denotes the word embedding at position $t$ , LSTM outputs a new representation matrix $\bar{\mathbf{Q}}$ that captures contextual information seen before in addition to the word at position $t$ itself based on the equations below:

where $i,f,o$ denote the input, forget and output gates, respectively. $c$ is the stored information in the memory cells and $h$ is the learned representation. Thus $h_{t}$ is corresponding to the $t$ -th column of the new representation matrix $\bar{\mathbf{Q}}$ which encodes the $t$ -th word in $\mathbf{Q}$ with its context information. We also tried to use the bidirectional LSTM (Bi-LSTM). But we found that Bi-LSTM does not improve the performance. It led to lower training efficiency comparing with LSTM. Thus we just use one directional LSTM in our model.

3. Convolutional and Max Pooling Layers

Given the hidden representations learned by the LSTM layer, we use convolutional layers with different filter sizes and max pooling layers with different window sizes to learn sequence representations for generating the matching score. The convolution operation transforms the original feature map to a new feature map by moving the filters and computing the dot products of the filters with the corresponding feature map patch. Each filter slides over the whole embedding vectors, but varies in how many words it covers.We set filter sizes to $ $and use$ 128 $filters of each size in our model. We slide the filters without padding the edges and perform a narrow convolution (Kalchbrenner et al., 2014) . We further feed the output of the convolutional layer to a rectified linear unit (ReLU) function which is simply defined as$ \max(0,\mathbf{x})$ to add non-linearity. After that we apply a max pooling layer on the output of the ReLU function. Finally we use a fully connected layer with a softmax function to output the probability distribution over different labels.

4. Loss Function and Training

We consider a pairwise learning setting during model training process. The training data consists of triples $(\mathbf{Q}_{i},\mathbf{P}_{i}^{+},\mathbf{P}_{i}^{-})$ where $\mathbf{P}_{i}^{+}$ and $\mathbf{P}_{i}^{-}$ respectively denote the positive and the negative candidate sequence for $\mathbf{Q}_{i}$ . The pairwise ranking-based hinge loss function is defined as:

where $M$ is the number of triples in the training data. $\lambda||\theta||^{2}_{2}$ is the regularization term where $\lambda$ and $\theta$ respectively denote the regularization coefficient and the model parameters. $\epsilon$ denotes the margin in the hinge loss. $S(\cdot,\cdot)$ denotes the output matching score from the last layer of the LSTM-CNN-Match model. The parameters of the network are optimized using the Adam algorithm (Kingma and Ba, 2014).

Experiments

We use the publicly available datasets from Quorahttps://data.quora.com/First-Quora-Dataset-Release-Question-Pairs and Ubuntu IRC chat logshttp://dataset.cs.mcgill.ca/ubuntu-corpus-1.0/ for the experiments. The Quora dataset consists of $404,340$ lines of question pairs. Each line contains the IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line contains a similar question pair or not. To use this dataset for question retrieval evaluation, we conducted data sampling and pre-processing. There are $148,487$ similar question pairs in the Quora data, which form the positive question pairs. For each positive question pair, we randomly picked one of them as the query question $\mathbf{Q}$ . Then the other question is the positive candidate question $\mathbf{P}^{+}$ for $\mathbf{Q}$ . We used negative sampling to construct the negative pairs following previous work (Wan et al., 2016). Specifically for each query question $\mathbf{Q}$ , we first used it to retrieve the top $1000$ results from the whole question set using Lucenehttp://lucene.apache.org/ with BM25. Then we randomly selected $4$ questions from them except the known positive candidate question $\mathbf{P}^{+}$ to construct the negative candidate questions. Finally, we randomly separated the whole dataset to training, development and testing data with proportion $8:1:1$ . The statistics of different data partitions of the Quora data is presented in Table 2.Note that in some rare cases, the hits count for a query question returned by Lucene could be less than $4$ . In this case, the actual candidate question number for this query question could be less than $5$ .

For the Ubuntu chat log data, we also perform similar data sampling and pre-processing. We identify questions from dialogs by question marks. For each question $q^{*}$ in a dialog, we stochastically sample a pre-context size $c\in[2,C]$ , where $C$ is the max number of questions in the pre-context.In our experiments, we empirically set $C=6$ . We skip a question if there are less than $2$ previous questions or the question length is less than $3$ . We remove speaker IDs in candidate questions to insure that different methods rank questions by matching actual question content instead of spearker IDs. Words appear less than or equal to $5$ times are replaced by $<$ UNK $>$ . Let $c^{\prime}=\min(c,t)$ , where $t$ is the total number of questions before $q^{*}$ . Then we generate context for $q^{*}$ by merging previous $c^{\prime}$ questions $\{q_{1},q_{2},\cdots,q_{c^{\prime}}\}$ with their responses. Thus the true question response $q^{*}$ is the positive question candidate. We additionally randomly sample another $9$ negative question responses except the known positive candidate question following previous work (Lowe et al., 2015). Finally, we randomly separated the whole dataset to training, development and testing data with proportion $8:1:1$ . The statistics of different data partitions of the Ubuntu chat log data is presented in Table 3.

For data pre-processing, we performed tokenization and punctuation removal. We maintained stopwords for neural models and removed them for the traditional retrieval models such as BM25 and QL. We used TensorFlowhttps://www.tensorflow.org/ for the implementation of the neural matching models.

Word Embeddings. We use Glove (Pennington et al., ) word embeddings, which are 300-dimension word vectors trained with a crawled large corpus with 840 billion tokens. Embeddings for words not present are randomly initialized with sampled numbers from a uniform distribution U[-0.25,0.25], which follows the same setting as in (Severyn and Moschitti, ).

Additional Word Overlap Features. As noted in previous work (Severyn and Moschitti, ; Yu et al., 2014), one weakness of models relying on distributional word embeddings is their inability to deal with cardinal numbers and proper nouns. This also has impacts on matching question pairs or contexts with questions. Suppose we have two questions “What happened in US in 1776?” and “What happened in Japan in 1871?”. These two questions will be likely predicted with a high matching probabilities by neural matching models replying on word embedding input since country names like “US” and “Japan”, numbers like “1776” and “1871” have close distances in the word embedding space. However, these two questions represent two different question intents. To mitigate this issue, we follow the approach in (Severyn and Moschitti, ; Yu et al., 2014) and include additional word overlap features into the model. Specifically, we compute the word co-occurrence count and IDF weighted word co-occurrence between two sequences. Computing these simple word overlap features is straightforward. We combine the matching probability learned by neural matching models with these two simple word overlap features with a logistic regression layer to generate the final ranking scores of candidate questions.

Model Hyper-parameters. We tuned the hyper-parameters with grid search using the development set. For the setting of LSTM-CNN-Match model in question retrieval, we set learning rate to $0.002$ , batch size to $500$ , margin of the hinge loss to $0.5$ , filter sizes to $ $, and the number of each feature size to$ 128 $. For the setting of LSTM-CNN-Match model in question prediction in conversations, we set learning rate to$ 0.002 $, batch size to$ 200 $, margin in the hinge loss to$ 0.3 $, filter sizes to$ $, and the number of each feature size to$ 128$.

2. Evaluation Metrics and Compared Methods

For the Quora data and Ubuntu chat log data, since there is only one positive candidate question for each query question or previous conversation context, we adopt mean reciprocal rank (MRR) and precision at the highest position (P@1) as the evaluation metrics. Note that in this case MRR is equivalent to MAP and P@1 is equivalent to R-Precision. For Ubuntu chat log data, since there are $10$ candidate questions for each context, we additionally report P@5 and Recall@5. We study the effectiveness of the following methods:

WordCount: This method computes the word co-occurrence count between the two sequences.

WordCountIDF: This method computes the word co-occurrence count weighted by IDF value between the two sequences.

VSM: This method computes the cosine similarity between the TF-IDF representation of the given two sequences.

BM25: This method computes the BM25 score between the two sequences, where we treat one of the sequences as the query and the other one as the document.

QL: This method computes the query likelihood (Ponte and Croft, 1998) score with Dirichlet prior smoothing between the language models of the two sequences.

TRLM: This method is the translation-based language model employed by Jeon et al. (Jeon et al., 2005) and Xue et al. (Xue et al., 2008). This method has been consistently reported as the state-of-the-art method for the question retrieval task.(Zhou et al., 2013).

AvgWordEmbed: This method uses the average vector of word embeddings as the sequence representation; then the cosine similarity of sequence representations is used for the candidate question ranking.

CNN-Match: This is a degenerate version of the LSTM-CNN-Match model where we remove the LSTM layer in the model, which is similar to the CDNN model proposed by Severyn and Moschitti (Severyn and Moschitti, ).

LSTM-CNN-Match: The model presented in Section 2, which has been recently applied to other tasks, such as answer sentence selection (He and Lin, 2016; Tan et al., ; Wang and Jiang, 2016; Zhou et al., 2016; Yan et al., 2016).

Combined Model: We tried to combine scores of all baseline methods with neural matching models and trained a LambdaMART ranker for question ranking. This is to study whether combining learned features from basic retrieval models with neural models could lead to better retrieval performance.

3. Experimental Results on Question Retrieval

Table 4 shows the experimental results for the question retrieval task with the Quora dataset. We summarize our observations as follows: (1) LSTM-CNN-Match model outperforms all the baseline methods including basic retrieval models, translation model based methods and basic neural model/word embedding based methods. This shows the advantage of jointly modeling semantic match information through a neural matching model and basic word overlap information for the question retrieval task. (2) Comparing the performance of LSTM-CNN-Match model and CNN-Match model, we found that the retrieval performance will decrease if we remove the LSTM layer. This shows that modeling long term dependency in questions through LSTM is useful for boosting question search performance. (3) If we combine the learned matching score of neural models with the basic retrieval model scores, we can observe further gain over the baselines. Thus in practice the learning to rank framework is still useful for combining different features including both traditional IR model scores and the more recent neural model scores for a strong ranker for question search.

To get a better understanding of the effectiveness of the model, we checked the retrieved questions of each method. Jointly modeling term matching information with semantic matching information is important for the question retrieval task. Table 5 reports the retrieval results of different methods for the query question “What are some good anime movies?”. BM25 relying on term matching between question pairs ranked the correct similar candidate question “What are some of the best anime shows?” in a relatively low position and ranked “What are good scary movies?” in the first position. TRLM suffers from a similar problem. The neural matching model LSTM-CNN-Match ranked the correct similar question candidate in the first position, since it can capture the semantic similarity between “movies” and “shows” as well as “good” and “best”, which are missed by the term matching based retrieval models.

4. Experimental Results on Predicting Questions in Conversations

Table 6 shows the experimental results for predicting questions in conversations with the Ubuntu chat log dataset. For this task, the “Combined Model” performed the best for MRR and P@1. CNN-Match achieved the best performances for P@5 and Recall@5. We also found LSTM-CNN-Match performed worse than CNN-Match for this task. Overall neural matching models could improve the ranking effectiveness of finding questions given previous context over traditional retrieval models. Combining scores from neural matching models and traditional retrieval models could also be helpful. Our research represents an initial effort to understand the effectiveness of neural matching models for predicting questions in conversations. We find that this is a more challenging task comparing with similar question finding due to at least two reasons: 1) Unlike similar question pairs with close sequence lengths, a context is usually much longer than a candidate question in conversations. 2) The matching pattern between conversational context and candidate questions could be more complex, which is beyond semantic match or paraphrase as in question retrieval. To find more effective clues from context, more advanced model architectures like attention modeling in context should be considered. Sequence to sequence learning with an RNN Encoder-Decoder architecture (Sutskever et al., 2014; Cho et al., 2014; Shang et al., 2015) and memory networks (Sukhbaatar et al., ) could be promising directions to explore.

Related Work

The current research for question retrieval can be divided into two categories. The first group leveraged translation models to bridge the lexical gaps between questions. Jeon et al. (Jeon et al., 2005) proposed a method learning word translation probabilities from question-question pairs collected based on similar answers in CQA. Xue et al. (Xue et al., 2008) proposed a retrieval model that combines a translation-based language model for the question part with a query likelihood approach for the answer part. The translation-based language model (TRLM) has been consistently reported as the state-of-the-art method for question retrieval (Zhou et al., 2013). Topic models have also been adopted for question retrieval (Yang et al., 2013). Recent years there are few research works on the research of building deep learning models with word embeddings for question retrieval (Zhou et al., ; Wang et al., ). Wang et al.(Wang et al., ) proposed a unified framework to simultaneously handle the three problems in question retrieval including lexical gap, polysemy and word order A high level feature embedded convolutional semantic model is proposed to learn the question embeddings.

The second research group has focused on improving question search with category information about questions. Cao et al. (Cao et al., 2009) proposed a language model with leaf category smoothing for questions in the same category. Zhou et al. (Zhou et al., 2013) proposed an efficient and effective retrieval model for question retrieval by leveraging user chosen categories. They achieved this by filtering some irrelevant historical questions under a range of leaf categories. Although considering category information can improve question retrieval performance, these methods could not be applied to the scenarios where the category information is not available. In many question answering and chatbot/dialogue systems, new questions issued by users have no explicit predefined category. Our work is closer to a general setting of question search where no category information are available.

2. Neural Conversation Models

Recent years there are growing interests on research about conversation response generation and ranking with deep learning and reinforcement learning (Shang et al., 2015; Yan et al., 2016; Bordes and Weston, 2016; Li et al., 2016a, b; Sordoni et al., 2015). Shang et al. (Shang et al., 2015) proposed Neural Responding Machine (NRM), which is a RNN encoder-decoder framework for short text conversation and showed that it outperformed retrieved-based methods and SMT-based methods for single round conversation. Sordoni et al. (Sordoni et al., 2015) proposed a neural network architecture for response generation that is both context-sensitive and data-driven utilizing the Recurrent Neural Network Language Model architecture. Yan et al. (Yan et al., 2016) proposed a retrieval-based conversation system with the deep learning-to-respond schema through a deep neural network framework driven by web data. Li et al. (Li et al., 2016b) apply deep reinforcement learning to model future reward in chatbot dialogs towards building a neural conversational model based on the long-term success of dialogs. Bordes et al. (Bordes and Weston, 2016) proposed a testbed to break down the strengths and shortcomings of end-to-end dialog systems in goal-oriented applications. They showed that an end-to-end dialog system based on Memory Networks can reach promising performance and learn to perform non-trivial operations. We work is relevant to neural conversational models. But we have different focuses on finding questions given previous conversational context.

3. Neural Ranking Models

A number of neural approaches have been proposed for ranking documents in response to a given query. These approaches can be generally divided into two groups: representation-focused and interaction-focused models (Guo et al., 2016). Representation-focused models independently learn a representation for each query and candidate document and then calculate the similarity between the two estimated representations via a similarity function. As an example, DSSM (Huang et al., 2013) is a feed forward neural network with a word hashing phase as the first layer to predict the click probability given a query string and a document title.

On the other hand, the interaction-focused models are designed based on the interactions between the query and the candidate document. For instance, DeepMatch (Lu and Li, 2013) is an interaction-focused model that maps each input to a sequence of terms and trains a feed-forward network to compute the matching score. These models have an opportunity to capture the interactions between query and document, while representation-focused models look at the inputs in isolation. Recently, Mitra et al. (Mitra et al., 2017) proposed to simultaneously learn local and distributional representations to capture both exact term matching and semantic term matching.

All the aforementioned models are trained based on either explicit relevance judgments or clickthrough data. More recently, Dehghani et al. (Dehghani et al., 2017) proposed to train neural ranking models when no supervision signal is available. They used an existing retrieval model, e.g., BM25 or query likelihood, to generate large amount of training data automatically and proposed to use these generated data to train neural ranking models with weak supervision.

Conclusions and Future Work

In this paper, we studied the effectiveness of neural matching models for two tasks: retrieving similar questions and predicting questions in conversations. We showed that neural matching models significantly outperforms all the baseline methods for the question retrieval task. Furthermore, when the neural matching model is combined with the basic term matching based retrieval models, we can achieve larger gains. For predicting questions in conversations, we observed that LSTM layers cannot handle long question history (past questions) and thus a simpler neural matching model with no LSTM layer outperforms all the other methods. This is a preliminary study in this area and there are still spaces to develop more advanced neural models to further improve the performance of matching conversational context with questions. For future work, we plan to continue the research on neural conversational models as a modern way for people to access information. Modeling context attentions and incorporating external knowledge into neural conversation models for finding better candidate questions could be also considered as interesting future directions.

Acknowledgments

This work was supported in part by the Center for Intelligent Information Retrieval, in part by NSF IIS-1160894, and in part by NSF grant #IIS-1419693. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.