Response Ranking with Deep Matching Networks and External Knowledge in Information-seeking Conversation Systems

Liu Yang, Minghui Qiu, Chen Qu, Jiafeng Guo, Yongfeng Zhang, W. Bruce Croft, Jun Huang, Haiqing Chen

Introduction

Personal assistant systems, such as Apple Siri, Google Now, Amazon Alexa, and Microsoft Cortana, are becoming ever more widely usedFor example, over 100M installations of Google Now (Google, http://bit.ly/1wTckVs); 15M sales of Amazon Echo (GeekWire, http://bit.ly/2xfZAgX); more than 141M monthly users of Microsoft Cortana (Windowscentral, http://bit.ly/2Dv6TVT).. These systems, with either text-based or voice-based conversational interfaces, are capable of voice interaction, information search, question answering and voice control of smart devices. This trend has led to an interest in developing conversational search systems, where users would be able to ask questions to seek information with conversation interactions. Research on speech and text-based conversational search has also recently attracted significant attention in the information retrieval (IR) community.

Existing approaches to building conversational systems include generation-based methods (Ritter et al., 2011; Shang et al., 2015) and retrieval-based methods (Ji et al., 2014; Yan et al., 2016a; Yan et al., 2016b; Yan et al., 2017). Compared with generation-based methods, retrieval-based methods have the advantages of returning fluent and informative responses. Most work on retrieval-based conversational systems studies response ranking for single-turn conversation (Wang et al., 2013), which only considers a current utterance for selecting responses. Recently, several researchers have been studying multi-turn conversation (Yan et al., 2016a; Zhou et al., 2016; Wu et al., 2017; Yan et al., 2017), which considers the previous utterances of the current message as the conversation context to select responses by jointly modeling context information, current input utterance and response candidates. However, existing studies are still suffering from the following weaknesses:

(1) Most existing studies are on open domain chit-chat conversations or task / transaction oriented conversations. Most current work (Ritter et al., 2011; Shang et al., 2015; Ji et al., 2014; Yan et al., 2016a; Yan et al., 2016b; Yan et al., 2017) is looking at open domain chit-chat conversations as in microblog data like Twitter and Weibo. There is some research on task oriented conversations (Young et al., 2010; Wen et al., 2017; Bordes et al., 2017), where there is a clear goal to be achieved through conversations between the human and the agent. However, the typical applications and data are related to completing transactions like ordering a restaurant or booking a flight ticket. Much less attention has been paid to information oriented conversations, which is referred to as information-seeking conversations in this paper. Information-seeking conversations, where the agent is trying to satisfy the information needs of the user through conversation interactions, are closely related to conversational search systems. More research is needed on response selection in information-seeking conversation systems.

(2) Lack of modeling external knowledge beyond the dialog utterances. Most research on response selection in conversation systems are purely modeling the matching patterns between user input message (either with context or not) and response candidates, which ignores external knowledge beyond the dialog utterances. Similar to Web search, information-seeking conversations could be associated with massive external data collections that contain rich knowledge that could be useful for response selection. This is especially critical for information-seeking conversations, since there may be not enough signals in the current dialog context and candidate responses to discriminate a good response from a bad one due to the wide range of topics for user information needs. An obvious research question is how to utilize external knowledge effectively for response ranking. This question has not been well studied, despite the potential benefits for the development of information-seeking conversation systems.

To address these research issues, we propose a learning framework on top of deep neural matching networks that leverages external knowledge for response ranking in information-seeking conversation systems. We study two different methods on integrating external knowledge into deep neural matching networks as follows:

(1) Incorporating external knowledge via pseudo-relevance feedback. Pseudo-relevance feedback (PRF) has been proven effective in improving the performance of many retrieval models (Lavrenko and Croft, 2001; Lv and Zhai, 2009; Zamani et al., 2016; Zhai and Lafferty, 2001; Rocchio, 1971; Cao et al., 2008). The motivation of PRF is to assume a certain number of top-ranked documents from the initial retrieval run to be relevant and use these feedback documents to improve the original query representation. For conversation response ranking, many candidate responses are much shorter compared with conversation context, which could have negative impacts on deep neural matching models. Inspired by the key idea of PRF, we propose using the candidate response as a query to run a retrieval round on a large external collection. Then we extract useful information from the (pseudo) relevant feedback documents to enrich the original candidate response representation.

(2) Incorporating external knowledge via QA correspondence knowledge distillation. Previous neural ranking models enhanced the performance of retrieval models such as BM25 and QL, which mainly rely on lexical match information, via modeling semantic match patterns in text (Guo et al., 2016; Huang et al., 2013; Mitra et al., 2017). For response ranking in information-seeking conversations, the match patterns between candidate responses and conversation context can be quite different from the well studied lexical and semantic matching. Consider the following sample utterance and response from the conversations in the Microsoft Answers community https://answers.microsoft.com/ shown in Table 1. A Windows user proposed a question about the windows update failure on “restart install”. An expert replied with a response pointing to a potential cause “Norton leftovers”. The match signals between the problem “restart install” and the cause “Norton leftovers” may not be captured by simple lexical and semantic matching. To derive such match patterns, we need to rely on external knowledge to distill QA correspondence information. We propose to extract the “correspondence” regularities between question and answer terms from retrieved external QA pairs. We define this type of match patterns as a “correspondence match”, which will be incorporated into deep matching networks as external knowledge to help response selection in information-seeking conversations.

We conduct extensive experiments with three information-seeking conversation data sets: the MSDialog data which contains crawled customer service dialogs from Microsoft Answers community , a popular benchmark data Ubuntu Dialog Corpus (UDC) (Lowe et al., 2015), and another commercial customer service data AliMe from Alibaba group. We compare our methods with various deep text matching models and the state-of-the-art baseline on response selection in multi-turn conversations. Our methods outperform all baseline methods regrading a variety of metrics.

To sum up, our contributions can be summarized as follows:

(1) Focusing on information-seeking conversations and building a new benchmark data set. We target information-seeking conversations to push the boundaries of conversational search models. To this end, we create a new information-seeking conversation data set MSDialog on technical support dialogs of Microsoft products and released it to the research community The MSDialog dataset can be downloaded from https://ciir.cs.umass.edu/downloads/msdialog. We also released our source code at https://github.com/yangliuy/NeuralResponseRanking ..

(2) Integrating external knowledge into deep neural matching networks for response ranking. We propose a new response ranking paradigm for multi-turn conversations by incorporating external knowledge into the matching process of dialog context and candidate responses. Under this paradigm, we design two different methods with pseudo relevance feedback and QA correspondence knowledge distillation to integrate external knowledge into deep neural matching networks for response ranking.

(3) Extensive experimental evaluation on benchmark / commercial data sets and promising results. Experimental results with three different information-seeking conversation data sets show that our methods outperform various baseline methods including the state-of-the-art method on response selection in multi-turn conversations. We also perform analysis over different response types, model variations and ranking examples to provide insights.

Related Work

Our work is related to research on conversational search, neural conversational models and neural ranking models.

Conversational Search. Conversational search has received significant attention with the emerging of conversational devices in the recent years. Radlinski and Craswell described the basic features of conversational search systems (Radlinski and Craswell, 2017). Thomas et al. (Thomas et al., 2017) released the Microsoft Information-Seeking Conversation (MISC) data set, which contains information-seeking conversations with a human intermediary, in a setup designed to mimic software agents such as Siri or Cortana. But this data is quite small (in terms of the number of dialogs) for the training of neural models. Based on state-of-the-art advances on machine reading, Kenter and de Rijke (Kenter and de Rijke, 2017) adopted a conversational search approach to question answering. Except for conversational search models, researchers have also studied the medium of conversational search. Arguello et al. (Arguello et al., 2017) studied how the medium (e.g., voice interaction) affect user requests in conversational search. Spina et al. studied the ways of presenting search results over speech-only channels to support conversational search (Spina et al., 2017; Trippas et al., 2015). Yang et al. (Yang et al., 2017) investigated predicting the new question that the user will ask given the past conversational context. Our research targets at the response ranking of information-seeking conversations, with deep matching networks and integration of external knowledge.

Neural Conversational Models. Recent years there are growing interests on research about conversation response generation and ranking with deep learning and reinforcement learning (Shang et al., 2015; Yan et al., 2016a; Yan et al., 2016b; Yan et al., 2017; Li et al., 2016a, b; Sordoni et al., 2015; Bordes et al., 2017). Existing work includes retrieval-based methods (Wu et al., 2017; Zhou et al., 2016; Yan et al., 2016a, 2017; Ji et al., 2014; Lowe et al., 2015) and generation-based methods (Shang et al., 2015; Tian et al., 2017; Ritter et al., 2011; Sordoni et al., 2015; Vinyals and Le, 2015; Li et al., 2016b; Bordes et al., 2017; Dhingra et al., 2017; Qiu et al., 2017). Sordoni et al. (Sordoni et al., 2015) proposed a neural network architecture for response generation that is both context-sensitive and data-driven utilizing the Recurrent Neural Network Language Model architecture. Our work is a retrieval-based method. There are some research on multi-turn conversations with retrieval-based method. Wu et al. (Wu et al., 2017) proposed a sequential matching network that matches a response with each utterance in the context on multiple levels of granularity to distill important matching information. The main difference between our work with their research is that we consider external knowledge beyond dialog context for multi-turn response selection. We show that incorporating external knowledge with pseudo-relevance feedback and QA correspondence knowledge distillation is important and effective for response selection.

Neural Ranking Models. Recently a number of neural ranking models have been proposed for information retrieval, question answering and conversation response ranking. These models could be classified into three categories (Guo et al., 2016). The first category is the representation focused models. These models will firstly learn the representations of queries and documents separately and then calculate the similarity score of the learned representations with functions such as cosine, dot, bilinear or tensor layers. A typical example is the DSSM (Huang et al., 2013) model, which is a feed forward neural network with a word hashing phase as the first layer to predict the click probability given a query string and a document title. The second category is the interaction focused models, which build a query-document term pairwise interaction matrix to capture the exact matching and semantic matching information between the query-document pairs. Then the interaction matrix will be fed into deep neural networks which could be CNN (Hu et al., 2014; Pang et al., 2016; Yu et al., 2018), term gating network with histogram or value shared weighting mechanism (Guo et al., 2016; Yang et al., 2016) to generate the final ranking score. In the end, the neural ranking models in the third category combine the ideas of the representation focused models and interaction focused models to joint learn the lexical matching and semantic matching between queries and documents (Mitra et al., 2017; Yu et al., 2018). The deep matching networks used in our research belong to the interaction focused models due to their better performances on a variety of text matching tasks compared with representation focused models (Hu et al., 2014; Pang et al., 2016; Guo et al., 2016; Yang et al., 2016; Wu et al., 2017; Xiong et al., 2017). We study different ways to build the interaction matching matrices to capture the matching patterns in term spaces, sequence structures and external knowledge signals between dialog context utterances and response candidates.

Our Approach

The research problem of response ranking in information-seeking conversations is defined as follows. We are given an information-seeking conversation data set $\mathcal{D}=\{(\mathcal{U}_{i},\mathcal{R}_{i},\mathcal{Y}_{i})\}_{i=1}^{N}$ , where $\mathcal{U}_{i}=\{u_{i}^{1},u_{i}^{2},\dots,u_{i}^{t-1},u_{i}^{t}\}$ in which $\{u_{i}^{1},u_{i}^{2},\dots,u_{i}^{t-1}\}$ is the dialog context and $u_{i}^{t}$ is the input utterance in the $t$ -th turn. $\mathcal{R}_{i}$ and $\mathcal{Y}_{i}$ are a set of response candidates $\{r_{i}^{1},r_{i}^{2},\dots,r_{i}^{k}\}_{k=1}^{M}$ and the corresponding binary labels $\{y_{i}^{1},y_{i}^{2},\dots,y_{i}^{k}\}$ , where $y_{i}^{k}=1$ denotes $r_{i}^{k}$ is a true response for $\mathcal{U}_{i}$ . Otherwise $y_{i}^{k}=0$ . In order to integrate external knowledge, we are also given an external collection $\mathcal{E}$ , which is related to the topics discussed in conversation $\mathcal{U}$ . Our task is to learn a ranking model $f(\cdot)$ with $\mathcal{D}$ and $\mathcal{E}$ . For any given $\mathcal{U}_{i}$ , the model should be able to generate a ranking list for the candidate responses $\mathcal{R}_{i}$ with $f(\cdot)$ . The external collection $\mathcal{E}$ could be any massive text corpus. In our paper, $\mathcal{E}$ are historical QA posts in Stack Overflow data dump https://stackoverflow.com/ for MSDialog, AskUbuntu data dump https://askubuntu.com/ for Ubuntu Dialog Corpus and product QA pairs for AliMe data.

2. Method Overview

In the following sections, we describe the proposed learning framework built on the top of deep matching networks and external knowledge for response ranking in information-seeking conversations. A summary of key notations in this work is presented in Table 2. In general, there are three modules in our learning framework:

(1) Information retrieval (IR) module: Given the information seeking conversation data $\mathcal{D}$ and external QA text collection $\mathcal{E}$ , this module is to retrieve a small relevant set of QA pairs $\mathcal{P}$ from $\mathcal{E}$ with the response candidate $\mathcal{R}$ as the queries. These retrieved QA pairs $\mathcal{P}$ become the source of external knowledge.

(2) External knowledge extraction (KE) module: Given the retrieved QA pairs $\mathcal{P}$ from the IR module, this module will extract useful information as term distributions, term co-occurrence matrices or other forms as external knowledge.

(3) Deep matching network (DMN) module: This is the module to model the extracted external knowledge from $\mathcal{P}$ , dialog utterances $\mathcal{U}_{i}$ and the response candidate $r_{i}^{k}$ to learn the matching pattern, over which it will accumulate and predict a matching score $f(\mathcal{U}_{i},r_{i}^{k})$ for $\mathcal{U}_{i}$ and $r_{i}^{k}$ .

We explore two different implementations under this learning framework as follows: 1) Incorporating external knowledge into deep matching networks via pseudo-relevance feedback (DMN-PRF). The architecture of DMN-PRF model is presented in Figure 1. 2) Incorporating external knowledge via QA correspondence knowledge distillation (DMN-KD). The architecture of DMN-KD model is presented in Figure 2. We will present the details of these two models in Section 3.3 and Section 3.4.

3. Deep Matching Networks with Pseudo-Relevance Feedback

We adopt different QA text collections for different conversation data (e.g. Stack Overflow data for MSDialog, AskUbuntu for UDC). The statistics of these external collections are shown in Table 3. We download the data dumps for Stack Overflow and AskUbuntu from archive.orghttps://archive.org/download/stackexchange. We index the QA posts in Stack Overflow in most recent two years and all the QA posts in AskUbuntu. Then we use the response candidate $r_{i}^{k}$ as the query to retrieve top $P$ In our experiments, we set $P=10$ . QA posts with BM25 as the source for external knowledge.

3.2. Candidate Response Expansion

The motivation of Pseudo-Relevance Feedback (PRF) is to extract terms from the top-ranked documents in the first retrieval results to help discriminate relevant documents from irrelevant ones (Cao et al., 2008). The expansion terms are extracted either according to the term distributions (e.g. extract the most frequent terms) or extracted from the most specific terms (e.g. extract terms with the maximal IDF weights) in feedback documents. Given the retrieved top QA posts $\mathcal{P}$ from the previous step, we compute a language model $\theta=P(w|\mathcal{P})$ using $\mathcal{P}$ . Then we extract the most frequent $W$ In our experiments, we set $W=10$ . terms from $\theta$ as expansion terms for response candidate $r_{i}^{k}$ and append them at the end of $r_{i}^{k}$ . For the query $r_{i}^{k}$ , we perform several preprocessing steps including tokenization, punctuation removal and stop words removal. QA posts in both Stack Overflow and AskUbuntu have two fields: “Body” and “Title”. We choose to search the “Body” field since we found it more effective in experiments.

3.3. Interaction Matching Matrix

Specifically, in the input channel one, $\forall i,j$ , the element $m_{1,i,j}$ in the $\mathbf{M}_{1}$ is defined by $m_{1,i,j}=\mathbf{e}_{r,i}^{T}\cdot\mathbf{e}_{u,j}$ . $\mathbf{M_{1}}$ models the word pairwise similarity between $r_{i}^{k^{\prime}}$ and $u_{i}^{t}$ via the dot product similarity between the embedding representations.

3.4. Convolution and Pooling Layers

The interaction matrices $\mathbf{M}_{1}$ and $\mathbf{M}_{2}$ are then fed into a CNN to learn high level matching patterns as features. CNN alternates convolution and max-pooling operations over these input channels. Let $\mathbf{z}^{(l,k)}$ denote the output feature map of the l-th layer and k-th kernel, the model will do convolution operations and max-pooling operations according to the following equations.

Convolution. Let $r_{w}^{(l,k)}\times r_{h}^{(l,k)}$ denote the shape of the k-th convolution kernel in the $l$ -th layer, the convolution operation can be defined as:

where $\sigma$ is the activation function ReLU, and $\mathbf{w}_{s,t}^{(l+1,k)}$ and $b^{(l+1,k)}$ are the parameters of the $k$ -th kernel on the $(l+1)$ -th layer to be learned. $K_{l}$ is the number of kernels on the $l$ -th layer.

Max Pooling. Let $p_{w}^{(l,k)}\times p_{h}^{(l,k)}$ denote the shape of the k-th pooling kernel in the $l$ -th layer, the max pooling operation can be defined as:

3.5. BiGRU Layer and MLP

Given the output feature representation vectors learned by CNN for utterance-response pairs $(r_{i}^{k^{\prime}},u_{i}^{t})$ , we add another BiGRU layer to model the dependency and temporal relationship of utterances in the conversation according to Equation 1 following the previous work (Wu et al., 2017). The output hidden states $\mathbf{H}_{c}=[\mathbf{h^{\prime}}_{1},\cdots,\mathbf{h^{\prime}}_{c}]$ will be concatenated as a vector and fed into a multi-layer perceptron (MLP) to calculate the final matching score $f(\mathcal{U}_{i},r_{i}^{k^{\prime}})$ as

where $\mathbf{w}_{1},\mathbf{w}_{2},\mathbf{b}_{1},\mathbf{b}_{2}$ are model parameters. $\sigma_{1}$ and $\sigma_{2}$ are tanh and softmax functions respectively.

3.6. Model Training

For model training, we consider a pairwise ranking learning setting. The training data consists of triples $(\mathcal{U}_{i},r_{i}^{k+},r_{i}^{k-})$ where $r_{i}^{k+}$ and $r_{i}^{k-}$ denote the positive and the negative response candidate for dialog utterances $\mathcal{U}_{i}$ . Let $\Theta$ denote all the parameters of our model. The pairwise ranking-based hinge loss function is defined as:

where $I$ is the total number of triples in the training data $\mathcal{D}$ . $\lambda||\Theta||^{2}_{2}$ is the regularization term where $\lambda$ denotes the regularization coefficient. $\epsilon$ denotes the margin in the hinge loss. The parameters of the deep matching network are optimized using back-propagation with Adam algorithm (Kingma and Ba, 2014).

4. Deep Matching Networks with QA Correspondence Knowledge Distillation

In addition to the DMN-PRF model presented in Section 3.3, we also propose another model for incorporating external knowledge into conversation response ranking via QA correspondence knowledge distillation, which is referred to as DMN-KD model in this paper. The architecture of DMN-KD model is presented in Figure 2. Compared with DMN-PRF, the main difference is that the CNN of DMN-KD will run on an additional input channel $\mathbf{M}_{3}$ denoted as blue matrices in Figure 2, which captures the correspondence matching patterns of utterance terms and response terms in relevant external QA pairs retrieved from $\mathcal{E}$ . Specifically, we firstly use the response candidate $r_{i}^{k}$ as the query to retrieve a set of relevant QA pairsNote that we want QA pairs here instead of question posts or answer posts, since we would like to extract QA term co-occurrence information with these QA pairs. $\mathcal{P}$ . Suppose $\mathcal{P}=\{\mathcal{Q},\mathcal{A}\}=\{(\mathbf{Q}_{1},\mathbf{A}_{1}),(\mathbf{Q}_{2},\mathbf{A}_{2}),\cdots,(\mathbf{Q}_{P},\mathbf{A}_{P})\}$ , where $(\mathbf{Q}_{p},\mathbf{A}_{p})$ denotes the $p$ -th QA pair. Given a response candidate $r_{i}^{k}$ and a dialog utterance $u_{i}^{t}$ in dialog $\mathcal{U}_{i}$ , the model will compute the term co-occurrence information as the Positive Pointwise Mutual Information (PPMI) of words of $r_{i}^{k}$ and $u_{i}^{t}$ in retrieved QA pair set $\{\mathcal{Q},\mathcal{A}\}$ . Let $[w_{r,1},w_{r,2},\cdots,w_{r,l_{r}}]$ and $[w_{u,1},w_{u,2},\cdots,w_{u,l_{u}}]$ denote the word sequence in $r_{i}^{k}$ and $u_{i}^{t}$ . We construct a QA term correspondence matching matrix $\mathbf{M}_{3}$ as the third input channel of CNN for $r_{i}^{k}$ and $u_{i}^{t}$ with the PPMI statistics from $\{\mathcal{Q},\mathcal{A}\}$ . More specifically, $\forall i,j$ , the element $m_{3,i,j}$ in $\mathbf{M}_{3}$ is computed as

where $w_{r,i}$ and $w_{u,j}$ denote the $i$ -th word in the response candidate and $j$ -th word in the dialog utterance. The intuition is that the PPMI between $w_{r,i}$ and $w_{u,j}$ in the top retrieved relevant QA pair set $\{\mathcal{Q},\mathcal{A}\}$ could encode the correspondence matching patterns between $w_{r,i}$ and $w_{u,j}$ in external relevant QA pairs . Thus $\mathbf{M}_{3}$ is the extracted QA correspondence knowledge from the external collection $\mathcal{E}$ for $r_{i}^{k}$ and $u_{i}^{t}$ . These correspondence matching knowledge capture relationships such as “(Problem Descriptions, Solutions)”, “(Symptoms, Causes)”, “(Information Request, Answers)”, etc. in the top ranked relevant QA pair set $\{\mathcal{Q},\mathcal{A}\}$ . They will help the model better discriminate a good response candidate from a bad response candidate given the dialog context utterances. To compute the co-occurrence count between $w_{r,i}$ and $w_{u,j}$ , we count all word co-occurrences considering $\mathbf{A}_{p}$ and $\mathbf{Q}_{p}$ as bag-of-words as we found this setting is more effective in experiments.

Experiments

We evaluated our method with three data sets: Ubuntu Dialog Corpus (UDC), MSDialog, and AliMe data consisting of a set of customer service conversations in Chinese from Alibaba.

The Ubuntu Dialog Corpus (UDC) (Lowe et al., 2015) contains multi-turn technical support conversation data collected from the chat logs of the Freenode Internet Relay Chat (IRC) network. We used the data copy shared by Xu et al.(Xu et al., 2016), in which numbers, urls and paths are replaced by special placeholders. It is also used in several previous related works (Wu et al., 2017)The data can be downloaded from https://www.dropbox.com/s/2fdn26rj6h9bpvl/ubuntu%20data.zip?dl=0. It consists of $1$ million context-response pairs for training, $0.5$ million pairs for validation and $0.5$ million pairs for testing. The statistics of this data is shown in Table 4. The positive response candidates in this data come form the true responses by human and negative response candidates are randomly sampled.

1.2. MSDialog

In addition to UDC, we also crawled another technical support conversation data from the Microsoft Answer community, which is a QA forum on topics about a variety of Microsoft products. We firstly crawled $35,536$ dialogs about $76$ different categories of Microsoft products including “Windows”, “IE”, “Office”, “Skype”, “Surface”, “Xbox”, etc. Note that some categories are more fine-grained, such as“Outlook_Calendar”, “Outlook_Contacts”, “Outlook_Email”, “Outlook_Messaging”, etc. Then we filtered dialogs whose number of turns are out of the range $ $. After that we split the data into training/validation/testing partitions by time. Specifically, the training data contains$ 25,019 $dialogs from “2005-11-12” to “2017-08-20”. The validation data contains$ 4,654 $dialogs from “2017-08-21” to “2017-09-20”. The testing data contains$ 5,064$ dialogs from “2017-09-21” to “2017-10-04”.

The next step is to generate the dialog context and response candidates. For each dialog, we assigned “User” label to the first participant who proposed the question leading to this information-seeking conversation, and “Agent” label to the other participants who provided responses. The “Agent” in our data could be Microsoft customer service staff, a Microsoft MVP (Most Valuable Professional) or a user from the Microsoft Answer community. Then for each utterance by the “User” $u_{i}^{t}$ We consider the utterances by the user except the first utterance, since there is no associated dialog context with it. , we collected the previous $c$ utterances as the dialog context, where $c=\min(t-1,10)$ and $t-1$ is the total number of utterances before $u_{i}^{t}$ . The true response by the “Agent” becomes the positive response candidate. For the negative response candidates, we adopted negative sampling to construct them following previous work (Wan et al., 2016; Lowe et al., 2015; Wu et al., 2017). For each dialog context, we firstly used the true response as the query to retrieve the top $1,000$ results from the whole response set of agents with BM25. Then we randomly sampled $9$ responses from them to construct the negative response candidates. The statistics of MSDialog data is presented in Table 4. For data preprocessing, we performed tokenization and punctuation removal. Then we removed stop words and performed word stemming. For neural models, we also removed words that appear less than $5$ times in the whole corpus.

1.3. AliMe Data

We collected the chat logs between customers and a chatbot AliMe from “2017-10-01” to “2017-10-20” in Alibaba. The chatbot is built based on a question-to-question matching system Interested readers can access AliMe Assist through the Taobao App, or the web version via https://consumerservice.taobao.com/online-help (Li et al., 2017), where for each query, it finds the most similar candidate question in a QA database and return its answer as the reply. It indexes all the questions in our QA database using Lucencehttps://lucene.apache.org/core/. For each given query, it uses TF-IDF ranking algorithm to call back candidates. To form our data set, we concatenated utterances within three turns The majority (around $85\%$ ) of conversations in the data set are within 3 turns. to form a query, and used the chatbot system to call back top-K We set K=15. most similar candidate questions as candidate “responses”. A “response” here is a question in our system. We then asked a business analyst to annotate the candidate responses, where a “response” is labeled as positive if it matches the query, otherwise negative. In all, we have annotated 63,000 context-response pairs, where we use 51,000 as training, 6,000 for testing, and 6,000 for validation shown in Table 4. Note that we have included human evaluation in AliMe data. Furthermore, if the confidence score of answering a given user query is low, the system will prompt three top related questions for users to choose. We collected such user click logs as our external data, where we treat the clicked question as positive and the others as negative. We collected 510,000 clicked questions with answers from the click logs in total as the source of external knowledge.

2. Experimental Setup

We consider different types of baselines for comparison, including traditional retrieval models, deep text matching models and the state-of-the-art multi-turn conversation response ranking method as the following:

BM25. This method uses the dialog context as the query to retrieve response candidates for response selection. We consider BM25 model (Robertson and Walker, 1994) as the retrieval model.

ARC-II. ARC-II is an interaction focused deep text matching architectures proposed by Hu et al. (Hu et al., 2014), which is built directly on the interaction matrix between the dialog context and response candidates. A CNN is running on the interaction matrix to learn the matching representation score.

MV-LSTM. MV-LSTM (Wan et al., 2016) is a neural text matching model that matches two sequences with multiple positional representations learned by a Bi-LSTM layer.

DRMM. DRMM (Guo et al., 2016) is a deep relevance matching model for ad-hoc retrieval. We implemented a variant of DRMM for short text matching. Specifically, the matching histogram is replaced by a top-k max pooling layer and the remaining part is the same with the original model.

Duet. Duet (Mitra et al., 2017) is the state-of-the-art deep text matching model that jointly learns local lexical matching and global semantic matching between the two text sequences.

SMN. Sequential Matching Network (SMN) (Wu et al., 2017) is the state-of-the-art deep neural architecture for multi-turn conversation response selection. It matches a response candidate with each utterance in the context on multiple levels of granularity and then adopts a CNN network to distill matching features. We used the TensorFlow https://www.tensorflow.org/ implementation of SMN shared by authors (Wu et al., 2017) The reported SMN results with the code from authors are on the raw data sets of UDC and MSDialog without any over sampling of negative training data..

We also consider a degenerated version of our model, denoted as DMN, where we do not incorporate external knowledge via pseudo-relevance feedback or QA correspondence knowledge distillation. Finally, we consider a baseline BM25-PRF, where we incorporate external knowledge into BM25 by matching conversation context with the expanded responses as in Section 3.3.2 using BM25 model.

2.2. Evaluation Methodology.

For the evaluation metrics, we adopted mean average precision (MAP), Recall@1, Recall@2, and Recall@5 following previous related works (Wu et al., 2017; Lowe et al., 2015). For UDC and MSDialog, MAP is equivalent to the mean reciprocal rank (MRR) since there is only one positive response candidate per dialog context. For AliMe data, each dialog context could have more than one positive response candidates.

2.3. Parameter Settings.

All models were implemented with TensorFlow and MatchZoohttps://github.com/faneshion/MatchZoo toolkit. Hyper-parameters are tuned with the validation data. For the hyper-parameter settings of DMN-KD and DMN-PRF models, we set the window size of the convolution and pooling kernels as $(3,3)$ . The number of convolution kernels is $8$ for UDC and $2$ for MSDialog. The dimension of the hidden states of BiGRU layer is set as $200$ for UDC and $100$ for MSDialog . The dropout rate is set as $0.3$ for UDC and $0.6$ for MSDialog . All models are trained on a single Nvidia Titan X GPU by stochastic gradient descent with Adam(Kingma and Ba, 2014) algorithm. The initial learning rate is $0.001$ . The parameters of Adam, $\beta_{1}$ and $\beta_{2}$ are $0.9$ and $0.999$ respectively. The batch size is $200$ for UDC and $50$ for MSDialog. The maximum utterance length is $50$ for UDC and $90$ for MSDialog. The maximum conversation context length is set as $10$ following previous work (Wu et al., 2017). We padded zeros if the number of utterances in a context is less than $10$ . Otherwise the most recent $10$ utterances will be kept. For DMN-PRF, we retrieved top $10$ QA posts and extracted $10$ terms as response expansion terms. For DMN-KD, we retrieved top $10$ question posts with accepted answers. For the word embeddings used in our experiments, we trained word embeddings with the Word2Vec tool (Mikolov et al., 2013) with the Skip-gram model using our training data. The max skip length between words and the number of negative examples is set as $5$ and $10$ respectively. The dimension of word vectors is $200$ . Word embeddings will be initialized by these pre-trained word vectors and updated during the training process.

3. Evaluation Results

We present evaluation results over different methods on UDC and MSDialog in Table 5. We summarize our observations as follows: (1) DMN-PRF model outperforms all the baseline methods including traditional retrieval models, deep text matching models and the state-of-the-art SMN model for response ranking on both conversation datasets. The results demonstrate that candidate response expansion with pseudo-relevance feedback could improve the ranking performance of responses in conversations. The main difference between DMN-PRF model and SMN model is the information extracted from retrieved feedback QA posts as external knowledge. This indicates the importance of modeling external knowledge with pseudo-relevant feedback beyond the dialog context for response selection. (2) DMN-KD model also outperforms all the baseline methods on MSDialog and UDC. These results show that the extracted QA correspondence matching knowledge could help the model select better responses. Comparing DMN-KD and DMN-PRF, their performances are very close. (3) If we compare the performances of DMN-PRF, DMN-KD with the degenerated model DMN, we can see that incorporating external knowledge via both pseudo-relevance feedback and QA correspondence knowledge distillation could improve the performance of the deep neural networks for response ranking with large margins. For example, the improvement of DMN-PRF against DMN on UDC is $4.83\%$ for MAP, $1.60\%$ for Recall@5, $8.19\%$ for Recall@1, $5.11\%$ for Recall@2 respectively. The differences are statistically significant with $p<0.05$ measured by the Student’s paired t-test.

3.2. Performance Comparison on AliMe Data

We further compare our models with the competing methods on AliMe data in Table 5. We find that: (1) our DMN model has comparable results in terms of MAP when compared with SMN, but has better Recall; (2) DMN-KD shows comparable or better results than all the baseline methods; (3) DMN-PRF significantly outperforms other competing baselines which shows the effectiveness of adding external pseudo-relevance feedback to the task; (4) both DMN-PRF and DMN-KD show better results than DMN, which demonstrates the importance of incorporating external knowledge via both pseudo-relevance feedback and QA correspondence knowledge distillation.

3.3. Performance Comparison over Different Response Types

We conduct fine-grained analysis on the performance of different models on different response types. We annotated the user intents in $10,020$ MSDialog utterances using Amazon Mechanical Turk https://www.mturk.com/. We defined $12$ user intent types including several types related to “questions” (original question, follow-up question, information request, clarifying question, and etc.), “answers” ( potential answer and further details), “gratitude” (expressing thanks, greetings) and “feedback” (positive feedback and negative feedback). Then we trained a Random Forest classifier with TF-IDF features and applied this classifier to predict the response candidate types in the testing data of MSDialog. The dialog contexts were grouped by the type of the true response candidate. Finally we computed the average Recall@1 over different groups. Figure 3 shows the results. We find that both DMN-KD and DMN-PRF improve the performances of SMN for responses with type “questions”, “answers” and “gratitude”. This indicates that incorporating external knowledge with PRF or QA correspondence knowledge distillation can help the model select better responses, especially for QA related responses. For responses with type “Feedback”, DMN-KD and DMN-PRF achieved similar performances comparing with SMN.

4. Model Ablation Analysis

We investigate the effectiveness of different components of DMN-PRF and DMN-KD by removing them one by one from the original model with UDC and MSDialog data. We also study the effectiveness of different interaction types for $\mathbf{M1}/\mathbf{M2}/\mathbf{M3}$ . Table 6 shows the results. We summarize our observations as follows: 1) For the interaction matrices, we find that the performance will drop if we remove any one of $\mathbf{M1}/\mathbf{M2}$ for DMN-PRF or $\mathbf{M1}/\mathbf{M2}/\mathbf{M3}$ for DMN-KD. This indicates that all of word level interaction matching, sequence level interaction matching and external QA correspondence interaction matching are useful for response selection in information-seeking conversation. 2) For interaction types, we can find that dot product is the best setting on both UDC and MSDialog except the results of DMN-KD on MSDialog. The next best one is cosine similarity. Bilinear product is the worst, especially on MSDialog data. This is because bilinear product will introduce a transformation matrix $\mathbf{A}$ as an additional model parameter, leading to higher model complexity. Thus the model is more likely to overfit the training data, especially for the relatively small MSDialog data. 3) If we only leave one channel in the interaction matrices, we can find that $\mathbf{M1}$ is more powerful than $\mathbf{M2}$ for DMN-PRF. For DMN-KD, $\mathbf{M1}$ is also the best one, followed by $\mathbf{M2}$ . $\mathbf{M3}$ is the last one, but it stills adds additional matching signals when it is combined with $\mathbf{M1}$ and $\mathbf{M2}$ . The matching signals $\mathbf{M3}$ from external collection could be supplementary features to the word embedding based matching matrix $\mathbf{M1}$ and BiGRU representation based matching matrix $\mathbf{M2}$ .

5. Impact of Conversation Context Length

We further analyze the impact of the conversation context length on the performances of our proposed DMN-KD and DMN-PRF models. As presented in Figure 4, we find the performance first increases and then decreases, with the increase of conversation context length. The reason for these trends is that the context length controls the available previous utterances in the dialog context modeled by DMN-KD and DMN-PRF. If the context length is too small, there would be not enough information for the model to learn the matching patterns between the context and response candidates. However, setting the context length too large will also bring noise into the model results, since the words in utterances a few turns ago could be very different due to the topic changes during conversations.

6. Case Study

We perform a case study in Table 7 on the top ranked responses by different methods including SMN, DMN-KD and DMN-PRF. In this example, both DMN-KD and DMN-PRF produced correct top ranked responses. We checked the retrieved QA posts by the correct response candidate and found that “settings, regional, change, windows, separator, format, excel, panel, application” are the most frequent terms. Among them “excel” is especially useful for promoting the rank of the correct response candidate, since this term which is included multiple times by the dialog context does not actually appear in the raw text of the correct response candidate. This gives an example of the effectiveness of incorporating external knowledge from the retrieved QA posts into response candidates.

Conclusions and Future Work

In this paper, we propose a learning framework based on deep matching networks to leverage external knowledge for response ranking in information-seeking conversation systems. We incorporate external knowledge into deep neural models with pseudo-relevance feedback and QA correspondence knowledge distillation. Extensive experiments on both open benchmarks and commercial data show our methods outperform various baselines including the state-of-the-art methods. We also perform analysis on different response types and model variations to provide insights on model applications. For future work, we plan to model user intent in information-seeking conversations and learn meaningful patterns from user intent dynamics to help response selection. Incorporating both structured and unstructured knowledge into deep matching networks for response ranking is also interesting to explore.

Acknowledgments

This work was supported in part by the Center for Intelligent Information Retrieval and in part by NSF grant #IIS-1419693. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.