Applying Deep Learning to Answer Selection: A Study and An Open Task

Minwei Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou

Introduction

Natural language understanding based spoken dialog system has been a popular topic in the past years of artificial intelligence renaissance. Many of those influential systems include a question answering module, e.g. Apple’s Siri, IBM’s Watson and Amazon’s Echo. In this paper, we address the Question Answering (QA) module in those spoken QA systems. We treat the QA from a text matching and selection perspective. IBM’s Watson system is a classical example of the traditional way of doing Question Answering (QA). In this work we utilize a deep learning framework to accomplish the answer selection which is a key step in the QA task. Hence QA is studied from an answer matching and selection perspective. Given a question $q$ and an answer candidate pool $\{a_{1},a_{2},...,a_{s}\}$ for that question ( $s$ is the pool size), the goal is to find the best answer candidate $a_{k}$ , $1\leq k\leq s$ . If the selected answer $a_{k}$ is inside the ground truth set (one question could have more than one correct answer) of question $q$ , the question $q$ is considered to be answered correctly, otherwise it is answered incorrectly. From the definition, the QA problem can be regarded as a binary classification problem. For each question, for each answer candidate, it may be appropriate or not. In order to find the best pair, we need a metric to measure the matching degree of each QA pair so that the QA pair with highest metric value will be chosen.

The above definition is general. The only assumption made is that for every question there is an answer candidate pool. In practice, the pool can be easily built by using a general search engine like Google Search or an information retrieval software library like Apache Lucene.

We created a data set by collecting question and answer pairs from the internet. All these question and answer pairs are in the insurance domain. The construction of this insurance domain QA corpus was driven by the intense scientific and commercial interest in this domain. We released this corpus git clone https://github.com/shuzi/insuranceQA.git to create an open QA task, enabling other researchers to utilize it and supporting a fair comparison among different methods. The corpus consists of four parts: train, development, test1 and test2. Table 1 gives the data statistics. All experiments conducted in this paper are based on this corpus. To our best knowledge, it is the first time an insurance domain QA task has been released.

Our QA task requires specifying an answer candidate pool for each question in the development, test1 and test2 parts of the corpus. The released corpus contains totally 24 981 unique answers. It is possible to use the whole answer space as the candidate pool, so that each question must be compared with 24 981 answer candidates. However, this is impractical due to time consuming computations. In this paper, we set the pool size to be 500, so that it is both practical and still a challenging task. We put the ground truth answers into the pool and randomly sample negative answers from the answer space until the pool size reaches 500.

The technology described in this paper with the released data set and benchmark task is targeting potential applications like online customer service. Hence it is not supposed to handle question answering tasks that require reasoning, e.g. is tomorrow Tuesday? (answer depends on if today is Monday.) The rest of the paper is organized as follows: Sec. 2 describes the different architectures used this work; Sec. 3 provides the experimental setup details; experimental results and discussions are presented in Sec. 4; Sec. 5 contains related work and finally we draw conclusions in Sec. 6.

Model Description

In this section we describe the proposed deep learning framework and many variations based on that framework. However, the main idea of those different systems is the same: learn a distributed vector representation of a given question and its answer candidates and then use a similarity metric to measure the matching degree. We first developed two baseline systems for comparison.

The first baseline system is a bag-of-words model. Step one is to train a word embedding by . This word embedding provides the word vector for each token in the question and its candidate answer. From these, the baseline system produces the idf-weighted sum of word vectors for the question and for all of its answer candidates. This produces a vector representation for the question and each answer candidate. The last step is to calculate the cosine similarity between each question/candidate pair. The pair with highest cosine similarity is returned as the answer. The second baseline is an information retrieval (IR) baseline. The state-of-the-art weighted dependency model (WD) is used. The WD model employs a weighted combination of term-based and term proximity-based ranking features to score each candidate answer. Example features include counts of question bigrams in ordered and unordered windows of different sizes in each candidate answer, in addition to simple unigram counts. The basic idea is that important bigrams or unigrams in the question should receive higher weights when their frequencies are computed. Thus, the feature weights are assigned in accordance to the importance of the question unigrams or bigrams that they are defined over, where the importance factor is learned as part of the model training process. Row 1 and 2 (first column Idx) of Table 2 are the baseline system results.

2 CNNs-based System

In this paper, a QA framework based on Convolutional Neural Networks (CNN) is presented. As summarized in Chapter 11 of , a CNN leverages three important ideas that can help improve a machine learning system: sparse interaction, parameter sharing and equivariant representation. Sparse interaction contrasts with traditional neural networks where each output is interactive with each input. In a CNN, the filter size (or kernel size) is usually much smaller than the input size. As a result , the output is only interactive with a narrow window of the input. Parameter sharing refers to reusing the filter parameters in the convolution operations, while the element in the weight matrix of traditional neural network will be used only once to calculate the output. Equivariant representation is related to the idea of $k$ -MaxPooling which is usually combined with a CNN. In this paper we always set $k=1$ . So each filter of the CNN represents some feature, and after the convolution operation, the $1$ -MaxPooling value represents the highest degree that the input contains the feature. The position of that feature in the input is irrelevant due to the convolution. This property is very useful for many NLP applications. Below is an example to demonstrate our CNN implementation.

The left matrix $W$ is the input sentence. Each word is represented by a $3$ -dimensional word embedding vector and the input length is $4$ . The right matrix $F$ represents the filter. The 2-gram filter size is $3\times 2$ . The convolution output of the input $W$ and the filter $F$ is a $3$ -dim vector $O$ , assuming zero padding has been done so that only a narrow convolution is conducted.

After $1$ -MaxPooling, the maximum of the 3 values will be kept for the filter $F$ which indicates the highest degree that filter $F$ matches the input $W$ .

3 Training and Loss Function

Different architectures will be described later. However all those different architectures share the same training and testing mechanism. In this paper we minimize a ranking loss similar to . During training, for each training question $Q$ there is a positive answer $A^{+}$ (the ground truth). A training instance is constructed by pairing this $A^{+}$ with a negative answer $A^{-}$ (a wrong answer) sampled from the whole answer space. The deep learning framework generates vector representations for the question and the two candidates: $V_{Q}$ , $V_{A^{+}}$ and $V_{A^{-}}$ . The cosine similarities $cos(V_{Q},V_{A^{+}})$ and $cos(V_{Q},V_{A^{-}})$ are calculated and the distance between the two similarities is compared to a margin: $cos(V_{Q},V_{A^{+}})-cos(V_{Q},V_{A^{-}})<m$ . $m$ is the margin. When this condition is satisfied, the implication is that the vector space embedding either ranks the positive answer below the negative answer, or does not sufficiently rank the positive answer above the negative answer. If $cos(V_{Q},V_{A^{+}})-cos(V_{Q},V_{A^{-}})>=m$ there is no update to the parameters and a new negative example is sampled until the margin is less than $m$ (to reduce running time we set maximum 50 times in this paper). The hinge loss function is hence defined as follows:

For testing, we calculate the $cos(V_{Q},V_{candidate})$ between the question $Q$ and each answer candidate $V_{candidate}$ in the pool (size 500). The candidate answer with largest cosine similarity is selected.

4 Architectures

In this subsection we demonstrate several proposed architectures for this QA task. Figure 1 shows the Architecture I . Q is the input question provided as input to the first hidden layer HLQ. The hidden layer (HL) is defined as $z=tanh(Wx+B)$ . $W$ is the weight matrix; $B$ is the bias vector; $x$ is input; $z$ is the output of the activation function $tanh$ . The output then flows to the CNN layer CNNQ, applied to extract question side features. P is the MaxPooling layer (we always use $1$ -MaxPooling in this paper) and T is the $tanh$ layer. Similar to the question side, the answer A is processed by HLA and then features are extracted by CNNA . $1$ -MaxPooling P and $tanh$ layer T will function in the last step. The result is a vector representation for both question and answer. The final output is the cosine similarity between these vectors. Row 3 of Table 2 is the Architecture I result.

Figure 2 is the Architecture II . The main difference compared to Architecture I is that both question and answer sides share the same HL and CNN weights. Row 4 of Table 2 is the Architecture II result.

We also consider architectures with a hidden layer after the CNN. Figure 3 is the Architecture III in which another HLQ is added at the question side after the CNN and another HLA is added at the answer side after the CNN. Row 5 of Table 2 is the Architecture III result. Architecture IV, shown in Figure 4, is similar except the second HL of both question and answer share the same HLQA weights. The rows 6 and 7 of Table 2 are the Architecture IV results.

Figure 5 is the Architecture V where two layers of CNNQA are deployed. In section 2.2 we show the convolution output is a vector ( $3$ -dim in that example). This is true only for CNNs with a single filter. By applying multiple filters the result is a matrix. If there are 4 filters utilized for the example in section 2.2, the output is the following matrix.

Each row represents the output of one filter and each column represents a bigram of the input. This matrix is the input to the next CNNQA layer. For this second layer, every bigram is effectively one “word” and the previous filter’s output for that bigram is its word embedding. Row 11 of Table 2 is the result for Architecture V . Architecture VI in Figure 6 is similar to Architecture V except we utilize layer-wise supervision. After each CNNQA layer there is $1$ -MaxPooling and a $tanh$ layer so that the cost function can be calculated and back-propagation can be conducted. The result of Architecture VI is in row 12 of Table 2.

We have tried another three techniques to improve Architecture II in Figure 2 . First, the CNN filter quantity has been increased, see row 8 9 and 10 of Table 2. Second, the convolution operation has been augmented to include skip-bigrams. Consider the example in section 2.2, for the input and one filter in Eq. 1, the augmented convolution operation will not only produce Eq. 2 but also the following discontinuous convolution:

The $1$ -MaxPooling will still be applied to get the largest value among $[o_{1},o_{2},o_{3},o_{4},o_{5}]$ so that this filter is automatically adapted to match a bigram or skip-bigram feature. Rows 13 14 and 15 of Table 2 show the results. Third, we investigate the similarity metric. Until now, we have been using the cosine similarity which is widely adopted for vector space models. However, is cosine similarity the best option for this task? Table 3 is the results for similarity metric study. Some metrics include hyperparameters and experiments with various hyperparameters have been conducted. We propose two novel metrics (GESD and AESD) which demonstrate superior performance.

Experimental Setup

The deep learning framework in this paper has been built from scratch using Java. To improve speed, we adopt the HOGWILD approach . Each thread processes one training instance at one time and updates the weights of the neural networks. There is no locking in any thread. The word embedding (100 dimensions) is trained by word2vec and used for initialization. Word embeddings are also parameters and are optimized for the QA task. Stochastic Gradient Descent is the optimization strategy and the L2-norm is also added in the loss function. In this paper, the weight of the L2-norm is $0.0001$ , the learning rate is $0.01$ and margin $m$ is $0.009$ . Those hyperparameters are chosen based on previous experiences in using deep learning on this data and they are not very sensitive within reasonable range. The utilized computing resources for this work are enormous. We heavily occupy a Power 7 cluster which consists of 75 machines. Each machine has 32 physical cores and each core supports 2-4 hyperthreading. The HOGWILD approach will bring some randomness due to no locking. Even with locking, the thread scheduler would alter the order of examples between runs so that randomness would still exist. Therefore, for each row in Table 2 (except for row 1 2) and Table 3, 10 experiments have been conducted on the dev set and the run with best dev score is chosen to calculate the test scores.

Results and Discussions

In this section, detailed analysis on experimental results are given. From Table 2 and 3 the following conclusions can be made: (1) baseline 1 only utilizes word embeddings and baseline 2 is based on traditional term based features. Our proposed method can reach significantly better accuracy which demonstrates the superiority of deep learning approach; (2) using separate hidden layer (HL) or CNN layers for Q and A has worse performance compared to a shared HL or CNN layer (Table 2, row 3 vs. 4, row 5 vs. 6). This is reasonable because for a shared layer network, the corresponding elements in Q and A vector are guaranteed to represent the same CNN filter convolution result while for network with separate Q and A layers, there is no such constraint and the optimizer has to learn over a set of double sized parameters. Hence the optimizer faces greater difficulty; (3) adding a HL after the CNN degrades the performance (Table 2, row 4 vs. 6 and 7). This proves that CNN already captures useful features for QA matching and unnecessary mapping the features to another space makes no sense at all; (4) increasing the CNN filter quantity can capture more features which gives notable improvement (Table 2, row 4 vs. 8, 9 and 10); (5) two layers of CNN can represent a higher level of abstraction with wider range in the input. Hence going deeper by using two CNN layers improves the accuracy (Table 2, row 4 vs. 11); (6) effective learning in deep networks is often a difficult task. Layer-wise supervision can alleviate the problem (Table 2, row 11 vs. 12); (7) combining bigram and skip-bigram features brings gain on Test1 but not on Test2 (Table 2, row 4 vs. 13, row 8 vs. 14, row 9 vs. 15); (8) Table 3 shows that with the same model capacity, similarity metric plays an important role and the widely used cosine similarity is not the best choice for this task. The similarity in Table 3 can be categorized into three classes: $L1$ -norm based metric which is the semantic distance of Q and A summed from each coordinate axis; $L2$ -norm based metric which is the straight-line semantic distance of Q and A; inner product based metric which measures the angle between Q and A. We propose two new metrics that combine $L2$ -norm and inner product by multiplication (GESD Geometric mean of Euclidean and Sigmoid Dot product) and addition (AESD Arithmetic mean of Euclidean and Sigmoid Dot product). The proposed two metrics are the best among all compared metrics. Finally, in the bottom of Table 3 it is clear that with more filters, the proposed metric can achieve even better performance.

Related Work

Deep learning technology has been widely used in machine learning tasks, often demonstrating superior performance compared to traditional methods. Many of those applications focus on classification related tasks, e.g. on image recognition , on speech and on machine translation . This paper is based on many prior works on utilizing deep learning for NLP tasks: Gao et al. proposed a CNN based network which maps source-target document pairs to embedding vectors such that the distance between source documents and their corresponding interesting targets is minimized. Lu and Li propose a CNN based deep network for a short text matching task; Hu et al. also use several CNN based networks for sentence matching; Kalchbrenner et al. use a CNN for sentiment prediction and question classification; Kim uses a CNN in sentiment analysis; Zeng et al. use a CNN for relation classification; Socher et al. use a recursive network for paraphrase detection and parsing; Iyyer et al. propose a recursive network for factoid question answering; Weston et al. use a CNN for hashtag prediction; Yu et al. use a CNN for answer selection; Yin and Schütze use a bi-CNN for paraphrase identification. Our work follows the spirit of many previous work in the sense that we utilize CNN to map natural language sentences into embedding vectors so that the similarity can be calculated. However this paper has conducted extensive experiments over various architectures which are not included in previous work. Furthermore, we explored different similarity metrics, skip-bigram based convolution and layerwise supervision which have not been presented in previous work.

Conclusions

In this paper, the spoken question answering system is studied from an answer selection perspective by employing a deep learning framework. The proposed framework does not rely on any linguistic tool and can be easily adapted to different languages or domains. Our work serves as solid evidence that deep learning based QA is an encouraging research direction. The scientific contributions can be summarized as follows: (1) creating a new QA task in the insurance domain and releasing a new corpus so that different methods can be fairly compared; (2) proposing a general deep learning framework with several variants for the QA task and comparison experiments have been conducted; (3) utilizing novel techniques that bring improvements: multi-layer CNN with layer-wise supervision, augmented CNN with discontinuous convolution and novel similarity metric that combine both $L2$ -norm and inner product information; (4) the best scores in this paper are very promising: for this challenging task (select one answer from a pool with size 500), the top one accuracy of test corpus can reach up to 65.3%; (5) for researchers who want to proceed with this task, this paper provides valuable guidance: a shared layer structure should be adopted; no need to append a hidden layer after the CNN; two levels of CNN with layer-wise training improves accuracy; discontinuous convolution sometimes can help; the similarity metric plays a crucial role and the proposed metric is preferred and finally increasing the filter quantity brings improvement.