Exploring Question Understanding and Adaptation in Neural-Network-Based Question Answering

Junbei Zhang, Xiaodan Zhu, Qian Chen, Lirong Dai, Si Wei, Hui Jiang

Introduction

Enabling computers to understand given documents and answer questions about their content has recently attracted intensive interest, including but not limited to the efforts as in (Richardson et al., 2013; Hermann et al., 2015; Hill et al., 2015; Rajpurkar et al., 2016; Nguyen et al., 2016; Berant et al., 2014). Many specific problems such as machine comprehension and question answering often involve modeling such question-document pairs.

The recent availability of relatively large training datasets (see Section 2 for more details) has made it more feasible to train and estimate rather complex models in an end-to-end fashion for these problems, in which a whole model is fit directly with given question-answer tuples and the resulting model has shown to be rather effective.

In this paper, we take a closer look at modeling questions in such an end-to-end neural network framework, since we regard question understanding is of importance for such problems. We first introduced syntactic information to help encode questions. We then viewed and modelled different types of questions and the information shared among them as an adaptation problem and proposed adaptation models for them. On the Stanford Question Answering Dataset (SQuAD), we show that these approaches can help attain better results on our competitive baselines.

Related Work

Recent advance on reading comprehension and question answering has been closely associated with the availability of various datasets. Richardson et al. (2013) released the MCTest data consisting of 500 short, fictional open-domain stories and 2000 questions. The CNN/Daily Mail dataset (Hermann et al., 2015) contains news articles for close style machine comprehension, in which only entities are removed and tested for comprehension. Children’s Book Test (CBT) (Hill et al., 2015) leverages named entities, common nouns, verbs, and prepositions to test reading comprehension. The Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016) is more recently released dataset, which consists of more than 100,000 questions for documents taken from Wikipedia across a wide range of topics. The question-answer pairs are annotated through crowdsourcing. Answers are spans of text marked in the original documents. In this paper, we use SQuAD to evaluate our models.

Many neural network models have been studied on the SQuAD task. Wang and Jiang (2016) proposed match LSTM to associate documents and questions and adapted the so-called pointer Network (Vinyals et al., 2015) to determine the positions of the answer text spans. Yu et al. (2016) proposed a dynamic chunk reader to extract and rank a set of answer candidates. Yang et al. (2016) focused on word representation and presented a fine-grained gating mechanism to dynamically combine word-level and character-level representations based on the properties of words. Wang et al. (2016) proposed a multi-perspective context matching (MPCM) model, which matched an encoded document and question from multiple perspectives. Xiong et al. (2016) proposed a dynamic decoder and so-called highway maxout network to improve the effectiveness of the decoder. The bi-directional attention flow (BIDAF) (Seo et al., 2016) used the bi-directional attention to obtain a question-aware context representation.

In this paper, we introduce syntactic information to encode questions with a specific form of recursive neural networks (Zhu et al., 2015; Tai et al., 2015; Chen et al., 2016; Socher et al., 2011). More specifically, we explore a tree-structured LSTM (Zhu et al., 2015; Tai et al., 2015) which extends the linear-chain long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) to a recursive structure, which has the potential to capture long-distance interactions over the structures.

Different types of questions are often used to seek for different types of information. For example, a "what" question could have very different property from that of a "why" question, while they may share information and need to be trained together instead of separately. We view this as a "adaptation" problem to let different types of questions share a basic model but still discriminate them when needed. Specifically, we are motivated by the ideas "i-vector" (Dehak et al., 2011) in speech recognition, where neural network based adaptation is performed among different (groups) of speakers and we focused instead on different types of questions here.

The Approach

Our baseline model is composed of the following typical components: word embedding, input encoder, alignment, aggregation, and prediction. Below we discuss these components in more details.

The above word representation focuses on representing individual words, and an input encoder here employs recurrent neural networks to obtain the representation of a word under its context. We use bi-directional GRU (BiGRU) (Cho et al., 2014) for both documents and questions.

Each $\mathbf{U}_{ij}$ represents the similarity between a question word $\mathbf{Q}_{i}^{c}$ and a document word $\mathbf{D}_{j}^{c}$ .

With the attention weights computed, we obtain the encoding of the question for each document word $w_{j}$ as follows, which we call word-level Q-code in this paper:

Question-based filtering To better explore question understanding, we design this question-based filtering layer. As detailed later, different question representation can be easily incorporated to this layer in addition to being used as a filter to find key information in the document based on the question. This layer is expandable with more complicated question modeling.

In the basic form of question-based filtering, for each question word $w_{i}$ , we find which words in the document are associated. Similar to $\mathbf{a}_{j}$ discussed above, we can obtain the attention weights on document words for each question word $w_{i}$ :

where the specific pooling function we used include max-pooling and mean-pooling. Then the document softly filtered based on the corresponding question $\mathbf{D}^{f}$ can be calculated by:

Through concatenating the document representation $\mathbf{D}^{c}$ , word-level Q-code $\mathbf{Q}^{w}$ and question-filtered document $\mathbf{D}^{f}$ , we can finally obtain the alignment layer representation:

where " $\circ$ " stands for element-wise multiplication and " $-$ " is simply the vector subtraction.

After acquiring the local alignment representation, key information in document and question has been collected, and the aggregation layer is then performed to find answers. We use three BiGRU layers to model the process that aggregates local information to make the global decision to find the answer spans. We found a residual architecture (He et al., 2016) as described in Figure 2 is very effective in this aggregation process:

The SQuAD QA task requires a span of text to answer a question. We use a pointer network (Vinyals et al., 2015) to predict the starting and end position of answers as in (Wang and Jiang, 2016). Different from their methods, we use a two-directional prediction to obtain the positions. For one direction, we first predict the starting position of the answer span followed by predicting the end position, which is implemented with the following equations:

where $\mathbf{I}^{3}$ is inference layer output, $\mathbf{h}_{s+}$ is the hidden state of the first step, and all $\mathbf{W}$ are trainable matrices. We also perform this by predicting the end position first and then the starting position:

We finally identify the span of an answer with the following equation:

We use the mean-pooling here as it is more effective on the development set than the alternatives such as the max-pooling.

2 Question Understanding and Adaptation

The interplay of syntax and semantics of natural language questions is of interest for question representation. We attempt to incorporate syntactic information in questions representation with TreeLSTM (Zhu et al., 2015; Tai et al., 2015). In general a TreeLSTM could perform semantic composition over given syntactic structures.

Unlike the chain-structured LSTM (Hochreiter and Schmidhuber, 1997), the TreeLSTM captures long-distance interaction on a tree. The update of a TreeLSTM node is described at a high level with Equation (22), and the detailed computation is described in (23–29). Specifically, the input of a TreeLSTM node is used to configure four gates: the input gate $\mathbf{i}_{t}$ , output gate $\mathbf{o}_{t}$ , and the two forget gates $\mathbf{f}_{t}^{L}$ for the left child input and $\mathbf{f}_{t}^{R}$ for the right. The memory cell $\mathbf{c}_{t}$ considers each child’s cell vector, $\mathbf{c}_{t-1}^{L}$ and $\mathbf{c}_{t-1}^{R}$ , which are gated by the left forget gate $\mathbf{f}_{t}^{L}$ and right forget gate $\mathbf{f}_{t}^{R}$ , respectively.

where $\sigma$ is the sigmoid function, $\circ$ is the element-wise multiplication of two vectors, and all $\mathbf{W}$ , $\mathbf{U}$ are trainable matrices.

where $\mathbf{I}_{new}$ is the new output of alignment layer, and function $repmat$ copies $\mathbf{Q}^{TL}$ for M times to fit with $\mathbf{I}$ .

2.2 Question Adaptation

Questions by nature are often composed to fulfill different types of information needs. For example, a "when" question seeks for different types of information (i.e., temporal information) than those for a "why" question. Different types of questions and the corresponding answers could potentially have different distributional regularity.

As discussed, different types of questions and their answers may share common regularity and have separate property at the same time. We also view this as an adaptation problem in order to let different types of questions share a basic model but still discriminate them when needed. Specifically, we borrow ideas from speaker adaptation (Dehak et al., 2011) in speech recognition, where neural-network-based adaptation is performed among different groups of speakers.

Conceptually we regard a type of questions as a group of acoustically similar speakers. Specifically we propose a question discriminative block or simply called a discriminative block (Figure 3) below to perform question adaptation. The main idea is described below:

For each input question $\mathbf{x}$ , we can decompose it to two parts: the cluster it belong(i.e., question type) and the diverse in the cluster. The information of the cluster is encoded in a vector $\mathbf{\bar{x}}^{c}$ . In order to keep calculation differentiable, we compute the weight of all the clusters based on the distances of $\mathbf{x}$ and each cluster center vector, in stead of just choosing the closest cluster. Then the discriminative vector $\mathbf{\delta_{x}}$ with regard to these most relevant clusters are computed. All this information is combined to obtain the discriminative information. In order to keep the full information of input, we also copy the input question $\mathbf{x}$ , together with the acquired discriminative information, to a feed-forward layer to obtain a new representation $\mathbf{x^{\prime}}$ for the question.

More specifically, the adaptation algorithm contains two steps: adapting and updating, which is detailed as follows:

We set $\alpha$ equals 50 to make sure only closest class will have a high weight while maintain differentiable. Then we acquire a soft class-center vector $\mathbf{\bar{x}}^{c}$ :

We then compute a discriminative vector $\mathbf{\delta_{x}}$ between the input question with regard to the soft class-center vector:

Updating The updating stage attempts to modify the center vectors of the $K$ clusters in order to fit each cluster to model different types of questions. The updating is performed according to the following formula:

In the equation, $\beta$ is an updating rate used to control the amount of each updating, and we set it to 0.01. When $\mathbf{x}$ is far away from $K$ -th cluster center $\mathbf{\bar{x}}_{k}$ , $\text{w}_{k}^{a}$ is close to be value 0 and the $k$ -th cluster center $\mathbf{\bar{x}}_{k}$ tends not to be updated. If $\mathbf{x}$ is instead close to the $j$ -th cluster center $\mathbf{\bar{x}}_{j}$ , $\text{w}_{k}^{a}$ is close to the value 1 and the centroid of the $j$ -th cluster $\mathbf{\bar{x}}_{j}$ will be updated more aggressively using $\mathbf{x}$ .

Experiment Results

We test our models on Stanford Question Answering Dataset (SQuAD) (Rajpurkar et al., 2016). The SQuAD dataset consists of more than 100,000 questions annotated by crowdsourcing workers on a selected set of Wikipedia articles, and the answer to each question is a span of text in the Wikipedia articles. Training data includes 87,599 instances and validation set has 10,570 instances. The test data is hidden and kept by the organizer. The evaluation of SQuAD is Exact Match (EM) and F1 score.

We use pre-trained 300-D Glove 840B vectors (Pennington et al., 2014) to initialize our word embeddings. Out-of-vocabulary (OOV) words are initialized randomly with Gaussian samples. CharCNN filter length is 1,3,5, each is 50 dimensions. All vectors including word embedding are updated during training. The cluster number K in discriminative block is 100. The Adam method (Kingma and Ba, 2014) is used for optimization. And the first momentum is set to be 0.9 and the second 0.999. The initial learning rate is 0.0004 and the batch size is 32. We will half learning rate when meet a bad iteration, and the patience is 7. Our early stop evaluation is the EM and F1 score of validation set. All hidden states of GRUs, and TreeLSTMs are 500 dimensions, while word-level embedding $d_{w}$ is 300 dimensions. We set max length of document to 500, and drop the question-document pairs beyond this on training set. Explicit question-type dimension $d_{ET}$ is 50. We apply dropout to the Encoder layer and aggregation layer with a dropout rate of 0.5.

2 Results

Table 1 shows the official leaderboard on SQuAD test set when we submitted our system. Our model achieves a 68.73% EM score and 77.39% F1 score, which is ranked among the state of the art single models (without model ensembling).

Table 2 shows the ablation performances of various Q-code on the development set. Note that since the testset is hidden from us, we can only perform such an analysis on the development set. Our baseline model using no Q-code achieved a 68.00% and 77.36% EM and F1 scores, respectively. When we added the explicit question type T-code into the baseline model, the performance was improved slightly to 68.16%(EM) and 77.58%(F1). We then used TreeLSTM introduce syntactic parses for question representation and understanding (replacing simple question type as question understanding Q-code), which consistently shows further improvement. We further incorporated the soft adaptation. When letting the number of hidden question types ( $K$ ) to be 20, the performance improves to 68.73%/77.74% on EM and F1, respectively, which corresponds to the results of our model reported in Table 1. Furthermore, after submitted our result, we have experimented with a large value of $K$ and found that when $K=100$ , we can achieve a better performance of 69.10%/78.38% on the development set.

Figure 4(a) shows the EM/F1 scores of different question types while Figure 4(b) is the question type amount distribution on the development set. In Figure 4(a) we can see that the average EM/F1 of the "when" question is highest and those of the "why" question is the lowest. From Figure 4(b) we can see the "what" question is the major class.

Figure 5 shows the composition of F1 score. Take our best model as an example, we observed a 78.38% F1 score on the whole development set, which can be separated into two parts: one is where F1 score equals to 100%, which means an exact match. This part accounts for 69.10% of the entire development set. And the other part accounts for 30.90%, of which the average F1 score is 30.03%. For the latter, we can further divide it into two sub-parts: one is where the F1 score equals to 0%, which means that predict answer is totally wrong. This part occupies 14.89% of the total development set. The other part accounts for 16.01% of the development set, of which average F1 score is 57.96%. From this analysis we can see that reducing the zero F1 score (14.89%) is potentially an important direction to further improve the system.

Conclusions

Closely modelling questions could be of importance for question answering and machine reading. In this paper, we introduce syntactic information to help encode questions in neural networks. We view and model different types of questions and the information shared among them as an adaptation task and proposed adaptation models for them. On the Stanford Question Answering Dataset (SQuAD), we show that these approaches can help attain better results over a competitive baseline.