Attention-over-Attention Neural Networks for Reading Comprehension

Yiming Cui, Zhipeng Chen, Si Wei, Shijin Wang, Ting Liu, Guoping Hu

Introduction

To read and comprehend the human languages are challenging tasks for the machines, which requires that the understanding of natural languages and the ability to do reasoning over various clues. Reading comprehension is a general problem in the real world, which aims to read and comprehend a given article or context, and answer the questions based on it. Recently, the cloze-style reading comprehension problem has become a popular task in the community. The cloze-style query Taylor (1953) is a problem that to fill in an appropriate word in the given sentences while taking the context information into account.

To teach the machine to do cloze-style reading comprehensions, large-scale training data is necessary for learning relationships between the given document and query. To create large-scale training data for neural networks, Hermann et al. (2015) released the CNN/Daily Mail news dataset, where the document is formed by the news articles and the queries are extracted from the summary of the news. Hill et al. (2015) released the Children’s Book Test dataset afterwards, where the training samples are generated from consecutive 20 sentences from books, and the query is formed by 21st sentence. Following these datasets, a vast variety of neural network approaches have been proposed (Kadlec et al., 2016; Cui et al., 2016; Chen et al., 2016; Dhingra et al., 2016; Sordoni et al., 2016; Trischler et al., 2016; Seo et al., 2016; Xiong et al., 2016), and most of them stem from the attention-based neural network Bahdanau et al. (2014), which has become a stereotype in most of the NLP tasks and is well-known by its capability of learning the “importance” distribution over the inputs.

In this paper, we present a novel neural network architecture, called attention-over-attention model. As we can understand the meaning literally, our model aims to place another attention mechanism over the existing document-level attention. Unlike the previous works, that are using heuristic merging functions Cui et al. (2016), or setting various pre-defined non-trainable terms Trischler et al. (2016), our model could automatically generate an “attended attention” over various document-level attentions, and make a mutual look not only from query-to-document but also document-to-query, which will benefit from the interactive information.

To sum up, the main contributions of our work are listed as follows.

To our knowledge, this is the first time that the mechanism of nesting another attention over the existing attentions is proposed, i.e. attention-over-attention mechanism.

Unlike the previous works on introducing complex architectures or many non-trainable hyper-parameters to the model, our model is much more simple but outperforms various state-of-the-art systems by a large margin.

We also propose an N-best re-ranking strategy to re-score the candidates in various aspects and further improve the performance.

The following of the paper will be organized as follows. In Section 2, we will give a brief introduction to the cloze-style reading comprehension task as well as related public datasets. Then the proposed attention-over-attention reader will be presented in detail in Section 3 and N-best re-ranking strategy in Section 4. The experimental results and analysis will be given in Section 5 and Section 6. Related work will be discussed in Section 7. Finally, we will give a conclusion of this paper and envisions on future work.

Cloze-style Reading Comprehension

In this section, we will give a brief introduction to the cloze-style reading comprehension task at the beginning. And then, several existing public datasets will be described in detail.

Formally, a general Cloze-style reading comprehension problem can be illustrated as a triple:

The triple consists of a document $\mathcal{D}$ , a query $\mathcal{Q}$ and the answer to the query $\mathcal{A}$ . Note that the answer is usually a single word in the document, which requires the human to exploit context information in both document and query. The type of the answer word varies from predicting a preposition given a fixed collocation to identifying a named entity from a factual illustration.

2 Existing Public Datasets

Large-scale training data is essential for training neural networks. Several public datasets for the cloze-style reading comprehension has been released. Here, we introduce two representative and widely-used datasets.

Hermann et al. (2015) have firstly published two datasets: CNN and Daily Mail news data The pre-processed CNN and Daily Mail datasets are available at http://cs.nyu.edu/~kcho/DMQA/. They construct these datasets with web-crawled CNN and Daily Mail news data. One of the characteristics of these datasets is that the news article is often associated with a summary. So they first regard the main body of the news article as the Document, and the Query is formed by the summary of the article, where one entity word is replaced by a special placeholder to indicate the missing word. The replaced entity word will be the Answer of the Query. Apart from releasing the dataset, they also proposed a methodology that anonymizes the named entity tokens in the data, and these tokens are also re-shuffle in each sample. The motivation is that the news articles are containing limited named entities, which are usually celebrities, and the world knowledge can be learned from the dataset. So this methodology aims to exploit general relationships between anonymized named entities within a single document rather than the common knowledge. The following research on these datasets showed that the entity word anonymization is not as effective as expected (Chen et al., 2016).

∙∙\bullet Children’s Book Test

There was also a dataset called the Children’s Book Test (CBTest) released by Hill et al. (2015), which is built on the children’s book story through Project Gutenberg The CBTest datasets are available at http://www.thespermwhale.com/jaseweston/babi/CBTest.tgz. Different from the CNN/Daily Mail datasets, there is no summary available in the children’s book. So they proposed another way to extract query from the original data. The document is composed of 20 consecutive sentences in the story, and the 21st sentence is regarded as the query, where one word is blanked with a special placeholder. In the CBTest datasets, there are four types of sub-datasets available which are classified by the part-of-speech and named entity tag of the answer word, containing Named Entities (NE), Common Nouns (CN), Verbs and Prepositions. In their studies, they have found that the answering of verbs and prepositions are relatively less dependent on the content of document, and the humans can even do preposition blank-filling without the presence of the document. The studies shown by Hill et al. (2015), answering verbs and prepositions are less dependent with the presence of document. Thus, most of the related works are focusing on solving NE and CN types.

Attention-over-Attention Reader

In this section, we will give a detailed introduction to the proposed Attention-over-Attention Reader (AoA Reader). Our model is primarily motivated by Kadlec et al., Kadlec et al. (2016), which aims to directly estimate the answer from the document-level attention instead of calculating blended representations of the document. As previous studies by Cui et al. (2016) showed that the further investigation of query representation is necessary, and it should be paid more attention to utilizing the information of query. In this paper, we propose a novel work that placing another attention over the primary attentions, to indicate the “importance” of each attentions.

Now, we will give a formal description of our proposed model. When a cloze-style training triple $\langle\mathcal{D},\mathcal{Q},\mathcal{A}\rangle$ is given, the proposed model will be constructed in the following steps.

We first transform every word in the document $\mathcal{D}$ and query $\mathcal{Q}$ into one-hot representations and then convert them into continuous representations with a shared embedding matrix $W_{e}$ . By sharing word embedding, both the document and query can participate in the learning of embedding and both of them will benefit from this mechanism. After that, we use two bi-directional RNNs to get contextual representations of the document and query individually, where the representation of each word is formed by concatenating the forward and backward hidden states. After making a trade-off between model performance and training complexity, we choose the Gated Recurrent Unit (GRU) Cho et al. (2014) as recurrent unit implementation.

∙∙\bullet Pair-wise Matching Score

After obtaining the contextual embeddings of the document $h_{doc}$ and query $h_{query}$ , we calculate a pair-wise matching matrix, which indicates the pair-wise matching degree of one document word and one query word. Formally, when given $i$ th word of the document and $j$ th word of query, we can compute a matching score by their dot product.

∙∙\bullet Individual Attentions

∙∙\bullet Attention-over-Attention

Different from Cui et al. (2016), instead of using naive heuristics (such as summing or averaging) to combine these individual attentions into a final attention, we introduce another attention mechanism to automatically decide the importance of each individual attention.

So far, we have obtained both query-to-document attention $\alpha$ and document-to-query attention $\beta$ . Our motivation is to exploit mutual information between the document and query. However, most of the previous works are only relying on query-to-document attention, that is, only calculate one document-level attention when considering the whole query.

Then we average all the $\beta(t)$ to get an averaged query-level attention $\beta$ . Note that, we do not apply another softmax to the $\beta$ , because averaging individual attentions do not break the normalizing condition.

∙∙\bullet Final Predictions

Following Kadlec et al. (2016), we use sum attention mechanism to get aggregated results. Note that the final output should be reflected in the vocabulary space $V$ , rather than document-level attention $|\mathcal{D}|$ , which will make a significant difference in the performance, though Kadlec et al. (2016) did not illustrate this clearly.

where $I(w,\mathcal{D})$ indicate the positions that word $w$ appears in the document $\mathcal{D}$ . As the training objectives, we seek to maximize the log-likelihood of the correct answer.

The proposed neural network architecture is depicted in Figure 1. Note that, as our model mainly adds limited steps of calculations to the AS Reader Kadlec et al. (2016) and does not employ any additional weights, the computational complexity is similar to the AS Reader.

N-best Re-ranking Strategy

Intuitively, when we do cloze-style reading comprehensions, we often refill the candidate into the blank of the query to double-check its appropriateness, fluency and grammar to see if the candidate we choose is the most suitable one. If we do find some problems in the candidate we choose, we will choose the second possible candidate and do some checking again.

To mimic the process of double-checking, we propose to use N-best re-ranking strategy after generating answers from our neural networks. The procedure can be illustrated as follows.

Instead of only picking the candidate that has the highest possibility as answer, we can also extract follow-up candidates in the decoding process, which forms an N-best list.

∙∙\bullet Refill Candidate into Query

As a characteristic of the cloze-style problem, each candidate can be refilled into the blank of the query to form a complete sentence. This allows us to check the candidate according to its context.

∙∙\bullet Feature Scoring

The candidate sentences can be scored in many aspects. In this paper, we exploit three features to score the N-best list.

Global N-gram LM: This is a fundamental metric in scoring sentence, which aims to evaluate its fluency. This model is trained on the document part of training data.

Local N-gram LM: Different from global LM, the local LM aims to explore the information with the given document, so the statistics are obtained from the test-time document. It should be noted that the local LM is trained sample-by-sample, it is not trained on the entire test set, which is not legal in the real test case. This model is useful when there are many unknown words in the test sample.

Word-class LM: Similar to global LM, the word-class LM is also trained on the document part of training data, but the words are converted to its word class ID. The word class can be obtained by using clustering methods. In this paper, we simply utilized the mkcls tool for generating 1000 word classes Josef Och (1999).

∙∙\bullet Weight Tuning

To tune the weights among these features, we adopt the K-best MIRA algorithm Cherry and Foster (2012) to automatically optimize the weights on the validation set, which is widely used in statistical machine translation tuning procedure.

∙∙\bullet Re-scoring and Re-ranking

After getting the weights of each feature, we calculate the weighted sum of each feature in the N-best sentences and then choose the candidate that has the lowest cost as the final answer.

Experiments

The general settings of our neural network model are listed below in detail.

Embedding Layer: The embedding weights are randomly initialized with the uniformed distribution in the interval $[-0.05,0.05]$ . For regularization purpose, we adopted $l_{2}$ -regularization to 0.0001 and dropout rate of 0.1 Srivastava et al. (2014). Also, it should be noted that we do not exploit any pre-trained embedding models.

Hidden Layer: Internal weights of GRUs are initialized with random orthogonal matrices Saxe et al. (2013).

Optimization: We adopted ADAM optimizer for weight updating Kingma and Ba (2014), with an initial learning rate of 0.001. As the GRU units still suffer from the gradient exploding issues, we set the gradient clipping threshold to 5 Pascanu et al. (2013). We used batched training strategy of 32 samples.

Dimensions of embedding and hidden layer for each task are listed in Table 3. In re-ranking step, we generate 5-best list from the baseline neural network model, as we did not observe a significant variance when changing the N-best list size. All language model features are trained on the training proportion of each dataset, with 8-gram word-based setting and Kneser-Ney smoothing Kneser and Ney (1995) trained by SRILM toolkit Stolcke (2002). The results are reported with the best model, which is selected by the performance of validation set. The ensemble model is made up of four best models, which are trained using different random seed. Implementation is done with Theano Theano Development Team (2016) and Keras Chollet (2015), and all models are trained on Tesla K40 GPU.

2 Overall Results

Our experiments are carried out on public datasets: CNN news datasets Hermann et al. (2015) and CBTest NE/CN datasets Hill et al. (2015). The statistics of these datasets are listed in Table 1, and the experimental results are given in Table 2.

As we can see that, our AoA Reader outperforms state-of-the-art systems by a large margin, where 2.3% and 2.0% absolute improvements over EpiReader in CBTest NE and CN test sets, which demonstrate the effectiveness of our model. Also by adding additional features in the re-ranking step, there is another significant boost 2.0% to 3.7% over AoA Reader in CBTest NE/CN test sets. We have also found that our single model could stay on par with the previous best ensemble system, and even we have an absolute improvement of 0.9% beyond the best ensemble model (Iterative Attention) in the CBTest NE validation set. When it comes to ensemble model, our AoA Reader also shows significant improvements over previous best ensemble models by a large margin and set up a new state-of-the-art system.

To investigate the effectiveness of employing attention-over-attention mechanism, we also compared our model to CAS Reader, which used pre-defined merging heuristics, such as sum or avg etc. Instead of using pre-defined merging heuristics, and letting the model explicitly learn the weights between individual attentions results in a significant boost in the performance, where 4.1% and 3.7% improvements can be made in CNN validation and test set against CAS Reader.

3 Effectiveness of Re-ranking Strategy

As we have seen that the re-ranking approach is effective in cloze-style reading comprehension task, we will give a detailed ablations in this section to show the contributions by each feature. To have a thorough investigation in the re-ranking step, we listed the detailed improvements while adding each feature mentioned in Section 4.

From the results in Table 4, we found that the NE and CN category both benefit a lot from the re-ranking features, but the proportions are quite different. Generally speaking, in NE category, the performance is mainly boosted by the $LM_{local}$ feature. However, on the contrary, the CN category benefits from $LM_{global}$ and $LM_{wc}$ rather than the $LM_{local}$ .

Also, we listed the weights of each feature in Table 5. The $LM_{global}$ and $LM_{wc}$ are all trained by training set, which can be seen as Global Feature. However, the $LM_{local}$ is only trained within the respective document part of test sample, which can be seen as Local Feature.

We calculated the ratio between the global and local features and found that the NE category is much more dependent on local features than CN category. Because it is much more likely to meet a new named entity than a common noun in the test phase, so adding the local LM provides much more information than that of common noun. However, on the contrary, answering common noun requires less local information, which can be learned in the training data relatively.

Quantitative Analysis

In this section, we will give a quantitative analysis to our AoA Reader. The following analyses are carried out on CBTest NE dataset. First, we investigate the relations between the length of the document and corresponding accuracy. The result is depicted in Figure 2.

As we can see that the AoA Reader shows consistent improvements over AS Reader on the different length of the document. Especially, when the length of document exceeds 700, the improvements become larger, indicating that the AoA Reader is more capable of handling long documents.

Furthermore, we also investigate if the model tends to choose a high-frequency candidate than a lower one, which is shown in Figure 3. Not surprisingly, we found that both models do a good job when the correct answer appears more frequent in the document than the other candidates. This is because that the correct answer that has the highest frequency among the candidates takes up over 40% of the test set (1071 out of 2500). But interestingly we have also found that, when the frequency rank of correct answer exceeds 7 (less frequent among candidates), these models also give a relatively high performance. Empirically, we think that these models tend to choose extreme cases in terms of candidate frequency (either too high or too low). One possible reason is that it is hard for the model to choose a candidate that has a neutral frequency as the correct answer, because of its ambiguity (neutral choices are hard to made).

Related Work

Cloze-style reading comprehension tasks have been widely investigated in recent studies. We will take a brief revisit to the related works.

Hermann et al. (2015) have proposed a method for obtaining large quantities of $\langle\mathcal{D},\mathcal{Q},\mathcal{A}\rangle$ triples through news articles and its summary. Along with the release of cloze-style reading comprehension dataset, they also proposed an attention-based neural network to handle this task. Experimental results showed that the proposed neural network is effective than traditional baselines.

Hill et al. (2015) released another dataset, which stems from the children’s books. Different from Hermann et al. (2015)’s work, the document and query are all generated from the raw story without any summary, which is much more general than previous work. To handle the reading comprehension task, they proposed a window-based memory network, and self-supervision heuristics is also applied to learn hard-attention.

Unlike previous works, that using blended representations of document and query to estimate the answer, Kadlec et al. (2016) proposed a simple model that directly pick the answer from the document, which is motivated by the Pointer Network Vinyals et al. (2015). A restriction of this model is that the answer should be a single word and appear in the document. Results on various public datasets showed that the proposed model is effective than previous works.

Liu et al. (2016) proposed to exploit reading comprehension models to other tasks. They first applied the reading comprehension model into Chinese zero pronoun resolution task with automatically generated large-scale pseudo training data. The experimental results on OntoNotes 5.0 data showed that their method significantly outperforms various state-of-the-art systems.

Our work is primarily inspired by Cui et al. (2016) and Kadlec et al. (2016) , where the latter model is widely applied to many follow-up works Sordoni et al. (2016); Trischler et al. (2016); Cui et al. (2016). Unlike the CAS Reader Cui et al. (2016), we do not assume any heuristics to our model, such as using merge functions: $sum$ , $avg$ etc. We used a mechanism called “attention-over-attention” to explicitly calculate the weights between different individual document-level attentions, and get the final attention by computing the weighted sum of them. Also, we find that our model is typically general and simple than the recently proposed model, and brings significant improvements over these cutting edge systems.

Conclusion

We present a novel neural architecture, called attention-over-attention reader, to tackle the cloze-style reading comprehension task. The proposed AoA Reader aims to compute the attentions not only for the document but also the query side, which will benefit from the mutual information. Then a weighted sum of attention is carried out to get an attended attention over the document for the final predictions. Among several public datasets, our model could give consistent and significant improvements over various state-of-the-art systems by a large margin.

The future work will be carried out in the following aspects. We believe that our model is general and may apply to other tasks as well, so firstly we are going to fully investigate the usage of this architecture in other tasks. Also, we are interested to see that if the machine really “comprehend” our language by utilizing neural networks approaches, but not only serve as a “document-level” language model. In this context, we are planning to investigate the problems that need comprehensive reasoning over several sentences.

Acknowledgments

We would like to thank all three anonymous reviewers for their thorough reviewing and providing thoughtful comments to improve our paper. This work was supported by the National 863 Leading Technology Research Project via grant 2015AA015409.