A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task

Danqi Chen, Jason Bolton, Christopher D. Manning

Introduction

Reading comprehension (RC) is the ability to read text, process it, and understand its meaning.https://en.wikipedia.org/wiki/Reading_comprehension How to endow computers with this capacity has been an elusive challenge and a long-standing goal of Artificial Intelligence (e.g., [Norvig (1978]). Genuine reading comprehension involves interpretation of the text and making complex inferences. Human reading comprehension is often tested by asking questions that require interpretive understanding of a passage, and the same approach has been suggested for testing computers [Burges (2013].

In recent years, there have been several strands of work which attempt to collect human-labeled data for this task – in the form of document, question and answer triples – and to learn machine learning models directly from it [Richardson et al. (2013, Berant et al. (2014, Wang et al. (2015]. However, these datasets consist of only hundreds of documents, as the labeled examples usually require considerable expertise and neat design, making the annotation process quite expensive. The subsequent scarcity of labeled examples prevents us from training powerful statistical models, such as deep learning models, and would seem to prevent a system from learning complex textual reasoning capacities.

Recently, researchers at DeepMind [Hermann et al. (2015] had the appealing, original idea of exploiting the fact that the abundant news articles of CNN and Daily Mail are accompanied by bullet point summaries in order to heuristically create large-scale supervised training data for the reading comprehension task. Figure 1 gives an example. Their idea is that a bullet point usually summarizes one or several aspects of the article. If the computer understands the content of the article, it should be able to infer the missing entity in the bullet point.

This is a clever way of creating supervised data cheaply and holds promise for making progress on training RC models; however, it is unclear what level of reading comprehension is actually needed to solve this somewhat artificial task and, indeed, what statistical models that do reasonably well on this task have actually learned.

In this paper, our aim is to provide an in-depth and thoughtful analysis of this dataset and what level of natural language understanding is needed to do well on it. We demonstrate that simple, carefully designed systems can obtain high, state-of-the-art accuracies of 73.6% and 76.6% on CNN and Daily Mail respectively. We do a careful hand-analysis of a small subset of the problems to provide data on their difficulty and what kinds of language understanding are needed to be successful and we try to diagnose what is learned by the systems that we have built. We conclude that: (i) this dataset is easier than previously realized, (ii) straightforward, conventional NLP systems can do much better on it than previously suggested, (iii) the distributed representations of deep learning systems are very effective at recognizing paraphrases, (iv) partly because of the nature of the questions, current systems much more have the nature of single-sentence relation extraction systems than larger-discourse-context text understanding systems, (v) the systems that we present here are close to the ceiling of performance for single-sentence and unambiguous cases of this dataset, and (vi) the prospects for getting the final 20% of questions correct appear poor, since most of them involve issues in the data preparation which undermine the chances of answering the question (coreference errors or anonymization of entities making understanding too difficult).

The Reading Comprehension Task

The RC datasets introduced in [Hermann et al. (2015] are made from articles on the news websites CNN and Daily Mail, utilizing articles and their bullet point summaries.The datasets are available at https://github.com/deepmind/rc-data. Figure 1 demonstrates an exampleThe original article can be found at http://www.cnn.com/2015/03/10/entertainment/feat-star-wars-gay-character/.: it consists of a passage pp, a question qq and an answer aa, where the passage is a news article, the question is a cloze-style task, in which one of the article’s bullet points has had one entity replaced by a placeholder, and the answer is this questioned entity. The goal is to infer the missing entity (answer aa) from all the possible entities which appear in the passage. A news article is usually associated with a few (e.g., 3–5) bullet points and each of them highlights one aspect of its content.

The text has been run through a Google NLP pipeline. It it tokenized, lowercased, and named entity recognition and coreference resolution have been run. For each coreference chain containing at least one named entity, all items in the chain are replaced by an @entitynn marker, for a distinct index nn. ?) argue convincingly that such a strategy is necessary to ensure that systems approach this task by understanding the passage in front of them, rather than by using world knowledge or a language model to answer questions without needing to understand the passage. However, this also gives the task a somewhat artificial character. On the one hand, systems are greatly helped by entity recognition and coreference having already been performed; on the other, they suffer when either of these modules fail, as they do (in Figure 1, “the character” should probably be coreferent with @entity14; clearer examples of failure appear later on in our data analysis). Moreover, this inability to use world knowledge also makes it much more difficult for a human to do this task – occasionally it is very difficult or impossible for a human to determine the correct answer when presented with an item anonymized in this way.

The creation of the datasets benefits from the sheer volume of news articles available online, so they offer a large and realistic testing ground for statistical models. Table 1 provides some statistics on the two datasets: there are 380k and 879k training examples for CNN and Daily Mail respectively. The passages are around 30 sentences and 800 tokens on average, while each question contains around 12–14 tokens.

In the following sections, we seek to more deeply understand the nature of this dataset. We first build some straightforward systems in order to get a better idea of a lower-bound for the performance of current NLP systems. Then we turn to data analysis of a sample of the items to examine their nature and an upper bound on performance.

Our Systems

In this section, we describe two systems we implemented – a conventional entity-centric classifier and an end-to-end neural network. While ?) do provide several baselines for performance on the RC task, we suspect that their baselines are not that strong. They attempt to use a frame-semantic parser, and we feel that the poor coverage of that parser undermines the results, and is not representative of what a straightforward NLP system – based on standard approaches to factoid question answering and relation extraction developed over the last 15 years – can achieve. Indeed, their frame-semantic model is markedly inferior to another baseline they provide, a heuristic word distance model. At present just two papers are available presenting results on this RC task, both presenting neural network approaches: [Hermann et al. (2015] and [Hill et al. (2016]. While the latter is wrapped in the language of end-to-end memory networks, it actually presents a fairly simple window-based neural network classifier running on the CNN data. Its success again raises questions about the true nature and complexity of the RC task provided by this dataset, which we seek to clarify by building a simple attention-based neural net classifier.

Given the (passage, question, answer) triple (p,q,a)(p,q,a), p={p1,,pm}p=\{p_{1},\ldots,p_{m}\} and q={q1,,ql}q=\{q_{1},\ldots,q_{l}\} are sequences of tokens for the passage and question sentence, with qq containing exactly one “@placeholder” token. The goal is to infer the correct entity apEa\in p\cap E that the placeholder corresponds to, where EE is the set of all abstract entity markers. Note that the correct answer entity must appear in the passage pp.

We first build a conventional feature-based classifier, aiming to explore what features are effective for this task. This is similar in spirit to [Wang et al. (2015], which at present has very competitive performance on the MCTest RC dataset [Richardson et al. (2013]. The setup of this system is to design a feature vector fp,q(e)f_{p,q}(e) for each candidate entity ee, and to learn a weight vector θ\theta such that the correct answer aa is expected to rank higher than all other candidate entities:

We employ the following feature templates:

Whether entity ee occurs in the passage.

Whether entity ee occurs in the question.

The frequency of entity ee in the passage.

The first position of occurence of entity ee in the passage.

nn-gram exact match: whether there is an exact match between the text surrounding the placeholder and the text surrounding entity ee. We have features for all combinations of matching left and/or right one or two words.

Word distance: we align the placeholder with each occurrence of entity ee, and compute the average minimum distance of each non-stop question word from the entity in the passage.

Sentence co-occurrence: whether entity ee co-occurs with another entity or verb that appears in the question, in some sentence of the passage.

Dependency parse match: we dependency parse both the question and all the sentences in the passage, and extract an indicator feature of whether wr@placeholderw\xrightarrow{r}\text{@placeholder} and wrew\xrightarrow{r}e are both found; similar features are constructed for @placeholderrw\text{@placeholder}\xrightarrow{r}w and erwe\xrightarrow{r}w.

2 End-to-end Neural Network

Our neural network system is based on the AttentiveReader model proposed by [Hermann et al. (2015]. The framework can be described in the following three steps (see Figure 2):

Using the output vector o\mathbf{o}, the system outputs the most likely answer using:

Finally, the system adds a softmax function on top of WaoW_{a}^{\intercal}\mathbf{o} and adopts a negative log-likelihood objective for training.

Our model basically follows the AttentiveReader. However, to our surprise, our experiments observed nearly 7 –10% improvement over the original AttentiveReader results on CNN and Daily Mail datasets (discussed in Sec. 4). Concretely, our model has the following differences:

We use a bilinear term, instead of a tanh\tanh layer to compute the relevance (attention) between question and contextual embeddings. The effectiveness of the simple bilinear attention function has been shown previously for neural machine translation by [Luong et al. (2015].

After obtaining the weighted contextual embeddings o\mathbf{o}, we use o\mathbf{o} for direct prediction. In contrast, the original model in [Hermann et al. (2015] combined o\mathbf{o} and the question embedding q\mathbf{q} via another non-linear layer before making final predictions. We found that we could remove this layer without harming performance. We believe it is sufficient for the model to learn to return the entity to which it maximally gives attention.

The original model considers all the words from the vocabulary V\mathcal{V} in making predictions. We think this is unnecessary, and only predict among entities which appear in the passage.

Of these changes, only the first seems important; the other two just aim at keeping the model simple.

Window-based MemN2Ns [Hill et al. (2016].

Another recent neural network approach proposed by [Hill et al. (2016] is based on a memory network architecture [Weston et al. (2015]. We think it is highly similar in spirit. The biggest difference is their way of encoding passages: they demonstrate that it is most effective to only use a 5-word context window when evaluating a candidate entity and they use a positional unigram approach to encode the contextual embeddings: if a window consists of 5 words x1,,x5x_{1},\ldots,x_{5}, then it is encoded as i=15Ei(xi)\sum_{i=1}^{5}{E_{i}(x_{i})}, resulting in 55 separate embedding matrices to learn. They encode the 5-word window surrounding the placeholder in a similar way and all other words in the question text are ignored. In addition, they simply use a dot product to compute the “relevance” between the question and a contextual embedding. This simple model nevertheless works well, showing the extent to which this RC task can be done by very local context matching.

Experiments

For training our conventional classifier, we use the implementation of LambdaMART [Wu et al. (2010] in the RankLib package.https://sourceforge.net/p/lemur/wiki/RankLib/. We use this ranking algorithm since our problem is naturally a ranking problem and forests of boosted decision trees have been very successful lately (as seen, e.g., in many recent Kaggle competitions). We do not use all the features of LambdaMART since we are only scoring 1/0 loss on the first ranked proposal, rather than using an IR-style metric to score ranked results. We use Stanford’s neural network dependency parser [Chen and Manning (2014] to parse all our document and question text, and all other features can be extracted without additional tools.

For training our neural networks, we only keep the most frequent V=50k|\mathcal{V}|=50\text{k} words (including entity and placeholder markers), and map all other words to an ¡unk¿ token. We choose word embedding size d=100d=100, and use the 100100-dimensional pre-trained GloVe word embeddings [Pennington et al. (2014] for initialization. The attention and output parameters are initialized from a uniform distribution between (0.01,0.01)(-0.01,0.01), and the GRU weights are initialized from a Gaussian distribution N(0,0.1)\mathcal{N}(0,0.1).

We use hidden size h=128h=128 for CNN and 256 for Daily Mail. Optimization is carried out using vanilla stochastic gradient descent (SGD), with a fixed learning rate of 0.10.1. We sort all the examples by the length of its passage, and randomly sample a mini-batch of size 32 for each update. We also apply dropout with probability 0.20.2 to the embedding layer and gradient clipping when the norm of gradients exceeds 1010.

Additionally, we think the original indices of entity markers are generated arbitrarily. We attempt to relabel the entity markers based on their first occurrence in the passage and question The first occurring entity is relabeled as @entity1, and the second one is relabeled as @entity2, and so on. and find that this step can make training converge faster as well bring slight gains. We report both results (with and without relabeling) for future reference.

All of our models are run on a single GPU (GeForce GTX TITAN X), with roughly a runtime of 3 hours per epoch for CNN, and 12 hours per epoch for Daily Mail. We run all the models up to 3030 epochs and select the model that achieves the best accuracy on the development set.

We run our models 5 times independently with different random seeds and report average performance across the runs. We also report ensemble results which average the prediction probabilities of the 5 models.

2 Main Results

Table 2 presents our main results. The conventional feature-based classifier obtains 67.9%67.9\% accuracy on the CNN test set. Not only does this significantly outperform any of the symbolic approaches reported in [Hermann et al. (2015], it also outperforms all the neural network systems from their paper and the best single-system result reported so far from [Hill et al. (2016]. This suggests that the task might not be as difficult as suggested, and a simple feature set can cover many of the cases. Table 3 presents a feature ablation analysis of our entity-centric classifier on the development portion of the CNN dataset. It shows that nn-gram match and frequency of entities are the two most important classes of features.

More dramatically, our single-model neural network surpasses the previous results by a large margin (over 5%). The relabeling process further improves the results by 0.6%0.6\% and 0.9%0.9\%, pushing up the state-of-the-art accuracies to 73.6% and 76.6% on the two datasets respectively. The ensembles of 55 models consistently bring further 24%2-4\% gains.

Concurrently with our paper, ?) and ?) also experiment on these two datasets and report competitive results. However, our model not only still outperforms theirs, but also appears to be structurally simpler. All these recent efforts converge to similar numbers, and we believe that they are approaching the ceiling performance of this task, as we will indicate in the next section.

Data Analysis

So far, we have good results via either of our systems. In this section, we aim to conduct an in-depth analysis and answer the following questions: (i) Since the dataset was created in an automatic and heuristic way, how many of the questions are trivial to answer, and how many are noisy and not answerable? (ii) What have these models learned? What are the prospects for further improving them? To study this, we randomly sampled 100 examples from the dev portion of the CNN dataset for analysis (see more details in Appendix A).

After carefully analyzing these 100 examples, we roughly classify them into the following categories (if an example satisfies more than one category, we classify it into the earliest one):

The nearest words around the placeholder are also found in the passage surrounding an entity marker; the answer is self-evident.

The question text is entailed/rephrased by exactly one sentence in the passage, so the answer can definitely be identified from that sentence.

In many cases, even though we cannot find a complete semantic match between the question text and some sentence, we are still able to infer the answer through partial clues, such as some word/concept overlap.

It requires processing multiple sentences to infer the correct answer.

It is unavoidable that there are many coreference errors in the dataset. This category includes those examples with critical coreference errors for the answer entity or key entities appearing in the question. Basically we treat this category as “not answerable”.

This category includes examples for which we think humans are not able to obtain the correct answer (confidently).

Table 5 provides our estimate of the percentage for each category, and Table 4 presents one representative example from each category. To our surprise, “coreference errors” and “ambiguous/hard” cases account for 25%25\% of this sample set, based on our manual analysis, and this certainly will be a barrier for training models with an accuracy much above 75% (although, of course, a model can sometimes make a lucky guess). Additionally, only 2 examples require multiple sentences for inference – this is a lower rate than we expected and ?) suggest. Therefore, we hypothesize that in most of the “answerable” cases, the goal is to identify the most relevant (single) sentence, and then to infer the answer based upon it.

2 Per-category Performance

Now, we further analyze the predictions of our two systems, based on the above categorization.

As seen in Table 6, we have the following observations: (i) The exact-match cases are quite simple and both systems get 100% correct. (ii) For the ambiguous/hard and entity-linking-error cases, meeting our expectations, both of the systems perform poorly. (iii) The two systems mainly differ in paraphrasing cases, and some of the “partial clue” cases. This clearly shows how neural networks are better capable of learning semantic matches involving paraphrasing or lexical variation between the two sentences. (iv) We believe that the neural-net system already achieves near-optimal performance on all the single-sentence and unambiguous cases. There does not seem to be much useful headroom for exploring more sophisticated natural language understanding approaches on this dataset.

Related Tasks

We briefly survey other tasks related to reading comprehension.

MCTest [Richardson et al. (2013] is an open-domain reading comprehension task, in the form of fictional short stories, accompanied by multiple-choice questions. It was carefully created using crowd sourcing, and aims at a 7-year-old reading comprehension level.

On the one hand, this dataset has a high demand on various reasoning capacities: over 50%50\% of the questions require multiple sentences to answer and also the questions come in assorted categories (what, why, how, whose, which, etc). On the other hand, the full dataset has only 660 paragraphs in total (each paragraph is associated with 4 questions), which renders training statistical models (especially complex ones) very difficult.

Up to now, the best solutions [Sachan et al. (2015, Wang et al. (2015] are still heavily relying on manually curated syntactic/semantic features, with the aid of additional knowledge (e.g., word embeddings, lexical/paragraph databases).

Children Book Test [Hill et al. (2016] was developed in a similar spirit to the CNN/Daily Mail datasets. It takes any consecutive 21 sentences from a children’s book – the first 20 sentences are used as the passage, and the goal is to infer a missing word in the 21st sentence (question and answer). The questions are also categorized by the type of the missing word: named entity, common noun, preposition or verb. According to the first study on this dataset [Hill et al. (2016], a language model (an nn-gram model or a recurrent neural network) with local context is sufficient for predicting verbs or prepositions; however, for named entities or common nouns, it improves performance to scan through the whole paragraph to make predictions. So far, the best published results are reported by window-based memory networks.

bAbI [Weston et al. (2016] is a collection of artificial datasets, consisting of 20 different reasoning types. It encourages the development of models with the ability to chain reasoning, induction/deduction, etc., so that they can answer a question like “The football is in the playground” after reading a sequence of sentences “John is in the playground; Bob is in the office; John picked up the football; Bob went to the kitchen.” Various types of memory networks [Sukhbaatar et al. (2015, Kumar et al. (2016] have been shown effective on these tasks, and ?) show that vector space models based on extensive problem analysis can obtain near-perfect accuracies on all the categories. Despite these promising results, this dataset is limited to a small vocabulary (only 100–200 words) and simple language variations, so there is still a huge gap from real-world datasets that we need to fill in.

Conclusion

In this paper, we carefully examined the recent CNN/Daily Mail reading comprehension task. Our systems demonstrated state-of-the-art results, but more importantly, we performed a careful analysis of the dataset by hand.

Overall, we think the CNN/Daily Mail datasets are valuable datasets, which provide a promising avenue for training effective statistical models for reading comprehension tasks. Nevertheless, we argue that: (i) this dataset is still quite noisy due to its method of data creation and coreference errors; (ii) current neural networks have almost reached a performance ceiling on this dataset; and (iii) the required reasoning and inference level of this dataset is still quite simple.

As future work, we need to consider how we can utilize these datasets (and the models trained upon them) to help solve more complex RC reasoning tasks (with less annotated data).

Acknowledgments

We thank the anonymous reviewers for their thoughtful feedback. Stanford University gratefully acknowledges the support of the Defense Advanced Research Projects Agency (DARPA) Deep Exploration and Filtering of Text (DEFT) Program under Air Force Research Laboratory (AFRL) contract no. FA8750-13-2-0040. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the authors and do not necessarily reflect the view of the DARPA, AFRL, or the US government.

References

Appendix A Samples and Labeled Categories from the CNN Dataset

For the analysis in Section 5, we uniformly sampled 100 examples from the development set of the CNN dataset. Table 8 provides a full index list of our samples and Table 7 presents our labeled categories.