The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations
Felix Hill, Antoine Bordes, Sumit Chopra, Jason Weston
Introduction
Humans do not interpret language in isolation. The context in which words and sentences are understood, whether a conversation, book chapter or road sign, plays an important role in human comprehension (Altmann & Steedman, 1988; Binder & Desai, 2011). In this work, we investigate how well statistical models can exploit such wider contexts to make predictions about natural language.
Our analysis is based on a new benchmark dataset (The Children’s Book Test or CBT) designed to test the role of memory and context in language processing and understanding. The test requires predictions about different types of missing words in children’s books, given both nearby words and a wider context from the book. Humans taking the test predict all types of word with similar levels of accuracy. However, they rely on the wider context to make accurate predictions about named entities or nouns, whereas it is unimportant when predicting higher-frequency verbs or prepositions.
As we show, state-of-the-art language modelling architectures, Recurrent Neural Networks (RNNs) with Long-Short Term Memory (LSTMs), perform differently to humans on this task. They are excellent predictors of prepositions (on, at) and verbs (run, eat), but lag far behind humans when predicting nouns (ball, table) or named entities (Elvis, France). This is because their predictions are based almost exclusively on local contexts. In contrast, Memory Networks (Weston et al., 2015b) are one of a class of ‘contextual models’ that can interpret language at a given point in text conditioned directly on both local information and explicit representation of the wider context. On the CBT, Memory Networks designed in a particular way can exploit this information to achieve markedly better prediction of named-entities and nouns than conventional language models. This is important for applications that require coherent semantic processing and/or language generation, since nouns and entities typically encode much of the important semantic information in language.
However, not all contextual models reach this level of performance. We find the way in which wider context is represented in memory to be critical. If memories are encoded from a small window around important words in the context, there is an optimal size for memory representations between single words and entire sentences, that depends on the class of word to be predicted. We have nicknamed this effect the Goldilocks Principle after the well-known English fairytale (Hassall, 1904). In the case of Memory Networks, we also find that self-supervised training of the memory access mechanism yields a clear performance boost when predicting named entities, a class of word that has typically posed problems for neural language models. Indeed, we train a Memory Network with these design features to beat the best reported performance on the CNN QA test of entity prediction from news articles (Hermann et al., 2015).
The Children’s Book Test
The experiments in this paper are based on a new resource, the Children’s Book Test, designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg.https://www.gutenberg.org/ Using children’s books guarantees a clear narrative structure, which can make the role of context more salient. After allocating books to either training, validation or test sets, we formed example ‘questions’ (denoted ) from chapters in the book by enumerating 21 consecutive sentences.
In each question, the first 20 sentences form the context (denoted ), and a word (denoted ) is removed from the 21st sentence, which becomes the query (denoted ). Models must identify the answer word among a selection of 10 candidate answers (denoted ) appearing in the context sentences and the query. Thus, for a question answer pair : ; is an ordered list of sentences; is a sentence (an ordered list of words) containing a missing word symbol; is a bag of unique words such that , its cardinality is 10 and every candidate word is such that . An example question is given in Figure 1.
For finer-grained analyses, we evaluated four classes of question by removing distinct types of word: Named Entities, (Common) Nouns, Verbs and Prepositions (based on output from the POS tagger and named-entity-recogniser in the Stanford Core NLP Toolkit (Manning et al., 2014)). For a given question class, the nine incorrect candidates are selected at random from words in the context having the same type as the answer. The exact number of questions in the training, validation and test sets is shown in Table 1. Full details of the candidate selection algorithm (e.g. how candidates are selected if there are insufficient words of a given type in the context) can be found with the dataset.The dataset can be downloaded from http://fb.ai/babi/.
Classical language modelling evaluations are based on average perplexity across all words in a text. They therefore place proportionally more emphasis on accurate prediction of frequent words such as prepositions and articles than the less frequent words that transmit the bulk of the meaning in language (Baayen & Lieber, 1996). In contrast, because the CBT allows focused analyses on semantic content-bearing words, it should be a better proxy for how well a language model can lend semantic coherence to applications including machine translation, dialogue and question-answering systems.
There are clear parallels between the CBT and the Microsoft Research Sentence Completion Challenge (MSRCC) (Zweig & Burges, 2011), which is also based on Project Gutenberg (but not children’s books, specifically). A fundamental difference is that, where examples in the MSRCC are made of a single sentence, each query in the CBT comes with a wider context. This tests the sensitivity of language models to semantic coherence beyond sentence boundaries. The CBT is also larger than the MRSCC (10,000 vs 1,040 test questions), requires models to select from more candidates on each question (10 vs 5), covers missing words of different (POS) types and contains large training and validation sets that match the form of the test set.
There are also similarities between the CBT and the CNN/Daily Mail (CNN QA) dataset recently released by Hermann et al. (2015). This task requires models to identify missing entities from bullet-point summaries of online news articles. The CNN QA task therefore focuses more on paraphrasing parts of a text, rather than making inferences and predictions from contexts as in the CBT. It also differs in that all named entities in both questions and articles are anonymised so that models cannot apply knowledge that is not apparent from the article. We do not anonymise entities in the CBT, as we hope to incentivise models that can apply background knowledge and information from immediate and wider contexts to the language understanding problem.See Appendix D for a sense of how anonymisation changes the CBT. At the same time, the CBT can be used as a benchmark for general-purpose language models whose downstream application is semantically focused generation, prediction or correction. The CBT is also similar to the MCTest of machine comprehension (Richardson et al., 2013), in which children’s stories written by annotators are accompanied by four multiple-choice questions. However, it is very difficult to train statistical models only on MCTest because its training set consists of only 300 examples.
Studying Memory Representation with Memory Networks
Memory Networks (Weston et al., 2015b) have shown promising performance at various tasks such as reasoning on the bAbI tasks (Weston et al., 2015a) or language modelling (Sukhbaatar et al., 2015). Applying them on the CBT enables us to examine the impact of various ways of encoding context on their semantic processing ability over naturally occurring language.
Context sentences of are encoded into memories, denoted , using a feature-map mapping sequences of words from the context to one-hot representations in , where is typically the size of the word vocabulary. We considered several formats for storing the phrases :
Lexical memory: Each word occupies a separate slot in the memory (each phrase is a single word and has only one non-zero feature). To encode word order, time features are added as embeddings indicating the index of each memory, following Sukhbaatar et al. (2015).
Window memory: Each phrase corresponds to a window of text from the context centred on an individual mention of a candidate in . Hence, memory slots are filled using windows of words where is an instance of one of the candidate words in the question.See Appendix E for discussion and analysis of using candidates in window representations and training. Note that the number of phrases is typically greater than since candidates can occur multiple times in . The window size is tuned on the validation set. We experimented with encoding as a standard bag-of-words, or by having one dictionary per window position, where the latter performed best.
Sentential memory: This setting follows the original implementation of Memory Networks for the bAbI tasks where the phrases correspond to complete sentences of . For the CBT, this means that each question yields exactly 20 memories. We also use Positional Encoding (PE) as introduced by Sukhbaatar et al. (2015) to encode the word positions.
The order of occurrence of memories is less important for sentential and window formats than for lexical memory. So, instead of using a full embedding for each time index, we simply use a scalar value which indicates the position in the passage, ranging from 1 to the number of memories. An additional parameter (tuned on the validation set) scales the importance of this feature. As we show in Appendix C, time features only gave a marginal performance boost in those cases.
For sentential and window memory formats, queries are encoded in a similar way to the memories: as a bag-of-words representation of the whole sentence and a window of size centred around the missing word position respectively. For the lexical memory, memories are made of the words preceding the word to be predicted, whether these words come from the context or from the query, and the query embedding is set to a constant vector .
2 End-to-end Memory Networks
The MemN2N architecture, introduced by Sukhbaatar et al. (2015), allows for a direct training of Memory Networks through backpropagation, and consists of two main steps.
During training, is used to minimise a standard cross-entropy loss with the true label against all other words in the dictionary (i.e. the candidates are not used in the training loss), and optimization is carried out using stochastic gradient descent (SGD). Extra experimental details and hyperparameters are given in Appendix A.
3 Self-supervision for Window Memories
After initial experiments, we observed that the capacity to execute multiple hops in accessing memories was only beneficial in the lexical memory model. We therefore also tried a simpler, single-hop Memory Network, i.e. using a single memory to answer, that exploits a stronger signal for learning memory access. A related approach was successfully applied by Bordes et al. (2015) to question answering about knowledge bases.
At test time, rather than use a hard selection as in eq (2) the model scores each candidate not only with its highest scoring memory but with the sum of the scores of all its corresponding windows after passing all scores through a softmax. That is, the score of a candidate is defined by the sum of the (as used in eq (1)) of the windows it appears in. This relaxes the effects of the operation and allows for all windows associated with a candidate to contribute some information about that candidate. As shown in the ablation study in Appendix C, this results in slightly better performance on the CNN QA benchmark compared to hard selection at test time.
Note that self-supervised Memory Networks do not exploit any new label information beyond the training data. The approach can be understood as a way of achieving hard attention over memories, to contrast with the soft attention-style selection described in Section 3.2. Hard attention yields significant improvements in image captioning (Xu et al., 2015). However, where Xu et al. (2015) use the REINFORCE algorithm (Williams, 1992) to train through the max of eq (2), our self-supervision heuristic permits direct backpropagation.
Baseline and Comparison Models
In addition to memory network variants, we also applied many different types of language modelling and machine reading architectures to the CBT.
We implemented two simple baselines based on word frequencies. For the first, we selected the most frequent candidate in the entire training corpus. In the second, for a given question we selected the most frequent candidate in its context. In both cases we broke ties with a random choice.
We also tried two more sophisticated ways to rank the candidates that do not require any learning on the training data. The first is the ‘sliding window’ baseline applied to the MCTest by Richardson et al. (2013). In this method, ten ‘windows’ of the query concatenated with each possible candidate are slid across the context word-by-word, overlapping with a different subsequence at each position. The overlap score at a given position is simply word-overlap weighted TFIDF-style based on frequencies in the context (to emphasize less frequent words). The chosen candidate corresponds to the window that achieves the maximum single overlap score for any position. Ties are broken randomly.
The second method is the word distance benchmark applied by Hermann et al. (2015). For a given instance of a candidate in the context, the query is ‘superimposed’ on the context so that the missing word lines up with , defining a subsequence of the context. For each word in , an alignment penalty is incurred. The model predicts the candidate with the instance in the context that incurs the lowest alignment penalty. We tuned the maximum single penalty on the validation data.
2 N-gram Language Models
We trained an n-gram language model using the KenLM toolkit (Heafield et al., 2013). We used Knesser-Ney smoothing, and a window size of 5, which performed best on the validation set. We also compare with a variant of language model with cache (Kuhn & De Mori, 1990), where we linearly interpolate the n-gram model probabilities with unigram probabilities computed on the context.
3 Supervised Embedding Models
We encode various parts of the question as the input passage: the entire context + query, just the query, a sub-sequence of the query defined by a window of maximum words centred around the missing word, and a version (window + position) in which we use a different embedding matrix for encoding each position of the window. We tune the window-size on the validation set.
4 Recurrent Language Models
We trained probabilistic RNN language models with LSTM activation units on the training stories (5.5M words of text) using minibatch SGD to maximise the negative log-likelihood of the next word. Hyper-parameters were tuned on the validation set. The best model had both hidden layer and word embeddings of dimension . When answering the questions in the CBT, we allow one variant of this model (context + query) to ‘burn in’ by reading the entire context followed by the query and another version to read only the query itself (and thus have no access to the context). Unlike the canonical language-modelling task, all models have access to the query words after the missing word (i.e if is the position of the missing word, we rank candidate based on rather than simply ).
Mikolov & Zweig (2012) previously observed performance boosts for recurrent language models by adding the capacity to jointly learn a document-level representation. We similarly apply a context-based recurrent model to our language-modelling tasks, but opt for the convolutional representation of the context applied by Rush et al. (2015) for summarisation. Our Contextual LSTM (CLSTM) learns a convolutional attention over windows of the context given the objective of predicting all words in the query. We tuned the window size () on the validation set. As with the standard LSTM, we trained the CLSTM on the running-text of the CBT training set (rather than the structured query and context format used with the Memory Networks) since this proved much more effective, and we report results in the best setting for each method.
5 Human Performance
We recruited 15 native English speakers to attempt a randomly-selected 10% from each question type of the CBT, in two modes either with question only or with question+context (shown to different annotators), giving 2000 answers in total. To our knowledge, this is the first time human performance has been quantified on a language modelling task based on different word types and context lengths.
6 Other Related Approaches
The idea of conditioning language models on extra-sentential context is not new. Access to document-level features can improve both classical language models (Mikolov & Zweig, 2012) and word embeddings (Huang et al., 2012). Unlike the present work, these studies did not explore different representation strategies for the wider context or their effect on interpreting and predicting specific word types.
The original Memory Networks (Weston et al., 2015b) used hard memory selection with additional labeled supervision for the memory access component, and were applied to question-answering tasks over knowledge bases or simulated worlds. Sukhbaatar et al. (2015) and Kumar et al. (2015) trained Memory Networks with RNN components end-to-end with soft memory access, and applied them to additional language tasks. The attention-based reading models of Hermann et al. (2015) also have many commonalities with Memory Networks, differing in word representation choices and attention procedures. Both Kumar et al. (2015) and Hermann et al. (2015) propose bidirectional RNNs as a way of representing previously read text. Our experiments in Section 5 provide a possible explanation for why this is an effective strategy for semantically-focused language processing: bidirectional RNNs naturally focus on small windows of text in similar way to window-based Memory Networks.
Other recent papers have proposed RNN-like architectures with new ways of reading, storing and updating information to improve their capacity to learn algorithmic or syntactic patterns (Joulin & Mikolov, 2015; Dyer et al., 2015; Grefenstette et al., 2015). While we do not study these models in the present work, the CBT would be ideally suited for testing this class of model on semantically-focused language modelling.
Results
In general, there is a clear difference in model performance according to the type of word to be predicted. Our main results in Table 2 show conventional language models are very good at predicting prepositions and verbs, but less good at predicting named entities and nouns. Among these language models, and in keeping with established results, RNNs with LSTMs demonstrate a small gain on n-gram models across the board, except for named entities where the cache is beneficial. In fact, LSTM models are better than humans at predicting prepositions, which suggests that there are cases in which several of the candidate prepositions are ‘correct’, but annotators prefer the less frequent one. Even more surprisingly, when only local context (the query) is available, both LSTMs and n-gram models predict verbs more accurately than humans. This may be because the models are better attuned to the distribution of verbs in children’s books, whereas humans are unhelpfully influenced by their wider knowledge of all language styles.We did not require the human annotators warm up by reading the 98 novels in the training data, but this might have led to a fairer comparison. When access to the full context is available, humans do predict verbs with slightly greater accuracy than RNNs.
The best performing Memory Networks predict common nouns and named entities more accurately than conventional language models. Clearly, in doing so, these models rely on access to the wider context (the supervised embedding model (query), which is equivalent to the memory network but with no contextual memory, performs poorly in this regard). The fact that LSTMs without attention perform similarly on nouns and named entities whether or not the context is available confirms that they do not effectively exploit this context. This may be a symptom of the difficulty of storing and retaining information across large numbers of time steps that has been previously observed in recurrent networks (See e.g. Bengio et al. (1994)).
Not all memory networks that we trained exploited the context to achieve decent prediction of nouns and named entities. For instance, when each sentence in the context is stored as an ordered sequence of word embeddings (sentence mem + PE), performance is quite poor in general. Encoding the context as an unbroken sequence of individual words (lexical memory) works well for capturing prepositions and verbs, but is less effective with nouns and entities. In contrast, window memories centred around the candidate words are more useful than either word-level or sentence-level memories when predicting named entities and nouns.
The window-based Memory Network with self-supervision (in which a hard attention selection is made among window memories during training) outperforms all others at predicting named entities and common nouns. Examples of predictions made by this model for two CBT questions are shown in Figure 2. It is notable that this model is able to achieve the strongest performance with only a simple window-based strategy for representing questions.
1 News Article Question Answering
To examine how well our conclusions generalise to different machine reading tasks and language styles, we also tested the best-performing Memory Networks on the CNN QA task (Hermann et al., 2015).The CNN QA dataset was released after our primary experiments were completed, hence we experiment only with one of the two large datasets released with that paper. This dataset consists of 93k news articles from the CNN website, each coupled with a question derived from a bullet point summary accompanying the article, and a single-word answer. The answer is always a named entity, and all named entities in the article function as possible candidate answers.
As shown in Table 3, our window model without self-supervision achieves similar performance to the best approach proposed for the task by Hermann et al. (2015) when using an ensemble of MemNN models. Our use of an ensemble is an alternative way of replicating the application of dropout (Hinton et al., 2012) in the previous best approaches (Hermann et al., 2015) as ensemble averaging has similar effects to dropout (Wan et al., 2013). When self-supervision is added, the Memory Network greatly surpasses the state-of-the-art on this task. Finally, the last line of Table 3 (excluding co-occurrences) shows how an additional heuristic, removing from the candidate list all named entities already appearing in the bullet point summary, boosts performance even further.
Some common principles may explain the strong performance of the best performing models on this task. The attentive/impatient reading models encode the articles using bidirectional RNNs (Graves et al., 2008). For each word in the article, the combined hidden state of such an RNN naturally focuses on a window-like chunk of surrounding text, much like the window-based memory network or the CLSTM. Together, these results therefore support the principle that the most informative representations of text correspond to sub-sentential chunks. Indeed, the observation that the most informative representations for neural language models correspond to small chunks of text is also consistent with recent work on neural machine translation, in which Luong et al. (2015) demonstrated improved performance by restricting their attention mechanism to small windows of the source sentence.
Given these commonalities in how the reading models and Memory Networks represent context, the advantage of the best-performing Memory Network instead seems to stem from how it accesses or retrieves this information; in particular, the hard attention and self-supervision. Jointly learning to access and use information is a difficult optimization. Self-supervision in particular makes effective Memory Network learning more tractable.See the appendix for an ablation study in which optional features of the memory network are removed.
Conclusion
We have presented the Children’s Book Test, a new semantic language modelling benchmark. The CBT measures how well models can use both local and wider contextual information to make predictions about different types of words in children’s stories. By separating the prediction of syntactic function words from more semantically informative terms, the CBT provides a robust proxy for how much language models can impact applications requiring a focus on semantic coherence.
We tested a wide range of models on the CBT, each with different ways of representing and retaining previously seen content. This enabled us to draw novel insights into the optimal strategies for representing and accessing semantic information in memory. One consistent finding was that memories that encode sub-sentential chunks (windows) of informative text seem to be most useful to neural nets when interpreting and modelling language. However, our results indicate that the most useful text chunk size depends on the modeling task (e.g. semantic content vs. syntactic function words). We showed that Memory Networks that adhere to this principle can be efficiently trained using a simple self-supervision to surpass all other methods for predicting named entities on both the CBT and the CNN QA benchmark, an independent test of machine reading.
The authors would like to thank Harsha Pentapelli and Manohar Paluri for helping to collect the human annotations and Gabriel Synnaeve for processing the QA CNN data.
References
Appendix A Experimental Details
The text of questions is lowercased for all Memory Networks as well as for all non-learning baselines. LSTMs models use the raw text (although we also tried lowercasing, which made little difference). Hyperparameters of all learning models have been set using grid search on the validation set. The main hyperparameters are embedding dimension , learning rate , window size , number of hops , maximum memory size ( means using all potential memories). All models were implemented using the Torch library (see torch.ch). For CBT, all models have been trained on all question types altogether. We did not try to experiment with word embeddings pre-trained on a bigger corpus.
Embedding model (context+query): , .
Embedding model (query): , .
Embedding model (window): , , .
Embedding model (window+position): , , .
LSTMs (query & context+query): , , layer, gradient clipping factor: , learning rate shrinking factor: .
Contextual LSTMs: , , layer, gradient clipping factor: , learning rate shrinking factor: .
MemNNs (lexical memory): , , , .
MemNNs (window memory): , , , , .
MemNNs (sentential memory + PE): , , , .
MemNNs (window memory + self-sup.): , , , .
MemNNs (window memory): , , , , .
MemNNs (window memory + self-sup.): , , , , .
MemNNs (window memory + ensemble): models with .
MemNNs (window memory + self-sup. + ensemble): models with .
Appendix B Results on CBT Validation Set
Appendix C Ablation Study on CNN QA
(Soft memory weighting: the softmax to select the best candidate in test as defined in Section 3.3)
Appendix D Effects of Anonymising Entities in CBT
To see the impact of the anonymisation of entities and words as done in CNN QA on the self-supervised Memory Networks on the CBT, we conducted an experiment where we replaced the mentions of the ten candidates in each question by anonymised placeholders in train, validation and test. The table above shows results on CBT test set in an anonymised setting (last row) compared to MemNNs in a non-anonymised setting (rows 2-5). Results indicate that this has a relatively low impact on named entities but a larger one on more syntactic tasks like prepositions or verbs.
Appendix E Candidates and Window Memories in CBT
In our main results in Table 2 the window memory is constructed as the set of windows over the candidates being considered for a given question. Training of MemNNs (window memory) is performed by making gradient steps for questions, with the true answer word as the target compared against all words in the dictionary as described in Sec. 3.2. Training of MemNNs (window memory + self-sup.) is performed by making gradient steps for questions, with the true answer word as the target compared against all other candidates as described in Sec. 3.3. As MemNNs (window memory + self-sup.) is the best performing method for named entities and common nouns, to see the impact of these choices we conducted some further experiments with variants of it.
Firstly, window memories do not have to be restricted to candidates, we could consider all possible windows. Note that this does not make any difference at evaluation time on CBT as one would still evaluate by multiple choice using the candidates, and those extra windows would not contribute to the scores of the candidates. However, this may make a difference to the weights if used at training time. We call this “all windows” in the experiments to follow.
Secondly, the self-supervision process does not have to rely on there being known candidates: all that is required is a positive label, in that case we can perform gradient steps with the true answer word as the target compared against all words in the dictionary (as opposed to only candidates) as described in Sec. 3.2, while still using hard attention supervision as described in 3.3. We call this “all targets” in the experiments to follow.
Thirdly, one does not have to try to train on only the questions in CBT, but can treat the children’s books as a standard language modeling task. In that case, all targets and all windows must be used, as multiple choice questions have not been constructed for every single word (although indeed many of them are covered by the four word classes). We call this “LM” (for language modeling) in the experiments to follow.
Results with these alternatives are presented in Table 4, the new variants are the last three rows. Overall, the differing approaches have relatively little impact on the results, as all of them provide superior results on named entities and common nouns than without self-supervision. However, we note that the use of all windows or LM rather than candidate windows does impact training and testing speed.