WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia

Daniel Hewlett, Alexandre Lacoste, Llion Jones, Illia Polosukhin, Andrew Fandrianto, Jay Han, Matthew Kelcey, David Berthelot

Introduction

A growing amount of research in natural language understanding (NLU) explores end-to-end deep neural network (DNN) architectures for tasks such as text classification [Zhang et al. (2015], relation extraction [Nguyen and Grishman (2015], and question answering [Weston et al. (2015]. These models offer the potential to remove the intermediate steps traditionally involved in processing natural language data by operating on increasingly raw forms of text input, even unprocessed character or byte sequences. Furthermore, while these tasks are often studied in isolation, DNNs have the potential to combine multiple forms of reasoning within a single model.

Supervised training of DNNs often requires a large amount of high-quality training data. To this end, we introduce a novel prediction task and accompanying large-scale dataset with a range of sub-tasks combining text classification and information extraction. The dataset is made publicly-available at http://goo.gl/wikireading. The task, which we call WikiReading, is to predict textual values from the open knowledge base Wikidata [Vrandečić and Krötzsch (2014] given text from the corresponding articles on Wikipedia [Ayers et al. (2008]. Example instances are shown in Table 1, illustrating the variety of subject matter and sub-tasks. The dataset contains 18.87M instances across 867 sub-tasks, split roughly evenly between classification and extraction (see Section 2 for more details).

In addition to its diversity, the WikiReading dataset is also at least an order of magnitude larger than related NLU datasets. Many natural language datasets for question answering (QA), such as WikiQA [Yang et al. (2015], have only thousands of examples and are thus too small for training end-to-end models. ?) proposed a task similar to QA, predicting entities in news summaries from the text of the original news articles, and generated a News dataset with 1M instances. The bAbI dataset [Weston et al. (2015] requires multiple forms of reasoning, but is composed of synthetically generated documents. WikiQA and News only involve pointing to locations within the document, and text classification datasets often have small numbers of output classes. In contrast, WikiReading has a rich output space of millions of answers, making it a challenging benchmark for state-of-the-art DNN architectures for QA or text classification.

We implemented a large suite of recent models, and for the first time evaluate them on common grounds, placing the complexity of the task in context and illustrating the tradeoffs inherent in each approach. The highest score of 71.8% is achieved by a sequence to sequence model [Kalchbrenner and Blunsom (2013, Cho et al. (2014] operating on word-level input and output sequences, with special handing for out-of-vocabulary words.

WikiReading

We now provide background information relating to Wikidata, followed by a detailed description of the WikiReading prediction task and dataset.

Wikidata is a free collaborative knowledge base containing information about approximately 16M items [Vrandečić and Krötzsch (2014]. Knowledge related to each item is expressed in a set of statements, each consisting of a (property, value) tuple. For example, the item Paris might have associated statements asserting (instance of, city) or (country, France). Wikidata contains over 80M such statements across over 800 properties. Items may be linked to articles on Wikipedia.

2 Dataset

We constructed the WikiReading dataset from Wikidata and Wikipedia as follows: We consolidated all Wikidata statements with the same item and property into a single (item, property, answer) triple, where answer is a set of values. Replacing each item with the text of the linked Wikipedia article (discarding unlinked items) yields a dataset of 18.58M (document, property, answer) instances. Importantly, all elements in each instance are human-readable strings, making the task entirely textual. The only modification we made to these strings was to convert timestamps into a human-readable format (e.g., “4 July 1776”).

The WikiReading task, then, is to predict the answer string for each tuple given the document and property strings. This setup can be seen as similar to information extraction, or question answering where the property acts as a “question”. We assigned each instance randomly to either training (around 16.03M instances), validation (1.89M), and test (0.95M) sets following a 85/10/5 distribution.

3 Documents

The dataset contains 4.7M unique Wikipedia articles, meaning that roughly 80% of the English-language Wikipedia is represented. Multiple instances can share the same document, with a mean of 5.31 instances per article (median: 4, max: 879). The most common categories of documents are human, taxon, film, album, and human settlement, making up 48.8% of the documents and 9.1% of the instances. The mean and median document lengths are 489.2 and 203 words.

4 Properties

The dataset contains 867 unique properties, though the distribution of properties across instances is highly skewed: The top 20 properties cover 75% of the dataset, with 99% coverage achieved after 180 properties. We divide the properties broadly into two groups: Categorical properties, such as instance of, gender and country, require selecting between a relatively small number of possible answers, while relational properties, such as date of birth, parent, and capital, typically require extracting rare or totally unique answers from the document.

To quantify this difference, we compute the entropy of the answer distribution AA for each property pp, scaled to the $rangebydividingbytheentropyofauniformdistributionwiththesamenumberofvalues,i.e.,range by dividing by the entropy of a uniform distribution with the same number of values, i.e.,\hat{H}(p)=H(A_{p})/\log{|A_{p}|}.Propertiesthatrepresentessentiallyonetoonemappingsscorenear. Properties that represent essentially one-to-one mappings score near1.0,whileapropertywithjustasingleanswerwouldscore, while a property with just a single answer would score0.0$. Table 2 lists entropy values for a subset of properties, showing that the dataset contains a spectrum of sub-tasks. We label properties with an entropy less than 0.7 as categorical, and those with a higher entropy as relational. Categorical properties cover 56.7% of the instances in the dataset, with the remaining 43.3% being relational.

5 Answers

The distribution of properties described above has implications for the answer distribution. There are a relatively small number of very high frequency “head” answers, mostly for categorical properties, and a vast number of very low frequency “tail” answers, such as names and dates. At the extremes, the most frequent answer human accounts for almost 7% of the dataset, while 54.7% of the answers in the dataset are unique. There are some special categories of answers which are systematically related, in particular dates, which comprise 8.9% of the dataset (with 7.2% being unique). This distribution means that methods focused on either head or tail answers can each perform moderately well, but only a method that handles both types of answers can achieve maximum performance. Another consequence of the long tail of answers is that many (30.0%) of the answers in the test set never appear in the training set, meaning they must be read out of the document. An answer is present verbatim in the document for 45.6% of the instances.

Methods

Recently, neural network architectures for NLU have been shown to meet or exceed the performance of traditional methods [Zhang et al. (2015, Dai and Le (2015]. The move to deep neural networks also allows for new ways of combining the property and document, inspired by recent research in the field of question answering (with the property serving as a question). In sequential models such as Recurrent Neural Networks (RNNs), the question could be prepended to the document, allowing the model to “read” the document differently for each question [Hermann et al. (2015]. Alternatively, the question could be used to compute a form of attention [Bahdanau et al. (2014] over the document, to effectively focus the model on the most predictive words or phrases [Sukhbaatar et al. (2015, Hermann et al. (2015]. As this is currently an ongoing field of research, we implemented a range of recent models and for the first time compare them on common grounds. We now describe these methods, grouping them into broad categories by general approach and noting necessary modifications. Later, we introduce some novel variations of these models.

Perhaps the most straightforward approach to WikiReading is to consider it as a special case of document classification. To fit WikiReading into this framework, we consider each possible answer as a class label, and incorporate features based on the property so that the model can make different predictions for the same document. While the number of potential answers is too large to be practical (and unbounded in principle), a substantial portion of the dataset can be covered by a model with a tractable number of answers.

The most common approach to document classification is to fit a linear model (e.g., Logistic Regression) over bag of words (BoW) features. To serve as a baseline for our task, the linear model needs to make different predictions for the same Wikipedia article depending on the property. We enable this behavior by computing two NwN_{\operatorname{w}} element BoW vectors, one each for the document and property, and concatenating them into a single 2Nw2N_{\operatorname{w}} feature vector.

1.2 Neural Network Methods

This is the neural network version of the baseline method described in Section 3.1.1. Embeddings for words in the document and property are separately averaged. The concatenation of the resulting vectors forms the joint representation of size 2din2d_{\operatorname{in}}.

We explore a variant of the previous model where the document is encoded as a paragraph vector [Le and Mikolov (2014]. We apply the PV-DBOW variant that learns an embedding for a document by optimizing the prediction of its constituent words. These unsupervised document embeddings are treated as a fixed input to the supervised classifier, with no fine-tuning.

This model is a simplified version of the Deep LSTM Reader proposed by ?). In this model, an LSTM [Hochreiter and Schmidhuber (1997] reads the property and document sequences word-by-word and the final state is used as the joint representation. This is the simplest model that respects the order of the words in the document. In our implementation we use a single layer instead of two and a larger hidden size. More details on the architecture can be found in Section 4.1 and in Table 4.

This model, also presented in ?), uses an attention mechanism to better focus on the relevant part of the document for a given property. Specifically, Attentive Reader first generates a representation u\mathbf{u} of the property using the final state of an LSTM while a second LSTM is used to read the document and generate a representation zt\mathbf{z}_{t} for each word. Then, conditioned on the property encoding u\mathbf{u}, a normalized attention is computed over the document to produce a weighted average of the word representations zt\mathbf{z}_{t}, which is then used to generate the joint representation y\mathbf{y}. More precisely:

where W1W_{1}, W2W_{2}, and v\mathbf{v} are learned parameters.

Our implementation closely follows the End-to-End Memory Network proposed in ?). This model maps a property pp and a list of sentences x1,,xn\mathbf{x}_{1},\dots,\mathbf{x}_{n} to a joint representation y\mathbf{y} by attending over sentences in the document as follows: The input encoder II converts a sequence of words xi=(xi1,,xiLi)\mathbf{x}_{i}=(x_{i1},\dots,x_{iL_{i}}) into a vector using an embedding matrix (equation 2), where LiL_{i} is the length of sentence ii.Our final results use the position encoding method proposed by ?), which incorporates positional information in addition to word embeddings. The property is encoded with the embedding matrix UU (eqn. 3). Each sentence is encoded into two vectors, a memory vector (eqn. 4) and an output vector (eqn. 5), with embedding matrices MM and CC, respectively. The property encoding is used to compute a normalized attention vector over the memories (eqn. 6).Instead of the linearization method of ?), we applied an entropy regularizer for the softmax attention as described in ?). The joint representation is the sum of the output vectors weighted by this attention (eqn. 7).

2 Answer Extraction

Relational properties involve mappings between arbitrary entities (e.g., date of birth, mother, and author) and thus are less amenable to document classification. For these, approaches from information extraction (especially relation extraction) are much more appropriate. In general, these methods seek to identify a word or phrase in the text that stands in a particular relation to a (possibly implicit) subject. Section 5 contains a discussion of prior work applying NLP techniques involving entity recognition and syntactic parsing to this problem.

RNNs provide a natural fit for extraction, as they can predict a value at every position in a sequence, conditioned on the entire previous sequence. The most straightforward application to WikiReading is to predict the probability that a word at a given location is part of an answer. We test this approach using an RNN that operates on the sequence of words. At each time step, we use a sigmoid activation for estimating whether the current word is part of the answer or not. We refer to this model as the RNN Labeler and present it graphically in Figure 1a.

For training, we label all locations where any answer appears in the document with a 11, and other positions with a (similar to distant supervision [Mintz et al. (2009]). For multi-word answers, the word sequences in the document and answer must fully matchDates were matched semantically to increase recall.. Instances where no answer appears in the document are discarded for training. The cost function is the average cross-entropy for the outputs across the sequence. When performing inference on the test set, sequences of consecutive locations scoring above a threshold are chunked together as a single answer, and the top-scoring answer is recorded for submission.We chose an arbitrary threshold of 0.5 for chunking. The score of each chunk is obtained from the harmonic mean of the predicted probabilities of its elements.

3 Sequence to Sequence

Recently, sequence to sequence learning (or seq2seq) has shown promise for natural language tasks, especially machine translation [Cho et al. (2014]. These models combine two RNNs: an encoder, which transforms the input sequence into a vector representation, and a decoder, which converts the encoder vector into a sequence of output tokens, one token at a time. This makes them capable, in principle, of approximating any function mapping sequential inputs to sequential outputs. Importantly, they are the first model we consider that can perform any combination of answer classification and extraction.

This model resembles LSTM Reader augmented with a second RNN to decode the answer as a sequence of words. The embedding matrix is shared across the two RNNs but their state to state transition matrices are different (Figure 1b). This method extends the set of possible answers to any sequence of words from the document vocabulary.

3.2 Placeholder seq2seq

While Basic seq2seq already expands the expressiveness of LSTM Reader, it still has a limited vocabulary and thus is unable to generate some answers. As mentioned in Section 3.2, RNN Labeler can extract any sequence of words present in the document, even if some are OOV. We extend the basic seq2seq model to handle OOV words by adding placeholders to our vocabulary, increasing the vocabulary size from NwN_{\operatorname{w}} to Nw+NdocN_{\operatorname{w}}+N_{\operatorname{doc}}. Then, when an OOV word occurs in the document, it is replaced at random (without replacement). by one of these placeholders. We also replace the corresponding OOV words in the target output sequence by the same placeholder,The same OOV word may occur several times in the document. Our simplified approach will attribute a different placeholder for each of these and will use the first occurrence for the target answer. as shown in Figure 1c. ?) developed a similar procedure for dealing with rare words in machine translation, copying their locations into the output sequence for further processing.

This makes the input and output sequences a mixture of known words and placeholders, and allows the model to produce any answer the RNN Labeler can produce, in addition to the ones that the basic seq2seq model could already produce. This approach is comparable to entity anonymization used in ?), which replaces named entities with random ids, but simpler because we use word-level placeholders without entity recognition.

3.3 Basic Character seq2seq

Another way of handling rare words is to process the input and output text as sequences of characters or bytes. RNNs have shown some promise working with character-level input, including state-of-the-art performance on a Wikipedia text classification benchmark [Dai and Le (2015]. A model that outputs answers character by character can in principle generate any of the answers in the test set, a major advantage for WikiReading.

This model, shown in Figure 2, operates only on sequences of mixed-case characters. The property encoder RNN transforms the property, as a character sequence, into a fixed-length vector. This property encoding becomes the initial hidden state for the second layer of a two-layer document encoder RNN, which reads the document, again, character by character. Finally, the answer decoder RNN uses the final state of the previous RNN to decode the character sequence for the answer.

3.4 Character seq2seq with Pretraining

Unfortunately, at the character level the length of all sequences (documents, properties, and answers) is greatly increased. This adds more sequential steps to the RNN, requiring gradients to propagate further, and increasing the chance of an error during decoding. To address this issue in a classification context, ?) showed that initializing an LSTM classifier with weights from a language model (LM) improved its accuracy. Inspired by this result, we apply this principle to the character seq2seq model with a two-phase training process: In the first phase, we train a character-level LM on the input character sequences from the WikiReading training set (no new data is introduced). In the second phase, the weights from this LM are used to initialize the first layer of the encoder and the decoder (purple and green blocks in Figure 2). After initialization, training proceeds as in the basic character seq2seq model.

Experiments

We evaluated all methods from Section 3 on the full test set with a single scoring framework. An answer is correct when there is an exact string match between the predicted answer and the gold answer. However, as describe in Section 2.2, some answers are composed from a set of values (e.g. third example in Table 1). To handle this, we define the Mean F1 score as follows: For each instance, we compute the F1-score (harmonic mean of precision and recall) as a measure of the degree of overlap between the predicted answer set and the gold set for a given instance. The resulting per-instance F1 scores are then averaged to produce a single dataset-level score. This allows a method to obtain partial credit for an instance when it answers with at least one value from the golden set. In this paper, we only consider methods for answering with a single value, and most answers in the dataset are also composed of a single value, so this Mean F1 metric is closely related to accuracy. More precisely, a method using a single value as answer is bounded by a Mean F1 of 0.963.

We implemented all models in a single framework based on TensorFlow [Abadi et al. (2015] with shared pre-processing and comparable hyperparameters whenever possible. All documents are truncated to the first 300300 words except for Character seq2seq, which uses 400400 characters. The embedding matrix used to encode words in the document uses din=300d_{\operatorname{in}}=300 dimensions for the Nw=100,000N_{\operatorname{w}}=100,000 most frequent words. Similarly, answer classification over the Nans=50,000N_{\operatorname{ans}}=50,000 most frequent answers is performed using an answer representation of size dout=300d_{\operatorname{out}}=300.For models like Averaged Embedding and Paragraph Vector, the concatenation imposes a greater doutd_{\operatorname{out}}. The first 1010 words of the properties are embedded using the document embedding matrix. Following ?), RNNs in seq2seq models use a GRU cell with a hidden state size of 10241024. More details on parameters are reported in Table 4.

Optimization was performed with the Adam stochastic optimizerUsing β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999 and ϵ=108\epsilon=10^{-8}. [Kingma and Adam (2015] over mini-batches of 128128 samples. Gradient clipping When the norm of gradient g\mathbf{g} exceeds a threshold CC, it is scaled down i.e. ggmin(1,Cg)\mathbf{g}\leftarrow\mathbf{g}\cdot\min\left(1,\frac{C}{||\mathbf{g}||}\right). [Graves (2013] is used to prevent instability in training RNNs. We performed a search over 50 randomly-sampled hyperparameter configurations for the learning rate and gradient clip threshold, selecting the one with the highest Mean F1 on the validation set. Learning rate and clipping threshold are sampled uniformly, on a logarithmic scale, over the range [105,102][10^{-5},10^{-2}] and [103,101][10^{-3},10^{1}] respectively.

2 Results and Discussion

Results for all models on the held-out set of test instances are presented in Table 3. In addition to the overall Mean F1 scores, the model families differ significantly in Mean F1 upper bound, and their relative performance on the relational and categorical properties defined in Section 2.4. We also report scores for properties containing dates, a subset of relational properties, as a separate column since they have a distinct format and organization. For examples of model performance on individual properties, see Table 5.

As expected, all classifier models perform well for categorical properties, with more sophisticated classifiers generally outperforming simpler ones. The difference in precision reading ability between models that use broad document statistics, like Averaged Embeddings and Paragraph Vectors, and the RNN-based classifiers is revealed in the scores for relational and especially date properties. As shown in Table 5, this difference is magnified in situations that are more difficult for a classifier, such as relational properties or properties with fewer training examples, where Attentive Reader outperforms Averaged Embeddings by a wide margin. This model family also has a high upper bound, as perfect classification across the 50,00050,000 most frequent answers would yield a Mean F1 of 0.831. However, none of them approaches this limit. Part of the reason is that their accuracy for a given answer decreases quickly as the frequency of the answer in the training set decreases, as illustrated in Figure 3. As these models have to learn a separate weight vector for each answer as part of the softmax layer (see Section 3.1), this may suggest that they fail to generalize across answers effectively and thus require significant number of training examples per answer.

The only answer extraction model evaluated, RNN Labeler, shows a complementary set of strengths, performing better on relational properties than categorical ones. While the Mean F1 upper bound for this model is just 0.434 because it can only produce answers that are present verbatim in the document text, it manages to achieve most of this potential. The improvement on date properties over the classifier models demonstrates its ability to identify answers that are typically present in the document. We suspect that answer extraction may be simpler than answer classification because the model can learn robust patterns that indicate a location without needing to learn about each answer, as the classifier models must.

The sequence to sequence models show a greater degree of balance between relational and categorical properties, reaching performance consistent with classifiers on the categorical questions and with RNN Labeler on relational questions. Placeholder seq2seq can in principle produce any answer that RNN Labeler can, and the performance on relational properties is indeed similar. As shown in Table 5, Placeholder seq2seq performs especially well for properties where the answer typically contains rare words such as the name of a place or person. When the set of possible answer tokens is more constrained, such as in categorical or date properties, the Basic seq2seq often performs slightly better. Character seq2seq has the highest upper bound, limited to 0.963 only because it cannot produce an answer set with multiple elements. LM pretraining consistently improves the performance of the Character seq2seq model, especially for relational properties as shown in Table 5. The performance of the Character seq2seq, especially with LM pretraining, is a surprising result: It performs comparably to the word-level seq2seq models even though it must copy long character strings when doing extraction and has access to a smaller portion of the document. We found the character based models to be particularly sensitive to hyperparameters. However, using a pretrained language model reduced this issue and significantly accelerated training while improving the final score. We believe that further research on pretraining for character based models could improve this result.

Related Work

The goal of automatically extracting structured information from unstructured Wikipedia text was first advanced by ?). As Wikidata did not exist at that time, the authors relied on the structured infoboxes included in some Wikipedia articles for a relational representation of Wikipedia content. Wikidata is a cleaner data source, as the infobox data contains many slight variations in schema related to page formatting. Partially to get around this issue, the authors restrict their prediction model Kylin to 4 specific infobox classes, and only common attributes within each class.

A substantial body of work in relation extraction (RE) follows the distant supervision paradigm [Craven and Kumlien (1999], where sentences containing both arguments of a knowledge base (KB) triple are assumed to express the triple’s relation. Broadly, these models use these distant labels to identify syntactic features relating the subject and object entities in text that are indicative of the relation. ?) apply distant supervision to extracting Freebase triples [Bollacker et al. (2008] from Wikipedia text, analogous to the relational part of WikiReading. Extensions to distant supervision include explicitly modelling whether the relation is actually expressed in the sentence [Riedel et al. (2010], and jointly reasoning over larger sets of sentences and relations [Surdeanu et al. (2012]. Recently, ?) developed methods for reducing the number of distant supervision examples required by sharing information between relations.

Conclusion

We have demonstrated the complexity of the WikiReading task and its suitability as a benchmark to guide future development of DNN models for natural language understanding. After comparing a diverse array of models spanning classification and extraction, we conclude that end-to-end sequence to sequence models are the most promising. These models simultaneously learned to classify documents and copy arbitrary strings from them. In light of this finding, we suggest some focus areas for future research.

Our character-level model improved substantially after language model pretraining, suggesting that further training optimizations may yield continued gains. Document length poses a problem for RNN-based models, which might be addressed with convolutional neural networks that are easier to parallelize. Finally, we note that these models are not intrinsically limited to English, as they rely on little or no pre-processing with traditional NLP systems. This means that they should generalize effectively to other languages, which could be demonstrated by a multilingual version of WikiReading.

Acknowledgments

We thank Jonathan Berant for many helpful comments on early drafts of the paper, and Catherine Finegan-Dollak for an early implementation of RNN Labeler.

References