Language Models as Knowledge Bases?

Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel

Introduction

Recently, pretrained high-capacity language models such as ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2018a) have become increasingly important in NLP. They are optimised to either predict the next word in a sequence or some masked word anywhere in a given sequence (e.g. “Dante was born in [Mask] in the year 1265.”). The parameters of these models appear to store vast amounts of linguistic knowledge (Peters et al., 2018b; Goldberg, 2019; Tenney et al., 2019) useful for downstream tasks. This knowledge is usually accessed either by conditioning on latent context representations produced by the original model or by using the original model weights to initialize a task-specific model which is then further fine-tuned. This type of knowledge transfer is crucial for current state-of-the-art results on a wide range of tasks.

In contrast, knowledge bases are effective solutions for accessing annotated gold-standard relational data by enabling queries such as (Dante, born-in, $\mathbf{X}$ ). However, in practice we often need to extract relational data from text or other modalities to populate these knowledge bases. This requires complex NLP pipelines involving entity extraction, coreference resolution, entity linking and relation extraction Surdeanu and Ji (2014)—components that often need supervised data and fixed schemas. Moreover, errors can easily propagate and accumulate throughout the pipeline. Instead, we could attempt to query neural language models for relational data by asking them to fill in masked tokens in sequences like “Dante was born in [Mask]”, as illustrated in Figure 1. In this setting, language models come with various attractive properties: they require no schema engineering, do not need human annotations, and they support an open set of queries.

Given the above qualities of language models as potential representations of relational knowledge, we are interested in the relational knowledge already present in pretrained off-the-shelf language models such as ELMo and BERT. How much relational knowledge do they store? How does this differ for different types of knowledge such as facts about entities, common sense, and general question answering? How does their performance without fine-tuning compare to symbolic knowledge bases automatically extracted from text? Beyond gathering a better general understanding of these models, we believe that answers to these questions can help us design better unsupervised knowledge representations that could transfer factual and commonsense knowledge reliably to downstream tasks such as commonsense (visual) question answering (Zellers et al., 2018; Talmor et al., 2019) or reinforcement learning (Branavan et al., 2011; Chevalier-Boisvert et al., 2018; Bahdanau et al., 2019; Luketina et al., 2019).

For the purpose of answering the above questions we introduce the LAMA (LAnguage Model Analysis) probe, consisting of a set of knowledge sources, each comprised of a set of facts. We define that a pretrained language model knows a fact (subject, relation, object) such as (Dante, born-in, Florence) if it can successfully predict masked objects in cloze sentences such as “Dante was born in ” expressing that fact. We test for a variety of types of knowledge: relations between entities stored in Wikidata, common sense relations between concepts from ConceptNet, and knowledge necessary to answer natural language questions in SQuAD. In the latter case we manually map a subset of SQuAD questions to cloze sentences.

Our investigation reveals that (i) the largest BERT model from Devlin et al. (2018b) (BERT-large) captures (accurate) relational knowledge comparable to that of a knowledge base extracted with an off-the-shelf relation extractor and an oracle-based entity linker from a corpus known to express the relevant knowledge, (ii) factual knowledge can be recovered surprisingly well from pretrained language models, however, for some relations (particularly $N$ -to- $M$ relations) performance is very poor, (iii) BERT-large consistently outperforms other language models in recovering factual and commonsense knowledge while at the same time being more robust to the phrasing of a query, and (iv) BERT-large achieves remarkable results for open-domain QA, reaching 57.1% precision@10 compared to 63.5% of a knowledge base constructed using a task-specific supervised relation extraction system .

Background

In this section we provide background on language models. Statistics for the models that we include in our investigation are summarized in Table 1.

Given an input sequence of tokens $\mathbf{w}=[w_{1},w_{2},\ldots,w_{N}]$ , unidirectional language models commonly assign a probability $p(\mathbf{w})$ to the sequence by factorizing it as follows

A common way to estimate this probability is using neural language models (Mikolov and Zweig, 2012; Melis et al., 2017; Bengio et al., 2003) with

fairseq-fconv: Instead of commonly used recurrent neural networks, Dauphin et al. (2017) use multiple layers of gated convolutions. We use the pretrained model in the fairseqhttps://github.com/pytorch/fairseq library in our study. It has been trained on the WikiText-103 corpus introduced by Merity et al. (2016).

Transformer-XL: Dai et al. (2019) introduce a large-scale language model based on the Transformer (Vaswani et al., 2017). Transformer-XL can take into account a longer history by caching previous outputs and by using relative instead of absolute positional encoding. It achieves a test perplexity of $18.3$ on the WikiText-103 corpus.

So far, we have looked at language models that predict the next word given a history of words. However, in many downstream applications we mostly care about having access to contextual representations of words, i.e., word representations that are a function of the entire context of a unit of text such as a sentence or paragraph, and not only conditioned on previous words. Formally, given an input sequence $\mathbf{w}=[w_{1},w_{2},\ldots,w_{N}]$ and a position $1\leq i\leq N$ , we want to estimate $p(w_{i})=p(w_{i}\,|\,w_{1},\ldots,w_{i-1},w_{i+1},\ldots,w_{N})$ using the left and right context of that word.

ELMo: To estimate this probability, Peters et al. (2018a) propose running a forward and backward LSTM (Hochreiter and Schmidhuber, 1997), resulting in $\overrightarrow{\mathbf{h}}_{i}$ and $\overleftarrow{\mathbf{h}}_{i}$ which consequently are used to calculate a forward and backward language model log-likelihood. Their model, ELMo, uses multiple layers of LSTMs and it has been pretrained on the Google Billion Word dataset. Another version of the model, ELMo 5.5B, has been trained on the English Wikipedia and monolingual news crawl data from WMT 2008-2012.

BERT: Instead of a standard language model objective, Devlin et al. (2018a) propose to sample positions in the input sequence randomly and to learn to fill the word at the masked position. To this end, they employ a Transformer architecture and train it on the BookCorpus (Zhu et al., 2015) as well as a crawl of English Wikipedia. In addition to this pseudo language model objective, they use an auxiliary binary classification objective to predict whether a particular sentence follows the given sequence of words.

Related Work

Many studies have investigated pretrained word representations, sentence representations, and language models. Existing work focuses on understanding linguistic and semantic properties of word representations or how well pretrained sentence representations and language models transfer linguistic knowledge to downstream tasks. In contrast, our investigation seeks to answer to what extent pretrained language models store factual and commonsense knowledge by comparing them with symbolic knowledge bases populated by traditional relation extraction approaches.

Baroni et al. (2014) present a systematic comparative analysis between neural word representation methods and more traditional count-based distributional semantic methods on lexical semantics tasks like semantic relatedness and concept categorization. They find that neural word representations outperform count-based distributional methods on the majority of the considered tasks. Hill et al. (2015) investigate to what degree word representations capture semantic meaning as measured by similarity between word pairs.

Marvin and Linzen (2018) assess the grammaticality of pretrained language models. Their dataset consists of sentence pairs with a grammatical and an ungrammatical sentence. While a good language model should assign higher probability to the grammatical sentence, they find that LSTMs do not learn syntax well.

Another line of work investigates the ability of pretrained sentence and language models to transfer knowledge to downstream natural language understanding tasks (Wang et al., 2018). While such an analysis sheds light on the transfer-learning abilities of pretrained models for understanding short pieces of text, it provides little insight into whether these models can compete with traditional approaches to representing knowledge like symbolic knowledge bases.

More recently, McCoy et al. (2019) found that for natural language inference, a model based on BERT learns to rely heavily on fallible syntactic heuristics instead of a deeper understanding of the natural language input. Peters et al. (2018b) found that lower layers in ELMo specialize on local syntactic relationships, while higher layers can learn to model long-range relationships. Similarly, Goldberg (2019) found that BERT captures English syntactic phenomena remarkably well. Tenney et al. (2019) investigate to what extent language models encode sentence structure for different syntactic and semantic phenomena and found that they excel for the former but only provide small improvements for tasks that fall into the latter category. While this provides insights into the linguistic knowledge of language models, it does not provide insights into their factual and commonsense knowledge.

Radford et al. (2018) introduce a pretrained language model based on the Transformer which they termed generative pretraining (GPTv1). The first version of GPT (Radford et al., 2018) has been trained on the Book Corpus (Zhu et al., 2015) containing 7000 books. The closest to our investigation is the work by Radford et al. (2019) which introduces GPTv2 and investigates how well their language model does zero-shot transfer to a range of downstream tasks. They find that GPTv2 achieves an $F_{1}$ of 55 for answering questions in CoQA (Reddy et al., 2018) and $4.1\%$ accuracy on the Natural Questions dataset (Kwiatkowski et al., 2019), in both cases without making use of annotated question-answer pairs or an information retrieval step. While these results are encouraging and hint at the ability of very large pretrained language models to memorize factual knowledge, the large GPTv2 model has not been made public and the publicly available small version achieves less than $1\%$ on Natural Questions (5.3 times worse than the large model). Thus, we decided to not include GPTv2 in our study. Similarly, we do not include GPTv1 in this study as it uses a limited lower-cased vocabulary, making it incompatible to the way we assess the other language models.

The LAMA Probe

We introduce the LAMA (LAnguage Model Analysis) probe to test the factual and commonsense knowledge in language models. It provides a set of knowledge sources which are composed of a corpus of facts. Facts are either subject-relation-object triples or question-answer pairs. Each fact is converted into a cloze statement which is used to query the language model for a missing token. We evaluate each model based on how highly it ranks the ground truth token against every other word in a fixed candidate vocabulary. This is similar to ranking-based metrics from the knowledge base completion literature (Bordes et al., 2013; Nickel et al., 2016). Our assumption is that models which rank ground truth tokens high for these cloze statements have more factual knowledge. We discuss each step in detail next and provide considerations on the probe below.

To assess the different language models in Section 2, we cover a variety of sources of factual and commonsense knowledge. For each source, we describe the origin of fact triples (or question-answer pairs), how we transform them into cloze templates, and to what extent aligned texts exist in Wikipedia that are known to express a particular fact. We use the latter information in supervised baselines that extract knowledge representations directly from the aligned text.

The Google-RE corpushttps://code.google.com/archive/p/relation-extraction-corpus/ contains ${\sim}60$ K facts manually extracted from Wikipedia. It covers five relations but we consider only three of them, namely “place of birth”, “date of birth” and “place of death”. We exclude the other two because they contain mainly multi-tokens objects that are not supported in our evaluation. We manually define a template for each considered relation, e.g., “[S] was born in [O]” for “place of birth”. Each fact in the Google-RE dataset is, by design, manually aligned to a short piece of Wikipedia text supporting it.

1.2 T-REx

The T-REx knowledge source is a subset of Wikidata triples. It is derived from the T-REx dataset Elsahar et al. (2018) and is much larger than Google-RE with a broader set of relations. We consider 41 Wikidata relations and subsample at most 1000 facts per relation. As with the Google-RE corpus, we manually define a template for each relation (see Table 3 for some examples). In contrast to the Google-RE knowledge source, T-REx facts were automatically aligned to Wikipedia and hence this alignment can be noisy. However, Elsahar et al. (2018) report an accuracy of 97.8% for the alignment technique over a test set.

1.3 ConceptNet

ConceptNet Speer and Havasi (2012) is a multi-lingual knowledge base, initially built on top of Open Mind Common Sense (OMCS) sentences. OMCS represents commonsense relationships between words and/or phrases. We consider facts from the English part of ConceptNet that have single-token objects covering 16 relations. For these ConceptNet triples, we find the OMCS sentence that contains both the subject and the object. We then mask the object within the sentence and use the sentence as template for querying language models. If there are several sentences for a triple, we pick one at random. Note that for this knowledge source there is no explicit alignment of facts to Wikipedia sentences.

1.4 SQuAD

SQuAD Rajpurkar et al. (2016) is a popular question answering dataset. We select a subset of 305 context-insensitive questions from the SQuAD development set with single token answers. We manually create cloze-style questions from these questions, e.g., rewriting “Who developed the theory of relativity?” as “The theory of relativity was developed by ”. For each question and answer pair, we know that the corresponding fact is expressed in Wikipedia since this is how SQuAD was created.

2 Models

We consider the following pretrained case-sensitive language models in our study (see Table 1): fairseq-fconv (Fs), Transformer-XL large (Txl), ELMo original (Eb), ELMo 5.5B (E5B), BERT-base (Bb) and BERT-large (Bl). We use the natural way of generating tokens for each model by following the definition of the training objective function.

Assume we want to compute the generation for the token at position $t$ . For unidirectional language models, we use the network output ( $\mathbf{h}_{t-1}$ ) just before the token to produce the output layer softmax. For ELMo we consider the output just before ( $\overrightarrow{\mathbf{h}}_{t-1}$ ) for the forward direction and just after ( $\overleftarrow{\mathbf{h}}_{t+1}$ ) for the backward direction. Following the loss definition in (Peters et al., 2018a), we average forward and backward probabilities from the corresponding softmax layers. For BERT, we mask the token at position $t$ , and we feed the output vector corresponding to the masked token ( $\mathbf{h}_{t}$ ) into the softmax layer. To allow a fair comparison, we let models generate over a unified vocabulary, which is the intersection of the vocabularies for all considered models ( ${\sim}21$ K case-sensitive tokens).

3 Baselines

To compare language models to canonical ways of using off-the-shelf systems for extracting symbolic knowledge and answering questions, we consider the following baselines.

Freq: For a subject and relation pair, this baseline ranks words based on how frequently they appear as objects for the given relation in the test data. It indicates the upper bound performance of a model that always predicts the same objects for a particular relation.

RE: For the relation-based knowledge sources, we consider the pretrained Relation Extraction (RE) model of Sorokin and Gurevych (2017). This model was trained on a subcorpus of Wikipedia annotated with Wikidata relations. It extracts relation triples from a given sentence using an LSTM-based encoder and an attention mechanism. Based on the alignment information from the knowledge sources, we provide the relation extractor with the sentences known to express the test facts. Using these datasets, RE constructs a knowledge graph of triples. At test time, we query this graph by finding the subject entity and then rank all objects in the correct relation based on the confidence scores returned by RE. We consider two versions of this procedure that differ in how the entity linking is implemented: REn makes use of a naïve entity linking solution based on exact string matching, while REo uses an oracle for entity linking in addition to string matching. In other words, assume we query for the object $o$ of a test subject-relation fact $(s,r,o)$ expressed in a sentence $x$ . If RE has extracted any triple $(s^{\prime},r,o^{\prime})$ from that sentence $x$ , $s^{\prime}$ will be linked to $s$ and $o^{\prime}$ to $o$ . In practice, this means RE can return the correct solution $o$ if any relation instance of the right type was extracted from $x$ , regardless of whether it has a wrong subject or object.

DrQA: Chen et al. (2017) introduce DrQA, a popular system for open-domain question answering. DrQA predicts answers to natural language questions using a two step pipeline. First, a TF/IDF information retrieval step is used to find relevant articles from a large store of documents (e.g. Wikipedia). On the retrieved top $k$ articles, a neural reading comprehension model then extracts answers. To avoid giving the language models a competitive advantage, we constrain the predictions of DrQA to single-token answers.

4 Metrics

We consider rank-based metrics and compute results per relation along with mean values across all relations. To account for multiple valid objects for a subject-relation pair (i.e., for N-M relations), we follow Bordes et al. (2013) and remove from the candidates when ranking at test time all other valid objects in the training data other than the one we test. We use the mean precision at k (P@k). For a given fact, this value is 1 if the object is ranked among the top k results, and 0 otherwise.

5 Considerations

There are several important design decisions we made when creating the LAMA probe. Below we give more detailed justifications for these decisions.

For each relation we manually define a template that queries for the object slot in that relation. One can expect that the choice of templates has an impact on the results, and this is indeed the case: for some relations we find both worse and better ways to query for the same information (with respect to a given model) by using an alternate template. We argue that this means we are measuring a lower bound for what language models know. We make this argument by analogy with traditional knowledge bases: they only have a single way of querying knowledge for a specific relation, namely by using the relation id of that relation, and this way is used to measure their accuracy. For example, if the relation ID is works-For and the user asks for is-working-for, the accuracy of the KG would be 0.

We only consider single token objects as our prediction targets. The reason we include this limitation is that multi-token decoding adds a number of additional tuneable parameters (beam size, candidate scoring weights, length normalization, n-gram repetition penalties, etc.) that obscure the knowledge we are trying to measure. Moreover, well-calibrated multi-token generation is still an active research area, particularly for bidirectional models (see e.g. Welleck et al. (2019)).

We choose to only query object slots in triples, as opposed to subject or relation slots. By including reverse relations (e.g. contains and contained-by) we can also query subject slots. We do not query relation slots for two reasons. First, surface form realisations of relations will span several tokens, and as we discussed above, this poses a technical challenge that is not in the scope of this work. Second, even if we could easily predict multi-token phrases, relations can generally be expressed with many different wordings, making it unclear what the gold standard pattern for a relation should be, and how to measure accuracy in this context.

The models that we considered are trained with different vocabularies. For instance, ELMo uses a list of $\sim$ 800K tokens while BERT considers only $\sim$ 30K tokens. The size of the vocabulary can influence the performance of a model for the LAMA probe. Specifically, the larger the vocabulary the harder it would be to rank the gold token at the top. For this reason we considered a common vocabulary of $\sim$ 21K case-sensitive tokens that are obtained from the intersection of the vocabularies for all considered models. To allow a fair comparison, we let every model rank only tokens in this joint vocabulary.

Results

We summarize the main results in Table 2, which shows the mean precision at one (P@1) for the different models across the set of corpora considered. In the remainder of this section, we discuss the results for each corpus in detail.

We query the LMs using a standard cloze template for each relation. The base and large versions of BERT both outperform all other models by a substantial margin. Furthermore, they obtain a $2.2$ and $2.9$ respective average accuracy improvement over the oracle-based RE baseline. This is particularly surprising given that with the gold-aligned Google-RE source we know for certain that the oracle RE baseline has seen at least one sentence expressing each test fact. Moreover, the RE baseline was given substantial help through an entity linking oracle.

It is worth pointing out that while BERT-large does better, this does not mean it does so for the right reasons. Although the aligned Google-RE sentences are likely in its training set (as they are part of Wikipedia and BERT has been trained on Wikipedia), it might not “understand” them to produce these results. Instead, it could have learned associations of objects with subjects from co-occurrence patterns.

The knowledge source derived from Google-RE contains relatively few facts and only three relations. Hence, we perform experiments on the larger set of facts and relations in T-REx. We find that results are generally consistent with Google-RE. Again, the performance of BERT in retrieving factual knowledge are close to the performance obtained by automatically building a knowledge base with an off-the-shelf relation extraction system and oracle-based entity linking. Broken down by relation type, the performance of BERT is very high for 1-to-1 relations (e.g., capital of) and low for N-to-M relations.

Note that a downstream model could learn to make use of knowledge in the output representations of a language model even if the correct answer is not ranked first but high enough (i.e. a hint about the correct answer can be extracted from the output representation). Figure 2 shows the mean P@k curves for the considered models. For BERT, the correct object is ranked among the top ten in around 60% of the cases and among the top 100 in 80% of the cases.

To further investigate why BERT achieves such strong results, we compute the Pearson correlation coefficient between the P@1 and a set of metrics that we report in Figure 8. We notice, for instance, that the number of times an object is mentioned in the training data positively correlates with performance while the same is not true for the subject of a relation. Furthermore, the log probability of a prediction is strongly positively correlated with P@1. Thus, when BERT has a high confidence in its prediction, it is often correct. Performance is also positively correlated with the cosine similarity between subject and object vectors, and slightly with the number of tokens in the subject.

Table 3 shows randomly picked examples for the generation of BERT-large for cloze template queries. We find that BERT-large generally predicts objects of the correct type, even when the predicted object itself is not correct.

To understand how the performance of a pretrained language model varies with different ways of querying for a particular fact, we analyze a maximum of 100 random facts per relation for which we randomly select 10 aligned sentences in Wikipedia from T-REx.We exclude all facts with less than 10 alignments. In each of the sentences, we mask the object of the fact, and ask the model to predict it. For several of our language models this also tests their ability to memorize and recall sentences from the training data since as the models have been trained on Wikipedia (see Table 1).

Figure 4 shows the average distribution of the rank for ten queries per fact. The two BERT models and ELMo 5.5B exhibit the lowest variability while ranking the correct object close to the top on average. Surprisingly, the performance of ELMo original is not far from BERT, even though this model did not see Wikipedia during training. Fairseq-fconv and Transformer-XL experience a higher variability in their predictions. Note that BERT and ELMo 5.5B have been trained on a larger portion of Wikipedia than fairseq-fconv and Transformer-XL and may have seen more sentences containing the test queries during training.

The results on the ConceptNet corpus are in line with those reported for retrieving factual knowledge in Google-RE and T-REx. The BERT-large model consistently achieves the best performance, and it is able to retrieve commonsense knowledge at a similar level to factual knowledge. The lower half of Table 3 shows generations by BERT-large for randomly sampled examples. Some of the concepts generated by the language models are surprisingly reasonable in addition to being syntactically correct.

Next we evaluate our system on open-domain cloze-style question answering and compare against the supervised DrQA model. Table 2 shows a performance gap between BERT-large and the DrQA open-domain QA system on our cloze SQuAD task. Again, note that the pretrained language model is completely unsupervised, it is not fine-tuned, and it has no access to a dedicated information retrieval system. Moreover, when comparing DrQA and BERT-large in terms of P@10, we find that gap is remarkably small (57.1 for BERT-large and 63.5 for DrQA).

Discussion and Conclusion

We presented a systematic analysis of the factual and commonsense knowledge in publicly available pretrained language models as is and found that BERT-large is able to recall such knowledge better than its competitors and at a level remarkably competitive with non-neural and supervised alternatives. Note that we did not compare the ability of the corresponding architectures and objectives to capture knowledge in a given body of text but rather focused on the knowledge present in the weights of existing pretrained models that are being used as starting points for many researchers’ work. Understanding which aspects of data our commonly-used models and learning algorithms are capturing is a crucial field of research and this paper complements the many studies focused on the learned linguistic properties of the data.

We found that it is non-trivial to extract a knowledge base from text that performs on par to directly using pretrained BERT-large. This is despite providing our relation extraction baseline with only data that is likely expressing target facts, thus reducing potential for false negatives, as well as using a generous entity-linking oracle. We suspected BERT might have an advantage due to the larger amount of data it has processed, so we added Wikitext-103 as additional data to the relation extraction system and observed no significant change in performance. This suggests that while relation extraction performance might be difficult to improve with more data, language models trained on ever growing corpora might become a viable alternative to traditional knowledge bases extracted from text in the future.

In addition to testing future pretrained language models using the LAMA probe, we are interested in quantifying the variance of recalling factual knowledge with respect to varying natural language templates. Moreover, assessing multi-token answers remains an open challenge for our evaluation setup.

Acknowledgments

We would like to thank the reviewers for their thoughtful comments and efforts towards improving our manuscript. In addition, we would like to acknowledge three frameworks that were used in our experiments: AllenNLPhttps://github.com/allenai/allennlp, Fairseqhttps://github.com/pytorch/fairseq and the Hugging Face PyTorch-Transformershttps://github.com/huggingface/pytorch-transformers library.