Commonsense Knowledge Mining from Pretrained Models

Joshua Feldman, Joe Davison, Alexander M. Rush

Introduction

Commonsense knowledge consists of facts about the world which are assumed to be widely known. For this reason, commonsense knowledge is rarely stated explicitly in natural language, making it challenging to infer this information without an enormous amount of data Gordon and Van Durme (2013). Some have even argued that machine learning models cannot learn common sense implicitly Davis and Marcus (2015).

One method for mollifying this issue is directly augmenting models with commonsense knowledge bases Young et al. (2018), which typically contain high-quality information but with low coverage. These knowledge bases are represented as a graph, with nodes consisting of conceptual entities (i.e. dog, running away, excited, etc.) and the pre-defined edges representing the nature of the relations between concepts (IsA, UsedFor, CapableOf, etc.). Commonsense knowledge base completion (CKBC) is a machine learning task motivated by the need to improve the coverage of these resources. In this formulation of the problem, one is supplied with a list of candidate entity-relation-entity triples, and the task is to distinguish which of the triples express valid commonsense knowledge and which are fictitious (Li et al., 2016).

Several approaches have been proposed for training models for commonsense knowledge base completion Li et al. (2016); Jastrzębski et al. (2018). Each of these approaches uses some sort of supervised training on a particular knowledge base, evaluating the model’s performance on a held-out test set from the same database. These works use relations from ConceptNet, a crowd-sourced database of structured commonsense knowledge, to train and validate their models Liu and Singh (2004). However, it has been shown that these methods generalize poorly to novel data Li et al. (2016); Jastrzębski et al. (2018). Jastrzębski et al. (2018) demonstrated that much of the data in the ConceptNet test set were simply rephrased relations from the training set, and that this train-test set leakage led to artificially inflated test performance metrics. This problem of train-test leakage is typical in knowledge base completion tasks (Toutanova et al., 2015; Dettmers et al., 2018).

Instead of training a predictive model on any specific database, we attempt to utilize the world knowledge of large language models to identify commonsense facts directly. By constructing a candidate piece of knowledge as a sentence, we can use a language model to approximate the likelihood of this text as a proxy for its truthfulness. In particular, we use a masked language model to estimate point-wise mutual information between entities in a possible relation, an approach that differs significantly from fine-tuning approaches used for other language modeling tasks. Since the weights of the model are fixed, our approach is not biased by the coverage of any one dataset. As we might expect, our method underperforms when compared to previous benchmarks on the ConceptNet common sense triples dataset Li et al. (2016), but demonstrates a superior ability to generalize when mining novel commonsense knowledge from Wikipedia.

Previous work by Schwartz et al. (2017) and Trinh and Le (2018) demonstrates a similar approach to using language models for tasks requiring commonsense, such as the Story Cloze Task and the Winograd Schema Challenge, respectively Mostafazadeh et al. (2016); Levesque et al. (2012). Bosselut et al. (2019) and Trinh and Le (2019) use unidirectional language models for CKBC, but their approach requires a supervised training step. Our approach differs in that we intentionally avoid training on any particular database, relying instead on the language model’s general world knowledge. Additionally, we use a bidirectional masked model which provides a more flexible framework for likelihood estimation and allows us to estimate point-wise mutual information. Although it is beyond the scope of this paper, it would be interesting to adapt the methods presented here for the related task of generating new commonsense knowledge Saito et al. (2018).

Method

We assume that heads and tails are arbitrary-length sequences of words in a vocabulary V\mathcal{V} so that h={h1,h2,,hn}\mathbf{h}=\{h_{1},h_{2},\dots,h_{n}\} and t={t1,t2,,tm}\mathbf{t}=\{t_{1},t_{2},\dots,t_{m}\}. We further assume that we have a known set of possible relations R\mathcal{R} so that rRr\in\mathcal{R}.

The goal is to determine a function ff that maps relational triples to validity scores. We propose decomposing f(x)=σ(τ(x))f(x)=\sigma(\tau(x)) into two sub-components: a sentence generation function τ\tau which maps a triple to a single sentence, and a scoring model σ\sigma which then determines a validity score yy.

Our approach relies on two types of pretrained language models. Standard unidirectional models are typically represented as autoregressive probabilities:

Masked bidirectional models such as BERT, proposed by Devlin et al. (2018), instead model in both directions, training word representations conditioned both on future and past words. The masking allows any number of words in the sequence to be hidden. This setup provides an intuitive framework to evaluate the probability of any word in a sequence conditioned on the rest of the sequence,

where wV{κ}w^{\prime}\in\mathcal{V}\cup\{\kappa\} and κ\kappa is a special token indicating a masked word.

We first consider methods for turning a triple such as (ferret, AtLocation, pet store) into a sentence such as “the ferret is in the pet store”. Our approach is to generate a set of candidate sentences via hand-crafted templates and select the best proposal according to a language model.

For each relation rRr\in\mathcal{R}, we hand-craft a set of sentence templates. For example, one template in our experiments for the relation AtLocation is, “you are likely to find HEAD in TAIL”. For the above example, this would yield the sentence, “You are likely to find ferret in pet store”.

Because these sentences are not always grammatically correct, such as in the above example, we apply a simple set of transformations. These consist of inserting articles before nouns, converting verbs into gerunds, and pluralizing nouns which follow numbers. See the supplementary materials for details and Table 1 for an example. We then enumerate a set of alternative sentences S={S1,,Sj}\mathcal{S}=\{S_{1},\dots,S_{j}\} resulting from each template and from all combinations of transformations. This yields a set of candidate sentences for each data point. We then select the candidate sentence with the highest log-likelihood according to a pre-trained unidirectional language model PcohP_{\text{coh}}.

We refer to this method of generating a sentence from a triple as Coherency Ranking. Coherency Ranking operates under the assumption that natural, grammatical sentences will have a higher likelihood than ungrammatical or unnatural sentences. See an example subset of sentence candidates and their corresponding scores in Table 1. From a qualitative evaluation of the selected sentences, we find that this approach produces sentences of significantly higher quality than those generated by deterministic rules alone. We also perform an ablation study in our experiments demonstrating the effect of each component on CKBC performance.

2 Scoring Generated Triples

Assuming we have generated a proper sentence from a relational triple, we now need a way to score its validity with a pretrained model that considers the relationship between the relation entities. We therefore propose using the estimated point-wise mutual information (PMI) of the head h\mathbf{h} and tail t\mathbf{t} of a triple conditioned on the relation rr, defined as,

We can estimate these scores by using a masked bidirectional language model, PcmpP_{\text{cmp}}. In the case where the tail is a single word, the model allows us to evaluate the conditional likelihood of a single triple component p(th,r)p(\mathbf{t}|\mathbf{h},r) by computing Pcmp(wi=t w1:i1,wi+1:m)P_{\text{cmp}}(w_{i}=\mathbf{t}\ |w_{1:i-1},w_{i+1:m}) for the tail word.

In practice, the tail might be realized as a jj-word phrase. To handle this complexity, we use a greedy approximation of its probability. We first mask all of the tail words and compute the probability of each. We then find the word with highest probability pkp_{k}, substitute it back in, and repeat jj times. Finally, we calculate the total conditional likelihood of the tail by the product of these terms, p(th,r)=k=1jpkp(\mathbf{t}|\mathbf{h},r)=\prod_{k=1}^{j}p_{k}.

The marginal p(tr)p(\mathbf{t}|r) is computed similarly, but in this case we mask the head throughout. For example, to compute the marginal tail probability for the sentence, “You are likely to find a ferret in the pet store” we mask both the head and the tail and then sequentially unmask the tail words only: “You are likely to find a κh1\kappa_{h1} in the κt1κt2\kappa_{t1}\hskip 3.0pt\kappa_{t2}”. If κt2=“store”\kappa_{t2}=\text{``store''} has a higher probability than κt1=“pet”\kappa_{t1}=\text{``pet''}, we unmask “store” and compute “You are likely to find a κh1\kappa_{h1} in the κt1\kappa_{t1} store”. The marginal likelihood p(tr)p(\mathbf{t}|r) is then the product of the two probabilities.

The final score combines the marginal and conditional likelihoods by employing a weighted form of the point-wise mutual information,

where λ\lambda is treated as a hyperparameter. Although exact PMI is symmetrical, the approximate model itself is not. We therefore average PMIλ(t,hr)\text{PMI}_{\lambda}(\mathbf{t},\mathbf{h}|r) and PMIλ(h,tr)\text{PMI}_{\lambda}(\mathbf{h},\mathbf{t}|r) to reduce the variance of our estimates, computing the masked head values rather than the tail values in the latter.

Experiments

To evaluate the Coherency Ranking approach we measure whether it can distinguish between valid and invalid triples. For our masked model, we use BERT-large Devlin et al. (2018). For sentence ranking, we use the GPT-2 117M LM Radford et al. (2019). The relation templates and grammar transformation rules which we use can be found in the supplementary materials.

We compare the proposed method to several baselines. Following Trinh and Le (2018), we evaluate a simple Concatenation method for generating sentences, splitting the relation rr into separate words and concatenating it with the head and tail. For the triple (ferret, AtLocation, pet store), the Concatenation approach would yield, “ferret at location pet store”.

We also evaluate CKBC performance when we construct sentences by applying a single hand-crafted template. Since each triple is mapped to a sentence with a single template without any grammatical transformations, we refer to this as the Template method. Using the Template approach, (ferret, AtLocation, pet store) would become “You are likely to find ferret in pet store” using the template “you are likely to find HEAD in TAIL”.

Next, we extend the Template method by applying deterministic grammatical transformations, which we refer to as the Template + Grammar approach. Like the full approach, these transformations involve adding articles before nouns, converting verbs into gerunds, and pluralizing nouns following numbers. The Template + Grammar approach differs from Coherency Ranking in that all transformations are applied to every sentence instead of applying combinations of transformations and templates, which are then ranked by a language model. Returning to our example, the Template + Grammar method produces “You are likely to find a ferret in a pet store”. While this sentence is grammatical, applying this method to (star, AtLocation, outer space) yields “You are likely to find a star in an outer space”, which is incorrect.

We compare our results to the supervised models from the work of Jastrzębski et al. (2018) and the best performing model from Li et al. (2016). Jastrzębski et al. (2018) introduce Factorized and Prototypical models. The Factorized model embeds the head, relation, and tail in a vector space and then produces a score by taking a linear combination of the inner products between each pair of embeddings. The Prototypical model is similar, but does not include the inner product between head and tail. Li et al. (2016) evaluate a deep neural network (DNN) for CKBC. They concatenate embeddings for the head, relation, and tail, which they then feed through a multilayer perceptron with one hidden layer. All three models are trained on 100,000 ConceptNet triples.

Our experimental setup follows Li et al. (2016), evaluating our model with their test set (n = 2400) containing an equal number of valid and invalid triples. The valid triples are from the crowd-sourced Open Mind Common Sense (OMCS) entries in the ConceptNet 5 dataset Speer and Havasi (2012). Invalid triples are generated by replacing an element of a valid tuple with another randomly selected element.

We use our scoring method to classify each tuple as valid or invalid. To this end, we use our method to assign a score to each tuple and then group the resulting scores into two clusters. Instances in the cluster with the higher mean PMI are labeled as valid, and the remainder are labeled as invalid. We use expectation-maximization with a mixture of Gaussians to cluster. We also tune the PMI weight via grid search over 9090 points from λ[0.5,5.]\lambda\in[0.5,5.], using the Akaike information criterion of the Gaussian mixture model for evaluation Akaike (1974).

Table 2 shows the full results. Our unsupervised approach achieves a test set F1 score of 78.878.8, comparable to the 79.479.4 F1 score found by the supervised prototypical approach. The Factorized and DNN models significantly outperformed our approach with F1 scores of 89.2 and 89.0, respectively. Our grid search found an optimal λ\lambda value of 1.651.65 for the Concatenation sentence generation model and 1.551.55 for the Coherency Ranking model. The Template and Template + Grammar methods found lambda values of 1.201.20 and 0.950.95, respectively.

Task 2: Mining Wikipedia

To assess the model’s ability to generalize to unseen data, we evaluate our unsupervised model in comparison to previous supervised methods on the task of mining commonsense knowledge from Wikipedia. In their evaluations, Li et al. (2016) curate a set of 1.7M triples across 10 relations by applying part-of-speech patterns to Wikipedia articles. We sample 300 triples from each relation. We apply our method to evaluate these 3000 triples. Using the approach described by Speer and Havasi (2012), and followed by Li et al. (2016) and Jastrzębski et al. (2018), two human annotators manually rate the 100 triples with the highest predicted score on a 0 to 4 scale: 0 (Doesn’t make sense), 1 (Not true), 2 (Opinion/Don’t know), 3 (Sometimes true), and 4 (Generally true). We tuned λ\lambda by measuring the quality of the 100 triples with the highest predicted score across λ{1,2,,9,10}\lambda\in\{1,2,\dots,9,10\}.

The top 100 triples selected by our model were assigned a mean rating of 3.00 (λ=4\lambda=4) with a standard error of 0.11 under the Coherency Ranking approach, well exceeding the performance of current supervised methods (Table 2). Standard errors were calculated using 1000 bootstrap samples of the top 100 triples. The ratings assigned by the two human annotators had a 0.500.50 Pearson correlation and 0.230.23 kappa inter-annotator agreement. Rater disagreements occur most frequently when triples are ambiguous or difficult to interpret. Notably, if we bucket the five scores into just two categories of true and false, this disagreement rate drops by 50%. To give a sense of the types of commonsense knowledge our models struggle to capture, we report the top 100 most confident predictions that receive an average score below 3 in the supplementary material. Notably, some of the top 100 triples our model identified were indeed true, but would not be reasonably considered common sense (e.g. (vector bundle, HasProperty, manifold)). This suggests that our approach may be applicable to mining knowledge beyond common sense.

Analysis: Sentence Generation

In order to measure the impact of sentence generation on our model, we select a sample of 100100 sentences and group the results by a) whether the sentence contained a grammatical error, and b) whether the sentence misrepresented the meaning of the triple. For example, the triple (golf, HasProperty, good) yields the sentence “golf is a good”, which is grammatically correct but conveys the wrong meaning. On both Wikipedia mining and CKBC, we find that misrepresenting meaning has an adverse impact on model performance. In CKBC, we also find that grammar has a high impact on the resulting F1 scores (Table 3). Future work could therefore focus on designing templates that more reliably encode a relation’s true meaning.

Conclusion

We introduce a robust unsupervised method for commonsense knowledge base completion using the world knowledge of pre-trained language models. We develop a method for expressing knowledge triples as sentences. Using a bidirectional masked language model on these sentences, we can then estimate the weighted point-wise mutual information of a triple as a proxy for its validity. Though our approach performs worse on a held-out test set developed by Li et al. (2016), it does so without any previous exposure to the ConceptNet database, ensuring that this performance is not biased. In the future, we hope to explore whether this approach can be extended to mining facts that are not commonsense and to generating new commonsense knowledge outside of any given database of candidate triples. We also see potential benefit in the development of a more expansive set of evaluation methods for commonsense knowledge mining, which would strengthen the validity of our conclusions.

Acknowledgments

This work was supported by NSF research award 1845664.

References

Appendix A Grammatical Transformations

In our experiments, we apply the following transformations to the head h\mathbf{h} and tail t\mathbf{t} of each relational triple before injecting them into the template.

If the first word is a noun or adjective, or if the first word is a verb and the second word is a noun or adjective, prepend an indefinite or definite article

If the first word is an infinitive verb, convert it to a gerund (i.e. “jump” \to “jumping”)

If the first word is a number, pluralize the following word (i.e. “two leg” \to “two legs”)

We use the default settings in the spaCy Python library (https://spacy.io/) for identifying the part of speech. We also use pattern (https://www.clips.uantwerpen.be/pages/pattern) for conjugation and pluralization.

Appendix B Hand Crafted Templates

We use the following hand-crafted templates for relations in the ConcpetNet database. Each relation is mapped to a list of several templates. Here, {0} refers to the head entity and {1} refers to the tail.

"RelatedTo": [ "{0} is like {1}", "{1} is related to {0}", "{0} is related to {1}" ], "ExternalURL": [ "{0} is described at the following URL {1}" ], "FormOf": [ "{0} is a form of the word {1}" ], "IsA": [ "{0} is {1}", "{0} is a type of {1}", "{0} are {1}", "{0} is a kind of {1}", "{0} is a {1}" ], "NotIsA": [ "{0} is not {1}", "{0} is not a type of {1}", "{0} are not {1}", "{0} is not a kind of {1}", "{0} is not a {1}" ], "PartOf": [ "{1} has {0}", "{0} is part of {1}", "{0} is a part of {1}" ], "HasA": [ "{0} has {1}", "{0} contains {1}", "{0} have {1}" ], "UsedFor": [ "{0} is used for {1}", "{0} is for {1}", "You can use {0} to {1}", "You can use {0} for {1}", "{0} are used to {1}", "{0} is used to {1}", "{0} can be used to {1}", "{0} can be used for {1}" ], "CapableOf": [ "{0} can {1}", "An activity {0} can do is {1}", "{0} sometimes {1}", "{0} often {1}" ], "AtLocation": [ "You are likely to find {0} in {1}", "You are likely to find {0} at {1}", "Something you find on {1} is {0}", "Something you find in {1} is {0}", "Something you find at {1} is {0}", "Somewhere {0} can be is {1}", "Something you find under {1} is {0}" ], "Causes": [ "Sometimes {0} causes {1}", "Something that might happen as a consequence of {0} is {1}", "Sometimes {0} causes you to {1}", "The effect of {0} is {1}" ], "HasSubevent": [ "Something you might do while {0} is {1}", "One of the things you do when you {0} is {1}", "Something that might happen while {0} is {1}", "Something that might happen when you {0} is {1}", "One of the things you do when you {1} is {0}", "Something that might happen when you {1} is {0}" ], "HasFirstSubevent": [ "the first thing you do when you {0} is {1}" ], "HasLastSubevent": [ "the last thing you do when you {0} is {1}" ], "HasPrerequisite": [ "something you need to do before you {0} is {1}", "If you want to {0} then you should {1}", "{0} requires {1}" ], "HasProperty": [ "{0} is {1}", "{0} are {1}", "{0} can be {1}" ], "MotivatedByGoal": [ "You would {0} because you want to {1}", "You would {0} because you want {1}", "You would {0} because {1}" ], "ObstructedBy": [ "{0} can be prevented by {1}" ], "Desires": [ "{0} wants {1}", "{0} wants to {1}", "{0} like to {1}" ], "CreatedBy": [ "{0} is created by {1}" ], "Synonyms": [ "{0} and {1} are have similar meanings", "{0} and {1} are similar" ], "Antonym": [ "{0} is the opposite of {1}" ], "DistinctFrom": [ "it cannot be both {0} and {1}" ], "DerivedFrom": [ "the word {0} is derived from the word {1}" ], "SymbolOf": [ "{0} is a symbol of {1}" ], "DefinedAs": [ "{0} is defined as {1}", "{0} is the {1}" ], "Entails": [ "if {0} is happening, {1} is also happening" ], "MannerOf": [ "{0} is a specific way of doing {1}" ], "LocatedNear": [ "{0} is located near {1}" ], "dbpedia": [ "{0} is conceptually related to {1}" ], "SimlarTo": [ "{0} is similar to {1}" ], "EtymologicallyRelatedTo": [ "the word {0} and the word {1} have the same origin" ], "EtymologicallyDerivedFrom": [ "the word {0} comes from the word {1}" ], "CausesDesire": [ "{0} makes people want {1}", "{0} would make you want to {1}" ], "MadeOf": [ "{0} is made of {1}", "{0} can be made of {1}", "{0} are made of {1}" ], "ReceivesAction": [ "{0} can be {1} ", "{0} is something that you can {1}", "{0} can receive {1}" ], "InstanceOf": [ "{0} is an example of {1}" ], "NotDesires": [ "{0} does not want {1}", "{0} doesn’t want to {1}", "{0} doesn’t want {1}" ], "NotUsedFor": [ "{0} is not used for {1}" ], "NotCapableOf": [ "{0} is not capable of {1}", "{0} do not {1}" ], "NotHasProperty": [ "{0} does not have the property of {1}" ], "NotMadeOf": [ "{0} is not made of {1}" ]

Appendix C Most Confident Mistakes