COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs

Jena D. Hwang, Chandra Bhagavatula, Ronan Le Bras, Jeff Da, Keisuke Sakaguchi, Antoine Bosselut, Yejin Choi

Introduction

Commonsense understanding and reasoning remain long-standing challenges in general artificial intelligence. However, large-scale language models have brought tremendous progress in the sub-field of natural language processing. Such large-scale language models (Radford et al. 2018; Devlin et al. 2019; Brown et al. 2020) trained on extreme-scale data have been shown to effectively adapt to diverse downstream tasks, achieving significant performance gains across natural language benchmarks (Wang et al. 2019). Interestingly, as these models have grown larger (and trained on larger amounts of data), their benchmark performance has continued to improve (Raffel et al. 2019) despite limited conceptual improvements, leaving open questions regarding the source of these remarkable generalization properties.

Recent work has hypothesized that many of these performance gains could be a result of language models being able to memorize facts in their parameters during training (Roberts, Raffel, and Shazeer 2020) that can be leveraged at evaluation time. As a result, a new paradigm of language models as knowledge bases has emerged (Petroni et al. 2019). In this setting, language models are prompted with natural language prefixes or questions, and they express knowledge through language generation. The initial success of this paradigm for representing commonsense knowledge (Davison, Feldman, and Rush 2019; Tamborrino et al. 2020) has led to the optimistic claim that language models comprehensively encode commonsense knowledge, and remove the need for structured knowledge resources.

We take a more skeptical view of this capacity of language models – Does scaling up language models actually endow them with commonsense knowledge? While language models can successfully express certain types of knowledge, their best results are observed in narrowly specific conditions – we show (cf. §5) that they perform better when evaluated on knowledge bases that prioritize ontological relations and whose examples resemble language-like assertions (e.g., mango IsA fruit).An observation supported by Brown et al. (2020)’s GPT-3 model, whose best few-shot performance on commonsense knowledge benchmarks comes on the PhysicalIQA (Bisk et al. 2020) and HellaSwag (Zellers et al. 2019) datasets. Consequently, the types of knowledge that can be directly accessed through the language model’s interface remains limited.

However, prior work has also shown that training language models on knowledge graph tuples leads them to learn to express their implicit knowledge directly (Bosselut et al. 2019), allowing them to provide commonsense knowledge on-demand. These adapted knowledge models have exhibited promising results on commonsense benchmarks compared with methods that require linking entities to knowledge graphs (Shwartz et al. 2020; Liu et al. 2020). Inspired by these successes, we propose a dual use for commonsense knowledge bases going forward: as static graphs that can be linked to for discrete knowledge access, and as resources for adapting language models to hypothesize commonsense knowledge about un-annotated entities and events.

With this second purpose in mind, we propose evaluating commonsense knowledge resources based on the complementary information they can bring to pretrained language models. We construct Atomic ${}^{20}_{20}$ , a new, high-quality knowledge graph with $1.33$ M commonsense knowledge tuples across $23$ commonsense relations. We compare Atomic ${}^{20}_{20}$ with respect to its coverage and accuracy in competition with other highly used CSKGs, such as ConceptNet (Speer, Chin, and Havasi 2017). Our results show that Atomic ${}^{20}_{20}$ is able to cover more correct facts about more diverse types of commonsense knowledge than any existing, publicly-available commonsense knowledge resource. However, our results also indicate that there remains a large amount of exclusivity between these KGs, highlighting the challenge of creating resources that cover the scale and diversity of general commonsense knowledge.

Furthermore, we formalize the COMET framework of Bosselut et al. (2019) across different seed language models and training knowledge graphs, and evaluate the commonsense knowledge hypothesized by these adapted knowledge models. Our empirical study yields two promising conclusions. First, it confirms that KG-adapted language models learn to express knowledge more precisely than naive language models trained only on language. And second, we show that Atomic ${}^{20}_{20}$ as a transfer resource leads to COMET models that achieve the largest increase over their seed language model (across all seed LMs) for the commonsense knowledge types it covers, validating the importance of constructing knowledge resources with examples of knowledge not readily found in language models.

Key Contributions: In summary, we make three key contributions in this paper. We present Atomic ${}^{20}_{20}$ —a new commonsense knowledge graph covering social, physical, and eventive aspects of everyday inferential knowledge (cf. §3). Next, we compare Atomic ${}^{20}_{20}$ with other prominent CSKBs head-to-head and show that our new symbolic knowledge graph is more accurate than any current CSKB (see Table 2) (cf. §4). Finally, we show that our new neural knowledge model COMET-Atomic ${}^{20}_{20}$ successfully transfers Atomic ${}^{20}_{20}$ ’s declarative knowledge to beat GPT-3, the largest pre-trained language model, in spite of using 400x fewer parameters (see Table 6) (cf. §5). This demonstrates the utility and importance of high-quality symbolic knowledge provided by Atomic ${}^{20}_{20}$ to generalize on commonsense information that LMs cannot expressively capture on their own (cf. §6).

Background

Large scale commonsense knowledge graphs are ubiquitous tools in natural language processing tasks as access to their facts allows models to learn to reason over commonsense knowledge to make predictions (Lin et al. 2019; Feng et al. 2020). In this work, we evaluate three existing knowledge graphs, ConceptNet, Atomic, and TransOMCS on their coverage and precision relative to our new resource Atomic ${}^{20}_{20}$ .We were unable to include Cyc (Lenat 1995) in our study due to the discontinuation of its research license and the cost of the commercial license (over $\$ 1$M). ConceptNet includes a subset of Cyc – OpenCyc.

The ConceptNet (v5.7) knowledge graph (Speer, Chin, and Havasi 2017) consists of 36 relations focusing mostly on taxonomic and lexical knowledge (e.g., RelatedTo, Synonym, IsA) and physical commonsense knowledge (e.g., MadeOf, PartOf). ConceptNet (v5.7) contains 3.4M entity-relation tuples (in English) collected by crowdsourcing and merged with existing knowledge databases from DBPedia, WordNet, Wiktionary, and OpenCyc. Since the knowledge are derived from human efforts, the accuracy of ConceptNet (v5.7) knowledge is fairly high, though the quality does vary depending on the sources of knowledge and relation types. However, as highlighted in (Davis and Marcus 2015; Sap et al. 2019), and shown in Figure 2, the coverage of ConceptNet (v5.7) is limited to mostly taxonomic, lexical, and object-centric physical commonsense knowledge. In fact, out of 3.4M tuples, 90% of them correspond to taxonomic (e.g., IsA) or lexical (e.g., Synonym, RelatedTo) knowledge, making the commonsense portion of ConceptNet (v5.7) relatively small.

The Atomic (Sap et al. 2019) knowledge graph consists of 880K of tuples across 9 relations that cover social commonsense knowledge (e.g, X gets X’s car repaired xIntent to maintain the car), including dynamic aspects of events such as causes and effects, if-then conditional statements, and mental states. The Atomic dataset is collected and validated completely through crowdsourcing.

The TransOMCS (Zhang et al. 2020a) knowledge graph consists of 18.48M tuples that were automatically converted from syntactic parses of sentences from various web sources including Wikipedia, Yelp, and Reddit. The set of relations used for the mapping is copied from ConceptNet. Although TransOMCS is much larger than other commonsense knowledge graphs, the precision of the extracted knowledge is significantly lower compared to other resources (cf. §4), and performs poorly as an adaptation resource relative to other KGs (cf. §5).

For this work we have selected three large scale CSKGs that retain a closed class of relational types that are comparable to one another. Other commonsense KBs in existence such as Quasimodo (Romero et al. 2019) provide a wider variety of fine-grained relations.

Language Models as Knowledge Bases Recent work hypothesizes that pretrained language models represent commonsense knowledge implicitly (Petroni et al. 2019; Roberts, Raffel, and Shazeer 2020). However, the results motivating these observations are often limited to narrowly scoped subsets of commonsense knowledge that primarily include taxonomic knowledge (e.g., mango IsA fruit) and that are often found explicitly stated in text. However, commonsense facts are often implied (Gordon and Van Durme 2013), and as will be seen in our studies (cf. §4), state of the art neural models struggle to express implicit commonsense knowledge that involves complex relationships.

To overcome this limitation, Bosselut et al. (2019) take the best of both worlds between commonsense knowledge graphs and pretrained language models. The commonsense transformer, or COMET, adapts pretrained neural language models by training on example tuples from commonsense knowledge graphs. It takes a head/source phrase and a relation (e.g., take a nap Causes) and generates the tail/target phrase (e.g., have energy). Bosselut et al. (2019) show that COMET trained on the ConceptNet and Atomic knowledge graphs is able to adapt to generate novel (and valid) commonsense knowledge tuples.

Importantly, these neural knowledge models can produce commonsense knowledge on-demand for any head entity that can be expressed through language. This flexibility allows them to be used out-of-the-box, and they have been applied to new, previously unexplored tasks, such as sarcastic comment generation (Chakrabarty et al. 2020), therapy chatbots (Kearns et al. 2020), and automated story plot generation (Ammanabrolu et al. 2020). These contributions show that progress on knowledge models opens up new downstream applications that were challenging to model before.

We present Atomic ${}^{20}_{20}$ , a commonsense knowledge graph with $1.33$ M everyday inferential knowledge tuples about entities and events. Atomic ${}^{20}_{20}$ represents a large-scale commonsense repository of textual descriptions that encode both the social and the physical aspects of common human everyday experiences, collected with the aim of being complementary to commonsense knowledge encoded in current language models. Atomic ${}^{20}_{20}$ introduces $23$ commonsense relations types. They can be broadly classified into three categorical types: $9$ commonsense relations of social-interaction, $7$ physical-entity commonsense relations, and $7$ event-centered commonsense relations concerning situations surrounding a given event of interest. The full inventory of Atomic ${}^{20}_{20}$ relations is listed in Table 1.

In terms of physical and event-centered commonsense, by far, the two largest new relations in Atomic ${}^{20}_{20}$ are ObjectUse and HinderedBy. For ObjectUse, we focused on affordances of everyday objects such as “popcorn bucket” that may be used for “holding popocorn” or “storing things”. For HinderedBy, we explore the notion that many events in real world can be defeasible (Lascarides and Asher 1991) by collecting hindrances to goals that may be useful for tasks such as counterfactual reasoning. For example X’s desires to adopt a cat may be hindered by finding out that X is allergic to cats, which would necessitate X to adjust future actions accordingly (say, opt for hypoallergenic options like tortoises).

In the case of ObjectUse, we collected over 130K everyday object-use pairs by asking crowdworkers for necessary objects and their uses for each event in Atomic ${}^{20}_{20}$ . For example, given “X eats popcorn” we elicited items such as “popcorn bucket” with their various expected uses. The number also reflects atypical usages gathered in a separate pass where workers were asked to provide creative or resourceful but feasible uses of the objects. Given “popcorn bucket”, for instance, one might “wear it as a hat” for, say, a costume party. For HinderedBy, we crowdsourced over 100K tuples of hindrances to existing Atomic ${}^{20}_{20}$ events, asking the workers to provide situations or events that might pose as deterrence should the event be considered an achievable goal (see Appendix A for further details). For social-interaction commonsense, we primarily incorporated tuples from Atomic, but also crowdsourced an additional 34K tuples using the same approach as Sap et al. (2019).

Atomic ${}^{20}_{20}$ also pulls commonsense tuples from the English subset of ConceptNet(v5.7) (latest version available; Speer, Chin, and Havasi 2017).A ConceptNet(v5.7) fact is considered English if both the head and tail concepts are marked with ‘/en/’ in the edge id. Of the $3.4$ M English tuples in ConceptNet(v5.7), a small subset of 172K tuples was selectively chosen to be integrated into Atomic ${}^{20}_{20}$ via elimination and crowdsourcing. This subset represents data carefully identified to reflect commonsense information dealing with qualitative human experiences. Among the eliminated data are tuples with edge weight $\leq 0.5$ , dictionary or etymologically based knowledge (e.g., synonyms/antonyms, inflections), lexical hyper/hyponymic lexical relationships such as IsA or InstanceOf, and relations based on lexical co-occurrence (e.g., RelatedTo or LocatedNear), which are easily recoverable from language models.ConceptNet 5.7 defines weight as “the strength with which this edge expresses this assertion”. A pilot crowdsource assessment step found any tuple with weight $\leq 0.5$ unreliable w.r.t. its validity. After selective removal of these relations and a post-processing step to ensure the removal of deterministic information such as geographic facts (e.g., “shenzhen” AtLocation“china”), tuples from each ConceptNet were examined for further splits or joins to align with the existing structure of Atomic ${}^{20}_{20}$ . A random 10% tuples from each selected relations were then put through crowdsourced validity testing (akin to the process described later in §4). Tuples that were directly incorporated without further edits passed with an acceptance rate of 93% or higher. A subset of relations (i.e., CapableOf, HasProperty, MotivatedByGoal) were put through additional crowdsourcing to weed out tuples that were either invalid or found to hold prejudiced descriptions of human entities. In the end, only 5 relations (marked with an asterisk in Table 1) retain the ConceptNet’s original meaning with a few relations that are cognates in Atomic ${}^{20}_{20}$ (more details in Appendix A).

Symbolic Knowledge Graph Comparison

In this work, we compare our new Atomic ${}^{20}_{20}$ knowledge graph to three other prominent CSKGs: Atomic (Sap et al. 2019), ConceptNetHereafter, as we focus on CSKGs, by ConceptNet, we refer to the commonsense subset, unless specified otherwise. (Li et al. 2016), and TransOMCS (Zhang et al. 2020a). We measure the accuracy of tuples in each KG and compare the coverage of each CSKG w.r.t. other CSKGs head-to-head.

In order to assess the accuracy of the knowledge represented, 3K random instances were extracted from each of the knowledge graphs for a crowdsourced evaluation of the tuples.

Qualifying Crowdsource Workers. The evaluation was carried out through crowdsourcing on the Amazon Mechanical Turk platform. To ensure high-quality annotations, we qualified a pool of 173 workers through a paid qualification task that tested their ability to follow directions and provide reasonable answers to the qualification test. The qualification test contained 6 manually selected tuples from Atomic and ConceptNet, including both easy and tricky relations to annotate. A worker was qualified if they provided $100$ % acceptable answers. Workers providing $5$ of $6$ correct answers were also accepted only when they provided a reasonable written substantiation for their incorrect choice. Workers were paid an average of $ $15$ per hour for their evaluations.

Human Evaluation Setup. Workers were presented with knowledge tuples in the form of (head, relation, tail) for annotation. To expedite the human assessment of the tuples, each relation (e.g., xWant or AtLocation) was translated into a human-friendly natural language form (e.g., “as a result, PersonX wants” and “located or found at/in/on”, respectively; cf. Appendix B). The workers were asked to rate the tuples along a 4-point Likert scale: always/often – the knowledge assertion presented is always or often true, sometimes/likely – it is sometimes or likely true, farfetched/never – it is false or farfetched at best, and invalid – it is invalid or makes no sense. Any tuples receiving the former two labels are ranked as Accept and latter two as Reject. The workers were also given a choice to opt out of assessment if the concepts were too unfamiliar for a fair evaluation (No Judgment). Each task (HIT) included 5 tuples of the same relation type, and each tuple was labeled by 3 workers. For the results, we take the majority vote among the 3 workers.

Results. Atomic ${}^{20}_{20}$ outperforms other KGs in crowdsourced accuracy as shown in Table 2.Overall inter-rater agreement measured by Fleiss’ $\kappa$ of 0.46 (moderate agreement; Fleiss 1971). Atomic ties with ConceptNet with reasonably high accuracy, while TransOMCS lags behind others with far lower accuracy. We provide a per-relation breakdown of accuracies in Table 3.

Between Atomic ${}^{20}_{20}$ and Atomic, the variations in the assessed accuracies are not found to be statistically significant. Among the Atomic ${}^{20}_{20}$ and ConceptNet relations that represent exact matches (marked with * in Table 3), the differences are either not statistically significant or when they are, Atomic ${}^{20}_{20}$ improves upon the associated facts, reflecting that the preprocessing stages of ConceptNet integration were helpful in improving the quality of these relations (§3). Among cognates in Atomic ${}^{20}_{20}$ and ConceptNet relations, two sets of relations fare significantly worse in Atomic ${}^{20}_{20}$ than in ConceptNet. In the case of ObjectUse/UsedFor, this is likely due to the fact that Atomic ${}^{20}_{20}$ ’s ObjectUse includes atypical affordances (cf. §3). In an annotation setting where workers are asked to evaluate the truth or likelihood of an assertion rather than feasibility of use, a portion of the atypical usages are seen as ‘farfetched’ and thus, rejected. In the case of MadeUpOf/MadeOf, there may be some room for improvement for Atomic ${}^{20}_{20}$ . Unlike the Atomic ${}^{20}_{20}$ ’s HasSubEvent label that successfully joins together ConceptNet’s Has(First/Last)Subevent labels for an improved accuracy, Atomic ${}^{20}_{20}$ ’s MadeUpOf union of MadeOf, PartOf, and a subset of HasA, did not seem to have resulted in improved quality. The rest of the Atomic ${}^{20}_{20}$ cognates see a significantly higher or similar accuracy in comparison to ConceptNet.

Coverage Assessment

We make a pairwise comparison between the CSKGs to assess their coverage with regards to the commonsense knowledge they contain. For a reliable head-to-head comparison, we map relations and tuples between various KGs.

Mapping Relations. Since Atomic ${}^{20}_{20}$ is built on existing Atomic relations, we primarily need to align relations between Atomic ${}^{20}_{20}$ and ConceptNet. We manually align them based on the definitions for the labels as supplied by the two graphs, then the resulting alignment was verified by sampling at random approximately 20 instances per relation.

Mapping Tuples. In order to resolve syntactic differences in how the concepts are expressed in each of the KGs (e.g., Atomic’s “PersonX eats breakfast” vs. ConceptNet’s “eat breakfast”), we preprocess each of the head and tail concepts of each tuple in each KG in the following manner: (1) the concept is lowercased and stripped of extra spaces, punctuations, and stopwords; (2) any exact tuple duplicates within each KB removed, and (3) remaining content words are lemmatized according to their POS category. For Atomic and Atomic ${}^{20}_{20}$ , an extra step is added to remove mentions of “PersonX”, “PersonY” and “PersonZ” if occurring at the beginning of a string, and to replace with ‘person‘ if they occur elsewhere (e.g, “PersonX greets PersonY”).

Metrics. We use two metrics to evaluate the coverage of knowledge graphs. For each pair of CSKGs, we compute precision and recall with respect to a target KG. Coverage precision assesses the proportion of tuples in the source KG that are correct according to tuples in the target KG. Coverage recall reflects the proportion of tuples in the target KB that the tuples in the source KB successfully recalled.

Results. Tables 4 and 5 show a pairwise coverage precision and recall assessment among the CSKGs. Atomic ${}^{20}_{20}$ shows the widest coverage: Atomic ${}^{20}_{20}$ is able to recall all of Atomic (as expected) and just under half of ConceptNet. There is very little overlap between Atomic and ConceptNet, which is unsurprising as all of Atomic knowledge is focused on social behaviors ConceptNet does not cover while ConceptNet leans on physical commonsense which falls outside Atomic’s scope. Overall, TransOMCS intersects very little with any of the other three KBs.

Neural Knowledge Graph Comparison

Language models are powerful tools for representing knowledge, but their ability to serve as generative knowledge bases is limited by the fact they are directly trained to represent the distribution of language. Previous work shows knowledge graphs can help language models better transfer as knowledge engines (Bosselut et al. 2019) by re-training them on examples of structured knowledge. As a result, a new purpose for knowledge graphs is to be useful in helping language models generalize to hypothesizing knowledge tuples.

Experimental Setup. To evaluate whether knowledge graphs can help language models effectively transfer to knowledge models, we train different pretrained language models on the knowledge graphs described in Section 4, which we describe below:

GPT2 (Radford et al. 2019) is a Transformer (Vaswani et al. 2017) based language model. In our experiments, we use the largest GPT2 model, GPT2-XL, that has 1.5B parameters. We fine-tune GPT2-XL on each of our CSKGs to predict the tail of a tuple (e.g., wheat) given the head (e.g., bread) and a relation (e.g., MadeUpOf). The hyperparameter settings used for training are described in more detail in Appendix. Additionally, we use GPT2-XL in a zero-shot setting as a baseline to measure the effect of transfer learning on knowledge graphs. For fair comparison, we convert each relation manually to an English language prompt expecting the tail of each tuple as output generated by the model.

BART (Lewis et al. 2020) is a Bidirectional and Autoregressive Transformer, an adaptation from BERT (Devlin et al. 2019) that is better suited for natural language generation (e.g., translation, summarization). Additional training details are provided in Appendix C.

GPT-3 (Brown et al. 2020) is an autoregressive language model that has 175B (over 100X more parameters than GPT2-XL) parameters and is trained on a corpus of web text. We use the GPT-3 API to prime the language model to generate the tail for a given prefix – (head, relation) pair. Thus, GPT-3 is evaluated in a few-shot setting. Additional details of our implementation are provided in Appendix C.

Evaluation Setup. To assess language-to-knowledge transfer capabilities, we evaluate how language models generalize to new, unseen entities, concepts, or events. We split each knowledge graph into training, validation, and test sets such that the heads of the knowledge tuples do not overlap between these sets. This adversarial split forces the language models to generalize the relationships they learn from training on the knowledge graphs to the entities learned during language pretraining. Also, to avoid overpopulating the validation and test sets with generic heads (e.g., “I”, “You”, “He”, “We”, and “They” collectively account for over 2.2M tuple heads in TransOMCS), we enforce that the head of any knowledge tuple in the dev and test sets is involved in at most $500$ tuples. Finally, we remove low-quality tuples from TransOMCS by imposing a confidence score of $\geq 0.5$ .

We score the tuples generated by these knowledge models using common evaluation metrics for text generation: BLEU (Papineni et al. 2002), ROUGE (Lin 2004), CIDEr (Vedantam, Lawrence Zitnick, and Parikh 2015), and BERT Score (Zhang et al. 2020b). For a subset of 5000 generated tuples from the test set of each knowledge graph, we also run the same human evaluation described in Section 4.

Results. We present our main results in Tables 6 and 7. First, we note the large divide between the zero-shot GPT2-XL model that produces commonsense knowledge without any fine-tuning and the two COMET models across the Atomic ${}^{20}_{20}$ , Atomic, and ConceptNet knowledge graphs (Table 6). This large gap indicates that language models can benefit from learning facts from commonsense knowledge graphs. They do not have the means to precisely express this knowledge directly from just pretraining on language. This observation is supported by the gaps between these models in the automatic evaluations (Table 7), as well. Additionally, human evaluation of GPT-3 (Table 6) shows a $\sim$ 12 point deficit compared to the performance of COMET(BART), in spite of GPT-3 (175B) having over $\sim$ 430 times more parameters than COMET(BART) (406M). Similarly, we see a large gap in performance across all automated metrics in Table 7. The performance gap indicates that high-quality declarative knowledge is valuable even after the advent of extreme scale language models.

In addition to this main result, two particularly interesting observations emerge. First, we note that the gap between the zero-shot model and COMET is larger on the Atomic ${}^{20}_{20}$ and Atomic knowledge graphs, than on ConceptNet, supporting the reflection that Atomic ${}^{20}_{20}$ supports categories of knowledge that are more difficult to learn from pretraining. Second, the results on the human evaluation show that COMET models trained on TransOMCS are not able to generalize knowledge to new entities, implying that language models benefit more from accurate knowledge examples, which TransOMCS lacks (cf. §4).

Discussion

Our conclusions on this subject are mixed and hinge on the ambiguous meaning of what it means to encode knowledge. Despite the conclusions of prior work (Petroni et al. 2019; Roberts, Raffel, and Shazeer 2020; Tamborrino et al. 2020), our results in Table 6 are clear that language models fail to express large varieties of knowledge when prompted for it in a zero-shot manner. When converted to COMET models by training on a knowledge graph, their performance at hypothesizing knowledge tuples skyrockets – 47.9% absolute difference between COMET(BART) and GPT2-XL on Atomic ${}^{20}_{20}$ .

However, the evaluation tuples are adversarially selected to not include head entities that were in the training set. The model must generalize its learned representations of relations to entities it has not observed these relationships for during fine-tuning, meaning the representation of these entities is solely formulated from learning language. As a result, language models may still encode this knowledge in their parameters, even if they are not capable of expressing it directly. With this framing in mind, the COMET training paradigm proposed by Bosselut et al. (2019) can perhaps be viewed less as a means of learning knowledge from KGs, and more as a method of learning an interface for language models to hypothesize encoded knowledge through language generation. We look forward to future work in this space that attempts to disentangle these two ideas.

What considerations should be made when designing commonsense knowledge resources? Based on our results in Section 5, we outline desiderata for the design and development of future commonsense knowledge graphs. Because certain types of knowledge are already encoded and expressible by pretrained language models, CSKG designers should focus on collecting examples and categories of knowledge that are less likely to be known by language models. For example, of the 378 test tuples evaluated by the GPT2-XL zero-shot model that contained the HinderedBy relation, only 1.3% were deemed plausible by human raters – jumping to 85% plausibility for COMET(BART) – pointing to an advantage in constructing Atomic ${}^{20}_{20}$ with this relationship in mind (see Appendix C for per-relation accuracy).

Second, commonsense knowledge resources should be designed with the goal of accuracy and relationship coverage. Because language models exhibit powerful adaptation (Brown et al. 2020), they can generalize many commonsense relationships as long they have examples on which to train. Consequently, we should construct commonsense resources that encapsulate larger numbers of relations so the knowledge in pretrained language models can be grounded to a variety of relationships. However, language models also benefit from learning from precise examples. Being able to train on a large collection of examples from TransOMCS (see Appendix C) did not allow COMET models to generalize to unseen entities as these examples were not of sufficient quality (See Table 2). Resources should be carefully validated for the quality of their facts, an example set by Speer, Chin, and Havasi (2017) and Sap et al. (2019).

Conclusion

In this work, we formalize a use for commonsense knowledge graphs as transfer learning tools for pretrained language models. With this new purpose, we hypothesize that commonsense knowledge graphs should be designed to contain knowledge that is not already expressible by language models without difficulty (e.g., not taxonomic and lexical knowledge). Consequently, we propose Atomic ${}^{20}_{20}$ , a novel commonsense knowledge graph containing tuples whose relations are specifically selected to be challenging for pretrained language models to express. Our empirical studies demonstrate that Atomic ${}^{20}_{20}$ contains high-accuracy knowledge tuples across multiple novel relations not found in existing CSKGs or expressible by LMs. Furthermore, we show that Atomic ${}^{20}_{20}$ can be effectively used as a training set for adapting language models as knowledge models to generate high quality tuples on-demand.

Acknowledgements

We would like to thank the anonymous reviewers for their valuable feedback. This research was supported in part by NSF (IIS-1524371), the National Science Foundation Graduate Research Fellowship under Grant No. DGE 1256082, DARPA CwC through ARO (W911NF15-1- 0543), DARPA MCS program through NIWC Pacific (N66001-19-2-4031), and the Allen Institute for AI. Computations on beaker.org were supported in part by credits from Google Cloud. TPU machines for conducting experiments were provided by Google.

References

Appendix A Atomic 2020 Details

In this section we detail the relations in Atomic ${}^{20}_{20}$ . Figure 3 shows the hierarchical breakdown of the Atomic ${}^{20}_{20}$ relation labels. While there is no internal structure directly encoded for Atomic ${}^{20}_{20}$ relations, they fall into three natural categories based on their meaning: physical-entity, social-interaction and event-centered commonsense.

Physical-Entity Commonsense. Physical-entity commonsense deals with inferential knowledge about common entities and objects. Physical commonsense such as these is crucial for interacting with the world: allowing us to distinguish the dangerous (e.g., “fire can be painful”) from the harmless (e.g., “teddy bears are comforting”), manipulate objects for our use (e.g., “helmets protect head”), and solve problems (e.g., “how do I open this door?”). We identify seven relations under this category.

ObjectUse describes everyday affordances or uses of objects, and includes both typical and atypical uses. For example, “popcorn bucket” can typically be used to “hold popcorn” but it could also serve “as a hat” in atypical situations. The template used to collect object affordances is shown in Figure 5.

MadeUpOf and HasProperty, two property relations, denote the relationship between an entity and its composition or characteristics. MadeUpOf describes a part, portion or makeup of an entity. For example, “cake” can be MadeUpOf “eggs” (composition/ingredient) or “icing” (part/portion). Similarly, HasProperty usually describes entities’ general characteristics such as “rose” is “red,” subjective attributes such as “thirst” is “uncomfortable.” In certain case, the relation can also map to descriptors that speak to the substance or value of items such as “meat” has property of being “stored in the freezer” or “bike” is “powered by person’s legs.”

AtLocation is a spatial relation that describes the location in/on/at which an entity is likely to be found (e.g. “gambler” can be found in “casino,” “wrench” can be found in “garage”).

CapableOf is designed to describe abilities and capabilities of everyday living entities (e.g., humans, animals, insects) and natural entities that can exert a force (e.g. sun, storms). CapableOf includes general capabilities such as a “human” is capable of “thinking and reasoning” or “drinking coffee.” It also includes specialized capabilities such as a “surgeon” is capable of “operating on a patient.”

Desires and NotDesires are relations that deal with desiresSince desire relations are about cognitive states of sentient beings, they also provide a degree of commonsense about social-interaction. However, we point out that these relations indicate generic characterizations of animate entities rather than describing situationally-based cognitive mental states (e.g., X being ‘encouraged’ only applies to the event it is situated in). For this reason, we include these relations under physical-entity commonsense. of sentient entities; e.g., “doctors” likely desire to “cure patient” but do not desire “malpractice suit.”

Social-Interaction Commonsense. Social-interaction relations comment on socially-triggered states and behaviors. Social commonsense is useful for gauging people’s intentions and purpose, and predicting situationally-relevant human reactions and behaviors. Following the definitions for Atomic relations (Sap et al. 2019), we identify a total of nine relations within this category.

Three mental state relations address the emotional or cognitive states of the participants in a given event. xIntent defines the likely intent or desire of an agent (X) behind the execution of an event. Given the head “X gives Y gifts,” an xIntent might be that X wanted “to be thoughtful.” Relations xReact and oReact define the emotional reactions on the part of X or other participants in an event. As a result of gift giving, X might feel “good about [one]self” and others (in this case, Y) might feel “appreciated.”

Five behavioral relations address the socially relevant responses to an event. xNeed describes a precondition for X achieving the event. For example, in order for X to give Y gifts, X must first “buy the presents.” xWant and oWant are postcondition desires on the part of X and others, respectively. As a result of X giving Y gifts, X may also desire “to hug [Y]” and Y may want to “open the gift.” xEffect and oEffect are social actions that may occur after the event: X may “get hugged” and Y may “blush” in response.

The last relation xAttr describes X’s persona or attribute as perceived by others given an event. In the gift giving example, X may be seen as “generous” or “giving.” In contrast, in an event such as “X steals a car,” X may be perceived as “evil.”

Event-Centered Commonsense. While social-interaction commonsense gauges human behaviors and mental states given an event, the event-centered commonsense provides intuitions about how common events are related to one another. Commonsense about event interaction is useful for understanding likely causes and effects of events in the world. This knowledge allows humans to strategize and explore the best solutions for their objectives, make contingency plans, and revise goals when circumstances deviate from expectation. There are seven relations that fall under this category.

We group three relations under force dynamics.For a discussion of force dynamics in cognitive linguistic and lexical semantic literature cf. Herskovits (2009); Landau and Jackendoff (1991); Talmy (1988). This group conceptualizes dynamic interactions between events with regards to exerted causal forces and impelled actions. Causes specifically captures the causal relation between two events or entities – e.g. an “accident” can cause “injury.” Causes does have some overlap with behavioral relations such as xEffect in that they are postconditions of an event, but the postcondition in Causes is not socially triggered and can exist outside human control (e.g., “bad weather” causes “power outages”). HinderedBy introduces hindrances that obstruct the natural path to the achievement of a goal. For example, the event “X adopts a cat” can be obstructed if “X is allergic to cats.” xReason provides a post-fact explanation of the cause of an event (e.g., why one has to “walk” could be explained by “car has broken down”), which is related to, but distinct from, xIntent’s intentions (i.e., “X walks” because X wanted to “go home”). The template used to collect goal hindrances is shown in Figure 6.

Three relations provide reasoning about event scripts or sequences. isAfter and isBefore introduce events that can precede or follow an event, respectively. For example, “X is in a hurry to get to work” can happen after “X wakes up 15 minutes late” and before “X drives too fast.” These relations are distinguished from behavioral relations xNeed (pre-condition) and xEffect (post-condition) in that isAfter and isBefore are temporally situated without specific regard to the need or reaction of the person X. For example, “X pushes X’s luck” can happen before “X gets a broken nose” but getting a broken nose is not an action X intentionally may take after pushing one’s luck. Relation HasSubEvent provides the internal structure of an event, each tail denoting a step within the larger head event.

The last relation in the event-centered category, isFilledBy, provides a filler phrase for an event with a blank that is sensical and commonly acceptable for the event. For example, the blank in an event such as “X catches ___ in the act” can be commonly filled by entities such as a “cheater,” a “burglar,” or a “mouse.”

In this section, we detail the population of the Atomic ${}^{20}_{20}$ tuples (see Table 1 for counts per relation).

Social-Interaction Tuples. For social-interaction relations, we incorporated $877$ K tuples from Atomic, and crowdsourced an additional $34$ K tuples using the same approach as Sap et al. (2019). The rest of this section will refer to the head events in the social-interaction tuples as base events.

Crowdsourced Tuples. Tuples for relations ObjectUse, HinderedBy, isFilledBy, isBefore and isAfter were crowdsourced via Amazon Mechanical Turk. We paid an average of $ $15$ an hour for our crowdsourcing efforts. We release all crowdsourcing templates as part of our codebase.http://anonymous

For the collection of HinderedBy, we crowdsourced over 100K event-hindrance tuples by prompting the workers with base events from Atomic and eliciting reasons why one may not be able to achieve the event. In order to make the prompt events more worker-friendly, we processed the events as a desire (e.g., “X adopts a cat” $\rightarrow$ “X wants to adopt a cat”).To achieve this, we removed modal verbs, lemmatized the head verb of the sentence, and inserted a ‘want to’ phrase before the verb. We specifically elicited personal causes (e.g., “X is allergic to cats”), situational causes (e.g., “there are no pet stores nearby”), and social causes (e.g., “X’s landlord disallows pets”).

$33$ K isFilledBy tuples were collected by presenting workers with base events. The workers were asked to provide two (up to four) common objects or entities that will make sense in the sentence.

$46$ K tuples for isBefore and isAfter were collected together as sequences of events. Given a base event, the workers were asked to write a short 3-sentence story by providing a preceding and following event. The workers were given the option to opt out of writing a story if they felt that the event they were given didn’t make sense enough to create a story.

As discussed in the main text (§3), $130$ K ObjectUse tuples were crowdsourced by eliciting common objects and their uses for every event in the collected event sequences. For each event, a worker was asked to provide 2-4 common items that were needed during the displayed event. Atypical ObjectUse was collected in a second pass, where for each collected unique object, the workers were prompted with the object and asked to provide an atypical, creative or resourceful use for the item shown.

Integration of ConceptNet Tuples. The tuples for the remaining relations are populated through the integration of the commonsense portion of ConceptNet. As discussed in the main text, a select subset of ConceptNet(v5.7) tuples ( $172$ K) were integrated into Atomic ${}^{20}_{20}$ .

The primary challenge in integrating ConceptNet tuples into Atomic ${}^{20}_{20}$ was in identifying knowledge that is most likely to reflect commonsense information. ConceptNet(v5.7) contains tuples built on not only concept relationships directly sourced from human informants, but also on information pulled from other lexical sources such as WordNet (Miller 1995) and DBpedia (Auer et al. 2007), which automatically extracts knowledge from Wikipedia articles (Speer, Chin, and Havasi 2017). As a result, even those relations that are designed to primarily represent commonsense knowledge (i.e., the OMCS relations) include among the mix, tuples that reflect factual or lexical co-occurrence knowledge. These examples deviate from the type of knowledge we would ideally consider as “commonsense,” i.e., qualitative experiential knowledge gained through subjective observation of and interaction with the world. Relations such as InstanceOf(“is instance/example of”) stands as a case in point (e.g., “tortilla” is an example of “flatbread” or “toffee” is an example of “candy”). While included within the OMCS relations, the encoded information can be hard to distinguish from the more accepted taxonomic relations such as IsA (“is a kind/type of”).In fact, ConceptNet(v5.7) recognizes the similarities between IsA and InstanceOf and has accordingly deprecated InstanceOf in favor of IsA. Nevertheless, InstanceOf is still found in ConceptNet(v5.7). Relationships found in relations such as RelatedTo and DistinctFrom are too underspecified with regards to the meaning they represent, and for other relations such as LocatedNear, and negative forms such as NotCapableOf or NotHasProperty, the relationships amount to general lexical relationships.

Thus, the process of ConceptNet(v5.7) knowledge selection (described in §3) was judiciously guided by three competing priorities: when possible, we prioritized (1) qualitative commonsense over factual knowledge, (2) general knowledge over highly specific knowledge (e.g., personal names), and (3) meanings that are specific enough to be meaningfully categorized. Since the ideal route of verifying imported data via crowdsourcing can be resource-intensive, we opted for an approach whereby relations were first selected based on the data they represent; then tuples were pruned based on heuristics that leverage lexical and syntactic information of the concepts. As mentioned in the main text, $10$ % of the data selected for integration was validated by crowdworkers, yielding a greater than $93$ % acceptance rate. Three relations, namely HasProperty, CapableOf, and MotivatedByGoal, were sent for instance-by-instance crowdsourcing for the purpose of debiasing human-related descriptions, and subdividing semantically distinct elements within the category (e.g., MotivatedByGoal mapped to xIntent and xReason). The resulting ConceptNet-to-Atomic ${}^{20}_{20}$ relation mapping details are shown in Table 8.

Appendix B Symbolic Knowledge Graph Details

Human Readable Relation Templates. Since the KB relation labels are rather telegraphic on their own, we used human readable language forms (based Atomic ${}^{20}_{20}$ and ConceptNet definitions) for prompt display in crowdsourced evaluations. The complete list is available in Table 9.

KB Accuracy & Coverage

In Table 2, what type of tuples generally end up with no judgment? Tuples receiving no judgment fall into three general categories: (1) either the head or the tail concept is too specialized for the workers to judge without consulting reference (e.g., “klebsiella” is part of “bacteria,” “drug cocktail” made of “nucleoside reverse transcriptase inhibitor”); (2) concepts refer to highly specific entities or referents (e.g., “singh” capable of “bring key,” “falkland island islas malvinas” part of “argentina”); and (3) Reject candidates that workers have decided to hedge on (e.g., “dandelion” used for “love,” “democrat” desires “matter”). Such tuples are mostly found in TransOMCS, as evidenced by the high fraction of tuples that received No Judgment at less than half of Atomic ${}^{20}_{20}$ ’s Accept rate (see 2).

Accuracy reported in Table 2 for TransOMCS is based on the complete set. What do the numbers look like for the top 1% and top 10% of TransOMCS? The evaluation for top 1% and top 10% are indeed higher than the reported values for TransOMCS. However, they still lag behind other KBs (Table 10).

Does the accuracy ratings breakdown for each KB provide further insights? A closer look at the raw accuracy ratings shows an interesting emergent rating pattern across KBs (Table 2). For all KBs with the exception of TransOMCS, we observe that the majority of social-interaction Accept originate from the sometimes/likely rating. However, such preference is not seen in the physical-entity tuples, which show a slightly higher tendency for the always/often rating. For event-centered tuples, Atomic ${}^{20}_{20}$ favors the sometimes/likely, while ConceptNet does not. TransOMCS shows highest ratings for the sometimes/likely and invalid ratings, and the patterns are invariant across the board.

One additional point to mention is that Atomic ${}^{20}_{20}$ social-interaction and event-centered tuples proportionally contain more of the human-crowdsourced commonsense knowledge than the physical-entity category, which, with the sole exception of ObjectUse, includes tuples integrated from ConceptNet graph. The observation that much of the knowledge in Atomic ${}^{20}_{20}$ is sometimes or likely true, reflects our intentional efforts to deprioritize factual information over qualitative commonsense knowledge. More importantly, it shows that most of the knowledge within the Atomic ${}^{20}_{20}$ graph can be, under the right circumstances, defeasible. That is, one can pose a likely hypothesis that a hindrance to “X writes stories” is that “X can’t read;” however, such a hypothesis can be defeated if we also know that X has written stories before. We find that such context-dependent ambiguities are of more compelling interest to us, as certainties may be better covered by language models.

Appendix C Neural Knowledge Graph Details

Dataset Split. Table 11 reports the number of tuples for each three-way split (train/dev/test) of each knowledge graph. The Atomic ${}^{20}_{20}$ split preserves the splits from Atomic and ConceptNet: any tuple in Atomic ${}^{20}_{20}$ that appears in the train (resp. dev, test) set of Atomic or ConceptNet belongs to the train (resp. dev, test) set of Atomic ${}^{20}_{20}$ . Overall, Atomic ${}^{20}_{20}$ provides over $50\%$ more tuples than the initial version Atomic.

Details about GPT2-XL Training. GPT2-XL (Radford et al. 2019) is a transformer language model trained on 8 million webpages ( $\sim$ 40G of text data). We finetune the language model on each commonsense knowledge graph by converting a tuple into a formatted text input – e.g. [GEN] [SEP]. where [GEN] and [SEP] are special delimiter tokens that indicate the start and end of the tail of a given relation for a given head entity / event. At inference time, the head and relation of a tuple are given as input and the model’s generation following the [GEN] token is recorded as its prediction of the tail entity. We finetuned GPT2-XL on each CSKG for one epoch, using a batch size of 32 and a learning rate of $5e-5$ on an Nvidia RTX-8000 GPU. The final trained models for each CSKG will be publicly released as part of our code release. In Figure 7, we include a few examples of COMET(GPT2-XL) trained on Atomic ${}^{20}_{20}$ .

Details about BART Training. BART (Lewis et al. 2020) is a denoising sequence-to-sequence pretrained language model. Similar to previous transformer-based language models (Devlin et al. 2019), BART’s pretraining objective is to recover its input, which is corrupted through various strategies such as token and span masking, and sentence permutation. For pretraining, BART uses a 160GB free-text dataset drawn from news, books, stories, and web texts. We used the BART-large version of the model,from HuggingFace’s implementation (Wolf et al. 2019). which has 24 layers, 1024-dimensional hidden states, 16 attention heads in its self-attention layers, and 406M total parameters. We set the maximum length to be 24 and the minimum length to be 1. For hyper-parameter search, we fine-tuned BART on each commonsense knowledge graph for one epoch with batch sizes {64, 32, 16}, learning rates {1e-3, 1e-5, 1e-7}, and three random seeds.

Details about GPT-3 Evaluation. We evaluate GPT-3 (Brown et al. 2020) using OpenAI’s language completion API. Similar to zero-shot evaluation on GPT2-XL, we use templates to evaluate the ability of the language model to generate a tail given the head and relation. We use the same templates as GPT2-XL. For priming examples, we prime each relation with 5 examples of heads and tails per relation, randomly selected from relations in the training set. We ran 3 random seeds to select priming examples to avoid spelling mistakes and other fragments from data collection. We ran with temperature 0.4.

Utility of Pre-trained Language Models. As a means for establishing a control for the utility of pre-trained models in commonsense tasks, we trained an un-pretrained BART model and performed human evaluation for Atomic ${}^{20}_{20}$ . We observe generation accuracy values of 54.9% for Accept, 44.9% for Reject, and 0.18% for No Judgement, which is a significant drop in performance compared to the results for COMET(BART) in Table 6. This indicates that pre-training does indeed provide a level of generalizations necessary for commonsense tasks.

Additional Automated Evaluation. In order to have a direct comparison between automated and human evaluations, we report in Section 5 the automated metrics on the same test subsets that were used for human evaluation. For completeness, in this section, we provide the automated evaluation results on the full test sets (Table 12). These results confirm the findings of Section 5.

Appendix D Additional Reproducibility Items

All experiments were conducted on a cluster with 8 GPUs of type NVIDIA Quadro RTX 8000 with 48 GB of GDDR6 memory each. To allow replication of results, whenever possible, a default, fixed value was assigned to the random seed that initializes the pseudo-random number generator, as specified in the source code. The details of the experimentation of the models (i.e. GPT2-XL and BART), including their hyper-parameter settings, are described in Appendix C. All the data as well as the source code required for conducting experiments will be made publicly available upon publication.