GenericsKB: A Knowledge Base of Generic Statements
Sumithra Bhakthavatsalam, Chloe Anastasiades, Peter Clark
Introduction
While deep learning systems have achieved remarkable performance trained on general text, NLP researchers frequently seek out additional repositories of general/commonsense knowledge to boost performance further, e.g., Icarte et al. (2017); Wang et al. (2018); Yang et al. (2019); Peters et al. (2019); Liu et al. (2019); Paul and Frank (2019). However, there are only a limited number of repositories currently available, with ConceptNet Speer et al. (2017) and WordNet Fellbaum (1998) being popular choices. In this work we contribute a new, novel resource, namely a large collection of contextualized generic sentences, as an additional source of general knowledge, and to help fill gaps with existing repositories. The resource, called GenericsKB, is the first to contain naturally occurring generic sentences, as opposed to extracted or crowdsourced triples, and thus is rich in high-quality, general, semantically complete statements.
Statements in GenericsKB were culled from over 1.7 billion sentences from three corpora. To collect statements, we first clean the source data, then filter it using linguistic rules to identify likely generics, then apply a BERT-based scoring step to distinguish generics that are meaningful on their own (avoiding generics with contextual meaning such as Meals are on the third floor). The resulting KB contains over 3.5M statements, each including metadata about its topic, surrounding context, and a confidence measure. Figure 1 illustrates some examples, as well as a full entry illustrating the metadata. We also create GenericsKB-Best (1M+ sentences), containing the best-quality generics in GenericsKB plus selected, synthesized generics from WordNet and ConceptNet.
We also report results using GenericsKB for two tasks, namely question-answering (using the OpenbookQA dataset Mihaylov et al. (2018)), and explanation generation (using the QASC dataset Khot et al. (2019)). Our goal is not to build a new model, but to see how an existing model’s performance changes when the GenericsKB corpus replaces a larger corpus for these tasks. We find that GenericsKB can sometimes produce higher question-answering scores, and always produced better quality explanations. This suggests that GenericsKB may have value for other NLP tasks also, either standalone or as an additional source of general knowledge to help train models. Finally, independent of deep learning, GenericsKB may be a valuable resource for those studying generics and their semantics in linguistics.
Related Work
A generic statement is one that makes a blanket statement about the members of a category, e.g., “Tigers are striped.” We also include near-universally quantified statements such as “Most tigers are striped” in GenericsKB, although their status as generics is sometimes disputed by semanticists. Because they apply to many entities, they are particularly important for reasoning. Although common in language, their semantics has been a topic of considerable debate in linguistics, e.g., Carlson and Pelletier (1995); Schubert and Pelletier (1989); Leslie (2015); Liebesman (2011); Schubert and Pelletier (1987); Leslie (2011). Rather than repeat that debate here, we note that our primary goal is to collect rather than interpret generics. We hope that our resource can contribute to study of their semantics.
Several repositories of general knowledge are available already, but with different characteristics and coverage to GenericsKB, e.g., Sap et al. (2019); Tandon et al. (2014); Van Durme et al. (2009). ConceptNet Speer et al. (2017) is perhaps the most used, containing approximately 1M English triples (excluding RelatedTo, Synonym, and [Lexical]FormOf links), or 34M triples total. ConceptNet triples can be rendered as short generics, thus covering just simple (typically three word) generic statements about 28 relationships. Similarly, WordNet taxonomic and meronymic links express short, specific relationships but leave most uncovered (compare with Figure 1). Triple stores, e.g., Clark and Harrison (2009), acquired from open information extraction Banko et al. (2007), contain larger and less constrained collections of knowledge, but typically with low precision Mishra et al. (2017), making it difficult to exploit them in practice. GenericsKB thus fills a gap in this space, containing naturally occurring generic statements that an author considered salient enough to write down.
Approach
To construct GenericsKB, sentences were selected from over 1.7B sentences in three corpora (Table 1): The Waterloo corpus is 280GB of English plain text, gathered by Charles Clarke (Univ. Waterloo) using a webcrawler in 2001 from .edu domains. It was made available to us and was previously used in Clark et al. (2016). SimpleWikipedia is a filtered scrape of SimpleWikipedia pages (simple.wikipedia.org). The ARC corpus is a collection of 14M science and general sentences, released as part of the ARC challenge Clark et al. (2018). GenericsKB was then assembled in the following three steps:
As the source corpora originated from web scrapes, they contain noise in various forms, such as blocks of code, non-English text, hyperlinks, and emails. The corpora were cleaned using the following:
Regular Expressions to capture frequently occurring lexical properties of noise.
Sentence and token length heuristics to filter out malformed sentences.
Text cleanup using the Fixes Text For You (ftfy) python library which fixes various encoding-related errors.
Language Detection using spaCy to filter out non-English text.
2 Filtering
We next use a set of 27 hand-authored lexico-syntactic rules to identify standalone generic sentences, and reject others. For example, sentences that start with a bare plural (“Dogs are…”) are considered good candidates, while those starting with a determiner (“A man said…”) or containing a present participle (“A bear is running…”) are not. Similarly, sentences containing pronouns (“He said…”) are likely to have contextual rather than standalone meaning, and so are also rejected. A sample of the filtering rules are summarized in Figure 2, and the full list of rules is given in the Appendix. Given the size and redundancy of the initial corpus, these rules aim to filter the corpus aggressively to produce a set of high-quality candidates, rather than catch all possible standalone generics.
3 Scoring
Finally, we train and apply a BERT classifier to score sentences by by how well they describe a useful, general truth. To build the classifier, a random subset (size 10k) of the 3.4M candidate generics was labeled by crowdworkers as to whether they expressed a useful, general truth about the world (with options yes, no, unsure), guided by examples. Specifically, workers were asked to reject (1) sentences which do not stand on their own, e.g.,: {quoting}[vskip=0mm,leftmargin=10mm] Free parking is provided (2) subjective and/or not useful statements, e.g., {quoting}[vskip=0mm,leftmargin=10mm] Life is too serious, sometimes. (3) Vague statements, e.g., {quoting}[vskip=0mm,leftmargin=10mm] All cats are essentially cats. (4) Statements about people and companies, e.g., {quoting}[vskip=0mm,leftmargin=10mm] Apple makes lots of iPhones (5) Facts that are incorrect in isolation, e.g., {quoting}[vskip=0mm,leftmargin=10mm] All maps are hand-drawn. Each fact was annotated twice and scores (yes/unsure/no = 1/0.5/0) averaged. The joint probability of agreement (i.e., that both annotators agreed) was 70.1% (approximately 1/3 of the agreed annotations being “yes”, 2/3 “no”), and Cohen’s Kappa was 0.52 (“moderate agreement” ). The dataset was then split 70:10:20 into train:dev:test, and a BERT classiferWe use the BERT-for-classification package provided by AllenNLP, https://allenai.github.io/allennlp-docs/api/allennlp.models.bert_for_classification.html fine-tuned on the training set. Each sentence is input simply as [CLS] sentence. The output is pooled, then run through a linear layer which outputs two logits representing the two classes (yes/no), followed by a softmax to obtain class probabilities. This classifier scored 83% on the held-out test set. The classifier was then used to score all 3.4M extracted generic sentences.
4 GenericsKB and GenericsKB-Best
The final GenericsKB contains 3,433,000 sentences. We also create GenericsKB-Best, comprising GenericsKB generics with a score 0.23By calibration, equivalent to an annotator score of 0.5, i.e., more likely good than bad., augmented with short generics synthesized from three other resources ConceptNet (isa, hasPart, locatedAt, usedFor); WordNet (isa, hasPart); and the Aristo TupleKB (at https://allenai.org/data/tuple-kb) For WordNet, we use just the most frequent sense for each generic term. for all the terms (generic categories) in GenericsKB-Best. GenericsKB-Best contains 1,020,868 generics (774,621 from GenericsKB plus 246,247 synthesized).
Evaluation
For some initial indications of whether GenericsKB can be useful, we performed two experiments.
We evaluate using GenericsKB for a question-answering task, namely OpenbookQA Mihaylov et al. (2018), comparing it to using an alternative, large, publically available corpus (QASC-17M, Khot et al. (2019)). For both, we use the BERT-MCQ QA system Khot et al. (2019). Note that our goal is to evaluate the corpora, not the QA system. The results are shown in Table 2, indicating that using the high-quality version GenericsKB-Best can, at least in this case, result in improved QA performance over using the original corpus, even though it is a fraction of the size.
2 Explanation Quality
We also experimented with using GenericsKB-Best to generate explanations for a (given) answer, where an explanation is a chain of two sentences drawn from the corpus. For example: What can cause a forest fire? storms because: Storms can produce lightning AND Lightning can start fires Good explanations typically use generic sentences, reflecting the underlying formal structure of the explanation. This suggests that a corpus of generics may help in this task.
We test this hypothesis using the QASC dataset. We can do this because the BERT-MCQ system described earlier already finds candidate good chains as part of its retrieval step Khot et al. (2019) (specifically, it finds pairs of sentences from the corpus that maximally overlap the question, answer, and each other). We can thus collect these chains found using the original QASC-17M corpus, and using GenericsKB-Best, and compare quality.
To evaluate these chains, we train a simple BERT-model using the QASC training data, which comes with a gold reasoning chain for every correct answer. We use the gold chains as examples of good chains, and BERT-MCQ-generated chains for incorrect answer options as examples of bad (invalid) chains. We can then use the trained model to evaluate the chains collected earlier.
The results are in Table 3, and indicate that substantially better explanations are generated with GenericsKB-Best. The same result was found using the OBQA dataset. In particular, because of the eclectic nature of the QASC-17M corpus, nonsensical explanations can often occur, e.g.,: What do vehicles transport? people because: What to say what vehicle to use AND Now people say it’s time to move on. compared with the GenericsKB-Best explanation: What do vehicles transport? people because: A vehicle is transport AND Transportation is used for moving people Here, the QASC-17M explanation is nonsensical, while as GenericsKB is rich in stand-alone generics, the explanations produced with it are more often valid.
3 GenericsKB Quality
Finally we note that even with filtering, some (undesirable) contextual generics occasionally pass through. Examples include:
Democracy is four wolves and a lamb voting on what to have for lunch.
These examples exhibit ellipsis, vagueness, and metaphor, complicating their interpretation. Ideally, the scoring model would then score these low, but this may not always happen: recognizing contextuality often requires world knowledge. For example, consider distinguishing the good, standalone generic Murder is illegal from the contextual one Parking is illegal.
To evaluate the extent of this, two annotators independently annotated 100 random (GenericsKB) sentences from GenericsKB-Best as to whether they represented useful, general truths (the same criterion as in Section 3.3), and found 85% (averaged) met this criterion. This suggests that such problems are relatively uncommon.
Conclusion
With the growing use of deep learning in NLP, researchers have often sought out additional general knowledge resources to improve their systems. To help meet this need, as well as provide a general resource for linguistics, we have created GenericsKB, the first large-scale resource of naturally occurring generic statements, as well as an augmented subset GenericsKB-Best, including important metadata about each statement. While GenericsKB is not a replacement for a Web-scale corpus, we have shown it can assist in both question-answering and explanation construction for two existing datasets. These positive examples of utility suggest that GenericsKB has potential as a large, new resource of general knowledge for the community. GenericsKB is available at https://allenai.org/data/genericskb.
References
Appendix: Patterns for Identifying Generics
The following 27 rules are used to identify generic sentences, as well as help filter out those which are likely contextual, gibberish, or otherwise not stand-alone. Some rules use spaCy features for processing. To be retained, each sentence must pass the following tests: {quoting}[vskip=0mm,leftmargin=2mm] is-short-enough: Length of the sentence 100. starts-with-capital: The first character is an upper-case character. ends-with-period: The last character is a period. has-at-least-one-token: The sentence contains at least one spaCy token. has-no-bad-first-word: The first word is not in a list of bad-first-words (determiners, etc.) has-no-bad-words: The sentence does not contain words in a badword list (e.g., copyright, licence, …) has-no-bad-pronouns: The sentence does not contain personal pronouns (he, she, …) has-no-negations: The sentence does not contain negations. has-no-modals: The sentence does not contain modals (“would”, “should”,…). first-word-is-not-verb: The first word of the sentence is not a verb. first-word-is-not-conjunction: The first word is not a conjunction. look-for-positive-quantifier-at-first-word: If the first word is a positive quantifier (“all”, “some”), note the quantifier and repeat the filter using the sentence without the quantifier. has-acceptable-past-participle-root: The root verb is in the present passive, or is not a past participle. noun-exists-before-root: There is a ’NOUN’ token before the root. key-concept-head-pos-tags-not-contradicted-by-wordnet: If WordNet disagrees about the POS of the key concept head, filter out this sentence. has-no-digits: The sentence has no digits. all-propn-exist-in-wordnet: All PROPN tokens exist in WordNet. all-propn-have-acceptable-ne-labels: Any PROPN tokens have one of the following ent_type values: ’EVENT’, ’GPE’, ’LANGUAGE’, ’LAW’, ’LOC’, ’WORK_OF_ART’. (These acceptable values were decided by the corresponding top level rules.) and must not pass these tests: {quoting}[vskip=0mm,leftmargin=2mm] scr.dot_dot_in_sentence: There is ’..’ in the sentence. scr.www_in_sentence: There is ’www’ in the sentence. scr.com_in_sentence: There is ’.com’ in the sentence. scr.many_hyphens_in_sentence: The number of hyphens in the sentence is 2. scr.sentence_does_not_end_with_period: The sentence does not end with a period. remove-non-verb-roots: Remove any sentences with non-verbal roots (e.g., “A large tree.”). remove-present-participle-roots: Reject sentences whose root verb is a present participle (“sitting”,…). remove-first-word-roots: Reject sentences with a root that corresponds to the first word. remove-past-tense-roots: Reject sentences with any past tense roots (“ate”,…).