Fusing Context Into Knowledge Graph for Commonsense Question Answering

Yichong Xu, Chenguang Zhu, Ruochen Xu, Yang Liu, Michael Zeng, Xuedong Huang

Introduction

One critical aspect of human intelligence is the ability to reason over everyday matters based on observation and knowledge. This capability is usually shared by most people as a foundation for communication and interaction with the world. Therefore, commonsense reasoning has emerged as an important task in natural language understanding, with various datasets and models proposed in this area (Ma et al., 2019; Talmor et al., 2018; Wang et al., 2020; Lv et al., 2020).

While massive pre-trained models (Devlin et al., 2018; Liu et al., 2019) are effective in language understanding, they lack modules to explicitly handle knowledge and commonsense. Also, structured data like knowledge graph is much more efficient in representing commonsense compared with unstructured text. Therefore, there have been multiple methods coupling language models with various forms of knowledge graphs (KG) for commonsense reasoning, including knowledge bases (Sap et al., 2019; Yu et al., 2020b), relational paths (Lin et al., 2019), graph relation network (Feng et al., 2020) and heterogeneous graph (Lv et al., 2020). These methods combine the merits of language modeling and structural knowledge information and improve the performance of commonsense reasoning and question answering.

However, there is still a non-negligible gap between the performance of these models and humans. One reason is that, although a KG can encode topological information between the concepts, it lacks rich context information. For instance, for a graph node for the entity “Mona Lisa”, the graph depicts its relations to multiple other entities. But given this neighborhood information, it is still hard to infer that it is a painting. On the other hand, we can retrieve the precise definition of “Mona Lisa” from external sources, e.g. the definition of Mona Lisa in Wiktionary is “A painting by Leonardo da Vinci, widely considered as the most famous painting in history”. To represent structured data that can be seamlessly integrated into language models, we need to provide a panoramic view of each concept in the knowledge graph, including its neighboring concepts, relations to them, and a definitive description of it.

Thus, we propose the DEKCOR model, i.e. DEscriptive Knowledge for COmmonsense question answeRing, to tackle multiple choice commonsense questions. Given a question and a choice, we first extract the contained concepts. Then, we extract the edge between the question concept and the choice concept in ConceptNet (Speer et al., 2017). If such an edge does not exist, we compute a relevance score for each knowledge triple (subject, relation, object) containing the choice concept, and select the one with the highest score. Next, we retrieve the definition of these concepts from Wiktionary via multiple criteria of text matching. Finally, we feed the question, choice, selected triple and definitions into the language model ALBERT (Lan et al., 2019) to produce a score indicating how likely this choice is the correct answer.

We evaluate our model on CommonsenseQA (Talmor et al., 2018) and OpenBookQA (Mihaylov et al., 2018). On CommonsenseQA, it outperforms the previous state-of-the-art result by 1.2% (single model) and 3.8% (ensemble model) on the test set. On OpenBookQA, our model outperforms all baselines other than two large-scale models based on T5 (Raffel et al., 2019). We further conduct ablation studies to demonstrate the effectiveness of fusing context into the knowledge graph.

Related work

Several different approaches have been investigated for leveraging external knowledge sources to answer commonsense questions. Min et al. (2019) addresses open-domain QA by retrieving from a passage graph, where vertices are passages and edges represent relationships derived from external knowledge bases and co-occurrence. Sap et al. (2019) introduces the ATOMIC graph with 877k textual descriptions of inferential knowledge (e.g. if-then relation) to answer causal questions. Lin et al. (2019) projects questions and choices to the knowledge-based symbolic space as a schema graph. It then utilizes path-based LSTM to give scores. Feng et al. (2020) adopts the multi-hop graph relation network (MHGRN) to perform reasoning unifying path-based methods and graph neural networks. Lv et al. (2020) proposes to extract evidence from both structured knowledge base such as ConceptNet and Wikipedia text and conduct graph-based representation and inference for commonsense reasoning. Wang et al. (2020) employs GPT-2 to generate paths between concepts in a knowledge graph, which can dynamically provide multi-hop relations between any pair of concepts.

Several studies have utilized knowledge descriptions for different tasks. Yu et al. (2020a) uses description text from Wikipedia for knowledge-text co-pretraining. Xie et al. (2016) encodes the semantics of entity descriptions in knowledge graphs to improve the performance on knowledge graph completion and entity classification. Chen et al. (2018) co-trains the knowledge graph embeddings and entity description representation for cross-lingual entity alignment. Concurrent with our work, Chen et al. (2020) also insert knowledge descriptions into commonsense question answering. Compared with our work, the proposed method in Chen et al. (2020) is much more complex, e.g. involving training additional rankers on retrieved text, while our result outperforms Chen et al. on CommonsenseQA.

Method

Problem formulation. In this paper, we focus on the following QA task: given a commonsense question $q$ , select the correct answer from several choices $c_{1},...,c_{n}$ . In most cases, the question does not contain any mentions of the answer. Therefore, external knowledge sources can be used to provide additional information. We adopt ConceptNet (Speer et al., 2017) as our knowledge graph $G=(V,E)$ , which contains over 8 million entities as nodes and over 21 million relations as edges. In the following, we use triple to refer to two neighboring nodes and the edge connecting them, i.e. $(u\in V,p=(u,v)\in E,v\in V)$ , with $u$ being the subject, $p$ the relation, and $v$ the object.

Suppose the question mentions an entity $e_{q}\in V$ and the choice contains an entity $e_{c}\in V$ CommonsenseQA provides the question/choice entity. For OpenBookQA, we choose from the extracted entities that are most frequent in retrieved facts. See Appendix for details.. We then employ the KCR method (Lin, 2020) to select relation triples. If there is a direct edge $r$ from $e_{q}$ to $e_{c}$ in $G$ , we choose this triple ( $e_{q}$ , $r$ , $e_{c}$ ). Otherwise, we retrieve all the $N$ triples containing $e_{c}$ . Each triple $j$ is assigned a score $s_{j}$ which is the product of its triple weight $w_{j}$ provided by ConceptNet and relation type weight $t_{r_{j}}$ :

Here, $r_{j}$ is the relation type of the triple $j$ , and $N_{r_{j}}$ is the number of triples among the retrieved triples that have the relation type $r_{j}$ . In other words, this process favors rarer relation types. Finally, the triple with the highest weight is chosen.

2 Contextual information

The retrieved entities and relations from the knowledge graph are described by their surface form. Without additional context, it is hard for the language model to understand its exact meaning, especially for proper nouns.

Therefore, we leverage large-scale online dictionaries to provide definitions as context. We use a dump of Wiktionaryhttps://www.wiktionary.org/ which includes definitions of 999,614 concepts. For every concept, we choose its first definition entry in Wiktionary as the description. For every question/choice concept, we find its closest match in Wiktionary by using the following forms in order: i) original form; ii) lemma form by Spacy (Honnibal and Montani, 2017); iii) base word (last word). For example, the concept “taking notes” does not appear in its original form in Wiktionary, but its lemma form “take notes” is in Wiktionary and we get its description text: “To make a record of what one hears or observes for future reference”. In this way, we find descriptions of all entities in our experiments. The descriptions of the question and choice concept are denoted by $d_{q}$ and $d_{c}$ , respectively.

Finally, we feed the question, choice, descriptions and triple (from Section 3.1) into the ALBERT model (Lan et al., 2019) in the following format: [CLS] $q$ $c$ [SEP] $e_{q}$ : $d_{q}$ [SEP] $e_{c}$ : $d_{c}$ [SEP] triple.

3 Reasoning

On top of ALBERT, we leverage an attention-based weighted sum and a softmax layer to generate the relevance score for the question-choice pair. In detail, suppose the output representations of ALBERT is $({\bm{x}}_{0},...,{\bm{x}}_{m})$ , where ${\bm{x}}_{i}\in R^{d}$ . We compute a weighted sum of these embeddings based on attention:

where ${\bm{u}}$ is a parameter vector. The relevance score between the question and the choice is then $s=\mbox{softmax}({\bm{v}}^{T}{\bm{b}})$ , where ${\bm{b}}\in R^{d}$ is a parameter vector and the softmax is computed over all choices for the cross-entropy loss function.

The architecture of our model DEKCOR and the construction of input is shown in Fig. 1.

Experiments

We evaluate our model on two benchmark datasets of multiple-choice questions for commonsense question answering: CommonsenseQA (Talmor et al., 2018) and OpenBookQA (Mihaylov et al., 2018). CommonsenseQA creates questions from ConceptNet entities and relations; OpenBookQA probes elementary science knowledge from a book of 1,326 facts. The statistics of the datasets is provided in Table 1. For OpenBookQA, we follow prior approaches (Wang et al., 2020) to append top 5 retrieved facts provided by Aristo team (Clark et al., 2019) to the input. We also pre-train our OpenBookQA model on CommonsenseQA’s training set as we find it helps to boost the performance.

We compare our models with state-of-the-art baselines, which all employ pre-trained models including RoBERTa Liu et al. (2019), XLNet Yang et al. (2019), ALBERT Lan et al. (2019) and T5 Raffel et al. (2019) and some adopt additional modules to process knowledge information. A detailed description of the baselines is in the Appendix.

2 Results

CommonsenseQA. Table 2 shows the accuracy on the test set of CommonsenseQA. For a fair comparison, we categorize the results into single models and ensemble models. Our ensemble model consists of 7 single models with different initialization random seeds, and its output is the majority of choices selected by these single models. More implementation details are shown in the Appendix.

Our proposed DEKCOR outperforms the previous state-of-the-art result by 1.2% (single model) and 3.8% (ensemble model). This demonstrates the effectiveness of the usage of knowledge description to provide context.

Furthermore, we notice two trends based on the results. First, the underlying pre-trained language model is important in commonsense QA quality. In general, we observe this order of accuracy among these language models: BERT $<$ RoBERTa $<$ XLNet $<$ ALBERT $<$ T5. Second, the additional knowledge module is critical to provide external information for reasoning. For example, RoBERTa+KEDGN outperforms the vanilla RoBERTa by 1.9%, and our model outperforms the vanilla ALBERT model by 6.8% in accuracy.

OpenBookQA. Table 3 shows the test set accuracy on OpenBookQA. All results are from single models. Note that the two best-performing models, i.e. UnifiedQA (Khashabi et al., 2020) and TTTTT (Raffel et al., 2019), are based on the T5 generation model, with 11B and 3B parameters respectively. Thus, they are computationally very expensive. Except these T5-based systems, DEKCOR achieves the best accuracy among all baselines.

Ablation study. Table 4 shows that the usage of concept descriptions from Wiktionary and triple from ConceptNet can help improve the accuracy of DEKCOR on the dev set of CommonsenseQA by 2.7% and 4.4% respectively. We observe similar results on OpenBookQA. This demonstrates that additional context information can help with fusing knowledge graph into language modeling for commonsense question answering.

Case Study. Table 5 shows two examples from CommonsenseQA and OBQA respectively. In the first example, without additional description the model would not know relevant information about bats, like they are insectivorous, leading to the wrong answer “eating bugs”. With the description, the model knows that bats eat bugs, so it chooses “laying eggs” as the answer. Similarly, for the second question, the “sharp teeth and very strong jaws” in the description hint that alligators are likely carnivorous, and reptiles are likely cold-blooded. The entity description leads to the correct answer of “eat gar”.

Conclusions

In this paper, we propose to fuse context information into knowledge graphs for commonsense question answering. As a knowledge graph often lacks descriptions for the contained entities and relations, we leverage Wiktionary to provide definitive text for each entity as additional input to the pre-trained language model ALBERT. The resulting DEKCOR model achieves state-of-the-art results on the benchmark datasets CommonsenseQA and OpenBookQA. Ablation studies demonstrate the effectiveness of the proposed usage of knowledge description and knowledge triple information in commonsense question answering.

Acknowledgements

We thank the anonymous reviewers for their valuable comments.

References

Appendix A Implementation Details

Identification of $e_{q}$ and $e_{c}$ . CommonsenseQA specifies the question entity in each question and each answer choice is also an entity in ConceptNet. We use them as $e_{q}$ and $e_{c}$ . For OpenBookQA, we identify all ConceptNet entities in the question and answer text and count their number of occurrences in the retrieved text. For a triple $(e_{q},r,e_{c})$ , we define its weight as $n_{e_{q}}+n_{e_{c}}$ , where $n_{e}$ is the number of occurrences in retrieved text. The edge with the largest weight is picked. If no edge is found between question and answer entities, we use the answer entity with the most occurrences to find triples. For Wiktionary descriptions, we find descriptions for $e_{q}$ and $e_{c}$ with the most occurrences as well.

Using ConceptNet. Since ConceptNet contains a lot of weak relations, we only use the following relations for our triples: CausesDesire, HasProperty, CapableOf, PartOf, AtLocation, Desires, HasPrerequisite, HasSubevent, Antonym, Causes.

Optimization. We use the AdamW (Loshchilov and Hutter, 2017) optimizer with a learning rate of 2e-5. The batch size is 8. We limit the maximum length of the input sequence to 192 tokens. The model is trained for 10 epochs. We use the Huggingface (Wolf et al., 2019) implementation for the ALBERT model.

Appendix B Baseline Methods

GraphReason (Lv et al., 2020) retrieves knowledge from both structured knowledge base and plain text. PG-FULL (Wang et al., 2020) fine-tunes GPT-2 on ConceptNet to generate knowledgeable paths between knowledge graph concepts. UnifiedQA (Khashabi et al., 2020) pre-trains T5 on a variety of QA datasets for general QA tasks. MHGRN (Feng et al., 2020) adopts the multi-hop graph relation network to perform reasoning. HyKAS (Ma et al., 2019) employs an option comparison network to consume ConceptNet triples. ALBERT+KRD retrieves commonsense knowledge from Open Mind Common Sense and then uses a self-attention module to compute a weighted sum of these triple representations. BERT + Selection Banerjee et al. (2019) improves the result on OpenBookQA via abductive information retrieval , information gain based re-ranking, passage selection and weighted scoring. ALBERT+KB also improves retrieval results on OpenBookQA by token-based and embedding-based retrieval. TTTTT (Raffel et al., 2019) fine-tunes the T5 language generation model on OpenBookQA.