Graph-Based Reasoning over Heterogeneous External Knowledge for Commonsense Question Answering

Shangwen Lv, Daya Guo, Jingjing Xu, Duyu Tang, Nan Duan, Ming Gong, Linjun Shou, Daxin Jiang, Guihong Cao, Songlin Hu

Introduction

Reasoning is an important and challenging task in artificial intelligence and natural language processing, which is “the process of drawing conclusions from the principles and evidence” (?). The “evidence” is the fuel and the “principle” is the machine that operates on the fuel to make predictions. The majority of studies typically only take the current datapoint as the input, in which case the important “evidence” of the datapoint from background knowledge is ignored.

In this work, we study commonsense question answering, a challenging task which requires machines to collect background knowledge and reason over the knowledge to answer questions. For example, an influential dataset CommonsenseQA (?) is built in a way that the answer choices share the same relation with the concept in the question while annotators are asked to use their background knowledge to create questions so that only one choice is the correct answer. Figure 1 shows an example which requires multiple external knowledge sources to make the correct predictions. The structured evidence from ConcepNet can help pick up the choices (A, C), while evidence from Wikipedia can help pick up the choices (C, E). Combining both evidence will derive the correct answer (C).

Approaches have been proposed in recent years for extracting evidence and reasoning over evidence. Typically, they either generate evidence from human-annotated evidence (?) or extract evidence from a homogeneous knowledge source like structured knowledge ConceptNet (?; ?; ?) or Wikipedia plain texts (?; ?; ?), but they fail to take advantages of both knowledge sources simultaneously. Structured knowledge sources contain valuable structural relations between concepts, which are beneficial for reasoning. However, they suffer from low coverage. Plain texts can provide abundant and high-coverage evidence, which is complementary to the structured knowledge.

In this work, we study commonsense question answering by using automatically collected evidence from heterogeneous external knowledge. Our approach consists of two parts: knowledge extraction and graph-based reasoning. In the knowledge extraction part, we automatically extract graph paths from ConceptNet and sentences from Wikipedia. To better use the relational structure of the evidence, we construct graphs for both sources, including extracted graph paths from ConceptNet and triples derived from Wikipedia sentences by Semantic Role Labeling (SRL). In the graph-based reasoning part, we propose a graph-based approach to make better use of the graph information. We contribute by developing two graph-based modules, including (1) a graph-based contextual word representation learning module, which utilizes graph structural information to re-define the distance between words for learning better contextual word representations, and (2) a graph-based inference module, which first adopts Graph Convolutional Network (?) to encode neighbor information into the representations of nodes, followed by a graph attention mechanism for evidence aggregation.

We conduct experiments on the CommonsenseQA benchmark dataset. Results show that both the graph-based contextual representation learning module and the graph-based inference module boost the performance. We also demonstrate that incorporating both knowledge sources can bring further improvements. Our approach achieves the state-of-the-art accuracy (75.3%) on the CommonsenseQA dataset.

Our contributions of this paper can be summarized as follows:

We introduce a graph-based approach to leverage evidence from heterogeneous knowledge sources for commonsense question answering.

We propose a graph-based contextual representation learning module and a graph-based inference module to make better use of the graph information for commonsense question answering.

Results show that our model achieves a new state-of-the-art performance on the CommonsenseQA dataset.

Task Definition and Dataset

This paper utilizes CommonsenseQA (?), an influential dataset for commonsense question answering task for experiments. Formally, given a natural language question QQ containing mm tokens {q1,q2,,qm}\{q_{1},q_{2},\cdots,q_{m}\}, and 55 choices {a1,a2,,a5}\{a_{1},a_{2},\cdots,a_{5}\}, the target is to distinguish the right answer from the wrong ones and accuracy is adopted as the metric. Annotators are required to utilize their background knowledge to write questions in which only one of them is correct, thus making the task more challenging. The lack of evidence requires the model to have strong commonsense knowledge extraction and reasoning ability to get the right results.

Approach Overview

In this section, we give an overview of our approach. As shown in Figure 2, our approach contains two parts: knowledge extraction and graph-based reasoning. In the knowledge extraction part, we extract knowledge from structured knowledge base ConcpetNet and Wikipedia plain texts according to the given question and choices. We construct graphs to utilize the relational structures of both sources. In the graph-based reasoning part, we propose two graph-based modules which consists of a graph-based contextual word representation learning module and a graph-based inference module to infer final answers. We will describe each part in detail in the following sections.

Knowledge Extraction

In this section, we provide the methods to extract evidence from ConceptNet and Wikipedia given the question and choices. Furthermore, we describe the details of constructing graphs for both sources.

ConceptNet is a large-scale commonsense knowledge base, containing millions of nodes and relations. The triple in ConceptNet contains four parts: two nodes, one relation, and a relation weight. For each question and choice, we first identify their entities in the given ConceptNet graph. Then we search for the paths (less than 3 hops) from question entities to choice entities and merge the covered triples into a graph where nodes are triples and edges are the relation between triples. If two triples sis_{i}, sjs_{j} contain the same entity, we will add an edge from the previous triple sis_{i} to the next triple sjs_{j}. In order to obtain contextual word representations for ConceptNet nodes, we transfer the triple into a natural language sequence according to the relation template in ConceptNet. An example is shown in Figure 3. We denote the graph as Concept-Graph.

Knowledge Extraction from Wikipedia

We extract 107M sentences from WikipediaWikipedia version enwiki-20190301 by Spacyhttps://spacy.io/ and adopt Elastic Search toolshttps://www.elastic.co/ to index the Wikipedia sentences. We first remove stopwords in the given question and choices then concatenate the words as queries to search from the Elastic Search engine. The engine ranks the matching scores between queries and all the Wikipedia sentences. We select top KK sentences as the Wikipedia evidence. Here we adopt KK=10 in experiments.

To discover the structure information in Wikipedia evidence, we construct a graph for Wikipedia evidence. We utilize Semantic Role Labeling (SRL) to extract triples (subjective, predicate, objective) in one sentence. Both arguments and predicates are the nodes in the graph. We add two edges <<subjective, predicate>> and <<predicate, objective>> in the graph. In order to enhance the connectivity of the graph. We remove stopwords and add an edge from node aa to node bb according to the following enhanced rules: (1) Node aa is contained in node bb and the number of words in aa is more than 3; (2) Node aa and node bb only have one different word and the numbers of words in aa and bb are both more than 3. An example is shown in Figure 4. We denote the graph as Wiki-Graph.

Graph-Based Reasoning

In this section, we present the model architecture of graph-based reasoning over the extracted evidence, shown in Figure 5. Our graph-based model consists of two modules: a graph-based contextual representation learning module and a graph-based inference module. The first module learns better contextual word representations by using graph information to re-define the distance between words. The second module gets node representations via Graph Convolutional Network (?) by using neighbor information and aggregates graph representations to make final predictions.

It is well accepted that pre-trained models have a strong text understanding ability and have achieved state-of-the-art results on a variety of natural language processing tasks. We use XLNet (?) as the backbone here, which is a successful pre-trained model with the advantage of capturing long-distance dependency. A simple way to get the representation of each word is to concatenate all the evidence as a single sequence and feed the raw input into XLNet. However, this would assign a long distance for the words mentioned in different evidence sentences, even though they are semantically related. Therefore, we use the graph structure to re-define the relative position between evidence words. In this way, semantically related words will have shorter relative position and the internal relational structures in evidence are used to obtain better contextual word representations.

Specifically, we develop an efficient way of utilizing topology sort algorithmWe also try to re-define the relative positions between two word tokens and get a position matrix according to the token distances in the graph. However, it consumes too much memory and cannot be executed efficiently. to re-order the input evidence according to the constructed graphs. For structured knowledge, ConceptNet triples are not represented as natural language. We use the relation template provided by ConceptNet to transfer a triple into a natural language text sentence. For example, “mammals HasA hair” will be transferred to “mammals has hair”. In this way, we can get a set of sentences STS_{T} based on the triples in the extracted graph. Then we can get the re-ordered evidence for ConceptNet STS_{T}^{\prime} with the method shown in Algorithm 1. The output of Figure 3 is <<“people has eyes”, “eyes is related to cry”, “people can do singing”, “cry is a kind of sound”, “singsing requires sound”, “sound is related to playing guitar”>>, which will shorten the distances between triples which are more similar to each other. For Wikipedia sentences, we construct a sentence graph. The evidence sentences SS are nodes in the graph. For two sentences sis_{i} and sjs_{j}, if there is an edge <<pp, qq>> in Wiki-Graph where pp, qq are in sis_{i} and sjs_{j} respectively, there will be an edge <<sis_{i}, sjs_{j}>> in the sentence graph. We can get a sorted evidence sequence SS^{\prime} by the method in Algorithm 1. In Algorithm 1, the relations RR is a set of edges, and an edge rr=<<xx,yy>> means the edge from node xx to node yy. The incident edges for sis_{i} represent edges from other nodes to the node sis_{i}.

Formally, the input of XLNet is the concatenation of sorted ConceptNet evidence sentences STS_{T}^{\prime}, sorted Wikipedia evidence sentences SS^{\prime}, question qq, and choice cc. The output of XLNet is contextual word piece representations and the input representation <<cls>>. By transferring the extracted graph into natural language texts, we can fuse these two different heterogeneous knowledge sources into the same representation space.

Graph-Based Inference Module

The XLNet-based model mentioned in the previous subsection provides effective word-level clues for making predictions. Beyond that, the graph provides more semantic-level information of evidence at a more abstract layer, such as the subject/object of a relation. A more desirable way is to aggregate evidence at the graph-level to make final predictions.

Specifically, we regard the two evidence graphs Concept-Graph and Wiki-Graph as one graph and adopt Graph Convolutional Networks (GCNs) (?) to obtain node representations by encoding graph-structural information.

To propagate information among evidence and reason over the graph, GCNs update node representations by pooling features of their adjacent nodes. Because relational GCNs usually over-parameterize the model (?; ?), we apply GCNs on the undirected graph.

The ii-th node representation hi0h^{0}_{i} is obtained by averaging hidden states of the corresponding evidence in the output of XLNet and reducing dimension via a non-linear transformation:

where si={w0,,wt}s_{i}=\{w_{0},\cdots,w_{t}\} is the corresponding evidence to the ii-th node, hwjh_{w_{j}} is the contextual token representation of XLNet for the token wjw_{j}, WRd×kW\in R^{d\times k} is to reduce high dimension dd into low dimension kk, and σ\sigma is an activation function.

In order to reason over the graph, we propagate information across evidence via two steps: aggregation and combination (?). The first step aggregates information from neighbors of each node. The aggregated information zilz^{l}_{i} for ii-th node can be formulated as Equation 2, where NiN_{i} is the neighbors of ii-th node and hjlh^{l}_{j} is the jj-th node representation at the layer ll. The representation zilz^{l}_{i} contains neighbors information for ii-th node at the layer ll, and we can combine it with the transformed ii-th node representation to get the updated node representation hil+1h^{l+1}_{i}:

We utilize graph attention to aggregate graph-level representations to make the prediction. The graph representation is computed the same as the multiplicative attention (?), where hiLh^{L}_{i} is the ii-th node representation at the last layer, hch^{c} is the input representation <<cls>>, αi\alpha_{i} is the importance of the ii-th node, and hgh^{g} is the graph representation:

We concatenate the input representation hch^{c} with the graph representation hgh^{g} as the input of a Multi-Layer Perceptron (MLP) to compute the confidence score score(q,a)score{(q,a)}. The probability of the answer candidate aa to the question aa can be computed as follows, where AA is the set of candidate answers:

Finally, we select the answer with the highest confidence score as the predicted answer.

Experiments

In this section, we conduct experiments to prove the effectiveness of our proposed approach. To dig into our approach, we perform ablation studies to explore the different effects of heterogeneous knowledge sources and graph-based reasoning models. We study a case to show how our model can utilize the extracted evidence to get the right answer. We also show some error cases to point directions to improve our model.

The CommonsenseQA (?) dataset contains 12,102 examples, include 9,741 for training, 1,221 for development and 1,140 for test.

We select XLNet large cased (?) as the pre-trained model. We concatenate “The answer is” before each choice to change each choice to a sentence. The input format for each choice is “<<evidence>> <<sep>> question <<sep>> The answer is <<choice>> <<cls>>”. Totally, we get 5 confidences scores for all the choices then we adopt the softmax function to calculate the loss between the predictions and the ground truth. We adopt cross-entropy loss as our loss function. In our best model on the development dataset, we set the batch size to 4 and learning rate to 5e-6. We set max length of input to 256. We use Adam (?) with β1\beta_{1} = 0.9, β2\beta_{2} = 0.999 for optimization. We set GCN layer to 1. We train our model for 2,800 steps (about one epoch) and get the results 79.3% on development dataset and 75.3% on blind test dataset.

Baselines

For the compared methods, we select models and classify them into 4 groups. Group 1: models without descriptions or papers, Group 2: models without extracted knowledge, Group 3: models with extracted structured knowledge and Group 4: models with extracted unstructured knowledge.

Group 1: models without description or papers. These models include SGN-lite, BECON (single), BECON (ensemble), CSR-KG and CSR-KG (AI2 IR).

Group 2: models without extracted knowledge, including BERT-large (?), XLNet-large (?) and RoBERTa (?). These models adopt pre-trained language models to finetune on the training data and make predictions directly on the test dataset without extracted knowledge.

Group 3: models with extracted structured knowledge, including KagNet (?), BERT + AMS (?) and BERT + CSPT. These models utilize structured knowledge ConceptNet to enhance the model to make predictions. KagNet extracts schema graphs from ConceptNet and utilize hierarchical path-based attention mechanism to infer answers. BERT + AMS constructs a commonsense-related multi-choice question answering dataset according to ConcepNet and pre-train on the generated dataset. BERT + CSPT first trains a generation model to generate synthetic data from ConceptNet, then finetunes RoBERTa on the synthetic data and Open Mind Common Sense (OMCS) corpus.

Group 4: models with extracted unstructured knowledge, including CoS-E (?), HyKAS, BERT + OMCS, AristoBERTv7, DREAM, RoBERT + KE, RoBERTa + IR and RoBERTa + CSPT. Cos-E (?) constructs human-annotated evidence for each question and generates evidence for test data. HyKAS and BERT + OMCS models pre-train BERT whole word masking model on the OMCS corpus. AristoBERTv7 utilizes the information from machine reading comprehension data RACE (?) and extracts evidence from text sources such as Wikipedia, SimpleWikipedia, etc. DREAM adopts XLNet-large as the baseline and extracts evidence from Wikipedia. RoBERT + KE, RoBERTa + IR and RoBERTa + CSPT adopt RoBERTa as the baseline and utilize the evidence from Wikipedia, search engine and OMCS, respectively.

It should be noted that these methods either utilize evidence from structured or unstructured knowledge sources, failing to take advantages of both sources simultaneously. RoBERT + CSPT adopts knowledge from ConceptNet and OMCS, but the model pre-trains on the sources without explicit reasoning over the evidence, which is different from our approach.

Experiment Results and Analysis

The results on CommonsenseQA development dataset and blind test dataset are shown in Table 1. Our model achieves the best performance on both datasets. In the following comparisons we focus on the results on test dataset. Compared with the model in group 1, we can get more than 10% absolute accuracy than these methods. Compared with models without extracted knowledge in group 2, our model also enjoys 2.8% absolute gain over the strong baseline RoBERTa (ensemble). XLNet-large is our baseline model and our approach can get 12.4% absolute improvement over the baseline and this approves the effectiveness of our approach. Compared to models with extracted structured knowledge in group 3, our model extracts graph paths from ConceptNet for graph-based reasoning rather than for pre-training, and we also extract evidence from Wikipedia plain texts, which brings 13.1% and 5.7% gains over BERT + AMS and ROBERTa + CSPT respectively. Group 4 contains model which utilizes unstructured knowledge such as Wikipedia or OMCS, etc. Compared with these methods, we not only utilize Wikipedia to provide unstructured evidences but also construct graphs to get the structural information. We also utilize the evidence from structure knowledge base ConceptNet. Our model achieves 3.2% absolute improvement over the best model RoBERTa + IR in this part.

Ablation Study

In this section, we perform ablation studies on the development datasetThe dataset restricts to submit the results no more than every two weeks. to dive into the effectiveness of different components in our model. We first explore the effect of different components in graph-based reasoning. Then we dive into the heterogeneous knowledge sources and see their effects.

In the graph-based reasoning part, we dive into the effect of topology sort algorithm for learning contextual word representations and graph inferences with GCN and graph attention. We select XLNet + Evidence as the baseline. In the baseline, we simply concatenate all the evidence into XLNet and adopt the contextual representation for prediction. By adding topology sort, we can obtain a 1.9% gain over the baseline. This proves that topology sort algorithm can fuse the graph structure information and change the relative position between words for better contextual word representation. The graph inference module brings 1.4% benefit, showing that GCN can obtain proper node representations and graph attention can aggregate both word and node representations to infer answers. Finally, we add topology sort, graph inference module together to get a 3.5% improvement, proving these models can be complementary and achieve better performance.

Then we perform ablations studies on knowledge sources to see the effectiveness of ConceptNet and Wikipedia sources. The results are shown in Table 3, “None” represents we only adopts the XLNet (?) large model as the baseline. When we add one knowledge source, the corresponding graph-based reasoning models are also added. From the results, we see that the structured knowledge ConceptNet can bring 6.4% absolute improvement and the Wikipedia source can bring 4.6% absolute improvement. This proves the benefits of ConceptNet or Wikipedia source. When combining ConceptNet and Wikipedia, we can enjoy a 9.4% absolute gain over the baseline. This proves that heterogeneous knowledge sources can achieve better performance than single one and different sources in our model and they are complementary to each other.

Case Study

In this section, we select a case to show that our model can utilize the heterogeneous knowledge sources to answer questions. As shown in Figure 6, the question is “Animals who have hair and don’t lay eggs are what?” and the answer is “mammals”. The first three nodes are from ConceptNet evidence graph. We can see that “mammals is animals” and “mammals has hair” can provide information about the relation between “mammals” and two concepts “animals” and “hair”. More evidence is needed to show the relation between “lay eggs” and “mammals”. The last three nodes are from Wikipedia evidence graph and they can provide the information that “very few mammals lay eggs”. The examples also show that both sources are necessary to infer the right answer.

Error Analysis

We randomly select 50 error examples from the development dataset and the reasons are classified into three categories: the lack of evidence, similar evidence and dataset noise. There are 10 examples which are lack of evidence. For example, the first example in Figure 7 extracts no triples from ConceptNet and the evidence from Wikipedia does not contain enough information to get the right answer. This problem can be alleviated by utilizing more advanced extraction strategies and adding more knowledge sources. There are 38 examples which extract enough evidence but the evidence are too similar to distinguish between choices. For example, the second example in Figure 7 has two choices “injury” and “puncture wound”, the evidence from both sources provides similar information. More evidence from other knowledge sources is needed to alleviate this problem. We also find there are 2 error examples which have 2 same choicesexample id: e5ad2184e37ae88b2bf46bf6bc0ed2f4, fa1f17ca535c7e875f4f58510dc2f430.

Related Work

Commonsense Reasoning Commonsense reasoning is a challenging direction since it requires reasoning over external knowledge beside the inputs to predict the right answer. Various downstream tasks have been released to address this problem like ATOMIC(?), Event2Mind(?), MCScript 2.0(?), SWAG(?), HellaSWAG(?) and Story Cloze Test(?).

Recently proposed CommonsenseQA(?) dataset derived from ConceptNet(?) and the choices have the same relation with the concept in the question. Recently, ? (?) explores adding human-written explanations to solve the problem. ? (?) extracts evidence from ConceptNet to study this problem. This paper focuses on automatically extracting evidence from heterogeneous external knowledge and reasoning over the extracted evidence to study this problem.

Knowledge Transfer in NLP Transfer learning has played a vital role in the NLP community. Pre-trained language models from large-scale unstructured data like ELMo (?), GPT (?), BERT (?), XLNet (?), RoBERTa (?) have achieved significant improvements on many tasks. This paper utilizes XLNet (?) as the backend and propose our approach to study the commonsense question answering problem.

Graph Neural Networks for NLP Recently, Graph Neural Networks (GNN) has been utilized widely in NLP. For example, ? (?) utilizes Graph Convolutional Networks (GCN) to jointly extract entity and relation. ? (?) applies GNN to relation extraction over pruned dependency trees and achieves remarkable improvements. GNN has also been applied into muli-hop reading comprehension tasks (?; ?; ?). This paper utilizes GCN to represent graph nodes by utilizing the graph structure information, followed by graph attention which aggregates the graph representations to make the prediction.

Conclusion

In this work, we focus on commonsense question answering task and select CommonsenseQA (?) dataset as the testbed. We propose an approach consisting of knowledge extraction and graph-based reasoning. In the knowledge extraction part, we extract evidence from heterogeneous external knowledge including structured knowledge source ConceptNet and Wikipedia plain texts. In the graph-based reasoning part, we propose a graph-based approach consisting of graph-based contextual word representation learning module and graph-based inference module to select the right answer. Results show that our model achieves state-of-the-art on CommonsenseQA(?) dataset.

Acknowledgement

Songlin Hu is the corresponding author. We thank the anonymous reviewers for providing valuable suggestions.

References