JointLK: Joint Reasoning with Language Models and Knowledge Graphs for Commonsense Question Answering

Yueqing Sun, Qi Shi, Le Qi, Yu Zhang

Introduction

Commonsense question answering (CSQA) requires systems to acquire different types of commonsense knowledge and reasoning skills, which is normal for humans, but challenging for machines Talmor et al. (2019). Recently, large pre-trained language models (LMs) have achieved remarkable success in many QA tasks and appear to use implicit (factual) knowledge encoded in their model parameters during fine-tuning Liu et al. (2019); Raffel et al. (2020). Nevertheless, commonsense knowledge is self-evident to humans and is rarely expressed clearly in natural language Gunning (2018), which makes it difficult for LMs to learn commonsense knowledge from the pre-training text corpus alone.

An extensive research path is to elaborately design graph neural networks (GNNs) Scarselli et al. (2008) to perform reasoning over explicit structural common sense knowledge from external knowledge bases Vrandečić and Krötzsch (2014); Speer et al. (2017). Related methods usually follow a retrieval-and-modeling paradigm. First, the knowledge subgraphs or paths related to a given question are retrieved by string matching or semantic similarity; such retrieved structured information indicates the relation between concepts or implies the process of multi-hop reasoning. Second, the retrieved subgraphs are modeled by a well-designed graph neural network module Lin et al. (2019); Feng et al. (2020); Yasunaga et al. (2021) to perform reasoning over knowledge graphs.

However, these approaches have two main issues. First, the retrieved knowledge subgraph contains many noisy nodes. Whether through simple string matching or semantic matching, in order to retrieve sufficient relevant knowledge, noise knowledge graph nodes will inevitably be included Lin et al. (2019); Yasunaga et al. (2021). Especially with the increase of hop count, the number of irrelevant nodes will expand dramatically, raising the burden of the model. As the example in Figure 1, some graph nodes such as “wood", “burn", and “gas", although related to some entities in the questions and choice, can mislead the global understanding of the question. Second, there are limited interactions between language representation and knowledge graph representation. Specifically, existing LM+KG methods Lin et al. (2019); Feng et al. (2020) model question context and knowledge subgraphs in isolation by LMs and GNNs, and perform only one interaction in a shallow manner to fuse their representations at the output for prediction. We argue that the limited interaction between the two modalities is the main bottleneck that may prevent the model from understanding the complex question-knowledge relations necessary to answer the question correctly.

Based on the above consideration, we propose JointLK, a model that performs the fine-grained modal fusion and multi-layer joint reasoning between the language model and the knowledge graph (see Figure 2). Specifically, given a question and retrieved subgraphs, JointLK first obtain the representations of the two modalities by using an LM encoder and a GNN encoder respectively. Then we design a joint reasoning module to generate fine-grained bidirectional attention maps between each question token and each KG node to fuse the information from each modality to the other. Guided by the attention generated in the interaction process, the dynamic pruning module deletes irrelevant nodes to make the model reason along the correct knowledge path. Multiple JointLK layers are stacked to form a hierarchy that supports multi-step interactions and recursive pruning. In summary, our contributions are three-fold:

We propose JointLK, a novel model that supports multi-step joint reasoning between LM and KG. It uses dense bidirectional attention to simultaneously update query-aware knowledge graph representation and knowledge-aware query representation, bridging the gap between the two information modalities.

We design a dynamic graph pruning module that recursively removes irrelevant graph nodes at each JointLK layer to ensure that the model reasons correctly with complete and appropriate evidence.

Experimental results show that JointLK is superior to current LM+KG methods, and the refined evidence is interpretable. Furthermore, through the multi-layer fusion of these two modalities, JointLK exhibits strong performance over previous state-of-the-art LM+KG methods in performing complex reasoning, such as solving questions with negation and complex questions with more entities.

Related Work

Commonsense question answering is challenging because the required commonsense knowledge is rarely given in the context of questions and answer choices or encoded in the parameters of pre-trained LMs. Therefore, many works obtain the required knowledge from external sources (e.g., KGs, corpus) to augment CSQA models. Due to the heterogeneity between structured knowledge and unstructured text questions, there are currently two main research methods. Some works Lv et al. (2020); Bian et al. (2021); Xu et al. (2021) unify the two modalities during model input, such as transforming structured knowledge into plain text through templates or transforming question context into structured graphs. However, the original structural/textual information will inevitably be lost during the conversion process. Other works Lin et al. (2019); Feng et al. (2020); Yan et al. (2021) use LM and GNN to model the two modalities separately, and perform shallow interactions in the latter model stage, such as attentive pooling or simple concatenation of the two modal representations. Although this method can retain the original information of question context and KGs, the limited interaction will affect the flow of information between the two modalities, so we mainly improve on this point.

Recently, QA-GNN Yasunaga et al. (2021) explicitly views the QA context as an additional node, connects it and KG to form a joint graph, and mutually updates their representations through graph-based message passing. However, it pools the representation of the question context into a single node, which limits the updating of the text representation and fine-grained interaction between LM and GNN. Compared with prior works, we retain the individual structure of both modalities, consider fine-grained interaction between any token in question and any entity in KG through dense bidirectional attention, and perform multi-step joint reasoning by stacking several interaction layers. Furthermore, we gradually prune the KG size in each stacked model layer under the guidance of attention weights generated in the interactions, making the reasoning path transparent and interpretable.

Methodology

In this section, we introduce the task definition (§ 3.1) and our JointLK model. The model framework is shown in Figure 2. JointLK takes the query and the retrieved knowledge subgraph as input, and outputs a real value as the correctness score of the answer. The model is mainly composed of four parts: query encoder, GNN layer, joint reasoning module and dynamic pruning module, of which the latter three form a stack of N identical layers. We use a pre-trained language model to learn the query representation (§ 3.2), and use the GNN layer to learn the graph representation (§ 3.3). The Joint Reasoning Module receives these two modalities’ representations and then apply dense bidirectional attention to make information fusion and representation update for each token and node (§ 3.4). The LM-to-KG attention weights generated in reasoning represents the global importance of each node in the graph, so the dynamic pruning module prunes the graph layer by layer according to this weights and finally retains the most relevant nodes (§ 3.5). After N layers of iteration, the query representation and the trimmed graph representation are used to predict the answer (§ 3.6).

The CSQA task in this paper is a multiple-choice problem with some answer choices. Given a commonsense question $q$ and a set of answer choices $\{a_{1},a_{2},...,a_{n}\}$ , our task is to measure the plausibility score between $q$ and each answer choice $a$ then select the answer with the highest plausibility score. In general, questions do not contain any reference to answer choices, so the external knowledge graph provides the necessary background knowledge. We extract from the external KG a subgraph $g=(V,R)$ with the guidance of question and choice. Here V is a subset of entity nodes retrieved from the external KG. $E\subseteq{V\times R\times V}$ is the set of edges that connect nodes in $V$ , where $R$ is a set of relations types. We describe the detailed extraction process in Appendix A.

2 Query Encoder

We follow baselines to use pre-trained language models to encode the query $\{w_{i}\}_{i=1}^{M}$ (question and choice) into a sequence of vectors $\{q_{i}^{0}\}_{i=1}^{M}$ :

3 GNN Layer

The output representation $x_{i}^{l}$ is computed by

4 Joint Reasoning Module

To reduce the gap of query and knowledge graph features, we fuse them in the joint reasoning module by the dense bidirectional attention mechanism that connects two encoding layers of query and knowledge graph and captures the fine-grained interplay between them.

The module takes the query and KG representations $\mathbf{Q}$ and $\mathbf{X}$ as inputs and then outputs their updated versions. We denote the inputs to the joint reasoning module in the l-st fusion layer by $\mathbf{Q}^{l-1}=\{q_{i}^{l-1}\}_{i=1}^{M}$ and $\mathbf{\widetilde{X}}^{l}=\{\widetilde{x}_{i}^{l}\}_{i=1}^{|V|}$ . Given $q_{i}^{l-1}$ and $\widetilde{x}_{i}^{l}$ , an affinity matrix is first constructed via:

where $W_{S}^{T}$ is a learnable weight matrix, $\circ$ is elementwise multiplication, [;] is vector concatenation across row. We normalize $S_{ij}^{l}$ in row-wise to derive KG-to-LM attention maps on query tokens conditioned by each entity in KG as

and also normalize $S_{ij}^{l}$ in column-wise to derive LM-to-KG attention maps on entities conditioned by each query token as

The attended representations are computed as follows:

where $\otimes$ represents matrix multiplication. The attended features are fused with the original features of the other modality by concatenation and then compressed to low-dimensional space by:

where $W_{Q},W_{X}$ are learnable weights. Then the updated query representation $\mathbf{Q}^{l}=\{q_{i}^{l}\}_{i=1}^{M}$ will be input to the next $l$ -th stacked JointLK layer of to continue participating in joint reasoning, and the updated KG representation $\mathbf{\bar{X}}^{l}=\{\bar{x}_{i}^{l}\}_{i=1}^{|V|}$ will be input to the next module of the current JointLK layer for pruning.

5 Dynamic Pruning Module

In Equation 10, the LM-to-KG attention value implies the importance of different nodes in the subgraph for question answering. Inspired by SAGPool Lee et al. (2019), under the guidance of query, we retain relevant nodes and cut out irrelevant nodes according to the LM-to-KG attention. Then, We define a hyperparameter, the Retention ratio $K\in(0,1]$ , which determines the number of nodes to be retained. We choose the top $\left\lceil K\cdot|V|\right\rceil$ nodes according to the value of LM-to-KG attention:

where top-rank is a function that returns the index of top $\left\lceil K\cdot|V|\right\rceil$ value, $\cdot_{idx}$ is an indexing operation, and $Z_{mask}$ is corresponding attention mask. Next, the subgraph is formed by pooling out the less essential entity nodes as:

where $\mathbf{\bar{X}}_{idx,:}^{l}$ is the row-wise indexed representation matrix of $\mathbf{\bar{X}}^{l}$ , $\odot$ is the broadcasted elementwise product, and $\mathbf{\bar{A}}_{idx,idx}^{l}$ is the row-wise and col-wise of indexed adjacency matrix. $\mathbf{X}^{l}=(x_{1}^{l},x_{2}^{l},\ldots,x_{\left\lceil k|V|\right\rceil}^{l})$ , $\mathbf{A}^{l}$ and $\left\lceil K\cdot|V|\right\rceil$ are the representation matrix, the adjacency matrix and the number of graph nodes in the next JointLK layer.

6 Answer Prediction

After N layers of iteration, we finally obtain the query representation $\mathbf{Q}^{N}$ that fuses knowledge information and the graph representation $\mathbf{X}^{N}$ that fuses question information. We compute the score of $a$ being the correct answer as:

where $s$ is the mean pooling of $\mathbf{Q}^{N}$ , and $g$ is the attention-based pooling of $\mathbf{X}^{N}$ . We get the final probability by normalize all question-choice pairs with softmax.

Experimental Setup

We evaluate our model on two typical commonsense question answering datasets CommonsenseQA Talmor et al. (2019) and OpenBookQA Mihaylov et al. (2018). CommonsenseQA is a 5-way multiple-choice question answering dataset that requires commonsense for reasoning and contains 12,102 questions. We experiment and report the accuracy on the in-house dev (IHdev) and test (IHtest) splits used by Lin et al. (2019), and report the accuracy of our final system on the official test set. OpenBookQA is a 4-way multiple choice question answering dataset that requires reasoning with elementary science knowledge. It contains 5,957 questions along with an open book of scientific facts. We use the official data split.

2 Implementation Details

Following previous work Yasunaga et al. (2021), we use ConceptNet Speer et al. (2017), a commonsense knowledge graph, as our structured knowledge source for both of the above tasks. Given each query, we follow the preprocessing steps described in Feng et al. (2020) to retrieve the subgraph from ConceptNet, and the max hop size is 3 (see Appendix A for the detail). We use cross-entropy loss and RAdam optimizer Liu et al. (2020). In training, we set the maximum input sequence length to text encoders to 100, batch size to 128, and perform early stopping. We set the dimension (D = 200) and number of layers (N = 5) of our GNN module, with dropout rate 0.2 applied to each layer Srivastava et al. (2014). We use separate learning rates for the LM encoder and the graph encoder. We choose the LM encoder learning rate from $\{1\times{10}^{-5},\ 2\times{10}^{-5},\ 3\times{10}^{-5}\}$ , and choose the graph encoder learning rate from $\{1\times{10}^{-3},\ 2\times{10}^{-3}\}$ . Each model is trained using one GPU (Tesla_v100-sxm2-16gb), which takes 20 hours on average.

3 Compared Method

Although text corpus can provide complementary knowledge except for knowledge graphs, our model focuses on improving the use of KG and the joint reasoning between LM and KG, so we choose LM and LM+KG as the comparison methods.

To investigate the role of KGs, we compare with the benchmark model RoBERTa-large Liu et al. (2019) for CommonsenseQA, and compare with RoBERTa-large and AristoRoBERTa Clark et al. (2020) for OpenBookQA. For LM+KG methods, they share a similar high-level framework with our methods, that is, LM is used as a text encoder, GNN or RN is used as a KG encoder, but the way of using knowledge or reasoning is different: (1) Relationship network (RN) Santoro et al. (2017), (2) RGCN Schlichtkrull et al. (2018), (3) GconAttn Wang et al. (2019), (4)KagNet Lin et al. (2019) and (5)MHGRN Feng et al. (2020), (6) QA-GNN Yasunaga et al. (2021). (1), (2) and (3) are the relational perception GNNs for KGs, and (4), (5) and (6) are further model paths in KGs. To be fair, we use the same LM for all comparison methods.

Results and Analysis

The results on CommonsenseQA in-house split dataset and official test dataset are shown in Table 1 and Table 2. The results on OpenBookQA test dataset and leaderboard are shown in Table 3 and Table 4. We can observe that JointLK performs best among all fine-tuned LMs and existing LM+KG models. On CommonsenseQA, our model’s test performance improves by 5.74% over fine-tuned LMs and 1.02% over the prior best LM+KG model, QA-GNN. On OpenbookQA, our model’s test performance improves by 6.52% over fine-tuned AristoRoBERTa, and 2.15% over QA-GNN. Additionally, we also submit our best model to the leaderboards, and our JointLK (with the text encoder being RoBERTa-large) ranks first among comparable approaches. Compared with the previous best model MHGRN and QA-GNN, the boost over them suggests the effectiveness of our proposed joint reasoning between LM and KG and the dynamic pruning mechanism.

In particular, we do not compare with the higher ranking models on the leaderboard, such as unified QA Khashabi et al. (2020), Albert + DESC-KCR Xu et al. (2021), because they either use a stronger text encoder or use additional data resources, while our model focuses on improving the joint reasoning between LM and KG.

2 Ablation Studies

We further conduct in-depth analyses to investigate the effectiveness of different components in our model. We show the accuracy of JointLK on the CommonsenseQA IHdev set.

Impact of JointLK components We assess the impact of the joint reasoning module (§ 3.4) and the dynamic pruning module (§ 3.5), shown in Table 5. Disabling the dynamic pruning module results in 0.5% drop in performance, showing that some nodes in subgraph are not conducive to reasoning. Especially, when we disable the joint reasoning module, the corresponding dynamic pruning module will also be removed, because the latter depends on the attention value in the former. Then the results have a significant drop: $77.88\%\rightarrow 76.61\%$ , suggesting that the joint reasoning between LM and KG is critical.

Impact of stacked of JointLK Layers We investigate the impact of the number of JointLK layers (shown in Figure 3 (a)). The increase of layers continues to bring benefits until layers $N=5$ . However, performance begins to drop when $N>5$ . As the number of layers increases, the model changes from underfitting to overfitting.

Impact of the Retention Ratio in Pruning The retention ratio $K$ is a hyperparameter of the dynamic pruning module. Since it is recursively pruning in each stacked layer of JointLK, the percentage of graph nodes that the model ultimately retains is also related to the number of layers of JointLK, that is, $K^{N}$ , where $N=5$ . Experiments show that if the retention ratio is too high, there may be almost no pruning effect (for example, K=0.98, 90% of the nodes are retained in the last layer); otherwise, useful nodes may be deleted. As shown in Figure 3 (b), when the number of JointLK layers $N=5$ , $K=0.92$ (about 66% of the original nodes remain in the last layer) works the best on the CommonsenseQA dev set.

3 Quantitative Analysis

Considering the overall performance improvement of our model on these two datasets, we analyze whether the improvement is reflected in questions that require more complex reasoning, such as questions with negation and complex questions with more entities. We compare our model with the prior best LM+KG model, QA-GNN in Table 6.

Questions with negation Large LMs do well due to memorizing subject and filler co-occurrences but are easily distracted by elements like negation Zagoury et al. (2021). To investigate the reasoning ability of the model on negation, we retrieved 133 questions with negation terms (e.g., no, not, nothing, never, unlikely, don’t, doesn’t, didn’t, can’t, couldn’t) from the CommonsenseQA IHdev set. JointLK exhibits a big boost ( $\uparrow$ 3.00%) over QA-GNN, suggesting its strength in negation reasoning. The fine-grained joint inference of LM and GNN allows the model to pay attention to the semantic nuances of language expressions.

Questions with fewer/more entities When the question contains many entities, the size and noise of the retrieved KG may limit the model’s performance because the model needs to understand the complex relationship between entities. According to statistics (see Appendix A), questions contain an average of 7 entities, so we divide the question into two categories: containing fewer entities ( $\leq$ 7) and more entities( $>$ 7). Compared with QA-GNN, JointLK has a bigger boost on questions with more entities ( $\uparrow$ 2.01%) than those with fewer entities ( $\uparrow$ 0.96%), suggesting that our model can reduce the reasoning difficulty of complex questions because it can remove irrelevant nodes in reasoning.

4 Interpretability: A Case Study

We aim to interpret JointLK’s reasoning process by analyzing the pruning of the knowledge subgraph. Figure 4 shows an example from CommonsenseQA where our model correctly answers the question and finally retains reasonable reasoning paths by pruning the subgraph. The flow from (a) to (b) to (c) represents the recursive pruning of the subgraph according to the LM-to-KG attention weight at each GNN update layer. From (a) to (b), although the nodes wood and burn bridge the reasoning gap between question entity and answer entity, their semantics are very different from the question. From (b) to (c), “ $play\_guitar\stackrel{{\scriptstyle usedfor}}{{\longrightarrow}}fun$ " and “ $fun\stackrel{{\scriptstyle relatedto}}{{\longrightarrow}}gas\stackrel{{\scriptstyle relatedto}}{{\longrightarrow}}singe$ " are both reasonable, but the former is related to the semantics of the question, and the latter is not. Two paths are reserved in (c), “ $play\_guitar\stackrel{{\scriptstyle hassubevent}}{{\longrightarrow}}take\_lessons\stackrel{{\scriptstyle hassubevent}}{{\longrightarrow}}dance\stackrel{{\scriptstyle relatedto}}{{\longrightarrow}}singing"$ and “ $play\_guitar\stackrel{{\scriptstyle relatedto}}{{\longrightarrow}}action\stackrel{{\scriptstyle relatedto}}{{\longrightarrow}}singer\stackrel{{\scriptstyle relatedto}}{{\longrightarrow}}singing$ ". These two paths describe two possible scenarios that support answering the question.

5 Error Analysis

In order to understand why our model fails in some cases, we randomly select 100 error cases and group them into several categories. There are three main types of errors, and we show some examples in the Appendix C.

Miss important evidence (39/100) Although we can retrieve many nodes related to questions and choices from ConceptNet, due to the incompleteness of the knowledge graph, there may be missing essential evidence nodes in the reasoning paths to answer the question. For example, although “eating_dinner" will cause “sleepiness" or “indigestion", knowledge such as “lactose intolerance causes indigestion" is essential to answer the question (Wikipedia: Lactose intolerance is a common condition caused by a decreased ability to digest lactose, a sugar found in dairy products.). However, ConceptNet does not cover such knowledge or not is retrieved.

Indistinguishable knowledge (25/100) Several choices of the question may be correct, difficult to distinguish, and which one is correct may vary from person to person. For example, “human" and “cat" may be at location “bed" or “comfortable chair", and the knowledge provided by ConceptNet is also the same. The model may choose bed because the bed appears more frequently in the pre-trained corpus.

Incomprehensible questions (23/100) This type of error often occurs when the question is particularly long, involving various events and changes in the characters’ emotions. The model is difficult to understand the scene described by the question. Some questions may require reasoning based on events, but the knowledge in ConceptNet is more based on entities and attributes.

The above three types of errors show that selecting complete, accurate, and context-sensitive knowledge is vital for more effective KG-augmented models.

Conclusion

In this work, we propose JointLK and provide a set of experiments to prove that (i) LM and KG interactive fusion can reduce the semantic gap between the two information modalities and make better use of KG for joint reasoning with LM. (ii) Dynamic pruning module can recursively delete irrelevant subgraph nodes at each layer of JointLK to provide fine appropriate evidence. Our results on CommonsenseQA and OpenBookQA demonstrate the superiority of JointLK over other methods using external knowledge and the strong performance in performing complex reasoning. In addition, our research results can be broadly extended to other tasks that require KGs as additional background knowledge to augment LMs, such as entity linking, KG completion and the recommendation system.

Acknowledgements

We would like to thank the anonymous reviewers for their helpful comments. This work was supported by the Key Development Program of the Ministry of Science and Technology (No.2019YFF0303003), the National Natural Science Foundation of China (No.61976068) and "Hundreds, Millions" Engineering Science and Technology Major Special Project of Heilongjiang Province (No.2020ZX14A02).

Ethical Impact

This paper proposes a general approach to fuse language models and external knowledge graphs for commonsense reasoning. We worked within the purview of acceptable privacy practices and strictly followed the data usage policy. In all the experiments, we use public datasets and consist of their intended use. We neither introduce any social/ethical bias to the model nor amplify any bias in the data, so we do not foresee any direct social consequences or ethical issues.

References

Appendix A Extracting subgraph from External KG

We choose ConceptNet as the external knowledge base, and we follow the process of Feng et al. (2020) and Yasunaga et al. (2021) to retrieve the knowledge subgraph.

Given the question and choice, we identify the concepts that appear in ConceptNet in question and choice, respectively, and get the initial node set $V_{q}$ and $V_{a}$ , which form the initial node set $V_{q,a}$ . For example, in the question “What do people typically do while playing guitar?" and choice “singing", $V_{q}$ = {guitar, people, play, play_guitar, playing, playing_guitar, typically}, $V_{a}$ = {singe, singing}. Then, in order to extract the subgraph related to question and choice, we add the bridge entities on the 1 and 2 hop paths between any pair of entities in $V_{q,a}$ , thus obtaining the retrieved entity set $V$ .

There may be many nodes in $V$ , especially long questions contain many concepts. We follow the preprocessing method of Yasunaga et al. (2021), connect the nodes with question + choice, and calculate the relevant scores of the nodes through a pre-trained LM. We only retain the top 200 scoring nodes (It is worth noting that this is the preprocessing of the retrieval process, which is different from the dynamic pruning in section 3.5. The former is to score only one node and separate from the whole subgraph where the node is located, while the latter is recursive pruning in the updating process of the modeling subgraph).

Finally, we get the relation set $R$ by merging the relation types in ConceptNet and adding reverse relation. We retrieve all the edges in $R$ of any two nodes in $V$ . In addition, we add question as a node $q$ to $V$ , and add the bidirectional edges of $q$ to $V_{q}$ and $q$ to $V_{a}$ . The relation types are shown in Table 7, and the statistics of the retrieved nodes are shown in Table 8.

Appendix B Node Initialization

For each entity in the subgraph, we need to obtain its feature representation. Following Feng et al. (2020), we first use the template to convert the knowledge triples in ConceptNet into sentences, and feed them into BERT-Large, obtaining a sequence of tokens embeddings from the last layer. For each entity, we perform mean pooling over the tokens of the entity’s occurrences across all the sentences to form the initial embeddings $x_{i}^{0}$ .

Appendix C Error Types and Examples

In Table 9, we present examples for each error type in the Commonsense IHdev set. Because the average number of subgraph nodes corresponding to each case is about 100, we cannot list them all. Only some important nodes are shown here.