Learning Contextualized Knowledge Structures for Commonsense Reasoning

Jun Yan, Mrigank Raman, Aaron Chan, Tianyu Zhang, Ryan Rossi, Handong Zhao, Sungchul Kim, Nedim Lipka, Xiang Ren

Introduction

Commonsense reasoning (CSR) is essential for natural language understanding (NLU) systems to function effectively in the real world Apperly (2010). For example, to answer the question in Figure 1, one must already know that printing requires using paper. Yet, since commonsense knowledge is self-evident to humans, it is rarely stated in natural language Gunning (2018). This makes it hard for neural pre-trained language models (PLMs) (Devlin et al., 2019) to learn commonsense knowledge from corpora alone Marcus (2018).

Unlike raw text corpora, knowledge graphs (KGs) can provide structured commonsense facts (edges) of the form (concept1, relation, concept2) Speer et al. (2017). Hence, many recent CSR models augment the PLM with a KG, allowing such KG-augmented models to make predictions via multi-hop reasoning over the KG Lin et al. (2019); Bosselut and Choi (2019).

Despite the growing success of KG-augmented models, obtaining helpful KG facts for a given task instance remains challenging. Existing models assume using either KG-extracted edges Lin et al. (2019); Ma et al. (2019); Feng et al. (2020); Yasunaga et al. (2021), PLM-generated edges (to address KG edge sparsity) Bosselut and Choi (2019), or a late fusion of both Wang et al. (2020) is sufficient. Both extraction and generation can produce unhelpful edges, so the model must decide which edges to focus on during reasoning. Since extracted and generated edges are derived from the same set of concepts (nodes), modeling the interactions between extracted and generated edges jointly within a shared KG structure could provide stronger signal for identifying contextually relevant edges. However, current models do not leverage this information.

In response, we propose a new KG-augmented model: Hybrid Graph Network (HGN). Unlike prior models, HGN learns to jointly contextualize extracted and generated knowledge by reasoning over both within a unified graph structure. Given the task input (i.e., context) and an extracted KG subgraph, HGN is trained to generate embeddings for the subgraph’s missing edges to form a “hybrid” graph, then reason over the graph (to update model parameters) while filtering out context-irrelevant edges. HGN achieves this primarily through edge reweighting, which downweights irrelevant edges, and edge-weighted message passing, which attenuates irrelevant edges’ impact on reasoning.

Our extensive experiments demonstrate that HGN improves performance over all baselines across four CSR benchmarks. In particular, among comparable methods, HGN ranks first on the CommonsenseQA (Talmor et al., 2019) and OpenbookQA (Mihaylov et al., 2018) leaderboards. Plus, our user studies show that humans find HGN-filtered edges to be more valid and helpful than the heuristically extracted edges used in prior work.

Problem Statement

We consider CSR tasks, like question answering (QA), which can benefit from commonsense KGs. To solve CSR tasks, we focus on KG-augmented models, where a PLM is augmented with a commonsense KG. Given a CSR task, let $x$ be the task’s text input, $f$ be the model, and $f(x)$ be the model output. We denote a KG as $\mathcal{G}=(\mathcal{V},\mathcal{R},\mathcal{E})$ . $\mathcal{V}$ , $\mathcal{R}$ , and $\mathcal{E}$ are the sets of nodes (concepts), relations, and edges (facts), respectively, in the KG. An edge is a directed triple of the form $e=(h,r,t)\in\mathcal{E}$ , where $h\in\mathcal{V}$ is the head node, $t\in\mathcal{V}$ is the tail node, and $r\in\mathcal{R}$ is the relation between $h$ and $t$ . Let $[\cdot,\cdot]$ denote concatenation of text or vectors.

As illustrated in Figure 2, a KG-augmented model $f$ has three main components: text encoder $f_{\text{text}}$ , graph encoder $f_{\text{graph}}$ , and scoring function $f_{\text{score}}$ . First, $\mathbf{s}=f_{\text{text}}(x;\bm{\uptheta}_{\text{text}})$ is the encoding of $x$ , where $f_{\text{text}}$ is usually a Transformer PLM. Second, as supporting evidence, a $x$ -specific graph $\mathcal{G}^{\prime}=(\mathcal{V}^{\prime},\mathcal{R}^{\prime},\mathcal{E}^{\prime})$ is constructed from $\mathcal{G}$ (Figure 1). Typically, this is done via heuristic extraction by selecting $\mathcal{V}^{\prime}\subseteq\mathcal{V}$ as the concepts mentioned in $x$ , $\mathcal{R}^{\prime}\subseteq\mathcal{R}$ as the relations between concepts in $\mathcal{V}^{\prime}$ , and $\mathcal{E}^{\prime}\subseteq\mathcal{E}$ as the edges involving $\mathcal{V}^{\prime}$ and $\mathcal{R}^{\prime}$ . If $\mathcal{G}$ does not provide enough knowledge to build a good $\mathcal{G}^{\prime}$ , then new edges are sometimes added to $\mathcal{G}^{\prime}$ using a PLM-based generator Wang et al. (2020). We call $\mathcal{G}^{\prime}$ the contextualized KG. $\mathbf{g}=f_{\text{graph}}(\mathcal{G}^{\prime},\mathbf{s};\bm{\uptheta}_{\text{graph}})$ is then the joint encoding of $\mathcal{G}^{\prime}$ and $\mathbf{s}$ . Third, the model output is computed as $f(x)=f_{\text{score}}([\mathbf{s},\mathbf{g}];\bm{\uptheta}_{\text{score}})$ , where $f_{\text{score}}$ is usually a multilayer perceptron (MLP). Existing KG-augmented models mainly differ in their design of $f_{\text{graph}}$ , reasoning over the KG through message passing Schlichtkrull et al. (2018a); Feng et al. (2020); Yasunaga et al. (2021) or edge/path aggregation Lin et al. (2019); Bosselut and Choi (2019); Ma et al. (2019).

While KG-augmented models can be applied to any CSR task involving KGs (e.g., natural language inference), we consider multi-choice QA in this work. Given a question $q$ and set of candidate answers $\{a_{i}\}$ , the QA model’s goal is to predict a plausibility score $\rho(q,a)$ for each $a\in\{a_{i}\}$ , so that the highest score is predicted for the correct answer. To use KG-augmented models for commonsense QA, we set $x=[q,a]$ and $\rho(q,a)=f(x)$ .

Hybrid Graph Network (HGN)

As illustrated in §2 and Figure 2, given question-answer pair $(q,a)$ for an instance of the multi-choice QA task, the KG-augmented QA model first obtains a $(q,a)$ -contextualized KG $\mathcal{G}^{\prime}$ via the full KG $\mathcal{G}$ . Edges in $\mathcal{G}^{\prime}$ can be extracted directly from $\mathcal{G}$ or generated using a PLM-based generator Wang et al. (2020); Bosselut et al. (2019). Then, the model transforms $(q,a)$ and $\mathcal{G}^{\prime}$ into text encoding $\mathbf{s}$ and graph encoding $\mathbf{g}$ , respectively. Finally, $\mathbf{s}$ and $\mathbf{g}$ are used to predict $(q,a)$ ’s plausibility.

However, a contextualized KG may have low knowledge recall or precision, hindering the QA model’s access to relevant knowledge. Low recall can stem from missing edges in $\mathcal{G}$ , low precision can be the result of bad annotations in $\mathcal{G}$ , and both can be caused by noisy edge extraction or generation when building $\mathcal{G}^{\prime}$ . HGN addresses these issues by reasoning over both extracted and generated edges within a unified graph structure. To improve recall, HGN generates new edges via a PLM-based generator, then initializes a hybrid contextualized KG containing both extracted and generated edges. Note that edge generation is generally $(q,a)$ -agnostic and may produce irrelevant edges that hurt knowledge precision. To improve precision, HGN learns to reweight edges in the hybrid graph and reason over the hybrid graph via edge-weighted message passing. This is akin to learning the hybrid graph’s structure and reduces the impact of irrelevant edges on reasoning. Additionally, to further encourage downweighting of noisy edges during reasoning, HGN is trained with entropy regularization on the learned edge weights.

The overall learning objective of HGN is defined as $\mathcal{L}=\mathcal{L}_{\text{task}}+\beta\mathcal{L}_{\text{edge}}$ , where $\mathcal{L}_{\text{task}}$ is the loss for the downstream task (in our work, QA), $\mathcal{L}_{\text{edge}}$ is the entropy regularization term for edge weights, and $\beta\geq 0$ is a loss weight hyperparameter. In the following subsections, we first explain how the contextualized KG $\mathcal{G}^{\prime}$ is constructed as a hybrid graph, including its node embeddings $\mathbf{V}$ , hybrid edge embeddings $\mathbf{E}$ , and adjacency matrix $\mathbf{A}^{0}$ (§3.2). Next, we show how HGN uses edge-weighted message passing to update $\mathbf{V}$ , $\mathbf{E}$ , and $\mathbf{A}^{0}$ for $L$ layers (Figure 3), yielding a refined adjacency matrix $\mathbf{A}^{L}$ of learned edge weights (§3.3). Finally, we describe how $\mathcal{L}_{\text{task}}$ is computed using $\mathbf{s}$ and $\mathbf{g}$ , while $\mathcal{L}_{\text{edge}}$ is calculated using $\mathbf{A}^{L}$ (§3.4).

2 Hybrid Graph Construction

The first step of retrieving knowledge from $\mathcal{G}$ is concept grounding, which involves identifying text spans in $(q,a)$ that match nodes in $\mathcal{V}$ . We define $\mathcal{V}^{\prime}$ as the set of all concepts mentioned in $(q,a)$ , where $\mathcal{V}^{\prime}_{q}=\{v_{i}\}_{i=1}^{n_{q}}$ and $\mathcal{V}^{\prime}_{a}=\{v_{i}\}_{i=1}^{n_{a}}$ are the question and answer concepts, respectively. Each node $v_{i}\in\mathcal{V}^{\prime}$ is represented by an embedding $\mathbf{v}_{i}\in\mathbf{V}$ , which can be initialized using BERT Devlin et al. (2019) or TransE Bordes et al. (2013).

Hybrid Edge Embeddings.

In $\mathcal{G}^{\prime}$ , we loosen the definition of an edge to be $e_{(i,j)}=(v_{i},v_{j})\in\mathcal{E}^{\prime}$ . We build fully-connected edges between question and answer nodes in $\mathcal{G}^{\prime}$ . The set of edges in $\mathcal{G}^{\prime}$ is thus defined as $\mathcal{E}^{\prime}=(\mathcal{V}^{\prime}_{q}\times\mathcal{V}^{\prime}_{a})\cup(\mathcal{V}^{\prime}_{a}\times\mathcal{V}^{\prime}_{q})$ . After concept grounding, we need an edge embedding $\mathbf{e}_{(i,j)}\in\mathbf{E}$ for each edge $e_{(i,j)}$ . Let $\mathbf{R}$ be the relation embeddings for all relations in $\mathcal{R}$ , obtained using TransE. Each extracted edge $(v_{i},r,v_{j})\in\mathcal{E}$ is thus initialized in $\mathcal{G}^{\prime}$ as $\mathbf{e}_{(i,j)}=\mathbf{r}\in\mathbf{R}$ . However, due to edge sparsity, many edges do not have labeled relations and cannot be initialized this way.

Meanwhile, despite PLMs’ limitations in commonsense, they have shown some ability to encode commonsense knowledge (Davison et al., 2019; Petroni et al., 2019) and aid KG completion (Malaviya et al., 2019; Bosselut et al., 2019; Wang et al., 2020). Hence, we generate edge embeddings for all unlabeled edges by feeding each unlabeled edge into a GPT-2 Radford et al. (2019) based generator $f_{\text{gen}}(\cdot,\cdot)$ . This is further explained in the “Edge Embedding Generation” paragraph.

In summary, edge embeddings are computed in a hybrid way: (1) If there exists $r\in\mathcal{R}$ such that $(v_{i},r,v_{j})\in\mathcal{E}$ , then $\mathbf{e}_{(i,j)}=\mathbf{r}\in\mathbf{R}$ . (2) Otherwise, $\mathbf{e}_{(i,j)}=f_{\text{adapt}}(f_{\text{gen}}(v_{i},v_{j}))$ , where $f_{\text{adapt}}(\cdot)$ is an MLP used to transform $f_{\text{gen}}(v_{i},v_{j})$ into the same space as $\mathbf{r}$ .

Edge Embedding Generation.

Alternatively, we consider another edge generation approach proposed by Wang et al. (2020). Here, $f_{\text{gen}}(\cdot,\cdot)$ is trained to generate a relational path connecting $v_{i}$ to $v_{j}$ , then pool the path into an edge embedding. The rationale for this approach is that such paths have been shown to contain useful semantic information about the relation between $v_{i}$ and $v_{j}$ (Neelakantan et al., 2015; Das et al., 2017; Wang et al., 2020).

Adjacency Matrix.

Before edge generation, $\mathcal{G}^{\prime}$ has binary adjacency matrix $\mathbf{A}^{\text{extract}}$ , where $\mathbf{A}_{(i,j)}=1\Leftrightarrow\exists r,\text{s.t. }(v_{i},r,v_{j})\in\mathcal{E}$ . After getting embeddings for all edges $(v_{i},v_{j})\in\mathcal{E}^{\prime}$ , $\mathbf{A}^{\text{extract}}$ becomes $\mathbf{A}^{0}$ , a denser binary adjacency matrix in which $\mathbf{A}^{0}_{(i,j)}=1\Leftrightarrow(v_{i},v_{j})\in\mathcal{E}^{\prime}$ .

3 Hybrid Graph Reasoning

The procedure described in §3.2 yields a hybrid graph, containing unweighted edges between all question-answer node pairs. Constructing this hybrid graph may improve edge recall, but does not address precision. Some edges in the initial hybrid graph may be irrelevant to the question-answer pair, either due to noisy edge extraction or generation. HGN is thus designed to downweight irrelevant edges by converting the unweighted graph into a weighted one, then learning to reweight all hybrid edges during reasoning (Figure 3).

Edge-Weighted Message Passing.

Following the general Graph Network (GN) formulation proposed by Battaglia et al. (2018), HGN’s graph reasoning module consists of layer-wise node-to-edge ( $v\rightarrow e$ ) and edge-to-node ( $e\rightarrow v$ ) message passing functions. However, we equip HGN with a modified version of GN’s edge-to-node message passing function, in which each edge’s weight is used to rescale information flow on that edge. Intuitively, an edge’s weight signifies the edge’s relevance for reasoning about the given task instance. We also use text encoding $\mathbf{s}$ as global context throughout message passing.

We use global edge attention (i.e., normalizing across $\mathcal{E}^{\prime}$ ) instead of local edge attention (i.e., normalizing across $N_{j}$ ) because local edge attention assumes at least one edge in $N_{j}$ is relevant, which may not be true. For example, given an irrelevant or incorrectly grounded concept, none of its edges will be helpful, and so all nodes in its neighborhood should be excluded from influencing the reasoning process. To demonstrate the advantage of global edge attention, we empirically compare our default HGN architecture to an HGN variant based on Graph Attention Network (GAT) (Velickovic et al., 2018), which uses local edge attention, in our experiments.

4 Learning Objective

After $L$ layers of message passing, we obtain node embeddings $\{\mathbf{h}^{L}_{i}\mid i:v_{i}\in\mathcal{V}^{\prime}\}$ and edge embeddings $\{\mathbf{h}^{L}_{(i,j)}\mid(i,j):(v_{i},v_{j})\in\mathcal{E}^{\prime}\}$ . Node embeddings are aggregated into $\mathbf{v}_{\text{agg}}$ via attentive pooling with $\mathbf{s}$ as the query vector. Edge embeddings are aggregated into $\mathbf{e}_{\text{agg}}$ via edge-weighted sum pooling. The final graph encoding is then given as $\mathbf{g}=[\mathbf{v}_{\text{agg}},\mathbf{e}_{\text{agg}}]$ . The probability of $a$ being the answer to $q$ is calculated as $\hat{\rho}(q,a)\propto\exp(\rho(q,a))$ , where $\rho(q,a)=f_{\text{score}}([\mathbf{s},\mathbf{g}];\bm{\uptheta}_{\text{score}})$ . We use cross-entropy loss for the QA classification task, so the loss for each $(q,a)$ with label $y$ is:

Entropy Regularization.

To encourage the model to be decisive during edge reweighting, we use a regularization term to penalize non-discriminative edge weights. In an extreme case, a blind model will assign the same weight to all edges, degenerating $\mathcal{G}^{\prime}$ into an unweighted graph. This is a failure mode, since $\mathcal{G}^{\prime}$ is likely to contain mostly irrelevant edges, and we want the model to focus on the helpful edges. Therefore, via $\mathcal{L}_{\text{edge}}$ , we train the model to minimize the entropy of the edge weight distribution (i.e., make the distribution more skewed), in order to maximize the informativeness of the predicted edge weights. Lower entropy means the model has higher certainty about edges’ relevance to the given task instance, such that the model will discriminatively judge some edges as being much more relevant than others. $\mathcal{L}_{\text{edge}}$ is computed as:

Joint Learning.

We jointly optimize $\mathcal{L}_{\text{task}}$ and $\mathcal{L}_{\text{edge}}$ , so graph reasoning and structure can be jointly learned. The full learning objective is:

where $\bm{\uptheta}=\{\bm{\uptheta}_{\text{text}},\bm{\uptheta}_{\text{graph}},\bm{\uptheta}_{\text{score}}\}$ is the set of all learnable parameters, and $X_{\text{train}}$ is the training set. We train our model end-to-end by minimizing $\mathcal{L}(\bm{\uptheta})$ with the RAdam (Liu et al., 2020) optimizer.

Experiments

We evaluate our proposed model on four multiple-choice commonsense QA datasets: CommonsenseQA (Talmor et al., 2019), CODAH (Chen et al., 2019), OpenBookQA (Mihaylov et al., 2018) and QASC (Khot et al., 2020) (details in Appendix §B). We use ConceptNet (Speer et al., 2017), a commonsensense knowledge graph, as $\mathcal{G}$ . For text encoder $f_{\text{text}}$ , we experiment with BERT-Base, BERT-Large (Devlin et al., 2019) and RoBERTa(-Large) (Liu et al., 2019) to validate our model’s effectiveness over different text encoders. For OpenbookQA and QASC, retrieving related facts from the provided corpus plays an important role in boosting the model’s performance. Therefore, we build our graph reasoning model on top of retrieval-augmented methods on the leaderboard: “AristoRoBERTa” https://leaderboard.allenai.org/open_book_qa/submission/blcp1tu91i4gm0vf484g for OpenBookQA and “RoBERTa (2-step IR)”https://leaderboard.allenai.org/qasc/submission/bolaun0ghifmkohgvhr0 for QASC. In this way, we can study if strong retrieval-augmented methods can still benefit from KG knowledge and our HGN framework.

2 Compared Methods

We compare our model with a series of KG-augmented methods and different graph encoders:

We consider seven models that only use extracted facts. RN (Santoro et al., 2017) builds the graph with the same node set as our method but extracted edges only. The graph vector is calculated as $\mathbf{g}=\text{Pool}(\{\text{MLP}([\mathbf{v}_{i},\mathbf{e}_{(i,j)},\mathbf{v}_{j}])\mid(v_{i},v_{j})\in\mathcal{E}^{\prime}\})$ . GN (Battaglia et al., 2018) presents a general formulation of GNNs. We instantiate it with the layerwise propagation rule defined in Equation 1. It differs from our HGN in that: (1) it only considers extracted edges; (2) all edge weights are fixed to 1. MHGRN (Feng et al., 2020) generalizes GNNs with multi-hop message passing. GAT (Velickovic et al., 2018) adopts attention mechanism to reweight edges locally in each node’s neighborhood. We implement it by replacing the graph edge attention with local edge attention and only considering $\mathcal{L}_{\text{task}}$ during training. RGCN (Schlichtkrull et al., 2018a) extends Graph Convolutional Networks (GCNs) (Kipf and Welling, 2017) with relation-specific transition matrices during message passing. It operates on the same graph as RN. The graph vector is calculated as $\mathbf{g}=\text{Pool}(\{\mathbf{h}^{L}_{i}\mid v_{i}\in V\})$ . GconAttn (Wang et al., 2019b) softly aligns the nodes in question and answer and do pooling over all matching nodes to get $\mathbf{g}$ . KagNet (Lin et al., 2019) uses an LSTM to encode relational paths between question and answer concepts and pool over the path embeddings for graph encoding.

Models Using Extracted and Generated Facts.

We consider two models that use both extracted facts and generated facts. RN + Link Prediction differs from RN by only considering the generated relation (predicted using TransE (Bordes et al., 2013)) between question and answer concepts. PathGenerator (Wang et al., 2020) learns a path generator from paths collected through random walks on the KG. The learned generator is used to generate paths connecting question and answer concepts. $\mathbf{g}$ is calculated as the concatenation of the pooled vector over the generated paths and the pooled vector over the extracted paths.

Our Model’s Variants.

As described in §3.2, the edge embedding can be computed either as a relation embedding or a path embedding. We name these two variants as HGN (w/ RelGen edges) and HGN (w/ PathGen edges) respectively.

3 Results

Tables 1, 3, 4 show performance comparisons between our models and baseline models on CommonsenseQA, CODAH, OpenBookQA and QASC. We clearly find that models with stronger text encoders perform better (i.e. RoBERTa $>$ BERT-Large $>$ BERT-Base). For all text encoders, our HGN shows consistent improvement over baseline models on all datasets. The improvement over all baselines are tested to be statistically significant under most settings, demonstrating the effectiveness of HGN both with and without retrieved evidence.

We also submit our best model to leaderboards for CommonsenseQA and OpenBookQA. For CommonsenseQA (Table 2), our HGN ranks first among comparable approaches and shows remarkable improvement over PathGenerator (Wang et al., 2020) and the LM Finetuning approach (ALBERT (Lan et al., 2020)). Higher-ranking models either use stronger text encoders or leverage additional data resources. Specifically, UnifiedQA (Khashabi et al., 2020) and T5-3B (Raffel et al., 2020) are based on T5. They have 11B and 3B parameters respectively, making them impractical to be finetuned in an academic setting. ALBERT+DESC-KCR (Xu et al., 2020) and ALBERT+KD additionally use concept definitions from dictionaries. ALBERT+DESC-KCR and ALBERT+KCR leverage “question_concept” annotations, which are used during the construction of the CommmonsenseQA dataset and allow the model to learn shortcuts that don’t generalize to other datasets. ALBERT+KRD retrieve sentences from OMCS corpus (Liu and Singh, 2004) as input. These methods are therefore not comparable with our model. For OpenBookQA (Table 5), our model ranks first among all models using AristoRoBERTa as the text encoder.

Training with Less Labeled Data.

Figure 4 (a)(b) show the results of our model and baselines when trained with different portions of the training data on CommonsenseQA and OpenBookQA. Our model gets better test accuracy under all settings. On CommonsenseQA without retrieved evidence, the improvement over the knowledge-agnostic baseline (LM Finetuning) is generally more significant with less training data, which suggests that incorporating external knowledge is helpful in the low-resource setting.

Study on More Model Variants.

To better understand the model design, we experiment with three variants of HGN (w/ RelGen edges) on CommonsenseQA and OpenBookQA. HGN w/o statement vector doesn’t consider $\mathbf{s}$ in Equation 1, which isolates the graph encoder from the text encoder. HGN w/o $\mathcal{L}_{\text{edge}}$ does not consider the entropy regularization term and thus does not penalize non-discriminative edge weights. HGN w/o edge weights reasons over an unweighted graph with hybrid features, which means edge weights are all fixed to 1 during training. Figure 4 (c) shows the results of the ablation study. “HGN” outperforms “HGN w/o $\mathcal{L}_{\text{edge}}$ ”, suggesting the usefulness of our proposed entropy regularization. Comparing “HGN w/o statement vector” with “HGN”, we find that accessing context information is also important for graph reasoning, which means information propagation and edge weight prediction should be conducted in a context-aware manner. HGN also improves over “HGN (w/o edge weights)”, indicating the effectiveness of conducting context-dependent pruning.

4 User Study on Learned Structures

To assess HGN’s ability to refine graph structure, we compare the graph structure before and after being processed by HGN. Specifically, we sample 30 questions with its answer from CommonsenseQA’s development set and ask 5 human annotators to evaluate the graph output by GN (with adjacency matrix $\mathbf{A}^{\text{extract}}$ and extracted facts only) and by HGN (with adjacency matrix $\mathbf{A}^{L}$ ). We manually binarize $\mathbf{A}^{L}$ by removing edges with weight lower than $0.01$ .

Given a graph, for each edge (fact), annotators are asked to rate its validness and helpfulness. The validness score is rated as a binary value in a context-agnostic way: 0 (the fact does not make sense), 1 (the fact is generally true). The helpfulness score measures if the fact is helpful for solving the question and is rated on a 0 to 2 scale: 0 (the fact is unrelated to the question and answer), 1 (the fact is related but doesn’t directly lead to the answer), 2 (the fact directly leads to the answer). Note that the percentage of valid edges can be understood as the precision of graph edges. For a given instance, the number of valid edges is proportional to the recall of the edges. We also include another metric named “prune rate” calculated as: $1-\frac{\text{\# edges in binarized }\mathbf{A}^{L}}{\text{\# edges in }\mathbf{A}^{0}}$ , which measures the portion of edges assigned very low weights (softly pruned) during training and is only applicable to HGN.

The mean ratings for 30 pairs of (GN, HGN) graphs by 5 annotators are reported in Table 6. The Fleiss’ Kappa (Fleiss, 1971) is 0.51 (moderate agreement) for validness and 0.36 (fair agreement) for helpfulness. The graph refined by HGN has both more edges and denser valid edges compared to the extracted one. The refined graph also achieves a higher average helpfulness score. These all indicate that our HGN learns a superior graph structure with more helpful edges and fewer noisy edges, which improves over previous works that rely on extracted and static graphs. Detailed cases can be found in Appendix §C.

Related Work

Commonsense QA is challenging because the required commonsense knowledge is seldom given in the question-answer context or encoded in the PLM’s parameters. Thus, many works obtain this knowledge from external sources (e.g., KGs, corpora). While Lv et al. (2020) show that KGs and corpora can provide complementary knowledge, our paper focuses on improving the use of KG knowledge. KG knowledge can be acquired in different ways, either from KG-extracted edges Lin et al. (2019); Ma et al. (2019); Feng et al. (2020); Yasunaga et al. (2021), PLM-generated edges Bosselut and Choi (2019), or both Wang et al. (2020). KG-augmented models mainly differ in how they encode KG knowledge, using message passing Schlichtkrull et al. (2018a); Feng et al. (2020) or edge/path aggregation Lin et al. (2019); Bosselut and Choi (2019); Ma et al. (2019); Wang et al. (2020). The most relevant work to ours is Wang et al. (2020). The main difference is that they coarsely combine extracted and generated knowledge via late fusion, while HGN encodes both types of knowledge within a unified graph. Besides, they use RN to pool over a set of paths for graph encoding, while HGN reasons over the graph via message passing and edge reweighting.

Graph Structure Learning.

Instead of assuming a fixed graph structure, a number of graph models learn the graph structure with respect to the downstream task. Some models learn to discretely select edges for the graph (i.e., hard pruning). Kipf et al. (2018) and Franceschi et al. (2019) sample the graph structure from a predicted probabilistic distribution with differentiable approximations. Norcliffe-Brown et al. (2018) calculate the relatedness between any pair of nodes and only keep the top- $k$ strongest connections for each node to construct the edge set. Sun et al. (2019) start with a small graph and iteratively expand it with retrieving operations. Others learn to reweight edges in a fully connected graph (i.e., soft pruning). Jiang et al. (2019) and Yu et al. (2019) propose heuristics for regularizing edge weights. Hu et al. (2019) use the question embedding to help predict edge weights. Unlike other edge reweighting models, HGN operates over a hybrid graph of both extracted and generated edges, while updating edge weights with respect to node, edge, and text features.

Conclusion

In this paper, we propose HGN, a KG-augmented model for CSR. To address KG edge sparsity and noisy edge extraction/generation, HGN learns to jointly contextualize extracted and generated knowledge by reasoning over both within a unified graph structure. We justify HGN’s design by showing that HGN improves performance on various CSR benchmarks and user studies. In future work, we plan to increase the graph’s relation expressiveness by incorporating open relations, plus make the edge extraction/generation process more dependent on the reasoning context.

Acknowledgments

This research is supported in part by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via Contract No. 2019-19051600007, the DARPA MCS program under Contract No. N660011924033 with the United States Office Of Naval Research, the Defense Advanced Research Projects Agency with award W911NF-19-20271, and NSF SMA 18-29268. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of ODNI, IARPA, or the U.S. Government. We would like to thank all the collaborators in USC INK research lab for their constructive feedback on the work. We would also like to thank the anonymous reviewers for their valuable comments.

References

Appendix A Implementation Details of Edge Embedding Generator (RelGen)

Here, we give a more detailed explanation of the PLM-based edge embedding generator $f_{\text{gen}}$ , introduced in the “Edge Embedding Generation” paragraph of §3.2.

Appendix B Details of Datasets

Below are descriptions of the four datasets used for the experiments presented in §4.

(Talmor et al., 2019) is a multiple-choice QA dataset targeting commonsense. It’s constructed based on the knowledge in ConceptNet. Since the test set of the official split (9741/1221/1140 for OFtrain/OFdev/OFtest) is not publicly available, we compare our models with baseline models on the inhouse split (8500/1221/1241 for IHtrain/IHdev/IHtest)https://github.com/INK-USC/MHGRN/blob/master/data/csqa/inhouse_split_qids.txt used by previous works (Lin et al., 2019; Feng et al., 2020; Wang et al., 2020).

CODAH

(Chen et al., 2019) contains 2801 sentence completion questions testing commonsense reasoning skills. We perform 5-fold cross validation using the official split.

OpenBookQA

(Mihaylov et al., 2018) is a multiple-choice QA dataset modeled after open-book exams. Besides 5957 elementary-level science questions (4957/500/500 for train/dev/test), it also provides an open book with 1326 core science facts. Solving the dataset requires combining facts from open book with commonsense knowledge.

QASC

(Khot et al., 2020) is a QA dataset with questions about grade-school science. It has 9980 8-way multiple-choice questions (8134/926/920 train/dev/test), and comes with a corpus of 17M sentences. Since the official test set does not have labels, we create an in-house test split by moving a randomly sampled set of 920 questions from the training set to the test set. Solving questions in QASC requires retrieving facts from the corpus and composing them to produce an answer.

Appendix C Case Study

In addition to the experiments in §4, we present a case study here, which compares a HGN-generated graph with a KG-extracted graph used by GN. On the development set of CommonsenseQA, there are two dominating cases and we show the representative instance of each one. Figure 5 (a) shows the first case, where HGN prunes edges from the extracted graph. Our HGN assigns the highest weights to the most helpful facts (book, AtLocation, house), (telephone book, AtLocation, house). It also downweight unhelpful fact (place, IsA, house) and invalid fact (usually, RelatedTo, house). Figure 5 (b) shows the second case, where new generated facts are incorporated into reasoning. All generated facts that are kept by the model make sense in the context and help identify the answer. Both cases suggest that our model improve the quality of the contextualized knowledge graph compared to the current methods that only rely on extracted facts.