Dynamic Neuro-Symbolic Knowledge Graph Construction for Zero-shot Commonsense Question Answering

Antoine Bosselut, Ronan Le Bras, Yejin Choi

Introduction

Understanding narratives requires reasoning about all the implicit, but trivially inferable, details of a situation based only on what is explicitly stated in text. A statement as simple as “they went to the club” instantly invokes a bank of commonsense expectations: they had to get dressed, they were going dancing, they likely had drinks, and so forth. These reasoning capabilities are missing in most existing neural language understanding models that learn task-specific representations without acquiring rich background knowledge about the social and physical world.

In response, recent work has investigated augmenting deep learning models with retrieval mechanisms over large-scale commonsense knowledge graphs (Mihaylov and Frank 2018; Bauer, Wang, and Bansal 2018; Paul and Frank 2019). However, these approaches assume an entity linking step between the written text and knowledge graph. By canonicalizing entities, they discard key context surrounding the input, and often retrieve semantically irrelevant knowledge (e.g., a “club” being a blunt weapon is irrelevant to the earlier situation).

In this paper, we propose to generate new knowledge that is contextually relevant instead of retrieving existing knowledge as is. Bosselut et al. (2019) recently introduced Commonsense Transformers (COMeT), a new framework for training neural representations of knowledge graphs. This new class of neural knowledge model provides a powerful representational tool for connecting commonsense knowledge to downstream task models. Because COMeT represents knowledge graphs neurally, it can generate commonsense inferences for any entity that can be encoded by the neural model (i.e., described with language). With no need to canonicalize context entities to link to a static knowledge graph, the knowledge model can be queried directly with complex compositional structures, and even full narrative contexts. This flexibility has led them to be used out-of-the-box in a variety of settings requiring contextual knowledge, such as sarcastic comment generation (Chakrabarty et al. 2020), therapy chatbots (Kearns et al. 2020), and story plot generation (Ammanabrolu et al. 2020).

In this work, we use COMeT to construct context-relevant knowledge graphs that can be reasoned over for commonsense question answering. Given a raw context, COMeT generates commonsense inferences that provide world knowledge about the situation depicted in the context. These inferences can be used as additional context to score answer candidates or to generate additional inferences. By generating new inferences and connecting them to the raw context and answers, COMeT dynamically constructs a symbolic graph of commonsense knowledge. The raw context is the root node, answer choices are leaf nodes and generated commonsense inferences provide intermediate nodes between them, instantiating different reasoning paths between the context and answers. Using COMeT generated scores as factors weighting these paths, we propose new inference algorithms to reason over the generated graph and identify the most likely answers to questions about the situation.

We evaluate our approach in a zero-shot setting on the SocialIQa (Sap et al. 2019b) benchmark, a question answering dataset for evaluating social commonsense, and the StoryCS benchmark (Rashkin et al. 2018), a story understanding dataset. Empirical results show that our neuro-symbolic approach, COMeT - DynaGen, outperforms purely neural large-scale pretrained language models (Radford et al. 2018, 2019) and knowledge models that evaluate QA examples directly without dynamically generating an intermediate symbolic commonsense knowledge graph (i.e., reasoning with COMeT with no inference hops).

Dynamic Knowledge Graph Construction for Question Answering

Our approach uses a knowledge model, COMeT (Bosselut et al. 2019), to dynamically construct a context-relevant commonsense knowledge graph about a presented situation. COMeT is trained using transfer learning from large-scale pretrained language models (Radford et al. 2018) to knowledge graphs. When trained on the Atomic knowledge graph (Sap et al. 2019a), it learns to generate social commonsense inferences of situations depicted in text. Importantly, unlike static knowledge graphs (e.g., ConceptNet; Speer, Chin, and Havasi 2017), which require canonicalizing input entities to link to the graph, COMeT represents knowledge neurally, allowing it to generate commonsense for arbitrary input forms.

In Figure 1, for example, the context “Kai knew things were getting out of control and managed to keep his temper in check” is unlikely to be found in any existing knowledge graph. It describes a very specific situation. However, COMeT can parse this full context and generate commonsense knowledge about Kai’s reactions and motivations, such as “Kai stays calm” or “Kai wants to avoid trouble,” as downstream inferences. We exploit this generalization property of knowledge models to dynamically construct knowledge graphs for presented situations that can be reasoned over to answer commonsense questions about them.

Formally, we assume a dataset of examples, each with an associated context $c$ describing a situation, a question $q$ asked about that situation, and a set of $n$ possible answers $\mathcal{A}=\{a^{0},...,a^{n-1}\}$ to that question. Each answer is composed of multiple tokens $Y^{a}=\{y_{1},...,y_{|a|}\}$ .

Generating Commonsense Inferences. We generate commonsense inferences for a situational context $c$ by concatenating the context with relation types from the Atomic knowledge graph and using COMeT to produce candidates $\mathcal{G}$ . Each candidate $g\in\mathcal{G}$ is associated with a score $\phi_{g}$ that approximates the model’s confidence in the inference:

where $x_{t}$ are the tokens of $g$ , $|g|$ is the token length of $g$ , $r$ is an arbitrary commonsense relation type for which COMeT can generate inferences, and:

where the tokens of $c$ and $r$ are concatenated with the tokens $x_{<t}$ to be input to COMeT. Any generation $g\in\mathcal{G}$ conditioned on $c$ can be seen as a 1-hop commonsense inference of $c$ .

Knowledge Graph Reasoning

Being designed as a conditional language model, COMeT can also be used to score candidate commonsense inferences. We use this property to score answer candidates $a\in\mathcal{A}$ conditioned on the generated commonsense inferences $g\in\mathcal{G}$ that are connected to them. The scores from COMeT are used to initialize factor nodes between each generated commonsense inference (at all levels of the graph) and each answer choice. Using these scores, and scores between commonsense inferences (Eqs. 1, 3), as a set of factors, our generated knowledge graph implicitly encodes a factor graph that can be reasoned over to evaluate each answer candidate.

COMeT is originally trained to maximize the conditional log-likelihood of the tokens of a target entity $e_{2}$ from a knowledge graph tuple ( $e_{1}$ , $r$ , $e_{2}$ ). As a result, the knowledge model can measure the log-likelihood of a candidate entity $e_{2}$ given a source entity $e_{1}$ and relation $r$ . For a given example, we treat each answer candidate $a$ as an $e_{2}$ candidate for COMeT, map the parent nodes of $a$ (e.g., $g$ nodes) to be equivalent to $e_{1}$ , and set the question $q$ as $r$ , allowing COMeT to evaluate each answer candidate according to its implicit knowledge representations. For each answer $a\in\mathcal{A}$ , we define a factor based on each token’s conditional log-likelihood as computed by COMeT:

where $y_{s}$ corresponds to the token in $a$ at time step $s$ , $y_{<s}$ is all the tokens preceding $y_{s}$ in $a$ , and $|a|$ is the total number of tokens making up $a$ . In this way, for any QA example, we define a set of factor nodes $\phi_{ga}$ connecting the answer candidates $a\in\mathcal{A}$ to the commonsense inferences $g\in\mathcal{G}$ generated by COMeT about the situational context $c$ .

Overcoming Answer Priors. Because certain answer candidates have a high probability of occurring for certain questions regardless of the context (e.g., happy is a common answer for questions about emotional reactions), we redefine $\phi_{ga}$ (Eq. 4) in terms of the point-wise mutual information between the commonsense path $g$ and answer $a$ :

where $\log P(y_{s}|y_{<s},q)$ is the log-likelihood of each token in the answer given only the question and previous answer tokens. We describe our approximation of this distribution in Appendix B.

Inference

Experimental Setup

We evaluate our approach in a zero-shot experimental setting. It is a well-studied phenomenon that neural methods trained on crowdsourced data often learn to shortcut reasoning to arrive at a correct answer (Gururangan et al. 2018; Li and Gauthier 2017). We use a zero-shot setting to simulate the model having to reason about situations it has never encountered before, forcing it to construct reasoning graphs from explicit knowledge it can generate (e.g., knowledge learned by COMeT), and precluding it from learning dataset-specific artifacts. As such, we do not use training data to update model parameters. Furthermore, any result presented on the test set does not have hyperparameters tuned on the development set.

We evaluate our method on two datasets: SocialIQa (Sap et al. 2019b) and StoryCS (Rashkin et al. 2018).

SocialIQa. The SocialIQa dataset evaluates a model’s ability to understand the social dynamics underlying situations described in short text snippets. Each example in the dataset consists of a context, a question about that context, and three multiple choice answers. An example from the dataset is shown in Figure 2(a). We outline pre-processing steps for the data in Appendix A.

StoryCS. The StoryCS dataset consists of short 5-sentence stories with annotated motivations and emotional responses whose labels are drawn from classical theories of psychology (e.g., Plutchik 1980). We map the emotion classification task to a QA task by posing an individual question for each emotion label (disgust, surprise, fear, anger, trust, anticipation, sadness, joy) that must be predicted for each example. We outline this procedure in Appendix B.

Experimental Settings

Prediction. To predict an answer on the SocialIQa dataset, we use Equation 9. Prediction for StoryCS is less straightforward, as the task is originally binary multi-label classification. To make a prediction, we treat $\phi_{a}$ (Eq. 8) for each label $j$ independently and select an answer based on whether $\phi_{a,j}$ is above a label-specific threshold, $\kappa^{j}$ . To avoid violating the zero-shot setting (i.e., tuning thresholds on the development set), we select the threshold using the score at the percentile of the positive label distribution (e.g., if the joy emotion is present for 20% of examples, we set the threshold at the score of the 20th percentile of the CDF). Thresholds are reported in Appendix Table 10 for each label.

SocialIQa Study

As baselines in the SocialIQa study, we use large-scale pretrained language models: GPT (Radford et al. 2018), GPT2-117M, GPT2-345M, and GPT2-762M (Radford et al. 2019). To adapt these language models optimally to the QA task, question-answer pairs are automatically converted to a templated form, a process we outline in Appendix B. We also report the results of a model, COMeT - Direct, that only uses $\phi^{0}_{a}$ to select answers (i.e., answers are evaluated with respect to the context with no dynamic graph construction). Additionally, we compare against the Self-Talk model of Shwartz et al. (2020), which queries pretrained language models to generate additional details about a presented situation and appends these to the original context. Finally, we report the result of supervised BERT (Devlin et al. 2018) and RoBERTa (Liu et al. 2019) models, and random and human baselines from Sap et al. (2019b).

Overall Performance. We report the main results of our SocialIQa study in Table 2. First, our approach achieves an absolute improvement of $\sim$ 10.2% over the top performing language model baseline, GPT2-762M, showing the importance of using knowledge models to represent commonsense. Additionally, our approach of dynamically constructing a knowledge graph on demand (COMeT - DynaGen) performs better than using the knowledge model to directly evaluate answers (COMeT - Direct) by $\sim$ 3.6%, highlighting the value in representing more complex reasoning paths. Finally, the improvement over Self-Talk depicts the benefit of using a structured graphical representation for reasoning compared to one that uses language models to generate additional situational context sentences for conditioning.

We note, however, that the state-of-the-art performance of the supervised BERT and RoBERTa models is significantly higher, meaning there is room for improvement in developing comparable zero-shot approaches to QA. However, one point of interest is that the performance of training BERT with only 5000 training examples (rather than the full 30k) is close (54%) to the performance of COMeT - DynaGen, indicating that knowledge models and joint neuro-symbolic solutions are already promising in low-data regimes.

Qualitative Analysis. In Table 1, we present top reasoning paths from the graphs generated by COMeT - DynaGen. The strength of our approach can be seen in the first example, where the correct answer, drained, is more likely to be a feeling associated with wanting “to go home," a post-condition in the graph generated by COMeT - DynaGen. In the original context, this condition is implicit. This benefit to leveraging graph reasoning is also seen in the second example, where Quinn’s foolishness is linked to “[getting] hurt.” We note that COMeT - Direct, RoBERTa-large, and GPT2-345M all answer this question incorrectly, reinforcing the importance of explicit reasoning graphs.

In the final two examples, we present uninteresting or failure cases. In the first, the model predicts that Alex will experience joy after reasoning through the path that he will be “happy,” which, while correct, is merely leveraging synonymy. In the final example, we show a case where the model selects an incorrect answer by reasoning through an incorrect path. By recognizing that “Taylor wants to celebrate” as a likely post-condition of the context, the model selects an answer that is incorrect. An interesting secondary failure mode in this example is in the second path through the inference “Taylor wants to be home.” While this path selects the correct answer, it would not be considered explanatory by humans. In general, we find these cases to be more common in multi-sentence situations. The compositionality of the context makes it more challenging to generate directed inferences, and the factor nodes become less reliable in the graph. We observe that performance on multi-sentence contexts drops by $\sim$ 5%.

Graph Construction Algorithm. As the quality of the reasoning paths is essential to our approach, we investigate the effect of the inference generation algorithm. We evaluate the following candidate generation algorithms: argmax decoding, beam search with beam size $b=5,10$ and top- $k$ sampling (Fan, Lewis, and Dauphin 2018; Holtzman et al. 2018) with $k$ = 5, 10. For each decoding method, we dynamically generate a graph using every candidate produced by the decoder (e.g., argmax decoding produces one candidate, top-10 sampling produces 10 candidates).

Our results in Table 3 show that the performance COMeT - DynaGen is not dependent on the decoding strategy used to dynamically generate the commonsense knowledge graph. This result is promising as it shows that the reasoning procedure is robust to variability in the candidate generations (larger graphs will be less precise). However, it also shows that the approach has difficulty using richer dynamically-generated commonsense knowledge representations to answer questions correctly. These results point to the need for future work in developing algorithms that can aggregate larger sets of commonsense inference paths as more expansive knowledge graphs are constructed using more powerful knowledge models.

StoryCS Study

As with SocialIQa, we report the results of a random baseline, pretrained language models adapted to the task, and a model that only uses $\phi^{0}_{a}$ to select answers (COMeT - Direct). As supervised comparison models, we report the performance of several BERT-based models from Gaonkar et al. (2020) that are state-of-the-art for the task.

Overall Performance. Our results indicate that our zero-shot algorithm, COMeT - DynaGen, significantly outperforms other zero-shot baselines such as language models, including models with twice the number of parameters. Importantly, again, we see consistent improvement from dynamically generating a contextual commonsense knowledge graph, rather than directly evaluating the answer choices with COMeT - Direct. Our full approach yields higher precision, recall, and F1, than the COMeT - Direct baseline.

Qualitative Analysis. We once again see the benefit of generating a reasoning graph in Table 5. COMeT - DynaGen is able to select the two correct answers to “How does Daniel feel?” leveraging the path through the commonsense inference that “His Dad is helpful” to predict that Daniel is trusting, and the path through the commonsense inference “Daniel wants to try something new” to predict that Daniel is excited. However, there is still much room for improvement, as large-scale pretrained language models that are fine-tuned using supervised data perform considerably better on the task.

Few-shot Tuning. To evaluate the quality of our untuned thresholds from Section 4 based on the label distribution threshold of the CDF of the model’s scores (CDF-label in Table 6), we also report the results of our approach using different strategies to set thresholds $\kappa$ . First, we explore the impact of tuning the $\kappa$ thresholds on varying amounts of the development set data: 4 examples, 10 examples, 20 examples, and 20% of the development data (the same amount used for validation in Rashkin et al. 2018). In each of these settings, we run a study with 5 different randomly selected sets of examples, and report the average performance. We also report the performance of using the 50th percentile score of the CDF as the threshold (CDF-50). In Table 6, we observe large recall gains from these tuning strategies at the expense of precision. However, tuning using merely 10 examples achieves higher F1 than the default strategy, showing the potential of relaxing to a few-shot setting when limited examples are available.

Related Work

Previous work has explored integrating reasoning over static knowledge graphs for question answering and story understanding. In general, these approaches extract knowledge tuples from the static KG by linking canonicalized entities to nodes and performing multi-hop inference along relation paths to form full tuples that can be encoded by a downstream neural architecture (Mihaylov and Frank 2018; Bauer, Wang, and Bansal 2018; Weissenborn, Kovcisk’y, and Dyer 2017; Lin et al. 2019; Paul and Frank 2019). Similar to our approach of discovering reasoning chains between contexts and answers, Paul and Frank (2019) extract reasoning paths in ConceptNet between normalized entities from the context answer candidates, but can only discover paths through nodes in the static knowledge graph. Finally, there exists works that also dynamically construct latent knowledge graphs (Das et al. 2019; Bosselut et al. 2018), but these works presuppose a fixed set of entities that can be KG nodes and then approximate graph edges with neural transformations. In contrast, our algorithm can generate arbitrary nodes, thereby constructing a unique graphical structure for any example.

Multi-hop Reading Comprehension Similar in spirit to reasoning over knowledge graphs for question answering is work in multi-hop reading comprehension. Many datasets for learning to aggregate facts without graph structure have been released in recent years (Weston et al. 2016; Welbl, Stenetorp, and Riedel 2018; Yang et al. 2018; Talmor and Berant 2018). Approaches designed for these resources generally use large-scale neural networks to attend over supporting facts across text (Zhong et al. 2019; Dhingra et al. 2018). Most similar to our work are approaches that construct real-time entity mention graphs as neural reasoning paths (Cao, Aziz, and Titov 2018; Jiang et al. 2019; Jiang and Bansal 2019; Fan et al. 2019). Our approach differs from these models in that we generate relevant supporting information rather than mining it from accompanying documents and conduct our study in a zero-shot setting with no additional training.

Automatic Commonsense KG Construction Multi-hop reasoning over commonsense inferences requires construction of knowledge resources and recent approaches have investigated how to mine commonsense knowledge from deep learning models. Sap et al. (2019a) investigated whether LSTM models could generate new tuples for the Atomic knowledge graph. Similarly, Li et al. (2016) and Saito et al. (2018) explored whether neural models could be used to validate proposed knowledge rather than generating it. Jastrzębski et al. (2018) built on these approaches for evaluating novel commonsense knowledge mined from Wikipedia. More recent work mapped commonsense tuples to natural language with templates and used pretrained language models to validate them (Davison, Feldman, and Rush 2019; Petroni et al. 2019). Concurrently, other research has explored using pretrained language models and adapting them as generative knowledge graph constructors (Bosselut et al. 2019; Malaviya et al. 2019). In contrast to these works that augment static knowledge graphs, our approach focuses on constructing knowledge graphs on demand to provide context-dependent commonsense for downstream inference.

Conclusion

Our neuro-symbolic approach uses neural representations of large-scale commonsense knowledge graphs (COMeT) to generate contextual knowledge graphs on demand for zero-shot question answering. Our approach dynamically constructs a knowledge graph of commonsense inferences related to a presented context and uses it to evaluate answer options for a posed question. A novel inference algorithm reasons over the constructed graph to select the most likely answer to a question. Our approach shows promising results at answering questions without training on the end task on two datasets, SocialIQa and StoryCS, outperforming zero-shot pretrained language models. Finally, our analysis indicates that dynamically generating a contextualized commonsense knowledge graph for inference performs better than using vanilla knowledge models (COMeT - Direct) to directly answer questions.

Acknowledgments

We thank Maarten Sap, Hannah Rashkin, Vered Shwartz, and Chandra Bhagavatula for helpful feedback. This research was supported in part by NSF (IIS-1524371, IIS-1714566), DARPA under the CwC program through the ARO (W911NF-15-1- 0543), DARPA under the MCS program through NIWC Pacific (N66001-19-2-4031), JD.com, and the Allen Institute for AI (AI2).

References

Appendix A Datasets and Preprocessing

We report statistics of the SocialIQa and StoryCS datasets in Table 7 below:

We use the original dataset splits proposed by the authors. We filter 2 and 7 examples from the SocialIQa development and test sets, respectively, that are spam.

Atomic and SocialIQa

During its construction, SocialIQa was seeded with Atomic triples during its curation. We address whether this could be a potential source of bias that benefits the approaches based on COMeT. In our analysis, we find there is minimal opportunity for data leakage between these resources.

First, the Atomic knowledge graph was designed with the idea in mind that it could be trained on using neural models to transfer learn knowledge from language. As a result, to evaluate transfer in this setting, the knowledge graph is split into a training, development, and test knowledge graph. These splits were made adversarially, meaning no head entities in the training knowledge graph are found in the evaluation knowledge graphs. The SocialIQa evaluation sets maintain this split in their design (SocialIQa training set seeded by Atomic training KG, etc.). As a result, no example in the SocialIQa evaluation sets is derived from a tuple in the Atomic training knowledge graph. Our COMeT implementation is only trained on the training portion of the Atomic knowledge graph, meaning our method does not learn from any examples used to design the SocialIQa evaluation sets. In our work, we do not use any examples from the SocialIQa training set.

Second, the SocialIQa authors state that crowdworkers heavily re-edited the Atomic triples to generate contexts, questions, and answers for each SocialIQa example. In any case, to evaluate whether unintentional overlap could still remain, we ran an analysis to recover close Atomic training tuples for each example in the SocialIQa development set. We removed stopwords from events, stemmed their tokens, and checked whether they could be recovered in the stemmed tokens of SocialIQa contexts. Among recovered Atomic events, we checked whether any of their associated tail entities were present in the answer choices of the SocialIQa example.

Using this matching scheme, we found an overlap for only $\sim$ 1.7% of examples (34/1954 examples in the development set, largely from the fact that the stemming causes compression that makes events appear to be a subset of a SocialIQa example). Furthermore, the COMeT - DynaGen and COMeT - Direct models would still perform better than all baselines on the SocialIQa development set with these examples removed. Finally, we note that this level of leakage falls far short of the 30% leakage identified in commonly-used QA datasets (Lewis, Stenetorp, and Riedel 2020).

Appendix B Additional Experimental Settings

We approximate the marginal distribution for the PMI calculation in Equation 5 using Equation 2, but set $c=$ “PersonX”. Every training example in the Atomic knowledge graph on which COMeT is trained begins with this token, so using it as the only token in the context essentially provides an output distribution that is only conditioned on the question $q$ .

Generation Processing

To ground the conditional distribution on which COMeT and GPT2 (for baselines) were trained we process the data and generations in the following ways:

For language model baselines (i.e., the class of GPT2 models), we adapt the QA task as natural language statements to be evaluated by the language models. Question-answer pairs are automatically converted to a templated form. For example, a question such as "How does Alice feel after?" will be replaced by the template “Alice feels" and prepended to the answer. The resulting snippet is then concatenated to the context, and the language models score the answer words conditioned on the context and template. We record the perplexity of each statement and select the lower perplexity score as the answer. Table 9 provides the template for each question variety.

When converting generated inferences to contexts for answer scoring (Eq. 4), we add a prefix that is specific to the inference type to the generated tokens (e.g., happy $\Rightarrow$ Person is happy).

We append the following prefixes to COMeT-generated inferences when using them in Equation 4 to compute factor nodes between them and answer nodes:

For the StoryCS dataset, when scoring the answer text, we use formulations of the words that make up the classification label (e.g., disgust, surprise, fear, anger, trust, anticipation, sadness, joy $\Rightarrow$ disgusted, surprised, afraid, angry, trusting, excited, sad, happy). As question representations $q$ to give to COMeT, we use the relations from Atomic (Sap et al. 2019a) that correspond to reactions to events: xReact and oReact. We compute $\phi_{ga}$ (Eq. 4, 5) for each $q$ and average them.

For our main models and ablations, names that appear in contexts and answers are anonymized.

Rules for pruning generation sets

We use the following rules to prune the set of commonsense inferences generated by COMeT as it constructs a graph of commonsense knowledge:

Any generation that is identical to a previous generation from the same inputs, but has added punctuation is pruned (e.g., to go to the mall vs. to go to the mall.)

Any generation that has the phrase “PersonY" for the following relations is removed: oEffect, oReact, oWant. These generations are untrustworthy as they are often impossible to resolve with an actual person in the context.

Any generation for the following relations that does not have a token that is a verb is removed: xEffect, oEffect

In multiple candidate settings (i.e., beam search, top- $k$ sampling), if one of the candidates is “none,” we prune all candidates with less likely scores.

For the StoryCS dataset, we only generate inferences along the following Atomic relations: xReact, oReact, xEffect, oEffect, xIntent. The logic for pruning xWant, oWant, xNeed, xAttr inferences is that emotional reactions for these dimensions could be irrelevant to the context. For example, the emotional reaction to getting into a car accident is different from needing to own a car to do this. Emotional reactions to the kept relations are more likely to be relevant to the original context.

Prediction thresholds

We set the following $\kappa$ thresholds to make positive predictions on the StoryCommonsense dataset.