GreaseLM: Graph REASoning Enhanced Language Models for Question Answering
Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D. Manning, Jure Leskovec
Introduction
Question answering is a challenging task that requires complex reasoning over both explicit constraints described in the textual context of the question, as well as unstated, relevant knowledge about the world (i.e., knowledge about the domain of interest). Recently, large pretrained language models fine-tuned on QA datasets have become the dominant paradigm in NLP for question answering tasks (Khashabi et al., 2020). After pretraining on an extreme-scale collection of general text corpora, these language models learn to implicitly encode broad knowledge about the world, which they are able to leverage when fine-tuned on a domain-specific downstream QA task. However, despite the strong performance of this two-stage learning procedure on common benchmarks, these models struggle when given examples that are distributionally different from examples seen during fine-tuning (McCoy et al., 2019). Their learned behavior often relies on simple (at times spurious) patterns to offer shortcuts to an answer, rather than robust, structured reasoning that effectively fuses the explicit information provided by the context and implicit external knowledge (Marcus, 2018).
On the other hand, massive knowledge graphs (KG), such as Freebase (Bollacker et al., 2008), Wikidata (Vrandečić & Krötzsch, 2014), ConceptNet (Speer et al., 2017), and Yago (Suchanek et al., 2007) capture such external knowledge explicitly using triplets that capture relationships between entities. Previous research has demonstrated the significant role KGs can play in structured reasoning and query answering (Ren et al., 2020; 2021; Ren & Leskovec, 2020). However, extending these reasoning advantages to general QA (where questions and answers are expressed in natural language and not easily mapped to strict logical queries) requires finding the right integration of knowledge from the KG with the information and constraints provided by the QA example. Prior methods propose various ways to leverage both modalities (i.e., expressive large language models and structured KGs) for improved reasoning (Mihaylov & Frank, 2018; Lin et al., 2019; Feng et al., 2020). However, these methods typically fuse the two modalities in a shallow and non-interactive manner, encoding both separately and fusing them at the output for a prediction, or using one to augment the input of the other. Consequently, previous methods demonstrate restricted capacity to exchange useful information between the two modalities. It remains an open question how to effectively fuse the KG and LM representations in a truly unified manner, where the two representations can interact in a non-shallow way to simulate structured, situational reasoning.
In this work, we present GreaseLM, a new model that enables fusion and exchange of information from both the LM and KG in multiple layers of its architecture (see Figure 1). Our proposed GreaseLM consists of an LM that takes as input the natural language context, as well as a graph neural network (GNN) that reasons over the KG. After each layer of the LM and GNN, we design an interactive scheme to bidirectionally transfer the information from each modality to the other through specially initialized interaction representations (i.e., interaction token for the LM; interaction node for the GNN). In such a way, all the tokens in the language context receive information from the KG entities through the interaction token and the KG entities indirectly interact with the tokens through the interaction node. By such a deep integration across all layers, GreaseLM enables joint reasoning over both the language context and the KG entities under a unified framework agnostic to the specific language model or graph neural network, so that both modalities can be contextualized by the other.
GreaseLM demonstrates significant performance gains across different LM architectures. We perform experiments on several standard QA benchmarks: CommonsenseQA, OpenbookQA and MedQA-USMLE, which require external knowledge across different domains (commonsense reasoning and medical reasoning) and use different KGs (ConceptNet and Disease Database). Across both domains, GreaseLM outperforms comparably-sized prior QA models, including strong fine-tuned LM baselines (by 5.5%, 6.6%, and 1.3%, respectively) and state-of-the-art KG+LM models (by 0.9%, 1.8%, and 0.5%, respectively) on the three competitive benchmarks. Furthermore, with the deep fusion of both modalities, GreaseLM exhibits strong performance over baselines on questions that exhibit textual nuance, such as resolving multiple constraints, negation, and hedges, and which require effective reasoning over both language context and KG.
Related Work
Integrating KG information has become a popular research area for improving neural QA systems. Some works explore using two-tower models to answer questions, where a graph representation of knowledge and language representation are fused with no interaction between them (Wang et al., 2019). Other works seek to use one modality to ground the other, such as using an encoded representation of a linked KG to augment the textual representation of a QA example (e.g., Knowledgeable Reader, Mihaylov & Frank, 2018; KagNet, Lin et al., 2019; KT-NET, Yang et al., 2019). Others reverse the flow of information and use a representation of the text (e.g., final layer of LM) to provide an augmentation to a graph reasoning model over an extracted KG for the example (e.g., MHGRN, Feng et al., 2020; Lv et al., 2020). In all of these settings, however, the interaction between both modalities is limited as information between them only flows one way.
More recent approaches explore deeper integrations of both modalities. Certain approaches learn to access implicit knowledge encoded in LMs (Bosselut et al., 2019; Petroni et al., 2019; Hwang et al., 2021) by training on structured KG data, and then use the LM to generate local KGs that can be used for QA (Wang et al., 2020; Bosselut et al., 2021). However, these approaches discard the static KG once they train the LM on its facts, losing important structure that can guide reasoning. More recently, QA-GNN (Yasunaga et al., 2021) proposed to jointly update the LM and GNN representations via message passing. However, they use a single pooled representation of the LM to seed the textual component of this joint structure, limiting the updates that can be made to the textual representation. In contrast to prior works, we propose to make individual token representations in the LM and node representations in the GNN mix for multiple layers, enabling representations of both modalities to reflect particularities of the other (e.g., knowledge grounds language; language nuances specifies which knowledge is important). Simultaneously, we retain the individual structure of both modalities, which we demonstrate improves QA performance substantially (§5).
Additionally, some works explore integrating knowledge graphs with language models in the pretraining stage. However, much like for QA, the modality interaction is typically limited to knowledge feeding language (Zhang et al., 2019; Shen et al., 2020; Yu et al., 2020), rather than designing interactions across multiple layers. Sun et al. (2020)’s work is perhaps most similar, but they do not use the same interaction bottleneck, requiring high-precision entity mention spans for linking, and they limit expressivity through shared modality parameters for the LM and KG.
Proposed Approach: GreaseLM
In this work, we augment large-scale language models (Devlin et al., 2019; Liu et al., 2019; Lan et al., 2020; Liu et al., 2021) with graph reasoning modules over KGs. Our method, GreaseLM (depicted in Figure 1), consists of two stacked components: (1) a set of unimodal LM layers which learn an initial representation of the input tokens, and (2) a set of upper cross-modal GreaseLM layers which learn to jointly represent the language sequence and linked knowledge graph, allowing textual representations formed from the underlying LM layers and a graph representation of the KG to mix with one another. We denote the number of LM layers as , and the number of GreaseLM layers as . The total number of layers in our model is .
Notation. In the task of multiple choice question answering (MCQA), a generic MCQA-type dataset consists of examples with a context paragraph , a question and a candidate answer set , all expressed in text. In this work, we also assume access to an external knowledge graph (KG) that provides background knowledge that is relevant to the content of the multiple choice questions.
We concatenate our context paragraph , question , and candidate answer with separator tokens to get our model input and tokenize the combined sequence into . Second, we use the input sequence to retrieve a subgraph of the KG (denoted ), which provides knowledge from the KG that is relevant to this QA example. We denote the set of nodes in as .
KG Retrieval. Given each QA context, we follow the procedure from Yasunaga et al. (2021) to retrieve the subgraph from . We describe this procedure in Appendix B.1. Each node in is assigned a type based on whether its corresponding entity was linked from the context , question , answer , or as a neighbor to these nodes. In the rest of the paper, we use “KG” to refer to .
Interaction Bottlenecks. In the cross-modal GreaseLM layers, information is fused between both modalities, for which we define a special interaction token and a special interaction node whose representations serve as the bottlenecks through which the two modalities interact (§3.3). We prepend to the token sequence and connect to all the linked nodes in .
2 Language Pre-encoding
where LM-Layer() is a single LM encoder layer, whose parameters are initialized using a pretrained model (§4.1). We refer readers to Vaswani et al. (2017) for technical details of these layers.
3 GreaseLM
GreaseLM uses a cross-modal fusion component to inject information from the KG into language representations and information from language into KG representations. The GreaseLM layer is designed to separately encode information from both modalities, and fuse their representations using the bottleneck of the special interaction token and node. It is comprised of three components: (1) a transformer LM encoder block which continues to encode the language context, (2) a GNN layer that reasons over KG entities and relations, and (3) a modality interaction layer that takes the unimodal representations of the interaction token and interaction node and exchanges information through them. We discuss these three components below.
Graph Representation. The GreaseLM layers also encode a representation of the local KG linked from the QA example. To represent the graph, we first compute initial node embeddings for the retrieved entities using pretrained KG embeddings for these nodes (§4.1). The initial embedding of the interaction node is initialized randomly.
where represents the neighborhood of an arbitrary node , denotes the message one of its neighbors passes to , is an attention weight that scales the message , and is a 2-layer MLP. The messages between nodes allow entity information from a node to affect the model’s representation of its neighbors, and are computed in the following manner:
where and are linear transformations and are defined the same as above.
As discussed in the following paragraph, message passing between the interaction node and the nodes from the retrieved subgraph will allow information from text that receives from to propagate to the other nodes in the graph.
For the MCQA task, given a question and an answer from all the candidates , we compute the probability of being the correct answer as , where denotes attention-based pooling of using as a query. We optimize the whole model end-to-end using the cross entropy loss. At inference time, we predict the most plausible answer as .
Experimental Setup
We evaluate GreaseLM on three diverse multiple-choice question answering datasets across two domains: CommonsenseQA (Talmor et al., 2019) and OpenBookQA (Mihaylov et al., 2018) as commonsense reasoning benchmarks, and MedQA-USMLE (Jin et al., 2021) as a clinical QA task.
CommonsenseQA is a 5-way multiple-choice question answering dataset of 12,102 questions that require background commonsense knowledge beyond surface language understanding. We perform our experiments using the in-house data split of Lin et al. (2019) to compare to baseline methods.
OpenbookQA is a 4-way multiple-choice question answering dataset that tests elementary scientific knowledge. It contains 5,957 questions along with an open book of scientific facts. We use the official data splits from Mihaylov & Frank (2018).
MedQA-USMLE is a 4-way multiple-choice question answering dataset, which requires biomedical and clinical knowledge. The questions are originally from practice tests for the United States Medical License Exams (USMLE). The dataset contains 12,723 questions. We use the original data splits from Jin et al. (2021).
We seed GreaseLM with RoBERTa-Large (Liu et al., 2019) for our experiments on CommonsenseQA, AristoRoBERTa (Clark et al., 2019) for our experiments on OpenbookQA, and SapBERT (Liu et al., 2021) for our experiments on MedQA-USMLE, demonstrating GreaseLM’s generality with respect to language model initializations. Hyperparameters for training these models can be found in Appendix Table 7.
Knowledge Graphs. We use ConceptNet (Speer et al., 2017), a general-domain knowledge graph, as our external knowledge source for both CommonsenseQA and OpenbookQA. It has 799,273 nodes and 2,487,810 edges in total. For MedQA-USMLE, we use a self-constructed knowledge graph that integrates the Disease Database portion of the Unified Medical Language System (UMLS; Bodenreider, 2004) and DrugBank (Wishart et al., 2018). The knowledge graph contains 9,958 nodes and 44,561 edges. Additional information about node initialization and hyperparameters for preprocessing these KGs can be found in Appendix B.2.
2 Baseline methods
To study the effect of using KGs as external knowledge sources, we compare our method with vanilla fine-tuned LMs, which are knowledge-agnostic. We fine-tune RoBERTa-Large (Liu et al., 2019) for CommonsenseQA, and AristoRoBERTaOpenbookQA provides an extra corpus of scientific facts in a textual form. AristoRoBERTa is based off RoBERTa-Large, but uses the facts corresponding to each question, prepared by Clark et al. (2019), as an additional input along with the QA context. (Clark et al., 2019) for OpenbookQA. For MedQA-USMLE, we use a state-of-the-art biomedical language model, SapBERT (Liu et al., 2021), which is an augmentation of PubmedBERT (Gu et al., 2022) that is trained with entity disambiguation objectives to allow the model to better understand entity knowledge.
LM+KG models. We also evaluate GreaseLM’s ability to exploit its knowledge graph augmentation by comparing with existing LM+KG methods: (1) Relation Network (RN; Santoro et al., 2017), (2) RGCN (Schlichtkrull et al., 2018), (3) GconAttn (Wang et al., 2019), (4) KagNet (Lin et al., 2019), (5) MHGRN (Feng et al., 2020), and (6) QA-GNN (Yasunaga et al., 2021). QA-GNN is the existing top-performing model under this LM+KG paradigm. The key difference between GreaseLM and these baseline methods is that they do not fuse the representations of both modalities across multiple interaction layers, allowing the representation of both modalities to affect the other (§3.3). For fair comparison, we use the same LM to initialize these baselines as for our model.
Experimental Results
Our results in Tables 2 and 4 demonstrate a consistent improvement on the CommonsenseQA and OpenbookQA datasets. On CommonsenseQA, our model’s test performance improves by 5.5% over fine-tuned LMs and 0.9% over existing LM+KG models. On OpenbookQA, these improvements are magnified, with 6.4% over raw LMs, and 2.0% over the prior best LM+KG system, QA-GNN. The boost over QA-GNN suggests that GreaseLM’s multi-layer fusion component that passes information between the text and KG representations is more expressive than LM+KG methods which do not integrate such sustained interaction between both modalities. We also achieve competitive results to other systems on the leaderboard of OpenbookQA (Table 4), posting the third highest score. However, we note that the T5 (Raffel et al., 2020) and UnifiedQA (Khashabi et al., 2020) models are pretrained models with 8 and more parameters, respectively, than our model. Among models with comparable parameter counts, GreaseLM achieves the highest score. An ablation study on different model components and hyperparameters is reported in Appendix C.1.
Quantitative Analysis. Given these overall performance improvements, we investigated whether GreaseLM’s improvements were reflected in questions that required more complex reasoning. Because we had no gold structures from these datasets to categorize the reasoning complexity of different questions, we defined three proxies: the number of prepositional phrases in the questions, the presence of negation terms, and the presence of hedging terms. We use the number of prepositional phrases as a proxy for the number of explicit reasoning constraints being set in the questions. For example, the CommonsenseQA question in Table 1, “A weasel has a thin body and short legs to easier burrow after prey in a what?” has three prepositional phrases: to easier burrow, after prey, in a what, which each provide an additional search constraint for the answer (n.b., in certain cases, the prepositional phrases do not provide constraints that are needed for selecting the correct answer). The presence of negation and hedging terms stratifies our evaluation to questions that have explicit negation mentions (e.g., no, never) and terms indicating uncertainty (e.g., sometimes; maybe).
Our results in Table 5 demonstrate that GreaseLM generally outperforms RoBERTa-Large and QA-GNN for both questions with negation terms and hedge terms, indicating GreaseLM handles contexts with nuanced constraints. Furthermore, we also note that GreaseLM performs better than the baselines across all questions with prepositional phrases, our measure for reasoning complexity. QA-GNN and GreaseLM perform comparably on questions with no prepositional phrases, but the increasing complexity of questions requires deeper cross-modal fusion between language and knowledge representations. While QA-GNN’s end fusion approach of initializing a node in the GNN from the LM’s final representation of the context is an effective approach, it compresses the language context to a single vector before allowing interaction with the KG, potentially limiting the cross-relationships between language and knowledge that can be captured (see example in Figure 2). Interestingly, we note that both GreaseLM and QA-GNN significantly outperform RoBERTa-Large even when no prepositional phrases are in the question. We hypothesize that some of these questions may require less reasoning, but require specific commonsense knowledge that RoBERTa may not have learned during pretraining (e.g., “What is a person considered a bully known for?”).
Qualitative Analysis. In Figure 2, we examine GreaseLM’s node-to-node attention weights induced by the GNN layers of the model, and analyze whether they reflect more expressive reasoning steps compared to QA-GNN. Figure 2 shows an example from the CommonsenseQA IH-dev set. In this example, GreaseLM correctly predicts that the answer is “airplane” while QA-GNN makes an incorrect prediction, “motor vehicle”. For both models, we perform Best First Search (BFS) on the retrieved KG subgraph to trace high attention weights from the interaction node (purple).
For GreaseLM, we observe that the attention by the interaction node increases on the “bug” entity in the intermediate GNN layers, but drops again by the final layer, resembling a suitable intuition surrounding the hedge term “unlikely”. Meanwhile, the attention on “windshield” consistently increases across all layers. For QA-GNN, the attention on “bug” increases over multiple layers. As “bug” is mentioned multiple times in the context, it may be well-represented in QA-GNN’s context node initialization, which is never reformulated by language representations, unlike in GreaseLM.
Our reported results thus far demonstrate the viability of our method in the general commonsense reasoning domain. In this section, we explore whether GreaseLM could be adapted to other domains by evaluating on the MedQA-USMLE dataset. Our results in Table 6 demonstrate that GreaseLM outperforms state-of-the-art fine-tuned LMs (e.g., SapBERT; Liu et al., 2021) and a QA-GNN augmentation of SapBERT. Additionally, we note the improved performance over all classical methods and LM methods first reported in Jin et al. (2021). Additional results in Appendix C show that our approach is also agnostic to the language model used with improvements recorded by GreaseLM when it is seeded with other LMs, such as PubmedBERT (Gu et al., 2022), and BioBERT (Lee et al., 2020). While these results are promising as they suggest that GreaseLM is an effective augmentation of pretrained LMs for different domains and KGs (i.e., the medical domain with the DDB + Drugbank KG), there is still ample room for improvement on this task.
Conclusion
In this paper, we introduce GreaseLM, a new model that enables interactive fusion through joint information exchange between knowledge from language models and knowledge graphs. Experimental results demonstrate superior performance compared to prior KG+LM and LM-only baselines across standard datasets from multiple domains (commonsense and medical). Our analysis shows improved capability modeling questions exhibiting textual nuances, such as negation and hedging.
Acknowledgment
We thank Rok Sosic, Maria Brbic, Jordan Troutman, Rajas Bansal, and our anonymous reviewers for discussions and for providing feedback on our manuscript. We thank Xiaomeng Jin for help with data preprocessing. We also gratefully acknowledge the support of DARPA under Nos. HR00112190039 (TAMI), N660011924033 (MCS); ARO under Nos. W911NF-16-1-0342 (MURI), W911NF-16-1-0171 (DURIP); NSF under Nos. OAC-1835598 (CINES), OAC-1934578 (HDR), CCF-1918940 (Expeditions), IIS-2030477 (RAPID), NIH under No. R56LM013365; Stanford Data Science Initiative, Wu Tsai Neurosciences Institute, Chan Zuckerberg Biohub, Amazon, JPMorgan Chase, Docomo, Hitachi, Intel, JD.com, KDDI, Toshiba, NEC, and UnitedHealth Group. J. L. is a Chan Zuckerberg Biohub investigator. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding entities.
References
Appendix A Ethics Statement
We outline potential ethical issues with our work below. First, GreaseLM is a method to fuse language representations and knowledge graph representations for effective reasoning about textual situations. Consequently, GreaseLM could reflect many of the same biases and toxic behaviors exhibited by language models and knowledge graphs that are used to initialize it. For example, prior large-scale language models have been shown to encode biases about race, gender, and other demographic attributes (Sheng et al., 2020). Because GreaseLM is seeded with pretrained language models that often learn these patterns, it is possible to reflect them in open-world settings. Second, the ConceptNet knowledge graph (Speer et al., 2017) used in this work has been shown to encode stereotypes (Mehrabi et al., 2021), rather than completely clean commonsense knowledge. If GreaseLM were used outside these standard benchmarks in conjunction with ConceptNet as a KG, it might rely on unethical relationships in its knowledge resource to arrive at conclusions. Consequently, while GreaseLM could be used for applications outside these standard benchmarks, we would encourage implementers to use the same precautions they would apply to other language models and methods that use noisy knowledge sources.
Another source of ethical concern is the use of the MedQA-USMLE evaluation. While we find clinical reasoning using language models and knowledge graphs to be an interesting testbed for GreaseLM and for joint language and reasoning models in general, we do not encourage users to use these models for real world clinical prediction, particularly at these performance levels.
Appendix B Experimental Setup Details
Given each QA context, we follow the procedure from Yasunaga et al. (2021) to retrieve the subgraph from . First, we perform entity linking to to retrieve an initial set of nodes . Second, we add any bridge entities that are in a 2-hop path between any pair of linked entities in to get the set of retrieved entities . Then we prune the set of nodes using a relevance score computed for each node. To compute the relevance score, we follow the procedure of Yasunaga et al. (2021) – we concatenate the node name with the context of the QA example, and pass it through a pre-trained LM, using the output score of the node name as the relevance score. We only retain the top 200 scores nodes and prune the remaining ones. Finally, we retrieve all the edges that connect any two nodes in , forming the retrieved subgraph . Each node in is assigned a type according to whether its corresponding entity was linked from the context , question , answer , or from a bridge path.
B.2 Graph Initialization
To compute initial node embeddings (§3.3) for entities retrieved in from ConceptNet, we follow the method of MHGRN (Feng et al., 2020). We convert knowledge triples in the KG into sentences using pre-defined templates for each relation. Then, these sentences are fed into a BERT-large LM to compute embeddings for each sentence. Finally, for all sentences containing an entity, we extract all token representations of the entity’s mention spans in these sentences, mean pool over these representations and project this mean-pooled representation.
For MedQA-USMLE, node embeddings are initialized similarly using the pooled token output embeddings of the entity name from the SapBERT model (described in §4.2; Liu et al., 2021). For MedQA, 5% of examples do not yield a retrieved entity. In these cases, we represent the graph using a dummy node initialized with 0. In essence, GreaseLM backs off to only using LM representations as the graph propagates no information.
B.3 Hyperparameters
Appendix C Additional Experimental Results
In Table 8, we summarize an ablation study conducted using the CommonsenseQA IHdev set.
Modality interaction. A key component of GreaseLM is the connection of the LM to the GNN via the modality interaction module (Eq. 11). If we remove modality interaction, the performance drops significantly, from 78.5% to 76.5% (approximately the performance of QA-GNN). Integrating the modality interaction in every other layer instead of consecutive layers also hurts performance. A possible explanation is that skipping layers could impede learning consistent representations across layers for both the LM and the GNN, a property which may be desirable given we initialize the model using a pretrained LM’s weights (e.g., RoBERTa). We also find that sharing parameters between modality interaction layers (Eq. 11) outperforms not sharing, possibly because our datasets are not very large (e.g., 10k for CommonsenseQA), and sharing parameters helps prevent overfitting.
Number of GreaseLM layers. We find that GreaseLM layers achieves the highest performance. However, both the results for and are relatively close to the top performance, indicating our method is not overly sensitive to this hyperparameter.
Graph connectivity. The interaction node is a key component of GreaseLM that bridges the interaction between the KG and the text. Selecting which nodes in the KG are directly connected to affects the rate at which information from different portions of the KG can reach the text representations. We find that connecting KG nodes explicitly linked to the input text performs best. Connecting to all nodes in the subgraph (e.g., bridge entities) hurts performance (-0.9%), possibly because the interaction node is overloaded by having to attend to all nodes in the graph (up to 200). By connecting the interaction node only to linked entities, each linked entity serves as a filter for relevant information that reaches the interaction node.
KG node embedding initialization. Effectively initializing KG node representations is critical. When we initialize nodes randomly instead of using the BERT-based initialization method from Feng et al. (2020), the performance drops significantly (78.5% 60.8%). While using standard KG embeddings (e.g., TransE; Bordes et al., 2013) recovers much of the performance drop (77.7%), we still find that using BERT-based entity embeddings performs best.
C.2 Effect of LM Initialization on GreaseLM
To evaluate whether our method is agnostic to the LM used to seed the GreaseLM layers, we replace the LMs we use in previous experiments (RoBERTa-large for CommonsenseQA and SapBERT for MedQA-USMLE) with RoBERTa-base for CommonsenseQA, and BioBERT and PubmedBERT for MedQA-USMLE. Across multiple LM initializations in two domains, our results demonstrate that GreaseLM can provide a consistent improvement for multiple LMs when used as a modality junction between KGs and language.