Open Domain Question Answering Using Early Fusion of Knowledge Bases and Text
Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, William W. Cohen
Introduction
Open domain Question Answering (QA) is the task of finding answers to questions posed in natural language. Historically, this required a specialized pipeline consisting of multiple machine-learned and hand-crafted modules (Ferrucci et al., 2010). Recently, the paradigm has shifted towards training end-to-end deep neural network models for the task (Chen et al., 2017; Liang et al., 2017; Raison et al., 2018; Talmor and Berant, 2018; Iyyer et al., 2017). Most existing models, however, answer questions using a single information source, usually either text from an encyclopedia, or a single knowledge base (KB).
Intuitively, the suitability of an information source for QA depends on both its coverage and the difficulty of extracting answers from it. A large text corpus has high coverage, but the information is expressed using many different text patterns. As a result, models which operate on these patterns (e.g. BiDAF (Seo et al., 2017)) do not generalize beyond their training domains (Wiese et al., 2017; Dhingra et al., 2018) or to novel types of reasoning (Welbl et al., 2018; Talmor and Berant, 2018). KBs, on the other hand, suffer from low coverage due to their inevitable incompleteness and restricted schema (Min et al., 2013), but are easier to extract answers from, since they are constructed precisely for the purpose of being queried.
In practice, some questions are best answered using text, while others are best answered using KBs. A natural question, then, is how to effectively combine both types of information. Surprisingly little prior work has looked at this problem. In this paper we focus on a scenario in which a large-scale KB (Bollacker et al., 2008; Auer et al., 2007) and a text corpus are available, but neither is sufficient alone for answering all questions.
A naïve option, in such a setting, is to take state-of-the-art QA systems developed for each source, and aggregate their predictions using some heuristic (Ferrucci et al., 2010; Baudiš, 2015). We call this approach late fusion, and show that it can be sub-optimal, as models have limited ability to aggregate evidence across the different sources (§ 5.4). Instead, we focus on an early fusion strategy, where a single model is trained to extract answers from a question subgraph (see Fig 1, left) containing relevant KB facts as well as text sentences. Early fusion allows more flexibility in combining information from multiple sources.
To enable early fusion, in this paper we propose a novel graph convolution based neural network, called GRAFT-Net (Graphs of Relations Among Facts and Text Networks), specifically designed to operate over heterogeneous graphs of KB facts and text sentences. We build upon recent work on graph representation learning (Kipf and Welling, 2016; Schlichtkrull et al., 2017), but propose two key modifications to adopt them for the task of QA. First, we propose heterogeneous update rules that handle KB nodes differently from the text nodes: for instance, LSTM-based updates are used to propagate information into and out of text nodes (§ 3.2). Second, we introduce a directed propagation method, inspired by personalized Pagerank in IR (Haveliwala, 2002), which constrains the propagation of embeddings in the graph to follow paths starting from seed nodes linked to the question (§ 3.3). Empirically, we show that both these extensions are crucial for the task of QA. An overview of the model is shown in Figure 1.
We evaluate these methods on a new suite of benchmark tasks for testing QA models when both KB and text are present. Using WikiMovies (Miller et al., 2016) and WebQuestionsSP (Yih et al., 2016), we construct datasets with a varying amount of training supervision and KB completeness, and with a varying degree of question complexity. We report baselines for future comparison, including Key Value Memory Networks (Miller et al., 2016; Das et al., 2017c), and show that our proposed GRAFT-Nets have superior performance across a wide range of conditions (§ 5). We also show that GRAFT-Nets are competitive with the state-of-the-art methods developed specifically for text-only QA, and state-of-the art methods developed for KB-only QA (§ 5.4)Source code and data are available at https://github.com/OceanskySun/GraftNet.
Task Setup
A knowledge base is denoted as , where is the set of entities in the KB, and the edges are triplets which denote that relation holds between the subject and object . A text corpus is a set of documents where each document is a sequence of words . We further assume that an (imperfect) entity linking system has been run on the collection of documents whose output is a set of links connecting an entity with a word at position in document , and we denote with the set of all entity links in document . For entity mentions spanning multiple words in , we include links to all the words in the mention in .
The task is, given a natural language question , extract its answers from . There may be multiple correct answers for a question. In this paper, we assume that the answers are entities from either the documents or the KB. We are interested in a wide range of settings, where the KB varies from highly incomplete to complete for answering the questions, and we will introduce datasets for testing our models under these settings.
To solve this task we proceed in two steps. First, we extract a subgraph which contains the answer to the question with high probability. The goal for this step is to ensure high recall for answers while producing a graph small enough to fit into GPU memory for gradient-based learning. Next, we use our proposed model GRAFT-Net to learn node representations in , conditioned on , which are used to classify each node as being an answer or not. Training data for the second step is generated using distant supervision. The entire process mimics the search-and-read paradigm for text-based QA (Dhingra et al., 2017).
2 Question Subgraph Retrieval
We retrieve the subgraph using two parallel pipelines – one over the KB which returns a set of entities, and the other over the corpus which returns a set of documents. The retrieved entities and documents are then combined with entity links to produce a fully-connected graph.
To retrieve relevant entities from the KB we first perform entity linking on the question , producing a set of seed entities, denoted . Next we run the Personalized PageRank (PPR) method (Haveliwala, 2002) around these seeds to identify other entities which might be an answer to the question. The edge-weights around are distributed equally among all edges of the same type, and they are weighted such that edges relevant to the question receive a higher weight than those which are not. Specifically, we average word vectors to compute a relation vector from the surface form of the relation, and a question vector from the words in the question, and use cosine similarity between these as the edge weights. After running PPR we retain the top entities by PPR score, along with any edges between them, and add them to .
Text Retrieval.
We use Wikipedia as the corpus and retrieve text at the sentence level, i.e. documents in are defined along sentences boundariesThe term document will always refer to a sentence in the rest of this paper.. We perform text retrieval in two steps: first we retrieve the top 5 most relevant Wikipedia articles, using the weighted bag-of-words model from DrQA (Chen et al., 2017); then we populate a Lucenehttps://lucene.apache.org/ index with sentences from these articles, and retrieve the top ranking ones , based on the words in the question. For the sentence-retrieval step, we found it beneficial to include the title of the article as an additional field in the Lucene index. As most sentences in an article talk about the title entity, this helps in retrieving relevant sentences that do not explicitly mention the entity in the question. We add the retrieved documents, along with any entities linked to them, to the subgraph .
The final question subgraph is , where the vertices consist of all the retrieved entities and documents, i.e. . The edges are all relations from among these entities, plus the entity-links between documents and entities, i.e.
where denotes a special “linking” relation. is the set of all edge types in the subgraph.
GRAFT-Nets
The question and its answers induce a labeling of the nodes in : we let if and otherwise for all . The task of QA then reduces to performing binary classification over the nodes of the graph . Several graph-propagation based models have been proposed in the literature which learn node representations and then perform classification of the nodes (Kipf and Welling, 2016; Schlichtkrull et al., 2017). Such models follow the standard gather-apply-scatter paradigm to learn the node representation with homogeneous updates, i.e. treating all neighbors equally.
The basic recipe for these models is as follows:
Initialize node representations .
For update node representations
where denotes the neighbours of along incoming edges of type , and is a neural network layer.
Here is the number of layers in the model and corresponds to the maximum length of the paths along which information should be propagated in the graph. Once the propagation is complete the final layer representations are used to perform the desired task, for example link prediction in knowledge bases (Schlichtkrull et al., 2017).
However, there are two differences in our setting from previously studied graph-based classification tasks. The first difference is that, in our case, the graph consists of heterogeneous nodes. Some nodes in the graph correspond to KB entities which represent symbolic objects, whereas other nodes represent textual documents which are variable length sequences of words. The second difference is that we want to condition the representation of nodes in the graph on the natural language question . In §3.2 we introduce heterogeneous updates to address the first difference, and in §3.3 we introduce mechanisms for conditioning on the question (and its entities) for the second.
where LSTM refers to a long short-term memory unit. We denote the -th row of , corresponding to the embedding of -th word in the document at layer , as .
2 Heterogeneous Updates
Figure 2 shows the update rules for entities and documents, which we describe in detail here.
Let be the set of positions in documents which correspond to a mention of entity . The update for entity nodes involves a single-layer feed-forward network (FFN) over the concatenation of four states:
The first two terms correspond to the entity representation and question representation (details below), respectively, from the previous layer.
The third term aggregates the states from the entity neighbours of the current node, , after scaling with an attention weight (described in the next section), and applying relation specific transformations . Previous work on Relational-Graph Convolution Networks (Schlichtkrull et al., 2017) used a linear projection for . For a batched implementation, this results in matrices of size , where is the batch size, which can be prohibitively large for large subgraphsThis is because we have to use adjacency matrices of size to aggregate embeddings from neighbours of all nodes simultaneously.. Hence in this work we use relation vectors for instead of matrices, and compute the update along an edge as:
Here is a PageRank score used to control the propagation of embeddings along paths starting from the seed nodes, which we describe in detail in the next section. The memory complexity of the above is , where is the number of facts in the subgraph .
The last term aggregates the states of all tokens that correspond to mentions of the entity among the documents in the subgraph. Note that the update depends on the positions of entities in their containing document.
Documents.
Let be the set of all entities linked to the word at position in document . The document update proceeds in two steps. First we aggregate over the entity states coming in at each position separately:
3 Conditioning on the Question
For the parts described thus far, the graph learner is largely agnostic of the question. We introduce dependence on question in two ways: by attention over relations, and by personalized propagation.
To represent , let be the words in the question. The initial representation is computed as:
where we extract the final state from the output of the LSTM. In subsequent layers the question representation is updated as , where denotes the seed entities mentioned in the question.
The attention weight in the third term of Eq. (1) is computed using the question and relation embeddings:
where the softmax normalization is over all outgoing edges from , and is the relation vector for relation . This ensures that embeddings are propagated more along edges relevant to the question.
Directed Propagation.
Many questions require multi-hop reasoning, which follows a path from a seed node mentioned in the question to the target answer node. To encourage such a behaviour when propagating embeddings, we develop a technique inspired from personalized PageRank in IR (Haveliwala, 2002). The propagation starts at the seed entities mentioned in the question. In addition to the vector embeddings at the nodes, we also maintain scalar “PageRank” scores which measure the total weight of paths from a seed entity to the current node, as follows:
Notice that we reuse the attention weights when propagating PageRank, to ensure that nodes along paths relevant to the question receive a high weight. The PageRank score is used as a scaling factor when propagating embeddings along the edges in Eq. (2). For , the PageRank score will be for all entities except the seed entities, and hence propagation will only happen outward from these nodes. For , it will be non-zero for the seed entities and their 1-hop neighbors, and propagation will only happen along these edges. Figure 3 illustrates this process.
4 Answer Selection
where is the sigmoid function. Training uses binary cross-entropy loss over these probabilities.
5 Regularization via Fact Dropout
To encourage the model to learn a robust classifier, which exploits all available sources of information, we randomly drop edges from the graph during training with probability . We call this fact-dropout. It is usually easier to extract answers from the KB than from the documents, so the model tends to rely on the former, especially when the KB is complete. This method is similar to DropConnect (Wan et al., 2013).
Related Work
The work of Das et al. (2017c) attempts an early fusion strategy for QA over KB facts and text. Their approach is based on Key-Value Memory Networks (KV-MemNNs) (Miller et al., 2016) coupled with a universal schema (Riedel et al., 2013) to populate a memory module with representations of KB triples and text snippets independently. The key limitation for this model is that it ignores the rich relational structure between the facts and text snippets. Our graph-based method, on the other hand, explicitly uses this structure for the propagation of embeddings. We compare the two approaches in our experiments (§5), and show that GRAFT-Nets outperform KV-MemNNs over all tasks.
Non-deep learning approaches have been also attempted for QA over both text assertions and KB facts. Gardner and Krishnamurthy (2017) use traditional feature extraction methods of open-vocabulary semantic parsing for the task. Ryu et al. (2014) use a pipelined system aggregating evidence from both unstructured and semi-structured sources for open-domain QA.
Another line of work has looked at learning combined representations of KBs and text for relation extraction and Knowledge Base Completion (KBC) (Lao et al., 2012; Riedel et al., 2013; Toutanova et al., 2015; Verga et al., 2016; Das et al., 2017b; Han et al., 2016). The key difference in QA compared to KBC is that in QA the inference process on the knowledge source has to be conditioned on the question, so different questions induce different representations of the KB and warrant a different inference process. Furthermore, KBC operates under the fixed schema defined by the KB before-hand, whereas natural language questions might not adhere to this schema.
The GRAFT-Net model itself is motivated from the large body of work on graph representation learning (Scarselli et al., 2009; Li et al., 2016; Kipf and Welling, 2016; Atwood and Towsley, 2016; Schlichtkrull et al., 2017). Like most other graph-based models, GRAFT-Nets can also be viewed as an instantiation of the Message Passing Neural Network (MPNN) framework of Gilmer et al. (2017). GRAFT-Nets are also inductive representation learners like GraphSAGE (Hamilton et al., 2017), but operate on a heterogeneous mixture of nodes and use retrieval for getting a subgraph instead of random sampling. The recently proposed Walk-Steered Convolution model uses random walks for learning graph representations (Jiang et al., 2018). Our personalization technique also borrows from such random walk literature, but uses it to localize propagation of embeddings.
Tremendous progress on QA over KB has been made with deep learning based approaches like memory networks (Bordes et al., 2015; Jain, 2016) and reinforcement learning (Liang et al., 2017; Das et al., 2017a). But extending them with text, which is our main focus, is non-trivial. In another direction, there is also work on producing parsimonious graphical representations of textual data (Krause et al., 2016; Lu et al., 2017); however in this paper we use a simple sequential representation augmented with entity links to the KB which works well.
For QA over text only, a major focus has been on the task of reading comprehension (Seo et al., 2017; Gong and Bowman, 2017; Hu et al., 2017; Shen et al., 2017; Yu et al., 2018) since the introduction of SQuAD (Rajpurkar et al., 2016). These systems assume that the answer-containing passage is known apriori, but there has been progress when this assumption is relaxed (Chen et al., 2017; Raison et al., 2018; Dhingra et al., 2017; Wang et al., 2018, 2017; Watanabe et al., 2017). We work in the latter setting, where relevant information must be retrieved from large information sources, but we also incorporate KBs into this process.
Experiments & Results
WikiMovies-10K consists of randomly sampled training questions from the WikiMovies dataset (Miller et al., 2016), along with the original test and validation sets. We sample the training questions to create a more difficult setting, since the original dataset has questions over only different relation types, which is unrealistic in our opinion. In § 5.4 we also compare to the existing state-of-the-art using the full training set.
We use the KB and text corpus constructed from Wikipedia released by Miller et al. (2016). For entity linking we use simple surface level matches, and retrieve the top entities around the seeds to create the question subgraph. We further add the top sentences (along with their article titles) to the subgraph using Lucene search over the text corpus. The overall answer recall in our constructed subgraphs is .
WebQuestionsSP (Yih et al., 2016) consists of natural language questions posed over Freebase entities, split up into training and test questions. We reserve training questions for model development and early stopping. We use the entity linking outputs from S-MARThttps://github.com/scottyih/STAGG and retrieve entities from the neighbourhood around the question seeds in Freebase to populate the question subgraphsA total of questions had no detected entities. These were ignored during training and considered as incorrect during evaluation.. We further retrieve the top sentences from Wikipedia with the two-stage process described in §2. The overall recall of answers among the subgraphs is .
Table 1 shows the combined statistics of all the retreived subgraphs for the questions in each dataset. These two datasets present varying levels of difficulty. While all questions in WikiMovies correspond to a single KB relation, for WebQuestionsSP the model needs to aggregate over two KB facts for of the questions, and also requires reasoning over constraints for of the questions (Liang et al., 2017). For maximum portability, QA systems need to be robust across several degrees of KB availability since different domains might contain different amounts of structured data; and KB completeness may also vary over time. Hence, we construct an additional datasets each from the above two, with the number of KB facts downsampled to , and of the original to simulate settings where the KB is incomplete. We repeat the retrieval process for each sampled KB.
2 Compared Models
KV-KB is the Key Value Memory Networks model from Miller et al. (2016); Das et al. (2017c) but using only KB and ignoring the text. KV-EF (early fusion) is the same model with access to both KB and text as memories. For text we use a BiLSTM over the entire sentence as keys, and entity mentions as values. This re-implementation shows better performance on the text-only and KB-only WikiMovies tasks than the results reported previouslyFor all KV models we tuned the number of layers , batch size , model dimension . We also use fact dropout regularization in the KB+Text setting tuned between . (see Table 4). GN-KB is the GRAFT-Net model ignoring the text. GN-LF is a late fusion version of the GRAFT-Net model: we train two separate models, one using text only and the other using KB only, and then ensemble the twoFor ensembles we take a weighted combination of the answer probabilities produced by the models, with the weights tuned on the dev set. For answers only in text or only in KB, we use the probability as is.. GN-EF is our main GRAFT-Net model with early fusion. GN-EF+LF is an ensemble over the GN-EF and GN-LF models, with the same ensembling method as GN-LF. We report Hits@1, which is the accuracy of the top-predicted answer from the model, and the F1 score. To compute the F1 score we tune a threshold on the development set to select answers based on binary probabilities for each node in the subgraph.
3 Main Results
Table 2 presents a comparison of the above models across all datasets. GRAFT-Nets (GN) shows consistent improvement over KV-MemNNs on both datasets in all settings, including KB only (-KB), text only (-EF, Text Only column), and early fusion (-EF). Interestingly, we observe a larger relative gap between the Hits and F1 scores for the KV models than we do for our GN models. We believe this is because the attention for KV is normalized over the memories, which are KB facts (or text sentences): hence the model is unable to assign high probabilities to multiple facts at the same time. On the other hand, in GN, we normalize the attention over types of relations outgoing from a node, and hence can assign high weights to all the correct answers.
We also see a consistent improvement of early fusion over late fusion (-LF), and by ensembling them together we see the best performance across all the models. In Table 2 (right), we further show the improvement for KV-EF over KV-KB, and GN-LF and GN-EF over GN-KB, as the amount of KB is increased. This measures how effective these approaches are in utilizing text plus a KB. For KV-EF we see improvements when the KB is highly incomplete, but in the full KB setting, the performance of the fused approach is worse. A similar trend holds for GN-LF. On the other hand, GN-EF with text improves over the KB-only approach in all settings. As we would expect, though, the benefit of adding text decreases as the KB becomes more and more complete.
4 Comparison to Specialized Methods
In Table 4 we compare GRAFT-Nets to state-of-the-art models that are specifically designed and tuned for QA using either only KB or only text. For this experiment we use the full WikiMovies dataset to enable direct comparison to previously reported numbers. For DrQA (Chen et al., 2017), following the original paper, we restrict answer spans for WebQuestionsSP to match an entity in Freebase. In each case we also train GRAFT-Nets using only KB facts or only text sentences. In three out of the four cases, we find that GRAFT-Nets either match or outperform the existing state-of-the-art models. We emphasize that the latter have no mechanism for dealing with the fused setting.
The one exception is the KB-only case for WebQuestionsSP where GRAFT-Net does F1 points worse than Neural Symbolic Machines (Liang et al., 2017). Analysis suggested three explanations: (1) In the KB-only setting, the recall of subgraph retrieval is only , which limits overall performance. In an oracle setting where we ensure the answers are part of the subgraph, the F1 score increases by . (2) We use the same probability threshold for all questions, even though the number of answers may vary significantly. Models which parse the query into a symbolic form do not suffer from this problem since answers are retrieved in a deterministic fashion. If we tune separate thresholds for each question the F1 score improves by . (3) GRAFT-Nets perform poorly in the few cases where there is a constraint involved in picking out the answer (for example, “who first voiced Meg in Family Guy”). If we ignore such constraints, and consider all entities with the same sequence of relations to the seed as correct, the performance improves by F1. Heuristics such as those used by Yu et al. (2017) can be used to improve these cases. Figure 3 shows examples where GRAFT-Net fails to predict the correct answer set exactly.
5 Effect of Model Components
We tested a non-heterogeneous version of our model, where instead of using fine-grained entity linking information for updating the node representations ( and in Eqs. 1, 3a), we aggregate the document states across all its positions as and use this combined state for all updates. Without the heterogeneous update, all entities will receive the same update from document . Therefore, the model cannot disambiguate different entities mentioned in the same document. The result in Table 5 shows that this version is consistently worse than the heterogeneous model.
Conditioning on the Question.
We performed an ablation test on the directed propagation method and attention over relations. We observe that both components lead to better performance. Such effects are observed in both complete and incomplete KB scenarios, e.g. on WebQuestionsSP dataset, as shown in Figure 4 (left).
Fact Dropout.
Figure 4 (right) compares the performance of the early fusion model as we vary the rate of fact dropout. Moderate levels of fact dropout improve performance on both datasets. The performance increases as the fact dropout rate increases until the model is unable to learn the inference chain from KB.
Conclusion
In this paper we investigate QA using text combined with an incomplete KB, a task which has received limited attention in the past. We introduce several benchmark problems for this task by modifying existing question-answering datasets, and discuss two broad approaches to solving this problem—“late fusion” and “early fusion”. We show that early fusion approaches perform better.
We also introduce a novel early-fusion model, called GRAFT-Net, for classifying nodes in subgraph consisting of both KB entities and text documents. GRAFT-Net builds on recent advances in graph representation learning but includes several innovations which improve performance on this task. GRAFT-Nets are a single model which achieve performance competitive to state-of-the-art methods in both text-only and KB-only settings, and outperform baseline models when using text combined with an incomplete KB. Current directions for future work include – (1) extending GRAFT-Nets to pick spans of text as answers, rather than only entities and (2) improving the subgraph retrieval process.
Acknowledgments
Bhuwan Dhingra is supported by NSF under grants CCF-1414030 and IIS-1250956 and by grants from Google. Ruslan Salakhutdinov is supported in part by ONR grant N000141812861, Apple, and Nvidia NVAIL Award.