WorldTree: A Corpus of Explanation Graphs for Elementary Science Questions supporting Multi-Hop Inference

Peter A. Jansen, Elizabeth Wainwright, Steven Marmorstein, Clayton T. Morrison

Introduction

Question answering (QA) is a high-level natural language processing task that requires automatically providing answers to natural language questions. The approaches used to construct QA solvers vary depending on the questions and domain, from inference methods that attempt to construct answers from semantic, syntactic, or logical decompositions, to retrieval methods that work to identify passages of text likely to contain the answer in large corpora using statistical methods. Because of the difficulty of this task, overall QA task performance tends to be low, with generally between 20% and 80% of natural (non-artificially generated) questions answered correctly, depending on the questions, the domain, and the knowledge and inference requirements.

Standardized science exams have recently been proposed as a challenge task for question answering , as these questions have very challenging knowledge and inference requirements , but are expressed in simple-enough language that the linguistic challenges are likely surmountable in the near-term. They also provide a standardized comparison of modern inference techniques against human performance, with individual QA solvers generally answering between 40% to 50% of multiple choice science questions correctly [2015, 2016, 2016, 2017, 2017, inter alia], and top-performing ensemble models nearly reaching a passing grade of 60% on middle school ( $8^{th}$ grade) science exams during a recent worldwide competition of 780 teams sponsored by the Allen Institute for AI .

One of the central shortcomings of question answering models is that while solvers are steadily increasing the proportion of questions they answer correctly, most solvers generally lack the capacity to provide human-readable explanations or justifications for why those answers are correct. This “explainable inference” task is seen as a limitation of current machine learning models in general (e.g. Ribeiro et al., ), but is critical for domains such as science or medicine where user trust and detecting potentially costly errors are important. More than this, evidence from the cognitive and pedagogy literature suggests that explanations (when tutoring others) and self-explanations (when engaged in self-directed learning) are an important aspect of learning, helping humans better generalize the knowledge they have learned . This suggests that explainable methods of inference may not only be desirable for users, but may be a requirement for automated systems to have human-like generalization and inference capabilities.

Building QA solvers that generate explanations for their answers is a challenging task, requiring a number of inference capacities. Central among these is the idea of information aggregation, or the idea that explanations for a given question are rarely found in a contiguous passage of text, and as such inference methods must generally assemble many separate pieces of knowledge from different sources in order to arrive at a correct answer. Previous estimates suggest elementary science questions require an average of 4 pieces of knowledge to answer and explain those answers (here our analysis suggests this is closer to 6), but inference methods tend to have difficulty aggregating more than 2 pieces of knowledge from free-text together due to the semantic or contextual “drift” associated with this aggregation . Because of the difficulty in assembling training data for the information aggregation task, some have approached explanation generation as a distant supervision problem, with explanation quality modelled as a latent variable . While these techniques have had some success in constructing short explanations, semantic drift likely limits the viability of this technique for explanations requiring more than two pieces of information to be aggregated.

To address this, here we construct a large corpus of explanation graphs (see Figure 1) to serve as training data for explainable inference tasks. The contributions of this work are are:

We construct a set of explanations for 1,680 standardized elementary science exam questions, represented as both free-text, and as lexically-overlapping “explanation graphs” that provide training data for inference models by detailing explicit connections between knowledge in different sentences of an explanation.

We provide an explanation-centered “tablestore”, a set of 62 semi-structured tables containing 4,950 rows that provide a substantial portion of the knowledge required to answer non-spatial, non-mathematical elementary science questions.

We provide an analysis of the knowledge growth and explanation overlap properties of this corpus, suggesting both requirements for inference algorithms to make use of explanation corpora, as well as methods of estimating the difficulty in constructing explanation corpora in other domains.

Related Work

In terms of question answering, the ability to provide compelling human-readable explanations for answers to questions has been proposed as a complementary metric to assess QA performance alongside the proportion of questions answered correctly. Jansen et al. developed a QA system for elementary science that answers questions by building and ranking explanation graphs built from aggregating multiple sentences read from free text corpora, including study guides and dictionaries. Because of the difficulty in constructing gold explanations to serve as training data, the explanations built with this system were constructed by modeling explanation quality as a latent variable machine learning problem. First, sentences were decomposed into sentence graphs based on clausal and prepositional boundaries, then assembled into multi-sentence “explanation graphs”. Questions were answered by ranking these candidate explanation graphs, using answer correctness as well as features that capture the connectivity of key-terms in the graphs as a proxy for explanation quality. Jansen at al. showed that it is possible to learn to generate high quality explanations for 60% of elementary science questions using this method, an increase of 15% over a baseline that retrieved single continuous passages of text as answer justifications. Critically, in their error analysis Jansen et al. found that for questions answered incorrectly by their system, nearly half had successfully generated high-quality explanation graphs and ranked these highly, though they were not ultimately selected. They suggest that the process of building and ranking explanations would be aided by developing more expensive second-pass reranking processes that are able to better recognize the components and structure of high-quality explanations within a short list of candidates.

Knowledge bases of tables, or “table stores”, have recently been proposed as a semi-structured knowledge formalism for question answering that balances the cost of manually crafting highly-structured knowledge bases with the difficulties in acquiring this knowledge from free text . The methods for question answering over tables generally take the form of constructing chains of multiple table rows that lead from terms in the question to terms in the answer, while the tables themselves are generally either collected from the web, automatically generated by extracting relations from free text, or manually constructed.

At the collection end of the spectrum, Pasupat and Liang extract 2,108 HTML tables from Wikipedia, and propose a method of answering these questions by reasoning over the tables using formal logic. They also introduce the WikiTableQuestions dataset, a set of 22,033 question-answer pairs (such as “Greece held its last Summer Olympics during which year?”) that can be answered using these tables. Demonstrating the ability for collection at scale, Sun et al. extract a total of 104 million tables from Wikipedia and the web, and develop a model that constructs relational chains between table rows using a deep-learning framework.Sun et al. note that the 99 million tables extracted from the web introduce more noise into the inference process than the high-quality tables from Wikipedia Using their system and table store, Sun et al. demonstrate state-of-the-art performance on several benchmark datasets, including WebQuestions , a set of popular questions asked from the web designed to be answerable using the large structured knowledge graph Freebase (e.g. “What movies does Morgan Freeman star in?”).

In terms of automatic generation, though relations are often represented as $<subject,relation,argument>$ triples, Yin et al. create a large table containing 120M n-tuple relations using OpenIE , arguing that the extra expressivity afforded by these more detailed relations allows their system to answer more complex questions. Yin et al. use this to successfully reason over the WebQuestions dataset, as well as their own set of questions with more complex prepositional and adverbial constraints.

Elementary science exams contain a variety of complex and challenging inference problems , with nearly 70% of questions requiring some form of causal, process, or model-based reasoning to solve and produce an explanation for. In spite of these exams being taken by millions of students each year, elementary students tend not to be fast or voluminous readers by adult standards, making this a surprisingly low-resource domain for grade-appropriate study guides and other materials. The questions also tend to require world knowledge expressed in grade-appropriate language (like that bears have fur and that fur keeps animals warm) to solve. Because of these requirements and limitations, table stores for elementary science QA tend to be manually or semi-automatically constructed, and comparatively small.

Khashabi et al. provide the largest elementary science table store to date, containing approximately 5,000 manually-authored rows across 65 tables based on science curriculum topics obtained from study guides and a small corpus of questions. Khashabi et al. also augment their tablestore with 4 tables containing 2,600 automatically generated table rows using OpenIE triples. Reasoning is accomplished using an integer-linear programming algorithm to chain table rows, with Khashabi et al. reporting that an average of 2 table rows are used to answer each question. Evaluation on a small set of 129 science questions achieved passing performance (61%), with an ablation study showing that the bulk of their model’s performance was from the manually authored tables.

To help improve the quality of automatically generated tables, Dalvi et al. introduce an interactive tool for semi-automatic table generation that allows annotators to query patterns over large corpora. They demonstrate that this tool can improve the speed of knowledge generation by up to a factor of 4 over manual methods, while increasing the precision and utility of the tables up to seven fold compared to completely automatic methods.

All of the above systems share the commonality that they work to connect (or aggregate) multiple pieces of knowledge that, through a variety of inference methods, move towards the goal of answering questions. Fried et al. report that information aggregation for QA is currently very challenging, with few methods able to combine more than two pieces of knowledge before succumbing to semantic drift, or the phenomenon of two pieces of knowledge being erroneously connected due to shared lexical overlap, incomplete word-sense disambiguation, or other noisy signals (e.g. erroneously aggregating a sentence about Apple computers to an inference when working to determine whether apples are a kind of fruit). In a generating a corpus of natural-language explanations for 432 elementary science questions, Jansen et al. found that the average question requires aggregating 4 separate pieces of knowledge to explainably answer, with some questions requiring much longer explanations.

Though few QA solvers explicitly report the aggregation limits of their algorithms, Fried et al. , Khabashi et al. and Jansen et al. appear to show limits or substantial decreases in performance after aggregating two pieces of knowledge. To the best of our knowledge, of systems that use information aggregation, only Jansen et al. explicitly rate the explanatory performance of the justifications from their model, with good explanations generated for only 60% of correctly answered questions. Taken together, all of this suggests that performance on information aggregation and explainable question answering is still far from human performance, and could substantially benefit from a large corpus of training data for these tasks.

Design Goals

We began with the following design goals:

Explanations should be represented at different levels of structure (explanation, then sentences, then relations within sentences). The knowledge links between explanation sentences should be explicit through lexical overlap, which can be used to form an “explanation graph” that describes how each sentence is linked in an explanation.

Sufficient knowledge should be present in explanations such that that the answer could be arrived at with little extra domain or world knowledge – i.e. where possible, explanations should be targeted at the level of knowledge of a 5-year old child, or lower (see below for a more detailed discussion of explanatory depth).

Where possible, knowledge should be re-used across explanations to facilitate automated analysis of knowledge use, and identifying common explanation patterns across questions.

The level of knowledge required to convincingly explain why an answer to a question is correct depends upon one’s familiarity with the domain of the question. For a domain expert (such as an elementary science teacher), a convincing explanation to why thick bark is the correct answer to ”Which characteristic could best help a tree survive the heat of a forest fire?” might need only take the form of explaining that one of bark’s primary functions is to provide protection for the tree. In contrast, for a domain novice, such as an elementary science student, this explanation might need to be elaborated to include more knowledge to make this inference, such as that thicker things tend to provide more protection. Here we identify four coarse levels of increasing explanatory knowledge depth, shown in Table 1.

For training explainable inference systems, a high level of explanatory depth is likely required. As such, in this work we target authoring explanations between the levels of young child and first principles. Pragmatically, in spite of their ultimate utility for training inference systems, building explanations too close to first principles becomes laborious and challenging for annotators given the level of abstraction and the large amount of implicit world knowledge that must be enumerated, and we leave developing protocols and methods for building such detailed explanations for future work.

Explanation Authoring

We describe our representations, tools, and annotation process below.

We author explanation graphs for a corpus of 2,201 elementary science questions ( $3^{rd}$ through $5^{th}$ grade) from the AI2 Science Questions V2 corpus, consisting of both standardized exam questions from 12 US states, as well as the separate AI2 Science Questions Mercury dataset, a set of questions licensed from a student assessment entity. Each question is a 4-way multiple choice question, and only those questions that do not involve diagram interpretation (a separate spatial task) are included. Approximately 20% of explanations required specialized domain knowledge (for example, spatial or mathematical knowledge) that did not easily lend itself to explanation using our formalism, resulting in a corpus of 1,680 questions and explanations.

2. Tables and Table Rows

Explanations for a given question consist of a set of sentences, each of which is on a single topic and centered around a particular kind of relation, such as water is a kind of liquid (a taxonomic relation), or melting means changing from a solid to a liquid through the addition of heat energy (a change relation).

Each explanation sentence is represented as a single row from a semi-structured table defined around a particular relation. Our tablestore includes 62 such tables, each centered around a particular relation such as taxonomy, meronymy, causality, changes, actions, requirements, or affordances, and a number of tables specified around specific properties, such as average lifespans of living things, the magnetic properties of materials, or the nominal durations of certain processes (like the Earth orbiting the Sun). The initial selection of table relations was drawn from a list of 21 common relations required for science explanations identified by Jansen et al. on a smaller corpus, and expanded as new knowledge types were identified. Subsets of example tables are included in Figure 3. Each explanation in this corpus contains an average of 6.3 rows.

In tabular representations, columns represent specific roles or arguments to a specific relation (such as X is when Y changes from A to B using mechanism C). In our tablestore we attempt to minimize the amount of information per cell, instead favouring tables with many columns that explicitly identify common roles, conditions, or other relations. This finer-grained structure eases the annotator’s cognitive load when authoring new rows, while also better compartmentalizing the relational knowledge in each row for inference algorithms. The tables in our tablestore contain between 2 and 16 content columns, as compared to 2 to 5 columns for the Ariso tablestore .

QA models use a variety of different representations for inference, from semantic roles and syntactic dependencies to discourse and embeddings. Following Khashabi et al. , we make use of a specific form of table representation that includes “filler” columns that allow each row to be directly read off as a stand-alone natural language sentence, and serve as input to any model. Examples of these filler columns can be seen in Figure 3.

3. Explanation Graphs and Sentence Roles

Explanations for a given question here take the form of a list of sentences, where each sentence is a reference to a specific table row in the table store. To increase their utility for knowledge and inference analyses, we require that each sentence in an explanation be explicitly lexically connected (i.e. share words) with either the question, answer, or other sentences in the explanation. We call this lexically-connected set of sentences an explanation graph.

In our preliminary analysis, we observed that the sentences in our explanations can take on very different roles, and we hypothesize that differentiating these roles is likely important for inference algorithms. We identified four coarse roles, listed in Table 2, and described below:

Central: The central concept(s) that a question is testing, such as changes of state or the coupled relationship between kinetic energy and temperature.

Grounding: Sentences linking generic or abstract terms in a central sentence with specific instances of those terms in the question or answer. For example, for questions about changes of state, grounding sentences might identify specific instances of liquids (such as water) or gasses (such as water vapor).

Background: Extra information elaborating on the topic, but that (strictly speaking) isn’t required to arrive at the correct inference.

Lexical glue: Sentences that lexically link two concepts, such as “to add means to increase”, or “heating means adding heat”. This is an artificial category in our corpus, brought about by the need for explanation graphs to be explicitly lexically linked.

For each sentence in each authored explanation, we provide annotation indicating which of these four roles the sentence serves in that explanation.

4. Annotation Tool

To facilitate explanation authoring, we developed and iterated the web-based collaborative authoring tool shown in Figure 3. The tool displays a given question to the explanation author, and allows the author to progressively build an explanation graph for that question by querying the tablestore for relevant rows based on keyword searches, as well as past explanations that are likely to contain similar content or structure (increasing consistency across explanations, while reducing annotation time). A graphical visualization of the explanation graph helps the author quickly assess gaps in the explanation content to address by highlighting lexical overlap between sentences with coloured edges and labels. The tablestore takes the form of a shared Google Sheethttp://sheets.google.com that the annotators populate, with each table represented as a separate tab on the sheet.

5. Procedure and Explanation Review

For a given question, annotators identified the central concept the question was testing, as well as the inference required to correctly answer the question, then began progressively constructing the explanation graph. Sentences in the graph were added by querying the tablestore based on keywords, which retrieved both single sentences/table rows, as well as entire explanations that had been previously annotated. If any knowledge required to build an explanation did not exist in the tablestore, this was added to an appropriate table, then added to the explanation.

New tables were regularly added, most commonly for property knowledge surrounding a particular topic (e.g. whether a particular material is recyclable). Because explanations are stored as lists of unique identifiers to table rows, tables and table rows could regularly be refactored, elaborated, or entirely reorganized without requiring existing explanations to be rewritten. We found this was critical for consistency and ensuring good organization throughout corpus construction.

One of the central difficulties with evaluating explanation authoring is determining metrics for interannotator agreement, as many correct explanations are possible for a given question, and there are many different wordings that an annotator might choose to express a given piece of knowledge in the tablestore. Similarly, the borders between different levels of explanatory depth are fuzzy, suggesting that one annotator may express their explanation with more or less specificity than another.

To address these difficulties we included two methods to increase consistency. First, as a passive intervention during the explanation generation process, annotators are presented with existing explanations that can be drawn from to compose a new explanation, where these existing explanations share many of the same query terms being used to construct the new explanation. Second, as an active intervention, each explanation goes through four review passes to ensure consistency. The first two passes are completed by the original annotator, before checking a flag on the annotation tool signifying that the question is ready for external review. A second annotator then checks the question for completeness and consistency with existing explanations, and composes a list of suggested edits and revisions. The fourth and final pass is completed by the original annotator, who implements these suggested revisions. This review process is expensive, taking approximately one third of the total time required to annotate each question.

Each annotator required approximately 60 hours of initial training for this explanation authoring task. We found that most explanations could be constructed within 5-10 minutes, with the review process taking approximately 5 more minutes per question.

Explanation Corpus Properties

Here we characterize three properties of the explanation corpus as they relate to developing methods of explainable inference: knowledge frequency, explanation overlap, and tablestore growth.

The tables most frequently used to author explanations are shown in Table 3, broken down into three broad categories identified by Jansen et al. : retrieval types, inference-supporting types, and complex inference types. Because the design of this corpus is data driven – i.e., knowledge is generally added to a table because it is required in one or more explanationsFor compatibility, we do include several property tables from the Aristo tablestore, though a large proportion of rows from these tables are not actively used. Our tablestore includes 4,950 rows, 3,686 of which are actively used in at least one explanation. – we can calculate how frequently the rows in a given table are reused to obtain an approximate measure of the generality of that knowledge. On average, a given table row is used in 2.9 different explanations, with 1,535 rows used more than once, and 531 rows used 5 or more times. The most frequently reused row (”an animal is a kind of organism”) is used in 89 different explanations. Generic “change of state” knowledge (e.g. solids, liquids, and gasses) is also frequently reused, with each row in the StatesOfMatter table used in an average of 15.7 explanations. Usage statistics for other common tables are also provided in Table 3.

2. Explanation Overlap

One might hypothesize that questions that require similar inferences to correctly answer may also contain some of the same knowledge in their explanations, with the amount of knowledge overlap dependent upon the similarity of the questions. We plan to explore using this overlap as a method of inference that can generate new explanations by editing, merging, or expanding known explanations from similar, known questions (see Jansen for an initial study). For this to be possible, an explanation corpus must reach a sufficient size that a large majority of questions have substantial overlap in their explanations.

Figure 5 shows the proportion of questions in the corpus that have 1 or more, 2 or more, 3 or more, etc., overlapping rows in their explanations with at least one other question in the corpus.Though not included for space, the number of questions with N or more rows in common in their explanations increases linearly with the number of questions. For this corpus, for a given question, on average there are 17 questions that have 1 or more overlapping rows in their explanation, 9 questions with 2 or more shared rows in their explanation, and 5 questions with 3 or more shared rows in their explanation. Similarly, to ground this, Figure 4 shows a visualization of questions whose explanations have 2 or more overlapping rows. For a given level of overlapping explanation sentences, Figure 5 shows that the proportion of questions with that level of overlap increases logarithmically with the number of questions.

This has two consequences. First, it allows us to estimate the size of corpus required to train hypothetical inference methods for the science exam domain capable of producing explanations. If a given inference method can work successfully with only minimal overlap (for example, 1 shared table row), then a training corpus of 500 explanations in this domain should be sufficient to answer 80% of questions. If an inference method requires 2 shared rows, the corpus requirements would increase to approximately 2,500 questions to answer 80% of questions. However, if an inference method requires 3 or more rows, this likely would not be possible without a corpus of at least 20,000 questions and explanations – a substantial undertaking. Second, because this relationship is strongly logarithmic, if it transfers to domains outside elementary science, it should be possible to estimate the corpus size requirements for those domains after authoring explanations for only a few hundred questions.

3. Explanation Tablestore Growth

Finally, we examine the growth of the tablestore as it relates to the number of questions in the corpus. Figure 6 shows a monte-carlo simulation of the number of unique tablestore rows required to author explanations for specific corpus sizes. This relationship is strongly correlated (R=0.99) with an exponential proportional decrease.Here, this exponential proportional decrease takes the form of $R=434-(-2.93/0.00054)\cdot(1-e^{-0.00054\cdot Q}),$ where $R$ is the size of the tablestore in rows, to explainably answer $Q$ questions. For this elementary science corpus, this asymptotes at approximately 6,000 unique table rows, and 10,000 questions, providing an estimate of the upper-bound of knowledge required in this domain, and the number of unique questions that can be generated within the scope of the elementary science curriculum.

The caveat to this estimate is that it estimates the knowledge required for elementary science exams as they currently exist, with the natural level of variation introduced by the test designers. Questions are naturally grounded in examples, such as “Which part of an oak tree is responsible for undertaking photosynthesis?” (Answer: the leaves). While the corpus often contains a number of variations of a given question that test the same curriculum topic and have similar explanations, many more variations on these questions are possible that ground the question in different examples, like orchids, peach trees, or other plants. As such, while we believe that these estimates likely cover the core knowledge of the domain, many times that knowledge would be required to make the explanation tablestore robust to small variations in the presentation of those existing exam questions, or to novel unseen questions.

Conclusion

We provide a corpus of explanation graphs for elementary science questions suitable for work in developing explainable methods of inference, and show that the knowledge frequency, explanation overlap, and tablestore growth properties of the corpus follow predictable relationships. This work is open source, with the corpus and generation tools available at http://www.cognitiveai.org/explanationbank.

Acknowledgements

We thank the Allen Institute of Artificial Intelligence for funding this work, Peter Clark at AI2 for thoughtful discussions, and Paul Hein for assistance constructing the annotation tool.