Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal

Introduction

Open book exams are a common mechanism for assessing human understanding of a subject, where test takers are allowed free access to a relevant book, study guide, or class notes when answering questions. In this context, the goal is not to evaluate memorization but a deeper understanding of the material and its application to new situations Jenkins (1995); Landsberger (1996). The application, in turn, often requires combining a fact in the book (e.g., metals conduct electricity) with additional common knowledge the test taker is expected to have acquired by this stage (e.g., a suit of armor is made of metal).

Motivated by this setting, we present a new kind of question answering dataset, OpenBookQA,The dataset and the code for the models are available at http://data.allenai.org/OpenBookQA. that consists of two parts: $\mathcal{Q}$ , a set of 5957 multiple-choice questions, and $\mathcal{F}$ , a set of 1326 diverse facts about elementary level science. $\mathcal{F}$ has three key characteristics of an ‘open book’: (a) it forms the basis for generating $\mathcal{Q}$ ; (b) it has been deemed central to scientific explanations Jansen et al. (2018); and (c) by itself, $\mathcal{F}$ is generally insufficient to answer questions in $\mathcal{Q}$ . Faced with a question $q\in\mathcal{Q}$ , a student or system $S$ is expected retrieve a relevant fact $f\in\mathcal{F}$ , and appeal to their own common knowledge, $\mathcal{K_{S}}$ , when applying $f$ to answer $q$ .

Figure 1 provides an example. Here, metals are thermal conductors is a core scientific fact available in $\mathcal{F}$ . One way to apply this fact to decide whether a steel spoon would let the most heat travel through is to appeal to common knowledge that steel is metallic and heat travels through thermal conductors. In general, the expected common knowledge is relatively simple (taxonomic facts, definitions, object properties, etc.); the difficulty lies in identifying it and meaningfully combining it with a core fact from $\mathcal{F}$ to answer the question.

OpenBookQA questions are challenging as they require multi-hop reasoning with partial context provided by $\mathcal{F}$ . Specifically, unlike existing datasets for reading comprehension (RC), answering questions on the back of a textbook (TQA),Only $\sim$ 5% of the TQA questions of Kembhavi et al. (2017) require additional common knowledge. as well as question answering over structured knowledge-bases (KBQA), the open book $\mathcal{F}$ that comes with OpenBookQA is not self-contained. A successful system must therefore go beyond the typical challenges such as paraphrase matching and coreference resolution, without benefiting from the canonicalized and complete information in KBQA.

Generating interesting open book questions is a difficult task. We used a multi-stage process starting with $\mathcal{F}$ , using crowd-sourcing to generate (noisy) questions based on $\mathcal{F}$ that probe novel situations, using an automatic filter to ensure hardness for retrieval and association based systems, using a crowd filter to ensure answerability by a lay person, and further using an expert filter to ensure higher quality in Dev and Test sets.

We evaluate a number of existing QA systems for science (without retraining) on OpenBookQA, finding that they perform surprisingly close to the random guessing baseline of 25%. Human performance, on the other hand, is close to 92%.To avoid ambiguity in the term ‘human performance’, Section 3.2 describes the specific randomized model we use.

Motivated by recent findings of gameability of NLP datasets Gururangan et al. (2018), we also develop and evaluate simple, attention-based, neural baselines including a plausible answer detector (which ignores the question text completely) and an odd-one-out solver. These highlight inevitable human bias in any crowdsourced dataset, increasing performance on OpenBookQA to 48%.

Building upon a recent neural model for incorporating external knowledge in the story cloze setting Mihaylov and Frank (2018), we propose a knowledge-aware neural baseline that can utilize both the open book $\mathcal{F}$ and common knowledge retrieved from sources such as ConceptNet Speer et al. (2017). While retrieving the most useful pieces of knowledge remains an open challenge, our ‘oracle’ experiments with the fact $f$ used while generating a question $q$ and an interpretation (by the question author) of the additional knowledge $k$ needed for $q$ , provides valuable insight into the nature of this dataset: Facts from the open book $\mathcal{F}$ are valuable (5% improvement) but not sufficient. Using both $f$ and $k$ increases the accuracy to 76%, but is still far from human level performance, suggesting the need for non-trivial reasoning to combine these facts.

To encourage further research on this new task, for each Train and Dev question $q$ , OpenBookQA also includes $f$ as intermediate supervision signal, which may be viewed as a partial explanation for $q$ . We leave closing the large gap to human performance as a challenge for the NLP community.

Related Work

By construction, answering OpenBookQA questions requires (i) some base science facts from a provided ‘open book’, (ii) broader understanding about the world (common or commonsense knowledge), and (iii) an ability to combine these facts (reasoning). This setup differs from several existing QA tasks, as summarized below.

Reading Comprehension (RC) datasets have been proposed as benchmarks to evaluate the ability of systems to understand a document by answering factoid-style questions over this document. These datasets have taken various forms: multiple-choice Richardson et al. (2013), cloze-style Hermann et al. (2015); Onishi et al. (2016); Hill et al. (2016), and span prediction Rajpurkar et al. (2016); Trischler et al. (2017); Joshi et al. (2017) However, analysis Chen et al. (2016); Sugawara et al. (2017) of these datasets has shown that many of the questions can be solved with context token matching Chen et al. (2017a); Weissenborn et al. (2017) or relatively simple paraphrasing.

To focus on the more challenging problem of reasoning across sentences, new datasets have been proposed for multi-step RC. QAngaroo (Welbl et al., 2018) have used a knowledge-base to identify entity pairs (s, o) with a known relation, r, which is also supported by a multi-hop path in a set of documents. They use structured tuple queries (s, r, ?) and use all the documents along the path as the input passage. NarrativeQA (Kociský et al., 2017) is an RC dataset that has been shown to require an iterative reasoning about the narrative of a story. Similar to OpenBookQA, the questions were generated to ensure that the answer is not a direct match or paraphrase that can be retrieved with an IR approach. Most recently, Khashabi et al. (2018) proposed MultiRC, a multiple-choice RC dataset that is designed to require multi-sentence reasoning and can have multiple correct answers. Again, like most RC datasets, it is self-contained.

While many of the RC datasets could benefit from commonsense or background knowledge, they are designed to be self-contained, i.e., solvable by the document context alone. Datasets such as the Story Cloze Test Mostafazadeh et al. (2016), MCScript,SemEval-2018 Task 11: Machine Comprehension using Commonsense Knowledge https://competitions.codalab.org/competitions/17184 and ProPara Mishra et al. (2018) do require additional domain knowledge about everyday events, scripts, and processes, respectively. However, these datasets need domain-specific modeling of events, whereas OpenBookQA appeals to broad common knowledge cutting across a variety of types and topics.

Stasaski and Hearst (2017) explore the creation of multi-hop questions and propose generating stronger distractors for the multiple-choice setting. Their work, however, starts with structured knowledge, specifically a Biology ontology.

Lastly, many Science Question Answering datasets (e.g. Clark et al., 2016, 2018) have been released that need broad external knowledge to answer the questions. However, these questions are not associated with a core set of facts, i.e., an “open book” used to define these questions. As a result, the questions vary widely in style and complexity Clark et al. (2018). In contrast, OpenBookQA focuses on a more well-defined subset of science QA, appealing to one core fact from the open book and one (or few) relatively simple commonly known supporting facts.

OpenBookQA Dataset

The OpenBookQA dataset consists of about 6,000 4-way multiple-choice questions, each associated with one core fact from a “book” $\mathcal{F}$ of 1326 such facts, and an auxiliary set $\mathcal{K}$ of about 6000 additional facts. The questions were created via a multi-stage crowdsourcing and partial expert filtering process, discussed in Section 3.1.

The small “book” $\mathcal{F}$ consists of recurring science themes and principles, each of which can be (and here is) instantiated into multiple questions. For $\mathcal{F}$ , we use a subset of the WorldTree corpus which Jansen et al. (2018) have analyzed for sufficiency for elementary level science. The subset we use is taken from the 2287 WorldTree facts that were marked as “central” by the original authors in at least one explanation. We further filter them down to 1326 that appear general enough to be applicable to multiple situations.

OpenBookQA additionally requires broad common knowledge, which is expected to come from large corpora, such as ConceptNet, Wikipedia, or a corpus with 14M science-related sentences used by some existing baselines. The crowdsourcing process below also asks workers to mark a second fact, $k$ , needed for each question $q$ , in addition to $f$ . These second facts, unfortunately, were often incomplete, over-complete, or only distantly related to $q$ . We thus include in OpenBookQA the set $\mathcal{K}$ of such second facts only as auxiliary data for optional use. We emphasize that $\mathcal{K}$ should not be viewed as ‘gold’ additional facts, or as a substitute for broad common knowledge.

The overall question generation and filtering pipeline is summarized in Figure 2. Given the “book” $\mathcal{F}$ of core facts, the process proceeds as follows, starting with an empty question set $Qs$ and an empty ‘second facts’ set $\mathcal{K}$ :

A crowd-worker We used Amazon Mechnical Turk, with workers from North America and with a ‘masters’ level qualification. $w$ is shown a random science fact $f$ from the set $\mathcal{F}$ .

$w$ is asked to think of a second common fact, $k$ , that may be combined with $f$ to derive a new, valid assertion $s$ .

The Dev and Test splits were further filtered by an in-house expert to ensure higher quality.

2 Human Performance

To assess human accuracy on this dataset, we consider the following model: Each question $q\in\mathcal{Q}$ has some (unknown) human accuracy $p_{q}$ , defined as the probability that a random human subject, chosen uniformly from a large pool $\mathcal{H}$ , would answer $q$ correctly. Thus, we can think of this as defining a Bernoulli random variable, $X_{q}\sim B(p_{q})$ , whose mean is (unknown) $p_{q}$ . The average human accuracy on $\mathcal{Q}$ under this model is:

where $\{p_{q}\mid q\in\mathcal{Q}\}$ are unknown.

With $\mathcal{H}$ as the set of crowd-workers (cf. Footnote 5), step 6 of the above question generation process is equivalent to obtaining 5 independent samples, $X_{q,i},i\in I,|I|=5$ , from $B(p_{q})$ . We must, however, be careful when using this data to estimate $p_{q}$ , as the same 5 samples were used to decide whether $q$ makes it into the question set $\mathcal{Q}$ or not. For instance, if we had kept only those questions that all 5 workers answered correctly, it would clearly be inaccurate to claim that the human accuracy on $\mathcal{Q}$ is 100%. Nevertheless, it is possible to re-use the judgments from Step 6 to approximate $H(\mathcal{Q})$ with high confidence, without posing the questions to new workers.

Intuitively, if all questions in $\mathcal{Q}$ were difficult to answer (i.e., all $p_{q}$ were small), it would be unlikely that all $|\mathcal{Q}|$ questions would pass the test in Step 6. We can use the contrapositive of this observation to conclude that $p_{q}$ , on average, must have been high for $q\in\mathcal{Q}$ .

Formally, aggregating across all questions gives the following empirical estimate of $H(\mathcal{Q})$ :

3 Question Set Analysis

OpenBookQA consists of 5957 questions, with 4957/500/500 in the Train/Dev/Test splits.Overall, 8140 questions were collected, of which 2183 were discarded in crowdsourcing Step 7. Table 1 summarizes some statistics about the full dataset. Each question has exactly four answer choices and one associated fact used in the creation process. We report the average length of questions, candidate choices, and associated facts, as well as how often is the longest/shortest choice the correct one.

We analyzed 100 questions in the Train set to capture the kind of common knowledge and reasoning needed. For each, we wrote down the additional common knowledge needed to answer this question in addition to the original science fact. In 21% of the cases, the crowdsourced question actually tests for a fact that doesn’t necessarily need the original science fact. For example, the question: “On a rainy day the clouds are (A) low (B) white (C) small (D) gray” was written based on the science fact “clouds produce rain” but doesn’t need this fact to answer it. We ignore such questions in our analysis. For the remaining questions, we categorized the additional facts into five high-level categories (and collapsed the remaining facts into a catch-all Others category) based on previous approaches on similar science questions Clark et al. (2018); Jansen et al. (2016):

Isa: Basic taxonomic facts such as isa(tree, living thing), isa(granite, rock).

Property: Properties of objects such as madeof(belt buckle, metal), has(mammals, four legs), contains(lemon juice, citric acid).

Definition: Definitions of objects that may be based on their appearance (tape is a plastic with markings), working mechanism (telescope is a device that uses mirrors to view objects), etc.

Causal: Causal facts such as causes(adding lemon juice to milk, milk to break down).

Basic: General scientific fact that did not fit above, e.g. squirrels eat nuts for food.

Table 2 presents the proportions of these facts in our analyzed question set. For each type of fact, we calculate the percentage of questions that need at least one such fact (shown as % Questions). We also calculate the overall percentage of each fact type across all the common knowledge facts (shown as % Facts). Most of our questions need simple facts such as $isa$ knowledge and properties of objects, further confirming the need for simple reasoning with common knowledge. Apart from these five major categories of facts, the catch-all Others category contains common-sense facts (e.g., it is dark at night), world knowledge (e.g., Japan is often hit by earthquakes) and lexical rewritesOf course, every question had lexical variations. We marked it when this was the only change to the core fact. (e.g., ad infinitum means over and over).

Most of our questions need simple facts that should be easily retrievable from any knowledge-base/textual corpora. On an average, each question needed 1.16 additional facts ignoring any linguistic variations. Despite the simplicity of the knowledge needed for these questions, as we show empirically, most baseline approaches achieve a relatively low score on this dataset (even when the core fact is provided). We claim that this is due to the fact that the reasoning needed to answer these questions is non-trivial. Table 3 shows few questions with the associated facts and high-level reasoning needed to answer these questions. Assuming a model can extract the described relations (e.g. defn, contains), the QA system still needs to be able to chain these facts together, identify the resulting relation and verify its expression for each choice. In the extreme case (as shown in the last example), even though only one additional fact is needed to answer the question, it needs a system to apply the core “general” science fact to a “specific” situation.

Baseline Models

We evaluate the performance of several baselines systems on the Dev and Test subsets of OpenBookQA. For each question, a solver receives 1 point towards this score if it chooses the correct answer, and $1/k$ if it reports a $k$ -way tie that includes the correct answer. The “Guess All” baseline, which always outputs a 4-way tie, thus achieves a score of 25%, same as the expected performance of a uniform random baseline.

Since OpenBookQA is a set of elementary level science questions, one natural baseline category is existing systems that have proven to be effective on elementary- and middle-school level science exams. These pre-trained systems, however, rely only on their background knowledge and do not take the set $\mathcal{F}$ of core facts into account. Further, their knowledge sources and retrieval mechanism are close to those used by the IR solver that, by design, is guaranteed to fail on OpenBookQA. These two aspects place a natural limit on the effectiveness of these solvers on OpenBookQA, despite their excellent fit for the domain of multiple-choice science questions. We consider four such solvers.

PMI Clark et al. (2016) uses pointwise mutual information (PMI) to score each answer choice using statistics based on a corpus of 280 GB of plain text. It extracts unigrams, bigrams, trigrams, and skip-bigrams from the question $q$ and each answer choice $c_{i}$ . Each answer choice is scored based on the average PMI across all pairs of question and answer n-grams.

TableILP Khashabi et al. (2016) is an Integer Linear Programming (ILP) based reasoning system designed for science questions. It operates over semi-structured relational tables of knowledge. It scores each answer choice based on the optimal (as defined by the ILP objective) “support graph” connecting the question to that answer through table rows. The small set of these knowledge tables, however, often results in missing knowledge, making TableILP not answer 24% of the OpenBookQA questions at all.

TupleInference Khot et al. (2017), also an ILP-based QA system, uses Open IE tuples Banko et al. (2007) as its semi-structured representation. It builds these subject-verb-object tuples on-the-fly by retrieving text for each question from a large corpus. It then defines an ILP program to combine evidence from multiple tuples.

DGEM Khot et al. (2018) is a neural entailment model that also uses Open IE to produce a semi-structured representation. We use the adaptation of this model to multiple-choice question answering proposed by Clark et al. (2018), which works as follows: (1) convert $q$ and each $c_{i}$ into a hypothesis, $h_{i}$ , and each retrieved fact into a premise $p_{j}$ ; and (2) return the answer choice with the highest entailment score, $\operatorname*{arg\,max}_{i}e(p_{j},h_{i})$ .

2 No Training; ℱℱ\mathcal{F} and Extr. Knowledge

We also consider providing the set $\mathcal{F}$ of core facts to two existing solvers: the IR solver of Clark et al. (2016) (to assess how far simple word-overlap can get), and the TupleInference solver.

3 Trained Models, No Knowledge

For each training instance, we build a feature representations $\vec{f}$ by concatenating these vectors and train an $L2$ logistic regression classifier:

BiLSTM Max-Out Baselines.

Given the contextual representations for each token sequence, we experiment with three configurations for using these representations for QA:

(a) Plausible Answer Detector.

This baseline goes to the extreme of completely ignoring $q$ and trying to learn how plausible it is for $c_{i}$ to be the correct answer to some question in this domain. This captures the fact that certain choices like ‘a magical place’ or ‘flying cats’ are highly unlikely to be the correct answer to a science question without negation (which is the case for OpenBookQA).

(b) Odd-One-Out Solver.

It considers all 4 answer options jointly and selects the one that is least similar to the others. This captures bias in human authored questions arising from the fact that creating good quality incorrect answers is difficult. Workers generally start with the correct answer, and then come up with three incorrect ones. The latter often tend to be homogeneous or share other common properties (e.g., non-scientific terms) uncharacteristic of the correct answer.

(c) Question Match.

4 Trained Model with External Knowledge

Lastly, we implement a two stage model for incorporating external common knowledge, $K$ . The first module performs information retrieval on $K$ to select a fixed size subset of potentially relevant facts $K_{Q,C}$ for each instance in the dataset (see Appendix A). The second module is a neural network that takes ( $Q$ , $C$ , $K_{Q,C}$ ) as input to predict the answer $a_{q,c}$ to a question $Q$ from the set of choices $C$ .

Baseline Performance

The results for various baseline models are summarized in Table 4, grouped by method category. We make a few observations:

The second group shows that pre-trained state-of-the-art solvers for multiple-choice science questions perform poorly. One explanation is their correlation with the the IR method used for question filtering, as mentioned in Section 4.1.

The third group of results suggests that adding $\mathcal{F}$ to pre-trained models has a mixed effect, improving TupleInference by 8.7% but not changing DGEM.By design, IR with its default corpus gets 0% on OpenBookQA. Hence we don’t consider the effect of adding $\mathcal{F}$ , which appears artificially magnified. Unlike DGEM, TupleInference relies on brittle word-overlap similarity measures very similar to the ones used by IR. Since IR (KB) gets 0% by design, TupleInference (KB) also has poor performance and adding $\mathcal{F}$ helps it find better support despite the brittle measures.

The fourth group demonstrates that carefully designed trainable neural models—even if simplistic and knowledge-free—can be surprisingly powerful. For example, the “plausible answer detector” can predict the correct answer with 49.6% accuracy without even looking at the question. The “odd-one-out” solver, by considering other answer choices, raises this to 50.2%. The “question match” solver, which simply compares the BiLSTM max-out encoding of the question with that of various answer choices, also achieves 50.2%.This model also achieves the current best score, 33.87%, on the ARC Reasoning Challenge Clark et al. (2018). When adapted for the textual entailment task by comparing BiLSTM max-out encodings of premise and hypothesis, it achieves 85% on the SciTail dataset Khot et al. (2018). Similar findings have been reported for several recent datasets Gururangan et al. (2018), making it imperative to perform such tests early.

Interestingly, all of these neural knowledge-free baselines simultaneously succeed on 34.4% of the Dev questions, and simultaneously fail on 23.6%. For Question Match and ESIM we also experiment with ELMo Peters et al. (2018) which improved their score on Test with 0.4% and 1.8%.

The final group demonstrates the need for external knowledge and deeper reasoning. When the “oracle” science fact $f$ used by the question author is provided to the knowledge-enhanced reader, it improves over the knowledge-less models by about 5%. However, there is still a large gap, showing that the core fact is insufficient to answer the question. When we also include facts retrieved from WordNet Miller et al. (1990), the score improves by about 0.5%. Unlike the WordNet gain, adding ConceptNet Speer et al. (2017) introduces a distraction and reduces the score. This suggests that ConceptNet is either not a good source of knowledge for our task, or only a subset of its relations should be considered. Overall, external knowledge helps, although retrieving the right bits of knowledge remains difficult. In the last row of Table 4, we use the oracle core fact along with question author’s interpretation of the additional fact $k$ . This increases the scores substantially, to about 76%. This big jump shows that improved knowledge retrieval should help on this task. At the same time, we are still not close to the human performance level of 92% due to various reasons: (a) the additional fact needed can be subjective, as hinted at by our earlier analysis; (b) the authored facts $\mathcal{K}$ tend to be noisy (incomplete, over-complete, or only distantly related), also as mentioned earlier; and (b) even given the true gold facts, performing reliable “reasoning” to link them properly remains a challenge.

Sample predictions and analysis of questions from Dev are provided in Appendix D.

Conclusion

We present a new dataset, OpenBookQA, of about 6000 questions for open book question answering. The task focuses on the challenge of combining a corpus of provided science facts (open book) with external broad common knowledge. We show that this dataset requires simple common knowledge beyond the provided core facts, as well as multi-hop reasoning combining the two. While simple neural methods are able to achieve an accuracy of about 50%, this is still far from the human performance of 92% on this task. We leave closing this gap for future research, and illustrate, via oracle-style experiments, the potential of better retrieval and reasoning on this task.

Acknowledgments

The authors would like to thank Lane Aasen for helping develop the infrastructure for the crowdsourcing task, and Madeleine van Zuylen for providing expert annotation for the Dev and Test questions.

References

Appendix A Knowledge Retrieval Module

This module is the first part of a two stage model for incorporating knowledge from an external source $K$ . For each instance $(q,C)$ in the dataset, where $q$ is a question and $C=\{c_{1},\ldots,c_{4}\}$ a set of answer choices, it performs information retrieval (IR) on $K$ to select a fixed size subset $K_{q,C}$ of potentially relevant facts. The second module is a neural network that takes $(q,C,K_{q,C})$ as input, and predicts the answer $a_{q,C}$ .

For experimentation with knowledge, we consider the ‘open book’ set of facts $\mathcal{F}$ in conjunction with two sources of common knowledge: the Open Mind Common Sense Singh et al. (2002) part of ConceptNet Speer et al. (2017), and its WordNet Miller (1995) subset.

Appendix B Implementation and Training

Our neural models are implemented with AllenNLPhttps://allennlp.org Gardner et al. (2017) and PyTorchhttps://pytorch.org (Paszke et al., 2017). We use cross-entropy loss and the Adam optimizer Kingma and Ba (2015) with initial learning rate 0.001. For the neural models without external knowledge, we typically train the model with a maximum of 30 epochs and stop training early if the Dev set accuracy does not improve for 10 consecutive epochs. We also halve the learning rate if there is no Dev set improvement for 5 epochs. For the neural models with external knowledge, we typically train for 60 epochs with a patience of 20 epochs. For most of our neural models, we use $h=128$ as the LSTM hidden layer size. The embedding dropout rate is chosen from $\{0.1,0.2,0.5\}$ , again based on the best Dev set performance.

For each model configuration, we perform 5 experiments with different random seeds. For each run, we take the model with the best performance on Dev and evaluate on Test. We report the average accuracy for the best Dev score and the average of the corresponding Test score $\pm$ the standard deviation across the 5 random seeds.

The code for the models and the configuration files required for reproducing the results are available at http://data.allenai.org/OpenBookQA.

Appendix C Additional Experiments

We also perform experiments with the Question Match system on the Challenge (hard) set of the AI2 Reasoning Challenge or ARC Clark et al. (2018). We train several models with different LSTM hidden sizes (128, 256, 384 (best), 512), and dropout of the embedding layer (0.0 (best), 0.2, 0.5) on the questions from the Challenge Train set and take the model that has the highest accuracy on the Dev set. The resulting system scores 33.87% on the Challenge Test set, which is 2.17% higher than the previous best score by Zhang et al. (2018). The code and model configuration are available at https://github.com/allenai/ARC-Solvers.

C.2 Textual Entailment: SciTail

We perform textual entailment experiments on the Science enTailment dataset SciTail Khot et al. (2018). We change the Question Match model to a classic BiLSTM Max-Out Conneau et al. (2017) for textual entailment, by replacing the question $q$ and a choice $c_{i}$ with the premise $p$ and the hypothesis $h$ , resp., and perform binary classification on the entailment labels (Entail, Neural). We run experiments with BiLSTM encoders with LSTM hidden size of 384 and share the encoder parameters between the premise and the hypothesis. Without additional hyper-parameter tuning, this yields entailment accuracy scores of 87.9% and 85.4% on the Dev and Test sets, respectively.

Appendix D Success and Failure Examples

We give some examples of questions that were answered correctly/incorrectly by various groups of models. We include here the first three questions in each case.

We begin with three examples of questions that all neural models without external knowledge (namely Question Match, Plausible Answer, One-Odd-Out, and ESIM from the fourth group in Table 5) predicted correctly.

In these examples, we observe that the correct answer usually contains a word that is semantically closer (than words in other answer choices) to an important word from the question: pores to body; non-renewable (negative sentiment) to gone, finished (also negative sentiment); iron to magma (liquid rock).

D.2 Neural Baseline Failures, Oracle Success

Table 6 shows example questions (with the Oracle facts) from the Dev set that were predicted correctly by the $f+k$ Oracle model (405/500) but incorrectly by all of the 4 neural models without knowledge (69/405). In contrast to Table 5, a simple semantic similarity is insufficient. The questions require chaining of multiple facts in order to arrive at the correct answer.

D.3 Neural Baseline and Oracle Failures

42/500 questions in the Dev set were predicted incorrectly by all models without external knowledge, as well as by the Oracle $f+k$ model. In Table 7 we show 3 such questions. In all cases, the Oracle $f+k$ model made an incorrect prediction with confidence higher than 0.9.

As noted earlier, there are several broad reasons why even this so-called oracle model fails on certain questions in OpenBookQA. In some cases, the core fact $f$ associated with a question $q$ isn’t actually helpful in answering $q$ . In many other cases, the corresponding second fact $k$ is noisy, incomplete, or only distantly related to $q$ . Finally, even if $f$ and $k$ are sufficient to answer $q$ , it is quite possible for this simple model to be unable to perform the reasoning that’s necessary to combine these two pieces of textual information in order to arrive at the correct answer.

In the shown examples, the first question falls outside the domain of Science where most of the core facts come from. The scientific fact “( $f$ ) An example of collecting data is measuring” is transformed into a question related to the law and judicial domain of collecting data for a (court) case. This is an indication that the model trained on the Train set does not perform well on distant domains, even if the core facts are provided.

In the second question, we have an option all of these. Indeed, the selected answer seems the most relevant (a generalized version of the other two), but the model did not know that if we have an option all of these and all answers are plausible, it should decide if all answers are correct and not pick the “most likely” individual answer.

The third question again requires the model to select a special type of aggregate answer (“all but xyz”), but the related Oracle facts are pointing to a specific answer.