Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, Ashwin Kalyan

Introduction

A long-standing goal of AI systems is to act reliably and learn complex tasks efficiently like human beings. In the process of reliable decision making, humans follow an explicit chain-of-thought (CoT) reasoning process that is typically expressed as an explanation. However, machine learning models are trained mostly using a large number of input-output examples to perform a specific task. These black-box models only generate the final decision without reliably revealing the underlying reasoning process. Not surprisingly, it is unclear if they understand the task and can generalize even though they perform well on the benchmark. On the other hand, humans are able to learn from instructions or explanations from past experience and generalize them to novel and unseen problems. This helps them learn more quickly with fewer data. In this work, we explore if machines can be endowed with such reasoning abilities in the context of science-based question answering.

Recently, science problem solving benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of AI systems. To answer science questions, a model needs to not only understand multimodal contents but also extract external knowledge to arrive at the correct answer. Since these tasks require domain-specific knowledge and explicit multi-hop reasoning, a model would be not interpretable if it fails to provide explanations to reveal the reasoning process. However, current science question datasets mostly lack annotated explanations for the answers. To address this issue, other science datasets annotate the explanations, but they are restricted to the textual only modality and limited to small data scales or a small set of topics . Therefore, we collect Science Question Answering (ScienceQA), a large-scale multi-choice dataset that contains multimodal science questions with explanations and features rich domain diversity.

ScienceQA is collected from elementary and high school science curricula, and contains 21,208 examples along with lectures and explanations. Different from existing datasets , ScienceQA has richer domain diversity from three different subjects: natural science, social science, and language science. A typical example consists of a question, multiple choices, multimodal contexts, a correct answer, as well as a lecture and an explanation. The lecture and explanation provide general external knowledge and specific reasons, respectively, for arriving at the correct answer.

Consider the thoughts one person might have when answering the question in Figure 1. One first recalls the knowledge regarding the definition of a force learned from textbooks: “A force is a push or a pull that … The direction of a push is … The direction of a pull is …”, then forms a line of reasoning: “The baby’s hand applies a force to the cabinet door. \rightarrow This force causes the door to open. \rightarrow The direction of this force is toward the baby’s hand.”, and finally arrives at the correct answer: “This force is a pull.”. Following , we formulate the task to output a natural explanation alongside the predicted answer. In this paper, we train language models to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process to answer ScienceQA questions.

Our experiments show that current multimodal methods fail to achieve satisfactory performance on ScienceQA and do not generate correct explanations. Instead, we find that CoT can help large language models not only in the few-shot learning setting but also in the fine-tuning setting. When combined with CoT to generate the lecture and explanation, the fine-tuned UnifiedQA achieves an improvement of 3.99% as opposed to not using CoT in the fine-tuning stage. The few-shot GPT-3 model via chain-of-thought prompting can obtain 75.17% on ScienceQA with an improvement of 1.20% compared to the few-shot GPT-3 without CoT. Prompted with CoT, GPT-3 can generate reasonable explanations as evaluated by automated metrics, and promisingly, 65.2% of explanations meet the gold standard of human evaluations. We also investigate the upper bound for models to harness explanations by including them in the input. We find that doing so improves GPT-3’s few-shot performance by 18.96%, suggesting that explanations do aid models and are currently underutilized in the CoT framework. Further analysis shows that, like humans, language models benefit from explanations to learn with less data: UnifiedQA with CoT obtains the same results as UnifiedQA without CoT with only 40% of the training data.

To sum up, our contributions are three-fold: (a) To bridge the gap in existing datasets in the scientific domain, we build Science Question Answering (ScienceQA), a new dataset containing 21,208 multimodal science questions with rich domain diversity. To the best of our knowledge, ScienceQA is the first large-scale multimodal dataset that annotates lectures and explanations for the answers. (b) We show that CoT benefits large language models in both few-shot and fine-tuning learning by improving model performance and reliability via generating explanations. (c) We further explore the upper bound of GPT-3 and show that CoT helps language models learn from fewer data.

Related Work

Visual question answering. Since the task of visual question answering (VQA) was first proposed in , there have been plenty of VQA datasets conducted to facilitate the research work. Although our ScienceQA dataset shares some features with VQA, there are several main differences between them. First, ScienceQA is more challenging than existing VQA datasets because it contains multimodal contexts and diverse topics in the scientific domain. In addition, most answers are annotated with lectures and explanations, which makes ScienceQA a suitable dataset for multi-modal question answering and multi-hop reasoning for AI systems. Inspired by the recent remarkable performance achieved for VQA , in this paper, we further extensively benchmark ScienceQA with a wide range of attention-based and Transformer-based methods.

Datasets for science problems. Science problem solving is a challenging task that requires an AI system not only to understand the multimodal information from the science curriculum but also to reason about how to answer the domain-specific questions. Current science problem datasets such as AI2D , DVQA , VLQA , and FOODWEDS have contributed to multimodal reasoning in the scientific domain. For example, a portion of VLQA contains multimodal questions on science subjects. These datasets, however, lack annotated explanations for the answers to reveal the reasoning steps. Some other datasets annotate the answers in the forms of supporting facts , entailment trees , explanation graphs , reasoning chains . However, these datasets are restricted to the single text modality with small data scales and limited topics. Instead, our ScienceQA annotates the answers with grounded lectures and explanations. Besides, ScienceQA features a richer domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills.

Learning from explanations and few-shot learning. Explanations help humans understand a task better, and there have been several attempts to show the same for models. For example, the learning from instruction paradigm , where the task level explanation is provided in the form of instruction, improves model performance significantly. An example of learning from explanations in the scientific domain is proposed in where the model interprets demonstrative solutions to solve geometry problems. Recently, there has been a surge of interest in few-shot learning, where language models learn a specific task from a few examples . For instance, find that explanations in the format of the chain of thought can improve language models’ reasoning ability in few-shot learning. In this paper, we show that the chain of thought boosts the performance of large language models like UnifiedQA if the models generate explanations along with the answer in a fine-tuning way. Furthermore, a few-shot GPT-3 model via chain-of-thought prompting is able to improve the reasoning performance on ScienceQA and generate reasonable explanations.

Dataset

We collect ScienceQA, which is a multimodal multiple-choice science question dataset containing 21,208 examples. An example in ScienceQA is shown in Figure 1. Given the science question and multimodal contexts, the task is to select the correct answer from multiple options. Different from existing datasets , ScienceQA covers diverse topics across three subjects: natural science, social science, and language science. Moreover, most questions are annotated with grounded lectures and detailed explanations. The lecture provides general knowledge that introduces the background information for solving problems of a similar class. The explanation reveals a specific reason for the answer. To effectively answer the questions, a model often needs to be able to understand the multimodal content in the input and extract external knowledge, similar to how humans do. More importantly, the goal of ScienceQA is to aid development of a reliable model that is capable of generating a coherent chain of thought when arriving at the correct answer to reveal the multi-step reasoning process. For data collection details, see Appendix A.1.

Key statistics. We randomly split the dataset into training, validation, and test splits with a ratio of 60:20:20. Each split has 12,726, 4,241, and 4,241 examples, respectively. Table 1 shows the main statistics of ScienceQA. ScienceQA has a large set of different questions, totaling up to 9,122. Out of the 21,208 questions in ScienceQA, 10,332 (48.7%) have an image context, 10,220 (48.2%) have a text context, and 6,532 (30.8%) have both. 83.9% of the questions are annotated with a lecture, while 91.3% of the questions feature an explanation. The cross-combination of these information sources diversifies the problem scenario: sometimes the model is given a lot of information from multiple sources, while at other times, the only source of information is the question itself. This level of complexity is very common in grade-level science exams.

Question analysis. ScienceQA has a diverse set of science questions. Figure 2 shows a distribution of the first four words in the question text. A large number of question lengths and formats highlight the diversity of ScienceQA. The question lengths range from 3 words to 141 words, and the questions in ScienceQA have an average length of 12.11 words. The question length distribution is visualized against other VQA datasets in Figure 3 (a). As shown in the diagram, ScienceQA’s distribution is flatter than other datasets, spanning more evenly across different question lengths.

Context analysis. Figure 3 (b) shows the number and percentage of questions with either an image context, a text context, or both. There are a total of 7,803 unique image contexts and 4,651 unique text contexts. 66.11% of the questions have at least one type of context information. The image context is in the format of diagrams or natural images, which visualize the critical scenario necessary for question answering or simply illustrate the question for better understanding. Similarly, the textual context can provide either semantically rich information or a simple hint to the question. Therefore, models need to be flexible and general to understand these diverse types of contexts.

Domain diversity. Each ScienceQA question belongs to one of the three subjects: natural science, language science, and social science. With each subject, questions are categorized first by the topic (Biology, Physics, Chemistry, etc.), then by the category (Plants, Cells, Animals, etc.), and finally by the specific skill (Classify fruits and vegetables as plant parts, Identify countries of Africa, etc.). ScienceQA has a total of 26 topics, 127 categories, and 379 skills. The treemap in Figure 4 visualizes the different subjects, topics, and categories and shows that ScienceQA questions are very diverse, spanning a wide range of domains.

2 Comparisons with Existing Datasets

Table 2 shows a comparison of ScienceQA and other science problem datasets. As shown in the table, ScienceQA is much larger than most other datasets. ScienceQA also has the largest set of images, spans across all 12 grades, contains the longest questions, and has the most diverse input sources. As opposed to limiting the subject to only natural science, ScienceQA also includes social science and language science, largely adding to the domain diversity of the dataset. Furthermore, most of the questions in ScienceQA are annotated with textual lectures (83.9%) and explanations (90.5%), which reveal the reasoning path to the correct answer. To the best of our knowledge, ScienceQA is the first large-scale multimodal science question dataset that annotates the answers with detailed lectures and explanations.

Baselines and Chain-of-Thought Models

In this section, we establish baselines and develop two chain-of-thought models on ScienceQA.

Heuristic baselines. The first heuristic baseline is random chance: we randomly select one from the multiple options. Each trial is completed on the whole test set, and we take three different trials for an average result. The second heuristic baseline is human performance. We post the task to Amazon Mechanical Turk and ask workers to answer ScienceQA questions. Only workers who obtain a high school or higher degree and pass the qualification examples are qualified for the study. Each worker needs to answer a set of 10 test questions, and each question is answered by three different workers. For more details of the human performance study, see Appendix B.2.

Zero-shot and few-shot baselines. We establish the zero-shot baselines on top of UnifiedQA and GPT-3 . The zero-shot setup follows the format of QCM\rightarrowA where the input is the concatenation of tokens of the question text (Q), the context text (C), and multiple options (M), while the output is to predict the answer (A) from the option set. We extract the caption from the captioning model based on ViT and GPT-2 for the image as the visual context. In the few-shot setting, we follow the standard prompting where in-context examples from the training set are concatenated before the test instance. These in-context examples serve as an instruction for the language model to adjust to the specific task in ScienceQA.

Fine-tuning baselines. We first consider the fine-tuning baselines from VQA models proposed in recent years. These VQA baselines take the question, the context, and choices as the textual input, take the image as the visual input, and predict the score distribution over choice candidates via a linear classifier. In addition, we build the fine-tuning baseline on top of the large language model UnifiedQA . UnifiedQA takes the textual information as the input and outputs the answer option. Similarly, the image is converted into a caption that provides the visual semantics for the language model.

2 Language Models with the Chain of Thought

A chain of thought refers to a coherent flow of sentences that reveals the premises and conclusion of a reasoning problem . A chain of thought clearly decomposes a multi-hop reasoning task into intermediate steps instead of solving the task in a black-box way. The chain of thought can be the step-by-step thought process before arriving at the final answer or explanations that come after the answer. The annotated lectures and explanations in ScienceQA serve as demonstrations of the chain of thought that mimics the multi-step reasoning steps of human beings. In this paper, we study if large language models can generate reasonable explanations as the chain of thought to reveal the thought process when answering ScienceQA questions. Further, we explore how the chain of thought can improve the reasoning ability of language models on ScienceQA in both few-shot and fine-tuning learning.

UnifiedQA with the chain of thought. UnifiedQA is a state of the art model for multi-option question answering. The original architecture of UnifiedQA takes the question and options as the input and outputs a short phrase as the final answer. We make a format modification to develop UnifiedQA with the chain of thought (CoT), i.e., UnifiedQA is fine-tuned to generate a long sequence of text which consists of the answer followed by the lecture and explanation.

GPT-3 via chain-of-thought prompting. Recent research work has shown that GPT-3 can perform various tasks when provided in-context examples in a standard prompt. Take multi-option question answering as an example, the standard prompt builds instructions using in-context examples with components of the question text, options, and the correct answer text. This style of few-shot learning enables the GPT-3 model to answer specific questions without parameter updates. Different from standard prompting, we build GPT-3 via chain-of-thought (CoT) prompting, as shown in Figure 5. To be specific, for each test problem tt, we map the prompt instruction I:{Ii}n,ItI:\{I_{i}\}_{n},I_{t} into a textual format where {Ii}n\{I_{i}\}_{n} refers to the instruction set of nn-shot in-context examples from the training set, while ItI_{t} denotes the test instruction. Instead of the way where the explanation comes before the answer , we feed the instruction II into the encoder-decoder model GPT-3 to generate the answer aa followed by the lecture lectlect and explanation expexp: M:{Ii}n,Ita,lect,expM:\{I_{i}\}_{n},I_{t}\rightarrow a,lect,exp.

Experiments

Evaluation metrics. The heuristics and VQA baselines treat our ScienceQA task as a multi-class classification problem with multiple options and are evaluated with the accuracy metrics. UnifiedQA and GPT-3 treat ScienceQA as a text generation problem. So the most similar option is selected as the final prediction to evaluate the question answering accuracy. The generated lectures and explanations are evaluated by automatic metrics and human scores by annotators.

Implementation details. The VQA baselines are trained for a maximum number of 50 epochs with a learning rate of 5e55e{-}5. We fine-tune the UnifiedQA for 50kk iterations and evaluate every 1kk iteration. The training process is stopped following the early stopping strategy with a patience period of three evaluations. For GPT-3, we use the text-davinci-002 engine, which is the most capable model version suggested in the official documentation. More details can be found in Appendix B.1.

2 Results for Question Answering

Table 3 demonstrates the empirical results for Science Question Answering.

VQA baselines. We feed the VQA baseline models with the input of QCM format to predict answers A. Out of all the VQA models we benchmarked, VisualBERT performs the best on average (61.87%). Interestingly, Patch-TRM beats VisualBERT in natural science (NAT) and language science (LAN), and it also performs better in higher-grade questions (67.50% v.s. 59.92%). However, in the subject of social science (SOC), VisualBERT outperforms Patch-TRM by a large margin (+22.39%). Such drastic changes in performance might imply that current VQA models are not generalized to process the challenging questions in ScienceQA.

Language models. We evaluate whether large-scale pretraining on text can help language models learn scientific knowledge and thus perform better on the ScienceQA task. For this purpose, we have tried two of the state-of-the-art pre-trained language models: UnifiedQA and GPT-3.

(i) UnifiedQA. The results show that without any supervised fine-tuning (zero-shot), UnifiedQA cannot beat any VQA baseline model, while the pretraining does help the model obtain some scientific knowledge to outperform the random baseline. When fine-tuned with the answer labels in ScienceQA, UnifiedQABASE{}_{\text{BASE}} reports an accuracy of 70.12% on average. By further teaching the model to generate the answer along with lecture and explanation, the developed language model with chain-of-thought (UnifiedQABASE{}_{\text{BASE}} (CoT)) brings additional improvements of +3.21% (QCM\rightarrowAE) and +3.99% (QCM\rightarrowALE). These results show that generating the chain of thought along with the answer benefits the reasoning ability of language models.

(ii) GPT-3. The positive effect of pretraining is also proved by the surprisingly good results from GPT-3 in the same zero-shot setting as UnifiedQA. Without any fine-tuning, GPT-3 already reaches almost the best performance we can get. Interestingly, prompting the GPT-3 with two training examples with only answers results in a negligible difference. However, if we prompt GPT-3 with chain-of-thought prompting (QCM\rightarrowALE), we obtain the state-of-the-art result so far (75.17%).

Human performance. Humans outperform all benchmarks consistently across question classes, context types, and grades, e.g., a 20.07% gap for questions with the image context (IMG) between humans and our best performing model. The gap is to be filled by future research on multimodal reasoning for scientific question answering.

3 Results for Generated Explanations

One prediction example of GPT-3 (CoT) is visualized in Figure 6. We can see that GPT-3 (CoT) predicts the correct answer and generates a reasonable lecture and explanation to mimic the human thought process. We further report automatic metrics (BLEU-1/4 , ROUGE-L , and (sentence) Similarity to evaluate the generated lectures and explanations, as shown in Table 4. The Similarity metric computes the cosine-similarity of semantic embeddings between two sentences based on the Sentence-BERT network . The results show that UnifiedQABASE{}_{\text{BASE}} (CoT) generates the most similar explanations to the given ones. However, it’s commonly agreed that automatic evaluation of generated texts only provides a partial view and has to be complemented by a human study. By asking annotators to rate the relevance, correctness, and completeness of generated explanations, we find that the explanations generated by GPT-3 (CoT) conform best to human judgment.

4 Analysis

Blind studies. Blind studies are conducted on top of the modification of the full model, Top-Down . The results achieved in blind studies of Q only and CI only are close to random chance, showing that the ScienceQA dataset is robust and reliable in distribution. The performance drops in Q+M only, Q+CT+M only, and Q+CI+M only indicate that all input components provide critical information for answering ScienceQA questions.

Prompt types. We study the effect of prompt types and visualize the comparison in Figure 7 (a). It shows that prompting the GPT-3 model with both lectures and explanations (QCM\rightarrowALE) results in the highest accuracy on average and the smallest variance. In contrast, prompting with only explanations (QCM\rightarrowAE) gives the largest variance, resulting in a less stable model.

Number of in-context examples. In Figure 7 (b), we further investigate how different numbers of training examples encoded in prompts can affect the prediction accuracy. The QCM\rightarrowALE prompt type outperforms or performs comparably the QCM\rightarrowA type with all numbers of examples. And we observe the peak performance of QCM\rightarrowALE with 2 training examples being prompted. After that, the accuracy goes down as more training examples are added to the model.

Dynamic sampling. In Table 5, instead of random sampling, we try to dynamically select the in-context examples to prompt with the same class as the test sample. However, slight differences in prediction accuracy are observed when comparing them to simple random sampling.

Upper bound. We search the upper bound of the GPT-3 accuracy by feeding the gold lecture and explanation in the test prompt. As reported in Table 6, QCME*\rightarrowA outperforms the QCM\rightarrowALE baseline by 18.86% and QCMLE*\rightarrowA outperforms QCM\rightarrowALE by 18.96%, indicating a potential improvement direction by generating correct explanations before answering science questions.

Positions of lectures and explanations. We study the performance of GPT-3 (CoT) in terms of different positions of lectures and explanations on 1,000 test examples. The results are shown in Table 7. There could be huge accuracy decreases if GPT-3 (CoT) predicts lectures and explanations before answers. It is mainly because if GPT-3 (CoT) is formalized to generate the long lecture and explanation first, there is a greater chance that it will stop generating the prediction early or use up the maximum token limits before obtaining the required answer.

CoT learns with fewer data. To study if the chain of thought helps language models learn more efficiently, we report the accuracies of UnifiedQA and UnifiedQA (CoT) fine-tuned on different sizes of the training set in Figure 8. UnifiedQA (CoT) benefits language models by learning the coherent reasoning path when answering questions, resulting in similar accuracy with fewer training examples.

Error analysis. GPT-3 via chain-of-chain prompting obtains promising results but still fails to answer a wide range of challenging questions in ScienceQA. See examples of failure cases in Appendix B.4. The failure cases can be classified into two types: (a) the model fails to understand the multimodal inputs and lacks domain-specific knowledge to arrive at the correct answer; (b) the model generates the wrong chain of thought with irrelevant, incorrect, or incomplete information.

Discussion and Conclusion

In this paper, we propose ScienceQA, a dataset that features 21,208 multi-option questions with multimodal contexts from the science curriculum. To the best of our knowledge, ScienceQA is the first large-scale multimodal science dataset where most questions are annotated with corresponding lectures and explanations. We establish various baselines, including recent VQA models and large language models on ScienceQA. We further study if language models can generate reasonable explanations and then benefit the reasoning ability. Experiments show that UnifiedQA with the chain of thought can achieve an improvement of 3.99% and few-shot GPT-3 via chain-of-thought (CoT) prompting can obtain a satisfactory accuracy of 75.17% on ScienceQA. 65.2% of the generated explanations from GPT-3 (CoT) meet the gold standard by human evaluations.

Acknowledgment

We would like to thank the anonymous reviewers for their valuable comments and suggestions. We would also like to thank Xiaodan Liang for insightful discussions on dataset collection. We thank our colleagues at The Allen Institute of AI (AI2), Jiasen Lu and Jungo Kasai for helpful discussions. The work does not relate to Liang Qiu’s position at Amazon Alexa.

References

Checklist

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

Did you describe the limitations of your work? [Yes] Yes, we did the error analysis in Section 5.4 and discussed the limitations of the work in Appendix B.4.

Did you discuss any potential negative societal impacts of your work? [Yes] We discussed the broader impacts in Appendix B.5.

Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results? [N/A]

Did you include complete proofs of all theoretical results? [N/A]

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] We included 100 data examples and the data visualizer tool in the supplemental material. The whole dataset and code will be available at https://scienceqa.github.io.

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes] See Section 5.1 and Appendix B.1 for experimental details.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [Yes] We reported the error bars for GPT-3 (CoT) experiments in Figure 7, where each experiment was repeated four times.

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We discussed compute resources in Appendix B.1.

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? [Yes] We collected the ScienceQA dataset from https://www.ixl.com/. The copyright belongs to IXL.

Did you mention the license of the assets? [Yes] ScienceQA is under the CC BY-NC-SA 4.0 license and is used for non-commercial research purposes.

Did you include any new assets either in the supplemental material or as a URL? [Yes] We included data examples and a visualizer tool in the supplemental material. The dataset will be available at https://scienceqa.github.io.

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [Yes] The collected data does not contain personally identifiable information or offensive content.

If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable? [Yes] We included screenshots of the instructions in Appendix B.2 and B.3.

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [Yes] We included the monetary compensation details in Appendix B.2 and B.3.

Appendix A Dataset Analysis

Questions in the ScienceQA dataset are sourced from open resources managed by IXL Learning, an online learning platform curated by experts in the field of K-12 education. The dataset includes problems that align with California Common Core Content Standards. To construct ScienceQA, we downloaded the original science problems and then extracted individual components (e.g. questions, hints, images, options, answers, lectures, and solutions) from them based on heuristic rules.

We manually removed invalid questions, such as questions that have only one choice, questions that contain faulty data, and questions that are duplicated, to comply with fair use and transformative use of the law. If there were multiple correct answers that applied, we kept only one correct answer. Also, we shuffled the answer options of each question to ensure the choices do not follow any specific pattern. To make the dataset easy to use, we then used semi-automated scripts to reformat the lectures and solutions. Therefore, special structures in the texts, such as tables and lists, are easily distinguishable from simple text passages. Similar to ImageNet, ReClor, and PMR datasets, ScienceQA is available for non-commercial research purposes only and the copyright belongs to the original authors. To ensure data quality, we developed a data exploration tool to review examples in the collected dataset, and incorrect annotations were further manually revised by experts. The tool can be accessed at https://scienceqa.github.io/explore.html.

A.2 Question Statistics

Figure 9 (a) is a word cloud showing the most frequently appeared words in the question texts. Stopping words that do not contain any semantic meaning, such as “what” or “and”, are removed to give us a clearer view of the semantic range of ScienceQA. The diagram shows that ScienceQA covers a wide range of topics, with words from different topics showing up across the cloud.

Figures 9 (b) (c) (d) show the word clouds for each of the three subjects. We can observe from the word clouds that the words are well-matched to the subject themes. In natural science questions, words such as “trait”, “magnet”, and “force” appear frequently. Words such as “capital” and “state” show up frequently in social science questions, whereas words such as “dictionary” and “page” are common in language science questions.

A.3 Choice Statistics

Table 8 shows the number of questions with each number of different choices. Questions have a minimum of two options and a maximum of five options. Figure 11 shows the distribution of choice length in ScienceQA. Most choices are short, containing up to five words. However, the distribution has a long tail where about 5% of the choices contain more than 15 words. Hence, it requires models to have a high level of text understanding to address diversely distributed choices.

A.4 Subject Statistics

Figure 11 shows the question length distribution of each subject. The three subjects all feature long-tail distributions in terms of the number of question words. On average, social science questions are the shortest, while language science questions are the longest. Language science questions are distributed more evenly than other questions across different numbers of words. These features imply that the ScienceQA dataset is rich in compositional diversity.

A.5 Grade Statistics

The grade distribution is shown in Figure 12. The majority of questions come from the middle level curriculum (i.e., from grade 3 to grade 8) while around 10% are taken from the high school curriculum (i.e., from grade 9 to grade 12). These high school level questions are close to or at the difficulty level of the U.S. standardized tests for college admissions. Machine algorithms need to master a large amount of scientific knowledge and perform complex reasoning in order to perform well on ScienceQA.

Appendix B Experiments

Fine-tuning on the dataset. Fine-tuning baselines (VQA baselines and UnifiedQA) are trained on the training set, developed on the validation set, and evaluated on the test set.

Input sizes: For VQA baselines, we set the maximum number of input words or tokens as 100.

Batch sizes. We use batches of 64 and 4 for VQA baselines and fine-tuned UnifiedQA, respectively.

Newline character. For language models, the newline separators (\n\mathtt{n}) in the text are replaced with \\n\mathtt{n} when encoding the inputs because \n\mathtt{n} is normally used as a stop symbol, following the original works .

Captioning model. We use the toolhttps://huggingface.co/nlpconnect/vit-gpt2-image-captioning to generate captions for the images in the dataset. The maximum length of generated captions is 16, the number of beams is 4, and the maximum number of output tokens is 512.

Compute resources. We use two GeForce RTX 3090 GPUs for fine-tuning VQA baselines and UnifiedQA on the dataset.

Questions without any context. For questions without any context, the context text is replaced with an empty string.

GPT-3: Following default settings, we choose temperature, frequency penalty and presence penalty as 0.0, and top probability as 1.0. All experiments for GPT-3 are run via the online API. Experiments in Figure 7 are repeated four times with in-context examples listed in Table 9. Experiments in Table 3, 5, 6, and 7 are conducted using examples with the trial ID of 1.

B.2 Human Performance Study

In order to understand how humans perform on ScienceQA questions, we used Amazon Mechanical Turk (AMT) to crowd source answers to the test set. The interface of instructions and one example of a test question are shown in Figure 13. A total of 4,241 test questions were shuffled and split into 425 batches, with each batch having 10 questions (excluding the last one). For each batch, we also randomly added five training questions as exam examples. Each set of 15 questions was then assigned to 3 AMT workers. Only workers who correctly answer 4 out of the 5 exam examples or more are qualified for the human performance study. In other words, workers who failed to pass the qualified exam were eliminated from the analysis. For each set of 15 questions, we provided the worker with 0.5perHITtask.Attherateof3questionsperminute,thisamountsto0.5 per HIT task. At the rate of 3 questions per minute, this amounts to6.0 per hour.

B.3 Human Evaluation of Generated Explanations

We also evaluated the quality of predictions from GPT-3 (CoT) and UnifiedQA (CoT) by asking AMT workers to rate the model-generated explanations. The interface is shown in Figure 14. Each sample’s question text, contexts, choices, and answers were presented, along with the corresponding explanation generated by language models. The workers were asked to decide whether the proposed explanation is relevant (is related to the question), correct (gives a correct answer and explanation), and complete (fully explains the answer). Prediction outputs that contain textual explanations were grouped into batches of 10, each assigned to 3 workers for evaluation. For each batch, we provided the workers with a monetary compensation of $0.3. Finally, the human scores for each explanation were determined by taking a majority vote.

B.4 Case Study and Limitations

Figure 15 shows three examples with correct answers and gold explanations predicted by GPT-3 via chain-of-thought prompting (CoT). We can see that GPT-3 (CoT) not only predicts the correct answers but also generates reasonable explanations, which follow the multi-hop reasoning process of human beings. This suggests that large language models like GPT-3 have great promise for implementing high-level reasoning abilities.

Figure 16 visualizes three more examples with predictions from GPT-3 (CoT). In these examples, GPT-3 (CoT) is able to predict the correct answers but fails to generate gold explanations. For example, GPT-3 (CoT) generates an irrelevant explanation because the context text does not include fine-grained visual information in the image (Figure 16). In the example shown in Figure 16, GPT-3 (CoT) fails to predict the coherent thought chains, where there are an incorrect example and an incorrect statement for a chemical change. The third example is given in Figure 16, where the generated explanation is just a repetition of the input question and the output answer, instead of following the complete thought chain to arrive at the final answer.

Four failure examples with wrong predicted answers are listed in Figure 17. We extract the image captions and feed them to the large language model as the visual content input. However, these captions lack fine-grained semantics and usually do not work well for diagrams, which results in two failure cases shown in Figure 17 and 17. Moreover, there exist challenges for large language models to reason about the questions that require them to understand complex and uncommon domain knowledge. For example, GPT-3 (CoT) cannot understand accurately the terminology of personification in language science (Figure 17) and a series of complex chemical changes happen in the formation process of dinosaur fossils (Figure 17).

B.5 Broader Impacts

Societal impact. The ScienceQA dataset collects science questions sourced from textbooks and is proposed to diagnose the multimodal understanding and multi-hop reasoning abilities of AI systems. Due to the nature of data sources, ScienceQA does not contain any user usage data or personally sensitive information such as gender and race. After careful examination of our dataset, to our best knowledge, we have not found any improper content, such as pornographic information, racial remarks, or harmful social bias. We adhere to the goal of AI for the common good, and any antisocial data points will be removed from the dataset based on feedback.

Potential usage. The proposed ScienceQA dataset and designed methods in this paper are beneficial to both follow-up research work and real-world applications. ScienceQA provides a useful benchmark for multi-modal learning, multi-hop reasoning, and general artificial intelligence. Besides, ScienceQA will contribute to the development of K-12 education applications such as tutoring systems. Furthermore, the designed methods with the chain of thought investigate the ability of large language models to mimic the human mind process when reasoning about a challenging task.