Explanations from Large Language Models Make Small Reasoners Better

Shiyang Li, Jianshu Chen, Yelong Shen, Zhiyu Chen, Xinlu Zhang, Zekun Li, Hong Wang, Jing Qian, Baolin Peng, Yi Mao, Wenhu Chen, Xifeng Yan

cs.CL

Introduction

Large language models (LLM) have achieved impressive results with in-context learning; by adding a few demonstrations as the prompts, they can solve unseen tasks without any parameter update (Brown et al., 2020; Thoppilan et al., 2022; Chowdhery et al., 2022; Wei et al., 2022a). Recently, it is shown that adding explanation-augmented prompts can elicit strong performance in various reasoning tasks (Wei et al., 2022b; Lampinen et al., 2022), such as math word problem (Cobbe et al., 2021), symbolic reasoning (Wei et al., 2022b), numerical reasoning (Zhou et al., 2022) and commonsense reasoning tasks (Talmor et al., 2019). In addition, they also enable LLM to generate reasonable explanations to justify the reasoning outcomes.

In this paper, we consider the problem of leveraging these elicited explanations by LLM to improve the training of small reasoners. Small language models (SLM) We argue that small and large models are relative concepts. For the same model, it can be small or large depending on the context. could be more favorable over LLM in many real situations due to their low cost in both storage and computation. Nevertheless, one important open question is how to close the performance gap with respect to LLM on complicated reasoning tasks, as is observed in Zelikman et al. (2022), especially in few-shot settings (Li et al., 2019). Surprisingly, Hase et al. (2020) shows that using human-annotated explanations does not improve the performance compared to standard finetuning on T5 (Raffel et al., 2019). One possible reason is that many human-annotated explanations collected via crowdsourcing (Wiegreffe and Marasović, 2021) could be logically inconsistent and grammatically incorrect (Narang et al., 2020), which restricts the amount of available high-quality explanations. On the other hand, using explanation-augmented prompts enables LLM to automatically generate descent explanations (Wiegreffe et al., 2021a), making it a plausible alternative to generate arbitrary amount of explanations. Therefore, a key question is: Can the explanations generated by LLM improve the reasoning capability of SLM?

In this paper, we show that explanations generated from LLM can consistently improve reasoning capability of SLM. Our framework is shown in Figure 1. Specifically, we first utilize several examples with human-written explanations as demonstrations of LLM and then generate explanations for training set. We systematically explore three approaches to generating explanations. The first approach utilizes explanations generated through chain of thought prompting and explanations are adopted if LLM have correct predictions and are rejected otherwise (Zelikman et al., 2022). The second one is to generate explanations by rationalization prompting conditioned on golden labels (Wiegreffe et al., 2021a). Intuitively, the first approach may generate higher quality explanations than the second if LLM’ predictions are correct as incorrect explanations tend to generate incorrect predictions (Wei et al., 2022b). However, the first approach will reject explanations on problems with incorrect predictions, leaving their explanations empty. On the other hand, the second one explicitly conditions on golden labels and may still generate useful explanations on problems where chain of thought prompting cannot predict correctly. Therefore, we propose a third hybrid approach: adopting explanations generated by chain of thought prompting if LLM have correct predictions and use rationalization prompting otherwise. As we will show in section 5, three explanation generation methods can consistently and significantly improve fine-tuning baselines without explanations and our hybrid approach achieves best results on two of three datasets.

We further adopt a multi-task (MT) learning framework shown in Figure 2 to utilize the LLM-generated explanations since (1) it can naturally allow training with partially generated explanations and (2) self-rationalizing model (Wiegreffe et al., 2021b), where golden label and the human-written explanation is linearly concatenated as the target, performs significantly worse than MT counterpart(Hase et al., 2020). Interestingly, we find that even with the same MT approaches (i.e., MT-Re (Hase et al., 2020) and MT-Ra (Camburu et al., 2018) ) as Hase et al. (2020), we can consistently and significantly improve strong T5 standard finetuning baseline using LLM-generated explanations, which is in stark contrast to the results in Hase et al. (2020), where finetuning T5 with MT-Re and MT-Ra only achieves on par results using crowdsourced ones. In addition, we further propose MT-CoT, where the small language model is trained to jointly solve two tasks: (i) directly generating the answer and (ii) generating an explanation and then the answer, as shown in Figure 2 (c). Unlike MT-Re and MT-Ra, MT-CoT positions the answer after the explanation, hoping the model can learn to derive it from the explanation like chain of thought Wei et al. (2022b). Our results show that all three explanation generation approaches can improve reasoning capability of small language models with MT-Ra, MT-Re and MT-CoT setups. And MT-CoT can achieve the best results over MT-Re and MT-Ra on two of three datasets. In addition, our method can outperform standard finetuning baseline by up to 8.1% in accuracy and even perform better than finetuning/prompting a 60x larger GPT-3 model (175B) by up to 9.5% in accuracy on CommonsenseQA. Finally, as a side benefit, human evaluation further shows that our method can generate high-quality explanations to justify its predictions, moving towards the goal of explainable AI (Samek et al., 2019).

In a nutshell, we summarize our contribution as following:

We show that multi-task learning with explanations from LLM can consistently and significantly improve strong T5 single-task fine-tuning baselines across various settings.

We propose a hybrid prompting approach to generating explanations from LLM and MT-CoT to further improve our learning with explanations from LLM paradiam.

We demonstrate that our method can perform better than finetuning/prompting a 60x larger GPT-3 model (175B) by up to 9.5% in accuracy on CommonsenseQA and generate high-quality explanations to justify its predictions towards the goal of explainable AI.

Related Work

Recently, a new learning paradigm, in-context learning where several training examples are used as demonstrations of LLM without any parameter update, has shown promising results in various NLP tasks (Brown et al., 2020). Although promising, LLM still struggle with tasks requiring strong reasoning capability (Wei et al., 2022b). To enable better few-shot in-context learning of LLM for reasoning tasks, Wei et al. (2022b) proposes chain of thought prompting, which provides intermediate reasoning steps as explanations in prompts before answers and has achieved state-of-the-art in arithmetic, symbolic and common sense reasoning tasks. Zhou et al. (2022) further extends chain of thought prompting with least-to-most prompting, which decomposes a complex problem into a list of subproblems with natural languages, and then sequentially solves these subproblems in a recursive fashion. Kojima et al. (2022) moves one step further and shows that LLM are zero-shot reasoners by simply adding “Let’s think step by step” without any demonstration in prompts. Unlike these work, Lampinen et al. (2022) explores explanations after answers prompting for LLM, where answers are fed into LLM before providing their explanations in prompts, and also observes consistent gains.

These also exist work to utilize explanations generated from LLM rather than focusing on their final predictions. Wiegreffe et al. (2021a) explores utilizing LLM to annotate explanations for existing datasets and proposes a sample-then-filter paradigm with human annotations. Ye and Durrett (2022) proposes to utilize a calibrator to calibrate GPT-3 as they find that GPT-3 tends to generate consistent but less factual explanations for textual reasoning tasks. However, none of these work explores if these noisy explanations generated from LLM without human-involved filtering can be used to improve SLM reasoning capability. The closest work to ours is STaR (Zelikman et al., 2022). STaR begins with prompting a descent large language model GPT-J with 6B parameters (Wang, 2021) possibly including answer hints via chain of thought prompting to generate explanations with incorrect answer rejection. After that, they utilize filtered training datasets with explanations to finetune GPT-J as a teacher model and then utilize the teacher model to generate explanations of training datasets to train a student GPT-J model iteratively with a self-training fashion until performance plateaus. However, STaR often requires dozens of iterations to converge, which is both time-consuming and compute-intensive to train a large 6B model. What’s worse, their method may not be applicable to smaller language models, e.g. GPT-2 (Radford et al., 2019) and strong non-autoregressive models, e.g. T5, as they may not generate high-quality explanations with prompting. In addition, they only focus on chain of thought style prompting and finetuning while our approach can improve SLM across model sizes, explanation generation and multi-task finetuning methods.

Learning with Explanations.

Learning with explanations has been commonly studied in robotics (Johnson, 1994) and computer vision (Hendricks et al., 2016). Recently, it has received increasing attention in NLP as well. Camburu et al. (2018) proposes MT-Ra for natural language inference task with LSTM and does not observe gains over single-task finetuning. Narang et al. (2020) utilizes MT-Ra setup on both T5-base and T5-11B models but mainly focuses on explanation generation. Instead, Rajani et al. (2019) observes improvements with two-stage finetuning using human-annotated explanations for common sense reasoning task, where the first stage is to train a model for explanation generations with GPT (Radford et al., 2018) and the second one utilizes explanations as input to train a classification model based on BERT (Devlin et al., 2019). However, Hase et al. (2020) finds that both two-stage finetuning and multi-task learning with MT-Re and MT-Ra setups only obtain comparable results over standard finetuning baselines on T5. We instead show that MT-Re, MT-Ra and our proposed MT-CoT with explanations from LLM can consistently and significantly outperform standard finetuning baselines without accuracy-explanation trade-off (Jain et al., 2020).

Explanation Generation from LLM

Problem setup. Denote $D=\{(x_{i}$ , $y_{i})\}^{N}$ to be a dataset with $N$ training instances, where $x_{i}$ is a problem and $y_{i}$ is its answer. Also, we have a handful of human-written instances $E=\{(x^{p}_{i}$ , $e^{p}_{i}$ , $y^{p}_{i})\}^{M}$ , where $e^{p}_{i}$ is a free-text explanation to explain why a problem $x^{p}_{i}$ has $y^{p}_{i}$ as its answer and $\{(x^{p}_{i}$ , $y^{p}_{i})\}^{M}\subset D$ with $M\ll N$ (we set $M=7$ in our experiments). Our goal is to fully leverage LLM with $E$ as demonstrations for in-context learning to generate explanation $e_{i}$ for all $(x_{i}$ , $y_{i})$ , where $1\leq i\leq N$ , so that we can utilize these generated explanations from LLM to improve SLM reasoning capability.

A chain of thought is a series of intermediate reasoning steps before providing an answer of a problem, mimicking human deliberate thinking process to perform complicated reasoning tasks (Wei et al., 2022b). Chain of thought prompting provides intermediate reasoning steps as explanations before answers in prompts. Formally, for $1\leq i\leq N$ , we first concatenate all instances in $E$ and $x_{i}$ as prompt $\hat{p}_{i}$ = ( $x^{p}_{1}$ , $e^{p}_{1}$ , $y^{p}_{1}$ , …, $x^{p}_{M}$ , $e^{p}_{M}$ , $y^{p}_{M}$ , $x_{i}$ ). We then feed prompt $\hat{p}_{i}$ into LLM and greedily decode until a stop token is generated. After that, we parse the decoded sentence as explanation part $\hat{e}_{i}$ and prediction part $\hat{y}_{i}$ . Intuitively, if $\hat{y}_{i}\neq y_{i}$ , $\hat{e}_{i}$ may not have high quality as incorrect explanations tend to generate incorrect predictions (Wei et al., 2022b). Thus, we utilize Chain Of Thought prompting with incorrect answer rEjection (COTE) (Zelikman et al., 2022) by only adopting $e_{i}:=\hat{e}_{i}$ if $\hat{y}_{i}=y_{i}$ ; otherwise, we reject $\hat{e}_{i}$ and set $e_{i}$ as none.

RP.

Since COTE uses the answers in original datasets to reject explanations with incorrect predictions, these instances will no longer have explanations. To alleviate this issue, an alternative is apply Rationalization Prompting (RP) (Wiegreffe et al., 2021a) to generate explanations for every instance in training sets. Unlike COTE, RP provides explanations given golden answers. Specifically, for $1\leq i\leq N$ , we concatenate all instances in $E$ and $(x_{i},y_{i})$ as prompt $\bar{p}_{i}$ = ( $x^{p}_{1}$ , $y^{p}_{1}$ , $e^{p}_{1}$ , …, $x^{p}_{M}$ , $y^{p}_{M}$ , $e^{p}_{M}$ , $x_{i}$ , $y_{i}$ ). We then feed prompt $\bar{p}_{i}$ into LLM and greedily decode until a stop token is generated. The decoded sentence $\bar{e}_{i}$ is cast as explanation $\hat{e}_{i}$ , i.e. $e_{i}:=\bar{e}_{i}$ , without filtering.

CROP.

COTE will possibly generate relatively high-quality explanations if LLM give correct predictions of problems at hand as incorrect explanations tend to generate incorrect predictions (Wei et al., 2022b). However, for problems with incorrect predictions, COTE casts their explanations as none. On the other hand, RP can generate explanations for every instance in the dataset, but we cannot easily assess their quality without human annotation. Therefore, we propose Chain of Thought with Rationalization PrOmpting backuP (CROP), where when COTE generates none as explanations, we will utilize RP as a backup approach. Intuitively, if LLM cannot predict a problem correctly under chain of thought prompting, the problem may be difficult (Zelikman et al., 2022) and RP may provide a meaningful explanation as it can access golden label during explanation generation process.

Multi-task Learning with Explanations

In this section, we elaborate how to utilize explanations generated from LLM to improve SLM reasoning capability with a multi-task learning framework. We detail three multi-task learning with explanations methods in the following.

Multi-task Learning with Reasoning (MT-Re) is introduced by Hase et al. (2020) (see Figure 2 (a)). MT-Re is trained to directly generate predictions for qta (question to answer) task the same as standard finetuning without explanations and generate explanations without explicitly providing answers in qtr (question to reason) task. The training objective of MT-Re is to mix loss $\mathcal{L_{\text{qta}}}$ for qta task and $\mathcal{L_{\text{qtr}}}$ for qtr task:

where $\alpha$ weights $\mathcal{L_{\text{qta}}}$ and $\mathcal{L_{\text{qtr}}}$ loss, and is tuned on development set.

MT-Ra.

Multi-task Learning with Rationalization (MT-Ra) is first proposed by Camburu et al. (2018) for natural language inference task using LSTM-based models (Hochreiter and Schmidhuber, 1997) and we adopt it with a more powerful T5 model for other reasoning tasks. As shown in Figure 2 (b), models are trained to generate predictions for qta task the same as MT-Re and also trained to generate rationalization for qtr task. This is different from MT-Re as MT-Ra allows explanations to be explicitly conditioned on predictions. For MT-Ra, we use the same training objective as Equation 1 and tune $\alpha$ on development set.

MT-CoT.

MT-Re does not explicitly model interactions between explanations and answers during training, which may make models hard to capture their relations. While MT-Ra is explicitly trained to generate explanations conditioned on answers, it may still have difficulty in understanding their causal effects as answers are never trained to explicitly access their explanations. To bridge this gap, we propose Multi-task Learning with Chain of Thought (MT-CoT), where models are trained to generate answers for qta task and generate chain of thought for qtr task, as shown in Figure 2 (c). For MT-CoT, we use the same training objective as Equation 1 and tune $\alpha$ on development set.

In MT-CoT training paradigm, models not only know answers from qta task but also are explicitly shown how answers are derived with intermediate reasoning steps before knowing them from qtr task. As we will show in experiments, this training paradigm is a supplement to MT-Re and MT-Ra, and can consistently improve small language model reasoning capability and also outperform MT-Re and MT-Ra on two datasets.

Experiments

We evaluate our methods on three reasoning tasks.

(1) CommonsenseQA (Talmor et al., 2019) is a 5-way multi-choice question answering dataset that requires common sense reasoning with 9741/1221/1140 for training/development/test set questions, respectively. Since its test set is not publicly available, we report results on its development set following previous work (Zelikman et al., 2022; Li et al., 2019).

(2) StrategyQA is a binary yes/no question answering dataset requiring implicit multi-hop reasoning steps and should be inferred using a strategy (Geva et al., 2021). It has 2290 training set and 490 test set questions. Since its test set is not publicly available, we utilize their split in GitHub https://github.com/eladsegal/strategyqa, where original training set is randomly split into 90% for training and 10% for development set. In our experiments, we report results on their Github development set and utilize their Github training set for training without utilizing explanations from their original annotations.

(3) OpenbookQA is a 4-way multi-choice question answering dataset requiring open book facts with broad common knowledge and multi-hop reasoning (Mihaylov et al., 2018). It has 4957/500/500 questions for training/development/test set split, respectively and we report results on its test set.

We utilize GPT-3 text-davinci-002 engine with official OpenAI API https://beta.openai.com/docs/models/gpt-3 to generate explanations through greedy decoding (by setting temperature as 0) following in-context learning paradigm. In each dataset, we have the same 7-shot examples with human-written explanations for COTE, RP and CROP detailed in section 2. We defer details of prompts into Appendix A.

Multi-task learning with explanations.

After obtaining explanations by COTE, RP and CROP, we utilize MT-Re, MT-Ra and MT-CoT introduced in section 4 to train models with explanations based on T5. We implement multi-task learning framework with Huggingface transformers library (Wolf et al., 2020). For baselines, we utilize single-task finetuning (ST) without explanations. For fair comparison with ST, we keep hyper-parameters of multi-task learning the same as its corresponding ST except weight $\alpha$ which we tune with grid search $\{0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9\}$ on development sets. When training on none explanations generated by COTE, we mask their loss for qtr task. For both ST and multi-task finetuning, we directly generate predictions from qta task for fair comparisons.

2 Main results

In this section, we compare results between multi-task learning with explanations and its single-task finetuning counterpart using full training data on three datasets introduced in section 5.1. Specifically, we generate explanations for each dataset with COTE, RP and CROP, and for each explanation generation method, we train T5-base model under MT-Re, MT-Ra and MT-CoT setups with 5 different runs in each setting. For single-task finetuning baseline, we only keep qta task by removing qtr task in multi-task learning setup. Results are summarized in Table 1.

Three multi-task learning with three different explanation generation methods consistently and significantly outperform single-task finetuning baselines, showing the effectiveness of utilizing explanations from LLM. However, MT-CoT and MT-Ra have 4 and 6 underlined results, respectively, while MT-Re does not have any. We hypothesize it is because MT-CoT and MT-Ra explicitly mention answers by the answer is in qtr task, making it easier for T5 to model relations between explanations and answers. Considering best results for each dataset, two of three are obtained via CROP with the remaining one obtained by COTE, showing that chain of thought prompting generates better explanations for SLM finetuning when their predictions are correct and RP backup can possibly further improve SLM reasoning capability. In addition, two of these three best results are obtained by MT-CoT, demonstrating that our method MT-CoT can serve as a good candidate to improve SLM reasoning with explanations from the toolbox.

3 Few-shot learning results

We have shown the effectiveness of our method on full-training settings in section 5.2 and further explore if explanations can improve SLM reasoning capability under few-shot settings. We conduct few-shot learning experiments for both CommonsenseQA and OpenbookQA datasets with best settings in section 5.2. Specifically, we choose MT-Ra finetuning with explanations generated by CROP for CommonsenseQA dataset and MT-CoT finetuning with explanations generated by COTE for OpenbookQA dataset. We conduct experiments with $\{50,100,200,400\}$ training sample sizes for both datasets on T5-base model and for each sample size, we randomly sample five data splits from its whole training set and each data split has a single run. Similar to previous experiments, we have single-task finetuning as our baselines and tune $\alpha$ using grid search on development sets for multi-task learning experiments. Besides accuracy, we also report optimal $\alpha$ on development sets, denoted as $\alpha$ *. Intuitively, if $\alpha$ * is small, $\mathcal{L_{\text{qtr}}}$ loss has more weight in the multi-task learning training objective listed in Equation 1 and hence, explanations are more important for correct prediction. We summarize our results in Table 2.

Multi-task learning with explanations (MT) consistently and significantly outperforms single-task finetuning baselines (ST). For CommonsenseQA dataset, when training sample sizes are in $\{50,100,200\}$ , MT significantly improves over ST about 6%-8% absolute accuracy. For OpenbookQA dataset, when training sample sizes are in $\{100,200,400\}$ , MT improves over ST about 4%-6% absolute accuracy. More interestingly, $\alpha$ * tends to be smaller when less training data is used on both datasets. Intuitively, when training data sizes are small, models may have difficulty in learning just from limited problem and answer pairs and hence, requires a small $\alpha$ * in the multi-task training objective 1, i.e. larger weight on $\mathcal{L_{\text{qtr}}}$ loss during multi-task learning process. These consistent and significant gains show that our method not only can improve results in full-training settings but also is very useful when training data is limited.

4 Results across model sizes

Previous experiments utilize T5-base model and we further explore if explanations can improve language model reasoning capability across model sizes. We conduct full-training set experiments for both CommonsenseQA and OpenbookQA datasets with best settings for each dataset in section 5.2 across $\{\text{T5-small},\text{T5-base},\text{T5-large},\text{T5-3B}\}$ . For T5-small and T5-base, we have five different runs for each setting and their average results are reported. For T5-large and T5-3B, we only report a single run due to their intensive computational cost. Results are summarized in Table 3.

MT consistently improves its ST counterpart on both CommonsenQA and OpenbookQA across model sizes from T5-small (60 million parameters) to T5-3B. For CommonsenQA, MT improves ST about 0.7%-1.8% absolute accuracy and for OpenbookQA, MT improves ST about 1.4%-3.0% absolute accuracy. Even for T5-3B, MT can improve strong ST with 2% absolute accuracy. These consistent results show that our approach can work on both small and relatively large models.

5 Comparison with Large Language Models

We further compare our method on T5-3B with state-of-the-art LLM. Specifically, we adopt GPT-J direct finetuning, its self-bootstrapping version (STaR) (Zelikman et al., 2022) and GPT-3 direct finetuning (Xu et al., 2021) as baseline methods with parameter update on downstream tasks. We also adopt GPT-3 direct prompting (Brown et al., 2020), GPT-3 chain of thought prompting (Wei et al., 2022b) and GPT-3 explanations after answers prompting (Lampinen et al., 2022) as prompting baselines. These three prompting methods utilize the same set of demonstrations for explanation generation in section 2 and we defer their prompts into Appendix A. Results are summarized in Table 4.

Our approach can outperform strong 60x larger GPT-3 finetuning and various GPT-3 prompting methods on CommonsenseQA up to about 9.5% absolute accuracy. Also, although STaR can outperform its GPT-J baseline with chain-of-thought style iterative finetuning, their result still has about 10% absolute accuracy gap with our method on CommonsenseQA even with doubled parameter size and more compute during iterative finetuning process. For OpenbookQA, our model underperforms GPT-3 direct prompting and explanations after answers prompting but can still outperform GPT-3 chain of thought prompting with 6% absolute accuracy. In short, our method can achieve strong performance even compared with 60x larger GPT-3.

6 Human evaluation on generated explanations

A side benefit of our model is to generate explanations towards more explainable AI to alleviate the notorious black box issue of deep neural networks (Koh and Liang, 2017). To evaluate quality of generated explanations from qtr task for our model, we conduct human evaluation since automatic metrics are not highly correlated with human assessment (Clinciu et al., 2021; Kayser et al., 2021).

Specifically, we perform a head-to-head explanation comparison on CommonsenseQA dataset between T5-3B and GPT-3 175B few-shot explanations after answers prompting since these models achieve close performance on this dataset, as shown in Table 4. T5 model is trained with explanations generated by GPT-3 and we would like to know its generated explanation quality compared to that of GPT-3, which has been shown to be high-quality in Wiegreffe et al. (2021a). Therefore, we randomly sample 100 examples that are predicted correctly by both GPT-3 and T5, and for each example, we present a question, its ground truth answer and two randomly shuffle explanations as (a) and (b) generated by T5 and GPT-3 to three different human annotators with advanced NLP backgrounds and then ask them which explanation they prefer: (a), (b) or tie, similar to Wiegreffe et al. (2021a). Finally, we adopt a majority vote approach to decide preference on each example if at least two annotators have the same preference; otherwise, we cast that example’s two explanations are tied. In addition, we report agreement percentage across three levels. Level 0 means all three annotators have different preferences, level 1 means only two annotators have the same preference and level 2 means all three annotators have the same preference. Results are summarized in Table 5.

As expected, explanations generated by T5 are less preferred over those from GPT-3 but there are still 58% (14%+44%) explanations having better or competitive quality over GPT-3. In addition, more than 60% explanations have disagreement (7% in level 0 + 56% in level 1). Given Wiegreffe et al. (2021a) finds GPT-3 can generate competitive explanations even compared to human-written ones, we argue that this high disagreement is because explanations generated by both T5 and GPT-3 are high-quality, making humans hard to choose. To verify this hypothesis, we choose three T5 and GPT-3 generated explanation examples used in our human evaluation experiments, as shown in Figure 3. Both T5 and GPT-3 can generate plausible explanations to justify their predictions and even though T5 loses to GPT-3 in example (c), its explanation is still reasonably good. We also provide examples with incorrect predictions in appendix B, some of which still have plausible predictions and explanations although different from golden labels. These results demonstrate explanations generated by our model have high quality even compared with strong GPT-3 with 60x larger size.

Conclusion

In this paper, we leverage explanations from LLM to improve small reasoners in a multi-task learning framework. Extensive experiments on multiple reasoning tasks show our method can consistently and significantly outperform single-task finetuning baselines across explanation generation method, multi-task learning setups, training sample and small reasoner sizes, and can outperform strong finetuning/prompting a 60x larger GPT-3 on CommonsenseQA by up to 9.5% in accuracy. In addition, our model can generate high-quality explanations even compared to strong GPT-3 towards more explainable AI according to human evaluation.

Limitations

Our approach requires a multi-task learning finetuning approach to integrate explanations into small language models and will require tuning weight $\alpha$ on a development set, which will require more compute during hyper-parameter tuning process. In addition, our work is constrained to textual reasoning problems, lacking more explorations in other reasoning tasks, e.g. symbolic reasoning and arithmetic reasoning, which we plan to leave as future work.

Ethics Statement

Our work is built on top of explanations generated from LLM, which have been observed to capture gender, race and religion biases (Brown et al., 2020; Lucy and Bamman, 2021; Abid et al., 2021). Generated explanations with these possible biases may be integrated into small models during finetuning process and be exposed when these small models generate explanations to justify their predictions. Therefore, our model could potentially share the same kinds of bias as the original LLM used for explanation generation. However, our multi-task learning framework naturally allows us to disable explanation generation and still enjoy performance gains by direct answer prediction without the risk of explicitly exposing these biases.

Acknowledgement

This research was sponsored in part by the DARPA PTG program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.

References

Appendix

Appendix A Prompt details

Here we provides prompts we use in our experiments. Our prompts on CommonsenseQA and StrategyQA datasets are based on (Zelikman et al., 2022) and (Wei et al., 2022b), respectively. Explanations in prompts for OpenbookQA are based on science facts in OpenbookQA dataset Github repository https://github.com/allenai/OpenBookQA.

Appendix B Explanation examples

Here we further provide three examples as shown in Figure 4, where both T5 and GPT-3 have incorrect predictions. We observe that in both example (b) and (c), T5 and GPT-3 have plausible predictions and explanations although their predictions are different from golden labels.