Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

Cheng-Yu Hsieh, Chun-Liang Li, Chih-Kuan Yeh, Hootan Nakhost, Yasuhisa Fujii, Alexander Ratner, Ranjay Krishna, Chen-Yu Lee, Tomas Pfister

Introduction

Despite the impressive few-shot ability offered by large language models (LLMs) (Brown et al., 2020; Chowdhery et al., 2022; Thoppilan et al., 2022; Hoffmann et al., 2022; Smith et al., 2022b; Zhang et al., 2022), these models are challenging to deploy in real world applications due to their sheer size. Serving a single 175175 billion LLM requires at least 350350GB GPU memory using specialized infrastructure (Zheng et al., 2022). To make matters worse, today’s state-of-the-art LLMs are composed of over 500500B parameters Chowdhery et al. (2022), requiring significantly more memory and compute. Such computational requirements are far beyond affordable for most product teams, especially for applications that require low latency performance.

To circumvent these deployment challenges of large models, practitioners often choose to deploy smaller specialized models instead. These smaller models are trained using one of two common paradigms: finetuning or distillation. Finetuning updates a pretrained smaller model (e.g. BERT (Devlin et al., 2018) or T5 (Raffel et al., 2020)) using downstream human annotated data (Howard and Ruder, 2018). Distillation trains the same smaller models with labels generated by a larger LLM (Tang et al., 2019; Wang et al., 2021; Smith et al., 2022a; Arora et al., 2022). Unfortunately, these paradigms reduce model size at a cost: to achieve comparable performance to LLMs, finetuning requires expensive human labels, and distillation requires large amounts of unlabeled data which can be hard to obtain (Tang et al., 2019; Liang et al., 2020).

In this work, we introduce Distilling step-by-step, a new simple mechanism for training smaller models with less training data. Our mechanism reduces the amount of training data required for both finetuning and distillation of LLMs into smaller model sizes. Core to our mechanism is changing our perspective from viewing LLMs as a source of noisy labels to viewing them as agents that can reason: LLMs can produce natural language rationales justifying their predicted labels (Wei et al., 2022; Kojima et al., 2022). For example, when asked “Jesse’s room is 1111 feet long and 1515 feet wide. If she already has 1616 square feet of carpet. How much more carpet does she need to cover the whole floor?”, an LLM can be prompted by chain-of-thought (CoT) technique (Wei et al., 2022) to provide intermediate rationales “Area == length ×\times width. Jesse’s room has 11×1511\times 15 square feet.” that better connects the input to the final answer “(11×15)16(11\times 15)-16”. These rationales can contain relevant task knowledge, such as “Area == length ×\times width”, that may originally require many data for small task-specific models to learn. We thus utilize these extracted rationales as additional, richer information to train small models through a multi-task training setup, with both label prediction and rationale prediction tasks (Raffel et al., 2020; Narang et al., 2020).

Distilling step-by-step allows us to learn task-specific smaller models that outperform LLMs using over 500×500\times less model parameters, and it does so with far fewer training examples compared to traditional finetuning or distillation (Figure 1). Our results show three promising empirical conclusions across 44 NLP benchmarks. First, compared to both finetuning and distillation, our resulting models achieve better performance with over 50%50\% less training examples on average across datasets (and up to over 85%85\% reduction). Second, our models outperform LLMs with much smaller model sizes (up to 2000×2000\times smaller), drastically reducing the computation cost required for model deployment. Third, we simultaneously reduce the model size as well as the amount of data required to outperform LLMs. We surpass the performance of 540540B parameter LLMs using a 770770M T5 model; this smaller model only uses 80%80\% of a labeled dataset that would otherwise be required if using an existing finetuning method. When only unlabeled data is present, our small models still perform on par or better than LLMs. We outperform 540540B PaLM’s performance with only a 1111B T5 model. We further show that when a smaller model performs worse than an LLM, Distilling step-by-step can more efficiently leverage additional unlabeled data to match the LLM performance compared to the standard distillation approach.

Related work

Our work distills task-specific knowledge of LLMs into smaller specialist models by leveraging the emergent reasoning capabilities of today’s LLMs. We draw on knowledge distillation research and methods that learn from both human-generated rationales and LLM-generated rationales.

Knowledge distillation has been successfully used to transfer knowledge from larger, more competent teacher models into smaller student models affordable for practical applications (Buciluǎ et al., 2006; Hinton et al., 2015; Beyer et al., 2022; West et al., 2021; Fu et al., 2023). It supports learning from limited labeled data, since the larger teacher model is often used to generate a training dataset with noisy pseudo labels (Chen et al., 2020; Iliopoulos et al., 2022; Wang et al., 2021; Smith et al., 2022a; Arora et al., 2022; Agrawal et al., 2022). The one limitation that knowledge distillation often faces is its reliance on large amounts of unlabelled data required to create a useful noisy training dataset. Although prior work has explored using data augmentation techniques to reduce this hunger for data (Tang et al., 2019; Liang et al., 2020; Srinivas and Fleuret, 2018; Milli et al., 2019), we propose an alternative approach: we reduce the need for large unlabeled data by distilling not just labels but also the teacher’s rationales.

Learning with human rationales.

While utilizing LLM-generated rationales is a new exciting area of investigation, using human-generated rationales has a rich history (Hase and Bansal, 2021). For instance, human rationales can be used to regularize model behavior (Ross et al., 2017); it can be used as additional inputs to guide a model’s predictions (Rajani et al., 2019); it can be used to improve overall model performance (Zaidan et al., 2007; Zhang et al., 2016; Camburu et al., 2018; Hancock et al., 2019; Pruthi et al., 2022); and human rationales can be used as gold standard labels to make models more interpretable by generating similar rationales (Wiegreffe et al., 2021; Narang et al., 2020; Eisenstein et al., 2022). Unfortunately, human rationales are expensive.

Learning with LLM generated rationales.

Today’s LLMs are capable of explaining their predictions by generating high-quality reasoning steps (Wei et al., 2022; Kojima et al., 2022). These reasoning steps have been used to augment input prompts to LLMs, improving their few-shot or zero-shot performance (Wei et al., 2022; Kojima et al., 2022; Wang et al., 2022b); reasoning steps have also been used as additional finetuning data “self-improve” LLMs (Zelikman et al., 2022; Huang et al., 2022). Unfortunately, regardless of how LLMs are improved, their large size limits their utility in most test-time applications.

By contrast, we leverage generated rationales as informative supervision to train smaller task-specific models, i.e. models that can be deployed without incurring large computation or memory costs. Several concurrent works have also proposed a similar idea to ours – that of using extracted rationales as supervision (Wang et al., 2022a; Ho et al., 2022; Magister et al., 2022; Li et al., 2023). Amongst them, PINTO (Wang et al., 2022a) relies on an LLM to generate rationales at test-time, and thus does not fully solve deployment challenges. Compared with Ho et al. (2022) and Magister et al. (2022), we go beyond their experiments to provide a granular study by varying training dataset size, exploring downstream model sizes, and demonstrating the effectiveness of our method on fully unlabeled datasets.

Distilling step-by-step

We propose a new paradigm, Distilling step-by-step, that leverages the ability of LLMs to reason about their predictions to train smaller models in a data-efficient way. Our overall framework is illustrated in Figure 2. Our paradigm has two simple steps: First, given an LLM and an unlabeled dataset, we prompt the LLM to generate output labels along with rationales to justify the labels. Rationales are natural language explanations that provide support for the model’s predicted label (see Figure 2). Second, we leverage these rationales in addition to the task labels to train smaller downstream models. Intuitively, rationales provide richer, more detailed information about why an input is mapped to a specific output label, and often contain relevant task knowledge that may be hard to infer solely from the original inputs.

Recent studies observe one intriguing emerging property of LLMs: their ability to generate rationales that support their predictions (Wei et al., 2022; Kojima et al., 2022). While the studies have largely focused on how to elicit such reasoning capability from LLMs (Nye et al., 2021; Wei et al., 2022; Kojima et al., 2022), we use them in training smaller downstream models.

2 Training smaller models with rationales

We first describe the current framework for learning task-specific models. With this framework in place, we extend it to incorporate rationales into the training process. Formally, we denote a dataset as D={(xi,yi)}i=1N\mathcal{D}=\{(x_{i},y_{i})\}_{i=1}^{N} where each xix_{i} represents an input and yiy_{i} is the corresponding desired output label. While our framework supports inputs and outputs of any modality, our experiments limits xx and yy to be natural language. This text-to-text framework (Raffel et al., 2020) encompasses a variety of NLP tasks: classification, natural language inference, question answering and more.

The most common practice to train a task-specific model is to finetune a pretrained model with supervised data (Howard and Ruder, 2018). In the absence of human-annotated labels, task-specific distillation (Hinton et al., 2015; Tang et al., 2019) uses LLM teachers to generates pseudo noisy training labels, y^i\hat{y}_{i} in place of yiy_{i} (Wang et al., 2021; Smith et al., 2022a; Arora et al., 2022).

For both scenarios, the smaller model ff is trained to minimize the label prediction loss:

Multi-task learning with rationales.

To create a more explicit connection between xix_{i}’s to y^i\hat{y}_{i}’s, we use extracted rationales r^i\hat{r}_{i} as additional supervision. There are several ways to incorporate rationales into the downstream model’s training process. One straightforward approach is feed r^i\hat{r}_{i} as an additional input—as proposed by other concurrent research (Rajani et al., 2019; Wang et al., 2022a). In other words, the f(xi,r^i)y^if(x_{i},\hat{r}_{i})\rightarrow\hat{y}_{i} is trained with both text and rationale [x,r][x,r] as inputs:

Unfortunately, this design requires an LLM to first generate a rationale before the ff can make a prediction. The LLM is still necessary during deployment, limited its deployability.

In this work, instead of using rationales as additional model inputs, we frame learning with rationales as a multi-task problem. Specifically, we train the model f(xi)(y^i,r^i)f(x_{i})\rightarrow(\hat{y}_{i},\hat{r}_{i}) to not only predict the task labels but also generate the corresponding rationales given the text inputs:

The rationale generation loss enables the model to learn to generate the intermediate reasoning steps for the prediction, and could therefore guide the model in better predicting the resultant label. This is our proposed Distilling step-by-step. Compared with Eq. 2, the rationale r^i\hat{r}_{i} is not required in the test time, which removes the need for an LLM at test-time.

We prepend “task prefixes” ([label], [rationale]) to the input examples and train the smaller model to output y^i\hat{y}_{i} when [label] is provided and to produce r^i\hat{r}_{i} with [rationale] (Raffel et al., 2020).

Experiments

We empirically validate the effectiveness of Distilling step-by-step. First, we show that when compared to standard finetuning and task distillation approaches, Distilling step-by-step achieves better performance with much fewer number of training examples, substantially improving the data efficiency to learn small task-specific models (Sec. 4.1). Second, we show that Distilling step-by-step surpasses the performance of LLMs with much smaller model size, drastically lowering the deployment cost compared to LLMs (Sec. 4.2). Third, we investigate the minimum resources required, w.r.t. both number of training examples and model size, for Distilling step-by-step to outperform LLMs. We show that Distilling step-by-step outperforms LLMs by using less data and smaller model, simultaneously improving both data- and deployability-efficiency (Sec. 4.3). Finally, we perform ablation studies to understand the influence of different components and design choices in the Distilling step-by-step framework (Sec. 4.4).

In the experiments, we consider the 540540B PaLM model (Chowdhery et al., 2022) as the LLM. For task-specific downstream models, we use T5 models (Raffel et al., 2020) where we initialize the models with pretrained weights obtained from publicly available sourceshttps://huggingface.co/. For CoT prompting, we follow Wei et al. (2022) when available, and curate our own examples for new datasets. We include more implementation details in Appendix A.1.

Datasets.

We conduct the experiments on 44 popular benchmark datasets across 3 different NLP tasks: e-SNLI (Camburu et al., 2018) and ANLI (Nie et al., 2020) for natural language inference; CQA (Talmor et al., 2019; Rajani et al., 2019) for commonsense question answering; SVAMP (Patel et al., 2021) for arithmetic math word problems. We include more dataset details in Appendix A.2.

1 Reducing training data

We compare Distilling step-by-step to two most common methods in learning task-specific models: (1) Standard finetuning when human-labeled examples are available, and (2) Standard task distillation when only unlabeled examples are available. Specifically, standard finetuning refers to the prevailing pretrain-then-finetune paradigm that finetunes a model with ground-truth labels via standard label supervision (Howard and Ruder, 2018). On the other hand, when only unlabeled examples are available, standard task distillation learns the task-specific model by treating a teacher LLM’s predicted labels as ground-truths (Hinton et al., 2015; Chen et al., 2020; Wang et al., 2021; Smith et al., 2022a; Arora et al., 2022).

In the following set of experiments, we fix the task-specific models to be 220M T5-Base models, and compare the task performances achieved by different methods under varying number of available training examples.

When finetuned with human-labeled examples, Figure 4 shows that Distilling step-by-step consistently achieves better performance than standard finetuning across varying numbers of labeled examples used. Furthermore, we see that Distilling step-by-step can achieve the same performance as standard finetuning with much less labeled examples. In particular, by using only 12.5%12.5\% of the full e-SNLI dataset, Distilling step-by-step can outperform standard finetuning trained with 100%100\% of the full dataset. Similarly, we achieve 75%75\%, 25%25\%, and 20%20\% reduction in training examples required to outperform standard finetuning on ANLI, CQA, and SVAMP respectively.

Distilling step-by-step outperforms standard distillation with much less unlabeled examples.

When only unlabeled data is available, we compare Distilling step-by-step to standard task distillation. In Figure 5, we observe an overall similar trend to the finetuning setup. Specifically, we see that Distilling step-by-step outperforms standard task distillation on all 44 datasets under different numbers of unlabeled data used. We as well see that Distilling step-by-step requires much less unlabeled data to outperform standard task distillation. For instance, we need only 12.5%12.5\% of the full unlabeled dataset to outperform the performance achieved by standard task distillation using 100%100\% of the training examples on e-SNLI dataset.

2 Reducing model size

In the following set of experiments, we hold the training set size fixed (using 100%100\% of the datasets), and compare varying sizes of small T5 models trained with Distilling step-by-step and standard approaches to LLMs. Specifically, we consider 33 different sizes of T5 models, i.e., 220M T5-Base, 770M T5-Large, and 11B T5-XXL. For LLMs, we include two baseline methods: (1) Few-shot CoT (Wei et al., 2022), and (2) PINTO tuning (Wang et al., 2022a). Few-shot CoT directly utilizes CoT demonstrations to prompt the 540B PaLM to generate intermediate steps before predicting the final labels without any further finetuning of the LLM. PINTO tuning refers to our extension of Wang et al. (2022a) to handle tasks beyond question-answering, which are not studied by Wang et al. (2022a). Here, we finetune a 220M T5-Base model on top of the outputs generated from the PaLM model, which can be viewed as a finetuning method for LLMs with additional parameters (Zhang et al., 2020; Lester et al., 2021).

We present the experimental results under the two broad scenarios of having access to labeled datasets or unlabeled datasets in Figure 6 and Figure 7, respectively. We plot each method by their deployed model sizes for prediction (xx-axis), and their corresponding task performances (yy-axis).

In Figure 6 and Figure 7 respectively, we see that Distilling step-by-step consistently improves over standard finetuning and standard distillation across all sizes of T5 models. The improvements are most pronounced on ANLI, where Distilling step-by-step outperforms standard finetuning and distillation by an average of 8%8\% and 13%13\% on task accuracy respectively.

Distilling step-by-step outperforms LLMs by using much smaller task-specific models.

In Figure 6 when human-labeled datasets are available, Distilling step-by-step can always outperform Few-shot CoT and PINTO tuning on all 44 datasets considered, by using much smaller T5 models. For instance, we can achieve better performances than 540B PaLM model’s Few-shot CoT with 220220M (over 2000×2000\times smaller) T5 model on e-SNLI, 770770M (over 700×700\times smaller) T5 models on ANLI and SVAMP, and 1111B (over 45×45\times smaller) T5 model on CQA. These results hold true even by further finetuning the 540B PaLM model on available labeled data with PINTO tuningWe note that PETuning methods may outperform PINTO tuning. However, they require massive resource in both training and deployment, which is not the focus of this work..

In Figure 7, by only utilizing unlabeled examples, Distilling step-by-step also outperforms the teacher LLM on 3 out of 4 datasets. Specifically, Distilling step-by-step surpasses the 540540B PaLM model’s Few-shot CoT performance by using 1111B T5 with less than 3%3\% of PaLM’s size. On SVAMP where the distilled model underperforms, we hypothesize that the performance gap is due to the relatively small number of data points in the dataset (i.e., 800800). In reaction, we propose to augment the dataset with additional unlabeled examples to close the performance gap as shown in next.

Unlabeled data augmentation further improves Distilling step-by-step.

We augment the SVAMP training set with unlabeled examples from the ASDiv dataset (Miao et al., 2020). ASDiv dataset contains a total of 2,3052,305 examples, where each example is a math word problem similar to the ones in SVAMP. In Figure 7 on SVAMP, we show the performances of Distilling step-by-step and standard task distillation using 1111B T5 model after augmenting the training set with ASDiv. We see the data augmentation much improves the performance for both Distilling step-by-step and standard task distillation. However, even with the added unlabeled examples, standard task distillation still underperforms Few-shot CoT. On the other hand, Distilling step-by-step is able to much more efficiently exploit the value of the added examples to achieve the same performance level of Few-shot CoT, again, using a T5 model of size less than 3%3\% of the 540540B PaLM.

3 Outperforming LLMs using minimum model size and least training data

Here, using the LLM’s performance as an anchor point, we explore the most efficient resource requirements in terms of both number of training examples and deployed model size, that Distilling step-by-step and standard finetuning/distillation need to outperform the LLM. We present the results, again under human-labeled setting and unlabeled setting, in Figure 8 and Figure 9 respectively. We visualize the results by plotting different resultant models by (1) the number of training examples used (xx-axis), (2) the final task performance achieved (yy-axis), and (3) the size of the model (visualized by the size of the shaded area).

On all datasets in Figure 8, we see that Distilling step-by-step outperforms PaLM’s Few-shot CoT with much smaller T5 models using only a subset of the available training examples. Specifically, on e-SNLI, Distilling step-by-step can achieve better performance than Few-shot CoT with a model over 2000×2000\times smaller (220M T5) and only 0.1%0.1\% of the full dataset. In Figure 9 where only unlabeled datasets are available, we observe the same trend that Distilling step-by-step can, at most time, outperform Few-shot CoT with smaller model as well as less data. For instance, on ANLI, Distilling step-by-step outperforms the LLM with a 45×45\times smaller model and 50%50\% of the full unlabeled set.

Standard finetuning and distillation require more data and larger model.

Finally, in Figure 8 and Figure 9, we see that standard finetuning and distillation often need either more data or larger models to match LLM’s performance. For instance, on e-SNLI in Figure 8, we observe that Distilling step-by-step outperform the LLM using only 0.1%0.1\% of the dataset while standard finetuning requires more data to match the performance. Furthermore, on ANLI in Figure 8, we observe that Distilling step-by-step can outperform PaLM using 770770M model with only 80%80\% of the training set while standard finetuning struggles to match the LLM even using the full dataset and thus requires larger model to close the performance gap.

4 Further ablation studies

So far, we have focused on showing the effectiveness of Distilling step-by-step on reducing the training data required for finetuning or distilling smaller task-specific models. In this section, we perform further studies to understand the influence of different components in the Distilling step-by-step framework. Specifically, we study (1) how different LLMs, from which the rationales are extracted, affect the effectiveness of Distilling step-by-step, and (2) how the multi-task training approach compares to other potential design choices in training small task-specific models with LLM rationales. Here, we fix the small task-specific models to be 220220M T5 models, and utilize 100%100\% of the data on all datasets.

In addition to using 540540B PaLM as the LLM, here we consider a relatively smaller LLM, 2020B GPT-NeoX model Black et al. (2022), from which we extract rationales for Distilling step-by-step. In Table 1, we see that when coupled with LLMs of different sizes, Distilling step-by-step can still provide performance improvements compared to standard finetuning. However, the performance lift is smaller when rationales are extracted from the 2020B GPT-NeoX model instead of from the 540540B PaLM. This can be due to the fact that the larger PaLM model provides higher-quality rationales that are more beneficial for learning the task.

Multi-task training is much more effective than single-task rationale and label joint prediction.

There are different possible ways to train task-specific models with LLM-rationales as output supervisions. One straightforward approach is to concatenate the rationale r^i\hat{r}_{i} and label y^i\hat{y}_{i} into a single sequence [r^i,y^i][\hat{r}_{i},\hat{y}_{i}] and treat the entire sequence as the target output in training small models, as considered in Magister et al. (2022); Ho et al. (2022):

In Table 2, we compare this single-task training approach to our proposed multi-task training approach for utilizing LLM-rationales. We see that not only multi-task training consistently leads to better performance, single-task training with LLM-rationales can at times leads to worse performance than standard finetuning, e.g., on ANLI and CQA. In fact, similar results have also been observed in Wiegreffe et al. (2021); Magister et al. (2022); Ho et al. (2022) that simply treating rationale and label predictions as a single joint task may harm the model’s performance on label prediction. This validates our use of the multi-task training approach, and highlights the need to treat the rationales carefully so as to unleash their actual benefits.

Discussion

We propose Distilling step-by-step to extract rationales from LLMs as informative supervision in training small task-specific models. We show that Distilling step-by-step reduces the training dataset required to curate task-specific smaller models; it also reduces the model size required to achieve, and even surpass, the original LLM’s performance. Distilling step-by-step proposes a resource-efficient training-to-deployment paradigm compared to existing methods. Further studies demonstrate the generalizability and the design choices made in Distilling step-by-step. Finally, we discuss the limitations, future directions and ethics statement of our work below.

Limitations

There are a number of limitations with our approach. First, we require users to produce a few example demonstrations (10\sim 10-shot for all tasks) in order to use the few-shot CoT (Wei et al., 2022) prompting mechanism. This limitation can be improved by using recent advances that suggest that rationales can be elicited without any user-annotated demonstrations (Kojima et al., 2022). Second, training task-specific models with rationales incur slight training-time computation overhead. However, at test time, our multi-task design naturally avoids the computation overhead since it allows one to only predict labels without generating the rationales. Finally, while we observe success using LLM rationales, there is evidence that LLMs exhibit limited reasoning capability on more complex reasoning and planning tasks (Valmeekam et al., 2022). Future work should characterize how rationale quality affects Distilling step-by-step.

Ethics statement

It is worth noting that the behavior of the our downstream smaller models is subject to biases inherited from the larger teacher LLM. We envision that the same research progress in reducing anti-social behaviors in LLMs can also be applied to improve smaller language models.

References

Appendix A Experiment detail

We perform our experiments on cloud A100×\times16 GPU instances. We train the T5 models with the following hyperparameters, using publicly available packages from https://github.com/huggingface/transformers:

T5-Base (220220M) and T5-Large (770770M): We train the models with learning rate=5×105\textrm{learning rate}=5\times 10^{-5}, batch size=64\textrm{batch size}=64, max input length=1024\textrm{max input length}=1024, for a maximum of 1000010000 steps.

T5-XXL (1111B): We train the models with learning rate=5×105\textrm{learning rate}=5\times 10^{-5}, batch size=32\textrm{batch size}=32, max input length=1024\textrm{max input length}=1024, for a maximum of 40004000 steps.

We report all the results over 44 random runs, and include the standard error in the presented plots.

A.2 Datasets

We provide more detailed descriptions on the datasets used in our experiments. We include the sources from which we obtain the datasets as well as their original sources released from the authors. We refer readers to these sources for their license or terms for use and/or distribution. To the best of our knowledge, the datasets used do not contain information that names or uniquely identifies individual people or offensive content.

e-SNLI: The dataset was originally released in (Camburu et al., 2018), and made publicly available at https://github.com/OanaMariaCamburu/e-SNLI. We obtain the dataset from https://huggingface.co/datasets/esnli.

ANLI: The dataset was originally released in (Nie et al., 2020), and made publicly available at https://github.com/facebookresearch/anli. We obtain the dataset from https://huggingface.co/datasets/anli. We use the R1 split in our experiments.

CQA: The dataset was originally released in (Talmor et al., 2019), and made publicly available at https://www.tau-nlp.sites.tau.ac.il/commonsenseqa. It was then augmented with human-labeled explanations by (Rajani et al., 2019), which is available at https://github.com/salesforce/cos-e. We obtain the dataset used in our experiments from https://huggingface.co/datasets/cos_e.

SVAMP: The dataset was originally released in (Patel et al., 2021). We obtain the dataset from https://github.com/arkilpatel/SVAMP.

ASDiv: The dataset was originally released in (Miao et al., 2020). We obtain the dataset from https://github.com/chaochun/nlu-asdiv-dataset.

For each dataset, we randomly subsample 10%10\% of the original training set to serve as validation set when validation set is not originally provided. For CQA, we use the original validation set to serve as our test set since the ground-truth labels are not available for the original test set. We provide the dataset statistics in Table 3.