Specializing Smaller Language Models towards Multi-Step Reasoning

Yao Fu, Hao Peng, Litu Ou, Ashish Sabharwal, Tushar Khot

Introduction

Recently, the field of NLP is significantly impressed by large language models’ strong abilities (Brown et al., 2020; Chowdhery et al., 2022). Wei et al. (2022a) discuss the emergent abilities of large language models – abilities that seems to only exist in large models (more than 100B parameters), but not in small models. A very typical example (also the first discovered emergent ability) is to perform multi-step reasoning on math word problems by chain-of-thought (CoT) prompting (Wei et al., 2022b) where the authors let the model generate a step-by-step reasoning chain to help get the final answer. The existence of such abilities has a very deep, profound influence on the community: on the positive side, such abilities open countless opportunities for new research directions; on the negative side, very few organizations have the compute to even fine-tune 100B-scale models, making the accessibility of such abilities extremely hard. It would be ideal if smaller models can also obtain emergent abilities like math CoT reasoning, so they can be accessed by a larger range of researchers and practitioners. However, preliminary results of Wei et al. (2022a) show that if the model scale is small (empirically less than 100B parameters), CoT exhibits flat, sometimes even near zero scaling curve (Wei et al., 2022b). Later smaller models’ scaling curve is partially improved in Chung et al. (2022), but still worse than large models. These results so far are rather pessimistic since they suggest increasing CoT performance for smaller models can be challenging. At the current stage, the community is eager to know to what extent such abilities can be further improved in smaller models.

This paper addresses the problem of CoT reasoning for smaller models by model specialization. Our hypothesis is that large models ( $\geq$ 100B) have strong modeling power but are spread over a large spectrum of tasks. Small models ( $\leq$ 10B) have limited model capacity, but if we concentrate their capacity on a target task, the model may still have a decent improved performance. There exists promising preliminary work on smaller models’ chain-of-thought abilities such as UL2 (Tay et al., 2022) and FlanT5 (Chung et al., 2022), but they focus on generic abilities and consequently, the model’s (limited) power is not concentrated. In our experiments, we show that by paying the price of decreased abilities in generic tasks (specifically we lose a large portion of accuracy on the BigBench Hard suite Suzgun et al., 2022), we can lift the scaling curve of CoT reasoning on small FlanT5 models (250M, 760M, and 3B) by a large margin (an average +10 accuracy gain) on a suite of 4 math reasoning tasks (1 in-distribution and 3 out-of-distribution). This means that we can indeed move the model’s power from generic abilities to concentrate on the target math CoT.

Our approach is to fine-tune an instruction-tuned model (FlanT5) by distilling chain-of-thought reasoning paths of the GSM8K data from a large teacher model (GPT-3.5 code-davinci-002 Chen et al., 2021), then do a model selection on the average performance of three held-out math reasoning data to ensure the model’s out-of-distribution generalization. Although distillation per se is a well-studied area, there are multiple caveats in our process, as we will demonstrate: (1). the teacher model code-davinci-002 and our student model FlanT5 use different tokenizers, we address the tokenizer alignment problem by dynamic programming. (2). Distillation induces different performance on an instruction-tuned checkpoint (in our case, FlanT5) and the raw pretrained checkpoint (T5), where specialized FlanT5 performs better but specialized T5 achieves more accuracy gain. (3). at the late training stage, the model’s in-distribution and out-of-distribution (OOD) performance fluctuates differently, so if one wants better OOD generalization, the model selection should be performed on held-out math datasets, rather than the validation portion of the tuning data. (4). multiple tradeoffs happen during the distillation/ specialization process: as we start distillation, on BigBench Hard test suite (the measure of generic ability), the model immediately loses all its CoT prompting abilities, and gradually loses a large portion (but not all) of answer-only prompting abilities. The data format we use for tuning is also closely related to model ability: in-context examples enable both in-context and zero-shot performance, but zero-shot examples lose the model’s in-context ability for increased zero-shot ability.

These findings deepen our understanding of language model chain-of-thought reasoning behavior in multiple aspects: (1). the previous hypothesis is that CoT has near-flat scaling curves on small scale, we show that we can lift up the scaling curve by concentrating the model’s capacity on a target ability. This indicates that chain-of-thought might not be an emergent ability because, after specialization, smaller models’ scaling curves become log-linear, just like large models (Kaplan et al., 2020; Hoffmann et al., 2022). (2). previous observation of LLM behaviors indicates complex tradeoffs and balances of model ability across multiple dimensions, we give a detailed description of how we move the model’s power from generic abilities to a target ability, clearly showing what can be gained at what cost. (3). classical model selection theory selects the model on the validation portion of the same dataset, we select the model based on the performance of different math reasoning datasets, to prevent overfitting on one single dataset. We hope our practice and discoveries can serve as an example attempt towards strong specialized smaller models.

Background

Large Language Models’ Abilities Large language models have significantly changed the research paradigm in NLP by showing strong abilities on multiple dimensions (Brown et al., 2020; Hoffmann et al., 2022; Chowdhery et al., 2022; Wei et al., 2022a). Currently, the new recipe for training LLMs is to first train a base model (e.g., GPT-3, PaLM, OPT), then elicit the abilities of the base model by instruction tuning (e.g., GPT-3 $\to$ InstructGPT Ouyang et al., 2022; PaLM $\to$ FlanPaLM Chung et al., 2022, OPT $\to$ OPT-IML Iyer et al., 2022, also see Fig. 1A step 1 and 2). For the base model, initially, Wei et al. (2022b) shows that the chain-of-thought performance curve is near-zero if the model size is smaller than 100B. Later Chung et al. (2022) updated this hypothesis by showing CoT can be unlocked if CoT data is included as one particular type of instruction, but their model’s performance is not as good because their model’s ability is spread over multiple dimensions. This work shows that CoT performance can be significantly lifted if we concentrate model’s power toward a target ability (Fig. 1A, step 3).

Specialized Language Models Although modern language models show strong generic abilities on multiple directions, recent analysis (Fu et al., 2022) shows models do have different focuses (e.g., code-davinci-002 for code and text-davinci-003 for text). Ability tradeoff happens at all scale: for large models, such a tradeoff does not have to be all or nothing: code-davinci-002, although specialized for code, can still solve a lot of text problems; for small models, due to limited model capacity, they have to trade all generic abilities for one special ability. One example is GitHub Copilot, which supposedly is a 12B small model (Thakkar, 2022). The actual practice of specialization is simply finetuning: to specialize a model towards a target ability, one simply tunes the model using the related data, which is the practice of concurrent work about smaller models’ CoT ability (Magister et al., 2022; Shridhar et al., 2022; Ho et al., 2022). The problem here is how to generalize beyond the tuning data, as small models may simply overfit the tuning distribution but struggle to generalize when the distribution shifts (Liu et al., 2022; Si et al., 2022). So far the community’s hypothesis of OOD generation involves two important aspects: (1). model scale (Chowdhery et al., 2022); (2). instruction tuning (Chung et al., 2022), which we will also study. These factors mark the differences between our work and the concurrent distillation work: we show how the model trades generic abilities for the target ability, and how model scale and instruction tuning help the model gain better in-distribution and OOD performance.

Distillation and Data Augmentation Our approach of using data generated from code-davinci-002 to tune smaller FlanT5 can be viewed as either distillation (Tan et al., 2019) or data augmentation (Li et al., 2022). Here we note that we merely use the generated data as the tool for model specialization, and the specialization data can also be from other sources like human annotation. Our focus is to study the ability tradeoff during specialization, but not directly contribute to the distillation or data augmentation literature.

Most closely related works There are two threads of most related works: (1). FlanT5 (Chung et al., 2022) and UL2 (Tay et al., 2022) which is the first work discussing smaller models’ CoT ability, but they focus on generic CoT while we trade generic ability for math CoT. (2). language model self-improvement (Huang et al., 2022) which also use CoT data augmentation, but they only consider large models and do not show the tradeoff between model abilities. Here we focus on small models and clearly show the price for ability improvements.

Specializing Multi-Step Reasoning

Our objective is to study what it takes to improve smaller models’ chain-of-thought math reasoning. We use GSM8K (Cobbe et al., 2021) as our seed dataset because it is one of the datasets with most diverse math reasoning problems, but test the model’s performance of three additional math datasets (MultiArith, ASDiv, and SVAMP Wei et al., 2022b) to show the model generalizes to OOD data. We further use BigBench Hard to test to model’s generic reasoning ability, demonstrating the tradeoff between generic and target abilities. We use T5 (raw pretrained checkpoint) and FlanT5 (instruction tuned checkpoint) as our base model, and use code-davinci-002 to generate distillation/ specialization data.

Distillation from Code-Davinci-002 Given a training question corpora, we use code-davinci-002 to generate 40 new CoT solutions then take the ones that lead to the correct answers as our training data. One solution consists of an answer and a chain of thought explaining the intermediate steps towards the answer. In addition to the standard finetuning setting where one uses the question as the input and use the [CoT, answer] pair as the output (Fig. 1 B4), we further consider three additional data formats: (1). in-context answer-only (Fig. 1 B1), where we do not use the CoT data (hence the name “answer-only”) and prepend 4 in-context examples before the question (hence the name “in-context”). The reason we prepend the in-context example is that previous work shows tuning with in-context examples improves the model’s in-context learning ability (Min et al., 2022). (2). in-context chain-of-thought (Fig. 1 B2), where we add CoT to both the in-context example and the output. (3). zero-shot answer-only, where we directly input the question and output the answer. Using answer-only data is because previous work shows they improve performance. In our experiments, we will show that in-context data induces zero-shot ability but zero-shot data sacrifice in-context learning ability. We note that there also exist techniques like adding a calculator (Cobbe et al., 2021) or self-consistency decoding (Wang et al., 2022) that can further improve the performance. These techniques are orthogonal to the distillation we use and can definitely be integrated to our work for better performance. Since our focus is the balance of the models’ special and generic abilities, we leave the integration of these orthogonal techniques to future work.

In terms of training objectives, in the distillation literature, there are typically two types of distillation approaches: (1). sample matching, where one trains the student model on the data generated by the teacher. In our case, sample matching means we directly optimize the student’s likelihood on the data generated by code-davinci-002. (2). distribution matching, where one minimizes the KL divergence between the student’s output distribution (in our case, the per-step autoregressive distribution) and the teacher’s. Usually, distribution matching is shown to achieve faster convergence and better performance than sample matching, so we use distribution matching as our training objective. Distribution matching has an additional challenge in storing the distribution parameter: at each step, we need to store the whole distribution defined on the vocabulary $\mathcal{V}$ , so the size of the dataset is $|\mathcal{V}|$ times larger than sample matching. Yet the OpenAI API only grants access to the 5 most probable tokens at each decoding step, but not the probability distribution over the entire vocabulary. Although the per-step distribution only covers the top 5 tokens, most of the time their probability sum is close to 1, being a good enough approximation of the full vocabulary distribution. We set to zero the probabilities of tokens not in the top 5.

Aligning tokenizers by dynamic programming One problem when matching the two distributions is the misalignment between the GPT tokenizer and the T5 tokenizer. We solve this problem by dynamic programming. Specifically, given two sequences to tokens $[\mathbf{s}_{1:L},\mathbf{t}_{1:N}]$ , our objective is to find an alignment that minimizes the total cost of editing one sequence to the other. Our dynamic program is a slight tweak of the textbook dynamic programming algorithms used in bioinformatics for sequence alignment (such as the Needleman–Wunsch algorithm (Needleman & Wunsch, 1970)) and in signal processing (such as dynamic time wrapping (Senin, 2008)). The recursion function is:

where $f(i,j)$ denotes the total cost aligning $\mathbf{s}_{1:i}$ and $\mathbf{t}_{1:j}$ and $c(\mathbf{s}_{i},\mathbf{t}_{j})$ is the predefined string edit distance between token $\mathbf{s}_{i}$ and $\mathbf{t}_{j}$ . our algorithm does not enforce one-on-one matching between tokens in the two sequences, and one token in $\mathbf{s}$ might align with multiple in $\mathbf{t}$ and vice versa Fig. 1C gives an example alignment. If there exists a one-to-one mapping between a GPT token and a T5 token, we use the GPT distribution as the T5 distribution. If the mapping is not one-to-one, e.g., two T5 tokens map to one GPT token, or two GPT tokens map to one T5 token (Fig. 1 C lower part), we do not use the corresponding GPT distribution and set the T5 distribution to be one-hot. We further note that aligning sequences generated by different tokenizers is a generic problem of contemporary NLP, yet we are not aware of any existing libraries approaching it. We plan to release the implementation of our dynamic program and hope it can be useful for future research.

Experiments

The objective of the experiments is to see to what extent we can lift up the scaling curve of smaller models’ math CoT performance and what is the price of it. We conduct model specialization on two model families: the raw pretrained checkpoints, and their instruction-tuned checkpoints (recall that the instruction-tuned checkpoints are generally more capable than the raw pretrained checkpoints, Fig 1A). Specifically, we consider the raw pretrained T5 Base (250M)/ Large (760M)/ XL (3B)/ XXL (11B), and the instruction-tuned FlanT5s. In Sec. 4.1, we validate our main hypothesis that large models can perform well on a wide range of tasks while smaller model’s ability can be moved from generic abilities to a specialized target ability. Specifically, we show model specialization can indeed improve CoT math performance for FlanT5-Base/ Large/ XL/ XXL, while paying the price of generic abilities, i.e., losing all CoT abilities on BigBench Hard and a large portion of answer-only (AO) abilities. In Sec. 4.2, we study the scaling behavior of smaller models and show how specialization lifts up the scaling curve for both T5 and FlanT5. This modifies the previous belief that smaller models exhibit a flat scaling curve (Wei et al., 2022b); we show that their scaling curve becomes log-linear after specialization, but not flat. In Sec 4.3, we show the dynamics and the generalization behavior of specialization: the model’s target performance increases gradually but generic abilities decrease gradually during tuning, and there exists tradeoffs between in-distribution v.s. OOD performance and in-context v.s. zero-shot performance.

We test the models’ math reasoning ability and generic ability and show their tradeoffs. For the math reasoning ability, we use the code-davinci-002 augmented GSM8K dataset (Cobbe et al., 2021) as our tuning dataset. The GSM8K has 7K training questions, for each question we ask the large model to generate 40 different solutions, taking the correct ones from the generation, we have 130K tuning data points in total. We test the model’s out-of-distribution performance on MultiArith, ASDiv, and SVAMP (collectively denoted as M-A-S) datasets (Wei et al., 2022b). None of the datasets has official train-dev-test splits, so we randomly sample 500 instances as the validation set, and use the remaining instances (800 for GSM8K, 400 for MultiArith, 18K for ASDiv, 500 for SVAMP) as the test set. The difference between M-A-S and GSM8K is that they are all primary school level arithmetic reasoning problems, but the entities involved in the datasets are different. For example, GSM8K may consider arithmetic reasoning on foods (e.g, 5 apples + 8 bananas = 13 fruits) and MultiArith may consider animals (e.g., 2 dogs + 3 cats = 5 animals). This type of out-of-distribution generalization is usually referred to as lexical-level compositional generalization (i.e., both are addition, but the lexicons are different, see Liu et al., 2022). For the generic ability, we use BigBench Hard (BBH, Suzgun et al., 2022) test suite, a list of 26 challenging dataset testing the model’s reasoning abilities from multiple dimensions (e.g., date understanding, causal judgement, referential game, .etc). Because of its difficulty and wide-coverage, BBH makes an ideal benchmark testing models’ generic ability.

For the baseline models, we consider generic large models and concurrent smaller distilled models, specifically: (1). generic large models, ranked according to scale: code-davinci-002 (our teacher model, presumably larger or equal to 175B); LaMDA 137B (Thoppilan et al., 2022) and PaLM 60B (Chowdhery et al., 2022), both are strong generic models for chain-of-thought reasoning; UL2 (Tay et al., 2022), a 20B model with good CoT ability. We will show that specialized FlanT5 11B outperforms UL2 20B and becomes close to PaLM 60B and LaMDA 137B on the target math reasoning task. (2). concurrent works with knowledge distillation from Magister et al. (2022); Shridhar et al. (2022); Ho et al. (2022). We will show that our specialized FlanT5 clearly outperform all of them on the distillation data (with the cost of BBH performance), mostly because we use an instruction-tuned checkpoint (FlanT5) as the base model rather than the raw pretrained checkpoint (T5).

Trading generic abilities for math CoT reasoning The overall results are in Table 1. After tuning on the seed GSM8K augmented data, all FlanT5 models have improved math reasoning performance with approximately +10 average accuracy gain. We note that our smaller 3B model outperforms the current 11B and 6B distillation models on the GSM8K test set. Despite multiple confounders including the size and the formats of tuning data, we believe our 3B model gets a better performance mostly because the base model is an instruction-tuned FlanT5, rather than the raw pretrained T5. Later we will show that instruction-tuned checkpoint consistently outperforms pretrained checkpoint after specialization (Sec. 4.2), showing the importance of the choice of the base model. Also, although not performing well as the teacher model code-davinci-002, our specialized 11B model performance improves to be on par with LaMDA 137B and slightly below PaLM 60B, showing it is indeed possible to make smaller models expert for the particular math reasoning task. The price is also very clear: all specialized models suffer from performance drop on BigBench, specifically, they lose all the CoT prompting abilities on BBH, and a large portion of AO prompting performance. This observation validates our hypothesis: large models can perform well on a wide range of tasks (here PaLM 60B perform well on both math reasoning and BBH), versus smaller model’s ability can be moved from generic tasks (BBH) to a specialized target ability (math reasoning), such that their performance on the target task can still match models that are larger than them, e.g., the average performance on the four math datasets LaMDA 137B 35.9 v.s. specialized FlanT5 11B 40.8.

2 Scaling Behavior of Smaller Models’ CoT Ability

Now we look the scaling behavoir to smaller models. We compare the scaling curve of: (1). GPT family small variants (Ada, Babbage, Curie and code-davinci-002); (2). raw pretrained T5 of different scales and their specialized versions; (3). the instruction-tuned FlanT5 of different scales and their specialized versions; The results are shown in Fig. 2 where x-axis denotes the model scale in terms of the number of parameters and y-axis denotes the validation accuracy on the GSM8K dataset.

Smaller models have log-linear, but not flat scaling curve Initially, in the original CoT paper Wei et al. (2022b) and the subsequent emergent abilities paper (Wei et al., 2022a), CoT prompting is believed to be an emergent property that only large models exhibit. Smaller model’s CoT performance (like smaller GPT variants) was believed to be a flat scaling curve: model performance does not improve with model scale, as is shown in Fig. 2A left part. Later this belief is updated by the FlanT5 paper (Chung et al., 2022), as they show that although the pretrained checkpoint does not have CoT ability, if the model has gone through instruction tuning, smaller models can still exhibit CoT on generic tasks. Our work shows that directly trained on CoT data can also lift up the flat scaling curve of the raw T5 checkpoints (Fig. 2B) to be log-linear. In Fig. 2C, we consider specialization for the instruction-tuned FlanT5, and show that specialization significantly lifts up the scaling curve of FlanT5, and both curves are also log-linear. All the log-linear curves we observed in Fig. 2 means that the chain-of-thought behavoir of smaller models are not flat, but actually log-linear. This further indicates that chain-of-thought may not be an emergent ability which is marked by the flat-then-phase-change curve, but they have the log-linear curve just like large models (Kaplan et al., 2020; Hoffmann et al., 2022).

Instruction-tuned checkpoints perform better than raw pretrained checkpoints Furthermore, comparing Fig. 2B and Fig. 2C, we see that specialized FlanT5 generally performs better than T5 (though T5 has a larger performance gain). The exact validation performance is shown in Table 2. We also believe that, despite there exist multiple confounders, a major reason that our performance in Table 1 (FlanT5 11B GSM8K accuracy 27.1) is better than concurrent distillation methods (Magister22 T5 11B, acc. 21.9) is mostly because we use the FlanT5 as our base model versus they use the raw pretrained T5. The intuitive explanation is because instruction-tuning elicits the model’s full ability while raw pretrained models’ ability are not fully released (conceptually see Fig. 1A, also see Fu et al., 2022; Chung et al., 2022). So for better performance, we recommend using instruction-tuned models in practice.

3 Specialization Process and Generalization Behaviors

Now we consider the specialization process. Intuitively, during finetuning, the model’s ability does not suddenly become the target ability, but will go through a process of moving the models’ ability from generic directions to the target. We save one checkpoint every 10K instances/ updates, then evaluate the checkpoints on (1). in-distribution math performance (GSM8K); (2). out-of-distribution math performance (MultiArith, ASDiv, and SVAMP); (3). generic answer-only prompting performance (BBH-AO); (4). generic chain-of-thought prompting performance (BBH-CoT). We plot the model’s performance across the fine-tuning process in Fig. 3.

The dynamics of model specilization. At the beginning of specialization (Figure A1 at step 10K and Figure A2 at step 20K), the model immediately loses all BBH CoT ability (accuracy becomes 0), and a large portion of BBH AO ability (accuracy drops from about 0.3 to about 0.1). As tuning goes on (A1 epoch 1, A2 epoch 1 and 2), the model’s in-distribution performance (GSM8K) and out-of-distribution performance (MultiArith-ASDiv-SVAMP, M-A-S) gradually increases, meaning that the model can generalize to three OOD datasets by tuning on GSM8K chain-of-thought data. At the later stage of tuning (Figure A1 at epoch 2, and Figure A2 at epoch 3), the model’s math performance fluctuates and better in-distribution performance does not indicate better out-of-distribution performance. The models’ BBH-AO performance drops a large portion and the BBH-CoT performance just die completely. Comparing A1 and A2, we also see that smaller models are more data-hungry than larger models (Kaplan et al., 2020): FlanT5 3B’s math performance plateaus at about 90K data points, versus FlanT5 Base’s performance continues increase until epoch 3 (each epoch has 130K datapoints).

In-distribution and out-of-distribution tradeoffs Because in Fig. 3 A, both in-distribution and out-of-distribution fluctuates, choosing the best in-distribution checkpoint does not necessarily lead to the best out-of-distribution checkpoint. This observation is shown in Table 3 where if we select the best model based on the GSM8K validation set, it does cannot achieve the best validation performance on the M-A-S OOD setting. Yet choosing the best model based on the M-A-S validation performance leads to a smaller performance drop in GSM8K. Given this observation, in practice, we would recommend choosing the validation checkpoints according to the specific goal: if the goal is in-distribution generalization, use GSM8K, if the goal is OOD generalization, users may want to use their own validation set (in our case, the M-A-S datasets).

4 Further Design Choices Analysis

In this section, we study two more design choices we have discussed before: (1). using distribution matching v.s. sample matching for distillation (recall distillation matching minimizes the KL divergence between FlanT5’s per-step autoregressive distribution and GPT’s autoregressive distribution, versus sample matching maximizes the likelihood of the reasoning paths generated by GPT); (2). the influence of data formats, and how in-context/ zero-shot training data induces different behaviors of the specialized model.

Distribution matching gives faster convergence than sample matching. Fig. 3 B shows the training loss of distribution matching v.s. sample matching. We show that the model converges faster under distribution matching, and the corresponding loss is lower. In terms of validation performance, these two approaches do not differ substantially. Yet since distribution matching has a faster convergence, in practice they may still be considered first especially when the model becomes large and tuning becomes expensive.

In-context data preserves zero-shot ability; Zero-shot data loses in-context ability This is actually a very interesting observation. Specifically, in Fig. 4 A, we tune the model with only in-context data (Format B1 and B2 in Fig 1), then test the models in-context learning and zero-shot generalization performance during validation. In Fig. 4 B, we tune the model with only zero-shot data (no in-context examples prepended, format B3 and B4 in Fig 1), the test if the model can still do in-context learning. As is shown in Fig. 4 A, when tuning with in-context data, the model can do both in-context and zero-shot generalization during validation, even the model is not trained with zero-shot data. In comparison, in Fig. 4 B, when tuning with zero-shot data, the model’s zero-shot performance increases, but gradually losses its in-context learning ability. This result aligns with the empirical observation on other large models, for example, text-davinci-002 has better zero-shot performance than code-davinci-002, but worse in-context learning performance (Fu et al., 2022). This means that the model’s ability tradeoff not only happens on math v.s. generic ability, but also happens on zero-shot v.s. in-context learning ability. In practice, we would recommend mix the different data formats during tuning (this is why we mix the formats) to maintain a balance between in-context and zero-shot abilities, or adjusting the ratio of different formats according to the specific use case.

Conclusion

In this work, we study the problem of specializing smaller language models toward multi-step reasoning using chain-of-thought prompting. We show that it is indeed possible to concentrate the small models’ ability from generic directions to the target math reasoning task. After specialization, we show that the model exhibits a log-linear scaling curve where model performance increases smoothly as model scale increases, this is a correction of the previous hypothesis which believes small models have a flat scaling curve that does not increase with model scale. We show the importance of using the instruction-tuned checkpoints as the base model because their generalization performance is better than the raw pretrained checkpoints. Mutiple tradeoff happens during model specialization, including the loss of BBH performance, the balance between in-distribution and out-of-distribution generalization, and the balance of in-context learning and zero-shot generalization ability. We hope our practice and discoveries can serve as an important attempt towards specialized smaller models in the new research paradigm set by LLMs