Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

Introduction

Large language models (LLMs) are extremely performant on a wide range of natural language tasks, but they require enormous amounts of compute to train (OpenAI, 2023; Anthropic, 2023). As such, there is growing interest in building strong moderate-sized models, such as LLaMA (Touvron et al., 2023a; b), MPT (MosaicML, 2023), and Falcon (Almazrouei et al., 2023), that allow for efficient inference and fine-tuning. These LLMs are available in varied sizes suited for different use cases, but training each individual model from scratch—even the smallest billion-parameter models—requires substantial computational resources that are cost-prohibitive for most organizations. In this work, we seek to address the following question:

Can we produce a smaller, general-purpose, and competitive LLM by leveraging existing pre-trained LLMs, while using much less compute than training one from scratch?

We explore structured pruning as a means to achieve this goal. Pruning is commonly viewed as a solution for compressing task-specific models (Han et al., 2016; Li et al., 2016; Lagunas et al., 2021; Xia et al., 2022; Kurtic et al., 2023), removing redundant parameters and accelerating inference without sacrificing task performance. However, for general-purpose LLMs, pruning inevitably results in performance degradation compared to original models (Frantar & Alistarh, 2023; Sun et al., 2023; Ma et al., 2023), especially when without significant compute invested post-pruning. In this work, we use pruning as an effective approach for developing smaller yet competitive LLMs that require only a fraction of the training compute compared to training them from scratch.

We identify two key technical challenges in this problem. First, how can we decide on final pruned architectures that are strong in performance and efficient for inference? Existing structured pruning techniques for LLMs (Xia et al., 2022; Ma et al., 2023) do not specify targeted structures and lead to suboptimal pruned models in terms of performance and inference speed (Table 4 and Figure 8). Second, how can we continue pre-training the pruned model to reach desired performance? We observe that training using the original pre-training data leads to imbalanced rates of loss reduction across different domains, compared to when training such models from scratch. This indicates that the pruned model retains varying levels of knowledge for different domains (e.g., GitHub vs. C4) and simply using the pre-training domain proportion results in an inefficient use of data (Figure 6). To address these issues, we propose “LLM-shearing”, an algorithm consisting of the following two components:

We propose a novel pruning algorithm, dubbed targeted structured pruning, which prunes a source model to a specified target architecture. The target architecture is determined by leveraging the configurations of existing pre-trained models. Our pruning approach searches for substructures within the source model that maximally preserve performance while adhering to the given constraints.

We devise a dynamic batch loading algorithm that loads training data from each domain in proportion to its rate of loss reduction, thereby making an efficient use of the data and accelerating the overall performance improvement.

We demonstrate the efficacy of our proposed method by pruning a LLaMA2-7B model (Touvron et al., 2023b) into two smaller LLMs: Sheared-LLaMA-1.3B and Sheared-LLaMA-2.7B. Despite using only 50 billion tokens (i.e., 5% of OpenLLaMA’s pre-training budget) for pruning and continued pre-training, Sheared-LLaMA-1.3B and Sheared-LLaMA-2.7B outperform other popular LLMs at similar scales, including Pythia (Biderman et al., 2023), INCITE (TogetherAI, 2023b), and OpenLLaMA (Geng & Liu, 2023), on 11 representative downstream tasks (Figure 1; commonsense, reading comprehension, and world knowledge) and instruction tuning for open-ended generation. Additionally, the downstream performance trajectory suggests that further training the pruned model with more tokens would result in even greater gains. While we only conduct experiments with up to 7B parameter models, our LLM-shearing algorithm is highly generalizable and can be extended to large language models of any size in future work.

LLM-Shearing

Given an existing large model $\mathcal{M}_{S}$ (the source model), we study how to efficiently produce a smaller, strong model $\mathcal{M}_{T}$ (the target model). We consider this as a two-stage process: (1) Pruning $\mathcal{M}_{S}$ into $\mathcal{M}_{T}$ . This reduces the number of parameters but incurs a performance drop inevitably. (2) Continually pre-training $\mathcal{M}_{T}$ with a standard language modeling objective to reach a target performance. While most recent efforts (Xia et al., 2022; Ma et al., 2023) focus on the former stage, we find the latter stage crucial for producing competitive general-purpose LLMs from structured pruning.

Structured pruning removes groups of model parameters to compress models and accelerate inference. However, existing structured pruning approaches often result in unconventional model configurations that deviate from popular architectures. For example, CoFiPruning (Xia et al., 2022) produces models with non-uniform layer configurations (e.g., different numbers of heads across layers), which incurs inference overhead compared to standard uniform layer configurations (Section 4.2).

In this work, we extend CoFiPruning to allow pruning the source model into any target configuration that we specify. We leverage the configurations of existing pre-trained models as the target architectures, based on the intuition that these configurations have already been well-optimized to balance model expressivity and inference efficiency. For example, we use the INCITE-Base-3B architecture (TogetherAI, 2023a) as the target structure when producing a $2.7$ B model.

Our method learns a set of pruning masks on model parameters at different granularities—from global ones like layers and hidden dimensions (persist across all layers), to local ones like attention heads and intermediate dimensions. Assume that the source model $\mathcal{M}_{S}$ has $L_{\mathcal{S}}$ layers, with each layer consisting of one multi-head attention module (MHA) and one feed-forward network (FFN). $\mathcal{M}_{S}$ has a hidden state dimension of $d_{\mathcal{S}}$ , $H_{\mathcal{S}}$ heads in each MHA, and an intermediate dimension of $m_{\mathcal{S}}$ in each FFN. We introduce the following mask variables:

Each mask variable controls whether the associated substructure is pruned or retained. For example, we remove a layer if its corresponding $z^{\text{layer}}=0$ . Figure 2 illustrates an example of how the pruning masks control the pruned structures.

where $\mathcal{L}(\theta,{z})$ is the language modeling loss computed with the masked model weights. This objective will produce a pruned model with the target shape. Ideally, running this prune algorithm on a large amount of data will directly produce a strong compact model. In practice, the pruning stage is expensive (roughly 5 $\times$ slower compared to standard LM training), and we find that the learned masks often converge fast. Therefore, in our experiments, we allocate only a limited budget for the pruning process. Following pruning, we finalize the pruned architecture by preserving the highest-scoring components associated with the mask variables in each substructure, and continue pre-training the pruned model with the language modeling objective. We refer to this second stage as continued pre-training.

2 Dynamic Batch Loading

Continued pre-training on a large amount of data is crucial for recovering the pruned model performance. However, we observe a surprising finding in our preliminary experiments: continuing pre-training our pruned models on an existing pre-training dataset RedPajama (TogetherAI, 2023b; LLaMA’s pre-training dataset) reduces loss at different rates across domains compared to pre-training a model from scratch, which signifies an inefficient use of data.

To be more specific, we first fit a scaling law (Hoffmann et al., 2022; details in Appendix A) on the series of LLaMA2 models for each domain. Then we predict the loss that a hypothetical 2.7B LLaMA2 model would achieve if trained from scratch on the same data. We obtain these estimated reference losses across domains of the pre-training data and compare them to the losses of our pruned model after continued pre-training. As shown in Figure 6 (left), while our model’s loss on GitHub is better than the reference loss, it is significantly worse than the reference loss on C4. This observation indicates that pruning preserves a greater amount of knowledge in low-entropy and smaller domains (e.g., GitHub) compared to high-entropy and larger domains (e.g., C4). As demonstrated later in Section 4.1, simply reusing the original pre-training data distributionThe LLaMA2 pre-training data is not public. We conducted the same analysis on LLaMA1 models and observed a similar phenomenon, indicating that this is a universal issue unrelated to specific pre-training data. results in an inefficient use of data and worse downstream performance, even if the overall loss is seemingly low.

Inspired by recent work (Xie et al., 2023), we propose dynamic batch loading, a more efficient algorithm to simply adjust domain proportions on the fly based on the model performance. The goal is to ensure the model achieves the reference loss roughly simultaneously across all domains. We introduce the algorithm below.

We apply dynamic batch loading to both the pruning stage and the continued pre-training stage. For pruning, we use the original pre-training data’s domain weights as $w_{0}$ . For continued pre-training, we use the final weights from the pruning stage as $w_{0}$ . Dynamic batch loading leverages reference losses on validation sets and adjusts the weights dynamically, so it adds minimal overhead to standard training. This improves the efficiency of Xie et al. (2023), which requires training both a reference and a proxy model to learn domain weights before training.

More broadly, dynamic batch loading has the potential to train an LLM to match reference losses of any model, by leveraging open-source pre-training datasets such as RedPajama, even when the reference model’s training data is unknown.

Choices of reference losses. By default, we use the loss predicted by the scaling law as the reference (denoted as scaling reference). We also experiment with an alternative where we directly use the source model’s domain validation loss as the reference (denoted as source reference). We show in Section E.3 and E.4 that while both variants perform well, using scaling reference leads to slightly better downstream results, especially on math and coding tasks. However, source reference is a viable alternative when a series of source models at different scales is not available.

Experiments

We use the LLaMA2-7B model (Touvron et al., 2023b) as the source model throughout all of our main experiments.Please find results on LLaMA1 models in Section E.5. We then conduct structured pruning experiments to compress this model down to two smaller target sizes—2.7B and 1.3B parameters. We compare to strong pre-trained language models of similar sizes, including OPT-1.3B (Zhang et al., 2022), Pythia-1.4B (Biderman et al., 2023), OPT-2.7B, Pythia-2.8B, INCITE-Base-3B (TogetherAI, 2023b), OpenLLaMA-3B-v1, and OpenLLaMA-3B-v2 (Geng & Liu, 2023). We use Pythia-1.4B as the target architecture for the 1.3B model, and INCITE-Base-3B as the target architecture for the 2.7B model. Table 8 summarizes model architecture details of all these models.

Data.

As the training data for LLaMA2 is not publicly accessible, we use RedPajama (TogetherAI, 2023b), which is a replicated pre-training dataset of the LLaMA1 models (Touvron et al., 2023a), for pruning and continued-pretraining. This dataset encompasses training data from seven domains: CommonCrawl, C4, Github, Wikipedia, Books, ArXiv, and StackExchange. We construct a held-out validation set with 2 million tokens (equivalent to 500 sequences of 4,096 tokens) for each domain. We allocate 0.4 billion tokens for the pruning phase and 50 billion tokens for the continued pre-training process. Following the conventions of LLaMA2, we maintain a sequence length of 4,096 tokens. Table 1 provides a summary of the pre-training data used by our models and the baseline models.

Training.

Our implementation builds on the Composer package (MosaicML, 2021). We use a maximum of 16 Nvidia A100 GPUs (80GB) for all experiments (More details are in Appendix B).

Downstream task evaluation.

We use the lm-evaluation-harness package (Gao et al., 2021) to evaluate on an extensive suite of downstream tasks:

We follow Pythia and LLaMA2 to report the 0-shot accuracy of ARC easy (ARC-E; Clark et al., 2018), LAMBADA (Paperno et al., 2016), LogiQA (Liu et al., 2020), PIQA (Bisk et al., 2020), SciQ (Welbl et al., 2017), and WinoGrande (Sakaguchi et al., 2021).

We report accuracy of the tasks used by Open LLM Leaderboardhttps://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, including 10-shot HellaSwag (Zellers et al., 2019), 25-shot ARC Challenge (ARC-C; Clark et al., 2018), and 5-shot MMLU (Hendrycks et al., 2021).

We also report exact match of 32-shot Natural Questions (NQ; Kwiatkowski et al., 2019) to measure the factual knowledge in the model.

Instruction tuning evaluation.

As training models to follow instructions has become a crucial application of LLMs (Ouyang et al., 2022; Taori et al., 2023), we evaluate our models on instruction tuning and fine-tune both Sheared-LLaMA and baseline models on 10,000 instruction-response pairs sampled from the ShareGPT datasethttps://sharegpt.com. We only use the first round in the multi-turn chat history.. For evaluation, we sample another 1,000 instructions from ShareGPT, generate responses from our fine-tuned models and other baseline models, and use GPT-4 as an evaluator to compare the two responses (Dubois et al., 2023). We report the win rate of our model compared to the baseline model (more details in Appendix D).

2 Sheared-LLaMA Outperforms LMs of Equivalent Sizes

We demonstrate, on both standard LM benchmarks and instruction tuning, Sheared-LLaMA significantly outperforms existing LLMs of similar sizes, while using only a fraction of the compute budget to train those models from scratch.

In Table 2, we present the zero-shot and few-shot downstream task performance of both Sheared-LLaMA and existing pre-trained models of a similar size. Our experiments show that, even with a budget as limited as approximately 50B tokens for pruning and continued pre-training, Sheared-LLaMA models outperform existing models that have been pre-trained on significantly larger compute. To elaborate further, Sheared-LLaMA-1.3B outperforms both the OPT-1.3B and Pythia-1.4B models, which were originally pre-trained with 300B tokens. Similarly, Sheared-LLaMA-2.7B outperforms INCITE-Base-3B and OpenLLaMA-3B-v1, which were pre-trained on 800B and 1T RedPajama tokens respectively; Sheared-LLaMA-2.7B also surpasses OpenLLaMA-3B-v2, which was trained on 1T tokens from a mixture of RedPajama, RefinedWeb, and StarCoder.

Instruction tuning.

As shown Figure 3, instruction-tuned Sheared-LLaMA achieves higher win rates compared to all the other pre-trained models at a comparable scale. This demonstrates that our 2.7B model can serve as a strong foundation for instruction tuning and has the capacity to generate long, coherent and informative responses (See examples in Appendix D).

Comparison to further pre-training an existing LM.

We examine if pruning produces a better initialization for continued pre-training than an existing LLM of equivalent size. We continue pre-training an INCITE-Base-3B model on the original RedPajama data and compare it to Sheared-LLaMA-2.7B. Figure 4 shows that the INCITE-Base-3B model starts off with much higher accuracy, but its performance plateaus throughout continued pre-training. In contrast, Sheared-LLaMA starts at a lower accuracy but rapidly improves, eventually surpassing the INCITE-Base-3B model. This suggests that pruned models from a strong base model serve as a better initialization for continued pre-training.In cases where the existing small model is competitive compared to the pruning source model, the small model may offer a better starting point than a pruned model. Intuitively, the larger the discrepancy in performance between the source model and the small model, the more advantages the pruned model . Please find more training details in Appendix F.

Analysis

We analyze the effectiveness of dynamic batch loading by examining its impact on three aspects: (1) the final LM loss across domains, (2) the data usage of each domain throughout training, (3) the downstream task performance. All results in this section are based on Sheared-LLaMA-1.3B.

Dynamic batch loading is designed to balance the rate of loss reduction across domains, so that the losses reach the reference value at approximately the same time. In Figure 6, we plot the difference between the loss of our model (with both original and dynamic batch loading) and the reference loss, estimated by fitting a scaling function to a hypothetical 2.7B parameter LLaMA2 model. With the original batch loading, the loss differences vary dramatically across domains. For instance, the GitHub loss decreases below the reference value, while the C4 loss lags behind. In contrast, dynamic batch loading reduces losses evenly and shows very similar loss differences across domains, indicating a more efficient data use.

Data usage.

Table 3 compares the original data proportion of RedPajama and the domain data usage of our dynamic loading (Figure 7 shows the evolution of domain weights throughout the training). We see that dynamic batch loading increases the weights for the Book and C4 domains versus other domains—suggesting that they are more difficult to recover for a pruned model.

Downstream performance.

As shown in Figure 6, pruned models trained with dynamic batch loading achieve better downstream performance than when trained on the original RedPajama distribution. This suggests that the more balanced loss reduction from dynamic batch loading transfers to improved downstream capabilities.

2 Comparison to Other Pruning Approaches

We compare our LLM-shearing method to other pruning approaches and report validation perplexity, which serves as a strong indicator of overall model capabilities (Xia et al., 2023). Due to computational constraints, the following experiments control the total compute budget across compared methods rather than runing each method to completion.

Previous works like Block Pruning (Lagunas et al., 2021) or CoFiPruning (Xia et al., 2022) are experimented on BERT-scale LMs, and the final model architectures, though structured, usually have non-uniform layer configurations, e.g., different layers have different number of heads or intermediate size. While bringing performance gains, non-uniformity also introduces training and inference overhead due to irregularities in model architectures. We experiment with both CoFiPruning and our targeted structured pruning. For a fair comparsion, we use the same original data proportion for both approaches. As shown in Table 4, our targeted pruned models have a higher inference throughput compard to the non-uniformly pruned CoFiPruning model at the same sparsity, despite having slightly higher perplexity.

Comparison to LLM-Pruner (Ma et al., 2023).

We compare our pruning method to LLM-Pruner, a recent work in uniform layer configuration structured pruning, in Section E.2. We show that with the same budget and the compression rate, ours achieves better perplexity.

3 Additional Analysis

Intuitively, allocating more compute to the pruning stage helps identify better subnetwork structures. We explore distributing data across pruning and continued pre-training stages differently, within a fixed budget of 5B tokens. Table 5 shows that when controlling the total amount of tokens, increasing the pruning budget consistently improves perplexity. However, since pruning is more expensive than continued pre-training, we decide to allocate 0.4B tokens to pruning. Please refer to Appendix B for details on training throughputs

Performance on math and coding tasks.

We also evaluate Sheared-LLaMA and baseline models on math and coding benchmarks in Section E.3. Sheared-LLaMA outperforms baselines trained on the same RedPajama data, but lags behind models trained on more ArXiv and GitHub data. This highlights a limitation of our work, where the performance is bounded by the chosen reference loss. To improve over math and coding, a better initial data proportion (e.g., more GitHub) and better reference losses are needed , and we leave it for future work.

Related Work

Structured pruning has been extensively studied as a model compression technique in computer vision and natural language processing, where task-specific models like classification ones are often overparameterized and can be pruned significantly with minimal impact on performance (Han et al., 2016; Wen et al., 2016; Liu et al., 2017; Luo et al., 2017; Cai et al., 2019; Deng et al., 2020; Hou et al., 2020; Wang et al., 2020; Lagunas et al., 2021; Xia et al., 2022; Kurtic et al., 2023). Unstructured pruning (Frankle & Carbin, 2018; Li et al., 2020; Chen et al., 2020; Sanh et al., 2020) prunes individual neurons instead of structured blocks. Though unstructured pruning usually achieve higher compression rates, they are not practical for model speedup.

In the era of LLMs, the prevalent NLP pipeline has shifted from task-specific models to general-purpose LMs, which leaves little room for redundancy. Both unstructured pruning, semi-structured pruning (Frantar & Alistarh, 2023; Sun et al., 2023), and structured pruning (Ma et al., 2023) lead to significant performance drops on LLM even at a modest sparsity. Noticeably, all the aforementioned works fix the original model parameters or tune them minimally. In our work, we see pruning as an initialization and consider it necessary to expend substantial compute to continually pre-training the model to recover performance.

Efficient pre-training approaches.

As orthogonal to our pruning approach, There is an extensive body of work on improving efficiency of training LLMs. For example, quantization reduces the numeric precision of model weights and activations and speeds up training and inference (Dettmers et al., 2022; 2023; Xiao et al., 2023). Knowledge distillation (Hinton et al., 2015; Sanh et al., 2019; Jiao et al., 2020; Sun et al., 2020), which trains a smaller model on a larger model’s prediction, is shown to be effective for task-specific models (Xia et al., 2022). For pre-training LLMs, though distilling from a teacher model is shown to improve the quality of student models given the same number of training steps (Rae et al., 2021; Blakeney et al., 2022), it is less cost-effective than pruning and continued training due to the exceeding inference cost incured by the teacher model (Jha et al., 2023). More methods have been introduced to enhance the efficiency of training LMs, such as dynamic architectures (Gong et al., 2019; Zhang & He, 2020) and efficient optimizers (Chen et al., 2023; Liu et al., 2023). However, as indicated by (Kaddour et al., 2023), the promised gains in training efficiency may not be consistently realized.

There are also data-based approaches to enhance training efficiency. Eliminating duplicated data is found to be effective (Lee et al., 2021). Various batch selection techniques propose to prioritize data based on criteria such as higher losses (Jiang et al., 2019) or a greater reducible loss (Mindermann et al., 2022). Xie et al. (2023) propose to optimize data mixtures by training a proxy model to estimate the optimal data weight of each domain.

Discussion

This work has two main limitations. First, the method relies heavily on the availability of open-source pre-training datasets and large language models. If the pre-training data does not cover a particular domain, the method is unlikely to recover performance well on that domain. Second, due to computational constraints, we only conducted experiments using a 7B parameter model. However, our method is highly generalizable and can be scaled up to larger models in future research.

Conclusion.

In this work, we propose using structured pruning as an efficient approach to producing competitive LLMs. Our approach consists of two stages, targeted structured pruning and continued pre-training, and we propose dynamic batch loading to improve the efficiency of using pre-training data. We train a series of competitive Sheared-LLaMA models with a fraction of compute compared to standard pre-training. Our results highlight a promising avenue to produce small LLMs with a low cost when strong large-scale models are available. As more capable LLMs and larger pre-training datasets emerge, our method can easily extend to these advances to produce improved small models.

Acknowledgements

We express our gratitude to Sadhika Malladi, Tanya Goyal, Ofir Press, Adithya Bhaskar, and the Princeton NLP group for reviewing the paper and providing helpful feedback. We also thank the engineering team at MosaicML for their invaluable assistance with implementation specifics using the Composer package. Mengzhou Xia is supported by a Bloomberg Data Science Ph.D. Fellowship, and Tianyu Gao is supported by an IBM PhD Fellowship. This research is also supported by Microsoft Azure credits through the “Accelerate Foundation Models Academic Research” Initiative.

References

Appendix A Reference Loss Predicted by Scaling Laws

The scaling law of language modeling is a function of model size $N$ and dataset size $D$ :

where $E$ captures the loss for the true language distribution in an ideal generation process, and $A,\alpha,B,\beta$ are scaling factors related to model scale or data size. Models in the same model family are usually trained with the same amount of tokens on the same data distribution. In this case, we need a minimum of three models to estimate the constant $E+\frac{B}{D^{\beta}},A$ and $\alpha$ . If the models are trained with different amount of tokens, we can estimate $E,A,\alpha,B,\beta$ with a minimal of $5$ models. Note that we will estimate the scaling factors for each domain seperately.

It is known that LLaMA $2$ models have been trained on the same $2$ T tokens (Touvron et al., 2023b). Therefore, we take the LLaMA2-7B, LLaMA2-13B and LLaMA2-70B checkpoints, evaluate them on the validation set of each domain, and fit the scaling factors with the corresponding loss. Given the limited data points for estimating the scaling law constant, we recognize the projected loss of a hypothetical LLaMA-2.7B model may be biased compared to the true value. We present the predicted loss in Table 6.

Appendix B Training Details

We present the hyperparameters used in our experiments in Table 7. We use fully sharded data parallel (Zhao et al., 2023) to train our models in parallel. We use FlashAttention V1 (Dao et al., 2022) to speed up training. We use a cosine learning rate scheduler and decay the learning rate to a minimum of $10\%$ of the peak value. We conduct some preliminary experiment to determine the peak learning rate for learning the masking variables and Lagrange multiplers, and we find that a learning rate of $1.0$ works well for pruning. We do not tune any other hyper-parameters. The throughput is dependent on the implementations and we believe that our throughput can be further improved by adopting more advanced recent optimizations such as FlashAttention V2 (Dao et al., 2022) and a more recent version of Composer.

Appendix C Model Configurations

In this section, we provide the model configurations for both our Sheared-LLaMA model and the baseline models, as illustrated in Table 8. Our design closely adheres to the architecture of Pythia-1.4B and INCITE-Base-3B, albeit with some nuanced distinctions. A noteworthy difference is found in the intermediate size of Sheared-LLaMA, which is a consequence of its lineage from LLaMA2-7B. Notably, LLaMA2-7B employs a GLU variant (Shazeer, 2020) within its feed-forward layer, comprising a gate matrix, an upward-projection matrix, and a downward-projection matrix. In contrast, other models employ the conventional double-matrix feed-forward layer structure. Furthermore, we acknowledge that the shearing algorithm will have to inherit the head dimension of the source model. Instead of explicitly specifying the number of heads based on existing language models, we set the target number of heads to be the target hidden dimension divided by the head dimension of the source model.

Appendix D Instruction Tuning

During instruction tuning training, the instruction is prepended with “You are a helpful assistant. Write a response that appropriately completes the request.”. For evaluating the instruction tuning generations, Wang et al. (2023) observes using GPT models as a judge could change its preference when swapping the presentation order of the two outputs. Therefore, we compare each output pair twice by swapping the presentation order of the two outputs and finally report the average win-rate of the two rounds to eliminate the position bias.

We randomly select an output generated by Sheared-LLaMA-1.3B and Sheared-LLaMA-2.7B in response to a given instruction, and present the generations in Table 10. Our findings demonstrate that, after instruction tuning, Sheared-LLaMA-2.7B consistently produces long, coherent, and informative outputs in response to the instruction.

Appendix E Additional Results

Figure 7 shows how the domain weights change throughout the training process and the final cumulative data usage of each domain. The trajectory shows that the domain weights stablize after around $30\%$ training. Unlike other domains, Wikipedia exhibits an anomalous spike in data loading early in training. The remaining domains demonstrate a steady, monotonic change in data loading over time as expected.

E.2 Comparison to LLM-Pruner

Table 11 displays the model configurations for an LLM-Pruner pruned model (Ma et al., 2023) versus our pruned model. The model pruned from LLM-Pruner has an unconventional archiecture where the intermediate size is smaller than hidden size, largely due to the fact that the algorithm does not support pruning the hidden dimension. When comparing performance between LLM-Pruner and ours in continued pre-training, our model achieves lower perplexity than LLM-Pruner with a similar parameter count and the same amount of continued pre-training, demonstrating the effectiveness of the targeted structured pruning.

E.3 Coding and Math Reasoning

We examine the math and coding abilities of our pruned models compared to other language models. We find that the math ability of existing 3B parameter models, including Sheared-LLaMA, is still far below that of larger models. We also find that Sheared-LLaMA’s coding ability lags behind models known to be trained on more code data, like Pythia-1.4B and Open-LLaMA-3B-v2. Sheared-LLaMA’s coding ability likely comes from the original LLaMA2 model, speculated to have used more code data, and the minimal code data used in our pruning experiments.

E.4 Scaling Reference vs. Source Reference

Figure 10 compares the performance of Sheared-LLaMA when trained with the scaling reference and the source reference in dynamic batch loaing. While both methods are effective in efficiently training the model, the scaling reference performs consistently (slightly) better in terms of downstream performance.

E.5 Pruning from LLaMA1 vs LLaMA2

In this section, we compare the performance of pruning from LLaMA1 and LLaMA2. Both models demonstrate strong downstream task performance, though not surprisingly, pruning from LLaMA2 yields a consistent advantage.

Appendix F Training details to continual pre-training INCITE-Base-3B

Before continuing pre-training the INCITE-Base-3B model, we conduct an initial grid search to evaluate various learning rates, including values of $1\times 10^{-4}$ , $5\times 10^{-5}$ , and $1\times 10^{-5}$ . Our initial results reveal that employing the first two learning rates resulted in a noticeable decline in model performance compared to the original model. Consequently, we opt to continue pre-training with a learning rate of $1\times 10^{-5}$ . The remaining hyperparameters remain consistent with those outlined in Table 7. It is worth noting that our choice of continued pre-training setup may not be optimal according to recent research (Gupta et al., 2023); however, it represents the best approach within our compute constraints.