A Simple and Effective Pruning Approach for Large Language Models

Mingjie Sun, Zhuang Liu, Anna Bair, J. Zico Kolter

Introduction

Large language models (Brown et al., 2020; OpenAI, 2023) have recently reshaped the field of NLP with their remarkable performance across a range of complex language benchmarks (Bommarito & Katz, 2022; Wei et al., 2022a; Bubeck et al., 2023). However, these models, with their billions of parameters, usually require significant computational resources. To democratize LLMs, considerable efforts have been taken to mitigate their high computational cost. Many of the notable advancements to date have centered on model quantization, a process where parameters are quantized into lower bit-level representations. The fast pace of LLM quantization research (Dettmers et al., 2022; Frantar et al., 2023a; Xiao et al., 2023; Ahmadian et al., 2023) has led to substantial resource savings for these models (Sheng et al., 2023; Lin et al., 2023).

Network pruning (LeCun et al., 1989; Hassibi et al., 1993; Han et al., 2015), on the other hand, shrinks network sizes by removing specific weights from the model – essentially setting them to zero. Along with quantization, it is often considered another popular approach for compressing neural networks. However, it has received relatively little focus in compressing LLMs. This seems to contradict the trend of model compression in the pre-LLM era, where both approaches have received large amounts of research effort. A quick review of existing pruning methods reveals a possible reason: they typically require retraining (Liu et al., 2019; Blalock et al., 2020), training from random initializations (Zhu & Gupta, 2017; Louizos et al., 2018; Gale et al., 2019) or even an extensive iterative process (Frankle & Michael, 2019; Renda et al., 2020). The sheer amount of computational resources required by LLMs limits these methods. A recent LLM pruning approach, SparseGPT (Frantar & Alistarh, 2023), does not require traditional retraining, but still demands a computationally intensive weight update process.

The argument concerning the need for retraining and weight update does not fully capture the challenges of pruning LLMs. One might reasonably expect to obtain a fairly high-performing initialization point for retraining using existing popular pruning methods. However, a recent study (Frantar & Alistarh, 2023) finds that magnitude pruning (Han et al., 2015), a well-established pruning approach, fails dramatically on LLMs even with relatively low levels of sparsity. Considering the past success of magnitude pruning on smaller networks, this result suggests that LLMs, despite having 100 to 1000 times more parameters, are substantially more difficult to prune directly.

In this work, we address this challenge by introducing a straightforward and effective approach, termed Wanda (Pruning by Weights and activations). This technique successfully prunes LLMs to high degrees of sparsity without any need for modifying the remaining weights. We are motivated by an observation from a recent study (Dettmers et al., 2022), where a small subset of hidden state features are exceptionally large in magnitude, a property unique to LLMs. We find that augmenting the standard weight magnitude pruning metric with the input activations, is surprisingly effective as a measure for evaluating the weight importance. Specifically, we introduce a novel pruning metric, where each weight is evaluated by the product of its magnitude and the norm of the corresponding input activations, estimated using a small set of calibration data. Our method uses this metric to induce sparsity in pretrained LLMs by comparing weights locally within each output of linear layers and removing lower priority weights. Our approach is computationally efficient, able to be executed in a single forward pass, and requires minimal memory overhead.

We empirically evaluate Wanda on the widely adopted LLaMA (Touvron et al., 2023a) and LLaMA-2 (Touvron et al., 2023b) model families. Our results demonstrate Wanda can find efficient sparse networks from pretrained LLMs, without any retraining or weight update. Our approach Wanda outperforms the standard magnitude pruning by a large margin and also competes favorably with the prior best LLM pruning method (Frantar & Alistarh, 2023), while requiring a lower computational cost. We hope our work serves as a baseline for future work in this area, and encourages further exploration in understanding sparsity in LLMs.

Preliminaries

Magnitude Pruning (Han et al., 2015) is a standard pruning technique to induce sparsity in neural networks. It removes individual weights based on their magnitudes, where weights with magnitudes below a certain threshold are removed. In practice, this threshold is typically determined by comparing weights locally within each layer or globally across the whole network. Despite its simplicity, magnitude pruning has been used to find extremely sparse networks (Frankle & Michael, 2019) and now stands out as a strong baseline approach (Blalock et al., 2020) for neural network sparsification.

Emergent Large Magnitude Features have been observed in Transformer-based large language models. Dettmers et al. (2022) discover that once LLMs reach a certain scale (in practice, around 6B parameters), a small set of hidden state features emerges with significantly larger magnitudes than the remaining ones. These outlier features exhibit several intriguing characteristics. First, they have very large magnitudes, about 100 times larger than typical hidden state values. Second, they are usually sparse and exist in certain feature dimensions. Finally, these outlier features are essential for the predictive capability of LLMs: zeroing out these features at inference time results in significant degradation of language modeling performance.

Wanda: Pruning by Weights and Activations

In this section, we motivate and describe our pruning method, Wanda (Pruning by Weights and activations), which consists of two simple but essential components. First, we propose a novel pruning metric that incorporates both weights and input activations into the computation of weight importance. Second, we compare weights on a per-output basis instead of across the whole layer, which we find is crucial for pruning LLMs effectively. An overview of Wanda is shown in Figure 1.

A Motivating Example. Consider a neuron with two inputs and corresponding weights: $\mathbf{y}=\mathbf{w}_{1}\mathbf{x}_{1}+\mathbf{w}_{2}\mathbf{x}_{2}$ , where $|\mathbf{w}_{1}|\leq|\mathbf{w}_{2}|$ . Now suppose the goal is to select one weight for removal while incurring less change on the output. The standard approach of magnitude pruning would always remove weight $\mathbf{w}_{1}$ , which may be a good strategy if input features $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ have similar magnitudes. However, as recently observed in LLMs (Dettmers et al., 2022), the two input features can differ significantly in scale. For instance, it is possible that $|\mathbf{x}_{1}|\gg|\mathbf{x}_{2}|$ , and as a result, $|\mathbf{w}_{1}\mathbf{x}_{1}|\gg|\mathbf{w}_{2}\mathbf{x}_{2}|$ . In this case, we should remove weight $\mathbf{w}_{2}$ instead, because this removal clearly exerts a smaller influence on the neuron output $\mathbf{y}$ than removing weight $\mathbf{w}_{1}$ .

This motivating example with the simplest linear layer hints at a major limitation of magnitude pruning: it does not take into account input activations, which could play an equally important role as weight magnitudes in determining the neuron output. For pruning LLMs, this is especially critical considering the emergent large magnitude features found within them. Thus, as the first part of our method, we propose a pruning metric designed explicitly for LLMs to handle such a limitation, while also maintaining the simplicity of magnitude pruning.

Pruning Metric. Consider a linear layer with weight $\mathbf{W}$ of shape $(C_{\mathtt{out}},C_{\mathtt{in}})$ . For language models, this linear layer takes in input activations $\mathbf{X}$ with a shape of $(N\times L,C_{\mathtt{in}})$ , where $N$ and $L$ are batch and sequence dimensions respectively. For each individual weight, we propose to evaluate its importance by the product of its magnitude and the corresponding input feature norm. Specifically, the score for the current weight $\mathbf{W}_{ij}$ is defined by:

This metric is interesting in several aspects. First, when the input channel of the considered weight has large magnitude features, the weight itself tends to be assigned a larger importance score even if it has a low magnitude. This tackles the problem we encounter in the motivating example. The effect can be seen in Figure 1, where weights corresponding to the large magnitude feature are more likely to be preserved with Wanda. Second, its computation is straightforward. Once we obtain the norm vector of input feature activations, the weight importance can be calculated using an element-wise dot product. Last, we find empirically that this metric is robust and can be easily estimated using a modest number of calibration samples, without access to the original training data.

Comparison Group. Generally, in a pruning method, each weight is first assigned an importance score, such as the pruning metric we discussed above. These weights are then grouped into comparison groups where weights within each group are compared against one another. Within each comparison group, weights with lower importance scores are pruned. Most previous pruning methods default to comparing weights locally within each layer or globally across the whole network.

While layer-wise and whole-network comparisons have been the popular options, we find that pruning LLMs could benefit from a more localized grouping. In our method, we compare and remove weights on a per-output basis (per row in Figure 1), where weight importance scores are compared locally within each output neuron. Specifically, for a weight $\mathbf{W}_{ij}$ that connects input $j$ to output $i$ inside the linear layer, we define the comparison group for this weight as all weights connecting to output $i$ :

Under this comparison group, for a pre-defined sparsity ratio $s\%$ , we eliminate $s\%$ of the weights connected to each output. This practice may seem counter-intuitive, since we are basically pruning under a stricter sparsity pattern. However, we find that it is consistently better than layer-wise pruning for LLMs. Notably, this holds true not only for our proposed pruning metric (Equation 1) but also the standard magnitude metric. This shows that maintaining a balanced pruning ratio across output features is important for pruning LLMs effectively.

To see if the superiority of pruning per output over per layer holds true in general, we conduct additional experiments on pruning image classifiers. However, we do not observe similar trend in image classification models, suggesting that our observations regarding pruning per output might be unique to LLMs. We hope this intriguing observation encourages practitioners to be more cautious in choosing the comparison group.

Procedure. Wanda can be implemented and integrated seamlessly within a single forward pass of the LLM model, where feature norm statistics $\|\mathbf{X}_{j}\|_{2}$ are estimated with a set of calibration data. We provide the PyTorch code of our approach in Algorithm 1. Given a pretrained LLM, we compute our pruning metric from the initial to the final layers of the network. After pruning a preceding layer, the subsequent layer receives updated input activations, based on which its pruning metrics will be computed. A recent method for pruning LLMs, SparseGPT (Frantar & Alistarh, 2023), requires a sophisticated weight update procedure in an iterative pruning process, while Wanda does not induce any additional weight update.

Structured N:M Sparsity. While Wanda so far has been developed for unstructured sparsity, it can be easily extended to structured N:M sparsity (Mishra et al., 2021). Structured N:M sparsity requires that at most N out of every M contiguous weights to be non-zero. It can leverage NVIDIA’s sparse tensor cores to accelerate matrix multiplication in practice. Wanda can be naturally extended to structured N:M sparsity, where we compare weights using the same metric among every M consecutive weights, for all weights connected to an output.

Remark. We discuss the connection between Wanda and a few existing works. SparseGPT formalizes the problem of pruning LLMs by solving a local layer-wise reconstruction problem, where their pruning metric and weight update procedure is inspired from Optimal Brain Surgeon (OBS) (Hassibi et al., 1993). The pruning metric in SparseGPT is:

Here $\mathbf{X}^{T}\mathbf{X}+\lambda\mathbf{I}$ in the denominator is the Hessian $\mathbf{H}$ for the layer-wise reconstruction problem and $\lambda$ is the Hessian dampening factor to avoid the collapse of inverse computation. With careful inspection, we observe that our metric in Equation 1 is similar to the above when $\lambda$ is 0 and only the diagonal elements of the Hessian matrix $\mathbf{X}^{T}\mathbf{X}+\lambda\mathbf{I}$ are retained. Starting from the pruning metric in Equation 3, we show the exact reduction steps and corresponding reduction conditions as follows:

In the 1980s, LeCun et al. (1989) have set up a pioneering framework for neural network pruning named Optimal Brain Damage (OBD). It uses second-order information without off-diagonal elements in Hessians for faster approximation. Later, Optimal Brain Surgeon (OBS) develops upon OBD partly by taking into account the off-diagonal elements. Wanda can be seen as a renaissance of OBD – it may be viewed as applying a process similar to OBD to each neuron, with local output reconstruction as the objective function, whereas the original OBD uses the global training objective. This is analogous to the relationship between SparseGPT and OBS.

A comparison of LLM pruning methods can be found in Table 1. Computing the pruning metric of Wanda has a reduced time complexity compared to SparseGPT, because it does not involve inverse computation. Overall, our method Wanda (Pruning by Weights and activations) has several attractive properties as an approach for pruning LLMs:

It maintains the simplicity of magnitude pruning in the pre-LLM era, requiring no gradient computation via back-propagation or any second-order Hessian inverses, but is also highly effective in discovering sparse networks in pretrained LLMs.

Wanda can be done with a single forward pass of the LLM. At each layer, the pruned weights can be decided in one shot without an iterative procedure. In practice, computing the pruning metric of Wanda can be 300 times faster in pruning LLMs compared with SparseGPT.

Unlike SparseGPT, our approach entails no weight update on pruned networks, suggesting that LLMs have effective sparse sub-networks that are exact, instead of them merely existing in the neighborhood of the original weights.

Experiments

Models and Evaluation. We evaluate Wanda on the two most widely adopted LLM model families: LLaMA 7B/13B/30B/65B (Touvron et al., 2023a) and LLaMA-2 7B/13B/70B (Touvron et al., 2023b) (LLaMA-2 34B is not released). Results for prior LLM families can be found in Appendix B. We measure the performance of pruned models on zero-shot tasks and language modeling. For zero-shot evaluation, we use seven tasks from EleutherAI LM Harness (Gao et al., 2021). Following previous works on LLM compression (Xiao et al., 2023; Frantar & Alistarh, 2023), we evaluate the perplexity on the held-out WikiText (Merity et al., 2016) validation set.

Baselines. We compare Wanda with two prior pruning approaches. Magnitude pruning (Han et al., 2015) is a simple and strong baseline in which weights are discarded based on their magnitudes. SparseGPT (Frantar & Alistarh, 2023) is a second-order pruning method for LLMs, based on solving a layer-wise reconstruction problem. In Appendix C, we compare with additional pruning methods.

Both Wanda and SparseGPT require calibration data to estimate input statistics (see Table 1). To control this variable factor, we use the exact same set of calibration data as SparseGPT, which consists of 128 sequences with context length size sampled from C4 training set (Raffel et al., 2020).

Sparsity. For all pruning methods, we focus on pruning the linear layers (skipping the first embedding layer and the final classification head), which account for around 99 $\%$ of the total LLM parameters. We impose a uniform sparsity for all linear layers. We evaluate three types of sparsity: unstructured sparsity, structured 4:8 and 2:4 sparsities. The magnitude pruning baseline is extended to structured N:M sparsity in a similar spirit to our method, as described in the previous section.

Comparison with Baselines. In Table 2, we show the mean zero-shot accuracies on 7 zero-shot tasks of pruned LLaMA and LLaMA-2 models. We refer the reader to Appendix D for task-wise performance. Across both unstructured and structured sparsities, Wanda outperforms the well-established magnitude pruning approach by a large margin, while also rivals with the previous best approach SparseGPT. Given that no fine-tuning takes place, there is a noticeable gap between sparse pruned LLMs and the original dense LLMs. However, as the model size increases, this accuracy gap diminishes. Remarkably, unstructured 50 $\%$ sparse LLaMA-65B and LLaMA-2-70B is able to match the zero-shot accuracies of their dense counterparts.

Large Sparse vs. Small Dense. It might be of interest to some readers on the comparison between large sparse LLMs and small dense LLMs with similar parameter counts. For zero-shot performance, we find the trend differs across the types of sparsity. For unstructured sparsity, large sparse LLMs are often better than small dense LLMs on zero-shot performance: unstructured 50 $\%$ sparse LLaMA-65B (66.67 $\%$ ) outperforms dense LLaMA-30B (65.38 $\%$ ); unstructured 50 $\%$ sparse LLaMA-2-13B (60.83 $\%$ ) outperforms dense LLaMA-7B (59.71 $\%$ ). Intriguingly, this gap is much larger for few-shot tasks (see Appendix D). For structured sparsity, the trend is reversed: without any fine-tuning, large sparse LLMs have worse zero-shot performance than small dense LLMs in general.

2 Language Modeling

In Table 3, we report the perplexity of pruned LLaMA and LLaMA-2 models. For robustness analysis under random sampling of the calibration data, see Appendix D.

Without any weight update, Wanda outperforms the established pruning approach of magnitude pruning by a large margin. For instance, for LLaMA-7B, Wanda is able to find sparse networks with a perplexity of 7.26, significantly better than the magnitude pruning baseline 17.29. This result suggests that exact and effective sparse sub-networks exist for LLMs. For unstructured 50% sparsity, Wanda performs on par with the prior best approach SparseGPT. We provide results for higher sparsity levels (60% and 80%) in Appendix D. The comparison between Wanda and SparseGPT is mixed for structured sparsity. On smaller models (e.g., 7B), SparseGPT outperforms Wanda on 2:4 sparsity. Wanda is more favorable for larger models, e.g., LLaMA-30B (2:4 and 4:8) and LLaMA-2-70B (2:4).

3 Speedup

Pruning Speed. The theoretical computational complexity of Wanda is lower than SparseGPT (Table 1). Here we compare their empirical pruning speed. Specifically, we measure the accumulated time for computing the pruning metric at each layer (excluding the forward pass process shared by both methods) on NVIDIA A6000 GPUs. Results are shown in Table 4. Wanda incurs negligible time overhead relative to SparseGPT. The fast speed of Wanda is particularly useful when pruning needs to be performed on a real-time basis, e.g., training sparse models from scratch (Evci et al., 2020) and finding the optimal sparsity (Jin et al., 2022).

Inference Speed. We evaluate the inference speedup for structured 2:4 sparsity on NVIDIA A6000 GPUs. Following the evaluation setup of Frantar & Alistarh (2023), we measure the latency of matrix multiplication in linear layers. We perform simulation analysis using the high-performance GEMM kernel in NVIDIA CUTLASS library. Results for LLaMA-65B (batch size of 1) can be found in Table 5. Structured 2:4 sparsity is able to bring notable inference speedup (around 1.6 $\times$ ) for linear layers in LLMs. For end to end latency, we observe a speedup of 1.24 $\times$ on LLaMA-7B (251ms as compared to 312ms). Last, we emphasize that the inference speedup is not unique to our pruning method but is delivered by the inherent power of sparsity for speeding up computation.

Analysis

We study several aspects of Wanda to better understand its effectiveness in pruning LLMs. We use the LLaMA-7B model and prune to unstructured 50 $\%$ sparsity, unless otherwise specified.

Fine-tuning. We study how fine-tuning could recover the performance drop of pruned LLMs, as observed in the previous section. We investigate two strategies for fine-tuning LLMs: LoRA (Hu et al., 2021) fine-tuning and full parameter dense fine-tuning. Fine-tuning is conducted on C4 training dataset and the objective is the pre-training auto-regressive loss. The pruned mask is kept fixed during fine-tuning. We fine-tune pruned LLaMA-7B with all three types of sparsities: unstructured 50 $\%$ , structured 4:8 and 2:4. Table 6 summarizes the results for mean zero-shot accuracies and perplexity after fine-tuning Wanda pruned LLaMA-7B models. See Appendix D for task-wise performance.

LoRA Fine-tuning. We enforce a limited computational budget (1 GPU and 12 hours). The low rank ( $r=8$ ) adapter is applied on the query and value projection matrices in attention layers. For LLaMA-7B, LoRA introduces only around 0.06% additional parameters, leaving the total sparsity level still around 50%. With LoRA fine-tuning, we are able to restore the performance of pruned LLMs by a non-trivial amount. One notable instance is that LoRA fine-tuning improves the zero-shot performance of structured 2:4 sparse LLaMA-7B from 48.53 $\%$ to 54.46 $\%$ , outperforming the original unstrucutred 50 $\%$ sparse LLaMA-7B (54.21 $\%$ ).

Full Parameter Fine-tuning. We conduct full parameter dense fine-tuning. We enforce a limited computational budget (4 GPU and 3 days). Compared to LoRA fine-tuning, full parameter dense fine-tuning is able to mitigate the gap between pruned LLMs and dense LLMs even further. For unstructured 50 $\%$ sparsity, full parameter fine-tuning could improve pruned LLaMA-7B from 54.21 $\%$ to 58.15 $\%$ in terms of zero-shot accuracy, close to that of dense LLaMA-7B (59.99 $\%$ ).

Pruning Configuration. Wanda differs from previous methods in both the pruning metric and the comparison group. We conduct ablation experiments to better understand their impact. The three pruning metrics can be found in Table 1. SparseGPT adopts a local comparison group inside a layer, where weights connected to 128 consecutive input channels form a group. Wanda groups weights connected with a single output channel. Therefore, we ablate two blocksize options (128 and 1) and the input/output choice. For simplicity, we use (input/output, blocksize) to denote each local comparison group, e.g., (input, 1). For this experiment, we do not perform the weight update procedure in SparseGPT to focus on the pruning configuration.

The results are shown in Table 7. We refer the reader to Appendix A for analysis on image classifiers and Appendix D for analysis on previous LLMs. The default pruning configuration of Wanda delivers the best pruned model (perplexity 7.26). Interestingly, for the magnitude metric, comparing weights of the same input neuron (input, 1) yields a perplexity of 8.86, significantly better than other grouping options. Three methods also produce equivalent pruning results as under this comparison group – the input is the same, thus weight ranking only depends on weight magnitude. This finding further highlights the importance of using a proper comparison group for pruning LLMs, even for the classical magnitude pruning approach.

Robustness to Calibration Samples. We vary the number of calibration samples by selecting different sample sizes ranging between 1 and 256. Results are summarized in Figure 2. We see a clear difference in trend as the size of calibration data changes, where Wanda is much more robust when there are few calibration samples. Notably, even with a single sample, pruned networks obtained by Wanda have a perplexity of 7.66. This may be because input norm statistics $\|\mathbf{X}_{j}\|$ could be much easier to estimate than the full inverse hessian $\mathbf{H}^{-1}$ of the local layer-wise reconstruction problem.

Weight Update. We characterize the conditions under which the weight update process in SparseGPT can improve the effectiveness of pruning LLMs. We experiment with two ways of applying weight update: sequential and iterative. A sequential update means that at each layer, the full pruned mask is first computed and weight update is performed on the remaining weights. An iterative update means that the pruning and weight update steps proceed iteratively within each layer. SparseGPT adopts an iterative update procedure every 128 input channels, as it was found to give more accurate results.

Effects of the weight update on magnitude pruning and Wanda are summarized in Table 8. We study these two pruning methods because they do not involve any weight update by default. An iterative update changes the comparison group for unstructured pruning, which we denote in the table as (input, 128). We make several interesting observations:

For all considered sparsities, weight update can improve magnitude pruning by a large margin.

For unstructured 50 $\%$ and 4:8 sparsities, weight update does not bring any improvement to Wanda.

For 2:4 sparsity, the improvement (from 11.53 to 10.89) is marginal. Note that the best 2:4 sparse model (10.89) we obtained here is better than that obtained by SparseGPT (11.00 in Table 3).

Last, we examine an extreme sparsity level (70 $\%$ ), where weight update can improve Wanda from 84.50 to 29.65. However, the best pruned model (29.65) lags far behind the dense LLaMA-7B (5.68).

Related Work

Network Pruning and Sparsity. Pruning is a popular technique for compressing neural networks through the elimination of weights, yielding sparse networks (LeCun et al., 1989; Hassibi et al., 1993). It can be broadly categorized into structured and unstructured approaches.

Structured pruning methods (Liu et al., 2017; Molchanov et al., 2019; Fan et al., 2020; Shen et al., 2022; Xia et al., 2022; Fang et al., 2023; Nova et al., 2023), sometimes referred to as activation pruning (Gale et al., 2019; Dhillon et al., 2018), remove entire structured components of a network, facilitating efficient GPU speedups. Some existing methods (Babaeizadeh et al., 2016; Dubey et al., 2018) have explored structured pruning based on activation statistics of neuron/filter output, e.g. percentage of zero activations (Hu et al., 2016) and activation mean (Molchanov et al., 2017). Recently, Ma et al. (2023) have studied structured pruning of LLMs. Bansal et al. (2023); Liu et al. (2023b) and Elena Voita (2023) have demonstrated the existence of prompt-dependent and task-specific sparsity in the structural components of LLMs, e.g., attention heads and MLP neurons.

Unstructured methods (Han et al., 2015; 2016; Paul et al., 2023; Hoang et al., 2023; Gadhikar et al., 2023; Liu et al., 2023a) like magnitude pruning operate at the individual weight level, maintaining performance even at higher sparsity levels. Existing pruning methods usually require either modifications to the training procedure (Sanh et al., 2020; Kusupati et al., 2020), retraining the pruned networks to regain accuracy (Liu et al., 2019; Zhou et al., 2023), or an even more computationally intensive iterative retraining process (Renda et al., 2020; Frankle et al., 2020). However, scaling these methods to LLMs with billions of parameters presents a challenge, as the required training process demands substantial computational resources (Hoffmann et al., 2022; Zhang et al., 2022).

Pruning with Limited Data. Most related to our approach is a recent line of work on pruning with limited data (Hubara et al., 2021; Frantar et al., 2022; Frantar & Alistarh, 2022; Kwon et al., 2022). Such methods require no modification to the original training procedure and also no retraining of the pruned networks on the full training dataset. The primary aim of these methods is to preserve performance during the pruning procedure, assuming access to a limited and small amount of data, also referred to as the calibration data. In order to mitigate the accuracy drop, a layer-wise reconstruction problem (Hubara et al., 2021) is solved to minimize the change of output evaluated on the calibration data. Existing solvers (Singh & Alistarh, 2020; Frantar et al., 2022) for the layer-wise reconstruction problem rely on heavy computation of second-order Hessian inverses, which do not scale to the large hidden state size of LLMs. SparseGPT (Frantar & Alistarh, 2023) develops an efficient weight update procedure for LLMs via synchronized second-order Hessian updates.

Emergent Properties of LLMs. Our work is also related to recent studies on the existence of large magnitude outlier features in large language models (Kovaleva et al., 2021; Bondarenko et al., 2021; Timkey & Schijndel, 2021; Luo et al., 2021; Puccetti et al., 2022; Wei et al., 2022b). Dettmers et al. (2022) demonstrate that when LLMs exceed a certain parameter scale (e.g., 6B), large magnitude features start to emerge and strongly affect all layers, which can be seen as an emergent property of LLMs (Dettmers et al., 2022; Wei et al., 2022a; Schaeffer et al., 2023). They also pinpoint these emerging features as the reason why existing quantization methods fail. This observation has spurred the development of various quantization schemes (Dettmers et al., 2022; Xiao et al., 2023; Lin et al., 2023; Dettmers et al., 2023; Behdin et al., 2023) tailored specifically for LLMs to handle outlier features. Our work extends this understanding, demonstrating that outlier features should also serve as pivotal indicators of which weights to prune in LLMs.

Conclusion

In this work, we propose a simple and effective method for pruning Large Language Models (LLMs). Inspired by the recent discovery of emergent large magnitude features in LLMs, our approach, termed Wanda (Pruning by Weights and activations), removes weights with the smallest magnitudes multiplied by the corresponding input activation norms, on a per-output basis. Without the need for any retraining or weight update procedures, Wanda is able to identify effective sparse networks within pretrained LLMs. We hope our work contributes to a better understanding of sparsity in LLMs. Last, considering the fast speed of pruning with Wanda, it would be interesting to investigate whether Wanda can be useful in the setting of sparse training (Evci et al., 2020; Peste et al., 2021; Kuznedelev et al., 2023; Benbaki et al., 2023; Frantar et al., 2023b), where pruning has to be conducted repeatedly and thus the pruning efficiency is critical.

Acknowledgments. We thank Yonghao Zhuang for valuable discussions. Mingjie Sun and Anna Bair were supported by funding from the Bosch Center for Artificial Intelligence.

References

Appendix A Image classifiers

We study how Wanda would perform against magnitude pruning on tasks where the latter has been widely used. We conduct a study on ImageNet-1K (Deng et al., 2009), a standard image classification task where magnitude pruning has been extensively studied (Gale et al., 2019; Blalock et al., 2020). We consider two modern vision architectures: ConvNeXt (Liu et al., 2022) and Vision Transformer (ViT) (Dosovitskiy et al., 2021). We choose these two architectures mainly for two reasons: first, as LLMs are based on Transformers, we would like to test if our observations on LLMs still hold on Transformers for other tasks; second, as we are evaluating on image classification, we are interested in examining how these pruning methods work on ConvNet models, with ConvNeXt being a representative architecture.

We use two ImageNet-1K pretrained models: ConvNeXt-B and DeiT-B, with a top-1 accuracy of 83.8 $\%$ and 81.8 $\%$ respectively. We prune the linear layers only (for ConvNeXt, this includes equivalent 1 $\times$ 1 convolution layers). For calibration data, we sample 4096 images from ImageNet training set. We observe that 4096 samples lead to a stable result for our pruning metric, beyond which we notice only a marginal effect. We report the accuracy of one-shot pruned models without any subsequent retraining.

We first study whether pruning per output is superior over pruning per layer for pruning image classifiers. In Figure 3, we show comparison results for both the magnitude metric and the pruning metric of Wanda. We can see that for both ConvNeXt-B and DeiT-B, layer-wise pruning is slightly better than pruning per output. We then compare the pruning metric of Wanda and the magnitude metric on layer-wise pruning. Results are shown in Figure 4. Our novel pruning metric leads to better results than magnitude pruning, especially at high sparsities (e.g., 70 $\%$ and 80 $\%$ ).

Appendix B Wanda on previous LLMs

In addition to LLaMA and LLaMA-2, we experiment with three previous LLM model families: namely OPT (Zhang et al., 2022), BLOOM (Scao et al., 2022) and Pythia (Biderman et al., 2023).

Comparison with Baselines. For OPT and Pythia, we experiment with varying sparsity levels (10 $\%$ to 50 $\%$ ). We conduct additional evaluation on OPT and BLOOM models with various sizes. Results are shown in Table 9, Table 10 and Table 11 respectively. Our observations are as follows:

Unlike LLaMA and LLaMA-2, the well-established magnitude pruning approach fails catastropically on OPT-13B and Pythia-12B, even for low sparsity levels (e.g., 20 $\%$ ). This result further highlights the limitations of magnitude pruning for LLMs, as discussed in Section 3.

Unlike magnitude pruning, Wanda successfully prunes these LLMs to much higher sparsities across various LLM model families, without any weight update on the kept weights. This result shows that LLMs have effective sub-networks that are exact. We hope this observation could contribute to a better understanding of sparsity in LLMs.

There are cases where Wanda slightly underperforms SparseGPT, especially for OPT models (see Table 10), suggesting that for OPT, there may be a tradeoff between pruning speed and pruning accuracy. However, the gap between SparseGPT and Wanda tends to get smaller as model sizes increase. This can be seen in Table 10 and Table 11.

At lower sparsities (e.g., 20 $\%$ ), Table 9 indicates that the computationally intensive weight update process may be unnecessary, as Wanda yields comparable or slightly superior results.

Comparison Group We test if our observation regarding pruning per output holds true for other LLM model families. We experiment on OPT (Zhang et al., 2022) and BLOOM (Scao et al., 2022). In Table 12 and Table 13, we provide results comparing pruning per layer and pruning per output for these two LLM model families. The pruning metric is fixed to be our proposed metric: $|\mathbf{W}_{ij}|\cdot\|\mathbf{X}_{j}\|$ . We can see that our findings regarding the comparison group are not limited to LLaMA. For OPT and BLOOM model families, pruning per output consistently outperforms pruning per layer.

Appendix C Additional Baselines

We compare with several prior activation pruning methods. These approaches remove entire neurons in the network based on certain statistics of the neuron output: mean and standard deviation (Molchanov et al., 2017), correlation (Babaeizadeh et al., 2016) and mean squared norm (Dubey et al., 2018). We show the results of pruning LLaMA-7B in Table 14. We compute these output statistics using the calibration set and remove neurons with smaller values. We observe that these activation pruning methods are unable to prune LLMs effectively.

We also compare with several prior methods on pruning BERT (Devlin et al., 2018). In Table 15, we provide a summary of existing pruning methods, mostly for pruning BERT. A key distinction of these methods and our work is that they interleave pruning heavily with the fine-tuning process. Another difference is that BERT pruning methods focus on performance on a downstream task, rather than preserving the general performance of pretrained language models.

We adopt these prior methods for pruning LLMs, where the goal is to preserve the language modeling ability. Thus we use the pre-training auto-regressive loss to compute their pruning metrics. We evaluate two settings: one-shot pruning and one-shot pruning followed by fine-tuning. For one-shot pruning, we use the pruning metrics listed in Table 15 to prune LLMs. We fine-tune the pruned LLMs within a limited computational budget, i.e., one day. Results are summarized in Table 16. We observe that these pruning methods are not effective when adapted for pruning LLMs.

Appendix D complementary Experimental Results

In this section, we supplement the main paper with additional experimental results. This includes robustness analysis under random seeds (Appendix D.1), evaluation at higher unstructured sparsity levels (Appendix D.2), few-shot results (Appendix D.3) and a detailed performance breakdown for zero-shot tasks (Appendix D.4 and Appendix D.5).

In this part, we perform a robustness analysis of our results in Section 4.2. The result in Table 3 is evaluated under a fixed calibration set. Since both SparseGPT and Wanda require calibration data to estimate input statistics, we sample different calibration sets under 5 random seeds and evaluate these two pruning methods. In Table 17, we report the perplexity (mean and standard deviation) of pruned LLaMA models under 5 random seeds. In many cases, the variance across random seeds is lower for Wanda, suggesting that Wanda is more stable with variations in the calibration sets.

D.2 Higher Sparsity

In Section 4, we have evaluated unstructured pruning with a sparsity level of 50 $\%$ . This is to follow the evaluation setup of Frantar & Alistarh (2023). In this part, we evaluate on higher sparsity levels, i.e., 60 $\%$ and 80 $\%$ . Results for these two sparsity levels are shown Table 18 and Table 19 respectively. At 60 $\%$ sparsity, Wanda remains competitive with SparseGPT. At 80 $\%$ sparsity, SparseGPT is able to outperform Wanda, but the performance drop compared to the dense counterpart is significant. The best 80 $\%$ sparse model (25.86) underperforms the smallest dense LLaMA-7B model (5.68) by a large gap. This suggests that at extreme sparsity levels, it may be better to use a small dense model trained to convergence instead.

D.3 Few-shot results on MMLU

Our experiments in Section 4.1 focus on zero-shot evaluation. However, LLMs are also known for their ability to learn in context. In this part, we conduct additional evaluation on few-shot tasks. Specifically, we choose the Massive Multitask Language Understanding benchmark (MMLU) (Hendrycks et al., 2021). In alignment with the evaluation methodology of Touvron et al. (2023a), we perform 5-shot evaluation. In Table 20, we report the mean accuracies for both dense LLMs and sparse LLMs with unstructured 50 $\%$ sparsity. In the few-shot setting, Wanda performs competitively with SparseGPT. Notably, large sparse LLMs surpass smaller dense counterparts, e.g., sparse LLaMA-13B/LLaMA-2-13B versus dense LLaMA-7B/LLaMA-2-7B. This trend can not be observed from the standard magnitude pruning approach.

D.4 Fine-tuning

In Table 6 of Section 5, we report the mean zero-shot accuracies after fine-tuning Wanda pruned LLaMA-7B models. In this part, we report the task-wise performance of these fine-tuned models. Results are summarized in Table 21. For per-task accuracies, most of the performance drop during pruning can be recovered through fine-tuning. Note that here we are performing limited fine-tuning with a computational budget (12 hours for LoRA fine-tuning and 3 days for full parameter fine-tuning). It remains to be seen if the gap between sparse pruned LLMs and the dense counterparts can be fully recovered given more computational budget.

D.5 Zero-Shot Tasks

For zero-shot results in Section 4.1, the 7 evaluated zero-shot tasks are: BoolQ (Clark et al., 2019), RTE (Wang et al., 2018), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2019), ARC Easy and Challenge (Clark et al., 2018), and OpenbookQA (Mihaylov et al., 2018). For reproducibility, we used commit df3da98 on the main branch. All tasks were evaluated on task version 0 except for BoolQ, where the evaluated version was 1. We show the task-wise performance in Table 22,23,24,25,26 and 27.