A Survey on Model Compression for Large Language Models

Xunyu Zhu, Jian Li, Yong Liu, Can Ma, Weiping Wang

Introduction

Large Language Models (LLMs) Zhao et al. (2023); Huang and Chang (2023); Chang et al. (2023) consistently exhibit remarkable performance across various tasks. Nevertheless, their exceptional capabilities come with significant challenges stemming from their extensive size and computational requirements. For instance, the GPT-175B model Brown et al. (2020), with an impressive 175 billion parameters, demands a minimum of 320GB (using multiples of 1024) of storage in half-precision (FP16) format. Furthermore, deploying this model for inference necessitates at least five A100 GPUs, each featuring 80GB of memory, to efficiently manage operations. To tackle these issues, a prevalent approach known as model compression Deng et al. (2020); He et al. (2018) offers a solution. Model compression involves transforming a large, resource-intensive model into a compact version suitable for storage on constrained mobile devices. Additionally, it can involve optimizing the model for faster execution with minimal latency or achieving a balance between these objectives.

Apart from their technical aspects, LLMs have triggered discussions on environmental and ethical matters. These models pose significant challenges for engineers and researchers in developing nations, where limited resources can impede access to essential hardware for model execution Lin et al. (2023). Additionally, the substantial energy consumption of LLMs contributes to carbon emissions, underscoring the significance of sustainable practices in AI research. A promising solution to these challenges lies in utilizing model compression techniques, which have showcased the potential to reduce emissions without substantially compromising performance Luccioni et al. (2022). By implementing model compression, we can tackle environmental concerns, enhance accessibility, and promote inclusivity in LLM deployment.

In our paper, our primary objective is to illuminate the recent strides made in the domain of model compression techniques tailored specifically for LLMs. Our work entails an exhaustive survey of methodologies, metrics, and benchmarks, which we meticulously organize into an innovative taxonomy. As illustrated in Figure 1, our proposed taxonomy provides a structured framework for understanding the landscape of Model Compression methods for LLMs. This exploration encompasses a thorough examination of well-established techniques, including but not limited to pruning, knowledge distillation, quantization, and low-rank factorization. Furthermore, our study sheds light on prevailing challenges and offers a glimpse into potential future research trajectories in this evolving field. We advocate for collaborative efforts within the community to pave the way for an ecologically conscious, all-encompassing, and sustainable future for LLMs. Notably, our work stands as the inaugural survey specifically addressing the realm of model compression for LLMs.

Methods

Pruning is a powerful technique to reduce the size or complexity of a model by removing unnecessary or redundant components LeCun et al. (1989); Han et al. (2015); Li et al. (2017). As we know, there are many redundant parameters that have little even no effects on the performance of the model, thus, the performance of the model will make the least drop after directly pruning these redundant parameters. At the same time, pruning can make the model storage-friendly Ardakani et al. (2019), memory-efficiency Han et al. (2015); Yang et al. (2017), computation-efficiency Li et al. (2017). Pruning can be divided into Unstructured Pruning Zhang et al. (2018); Gordon et al. (2020) and Structured Pruning Anwar et al. (2017); Fang et al. (2023). The main difference between structured pruning and unstructured pruning lies in the pruning targets and the resulting network structure. Structured pruning removes connections or hierarchical structures based on specific rules while preserving the overall network structure. On the other hand, unstructured pruning prunes individual parameters, resulting in an irregular sparse structure. Recent research efforts have been devoted to combining LLMs with pruning techniques, aiming to tackle the substantial size and computational costs associated with LLMs. In this section, we systematically categorize these works based on whether they employ structured or unstructured pruning strategies.

Unstructured pruning simplifies an LLM by removing specific parameters without considering its internal structure. This approach targets individual weights or neurons in the LLM, usually by applying a threshold to zero out parameters below it. However, this method disregards the overall LLM structure, resulting in an irregular sparse model composition. Such irregularity demands specialized compression techniques for efficient storage and computation of the pruned model. Unstructured pruning often involves substantial retraining of the LLM to regain accuracy, which is especially expensive for LLMs. An innovative approach in this domain is SparseGPT Frantar and Alistarh (2023). It introduces a one-shot pruning strategy that doesn’t require retraining. The method frames pruning as an extensive sparse regression problem and solves it using an approximate sparse regression solver. SparseGPT achieves significant unstructured sparsity, even up to 60% on the largest GPT models like OPT-175B and BLOOM-176B, with minimal increase in perplexity. Contrasting this, Syed et al. propose an iterative pruning technique that fine-tunes the model during pruning with minimal training steps. Another advancement is LoRAPrune Zhang et al. (2023a), which combines parameter-efficient tuning (PEFT) methods with pruning to enhance performance on downstream tasks. It introduces a unique parameter importance criterion using values and gradients from Low-Rank Adaption (LoRA) Hu et al. (2022). In response to the resource-intensive weight update process still required by SparseGPT, Wanda Sun et al. (2023) presents a new pruning metric. Wanda evaluates each weight based on the product of its magnitude and the norm of corresponding input activations, approximated using a small calibration dataset. This metric is employed for local comparisons within linear layer outputs, enabling the removal of lower-priority weights from LLMs.

1.2 Structured Pruning

Structured pruning simplifies an LLM by removing entire structural components, such as neurons, channels, or layers. This approach targets whole sets of weights at once, offering the advantage of reducing model complexity and memory usage while maintaining the overall LLM structure intact. To explore structured pruning methods’ application to and efficacy for LLMs, GUM Santacroce et al. (2023) makes a analysis of several structured pruning methods to decoder-only LLMs on NLG tasks, and discovers that established structured pruning methods do not take into account the distinctiveness of neurons, leaving behind excess redundancies. To solve the problem, GUM introduces a proof-of-concept method to maximize both sensitivity and uniqueness by pruning network components based on their global movement and local uniqueness scores. LLM-Pruner Ma et al. (2023) takes a versatile approach to compressing LLMs while safeguarding their multi-task solving and language generation capabilities. LLM-Pruner also tackles the challenges that arise from the substantial training data used for LLMs, which can lead to significant data transfers and post-training model sizes. To overcome these challenges, LLM-Pruner incorporates a dependency detection algorithm to pinpoint interdependent structures within the model. It also implements an efficient importance estimation method that considers both first-order information and an approximated Hessian information. This strategy aids in selecting optimal groups for pruning, thereby improving the compression process.

2 Knowledge Distillation

Knowledge Distillation (KD) Hinton et al. (2015); Kim and Rush (2016); Tung and Mori (2019) is a valuable machine learning technique aimed at improving model performance and generalization. It achieves this by transferring knowledge from a complex model, referred to as the teacher model, to a simpler counterpart known as the student model. The core idea behind KD involves transforming the comprehensive knowledge of the teacher model into a more streamlined and effective representation. In this section, we offer an overview of distillation methods that employ LLMs as teachers. We categorize these methods into two distinct groups: Black-box KD, in which only the teacher’s predictions are accessible, and White-box KD, where the teacher’s parameters is available for utilization. For a visual representation, Figure 2 provides a brief classification of knowledge distillation for LLMs.

In White-box KD, not only are the teacher LLM’s predictions accessible, but access to and utilization of the teacher LLM’s parameters are also allowed. This method enables the student LM to gain a deeper understanding of the teacher LLM’s internal structure and knowledge representations, often resulting in higher-level performance improvements. White-box KD is typically used to assist smaller student LMs in learning and replicating the knowledge and capabilities of larger, more powerful teacher LLMs Gou et al. (2021); Park et al. (2019); Zhao et al. (2022); Liu et al. (2021a). An illustrative example is MINILLM Gu et al. (2023), which delves into distillation from white-box generative LLMs. It observes a challenge with minimizing forward Kullback-Leibler divergence (KLD) - this can lead to overly high probabilities in unlikely areas of the teacher’s distribution, causing improbable samples during free-run generation. To address this, MINILLM opts for minimizing reverse KLD. This approach prevents the student from overestimating low-probability regions within the teacher’s distribution, thereby refining the quality of generated samples. In contrast, GKD Agarwal et al. (2023) explores distillation from auto-regressive models, where white-box generative LLMs are a subset. This method identifies two key issues: a distribution mismatch between output sequences during training and those generated by the student during deployment, and model under-specification, where the student model might lack the expressive power to match the teacher’s distribution. GKD handles the distribution mismatch by sampling output sequences from the student during training. It also tackles model under-specification by optimizing alternative divergences like reverse KL. To realize task-agnostic zero-shot evaluated distillation for LLMs without access to end-task finetuning data, TF-LLMD Jha et al. (2023) uses a truncated model with a subset of layers from the larger model for initialization, and train the model on pretraining data using a language modeling objective.

2.2 Black-box KD

In Black-box KD, only the predictions made by the teacher LLM are accessible. Recently, black-box KD has shown promising results in fine-tuning small models on the prompt-response pairs generated by LLM APIs Li et al. (2022); Ho et al. (2023); Hsieh et al. (2023). At the same time, Recent research Wei et al. (2022a); Schaeffer et al. (2023); Zhao et al. (2023) underscores that with the emphasis on augmenting model size, LLMs like GPT-3 (175B parameters) and PaLM (540B parameters) showcase unique behaviors when compared to smaller models like BERT (330M parameters) and GPT-2 (1.5B parameters). These LLMs exhibit surprising capabilities, referred to as Emergent Abilities, when tackling intricate tasks. Emergent Abilities encompass several intriguing facets, including In-Context Learning (ICL) Dong et al. (2023); Wang et al. (2023b), Chain-of-Thought (CoT) Wei et al. (2022b); Wang et al. (2023c); Shi et al. (2023), and Instruction Following (IF) Ouyang et al. (2022); Brooks et al. (2023). In our paper, we further categorize black-box KD methods according to which emergent Abilities is utilized. Thus, we also refer to Black-box KD as EA-based KD. For a visual overview, refer to Figure 3, which provides a concise representation of the EA-based Knowledge Distillation concept.

ICL employs a structured natural language prompt that contains task descriptions and possibly a few task examples as demonstrations. Through these task examples, LLMs can grasp and perform new tasks without necessitating explicit gradient updates. The work by Huang et al. introduces ICL distillation, which transfers in-context few-shot learning and language modeling capabilities from LLMs to SLMs. This is accomplished by combining in-context learning objectives with traditional language modeling objectives. To achieve this, they explore ICL distillation under two few-shot learning paradigms: Meta In-context Tuning (Meta-ICT) and Multitask In-context Tuning (Multitask-ICT). In Meta-ICT, the language model undergoes meta-training across diverse tasks using in-context learning objectives. This equips it to adapt to unseen tasks through in-context learning, thereby extending its problem-solving capabilities. On the other hand, Multitask-ICT fine-tunes the model using ICL objectives and a handful of examples from target tasks. Subsequently, it employs in-context learning for making predictions on these tasks. Comparing the two paradigms, Multitask-ICT exhibits superior performance over Meta-ICT. However, it does demand greater computational resources during task adaptations, making it computationally more intensive.

CoT takes a different approach compared to ICL by incorporating intermediate reasoning steps, which can lead to the final output, into the prompts instead of using simple input-output pairs. MT-COT Li et al. (2022) aims to leverage the explanations produced by LLMs to enhance the training of smaller reasoners. It utilizes a multi-task learning framework to empower smaller models with strong reasoning capabilities alongside the ability to generate explanations. CoT Prompting Magister et al. (2023) explores the transferability of such reasoning capabilities to smaller models via knowledge distillation, and find that these is a trade-off between model and dataset size on reasoning capabilities. Fine-tune-CoT Ho et al. (2023) takes a step further by generating multiple reasoning solutions from LLMs through stochastic sampling. This augmentation of training data aids student models in their learning process. SSLM Fu et al. (2023a) identifies a trade-off between the multi-dimensional capabilities of language models and propose fine-tuning an instruction-tuned model. They distill CoT reasoning paths from a large teacher model to improve out-of-distribution generalization. Distilling Step-by-Step Hsieh et al. (2023) employ LLM rationales as additional guidance for training smaller models within a multi-task framework. SOCRATIC CoT Shridhar et al. (2023) trains two distilled models: a problem decomposer and a subproblem solver. The decomposer breaks down an original problem into a sequence of subproblems, while the subproblem solver handles solving these subproblems. For rationale faithfulness, SCOTT Wang et al. (2023a) employs contrastive decoding, which links each rationale to the answer. It encourages relevant rationales from the teacher. Additionally, the student is guided to engage in counterfactual reasoning and predict based on rationales that lead to different answers. In PaD Zhu et al. (2023a), student models are reinforced with program-aided reasoning and are aided in overcoming faulty reasoning steps through automated error checking. LMTWA Saha et al. (2023) explores the use of LLMs as teachers to enhance the performance of weaker agents through natural language explanations. Specifically, it introduces a student-teacher framework, investigating when and how the teacher should intervene with explanations. The study proposes a Theory of Mind approach for personalized and budget-conscious teaching, demonstrates the long-term impact of teacher explanations, and cautions about the potential negative effects of misaligned teachers intentionally misleading students.

IF endeavors to enhance the competence of language models in executing new tasks solely based on reading task descriptions, without relying on few-shot examples. By undergoing fine-tuning using an array of tasks expressed as instructions, language models showcase the capacity to accurately execute tasks described in previously unseen instructions. For instance, Lion Jiang et al. (2023) harnesses the adaptable nature of LLMs to improve student model performance. It prompts the LLM to identify and generate the “hard” instructions, which are then utilized to enhance the student model’s capabilities. This approach taps into the versatility of LLMs to guide the learning of student models in addressing complex instructions and tasks. LaMini-LM Wu et al. (2023a) addresses the challenge of resource-intensive language models, which demand substantial computational power and memory, rendering them inaccessible to many researchers and developers. To tackle this issue, LaMini-LM has developed an extensive collection of 2.58 million instructions, comprising both existing and newly generated instructions. These instructions are utilized in fine-tuning a diverse array of models, providing an effective solution to this problem.

3 Quantization

In the domain of model compression, quantization has emerged as a widely embraced technique to alleviate the storage and computational overhead of deep learning models Liu et al. (2021b); Gholami et al. (2022); Guo et al. (2020). While traditional representation employs floating-point numbers, quantization converts them to integers or other discrete forms. This transformation significantly reduces storage requirements and computational complexity. Although some precision loss is inherent, careful quantization techniques can achieve substantial model compression with only minimal accuracy degradation. Quantization can be categorized into three main approaches: quantization-aware training (QAT) Tailor et al. (2021); Kim et al. (2022); Ding et al. (2022), and post-training quantization (PTQ) Liu et al. (2021b); Nagel et al. (2020); Fang et al. (2020). The primary distinction among these approaches lies in when quantization is applied to compress the model. QAT employs quantization during the model’s training / fine-tuning process, and PTQ quantizes a model after it has completed its training. Recent research endeavors have harnessed quantization to compress LLMs, yielding impressive outcomes. These efforts are classified into the two mentioned approaches: Quantization-Aware Training, and Post-Training Quantization. Furthermore, Table 1 serves as a summarized reference for quantization methods applied to LLMs. The table classifies these works into 8-bit quantization and lower-bit quantization, based on the number of bits (precision) in the weights of the LLM.

In QAT, the quantization objective is seamlessly integrated into the model’s training process. This approach enables the LLM to adapt to low-precision representations during training, enhancing its capacity to handle precision loss caused by quantization. This adaptation aims to preserve higher performance even after the quantization process. For instance, LLM-QAT Liu et al. (2023) delves into the challenges of acquiring training data for LLMs. Given that gathering training data for LLMs can be demanding, LLM-QAT proposes an innovative solution. It leverages generations produced by a pretrained model to achieve data-free distillation. This approach significantly aids in circumventing the data collection challenge. Additionally, LLM-QAT goes a step further by quantizing not only weights and activations but also key-value (KV) caches. This strategy aims to enhance throughput and support longer sequence dependencies. A noteworthy achievement of LLM-QAT is its ability to distill large LLaMA models with quantized weights and KV caches down to just 4 bits. This groundbreaking result demonstrates the feasibility of producing accurate 4-bit quantized LLMs. On the other hands, PEQA Kim et al. (2023a) and QLORA Dettmers et al. (2023a) both fall under the category of quantization-aware Parameter-Efficient Fine-Tuning (PEFT) techniques Liu et al. (2022); Ding et al. (2023); Fu et al. (2023b). These techniques focus on facilitating model compression and accelerating inference. PEQA employs a dual-stage process. In the first stage, each fully-connected layer’s parameter matrix is quantized into a matrix of low-bit integers and a scalar vector. In the second stage, fine-tuning occurs on the scalar vector for each specific downstream task. QLORA introduces innovative concepts like a new data type, double quantization, and paged optimizers. These ideas are aimed at conserving memory without compromising performance. QLORA enables large models to undergo fine-tuning on a single GPU while achieving state-of-the-art results on the Vicuna benchmark Chiang et al. (2023).

3.2 Post-Training Quantization

PTQ involves quantizing the parameters of a LLM after the completion of the LLM’s training phase. The primary objective of PTQ is to diminish the storage and computational complexity of the LLM, all without necessitating modifications to the LLM architecture or requiring a retraining process. PTQ’s key advantage is its simplicity and efficiency in achieving model compression. However, it’s important to note that PTQ can introduce a certain degree of precision loss due to the quantization procedure. This method serves as a straightforward way to enhance the efficiency of an LLM without significant alterations or extensive training efforts.

In PTQ, certain approaches focus on quantizing only the weights of LLMs to enhance efficiency and reduce computational demands. Specifically, LUT-GEMM Park et al. (2022) optimizes matrix multiplications within LLMs using weight-only quantization and the BCQ format Rastegari et al. (2016), enhancing latency reduction and performance by improving computational efficiency. LLM.int8() Dettmers et al. (2022) employs 8-bit quantization for matrix multiplication in LLM transformers, effectively halving GPU memory usage during inference while maintaining performance precision. This method employs vector-wise quantization and mixed-precision decomposition to handle outliers for efficient inference. Remarkably, LLM.int8() enables inference in models with up to 175 billion parameters without performance compromise. GPTQ Frantar et al. (2022) acknowledges that the methods mentioned above work well for low compression targets like 8-bit weights, but face challenges in maintaining accuracy at higher rates. To tackle the challenges, GPTQ proposes a novel layer-wise quantization technique based on approximate second-order information. The result is a bitwidth reduction to 3 or 4 bits per weight, with minimal accuracy loss compared to the uncompressed version. Dettmers and Zettlemoyer delve into the trade-off between model size and bit precision in LLMs concerning zero-shot performance by analyzing inference scaling laws. Their extensive experimentation across various LLM families reveals that 4-bit precision is nearly universally optimal for achieving the right balance between total model bits and zero-shot accuracy. AWQ Lin et al. (2023) finds that weights are not equally important for LLMs’ performance, and protecting only 1% of salient weights can greatly reduce quantization error. Building on this insight, AWQ employs an activation-aware approach by considering the significance of weight channels corresponding to larger activation magnitudes, which play a pivotal role in processing vital features. The approach incorporates a per-channel scaling technique to identify optimal scaling factors that minimize quantization errors while quantizing all weights. OWQ Lee et al. (2023) makes a theoretical analysis about how activation outliers can amplify the error in weight quantization. Drawing insights from this analysis, OWQ introduces a mixed-precision quantization scheme, which applies higher precision to the weights susceptible to quantization caused by activation outliers. To further compress accurate LLMs to 3-4 bits per parameter while staying near-lossless, SpQR Dettmers et al. (2023b) identifies and isolates outlier weights, storing them in higher precision, and compressing all other weights to 3-4 bits. SqueezeLLM Kim et al. (2023b) incorporates sensitivity-based non-uniform quantization and Dense-and-Sparse decomposition to enables lossless compression to ultra-low precisions of up to 3-bit. Specifically, sensitivity-based non-uniform quantization searches for the optimal bit precision assignment based on second-order information, and Dense-and-Sparse decomposition stores outliers and sensitive weight values in an efficient sparse format. Motivated by the insight that quantization benefits from incoherent weight and Hessian matrices, QuIP Chee et al. (2023) utilizes an adaptive rounding procedure that minimizes a quadratic proxy objective and efficient pre- and post-processing that ensures weight and Hessian incoherence via multiplication by random orthogonal matrices to realize 2-bit quantization of LLMs. To achieve high precision at a lower bit while remaining cost-efficient, Norm Tweaking Li et al. (2023a) involves rectifying the quantized activation distribution to match its float counterpart, which helps restore accuracy for LLMs. It includes calibration data generation and channel-wise distance constraints to update the weights of normalization layers for better generalization. For the purpose of enhancing the accuracy of weight-only quantization, SignRound Cheng et al. (2023) introduces a lightweight block-wise tuning approach using signed gradient descent.

Except the above works that quantize only the weights of LLMs, lots of works in PTQ try to quantize both weights and activations of LLMs. ZeroQuant Yao et al. (2022) integrates a hardware-friendly quantization scheme, layer-by-layer knowledge distillation, and optimized quantization support to reduce weight and activation precision in Transformer-based models to INT8 with minimal accuracy impact. SmoothQuant Xiao et al. (2022) addresses the challenge of quantizing activations, which is often more complex due to the presence of outliers. Observing that different tokens exhibit similar variations across their channels, SmoothQuant introduces a per-channel scaling transformation that effectively smooths the magnitudes, rendering the model more amenable to quantization. By conducting a systematic examination of various quantization schemes, model families, and quantization bit precision, ZeroQuant-V2 Yao et al. (2023) finds that 1) activation quantization is generally more susceptible to weight quantization, with smaller models often outperforming larger models in terms of activation quantization, 2) none of the current methods can achieve the original model quality for quantization with either INT4-weight or INT4-weight-and-INT8-activation. To solve these problems, ZeroQuant-V2 introduces a technique called Low Rank Compensation (LoRC), which employs low-rank matrix factorization on the quantization error matrix to enhance model quality recovery with a minimal increase in model size. Recognizing the complexity of quantizing activations in LLMs, RPTQ Yuan et al. (2023) sheds light on the challenge stemming from the uneven ranges across different channels, in addition to the presence of outliers. To address this, RPTQ strategically arranges channels into clusters for quantization, effectively mitigating the discrepancies in channel ranges. Moreover, it integrates the channel reordering into the layer norm operation and linear layer weights to minimize associated overhead. OliVe Guo et al. (2023) further adopts an outlier-victim pair (OVP) quantization and handles outlier values locally with low hardware overheads and high performance gains, because it finds that outliers are important while the normal values next to them are not. Outlier Suppression+ Wei et al. (2023) extends this understanding by confirming that harmful outliers within activations exhibit an asymmetric distribution, predominantly concentrating in specific channels, and introduces a novel strategy involving channel-wise shifting and scaling operations to rectify the asymmetric presentation of outliers and mitigate the impact of problematic channels, and quantitatively analyzes the optimal values for shifting and scaling, taking into account both the asymmetric nature of the outliers and the quantization errors stemming from weights in the next layers. ZeroQuant-FP Wu et al. (2023b) explores the applicability of floating-point (FP) quantization, specifically focusing on FP8 and FP4 formats. The study reveals that for LLMs, FP8 activation consistently outperforms its integer counterpart (INT8), while in terms of weight quantization, FP4 demonstrates comparable, if not superior, performance compared to INT4. To address the challenges arising from the divergence between weights and activations, ZeroQuant-FP mandates that all scaling factors be powers of 2 and confines the scaling factors within a single compute group. Notably, ZeroQuant-FP also integrates the Low Rank Compensation (LoRC) strategy to further enhance the effectiveness of its quantization approach. FPTQ Li et al. (2023b) combines the advantages of both n two recipes W8A8 and W4A16 to design a novel W4A8 post-training quantization method for the available open-sourced LLMs, and combines fine-grained weight quantization with layerwise activation quantization strategies which feature a novel logarithmic equalization for most intractable layers to eliminate the necessity for further fine-tuning. QuantEase Behdin et al. (2023), is an innovative layer-wise quantization framework that involves distinct quantization processes for individual layers. The core challenge is framed as a discrete-structured non-convex optimization problem, for which effective solutions are derived through the application of Coordinate Descent (CD) techniques. OmniQuant Shao et al. (2023) consists of two groundbreaking components: Learnable Weight Clipping (LWC) and Learnable Equivalent Transformation (LET). These components are designed to fine-tune a range of quantization parameters effectively. OmniQuant operates within a differentiable framework, employing block-wise error minimization, and excels in achieving impressive performance across a variety of quantization configurations.

4 Low-Rank Factorization

Low-Rank Factorization Cheng et al. (2017); Povey et al. (2018); Idelbayev and Carreira-Perpiñán (2020) is a model compression technique that aims to approximate a given weight matrix by decomposing it into two or more smaller matrices with significantly lower dimensions. The core idea behind low-rank factorization involves finding a factorization of a large weight matrix $W$ into two matrices $U$ and $V$ such that $W\approx UV$ , where $U$ is an $m\times k$ matrix, and $V$ is a $k\times n$ matrix, with $k$ being much smaller than $m$ and $n$ . The product of $U$ and $V$ approximates the original weight matrix, leading to a substantial reduction in the number of parameters and computational overhead. In the field of LLM research, low-rank factorization has been widely adopted to fine-tune LLMs efficiently, e.g., LORA Hu et al. (2022) and its variants Valipour et al. (2023); Zhang et al. (2023b); Chavan et al. (2023). Different from those above works, we focus on these works that use low-rank factorization to compress LLMs. TensorGPT Xu et al. (2023) stores large embeddings in a low-rank tensor format, reducing the space complexity of LLMs and making them available on edge devices. Specifically, TensorGPT efficiently compresseses the embedding layer in LLMs using the Tensor-Train Decomposition (TTD). By treating each token embedding as a Matrix Product State (MPS), the embedding layer can be compressed by a factor of up to 38.40 times, while still maintaining or even improving the model’s performance compared to the original LLM.

Metrics and Benchmarks

Inference efficiency of LLMs can be measured using various metrics, which capture different aspects of performance. These metrics are commonly presented alongside accuracy and zero-shot ability to comprehensively evaluate the LLM.

Number of Parameters Ma et al. (2023); Dasgupta et al. (2023) in a LLM refers to the total count of learnable weights or variables that the LLM needs to optimize during training. In LLMs, parameters represent the weights in the connections between neurons or attention layers. In general, the more parameters a LLM has, the more expressive it can be, but it also requires more computational resources and memory for both training and inference.

1.2 Model Size

Model Size Shridhar et al. (2023); Li et al. (2022); Magister et al. (2023) typically refers to the disk space or memory footprint required to store the entire LLM, including weights, biases, and other necessary components. The model size is closely related to the number of parameters, as more parameters usually lead to a larger model size. However, other factors, like the data type used to represent the parameters and model architecture, can also influence the overall size.

1.3 Compression Ratio

Compression Ratio Frantar and Alistarh (2023); Tao et al. (2023) represents the ratio between the original size of the uncompressed LLM and the size of the compressed LLM. A higher compression ratio indicates a more efficient compression, as the LLM has been significantly reduced in size while preserving its functionality and performance.

1.4 Inference time

Inference time (i.e., latency) Kurtic et al. (2023); Frantar et al. (2022) measures the time taken by the LLM to process and generate responses for input data during inference or prediction. Inference time is particularly crucial for real-world applications where the LLM needs to respond to user queries or process large amounts of data in real-time.

1.5 Floating point operations (FLOPs)

FLOPs Dettmers and Zettlemoyer (2022); Yuan et al. (2023); Wei et al. (2023) measures the number of arithmetic operations involving floating-point numbers (typically 32-bit or 16-bit) that the LLM performs when processing input data. FLOPs provide a useful way to estimate the computational requirements of a LLM and compare the efficiency of different LLMs or compression techniques.

2 Benchmarks and Datasets

The main goal of these benchmarks and Datasets is to measure the effectiveness, efficiency, and accuracy of compressed LLMs in comparison to their uncompressed counterparts. These benchmarks and datasets typically consist of diverse tasks and datasets that cover a range of natural language processing challenges.

The majority of research evaluates compressed LLMs on well-established NLP benchmarks and Datasets. For instance, GLUE Wang et al. (2019b) and SuperGLUE Wang et al. (2019a) is designed for evaluating the performance of language models on a wide range of natural language understanding (NLU) tasks. LAMBADA Paperno et al. (2016) is designed to evaluate the context-dependent understanding of language models. LAMA Petroni et al. (2019) and StrategyQA Geva et al. (2021) are both designed to evaluate the reasoning ability of language models. SQuAD Rajpurkar et al. (2016) is designed for machine reading comprehension (MRC) tasks.

2.2 BIG-Bench

BIG-Bench (BBH) Srivastava et al. (2022) is a benchmark suite designed for LMs, covering over 200 NLP tasks, e.g., Text Comprehension Tasks, Inference Tasks, Mathematical Reasoning Tasks. The aim of BBH is to evaluate the performance of LMs across these various complex tasks. The compressed LLMs use BBH to measuring the general capability on real-world tasks. This approach provides a multi-dimensional perspective on model performance and efficiency. BBH facilitates insightful evaluation and method assessment.

2.3 Unseen Instructions Datasets

The aim of unseen instructions datasets is used to evaluate the performance of LLMs when facing arbitrary tasks. There are two prominent datasets, i.e., Vicuna-Instructions Chiang et al. (2023) and User-Oriented-Instructions Wang et al. (2023d). The Vicuna-Instructions dataset, generated by GPT-4, comprises 80 intricate questions designed to challenge baseline models. It spans a diverse array of nine distinct categories, encompassing generic, knowledge-based, roleplay, commonsense, fermi, counterfactual, coding, mathematical, and writing tasks. The User-Oriented-Instructions dataset is a meticulously curated collection containing 252 instructions. This dataset takes inspiration from 71 user-oriented applications, including Grammarly, StackOverflow, Overleaf, rather than being centered around extensively studied NLP tasks. These datasets aim to gauge the performance of the compressed LLMs when faced with unseen instructions to scrutinize their aptitude in handling and executing arbitrary tasks.

Challenges and Future Directions

The research on model compression techniques for LLMs is still in its early stages. These compressed LLMs, as demonstrated in prior studies Frantar and Alistarh (2023); Liu et al. (2023); Ho et al. (2023), continue to exhibit a significant performance gap when compared to their uncompressed counterparts. By delving into more advanced model compression methods tailored for LLMs, we have the potential to enhance the performance of these uncompressed LLMs.

0.2 Performance-Size Trade-offs

Prior research Magister et al. (2023); Dettmers and Zettlemoyer (2022) highlights the delicate balance between Large Language Model (LLM) performance and model size. Analyzing this trade-off allows for optimal performance within hardware constraints. However, current work lacks theoretical and empirical insights into this trade-off. Future LLM compression research should conduct comprehensive analyses to guide advanced techniques. Understanding the relationship between performance and size empowers researchers to develop tailored compression methods, navigating the design space effectively for efficient solutions.

0.3 Dynamic LLM Compression

Despite the advancements in current compression methods, they still rely on manual design to determine the compressed size and structure of LLMs. This often involves a trial-and-error approach based on input data or task requirements. This process becomes particularly challenging in scenarios like knowledge distillation, where several trials are necessary to find suitable student models within computational constraints. This manual effort presents a practical hindrance. A promising solution emerges in the integration of Neural Architecture Search (NAS) techniques Elsken et al. (2019); Zoph and Le (2016); Zhu et al. (2021, 2023b) into the realm of compressing LLMs. NAS holds the potential to reduce the dependence on human-designed architectures, potentially revolutionizing LLM compression for improved efficiency and effectiveness.

0.4 Explainability

Earlier research Stanton et al. (2021); Xu et al. (2021) has raised significant concerns regarding the explainability of compression techniques applied to Pre-trained Language Models (PLMs). Notably, these same challenges extend to LLM compression methods as well. For example, there are no explanation about why CoT-distillation can make the SLMs own CoT ability and achieve good performance in reasoning tasks. Consequently, the integration of explainable compression approaches emerges as a crucial necessity for the progression of LLM compression applications. Moreover, the adoption of explainable compression not only addresses the issue of interpretability but also simplifies the evaluation procedure for compressed models. This, in turn, enhances the reliability and predictability of the models throughout the production phase.

Conclusion

In this thorough survey, we’ve explored model compression techniques for large language models (LLMs). Our coverage spanned compression methods, evaluation metrics, and benchmark datasets. By diving into LLM compression, we’ve highlighted its challenges and opportunities. As LLM compression advances, there’s a clear call for research into advanced methodologies specifically for LLMs, unlocking their potential across applications. This survey aims to be a valuable reference, providing insights into the current landscape and promoting ongoing exploration of this pivotal topic.

Acknowledgments

The work of Jian Li is supported partially by Natural Science Foundation of China (No. 62106257), China Postdoctoral Science Foundation (No. 2023T160680), and Excellent Talents Program of Institute of Information Engineering, CAS. The work of Yong Liu is supported partially by Natural Science Foundation of China (No. 62076234), Beijing Outstanding Young Scientist Program (No. BJJWZYJH012019100020098), the Unicom Innovation Ecological Cooperation Plan, and the CCF-Huawei Populus Grove Fund.